On Measures of Entropy and Information

Tech. Note 009 v0.7 http://threeplusone.com/info

Gavin E. Crooks 2018-09-22

Contents 5 Csiszar´ f-divergences 12 Csiszar´ f- ...... 12 0 Notes on notation and nomenclature 2 Dual f-divergence ...... 12 Symmetric f-divergences ...... 12 1 Entropy 3 K-divergence ...... 12 Entropy ...... 3 Fidelity ...... 12 ...... 3 Marginal entropy ...... 3 Hellinger discrimination ...... 12 ...... 3 Pearson divergence ...... 14 Neyman divergence ...... 14 2 3 LeCam discrimination ...... 14 Mutual information ...... 3 Skewed K-divergence ...... 14 Multivariate mutual information ...... 4 Alpha-Jensen-Shannon-entropy ...... 14 Interaction information ...... 5 Conditional mutual information ...... 5 6 Chernoff divergence 14 Binding information ...... 6 Chernoff divergence ...... 14 Residual entropy ...... 6 Chernoff coefficient ...... 14 ...... 6 Renyi´ divergence ...... 15 Lautum information ...... 6 Alpha-divergence ...... 15 Uncertainty coefficient ...... 7 Cressie-Read divergence ...... 15 Tsallis divergence ...... 15 3 Relative entropy 7 Sharma-Mittal divergence ...... 15 Relative entropy ...... 7 ...... 7 7 Distances 16 Burg entropy ...... 7 Variational distance ...... 16 Relative joint entropy ...... 7 Total variational distance ...... 16 Relative conditional entropy ...... 7 Euclidian distance ...... 16 Relative mutual information ...... 10 Minkowski distance ...... 16 Relative conditional mutual information ..... 10 Chebyshev distance ...... 16 Relative relative entropy ...... 10 Jeffreys entropy ...... 10 LeCam distance ...... 16 Jensen-Shannon divergence ...... 10 Hellinger distance ...... 16 General Jensen-Shannon divergence ...... 10 Jensen-Shannon distance ...... 16 Jensen-Shannon entropy ...... 11 Bhattacharyya distance ...... 16 Resistor-average entropy ...... 11 8 Specific information 16 4 Renyi´ information 11 Specific entropy ...... 16 Renyi´ information ...... 11 Specific mutual information ...... 16 Collision entropy ...... 11 Min-entropy ...... 11 Back Matter 16 entropy ...... 11 Acknowledgments ...... 16 Tsallis information ...... 11 Version history ...... 17 Sharma-Mittal information ...... 11 Bibliography ...... 17 On Measures of Entropy and Information, G. E. Crooks (2018)

0 Notes on notation and nomencla- Table 1: Information symbology ture

Information measures An information measure is, symbol usage commutative precedence broadly speaking, any function of one or more proba- bility distributions. We’ll specifically reserve the term , conjugation yes high entropy for measures that have units of entropy — nega- : mutual yes : tive of , or a linear combination | conditional no : of the same. The units of entropy depend on the base || relative entropy no : of the : for instance (”binary digits”), ; divergence no low nats (”natural units”), or bans, for bases 2, e, or 10 respectively. Herein we’ll use natural logarithms. Note that, by convention, 0 ln 0 = 0. In many cases, tradition ‘;’ for any other comparison of ensembles; and a bar ‘|’ doth dictate that an entropic measure be referred to as denotes conditioning (“given”). We’ll use the operator an information, e.g. “mutual information” rather than precedence (high to low) ‘,’, ‘:’, ‘|’, ‘∥’, ‘;’ to obviate exces- “mutual entropy”. But we’ll eschew the not uncommon sive bracketing [1]. Samples spaces are the same on ei- practice of using the term “entropy” to refer to generic ther side of double bars ‘∥’ or semicolons ‘;’, but different (non-entropic) information measures. across bars ‘|’ and colons ‘:’. Measures are symmetric to interchange of the ensembles across commas and colons, Ensembles We’ll use upper case letters for ensembles, i.e. S(A, B, C) is the same as S(C, B, A) and I(A : B : C) is A, B, C, etc. An ensemble (or space) consists the same as I(B : C : A). of a complete and mutually exclusive set of propositions, {Aa}, indexed by the sample space a ∈ ΩA. The proba- Information diagrams Information diagrams (See Figs. bility of a proposition is written as P(Aa), and the condi- 1, 2, 3 and 4) are a graphical display of multivariant Shan- tional probability as P(Aa | Bb). For instance, Aa could represent the propositions {A = a}, where A is a random non information measures [2, 3]. variable. Any sum is implicitly over the entire relevant B C sample space. When different ensembles have different A D samples spaces, we’ll use corresponding lower case let- ters for samples, a for A, b for B, and so on. But when ensembles share the same sample space, we’ll use some other lower case letter, typically x, y, z. For instance, com- pare the definitions of mutual information (4) (different I(A, B : C | D) sample spaces) with relative entropy (14) (same sample spaces). We will not employ the common shorthand of These are not Venn-diagrams per se, since the individual labeling distributions by the sample alone, e.g. P(x) for regions can have positive or negative weight. The regions P({A = x}), precisely because we need to deal with multi- of an corresponding to a particular ple distributions with the same sample space in the same information measure can be deduced by mapping joint expression. distributions ‘A, B’ to the union of sets A∪B, mutual mea- sures ‘A : B’, to the intersection of sets A ∩ B, and condi- Collections of ensembles Since we have used sub- tional ‘A|B’ to set complement A\B. For instance, the con- scripts to index propositions, we’ll use superscripts to in- ditional mutual information I(A, B : C | D) corresponds dex collections of ensembles, A = {A1, A2,..., A|A|}. We to the region ((A ∪ B) ∩ C) \ D. never need to take powers of ensembles or propositions, so there is little risk of ambiguity. Other useful notation Dissimilarity An information-theoretic divergence is a includes P(A), the power set of A (the set of all subsets); measure of dissimilarity between a pair of ensembles that |A|, the set cardinality; ∅ for the empty set; and A\ A, the is non-negative and zero if (and only if) the distributions set complement (difference) of A. are identical. Since divergences are not symmetric to their arguments in general, we can also define the dual diver- ∗ Naming and notating Information measures are given gence d (A; B) = d(B; A). We’ll refer to a symmetric di- CamelCased function names, unless an unambiguous vergence as a discrimination.1 By distance we mean a metric short function name is in common usage. Commas be- distance: a measure that is non-negative; symmetric; zero tween ensembles (or propositions) denote conjugation 1 (logical and); a colon ‘:’ the mutual information (4) “be- The terms disparity, discrimination, and divergence are used essen- ∥ tially interchangeably as synonymies of dissimilarity. The useful dis- tween” ensembles; a double bar ‘ ’ the relative entropy tinction that a discrimination is a symmetric divergence appears to be (14) of one ensemble “relative to” another; a semicolon novel, but consistent with practical usage in much of the literature.

2 On Measures of Entropy and Information, G. E. Crooks (2018) if (and only if) the distributions are identical (reflective); and obeys the triangle inequality, d(A; B) + d(B; C) ⩾ Table 2: Units of entropy d(A; C) . 1 ≈ deciban 10 log2(10) 0.33 bits tenth of a ban (shannon) 1 bit ≈ nat (nit, nepit) log2(e) 1.44 bits natural digit 1 Entropy ≈ trit log2(3) 1.6 bits ternary digit Entropy (Shannon entropy, Gibbs entropy) A measure of quad 2 bits ≈ the inherent uncertainty or randomness of a single ran- ban (hartly) log2(10) 3.32 bits decimal digit dom variable. nibble (nybble) 4 bits half a byte ∑ byte 8 bits S(A) := − P(Aa) ln P(Aa) (1) a The chain rule for entropies [5, 8] expands conditional en- In the entropy is typically denoted by tropy as a Shannon information measure. the symbol H, a notation that dates back to Boltzmann and his H-theorem [4], and adopted by Shannon [5]. The S(A, B) = S(A | B) + S(B) notation S is due to Clausius and the original discovery of entropy in thermodynamics [6], and adopted by Gibbs [7] This follows from the probability chain rule for use in statistical mechanics. I tend to use S since I care ( , ) = ( | ) ( ) . about the physics of information, and the symbol H is oft P Aa Bb P Bb Aa P Aa needed to denote the the Hamiltonian. Subadditivity of entropy: Since entropies are always non- Entropy is occasionally referred to as the self- negative if follows that conditioning always reduces en- information, since entropy is equal to the mutual infor- tropy, S(A | B) ⩽ S(A). This implies that entropy is sub- mation between a distribution and itself, S(A) = I(A : A). additive: The joint entropy is less than the sum of the indi- This is distinct from the specific entropy (57) which is vidual entropies (with equality only if A and B are inde- also sometimes referred to as the self-information. pendent). Entropies are non-negative and bounded. S(A, B) ⩽ S(A) + S(B) 0 ⩽ S(A) ⩽ ln |ΩA|

Bayes’ rule for probabilities is the relation P(Aa | Bb) = | Joint entropy Given a joint P(Bb Aa)P(Aa)/P(Bb). In entropic terms the equivalent P(A, B) then the joint entropy is statement is (taking logarithms and averaging) ∑ S(A | B) = S(B | A) + S(A) − S(B) S(A, B) := − P(Aa, Bb) ln P(Aa, Bb) (2) a,b

This joint entropy can be readily generalized to any num- ber of variables. 2 Mutual information

2 S(A1, A2,..., An) Mutual information (mutual entropy, transinforma- ∑ tion) [5, 8] = − P(A1 , A2 ,..., An ) ln P(A1 , A2 ,..., An ) a1 a2 an a1 a2 an ∑ a1,a2,...,an P(Aa, Bb) I(A : B) := P(Aa, Bb) ln (4) P(Aa)P(Bb) a,b Marginal entropy The entropy of a marginal distribution. Thus S(A), S(B), S(C), S(A, B), S(B, C) and S(A, C) are all marginal entropies of the joint entropy S(A, B, C). A B Conditional entropy (or equivocation) [5, 8] Measures how uncertain we are of A on the average when we know Mutual information is oft notated with a semicolon, B. ∑ ∑ 2“I didn’t like the term Information Theory. Claude didn’t like it ei- S(A | B) := − P(B ) P(A | B ) ln P(A | B ) (3) ther. You see, the term ‘information theory’ suggests that it is a theory b a b a b about information – but it’s not. It’s the transmission of information, not b a information. Lots of people just didn’t understand this...I coined the term ‘mutual information’ to avoid such nonsense: making the point The conditional entropy is non-negative, since it is the ex- that information is always about something. It is information provided pectation of non-negative entropies. by something, about something.” – Robert Fano [9]

3 On Measures of Entropy and Information, G. E. Crooks (2018)

rather than a colon [5, 10]. Herein we reserve the semi- colon for divergences. Mutual information is the reduction in uncertainty of A due to the knowledge of B, or vice versa. S(∅) I(A : B) = S(A) − S(A | B) A B = S(B) − S(B | A) = S(A) + S(B) − S(A, B) = S(A, B) − S(A | B) − S(B | A) S(A) Mutual information is non-negative.

A B 0 ⩽ I(A : B) S(A) + S(B) ⩽ S(A, B)

This also implies that entropy is subadditive (second line S(B) above): The sum of marginal entropy of two systems is less than, or equal to, the joint entropy. The mutual in- formation is zero if (and only if) A and B are independent A B (written A ⊥⊥ B), such that P(Aa, Bb) = P(Aa)P(Ba). And easy proof is to note that the mutual information can be written as a relative entropy (14). The mutual information of an ensemble with itself is S(A, B) the entropy (which why entropy is occasionally called the self-information). A B I(A : A) = S(A)

Multivariate mutual information (co-information) [10, I(A : B) 11, 12, 13, 14, 3]: A multivariate generalization of the mu- tual information. Given a collection of probability ensem- A { 1 2 ··· |A|} A B bles, = A , A , , A , the multivariate mutual in- formation is equal to an alternating signed sum of all the marginal entropies.

|A| I(A1 : A2 : ··· : A ) (5) S(A | B) ∑ := − (−1)|B| S(B1, B2,..., B|B|) A B B∈P(A) Here P(A) is the power set of A (the set of all subsets), and |A| is the set cardinality. Note that there are conflicting sign conventions in the literature: the multivariate mu- S(B | A) tual information is sometimes defined with opposite sign for odd cardinalities (See interaction information (7)). A B The single variable case is equal to the entropy, I(A) = S(A), the binary case is equal to the standard mutual in- formation (5), and the ternary case is

residual I(A : B : C) (6) ∑ P(Aa, Bb)P(Aa, Cc)P(Bb, Cc) := P(Aa, Bb, Cc) ln P(Aa, Bb, Cc)P(Aa)P(Bb)P(Cc) A B a,b,c

Figure 1: Two-variable Information diagrams [2, 3]. For three or more variables the mutual information can be positive, negative, or zero, whereas for one or two vari- ables the mutual information is non-negative. For zero variables the mutual information is zero, I(∅) = 0.

4 On Measures of Entropy and Information, G. E. Crooks (2018)

C

I(A : B | C)

The mutual information defines a partitioning of the to- tal, multivariate joint entropy into single variable, binary, ternary, and higher order shared entropies. A B S(A) = I(A) C S(A, B) = I(A) + I(B) − I(A : B) S(A, B, C) = I(A) + I(B) + I(A) I(A : B : C) − I(A : B) − I(B : C) − I(A : C) + I(A : B : C)

Or generally, ∑ A B 1 2 |A| |B| 1 ··· |B| S(A , A ,..., A ) = − (−1) I(B : : B ) C B∈P(A)

The triplet interaction information is the information I(A : B, C) that a pair of variable provides about the third, compared to the information that each provides separately [10, 11, 15].

I(A : B : C) = I(A : B) + I(A : C) − I(A : B, C) A B C The multivariate self-information is equal the entropy for any cardinality. binding I(A : A : A : ··· : A) = S(A)

A B Interaction information (synergy, mutual informa- tion) [10]: An alternative sign convention for multivariate C mutual information. The interaction information is equal in magnitude to the multivariate information, but has the opposite sign for odd number of ensembles. residual

Int(A1 : A2 : ··· : An) := (−1)nI(A1 : A2 : ··· : An) (7)

The sign convention used above for multivariate informa- tion generally makes more sense. A B

Figure 2: Three-variable information diagrams [2, 3]. Note that the central region I(A : B : C) can be positive or Conditional mutual information [12] The average mu- negative (these are not Venn diagrams) and that we shade tual information between A and B given C. regions in proportion to their multiplicity, e.g. total corre- lation. I(A : B | C) (8) ∑ ∑ P(Aa, Bb | Cc) := P(Cc) P(Aa, Bb | Cc) ln P(Aa | Cc)P(Bb | Cc) c a,b

5 On Measures of Entropy and Information, G. E. Crooks (2018)

C C C

A B A B A B I(A : B | C) I(A : C | B) I(B : C | A) Binding(A : B) = I(A : B) Binding(A : B : C) = I(A, B : A, C : B, C) Binding(A : B : C : D) = I(A, B, C : A, B, D : A, C, D : B, C, D) I(A : B | C) = S(A | C) − S(A | B, C) = S(B | C) − S(B | A, C) Residual entropy (erasure entropy, independent infor- mation,variation of information, shared information dis- tance) [18, 19, 20, 3] Strong subadditivity: The conditional mutual informa- ∑ tion is non-negative. If the conditional mutual informa- Residual(A) := S(A | A \ A) (10) tion I(A : B | C) is zero, then A and B are conditionally in- A∈A dependent given C (written A ⊥⊥ B | C). Conversely condi- A A tional independence implies that the conditional mutual = S( ) − Binding( ) information is zero. Measures the total amount of randomness localized to in- A ⊥⊥ B | C ⇐⇒ I(A : B | C) = 0 dividual variables.

Conditional independence implies that P(Aa, Bb | Cc) = P(Aa | Cc) P(Ba | Cc)for all a, b, c. The chain rule for mutual information is

I(A : B, C) = I(A : C) + I(A : B | C)

C C C Total correlation (Multi-information, multivariate con- straint, redundancy) [10, 11, 21, 13] = + TotalCorr(A1, A2,..., An) := (11) A B A B A B S(A1) + S(A2) + ··· + S(An) − S(A1, A2,..., An) A C The data processing inequality states that if and The total amount of information carried by correlations B are conditionally independent, given (as happens when between the variables. Quantifies the total correlation or A → B → C you have a Markov chain ) then redundancy. Equal to the mutual information when n = 2. The independence bound on entropy states that the I(A : B) ⩾ I(A : C) given I(A : C | B) = 0 total correlation is non-negative. Proof: I(A : B) − I(A : C) = I(A : B | C) − I(A : C | B), but I(A : C | B) is zero, and I(A : B | C) is positive.

C C C C

− = −

A B A B A B A B Lautum information [22]: ∑ P(Aa)P(Bb) Binding information (Dual total correlation) [16, 3, 17] Lautum(A; B) := P(Aa)P(Bb) ln (12) P(Aa, Bb) ∑ a,b Binding(A) := S(A) − S(A | A \ A) (9) A∈A Much like the mutual information, but with the roles of joint and marginal product distributions swapped. (Lau- Here A\ A is the set compliment of A. tum is mutual spelled backwards).

6 On Measures of Entropy and Information, G. E. Crooks (2018)

Uncertainty coefficient (relative mutual information) [23] The Lautum information (12) is 3 Lautum(A : B) = D(Aˆ , Bˆ ∥ A, B) I(A : B) S(A | B) UncertaintyCoeff(A; B) := = 1 − (13) S(A) S(A) Cross entropy (inaccuracy) [25, 26]: Given B, the fraction of the information we can predict ∑ about A. CrossEnt(A; B) := − P(Ax) ln P(Bx) (15) x = S(A) + D(A ∥ B)

3 Relative entropy The cross entropy measures the average number of bits needed to identify events that occur with probability Relative entropy (Kullback-Leibler divergence4, KL- P(Ax), if a coding scheme is used that is optimal for the divergence, KL-distance, Kullback information, infor- probability distribution P(Bx). mation gain, logarithmic divergence, information diver- 5 gence) [24, 8] Burg entropy [27]: ∑ ∑ P(Ax) D(A ∥ B) := P(A ) ln (14) Burg(B) := ln P(Bx) (16) x P(B ) x x x

Roughly speaking, the relative entropy measures the dif- Proportional to the cross entropy with a uniform source ference between two distributions, although it is not a distribution. metric since it is not symmetric [D(A ∥ B) ≠ D(B ∥ A) in general], nor does it obey the triangle inequality. Note Relative joint entropy that the two distributions must have the same sample ( ) ∑ space, and that we take as convention that 0 ln 0 = 0. ′ ′ P(A , B ) D A, B ∥ A , B := P(A , B ) ln x y (17) One interpretation of relative entropy is that it repre- x y P(A′ , B′ ) x,y x y sents an encoding cost [8]: if we encode messages using an optimal code for a probability distribution P(Bx) of We can generalize any Shannon information measure to a messages x, but the messages actually arrive with prob- relative Shannon information measure by combining appro- abilities P(Ax), then each message requires, on average, priate linear combinations of relative joint entropies6. ∥ an additional D(A B) nats to encode compared to the Monotonicity of relative entropy: The relative joint en- optimal encoding. tropy is always greater or equal to the marginal relative The mutual information (4) is the relative entropy be- entropy. tween the joint and marginal product distributions. Let ( ) ( ) ∥ ′ ′ ⩾ ∥ ′ the random variables (Aˆ , Bˆ) be independent, but with the D A, B A , B D B B same marginals as (A, B), i.e. P(Aˆ , Bˆ) = P(A)P(B). Then This follows because the relative conditional entropy D(A|B ∥ A′|B′) = D(A, B ∥ A′, B′) − D(B ∥ B′) I(A : B) = D(A, B ∥ Aˆ , Bˆ) is non- negative. Similarly, for three or more variables, the relative en- tropy between the joint and marginal product distribu- Relative conditional entropy (conditional relative en- tions is the total correlation (11). tropy, conditional Kullback-Leibler divergence) [8]

1 2 n ( ) ∑ | TotalCorr(A , A ,..., A ) | ∥ ′ | ′ P(Ax By) D A B A B := P(Ax, By) ln ′ ′ (18) 1 2 n 1 2 n P(A | B ) = D(A , A ,..., A ∥ Aˆ , Aˆ ,..., Aˆ ) x,y x y

The relative conditional entropy is non-negative (Proof 3 Despite misinformation to the contrary, this uncertainty coefficient via log-sum inequality). is not related to Theil’s U-statistic 4Kullback-Leibler divergence is probably the most common terminol- The chain rule for relative entropy is ogy, which is often denoted D and verbalized as “dee-kay-ell”. I’ve ( ) KL ′ ′ ′ ′ ′ chosen to use the more descriptive “relative entropy” partially so that D(A, B ∥ A , B ) = D A | B ∥ A | B + D(B ∥ B ) we can more easily talk about the generalization of relative entropy to other relative Shannon measures. 6The reason for extending the standard notations to this larger class 5Note that our notation for relative entropy is uncommon. Follow- of relative Shannon information measures is that many such measures ing [8], many authors instead directly supply the distributions as argu- turn up naturally, for instance when consider the non-equilibrium ther- ments, e.g. D(p(x)∥q(x)). modynamics of strongly coupled systems [28, 29]

7 On Measures of Entropy and Information, G. E. Crooks (2018)

B C A D

S(∅)

S(A | B, C, D) S(B | A, C, D) S(C | A, B, D) S(D | A, B, C)

I(A : B | C, D) I(A : C | B, D) I(A : D | B, C) I(B : C | A, D) I(B : D | A, C) I(C : D | A, B)

I(A : B : C | D) I(A : B : D | C) I(B : C : D | A) I(A : C : D | B)

I(A : B : C : D)

Figure 3: Components of four-variable information diagrams.

8 On Measures of Entropy and Information, G. E. Crooks (2018)

B C A D

S(∅)

S(A) S(B) S(C) S(D)

I(A : B) I(A : C) I(A : D) I(B : C) I(B : D) I(C : D)

I(A : B : C) I(A : B : D) I(A : C : D) I(B : C : D)

I(A : B : C : D)

Figure 4: Information diagrams for four-variable mutual-information.

9 On Measures of Entropy and Information, G. E. Crooks (2018)

Relative mutual information [1] vergence) [30, 24] is a symmetrized relative entropy (14).

∥ ′ ′ 1 ∥ 1 ∥ D(A : B A : B ) (19) Jeffreys(A; B) := 2 D(A B) + 2 D(A B) (22) ∑ ′ ′ ∑ ∑ P(Ax, By) P(Ax)P(By) 1 P(Ax) 1 P(Bx) := P(Ax, By) ln ′ ′ = 2 P(Ax) ln + 2 P(Bx) ln P(A , B ) P(A )P(B ) P(Bx) P(Ax) x,y x y x y x x ∑( ) 1 P(Ax) The relative joint entropy can be split up into various com- = 2 P(Ax) − P(Bx) ln P(Bx) ponents analogously to the joint entropy. x ′ ′ This measure is symmetric and non-negative, but not D(A, B ∥ A , B ) ′ ′ ′ ′ a metric since it does not obey the triangle inequality. := D(A ∥ A ) + D(B ∥ B ) + D(A : B ∥ A : B ) ( ) The Jeffreys entropy is a symmetric f-divergence (33), ′ ′ ′ 1 = D A | B A | B + D(B ∥ B ) Cf(A; B) with f(t) = (t − 1) ln t. ( ) ( ) 2 = D A | B A′ | B′ + D B | A B′ | A′ Note that the Jeffreys divergence is sometimes defined ∥ ∥ − D(A : B ∥ A′ : B′) as D(A B) + D(B A), which is twice the value of the definition used here. If the reference distributions are independent, the rel- ative mutual information is equal to the mutual informa- tion between the principle distributions. Jensen-Shannon divergence (Jensen-Shannon entropy, ′ ′ Jensen difference, information radius, capacitory discrim- D(A : B ∥ A : B ) = I(A : B) if A ⊥⊥ B ination7) is the mean relative entropy between two distri- butions and the distribution mean [32, 31] Note that the uncertainly coefficient (13) is also some- times called the relative mutual information. ∑ 1 ( P(Ax) ) JS(A; B) := P(Ax) ln (23) 2 1 x 2 P(Ax) + P(Bx) ∑ Relative conditional mutual information [1] 1 ( P(Bx) ) + P(Bx) ln 2 1 ′ ′ ′ x P(Ax) + P(Bx) D(A : B | C ∥ A : B | C ) (20) ( ) 2( ) ∑ ∑ = 1 D A ∥ M + 1 D B ∥ M , := P(C ) P(A , B | C ) 2 2 z x y z 1 1 z x,y =S(M) − 2 S(A) − 2 S(B) . ′ ′ ′ ′ P(A , B | C ) P(A | C )P(B | C ) × x y z x z y z where ln ′ ′ | ′ | | P(Ax, By Cz) P(Ax Cz)P(By Cz) 1 1 P(Mx) = 2 P(Ax) + 2 P(Bx) We could continue this insanity by generalizing the con- ditional mutual relative entropy to many variables. One interpretation of the Jensen-Shannon entropy is in terms of a Bayesian inference problem [33]: given a sam- ple taken from one of two probability distributions, the Relative relative entropy [1] Jensen-Shannon entropy is the average information the sample provides about the identity of the distribution. ( ) ∑ P(Ax) P(Dx) D (A ∥ B) (C ∥ D) := P(A ) ln (21) The divergence is equal to zero only if the two distribu- x P(B ) P(C ) x x x tions are identical, and therefore indistinguishable, and reaches its maximum value of ln 2 nats (i.e. 1 bit) if the two The relative operation can be applied to relative entropy distributions do not overlap and therefore are perfectly itself, leading to the recursively defined relative relative distinguishable from a single sample. entropy. As an example, the relative mutual information The Jeffreys and Jensen-Shannon entropies are related can also be expressed as a relative relative entropy (Just by the inequalities [32] as the mutual information can be expressed as a relative entropy). ⩽ ⩽ 1 0 JS(A; B) 2 Jeffreys(A; B) . ( ) D(A : B ∥ A′ : B′) = D (A, B ∥ Aˆ , Bˆ) (A′, B′ ∥ Aˆ ′, Bˆ ′)

Jeffreys entropy The Jeffreys entropy (Jeffreys diver- 7Capacitory discrimination is defined as twice the Jensen-Shannon gence, J-divergence or symmetrized Kullback-Leibler di- divergence by Topsøe [31]

10 On Measures of Entropy and Information, G. E. Crooks (2018)

General Jensen-Shannon divergence (skewed Jensen- non entropy. Shannon divergence) 1 ∑ ( ) ( ) R´enyi (A) := ln P(A )α (27) JS (A; B) :=(1 − α)D A ∥ M + αD B ∥ M , α 1 − α x α (24) x P(M) = (1 − α)P(A) + αP(B) . Interesting special cases of the Renyi´ information in- clude the Hartley entropy (α = 0), collision entropy (α = 2), Shannon entropy (α = 1), and min entropy (α = ∞). JS0(A; B) = D(A ∥ B) See also: Renyi´ divergence (45).

JS 1 (A; B) = JS(A; B) 2 Collision entropy (Renyi´ information of order 2, second JS1(A; B) = D(B ∥ A) order entropy) [37] ∑ 2 Jensen-Shannon entropy (generalized Jensen-Shannon CollisionEntropy(A) := − ln P(Ax) (28) divergence) [34, 33] x ∑ ( ) = R´enyi2(A) 1 2 n α JSΘ(A ; A ; ··· ; A ) := P(Θα) D A ∥ M , (25) α=1,n A special case of the Renyi´ information. The negative log ∑ (57) probability that two independent samples from the P(M ) = P(Θ ) P(Aα) . x α x distribution are the same. α

The entropy of a mixed distribution is the average en- Min-entropy [36] tropy of the components plus the Jensen-Shannon en- MinEntropy(A) := − ln max P(Ax) (29) tropy [31]: x ∑ JS 1; 2; ··· ; n α S(M) = Θ(A A A ) + P(Θα) S(A ) Hartley entropy (Hartley function, max-entropy, Boltz- α mann entropy) [38]: The logarithm of the number distinct The multivariate Jensen-Shannon entropy is the mu- possibilities. tual information between the mixing ensemble Θ, and the Hartley(A) := ln |Ω | (30) mixed ensemble M. A ∑ P(Θα, Mx) The maximum entropy for a given cardinality. Coincides I(Θ : M) := P(Θα, Mx) ln with the entropy for a uniform distributions. P(Θα)P(Mx) α,∑x = − P(Mx) ln P(Mx) Tsallis information [39, 40, 41] (Havrda-Charvat´ infor- x∑ mation, α order information) + P(Θ )P(M | Θ ) ln P(M | Θ ) α x α X α 1 ( ∑ ) α,x Tsallis (A) := 1 − P(A )α (31) ∑ α α − 1 x = S(M) − P(Θ ) S(Aα) x α 1 [ ] α = e(α−1) R´enyiα(A) − 1 1 2 n α − 1 = JSΘ(A ; A ; ··· ; A )

Sharma-Mittal information [42, 43] Resistor-average entropy [35]:   ( ) 1−β ∑ 1−α 1 1  α  SharmaMittalα,β(A) := 1 − P(Ax) ResAvg(A; B) := 1 1 (26) β − 1 x D(A ∥ B) + D(B ∥ A) (32) The harmonic mean of forward and reversed relative en- tropies. Assuming suitable limits are taken, the Sharma-Mittal information contains Shannon, Renyi´ and Tsallis informa- tions as special cases. 4 Renyi´ information SharmaMittal1,1(A) = S(A)

Renyi´ information (Renyi´ entropy, alpha-order en- SharmaMittalα,1(A) = R´enyiα(A) tropy) [36]: A one parameter generalization of the Shan- SharmaMittalα,α(A) = Tsallisα(A)

11 On Measures of Entropy and Information, G. E. Crooks (2018)

5 Csiszar´ f-divergences tion. ( ) ( ) ; 1 ; 1 ; Csiszar´ f-divergence Many interesting divergence mea- Cf(A B) = 2 Ch A M + 2 Ch B M sures between probability distributions can be written P(M ) = 1 P(A ) + 1 P(B ) x (2 x )2 x( ) as (or related to) an f-divergence (also know as Csiszar´ 8, 1 1 1 1 Csiszar-Morimoto,´ Ali-Silvey, or ϕ-divergence) [44, 45, 46, f(t) = h + h 2 1 + 1 t 2 1 + 1 t−1 47, 48, 35]. 2 2 ∑ ( ) We include the half’s in these definitions so that sym- P(Bx) Cf(A; B) = P(Ax) f , (33) metrized f-divergences are invariant under symmetriza- P(A ) x x tion. We call these the Jeffreys and Jensen symmetrizations respectively [1], becuase a Jeffreys symmetrization of the where the function f is convex ⌣ and f(1) = 0. This im- relative entropy (14) leads to the Jeffreys entropy (22), plies C (A; B) ⩾ 0 from an application of Jensen’s inequal- f and a Jensen symmetrization gives the Jensen-Shannon ity. Examples already encountered include the relative, entropy (23). Jeffreys, and Jensen-Shannon entropies (see table 3). Note that the first argument to the f-divergence appears in the numerator of the ratio and is the distribution to be aver- aged over. The opposite convention also occurs9. Convex functions are closed under conical combina- K-divergence [32] tions: the function f(x) = c1f1(x)+c2f2(x)+...+ cNfN(x) ∑ ( P(Ax) ) is convex if each function of the mixture fn is convex KayDiv(A; B) := P(Ax) ln (35) ⩾ 0 1 and each constant cn is non-negative. It follows x 2 P(Ax) + P(Bx) that a positive linear sum of an f-divergences is also an ( ) = D A ∥ M , P(M ) = 1 P(A ) + 1 P(B ) f-divergence [49]. x 2 x 2 x 2 = C (A; B), f(t) = ln . f (1 + t) Dual f-divergence [49] The dual of an f divergence is de- fined by swapping the arguments. Of interest since the K-divergence is a lower bound to the relative entropy [32], ∗ Cf(A; B) = C (B; A) = Cf∗ (B; A) (34) f ⩽ 1 ∥ KayDiv(A; B) 2 D(A B) Here f∗ is the Csiszar´ dual [49] of a function f∗(x) = x f(1/x). For instance, if f(x) = − ln(x) then the f diver- and the Jeffreys symmetrization of the K-divergence is the gence is the relative entropy Cf(A; B) = D(A ∥ B), with Jensen-Shannon entropy. ∗ ln dual function f (x) = x (x), and dual divergence the 1 1 ∗ ∥ JS(A; B) = KayDiv(A; B) + KayDiv(B; A) dual (or reverse) relative entropy, Cf (A; B) = D(B A). 2 ( ) (2 ) 1 ∥ 1 ∥ = 2 D A M + 2 D B M Symmetric f-divergences [0] Many instances of the f- divergence are symmetric under interchange of the two distributions, Fidelity (Bhattacharyya coefficient, Hellinger affin-

Cf(A; B) = Cf(B; A) . ity) [50] The Bhattacharyya distance (56) and the Hellinger divergence and distance (37) are functions 1 This implies that f(x) = x f( x ). of fidelity. The name derives from usage in quantum There are two common methods for symmetrizing information theory [51]. an asymmetric f-divergence: the Jeffreys symmetrization ∑ √ where we average over interchanged distributions Fidelity(A; B) := P(Ax)P(Bx) (36) x 1 1 Cf(A; B) = Cg(A; B) + Cg(B; A) 2 2 The range is [0, 1], with unity only if the two distributions ( ) = 1 ( ) + 1 ( −1) f t 2 g t 2 tg t are identical. Fidelity is√ not itself an f-divergence (The re- quired function f(t) = t isn’t convex), but is directly re- and the Jensen symmetrization where we take the average lated to the Hellinger divergence (37) and Bhattacharyya divergence of both distributions to the average distribu- distance (56).

8Pronounced che-sar. 9I, myself, have used the opposite convention on other occasions, but, on reflection, this way around makes more sense.

12 On Measures of Entropy and Information, G. E. Crooks (2018)

Table 3: Csiszar´ f-divergences (§5)

Asymmetric f-divergences f(t) f∗(t)

(14) relative entropy − ln t t ln t 2 2t (35) K-divergence ln t ln (1 + t) (1 + t) √ (t − 1)2 ( √1 − t)2 (38) Pearson divergence t t−α − 1 (47) Cressie-Read divergence α(α + 1) t1−α − 1 (48) Tsallis divergence α − 1

Jeffreys Jensen

Symmetric f-divergences f(t) h(t) g(t)

(t − 1)2 (40) LeCam discrimination (t − 1)2 (t + 1) 1 (22) Jeffreys entropy 2 (t − 1) ln t − ln t 2 2t (23) Jensen-Shannon entropy 1 ln + 1 t ln ln 2 − ln t 2 1 + t 2 1 + t 1+t 1 | | ··· (50) variational distance 2 t − 1 √ (37) Hellinger discrimination 1 − t ···

13 On Measures of Entropy and Information, G. E. Crooks (2018)

Hellinger discrimination (Squared Hellinger distance, Skewed K-divergence [32] infidelity) [52] ∑ P(A ) ∑ (√ √ ) KayDiv (A; B) := P(A ) ln x 1 2 α x (1 − α)P(A ) + αP(B ) HellingerDiv(A; B) := 2 P(Ax) − P(Bx) (37) x x x x (41) √ ( ) ∑ ( ) P(Bx) = D A ∥ M , P(M) = (1 − α)P(A) + αP(B) = P(Ax) 1 − P(Ax) x √ = C (A; B), f(t) = (1 − t) f ∑ √ Alpha-Jensen-Shannon-entropy [32, 59] = 1 − P(A )P(B ) x x AlphaJS ; : 1 KayDiv ; 1 KayDiv ; x α(A B) = 2 α(A B) + 2 α(B A) = 1 − Fidelity(A; B) (42)

AlphaJS0(A; B) = 0 Symmetric, with range [0, 1]. A common alternative nor- AlphaJS 1 (A; B) = JS(A; B) malization omits the one-half prefactor. The name origi- 2 nates form that of the corresponding integral in the con- AlphaJS1(A; B) = Jeffreys(A; B) tinuous case [53]. The Jeffreys symmetrization of the skewed K-divergence.

Pearson divergence (χ2-divergence, chi square diver- gence, Pearson chi square divergence, Kagan divergence, 6 Chernoff divergence quadratic divergence, least squares) [54] ( ) ∑ 2 Chernoff divergence The Chernoff divergence [60, 61] of P(Bx) − P(Ax) Pearson(A; B) := (38) order α is defined as P(A ) ( ) x x ∑ α−1 ∑ ( ) P(Ax) P(A ) 2 Chernoff (A; B) := − ln P(A ) (43) x 1 α x P(B ) = P(Ax) − x x P(Bx) [ ] x 1−α 2 = − ln Cf(A; B) + 1 , f(t) = t − 1 . = Cf(A; B) , f(t) = (t − 1) The Chernoff divergence is zero for α = 1 and α = 0, and reaches a maximum, the Chernoff information [60, 8], Neyman divergence (inverse Pearson chi square diver- for some intermediate value of alpha. The Chernoff di- gence) [55, 56] vergence is well defined for α > 1 if P(Bx) > 0 when- ever P(Ax) > 0, and for α < 0 if P(Ax) > 0 whenever Neyman(A; B) := Pearson(B; A) (39) P(Bx) > 0, and thus defined for all α if the distributions have the same support. The dual of the Pearson divergence (arguments The Chernoff divergence of order α is related to the switched). Chernoff divergence of order 1 − α with the distributions interchanged [61],

LeCam discrimination (Vincze-LeCam divergence, trian- Chernoffα(A; B) = Chernoff1−α(B; A) . gular discrimination) [57, 58, 31] ( ) This relation always holds for α ∈ [0, 1], and for all α when ∑ 2 P(Ax) − P(Bx) the distributions have the same support. LeCam(A; B) := 1 (40) 2 P(A ) + P(B ) x x x 2 (t − 1) Chernoff coefficient = Cf(A; B), f(t) = . (alpha divergence) [60, 61, 62] 2(t + 1) ( ) ∑ α−1 P(Ax) The LeCam divergence is a Jensen symeterized Pearson ChernoffCoefficientα(A; B) := P(Ax) P(B ) divergence. x x (44) ( ) ( ) ( ) LeCam(A; B) := 1 Pearson A; M + 1 Pearson B; M 2 2 = exp − Chernoffα(A; B) 1 1 P(Mx) = P(Ax) + P(Bx) 1−α 2 2 = Cf(A; B) + 1, f(t) = t − 1

The exponential twist density [63] is way of mixing two

14 On Measures of Entropy and Information, G. E. Crooks (2018) distributions [64] to form a third tropy.  1 α 1−α Pearson(A; B) α = −1 P(Cx) = P(Ax) P(Bx)  Zα  D(B ∥ A) limα→0 Here α is a mixing parameter between 0 and 1. The nor- D (A; B) = 4 HellingerDiv(A; B) α = + 1 α  2 Z Z =  malization constant α is the Chernoff coefficient, α D(A ∥ B) limα→+1 ChernoffCoefficient ( ; )  α A B [65]. Neyman(A; B) α = +2

Another common paramaterization is [0, 0] Renyi´ divergence The Renyi´ divergence (or relative Renyi´ ( ) entropy) of order α is a one-parameter generalization of ∑ ′ ′ 4 1−α 1+α 2 2 the relative entropy [36], Dα′ (A; B) := ′ 1 − P(Ax) P(Bx) 1 − α 2 ( ) x ∑ α−1 1 P(A ) ′ R´enyi (A; B) := ln P(A ) x (45) where α = 1 − 2α. This paramaterization has the advan- α α − 1 x P(B ) x x tage that the duality corresponds to negating the param- 1 eter. = Chernoff (A; B) 1 − α α 1 [ ] D+α′ (A; B) = D−α′ (B; A) = ln C (A; B) + 1 , f(t) = t1−α − 1 . α − 1 f Higher values of α give a Renyi´ divergence dominated by Cressie-Read divergence [56] [( ) ] the greatest ratio between the two distributions, whereas ∑ α 1 P(A ) as α approaches zero the Renyi´ entropy weighs all pos- CressieRead (A; B) := P(A ) x − 1 α α(α + 1) x P(B ) sibilities more equally, regardless of their dissimilarities. x x We recover the relative entropy in the limit of α → 1. (47) [ ] Interesting special cases of the Renyi´ divergence occur 1 α R´enyi (A;B) 1 ∞ = e α+1 − 1 for α = 0, 2 , 1 and . As previously mentioned, α = 1 α(α + 1) 1 gives the relative entropy (14), and α = 2 gives the Bhat- 1 α → 0 = Tsallisα+1(A; B) tacharyya distance (56). In the limit , the Renyi´ α + 1 divergence slides to the negative log probability under q t−α − 1 that p is non-zero, = Cf(A; B), f(t) = . ∑ α(α + 1) α 1−α lim R´enyiα(A; B) = − ln lim P(Ax) P(Bx) α→0 α→0 ∑x Tsallis divergence (relative Tsallis entropy) [66] Other = − ln P(Bx)[P(Ax) > 0] . closely related divergences include the relative Tsallis en- x tropy, [( ) ] Here we have used the Iverson bracket, [a], which evalu- ∑ α−1 1 P(A ) ates to 1 if the condition inside the bracket is true, and 0 Tsallis (A; B) := P(A ) x − 1 α α − 1 x P(B ) otherwise. If the two distributions have the same support x x then in the α → 0 Renyi´ divergence is zero. (48) 1 [ ] = e(α−1) R´enyiα(A;B) − 1 α − 1 Alpha-divergence [0, 0] In the most widespread param- t1−α−1 = Cf(A; B), f(t) = , eterization, α−1 ( ) 1 ∑ D (A; B) := 1 − P(A )α P(B )1−α Sharma-Mittal divergence [42, 43] α α(1 − α) x x x SharmaMittal (A; B) (49) (46)  α,β  ( ) 1−β 1 ∑ 1−α The alpha-divergence is self-dual, := 1 − P(A )αP(B )1−α  β − 1 x x x Dα(A; B) = D1−α(B; A) ( ) 1 1−β = 1 − ChernoffCoefficient (A; B) 1−α 1 − β α Special cases include the Neyman and Pearson diver- ̸ ̸ gences, the Hellinger discrimination, and the relative en- α > 0, α = 0, β = 0

15 On Measures of Entropy and Information, G. E. Crooks (2018)

The Sharma-Mittal divergence encompasses Cressie- Jensen-Shannon distance is the square root of the Jensen- Read divergence (β = 1 − α(α + 1)), Renyi´ divergence Shannon divergence, and is a metric between probability (β → 1), Tsallis divergence (β → 0), and the relative en- distributions [33, 69]. tropy (β, α → 1). Bhattacharyya distance [70] The Chernoff divergence of α = 1 7 Distances order 2 . ∑ √ Bhattacharyya(A; B) := − ln P(A )P(B ) Variational distance (L1 distance, variational diver- x x (56) gence) x = Chernoff 1 (A; B) ∑ 2 1 V(A; B) = L1(A; B) := 2 P(Bx) − P(Ax) (50) = − ln Fidelity(A; B) . x

= Cf(A; B), f(t) = t − 1 8 Specific information The only f-divergence (33) which is also a metric [67]. Pinsker’s inequality: A specific (or point-wise, or local) entropy is the entropy associated with a single event, as opposed to the average ( ∥ ) ⩾ 1 V( ; )2 D A B 2 A B entropy over the entire ensemble [12]. A common convention is to use lower cased function Total variational distance [0] The largest possible differ- names for specific information measures; s for S, i for I. ence between the probabilities that the two distributions However, since the expectation of a single proposition is can assign to the same event. Equal to twice the Varia- equal to the value associated with the event, we can also tional distance write S(Ax) := s(Ax), and I(Ax : By) := i(Ax : By). With this notation we can express point-wise measures Euclidian distance (L2 distance) corresponding to all the ensemble measures defined pre- viously, without having to create a host of new notation. √ ∑ We can also write multivariate specific informations aver- 2 aged over only one ensemble, not both, e.g. I(A : Bx) [25]. L2(A; B) := P(Bx) − P(Ax) (51) x Specific entropy (self-information, score, surprise, or sur- It is sometimes useful to treat a probability distribution prisal) [71, 72] is the negative logarithm of a probability. as a vector in a Euclidean vector space, and therefore con- sider Euclidean distances between probability distribu- s(Ax) := − ln P(Ax) (57) tions. The expectation of the specific entropy is the Shannon en- Minkowski distance tropy, the average point-wise entropy of the ensemble (1). [ ] ∑ ( ) 1 ∑ p S(A) = E s(A) = P(Ax) s(Ax) p Lp(A; B) := P(Bx) − P(Ax) (52) x x where the expectation operator E is ⩾ A metric distance provided that p 1. [ ] ∑ E f(A) = P(Aa) f(Aa) Chebyshev distance a

L∞(A; B) := max P(Bx) − P(Ax) (53) Specific mutual information (point-wise mutual infor- mation) [25, 12, 73] LeCam distance The square root of the LeCam diver- P(Ax, By) gence (40)[68]. i(Ax : By) := ln (58) √ P(Ax)P(By) LeCamDist(A; B) := LeCam(A; B) (54) Similarly, the expectation of the specific mutual informa- tion is the mutual information of the ensemble (4). Hellinger distance The square root of the Hellinger diver- [ ] gence. I(A : B) = E i(A : B) √ Hellinger(A; B) := HellingerDiv(A; B) (55)

16 On Measures of Entropy and Information, G. E. Crooks (2018)

Acknowledgments [5] Claude E. Shannon. A mathematical theory of commu- nication. Bell Syst. Tech. J., 27:379–423, 623–656 (1948). Useful sources and pedagogical expositions include doi:10.1002/j.1538-7305.1948.tb01338.x. (pages 3, 3, 3, 3, Cover and Thomas’s classic Elements of Information The- and 4). ory [8], the inestimable (for some value of truthiness) [6] Rudolf Clausius. Ueber verschiedene fur¨ die anwendung Wikipedia, and various other reviews [74, 3]. Notation bequeme formen der hauptgleichungen der mechanischen and nomenclature was adapted from Good (1956) [25]. warmetheorie.¨ Annalen der Physik und Chemie, 201(7):353– The title of this essay was shamelessly lifted from Renyi’s´ 400 (1865). doi:10.1002/andp.18652010702. (page 3). paper on a similar subject [36]. [7] J. Willard Gibbs. Elementary principles in statistical mechanics. Charles Scribner’s Sons, New York (1902). (page 3). This review is inevitably incomplete, inaccurate and oth- [8] Thomas M. Cover and Joy A. Thomas. Elements of informa- erwise imperfect — caveat emptor. tion theory. Wiley, New York, 2nd edition (2006). (pages 3, 3, 3, 7, 7, 7, 7, 14, and 17). Version history [9] O. Aftab, P. Cheung, A. Kim, S. Thakknar, and N. Yed- danapudi. Information theory and the digital revolu- 0.7 (2018-09-22) Added units of entropy. Corrected typos tion (2001). http://web.mit.edu/6.933/www/Fall2001/ and citations. (Kudos: Glenn Davis, Susanna Still) Shannon2.pdf. (page 3). 0.6 (2017-05-07) Added alpha-divergence (46) and Burg [10] William J. McGill. Multivariate information entropy (16). Corrected typos. (Kudos: Glenn Davis) transmission. Psychometrika, 19(2):97–116 (1954). 0.5 (2016-08-16) Extensive miscellaneous improvements. doi:10.1007/BF02289159. (pages 4, 4, 5, 5, and 6). 0.4 (2016-06-22) Added Jensen-Shannon divergence; al- [11] Satosi Watanabe. Information theoretical analysis of mul- pha Jensen-Shannon entropy; skewed K-Divergence; tivariate correlation. IBM J. Res. Develop., 4(1):66–82 (1960). dual divergence; and total variational distance. Re- doi:10.1147/rd.41.0066. (pages 4, 5, and 6). ordered operator precedence. Expanded and refined collection of relative Shannon entropies. Extensive [12] Robert Fano. Transmission of Information: A Statistical The- ory of Communications. MIT Press, Cambridge, MA (1961). miscellaneous improvements. (pages 4, 5, 16, and 16). 0.3 (2016-04-21) Added residual entropy (10) (and com- bined with independent information, which is essen- [13] Anthony J. Bell. The co-information lattice. In Proc. 4th Int. Symp. Independent Component Analysis and Blind Source tially another name for the same thing), and uncer- Separation (2003). (pages 4 and 6). tainty coefficient (13). Added information diagrams. Miscellaneous minor improvements. Fixed sign er- [14] Yaneer Bar-Yam. Multiscale complexity / en- ror in lautum information (12) (Kudos: Ryan James) tropy. Adv. in Complex Sys., 7:47–63 (2004). doi:10.1142/S0219525904000068. (page 4). 0.2 (2015-06-26) Improved presentation and fixed miscel- laneous minor errors. [15] Elad Schneidman, Susanne Still, Michael J. Berry, II, 0.1 (2015-01-22) Initial release: over 50 information mea- and William Bialek. Network information and con- nected correlations. Phys. Rev. Lett., 91:238701 (2003). sures described. doi:10.1103/PhysRevLett.91.238701. (page 5). [16] Te Sun Han. Nonnegative entropy measures of multivari- References ate symmetric correlations. Inf. Control, 36:133–156 (1978). doi:10.1016/S0019-9958(78)90275-9. (page 6). (Recursive citations mark neologisms and other innovations [1].) [17] Samer A. Abdallah and Mark D. Plumbley. A measure of [0] [citation needed]. (pages 12, 15, 15, 15, 15, and 16). statistical complexity based on predictive information with [1] Gavin E. Crooks. On measures of entropy and informa- application to finite spin systems. Phys. Lett. A, 376(4):275– tion. Technical report, (2017). Tech. Note 009 v0.7 281 (2012). doi:10.1016/j.physleta.2011.10.066. (page 6). http://threeplusone.com/info. (pages 2, 10, 10, 10, 12, [18] Phipps Arabie and Scott A. Boorman. Multidimensional and 17). scaling of measures of distance between partitions. J. Math. Psy., 10(148-203) (1973). doi:10.1016/0022-2496(73)90012-6. [2] Raymond W. Yeung. A new outlook on Shannon’s informa- (page 6). tion measures. IEEE Trans. Inf. Theory, 37(3):466–474 (1991). doi:10.1109/18.79902. (pages 2, 4, and 5). [19] James P. Crutchfield. Information and its metric. In L. Lam and H. C. Morris, editors, Nonlinear Structures in Physical [3] Ryan G. James, Christopher J. Ellison, and James P. Crutch- Systems – Pattern Formation, Chaos and Waves, pages 119–130. field. Anatomy of a bit: Information in a time series obser- Springer-Verlag, New York (1990). (page 6). vation. Chaos, 21(3):037109 (2011). doi:10.1063/1.3637494. (pages 2, 4, 4, 5, 6, 6, and 17). [20] Sergio Verdu´ and Tsachy Weissman. The information lost in erasures. IEEE Trans. Inf. Theory, 54:5030–5058 (2008). [4] Ludwig Boltzmann. Weitere studien uber¨ das doi:10.1109/TIT.2008.929968. (page 6). Warmegleichgewicht¨ unter Gasmolekulen¨ (Further studies on the thermal equilibrium of gas molecules). [21] Te Sun Han. Multiple mutual information and multiple Sitzungsberichte Akad. Wiss., Vienna, part II, 66:275–370 interactions in frequency data. Inf. Control, 46:26–45 (1980). (1872). (page 3). doi:10.1016/S0019-9958(80)90478-7. (page 6).

17 On Measures of Entropy and Information, G. E. Crooks (2018)

[22] Daniel P. Palomar and Sergio Verdu.´ Lautum information. [39] Jan Havrda and Frantisekˇ Charvat.´ Quantification method IEEE Trans. Inf. Theory, 54(3):964–975 (2008). (page 6). of classification processes. Concept of structural a-entropy. Kybernetika, 3:30–35 (1967). http://dml.cz/dmlcz/125526. [23] William H. Press, Saul A. Teukolsky, William T. Vetterling, (page 11). and Brian P.Flannery. Numerical Recipes: The Art of Scientific Computing. Cambridge University Press, New York, 3rd [40] Zoltan´ Daroczy.´ Generalized information functions. edition (2007). (page 7). Inf. Control, 16(1):36–51 (1970). doi:10.1016/S0019- 9958(70)80040-7. (page 11). [24] Solomon Kullback and Richard A. Leibler. On informa- tion and sufficiency. Ann. Math. Statist., 22:79–86 (1951). [41] Constantino Tsallis. Possible generalization of Boltzmann- doi:10.1214/aoms/1177729694. (pages 7 and 10). Gibbs . J. Stat. Phys., 52:479–487 (1988). doi:10.1007/BF01016429. (page 11). [25] Irving J. Good. Some terminology and notation in informa- tion theory. Proc. Inst. Elec. Eng. C, 103(3):200–204 (1956). [42] Bhu Dev Sharma and D. P. Mittal. New non-additive mea- doi:10.1049/pi-c.1956.0024. (pages 7, 16, 16, and 17). sures of entropy for discrete probability distributions. J. Math. Sci. (Soc. Math. Sci., Delhi, India), 10(28-40) (1975). [26] David F. Kerridge. Inaccuracy and inference. J. Roy. Statist. (pages 11 and 15). Soc. Sec. B, 23:184–194 (1961). http://www.jstor.org/ stable/2983856. (page 7). [43] Frank Nielsen and Richard Nock. A closed-form expres- sion for the Sharma-Mittal entropy of exponential fami- [27] John Parker Burg. The relationship between maximum en- lies. J. Phys. A, 45(3):032003 (2012). doi:10.1088/1751- tropy spectra and maximum likelihood spectra. Geophysics, 8113/45/3/032003. (pages 11 and 15). 37:375–376 (1972). doi:10.1190/1.1440265. (page 7). [44] S. M. Ali and Samuel D. Silvey. A general class of coef- [28] Giovanni Diana and Massimiliano Esposito. Mu- ficients of divergence of one distribution from another. J. tual entropy production in bipartite systems. J. Stat. Roy. Statist. Soc. B, 28(1):131–142 (1966). (page 12). Mech.: Theor. Exp., (4):P04010 (2014). doi:10.1088/1742- 5468/2014/04/P04010. (page 7). [45] Imre Csiszar.´ Information-type measures of difference of probability distributions and indirect observation. Studia [29] Gavin E. Crooks and Susanne Still. Marginal and Sci. Math. Hungar., 2:299–318 (1967). (page 12). conditional second laws of thermodynamics (2016). arXiv:1611.04628. (page 7). [46] Tetsuzo Morimoto. Markov processes and the H-theorem. [30] Harold Jeffreys. Theory of probability. Clarendon Press, Ox- J. Phys. Soc. Jpn, 18:328–331 (1963). doi:10.1143/JPSJ.18.328. ford, 2nd edition (1948). (page 10). (page 12). [31] Flemming Topsøe. Some inequalities for information diver- [47] Inder J. Taneja. Refinement inequalities among symmetric gence and related measures of discrimination. IEEE Trans. divergence measure. Aust. J. Math. Anal. Appl., 2(1):8 (2005). Inf. Theory, 46(4):1602–1609 (2000). doi:10.1109/18.850703. (page 12). (pages 10, 10, 11, and 14). [48] Alexander N. Gorban, Pavel A. Gorban, and George Judge. [32] Jianhua Lin. Divergence measures based on the Shan- Entropy: The Markov ordering approach. Entropy, 12:1145– non entropy. IEEE Trans. Inf. Theory, 37:145–151 (1991). 1193 (2010). doi:10.3390/e12051145. (page 12). doi:10.1109/18.61115. (pages 10, 10, 12, 12, 14, and 14). [49] Mark D. Reid and Robert C. Williamson. Information, di- [33] Dominik M. Endres and Johannes E. Schindelin. A new vergence and risk for binary experiments. J. Mach. Learn. metric for probability distributions. IEEE Trans. Inf. The- Res., 12:731–817 (2011). http://jmlr.csail.mit.edu/ ory, 49(7):1858–1860 (2003). doi:10.1109/TIT.2003.813506. papers/v12/reid11a.html. (page 12, 12, and 12). (pages 10, 11, and 16). [50] Anil K. Bhattacharyya. On a measure of divergence be- [34] R. G. Gallager. Information Theory and Reliable Communica- tween two multinomial populations. Sankhya¯, 7:401–406 tion. Wiley and Sons, New York (1968). (page 11). (1946). (page 12). [35] Don H. Johnson and Sinan Sinanovic.´ Symmetrizing the [51] Richard Jozsa. Fidelity for mixed quantum states. J. Mod. Kullback-Leibler distance. Technical report, Rice Univ., Opt., 41:2315–2323 (1994). doi:10.1080/09500349414552171. Houston, TX (2001). http://www.ece.rice.edu/˜dhj/ (page 12). resistor.pdf. (pages 11 and 12). [52] Shizuo Kakutani. On equivalence of infinite product mea- [36] Alfred´ Renyi.´ On measures of entropy and informa- sures. Ann. Math., 49(1):214–224 (1948). https://www. tion. In Proceedings of the fourth Berkeley symposium on jstor.org/stable/1969123. (page 14). mathematics, statistics and probability, volume 1, pages 547– 561 (1960). http://projecteuclid.org/euclid.bsmsp/ [53] Ernst Hellinger. Neue Begrundung¨ der Theo- 1200512181. (pages 11, 11, 15, and 17). rie quadratischer Formen von unendlichvielen Veranderlichen.¨ J. Reine Angew. Math., 136:210–271 [37] Charles H. Bennett, Franc¸ois Bessette, Gilles Bras- (1909). doi:10.1515/crll.1909.136.210. (page 14). sard, Louis Salvail, and John Smolin. Experimental quantum cryptography. J. Cryptology, 5:3–28 (1992). [54] Karl Pearson. X. On the criterion that a given system of de- doi:10.1007/BF00191318. (page 11). viations from the probable in the case of a correlated sys- tem of variables is such that it can be reasonably supposed [38] Ralph V. L. Hartley. Transmission of information. to have arisen from random sampling. Philos. Mag. Ser. Bell Syst. Tech. J., 7:535–563 (1928). doi:10.1002/j.1538- 5, 50(302):157–175 (1900). doi:10.1080/14786440009463897. 7305.1928.tb01236.x. (page 11). (page 14).

18 On Measures of Entropy and Information, G. E. Crooks (2018)

[55] Jerzy Neyman. Contribution to the theory of the χ2 test. [71] Edward W. Samson. Fundamental natural concepts of In Proceedings of the [first] Berkeley symposium on mathemat- information theory. ETC: A Review of General Semantics, ics, statistics and probability, pages 239–273. Univ. of Calif. 10(4):283–297 (1953). (page 16). Press (1949). http://projecteuclid.org/euclid.bsmsp/ 1166219208. (page 14). [72] Myron Tribus. Thermostatics and thermodynamics. D. Van Nostrand Company (1961). (page 16). [56] Noel Cressie and Timothy R. C. Read. Multinomial good- ness of fit. J. Roy. Statist. Soc. B, 46:440–464 (1984). http: [73] Kenneth W. Church and Patrick Hanks. Word association //www.jstor.org/stable/2345686. (pages 14 and 15). norms, mutual information and lexicography. Comp. Ling., 16(1):22–29 (1990). (page 16). [57] Istan´ Vincze. On the concept and measure of information contained in an observation. In J. Gani and V.F. Rohatgi, ed- [74] Sung-Hyuk Cha. Comprehensive survey on dis- itors, Contributions to Probability: A Collection of Papers Dedi- tance/similarity measures between probability density cated to Eugene Lukacs, pages 207–214. Academic Press, New functions. Int. J. Math. Mod. Meth. Appl. Sci., 1(4):300– York (1981). (page 14). 307 (2007). http://www.naun.org/main/NAUN/ijmmas/ mmmas-49.pdf. (page 17). [58] Lucien Le Cam. Asymptotic Methods in Statistical Decision Theory. Springer-Verlag, Berlin (1986). (page 14). [59] Frank Nielsen. A family of statistical symmetric diver- gences based on Jensen’s inequality. arXiv:1009.4004. (page 14). [60] Herman Chernoff. A measure of asymptotic efficiency for tests of a hypothesis based on the sum of ob- servations. Ann. Math. Statist., 23(4):493–507 (1952). doi:10.1214/aoms/1177729330. (page 14, 14, and 14). [61] Yoshihide Kakizawa, Robert H. Shumway, and Masanobu Taniguchi. Discrimination and clustering for multivariate time series. J. Amer. Statist. Assoc., 93(441):328–340 (1998). (page 14, 14, and 14). [62] Shun’ichi Amari and Hiroshi Nagaoka. Methods of informa- tion geometry, volume 191. American Mathematical Society (2000). (page 14). [63] James A. Bucklew. Large deviation techniques in decision, sim- ulation, and estimation,. Wiley (1990). (page 14). [64] Anand G. Dabak and Don H. Johnson. Relations be- tween Kullback-Leibler distance and Fisher information (2001). http://www.ece.rice.edu/˜dhj/distance.pdf. (page 15). [65] John C. Baez. Renyi´ entropy and free energy. arXiv:1102.2098. (page 15). [66] Lisa Borland, Angel R. Plastino, and Constantino Tsallis. Information gain within nonextensive thermostatistics. J. Math. Phys., 39(12):6490–6501 (1998). doi:10.1063/1.532660. Erratum: 40 2196 (1999). (page 15). [67] Mohammadali Khosravifard, Dariush Fooladivanda, and T. Aaron Gulliver. Confliction of the convexity and metric properties in f-divergences. IEICE-Tran. Fund. Elec. Comm. Comp. Sci., E90-A:1848–1853 (2007). doi:10.1093/ietfec/e90- a.9.1848. (page 16).

[68] P. Kafka, Ferdinand Osterreicher,¨ and I. Vincze. On pow- ers of f-divergences defining a distance. Studia Sci. Math. Hungar., 26:415–422 (1991). (page 16). Copyright ⃝c 2015-2018 Gavin E. Crooks [69] Ferdinand Osterreicher¨ and Igor Vajda. A new class of met- ric divergences on probability spaces and and its statistical applications. Ann. Inst. Statist. Math., 55(3):639–653 (2003). http://threeplusone.com/info doi:10.1007/BF02517812. (page 16). typeset on 2018-09-22 with XeTeX version 0.99999 [70] Anil K. Bhattacharyya. On a measure of divergence be- tween two statistical populations defined by their proba- fonts: Trump Mediaeval (text), Euler (math) bility distributions. Bull. Calcutta Math. Soc., 35(1):99–109 2 7 1 8 2 8 1 8 3 (1943). (page 16).

19