Quick viewing(Text Mode)

Justifications of Shannon Entropy and Mutual Information in Statistical

Justifications of Shannon Entropy and Mutual Information in Statistical

Justifications of and Mutual in Statistical Inference

EE378A Lecture Notes Spring 2017, Stanford University April 19, 2017

Let X = {x1, x2, . . . , xn} be a finite set with |X | = n. Let Γn denote the set of probability measures on X , and let R¯ denote the extended real line. We recall the definition of the logarithmic loss as follows: ¯ Definition 1 (Logarithmic loss). The logarithmic loss `log : X × Γn 7→ R is defined by 1 ` = ln , (1) log P (x) where P (x) denotes the probability of x under measure P . In this lecture, we go over some elegant results justifying the use of the logarithmic loss in statistical inference. Recall the following lemma that were introduced in Lecture 2: Lemma 1. The true distribution minimizes the expected logarithmic loss: 1 P = arg min EP ln , (2) Q Q(X) and 1 H(P ) = min EP ln . (3) Q Q(X)

Proof. The result follows from the observation that for any Q ∈ Γn, 1 1 P (X) ln − ln = ln (4) EP Q(X) EP P (X) EP Q(X) X P (x) = P (x) ln (5) Q(x) x∈X X Q(x) = P (x) − ln (6) P (x) x∈X ! X Q(x) ≥ − ln P (x) (7) P (x) x∈X = 0, (8) where we applied Jensen’s inequality for the convex function − ln x on (0, ∞). Lemma 1 is the foundational result behind the use of the loss in . A natural question arises: is the logarithmic loss the only loss function that satisfies the property that the true distribution minimizes the expected loss? Perhaps surprisingly, it is true under a natural “locality” constraint. The following result is also known as the fact that the logarithmic scoring rule is the unique proper scoring rule.

1 Theorem 1. [1] Suppose n ≥ 3. Then, the only function f such that for any P

P ∈ arg min EP f(Q(X)) (9) Q is of the form 1 f(p) = c ln + b for all p ∈ (0, 1), (10) p where b, c ≥ 0 are constants. Theorem 1 does not hold if n = 2. We refer to [2] for detailed discussions regarding the dichotomy between the binary and non-binary alphabets.

1 Justification of the Shannon entropy (and the logarithmic loss)

Csiszar [3] provided a survey of axiomatic characterizations of information measures up to 2008. Csiszar stated that “the intuitively most appealing axiomatic result is due to Aczel–Forte–Ng [4]”, which characterized the Shannon entropy as a natural measure to measure uncertainty. We quote their results below.

Theorem 2. [4] Let K(P ) be a general functional of discrete distribution P . We assume the following axioms: 1. Subadditivity:

K(PXY ) ≤ K(PX ) + K(PY ); (11)

2. Additivity: when X and Y are independent,

K(PXY ) = K(PX ) + K(PY ); (12)

3. Expansibility:

K(p1, p2, . . . , pm, 0) = K(p1, p2, . . . , pm); (13)

4. Symmetry: for all permutations π,

K(p1, p2, . . . , pm) = K(pπ(1), pπ(2), . . . , pπ(m)); (14)

5. Normalization:

K(1/2, 1/2) = 1; (15)

6. Small for small probabilities:

lim K(1 − q, q) = 0. (16) q→0+

P Then, the Shannon entropy H(P ) = x∈X −P (x) ln P (x) is the only functional satisfying the axioms above.

2 Justification of (and the logarithmic loss)

For the rest of the lecture, we present a characterization of mutual information that justifies the use of the logarithmic loss from another perspective. Concretely, we ask the following question: Suppose X and Y are dependent random variables. How relevant is Y for inference on X? Toward answering this question, let ` : X × Xˆ → R¯ be an arbitrary loss function with reconstruction alphabet R, where R is arbitrary. Given (X,Y ) ∼ PXY , it is natural to quantify the benefit of additional

2 side information Y by computing the difference between the expected losses in estimating X ∈ X with and without side information Y , respectively. This motivates the following definition: ˆ C(`, PXY ) , inf EP [`(X, xˆ1)] − inf EP [`(X, X2)], (17) xˆ1∈R Xˆ2(Y ) wherex ˆ1 ∈ R is deterministic, and Xˆ2 = Xˆ2(Y ) ∈ R is any measurable function of Y . In the following discussions, we require that indeterminate forms like ∞ − ∞ do not appear in the definition of C(`, PXY ). By taking Y to be independent of X, this requirement implies that for all P ∈ Γn,

inf EP [`(X, xˆ1)] < ∞. (18) xˆ1∈R The formulation (17) has appeared previously in the statistics literature. DeGroot [5] in 1962 defined the information contained in an experiment, which turns out to be equivalent to (17). Later, Dawid [6] defined the coherent dependence function, which is equivalent to (17), and used it to quantify the dependence between two random variables X,Y . Our framework of quantifying the predictive benefit of side information is closely connected to the notion of proper scoring rules and the literature on probability forecasting in statistics. The survey by Gneiting and Raftery [7] provides a good overview. Having introduced the yardstick in (17), we now reformulate the question of interest: Which loss func- tion(s) ` can be used to define C(`, PXY ) in a meaningful way? Of course, “meaningful” is open to in- terpretation, but it is desirable that C(`, PXY ) be well-defined, at minimum. This motivates the following axiom:

Axiom 1. For all distributions PXY , the quantity C(`, PXY ) satisfies

C(`, PTY ) ≤ C(`, PXY ) whenever T (X) ∈ X is a statistically sufficient transformation of X for Y . We remind the reader that the statement ‘T is a statistically sufficient transform of X for Y ’ means that the following two Markov chains hold:

T − X − Y,X − T − Y (19)

That is, T (X) preserves all of the information X contains about Y . In words, the Data Processing Axiom stipulates that processing the data X → T cannot boost the predictive benefit of the side information1. To convince the reader that the Data Processing Axiom is a natural requirement, suppose instead that the Data Processing Axiom did not hold. Since X and T are mutually sufficient statistics for Y , this would imply that there is no unique value which quantifies the benefit of side information Y for the of interest. Thus, the Data Processing Axiom is needed for the benefit of side information to be well-defined. Although the Data Processing Axiom may seem to be a benign requirement, it has far-reaching implica- tions for the form C(`, PXY ) can take. This is captured by our first main result:

Theorem 3. [8] Let n ≥ 3. Under the Data Processing Axiom, the function C(`, PXY ) is uniquely deter- mined by the mutual information, C(`, PXY ) = I(X; Y ), (20) up to a multiplicative factor. We prove Theorem 3 below. To begin, we show that the measure of relevance defined in (17) is equivalently characterized by a bounded convex function defined on the X -simplex. The following lemma achieves this goal.

Lemma 2. There exists a bounded convex function V :Γn → R, depending on `, such that ! X C(`, PXY ) = PY (y)V (PX|Y =y) − V (PX ). (21) y

1In fact, the Data Processing Axiom is weaker than this general data processing statement since it only addresses statistically sufficient transformations of X.

3 The proof of Lemma 2 follows from defining V (P ) by

V (P ) = − inf EP [`(X, xˆ)], (22) xˆ∈R and its details are deferred to the appendix. In the statistics literature, the quantity −V (P ) is usually called the generalized entropy or the Bayes envelope. The next lemma asserts that we only need to consider symmetric (invariant to permutations) functions V (P ).

Lemma 3. Under the Data Processing Axiom, there exists a symmetric finite convex function G :Γn → R, such that ! X C(`, PXY ) = PY (y)G(PX|Y =y) − G(PX ), (23) y and G(·) is equal to V (·) in Lemma 2 up to a linear translation:

G(P ) = V (P ) + hc, P i, (24) where c ∈ Rn is a constant vector. The proof of Lemma 3 follows by applying a permutation to the space X and applying the Data Processing Axiom. Details are deferred to the appendix. Now we are in a position to begin the proof of Theorem 3 in earnest. It suffices to consider the case when the side information Y is binary valued, i.e., Y ∈ {1, 2}. We will show that the Data Processing Axiom mandates the usage of the logarithmic loss even when we constrain ourselves to this situation. Define α {Y = 1}. Take P (t),P (t) to be two on X parametrized in the , P λ1 λ2 following way:

P (t) = (λ t, λ (1 − t), r − λ , p , . . . , p ) (25) λ1 1 1 1 4 n P (t) = (λ t, λ (1 − t), r − λ , p , . . . , p ), (26) λ2 2 2 2 4 n P where r , 1 − i≥4 pi, t ∈ [0, 1], 0 ≤ λ1 < λ2 ≤ r. Taking P P (t),P P (t), it follows from Lemma 2 that X|1 , λ1 X|2 , λ2

C(`, PXY ) = αV (P (t)) + (1 − α)V (P (t)) − V (αP (t) + (1 − α)P (t)). (27) λ1 λ2 λ1 λ2 Note that the following transformation T (X) is a statistically sufficient transformation of X for Y : ( x1 X ∈ {x1, x2}, T (X) = (28) X otherwise.

The Data Processing Axiom implies that for all α ∈ [0, 1], t ∈ [0, 1] and legitimate λ2 > λ1 ≥ 0,

αV (P (t)) + (1 − α)V (P (t)) − V (αP (t) + (1 − α)P (t)) λ1 λ2 λ1 λ2 = αV (P (1)) + (1 − α)V (P (1)) − V (αP (1) + (1 − α)P (1)). (29) λ1 λ2 λ1 λ2 We now define the function (t) R(λ, t) , V (Pλ ), (30) where we note that the bi-variate function R(λ, t) implicitly depends on the parameter p4, p5 . . . pn which we shall fix for the rest of this proof. Thus, R(λ, t) = R(λ, t; p4, p5, . . . , pn). Note that by definition,

R(αλ + (1 − α)λ , t) = V (αP (t) + (1 − α)P (t)), (31) 1 2 λ1 λ2

4 hence we know that

αR(λ1, t) + (1 − α)R(λ2, t) − R(αλ1 + (1 − α)λ2, t)

= αR(λ1, 1) + (1 − α)R(λ2, 1) − R(αλ1 + (1 − α)λ2, 1). (32) P ˜ Taking λ1 = 0, λ2 = r = 1 − i≥4 pi. We define R(λ, t) , R(λ, t) − λU(t), where R(r, t) U(t) = . (33) r It follows that ˜ (t) ˜ R(0, t) = V (P0 ), R(r, t) = 0, ∀t ∈ [0, 1], (34) (t) and we note that V (P0 ) in fact does not depend on t. With the help of (34), we plug R(λ, t) = R˜(λ, t) + λU(t) into (32), and obtain R˜((1 − α)r, t) = R˜((1 − α)r, 1), ∀α ∈ [0, 1], t ∈ [0, 1]. (35)

In other words, there exists a function E : [0, 1] → R, such that R˜(λ, t) = E(λ). (36)

Since R(λ, t) = R˜(λ, t)+λU(t), we know that there exist real-valued functions E,U (indexed by p4, . . . , pn) such that R(λ, t) = λU(t) + E(λ). (37)

Expressing λ, t in terms of p1, p2, we have

p1 λ = p1 + p2, t = . (38) p1 + p2 By definition of R(λ, t), we can re-write (37) as

V (p1, p2, p3, p4, . . . , pn)   p1 = (p1 + p2)U ; p4, . . . , pn p1 + p2

+ E(p1 + p2; p4, . . . , pn). (39) By Lemma 3, we know that there exists a symmetric (permutation invariant) finite convex function G :Γn → R, such that G(P ) = V (P ) + hc, P i. (40) In other words, we have proved that G is of the form   p1 G(P ) = (p1 + p2)U ; p4, . . . , pn p1 + p2

+ E(p1 + p2; p4, . . . , pn) + hc, P i. (41) For notational simplicity, we define Y (p1, p2) , G(P ), (42) where we again note that Y (p1, p2; p4, . . . , pn) is a bi-variate function parameterized by p4, . . . pn. This gives   p1 Y (p1, p2) = (p1 + p2)U + E(p1 + p2) p1 + p2

+ c1p1 + c2p2 + c3(r − p1 − p2). (43)

Since G(P ) is a symmetric function, we know that if we exchange p1 and p3 in G(P ), the value of G(P ) will not change. In other words, for r = p1 + p2 + p3, we have   p1 (r − p3)U + E(r − p3) + c1p1 + c2p2 + c3p3 r − p3   p3 = (r − p1)U + E(r − p1) + c1p3 + c2p2 + c3p1, (44) r − p1

5 which is equivalent to   p1 (r − p3)U + E(r − p3) + (c3 − c1)p3 r − p3   p3 = (r − p1)U + E(r − p1) + (c3 − c1)p1. (45) r − p1 ˜ Defining E(x) , E(r − x) + (c3 − c1)x, we have     p1 p3 (r − p3)U + E˜(p3) = (r − p1)U + E˜(p1). (46) r − p3 r − p1 Interestingly, we can solve for general solutions of the above functional equation, which has connections to the so-called fundamental equation of : Lemma 4 ( [9] [10] [11]). The most general measurable solution of

 y   x  f(x) + (1 − x)g = h(y) + (1 − y)k , (47) 1 − x 1 − y for x, y ∈ [0, 1) with x + y ∈ [0, 1], where f, h : [0, 1) → R and g, k : [0, 1] → R, has the form

f(x) = aH2(x) + b1x + d, (48)

g(y) = aH2(y) + b2y + b1 − b4, (49)

h(x) = aH2(x) + b3x + b1 + b2 − b3 − b4 + d, (50)

k(y) = aH2(y) + b4y + b3 − b2, (51) for x ∈ [0, 1), y ∈ [0, 1], where H2(x) = −x ln x − (1 − x) ln(1 − x) is the binary Shannon entropy and a, b1, b2, b3, b4, and d are arbitrary constants. Remark 1. If f = g = h = k in (48)-(51), the corresponding functional equation is called the ‘fundamental equation of information theory’. In order to apply the above lemma to our setting, we define

qi = pi/r, i = 1, 2, 3 (52) and h(x) = E˜(rx)/r. Then we know     q1 q3 (1 − q3)U + h(q3) = (1 − p1)U + h(q1). (53) 1 − q3 1 − q1 Applying the general solution of (47), setting f = h, g = k = U, we have

b1 = b3, b2 = b4. (54)

Thus, h(x) = aH2(x) + b1x + d, (55)

U(y) = aH2(y) + b2y + b1 − b2. (56) By the definition of h(x) and E˜(x), we have that

E(x) = raH2(x/r) + (b1 + c1 − c3)(r − x) + d. (57)

Plugging the general solutions to U(x),E(x) into (43), and redefining the constants, we have

Y (p1, p2)   = A p1 ln p1 + p2 ln p2 + (r − p1 − p2) ln(r − p1 − p2)

+ Bp1 + Cp2 + D. (58)

6 Note that the constants A, B, C, D are functions of p4, . . . , pn. Therefore, we have the following general representation of the symmetric function G(P ):

G(P ) = A(p4, . . . , pn)(p1 ln p1 + p2 ln p2 + p3 ln p3)

+ B(p4, . . . , pn)p1 + C(p4, . . . , pn)p2

+ D(p4, . . . , pn), (59) where we have made the dependence on p4 . . . pn explicit. Now we utilize the property that Y (p1, p2) is invariant to permutations. Exchanging p1, p2, we obtain that B ≡ C. Exchanging p1, p3, we obtain that B ≡ C ≡ 0. Doing an arbitrary permutation on p4, . . . , pn, since p1, p2, p3 enjoy two degrees of freedom, we know that A(p4, . . . , pn),D(p4, . . . , pn) are symmetric functions. Exchanging p1, p4 and comparing the coefficients for p2 ln p2, we know that

A(p4, p5, . . . , pn) = A(p1, p5, . . . , pn), (60) since A is symmetric, and thus we can conclude that A is a constant. Now exchanging p1, p4 gives us

Ap1 ln p1 − Ap4 ln p4 = D(p1, p5, . . . , pn) − D(p4, p5, . . . , pn). (61)

Taking partial derivatives with respect to p1 (we vary p2 simultaneously to ensure P still lies on the simplex) on both sides of (61), we obtain ∂ A (ln p1 + 1) = D(p1, p5, . . . , pn). (62) ∂p1

Integrating on both sides with respect to p1, we know there exists a function f such that

D(p1, p5, . . . , pn) = Ap1 ln p1 + f(p5, . . . , pn). (63)

Since D is symmetric, we further know that X D(p4, . . . , pn) = Api ln pi. (64) i≥4

To sum up, we have n X G(P ) = A pi ln pi. (65) i=1 To guarantee that G(P ) is convex, we need A > 0. Plugging (65) into Lemma 3, the proof is complete.

References P P [1] P. Fischer, “On the inequality pif(pi) ≥ pif(qi),” Metrika, vol. 18, pp. 199–208, 1972. [2] J. Jiao, T. A. Courtade, A. No, K. Venkat, and T. Weissman, “Information measures: the curious case of the binary alphabet,” IEEE Transactions on Information Theory, vol. 60, no. 12, pp. 7616–7626, 2014. [3] I. Csisz´ar,“Axiomatic characterizations of information measures,” Entropy, vol. 10, no. 3, pp. 261–273, 2008. [4] J. Acz´el,B. Forte, and C. Ng, “Why the Shannon and entropies are’natural’,” Advances in Applied Probability, pp. 131–146, 1974. [5] M. H. DeGroot, “Uncertainty, information, and sequential experiments,” The Annals of Mathematical Statistics, pp. 404–419, 1962. [6] A. P. Dawid, “Coherent measures of discrepancy, uncertainty and dependence, with applications to bayesian predictive experimental design,” Technical Report 139, Department of Statistical Science, Uni- versity College London. http://www. ucl. ac. uk/Stats/research/abs94. html, Tech. Rep., 1998.

7 [7] T. Gneiting and A. E. Raftery, “Strictly proper scoring rules, prediction, and estimation,” Journal of the American Statistical Association, vol. 102, no. 477, pp. 359–378, 2007. [8] J. Jiao, T. A. Courtade, K. Venkat, and T. Weissman, “Justification of logarithmic loss via the benefit of side information,” IEEE Transactions on Information Theory, vol. 61, no. 10, pp. 5357–5365, 2015.

[9] P. Kannappan and C. Ng, “Measurable solutions of functional equations related to information theory,” Proceedings of the American Mathematical Society, pp. 303–310, 1973. [10] G. Maksa, “Solution on the open triangle of the generalized fundamental equation of information with four unknown functions,” Utilitas Math, vol. 21, pp. 267–282, 1982. [11] J. Acz´eland C. Ng, “Determination of all semisymmetric recursive information measures of multiplicative type on n positive discrete probability distributions,” Linear algebra and its applications, vol. 52, pp. 1–30, 1983.

8