On the Entropy of Sums

Mokshay Madiman Department of Statistics Yale University Email: [email protected]

Abstract— It is shown that the entropy of a sum of independent II.PRELIMINARIES random vectors is a submodular set function, and upper bounds on the entropy of sums are obtained as a result in both discrete Let X1,X2,...,Xn be a collection of random variables and continuous settings. These inequalities complement the lower taking values in some linear space, so that addition is a bounds provided by the entropy power inequalities of Madiman well-defined operation. We assume that the joint distribu- and Barron (2007). As applications, new inequalities for the tion has a density f with respect to some reference mea- determinants of sums of positive-definite matrices are presented. sure. This allows the definition of the entropy of various random variables depending on X1,...,Xn. For instance, I.INTRODUCTION H(X1,X2,...,Xn) = −E[log f(X1,X2,...,Xn)]. There are the familiar two canonical cases: (a) the random variables Entropies of sums are not as well understood as joint are real-valued and possess a probability density function, or entropies. Indeed, for the joint entropy, there is an elaborate (b) they are discrete. In the former case, H represents the history of entropy inequalities starting with the chain rule of , and in the latter case, H represents the Shannon, whose major developments include works of Han, discrete entropy. We simply call H the entropy in all cases, ˇ Shearer, Fujishige, Yeung, Matu´s, and others. The earlier part and where the distinction matters, the relevant assumption will of this work, involving so-called Shannon-type inequalities be made explicit. that use the submodularity of the joint entropy, was synthe- Since we wish to consider sums of various subsets of ran- sized and generalized by Madiman and Tetali [1], who give dom variables, the following notational conventions and defi- an array of lower as well as upper bounds for joint entropy nitions will be useful. Let [n] be the index set {1, 2, . . . , n}. of a collection of random variables generalizing inequalities We are interested in a collection C of subsets of [n]. For any of Han [2], Fujishige [3] and Shearer [4]. For a review of the set s ⊂ [n], Xs stands for the (Xi : i ∈ s), later history, involving so-called non-Shannon inequalities, one with the indices taken in their increasing order, while may consult, for instance, Matu´ˇs [5]. X When one considers the entropy of sums of independent Ys = Xi. random variables instead of joint entropies, the most important i∈s inequalities known are the entropy power inequalities, which For any index i in [n], define the degree of i in C as r(i) = provide lower bounds on entropy of sums. In the setting |{t ∈ C : i ∈ t}|. A function α : C → R+, is called a fractional of independent summands, the most general entropy power P covering, if for each i ∈ [n], we have αs ≥ 1. inequalities known to date are described by Madiman and s∈C:i∈s A function β : C → R+, is called a fractional packing, if Barron [6]. P for each i ∈ [n], we have s∈C:i∈s βs ≤ 1. If α is both In this note, we develop a basic submodularity property of a fractional packing and a fractional covering, it is called a the entropy of sums of independent random variables, and fractional partition. formulate a “chain rule for sums”. Surprisingly, these results The between two jointly distributed are entirely elementary and rely on the classical properties of random variables X and Y is joint entropy. We also demonstrate as a consequence upper bounds on entropy of sums that complement entropy power I(X; Y ) = H(X) + H(Y ) − H(X,Y ), inequalities. X Y The paper is organized as follows. In Section II, we review and is a measure of the dependence between and . I(X; Y ) = 0 X Y the notation and definitions we use. Section III presents the In particular, if and only if and are submodularity of the entropy of sums of independent random independent. The following three simple properties of mutual vectors, which is a central result. Section IV develops general information can be found in elementary texts on information upper bounds on entropies of sums, while Section V reviews theory such as Cover and Thomas [7]: known lower bounds and conjectures some additional lower • When Y = X +Z, where X and Z are independent, then bounds. In Section VI, we apply the entropy inequalities of I(X; Y ) = H(Y ) − H(Z). (1) the previous sections to give information-theoretic proofs of inequalities for the determinants of sums of positive-definite Indeed, I(X; X + Z) = H(X + Z) − H(X + Z|X) = matrices. H(X + Z) − H(Z|X) = H(X + Z) − H(Z). • The mutual information cannot increase when one looks to the author’s knowledge. Second, a consequence of the at functions of the random variables (the “data processing submodularity of entropy of sums is an entropy inequality inequality”): that obeys the partial order constructed using compressions, as introduced by Bollobas and Leader [8]. Since this inequality I(f(X); Y ) ≤ I(X; Y ). requires the development of some involved notation, we defer • When Z is independent of (X,Y ), then the details to the full version [9] of this paper. Third, an equivalent statement of Theorem I is that for independent I(X; Y ) = I(X,Z; Y ). (2) random vectors, Indeed, I(X; X + Y ) ≥ I(X; X + Y + Z), (3) I(X,Z; Y ) − I(X; Y ) as is clear from the proof of Theorem I. = H(X,Z) − H(X,Z|Y ) − [H(X) − H(X|Y )]

= H(X,Z) − H(X) − [H(X,Z|Y ) − H(X|Y )] IV. UPPER BOUNDS = H(Z|X) − H(Z|X,Y ) Remarkably, if we wish to use Theorem I to obtain upper = 0. bounds for the entropy of a sum, there is a dramatic difference between the discrete and continuous cases, even though this distinction does not appear in Theorem I. III.SUBMODULARITY First consider the case where all the random variables are Although the inequality below is a simple consequence of discrete. Shannon’s chain rule for joint entropy says that these well known facts, it does not seem to have been noticed H(X1,X2) = H(X1) + H(X2|X1) (4) before. where H(X2|X1) is the of X2 given X1. Theorem I:[SUBMODULARITYFOR SUMS] If Xi are inde- This rule extends to the consideration of n variables; indeed, d pendent R -valued random vectors, then n X H(X1,...,Xn) = H(Xi|X

where X

where r+(s) = maxi∈s r(i) and r(i) is the degree of i. Thus Proof: The modified chain rule (6) for sums implies that Conjecture I implies that 2H(P X ) X 2H(X1+...+Xn) X 1 j∈s j H(X + Ys) = H(X) + I(Yi; Ys∩≤i). e d ≥ e d . r+(s) i∈s s∈C The following weaker statement is the main result of Madiman VI.APPLICATIONTODETERMINANTS and Barron [6]: if r+ is the maximum degree (over all i ∈ [n]), It has been long known that the probabilistic represen- then tation of a matrix determinant using multivariate normal 2H(P X ) 2H(X1+...+Xn) 1 X j∈s j probability density functions can be used to prove useful e d ≥ e d . (9) r+ matrix inequalities (see, for instance, Bellman’s classic book s∈C [16]). A new variation on the theme of using properties It is easy to see that this generalizes the entropy power of normal distributions to deduce inequalities for positive- inequalities of Shannon-Stam [10], [11] and Artstein, Ball, definite matrices was begun by Cover and El Gamal [17], and Barthe and Naor [12]. continued by Dembo, Cover and Thomas [18] and Madiman and Tetali [1]. Their idea was to use the representation of the Of course, one may further conjecture that the entropy determinant in terms of the entropy of normal distributions, power is also supermodular for sums, which would be stronger rather than the integral representation of the determinant in than Conjecture I. In general, it is an interesting question to terms of the normal density. They showed that the classical characterize the class of all real numbers that can arise as Hadamard, Szasz and more general inequalities relating the entropy powers of sums of a collection of underlying indepen- determinants of square submatrices followed from properties dent random variables. Then the question of supermodularity of the joint entropy function. Our development below also of entropy power is just the question of whether this class, uses the representation of the determinant in terms of entropy, which one may call the Stam class to honour Stam’s role namely, the fact that the entropy of the the d-variate normal in the study of entropy power, is contained in the class of distribution with covariance matrix K, written N(0,K), is polymatroidal vectors. given by There has been much progress in recent decades on char-   1 d acterizing the class of all entropy inequalities for the joint H(N(0,K)) = 2 log (2πe) |K| . distributions of a collection of (dependent) random variables. Major contributions were made by Han and Fujishige in the The ground is prepared to present an upper bound on the 1970’s, and by Yeung and collaborators in the 1990’s; we determinant of a sum of positive matrices. refer to [13] for a history of such investigations; more recent developments and applications are contained in [5] and [14]. Theorem IV:[UPPER BOUNDON DETERMINANTOF SUM] Let K and Ki be positive matrices of dimension d. Then the The study of the Stam class above is a direct analogue of the P study of the entropic vectors by these authors. Let us point function D(s) = log | i∈s Ki|, defined on the power set of out some particularly interesting aspects of this analogy, using [n], is submodular. Furthermore, for any fractional covering α using any collection of subsets C of [n], Xs to denote (Xi : i ∈ s) and H to denote (with some abuse of notation) the discrete entropy, where (X : i ∈ [n]) is αs i −a Y X |K + K1 + ... + Kn| ≤ |K| K + Kj , a collection of dependent random variables taking values in s∈C j∈s some discrete space. Fujishige [3] proved that h(s) = H(Xs) is a submodular set function (analogous to the conjecture on P where a = s∈C αs − 1. the supermodularity of entropy power), and Madiman and Tetali [1] proved that Proof: Substituting normals in Theorems I and III gives the result. X X βsH(Xs|Xsc ) ≤ h([n]) ≤ αsh(s), (10) s∈C s∈C As a corollary, one obtains for instance the upper bound of which is analogous to the entropy power n−1 Y X |K + K1 + ... + Kn| |K| ≤ K + Kj . inequalities for sums suggested in Conjecture I. However, it i∈[n] j6=i does not seem to be easy to derive entropy lower bounds for sums and inequalities for joint distributions using a common We now present lower bounds for the determinant of a sum. framework; even the recent innovative treatment of entropy Proposition I:[GENERALIZED MINKOWSKIDETERMINANT power inequalities by Rioul [15], which partially addresses INEQUALITY] Let K ,...,K be d × d positive matrices. this issue, requires some delicate analysis in the end to justify 1 n Let C be a a collection of subsets of [n] maximum degree r . the vanishing of higher-order terms in a Taylor expansion. + Then On the other hand, inequalities for joint distributions tend 1 d to rely entirely on elementary facts such as chain rules. 1 1 X X |K1 + ... + Kn| d ≥ Kj . Nonetheless the formal similarity begs the question of how r+ far this parallelism extends; in particular, are there “non- s∈C j∈s Shannon-type” inequalities for entropy powers of sums that Equality holds if and only if all the matrices Ki are propor- impose restrictions beyond those imposed by the conjectured tional to each other, and each index appears in exactly r+ supermodularity? subsets. In particular, when C is the collection of singleton sets, where LBk is the lower bound given by Proposition I using r+ = 1, and one recovers the classical inequality. Specifically, Ck. Note that the direct Minkowski lower bound, namely LB1, if K1 and K2 are d × d symmetric, positive-definite matrices, is the worst of the hierarchy. then Note that Theorem V gives evidence towards Conjecture 1 1 1 I, since it is simply the special case of Conjecture I for |K + K | d ≥ |K | d + |K | d . 1 2 1 2 multivariate normals! (Conversely, the inequality (9) may be Many proofs of this inequality exist, see, e.g., [19]. Although used to give an information-theoretic proof of Proposition I, the bound in Proposition I is better in general than the classical which is a special case of Theorem V.) 1 1 Minkowski bound of |K1| d + ... + |Kn| d , mathematically There are other interesting related entropy and matrix in- the former can be seen as a consequence of the latter. Indeed, equalities that we do not have space to cover in this note; below we prove a more general version of Proposition I. these can be found in the full paper [9].

ACKNOWLEDGMENT Theorem V: Let K1,...,Kn be d×d positive matrices. Then for any fractional packing β using the collection C of subsets The author would like to thank Andrew Barron and Prasad of [n], Tetali, who have collaborated with him on related work.

1 d 1 X X REFERENCES |K1 + ... + Kn| d ≥ βs Kj . [1] M. Madiman and P. Tetali, “Information inequalities for joint distribu- s∈C j∈s tions, with interpretations and applications,” Submitted, 2007. Equality holds iff the matrices {P K , s ∈ C} are propor- [2] T. S. Han, “Nonnegative entropy measures of multivariate symmetric j∈s j correlations,” Information and Control, vol. 36, no. 2, pp. 133–156, tional. 1978. [3] S. Fujishige, “Polymatroidal dependence structure of a set of random Proof: First note that for any fractional partition, variables,” Information and Control, vol. 39, pp. 55–72, 1978. [4] F. Chung, R. Graham, P. Frankl, and J. Shearer, “Some intersection X X J. Combinatorial Theory, Ser. A K1 + ... + Kn = βs Kj. theorems for ordered sets and graphs,” , vol. 43, pp. 23–37, 1986. s∈C j∈s [5] F. Matu´ˇs, “Two constructions on limits of entropy functions,” IEEE Applying the usual Minkowski inequality gives Trans. Inform. Theory, vol. 53, no. 1, pp. 320–330, 2007. [6] M. Madiman and A. Barron, “Generalized entropy power inequalities 1 1 d d and monotonicity properties of information,” IEEE Trans. Inform. The- 1 X X X X d ory, vol. 53, no. 7, pp. 2317–2329, July 2007. |K1 + ... + Kn| ≥ βs Kj = βs Kj . [7] T. Cover and J. Thomas, Elements of . New York: s∈C j∈s s∈C j∈s J. Wiley, 1991. For equality, we clearly need the matrices [8] B. Bollobas´ and I. Leader, “Compressions and isoperimetric inequali- ties,” J. Combinatorial Theory Ser. A, vol. 56, no. 1, pp. 47–62, 1991. X β K , s ∈ C [9] M. Madiman, “The entropy of sums of independent random elements,” s j Preprint, 2008. j∈s [10] C. Shannon, “A mathematical theory of communication,” Bell System to be proportional. Tech. J., vol. 27, pp. 379–423, 623–656, 1948. [11] A. Stam, “Some inequalities satisfied by the quantities of information of Fisher and Shannon,” Information and Control, vol. 2, pp. 101–112, It is not a priori clear that the bounds obtained using Propo- 1959. sition I (or Theorem V) are better than the direct Minkowski [12] S. Artstein, K. M. Ball, F. Barthe, and A. Naor, “Solution of Shannon’s problem on the monotonicity of entropy,” J. Amer. Math. Soc., vol. 17, lower bound of no. 4, pp. 975–982 (electronic), 2004. 1 1 [13] J. Zhang and R. Yeung, “On characterization of entropy function via d d |K1| + ... + |Kn| , information inequalities,” IEEE Trans. Inform. Theory, vol. 44, pp. 1440–1452, 1998. but we observe that they are in general better. Consider the [14] R. Dougherty, C. Freiling, and K. Zeger, “Networks, matroids and non- collection Cn−1 of leave-one-out sets, i.e., C = {s ⊂ [n]: Shannon information inequalities,” IEEE Trans. Inform. Theory, vol. 53, no. 6, pp. 1949–1969, June 2007. |s| = n − 1}. The maximum degree here is r+ = n − 1, so [15] O. Rioul, “A simple proof of the entropy-power inequality via properties 1 d of mutual information,” Proc. IEEE Intl. Symp. Inform. Theory, Nice, 1 1 X X France, pp. 46–50, 2007. |K1 + ... + Kn| d ≥ Kj . (11) n − 1 [16] R. Bellman, Introduction to Matrix Analysis. McGraw-Hill, 1960. s∈Cn−1 j∈s [17] T. Cover and A. El Gamal, “An information theoretic proof of Lower bounding each term on the right by iterating this Hadamard’s inequality,” IEEE Trans. Inform. Theory, vol. 29, no. 6, pp. 930–931, 1983. inequality, and using Ck to denote the collection of all sets [18] A. Dembo, T. Cover, and J. Thomas, “Information-theoretic inequali- of size k, one obtains ties,” IEEE Trans. Inform. Theory, vol. 37, no. 6, pp. 1501–1518, 1991. 1 [19] R. Bhatia, Matrix Analysis. Springer, 1996. d 1 2 X X |K1 + ... + Kn| d ≥ Kj . (n − 1)(n − 2) i∈[n] j∈s Repeating this procedure yields the hierarchy of inequalities

1 |K1 + ... + Kn| d ≥ LBn−1 ≥ LBn−2 ≥ ... ≥ LB1,