arXiv:cs/0605047v2 [cs.IT] 2 Jul 2007 hr ahtr novsteetoyo h u of sum the of entropy the involves term each where r rsne.Teeieulte eaeteifraini information variable the random relate of independent inequalities sum of These presented. sums are for inequalities power iy 4HlhueAeu,NwHvn T051 S.Email: USA. 06511, CT Haven, New Avenue, Hillhouse [email protected] [email protected], 24 sity, EEItrainlSmoimo nomto hoy Seatt Theory, Information 200 Workshop on February Inaugural 2006. CA, Symposium the Diego, at International San part IEEE Center, in Applications presented and was Theory paper this in oegnrlstigo needn umnswt variance with in as summands well independent as sums. of summands theorem standardized i.i.d. setting limit of general central setting more the in in information a simple both of for a obtained, consequence, variables, monotonicity a random the As the of subsets. of of subsets collection over arbitrary sums in contained oooiiyo nrp sgvnb uioadVerd´u [42]. and Tulino by A given theorems. is entropy limit th of of central development monotonicity understand in contemporaneous and entropy simplified independent of similar a monotonicity inequ provides the these this of of particular, all of In proofs ities. pr interpretable We easily and inequalities. (1) and interesting and simple other (2) several both subsumes implies power that also entropy sums generalized subset a for below inequality present will successi We a (1). for hence (2) of of It application values theorem. repeated of limit by see, central to the hard in not entropy of monotonicity the aininequalities. mation aibe xldn the excluding variables OAPA NIE RNATOSO NOMTO HOY O.5 VOL. THEORY, INFORMATION ON TRANSACTIONS IEEE IN APPEAR TO ucinof function L nrp oe nqaiyo hno 3]adSa 3]state [39] Stam and [36] Shannon of inequality power entropy dfeeta)etoy .. if i.e., entropy, (differential) yAB 1)poe e nrp oe inequality power entropy new a deno proved (hereafter [1]) [1] Naor ABBN and by Barthe Ball, Artstein, 2004, In omnctdb .A asapyn soit dtra Lar at Editor Associate Vaishampayan, A. V. by Communicated h uhr r ihteDprmn fSaitc,Yl Univ Yale Statistics, of Department the with are authors The aucitrcie pi 5 06 eie ac 6 2007. 16, March revised 2006; 15, April received Manuscript Abstract ne Terms Index ET ihdniisadfiievracs Let variances. finite and densities with e 2 H n X ( e aiiso ihrifrainadentropy and information Fisher of families New — eeaie nrp oe nqaiisand Inequalities Power Entropy Generalized X needn admvralst h information the to variables random independent 1 1 X e X , n + 2 eta ii hoe;etoypwr infor- power; entropy theorem; limit Central — ht()i atipisteieult 4 and (4) inequality the implies fact in (2) that , then , H ... oooiiyPoete fInformation of Properties Monotonicity 2 ( + X , . . . , X X 1 osa Madiman, Mokshay n + ) .I I. ... H ≥ + i ( n t,adpeetdisipiain for implications its presented and -th, X X NTRODUCTION n n eidpnetrno variables random independent be = ) − ) 1 ≥ 1 X j f X i − =1 =1 n n E stepoaiiydensity probability the is e e [log 2 2 H H ( X f P j ( ebr IEEE, Member, ) X j . 6= i )] X h classical The . H j  , n eoethe denote e A July WA, le, Information , ,ada the at and 6, h material The − ge. 1 fthe of ovide the n proof is s ing ted the ,N.7 UY2007 JULY 7, NO. 3, (2) (1) al- on er- n nrwBarron, Andrew and is n e s s - nc”i es htwl emd rcs later. precise made be will that sense a in “nice” le htteetoyo h tnadzdsums standardized the of entropy the that plies r needn n dnial itiue iid) the (i.i.d.), to distributed equivalent identically and independent are if inequalities these the of if any only in s holds family of equality whole collections Furthermore, a arbitrary yields for inequalities, clearly power (3) entropy inequality the general, In where yuigtescaling the using by 8,Bro 4,Cre n ofr[0,JhsnadBarron and Johnson Brown [10], [37], Soffer Shimizu and by Carlen theorems [4], limit Barron use central [8], was size of sample proofs of in doubling on quantities information in n ne per.I atclr oethat note particular, In appears. index one oiiyi eta ii hoes ned if Indeed, theorems. limit central in tonicity H i.i.d. for { sa olw:if follows: as is 1 ( hs nqaiisaerlvn o h xmnto fmono of examination the for relevant are inequalities These )Choosing 1) )Choosing 3) )Choosing 4) )Choosing 2) u eeaie nrp oe nqaiyfrsbe sums subset for inequality power entropy generalized Our , Y 2 2 n , . . . , k yields lmnsyields elements r neesyields integers inequality exp ) r 1 = e snndcesn in non-decreasing is X stemxmmnme fst in sets of number maximum the is 2 H  exp i X } ( nrae ln h oeso- usqec,i.e., subsequence, powers-of-2 the along increases 2 n ec (1). hence and r X then , i H min 1 = r omlydsrbtdadtecollection the and distributed normally are  H + C C C C 2 ... X  H m C m ob h class the be to { n n eirMme,IEEE Member, Senior ob h ls falst of sets all of class the be to + ob h class the be to ob h class the be to 1 X ,n k, − 1 − − − X  + sa rirr olcino ust of subsets of collection arbitrary an is H 1 1 1 1 1 r n X r √   ) + 1 . . . ( min = 1 1 + = aX s 2 n ec h inequality the hence and ≥ X ∈C X + + n 1 r 2 = ) m . . . − −  k X s X exp hrceiaino h change the of Characterization . ∈C k n 1 ≥ { + H }  ,n k, C n ec (2). hence and e   H X s X ( 2 m C X ∈C C H 2 ≥ n ( 1 n H falst of sets all of 1 + X log + )  − exp falsnltn yields singletons all of P 1 1  ) j , − falst of sets all of X ∈ i ≥  ∈ s s k 2 X | } H a X j C |  hsfc im- fact This . i n ec the hence and  Y , k  nwihany which in n X X i m consecutive ∈ = 1 . s elements X and P 1 is (1) n ubsets. i n i n √  =1 n C − and X (3) (5) (4) X of . is 1 d i 2 1 - TO APPEAR IN IEEE TRANSACTIONS ON , VOL. 53, NO. 7, JULY 2007 2

2 [21], and Artstein, Ball, Barthe and Naor [2]. In particular, and the I(X1 + . . . + Xn) = E[ρtot] has Barron [4] showed that the sequence H(Y ) converges to the bound { n } the entropy of the normal; this, incidentally, is equivalent to the 2 convergence to 0 of the relative entropy (Kullback divergence) E[ρ2 ] E wsρs . (7) tot ≤    from a normal distribution. ABBN [1] showed that H(Y ) is sX n ∈C in fact a non-decreasing sequence for every n, solving a long- For non-overlapping subsets, the independence and zero mean standing conjecture. In fact, (2) is equivalent in the i.i.d. case properties of the scores provide a direct means to express the to the monotonicity property right side in terms of the Fisher informations of the subset sums (yielding the traditional Blachman [7] proof of Stam’s X1 + . . . + Xn X1 + . . . + Xn 1 H H − . (6) inequality for Fisher information). In contrast, the case of over-  √n  ≥  √n 1  − lapping subsets requires fresh consideration. Whereas a naive application of Cauchy-Schwarz would yield a loose bound of Note that the presence of the factor n 1 (rather than n) in − s ws2 Eρs2 , instead a variance drop lemma yields that the denominator of (2) is crucial for this monotonicity. |C| the right∈C side of (7) is not more than r s ws2Eρs2 if each Likewise, for sums of independent random variables, the P i is in at most r subsets of . Consequently,P ∈C inequality (4) is equivalent to “monotonicity on average” C properties for certain standardizations; for instance, I(X + . . . + X ) r ws2I X (8) 1 n ≤  i sX Xi s X1 + . . . + Xn 1 i s Xi ∈C ∈ H H ∈ . s  √n  ≥ n P√m  for any weights ws that add to 1 over all subsets m sXm ⊂ ∈C 1,...,n in . See Sections II and IV for details. Optimizing  { } C A similar monotonicity also holds, as we shall show, for arbi- over w yields an inequality for inverse Fisher information trary collections and even when the sums are standardized that extends the Fisher information inequalities of Stam and by their variances.C Here again the factor r (rather than the ABBN: 1 1 1 cardinality of the collection) in the denominator of (3) for . (9) |C| I(X + . . . + X ) r I( X ) the unstandardized version is crucial. 1 n ≥ Xs i s i ∈C P ∈ Alternatively, using a scaling property of Fisher information Outline of our development. to re-express our core inequality (8), we see that the Fisher information of the total sum is bounded by a convex combi- We find that the main inequality (3) (and hence all of nation of Fisher informations of scaled subset sums: the above inequalities) as well as corresponding inequalities i s Xi for Fisher information can be be proved by simple tools. I(X1 + . . . + Xn) wsI ∈ . (10) P s ≤ Xs  √w r  Two of these tools, a convolution identity for score functions ∈C and the relationship between Fisher information and entropy This integrates to give an inequality for entropy that is an (discussed in Section II), are familiar in past work on infor- extension of the “linear form of the entropy power inequality” mation inequalities. An additional trick is needed to obtain the developed by Dembo et al [15]. Specifically we obtain denominator r in (3). This is a simple variance drop inequality i s Xi for statistics expressible via sums of functions of subsets of a H(X + . . . + X ) wsH ∈ . (11) 1 n P s collection of variables, particular cases of which are familiar ≤ Xs  √w r  in other statistical contexts (as we shall discuss). ∈C See Section V for details. Likewise using the scaling property We recall that for a with differentiable X of entropy on (11) and optimizing over w yields our extension density f, the score function is ρ(x)= ∂ log f(x), the score ∂x (3) of the entropy power inequality, described in Section VI. is the random variable ρ(X), and its Fisher information is 2 To relate this chain of ideas to other recent work, we I(X)= E[ρ (X)]. point out that ABBN [1] was the first to show use of a Suppose for the consideration of Fisher information that variance drop lemma for information inequality development the independent random variables X1,...,Xn have absolutely (in the leave-one-out case of the collection n 1). For that continuous densities. To outline the heart of the matter, the first case what is new in our presentation is goingC − straight to step boils down to the geometry of projections (conditional n the projection inequality (7) followed by the variance drop expectations). Let ρtot be the score of the total sum i=1 Xi inequality, bypassing any need for the elaborate variational s and let ρ be the score of the subset sum i s XPi. As we characterization of Fisher information that is an essential in- ∈ 2 recall in Section II, ρtot is the conditional expectationP (or L gredient of ABBN [1]. Moreover, whereas ABBN [1] requires projection) of each of these subset sum scores given the total that the random variables have a C2 density for monotonicity sum. Consequently, any convex combination s wsρs also of Fisher information, absolute continuity of the density (as ∈C has projection P minimally needed to define the score function) suffices in our n approach. Independently and contemporaneously to our work, ρ = E wsρs X Tulino and Verd´u[42] also found a similar simplified proof of tot  i sX Xi=1 monotonicity properties of entropy and entropy derivative with ∈C

TO APPEAR IN IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 53, NO. 7, JULY 2007 3 a clever interpretation (its relationship to our approach is de- respectively with respect to convolution of the arguments. scribed in Section III after we complete the presentation of our In the case of normal random variables, the inverse Fisher development). Furthermore, a recent preprint of Shlyakhtenko information and the entropy power equal the variance. Thus [38] proves an analogue of the information inequalities in the in that case (9) and (3) become Fact 1 with ai equal to the leave-one-out case for non-commutative or “free” probability variance of Xi. theory. In that manuscript, he also gives a proof for the classical setting assuming finiteness of all moments, whereas II. SCORE FUNCTIONS AND PROJECTIONS our direct proof requires only finite variance. Our proof also f ′(x) reveals in a simple manner the cases of equality in (6), for We use ρ(x)= f(x) to denote the (almost everywhere de- which an alternative approach in the free probability setting is fined) score function of the random variable X with absolutely in Schultz [35]. continuous probability density function f. The score ρ(X) has While Section III gives a direct proof of the monotonicity zero mean, and its variance is just the Fisher information I(X). of entropy in the central limit theorem for i.i.d. summands, The first tool we need is a projection property of score Section VII applies the preceding inequalities to study sums of functions of sums of independent random variables, which non-identically distributed random variables under appropriate is well-known for smooth densities (cf., Blachman [7]). For scalings. In particular, we show that “entropy is monotone on completeness, we give the proof. As shown by Johnson and average” in the setting of variance-standardized sums. Barron [21], it is sufficient that the densities are absolutely Our subset sum inequalities are tight (with equality in the continuous; see [21][Appendix 1] for an explanation of why Gaussian case) for balanced collections of subsets, as will this is so. be defined in Section II. In Section VIII, we present refined versions of our inequalities that can even be tight for certain unbalanced collections. Lemma 1 (CONVOLUTIONIDENTITYFORSCORES): If V1 Section IX concludes with some discussion on potential and V2 are independent random variables, and V1 has an directions of application of our results and methods. In particu- absolutely continuous density with score ρ1, then V1 + V2 has lar, beyond the connection with central limit theorems, we also the score function discuss potential connections of our results with distributed ρ(v)= E[ρ1(V1) V1 + V2 = v] (14) statistical estimation, graph theory and multi-user information | theory. Proof: Let f1 and f be the densities of V1 and V = V1 + V2 respectively. Then, either bringing the derivative inside the Form of the inequalities. integral for the smooth case, or via the more general formalism in [21], If ψ(X) represents either the inverse Fisher information or ∂ the entropy power of , then our inequalities above take the f ′(v)= E[f1(v V2)] X ∂v − form = E[f ′ (v V )] 1 − 2 rψ X + . . . + X ψ X . (12) = E[f1(v V2)ρ1(v V2)] 1 n ≥  i − − sX Xi s  ∈C ∈ so that We motivate the form (12) using the following almost trivial f ′(v) fact. ρ(v)= f(v) f (v V ) = E 1 − 2 ρ (v V ) f(v) 1 2 Fact 1: For arbitrary numbers ai : i =1, 2,...,n ,  −  { } = E[ρ (V ) V + V = v]. n 1 1 | 1 2 ai = r ai, (13) Xs Xi s Xi=1 ∈C ∈ if each index i appears in the same number of times r. The second tool we need is a “variance drop lemma”, the C history of which we discuss in remarks after the proof below. Indeed, The following conventions are useful: n [n] is the index set 1, 2,...,n . • { } ai = ai For any s [n], Xs stands for the collection of random • ⊂ Xs Xi s Xi=1 s Xi,s variables s , with the indices taken in their ∈C ∈ ∋ ∈C (Xi : i ) n n natural (increasing)∈ order. s = rai = r ai. For ψs : R| | R, we write ψs(xs) for a function of • Xi=1 Xi=1 s → xs for any [n], so that ψs(xs) ψs(xk1 ,...,xk s ), ⊂ ≡ | | If Fact 1 is thought of as -additivity of the sum function for where k1 < k2 <...

The following notions are not required for the inequalities by rearranging the sums and using the orthogonality of the we present, but help to clarify the cases of equality. ANOVA decomposition again. This proves the inequality. A collection of subsets of [n] is said to be discriminat- Now suppose ψs′ is not additive. This means that for some • t s ¯ ing if for anyC distinct indices i and j in [n], there is a set set ′ with two elements, Etψs′ (Xs′ ) =0. Fix this choice t ⊂ 6 in that contains i but not j. Note that all the collections of . Since is a discriminating collection, not all of the at C t introducedC in Section I were discriminating. most r subsets containing one element of can contain the A collection of subsets of [n] is said to be balanced if other. Consequently, the inner sum in the inequality (17) runs • s each index i Cin [n] appears in the same number (namely, over strictly fewer than r subsets , and the inequality (17) r) of sets in . must be strict. Thus each ψs must be an additive function if A function f C: Rd R is additive if there exist functions equality holds, i.e., it must be composed only of main effects • f : R R such that→ f(x ,...,x )= d f (x ), i.e., and no interactions. i → 1 d i=1 i i if it is 1-additive. P C Remark 1: The idea of the variance drop inequality goes back at least to Hoeffding’s seminal work [18] on U-statistics. Lemma 2 (VARIANCE DROP): Let U(X1,...,Xn) = Suppose ψ : Rm R is symmetric in its arguments, and s ψs(Xs) be a -additive function with mean zero → C Eψ(X1,...,Xm)=0. Define components,P ∈C i.e., Eψs(Xs)=0 for each s . Then ∈C 1 2 U(X1,...,Xn)= n ψ(Xs). (18) EU 2 r E ψs(Xs) , (15) s s ≤ m [nX]: =m sX   { ⊂ | | } ∈C Then Hoeffding [18] showed s where r is the maximum number of subsets in which any 2 m 2 index appears. When is a discriminating collection,∈C equality EU Eψ , (19) C ≤ n can hold only if each ψs is an additive function. which is implied by Lemma 2 under the symmetry assump- tions. In statistical language, U defined in (18) is a U-statistic t ¯ Proof: For every subset of [n], let Et be the ANOVA of degree m with symmetric, mean zero kernel ψ that is projection onto the space of functions of Xt (see the Appendix applied to data of sample size n. Thus (19) quantitatively for details). By performing the ANOVA decomposition on captures the reduction of variance of a U-statistic when sample each ψs, we have size n increases. For m = 1, this is the trivial fact that the 2 empirical variance of a function based on i.i.d. samples is 2 1 EU = E E¯tψs(Xs) the actual variance scaled by n− . For m > 1, the functions  sX tXs  ψ(Xs) are no longer independent, nevertheless the variance of ∈C ⊂ 2 m the U-statistic drops by a factor of n . Our proof is valid for the = E E¯t ψs(Xs) more general non-symmetric case, and also seems to illuminate   (16) Xt s Xt,s the underlying statistical idea (the ANOVA decomposition) as ⊃ ∈C 2 well as the underlying geometry (Hilbert space projections) = E E¯tψs(Xs) , better than Hoeffding’s original combinatorial proof. In [16],   Xt s Xt,s Efron and Stein assert in their Comment 3 that an ANOVA- ⊃ ∈C like decomposition “yields one-line proofs of Hoeffding’s using the orthogonality of the ANOVA decomposition of U in important theorems 5.1 and 5.2”; presumably our proof of the last step. Lemma 2 is a generalization of what they had in mind. As Recall the elementary fact ( m y )2 m m y2, which i=1 i ≤ i=1 i mentioned before, the application of such a variance drop follows from the Cauchy-SchwarzP inequality. InP order to apply lemma to information inequalities was pioneered by ABBN this observation to the preceding expression, we estimate the [1]. They proved and used it in the case = n 1 using clear number of terms in the inner sum. The outer summation over t C C − t ¯ notation that we adapt in developing our generalization above. can be restricted to non-empty sets , since Eφ has no effect in A further generalization appears when we consider refinements the summation due to s having zero mean. Thus, any given ψ of our main inequalities in Section VIII. t in the expression has at least one element, and the sets s t in the collection must contain it; so the number of sets⊃ s C The third key tool in our approach to monotonicity is over which the inner sum is taken cannot exceed r. Thus we the well-known link between Fisher information and entropy, have whose origin is the de Bruijn identity first described by Stam 2 [39]. This identity, which identifies the Fisher information as EU 2 r E E¯tψs(Xs) ≤ the rate of change of the entropy on adding a normal, provides Xt s Xt,s  ⊃ ∈C a standard way of obtaining entropy inequalities from Fisher ¯ 2 = r E Etψs(Xs) (17) information inequalities. An integral form of the de Bruijn Xs tXs  identity was proved by Barron [4]. We express that integral ∈C ⊂ 2 = r E ψs(Xs) , in a form suitable for our purpose (cf., ABBN [1] and Verd´u Xs and Guo [43]). ∈C  TO APPEAR IN IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 53, NO. 7, JULY 2007 5

Lemma 2 yields Lemma 3: Let X be a random variable with a density and 2 (d) 1 1 2 n 1 arbitrary finite variance. Suppose X = X + √tZ, where Z E ρj (n 1) Eρj = − I(Sn 1), t n  ≤ − n2 n − jX[n] jX[n] is a standard normal independent of X. Then, ∈ ∈

1 1 ∞ 1 so that H(X)= 2 log(2πe) 2 I(Xt) dt. (20) − Z0  − 1+ t n 1 I(Sn) − I(Sn 1). ≤ n − Proof: In the case that the variances of Z and X match, 1 2 If , then ′ and ; equivalent forms of this identity are given in [3] and [4]. X′ = aX ρX (X′)= a ρX (X) a I(X′)= I(X) Applying a change of variables using t = τv to equation hence

(2.23) of [3] (which is also equivalent to equation (2.1) of I(Yn)= nI(Sn) (n 1)I(Sn 1)= I(Yn 1). [4] by another change of variables), one has that ≤ − − − The inequality implied by Lemma 2 can be tight only if each ∞ 1 H(X)= 1 log(2πev) 1 I(X ) dt, ρj , considered as a function of the random variables Xi,i = j, 2 − 2  t − v + t 6 Z0 is additive. However, we already know that ρj is a function of if X has variance v and Z has variance 1. This has the the sum of these random variables. The only functions that are advantage of positivity of the integrand but the disadvantage both additive and functions of the sum are linear functions of that it seems to depend on v. One can use the sum; hence the two sides of (22) can be finite and equal only if each of the scores is linear, i.e., if all the are ∞ 1 1 ρj Xi log v = dt normal. It is trivial to check that X normal or I(Y ) = Z 1+ t − v + t 1 n 0 imply equality. ∞ to re-express it in the form (20), which does not depend on v. The monotonicity result for entropy in the i.i.d. case now follows by combining Proposition 1 and Lemma 3. III. MONOTONICITYINTHE IID CASE For clarity of presentation of ideas, we focus first on the i.i.d. setting. For i.i.d. summands, inequalities (2) and (4) Theorem 1 (MONOTONICITY OF ENTROPY: IIDCASE): reduce to the monotonicity property H(Yn) H(Ym) for Suppose X are i.i.d. random variables with densities and ≥ { i} n>m, where finite variance. If the normalized sum Yn is defined by (21), n 1 then Yn = Xi. (21) √n H(Yn) H(Yn 1). Xi=1 ≥ − We exhibit below how our approach provides a simple proof The two sides are finite and equal iff X1 is normal. of this monotonicity property, first proved by ABBN [1] using somewhat more elaborate means. We begin by showing the After the submission of these results to ISIT 2006 [28], monotonicity of the Fisher information. we became aware of a contemporaneous and independent development of the simple proof of the monotonicity fact (Theorem 1) by Tulino and Verd´u[42]. In their work they Proposition 1 (MONOTONICITY OF FISHER INFORMATION): take nice advantage of projection properties through minimum If Xi are i.i.d. random variables, and Yn 1 has an absolutely mean squared error interpretations. It is pertinent to note that { } − continuous density, then the proofs of Theorem 1 (in [42] and in this paper) share essentials, because of the following observations. I(Yn) I(Yn 1), (22) ≤ − Consider estimation of a random variable X from an obser- with equality iff X1 is normal or I(Yn)= . vation Y = X+Z in which an independent standard normal Z ∞ has been added. Then the score function of Y is related to the Proof: We use the following notation: The (unnormal- difference between two predictors of X (maximum likelihood ized) sum is Sn = i [n] Xi, and the leave-one-out sum and Bayes), i.e., (j) ∈ leaving out Xj is S P= i=j Xi. Setting ρ(Sn) to be the 6 (j) ρ(Y )= Y E[X Y ], (23) score of Sn and ρj to beP the score of S , we have by − − | 2 Lemma 1 that ρ(Sn)= E[ρj Sn] for each j, and hence and hence the Fisher information I(Y ) = Eρ (Y ) is the | same as the mean square difference E[(Y E[X Y ])2], or 1 − | ρ(Sn)= E ρj Sn . equivalently, by the Pythagorean identity, n  jX[n] ∈ I(Y )= Var(Z) E[(X E[X Y ])2]. (24) Since the norm of the score is not less than that of its − − | projection (i.e., by the Cauchy-Schwarz inequality), Thus the Fisher information (entropy derivative) is related to the minimal mean squared error. These (and more general) 1 2 I(S )= E[ρ2(S )] E ρ . identities relating differences between predictors to scores and n n ≤ n j   jX[n] relating their mean squared errors to Fisher informations are ∈ TO APPEAR IN IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 53, NO. 7, JULY 2007 6 developed in statistical decision theory in the work of Stein and as desired. The application of Lemma 2 can yield equality s s Brown. These developments are described, for instance, in the only if each ρ(T ( )) is additive; since the score ρ(T ( )) is s point estimation text by Lehmann and Casella [25][Chapters already a function of the sum T ( ), it must in fact be a linear 4.3 and 5.5], in their study of Bayes risk, admissibility and function, so that each Xi must be normal. minimaxity of conditional means E[X Y ]. Tulino and Verd´u[42] emphasize the| minimal mean squared Naturally, it is of interest to minimize the upper bound of error property of the entropy derivative and associated projec- Proposition 2 over the weighting distribution w, which is easily tion properies that (along with the variance drop inequality done either by an application of Jensen’s inequality for the which they note in the leave-one-out case) also give Proposi- reciprocal function, or by the method of Lagrange multipli- tion 1 and Theorem 1. That is a nice idea. Working directly ers. Optimization of the bound implies that Proposition 2 is with the minimal mean squared error as the entropy derivative equivalent to the following Fisher information inequalities. they bypass the use of Fisher information. In the same manner Verd´uand Guo [43] give an alternative proof of the Shannon- Stam entropy power inequality. If one takes note of the above Theorem 2: Let Xi be independent random variables (s) { } identities one sees that their proofs and ours are substantially such that each T has an absolutely continuous density. Then the same, except that the same quantities are given alternative 1 1 1 . (32) interpretations in the two works, and that we give extensions (s) I(Tn) ≥ r Xs I(T ) to arbitrary collections of subsets. ∈C When is discriminating, the two sides are positive and equal iff eachC X is normal and is also balanced. IV. FISHER INFORMATION INEQUALITIES i C

In this section, we demonstrate our core inequality (8). Remark 2: Theorem 2 for the special case = 1 of singleton sets is sometimes known as the “StamC inequality”C and has a long history. Stam [39] was the first to prove Proposition 2: Let Xi be independent random variables Proposition 2 for , and he credited his doctoral advisor { } 1 with densities and finite variances. Define de Bruijn with noticingC the equivalence to Theorem 2 for s . Subsequently several different proofs have appeared: in T = X and T ( ) = X , (25) 1 n i i BlachmanC [7] using Lemma 1, in Carlen [9] using another iX[n] Xi s ∈ ∈ superadditivity property of the Fisher information, and in for each s , where is an arbitrary collection of subsets of Kagan [22] as a consequence of an inequality for Pitman s [n]. Let w∈Cbe any probabilityC distribution on . If each T ( ) estimators. On the other hand, the special case of the leave- C has an absolute continuous density, then one-out sets = n 1 in Theorem 2 was first proved in ABBN [1]. Zamir [46]C usedC − data processing properties of the Fisher 2 (s) I(Tn) r wsI(T ), (26) information to prove some different extensions of the 1 case, ≤ sX C ∈C including a multivariate version; see also Liu and Viswanath where ws = w( s ). When is discriminating, both sides can [27] for some related interpretations. Our result for arbitrary { } C be finite and equal only if each Xi is normal. collections of subsets is new; yet our proof of this general result is essentially no harder than the elementary proofs of s Proof: Let ρs be the score of T ( ). We proceed in accor- the original inequality by Stam and Blachman. dance with the outline in the introduction. Indeed, Lemma 1 implies that ρ(Tn) = E[ρs Tn] for each s. Taking a convex Remark 3: While inequality (29) in the proof above combinations of these identities| gives, for any ws such that uses a Pythagorean inequality, one may use the associated { } s ws =1, Pythagorean identity to characterize the difference as the mean ∈C P square of s wsρs ρ(Tn). In the i.i.d. case with n = 2m and a disjoint pair− of subsets of size m, this drop in ρ(Tn)= wsE[ρs Tn]= E wsρs Tn . (27) P |   C sX sX Fisher distance from the normal played an essential role in ∈C ∈C the previously mentioned CLT analyses of [37], [8], [4], [21]. Applying the Cauchy-Schwarz inequality, Furthermore for general , we have from the variance drop 2 analysis that the gap inC inequality (30) is characterized by ρ2(T ) E wsρs T . (28) n ≤   n the non-additive ANOVA components of the score functions. sX ∈C We point out these observations as an encouragement to Taking the expectation and then applying Lemma 2 in succes- examination of the information drop for collections such as sion, we get n 1 and n/2 in refined analysis of CLT rates under more C − C 2 general conditions. 2 E ρ (Tn) E s wsρs (29) ≤  ∈C  V. ENTROPY INEQUALITIES   P 2 r s E(wsρs) (30) The Fisher information inequality of the previous section ≤ ∈C s = r Ps ws2I(T ( )), (31) yields a corresponding entropy inequality. P ∈C TO APPEAR IN IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 53, NO. 7, JULY 2007 7

sufficient for our needs), we provide such a construction of Zj Proposition 3 (ENTROPY OF SUMS): Let X be inde- and their subset sums in the case of rational weights, say ws = { i} pendent random variables with densities. Then, for any prob- W (s)/M, where the denominator M may be large. Indeed set ability distribution w on such that ws 1 for each s, Z ,...,Z independent mean-zero normals each of variance C ≤ r 1 M 1/M. For each s we construct a set s′ that has precisely rW (s) H X wsH X normals and each normal is assigned to precisely r of these  i ≥  i iX[n] sX Xi s (33) sets. This may be done systematically by considering the sets ∈ ∈C ∈ 1 1 s s s s + H(w) log r, 1, 2,... in in some order. We let 1′ be the the first rW ( 1) 2 − 2 C s indices j (not more than M by assumption), we let 2′ be the where s 1 is the discrete entropy of . H(w)= s w log ws w next rW (s2) indices (looping back to the first index once we ∈C When is discriminating,P equality can hold only if each Xi C pass M) and so on. This proves the validity of (34) for rational is normal. weights; its validity for general weights follows by continuity.

Proof: As pointed out in the Introduction, Proposition 2 is equivalent to Remark 4: One may re-express the inequality of Proposi- i s Yi I Yi wsI ∈ ,   ≤ P√wsr  tion 3 as a statement for the relative entropies with respect to iX[n] Xs ∈ ∈C normals of the same variance. If X has density f, we write for independent random variables Y . For application of the D(X) = E[log f(X) ] = H(Z) H(X), where Z has the i g(X) − Fisher information inequality to entropy inequalities we also Gaussian density g with the same variance as X. Then, for need for an independent standard normal Z that any probability distribution w on a balanced collection , s C T 1 I(Tn + √τZ) wsI( + √τZ), (34) D X wsD X + D(w η), ≤ √rws  i ≤  i 2 k (35) sX iX[n] sX Xi s ∈C ∈ ∈C ∈ at least for suitable values of ws. We will show below that s wheres η is the probability distribution on given by η = this holds when rws 1 for each s, and thus Var(T ( )) C ≤ , and D(w η) is the (discrete) relative entropy of w rVar(Tn) k 1 1 ∞ 1 with respect to η. When is also discriminating, equality holds H(Tn)= 2 log(2πe) 2 I(T[n] + √τZ) dt C − Z0  − 1+ τ  iff each Xi is normal. Theorem 1 of Tulino and Verd´u[42] is the special case of inequality (35) for the collection of leave- ws 1 log(2πe) 2 one-out sets. Inequality (35) can be further extended to the case ≥ Xs  − ∈C s where is not balanced, but in that case η is a subprobability C 1 ∞ T 1 distribution. The conclusions (33) and (35) are equivalent, and 2 I + √τZ dt Z0  √rws  − 1+ τ   as seen in the next section, are equivalent to our subset sum s T entropy power inequality. = wsH , s Xs √rw  ∈C Remark 5: It will become evident in the next section that using (34) for the inequality and Lemma 3 for the equalities. the condition rws 1 in Proposition 3 is not needed for the By the scaling property of entropy, this implies validity of the conclusions≤ (33) and (35) (see Remark 8). s s 1 s s 1 H(Tn) w H(T ) 2 w log w 2 log r, ≥ sX − sX − VI. ENTROPYPOWERINEQUALITIES ∈C ∈C proving the desired result. Proposition 3 is equivalent to a subset sum entropy power The inequality (34) is true though not immediately so (the e2H(X) inequality. Recall that the entropy power 2πe is the variance naive approach of adding an independent normal to each Xi of the normal with the same entropy as X. The term entropy does not work to get our desired inequalities when the subsets power is also used for the quantity have more than one element). What we need is to provide (X)= e2H(X), (36) a collection of independent normal random variables Zj for N some set of indices j (possibly many more than n of them). even when the constant factor of 2πe is excluded. For each s in we need an assignment of subset sums of Zj sC′ (called say Z ) which has variance rws, such that no j is in s more than r of the subsets ′. Then by Proposition 2 (applied Theorem 3: For independent random variables with finite s s s to the collection ′ of augmented sets ′ for each in ) variances, we have C ∪ C s s′ 1 I(T + √tZ) rws2 I(T + √tZ ) Xi Xi . (37) n N   ≥ r s N  s  ≤ sX iX[n] X Xi ∈C ∈ ∈C ∈ from which the desired inequality follows using the fact that When is discriminating, the two sides are equal iff each Xi a2I(X) = I(X/a). Assuming that rws 1 (which will be is normalC and is also balanced. ≤ C TO APPEAR IN IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 53, NO. 7, JULY 2007 8

s Proof: Let T be the subset sums as defined in (25). De- of the equivalence between the linear form of Proposition 3 s fine Z = s (T ) as the normalizing constant producing and Theorem 3 reduces to the non-negativity of the relative N the weightsP ∈C entropy. s (T ) s µ = N Z . (38) s Remark 8: To see that (33) (and hence (35)) holds without If N(T ) > Z/r for some s then we trivially have any assumption on w, simply note that when the assumption s Z ∈ C N(Tn) N(T ) > /r which is the desired result, so now is not satisfied, the entropy power inequality of Theorem 3 ≥ s assume N(T ) Z/r for all s , that is, rµs 1 for each implies trivially that s. ≤ ∈C ≤ 1 1 D(w µ) (T ) r− Z r− e− k Z, Since N n ≥ ≥ s s H(T ( ))= 1 log (T )= 1 log[µsZ], for µ defined by (38), and inverting the steps of (39) yields 2 N 2 (33). Proposition 3 implies for any weighting distribution w with rws 1 that VII. ENTROPY IS MONOTONE ON AVERAGE ≤ In this section, we consider the behavior of the entropy of 1 s sZ 1 H(Tn) 2 w log[µ ]+ 2 H(w) log r ≥ s  −  sums of independent but not necessarily identically distributed X (39) ∈C Z (i.n.i.d.) random variables under various scalings. 1 First we look at sums scaled according to the number of = 2 log D(w µ) ,  r − k  summands. Fix the collection = s [n] : s = m . For Cm { ⊂ | | } where D(w µ) is the (discrete) relative entropy. Exponentiat- i.n.i.d. random variables Xi, let ing gives k i [n] Xi s i s Xi ∈ ( ) 1 D(w µ) Yn = and Ym = ∈ (41) (T ) r− e− k Z. (40) P √n P√m N n ≥ for s be the scaled sums. Then Proposition 3 applied to It remains to optimize the right side over w, or equivalently ∈Cm to minimize D(w µ) over feasible w. Since rµs 1 by m implies k ≤ C assumption, w = µ is a feasible choice, yielding the desired s n H(Y ) wsH(Y ( )) 1 log H(w) . inequality. n ≥ m − 2  m −  sXm The necessary conditions for equality follow from that for ∈C Proposition 3, and it is easily checked using Fact 1 that this The term on the right indicates that we pay a cost for is also sufficient. The proof is complete. deviations of the weighting distribution w from the uniform. In particular, choosing w to be uniform implies that entropy is “monotone on average” with uniform weights for scaled sums Remark 6: To understand this result fully, it is useful to of i.n.i.d. random variables. Applying Theorem 3 to m yields note that if a discriminating collection is not balanced, it can a similar conclusion for entropy power. These observationsC , C always be augmented to a new collection ′ that is balanced which can also be deduced from the results of ABBN [1], are C in such a way that the inequality (37) for ′ becomes strictly collected in Corollary 1. better than that for . Indeed, if index i appearsC in r(i) sets of , one can always findC r r(i) sets of not containing i (since C − C r ), and add i to each of these sets. The inequality (37) for Corollary 1: Suppose Xi are independent random variables ≤ |C| ′ is strictly better since this collection has the same r and the with densities, and the scaled sums are defined by (41). Then C subset sum entropy powers on the right side are higher due to 1 s H(Y ) H(Y ( )) the addition of independent random variables. While equality n ≥ n m m sXm in (37) is impossible for the unbalanced collection , it holds ∈C (42) C 1 s for normals for the augmented, balanced collection ′. This and (Y ) (Y ( )). C N n ≥ n N m illuminates the conditions for equality in Theorem 3. m sXm  ∈C Remark 9: It is interesting to contrast Corollary 1 with the Remark 7: The traditional Shannon inequality involving the following results of Han [17] and Dembo, Cover and Thomas entropy powers of the summands [36] as well as the inequality [15]. With no assumptions on (X1,...,Xn) except that they of ABBN [1] involving the entropy powers of the “leave-one- have a joint density, the above authors show that out” normalized sums are two special cases of Theorem 3, H(X ) 1 H(Xs) corresponding to = 1 and = n 1. Proofs of the former [n] (43) subsequent to Shannon’sC C includeC thoseC − of Stam [39], Blachman n ≤ n m m sXm [7], Lieb [26] (using Young’s inequality for convolutions with  ∈C and the sharp constant), Dembo, Cover and Thomas [15] (building 1 1 1 on Costa and Cover [13]), and Verd´uand Guo [43]. Note that n m (X[n]) n (Xs) . (44) unlike the previous proofs of these special cases, our proof N ≤ s N   m Xm    ∈C TO APPEAR IN IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 53, NO. 7, JULY 2007 9

1 where H(Xs) and (Xs) denote the joint entropy and joint where ψ represents either the inverse Fisher information I− entropy power of XNs respectively. These bounds have a form or the entropy H or the entropy power . very similar to that of Corollary 1. In fact, such an analogy The conclusion of Theorem 4 is a particularN instance of between inequalities for entropy of sums and joint entropies (48). Indeed, we can express the random variables of interest as goes much deeper (so that all of the entropy power inequalities X = σ X , so that each X has variance 1. Choose a = σi , i i i′ i′ i √vn vs we present here have analogues for joint entropy). More details which is valid since a2 =1. Then a¯s2 = a2 = i [n] i i s i vn can be found in Madiman and Tetali [30]. and λs = ηs. Thus P ∈ P ∈ Next, we consider sums of independent random variables Xi Vn = = aiXi′, √vn standardized by their variances. This is motivated by the fol- iX[n] iX[n] lowing consideration. Consider a sequence of i.n.i.d. random ∈ ∈ and variables Xi : i N with zero mean and finite variances, 2 { ∈ } s X 1 σi = Var(Xi). The variance of the sum of n variables is ( ) i V = = aiX′. denoted v = σ2, and the standardized sum is √vs a¯s i n i [n] i Xi s Xi s P ∈ ∈ ∈ i [n] Xi Now an application of (48) gives the desired result, not just Vn = ∈ . 1 P v for H but also for and I− . √ n N The Lindeberg-Feller central limit theorem gives conditions under which V N(0, 1). Johnson [20] has proved an n ⇒ Remark 10: Since the collection is balanced, it follows entropic version of this classical theorem, showing (under from Fact 1 that ηs defines a probabilityC distribution on . This appropriate conditions) that H(V ) 1 log(2πe) and hence C n → 2 justifies the interpretation of Theorem 4 as displaying “mono- the relative entropy from the unit normal tends to 0. Is there tonicity on average”. The averaging distribution η is tuned to an analogue of the monotonicity of information in this setting? the random variables of interest, through their variances. We address this question in the following theorem, and give two proofs. The first proof is based on considering appropriately standardized linear combinations of independent Remark 11: Theorem 4 also follows directly from (35) random variables, and generalizes Theorem 2 of ABBN [1]. upon setting ws = ηs and noting that the definition of D(X) The second proof is outlined in Remark 11. is scale invariant (i.e., D(aX) = D(X) for any real number a).

Theorem 4 (MONOTONICITY ON AVERAGE): Suppose Let us briefly comment on the interpretation of this result. Xi : i [n] are independent random variables with As discussed before, when the summands are i.i.d., entropic { ∈ } 2 2 densities, and Xi has finite variance σ . Set vn = σ i i [n] i convergence of Vn to the normal was shown in [4], and ABBN 2 s ∈ and vs = i s σi for sets in the balanced collectionP . [1] showed that this sequence of entropies is monotonically in- ∈ C Define theP standardized sums creasing. This completes a long-conjectured intuitive picture of i [n] Xi the central limit theorem: forming normalized sums that keep Vn = ∈ (45) the variance constant yields random variables with increasing P√vn entropy, and this sequence of entropies converges to the and maximum entropy possible, which is the entropy of the normal (s) i s Xi with that variance. In this sense, the CLT is a formulation of V = ∈ . (46) P√vs the “second law of thermodynamics” in physics. Theorem 4 Then above shows that even in the setting of variance-standardized sums of i.n.i.d. random variables, a general monotonicity on (s) H(Vn) ηsH(V ), (47) average property holds with respect to an arbitrary collection ≥ sX of normalized subset sums. This strengthens the “second law” ∈C vs where ηs = 1 . Furthermore, if is also discriminating, interpretation of central limit theorems. r vn C A similar monotonicity on average property also holds then the inequality is strict unless each Xi is normal. for appropriate notions of Fisher information in convergence of sums of discrete random variables to the Poisson and Proof: Let ai,i [n] be a collection of non-negative real 1 ∈n 2 2 2 compound Poisson distributions; details may be found in [29]. numbers such that i=1 ai = 1. Define a¯s = i s ai a¯s2 ∈ and the weights λs P= for s . Applying the inequalitiesP  r VIII. A REFINED INEQUALITY of Theorem 2, Proposition 3 and∈C Theorem 3 to independent random variables aiXi′, and utilizing the scaling properties of Various extensions of the basic inequalities presented above the relevant information quantities, one finds that are possible; we present one here. To state it, we find it convenient to recall the notion of a fractional packing from n 1 ψ a X′ λsψ a X′ , (48) discrete mathematics (see, e.g., Chung, F¨uredi, Garey and  i i ≥ a¯s i i Xi=1 sX Xi s Graham [11]). ∈C ∈ TO APPEAR IN IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 53, NO. 7, JULY 2007 10

Definition 1: Let be a collection of subsets of [n]. A the variance drop lemma. The idea of looking for such refine- collection βs : s C of non-negative real numbers is called ments with coefficients depending on s arose in conversations a fractional{ packing∈ C} for if with Tom Cover and Prasad Tetali at ISIT 2006 in Seattle, C after our basic results described in the previous sections were βs 1 (49) ≤ presented. In particular, Prasad’s joint work with one of us s Xi,s ∋ ∈C [30] influenced the development of Theorem 5. for each i in [n]. Theorem 5: Let βs : s be any fractional packing for Note that if the numbers βs are constrained to only take the . Then { ∈ C} values 0 and 1, then the condition above entails that not more C 1 1 than one set in can contain i, i.e., that the sets s are I− (X1 + . . . + Xn) βsI− Xj . (52) ≥ C ∈ C sX Xj s pairwise disjoint, and provide a packing of the set [n]. We may ∈C ∈  interpret a fractional packing as a “packing” of [n] using sets For given subset sum informations, the best such lower in , each of which contains only a fractional piece (namely, bound on the information of the total sum would involve βs)C of the elements in that set. maximizing the right side of (52) subject to the linear con- We now present a refined version of Lemma 2. straints (49). This linear programming problem, a version of which is the problem of optimal fractional packing well- Lemma 4 (VARIANCE DROP: GENERAL VERSION): studied in combinatorics (see, e.g., [11]), does not have an explicit solution in general. Suppose U(X1,...,Xn) is a -additive function with mean zero components, as in LemmaC 2. Then A natural choice of a fractional packing in Theorem 5 leads to the following corollary. 1 2 EU 2 E ψs(Xs) , (50) ≤ βs sX  Corollary 2: For any collection of subsets of [n], let r(i) ∈C denote the number of sets in thatC contain i. In the same for any fractional packing βs : s . C { ∈ C} setting as Theorem 2, we have Proof: As in the proof of Lemma 2, we have 1 1 , (53) 2 I(X1 + . . . + Xn) ≥ r(s)I s Xj 2 sX j EU = E E¯tψs(Xs) . ∈C ∈   P  Xt s Xt,s where r(s) is the maximum value of r(i) over the indices i ⊃ ∈C s We now proceed to perform a different estimation of this in . s s expression, recalling as in the proof of Lemma 2, that the We say that is quasibalanced if r(i)= r( ) for each i s C ∈ outer summation over t can be restricted to non-empty sets t. and each . If is discriminating, equality holds in (53) if the X are normalC and is quasibalanced. By the Cauchy-Schwarz inequality, i C 2 E¯tψs(Xs)   ≤ Remark 12: For any collection and any s , r(s) r s Xt,s C ∈C ≤ ⊃ ∈C by definition. Thus Theorem 5 and Corollary 2 generalize 2 1 Theorem 2. Furthermore, from the equality conditions in ( βs)2 E¯tψs(Xs) .    √βs   Corollary 2, we see that equality can hold in these more s Xt,s p s Xt,s ⊃ ∈C ⊃ ∈C general inequalities even for collections that are not bal- Since any t of interest has at least one element, the definition anced, which was not possible with the originalC formulation of a fractional packing implies that in Theorem 2. βs 1. ≤ Remark 13: One can also give an alternate proof of Theo- s Xt,s ⊃ ∈C rem 5 using Corollary 2 (which could be proved directly), so Thus that the two results are mathematically equivalent. The key to

2 1 2 doing this is the observation that nowhere in our proofs do we EU E E¯tψs(Xs) s ≤ βs actually require that the sets in are distinct. In other words, Xt s Xt,s  C ⊃ ∈C given a collection , one may look at an augmented collection 1 2 that has ks copiesC of each set s in . Then the inequality = E E¯tψs(Xs) (51) βs (53) holds for the augmented collectionC with the counts sX tXs  r(i) ∈C ⊂ and r(s) appropriately modified. By considering arbitrary aug- 1 2 = E ψs(Xs) . mentations, one can obtain Theorem 5 for fractional packings βs sX with rational coefficients. An approximation argument yields ∈C  the full version. This method of proof, although picturesque, Exactly as before, one can obtain inequalities for Fisher is somewhat less transparent in the details. information, entropy and entropy power based on this form of TO APPEAR IN IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 53, NO. 7, JULY 2007 11

Remark 14: It is straightforward to extend Theorem 5 and limit theorems in i.n.i.d. settings, and perhaps rate conclusions Corollary 2 to the multivariate case, where Xi are independent under less restrictive assumptions than those imposed in [21] Rd-valued random vectors, and I(X) represents the trace of and [2]. the Fisher information matrix of X. Similarly, extending Theo- The new Fisher information inequalities we present are also rem 3, one obtains for independent Rd-valued random vectors of interest, because of the relationship of inverse Fisher infor- X1,...,Xn with densities and finite covariance matrices that mation to asymptotically efficient estimation. In this context,

2H(P s Xj ) the subset sum inequality can be interpreted as a comparison 2H(X1+...+Xn) 1 j∈ e d e d , of an asymptotic mean squared error achieved with use of ≥ r sX ∈C all X1,...,Xn, and the sum of the mean squared errors which implies the monotonicity of entropy for standardized achieved in distributed estimation by sensors that observe s s sums of d-dimensional random vectors. We leave the details (Xi,i ) for . The parameter of interest can either be a ∈ ∈C to the reader. location parameter, or (following [21]) a natural parameter of exponential families for which the minimal sufficient statistics Remark 15: It is natural to speculate whether an analogous are sums. Furthermore, a non-asymptotic generalization of the subset sum entropy power inequality holds with 1/r(s) inside new Fisher information inequalities holds (see [5] for details), the sum. For each r between the minimum and the maximum which sheds light on minimax risks for estimation of a location of the r(s), we can split into the sets = s : r(s)= parameter from sums. C Cr { ∈C r . Under an assumption that no one set s in r dominates, that Entropy inequalities involving subsets of random variables } s∗ C s is, that there is no s∗ r with N(T ) > s N(T )/r, (although traditionally not involving sums) have played an im- ∈C r we are able to create suitable normals forP perturbation∈C of portant role in understanding some problems of graph theory. the Fisher information inequality and integrate (in the same Radhakrishnan [33] provides a nice survey, and some recent manner as in the proof of Proposition 3) to obtain developments (including joint entropy inequalities analogous s to the entropy power inequalities in this paper) are discussed N(T ) N(Tn) s . (54) in [30]. The appearance of fractional packings in the refined ≥ sX r( ) inequality we present in Section VIII is particularly suggestive ∈C The quasibalanced case (in which r(i) is the same for each i of further connections to be explored. in s) is an interesting special case. Then the unions of sets in In multi-user information theory, subset sums of rates and information quantities involving subsets of random variables r are disjoint for distinct values of r. So for quasibalanced collectionsC the refined subset sum entropy power inequality are critical in characterizing rate regions of certain source (54) always holds by combining our observation above with and channel coding problems (e.g., m-user multiple access the Shannon-Stam entropy power inequality. channels). Furthermore, there is a long history of the use of the classical entropy power inequality in the study of rate regions, see, e.g., Shannon [36], Bergmans [6], Ozarow [32], Costa [12] IX. CONCLUDING REMARKS and Oohama [31]. For instance, the classical entropy power Our main contributions in this paper are rather general - inequality was a key tool in Ozarow’s solution of the Gaussian C superadditivity inequalities for Fisher information and entropy multiple description problem for two multiple descriptions, but power that hold for arbitrary collections , and specialize seems to have been inadequate for problems involving three or C to both the Shannon-Stam inequalities and the inequalities more descriptions (see Wang and Viswanath [45] for a recent of ABBN [1]. In particular, we prove all these inequalities solution of one such problem without using the entropy power transparently using only simple projection facts, a variance inequality). It seems natural to expand the set of tools available drop lemma and classical information-theoretic ideas. A re- for investigation in these contexts. markable feature of our proofs is that their main ingredients are rather well-known, although our generalizations of the variance drop lemma appear to be new and are perhaps of APPENDIX I independent interest. Both our results as well as the proofs THE ANALYSIS OF VARIANCE DECOMPOSITION lend themselves to intuitive statistical interpretations, several In order to prove the variance drop lemma, we use a de- of which we have pointed out in the paper. We now point to composition of functions in L2(Rn), which is nothing but the potential directions of application. Analysis of Variance (ANOVA) decomposition of a statistic. The inequalities of this paper are relevant to the study of For any j [n], Ej ψ denotes the conditional expectation of central limit theorems, especially for i.n.i.d. random variables. ∈ ψ, given all random variables other than Xj , i.e., Indeed, we demonstrated monotonicity on average properties in such settings. Moreover, most approaches to entropic central Ej ψ(x1,...,xn)= E[ψ(X1,...,Xn) Xi = xi i = j] (55) limit theorems involve a detailed quantification of the gaps | ∀ 6 associated with monotonicity properties of Fisher informa- averages out the dependence on the j-th coordinate. tion when the summands are non-normal. Since the gap in our inequality is especially accessible due to our use of a Pythagorean property of projections (see Remark 3), it could Fact 2 (ANOVA DECOMPOSITION): Suppose ψ : Rn be of interest in obtaining transparent proofs of entropic central R satisfies Eψ2(X ,...,X ) < , i.e., ψ L2, for→ 1 n ∞ ∈ TO APPEAR IN IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 53, NO. 7, JULY 2007 12 independent random variables X ,X ,...,X . For t [n], ACKNOWLEDGMENT 1 2 n ⊂ define the orthogonal linear subspaces We are indebted to Prasad Tetali for several interesting 2 conversations and for the beneficial influence of Prasad’s joint t = ψ L : Ej ψ = ψ1 j /t j [n] (56) H { ∈ { ∈ } ∀ ∈ } work with MM on entropy inequalities. We are also very of functions depending only on the variables indexed by grateful to the anonymous referees and the Associate Editor for t 2 . Then L is the orthogonal direct sum of this family of careful readings of the paper, and their insightful comments. subspaces, i.e., any ψ L2 can be written in the form ∈ ψ = E¯tψ, (57) REFERENCES tX[n] ⊂ [1] S. Artstein, K. M. Ball, F. Barthe, and A. Naor. Solution of Shannon’s problem on the monotonicity of entropy. J. Amer. Math. Soc., 17(4):975– where E¯tψ t, and the subspaces t (for t [n]) are ∈ H H ⊂ 982 (electronic), 2004. orthogonal to each other. [2] S. Artstein, K. M. Ball, F. Barthe, and A. Naor. On the rate of convergence in the entropic central limit theorem. Probab. Theory Proof: Let Et denote the integrating out of the variables Related Fields, 129(3):381–390, 2004. t [3] A.R. Barron. Monotonic central limit theorem for densities. Technical in , so that Ej = E j . Keeping in mind that the order of Report #50, Department of Statistics, Stanford University, California, integrating out independent{ } variables does not matter (i.e., the 1984. E are commuting projection operators in L2), we can write [4] A.R. Barron. Entropy and the central limit theorem. Ann. Probab., j 14:336–342, 1986. n [5] M. Madiman, A. R. Barron, A. M. Kagan, and T. Yu. On the minimax risks for estimation of a location parameter from sums. Preprint, 2007. ψ = [Ej + (I Ej )]ψ − [6] P. Bergmans. A simple converse for broadcast channels with additive jY=1 white gaussian noise. IEEE Trans. Inform. Theory, 20(2):279–280, 1974. [7] N.M. Blachman. The convolution inequality for entropy powers. IEEE = Ej (I Ej )ψ − (58) Trans. Information Theory, IT-11:267–271, 1965. tX[n] jY /t jYt ⊂ ∈ ∈ [8] L.D. Brown. A proof of the central limit theorem motivated by the ¯ Cram´er-Rao inequality. In Statistics and probability: essays in honor of = Etψ, C. R. Rao, pages 141–148. North-Holland, Amsterdam, 1982. tX[n] [9] E. A. Carlen. Superadditivity of Fisher’s information and logarithmic ⊂ Sobolev inequalities. J. Funct. Anal., 101(1):194–211, 1991. where [10] E. A. Carlen and A. Soffer. Entropy production by block variable summation and central limit theorems. Comm. Math. Phys., 140, 1991. E¯tψ Etc (I Ej )ψ. (59) [11] F.R.K. Chung, Z. F¨uredi, M. R. Garey, and R.L. Graham. On the ≡ − jYt fractional covering number of hypergraphs. SIAM J. Disc. Math., ∈ 1(1):45–49, 1988. Note that if j is in t, Ej E¯tψ =0, being in the image of the [12] M.H.M. Costa. On the Gaussian interference channel. IEEE Trans. operator E (I E )=0. If j is not in t, E¯tψ is already in Inform. Theory, 31(5):607–615, 1985. j − j [13] M.H.M. Costa and T.M. Cover. On the similarity of the entropy power the image of Ej , and a further application of the projection inequality and the Brunn-Minkowski inequality. IEEE Trans. Inform. E has no effect. Thus E¯tψ is in t. Theory, 30(6):837–839, 1984. j H Finally we wish to show that the subspaces t are orthog- [14] T.M. Cover and J.A. Thomas. Elements of Information Theory. J. Wiley, t t H New York, 1991. onal. For any distinct sets 1 and 2 in [n], there exists an [15] A. Dembo, T.M. Cover, and J.A. Thomas. Information-theoretic inequal- index j which is in one (say t1), but not the other (say t2). ities. IEEE Trans. Inform. Theory, 37(6):1501–1518, 1991. [16] B. Efron and C. Stein. The jackknife estimate of variance. Ann. Stat., Then, by definition, E¯t ψ is contained in the image of Ej and ¯ 1 ¯ 9(3):586–596, 1981. Et ψ is contained in the image of (I Ej ). Hence Et ψ is [17] Te Sun Han. Nonnegative entropy measures of multivariate symmetric 2 − 1 orthogonal to E¯t ψ. correlations. Information and Control, 36(2):133–156, 1978. 2 [18] W. Hoeffding. A class of statistics with asymptotically normal distribu- tion. Ann. Math. Stat., 19(3):293–325, 1948. [19] R. L. Jacobsen. Linear Algebra and ANOVA. PhD thesis, Cornell Remark 16: In the language of ANOVA familiar to statis- University, 1968. [20] O. Johnson. Entropy inequalities and the central limit theorem. Stochas- ticians, when φ is the empty set, E¯φψ is the mean; ¯ ¯ ¯ ¯ tic Process. Appl., 88:291–304, 2000. E 1 ψ, E 2 ψ,..., E n ψ are the main effects; Etψ : t = { } { } { } { | | [21] O. Johnson and A.R. Barron. Fisher information inequalities and the 2 are the pairwise interactions, and so on. Fact 2 implies that central limit theorem. Probab. Theory Related Fields, 129(3):391–409, } s ¯ 2004. for any subset [n], the function t:t s Etψ is the best ⊂ { ⊂ } [22] A. Kagan. An inequality for the Pitman estimators related to the Stam approximation (in mean square) to ψPthat depends only on the inequality. Sankhya¯ Ser. A, 64:281–292, 2002. collection Xs of random variables. [23] S. Karlin and Y. Rinott. Applications of anova type decompositions for comparisons of conditional variance statistics including jackknife estimates. Ann. Statist., 10(2):485–501, 1982. [24] B. Kurkjian and M. Zelen. A calculus for factorial arrangements. Ann. Remark 17: The historical roots of this decomposition lie in Math. Statist., 33:600–619, 1962. the work of von Mises [44] and Hoeffding [18]. For various [25] E. L. Lehmann and G. Casella. Theory of Point Estimation. Springer interpretations, see Kurkjian and Zelen [24], Jacobsen [19], Texts in Statistics. Springer-Verlag, New York, second edition, 1998. [26] E. H. Lieb. Proof of an entropy conjecture of Wehrl. Comm. Math. Rubin and Vitale [34], Efron and Stein [16], Karlin and Rinott Phys., 62(1):35–41, 1978. [23], and Steele [40]; these works include applications of [27] T. Liu and P. Viswanath. Two proofs of the fisher information inequality such decompositions to experimental design, linear models, via data processing arguments. Preprint, 2005. [28] M. Madiman and A.R. Barron. The monotonicity of information in the U-statistics, and jackknife theory. Takemura [41] describes a central limit theorem and entropy power inequalities. Proc. IEEE Intl. general unifying framework for ANOVA decompositions. Symp. Inform. Th., Seattle, July 2006. TO APPEAR IN IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 53, NO. 7, JULY 2007 13

[29] M. Madiman, O. Johnson, and I. Kontoyiannis. Fisher information, [42] A. M. Tulino and S. Verd´u. Monotonic decrease of the non-gaussianness compound Poisson approximation, and the Poisson channel. Proc. IEEE of the sum of independent random variables: A simple proof. IEEE Intl. Symp. Inform. Th., Nice, June 2007. Trans. Inform. Theory, 52(9):4295–4297, 2006. [30] M. Madiman and P. Tetali. Sandwich bounds for joint entropy. Proc. [43] S. Verd´uand D. Guo. A simple proof of the entropy-power inequality. IEEE Intl. Symp. Inform. Th., Nice, June 2007. IEEE Trans. Inform. Theory, 52(5):2165–2166, 2006. [31] Y. Oohama. The rate-distortion function for the quadratic gaussian CEO [44] R. von Mises. On the asymptotic distribution of differentiable statistical problem. IEEE Trans. Inform. Theory, 44(3):1057–1070, 1998. functions. Ann. Math. Stat., 18(3):309–348, 1947. [32] L. Ozarow. On a source coding problem with two channels and three [45] H. Wang and P. Viswanath. Vector gaussian multiple description with receivers. Bell System Tech. J., 59(10):1909–1921, 1980. individual and central receivers. To appear in IEEE Trans. Inform. [33] J. Radhakrishnan. Entropy and counting. In Computational Mathemat- Theory, 2007. ics, Modelling and Algorithms (ed. J. C. Misra), Narosa, 2003. [46] R. Zamir. A proof of the Fisher information inequality via a data [34] H. Rubin and R. A. Vitale. Asymptotic distribution of symmetric processing argument. IEEE Trans. Inform. Theory, 44(3):1246–1250, statistics. Ann. Statist., 8(1):165–170, 1980. 1998. [35] H. Schultz. Semicircularity, gaussianity and monotonicity of entropy. Preprint, 2005. [36] C.E. Shannon. A mathematical theory of communication. Bell System Tech. J., 27:379–423, 623–656, 1948. [37] R. Shimizu. On Fisher’s amount of information for location family. In G.P.Patil et al, editor, Statistical Distributions in Scientific Work, volume 3, pages 305–312. Reidel, 1975. [38] D. Shlyakhtenko. A free analogue of Shannon’s problem on monotonic- ity of entropy. Preprint, 2005. [39] A.J. Stam. Some inequalities satisfied by the quantities of information of Fisher and Shannon. Information and Control, 2:101–112, 1959. [40] J. Michael Steele. An Efron-Stein inequality for nonsymmetric statistics. Ann. Statist., 14(2):753–758, 1986. [41] A. Takemura. Tensor analysis of ANOVA decomposition. J. Amer. Statist. Assoc., 78(384):894–900, 1983.