IEEE TRANSACTIONS ON , VOL. 64, NO. 4, APRIL 2018 2173 AStrongEntropyPowerInequality

Thomas A. Courtade , Member, IEEE

Abstract—When one of the random summands is Gaussian, the data processing inequality. Likewise, the inequality is also we sharpen the entropy power inequality (EPI) in terms of the degenerate when h(X) .So,aswiththeclassicalEPI, strong data processing function for Gaussian channels. Among Theorem 1 is only informative=∞ when the entropy h(X) exists other consequences, this ‘strong’ EPI generalizes the vector extension of Costa’s EPI to non-Gaussian channels in a precise and is finite. sense. This leads to a new reverse EPI and, as a corollary, Let us briefly remark on equality cases of Theorem 1 sharpens Stam’s uncertainty principle relating entropy power in comparison to the classical EPI. First, we recall that and (or, equivalently, Gross’ logarithmic the Shannon-Stam-Blachman EPI (2) attains equality pre- Sobolev inequality). Applications to network information theory cisely when X and W are Gaussian with proportional are also given, including a short self-contained proof of the rate region for the two-encoder quadratic Gaussian source coding covariances [3], [5]. In contrast, by judiciously choosing the problem and a new outer bound for the one-sided Gaussian auxiliary V ,equalitycanbeachievedin(3) interference channel. for any given X with finite entropy. In fact, choosing V 2 h(W ) = Index Terms—Entropy power inequality,Costa’sEPI, Stam’s X W renders both sides equal to e d .Furthermore, inequality, reverse EPI, strong data processing, gaussian source in+ analogy to the classical EPI, it is a simple calculation to coding. see that equality is also achieved in (3) whenever the random variables X, Y and V are jointly Gaussian with proportional I. INTRODUCTION AND MAIN RESULT covariances. d OR a random vector X with density f on R ,the For X, Y as in the statement of Theorem 1, we may of X is defined by F associate a function ! R 0 R 0 to their joint law PXY as follows : ≥ → ≥ h(X) f (x) log f (x)dx, (1) =− d !R !(t) sup I (X V ) I (Y V ) t , (4) := V X Y V { ; : ; ≤ } where we adopt the convention that the logarithm is computed : → → with respect to the natural base. The celebrated Entropy Power where the supremum ranges over all random variables V Inequality (EPI) put forth by Shannon [2] and rigorously satisfying the Markov relation X Y V .Thefunction! → → established by Stam [3] and Blachman [4] asserts that for is generally referred to as a strong data processing function X, W independent and Y X W, because it sharpens the classical data processing inequality = + 2 h(Y ) 2 h(X) 2 h(W ) for mutual information in a certain way. Indeed, definitions e d e d e d . (2) ≥ + ensure I (X V ) !(I (Y V )) I (Y V ) for all V satisfying X Y ;V ,and≤ ! is the; (pointwise)≤ ; smallest function for Under the assumption that W is Gaussian, our main result is → → the following improvement of (2): which the first inequality holds. With ! defined in this way, Theorem 1: Suppose X, Wareindependentrandomvectors Theorem 1 may be recast in terms of ! as d in ,andmoreoverthatWisGaussian.DefineY X W. 2 (h(Y ) !(t)) 2 (h(X) t) 2 h(W ) R e d − e d − e d t 0. (5) For any V satisfying X Y V, = + ≥ + ∀ ≥ → → Hence, Theorem 1 strengthens the EPI when one of the 2 (h(Y ) I (X V )) 2 (h(X) I (Y V )) 2 h(W ) e d − ; e d − ; e d . (3) variables is Gaussian, and does so precisely in terms of the ≥ + The notation X Y V in Theorem 1 indicates that the strong data processing function !.Accordingtothis,wefind random variables X→, Y →and V form a Markov chain, in that it fitting to refer to (3) as a ‘strong’ entropy power inequality, order (this and other notations are detailed in Section II). giving this paper its title. In case the integral (1) does not exist, or if X does not have It should be noted that Calmon, Polyanskiy and Wu [6], [7] density, then we adopt the convention that h(X) . have recently considered a somewhat related problem where In this case, the inequality (3) is a trivial consequence=−∞ of they bound the best-possible data processing function defined according to Manuscript received June 1, 2016; revised October 2, 2017; accepted October 12, 2017. Date of publication December 14, 2017; date of current FI (t,γ) sup I (Y U) I (X U) t , (6) := U,X U X Y { ; : ; ≤ } version March 15, 2018. This work was supported by the Center for Science : → → of Information NSF under Grant CCF-1528132 and Grant CCF-0939370. where Y X W, W N(0, I) is independent of (X, U), This paper was presented in part at the 2016 International Symposium on = + ∼ Information Theory [1] and the supremum is over all joint distributions PUX such that The author is with the Department of Electrical Engineering and Computer E X 2 γ .SincetheMarkovassumptionsin(4)and(6) Sciences, University of California, Berkeley, CA 94720 USA. differ,[ ]≤ and Calmon et al. further optimize over the Communicated by M. Madiman, Associate Editor for Probability and Statistics. distribution PX ,thepresentresultsandthoseof[6],[7]are Digital Object Identifier 10.1109/TIT.2017.2779745 not comparable. 0018-9448 © 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. 2174 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 64, NO. 4, APRIL 2018

Various improvements of the EPI have appeared previously In particular, Section IV-A is devoted to a heuristic discussion in the literature. However, none are highly similar to (3). of the crux of our argument, in which we sketch the basic ideas For instance, strengthened EPIs for random vectors with log- involved in proving a one-dimensional version of Theorem 1, concave densities have recently been proposed by Toscani [8] but gloss over several technical issues that need to be dealt and Courtade, Fathi and Pananjady [9]. In another direction, with. Section V shows how to recover Theorem 1 from its one- Madiman and Barron have established an EPI for subsets of dimensional counterpart by leveraging some of the machinery random variables [10], which generalizes the monotonicity of developed in Section IV. entropy along the central limit theorem [11]. The reader is Section VI and the two appendices are for readers interested referred to the recent survey [12] for a general overview of in the details of the proof. Specifically, Section VI revisits the EPI-related results. heuristic discussion of Section IV-A, and makes it rigorous Among those inequalities appearing in the literature, by treating the subtle points with more care. In order to we consider Costa’s EPI [13] to be most comparable to arrive at its main conclusion, Section VI requires two key Theorem 1. In the following section, we clarify this rela- technical estimates whose proofs are essentially orthogonal to tionship by demonstrating that Theorem 1 generalizes Costa’s the rest of the argument. Appendices A and B are dedicated EPI [13] and its vector extension [14], which enjoy applica- to establishing these separate results. tions ranging from interference channels to secrecy capacity (e.g., [14]–[17]). However, Theorem 1 goes considerably fur- II. NOTATION ther than generalizing Costa’s EPI. In the context of functional inequalities, we will see that Theorem 1 leads to new reverse This section establishes basic notation, most of which entropy power and Fisher information inequalities, which in is standard in the information theory literature. We write turn can be applied to sharpen the Gaussian logarithmic PXY, PX and PY to denote the joint and respective mar- Sobolev inequality. On the otherhand,inthecontextofcoding ginal probability distributions associated to a pair of ran- theorems, we will see that Theorem 1 leads to a concise proof dom variables (X, Y ).WewriteX PX to denote ∼ of the converse for the rate region of the quadratic Gaussian X has distribution PX ,andsimilarly(X, Y ) PXY to ∼ two-encoder source-coding problem [18], [19] (a result which indicate that (X, Y ) have joint law PXY,andsoforth. seems beyond the reach of Costa’s EPI). Applications to We shall use Y X x to denote the random variable Y |{ = } one-sided interference channels and strong data processing conditional on X x ,anddenoteitslawbyPY X x . inequalities are also briefly discussed. Note that Y X{ =x is} uniquely defined in the sense|{ = that} |{ = } The restriction of W to be Gaussian in Theorem 1 should different versions of the same are equal PX -a.e. x;assuch, not be a severe limitation in practice. In applications of the we will not distinguish between these versions. It will be EPI, it is often the case that one of the variables is Gaussian. often convenient to factor joint distributions into products of As noted by Rioul [20], examples include the scalar Gaussian marginals and conditionals, so that (X, Y ) PX PY X denotes ∼ | broadcast channel problem [21] and its generalization to the that X, Y are jointly distributed such that X PX and ∼ multiple-input multiple-output case [22], [23]; the secrecy Y X x PY X x .Thisnotationisparticularlyhandy capacity of the Gaussian wiretap channel [24] and its multiple to|{ describe= } conditional∼ |{ = } independence. For example, any joint access extension [25]; determination of the corner points for distribution PXYV may be factored as PXY PV XY.So,ifwe | the scalar Gaussian interference channel problem [15], [16]; write (X, Y, V ) PXY PV Y ,thisimpliesthattheconditional ∼ | the scalar Gaussian source multiple-description problem [26]; law PV XY PV Y does not depend on X,andtherefore and characterization of the rate-distortion regions for several V and X| are= conditionally| independent given Y .Asalready multiterminal Gaussian source coding schemes [19], [27], [28]. introduced previously, conditional independence (i.e., Markov) Furthermore, the EPI with one Gaussian variable is equiv- structure can be compactly represented as X Y V to → → alent to the Gaussian logarithmic Sobolev inequality [29], denote PXYV PX PY X PV Y .Whenadditionalconditioning = | | which has a number of important applications in analysis is needed, we write X Y V Q to indicate that PQXYV → → | = (e.g,. [30], [31]). It is tempting to conjecture that (3) holds PQ PX Q PY XQPV YQ.Inotherwords,X Y V form even when the distribution of W is non-Gaussian, and we aMarkovchain,conditionedon| | | Q.So,forexample,ifwe→ → have found no evidence to suggest the contrary. This is a very write interesting question, which we leave as an open problem. inf I (Y V Q) λI (X V Q) , V X Y V Q ; | − ; | : → → | A. Organization " # this denotes an infimum taken over all distributions PV YQ, The remainder of this paper is organized as follows: where the information measures in the argument of| the Section II defines notation and Section III explores various infimum are evaluated with respect to the joint distribution applications of Theorem 1. For the reader interested only in PQXYV PQ PX Q PY XQPV YQ.Readerfamiliaritywith applications of the main result, Sections IV and beyond can be information= measures| (entropy,| | mutual information, etc.) and safely omitted as they are dedicated to the proof of Theorem 1. their calculus is assumed. For the unfamiliar reader, definitions As for the proof of Theorem 1, it is rather long and we may be found in any standard textbook, e.g., [32]. know of no simpler proof at this time. Section IV is intended AsequenceofrandomvariablesX1, X2,... indexed by to familiarize the reader with the basic strategy of the proof. n N will be denoted by the shorthand Xn ,andconvergence ∈ { } COURTADE: STRONG ENTROPY POWER INEQUALITY 2175

of Xn in distribution to a random variable X is written that A and $ commute, it holds that V X W { } ∗ = + Xn D X . in distribution so that I (X V ) h(X W) h(W). −→ ∗ Similarly, I (Y V ) h(X W; ) h=((I A)+1/2W).Now,(8)− We write W N(µ, $) to indicate that W has Gaussian ; = + − − distribution with∼ mean µ and covariance $.Wewilloften follows from Theorem 1 since be interested in the Gaussian channel parametrized by the 2 (h(X A1/2 W ) h(X W ) h(W )) e d scalar quantity ϱ>0(thesignal-to-noise ratio), in which + − + + 2 (h(Y ) I (X V )) Y ϱX Z,whereZ N(0, 1) is independent of the e d − ; √ = 2 2 1/2 = + ∼ (h(X) I (Y V )) h(A W1) input X.Forthisparticularsituation,wereservethenotation e d − ; e d ϱ ≥ + to denote the conditional law of Y given X.Hence, 2 (h(X) h(X W ) h((I A)1/2W )) 1/d 2 h(W ) GY X e d − + + − A e d | ϱ writing (Q , X , Y ) P is compact notation for = 2 +| | 2 n n n Qn Xn GYn Xn 1/d d (h(X) h(X W ) h(W )) 1/d d h(W ) ∼ | I A e − + + A e . (Qn, Xn) PQn,Xn and Yn √ϱXn Z,whereZ N(0, 1) =| − | +| | ∼ = + ∼ 2 is independent of (Q , X ). (h(X W ) h(W )) n n Multiplying both sides by e d + − completes the For a conditional law PY X and a random variable X PX , proof. | ∼ ! we sometimes employ the compact notation PY X X Y Costa’s EPI may be interpreted as a concavity property | : +→ to indicate that Y is a random variable with law obtained by enjoyed by entropy powers. The proof of Theorem 2 suggests a composing PY X with PX .Inparticular,givenasequence Xn , generalization of this property to non-Gaussian additive noise. | { } we may define a sequence Yn via PY X Xn Yn,which Indeed, we have the following general result: means that the conditional law{ of} Y given| X: is+→ equal to P n n Y X Theorem 3: Let X PX , Y PY and W N(0,$) be | ∼ ∼ ∼ for each n. independent random vectors in Rd with finite second moments. Without loss of generality, it is assumed throughout that Then all logarithms are base-e,sothatallinformationmeasures 2 (h(X W ) h(Y W )) 2 (h(X) h(Y )) 2 (h(X Y W ) h(W )) are in units of nats.Exceptionstothisconventionoccurin e d + + + e d + e d + + + . ≥ + Sections III-C and III-D, and are explicitly noted. Finally, Proof: This is an immediate consequence of Theorem 1 is used to denote Euclidean length on Rd . ∥·∥ by letting V X Y W and rearranging exponents. ! We remark= that+ the+ assumption of finite second moments III. APPLICATIONS in Theorem 3 is only needed to ensure that all entropies are Here we present several representative applications of less than (precluding the indeterminate in the Theorem 1. exponent)+ and∞ can generally be relaxed. If any of∞−∞ the entropies are equal to ,theclaimistrivial. −∞ A. Generalized Costa’s Entropy Power Inequality We also mention that Madiman observed the follow- ing related inequality on submodularity of differential Costa’s EPI [13] (see [33]–[35] for alternate proofs) states entropy [37], which can be proved via data processing: if that, for independent d-dimensional random vectors X P , X X, Y, W are independent (one-dimensional) random variables, W N(0,$) and α 1 ∼ ∼ | |≤ then 2 h(X αW ) 2 2 h(X) 2 2 h(X W ) e d + (1 α )e d α e d + . (7) 2(h(X W ) h(Y W )) 2(h(X Y W ) h(W )) ≥ − + e + + + e + + + . (9) ≥ This result was generalized to matrix weightings by Liu, When W is Gaussian, Theorem 3 sharpens inequality (9) by Liu, Poor and Shamai using perturbation and I-MMSE reducing the LHS by a factor of e2(h(X) h(Y )). arguments [14]. We demonstrate below that this generalization + follows as an easy corollary to Theorem 1 by taking V equal to X contaminated by additive Gaussian noise. In this sense, B. Reverse Convolution Inequalities and Refinement of Theorem 1 may be interpreted as a further generalization of Stam-Gross LSI Costa’s EPI, where the additive noise is non-Gaussian. Theorem 3 admits several interesting corollaries that are Theorem 2 [14]: Let X P and W N(0,$) be X connected to reversing the convolution inequalities for entropy independent random vectors in∼ d .Forapositivesemidefinite∼ R and Fisher information. We explore these connections briefly matrix A Ithatcommuteswith$, ≼ in this section, but remark that our treatment is far from 2 h(X A1/2 W ) 1/d 2 h(X) 1/d 2 h(X W ) 1 e d + I A e d A e d + , (8) exhaustive. To start, define the entropy power N(X) and the ≥| − | +| | Fisher information J(X) of a random vector X in Rd with where denotes determinant. density f (with respect to Lebesgue measure) as follows: Remark|·| 1: The original claim by Liu, Liu, Poor and Shamai 2 in [14] does not contain the hypothesis that A and $ commute. 1 2 h(X) f (X) N(X) e d J(X) E ∥∇ ∥ . However, the inequality can fail if this commutativity property := 2πe := f 2(X) $ % does not hold [36]. If the Fisher information does not exist (e.g., if f is not Proof: We assume $ is positive definite; the gen- sufficiently smooth), we adopt the convention J(X) . eral case follows by approximation. Let W1, W2 denote =∞ two independent copies of W,andputY X 1Amorecomprehensivetreatmentofthistopicandtheconnectionsto 1/2 1/2 = + A W1 and V Y (I A) W2.Usingthefact Costa’s EPI can be found in [35]. = + − 2176 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 64, NO. 4, APRIL 2018

In his classic 1959 proof of the entropy power inequal- saturate (10), then the convolution inequalities for entropy ity [3], Stam observed the following uncertainty principle for power and Fisher information will also be nearly saturated. d-dimensional X with finite second moments: Notably, the reverse EPI (13) is nontrivial whenever the Fisher informations J(X), J(Y ) exist and are finite. This should be p(X) 1 N(X)J(X) 1. (10) := d ≥ contrasted with the reverse EPI of Bobkov and Madiman [39], As noted by Costa and Cover [38], p(X) may be interpreted as which holds only for convex measures and must be stated anotionofsurfaceareaassociatedtoX;indeed,(10)isderived in terms of volume preserving maps (similar to Milman’s from the EPI in the same way that the isoperimetric inequality reverse Brunn-Minkowski inequality for convex bodies [40]). is derived from the Brunn-Minkowski inequality. Since (10) The reader is referred again to the recent paper [12] which is proved using de Bruijn’s identity and the special case of surveys known reverse EPIs, all of which apparently require Shannon’s EPI when one summand is Gaussian, Theorem 3 convexity properties of the involved measures. naturally leads to a sharpening of (10). This strengthen- We have already seen above that Theorem 3 is a natural ing takes the form of a reverse EPI, which upper bounds generalization of Costa’s EPI. However, we note that its par- N(X Y ) in terms of only marginal entropies and Fisher ticularization to (13) continues to imply Costa’s EPI. Indeed, informations.+ if Z N(0, I),thenN(√tZ) t, p(√tZ) 1and ∼ = = Theorem 4: If X and Y are independent d-dimensional de Bruijn’s identity (12) together with (13) yields random vectors with finite second moments, then d N(X √tZ) N(X) t N(X √tZ) , (15) N(X)N(Y ) ( J(X) J(Y )) dN(X Y ). (11) + ≤ + dt + t 0 + ≥ + ' & = ( & Proof: We may assume J(X)< and J(Y )< which is equivalent to concavity of t N(X &√tZ). ,elsethereisnothingtoprove.Tobegin,let∞ Z +→ + ∞ ∼ In addition to generalizing Costa’s inequality, (13) also N(0, I) be independent of X, Y and recall de Bruijn’s identity improves the uncertainty principle (10). To see this, let X, X d 1 [3]: h(X √tZ) J(X √tZ).Inparticular,we be i.i.d. random vectors on d .Then,itfollowsimmediately∗ dt + = 2 + R have from (13) that d 1 N(X √tZ) N(X)J(X). (12) 2 1 dt + t 0 = d p(X) exp h (X X ) h(X) . (16) & = ≥ d √2 + ∗ − Identifying W √tZ in Theorem& 3 and rearranging, we find $ ' ' ( (% = & The RHS of (16) is strictly greater than one by the EPI, N(X √tZ)N(Y √tZ) N(X)N(Y ) + + − unless X is Gaussian. The quantity h 1 (X X ) h(X) √2 + ∗ − t is referred to as the entropy jump associated to X,and N(X Y √tZ) N(X Y ). ) * ≥ + + ≥ + can be lower bounded by a linear function of the relative Letting t 0andapplying(12)provestheclaim. entropy D(X Z) when Cov(X) I and X satisfies regularity ! ∥ = By recalling→ the definition of p( ) in (10), we obtain the fol- conditions (i.e., X satisfies a Poincaré inequality [41]–[43] lowing corollary, which reverses the· convolution inequalities and, for d 2, has log-concave density [44]). ≥ for entropy powers and Fisher information: In closing this section, we briefly discuss how Theorem 4 Corollary 1: Let X, Ybeindependentwithfinitesecond leads to a sharpening of the Gaussian logarithmic Sobolev moments, and choose θ to satisfy θ/(1 θ) N(Y )/N(X). inequality (LSI). In 1975, Gross rediscovered (10) in a dif- Then − = ferent form by establishing the LSI for the standard Gaussian measure γ on Rd [29]. In particular, he showed that for every N(X Y ) (N(X) N(Y ))(θp(X) (1 θ)p(Y )) (13) d 2 + ≤ + + − g on R with gradient in L (γ), and 2 2 1 1 1 g log g dγ p(X)p(Y ). (14) Rd J(X Y ) ≤ J(X) + J(Y ) ! + ' ( 2 g 2 dγ g2dγ log g2 dγ . (17) Proof: Inequality (13) is the same as (11), but rewritten in ≤ d ∥∇ ∥ + d d !R '!R ( '!R ( terms of p( ).Likewise,(14)followsimmediatelyfrom(11) In the same paper, Gross also proved that (17) is equiv- and (10), applied· to the sum X Y . ! alent to the hypercontractivity of the Ornstein-Uhlenbeck We remind the reader that+ the convolution inequalities semigroup [45]. Despite the fact that Stam’s inequality (10) for entropy power and Fisher information may, respectively, preceded Gross’ LSI (17) by over a decade, it wasn’t until the be written as 1990’s that Carlen [46] recognized that they were completely N(X Y ) (N(X) N(Y )) equivalent (a concise proof can be found in [47]). By view- + ≥ + ing g2 as the probability density (with respect to the standard and Gaussian measure γ )associatedtoarandomvectorX,itis 1 1 1 . well known that the LSI (17) may be equivalently written in J(X Y ) ≥ J(X) + J(Y ) information-theoretic terms as + So, we may conclude sharpness of the “reverse” estimates in 1 I (X Z) D(X Z), (18) Corollary 1. Precisely, we see that if both X and Y each nearly 2 ∥ ≥ ∥ COURTADE: STRONG ENTROPY POWER INEQUALITY 2177 where I (X Z) denotes the relative Fisher information of X built upon Oohama’s earlier solution to the one-helper prob- with respect∥ to Z N(0, I),andD(X Z) is the relative lem [19] and the independent solutions to the Gaussian CEO entropy (with units∼ of nats). There have been∥ several recent problem due to Prabhakaran, Tse and Ramachandran [28] and works that attempt to give quantitative bounds on the deficit Oohama [27] (see [52] for a self-contained treatment). Since 1 δLSI(X) 2 I (X Z) D(X Z) (e.g., [48]–[50]). By recalling Wagner et al.’s original proof of the sum-rate constraint, the expressions:= for∥ relative− entropy∥ and Fisher information in other proofs have been proposed based on estimation-theoretic terms of their non-relative counterparts arguments and semidefinite programming (e.g., [55]), however all known proofs are quite complex. Below, we show that d 1 2 d D(X Z) log(2πe) h(X) E X the converse result for the entire rate region is a direct ∥ = 2 − + 2 ∥ ∥ − 2 2 consequence of Corollary 2, thus unifying the results of [18] I (X Z) J(X) 2d E X , ∥ = − + ∥ ∥ and [19] under a common and succinct inequality. we may take logarithms in (16) and use the inequality log x Theorem 6 [18]: Let ρ 1, 1 and let X N(0, I) ∈[− ] ∼ x 1toobtainthefollowingconciseboundonsimplification:≤ be a d-dimensional standard Gaussian random vector. Define −Theorem 5: Let X, X be i.i.d. random vectors on d with Y ρX 1 ρ2 Z,whereZ N(0, I) is independent of X. R = + − ∼ ∗ Let φ d 1,...,2dRX and φ d 1,...,2dRY , finite second moments. For δLSI defined as above, X R+ Y R and define: →{ } : →{ } 1 δLSI(X) h (X X ) h(X). (19) ≥ √2 + ∗ − 1 2 ) * dX E X E X φX (X),φY (Y ) Thus, we see that δLSI(X) controls the entropy jump associ- := d ∥ − [ | ]∥ ated to X.Notethatinequality(19)doesnotimposeregularity 1 , 2 - dY E Y E Y φX (X),φY (Y ) . conditions on X (beyond finiteness of second moment). This := d ∥ − [ | ]∥ , - should be contrasted with the quantitative bound in [49], which Then assumes that X satisfies a Poincaré inequality. Inequality (19) 1 1 2 2 2RY also has the desirable property that it is dimension-free in RX log 1 ρ ρ 2− (21) ≥ 2 2 d − + the sense that it is additive on independent components of X. ' X ( 1 1 ) * The disadvantage of (19) is that it is presently unknown how 2 2 2RX RY log2 1 ρ ρ 2− (22) the entropy jump associated to X is quantitatively related ≥ 2 dY − + ' ) 2 *( to the distance of X from Gaussian, except under regular- 1 (1 ρ )β(dX dY ) RX RY log2 − , (23) ity assumptions [9], [41], [44]. This situation is similar to + ≥ 2 2dX dY the HSI inequality of Ledoux, Nourdin and Peccati which 2 where D 1 1 4ρ D . strengthens (18), provided the so-called Stein discrepancy β( ) (1 ρ2)2 := + + − associated to X is finite. As with entropy jumps, finiteness of For convenience,. we assume throughout the remainder of Stein discrepancy is presently only ensured under regularity this subsection that all information quantities have units of assumptions such as positive spectral gap [51]. bits (i.e., their defining logarithms are taken to be base-2). The key ingredient for establishing Theorem 6 is the following simple consequence of Corollary 2: C. Conditional Strong EPI and Converse for the Proposition 1: For X, YasaboveandanyU, Vsatisfying Two-Encoder Quadratic Gaussian Source Coding Problem U X Y V, AconditionalversionoftheEPIisoftenusefulinapplica- → → → 2 (I (X U,V ) I (Y U,V )) tions (e.g., [52]). Theorem 1 easily generalizes along these 2− d ; + ; 2 I (X,Y U,V ) 2 2 2 I (X,Y U,V ) lines. In particular, due to joint convexity of (u,v) 2− d ; 1 ρ ρ 2− d ; . (24) log(eu ev ) in (u,v),weimmediatelyobtainviaJensen’s+→ ≥ − + ) * inequality+ the following corollary: Proof: By identifying Q U,anapplicationof ← Corollary 2: Suppose X, WarerandomvectorsinRd , Corollary 2 directly yields conditionally independent given Q, and moreover that W is 2 (h(Y U) I (X V U)) conditionally Gaussian given Q. Define Y X W. For any 2 d | − ; | = + 2 (h(ρX U) I (Y V U)) 2 h(√1 ρ2 Z) VsatisfyingX Y V Q, 2 d | − ; | 2 d − → → | ≥ 2 2 (h(X U) I (Y V U))+ 2 2 (h(Y Q) I (X V Q)) 2 (h(X Q) I (Y V Q)) 2 h(W Q) ρ 2 d | − ; | (2πe)(1 ρ ). e d | − ; | e d | − ; | e d | . = + − ≥ + 2 2 d h(Y ) d h(X) 1 (20) Since 2− 2− 2πe ,multiplyingthroughby 1 2 I (X,Y U,V=) = 2 d and rearranging exponents establishes the Let’s now see an example of how this may be applied to 2πe − ; establish nontrivial converse results in network information claim. ! Proof of Theorem 6: Put U φX (X) and V φY (Y ). theory. Toward this end, let us mention that characterizing = = the rate region for the two-encoder quadratic Gaussian source The left- and right-hand sides of (24) are monotone decreasing in 1 (I (X U, V ) I (Y U, V )) and 1 I (X, Y U, V ),respec- coding problem was a longstanding open problem until its d ; + ; d ; ultimate resolution by Wagner, Tavildar and Viswanath in tively. Therefore, if their tour de force [18], which established that a separation- 1 1 1 (I (X U, V ) I (Y U, V )) log based scheme [53], [54] was optimal. Wagner et al.’s work d ; + ; ≥ 2 2 D 2178 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 64, NO. 4, APRIL 2018

2 and i 1, 2. Here, W N(0, 1) and W2 N(0, 1 α ) are = ∼ ∼ − 1 independent of each other and of the channel inputs X1, X2. I (X, Y U, V ) R d ; ≤ We have assumed α < 1sincethesettingwhere α 1 is referred to as the| | strong interference regime,| and|≥ the for some pair (R, D),thenwehave capacity is known to coincide with the Han-Kobayashi inner 2 R 2 2 2 R D 2− 1 ρ ρ 2− , bound [15], [52], [59], [60]. Observe that we have expressed ≥ − + the one-sided Gaussian IC in degraded form,whichhas ) * 2 R which is a quadratic inequality with respect to the term 2− . capacity region identical to the corresponding non-degraded This is easily solved using the quadratic formula to obtain: version as proved by Costa [15]. Despite receiving significant 2 attention from researchers overseveraldecades,thecapacity 2R 2D 1 (1 ρ )β(D) 2− R log − . region C (α, P , P ) of the one-sided Gaussian IC remains ≤ (1 ρ2)β(D) 0⇒ ≥ 2 2 2D 1 2 − unknown in the regime of α < 1describedabove. Jensen’s inequality and the maximum-entropy property of Having already discussed| | connections between Costa’s Gaussians imply EPI (7) and Theorem 1 above, we remark that Costa’s EPI was 1 1 1 motivated by the Gaussian IC [15]. Since Theorem 1 general- I (X U, V ) log2 izes Costa’s result, the one-sided Gaussian IC presents itself d ; ≥ 2 dX as a natural application. Toward this end, we establish a new and multi-letter outer bound to give a simple demonstration of how 1 1 1 Theorem 1 might be applied to the one-sided Gaussian IC. I (Y U, V ) log2 , d ; ≥ 2 dY Theorem 7: Let all informationquantitieshaveunitsofbits. so that (R1, R2) C (α, P1, P2) only if ∈ 1 1 1 1 (I (X U, V ) I (Y U, V )) log2 , R1 log2(1 P1) (27) d ; + ; ≥ 2 dX dY ≤ 2 + establishing (23) since 1 R2 log (1 P2) (28) ≤ 2 2 + 1 1 I (X, Y U, V ) (H (U) H (V )) RX RY . and d ; ≤ d + ≤ + 2 n n n 2 n n 2R2 ϵn I (X ,X Y ) 2 2R1 I (Y V Y ) Similarly, 2− + 2− n 1 2 ; 2 sup α 2 − n 0 ; | 1 ≥ V Y n Y n V 2 : 1 → 0 → 2 RX log dX (I (X U V ) I (X U,V )) " 2 + 2 2 d ; | − ; 2 2 I (Y n V ) (1 α )2 n 1 ; , (29) ≥ 2 I (X V ) 2− d ; + − = 2 n n # 2 2 d I (Y V ) for some ϵn 0 and independent X1 , X2 satisfying the power (1 ρ ) ρ 2− ; → n 2 ≥ − + constraints X nPi ,i 1, 2. 2 2 2 RY E i (1 ρ ) ρ 2− , [∥ ∥ ]≤ = ≥ − + Proof: For consistency, we assume throughout the proof where the second inequality is a consequence of Theorem 1 that information quantities have units of bits. The only non- and the fact that h(X) h(Y ) h(Z) for the variables as trivial inequality to prove is (29). Thus, we begin by noting = = that Corollary 2 implies defined. Rearranging (and symmetry) yields (21)-(22). ! We remark that Proposition 1 (a special case of 2 (h(Y n Xn) I (Y n V Xn)) 2 n 2 | 2 − 1 ; | 2 Corollary 2) first appeared in [56] by the author and Jiao. 2 (h(αY n Xn) I (Y n V Xn)) 2 h(W n Xn) In fact, Proposition 1 yields a stronger result than the con- 2 n 1 | 2 − 2 ; | 2 2 n 2 | 2 ≥ 2 2 (h(Y n) I (Y n V Xn)) + 2 2 h(W n) verse for the two-terminal Gaussian source coding problem; α 2 n 1 − 2 ; | 2 (1 α )2 n = + − it shows that the rate regions coincide for the problems when for all V such that Y n Y n V X n.Sinceh(W n) distortion is measured under quadratic loss and logarithmic 1 2 2 h(Y n X n, X n ) h(Y n X→n), I (Y→n V |X n) I (Y n V, X n=) loss [57], [58]. On a different note, we remark that the 2 1 2 1 1 2 2 0 2 and I|(Y n V X n=) I (Y|n V, X n),thiscanberewrittenas; | = ; quadratic Gaussian sum-rate bound has an alternate proof that 1 ; | 2 = 1 ; 2 2 I (Xn Y n) 2 I (Xn ,Xn Y n) avoids using the EPI, favoring estimation-theoretic arguments 2− n 2 ; 2 + n 1 2 ; 2 instead [55]. 2 2 I (Xn Y n) 2 I (Y n V Y n) sup α 2 n 1 ; 1 − n 0 ; | 1 ≥ V Y n Y n V : 1 → 0 → " D. One-Sided Gaussian Interference Channel 2 2 I (Y n V ) (1 α )2 n 1 ; . The one-sided Gaussian interference channel (IC) + − # (or Z-Gaussian IC) is a discrete memoryless channel, Therefore, 2(R2 ϵn) with input-output relationship given by 2− − 2 I (Xn Y n) Y X W (25) 2− n 2 ; 2 (30) 1 1 ≥ = + 2 I (Xn ,Xn Y n) 2 2 I (Xn Y n) 2 I (Y n V Y n) Y2 αY1 X2 W2, (26) 2− n 1 2 ; 2 sup α 2 n 1 ; 1 − n 0 ; | 1 = + + ≥ V Y n Y n V : 1 → 0 → where Xi and Yi are the channel inputs and observations " 2 2 I (Y n V ) (1 α )2 n 1 ; (31) corresponding to Encoder i and Decoder i,respectively,for + − # COURTADE: STRONG ENTROPY POWER INEQUALITY 2179

2 n n n 2 n n I (X ,X Y ) 2 2(R1 ϵn) I (Y V Y ) 2− n 1 2 ; 2 sup α 2 − − n 0 ; | 1 parameterized by λ 1 and ϱ>0.Similarly,let(X, Q) ≥ V Y n Y n V ≥ ∼ 1 0 PXQ be a pair of random variables on R .For(Y, X, Q) : → → " ϱ ×Q ∼ 2 2 I (Y n V ) P ,definethefunctional(ofP ) (1 α )2 n 1 ; , (32) GY X XQ XQ + − | # sλ(X,ϱ Q) h(X Q) λh(Y Q) where (30) and (32) hold for ϵn 0duetoFano’sinequality. | := − | + | 2ϵ → Multiplying both sides by 2 n proves the claim. ! inf I (Y V Q) λI (X V Q) . The Han-Kobayashi achievable region [52], [60] evaluated + V X Y V Q ; | − ; | : → → | " # for Gaussian inputs (without power control) can be expressed Noting that the term e2 h(W ) is constant in (34), one as the set of rate pairs (R1, R2) satisfying (27), (28) and approach toward proving Theorem 8 would be to simultane- 2 2R 2 α P2 2 1 1 α ously minimize the exponent h(Y ) I (X V ),whilemaxi- 2R2 − ; 2− 2 2 − 2 . mizing the exponent h(X) I (Y V ) over all valid choices ≥ (P2 1 α )(1 α P1 P2) + P2 1 α − ; + − + + + − (33) of X, V .If,forallsuchchoices,theLHSof(34)exceeds the RHS of (34), the theorem would be proved. Modulo Interestingly, this takes a similar form to (29); however, it is rescaling of random variables, the functional sλ(X,ϱ) makes known that transmission without power control is suboptimal this intuition precise by capturing this tradeoff of exponents, for the Gaussian Z-interference channel in general [61], [62]. as parameterized by λ.Thisisapparentbyobservingthat Nevertheless, it may be possible to identify a random vari- n able V in (29), possibly depending on X ,whichultimately sλ(X, 1) inf λ(h(Y ) I (X V )) 2 = V X Y V − ; improves known bounds. We leave this for future work. : → → " (h(X) I (Y V )) , − − ; E. Relationship to Strong Data Processing # where X, Y are as in the statement of Theorem 8 for σ 2 1. Strong data processing inequalities and their connection = Since sλ(X,ϱ) has the optimization over valid choices to hypercontractivity have garnered much attention recently of V built into its definition, the stated approach of extremizing (e.g,. [6], [7], [63]–[68]). We have already seen the connection exponents in (34) reduces to characterizing the infimum of to Theorem 1 in the discussion surrounding (5). We briefly sλ(X,ϱ),takenoveralldistributionsPX with finite variance mention here that, for !(t) !(t, PXY) as defined in (4), ≡ (in fact, variance at most 1 suffices). Such an optimization inequality (5) may be rearranged to provide the following appears difficult to execute directly, so we instead turn to the explicit upper bound on the strong data processing function ! relaxation which may be of independent interest: Corollary 3: Let X P and W N(0, I) be indepen- Vλ(ϱ) inf sλ(X,ϱ Q). (35) X 2 ∼ d ∼ := PXQ X 1 | dent random vectors in R .ForY X W, : E[ ]≤ = + This is indeed a relaxation of infimizing sλ(X,ϱ) over the d 1 2 (h(X) t) !(t, PXY) I (X Y ) log 1 e d − . set of distributions P with X 2 1becauseproblem(35) ≤ ; − 2 + 2πe X E ' ( is equivalent to finding the infimum≤ of the lower convex Moreover, equality is achieved if X is Gaussian with covari- envelope of the functional PX sλ(X,ϱ),overthesetof ance proportional to the identity matrix I. 2 +→ distributions PX with EX 1. Similar relaxation strategies have been fruitfully applied≤ elsewhere in information theory IV. OUTLINE OF PROOF OF MAIN RESULTS (cf. [69]–[71]). The goal of this section is to provide an overview of the Having described the overall objective, our strategy for key ideas involved in establishing our main results. To this characterizing Vλ(ϱ) will be to show that in (35) it suffices to end, we focus attention on the scalar version of Theorem 1, minimize over Gaussian X,independentofQ. Stated precisely, stated as Theorem 8 below. The rationale for this is that we aim to show: the multidimensional result of Theorem 1 will be proved as Theorem 9: acorollaryofitsone-dimensionalcounterpartinSectionV, using some of the machinery that we develop in this section. Vλ(ϱ) inf sλ(Xγ ,ϱ), where Xγ N(0,γ). (36) = 0 γ 1 ∼ Theorem 8: Let X be random variable on R,andletW ≤ ≤ ∼ N(0,σ2) be independent of X. For Y X WandanyV With Theorem 9 in hand, it is a matter of calculus to = + satisfying X Y V, explicitly compute V (ϱ) as a function of ϱ and λ,yielding → → λ 2(h(Y ) I (X V)) 2(h(X) I (Y V )) 2h(W ) Theorem 8 as a corollary. Thus, we dedicate the following e − ; e − ; e . (34) ≥ + subsection to provide a heuristic discussion of how Theorem 9 In order to give the intuition behind our proof of (34), will be established. This is intended to orient the reader we begin by defining the following quantities whose moti- and provide context for the proof details that follow in later vation will soon be apparent: sections. The computations for deducing Theorem 8 from Definition 1: Let X PX be a random variable on R.For Theorem 9 are detailed in subsection IV-B. ϱ ∼ (Y, X) G PX ,definethefamilyoffunctionals(ofPX ) ∼ Y X Remark 2: A somewhat technical point is that the value | of the optimization problem (35) potentially depends on the sλ(X,ϱ) h(X) λh(Y ) := − + set (in which the random variable Q takes values), which inf I (Y V ) λI (X V ) Q + V X Y V ; − ; we have not explicitly specified in Definition 1. However, this : → → " # 2180 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 64, NO. 4, APRIL 2018

is not problematic since any non-singleton set will suf- Now, suppose there is a distribution PXQ having the Q ϱ ∈ P fice. Indeed, the standard dimensionality reduction argument extremal property that, for (Y, X, Q) GY X PXQ and any ϱ ∼ | can be applied: Consider the image of the map PXQ other (Y ′, X ′, Q′) G PX Q with PX Q , 2 +→ ∼ Y ′ X′ ′ ′ ′ ′ ∈ P (EX , sλ(X,ϱ Q)),takenoveralldistributionsPXQ on R | | ×Q with X 2 1 and, say for concreteness, .Asusual, h(Y Q) h(X Q) h(Y ′ Q′) h(X ′ Q′). (40) E R | − | ≤ | − | this image≤ is a convex set in 2 by definition.Q = Hence, by the R We will argue below that by choosing P in this manner, Fenchel-Caratheodory-Bunt theorem [72, Theorem 18(ii)], for XQ the corresponding conditional law PX Q q is Gaussian with any point a b in this set, there is a distribution P on |{ = } ( , ) XQ variance not depending on q. R ,withPQ supported on at most two points in ,that × Q 2 Q Toward this end, let X , X , Y , Y , Q be as in achieves the values EX aandsλ(X,ϱ Q) b. Since + − + − = | = Lemma 1, constructed from independent copies of we will often consider random variables Q taking values on ϱ (Y, X, Q) GY X PXQ.Sincethetransformation different spaces, the set will be implicitly defined as the set ∼ | Q (X1, X2) (X , X ) is variance-preserving, each of in which the random variable Q takes values. However, taking +→ + − the four quantities sλ(X ,ϱ X , Q), sλ(X ,ϱ Y , Q), 1, 2, 3 is generally sufficient for our purposes as will + | − − | + Q ={ } sλ(X ,ϱ Y , Q) and sλ(X ,ϱ X , Q) is at least Vλ(ϱ) be made clear in Section VI-B. + − − + by definition| (i.e., (35)). However,| the assumption that PXQ combined with Lemma 1 establishes the reverse A. Heuristic Proof of Theorem 9 inequality,∈ P giving equality. E.g., Having argued that Theorem 9 is the crux, we now give sλ(X ,ϱ Y , Q) sλ(X ,ϱ Y , Q) Vλ(ϱ). (41) aheuristicdiscussionofthemainideasthatwillbecarried + | − = − | + = forward later into the full arguments. In order to minimize Thus, roughly speaking, the set is closed under the doubling interruption to the flow of the argument, remarks on related operation. It is tempting at thisP point to imagine applying literature are collected at the conclusion of this subsection. the doubling operation ad infinitum, leading to Gaussianity Establishing sufficiency of Gaussian X in optimization by the central limit theorem. However, this approach runs problem (35) appears nontrivial from the outset. However, into the problem that the auxiliary alphabet grows at each the first crucial idea is that we may exploit a “doubling” stage, leading to technical difficulties down theQ line. One could property enjoyed by the functional sλ.Inparticular,wehave potentially control this growth by applying the dimensionality the following simple lemma, established in Section VI-A: reduction operation described in Remark 2. Unfortunately, it is Lemma 1: Let PXQ be a distribution on R ,let not clear that such dimensionality reduction will preserve any ϱ × Q (Y, X, Q) GY X PXQ,andlet(Y1, X1, Q1) and (Y2, X2, Q2) additional “Gaussianity" gained through doubling. The need denote two∼ independent| copies of (Y, X, Q).Define to circumvent this problem is what motivates the additional extremal property (40) which we assumed of PXQ and have X1 X2 X1 X2 X + , X − (37) not yet exploited. + = √ − = √ 2 2 To see how the argument works, let B Bernoulli(1/2) be and, in a similar manner, define Y , Y .LettingQ aBernoullirandomvariabletakingvalueson∼ , ,indepen- + − = {+ −} (Q1, Q2),wehaveforλ 1 dent of X , X , Y , Y , Q.DefineB to be the complement ≥ ¯ of B in + , − ,meaningthat+ − B B , 2sλ(X,ϱ Q) sλ(X ,ϱ X , Q) sλ(X ,ϱ Y , Q) (38) {+ −} { =+}⇔{¯ =−} | ≥ + | − + − | + and vice-versa. Now, let us define a new pair of random and variables (X ′, Q′) as follows: X ′ X B and Q′ (B, YB, Q). By construction and (41), = = ¯ 2sλ(X,ϱ Q) sλ(X ,ϱ Y , Q) sλ(X ,ϱ X , Q). (39) | ≥ + | − + − | + 1 1 sλ(X ′,ϱ Q′) sλ(X ,ϱ Y , Q) sλ(X ,ϱ Y , Q) In view of the central limit theorem, the doubling opera- | = 2 + | − + 2 − | + tion (37) should produce random variables X ,andX that Vλ(ϱ). + − = are “more Gaussian” than X.Theessentialcontentof 2 2 We also have EX ′ EX ,sothatPX Q ,where Lemma 1 is that the doubling operation (37) also improves = ′ ′ ∈ P PX Q denotes the law of (X ′, Q′).Unravelingdefinitions, the value of the sλ functional in the sense of (38) and (39). ′ ′ So, the general idea will be to exploit this property to show the following crucial identity may be obtained: that Gaussian X minimizes the functional sλ( ,ϱ). · h(Y Q) h(X Q) h(Y ′ Q′) h(X ′ Q′) Let us now make this intuition precise, while still glossing | − | = | 1− | over some of the challenging technicalities that need to be I (X X Y , Y , Q). (42) + 2 +; −| + − dealt with. In particular, let us assume that the infimum in (35) Therefore, assumption (40) together with non-negativity of is attained, and let denote the subset of distributions P XQ mutual information implies that we must have achieving the correspondingP minimum, with the additional property that EX 0. The assumption of centered X is I (X1 X2 X1 X2 Y1, Y2, Q1, Q2) = + ; − | only for convenience, since the functional sλ( ,ϱ) is invariant · I (X X Y , Y , Q) 0. (43) to translation of the mean of its argument (an immediate = +; −| + − = consequence of the same translation invariance enjoyed by Let us now take a final leap of faith, and pretend for entropy and mutual information). sake of simplicity that Y1, Y2 are absent in the conditioning COURTADE: STRONG ENTROPY POWER INEQUALITY 2181

in (43). This would imply that I (X X Q1, Q2) 0. inventing the general approach; indeed, Carlen’s contempo- +; −| = Equivalently, X Q1, Q2 q1, q2 and X Q1, Q2 raneous work [46] attributes the idea to Lieb and coined +|{ = } −|{ = q1, q2 are independent for PQ PQ -a.e. q1, q2.However, the “doubling trick” terminology. The fact that this doubling } × X1 Q1, Q2 q1, q2 and X2 Q1, Q2 q1, q2 are inde- trick works well for establishing Gaussian optimality in both pendent|{ by construction.= } At this|{ point,= a classical} char- information-theoretic and functional inequalities is not coin- acterization of the normal law due to Bernstein [73] is cidental. Indeed, there is a precise duality between these helpful: different classes of inequalities; interested readers are referred Lemma 2 [74, Theorem 5.1.1]: If A1, A2 are independent to [78]–[81] for further discussion. random variables such that A1 A2 and A1 A2 are indepen- In a different direction, we remark that the usual state- + − dent, then A1 and A2 are normal, with identical variances. ment of Bernstein’s theorem does not comment on the Since Xi Q1, Q2 q1, q2 is equal in distribution to identical variances of A1, A2 as stated in Lemma 2. |{ = } Xi Qi qi ,fori 1, 2(byindependenceofthecopies However, assuming without loss of generality that A1, A2 |{ = } = (X1, Q1), (X2, Q2)), Bernstein’s theorem allows us to con- are zero-mean, the observation that A1 and A2 have identical 2 2 clude from the above that X1 Q1 q1 and X2 Q2 q2 variances is immediate since E A E A E (A1 A2) |{ = } |{ = } [ 1]− [ 2]= [ − are normal with identical variances. Moreover, freedom of (A1 A2) 0. This fact was explicitly noted by + ]= choosing q1 and q2 lets us further surmise that these variances Geng and Nair [70]. do not depend on the specific choice of q1, q2.So,weare 2 2 left to conclude that X Q q N(µq ,σ ),whereσ |{ = }∼ X X does not depend on q,butthemeanµq may. However, since B. From Theorem 9 to Theorem 8 the functional sλ( ,ϱ) is invariant to translations of the mean, · Taking Theorem 9 for granted, we now detail the com- we arrive at the statement of Theorem 9. putations needed to derive Theorem 8 from it. To begin, The above heuristic argument provides a roadmap for the we establish the following explicit characterization of Vλ(ϱ). complete proof that follows. In order to make the proof Theorem 10: rigorous, there are three main technical issues to be dealt with: First, it is not clear a priori that the infimum in (35) is achieved. Vλ(ϱ) We remedy this by adapting the above argument to work for 1 λ2πe 2πe 1 2 λ log λ 1 log λ 1 log(ϱ) if ϱ λ 1 distributions which closely approach the infimum in (35), and − − − + ≥ − = ⎧ 1 , ) * ) * - 1 are nearly extremal in the sense of (40). This allows us to infer 2 λ log (2πe(1 ϱ)) log (2πe) if ϱ λ 1 . ⎨ + − ≤ − the existence of a sequence of distributions which approach , - the infimum (35), but also have mutual informations of the We require⎩ the following lemma, which is a simple conse- form (43) that asymptotically vanish. Second, the application quence of the conditional EPI, and an equivalent formulation of Bernstein’s theorem above relied on equality in (43), of an inequality observed by Oohama [19]. and also neglected the conditioning on the variables Y , Y . Lemma 3: Let X N(0,γ) and Z N(0, 1) be indepen- 1 2 ∼ ∼ To correct these issues, we develop an information-theoretic dent, and define Y √ϱX Z. For λ 1, = + ≥ variation of Bernstein’s theorem that enables us to conclude that the aforementioned asymptotically optimal sequence of inf I (Y V ) λI (X V ) V X Y V ; − ; distributions converges weakly to Gaussian. Finally, to pass : → → ) 1 * λ 1 from this weakly convergent sequence to Theorem 9, we need 2 log ((λ 1)γ ϱ) λ log −λ (1 γϱ) − 1 − + to establish a local semicontinuity property enjoyed by the ⎧ if γϱ λ 1 = 3 ≥ − 4 56 functional PX sλ(X,ϱ).Generallyspeaking,thesethree ⎪0 if γϱ 1 . +→ ⎨ λ 1 points are separately dealt with in Section VI, and Appen- ≤ − dices A and B, respectively. Proof: Let⎩⎪ V be such that X Y V ,andlet Literature Notes: At this point, we would like to make a X V v , Y V v denote the random→ → variables con- few remarks on literature relevant to the above proof. Although ditioned|{ = on} V|{ v=.Since} X, Y are jointly Gaussian and considerably different in the details, the doubling argument V Y {X,wehave= } X V v ρY V v W, presented above draws inspiration from Geng and Nair’s → → γ √ϱ |{ = }=γ 2 ϱ |{ = }+ where ρ 1 γϱ and W N 0,γ 1 γϱ is independent exposition in [70], which established Gaussian optimality := + ∼ − + of Y V v .Bytheentropypowerinequality,itholdsthat) * for different information functionals by appealing to rota- |{ = } 2h(X V v) 2h(ρY V v) 2 h(W ) tional invariance of the optimizers. They refer to their proof e | = e | = e technique as “factorization of concave envelopes”. Another ≥ 2 + 2 γ ϱ 2h(Y V v) γ ϱ e | = 2πe γ . argument appealing to rotational invariance appeared in [75], = (1 γϱ)2 + − 1 γϱ providing an alternate proof to the fact that Gaussian inputs + ' + ( maximize mutual information inGaussianchannels.Inaddi- Rearranging, applying Jensen’s inequality to the convex func- 2 2 tion to these relatively recent appearances in the information tion t 1 log γ ϱ e2t 2 e γ ϱ ,andrear- 2 (1 γϱ)2 π γ 1 γϱ theory literature, this general strategy has been successfully +→ + + − + ranging again, yields) ) ** employed for establishing extremality of Gaussian kernels 2I (Y V ) in functional inequalities [46], [76], [77]. Our reading of 2 I (X V ) 1 γϱe− ; e− ; + . the literature suggests that Lieb [76] deserves credit for ≥ 1 γϱ + 2182 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 64, NO. 4, APRIL 2018

It follows that on the event X n ,letYn Xn W and define Vn via {| |≤ } = + 2 PV Y Yn Vn.SinceXn is bounded, E Xn < so that I (Y V ) λI (X V ) | : +→ [ ] ∞ ; − ; 2(h(Yn) I (Xn Vn)) 2(h(Xn) I (Yn Vn)) 2h(W ) λ 2I (Y V ) λ e − ; e − ; e . I (Y V ) log 1 γϱe− ; log(1 γϱ) ≥ ; + 2 + − 2 + ≥ + 1 ) λ 1 * The dominated convergence theorem can be used to show 2 log ((λ 1)γ ϱ) λ log −λ (1 γϱ) − 1 − + that limn h(Xn) h(X),providedh(X) exists and is if γϱ> →∞ = ≥ ⎧ 3 λ 1 4 56 finite, which covers all cases requiring proof in view of the ⎪ −1 ⎨0ifγϱ λ 1 , ≤ comments following Theorem 1. Moreover, since Xn D X, − −→ where⎪ the second inequality follows by minimizing over the Lemma 10 (see Appendix A) asserts that limn h(Xn ⎩ 1 →∞ + scalar quantity I (Y V ) 0. When γϱ λ 1 ,thisis W) h(X W),sothath(Yn) h(Y ).Itiseasytoseethat trivially achieved by; setting≥V constant.Ontheotherhand,≤ − = + → (Xn, Vn) D (X, V ),soliminfn I (Xn Vn) I (X V ) 1 = −→ →∞ ; ≥ ; if γϱ> λ 1 ,thenitiseasytoseethatthelowerboundis by lower semicontinuity of relative entropy. Finally, the fact − 1 γϱ 1 achieved by taking V Y W,whereW N(0, γϱ(λ+ 1) 1 ) that X n Y V ,combinedwiththechainrulefor = + ∼ − − {| |≤ } → → is independent of Y . ! mutual information implies Proof of Theorem 10: Let Xγ N(0,γ).Recallingthe ∼ 1 definition of sλ( ,ϱ),Lemma3implies I (Y V ) I ( X n , Y V ) · ; = {| |≤ } ; I (Y V 1 X n ) sλ(Xγ ,ϱ) ≥ ; | {| |≤ } I (Yn Vn)P X n , 1 λ2πe 2πe 1 ≥ ; {| |≤ } 2 λ log λ 1 log λ 1 log(ϱ) if γϱ λ 1 − − − + ≥ − = 1 1 giving lim supn I (Yn Vn) I (Y V ).Puttingtheseobser- 7 2 ,λ log ()2πe (*1 γϱ))) log* (2πeγ )- if γϱ λ 1 . →∞ ; ≤ ; + − ≤ − vations together, we have established Differentiating3 with respect to the quantity6 γ ,wefindthat 2(h(Y ) I (X V )) 2(h(X) I (Y V)) 2h(W ) 1 e − ; e − ; e 2 λ log (2πe (1 γϱ)) log (2πeγ ) is decreasing in γ pro- +1 − ≥ + vided γϱ λ 1 .Therefore,takingγ 1 minimizes 3 ≤ − 6 = in absence of second moment constraints on X,asdesired.! sλ(Xγ ,ϱ) over the interval γ 0, 1 .Takentogetherwith ∈[ ] Theorem 9, the claim is proved. ! Given the explicit characterization of Vλ(ϱ) afforded by V. E XTENSION TO RANDOM VECTORS Theorem 10, we may now prove Theorem 8. The vector generalization of the classical EPI is usually Proof of Theorem 8: We first establish (34) under the proved by a combination of conditioning, Jensen’s inequality additional assumption that E X 2 < ,andgeneralizeatthe [ ] ∞ and induction (e.g., [52, Problem 2.9]). The same argument end via a truncation argument. Toward this goal, since mutual does not appear to readily apply in generalizing Theorem 8 information is invariant to scaling, it is sufficient to prove to its vector counterpart in Theorem 1 due to complica- that, for Y √ϱX Z with E X 2 1andZ N(0, 1) = + [ ]≤ ∼ tions arising from the Markov constraint X Y V . independent of X,wehave However, the desired generalization may be established→ → by 2(h(Y ) I (X V)) 2(h(X) I (Y V )) 2h(Z) e − ; ϱ e − ; e (44) noting an additivity property enjoyed by the functional sλ of ≥ + Definition 1, appropriately generalized to probability distribu- d for V satisfying X Y V .Multiplyingbothsides tions on R . by σ 2 and choosing ϱ→ Var→(X) gives the desired inequal- d σ 2 For a random vector X PX in R ,lettheconditionallaw 2 := ∼ 1/2 ity (34) when E X < .Thus,toprove(44),observeby PY X be defined via the Gaussian channel Y / X Z, [ ] ∞ | = + definition of Vλ(ϱ) that for all λ 1 where Z N(0, I) is independent of X and / is a d d ≥ diagonal matrix∼ with nonnegative diagonal entries. Analogous× h(X) I (Y V ) λ(I (X V ) h(Y )) Vλ(ϱ). (45) − + ; ≥ ; − + to the scalar case, (X, Y ) PX PY X ,wedefinethefamily ∼ | Inequality (44) is now immediately verified by substituting of functionals of PX into (45) the unique λ 1whichsatisfies ≥ sλ(X,/) h(X) λh(Y ) λ 1 2(I (X V ) h(Y )) := − + e− ; − (46) inf I (Y V ) λI (X V ) λ 1 = 2πe + V X Y V ; − ; − : → → and simplifying. Note that there always exists a unique λ 1 " # ≥ parameterized by λ 1. Similarly, for (Y, X, Q) PY X PXQ, that solves (46) since define ≥ ∼ | 1 2(I (X V ) h(Y )) 2h(Z) 2(I (X V ) h(Y )) e− ; − e− e− ; − sλ(X,/ Q) h(X Q) λh(Y Q) 2πe = | := − | + | 2 I (X Y V ) e ; | 1. inf I (Y V Q) λI (X V Q) , + V X Y V Q ; | − ; | = ≥ : → → | 2 " # Now, we eliminate the assumption that E X < .Toward and consider the optimization problem this end, let X have density, let W be Gaussian[ ] independent∞ of X,andconsiderV satisfying X Y V ,whereY Vλ(/) inf sλ(X,/ Q). (47) → → = = P X2 1,i d | X W.DefineXn to be the random variable X conditioned XQ E i + : [ ]≤ ∈[ ] COURTADE: STRONG ENTROPY POWER INEQUALITY 2183

Theorem 11: If / diag(ϱ1,ϱ2,...,ϱd ),then Section IV-A and fill in the details where needed. In particular, = d we first establish the doubling property of the functional sλ, and then proceed to adapt the theheuristicargumentsothat Vλ(/) Vλ(ϱi ). = i 1 it applies to near-extremal distributions. 8= Proof: Consider (Y, X, Q) PY X PXQ defined as above, and let / be a block diagonal∼ matrix| with blocks given by A. Proof of Lemma 1 / diag(/1,/2).PartitionX (X1, X2) and Z (Z1, Z2) In order to establish the doubling property asserted by = 1/2 = = such that Yi / Xi Zi for i 1, 2. Then, for any V Lemma 1, we require the following simple observation, hold- = i + = such that X Y V Q,itfollowsfromLemma4(see ing for all random variables with joint distribution of a → → | Section VI-B) that prescribed form. Lemma 4: Let X (X , X ), Y (Y , Y ),andQhave sλ(X,/ Q) sλ(X1,/1 X2, Q) sλ(X2,/2 Y1, Q). = + − = + − | ≥ | + | joint distribution of the form PXYQ PX X Q PY X PY X . = + − +| + −| − Therefore, by definitions, V (/) V (/ ) V (/ ).The If V satisfies X Y V Q, then for λ 1,wehave λ λ 1 λ 2 → → | ≥ reverse direction of this inequality≥ is immediate;+ the infimum I (Y , Y V Q) h(X , X Q) in (47) cannot decrease if we restrict attention to measures of + −; | − + −| product form PXQ PX Q PX Q ,onwhichsλ is additive. λ (I (X , X V Q) h(Y , Y Q)) = 1 1 2 2 − + −; | − + −| The general case then follows by induction. ! I (Y V X , Q) h(X X , Q) ≥ +; | − − +| − Evidently, Theorem 11 relates Vλ(/) for matrix-valued λ (I (X V X , Q) h(Y X , Q)) parameter / to the quantity V (ϱ),forscalarϱ.However, − +; | − − +| − λ I (Y V Y , Q) h(X Y , Q) we have already seen a complete characterization of the latter + −; | + − −| + quantity in Theorem 10. Thus, we are now positioned to λ (I (X V Y , Q) h(Y Y , Q)) . (48) − −; | + − −| + establish the first stated result of this paper, Theorem 1. Moreover, X Y V (X , Q) and X Y Proof of Theorem 1: Define Y X W for con- + → + → | − − → − → = + V (Y , Q). venience, where W N 0 is independent of X. + ( ,$W ) | Proof: The second claim is straightforward. As in the scalar setting,∼ we establish the claim first under 2 Indeed, using PXYQ PX X Q PY X PY X , the constraint X < .Thegeneralresultfollowsbya + − +| + −| − E we can factor the joint= distribution of (X, Y, V, Q) truncation argument[∥ ∥ exactly] ∞ as in the scalar setting. Moreover, as PXYVQ PX X Q PY X PY X PV Y Y Q we may assume $W 0, else the inequality reduces to = + − +| + −| − | + − = ≻ PX X Q PY X PY V Y X Q .MarginalizingoverY , h(Y ) I (Y V ) h(X) I (X V ),whichistriviallytrueby + − +| + − | + − − + ; ≥ + ; we find that X Y V (X , Q).The the data processing inequality and the fact that conditioning symmetric Markov+ chain→ follows+ → similarly| − by writing reduces entropy. PXYV, Q PX X Y Q PY X PV Y Y Q and marginalizing Thus, due to positive definiteness of and invariance of + − + −| − | + − $W over X . = mutual information under one-one transformations, we may + 1/d To prove the claimed inequality, note the following identity multiply both sides of (3) by $W to obtain the equivalent | |− which does not make use of any particular structure of the inequality joint distribution: 2 1/2 1/2 (h($− Y ) I ($− X V )) e d W − W ; I (Y , Y V Q) h(X , X Q) 2 1/2 1/2 2 1/2 (h($− X) I ($− Y V )) h($− W ) + −; | − + −| e d W − W ; e d W . I (Y V Q) I (Y V Q, Y ) ≥ + = +; | + −; | + 1/2 1/2 2 However, $W− W N(0, I) and, E $W− X < h(X Q) h(X Q, X ) 2 ∼ [∥ ∥ ] ∞ − −| − +| − provided E X < ,sowemayassumewithoutlossof I (Y V Q) I (Y V Q, Y ) generality that[∥ ∥W] N∞(0, I) in establishing (3). = +; | + −; | + ∼ 2 h(X Q, Y ) h(X Q, X ) I (X Y Q) To simplify further, put ϱ max1 i d E X .Notethatwe − −| + − +| − − −; +| := ≤ ≤ [ i ] I (Y V Q, X ) I (Y V Q, Y ) may assume ϱ>0, else the claimed inequality is trivial since = +; | − + −; | + h(X) and h(Y ) I (X V ) h(Y ) I (X Y ) h(W). h(X Q, Y ) h(X Q, X ) I (X Y Q, V ), − −| + − +| − − −; +| Therefore,=−∞ (3) is equivalent− to; ≥ − ; = Asymmetricargumentgives 2 (h(Y ) I (X V )) 2 (h(X) I (Y V )) 2 h(Z) e d − ; ϱ e d − ; e d ≥ + I (X , X V Q) h(Y , Y Q) + −; | − + −| holding for X Y V ,whereY √ϱX Z, I (X V Q, Y ) I (X V Q, X ) → → = 2 + = −; | + + +; | − Z N(0, I) is independent of X,andmax1 i d E X 1. ∼ ≤ ≤ [ i ]≤ h(Y Q, X ) h(Y Q, Y ) I (X Y Q, V ). This is established exactly as in the proof of Theorem 8 − +| − − −| + − −; +| (starting from equation (44)), since Vλ(ϱ I) d Vλ(ϱ) by Therefore, · = Theorem 11. ! I (Y , Y V Q) h(X , X Q) + −; | − + −| VI. PROOF OF THEOREM 9 λ (I (X , X V Q) h(Y , Y Q)) − + −; | − + −| This section is dedicated to the proof of Theorem 9. Our I (Y V X , Q) h(X X , Q) = +; | − − +| − agenda will be to revisit the heuristic discussion given in λ (I (X V X , Q) h(Y X , Q)) − +; | − − +| − 2184 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 64, NO. 4, APRIL 2018

I (Y V Y , Q) h(X Y , Q) From Remark 2, it is clear that the set of admissible + −; | + − −| + λ (I (X V Y , Q) h(Y Y , Q)) sequences is nonempty, even under the restriction that the − −; | + − −| + variables Qn take values in the three-point set 3.Infact, (λ 1)I (X Y V, Q), Q + − −; +| two points would suffice to ensure the existence of admissible which proves the inequality (48) since λ 1andmutual sequences, but the additional degree of freedom afforded by ≥ information is non-negative. ! the third point will be needed in our proof. Associate to We are now in a position to establish Lemma 1, stated in each (Xn, Qn ) taking values in R 3 the random variable × Q Section IV-A. Yn defined by Yn √ϱXn Z,whereZ N(0, 1) is = + ∼ Proof of Lemma 1: Let all quantities be as defined in independent of (Xn, Qn ).Wedefinethefollowingquantity: the statement of the lemma. The crucial observation is that ⋆ the unitary transformation (Y1, Y2) (Y , Y ) preserves the h (ϱ) inf lim inf (h(Yn Qn) h(Xn Qn)) +→ + − = n | − | : Gaussian nature of the channel. That is, if Yi √ϱXi Zi , " →∞ 1 = + 1 the sequence Xn, Qn is admissible . then Y √ϱX (Z1 Z2) and Y √ϱX + + √2 − − √2 { } = + +1 1= + (Z1 Z2),wherethepair( (Z1 Z2), (Z1 Z2)) is # − √2 + √2 − This definition is meant to capture the extremal property (40) equal in distribution to (Z1, Z2). in the discussion of Section IV-A, appropriately modified for Thus, consider an arbitrary V satisfying (X , X ) admissible sequences. In particular, we will aim to show that (Y , Y ) V Q.ByLemma4andtheaboveobservation,+ − → + − → | admissible sequences Xn, Qn which are extremal in the we have { } ⋆ sense that lim infn (h(Yn Qn) h(Xn Qn)) h (ϱ) will be guaranteed to approach→∞ normality| − in a| precise= sense. First, I (Y1, Y2 V Q) h(X1, X2 Q) ; | − | we observe that degeneracy is avoided in the sense that h⋆(ϱ) λ (I (X1, X2 V Q) h(Y1, Y2 Q)) − ; | − | is always finite. I (Y , Y V Q) h(X , X Q) Proposition 2: It holds that h⋆(ϱ) < . = + −; | − + −| | | ∞ λ (I (X , X V Q) h(Y , Y Q)) Proof: Let (Yn, Xn, Qn) be as above, with Xn, Qn − + −; | − + −| { } being an admissible sequence. Note first that h(Yn Qn) I (Y V X , Q) h(X X , Q) | − ≥ +; | − − +| − h(Xn Qn) 0sinceconditioningreducesentropy,so λ (I (X V X , Q) h(Y X , Q)) ⋆ | ≥ − +; | − − +| − h (ϱ) 0. On the other hand, since Xn, Qn is admissible, ≥ { } I (Y V Y , Q) h(X Y , Q) we have sλ(Xn,ϱ Qn) .However,thisisalways − −; | + − −| + the case. Indeed, for λ 1,−∞ we use again the inequality sλ(X ,ϱ X , Q) sλ(X ,ϱ Y , Q). ≥ ≥ + | − + − | + h(Yn Qn) h(Xn Qn) 0toobserve This proves (38) since | − | ≥ h(Xn Qn) λh(Yn Qn) (λ 1)h(Yn Qn) − | + | ≥ − | inf I (Y1, Y2 V Q) h(X1, X2 Q) V X Y V Q ; | − | (λ 1)h(Z) : → → | " ≥ − 1 λ (I (X1, X2 V Q) h(Y1, Y2 Q)) (λ 1) log(2πe), (51) − ; | − | = − 2 2 # where Z N(0, 1) and (51) follows from the definition of Yn inf I (Yi V Qi ) h(Xi Qi ) ∼ ≤ V Xi Yi V Qi ; | − | and the fact that conditioning reduces entropy. Hence, i 1 : → → | 8= " λ (I (X V Q ) h(Y Q )) i i i i sλ(Xn,ϱ Qn ) − ; | − | | 2 sλ(X,ϱ Q), # h(Xn Qn) λh(Yn Qn) = | =− | + | where the inequality follows since the infimum is taken over inf I (Yn V Qn) λI (Xn V Qn) + V Xn Yn V Qn ; | − ; | : → → | asmallerset. ! 1 " # (λ 1) log(2πe) ≥ − 2 B. Existence of Sequences Satisfying inf I (Yn V Qn) λI (Xn V Qn) + V Xn Yn V Qn ; | − ; | limn sλ(Xn,ϱ Qn) Vλ(ϱ) That Converge Weakly to : → → | Gaussian→∞ | = 1 " # (λ 1) log(2πe) (λ 1)I (Xn Yn Qn), (52) ≥ − 2 − − ; | Define 3 1, 2, 3 .Thefollowingdefinitionwillalso Q := { } be convenient. where (52) follows from the data processing inequali- Definition 2: For given ϱ>0 and λ 1,a sequence ties I (Yn V Qn) I (Xn V Qn) and I (Xn Yn Qn) ≥ ; | ≥ ; | 1 ; | ≥ Xn, Qn is said to be admissible if, for each n 1, (Xn, Qn) I (Xn V Qn).Ofcourse,I (Xn Yn Qn) log(1 ϱ) by { } ≥ ; | ; | ≤ 2 + takes values in R 3,andthefollowingtwoconditionshold: the maximum entropy property of Gaussians (since EX 2 1 ×Q n ≤ by admissibility), so that Vλ(ϱ) > as claimed. lim sλ(Xn,ϱ Qn) Vλ(ϱ) (49) −∞ n | = Therefore, if n is sufficiently large so that sλ(Xn,ϱ Qn)< →∞ 2 | E X 1 n 1. (50) Vλ(ϱ) 1, there is necessarily some Vn satisfying [ n]≤ ≥ + COURTADE: STRONG ENTROPY POWER INEQUALITY 2185

Xn Yn Vn Qn for which Next, exactly as in Section IV-A, let B Bernoulli(1/2) → → | be a Bernoulli random variable taking values∼ on , , h(Y Q ) h(X Q ) n n n n independent of X , X , Y , Y , Q,anddefineB to{+ be− the} | − | ¯ Vλ(ϱ) 1 λI (Xn Vn Qn) I (Yn Vn Qn) complement of B +in −, +.Constructanewpairofrandom− ≤ + + ; | − ; | {+ −} (λ 1)h(Yn Qn) variables (X, Q) as follows: X X B and Q (B, YB, Q). − − | ˜2 ˜ 2 ˜ = ˜ = ¯ Vλ(ϱ) 1 (λ 1)I (Xn Vn Qn) Clearly, EX EX 1. Also, by construction and (59), ≤ + + − ; | ˜ = ≤ (λ 1)h(Yn Qn) (53) 1 1 − − | sλ(X,ϱ Q) sλ(X ,ϱ Y , Q) sλ(X ,ϱ Y , Q) ˜ ˜ + − − + Vλ(ϱ) 1 (λ 1)I (Xn Yn Qn) | = 2 | + 2 | ≤ + + − ; | Vλ(ϱ) 2ϵ. (λ 1)h(Yn Qn) ≤ + − − | Moreover, using Markovity, we may establish Vλ(ϱ) 1 (λ 1)h(Yn Xn), (54) = + − − | where (53) and (54) are both due to the data processing h(Y Q) h(X Q) | − | inequality. Since Vλ(ϱ) < trivially and h(Yn Xn) 1 1 ∞ ⋆ | = (h(Y , Y Q) h(X , X Q)) h(Z) 2 log 2πe,weconcludethath (ϱ) < ,establishing = 2 + −| − + −| the claim.= ∞ 1 ! (h(Y Y , Q) h(X Y , Q)) At this point, we may essentially repeat the heuristic dis- = 2 −| + − −| + 1 cussion of Section IV-A to obtain the following conclusion (h(Y Y , Q) h(X Y , Q)) that applies to distributions that are near-extremal in a precise + 2 +| − − +| − 1 sense. I (X X Y , Y , Q) Lemma 5: Fix ϵ>0.ConsideradistributionPXQ on + 2 +; −| + − 2 1 R 3 such that EX 0, EX 1, h(Y Q) h(X Q) I (X X Y , Y , Q), × Q = ≤ = ˜ | ˜ − ˜ | ˜ + 2 +; −| + − sλ(X,ϱ Q) Vλ(ϱ) ϵ (55) | ≤ + where Y √ϱX Z,withZ N(0, 1) independent of ˜ = ˜ + ∼ and (X, Q).Inparticular,using(56),wehave ˜ ˜ h(Y Q) h(X Q) h⋆(ϱ) ϵ, (56) | − | ≤ + 1 ⋆ ϱ h(Y Q) h(X Q) I (X X Y , Y , Q) h (ϱ) ϵ. ˜ ˜ ˜ ˜ + − + − where (Y, X, Q) GY X PXQ.Thereexistsadistribution | − | + 2 ; | ≤ + ∼ | ϱ PX Q on 3 such that, for (Y ′, X ′, Q′) PX Q , ′ ′ R GY ′ X′ ′ ′ Evidently, the pair (X, Q) satisfies nearly all of the stated ×2 Q ∼ | ˜ ˜ we have EX ′ 1, properties of (X , Q ) in the claim to be proved. The only ≤ ′ ′ exception is that Q takes values in the space , R sλ(X ′,ϱ Q′) Vλ(ϱ) 2ϵ ˜ {+ −}× × | ≤ + 3 3,whereaswerequirethatQ′ takes values in 3. and QThis× isQ easily dealt with by applying the standard dimen-Q sionality reduction procedure described in Remark 2. Indeed, h(Y ′ Q′) h(X ′ Q′) | − | by the Fenchel-Caratheodory-Bunt [72, Theorem 18(ii)], 1 ⋆ I (X1 X2 X1 X2 Y1, Y2, Q1, Q2) h (ϱ) ϵ, there is a pair (X ′, Q′) with Q′ supported on at most +2 + ; − | ≤ + 2 2 three points that preserves the values EX ′ EX , where (Y1, X1, Q1) and (Y2, X2, Q2) denote independent = ˜ sλ(X ,ϱ Q ) sλ(X,ϱ Q) and h(Y Q ) h(X Q ) copies of (Y, X, Q). ′ | ′ = ˜ | ˜ ′| ′ − ′| ′ = (h(Y Q) h(X Q)).Thus,theclaimisproved. Proof: Let X , X , Y , Y , Q be as in Lemma 1, con- ˜ ˜ ˜ ˜ 4 5 ! Applying| − Lemma| 5 to an appropriately chosen admissible structed from the two+ independent− + − copies of (Y, X, Q).Apply- sequence, we obtain the following asymptotic analog of ing Lemma 1 to the variables Q (X , X ) (Y , Y ), property (43), which was a key milestone in the heuristic we obtain → + − → + − discussion of Section IV-A. 2sλ(X,ϱ Q) sλ(X ,ϱ X , Q) sλ(X ,ϱ Y , Q), (57) Lemma 6: There exists an admissible sequence Xn, Qn | ≥ + | − + − | + { } and the symmetric inequality and a distribution PX Q on R 3 such that (Xn, Qn) D ∗ ∗ 2 × Q −→ (X , Q ) PX Q ,withE X 1 and 2sλ(X,ϱ Q) sλ(X ,ϱ Y , Q) sλ(X ,ϱ X , Q). (58) ∗ ∗ ∼ ∗ ∗ [ ∗]≤ | ≥ + | − + − | + lim inf I (X1,n X2,n X1,n X2,n Yn, Qn q) 0(60) By independence of X and X and the assumption that n 1 2 →∞ + ; − | = = EX 0, we have for PQ PQ -a.e. q,whereYn (Y1,n, Y2,n), Qn = ∗ × ∗ := := 2 2 1 2 1 2 2 (Q1,n, Q2,n),and(Y1,n, X1,n, Q1,n) and (Y2,n, X2,n, Q2,n) X X X X X 1. ϱ E E E 1 E 2 E denote independent copies of (Y , X , Q ) P . [ +]= [ −]=2 [ ]+2 [ ]= [ ]≤ n n n GYn Xn Xn Qn ∼ | Hence, it follows that the terms in the RHS of (57) and the Proof: Let Xn, Qn be an admissible sequence with the { } RHS of (58) are each lower bounded by Vλ(ϱ).Combined additional property that with (55), we may conclude that ⋆ lim (h(Yn Qn) h(Xn Qn)) h (ϱ). n 1 1 →∞ | − | = sλ(X ,ϱ Y , Q) sλ(X ,ϱ Y , Q) Vλ(ϱ) 2ϵ. 2 + | − + 2 − | + ≤ + By a diagonalization argument, such an admissible 2 (59) sequence exists. Since 3 is finite and E X 1, Q [ n]≤ 2186 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 64, NO. 4, APRIL 2018

the sequence Xn, Qn is tight. By Prokhorov’s where the equality follows since sλ( ,ϱ) is invariant to trans- theorem [82], we{ may assume} that there is some (X , Q ) lation of the mean. Thus, · ∗ ∗ for which (Xn, Qn) D (X , Q ) by restricting our attention 2 −→ ∗ ∗ lim inf PQn (q)sλ(Xn Qn q ,ϱ) PQ (q)sλ(N(0,σX ),ϱ) to a subsequence of X Q if necessary. Moreover, n ∗ n, n →∞ |{ = } ≥ 2 {2 } E X lim infn E Xn 1byFatou’slemma. for each q 3 satisfying PQ (q)>0. [ ∗]≤ →∞ [ ]≤ Applying Lemma 5 along the sequence Xn, Qn for each n, ∈ Q ∗ { } We now turn our attention to the possible situation where we conclude the existence of another admissible sequence PQ (q) 0forsomeq.Followingthesamestepsleading ∗ = Xn′ , Qn′ with the property that to (52), for λ 1wehavethelowerbound { } ≥ lim inf I (X X X X Y , Q ) 1,n 2,n 1,n 2,n n n PQ (q)sλ(Xn Qn q ,ϱ) n + ; − | n |{ = } →∞ 1 lim inf h(Yn′ Qn′ ) h(Xn′ Qn′ ) + n | − | PQn (q)(λ 1) log(2πe) I (Xn Yn Qn q) , →∞ ≥ − 2 − ; | = h⋆(ϱ). 4 5 ' ( ≤ (64) However, as X , Q is admissible, the definition of h⋆(ϱ) n′ n′ valid for all q .Since X , Q is admissible, for each combined with{ the fact} that it is finite (Proposition 2) implies 3 n n n 1, ∈ Q { } that we must have ≥ 2 2 PQn (q)E Xn Qn q E Xn 1 q 3. lim inf I (X1,n X2,n X1,n X2,n Yn, Qn) 0. (61) [ | = ]≤ [ ]≤ ∀ ∈ Q n + ; − | = →∞ 2 1 Rearranging gives E X Qn q PQn (q) − .Bythe [ n| = ]≤ Since 3 is finite and Qn D Q ,thedesiredresultfollows maximum-entropy property of Gaussians and definition of Q −→ ∗ 4 5 from (61). ! (Yn, Xn, Qn ), As suggested in the discussion at the end of Section IV-A, 1 2 to finish the proof of Theorem 9 we will require (i) a I (Xn Yn Qn q) log 1 ϱ E X Qn q ; | = ≤ 2 + [ n| = ] generalization of Bernstein’s theorem with hypothesis com- 1 ) 1 * patible with (60); and (ii) a local semicontinuity property of log 1 ϱ PQ (q) − , ≤ 2 + n s ( ,ϱ).Statedprecisely,thetworequiredingredientsarethe λ ) 4 5 * following:· so that (64) yields D Lemma 7: Suppose (X1,n, X2,n) (X1, , X2, ) with PQn (q)sλ(Xn Qn q ,ϱ) 2 −→ ∗ ∗2 |{ = } supn E Xi,n < for i 1, 2.Let(Z1, Z2) N(0,σ I) be 1 1 [ ] ∞ = ∼ P (q)(λ 1) log(2πe) log 1 ϱ P (q) − . pairwise independent of (X1,n, X2,n) and, for i 1, 2,define Qn Qn = ≥ 2 − − + Yi,n Xi,n Zi .IfX1,n, X2,n are independent and ) ) 4 5 ** = + Hence, if q is such that PQn (q) PQ (q) 0, then → ∗ = lim inf I (X1,n X2,n X1,n X2,n Y1,n, Y2,n) 0, (62) n + ; − | = lim inf PQ (q)sλ(Xn Qn q ,ϱ) 0. →∞ n n →∞ |{ = } ≥ then X1, , X2, are independent Gaussian random variables with identical∗ ∗ variances. Combining the above establishes 2 2 2 Lemma 8: If Xn D X N(µ, σX ) and supn E Xn < , lim inf sλ(Xn,ϱ Qn ) sλ(N(0,σ ),ϱ). (65) −→ ∗ ∼ [ ] ∞ n X then →∞ | ≥ Since Xn, Qn was assumed to be an admissible sequence, lim inf sλ(Xn,ϱ) sλ(X ,ϱ). (63) { } n ∗ (65) and (49) together ensure that →∞ ≥ By assembling Lemmas 6, 7 and 8, Theorem 9 follows 2 Vλ(ϱ) sλ(N(0,σX ),ϱ), almost immediately. The argument is as follows. ≥ Proof of Theorem 9: Let Xn, Qn , (Y1,n, X1,n, Q1,n), which proves the claim. ! { } (Y2,n, X2,n, Q2,n) and (X , Q ) be as in Lemma 6. Since 3 is ∗ ∗ Q finite, we of course have Xi,n Qi,n qi D X Q qi , APPENDIX A |{ = } −→ ∗|{ ∗ = } for i 1, 2andPQ -a.e. qi .Now,(60)fulfillsthehypothesis AWEAK FORM OF BERNSTEIN’S THEOREM = ∗ of Lemma 7, so we are able to conclude that for PQ PQ -a.e. The aim of this appendix is to prove Lemma 7, which ∗ × ∗ q1, q2,therandomvariablesX Q q1 and X Q q2 is largely a matter of assembling the needed ingredients. ∗|{ ∗ = } ∗|{ ∗ = } are Gaussian with identical variances. In other words, for We begin by recalling two facts about random variables that 2 PQ -a.e. q,wehaveX Q q N(µq ,σX ),whereµq are contaminated by Gaussian noise. Of particular interest ∗ ∗|{ ∗ =2 }∼ 2 possibly depends on q,butσX does not. Since EX 1, to us are weakly convergent sequences of random variables, we must have σ 2 1. ∗ ≤ X ≤ and corresponding continuity properties when densities are So, using the lower semicontinuity property of Lemma 7, regularized via convolution with a Gaussian density. for PQ -a.e. q, Lemma 9 [74, Lemma 5.1.3]: If X, Zareindependent ∗ 2 random variables and Z is normal, then X Zhasanon- lim inf sλ(Xn Qn q ,ϱ) sλ(N(µq ,σ ),ϱ) n |{ = } ≥ X vanishing probability density function which+ has derivatives →∞ 2 sλ(N(0,σ ),ϱ), of all orders. = X COURTADE: STRONG ENTROPY POWER INEQUALITY 2187

Lemma 10 [70, Propositions 16 and 18]: Let Xn D X which is verified as follows: 2 2 −→ ∗ with supn E Xn < ,andletZ N(0,σ I) be a non- [∥ ∥ ] ∞ ∼ I (X X Y , Y ) degenerate Gaussian, independent of Xn , X .LetYn Xn 1,n 2,n 1,n 2,n { } ∗ = + ; | ZandY X Z. Finally,let fn and f denote the density I (X1,n Y1,n Y2,n) I (X1,n X2,n Y1,n, Y2,n) ∗ = ∗ + ∗ = ; | + ; | of Yn and Y ,respectively.Then I (X1,n Y1,n Y2,n) ∗ − ; | 1. Yn D Y I (X1,n Y1,n, X2,n Y2,n) I (X1,n, X2,n Y1,n Y2,n) −→ ∗ = ; | − ; | 2. fn f 0 I (Y X ) I (X Y , X Y ) ∥ − ∗∥∞ → 2,n 2,n 1,n 1,n 2,n 2,n 3. h(Yn) h(Y ). = ; + ; | → ∗ I (X1,n, X2,n Y2,n) I (X1,n, X2,n Y1,n Y2,n) Let us also make note of two characterizations of the − ; − ; | I (Y2,n Y1,n, X2,n) I (X1,n Y1,n, X2,n Y2,n) normal distribution. First, we remind the reader of the result = ; + ; | I (X1,n, X2,n Y1,n, Y2,n) of Bernstein given previously as Lemma 2. Second, we will − ; need the following observation: I (X1,n, Y2,n Y1,n, X2,n) I (X1,n, X2,n Y1,n, Y2,n). = ; − ; Lemma 11: Let Y X Z, where Z N(0,σ2) is a = + ∼ Observe that lim infn I (X1,n, Y2,n Y1,n, X2,n) I non-degenerate Gaussian, independent of X. If X Y y →∞ ; ≥ 2 |{ = } (X1, , Y2, Y1, , X2, ) due to lower semicontinuity of is normal for PY -a.e. y, with variance σX not depending on ∗ ∗; ∗ ∗ σ 2σ 2 relative entropy, and limn I (X1,n, X2,n Y1,n, Y2,n) y, then X is normal with variance X . →∞ ; = σ 2 σ 2 I (X1, , X2, Y1, , Y2, ) due to (68) and the fact that − X ∗ ∗; ∗ ∗ Proof: If X Y y is normal for PY -a.e. y with variance h(Y1, , Y2, X1, , X2, ) h(Y1,n, Y2,n X1,n, X2,n) 2 |{ = } ∗ ∗| ∗ ∗ = | = σX not depending on Y ,thenX E X Y W in distribution, h(Z1, Z2) is constant. Thus, (67) is proved by applying the 2 = [ | ]+ where W N(0,σX ) is independent of Y .Inparticular, identity (69) again for (X1, , X2, , Y1, , Y2, ). ! ∼ ∗ ∗ ∗ ∗ X has density fX by Lemma 9. Also by Lemma 9, Having assembled the required ingredients, we now prove Y has density fY .Theconditionaldensity fY X exists and is Lemma 7. | Gaussian by definition, and fX Y is a valid Gaussian density Proof of Lemma 7: Let Yi, be as in the statement | 2 ∗ for PY -a.e. y,withcorrespondingvarianceσX not depending of Lemma 12, and recall that the same lemma asserts on y.Thus,wehave (Y1,n, Y2,n) D (Y1, , Y2, ).BydefinitionofZ1, Z2,theran- −→ ∗ ∗ dom variables (Z1 Z2) and (Z1 Z2) are independent log fX (x) log fY (y) log fY X (y x) log fX Y (x y). and Gaussian with respective+ variances− 2 2.Thus,notingthat = + | | − | | σ (66) assumption (62) is equivalent to

lim inf I (X1,n X2,n X1,n X2,n Y1,n Y2,n, Y1,n Y2,n) 0, The key observation is that the RHS of (66) is a quadratic n →∞ + ; − | + − = function in x.Since fX is a density and must integrate to unity, we may apply Lemma 12 to the sequences X1,n X2,n, X1,n it must therefore be Gaussian. Direct computation reveals that { + − σ 2σ 2 X2,n and Y1,n Y2,n, Y1,n Y2,n to obtain X has variance X . } { + − } σ 2 σ 2 ! − X In final preparation for the proof of Lemma 7, we record the I (X1, X2, X1, X2, Y1, , Y2, ) following consequence of Lemma 10 and lower semicontinuity ∗ + ∗; ∗ − ∗| ∗ ∗ I (X1, X2, X1, X2, Y1, Y2, , Y1, Y2, ) of relative entropy: = ∗ + ∗; ∗ − ∗| ∗ + ∗ ∗ − ∗ 0. (70) Lemma 12: Suppose (X1,n, X2,n) D (X1, , X2, ) with = 2 −→ ∗ ∗2 supn E X < for i 1, 2.Let(Z1, Z2) N(0,σ I) be Using independence of X1,n, X2,n,Lemma12applieddirectly [ i,n ] ∞ = ∼ pairwise independent of (X1,n, X2,n) and (X1, , X2, ),and, yields ∗ ∗ for i 1, 2,defineYi,n Xi,n Zi and Yi, Xi, Zi . = = + ∗ = ∗ + I (X1, X2, Y1, , Y2, ) 0. (71) Then (Y1,n, Y2,n) D (Y1, , Y2, ) and ∗; ∗| ∗ ∗ = −→ ∗ ∗ In particular, for PY1, Y2, -a.e. y1, y2,therandomvariables lim inf I (X1,n X2,n Y1,n, Y2,n) I (X1, X2, Y1, , Y2, ). ∗ ∗ n ∗ ∗ ∗ ∗ X1, Y1, , Y2, y1, y2 and X2, Y1, , Y2, y1, y2 →∞ ; | ≥ ; | ∗|{ ∗ ∗ = } ∗|{ ∗ ∗ = } (67) are independent by (71), and (X1, X2, ) Y1, , Y2, ∗ + ∗ |{ ∗ ∗ = y1, y2 and (X1, X2, ) Y1, , Y2, y1, y2 are independent } ∗ − ∗ |{ ∗ ∗ = } by (70). Therefore, Lemma 2 implies that X1, Y1, , Y2, Proof: The fact that (Y1,n, Y2,n) D (Y1, , Y2, ) follows ∗|{ ∗ ∗ = −→ ∗ ∗ y1, y2 and X2, Y1, , Y2, y1, y2 are each normal with from Lemma 10. Lemma 10 also establishes that } ∗|{ ∗ ∗ = } identical variances for PY1, Y2, -a.e. y1, y2.Startingwiththe third claim of Lemma 10 and∗ ∗ applying lower semicontinuity h(Y1,n, Y2,n) h(Y1, , Y2, ). (68) → ∗ ∗ of relative entropy, we observe

On account of the Markov chains (X2,n, Y2,n) X1,n Y1,n I (X1, Y1, ) lim I (X1,n Y1,n) → → ∗ ∗ n and (X1,n, Y1,n) X2,n Y2,n,wehavetheidentity ; = →∞ ; → → lim I (X1,n Y1,n, Y2,n) n = →∞ ; I (X1,n X2,n Y1,n, Y2,n) I (X1, Y1, , Y2, ) ; | ≥ ∗; ∗ ∗ I (X1,n, Y2,n Y1,n, X2,n) I (X1,n, X2,n Y1,n, Y2,n), (69) I (X1, Y1, ) I (X1, Y2, Y1, ), = ; − ; = ∗; ∗ + ∗; ∗| ∗ 2188 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 64, NO. 4, APRIL 2018

so it follows that X1, Y1, Y2, ,andtherefore B. Semicontinuity of Fλ,σ2 ∗ → ∗ → ∗ X1, Y1, , Y2, y1, y2 X1, Y1, y1 by condi- ∗|{ ∗ ∗ = }∼ ∗|{ ∗ = } This section is devoted to proving the required semiconti- tional independence. Similarly, X2, Y1, , Y2, y1, y2 ∗|{ ∗ ∗ = }∼ nuity property by Fλ,σ2 .Specifically,weaimtoshow: X2, Y2, y2 .So,wemayconcludethattherandom 2 2 ∗|{ ∗ = } Lemma 14: If Xn D X N(µ, σ ) and supn E Xn < variables X1, Y1, y1 and X2, Y2, y2 are normal, −→ ∗ ∼ X [ ] ∗|{ ∗ = } ∗|{ ∗ = } ,then with identical variances not depending on y1, y2.Invoking ∞ Lemma 11, we find that both X1 and X2 are normal with , , lim inf Fλ,σ2 (Xn) Fλ,σ2 (X ). (73) ∗ ∗ n ≥ ∗ identical variances, completing the proof. ! →∞ The proof of Lemma 14 boils down to careful control over various error terms, which are ultimately shown to APPENDIX B be negligible. To this end, we begin by establishing three ALOWER SEMICONTINUITY PROPERTY OF sλ( ,ϱ) · technical estimates. These are then assembled together in the The goal of this appendix is to establish the lower semicon- proof of Lemma 14. Lemma 15: Let Yn , Y be as in Lemma 10, and let PV Y tinuity property of sλ( ,ϱ)as stated in Lemma 8. In particular, { } ∗ | · 2 2 be given. Fix b > 0 and define Vn , V according to PV Y if Xn D X N(µ, σX ) and supn E Xn < ,then { } ∗ | : −→ ∗ ∼ [ ] ∞ Yn Vn and PV Y Y V .Thereexistsapositive +→ | : ∗ +→ ∗ sequence ϵn depending on b and Yn ,butnotPV Y ,which lim inf sλ(Xn,ϱ) sλ(X ,ϱ). (72) { } { } | n ≥ ∗ satisfies limn ϵn 0 and →∞ →∞ = Recall that sλ(X,ϱ)is defined in the context of the Gaussian I (Vn Yn Yn b) (1 ϵn)I (V Y Y b) ϵn, channel Y √ϱX Z.However,forthepurposesofthe ; || |≤ ≤ + ∗; ∗ || ∗|≤ + = + I (Vn Yn Yn b) (1 ϵn)I (V Y Y b) ϵn, proof, it will be convenient to omit the scaling factor ϱ,and ; || |≤ ≥ − ∗; ∗ || ∗|≤ − P( Yn b) instead parametrize the channel in terms of the noise variance. | |≤ 1 ϵn. (74) ( Y b) − ≤ Toward this end, let Z N(0,σ2).Forλ>0andarandom & P & ∼ & | ∗|≤ & variable X PX ,independentofZ,defineY X Z and &Proof: Let fn and& f denote the density of Yn and Y , the functionals∼ = + respectively.& By Lemma& 9,∗ the density f is continuous and∗ does not vanish, and is therefore bounded∗ away from zero on Fλ,σ2 (X) inf I (Y V ) λI (X V ) the interval B b, b .ByLemma10, fn f 0, = V X Y V ; − ; =[− ] ∥ − ∗∥∞ → : → → ) * so it follows that Gλ,σ2 (X) h(X) λh(Y ). =− + fn(y) f (y) sup 1 ϵn and sup 1 ∗ ϵn (75) The semicontinuity property (72) is an immediate corollary y B − f (y) ≤ y B − fn(y) ≤ ∈ & ∗ & ∈ & & of weak lower semicontinuity of Gλ,σ2 (X) and Fλ,σ2 (X) at & & & & for some& positive ϵn& 0asn & (note that& ϵn does not Gaussian X.Thesefactsareestablishedseparatelybelowin & & → →∞& & depend on PV Y ). As a consequence, separate subsections. The former is straightforward, while the | latter is a bit more delicate. P(Y B) f (y)dy (1 ϵn) fn(y)dy ∗ ∈ = ∗ ≤ + !B !B (1 ϵn)P(Yn B). A. Semicontinuity of Gλ,σ2 = + ∈ By symmetry of (75), this establishes (74). Here, we establish a semicontinuity property enjoyed by Now, for y B,theconditionaldensitiesofYn Yn B G 2 .Specifically, λ,σ and Y Y B∈ satisfy |{ ∈ } 2 2 ∗|{ ∗ ∈ } Lemma 13: If Xn D X N(µ, σX ) and supn E Xn < −→ ∗ ∼ [ ] fYn Yn B (y) fn(y) P(Y B) 2 ,then |{ ∈ } ∗ ∈ (1 ϵn) . (76) ∞ fY Y B (y) = f (y) · P(Yn B) ≤ + ∗|{ ∗∈ } ∗ ∈ lim inf Gλ,σ2 (Xn) Gλ,σ2 (X ). 2 n ≥ ∗ Therefore, for any Borel set A , →∞ ⊂ V Proof: Fix δ>0anddefineNδ N(0,δ),pairwise P(Vn A Yn B) ∼ ∈ | ∈ independent of Xn , X .Observethat { } ∗ fYn Yn B (y)PV Y y(A)dy = B |{ ∈ } | = G 2 (X ) h(X ) λh(Y ) ! λ,σ n n n 2 =− + (1 ϵn) fY Y B (y)PV Y y(A)dy h(Xn Nδ) λh(Yn). ≤ + B ∗|{ ∗∈ } | = ≥− + + 2 ! (1 ϵn) P(V A Y B). By the third claim of Lemma 10, we have h(Xn Nδ) = + ∗ ∈ | ∗ ∈ − + + λh(Yn) h(X Nδ ) λh(Y ) as n .Thus, As a consequence, →− ∗ + + ∗ →∞ dPVn Yn B 2 lim inf Gλ,σ2 (Xn) h(X Nδ) λh(Y ). | ∈ (v) (1 ϵn) ,v . (77) n ≥− ∗ + + ∗ dPV Y B ≤ + ∈ V →∞ ∗| ∗∈ 1 2 2 Since h(X Nδ) 2 log 2πe(σX δ) is continuous in δ, We implicitly assume PV Y y is a Borel measure on a topological space we may take∗ +δ 0toprovetheclaim.= + for each y. | = ↓ 4 5 ! V COURTADE: STRONG ENTROPY POWER INEQUALITY 2189

By a symmetric argument, the following counterpart to (76) It is now a simple matter to construct a new vanishing is obtained sequence ϵ from ϵn which satisfies the claims of the { n′ } { } 2 lemma. ! fY Y B (y) (1 ϵn)− fY Y B (y) y B. (78) 2 n n Lemma 16: Let X PX and let Z N(0,σ ) be a non- |{ ∈ } ≥ + ∗|{ ∗∈ } ∀ ∈ ∼ ∼ Using the fact that mutual information may be written as a degenerate Gaussian, independent of X. It holds that relative entropy and PVn Yn PV Y for each n, | = | lim P( X > b)I (X X Z X > b) 0. b | | ; + || | = I (Vn Yn Yn B) →∞ ; | ∈ Proof: The proof follows that of [84, Theorem 6]. fYn Yn B (y)D(PVn Yn y,Yn B PVn Yn B)dy = |{ ∈ } | = ∈ ∥ | ∈ By lower semicontinuity of relative entropy, we have !

fYn Yn B (y)D(PV Y y PVn Yn B)dy. lim inf I (X X Z X b) I (X X Z). = |{ ∈ } | = ∥ | ∈ b ; + || |≤ ≥ ; + ! →∞ Similarly, since PV Y PV Y , Also, ∗| ∗ = | I X X Z X b I X X Z X b I (V Y Y B) fY Y B (y)D(PV Y y PV Y B)dy. ( ) P( ) ( ), ∗; ∗| ∗ ∈ = ∗|{ ∗∈ } | = ∥ ∗| ∗∈ ; + ≥ | |≤ ; + || |≤ ! Hence, if we can show that, for each y, so that lim I (X X Z X b) D(PV Y y PVn Yn B) D(PV Y y PV Y B) 3ϵn, b ; + || |≤ | = ∥ | ∈ ≥ | = ∥ ∗| ∗∈ − →∞ lim P( X b)I (X X Z X b) I (X X Z). then we may apply (78) and non-negativity of divergence to = b | |≤ ; + || |≤ = ; + conclude →∞ By the chain rule for mutual information 2 I (Vn Yn Yn B) (1 ϵn)− I (V Y Y B) 3ϵn. ; | ∈ ≥ + ∗; ∗| ∗ ∈ − P( X > b)I (X X Z X > b) (79) | | ; + || | I (X X Z) I (1 X b X Z) Toward this end, let us recall the Donsker-Varadhan variational = ; + − {| |≤ }; + P( X b)I (X X Z X b), formula for relative entropy [83] between probability measures − | |≤ ; + || |≤ P Q so the claim is proved since I (1 X b X Z) vanishes as ≪ {| |≤ }; + b . ! ϕ →∞ D(P Q) sup ϕdP log e dQ , Lemma 17: Fλ,σ2 (X) is continuous in λ.Furthermore, ∥ = ϕ − 2 9! '! (: if X N(µ, σX ),then where the supremum is over all bounded Q-measurable func- ∼ 2 2 tions ϕ.Appliedtooursituation,PV Y y PV Y B for 1 σX λ 1 σX | = ≪ ∗| ∗∈ Fλ,σ2 (X) log (λ 1) 2 λ log − 1 2 y B (except possibly for y in a null set, which we may = 2 ; < − σ = − < λ < + σ ==> neglect),∈ so we are ensured the existence of a bounded function ϕ for which σ 2 σ 2 for λ 1 2 ,andFλ,σ2 (X) 0 when 0 λ 1 2 . ∗ ≥ + σX = ≤ ≤ + σX 2 ϕ In particular, Fλ,σ2 (X) is continuous in the parameters σ , ϕ dPV Y y log e ∗ dPV Y B 2 ∗ | = − ∗| ∗∈ σ and λ for Gaussian X. ! '! ( X D(PV Y y PV Y B) ϵn. Proof: The function Fλ,σ2 (X) is the pointwise infimum of ≥ | = ∥ ∗| ∗∈ − linear functions in λ,andisthereforeconcaveandcontinuous Therefore, using the Donsker-Varadhan formula for on the open interval λ (0, ) for any distribution PX .The D(PV Y y PVn Yn B) and the estimate (77), we have ∈ ∞ | = ∥ | ∈ explicit expression for Fλ,σ2 (X) follows by identifying γϱ 2 ← σX D(PV Y y PVn Yn B) in Lemma 3. | = ∥ | ∈ σ 2 ! ϕ We may now prove the desired result, which will conclude ϕ dPV Y y log e ∗ dPVn Yn B ≥ ∗ | = − | ∈ the proof of Lemma 8. ! '! ( Proof of Lemma 14: Fix an interval B b, b , 2 ϕ =[−2 ] ϕ dPV Y y log (1 ϵn) e ∗ dPV Y B aconditionallawPV Y ,andδ satisfying 0 <δ<σ/2. ≥ ∗ | = − + ∗| ∗∈ ! ' ! ( Recalling the definition| of Z N(0,σ2),decomposeZ ϕ 2 ∼ 2 = ϕ dPV Y y log e ∗ dPV Y B log(1 ϵn) N1 N2 N3,whereN1 N(0,δ), N2 N(0,σ 2δ) and = ∗ | = − ∗| ∗∈ − + + + ∼ ∼ δ − ! '! ( N3 N(0,δ)are mutually independent. Define Xn Xn N1 D(PV Y y PV Y B) 3ϵn. and∼Y δ Y N X N N .Notethatwehave= + | = ∗| ∗∈ n n 3 n 1 2 ≥ ∥ − =δ −δ = + + 2 Xn X Y Yn Vn,whereVn is defined by the The final step used the inequality log(1 x) 2x for → n → n → → purposes of simplifying the expression. Hence,+ (79)≤ is proved. stochastic transformation PV Y Yn Vn.Usingthenotation of Lemma 10, we also have| X: +→X δ Y δ Y V , By symmetry of (75) and the above argument, we also have δ ∗ → ∗δ → ∗ → ∗ → ∗ where Y X Z, X X N1, Y X N1 N2 and the complementary inequality ∗ = ∗ + ∗ = ∗ + ∗ = ∗ + + V is defined via PV Y Y V .Withthesedefinitionsin 2 ∗ | : ∗ +→ ∗ δ δ I (V Y Y B) (1 ϵn)− I (Vn Yn Yn B) 3ϵn. hand, we may apply Lemma 15 to the processes Xn , Yn to ∗; ∗| ∗ ∈ ≥ + ; | ∈ − { } { } 2190 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 64, NO. 4, APRIL 2018

δ δ conclude the existence of a sequence ϵn 0, not depending (89) follows since P(Y B) and P(X B) are each → • n ∈ n ∈ on PV Y ,thatsatisfies at most one. | δ δ δ δ Now, let us separately bound the first two terms in (89). I (Vn Xn Xn B) (1 ϵn)I (V X X B) ϵn (80) ∗ First, we note that the chain rule for mutual information ; δ| δ ∈ ≤ + ; δ∗| δ∗ ∈ + I (Vn Yn Yn B) (1 ϵn)I (V Y Y B) ϵn (81) implies ; | ∈ ≥ − ∗; ∗ | ∗ ∈ − (X δ B) (1 ϵ ) (X δ B) (82) P n n P δ δ δ δ 1 δ ∈ ≤ + δ∗ ∈ P(Y B)I (Y V Y B) I (Y V ) I ( Y δ B V ) (Y B) (1 ϵn) (Y B). (83) ∗ ∈ ∗ ; ∗| ∗ ∈ = ∗ ; ∗ − { ∗ ∈ }; ∗ P n P δ δ δ ∈ ≥ − ∗ ∈ P(Y / B)I (Y V Y / B). − ∈ ; ∗| ∈ Without loss of generality, we may assume that ϵn < 1for ∗ ∗ ∗ all n. So, the first term in (89) may be bounded as Having completed the setup, our goal at this point will be to δ δ δ P(Yn B)(1 ϵn)I (Y V Y B) obtain a lower bound on the quantity I (Yn Vn) λI (Xn Vn) ∈ δ − ∗ ; ∗| ∗ ∈ ; − ; P(Yn B) δ which does not depend on the specific conditional law P . 1 δ V Y δ ∈ (1 ϵn) I (Y V ) I ( Y B V ) | = P(Y B) − ∗ ; ∗ − { ∗ ∈ }; ∗ The key idea will be to work with the perturbed random ∗ ∈ δ δ δ δ ) δ variables Xn Xn and Yn Yn ,andresisttemptationto P(Y / B)I (Y V Y / B) → → − ∗ ∈ ∗ ; ∗| ∗ ∈ take any limits n or b until after dependence on 2 δ 1 * →∞ →∞ (1 ϵn) I (Y V ) I ( Y δ B V ) PV Y is eliminated. We begin with the following sequence of ≥ − ∗ ; ∗ − { ∗ ∈ }; ∗ | δ δ δ inequalities, each of whose steps are individually justified in P(Y /)B)I (Y V Y / B) (90) − ∗ ∈ ∗ ; ∗| ∗ ∈ the sequel: 2 δ (1 ϵn) I (Y V ) * ≥ − ; ∗ δ ∗ δ δ 1 I (Yn Vn) λI (Xn Vn) P(Y / B)I (Y Y Y / B) H ( Y δ B ) , (91) ; δ− ; δ − ∗ ∈ ∗ ; ∗| ∗ ∈ + { ∗ ∈ } I (Y Vn) λI (X Vn) (84) ) * ≥ n ; − n ; where δ 1 δ 1 I (Yn , Y δ B Vn) λI (Xn, Xδ B Vn) (85) = { n ∈ }; − { n ∈ }; (90) follows from (83) (note that the term inside the δ δ δ • δ δ P(Yn B)I (Yn Vn Yn B) parentheses is proportional to I (Y V Y B),and = ∈ ; | ∈ ∗ δ δ δ 1 therefore nonnegative). ∗ ; | ∗ ∈ P(Yn / B)I (Yn Vn Yn / B) I ( Y δ B Vn) + ∈ ; | ∈ + { n ∈ }; 1 (91) uses the simple facts that I ( Y δ B V ) δ δ δ • ∗ λ P(X B)I (X Vn X B) 1 2 { ∗ ∈ }; ≤ − n ∈ n; | n ∈ H ( Y δ B ),and(1 ϵn) 1, and { ∗ ∈ } − ≤ ) δ δ δ (X / B)I (X V X / B) I (1 δ V ) δ δ δ δ P n n n n Xn B n I (Y V Y / B) I (Y Y Y / B), + ∈ ; | ∈ + { ∈ }; ; ∗| ∈ ≤ ; ∗| ∈ *(86) ∗ ∗ ∗ ∗ δ δ δ where the third is due to the data processing inequality. P(Y B)I (Y Vn Y B) ≥ n ∈ n ; | n ∈ Next, we bound the second term in (89). To start, note δ δ δ λ P(X B)I (X Vn X B) that the chain rule for mutual information combined with non- − n ∈ n; | n ∈ negativity of mutual information gives ) δ δ δ 1 P(X / B)I (X Yn X / B) H ( Xδ B ) (87) + n ∈ n; | n ∈ + n { ∈ } δ δ 1 δ δ δ δ * I (X V X B) δ I (X V ). P(Yn B) (1 ϵn)I (Y V Y B) ϵn ∗; ∗| ∗ ∈ ≤ P(X B) ∗; ∗ ≥ ∈ − ∗ ; ∗| ∗ ∈ − ∗ ∈ δ ) δ δ * Thus, λP(Xn B) (1 ϵn)I (X V X B) ϵn − ∈ + ∗; ∗| ∗ ∈ + δ ) δ δ * λ (X δ B)(1 ϵ )I (X δ V X δ B) λ P(X / B)I (X Yn X / B) H (1 δ ) P n n n n n Xn B ∈ + ∗; ∗| ∗ ∈ − ∈ ; | ∈ + { ∈ } (X δ B) ) *(88) P n δ λ δ ∈ (1 ϵn)I (X V ) δ δ δ ≤ P(X B) + ∗; ∗ P(Yn B)(1 ϵn)I (Y V Y B) ∗ ∈ ∗ 2 δ ≥ ∈ δ − ∗ ; δ| ∗ ∈ δ λ(1 ϵn) I (X V ), λP(Xn B)(1 ϵn)I (X V X B) ≤ + ∗; ∗ − ∈ + ∗; ∗| ∗ ∈ δ δ δ 1 where the last inequality is due to (82). λ P(Xn / B)I (Xn Yn Xn / B) H ( Xδ B ) − ∈ ; | ∈ + { n∈ } Combining the above two bounds with (89), we have (λ) 1)ϵn. *(89) − + established The above steps are justified as follows: I (Yn Vn) λI (Xn Vn) (84) follows by the data processing inequality. ; − 2 δ; 2 δ • (1 ϵn) I (Y V ) λ(1 ϵn) I (X V ) (85) follows since 1 δ and 1 δ are functions of ≥ − ∗ ; ∗ − + ∗; ∗ • Yn B Xn B δ δ δ 1 Y δ and X δ ,respectively.{ ∈ } { ∈ } P(Y / B)I (Y Y Y / B) H ( Y δ B ) n n − ∗ ∈ ∗ ; ∗| ∗ ∈ + { ∗ ∈ } (86) follows from the chain rule for mutual information. ) δ δ δ 1 * • λ P(Xn / B)I (Xn Yn Xn / B) H ( Xδ B ) (87) follows from non-negativity of mutual information, − ∈ ; | ∈ + { n∈ } • the fact that I (1 δ Vn) H (1 δ ),andthedata (λ) 1)ϵn *(92) Xn B Xn B − + { ∈ }; ≤ { ∈δ } δ 2 δ δ processing inequality which implies I (Xn Vn Xn / B) (1 ϵn) I (Y V ) λn I (X V ) δ δ ; | ∈ ≤ = − ∗ ; ∗ − ∗; ∗ I (X Yn X / B). n ; | n ∈ δ ) δ δ *1 (88) follows from (80) and (81). P(Y / B)I (Y Y Y / B) H ( Y δ B ) • − ∗ ∈ ∗ ; ∗| ∗ ∈ + { ∗ ∈ } ) * COURTADE: STRONG ENTROPY POWER INEQUALITY 2191

δ δ δ 1 λ P(Xn / B)I (Xn Yn Xn / B) H ( Xδ B ) [6] Y. Polyanskiy and Y. Wu, “Dissipation of information in channels with − ∈ ; | ∈ + { n∈ } input constraints,” IEEE Trans. Inf. Theory,vol.62,no.1,pp.35–55, (λ) 1)ϵn, *(93) Jan. 2016. − + 2 δ [7] F. du Pin Calmon, Y. Polyanskiy, and Y. Wu, “Strong data processing 2 (1 ϵn) Fλn,(σ 2δ)(X ) ≥ − − ∗ inequalities in power-constrained Gaussian channels,” in Proc. ISIT, δ δ δ Jun. 2015, pp. 2558–2562. (Y / B)I (Y Y Y / B) H (1 δ ) P Y B [8] G. Toscani, “A strengthened entropy power inequality for log-concave − ∗ ∈ ∗ ; ∗| ∗ ∈ + { ∗ ∈ } ) δ δ δ * densities,” IEEE Trans. Inf. Theory,vol.61,no.12,pp.6550–6559, λ P(X / B)I (X Y X / B) H (1 δ ) n n n n Xn B Dec. 2015. − ∈ ; | ∈ + { ∈ } [9] T. A. Courtade, M. Fathi, and A. Pananjady, “Wasserstein stability of (λ) 1)ϵn. *(94) − + the entropy power inequality for log-concave random vectors,” in Proc. 2 IEEE Int. Symp. Inf. Theory (ISIT),Jun.2017,pp.659–663. (1 ϵn) Above, we make the definition λn + 2 λ in (93) and [10] M. Madiman and A. Barron, “Generalized entropy power inequalities (1 ϵn) δ:= − and monotonicity properties of information,” IEEE Trans. Inf. Theory, 2 used the definition of Fλn,(σ 2δ)(X ) in (94). − ∗ vol. 53, no. 7, pp. 2317–2329, Jul. 2007. As desired, the RHS of (94) does not depend on Vn [11] S. Artstein, K. Ball, F. Barthe,andA.Naor,“SolutionofShannon’s (i.e., PV Y ). Thus, taking the infimum over Vn satisfying problem on the monotonicity of entropy,” J. Amer. Math. Soc.,vol.17, | no. 4, pp. 975–982, 2004. Xn Yn Vn and then letting n ,wearriveat → → →∞ [12] M. Madiman, J. Melbourne, and P. Xu, “Forward and reverse entropy power inequalities in convex geometry,” in Convexity and Concentration lim inf F 2 (Xn) n λ,σ (The IMA Volumes in Mathematics and its Applications), vol. 161, →∞ δ E. Carlen, M. Madiman, and E. Werner, Eds. New York, NY, USA: Fλ,(σ2 2δ)(X ) Springer, 2017. ≥ − ∗ δ δ δ 1 [13] M. H. M. Costa, “A new entropy power inequality,” IEEE Trans. Inf. P(Y / B)I (Y Y Y / B) H ( Y δ B ) Theory,vol.IT-31,no.6,pp.751–760,Nov.1985. − ∗ ∈ ∗ ; ∗| ∗ ∈ + { ∈ } ∗ [14] R. Liu, T. Liu, H. V. Poor, and S. Shamai (Shitz), “A vector generaliza- ) δ δ δ 1 * λ P(X / B)I (X Y X / B) H ( Xδ B ) , (95) tion of costa’s entropy-power inequality with applications,” IEEE Trans. − ∗ ∈ ∗; ∗| ∗ ∈ + { ∗∈ } Inf. Theory,vol.56,no.4,pp.1865–1879,Apr.2010. ) * which follows due to ϵn 0andthefollowing: [15] M. Costa, “On the Gaussian interference channel,” IEEE Trans. Inf. δ → δ Theory,vol.31,no.5,pp.607–615,Sep.1985. 2 2 Fλn,(σ 2δ)(X ) Fλ,(σ 2δ)(X ) by continuity of the [16] Y. Polyanskiy and Y. Wu, “Wasserstein continuity of entropy and outer • − ∗ → − ∗ mapping λ F 2 (X) (Lemma 17). bounds for interference channels,” IEEE Trans. Inf. Theory,vol.62, +→ λ,σ δ δ δ δ no. 7, pp. 3992–4002, Jul. 2016. P(Xn / B) P(X / B) since Xn D X by the first [17] G. Bagherikaram, A. S. Motahari, and A. K. Khandani, “The secrecy • ∈ → ∗ ∈ −→ ∗1 claim of Lemma 10. By the same token, H ( Xδ B ) capacity region of the Gaussian MIMO broadcast channel,” IEEE Trans. { n∈ } → Inf. Theory,vol.59,no.5,pp.2673–2682,May2013. H (1 δ ) by continuity of the binary entropy function. X B [18] A. B. Wagner, S. Tavildar, and P. Viswanath, “Rate region of the δ{ ∗∈ } δ δ δ I (X Yn X / B) I (X Y X / B) by the third quadratic Gaussian two-encoder source-coding problem,” IEEE Trans. • n n ; | ∈ → ∗; ∗| ∗ ∈ δ 2 δ Inf. Theory,vol.54,no.5,pp.1938–1961,May2008. claim of Lemma 10 since lim supn E Xn Xn / B < 2 [ | ∈ δ] [19] Y. Oohama, “Gaussian multiterminal source coding,” IEEE Trans. Inf. due to the fact that supn E X < and P(X / ∞ [ n] 4 ∞5 n ∈ Theory,vol.43,no.6,pp.1912–1923,Nov.1997. B) P(X δ / B),a positiveconstant. [20] O. Rioul, “Information theoretic proofs of entropy power inequalities,” → ∗ ∈ As we take b ,continuityofthebinaryentropyfunction IEEE Trans. Inf. Theory,vol.57,no.1,pp.33–55,Jan.2011. →∞ [21] P. Bergmans, “A simple converse for broadcast channels with additive and Lemma 16 together imply the latter two terms in the RHS white Gaussian noise (corresp.),” IEEE Trans. Inf. Theory,vol.IT-20, of (95) vanish, yielding the inequality no. 2, pp. 279–280, Mar. 1974. [22] H. Weingarten, Y. Steinberg, and S. Shamai (Shitz), “The capacity region δ of the Gaussian multiple-input multiple-output broadcast channel,” IEEE lim inf Fλ(Xn) F 2 (X ). (96) n λ,(σ 2δ) Trans. Inf. Theory,vol.52,no.9,pp.3936–3964,Sep.2006. →∞ ≥ − ∗ We finally arrive at the point where the hypothesis X [23] M. Mohseni and J. M. Cioffi, “A proof of the converse for the capacity ∗ ∼ of Gaussian MIMO broadcast channels,” in Proc. IEEE Int. Symp. Inf. N(0,σ2) is needed. In particular, since δ was arbitrary and Theory,Jul.2006,pp.881–885. δ [24] S. Leung-Yan-Cheong and M. E. Hellman, “The Gaussian wire-tap Fλ,(σ2 2δ)(X ) is continuous in δ by Lemma 17, the proof is complete− by∗ letting δ 0. channel,” IEEE Trans. Inf. Theory,vol.IT-24,no.4,pp.451–456, ↓ ! Jul. 1978. [25] E. Tekin and A. Yener, “The Gaussian multiple access wire-tap channel,” IEEE Trans. Inf. Theory,vol.54,no.12,pp.5747–5755,Dec.2008. ACKNOWLEDGMENT [26] L. Ozarow, “On a source-coding problem with two channels and three The author acknowledges the Simons Institute for the The- receivers,” Bell Syst. Tech. J.,vol.59,no.10,pp.1909–1921,Dec.1980. [27] Y. Oohama, “Rate-distortion theory for Gaussian multiterminal source ory of Computing, where some of this research was completed coding systems with several side informations at the decoder,” IEEE Trans. Inf. Theory,vol.51,no.7,pp.2577–2593,Jul.2005. REFERENCES [28] V. Prabhakaran, D. Tse, and K. Ramachandran, “Rate region of the quadratic Gaussian CEO problem,” in Proc. Int. Symp. Inf. Theory (ISIT), [1] T. A. Courtade, “Strengthening the entropy power inequality,” in Proc. Jun. 2004, p. 119. IEEE Int. Symp. Inf. Theory (ISIT),Jul.2016,pp.2294–2298. [29] L. Gross, “Logarithmic Sobolev inequalities,” Amer. J. Math.,vol.97, [2] C. E. Shannon, “A mathematical theory of communication,” Bell Syst. no. 4, pp. 1061–1083, 1975. Tech. J.,vol.27,no.4,pp.623–656,Oct.1948. [30] D. Bakry, I. Gentil, and M. Ledoux, Analysis and Geometry of Markov [3] A. J. Stam, “Some inequalities satisfied by the quantities of information Diffusion Operators,vol.348.Cham,Switzerland:Springer,2013. of Fisher and Shannon,” Inf. Control,vol.2,no.2,pp.101–112, [31] M. Ledoux, The Concentration of Measure Phenomenon (Mathematical Jun. 1959. Surveys and Monographs). Providence, RI, USA: AMS, 2005, ch. 89. [4] N. M. Blachman, “The convolution inequality for entropy powers,” IEEE [32] J. A. Thomas and T. M. Cover, Elements of Information Theory. Trans. Inf. Theory,vol.IT-11,no.2,pp.267–271,Apr.1965. Hoboken, NJ, USA: Wiley, 2012. [5] E. Carlen and A. Soffer, “Entropy production by block variable summa- [33] A. Dembo, “Simple proof of the concavity of the entropy power with tion and central limit theorems,” Commun. Math. Phys.,vol.140,no.2, respect to added Gaussian noise,” IEEE Trans. Inf. Theory,vol.35,no.4, pp. 339–371, 1991. pp. 887–888, Jul. 1989. 2192 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 64, NO. 4, APRIL 2018

[34] C. Villani, “A short proof of the ‘concavity of entropy power,”’ IEEE [62] C. Nair and M. H. Costa, “Gaussian Z-interference channel: Around the Trans. Inf. Theory,vol.46,no.4,pp.1695–1696,Jul.2000. corner,” in Proc. Inf. Theory Appl. Workshop (ITA),2016,pp.1–6. [35] T. A. Courtade, “Concavity of entropy power: Equivalent formulations [63] R. Ahlswede and P. Gács, “Spreading of sets in product spaces and and generalizations,” in Proc. IEEE Int. Symp. Inf. Theory (ISIT), hypercontraction of the Markov operator,” Ann. Probab.,vol.4,no.6, Jul. 2017, pp. 56–60. pp. 925–939, Dec. 1976. [36] T. A. Courtade, G. Han, and Y. Wu. (2017). “A counterexample to the [64] C. Nair, “Equivalent formulations of hypercontractivity using informa- vector generalization of costa’s EPI, and partial resolution.” [Online]. tion measures,” in Proc. Int. Zurich Seminar Commun.,2014,p.42. Available: https://arxiv.org/abs/1704.06164 [65] V. Anantharam, A. A. Gohari, S. Kamath, and C. Nair, “On hypercon- [37] M. Madiman, “On the entropy of sums,” in Proc. IEEE Inf. Theory tractivity and the mutual information between Boolean functions,” in Workshop,May2008,pp.303–307. Proc. Allerton,2013,pp.13–19. [38] M. Costa and T. Cover, “On the similarity of the entropy power [66] M. Raginsky, “Strong data processing inequalities and 2-Sobolev inequality and the Brunn-Minkowski inequality (corresp.),” IEEE Trans. inequalities for discrete channels,” IEEE Trans. Inf. Theory,vol.62, Inf. Theory,vol.IT-30,no.6,pp.837–839,Nov.1984. no. 6, pp. 3355–3389, Jun. 2016. [39] S. Bobkov and M. Madiman, “Reverse Brunn-Minkowski and reverse [67] V. Anantharam, A. Gohari, S. Kamath, and C. Nair. (2013). entropy power inequalities for convex measures,” J. Funct. Anal., “On maximal correlation, hypercontractivity, and the data process- vol. 262, no. 7, pp. 3309–3339, 2012. ing inequality studied by Erkip and Cover.” [Online]. Available: [40] V. D. Milman, “Inégalité de Brunn-Minkowski inverse et applicationsa https://arxiv.org/abs/1304.6133 la théorie locale des espaces normés,” CR Acad. Sci. Paris,vol.302, [68] T. A. Courtade, “Outer bounds for multiterminal source coding via a no. 1, pp. 25–28, 1986. strong data processing inequality,” in Proc. IEEE Int. Symp. Inf. Theory [41] K. Ball, F. Barthe, and A. Naor, “Entropy jumps in the presence of a Proc. (ISIT),Jul.2013,pp.559–563. spectral gap,” Duke Math. J.,vol.119,no.1,pp.41–63,2003. [69] C. Nair, “Upper concave envelopesandauxiliaryrandomvariables,”Int. [42] O. Johnson and A. Barron, “Fisher information inequalities and the J. Adv. Eng. Sci. Appl. Math.,vol.5,no.1,pp.12–20,2013. central limit theorem,” Probab. Theory Rel. Fields,vol.129,no.3, [70] Y. Geng and C. Nair, “The capacity region of the two-receiver Gaussian pp. 391–409, Jul. 2004. vector broadcast channel with private and common messages,” IEEE [43] O. Johnson, Information Theory and the Central Limit Theorem. Trans. Inf. Theory,vol.60,no.4,pp.2087–2104,Apr.2014. Singapore: World Scientific, 2004. [71] C. Nair, L. Xia, and M. Yazdanpanah, “Sub-optimality of [44] K. Ball and V. H. Nguyen, “Entropy jumps for isotropic log-concave Han-Kobayashi achievable regionforinterferencechannels,”inProc. random vectors and spectral gap,” Studia Math.,vol.213,no.1, IEEE Int. Symp. Inf. Theory (ISIT),Jun.2015,pp.2416–2420. pp. 81–96, 2012. [72] H. G. Eggleston, Convexity (Cambridge Tracts in Mathematics and [45] E. Nelson, “The free Markoff field,” J. Funct. Anal.,vol.12,no.2, Mathematical Physics), no. 47. Cambridge, U.K.: Cambridge Univ. pp. 211–227, 1973. Press, 1958. [46] E. A. Carlen, “Superadditivity of Fisher’s information and logarithmic [73] S. N. Bernstein, “On a property characteristic of the normal law,” Trudy Sobolev inequalities,” J. Funct. Anal.,vol.101,no.1,pp.194–211, Leningrad. Polytech. Inst.,vol.3,pp.21–22,1941. 1991. [74] W. Bryc, The Normal Distribution: Characterizations With Applications, vol. 100. New York, NY, USA: Springer-Verlag, 2012. [47] M. Raginsky and I. Sason, “Concentration of measure inequalities [75] Y. Polyanskiy, “Channel coding: Non-asymptotic fundamental limits,” in information theory, communications, and coding,” Found. Trends Ph.D. dissertation, Dept. Elect. Eng., Princeton Univ., Princeton, NJ, Commun. Inf. Theory,vol.10,nos.1–2,pp.1–247,2013. USA: 2010. [48] S. G. Bobkov, N. Gozlan, C. Roberto, and P.-M. Samson, “Bounds on the [76] E. H. Lieb, “Gaussian kernels have only Gaussian maximizers,” Invent. deficit in the logarithmic Sobolev inequality,” J. Funct. Anal.,vol.267, Math.,vol.102,no.1,pp.179–208,1990. no. 11, pp. 4110–4138, 2014. [77] F. Barthe, “Optimal Young’s inequality and its converse: A simple [49] M. Fathi, E. Indrei, and M. Ledoux, “Quantitative logarithmic Sobolev proof,” Geometric Funct. Anal.,vol.8,no.2,pp.234–242,1998. inequalities and stability estimates,” Discrete Continuous Dyn. Syst., A, [78] E. A. Carlen and D. Cordero-Erausquin, “Subadditivity of the entropy vol. 36, no. 12, pp. 6835–6853, 2016. and its relation to Brascamp–Lieb type inequalities,” Geometric Funct. [50] J. Dolbeault and G. Toscani, “Stability results for logarithmic Anal.,vol.19,no.2,pp.373–405,2009. Sobolev and Gagliardo–Nirenberg inequalities,” Int. Math. Res. Notices, [79] J. Liu, T. A. Courtade, P. Cuff, and S. Verdú, “Brascamp–Lieb inequality vol. 2016, no. 2, pp. 473–498, Jan. 2015. and its reverse: An information theoretic view,” in Proc. IEEE Int. Symp. [51] T. A. Courtade, M. Fathi, and A. Pananjady. (2017). “Existence of Inf. Theory (ISIT),Jul.2016,pp.1048–1052. Stein kernels under a spectral gap, and discrepancy bounds.” [Online]. [80] S. Beigi and C. Nair, “Equivalent characterization of reverse Available: https://arxiv.org/abs/1703.07707 Brascamp–Lieb-type inequalities using information measures,” in Proc. [52] A. El Gamal and Y.-H. Kim, Network Information Theory.Cambridge, IEEE Int. Symp. Inf. Theory (ISIT),Jul.2016,pp.1038–1042. U.K.: Cambridge Univ. Press, 2012. [81] S. Kamath, “Reverse hypercontractivity using information measures,” in [53] T. Berger, “Multiterminal source coding,” in The Information Theory Proc. 53rd Annu. Allerton Conf. Commun, Control, Comput. (Allerton), Approach to Communications,G.Longo,Ed.NewYork,NY,USA: Sep. 2015, pp. 627–633. Springer-Verlag, 1977. [82] R. Durrett, Probability: Theory and Examples.Cambridge,U.K.: [54] S.-Y. Tung, “Multiterminal source coding,” Ph.D. dissertation, Dept. Cambridge Univ. Press, 2010. Elect. Eng., Cornell Univ., Ithaca, NY, USA, 1978. [83] M. D. Donsker and S. R. S. Varadhan, “Asymptotic evaluation of certain [55] J. Wang, J. Chen, and X. Wu, “On the sum rate of Gaussian multiter- Markov process expectations for large time, I,” Commun. Pure Appl. minal source coding: New proofs and results,” IEEE Trans. Inf. Theory, Math.,vol.28,no.1,pp.1–47,1975. vol. 56, no. 8, pp. 3946–3960, Aug. 2010. [84] Y. Wu and S. Verdú, “Functional properties of minimum mean-square [56] T. A. Courtade and J. Jiao, “An extremal inequality for long Markov error and mutual information,” IEEE Trans. Inf. Theory,vol.58,no.3, chains,” in Proc. 52nd Annu. Allerton Conf. Commun., Control, pp. 1289–1301, Mar. 2012. Comput. (Allerton),Sep.2014,pp.763–770. [57] T. A. Courtade and T. Weissman, “Multiterminal source coding under logarithmic loss,” IEEE Trans. Inf. Theory,vol.60,no.1,pp.740–761, Thomas A. Courtade (S’06–M’13) received the B.Sc. degree (summa cum Jan. 2014. laude) in electrical engineering fromMichiganTechnologicalUniversity, [58] J. Jiao, T. A. Courtade, K. Venkat, and T. Weissman, “Justification of Houghton, MI, USA, in 2007, and the M.S. and Ph.D. degrees from the logarithmic loss via the benefit of side information,” IEEE Trans. Inf. University of California, Los Angeles (UCLA), CA, USA, in 2008 and 2012, Theory,vol.61,no.10,pp.5357–5365,Oct.2015. respectively. He is an Assistant Professor with the Department of Electrical [59] H. Sato, “The capacity of the Gaussian interference channel under strong Engineering and Computer Sciences, University of California, Berkeley, CA, interference (corresp.),” IEEE Trans. Inf. Theory,vol.IT-27,no.6, USA. Prior to joining UC Berkeley in 2014, he was a Postdoctoral Fellow pp. 786–788, Nov. 1981. supported by the NSF Center for Science of Information. [60] T. Han and K. Kobayashi, “A new achievable rate region for the Prof. Courtade’s honors include a Distinguished Ph.D. Dissertation Award interference channel,” IEEE Trans. Inf. Theory,vol.IT-27,no.1, and an Excellence in Teaching Award from the UCLA Department of pp. 49–60, Jan. 1981. Electrical Engineering, and a Jack Keil Wolf Student Paper Award for the [61] M. H. M. Costa, “Noisebergs in Z-Gaussian interference channels,” in 2012 International Symposium on Information Theory. He is the recipient of Proc. Inf. Theory Appl. Workshop (ITA),Feb.2011,pp.1–6. aHellmanFellowshipandaNSFCAREERAward.