1 The of Mixtures: New Bounds and Applications

James Melbourne1, Saurav Talukdar1, Shreyas Bhaban1, Mokshay Madiman2 and Murti Salapaka1

1Electrical and Computer Engineering, University of Minnesota 2Department of Mathematical Sciences, University of Delaware

Abstract Mixture distributions are extensively used as a modeling tool in diverse areas from machine learning to communications engineering to physics, and obtaining bounds on the entropy of probability distributions is of fundamental importance in many of these applications. This article provides sharp bounds on the entropy concavity deficit, which is the difference between the entropy of the mixture and the weighted sum of entropies of constituent components. Toward establishing lower and upper bounds on the concavity deficit, results that are of importance in their own right are obtained. In order to obtain nontrivial upper bounds, properties of the skew-divergence are developed and notions of “skew” f-divergences are introduced; a reverse Pinsker inequality and a bound on Jensen-Shannon divergence are obtained along the way. Complementary lower bounds are derived with special attention paid to the case that corresponds to independent summation of a continuous and a discrete . Several applications of the bounds are delineated, including to of additive noise channels, thermodynamics of computation, and functional inequalities.

I.INTRODUCTION Mixture models are extensively employed in diverse disciplines including genetics, biology, medicine, economics, speech recognition, as the distribution of a signal at the receiver of a communication channel when the transmitter sends a random element of a codebook, or in models of clustering or classification in machine learning (see, e.g., [23], [52]). A mixture

model is described by a density of the form f “ i pifipxq, where each fi is a probability density function and each pi is a nonnegative weight with i pi “ 1. Such mixture densities have a natural probabilistic meaning as outcomes of a two stage random process with the first stage being a randomř draw, i, from the probability mass function p, followed by choosing a ř real-valued vector x following a distribution fip¨q; equivalently, it is the density of X ` Z, where X is a discrete random

variable taking values xi with probabilities pi, and Z is a dependent variable such that PpZ P A|X “ xiq “ A fipz ´ xiqdz. The differential entropy of this mixture is of significant interest. Our original motivation for this paper came from the fundamental study of thermodynamics of computation, inş which memory models are well approximated by mixture models, while the erasure of a bit of information is akin to the state described by a single unimodal density. Of fundamental importance here is the entropy of the mixture model which is used to estimate the thermodynamic change in entropy in an erasure process and for other computations [31], [57]. It is not possible, analytically, to determine the differential entropy of the mixture model pifi, even in the simplest case where fi are normal distributions, and hence one is interested in refined bounds on the same. While this was our original motivation, the results of this paper are more broadly applicable and we strive to give generalř statements so as not to limit the applicability. arXiv:1805.11257v2 [cs.IT] 22 Apr 2020 For a random vector Z taking values in Rd with probability density density f, the differential entropy is defined as hpZq “

hpfq “ ´ d fpzq ln fpzqdz, where the integral is taken with respect to Lebesgye measure. We will frequently omit the R qualifier “differential” when this is obvious from context and simply call it the entropy. It is to be noted that, unlike hp pifiq, ş the quantity i pihpfiq is more readily determinable and thus the concavity deficit hp pifiq ´ pihpfiq is of interest. This quantity can also be interpreted as a generalization of the Jensen-Shannon divergence [5], and its quantum analog (withř density functions replacedř by density matrices and Shannon entropy replaced by von Neumannř entropy)ř is the Holevo information, which plays a key role in Holevo’s theorem bounding the amount of accessible (classical) information in a quantum state [26]. It is a classical fact going back to the origins of that the entropy h is a concave function, which means that the concavity deficit is always nonnegative:

hpfq ´ pihpfiq ě 0. (1) i ÿ Let X be a random variable that takes values in a countable set where X “ xi with probability pi. Intimately related to the entropy h of a mixture distribution f “ pifi and the concavity deficit are the quantities Hppq :“ ´ pi log pi, and the conditional entropies hpZ|Xq and HpX|Zq. Indeed, it is easy to show an upper bound on the concavity deficit (see, e.g., [64]) in the form ř ř

h pfq ´ pihpfiq ď Hppq, (2) i ÿ 2

which relates the entropy of continuous variable Z with density f “ i pifi to the entropy of a random variable that lies in a countable set. A main thrust of this article will be to provide refined upper and lower bounds on the concavity deficit that improve upon the basic bounds (1) and (2). ř The main upper bound we establish is inspired by bounds in the quantum setting developed by Audenaert [3] and utilizes the total variation distance. Given two two probability densities f1 and f2 with respect to a common measure µ, the total 1 variation distance between them is defined as }f1 ´ f2}TV “ 2 |f1 ´ f2|dµ. We will state the following theorem in terms of the usual differential entropy, on Euclidean space with respect to the Lebesgue measure. Within the article, the statements and proofsş will be given for a general Polish measure space pE, γq from which the result below can be recovered as a special case. d Theorem I.1. Suppose f “ i pifi, where fi are probability density functions on R , pi ě 0, i pi “ 1. Define the mixture ˜ pi complement of fj by fjpzq “ fi. Then ř i‰j 1´pj ř

ř hpfq ´ pihpfiq ď Tf Hppq i ÿ where ˜ Tf :“ sup }fi ´ fi}TV . i Theorem I.1 shows that as distributions cluster in total variation distance, the concavity deficit vanishes. The above result thus considerably reduces the conservativeness? of the upper bound on the? concavity deficit given by (2). Indeed consider the ´pz´aq2{2 ´pz`aq2{2 following example with f1pzq “ e { 2π and f2pzq “ e { 2π. By (2), for p P p0, 1q, 1 hppf ` p1 ´ pqf q ď Hppq ` log 2πe. 1 2 2 ˜ ˜ However, noting that, f1 “ f2 and f2 “ f1 implies,

Tf “}f1 ´ f2}TV 8 2 ? 2 ? “ e´pz´aq {2{ 2π ´ e´pz`aq {2{ 2π dz ż0 “ Φpaq´ ´ Φp´aq, ¯ ? : t x2{2 where Φ is the standard function Φptq “ ´8 e { 2πdx. Since Φpaq ´ Φp´aq ď a 2{π, Theorem I.1 gives ş a 2 1 hppf ` p1 ´ pqf q ď a Hppq ` log 2πe. (3) 1 2 π 2 c Another interpretation of Theorem I.1, is as a generalization of the classical bounds on the Jensen-Shannon divergence by 1 the total variation distance [35], [59], which is recovered by taking p1 “ p2 “ 2 , see Corollary III.11. Let us point out that Φpaq ´ Φp´aq “ 1 ´ Pp|Z| ą aq where Z is a standard normal variable, thus an alternative representation of these bounds in this special case is as mutual information bounds

IpX; Zq ď HpXq ´ Pp|Z| ą aqHpXq (4) where X denotes an independent Bernoulli taking the values ˘a with probability p and 1 ´ p. The methods and technical development toward establishing Theorem I.1 are of independent interest. We develop a notion of skew f-divergence for general f-divergences generalizing the skew divergence (or skew relative entropy) introduced by Lee [32], and in Theorem III.1 show that the class of f-divergences is stable under the skew operation. After proving elementary properties of the skew relative entropy in Proposition III.3 and an introduced skew chi-squared divergence in Proposition III.8, we adapt arguments due to Audenaert [3] from the quantum setting to prove the two f-divergences to be intertwined through a differential equality and that the classical upper bound of the relative entropy by the chi-square divergence can be generalized to the skew setting (see Theorem III.2). After further bounding the skew chi-square divergence by the total variation distance, we integrate the differential equality and obtain a bound of the skew divergence by the total variation in Theorem III.3. As a corollary we obtain a reverse Pinsker inequality due to Verdu [62]. With these tools in hand, Theorem I.1 is proven and we demonstrate that the bound of the Jensen-Shannon divergence by total variation distance [35], [59] is an immediate special case. In the converse direction that provides lower bounds on the concavity deficit, our main result applies to the case where all the component densities come from perturbations of a random vector W in Rd that has a log-concave and spherically symmetric distribution. We say a random vector W has a log-concave distribution when it possesses a density ϕ satisfying ϕpp1 ´ tqz ` tyq ě ϕ1´tpzqϕtpyq. We say W has a spherically symmetric distribution when there exists ψ : R Ñ r0, 8q such d 2 2 2 d that the density ϕpzq “ ψp|z|q for every z P R , where |z| :“ z1 ` z2 ` . . . zd. We employ the notation Bλ “ tx P R : a 3

d |x| ď λu for the centered closed ball of radius λ in R , T ptq :“ TW ptq :“ Pp|W | ą tq for the tail probability of W , and d d }A} :“ sup}w}2“1 }Aw}2 for the operator norm of a A : R Ñ R .

Theorem I.2. Suppose that there exists τ ě 1 such that for each x P X , Z|X “ x has distribution given by TxpW q where W ? ´1 has density ϕ, spherically symmetric and log-concave and Tx is a τ bi-Lipschitz function. For i, k P X , take Tij :“ Ti ˝ Tj and further assume there exists λ ą 0 such that for any k ‰ i TijpBλq X TkjpBλq “ H. Then hpZq ´ hpZ|Xq ě HpXq ´ C˜pW q, (5) where C˜ is the following function dependent heavily on the tail behavior of |W |,

1 ? C˜pW q “ T pλqp1 ` hpW qq ` T 2 pλqp d ` Kpϕqq (6) with 1 2 2 3 1 τ |w| d ´1 2 2 Kpϕq :“ log τ }ϕ}8 ` ωd P p|W | ą λq ` d ϕpwq log 1 ` τ ` dw (7) λ Bc λ „ ˆ ˆ ˙ ˙ ˜ż λ „  ¸ c d where ωd denoting the volume of the d-dimensional unit ball, Bλ denotes the complement of Bλ P R . We note the quantity HpX|Zq connotes the uncertainty in the discrete variable X conditioned on the continuous variable Z; such a quantity needs to be defined/determined from the knowledge of probabilities, pi, that the discrete variable X “ xi and the description of the conditional probability density function ppz|xiq “ fipzq. These notions are made precise in Section II. Here, it is also established that (2) can be equivalently formulated as HpX|Zq ď HpXq for a particular coupling of a discrete variable X taking values with probabilities tpiu, and a variable Z with density fi when conditioned on X “ xi. From this ˜ n perspective the super-concavity bound of Theorem IV.1 gives HpX|Zq ď CpW q. One should also note that when tpiui“1 is a finite sequence, the classical bounds on HpX|Zq are provided by Fano’s inequality: for a Markov triple of random variables X Ñ Z Ñ Xˆ, and e “ tX ‰ Xˆu,

HpX|Zq ď Hpeq ` Ppeq logp#X ´ 1q, (8) where #X denotes the cardinality of the set X , gives a strengthening of concavity in many situations, where we have employed the notation for a measurable set A, HpAq “ ´PpAq log PpAq ´ p1 ´ PpAqq logp1 ´ PpAqq. It yields n hpfq ě pihpfiq ` Hppq ´ pHpeq ` Ppeq logp#X ´ 1qq . (9) i“1 ÿ To compare the strength of the bounds derived in Theorem IV.1 to Fano’s we compare Hpeq`Ppeq logp#X ´1q and C˜pW q; as ˜ is established in Section IV, even in simple cases, CpW q can be arbitrarily small even while minXˆ Hpeq ` Ppeq logp#X ´ 1q is arbitrarily large.

The study of entropy of mixtures has a long history and is scattered in a variety of papers that often have other primary emphases. Consequently it is difficult to exhaustively review all related work. Nonetheless, the references that we were able to find that attempt to obtain refined bounds under various circumstances are [1], [8], [27], [30], [42], [46]. In all of these papers, however, either the bounds deal with specialized situations, or with a general setup but employing different (and typically far more) information than we require. We emphasize that our bounds deal with general multidimensional situations (including in particular multivariate Gaussian mixtures, which are historically and practically of high interest) and in that sense go beyond the previous literature. The article is organized as follows. In Section II we will give notation and preliminaries, where we delineate definitions and relationship for entropies, conditional entropies emphasizing a mix of continuous and discrete variables. Section III is devoted to the proof of Theorem I.1 (a preliminary verison has appeared in the conference paper [40]). In Section IV we prove Theorem IV.1. These results give a considerable generalization of earlier work of authors in [39], [56]. The result will hinge on a Lemma IV.2 bounding the sum i ϕpxiq for xi well spaced and ϕ log-concave and spherically symmetric, and a concentration result from convex geometry [22]. We close discussing bounds of Pp|W | ě tq in the case that W is log-concave and strongly log-concave, see Corollary IV.9.ř Section V we demonstrate applications of the theorems to a diverse group of problems; hypothesis testing, capacity estimation, nanoscale energetics, and functional inequalities.

II.NOTATION AND PRELIMINARIES In this part of the article, we will elucidate definitions and results for and mutual information when a mix of discrete valued and continuous random variables are involved. We will assume that 1) X takes values in a discrete countable set X , and that PpX “ xq “ px ą 0. 4

2) Z is a random variable which takes values in a Polish space E. The conditional distribution is described by PpZ P A|X “ xq “ A fxpzqdγpxq where fxpzq is a density function with respect to a Radon reference measure γ. We will denote by m the counting measure on X , so that for A Ď X , the measure of A is its cardinality, ş mpAq “ #pAq. (10)

Integration with respect to the counting measure, corresponding to summation; for g : X Ñ R such that xPX |gpxq| ă 8,

gpxqdm :“ gpxq. ř (11) X xPX ż ÿ We will denote by dm dγ the product measure on X ˆ E where, for A Ď X and measurable B Ď Rd,

1AˆBpx, zqdmpxq dγpzq :“ mpAqγpBq, (12) żX ˆE when γ denotes the d-dimensional Lebesgue measure, we will use |B|d or |B| when there is no risk of confusion to denote the Lebesgue volume of a measureable set B. For measures P and Q on a shared measure space such that P has a density ϕ, with respect to Q, when any measurable set A satisfies,

PpAq “ ϕ dQ. (13) żA Such a ϕ will also be written as dP . dQ For random variables U and V whose induced probability measures admit densities with respect to a reference measure γ, in the sense that µpAq :“ PpU P Aq “ A udγ and νpAq :“ PpV P Aq “ A vdγ where u and v are density functions with respect to γ, the relative entropy (or KL divergence) is defined as ş ş u DpU||V q :“ Dpµ||νq :“ Dpu||vq :“ u log dγ. (14) v ż When U has an E valued random variable density u with respect to a reference measure γ, denote the entropy

hγ pUq :“ hγ puq “ ´ upzq log upzqdγpzq, (15) ż whenever the above integral is well defined. When γ is the Lebesgue measure we denote the usual differential entropy,

hpUq :“ hpuq :“ ´ upxq log upxqdx, (16) ż When U is discrete, taking values x Ď X with probability px, define

HpUq :“ Hppq :“ ´ px log px. (17) xPX ÿ When t P r0, 1s, we define Hptq to the the entropy of a Bernoulli random variable with parameter t, Hptq :“ ´p1 ´ tq logp1 ´ tq ´ t log t. When A is an event, HpAq :“ HpPpAqq. The following proposition elucidates notions of conditional entropy and of a mix of discrete and continuous random variables. Proposition II.1. Suppose X is a discrete random variable with values in a countable set X and Z is a Borel measurable random variable taking values in E. Suppose, for all x P X , P pZ P A|X “ xq “ A fxpzqdγpzq for density fxpzq with respect to a common reference measure γ. Then the following hold. ş ‚ The joint distribution of pX,Zq on X ˆ E, has a density

F px, zq “ pxfxpzq (18) with respect to dm dγ. ‚ Z has a density

fpzq “ pxfxpzq (19) xPX ÿ with respect to γ on E. ‚ The conditional density of X with respect to Z “ z defined as pxfxpzq for fpzq ą 0 ppx|zq “ fpzq (20) #0 otherwise, 5

satisfies PpX “ xq “ E ppx|zqfpzqdz “ px.

Proof. Note that since a setş A Ď X ˆ E can be decomposed into a countable union of disjoint sets txu ˆ Ax where Ax “ tz P E : px, zq P Au, to prove F px, zq is the joint density function of pX,Zq, it suffices to prove

PXZ ptxu ˆ Aq “ F px, zqdmpxq dγpzq. (21) żtxuˆA Indeed,

PXZ pAq “ PXZ pYxtxu ˆ Axq (22) “ PXZ ptxu ˆ Axq, (23) x ÿ while

F px, zqdmpxq dγpzq “ F px, zqdmpxq dγpzq (24) żA żYxtxuˆAx “ F px, zqdmpxq dγpzq. (25) x txuˆAx ÿ ż Since (21) would give equality of the summands of (23) and (25) the result would follow. We compute directly.

PXZ ptxu ˆ Aq “ PpX “ x, Z P Aq (26) “ PpX “ xqPpZ P A|X “ xq (27)

“ px fxpzqdγpzq (28) żA “ px fxpzqdγpzqdmpxq (29) żtxu żA “ F px, zqdmpxq dγpzq. (30) żtxuˆA This gives the first claim. For the second,

PpZ P Aq “ PpX “ x, Z P Aq (31) xPX ÿ “ px fxpzqdγpzq (32) xPX A ÿ ż “ fpzqdγpzq. (33) żA The last assertion is immediate,

pxfxpzq ppz|xqfpzqdγpzq “ fpzqdγpzq “ px. (34) fpzq żE żE

Proposition II.1 allows the following definitions of conditional entropies.

‚ hγ pZ|X “ xq “ ´ E fpZ|X “ xq log fpZ|X “ xq “ ´ fx log fxdγpzq and thus ş ş hpZ|Xq :“ ExrhpZ|X “ xqs “ ´ px fxpzq log fxpzqdγpzq “ pxhγ pfxq. (35) xPX zPE xPX ÿ ż ÿ ‚ HpX|Z “ zq “ ´ xPX ppX “ x|Z “ zq log ppX “ x|Z “ zq “ ´ xPX ppx|zq log ppx|zq “ Hppp¨|zqq and thus ř ř HpX|Zq “ EZ rHpX|Z “ zqs :“ ´ ppx|zq log ppx|zq fpzqdγpzq. (36) E ˜xPX ¸ ż ÿ Let us note how the entropy of a mixture can be related to its relative entropy with respect to a dominating distribution g. The entropy concavity deficit of a convex combination of densities fi, is the convexity deficit of the relative entropy with respect to a reference measure in the following sense.

Proposition II.2. For a density g such that x pxDpfx||gq ă 8,

hγ pfq ´ř pxhγ pfxq “ pxDpfx||gq ´ Dpf||gq. (37) x x ÿ ÿ 6

Proof. f f p Dpf ||gq ´ Dpf||gq “ p f log x ´ f log dγ x x x x g x g x x ÿ ÿ ˆż ˙ “ px fx log fx ´ fx log fdγ x ÿ ˆż ˙ “ hγ pfq ´ pxhγ pfxq. x ÿ

Note that the left hand side of Proposition II.2 is invariant with respect to g. Thus for g1, g2 such that x pxDpfx||gjq ă 8,

pxDpfx||g1q ´ Dpf||g1q “ pxDpfx||g2q ´ Dpf||g2q. ř (38) x x ÿ ÿ Taking g1 “ g and g2 “ f “ x pxfx yields the compensation identity,

ř pxDpfx||gq “ pxDpfx||fq ` Dpf||gq, (39) x x ÿ ÿ which is often used to obtain its immediate corollary

min pxDpfx||gq “ pxDpfx||fq. (40) g x x ÿ ÿ We define the mutual information between probability measures PU and PV with joint distribution PUV and their product distribution PU PV , as the relative entropy of the product distribution from the joint distribution,

IpPU ; PV q “ DpPUV ||PU PV q.

For the random variables U and V inducing probability measures PU and PV , we will write IpU; V q “ IpPU ; PV q.

Proposition II.3. For X discrete with PpX “ xq “ px and Z satisfying PpZ P B|X “ xq “ B fxpzqdz,

IpX; Zq “ hγ pZq ´ hγ pZ|Xq ş (41) “ HpXq ´ HpX|Zq (42)

“ pxDpfx||fq. (43) xPX ÿ Proof. By Proposition II.1, PXZ has density F px, zq “ pxfxpzq with respect to dmpxq dγpzq the product of the counting measure m and γ. The product measure PX PZ , has density Gpx, zq “ pxfpzq with respect to dm dγ and it follows that dPXZ x, z dPXZ dm dγ p q F px, zq fxpzq px, zq “ “ “ (44) d dPX PZ G x, z f z PX PZ dm dz px, zq p q p q By equation (44),

fxpzq DpPXZ ||PX PZ q “ F px, zq log dm dγ (45) d fpzq żX ˆR fxpzq “ pxfxpzq log dγpzq (46) fpzq E xPX ż ÿ Recalling ppz|xq from Proposition II.1, using the algebra of and Fubini-Tonelli,

fxpzq pxfxpzq log dγpzq (47) fpzq E xPX ż ÿ ppx|zq “ p f pzq log dγpzq (48) x x p E xPX x ż ÿ “ ´ px log px fxpzqdγpzq ` fpzq ppx|zq log ppx|zqdγpzq (49) xPX E E xPX ÿ ż ż ÿ “ Hppq ´ fpzqHpppx|zqqdγpzq (50) d żR “ HpXq ´ HpX|Zq, (51) 7 giving (42). By Fubini-Tonelli,

fxpzq fxpzq pxfxpzq log dγpzq “ px fxpzq log dγpzq (52) fpzq fpzq E xPX xPX E ż ÿ ÿ ż “ pxDpfx||fq, (53) xPX ÿ we have expression (43). By Proposition II.2,

pxDpfx||fq “ hγ pfq ´ pxhpfxq (54) xPX xPX ÿ ÿ “ hpZq ´ hpZ|Xq, (55) (41) follows. Using Proposition II.3, we can give a simple information theoretic proof of a result proved analytically in [9], [64]. Corollary II.4. When X Ď E, and γ is a Haar measure, then X and Z satisfy,

hγ pX ` Zq ď HpXq ` hγ pZ|Xq (56) which reduces to hγ pX ` Zq ď HpXq ` hγ pZq (57) in the case that X and Z are independent. Proof. Applying Proposition II.3 to X and Z˜ “ X ` Z we have

hγ pX ` Zq “ HpXq ` hγ pX ` Z|Xq ´ HpX|X ` Zq (58)

“ HpXq ` hγ pZ|Xq ´ HpX|X ` Zq, (59) where the second equality follows from the assumption that γ is a Haar measure. Since HpX|X ` Zq ě 0, (56) follows, while (57) follows from hγ pX ` Z|Xq “ hγ pZq under the assumption of independence. Incidentally, the main use of Corollary II.4 in [64] is to give a rearrangement-based proof of the (see [37] for much more in this vein).

III.LOWERBOUNDS In this section we will provide lower bounds to the concavity deficit and provide a proof of Theorem I.1. We will first introduce the notion of f-divergences. dµ Definition III.1. For a convex function f satisfying fp1q “ 0, and probability measures µ and ν, with densities u “ dγ and dν v “ dγ with respect to a common reference measure γ, the f divergence from µ to ν is u D pµ||νq :“ D pu||vq :“ f dvdγ. (60) f f v ż ´ ¯ 1 Note that a common reference measure for measures µ and ν always exists, take 2 pµ ` νq for instance, and the value of Df pµ||νq is independent of the choice of reference measure as can be seen by comparing a reference measures to one it has a density with respect to. When the value of a functional of a pair of probability distributions µ, ν is given by (60) we will call the functional an f-divergence. An f-divergence satisfies the following. (i) Non-negativity, Df pµ||νq ě 0 and (ii) The map pµ, νq ÞÑ Df pµ||νq is convex. We direct the reader to [34], [50], [51] for further background on f-divergences and their properties. When fpxq “ x log x, the divergence induced is the relative entropy. This section is organized as follows. We first introduce the concept of skewing which is the f-divergence from a convex combination p1 ´ tqµ ` tν to ν. Skewing provides a more regular version of the original divergence measure, for example the Radon-Nikodym derivative of µ with respect to p1 ´ tqµ ` tν always exists even if Radon-Nikodym derivative of µ with respect to ν may not, whereby skew divergence is well defined unlike divergence. Skew divergence as we will develop preserves important features of the original divergence. We first state elementary properties of the skew relative information, corresponding to skewing the relative entropy with proofs given in an appendix, and then introduce a skew χ2-divergence which interpolates between the well known Neyman χ2 divergence and the Pearson χ2 divergence. We will pause to demonstrate that the class of f-divergences is stable under skewing and recover as a special case; a recent result of Nielsen [47], that the generalized Jensen-Shannon divergence is an f-divergence. Then we establish several inequalities between the skew relative information and the introduced skew χ2 divergence. We will show in Theorem III.2 that the skew relative information can be controlled by the skew χ-square divergence extending the classical bound of relative 8 entropy by Pearson χ2 divergence, and using an argument due to Audenart in the quantum setting [3], we show that the rate of decrease of the skew relative information with respect to the skewing parameter can be described exactly as a multiple of the skew χ2 divergence. Theorem III.3 also appropriates a quantum argument [3] to show that though neither the Neyman or Pearson divergences can be controlled by total variation, their skewed counterparts can be. We harness this bound along side the differential relationship between the two skew divergences to bound the skew relative entropy by the total variation as well. As a brief aside we demonstrate that the this bound is equivalent to a reverse Pinsker type inequality derived by Verdu [62], before using Theorem III.3 to give our proof of Theorem I.1. Finally to conclude the section, we demonstrate that one may obtain the classical result of Lin [35] bounding the Jensen-Shannon divergence by total variation as a special case of Theorem I.1.

A. Skew Relative Information We will consider the following generalization of the relative entropy due to Lee. Definition III.2. [32] For probability measures µ and ν on a common set Y and t P r0, 1s define their Skew relative information dµ Stpµ||νq “ log dµ dptµ ` p1 ´ tqνq ż In the case that dµ “ udγ, and dν “ vγ we will also write

Stpu||vq “ Stpµ||νq. We state some important properties of Skew relative information with the proofs provided in the Appendix. Proposition III.3. For probability measures µ and ν on a common set and t P r0, 1s the Skew Relative information satisfies the following properties.

1) Stpµ||νq “ Dpµ||tµ ` p1 ´ tqνq. In particular, S0pµ||νq “ Dpµ||νq. 2) Stpµ||νq “ 0 iff t “ 1 or µ “ ν. 3) For 0 ă t ă 1 the Radon-Nikodym derivative of µ with respect to tµ ` p1 ´ tqν does exist, and Stpµ||νq ď ´ log t. 4) Stpµ||νq is convex, non-negative, and decreasing in t. 5) St is an f-divergence with fpxq “ x logpx{ptx ` p1 ´ tqq. Motivated by the fact that the act of skewing the relative entropy preserves its status as an f-divergence we introduce the act of skewing of an f ´ divergence

Definition III.4. Given a convex function f : r0, 8q Ñ R with fp1q “ 0 and its associated divergence Df p¨||¨q, define the r, t-skew of Df by

Sf,r,tpµ||νq :“ Df prµ ` p1 ´ rqν||tµ ` p1 ´ tqνq. (61)

It can be shown that for t P p0, 1q, Sf,r,tpµ||νq ă 8. Theorem III.1. The class of f-divergences is stable under skewing. That is, if f is convex, satisfying fp1q “ 0, then rx ` p1 ´ rq fˆpxq :“ ptx ` p1 ´ tqqf (62) tx ` p1 ´ tq ˆ ˙ ˆ is convex with fp1q “ 0 as well, so that the r, t skew of Df defined in (61) is an f-divergence as well. Proof. If µ and ν have respective densities u and v with respect to a reference measure γ, then rµ ` p1 ´ rqν and tµ ` 1 ´ tν have densities ru ` p1 ´ rqv and tu ` p1 ´ tqv ru ` p1 ´ rqv Sf,r,tpµ||νq “ f ptu ` p1 ´ tqvqdγ (63) tu ` p1 ´ tqv ż ˆ ˙ r u ` p1 ´ rq u “ f v pt ` p1 ´ tqqvdγ (64) t u ` p1 ´ tq v ż ˆ v ˙ u “ fˆ vdγ. (65) v ż ´ ¯ Since fˆp1q “ fp1q “ 0, we need only prove fˆ convex. For this, recall that the conic transform g of a convex function f defined by gpx, yq “ yfpx{yq for y ą 0 is convex, since y ` y x ` x y ` y y ` y y x y x 1 2 f 1 2 { 1 2 “ 1 2 f 1 1 ` 2 2 (66) 2 2 2 2 y1 ` y2 y1 y1 ` y2 y2 ˆ ˙ y ˆ y ˙ ď 1 fpx {y q ` 2 fpx {y q. (67) 2 1 1 2 2 2 9

Our result follows since fˆ is the composition of the affine function Apxq “ prx`p1´rq, tx`p1´tqq with the conic transform of f, fˆpxq “ gpApxqq. (68)

Let us note that in the special case that Df corresponds to relative entropy, Theorem III.1 demonstrates that the “Generalized Jensen-Shannon divergence” developed recently by Nielsen see [47, Definition 1] is in fact an f-divergence, as it is defined as the weighted sum of ri, t-skew divergences associated to the relative entropy. k Corollary III.5. For a vector α P r0, 1s and wi ą 0 such that i wi “ 1, the pα, wq-Jensen-Shannon divergence between two densities p, q defined by: ř k α,w JS pp : qq :“ wiDpp1 ´ αiqp ` αiq||p1 ´ α¯qp ` αq¯ q (69) i“1 ÿ with α¯ “ i wiαi, is an f-divergence.

Proof. Byř Theorem III.1 the mapping pp, qq ÞÑ Dpp1 ´ αiqp ` αiq||p1 ´ α¯qp ` αq¯ q is an f-divergence, and the result follows since the class of f-divergences is stable under non-negative linear combinations.

We will only further pursue the case that r “ 1, and write Sf,tpµ||νq :“ Sf,1,tpµ||νq. We now skew, Pearson’s χ2 divergence which we recall below. Definition III.6. [49] For measures µ and ν absolutely continuous with respect to a common reference measure dγ so that dµ “ udγ and dν “ vdγ, define dµ 2 pu ´ vq2 χ2pµ; νq “ 1 ´ dν “ dγ, dν v ż ˆ ˙ ż 2 dµ and χ pµ; νq “ 8 when dν does not exist. 2 Definition III.7. For t P r0, 1s and measures µ and ν, define the skew χt via: 2 1 ´ dµ 2 dν χt pµ; νq “ dν. t dµ´ ` p1 ´¯ tq ż dν 2 2 2 Formally, the χ divergence of Neyman [44] differs only by a notational convention χN pν; µq “ χ pµ; νq, see [34] for more modern treatment and [17] for background on the distances significance in statistics. Now let us present a skew χ2 divergence, which interpolates the Pearson and Neyman χ2 divergences. 2 Proposition III.8. The skew χt divergence satisfies the following, 1) When dµ “ udγ and dν “ vdγ with respect to some reference measure γ, then pu ´ vq2 χ2pµ; νq “ dγ. t tu ` p1 ´ tqv ż 2 2 2 2) For t “ 0, p1 ´ tq χt pµ; νq “ χ pµ; tµ ` p1 ´ tqνq. 2 2 3) χt pµ; νq “ χ1´tpν; µq. 2 2 4) χt is an f-divergence with fpxq “ px ´ 1q {p1 ` tpx ´ 1qq. 2 2 2 2 2 5) The skew χt interpolates the divergences of Neyman and Pearson, χ0pµ; νq “ χ pµ; νq and χ1pµ; νq “ χN pµ; νq. Proof. For (1), the formula follows in the case µ ! ν, from the fact that on the support of ν, u dµ “ , v dν so that 2 1 ´ u χ2pµ; νq “ v vdγ t t u ` p1 ´ tq ż v` ˘ pu ´ vq2 “ dγ. tu ` p1 ´ tqv ż 10

To prove (2), we use (1). Note that dµ “ udγ and dν “ vdγ implies that dptµ ` p1 ´ tqνq “ ptu ` p1 ´ tqvqdγ so that pu ´ ptu ` p1 ´ tqvqq2 χ2pµ; tµ ` p1 ´ tqνq “ dγ tu ` p1 ´ tqv ż pu ´ vq2 “ p1 ´ tq2 dγ tu ` p1 ´ tqv 2 ż2 “ p1 ´ tq χt pµ; νq. 2 It is immediate from (1) that (3) holds. That χt is an f-divergence follows from (2) and Theorem III.1, so that (4) follows. 2 2 To prove (5), note that χ0pµ; νq “ χ pµ; νq is immediate from the definition. Applying this and symmetry from (3) we have 2 2 2 2 χ1pµ; νq “ χ0pν; µq “ χ pν; µq “ χN pµ; νq. The skew divergence and skew χ2 inherit bounds from t “ 0 case, and enjoy an interrelation unique to the skew setting as described below. Theorem III.2. For probability measures µ and ν and t P p0, 1q 2 2 Stpµ||νq ď p1 ´ tq χt pµ; νq (70) and d S pµ||νq “ pt ´ 1qχ2pµ; νq. (71) dt t t Proof. Recall that when t “ 0, the concavity of bounds log x by its tangent line x ´ 1 so that, dµ dµ log dµ ď ´ 1 dµ (72) dν dν ż ˆ ˙ ż ˆ ˙ dµ 2 “ ´ 1 dν, (73) dν ż ˆ ˙ giving the classical bound, Dpµ||νq ď χ2pµ; νq. (74) Applying (74) to the identities Proposition III.3, (1) and Proposition III.8,(2) gives

Stpµ||νq “ Dpµ||tµ ` p1 ´ tqνq (75) ď χ2pµ; tµ ` p1 ´ tqνq (76) 2 2 “ p1 ´ tq χt pµ; νq. (77) Applying the identity p1 ´ tqpy ´ 1q “ y ´ pty ` p1 ´ tqq we have dµ dµ dµ 2 p dν ´ 1qp dν ´ pt dν ` p1 ´ tqqq p1 ´ tqχt pµ; νq “ dν (78) t dµ ` p1 ´ tq ż dν dµ ´ 1 dµ “ dν dµ ´ p ´ 1qdν (79) t dµ ` p1 ´ tq dν ż dν ż dµ ´ 1 “ dν dµ. (80) t dµ ` p1 ´ tq ż dν Observing the expression dµ dµ S pµ||νq “ log ´ log t ` p1 ´ tq dµ, t dν dν ż ˆ ˙ we compute directly, dµ d dν ´ 1 Stpµ||νq “ ´ dµ. (81) dt t dµ ` p1 ´ tq ż dν

Recall the total variation norm for a signed measure γ to be supA }γpAq}TV , and adopting the notation x` “ maxtx, 0u then dµ }µ ´ ν} “ ´ 1 dν. TV dν ż ˆ ˙` 11

Theorem III.3. For µ and ν, and t P p0, 1q, }µ ´ ν} χ2pµ; νq ď TV (82) t tp1 ´ tq

Stpµ||νq ď ´ log t}µ ´ ν}TV . (83) Proof. From the identity in (78) we have dµ dµ 2 1 dν p dν ´ 1q χt pµ; νq “ dν 1 ´ t t dµ ` p1 ´ tq ż dν 1 dµ ď ´ 1 dν tp1 ´ tq dν ż ˆ ˙` }µ ´ ν} “ TV . tp1 ´ tq Define the function ϕpλq :“ Se´λ pµ||νq, for λ P r0, 8q and note that ϕp0q “ Dpµ||µq “ 0. Thus we can write λ d ϕpλq “ Se´s pµ||νqds 0 ds ż λ ´s ´s 2 “ e p1 ´ e qχe´s pµ; νqds. ż0 Applying (82) gives λ Se´λ pµ||νq “ ϕpλq ď }µ ´ ν}TV ds “ λ}µ ´ ν}TV . ż0 The substitution t “ e´λ gives (83). Observe that (83) of Theorem III.3 recovers a reverse Pinsker inequality due to Verdu [62]. dµ 1 Corollary III.9 ( [62] Theorem 7). For probability measures µ and γ such that dγ ď β with β P p0, 1q 1 ´ β }µ ´ γ}TV ě 1 Dpµ||γq. log β γ´βµ Proof. The hypothesis implies that ν “ 1´β is a probability measure satisfying γ “ βµ ` p1 ´ βqν. Applying (83) ´ log β Dpµ||νq “ Sβpµ||νq ď ´ log β}µ ´ ν}TV “ }µ ´ γ}TV . 1 ´ β

It is easily seen that the two results, (83) and Theorem 7 of [62] are actually equivalent. In contrast the proof of (83) hinges on foundational properties of the divergence metrics, while Verdu leverages the monotonicity of x ln x{px ´ 1q for x ą 1. Proof of Theorem I.1. From Proposition II.2

hp pifiq “ pihpfiq ` piDpfi||fq (84) i i i ÿ ÿ ÿ ˜ “ pihpfiq ` piSpi pfi||fiq. (85) i i ÿ ÿ ˜ 1 ˜ By Theorem III.3, Sp pfi||fiq ď log }fi ´ fi}TV . Applying Hölder’s inequality completes the proof, i pi 1 p S pf ||f˜q ď p log }f ´ f˜} (86) i pi i i i p i i TV i i i ÿ ÿ 1 ď T p log , (87) i p i i ÿ ˜ where we recall T :“ supi }fi ´ fi}TV . 12

Since the total variation of any two measures is bounded above by 1 this is indeed a sharpening of (2). Expressed in random variables it is

hγ pZq ď T HpXq ` hγ pZ|Xq. (88) which when γ is a Haar measure and we apply to Z˜ “ X ` Z gives

hγ pX ` Zq ď T HpXq ` hγ pZ|Xq, (89) while the right hand side of (89) reduces further to

hγ pX ` Zq ď T HpXq ` hγ pZq (90) in the case that X and Z are independent.

Note that the quantity hγ p i pifiq ´ i pihγ pfiq “ i piDpfi||fq can be considered a generalized Jensen-Shannon 1 divergence, as the case that n “ 2 and p1 “ p2 “ this is exactly the Jensen-Shannon Divergence. ř ř 2 ř Definition III.10. For probability measures µ and ν define the Jensen-Shannon divergence, 1 JSDpµ||νq “ Dpµ||2´1pµ ` νqq ` Dpν||2´1pµ ` νq . (91) 2 Theorem I.1 recovers the classical bound of the` Jensen-Shannon divergence by the total˘ variation, due to Lin, see also [58], [59] for other proofs. Corollary III.11. [35] For µ and ν probability measures,

JSDpµ||νq ď }µ ´ ν}TV log 2.

Proof. Apply Theorem I.1 to the Jensen-Shannon divergence, and observe that T “}µ´ν}TV in the case of two summands.

IV. UPPERBOUNDS Let us state our assumptions and notations for this section. 1) X is a random variable taking values in countable space X , such that for i P X , PpX “ iq “ pi. d 2) Z is an R valued random variable, with conditional densities, fi satisfying,

PpZ P A|X “ iq “ fipzqdz “ PpTipW q P Aq. (92) A ? ż for Ti a τ bi-Lipschitz function, and W is a spherically symmetric log-concave random vector with density ϕ. 3) There exists λ, M ą 0 such that for any i, j,

#tk : TkjpBλq X TijpBλq ‰ Hu ď M, (93) ´1 with # denoting cardinality and Tij :“ Ti ˝ Tj. Our assumption that W is log-concave and spherically symmetric is equivalent to W possessing a density ϕ that is spherically symmetric in the sense that ϕpxq “ ϕpyq for |x| “ |y| and log-concave in the sense that ϕpp1´tqx`tyq ě ϕ1´tpxqϕtpyq holds for t P r0, 1s and x, y P Rd. By the spherical symmetry of ϕ, there exists ψ : r0, 8q Ñ r0, 8q such that ϕpxq “ ψp|x|q. Note that by Radamacher’s theorem, Lipschitz continuous functions are almost everywhere differentiable, and since bi-Lipschitz functions are necessarily invertible. Thus, using this and II.1, it follows that Z has a density given by the the following expression

´1 ´1 1 fpzq “ pifipzq “ piϕpTi pzqqdetppTi q pzqq. (94) i i ÿ ÿ ? ´1 ? Note that Ti being τ bi-Lipschitz implies Ti is τ bi-Lipschitz as well, thus Tij is τ-bi-Lipschitz, thus after potentially 1 1 1 adjusting Tij on set of measure zero, we have τ ď }Tijpzq} ď τ. Under these assumptions we will prove the following generalization of I.2. Theorem IV.1. For X and Z satisfying the assumptions of Section (IV) hpZq ´ hpZ|Xq ě HpXq ´ C˜pW q, (95) where C˜ is the following function dependent heavily on the tail behavior of |W |,

1 ? C˜pW q “ pM ´ 1qp1 ´ T pλτqq ` T pλqpM ` hpW qq ` T 2 pλqp d ` Kpϕqq (96) 13

with 1 2 2 3 1 ετ ` τ |w| d ´1 2 2 Kpϕq :“ log τ M }ϕ}8 ` ωd P p|W | ą λq ` d ϕpwq log 1 ` dw (97) ε Bc λ „ ˆ ˆ ˙ ˙ ˜ż λ „  ¸ c d where ωd denoting the volume of the d-dimensional unit ball, Bλ denotes the complement of Bλ P R . Note that when M “ 1, Theorem IV.1 reduces to Theorem I.2. Additionally observe that log2pxq is a concave function for x ě e. If one writes m :“ maxte, 1 ` τu then by Jensen’s inequality,

1 1 2 τ 2|w| τ 2|w| 2 ϕpwq log2 1 ` τ ` dw ď ϕpwq log2 m ` dw (98) Bc λ d λ ˜ż λ „  ¸ ˆżR „  ˙ τ 2 ϕpwq|w|dw ď log m ` . (99) λ ˆ ş ˙ Thus we can further bound 2 d 3 ´1 1 τ ϕpwq|w|dw Kpϕq ď log τ M }ϕ} ` ω 2 p|W | ą λq ` d log m ` . (100) 8 ε d P λ „ ˆ ˆ ˙ ˙ ˆ ş ˙ We now derive some implications of our assumptions on Tji, a partitioning result on TjipBλq based on the axiom of choice and for the reader’s convenience we prove some elementary consequences of the boundedness of the derivatives of Tij. ´1 Proposition IV.1. For Tij “ Ti ˝ Tj

Tijp0q ` Bλ{τ Ď TijpBλq Ď Tijp0q ` Bλτ (101) That

#tk : TjipBλq X TjkpBλqu ď M (102) implies that any collection X Ď N has a partition X1,..., Xn, with n ď M such that x1, x2 P Xk implies Tjx1 pBλq X

Tjx2 pBλq “ H. ´1 Proof. To prove (101), observe that Tij τ-bi-Lipschitz implies, that Tij exists and is τ-bi-Lipschitz as well. Observing that τ-Lipschitz implies,

|Tijpxq ´ Tijp0q| ď τ|x|. (103) ´1 If we take |x| ă λ, this inequality shows TijpBλq Ď Tijp0q`Bλτ . For the other inclusion, observe that since Tij is τ-Lipschitz as well, ´1 ´1 ´1 |Tij pT p0q ` xq| “ |Tij pTijp0q ` xq ´ Tij pTijp0qq| (104) ď τ|x|. (105) ´1 Taking |x| ă λ, this shows that Tij pTijp0q ` Bλ{τ q Ď Bλ, which the desired inclusion follows from.

Now we prove the existence of the partitioning. If M “ 1, the result is obvious, and we proceed by induction. Choose X1 to be a maximal subset of X such that tTjxuxPX1 are disjoint. For x0 P X ´ X1,

#tk P X ´ X1 : TjipBλq X TjkpBλq ‰ Hu ď M ´ 1. (106)

Indeed for every k P X , TjkpBλq intersects at most M others, and since X1 is maximal and k R X1 TjkpBλq must intersect one of the TjxpBλq for x P X1 which leaves TjkpBλq to intersect at most M ´ 1 of tTjipBλquiPX ´X1 . By induction the result follows. We will need the following concentration result for the of a log-concave vector [22], [45], [63]. Theorem IV.2. For a log-concave density function ϕ on Rd, 1 2 log ´ hpϕq ϕpxqdx ď d. (107) ϕpxq ż ˆ ˙ where hpϕq is the entropy of the density ϕq. See [21] for a generalization to convex measures, which can be heavy-tailed. The following upper bounds the sum of a sequence whose values are obtained by evaluating a spherically symmetric density at well spaced points. 14

Lemma IV.2. If φ is a density on Rd, not necessarily log-concave, given by φpxq “ ψp|x|q for ψ : r0, 8q Ñ r0, 8q decreasing, d λ ą 0, and a discrete set X Ď R admitting a partition X1,..., XM such that distinct x, y P Xk satisfy |x ´ y| ě 2λ, then there exists an absolute constant c ď 3 such that c d φpxq ď M }φ} ` ω´1 , (108) 8 λ d xPX ÿ ˆ ´ ¯ ˙ where ωd “ |tx : |x| ď 1u|d, where we recall | ¨ |d as the d-dimensional Lebesgue volume. In particular, if for all x0 P X

#tx P X : |x ´ x0| ă 2λu ď M, (109) then (108) holds. Note, that when M “ 1 and φ is the uniform distribution on a d-dimensional ball, this reduces to a sphere packing bound, Rc d #tdisjoint λ-balls contained in B u ď 1 ` . (110) R`λ λ ˆ ˙ 1 From which it follows, due to classical bounds of Minkowski, that c ě 2 . d For the proof below we use the notations Bλpxq “ tw P R | |w ´ x| ď λu and we identify Bλ ” Bλp0q. Proof. Let us first see that it is enough to prove the result when M “ 1. M φpxq “ φpxq (111) xPX k“1 xPX ÿ ÿ ÿk M c d ď }φ} ` ω´1 (112) 8 λ d k“1 ÿ ˆ ´ ¯ ˙ c d “ M }φ} ` ω´1 (113) 8 λ d ˆ ´ ¯ ˙ where (112) follows if we have the result when M “ 1. We proceed in the case that M “ 1 and observe that ψ non-increasing enables the following Riemann sum bound,

1 “ φpxqdx (114)

ż d d d ě ψpkλqωdλ pk ` 1q ´ k , (115) k“0 ÿ ` ˘ d d d where ωdλ ppk ` 1q ´ k q is the volume of the annulus Bpk`1qλ ´ Bkλ. Define

Λk :“ x P X : |x| P rkλ, pk ` 1qλq , (116) then ( 8 φpxq “ φpxq (117)

xPX k“0 xPΛk ÿ ÿ8 ÿ ď #Λkψpkλq, (118) k“0 ÿ as ψ is non-increasing. Let us now bound #Λk. Using the assumption that any two elements x and y in X satisfy |x´y| ě 2λ,

| tx ` Bλu| “ #Λk|Bλ| (119)

xPΛk ď d “ #Λkωdλ , (120) so that it suffices to bound | YxPΛk tx ` Bλu|. Observe that we also have YxPΛk tx ` Bλu contained in an annulus,

YxPΛk Bλpxq Ď tx : |x| P rpk ´ 1qλ, pk ` 2qλqu, (121) which combined with (120) gives d #Λkωdλ ď |tx : |x| P rpk ´ 1qλ, pk ` 2qλqu|d (122) d d d “ ωdλ pk ` 2q ´ pk ´ 1q , (123) ` ˘ 15

so that d d #Λk ď pk ` 2q ´ pk ´ 1q . (124) Note the following bound, for k ě 1 pk ` 2qd ´ pk ´ 1qd ď 3d pk ` 1qd ´ kd . (125)

Indeed, by the mean value theorem, there exists x0 P rk ´ 1, k ` 2s` ˘ d d d´1 pk ` 2q ´ pk ´ 1q “ 3dx0 (126) ď 3dpk ` 2qd´1 (127) and there exists y0 P rk, k ` 1s such that, d d d´1 pk ` 1q ´ k “ dy0 (128) ě dkd´1. (129) Thus our bound follows from the obvious fact that for k ě 1 pk ` 2qd´1 ď 3d´1kd´1. (130) Compiling the above, pk ` 2qd ´ pk ´ 1qd ď 3dpk ` 2qd´1 (131) ď 3d3d´1kd´1 (132) ď 3d pk ` 1qd ´ kd . (133) Thus for k ě 1, (124) and (125) give, ` ˘ d d d #Λk ď 3 pk ` 1q ´ k . (134) Applying this inequality to (117) gives ` ˘ 8 φpxq ď ψpλkq (135) xPX k“0 xPΛ ÿ ÿ ÿk ď }φ}8 ` ψpλkq#Λk (136) k“1 ÿ d d d ď }φ}8 ` ψpλkq3 pk ` 1q ´ k (137) k“1 ÿ ` ˘ 3 d ď }φ} ` ω´1 , (138) 8 d λ ˆ ˙ where (137) follows from the fact that #Λ0 ď 1 (any x P Λ0 has 0 P tx ` Bλu), and the last inequality follows from the Riemann sum bound (115). Let us now show that any X satisfying (109) necessarily satisfy (108), by show that X satisfying (109) can be partitioned M into M subsets tX1,..., XM u such Yk“1Xk “ X and x, y P Xk implies |x ´ y| ě 2λ for x ‰ y. Choose X1 to be a subset of X maximal with respect to the property x, y P X1 implies |x ´ y| ě 2λ. We note that such a maximal subset necessarily 1 exists by Zorn’s Lemma with the partial order given by set theoretic inclusion. Now suppose that x0 P X “ X ´ X1. Then if follows that 1 #tx P X : |x0 ´ x| ă 2λu ď M ´ 1. (139) 1 1 Indeed, as x0 R X1 we have from the maximality of X1, that there exists a x0 P X1 such that }x0 ´ x0} ď 2λ. Note 1 1 that the cardinality of the set tx P X : }x ´ x0} ď 2λu is at most M with x0 R X :“ X ´ X1 in the set. It follows that 1 1 #tx P X : }x ´ x0} ď 2λu ď M ´ 1. Applying the result inductively on X our claim follows. ? Lemma IV.3. For Ti τ-bi-Lipshitz satisfying #tl : |Tljp0q ´ Tijp0q| ă 2λu ď M for all x, then for ε ą 0, λ ` ε ` 2τ|w| d #tl : |T pwq ´ T pwq| ă εu ď M . (140) ij lj λ ˆ ˙ In particular, ε ` λ d #tl : |T p0q ´ T p0q| ă εu ď M . (141) ij lj λ ˆ ˙ 16

Proof. We will first prove and then leverage (141). Note that there is nothing to prove when ε ď 2λ as (141) is weaker than the assumption. When 2λ ă ε choose (by Zorn’s lemma for instance) Λ to be a maximal subset of N such that for k P Λ, 1 1 |Tkjp0q ´ Tijp0q| ă ε and |Tkjp0q ´ Tk1jp0q| ě 2λ for k, k P Λ, with k ‰ k . By construction, for a fixed j, Tkjp0q ` Bλ are disjoint over k P Λ contained in Tijp0q ` Bλ`ε. Thus d λ ωd#Λ “ | YkPΛ tTkjp0q ` Bλu| (142)

ď |tTijp0q ` Bλ`εu| (143) d “ pλ ` εq ωd, (144) and we have the following bound on the cardinality of Λ, λ ` ε d #Λ ď . (145) λ ˆ ˙ Applying (145), the assumed cardinality bounds, and the maximality of Λ, which impliesYkPΛtTkjp0q ` Bλu contains every Tljp0q such that |Tljp0q ´ Tijp0q| ă ε, we have

#tl : |Tijp0q ´ Tljp0q| ă εu ď #tm : |Tmjp0q ´ Tkjp0q| ă λu (146) kPΛ ÿ λ ` ε d ď M . (147) λ ˆ ˙ Towards (140), by the mean value theorem, there exists t P r0, 1s such that 1 1 Tijpwq ´ Tljpwq “ Tijp0q ´ Tljp0q ` pTijptwq ´ Tljptwqqw. (148)

Note that if |Tijpwq ´ Tljpwq| ă ε, then 1 1 |Tijp0q ´ Tljp0q| “ |Tijpwq ´ Tljpwq ´ pTijptwq ´ Tljptwqqw| (149) 1 1 ď |Tijpwq ´ Tljpwq| ` |pTijptwq ´ Tljptwqqw| (150) ď ε ` 2τ|w|. (151) Thus

#tl : |Tijpwq ´ Tljpwq| ă εu ď #tl : |Tijp0q ´ Tljp0q| ă ε ` 2τ|w|u. (152) Applying (141), λ ` ε ` 2τ|w| d #tl : |T pwq ´ T pwq| ă εu ď M . (153) ij lj λ ˆ ˙

? ´1 Corollary IV.4. For a density φpwq “ ψp|w|q, with ψ decreasing, ε ą 0, Ti τ-bi-Lipschitz, Tij “ Ti ˝ Tj, such that there exists M ě 1 such that for any i

|tk : TijpBλq X TkjpBλq ‰ Hu| ď M, (154) then d ετ ` τ 2|w| 3 d φpT pwqqdetpT 1 pwqq ď τ dM 1 ` }φ} ` ω´1 . (155) ij ij λ 8 ε d i ˜ ¸ ÿ ˆ ˙ ˆ ˙ Proof. For any k suppose |Tijp0q ´ Tkjp0q| ă 2λ{τ, then tTijp0q ` Bλ{τ u X tTkjp0q ` Bλ{τ u ‰ H which by Proposition IV.1 implies TijpBλq X TkjpBλq ‰ H. Thus, #tk : |Tijp0q ´ Tkjp0q| ă 2λ{τu ď M. By Lemma IV.3, d 2λ τ ` 2ε ` 2τ|w| #tl : |Tijpwq ´ Tljpwq| ă 2εu ď M 2λ (156) ˜ τ ¸ d ετ ` τ 2|w| “ M 1 ` . (157) λ ˆ ˙ d ετ`τ 2|w| This shows that (109) holds for xi “ Tijpwq, λ “ ε and M in (109) identified with M 1 ` λ . It follows from Lemma IV.2 that ´ ¯ d ετ ` τ 2|w| 3 d φpT pwqq ď M 1 ` }φ} ` ω´1 . (158) ij λ 8 ε d i ˜ ¸ ÿ ˆ ˙ ˆ ˙ 17

1 d This, combined with }Tijpxq} ď τ giving the determinant bounds detpTijpwqq ď τ yields, 1 d φpTijpwqqdetpTijpwqq ď τ φpTijpwqq (159) i i ÿ ÿ d ετ ` τ 2|w| 3 d ď τ dM 1 ` }φ} ` ω´1 . (160) λ 8 ε d ˆ ˙ ˜ ˆ ˙ ¸

? Corollary IV.5. Consider Ti, τ-bi-Lipschitz and suppose there exists M, λ ą 0 such that #tk : TkjpBλqXTijpBλq ‰ Hu ď M holds for any i, j. Then for a spherically symmetric log-concave density ϕpxq “ ψp|x|q,

1 1 2 ϕpwq log pi ϕpTijpwqqdetpTijpwqq ď KpϕqP p|W | ą λq, (161) c Bλ ˜ i j ¸ ż ÿ ÿ where 1 2 2 d 3 ´1 1 2 τ |w| Kpϕq :“ log τ M }ϕ} ` ω 2 p|W | ą λq ` d ϕpwq log 1 ` τ ` dw . (162) 8 λ d P λ „ ˆ ˆ ˙ ˙ ˆżBλ „  ˙ Note that Kpϕq depends on only on the statistics of ϕ, its maximum and its a logrithmic scaling of its norm, which can be easily further bounded by concavity results. The proof does not leverage log-concavity, a stronger assumption than ψ non-increasing1, except to ensure that the relevant statistics are finite. Proof. Note that,

1 ϕpwq log pi ϕpTijpwqqdetpTijpwqq dw (163) c Bλ ˜ i j ¸ ż ÿ ÿ 2 d d d τ |w| 3 ´1 ď ϕpwq1Bc log τ M 1 ` τ ` }ϕ}8 ` ω dw (164) λ λ λ d ż ˜ ˆ ˙ ˜ ˆ ˙ ¸¸ 1 ď Kpϕq P 2 p|W | ą λq (165) where the first inequality is an application of Corollary IV.4 with ε “ λ and the second from Cauchy-Schwartz. ´1 ´1 1 Proof of Theorem IV.1. Using fipxq “ ϕpTi pxqqdetppTi q pxqq and applying the substitution x “ Tipwq we can write

j‰i pjfjpxq fipxq log 1 ` dx (166) p f pxq ż ˆ ř i i ˙ ´1 ´1 1 pjϕpTj pxqqdetppTj q pxqq “ ϕpT ´1pxqqdetppT ´1q1pxqq log 1 ` j‰i dx (167) i i p ϕpT ´1pxqqdetppT ´1q1pxqq ż ˜ ř i i i ¸ ´1 ´1 1 1 pjϕpTj pTipwqqqdetppTj q pTipwqqqdetpTi pxqq “ ϕpwq log 1 ` j‰i dw (168) p ϕpwq ż ˜ ř i ¸ 1 pjϕpTjipwqqdetpTjipwqq “ ϕpwq log 1 ` j‰i dw (169) p ϕpwq ż ˜ ř i ¸ Thus, applying Jensen’s inequality

j‰i pjfjpxq pi fipxq log 1 ` dx (170) p f pxq i ż ˆ ř i i ˙ ÿ 1 j‰i pjϕpTjipwqqdetpTjipwqq “ pi ϕpwq log 1 ` dw (171) p ϕpwq i ż ˜ ř i ¸ ÿ 1 pjϕpTjipwqqdetpTjipwqq ď ϕpwq log 1 ` i j‰i dw (172) ϕpwq ż ˜ ř ř ¸ 1 pj ϕpTjipwqqdetpTjipwqq “ ϕpwq log 1 ` j i‰j dw (173) ϕpwq ż ˜ ř ř ¸

1 ´x`x Under spherical symmetry and log-concavity }ϕ}8 “ ϕp0q. Indeed, ϕp0q “ ϕp 2 q ě ϕp´xqϕpxq “ ϕpxq. Using log-concavity again for t P p0, 1q, ψpt|x|q “ ϕpp1 ´ tq0 ` txq ě ϕ1´tp0qϕtpxq ě ϕpxq “ ψp|x|q. Thus it follows that ψ is non-increasing. a 18

We will split the integral in to two pieces. Using logp1 ` xq ď x on Bλ 1 pj ϕpTjipwqqdetpTjipwqq ϕpwq log 1 ` j i‰j dw (174) ϕpwq żBλ ˜ ř ř ¸ 1 ď pj ϕpTjipwqqdetpTjipwqqdw (175) Bλ j i‰j ż ÿ ÿ 1 “ pj ϕpTjipwqqdetpTjipwqqdw (176) j i‰j Bλ ÿ ÿ ż

“ pj ϕpxqdx (177) j ˜i‰j TjipBλq ¸ ÿ ÿ ż

“ pj ϕpxqdx ` ϕpxqdx (178) ¨ ˛ j i j:T B B TjipBλq i:T B B TjipBλq ÿ t ‰ jip ÿλqX λ‰Hu ż t jip λÿqX λ“Hu ż ˝ ‚

ď pj ϕpxqdx ` M ϕpxqdx (179) ¨¨ ˛ c ˛ j i j:T B B Tjip0q`Bλτ Bλ ÿ t ‰ jip ÿλqX λ‰Hu ż ż ˝˝ ‚ ‚ ď pj pM ´ 1q ϕpxqdx ` M ϕpxqdx (180) c j ˜ Bλτ Bλ ¸ ÿ ż ż “ pM ´ 1qPp|W | ď λτq ` MPp|W | ą λq (181) where inequality (179) follows from Proposition IV.1 and inequality (180) follows another application of Proposition IV.1 and the fact that the map spxq “ x`B ϕpzqdz is maximized at 0. To see this, observe that s can be realized as the convolution λτ 1 of two spherically symmetric unimodal functions, explicitly spxq “ ϕ ˚ Bλτ pxq. Since the class of such functions is stable under convolution, see for instanceş [33, Proposition 8], s is unimodal and spherically symmetric which obviously implies sp0q ě spxq for all x. Using the fact that Tiipwq “ w, 1 pj ϕpTjipwqqdetpTjipwqq ϕpwq log 1 ` j i‰j dw (182) Bc ϕpwq ż λ ˜ ř ř ¸

1 “ ϕpwq log pj ϕpTjipwqqdetpTjipwqq dw ´ ϕpwq log ϕpwqdw (183) c c Bλ ˜ j i ¸ Bλ ż ÿ ÿ ż 1 ď KpϕqP 2 p|W | ď λq ´ ϕpwq log ϕpwqdw, (184) Bc ż λ where the bound Kpϕq is defined from Corollary IV.5. By Cauchy-Schwartz, followed by Theorem IV.2,

´ ϕpwq log ϕpwqdw (185) Bc ż λ 1 “ hpW qPp|W | ě λq ` ϕpwq log ´ hpW q dw (186) Bc ϕpwq ż λ ˆ ˙ 1 ď hpW qPp|W | ě λq ` plog ´ hpϕqq2ϕpwqdw ϕpwqdw (187) ϕpwq Bc dż ż λ ? 1 ď hpW qPp|W | ě λq ` d P 2 p|W | ě λq. (188)

A. Commentary on Theorem IV.1 Let us comment on the nature of Pp|W | ě λq for W log-concave. It is well known that in broad generality (see [10]–[12], [24], [36]), log-concave random variables satisfy “sub-exponential” large deviation inequalities. The following is enough to suit our needs. Lemma IV.6. [36, Theorem 2.8] When W is a log-concave and t, r ą 0, then

r`1 Pp|W | ą rtq ď Pp|W | ą tq 2 19

2 2 Corollary IV.7. For a spherically symmetric log-concave random vector W such that EW1 “ σ , where W1 is the random varaiable given by the first coordinate of W , ´ct Pp|W | ą tq ď Ce , log 2 where C “ 2´1{2, c “ ? . 2 2dσ2 ? 2 1 Proof. By Chebyshev’s inequality Pp|W | ą 2dσ q ď 2 . Hence for r ą 1, by Lemma IV.6

? ? r`1 Pp|W | ą r 2dσ2q ď Pp|W | ą 2dσ2q 2 ´ r`1 ď 2 2 . ? Taking t “ r 2dσ2 gives the result. Lemma IV.8. Suppose that the spherically symmetric log-concave W is strongly log-concave in the sense that its density 2 function ϕ satisfies ϕpp1 ´ tqx ` tyq ě etp1´tq|x´y| {2ϕ1´tpxqϕtpyq, for x, y P Rd and t P r0, 1s, then Pp|W | ą tq ď Pp|Z| ą tq, (189) where Z is a standard normal vector. Proof. By the celebrated result of Caffarelli [13], W can be expressed as T pZq where T is the necessarily 1-Lipschitz Brenier transportation map. Moreover it follows from the assumed spherically symmetry of W , the radial symmetry of the Gaussian, and [13, Lemma 1] that T is spherically symmetric. In particular T p0q “ 0, so that |Z| ď t implies |T pZq| “ |T pZq´T p0q| ď |Z ´ 0| ď t, and our result follows. Corollary IV.9. Suppose X and Y are variables satisfying the conditions of Section IV for τ, M “ 1, λ, and W possessing spherically symmetric log-concave density ϕ. Then ? 2 ´ 1`λ{ 2σ d HpX|Y q ď C˜ 2 4 . (190) Furthermore, if W is strongly log-concave then

1 HpX|Y q ď C˜ P 2 p|Z| ą tq, (191) ? where Z is a standard Gaussian vector and C˜ “ 1 ` d ` hpW q ` Kpϕq , with Kpϕq defined as in (97). Proof. The proof is an immediate application of Corollary´ IV.7 and Lemma¯ IV.8 to Theorem IV.1.

V. APPLICATIONS Mixture distributions are ubiquitous, and their entropy is a fundamental quantity. We now give special attention to how the ideas of Section IV can be sharpened in the case that W is a Gaussian. ´|x|2 d{2 Proposition V.1. When X and Y satisfy the assumptions of Section IV, for τ, M, λ, Tij and W „ ϕpwq “ e {p2πq , then

HpX|Y q ď pM ´ 1qPp|W | ď τλq ` JdpϕqPp|W | ą λq (192) with ? 2 d d 2 τ dσ 3 2πσ J pϕq “ log epλ{σq `M pτeqdM 1 ` τ ` τ 2 ` 1 ` ω´1 (193) d λ λ d « ˆ ˙ ˜ ˆ ˙ ¸ff for d ě 2 and

2 2 τ σ 9π σ J pϕq “ log epλ{σq `M`2τM 1 ` τ ` τ 2 ` 1 ` , (194) 1 λ2 2 λ « ˆ ˙ ˜ c ¸ff when d “ 1. The proof of the case d ě 2 is given below, a similar argument in the d “ 1 case is given in the appendix as Proposition A.4. 20

Proof. As in the proof of Theorem IV.1,

1 j‰i pjϕpTjipwqqdetpTjipwqq HpX|Y q “ pi ϕpwq log 1 ` dw (195) p ϕpwq i ż ˜ ř i ¸ ÿ 1 pj ϕpTjipwqqdetpTjipwqq ď ϕpwq log 1 ` j i‰j dw (196) Bc ϕpwq ż λ ˜ ř ř ¸ 1 pj ϕpTjipwqqdetpTjipwqq ` ϕpwq log 1 ` j i‰j dw. (197) ϕpwq żBλ ˜ ř ř ¸ We use the general bound from the proof of Theorem IV.1 for

1 pj ϕpTjipwqqdetpTjipwqq ϕpwq log 1 ` j i‰j dw (198) ϕpwq żBλ ˜ ř ř ¸ ď pM ´ 1qPp|W | ď λτq ` MPp|W | ą λq. (199) Splitting the integral,

1 pj ϕpTjipwqqdetpTjipwqq ϕpwq log 1 ` j i‰j dw (200) Bc ϕpwq ż λ ˜ ř ř ¸

1 “ ϕpwq log pj ϕpTjipwqqdetpTjipwqq dw ´ ϕpwq log ϕpwqdw. (201) c c Bλ ˜ j i ¸ Bλ ż ÿ ÿ ż Using Corollary IV.4, Jensen’s inequality, and then Proposition A.2,

1 ϕpwq log pj ϕpTjipwqqdetpTjipwqq dw (202) c Bλ ˜ j i ¸ ż ÿ ÿ 2 d d d τ |w| 3 ´1 ď ϕpwq log τ M 1 ` τ ` }ϕ}8 ` ωd (203) Bc λ λ ż λ « ˆ ˙ ˜ ˆ ˙ ¸ff 2 1 d d d τ t|w|ąλuϕpwq|w|dw 3 ´1 ď Pp|W | ą λq log τ M 1 ` τ ` }ϕ}8 ` ω (204) p|W | ą λqλ λ d « ˆ ş P ˙ ˜ ˆ ˙ ¸ff d τ 2pλ ` σdq 3 d ď p|W | ą λq log τ dM 1 ` τ ` }ϕ} ` ω´1 . (205) P λ 8 λ d « ˆ ˙ ˜ ˆ ˙ ¸ff Then applying Proposition A.1

d 2 2 2 2 ´ ϕpwq log ϕpwqdw ď log 2πe σ ` λ {σ Pp|W | ą λq (206) Bc 2 ż λ ˆ ˙ Combining (199), (205), and (206) we have

HpX|Y q ď pM ´ 1qPp|W | ď λτq ` JdpϕqPp|W | ą λq (207) with ? 2 d d pλ{σq2`M d 2 τ σd 2 ´ d 3 ´1 J pϕq “ log e pτeσ 2πq M 1 ` τ ` τ ` p2πσ q 2 ` ω (208) d λ λ d « ˆ ˙ ˜ ˆ ˙ ¸ff ? 2 d d pλ{σq2`M d 2 τ σd ´ d 3σ ´1 “ log e pτe 2πq M 1 ` τ ` τ ` p2πq 2 ` ω . (209) λ λ d « ˆ ˙ ˜ ˆ ˙ ¸ff

For example, when X takes values txiu P R such that |xi ´ xj| ě 2λ and Y is given by X ` W where W is independent 2 Gaussian noise with variance σ , then Y has density i pifipyq with fipyq “ ϕσpy ´ xiq. Let Tipyq “ y ´ xi. Thus TijpBλq “ Bλ ` xj ´ xi, so that tTkjpBλquk are disjoint and we can take M “ 1 and τ “ 1. Applying Proposition V.1, we have ř

HpX|Y q “ HpX|X ` W q ď J1pϕqPp|W | ą λq “ J1pϕqPp|Z| ą λ{σq (210) 21 with

pλ{σq2`3 σ 9π σ J1pϕq “ log e 3 ` 2 1 ` . (211) « λ ˜ 2 λ ¸ff ´ ¯ c Notice that for t ą 0, tX and tX ` tW satisfy the same conditions with λ˜ “ tλ and σ˜ “ tσ, and since HptX|tX ` tW q “ HpX|X ` W q, after applying the result to tX, tX ` tW we have HpX|X ` W q “ HptX|tX ` tW q (212)

pλ{σq2`3 σ 9π σ ď log e 3 ` 2 1 ` Pp|Z| ą λ{σq. (213) « tλ ˜ 2 λ ¸ff ´ ¯ c Taking the limit with t Ñ 8, we obtain,

2 9π σ HpX|X ` W q ď log 3epλ{σq `3 1 ` p|Z| ą λ{σq. (214) 2 λ P « ˜ c ¸ff We collect these observations in the following Corollary.

Corollary V.2. When X is a discrete R valued random variable taking values txiu such that |xi ´ xj| ě 2λ for i ‰ j and W is an independent Gaussian variable with variance σ2, then

2 9π σ HpX|X ` W q ď log 3epλ{σq `3 1 ` p|Z| ą λ{σq . (215) 2 λ P « ˜ c ¸ff

A. Fano’s inequality A multiple hypothesis testing problem is described in the following, with X an index i P X is drawn and then samples are drawn from the distribution fi, with a goal of determining the value i. If Z denotes a random variable with P pZ P A|X “ iq “

A fipzqdz, then by the commutativity of mutual information proven in Proposition II.3, HpX|Zq “ hpZ|Xq ´ hpZq ` HpXq. Thus bounds on the mixture distribution are equivalent to bounds HpX|Zq. For Xˆ “ gpZq, Fano’s inequality provides the ş following bound

HpX|Zq ď Hpeq ` Ppeq logp#X ´ 1q (216) where e “ tXˆ ‰ Xu is the occurrence of an error. Fano and Fano-like inequalities are important in multiple hypothesis testing, as they can be leveraged to deliver bounds on the Bayes risk (and hence min/max risk); we direct the reader to [7], [25], [68] for more background. Fano’s inequality gives a lower bound on the entropy of a mixture distribution, that can also give a non-trivial improvement on the concavity of entropy through the equality HpX|Zq “ HpXq ` hpZ|Xq ´ hpZq. Combined with (216),

hp pifiq ´ pihpfiq ě Hppq ´ pHpeq ` Ppeq logp|X | ´ 1qq . (217) i i ÿ ÿ In concert with with Theorem I.1 we have the following corollary.

Corollary V.3. For X distributed on indices i P X , and Z such that Z|tX “ iu is distributed according to fi, then given an X˜ “ fpZq, with e “ tX ‰ X˜u

p1 ´ Tf qHpXq ď Hpeq ` Ppeq logp|X | ´ 1q. (218)

Proof. By Fano’s inequality Hpeq`Ppeq logpN ´1q ě HpX|Zq. Recalling that HpX|Zq “ HpXq´php i pifiq´ i pihpfiqq and by Theorem I.1 ř ř HpXq ´ php pifiq ´ pihpfiqq ě HpXq ´ Tf HpXq, (219) i i ÿ ÿ gives our result. Heuristically, this demonstrates that “good ” are only possible for hypothesis distributions discernible in total variation distance. For example in the simplest case of binary hypothesis testing where n “ 2, the inequality is p1 ´ }f1 ´ f2}TV qHpXq ď Hpeq, demonstrating that existence of an estimator improving on the trivial lower bound HpXq, is limited explicitly by the total variation distance of the two densities.

We note that the pursuit of good estimators Xˆ is a non-trivial problem in most interesting cases, so much so that Fano’s inequality is often used to provide a lower bound on the potential performance of a general estimator by the ostensibly simpler 22 quantity HpX|Zq, as determining an optimal value for Ppeq is often intractable. A virtue of Theorem IV.1 is that it provides upper bounds on HpX|Zq, in terms of tail bounds of a single log-concave variable |W |. Thus, Theorem IV.1 asserts that for a large class of models, HpX|Zq can be controlled by a single easily computable quantity, which in the case that M “ 1, decays sub-exponentially in λ to 0. However, the example delineated below, demonstrates that even in simple cases where an optimal estimator of X admits explicit computation, the bounds of Theorem IV.1 may outperform the best possible bounds based on Fano’s inequality. Suppose that X is uniformly distributed on t1, 2,...,Nu and that W is an independent, symmetric log-concave variable N f pzq with density ϕ, and Z “ X ` W , then Z has density fpzq “ i with f pzq “ ϕpz ´ iq. The optimal (Bayes) estimator N i i“1 of X is given by Θpzq “ argmaxitfipzq : i P t1, 2,...,Nuuÿ, which by the assumption of symmetric log-concavity can be expressed explicitly as:

N´1 Θpzq “ 1 3 ` i1 1 1 ` N1 1 . (220) p´8, 2 q pi´ 2 ,i` 2 q pN´ 2 ,8q i“2 ÿ Thus, PpΘ ‰ Xq can be written explicitly as well. Indeed, PpX “ ΘpZqq “ PpX “ i, i “ Θpi ` W qq (221) i ÿ N´1 1 1 1 1 1 1 P W P p´ , q “ W ď ` W ě ´ ` 2 2 (222) N P 2 N P 2 N ˆ ˙ ˆ ˙ i“2 ` ˘ 1 1 2 1 ÿ “ W P p´ , q ` pW ě q. (223) P 2 2 N P 2 ˆ ˙ Thus writing P peq “ PpX ‰ Θq, we have 1 1 P peq “ 2 pW ě q 1 ´ (224) P 2 N ˆ ˙ 1 “ p|W | ě 1{2q 1 ´ (225) P N ˆ ˙ 1 “ p|Z| ě 1{2σq 1 ´ , (226) P N ˆ ˙ where Z is a standard normal variable. Thus the optimal bounds achievable through Fano’s inequality are described by, HpX|Zq ď Hpeq ` P peq logpN ´ 1q (227) 1 with P peq “ Pp|Z| ě 1{2σq 1 ´ N . Note that with N Ñ 8, the bounds attainable through Fano’s inequality become meaningless since limNÑ8 Hpeq ` P peq logpN ´ 1q “ 8 independent of σ. In contrast, since Z “ X ` W , Corollary V.2 gives the following bound: ` ˘

2 9π σ HpX|Zq ď log 3epλ{σq `3 1 ` p|Z| ą λ{σq , (228) 2 λ P « ˜ c ¸ff independent of N.

B.

In the case of a channel that admits discrete inputs (and possibly continuous inputs as well) with output density fipzq “ ppz|iq when conditioned on an input i. Suppose the input X takes value i with probability pi then the output Z distribution will have a density function i pifi. Thus ř IpZ; Xq “ hpZq ´ hpZ|Xq (229) “ HpXq ´ HpX|Zq (230)

“ hp pifiq ´ pihpfiq. (231) i i ÿ ÿ Thus, any choice of input X gives a lower bound on the capacity of the channel. In the context of additive white Gaussian noise channel [48] gave rigorous bounds to the findings of [60], that finite input can nearly achieve capacity. 23

Theorem V.1 (Ozarow-Wyner [48]). Suppose X is uniformly distributed on N evenly spaced points, t2λ, 4λ, . . . , 2Nλu and 2 2 2 N 2´1 its variance EX ´ E X “ λ 3 . If Z “ X ` W , where W is Gaussian with variance one and independent of X. Then 1) ´ ¯ IpZ; Xq ě p1 ´ pπKq´1{2e´K qHpXq ´ hppπKq´1{2e´K q (232) where 3 K “ p1 ´ 2´2C q (233) 2α2 α “ N2´C (234) 1 N 2 ´ 1 C “ log 1 ` λ2 . (235) 2 3 ˆ ˙ 2) 1 πe 1 1 ` α2 IpZ; Xq ě C ´ log ´ log . (236) 2 6 2 α2 λ2 1 In the notation of this paper K “ 2 1 ´ N 2 . Defining, 2 ´ λ 1´ 1 ` ˘ e 2 p N2 q po :“ , (237) πλ2 1 2 1 ´ N 2 we can re-write (232) as b ` ˘

IpX|Zq ě HpXq ´ ppoHpXq ` hppoqq . (238)

2 ? ´C N 1`3pσ{λq Note that N2 “ ? “ 2 , so that α « 3{λ for σ large. Thus (232) gives a bound with sub-Gaussian- 1`σ2 1`σ like convergence in λ to HpXqbfor fixed N, but gives worse than trivial bounds for fixed λ and N Ñ 8. In contrast, by Corollary V.2 gives

2 9π σ IpX|Zq ě HpXq ´ log 3eλ `3 1 ` p|Z| ą λq . (239) 2 λ P « ˜ c ¸ff Comparing the bound on the gap between IpX|Zq and HpXq provided by (238) and Corollary V.2, we see that Corollary V.2 outperforms Theorem V.1 for large λ. Indeed, one can easily find an explicit rational function q such that

poHpXq ` hppoq λ2 ě qpλqe 2N . (240) λ2`3 9π σ log 3e 1 ` 2 λ P p|Z| ą λq Additionally, (239) gives a universal bound,” independent´ b of N.¯ı These results have been of recent interest, see for example [18], [19], where the results improving and generalizing (2) have been studied in a form HpXq ´ gap˚ ď IpZ; Xq ď HpXq, (241) with an emphasis on achieving gap˚ bounds that are independent of N, and viable for more general noise models. The significance of the results of Theorem IV.1 in this context is that the gap˚ bounds provided converge exponentially fast to zero in λ, independent of HpXq, while for example in [19], the gap˚ satisfies 1 2πe gap˚ ě log . (242) 2 12 Additionally, the tools developed can be extended to perturbations of the Y and signal dependent noise through Theorem IV.1. A related investigation of recent interest is the relationship between finite input approximations of capacity achieving distributions, particularly the number of “constellations” needed to approach capacity. For example [65], [66] the rate of convergence in n of the capacity of an n input power constrained additive white Gaussian noise channel to the usual additive white noise Gaussian channel is obtained. In many practical situations, although a Gaussian input is capacity achieving, discrete inputs are used. We direct the reader to [67] for background on the role this practice plays in Multiple Input- Multiple Output channels pivotal in the development of 5G technology. Additionally in the amplitude constrained discrete time additive white Gaussian noise channel, the capacity achieving distribution is itself discrete [55]. In fact, many important channels achieve capacity for discrete distributions, see for example [2], [14], [28], [54], [61]. Thus in the case that the noise model is independent of input, the capacity achieving output will be a mixture distribution, and the capacity of the channel is given by calculating the entropy of said mixture. 24

Theorem IV.1 shows that for sparse input, relative to the strength of the noise, the mutual information of the input and output distributions is sub-exponentially close to the entropy of the the input in the case of log-concave noise, and sub-Gaussian from the the entropy of the input in the case of strongly log-concave noise, which includes Gaussian noise as a special case. In contrast Theorem I.1 gives a reverse inequality, demonstrating that when the mixture distributions are close to one another in the sense that their total variation distance from the mixture “with themselves removed” is small, then the mutual information is quantifiably lessened.

C. Energetics of Non-equillibrium thermodynamics ? ∇Upxt,tq For a process xt that satisfies an overdamped Langevin stochastic differential equation, dxt “ ´ γ dt ` 2Ddζt. with d time varying potential U : R ˆ R Ñ R and ζt a Brownian motion with D “ kBT {γ where γ is the viscosity constant, kB is Boltzman’s constant and T is temperature, one can define natural thermodynamic quantities, in particular trajectory dependent notions of work done W on the system (see [29]) and heat dissipated Q, respectively,

tf W :“ BtUpxt, tqdt (243) ż0 and

tf Q :“ ´ ∇xUpxt, tq ˝ dxt, (244) ż0 where the above is a Stratonovich stochastic integral. Recall that Stratonovich integrals satisfy a chain rule dUpxt, tq “ BU ∇xUpxt, tq ˝ dxt ` Bt pxt, tqdt so that we immediately have a first law of thermodynamics

∆U :“ Uptf , xptf qq ´ Up0, xp0qq (245)

tf tf “ BtUpxt, tqdt ` ∇xUpxt, tq ˝ dxt (246) ż0 ż0 “ W ´ Q. (247)

Further, if ρt denotes the distribution of xt at time t, satisfying the Fokker-Planck equation then it can be shown [4] (see also [15], [38], [53]),

tf 2 EQ “ kBT hpρ0q ´ hpρtf q ` E|vpt, xtq| dt, (248) ż0 where v is mean local velocity (see [4] or as the current` velocity in˘ [43]). In the quasistatic limit where the non-negative term tf 2 0 x|vpt, xtq| ydt goes to 0, one has a fundamental lower bound on the efficiency of a process’s evolution, the average heat dissipated in a transfer from configuration ρ0 to ρt is bounded below by the change in entropy. ş f

EQ “ kBT hpρ0q ´ hpρtf q . (249) A celebrated example of this inequality is Landauer’s principle` [31], which˘ proposes fundamental thermodynamic limits to the efficiency of a computer utilizing logically irreversable computations (see also [6]). More explicitly (249) suggests that the average heat dissipated in the erasure of a bit, that is, the act of transforming a random bit to a deterministic 0 is at least kBT log 2. This can be reasoned to in the above, by presuming the entropy of a random bit should satisfy hpρ0q “ log 2 and that the reset bit should satisfy hpρtf q “ 0. In the context of nanoscale investigations, (like protein pulling or the intracellular transport of cargo by molecular motors) it is often the case that phenomena take one of finitely many configurations with an empirically derived probability. However at this scale, thermal fluctuations can make discrete modeling of the phenomena unreasonable, and hence the distributions ρ0 and

ρtf in such problems are more accurately modeled as a discrete distribution disrupted by thermal noise, and are thus, mixture distributions. Consequently bounds on the entropy of mixture distributions translate directly to bounds on the energetics of nanoscale phenomena [41], [56]. For example, in the context of Landauer’s bound, the distribution of the position of a physical bit is typically modeled by a Gaussian bistable well, explicitly by the density

2 2 ? 2 2 ? ´px´aq {2σ 2 ´px`aq {2σ 2 fppzq “ pe { 2πσ ` p1 ´ pqe { 2πσ . (250) The variable p connotes the probability that the bit takes the value 1, and p1 ´ pq the probability the bit takes the value 0. This can be modeled by Xp a Bernoulli variable taking values ˘a and Zp “ Xp ` σZ where Z is a standard normal, so that Zp has distribution fp.

Corollary V.4. The average heat dissipated Q0 in an optimal erasure protocol, resetting a random bit to zero in the framework of (250) can be bounded above and below, ˜ ˜ CL Pp|Z| ą a{σq ď EQ0 ´ kBT log 2 ď CU Pp|Z| ą a{σq (251) 25

˜ pa{σq2`3 9π σ ˜ 3 pa{σq2`3 9π σ where CL “ ´kBT log 3e 1 ` 2 a and CU “ kBT log 2 e 1 ` 2 a . ´ ” ´ b ¯ı¯ ´ ” ´ b ¯ı¯ More generally, in the case that the erasure is imperfect, so that the probability of failure is non-negligible we have the following bound,

CL Pp|Z| ą a{σq ď EQ0 ´ kBT pHpp0q ´ Hpp1qq ď CU Pp|Z| ą a{σq (252) where

2 9π σ C “ ´k T log 3epa{σq `3 1 ` ´ HpX q (253) L B 2 a p1 ˜ « ˜ c ¸ff ¸

2 9π σ C “ k T log 3epa{σq `3 1 ` ´ HpX q . (254) U B 2 a p0 ˜ « ˜ c ¸ff ¸ 1 Proof. First let us note that we understand a random bit to be the case that p0 “ 2 , while an erasured bit is to be understood as a deterministic X with p1 “ 0. Thus, (251) follows immediately from (252) If we let ρ0 “ fp0 and ρtf “ fp1 , and let

Zpi “ Xp ` σZ denote a variable then (249) gives

EQ0 “ kBT phpfp0 q ´ hpfp1 qq (255)

“ kBT pIpZp0 ; Xp0 q ´ IpZp1 ; Xp1 qq , (256) where the second equality follows from the fact that both Zpi variables are conditionally Gaussian of the same variance. Using the corollary of Theorem I.2 obtained in (228),

2 9π σ IpZ ; X q ě HpX q ´ log 3epa{σq `3 1 ` p|Z| ą a{σq , (257) pi pi pi 2 a P « ˜ c ¸ff while Theorem I.1 applied as in (4) as

IpZpi ; Xpi q ď Tf HpXpi q (258)

“ HpXpi q ´ Pp|Z| ą a{σqHpXpi q, (259) since Tf “ 1 ´ Pp|Z| ą a{σq. Combining these results gives

2 9π σ IpZ ; X q ´ IpZ ; X q ď HpX q ´ HpX q ` p|Z| ą a{σq log 3epa{σq `3 1 ` ´ HpX q (260) p0 p0 p1 p1 p0 p1 P 2 a p0 ˜ « ˜ c ¸ff ¸

2 9π σ IpZ ; X q ´ IpZ ; X q ě HpX q ´ HpX q ´ p|Z| ą a{σq log 3epa{σq `3 1 ` ´ HpX q . (261) p0 p0 p1 p1 p0 p1 P 2 a p1 ˜ « ˜ c ¸ff ¸ Inserting these equations into (256) completes the proof.

D. Functional inequalities Mixture distributions arise naturally in mathematical contexts as well. For example in [9] Bobkov and Marsiglietti found interesting application of hpX ` Zq ď HpXq ` hpZq for X discrete and Z independent and continuous in the investigation of entropic Central Limit Theorem for discrete random variables under smoothing. In the study of stability in the Gaussian log-Sobolev inequalities, Eldan, Lehec, and Shenfeld [20], it is proven as Proposition 5 that the deficit in the Gaussian log-Sobolev inequality, defined as Ipµ||γq δpµq “ ´ Dpµ||γq (262) 2 for a measure µ and γ the standard d-dimensional Gaussian measure, and I is relative , dµ Ipµ||γq :“ log dγ, (263) d dγ żR ˆ ˙ is small for Gaussian mixtures. More explicitly for pi non-negative numbers summing to 1,

δ piγi ď Hppq. (264) ˜ i ¸ ÿ In the language of Theorem I.1 a sharper bound can be achieved. 26

Corollary V.5. When γi are translates of the standard Gaussian measure then

δ piγi ď T Hppq, (265) ˜ i ¸ ÿ where T is defined as in Theorem I.1.

Proof. By the convexity of the relative Fisher information, the equality Dp i piγi||γq “ i piDpγi||γq ` hp i piγiq ´ p hpγ q, Ipγi||γq ´ Dpγ ||γq “ 0, and the application of Theorem I.1 we have i i i 2 i ř ř ř

ř Ip piγi||γq δ p γ “ i ´ Dp p γ ||γq (266) i i 2 i i ˜ i ¸ i ÿ ř ÿ Ipγ ||γq ď p i ´ Dpγ ||γq ` hp p γ q ´ p hpγ q (267) i 2 i i i i i i ˜ i i ¸ ÿ ˆ ˙ ÿ ÿ “ hp piγiq ´ pihpγiq (268) i i ÿ ÿ ď T Hppq. (269)

VI.CONCLUSIONS In this article, the entropy of mixture distributions is estimated by providing tight upper and lower bounds. The efficacy of the bounds is demonstrated, for example, by demonstrating that existing bounds on the conditional entropy, HpX|Zq of a random variable, X taking values in a countable set, X conditioned on a continuous random variable, Z, become meaningless as the cardinality of the set X increases while the bounds obtained here remain relevant. Significantly enhanced upper bounds on mutual information of channels that admit discrete input with continuous output are obtained based on the bounds on the entropy of mixture distributions. The technical methodology developed is of interest in its own right whereby connections to existing results either can be derived as corollaries of more general theorems in the article or are improved upon by the results in the article. These include the reverse Pinsker inequality, and bounds on Jensen-Shannon divergence, and bounds that are obtainable via Fano’s inequality.

VII.ACKNOWLEDGEMENT The authors acknowledge the support of the National Science Foundation for funding the research under Grant No. 1462862 (CMMI), 1544721 (CNS) and 1248100 (DMS).

APPENDIX Proof of Proposition III.3. There is nothing to prove in (1), this is exactly the definition of the usual relative entropy from µ to tµ ` p1 ´ tqν. For (2), by (1) Stpµ||νq “ 0 iff Dpµ||tµ ` p1 ´ tqνq “ 0 which is true iff µ “ tµ ` p1 ´ tqν which happens iff t “ 1 or µ “ ν. To prove (3), observe that for a Borel set A 1 µpAq ď ptµ ` p1 ´ tqνqpAq. t dµ This gives the following inequality, from which absolute continuity, and the existence of dptµ`p1´tqνq follow immediately, dµ 1 ď . (270) dptµ ` p1 ´ tqνq t Integrating (270) gives

Stpµ||νq ď ´ log t. (271)

To prove (4), notice that for fixed µ and ν, the map Φt “ tµ ` p1 ´ tqν is affine, and since the relative entropy is jointly convex [16], convexity in t follows from the computation below.

Sp1´λqt1`λt2 pµ||νq “ Dpµ||Φp1´λqt1`λt2 q

“ Dpµ||p1 ´ λqΦt1 ` λΦt2 q

ď p1 ´ λqDpµ||Φt1 q ` λDpµ||Φt2 q

“ p1 ´ λqSt1 pµ||νq ` λSt2 pµ||νq. 27

Since t ÞÑ Stpµ||νq is a non-negative convex function on p0, 1s with S1pµ||νq “ 0 it is necessarily non-increasing. When µ ‰ ν, µ ‰ tµ ` p1 ´ tqν so that Stpµ||νq ą 0 for t ă 1, so that as a function of t the skew divergence is strictly decreasing. To prove that St is an f-divergence recall Definition III.1. It is straight forward that St can be expressed in form (60) with fpxq “ x logpx{ptx ` p1 ´ tqq. Convexity of f follows from the second derivative computation, pt ´ 1q2 f 2pxq “ ą 0. xptx ` p1 ´ tqq2 Since fp1q “ 0 the proof is complete. ´|w|2{2σ 2 d In this section we consider W „ ϕσ with ϕσpwq “ e {p2πσ q 2 , and use ϕ to denote ϕ1 and use Z in place of W in this case. Proposition A.1. For d ě 2 2 d 2 2 λ ´ ϕσpwq log ϕσpwqdw ď log 2πe σ ` 2 Pp|W | ą λq (272) Bc 2 σ ż λ ˆ ˙ Proof. We first show that the result for general σ follows from the case that σ “ 1. Indeed, assuming (272), the substitution u “ w{σ gives

ϕσpwq log ϕσpwqdw “ Pp|Z| ą λ{σqd log σ ´ ϕpuq log ϕpuqdu (273) Bc Bc ż λ ż λ{σ d λ2 ď log 2πe2σ2 ` p|Z| ą λ{σq, (274) 2 σ2 P „  where we have applied (272) to achieve the inequality. Since W has the same distribution as σZ, the reduction holds. By direct computation, d |w|2 ´ ϕpwq log ϕpwqdw “ ϕpwq log 2πdw ` ϕpwq (275) Bc Bc 2 Bc 2 ż λ ż λ ż λ ´d{2 8 d`1 ´r2{2 d p2πq ωd λ r e dr “ Pp|Z| ą λq log 2π ` 8 2 (276) 2 2π ´d{2ω rd´1e´r {2dr ˜ p q d şλ ¸ d ´λ2{2 8 d´1 ´r2{2 d λ e ` şd λ r e “ Pp|Z| ą λq log 2π ` 8 2 (277) 2 rd´1e´r {2dr ˜ λ ş ¸ d ´λ2{2 d 2 şλ e “ Pp|Z| ą λq log 2πe ` 8 2 . (278) 2 d´1 ´r {2 ˜ λ r e dr ¸ Using rd´1 ě rλd´2 for r ě λ when d ě 2, ş 8 8 2 2 2 rd´1e´r {2dr ě λd´2 re´r {2dr “ λd´2e´r {2. (279) żλ żλ Thus,

d 2 2 ´ ϕpwq log ϕpwqdw ď Pp|Z| ą λq log 2πe ` λ (280) Bc 2 ż λ ˆ ˙

Proposition A.2. For d ě 2,

ϕpwq|w|dw ď pλ ` dσqPp|W | ą λq (281) Bc ż λ Proof. Again we reduce to the case that σ “ 1. Substituting u “ w{σ gives

ϕσpwq|w|dw “ σ ϕpuq|u|du (282) Bc Bc ż λ ż λ{σ λ ď ` d σ p|Z| ą λ{σq (283) σ P ˆ ˙ “ pλ ` dσqPp|W | ą λq, (284) 28 where we have used (281) for the inequality and σZ being equidistributed with W for the last equality. We now proceed in the reduced case. By change of coordinates and integration by parts,

8 d ´r2{2 Bc ϕpwq|w|dw r e dr λ “ λ (285) p|Z| ą λq 8 rd´1e´r2{2dr ş P λş d´1 ´λ2{2 8 8 d´1 ´r2{2 ´ş r e ` d λ r e dr “ λ (286) 8 d´1 ´r2{2 λ ˇr e ş dr 2 ˇ λd´1e´λ {2 “ ş ` d (287) 8 d´1 ´r2{2 λ r e dr Using r ě λ, for d ě 2, ş 8 8 2 2 2 rd´1e´r {2dr ě λd´2 re´r {2dr “ λd´2e´λ {2. (288) żλ żλ Thus,

Bc ϕpwq|w|dw λ ď λ ` d. (289) ş Pp|Z| ą λq

? ´x2{2σ 2 Proposition A.3. When d “ 1, so that W „ ϕσpwq “ e { 2πσ , we have the following bounds for λ ą 0, 8 8 σ2 σ2ϕ pλq “ wϕ pwqdw ď ϕ pwqdw λ ` (290) σ σ σ λ λ λ ˆ ˙ 8 ż ż ? 8 2 2 ´ ϕσpwq log ϕσpwqdw ď λ {σ ` 2 ` logp 2πσq ϕσpwqdw (291) λ λ ż ´ ¯ ż Proof. The inequality (290) is standard. The inequality can be reduced to the σ “ 1 by applying (290) after change of variables u “ w{σ. The proof then follows from the σ “ 1 case. Recall ϕ1pwq “ wϕpwq and observe that the function 8 λ2 gpλq “ ϕpwqdw ´ ϕpλq (292) λ2 ` 1 żλ satisfies gp0q ą 0, limλÑ8 gpλq “ 0, and has derivative ´2ϕpλq ă 0, (293) pλ2 ` 1q2 so that gpλq ą 0 which is equivalent to (290). To prove (291), we again reduce to the σ “ 1 case by the substitution u “ w{σ. Then compute directly using integration by parts, 8 8 ? 8 ´ ϕpwq log ϕpwqdw “ log 2πϕpwqdw ` w2ϕpwqdw (294) λ λ λ ż ż ? 8 ż 8 “ log 2π ϕpwqdw ` λϕpλq ` ϕpwqdw (295) λ λ ż ? 8 ż ď λ2 ` 2 ` log 2π ϕpwqdw. (296) λ ´ ¯ ż The inequality is an application of (290).

Proposition A.4. When X and Z satisfy the conditions of section IV for the one dimensional Gaussian W „ ϕσpwq “ 2 ? e´w {2σ{ 2πσ2,

HpX|Zq ď pM ´ 1qPp|W | ď τλq ` JpϕqPp|W | ě λq (297) with ? 2 pλ{σq2`M`2 2 τ σ 2 ´ 1 3 Jpϕq “ log e τMσ 2π 1 ` τ ` τ ` p2πσ q 2 ` (298) λ2 2λ „ ˆ ˙ ˆ ˙ 29

Proof. As in the proof of Theorem IV.1,

1 j‰i pjϕpTjipwqqdetpTjipwqq HpX|Zq “ pi ϕpwq log 1 ` dw (299) p ϕpwq i ż ˜ ř i ¸ ÿ 1 pj ϕpTjipwqqdetpTjipwqq ď ϕpwq log 1 ` j i‰j dw (300) Bc ϕpwq ż λ ˜ ř ř ¸ 1 pj ϕpTjipwqqdetpTjipwqq ` ϕpwq log 1 ` j i‰j dw. (301) ϕpwq żBλ ˜ ř ř ¸ with 1 pj ϕpTjipwqqdetpTjipwqq ϕpwq log 1 ` j i‰j dw (302) ϕpwq żBλ ˜ ř ř ¸ ď pM ´ 1qPp|W | ď λτq ` MPp|W | ą λq. (303) Splitting the integral,

1 pj ϕpTjipwqqdetpTjipwqq ϕpwq log 1 ` j i‰j dw (304) Bc ϕpwq ż λ ˜ ř ř ¸

1 “ ϕpwq log pj ϕpTjipwqqdetpTjipwqq dw ´ ϕpwq log ϕpwqdw. (305) c c Bλ ˜ j i ¸ Bλ ż ÿ ÿ ż Using Corollary IV.4, Jensen’s inequality, and (290),

1 ϕpwq log pj ϕpTjipwqqdetpTjipwqq dw (306) c Bλ ˜ j i ¸ ż ÿ ÿ τ 2|w| 3 1 ď ϕpwq log τM 1 ` τ ` }ϕ}8 ` (307) Bc λ λ 2 ż λ „ ˆ ˙ ˆ ˆ ˙ ˙ 2 τ 1 w λ ϕpwq|w|dw 1 3 ď p|W | ą λq log τM 1 ` τ ` t| |ą u ? ` (308) P W λ λ 2λ ş Pp| | ą q 2πσ „ ˆ 2 σ ˙ ˆ ˙ τ pλ ` λ q 1 3 ď Pp|W | ą λq log τM 1 ` τ ` ? ` . (309) λ 2πσ 2λ „ ˆ ˙ ˆ ˙ Then applying (291), ? 2 ´ ϕpwq log ϕpwqdw ď pλ{σq ` 2 ` log 2πσ Pp|W | ą λq (310) Bc ż λ ´ ¯ Combining 303, 309, and 310 we have

HpX|Zq ď pM ´ 1qPp|W | ď λτq ` JpϕqPp|W | ą λq (311) with ? 2 σ pλ{σq2`M`2 τ pλ ` λ q 2 ´ 1 3 Jpϕq “ log e τMσ 2π 1 ` τ ` p2πσ q 2 ` (312) λ 2λ „ ˆ ˙ ˆ ˙

REFERENCES [1] E. Abbe and A. Barron. Polar coding schemes for the AWGN channel. In Proc. IEEE Intl. Symp. Inform. Theory, pages 194–198, St. Petersburg, Russia, August 2011. [2] Ibrahim C Abou-Faycal, Mitchell D Trott, and Shlomo Shamai. The capacity of discrete-time memoryless rayleigh-fading channels. IEEE Transactions on Information Theory, 47(4):1290–1301, 2001. [3] Koenraad MR Audenaert. Quantum skew divergence. Journal of Mathematical Physics, 55(11):112202, 2014. [4] Erik Aurell, Krzysztof Gawedzki, Carlos Mejía-Monasterio, Roya Mohayaee, and Paolo Muratore-Ginanneschi. Refined second law of thermodynamics for fast random processes. Journal of statistical physics, 147(3):487–505, 2012. [5] F. Barthe and N. Huet. On Gaussian Brunn–Minkowski inequalities. Studia Math., 191(3):283–304, 2009. [6] Charles H Bennett. Logical reversibility of computation. IBM journal of Research and Development, 17(6):525–532, 1973. [7] Lucien Birgé. A new lower bound for multiple hypothesis testing. IEEE transactions on information theory, 51(4):1611–1615, 2005. [8] S. Bobkov and M. Madiman. The entropy per coordinate of a random vector is highly constrained under convexity conditions. IEEE Trans. Inform. Theory, 57(8):4940–4954, August 2011. 30

[9] Sergey G Bobkov and Arnaud Marsiglietti. Entropic clt for smoothed convolutions and associated entropy bounds. arXiv preprint arXiv:1903.03666, 2019. [10] Sergey G Bobkov, James Melbourne, et al. Hyperbolic measures on infinite dimensional spaces. Probability Surveys, 13:57–88, 2016. [11] SG Bobkov and J Melbourne. Localization for infinite-dimensional hyperbolic measures. In Doklady Mathematics, volume 91, pages 297–299. Springer, 2015. [12] C. Borell. Complements of Lyapunov’s inequality. Math. Ann., 205:323–331, 1973. [13] L. A. Caffarelli. Monotonicity properties of optimal transportation and the FKG and related inequalities. Comm. Math. Phys., 214(3):547–563, 2000. [14] Terence H Chan, Steve Hranilovic, and Frank R Kschischang. Capacity-achieving probability measure for conditionally gaussian channels with bounded inputs. IEEE Transactions on Information Theory, 51(6):2073–2088, 2005. [15] Raphaël Chetrite, Gregory Falkovich, and Krzysztof Gawedzki. Fluctuation relations in simple examples of non-equilibrium steady states. Journal of Statistical Mechanics: Theory and Experiment, 2008(08):P08005, 2008. [16] T.M. Cover and J.A. Thomas. Elements of Information Theory. J. Wiley, New York, 1991. [17] Noel Cressie and Timothy RC Read. Multinomial goodness-of-fit tests. Journal of the Royal Statistical Society: Series B (Methodological), 46(3):440–464, 1984. [18] Alex Dytso, Mario Goldenbaum, H Vincent Poor, and Shlomo Shamai Shitz. A generalized ozarow-wyner capacity bound with applications. In 2017 IEEE International Symposium on Information Theory (ISIT), pages 1058–1062. IEEE, 2017. [19] Alex Dytso, Daniela Tuninetti, and Natasha Devroye. Interference as noise: Friend or foe? IEEE Transactions on Information Theory, 62(6):3561–3596, 2016. [20] Ronen Eldan, Joseph Lehec, and Yair Shenfeld. Stability of the logarithmic sobolev inequality via the fz" ollmer process. arXiv preprint arXiv:1903.04522, 2019. [21] M. Fradelizi, J. Li, and M. Madiman. Concentration of information content for convex measures. Electron. J. Probab., 25(20):1–22, 2020. Available online at arXiv:1512.01490v3. [22] M. Fradelizi, M. Madiman, and L. Wang. Optimal concentration of information content for log-concave densities. In C. Houdré, D. Mason, P. Reynaud- Bouret, and J. Rosinski, editors, High Dimensional Probability VII: The Cargèse Volume, Progress in Probability. Birkhäuser, Basel, 2016. Available online at arXiv:1508.04093. [23] Yves Grandvalet and Yoshua Bengio. Semi-supervised learning by entropy minimization. In Advances in neural information processing systems, pages 529–536, 2005. [24] O. Guédon. Kahane-Khinchine type inequalities for negative exponent. Mathematika, 46(1):165–173, 1999. [25] A. Guntuboyina. Lower bounds for the minimax risk using f-divergences, and applications. IEEE Trans. Inform. Theory, 57(4):2386–2399, 2011. [26] Alexander S Holevo. Quantum systems, channels, information: a mathematical introduction, volume 16. Walter de Gruyter, 2012. [27] M. F. Huber, T. Bailey, H. Durrant-Whyte, and U. D. Hanebeck. On entropy approximation for Gaussian mixture random vectors. In Proceedings of IEEE International Conference on Multisensor Fusion and Integration for Intelligent Systems, Seoul, Korea, August 2008. [28] W. Huleihel, Z. Goldfeld, T. Koch, M. Madiman, and M. Médard. Design of discrete constellations for peak-power-limited complex Gaussian channels. In Proc. IEEE Intl. Symp. Inform. Theory, Vail, CO, June 2018. [29] Christopher Jarzynski. Equilibrium free-energy differences from nonequilibrium measurements: A master-equation approach. Physical Review E, 56(5):5018, 1997. [30] K. Kampa, E. Hasanbelliu, and J. C. Principe. Closed-form Cauchy-Schwarz PDF divergence for mixture of Gaussians. In Proceedings of International Joint Conference on Neural Networks, San Jose, CA, August 2011. [31] Rolf Landauer. Irreversibility and heat generation in the computing process. IBM journal of research and development, 5(3):183–191, 1961. [32] Lillian Lee. Measures of distributional similarity. In Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics, pages 25–32. Association for Computational Linguistics, 1999. [33] Jiange Li, Arnaud Marsiglietti, and James Melbourne. Further investigations of Rényi entropy power inequalities and an entropic characterization of s-concave densities. arXiv preprint arXiv:1901.10616, 2019. [34] Friedrich Liese and Igor Vajda. On divergences and informations in statistics and information theory. IEEE Transactions on Information Theory, 52(10):4394–4412, 2006. [35] Jianhua Lin. Divergence measures based on the shannon entropy. IEEE Transactions on Information theory, 37(1):145–151, 1991. [36] L. Lovász and M. Simonovits. Random walks in a convex body and an improved volume algorithm. Random Structures Algorithms, 4(4):359–412, 1993. [37] M. Madiman, J. Melbourne, and P. Xu. Forward and reverse entropy power inequalities in convex geometry. In E. Carlen, M. Madiman, and E. M. Werner, editors, Convexity and Concentration, volume 161 of IMA Volumes in Mathematics and its Applications, pages 427–485. Springer, 2017. Available online at arXiv:1604.04225. [38] Christian Maes, Karel Netocnˇ y,` and Bram Wynants. Steady state statistics of driven diffusions. Physica A: Statistical Mechanics and its Applications, 387(12):2675–2689, 2008. [39] J. Melbourne, S. Talukdar, S. Bhaban, and M. Salapaka. Error bounds for a mixed entropy inequality. In Information Theory (ISIT), 2018 IEEE International Symposium on, 2018. [40] James Melbourne, Mokshay Madiman, and Murti V Salapaka. Relationships between certain f-divergences. In 2019 57th Annual Allerton Conference on Communication, Control, and Computing (Allerton), pages 1068–1073. IEEE, 2019. [41] James Melbourne, Saurav Talukdar, and Murti V Salapaka. Realizing information erasure in finite time. In 2018 IEEE Conference on Decision and Control (CDC), pages 4135–4140. IEEE, 2018. [42] K. Moshksar and A. K. Khandani. Arbitrarily tight bounds on differential entropy of Gaussian mixtures. IEEE Trans. Inform. Theory, 62(6):3340–3354, 2016. [43] Edward Nelson. Dynamical theories of Brownian motion, volume 3. Princeton university press, 1967. [44] Jerzy Neyman. Contribution to the theory of the χ2 test. In Proceedings of the Berkeley symposium on mathematical statistics and probability, volume 1, pages 239–273. University of California Press Berkeley, 1949. [45] V. H. Nguyen. Inégalités fonctionnelles et convexité. PhD thesis, Université Pierre et Marie Curie (Paris VI), October 2013. [46] F. Nielsen and R. Nock. Maxent upper bounds for the differential entropy of univariate continuous distributions. IEEE Signal Processing Letters, 24(4):402–406, April 2017. [47] Frank Nielsen. On a generalization of the Jensen-Shannon divergence. arXiv preprint arXiv:1912.00610, 2019. [48] Lawrence H Ozarow and Aaron D Wyner. On the capacity of the gaussian channel with a finite number of input levels. IEEE transactions on information theory, 36(6):1426–1428, 1990. [49] Karl Pearson. X. on the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science, 50(302):157–175, 1900. [50] Igal Sason. On f-divergences: Integral representations, local behavior, and inequalities. Entropy, 20(5):383, 2018. [51] Igal Sason and Sergio Verdu. f-divergence inequalities. IEEE Transactions on Information Theory, 62(11):5973–6006, 2016. [52] Peter Schlattmann. Medical applications of finite mixture models. Springer, 2009. [53] Udo Seifert. Entropy production along a stochastic trajectory and an integral fluctuation theorem. Physical review letters, 95(4):040602, 2005. 31

[54] Shlomo Shamai and Israel Bar-David. The capacity of average and peak-power-limited quadrature gaussian channels. IEEE Transactions on Information Theory, 41(4):1060–1071, 1995. [55] Joel G Smith. The information capacity of amplitude-and variance-constrained scalar gaussian channels. Information and Control, 18(3):203–219, 1971. [56] Saurav Talukdar, Shreyas Bhaban, James Melbourne, and Murti Salapaka. Analysis of heat dissipation and reliability in information erasure: A gaussian mixture approach. Entropy, 20(10):749, 2018. [57] Saurav Talukdar, Shreyas Bhaban, and Murti V Salapaka. Memory erasure using time-multiplexed potentials. Physical Review E, 95(6):062121, 2017. [58] F. Topsoe. Some inequalities for information divergence and related measures of discrimination. IEEE Transactions on information theory, 46(4):1602– 1609, 2000. [59] F. Topsøe. Jenson-shannon divergence and norm-based measures of discrimination and variation. preprint, 2003. [60] Gottfried Ungerboeck. Channel coding with multilevel/phase signals. IEEE transactions on Information Theory, 28(1):55–67, 1982. [61] Lav R Varshney. Transporting information and energy simultaneously. In 2008 IEEE International Symposium on Information Theory, pages 1612–1616. IEEE, 2008. [62] Sergio Verdú. Total variation distance and the distribution of relative information. In 2014 Information Theory and Applications Workshop (ITA), pages 1–3. IEEE, 2014. [63] L. Wang. Heat capacity bound, energy fluctuations and convexity. PhD thesis, Yale University, May 2014. [64] Liyao Wang and Mokshay Madiman. Beyond the entropy power inequality, via rearrangements. IEEE Transactions on Information Theory, 60(9):5116– 5137, 2014. [65] Yihong Wu and Sergio Verdú. Functional properties of mmse. In 2010 IEEE International Symposium on Information Theory, pages 1453–1457. IEEE, 2010. [66] Yihong Wu and Sergio Verdú. The impact of constellation cardinality on gaussian channel capacity. In 2010 48th Annual Allerton Conference on Communication, Control, and Computing (Allerton), pages 620–628. IEEE, 2010. [67] Yongpeng Wu, Chengshan Xiao, Zhi Ding, Xiqi Gao, and Shi Jin. A survey on mimo transmission with finite input signals: Technical challenges, advances, and future trends. Proceedings of the IEEE, 106(10):1779–1833, 2018. [68] Y. Yang and A. Barron. Information-theoretic determination of minimax rates of convergence. Ann. Statist., 27(5):1564–1599, 1999.