Optimality of the Plug-in for Differential Entropy Estimation under Gaussian Convolutions

Ziv Goldfeld Kristjan Greenewald Jonathan Weed Yury Polyanskiy MIT IBM Research MIT MIT [email protected] [email protected] [email protected] [email protected]

Abstract—This paper establishes the optimality of the plug- smallest number of samples for which estimation within an in estimator for the problem of differential entropy estima- additive gap η is possible. This estimation setup was originally tion under Gaussian convolutions. Specifically, we consider the motivated by measuring the information flow in deep neural estimation of the differential entropy h(X + Z), where X and Z are independent d-dimensional random variables with networks [2] for testing the Information Bottleneck compres- 2 Z ∼ N (0, σ Id). The distribution of X is unknown and belongs sion conjecture of [3]. to some nonparametric class, but n independently and identically distributed samples from it are available. We first show that A. Contributions despite the regularizing effect of noise, any good estimator (within The results herein establish the optimality of the plug- an additive gap) for this problem must have an exponential in d sample complexity. We then analyze the absolute-error risk in estimator for the considered problem. Defining Tσ(P ) , d of the plug-in estimator and show that it converges as √c , h(P ∗ Nσ) as the functional (of P ) that we aim to estimate, n ˆ  ˆ  thus attaining the parametric estimation rate. This implies the the plug-in estimator is Tσ PXn = h PXn ∗ Nσ , where 1 n optimality of the plug-in estimator for the considered problem. ˆ n P PX = n i=1 δXi is the empirical measure associated with We provide numerical results comparing the performance of the the samples Xn and δ is the Dirac measure at X . Despite plug-in estimator to general-purpose (unstructured) differential Xi i entropy (based on kernel density estimation (KDE) the suboptimality of plug-in techniques for vanilla discrete or k nearest neighbors (kNN) techniques) applied to samples of (Shannon) and differential entropy estimation (see [4] and  X + Z. These results reveal a significant empirical superiority of [5], respectively), we show that h PˆXn ∗ Nσ attains the   the plug-in to state-of-the-art KDE- and kNN-based methods. √1 parametric estimation rate of Oσ,d n for the considered I.INTRODUCTION setup when P is a subgaussian measure. This establishes Consider the problem of estimating differential entropy the plug-in estimator as minimax rate-optimal for differential under Gaussian convolutions that was recently introduced in entropy estimation under Gaussian convolutions. [1]. Namely, let X ∼ P be an arbitrary with The derivation of this optimal convergence rate first bounds d 2 values in R and Z ∼ N (0, σ Id) be an independent isotropic the risk by a weighted total variation (TV) distance between ˆ Gaussian. Upon observing n independently and identically the original measure P ∗ Nσ and the empirical one PXn ∗ Nσ. n distributed (i.i.d.) samples X , (X1,...,Xn) from P and This bound is derived by linking the two measures via the assuming σ is known1, we aim to estimate h(X +Z) = h(P ∗ maximal TV coupling, and reduces the analysis to controlling N ), where N is a centered isotropic Gaussian measure with certain d-dimensional integrals. The subgaussianity of P is σ σ d √c parameter σ. To investigate the decision-theoretic fundamental used to bound the integrals by a n term, with all constants limit, we consider the minimax absolute-error estimation risk explicitly characterized. It is then shown that the exponential dependence d is unavoidable. Specifically, we prove that any ? ˆ n R (n, σ, Fd) , inf sup E h(P ∗ Nσ) − h(X , σ) , h(P ∗ N ) η hˆ P ∈F good estimator of σ , within an additive gap , has a d  γ(σ)d  sample complexity n?(η, σ, F ) = Ω 2 , where γ(σ) is ˆ d ηd where Fd is a nonparametric class of distributions and h ? positive and monotonically decreasing in σ. The proof relates is the estimator. The sample complexity n (η, σ, Fd) is the the estimation of h(P ∗ Nσ) to estimating the discrete entropy This work was partially supported by the MIT-IBM Watson AI Lab. The of a distribution supported on a capacity-achieving codebook work of Z. Goldfeld and Y. Polyanskiy was also supported in part by the for an additive white Gaussian noise (AWGN) channel. National Science Foundation CAREER award under grant agreement CCF-12- 53205, by the Center for Science of Information (CSoI), an NSF Science and Technology Center under grant agreement CCF-09-39370, and a grant from B. Related Past Works and Comparison Skoltech–MIT Joint Next Generation Program (NGP). The work of J. Weed General-purpose differential entropy estimators are applica- was supported in part by the Josephine de Kármán fellowship. 1The extension to unknown σ is omitted for space reasons. Note that ble in the considered setup by accessing the noisy samples of samples from P contain no information about σ. Hence for unknown σ, X + Z. There are two prevailing approaches for estimating samples of both X ∼ P and Z would presumably be required. Under this the nonsmooth differential entropy functional: the first relies alternative model, σ2 can be estimated as the empirical variance of Z and then plugged into our estimator. It can be shown that this σ2 estimate converges on kernel density estimators (KDEs) [6], and the other uses k  − 1  as O (nd) 2 , which does not affect our estimator’s overall convergence nearest neighbor (kNN) techniques (see [7] for a comprehen- rate. sive survey). Many performance analyses of such estimators restrict attention to smooth nonparametric density classes and III.EXPONENTIAL SAMPLE COMPLEXITY assume these densities are bounded away from zero. Since the As claimed next, the sample complexity of any good esti- density associated with P ∗N violates the boundedness from σ mator of h(P ∗ N ) is exponential in d. below assumption, any such result does not apply in our setup. σ Two recent works weakened/dropped the boundedness Theorem 1 (Exp. Sample Complexity): The following holds: from below assumption, providing general-purpose estimators 1) Fix σ > 0. There exist d0(σ) ∈ N, η0(σ) > 0 and γ(σ) > whose risk bounds are valid in our setup. The first is [5], which 0 (monotonically decreasing in σ), such that for all d ≥  γ(σ)d  proposed a KDE-based differential entropy estimator that also ? 2 d0(σ) and η < η0(σ), we have n (η, σ, Fd) ≥ Ω dη . combines best polynomial approximation techniques. Assum- 2) Fix d ∈ N. There exist σ0(d), η0(d) > 0, such that for all ing subgaussian densities with unbounded support, Theorem ?  2d  2 − s  σ < σ0(d) and η < η0(d), we have n (η, σ, Fd) ≥ Ω . 2 of [5] bounded the estimation risk by O n s+d , where ηd s is a Lipschitz smoothness parameter assumed to satisfy Part 1 of Theorem 1 is proven in Section VI-A. It relates 0 < s ≤ 2. While the result is applicable for our setup when the estimation of h(P ∗ Nσ) to discrete entropy estimation P is compactly supported or subgaussian, its convergence rate of a distribution supported on a capacity-achieving codebook quickly deteriorates with dimension d and is unable to exploit for a peak-constrained AWGN channel. Since the codebook 3 the smoothness of P ∗ Nσ due to the s ≤ 2 restriction. size is exponential in d, discrete entropy estimation over the A second relevant work is [8], which studied a weighted-KL codebook within a small gap η > 0 is impossible with less 2γ(σ)d estimator (in the spirit of [9], [10]) for very smooth densities. than order of ηd samples [13]. The exponent γ(σ) is Under certain assumptions on the densities’ speed of decay monotonically decreasing in σ, implying that larger σ values to zero (which captures P ∗ Nσ when, e.g., P is compactly are favorable for estimation. Part 2 of the theorem follows √1  supported) the proposed estimator was shown to attain O n by similar arguments but for a d-dimensional AWGN channel risk. Despite the estimator’s efficiency, empirically it is signif- with an input distributed on the vertices of the [−1, 1]d icantly outperformed by the plug-in estimator studied herein hypercube; the proof is omitted (see [1, Section V-B2]). even in rather simple scenarios (see Section V). In fact, our Remark 1 (Exponential Sample Complexity for Restricted simulations show that the vanilla (unweighted) kNN estimator Classes of Distributions): Restricting Fd by imposing smooth- of [11], which is also inferior to the plug-in, typically performs ness or lower-boundedness assumptions on the distributions better than the weighted version from [8]. The poor empirical in the class would not alleviate the exponential dependence performance of the latter may originate from the dependence on d from Theorem 1. For instance, consider convolving any of the associated risk on d, which was overlooked in [8]. P ∈ F with N σ , i.e., replacing each P with Q = P ∗ N σ . d 2 2 These Q distributions are smooth, but if one could accurately II.PRELIMINARIESAND DEFINITIONS  estimate h Q∗N σ over the convolved class, then h(P ∗Nσ) are with respect to (w.r.t.) base e, kxk is the 2 over F could have been estimated as well. Therefore, Theo- Euclidean norm in d, and I is the d × d identity . d R d rem 1 applies also for the class of such smooth Q distributions. We use EP for an expectation w.r.t. a distribution P , omitting the subscript when P is clear. For a continuous X ∼ P with IV. OPTIMALITYOF PLUG-IN ESTIMATOR probability density function (PDF) p, we interchangeably use h(X), h(P ) and h(p) for its differential entropy. The n-fold We next establish the minimax-rate optimality of the plug- product extension of P is denoted by P ⊗n. The convolution of in estimator. Given a collection of samples Xn ∼ P ⊗n, the d RR 1 n P Q (P ∗Q)(A) = 1 (x+ ˆ n ˆ n P two distributions and on R is A estimator is h(PX ∗ Nσ), where PX = n i=1 δXi . y) dP (x) dQ(y), where 1A is the indicator of the Borel set A. Theorem 2 (Plug-in Risk Bound): Fix σ > 0, d ≥ 1. Then Let Fd be the set of distributions P with supp(P ) ⊆ −1 d 4 ˆ n 2 [−1, 1] . We also consider the class of K-subgaussian dis- sup EP ⊗n h(P ∗Nσ) − h(PX ∗Nσ) ≤ Cσ,d,µ,K n . (SG) (SG) P ∈F (SG) tributions F [12]. Namely, P ∈ F , for K > 0, if d,µ,K d,K d,K (2) X ∼ P satisfies d where Cσ,d,µ,K = Oσ,µ,K (c ) for a numerical constant c. h T i 2 2 d EP exp α (X −EX) ≤ exp 0.5K kαk , ∀α ∈ R . The proof of Theorem 2 is given in Section VI-B, where (1) an explicit expression for Cσ,d,µ,K is stated in (12). The In other words, every one-dimensional projection of X is sub- derivation exploits the maximal TV coupling to bound the 0 (SG) right-hand side (RHS) of (2) by a weighted TV between P ∗N gaussian. Clearly, there exists a K > 0 such that Fd ⊆ Fd,K0 . σ Pˆ n ∗N We therefore state our lower bound result (Theorem 1) for Fd, and X σ. Exploiting√ the Gaussian smoothing, we control (SG) this TV distance by a cd/ n term as desired. while the upper bound (Theorem 2) is given for Fd,K . Remark 2 (Minimax Rate-Optimality): A convergence rate 2 Multiplicative polylogarithmic factors are overlooked in this restatement faster than √1 cannot be attained for parameter estimation un- 3Such convergence rates are typical in estimating h(p) under boundedness n or smoothness conditions on p. Indeed, the results cited above (applicable in der the absolute-error loss. This follows from, e.g., Proposition our framework or otherwise) as well as many others bound the estimation risk 1 of [14], which establishes this rate as a lower bound for the  − α  decays as O n β+d , where α, β are constants that may depend on s and d. parametric estimation problem. Combined with Theorem 2, 4One may consider any other class of compactly supported distributions. this establishes the plug-in estimator as minimax rate-optimal. Fig. 2: Estimation results in the unbounded support regime, where P is a d-dimensional mixture of 2d Gaussians. Error bars are one standard deviation over 20 random trials.

phase-shift keying (BPSK) modulation of a Reed-Muller code. A Reed-Muller code RM(r, m) of parameters r, m ∈ N, where Pr m 0 ≤ r ≤ m, encodes messages of length k = i=0 i Fig. 1: Estimation results comparing the plug-in estimator to: m into 2 -lengthed binary codewords. Let CRM(r,m) be set of (i) a KDE-based method [6]; (ii) the KL estimator [15]; and BPSK modulated sequences corresponding to RM(r, m) (with (iii) a weighted-KL estimator [8]. Here P is a truncated d- d 2 0 and 1 mapped to −1 and 1, respectively). The number of dimensional mixture of 2 Gaussians and Z ∼ N (0, σ Id). bits reliably transmittable over the 2m-dimensional AWGN Error bars are one standard deviation over 20 random trials. channel with noise variance σ2 is I(X; X +Z) = h(X +Z)− d 2 log(2πeσ ), where X ∼ Unif(CRM(r,m)) and Z are indepen- V. EXPERIMENTS 2 dent. Despite I(X; X + Z) being a well-behaved function of We present empirical results illustrating the convergence σ 5 , an exact computation of this quantity is infeasible. of the plug-in estimator compared to several competing Our estimator readily estimates I(X; X + Z) from sam- methods: (i) the KDE-based estimator of [6]; (ii) and kNN ples of X. Results for the Reed-Muller codes RM(4, 4) and Kozachenko-Leonenko (KL) estimator [15]; and (iii) the re- RM(5, 5) (containing 216 and 232 codewords, respectively) cently developed weighted-KL (wKL) estimator from [8]. are shown in Fig. 3 for various values of σ and n. Fig. 3(a) P with Bounded Support: Convergence rates in the shows our estimate of I(X; X +Z) for an RM(4, 4) code as a bounded support regime are illustrated first. We set P as a d function of σ, for different values of n. As expected, the plug- mixture of Gaussians truncated to have support in [−1, 1] . in estimator converges faster when σ is larger. Fig. 3(b) shows Before truncation, the mixture consists of 2d Gaussian com- d d the estimated I(X; X+Z) for X ∼ Unif(CRM(5,5)) and σ = 2, ponents with means at the 2 corners of [−1, 1] . Fig. 1 with the KDE and KL estimates based on samples of (X +Z) shows estimation results as a function of n, for d = 5, 10 shown for comparison. Our method significantly outperforms and σ = 0.1, 0.2. The kernel width for the KDE estimate was the general-purpose estimators. The wKL estimator is omitted chosen via cross-validation, varying with both d and n; the due to its instability in this high dimensional (d = 32) setting. KL, wKL and plug-in estimators require no tuning parameters. √ Remark 3: When supp(P ) lies inside a ball of radius d, We stress that the KDE estimate is highly unstable and, while the subgaussian constant K is proportional to d, and the not shown here, the estimated value is very sensitive to the d bound from (2) scales as √d . This scenario corresponds to the chosen kernel width. The KDE, KL and wKL estimators n popular setup of an AWGN channel with an input constraint. converge slowly, at a rate that degrades with increased d, underperforming the plug-in estimator. Finally, we note that in accordance to the explicit risk bound from (12), the absolute error increases with larger d and smaller σ. P with Unbounded Support: In Fig. 2, we show the con- vergence rates in the unbounded support regime by considering the same setting with d = 15 but without truncating the 2d- mode Gaussian mixture. The fast convergence of the plug-in estimator is preserved, outperforming the competing methods. Reed-Muller Codes for AWGN Channels: We next con- (a) (b) sider data transmission over an AWGN channel using a binary Fig. 3: Estimating I(X; X+Z), where X comes from a BPSK 2 5Evaluating the plug-in estimator requires computing a d-dimensional modulated Reed-Muller and Z ∼ N (0, σ Id): (a) Estimated integral, which has no closed form solution. Nonetheless, in [1] we propose I(X; X + Z) as a function of σ, for different n values, for an efficient Monte Carlo integration method to perform this computation. The the RM(4, 4) code. (b) Plug-in, KDE and KL I(X; X + Z) method’s accuracy is ensured via mean-squared error bounds [1, Theorem 5], and the computational complexity is shown to be on average O(n log n). estimates for the RM(5, 5) code and σ = 2 as a function of n. VI.PROOFS B. Proof of Theorem 2 A. Proof of Theorem 1 We start with two technical lemmata used for the proof. Let Y = A + N be an AWGN channel with input peak Lemma 1: Let U ∼ PU and V ∼ PV be continuous constraint A∈[−1,1] almost surely, and noise N ∼ N (0,σ2). random variables with densities pU and pV , respectively. If h(U) , h(V ) < ∞, then The capacity CAWGN(σ) = maxA∼P : supp(P )⊆[−1,1] I(A; Y ) is positive for any σ < ∞. This positivity implies [16] that  p (V ) p (U)   h(U) − h(V ) ≤ max log V , log U . for any  ∈ 0, CAWGN(σ) and large enough d, there exists a E E pV (U) pU (V ) d . d(C (σ)−) codebook Cd ⊂ [−1, 1] of size |Cd| = e AWGN and a d d Proof: Recall the identity decoder ψd : R → [−1, 1] , such that

2 pV (V ) pV (V )  d d  − d h(U)−h(V )+D(P ||P ) = log ≤ log . ψd(Y ) = c A = c ≥ 1 − e , ∀c ∈ Cd, (3) U V E E P pV (U) pV (U) d d Reversing the roles of U and V completes the proof. where A , (A1,A2,...,Ad) and Y , (Y1,Y2,...,Yd) are the channel input and output sequences, respectively.6 Lemma 2: Let U ∼ PU and V ∼ PV be continuous From (3) it follows that if Ad ∼ P , for any P with random variables with PDFs pU and pV , respectively. For any ˆd d measurable function g : d → supp(P ) = Cd, and set A , ψd(Y ), then R R Z X   2 d ˆd d d − d g(U) − g(V ) ≤ g(z) · pU (z) − pV (z) dz . P A 6=A = P (c)P ψd(c+N ) 6= c A =c ≤ e . E E c∈Cd Proof: We couple P and P via the TV maximal coupling7 Invoking Fano’s inequality, we further obtain U V 1    2  2 d ˆd − d − d (1) π , (Id, Id)](PU ∧PV )+ (PU −PV )+ ⊗(PU −PV )−, (6) H A A ≤ Hb e + e log |Cd| , δσ,d, (4) α where (P − P ) and (P − P ) are the positive and where H (α) = −α log α − (1 − α) log(1 − α), for α ∈ [0, 1], U V + U V − b negative parts of the signed measure (P −P ); (P ∧P ) is the binary entropy function. Although not explicit in our U V U V , P − (P − P ) ; (Id, Id) (P ∧ P ) is the push-forward notation, the dependence of δ(1) on σ is through . Note that U U V + ] U V σ,d P ∧P (Id, Id) ⊗ (1) measure of U V by the map ; denotes a product lim δ = 0, for all σ > 0, because log |C | grows only 1 R R d→∞ σ,d d measure; and α , |pU (x) − pV (x)|dx satisfies d(PU − d lim H (q) = 0 R 2 linearly with and q→0 b . This further gives PV )+ = d(PU − PV )− = α. Jensen’s inequality implies

(a)   (b) Eg(U) − Eg(V ) ≤ Eπ g(U) − g(V ) and hence d d d d ˆd d (1) I A ; Y ≥ H A −H A A ≥ H A −δσ,d, Eπ g(U) − g(V )  Z where (a) is since H(A|B)≤H A f(B) for any random vari- 1     ≤ g(u) + g(v) pU(u)−pV (u) pU(v)−pV (v) dudv ables (A, B) and deterministic function f, while (b) uses (4). α + − d d d Z Z Since we also have I(A ; Y ) ≤ H(A ), it follows that   = g(u) pU(u)−pV (u) +du+ g(v) pU(v)−pV (v) −dv d d d (1) H(A ) − I(A ; Y ) ≤ δ , (5) Z σ,d     = g(z) pU (z) − pV (z) + + (pU (z) − pV (z) − dz which means that any good estimator of H(Ad) over the class Z P supp(P ) = C ⊆ F ) is also a good estimator of d d = g(z) · pU (z) − pV (z) dz.  the . Using the well-known lower bound on discrete entropy estimation sample complexity (see, e.g., (SG) d Fix any P ∈ Fd,K and assume that EP S = 0. This [17, Corollary 10]), we have that estimating H(A ) within assumption comes with no loss of generality since both the a sufficiently small additive gap η > 0 requires at least γ(σ)d target functional h(P ∗Nσ) and the plug-in estimator are trans-  |Cd|   2  Ω = Ω , where γ(σ) CAWGN(σ)− > 0. ˆ η log |Cd| ηd , lation invariant. Note that h(P ∗ Nσ) , h(PSn ∗ Nσ) < ∞. We relate the above back to the estimation of h(X + Z) by Combining Lemmas 1 and 2, we a.s. have d d d d d 2 noting that I(A ; Y ) = h(A +N )− 2 log2(2πeσ ). Letting h(P ∗ N ) − h(Pˆ n ∗ N ) D d D σ X σ X ∼ P and noting that Z = N , where = denotes equality Z d d in distribution, we have h(A + N ) = h(X + Z). Assuming ≤ max logr ˜Xn (z) · |q(z) − rXn (z)| dz, in contradiction that there exists an estimator of h(X + Z) Z  that uses o2γ(σ)d/(ηd) samples and achieves an additive gap logq ˜(z) · q(z) − r n (z) dz , (7)  d X η > 0 over P supp(P ) = Cd , implies that H(A ) can be (1) estimated from these samples within gap η+δσ,d. This follows where q and rXn , respectively, denote the PDFs of P ∗ Nσ q rXn from (5) by taking the estimator of h(X + Z) and subtracting and PˆXn ∗ Nσ, and we set q˜ and r˜Xn , for , c1 , c1 d 2 2 −d/2 the constant 2 log2(2πeσ ). We arrive at a contradiction. c1 = (2πσ ) .

6 . 1 ak 7 ak = bk denotes equality in the exponential scale: lim log = 0. The maximal coupling attains maximal probability for the event {U = V }. k→∞ k bk To control the above integrals we require one last lemma. Since√ X is K-subgaussian√ and Z is σ-subgaussian, X + Lemma 3: Let X ∼ P . For all z ∈ Rd it holds that Z/ 2 is (K + σ/ 2)-subgaussian. Following (10), for any 0 < a < 1 √ , we have 2(K+σ/ 2)2 2 1 4 ⊗n n √ EP logr ˜X (z) ≤ 4 EP kz − Xk (8a) c1 1 c1c2  2 4σ √ = exp a X + Z/ 2 n2d/2 E n2d/2 E 2 1 4 fa(X + Z/ 2) logq ˜(z) ≤ P kz − Xk . (8b) √ 4 E 4 2 ! 4σ (a) c c √ 2 (K + σ/ 2) a d ≤ 1 2 exp K + σ/ 2 ad + √ , n2d/2 1 − 2(K + σ/ 2)2a Proof: We prove (8a); the proof of (8b) is similar. The map x 7→ (log x)2 is convex on [0, 1]. For any fixed xn, let (11) Xˆ ∼ Pˆxn . Jensen’s inequality gives where (a) is by [12, Remark 2.3]. Setting a = 1 √ , we combine (7) and (9)-(11) to 4(K+σ/ 2)2 2 ˆ 2!! ˆ 4 obtain the result (recalling that the second integral from (7) 2 kz−Xk kz−Xk logr ˜xn(z) = log ˆ exp − ≤ ˆ . EPxn 2 EPxn 4 is bounded exactly as the first and using E[max{|X|, |Y |}] ≤ 2σ 4σ (SG) E|X| + E|Y |). For any P ∈ Fd,K we have 2 Taking an outer expectation w.r.t. Xn ∼ P ⊗n yields  ˆ  EP ⊗n h(P ∗ Nσ) − h(PXn ∗ Nσ) √ 2 4 4 ˆ 4 4 64 2d K + d(d + 2)(K + σ/ 2) 2 kz − Xk EP kz − Xk ⊗n n ⊗n ≤ EP logr ˜X (z) ≤ EP E ˆ n = . 4 PX 4σ4 4σ4 σ   d 1 K 3 1 × √ + e 8 . (12) σ n R 2 Following (7), we bound logr ˜ n (z) p (z)−p (z) dz. E X U V ACKNOWLEDGEMENTS The bound for the other integral is identical and thus omitted. d 1  We thank Yihong Wu for fruitful discussions on this topic. Let fa : R → R be the PDF of N 0, 2a Id , for a > 0 specified later. The Cauchy-Schwarz inequality implies REFERENCES [1] Z. Goldfeld, K. Greenewald, and Y. Polyanskiy. Estimating differential  Z 2 entropy under Gaussian convolutions. preprint arXiv:1810.11589, 2018.

EP ⊗n logr ˜Xn (z) q(z) − rXn (z) dz (9) [2] Z. Goldfeld, E. van den Berg, K. Greenewald, I. Melnyk, N. Nguyen, B. Kingsbury, and Y. Polyanskiy. Estimating information flow in neural Z Z 2 networks. ArXiv preprint arXiv:1810.05728, Nov. 2018. 2 q(z)−rXn (z) [3] R. Shwartz-Ziv and N. Tishby. Opening the black box of deep neural ≤ ⊗n logr ˜ n (z) f (z)dz · ⊗n dz. EP X a EP networks via information. ArXiv preprint arXiv:1703.00810, Mar. 2017. fa(z) [4] L. Paninski. Estimation of entropy and mutual information. Neural Comput., 15:1191–1254, 2004. By virtue of Lemma 3, we bound the first integral as [5] Y. Han, J. Jiao, T. Weissman, and Y. Wu. Optimal rates of entropy estimation over Lipschitz balls. ArXiv preprint arXiv:1711.02141, 2017. Z [6] K. Kandasamy, A. Krishnamurthy, B. Poczos, L. Wasserman, and J. M. 2 EP ⊗n logr ˜Xn (z) fa(z) dz Robins. Nonparametric von Mises estimators for entropies, divergences and mutual informations. In NIPS, pages 397–405, 2015. Z kz − Xk4 exp − akzk2 [7] G. Biau and L. Devroye. Lectures on the nearest neighbor method. ≤ E √ dz Springer, 2015. 4σ4 πda−d [8] T. B. Berrett, R. J. Samworth, and M. Yuan. Efficient multivariate en- (a) Z 2 tropy estimation via k-nearest neighbour distances. Annals of Statistics, 2 4 2 4 exp − akzk ≤ EkXk + kzk √ dz 47(1):288–318, 2019. σ4 σ4 πda−d [9] K. Sricharan, D. Wei, and A. Hero. Ensemble estimators for multivariate (b) 32K4d2 1 entropy estimation. IEEE Trans. Inf. Theory, 59(7):4374–4388, 2013. ≤ + d(d + 2) [10] K. Moon, K. Sricharan, K. Greenewald, and A. Hero. Ensemble σ4 2σ4a2 estimation of information divergence. Entropy, 20(8), 2018. [11] H. Stögbauer A. Kraskov and P. Grassberger. Estimating mutual where (a) follows from the triangle inequality, and (b) uses information. Phys. rev. E, 69(6):066138, June 2004. [12] D. Hsu, S. Kakade, and T. Zhang. A tail inequality for quadratic forms the K-subgaussianity of S [18, Lemma 5.5]. of subgaussian random vectors. Electron. Commun. Probab., 17, 2012. For the second integral, note that rXn (z) is a sum of [13] Y. Wu and P. Yang. Minimax rates of entropy estimation on large alphabets via best polynomial approximation. IEEE Tran. Inf. Theory, i.i.d. terms with expectation q(z). This implies EP ⊗n q(z) − 62(6):3702–3720, Jun. 2016. 2 2 1 2  c1 − kz−Xk rXn (z) ≤ Ee σ2 and further gives [14] J. Chen. A general lower bound of minimax risk for absolute-error loss. n Can. J. Stats., 25(4):545–558, Dec. 1997. [15] L. F. Kozachenko and N. N. Leonenko. Sample estimate of the entropy Z 2 of a random vector. Prob. Pered. Inf., 23(2):9–16, 1987. q(z)−rXn (z) c1 1 ⊗n dz ≤ √ , [16] R. Gallager. and reliable communication, volume 2. EP d/2 E fa(z) n2 fa(X + Z/ 2) Springer, 1968. (10) [17] G. Valiant and P. Valiant. A CLT and tight lower bounds for estimating 2 entropy. In ECCC, volume 17, page 9, Nov. 2010. where Z ∼ N (0, σ Id) and X ∼ P are independent. [18] R. Vershynin. Introduction to the non-asymptotic analysis of random d −1 π  2  2 matrices. ArXiv preprint arXiv:1011.3027, Nov. 2010. Setting c2 , a , we have fa(z) = c2 exp akzk .