Optimality of the Plug-In Estimator for Differential Entropy Estimation Under Gaussian Convolutions
Total Page:16
File Type:pdf, Size:1020Kb
Optimality of the Plug-in Estimator for Differential Entropy Estimation under Gaussian Convolutions Ziv Goldfeld Kristjan Greenewald Jonathan Weed Yury Polyanskiy MIT IBM Research MIT MIT [email protected] [email protected] [email protected] [email protected] Abstract—This paper establishes the optimality of the plug- smallest number of samples for which estimation within an in estimator for the problem of differential entropy estima- additive gap η is possible. This estimation setup was originally tion under Gaussian convolutions. Specifically, we consider the motivated by measuring the information flow in deep neural estimation of the differential entropy h(X + Z), where X and Z are independent d-dimensional random variables with networks [2] for testing the Information Bottleneck compres- 2 Z ∼ N (0; σ Id). The distribution of X is unknown and belongs sion conjecture of [3]. to some nonparametric class, but n independently and identically distributed samples from it are available. We first show that A. Contributions despite the regularizing effect of noise, any good estimator (within The results herein establish the optimality of the plug- an additive gap) for this problem must have an exponential in d sample complexity. We then analyze the absolute-error risk in estimator for the considered problem. Defining Tσ(P ) , d of the plug-in estimator and show that it converges as pc , h(P ∗ Nσ) as the functional (of P ) that we aim to estimate, n ^ ^ thus attaining the parametric estimation rate. This implies the the plug-in estimator is Tσ PXn = h PXn ∗ Nσ , where 1 n optimality of the plug-in estimator for the considered problem. ^ n P PX = n i=1 δXi is the empirical measure associated with We provide numerical results comparing the performance of the the samples Xn and δ is the Dirac measure at X . Despite plug-in estimator to general-purpose (unstructured) differential Xi i entropy estimators (based on kernel density estimation (KDE) the suboptimality of plug-in techniques for vanilla discrete or k nearest neighbors (kNN) techniques) applied to samples of (Shannon) and differential entropy estimation (see [4] and X + Z. These results reveal a significant empirical superiority of [5], respectively), we show that h P^Xn ∗ Nσ attains the the plug-in to state-of-the-art KDE- and kNN-based methods. p1 parametric estimation rate of Oσ;d n for the considered I. INTRODUCTION setup when P is a subgaussian measure. This establishes Consider the problem of estimating differential entropy the plug-in estimator as minimax rate-optimal for differential under Gaussian convolutions that was recently introduced in entropy estimation under Gaussian convolutions. [1]. Namely, let X ∼ P be an arbitrary random variable with The derivation of this optimal convergence rate first bounds d 2 values in R and Z ∼ N (0; σ Id) be an independent isotropic the risk by a weighted total variation (TV) distance between ^ Gaussian. Upon observing n independently and identically the original measure P ∗ Nσ and the empirical one PXn ∗ Nσ. n distributed (i.i.d.) samples X , (X1;:::;Xn) from P and This bound is derived by linking the two measures via the assuming σ is known1, we aim to estimate h(X +Z) = h(P ∗ maximal TV coupling, and reduces the analysis to controlling N ), where N is a centered isotropic Gaussian measure with certain d-dimensional integrals. The subgaussianity of P is σ σ d pc parameter σ. To investigate the decision-theoretic fundamental used to bound the integrals by a n term, with all constants limit, we consider the minimax absolute-error estimation risk explicitly characterized. It is then shown that the exponential dependence d is unavoidable. Specifically, we prove that any ? ^ n R (n; σ; Fd) , inf sup E h(P ∗ Nσ) − h(X ; σ) ; h(P ∗ N ) η h^ P 2F good estimator of σ , within an additive gap , has a d γ(σ)d sample complexity n?(η; σ; F ) = Ω 2 , where γ(σ) is ^ d ηd where Fd is a nonparametric class of distributions and h ? positive and monotonically decreasing in σ. The proof relates is the estimator. The sample complexity n (η; σ; Fd) is the the estimation of h(P ∗ Nσ) to estimating the discrete entropy This work was partially supported by the MIT-IBM Watson AI Lab. The of a distribution supported on a capacity-achieving codebook work of Z. Goldfeld and Y. Polyanskiy was also supported in part by the for an additive white Gaussian noise (AWGN) channel. National Science Foundation CAREER award under grant agreement CCF-12- 53205, by the Center for Science of Information (CSoI), an NSF Science and Technology Center under grant agreement CCF-09-39370, and a grant from B. Related Past Works and Comparison Skoltech–MIT Joint Next Generation Program (NGP). The work of J. Weed General-purpose differential entropy estimators are applica- was supported in part by the Josephine de Kármán fellowship. 1The extension to unknown σ is omitted for space reasons. Note that ble in the considered setup by accessing the noisy samples of samples from P contain no information about σ. Hence for unknown σ, X + Z. There are two prevailing approaches for estimating samples of both X ∼ P and Z would presumably be required. Under this the nonsmooth differential entropy functional: the first relies alternative model, σ2 can be estimated as the empirical variance of Z and then plugged into our estimator. It can be shown that this σ2 estimate converges on kernel density estimators (KDEs) [6], and the other uses k − 1 as O (nd) 2 , which does not affect our estimator’s overall convergence nearest neighbor (kNN) techniques (see [7] for a comprehen- rate. sive survey). Many performance analyses of such estimators restrict attention to smooth nonparametric density classes and III. EXPONENTIAL SAMPLE COMPLEXITY assume these densities are bounded away from zero. Since the As claimed next, the sample complexity of any good esti- density associated with P ∗N violates the boundedness from σ mator of h(P ∗ N ) is exponential in d. below assumption, any such result does not apply in our setup. σ Two recent works weakened/dropped the boundedness Theorem 1 (Exp. Sample Complexity): The following holds: from below assumption, providing general-purpose estimators 1) Fix σ > 0. There exist d0(σ) 2 N, η0(σ) > 0 and γ(σ) > whose risk bounds are valid in our setup. The first is [5], which 0 (monotonically decreasing in σ), such that for all d ≥ γ(σ)d proposed a KDE-based differential entropy estimator that also ? 2 d0(σ) and η < η0(σ), we have n (η; σ; Fd) ≥ Ω dη . combines best polynomial approximation techniques. Assum- 2) Fix d 2 N. There exist σ0(d); η0(d) > 0, such that for all ing subgaussian densities with unbounded support, Theorem ? 2d 2 − s σ < σ0(d) and η < η0(d), we have n (η; σ; Fd) ≥ Ω . 2 of [5] bounded the estimation risk by O n s+d , where ηd s is a Lipschitz smoothness parameter assumed to satisfy Part 1 of Theorem 1 is proven in Section VI-A. It relates 0 < s ≤ 2. While the result is applicable for our setup when the estimation of h(P ∗ Nσ) to discrete entropy estimation P is compactly supported or subgaussian, its convergence rate of a distribution supported on a capacity-achieving codebook quickly deteriorates with dimension d and is unable to exploit for a peak-constrained AWGN channel. Since the codebook 3 the smoothness of P ∗ Nσ due to the s ≤ 2 restriction. size is exponential in d, discrete entropy estimation over the A second relevant work is [8], which studied a weighted-KL codebook within a small gap η > 0 is impossible with less 2γ(σ)d estimator (in the spirit of [9], [10]) for very smooth densities. than order of ηd samples [13]. The exponent γ(σ) is Under certain assumptions on the densities’ speed of decay monotonically decreasing in σ, implying that larger σ values to zero (which captures P ∗ Nσ when, e.g., P is compactly are favorable for estimation. Part 2 of the theorem follows p1 supported) the proposed estimator was shown to attain O n by similar arguments but for a d-dimensional AWGN channel risk. Despite the estimator’s efficiency, empirically it is signif- with an input distributed on the vertices of the [−1; 1]d icantly outperformed by the plug-in estimator studied herein hypercube; the proof is omitted (see [1, Section V-B2]). even in rather simple scenarios (see Section V). In fact, our Remark 1 (Exponential Sample Complexity for Restricted simulations show that the vanilla (unweighted) kNN estimator Classes of Distributions): Restricting Fd by imposing smooth- of [11], which is also inferior to the plug-in, typically performs ness or lower-boundedness assumptions on the distributions better than the weighted version from [8]. The poor empirical in the class would not alleviate the exponential dependence performance of the latter may originate from the dependence on d from Theorem 1. For instance, consider convolving any of the associated risk on d, which was overlooked in [8]. P 2 F with N σ , i.e., replacing each P with Q = P ∗ N σ . d 2 2 These Q distributions are smooth, but if one could accurately II. PRELIMINARIES AND DEFINITIONS estimate h Q∗N σ over the convolved class, then h(P ∗Nσ) Logarithms are with respect to (w.r.t.) base e, kxk is the 2 over F could have been estimated as well. Therefore, Theo- Euclidean norm in d, and I is the d × d identity matrix. d R d rem 1 applies also for the class of such smooth Q distributions.