Undersmoothed Kernel Entropy Estimators MAIN RESULTS

1 Undersmoothed kernel entropy estimators MAIN RESULTS We assume that data {xj }, 1 ≤ j ≤ N, are drawn i.i.d. from some Liam Paninski and Masanao Yajima arbitrary probability measure p. We are interested in estimating the Department of Statistics, Columbia University differential entropy of p (Cover and Thomas, 1991), [email protected], [email protected] dp(s) dp(s) http://www.stat.columbia.edu/∼liam H(p) = − log ds ds ds Z (for clarity, we will restrict our attention here to the case that the base Abstract— We develop a “plug-in” kernel estimator for the differential measure ds is Lebesgue measure on a finite one-dimensional interval entropy that is consistent even if the kernel width tends to zero as quickly X of length µ(X ), though extensions of the following results to more as 1/N, where N is the number of i.i.d. samples. Thus, accurate density general measure spaces are possible.) estimates are not required for accurate kernel entropy estimates; in fact, it is a good idea when estimating entropy to sacrifice some accuracy in We will consider kernel entropy estimators of the following form: the quality of the corresponding density estimate. Hˆ = g(ˆp(s))ds, Index Terms— Approximation theory, bias, consistency, distribution- free bounds, density estimation. Z where we define the kernel density estimate N 1 pˆ(s) = k(s − xj ), N INTRODUCTION j=1 X The estimation of the entropy and of related quantities (mutual with k(.) the kernel; as usual, kds = 1 and k ≥ 0. The standard information, Kullback-Leibler divergence, etc.) from i.i.d. samples is “plug-in” estimator for the entropy is obtained by setting R a very well-studied problem. Work on estimating the discrete entropy g(u) = h(u) ≡ −u log u; began shortly after the appearance of Shannon’s original work (Miller, 1955; Basharin, 1959; Antos and Kontoyiannis, 2001; Paninski, our basic plan is to optimize g(.), in some sense, to obtain a better 2003). A variety of nonparametric approaches for estimating the estimate than the plug-in estimate. differential entropy have been studied, including histogram-based Our development begins with the standard bias-variance decompo- estimators, “plug-in” kernel estimators, resampled kernel estimators, sition for the squared error of the estimator: and nearest-neighbor estimators; see (Beirlant et al., 1997) for a nice ˆ 2 ˆ 2 ˆ review. E(H − H) = (Eapp + B(H)) + V (H), In particular, this previous work has established the consistency with the approximation error of several kernel- or nearest-neighbor-based estimators of the differential entropy, under certain smoothness or tail conditions on the Eapp = H(p ∗ k) − H(p), underlying (unknown) distribution p. In the kernel case, consistency and the bias term is established under the assumption that the kernel width scales more slowly than 1/N (Beirlant et al., 1997); this is the usual assump- B(Hˆ ) = Ep(Hˆ ) − H(p ∗ k) tion guaranteeing that the corresponding kernel density estimate is defined relative to the smoothed measure consistent (not “undersmoothed”). While these consistency results are well-understood, worst-case error bounds — i.e., bounds on the p ∗ k(s) = Ep(ˆp(s)) = k(s − x)dp(x). estimator’s average error over a large class of underlying probability Z measures p — are more rare. Note that Eapp is generically positive and increasing with the kernel Our main result here is an adaptation of the discrete (histogram- width (since smoothing tends to increase entropy), while the bias based) techniques of (Paninski, 2003; Paninski, 2004) to the kernel B(Hˆ ) of the standard plug-in estimator (g(.) = h(.)) is always estimator case. This earlier work established universal consistency for negative, by Jensen’s inequality. a histogram-based estimator of the entropy assuming that the number Clearly, it is impossible to obtain any nontrivial risk bounds on of histogram bins, m = mN , obeyed the scaling mN = O(N); the expected mean-square error of any estimator of the differential in addition, nonparametric error bounds were established for any entropy, since we might have H = −∞ (in the case that p is (m, N) pair. To adapt these results here we decompose the error singular). Thus, instead of trying to obtain bounds on the full error of the kernel estimator into three parts: a (deterministic) smoothing E(H(p) − Hˆ )2, our goal will be to bound the estimation error error, and an estimation error consisting of the usual bias and variance E(H(p ∗ k) − Hˆ )2 = B(Hˆ )2 + V (Hˆ ), terms. Smoothing error generically decreases with kernel width, and therefore it is beneficial to make the kernel width as small as possible; and then choose the kernel k so that the smoothing error Eapp is as on the other hand, in the classical plug-in entropy estimators, making small as possible, under the constraint that the worst-case expected the kernel width too small can make the estimation error component estimation error is acceptably small. (the bias plus the variance) large. We provide an estimator whose For this class of kernel entropy estimators, we have some simple estimation error term may be bounded by a term which goes to bounds on the bias and variance (adapted from bounds derived in zero even if the kernel width scales as 1/N. Thus, accurate density (Antos and Kontoyiannis, 2001; Paninski, 2003)). We may bound estimates are not required for accurate kernel entropy estimates; in the variance V (Hˆg,N ) using McDiarmid’s technique (Devroye et al., fact, it is a good idea when estimating entropy to sacrifice some 1996; McDiarmid, 1989): accuracy in the quality of the corresponding density estimate (i.e., to undersmooth). Some comparisons on simulated data are provided. Lemma 1 (Variance bound, general kernel). k(s) 2 V (Hˆg,N ) ≤ N sup g(y) − g y + ds . LP was partially supported by an NSF CAREER award. N „ Z y ˛ „ «˛ « ˛ ˛ ˛ ˛ ˛ ˛ 2 In the special case that g(.) is Lipschitz, sups,t |g(s) − g(s + t)| ≤ the derivation of this formula exactly follows that in the discrete case, c|t|, for some 0 < c < ∞, the bound simplifies considerably: as described in (Paninski, 2003) (all that is required is an interchange 2 of an integral and a finite sum). From this we may easily derive the V (Hˆg,N ) ≤ c /N. following approximation-theoretic bound: Proof: McDiarmid’s variance inequality (Devroye et al., 1996; Lemma 3 (Bias bound). 1 McDiarmid, 1989) says that if we may bound the maximal coordi- natewise difference N 1 j sup |B(Hˆg,N )| ≤ µ(X ) max h(u)+log w− g( )Bj,N (u) . sup Hˆg,N (x1,...,xj ,...,xN ) ′ p 0≤u≤1 w Nw x1,...,xj ,x ,...,xN ˛ j=0 ˛ j ˛ ˛ X ˛ ˛ ˛ ˛ ˛ ′ ˛ ˛ −Hˆg,N (˛x ,...,x ,...,xN ) ≤ cj , 1 j Proof: We apply the simple inequality | f(x)dµ(x)| ≤ ˛ X ˛ µ(X )supx |f(x)| to the expression for the bias in equation (1). First where Hg,N (x1,...,xN ) denotes the estimator˛ evaluated on some R ˛ we rewrite arbitrary configuration of the observed samples {xj }1≤j≤N , then 1 1 1 2 h[p∗kw(x)]dx = h[ wp∗kw(x)]dx = log w+ h[wp∗kw(x)]dx. V (Hˆg,N ) ≤ c . w w 4 j Z Z Z j X Now We have here that cj , as defined above, may be chosen as N 1 j B(Hˆg,N ) = − log w+ h[wp∗kw(x)]− g( )Bj,N [wp∗kw(x)] dx, k(s) w Nw cj = 2 sup g(y) − g y + ds; Z „ j=0 « y N X Z ˛ „ «˛ ˛ ˛ ˆ plugging in, we obtain the the˛ general bound in the˛ lemma. so |B(Hg,N )| is bounded above by ˛ ˛ In the Lipschitz case, N 1 j 2 2 µ(X )sup log w + h[wp ∗ kw(x)] − g( )Bj,N [wp ∗ kw(x)] k(s) c x w Nw N sup g(y) − g y + ds ≤ N k(s)ds ˛ j=0 ˛ N N ˛ X ˛ „ Z y ˛ „ «˛ « „ Z « ˛ N ˛ ˛ ˛ 2 2 ˛ 1 j ˛ ˛ ˛ = N(c/N) = c /N, = µ(X ) max˛ log w + h(u) − g( )Bj,N (u) , ˛ ˛ ˛ 0≤u≤1 w Nw ˛ j=0 ˛ where the first inequality follows by the Lipschitz condition and the ˛ X ˛ ˛ ˛ first equality by the fact that the kernel k integrates to one. since 0 ≤ wp ∗ k˛ w(x) ≤ 1. The maximum is obtained,˛ by Exponential tail bounds are also available (McDiarmid, 1989; compactness and continuity of h(u) and Bj,N (u). Devroye et al., 1996; Antos and Kontoyiannis, 2001; Paninski, 2003) Note that each of the above bounds is distribution-free, that is, in case almost-sure results are desired, but these bounds will not be uniform over all possible underlying distributions p. We may combine necessary here. these to obtain uniform bounds on the mean-square error: We now specialize to the simplest possible kernel, the step kernel 2 sup E(Hˆg,N − H(p ∗ kw)) of width w: p 1 kw(s) = 1 s ∈ [−w/2, w/2] . ˆ 2 ˆ w = sup B(Hg,N ) + V (Hg,N ) p In this case we only need to define` g(u) at the Ń + 1 points u = h 2 i 1 2 1 ˆ ˆ {0, Nw , Nw ,..., w }, and we have the following simplification of ≤ sup |B(Hg,N )| + sup V (Hg,N ) Lemma 1: „ p « p N 2 2 1 j Lemma 2 (Variance bound, step kernel). ≤ µ(X ) max h(u) + log w − g( )Bj,N (u) 0≤u≤1 w Nw j=0 j + 1 j 2 ˛ X ˛ ˆ 2 ˛ 2 ˛ sup V (Hg,N ) ≤ Nw max g − g .

Undersmoothed Kernel Entropy Estimators MAIN RESULTS

An Unforeseen Equivalence Between Uncertainty and Entropy

Guaranteed Bounds on Information-Theoretic Measures of Univariate Mixtures Using Piecewise Log-Sum-Exp Inequalities

A Lower Bound on the Differential Entropy of Log-Concave Random Vectors with Applications

Specific Differential Entropy Rate Estimation for Continuous-Valued Time Series

"Differential Entropy". In: Elements of Information Theory

January 28, 2021 1 Dealing with Infinite Universes

A Probabilistic Upper Bound on Differential Entropy Joseph Destefano, Member, IEEE, and Erik Learned-Miller

Chapter 5 Differential Entropy and Gaussian Channels

Entropy-Based Approach for Detecting Feature Reliability

Measuring the Complexity of Continuous Distributions

On Relations Between the Relative Entropy and Χ2-Divergence, Generalizations and Applications

Entropy and Mutual Information