1

Undersmoothed kernel entropy MAIN RESULTS

We assume that data {xj }, 1 ≤ j ≤ N, are drawn i.i.d. from some Liam Paninski and Masanao Yajima arbitrary probability measure p. We are interested in estimating the Department of Statistics, Columbia University differential entropy of p (Cover and Thomas, 1991), [email protected], [email protected] dp(s) dp(s) http://www.stat.columbia.edu/∼liam H(p) = − log ds ds ds Z (for clarity, we will restrict our attention here to the case that the base Abstract— We develop a “plug-in” kernel for the differential measure ds is Lebesgue measure on a finite one-dimensional interval entropy that is consistent even if the kernel width tends to zero as quickly X of length µ(X ), though extensions of the following results to more as 1/N, where N is the number of i.i.d. samples. Thus, accurate density general measure spaces are possible.) estimates are not required for accurate kernel entropy estimates; in fact, it is a good idea when estimating entropy to sacrifice some accuracy in We will consider kernel entropy estimators of the following form: the quality of the corresponding density estimate. Hˆ = g(ˆp(s))ds, Index Terms— Approximation theory, bias, consistency, distribution- free bounds, density estimation. Z where we define the kernel density estimate

N 1 pˆ(s) = k(s − xj ), N INTRODUCTION j=1 X The estimation of the entropy and of related quantities (mutual with k(.) the kernel; as usual, kds = 1 and k ≥ 0. The standard information, Kullback-Leibler divergence, etc.) from i.i.d. samples is “plug-in” estimator for the entropy is obtained by setting R a very well-studied problem. Work on estimating the discrete entropy g(u) = h(u) ≡ −u log u; began shortly after the appearance of Shannon’s original work (Miller, 1955; Basharin, 1959; Antos and Kontoyiannis, 2001; Paninski, our basic plan is to optimize g(.), in some sense, to obtain a better 2003). A variety of nonparametric approaches for estimating the estimate than the plug-in estimate. differential entropy have been studied, including histogram-based Our development begins with the standard bias-variance decompo- estimators, “plug-in” kernel estimators, resampled kernel estimators, sition for the squared error of the estimator: and nearest-neighbor estimators; see (Beirlant et al., 1997) for a nice ˆ 2 ˆ 2 ˆ review. E(H − H) = (Eapp + B(H)) + V (H), In particular, this previous work has established the consistency with the approximation error of several kernel- or nearest-neighbor-based estimators of the dif- ferential entropy, under certain smoothness or tail conditions on the Eapp = H(p ∗ k) − H(p), underlying (unknown) distribution p. In the kernel case, consistency and the bias term is established under the assumption that the kernel width scales more slowly than 1/N (Beirlant et al., 1997); this is the usual assump- B(Hˆ ) = Ep(Hˆ ) − H(p ∗ k) tion guaranteeing that the corresponding kernel density estimate is defined relative to the smoothed measure consistent (not “undersmoothed”). While these consistency results are well-understood, worst-case error bounds — i.e., bounds on the p ∗ k(s) = Ep(ˆp(s)) = k(s − x)dp(x). estimator’s average error over a large class of underlying probability Z measures p — are more rare. Note that Eapp is generically positive and increasing with the kernel Our main result here is an adaptation of the discrete (histogram- width (since smoothing tends to increase entropy), while the bias based) techniques of (Paninski, 2003; Paninski, 2004) to the kernel B(Hˆ ) of the standard plug-in estimator (g(.) = h(.)) is always estimator case. This earlier work established universal consistency for negative, by Jensen’s inequality. a histogram-based estimator of the entropy assuming that the number Clearly, it is impossible to obtain any nontrivial risk bounds on of histogram bins, m = mN , obeyed the scaling mN = O(N); the expected mean-square error of any estimator of the differential in addition, nonparametric error bounds were established for any entropy, since we might have H = −∞ (in the case that p is (m, N) pair. To adapt these results here we decompose the error singular). Thus, instead of trying to obtain bounds on the full error of the kernel estimator into three parts: a (deterministic) smoothing E(H(p) − Hˆ )2, our goal will be to bound the estimation error error, and an estimation error consisting of the usual bias and variance E(H(p ∗ k) − Hˆ )2 = B(Hˆ )2 + V (Hˆ ), terms. Smoothing error generically decreases with kernel width, and therefore it is beneficial to make the kernel width as small as possible; and then choose the kernel k so that the smoothing error Eapp is as on the other hand, in the classical plug-in entropy estimators, making small as possible, under the constraint that the worst-case expected the kernel width too small can make the estimation error component estimation error is acceptably small. (the bias plus the variance) large. We provide an estimator whose For this class of kernel entropy estimators, we have some simple estimation error term may be bounded by a term which goes to bounds on the bias and variance (adapted from bounds derived in zero even if the kernel width scales as 1/N. Thus, accurate density (Antos and Kontoyiannis, 2001; Paninski, 2003)). We may bound estimates are not required for accurate kernel entropy estimates; in the variance V (Hˆg,N ) using McDiarmid’s technique (Devroye et al., fact, it is a good idea when estimating entropy to sacrifice some 1996; McDiarmid, 1989): accuracy in the quality of the corresponding density estimate (i.e., to undersmooth). Some comparisons on simulated data are provided. Lemma 1 (Variance bound, general kernel). k(s) 2 V (Hˆg,N ) ≤ N sup g(y) − g y + ds . LP was partially supported by an NSF CAREER award. N „ Z y ˛ „ «˛ « ˛ ˛ ˛ ˛ ˛ ˛ 2

In the special case that g(.) is Lipschitz, sups,t |g(s) − g(s + t)| ≤ the derivation of this formula exactly follows that in the discrete case, c|t|, for some 0 < c < ∞, the bound simplifies considerably: as described in (Paninski, 2003) (all that is required is an interchange

2 of an integral and a finite sum). From this we may easily derive the V (Hˆg,N ) ≤ c /N. following approximation-theoretic bound: Proof: McDiarmid’s variance inequality (Devroye et al., 1996; Lemma 3 (Bias bound). 1 McDiarmid, 1989) says that if we may bound the maximal coordi- natewise difference N 1 j sup |B(Hˆg,N )| ≤ µ(X ) max h(u)+log w− g( )Bj,N (u) . sup Hˆg,N (x1,...,xj ,...,xN ) ′ p 0≤u≤1 w Nw x1,...,xj ,x ,...,xN ˛ j=0 ˛ j ˛ ˛ X ˛ ˛ ˛ ˛ ˛ ′ ˛ ˛ −Hˆg,N (˛x ,...,x ,...,xN ) ≤ cj , 1 j Proof: We apply the simple inequality | f(x)dµ(x)| ≤ ˛ X ˛ µ(X )supx |f(x)| to the expression for the bias in equation (1). First where Hg,N (x1,...,xN ) denotes the estimator˛ evaluated on some R ˛ we rewrite arbitrary configuration of the observed samples {xj }1≤j≤N , then 1 1 1 2 h[p∗kw(x)]dx = h[ wp∗kw(x)]dx = log w+ h[wp∗kw(x)]dx. V (Hˆg,N ) ≤ c . w w 4 j Z Z Z j X Now We have here that cj , as defined above, may be chosen as N 1 j B(Hˆg,N ) = − log w+ h[wp∗kw(x)]− g( )Bj,N [wp∗kw(x)] dx, k(s) w Nw cj = 2 sup g(y) − g y + ds; Z „ j=0 « y N X Z ˛ „ «˛ ˛ ˛ ˆ plugging in, we obtain the the˛ general bound in the˛ lemma. so |B(Hg,N )| is bounded above by ˛ ˛ In the Lipschitz case, N 1 j 2 2 µ(X )sup log w + h[wp ∗ kw(x)] − g( )Bj,N [wp ∗ kw(x)] k(s) c x w Nw N sup g(y) − g y + ds ≤ N k(s)ds ˛ j=0 ˛ N N ˛ X ˛ „ Z y ˛ „ «˛ « „ Z « ˛ N ˛ ˛ ˛ 2 2 ˛ 1 j ˛ ˛ ˛ = N(c/N) = c /N, = µ(X ) max˛ log w + h(u) − g( )Bj,N (u) , ˛ ˛ ˛ 0≤u≤1 w Nw ˛ j=0 ˛ where the first inequality follows by the Lipschitz condition and the ˛ X ˛ ˛ ˛ first equality by the fact that the kernel k integrates to one. since 0 ≤ wp ∗ k˛ w(x) ≤ 1. The maximum is obtained,˛ by Exponential tail bounds are also available (McDiarmid, 1989; compactness and continuity of h(u) and Bj,N (u). Devroye et al., 1996; Antos and Kontoyiannis, 2001; Paninski, 2003) Note that each of the above bounds is distribution-free, that is, in case almost-sure results are desired, but these bounds will not be uniform over all possible underlying distributions p. We may combine necessary here. these to obtain uniform bounds on the mean-square error: We now specialize to the simplest possible kernel, the step kernel 2 sup E(Hˆg,N − H(p ∗ kw)) of width w: p 1 kw(s) = 1 s ∈ [−w/2, w/2] . ˆ 2 ˆ w = sup B(Hg,N ) + V (Hg,N ) p In this case we only need to define` g(u) at the ´N + 1 points u = h 2 i 1 2 1 ˆ ˆ {0, Nw , Nw ,..., w }, and we have the following simplification of ≤ sup |B(Hg,N )| + sup V (Hg,N ) Lemma 1: „ p « p N 2 2 1 j Lemma 2 (Variance bound, step kernel). ≤ µ(X ) max h(u) + log w − g( )Bj,N (u) 0≤u≤1 w Nw j=0 j + 1 j 2 ˛ X ˛ ˆ 2 ˛ 2 ˛ sup V (Hg,N ) ≤ Nw max g − g . ˛ j + 1 j ˛ p 0≤j

ˆ ˆ N B(Hg,N ) = Ep(Hg,N ) − H(p ∗ kw) µ(X ) 2 j 2 j + 1 j 2 N max h(u)− a( )Bj,N (u) +N max a( )−a( ) . j w 0≤u≤1 N 0≤j

N j N−j 1 Bj,N (u) ≡ u (1 − u) ; A direct generalization to the infinite µ(X ) case is not possible without j ! some restrictions on the decay of p. We will not pursue such bounds here. 3

approximation-theoretic convex optimization problem) such that True Density p(x) (N=5)

N 2 2 2 2 j j + 1 j mN max h(u)− aN ( )Bj,N (u) +N max aN ( )−aN ( ) 0≤u≤1 N 0≤j

converges˛ to zero as N →∞ for any˛ sequence mN satisfying mN = Density O(N).2 0 Now, setting mN = µ(X )/wN , we may now easily deduce the 0 0.2 0.4 0.6 0.8 1 main result of this paper: Smoothed Density p*k(x) (N=5, w=1/4N) Theorem 4. Let NwN ≥ c > 0, uniformly in N. There exists an 2 estimator Hˆg,N for the entropy H which is uniformly smoothed- consistent in mean square; that is, 1 ˆ 2 sup E(Hg,N − H(p ∗ kwN )) <ǫ(c, N), p Density with ǫ(c, N) ց 0 as N → ∞, and the supremum is taken over all 0 probability measures p. 0 0.2 0.4 0.6 0.8 1 x Proof: We need only apply the main result of (Paninski, 2004) guaranteeing the existence of the sequence aN described above, and Fig. 1. Top: True density used for the simulations described in the text and 1 in Fig. 2. Bottom: Smoothed density. then take g(j/N) = w aN (j/N) + log w. As a corollary, it is easy to show that a uniformly consistent estimator exists if NwN → 0 sufficiently slowly; as in (Paninski, 2004), this follows by a straightforward diagonalization argument. To compare the performance of the estimators, it is useful to −1 −1 Note that wN = O(N ) (and certainly wN = o(N )) does choose a bounded, absolutely continuous density whose entropy is not lead to consistent density estimates, even under smoothness very vulnerable to oversmoothing, that is, a density p for which H(p) restrictions on p (Paninski, 2003; Braess and Dette, 2004; Paninski, and H(p∗k) are very different and therefore the approximation error 2005). Thus the content of the theorem is that we can undersmooth Eapp is large. One such density is of the one-dimensional sawtooth the density and still estimate entropy well. In fact, undersmoothing is form a good idea because it generically decreases the approximation bias n−1 2n−1 2 N ≤ x ≤ 2N Eapp. p(x) = pN (x) ≡ 0 2n−1 ≤ x ≤ n , Finally, it is worth noting that an identical result may be obtained  2N N in the multidimensional case; the only difference in the statement and where n = 1, 2, · · · ,N (Fig. 1). (More generally, any density with proof of the result is that in the general case the inverse measure of large fluctuations on a 1/w scale will induce a large approximation the support of our step kernel must be O(N), whereas in the one- error Eapp; the density p chosen here just has a particularly con- dimensional case (theorem 4) we restrict the inverse length wN to venient form.) The entropy H(p) of this distribution can easily be be O(N). calculated as − log(2). The smoothed entropy H(p ∗ k) may also be computed explicitly NUMERICAL RESULTS here. The density p ∗ k is simply a sum of trapezoids, of the form Sample-spacing estimators also have the “undersmoothing” prop- 2 n−1 w n−1 w n−1 w erty — consistent density estimates are not required for consistent w (x − ( N − 2 )) N − 2 ≤ x ≤ N + 2 entropy estimates (Beirlant et al., 1997). Thus it makes sense to 2 n−1 + w ≤ x ≤ 2n−1 − w p∗k(x) = 8 N 2 2N 2 compare the performance of the estimator introduced here with that >2[1 − 1 (x − ( 2n−1 − w ))] 2n−1 − w ≤ x ≤ 2n−1 + w > w 2N 2 2N 2 2N 2 of these sample-spacing estimators. < 2n−1 w n 0 2N + 2 ≤ x ≤ N , The m-sample spacing estimator is defined as follows. Given N > > real-valued samples Xi, we may form the usual order statistics X(i). where we: have assumed that w < 1/N. Thus for the smoothed The gaps between the i-th and (i + m)-th order statistics, X(i+m) − entropy we obtain X(i), are called the m-spacings. It is easy to form a density estimator based on these m-spacings (Beirlant et al., 1997), and plugging this H(p ∗ k) = − p ∗ k(x) log p ∗ k(x)dx estimator into the differential entropy formula (and performing a bias ℜ Z 1 − w w correction) gives the following estimator for the entropy: 2N 2 2x 2x = −N 2 log 2dx − 2N log dx} N−m w 0 w w (m) 1 N Z 2 Z Hˆ ≡ log (X − X ) − ψ(m) + log m, 2 N m (i+m) (i) 1 w i=1 „ « = −N − w 2 log 2 − 2N y log ydy X 2N 2 where we have abbreviated the digamma function „ « Z0 1 Nw 1 2 = −N − w 2 log 2 − y2 log y − y2 ∂ log Γ(t) 2N 2 2 ψ(x) = . „ « „ « ˛0 ∂t ˛ ˛t=x = − log 2 + Nw. ˛ ˛ ˛ 2 ˛ In (Paninski, 2004) m was defined as the finite˛ number of points on which We illustrate the performance of the new kernel estimator (which the discrete probability measure was supported. Note that this definition of we will refer to by the initials “BUB,” for “best upper bound,” as m is consistent with the definition of m in the case of a histogram-based method, in which we divide the space X into m bins and the effective kernel in (Paninski, 2003)) versus the m-spacing estimator with m = 1 width w is exactly µ(X )/m. (this value of m led to the best performance here; data not shown) 4

3 in Fig. 2. The idea was to choose w to be as small as possible (to make the smoothing error H(p ∗ k) − H(p) as small as possible), 2 within the constraint that the maximal error maxp[Hˆ − H(p ∗ k)] −0.2 H(p*k) is decreasing as a function of N (to ensure smoothed-consistency of BUB ˆ the estimate H). This behavior is illustrated in Fig. 2: we see that the −0.3 m−space error bound does in fact tend to zero (albeit slowly), implying that H(p) Hˆ → H(p ∗ k) in mean square; at the same time, since nWN → 0, H(p ∗ k) → H(p), and we have that Hˆ is not only smoothed- −0.4 consistent in mean-square but in fact mean-square consistent for H(p). On the other hand the m-spacing estimator has an asymptotic −0.5 bias; since the m-spacing estimator is constructed from a density entropy (nats) estimate whose kernel width, roughly speaking, would correspond to −0.6 1/[Np(x)], this estimator cannot detect the structure on the o(1/N) scale which is necessary to consistently estimate H(p) here. (But note that the m-spacing estimator can be superior in the case of an −0.7 unbounded density p, where the smoothing error H(p ∗ k) − H(p) of the kernel estimator is large but where the small effective width 1/[Np(x)] of the m-spacing estimator can lead to a much smaller 0 0.2 0.4 0.6 0.8 1 bias; data not shown.) N × 107

Fig. 2. Comparison of the performance of the m-spacing (m = 1) and CONCLUSIONS BUB estimators applied to the density p shown in Fig. 1. For each of several values of the sample size N, we chose N i.i.d. samples from p (with N We have presented a kernel density estimator of the entropy (based in the definition of p chosen to equal the sample size in each case; i.e., the on a simple step kernel) which can be applied even when the kernel number of sawtooths in the definition of p increases linearly with N), then undersmooths the true underlying density (that is, when the kernel replicated the experiment ten times, in order to obtain reliable estimates of the sample mean and standard deviation of the two estimates. This sample mean, width tends to zero as quickly as 1/N). This kernel estimator is plus and minus a single standard deviation, is plotted for the m-spacing and shown to have better numerical performance than the classical m- BUB estimates (dotted and solid black traces, respectively). Note the large spacing estimators when the underlying density p is very jagged. We positive asymptotic bias of the m-spacing estimator (the variance of both anticipate that this new estimator will be useful in applications that the m-spacing and BUB estimators are relatively negligible). The true value of the entropy, H(p) = − log 2, is indicated by the dashed line; the gray require the estimation of differential entropy of a random vector, or trace shows the true smoothed entropy, plus or minus the square root of the of the between two random variables. maximal mean-squared error of the BUB estimator. Note that this maximal error tends to zero as N → ∞, as does H(p ∗ k) → H(p), for the values of wN chosen here, implying mean-square consistency of Hˆ for H(p).

Antos, A. and Kontoyiannis, I. (2001). Convergence properties of functional estimates for discrete distributions. Random Struc- McDiarmid, C. (1989). On the method of bounded differences. In tures and Algorithms, 19:163–193. Surveys in Combinatorics, pages 148–188. Cambridge Univer- Basharin, G. (1959). On a statistical estimate for the entropy sity Press. of a sequence of independent random variables. Theory of Miller, G. (1955). Note on the bias of information estimates. In Probability and its Applications, 4:333–336. in psychology II-B, pages 95–100. Beirlant, J., Dudewicz, E., Gyorfi, L., and van der Meulen, E. (1997). Paninski, L. (2003). Estimation of entropy and mutual information. Nonparametric entropy estimation: an overview. International Neural Computation, 15:1191–1253. Journal of the Mathematical Statistics Sciences, 6:17–39. Paninski, L. (2004). Estimating entropy on m bins given fewer than Braess, D. and Dette, H. (2004). The asymptotic minimax risk m samples. IEEE Transactions on Information Theory, 50:2200– for the estimation of constrained binomial and multinomial 2203. probabilities. Sankhya, 66:707–732. Paninski, L. (2005). Variational minimax estimation of discrete Cover, T. and Thomas, J. (1991). Elements of information theory. distributions under KL loss. Advances in Neural Information Wiley, New York. Processing Systems, 17. Devroye, L., Gyorfi, L., and Lugosi, G. (1996). A probabilistic theory of pattern recognition. Springer-Verlag, New York.

3Kernel density estimators are typically computationally expensive, re- quiring O(Nt) time to compute, where t denotes the number of points at which we evaluate the integrand in the definition of the entropy estimate; the m-spacing estimates, on the other hand, may be computed after a simple sorting operation which requires O(N log N) time (typically t is taken to be significantly larger than log N; i.e., the m-spacing estimator is computationally cheaper). However, in the case of the step kernel used here, applied to one-dimensional data, it is possible to compute the density estimate, and therefore Hˆ , in O(N log N) time: we need only sort the sample points (as in the case of the m-spacing estimator), then to compute the integral in the definition of Hˆ we need only keep track of the 2N points at which the density estimate Pi k(xi) jumps up or down (at the points {xi − w/2} and {xi +w/2}, respectively); the whole computation requires just a couple lines of code.