A Probabilistic Upper Bound on Differential Entropy Joseph Destefano, Member, IEEE, and Erik Learned-Miller
Total Page:16
File Type:pdf, Size:1020Kb
1 A Probabilistic Upper Bound on Differential Entropy Joseph DeStefano, Member, IEEE, and Erik Learned-Miller 1 Abstract— A novel, non-trivial, probabilistic upper bound on the entropy of an unknown one-dimensional distribution, given 0.9 the support of the distribution and a sample from that dis- tribution, is presented. No knowledge beyond the support of 0.8 the unknown distribution is required, nor is the distribution required to have a density. Previous distribution-free bounds on 0.7 the cumulative distribution function of a random variable given Upper bound a sample of that variable are used to construct the bound. A 0.6 simple, fast, and intuitive algorithm for computing the entropy bound from a sample is provided. 0.5 Index Terms— Differential entropy, entropy bound, string- 0.4 Empirical cumulative tightening algorithm 0.3 Lower bound I. INTRODUCTION 0.2 The differential entropy of a distribution [6] is a quantity 0.1 employed ubiquitously in communications, statistical learning, 0 physics, and many other fields. If is a one-dimensional 0 0.5 1 1.5 2 2.5 3 3.5 4 random variable with absolutely continuous distribution ¡£¢¥¤§¦ and density ¨©¢¥¤§¦ , the differential entropy of is defined to Fig. 1. This figure shows a typical empirical cumulative distribution (blue). The red lines show us the probabilistic bounds provided by Equation 2 on the be true cumulative. ¦ ¨©¢¥¤§¦¨©¢¤¦©¤ ¢ (1) For our purposes, the distribution need not have a density. II. THE BOUND If there are discontinuities in the distribution function ¡£¢¥¤§¦ , 1 ¤+- ¤. Given , samples, through , from an unknown dis- then no density exists and the entropy is ! . ¡ It is well known [2] that the entropy of a distribution tribution ¡£¢¤¦ , we seek a bound for the entropy of . Our approach will primarily be concerned with the order statistics2 #$©%&#')( ¢¥#'* #$+¦ with support " is at most , which is the # # . entropy of the distribution that is uniform over the support. of that sample, - through . We assume that the distribution has finite support and that we know this support. For ease of Given a sample of size , from an unknown distribution with this support, we cannot rule out with certainty the possibility exposition, we label the left support #/ and the right support that this sample came from the uniform distribution over this #.102- making the support values act like additional order interval. Thus, we cannot hope to improve a deterministic statistics of the sample. But this is done merely for notational upper bound on the entropy over such an interval when nothing convenience and does not imply in any way that these are real more is known about the distribution. samples of the random variable. However, given a sample from an unknown distribution, we We start with a bound due to Dvoretzky, Kiefer, and can say that it is unlikely to have come from a distribution with Wolfowitz [4] on the supremum of the distance between the ¡©.+¢¥¤§¦ entropy greater than some value. In this paper, we formalize empirical , -sample cumulative, , and the true distribu- this notion and give a specific, probabilistic upper bound tion: 3 §E for the entropy of an unknown distribution using both the .FGIHKJ ¡£¢¤¦© ;¡2.<¢¥¤§¦ ¦ACBD ¢4&576 (2) 9>=@? support of the distribution and a sample of this distribution. 8:9 To our knowledge, this is the first non-trivial upper bound J on differential entropy which incorporates information from a Thus, with probability at least , the true cumulative does not differ from the empirical cumulative by more than . This sample and can be applied to any one-dimensional probability ? distribution. 1 We will assume for the remainder of the paper that LNMPO , as this will simplify certain analyses. 2 ^>R_T^VXT`Y[Y[Y[T¥^1\ This work has been submitted to the IEEE for possible publication. Copyright The order statistics QSRUTQWVXTZY[Y[Y[T¥Q]\ of a sample are simply may be transfered without notice, after which this version may no longer be the values in the sample arranged in non-decreasing order. Hence, Q1R is the accessible. minimum sample value, QXV the next largest value, and so on. 2 is a distribution-free bound. That is, it is valid for any one- 1 dimensional probability distribution. For background on such 0.9 bounds and their uses, see [3]. 0.8 There is a family of cumulative distribution curves a which fit the constraints of this bound, and with probability at least 0.7 J , the true cumulative must be one of these curves. If we 0.6 can find the curve in a with maximum entropy and compute its entropy, then we have confidence at least J that the true 0.5 entropy is less than or equal to the entropy of this maximum entropy distribution. 0.4 Figure 1 shows the relationship between the empirical cumulative and the bounds provided by (2). The piecewise 0.3 constant blue curve is the empirical cumulative distribution 0.2 based on the sample. The red stars show the upper and lower bounds on the true cumulative for some particular . The green 0.1 ? curve shows one possibility for the true cumulative distribution 0 that satisfies the probabilisitic constraints of Equation 2. Our 0 0.5 1 1.5 2 2.5 3 3.5 4 goal is to find, of all cumulative distributions which satisfy these constraints the one with the greatest entropy, and then Fig. 2. Pulling tight the ends of a loose string (the green line) threaded through the pegs will cause the string to trace out the maximum-entropy to calculate the entropy of this maximum entropy distribution. distribution that satisfies the bound (2). Note that in figure 1, as in all the figures presented, a very small sample size is used to keep the figures legible. As can 1 be seen from equation 3, below, this results in a loose bound. 0.9 In practice, a larger value for , drives the bound much closer to the emirical cumulative (with tending to zero), with a 0.8 corresponding improvement in the ? resulting entropy bound. J 0.7 F For a desired confidence level , we can compute a corre- M Upper bound sponding from Eq. 2 that meets that level: 0.6 ? §e - dc 0.5 E b (3) ? Bf, 0.4 Empirical cumulative We conclude that with probability J , the true distribution lies 0.3 within this of the empirical distribution at all ¤ . ? Lower bound 0.2 Knot point A. Pegs 0.1 h h ¦ g+hi j¢#h`% m¥hi j¢#h&% ¦ .10)- The points .10)-lk (resp. ) ? ? 0 ¡)p describe a piecewise linear function ¡on (resp., ) marking 0 0.5 1 1.5 2 2.5 3 3.5 4 the probabilistic upper (resp., lower) boundary of the true cumulative ¡ . We call these points the upper (lower) pegs, Fig. 3. This figure shows the maximum entropy cumulative distribution which q7%Xr]( and note that we clip them to the range " . Also, our fits the constraints of the Dvoretzky-Kiefer-Wolfowitz inequality for the given knowledge of the support of the distribution allows us to take empirical cumulative distribution. Notice that the cumulative is piecewise linear, implying a piecewise constant density function. With probability at gs/! tmu/v ¢#1/%wq¦ g xm ¢# %WrS¦ .10)- .10)- .10)- } and . least | , the true cumulative distribution has entropy less than or equal to The largest of the entropies of all distributions that fall this maximum entropy distribution. ¡ within of the empirical cumulative (i.e., entirely between n ? ¡ and p ) provides our desired upper bound on the entropy of J r the true distribution, again with probability . The distribution entropy estimate of ¢¥#.~ #-W¦ ( ), so in some sense ? that achieves this entropy we call ¡oy . our algorithm produces a bound between these two extremes. To illustrate what such a distribution looks like, one can imagine a loose string threaded between each pair of pegs III. THE STRING-TIGHTENING ALGORITHM mzh placed at the g§h and , as in Figure 2. When the ends are pulled tight, the string traces out a distribution which, as we We first develop several properties of ¡ y that provide the show below, has the greatest possible entropy of all such basis for our algorithm, which we call the string-tightening distributions. Since this distribution turns out to be piecewise algorithm. J linear, its entropy, which again is the probability- bound on Lemma 1: ¡©y is linear between consecutive order statis- the true entropy, is easily computed. tics. It is interesting to note that extreme values of correspond Proof: Although the true distribution may have dis- ? ! to the empirical distribution itself ( {q ), and to the na¨ıve continuities, the entropy of any such distibution is . We ? 3 ¡ ¡ y therefore can restrict our search for y to those distributions Lemma 2: An increase (decrease) in slope of can occur with densities. only at the upper (lower) peg of a sample. First we must show that the bound (2) allows ¡ y to be Proof: By contradiction. Suppose that there are two ¢¥#h -1%&®>¦ ¯¢#h`%_°]¦ ¢¥#h`%w°W¦± linear. That is, we must show that the curve between two connected segments of ¡oy , and g§h ° ° ¢¥#h -f%w®>¦± consecutive order statistics is not restricted by the bound so ¢¥#hd0)-f%&²W¦ , with and below the segment = ¢#h&%w°W¦ that it cannot be linear. To do this, it suffices to show that the ¢¥#hd0)-f%&²W¦ (i.e., the slope increases at ). Then there is - " #h± x³1%Z#h ³´( ³ q k bound is always larger than the step size . in the cumulative an interval , where the line segment = ? J ¡©y¢#h© ³1¦ ¡©yP¢¥#h ³1¦ ¡)p ¡)n tq distribution.