Kullback-Leibler Divergence Estimation of Continuous Distributions

Fernando Perez-Cruz´ Department of Electrical Engineering Princeton University Princeton, New Jersey 08544 Email: [email protected]

Abstract—We present a method for estimating the KL diver- Information-theoretic analysis of neural data is unavoidable gence between continuous densities and we prove it converges given the questions neurophysiologists are interested in, see almost surely. Divergence estimation is typically solved estimating [19] for a detailed discussion on mutual information estimation the densities first. Our main result shows this intermediate step is unnecessary and that the divergence can be either estimated using in neuroscience. There are other applications in different the empirical cdf or k-nearest-neighbour density estimation, research areas in which KL divergence estimation is used to which does not converge to the true measure for finite k. The measure the difference between two density functions. For convergence proof is based on describing the statistics of our example in [17] it is used for multimedia classification and estimator using waiting-times distributions, as the exponential in [8] for text classification. or Erlang. We illustrate the proposed estimators and show how they compare to existing methods based on density estimation, In this paper, we focus on estimating the KL divergence for and we also outline how our divergence estimators can be used continuous random variables from independent and identically for solving the two-sample problem. distributed (i.i.d.) samples. Specifically, we address the issue of estimating this divergence without estimating the densities, I.INTRODUCTION i.e. the density estimation used to compute the KL divergence The Kullback-Leibler divergence [11] measures the distance does not converge to their measures as the number of samples between two density distributions. This divergence is also tends to infinity. In a way, we follow Vapnik’s advice [20] known as information divergence and relative entropy. If the about not trying to solve an intermediate (harder) problem densities P and Q exist with respect to a Lebesgue measure, to estimate the quantity we are interested in. We propose a the Kullback-Leibler divergence is given by: new method for estimating the KL divergence based on the empirical cumulative distribution function (cdf) and we show Z p(x) D(P ||Q) = p(x) log dx ≥ 0. (1) it converges almost surely to the actual divergence. d q(x) R There are several approaches to estimate this divergence This divergence is finite whenever P is absolutely continuous from samples for continuous random variables [21], [12], [5], with respect to Q and it is only zero if P = Q. [22], [13], [18], see also the references therein. Other methods The KL divergence is central to information theory and concentrate on estimating the divergence for discrete random statistics. Mutual information measures the information one variables [19], [4], but we will not discuss them further as they random variable contains about a related random variable and lie outside the scope of this paper. Most of these approaches it can be computed as a special case of the KL divergence. are based on estimating the densities first, hence ensuring the From the mutual information we can define the entropy convergence of the estimator to the divergence as the number and differential entropy of a random variable as its self- of samples tends to infinity. For example in [21] the authors information. KL divergence can be directly defined as the propose to estimate the densities based on an data-dependent mean of the log-likelihood ratio and it is the exponent in histograms with a fixed number of samples from q(x) in each large deviation theory. Also the two-sample problem can be bin and in [5] the authors compute relative frequencies on data- naturally approach using this divergence, as its goal is to detect driven partitions achieving local independence for estimating whether two set of samples have been drawn from the same mutual information. In [12] local likelihood density estimation distribution [1]. is used to estimate the divergence between a parametric In machine learning and neuroscience the KL divergence model and the available data. In [18] the authors compute also plays a leading role. In Bayesian machine learning, the divergence between p(x) and q(x) using a variational it is typically used to approximate an intractable density approach, in which convergence is proven ensuring that the model. For example, expectation propagation [16] iteratively estimate for p(x)/q(x) converges to the true measure ratio. approximates an exponential family model to the desired Finally, we only know of two previous approach based on density, minimising the inclusive KL divergence: D(P ||Papp). k-nearest-neighbours density estimation [22], [13], in which While variational methods [15] minimize the exclusive KL the authors prove mean-square consistency of the divergence divergence, D(Papp||P ), to fit the best approximation to P . estimator for finite k, although this density estimate does not converge to its measure. In [3] a good survey paper analyzes The equality holds because Pc(x) and Qc(x) are piecewise the different proposals for entropy estimation. linear approximations to their cdfs. The rest of the paper is organized as follows. We show Let us rearrange (6) as follows: the proposed method for one dimensional data in Section 2 n n 1 X ∆P (xi)/∆xi 1 X ∆P (xi) together with its proof of convergence. In Section 3, we extend Db(P ||Q) = log − log n ∆Q(x0 )/∆x0 n ∆P (x ) our proposal to multidimensional problems. We also discuss i=1 mi mi i=1 c i how to extend this approach for kernels in Section 3.1, which n 0 1 X ∆Q(xmi) is of relevance for solving the two-sample problem with no + log 0 = Dbe(P ||Q) − C1(P ) + C2(P,Q) n ∆Qc(x ) real-valued data, such as graphs or sequences. In Section 4, we i=1 mi (7) compute the KL divergence for known and unknown density models and indicate how it can be used for solving the two- The first term in (7): sample problem. We conclude the paper in Section 5 with n 1 X ∆P (xi)/∆xi a.s. some final remarks and proposed further work. Dbe(P ||Q) = log −→ D(p||q), (8) n ∆Q(x0 )/∆x0 i=1 mi mi II.DIVERGENCEESTIMATIONFOR 1D DATA because lim ∆P (xi)/∆xi = p(xi) and n n→∞ We are given n i.i.d. samples from p(x), X = {xi}i=1, and 0 0 0 0 m lim ∆Q(xmi)/∆xmi = q(xi), due to p(x) is absolutely m i.i.d. samples from q(x), X = {xj}j=1, without loss of n→∞ generality we assume the samples in these sets are sorted in continuous with respect to q(x). increasing order. Let P (x) and Q(x), respectively, denote the The second term in (7): absolutely continuous cdfs of p(x) and q(x). The empirical n n 1 X ∆P (xi) 1 X cdf is given by: C1(P ) = log = log n∆P (xi) (9) n ∆Pc(xi) n n i=1 i=1 1 X P (x) = U(x − x ) (2) As xi is distributed according to p(x), P (xi) is distributed e n i i=1 according to a uniform random variable between 0 and 1. where U(x) is the unit-step function with U(0) = 0.5. We zi = n∆P (xi) is the difference (waiting time) between two consecutive samples from a uniform distribution between 0 also define a continuous piece-wise linear extension to Pe(x): and n with one arrival per unit-time, therefore it is distributed  0, x < x0 like an unit-mean exponential random variable. Consequently  Pc(x) = aix + bi, xi−1 ≤ x < xi (3) n Z ∞ 1 X a.s. −z  C1(P ) = log zi −→ log ze dz = −0.5772, (10) 1, xn+1 ≤ x n i=1 0 where ai and bi are defined to ensure that Pc(x) takes the same which is the negated Euler-Mascheroni constant. value as Pe(x) at the sampled values and leads to a continuous The third term in (7): approximation. x0 < inf{X } and xn+1 > sup{X }, their exact m 0 1 X ∆Q(xj) values are inconsequential for our estimate. Both of these C (P,Q) = n∆P (x0 ) log = empirical cdfs converges uniformly and independent of the 2 n e j ∆Q (x0 ) j=1 c j distribution to their cdfs [20]. ∆P (x0 ) m e j The proposed divergence estimator is given by: 0 1 X ∆xj m∆Q(x0 ) log m∆Q(x0 ) ∆Q(x0 ) j j (11) n m j 1 X δPc(xi) j=1 ∆x0 Db(P ||Q) = log (4) j n δQ (x ) i=1 c i 0 where n∆Pe(xj) counts the number of samples from the set 0 δPc(xi) = Pc(xi) − Pc(xi − ) for any < mini{xi − xi−1}. X between two consecutive samples from X . As before, m∆Q(x0 ) is distributed like unit-mean exponential, indepen- Theorem 1. Let P and Q be absolutely continuous proba- j dent from q(x), and ∆Q(x0 )/∆x0 and ∆P (x0 )/∆x0 tend, bility measures and assume its KL divergence is finite. Let j j e j j respectively, to q(x) and to p (x), hence X = {x }n and X 0 = {x0 }m be i.i.d. samples sorted in e i i=1 i i=1 Z increasing order, respectively, from P and Q, then a.s. pe(x) −z C2(P,Q) −→ z log ze q(x)dzdx = a.s. q(x) Db(P ||Q) − 1 −→ D(P ||Q) (5) Z ∞ Z z log ze−zdz p (x)dx = 0.4228, (12) Proof: We can rearrange (4) as follows: e 0 R n 1 X ∆Pc(xi)/∆xi pe(x) is a density model, but it does not need to tend to p(x) D(P ||Q) = log (6) b 0 0 for C2(P,Q) to converge to 0.4228, i.e 1 minus the Euler- n ∆Qc(x )/∆x i=1 mi mi Mascheroni constant. where ∆Pc(xi) = Pc(xi) − Pc(xi−1), ∆xi = xi − xi−1, The three terms in (7), respectively, converge almost surely 0 0 0 0 0 ∆xmi = min{xj|xj ≥ xi} − max{xj|xj < xi} and to D(P ||Q), −0.5772 and 0.4228, due to the strong law of 0 0 0 0 0 ∆Qc(xmi) = Q(min{xj|xj ≥ xi}) − Q(max{xj|xj < xi}). large numbers, and hence so it does their sum [7]. From the last equality in (6), we can understand that we and rk(xi) and sk(xi) are, respectively, the Euclidean dis- th 0 are using a data-dependent histogram, in which we put one tances to the k nearest-neighbour of xi in X\xi and X , sample in each bin, as density estimate for p(x) and q(x), e.g. and πd/2/Γ(d/2 + 1) is the volume of the unit-ball in Rd. pb(xi) = 1/n∆xi, to estimate the KL divergence. In [14], the Before proving (14) converges almost surely to D(P ||Q), let authors show that data-dependent histograms converge to their us show an intermediate necessary result. true measures when two conditions are met. The first condition Lemma 1. Given n i.i.d. samples, X = {x }n , from an states that the number of bins must grow sublinearly with the i i=1 absolutely continuous probability distribution P , the random number of samples and this condition is violated by our density variable p(x)/p (x) converges in probability to a unit-mean estimate. Hence our KL divergence estimator converges almost b1 exponential distribution for any x in the support of p(x). surely, but it is based on non-convergent density estimates. Proof: Let’s initially assume p(x) is a d-dimensional IVERGENCE ESTIMATOR FOR VECTORIAL DATA III.D uniform distribution of a given support. The set Sx,R = The procedure to estimate the divergence from samples {xi| kxi − xk2 ≤ R} contains all the samples from X inside proposed in the previous section is based on the empirical the ball centred in x of radius R. Therefore, the samples in d cdf and it is not straightforward how it can be extended to {kxi − xk2| xi ∈ Sx,R} are uniformly distributed between vectorial data. But, taking a closer look at the last part of 0 and Rd, if the ball lies inside the support of p(x). Hence d d equation (6), we can reinterpret our estimator as follows, first the random variable r1(x) = minxj ∈Sx,R (kxj − xk2) is an compute nearest-neighbour estimates for p(x) and q(x) and exponential random variable, as it measures the waiting time then use these estimates to calculate the divergence: between the origin and the first event of a uniformly spaced distribution [2]. As p(x)nπd/2/Γ(d/2+1) is the mean number n n 0 1 X p(xi) 1 X m∆xmi of samples per unit ball centred in x, p(x)/p (x) is distributed Db(P ||Q) = log b = log (13) b1 n q(x ) n n∆x i=1 b i i=1 i as an exponential distribution with unit mean. This holds for all n, it is not an asymptotic result. where we employ the nearest-neighbour less than xi from X For non-uniform absolutely-continuous p(x), P(r1(x) > to estimate p(xi) and the two nearest-neighbours, one less ε) → 0 as n → ∞ for any x in the support of p(x) and 0 than and the other larger than xi, from X to estimate q(xi). every ε > 0. Therefore, as n tends to infinity we can consider We showed Db(P ||Q) − 1 converges to the KL divergence, x and its nearest-neighbour in X to come from a uniform even though p(xi) and q(xi) do not converge to their true b b distribution and hence p(x)/pb1(x) converges in probability to measures, and nearest-neighbour can be readily used for a unit-mean exponential distribution. multidimensional data. n The idea to use k-nearest-neighbour density estimation as Corolary 1. Given n i.i.d. samples, X = {xi}i=1, from an intermediate step to estimate the KL divergence was put an absolutely continuous probability distribution p(x), the forward in [22], [13] and follows a similar idea proposed to random variable p(x)/pbk(x) converges in probability to a estimate differential entropy [9] and that has been used to unit-mean and 1/k-variance gamma distribution for any x estimate mutual information in [10]. In [22], [13], the authors in the support of p(x). prove mean-square consistency of their estimator for finite k, Proof: In the previous proof, instead of measuring the which is based on some regularity conditions imposed over waiting time to the first event, we measure the waiting time to the densities p(x) and q(x), as for finite k, nearest-neighbour the kth event of a uniformly spaced distribution. This waiting density estimation does not converge to their measure. From time is distributed as an Erlang distribution or a unit-mean and our point of view their proof is rather technical. In this paper, 1/k-variance gamma distribution. we prove the almost sure convergence of this KL divergence Now we can easily proof the almost surely convergence of estimator, using waiting times distributions without needing to the KL divergence based on the k-nearest-neighbour density impose additional conditions over the density models. Given estimation. a set with n i.i.d. samples from p(x) and m i.i.d. samples from q(x), we can estimate the D(P ||Q) from a k-nearest- Theorem 2. Let P and Q be absolutely continuous probability neighbour density estimate as follows: measures and assume its KL divergence is finite. Let X = n 0 0 m {xi} and X = {x } be i.i.d. samples, respectively, n n i=1 i i=1 1 X pbk(xi) d X rk(xi) m from P and Q, then Dbk(P ||Q) = log = log + log n qk(xi) n sk(xi) n − 1 a.s. i=1 b i=1 Dbk(P ||Q) −→ D(P ||Q) (17) (14) Proof: We can rearrange Dbk(P ||Q) in (14) as follows: where n n 1 X p(xi) 1 X p(xi) k Γ(d/2 + 1) Dbk(P ||Q) = log − log + p (x ) = (15) n q(xi) n pk(xi) bk i (n − 1) d/2 d i=1 i=1 b π rk(xi) n k Γ(d/2 + 1) 1 X q(xi) q (x ) = (16) log (18) bk i d/2 d n q (x ) m π sk(xi) i=1 bk i √ The first term converges almost surely to the KL divergence estimator proposed in [21] as Algorithm A with m as the between P and Q and the second and third terms converges number of bins for the density estimation, which was shown in R ∞ k−1 −z almost surely to 0 z log ze dz/(k − 1)!, because the [21] to be more accurate than the divergence estimator in [5] sum of random variables that converge in probability con- and one based on Partzen windows density estimation. Each verges almost surely [7]. Finally, the sum of almost surely curve is the mean value of 100 independent trials. We have not convergent terms also converges almost surely [7]. depicted the variance for each estimator for clarity purposes, In the proof of Theorem 2, we can see that the first element although both are similar and tend towards zero as 1/n. In of (18) is an unbiased estimator of the divergence and, as the these figure we can see that the proposed estimator is more other two terms cancel each other out, one could think this accurate than the one in [21], as it converges faster to the true estimator is unbiased. But this is not the case, because the divergence as the number of samples increases. convergence rates for the second and third terms to their means are not equal. For example for k = 1, p(xi)/pb1(xi) converges 1.3 0.25 q(x )/q (x ) much faster to an exponential distribution than i b1 i 1.25 x p(x) does, because the i samples comes from . The samples 1.2 0.2 from p(x) in the low probability region of q(x) needs many 1.15 D(P||Q) samples from q(x) to guarantee that its nearest-neighbour is D(P||Q) 1.1 0.15 close enough to assume that q(xi)/qb1(xi) is distributed like an exponential. Hence, this estimator is biased and this bias 1.05

1 0.1 2 3 4 5 6 2 3 4 5 6 depends on the distributions. 10 10 10 10 10 10 10 10 10 10 If the divergence is zero, the estimator is unbiased as the n(=m) n(=m) (a) (b) distributions p(xi)/pk(xi) and q(xi)/qk(xi) are identical. For b b Fig. 1. Divergence estimation for a unit-mean exponential and N (3, 4) in the two-sample problem this is a very interesting result as it (a) and N (0, 2) and N (0, 1) in (b). The solid lines represent the estimator allows to measure the variance of our estimator for P = Q and in (4) and the dashed lines the estimator in [21]. The dotted lines show the set the threshold for rejecting the null-hypothesis according to KL divergences. a fixed probability of type I errors (false positives). In the second experiment we measure the divergence be- A. Estimating KL Divergence with kernels tween two 2-dimensional densities: The KL divergence estimator in (14) can be computed using p(x) = N (0, I) (21) kernels. This extension allows measuring the divergence for 0.5 0.5 0.1 non-real valued data, such as graphs or sequences, which q(x) = N , (22) could not be measured otherwise. There is only one previous −0.5 0.1 0.3 proposal for solving the two-sample problem using kernels [6], In Figure 2(a) we have contour-plotted these densities and which allows to compare if 2 non-real valued sets belong to in Figure 2(b) we have depicted Dbk(P ||Q) and Dbk(Q||P ) the same distribution. mean values for 100 independent experiments with k = 1 and To compute (14) with kernels we need to measure the k = 10. We can see that Dbk(Q||P ) converges much faster th distance to the k nearest-neighbour to xi. Let’s assume x k nni to its divergence than Dbk(P ||Q) does. This can be readily 0 th and xnnk are, respectively, the k nearest-neighbour to xi in understood by looking at the density distributions in Figure i 0 0 X\xi and X , then 2(a). When we compute sk(xi) for Dbk(Q||P ), there is always q a sample from p(x) close by for every sample of q(x). Hence rk(xi) = k(xi, xi) + k(xnnk , xnnk ) − 2k(xi, xnnk ) (19) 0 0 0 0 i i i both p(xi)/pbk(xi) and q(xi)/qbk(xi) converge quickly to a q 0 0 0 k-mean k-variance gamma distribution. But the converse is sk(xi) = k(xi, xi) + k(x k , x k ) − 2k(xi, x k ) (20) nni nni nni not true, as there is a high density region for p(x) which is Finally, to measure the divergence we need to set the not well covered by q(x). So q(xi)/qbk(xi) needs many more dimension d of our feature space. For finite VC dimension samples to converge than p(xi)/pbk(xi) does, which explains kernels, as polynomial kernels, d is the VC dimension of the strong bias in this estimator. As we increase k, we notice our kernel. While for infinite VC dimension kernels we set the divergence estimate takes longer to converge in both cases, d = n + m − 1, as our data cannot live in a space larger than because the tenth nearest-neighbour is further than the first and that. so the convergence of p(xi)/pbk(xi) and q(xi)/qbk(xi) to their distributions needs more samples. IV. EXPERIMENTS Finally, we have estimated the divergence between We have conducted 3 series of experiments to show the per- the three’s and two’s in the MNIST dataset formance of the proposed divergence estimators. First, we have (http://yann.lecun.com/exdb/mnist/) in a 784 dimensional estimated the divergence using (4), comparing a unit-mean space. In Figure 3a we have plotted the divergence estimator exponential and a N (3, 4) and two zero-mean Gaussians with for Db1(3, 2) (solid line) and Db1(3, 3) (dashed line) mean variances 2 and 1, which are shown as solid lines in Figure 1. values for 100 experiments together with their two standard For comparison purposes, we have also plotted the divergence deviations confidence intervals. As expected Db1(3, 3) is [22], [13] to prove in mean-square convergence. We illustrated 1.6 in the experimental section that the proposed estimators are 1.4 more accurate than the estimators based on convergent density (P||Q) k 1.2 estimation. Finally we have also suggested this divergence

1 estimator can be used for solving the two-sample problem, (Q||P) D k

D 0.8 although a thorough examination of its merit as such has been

0.6 left as further work.

2 3 4 10 10 10 n(=m) REFERENCES (a) (b) [1] N. Anderson, P. Hall, and D. Titterington. Two-sample test statistics for Fig. 2. In (a) we have plotted the contour lines of p(x) (dashed) and of q(x) measuring discrepancies between two multivariate probability density (solid). In (b) we have plotted Dbk(Q||P ) (solid) and Dbk(P ||Q) (dashed), functions using kernel-based density estimates. Journal of Multivariate the curves with bullets represent the results for k = 10. The dotted and Analysis, 50(1):41–54, 7 1994. dash-dotted lines represent, respectively, D(Q||P ) and D(P ||Q). [2] K. Balakrishnan and A. P. Basu. The Exponential Distribution: Theory, Methods and Applications. Gordon and Breach Publishers, Amsterdam, Netherlands, 1996. [3] J. Beirlant, E. Dudewicz, L. Gyorfi, and E. van der Meulen. Nonpara- unbiased, so it is close to zero for any sample size. Db1(3, 2) metric entropy estimation: An overview. nternational Journal of the seems to level off around 260nats, but we do not believe this Mathematical Statistics Sciences, pages 17–39, 1997. is the true divergence between the three’s and two’s, as we [4] H. Cai, S. Kulkarni, and S. Verdu.´ Universal divergence estimation for 1 finite-alphabet sources. IEEE Trans. Information Theory, 52(8):3456– need to resample from a population of around 7000 samples 3475, 8 2006. for each digit in each experiment. But we can see that for as [5] G. A. Darbellay and I. Vajda. Estimation of the information by an little as 20 samples we can clearly distinguish between these adaptive partitioning of the observation space. IEEE Trans. Information Theory, 45(4):1315–1321, 5 1999. two populations. For comparison purposes we have plotted the [6] A. Gretton, K. M. Borgwardt, M. Rasch, B. Scholkopf,¨ and A. Smola. MMD test from [6], in which a kernel method was proposed A kernel method for the two-sample-problem. In B. Scholkopf,¨ J. Platt, for solving the two-sample problem. We have used the code and T. Hofmann, editors, Advances in Neural Information Processing Systems 19, Cambridge, MA, 2007. MIT Press. available in http://www.kyb.mpg.de/bs/people/arthur/mmd.htm [7] G.R. Grimmett and D.R. Stirzaker. Probability and Random Processes. and have used its bootstrap estimate for our comparisons. Oxford University Press, Oxford, UK, 3 edition, 2001. Although a more thorough examination is required, it seems [8] S. Mallela I. S. Dhillon and R. Kumar. A divisive information-theoretic feature clustering algorithm for text classification. Journal of Machine that our divergence estimator would be perform similar to Learning Research, 3:1265–1287, 3 2003. [6] for the two-sample problem without needing to chose a [9] L. F. Kozachenko and N. N. Leonenko. Sample estimate of the entropy kernel and its hyperparameters. of a random vector. Problems Inform. Transmission, 23(2):95–101, 4 1987. [10] A. Kraskov, H. Stogbauer,¨ and P. Grassberger. Estimating mutual 400 0.3 information. Physical Review E, 69(6):1–16, 6 2004. 0.25 [11] S. Kullback and R. A. Leibler. On information and sufficiency. Ann. 300 0.2 Math. Statistics, 22(1):79–86, 3 1951. 0.15 200 [12] Y. K. Lee and B. U. Park. Estimation of kullback-leibler divergence 0.1 by local likelihood. Annals of the Institute of Statistical Mathematics, 100 0.05

0 58(2):327–340, 6 2006. 0 −0.05 [13] N. N. Leonenko, L. Pronzato, and V. Savani. A class of renyi information D(3||3) D(3||2) −0.1 Annals of Statistics −100 MMD(3,3) MMD(3,2) estimators for multidimensional densities. , 2007. −0.15 Submitted. −200 −0.2 1 2 1 2 10 10 10 10 [14] G. Lugosi and A. Nobel. Consistency of data-driven histogram methods n(=m) n(=m) for density estimation and classification. Annals Statistics, 24(2):687– (a) (b) 706, 4 1996. [15] D. J. C. MacKay. Information Theory, Inference and Learning Algo- D (3||2) D (3||3) Fig. 3. In (a) we have plotted b1 (solid), b1 (dashed) and their rithms. Cambridge University Press, Cambridge, UK, 2003. ±2 standard deviations confidence intervals (dotted). In (b) we have repeated [16] T Minka. A family of algorithms for approximate Bayesian inference. the same plots using the MMD test from [6]. PhD thesis, Massachusetts Institute of Technology, 2001. [17] P. J. Moreno, P. P. Ho, and N. Vasconcelos. A kullback-leibler divergence based kernel for svm classification in multimedia applications. Technical V. CONCLUSIONSANDFURTHERWORK Report HPL-2004-4, HP Laboratories, Cambridge, MA, 2004. [18] X. Nguyen, M. J. Wainwright, and M. I. Jordan. Nonparametric We have proposed a divergence estimator based on the estimation of the likelihood ratio and divergence functionals. In IEEE empirical cdf, which does not need to estimate the densities Int. Symp. Information Theory, Nice, France, 6 2007. as an intermediate step, and we have proven its almost sure [19] L. Paninski. Estimation of entropy and mutual information. Neural Computation, 15(6):1191–1253, 6 2003. convergence to the true divergence. The extension for vecto- [20] V. N. Vapnik. Statistical Learning Theory. John Wiley & Sons, New rial data coincides with a divergence estimator based on k- York, 1998. nearest-neighbour density estimation, which has been already [21] Q. Wang, S. Kulkarni, and S. Verdu.´ Divergence estimation of continuous distributions based on data-dependent partitions. IEEE Trans. proposed in [22], [13]. In this paper we prove its almost Information Theory, 51(9):3064–3074, 9 2005. sure convergence and we do not need to impose additional [22] Q. Wang, S. Kulkarni, and S. Verdu.´ A nearest-neighbor approach to conditions over the densities to ensure convergence, as need in estimating divergence between continuous random vectors. In IEEE Int. Symp. Information Theory, Seattle, USA, 7 2006. 1We have used all the MNIST data (training and test) for our experiments.