arXiv:1801.00852v1 [math.PR] 2 Jan 2018 Let Introduction 1 R φ enlbsscntutdi hswylast etrperformance better to th leads in way defined this functions in probabilit kernel constructed of construct basis space to kernel potentially on employed operate can be to can estimation them gence divergence enabling Using by algorithms learning space. vector finite e the Entropy distribu 2005). the al., 2005). where et al., (Goria models, et vector (Wolsztynski regression u random semi-parametric be a in can estimators estimation of performan entropy entropy the Similarly, optimizing the 2009). and for al., evaluating et Wang in 1987; est (Mo function (Hall, testing divergence loss hypothesis a for in as Methods level applicable significance given distribution. achieve same to required the from drawn are ape from samples itne(iui,20) oa aito distance, variation total 2001), (Nikulin, distance The 1,2 nomto-hoei uniissc setoyadmta info mutual and of entropy as such quantities information-theoretic dvrec swl setoyadmta nomto a ayimp many has information mutual and entropy as well as -divergence d ocnrto euto siaigPiDvrec using Phi-Divergence Estimating of Result Concentration A φ eateto nutilEgneigadMngmn cec,N Science, Management and Engineering Industrial of Department The . nmcielann,mn motn loihsfrrgeso,clas regression, for algorithms important many learning, machine In t hypothesis for used be can estimators divergence statistics, In dvrec.We h distributions the When -divergence. P φ oesr h siainerri one yagvnlvlwit level given a by bounded is error numb estimation the on the method distr ensure this to the for result that rate f convergence assumption a method the provide Under partitioning statisti dependent distributions. and data unknown theory the information of in generalization problem fundamental a is dvrec aiyicue h ulakLilr(L iegne(Ku divergence (KL) Kullback-Leibler the includes family -divergence and siaino the of Estimation φ dvrec of -divergence Q etopoaiiydsrbtoso endo ( on defined on distributions probability two be P and Q safnaetlpolmi nomto hoyadsaitc.Tee The statistics. and theory information in problem fundamental a is Q φ dvrec ewe w nnw rbblt distribution probability unknown two between -divergence from egioLuo Fengqiao P sdfie as: defined is eedn Partition Dependent D P φ ( etme 8 2018 18, September and P || Q Q 1 = ) Abstract n ajyMehrotra Sanjay and r nnw,teetmt of estimate the unknown, are χ Z 2 R 1 -divergence, d φ  dQ dP retmtn iegnebtentetwo the between divergence estimating or ro ape n ye-etnlsrequired hyper- and samples of er R a erig ecnie multi-variate a consider We learning. cal  ie probability. given a h bto aifisapwrlwo ea,we decay, of law power a satisfies ibution d mto a efruae sseilcases special as formulated be can rmation dQ. nmliei lsicto hnuigthe using than classification multimedia in , sigo hte w eso ...samples i.i.d. of sets two whether of esting B rbblt est ucinsae The space. function density probability e α R dvrec mn tes ayother Many others. among -divergence d nadHr,20) iegnei also is Divergence 2004). Hero, and on mto a lopoiesml sizes sample provide also can imation infnto fteerri unknown is error the of function tion tmto sapial o parameter for applicable is stimation ,where ), ratapplications. ortant e obidtegons-ffi tests goodness-of-fit the build to sed xadtesoeo aymachine many of scope the expand itiuin.Freape diver- example, For distributions. y icto,cutrn,ecoeaeon operate etc clustering, sification, lakadLilr 91,Hellinger 1951), Leibler, and llback eo est siainmethods estimation density of ce 2 B D R rhetr University orthwestern φ sn miia data empirical using s d ( P steBrlmaueon measure Borel the is || Q ae ntei.i.d. the on based ) tmto of stimation Data (1) ordinary kernel basis defined in the sample space (Moreno et al., 2004). Using statistical divergence to con- struct kernel functions, and i.i.d. samples to estimate the divergence are also applied for image classification (Poczos et al., 2012). Entropy estimation and mutual information estimation have been used for texture classification (Hero et al., 2002a,b), feature selection (Peng et al., 2005), clustering (Aghagolzadeh et al., 2007), optimal experimental design (Lewi et al., 2007) fMRI data processing (Chai et al., 2009), prediction of protein structures (Adami, 2004), boosting and facial expression recognition (Shan et al., 2005), independent component and subspace analysis (Learned-Miller and Fisher, 2003; Szab´oet al., 2007), as well as for image registration (Hero et al., 2002a,b; Kybic, 2006). We now put our work in the context of concentration results known in the literature. Liu et al. (2012) derived exponential-concentration bound for an estimator of the two-dimensional Shannon entropy over [0, 1]2. Singh and P´oczos (2014) generalized the method in (Liu et al., 2012) to develop an estimator of mutual R´enyi divergence for a smooth H¨older class of densities on d-dimensional unit cube [0, 1]d using kernel functions, and they also derived an exponential concentration inequality on the convergence rate of the estimator. P´erez-Cruz (2008) constructed an estimator of KL divergence and differential entropy, using k-nearest-neighbor approach to approximate the densities at each sample . They show that the estimator converges to the true value almost surely. P´al et al. (2010) constructed estimators of R´enyi entropy and mutual information based on a generalized nearest-neighbor graph. They show the almost sure convergence of the estimator, and provide an upper bound on the rate of convergence for the case that the density function is Lipschitz continuous. Wang et al. (2005) developed an estimator of Kullback-Leibler (KL) divergence for one dimensional sam- ple space based on partitioning the sample space into sub-intervals. The partition depends on the observed sample points. They show that this estimator converges to the true KL divergence almost surely as the number of sub-intervals and the number of sample points inside each sub- go to infinity. However, they do not provide a convergence rate result for this method. In this paper, we investigate a multivariate generalization of the data-dependent partition scheme pro- posed in (Wang et al., 2005) to estimate the φ-divergence. We also generalize the analysis for a family of φ-divergence. The generalized method is based on partitioning the sample space into finitely many hyper-rectangles, using counts of samples in each hyper- to estimate the divergence within in the hyper-rectangle, and finally summing up the estimated divergence in every hyper-rectangle. We provide a convergence rate result for this method (Theorem 1). The results are proved under the assumption that the probability density functions satisfy the power law and certain additional regularity assumptions, which are satisfied by most well known continuous distributions such as Gaussian, exponential, χ2, etc., and all light-tailed distributions.

2 A Data Dependent Partition Method for φ-Divergence Estima- tion

In this section, we describe a discretization method to estimate the φ-divergence Dφ(P Q) for two unknown d || d probability measures P and Q defined on (R , Rd ), where Rd is the Borel measure on R . We assume that B B P is absolutely continuous with respect to Q (denoted as P Q), otherwise the divergence is infinity. We ≪ also assume the densities of P and Q with respect to Lebesgue measure exist, which are denoted as p(x) and q(x), respectively. Suppose the random vectors X and Y follow the distributions of P and Q respectively, and there are i.i.d. observations X n1 of X and Y n2 of Y . The sample space Rd is partitioned { i}1 { i}1 into a number of hyperrectangles (or d-dimensional rectangles). The estimator of divergence is the sum of estimation in each hyperrectangle using counts of empirical data that fall into that hyperrectangle. In this section, we provide the convergence result to a general φ-divergence. In Sections 4 and 5, we provide a rate of convergence on the number of samples needed to ensure a probability guarantee of this method.

2 Before introducing the space partitioning algorithm, we give the following definitions. The space partitioning algorithm is described as Algorithm 1. Our analysis is based on density functions p(x) and q(x) satisfying the power law regularity condition defined as follows:

Definition 1. A probability density function f satisfies the power law regularity condition with parameters (c> 0,α> 0) if c x x f( )d < α , (2) x Rd: x >r r Z{ ∈ k k } for any r> 0.

Definition 2. Let φ be an univariate differentiable function defined on [0, ). The function φ is ε-regularized ∞ by K ( ),K ( , ),K ( ) if for any ε> 0 small enough and L>ε, the following regularity conditions hold: { 0 · 1 · · 2 · }

(a) max φ(s) K0(L) s [0,L]| |≤ ∈ (b) max φ′(s) K1(ε,L) s [ε,L] | |≤ ∈ (c) φ(s ) φ(s ) K (ε) s ,s [0,ε]. | 2 − 1 |≤ 2 ∀ 1 2 ∈ For any ε,L > 0, let K(ε,L) := max K (L),K (ε,L),K (ε) . { 0 1 2 } Remark 1. If a function φ is ε-regularized by K ( ),K ( , ),K ( ) , then the following inequality holds: { 0 · 1 · · 2 · } φ(s ) φ(s ) K (ε,L) s s +2K (ε) s [0,ε], s [ε,L] | 2 − 1 |≤ 1 | 2 − 1| 2 ∀ 1 ∈ ∀ 2 ∈ Table 1 provides the value of K ( ),K ( , ),K ( ) for some specific choices of φ-divergences. { 0 · 1 · · 2 · } Table 1: List of possible regularization functions for common φ-divergence distances

φ-divergence distance φ(t) K0(L) K1(ε,L) K2(ε) KL-divergence tlogt LlogL max logε , logL +1 2 εlogε | | {| | | |} | | Hellinger distance (√t 1)2 max 1, (√L 1)2 max 1 1 , 1 1 2√ε − { − } − √ε − √L Total variation distance 1 t 1 max 1 , 1 L 1 1 n o 1 ε 2 | − | 2 2 | − | 2 2 χ2-divergence distance (t 1)2 max 1, (L 1)2 max 2 , 2 L 1 2ε − { − } { | − |}

Assumption 1. Regularity conditions of the divergence function φ and probability measures P and Q:

(a) The divergence function φ is ε-regularized by K ( ),K ( , ),K ( ) . { 0 · 1 · · 2 · } (b) The probability measures P and Q have density functions p(x) and q(x), respectively.

(c) The density functions p and q satisfy the power law regularity condition with parameters (c, α), and they are also Lipschitz continuous with Lipschitz constant L1.

(d) There exists an L > 0 such that p(x)/q(x)

Now we describe the divergence estimation method in details. Let Pn1 and Qn2 be empirical distributions of P and Q, respectively. First, we use Algorithm 1 to partition Rd into m hyperrectangles according to Q . Denote the partition as = I m. For i [m], let p (resp. q ) be the number of samples from P n2 I { i}1 ∈ i i n1

3 (resp. Q ) that fall into I . By the way of partition, q = n /m. The estimator of D (P Q) is constructed n2 i i 2 φ || as m P (I ) Dn1,n2 (P Q)= φ n1 i , (3) φ || Q (I ) i=1 n2 i X   b where Pn1 (Ii)= pi/n1 and Qn2 (Ii)=1/m.

Algorithm 1 An algorithm to partition Rd into m equal-measure hyperrectangles with respect to an em- pirical probability measure Pn. 1 n n X Input: An empirical probability measure µn := i=1 1 i consisting of n samples Xi i=1. A n { } { } partition number m satisfying m = md for some m N +. The integers m and n satisfy m n. 0 0 ∈P | Output: A partition = I m of Rd, such that µ (I )=1/m, i [m]. I { i}i=1 n i ∀ ∈ Initialization: Set I Rd, k 0, r(I) 0 and I . ← ← ← I←{ } for k =1,...,d do for I with v(I)= k 1 do ∈ I − k 1 d k+1 Suppose I is of the form: I = − J R − , where every J is in one of the three forms: i=1 i × i ( ,a], (a,b], or (b, ) for some a or b. −∞ ∞ Q Partition I into m subregions I m0 , where 0 { j }j=1 k 1 Rd k i=1− Ji ( ,a1] − if j =1 k 1 × −∞ × d k Ij = − Ji (aj 1,aj] R − if j =2,...,m0 1 (4)  Qi=1 × − × −  k 1 Rd k  − Ji (am0 1, ) − if j = m0 Qi=1 × − ∞ × such that µ (I )=1/mQk for every j [m ]. n j 0 ∈ 0 Let ( I ) I m. I ← I\{ } ∪{ j }1 end for end for Assign index 1,...,m to m hyperrectangles in respectively. return . { } I I

Without loss of generality, let n = n = n. The error of the estimator Dn1,n2 (P Q) with respect to the 1 2 φ || true value Dφ(P Q) can be bounded by two terms as follows: || b m m P (I ) dP D(n,n)(P Q) D (P Q) = φ n i Q (I ) φ dQ φ || − φ || Q (I ) n i − dQ i=1  n i  i=1 ZIi   X X b m m P (Ii) dP φ Q(Ii) φ dQ ≤ Q(Ii) − Ii dQ i=1   i=1 Z   X X T1 m m | Pn(Ii) {z P (Ii) } + φ Qn(Ii) φ Q(Ii) . (5) Qn(Ii) − Q(Ii) i=1   i=1   X X T2

Note that T1 is the error| of numerical integration{z and T2 is the error} of random sampling. The convergence rate for T is given in Section 4, and the convergence rates for T and that for D (P Q) are given in Section 5. 1 2 φ || 3 A convergence rate of the data-dependent partition error

In this section, we establish a key intermediate result that bounds the error of partition generated from Algorithm 1 in probability. This result is used to prove the convergence rate presented in Sections 4 and 5.

4 Specifically, let µn be an empirical distribution consisting of n samples drawn from µ. For a partition π of µ , the error of π is defined as µ (A) µ(A) . Let A be a (possibly infinite) family of partitions of n A π | n − | Rd. The maximal cell count of A ∈is defined as m(A ) = sup π . The growth function is a combinatorial P π A | | quantity to measure the complexity of a set family (Vapnik∈ and Chervonenkis, 1971). One can analogously define the growth function of a partition family A . Specifically, given n points x ,...,x Rd and let 1 n ∈ B = x ,...,x . Let ∆(A ,B) be the number of distinct partitions of B having the form { 1 n} A B,...,A B (6) { 1 ∩ r ∩ } that are induced by partitions A ,...,A A . Note that the order of appearance of the individual sets { 1 r} ∈ in (6) is disregarded. The growth function of A is then defined as

A A ∆n∗ ( )= max ∆( ,B). (7) B Rn·d ∈ It is the largest number of distinct partitions of any n points subset of Rd that can be induced by the partitions in A . The convergence rate of the partition error is given in Lemma 2.

Lemma 1 (Lugosi and Nobel (1996)). Let µ be any Rd -measurable probability distribution, and let µ be B n its empirical distribution with n samples. Let A be any collection of partitions of Rd. For each n 1 and ≥ every ε> 0, P A m(A ) 2 sup µn(A) µ(A) >ε 4∆2∗n( )2 exp( nε /32). (8) A | − | ≤ − (π A π ) ∈ X∈ Lemma 2 (The error of a partition). Let µ be any Rd -measurable probability distribution, and let µ be its B n empirical distribution with n i.i.d. samples Z n where Z µ for each i [n]. Let A be the collection of { i}1 i ∼ ∈ all possible partitions of Rd using Algorithm 1 for all possible outcomes of Z n, i.e., { i}1 A = π : π is obtained from applying Algorithm 1 on z n , (9) { i}1 z n [supp(µ)]n { i}1 ∈[  d where supp(µ) is the support of µ in R . If n>N ∗(m,d,ε,δ), then

P sup µn(A) µ(A) >ε δ, (10) A | − | ≤ (π A π ) ∈ X∈ d+1/d−1 c1m c2[log(1/δ)+m] where N ∗(m,d,ε,δ) := max ε4 , ε2 , and c1, c2 are some constants. n o Proof. In Algorithm 1, each k-level hyperrectangle is partitioned into m1/d (k + 1)-level hyperrectangles with equal measure with respect to µ for every k 0, 1,...,d 1 . Consider a modified algorithm named n ∈{ − } ModAlg, in which each k-level hyperrectangle is partitioned into m1/d (k + 1)-level hyperrectangles in an arbitrary way for every k 0, 1,...,d 1 , and let A ′ be the collection of all possible final partitions ∈ { − } generated from this modified algorithm. Clearly, we have A A ′. Now we estimate the growth function A Rd ⊂ 2n+m1/d ∆2∗n( ′). For any fixed 2n points in , the number of distinct partitions of ModAlg at level 0 is m1/d . At level k, there are mk/d number of k-level hyperrectangles needed to be partitioned. For a specific k-level (k) l+m1/d  hyperrectangle I containing l points, there are m1/d ways to partition these l points at the level k. Therefore,  k/d (d−1)/2 d 1 m m − 2n + m1/d 2n + m1/d ∆∗ (A ) ∆∗ (A ′) = . (11) 2n ≤ 2n ≤ m1/d m1/d kY=0     Now we need to find a threshold of n to ensure

m(A ) 2 4∆∗ (A )2 exp( nε /32) <δ (12) 2n −

5 in Lemma 1. Note that m(A )= m, and take logarithm on both sides of (12) we impose that

nε2 log4 + mlog2 + log∆∗ (A ) < logδ 2n − 32 nε2 = log∆∗ (A ) > log(1/δ)+ mlog2 + log4 (13) ⇐ 32 − 2n 2 1/d nε (d 1)/2 2n + m = m − log > log(1/δ)+ mlog2 + log4. ⇐ 32 − m1/d   By the inequality log s sh(t/s), where h(x)= xlogx (1 x)log(1 x) for x (0, 1), we have t ≤ − − − − ∈  2 1/d nε (d 1)/2 1/d m (13) = m − (2n + m )h > log(1/δ)+ mlog2 + log4. (14) ⇐ 32 − 2n + m1/d   First, we impose that n>m1/d, and then we have

m1/d m1/d (2n + m1/d)h 3nh 2n + m1/d ≤ 2n     m1/d 2n m1/d m1/d 3n log 1 log 1 ≤ 2n m1/d − − 2n − 2n        m1/d 2n m1/d 3n log + , (15) ≤ 2n m1/d 2n     where we use the inequality (1 x)log(1 x) x for x (0, 1) in (15). Now substituting (15) in (14), we − − − ≤ ∈ obtain that 2 1/d 1/d ε d−1 m 2n d−1 m (12) = 3m 2 log 3m 2 n ⇐ 32 − 2n m1/d − 2n ·        (16) > log(1/δ)+ mlog2 + log4. To ensure (16), we further impose that

1/d 2 d−1 m 2n ε 3m 2 log . (17) 2n m1/d ≤ 96     Let k =2n/m1/d, we have

logk ε2 2 ε2 (17) = d−1 = d−1 , (18) ⇐ k ≤ 288m 2 ⇐ √k ≤ 288m 2 where we use the inequality logk 2√k in the second inequality of (18). Then (18) implies that we should ≤ impose 2 d+1/d 1 2 288 m − n · (19) ≥ ε4 to ensure (17). Now substitute (17) in (16), we have

ε2 (12) = n> log(1/δ)+ mlog2 + log4. (20) ⇐ 96 Combine (18) and (18), we obtain that if

2 d+1/d 1 2 288 m − 96 [log(1/δ)+ mlog2 + log4] n>N ∗(m,d,ε,δ) := max · , , (21) ε4 ε2   then (10) holds.

6 4 A convergence rate of the approximation error of numerical integration

In this section, we provide a rate of convergence for T1. Note that the partition algorithm only ensures that the empirical measures of all hyperrectangles in the partition are equal, however it does not guarantee that the size of each hyperrectangle is small. The idea of bounding T1 is to show that with a desired high probability, most of the hyperrectangles in the partition obtained from Algorithm 1 are in small size as n goes to infinity. Then the error of numerical integral can be estimated based on the Lipschitz condition of d density functions. Denote HR as the d-dimensional [ R, R] . Let ∂HR be the boundary of HR. d −d For a bounded hyperrectangle I = j=1(aj ,bj], let v(I) := j=1 bj aj and l(I) := max bj aj be | − | j [d] | − | ∈ the volume and length of I. For anQ empirical distribution of Qµn induced by n i.i.d. samples following the probability measure µ, apply Algorithm 1 to divide Rd into m hyperrectangles := I m with respect to I { i}1 µ . Divide into four classes based on the position and size of the hyperrectangle. The definition of the n I four classes are

Γ (µ ,m,H ) := i [m]: I ∂H = , 1 n R { ∈ i ∩ R 6 ∅} Γ (µ ,m,H ) := i [m]: I Rd H , 2 n R { ∈ i ⊂ \ R} 2d+1 Γ (µ ,m,H ) := i [m]: I H and l(I ) m− 2d(d+1) , 3 n R ∈ i ⊂ R i ≥ n o Γj (µ ,m,H ) := i [m]: I H , and the length of the j-th interval of I 3 n R ∈ i ⊂ R i n 2d+1 is no less than m− 2d(d+1) j [d]. ∀ ∈ o We note that Γ1(µn,m,HR) is the index set of hyperrectangles that intersect with the boundary of HR. The j Γ2(µn,m,HR) is the index set of hyperrectangles that fall out of HR. TheΓ3(µn,m,HR) and Γ3(µn,m,HR) are the index sets of hyperrectangles whose maximum edge length and the j-th edge length is no less 2d+1 than m− 2d(d+1) , respectively. Proposition 2 shows that the cardinality of Γ1(µn,m,HR), Γ2(µn,m,HR) and Γ3(µn,m,HR) is a small fraction of m with a desired high probability as m goes to infinity.

Proposition 1 (Chernoff Bound (Boucheron and Lugosi, 2016)). Let X1,...,Xn be random variables such that 0 X 1 for all i. Let X = n X and set µ = E(X). Then, for all δ > 0: ≤ i ≤ i=1 i P 2δ2µ2 P X (1 + δ)µ exp . (22) { ≥ }≤ − n   d Proposition 2. Let µ be a probability measure defined on (R , Rd ). Let f be the density function of µ B which satisfies the power law regularity condition with parameters (c, α). Let µn be an empirical distribution of µ induced by n i.i.d. samples, and apply Algorithm 1 to divide Rd into m hyperrectangles := I m with I { i}1 respect to µn. The following properties hold:

2 − d−1 2d +2d 1 (a) Γ (µ ,m,H ) 2dm d a.s., and Γ (µ ,m,H ) 2dRm 2d2+2d a.s. | 1 n R |≤ | 3 n R |≤ 2log1/δ 2c 1/α (b) For every ε,δ > 0, if n> ε2 and R> ε , then P Γ (µ ,m,H ) < εm > 1 δ. {| 2 n R | } −  − d 1 1/d Proof. (a) First, we prove Γ (µ ,m,H ) 2dm d a.s. For simplicity, we assume m is an integer. Since | 1 n R |≤ the that compose ∂H are either parallel or perpendicular to the boundary of every I , it is R ∈ I clearly that the following inequality holds:

Γ (µ ,m,H ) i [m]: I is infinitely large . (23) | 1 n R |≤ { ∈ i }

7 d−1 It suffices to show that 2dm d is an upper bound of i [m]: I is infinitely large . Note that any I { ∈ i } ∈ I can be written as I = d I , where I is the j-th component of I for j [d]. Since each of Rd j=1 j j ∈ is divided into m1/d intervals, for a fixed j [d] we have: Q ∈ d 1 I : I is infinitely large 2 m1/d − , (24) { ∈ I j } ≤   as because in (24) there are only 2 options to choose the form of Ij , i.e., either ( ,a] or(a, ), and there 1/d −∞ ∞ are m options to choose intervals for J ′ (j′ = j). Then it follows that j 6 d d−1 I : I is infinitely large I : I is infinitely large 2dm d . { ∈ I } ≤ { ∈ I j } ≤ j=1 X

2d2+2d−1 2 Next, we prove Γ3(µn,m,HR) 2dRm 2d +2d a.s. Consider an arbitary (k 1)-level (k [d]) hyper- | k 1 |≤d k+1 − ∈ (k 1)/d rectangle denoted as I = − J R − In Algorithm 1. According to the algorithm, µ (I)=1/m − . j=1 j × n Then the k-th component of I is divided into m1/d intervals in the algorithm. Denote the corresponding Q m1/d k-level hyperrectangles to be ′ := Ii′ 1 , where the first k 1 components of each Ii′ are the same as that 2 2d+1 I { } 2d +2d−1 − j 2d(d+1) 1/d d 1 2d2+2d of I. Then Γ3 2Rm (m ) − =2Rm . This is because the number of intervals within | | ≤ × 2d+1 2d+1 [ R, R] whose length is greater than m− 2d(d+1) in the j-th dimension is less than 2Rm 2d(d+1) , and for any − other dimension one can have at most m1/d options to choose the interval. Therefore,

d 2 − j 2d +2d 1 Γ Γ 2dRm 2d2+2d . | 3|≤ | 3|≤ j=1 X Now we prove (b). Define indicator variables X for every i [n], such that X = 1 if the i-th sample i ∈ i point falls into Rd B(0, R), otherwise X = 0. Let X := n X . Since the density function f satisfies the \ i i=1 i power law regularity condition with parameters (c, α), then: P n cn λ := E[X]= E[X ] < . (25) i Rα i=1 X Since B(0, R) S, it follows that ⊂ P i [m]: I Rd S εm P i [m]: I Rd B(0, R) εm { ∈ i ⊂ \ } ≥ ≤ { ∈ i ⊂ \ } ≥ 2  εn  2(εn λ) = P X εn = P X 1+ 1 λ exp − { ≥ } ≥ λ − ≤ − n n h  i o   cn 2 2 2 εn α nε exp − R exp < δ, (26) ≤ − n ≤ − 2  !   where we use λ < cn/Rα, R> (2c/ε)1/α and n> 2(log1/δ)/ε2 in (26).

Proposition 3. For any x Rd, let x be the i-th component of x. Let I = d(a ,b ] be any bounded ∈ i j j j hyperrectangle, then d Q d y x dy l(I)v(I) (27) | j − j | ≤ 2 I j=1 Z X Proof. The integral of interest can be evaluated as follows:

d d bj y x dy = y x dy (b ′ a ′ ) | j − j | | j − j | j j − j ZI j=1 j=1 Zaj ! j′: j′=j X X { Y6 }

8 d 1 2 d b a (b ′ a ′ ) l(I)v(I). ≤ 2| j − j | j − j ≤ 2 j=1 j′: j′=j X { Y6 }

d Lemma 3 (The error of numerical integration). Let P and Q be probability measures defined on (R , Rd ), B and P Q. Let Assumption 1 hold. Let Pn and Qn be empirical measures of P and Q with n samples for each ≪ Rd m probability measures, respectively. Divide into m equal-measured hyperrectangles Ii i with respect to { } 2 1+α (2d2+2d) max 2d +2d ,2d Q . If n = O md+1/d+3 + m2log1/δ , and m = C(c,α,d,L ,L ) [K (ε ,L )] α · ε− n α o, n 1 2 · 1 1 2 · where ε is the largest value such that K (ε ) ε , and C(c,α,d,L ,L ) is a constant only depending on 1  2 1 ≤ 10 1 2 c,α,d,L1,L2, then

m p(x) m P (I ) P φ q(x)dx φ i Q(I ) >ε < δ. (28) q(x) Q(I ) i ( Ii − i ) i=1 Z   i=1   X X

To prove this lemma, we introduce an intermediate error notation ε2 > 0 for the sake of clarity, we will 1/α eventually rewrite ε using ε and ε. We set R > 2c , and use notations Γ , Γ and Γ to briefly 2 1 ε2 1 2 3 represent Γ1(Qn,m,HR), Γ2(Qn,m,HR) and Γ3(Qn,m,H  R), respectively. Without loss of generality, we assume that Q(I ) > 0 for each i [m]. Otherwise, suppose Q(I ) = 0 for some i [m]. It implies that i ∈ i ∈ P (I ) = 0as P Q, and hence p(x) = 0 and q(x) = 0 for every x I . Then we can remove the i ≪ ∈ i corresponding term of Ii from the left side of (28). The proof of Lemma 3 can be decomposed by proving the following intermediate inequalities:

p(x) P (Ii) (a) T1 K1(ε1,L2) q(x)dx +2K2(ε1); ≤ q(x) − Q(Ii) i [m] ZIi X∈

p(x) P (Ii) 1 1 (b) q(x)dx dL1(L2 + 1)m− 2d ; Ii q(x) − Q(Ii) ≤ 2 i [m] (Γ1 Γ2 Γ3) Z ∈ \ X∪ ∪

1 1 2log(3 /δ) (c) If n> max N ∗ m, d, , δ , 2 , then 2m 3 ε2

n p(x) P (Ii)  6oL2d 2 P q(x)dx +3L ε 1 δ; q(x) Q(I ) 1/d 2 2 3 ( Ii − i ≤ m ) ≥ − i Γ1 Γ2 Z ∈X∪

x 1 1 p( ) P (Ii) 2 1 1 (d) P q(x)dx 6dL2m− 2d +2s + dL1(L2 + 1)m− 2d 1 δ.  Ii q(x) − Q(Ii) ≤ 2  ≥ − 3 i [m] Γ1 Γ2 Z  ∈ X\ ∪ 

Proof of(a). We can first bound the integration error by summing up the error in each hyperrectangle as follows:

m p(x) m P (I ) φ q(x)dx φ i Q(I ) q(x) − Q(I ) i i=1 Ii i=1 i X Z   X   m m p(x) P (I ) = φ q(x)dx φ i q(x)dx x Ii q( ) − Ii Q(Ii) i=1 Z   i=1 Z   mX X p(x) P (Ii) φ φ q(x)dx. (29) ≤ q(x) − Q(I ) i=1 ZIi    i  X

9 x ε1 P (Ii) ε1 p( ) Let Γ := i [m]: ε1 , and let I := x Ii : x ε1 . For clarity of notation, let Wi(x) := ∈ Q(Ii) ≤ i ∈ q( ) ≤ p(x) n P (Ii) o n o φ x φ . Then the summation of errors on the right side of (29) can be partitioned as follows: q( ) − Q(Ii)

    m

Wi(x)q(x)dx = Wi(x)q(x)dx + Wi(x)q(x)dx ε1 ε1 i=1 Ii i Γε1 Ii i Γε1 Ii Ii X Z ∈X Z ∈X Z \

+ Wi(x)q(x)dx + Wi(x)q(x)dx Iε1 I Iε1 i [m] Γε1 Z i i [m] Γε1 Z i\ i ∈ X\ ∈ X\

ε1 p(x) P (Ii) Q(Ii )+ K1(ε1,L2) +2K2(ε1) q(x)dx ε1 ≤ ε ε I I q(x) − Q(Ii) i Γ 1 i Γ 1 Z \ i   ∈X ∈X p(x) P (Ii) + K1(ε1,L2) +2K2(ε 1) q(x)dx Iε1 q(x) − Q(Ii) i [m] Γε1 Z i   ∈ X\

p(x) P (Ii ) + K1(ε1,L2) q(x)dx I Iε1 q(x) − Q(Ii) i [m] Γε1 Z i\ i ∈ X\

p(x) P ( Ii) K1(ε1,L2) q(x)dx ≤ q(x) − Q(Ii) i [m] ZIi X∈

+ K (ε ) Q(Iε1 )+ 2Q(I Iε1 )+ 2Q(Iε1 ) 2 1  i i \ i i  i Γε1 i Γε1 i [m] Γε1 ∈X ∈X ∈ X\  p(x) P (Ii)  K1(ε1,L2) q(x)dx +2K2(ε1), (30) ≤ q(x) − Q(Ii) i [m] ZIi X∈

ε1 ε1 ε1 where we use the fact that ε Q(I )+ ε 2Q(I I )+ ε 2Q(I ) 2 Q(I )=2 i Γ 1 i i Γ 1 i \ i i [m] Γ 1 i ≤ i [m] i in the last inequality. ∈ ∈ ∈ \ ∈ P P P P Proof of (b). Since we have

p(x) P (I ) i q(x)dx Ii q(x) − Q(Ii) i [m] (Γ1 Γ2 Γ3) Z ∈ \ X∪ ∪

1 = p(x)Q(Ii) P (Ii)q(x) dx Q(Ii) Ii − i [m] (Γ1 Γ2 Γ3) Z ∈ \ X∪ ∪ 1 = p(x)q(y)dy p( y)q(x)dy dx Q(Ii) Ii Ii − Ii i [m] (Γ1 Γ2 Γ3) Z Z Z ∈ \ X∪ ∪

1 p(x) q(x)+ L1 y x dy ≤ Q(Ii) Ii Ii k − k i [m] (Γ1 Γ2 Γ3) Z Z ∈ \ X∪ ∪  

p(x) L y x q(x)dy dx − − 1k − k ZIi   1 L1 y x p(x )+ q(x) dydx ≤ Q(Ii) Ii Ii k − k i [m] (Γ1 Γ2 Γ3) Z Z ∈ \ X∪ ∪   1 d L1 yj xj p(x)+ q(x) dydx, (31) ≤ Q(Ii) Ii Ii | − | i [m] (Γ1 Γ2 Γ3) Z Z j=1 ∈ \ X∪ ∪ X   where xj and yj are j-th components of vectors x and y, respectively. Applying Proposition 3, (31) can be

10 further upper bounded as: 1 p(x)Q(Ii) P (Ii)q(x) dx Q(Ii) Ii − i [m] (Γ1 Γ2 Γ3) Z ∈ \ X∪ ∪ 1 1 dL1l(Ii)v(Ii)[p(x)+ q(x)]dx ≤ Q(Ii) Ii 2 i [m] (Γ1 Γ2 Γ3) Z ∈ \ X∪ ∪ 1 1 dL1(L2 + 1)l(Ii)v(Ii)Q(Ii) ≤ Q(Ii) 2 i [m] (Γ Γ Γ ) ∈ \ X1∪ 2∪ 3 1 = dL (L + 1)l(I )v(I ) 2 1 2 i i i [m] (Γ Γ Γ ) ∈ \ X1∪ 2∪ 3 1 2d+1 2d+1 d dL (L + 1) m− 2d2+2d m− 2d2+2d ≤ 2 1 2 · · i [m] (Γ Γ Γ ) ∈ \ X1∪ 2∪ 3   1 2d+1 = dL (L + 1)m− 2d 2 1 2 i [m] (Γ Γ Γ ) ∈ \ X1∪ 2∪ 3 1 1 dL (L + 1)m− 2d . (32) ≤ 2 1 2

Proof of (c). By Assumption 1(d), we have

p(x) P (Ii) q(x)dx 2L2 Q(Ii). (33) Ii q(x) − Q(Ii) ≤ i Γ1 Γ2 Z i Γ1 Γ2 ∈X∪ ∈X∪

1 1 Since n>N ∗ m, d, 2m , 3 δ and Qn(Ii)=1/m, by Lemma 2, we have 1  3 1 1 P Q(I ) = P Q(I ) Q (I ) 1 δ, for every i, 2m ≤ i ≤ 2m | i − n i |≤ 2m ≥ − 3    

1/α P 3 1 2log(3/δ) 2c and hence Q(Ii) ( Γ1 + Γ2 ) 1 δ. Since n > ε2 and R > ε , by ≤ 2m | | | | ≥ − 3 2 2 (i Γ1 Γ2 ) ∈ ∪ −   X d 1 1 Proposition 2, we have Γ 2dm d and P Γ <ε m > 1 δ, and hence | 1|≤ {| 2| 2 } − 3 3d 3ε 2 P Q(I ) + 2 1 δ. (34) i ≤ m1/d 2 ≥ − 3 (i Γ Γ ) ∈X1∪ 2 The inequalities (33) and (34) imply the inequality of (c).

Proof of (d). Since for every i [m] (Γ Γ ), I is a bounded hyperrectangle, we have ∈ \ 1 ∪ 2 i p(x) P (I ) p(x) P (I ) i q(x)dx = i q(x)dx Ii q(x) − Q(Ii) Ii q(x) − Q(Ii) i [m] (Γ1 Γ2) Z i Γ3 Z ∈ X\ ∪ X∈ p(x ) P (I ) + i q(x)dx Ii q(x) − Q(Ii) i [m] (Γ1 Γ2 Γ3) Z ∈ \ X∪ ∪

p( x) P (Ii) 2L2Q(Ii)+ q(x)dx ≤ Ii q(x) − Q(Ii) i Γ3 i [m] (Γ1 Γ2 Γ3) Z X∈ ∈ \ X∪ ∪

2L2 Γ3(Qn,m,HR) max Q(Ii) ≤ | | · i [m] ∈

11 1 + p(x)Q(Ii) P (Ii)q(x) dx Ii Q(Ii) − i [m] (Γ1 Γ2 Γ3) Z ∈ \ X∪ ∪ 2d2+2d−1 2 4L2dRm 2d +2d max Q(Ii) ≤ · i [m] ∈ 1 + p(x)Q(Ii) P (Ii)q(x) dx (35) Ii Q(Ii) − i [m] (Γ1 Γ2 Γ3) Z ∈ \ X∪ ∪

3 1 Now substitute the inequality of (c) in (35), and note that P max Q(Ii) 1 δ for n > i [m] ≤ 2m ≥ − 3 1 1  ∈  N ∗ m, d, 2m , 3 δ . It follows that  x 1 1 p( ) P (Ii) 2 1 P q(x)dx 6dL2Rm− 2d +2d + dL1(L2 + 1)m− 2d  Ii q(x) − Q(Ii) ≤ 2  i [m] (Γ1 Γ2) Z  ∈ X\ ∪  1  1 δ.  (36) ≥ − 3

Proof of Lemma 3. Now incorporating inequalities from (a), (c) and (d), we obtain

m p(x) m P (I ) P φ q(x)dx φ i Q(I ) q(x) − Q(I ) i i=1 Ii i=1 i  X Z   X   3d 3ε < 2K (ε )+2K (ε ,L )L + 2 2 1 1 1 2 2 m1/d 2   1 1 2 1 +6K (ε ,L )L Rd m− 2d +2d + K (ε ,L )L (L + 1)d m− 2d 1 3δ . 1 1 2 2 · 2 1 1 2 1 2 · ≥ − 1  1/d Now we set δ1 = δ/3, 2K2(ε1) < ε/5, 6K1(ε1,L2)L2dm− < ε/5, 3K1(ε1,L2)L2ε2 < ε/5, 1 1 1 − 2d2+2d 2d 6K1(ε1,L2)L2Rdm < ε/5 and 2 K(ε1,L2)L1(L2+1)dm− < ε/5, respectively. These settings ensure that m m p(x) P (I ) P φ q(x)dx φ i Q(I ) <ε 1 δ, (37) q(x) − Q(I ) i ≥ − ( i=1 Ii i=1 i ) X Z   X   1+α 2 d+1/d+3 2 α (2d +2d) under the conditions that n = O m + m log1/δ , and m = C(c,α,d,L 1,L2) [K1(ε1,L2)] · 2 · · max 2d +2d ,2d ε− n α o, where ε is the largest value such that K (ε ) ε . 1 2 1 ≤ 10 5 Concentration of the φ-divergence estimator

Theorem 1 (Concentration of the φ-divergence estimator). Let P and Q be probability measures defined d on (R , Rd ), and P Q. Let the Assumption 1 hold. Let P and Q be empirical measures of P and Q B ≪ n n with n samples for each probability measures, respectively. Divide Rd into m equal-measured hyperrectangles d+1/d−1 = I m with respect to Q . If n = O max md+1/d+3, m2log1/δ, [K (ε/9,L )]4 m , I { i}i n 12 2 · ε4   1+α 2 2 α (2d +2d) 2d +2d 2 log1/δ ε · maxn α ,2do [K12(ε/9,L2)] 2 , and m = C(c,α,d,L1,L2) K ,L2 ε− , then ε · 5K3 ·  h  i P D(n,n)(P Q) D (P Q) ε 1 δ. (38) φ || − φ || ≤ ≥ − n o b

12 Proof. By (5), D(n,n)(P Q) D (P Q) T + T , where T is the approximation error of integration φ || − φ || ≤ 1 2 1 which can be bounded using Lemma 3. To bound T , we split it into the following two terms: 2 b m m Pn(Ii) P (Ii) P (Ii) T2 φ φ Qn(Ii) + φ Qn(Ii) Q(Ii) , (39) ≤ Qn(Ii) − Q(Ii) Q(Ii) | − | i=1     i=1   X X T21 T 22

and then we bound| T21 and T22. In{z the following analysis,} | we introduce{z parameters ε1}, ε2 and δ1 to temporarily measure errors and uncertainty. They will be rewritten in terms of ε and δ at the end of the proof. First, T22 is more straightforward to estimate:

m T22 max φ(s) Qn(Ii) Q(Ii) K0(L2)ε1, (40) ≤ s [0,L2]| | | − |≤ i=1  ∈  X with probability greater than 1 δ when n>N ∗(m,d,ε ,δ ). Next, we bound T . − 1 1 1 21 ε2 Pn(Ii) P (Ii) Let J := i [m]: ε2 and ε2 , 1 ∈ Qn(Ii) ≤ Q(Ii) ≤ ε2 n Pn(Ii) P (Ii) o Pn(Ii) P (Ii) J := i [m]: ε2 and >ε2 i [m]: >ε2 and ε2 , 2 ∈ Qn(Ii) ≤ Q(Ii) ∪ ∈ Qn(Ii) Q(Ii) ≤ ε2 n Pn(Ii) P (Ii) o n Pn(Ii) P (Ii) o J := i [m]: >ε2 and >ε2 , and Wi = φ φ . Then 3 ∈ Qn(Ii) Q(Ii) Qn(Ii) − Q(Ii) n o    

T21 = WiQn(Ii)+ WiQn(Ii)+ WiQn(Ii) i Jε2 i Jε2 i Jε2 ∈X1 ∈X2 ∈X3 Pn(Ii) P (Ii) K2(ε2) Qn(Ii)+ K1(ε2,L2) +2K2(ε2) Qn(Ii) ≤ Qn(Ii) − Q(Ii) i Jε2 i Jε2   ∈X1 ∈X2

Pn(Ii) P (Ii) + K1(ε2,L2) Qn(Ii) Qn(Ii) − Q(Ii) i Jε2 ∈X3 m P (I ) P (I ) 3K (ε )+ K (ε ,L ) n i i Q (I ) (41) ≤ 2 2 1 2 2 · Q (I ) − Q(I ) n i i=1 n i i X

m Pn(Ii) P (Ii) Note that the term Qn(Ii) in the above inequality can be bounded as i=1 Qn(Ii) − Q(Ii)

m P m Pn(I i) P (Ii) Qn(Ii) Qn(Ii)= Pn(Ii) P (Ii)+ P (Ii) P (Ii) Qn(Ii) − Q(Ii) − − Q(Ii) i=1 i=1 X m m X P (Ii) P (I ) P (I ) + Q(I ) Q (I ) ≤ | n i − i | Q(I ) | i − n i | i=1 i=1 i Xm X m P (I ) P (I ) + L Q(I ) Q (I ) ≤ | n i − i | 2 | i − n i | i=1 i=1 X X (L + 1)ε (42) ≤ 2 1 with probability greater than 1 2δ when n>N ∗(m,d,ε ,δ ). Incorporate (39), (40), (41) and (42), we − 1 1 1 obtain that P T 3K (ε ) + [K (L ) + (L + 1)K (ε ,L )] ε 1 2δ . (43) 1 ≤ 2 2 0 2 2 1 2 2 1 ≥ − 1 P 2ε 2δ Now let 3K2(ε2) ε/3, [K0(L2) + (L2 + 1)K1(ε2,L2)] ε1 ε/3 and δ1 = δ/ 3, we have T1 3 1 3 , ≤ − ≤ ≤ ≥ − 1 4 md+1/d 1 if n C(L ) max K K− (ε/9),L , ≥ 2 · 1 2 2 · ε4  1 2 log1/δ K K− (ε/9),L n , whereC(L ) is a constant only depends on K (L ) and L . Now using 1 2 2 · ε2 2 0 2 2 Lemma 3 to guarantee that Po T ε 1 δ , we eventually obtain P T + T ε 1 δ if n =   { 2 ≤ 3 } ≥ − 3 { 1 2 ≤ } ≥ −

13 d+1/d−1 O max md+1/d+3, m2log1/δ, [K (ε/9,L )]4 m , 12 2 · ε4   1+α (2d2+2d) 2 log1/δ ε α · [K12(ε/9,L2)] 2 , and m = C(c,α,d,L1,L2) K ,L2 · ε · 5K3 · 2  max 2d +2d ,2d h  i ε− n α o.

6 Concluding Remarks

We provide a convergence rate result for the φ-divergence estimator constructed based on the data dependent partition scheme. The estimation error consists of two parts: the error of numerical integration and the error of random sampling. We use concentration inequalities to bound both types of errors with a given probability guarantee.

7 Acknowledgement

We acknowledge NSF grants CMMI-1362003 that was used to support this research.

References

Adami, C. (2004). Information theory in molecular biology. Physics of Life Reviews, 1:3–22.

Aghagolzadeh, M., Soltanian-Zadeh, H., Araabi, B., and Aghagolzadeh, A. (2007). A hierarchical cluster- ing based on mutual information maximization. In Proc. of IEEE International Conference on Image Processing, pages 277–280.

Boucheron, S. and Lugosi, G. (2016). Concentration inequalities: a non-asymptotic theory of independence. Oxford.

Chai, B., Walther, D., Diane, M., and Li, F. (2009). Exploring functional connectivity of the human brain using multivariate information analysis. In Advances in Neural Information Processing Systems (NIPS).

Goria, M., Leonenko, N., Mergel, V., and Inverardi, P. (2005). A new class of random vector entropy estimators and its applications in testing statistical hypotheses. Journal of Nonparametric Statistics, 17:277–297.

Hall, P. (1987). On Kullback-Leibler loss and density estimation. Annals of Statistics, 15(4):1491–1519.

Hero, A. O., Ma, B., Michel, O. J. J., and Gorman, J. (2002a). Applications of entropic spanning graphs. IEEE Signal Processing Magazine, 19(5):85–95.

Hero, A. O., Michel, O., and Gorman, J. (2002b). Alpha-divergence for classification, indexing and retrieval. Communications and signal processing laboratory technical report cspl-328.

Kullback, S. and Leibler, R. A. (1951). On information and sufficiency. Annals of Mathematical Statistics, 22(1):79–86.

Kybic, J. (2006). Incremental updating of nearest neighbor-based high-dimensional entropy estimation. In Proc. Acoustics, Speech and Signal Processing.

Learned-Miller, E. and Fisher, J. (2003). Ica using spacings estimates of entropy. Journal of Machine Learning Research, 4:1271–1295.

14 Lewi, J., Butera, R., and Paninski, L. (2007). Real-time adaptive information-theoretic optimization of neurophysiology experiments. In Advances in Neural Information Processing Systems (NIPS).

Liu, H., Lafferty, J., and Wasserman, L. (2012). Exponential concentration inequality for mutual information estimation. In Neural Information Processing Systems (NIPS).

Lugosi, G. and Nobel, A. (1996). Consistency of data-driven histogram methods for density estimation and classification. The Annals of Statistics, 24(2):687–706.

Moon, K. and Hero, A. (2004). Multivariate f-divergence estimation with confidence. In Advances in Neural Information Processing Systems (NIPS).

Moreno, P., Ho, P., and Vasconcelos, N. (2004). A Kullback-Leibler divergence based kernel for SVM classification in multimedia applications. In Neural Information Processing Systems (NIPS).

Nikulin, M. S. (2001). Hellinger distance. In Hazewinkel, M., editor, Encyclopedia of Mathematics. Springer.

P´al, D., P´oczos, B., and Szepesv´ari, C. (2010). Estimation of r´enyi entropy and mutual information based on generalized nearest-neighbor graphs. In Proceedings of the Neural Information Processing Systems (NIPS).

Peng, H., Long, F., and Ding, C. (2005). Feature selection based on mutual information: criteria of max- dependency, max-relevance, and min-redundancy. IEEE Trans. Pattern Anal. Mach. Intell., 27(8).

P´erez-Cruz, F. (2008). Estimation of information theoretic measures for continuous random variables. In Advances in Neural Information Processing Systems, volume 21.

Poczos, B., Xiong, L., Sutherland, D., and Schneider, J. (2012). Nonparametric kernel estimators for image classification. In 25th IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

Shan, C., Gong, S., and Mcowan, P. (2005). Conditional mutual information based boosting for facial expression recognition. In British Machine Vision Conference (BMVC).

Singh, S. and P´oczos, B. (2014). Generalized exponential concentration inequality for r´enigh divergence estimation. In Proceedings of the 31st International Conference on Machine Learning, volume 32, Beijing, China.

Szab´o, Z., P´ocos, B., and LHorincz, A. (2007). Undercomplete blind subspace deconvolution. Journal of Machine Learning Research, 8:1063–1095.

Vapnik, V. and Chervonenkis, A. (1971). On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability and Its Applications, 16(2):561.

Wang, Q., Kulkarni, S., and Verd´u, S. (2009). Divergence estimation for multidimensional densities via k-nearest-neighbor distances. IEEE Transactions on Information Theory, 55(5):2392–2405.

Wang, Q., Kulkarni, S. R., and Verd´u, S. (2005). Divergence estimation of continuous distributions based on data-dependent partitions. IEEE Trans. Inf. Theory, 51(9):3064–3074.

Wolsztynski, E., Thierry, E., and Pronzato, L. (2005). Minimum-entropy estimation in semi-parametric models. Journal of Nonparametric Statistics, 17:277–297.

15