Statistical Estimation of the Kullback–Leibler Divergence

mathematics

Article Statistical Estimation of the Kullback–Leibler Divergence

Alexander Bulinski 1,* and Denis Dimitrov 2

1 Steklov Mathematical Institute of Russian Academy of Sciences, 119991 Moscow, Russia 2 Faculty of Mechanics and Mathematics, Lomonosov Moscow State University, 119234 Moscow, Russia; [email protected] * Correspondence: [email protected]; Tel.: +7-495-939-14-03

Abstract: Asymptotic unbiasedness and L2-consistency are established, under mild conditions, for the estimates of the Kullback–Leibler divergence between two probability measures in Rd, absolutely continuous with respect to (w.r.t.) the Lebesgue measure. These estimates are based on certain k-nearest neighbor statistics for pair of independent identically distributed (i.i.d.) due vector samples. The novelty of results is also in treating mixture models. In particular, they cover mixtures of nondegenerate Gaussian measures. The mentioned asymptotic properties of related estimators for the Shannon entropy and cross-entropy are strengthened. Some applications are indicated.

Keywords: Kullback–Leibler divergence; Shannon differential entropy; statistical estimates; k-nearest neighbor statistics; asymptotic behavior; Gaussian model; mixtures

MSC: 60F25; 62G20; 62H12

1. Introduction Citation: Bulinski, A.; Dimitrov, D. Statistical Estimation of the The Kullback–Leibler divergence introduced in [1] is used for quantification of sim- Kullback–Leibler Divergence. ilarity of two probability measures. It plays important role in various domains such as Mathematics 2021, 9, 544. https:// statistical inference (see, e.g., [2,3]), metric learning [4,5], machine learning [6,7], computer doi.org/10.3390/math9050544 vision [8,9], network security [10], feature selection and classification [11–13], physics [14], biology [15], medicine [16,17], finance [18], among others. It is worth to emphasize that Academic Editors: Irina Shevtsova mutual information, widely used in many research directions (see, e.g., [19–23]), is a special and Victor Korolev case of the Kullback–Leibler divergence for certain measures. Moreover, the Kullback– Leibler divergence itself belongs to a class of f -divergence measures (with f (t) = log t). For Received: 25 January 2021 comparison of various f -divergence measures see, e.g., [24], their estimates are considered Accepted: 28 February 2021 in [25,26]. Published: 4 March 2021 Let P and Q be two probability measures on a measurable space (S, B). The Kullback– Leibler divergence between P and Q is defined, according to [1], by way of Publisher’s Note: MDPI stays neutral  with regard to jurisdictional claims in R log dP d if ,  dQ P P Q published maps and institutional affil- D(P||Q) := S (1) iations. ∞ otherwise,

where dP stands for the Radon–Nikodym derivative. The integral in (1) can take values in dQ [0, ∞]. We employ the base e of logarithms since a constant factor is not essential here. d d Copyright: © 2021 by the authors. If (S, B) = (R , B(R )), where d ∈ N, and (absolutely continuous) P and Q have Licensee MDPI, Basel, Switzerland. densities, p(x) and q(x), x ∈ Rd, w.r.t. the Lebesgue measure µ, then (1) can be expressed as This article is an open access article distributed under the terms and Z p(x) D(P||Q) = p(x) log dx, (2) conditions of the Creative Commons q(x) d Attribution (CC BY) license (https:// R creativecommons.org/licenses/by/ 4.0/).

Mathematics 2021, 9, 544. https://doi.org/10.3390/math9050544 https://www.mdpi.com/journal/mathematics Mathematics 2021, 9, 544 2 of 36

where we write dx instead of µ(dx) to simplify notation. One formally sets 0/0 := 0, = > = − ( ) = = ( p(x) ) a/0 : ∞ if a 0, log 0 : ∞, log ∞ : ∞ and 0 log 0 : 0. Then log q(x) is a measurable function with values in [−∞, ∞]. So, the right-hand sides of (1) and (2) coincide. Formula (2) is justified by Lemma A1, see AppendixA. Denote by S( f ) := {x ∈ Rd : f (x) > 0} the support of a (version of) probability density f . The integral in (2) is taken over S(p) and does not depend on the choice of p and q versions. The following two functionals are closely related to the Kullback - Leibler divergence. For probability measures P and Q on (Rd, B(Rd)) having densities p(x) and q(x), x ∈ Rd, w.r.t. the Lebesgue measure µ, one can introduce, according to [27], p. 35, entropy H (also called the Shannon differential entropy) and cross-entropy C as follows Z Z H(P) := − p(x) log p(x)dx, C(P, Q) := − p(x) log q(x)dx. Rd Rd In view of (2), D(P||Q) = C(P, Q) − H(P) whenever the right-hand side is well defined. Usually one constructs statistical estimates of some characteristics of a stochastic model under consideration relying on a collection of observations. In the pioneering paper [28] the estimator of the Shannon differential entropy was proposed, based on the nearest neighbor statistics. In a series of papers this estimate was studied and applied. Moreover, estimators of the Rényi entropy, mutual information and the Kullback–Leibler divergence have appeared (see, e.g., [29–31]). However, the authors of [32] indicated the occurrence of gaps in the known proofs concerning the limit behavior of such statistics. Almost all of these flaws refer to the lack of proved correctness of using the (reversed) Fatou lemma (see, e.g., [28], inequality after the statement (21), or [31], inequality (91)) or the generalized Helly–Bray lemma (see, e.g., [30], page 2171). One can find these lemmas in [33], p. 233, and [34], p. 187. Paper [32] has attracted our attention and motivated study of the declared asymptotic properties. Furthermore, we would like to highlight the important role of the papers [28,30–32]. Thus, in a recent work [35] the new functionals were introduced to prove asymptotic unbiasedness and L2-consistency of the Kozachenko– Leonenko estimators of the Shannon differential entropy. We used the criterion of uniform integrability, for different families of functions, to avoid employment of the Fatou lemma since it is not clear whether one could indicate the due majorizing functions for those families. The present paper is aimed at extension of our approach to grasp the Kullback– Leibler divergence estimation. Instead of the nearest neighbor statistics we employ the k-nearest neighbor statistics (on order statistics see, e.g., [36]) and also use more general forms of the mentioned functionals. Note in passing that there exist investigations treating important aspects of the entropy, Kullback–Leibler divergence and mutual information estimation. The mixed models and conditional entropy estimation are studied, e.g., in [37,38]. The central limit theorem (CLT) for the Kozachenko–Leonenko estimates is established in [39]. In [40], deep analysis of efficiency of functional weighted estimates was performed (including CLT). The limit theorems for point processes on manifolds are employed in [41] to analyze behavior of the Shannon and the Rényi entropy estimates. The convergence rates for the Shannon entropy (truncated) estimates are obtained in [42] for one-dimensional case, see also [43] for multidimensional case. A kernel density plug-in estimator of the various divergence functionals is studied in [25]. The principal assumptions of that paper are the following: the densities are smooth and have common bounded support S, they are strictly lower bounded on S, moreover, the set S is smooth with respect to the employed kernel. Ensemble estimation of various divergence functionals is studied in [25]. Profound results for smooth bounded densities are established in recent work [44]. The mutual information estimation by the local Gaussian approximation is developed in [45]. Note that various deep results (including the central limit theorem) were obtained for the Kullback–Leibler estimates under certain conditions imposed on derivatives of unknown densities (see, e.g., the recent papers [25,46]). In a series of papers the authors demand boundedness of densities to prove Mathematics 2021, 9, 544 3 of 36

L2-consistency for the Kozachenko–Leonenko estimates of differential Shannon entropy (see, e.g., [47]). Our goal is to provide wide conditions for the asymptotic unbiasedness and L2- consistency of the speciﬁed Kullback–Leibler divergence estimates without such smoothness and boundedness hypotheses. Furthermore, we do not assume that densities have bounded supports. As a byproduct we obtain new results concerning Shannon differential entropy and cross-entropy. We employ probabilistic and analytical techniques, namely, weak convergence of probability measures, conditional expectations, regular probability distributions, k-nearest neighbor statistics, probability inequalities, integration by parts in the Lebesgue–Stieltjes integral, analysis of integrals depending on certain parameters and taken over speciﬁed domains, criterion of the uniform integrability of various families of functions, slowly varying functions. The paper is organized as follows. In Section2, we introduce some notation. In Section3 we formulate main results, i.e., Theorems1 and2. Their proofs are provided in Sections4 and5, respectively. Section6 contains concluding remarks and perspectives of future research. Proofs of several lemmas are given in AppendixA.

2. Notation d Let X and Y be random vectors taking values in R and having distributions PX and PY, respectively, (below we will take P = PX and Q = PY). Consider random d vectors X1, X2, ... and Y1, Y2, ... with values in R such that law(Xi) = law(X) and law(Yi) = law(Y), i ∈ N. Assume also that {Xi, Yi, i ∈ N} are independent. We are interested in statistical estimation of D(PX||PY) constructed by means of observations Xn := {X1, ... , Xn} and Ym := {Y1, ... , Ym}, n, m ∈ N. All the random variables under consideration are deﬁned on a probability space (Ω, F, P), each measure space is assumed complete. d d For a ﬁnite set E = {z1, ... , zN} ⊂ R , where zi 6= zj (i 6= j), and a vector v ∈ R , renumerate points of E as z(1)(v), ... , z(N)(v) in such a way that kv − z(1)k ≤ ... ≤ k − k k · k d v z(N) , is the Euclidean norm in R . If there are points zi1 , ... , zis having the same distance from v then we numerate them according to the indexes i1, ... , is increase. In other words, for k = 1, ... , N, z(k)(v) is the k-nearest neighbor of v in a set E. To indicate that z(k)(v) is constructed by means of E we write z(k)(v, E). Fix k ∈ {1, ... , n − 1}, l ∈ {1, . . . , m} and (for each ω ∈ Ω) put

Rn,k(i) := kXi − X(k)(Xi, Xn \{Xi})k, Vm,l(i) := kXi − Y(l)(Xi, Ym)k, i = 1, . . . , n.

dPX dPY We assume that X and Y have densities p = dµ and q = dµ . Then with probability one all the points in Xn are distinct as well as points of Ym. Following [31] (see Formula (17) there) introduce an estimate of D(PX||PY) n n ( ) 1 m d Vm,li i Den,m(Kn, Ln) := (ψ(k ) − ψ(l )) + log + log , (3) n ∑ i i n − n ∑ R (i) i=1 1 i=1 n,ki

0 ( ) = d ( ) = Γ (t) > K = { }n L = where ψ t dt log Γ t Γ(t) is the digamma function, t 0, n : ki i=1, n : n {li}i=1 are collections of integers and, for some r ∈ N and all i ∈ N, ki ≤ r, li ≤ r. Note that (3) is well-deﬁned for n ≥ maxi=1,...,n ki + 1, m ≥ maxi=1,...,n li. If ki = k and li = l, i = 1, . . . , n, then, for n ≥ k + 1 and m ≥ l, we write m d n V (i) D (k, l) := ψ(k) − ψ(l) + log + log m,l . (4) bn,m − ∑ ( ) n 1 n i=1 Rn,k i

If k = l then m d n V (i) D (k) = log + log m,k bn,m − ∑ ( ) n 1 n i=1 Rn,k i Mathematics 2021, 9, 544 4 of 36

and we come to formula (5) in [31]. For an intuitive background of the proposed estimates one can address [31] (Introduction, Parts B and C). d d We write B(x, r) := {y ∈ R : kx − yk ≤ r} for x ∈ R , r > 0, and Vd = µ(B(0, 1)) is the volume of the unit ball in Rd. Similar to (3) with the same notation and the same conditions for ki and li, i = 1, ... , n, one can deﬁne the Kozachenko - Leonenko type estimates of H(PX) and C(PX, PY), respectively, by formulas

1 n d n H (K ) = − (k ) + V + (n − ) + R (i) en n : ∑ ψ i log d log 1 ∑ log n,ki , (5) n i=1 n i=1

1 n d n C (L ) = − (l ) + V + m + V (i) en,m n : ∑ ψ i log d log ∑ log m,li . (6) n i=1 n i=1

In [28], an estimate (5) was proposed for ki = 1, i = 1, ... , n. If ki = k, li = l, i = 1, . . . , n, n ≥ k + 1 and m ≥ l, then one has

n d ! n d ! 1 VdR (i)(n − 1) 1 VdV (i)m H (k) := log n,k , C (l) := log m,l . (7) bn ∑ ψ(k) bn,m ∑ ψ(l) n i=1 e n i=1 e

Remark 1. All our results are valid for statistics (3). To simplify notation we consider estimates (4) since the study of Den,m(Kn, Ln) follows the same lines. For the same reason, as in the case of Kullback–Leibler divergence, we will only deal with (7) since (5) and (6) can be analyzed in quite the same way.

Some extra notation is necessary. As in [35], given a probability density f in Rd, we consider the following functions of x ∈ Rd, r > 0 and R > 0, that is, deﬁne integral functionals (depending on parameters) R B(x,r) f (y) dy I (x r) = f , : d , (8) r Vd

M f (x, R) := sup If (x, r), m f (x, R) := inf If (x, r). (9) r∈(0,R] r∈(0,R] R Some properties of function B(x,r) f (y) dy are demonstrated in [48]. By virtue of Lemma 2.1 [35], for each probability density f , the function If (x, r) introduced above is continuous in (x, r) on Rd × (0, ∞). Hence on account of Theorem 15.84 [49] the functions m f (·, R) and M f (·, R) for any R > 0 have to be upper semicontinuous and lower semicontinuous, respectively. Therefore, Borel measurability of these nonnegative functions ensues from Proposition 15.82 [49]. On the other hand, the function m f (x, ·) is evidently nonin- d creasing whereas M f (x, ·) is nondecreasing for each x in R . Notably, changing supr∈(0,R] to supr∈(0,∞) transforms the function M f (x, R) into the famous Hardy–Littlewood maximal function M f (x) well-known in Harmonic analysis. Set e[−1] := 0 and e[N] := exp{e[N−1]}, N ∈ Z+. Introduce a function log[1](t) := log t, t > 0. For N ∈ N, N > 1, set log[N](t) := log(log[N−1](t)). Evidently, this function (for each N ∈ N) is deﬁned if t > e[N−2]. For N ∈ N, consider the continuous nondecreasing function GN : R+ → R+, given by formula ( 0, t ∈ [0, e[N−1]], GN(t) := (10) t log[N](t), t ∈ (e[N−1], ∞).

In other words we employ the function having the form t r(t) where a function r(t), taken as N iterations of log t, is slowly varying for large t. Mathematics 2021, 9, 544 5 of 36

For probability densities p, q in Rd, N ∈ N and positive constants ν, t, ε, R, introduce the functionals taking values in [0, ∞] Z ν Kp,q(ν, N, t) := GN | logkx − yk| p(x)q(y) dx dy, (11) x,y∈Rd, kx−yk>t Z ε Qp,q(ε, R) := Mq(x, R)p(x) dx, (12) Rd Z −ε Tp,q(ε, R) := mq (x, R)p(x) dx. (13) Rd

Set Kp,q(ν, N) := Kp,q(ν, N, e[N]).

−ε Remark 2. We have stipulated that 1/0 := ∞ (thus mq (x, R) := ∞ whenever mq(x, R) = 0). One can write in (12), (13) the integrals over the support S(p) instead of integrating over Rd, whatever the versions of p and q are taken.

3. Main Results

Theorem 1. Let, for some positive ε, R and N ∈ N, the functionals Kp, f (1, N), Qp, f (ε, R), Tp, f (ε, R) be ﬁnite if f = p and f = q. Then D(PX||PY) < ∞ and

lim EDbn,m(k, l) = D(PX||PY). (14) n,m→∞

Consider 3 kinds of conditions (labeled A,B,C, possibly with indices, and involving parameters indicated in parentheses) on probability densities. (A; p, f , ν) For probability densities p, f in Rd and some positive ν Z ν Lp, f (ν) := | logkx − yk| p(x) f (y) dxdy < ∞. (15) Rd×Rd R As usual, A g(z)Q(dz) = 0 whenever g(z) = ∞ (or −∞) for z ∈ A and Q(A) = 0, where Q is a σ-ﬁnite measure on (Rd, B(Rd)). Condition (15) with ν > 1 is used, e.g., in [28,31,47]. (B1; f ) A version of f is upper bounded by a positive number M( f ) ∈ (0, ∞):

f (x) ≤ M( f ), x ∈ Rd.

(C1; f ) A version of f is lower bounded by a positive number m( f ) ∈ (0, ∞): f (x) ≥ m( f ), x ∈ S( f ).

Corollary 1. Let, for some ν > 1, condition (A; p, f , ν) be satisﬁed when f = p and f = q. Then the statements of Theorem1 are true, provided that (B1; f ) and (C1; f ) are both valid for f = p and f = q. Moreover, if the latter assumption involving (B1; f ) and (C1; f ) holds then conditions of Theorem1 are satisﬁed whenever p and q have bounded supports.

Next we formulate conditions to guarantee L2-consistency of estimates (4).

Theorem 2. Let the requirement Kp, f (1, N) < ∞ in conditions of Theorem1 be replaced by Kp, f (2, N) < ∞, given f = p and f = q. Then D(PX||PY) < ∞ and, for any ﬁxed k, l ∈ N, 2 the estimates Dbn,m(k, l) are L -consistent, i.e., 2 lim E Dbn,m(k, l) − D(PX||PY) = 0. (16) n,m→∞

Corollary 2. For some ν > 2, let condition (A; p, f , ν) be satisﬁed if f = p and f = q. Assume that (B1; f ) and (C1; f ) are both valid for f = p and f = q. Then the statements of Theorem2 Mathematics 2021, 9, 544 6 of 36

are true. Moreover, if the latter assumption involving (B1; f ) and (C1; f ) holds then conditions of Theorem2 are satisﬁed whenever p and q have bounded supports.

Currently we dwell on a modiﬁcation of condition (C1; f ) introduced in [35] that allows us to work with densities that need not have bounded supports. (C2; f ) There exist a version of density f and R > 0 such that, for some c > 0,

d m f (x, R) ≥ c f (x), x ∈ R .

Remark 3. If, for some positive ε, R and c, condition (C2; q) is true and Z q(x)−ε p(x)dx < ∞, (17) Rd

then Tp,q(ε, R) is ﬁnite. Hence we could apply, for f = p and f = q in Theorems1 and2, condition (C ; f ) and presume, for some ε > 0, validity of (17) and ﬁniteness of R p1−ε(x)dx instead of 2 Rd the corresponding assumptions Tp,q(ε, R) < ∞ and Tp,p(ε, R) < ∞. An illustrative example to this point is provided with a density having unbounded support.

d Corollary 3. Let X, Y be Gaussian random vectors in R with EX = µX, EY = µY and nondegenerate covariance matrices ΣX and ΣY, respectively. Then relations (14) and (16) hold where 1 −1 T −1 det ΣY D(PX||PY) = tr ΣY ΣX + (µY − µX) ΣY (µY − µX) − d + log . 2 det ΣX

The latter formula can be found, e.g., in [2], p. 147, example 6.3. The proof of Corollary3 is discussed in AppendixA. Similarly to condition (C2; f ) let us consider the following one. (B2; f ) There exist a version of density f and R > 0 such that, for some C > 0,

M f (x, R) ≤ C f (x), x ∈ S( f ).

Remark 4. If, for some positive ε, R and c, condition (B2; q) is true and Z q(x)ε p(x)dx < ∞, (18) Rd

then obviously Qp,q(ε, R) < ∞. Thus, in Theorems1 and2 one can employ, for f = p and f = q, condition (B ; f ) and exploit, for some ε > 0, the validity of (18) and ﬁniteness of R p1+ε(x)dx 2 Rd instead of the assumptions Qp,q(ε, R) < ∞ and Qp,p(ε, R) < ∞, respectively.

Remark 5. D.Evans applied “positive density condition” in Deﬁnition 2.1 of [48] assuming the rd R d existence of constants β > 1 and δ > 0 such that β ≤ B(x,r) q(y)dy ≤ βr for all 0 ≤ r ≤ δ and d 1 d −ε R x ∈ . Consequently mq(x, δ) ≥ := m > 0, x ∈ . Then Tp,q(ε, δ) ≤ m d p(x) dx = R βVd R R −ε β d m < ∞ for all ε > 0. Analogously, Mq(x, δ) ≤ := M, M > 0, x ∈ , and Qp,q(ε, δ) ≤ Vd R Mε R p(x) dx = Mε < ∞ for all ε > 0. The above mentioned inequalities from Deﬁnition 2.1 Rd of [48] are valid, provided that density f is smooth and its support in Rd is a convex closed body, see proof in [50]. Therefore, if p and q are smooth and their supports are compact convex bodies in Rd, the relations (14) and (16) are valid.

Moreover, as a byproduct of Theorems1 and2, we obtain the new results indicating both the asymptotic unbiasedness and L2-consistency of the estimates (7) for the Shannon differential entropy and cross-entropy.

Theorem 3. Let Qp,q(ε, R) < ∞ and Tp,q(ε, R) < ∞ for some positive ε and R. Then C(PX, PY) is ﬁnite and the following statements hold for any ﬁxed l ∈ N. Mathematics 2021, 9, 544 7 of 36

(1) If, for some N ∈ N,Kp,q(1, N) < ∞, then ECbn,m(l) → C(PX, PY), n, m → ∞. 2 (2) If, for some N ∈ N,Kp,q(2, N) < ∞, then E(Cbn,m(l) − C(PX, PY)) → 0, n, m → ∞.

In particular, one can employ Lp,q(ν) with ν > 1 instead of Kp,q(1, N), and with ν > 2 instead of Kp,q(2, N), where N ∈ N.

The ﬁrst claim of this Theorem follows from the proof of Theorem1. In a similar way one can infer the second statement from the proof of Theorem2. If we take q = p in conditions of Theorem3 then we get the statement concerning the entropy since C(PX, PX) = H(PX). Now we consider the case when p and q are mixtures of some probability densities. Namely, I J p(x) := ∑ ai pi(x), q(x) := ∑ bjqj(x), (19) i=1 j=1

where pi(x), qj(x) are probability densities (w.r.t. the Lebesgue measure µ), positive I J d weights ai, bj are such that ∑i=1 ai = 1, ∑j=1 bj = 1, i = 1, ... , I, j = 1, ... , J, x ∈ R . Some applications of models described by mixtures are treated, e.g., in [51].

Corollary 4. Let random vectors X and Y have densities of the form (19). Assume that, for some positive ε, R and N ∈ N, the functionals K f ,g(1, N), Q f ,g(ε, R), Tf ,g(ε, R) are ﬁnite, whenever f ∈ {p1, ... , pI } and g ∈ {p1, ... , pI, q1, ... , qJ}. Then D(PX||PY) < ∞ and, for any ﬁxed k, l ∈ N, (14) holds. Moreover, if the requirement K f ,g(1, N) < ∞ is replaced by K f ,g(2, N) < ∞ then (16) is true.

The proof of this Corollary is given in AppendixA. Thus, due to Corollaries3 and4 one can guarantee the validity of (14) and (16) for any mixtures of nondegenerate Gaussian densities. Note also that in a similar way we can claim the asymptotic unbaisedness and L2-consistency of estimates (7) for mixtures satisfying conditions of Corollary4.

Remark 6. Let us compare our new results with those established in [35]. Developing the approach of [35] to analysis of asymptotic behavior of the Kozachenko–Leonenko estimates of the Shannon differential entropy we encounter new complications due to dealing with k-nearest neighbor statistics for k ∈ N (not only for k = 1). Accordingly, in the framework of the Kullback–Leibler divergence estimation, we propose a new way to bound the function 1 − Fm,l,x(u) playing the key role in the proofs (see Formula (28)). Furthermore, instead of the function G(t) = t log t (for t > 1), used in [35] for the Shannon entropy estimates, we employ a regularly varying function GN(t) = t log[N](t) where (for t large enough) log[N](t) is the N-fold iteration of the logarithmic function and N ∈ N can be large. Whence in the deﬁnition of integral functional Kp,q(ν, N, t) by formula (11) one can take a function GN(z) having, for z > 0, the growth rate close to that of function z. Moreover, this entails a generalization of paper [35] results. Now we invoke convexity of GN (see Lemma6) to provide more general conditions for asymptotic unbiasedness and L2-consistency of the Shannon differential entropy as opposed to [35].

4. Proof of Theorem 1 Note that the general structure of this proof, as well as that of Theorem2, is similar to the one originally proposed in [28] and later used in various papers (see, e.g., [30,31,47]). Nevertheless in order to prove both theorems correctly we employ new ideas and conditions (such as uniform integrability of a family of random variables) in our reasoning.

Remark 7. In the proof, for certain random variables α, α1, α2, ... (depending on some parameters), we will demonstrate that Eαn → Eα, as n → ∞ (and that all these expectations are finite). To this end, for a fixed Rd-valued random vector τ and each x ∈ A, where A is a specified subset of Rd, we will prove that E(αn|τ = x) → E(α|τ = x), n → ∞. (20) Mathematics 2021, 9, 544 8 of 36

It turns out that E(αn|τ = x) = E(αn,x) and E(α|τ = x) = Eαx, where the auxiliary d random variables αn,x and αx can be constructed explicitly for each x ∈ R . Moreover, it is law possible to show that, for each x ∈ A, one has αn,x → αx, n → ∞. Thus, to prove (20) the Fatou lemma is not used, it is not evident whether there exists a random variable majorizing those under consideration. Instead we verify, for each x ∈ A, the uniform integrability (w.r.t. measure P) of a family ( ) . Here we employ the necessary and sufﬁcient conditions of uniform αn,x n≥n0(x) integrability provided by de la Vallée–Poussin theorem (see, e.g., Theorem 1.3.4 in [52]). After that, to prove the desired relation Eαn → Eα, n → ∞, we have a new task. Namely, we check the uniform (E( | = )) ∈ P integrability of a family αn τ x n≥k0 , where x A, w.r.t. the measure τ, i.e., the law of τ, and k0 does not depend of x. Then we can prove that Z Z Eαn = E(αn|τ = x)Pτ(dx) → E(α|τ = x)Pτ(dx) = Eα, n → ∞. A A Further we will explain a number of nontrivial details concerning the proofs of uniform integrability of various families, the choice of the mentioned random variables (vectors), the set A, n0(x) and k0.

The ﬁrst auxiliary result explains why without loss of generality (w.l.g.) we can consider the same parameters ε, R, N for different functionals in conditions of Theorems1 and2.

Lemma 1. Let p and q be any probability densities in Rd. Then the following statements are valid.

(1) If Kp,q(ν0, N0) < ∞ for some ν0 > 0 and N0 ∈ N then Kp,q(ν, N) < ∞ for any ν ∈ (0, ν0] and each N ≥ N0. (2) If Qp,q(ε1, R1) < ∞ for some ε1 > 0 and R1 > 0 then Qp,q(ε, R) < ∞ for any ε ∈ (0, ε1] and each R > 0. (3) If Tp,q(ε2, R2) < ∞ for some ε2 > 0 and R2 > 0 then Tp,q(ε, R) < ∞ for any ε ∈ (0, ε2] and each R > 0.

In particular one can take q = p and the statements of Lemma1 still remain valid. The proof of Lemma1 is given in AppendixA.

Remark 8. The results of Lemma1 allow us to ensure (14) by demanding the finiteness of the functionals Kp,q(1, N1), Qp,q(ε1, R1), Tp,q(ε2, R2), Kp,p(1, N2), Qp,p(ε3, R3), Tp,p(ε4, R4), for some εi >0, Rj >0 and Nj ∈ N, where i = 1, 2, 3, 4 and j = 1, 2. Moreover, if we assume the finiteness of Kp,q(2, N3) and Kp,p(2, N4), for some N3 ∈ N, N4 ∈ N, instead of the finiteness of Kp,q(2, N1) and Kp,p(2, N2) then (16) holds.

According to Remark 2.4 of [35] if, for some positive ε, R, the integrals Qp,q(ε, R), Tp,q(ε, R), Qp,p(ε, R), Tp,p(ε, R) are ﬁnite then Z Z p(x)| log q(x)|dx < ∞, p(x)| log p(x)|dx < ∞. (21) Rd Rd

Therefore D(p||q) < ∞ (and thus PX PY in view of Lemma A1). For n ∈ N such that n > 1, for ﬁxed k ∈ N and m ∈ N, where 1 ≤ k ≤ n − 1, 1 ≤ l ≤ m d d and i = 1, ... , n, set φm,l(i) := mVm,l(i), ζn,k(i) := (n − 1)Rn,k(i). Then we can rewrite the estimate Dbn,m(k, l) as follows:

n 1 Dbn,m(k, l) = ψ(k) − ψ(l) + ∑ log φm,l(i) − log ζn,k(i) . (22) n i=1

It is sufﬁcient to prove the following two assertions.

Statement 1. For each ﬁxed l, all m large enough and any i ∈ N, E| log φm,l(i)| is ﬁnite. Mathematics 2021, 9, 544 9 of 36

Moreover, ! 1 n Z E log φm l(i) = E log φm l(1) → ψ(l) − log Vd − p(x) log q(x) dx, m → ∞. (23) ∑ , , d n i=1 R

Statement 2. For each ﬁxed k, all n large enough and any i ∈ N, E| log ζn,k(i)| is ﬁnite. Moreover, ! 1 n Z E log ζn k(i) = E log ζn k(1) → ψ(k) − log Vd − p(x) log p(x) dx, n → ∞. (24) ∑ , , d n i=1 R

Then in view of (2) and (21)–(24) Z Z EDbn,m(k, l) → − p(x) log q(x) dx + p(x) log p(x) dx = D(PX||PY), n, m → ∞, Rd Rd and Theorem1 will be proved. Recall that, as explained in [35], for a nonnegative random variable V (thus 0 ≤ EV ≤ ∞) and any random Rd-valued vector, one has Z EV = E(V|X = x)PX(dx). (25) Rd This signifies that both sides of (25) coincide, being finite or infinite simultaneously. Let F(u, ω) be a regular conditional distribution function of a nonnegative random variable U given X where u ∈ R and ω ∈ Ω. Let h be a measurable function such that h : R → [0, ∞). d It was also explained in [35] that, for PX-almost all x ∈ R , it follows (without assuming Eh(U) < ∞) Z E(h(U)|X = x) = h(u)dF(u, x). (26) [0,∞) This means that both sides of (26) are finite or infinite simultaneously and coincide. By virtue of (25) and (26) one can establish that E| log φm,l(i)| < ∞, for all m large enough, fixed l and for all i, and that (23) holds. To perform this take U = φm,l(i), X = Xi, h(u) = | log u|, u > 0 (we use h(u) = log2 u in the proof of Theorem2) and V = h(U). If h : R → R and E|h(U)| < ∞ then (26) is true as well. To avoid increasing the volume of this paper we will only examine the evaluation of E log φm,l(i) as all the steps of the proof will be the same when treating E| log φm,l(i)|. The proof of Statement1 is partitioned into 4 steps. The first three demonstrate that there is a measurable A ⊂ S(p), depending on p and q versions, such that PX(S(p) \ A) = 0 and, for any x ∈ A, i ∈ N, the following relation holds:

E(log φm,l(i)|Xi = x) = E(log φm,l(1)|X1 = x) → ψ(l) − log Vd − log q(x), m → ∞. (27)

The last Step 4 justiﬁes the desired result (23). Finally Step 5 validates Statement2. Step 1. Here we establish the distribution convergence for the auxiliary random variables. Fix any i ∈ N and l ∈ {1, ... , m}. To simplify notation we do not indicate the dependence of functions on d. For x ∈ Rd and u ≥ 0, we identify the asymptotic behavior (as m → ∞) of the function

i d Fm,l,x(u) := P(φm,l(i) ≤ u|Xi = x) = P mVm,l(i) ≤ u|Xi = x ! ! 1 1 u d u d = 1 − P Vm,l(i) > Xi = x = 1 − P x − Y( )(x, Ym) > m l m (28) l−1 m s m−s = 1 − ∑ (Wm,x(u)) (1 − Wm,x(u)) := P(ξm,l,x ≤ u), s=0 s

where Mathematics 2021, 9, 544 10 of 36

Z 1 d u d Wm,x(u) := q(z) dz, rm(u) := , ξm,l,x := m x − Y(l)(x, Ym) . (29) B(x,rm(u)) m

We take into account in (28) that random vectors Y1, ... , Ym, Xi are independent and condition that Y , ... , Ym have the same law as Y. We also noted that an event n 1 o

x − Y(l)(x, Ym) > rm(u) is a union of pair-wise disjoint events As, s = 0, ... , l − 1. Here As means that exactly s observations among Ym belong to the ball B(x, rm(u)) and other m − s are outside this ball (probability that Y belongs to the sphere {z ∈ Rd : kz − xk = r} equals 0 since Y has a density w.r.t. the Lebesgue measure µ). Formulas (28) i and (29) show that Fm,l,x(u) is the regular conditional distribution function of φm,l(i) given Xi = x. Moreover, (28) means that φm,l(i), i ∈ {1, ... , n} are identically distributed and we i may omit the dependence on i. So, one can write Fm,l,x(u) instead of Fm,l,x(u). According to the Lebesgue differentiation theorem (see, e.g., [49], p. 654) if q ∈ L1(Rd), for µ-almost all x ∈ Rd, one has 1 Z lim |q(z) − q(x)| dz = 0. (30) r→0+ µ(B(x, r)) B(x,r)

Let Λ(q) denote the set of Lebesgue points of a function q, namely the points in Rd satisfying (30). Evidently it depends on the choice of version within the class of functions in L1(Rd) equivalent to q, and, for an arbitrary version of q, we have µ(Rd \ Λ(q)) = 0. d Clearly, for each u ≥ 0, rm(u) → 0 as m → ∞, and µ(B(x, rm(u))) = Vd rm(u) = Vdu m . Therefore by virtue of (30), for any ﬁxed x ∈ Λ(q) and u ≥ 0, V u W (u) = d (q(x) + α (x, u)), m,x m m

where αm(x, u) → 0, m → ∞. Hence, for x ∈ Λ(q) ∩ S(q) (thus q(x) > 0), due to (28) l−1 (V q(x)u)s d −Vdq(x)u Fm,l,x(u) → 1 − ∑ e := Fl,x(u), m → ∞. (31) s=0 s!

Relation (31) means that

law ξm,l,x → ξl,x, x ∈ Λ(q) ∩ S(q), m → ∞,

where ξl,x has the Gamma distribution Γ(α, λ) with parameters α = Vd q(x) and λ = l. For any x ∈ S(q), one can assume w.l.g. that the random variables ξl,x and {ξm,l,x}m≥l are deﬁned on a probability space (Ω, F, P). Indeed, by the Lomnicki–Ulam theorem (see, e.g., [53], p. 93) the independent copies of Y1, Y2, ... and {ξl,x}x∈S(q) exist on a certain probability space. The convergence in distribution of random variables survives under continuous mapping. Thus, for any x ∈ Λ(q) ∩ S(q), we see that law log ξm,l,x → log ξl,x, m → ∞. (32)

We have employed that ξ > 0 a.s. for each x ∈ Λ(q) ∩ S(q) and Y has a density, so it l, x

follows that P(ξm,l,x > 0) = P( x − Y(l)(x, Ym) > 0) = 1. More precisely, we take strictly positive versions of ξl,x and ξm,l,x for each x ∈ Λ(q) ∩ S(q). Step 2. Now we show that, instead of (27) validity, one can verify the following assertion. For µ-almost every x ∈ Λ(q) ∩ S(q) E log ξm,l,x → E log ξl,x, m → ∞. (33)

Note that if η ∼ Γ(α, λ), where α > 0 and λ > 0, then E log η = ψ(λ) − log α, where ψ is a digamma function. Set α = Vdq(x) for x ∈ S(q) (then α > 0) and λ = l. Hence E log ξl,x = ψ(l) − log (Vdq(x)) = ψ(l) − log Vd − log q(x). By virtue of (26), for each x ∈ Rd, Mathematics 2021, 9, 544 11 of 36

Z Z E log ξm,l,x = log u dFm,l,x(u) = log u dP(φm,l(1) ≤ u|X1 = x) (0,∞) (0,∞)

= E(log φm,l(1)|X1 = x).

Hence, for x ∈ Λ(q)∩S(q), the relation E(log φm,l(1)|X1 = x) → ψ(l) − log Vd − log q(x) holds if and only if (33) is true. According to Theorem 3.5 [54] we would have established (33) if relation (32) could be supplemented, for µ-almost all x ∈ Λ(q) ∩ S(q), by the condition of uniform integrability { } N ∈ G (t) of a family log ξm,l,x m≥m0(x). Note that, for each N, a function N introduced GN (t) by (10) is nondecreasing on (0, ∞) and t → ∞, as t → ∞. By the de la Vallée–Poussin theorem (see, e.g., Theorem 1.3.4 [52]), to ensure, for µ-almost every x ∈ Λ(q) ∩ S(q), { } the uniform integrability of log ξm,l,x m≥m0(x), it sufﬁces to prove the following statement. For the indicated x, a positive C0(x) and m0(x) ∈ N, one has

sup EGN(| log ξm,l,x|) ≤ C0(x) < ∞, (34) m≥m0(x)

where GN appears in conditions of Theorem1. Moreover, it is possible to ﬁnd m0 ∈ N that does not depend on x ∈ Rd as we will show further. Step 3. This step is devoted to proving validity of (34). It is convenient to divide this step into its own parts (3a), (3b), etc. For any N ∈ N, set  1 1 1 i − log (− log t) + N−1 , t ∈ 0, ,  t [N] log (− log t) e[N]  ∏j=1 [j]  1 i gN(t) = 0, t ∈ e , e[N] ,  [N]   1 log (log t) + 1 , t ∈ e , ∞ ,  t [N] N−1 ( ) [N] ∏j=1 log[j] log t where the product over empty set (when N = 1) is equal to 1. The proof of the following result is placed at AppendixA.

Lemma 2. Let F(u), u ∈ R, be a distribution function such that F(0) = 0. Then, for each N ∈ N, one has Z Z (1) GN(| log u|)dF(u) = F(u)(−gN(u))du, 0, 1 0, 1 e[N] e[N] Z Z (2) GN(| log u|)dF(u) = (1 − F(u))gN(u)du. (e[N],∞) (e[N],∞)

1 i Fix N appearing in conditions of Theorem1. Observe that, for u ∈ , e[ ] , one e[N] N has GN(| log u|) = 0. Therefore, according to Lemma2, for x ∈ Λ(q) ∩ S(q) and m ≥ l, we get EGN(| log ξm,l,x|) := I1(m, x) + I2(m, x) where Z Z I (m, x) := F (u)(−g (u))du, I (m, x) := (1 − F (u))g (u)du. 1 1 m,l,x N 2 m,l,x N 0, (e[N],∞) e[N]

For convenience sake we write I1(m, x) and I2(m, x) without indicating their dependence on N, l and d (these parameters are ﬁxed). Part (3a). We provide bounds for I1(m, x). Take R > 0 appearing in conditions of 1 i 1 Theorem1 and any u ∈ 0, . Introduce m1 := max d , l , where, for a ∈ R, e[N] e[N] R 1/d u 1/d 1 dae := inf{m ∈ : m ≥ a}. Then rm(u) = ≤ ≤ R if m ≥ m1. Note also Z m e[N]m that we can consider only m ≥ l everywhere below, because the size of sample Ym is not Mathematics 2021, 9, 544 12 of 36

i less than the number of neighbors l (see, e.g., (28)). Thus, for R > 0, u ∈ 0, 1 , x ∈ d e[N] R and m ≥ m1, R R W (u) B(x,r (u)) q(y) dy B(x,r) q(y) dy m,x = m ≤ = M (x R) d sup d q , , µ(B(x, rm(u))) rm(u)Vd r∈(0,R] r Vd

and we arrive at the inequality Mq(x, R)Vd u W (u) ≤ M (x, R) µ(B(x, r (u))) = . (35) m,x q m m If γ ∈ (0, 1] and t ∈ [0, 1] then, for all m ≥ 1, invoking the Bernoulli inequality, one has

1 − (1 − t)m ≤ (mt)γ. (36)

Recall that we assume Qp,q(ε, R) < ∞ for some ε > 0, R > 0. By virtue of Lemma1 d one can take ε < 1. So, due to (36) and since Wm,x(u) ∈ [0, 1] for all x ∈ R , u > 0 and m ≥ l, we get m ε 1 − (1 − Wm,x(u)) ≤ (mWm,x(u)) . (37) Thus in view of (28), (35) and (37) we have established that, for all x ∈ Λ(q) ∩ S(q), 1 u ∈ (0, ] and m ≥ m1, e[N] l−1 m − F (u) = 1 − (W (u))s(1 − W (u))m s m,l,x ∑ s m,x m,x s=0 (38) ε Mq(x, R)Vdu ≤ 1 − (1 − W (u))m ≤ m = (M (x, R))εVεuε. m,x m q d

Therefore, for any x ∈ Λ(q) ∩ S(q) and m ≥ m1, one can write Z ε ε ε I1(m, x) ≤ (Mq(x, R)) Vd u (−gN(u)) du 0, 1 e[N] (39) Z log (− log u) + 1 ε ε [N] ε ≤ (Mq(x, R)) Vd − du = U1(ε, N, d)(Mq(x, R)) , 0, 1 u1 ε e[N]

ε R −εt where U1(ε, N, d) := V LN(ε), LN(ε) := (log (t) + 1)e dt < ∞. We took into d [e[N],∞) [N] 1 1 i account that (−gN(u)) ≤ (log (− log u) + 1) whenever u ∈ 0, . u [N] e[N] log[N+1](u)+1 Part (3b). We give bounds for I2(m, x). Since gN(u) ≤ u if u ∈ (e[N], ∞), we ≥ { 2 } can write, for m max e[N], l ,

Z log[N+1](u) + 1 I2(m, x) ≤ √ (1 − Fm,l,x(u)) du (e[N], m] u

Z log[N+1](u) + 1 Z + √ (1 − Fm,l,x(u)) du + (1 − Fm,l,x(u))gN(u) du ( m,m2] u (m2, ∞)

:= J1(m, x) + J2(m, x) + J3(m, x).

Evidently,

m m r m−r 1 − Fm,l,x(u) = ∑ (Pm,x(u)) (1 − Pm,x(u)) = P(Z ≥ m − l + 1), r=m−l+1 r

where Pm,x(u) = 1 − Wm,x(u) and Z ∼ Bin(m, Pm,x(u)). By Markov’s inequality P(Z ≥ t) ≤ e−λtEeλZ for any λ > 0 and t > 0. One has Mathematics 2021, 9, 544 13 of 36

m λZ λj m j m−j Ee = ∑ e (Pm,x(u)) (1 − Pm,x(u)) j=0 j m j m λ m−j λ m = ∑ Pm,x(u)e (1 − Pm,x(u)) = 1 − Pm,x(u) + e Pm,x(u) . j=0 j

Consequently, for each λ > 0, −λ(m−l+1) λ m 1 − Fm,l,x(u) ≤ e 1 − Pm,x(u) + e Pm,x(u) m (40) −λ(m−l+1) λ m λ(l−1) 1 = e Wm x(u) + e (1 − Wm x(u)) = e 1 − 1 − Wm x(u) . , , eλ ,

l−1 1 To simplify bounds we take λ = 1 and set S1 = S1(l) := e , S2 := 1 − e (recall that l is ﬁxed). Thus, S1 ≥ 1 and S2 < 1. Therefore, m 1 − Fm,l,x(u) ≤ S1(1 − S2 Wm,x(u)) ≤ S1 exp{−S2 mWm,x(u)}, (41)

where we have used simple inequality 1 − t ≤ e−t, t ∈ [0, 1]. √ i For R > 0 appearing in conditions of the Theorem and any u ∈ e[N], m , one nl m l m o 1/d can choose m := max 1 , e2 , l such that if m ≥ m then r (u) = u ≤ 2 R2d [N] 2 m m 1/d √ √1 ≤ R u ∈ (e m] m ≥ m m . Due to (29) and (41), for [N], and 2, one has ( ) Vdu Wm,x(u) 1 − Fm,l,x(u) ≤ S1 exp −S2 m m Vdu m (42) ( R q(z) dz ) B(x,rm(u)) = S1 exp −S2Vdu ≤ S1 exp −S2Vdu mq(x, R) , µ(B(x, rm(u)))

by deﬁnition of m f (for f = q) in (9). Now we use the following Lemma 3.2 of [35].

Lemma 3. For a version of a density q and each R > 0, one has µ(S(q) \ Dq(R)) = 0 where Dq(R) := {x ∈ S(q) : mq(x, R) > 0} and mq(·, R) is deﬁned according to (9).

−t −δ It is easily seen that, for any t √> 0 and each δ ∈ (0, e], one has e ≤ t . Thus, for x ∈ Dq(R), m ≥ m2, u ∈ (e[N], m] and ε > 0, we deduce from conditions of the Theorem (in view of Lemma1 one can suppose that ε ∈ (0, e]) that −ε 1 − Fm,l,x(u) ≤ S1 S2Vdu mq(x, R) . (43)

We also took into account that mq(x, R) > 0 for x ∈ Dq(R) and applied relation (42). Thus, for all x ∈ Λ(q) ∩ S(q) ∩ Dq(R) and any m ≥ m2, S Z log[ + ](u) + 1 ( ) ≤ 1 N 1 J1 m, x ε ε 1+ε du (S2 Vd) (mq(x, R)) (e[N],∞) u (44) −ε = U2(ε, N, d, l)(mq(x, R)) ,

−ε where U2(ε, N, d, l) := S1(l) LN(ε)(S2 Vd) . Part (3c). We provide the bound for J2(m, x). For all x ∈ Λ(q) ∩ S(q) ∩ Dq(R) and any √ √ −ε m ≥ m2, in view of (43), it holds 1 − Fm,l,x( m) ≤ S1 S2Vd mq(x, R) m . Hence (as m2 ≥ 2) Mathematics 2021, 9, 544 14 of 36

Z log[N+1](u) + 1 J2(m, x) ≤ √ (1 − Fm,l,x(u)) du ( m, m2] u √ Z ≤ 1 − Fm,l,x( m) √ log[N+1](u) + 1 d log u ( m, m2]

−ε −ε − ε 3 ≤ S (S V ) m (x, R) m 2 log (2 log m) + 1 log m. 1 2 d q [N] 2

Then, for all x ∈ Λ(q) ∩ S(q) ∩ Dq(R) and any m ≥ m2, −ε J2(m, x) ≤ U3(m, ε, N, d, l) mq(x, R) , (45)

ε 3 −ε − 2 where U3(m, ε, N, d, l) := 2 S1(l)(S2Vd) m log m log[N](2 log m) + 1 → 0, m → ∞.

Part (3d). To indicate bounds for J3(m, x) we employ several auxiliary results.

Lemma 4. For each N ∈ N and any ν > 0, there are a := a(d, ν) ≥ 0, b := b(N, d, ν) ≥ 0 such that, for arbitrary x, y ∈ Rd, d ν ν GN | log kx − yk | ≤ a GN(| log kx − yk| ) + b.

The proof is given in AppendixA. On the one hand, by (29), for any w ≥ 0, we get Z Wm,x(mw) = q(z) dz = W1,x(w). B(x,w1/d) On the other hand, by (28), one has F1,1,x(w) = 1 − 1 − W1,x(w) = W1,x(w). Conse- quently, for any m ∈ N, w ≥ 0 and all x ∈ Rd,

Wm,x(mw) = F1,1,x(w). (46)

d law d Moreover, F1,1,x(w) = P(kY − xk ≤ w). So, ξ1,1,x = kY − xk . Thus, due to Lemmas2 and4 (for ν = 1) Z Z (1 − F1,1,x(w))gN(w) dw = GN(log w) dF1,1,x(w) (e[N],∞) (e[N],∞) h n oi d d = E GN(log ξ1,1,x)I ξ1,1,x > e[N] = E[GN(log kY − xk )I{kY − xk > e[N]}] Z d = GN(logkx − yk )q(y) dy d 1/d (47) y∈R , kx−yk>(e[N]) Z ≤ a(d, 1) GN(| logkx − yk|)q(y) dy + b(N, d, 1) d 1/d y∈R , kx−yk>(e[N]) Z = a(d, 1) GN(logkx − yk)q(y) dy + b(N, d, 1), d y∈R , kx−yk>e[N]

since GN(t) = 0 for t ∈ [0, e[N−1]], N ∈ N. Now we will estimate 1 − Fm,l,x(u) in a way different from (40). Fix any δ > 0. Note 1 m m that, for all m ≥ (l − 1) 1 + δ and s ∈ {0, ... , l − 1}, it holds m−s ≤ m−l+1 ≤ 1 + δ. Then, d 1 for all x ∈ R , u ≥ 0 and m ≥ max{l, (l − 1) 1 + δ }, in view of (28) one can write l−1 m − 1 m ( − )− 1 − F (u) = (1 − W (u)) (W (u))s(1 − W (u)) m 1 s m,l,x m,x ∑ − m,x m,x s=0 s m s l−1 m − 1 s (m−1)−s ≤ (1 + δ)(1 − Wm,x(u)) ∑ (Wm,x(u)) (1 − Wm,x(u)) s=0 s

≤ (1 + δ)(1 − Wm,x(u)). (48) Mathematics 2021, 9, 544 15 of 36

We are going to employ the following statement as well.

Lemma 5. For each N ∈ N, a function log[N](t), t > e[N−1], is slowly varying at inﬁnity.

The proof is elementary and thus is omitted. Part (3e). Now we are ready to get the bound for J3(m, x). Set u = mw. Then one has ! Z 1 1 ( ) = ( − ( )) ( ) + J3 m, x 1 Fm,l,x u log[N] log u N− du (m2, ∞) u 1 ∏j=1 log[j](log u) ! Z 1 1 = (1 − Fm,l,x(mw)) log[N+1](mw) + dw. ( ) w N m, ∞ ∏j=2 log[j](mw)

2 Given w > m, Lemma5 implies that log[N+1](mw) ≤ log[N+1](w ) = log[N](2 log w) ≤ 2 log[N+1](w) for w large enough, namely for all w ≥ W, where W = W(N). Take δ > 0 n l 1 m l mo and set m3 := max l, (l − 1) 1 + δ , dW(N)e, e[N] . Let further m ≥ m3. Then ! Z 1 1 J3(m, x) ≤ 2 (1 − Fm,l,x(mw)) log[N+1](w) + dw. ( ) w N m, ∞ ∏j=2 log[j](w)

By virtue of (46) and (48) one has

1 − Fm,l,x(mw) ≤ (1 + δ)(1 − Wm,x(mw)) = (1 + δ)(1 − F1,1,x(w)). (49)

Hence it can be seen that Z ( ) ≤ ( + ) ( − ( )) ( ) J3 m, x 2 1 δ 1 F1,1,x w gN1 w dw. (50) (m, ∞)

Introduce Z RN(x) := GN(log kx − yk)q(y) dy, Ap(GN) := {x ∈ S(p) : RN(x) < ∞}. d y∈R , kx−yk>e[N]

Let us note that (1) PX(S(p) \ Ap(GN)) = 0 as Kp,q(1, N) < ∞; (2) PX(S(p) \ S(q)) = 0 as PX PY (see Lemma A1); (3) µ S(q) \ (Λ(q) ∩ Dq(R)) = 0 due to Lemma3. Since PX µ we conclude that PX S(q) \ (Λ(q) ∩ Dq(R)) = 0. Hence, one has PX S(p) \ (Λ(q) ∩ Dq(R)) = 0 in view of 2) and because B \ C ⊂ (B \ A) ∪ (A \ C) for d any A, B, C ⊂ R . Set further A := Λ(q) ∩ S(q) ∩ Dq(R) ∩ S(p) ∩ Ap(GN). It follows from (1), (2) and (3) that PX(S(p) \ A) = 0, so PX(A) = 1. We are going to consider only x ∈ A. Then, by virtue of (47) and (50), for all m ≥ m3 and x ∈ A, we come to the inequality J3(m, x) ≤ 2(1 + δ) a(d, 1)RN(x) + b(N, d, 1) = A(δ, d)RN(x) + B(δ, d, N), (51)

where A(δ, d) := 2(1 + δ)a(d, 1), B(δ, d, N) := 2(1 + δ)b(N, d, 1). Part (3f). Here we get the upper bound for EGN(| log ξm,l,x|). For m ≥ max{m1, m2, m3} and each x ∈ A, taking into account (39), (44), (45) and (51) we can claim that

EGN(| log ξm,l,x|) ≤ I1(m, x) + J1(m, x) + J2(m, x) + J3(m, x) ε −ε ≤ U1(ε, N, d)(Mq(x, R)) + U2(ε, N, d, l)(mq(x, R)) (52) −ε + U3(m, ε, N, d, l) mq(x, R) + (A(δ, d)RN(x) + B(δ, d, N)).

For any κ > 0, one can take m4 = m4(κ, ε, N, d, l) ∈ N such that U3(m, ε, N, d, l) ≤ κ if m ≥ m4. Then by virtue of (52), for each x ∈ A and m ≥ m0 := max{m1, m2, m3, m4}, Mathematics 2021, 9, 544 16 of 36

ε EGN(| log ξm,l,x|) ≤ U1(ε, N, d)(Mq(x, R)) −ε (53) + U2(ε, N, d, l) + κ (mq(x, R)) + (A(δ, d)RN(x) + B(δ, d, N)) := C0(x) < ∞.

Hence, for each x ∈ A, we have established uniform integrability of the family log ξ . m,l,x m≥m0 Step 4. Now we verify (23). It was checked, for each x ∈ A (thus, for PX-almost every x belonging to S(p)) that E(log φm,l(1)|X1 = x) → ψ(l) − log Vd − log q(x), m → ∞. Set Zm,l(x) := E(log φm,l(1)|X1 = x) = E log ξm,l,x. Consider x ∈ A and take any m ≥ m0. We use the following property of GN which is shown in AppendixA.

Lemma 6. For each N ∈ N, a function GN is convex on R+.

By the Jensen inequality a function GN is nondecreasing and convex.

GN(|Zm,l(x)|) = GN(|E log ξm,l,x|) ≤ GN(E| log ξm,l,x|) ≤ EGN(| log ξm,l,x|).

Relation (53) guarantees that, for all m ≥ m0, Z GN(|Zm,l(x)|)p(x) dx ≤ U1(ε, N, d)Qp,q(ε, R) Rd + U2(ε, N, d, l) + κ Tp,q(ε, R) + A(δ, d)Kp,q(1, N) + B(δ, d, N) < ∞.

Now we know that the family {Zm,l(x)}m≥m0 , x ∈ A, is uniformly integrable w.r.t. measure PX. Thus, for i ∈ N, Z Z E ( ) = E( ( )| = )P ( ) = ( ) ( ) log φm,l i log φm,l 1 X1 x X1 dx Zm,l x p x dx Rd Rd Z → ψ(l) − log Vd − p(x) log q(x)dx, m → ∞, Rd and we come to relation (23) establishing Statement1. Step 5. Here we prove Statement2 . Similar to Fm,l,x(u), one can introduce, for n, k ∈ N, n ≥ k + 1, x ∈ d and u ≥ 0, the following function R

Fen,k,x(u) := P(ζn,k(i) ≤ u|Xi = x) = 1 − P x − X(k)(x, Xn \{x}) > rn−1(u) k−1 (54) n − 1 s n−1−s = 1 − ∑ (Vn−1,x(u)) (1 − Vn−1,x(u)) := P ξen,k,x ≤ u , s=0 s

where rn(u) was deﬁned in (29), and Z d

Vn,x(u) := p(z) dz, ξen,k,x := (n − 1) x − X(k)(x, Xn \{x}) . (55) B(x,rn(u))

Formulas (54) and (55) show that Fen,k,x(u) is the regular conditional distribution function of ζn,k(i) given Xi = x. Moreover, for any ﬁxed u ≥ 0 and x ∈ Λ(p) ∩ S(p) (thus p(x) > 0), k−1 (V p(x)u)s d −Vd p(x)u Fen,k,x(u) → 1 − ∑ e := Fek,x(u), n → ∞. s=0 s!

law Hence, ξen,k,x → ξek,x, x ∈ Λ(p) ∩ S(p), n → ∞. Set Aep(GN) := {x ∈ S(p) : ReN(x) < ∞}, where N ∈ N and Z ReN(x) := GN(log kx − yk)p(y)dy. d y∈R , kx−yk>e[N]

Take Ae := Λ(p) ∩ S(p) ∩ Dp(R) ∩ Aep(GN). Then PX(Ae) = 1 and, for x ∈ Ae, one can verify that EGN(| log ξen,k,x|) ≤ Ce0(x) < ∞, for all n ≥ n0, and therefore E log ξen,k,x → E log ξek,x Mathematics 2021, 9, 544 17 of 36

as n → ∞. Thus, E(log ζn,k(1)|X1 = x) → ψ(k)−log Vd −log p(x), n → ∞. Set Zen,k(x) := E(log ζ (1)|X = x). One can see that, for all n ≥ n , R G (|Z (x)|)p(x) dx < ∞. n,k 1 0 Rd N en,k Hence similar to Steps 1–4 we come to relation (24). So, (14) holds and the proof of Theorem1 is complete.

5. Proof of Theorem 2 We will follow the general scheme described in Remark7. However now this scheme is more involved. First of all note that, in view of Lemma1, the ﬁniteness of Kp,q(2, N) and Kp,p(2, N) implies the ﬁniteness of Kp,q(1, N) and Kp,p(1, N), respectively. Thus, the conditions of Theorem2 entail validity of Theorem1 statements. Consequently under the conditions 1 of Theorem2, for n and m large enough, one can claim that Dbn,m(k, l) ∈ L (Ω) and EDbn,m(k, l) → D(PX||PY), as n, m → ∞. 2 We will show that Dbn,m(k, l) ∈ L (Ω) for all n and m large enough. Then one can write 2 2 E Dbn,m(k, l) − D(PX||PY) = var Dbn,m(k, l) + EDbn,m(k, l) − D(PX||PY) .

Therefore to prove (16) we will demonstrate that var Dbn,m(k, l) → 0, n, m → ∞.

Due to (28) the random variables log φm,l(1), ... , log φm,l(n) are identically distributed (and log ζn,k(1), ... , log ζn,k(n) are identically distributed as well). The variables φm,l(i) and ζn,k(i) are the same as in (22). We will demonstrate that log φm,l(1) and log ζn,k(1) belong to L2(Ω). Hence (22) yields 1 n var ( ) = cov ( ) − ( ) ( ) − ( ) Dbn,m k, l 2 ∑ log φm,l i log ζn,k i , log φm,l j log ζn,k j n i,j=1 1 2 1 = var( ( )) + cov( ( ) ( )) + var( ( )) log φm,l 1 2 ∑ log φm,l i , log φm,l j log ζn,k 1 (56) n n 1≤i

We mainly follow the notation employed in the above proof of Theorem1, except the d d possibly different choice of the sets A ⊂ R , Ae ⊂ R , positive Uj, Cj(x), Cej(x) and integers d mj, nj, where j ∈ Z+ and x ∈ R . The following Theorem 2 proof is also subdivided in 5 1 parts. Steps 1–3 deal with the demonstration of relation n var(log φm,l(1)) → 0 as n, m → ∞. 2 cov( ( ) ( )) → → Step 4 validates the relation n2 ∑1≤i

2 cov( ( ) ( )) → → 2 ∑ log ζn,k i , log ζn,k j 0, n ∞, n 1≤i

Set Ap,2(GN) := {x ∈ S(p) : RN,2(x) < ∞}. Then PX(S(p) \ Ap,2(GN)) = 0 since Kp,q(2, N)< ∞. Consider A := Λ(q) ∩ S(q) ∩ Dq(R) ∩ S(p) ∩ Ap,2(GN), (58) Mathematics 2021, 9, 544 18 of 36

where the ﬁrst four sets appeared in Theorem1 proof, R and N are indicated in conditions of Theorem2. It is easily seen that PX(A) = 1. The reasoning is exactly the same as in the proof of Theorem1. law Recall that, for each x ∈ A, one has log ξm,l,x → log ξl,x, m → ∞, where ξm,l,x := d

m x − Y(l)(x, Ym) and ξl,x has Γ(Vd q(x), l) distribution. Convergence in law of random variables is maintained by continuous transformations. Thus, for each x ∈ A, we get

2 law 2 log ξm,l,x → log ξl,x, m → ∞. (59)

For any x ∈ A, according to (28), Z Z 2 2 2 E log ξm,l,x = log u dFm,l,x(u) = log u dP(φm,l(1) ≤ u|X1 = x) (0,∞) (0,∞) (60) 2 = E(log φm,l(1)|X1 = x).

Note that if η ∼ Γ(α, λ), where α > 0 and λ > 0, then it is not difﬁcult to verify that

Γ00(λ) E log2 η = − 2 ψ(λ) log α + log2 α. Γ(λ)

Since ξl,x ∼ Γ(Vdq(x), l), for x ∈ S(q), one has Γ00(l) E log2 ξ = − 2 ψ(l) log(V q(x)) + log2(V q(x)) = log2 q(x) + h log q(x) + h , (61) l,x Γ(l) d d 1 2

where h1 := h1(l, d) and h2 := h2(l, d) depend only on ﬁxed l and d. We prove now that, for x ∈ A, one has

2 2 E(log φm,l(1)|X1 = x)→log q(x)+h1 log q(x)+h2, m → ∞. (62)

Taking into account (60) and (61) we can claim that relation (62) is equivalent to the 2 2 following one: E log ξm,l,x → E log ξl,x, m → ∞. So, in view of (59) to prove (62) it is n 2 o sufﬁcient to show that, for each x ∈ A, a family log ξm,l,x is uniformly integrable m≥m0(x) for some m0(x) ∈ N. Then, following Theorem1 proof, one can certify that, for all x ∈ A and some nonnegative C0(x),

2 sup EGN(log ξm,l,x) ≤ C0(x) < ∞. (63) m≥m0(x) √ Step 2. Now we will prove (63). For each N ∈ N, introduce ρ(N) := exp{ e[N−1]} and

 i t ∈ 1 (N) 0, ρ(N) , ρ , hN(t) := i 2 log t log (log2 t) + 1 , t ∈ 0, 1 ∪ (ρ(N), ∞).  t [N] N−1 ( 2 ) ρ(N) ∏j=1 log[j] log t

As usual, a product over an empty set (if N = 1) equals to 1. To show (63) we refer to the next lemma.

Lemma 7. Let F(u), u ∈ R, be a distribution function such that F(0) = 0. Fix an arbitrary ∈ N N. Then Z Z (1) G (log2 u)dF(u) = F(u)(−h (u))du, 1 i N 1 i N 0, ρ(N) 0, ρ(N) Z Z 2 (2) GN(log u)dF(u) = (1 − F(u))hN(u)du. (ρ(N),∞) (ρ(N),∞) Mathematics 2021, 9, 544 19 of 36

The proof of this lemma is omitted, being quite similar to one of Lemma2 . By Lemma7 i ( 2 ) = ∈ 1 ( ) and since GN log u 0, for u ρ(N) , ρ N , one has Z Z 2 EGN(log ξm,l,x) = i Fm,l,x(u)(−hN(u))du + (1 − Fm,l,x(u))hN(u)du 1 ( ( ) ) 0, ρ(N) ρ N ,∞

:= I1(m, x) + I2(m, x).

To simplify notation we do not indicate the dependence of Ii(m, x) (i = 1, 2) on fixed N, l and d. For clarity, further implementation of Step 2 is divided into several parts. Part (2a). At first we consider I1(m, x). As in Theorem1 proof, for fixed R > 0 and ε > 0 ε ε ε appearing in the conditions of Theorem2, an inequality Fm,l,x(u) ≤ (Mq(x, R)) V u holds i nl m o d for any x ∈ A, u ∈ 0, 1 and m ≥ m := max 1 , l . Taking into account that ρ(N) 1 ρ(N)Rd (−2 log u) log (log2 u)+1 i ≤ (− ( )) ≤ [N] ∈ 1 ≥ 0 hN u u if u 0, ρ(N) , we get, for m m1,

2 Z (−2 log u) log (log u)+1 ε ε [N] I1(m, x) ≤ (Mq(x, R)) V i du d 1 u1−ε (64) 0, ρ(N) ε = U1(ε, N, d)(Mq(x, R)) . ε Rh 2 −εt Here U1(ε, N, d) := Vd LN,2(ε), LN,2(ε) := √ 2t log[N](t ) + 1 e dt < ∞ e[N−1],∞ for each ε > 0 and any N ∈ N. Part (2b). Consider I2(m, x). Following the previous theorem proof we at ﬁrst observe 2 log u 2 2 that hN(u) ≤ u log[N](log u) + 1 for u ∈ (ρ(N), ∞). So, for all m ≥ max{ρ (N), l},

2 Z 2 log u log[N](log u) + 1 I2(m, x) ≤ √ (1 − Fm,l,x(u)) du (ρ(N), m] u 2 Z 2 log u log[N](log u) + 1 + √ (1 − Fm,l,x(u)) du ( m,m2] u Z + (1 − Fm,l,x(u))hN(u) du := J1(m, x) + J2(m, x) + J3(m, x), (m2,∞)

where we do not indicate the dependence of Jj(m, x) (j = 1, 2, 3) on N, l and d. For R > 0 and ε > 0 appearing in the conditions of Theorem2, one can show (see Theorem1 proof), that inequality

−ε 1 − Fm,l,x(u) ≤ S1 S2Vdu mq(x, R) (65) √ nl m o holds for any x ∈ A, u ∈ ρ(N), m and all m ≥ m := max 1 , ρ2(N), l . Here 2 R2d S1 := S1(l) and S2 are the same as in the proof of Theorem1. For all x ∈ A and m ≥ m2, we come to the relations

2 S Z 2 log u log[N](log u) + 1 ( ) ≤ 1 J1 m, x ε ε 1+ε du (S2 Vd) (mq(x, R)) (ρ(N),∞) u (66) −ε = U2(ε, N, d, l)(mq(x, R)) ,

−ε where U2(ε, N, d, l) := 2S1(l) LN,2(ε)(S2 Vd) . Part (2c). Let us consider J2(m, x). Take δ > 0. Then, due to (65), for all x ∈ A and any m ≥ m2, Mathematics 2021, 9, 544 20 of 36

√ Z 2 J2(m, x) ≤ 2 1 − Fm,l,x( m) √ log u log[N](log u) + 1 d log u ( m, m2] ε −ε −ε − 2 2 2 (67) ≤ 4S1(S2Vd) m mq(x, R) log[N](4 log m) + 1 log m −ε = U3(m, ε, N, d, l) mq(x, R) ,

ε −ε − 2 2 2 where U3(m, ε, N, d, l) := 4S1(S2Vd) m log m log[N](4 log m) + 1 → 0, m → ∞. Part (2d). Now we turn to J3(m, x). Take u = mw. Then J3(m, x) has the form

Z ! 2 log (mw) 2 1 (1 − Fm,l,x(mw)) log[N](log (mw)) + − dw. (m, ∞) w N 1 2 ∏j=1 log[j](log (mw))

Due to Lemma5 there exists T(N) > ρ(N) such that 2 2 2 2 log[N](log (w )) = log[N](4 log w) ≤ 2 log[N](log w), w ≥ T(N). (68)

n l 1 m o Pick some δ > 0 and set m3 := max l, (l − 1) 1 + δ , dT(N)e, dρ(N)e , where T(N) was introduced in (68). Consider m ≥ m3. In view of Lemma4 (for ν = 2), (49), (68) and since w > m,   Z 2 2 log (w ) 2 2 1 J3(m, x)≤ (1 − Fm,l,x(mw)) log[N](log (w ))+ −  dw (m, ∞) w N 1 2 ∏j=1 log[j](log w)   Z 2 log w 2 1 ≤ 4(1 + δ) (1 − F1,1,x(w)) log[N](log w) + −  dw (m, ∞) w N 1 2 ∏j=1 log[j](log w) Z Z = 4(1 + δ) (1 − F1,1,x(w))hN (w) dw ≤ 4(1 + δ) (1 − F1,1,x(w))hN (w) dw (m, ∞) (ρ(N), ∞) Z 2 2 = 4(1 + δ) GN (log w) dF1,1,x(w) = 4(1 + δ)E[GN (log ξ1,1,x)I{ξ1,1,x > ρ(N)}] (ρ(N),∞) d 2 d = 4(1 + δ)E[GN ((logkY − xk ) )I{kY − xk > ρ(N)}] Z d 2 = 4(1 + δ) GN ((log kx − yk ) )q(y) dy y∈Rd,kx−yk>(ρ(N))1/d Z ≤ ( + ) a(d ) G ( 2 kx − yk)q(y) dy + b(N d ) 4 1 δ , 2 1/d N log , , 2 y∈Rd,kx−yk>(ρ(N))

2 ≤ 4(1 + δ) a(d, 2) RN,2(x) + GN(e[ − ]) + b(N, d, 2) N 1 (69) = A(δ, d)RN,2(x) + B(δ, d, N),

( ) = ( + ) ( ) ( ) = ( + ) ( ) ( 2 ) + ( ) ( ) A δ, d : 4 1 δ a d, 2 , B δ, d, N : 4 1 δ a d, 2 GN e[N−1] b N, d, 2 , RN,2 x is deﬁned in (57). Here we have also used, for any N ∈ N, ν, t, u > 0, t < u, the following estimates

ν ν Kp,q(ν, N, u) ≤ Kp,q(ν, N, t) ≤ Kp,q(ν, N, u) + max{GN(| log t| ), GN(| log u| )}.

2 Part (2e). Examine EGN(log ξm,l,x). Thus, for each x ∈ A and m ≥ max{m1, m2, m3}, taking into account (64), (66), (67) and (69), we can claim that

2 EGN (log ξm,l,x) ≤ I1(m, x) + J1(m, x) + J2(m, x) + J3(m, x) ε −ε −ε ≤ U1(ε, N, d)(Mq(x, R)) + U2(ε, N, d, l)(mq(x, R)) + U3(m, ε, N, d, l) mq(x, R)

+ A(δ, d)RN,2(x) + B(δ, d, N). (70) Mathematics 2021, 9, 544 21 of 36

Moreover, for any κ > 0, one can choose m4 := m4(κ, ε, N, d, l) ∈ N such that, for m ≥ m4, it holds U3(m, ε, N, d, l) ≤ κ. Then by (70), for each x ∈ A and m ≥ m0 := max{m1, m2, m3, m4},

2 ε EGN(log ξm,l,x) ≤ U1(ε, N, d)(Mq(x, R)) −ε (71) + U2(ε, N, d, l) + κ (mq(x, R)) + A(δ, d)RN,2(x) + B(δ, d, N) := C0(x) < ∞.

n 2 o Hence we have proved the uniform integrability of the family log ξm,l,x for m≥m0 each x ∈ A. Therefore, for any x ∈ A (thus for PX-almost every x ∈ S(p)), relation (62) holds. 2 2 Step 3. Now we can return to E log φm,l(1). Set ∆m,l(x) := E(log φm,l(1)|X1 = x) = 2 E log ξm,l,x. Consider x ∈ A and take any m ≥ m0. A function GN is nondecreasing and convex according to Lemma6. Due to the Jensen inequality

2 2 GN(∆m,l(x)) = GN(E log ξm,l,x) ≤ EGN(log ξm,l,x). (72)

Relation (72) guarantees that, for each x ∈ A and all m ≥ m0,

Z GN(∆m,l(x))p(x) dx ≤ U1(ε, N, d)Qp,q(ε, R) + U2(ε, N, d, l) + κ Tp,q(ε, R) Rd +A(δ, d)Kp,q(2, N) + B(δ, d, N) < ∞.

Uniform integrability of the family {∆m,l(·)}m≥m0 (w.r.t measure PX) is thus established. Hence one can claim that Z Z 2 2 E log φm,l(1) → p(x) log q(x) dx + h1 p(x) log q(x) dx + h2, m → ∞. Rd Rd

It is easily seen that ﬁniteness of integrals Qp,q(ε, R), Tp,q(ε, R) implies that Z Z p(x) log2 q(x)dx < ∞, p(x)| log q(x)|dx < ∞. Rd Rd 2 2 2 Thus, E log φm,l(1) → τ2 < ∞ and var(log φm,l(1)) = E log φm,l(1) − (E log φm,l(1)) → τ − τ2 < ∞, m → ∞, where τ := ψ(l) − log V − R p(x) log q(x) dx according to (23). 2 1 1 d Rd 1 Consequently, n var(log φm,l(1)) → 0 as n, m → ∞. Step 4. Now we consider cov(log φm,l(i), log φm,l(j)) for i 6= j, where i, j ∈ {1, ... , n}. For x, y ∈ Rd, deﬁne conditional distribution function i,j ( ) = P( ( ) ≤ ( ) ≤ | = = ) ≥ Φm,l,x,y u, w : φm,l i u, φm,l j w Xi x, Xj y , u, w 0. For x, y ∈ Rd, u, w ≥ 0, i 6= j, i,j ( ) = − P( ( ) > | = = ) Φm,l,x,y u, w 1 φm,l i u Xi x, Xj y

−P(φm,l(j) > w|Xi = x, Xj = y) + P(φm,l(i) > u, φm,l(j) > w|Xi = x, Xj = y) (73)

= 1 − P x − Y(l)(x, Ym) > rm(u) − P y − Y(l)(y, Ym) > rm(w)

+P x − Y(l)(x, Ym) > rm(u), y − Y(l)(y, Ym) > rm(w) .

1 a d Here rm(a) = m for all a ≥ 0, as previously. One can write Φm,l,x,y(u, w) instead of i,j ( ) Φm,l,x,y u, w , because the right-hand side of (73) does not depend on i and j. Set A1 := (x, y) : x ∈ A, y ∈ A, x 6= y and A2 := (x, y) : x ∈ A, y ∈ A, x = y , where A is introduced in (58). Evidently, (PX ⊗ PX)(A1) = 1 and (PX ⊗ PX)(A2) = 0. Consider (x, y) ∈ A1. Obviously, for any a > 0, rm(a) → 0, as m → ∞. For (x, y) ∈ A1 l d m = ( − ) = + 2 ( ) < kx−yk we take m5 m5 u, w, kx yk : 1 kx−yk max{u, w} . Then rm u 2 Mathematics 2021, 9, 544 22 of 36

kx−yk and rm(w) < for all m ≥ m5. Thus, B(x, rm(u)) ∩ B(y, rm(w)) = ∅ if m ≥ m5. 2 Consequently, for m ≥ m6(u, w, kx − yk) := max m5, 2(l − 1) ,

P x − Y(l)(x, Ym) > rm(u), y − Y(l)(y, Ym) > rm(w) l−1 l−1 (74) m! s1 s2 m−s1−s2 = ∑ ∑ (Wm,x(u)) Wm,y(w) 1−Wm,x(u)−Wm,y(w) . s1!s2!(m−s1−s2)! s1=0s2=0

In view of (28), (73) and (74), one has for Φm,l,x,y(u, w) the following representation l−1 l−1 m s1 m−s1 m s2 m−s2 1− ∑ (Wm,x(u)) (1−Wm,x(u)) − ∑ Wm,y(w) 1 − Wm,y(w) s =0 s1 s =0 s2 1 2 (75) l−1 l−1 m! s1 s2 m−s1−s2 + ∑ ∑ (Wm,x(u)) Wm,y(w) 1−Wm,x(u)−Wm,y(w) . s1!s2!(m−s1−s2)! s1=0 s2=0

For any ﬁxed (x, y) ∈ A1 and u, w ≥ 0, we get, as m → ∞, ( ( ))s1 ( ( ))s2 m! s1 s2 Vd u q x Vd w q y (Wm,x(u)) Wm,y(w) → , s1!s2!(m − s1 − s2)! s1! s2! (76) m−s1−s2 −Vd uq(x)+wq(y) 1 − Wm,x(u) − Wm,y(w) → e .

Then, according to (31), (75) and (76), for all ﬁxed u, w ≥ 0, (x, y) ∈ A1, one has

− − l 1 (V uq(x))s1 l 1 (V wq(y))s2 d −Vduq(x) d −Vdwq(y) Φm,l,x,y(u, w) → 1 − ∑ e − ∑ e s1! s2! s1=0 s2=0 − − l 1 l 1 (V u q(x))s1 (V w q(y))s2 + ∑ ∑ d d e−Vd uq(x)+wq(y) s1! s2! s1=0 s2=0 − − l 1 (V uq(x))s1 l 1 (V wq(y))s2 = 1 − ∑ d e−Vduq(x) 1 − ∑ d e−Vdwq(y) s1! s2! s1=0 s2=0

= Fl,x(u)Fl,y(w) := Φl,x,y(u, w), m → ∞.

Thus, Φl,x,y(·, ·) is identiﬁed as a distribution function of a vector ηl,x,y := (ξl,x, ξl,y) having independent components such that ξl,x ∼ Γ(Vdq(x), l), ξl,y ∼ Γ(Vdq(y), l). Observe also that Φm,l,x,y(·, ·) is a distribution function of a random vector ηm,l,x,y := (ξm,l,x, ξm,l,y). law Consequently, we have shown that ηm,l,x,y → ηl,x,y as m → ∞. Hence, for any (x, y) ∈ A1,

law log ξm,l,x log ξm,l,y → log ξl,x log ξl,y, m → ∞.

Here we take strictly positive versions of random variables under consideration. Note that, for all i, j ∈ , i 6= j, N Z E(log ξm,l,x log ξm,l,y) = log u log w dΦm,l,x,y(u, w) (0,∞)×(0,∞) (77) = E log φm,l(i) log φm,l(j)|Xi = x, Xj = y .

One has E(log ξl,x log ξl,y)=E log ξl,x E log ξl,y = al,d(x)al,d(y) because ξl,x and ξl,y are d independent, here al,d(z) := ψ(l) − log Vd − log q(z), z ∈ R . Now we intend to verify that, for any (x, y) ∈ A1, E log φm,l(1) log φm,l(2)|X1 = x, X2 = y → al,d(x)al,d(y). (78)

Equivalently, one can prove that E(log ξm,l,x log ξm,l,y) → E(log ξl,x log ξl,y) for each (x, y)∈ A1, as m → ∞. Mathematics 2021, 9, 544 23 of 36

Part (4a). We will prove the uniform integrability of a family {log ξm,l,x log ξm,l,y}m≥m0 for (x, y) ∈ A1. The convex function GN(·) is nondecreasing. Thus, following the proof of Step 2 for any (x, y) ∈ A1, one can ﬁnd m0 (the same as in the proof of Step 2 such that, for all m ≥ m0,

1 1 EG (| log ξ log ξ |) ≤ EG log2 ξ + log2 ξ N m,l,x m,l,y N 2 m,l,x 2 m,l,y 1 U ≤ EG (log2 ξ ) + EG (log2 ξ ) ≤ 1 (M (x, R))ε + (M (y, R))ε (79) 2 N m,l,x N m,l,y 2 q q U2 + κ −ε −ε A + (mq(x, R)) + (mq(y, R)) + R (x) + R (y) + B := Ce (x, y). 2 2 N,2 N,2 0

Here we used (71). It is essential that U1, U2, κ, A, B do not depend on x and y. Hence,

for any (x, y) ∈ A1, a family {log ξm,l,x log ξm,l,y}m≥m0 is uniformly integrable. So, we establish (78) for (x, y) ∈ A1. Part (4b). We return to cov(log φm,l(i), log φm,l(j)) for i 6= j, i, j ∈ {1, ... , n}. Set Tm,l(x, y) := E log φm,l(1) log φm,l(2)|X1 = x, X2 = y where (x, y) ∈ A1. Then (78) means that Tm,l(x, y) = E(log ξm,l,x log ξm,l,y) → al,d(x)al,d(y) for any (x, y) ∈ A1, as m → ∞. Note that

GN(|Tm,l(x, y)|) = GN(|E log ξm,l,x log ξm,l,y|) (80) ≤ GN(E| log ξm,l,x log ξm,l,y|) ≤ EGN(| log ξm,l,x log ξm,l,y|).

As (PX ⊗ PX)(A1) = 1, one can conclude due to (79) and (80) that, for all m ≥ m0, Z Z GN(|Tm,l(x, y)|)p(x)p(y) dx dy = GN(|Tm,l(x, y)|)p(x)p(y) dx dy d d R ×R (x,y)∈A1 Z Z Z ε −ε ≤ U1 Mq(x, R)p(x) dx + U2 + κ mq (x, R)p(x) dx + A RN,2(x)p(x) dx + B Rd Rd Rd = U1Qp,q(ε, R) + (U2 + κ)Tp,q(ε, R) + AKp,q(2, N) + B < ∞. Therefore, for (x, y) ∈ A1, the family Tm,l(x, y) is uniformly integrable w.r.t. m≥m0 PX ⊗ PX. Consequently, Z Z Tm,l(x, y)p(x)p(y) dx dy → al,d(x)al,d(y)p(x)p(y) dx dy, m → ∞. Rd×Rd Rd×Rd Thus

Z 2 E log φm,l(1) log φm,l(2) → ψ(l) − log Vd − log q(x)p(x) dx , m → ∞. (81) Rd On the other hand, taking also into account (23), we come to the relation

Z 2 E log φm,l(1)E log φm,l(2) → ψ(l) − log Vd − log q(x)p(x) dx . (82) Rd Hence (81) and (82) imply that

2 n − 1 cov( ( ) ( ))= cov( ( ) ( )) → → 2 ∑ log φm,l i , log φm,l j log φm,l 1 , log φm,l 2 0, n, m ∞. n 1≤i

Step 5. Now we consider cov(log ζn,k(i), log ζn,k(j)) for i 6= j, where i, j ∈ {1, . . . , n}. Similarly to Step 4, for x, y ∈ Rd and u, w ≥ 0, introduce a conditional distribution function

i,j ( ) = P( ( ) ≤ ( ) ≤ | = = ) Φe n,k,x,y u, w : ζn,k i u, ζn,k j w Xi x, Xj y Mathematics 2021, 9, 544 24 of 36

i,j i,j = P x − X(k)(x, Xn ∪ {y}) ≤ rn−1(u), y − X(k)(y, Xn ∪ {x}) ≤ rn−1(w) = P( y,i,j ≤ x,i,j ≤ ) ≥ : ηen,k,x u, ηen,k,y w , u, w 0, d i,j = \{ } y,i,j = ( − ) − ( i,j ∪ { }) where Xn Xn Xi, Xj , ηen,k,x : n 1 x X(k) x, Xn y . We write Φe n,k,x,y ( ) y x i,j ( ) y,i,j x,i,j u, w , ηen,k,x and ηen,k,y instead of Φe n,k,x,y u, w , ηen,k,x, ηen,k,y, respectively, (because X1, X2, ... are i.i.d. random vectors). Moreover, Φe n,k,x,y(u, w) is the distribution function of a random = ( y x ) vector ηen,k,x,y : ηen,k,x, ηen,k,y and the regular conditional distribution function of a random vector (ζn,k(i), ζn,k(j)) given (Xi, Xj) = (x, y). One has i,j Φe n,k,x,y(u, w) = 1 − P x − X(k)(x, Xn ∪ {y}) > rn−1(u) i,j −P y − X(k)(y, Xn ∪ {x}) > rn−1(w) i,j i,j +P x − X(k)(x, Xn ∪ {y}) > rn−1(u), y − X(k)(y, Xn ∪ {x}) > rn−1(w) .

Introduce Z 2 ReN,2(x) := GN(log kx − yk)p(y) dy, kx−yk≥e[N]

Aep,2(GN) := {x ∈ S(p) : ReN,2(x) < ∞} and Ae := Λ(p) ∩ S(p) ∩ Dp(R) ∩ Aep,2(GN), where the ﬁrst three sets appeared in Theorem1 proof (Step 5) Then PX(S(p) \ Aep,2(GN)) = 0 since Kp,p(2, N) < ∞. It is easily seen that PX(Ae) = 1. Take Ae1 := (x, y) : x ∈ Ae, y ∈ Ae, x 6= y and Ae2 := (x, y) : x ∈ Ae, y ∈

Ae, x = y . Evidently, (PX ⊗PX)(Ae1) = 1 and (PX ⊗PX)(Ae2) = 0. For any a > 0, rm(a) → 0, as m → ∞. Hence, for (x, y) ∈ Ae1, one can ﬁnd ne5 = ne5(u, w, kx − yk) = l d m + 2 ( ) < kx−yk ( ) < kx−yk ≥ 2 kx−yk max{u, w} such that rn−1 u 2 , rn−1 w 2 if n ne5. Then B(x, rn−1(u)) ∩ B(y, rn−1(w)) = ∅ if n ≥ n5(u, w, kx − yk). Thus, for n ≥ n6 := e e max ne5, 2k , one has

k−1 − n 2 s1 n−2−s1 Φe n,k,x,y(u, w) = 1 − ∑ (Vn−1,x(u)) (1 − Vn−1,x(u)) s1 s1=0

k−1 n − 2 s2 n−2−s2 − ∑ Vn−1,y(w) 1 − Vn−1,y(w) s2 s2=0

k−1 k−1 ( − ) n 2 ! s1 s2 n−2−s1−s2 + ∑ ∑ (Vn−1,x(u)) Vn−1,y(w) 1−Vn−1,x(u)−Vn−1,y(w) . s1!s2!(n−2−s1−s2)! s1=0s2=0

Therefore, for each ﬁxed (x, y) ∈ Ae1, u, w ≥ 0, we get, as n → ∞,

− − k 1 (V u p(x))s1 k 1 (V w p(y))s2 d −Vdu p(x) d −Vdw p(y) Φe n,k,x,y(u, w) → 1 − ∑ e − ∑ e s1! s2! s1=0 s2=0 − − k 1 k 1 (V u p(x))s1 (V w p(y))s2 + ∑ ∑ d d e−Vd u p(x)+w p(y) s1! s2! s1=0 s2=0 − − k 1 (V u p(x))s1 k 1 (V w p(y))s2 d −Vdu p(x) d −Vdw p(y) = 1 − ∑ e 1 − ∑ e = Fek,x(u)Fek,y(w) s1! s2! s1=0 s2=0

:= Φe k,x,y(u, w). Mathematics 2021, 9, 544 25 of 36

Here Φe k,x,y(·, ·) denotes the distribution function of a vector ηek,x,y := (ξek,x, ξek,y). The components of ηek,x,y are independent, ξek,x ∼ Γ(Vd p(x), k) and ξek,y ∼ Γ(Vd p(y), k). Con- law sequently, for each ﬁxed (x, y) ∈ Ae1, we have shown that ηen,k,x,y → ηek,x,y as n → ∞. Therefore, for such (x, y),

y x law→ → log ηen,k,x log ηen,k,y log ξek,x log ξek,y, n ∞. (83)

Here we take strictly positive versions of the random variables under consideration. In a way similar to (77), for i, j ∈ {1, . . . , n}, i 6= j, we write

Z E( y x ) = ( ) log ηen,k,x log ηen,k,y log u log w dΦe n,k,x,y u, w (0,∞)×(0,∞) = E log ζn,k(i) log ζn,k(j)|Xi = x, Xj = y .

Since ξek,x and ξek,y are independent, write E(log ξek,x log ξek,y) = E log ξek,x E log ξek,y = d bk,d(x)bk,d(y), where bk,d(z) := ψ(k) − log Vd − log p(z), z ∈ R . For any ﬁxed M > 0, consider Ae1,M := (x, y) ∈ Ae1 : kx − yk > M . Now our aim is to verify that, for each (x, y) ∈ Ae1,M, E log ζn,k(1) log ζn,k(2)|X1 = x, X2 = y → bk,d(x)bk,d(y). (84)

Equivalently, we can prove, for each (x, y) ∈ Ae1,M, that

E y x → E → log ηen,k,x log ηen,k,y log ξek,x log ξek,y, n ∞. (85)

The idea that we consider only (x, y) ∈ Ae1,M is principal for the further proof. y Part (5a). We are going to establish that, for (x, y)∈ A , a family {log η log η x } e1,M en,k,x en,k,y n≥ne0 is uniformly integrable, where ne0 ∈ N is independent of x, y, but might depend on M. Then, due to (83), the relation (85) would be valid for such (x, y) as well. As we have seen, the function GN(·) is nondecreasing and convex. Hence

y 1 y EG (| log η log η x |) ≤ EG (log2 η ) + EG (log2 η x ) . (86) N en,k,x en,k,y 2 N en,k,x N en,k,y

E ( 2 y ) Let us consider, for instance, GN log ηen,k,x . Alike Step 2 we can write Z Z 2 y y y EGN(log η ) = i Fe (u)(−hN(u))du + (1 − Fe (u))hN(u)du en,k,x 1 n,k,x ( ( ) ) n,k,x 0, ρ(N) ρ N ,∞

:= I1(n, x, y) + I2(n, x, y),

where y ( ) = P y ≤ = − P − ( i,j ∪ { }) > ( ) Fen,k,x u : ηen,k,x u 1 x X(k) x, Xn y rn−1 u n o k−1 n − 2 s n−2−s = kx − yk > r (u) − V (u) − V (u) I n−1 1 ∑ n−1,x 1 n−1,x (87) s=0 s n o k−2 n − 2 s n−2−s +I kx − yk ≤ rn−1(u) 1 − ∑ Vn−1,x(u) 1 − Vn−1,x(u) , s=0 s

As usual a sum over empty set is equal to 0 (for k = 1). i √ l m If u ∈ 0, 1 , where ρ(N) := exp{ e } and n ≥ n := 1 + 1, then ρ(N) [N−1] e1 ρ(N)Md rn−1(u) ≤ M. Thus, rn−1(u) < kx − yk if (x, y) ∈ Ae1,M. In view of (87) and similarly to (38), one has Mathematics 2021, 9, 544 26 of 36

ε y n − 2 ε ε ε ε Fe (u) ≤ Mp(x, R)V u ≤ Mp(x, R) V u n,k,x n − 1 d d i l m for (x, y)∈A , u∈ 0, 1 , n ≥max{n (M), n (R)}, here n (R) :=max{ 1 , k + 1}. e1,M ρ(N) e1 e2 e2 ρ(N)Rd ε So, I1(n, x, y) ≤ U1(ε, N, d) Mp(x, R) for (x, y) ∈ Ae1,M and n ≥ max{ne1(M), ne2(R)}. Moreover, for all u ≥ 0, in view of (87) it holds

k−1 n − 2 − y ( ) ≤ ( ( ))s( − ( ))n−2−s 1 Fen,k,x u ∑ Vn−1,x u 1 Vn−1,x u . s=0 s

The same reasoning as was used in Theorem1 proof (Step 3, Part (3b)) leads to the inequalities

− y ( ) ≤ ( ) − ( ) n−2 ≤ − ( − ) ( ) 1 Fen,k,x u S1 k (1 S2 Vn−1,x u ) S1 exp{ S2 n 2 Vn−1,x u } − n − 2 S ε ≤ S exp − S V u m (x, R) ≤ S 2 V u m (x, R) 1 n − 1 2 d p 1 2 d p

for all n ≥ max{ne3(R), 3}. Then similarly to (70), the relation

E ( 2 y ) ≤ ( ( ))ε + + ( ( ))−ε + ( )+ = ( ) < GN log ηen,k,x U1 Mp x, R Ue2 κ mp x, R A ReN,2 x B : Ce1 x ∞

is valid for all (x, y) ∈ Ae1,M and n ≥ ne0(M) := max{ne1(M), ne2, ne3, ne4(κ), 3}. Here E 2 x U1, Ue2, κ, A, B do not depend on x and y. The term GN(log ηen,k,y) can be treated in the above manner. Thus, in view of (86), one has U E (| y x |) ≤ 1 ( ( ))ε + ( ( ))ε GN log ηen,k,x log ηen,k,y Mp x, R Mp y, R 2 (88) U2 + κ −ε −ε A + (mp(x, R)) + (mp(y, R)) + Re (x) + Re (y) + B := Ce (x, y). 2 2 N,2 N,2 1 y Therefore, for any (x, y) ∈ A , a family {log η log η x } is uniformly inte- e1,M en,k,x en,k,y n≥ne0 grable. Thus, we come to (84) for (x, y) ∈ Ae1,M. Part (5b). Now we return to upper bound for cov(log ζn,k(1), log ζn,k(2)). Set

( ) =E ( ) ( )| = = =E y x Ten,k x, y : log ζn,k 1 log ζn,k 2 X1 x, X2 y log ηen,k,x log ηen,k,y

for all (x, y) ∈ Ae1. Validity of (84) is equivalent to the following relation: for any (x, y) ∈ Ae1,M, Ten,k(x, y) → bk,d(x)bk,d(y), as n → ∞. Take any (x, y) ∈ Ae1. For each M > 0, it was shown that

Ten,k(x, y)I{kx − yk > M} → bk,d(x)bk,d(y)I{kx − yk > M}, n → ∞.

Note that

(| ( )| { − > }) ≤ (| ( )|) = (|E y x |) GN Ten,k x, y I kx yk M GN Ten,k x, y GN log ηen,k,x log ηen,k,y (89) ≤ (E| y x |) ≤ E (| y x |) GN log ηen,k,x log ηen,k,y GN log ηen,k,x log ηen,k,y .

Due to (88) and (89) one can conclude that, for all n ≥ ne0, Z GN(|Ten,k(x, y)|I{kx − yk > M})p(x)p(y) dx dy Rd×Rd Z Z Z ε −ε ≤ U1 Mp(x, R)p(x) dx + Ue2 + κ mp (x, R)p(x) dx + A ReN,2(x)p(x) dx + B Rd Rd Rd = U1Qp,p(ε, R) + (Ue2 + κ)Tp,p(ε, R) + AKp,p(2, N) + B < ∞. Mathematics 2021, 9, 544 27 of 36

Therefore, for (x, y) ∈ Ae1, the family Ten,k(x, y)I{kx − yk > M} is uniformly n≥ne0 integrable w.r.t. P ⊗ P . Hence, by virtue of (84), for each M > 0, Z X X Z Ten,k(x, y)p(x)p(y) dx dy → bk,d(x)bk,d(y)p(x)p(y) dx dy, n → ∞, (90) D(M) D(M)

where D(M) := {x, y ∈ Rd, kx − yk > M}. Now we turn to the case kx − yk ≤ M. T∞ n 1 o One has s=1 kX1 − X2k ≤ s = {X1 = X2} and P(X1 = X2) = 0 as X1 and X2 are independent and have a density p(x) w.r.t. the Lebesgue measure µ. Then in view of continuity of a probability measure it holds that P kX1 − X2k ≤ M → 0, as M → 0. R Taking into account that, for an integrable function h, C hdP → 0 as P(C) → 0, we get

E(log ζn,k(1) log ζn,k(2)I{kX1 − X2k ≤ M}) → 0, M → 0,

1 2 2 since E log ζn,k(1) log ζn,k(2) ≤ 2 E log ζn,k(1) + E log ζn,k(2) < ∞ (the proof is similar 2 to the establishing that E log φm,l(1) < ∞). Thus, for any γ > 0, one can ﬁnd M1 = M1(γ) > 0 such that, for all M ∈ (0, M1] and n ≥ ne0,

Z γ Ten,k(x, y)p(x)p(y) dx dy = E log ζn,k(1) log ζn,k(2)I{kX1 − X2k ≤ M} < . (91) R2d\D(M) 3

Also there exists M2 = M2(γ) > 0 such that, for all M ∈ (0, M2],

Z γ bk,d(x)bk,d(y)p(x)p(y) dx dy < . (92) R2d\D(M) 3

Take M(γ) := min{M1(γ), M2(γ)}. Due to (90) there is ne7(M(γ), γ) such that n ≥ max{ne0, ne7(M(γ), γ)} entails the following inequality

Z Z γ Ten,k(x, y)p(x)p(y) dx dy − bk,d(x)bk,d(y)p(x)p(y) dx dy < . (93) D(M) D(M) 3

So, in view of (91)–(93), for any γ > 0, there is M(γ) > 0 such that, for all n large enough, i.e., n ≥ max{ne0, ne7(M(γ), γ)}, one has

Z Z Ten,k(x, y)p(x)p(y) dx dy − bk,d(x)bk,d(y)p(x)p(y) dx dy < γ. (94) Rd×Rd Rd×Rd

By virtue of the formula

Z Z 2 bk,d(x)bk,d(y)p(x)p(y) dx dy = ψ(k) − log Vd − (log p(x))p(x) dx , Rd×Rd Rd and taking into account (94) we deduce the limit relation, for n → ∞,

Z 2 E log ζn,k(1) log ζn,k(2) → ψ(k) − log Vd − (log p(x))p(x) dx . Rd Moreover, in view of (24) (see Step 5 of Theorem1 proof), it follows that

Z 2 E log ζn,k(1)E log ζn,k(2) → ψ(k) − log Vd − (log p(x))p(x) dx . Rd Therefore,

2 n − 1 cov( ( ) ( ))= cov( ( ) ( ))→ → 2 ∑ log ζn,k i , log ζn,k j log ζn,k 1 , log ζn,k 2 0, n ∞. n 1≤i

Step 6. Here we complete the analysis of summands in (56). Reasoning as at Steps 1–3 1 n var( ( )) = 1 var( ( ))→ shows that n2 ∑i=1 log ζn,k i n log ζn,k 1 0 since

var(log ζn,k(i)) = var(log ζn,k(1))→ vk < ∞

∈ → 2 n cov( ( ) ( ))→ for each i N, as n ∞. It remains to prove that n2 ∑i,j=1 log φm,l i , log ζn,k j 0, as n, m → ∞. 1 2 For i = 1, ... , n, one has |cov(log φm,l(i), log ζn,k(i))| ≤ (var(log φm,l(1))var(log ζn,k(1))) < ∞ for all n, m large enough. So, it sufﬁces to show that

1 cov( ( ) ( )) → → 2 ∑ log φm,l i , log ζn,k j 0, n, m ∞. n i,j=1,...,n;i6=j

For i, j = 1, ... , n, i 6= j, u, w ≥ 0, x, y ∈ Rd, let us introduce a conditional distribution function P φm,l(i) ≤ u, ζn,k(j) ≤ w|Xi = x, Xj = y i,j = P x − Y(l)(x, Ym) ≤ rm(u), y − X(k)(y, Xn ∪ {x}) ≤ rn−1(w) i,j = P x − Y(l)(x, Ym) ≤ rm(u) P y − X(k)(y, Xn ∪ {x}) ≤ rn−1(w) l−1 ! m s m−s = 1 − ∑ (Wm,x(u)) 1 (1 − Wm,x(u)) 1 s1 s1=0

n o k−1 n − 2 s n−2−s · I kx − yk > rn−1(w) 1 − ∑ Vn−1,y(w) 1 − Vn−1,y(w) s=0 s ! n o k−2 n − 2 s n−2−s +I kx − yk ≤ rn−1(w) 1 − ∑ Vn−1,y(w) 1 − Vn−1,y(w) . s=0 s

We used that {Xn, Ym} is a collection of independent vectors. Now we combine the estimates obtained at Steps 4 and 5 of Theorem2 proof to verify that, for i, j ∈ {1, ... , n} and i 6= j, cov(log φm,l(i), log ζn,k(j)) = cov(log φm,l(1), log ζn,k(2))→0, n, m → ∞. Thus, we have established that var(Dbn,m(k, l))→0 as n, m → ∞, hence (16) holds. The proof of Theorem2 is complete.

6. Conclusions The aim of this paper is to provide wide conditions ensuring the asymptotic unbiasedness and mean square consistency for statistical estimates of the Kullback–Leibler divergence proposed in [31]. We do not impose restrictions on the smoothness of the densities under consideration and do not assume that the densities have bounded supports. Thus, in particular one can apply our results to various mixtures of distributions, for instance, to mixture of any nondegenerate normal laws in Rd (Corollary 4). As a byproduct we relax conditions in our recent analysis of the Kozachenko - Leonenko type estimators for the Shannon differential entropy [55] and use these conditions in estimating the cross-entropy as well. Observe that the integral functional Kp,q appearing in Theo- rems 1–3 involves the function GN(t) which is close to a function t when parameter N is large enough. Thus, we impose essentially less restrictive condition than one requiring 1+ν a function G(t) = t for some ν > 0 instead of GN(t). Even for the latter choice of G our results provide the first valid proof without appealing to the Fatou lemma (the long standing problem to obtain correct proofs was discussed in Introduction). An interesting and hard problem for future research is to find the class of functions ϕ : R+ → R+ such that one can replace GN(t) in expression of Kp,q by G(t) = tϕ(t), where ϕ(t) → ∞, as t → ∞, and keep the validity of established theorems. Here one can see the analogy with investigation of fluctuations of sums of random variables or Brownian motion by Mathematics 2021, 9, 544 29 of 36

G.H.Hardy, H.D.Steinhaus, A.Ya.Khinchin, A.N.Kolmogorov, I.G.Petrovski, W.Feller and other researchers. The increasing precision on the way of description of the upper and lower functions has led to the law of the iterated logarithm and its generalizations. Another deep problem is to provide sharp conditions of CLT validity for estimates of the Kullback - Leibler divergence. Beside pure theoretical aspects the estimates of entropy and related functionals have diverse applications. In [5], the estimates of the Kullback - Leibler divergence are applied to the change-point detection in time series. That issue is important, e.g., in analysis of stochastic financial models. Moreover, it is interesting to study the spatial variant of this problem. Namely, in [55,56] statistical estimates of entropy and scan-statistics (see, e.g., [57]) were employed for identification of inhomogeneities of fiber materials. In [58], the Kullback–Leibler divergence estimators are used to identify multivariate spatial clusters in the Bernoulli model. A modification of the latter paper idea can also be applied to analysis of the fiber structures. Such structures in Rd can be modeled by spatial point stochastic process to specify the locations of the centers of fibers (segments). A certain law on the unit sphere of Rd can be used to model their directions. The length of fibers can be fixed or follow some distribution on R+. Since various scan domains could contain random number of observations the development of present results will be applied along with the theory of random sums of random variables. The latter theory (see, e.g., [59]) is essential in this case. Moreover, we intend to employ the studied estimators in the feature selection theory, actively used in Genome-wide association studies (GWAS), see, e.g., [16,17,22]. In this regard statistical estimates of the mutual information were proposed, see, e.g., [12]. Note also an important problem of stability analysis of constructing, by means of statistical estimates of the mutual information, a sub-collection of relevant (in a sense) factors determining a random response. The above mentioned applications will be considered in separate publications, supplemented with computer simulations and illustrative graphs.

Author Contributions: Conceptualization, A.B. and D.D.; validation, A.B. and D.D.; writing— original draft preparation, A.B. and D.D.; writing—review and editing, A.B. and D.D.; supervision, A.B.; project administration, A.B.; funding acquisition, A.B. All authors have read and agreed to the published version of the manuscript. Funding: The work of the first author is supported by the Russian Science Foundation under grant 14-21-00162 and performed at the Steklov Mathematical Institute of Russian Academy of Sciences. Institutional Review Board Statement: Not applicable. Informed Consent Statement: Not applicable. Data Availability Statement: Not applicable. Acknowledgments: The authors are grateful to Professor A.Tsybakov for useful discussions. We also thank the Reviewers for remarks and suggestions improving the exposition. Conflicts of Interest: The authors declare no conflict of interest.

Appendix A Proofs of Lemmas1–3 are similar to the proofs of Lemma 2.5, 3.1 and 3.2 in [ 35]. We provide them for the sake of completeness.

Proof of Lemma1. (1) Note that log kx − yk > e[N−1] ≥ 1 if kx − yk > e[N] and N ∈ N. ν ν Hence, for such x, y, one has (log kx − yk) ≤ (log kx − yk) 0 if ν ∈ (0, ν0]. If N ≥ N0 G (u) ≤ G (u) u ≥ e ≥ e K ( N) ≤ K ( N ) < then N N0 for [N−1] [N0−1]. Thus, p,q ν, p,q ν0, 0 ∞ for ν ∈ (0, ν0] and any integer N ≥ N0. Mathematics 2021, 9, 544 30 of 36

(2) Assume that Qp,q(ε1, R1) < ∞. Consider Qp,q(ε1, R) where R > 0. If 0 < R ≤ R1 d then, for each x ∈ R , in accordance with the deﬁnition of Mq one has Mq(x, R) ≤ Mq(x, R1). Consequently, Qp,q(ε1, R) ≤ Qp,q(ε1, R1) < ∞. Let now R > R1. One has ( R R ) B(x,R ) q(x)dx + B(x,r)\B(x,R ) q(x)dx M (x, R) ≤ max M (x, R ), sup 1 1 q q 1 µ(B(x, r)) R1

1 1 ≤ max Mq(x, R1), Mq(x, R1) + = Mq(x, R1) + . µ(B(x, R1)) µ(B(x, R1))

Therefore,

!ε1 Z Z 1 ε1 Qp,q(ε1, R) = (Mq(x, R)) p(x) dx ≤ Mq(x, R1) + p(x) dx d d d R R R1Vd ε1−1 d −ε1 ≤ max{1, 2 } Qp,q(ε1, R1) + (R1Vd) < ∞.

Suppose now that Qp,q(ε1, R) < ∞ for some ε1 > 0 and R > 0. Then, for each ε ε ε ∈ (0, ε1], the Lyapunov inequality leads to the estimate Qp,q(ε, R) ≤ (Qp,q(ε1, R)) 1 < ∞. d (3) Let Tp,q(ε2, R2) < ∞. Take 0 < R ≤ R2. Then, for any x ∈ R , according to the definition of mq we get 0 ≤ mq(x, R2) ≤ mq(x, R). Hence Tp,q(ε2, R) ≤ Tp,q(ε2, R2) < ∞. d Consider R > R2. For any x ∈ R and every a > 0, the function Iq(x, r) is continuous in r on (0, a]. Next fix an arbitrary x ∈ S(q) ∩ Λ(q). We see that there exists limr→0+ Iq(x, r) = q(x). For such x, set Iq(x, 0) := q(x). Thus, Iq(x, ·) is continuous on any segment [0, a]. Hence, one can find Re2 in [0, R2] such that mq(x, R2) = Iq(x, Re2) and there exists R0 in [0, R] such that mq(x, R) = Iq(x, R0). If R0 ≤ R2 then mq(x, R) = mq(x, R2) (since mq(x, R) ≤ mq(x, R2) for R > R2 and mq(x, R) = Iq(x, R0) ≥ mq(x, R2) as R0 ∈ [0, R2]). Assume that R0 ∈ (R2, R]. Obviously R0 > 0 as R2 > 0. One has

R q(y)dy + R q(y)dy B(x,R2) B(x,R0)\B(x,R2) mq(x, R) = Iq(x, R0) = µ((x, R0)) R q(y)dy B(x,R2) µ(B(x, R2)) µ(B(x, R2)) ≥ = Iq(x, R2) ≥ mq(x, R2) µ(B(x, R0)) µ(B(x, R0)) µ(B(x, R0))

d d R2 R2 = mq(x, R2) ≥ mq(x, R2). R0 R

d R2 Thus, in any case (R0 ∈ [0, R2] or R0 ∈ (R2, R]) one has mq(x, R)≥ R mq(x, R2) as R2 < R. Taking into account that µ(S(q) \ (S(q) ∩ Λ(q))) = 0 we deduce the inequality

R ε2d Tp,q(ε2, R) ≤ Tp,q(ε2, R2) < ∞. R2

Assume now that Tp,q(ε2, R) < ∞ for some ε2 > 0 and R > 0. Then, for any ε ∈ (0, ε2], ε ε the Lyapunov inequality entails Tp,q(ε, R) ≤ (Tp,q(ε2, R)) 2 < ∞. This completes the proof.

Proof of Lemma2. Begin with relation (1). Observe that if a function g is measurable and bounded on a ﬁnite interval (a, b] and ν is a ﬁnite measure on the Borel subsets of (a, b] Mathematics 2021, 9, 544 31 of 36

R then (a,b] g(x)ν(dx) is ﬁnite. So applying the integration by parts formula (see, e.g., [33], i p. 245), for each a ∈ 0, 1 , we get e[N] Z Z F(u)(−gN(u)) du = F(u)d(−GN(− log u)) a, 1 a, 1 e[N] e[N] Z (A1) = GN(− log a)F(a) + GN(− log u) dF(u). a, 1 e[N] R Assume now that GN(− log u) dF(u) < ∞. Then by the monotone conver- 0, 1 e[N] gence theorem Z lim GN(− log u) dF(u) = 0. (A2) a→0+ (0,a] Given a > 0 the following lower bound is obvious Z Z GN(− log u) dF(u) ≥ GN(− log a) dF(u) (0,a] (0,a]

= GN(− log a)(F(a) − F(0)) = GN(− log a)F(a) ≥ 0.

Therefore (A2) implies that

GN(− log a)F(a) → 0, a → 0 + . (A3)

By the Lebesgue monotone convergence theorem letting a → 0+ in (A1) yields the desired relation (1) of our Lemma. Now we assume that Z F(u)(−gN(u)) du < ∞. (A4) 0, 1 e[N] R R Hence from the equality F(u)(−gN(u)) du = F(u)d(−GN (− log u)) 0, 1 0, 1 e[N] e[N] R we get limb→0+ (0,b] F(u) d(−GN(− log u)) = 0 by monotone convergence theorem. There- fore, for any c ∈ (0, b), we come to the inequalities Z Z F(u)d(−GN(− log u)) ≥ F(u)d(−GN(− log u)) (0,b] (c,b] Z = −F(b)GN(− log b) + F(c)GN(− log c) + GN(− log u) dF(u) (c,b]

≥ F(c)GN(− log c) − F(b)GN(− log b) + (F(b) − F(c))GN(− log b) GN(− log b) = F(c)GN(− log c) 1 − . GN(− log c) Let c = b2 (b ≤ 1 < 1). Then, for all positive b small enough, e[N]

G (− log b) G (− log b) 1 log[N](− log b) 1 1 − N = 1 − N = 1 − ≥ . GN(− log c) GN(−2 log b) 2 log[N](−2 log b) 2

R 1 2 2 2 2 Thus (0,b]F(u)d(−GN(− log u)) ≥ 2 F(b )GN(− log(b )) ≥ 0, so F(b )GN(− log b ) → 0 as b →0. Consequently we come to (A3) taking a = b2. Then (A1) implies relation (1). When one of the (nonnegative) integrals in (1) turns inﬁnite while the other one is ﬁnite we come to a contradiction. Thus, (1) is established. In quite the same manner one can verify validity of relation (2), therefore further details can be omitted. Mathematics 2021, 9, 544 32 of 36

Proof of Lemma3. Take x ∈ S(q) ∩ Λ(q) and R > 0. Suppose that mq(x, R) = 0. Since the d function Iq(x, r) deﬁned in (8) is continuous in (x, r) ∈ R × (0, ∞), there exists Re ∈ [0, R] (Re = Re(x, R)) such that mq(x, R) = Iq(x, Re) (Iq(x, 0) := limr→0+ Iq(x, r) = q(x) for any x ∈ Λ(q) by continuity). If Re = 0 then mq(x, r) = q(x) > 0 as x ∈ S(q) ∩ Λ(q). Hence we R have to deal with Re ∈ (0, R]. If Iq(x, Re) = 0 then B(x,r) q(y)dy = 0 for any 0 < r ≤ Re. Thus, (30) ensures that q(x) = 0. However, x ∈ S(q) ∩ Λ(q). So mq(x, R) > 0 for x ∈ S(q) ∩ Λ(q). Thus, S(q) ∩ Λ(q) ⊂ Dq(R) := {x ∈ S(q) : mq(x, R) > 0}. It remains to note that d d S(q) \ Λ(q) ⊂ R \ Λ(q) and µ(R \ Λ(q)) = 0. Therefore µ(S(q) \ Dq(R)) = 0. Proof of Lemma4. We will check that, for given N ∈ N and τ > 0, there exist a := a(τ) ≥ 0 and b := b(N, τ) ≥ 0 such that, for any c ≥ 0,

GN(τc) ≤ aGN(c) + b. (A5)

log (τc) For c = 0 the statement is obviously true. Let c > 0. It easily seen that [N] → 1 log[N](c) as c → ∞. Hence one can ﬁnd c0(N, τ) such that, for all c ≥ c0(N, τ), the inequality log[N](τc) ≤ 2 is valid. Consequently, for c ≥ c0(N, τ), log[N](c)

G (τc) τc log[N](τc) N = ≤ 2τ := a(τ). GN(c) c log[N](c)

For all 0 ≤ c ≤ c0(N, τ) we write GN(τc) ≤ GN(τc0(N, τ)) := b(N, τ). Therefore, for any c ≥ 0, we come to (A5). Thus, for any ν > 0 and x, y ∈ Rd, x 6= y, one has

Proof of Lemma6. For t ∈ [0, e[N−1]], a function GN(t) ≡ 0 is convex. We show that GN is convex on (e[N−1], ∞). Consider t > e[N−1]. Write ∏ := 1 and ∑ := 0. Then, for N ∈ N, ∅ ∅ N−1 1 (G (t))0 = log (t) + . N [N] ∏ ( ) j=1 log[j] t

0 1 = − 1 k−1 1 ∈ > Obviously, log (t) 2 ∏s=1 log (t) , k N. Thus, for t e[N−1], we get [k] t log[k](t) [s]

  N−1 N−1 k−1 00 1 1 1 1 1 1 (GN(t)) = + −  t ∏ log (t) ∑ t 2 ∏ log (t) ∏ log (t) j=1 [j] k=1 log[k](t) s=1 [s] j∈{1,...,N−1},j6=k [j] ! ! 1 N−1 1 N−1 k 1 = 1 − . ∏ ( ) ∑ ∏ ( ) t j=1 log[j] t k=1 s=1 log[s] t

00 1 For N = 1 and t > 0, we have (G1(t)) = t > 0. Take now N > 1. Clearly, for t > N−1 1 1 > ( ) > ( ) = ≥ > e[N−1], one has t ∏ log (t) 0 because log[j] t log[j] e[N−1] e[N−1−j] 1 0 j=1 [j] when 1 ≤ j ≤ N − 1. Observe also that

N−1 k 1 N−1 k 1 N−1 1 N − 1 < ≤ = ≤ 1. (A6) ∑ ∏ ( ) ∑ ∏ ∑ k=1 s=1 log[s] t k=1 s=1 e[N−1−s] k=1 e[N−2] e[N−2] Mathematics 2021, 9, 544 33 of 36

The last inequality is established by induction in N. Thus, in view of (A6), we have 00 proved that, for all t > e[N−1] and N ∈ N, the inequality (GN(t)) > 0 holds. Hence, the function GN(t) is (strictly) convex on e[N−1], ∞ . Let h : [a, ∞) → R be a continuous nondecreasing function. If the restrictions of h to [a, b] and (b, ∞) (where a < b) are convex functions then, in general, it is not true that h is convex on [a, ∞). However, we can show that GN is convex on [0, ∞). Note that a function GN is convex on [e[N−1], ∞) since it is convex on (e[N−1], ∞) and continuous on [e[N−1], ∞). Take now any z ∈ [0, e[N−1]], y ∈ (e[N−1], ∞) and s ∈ [0, 1]. Then

GN(sz + (1 − s)y) ≤ GN(se[N−1] + (1 − s)y) ≤ sGN(e[N−1]) + (1 − s)GN(y)

= (1 − s)GN(y) = sGN(z) + (1 − s)GN(y)

as GN(z) = 0. Thus, for each N ∈ N, a function GN(·) is convex on R+. Proof of Corollary3. The proof (i.e., checking the conditions of both Theorem1 and2) is quite similar to the proof of Corollary 2.11 in [35].

k Proof of Corollary4. Take f (x) = ∑i=1 γi fi(x), where fi(x) is a density, γi > 0, i = k d d 1, ... , k, ∑i=1 γi = 1, x ∈ R . Then according to (9) and (10), for any x ∈ R , r > 0 > ( ) = k ( ) ( ) ≤ k ( ) ( ) ≥ and R 0, one has If x, r ∑i=1 γi Ifi x, r , M f x, R ∑i=1 γi M fi x, R , m f x, R k ( ) = = ∑i=1 γim fi x, R . We will apply these relations for f p and f q. It is well-known that, k ε for any ε > 0, ci ≥ 0, i = 1, ... , k, k ∈ N, the following inequality is valid (∑i=1 ci) ≤ ε−1 k ε max{1, k } ∑i=1 ci . Moreover, this inequality is obviously satisﬁed for all ε ∈ R as for k ε k ε ε ≤ 0 it holds (∑i=1 ci) ≤ ∑i=1 ci . Therefore

I J ( ) ≤ { ε−1} ε ( ) < Qp,q ε, R max 1, J ∑ ∑ aibj Qpi,qj ε, R ∞, i=1 j=1

I J ( ) ≤ −ε ( ) < Tp,q ε, R ∑ ∑ aibj Tpi,qj ε, R ∞. i=1 j=1

The same reasoning leads to bounds Qp,p(ε, R) < ∞ and Tp,p(ε, R) < ∞. Now in view > > ∈ ( ) = I J ( ) of (13), for ν 0, t 0 and N N, we can write Kp,q ν, N, t ∑i=1 ∑j=1 aibjKpi,qj ν, N, t . In this manner we can also represent Kp,p(ν, N, t).

Lemma A1. Let probability measures P, Q and a σ-ﬁnite measure µ (e.g., the Lebesgue measure) be deﬁned on (Rd, B(Rd)). Assume that P and Q have densities p(x) and q(x), x ∈ Rd, w.r.t. the measure µ. Then the following statements are true. (1) P Q if and only if P(S(p) \ S(q)) = 0; (2) formula (2) holds.

Proof of Lemma A1. (1) Let P Q. Obviously Q(Rd \ S(q))=0. Therefore P(Rd \ S(q))= 0. Since S(p) \ S(q) ⊆ Rd \ S(q), one has P(S(p) \ S(q)) = 0. Now let P(S(p) \ S(q)) = 0. Assume that P is not absolutely continuous w.r.t. Q. Then there exists a set A such that Q(A) = 0 and P(A) > 0. Consequently µ(A) > 0 as d P µ. We can write A = A1 ∪ A2, where A1 := A ∩ (R \ S(q)), A2 := A ∩ S(q). We get Q(A) = Q(A1) + Q(A2) as A1 ∩ A2 = ∅. Note that Q(A1) = 0 since q ≡ 0 on A1, so (A ) = 0. Relation (A ) = R q(x)µ(dx) yields µ(A ) = 0 (q > 0 on A and µ is a Q 2 Q 2 A2 2 2 σ-ﬁnite measure). One has P(A2) = 0 because P µ. Thus, P(A) = P(A1) + P(A2) = d d P(A1) > 0. Clearly, A1 ⊂ R \ S(q). Hence P(S(p) \ S(q)) = P(S(p) ∩ (R \ S(q))) ≥ P(S(p) ∩ A1) = P(A1) > 0. We come to the contradiction. Therefore P Q. Mathematics 2021, 9, 544 34 of 36

In such a way we have proved that if P µ and Q µ, the relation P Q holds dP if and only if P(S(p) \ S(q)) = 0. Obviously we can take as p and q any versions of dµ dQ and dµ . (2) Suppose that P Q. We know that P, Q are probability measures, Q µ where µ is a σ-finite measure. Then, in view of [33], statement (b) of Lemma on p. 273, the following equality dP = dP / dQ holds -a.s. and consequently -a.s. too (on the dQ dµ dµ Q P set B := {x : dQ = 0} having (B) = 0 a density dP can be taken equal to zero). So, dµ Q dQ ( ) dP (x) = p x for -almost all x ∈ d. One has dQ q(x) P R Z p(x) Z p(x) Z dP p(x) log µ(dx) = log dP = log dP, q(x) q(x) dQ Rd Rd Rd where all integrals converge or diverge simultaneously. Indeed, if h is a measurable R function with values in [−∞, ∞] then A h(x)ν(dx) := 0, whenever ν(A) = 0 (ν is a finite or a σ-finite measure). We also employed [33], statement (a) of Lemma on p.273, when we changed the integration by µ to integration by P. Now assume that P is not absolutely continuous w.r.t. Q, i.e., P(S(p) \ S(q)) > 0 in d R view of part (1) of the present Lemma. As usual, for any measurable B ⊂R , B 0 µ(dx) = 0. Then Z p(x) Z p(x) Z p(x) p(x) log µ(dx) = p(x) log µ(dx) + p(x) log µ(dx). q(x) q(x) q(x) Rd S(p)\S(q) S(p)∩S(q)

Evidently

Z p(x) Z p(x) p(x) log µ(dx) = log P(dx) = ∞ · P(S(p) \ S(q)) = ∞ q(x) q(x) S(p)\S(q) S(p)\S(q)

as P(S(p) \ S(q)) > 0. Since − log t ≥ 1 − t if t > 0, we write, for all x ∈ S(p) ∩ S(q), p(x) = − q(x) ≥ − q(x) R ( ) p(x) ( ) ≥ R ( ) log q(x) log p(x) 1 p(x) . Thus, p x log q(x) µ dx p x S(p)∩S(q) S(p)∩S(q) − q(x) ( ) = R ( ) ( ) − R ( ) ( ) = ( ( ) ∩ ( )) − ( ( ) ∩ 1 p(x) µ dx p x µ dx q x µ dx P S p S q Q S p S(p)∩S(q) S(p)∩S(q) ( )) ≥ − = − R ( ) p(x) ( ) = S q 0 1 1. Consequently p x log q(x) µ dx ∞. The proof is com- Rd plete.

Remark A1. Note that formula (2) can give an inﬁnite value of D(P||Q) also when P Q. It is 2 enough to take p(x) = 1 and q(x) = √1 exp{− x }, x ∈ . π(1+x2) 2π 2 R References 1. Kullback, R.; Leibler, R.A. On information and sufﬁciency. Ann. Math. Stat. 1951, 22, 79–86. [CrossRef] 2. Moulin, P.; Veeravalli, V.V. Statistical Inference for Engineers and Data Scientists; Cambridge University Press: Cambridge, UK, 2019. 3. Pardo, L. New developments in statistical information theory based on entropy and divergence measures. Entropy 2019, 21, 391. [CrossRef][PubMed] 4. Ji, S.; Zhang, Z.; Ying, S.; Wang, L.; Zhao, X.; Gao, Y. Kullback–Leibler divergence metric learning. IEEE Trans. Cybern. 2020, 1–12. [CrossRef] 5. Noh, Y.K.; Sugiyama, M.; Liu, S.; du Plessis, M.C.; Park, F.C.; Lee, D.D. Bias reduction and metric learning for nearest-neighbor estimation of Kullback–Leibler divergence. Neural Comput. 2018, 30, 1930–1960. [CrossRef] 6. Claici, S.; Yurochkin, M.; Ghosh, S.; Solomon, J. Model Fusion with Kullback–Leibler Divergence. In Proceedings of the 37th International Conference on Machine Learning, Online, 12–18 July 2020; Daumé, H., III, Singh, A., Eds.; PMLR: Brookline, MA, USA, 2020; Volume 119, pp. 2038–2047. Mathematics 2021, 9, 544 35 of 36

7. Póczos, B.; Xiong, L.; Schneider, J. Nonparametric Divergence Estimation with Applications to Machine Learning on Distributions. In Proceedings of the 27th Conference on Uncertainty in Artificial Intelligence, Barcelona, Spain, 14–17 July 2011; AUAI Press: Arlington, VA, USA, 2011; pp. 599–608. 8. Cui, S.; Luo, C. Feature-based non-parametric estimation of Kullback–Leibler divergence for SAR image change detection. Remote Sens. Lett. 2016, 11, 1102–1111. [CrossRef] 9. Deledalle, C.-A. Estimation of Kullback–Leibler losses for noisy recovery problems within the exponential family. Electron. J. Stat. 2017, 11, 3141–3164. [CrossRef] 10. Yu, X.-P.; Chen, S.-X.; Peng, M.-L. Application of partial least squares algorithm based on Kullback–Leibler divergence in intrusion detection. In Proceedings of the International Conference on Computer Science and Technology (CST2016), Shenzhen, China, 8–10 January 2016; Cai, N., Ed.; World Scientific: Singapore, 2017; pp. 256–263. 11. Li, J.; Cheng, K.; Wang, S.; Morstatter, F.; Trevino, R.P.; Tang, J.; Liu, H. Feature Selection: A Data Perspective. ACM Comput. Surv. 2017, 50, 1–45. [CrossRef] 12. Peng, H.; Long, F.; Ding, C. Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans. Pattern Anal. Mach. Intell. 2005, 27, 1226–1238. [CrossRef] 13. Vergara, J.R.; Estévez, P.A. A review of feature selection methods based on mutual information. Neural Comput. Appl. 2014, 24, 175–186. [CrossRef] 14. Granero-Belinchón, C.; Roux, S.G.; Garnier, N.B. Kullback–Leibler divergence measure of intermittency: Application to turbulence. Phys. Rev. E 2018, 97, 013107, 1–10. [CrossRef] 15. Charzyńska,A.; Gambin, A. Improvement of the k-NN entropy estimator with applications in systems biology. Entropy 2016, 18, 13. [CrossRef] 16. Wang, M.; Jiang, J.; Yan, Z.; Alberts, I.; Ge, J.; Zhang, H.; Zuo, C.; Yu, J.; Rominger, A.; Shi, K.; Alzheimer’s Disease Neuroimaging Initiative. Individual brain metabolic connectome indicator based on Kullback–Leibler Divergence Similarity Estimation predicts progression from mild cognitive impairment to Alzheimer’s dementia. Eur. J. Nucl. Med. Mol. Imaging 2020, 47, 2753–2764. [CrossRef][PubMed] 17. Zhong, J.; Liu, R.; Chen, P. Identifying critical state of complex diseases by single-sample Kullback–Leibler divergence. BMC Genom. 2020, 21, 87. [CrossRef][PubMed] 18. Li, J.; Shang, P. Time irreversibility of financial time series based on higher moments and multiscale Kullback–Leibler divergence. Phys. A Stat. Mech. Appl. 2018, 502, 248–255. [CrossRef] 19. Beraha, M.; Betelli, A.M.; Papini, M.; Tirinzoni, A.; Restelli, M. Feature selection via mutual information: New theoretical insights. arXiv 2019, arXiv:1907.07384v1. 20. Carrara, N.; Ernst, J. On the estimation of mutual information. Proceedings 2019, 33, 31. [CrossRef] 21. Lord, W.M.; Sun, J.; Bollt, E.M. Geometric k-nearest neighbor estimation of entropy and mutual information. Chaos Interdiscip. J. Nonlinear Sci. 2018, 28, 033114. [CrossRef] 22. Moon, K.R.; Sricharan, K.; Hero, A.O., III. Ensemble estimation of generalized mutual information with applications to Genomics. arXiv 2019, arXiv:1701.08083v2. 23. Suzuki, J. Estimation of Mutual Information; Springer: Singapore, 2021. 24. Sason, I.; Verdú, S. F-difergence inequalities. IEEE Trans. Inf. Theory 2016, 62, 5973–6006. [CrossRef] 25. Moon, K.R.; Sricharan, K.; Greenewald, K.; Hero, A.O., III. Ensemble estimation of information divergence. Entropy 2018, 20, 560. [CrossRef] 26. Rubenstein, P.K.; Bousquet, O.; Djolonga, J.; Riquelme, C.; Tolstikhin, I. Practical and Consistent Estimation of f -Divergences. In Proceedings of the NeurIPS 2019, 33rd Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019; Advances in Neural Information Processing Systems; Wallach, H., Larochelle, H., Beygelzimer, A., d’Alché-Buc, F., Fox, E., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2019; Volume 32, pp. 4070–4080. 27. Shannon, C.E. A mathematical theory of communication. Bell Syst. Tech. J. 1948, 27, 379–423; 623–656. [CrossRef] 28. Kozachenko, L.F.; Leonenko, N.N. Sample estimate of the entropy of a random vector. Probl. Inf. Transm. 1987, 23, 9–16. 29. Kraskov, A.; Stögbauer, H.; Grassberger, P. Estimating mutual information. Phys. Rev. E 2004, 69, 066138. [CrossRef][PubMed] 30. Leonenko, N.N.; Pronzato, L.; Savani V. A class of Rényi information estimations for multidimensional densities. Ann. Stat. 2008, 36, 2153–2182. Correction: Ann. Stat. 2010, 38, 3837–3838. [CrossRef] 31. Wang, Q.; Kulkarni, S.R.; Verdú, S. Divergence estimation for multidimensional densities via k-nearest-neighbor distances. IEEE Trans. Inf. Theory 2009, 55, 2392–2405. [CrossRef] 32. Pál, D.; Póczos, B.; Szepesvári, C. Estimation of Rényi Entropy and Mutual Information Based on Generalized Nearest-Neighbor Graphs. In Proceedings of the NIPS 2010 Proceedings of the 23rd International Conference on Neural Information Processing Sys- tems, Vancouver, BC, Canada, 6–9 December 2010; Advances in Neural Information Processing Systems; Lafferty, J., Williams, C., Shawe-Taylor, J., Zemel, R., Culotta, A., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2010; Volume 23, pp. 1849–1857. 33. Shiryaev, A.N. Probability—1, 3rd ed.; Springer: New York, NY, USA, 2016. 34. Loève, M. Probability Theory, 4th ed.; Springer: New York, NY, USA, 1977. 35. Bulinski, A.; Dimitrov, D. Statistical estimation of the Shannon entropy. Acta Math. Sin. Ser. 2019, 35, 17–46. [CrossRef] 36. Biau G.; Devroye L. Lectures on the Nearest Neighbor Method; Springer: Cham, Switzerland, 2015. 37. Bulinski, A.; Kozhevin, A. Statistical estimation of conditional Shannon entropy. ESAIM Probab. Stat. 2019, 23, 350–386. [CrossRef] Mathematics 2021, 9, 544 36 of 36

38. Coelho, F.; Braga, A.P.; Verleysen, M. A mutual information estimator for continuous and discrete variables applied to feature selection and classification problems. Int. J. Comput. Intell. Syst. 2016, 9, 726–733. [CrossRef] 39. Delattre, S.; Fournier, N. On the Kozachenko-Leonenko entropy estimator. J. Stat. Plan. Inference 2017, 185, 69–93. [CrossRef] 40. Berrett, T.B.; Samworth R.J. Efficient two-sample functional estimation and the super-oracle phenomenon. arXiv 2019, arXiv:1904.09347. 41. Penrose, M.D.; Yukich, J.E. Limit theory for point processes in manifolds. Ann. Appl. Probab. 2013, 6, 2160–2211. [CrossRef] 42. Tsybakov, A.B.; Van der Meulen, E.C. Root-n consistent estimators of entropy for densities with unbounded support. Scand. J. Stat. 1996, 23, 75–83. 43. Singh, S.; Pószoc, B. Analysis of k-nearest neighbor distances with application to entropy estimation. arXiv 2016, arXiv: 1603.08578v2. 44. Ryu, J.J.; Ganguly, S.; Kim, Y-H.; Noh, Y-K.; Lee, D.D. Nearest neighbor density functional estimation from inverse Laplace transform. arXiv 2020, arXiv:1805.08342v3. 45. Gao, S.; Steeg, G.V.; Galstyan A. Efficient Estimation of Mutual Information for Strongly Dependent Variables. In Proceedings of the 18th International Conference on Artificial Intelligence and Statistics, San Diego, CA, USA, 9–12 May 2015; Lebanon, G., Vishwanathan, S. V. N., Eds.; PMLR: Brookline, MA, USA, 2015; Volume 38, pp. 277–286. 46. Berrett, T.B.; Samworth R.J.; Yuan M. Efficient multivariate entropy estimation via k-nearest neighbour distances. Ann. Stat. 2019, 47, 288–318. [CrossRef] 47. Goria, M.N.; Leonenko, N.N.; Mergel, V.V.; Novi Inverardi, P.L. A new class of random vector entropy estimators and its applications in testing statistical hypotheses. J. Nonparametr. Stat. 2005, 17, 277–297. [CrossRef] 48. Evans, D. A computationally efficient estimator for mutual information. Proc. R. Soc. A Math. Phys. Eng. Sci. 2008, 464, 1203–1215. [CrossRef] 49. Yeh, J. Real Analysis: Theory of Measure and Integration, 3rd ed.; World Scientific: Singapore, 2014. 50. Evans, D.; Jones, A.J.; Schmidt, W.M. Asymptotic moments of near-neighbour distance distributions. Proc. R. Soc. A Math. Phys. Eng. Sci. 2002, 458, 2839–2849. [CrossRef] 51. Bouguila, N.; Wentao F. (Eds.) Mixture Models and Applications; Springer: Cham, Switzerland, 2020. 52. Borkar, V.S. Probability Theory. An Advanced Course; Springer: New York, NY, USA, 1995. 53. Kallenberg, O. Foundations of Modern Probability; Springer: New York, NY, USA, 1997. 54. Billingsley, P. Convergence of Probability Measures, 2nd ed.; Wiley: New York, NY, USA, 1999. 55. Alonso Ruiz, P.; Spodarev, E. Entropy-based inhomogeneity detection in fiber materials. Methodol. Comput. Appl. Probab. 2018, 20, 1223–1239. [CrossRef] 56. Dresvyanskiy, D.; Karaseva, T.; Makogin, V.; Mitrofanov, S.; Redenbach, C.; Spodarev, E. Detecting anomalies in fibre systems using 3-dimensional image data. Stat. Comput. 2020, 30, 817–837. [CrossRef] 57. Glaz, J.; Naus, J.; Wallenstein, S. Scan Statistics; Springer: New York, NY, USA, 2009. 58. Walther, G. Optimal and fast detection of spatial clusters with scan statistics. Ann. Stat. 2010, 38, 1010–1033. [CrossRef] 59. Gnedenko, B.V.; Korolev, V.Yu. Random Summation: Limit Theorems and Applications; CRC Press: Boca Raton, FL, USA, 1996.