mathematics
Article Statistical Estimation of the Kullback–Leibler Divergence
Alexander Bulinski 1,* and Denis Dimitrov 2
1 Steklov Mathematical Institute of Russian Academy of Sciences, 119991 Moscow, Russia 2 Faculty of Mechanics and Mathematics, Lomonosov Moscow State University, 119234 Moscow, Russia; [email protected] * Correspondence: [email protected]; Tel.: +7-495-939-14-03
Abstract: Asymptotic unbiasedness and L2-consistency are established, under mild conditions, for the estimates of the Kullback–Leibler divergence between two probability measures in Rd, absolutely continuous with respect to (w.r.t.) the Lebesgue measure. These estimates are based on certain k-nearest neighbor statistics for pair of independent identically distributed (i.i.d.) due vector samples. The novelty of results is also in treating mixture models. In particular, they cover mixtures of nondegenerate Gaussian measures. The mentioned asymptotic properties of related estimators for the Shannon entropy and cross-entropy are strengthened. Some applications are indicated.
Keywords: Kullback–Leibler divergence; Shannon differential entropy; statistical estimates; k-nearest neighbor statistics; asymptotic behavior; Gaussian model; mixtures
MSC: 60F25; 62G20; 62H12
1. Introduction Citation: Bulinski, A.; Dimitrov, D. Statistical Estimation of the The Kullback–Leibler divergence introduced in [1] is used for quantification of sim- Kullback–Leibler Divergence. ilarity of two probability measures. It plays important role in various domains such as Mathematics 2021, 9, 544. https:// statistical inference (see, e.g., [2,3]), metric learning [4,5], machine learning [6,7], computer doi.org/10.3390/math9050544 vision [8,9], network security [10], feature selection and classification [11–13], physics [14], biology [15], medicine [16,17], finance [18], among others. It is worth to emphasize that Academic Editors: Irina Shevtsova mutual information, widely used in many research directions (see, e.g., [19–23]), is a special and Victor Korolev case of the Kullback–Leibler divergence for certain measures. Moreover, the Kullback– Leibler divergence itself belongs to a class of f -divergence measures (with f (t) = log t). For Received: 25 January 2021 comparison of various f -divergence measures see, e.g., [24], their estimates are considered Accepted: 28 February 2021 in [25,26]. Published: 4 March 2021 Let P and Q be two probability measures on a measurable space (S, B). The Kullback– Leibler divergence between P and Q is defined, according to [1], by way of Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in R log dP d if , dQ P P Q published maps and institutional affil- D(P||Q) := S (1) iations. ∞ otherwise,
where dP stands for the Radon–Nikodym derivative. The integral in (1) can take values in dQ [0, ∞]. We employ the base e of logarithms since a constant factor is not essential here. d d Copyright: © 2021 by the authors. If (S, B) = (R , B(R )), where d ∈ N, and (absolutely continuous) P and Q have Licensee MDPI, Basel, Switzerland. densities, p(x) and q(x), x ∈ Rd, w.r.t. the Lebesgue measure µ, then (1) can be expressed as This article is an open access article distributed under the terms and Z p(x) D(P||Q) = p(x) log dx, (2) conditions of the Creative Commons q(x) d Attribution (CC BY) license (https:// R creativecommons.org/licenses/by/ 4.0/).
Mathematics 2021, 9, 544. https://doi.org/10.3390/math9050544 https://www.mdpi.com/journal/mathematics Mathematics 2021, 9, 544 2 of 36
where we write dx instead of µ(dx) to simplify notation. One formally sets 0/0 := 0, = > = − ( ) = = ( p(x) ) a/0 : ∞ if a 0, log 0 : ∞, log ∞ : ∞ and 0 log 0 : 0. Then log q(x) is a measurable function with values in [−∞, ∞]. So, the right-hand sides of (1) and (2) coincide. Formula (2) is justified by Lemma A1, see AppendixA. Denote by S( f ) := {x ∈ Rd : f (x) > 0} the support of a (version of) probability density f . The integral in (2) is taken over S(p) and does not depend on the choice of p and q versions. The following two functionals are closely related to the Kullback - Leibler divergence. For probability measures P and Q on (Rd, B(Rd)) having densities p(x) and q(x), x ∈ Rd, w.r.t. the Lebesgue measure µ, one can introduce, according to [27], p. 35, entropy H (also called the Shannon differential entropy) and cross-entropy C as follows Z Z H(P) := − p(x) log p(x)dx, C(P, Q) := − p(x) log q(x)dx. Rd Rd In view of (2), D(P||Q) = C(P, Q) − H(P) whenever the right-hand side is well defined. Usually one constructs statistical estimates of some characteristics of a stochastic model under consideration relying on a collection of observations. In the pioneering paper [28] the estimator of the Shannon differential entropy was proposed, based on the nearest neighbor statistics. In a series of papers this estimate was studied and applied. Moreover, estimators of the Rényi entropy, mutual information and the Kullback–Leibler divergence have appeared (see, e.g., [29–31]). However, the authors of [32] indicated the occurrence of gaps in the known proofs concerning the limit behavior of such statistics. Almost all of these flaws refer to the lack of proved correctness of using the (reversed) Fatou lemma (see, e.g., [28], inequality after the statement (21), or [31], inequality (91)) or the generalized Helly–Bray lemma (see, e.g., [30], page 2171). One can find these lemmas in [33], p. 233, and [34], p. 187. Paper [32] has attracted our attention and motivated study of the declared asymptotic properties. Furthermore, we would like to highlight the important role of the papers [28,30–32]. Thus, in a recent work [35] the new functionals were introduced to prove asymptotic unbiasedness and L2-consistency of the Kozachenko– Leonenko estimators of the Shannon differential entropy. We used the criterion of uniform integrability, for different families of functions, to avoid employment of the Fatou lemma since it is not clear whether one could indicate the due majorizing functions for those families. The present paper is aimed at extension of our approach to grasp the Kullback– Leibler divergence estimation. Instead of the nearest neighbor statistics we employ the k-nearest neighbor statistics (on order statistics see, e.g., [36]) and also use more general forms of the mentioned functionals. Note in passing that there exist investigations treating important aspects of the entropy, Kullback–Leibler divergence and mutual information estimation. The mixed models and conditional entropy estimation are studied, e.g., in [37,38]. The central limit theorem (CLT) for the Kozachenko–Leonenko estimates is established in [39]. In [40], deep analysis of efficiency of functional weighted estimates was performed (including CLT). The limit theorems for point processes on manifolds are employed in [41] to analyze behavior of the Shannon and the Rényi entropy estimates. The convergence rates for the Shannon entropy (truncated) estimates are obtained in [42] for one-dimensional case, see also [43] for multidimensional case. A kernel density plug-in estimator of the various divergence functionals is studied in [25]. The principal assumptions of that paper are the following: the densities are smooth and have common bounded support S, they are strictly lower bounded on S, moreover, the set S is smooth with respect to the employed kernel. Ensemble estimation of various divergence functionals is studied in [25]. Profound results for smooth bounded densities are established in recent work [44]. The mutual information estimation by the local Gaussian approximation is developed in [45]. Note that various deep results (including the central limit theorem) were obtained for the Kullback–Leibler estimates under certain conditions imposed on derivatives of unknown densities (see, e.g., the recent papers [25,46]). In a series of papers the authors demand boundedness of densities to prove Mathematics 2021, 9, 544 3 of 36
L2-consistency for the Kozachenko–Leonenko estimates of differential Shannon entropy (see, e.g., [47]). Our goal is to provide wide conditions for the asymptotic unbiasedness and L2- consistency of the specified Kullback–Leibler divergence estimates without such smooth- ness and boundedness hypotheses. Furthermore, we do not assume that densities have bounded supports. As a byproduct we obtain new results concerning Shannon differential entropy and cross-entropy. We employ probabilistic and analytical techniques, namely, weak convergence of probability measures, conditional expectations, regular probability distributions, k-nearest neighbor statistics, probability inequalities, integration by parts in the Lebesgue–Stieltjes integral, analysis of integrals depending on certain parameters and taken over specified domains, criterion of the uniform integrability of various families of functions, slowly varying functions. The paper is organized as follows. In Section2, we introduce some notation. In Section3 we formulate main results, i.e., Theorems1 and2. Their proofs are provided in Sections4 and5, respectively. Section6 contains concluding remarks and perspectives of future research. Proofs of several lemmas are given in AppendixA.
2. Notation d Let X and Y be random vectors taking values in R and having distributions PX and PY, respectively, (below we will take P = PX and Q = PY). Consider random d vectors X1, X2, ... and Y1, Y2, ... with values in R such that law(Xi) = law(X) and law(Yi) = law(Y), i ∈ N. Assume also that {Xi, Yi, i ∈ N} are independent. We are interested in statistical estimation of D(PX||PY) constructed by means of observations Xn := {X1, ... , Xn} and Ym := {Y1, ... , Ym}, n, m ∈ N. All the random variables un- der consideration are defined on a probability space (Ω, F, P), each measure space is assumed complete. d d For a finite set E = {z1, ... , zN} ⊂ R , where zi 6= zj (i 6= j), and a vector v ∈ R , renumerate points of E as z(1)(v), ... , z(N)(v) in such a way that kv − z(1)k ≤ ... ≤ k − k k · k d v z(N) , is the Euclidean norm in R . If there are points zi1 , ... , zis having the same distance from v then we numerate them according to the indexes i1, ... , is increase. In other words, for k = 1, ... , N, z(k)(v) is the k-nearest neighbor of v in a set E. To indicate that z(k)(v) is constructed by means of E we write z(k)(v, E). Fix k ∈ {1, ... , n − 1}, l ∈ {1, . . . , m} and (for each ω ∈ Ω) put
Rn,k(i) := kXi − X(k)(Xi, Xn \{Xi})k, Vm,l(i) := kXi − Y(l)(Xi, Ym)k, i = 1, . . . , n.
dPX dPY We assume that X and Y have densities p = dµ and q = dµ . Then with probability one all the points in Xn are distinct as well as points of Ym. Following [31] (see Formula (17) there) introduce an estimate of D(PX||PY) n n ( ) 1 m d Vm,li i Den,m(Kn, Ln) := (ψ(k ) − ψ(l )) + log + log , (3) n ∑ i i n − n ∑ R (i) i=1 1 i=1 n,ki
0 ( ) = d ( ) = Γ (t) > K = { }n L = where ψ t dt log Γ t Γ(t) is the digamma function, t 0, n : ki i=1, n : n {li}i=1 are collections of integers and, for some r ∈ N and all i ∈ N, ki ≤ r, li ≤ r. Note that (3) is well-defined for n ≥ maxi=1,...,n ki + 1, m ≥ maxi=1,...,n li. If ki = k and li = l, i = 1, . . . , n, then, for n ≥ k + 1 and m ≥ l, we write m d n V (i) D (k, l) := ψ(k) − ψ(l) + log + log m,l . (4) bn,m − ∑ ( ) n 1 n i=1 Rn,k i
If k = l then m d n V (i) D (k) = log + log m,k bn,m − ∑ ( ) n 1 n i=1 Rn,k i Mathematics 2021, 9, 544 4 of 36
and we come to formula (5) in [31]. For an intuitive background of the proposed estimates one can address [31] (Introduction, Parts B and C). d d We write B(x, r) := {y ∈ R : kx − yk ≤ r} for x ∈ R , r > 0, and Vd = µ(B(0, 1)) is the volume of the unit ball in Rd. Similar to (3) with the same notation and the same conditions for ki and li, i = 1, ... , n, one can define the Kozachenko - Leonenko type estimates of H(PX) and C(PX, PY), respectively, by formulas
1 n d n H (K ) = − (k ) + V + (n − ) + R (i) en n : ∑ ψ i log d log 1 ∑ log n,ki , (5) n i=1 n i=1
1 n d n C (L ) = − (l ) + V + m + V (i) en,m n : ∑ ψ i log d log ∑ log m,li . (6) n i=1 n i=1
In [28], an estimate (5) was proposed for ki = 1, i = 1, ... , n. If ki = k, li = l, i = 1, . . . , n, n ≥ k + 1 and m ≥ l, then one has
n d ! n d ! 1 VdR (i)(n − 1) 1 VdV (i)m H (k) := log n,k , C (l) := log m,l . (7) bn ∑ ψ(k) bn,m ∑ ψ(l) n i=1 e n i=1 e
Remark 1. All our results are valid for statistics (3). To simplify notation we consider estimates (4) since the study of Den,m(Kn, Ln) follows the same lines. For the same reason, as in the case of Kullback–Leibler divergence, we will only deal with (7) since (5) and (6) can be analyzed in quite the same way.
Some extra notation is necessary. As in [35], given a probability density f in Rd, we consider the following functions of x ∈ Rd, r > 0 and R > 0, that is, define integral functionals (depending on parameters) R B(x,r) f (y) dy I (x r) = f , : d , (8) r Vd
M f (x, R) := sup If (x, r), m f (x, R) := inf If (x, r). (9) r∈(0,R] r∈(0,R] R Some properties of function B(x,r) f (y) dy are demonstrated in [48]. By virtue of Lemma 2.1 [35], for each probability density f , the function If (x, r) introduced above is continuous in (x, r) on Rd × (0, ∞). Hence on account of Theorem 15.84 [49] the functions m f (·, R) and M f (·, R) for any R > 0 have to be upper semicontinuous and lower semicon- tinuous, respectively. Therefore, Borel measurability of these nonnegative functions ensues from Proposition 15.82 [49]. On the other hand, the function m f (x, ·) is evidently nonin- d creasing whereas M f (x, ·) is nondecreasing for each x in R . Notably, changing supr∈(0,R] to supr∈(0,∞) transforms the function M f (x, R) into the famous Hardy–Littlewood maximal function M f (x) well-known in Harmonic analysis. Set e[−1] := 0 and e[N] := exp{e[N−1]}, N ∈ Z+. Introduce a function log[1](t) := log t, t > 0. For N ∈ N, N > 1, set log[N](t) := log(log[N−1](t)). Evidently, this function (for each N ∈ N) is defined if t > e[N−2]. For N ∈ N, consider the continuous nondecreasing function GN : R+ → R+, given by formula ( 0, t ∈ [0, e[N−1]], GN(t) := (10) t log[N](t), t ∈ (e[N−1], ∞).
In other words we employ the function having the form t r(t) where a function r(t), taken as N iterations of log t, is slowly varying for large t. Mathematics 2021, 9, 544 5 of 36
For probability densities p, q in Rd, N ∈ N and positive constants ν, t, ε, R, introduce the functionals taking values in [0, ∞] Z ν Kp,q(ν, N, t) := GN | logkx − yk| p(x)q(y) dx dy, (11) x,y∈Rd, kx−yk>t Z ε Qp,q(ε, R) := Mq(x, R)p(x) dx, (12) Rd Z −ε Tp,q(ε, R) := mq (x, R)p(x) dx. (13) Rd
Set Kp,q(ν, N) := Kp,q(ν, N, e[N]).
−ε Remark 2. We have stipulated that 1/0 := ∞ (thus mq (x, R) := ∞ whenever mq(x, R) = 0). One can write in (12), (13) the integrals over the support S(p) instead of integrating over Rd, whatever the versions of p and q are taken.
3. Main Results
Theorem 1. Let, for some positive ε, R and N ∈ N, the functionals Kp, f (1, N), Qp, f (ε, R), Tp, f (ε, R) be finite if f = p and f = q. Then D(PX||PY) < ∞ and
lim EDbn,m(k, l) = D(PX||PY). (14) n,m→∞
Consider 3 kinds of conditions (labeled A,B,C, possibly with indices, and involving parameters indicated in parentheses) on probability densities. (A; p, f , ν) For probability densities p, f in Rd and some positive ν Z ν Lp, f (ν) := | logkx − yk| p(x) f (y) dxdy < ∞. (15) Rd×Rd R As usual, A g(z)Q(dz) = 0 whenever g(z) = ∞ (or −∞) for z ∈ A and Q(A) = 0, where Q is a σ-finite measure on (Rd, B(Rd)). Condition (15) with ν > 1 is used, e.g., in [28,31,47]. (B1; f ) A version of f is upper bounded by a positive number M( f ) ∈ (0, ∞):
f (x) ≤ M( f ), x ∈ Rd.
(C1; f ) A version of f is lower bounded by a positive number m( f ) ∈ (0, ∞): f (x) ≥ m( f ), x ∈ S( f ).
Corollary 1. Let, for some ν > 1, condition (A; p, f , ν) be satisfied when f = p and f = q. Then the statements of Theorem1 are true, provided that (B1; f ) and (C1; f ) are both valid for f = p and f = q. Moreover, if the latter assumption involving (B1; f ) and (C1; f ) holds then conditions of Theorem1 are satisfied whenever p and q have bounded supports.
Next we formulate conditions to guarantee L2-consistency of estimates (4).
Theorem 2. Let the requirement Kp, f (1, N) < ∞ in conditions of Theorem1 be replaced by Kp, f (2, N) < ∞, given f = p and f = q. Then D(PX||PY) < ∞ and, for any fixed k, l ∈ N, 2 the estimates Dbn,m(k, l) are L -consistent, i.e., 2 lim E Dbn,m(k, l) − D(PX||PY) = 0. (16) n,m→∞
Corollary 2. For some ν > 2, let condition (A; p, f , ν) be satisfied if f = p and f = q. Assume that (B1; f ) and (C1; f ) are both valid for f = p and f = q. Then the statements of Theorem2 Mathematics 2021, 9, 544 6 of 36
are true. Moreover, if the latter assumption involving (B1; f ) and (C1; f ) holds then conditions of Theorem2 are satisfied whenever p and q have bounded supports.
Currently we dwell on a modification of condition (C1; f ) introduced in [35] that allows us to work with densities that need not have bounded supports. (C2; f ) There exist a version of density f and R > 0 such that, for some c > 0,
d m f (x, R) ≥ c f (x), x ∈ R .
Remark 3. If, for some positive ε, R and c, condition (C2; q) is true and Z q(x)−ε p(x)dx < ∞, (17) Rd
then Tp,q(ε, R) is finite. Hence we could apply, for f = p and f = q in Theorems1 and2, condition (C ; f ) and presume, for some ε > 0, validity of (17) and finiteness of R p1−ε(x)dx instead of 2 Rd the corresponding assumptions Tp,q(ε, R) < ∞ and Tp,p(ε, R) < ∞. An illustrative example to this point is provided with a density having unbounded support.
d Corollary 3. Let X, Y be Gaussian random vectors in R with EX = µX, EY = µY and nondegenerate covariance matrices ΣX and ΣY, respectively. Then relations (14) and (16) hold where 1 −1 T −1 det ΣY D(PX||PY) = tr ΣY ΣX + (µY − µX) ΣY (µY − µX) − d + log . 2 det ΣX
The latter formula can be found, e.g., in [2], p. 147, example 6.3. The proof of Corollary3 is discussed in AppendixA. Similarly to condition (C2; f ) let us consider the following one. (B2; f ) There exist a version of density f and R > 0 such that, for some C > 0,
M f (x, R) ≤ C f (x), x ∈ S( f ).
Remark 4. If, for some positive ε, R and c, condition (B2; q) is true and Z q(x)ε p(x)dx < ∞, (18) Rd
then obviously Qp,q(ε, R) < ∞. Thus, in Theorems1 and2 one can employ, for f = p and f = q, condition (B ; f ) and exploit, for some ε > 0, the validity of (18) and finiteness of R p1+ε(x)dx 2 Rd instead of the assumptions Qp,q(ε, R) < ∞ and Qp,p(ε, R) < ∞, respectively.
Remark 5. D.Evans applied “positive density condition” in Definition 2.1 of [48] assuming the rd R d existence of constants β > 1 and δ > 0 such that β ≤ B(x,r) q(y)dy ≤ βr for all 0 ≤ r ≤ δ and d 1 d −ε R x ∈ . Consequently mq(x, δ) ≥ := m > 0, x ∈ . Then Tp,q(ε, δ) ≤ m d p(x) dx = R βVd R R −ε β d m < ∞ for all ε > 0. Analogously, Mq(x, δ) ≤ := M, M > 0, x ∈ , and Qp,q(ε, δ) ≤ Vd R Mε R p(x) dx = Mε < ∞ for all ε > 0. The above mentioned inequalities from Definition 2.1 Rd of [48] are valid, provided that density f is smooth and its support in Rd is a convex closed body, see proof in [50]. Therefore, if p and q are smooth and their supports are compact convex bodies in Rd, the relations (14) and (16) are valid.
Moreover, as a byproduct of Theorems1 and2, we obtain the new results indicating both the asymptotic unbiasedness and L2-consistency of the estimates (7) for the Shannon differential entropy and cross-entropy.
Theorem 3. Let Qp,q(ε, R) < ∞ and Tp,q(ε, R) < ∞ for some positive ε and R. Then C(PX, PY) is finite and the following statements hold for any fixed l ∈ N. Mathematics 2021, 9, 544 7 of 36
(1) If, for some N ∈ N,Kp,q(1, N) < ∞, then ECbn,m(l) → C(PX, PY), n, m → ∞. 2 (2) If, for some N ∈ N,Kp,q(2, N) < ∞, then E(Cbn,m(l) − C(PX, PY)) → 0, n, m → ∞.
In particular, one can employ Lp,q(ν) with ν > 1 instead of Kp,q(1, N), and with ν > 2 instead of Kp,q(2, N), where N ∈ N.
The first claim of this Theorem follows from the proof of Theorem1. In a similar way one can infer the second statement from the proof of Theorem2. If we take q = p in conditions of Theorem3 then we get the statement concerning the entropy since C(PX, PX) = H(PX). Now we consider the case when p and q are mixtures of some probability densities. Namely, I J p(x) := ∑ ai pi(x), q(x) := ∑ bjqj(x), (19) i=1 j=1
where pi(x), qj(x) are probability densities (w.r.t. the Lebesgue measure µ), positive I J d weights ai, bj are such that ∑i=1 ai = 1, ∑j=1 bj = 1, i = 1, ... , I, j = 1, ... , J, x ∈ R . Some applications of models described by mixtures are treated, e.g., in [51].
Corollary 4. Let random vectors X and Y have densities of the form (19). Assume that, for some positive ε, R and N ∈ N, the functionals K f ,g(1, N), Q f ,g(ε, R), Tf ,g(ε, R) are finite, whenever f ∈ {p1, ... , pI } and g ∈ {p1, ... , pI, q1, ... , qJ}. Then D(PX||PY) < ∞ and, for any fixed k, l ∈ N, (14) holds. Moreover, if the requirement K f ,g(1, N) < ∞ is replaced by K f ,g(2, N) < ∞ then (16) is true.
The proof of this Corollary is given in AppendixA. Thus, due to Corollaries3 and4 one can guarantee the validity of (14) and (16) for any mixtures of nondegenerate Gaussian densities. Note also that in a similar way we can claim the asymptotic unbaisedness and L2-consistency of estimates (7) for mixtures satisfying conditions of Corollary4.
Remark 6. Let us compare our new results with those established in [35]. Developing the approach of [35] to analysis of asymptotic behavior of the Kozachenko–Leonenko estimates of the Shannon differential entropy we encounter new complications due to dealing with k-nearest neighbor statistics for k ∈ N (not only for k = 1). Accordingly, in the framework of the Kullback–Leibler divergence estimation, we propose a new way to bound the function 1 − Fm,l,x(u) playing the key role in the proofs (see Formula (28)). Furthermore, instead of the function G(t) = t log t (for t > 1), used in [35] for the Shannon entropy estimates, we employ a regularly varying function GN(t) = t log[N](t) where (for t large enough) log[N](t) is the N-fold iteration of the logarithmic function and N ∈ N can be large. Whence in the definition of integral functional Kp,q(ν, N, t) by formula (11) one can take a function GN(z) having, for z > 0, the growth rate close to that of function z. Moreover, this entails a generalization of paper [35] results. Now we invoke convexity of GN (see Lemma6) to provide more general conditions for asymptotic unbiasedness and L2-consistency of the Shannon differential entropy as opposed to [35].
4. Proof of Theorem 1 Note that the general structure of this proof, as well as that of Theorem2, is similar to the one originally proposed in [28] and later used in various papers (see, e.g., [30,31,47]). Nevertheless in order to prove both theorems correctly we employ new ideas and condi- tions (such as uniform integrability of a family of random variables) in our reasoning.
Remark 7. In the proof, for certain random variables α, α1, α2, ... (depending on some parameters), we will demonstrate that Eαn → Eα, as n → ∞ (and that all these expectations are finite). To this end, for a fixed Rd-valued random vector τ and each x ∈ A, where A is a specified subset of Rd, we will prove that E(αn|τ = x) → E(α|τ = x), n → ∞. (20) Mathematics 2021, 9, 544 8 of 36
It turns out that E(αn|τ = x) = E(αn,x) and E(α|τ = x) = Eαx, where the auxiliary d random variables αn,x and αx can be constructed explicitly for each x ∈ R . Moreover, it is law possible to show that, for each x ∈ A, one has αn,x → αx, n → ∞. Thus, to prove (20) the Fatou lemma is not used, it is not evident whether there exists a random variable majorizing those under consideration. Instead we verify, for each x ∈ A, the uniform integrability (w.r.t. measure P) of a family ( ) . Here we employ the necessary and sufficient conditions of uniform αn,x n≥n0(x) integrability provided by de la Vallée–Poussin theorem (see, e.g., Theorem 1.3.4 in [52]). After that, to prove the desired relation Eαn → Eα, n → ∞, we have a new task. Namely, we check the uniform (E( | = )) ∈ P integrability of a family αn τ x n≥k0 , where x A, w.r.t. the measure τ, i.e., the law of τ, and k0 does not depend of x. Then we can prove that Z Z Eαn = E(αn|τ = x)Pτ(dx) → E(α|τ = x)Pτ(dx) = Eα, n → ∞. A A Further we will explain a number of nontrivial details concerning the proofs of uniform integrability of various families, the choice of the mentioned random variables (vectors), the set A, n0(x) and k0.
The first auxiliary result explains why without loss of generality (w.l.g.) we can consider the same parameters ε, R, N for different functionals in conditions of Theorems1 and2.
Lemma 1. Let p and q be any probability densities in Rd. Then the following statements are valid.
(1) If Kp,q(ν0, N0) < ∞ for some ν0 > 0 and N0 ∈ N then Kp,q(ν, N) < ∞ for any ν ∈ (0, ν0] and each N ≥ N0. (2) If Qp,q(ε1, R1) < ∞ for some ε1 > 0 and R1 > 0 then Qp,q(ε, R) < ∞ for any ε ∈ (0, ε1] and each R > 0. (3) If Tp,q(ε2, R2) < ∞ for some ε2 > 0 and R2 > 0 then Tp,q(ε, R) < ∞ for any ε ∈ (0, ε2] and each R > 0.
In particular one can take q = p and the statements of Lemma1 still remain valid. The proof of Lemma1 is given in AppendixA.
Remark 8. The results of Lemma1 allow us to ensure (14) by demanding the finiteness of the func- tionals Kp,q(1, N1), Qp,q(ε1, R1), Tp,q(ε2, R2), Kp,p(1, N2), Qp,p(ε3, R3), Tp,p(ε4, R4), for some εi >0, Rj >0 and Nj ∈ N, where i = 1, 2, 3, 4 and j = 1, 2. Moreover, if we assume the finiteness of Kp,q(2, N3) and Kp,p(2, N4), for some N3 ∈ N, N4 ∈ N, instead of the finiteness of Kp,q(2, N1) and Kp,p(2, N2) then (16) holds.
According to Remark 2.4 of [35] if, for some positive ε, R, the integrals Qp,q(ε, R), Tp,q(ε, R), Qp,p(ε, R), Tp,p(ε, R) are finite then Z Z p(x)| log q(x)|dx < ∞, p(x)| log p(x)|dx < ∞. (21) Rd Rd
Therefore D(p||q) < ∞ (and thus PX PY in view of Lemma A1). For n ∈ N such that n > 1, for fixed k ∈ N and m ∈ N, where 1 ≤ k ≤ n − 1, 1 ≤ l ≤ m d d and i = 1, ... , n, set φm,l(i) := mVm,l(i), ζn,k(i) := (n − 1)Rn,k(i). Then we can rewrite the estimate Dbn,m(k, l) as follows:
n 1 Dbn,m(k, l) = ψ(k) − ψ(l) + ∑ log φm,l(i) − log ζn,k(i) . (22) n i=1
It is sufficient to prove the following two assertions.
Statement 1. For each fixed l, all m large enough and any i ∈ N, E| log φm,l(i)| is finite. Mathematics 2021, 9, 544 9 of 36
Moreover, ! 1 n Z E log φm l(i) = E log φm l(1) → ψ(l) − log Vd − p(x) log q(x) dx, m → ∞. (23) ∑ , , d n i=1 R
Statement 2. For each fixed k, all n large enough and any i ∈ N, E| log ζn,k(i)| is finite. Moreover, ! 1 n Z E log ζn k(i) = E log ζn k(1) → ψ(k) − log Vd − p(x) log p(x) dx, n → ∞. (24) ∑ , , d n i=1 R
Then in view of (2) and (21)–(24) Z Z EDbn,m(k, l) → − p(x) log q(x) dx + p(x) log p(x) dx = D(PX||PY), n, m → ∞, Rd Rd and Theorem1 will be proved. Recall that, as explained in [35], for a nonnegative random variable V (thus 0 ≤ EV ≤ ∞) and any random Rd-valued vector, one has Z EV = E(V|X = x)PX(dx). (25) Rd This signifies that both sides of (25) coincide, being finite or infinite simultaneously. Let F(u, ω) be a regular conditional distribution function of a nonnegative random variable U given X where u ∈ R and ω ∈ Ω. Let h be a measurable function such that h : R → [0, ∞). d It was also explained in [35] that, for PX-almost all x ∈ R , it follows (without assuming Eh(U) < ∞) Z E(h(U)|X = x) = h(u)dF(u, x). (26) [0,∞) This means that both sides of (26) are finite or infinite simultaneously and coincide. By virtue of (25) and (26) one can establish that E| log φm,l(i)| < ∞, for all m large enough, fixed l and for all i, and that (23) holds. To perform this take U = φm,l(i), X = Xi, h(u) = | log u|, u > 0 (we use h(u) = log2 u in the proof of Theorem2) and V = h(U). If h : R → R and E|h(U)| < ∞ then (26) is true as well. To avoid increasing the volume of this paper we will only examine the evaluation of E log φm,l(i) as all the steps of the proof will be the same when treating E| log φm,l(i)|. The proof of Statement1 is partitioned into 4 steps. The first three demonstrate that there is a measurable A ⊂ S(p), depending on p and q versions, such that PX(S(p) \ A) = 0 and, for any x ∈ A, i ∈ N, the following relation holds:
E(log φm,l(i)|Xi = x) = E(log φm,l(1)|X1 = x) → ψ(l) − log Vd − log q(x), m → ∞. (27)
The last Step 4 justifies the desired result (23). Finally Step 5 validates Statement2. Step 1. Here we establish the distribution convergence for the auxiliary random variables. Fix any i ∈ N and l ∈ {1, ... , m}. To simplify notation we do not indicate the dependence of functions on d. For x ∈ Rd and u ≥ 0, we identify the asymptotic behavior (as m → ∞) of the function
i d Fm,l,x(u) := P(φm,l(i) ≤ u|Xi = x) = P mVm,l(i) ≤ u|Xi = x ! ! 1 1 u d u d = 1 − P Vm,l(i) > Xi = x = 1 − P x − Y( )(x, Ym) > m l m (28) l−1 m s m−s = 1 − ∑ (Wm,x(u)) (1 − Wm,x(u)) := P(ξm,l,x ≤ u), s=0 s
where Mathematics 2021, 9, 544 10 of 36
Z 1 d u d Wm,x(u) := q(z) dz, rm(u) := , ξm,l,x := m x − Y(l)(x, Ym) . (29) B(x,rm(u)) m
We take into account in (28) that random vectors Y1, ... , Ym, Xi are independent and condition that Y , ... , Ym have the same law as Y. We also noted that an event n 1 o
x − Y(l)(x, Ym) > rm(u) is a union of pair-wise disjoint events As, s = 0, ... , l − 1. Here As means that exactly s observations among Ym belong to the ball B(x, rm(u)) and other m − s are outside this ball (probability that Y belongs to the sphere {z ∈ Rd : kz − xk = r} equals 0 since Y has a density w.r.t. the Lebesgue measure µ). Formulas (28) i and (29) show that Fm,l,x(u) is the regular conditional distribution function of φm,l(i) given Xi = x. Moreover, (28) means that φm,l(i), i ∈ {1, ... , n} are identically distributed and we i may omit the dependence on i. So, one can write Fm,l,x(u) instead of Fm,l,x(u). According to the Lebesgue differentiation theorem (see, e.g., [49], p. 654) if q ∈ L1(Rd), for µ-almost all x ∈ Rd, one has 1 Z lim |q(z) − q(x)| dz = 0. (30) r→0+ µ(B(x, r)) B(x,r)
Let Λ(q) denote the set of Lebesgue points of a function q, namely the points in Rd satisfying (30). Evidently it depends on the choice of version within the class of functions in L1(Rd) equivalent to q, and, for an arbitrary version of q, we have µ(Rd \ Λ(q)) = 0. d Clearly, for each u ≥ 0, rm(u) → 0 as m → ∞, and µ(B(x, rm(u))) = Vd rm(u) = Vdu m . Therefore by virtue of (30), for any fixed x ∈ Λ(q) and u ≥ 0, V u W (u) = d (q(x) + α (x, u)), m,x m m
where αm(x, u) → 0, m → ∞. Hence, for x ∈ Λ(q) ∩ S(q) (thus q(x) > 0), due to (28) l−1 (V q(x)u)s d −Vdq(x)u Fm,l,x(u) → 1 − ∑ e := Fl,x(u), m → ∞. (31) s=0 s!
Relation (31) means that
law ξm,l,x → ξl,x, x ∈ Λ(q) ∩ S(q), m → ∞,
where ξl,x has the Gamma distribution Γ(α, λ) with parameters α = Vd q(x) and λ = l. For any x ∈ S(q), one can assume w.l.g. that the random variables ξl,x and {ξm,l,x}m≥l are defined on a probability space (Ω, F, P). Indeed, by the Lomnicki–Ulam theorem (see, e.g., [53], p. 93) the independent copies of Y1, Y2, ... and {ξl,x}x∈S(q) exist on a certain probability space. The convergence in distribution of random variables survives under continuous mapping. Thus, for any x ∈ Λ(q) ∩ S(q), we see that law log ξm,l,x → log ξl,x, m → ∞. (32)
We have employed that ξ > 0 a.s. for each x ∈ Λ(q) ∩ S(q) and Y has a density, so it l, x
follows that P(ξm,l,x > 0) = P( x − Y(l)(x, Ym) > 0) = 1. More precisely, we take strictly positive versions of ξl,x and ξm,l,x for each x ∈ Λ(q) ∩ S(q). Step 2. Now we show that, instead of (27) validity, one can verify the following assertion. For µ-almost every x ∈ Λ(q) ∩ S(q) E log ξm,l,x → E log ξl,x, m → ∞. (33)
Note that if η ∼ Γ(α, λ), where α > 0 and λ > 0, then E log η = ψ(λ) − log α, where ψ is a digamma function. Set α = Vdq(x) for x ∈ S(q) (then α > 0) and λ = l. Hence E log ξl,x = ψ(l) − log (Vdq(x)) = ψ(l) − log Vd − log q(x). By virtue of (26), for each x ∈ Rd, Mathematics 2021, 9, 544 11 of 36
Z Z E log ξm,l,x = log u dFm,l,x(u) = log u dP(φm,l(1) ≤ u|X1 = x) (0,∞) (0,∞)
= E(log φm,l(1)|X1 = x).
Hence, for x ∈ Λ(q)∩S(q), the relation E(log φm,l(1)|X1 = x) → ψ(l) − log Vd − log q(x) holds if and only if (33) is true. According to Theorem 3.5 [54] we would have established (33) if relation (32) could be supplemented, for µ-almost all x ∈ Λ(q) ∩ S(q), by the condition of uniform integrability { } N ∈ G (t) of a family log ξm,l,x m≥m0(x). Note that, for each N, a function N introduced GN (t) by (10) is nondecreasing on (0, ∞) and t → ∞, as t → ∞. By the de la Vallée–Poussin theorem (see, e.g., Theorem 1.3.4 [52]), to ensure, for µ-almost every x ∈ Λ(q) ∩ S(q), { } the uniform integrability of log ξm,l,x m≥m0(x), it suffices to prove the following statement. For the indicated x, a positive C0(x) and m0(x) ∈ N, one has
sup EGN(| log ξm,l,x|) ≤ C0(x) < ∞, (34) m≥m0(x)
where GN appears in conditions of Theorem1. Moreover, it is possible to find m0 ∈ N that does not depend on x ∈ Rd as we will show further. Step 3. This step is devoted to proving validity of (34). It is convenient to divide this step into its own parts (3a), (3b), etc. For any N ∈ N, set 1 1 1 i − log (− log t) + N−1 , t ∈ 0, , t [N] log (− log t) e[N] ∏j=1 [j] 1 i gN(t) = 0, t ∈ e , e[N] , [N] 1 log (log t) + 1 , t ∈ e , ∞ , t [N] N−1 ( ) [N] ∏j=1 log[j] log t where the product over empty set (when N = 1) is equal to 1. The proof of the following result is placed at AppendixA.
Lemma 2. Let F(u), u ∈ R, be a distribution function such that F(0) = 0. Then, for each N ∈ N, one has Z Z (1) GN(| log u|)dF(u) = F(u)(−gN(u))du, 0, 1 0, 1 e[N] e[N] Z Z (2) GN(| log u|)dF(u) = (1 − F(u))gN(u)du. (e[N],∞) (e[N],∞)
1 i Fix N appearing in conditions of Theorem1. Observe that, for u ∈ , e[ ] , one e[N] N has GN(| log u|) = 0. Therefore, according to Lemma2, for x ∈ Λ(q) ∩ S(q) and m ≥ l, we get EGN(| log ξm,l,x|) := I1(m, x) + I2(m, x) where Z Z I (m, x) := F (u)(−g (u))du, I (m, x) := (1 − F (u))g (u)du. 1 1 m,l,x N 2 m,l,x N 0, (e[N],∞) e[N]
For convenience sake we write I1(m, x) and I2(m, x) without indicating their depen- dence on N, l and d (these parameters are fixed). Part (3a). We provide bounds for I1(m, x). Take R > 0 appearing in conditions of 1 i 1 Theorem1 and any u ∈ 0, . Introduce m1 := max d , l , where, for a ∈ R, e[N] e[N] R 1/d u 1/d 1 dae := inf{m ∈ : m ≥ a}. Then rm(u) = ≤ ≤ R if m ≥ m1. Note also Z m e[N]m that we can consider only m ≥ l everywhere below, because the size of sample Ym is not Mathematics 2021, 9, 544 12 of 36
i less than the number of neighbors l (see, e.g., (28)). Thus, for R > 0, u ∈ 0, 1 , x ∈ d e[N] R and m ≥ m1, R R W (u) B(x,r (u)) q(y) dy B(x,r) q(y) dy m,x = m ≤ = M (x R) d sup d q , , µ(B(x, rm(u))) rm(u)Vd r∈(0,R] r Vd
and we arrive at the inequality Mq(x, R)Vd u W (u) ≤ M (x, R) µ(B(x, r (u))) = . (35) m,x q m m If γ ∈ (0, 1] and t ∈ [0, 1] then, for all m ≥ 1, invoking the Bernoulli inequality, one has
1 − (1 − t)m ≤ (mt)γ. (36)
Recall that we assume Qp,q(ε, R) < ∞ for some ε > 0, R > 0. By virtue of Lemma1 d one can take ε < 1. So, due to (36) and since Wm,x(u) ∈ [0, 1] for all x ∈ R , u > 0 and m ≥ l, we get m ε 1 − (1 − Wm,x(u)) ≤ (mWm,x(u)) . (37) Thus in view of (28), (35) and (37) we have established that, for all x ∈ Λ(q) ∩ S(q), 1 u ∈ (0, ] and m ≥ m1, e[N] l−1 m − F (u) = 1 − (W (u))s(1 − W (u))m s m,l,x ∑ s m,x m,x s=0 (38) ε Mq(x, R)Vdu ≤ 1 − (1 − W (u))m ≤ m = (M (x, R))εVεuε. m,x m q d
Therefore, for any x ∈ Λ(q) ∩ S(q) and m ≥ m1, one can write Z ε ε ε I1(m, x) ≤ (Mq(x, R)) Vd u (−gN(u)) du 0, 1 e[N] (39) Z log (− log u) + 1 ε ε [N] ε ≤ (Mq(x, R)) Vd − du = U1(ε, N, d)(Mq(x, R)) , 0, 1 u1 ε e[N]
ε R −εt where U1(ε, N, d) := V LN(ε), LN(ε) := (log (t) + 1)e dt < ∞. We took into d [e[N],∞) [N] 1 1 i account that (−gN(u)) ≤ (log (− log u) + 1) whenever u ∈ 0, . u [N] e[N] log[N+1](u)+1 Part (3b). We give bounds for I2(m, x). Since gN(u) ≤ u if u ∈ (e[N], ∞), we ≥ { 2 } can write, for m max e[N], l ,
Z log[N+1](u) + 1 I2(m, x) ≤ √ (1 − Fm,l,x(u)) du (e[N], m] u
Z log[N+1](u) + 1 Z + √ (1 − Fm,l,x(u)) du + (1 − Fm,l,x(u))gN(u) du ( m,m2] u (m2, ∞)
:= J1(m, x) + J2(m, x) + J3(m, x).
Evidently,
m m r m−r 1 − Fm,l,x(u) = ∑ (Pm,x(u)) (1 − Pm,x(u)) = P(Z ≥ m − l + 1), r=m−l+1 r
where Pm,x(u) = 1 − Wm,x(u) and Z ∼ Bin(m, Pm,x(u)). By Markov’s inequality P(Z ≥ t) ≤ e−λtEeλZ for any λ > 0 and t > 0. One has Mathematics 2021, 9, 544 13 of 36
m λZ λj m j m−j Ee = ∑ e (Pm,x(u)) (1 − Pm,x(u)) j=0 j m j m λ m−j λ m = ∑ Pm,x(u)e (1 − Pm,x(u)) = 1 − Pm,x(u) + e Pm,x(u) . j=0 j
Consequently, for each λ > 0, −λ(m−l+1) λ m 1 − Fm,l,x(u) ≤ e 1 − Pm,x(u) + e Pm,x(u) m (40) −λ(m−l+1) λ m λ(l−1) 1 = e Wm x(u) + e (1 − Wm x(u)) = e 1 − 1 − Wm x(u) . , , eλ ,
l−1 1 To simplify bounds we take λ = 1 and set S1 = S1(l) := e , S2 := 1 − e (recall that l is fixed). Thus, S1 ≥ 1 and S2 < 1. Therefore, m 1 − Fm,l,x(u) ≤ S1(1 − S2 Wm,x(u)) ≤ S1 exp{−S2 mWm,x(u)}, (41)
where we have used simple inequality 1 − t ≤ e−t, t ∈ [0, 1]. √ i For R > 0 appearing in conditions of the Theorem and any u ∈ e[N], m , one nl m l m o 1/d can choose m := max 1 , e2 , l such that if m ≥ m then r (u) = u ≤ 2 R2d [N] 2 m m 1/d √ √1 ≤ R u ∈ (e m] m ≥ m m . Due to (29) and (41), for [N], and 2, one has ( ) Vdu Wm,x(u) 1 − Fm,l,x(u) ≤ S1 exp −S2 m m Vdu m (42) ( R q(z) dz ) B(x,rm(u)) = S1 exp −S2Vdu ≤ S1 exp −S2Vdu mq(x, R) , µ(B(x, rm(u)))
by definition of m f (for f = q) in (9). Now we use the following Lemma 3.2 of [35].
Lemma 3. For a version of a density q and each R > 0, one has µ(S(q) \ Dq(R)) = 0 where Dq(R) := {x ∈ S(q) : mq(x, R) > 0} and mq(·, R) is defined according to (9).
−t −δ It is easily seen that, for any t √> 0 and each δ ∈ (0, e], one has e ≤ t . Thus, for x ∈ Dq(R), m ≥ m2, u ∈ (e[N], m] and ε > 0, we deduce from conditions of the Theorem (in view of Lemma1 one can suppose that ε ∈ (0, e]) that −ε 1 − Fm,l,x(u) ≤ S1 S2Vdu mq(x, R) . (43)
We also took into account that mq(x, R) > 0 for x ∈ Dq(R) and applied relation (42). Thus, for all x ∈ Λ(q) ∩ S(q) ∩ Dq(R) and any m ≥ m2, S Z log[ + ](u) + 1 ( ) ≤ 1 N 1 J1 m, x ε ε 1+ε du (S2 Vd) (mq(x, R)) (e[N],∞) u (44) −ε = U2(ε, N, d, l)(mq(x, R)) ,
−ε where U2(ε, N, d, l) := S1(l) LN(ε)(S2 Vd) . Part (3c). We provide the bound for J2(m, x). For all x ∈ Λ(q) ∩ S(q) ∩ Dq(R) and any √ √ −ε m ≥ m2, in view of (43), it holds 1 − Fm,l,x( m) ≤ S1 S2Vd mq(x, R) m . Hence (as m2 ≥ 2) Mathematics 2021, 9, 544 14 of 36
Z log[N+1](u) + 1 J2(m, x) ≤ √ (1 − Fm,l,x(u)) du ( m, m2] u √ Z ≤ 1 − Fm,l,x( m) √ log[N+1](u) + 1 d log u ( m, m2]
−ε −ε − ε 3 ≤ S (S V ) m (x, R) m 2 log (2 log m) + 1 log m. 1 2 d q [N] 2
Then, for all x ∈ Λ(q) ∩ S(q) ∩ Dq(R) and any m ≥ m2, −ε J2(m, x) ≤ U3(m, ε, N, d, l) mq(x, R) , (45)
ε 3 −ε − 2 where U3(m, ε, N, d, l) := 2 S1(l)(S2Vd) m log m log[N](2 log m) + 1 → 0, m → ∞.
Part (3d). To indicate bounds for J3(m, x) we employ several auxiliary results.
Lemma 4. For each N ∈ N and any ν > 0, there are a := a(d, ν) ≥ 0, b := b(N, d, ν) ≥ 0 such that, for arbitrary x, y ∈ Rd, d ν ν GN | log kx − yk | ≤ a GN(| log kx − yk| ) + b.
The proof is given in AppendixA. On the one hand, by (29), for any w ≥ 0, we get Z Wm,x(mw) = q(z) dz = W1,x(w). B(x,w1/d) On the other hand, by (28), one has F1,1,x(w) = 1 − 1 − W1,x(w) = W1,x(w). Conse- quently, for any m ∈ N, w ≥ 0 and all x ∈ Rd,
Wm,x(mw) = F1,1,x(w). (46)
d law d Moreover, F1,1,x(w) = P(kY − xk ≤ w). So, ξ1,1,x = kY − xk . Thus, due to Lemmas2 and4 (for ν = 1) Z Z (1 − F1,1,x(w))gN(w) dw = GN(log w) dF1,1,x(w) (e[N],∞) (e[N],∞) h n oi d d = E GN(log ξ1,1,x)I ξ1,1,x > e[N] = E[GN(log kY − xk )I{kY − xk > e[N]}] Z d = GN(logkx − yk )q(y) dy d 1/d (47) y∈R , kx−yk>(e[N]) Z ≤ a(d, 1) GN(| logkx − yk|)q(y) dy + b(N, d, 1) d 1/d y∈R , kx−yk>(e[N]) Z = a(d, 1) GN(logkx − yk)q(y) dy + b(N, d, 1), d y∈R , kx−yk>e[N]
since GN(t) = 0 for t ∈ [0, e[N−1]], N ∈ N. Now we will estimate 1 − Fm,l,x(u) in a way different from (40). Fix any δ > 0. Note 1 m m that, for all m ≥ (l − 1) 1 + δ and s ∈ {0, ... , l − 1}, it holds m−s ≤ m−l+1 ≤ 1 + δ. Then, d 1 for all x ∈ R , u ≥ 0 and m ≥ max{l, (l − 1) 1 + δ }, in view of (28) one can write l−1 m − 1 m ( − )− 1 − F (u) = (1 − W (u)) (W (u))s(1 − W (u)) m 1 s m,l,x m,x ∑ − m,x m,x s=0 s m s l−1 m − 1 s (m−1)−s ≤ (1 + δ)(1 − Wm,x(u)) ∑ (Wm,x(u)) (1 − Wm,x(u)) s=0 s
≤ (1 + δ)(1 − Wm,x(u)). (48) Mathematics 2021, 9, 544 15 of 36
We are going to employ the following statement as well.
Lemma 5. For each N ∈ N, a function log[N](t), t > e[N−1], is slowly varying at infinity.
The proof is elementary and thus is omitted. Part (3e). Now we are ready to get the bound for J3(m, x). Set u = mw. Then one has ! Z 1 1 ( ) = ( − ( )) ( ) + J3 m, x 1 Fm,l,x u log[N] log u N− du (m2, ∞) u 1 ∏j=1 log[j](log u) ! Z 1 1 = (1 − Fm,l,x(mw)) log[N+1](mw) + dw. ( ) w N m, ∞ ∏j=2 log[j](mw)
2 Given w > m, Lemma5 implies that log[N+1](mw) ≤ log[N+1](w ) = log[N](2 log w) ≤ 2 log[N+1](w) for w large enough, namely for all w ≥ W, where W = W(N). Take δ > 0 n l 1 m l mo and set m3 := max l, (l − 1) 1 + δ , dW(N)e, e[N] . Let further m ≥ m3. Then ! Z 1 1 J3(m, x) ≤ 2 (1 − Fm,l,x(mw)) log[N+1](w) + dw. ( ) w N m, ∞ ∏j=2 log[j](w)
By virtue of (46) and (48) one has
1 − Fm,l,x(mw) ≤ (1 + δ)(1 − Wm,x(mw)) = (1 + δ)(1 − F1,1,x(w)). (49)
Hence it can be seen that Z ( ) ≤ ( + ) ( − ( )) ( ) J3 m, x 2 1 δ 1 F1,1,x w gN1 w dw. (50) (m, ∞)
Introduce Z RN(x) := GN(log kx − yk)q(y) dy, Ap(GN) := {x ∈ S(p) : RN(x) < ∞}. d y∈R , kx−yk>e[N]
Let us note that (1) PX(S(p) \ Ap(GN)) = 0 as Kp,q(1, N) < ∞; (2) PX(S(p) \ S(q)) = 0 as PX PY (see Lemma A1); (3) µ S(q) \ (Λ(q) ∩ Dq(R)) = 0 due to Lemma3. Since PX µ we conclude that PX S(q) \ (Λ(q) ∩ Dq(R)) = 0. Hence, one has PX S(p) \ (Λ(q) ∩ Dq(R)) = 0 in view of 2) and because B \ C ⊂ (B \ A) ∪ (A \ C) for d any A, B, C ⊂ R . Set further A := Λ(q) ∩ S(q) ∩ Dq(R) ∩ S(p) ∩ Ap(GN). It follows from (1), (2) and (3) that PX(S(p) \ A) = 0, so PX(A) = 1. We are going to consider only x ∈ A. Then, by virtue of (47) and (50), for all m ≥ m3 and x ∈ A, we come to the inequality J3(m, x) ≤ 2(1 + δ) a(d, 1)RN(x) + b(N, d, 1) = A(δ, d)RN(x) + B(δ, d, N), (51)
where A(δ, d) := 2(1 + δ)a(d, 1), B(δ, d, N) := 2(1 + δ)b(N, d, 1). Part (3f). Here we get the upper bound for EGN(| log ξm,l,x|). For m ≥ max{m1, m2, m3} and each x ∈ A, taking into account (39), (44), (45) and (51) we can claim that
EGN(| log ξm,l,x|) ≤ I1(m, x) + J1(m, x) + J2(m, x) + J3(m, x) ε −ε ≤ U1(ε, N, d)(Mq(x, R)) + U2(ε, N, d, l)(mq(x, R)) (52) −ε + U3(m, ε, N, d, l) mq(x, R) + (A(δ, d)RN(x) + B(δ, d, N)).
For any κ > 0, one can take m4 = m4(κ, ε, N, d, l) ∈ N such that U3(m, ε, N, d, l) ≤ κ if m ≥ m4. Then by virtue of (52), for each x ∈ A and m ≥ m0 := max{m1, m2, m3, m4}, Mathematics 2021, 9, 544 16 of 36
ε EGN(| log ξm,l,x|) ≤ U1(ε, N, d)(Mq(x, R)) −ε (53) + U2(ε, N, d, l) + κ (mq(x, R)) + (A(δ, d)RN(x) + B(δ, d, N)) := C0(x) < ∞.
Hence, for each x ∈ A, we have established uniform integrability of the family log ξ . m,l,x m≥m0 Step 4. Now we verify (23). It was checked, for each x ∈ A (thus, for PX-almost every x belonging to S(p)) that E(log φm,l(1)|X1 = x) → ψ(l) − log Vd − log q(x), m → ∞. Set Zm,l(x) := E(log φm,l(1)|X1 = x) = E log ξm,l,x. Consider x ∈ A and take any m ≥ m0. We use the following property of GN which is shown in AppendixA.
Lemma 6. For each N ∈ N, a function GN is convex on R+.
By the Jensen inequality a function GN is nondecreasing and convex.
GN(|Zm,l(x)|) = GN(|E log ξm,l,x|) ≤ GN(E| log ξm,l,x|) ≤ EGN(| log ξm,l,x|).
Relation (53) guarantees that, for all m ≥ m0, Z GN(|Zm,l(x)|)p(x) dx ≤ U1(ε, N, d)Qp,q(ε, R) Rd + U2(ε, N, d, l) + κ Tp,q(ε, R) + A(δ, d)Kp,q(1, N) + B(δ, d, N) < ∞.
Now we know that the family {Zm,l(x)}m≥m0 , x ∈ A, is uniformly integrable w.r.t. measure PX. Thus, for i ∈ N, Z Z E ( ) = E( ( )| = )P ( ) = ( ) ( ) log φm,l i log φm,l 1 X1 x X1 dx Zm,l x p x dx Rd Rd Z → ψ(l) − log Vd − p(x) log q(x)dx, m → ∞, Rd and we come to relation (23) establishing Statement1. Step 5. Here we prove Statement2 . Similar to Fm,l,x(u), one can introduce, for n, k ∈ N, n ≥ k + 1, x ∈ d and u ≥ 0, the following function R
Fen,k,x(u) := P(ζn,k(i) ≤ u|Xi = x) = 1 − P x − X(k)(x, Xn \{x}) > rn−1(u) k−1 (54) n − 1 s n−1−s = 1 − ∑ (Vn−1,x(u)) (1 − Vn−1,x(u)) := P ξen,k,x ≤ u , s=0 s
where rn(u) was defined in (29), and Z d
Vn,x(u) := p(z) dz, ξen,k,x := (n − 1) x − X(k)(x, Xn \{x}) . (55) B(x,rn(u))
Formulas (54) and (55) show that Fen,k,x(u) is the regular conditional distribution function of ζn,k(i) given Xi = x. Moreover, for any fixed u ≥ 0 and x ∈ Λ(p) ∩ S(p) (thus p(x) > 0), k−1 (V p(x)u)s d −Vd p(x)u Fen,k,x(u) → 1 − ∑ e := Fek,x(u), n → ∞. s=0 s!
law Hence, ξen,k,x → ξek,x, x ∈ Λ(p) ∩ S(p), n → ∞. Set Aep(GN) := {x ∈ S(p) : ReN(x) < ∞}, where N ∈ N and Z ReN(x) := GN(log kx − yk)p(y)dy. d y∈R , kx−yk>e[N]
Take Ae := Λ(p) ∩ S(p) ∩ Dp(R) ∩ Aep(GN). Then PX(Ae) = 1 and, for x ∈ Ae, one can verify that EGN(| log ξen,k,x|) ≤ Ce0(x) < ∞, for all n ≥ n0, and therefore E log ξen,k,x → E log ξek,x Mathematics 2021, 9, 544 17 of 36
as n → ∞. Thus, E(log ζn,k(1)|X1 = x) → ψ(k)−log Vd −log p(x), n → ∞. Set Zen,k(x) := E(log ζ (1)|X = x). One can see that, for all n ≥ n , R G (|Z (x)|)p(x) dx < ∞. n,k 1 0 Rd N en,k Hence similar to Steps 1–4 we come to relation (24). So, (14) holds and the proof of Theorem1 is complete.
5. Proof of Theorem 2 We will follow the general scheme described in Remark7. However now this scheme is more involved. First of all note that, in view of Lemma1, the finiteness of Kp,q(2, N) and Kp,p(2, N) implies the finiteness of Kp,q(1, N) and Kp,p(1, N), respectively. Thus, the conditions of Theorem2 entail validity of Theorem1 statements. Consequently under the conditions 1 of Theorem2, for n and m large enough, one can claim that Dbn,m(k, l) ∈ L (Ω) and EDbn,m(k, l) → D(PX||PY), as n, m → ∞. 2 We will show that Dbn,m(k, l) ∈ L (Ω) for all n and m large enough. Then one can write 2 2 E Dbn,m(k, l) − D(PX||PY) = var Dbn,m(k, l) + EDbn,m(k, l) − D(PX||PY) .
Therefore to prove (16) we will demonstrate that var Dbn,m(k, l) → 0, n, m → ∞.
Due to (28) the random variables log φm,l(1), ... , log φm,l(n) are identically distributed (and log ζn,k(1), ... , log ζn,k(n) are identically distributed as well). The variables φm,l(i) and ζn,k(i) are the same as in (22). We will demonstrate that log φm,l(1) and log ζn,k(1) belong to L2(Ω). Hence (22) yields 1 n var ( ) = cov ( ) − ( ) ( ) − ( ) Dbn,m k, l 2 ∑ log φm,l i log ζn,k i , log φm,l j log ζn,k j n i,j=1 1 2 1 = var( ( )) + cov( ( ) ( )) + var( ( )) log φm,l 1 2 ∑ log φm,l i , log φm,l j log ζn,k 1 (56) n n 1≤i We mainly follow the notation employed in the above proof of Theorem1, except the d d possibly different choice of the sets A ⊂ R , Ae ⊂ R , positive Uj, Cj(x), Cej(x) and integers d mj, nj, where j ∈ Z+ and x ∈ R . The following Theorem 2 proof is also subdivided in 5 1 parts. Steps 1–3 deal with the demonstration of relation n var(log φm,l(1)) → 0 as n, m → ∞. 2 cov( ( ) ( )) → → Step 4 validates the relation n2 ∑1≤i 2 cov( ( ) ( )) → → 2 ∑ log ζn,k i , log ζn,k j 0, n ∞, n 1≤i Set Ap,2(GN) := {x ∈ S(p) : RN,2(x) < ∞}. Then PX(S(p) \ Ap,2(GN)) = 0 since Kp,q(2, N)< ∞. Consider A := Λ(q) ∩ S(q) ∩ Dq(R) ∩ S(p) ∩ Ap,2(GN), (58) Mathematics 2021, 9, 544 18 of 36 where the first four sets appeared in Theorem1 proof, R and N are indicated in conditions of Theorem2. It is easily seen that PX(A) = 1. The reasoning is exactly the same as in the proof of Theorem1. law Recall that, for each x ∈ A, one has log ξm,l,x → log ξl,x, m → ∞, where ξm,l,x := d m x − Y(l)(x, Ym) and ξl,x has Γ(Vd q(x), l) distribution. Convergence in law of random variables is maintained by continuous transformations. Thus, for each x ∈ A, we get 2 law 2 log ξm,l,x → log ξl,x, m → ∞. (59) For any x ∈ A, according to (28), Z Z 2 2 2 E log ξm,l,x = log u dFm,l,x(u) = log u dP(φm,l(1) ≤ u|X1 = x) (0,∞) (0,∞) (60) 2 = E(log φm,l(1)|X1 = x). Note that if η ∼ Γ(α, λ), where α > 0 and λ > 0, then it is not difficult to verify that Γ00(λ) E log2 η = − 2 ψ(λ) log α + log2 α. Γ(λ) Since ξl,x ∼ Γ(Vdq(x), l), for x ∈ S(q), one has Γ00(l) E log2 ξ = − 2 ψ(l) log(V q(x)) + log2(V q(x)) = log2 q(x) + h log q(x) + h , (61) l,x Γ(l) d d 1 2 where h1 := h1(l, d) and h2 := h2(l, d) depend only on fixed l and d. We prove now that, for x ∈ A, one has 2 2 E(log φm,l(1)|X1 = x)→log q(x)+h1 log q(x)+h2, m → ∞. (62) Taking into account (60) and (61) we can claim that relation (62) is equivalent to the 2 2 following one: E log ξm,l,x → E log ξl,x, m → ∞. So, in view of (59) to prove (62) it is n 2 o sufficient to show that, for each x ∈ A, a family log ξm,l,x is uniformly integrable m≥m0(x) for some m0(x) ∈ N. Then, following Theorem1 proof, one can certify that, for all x ∈ A and some nonnegative C0(x), 2 sup EGN(log ξm,l,x) ≤ C0(x) < ∞. (63) m≥m0(x) √ Step 2. Now we will prove (63). For each N ∈ N, introduce ρ(N) := exp{ e[N−1]} and i t ∈ 1 (N) 0, ρ(N) , ρ , hN(t) := i 2 log t log (log2 t) + 1 , t ∈ 0, 1 ∪ (ρ(N), ∞). t [N] N−1 ( 2 ) ρ(N) ∏j=1 log[j] log t As usual, a product over an empty set (if N = 1) equals to 1. To show (63) we refer to the next lemma. Lemma 7. Let F(u), u ∈ R, be a distribution function such that F(0) = 0. Fix an arbitrary ∈ N N. Then Z Z (1) G (log2 u)dF(u) = F(u)(−h (u))du, 1 i N 1 i N 0, ρ(N) 0, ρ(N) Z Z 2 (2) GN(log u)dF(u) = (1 − F(u))hN(u)du. (ρ(N),∞) (ρ(N),∞) Mathematics 2021, 9, 544 19 of 36 The proof of this lemma is omitted, being quite similar to one of Lemma2 . By Lemma7 i ( 2 ) = ∈ 1 ( ) and since GN log u 0, for u ρ(N) , ρ N , one has Z Z 2 EGN(log ξm,l,x) = i Fm,l,x(u)(−hN(u))du + (1 − Fm,l,x(u))hN(u)du 1 ( ( ) ) 0, ρ(N) ρ N ,∞ := I1(m, x) + I2(m, x). To simplify notation we do not indicate the dependence of Ii(m, x) (i = 1, 2) on fixed N, l and d. For clarity, further implementation of Step 2 is divided into several parts. Part (2a). At first we consider I1(m, x). As in Theorem1 proof, for fixed R > 0 and ε > 0 ε ε ε appearing in the conditions of Theorem2, an inequality Fm,l,x(u) ≤ (Mq(x, R)) V u holds i nl m o d for any x ∈ A, u ∈ 0, 1 and m ≥ m := max 1 , l . Taking into account that ρ(N) 1 ρ(N)Rd (−2 log u) log (log2 u)+1 i ≤ (− ( )) ≤ [N] ∈ 1 ≥ 0 hN u u if u 0, ρ(N) , we get, for m m1, 2 Z (−2 log u) log (log u)+1 ε ε [N] I1(m, x) ≤ (Mq(x, R)) V i du d 1 u1−ε (64) 0, ρ(N) ε = U1(ε, N, d)(Mq(x, R)) . ε Rh 2 −εt Here U1(ε, N, d) := Vd LN,2(ε), LN,2(ε) := √ 2t log[N](t ) + 1 e dt < ∞ e[N−1],∞ for each ε > 0 and any N ∈ N. Part (2b). Consider I2(m, x). Following the previous theorem proof we at first observe 2 log u 2 2 that hN(u) ≤ u log[N](log u) + 1 for u ∈ (ρ(N), ∞). So, for all m ≥ max{ρ (N), l}, 2 Z 2 log u log[N](log u) + 1 I2(m, x) ≤ √ (1 − Fm,l,x(u)) du (ρ(N), m] u 2 Z 2 log u log[N](log u) + 1 + √ (1 − Fm,l,x(u)) du ( m,m2] u Z + (1 − Fm,l,x(u))hN(u) du := J1(m, x) + J2(m, x) + J3(m, x), (m2,∞) where we do not indicate the dependence of Jj(m, x) (j = 1, 2, 3) on N, l and d. For R > 0 and ε > 0 appearing in the conditions of Theorem2, one can show (see Theorem1 proof), that inequality