arXiv:2009.12337v1 [cs.IT] 25 Sep 2020 ust of subsets X nti ok eaeitrse nsuyn h dependence the studying in interested are we between work, this In etyadietclydsrbtd(... rmsm know some from (i.i.d.) distribution distributed sampling identically and dently ae twihtemta nomto prahszr)are zero) approaches information of choices mutual various the for which characterized at rates hw htterslso h rtpr a tl eue supp as used be still rates. can part decoupling first the the on of bounds results the that shown utemr,tedculn ae between rates decoupling the Furthermore, au ftemta nomto o h case the for information mutual exac the the compute of we and value distribution, sampling the on depend h uulifrainbetween this information finite under mutual no every Moreover, the does for distribution. distribution assumption, sampling product distributional the the on and depend distribution (cd joint the that the show distribution we assumption, cumulative this Under invertible distri all an of set having the namely, fro distributions, drawn of is family sample the large the when statistics consider order of we information II, Section dependence. such the of use measures as to information choose we particular, In o aif h supin ntefis ato h ae.In paper. distribution in sampling sta the that the order shown of on between is depend part information it part, does mutual first first the the the setting, of discrete in results the assumptions to whi comparison the distributions, discrete satisfy of family not a considers paper the onainudrGatCCF-1849757. Grant under Foundation itiuinfe rpry hti,i osntdpn on depend not does it sa is, also that statistics property; order distribution-free of a subsets two between information itiuino re ttsisde o eedo h orig the on depend not p does the distribution statistics and sampling order statistics the distribut order of that of distribution a distribution shows joint o which assumption, the part established, between this is first cumul property invertible The Under free an sample. with function. distributions the distribution on of focuses statistics paper the order the represent neednl n dnial itiue rmsm known some from distributed distribution identically sampling and independently I 2 h oko .Croewsspotdi atb h ..Nation U.S. the by part in supported was Cardone M. of work The (2) osdrarno sample random a Consider u otiuin n ae uln r sflos In follows. as are outline paper and contributions Our Abstract = ≤ { m ... X esrn eedniso re Statistics: Order of Dependencies Measuring } Cnie admsample random a —Consider ≤ { ( I 1 o integers for ⋆ 1 .,n ..., , X † ) e esyIsiueo ehooy eak J012 S Em USA 07102, NJ Newark, Technology, of Institute Jersey New ∗ ( oubaUiest,NwYr,N 02,UA mi:cynthi Email: USA, 10025, NY York, New University, Columbia and n nvriyo inst,Mnepls N544 S,Email USA, 55404, MN Minneapolis, Minnesota, of University ) nIfrainTertcPerspective Theoretic Information An ersn h re ttsiso h sample. the of statistics order the represent } X and P .I I. ( P P X I X X 2 oevr ti hw httemutual the that shown is it Moreover, . 1 ) NTRODUCTION X Let . e h admvariables random the Let . where ≤ ( I k m < r ) X f = X 1 dvrec n h mutual the and - X , I (1) X { 1 X ( ( f 2 ,m r, lxDytso Alex and I .,X ..., , ( ≤ dvrec n mutual and -divergence 1 i ≤ ) ) f } X ) P and X i dvrec between -divergence I h eodpr of part second The . n ∈I X X 1 2 (2) utemr,we Furthermore, . X , n oehls,i is it Nonetheless, . n ( k r w arbitrary two are r , eso that show we , ) X rw indepen- drawn 2 for I ≤ .,X ..., , and ( 1 I ⋆ f 2 ... atn Cardone Martina , = k ) -divergence X { ∈ osnot does n { lScience al ( ≤ X m butions hdoes ch r roduct drawn } ) (1) tistics 1 tisfies X ative , (i.e., P ion- inal and a m 2 the ( X f). n er ≤ } n ) f t t . . aeof rate a iiu n aiu (i.e., maximum and minimum ie,rtsa hc h uulifrainapoce zer approaches information of mutual choices various the for which at rates (i.e., ncmaio oterslsi eto I eso htin that show we between II, information Section mutual in the setting, results discrete Sectio the the of to family the comparison into In fall not does which distributions, aeof rate hrceietertso eopigbetween decoupling of rates the characterize X epoeta h eut nScinI a tl eue as used setting mutual be the discrete still compute between the we can information in comparisons, some II rates provide decoupling Section to Finally, the in on results bounds the upper that prove we needn ftesmln itiuinprovided. distribution sampling th the statisti shown order of consecutive have independent The between distributions, information distribution. continuous mutual sampling for distributithe underlying [8], sampling the of the on authors of depend entropy not the individu do the and of entropy statistics average order the that d shown continuous have for underlyi [7], butions, the of authors on For The depend an distribution. not past. sampling statistics does order distribution the between sampling divergence their in distributions Rényi continuous the observed for that [4], shown been of authors also the have instance, statistics order X n oabtaysbeso admvrals hr,w find we Third, variables. random large of exact extend subsets the to arbitrary us distrib to allows continuous and use beyond the we property distribution-free that for the technique property proof the distribution-free Second, a show we [6 [4 information [3], Fisher entropy the and Rényi [5], the entropies include cumulative distr statistics the the order on of considered been bution [2] have distribution that of sampling measures also t the information authors of characterizes have entropy the statistics differential statistics the order example, which order under For Information conditions of survey. showed attention. comprehensive distribution some a the received for r of interested [1] measures the to processing; referred signal is statistical in cations Work. distributi Related Bernoulli the from comes distribution sampling ( ( egnrlz h bv eut nsvrldrcin.Firs directions. several in results above the generalize We on measures information for properties Distribution-free m r ) ) and ∗ osdpn ntesmln itiuin Nonetheless, distribution. sampling the on depend does yti Rush Cynthia , n 1 nScinII ecnie aiyo discrete of family a consider we III, Section In . X n 1 2 ( m hl h einadmxmmdcul ta at decouple maximum and median the while ) n o aiu regimes. various for re ttsishv ierneo appli- of range wide a have statistics Order eairo h uulifrainbetween information mutual the of behavior X ( [email protected] [email protected] : i:[email protected] ail: ,m r, † ( r ) and ) o xml,w hwta the that show we example, For . ( X ,m r, ( m ) (1 = ) o h aewe the when case the for n , X ) eopeat decouple ) f ( r -divergence. ) and X Other . ( r utions have , X eader ) istri- sis cs on. II. n and ( m on ng ]. he o) at al i- ], t, d ) . Notation. We use [n] to denote the collection {1, 2, ..., n}. with the sample (U1, ..., Un) i.i.d. ∼ U(0, 1), where U(0, 1) D Logarithms are assumed to be in base e. The notation = de- denotes the uniform distribution over (0, 1); and P{U(i) }i∈I notes equality in distribution. The harmonic number, denoted and i∈I PU(i) are the joint distribution and the product as H , is defined as follows. For r ∈ N, distribution of the sequence {U(i)}i∈I, respectively. r Q r −1 1 Proof: Let FX be the inverse cdf of the sampling H = . (1) r k distribution PX . Recall that for (U1, ..., Un) i.i.d. ∼ U(0, 1), k=1 D −1 −1 X we have (X1, ..., Xn) = (FX (U1), ..., FX (Un)). Then since We also define, for N, −1 r ∈ FX (·) is order preserving (see [10, eq. (2.4.2)]), we have D Tr = log(r!) − rHr. (2) −1 X(I) = {X(i)}i∈I = {FX (U(i))}i∈I. (5)

The Euler-Mascheroni constant is denoted by γ ≈ 0.5772. Then since FX is a one-to-one mapping and the f-divergence Let f : (0, ∞) → R be a convex function such that f(0) = 1. is invariant under invertible transformations [11, Thm. 14], Then, for two probability distributions P and Q over a space Df P i PX i Ω such that P ≪ Q (i.e., P is absolutely continuous with {X(i) } ∈I ( ) respect to Q), the f-divergence is defined as  i∈I  Y = Df P −1 P −1 dP {FX (U(i) )}i∈I FX (U(i) ) D (P kQ)= f dQ. (3) i∈I f dQ  Y  ZΩ   = Df P{U(i) }i∈I PU(i) . (6) II. THE CASE OF CONTINUOUS DISTRIBUTIONS  i∈I  Y In this section, we consider a setting in which the cdf of This concludes the proof of Theorem 1. a sampling distribution is an invertible function (i.e., bijective function). Several classes of probability distributions satisfy Remark 1. Computing the f-divergence in (4) requires the this property. For example, all absolutely continuous distribu- knowledge of the joint distribution of {U(i)}i∈I for any tions with a non-zero probability density functions (pdf) satisfy subset I. The joint pdf of this sequence can be readily this property since, in this case, the cdfs are strictly increasing computed and is given by the following expression [10]: let and, therefore, have an inverse. A non-example, however, is I = {(i1,i2, ..., ik):1 ≤ i1,i2, ..., ik ≤ n} where |I| = k, then, P is non-zero only if −∞ < x < x < the set of discrete distributions having step functions for their {U(i) }i∈I (i1) (i2) and, when this is true, its expression is cdfs, which do not have a proper inverse. ... < x(ik ) < ∞, Out of the two aforementioned classes of distributions, the k+1 it−it−1−1 class of distributions with a non-zero pdf is one that is typically P{U(i) }i∈I = cI x(it) − x(it−1) , (7) studied the most in conjunction with the order statistics [9]. t=1 Y   Because the probability of ties in the sample for this case where x(i0) = x(ik+1) =0, and, with i0 =0 and ik+1 = n+1, equals zero, the analysis considerably simplifies. For discrete n! distributions, one must account for the possibility of samples cI = k+1 . taking the exact same value, and the analysis often becomes t=1 (it − it−1 − 1)! combinatorially cumbersome. Nonetheless, we consider the The next result, the proofQ of which is in Appendix A, evaluates case of discrete distributions in the next section. the Kullback-Leibler (KL) divergence, which is a special case of the f-divergence with f(x)= x log(x). A. Distribution-Free Property for the f-divergence Proposition 1. Under the assumptions of Theorem 1, where We begin our study on the dependence structure of order I ⊆ [n] with |I| = k, we have that statistics by showing that a large class of , namely the f-divergence, have the following distribution-free property: DKL P{U(i) }i∈I PU(i) if the cdf of the sampling distribution is invertible, then the i  Y∈I  f-divergence between the joint distribution of order statistics k k−1 and the product distribution of order statistics does not depend = (Tit−1 − Tit−it−1 −1)+ Tn−it − (k − 1)Tn. (8) on the sampling distribution. t=2 t=1 X X In particular, Theorem 1. Fix a subset I ⊆ [n] and assume that X1, ..., Xn (Whole Sequence). For I = [n], we have that i.i.d. ∼ PX , with PX having an invertible cdf. Then, n n

Df P i PX i = Df P i PU i , DKL P PU i =2 Tt−1 − (n − 1)Tn. {X(i) } ∈I ( ) {U(i)} ∈I ( ) {U(1),...,U(n)} ( )  i∈I   i∈I   i=1  t=2 Y Y (4) Y X (Min and Max). For I = {1,n}, we have that where P{X(i) }i∈I and i∈I PX(i) are the joint distribution and the product distribution of the sequence {X(i)}i∈I , re- n − 1 1 Q DKL P PU PU n = log + . spectively; (U(1), ..., U(n)) are the order statistics associated {U(1) , U(n)} (1) ( ) n n − 1    

Remark 2. For I = {1,n}, Proposition 1 says that, when Lemma 1. For k> 0, n → ∞, we have that 2k 1 Tk = klog + log(2πk) − (1 + γ)k − e(k), (12) 2 2k +1 2 lim n DKL P PU PU n   n→∞ {U(1), U(n)} (1) ( ) 2k +2 1  n − 1 1  1 Tk+1 − Tk = log − (1 + γ)+ − c(k), (13) = lim n2 log + = , 2k +3 k +1 n→∞ n n − 1 2       where where the last equality follows by using the Maclaurin k 1 1 1 − ≤ e(k) ≤ − , (14) for the . Thus, when the KL divergence is 24(k + 1)2 12k 24k 12k +1 considered, the joint and product distributions of the minimum 1 1 and maximum converge at a rate equal to 1/n2. ≤ c(k) ≤ . (15) 24(k + 2)2 24(k + 1)2 B. Distribution-Free Property for the Mutual Information C. Large Sample Size Asymptotics of Mutual Information Here we consider the mutual information measure. In par- Using Theorem 2 and the approximations in Lemma 1, we ticular, as a special case of the approach used for the proof of now study the rates of decoupling of the order statistics as the Theorem 1 we have the following result. sample size grows. In particular, we have the next theorem, which is proved in Appendix D. Theorem 2. Assume that X , ..., X i.i.d. ∼ P , with P 1 n X X Theorem 3. Under the assumptions of Theorem 2, we have having an invertible cdf and fix two sets I , I ⊆ [n]. Then, 1 2 the following: th I(X(I1); X(I2))= I(U(I1); U(I2)), (9) 1) (r vs. Max). Fix some r ≥ 1 independent of n. Then, 2 lim n I(X(r); X(n))= r/2. (16) where X(Ik) = {X(i)}i∈Ik and U(Ik) = {U(i)}i∈Ik both for n→∞ k ∈ {1, 2}. Consequently, I(X ; X ) is not a function (I1) (I2) 2) ( th vs. th). Fix some independent of . of P , the sampling distribution of X. Moreover, for r

I(X(r); X(m))= Tm−1 + Tn−r − Tm−r−1 − Tn. (10) lim I(X(r); X(m))= Tm−1 − Tm−r−1 +(1+ γ)r. (17) n→∞ Proof: The proof of (9) follows along the same lines as 3) (k-Step). Fix some k ≥ 1. Then, the proof of Theorem 1 and relies on the invariance of the lim I(X(n−k); X(n)) = log(k) − Hk−1 + γ mutual information to one-to-one transformations. n→∞ To compute (10), recall that the mutual information can be k 1 (18) = log + − c(k). written as a KL divergence, and then using (8), k + 1 k  2  Consequently, for step, I(U(r); U(m))= DKL(PU(r) ,U(m) kPU(r) PU(m) ) 1

= Tm + Tn r − Tm r − Tn. −1 − − −1 lim I(X(n−1); X(n))= γ. (19) n→∞ This concludes the proof of Theorem 2. 4) (⌊αn⌋ vs. ⌈βn⌉). Fix 0 <α<β< 1, with (α, β) independent of n. Then, Remark 3. In Theorem 2, if I1 ∩ I2 6= ∅, then

I(X(I1); X(I2)) = ∞. Moreover, the assumption that r

) ·10 ) Exact in (9) 4 n

( 0.6 in (21) n =9 ) X ) ; ) n 3 n = 10 ( ⌋ 2 n X n = 11 ⌊ ; ( 0.4 2 X (1) ( X ( nI 1 0.2 I 0 20 40 60 80 100 0 n 0 0.2 0.4 0.6 0.8 1

Fig. 1. Convergence of nI(X n ; X n ) (Median vs. Max) to (23). p (⌊ 2 ⌋) ( ) Fig. 2. I(X(1); X(n)) in (27) versus p ∈ (0, 1). Remark 5. We compare the rate of decoupling of the mutual information I(U(r); U(m)) for integers r

Cov(U(r),U(m)) I(U(r); U(m)) bution (i.e., Bernoulli with parameter p ∈ (0, 1)) for which Case 1: m = n 1/n3 1/n2 the mutual information does depend on the sampling distri- 2 Case 2 1/n No decoupling bution (i.e., parameter p). For illustration, Fig. 2 shows the Case 3: r = n − k and m = n 1/n2 No decoupling Case 4: r = ⌊αn⌋ and m = ⌈βn⌉ 1/n No decoupling dependence of I(X(1); X(n)) in (27) over p ∈ (0, 1). Case 5: r = ⌊αn⌋ and m = n 1/n2 1/n The fact that the mutual information does depend on the sampling distribution prevents us from having universal results TABLE I similar to those derived in Section II. However, the results in RATES OF DECOUPLING FOR Cov(U(r),U(m)) AND I(U(r); U(m)). Section II can still be used as upper bounds as we show next. III. THE CASE OF DISCRETE DISTRIBUTIONS Theorem 5. Assume that X1, ..., Xn i.i.d. ∼ PX , with PX In this section, we consider the case when the sampling having an arbitrary sampling distribution. Then, distribution is discrete. Historically, order statistics with a • (f-divergence). For I ⊂ [n], discrete distribution have received far less attention than those with a continuous distribution. However, recently, since dis- Df P{X(i) }i∈I PX(i) ≤Df P{U(i) }i∈I PU(i) , crete distributions naturally occur in several practical situations  i∈I   i∈I  Y Y (28) (e.g., image processing), discrete order statistics have started to receive more attention in the literature [12], [13]. • (Mutual Information). For I1, I2 ⊂ [n], The mutual information of discrete order statistics often I(X ; X ) ≤ I(U ; U ). (29) behaves differently from that of continuous order statistics. (I1) (I2) (I1) (I2) For example, while order statistics X(1) ≤ X(2) ≤ ... ≤ X(n) Proof: The proof relies on the data processing inequality. from a continuous distribution form a Markov chain [10], for Due to space constraints we only show it for I(X(r); X(m)). the case of discrete distributions, the order statistics form a −1 We start by defining the quantile function as FX (y) = Markov chain if and only if the sampling distribution has at sup{x : FX (x) ≤ y}. Then, as discussed in the proof of most two points in its support [9]. Theorem 1, for an arbitrary sampling distribution [10, eq. The next theorem (see Appendix E for the proof) shows (1.1.3)], that, unlike in the case of continuous order statistics, when D −1 −1 the sampling distribution is discrete, the mutual information (X(r),X(m)) = (FX (U(r)), FX (U(m))), (30) −1 −1 k thus I(FX (U(r)); FX (U(m))) = I(X(r); X(m)). Moreover, − (n − it)(ψ(n +1 − it) − ψ(n + 1)) −1 −1 t=1 I(U(r); U(m)) ≥ I(FX (U(r)); FX (U(m))) = I(X(r); X(m)), kX+1 where the inequality uses the data processing inequality for + (it − it−1 − 1)(ψ(it − it−1) − ψ(n + 1)), (31) −1 the mutual information since U(r) → U(m) → FX (U(m)) t=1 and F −1(U ) → U → F −1(U ) are Markov chains. X (r) (r) X (m) where we have used the fact that U ∼ Beta(m,n +1 − m) This concludes the proof of Theorem 5. (m) and 1 − U ∼ Beta(n +1 − r, r) with the difference U − The bound in (29) is appealing since all the results de- (r) (m) U ∼ Beta(m − r, n − (m − r)+1). Then, (see, e.g., [10]), rived in Section II (e.g., Theorem 3) can be used to obtain (r) upper bounds on I(X(r); X(m)) for any arbitrary sampling E log(U(m)) = ψ(m) − ψ(n + 1), distribution. However, the bound in (29) can be very sub- E log(1 − U r ) = ψ(n +1 − r) − ψ(n + 1), optimal. For example, as shown in Fig. 3, we have that  ( )  E log(U(m) − U(r)) = ψ(m − r) − ψ(n + 1), limn→∞ I(X(n−1); X(n))=0 for the Bernoulli case, while   limn→∞ I(U(n−1); U(n))= γ, computed in (19). where ψ (·) is the digamma function and where we use the Remark 6. Theorem 5 can be generalized to arbitrary non- convention that U(0) = 0 and U(n+1) =1. Finally, collecting overleaping subset of order statistics. Moreover, it can be terms and using the result of (31), we have generalized to f-divergences. DKL P{U(i)}i∈I PU(i) i  Y∈I  0.6 k k+1 − 1 = log cI cit + (it − it−1 − 1)ψ(it − it−1) t=1 t=1 0.4  Y  X I(X(n−1); X(n)) in (26) k k I(U ; U ) in (10) − (it − 1)ψ(it) − (n − it)ψ(n +1 − it) 0.2 (n−1) (n) t=1 t=1 X k X k+1 0 0 10 20 30 40 50 + ψ(n + 1) [(it − 1)+(n − it)] − (it − it−1 − 1) " # n t=1 t=1 k X k X Fig. 3. I(X ; X ) in (26) and I(U ; U ) in (10) versus n. +1 (n−1) (n) (n−1) (n) (a) −1 = log cI cit + (it − it−1 − 1)ψ(it − it−1) APPENDIX A  t=1  t=1 PROOF OF PROPOSITION 1 k Y Xk The computation for an arbitrary I ⊆ [n] with |I| = k − (it − 1)ψ(it) − (n − it)ψ(n +1 − it) proceeds as follows. First, notice t=1 t=1 X X + ψ(n +1)(k(n − 1) − (n − k)) DKL P{U(i) }i∈I PU(i) k+1 k k i∈I (b)  Y  = − Tit−it−1 −1 + Tit−1 + Tn−it −(k−1)Tn, (32) k+1 it−it−1−1 (a) cI U(it) − U(it− ) t=1 t=1 t=1 = E log t=1 1 , X X X k it−1 n−it c t U (1 − U ) where the labeled equalities follow from: the fact that " Qt=1 i (it) (it) !# (a) using the expressionQ of the joint distribution in (7) and k+1 r−1 n−r n! (it − it−1 − 1) = ik+1 − i0 − (k +1)= n − k, PU(r) (x) = cr x (1 − x) , where cr = (r−1)!(n−r)! . t=1 Then, we simplify further: X where we recall that i0 = 0 and ik+1 = n + 1; and (b) DKL P{U i }i∈I PU i ( ) ( ) using the identity ψ(n)= Hn−1 − γ, with γ being the Euler- i  Y∈I  Mascheroni constant, noting that the terms that multiply γ k k

−1 E cancel out, and using the definition of Tn from (2). The result = log cI cit − (it − 1) [log(U(it))] in (8) comes from canceling the terms Ti1−i0−1 = Ti1−1 and  t=1  t=1 k Y X Tik+1−ik −1 = Tn−ik from (32). E The special case of I = [n] follows by observing that it = t − (n − it) [log(1 − U(it))] t=1 for t ∈{0, 1, ..., n +1} and hence it − it−1 =1 leading to kX+1 n E + (it − it−1 − 1) [log(U(it) − U(it− ))] DKL P PU i 1 {U(1) ,...,U(n)} ( ) t=1 i=1 X  Y  k k n n−1 −1 = log cI cit − (it − 1)(ψ(it) − ψ(n + 1)) = (Tit−1 − Tit−it−1−1)+ Tn−it − (n − 1)Tn t=1 t=1 t=2 t=1  Y  X X X n n−1 Combining (34) and (35) with the definition of Tk in (2), = (Tt−1 − Tt−(t−1)−1)+ Tn−t − (n − 1)Tn namely Tk = log(k!) − kHk, we find the result in (12), where t=2 t=1 Xn X we have defined e(k) := eH (k) − ef (k) and using the bounds

=2 Tt−1 − (n − 1)Tn, for eH (k) and ef (k) stated above, we find the result in (14). t=2 Next, we show (13). First, from the definition of Tk in (2), X where the first equality follows from (8), and the last equal- we have that (33) holds. Next, observe that by (34), n ity follows since T0 = 0 and noting that t=2 Tt−1 = 1 3 eH (k + 1) − 1 n−1 Hk = Hk+1 − = log k + + γ + , t=1 Tn−t. k +1 2 k +1 P   The special case of I = {1,n} (hence k = 2 with i2 = n (36) P and i1 =1) follows by observing that (8) gives and therefore, by substituting this inside (33), we obtain (13) with c(k) defined as c(k) := eH (k + 1)/(k + 1) and the k+1 1 DKL P U , U n PU(1) PU(n) = Tn−1 − Tn−2 + Tn−1 − Tn { (1) ( )} bounds in (15) follow from 24(k+2)2 ≤ eH (k + 1) ≤ 24(k+1) . = Hn−1 − Hn−2 − log(n) + log( n − 1) APPENDIX D n − 1 1 = log + , PROOF OF THEOREM 3 n n − 1   We consider each case separately. and the second equality follows since A. Case 1: Proof of the rth vs. Max Tk+1 − Tk =log((k + 1)!) − (k + 1)Hk+1 − log(k!) + kHk We set m = n in (10) and use (13). With this, we obtain = log(k + 1) − Hk − 1, (33)

I(X(r); X(n))= −(Tn − Tn−1) + (Tn−r − Tn−r−1) where the last equality follows since (k + 1)Hk+1 = (k + 1)H +1 by the definition of the harmonic mean in (1). 2n +1 2(n − r)+1 1 1 k = log − log − + 2n 2(n − r) n n − r APPENDIX B     + c(n − 1) − c(n − r − 1). PROOF OF (11) First, by [14, page 18], if the sample is drawn from an Finally, taking the limit of the above we find result (16) using exponential distribution then the kth order statistic for k ∈ [n] the limits in Table II. Specifically, the Table II limits follow D 1 k from: (#1) the Maclaurin series for the natural logarithm; (#2) is distributed as X(k) = λ i=1 an,i Zi, where the Zi’s are i.i.d. exponential random variables with rate one and combining the two terms inside the bracket and taking the −1 EP 1 k E limit; and (#3) and (#4) from the bounds in (15), namely an,i = [n − i + 1] . Then, [X(k)] = λ i=1 an,i [Zi] = 1 1 1 k E E 24(k+2)2 ≤ c(k) ≤ 24(k+1)2 . λ i=1 an,i, since [Zi]=1. Thus, [X(1)]= an,1/λ and E PE 2 [X(2)]=an,1/λ+an,2/λ. Moreover, using [Zi ]=2, P function f(n) limn→∞ f(n) (#1) 2 2(n−r) 2n r 1 n log  2(n−r)+1  − log  2n+1  − 2 E[X(1)X(2)]= E[(an,1Z1)(an,1Z1 + an,2Z2)] λ2 (#2) 2 1 1 n  n−r − n  r 1 2 2 1 2 (#3) n2c(n − r − 1) 1/24 = (a E[Z ]+ an,1an,2E[Z1Z2]) = (2a + an,1an,2). λ2 n,1 1 λ2 n,1 (#4) n2c(n − 1) 1/24

Therefore, using the fact that an,1 =1/n, TABLE II LIST OF LIMITS FOR CASE 1. Cov(X(1),X(2))= E[X(1)X(2)] − E[X(1)]E[X(2)] th th 1 2 1 2 B. Case 2: Proof of the r vs. the m = 2an,1 + an,1an,2 − an,1 + an,1an,2 λ2 λ2 Using (10), we have that 1 2 1   = 2 an,1 = 2 2 . λ λ n lim I(X(r); X(m)) = lim [Tm−1 + Tn−r − Tm−r−1 − Tn] n→∞ n→∞ APPENDIX C = Tm−1 − Tm−r−1 +(1+ γ)r, PROOF OF LEMMA 1 By [15] (see also [16, page 76]), where we have used the fact that, from (12), we have 2(n − r) 2n 1 T − T = (n − r) log − n log kHk − k log k + − kγ = eH (k), (34) n−r n 2(n − r)+1 2n +1 2       1 n − r where k ≤ e (k) ≤ 1 , and using a strong form of + log +(1+ γ)r − e(n − r)+ e(n), (37) 24(k+1)2 H 24k 2 n 1 1 Stirling’s approximation [17], for ≤ ef (k) ≤ ,   12k+1 12k and hence, 1 log(k!) − log(2πk)+ k log(k) − k = e (k). (35) 2 f lim (Tn−r − Tn)=(1+ γ)r, (38)   n→∞ which can be seen by expanding with the Maclaurin series for fact that, for fixed 0 <α<β< 1 with (α, β) independent of the natural logarithm, for example, we have, n, we can write ⌊αn⌋ = αn −{αn} and ⌈βn⌉ = βn + {βn}, where {x} indicates the fractional part of x, i.e, 0 ≤{x} < 1 2(n − r) ∞ n − r −1 i (n − r) log = , with x ∈{αn,βn}. 2(n − r)+1 i 2(n − r)+1 i=1   X   E. Case 5: Proof of ⌊αn⌋ vs. Max and hence the limit equals = −1/2, along with (14) from We set r = ⌊αn⌋ and m = n in (10) and we use (13). With which it follows that limn→∞ e(n − r) = limn→∞ e(n)=0. this, we obtain

C. Case 3: Proof of the k-Step I(X(⌊αn⌋); X(n))= Tn−1 + Tn−⌊αn⌋ − Tn−⌊αn⌋−1 − Tn To show (18) observe that 2n 1 =− log − + c(n − 1) − c(n −⌊αn⌋− 1) (a) 2n +1 n lim I(X(n−k); X(n)) = lim [Tn−1 + Tk − Tk−1 − Tn]   n→∞ n→∞ 2(n −⌊αn⌋) 1 + log + . (b) (c) 2(n −⌊αn⌋)+1 n −⌊αn⌋ = Tk − Tk−1 +(1+ γ) = log(k) − Hk−1 + γ   (d) 2k 1 Finally, the limit in (21) is given by = log + − c(k − 1), 2k +1 k 1 1 1   lim nI(X(⌊αn⌋); X(n))= − 1 − + n→∞ 2 2(1 − α) 1 − α where the labeled equalities follow from: (a) eq. (10); (b) eq. α (38); eq. (33); and eq. (36) and . = , (c) (d) c(k − 1) = eH (k)/k 2(1 − α) D. Case 4: Proof of ⌊αn⌋ vs. ⌈βn⌉ where we have used the limit (#4) in Table III, the fact that from (15) we have limn→∞ nc(n) = limn→∞ nc(n−⌊αn⌋)= TABLE III 0, and the facts that LIST OF LIMITS FOR CASE 4. 2(n −⌊αn⌋) 1 function lim n log = − , f(n) limn→∞ f(n) n→∞ 2(n −⌊αn⌋)+1 2(1 − α) (#1) (⌈βn⌉ − ⌊αn⌋− 1) log 2(⌈βn⌉−⌊αn⌋−1) − 1    2(⌈βn⌉−⌊αn⌋)−1  2 n 1 (#2) (⌈βn⌉− 1) log 2(⌈βn⌉−1) − 1 lim = ,  2⌈βn⌉−1  2 n→∞ n −⌊αn⌋ 1 − α (#3) (n − ⌊αn⌋) log 2(n−⌊αn⌋) − 1  2(n−⌊αn⌋)+1  2 which follow by using the Maclaurin series for the natural (#4) 2n 1 n log  2n+1  − 2 logarithm and since, for fixed 0 <α< 1 with α independent (#5) 1 ⌈βn⌉−⌊αn⌋−1 1 β−α of n, we can write ⌊αn⌋ = αn−{αn}, where 0 ≤{αn} < 1. 2 log  ⌈βn⌉−1  2 log  β  (#6) 1 n−⌊αn⌋ 1 2 log  n  2 log (1 − α) APPENDIX E PROOF OF THEOREM 4 Setting and in (10) and then using (37) r = ⌊αn⌋ m = ⌈βn⌉ First, notice that P (X(k) = 0) is the probability that and (12), we obtain there are k or more zeros in the sample, hence, for B a Binomial(n, 1 − p) random variable, P (X(k) =0)= P (B ≥ I(X(⌊αn⌋); X(⌈βn⌉)) k) and P (X(k) =1)=1 − P (B ≥ k). Moreover, we observe = T⌈βn⌉−1 + Tn−⌊αn⌋ − T⌈βn⌉−⌊αn⌋−1 − Tn that P (X(r) = 0,X(m) = 0) = P (X(m) = 0) = P (B ≥ m) 2(⌈βn⌉−⌊αn⌋− 1) and . = −(⌈βn⌉−⌊αn⌋− 1) log P (X(r) =1,X(m) =1)= P (X(r) =1)=1−P (B ≥ r) 2(⌈βn⌉−⌊αn⌋) − 1 Finally, we will use that   1 ⌈βn⌉−⌊αn⌋− 1 − log + e(⌈βn⌉−⌊αn⌋− 1) P (X(r) =0,X(m) = 1) 2 ⌈βn⌉− 1   = P (X r = 0) − P (X r =0,X m = 0) 2(⌈βn⌉− 1) ( ) ( ) ( ) + (⌈βn⌉− 1) log − e(⌈βn⌉− 1) = P (X = 0) − P (X =0)= P (B ≥ r) − P (B ≥ m). 2⌈βn⌉− 1 (r) (m)   2(n −⌊αn⌋) 1 n −⌊αn⌋ Therefore, the mutual information is given by + (n −⌊αn⌋) log + log 2(n −⌊αn⌋)+1 2 n    I(X(r); X(m)) 2n P (X r =0,X m = 0) − e(n −⌊αn⌋) − n log + e(n), = P (X =0,X = 0) log ( ) ( ) 2n +1 (r) (m) P (X = 0)P (X = 0)    (r) (m)  and then using the limits in Table III, and the fact that P (X(r) =0,X(m) = 1) + P (X r =0,X m = 1) log limk→∞ e(k)=0 from (14), ( ) ( ) P (X = 0)P (X = 1)  (r) (m)  1 β(1 − α) P (X(r) =1,X(m) = 1) lim I(X(⌊αn⌋); X(⌈βn⌉))= log . (39) + P (X(r) =1,X(m) = 1) log . n→∞ 2 β − α P (X = 1)P (X = 1)    (r) (m)  Specifically, the limits in Table III follow from: (#1)-(#4) the Plugging in the values for these probabilities in terms of B as Maclaurin series for the natural logarithm, and (#5)-(#6) the discussed above gives the result in (26). REFERENCES [1] C. R. Rao, C. Rao, and V. Govindaraju, Handbook of Statistics. Elsevier, 2006, vol. 17. [2] S. Baratpour, J. Ahmadi, and N. R. Arghami, “Some characterizations based on entropy of order statistics and record values,” Communications in Statistics-Theory and Methods, vol. 36, no. 1, pp. 47–57, 2007. [3] ——, “Characterizations based on rényi entropy of order statistics and record values,” Journal of Statistical Planning and Inference, vol. 138, no. 8, pp. 2544–2551, 2008. [4] M. Abbasnejad and N. R. Arghami, “Renyi entropy properties of order statistics,” Communications in Statistics-Theory and Methods, vol. 40, no. 1, pp. 40–52, 2010. [5] N. Balakrishnan, F. Buono, and M. Longobardi, “On cumulative entropies in terms of moments of order statistics,” arXiv preprint arXiv:2009.02029, 2020. [6] G. Zheng, N. Balakrishnan, and S. Park, “Fisher information in ordered data: A review,” Statistics and its Interface, vol. 2, pp. 101–113, 2009. [7] K. M. Wong and S. Chen, “The entropy of ordered sequences and order statistics,” IEEE Transactions on Information Theory, vol. 36, no. 2, pp. 276–284, 1990. [8] N. Ebrahimi, E. S. Soofi, and H. Zahedi, “Information properties of order statistics and spacings,” IEEE Transactions on Information Theory, vol. 50, no. 1, pp. 177–183, 2004. [9] H. N. Nagaraja, “On the non-Markovian structure of discrete order statistics,” Journal of Statistical Planning and Inference, vol. 7, no. 1, pp. 29–33, 1982. [10] B. C. Arnold, N. Balakrishnan, and H. N. Nagaraja, A First Course in Order Statistics. Siam, 1992, vol. 54. [11] F. Liese and I. Vajda, “On divergences and informations in statistics and information theory,” IEEE Transactions on Information Theory, vol. 52, no. 10, pp. 4394–4412, 2006. [12] D. L. Evans, L. M. Leemis, and J. H. Drew, “The distribution of order statistics for discrete random variables with applications to bootstrap- ping,” INFORMS Journal on Computing, vol. 18, no. 1, pp. 19–30, 2006. [13] A. Dembinska,´ “Discrete order statistics,” Wiley StatsRef: Statistics Reference Online, 2014. [14] H. A. David and H. N. Nagaraja, Order Statistics, Third edition. John Wiley & Sons, 2003. [15] D. W. Detemple, “The noninteger property of sums of reciprocals of successive integers,” The Mathematical Gazette, vol. 75, no. 472, pp. 193–194, 1991. [16] J. Havil, Gamma: Exploring Euler’s Constant. Princeton University Press, 2003. [Online]. Available: http://www.jstor.org/stable/j.ctt7sd75 [17] H. Robbins, “A remark on Stirling’s formula,” The American Mathemat- ical Monthly, vol. 62, no. 1, pp. 26–29, 1955.