<<

ON THE ASYMPTOTIC DISTRIBUTION OF U-

by

Alan J. Lee*

University of Auckland and University of North Carolina at Chapel Hill

Abstract

The asymptotic distribution of a U-statistic is found in the case when

the corresponding Von Mises functional is stationary of order I. Practical

methods for the tabulation of the limit distributions are discussed, and the

results extended to certain incomplete U-statistics.

Key Words and Phrases: asymptotic distribution, stationary statistical

functional, incomplete U-statistic

*The work of this author was partially supported by the Air Force Office of .e Scientific Research under Contract AFOSR-7S-2796. §1. Introduction.

Let Xl"",X be independent random k-vectors with common distribution n ¢(~l""'~) function F. Let be a function on R km symmetric in the

k-vectors ~xl""'x"'III and let

m 8(F) = J...J ¢(xl,.·.,x) IT F(dx.) (1) D1< R m i=l 1 k

be a functional defined on the space of d.f.s. on ~, assuming that the integral exists. An unbiased estimate of 8 = 8(F) is furnished by the U-statistic

¢(X. , ... ,X. ) (2) 1 1 1 m

where the sum extends over all (~) subsets of the Xi' Following Hoeffding

[10], define

c=1,2, ... ,m

and ~c(xl""'xc) = ~c(xl""'xc) - 8, assuming all expectations exist. Also define

1';0 = 0 , c = 1,2, ... ,m .

< ~ If for some nonnegative integer. d m, sd+1 0 but Sc = 0 for c = O, ... ,d, the functional (1) is said to be stationary of order d at F. If d = 0, then (2) has an asymptotically after suitable normalization [10].

If d ~ 0, the distribution is no longer normal and ma)' be obtained by 2

applying the theory of differentiable statistical functionals due to

Von Mises and Filippova" ([13], [7]) .

Let Fn be the empirical distribution function of Xl' ... 'Xn . The distribution of

1 n n S(F ) = - I (X. , ... ,X. ) (3) n m. 1 .L 1 1 n 1 1 =1 1 m 1= m is closely related to the distribution of U as described in §2, and the n asymptotic distribution of S(F ) is given by the following theorem. n

Theorem (Von Mises, Filippova). Let 6 be a stationary functional of order d+l d. Then theasynlptotic distribution "of n2 (S(F ) -6) is identical to that " n of

d II (F (dx.)-F(dx.» • (4) i=l n 1 1

In the case d = 1, the asymptotic distribution of (4) can be found using integral equation techniques. Filippova "gives an expression for the characteristic function of the asymptotic distribution of (4) in terms of Fredholm determinants. Gregory [8] gives a more concrete representation of the asymptotic distribution of (4) as an infinite series of random variables, in the case m = 2. Below we combine these results to give explicit expressions for the c.f.s. of the limit distributions and discuss practical methods for the tabulation of the limit distributions. We also make some remarks about incomplete U-statistics. 3

§2. Relationship between 8(P ) and U. n." n

Let 1be the set of indices {(i , ... , i ): 1 ~ iJl, ~ n, JI, = 1, ... ,m} and let l m 1. be the subset of 1 consisting of those m-tuples with exactly j distinct J integers. Define for each j = i, ... ,m a sYmmetric kernel

1 m = -

.e

where u~j) is the U-statistic corresponding to the kernel

(5)

where

(6)

If 8 is stationary of order 1, then n(8(P )-8) converges to a nondegenerate . n

distribution as seen in §3, and Var(nU ) is bounded as n + 00 ([10]). n Moreover, E(Zn) converges to the quantity -m(~-l) + 8(m-l) where 4

(m-I) .• . - . [IJ . . . (j) 8 = E[(m_I) (XI,···,Xm_I)] and S1nce Var Un - 0 n2 . and Var Un = for j < mit follows from (6) that Z converges in probabi.1 ity to n -m(m-l) (m-I) ---2-- + 0 . , lind so the asymptotic di.s1ri.butiolls of n(O(F )-0) lind n n(U -8) differ only by a shift. A more explicit expression for e(m-l) is n

1 = (m-I)! I(m-l) E((Xi "",Xi » 1 m

1 m! (m-I) = (m-I)! 2 E((XI,XI'X2"",Xm_I»

_ mem-I) - 2 E(~2(XI,XI»

(m2JE(~2(XI,XI»' and so we..may write limnE(Z n) = lim nE n(8(F n)-8) =

§3. Asymptotic. distribution of n(8(F. n)-8) and n(Un -8).

By the Von Mises - Filippova theorem, the asymptotic distribution of n(8(F )-8) coincides with that of n

(7)

- '" where G (x) /:R(F (x)-F(x». Consider the karnel ~2(xI,x2) where (all n = n integrals are over II1c)

Then it is easily seen that 5

'" JJ '1'Z(x1,xZ)Gn(dx1)Gn(dxZ) = J J'1'(x1,x2)Gn(dx1)Gn(dx2)

=! 1ry(X.,X.) (8) n·I ·I _1J 1=1 J=1

Now let A., j ~ 1, be the (real) eigenvalues of the linear operator J '" LZ(IRk,dF) + LZ(IRk,dF) with (symmetric) kernel ~2' We note that by the

assumption that ?:Z exists, JJ \'1'z(u,v)\z F(du)F(dv) < 00. By the results of

[8], we can see that

(i) the asymptotic distribution of (8) is the same as

(mz}{ I A. (Y .-1) + E(o/Z(X ,X ).)}. (9) j=l J J 1 1

where the Y are independent xi random variables; and j 00 Cii) if L 11...1 < 00 then the asymptotic distribution of (8) is that of j=l J

A.Y .. JJ

To prove these, set the functions h in Theorem 2.1 of [8] to be 0, then n 00 !\''''L '1'ZCX.,X. ) converges in distribution to \'L A.(Y.-1). Since n1. " ;tJ 1. J J'=l J J

1 n n 1 '" 1 n '" 1 n '" - I I '1'2(X.,x.) = - I '1'2(X.,X,) + - I '1'2(X.)x.) and- I '1'z(x.,x.) n i=l j=l 1. J n i;tj 1. J n i=l 1. 1. n i=l . 1. 1.

'" converges in probability to E['1'Z(X1,X1)] by the WLLN, (9) follows upon noting that

E['1'2(X1,X1)] = J '1'2(x1,X1)F(dx1) - E('1'2(Xl'X2))

=J '1'Z(x1,X1)F(dx1) 6

since E[~2(X1,X2)] = 0 by [10]. (ii) follows as in 1.2.3 of [8].

Using (i) and (ii) above and (5) we see that the asymptotic (~) distribution of n(U -8) is that of I A.(Y.-1), and if I IA.I < co, the n j=l JJ J

distribution to that of (~) ~~1 ~/~ - E !'J!~(XrXl) ~. The characteristic

functions of the limit distributions are thus seen to be

co -itA'• 1

Aj = (~}Aj' j ~ 1.

§4. Tabulation of the limit distribution.

In this section we discuss practical means of computing the limit distribution of n(U -8). We first consider the case when the eigenvalues A. n J are surnmab1e. The method employed to tabulate the limit distribution will depend on the amount of information we possess about the eigenvalues. There are three subcases.

(i) The eigenvalues must be determined numerically, and we have

accurate determinations of x.. for 1 ::; j ::; N. J (ii) The eigenvalues can be explicitly dete-rmined for all j. co (iii) The characteristic function of I A. Y., namely co 1 j=l JJ

[5], Bohman [3], Martynov [12]). In subcase (ii), computation of the inner product is necessary. A reason~b1e way to accomplish this is described in N-1 !,: [2] and consists of expJ'es~ing

00 log ¢(t) = -~ I 10g(1-2itA.). Formally, we have j=N J 00 00 00 00 k log ¢(t) = ~ I I (2itA .)k/k = ~ I c k(2it)k;lk where C k = I A.. j=N k=l J k=l N' N, j=N J The formal manipulations above are valid if the last power series converges.

Since IcNkl ~ [.Y IA.I]k = c~, say, this convergence will take place if J=N J ItI ~ 2~ • The numerical inversion methods above will require the N computation of ¢2 at a finite number M of ordinates t , so choose N so large . m m~x that CN < (2 Itml)-l and use the power series to compute ¢2' In special cases other methods may be used. If the eigenvalues A. are J all positive, the asymptotic results of Zolotarev [14] and Hoeffding [11] furnish good approximations for the tail probabilities. If the eigenvalues are all positive and of multiplicity unity, then Smirnov's formula may be used as an efficient numerical inversion method; see Martynov [12] for details. In subcase (i) two possible approaches suggest themselves. The use N 00 of the distribution of I A.Y. to approximate that of I A.Y. will not be j=l JJ j=l JJ satisfactory in general for reasonable values of N. For a discussion of this point and bounds on the truncation error see [2]. Rather, some method of approximating the tail of the series is required. If all the A. for J j > N are positive, we may approximate (see [6]) the distribution of

00 N 2 I A.Y. by that of I A.Y. + cY where Y is an x variate and c and v are j=l J J j=l JJ v chosen to make the mean and variance of the two distributions coincide.

Thus c and v are obtained from 8

N A~ 2 I: + C v = Var(\1'2(Xl ,X2)) = l';2 5=1 J

N I: A. + cv = E[\1'2 (Xl' Xl) ] 5=1 J yielding

N I;; - L A~ 2 ·=l J c = -----"--J ---'----'~-- ~ A. ) (E[\1'2(Xl ,Xl )] - . J j=1

v = ~[E(~2(Xl,Xl)) - i~l Ail .

The c.f. of the approximating r.v. is

N = II (1-2iA.t)-~ (1_2ict)-V/2 (10) 5=1 J and a numerical inversion of (10) furnishes the desired approximation. If all but a finite number of the A. are not of the same sign, one may N-l J compute the density of I: A.Y. by e.g. the integral equation method of 5=1 J J Grenander et al. [grand approximate the distribution function of the

00 remainder I: A.Y. by means of a Cornish-Fisher expansion. The cumulants of j=N JJ the remainder are easily seen to be

= 2r - l (r_l)lC Kr . N,r

N-l N-l 2 so K = E[\1'2(X ,X )] - .1: A.,K = 62 - .1: A. and for r ~ 3 the K may be 1 1 1 J=1 J 2 . J=1 J r approximated reasonably well by computing a few more eigenvalues. The limit df may then be obtained by a numerical convolution. 9

If all but a finite number of eigenvalues are positive (or negative) then the kernel ~2(xl,x2) can be expressed as the sum of a degenerate and a positive definite (negative definite) kernel and hence its eigenvalues will be summable. Thus in the nonsummable case, an infinite number of both positive and negative eigenvalues will be encountered. If only a finite number are known, we may employ the Cornish-Fisher method suggested above. 00 -itA.· 1 If all are known, we may compute <1>2 (t) = IT e J (1-2iA. t) -'2 by 00 j=N J log 2(t) = -~ I C k(2it)k/k and proceed as before. k=2 N'

§5. Some remarks on incomplete U-statistics. The calculation of the U-statistic (2) requires the averaging of (~) terms, which may not be practical if m and n are not small. To reduce the volume of computation Blom [1] and Brown and Kildea [4] have proposed the use of incomplete U-statistics of the form

l U = N I (X i , ... ,x. ) (11) 1 1 m when the sum in (11) is taken over N specified or randomly selected m-subsets of the indices. The asymptotic distributions of nondegenerate incomplete U-statistics (~l > 0) are studied in the references above. For the degenerate case ~1=0'~2>0 some of their results remain true while others need modification. Denote by <1>. the r.v. (X. , ... ,X. ) where (il,...,i ) is the i-th 1 1 1 m 1 m subset of indices in the summation (11). As·in [1], let pc denote the 2 proportion of the N pairs (<1>.,<1>.) having c indices in common. p will be a 1 J c fixed constant or an r.v.depending on the method of subset selection. From

[1] we have 10

m Var U = I E(pc)sc c=2

and the assumption is made that n Var U + S where 8 is some nonnegative

constant. N01:e that the E(p~) do not depenq on ~ but only on the method of subset selection.

The asymptotic distribution of U depends on the ratio nINo If n/N + 00

as n,N both + 00; the quantities <1>. are asymptotically independent; and the 1 degeneracy of the complete U-statistic is irrelevant to the limit distribution of !n(u-e), which will be normal with mean zero and variance

s. On the other hand, if n/N + 0, the limit distribution of U will m coincide with that of the complete U-statistic under certain conditions.

For example, from [1] we have (denoting the complete U-statistic by Uo)

2 Var U- Var Uo = E(U-UO), so n(UO-e) and n(U-e) will have the same asymptotic distribution if n2(Var U- Var U ) converges to zero. Now n2Var U converges to 2(~}2 s2 O o so a sufficient condition for the coincidence of the asymptotic distributions is

c = 2 ,

c> 2 •

Alternatively, if the N subsets are chosen at random, with replacement from the (:} possible subsets, then ([1]) 11

2 so the asymptotic distributions coincide if n IN converges to zero. Thus 2 E the incomplete U-statistic based on n + randomly chosen subsets will have the same asymptotic distribution as that of the complete U-statistic based

on (:) subsets.

References

[1] Blom, G. (1976). Some properties of incomplete U-statistics. Biometrika, 63, pp. 573-580. [2] Blum, J.R., Kiefer, J., and Rosenblatt, M. (1961). Distribution-free tests of independence based on the sample distribution function. Ann. Math. Statist., 32, pp. 485-498. [3] Bohman, H. (1972). From the characteristic function to the distribution function via Fourier Analysis. BIT, 12, pp. 279-283. [4] Brown, B.M. and Kildea, D.G. (1978). Reduced U-statistics and the Hodges-Lehmann . Ann. Statist., 6, pp. 828-835. [5] Davies, R.B. (1973). Numerical inversion of the characteristic function. Biometrika, 60, pp. 415-417. [6] Durbin, J. and Knott, M. (1972). Components of Cramer - Von Mises statistics I. J.R. Statist. Soa., 34, pp. 290-307. [7] Filippova, A.A. (1962). Von Mises '-tlieorem on the asymptotic behavior of functionals of empirical distribution functions and its statistical applications. Theop. Ppob. AppZ., 7, pp. 24-57.

[8] Gregory, G.G. (1977). Large sample theory for U-statistics and tests of fit. Ann. Statist. if 5, pp. 110-115. [9] Grenander; U., Pollak, H.O., and Slepian, D. (1959). The distribution of quadratic forms in normal variates. J.S.I.A.M., 7, pp. 374-401.

[10] Hoeffding, W. (1948). A class of statistics with asymptotically normal distribution. Ann. Math. Statist., 19, pp. 293-325.

[11] Hoeffding, W. (1964). On a theorem of V.M. Zolotarev. Theop. Ppob. AppZ., 9, pp. 89-91. [12] Martynov, G.V. (1975). Computation of distribution functions of quadratic forms in normally distributed random variables. Theop. .e Ppob. AppZ., 20, pp. 782-793 . 12

[13] Von Mises, R. (1947). On the asymptotic distribution of differentiable statistical functi0nals. Ann. Math. Statist.,. 18, pp. 309-348.

[14] Zolotarev, V.M. (1961). Concerning a certain probability problem. Theor. Prob. AppZ., 6, pp. 201-204. UNCLASSIFIED SECURITY CLASSIFICATION OF THIS PAGE (When Data Entered) READ INSTRUCTIONS REPORT DOCUMENTAriON PAGE BEFORE COMPLETING FORM l. REPORT NUMBER 2. GOV" ACCESSION NO. 3. RECIPIENT'S CATALOG NUMBER

4. TITLE (and Subtitle) 5. TYPE OF REPORT & PERIOD COVERED TECHNICAL On the Asymptotic Distributioh at U-Statistic 6. PERFORMING oqG. REPORT NUMBER Mimeo Series No. 1255

7. AUTHOR(s) 8. CONTRACT OR GRANT NUMBER(s) Alan J. Lee AFOSR-75-2796

9. PERFORMING ORGANIZATION NAME AND ADORESS 10. PROGRAM ELEMENT, PROJECT, TASK AREA 8c WORK UNIT NUMBERS

11. CONTROLLING OFFICE NAME AND ADDRESS 12. REPORT DATE Sentember 1979 Air Force Office of Scientific Research 13. NUMBER OF PAGES Bolling AFB, DC 20332 13 14. MON ITORIN G AGENCY NAME 8c ADDRESS(if different from Controlllnll Office) 15. SECURITY CLASS. (of this report) UNCLASSIFIED

15a. DECLASSI FICATIONI DOWNGRADING SCHEDULE

16. DISTRIBUTION STATEMENT (of this Report)

Approved for public release - distribution unlimited.

17. DISTRIBUTION STATEMENT (of the abstract entered in Block 20, if different (rom Report)

18. SUPPL EMENT ARY NOTES

19. KEY WORDS (Contfnu.e on reverse side if necessary and Identlfy by block number)

Asymptotic distribution, stationary statistical functional, incomplete U-statistic

20. ABSTRACT (Continue on reverse side If necessary and identify by block number) The asymptotic distribution of a U-statistic is found in the case when the corresponding Von Mises functional is.stationary of order l. Practical methods for the tabulation of the limit distributions are discussed, and the results extended to certain incomplete U-statistics.

FORM. DO 1 JAN 73 1473 EDITION OF I NOV 65 IS OBSOLETE UNCLASSIFIED

SECURITY CLASSIFICATION OF THIS PAGE (When D"t" Entered) SECURITY CLASSIFICA1'ION OF THIS PAGE(lf'hen D.t. Bntered)

SECURITY CLASSIFICATIOII OF TU" f'AGE(Wh"n Dat. Entur '1;