9 U-Statistics

Suppose X 1, X 2, . . . , X n are P i.i.d. with CDF F . Our goal is to estimate the expectation 2P t (P)=Eh(X 1, X 2, . . . , X m ). Note that this expectation requires more than one X in contrast to 2 EX , EX , or Eh(X ). One example is E X 2 X 1 or P((X 1, X 2) S). | | 2 For Eh(X 1), the empirical estimator is asymptotically optimal. Now, U-statistics general- ize the idea. The advantage of using U-statistics is unbiasedness.

Notation. Let t (P)= h(x 1, . . . , x m )d F (x 1) d F (x m ) where h is a known function; h is called kernel. Assum···e that, in the following···, without loss of generality, h is symmetric, i.e. h x , x h xR , x R. If h is not symmetric, we can replace it with h x , . . . , x ( 1 2)= ( 2 1) ⇤( 1 m )= 1 m ! h x 1 , . . . , x m where ⇡ : and is set of all possible permutations of ( ) ⇡ ⇧ ( ⇡( ) ⇡( )) N N ⇧ 1, . . . , m .2Note that h is also unbiased, 7! ⇤ { P} 1 i.i.d 1 Eh⇤ =(m !) Eh(X⇡(1), . . . , X⇡(m )) =(m !) Eh(X 1, . . . , X m )=t (P). ⇡ ⇧ ⇡ ⇧ X2 X2 Deﬁnition 9.1. Suppose h : Rm R is symmetric in its arguments. The U-statistic for esti- 7! mating t (P)=Eh(X 1, . . . , X m ) is a symmetric average 1 u = u (X 1, . . . , X m )= n h(X i 1 , . . . , X i m ). m 1 i 1

(a) Suppose t (P)=EX 1 = x 1d F (x 1). Then, h(x 1)=x 1 and 1 1 uRX , . . . , X h X X X¯ . ( 1 m )= n ( i 1 )= n i = 1 i 1 i X X 2 2 2 ¯ 2 (b) Suppose t (P)=(EX 1) = x 1d F (x 1) . Then u = X from (a) is biased since

ÄR n ä n E X¯ 2 n 2 EX X E X 2 EX 2. ( )= 2 i j + ( i ) 3 =( 1) i 1 j i i 1 6 = = 2 = 2 EX 1 EX EX 2 6XX6 =( ) X 1 =( 1) 7 6 6 7 4 5 |–3{z5– } | {z } 9 U-STATISTICS

Now, write t (P)= x 1x 2d F (x 1)d F (x 2) and h(x 1, x 2)=x 1x 2. Then

RR 2 u u X , . . . , X X X . = ( 1 n )= n n 1 i j ( ) i

(c) Suppose t (P)=P(X 1 t 0)=F (t 0)= h(x 1)d F (x 1). Then h(x 1)=1( ,t 0](x 1) and  1 R1 u = n 1 X i t 0 = Fˆ(t 0) i {  } X which is just the empirical CDF. x 2 x 2 2x x (d) Suppose t P VarX 1 + 2 1 2 d F x d F x . Then h x , x x 2 x 2 2x x 2 ( )= 1 = 2 ( 1) ( 2) ( 1 2)=( 1 + 2 1 2)/ = 2 (x 1 x 2) /2. Note that RR 2 2 i.i.d. Eh(x 1, x 2)= Var(X 1)+[E(X 1)] + Var(X 2)+(EX 2) 2E(X 1X 2) /2 = Var(X 1). { } Hence, 2 u h X , X = n n 1 ( i j ) ( ) i

–36– 9 U-STATISTICS

Remark 9.3 (Preliminary Remark). Write X X , . . . , X as the order statistic. U-statistic (n) =( (1) (n)) can be regarded as conditional expectation given X . For m 1, (n) =

u n 1 h X n 1 h X E h X X . = ( i )= ( (i ))= Fˆ[ ( 1) (n)] i i | X X For m=2, 1 1 u h X , X h X , X E h X , X X . = n ( i j )= n ( (i ) (j ))= Fˆ[ ( 1 2) (n)] 2 i

Theorem 9.4. Let S = S(X 1, . . . , X n ) be an unbiased estimator of t (P) with corresponding U- statistic u . Then, u is unbiased as well and Varu VarS with the equality holding if P(u = S)=1. 

Proof. u is unbiased: E u E E S X t P . ( )= [ Fˆ( (n))] = ( ) | E(S) Since both u and S are unbiased we show Eu 2 ES2, | {z }  Jensen inequ. Eu 2 E E2 S X E E S2 X E S2 = [ Fˆ( (n))] [ Fˆ( (n))] = ( ) |  | with " " if the distribution of E S X is degenerate with E S X S almost surely. = Fˆ( (n)) Fˆ( (n))= | |

Note: This also follows from the Rao-Blackwell Theorem: Taking conditional expectation of an unbiased statistic conditional on a sufﬁcient statistic (eg. X here) will give us an (n) estimator which is as least as good in the sense of lower risk/variance. Remark 9.5 (The Variance for m 2 (heuristics)).  m = 1. 1 n 1 Varu Var h X Varh X O 1/n . = n ( i ) = n ( 1)= ( ) i =1 ! X –37– 9 U-STATISTICS

m = 2. Since 1 2 u h X , X h X , X = n ( i j )= n n 2 ( i j ) 2 i

2 Varu Var h X , X = n n 1 ( i j ) 0 ( ) i j >i 1 XX B2 2 C @ CovAh X , X , h X , X . = n n 1 [ ( i j ) ( k l )] ( ) i j >i k l >k ✓ ◆ XXXX =0 if i ,j ,k ,l different

3 The second largest term (three sums): e.g. i = k |but k , j , l a{zre different}( u ) ' 1 2 n 3 1 Varu n 3 same order as m 1. 4 = = ⇠ n(n 1) ⇠ n n ✓ ◆ Thus, we expect in general, Varu = O(1/n).

2 Notation. h i (x 1, . . . , x i )=E[h(X 1, . . . , X m ) X 1 = x 1, . . . , X i = x i ] and i = Varh i (X 1, . . . , X i ) | Lemma 9.6.

(a) Eh i (X 1, . . . , X i )=t (P)(= Eh(X 1, . . . , X m )) for all 1 i m . (b) Cov h X , . . . , X , X , . . . , X , h X , . . . , X , X ,. . . , X 2 [ ( 1 i i +1 m ) ( 1 i i0+1 m0 )] = i

Proof. We only consider m = 2 here. (a)

Eh 2(X 1, X 2)=E E[h(X 1, X 2) X 1, X 2] = E[h(X 1, X 2)] = t (P) { | } Eh 1(X 1, X 2)=E E[h(X 1, X 2) X 1] = E[h(X 1, X 2)] = t (P) { | } (b) i = 2. 2 Cov[h(X 1, X 2), h(X 1, X 2)] = Var[h(X 1, X 2)] = Varh 2(X 1, X 2)=2 .

The second equality is because h 2(X 1, X 2)=E[h(X 1, X 2) X 1, X 2]=h(X 1, X 2). i = 1. |

Cov[h(X 1, X 2), h(X 1, X 20 )] = E[h(X 1, X 2)h(X 1, X 20 )] E[h(X 1, X 2)]E[h(X 1, X 20 )] . 2 2 = E[h(X 1,X 2)] =[t (P)] { } –38– | {z } 9 U-STATISTICS

Note that ﬁrst term on the right hand side can be computed by

E[h(X 1, X 2)h(X 1, X 20 )] = E E [h(X 1, X 2)h(X 1, X 20 ) X 1] { | } = E E [h(X 1, X 2) X 1]E [h(X 1, X 20 ) X 1] { 2 | | } = Eh 1(X 1).

Theorem 9.7 (Hoeffding).

(a) The variance of the U-statistic is

1 m m n m Varu 2. = n i m i i m i =1 X✓ ◆✓ ◆ 2 Note that one can compute i from Lemma 9.6. 2 2 (b) If i > 0 and i < for i = 1, 2, . . . , m , then 1 2 2 Var(pnu ) m i . !

Proof. (a) For general proof, see Lee “U-statistics” (1990). We only prove the case of m = 2. We want to show that 1 2 n 2 2 n 2 1 Varu 2 2 2 n 2 2 2 = n 1 2 1 1 + 2 2 2 2 = n [ ( ) 1 )+ 2 ] 2 2 ✓✓ ◆✓ ◆ ✓ ◆✓ ◆ ◆ with 2 Cov h X , X , h X , X and 2 Cov h X , X , h X , X Var h X , X . 1 = [ ( 1 2) ( 1 20 )] 2 = [ ( 1 2) ( 1 2)] = [ ( 1 2)] From Remark 9.5 we have

2 1 Varu = n Cov[h(X i , X j ), h(X k , X l )] Ç 2 å i j >i k l >k XXXX 1 Cov h X , X , h X , X [ ( i j ) ( k l )] = n n . 2 i j >i k l >k 2 XXXX 2 n 2 2 2 = ( )1 +2 Case 1 i , j , k , l are all differe|nt. Then Cov = 0.{z } Case 2 i = k and j = l :

2 Cov[h(X i , X j ), h(X k , X l )] = Cov[h(X i , X j ), h(X i , X j )] = Var[h(X i , X j )] = 2 .

Note that the number of ways to choose i , j out of 1, 2, . . . , n is n . This gives us 2 n 2 n 2. { } 2 2 / 2 = 2

–39– 9 U-STATISTICS

Case 3 (i = k and j = l ) or (i = k and j = l ). 6 6 2 Cov[h(X i , X j ), h(X k , X l )] = 1 . Note that the number of ways to choose i , j , k , l is 1 n n n 1 n 2 2 2 n 2 . ( ) 2 = 2 ( ) i · · · · or j =i l ✓ ◆ 6 This gives |{z} | {z } |{z} |{z} n 2 n 2 2 2 ( )1 2 n = 2(n 2)1 . 2 (b) Consider the variance formula from (a) k n m (n m )(n m 1) (n m k + 1) n = ··· k k ! ⇠ k ! ✓ ◆ is large if k is large. This implies that n m is large if m i is large, i.e. i 1. Thus the m i = terms of the sum are dominated by the i= 1 term, i.e., by m n m 1 n m 1 1 2 2 m 2 m 2 2 . 1 m 1 1 n m 1 ! 1 n m m ! = n m ⇠ ( ) / ✓ ◆✓ ◆ Hence, Var nu m 2 2. (p ) 1 !

Theorem 9.8. 2 2 pn(u t (P)) D N (0, m 1 ). ! Proof. For general proof, see p. 178 of Serﬂing. m m n 1 u g X , . . . , X o n 1/2 . n = c c c ( i 1 i c )+ p ( ) c=1 1 i 1<

For ( ), ⇤ 1 Cov u , h X Cov h X , X , h X 22. ( 1( i )) = n n 1 [ ( l k ) 1( i )] = 1 ( ) i k l k = (†) 2 X XXX6 =1 , if k =i or l =i The number of non-zero terms is 2n(n 1). | {z } For (†),

2 (9.6) 1 = Cov[h(X 1, X 2), h(X 1, X 20 )] 2 = E[h(X 1, X 2)h(X 1, X 20 )] [t (P)] 2 = E E[h(X 1, X 2)h(X 1, X 20 ) X 1, X 2] [t (P)] { | } 2 = E h(X 1, X 2)E[h(X 1, X 20 ) X 1, X 2] [t (P)] { | } =E(h(X 1,X 2) X 1)=h 1(X 1) | Cov h X , X , h X . = [ ( 1 |2) 1( 1{z)] }

Example 9.9. Suppose t (P)=P(X 1 + X 2 > 0), m = 2, h(X 1, X 2)=1(X 1 + X 2 > 0). Then 1 u 1 X X > 0 = n n 1 ( i + j ) ( ) i )) D ( 1 ) 1 ! 2 (9.6) 1 = Cov[h(X 1, X 2), h(X 1, X 20 )] 2 = E[h(X 1, X 2)h(X 1, X 20 )] [P(X 1 + X 2 > 0)] 2 = E[1(X 1 + X 2 > 0)1(X 1 + X 20 > 0)] [P(X 1 + X 2 > 0)] 2 = E[1(X 1 + X 2 > 0, X 1 + X 20 > 0)] [P(X 1 + X 2 > 0)] . To obtain an explicit form we need some assumptions. Suppose, for example, F is symmetric around zero, i.e. P(X 1 < a )=P(X 1 > a )=P( X 1 < a ) which implies X 1 and X 1 have the same distribution. Moreover, suppose F is continuous. Then P(X 1 + X 2 > 0)=P( X 1 X 2 > 0)=P(X 1 + X 2 < 0) with P(X 1 + X 2 = 0 = 0), and therefore 1 1 = P(X 1 + X 2 < 0)+P(X 1 + X 2 = 0)+P(X 1 + X 2 > 0)=2P(X 1 + X 2 > 0) P(X 1 + X 2 > 0)= . ) 2

On the other hand, since P(X 1 + X 2 > 0)=P(X 1 > X 2)=P(X 1 > X 2),

P(X 1+X 2 > 0, X 1+X 20 > 0)=P(X 1 = max X 1, X 2, X 20 )=P(X i = max X 1, X 2, X 3 , i = 1, 2, 3)=1/3. { } { } 2 2 In sum, 1 = 1/3 (1/2) = 1/12.

–41– 9 U-STATISTICS

Remark 9.10 (Generalization: Two-Sample Problems). Suppose X 1, . . . , X n are i.i.d. with cdf F , Y1, . . . , Yn are i.i.d. with cdf G , and F and G are unknown, X i and Yi are independent. Let

h(X 1, . . . , X mX , Y1, . . . , YmY ) symmetric symmetric be a function of mX +m Y arguments,|with{zmX } |n X a{znd m}Y n Y . We want to estimate t (P)=   Eh(X 1, . . . , X mX , Y1, . . . , YmY ). For example, t (P)=P(X 1 < Y1)=Eh(X 1, Y1) where h(X , Y )=1(X <

Y ); or t (P)=E Y X . Note also that h(X 1, . . . , X mX , Y1, . . . , YmY ) is trivally unbiased for t (P)= Eh. Thus h X |, . .. , X | , Y , . . . , Y with 1 i . . . i n and 1 j . . . j n is ( i 1 i mX j 1 j mY ) 1 mX X 1 mY Y unbiased as well. The number of combinations is n X n Y . Our unbiased estiamtor is mX mY

1 u h X , . . . , X , Y , . . . , Y = n X n Y i 1 i mX j 1 j mY mX mY ··· X X Ä ä Example (mX = m Y = 2). 1 u h X , X , Y , Y = n X n Y ( i k j l ) 2 2 i X 2 X 1 ). No|te that this U{z-statistic can}be used to test Y is more dispersed than| X . | | |

Remark 9.11 (Properties of the Two-Sample U-statistic).

(a) Formula for Var u , see Lee or Serﬂing. (b) Asymptotic variance and normality. Let n = n X + n Y and n X /n c where c (0, 1). ! 2 Suppose Var[h(X 1, X 2, . . . , X mX , Y1, Y2, . . . , YmY )] > 0. Then

2 2 2 mX 2 m Y 2 Var(pnu ) = 10 + 01 ! c c and 2 pn(u t (P)) D N (0, ) ! where

2 Cov h X 1, X 2, . . . , X m , Y1, Y2, . . . , Ym , h X 1, X , . . . , X , Y , Y , . . . , Y 10 = [ ( X Y ) ( 20 m0 X 10 20 m0 Y ) 2 Cov h X 1, X 2, . . . , X m , Y1, Y2, . . . , Ym , h X , X , . . . , X , Y1, Y , . . . , Y 01 = [ ( X Y ) ( 10 20 m0 X 20 m0 Y )]

–42– 9 U-STATISTICS

Example 9.12. Suppose t (P)=P(X < Y )=E[1(X < Y )] (thus mX = m Y = 1). Then 1 u 1 X < Y = n n ( i j ) X Y i j XX and

2 2 10 = Cov[1(X 1 < Y1), 1(X 1 < Y10)] = E[(1(X 1 < Y1)1(X 1 < Y10)] E[1(X 1 < Y1)] 2 { } = P(X 1 < Y1, X 1 < Y10) [P(X 1 < Y1)] 2 2 01 = Cov[1(X 1 < Y1), 1(X 10 < Y1)] = E[(1(X 1 < Y1)1(X i0 < Yj )] E[1(X 1 < Y1)] 2 { } = P(X 1 < Y1, X 10 < Y1) [P(X 1 < Y1)] . Now suppose F = G is continuous. Then P(X 1 < Y1)=1 P(X 1 Y1)=1 P(X 1 > Y1)= 1 P Y X and hence P X Y 1 2. Furthermore, P X Y , X Y P X ( 1 > 1) ( 1 < 1)= / ( 1 < 1 1 < 10)= ( 1 = min X , Y , Y 1 3 and P X Y , X Y P Y max X , X , Y 1 3. Then 2 1 1 10 )= / ( 1 < 1 10 < 1)= ( 1 = 1 10 1 )= / 10 = 2 { } { } 01 = 1/12 and the asymptotic variance is

2 2 2 mX 2 m Y 2 1 = 10 + 01 = . c c 12c(1 c) In sum, 1 pn(u P(X < Y )) D N 0, ! 12c(1 c) ✓ ◆ Note: n X n Y u = i j 1(X i < Yj ) is the Mann-Whitney test statistic with H0 : F = G (and equivalent to a Wilcoxon rank sum statistic). P P

Deﬁnition 9.13. Consider a symmetric function h : Rm R with m n . The V-statistic for 7!  estimating t (P)=Eh(X 1, . . . , X m ) is

1 n n V V X , . . . , X h X , . . . , X = ( 1 m )= n m ( i 1 i m ) i 1=1 ···i m =1 X X

Remark 9.14 (Comparing U- and V-Statistics).

m 1. u n 1 h X v = = ( i )= m = 2. First note that P 2 1 u h X , X h X , X . = n n 1 ( i j )= n n 1 ( i j ) ( ) i

–43– 9 U-STATISTICS

On the other hand,

1 1 v h X , X h X , X h X , X = n 2 ( i j )= n 2 ( i j )+ ( i i ) i j i 2 j =i 3 XX X X6 1 1 6 7 h X , X h4 X , X 5 = n 2 ( i j )+ n 2 ( i i ) i j =i i XX6 X n n 1 1 ( ) u h X , X . = n2 + n 2 ( i i ) i 1 X P ! 0 ! | {z } Moreover, Eh(X 1, X 2)=t (P). | {z } n 1 1 Ev Eu Eh X , X = n + n ( 1 1) 1 1 = t (P) t (P)+ Eh(X 1, X 1) n n 1 = t (P)+ Eh(X 1, X 1) t (P) n 2 3 =constant 6 7 4 =bias 0 5 | {z! }

2 | {z } 2 2 Theorem 9.15. Let m = 2, i = Varh i (X 1, . . . , X i ) (see 9.6), and suppose 0 < 1 < , 2 < . Then U- and V-statistics have the same asymptotic distribution, 1 1

2 pn(V t (P)) D N (0, 41 ). ! Proof. From Remark 9.14,

n 1 1 n 1 1 n V t P n u h X , X t P + p ( ( )) = p n + n 2 ( i i ) ( ) n i ! n 1 X 1 1 n u t P h X , X t P = p ( ( )) + 2 ( i i ) ( ) n n n n ✓1 1 X ◆ n u t P n h X , X t P = p ( ( )) + 2 p ( ( i i ) ( )) n n n 1 1 1 hX i 2 = pn(u t (P))+ (h(X i , X i ) t (P)) D N (0, 41 ). n pn n ! 2 1 D N (0,4 ) p 1 XE h X ,X t P constant ! ! [ ( i i ) ( )]= ! | {z } | {z } | {z }

Conclusion: U- and V-statistics are asymptotically equivalent. The V-statistic is a more intu- itive estimator, the U-statistic is more convenient for proofs (and unbiased).

–44–