Theoretical . Lecture 5. Peter Bartlett

1. U-statistics.

1 Outline of today’s lecture

We’ll look at U-statistics, a family of that includes many interesting examples. We’ll study their properties: unbiased, lower , concentration (via an application of the bounded differences inequality), asymptotic variance, asymptotic distribution. (See Chapter 12 of van der Vaart.) First, we’ll consider the standard unbiased estimate of variance—a special case of a U-statistic.

2 Variance estimates

n 1 s2 = (X − X¯ )2 n n − 1 i n Xi=1 n n 1 = (X − X¯ )2 +(X − X¯ )2 2n(n − 1) i n j n Xi=1 Xj=1  n n 1 2 = (X − X¯ ) − (X − X¯ ) 2n(n − 1) i n j n Xi=1 Xj=1  n n 1 1 = (X − X )2 n(n − 1) 2 i j Xi=1 Xj=1 1 1 = (X − X )2 . n 2 i j 2 Xi

3 Variance estimates

This is unbiased for i.i.d. : 1 Es2 = E (X − X )2 n 2 1 2 1 = E ((X − EX ) − (X − EX ))2 2 1 1 2 2 1 2 2 = E (X1 − EX1) +(X2 − EX2) 2   2 = E (X1 − EX1) .

4 U-statistics

Definition: A U-statistic of order r with kernel h is 1 U = n h(Xi1 ,...,Xir ), r iX⊆[n]  where h is symmetric in its arguments.

[If h is not symmetric in its arguments, we can also average over permutations.] “U” for “unbiased.” Introduced by Wassily Hoeffding in the 1940s.

5 U-statistics

Theorem: [Halmos] θ (, i.e., function defined on a family of distributions) admits an unbiased (ie: for all sufficiently large n, some function of the i.i.d. has expectation θ) iff for some k there is an h such that

θ = Eh(X1,...,Xk).

Necessity is trivial. Sufficiency uses the estimator

θˆ(X1,...,Xn)= h(X1,...,hk).

U-statistics make better use of the sample than this, since they are a symmetric function of the data.

6 U-statistics: Examples

2 2 • sn is a U-statistic of order 2 with kernel h(x, y)=(1/2)(x − y) .

• X¯n is a U-statistic of order 1 with kernel h(x)= x. • The U-statistic with kernel h(x, y)= |x − y| estimates the pairwise deviation or Gini mean difference. [The Gini coefficient, G = E|X − Y |/(2EX), is commonly used as a measure of income inequality.]

• Third k-statistic, n n k = (X − X¯ )3 3 (n − 1)(n − 2) i n Xi=1 is a U-statistic that estimates the 3rd cumulant.

7 U-statistics: Examples

• The U-statistic with kernel h(x, y)=(x − y)(x − y)T estimates the variance- matrix.

• Kendall’s τ: For a random pair P1 =(X1,Y1),P2 =(X2,Y2) of points in the plane,

τ = Pr(P1P2 has positive slope) − Pr(P1P2 has negative slope)

= E (1[P1P2 has positive slope] − 1[P1P2 has negative slope]) ,

where P1P2 is the line from P1 to P2. It is a measure of correlation: τ ∈ [−1, 1], τ = 0 for independent X,Y , τ = ±1 for Y = f(X) for monotone f. Clearly, τ can be estimated using a U-statistic of order 2.

8 U-statistics: Examples

The Wilcoxon one-sample rank statistic:

n + T = Ri1[Xi > 0], Xi=1 where Ri is the rank (position when |X1|,..., |Xn| are arranged in ascending order). It’s used to test if the distribution is symmetric about zero.

Assuming the |Xi| are all distinct, then we can write

n Ri = 1[|Xj|≤|Xi|], Xj=1

9 U-statistics: Examples

Hence n n + T = 1[|Xj|≤ Xi] Xi=1 Xj=1

= 1[|Xj| 0] Xi

= 1[Xi + Xj > 0] + 1[Xi > 0] Xi 0] + n 1[Xi > 0] n 2 n 2 Xi

10 U-statistics: Examples where n h (Xi,Xj)= 1[Xi + Xj > 0], 2 2

h1(Xi)= n 1[Xi > 0].

So it’s a sum of U-statistics. [Why is it not a U-statistic?]

11 Properties of U-statistics

• “U” for “unbiased”: U is an unbiased estimator for Eh(X1,...,Xr):

EU = Eh(X1,...,Xr).

• U is a lower variance estimate than h(X1,...,Xr), because U is an average over permutations. Indeed, since U is an average over

permutations π of h(Xπ(1),...,Xπ(r)), we can write E U(X1,...,Xn)= h(X1,...,Xr)|X(1),...,X(n) ,   where (X(1),...,X(n)) is the data in some sorted order. Thus, for EU = θ, we can write the variance as:

12 Properties of U-statistics

E 2 E E 2 (U − θ) = [h(X1,...,Xr) − θ|X(1),...,X(n)] EE 2  ≤ (h(X1,...,Xr) − θ) |X(1),...,X(n)  2  = E(h(X1,...,Xr) − θ) , by Jensen’s inequality (for a convex function φ, we have φ(EX) ≤ Eφ(X)). This is the Rao-Blackwell theorem: the mean squared error of the estimator h(X1,...,Xr) is reduced by replacing it by its conditional expectation, given the sufficient statistic (X(1),...,X(n)).

13 Recall: Bounded Differences Inequality

Theorem: Suppose f : X n → R satisfies the following bounded differ- ences inequality: ′ for all x1,...,xn, xi ∈X ,

′ |f(x1,...,xn) − f(x1,...,xi−1, xi, xi+1,...,xn)|≤ Bi.

Then 2 E 2t P (|f(X) − f(X)|≥ t) ≤ 2exp − 2 .  i Bi  P

14 Bounded Differences Inequality

Consider a U-statistic of order 2. 1 U = n h(Xi,Xj). 2 Xi

Theorem: If |h(X1,X2)|≤ B a.s., then

P (|U − EU|≥ t) ≤ 2exp(−nt2/(8B2)).

15 Bounded Differences Inequality

Proof: For X,X′ differing in a single coordinate, we have

′ 1 ′ ′ |U − U |≤ n |h(Xi,Xj) − h(Xi,Xj)| 2 Xi

16 Variance of U-statistics

Now we’ll compute the asymptotic variance of a U-statistic. Recall the definition: 1 U = n h(Xi1 ,...,Xir ), r iX⊆[n]  So [letting S,S′ over subsets of {1,...,n} of size r]: 1 Var(U)= Cov(h(XS), h(XS′ )) n 2 r XS XS′ r 1 n r n − r = ζc, n 2 rcr − c r Xc=0 n r n−r  ′ where r c r−c is the number of ways of choosing S and S with an intersection  of sizec (first choose S, then choose the intersection from S, then choose the non-intersection for the rest of S′).

17 Variance of U-statistics

′ Also, ζc = Cov(h(XS), h(XS′ )) depends only on c = |S ∩ S |. To see this, suppose that S ∩ S′ = I with |I| = c,

ζc = Cov(h(XS), h(XS′ ))

= Cov(h(XI ,XS−I ), h(XI XS′−I )) c r c 2r−c = Cov(h(X1,Xc+1), h(X1Xr+1 )) E c r c E c 2r−c c = Cov h(X1,Xc+1) X1 , h(X1Xr+1 ) X1  E c r  c 2r−c c  + Cov h(X1 ,Xc+1), h(X1Xr+1 ) X 1 E c  r c  = Var h(X1,Xc+1) X1 .  

Clearly, ζ0 = 0.

18 Variance of U-statistics

Now,

r 1 n r n − r Var(U)= ζc n 2 rcr − c r Xc=1 r 1 r n − r = ζc n cr − c r Xc=1  r −r r−c = θ(n ) θ(n )ζc Xc=1 r −c = θ(n )ζc. Xc=1

19 Variance of U-statistics

So if ζ1 6= 0, the first term dominates: nr!(n − r)!r(n − r)! nVar(U) → ζ → r2ζ . n!(r − 1)!(n − 2r + 1)! 1 1

2 If r ζ1 = 0, we say that U is degenerate.

20