Theoretical Statistics. Lecture 5. Peter Bartlett

Theoretical Statistics. Lecture 5. Peter Bartlett 1. U-statistics. 1 Outline of today’s lecture We’ll look at U-statistics, a family of estimators that includes many interesting examples. We’ll study their properties: unbiased, lower variance, concentration (via an application of the bounded differences inequality), asymptotic variance, asymptotic distribution. (See Chapter 12 of van der Vaart.) First, we’ll consider the standard unbiased estimate of variance—a special case of a U-statistic. 2 Variance estimates n 1 s2 = (X − X¯ )2 n n − 1 i n Xi=1 n n 1 = (X − X¯ )2 +(X − X¯ )2 2n(n − 1) i n j n Xi=1 Xj=1 n n 1 2 = (X − X¯ ) − (X − X¯ ) 2n(n − 1) i n j n Xi=1 Xj=1 n n 1 1 = (X − X )2 n(n − 1) 2 i j Xi=1 Xj=1 1 1 = (X − X )2 . n 2 i j 2 Xi<j 3 Variance estimates This is unbiased for i.i.d. data: 1 Es2 = E (X − X )2 n 2 1 2 1 = E ((X − EX ) − (X − EX ))2 2 1 1 2 2 1 2 2 = E (X1 − EX1) +(X2 − EX2) 2 2 = E (X1 − EX1) . 4 U-statistics Definition: A U-statistic of order r with kernel h is 1 U = n h(Xi1 ,...,Xir ), r iX⊆[n] where h is symmetric in its arguments. [If h is not symmetric in its arguments, we can also average over permutations.] “U” for “unbiased.” Introduced by Wassily Hoeffding in the 1940s. 5 U-statistics Theorem: [Halmos] θ (parameter, i.e., function defined on a family of distributions) admits an unbiased estimator (ie: for all sufficiently large n, some function of the i.i.d. sample has expectation θ) iff for some k there is an h such that θ = Eh(X1,...,Xk). Necessity is trivial. Sufficiency uses the estimator θˆ(X1,...,Xn)= h(X1,...,hk). U-statistics make better use of the sample than this, since they are a symmetric function of the data. 6 U-statistics: Examples 2 2 • sn is a U-statistic of order 2 with kernel h(x, y)=(1/2)(x − y) . • X¯n is a U-statistic of order 1 with kernel h(x)= x. • The U-statistic with kernel h(x, y)= |x − y| estimates the mean pairwise deviation or Gini mean difference. [The Gini coefficient, G = E|X − Y |/(2EX), is commonly used as a measure of income inequality.] • Third k-statistic, n n k = (X − X¯ )3 3 (n − 1)(n − 2) i n Xi=1 is a U-statistic that estimates the 3rd cumulant. 7 U-statistics: Examples • The U-statistic with kernel h(x, y)=(x − y)(x − y)T estimates the variance-covariance matrix. • Kendall’s τ: For a random pair P1 =(X1,Y1),P2 =(X2,Y2) of points in the plane, τ = Pr(P1P2 has positive slope) − Pr(P1P2 has negative slope) = E (1[P1P2 has positive slope] − 1[P1P2 has negative slope]) , where P1P2 is the line from P1 to P2. It is a measure of correlation: τ ∈ [−1, 1], τ = 0 for independent X,Y , τ = ±1 for Y = f(X) for monotone f. Clearly, τ can be estimated using a U-statistic of order 2. 8 U-statistics: Examples The Wilcoxon one-sample rank statistic: n + T = Ri1[Xi > 0], Xi=1 where Ri is the rank (position when |X1|,..., |Xn| are arranged in ascending order). It’s used to test if the distribution is symmetric about zero. Assuming the |Xi| are all distinct, then we can write n Ri = 1[|Xj|≤|Xi|], Xj=1 9 U-statistics: Examples Hence n n + T = 1[|Xj|≤ Xi] Xi=1 Xj=1 = 1[|Xj| <Xi]+ 1[|Xi| <Xj]+ 1[Xi > 0] Xi<j Xi<j Xi = 1[Xi + Xj > 0] + 1[Xi > 0] Xi<j Xi 1 n 1 = 1[Xi + Xj > 0] + n 1[Xi > 0] n 2 n 2 Xi<j Xi 1 1 = h (X ,X )+ h (X ), n 2 i j n 1 i 2 Xi<j Xi 10 U-statistics: Examples where n h (Xi,Xj)= 1[Xi + Xj > 0], 2 2 h1(Xi)= n 1[Xi > 0]. So it’s a sum of U-statistics. [Why is it not a U-statistic?] 11 Properties of U-statistics • “U” for “unbiased”: U is an unbiased estimator for Eh(X1,...,Xr): EU = Eh(X1,...,Xr). • U is a lower variance estimate than h(X1,...,Xr), because U is an average over permutations. Indeed, since U is an average over permutations π of h(Xπ(1),...,Xπ(r)), we can write E U(X1,...,Xn)= h(X1,...,Xr)|X(1),...,X(n) , where (X(1),...,X(n)) is the data in some sorted order. Thus, for EU = θ, we can write the variance as: 12 Properties of U-statistics E 2 E E 2 (U − θ) = [h(X1,...,Xr) − θ|X(1),...,X(n)] EE 2 ≤ (h(X1,...,Xr) − θ) |X(1),...,X(n) 2 = E(h(X1,...,Xr) − θ) , by Jensen’s inequality (for a convex function φ, we have φ(EX) ≤ Eφ(X)). This is the Rao-Blackwell theorem: the mean squared error of the estimator h(X1,...,Xr) is reduced by replacing it by its conditional expectation, given the sufficient statistic (X(1),...,X(n)). 13 Recall: Bounded Differences Inequality Theorem: Suppose f : X n → R satisfies the following bounded differences inequality: ′ for all x1,...,xn, xi ∈X , ′ |f(x1,...,xn) − f(x1,...,xi−1, xi, xi+1,...,xn)|≤ Bi. Then 2 E 2t P (|f(X) − f(X)|≥ t) ≤ 2exp − 2 . i Bi P 14 Bounded Differences Inequality Consider a U-statistic of order 2. 1 U = n h(Xi,Xj). 2 Xi<j Theorem: If |h(X1,X2)|≤ B a.s., then P (|U − EU|≥ t) ≤ 2exp(−nt2/(8B2)). 15 Bounded Differences Inequality Proof: For X,X′ differing in a single coordinate, we have ′ 1 ′ ′ |U − U |≤ n |h(Xi,Xj) − h(Xi,Xj)| 2 Xi<j 2B(n − 1) ≤ n 2 4B = . n The bounded differences inequality implies the result. 16 Variance of U-statistics Now we’ll compute the asymptotic variance of a U-statistic. Recall the definition: 1 U = n h(Xi1 ,...,Xir ), r iX⊆[n] So [letting S,S′ range over subsets of {1,...,n} of size r]: 1 Var(U)= Cov(h(XS), h(XS′ )) n 2 r XS XS′ r 1 n r n − r = ζc, n 2 rcr − c r Xc=0 n r n−r ′ where r c r−c is the number of ways of choosing S and S with an intersection of sizec (first choose S, then choose the intersection from S, then choose the non-intersection for the rest of S′). 17 Variance of U-statistics ′ Also, ζc = Cov(h(XS), h(XS′ )) depends only on c = |S ∩ S |. To see this, suppose that S ∩ S′ = I with |I| = c, ζc = Cov(h(XS), h(XS′ )) = Cov(h(XI ,XS−I ), h(XI XS′−I )) c r c 2r−c = Cov(h(X1,Xc+1), h(X1Xr+1 )) E c r c E c 2r−c c = Cov h(X1,Xc+1) X1 , h(X1Xr+1 ) X1 E c r c 2r−c c + Cov h(X1 ,Xc+1), h(X1Xr+1 ) X 1 E c r c = Var h(X1,Xc+1) X1 . Clearly, ζ0 = 0. 18 Variance of U-statistics Now, r 1 n r n − r Var(U)= ζc n 2 rcr − c r Xc=1 r 1 r n − r = ζc n cr − c r Xc=1 r −r r−c = θ(n ) θ(n )ζc Xc=1 r −c = θ(n )ζc. Xc=1 19 Variance of U-statistics So if ζ1 6= 0, the first term dominates: nr!(n − r)!r(n − r)! nVar(U) → ζ → r2ζ . n!(r − 1)!(n − 2r + 1)! 1 1 2 If r ζ1 = 0, we say that U is degenerate. 20.

Theoretical Statistics. Lecture 5. Peter Bartlett

On the Meaning and Use of Kurtosis

The Probability Lifesaver: Order Statistics and the Median Theorem

Lecture 14 Testing for Kurtosis

Covariances of Two Sample Rank Sum Statistics

Analysis of Covariance (ANCOVA) with Two Groups

What Is Statistic?

Lecture 8 — October 15 8.1 Bayes Estimators and Average Risk

AP Statistics Chapter 7 Linear Regression Objectives

Unit 4: Measures of Center

Chapter 9: Serial Correlation

The Multivariate Normal Distribution

8 the Likelihood Ratio Test