Chapter 10
When one is willing to assume the existence of a simple random sample X1,...,Xn, U- statistics generalize common notions of unbiased estimation such as the sample mean and the unbiased sample variance (in fact, the “U” in “U-statistics” stands for “unbiased”). Even though U-statistics may be considered a bit of a special topic, their study in a large-sample theory course has side benefits that make them valuable pedagogically. The theory of U- statistics nicely demonstrates the application of some of the large-sample topics presented thus far. Furtheromre, the study of U-statistics even enables a theoretical discussion of sta- tistical functionals, which gives insight into the common modern practice of bootstrapping.
10.1 Statistical Functionals and V-Statistics
Let S be a set of cumulative distribution functions and let T denote a mapping from S into the real numbers R. Then T is called a statistical functional. We may think of statistical functionals as parameters of interest. If, say, we are given a simple random sample from an distribution with unknown distribution function F , we may want to learn the value of θ = T (F ) for a (known) functional T . Some particular instances of statistical functionals are as follows:
• If T (F ) = F (c) for some constant c, then T is a statistical functional mapping each F to PF (X ≤ c). • If T (F ) = F −1(p) for some constant p, where F −1(p) is defined in Equation (3.13), then T maps F to its pth quantile.
• If T (F ) = E F (X), then T maps F to its mean.
157 Suppose X1,...,Xn is an independent and identically distributed sequence with distribution ˆ function F (x). We define the empirical distribution function Fn to be the distribution function for a discrete uniform distribution on {X1,...,Xn}. In other words,
n 1 1 X Fˆ (x) = #{i : X ≤ x} = I{X ≤ x}. n n i n i i=1 ˆ Since Fn(x) is a distribution function, a reasonable estimator of T (F ) is the so-called plug-in ˆ estimator T (Fn). For example, if T (F ) = E F (X), then the plug-in estimator given a simple random sample X1,X2,... from F is
n ˆ 1 X T (Fn) = E ˆ (X) = Xi = Xn. Fn n i=1 As we will see later, a plug-in estimator is also known as a V-statistic or a V-estimator.
Suppose that for some real-valued function φ(x), we define T (F ) = E F φ(X). Note in this case that
T {αF1 + (1 − α)F2} = α E F1 φ(X) + (1 − α)E F2 φ(X) = αT (F1) + (1 − α)T (F2).
For this reason, such a functional is sometimes called a linear functional (see Definition 10.1). To generalize this idea, we consider a real-valued function taking more than one real argu- ment, say φ(x1, . . . , xa) for some a > 1, and define
T (F ) = E F φ(X1,...,Xa), (10.1) which we take to mean the expectation of φ(X1,...,Xa) where X1,...,Xa is a simple random sample from the distribution function F . We see that
E F φ(X1,...,Xa) = E F φ(Xπ(1),...,Xπ(a)) for any permutation π mapping {1, . . . , a} onto itself. Since there are a! such permutations, consider the function
def 1 X φ∗(x , . . . , x ) = φ(x , . . . , x ). 1 a a! π(1) π(a) all π
∗ ∗ Since E F φ(X1,...,Xa) = E F φ (X1,...,Xa) and φ is symmetric in its arguments (i.e., permuting its a arguments does not change its value), we see that in Equation (10.1) we may assume without loss of generality that φ is symmetric in its arguments. A function defined as in Equation (10.1) is called an expectation functional, as summarized in the following definition:
158 Definition 10.1 For some integer a ≥ 1, let φ: Ra → R be a function symmetric in its a arguments. The expectation of φ(X1,...,Xa) under the assumption that X1,...,Xa are independent and identically distributed from some distri- bution F will be denoted by E F φ(X1,...,Xa). Then the functional T (F ) = E F φ(X1,...,Xa) is called an expectation functional. If a = 1, then T is also called a linear functional.
Expectation functionals are important in this chapter because they are precisely the func- tionals that give rise to V-statistics and U-statistics. The function φ(x1, . . . , xa) in Definition 10.1 is used so frequently that we give it a special name:
Definition 10.2 Let T (F ) = E F φ(X1,...,Xa) be an expectation functional, where φ: Ra → R is a function that is symmetric in its arguments. In other words, φ(x1, . . . , xa) = φ(xpi(1), . . . , xπ(a)) for any permutation π of the integers 1 through a. Then φ is called the kernel function associated with T (F ). Suppose T (F ) is an expectation functional defined according to Equation (10.1). If we have a simple random sample of size n from F , then as noted earlier, a natural way to estimate ˆ T (F ) is by the use of the plug-in estimator T (Fn). This estimator is called a V-estimator or a ˆ V-statistic. It is possible to write down a V-statistic explicitly: Since Fn assigns probability 1 n to each Xi, we have n n ˆ 1 X X Vn = T (Fn) = E ˆ φ(X1,...,Xa) = ··· φ(Xi ,...,Xi ). (10.2) Fn na 1 a i1=1 ia=1
In the case a = 1, then Equation (10.2) becomes
n 1 X V = φ(X ). n n i i=1
2 It is clear in this case that E Vn = T (F ), which we denote by θ. Furthermore, if σ = Var F φ(X) < ∞, then the central limit theorem implies that
√ d 2 n(Vn − θ) → N(0, σ ).
For a > 1, however, the sum in Equation (10.2) contains some terms in which i1, . . . , ia are not all distinct. The expectation of such terms is not necessarily equal to θ = T (F ) because in Definition 10.1, θ requires a independent random variables from F . Thus, Vn is not necessarily unbiased for a > 1.
Example 10.3 Let a = 2 and φ(x1, x2) = |x1 − x2|. It may be shown (Problem 10.2) that the functional T (F ) = E F |X1 − X2| is not linear in F . Furthermore, since
159 |Xi1 − Xi2 | is identically zero whenever i1 = i2, it may also be shown that the V-estimator of T (F ) is biased.
Since the bias in Vn is due to the duplication among the subscripts i1, . . . , ia, one way to correct this bias is to restrict the summation in Equation (10.2) to sets of subscripts i1, . . . , ia that contain no duplication. For example, we might sum instead over all possible subscripts satisfying i1 < ··· < ia. The result is the U-statistic, which is the topic of Section 10.2.
Exercises for Section 10.1
Exercise 10.1 Let X1,...,Xn be a simple random sample from F . For a fixed x for ˆ which 0 < F (x) < 1, find the asymptotic distribution of Fn(x).
Exercise 10.2 Let T (F ) = E F |X1 − X2|.
(a) Show that T (F ) is not a linear functional by exhibiting distributions F1 and F2 and a constant α ∈ (0, 1) such that
T {αF1 + (1 − α)F2}= 6 αT (F1) + (1 − α)T (F2).
(b) For n > 1, demonstrate that the V-statistic Vn is biased in this case by finding cn 6= 1 such that E F Vn = cnT (F ).
Exercise 10.3 Let X1,...,Xn be a random sample from a distribution F with finite third absolute moment.
(a) For a = 2, find φ(x1, x2) such that E F φ(X1,X2) = Var F X. Your φ function should be symmetric in its arguments.
2 Hint: The fact that θ = E X1 − E X1X2 leads immediately to a non-symmetric φ function. Symmetrize it.
3 (b) For a = 3, find φ(x1, x2, x3) such that E F φ(X1,X2,X3) = E F (X − E F X) . As in part (a), φ should be symmetric in its arguments.
10.2 Hoeffding’s Decomposition and Asymptotic Nor- mality
Because the V-statistic n n 1 X X V = ··· φ(X ,...,X ) n na i1 ia i1=1 ia=1
160 is in general a biased estimator of the expectation functional T (F ) = E F φ(X1,...,Xa)
due to presence of summands in which there are duplicated indices on the Xik , one way to produce an unbiased estimator is to sum only over those (i1, . . . , ia) in which no duplicates occur. Because φ is assumed to be symmetric in its arguments, we may without loss of generality restrict attention to the cases in which 1 ≤ i1 < ··· < ia ≤ n. Doing this, we obtain the U-statistic Un:
Definition 10.4 Let a be a positive integer and let φ(x1, . . . , xa) be the kernel func- tion associated with an expectation functional T (F ) (see Definitions 10.1 and 10.2). Then the U-statistic corresponding to this functional equals
1 X X Un = n ··· φ(Xi1 ,...,Xia ), (10.3) a 1≤i1<··· where X1,...,Xn is a simple random sample of size n ≥ a. The “U” in “U-statistic” stands for unbiased (the “V” in “V-statistic” stands for von Mises, who was one of the originators of this theory in the late 1940’s). The unbiasedness of Un n follows since it is the average of a terms, each with expectation T (F ) = E F φ(X1,...,Xa). Example 10.5 Consider a random sample X1,...,Xn from F , and let n X Rn = jI{Wj > 0} j=1 be the Wilcoxon signed rank statistic, where W1,...,Wn are simply X1,...,Xn reordered in increasing absolute value. We showed in Example 9.12 that n i X X Rn = I{Xi + Xj > 0}. i=1 j=1 Letting φ(a, b) = I{a + b > 0}, we see that φ is symmetric in its arguments and thus n 1 1 X R = U + I{X > 0} = U + O ( 1 ) , n n n n i n P n 2 2 i=1 where Un is the U-statistic corresponding to the expectation functional T (F ) = PF (X1 + X2 > 0). Therefore, many of the properties of the signed rank test that we have already derived elsewhere can also be obtained using the theory of U-statistics developed later in this section. 161 In the special case a = 1, the V-statistic and the U-statistic coincide. In this case, we have already seen that both Un and Vn are asymptotically normal by the central limit theorem. However, for a > 1, the two statistics do not coincide in general. Furthermore, we may no longer use the central limit theorem to obtain asymptotic normality because the summands are not independent (each Xi appears in more than one summand). To prove the asymptotic normality of U-statistics, we shall use a method sometimes known as the H-projection method after its inventor, Wassily Hoeffding. If φ(x1, . . . , xa) is the kernel function of an expectation functional T (F ) = E F φ(X1,...,Xa), suppose X1,...,Xn is a simple random sample from the distribution F . Let θ = T (F ) and let Un be the U-statistic defined in Equation (10.3). For 1 ≤ i ≤ a, suppose that the values of X1,...,Xi are held constant, say X1 = x1,...,Xi = xi. This may be viewed as projecting the random vector a (X1,...,Xa) onto the (a − i)-dimensional subspace in R given by {(x1, . . . , xi, ci+1, . . . , ca): a−i (ci+1, . . . , ca) ∈ R }. If we take the conditional expectation, the result will be a function of x1, . . . , xi, which we will denote by φi. That is, we define for i = 1, . . . , a φi(x1, . . . , xi) = E F φ(x1, . . . , xi,Xi+1,...,Xa). (10.4) Equivalently, we may use conditional expectation notation to define φi(X1,...,Xi) = E F {φ(X1,...,Xa) | X1,...,Xi} . (10.5) From Equation (10.5), we obtain E F φi(X1,...,Xi) = θ for all i. Furthermore, we define 2 σi = Var F φi(X1,...,Xi). (10.6) 2 The variance of Un can be expressed in terms of the σi as follows: Theorem 10.6 The variance of a U-statistic is a 1 X an − a Var U = σ2. F n n k a − k k a k=1 2 2 If σ1, . . . , σa are all finite, then a2σ2 1 Var U = 1 + o . F n n n √ Theorem 10.6 is proved in Exercise 10.4. This theorem shows that the variance of nUn tends 2 2 √ to a σ1, and indeed we may well wonder whether it is true that n(Un −θ) is asymptotically normal with this limiting variance. It is the idea of the H-projection method of Hoeffding to prove exactly that fact. 162 We shall derive the asymptotic normality of Un in a sequence of steps. The basic idea will be to show that Un − θ has the same limiting distribution as the sum n ˜ X Un = E F (Un − θ | Xj) (10.7) j=1 ˜ of projections. The asymptotic distribution of Un follows from the central limit theorem ˜ because Un is the sum of independent and identically distributed random variables. Lemma 10.7 For all 1 ≤ j ≤ n, a E (U − θ | X ) = {φ (X ) − θ} . F n j n 1 j Proof: Note that 1 X X E F (Un − θ | Xj) = n ··· E F {φ(Xi1 ,...,Xia ) − θ | Xj} , a 1≤i1<··· where E F {φ(Xi1 ,...,Xia ) − θ | Xj} equals φ1(Xj)−θ whenever j is among {i1, . . . , ia} and n−1 0 otherwise. The number of ways to choose {i1, . . . , ia} so that j is among them is a−1 , so we obtain