<<

1 The Glivenko-Cantelli Theorem

Let Xi, i = 1, . . . , n be an i.i.d. sequence of random variables with distribu- tion function F on R. The empirical distribution function is the function of x defined by 1 X Fˆ (x) = I{X ≤ x} . n n i 1≤i≤n For a given x ∈ R, we can apply the strong to the sequence I{Xi ≤ x}, i = 1, . . . n to assert that

Fˆn(x) → F (x) a.s (in order to apply the strong law of large numbers we only need to show that E[|I{Xi ≤ x}|] < ∞, which in this case is trivial because |I{Xi ≤ x}| ≤

1). In this sense, Fˆn(x) is a reasonable estimate of F (x) for a given x ∈ R.

But is Fˆn(x) a reasonable estimate of the F (x) when both are viewed as functions of x? The Glivenko-Cantelli Thoerem provides an answer to this question. It asserts the following:

Theorem 1.1 Let Xi, i = 1, . . . , n be an i.i.d. sequence of random variables with distribution function F on R. Then,

sup |Fˆn(x) − F (x)| → 0 a.s. (1) x∈R This result is perhaps the oldest and most well known result in the very large field of theory, which is at the center of much of modern econometrics. The statistic (1) is an example of a Kolmogorov-Smirnov statistic. The proof of the result will require the following lemma.

Lemma 1.1 Let F be a (nonrandom) distribution function on R. For each

 > 0 there exists a finite partition of the real line of the form −∞ = t0 < t1 < ··· < tk = ∞ such that for 0 ≤ j ≤ k − 1

− F (tj+1) − F (tj) ≤  .

1 Proof: Let  > 0 be given. Let t0 = −∞ and for j ≥ 0 define

tj+1 = sup{z : F (z) ≤ F (tj) + } .

Note that F (tj+1) ≥ F (tj)+. To see this, suppose that F (tj+1) < F (tj)+.

Then, by right continuity of F there would exist δ > 0 so that F (tj+1 +δ) <

F (tj) + , which would contradict the definition of tj+1. Thus, between tj and tj+1, F jumps by at least . Since this can happen at most a finite number of times, the partition is of the desired form, that is −∞ = t0 < − t1 < ··· < tk = ∞ with k < ∞. Moreover, F (tj+1) ≤ F (tj) + . To see this, note that by definition of tj+1 we have F (tj+1 − δ) ≤ F (tj) +  for all δ > 0. − The desired result thus follows from the definition of F (tj+1).

Proof of 1.1: It suffices to show that for any  > 0

lim sup sup |Fˆn(x) − F (x)| ≤  a.s. n→∞ x∈R To this send, let  > 0 be given and consider a partition of the real line into finitely many pieces of the form −∞ = t0 < t1 ··· < tk = ∞ such that for 0 ≤ j ≤ k − 1  F (t− ) − F (t ) ≤ . j+1 j 2 The existence of such a partition is ensured by the previous lemma. For any x ∈ R, there exists j such that tj ≤ x < tj+1. For such j,

ˆ ˆ ˆ − Fn(tj) ≤ Fn(x) ≤ Fn(tj+1) − F (tj) ≤ F (x) ≤ F (tj+1) , which implies that

ˆ − ˆ ˆ − Fn(tj) − F (tj+1) ≤ Fn(x) − F (x) ≤ Fn(tj+1) − F (tj) .

Furthermore,

ˆ − ˆ Fn(tj) − F (tj) + F (tj) − F (tj+1) ≤ Fn(x) − F (x) ˆ − − − ˆ Fn(tj+1) − F (tj+1) + F (tj+1) − F (tj) ≥ Fn(x) − F (x) .

2 By construction of the partition, we have that  Fˆ (t ) − F (t ) − ≤ Fˆ (x) − F (x) n j j 2 n  Fˆ (t− ) − F (t− ) + ≥ Fˆ (x) − F (x) . n j+1 j+1 2 n The desired result now follows from the Strong Law of Large Numbers.

2 The Sample Median

We now give a brief application of the Glivenko-Cantelli Theorem. Let

Xi, i = 1, . . . , n be an i.i.d. sequence of random variables with distribution F . Suppose one is interested in the median of F . Concretely, we will define 1 Med(F ) = inf{x : F (x) ≥ } . 2

A natural estimator of Med(F ) is the sample analog, Med(Fˆn). Under what conditions is Med(Fˆn) a reasonable estimate of Med(F )? Let m = Med(F ) and suppose that F is well behaved at m in the sense 1 that F (t) > 2 whenever t > m. Under this condition, we can show using the Glivenko-Cantelli Theorem that Med(Fˆn) → Med(F ) a.s. We will now prove this result.

Suppose Fn is a (nonrandom) sequence of distribution functions such that

sup |Fn(x) − F (x)| → 0 . x∈R Let  > 0 be given. We wish to show that there exists N = N() such that for all n > N

|Med(Fn) − Med(F )| <  .

Choose δ > 0 so that 1 δ < − F (m − ) 2 1 δ < F (m + ) − , 2

3 which in turn implies that 1 F (m − ) < − δ 2 1 F (m + ) > + δ . 2 (It might help to draw a picture to see why we should pick δ in this way.) Next choose N so that for all n > N,

sup |Fn(x) − F (x)| < δ . x∈R

Let mn = Med(Fn). For such n, mn > m − , for if mn ≤ m − , then 1 F (m − ) > F (m − ) − δ ≥ − δ , n 2 which contradicts the choice of δ. We also have that mn < m + , for if mn ≥ m + , then 1 F (m + ) < F (m + ) + δ ≤ + δ , n 2 which again contradicts the choice of δ. Thus, for n > N, |mn − m| < , as desired.

By the Glivenko-Cantelli Theorem, it follows immediately that Med(Fˆn) → Med(F ) a.s.

3 A Generalized Glivenko-Cantelli Theorem

Let Xi, i = 1, . . . , n be an i.i.d. sequence of random variables with distri- bution P . We no longer require that the random variables are real-valued. The empirical measure is the measure defined by 1 X Pˆ = δ , n n Xi 1≤i≤n where δx is the measure that puts mass one at x. In this language, the empirical distribution function is simply Pˆn(−∞, x] and the uniform con- vergence over x ∈ R asserted in (1) can instead be understood as uniform

4 convergence of the empirical measure to the true measure over sets of the form (−∞, x] for x ∈ R. Can this result be extended to other classes of sets? It turns out that the convergence of the empirical measure to the true measure is uniform for large classes of sets provided that the sets do not “pick out” too many subsets of any set of n points in the space.

Definition 3.1 We say that a class V of subsets of a space S has polynomial discrimination (of degree v) if there exists a polynomial ρ(·) (of degree v) so that from every set of n points in S the class picks out at most ρ(n) distinct subsets. Formally, if S0 ⊆ S consists of n points, then there are at most

ρ(n) distinct sets of the form V ∩ S0 with V ∈ V.

It is easy to see that half-intervals of the form (−∞, x] with x ∈ R have polynomial discrimination of degree 1. Unfortunately, it is difficult to come up with more classes of sets have polynomial discrimination. For that purpose, the following observation is useful. A class of sets V is said to “shatter” a set of points S0 if it can “pick out” every possible subset of

S0, i.e., if each subset of S0 can be written as V ∩ S0 for some V ∈ V. A lemma due to Sauer asserts that if a class of sets shatters no set of v points, then it has polynomial discrimination of degree (no greater than) v − 1. For example, on the real line, the half-intervals shatter no set of two points, which is consistent with our earlier observation. More interestingly, the class of all closed discs in R2 shatters no set of four points. This follows from the following result, which allows us to generate many classes of sets that have polynomial discrimination.

Lemma 3.1 Let G be a vector space of real-valued functions on S with dimension v − 1. The class of sets of the form {g ≥ 0} as g varies over G shatters no set of v points in S.

Proof: Choose s1, . . . , sv distinct points from S. We must show that there is some subset of these points that is not “picked out” by the sets of the

5 form {g ≥ 0} as g varies over G. Define the map L : G → Rv by the rule

L(g) = (g(s1), . . . , g(sv) .

L is a linear map. Hence, L(G) is a linear subspace of Rv of dimension no greater than v − 1. As a result, there exists a nonzero γ ∈ Rv that is orthogonal to L(G), i.e., X γig(si) = 0 for all g ∈ G . 1≤i≤v Equivalently, X X γig(si) = −γig(si) . (2)

i:γi≥0 i:γi<0

Assume without loss of generality that {i : γi < 0}= 6 0 (otherwise replace

γ by −γ). Consider the set {si : γi ≥ 0}. Suppose by way of contradiction that there exists a g ∈ G such that {g ≥ 0} ∩ {s1, . . . , sv} = {si : γi ≥ 0}. For such a g,(??) cannot hold.

The following theorem generalizes the classical Glivenko-Cantelli The- orem to all classes of sets with polynomial discrimination, including those that can be generated using the previous lemma.

Theorem 3.1 Let Xi, i = 1, . . . , n be an i.i.d. sequence of random variables with distribution P on S. For every (suitably measurable) class V of subsets of S with polynomial discrimination,

sup |PˆnV − PV | → 0 a.s. . V ∈V We will break the proof of this result up into several steps.

Lemma 3.2 Let {Z(t): t ∈ T } and {Z0(t): t ∈ T } be independent stochastic processes. Suppose there exists α > 0 and β > 0 such that P {|Z0(t)| ≤ α} ≥ β for every t ∈ T . Then, 1 P {sup |Z(t)| > } ≤ P {sup |Z(t) − Z0(t)| >  − α} . t∈T β t∈T

6 Proof: Let τ be a such that

sup |Z(t)| >  ⇐⇒ |Z(τ)| >  . t∈T Note that

P {|Z(τ)| > , |Z0(τ)| ≤ α} = P {|Z0(τ)| ≤ α||Z(τ)| > }P {|Z(τ)| > } ≥ βP {|Z(τ)| > } , where the inequality follows because

P {|Z0(τ)| ≤ α||Z(τ)| > } = E[P {|Z0(τ)| ≤ α|Z}||Z(τ)| > ] P {|Z0(τ)| ≤ α|Z} ≥ β .

Furthermore,

P {|Z(τ)| > , |Z0(τ)| ≤ α} ≤ P {|Z(τ) − Z0(τ)| >  − α} ≤ P {sup |Z(t) − Z0(t)| >  − α} . t∈T The desired result thus follows.

0 Lemma 3.3 Let Xi, i = 1, . . . , n and Xi, i = 1, . . . , n be independent i.i.d. sequences of random variables with distribution P on S. For every (suitably measurable) class V of subsets of S,

ˆ ˆ ˆ0  P { sup |PnV − PV | > } ≤ 2P { sup |PnV − PnV | > } V ∈V V ∈V 2 whenever n ≥ 8/2.

Proof: By Chebychev’s Inequality,  4 P {|Pˆ0 V − PV | > } ≤ . n 2 n2 Hence,  1 P {|Pˆ0 V − PV | ≤ } ≥ n 2 2 whenever n ≥ 8/2. We may therefore apply the preceding lemma to Z(V ) = ˆ 0 ˆ0 PnV − PV and Z (V ) = PnV − PV with α = /2 and β = 1/2 to reach the desired conclusion.

7 0 Lemma 3.4 Let Xi, i = 1, . . . , n and Xi, i = 1, . . . , n be independent i.i.d. sequences of random variables with distribution P on S. Independently, let

σi, i = 1, . . . , n be an i.i.d. sequence of random variables with distribution putting equal mass on −1 and 1. Let 1 X Pˆo = σ δ . n n i Xi 1≤i≤n

For every (suitably measurable) class V of subsets of S,

ˆ ˆ0  ˆo  P { sup |PnV − PnV | > } ≤ 2P { sup |PnV | > } . V ∈V 2 V ∈V 4

Proof: Note that the distribution of the random variables I{Xi ∈ V } − 0 0 I{Xi ∈ V } for i = 1, . . . , n and V ∈ V and σi[I{Xi ∈ V } − I{Xi ∈ V }] for i = 1, . . . , n and V ∈ V are the same. To see this, simply consider the distribution conditional on σi, i = 1, . . . , n. Then,

ˆ ˆ0  P { sup |PnV − PnV | > } V ∈V 2 1 X  = P { sup | σ [I{X ∈ V } − I{X0 ∈ V }]| > } n i i i 2 V ∈V 1≤i≤n 1 X  ≤ P { sup | σ I{X ∈ V }| > } n i i 4 V ∈V 1≤i≤n 1 X  +P { sup | σ I{X0 ∈ V }| > } , n i i 4 V ∈V 1≤i≤n from which the desired result follows.

We are now almost prepared to prove the desired result, but to do so we will require the following inequality due to Hoeffding for tail probabilities of sums of independent, bounded random variables.

Lemma 3.5 Let Yi, i = 1, . . . , n be independent random variables with zero mean and satisfying ai ≤ Yi ≤ bi. For each η > 0,

X 2 X 2 P {| Yi| ≥ η} ≤ 2 exp(−2η /( (bi − ai) ) . 1≤i≤n 1≤i≤n

8 Proof of Theorem ??: Let Xn = {X1,...,Xn} and consider

ˆo  P { sup |PnV | > |Xn} . V ∈V 4 Since V has polynomial discrimination, we may replace V with V∗, which is a collection of at most ρ(n) sets from V. Note that V∗ may depend on

Xn. Thus,

ˆo  ˆo  P { sup |PnV | > |Xn} ≤ ρ(n) max P {|PnV | > |Xn} . V ∈V 4 V ∈V∗ 4

Apply Hoeffding’s inequality to Yi = σiI{Xi ∈ V } to conclude that  P {|PˆoV | > |X } ≤ 2 exp(−n2/32) . n 4 n

Integrating out with respect to Xn and using the previous lemmas, we have that 2 P { sup |PˆnV − PV | > } ≤ 8ρ(n) exp(−n /32) . V ∈V It follows that

X P { sup |PˆnV − PV | > } < ∞ , 1≤n<∞ V ∈V so the desired result follows from the Borel-Cantelli lemma.

9