Class Notes of Stat 8112 1 Bayes Estimators

Class Notes of Stat 8112

Here are three methods of estimating parameters: (1) MLE; (2) Moment Method; (3) Bayes Method. An example of Bayes argument: Let X F (x θ), θ H . We want to estimate g(θ) R1. ∼ | ∈ ∈ Suppose t(X) is an estimator and look at

MSE (t)= E (t(X) g(θ))2. θ θ −

The problem is MSEθ(t) depends on θ. So minimizing one point may costs at other points.

Bayes idea is to average MSEθ(t) over θ and then minimize over t’s. Thus we pretend to have a distribution for θ, say, π, and look at

H(t)= E(t(X) g(θ))2 − where E now refers to the joint distribution of X and θ, that is

E(t(X) g(θ))2 = (t(x) g(θ))2F (dx θ)π(dθ). (1.1) − − | Z Next pick t(X) to minimize H(t). The minimizer is called the Bayes estimator.

LEMMA 1.1 Suppose Z and W are real random variables deﬁned on the same probability space and be the set of functions from R1 to R1. Then H min E(Z h(W ))2 = E(Z E(Z W ))2. h − − | ∈H That is, the minimizer above is E(Z W ). | Proof. Note that

E(Z h(W ))2 = E(Z E(Z W )+ E(Z W ) h(W ))2 − − | | − = E(Z E(Z W ))2 + E(E(Z W ) h(W ))2 − | | − +2E (Z E(Z W ))(E(Z W ) h(W )) . { − | | − } Conditioning on W, we have that the cross term is zero. Thus

E(Z h(W ))2 = E(Z E(Z W ))2 + E(E(Z W ) h(W ))2. − − | | −

1 Now E(Z E(Z W ))2 is ﬁxed. The second term E(E(Z W ) h(W ))2 depends on h. So − | | − choose h(W )= E(Z W ) to minimize E(Z h(W ))2. | −

Now choose W = X, t = h and Z = g(θ). Then

COROLLARY 1.1 The Bayes estimator in (1.1) is g(θ)= E(g(θ) X), which is the poste- | rior mean. d At a later occasion, we call E(g(θ) X) the posterior mean of g(θ). Similarly, we have the | following multivariate analogue of the above corollary. The proof is in the same spirit of Lemma 1.1. We omit it.

THEOREM 1 Let X Rm and X F ( θ), θ H . Let g(θ) = (g (θ), , g (θ)) Rk. ∈ ∼ ·| ∈ 1 · · · k ∈ The MSE between an estimator t(X) and g(θ) is

E t(X) g(θ) 2 = t(x) g(θ) 2F (dx θ)π(dθ). k − k k − k | Z Then Bayes estimator g(θ)= E(g(θ) X) := (E(g (θ) X), , E(g (θ) X)). | 1 | · · · k | Deﬁnition. The distributiond of θ (we thought there is such) , π(θ), is called a prior distribution of θ. The conditional distribution of θ given X is called the posterior distribution.

Example. Let X , , X be i.i.d. Ber(p) and p Beta(α, β). We know that the 1 · · · n ∼ density function of p is

Γ(α + β) α 1 β 1 f(p)= p − (1 p) − Γ(α)Γ(β) − and E(p)= α/(α + β). The joint distribution of X , , X and p is 1 · · · n x n x α 1 β 1 f(x , ,x ,p)= f(x , ,x p) π(p) = p (1 p) − C(α, β)p − (1 p) − 1 · · · n 1 · · · n| · − · − x+α 1 n+β x 1 = C(α, β)p − (1 p) − − − where x = x . The marginal density of (X , , X ) is i i 1 · · · n P 1 x+α 1 n+β x 1 f(x , ,x )= C(α, β) p − (1 p) − − dp = C(α, β) B(x + α, n + β x). 1 · · · n − · − Z0 Therefore,

f(x1, ,xn,p) x+α 1 n+β x 1 f(p x , ,x )= · · · = D(x, α, β) p − (1 p) − − . | 1 · · · n f(x , ,x ) − 1 · · · n 2 This says that p x Beta(x + α, n + β x). Thus the Bayes estimator is | ∼ − X + α pˆ = E(p X)= , | n + α + β n where X = i=1 Xi. One can easily check that np + α P E(ˆp)= . n + α + β Sop ˆ is biased unless α = β = 0, which is not allowed in the prior distribution. The next lemma is useful

LEMMA 1.2 The posterior distribution depends only on sufficient statistics. Let t(X) be a sufficient statistic. Then E(g(θ) X)= E(g(θ) t(X)). | | Proof. The joint density function of X and θ is p(x θ)π(θ) where π(θ) is the prior distri- | bution of θ. Let t(X) be a sufficient statistic. The by the factorization theorem

p(x θ)= q(t(x) θ)h(x). | | So the joint distribution function is q(t(x) θ)h(x)π(θ). It follows that the conditional distri- | bution of θ given X is q(t(x) θ)h(x)π(θ) q(t(x) θ)π(θ) p(θ x)= | = | | q(t(x) ξ)h(x)π(ξ) dξ q(t(x) ξ)π(ξ) dξ | | which is a function of t(x)R and θ. R By the above conclusion, E(g(θ) X) = l(t(X)) for some function l( ). Take conditional | · expectation for both sides with respect to the algebra σ(t(X)). Note that σ(t(X)) σ(X). ⊂ By the Tower theorem, l(t(X)) = E(g(θ) t(X)). This proves the second conclusion. |

Example. Let X , , X be iid from N(µ, 1) and µ π(µ) = N(µ ,τ ). We know 1 · · · n ∼ 0 0 that X¯( N(µ, 1/n)) is a sufficient statistic. The Bayes estimator isµ ˆ = E(µ X¯ ). We need ∼ | to calculate the joint distribution of (µ, X¯ )T first. T It is not difficult to see that (µ, X¯) is bivariate normal. We know that Eµ = µ0, V ar(µ)= τ , E(X¯)= E(E(X¯ µ)) = E(µ)= µ . Now 0 | 0 1 V ar(X¯)= E(X¯ µ + µ µ )2 = E((1/n) + (µ µ )2)= + τ ; − − 0 − 0 n 0 Cov(X,µ¯ )= E(E((X¯ µ )(µ µ ) µ)) = E(µ µ )2 = τ . − 0 − 0 | − 0 0 Thus,

X¯ µ0 (1/n)+ τ0 τ0 N , . µ ! ∼ µ0! τ0 τ0!!

3 Recall the following fact: if

Y1 µ1 Σ11 Σ12 N , , Y2! ∼ µ2! Σ21 Σ22!! then

1 Y1 Y2 N µ1 +Σ12Σ22− (Y2 µ2), Σ11 2 | ∼ − · 1 where Σ11 2 =Σ11 Σ12Σ22− Σ21. It follows that · − τ τ 2 µ X¯ N µ + 0 (X¯ µ ),τ 0 . | ∼ 0 (1/n)+ τ − 0 0 − (1/n)+ τ 0 0 Thus the Bayes estimator is

τ0 τ0 µ0 µˆ = µ0 + (X¯ µ0)= X¯ + . (1/n)+ τ0 − (1/n)+ τ0 1+ nτ0

One can see thatµ ˆ µ as τ 0 andµ ˆ X¯ as τ . → 0 0 → → 0 →∞

Example. Let X N (θ,I ) and π(θ) N (0,τI ),τ> 0. The Bayes estimator is the ∼ p p ∼ p p posterior mean E(θ X). We claim that |

X 0 (1 + τ)Ip τIp N , . θ ! ∼ 0! τIp τIp!!

Indeed, by the prior we know that Eθ = 0 and Cov(θ)=τIp. By conditioning argument one

can verify that E(Xθ′) = τIp and E(XX′) = (1+ τ)Ip. So by conditional distribution of normal random variables,

τ τ θ X N X, I . | ∼ 1+ τ 1+ τ p

So the Bayes estimator is t0(X)= τX/(1 + τ).

The usual MVUE is t1(X)= X. It follows that

MSE (θ)= E X θ 2 = p t1 k − k which doesn’t depend on θ. Now τ τ 1 MSE (θ)= E X θ 2 = E (X θ)+ − θ 2 t0 θk1+ τ − k θk1+ τ − 1+ τ k τ 2 1 = p + θ 2 (1.2) 1+ τ (1 + τ)2 k k 4 which goes to p as τ . What happens to the prior when τ ? It “converges → ∞ → ∞ to” the Lebesgue measure, or “uniform distribution over the real line” which generates no information (recall the information of X N (0, (1 + τ)I ) is p/(1 + τ) and θ¯ = ∼ p p i uniform over [ τ,τ] with θ¯ , 1 i p i.i.d. is p/(2τ 2). Both go to zero and the later is − i ≤ ≤ faster than the former). This explains why the MSE of the former and the limiting MSE are identical. Now let

2 M = inf sup Eθ t(X) θ t θ k − k

called the Minimax risk and any estimator t∗ with

2 M = sup Eθ t∗(X) θ θ k − k is called a Minimax estimator.

THEOREM 2 For any p 1, t (X)= X is a minimax. ≥ 0 Proof. First,

2 2 M = inf sup Eθ t(X) θ sup Eθ X θ = p. t θ k − k ≤ θ k − k Second 2 2 2 τ M = inf sup Eθ t(X) θ inf Eθ t(X) θ πτ (dθ) p t k − k ≥ t k − k ≥ 1+ τ θ Z for any τ > 0 where π (θ)= N(0,τI ), where the last step is from (1.2). Then M p by τ p ≥ letting p . The above says that M = sup E X θ 2 = p. →∞ θ θk − k

Question. Does there exist another estimator t1(X) such that

E t (X) θ 2 E X θ 2 θk 1 − k ≤ θk − k

for all θ, and the strict inequality holds for some θ? If so, we say t0(X)= X is inadmissible (because it can be beaten for any θ by some estimator t, and strictly beaten at some θ). Here is the answer: (i) For p = 1, there is not. This is proved by Blyth in 1951. (ii) For p = 2, there is not. This result was shown by Stein in 1961. (iii) When p 3, Stein (1956) shows there is such estimator, which is called James-Stein ≥ estimator.

Recall the density function of N(µ,σ2) is φ(x) = (√2πσ) 1 exp( (x µ)2/(2σ2)). − − −

5 LEMMA 1.3 (Stein’s lemma). Let Y N(µ,σ2) and g(y) be a function such that g(b) ∼ − g(a)= b g (y) dy for any a and b and some function g (y). If E g (Y ) < . Then a ′ ′ | ′ | ∞ R 2 Eg(Y )(Y µ)= σ Eg′(Y ). − Proof. Let φ(y) = (1/√2πσ) exp( (y µ)2/(2σ2)), the density of N(µ,σ2). Then φ (y)= − − ′ 2 z µ y z µ φ(y)(y µ)/σ . Thus, φ(y)= y∞ σ−2 φ(z) dz = σ−2 φ(z) dz. We then have that − − − −∞ R R ∞ Eg′(Y ) = g′(y)φ(y) dy Z−∞ 0 y ∞ ∞ z µ z µ = g′(y) −2 φ(z) dz dy g′(y) −2 φ(z) dz dy 0 y σ − σ Z Z Z−∞ Z−∞ z 0 0 ∞ z µ z µ = −2 φ(z) g′(y) dy dz −2 φ(z) g′(y) dy dz 0 σ 0 − σ z Z Z Z−∞ Z 0 ∞ z µ = + −2 φ(z)(g(z) g(0)) dz 0 σ − Z Z−∞ 1 1 = ∞ (z µ)g(z)φ(z) dz = E(Y µ)g(Y ). σ2 − σ2 − Z−∞ The Fubini’s theorem is used in the third step; the mean of a centered normal random variable is zero is used in the ﬁfth step.

Remark 1. Suppose two functions g(x) and h(x) are given such that g(b) g(a) = b − a h(x) dx for any a < b. This does not mean that g′(x) = h(x) for all x. Actually, in this case,R g(x) is diﬀerentiable almost everywhere and g′(x)= h(x) a.s. under Lebesgue measure. For example, let h(x)=1if x is an irrational number, and h(x)=0if x is rational. Then g(x) = x = x h(t) dt for any x R. We know in this case that g (x) = h(x) a.s. The 0 ∈ ′ following areR true from real analysis: Fundamental Theorem of Calculus. If f(x) is absolutely continuous on [a, b], then f(x) is diﬀerentiable almost everywhere and

x f(x) f(a)= f ′(t) dt, x [a, b]. − ∈ Za

Another fact. Let f(x) be diﬀerentiable everywhere on [a, b] and f ′(x) is integrable over [a, b]. Then

x f(x) f(a)= f ′(t) dt x [a, b]. − ∈ Za Remark 2. What drives the Stein’s lemma is the formula of integration by parts:

6 suppose g(x) is diﬀerentiable then

2 ∞ Eg(Y )(Y µ) = g(y)(y µ)φ(y) dy = σ g(y)φ′(y) dy − R − − Z Z−∞ 2 2 2 ∞ = σ lim g(y)φ(y) σ lim g(y)φ(y)+ σ g′(y)φ(y) dy. y − y + →−∞ → ∞ Z−∞ 2 It is reasonable to assume the two limits are zero. The last term is exactly σ Eg′(Y ).

THEOREM 3 Let X N (θ,I ) for some p 3. Deﬁne ∼ p p ≥ p 2 δ (X)= 1 c − X. c − X 2 k k Then c(2 c) E δ (X) θ 2 = p (p 2)2E − . k c − k − − X 2 k k Proof. Let g (x)= c(p 2)x / x 2 and g(x) = (g (x), , g (x)) for x = (x ,x , ,x ). i − i k k 1 · · · p 1 2 · · · p Then g(x)= c(p 2)x/ x 2. It follows that − k k 2 2 2 2 E δc(X) θ = E (X θ) g(X) = E X θ + E g(X) 2E X θ, g(X) k − k k − − k k − k k p k − h − i = p + E g(X) 2 2 E(X θ )g (X). k k − i − i i Xi=1 Since X , , X are independent, by conditioning and Stein’s lemma, 1 · · · p ∂g E(X θ )g (X)= E [E (X θ )g (X) (X , j = i) ]= E i (X) . i − i i { i − i i | j 6 } ∂x i It is easy to calculate that

∂g p x2 2x2 x 2 2x2 i (x)= c(p 2) k=1 k − i = c(p 2)k k − i . ∂x − ( p x2)2 − x 4 i P k=1 k k k Thus, P

p p X 2 2X2 1 E(X θ )g (X)= c(p 2)E k k − i = c(p 2)2E . i − i i − X 4 − X 2 Xi=1 Xi=1 k k k k On the other hand, E g(X) 2 = c2(p 2)2E(1/ X 2). Combine all above together, the k k − k k conclusion follows.

We have to show that E X 2 < for p 3. Indeed, k k− ∞ ≥ 1 1 1 E 2 1+ E 2 I( X 1) 1+ 2 dx1 dxp X ≤ X k k≤ ≤ · · · x 1 x · · · k k k k Z Zk k≤ k k 7 since the density of a normal distribution is bounded by one. By the polar-transformation, the last integral is equal to

1 π π p 1 1 r − J(θ1, ,θp) p 1 p 3 dr dθ1 dθp 1 · · · dr Cπ − r − dr < · · · − r2 ≤ ∞ Z0 Z0 Z0 Z0 if the integer p 3, where J(θ , ,θ ) is a multivariate polynomial of sin θ and cos θ , i = ≥ 1 · · · p i i 1, 2, , p, bounded by a constant C. · · · When 0 c 2, the term c(2 c) 0 and attains the maximum at c = 1. It follows ≤ ≤ − ≥ that

COROLLARY 1.2 The estimator δ (X) dominates X provided 0

When c = 1, the estimator δ1(X) is called the James-Stein estimator.

COROLLARY 1.3 The James-Stein estimator δ dominates all estimators δ with c = 1. 1 c 6

2 Likelihood Ratio Test

We consider test H : θ H vs H : θ H , where H and H are disjoint. The 0 ∈ 0 a ∈ a 0 a probability density or probability mass function of a random observation is f(x θ). The | likelihood ratio test statistic is sup f(x θ) λ(x)= H0 | . supH0 Ha f(x θ) ∪ |

When H0 is true, the value of λ(x) tends to be large. So the rejection region is

λ(x) c . { ≤ } Example. The t-test as a likelihood ratio test. Let X , , X be i.i.d. N(µ,σ2) with both parameters unknown. Test H : µ = µ vs 1 · · · n 0 0 H : µ = µ . The joint density function of X ’s is a 6 0 i n n 1 2 n/2 1 2 f(x1, ,xn µ,σ)= (σ )− exp 2 (xi µ) . (2.1) · · · | √2π −2σ − ! Xi=1 This one has maximum at (X,¯ V ) where

n 1 V = (X X¯ )2. n i − Xi=1

8 The maximum is

n/2 1 2 n/2 n/2 max f(x1, ,xn µ,σ)= CV − exp (Xi X¯) = CV − e− µ,σ · · · | −2V − X Where C = (2π) n/2. When µ = µ , f(x , ,x µ,σ) has maximum at − 0 1 · · · n| n 1 V = (X µ )2. 0 n i − 0 Xi=1 And the corresponding maximum value is

n/2 1 2 n/2 n/2 max f(x1, ,xn µ0,σ)= CV0− exp (Xi µ0) = CV0− e− . σ · · · | −2V0 − X So the likelihood ratio is

n/2 2 n/2 V − V + (X¯ µ ) − Λ= 0 = − 0 . V V So the rejection region is X¯ µ Λ a = − 0 b { ≤ } S ≥

for some constant b, where S = (1/(n 1)) n (X X¯ )2. This yields exactly a student t − i=1 i − test. P

Example. Let X , , X be a random sample from 1 · · · n (x θ) e− − , if x θ; f(x θ)= ≥ (2.2) |  0, otherwise. The joint density function is 

P xi+nθ e− , if θ x(1); f(x θ)= ≤ |  0, otherwise.

Consider the test H0 : θ θ0 versusHa : θ>θ0. It is easy to see ≤ P x +nx max f(x θ)= e− i (1) , θ R | ∈

which is achieved at θ = x(1). Now consider maxθ θ0 f(x θ). If θ0 x(1), the two maxima ≤ | ≥ are identical; If θ0

P xi+nx(1) e− , if θ0 x(1); max f(x θ)= ≥ θ θ0 |  P xi+nθ0 ≤ e− , if θ0

1, if x(1) θ0; λ(x)= ≤  n(θ0 x(1)) e − , if x(1) >θ0. The rejection region is λ(x) awhich is equivalent to x b for some b. { ≤ } { (1) ≥ }

Again, the general form of a hypothesis test is H : θ H vs H : θ H . 0 ∈ 0 a ∈ a THEOREM 4 Suppose X has pdf or pmf f(x θ) and T (X) is suﬃcient for θ. Let λ(x) and | λ∗(y) be the LRT statistics obtained from X and Y = T (X). Then λ(x)= λ∗(T (x)) for any x.

Proof. Suppose the pdf or pmf of T (X) is g(t θ). To be simple, suppose X is discrete, that | is, P (T (X)= t)= g(t θ). Then θ | f(x θ) = P (X = x, T (X)= T (x)) | θ = P (T (X)= T (x))P (X = x T (X)= T (x)) θ θ | = g(T (x) θ)h(x). (2.3) | by the deﬁnition of suﬃciency, where h(x) is a function depending only on x (not θ). Evidently,

supθ H0 f(x θ) supθ H0 g(t θ) λ(x)= ∈ | and λ∗(t)= ∈ | . (2.4) supθ H0 Ha f(x θ) supθ H0 Ha g(t θ) ∈ ∪ | ∈ ∪ | By (2.3) and (2.4)

supθ H0 f(T (x) θ) λ(x)= ∈ | = λ∗(T (x)). supθ H0 Ha f(T (x) θ) ∈ ∪ | This theorem says that the simpliﬁcation of λ(x) is eventually relevant to x through a suﬃcient statistic for the parameter.

3 Evaluating tests

Consider the test H : θ H versus H : θ H , where H and H are disjoint. When 0 ∈ 0 a ∈ a 0 a making decisions, there are two types of mistakes:

Type I error: Reject H0 when it is true.

Type II error: Do not reject H0 when it is false.

10 Not reject H0 reject H0

H0 correct type I error

H1 Type II error correct

Suppose R denotes the rejection region for the test. The chances of making the two types of mistakes are P (X R) for θ H and P (X Rc), θ H . So θ ∈ ∈ 0 θ ∈ ∈ a

probability of a Type I error, if θ H 0; P (X R)= ∈ θ  ∈ 1 probability of a Type II error, if θ H .  − ∈ a DEFINITION 3.1 The power function of the test above with the rejection region R is a function of θ deﬁned by β(θ)= P (X R). θ ∈

DEFINITION 3.2 For α [0, 1], a test with power function β(θ) is a size α test if ∈ supθ H β(θ)= α. ∈ 0 DEFINITION 3.3 For α [0, 1], a test with power function β(θ) is a level α test if ∈ supθ H β(θ) α. ∈ 0 ≤ There are some diﬀerences between the above two deﬁnitions when dealing with complicated models. Example. Let X , , X be from N(µ,σ2) with µ unknown but known σ. Look at the 1 · · · n test H : µ µ vs H : µ>µ . It is easy to check that by likelihood rato test, the rejection 0 ≤ 0 a 0 region is

X¯ µ − 0 > c . σ/√n So the power function is

X¯ µ X¯ µ µ µ β(µ)= P − 0 > c = P − > c + 0 − θ σ/√n θ σ/√n σ/√n µ µ = 1 Φ c + 0 − . − σ/√n It is easy to see that

lim β(µ) = 1 lim β(µ)=0 and β(µ0)= a if P (Z > c)= a, µ µ →∞ →−∞ where Z N(0, 1). ∼

11 Example. In general, a size α LRT is constructed by choosing c such that supθ H Pθ(λ(X) ∈ 0 c)= α. Let X , , X be a random sample from N(θ, 1). Test H : θ = θ vs H : θ = θ . ≤ 1 · · · n 0 0 a 6 0 Here one can easily see that n λ(x) = exp (¯x θ )2 − 2 − 0 So λ(X) c = X¯ θ d . For α given, d can be chosen such that P ( Z > d√n)= α. { ≤ } {| − 0|≥ } θ0 | | For the example in (2.2), the test is H : θ θ vs θ>θ and the rejection region 0 ≤ 0 0 λ(X) c = X > d . For size α, ﬁrst choose d such that { ≤ } { (1) }

n(d θ0) α = Pθ0 (X(1) > d)= e− − .

Thus, d = θ (log α)/n. Now 0 −

n(d θ) n(d θ0) sup Pθ(X(1) > d)= sup Pθ(X(1) > d)= sup e− − = e− − = α. θ H θ θ0 θ θ0 ∈ 0 ≤ ≤ 4 Most Powerful Tests

Consider a hypothesis test H : θ H vs H : θ H . There are a lot of decision rules. 0 ∈ 0 1 ∈ a We plan to select the best one. Since we are not able to reduce the two types of errors at same time (the sum of chances of two errors and two correct decisions is one). We ﬁx the the chance of the ﬁrst type of error and maximize the powers among all decision rules.

DEFINITION 4.1 Let be a class of tests for testing H : θ H vs H : θ H . A test C 0 ∈ 0 1 ∈ a in class , with power function β(θ), is a uniformly most powerful (UMP) class test if C C β(θ) β (θ) for every θ H and every β (θ) that is a power function of a test in class . ≥ ′ ∈ a ′ C In this section we only consider as the set of tests such that the level α is ﬁxed, i.e., C β(θ); supθ H β(θ) α . { ∈ 0 ≤ } The following gives a way to obtain UMP tests.

THEOREM 5 (Neyman-Pearson Lemma). Let X be a sample with pdf or pmf f(x θ). | Consider testing H0 : θ = θ0 vs H1 : θ = θ1, using a test with rejection region R that satisﬁes

x R if f(x θ ) > kf(x θ ) and x Rc if f(x θ ) < kf(x θ ) (4.1) ∈ | 1 | 0 ∈ | 1 | 0 for some k 0, and ≥ α = P (X R). (4.2) θ0 ∈

12 Then (i) (Sufficiency). Any test satisfies (4.1) and (4.2) is a UMP level α test. (ii) (Necessity). Suppose there exists a test satisfying (4.1) and (4.2) with k > 0. Then every UMP level α test is a size α test (satisfies (4.2)); Every UMP level α test satisfies (4.1) except on a set A satisfying P (X A) = 0 for θ = θ and θ . θ ∈ 0 1 Proof. We will prove the theorem only for the case that f(x θ) is continuous. | Since H consists of only one point, the level α test and size α test are equivalent. 0 (i) Let R satisfy (4.1) and (4.2) and φ(x)= I(x R) with power function β(θ). Now take ∈ another rejection region R with β (θ)= P (X R ) and β (θ ) α. Set φ (x)= I(x R ). ′ ′ θ ∈ ′ ′ 0 ≤ ′ ∈ ′ It suffices to show that

β(θ ) β′(θ ). (4.3) 1 ≥ 1 Actually, (φ(x) φ (x))(f(x θ ) kf(x θ )) 0 (check it based on if f(x θ ) > kf(x θ ) or − ′ | 1 − | 0 ≥ | 1 | 0 not). Then

0 (φ(x) φ′(x))(f(x θ ) kf(x θ )) dx = β(θ ) β′(θ ) k(β(θ ) β′(θ )). (4.4) ≤ − | 1 − | 0 1 − 1 − 0 − 0 Z Thus the assertion (4.3) follows from (4.2) and that k 0. ≥ (ii) Suppose the test satisfying (4.1) and (4.2) has power function β(θ). By (i) the test is

UMP. For the second UMP level α test, say, it has power function β′(θ). Then β(θ1)= β′(θ1). Since k > 0, (4.4) implies β (θ ) β(θ )= α. So β (θ )= α. ′ 0 ≥ 0 ′ 0 Given a UMP level α test with power function β′(θ). By the proved (ii), β(θ1)= β′(θ1)

and β(θ0) = β′(θ0) = α. Then (4.4) implies that the expectation in there is zero. Being always nonnegative, (φ(x) φ (x))(f(x θ ) kf(x θ )) = 0 except on a set A of Lebesgue − ′ | 1 − | 0 measure zero, which leads to (4.1) (by considering f(x θ ) > kf(x θ ) or not). Since X has | 1 | 0 density, P (X A)=0 for θ = θ and θ = θ . θ ∈ 0 1

Example. Let X Bin(2,θ). Consider H : θ = 1/2 vs H : θ = 3/4. Note that ∼ 0 a x 2 x x 2 x 2 3 1 − 2 1 1 − f(x θ )= > k = kf(x θ ) | 1 x 4 4 x 2 2 | 0 is equivalent to 3x > 4k . { } (i) The case k 9/4 and the case k < 1/4 correspond to R = and Ω, respectively. ≥ ∅ The according UMP level α = 0 and α = 1. (ii) If 1/4 k < 3/4, then R = 1, 2 , and α = P (Bin(2, 1/2) = 1 or 2) = 3/4. ≤ { } (iii) If 3/4 k < 9/4, then R = 2 . The level α = P (Bin(2, 1/2) = 2) = 1/4. ≤ { }

13 This example says that if we ﬁrmly want a size α test, then there are only two such α’s: α = 1/4 and α = 3/4.

Example. Let X , , X be a random sample from N(θ,σ2) with σ known. Look 1 · · · n at the test H : θ = θ vs H : θ = θ , where θ > θ . The density function f(x θ,σ) = 0 0 1 1 0 1 | (1/√2πσ) exp( (x θ)2/(2σ2)). Now f(x θ ) > kf(θ ) is equivalent to that R = X¯ < c . − − | 1 0 { } ¯ Set α = Pθ0 (X < c). One can check easily that c = θ0 + (σzα/√n). So the UMP rejection region is σ R = X<θ¯ + z . 0 √n α LEMMA 4.1 Let X be a random variable. Let also f(x) and g(x) be non-decreasing real functions. Then

E(f(X)g(X)) Ef(X) Eg(X) ≥ · provided all the above three expectations are ﬁnite.

Proof. Let the µ be the distribution of X. Noting that (f(x) f(y))(g(x) g(y)) 0 for − − ≥ any x and y. Then

0 (f(x) f(y))(g(x) g(y)) µ(dx) µ(dy) ≤ − − ZZ = 2 f(x)g(x) µ(dx) 2 f(x)g(y) µ(dx) µ(dy) − Z Z = 2[E(f(X)g(X)) Ef(X) Eg(X)]. − · The desired result follows.

Let f(x θ), θ R be a pmf or pdf with common support, that is, the set x : f(x θ) > 0 | ∈ { | } is identical for every θ. This family of distributions is said to have monotone likelihood ratio (MLR) with respect to a statistic T (x) if

fθ2 (x) on the support is a non-decreasing function in T (x) for any θ2 >θ1. (4.5) fθ1 (x) LEMMA 4.2 Let X be a random vector with X f(x θ), θ R, where f(x θ) is a pmf ∼ | ∈ | or pdf with a common support. Suppose f(x θ) has MLR with a statistic T (x). Let | 1 when T (x) > C, φ(x)= γ when T (x)= C, (4.6)   0 when T (x)

Proof. Thanks to MLT, for any θ1 < θ2 there exists a nondecreasing function g(t) such that f(x θ )/f(x θ )= g(T (x)), where g(t) is a nondecreasing function. Thus | 2 | 1 f(x θ ) E φ(X)= φ(x) | 2 f(x θ ) dx = E (g(T (X)) h(T (X))) , θ2 f(x θ ) | 1 θ1 · Z | 1 where

1 when x>C, h(x)= γ when x = C,   0 when x

E (g(T (X)) h(T (X))) E g(T (X)) E h(T (X)) = E φ(X) θ1 · ≥ θ1 · θ1 θ1 since Eθ1 g(T (X)) = 1.

In (4.6), if T (X) > C reject H0; if T (X) < C, don’t reject H0; if T (X) = C, with chance γ reject H and with chance 1 γ don’t reject H . The last case is equivalent to 0 − 0 ﬂipping a coin with chance γ for head and 1 γ for tail. If the head occurs, reject H ; − 0 otherwise, don’t reject H0. This is a randomization. The purpose is to make the test UMP. The randomization occurs only when T (X) is discrete.

THEOREM 6 Consider testing H : θ θ vs H : θ>θ . Let X f(x θ), θ R, where 0 ≤ 0 1 0 ∼ | ∈ f(x θ) is a pmf or pdf with a common support. Also let T be a suﬃcient statistic for θ. If | f(x θ) has MLR w.r.t. T (x), then φ(X) in (4.6) is a UMP level α test, where α = E φ(X). | θ0

Proof. Let β(θ)= Eθφ(X). Then by the previous lemma, supθ θ0 β(θ)= Eθ0 φ(X)= α. ≤

We need to show β(θ1) β′(θ1) for any θ1 > θ0 and β′ such that supθ θ0 β′(θ) α. ≥ ≤ ≤ Consider test H0 : θ = θ0 vs Ha : θ = θ1. We know that

α = P (X R)= E φ(X) and β′(θ ) α. (4.7) θ0 ∈ θ0 1 ≤ Again, we use the same notation as in Lemma 4.2, f(x θ )/f(x θ ) = g(T (x)), where | 1 | 0 g(t) is a non-decreasing function in t. Take k = g(C) in the Neyman-Pearson Lemma 5.

15 If f(x θ )/f(x θ ) > k, since g(t) is non-decreasing, then T (X) > C, which is the same | 1 | 0 as saying to reject H by the functional form of φ(x) as in (4.6). If f(x θ )/f(x θ ) < k 0 | 1 | 0 then T (X) < C, that is, not to reject H0. By Neyman-Pearson lemma and (4.7), φ(x) corresponds to a UMP level-α test. Thus, β(θ ) β (θ ). 1 ≥ ′ 1

16 5 Probability convergence

Let (Ω, , P ) be a probability space. Let Also X, X ; n 1 be a sequence of ran- F { n ≥ } dom variables deﬁned on this probability space. We say X X in probability or X is n → n consistent with X if

P ( X X ǫ) 0 as n | n − |≥ → →∞ for any ǫ> 0. Example. Let X ; n 1 be a sequence of independent random variables with mean { n ≥ } zero and variance one. Set S = n X , n 1. Then S /n 0 in probability as n . n i=1 i ≥ n → →∞ In fact, P S E(S )2 1 P n ǫ n = 0 n ≥ ≤ nǫ2 nǫ2 → by Chebyshev’s inequality.

Recall X and X , n 1 are real measurable functions deﬁned on (Ω, ). We say X n ≥ F n converges to X almost surely if

P (ω Ω : lim Xn(ω)= X(ω)) = 1. n ∈ →∞ For a sequence of events E ; n 1 in . Define n ≥ F ∞ E ; i.o. = E . { n } k n=1 k n \ [≥ Here “i.o.” means “infinitely often”. If ω E ; i.o. , then there are infinitely many n ∈ { n } such that ω E . Conversely, if ω / E ; i.o. , or ω E ; i.o. c, then only finitely many ∈ n ∈ { n } ∈ { n } En contain ω. In proving almost convergence, the following so-called Borel-Cantelli lemma is very powerful.

LEMMA 5.1 (Borel-Cantelli Lemma). Let E ; n 1 be a sequence of events in . If n ≥ F ∞ P (E ) < (5.1) n ∞ nX=1 then P (E occurs only ﬁnite times)= P ( E , i.o. c) = 1. n { n }

Proof. Recall P (En)= E(IEn ). Let X(ω)= n∞=1 IEn (ω). Then

∞ ∞ P EX = E I = P (E ) < . En n ∞ Xi=1 Xi=1 17 Thus, X < a.s. ∞

LEMMA 5.2 (Second Borel-Cantelli Lemma). Let E ; n 1 be a sequence of independent n ≥ events in . If F ∞ P (E )= (5.2) n ∞ n=1 X then P (En, i.o.) = 1.

Proof. First,

E , i.o. c = Ec. { n } k n k n [ \≥

With pk denoting P (Ek), we have that

P Ec = (1 p ) exp p = 0  k − k ≤ − k k n k n k n \≥ Y≥ X≥     as n , where we used the inequality that 1 x e x for any x R. Thus, →∞ − ≤ − ∈

P ( E , i.o. c) P Ec = 0. { n } ≤  k n 1 k n X≥ \≥   As an application, we have that Example. Let X ; n 1 be a sequence of i.i.d. random variables with mean µ and { n ≥ } ﬁnite fourth moment. Then S n µ a.s. n → as n . →∞ Proof. W.L.O.G., assume µ = 0. Let ǫ> 0. Then by Chebyshev’s inequality

S E(S )4 P n ǫ n . n ≥ ≤ n4ǫ4

Now, by independence,

4 4 E(S4) = E X4 + X2X2 + X X3 n  k 2 i j 2 i j  k 1 i=j n 1 i=j=n X ≤X6 ≤ ≤X6 = nE(X4) + 6n(n 1)E(X2).  1 − 1

18 Sn 2 Sn Thus, P n ǫ = O(n− ). Consequently, n 1 P n ǫ < for any ǫ > 0. This ≥ ≥ ≥ ∞ means that, with probability one, Sn/n ǫ forP suﬃciently large n. Therefore, | |≤

S lim sup | n| ǫ a.s. n n ≤ for any ǫ> 0. Let ǫ 0. The desired conclusion follows. ↓

By using a truncation method, we have the following Kolmogorov’s Strong Law of Large Numbers.

THEOREM 7 Let X ; n 1 be a sequence of i.i.d. random variables with mean µ. { n ≥ } Then S n µ a.s. n → as n . →∞ There are many variations of the above strong law of large numbers. Among them, Etemadi showed that Theorem 7 holds if X ; n 1 is a sequence of pairwise independent { n ≥ } and identically distributed random variables with mean µ. The next proposition tells the relationship between convergence in probability and convergence almost surely.

PROPOSITION 5.1 Let X and X be random variables taking values in Rk. If X n n → X a.s. as n , then X converges to X in probability; second, X converges to X in → ∞ n n probability iﬀ for any subsequence of X there exists a further subsequence, say, X ; k { n} { nk ≥ 1 such that X X a.s. as k . } nk → →∞ Proof. The joint almost sure convergence and joint convergence in probability are equiv-

alent to the corresponding pointwise convergences. So W.L.O.G., assume Xn and X are one-dimensional. Given ǫ> 0. The almost sure convergence implies that

P ( X X ǫ occurs only for ﬁnitely many n) = 1. | n − |≥ This says that P ( X X ǫ, i.o.) = 0. But | n − |≥

lim P ( Xn X ǫ) lim P Xk X ǫ = P ( Xn X ǫ, i.o.) = 0 n | − |≥ ≤ n  {| − |≥ } | − |≥ k n [≥   19 where we use the fact that

X X ǫ X X ǫ, i.o. . {| k − |≥ } ↑ {| n − |≥ } k n [≥ Now suppose X X in probability. W.L.O.G., we assume the subsequence is X n → n itself. Then for any ǫ> 0, there exists nǫ > 0 such that supn nǫ P ( Xn X ǫ) < ǫ. We ≥ | − |≥ then have a subsequence n such that P ( X X 2 k) < 2 k for each k 1. By the k | nk − | ≥ − − ≥ Borel-Cantelli lemma

k P X X < 2− for suﬃciently large k = 1 | nk − | which implies that X X a.s. as k . nk → →∞ Now suppose Xn doesn’t converge to X in probability. That is, there exists ǫ0 > 0 and subsequence X ; l 1 such that { kl ≥ } P ( X X ǫ ) ǫ | kl − |≥ 0 ≥ 0 for all l 1. Therefore any subsequence of X can not converge to X in probability, hence ≥ kl it can not converge to X almost surely.

Example Let X , i 1 be a sequence of i.i.d. random variables with mean zero and i ≥ variance one. Let S = X + + X , n 1. Then S /√2n log log n 0 in pobability but n 1 · · · n ≥ n → it does not converge to zero almost surely. Actually lim supn Sn/√2 log log n = 1 a.s. →∞ PROPOSITION 5.2 Let X ; k 1 be a sequence of i.i.d. random variables with X { k ≥ } 1 ∼ N(0, 1). Let Wn = max1 i n Xi. Then ≤ ≤ W lim n = √2 a.s. n →∞ √log n Proof. Recall

x x2/2 1 x2/2 e− P (X1 x) e− (5.3) √2π (1 + x2) ≤ ≥ ≤ √2πx for any x> 0. Given ǫ> 0, let bn = (√2+ ǫ)√log n. Then

n 2 1 P (W b ) nP (X b ) e bn/2 n n 1 n − ǫ2/2 ≥ ≤ ≥ ≤ bn ≤ n

2 for n 3. Let n = [k4/ǫ ]+1 for k 1. Then, ≥ k ≥ 1 P (W b ) nk ≥ nk ≤ k2

20 for large k. By the Borel-Cantelli lemma, P (W b , i.o.) = 0. This implies that nk ≥ nk W lim sup nk √2+ ǫ a.s. k bnk ≤ For any n n there exists k such that n n

This proves the upper bound. Now we turn to prove the lower bound. Choose ǫ (0, √2). ∈ Set c = (√2 ǫ)√log n. Then n −

n n nP (X1>cn) P (W c )= P (X c ) = (1 P (X > c )) e− . n ≤ n 1 ≤ n − 1 n ≤ By (5.3),

ncn c2 /2 C 1 (√2 ǫ)2/2 nP (X > c ) e− n n − − 1 n 2 ≥ √2π (1 + cn) ∼ √2π log n as n is suﬃciently large, where C is a constant. Since 1 (√2 ǫ)2/2 > 0, − − P (W c ) < . n ≤ n ∞ n 2 X≥ By the Borel-Cantelli again, P (W c , i.o.) = 0. This implies that n ≤ n W lim inf n √2 ǫ a.s. n √log n ≥ − Let ǫ 0. We obtain ↓ W lim inf n √2 a.s. n √log n ≥

This together with (5.4) implies the desired result.

The above results also hold for random variables taking abstract values. Now let’s introduce such random variables.

21 A metric space is a set D equipped with a metric. A metric is a map d : D D [0, ) × → ∞ with the properties

(i) d(x,y) = 0 if and only if x = y; (ii) d(x,y)= d(y,x); (iii) d(x, z) d(x,y)+ d(y, z) ≤ for any x,y and z. A semi-metric satisfies (ii) and (iii) , but not necessarily (i). An open ball is a set of the form y : d(x,y) < r . A subset of a metric space is open if and only if it { } is the union of open balls; it is closed if and only if the complement is open. A sequence xn converges to x if and only if d(x ,x) 0, and denoted by x x. The closure A¯ of a set n → n → A is the set of all points that are the limit of a sequence in A; it is the smallest closed set containing A. The interior A˚ is the collection of all points x such that x G A for some ∈ ⊂ open set G; it is the largest open set contained in A. A function f : D E between metric → spaces is continuous at a point x if and only if f(x ) f(x) for every sequence x x; n → n → it is continuous at every x if and only if f 1(G) is also open for every open set G E. A − ⊂ subset of a metric space is dense if and only if its closure is the whole space. A metric space is separable if and only if it has a countable dense subset. A subset K of a metric space is compact if and inly if it is closed and every sequence in K has a convergent subsequence. A set K is totally bounded if and only if for every ǫ> 0 it can be covered by finitely many balls of radius ǫ > 0. The space is complete if and only if any Cauchy sequence has a limit. A subset of a complete metric space is compact if and only if it is totally bounded and closed. A norm space D is a vector space equipped with a norm. A norm is a map : D k · k → [0, ) such that for every x,y in D, and α R, ∞ ∈ (i) x + y x + y ; k k ≤ k k k k (ii) αx = α x ; k k | |k k (iii) x = 0 if and only if x = 0. k k Definition. The Borel σ-field on a metric space D is the smallest σ-field that contains the open sets. A function taking values in a metric space is called Borel-measurable if it is measurable relative to the Borel σ-field. A Borel-measurable map X :Ω D defined on a → probability space (Ω, , P ) is referred to as a random element with values in D. F The following lemma is obvious.

LEMMA 5.3 A continuous map between two metric spaces is Borel-measurable.

22 Example. Let C[a, b]= f(x); f is a continuous function on [a, b] . Deﬁne { } f = sup f(x) . k k a x b | | ≤ ≤ This is a complete normed linear space which is also called a Banach space. This space is separable. Any separable Banach space is isomorphic to a subset of C[a, b].

Remark. Proposition 5.1 still holds when the random variables taking values in a Polish space. Let X, X ; n 1 be a sequence of random variables. We say X converges to X { n ≥ } n weakly or in distribution if

P (X x) P (X x) n ≤ → ≤ as n for every continuous point x, that is, P (X x)= P (X

LEMMA 5.4 Let Xn and X be random variables. The following are equivalent:

(i) Xn converges weakly to X; (ii) Ef(X ) Ef(X) for any bounded continuous function f(x) deﬁned on R; n → (iii) liminf P (X G) P (X G) for any open set G; n n ∈ ≥ ∈ (iv) limsup P (X F ) P (X F ) for any closed set F ; n n ∈ ≤ ∈ Remark. The above lemma is actually true for any ﬁnite dimensional space. Further, it is true for random variables taking values in separable, complete metric spaces, which are also called Polish spaces. Proof. (i) = (ii). Given ǫ > 0, choose K > 0 such that K and K are continuous ⇒ − point of all X ’s and X and P ( K X K) 1 ǫ. By (i) n − ≤ ≤ ≥ −

lim P ( Xn K) = lim P (Xn K) lim P (Xn K) = P (X K) P (X K) n | |≤ n ≤ − n ≤− ≤ − ≤− 1 ǫ ≥ − In other words

lim P ( Xn > K) ǫ. (5.5) n | | ≤ Since f(x) is continuous on [ K, K], we can cut the interval into, say, m subintervals − [ai, ai+1], 1 i m such that each ai is a continuous point of X and supa x a f(x) ≤ ≤ i≤ ≤ i+1 | − f(a ) ǫ. Set i |≤ m g(x)= f(a )I(a x < a ). i i ≤ i+1 Xi=1 23 It is easy to see that

f(x) g(x) ǫ for x K and f(x) g(x) f for x > K, | − |≤ | |≤ | − | ≤ k k | |

where f = supx R f(x) . It follows that k k ∈ | | E f(Y ) Eg(Y ) ǫ + f P ( Y > K) | − |≤ k k · | |

for any random variable Y. Apply this to Xn and X we obtain from (5.5) that

lim sup Ef(Xn) Eg(Xn) (1 + f )ǫ and Ef(X) Eg(X) (1 + f )ǫ. n | − |≤ k k | − |≤ k k Easily Eg(X ) Eg(X). Therefore by triangle inequality, n →

lim sup Ef(Xn) Ef(X) 2(1 + f )ǫ. n | − |≤ k k Then (ii) follows by letting ǫ 0. ↓ (ii) = (iii). Deﬁne f (x) = (md(x, Gc)) 1 for m 1. Since G is open, f (x) is a ⇒ m ∧ ≥ m continuous function bounded by one and f (x) 1 (x) for each x as m . It follows m ↑ G → ∞ that

lim inf P (Xn G) lim inf Efm(Xn) Efm(X) n ∈ ≥ n → for each m. (iii) follows by letting m . →∞ (iii) and (iv) are equivalent because F c is open. (iv) = (i). Let x be a continuous point. Then by (iii) and (iv) ⇒

lim sup P (Xn x) P (X x)= P (X

Example. Let X be uniformly distributed over 1/n, 2/n, ,n/n . Prove that X con- n { · · · } n verges weakly to U[0, 1]. Actually, for any bounded continuous function f(x),

n 1 i 1 Ef(Xn)= f f(t) dt = Ef(Y ) n n → 0 Xi=1 Z where Y U[0, 1]. ∼

24 Example. Let X , i 1 be i.i.d. random variables. Deﬁne i ≥ n n 1 1 # i : X A µ = δ through µ (A)= I(X A)= { i ∈ } n n Xi n n i ∈ n Xi=1 Xi=1 for any set A. Then, for any measurable function f(x), by the strong law of large numbers,

n 1 f(x)µ ( dx)= f(X ) Ef(X )= f(x)µ( dx) n n i → 1 Z Xi=1 Z where µ = (X ). This concludes that µ µ weakly as n . L 1 n → →∞

COROLLARY 5.1 Suppose Xn converges to X weakly and g(x) is a continuous map de- 1 ﬁned on R (not necessarily one-dimensional). Then g(Xn) converges to g(X) weakly.

The following two results are called Delta methods.

THEOREM 8 Let Y be a sequence of random variables that satisﬁes √n(Y θ) = n n − ⇒ N(0,σ2). Let g be a real-valued function such that g (θ) = 0. Then ′ 6 2 2 √n(g(Y ) g(θ)) = N(0,σ (g′(θ) )). n − ⇒ Proof. By Taylor’s expansion g(x)= g(θ)+ g (θ)(x θ)+ a(x θ) for some real function ′ − − a(x) such that a(t)/t 0 as t 0. Then, → →

√n(g(Y ) g(θ)) = g′(θ)√n(Y θ)+ √n a(Y θ). n − n − · n − We only need to show that √n a(Y θ) 0 in probability as n (latter we will see · n − → →∞ this easily by Slusky’s lemma. But why now?) Indeed, for any m 1 there exists a δ > 0 ≥ such that a(x) x /m for all x δ. Thus, | | ≤ | | | |≤ P ( √n a(Y θ) b) P ( Y θ > δ)+ P ( √n(Y θ) mb) | · n − |≥ ≤ | n − | | n − |≥ P ( N(0,σ2) mb) → | |≥ as n . Now let m . we obtain that √n a(Y θ) 0 in probability. →∞ ↑∞ · n − →

Similarly, one can prove that

THEOREM 9 Let Y be a sequence of random variables that satisﬁes √n(Y θ) = n n − ⇒ 2 N(0,σ ). Let g be a real-valued function such that g′′(x) exists around x = θ with g′(θ) = 0 and g (θ) = 0. Then ′′ 6 1 2 2 n(g(Y ) g(θ)) = σ g′′(θ)χ . n − ⇒ 2 1

25 Proof. By Taylor’s expansion again,

2 1 2 Yn θ 2 n(g(Y ) g(θ)) = σ g′′(θ) n − + o(n Y θ ). n − 2 · σ | n − | " # This leads to the desired conclusion by the same arguments as in Theorem 8.

PROPOSITION 5.3 Let X, X ; n 1 be a sequence of random variables on Rk. Look { n ≥ } at the following three statements:

(i) Xn converges to X almost surely;

(ii)Xn converges to X in probability;

(iii) Xn converges to X weakly. Then (i) = (ii) = (iii). ⇒ ⇒ Proof. The implication “(i) = (ii)” is proved earlier. Now we prove that (ii) = (iii). ⇒ ⇒ For any closed set F let

F ǫ = x : inf d(x,y) ǫ . { y F ≤ } ∈ Then F ǫ is a closed set. When X takes values in R, the metric d(x,y)= x y . It is easy | − | to see that

X F X F 1/k X X 1/k { n ∈ } ⊂ { ∈ } ∪ {| n − |≥ } for any integer k 1. Therefore ≥ P (X F ) P (X F 1/k)+ P ( X X 1/k). n ∈ ≤ ∈ | n − |≥ Let n we have that →∞ 1/k lim sup P (Xn F ) P (X F ) n ∈ ≤ ∈

Note that F 1/k F as k . (iii) follows by letting k . ↓ →∞ →∞

Any random vector X taking values in Rk is tight: For every ǫ > 0 there exists a number M > 0 such that P ( X > M) < ǫ. A set of random vectors X ; α A is called k k { α ∈ } uniformly tight if there exists a constant M > 0 such that

sup P ( Xα > M) < ǫ. α A k k ∈

26 A random vector X taking values in a polish space (D, d) is also tight: For every ǫ> 0 there exists a compact set K D such that P (X / K) < ǫ. This will be proved later. ⊂ ∈

Remark. It is easy to see that a ﬁnite number of random variables is uniformly tight if every individual is tight. Also, if X ; α A and X ; α B are uniformly tight, respectively, { α ∈ } { α ∈ } then X , α A B is also tight. Further, if X = X, then X, X ; n 1 is uniformly { α ∈ ∪ } n ⇒ { n ≥ } tight. Actually, for any ǫ > 0, choose M > 0 such that P ( X M) 1 ǫ. Since the k k ≥ ≥ − k set x R , x > M is an open set, lim infn P ( Xn > M) P ( X > M) 1 ǫ. { ∈ k k } →∞ k k ≥ k k ≥ − The tightness of X, X follows. { n} LEMMA 5.5 Let µ be a probability measure on (D, ), where (D, d) is a Polish space and B is the Borel σ-algebra. Then µ is tight. B Proof. Since (D, d) is Polish, it is separable, there exists a countable set x ,x , D { 1 2 ···}⊂ such that it is dense in D.

Given ǫ > 0. For any integer k 1, since n 1B(xn, 1/k) = D, there exists nk such ≥ ∪ ≥ nk k D nk D that µ( n=1B(xn, 1/k)) > 1 ǫ/2 . Set k = n=1B(xn, 1/k) and M = k 1 k. Then M ∪ − ∪ ∩ ≥ is totally bounded (M is always covered by a ﬁnite number of balls with any prescribed radius). Also

∞ ∞ ǫ µ(M c) µ(Dc ) = ǫ. ≤ k ≤ 2k Xk=1 Xk=1 Let K be the closure of M. Since D is complete, M K D and K is compact. Also by ⊂ ⊂ the above fact, µ(Kc) < ǫ.

LEMMA 5.6 Let X ; n 1 be a sequence of random variables. Let F be the cdf of X . { n ≥ } n n Then there exists a subsequence F with the property that F (x) F (x) at each continuity nj nj → point of x of a possibly defective distribution function F. Further, if X is tight, then F (x) { n} is a cumulative distribution function.

Proof. Put all rational numbers q ,q , in a sequence. Because the sequence F (q ) 1 2 · · · n 1 is contained in the interval [0, 1], it has a converging subsequence. Call the indexing subsequence n1 and the limit G(q ). Next, extract a further subsequence n2 n1 { j }j∞=1 1 { j } ⊂ { j } along which F (q ) converges to a limit G(q ), a further subsequence n3 n2 along n 2 2 { j } ⊂ { j } which F (q ) converges to a limit G(q ), , and so forth. The tail of the diagonal sequence n 3 3 · · · n := nj belongs to every sequence ni . Hence F (q ) G(q ) for every i = 1, 2, . Since j j j nj i → i · · · each F is nondecreasing, G(q) G(q ) if q q . Deﬁne n ≤ ′ ≤ ′ F (x)= inf G(q). q>x

27 It is not diﬃcult to verify that F (x) is decreasing and right-continuous at every point x. It is also easy to show that F (x) F (x) at each continuity point of x of F by monotonicity nj → of F (x). Now assume X is tight. For given ǫ (0, 1), there exists a rational number M > 0 { n} ∈ such that P ( M < X M) 1 ǫ. Equivalently, F (M) F ( M) 1 ǫ for all n. − n ≤ ≥ − n − n − ≥ − Passing the subsequence n to . We have that G(M) G( M) 1 ǫ. By deﬁnition, j ∞ − − ≥ − F (M) F ( M) 1 ǫ. Therefore F (x) is a cdf. − − ≥ −

THEOREM 10 (Almost-sure representations). If X = X then there exist random n ⇒ 0 variables Y on ([0, 1], [0, 1]) such that (Y ) = (X ) for each n 0 and Y Y n B L n L n ≥ n → 0 almost surely.

Proof. Let Fn(x) be the cdf of Xn. Let U be a random variable uniformly distributed on ([0, 1], [0, 1]). Deﬁne Y = F 1(U) where the inverse here is the generalized inverse of a B n n− non-decreasing function deﬁned by

1 F − (t) = inf x : F (x) t n { n ≥ }

1 1 for 0 t 1. One can verify that F (t) F − (t) for every t such that F (x) is continuous ≤ ≤ n− → 0 0 at t. We know that the set of the discontinuous points is countable. Therefore the Lebesgue 1 1 measure is zero. This shows that Y = F (U) F − (U)= Y almost surely. n n− → 0 0 It is easy to check that (X )= (Y ) for all n = 0, 1, 2, . L n L n · · · LEMMA 5.7 (Slusky). Let a , b and X are sequences of random variables such { n} { n} { n} that a a in probability and b b in probability and X = X for constants a and b n → n → n ⇒ and some random variable X. Then a X + b = aX + b as n . n n n ⇒ →∞ Proof. We will show that (a X +b ) (aX +b) 0 in probability. If this is the case, n n n − n → then the desired conclusion follows from the weak convergence of Xn (why?). Given ǫ> 0. Since X is convergent, it is tight. So there exists M > 0 such that P ( X M) ǫ. { n} | n| ≥ ≤ Moreover, by the given condition, ǫ P a a <ǫ and P ( b b ǫ) <ǫ | n − |≥ 2M | n − |≥ as n is suﬃciently large. Note that

(a X + b ) (aX + b) a a X + b b . | n n n − n | ≤ | n − || n| | n − |

28 Thus,

P ( (anXn + bn) (aXn + b) 2ǫ) | ǫ− |≥ P a a + P ( X M)+ P ( b b ǫ) < 3ǫ ≤ | n − |≥ 2M | n|≥ || n − |≥ as n is suﬃciently large.

Remark. All the above statements of weak convergence of random variables Xn can be replaced by their distributions µ := (X ). So “X = X” is equivalent to “µ = µ”. n L n n ⇒ n ⇒

THEOREM 11 (L´evy’s continuity theorem). Let µ ; n = 0, 1, 2, be a sequence of prob- n · · · k ability measures on R with characteristic function φn. We have that (i) If µ = µ then φ (t) φ (t) for every t Rk; n ⇒ 0 n → 0 ∈ (ii) If φn(t) converges pointwise to a limit φ0(t) that is continuous at 0, then the as-

sociated sequence of distributions µn is tight and converges weakly to the measure µ0 with

characteristic function φ0.

Remark. The condition that “φ (t) is continuous at 0” is essential. Let X 0 n ∼ N(0,n), n 1. So P (Xn x) 1/2 for any x, which means that Xn doesn’t converges ≥ ≤ 2 → weakly. Note that φ (t)= e nt /2 0 as n and φ (0) = 1. So φ (t) is not continuous. n − → →∞ n 0 If one ignores the continuity condition at zero, the conclusion that Xn converges weakly may wrongly follow.

Proof of Theorem 11. (i) is trivial.

(ii) Because marginal tightness implies joint tightness. W.L.O.G., assume that µn is a

probability measure on R and generated by Xn. Note that for every x and δ > 0,

sin δx 1 δ 1 δx > 2 2 1 = (1 cos tx) dt. {| | }≤ − δx δ δ − Z−

Replace x by Xn, take expectations, and use Fubini’s theorem to obtain that

δ δ 2 1 itXn 1 P Xn > (1 Ee ) dt = (1 φn(t)) dt. | | δ ≤ δ δ − δ δ − Z− Z− By the given condition,

2 1 δ lim sup P Xn > (1 φ0(t)) dt. n | | δ ≤ δ δ − Z− The right hand side above goes to zero as δ 0. So tightness follows. ↓

29 To prove the weak convergence, we have to show that f(x) du f(x) dµ for every n → bounded continuous function f(x). It is equivalent to thatR for any subsequence,R there is a further subsequence, say n , such that f(x) du f(x) dµ . k nk → 0 For any subsequence, by Lemma 5.6,R the Helly selectionR principle, there is a further

subsequence, say µnk , such that it converges to µ0 weakly, by the earlier proof, the c.f. of

µ0 is φ0. We know that a c.f. uniquely determines a distribution. This says that every limit is identical. So f(x) du f(x) dµ . nk → 0 THEOREM 12R (Central LimitR Theorem). Let X , X , , X be a sequence of i.i.d. random 1 2 · n variables with mean zero and variance one. Let X¯ = (X + + X )/n. Then √nX¯ = n 1 · · · n n ⇒ N(0, 1).

2 Proof. Let φ(t) be the c.f. of X1. Since EX1 = 0 and EX1 = 1 we have that φ′(0) = iEX = 0 and φ (0) = i2EX2 = 1. By Taylor’s expansion 1 ′′ 1 − t t2 1 φ = 1 + o √n − 2n n as n is suﬃciently large. Hence n 2 n 2 it√nX¯n t t 1 t /2 Ee = φ = 1 + o e− √n − 2n n → as n . By Levy’s continuity theorem, the desired conclusion follows. →∞

To deal with multivariate random variables, we have the next tool.

THEOREM 13 (Cram´er-Wold device). Let Xn and X be random variables taking values in Rk. Then

X = X if and only if tT X = tT X for all t Rk. n ⇒ n ⇒ ∈ Proof. By Levy’s continuity Theorem 11, X = X if and only if E exp(itT X ) n ⇒ n → E exp(itT X) for each t Rk, which is equivalent to E exp(iu(tT X )) E exp(iu(tT X)) for ∈ n → T T any real number u. This is same as saying that the c.f. of t Xn converges to that of t X for each t Rk. It is equivalent to that tT X = tT X for all t Rk. ∈ n ⇒ ∈

As an application, we have the multivariate analogue of the one-dimensional CLT.

THEOREM 14 Let X , X , be a sequence of i.i.d. random vectors in Rk with mean 1 2 · · · vector µ = EX and covariance matrix Σ= E(X µ)(X µ)T . Then 1 1 − 1 − n 1 (X µ)= √n(Y¯ µ)= N (0, Σ). √n i − n − ⇒ k Xi=1

30 Proof. By the Cram´er-Wold device, we need to show that

n 1 (tT X tT µ)= N(0,tT Σt) √n i − ⇒ Xi=1 for ant t Rk. Note that tT X tT µ, i = 1, 2, are i.i.d. real random variables. By ∈ { i − ···} the one-dimensional CLT,

n 1 (tT X tT µ)= N(0, Var(tT X )). √n i − ⇒ 1 Xi=1 T T This leads to the desired conclusion since Var(t X1)= t Σt. We state the following theorem without giving the proof. The spirit of the proof is similar to that of Theorem 12. It is called Lindeberg-Feller Theorem.

THEOREM 15 Let k ; n 1 be a sequence of positive integers and k + as n . n ≥ } n → ∞ →∞ For each n let Y , ,Y be independent random vectors with ﬁnite variances such that n,1 · · · n,kn

kn kn Cov (Y ) Σ and E Y 2I Y >ǫ 0 n,i → k n,ik {k n,ik }→ Xi=1 Xi=1 for every ǫ> 0. Then the sequence Σkn (Y EY )= N(0, Σ). i=1 n,i − n,i ⇒

The Central Limit Theorems hold not only for independent random variables, it is also true for some other dependent random variables of special structures. Let Y ,Y , be a sequence of random variables. A sequence of σ-algebra satisfies 1 2 · · · Fn that σ(Y , ,Y ) and is non-decreasing for all n 0, where = , Ω . Fn ⊃ 1 · · · n Fn ⊂ Fn+1 ≥ F0 {∅ } When E(Y ) = 0 for any n 0, we say Y ; n 1 are martingale differences. n+1|Fn ≥ { n ≥ } THEOREM 16 For each n 1, let X , 1 i k be a sequence of martingale differ- ≥ { ni ≤ ≤ n} ences relative to nested σ-algebras , 1 i k , that is, for i < j, such {Fni ≤ ≤ n} Fni ⊂ Fnj that 1. max X ; n 1 is uniformly integrable; { i | ni| ≥ } 2. E (max X ) 0; i | ni| → 3. X2 1 in probability. i ni → Then S = X N(0, 1) in distribution. Pn i ni → P Example. Let X ; i 1 be i.i.d. with mean zero and variance one. Then one can i ≥ verify that the required conditions in Theorem 16 hold for X = X /√n for i = 1, 2, ,n. ni i · · ·

31 Therefore we recollect the classical CLT that n X /√n = N(0, 1). Actually, i=1 i ⇒ P n 2 P (max Xni ǫ) nP ( X1 √nǫ) E X1 I X1 √nǫ 0 and i | |≥ ≤ | |≥ ≤ nǫ2 | | {| |≥ }→ n 2 2 i=1 Xi E(max Xni ) E = 1 i | | ≤ n P This shows that max X 0 in probability and max X , n 1 is uniformly inte- i | ni| → { i | ni| ≥ } grable. So condition 2 holds.

LEMMA 5.8 Let U ; n 1 and T ; n 1 be two sequences of random variables such { n ≥ } n ≥ } that 1. U a in probability, where a is a constant; n → 2. T is uniformly integrable; { n} 3. U T is uniformly integrable; { n n} 4. ET 1. n → Then E(T U ) a. n n → Proof. Write T U = T (U a)+ aT . Then E(T U ) = E T (U a) + aET . Evi- n n n n − n n n { n n − } n dently, T (U a) is uniformly integrable by the given condition. Since T is uniformly { n n − } { n} integrable, it is not diﬃcult to verify that T (U a) 0 in probability. Then the result n n − → follows.

By the Taylor expansion, we have the following equality

ix x2/2+r(x) e = (1+ ix)e− (5.6) for some function r(x)= O(x3) as x . Look at the set-up in Theorem 16, we have that →∞ itSn e = TnUn, where

T = (1 + itX ) and U = exp( (t2/2) X2 + r(tX )). (5.7) n nj n − nj nj Yj Xj Xj We next obtain a general result on CLT by using Lemma 5.8.

THEOREM 17 Let X , 1 i k , n 1 be a triangular array of random variables. { nj ≤ ≤ n ≥ } Suppose 1. ET 1; n → 2. T is uniformly integrable; { n}

32 3. X2 1 in probability; j nj → 4. E(max X ) 0. P j | nj| → Then Sn = j Xnj converges to N(0, 1) in distribution. P itSn Proof. Since TnUn = e = 1, we know that TnUn is uniformly integrable. By Lemma | | | | 2 { } 5.8, it is enough to show that U e t /2 in probability. Review (5.7) and condition 3, it n → − suﬃces to show that r(tX ) 0 in probability. The assertion r(x) = O(x3) say that j nj → there exists A> 0 and δ > 0 such that r(x) A x3 as x < δ. Then P | |≤ | | | |

P ( r(tXnj) >ǫ) P (max Xnj δ)+ P ( r(tXnj) > ǫ, max Xnj < δ) | | ≤ j | |≥ | | j | | Xj Xj 1 2 δ− E(max Xnj )+ P (max Xnj Xnj >ǫ) 0 ≤ j | | j | | · | | → Xj as n by the given conditions. →∞

Proof of Theorem 16. Deﬁne Zn1 = Xn1 and

j 1 − Z = X I( X2 2), 2 j k . nj nj nk ≤ ≤ ≤ n Xk=1 It is easy to check that Z is still a martingale relative to the original σ-algebra. Now { nj} 2 deﬁne J = inf j 1; 1 k j Xnk > 2 kn. Then { ≥ ≤ ≤ }∧ P P (X = Z for some k k ) = P (J k 1) nk 6 nk ≤ n ≤ n − P ( X2 > 2) 0 ≤ nk → 1 k kn ≤X≤ as n . Therefore, P (S = kn Z ) 0 as n . To prove the result, we only need →∞ n 6 j=1 nj → →∞ to show P

kn Z = N(0, 1). (5.8) nj ⇒ Xj=1

We now apply Theorem 17 to prove this. Replacing Xnj by Znj in (5.7), we have new Tn and

Un. Let’s literally verify the four conditions in Theorem 17. By a martingale property and iteration, ET = 1. So condition 1 holds. Now max Z max X 0 in probability n j | nj|≤ j | nj|→ as n . So condition 4 holds. It is also easy to check that Z2 1 in probability. → ∞ j nj → Then it remains to show condition 2. P

33 kn By deﬁnition, Tn = j=1(1 + itZnj)= 1 j J (1 + itZnj). Thus, ≤ ≤ Q J 1 Q − T = (1 + t2X2 )1/2 1+ tX | n| nk | nJ | kY=1 2 J 1 t − 2 exp Xnk (1 + t XnJ ) ≤ 2 ! | | · | | Xk=1 t2 e (1 + t max Xnj ) ≤ | | · j | | which is uniform integrable by condition 1 in Theorem 16.

The next is an extreme limit theorem, which is diﬀerent than the CLT. The limiting distribution is called an extreme distribution.

THEOREM 18 Let Xi; 1 i n be i.i.d. N(0, 1) random variables. Let Wn = max1 i n Xi. ≤ ≤ ≤ ≤ Then

1 (log2 n)+ x ex/2 P Wn 2 log n e− 2√π ≤ − 2√2 log n → p for any x R, where log n = log(log n). ∈ 2

Proof. Let t be the right hand side of “ ” in the probability above. Then n ≤ P (W t )= P (X t )n = (1 P (X >t ))n. n ≤ n 1 ≤ n − 1 n Since (1 x )n e a as n if x a/n. To prove the theorem, it suﬃces to show that − n ∼ − →∞ n ∼ 1 P (X >t ) ex/2. (5.9) 1 n ∼ 2√πn

Actually, we know that

1 x2/2 P (X1 >x) e− ∼ √2πx as x + . It is easy to calculate that → ∞ 1 1 t2 x and n log n log( log n) √2πtn ∼ 2√π log n 2 ∼ − − 2 p as n . This leads to (5.9). →∞

34 As an application of the CLT of i.i.d. random variables, we will be able to derive the classical χ2-test as follows. Let’s consider a multinomial distribution with n trails, k classes with parameter p = (p , ,p ). Roll a die twenty times. Let X be the number of the occurrences of “i dots”, 1 · · · k i 1 i 6. Then (X , X , , X ) follows a multinomial distribution with “success rate” ≤ ≤ 1 2 · · · 6 p = (p , ,p ). How to test the die is fair or not? From an introductory course, we know 1 · · · 6 that, the null hypothesis is H : p = = p = 1/6. The test statistic is 0 1 · · · 6 6 (X np )2 χ2 = i − i . npi Xi=1 We use the fact that χ2 is roughly χ2(5) as n is large. We will prove this next. In general, X = (X , , X ) follows a multinomial distribution with n trails and n n,1 · · · n,k “success rate” p = (p , ,p ). Of course, k X = n. To be more precise, 1 · · · k i=1 n,i P n! x P (X = x , , X = x )= pxn,1 p n,k , n,1 n,1 · · · n,k n,k x ! x ! 1 · · · k n,1 · · · n,k k where i=1 xn,i = n. Now we prove the following theorem, P THEOREM 19 As n , →∞ k 2 2 (Xi npi) 2 χn = − = χ (k 1). npi ⇒ − Xi=1 We need a lemma.

LEMMA 5.9 Let Y N (0, Σ). Then ∼ k k d Y 2 = λ Z2, k k i i Xi=1 where λ , , λ are eigenvalues of Σ and Z , , Z are i.i.d. N(0, 1). 1 · · · k 1 · · · k Proof. Decompose Σ = Odiag(λ , , λ )OT for some orthogonal matrix O. Then the k 1 · · · k coordinates of OY are independent with mean zero and variances λ ’s. So Y 2 = OY 2 i || k k k k 2 is equal to i=1 λiZi , where Zi’s are i.i.d. N(0, 1). P Another fact we will use is that AX = AX for any k k matrix A if X = X, where n ⇒ × n ⇒ k Xn and X are R -valued random vectors. This can be shown easily by the Cramer-Wold device.

35 Proof of the Theorem. Let Y , ,Y be i.i.d. with a multinomial distribution with 1 1 · · · n trails and “success rate” p = (p , ,p ). Then X = Y + Y . Each Y is a k-dimensional 1 · · · k n 1 · · · n i vector. By CLT, X EX n − n = N (0, Cov(Y )), √n ⇒ k 1 where it is easy to see that EXn = np and

p (1 p ) p p p p 1 − 1 − 1 2 ··· − 1 k  p1p2 p2(1 p2) p2pk  Σ1 := Cov(Y1)= − − ··· − .    ··· ··· ··· ···     p1pk p2pk pk(1 pk)  − − · · · −    Set D = diag(1/√p1, , 1/√p ). Then · · · k T √p1 √p1 p p T √ 2 √ 2 DΣ1D == I . . . −  .   .          √pk √pk         k Since i=1 pi = 1, the above matrix is symmetric and idempotent or a projection. Also, trace of Σ is k 1. So Σ has eigenvalue 0 with one multiplicity and 1 with k 1 multiplicity. P 1 − 1 − Therefore, by the lemma, we have that D(X np) n − = N(0, DΣ DT ). √n ⇒ 1

By the lemma, since g(x)= x 2 is a continuous function on Rk, k k k 2 2 k 1 (X np ) D(X np) − d i − i = n − = Z2 = χ2(k 1). np √n ⇒ i − i=1 i i=1 X X

6 Central Limit Theorems for Stationary Random Variables and Markov Chains

Central limit theorems hold not only for independent random variables but also for dependent random variables. In the next we will study the Central Limit Theorems for α-mixing random variables and and martingale (a special class of dependent random variables). Let X , X , be a sequence of random variables. For each n 1, deﬁne α by 1 2 · · · ≥ n α(n) := sup P (A B) P (A)P (B) (6.10) | ∩ − |

36 where the supremum is taken over all A σ(X , , X ), B σ(X , X , ), and ∈ 1 · · · k ∈ k+n k+n+1 · · · k 1. Obviously, α(n) is decreasing. When α(n) 0, then X and X are approximately ≥ → k k+n independent. In this case the sequence X is said to be α-mixing. If the distribution of { n} the random vector (X , X , , X ) does not depend on n, the sequence is said to be n n+1 · · · n+j stationary. The sequence is called m-dependent if (X , , X ) and (X , , X ) are inde- 1 · · · k k+n · · · k+n+l pendent for any n>m, k 1 and l 1. In this case the sequence is α-mixing with α = 0 ≥ ≥ n for n>m. An independent sequence is 0-dependent. For stationary process, we have the following the Central Limit Theorems.

THEOREM 20 Suppose that X , X , is stationary with EX = 0 and X A for 1 2 · · · 1 | 1| ≤ some constant A> 0. Let S = X + + X . If ∞ α(n) < , then n 1 · · · n i=1 ∞ V ar(S ) P∞ n σ2 = E(X2) + 2 E(X X ) (6.11) n → 1 1 k+1 Xk=1 where the series converges absolutely. If σ > 0, then S /√n = N(0,σ2). n ⇒ THEOREM 21 Suppose that X , X , is stationary with EX = 0 and E X 2+δ < 1 2 · · · 1 | 1| ∞ δ/(2+δ) for some constant δ > 0. Let S = X + + X . If ∞ α(n) < , then n 1 · · · n i=1 ∞ V ar(S ) ∞ P n σ2 = E(X2) + 2 E(X X ) n → 1 1 k+1 Xk=1 where the series converges absolutely. If σ > 0, then S /√n = N(0,σ2). n ⇒ Easily we have the following corollary.

COROLLARY 6.1 Let X , X , be m-dependent random variables with the same distri- 1 2 · · · bution. If EX = 0 and EX2+δ < for some δ > 0. Then S /√n N(0,σ2), where 1 1 ∞ n → 2 2 m+1 σ = EX1 + 2 i=1 E(X1Xi).

Remark. If weP argue by using the truncation method as in the proof of Theorem 21, we can show that the above corollary still holds only under EX2 < . 1 ∞

LEMMA 6.1 If Y σ(X , , X ) and Z σ(X , X , , X ) for n m ∈ 1 · · · k ∈ k+n k+n+1 · · · k+m ≤ ≤ + . Assume Y C and Z D. Then ∞ | |≤ | |≤ E(Y Z) (EY )EZ 4CDα(n). | − |≤

37 Proof. W.l.o.g, assume that C = D = 1. Now since expectations can be approximated by ﬁnite sums, so w.l.o.g, we further assume that

m n Y = yiAi and Z = zjBj Xi=1 Xj=1 where Ai’s and Bj’s are respectively partitions of the sample space. Then

E(Y Z) (EY )EZ = y z (P (A B ) P (A )P (B )) | − | | i j i ∩ j − i j | Xi Xj z (P (A B ) P (A )P (B )) . ≤ | j i ∩ j − i j | Xi Xj Let ξ =1 or 1 depends on whether or not z (P (A B ) P (A )P (B ) 0 or not. Then i − j i ∩ j − i j ≥ the last sum is equal to ξ z (P (A B ) P (A )P (B )) which is also identical to i j i j i ∩ j − i j z ξ (P (A B ) P (A )P (B )). By the same argument, there are η equal to 1 or j j j i i ∩ j −P Pi j i 1 such that the last quantity is bounded by ξ η (P (A B ) P (A )P (B )). Let P− P j j i j i ∩ j − i j c E1 be the union of Ai such that ξi = 1 and E2 P= EP1; Let F1 and F2 be deﬁned similarly for Bj according to the sign of ηj’s. Then

E(Y Z) (EY )EZ P (E F ) P (E )P (F ) 4α(n) | − |≤ | i ∩ j − i j |≤ 1 i,j 2 ≤X≤ LEMMA 6.2 If Y σ(X , , X ) and Z σ(X , X , ). Assume for some ∈ 1 · · · k ∈ k+n k+n+1 · · · δ > 0, E(Y 2+δ) C and E(Z2+δ) D. Then ≤ ≤ E(Y Z) (EY )EZ 5(1 + C + D)α(n)δ/(2+δ). | − |≤ Proof. Let Y = YI( Y a) and Y = YI( Y > a); and Z = ZI( Z a) and 0 | | ≤ 1 | | 0 | | ≤ Z = ZI( Z > a). Then 1 | | E(Y Z) (EY )EZ E(Y Z ) (EY )EZ . | − |≤ | i j − i j | 1 i,j 2 ≤X≤ By the previous lemma, the term corresponding to i = j = 0 is bounded by 4a2α(n). For i = j = 1, E(Y Z ) (EY )EZ = Cov(Y , Z ) which is bounded by E(Y 2)E(Z2) | 1 1 − 1 1| | 1 1 | 1 1 ≤ √CD/aδ since E(Y 2) C/aδ. Now 1 ≤ p E(Y Z ) (EY )EZ E( Y EY Z EZ ) 2a 2E Z | 0 1 − 0 1|≤ | 0 − 0| · | 1 − 1| ≤ · | 1| which is less than 4D/aδ. The term for i = 1, j = 0 is bounded by 4C/aδ. In summary

√CD + 4C + 4D 4.5(C + D) E(Y Z) (EY )EZ 4a2α(n)+ 4a2α(n)+ . | − |≤ a2 ≤ aδ

38 1/(2+δ) Now take a = α(n)− . The conclusion follows.

Proof of Theorem 20. By Lemma 6.2, EX X 4A2α(k), so the series (6.11) con- | 1 1+k|≤ 2 2 verges absolutely. Moreover, let ρk = E(X1X1+k). Then ESn = nEX1 +2 1 i

4 3 E(Sn)= n δn (6.13)

4 for a sequence of constants δn 0. Actually, E(Sn) 1 i,j,k,l n E(XiXjXkXl) . Among → ≤ ≤ ≤ | | 4 2 2 2 3 the terms in the sum there are n terms of Xi ; at mostPn of Xi Xj and Xi Xj respectively for i = j. So the contribution of these terms is bounded by Cn2 for an universal constant 6 2 C. There are only two type of terms left: Xi XjXk and XiXjXkXl where these indices are all diﬀerent. We will deal with the second type of terms only. The ﬁrst one can be done similarly and more easily. Note that

E(X X X X ) 4!n E(X X X X ) (6.14) | i j k l |≤ | 1 1+i 1+i+j 1+i+j+k | 1 i,j,k,l n 1 i+j+k n ≤ X ≤ ≤ X ≤ where the three indices in the last sum are not zero. By Lemma 6.2, the term inside is 4 4 bounded by 4A αi. By the same argument, it is also bounded by 4A αk. Therefore, the sum in (6.14) is bounded by

4 4!A4n2 min α , α 2Kn2 min α , α · { i k} ≤ { i k} 2 i+k n 2 i+k n, i k ≤X≤ ≤ X≤ ≤ n 2Kn2 kα = nδ (6.15) ≤ k n Xk=1 for a sequence of constants δ such that 0 < δ 0. This is because n n → n kαk = kαk + kαk k=1 k √n √n

39 Therefore (6.13) follows. From this, we know that ES4 nδ n3(δ + (1/n)) n3δ n ≤ n ≤ n ≤ n′ where δn′ := supk n δk + (1/k); k n is decreasing to 0 and is bounded below by 1/n. ≥ { ≥ } Thus, w.l.o.g., assume δn has this property.

Deﬁne pn = [√n log(1/δ[√n])] and qn = [n/pn] and rn = [n/(pn + qn)]. Then

p2 δ 1 1 √n p √n log n and n pn δ log2 δ log2 0 (6.16) ≪ n ≤ n ≤ pn δ ≤ pn δ → [√n] ! pn as n . Now for ﬁnite n, break X , X , , X into some blocks as follows: →∞ 1 2 · · · n X , , X X , , X X , , X , X 1 · · · pn | pn+1 · · · pn+qn | pn+qn+1 · · · 2pn+qn | ··· n For i = 1, 2, , r , set · · · n

Un,i = X(i 1)(pn+qn)+1 + X(i 1)(pn+qn)+2 + + Xi(pn+qn); − − · · ·

Vn,i = Xipn+(i 1)qn+1 + Xipn+(i 1)qn+2 + + Xi(pn+qn); − − · · · W = X + X + + X , n rn(pn+qn)+1 rn(pn+qn)+1 · · · n rn rn and Sn′ = i=1 Un,i and Sn′′ = i=1 Vn,i. In what follows, we will show that Sn′′ and Wn are negligible.P Now, by (6.12) andP its reasoning rn V ar(S′′) = r V ar(V ) + 2 (r k)E(V V ) n n n,1 n − n,1 n,k Xk=2 rn Cr q + 2r E(V V ) . ≤ n n n | n,1 n,k | Xk=2 2 2 By assumption, Vn,i Aqn for any i 1. By Lemma 6.1, E(Vn,1Vn,k) 4A qnα(k 1)pn . | |≤ ≥ | | ≤ − Thus, by monotoncity,

r r r (k 1)pn n n n − 2 2 2 2 1 E(Vn,1Vn,k) 4A qn α(k 1)pn 4A qn αj | |≤ − ≤ pn k=2 k=2 k=2 j=(k 2)pn+1 X X X −X 2 2 4A qn ∞ αj. ≤ pn Xj=1 Since q /p 0, the above two inequalities imply that n n → V ar(S ) n′′ 0 (6.17) n → as n . This says that S /√n 0 in probability. By Chebyshev’s inequality and (6.12) →∞ n′′ → we see that W /√n 0 in probability. Thus, to prove the theorem, it suﬃces to show that n → S /√n = N(0,σ2). This will be done through the following two steps n′ ⇒ S˜n ˜ = N(0,σ2) and EeitSn′ /√n EeitSn/√n 0 (6.18) √n ⇒ − →

40 for any t R, where S˜ = rn U and U ;ı=1, 2, , r are i.i.d. with the same ∈ n i=1 n,i′ { n,i′ · · · n} distribution as that of U . First, by (6.12), V ar(S˜ ) = r V ar(U ) r p σ2. It follows n,1 P n n n,1 ∼ n n that V ar(S˜ )/n σ2. Also, EU = 0. By (6.13) and (6.16), n → n,i′ rn 3 2 1 4 rn 4 rnpnδpn 4 pnδpn E(U ′ ) E(U )= σ− 0 2 n,i 2 4 n,1 2 4 V ar(S˜n) ∼ n σ n σ ∼ n → Xi=1 as n . By the Lyapounov central limit theorem, we know the ﬁrst assertion in (6.18) → ∞ holds. Last, by Lemma 6.12,

rn EeitSn′ /√n (EeitUn,1/√n)E exp(it U /√n) 16α(q ). | − n,j |≤ n Xj=2 rn Then repeat this inequality for E exp(it j=2 Un,j/√n) and so on, we obtain that

P rn ˜ itU /√n EeitSn′ /√n EeitSn/√n = EeitSn′ /√n Ee n,j′ 16α(q )r . | − | | − |≤ n n jY=1

Since α(n) is decreasing and n 1 α(n) < , we have nα(n)/2 [n/2] i n α(i) 0. ≥ ∞ ≤ ≤ ≤ → Thus the above bound is o(r /q )= o(n/(p q )) 0 as n . The second statement in n Pn n n → →∞ P (6.18) follows. The proof is complete.

Proof of Theorem 21. By Lemma 6.2, EX X 5(1 + 2E X 2+δ)α(n)2/(2+δ), | 1 k| ≤ | 1| 2 which is summable by the assumption. Thus the series EX1 +2 i∞=2 E(X1Xk) is absolutely convergent. By the exact same argument as in proving (6.11), we obtain that V ar(S )/n P n → σ2. Now for a given constant A> 0 and all i = 1, 2, , deﬁne · · · Y = X I X A E(X I X A ) and Z = X I X > A E(X I X > A ), i i {| i|≤ }− i {| i|≤ } i i {| i| }− i {| i| } n n and Sn′ = i=1 Yi and Sn′′ = i=1 Zi. Obviously, Sn = Sn′ + Sn′′. By Theorem 20, P P S n = N(0,σ(A)2) (6.19) √n ⇒

2 2 as n for each ﬁxed A > 0, where σ(A) := EY + 2 ∞ EY Y . By the summable → ∞ 1 i=2 1 k 2 assumption on α(n), it is easy to check that both the sumP in the deﬁnition of σ(A) and 2 2 σ˜(A) := EZ1 + 2 i∞=2 E(Z1Zk) are absolutely convergent, and P σ(A)2 σ2 andσ ˜(A)2 0 (6.20) → →

41 as A + . Now → ∞ Sn′′ V ar(Sn′′) σ˜(A) lim sup P (| | ǫ) lim sup 2 2 . (6.21) n √n ≥ ≤ n nǫ → ǫ →∞ →∞ by using the same argument as in (6.2). Since P (S /√n F ) P (S /√n F¯ǫ) + n ∈ ≤ n′ ∈ P ( S /√n ǫ) for any closed set F R and any ǫ > 0, where F¯ǫ is the closed ǫ- | n′′| ≥ ⊂ neighborhood of F. Thus, from (6.19) and (6.21) S S lim sup P ( n F ) P (N(0,σ(A)2) F¯ǫ)+ limsup P (| n′′| ǫ) n √n ∈ ≤ ∈ n √n ≥ →∞ →∞ σ σ˜(A) P N(0,σ2) F¯ǫ + ≤ ∈ σ(A) ǫ2 for any A> 0 and ǫ> 0. First let A + , and then let ǫ 0+ to obtain from (6.20) that → ∞ → S lim sup P ( n F ) P (N(0,σ2) F ). n √n ∈ ≤ ∈ →∞ The proof is complete.

Deﬁne byα ˜(n) the supremum of

E (f(X , , X )g(X , , X )) Ef(X , , X ) Eg(X , , X ) (6.22) | 1 · · · k k+n · · · k+m − 1 · · · k · k+n · · · k+m | over all k 1,m 0 and measurable functions f(x) deﬁned on Rm and g(x) deﬁned on ≥ ≥ Rm n+1 with f(x) 1 and g(x) 1 for all x R. − | |≤ | |≤ ∈ LEMMA 6.3 The following are true:

(i) α(n) = sup sup sup P (A B) P (A)P (B) . k 1 m 0 A σ(X1, ,X ), B σ(X , ,X ) | ∩ − | ≥ ≥ ∈ ··· k ∈ k+n ··· k+n+m (ii) For all n 1, we have that α(n) α˜(n) 4α(n). ≥ ≤ ≤ Proof. (i) For two sets A and B, let A∆B = (A B) (B A), which is their symmetric \ ∪ \ diﬀerence. Then I = I I . Let Z ; i 1 be a sequence of random variables. We A∆B | A − B| i ≥ claim that

σ(Z , Z , )= B σ(Z , Z , ); for any ǫ> 0, there exists l 1 and (6.23) 1 2 · · · { ∈ 1 2 · · · ≥ C σ(Z , , Z ) such that P (B∆C) <ǫ . (6.24) ∈ 1 · · · l } If this is true, then for any ǫ > 0, there exists l < and C σ(Z , , Z ) such that ∞ ∈ 1 · · · l P (B) P (C) < ǫ. Thus (ii) follows. | − |

42 Now we prove this claim. First, note that σ(Z , Z , )= σ Z E ; E (R); i 1 2 · · · {{ i ∈ i} i ∈B ≥ 1 . Denote by the right hand side of (6.23). Then contains all Z E ; E } F F { i ∈ i} i ∈ (R); i 1. We only need to show that the right hand side is a σ-algebra. B ≥ It is easy to see that (i) Ω ; (ii) Bc if B since A∆B = Ac∆Bc. (iii) If B ∈ F ∈ F ∈ F i ∈ F for i 1, there exist m < and C σ(Z , , Z ) such that P (B ∆C ) < ǫ/2i for all ≥ i ∞ i ∈ 1 · · · mi i i i 1. Evidently, n B B as n . Therefore, there exists n < such that ≥ ∪i=1 i ↑ ∪i∞=1 i → ∞ 0 ∞ P ( B ) P ( n0 B ) ǫ/2. It is easy to check that ( n0 B )∆( n0 C ) n0 (B ∆C ). | ∪i∞=1 i − ∪i=1 i |≤ ∪i=1 i ∪i=1 i ⊂∪i=1 i i Write B = B and B˜ = n0 B and C = n0 C . Note B∆C (B B˜) (B˜∆C). These ∪i∞=1 i ∪i=1 i ∪i=1 i ⊂ \ ∪ facts show that P (B∆C) < ǫ and C σ(Z , Z , , Z ) for some l < . Thus is a ∈ 1 2 · · · l ∞ F σ-algebra. (ii) Take f and g to be indicator functions, from (i) we get α(n) α˜(n). By Lemma ≤ 6.1, we have thatα ˜(n) 4α(n). ≤

Let X ; i 1 be a sequence of random variables. We say it is a Markov chain if { i ≥ } P (X A X , , X )= P (X A X ) for any Borel set A and any k 1. A Markov k+1 ∈ | 1 · · · k k+1 ∈ | k ≥ chain has the property that given the present observations, the past and future observations are (conditionally) independent. This is literally stated in Proposition 10.3. Let’s look at the strong mixing coeﬃcientα ˜(n). Recall the deﬁnition ofα ˜(n). If X , X , 1 2 · · · is a Markov chain, for simplicity, we write f and g, respectively, for f(X , , X ) and 1 · · · k g(Xk+n, , Xk+m). Let f˜ = E(f Xk+1)) andg ˜ = E(g Xk+n 1). Then by Proposition 10.3, · · · | | − E(fg)= E(f˜g˜), and Ef = Ef˜ and Eg = Eg.˜ Thus,

E(fg) (Ef)Eg = E(f˜g˜) (Ef˜)Eg.˜ − −

The point is that f˜ is Xk+1 measurable, andg ˜ is Xk+n 1 measurable. Therfore −

α(n) sup E(f(Xk+1)g(Xk+n 1) (Ef(Xk+1))Eg(Xk+n 1) (6.25) ≤ f 1 g 1 | − − − | | |≤ | |≤ 4sup P (Xk+1 A, Xk+n 1 B) P (Xk+1 A)P (Xk+n 1 B) (6.26) ≤ A, B | ∈ − ∈ − ∈ − ∈ |

by Lemma 6.1 by taking k there to be k +1 and k + n to be k + n +1 and Xi = 0 for all positive integer i excluding k +1 and k + n + 1. If the Markov process is stationary, then

α(n + 1) 4 sup P (X1 A, Xn B) P (X1 A)P (Xn B) . (6.27) ≤ A, B | ∈ ∈ − ∈ ∈ |

43 Let P n(x,B) represent the probability of X B given X = x, and π be the marginal n+1 ∈ 1 probability distribution of X1. Then

P (X B, X A)= P n(x,B)π( dx) n+1 ∈ 1 ∈ ZA From this,

α(n + 2) 4 sup (P n(x,B) P (X B))π( dx) . ≤ · | − n ∈ | A,B ZA For two probability measures µ and ν, deﬁne its total variation distance µ ν = sup µ(A) k − k A | − ν(A) . Then | α(n + 2) 4 P n(x, ) π π( dx), n 2. ≤ k · − k ≥ Z In practice, for example, in the theory of Monto Carlo Markov Chains, we have some conditions to guarantee the convergence rate of P n(x, ) π . In other words, we have a k · − k function M(x) 0 and γ(n) such that ≥ P n(x, ) π M(x)γ(n) k · − k≤ for all x and n. Here are some examples: m If a chain is so-called polynomially ergodic of order m, then γ(n)= n− ; If a chain is geometrically ergodic, then γ(n)= tn for some t (0, 1); ∈ If a chain is uniformly geometrically ergodic, then γ(n) = tn for some t (0, 1) and ∈ sup M(x) < . x ∞ These three conditions, combining with the Central Limit Theorems of stationary random variables stated in Theorems 20 and 21, imply the following CLT for Markov chains directly.

Let EπM = M(x)π( dx). R THEOREM 22 Suppose X , X , is a stationary Markov chain with (X ) = π and { 1 2 ···} L 1 E M < . If the chain is Geometrically ergodic, in particular, if it is uniformly geometri- π ∞ cally ergodic, then S nEX n − 1 = N(0,σ2) (6.28) √n ⇒ where σ2 = E(X2) + 2 E(X X ). 1 i=2 ∞ 1 k P

44 THEOREM 23 Suppose X , X , is a stationary Markov chain with (X ) = π and { 1 2 ···} L 1 E M < . π ∞ (i) If X C for some constant C > 0, and the chain is polynomially ergodic of order | 1|≤ m> 1, then (6.28) holds. (ii) If the chain is polynomially ergodic of order m, and E X 2+δ < for some δ > 0 | 1| ∞ satisfying mδ > 2+ δ, then (6.28) holds.

Remark. In practice, given an initial distribution λ, and a transitional kernel P (x, A),

here is the way to generate a Markov chain: randomly draw X0 = x under distribution λ, randomly draw X under distribution P (X , ) and X under P (X , ) for any n 1. 1 0 · n+1 n · ≥ Through this a Markov chain X , X , with initial distribution λ and transitional kernel 1 2 · · · P (x, ) has been generated. Let π be the stationary distribution of this chain, that is, · π( )= P (x, )π( dx). · · Z Theorem 17.1.6 (on p. 420 from “Markov Chains and Stochastic Stability” by S.P. Meyn

and R.L. Tweedie) says that a Markov CLT holds for a certain initial distribution λ0, it then holds for any initial distribution λ. In particular, if Theorem 22 or Theorem 23 (λ = π in this case) holds, then the CLT holds for the Markov chain of the same transitional probability kernel P (x, ) and stationary distribution π for any initial distribution λ. ·

7 U-Statistics

Let X , , X be a sequence of i.i.d. random variables. Sometimes we are interested in 1 · · · n estimating θ = Eh(X , , X ) for 1 m n. To make discussion simple, we assume 1 · · · m ≤ ≤ h(x , ,x ) is a symmetric function. If X X X , the order statistic, is 1 · · · m (1) ≤ (2) ≤···≤ (n) suﬃcient and complete, then 1 U = h(X , , X ) (7.1) n n i1 · · · im m 1 i1<

45 PROPOSITION 7.1 The equality U = E(h(X , , X ) X , , X ) holds for all 1 n 1 · · · m | (1) · · · (n) ≤ m n. ≤ Proof. One can see that

E(h(X , , X ) X , , X )= E(h(X , , X ) X , , X ) 1 · · · m | (1) · · · (n) i1 · · · im | (1) · · · (n) for any 1 i < < i n. Summing both sidees above over all such indices and then ≤ 1 · · · m ≤ n dividing by m , we obtain 1 E(h(X , , X ) X , , X )= h(X , , X )= U 1 · · · m | (1) · · · (n) n (i1) · · · (im) n m 1 i1<

Example Let θ = E(X2). We know E(X2) = (1/2)E(X X )2. Taking h(x ,x ) = 1 1 1 − 2 1 2 (1/2)(x x )2, then the corresponding U-statistic is 1 − 2 n 2 1 1 U = (X X )2 = (X X¯)2 = S2, n n(n 1) · 2 i − j n 1 i − 1 i

Proof. One-sample Wilcoxon statistic θ = P (X + X 0). The statistic is 1 2 ≤ 2 U = I(X + X 0) n n(n 1) i j ≤ 1 i

Example. Gini’s mean diﬀerence is to measure the concentration of the distribution, the parameter is θ = E X X . The statistic is | 1 − 2| 2 U = X X . n n(n 1) | i − j| 1 i

46 THEOREM 24 (Hoeﬀding’s Theorem) For a U-statistic given by (7.1) with E[h(X , , X )2] < 1 · · · m , ∞ m 1 m n m V ar(Un)= n − ζk m k m k Xk=1 − where ζ = V ar(h (X , , X )). k k 1 · · · k

Proof. Consider two sets i , , i and j , , j of m distinct integers from 1, 2, ,n { 1 · · · m} { 1 · · · m} { · · · } n m n m with exactly k integers in common. Then the total number of distinct choices is m k m− k . − Now 1 V ar(U )= Cov(h(X , , X ), h(X , , X )), n n 2 i1 · · · im j1 · · · jm m X where the sum is over all indices 1 i < < i n and 1 j < < j n. ≤ 1 · · · m ≤ ≤ 1 · · · m ≤ If i , , i and j , , j have no integers in common, the covariance is zero by { 1 · · · m} { 1 · · · m} independence. Now we classﬁsy the sum over the number of integers in common. Then

1 n m n m V ar(U )= m − n n 2 k=1 m k m k · m − P Cov(h(X1, , Xm), h(X1, , Xk, Xm+1, , X2m k)) m · · · · · · · · · − 1 m n m = n − V ar(ζk) m k m k · Xk=1 − where the fact (from conditioning) that the covariance above equal to V ar(ζk) is used in the last step.

LEMMA 7.1 Under the condition of Theorem 24, (i) 0= ζ ζ ζ = V ar(h); 0 ≤ 1 ≤···≤ m (ii) For ﬁxed m and some k 1 such that ζ0 = = ζk 1 = 0 and ζk > 0, we have ≥ · · · − m 2 k! ζk 1 V ar(U )= k + O n nk nk+1 as n . →∞

Proof. (i) For any 1 k m 1, ≤ ≤ − h (X , , X ) = E (h(X , , X ) X , , X ) k 1 · · · k 1 · · · n | 1 · · · k = E (h (X , , X ) X , , X ) . k+1 1 · · · k+1 | 1 · · · k

47 Then by Jensen’s inequality, E(h2 ) E(h2 ). Note that Eh = Eh . Then V ar(h ) k ≤ k+1 k k+1 k ≤ V ar(hk+1). So (i) follows. (ii) By Theorem 24,

m 1 m n m 1 m n m V ar(Un)= n − ζk + n − ζi. m k m k m i m i − i=Xk+1 − n m n m m i Since m is ﬁxed, m n /m! and m− i n − for any k + 1 i m. Thus the last ∼ − ≤ ≤ ≤ term above is bounded by O(n (k+1)) as n . Now − →∞ 1 m n m − ζ n k m k k m − m m! ζk (n m)! = k − n(n 1) (n m + 1) · (m k)!(n 2m + k)! − · · · − − − m 2 k! ζk n n n (n m) (n 2m + k + 1) = k · · · · − · · · − nk · n(n 1) (n k + 1) · (n k) (n m + 1) − · · · − − · · · − m 2 k 1 1 2m k 1 m 1 1 k! ζk − i − − − i − i − = k 1 1 1 . nk · − n · − n · − n i=1 i=m i=k Y Y Y Note that since m is ﬁxed, each of the above product is equal to 1+O(1/n). The conclusion then follows.

Let X , , X be a sample, and T be a statistic based on this sample. The projection 1 · · · n n of T on some random variables Y , ,Y is deﬁned by n 1 · · · p p T ′ = ET + (E(T Y ) ET ). n n n| i − n Xi=1 Suppose T is a symmetric function of X , , X . Set ψ(X )= E(T X ) for i = 1, 2, , n. n 1 · · · n i n| i · · · Then ψ(X ), , ψ(X ) are i.i.d. random variables with mean ET . If V ar(T ) < , then 1 · · · n n n ∞ n 1 (ψ(Xi) ETn) N(0, 1) (7.4) nV ar(ψ(X1)) − → Xi=1 in distribution as n p . Now Let T be the projection of T on X , , X . Then →∞ n′ n 1 · · · n n T T ′ = (T ET ) (ψ(X ) ET ). (7.5) n − n n − n − i − n Xi=1 We will next show that T T is negligible comparing the order appeared in (7.4). Then n − n′ the CLT holds for Tn by (7.4) and thre Slusky lemma.

48 LEMMA 7.2 Let T be a symmetric statistic with V ar(T ) < for every n, and T be n n ∞ n′ the projection of T on X , , X . Then ET = ET and n 1 · · · n n′ n 2 E(T T ′ ) = V ar(T ) V ar(T ′ ). n − n n − n

Proof. Since ETn = ETn′ ,

2 E(T T ′ ) = V ar(T )+ V ar(T ′ ) 2Cov(T , T ′ ). n − n n n − n n First, V ar(T )= nV ar(E(T X )) by independence. Second, by (7.5) n′ n| 1

Cov(Tn, Tn′ ) n = V ar(T )+ nV ar(E(T X )) 2 Cov(T ET )(ψ(X ) ET ). (7.6) n n| 1 − n − n i − n Xi=1 Now the i-th covariance above is equal to E(T E(T X )) E(E(T X )) 2 = V ar(E(T X )). n n| i −{ n| i } n| 1 We already see V ar(T )= nV ar(E(T X )). So the proof is complete by the two facts and n′ n| 1 (7.6).

THEOREM 25 Let U be the statistic given by (7.1) with E[h(X , , X )]2 < . n 1 · · · m ∞ (i) If ζ > 0, then √n(U ET ) N(0,m2ζ ) in distribution. 1 n − n → 1 (ii) If ζ1 = 0 and ζ2 > 0, then

m(m 1) ∞ n(U EU ) − λ (χ2 1), n − n → 2 j 1j − Xj=1 2 2 where χ1j’s are i.i.d. χ2(1)-random variables, and λj’s are constants satisfying j∞=1 λj = ζ2. P

Proof. We will only prove (i) here. The proof of (ii) is very technical, it is omitted. Let U be the projection of U on X , , X . Then n′ n 1 · · · n 1 U ′ = Eh + E(h(X , , X ) X ) Eh. n n i1 · · · ik | i − m 1 i1<

50 8 Empirical Processes

Let X , X , be a random sample, where X takes values in a metric space M. Deﬁne a 1 2 · · · i probability measure µn such that

n 1 µ (A)= I(X A), A M. n n i ∈ ⊂ Xi=1 The measure µn is called a probability measure. If X1 takes real values, that is, M = R, set n 1 F (t)= I(X t), t R, A = ( ,t]. n n i ≤ ∈ −∞ Xi=1 The random process F (t), t R is called an empirical process. By the LLN and CLT, { n ∈ } F (t) F (t) a.s. and √n(F (t) F (t)) = N(0, F (t)(1 F (t))), n → n − ⇒ −

where F (t) is the cdf of X1. Among many interesting questions, here we are interested in the uniform convergence of the above almost sure and weak convergences. Speciﬁcally, we want to answer the following two questions:

1) Does ∆n := sup Fn(t) F (t) 0 a.s.? t | − |→ 2) Fix n 1, regard F (t); t R as one point in a big space, does the CLT hold? ≥ { n ∈ } The answers are yes for both questions. We ﬁrst answer the ﬁrst question.

THEOREM 26 (Glivenko-Cantelli) Let X , X , be a sequence of i.i.d. random vari- 1 2 · · · ables with cdf F (t). Then

∆n = sup Fn(t) F (t) 0 a.s. t | − |→ as n . →∞

Remark. When F (t) is continuous and strictly increasing in the support of X1, the above

the distribution of ∆n is independent of the distribution of X1. First, F (Xi)’s are i.i.d. U[0, 1]-distributed random variables. Second, 1 1 ∆n = sup I(Xi t) F (t) = sup I(F (Xi) F (t)) F (t) t |n ≤ − | t |n ≤ − | Xi Xi 1 = sup I(Yi y) y , 0 y 1 |n ≤ − | ≤ ≤ Xi 51 where Y = F (X ); 1 i n are i.i.d. random variables with uniform distribution over { i i ≤ ≤ } [0, 1]. Proof. Set n 1 F (t )= I(X 0, choose = t0

∆n max Fn(ti) F (ti) , Fn(ti ) F (ti ) + ǫ ǫ a.s. ≤ 1 i k{| − | | − − − |} → ≤ ≤ as n . The conclusion follows by letting ǫ 0. →∞ ↓

Given (t ,t , ,t ) Rk, it is easy to check that √n(F (t ) F (t ), , F (t ) 1 2 · · · k ∈ n 1 − 1 · · · n k − F (t )) = N (0, Σ) where k ⇒ k Σ = (σ ) and σ = F (t t ) F (t )F (t ), 1 i, j n. (8.1) ij ij i ∧ j − i j ≤ ≤ Let D[ , ] be the set of all functions deﬁned on ( , ) such that they are right- −∞ ∞ −∞ ∞ continuous and the left limits exist everywhere. Equipped with a so-called Skorohod metric, it becomes a Polish space. The random vector √n(F F ), viewed as an element in n − D[ , ], converges weakly to a continuous Gaussian process ξ(t), where ξ( )=0 and −∞ ∞ ±∞ the covariance structure is given in (8.1). This is usually called a Brownian bridge. It has same distribution of G F, where G is the limiting point when F is the cumulative λ ◦ λ distribution function of the uniform distribution over [0, 1].

THEOREM 27 (Donsker). If X , X , are i.i.d. random variables with distribution 1 2 · · · function F, then the sequence of empirical processes √n(F F ) converges in distribution n − in the space D[ , ] to a random element G , whose marginal distributions are zero- −∞ ∞ F mean with covariance function (8.1).

Later we will see that

√n Fn F = √n sup Fn(t) F (t) = sup GF (t) , k − k∞ t | − | ⇒ t | |

52 where

∞ j+1 2j2x2 P sup GF (t) x = 2 ( 1) e− , x 0. t | |≥ − ≥ Xj=1 Also, the DKW (Dvoretsky, Kiefer, and Wolfowitz) inequality says that

2x2 P (√n Fn F >x) 2e− , x> 0. k − k∞ ≤ Let X , X , be i.i.d. random variables taking values in ( , ) and (X ) = P. We 1 2 · · · X A L 1 write µf = f(x)µ( dx). So n R 1 P f = f(X ) and Pf = f(x)P ( dx). n n i Xi=1 ZX By the Glivenko-Cantelli theorem

sup Pn(( ,x]) P (( ,x]) 0 a.s. x R | −∞ − −∞ |→ ∈ Also, if P has no point mass, then

sup Pn(A) P (A) = 1 any measurable set A | − | because P (A)=1 and P (A)=0 for A = X , X , , X . We are searching for such n { 1 2 · · · n} F that

sup Pnf Pf 0 a.s. (8.2) f | − |→ ∈F The previous two examples say (8.2) holds for = I( ,x], x R but doesn’t hold for F { −∞ ∈ } the power set 2R. The class is called P -Glivenko-Cantelli if (8.2) holds. F Deﬁne G = √n(P P ). Given k measurable functions f , ,f such that Pf 2 < n n − 1 · · · k i ∞ for all i, one can also check by the multivariate Central Limit Theorem (CLT) that

(G f , , G f )= N (0, Σ) n 1 · · · n k ⇒ k

where Σ = (σij)1 i,j n and σij = P (fifj) (Pfi)Pfj. This tells us that Gnf, f ≤ ≤ − { ∈ F} satisfies CLT in Rk. If has infinite many members, how we define weak convergence? We F first define a space similar to Rk. Set

l∞( )= bounded function z : R F { F → } with norm z = supF z(F ) . Then (l∞( ), ) is a Banach space; it is separable k k∞ ∈F | | F k · k∞ k if is countable. When has ﬁnite elements, (l∞( ), ) = (R , ). F F F k · k∞ k · k∞ So, G : R is a random element and can be viewed as an element in Banach space n F → l ( ). We say is P -Donsker if G = G, where G is a tight element in l ( ). ∞ F F n ⇒ ∞ F

53 LEMMA 8.1 (Berstein’s inequality). Let X , , X be independent random variables 1 · · · n mean zero, X K for some constant K and all j, and σ2 = EX2 > 0. Let S = n X | j|≤ j j n i=1 i 2 n 2 and sn = j=1 σj . Then P P x2/2(s2 +Kx) P ( S x) 2e− n , x> 0. | n|≥ ≤ Proof. It suﬃces to show that

x2/2(s2 +Kx) P (S x) e− n , x> 0. (8.3) n ≥ ≤ First, for any λ> 0,

λx λSn λx n λX P (S >x) e− Ee = e− Π Ee i . n ≤ i=1 Now 2 2 2 2 2 2 ∞ i σ λ ∞ σ λ σ λ λXj λ i j i 2 j j Ee =1+ EXj 1+ (λK) − =1+ exp i! ≤ 2 2(1 λK) ≤ 2(1 λK)! Xi=2 Xi=2 − − if λK < 1. Thus λ2s2 P (S >x) exp λx + n n ≤ − 2(1 λK) − 2 Now (8.3) follows by choosing λ = x/(sn + Kx).

Recall G f = n (f(X ) Pf)/√n. The random variable Y := (f(X ) Pf)/√n n i=1 i − i i − here corresponds to X in the above lemma. Note that V ar( n Y )= V ar(f(X )) Pf 2 P i i=1 i 1 ≤ n and Yi 2 f /√n. Apply the above lemma to i=1 Yi, we obtain k k∞ ≤ k k∞ P COROLLARY 8.1 For any bounded, measurable functionP f, 1 x2 P ( G f x) 2exp | n |≥ ≤ −4 Pf 2 + x f /√n k k∞ for any x> 0.

Notice that

m 2 1/2 2 1/2 E max Yi (E max Yi ) ( E Yi ) √2m. 1 i m | |≤ 1 i m | | ≤ | | ≤ ≤ ≤ ≤ ≤ Xi=1 Let ψ(x)= ex 1, x 0. Following this logic, by Jensen’s inequality − ≥

ψ(E max Yi /2) m max Eψ( Yi /2) 2m. 1 i m | | ≤ · 1 i m | | ≤ ≤ ≤ ≤ ≤ 1 Take inverse ψ− for both sides. We obtain that

E max Yi 2 log(2m). 1 i m | |≤ ≤ ≤ This together with (8.4) leads to the following lemma.

LEMMA 8.2 For any finite class of bounded, measurable, square-integrable functions F with elements, there is an universal constant C > 0 such that |F| log(1 + ) C E max Gnf max f |F| + max f P,2 log(1 + ) . · f | |≤ f k k∞ √n f k k |F| p where f = (Pf 2)1/2. k kP,2 2 Proof. Define a = 24 f /√n and b = 24Pf . Define Af = (Gnf)I Gnf > b/a and k k∞ {| | } B = (G f)I G f b/a , then G f = A + B . It follows that f n {| n |≤ } n f f

E max Gnf E max Af + E max Bf . (8.5) f | |≤ f | | f | | For x b/a and x b/a the exponent in the Bernstein inequality is bounded above by ≥ ≤ 3x/a and 3x2/b, respectively. By Bernstein’s inequality, − − 3x P ( A x) P ( G f x b/a) 2exp , | f |≥ ≤ | n |≥ ∨ ≤ − a 3x2 P ( B x)= P (b/a G f x) P ( G f x b/a) 2exp | f |≥ ≥ | n |≥ ≤ | n |≥ ∧ ≤ − b for all x 0. Let ψ (x) = exp(xp) 1, x 0, p 1. ≥ p − ≥ ≥ A /a A | f | ∞ Eψ | f | = E ex dx = P ( A > ax)ex dx 1. 1 a | f | ≤ Z0 Z0 By a similar argument we ﬁnd that Eψ ( B /√b) 1. Because ψ ( ) is convex for all p 1, 2 | f | ≤ p · ≥ by Jensen’s inequality

Af maxf Af Af ψ1 E max | | Eψ1 | | E ψ1 | | . f a ≤ a ≤ a ≤ |F| Xf 55 Taking inverse function, we have the ﬁrst term on the right hand side. Similarly we have the second term. The conclusion follows from (8.5).

We actually use the following lemma above

LEMMA 8.3 Let X be a real valued random variable and f : [0, ) R be diﬀerentiable s ∞ → with f (t) dt < for any s> 0 and ∞ P (X>t) f (t) dt < . Then 0 | ′ | ∞ 0 | ′ | ∞ R ∞ R Ef(X)= P (X t)f ′(t) dt + f(0). ≥ Z0 Proof. First, since t f (t) dt < for any t> 0, we have 0 | ′ | ∞ X R ∞ f(X) f(0) = f ′(t) dt = I(X t)f ′(t) dt. − ≥ Z0 Z0 Then, by Fubini’s theorem,

∞ ∞ Ef(X)= E I(X t)f ′(t) dt + f(0) = P (X t)f ′(t) dt + f(0). ≥ ≥ Z0 Z0 Remark. When f(x) is diﬀerentiable, f ′(x) is not necessarily Lebesgue integrable even though it is Riemann-integrable. The following is an example: Let

x2 cos(1/x2), if x = 0; f(x)= 6  0, otherwise. Then  2x cos(1/x2) + (2/x) sin(1/x2), if x = 0; f (x)= 6 ′  0, otherwise is not Lebesgue-integrable on [0, 1], but obviously Riemann-integrable. We only need to show g(x) := (2/x) sin(1/x2) is not integrable. Suppose it is, 1 1 1 1 1 sin t sin dx = | | dt =+ , x x2 2 t ∞ Z0 Z0 yields a contradiction.

Now we describe the size of . For any f , deﬁne F ∈ F f = (P f r)1/r. k kP,r | | Let l and u be two functions, the bracket [l, u] is set of all functions f such that l f u. ≤ ≤ An ǫ-bracket in L (P ) is a bracket [l, u] such that u l < ǫ. The bracketing number r k − kP,r N (ǫ, ,L (P )) is the minimum numbers of ǫ-brackets needed to cover . The entropy with [] F r F bracketing is the logarithm of the bracketing number.

56 8.1 Outer Measures and Expectations

Recall X is a random variable from (Ω, ) (R, (R)) if X 1(B) for any set B (R). G → B − ∈ G ∈B For an arbitrary map X the inverse may not be in ), particularly for small ). For G G example, when = , Ω , many maps are not random variables. In empirical processes, G {∅ } we will deal with Z =: supt T Xt for some index set T. If T is big, then Z may not be ∈ measurable. It does not make sense to study expectations and probabilities related to such a random variable. But we have another way to get around this. Deﬁnition Let X be an arbitrary map from (Ω, , P ) (R, (R)). Deﬁne G → B

E∗X = inf EY ; Y X and Y is a measurable map : (Ω, ) (R, (R)) ; { ≥ G → B } P ∗(X A) = inf P (X B); X B , B A , A,B (R). ∈ { ∈ { ∈ } ∈ G ⊃ } ∈B

One can easily show that E∗X, as an inﬁmum, can be achieved, i.e., there exists a random variable X : (Ω, , P ) (R, (R)) such that EX = E X. Further, X is P -almost surely ∗ F → B ∗ ∗ ∗ unique, i.e., if there exist two such random variables X1∗ and X2∗, then P (X1∗ = X2∗) = 1.

We call X∗ the measurable cover function. Obviously

(X + X )∗ X∗ + X∗ and X∗ X∗ if X X . 1 2 ≤ 1 2 1 ≤ 2 1 ≤ 2 One can deﬁne E and P similarly. ∗ ∗ Let (M, d) be a metric space. A sequence of arbitrary maps X : (Ω , ) (M, d) n n Gn → converges in distribution to a random vector X if

E∗f(X ) Ef(X) n → for any bounded, continuous function f deﬁned on (M, d). We still have an analogue of the Portmanteau theorem.

THEOREM 28 The following are equivalent: (i) E f(X ) Ef(X) for every bounded, continuous function f deﬁned on (M, d); ∗ n → (ii) E f(X ) Ef(X) for every bounded, Lipschitz function f deﬁned on (M, d), that ∗ n → is, there is a constant C > 0 such that f(x) f(y) Cd(x,y) for any x,y M; | − |≤ ∈ (iii) lim infn P (Xn G) P (X G) for any open set G; ∗ ∈ ≥ ∈ (iv) lim sup P (X F ) P (X F ) for any closed set F ; n ∗ n ∈ ≤ ∈ (v) lim sup P (X H)= P (X H) for any set H such that P (X ∂H) = 0, where n ∗ n ∈ ∈ ∈ ∂H is the boundary of H.

57 Let δ J[](δ, ,L2(P )) = log N[](ǫ, ,L2(P )) dǫ. F 0 F Z q The proof of the following thorem is easy. It is omitted.

THEOREM 29 Every class of measurable functions such that N (ǫ, ,L (P )) < for F [] F 1 ∞ every ǫ> 0 is P -Glivenko-Cantelli.

THEOREM 30 Every class of measurable functions with J (1, ,L (P )) < is P - F [] F 2 ∞ Donsker.

To prove the theorem, we need some preparation.

THEOREM 31 A sequence of arbitrary maps X = (X , t T ) : (Ω , ) l (T ) n n,t ∈ n Gn → ∞ converges weakly to a tight random element if and only if both of the following conditions hold: (i) The sequence (X , , X ) converges in distribution in Rk for every ﬁnite set n,t1 · · · n,tk of points t , ,t in T ; 1 · · · k (ii) for every ǫ,η > 0 there exists a partition of T into ﬁnitely many sets T , , T 1 · · · k such that

lim sup P ∗ sup sup Xn,s Xn,t ǫ η. n i s,t T | − |≥ ≤ →∞ ∈ i ! Proof. We only prove suﬃciency. Step 1: A preparation. For each integer m 1, let T m, , T m be a partition of T such ≥ 1 · · · km that

1 1 lim sup P ∗ sup sup Xn,s Xn,t m m . (8.1) n j s,t T m | − |≥ 2 ≤ 2 →∞ ∈ j ! Since the supremum above becomes smaller when a partition becomes more refined, w.l.o.g., assume the partitions are successive refinements as m increases. Define a semi-metric

0, if s,t belong to the same partitioning set T m for some j; ρ (s,t)= j m  1, otherwise. Easily, by the nesting of the partitions, ρ ρ . Deﬁne 1 ≤ 2 ≤··· ∞ ρ (s,t) ρ(s,t)= m for s,t T. 2m ∈ m=1 X 58 k m m Obviously, ρ(s,t) ∞ 2 2 when s,t T for some j. So (T, ρ) is totally ≤ k=m+1 − ≤ − ∈ j bounded. Let T0 beP the countable ρ-dense subset constructed by choosing an arbitrary m m point tj from every Tj . Step 2: Construct the limit of X . For two ﬁnite subsets of T, S = (s , ,s ) and n 1 · · · p U = (s , ,s , , ,s ), by assumption (i), there are two probability measures µ on Rp 1 · · · p · · · q p q and µq on R such that

(X , , X )= µ ; (X , , X , , X )= µ n,s1 · · · n,sp ⇒ p n,s1 · · · n,sp · · · n,sq ⇒ q q p p µ (A R − )= µ (A) for any Borel set A R . q × p ⊂ By the Kolmogorov consistency theorem there exists a stochastic process X ; t T on { t ∈ 0} some probability space such that (X , , X ) = (X , , X ) for every ﬁnite set n,t1 · · · n,tk ⇒ t1 · · · tk of points t , ,t in T . It follows that, for a ﬁnite set S T, 1 · · · k 0 ⊂

sup sup Xn,s Xn,t sup sup Xs Xt m | − |→ m | − | i s,t Ti i s,t Ti s,t∈ S s,t∈ S ∈ ∈ as n for ﬁxed m 1. By the Portmanteau theorem and (8.1) →∞ ≥

1 1 1 P sup sup Xs Xt > m  lim sup P ∗ sup sup Xn,s Xn,t m  m i s,t T m | − | 2 ≤ n i s,t T m | − |≥ 2 ≤ 2 i →∞ i s,t∈ S s,t∈ S  ∈   ∈      for each m 1. By Monotone Convergence Theorem, since T is countable, let S T, we ≥ ↑ obtain that

1 1 P sup sup Xs Xt > m  m . i s,t T m | − | 2 ≤ 2 ∈ i s,t T0  ∈    If ρ(s,t) < 2 m, then ρ (s,t) < 1 (otherwise, ρ (s,t) = 1 implies ρ(s,t) 2 m). This − m m ≥ − says that s,t T m for the same j. So the above inequality implies that ∈ j

1 1 P sup Xs Xt > m m  m | − | 2  ≤ 2 ρ(s,t)<2−  s,t T0   ∈  for each m 1. By the Borel-Cantelli lemma, when m is large, with probability one ≥ 1 sup Xs Xt m m | − |≤ 2 ρ(s,t)<2− s,t T0 ∈

59 as m is large enough. This says that, with probability one, Xt, as a function of t is uniformly continuous on T . Extending this from T to T, we obtain a ρ-continuous process X , t T. 0 0 t ∈ We then have that with probability one 1 1 X X for any s,t T with ρ(s,t) < (8.2) | s − t|≤ 2m ∈ 2m for suﬃciently large m. Step 3: Show X = X. Deﬁne π : T T as the map that maps every point in n ⇒ m → the partitioning set T m onto the point tm T m. For any bounded, Lipschitz continuous i i ∈ i function f : l (T ) R, ∞ →

lim E∗f(Xn) Ef(X) lim E∗f(Xn) E∗f(Xn πm) n n →∞ | − | ≤ →∞ | − ◦ | + lim E∗f(Xn πm) Ef(X πm) + E f(X πm) f(X) . n →∞ | ◦ − ◦ | | ◦ − | X π and X π are essentially ﬁnite dimensional random vectors, by assumption (i), n ◦ m ◦ m the ﬁrst one converges to the second in distribution. So the middle term above is zero. Now, if s,t T m, ρ(s,t) 2 m. It follows that ∈ i ≤ − 1 X π X = sup sup X X m < . (8.3) m t ti m 1 k ◦ − k∞ i t T m | − | 2 − ∈ i m+1 as m is large enough. Thus, E f(X πm) f(X) f 2− . Since f is Lipschitz, | ◦ − | ≤ k k∞ m m E∗f(Xn) E∗f(Xn πm)) f 2− + P Xn Xn πm 2− | − ◦ | ≤ k k∞ k − ◦ k≥ By the same argument as in (8.3), using (8.1) to obtain

m m m+1 lim E∗f(Xn) Ef(X) f 2− + 2− + 2− f . n ∞ ∞ →∞ | − | ≤ k k k k Letting m , we have that E f(X ) Ef(X) as n . →∞ ∗ n → →∞

Let be a class of measurable functions f : R. Review F X → δ a(δ)= , log N (δ, ,L (P )) [] F 2 q δ J[](δ, ,L2(P )) = log N[](ǫ, ,L2(P )) dǫ. F 0 F Z q 2 2 LEMMA 8.4 Suppose there is δ > 0 and a measurable function F > 0 such that P ∗f < δ and f F for every f . Then there exists a constant C > 0 such that | |≤ ∈ F G C EP∗ n J[](δ, ,L2(P )) + √nP ∗F F > √na(δ) . · k kF ≤ F { }

60 Proof. Step 1: Truncation. Recall G f = (1/√n) n (f(X ) Pf). If f g, then n i=1 i − | |≤ n P 1 G f (g(X )+ P g). | n |≤ √n i Xi=1 It follows that

E∗ GnfI F > √na(δ) 2√nP F F > √na(δ) . { } F ≤ { }

We will bound E∗ GnfI F √na(δ) next. The bracketing number of the class of k { ≤ }kF functions fI F > √na(δ) if f ranges over are smaller than the bracketing number of { } F the class . To simplify notation, we assume, w.l.o.g., f √na(δ) for every f . F | |≤ ∈ F Step 2: Discretization of the integral. Choose an integer q such that 4δ 2 q0 8δ. 0 ≤ − ≤ We claim there exists a nested sequence of -partitions ; 1 i N , indexed by the F {Fqi ≤ ≤ q} integers q q , into N disjoint subsets and measurable functions ∆ 2F such that ≥ 0 q qi ≤ 1 δ C q log Nq log N[](ǫ, ,L2(P )) dǫ, (8.4) 2 ≤ 0 F q q0 Z X≥ p q 2 1 sup f g ∆qi, and P ∆qi 2q (8.5) f,g | − |≤ ≤ 2 ∈Fqi for an universal constant C > 0. For convenience, write N(ǫ)= N (ǫ, ,L (P )). Thus, [] F 2

q0+3 q δ 1/2 ∞ 1/2 log N(ǫ) dǫ log N(ǫ) dǫ = log N(ǫ) dǫ ≥ q+1 0 0 q=q +3 1/2 Z p Z p X0 Z p ∞ 1 1 log N . ≥ 2q+1 2q 3 q=q +3 s − X0 Re-indexing the sum, we have that

1 ∞ 1 δ q log Nq log N(ǫ) dǫ, 16 2 ≤ 0 q=q0 Z X p p where N = N(2 q). By deﬁnition of N , there exists a partition = [l , u ]; 1 i q − q {Fqi qi qi ≤ ≤ N , where l and u are functions such that E(u l )2 < 2 q. Set ∆ = u l . Then q} qi qi qi − qi − qi qi − qi 2 2q ∆qi 2F and supf,g f g ∆qi, and P ∆qi 2− . ≤ ∈Fqi | − |≤ ≤ We can also assume, w.l.o.g., that the partitions ; 1 i N are successive {Fqi ≤ ≤ q} reﬁned as q increases. Actually, at q-th level, we can make intersections of elements from q ki; 1 i Nk as k goes from q0 to q : ki. The total possible number of {F ≤ ≤ } ∩k=q0 F such intersections is no more than N¯ := N N N . Since the current ’s become q q0 q0+1 · · · q Fqi

61 smaller, all requirements on qi still hold obviously except possibly (8.4). Now we verify F q (8.4). Actually, noticing log N¯q √log Nk. Then ≤ k=q0 q q P 1 1 ∞ 1 ∞ 1 log N¯ log N = log N = 2 log N . 2q q ≤ 2q k 2q k 2k k q q0 q q0 k=q0 k=q0 q k k=q0 X≥ q X≥ X p X X≥ p X p

So (8.4) still holds when replacing Nq by N¯q.

Step 3: Chaining-a skill. For each ﬁxed q q , choose a ﬁxed element f from each ≥ 0 qi partition set , and set Fqi π f = f , ∆ f =∆ , if f . (8.6) q qi q qi ∈ Fqi Thus, π f and ∆ f runs through a set of N functions if f run through . Without loss of q q q F generality, we can assume

∆qf ∆q 1f for any f and q q0 + 1. (8.7) ≤ − ∈ F ≥ ¯ Actually, let ∆qi be the measurable cover function of supf,g f g . Then (8.5) also holds. ∈Fqi | − | Let q 1 j be a partition set at the q 1 level such that qi q 1 j. Then supf,g f g F − − F ⊂ F − ∈Fqi | − |≤ ¯ ¯ supf,g q 1 j f g . Thus, ∆qi ∆q 1 j. The assertion (8.7) follows by replacing ∆qi with ∈F − | − | ≤ − ∆¯ qi. By (8.6), P (π f f)2 max P ∆2 < 2 2q. Thus, P (π f f)2 = P (π f f)2 < q − ≤ i qi − q q − q q − . So (π f f)2 < a.s. This implies ∞ q q − ∞ P P P π f f a.s. on P (8.8) q → as q for any f . Deﬁne →∞ ∈ F q aq = 2− / log Nq+1 , τ = τ(n,fp) = inf q q :∆ f > √na . { ≥ 0 q q} The value of τ is thought to be + if the above set is empty. This is the ﬁrst time ∆ f > ∞ q √na . By construction, 2a(δ) = 2δ(log N (δ, ,L (P ))) 1/2 a since the denominator is q [] F 2 − ≤ q0 decreasing in δ. By Step 1, we know that ∆ f 2√na(δ) √na . This says that τ >q . | q |≤ ≤ q0 0 We claim that

∞ ∞ f πq0 f = (f πqf)I τ = q + (πqf πq 1f)I τ q , a.s on P. (8.9) − − { } − − { ≥ } q +1 q +1 X0 X0 q1 In fact, write f πq0 f = (f πq1 f)+ q +1(πqf πq 1f) for q1 >q0. Now, − − 0 − − P 62 (i) If τ = , the right hand side above is identical to limq πqf πq0 f = f πq0 f a.s. ∞ →∞ − − by (8.8);

q1 (ii) If τ = q1 < , the right hand is equal to (f πq1 f)+ q +1(πqf πq 1f). ∞ − 0 − − P Step 4. Bound terms in the chain. Apply G to the both sides of (8.9). For f g, n | | ≤ note that G f G g + 2√nP g. By (8.6), f π f ∆ f. One obtains that | n | ≤ | n | | − q |≤ q f ∞ E∗ Gn (f πqf)I τ = q k − { } kF q +1 X0 z }| { ∞ ∞ E∗ Gn(∆qfI τ = q ) + 2√n P (∆qfI τ = q ) . (8.10) ≤ k { } kF k { } kF q +1 q +1 X0 g X0 | {z } 2 Now, by (8.7), ∆qfI τ = q ∆q 1fI τ = q √naq 1. Moreover, P (∆qfI τ = q ) { }≤ − { }≤ − { } ≤ 2q 2− . By Lemma 8.2, the middle term above is bounded by

∞ q (aq 1 log Nq + 2− log Nq). − q +1 X0 p By Holder’s inequality, P (∆ fI τ = q ) (P (∆ f)2)1/2P (τ = q)1/2 2 qP (∆ f > q { } ≤ q ≤ − q √na )1/2 2 q(√na ) 1(P (∆ f)2)1/2 2 2q(√na ) 1. So the last term in (8.10) is q ≤ − q − q ≤ − q − bounded by 2 2 2q/a . In summary, · − q

∞ ∞ q E∗ G (f π f)I τ = q C 2− log N (8.11) n − q { } ≤ q q0+1 q0+1 X X p F

for some universal constant C > 0. Second, there are at most Nq functions πqf πq 1f and two values the indicator functions − − I(τ q) takes. Because the partitions are nested, the function πqf πq 1f I τ q ≥ | − − | { ≥ } ≤ q+1 ∆q 1fI τ q √naq 1. The L2(P )-norm of πqf πq 1f is bounded by 2− . Applying − { ≥ }≤ − − − Lemma 8.2 again to obtain

∞ ∞ q E∗ Gn(πqf πq 1f)I τ q (aq 1 log Nq + 2− log Nq) − − { ≥ } ≤ − q0+1 q0+1 X X p F ∞ q C 2− log N (8.12) ≤ q q +1 X0 p for some universal constant C > 0.

63 At last, we consider π f. Because π f F a(δ)√n √na and P (π f)2 δ2 q0 | q0 | ≤ ≤ ≤ q0 q0 ≤ by assumption, another application of Lemma 8.2 leads to

E∗ Gnπq0 f aq0 log Nq0 + δ log Nq0 . k kF ≤ p In view of the choice of q0, this is no more than the bound in (8.12). All above inequalities together with (8.4) yield the desired result.

COROLLARY 8.2 For any class of measurable functions with envelope function F, there F exists an universal constant C such that

G EP∗ n C J[]( F P,2, ,L2(P )). k kF ≤ · k k F Proof. Since has a single bracket [ F, F ], we have that N (δ, ,L (P )) = 1 for δ = F − [] F 2 2 F . Review the deﬁnition in Lemma 8.4. Choose δ = F . It follows that k kP,2 k kP,2 F a(δ)= k kP,2 . log N ( F , ,L (P )) [] k kP,2 F 2 q Now √nP FI(F > √na(δ)) F 2 /a(δ)= F log N ( F , ,L (P )) by Markov’s ∗ ≤ k kP,2 k kP,2 [] k kP,2 F 2 inequality, which is bounded by J ( F , ,L (P ))q since the integrand is non-decreasing [] k kP,2 F 2 and hence

F P,2 k k log N[](ǫ, ,L2(P )) dǫ F P,2 log N[]( F P,2, ,L2(P )). 0 F ≥ k k k k F Z q q

Proof of Theorem 30. We will use Theorem 31 to prove this theorem. The part (i) is easily satisﬁed. Now we verify (ii). Note there is no cover on and we don’t know if Pf 2 < δ2. Let = f g, f,g . F G { − ∈ F} With a given set of ǫ-brackets [l , u ] over we can construct 2ǫ-brackets over by taking i i F G diﬀerences [l u , u l ] of upper and lower bounds. Therefore, the bracketing number i − j i − j N (2ǫ, ,L (P )) are bounded by the square of the bracketing number N (ǫ, ,L (P )). [] G 2 [] F 2 Easily, (u l ) (l u ) 2ǫ. Hence k i − j − i − j k≤ N (2ǫ, ,L (P )) N (ǫ, ,L (P ))2. [] G 2 ≤ [] F 2 This says that

J (ǫ, ,L (P )) < . (8.13) [] G 2 ∞

64 For a given, small δ > 0, by the definition of N (δ, ,L (P )), choose a minimal number [] F 2 of brackets of size δ that cover , and use them to form a partition of = . The subsets F F ∪iFi consisting of differences f g of functions f and g belonging to the same partitioning G − set consists of functions of L2(P )-norm smaller than δ. Hence, by Lemma 8.4, there exists a finite number a(δ) := a(δ) < a(2δ) and a universal constant C such that F G

C E∗ sup sup Gn(f g) = CE∗ sup Gnh · i f,g | − | h | | ∈Fi ∈H J (δ, ,L (P )) + 2√nPFI(F > a(δ)√n) ≤ [] H 2 J (δ, ,L (P )) + 2√nPFI(F > a(δ)√n). ≤ [] G 2 since := f g; f, g , for all i , where the envelope function F can be taken H { − ∈ Fi } ⊂ G equal to the supremum of the absolute values of the upper and lower bounds of ﬁnitely many brackets that cover , for instance a minimal set of brackets of size 1. This F is F square integrable. The second term above is bounded by a(δ) 1P F 2I(F > a(δ)√n) 0 as − → n for ﬁxed δ. First let n , then let δ 0 the left hand side goes to 0. So part (ii) →∞ →∞ ↓ in (31) is valid by using Markov’s inequality.

Example. Let = f = I( ,t], t R . Then F { t −∞ ∈ } f f = (F (t) F (s))1/2 for s

2 F (ti ) F (ti 1) <ǫ for i = 1, 2, , k. − − − · · · 1/2 Since I( ,ti) I( ,ti 1] P,2 = (F (ti ) F (ti 1)) < ǫ. Namely, (I( ,ti 1], I( ,ti)) k −∞ − −∞ − k − − − −∞ − −∞ is an ǫ-bracket for i = 1, 2, , k. It follows that · · · 2 N (ǫ, ,L (P )) for ǫ (0, 1). [] F 2 ≤ ǫ2 ∈ But 1 log(1/ǫ) dǫ < . The previous classical Glivenko-Cantelli’s Lemma 26 and Donsker’s 0 ∞ TheoremR p 27 follow from Theorems 29 and 30. Since f = sup f(t) is the norm in l , it is continuous. By the Delta method, we k k t | | ∞ have

sup √n Fn(t) F (t) sup G(t) t R | − |→ t R ∈ ∈ 65 weakly, i.e., in distribution, as n , where G(t) is the Gaussian process we mentioned → ∞ before.

Example. Let = f ; θ Θ be a collection of measurable functions with Θ Rd F { θ ∈ } ⊂ bounded. Suppose Pmr < and f (x) f (x) m(x) θ θ for any θ and θ . Then ∞ | θ1 − θ2 |≤ k 1 − 2k 1 2 diam(Θ) d N (ǫ m , ,L (P )) K (8.14) [] k kP,r F r ≤ ǫ for any 0 <ǫ< diam(Θ). As long as θ θ < ǫ, we have that l(x) := f (x) m(x)ǫ(x) k 1 − 2k θ1 − ≤ f (x) f (x) + m(x)ǫ(x) =: u(x). Also, u l = ǫ m . Naturally, [l, u] is an θ2 ≤ θ1 k − kP,r k kP,r ǫ m -bracket. It suﬃces to calculate the minimal number of balls of radius ǫ need to k kP,r cover Θ. Note that Θ Rd is in a cube with side length no bigger than diam(Θ). We can cover ⊂ Θ with fewer than (2diam(Θ)/ǫ)d cubes of size ǫ. The circumscribed balls have radius a multiple of ǫ and also cover Θ. The intersection of these balls with Θ cover Θ. So the claim is true. This says that is a P -Donsker class. F

Example (Sobolev classes). For k 1, let ≥ 1 = f : [0, 1] R; f 1 and (f (k)(x))2 dx 1 . F → k k∞ ≤ ≤ Z0 Then there exists an universal constant K such that 1 1/k log N[](ǫ, , ) K F k · k∞ ≤ ǫ Since f P,2 f for any P, it is easy to check that N[](ǫ, ,L2(P )) N[](ǫ, , ) k k ≤ k k∞ F ≤ F k · k∞ for any P. So is P -Donsker class for any P. F

Example (Bounded Variation). Let

= f : R [ 1, 1] of bounded variation 1 . F { → − } Any function of bounded variation is the diﬀerence of two monotone increasing functions. Then for any r 1 and probability measure P, ≥ 1 log N (ǫ, ,L (P )) K . [] F r ≤ ǫ Therefore is P -Donsker for every P. F

66 9 Consistency and Asymptotic Normality of Maximum Like- lihood Estimators

9.1 Consistency

Let X , X , , X be a random sample from a population distribution with pdf or pmf 1 2 · · · n f(x θ), where θ is an unknown parameter. The MLE θˆ under certain conditions on f(x θ) | | will be consistent and satisfy Central Limit Theorems. But for some cases those will not be true. Let’s see a good example and a pathological example. Example. Let X , X , , X be i.i.d. from Exp(θ) with pdf 1 2 · · · n θx θe− , if x> 0; f(x θ)= |  0, otherwise. 2 So EX1 = 1/θ and V ar(X1) = 1/θ . It is easy to check that θˆ = 1/X.¯ By the CLT,

1 2 √n X¯ = N(0, θ− ) − θ ⇒ as n . Let g(x) = 1/x,x > 0. Then g(EX ) = θ and g (EX ) = θ2. By the Delta → ∞ 1 ′ 1 − method,

2 2 2 √n(θˆ θ)= N(0, g′(µ) σ )= N(0, θ ) − ⇒ as n . Of course, θˆ θ in probability. →∞ → Example. Let X , X , , X be i.i.d. from U[0,θ], where θ is unknown. The MLE 1 2 · · · n estimator θˆ = max X . First, it is easy to check that θˆ θ in probability. But θˆ doesn’t i → follow the CLT. In fact,

x n(θ θˆ) 1 e− , if x> 0; P − x − θ ≤ ! →  0, otherwise as n . →∞  For some cases even the consistency, i.e., θˆ θ in probability, doesn’t hold. Next, we → will study some suﬃcient conditions for consistency. Later we provide suﬃcient conditions for CLT. Let X , , X be a random sample from a density p with reference measure µ, that 1 · · · n θ is, P (X A) = p (x)µ( dx), where θ Θ. The maximum likelihood estimator θˆ θ 1 ∈ A θ ∈ n maximizes the functionR h(θ) := log pθ(Xi) over Θ, or equivalently, the function n P 1 pθ Mn(θ)= log (Xi), n pθ0 Xi=1 67 where θ0 is the true parameter. Under suitable conditions, by the weak law of large numbers,

pθ pθ Mn(θ) M(θ) := Eθ0 log (X1)= pθ0 (x) log (x)µ( dx) (9.1) → p R p θ0 Z θ0 in probability as n . The number M(θ) is called the Kullback-Leibler divergence of → ∞ − p and p . Let Y = p (Z)/p (Z), where Z p . Then E Y = 1 and θ θ0 θ0 θ ∼ θ θ M(θ)= E (Y log Y ) (E Y ) log(E Y ) = 0 − θ ≤− θ θ by Jensen’s inequality. Of course M(θ0) = 0, that is, θ0 attains the maximum of M(θ).

Obviously, M(θ0)= Mn(θ0) = 0. The following gives the consistency of θˆn with θ0.

HEOREM P T 32 Suppose supθ Θ Mn(θ) M(θ) 0 and supθ:d(θ,θ0) ǫ M(θ) < M(θ0) for ∈ | − | → ≥ any ǫ> 0. Then θˆ θ in probability. n → 0 Proof. For any ǫ> 0, we need to show that

P (d(θˆ ,θ ) ǫ) 0 (9.2) n 0 ≥ → as n . By the condition, there exists δ > 0 such that →∞

sup M(θ) < M(θ0) δ. θ:d(θ,θ0) ǫ − ≥ Thus, if d(θˆ ,θ ) ǫ, then M(θˆ ) < M(θ ) δ. Note that n 0 ≥ n 0 − M(θˆ ) M(θ ) = M (θˆ ) M (θ ) + (M (θ ) M(θ )) + (M(θˆ ) M (θˆ )) n − 0 n n − n 0 n 0 − 0 n − n n 2sup Mn(θ) M(θ) ≥ − θ Θ | − | ∈ since M (θˆ ) M (θ ). Consequently, n n ≥ n 0

P (d(θˆn,θ0) ǫ) P (M(θˆn) M(θ0) < δ)= P (sup Mn(θ) M(θ) δ/2) 0. ≥ ≤ − − θ Θ | − |≥ → ∈ by the ﬁrst assumption.

Let X , , X be a random sample from a density p with reference measure µ, that 1 · · · n θ is, P (X A) = p (x)µ( dx), where θ Θ. The maximum likelihood estimator θˆ θ 1 ∈ A θ ∈ n maximizes the functionR log pθ(Xi) over Θ, or equivalently, the function P n 1 pθ Mn(θ) := log (Xi), n pθ0 Xi=1 68 where θ0 is the true parameter. This is usually practiced by solving the equation ∂Mn(θ)/∂θ = 0 for θ. Let ∂ log(p (x)/p (x)) ψ (x)= θ θ0 . θ ∂θ

Then the maximum likelihood estimator θˆn is the zeroes of the function n 1 Ψ (θ) := ψ (X ). (9.3) n n θ i Xi=1 Though we still use the notation Ψn in the following theorem, it is not necessarily the one above. It can be any function of properties stated below.

THEOREM 33 Let Θ be a subset of the real line and let Ψn be random functions and Ψ a ﬁxed function of θ such that Ψ (θ) Ψ(θ) in probability for every θ. Assume that Ψ (θ) is n → n continuous and has exactly one zero θˆn or is nondecreasing satisfying Ψn(θˆn)= oP (1). Let P θ be a point such that Ψ(θ ǫ) < 0 < Ψ(θ + ǫ) for every ǫ> 0. Then θˆ θ . 0 0 − 0 n → 0

Proof. Case 1. Suppose Ψn(θ) is continuous and has exactly one zero θˆn. Then

P (Ψ (θ ǫ) < 0, Ψ (θ + ǫ) > 0) P (θ ǫ θˆ θ + ǫ). n 0 − n 0 ≤ 0 − ≤ n ≤ 0 Since Ψ (θ ǫ) Ψ(θ ǫ) in probability, and Ψ(θ ǫ) < 0 < Ψ(θ + ǫ), the left hand n 0 ± → 0 ± 0 − 0 side goes to one.

Case 2. Suppose Ψn(θ) is nondecreasing satisfying Ψn(θˆn)= oP (1). Then

P ( θˆ θ ǫ) P (θˆ >θ + ǫ)+ P (θˆ <θ ǫ) | n − 0|≥ ≤ n 0 n 0 − P (Ψ (θˆ ) Ψ (θ + ǫ)) + P (Ψ (θˆ ) Ψ (θ ǫ)). ≤ n n ≥ n 0 n n ≤ n 0 − Now, Ψ (θ ǫ) Ψ(θ ǫ) in probability. This together with Ψ (θˆ )= o (1) shows that n 0 ± → 0 ± n n P P ( θˆ θ ǫ) 0 as n . | n − 0|≥ → →∞

Example. Let X , , X be a random sample from Exp(θ), that is, the density 1 · · · n function is p (x)= θ 1 exp( x/θ)I(x 0) for θ > 0. We know that, under the true model, θ − − ≥ the MLE θˆ = 1/X¯ θ in probability. Let’s verify that this conclusion indeed can be n n → 0 deduced from Theorem 33. Actually, ψ (x) = (log p (x)) = θ 1 + xθ 2 for x 0. Thus Ψ (θ)= θ 1 + X¯ θ 2, θ θ ′ − − − ≥ n − − n − which goes to Ψ(θ) = E ( θ 1 + X θ 2) = θ 2(θ θ), that is positive or negative θ0 − − 1 − − 0 − depending on if θ is samller or bigger than θ0. Also, θˆn = 1/X¯n. Applying Theorem 33 to Ψ and Ψ, we obtain the consisitency result. − n −

69 9.2 Asymptotic Normality

Now we study the central limit theorems for the MLE. Now we illustrate the idea of showing the normality of MLE. Recalling (9.3). Do the

Taylor’s expansion for ψθ(Xi).

. 1 2 ψˆ (Xi) = ψθ (Xi) + (θˆn θ0)ψ˙θ (Xi)+ (θˆn θ0) ψ¨θ (Xi) θn 0 − 0 2 − 0 We will use the following notation:

n 1 Pf = f(x)µ(dx) for any real function f(x) and P = δ . n n Xi Z Xi=1 n Then Pnf = (1/n) i=1 f(Xi). Thus, P . 1 0=Ψ (θˆ ) = P ψ + (θˆ θ )P ψ˙ + (θˆ θ )2P ψ¨ . n n n θ0 n − 0 n θ0 2 n − 0 n θ0 Reorganize it in the following form

. √n(Pnψθ0 ) √n(θˆn θ0) = − . − P ψ˙ + 1 (θˆ θ )P ψ¨ n θ0 2 n − 0 n θ0 Recall the Fisher information

∂ log p (X ) 2 ∂2 log p (X ) I(θ )= E θ 1 = E (ψ (X ))2 = E θ 1 0 θ0 ∂θ |θ=θ0 θ0 θ0 1 − θ0 ∂2θ |θ=θ0 = E (ψ˙ (X )). − θ0 θ0 1 By CLT and LLN, √n(P ψ )= N(0,I(θ )), P ψ˙ I(θ ) and P ψ¨ p ψ¨ (X ) n θ0 ⇒ 0 n θ0 →− 0 n θ0 → θ0 θ0 1 in probability. This illustrates that

1 √n(θˆ θ )= N(0,I(θ )− ) n − 0 ⇒ 0 as n . The next two theorems will make these steps rigorous. →∞ Let g(θ)= Eθ0 (mθ(X1)) = mθ(x)pθ0 (x)µ( dx). We need the following condition: R 1 g(θ)= g(θ )+ (θ θ )T V (θ θ )+ o( θ θ 2), (9.1) 0 2 − 0 θ0 − 0 k − 0k where

∂mθ Rd Vθ0 = Eθ0 θ=θ0 , θ = (θ1, ,θd) . ∂θi∂θj | 1 i,j d · · · ∈ ≤ ≤

70 THEOREM 34 For each θ in an open subset of Euclidean space let x m (x) be a mea- 7→ θ surable function such that θ m (x) is diﬀerentiable at θ for P -almost every x with 7→ θ 0 derivative m˙ θ0 (x) and such that , for every θ1 and θ2 in a neighborhood of θ0 and a measurable function n(x) with En(X )2 < 1 ∞ m (x) m (x) n(x) θ θ . | θ1 − θ2 |≤ k 1 − 2k

Furthermore, assume the map θ Emθ(X1) has expression as in (9.1). If Pnmˆ 7→ θn ≥ sup P m o (n 1) and θˆ P θ , then θ n θ − P − n → 0 n 1 1 √n(θˆn θ0)= V − m˙ θ (Xi)+ oP (1). − − θ0 √n 0 Xi=1 1 T 1 In particular, √n(θˆn θ0)= N(0, V − E(m ˙ θ m˙ )V − ) as n . − ⇒ θ0 0 θ0 θ0 →∞ A statistical model (p ,θ Θ) is called diﬀerentiable in quadratic mean if there exists a θ ∈ measurable vector-valued function l˙ such that, as θ θ , θ0 → 0 1 2 √p √p (θ θ )T l˙ √p du = o( θ θ 2). θ − θ0 − 2 − 0 θ0 θ0 k − 0k Z THEOREM 35 Suppose that the model (P : θ Θ) is diﬀerentiable in quadratic mean at θ ∈ an inner point θ of Θ Rk. Furthermore, suppose that there exists a measurable function 0 ⊂ l(x) with E l2(X ) < such that, for every θ and θ in a neighborhood of θ , θ0 1 ∞ 1 2 0 log p (x) log p (x) l(x) θ θ . | θ1 − θ2 |≤ k 1 − 2k ˆ If the Fisher information matrix Iθ0 is non-singular and θn is consistent, then

n 1 1 √n(θˆn θ0)= I− l˙θ (Xi)+ oP (1) − θ0 √n 0 Xi=1 1 In particular, √n(θˆn θ0)= N(0, I− ) as n , where − ⇒ θ0 →∞ 2 ∂ log pθ(X1) I(θ0)= E . − ∂θi∂θj 1 i,j k ≤ ≤ We need some preparations in proving the above theorems. Given functions x m (x), θ Rd, we need conditions that ensure that, for a given → θ ∈ sequence r and any sequence h˜ = O (1), n →∞ n P∗ T P Gn rn(m ˜ mθ ) h˜ m˙ θ 0. (9.2) θ0+hn/rn − 0 − n 0 → 71 LEMMA 9.1 For each θ in an open subset of Euclidean space let mθ(x), as a function of x, be measurable for each θ; and as a function of θ, is diﬀerentiable for almost every x

(w.r.t. P ) with derivative m˙ θ0 (x) and such that for every θ1 and θ2 in a neighborhood of θ0 and a measurable function m˙ such that pm˙ 2 < , ∞ m (x) m (x) m˙ (x) θ θ . k θ1 − θ2 k≤ k 1 − 2k

Then (9.2) holds for every random sequence h˜n that is bounded in probability.

Proof. Because h˜n is bounded in probability, to show (9.2), it is enough to show, w.l.o.g. G ˜T P sup n(rn(mθ0+θ/rn mθ0 ) θ m˙ θ0 ) 0 θ 1 | − − | → | |≤ as n . Deﬁne →∞ = f := r (m m ) θT m˙ , θ 1 . Fn { θ n θ0+θ/rn − θ0 − θ0 | |≤ } Then

f (x) f (x) 2m ˙ (x) θ θ | θ1 − θ2 |≤ θ0 k 1 − 2k d for any θ1 and θ2 in the unit ball in R by the Lipschitz condition. Further, set

T Hn = sup rn(mθ0+θ/rn mθ0 ) θ m˙ θ0 . θ 1 | − − | | |≤ Then H is a cover and H 0 as n by the deﬁnition of partial derivative. By n n → → ∞ Bounded Convergence Theorem, δ := (PH2)1/2 0. Thus, by Corollary 8.2 and Example n n → 8.14,

δn G d EP∗ n n C J[]( H P,2, n,L2(P )) C log(Kǫ− ) dǫ 0 k kF ≤ · k k F ≤ 0 → Z q as n . The desired conclusion follows. →∞ We need some preparation before proving Theorem 34.

Let Pn be the empirical distribution of a random sample of size n from a distribution P, and, for every θ in a metric space (Θ, d), mθ(x) be a measurable function. Let θˆn (nearly) maximize the criterion function Pnmθ. The number θ0 is the truth, that is, the maximizer of m over θ Θ. Recall G = √n(P P ). θ ∈ n n − THEOREM 36 (Rate of Convergence) Assume that for ﬁxed constants C and α>β, for every n and every suﬃciently small δ > 0,

α sup P (mθ mθ0 ) Cδ , d(θ,θ0)>δ − ≤− G β E∗ sup n(mθ mθ0 ) Cδ . d(θ,θ0)<δ | − |≤

72 α/(2β 2α) If the sequence θˆn satisﬁes Pnmˆ Pnmθ OP (n − ) and converges in outer prob- θn ≥ 0 − 1/(2α 2β) ˆ ability to θ0, then n − d(θn, θ0)= OP∗ (1).

1/(2α 2β) α/(2β 2α) Proof. Set rn = n − and Pnmˆ Pnmθ Rn with 0 Rn = OP (n − ). θn ≥ 0 − ≤ Partition (0, ) by S = θ : 2j 1 < r d(θ,θ ) 2j for all integers j. If r d(θˆ ,θ ) 2M ∞ j,n { − n 0 ≤ } n n 0 ≥ ˆ P for a given M, then θn is in one of the shells Sj,n with j M. In that case, supθ S ( nmθ ≥ ∈ j,n − P m ) R . It follows that n θ0 ≥− n

M K P ∗(r d(θˆ ,θ ) 2 ) P ∗ sup (P m P m ) (9.3) n n 0 n θ n θ0 rα ≥ ≤ j θ Sj,n − ≥− n ! j M,2 ǫrn ∈ ≥ X≤ α +P ∗(2d(θˆ ,θ ) ǫ)+ P (r R K). n 0 ≥ n n ≥ The middle term on right goes to zero; the last term can be arbitrarily small by choosing large K. We only need to show the sum is arbitrarily small when M is large enough as n . By the given condition →∞ j 1 2 − α sup P (mθ mθ0 ) C α . θ Sj,n − ≤− rn ∈ For M such that (1/2)C2(M 1)α K, by the fact that sup (f + g ) sup f + sup g , − ≥ s s s ≤ s s s s K 2(j 1)α P P G √ − P ∗ sup ( nmθ nmθ0 ) α P ∗ sup n(mθ mθ0 ) C n α . θ Sj,n − ≥−rn ≤ θ Si,n − ≥ 2rn ∈ ! ∈ ! Therefore, the sum in (9.3) is bounded by

(j 1)α j β α 2 − (2 /rn) 2rn P ∗ sup G (m m ) C√n | n θ − θ0 |≥ 2rα ≤ (j 1)α j θ Si,n n ! √n2 − j M,2 ǫrn ∈ j M ≥ X≤ X≥ by Markov’s inequality and the deﬁnition of r . The right hand side goes to zero as M . n →∞

COROLLARY 9.1 For each θ in an open subset of Euclidean space, mθ(x) is a measurable function such that, there exists m˙ (x) with P m˙ 2 < satisfying ∞ m (x) m (x) m˙ (x) θ θ . | θ1 − θ2 |≤ k 1 − 2k Furthermore, assume 1 Pm = Pm + (θ θ )T V (θ θ )+ o( θ θ 2) (9.4) θ θ0 2 − 0 θ0 − 0 k − 0k

1 P with Vθ nonsingular. If Pnmˆ Pnmθ OP (n− ) and θˆn θ0, then √n(θˆn θ0)= OP (1). 0 θn ≥ 0 − → −

73 Proof. (9.4) implies the ﬁrst condition of Theorem 36 holds with α = 2 since Vθ0 is nonsingular. Now we use Corollary 8.2 to the class of functions = m m ; θ θ < δ F { θ− θ0 k − 0k } to see the second condition is valid with β = 1. This class has envelope function F = δm,˙ while

δ m˙ P,2 G k k E∗ sup n(mθ mθ0 ) C log[](ǫ, ,L2(P )) dǫ θ θ0 <δ | − | ≤ · 0 F k − k Z q δ m˙ P,2 δ d k k log K dǫ = C δ v 1 ≤ 0 u ǫ ! Z u t by (8.14) with Diam(Θ) = const δd, where C depends on m˙ . · 1 k kP,2

Proof of Theorem 34. By Lemma 9.1,

T P Gn rn(m ˜ mθ ) h˜ m˙ θ 0. (9.5) θ0+hn/rn − 0 − n 0 → Expand P (rn(m ˜ mθ ) by condition 9.1, we obtain θ0+hn/rn − 0

1 T T nPn(m ˜ mθ )= h˜ Vθ h˜n + h˜ Gnm˙ θ + oP (1). θ0+hn/√n − 0 2 n 0 n 0

By Corollary 9.1, √n(θˆ θ) is bounded in probability (same as tightness). Take hˆ = n − n 1 √n(θˆn θ) and h˜n = V − Gnm˙ θ , we then obtain the Taylor’s expansions of Pnmˆ and − − θ0 0 θn P m 1 as follows n θ0 V − Gnm˙ θ − θ0 0

1 T T nPn(mˆ mθ )= hˆ Vθ hˆn + hˆ Gnm˙ θ + oP (1), θn − 0 2 n 0 n 0 1 T 1 nP (m 1 m )= G m˙ V − G m˙ + o (1), n θ0 V − Gnm˙ θ /√n θ0 n θ0 θ0 n θ0 P − θ0 0 − −2 where the second one is obtained through a bit algebra. By the definition of θˆn, the left hand side of the first equation is greater than that of the second one. So are he right hand sides. Take the difference and make it to a complete square, we then have

1 1 T 1 (hˆn + V − Gnm˙ θ ) Vθ (hˆn + V − Gnm˙ θ )+ oP (1) 0. 2 θ0 0 0 θ0 0 ≥

We know Vθ0 is strictly negative-deﬁnite, the quadratic form must converge to zero in 1 1 probability. This is also the case hˆn + V − Gnm˙ θ , that is, hˆn = V − Gnm˙ θ + oP (1). k θ0 0 k − θ0 0

74 10 Appendix

Let be a collection of subsets of Ω and is generated by , that is, = σ( ). Let P be A B A B A a probability measure on (Ω, ). B LEMMA 10.1 Suppose has the following property: (i) Ω , (ii)Ac if A , A ∈ A ∈ A ∈ A and (iii) m A if A for all 1 i m. Then, for any B and ǫ > 0, there ∪i=1 i ∈ A i ∈ A ≤ ≤ ∈ B exists A such that P (B∆A) < ǫ. ∈A Proof. Let be the set of B satisfying the conclusion. Obviously, . It is B′ ∈B A⊂B′ ⊂B enough to verify that is a σ-algebra. B′ It is easy to see that (i) Ω ; (ii) Bc if B since A∆B = Ac∆Bc. (iii) If ∈ B′ ∈ B′ ∈ B′ B for i 1, there exist A such that P (B ∆A ) < ǫ/2i for all i 1. Evidently, i ∈ B′ ≥ i ∈ A i i ≥ n B B as n . Therefore, there exists n < such that P ( B ) ∪i=1 i ↑ ∪i∞=1 i → ∞ 0 ∞ | ∪i∞=1 i − P ( n0 B ) ǫ/2. It is easy to check that ( n0 B )∆( n0 A ) n0 (B ∆A ). Write ∪i=1 i | ≤ ∪i=1 i ∪i=1 i ⊂ ∪i=1 i i B = B and B˜ = n0 B and A = n0 A . Then A . Note B∆A (B B˜) (B˜∆A). ∪i∞=1 i ∪i=1 i ∪i=1 i ∈A ⊂ \ ∪ The above facts show that P (B∆A) < ǫ. Thus is a σ-algebra. B′ LEMMA 10.2 Let X , X , , X , m 1 be random variables deﬁned on (Ω, , ). Let 1 2 · · · m ≥ F P f(x , ,x ) be a real measurable function with E f(X , , X ) p < for some p 1. 1 · · · m | 1 · · · m | ∞ ≥ Then there exists f (X , , X ); n 1 such that { n 1 · · · m ≥ } (i) f (X , , X ) f(X , , X ) a.s. n 1 · · · m → 1 · · · m (ii) f (X , , X ) f(X , , X ) in Lp(Ω, , ). n 1 · · · m → 1 · · · m F P (iii) For each n 1, f (X , , X )= km c g (X ) g (X ) for some k < , ≥ n 1 · · · m i=1 i i1 1 · · · im m m ∞ constants c , and g (X ) = I (X ) for some sets A (R), and all 1 i k and i ij j Ai,j j P i,j ∈ B ≤ ≤ m 1 j m. ≤ ≤ Proof. To save notation, we write f = f(x , ,x ). Since E fI( f C) f p 0 1 · · · m | | | ≤ − | → as C , choose C such that E fI( f C ) f p 1/k2 for k = 1, 2, . We will → ∞ k | | | ≤ k − | ≤ · · · show that there exists function g = g (x , ,x ) of the form in (iii) such that E fI( f k k 1 · · · k | | |≤ C ) g p 1/k2 for all k 1. Thus, E f g p < 2/k2 for all k 1. The assertion (ii) k − k| ≤ ≥ | − k| ≥ p p then follows. Also, it implies E( k 1 f gk ) < , then k 1 f gk < a.s. we ≥ | − | ∞ ≥ | − | ∞ ontain (i). Therefore, to prove thisP lemma, we assume w.l.o.g.P that f is bounded and we need to show that there exist f = f (x , ,x ); n 1 of the form in (iii) such that n n 1 · · · m ≥ 1 E f f p (10.6) | − n| ≤ n2 for all n 1. ≥

75 Since f is bouded, for any n 1, there exists h such that ≥ n 1 sup f(x) hn(x) < 2 , (10.7) x Rm | − | 2n ∈ where h is a simple function, i.e., h (x) = kn c I x B for some k < , constants n n i=1 i { ∈ i} n ∞ c ’s and B (Rn). i i ∈B P Now set X = (X , , X ) Rm and µ be the probability measure of X under proba- 1 · · · m ∈ bility P. Let be the set of all finite unions of sets in := m A (Rn); A (R) . A A1 { i=1 i ∈B i ∈B } By the construction of (Rn), we know that (Rn)= σ( ). It is not difficult to verify that B B A Q satisfies the conditions in Lemma 10.1. Thus there exist E such that A i ∈A k m 1 i I (x) I (x) p dµ = µ(B ∆E ) < and E = A | Bi − Ei | i i (2cn2)p i i,j,l Z j[=1 Yl=1 where c =1+ kn c and A (R) for all i, j and l. Now, since is a norm, we i=1 | i| i,j,l ∈ B k · kp have P k k n n 1 h (X) c I (X) c I (X) I (X) . (10.8) k n − i Ei kp ≤ | i| · k Ei − Bi kp ≤ 2n2 Xi=1 Xi=1 Observe that ν(E) := I (X) is a measure on (Rn, (Rn)). Note that the intersection of any E B finite product sets m A is still in . By the inclusion-exclusion formula, I (X) is a l=1 i,j,l A1 Ei finite linear combination of I (X) where F . Thus, f (X) := kn c I (X) is of the Q F ∈A1 n i=1 i Ei form in (iii). Now, by (10.7) and (10.8) P 1 f(X) f (X) f(X) h (X) + h (X) f (X) . k − n kp ≤ k − n kp k n − n kp ≤ n2 Thus, (10.6) follows.

LEMMA 10.3 Let ξ , ,ξ be random variables, and φ(x), x Rn be a measurable func- 1 · · · k ∈ tion with E φ(ξ , ,ξ ) < . Let and be two σ-algebras. Suppose | 1 · · · k | ∞ F G E(f (ξ ) f (ξ ) )= E(f (ξ ) f (ξ ) ) a.s. (10.9) 1 1 · · · k k |F 1 1 · · · k k |G holds for all bounded, measurable functions f : R R, 1 i n. Then E(φ(ξ , ,ξ ) )= i → ≤ ≤ 1 · · · k |F E(φ(ξ , ,ξ ) ) almost surely. 1 · · · k |G Proof. From (10.9), we know that the desired conclusion holds for φ that is a ﬁnite linear combination of f f ’s. By Lemma 10.2, there exist such functions φ ; n 1, such that 1 · · · k n ≥

76 E φ (ξ , ,ξ ) φ(ξ , ,ξ ) 0 as n . Therefore, | n 1 · · · k − 1 · · · k |→ →∞ E E(φ (ξ , ,ξ ) ) E(φ(ξ , ,ξ ) ) | n 1 · · · k |F − 1 · · · k |F | E E(φ (ξ , ,ξ ) φ(ξ , ,ξ ) 0. ≤ | n 1 · · · k − 1 · · · k |→ We know that E(φ (ξ , ,ξ ) ) = E(φ (ξ , ,ξ ) ) for all n 1, the conclusion n 1 · · · k |F n 1 · · · k |G ≥ follows since the L1-limit is unique up to a set of measure zero.

PROPOSITION 10.1 Let X , X , be a Markov chain. Let 1 l m n 1. Assume 1 2 · · · ≤ ≤ ≤ − φ(x), x Rn m is a measurable function such that E φ(X , , X ) < . Then ∈ − | m+1 · · · n | ∞ E (φ(X , , X ) X , , X )= E (φ(X , , X ) X ) . m+1 · · · n | m · · · l m+1 · · · n | m Proof. Let p = n m. By Lemma 10.3, we only need to prove the lemma for φ(x) = − φ (x ) φ(x ) with x = (x , ,x ) and φ ’s being bounded functions. We do this by 1 1 · · · p 1 · · · p i induction. If n = m+1, by the Markov property, Y := E(φ (X ) X , , X )= E(φ (X ) X ). 1 m+1 | m · · · 1 1 m+1 | m Review the property that E(E(Z ) )= E(E(Z ) ) = E(Z ) if . We know that |F |G |G |F |F F ⊂ G E (φ (X ) X , , X )= E(Y X , , X )= E(φ (X ) X ) 1 m+1 | m · · · l | m · · · l 1 m+1 | m since Y = E(φ (X ) X ) is σ(X , , X )-measurable. 1 m+1 | m m · · · l Now assume n m + 2. Using the same argument to obtain, ≥ E φ (X ) φ (X ) X , , X ) { 1 m+1 · · · p+1 n+1 | m · · · l } = E φ1(Xm+1) φp 1(Xn 1)φ˜p(Xn) Xm, , Xl { · · · − − | · · · } = E φ1(Xm+1) φp 1(Xn 1)φ˜p(Xn) Xm { · · · − − | } where φ˜ (X ) = φ (X ) E(φ (X ) X ), and the last step is by induction. Since p n p n · p+1 n+1 | n φ1(Xm+1) φp 1(Xn 1)φ˜p(Xn) is equal to E(φ1(Xm+1) φp+1(Xn+1) Xm, , X1). The · · · − − · · · | · · · conclusion follows.

The following property says that a Markov chain has a reverse property.

PROPOSITION 10.2 Let X , X , be a Markov chain. Let 1 m < n. Assume 1 2 · · · ≤ φ(x), x Rm is a measurable function such that E φ(X , , X ) < . Then ∈ | 1 · · · m | ∞ E(φ(X , , X ) X , , X )= E(φ(X , , X ) X ). 1 · · · m | m+1 · · · n 1 · · · m | m+1

77 Proof. By lemma 10.3, we assume, without loss of generality, φ(x) is a bounded function. From the deﬁnition of conditional expectation, to prove the proposition, it suﬃces to show that

E φ(X , , X )g(X , , X ) { 1 · · · m m+1 · · · n } = E E(φ(X , , X ) X ) g(X , , X ) (10.10) { 1 · · · m | m+1 · m+1 · · · n } for any bounded, measurable function g(x), x Rn. By Lemma 10.3 again, we only need ∈ to prove (10.10) for g(x)= g (x ) g (x ) where p = n m. Now 1 1 · · · p p − E E(φ(X , , X ) X ) g (X ) g (X ) { 1 · · · m | m+1 · 1 m+1 · · · p n } = E Z g (X ) g (X ) { · 2 m+2 · · · p n } where Z = E(φ(X , , X )g (X ) X ). Use the fact E(E(ξ ) η)= E(ξ E(η )) 1 · · · m 1 m+1 | m+1 |F · · |F for any ξ, η and σ-algebra to obtain that F E Z g (X ) g (X ) = E φ(X , , X )g (X ) E(η X ) { · 2 m+2 · · · p n } { 1 · · · m 1 m+1 · | m+1 } where η = g (X ) g (X ). By Proposition 10.1, E(η X ) = E(η X , , X ). 2 m+2 · · · p n | m+1 | m+1 · · · 1 Therefore the above is identical to

E E(φ(X , , X )g (X ) η X , , X ) { 1 · · · m 1 m+1 · | m+1 · · · 1 } = E φ(X , , X )g(X , , X ) { 1 · · · m m+1 · · · n } since g(x)= g (x ) g (x ). 1 1 · · · p p

The following proposition says that a Markov chain has the property: given the present, the past and future are independent.

PROPOSITION 10.3 Let 1 k < l m be such that l k 2. Let φ(x), x Rk ≤ ≤ − ≥ ∈ and ψ(x), x Rm l+1 be two measurable functions satisfying E(φ(X , , X )2) < and ∈ − 1 · · · l ∞ E(ψ(X , , X )2) < . Then l · · · m ∞

E φ(X1, , Xk) ψ(Xl, , Xm) Xk+1, Xl 1 { · · · · · · · | − } = E φ(X1, , Xk) Xk+1, Xl 1 E ψ(Xl, , Xm) Xk+1, Xl 1 . { · · · | − } · { · · · | − } Proof. Fo saving notation, we write φ = φ(X , , X ) and ψ = ψ(X , , X ). Then 1 · · · k l · · · m

E(φψ Xk+1, Xl 1) = E E(φψ X1, Xl 1) Xk+1, Xl 1 | − { | · · · − | − } = E φ E(ψ Xl 1) Xk+1, Xl 1 { · | − | − }

78 by Proposition 10.1 and that φ is σ(X1, , Xl 1) measurable. The last term is also equal · · · − to E(ψ Xl 1) E φ Xk+1, Xl 1 . The conclusion follows since E(ψ Xl 1)= E(ψ Xk, Xl 1). | − · { | − } | − | −