Class Notes of Stat 8112
1 Bayes estimators
Here are three methods of estimating parameters: (1) MLE; (2) Moment Method; (3) Bayes Method. An example of Bayes argument: Let X F (x θ), θ H . We want to estimate g(θ) R1. ∼ | ∈ ∈ Suppose t(X) is an estimator and look at
MSE (t)= E (t(X) g(θ))2. θ θ −
The problem is MSEθ(t) depends on θ. So minimizing one point may costs at other points.
Bayes idea is to average MSEθ(t) over θ and then minimize over t’s. Thus we pretend to have a distribution for θ, say, π, and look at
H(t)= E(t(X) g(θ))2 − where E now refers to the joint distribution of X and θ, that is
E(t(X) g(θ))2 = (t(x) g(θ))2F (dx θ)π(dθ). (1.1) − − | Z Next pick t(X) to minimize H(t). The minimizer is called the Bayes estimator.
LEMMA 1.1 Suppose Z and W are real random variables defined on the same probability space and be the set of functions from R1 to R1. Then H min E(Z h(W ))2 = E(Z E(Z W ))2. h − − | ∈H That is, the minimizer above is E(Z W ). | Proof. Note that
E(Z h(W ))2 = E(Z E(Z W )+ E(Z W ) h(W ))2 − − | | − = E(Z E(Z W ))2 + E(E(Z W ) h(W ))2 − | | − +2E (Z E(Z W ))(E(Z W ) h(W )) . { − | | − } Conditioning on W, we have that the cross term is zero. Thus
E(Z h(W ))2 = E(Z E(Z W ))2 + E(E(Z W ) h(W ))2. − − | | −
1 Now E(Z E(Z W ))2 is fixed. The second term E(E(Z W ) h(W ))2 depends on h. So − | | − choose h(W )= E(Z W ) to minimize E(Z h(W ))2. | −
Now choose W = X, t = h and Z = g(θ). Then
COROLLARY 1.1 The Bayes estimator in (1.1) is g(θ)= E(g(θ) X), which is the poste- | rior mean. d At a later occasion, we call E(g(θ) X) the posterior mean of g(θ). Similarly, we have the | following multivariate analogue of the above corollary. The proof is in the same spirit of Lemma 1.1. We omit it.
THEOREM 1 Let X Rm and X F ( θ), θ H . Let g(θ) = (g (θ), , g (θ)) Rk. ∈ ∼ ·| ∈ 1 · · · k ∈ The MSE between an estimator t(X) and g(θ) is
E t(X) g(θ) 2 = t(x) g(θ) 2F (dx θ)π(dθ). k − k k − k | Z Then Bayes estimator g(θ)= E(g(θ) X) := (E(g (θ) X), , E(g (θ) X)). | 1 | · · · k | Definition. The distributiond of θ (we thought there is such) , π(θ), is called a prior distribution of θ. The conditional distribution of θ given X is called the posterior distribu- tion.
Example. Let X , , X be i.i.d. Ber(p) and p Beta(α, β). We know that the 1 · · · n ∼ density function of p is
Γ(α + β) α 1 β 1 f(p)= p − (1 p) − Γ(α)Γ(β) − and E(p)= α/(α + β). The joint distribution of X , , X and p is 1 · · · n x n x α 1 β 1 f(x , ,x ,p)= f(x , ,x p) π(p) = p (1 p) − C(α, β)p − (1 p) − 1 · · · n 1 · · · n| · − · − x+α 1 n+β x 1 = C(α, β)p − (1 p) − − − where x = x . The marginal density of (X , , X ) is i i 1 · · · n P 1 x+α 1 n+β x 1 f(x , ,x )= C(α, β) p − (1 p) − − dp = C(α, β) B(x + α, n + β x). 1 · · · n − · − Z0 Therefore,
f(x1, ,xn,p) x+α 1 n+β x 1 f(p x , ,x )= · · · = D(x, α, β) p − (1 p) − − . | 1 · · · n f(x , ,x ) − 1 · · · n 2 This says that p x Beta(x + α, n + β x). Thus the Bayes estimator is | ∼ − X + α pˆ = E(p X)= , | n + α + β n where X = i=1 Xi. One can easily check that np + α P E(ˆp)= . n + α + β Sop ˆ is biased unless α = β = 0, which is not allowed in the prior distribution. The next lemma is useful
LEMMA 1.2 The posterior distribution depends only on sufficient statistics. Let t(X) be a sufficient statistic. Then E(g(θ) X)= E(g(θ) t(X)). | | Proof. The joint density function of X and θ is p(x θ)π(θ) where π(θ) is the prior distri- | bution of θ. Let t(X) be a sufficient statistic. The by the factorization theorem
p(x θ)= q(t(x) θ)h(x). | | So the joint distribution function is q(t(x) θ)h(x)π(θ). It follows that the conditional distri- | bution of θ given X is q(t(x) θ)h(x)π(θ) q(t(x) θ)π(θ) p(θ x)= | = | | q(t(x) ξ)h(x)π(ξ) dξ q(t(x) ξ)π(ξ) dξ | | which is a function of t(x)R and θ. R By the above conclusion, E(g(θ) X) = l(t(X)) for some function l( ). Take conditional | · expectation for both sides with respect to the algebra σ(t(X)). Note that σ(t(X)) σ(X). ⊂ By the Tower theorem, l(t(X)) = E(g(θ) t(X)). This proves the second conclusion. |
Example. Let X , , X be iid from N(µ, 1) and µ π(µ) = N(µ ,τ ). We know 1 · · · n ∼ 0 0 that X¯( N(µ, 1/n)) is a sufficient statistic. The Bayes estimator isµ ˆ = E(µ X¯ ). We need ∼ | to calculate the joint distribution of (µ, X¯ )T first. T It is not difficult to see that (µ, X¯) is bivariate normal. We know that Eµ = µ0, V ar(µ)= τ , E(X¯)= E(E(X¯ µ)) = E(µ)= µ . Now 0 | 0 1 V ar(X¯)= E(X¯ µ + µ µ )2 = E((1/n) + (µ µ )2)= + τ ; − − 0 − 0 n 0 Cov(X,µ¯ )= E(E((X¯ µ )(µ µ ) µ)) = E(µ µ )2 = τ . − 0 − 0 | − 0 0 Thus,
X¯ µ0 (1/n)+ τ0 τ0 N , . µ ! ∼ µ0! τ0 τ0!!
3 Recall the following fact: if
Y1 µ1 Σ11 Σ12 N , , Y2! ∼ µ2! Σ21 Σ22!! then
1 Y1 Y2 N µ1 +Σ12Σ22− (Y2 µ2), Σ11 2 | ∼ − · 1 where Σ11 2 =Σ11 Σ12Σ22− Σ21. It follows that · − τ τ 2 µ X¯ N µ + 0 (X¯ µ ),τ 0 . | ∼ 0 (1/n)+ τ − 0 0 − (1/n)+ τ 0 0 Thus the Bayes estimator is
τ0 τ0 µ0 µˆ = µ0 + (X¯ µ0)= X¯ + . (1/n)+ τ0 − (1/n)+ τ0 1+ nτ0
One can see thatµ ˆ µ as τ 0 andµ ˆ X¯ as τ . → 0 0 → → 0 →∞
Example. Let X N (θ,I ) and π(θ) N (0,τI ),τ> 0. The Bayes estimator is the ∼ p p ∼ p p posterior mean E(θ X). We claim that |
X 0 (1 + τ)Ip τIp N , . θ ! ∼ 0! τIp τIp!!
Indeed, by the prior we know that Eθ = 0 and Cov(θ)=τIp. By conditioning argument one
can verify that E(Xθ′) = τIp and E(XX′) = (1+ τ)Ip. So by conditional distribution of normal random variables,
τ τ θ X N X, I . | ∼ 1+ τ 1+ τ p
So the Bayes estimator is t0(X)= τX/(1 + τ).
The usual MVUE is t1(X)= X. It follows that
MSE (θ)= E X θ 2 = p t1 k − k which doesn’t depend on θ. Now τ τ 1 MSE (θ)= E X θ 2 = E (X θ)+ − θ 2 t0 θk1+ τ − k θk1+ τ − 1+ τ k τ 2 1 = p + θ 2 (1.2) 1+ τ (1 + τ)2 k k 4 which goes to p as τ . What happens to the prior when τ ? It “converges → ∞ → ∞ to” the Lebesgue measure, or “uniform distribution over the real line” which generates no information (recall the information of X N (0, (1 + τ)I ) is p/(1 + τ) and θ¯ = ∼ p p i uniform over [ τ,τ] with θ¯ , 1 i p i.i.d. is p/(2τ 2). Both go to zero and the later is − i ≤ ≤ faster than the former). This explains why the MSE of the former and the limiting MSE are identical. Now let
2 M = inf sup Eθ t(X) θ t θ k − k
called the Minimax risk and any estimator t∗ with
2 M = sup Eθ t∗(X) θ θ k − k is called a Minimax estimator.
THEOREM 2 For any p 1, t (X)= X is a minimax. ≥ 0 Proof. First,
2 2 M = inf sup Eθ t(X) θ sup Eθ X θ = p. t θ k − k ≤ θ k − k Second 2 2 2 τ M = inf sup Eθ t(X) θ inf Eθ t(X) θ πτ (dθ) p t k − k ≥ t k − k ≥ 1+ τ θ Z for any τ > 0 where π (θ)= N(0,τI ), where the last step is from (1.2). Then M p by τ p ≥ letting p . The above says that M = sup E X θ 2 = p. →∞ θ θk − k
Question. Does there exist another estimator t1(X) such that
E t (X) θ 2 E X θ 2 θk 1 − k ≤ θk − k
for all θ, and the strict inequality holds for some θ? If so, we say t0(X)= X is inadmissible (because it can be beaten for any θ by some estimator t, and strictly beaten at some θ). Here is the answer: (i) For p = 1, there is not. This is proved by Blyth in 1951. (ii) For p = 2, there is not. This result was shown by Stein in 1961. (iii) When p 3, Stein (1956) shows there is such estimator, which is called James-Stein ≥ estimator.
Recall the density function of N(µ,σ2) is φ(x) = (√2πσ) 1 exp( (x µ)2/(2σ2)). − − −
5 LEMMA 1.3 (Stein’s lemma). Let Y N(µ,σ2) and g(y) be a function such that g(b) ∼ − g(a)= b g (y) dy for any a and b and some function g (y). If E g (Y ) < . Then a ′ ′ | ′ | ∞ R 2 Eg(Y )(Y µ)= σ Eg′(Y ). − Proof. Let φ(y) = (1/√2πσ) exp( (y µ)2/(2σ2)), the density of N(µ,σ2). Then φ (y)= − − ′ 2 z µ y z µ φ(y)(y µ)/σ . Thus, φ(y)= y∞ σ−2 φ(z) dz = σ−2 φ(z) dz. We then have that − − − −∞ R R ∞ Eg′(Y ) = g′(y)φ(y) dy Z−∞ 0 y ∞ ∞ z µ z µ = g′(y) −2 φ(z) dz dy g′(y) −2 φ(z) dz dy 0 y σ − σ Z Z Z−∞ Z−∞ z 0 0 ∞ z µ z µ = −2 φ(z) g′(y) dy dz −2 φ(z) g′(y) dy dz 0 σ 0 − σ z Z Z Z−∞ Z 0 ∞ z µ = + −2 φ(z)(g(z) g(0)) dz 0 σ − Z Z−∞ 1 1 = ∞ (z µ)g(z)φ(z) dz = E(Y µ)g(Y ). σ2 − σ2 − Z−∞ The Fubini’s theorem is used in the third step; the mean of a centered normal random variable is zero is used in the fifth step.
Remark 1. Suppose two functions g(x) and h(x) are given such that g(b) g(a) = b − a h(x) dx for any a < b. This does not mean that g′(x) = h(x) for all x. Actually, in this case,R g(x) is differentiable almost everywhere and g′(x)= h(x) a.s. under Lebesgue measure. For example, let h(x)=1if x is an irrational number, and h(x)=0if x is rational. Then g(x) = x = x h(t) dt for any x R. We know in this case that g (x) = h(x) a.s. The 0 ∈ ′ following areR true from real analysis: Fundamental Theorem of Calculus. If f(x) is absolutely continuous on [a, b], then f(x) is differentiable almost everywhere and
x f(x) f(a)= f ′(t) dt, x [a, b]. − ∈ Za
Another fact. Let f(x) be differentiable everywhere on [a, b] and f ′(x) is integrable over [a, b]. Then
x f(x) f(a)= f ′(t) dt x [a, b]. − ∈ Za Remark 2. What drives the Stein’s lemma is the formula of integration by parts:
6 suppose g(x) is differentiable then
2 ∞ Eg(Y )(Y µ) = g(y)(y µ)φ(y) dy = σ g(y)φ′(y) dy − R − − Z Z−∞ 2 2 2 ∞ = σ lim g(y)φ(y) σ lim g(y)φ(y)+ σ g′(y)φ(y) dy. y − y + →−∞ → ∞ Z−∞ 2 It is reasonable to assume the two limits are zero. The last term is exactly σ Eg′(Y ).
THEOREM 3 Let X N (θ,I ) for some p 3. Define ∼ p p ≥ p 2 δ (X)= 1 c − X. c − X 2 k k Then c(2 c) E δ (X) θ 2 = p (p 2)2E − . k c − k − − X 2 k k Proof. Let g (x)= c(p 2)x / x 2 and g(x) = (g (x), , g (x)) for x = (x ,x , ,x ). i − i k k 1 · · · p 1 2 · · · p Then g(x)= c(p 2)x/ x 2. It follows that − k k 2 2 2 2 E δc(X) θ = E (X θ) g(X) = E X θ + E g(X) 2E X θ, g(X) k − k k − − k k − k k p k − h − i = p + E g(X) 2 2 E(X θ )g (X). k k − i − i i Xi=1 Since X , , X are independent, by conditioning and Stein’s lemma, 1 · · · p ∂g E(X θ )g (X)= E [E (X θ )g (X) (X , j = i) ]= E i (X) . i − i i { i − i i | j 6 } ∂x i It is easy to calculate that
∂g p x2 2x2 x 2 2x2 i (x)= c(p 2) k=1 k − i = c(p 2)k k − i . ∂x − ( p x2)2 − x 4 i P k=1 k k k Thus, P
p p X 2 2X2 1 E(X θ )g (X)= c(p 2)E k k − i = c(p 2)2E . i − i i − X 4 − X 2 Xi=1 Xi=1 k k k k On the other hand, E g(X) 2 = c2(p 2)2E(1/ X 2). Combine all above together, the k k − k k conclusion follows.
We have to show that E X 2 < for p 3. Indeed, k k− ∞ ≥ 1 1 1 E 2 1+ E 2 I( X 1) 1+ 2 dx1 dxp X ≤ X k k≤ ≤ · · · x 1 x · · · k k k k Z Zk k≤ k k 7 since the density of a normal distribution is bounded by one. By the polar-transformation, the last integral is equal to
1 π π p 1 1 r − J(θ1, ,θp) p 1 p 3 dr dθ1 dθp 1 · · · dr Cπ − r − dr < · · · − r2 ≤ ∞ Z0 Z0 Z0 Z0 if the integer p 3, where J(θ , ,θ ) is a multivariate polynomial of sin θ and cos θ , i = ≥ 1 · · · p i i 1, 2, , p, bounded by a constant C. · · · When 0 c 2, the term c(2 c) 0 and attains the maximum at c = 1. It follows ≤ ≤ − ≥ that
COROLLARY 1.2 The estimator δ (X) dominates X provided 0 When c = 1, the estimator δ1(X) is called the James-Stein estimator. COROLLARY 1.3 The James-Stein estimator δ dominates all estimators δ with c = 1. 1 c 6 2 Likelihood Ratio Test We consider test H : θ H vs H : θ H , where H and H are disjoint. The 0 ∈ 0 a ∈ a 0 a probability density or probability mass function of a random observation is f(x θ). The | likelihood ratio test statistic is sup f(x θ) λ(x)= H0 | . supH0 Ha f(x θ) ∪ | When H0 is true, the value of λ(x) tends to be large. So the rejection region is λ(x) c . { ≤ } Example. The t-test as a likelihood ratio test. Let X , , X be i.i.d. N(µ,σ2) with both parameters unknown. Test H : µ = µ vs 1 · · · n 0 0 H : µ = µ . The joint density function of X ’s is a 6 0 i n n 1 2 n/2 1 2 f(x1, ,xn µ,σ)= (σ )− exp 2 (xi µ) . (2.1) · · · | √2π −2σ − ! Xi=1 This one has maximum at (X,¯ V ) where n 1 V = (X X¯ )2. n i − Xi=1 8 The maximum is n/2 1 2 n/2 n/2 max f(x1, ,xn µ,σ)= CV − exp (Xi X¯) = CV − e− µ,σ · · · | −2V − X Where C = (2π) n/2. When µ = µ , f(x , ,x µ,σ) has maximum at − 0 1 · · · n| n 1 V = (X µ )2. 0 n i − 0 Xi=1 And the corresponding maximum value is n/2 1 2 n/2 n/2 max f(x1, ,xn µ0,σ)= CV0− exp (Xi µ0) = CV0− e− . σ · · · | −2V0 − X So the likelihood ratio is n/2 2 n/2 V − V + (X¯ µ ) − Λ= 0 = − 0 . V V So the rejection region is X¯ µ Λ a = − 0 b { ≤ } S ≥ for some constant b, where S = (1/(n 1)) n (X X¯ )2. This yields exactly a student t − i=1 i − test. P Example. Let X , , X be a random sample from 1 · · · n (x θ) e− − , if x θ; f(x θ)= ≥ (2.2) | 0, otherwise. The joint density function is P xi+nθ e− , if θ x(1); f(x θ)= ≤ | 0, otherwise. Consider the test H0 : θ θ0 versusHa : θ>θ0. It is easy to see ≤ P x +nx max f(x θ)= e− i (1) , θ R | ∈ which is achieved at θ = x(1). Now consider maxθ θ0 f(x θ). If θ0 x(1), the two maxima ≤ | ≥ are identical; If θ0 P xi+nx(1) e− , if θ0 x(1); max f(x θ)= ≥ θ θ0 | P xi+nθ0 ≤ e− , if θ0 1, if x(1) θ0; λ(x)= ≤ n(θ0 x(1)) e − , if x(1) >θ0. The rejection region is λ(x) awhich is equivalent to x b for some b. { ≤ } { (1) ≥ } Again, the general form of a hypothesis test is H : θ H vs H : θ H . 0 ∈ 0 a ∈ a THEOREM 4 Suppose X has pdf or pmf f(x θ) and T (X) is sufficient for θ. Let λ(x) and | λ∗(y) be the LRT statistics obtained from X and Y = T (X). Then λ(x)= λ∗(T (x)) for any x. Proof. Suppose the pdf or pmf of T (X) is g(t θ). To be simple, suppose X is discrete, that | is, P (T (X)= t)= g(t θ). Then θ | f(x θ) = P (X = x, T (X)= T (x)) | θ = P (T (X)= T (x))P (X = x T (X)= T (x)) θ θ | = g(T (x) θ)h(x). (2.3) | by the definition of sufficiency, where h(x) is a function depending only on x (not θ). Evidently, supθ H0 f(x θ) supθ H0 g(t θ) λ(x)= ∈ | and λ∗(t)= ∈ | . (2.4) supθ H0 Ha f(x θ) supθ H0 Ha g(t θ) ∈ ∪ | ∈ ∪ | By (2.3) and (2.4) supθ H0 f(T (x) θ) λ(x)= ∈ | = λ∗(T (x)). supθ H0 Ha f(T (x) θ) ∈ ∪ | This theorem says that the simplification of λ(x) is eventually relevant to x through a sufficient statistic for the parameter. 3 Evaluating tests Consider the test H : θ H versus H : θ H , where H and H are disjoint. When 0 ∈ 0 a ∈ a 0 a making decisions, there are two types of mistakes: Type I error: Reject H0 when it is true. Type II error: Do not reject H0 when it is false. 10 Not reject H0 reject H0 H0 correct type I error H1 Type II error correct Suppose R denotes the rejection region for the test. The chances of making the two types of mistakes are P (X R) for θ H and P (X Rc), θ H . So θ ∈ ∈ 0 θ ∈ ∈ a probability of a Type I error, if θ H 0; P (X R)= ∈ θ ∈ 1 probability of a Type II error, if θ H . − ∈ a DEFINITION 3.1 The power function of the test above with the rejection region R is a function of θ defined by β(θ)= P (X R). θ ∈ DEFINITION 3.2 For α [0, 1], a test with power function β(θ) is a size α test if ∈ supθ H β(θ)= α. ∈ 0 DEFINITION 3.3 For α [0, 1], a test with power function β(θ) is a level α test if ∈ supθ H β(θ) α. ∈ 0 ≤ There are some differences between the above two definitions when dealing with complicated models. Example. Let X , , X be from N(µ,σ2) with µ unknown but known σ. Look at the 1 · · · n test H : µ µ vs H : µ>µ . It is easy to check that by likelihood rato test, the rejection 0 ≤ 0 a 0 region is X¯ µ − 0 > c . σ/√n So the power function is X¯ µ X¯ µ µ µ β(µ)= P − 0 > c = P − > c + 0 − θ σ/√n θ σ/√n σ/√n µ µ = 1 Φ c + 0 − . − σ/√n It is easy to see that lim β(µ) = 1 lim β(µ)=0 and β(µ0)= a if P (Z > c)= a, µ µ →∞ →−∞ where Z N(0, 1). ∼ 11 Example. In general, a size α LRT is constructed by choosing c such that supθ H Pθ(λ(X) ∈ 0 c)= α. Let X , , X be a random sample from N(θ, 1). Test H : θ = θ vs H : θ = θ . ≤ 1 · · · n 0 0 a 6 0 Here one can easily see that n λ(x) = exp (¯x θ )2 − 2 − 0 So λ(X) c = X¯ θ d . For α given, d can be chosen such that P ( Z > d√n)= α. { ≤ } {| − 0|≥ } θ0 | | For the example in (2.2), the test is H : θ θ vs θ>θ and the rejection region 0 ≤ 0 0 λ(X) c = X > d . For size α, first choose d such that { ≤ } { (1) } n(d θ0) α = Pθ0 (X(1) > d)= e− − . Thus, d = θ (log α)/n. Now 0 − n(d θ) n(d θ0) sup Pθ(X(1) > d)= sup Pθ(X(1) > d)= sup e− − = e− − = α. θ H θ θ0 θ θ0 ∈ 0 ≤ ≤ 4 Most Powerful Tests Consider a hypothesis test H : θ H vs H : θ H . There are a lot of decision rules. 0 ∈ 0 1 ∈ a We plan to select the best one. Since we are not able to reduce the two types of errors at same time (the sum of chances of two errors and two correct decisions is one). We fix the the chance of the first type of error and maximize the powers among all decision rules. DEFINITION 4.1 Let be a class of tests for testing H : θ H vs H : θ H . A test C 0 ∈ 0 1 ∈ a in class , with power function β(θ), is a uniformly most powerful (UMP) class test if C C β(θ) β (θ) for every θ H and every β (θ) that is a power function of a test in class . ≥ ′ ∈ a ′ C In this section we only consider as the set of tests such that the level α is fixed, i.e., C β(θ); supθ H β(θ) α . { ∈ 0 ≤ } The following gives a way to obtain UMP tests. THEOREM 5 (Neyman-Pearson Lemma). Let X be a sample with pdf or pmf f(x θ). | Consider testing H0 : θ = θ0 vs H1 : θ = θ1, using a test with rejection region R that satisfies x R if f(x θ ) > kf(x θ ) and x Rc if f(x θ ) < kf(x θ ) (4.1) ∈ | 1 | 0 ∈ | 1 | 0 for some k 0, and ≥ α = P (X R). (4.2) θ0 ∈ 12 Then (i) (Sufficiency). Any test satisfies (4.1) and (4.2) is a UMP level α test. (ii) (Necessity). Suppose there exists a test satisfying (4.1) and (4.2) with k > 0. Then every UMP level α test is a size α test (satisfies (4.2)); Every UMP level α test satisfies (4.1) except on a set A satisfying P (X A) = 0 for θ = θ and θ . θ ∈ 0 1 Proof. We will prove the theorem only for the case that f(x θ) is continuous. | Since H consists of only one point, the level α test and size α test are equivalent. 0 (i) Let R satisfy (4.1) and (4.2) and φ(x)= I(x R) with power function β(θ). Now take ∈ another rejection region R with β (θ)= P (X R ) and β (θ ) α. Set φ (x)= I(x R ). ′ ′ θ ∈ ′ ′ 0 ≤ ′ ∈ ′ It suffices to show that β(θ ) β′(θ ). (4.3) 1 ≥ 1 Actually, (φ(x) φ (x))(f(x θ ) kf(x θ )) 0 (check it based on if f(x θ ) > kf(x θ ) or − ′ | 1 − | 0 ≥ | 1 | 0 not). Then 0 (φ(x) φ′(x))(f(x θ ) kf(x θ )) dx = β(θ ) β′(θ ) k(β(θ ) β′(θ )). (4.4) ≤ − | 1 − | 0 1 − 1 − 0 − 0 Z Thus the assertion (4.3) follows from (4.2) and that k 0. ≥ (ii) Suppose the test satisfying (4.1) and (4.2) has power function β(θ). By (i) the test is UMP. For the second UMP level α test, say, it has power function β′(θ). Then β(θ1)= β′(θ1). Since k > 0, (4.4) implies β (θ ) β(θ )= α. So β (θ )= α. ′ 0 ≥ 0 ′ 0 Given a UMP level α test with power function β′(θ). By the proved (ii), β(θ1)= β′(θ1) and β(θ0) = β′(θ0) = α. Then (4.4) implies that the expectation in there is zero. Being always nonnegative, (φ(x) φ (x))(f(x θ ) kf(x θ )) = 0 except on a set A of Lebesgue − ′ | 1 − | 0 measure zero, which leads to (4.1) (by considering f(x θ ) > kf(x θ ) or not). Since X has | 1 | 0 density, P (X A)=0 for θ = θ and θ = θ . θ ∈ 0 1 Example. Let X Bin(2,θ). Consider H : θ = 1/2 vs H : θ = 3/4. Note that ∼ 0 a x 2 x x 2 x 2 3 1 − 2 1 1 − f(x θ )= > k = kf(x θ ) | 1 x 4 4 x 2 2 | 0 is equivalent to 3x > 4k . { } (i) The case k 9/4 and the case k < 1/4 correspond to R = and Ω, respectively. ≥ ∅ The according UMP level α = 0 and α = 1. (ii) If 1/4 k < 3/4, then R = 1, 2 , and α = P (Bin(2, 1/2) = 1 or 2) = 3/4. ≤ { } (iii) If 3/4 k < 9/4, then R = 2 . The level α = P (Bin(2, 1/2) = 2) = 1/4. ≤ { } 13 This example says that if we firmly want a size α test, then there are only two such α’s: α = 1/4 and α = 3/4. Example. Let X , , X be a random sample from N(θ,σ2) with σ known. Look 1 · · · n at the test H : θ = θ vs H : θ = θ , where θ > θ . The density function f(x θ,σ) = 0 0 1 1 0 1 | (1/√2πσ) exp( (x θ)2/(2σ2)). Now f(x θ ) > kf(θ ) is equivalent to that R = X¯ < c . − − | 1 0 { } ¯ Set α = Pθ0 (X < c). One can check easily that c = θ0 + (σzα/√n). So the UMP rejection region is σ R = X<θ¯ + z . 0 √n α LEMMA 4.1 Let X be a random variable. Let also f(x) and g(x) be non-decreasing real functions. Then E(f(X)g(X)) Ef(X) Eg(X) ≥ · provided all the above three expectations are finite. Proof. Let the µ be the distribution of X. Noting that (f(x) f(y))(g(x) g(y)) 0 for − − ≥ any x and y. Then 0 (f(x) f(y))(g(x) g(y)) µ(dx) µ(dy) ≤ − − ZZ = 2 f(x)g(x) µ(dx) 2 f(x)g(y) µ(dx) µ(dy) − Z Z = 2[E(f(X)g(X)) Ef(X) Eg(X)]. − · The desired result follows. Let f(x θ), θ R be a pmf or pdf with common support, that is, the set x : f(x θ) > 0 | ∈ { | } is identical for every θ. This family of distributions is said to have monotone likelihood ratio (MLR) with respect to a statistic T (x) if fθ2 (x) on the support is a non-decreasing function in T (x) for any θ2 >θ1. (4.5) fθ1 (x) LEMMA 4.2 Let X be a random vector with X f(x θ), θ R, where f(x θ) is a pmf ∼ | ∈ | or pdf with a common support. Suppose f(x θ) has MLR with a statistic T (x). Let | 1 when T (x) > C, φ(x)= γ when T (x)= C, (4.6) 0 when T (x) Proof. Thanks to MLT, for any θ1 < θ2 there exists a nondecreasing function g(t) such that f(x θ )/f(x θ )= g(T (x)), where g(t) is a nondecreasing function. Thus | 2 | 1 f(x θ ) E φ(X)= φ(x) | 2 f(x θ ) dx = E (g(T (X)) h(T (X))) , θ2 f(x θ ) | 1 θ1 · Z | 1 where 1 when x>C, h(x)= γ when x = C, 0 when x E (g(T (X)) h(T (X))) E g(T (X)) E h(T (X)) = E φ(X) θ1 · ≥ θ1 · θ1 θ1 since Eθ1 g(T (X)) = 1. In (4.6), if T (X) > C reject H0; if T (X) < C, don’t reject H0; if T (X) = C, with chance γ reject H and with chance 1 γ don’t reject H . The last case is equivalent to 0 − 0 flipping a coin with chance γ for head and 1 γ for tail. If the head occurs, reject H ; − 0 otherwise, don’t reject H0. This is a randomization. The purpose is to make the test UMP. The randomization occurs only when T (X) is discrete. THEOREM 6 Consider testing H : θ θ vs H : θ>θ . Let X f(x θ), θ R, where 0 ≤ 0 1 0 ∼ | ∈ f(x θ) is a pmf or pdf with a common support. Also let T be a sufficient statistic for θ. If | f(x θ) has MLR w.r.t. T (x), then φ(X) in (4.6) is a UMP level α test, where α = E φ(X). | θ0 Proof. Let β(θ)= Eθφ(X). Then by the previous lemma, supθ θ0 β(θ)= Eθ0 φ(X)= α. ≤ We need to show β(θ1) β′(θ1) for any θ1 > θ0 and β′ such that supθ θ0 β′(θ) α. ≥ ≤ ≤ Consider test H0 : θ = θ0 vs Ha : θ = θ1. We know that α = P (X R)= E φ(X) and β′(θ ) α. (4.7) θ0 ∈ θ0 1 ≤ Again, we use the same notation as in Lemma 4.2, f(x θ )/f(x θ ) = g(T (x)), where | 1 | 0 g(t) is a non-decreasing function in t. Take k = g(C) in the Neyman-Pearson Lemma 5. 15 If f(x θ )/f(x θ ) > k, since g(t) is non-decreasing, then T (X) > C, which is the same | 1 | 0 as saying to reject H by the functional form of φ(x) as in (4.6). If f(x θ )/f(x θ ) < k 0 | 1 | 0 then T (X) < C, that is, not to reject H0. By Neyman-Pearson lemma and (4.7), φ(x) corresponds to a UMP level-α test. Thus, β(θ ) β (θ ). 1 ≥ ′ 1 16 5 Probability convergence Let (Ω, , P ) be a probability space. Let Also X, X ; n 1 be a sequence of ran- F { n ≥ } dom variables defined on this probability space. We say X X in probability or X is n → n consistent with X if P ( X X ǫ) 0 as n | n − |≥ → →∞ for any ǫ> 0. Example. Let X ; n 1 be a sequence of independent random variables with mean { n ≥ } zero and variance one. Set S = n X , n 1. Then S /n 0 in probability as n . n i=1 i ≥ n → →∞ In fact, P S E(S )2 1 P n ǫ n = 0 n ≥ ≤ nǫ2 nǫ2 → by Chebyshev’s inequality. Recall X and X , n 1 are real measurable functions defined on (Ω, ). We say X n ≥ F n converges to X almost surely if P (ω Ω : lim Xn(ω)= X(ω)) = 1. n ∈ →∞ For a sequence of events E ; n 1 in . Define n ≥ F ∞ E ; i.o. = E . { n } k n=1 k n \ [≥ Here “i.o.” means “infinitely often”. If ω E ; i.o. , then there are infinitely many n ∈ { n } such that ω E . Conversely, if ω / E ; i.o. , or ω E ; i.o. c, then only finitely many ∈ n ∈ { n } ∈ { n } En contain ω. In proving almost convergence, the following so-called Borel-Cantelli lemma is very powerful. LEMMA 5.1 (Borel-Cantelli Lemma). Let E ; n 1 be a sequence of events in . If n ≥ F ∞ P (E ) < (5.1) n ∞ nX=1 then P (E occurs only finite times)= P ( E , i.o. c) = 1. n { n } Proof. Recall P (En)= E(IEn ). Let X(ω)= n∞=1 IEn (ω). Then ∞ ∞ P EX = E I = P (E ) < . En n ∞ Xi=1 Xi=1 17 Thus, X < a.s. ∞ LEMMA 5.2 (Second Borel-Cantelli Lemma). Let E ; n 1 be a sequence of independent n ≥ events in . If F ∞ P (E )= (5.2) n ∞ n=1 X then P (En, i.o.) = 1. Proof. First, E , i.o. c = Ec. { n } k n k n [ \≥ With pk denoting P (Ek), we have that P Ec = (1 p ) exp p = 0 k − k ≤ − k k n k n k n \≥ Y≥ X≥ as n , where we used the inequality that 1 x e x for any x R. Thus, →∞ − ≤ − ∈ P ( E , i.o. c) P Ec = 0. { n } ≤ k n 1 k n X≥ \≥ As an application, we have that Example. Let X ; n 1 be a sequence of i.i.d. random variables with mean µ and { n ≥ } finite fourth moment. Then S n µ a.s. n → as n . →∞ Proof. W.L.O.G., assume µ = 0. Let ǫ> 0. Then by Chebyshev’s inequality S E(S )4 P n ǫ n . n ≥ ≤ n4ǫ4 Now, by independence, 4 4 E(S4) = E X4 + X2X2 + X X3 n k 2 i j 2 i j k 1 i=j n 1 i=j=n X ≤X6 ≤ ≤X6 = nE(X4) + 6n(n 1)E(X2). 1 − 1 18 Sn 2 Sn Thus, P n ǫ = O(n− ). Consequently, n 1 P n ǫ < for any ǫ > 0. This ≥ ≥ ≥ ∞ means that, with probability one, Sn/n ǫ forP sufficiently large n. Therefore, | |≤ S lim sup | n| ǫ a.s. n n ≤ for any ǫ> 0. Let ǫ 0. The desired conclusion follows. ↓ By using a truncation method, we have the following Kolmogorov’s Strong Law of Large Numbers. THEOREM 7 Let X ; n 1 be a sequence of i.i.d. random variables with mean µ. { n ≥ } Then S n µ a.s. n → as n . →∞ There are many variations of the above strong law of large numbers. Among them, Etemadi showed that Theorem 7 holds if X ; n 1 is a sequence of pairwise independent { n ≥ } and identically distributed random variables with mean µ. The next proposition tells the relationship between convergence in probability and con- vergence almost surely. PROPOSITION 5.1 Let X and X be random variables taking values in Rk. If X n n → X a.s. as n , then X converges to X in probability; second, X converges to X in → ∞ n n probability iff for any subsequence of X there exists a further subsequence, say, X ; k { n} { nk ≥ 1 such that X X a.s. as k . } nk → →∞ Proof. The joint almost sure convergence and joint convergence in probability are equiv- alent to the corresponding pointwise convergences. So W.L.O.G., assume Xn and X are one-dimensional. Given ǫ> 0. The almost sure convergence implies that P ( X X ǫ occurs only for finitely many n) = 1. | n − |≥ This says that P ( X X ǫ, i.o.) = 0. But | n − |≥ lim P ( Xn X ǫ) lim P Xk X ǫ = P ( Xn X ǫ, i.o.) = 0 n | − |≥ ≤ n {| − |≥ } | − |≥ k n [≥ 19 where we use the fact that X X ǫ X X ǫ, i.o. . {| k − |≥ } ↑ {| n − |≥ } k n [≥ Now suppose X X in probability. W.L.O.G., we assume the subsequence is X n → n itself. Then for any ǫ> 0, there exists nǫ > 0 such that supn nǫ P ( Xn X ǫ) < ǫ. We ≥ | − |≥ then have a subsequence n such that P ( X X 2 k) < 2 k for each k 1. By the k | nk − | ≥ − − ≥ Borel-Cantelli lemma k P X X < 2− for sufficiently large k = 1 | nk − | which implies that X X a.s. as k . nk → →∞ Now suppose Xn doesn’t converge to X in probability. That is, there exists ǫ0 > 0 and subsequence X ; l 1 such that { kl ≥ } P ( X X ǫ ) ǫ | kl − |≥ 0 ≥ 0 for all l 1. Therefore any subsequence of X can not converge to X in probability, hence ≥ kl it can not converge to X almost surely. Example Let X , i 1 be a sequence of i.i.d. random variables with mean zero and i ≥ variance one. Let S = X + + X , n 1. Then S /√2n log log n 0 in pobability but n 1 · · · n ≥ n → it does not converge to zero almost surely. Actually lim supn Sn/√2 log log n = 1 a.s. →∞ PROPOSITION 5.2 Let X ; k 1 be a sequence of i.i.d. random variables with X { k ≥ } 1 ∼ N(0, 1). Let Wn = max1 i n Xi. Then ≤ ≤ W lim n = √2 a.s. n →∞ √log n Proof. Recall x x2/2 1 x2/2 e− P (X1 x) e− (5.3) √2π (1 + x2) ≤ ≥ ≤ √2πx for any x> 0. Given ǫ> 0, let bn = (√2+ ǫ)√log n. Then n 2 1 P (W b ) nP (X b ) e bn/2 n n 1 n − ǫ2/2 ≥ ≤ ≥ ≤ bn ≤ n 2 for n 3. Let n = [k4/ǫ ]+1 for k 1. Then, ≥ k ≥ 1 P (W b ) nk ≥ nk ≤ k2 20 for large k. By the Borel-Cantelli lemma, P (W b , i.o.) = 0. This implies that nk ≥ nk W lim sup nk √2+ ǫ a.s. k bnk ≤ For any n n there exists k such that n n This proves the upper bound. Now we turn to prove the lower bound. Choose ǫ (0, √2). ∈ Set c = (√2 ǫ)√log n. Then n − n n nP (X1>cn) P (W c )= P (X c ) = (1 P (X > c )) e− . n ≤ n 1 ≤ n − 1 n ≤ By (5.3), ncn c2 /2 C 1 (√2 ǫ)2/2 nP (X > c ) e− n n − − 1 n 2 ≥ √2π (1 + cn) ∼ √2π log n as n is sufficiently large, where C is a constant. Since 1 (√2 ǫ)2/2 > 0, − − P (W c ) < . n ≤ n ∞ n 2 X≥ By the Borel-Cantelli again, P (W c , i.o.) = 0. This implies that n ≤ n W lim inf n √2 ǫ a.s. n √log n ≥ − Let ǫ 0. We obtain ↓ W lim inf n √2 a.s. n √log n ≥ This together with (5.4) implies the desired result. The above results also hold for random variables taking abstract values. Now let’s introduce such random variables. 21 A metric space is a set D equipped with a metric. A metric is a map d : D D [0, ) × → ∞ with the properties (i) d(x,y) = 0 if and only if x = y; (ii) d(x,y)= d(y,x); (iii) d(x, z) d(x,y)+ d(y, z) ≤ for any x,y and z. A semi-metric satisfies (ii) and (iii) , but not necessarily (i). An open ball is a set of the form y : d(x,y) < r . A subset of a metric space is open if and only if it { } is the union of open balls; it is closed if and only if the complement is open. A sequence xn converges to x if and only if d(x ,x) 0, and denoted by x x. The closure A¯ of a set n → n → A is the set of all points that are the limit of a sequence in A; it is the smallest closed set containing A. The interior A˚ is the collection of all points x such that x G A for some ∈ ⊂ open set G; it is the largest open set contained in A. A function f : D E between metric → spaces is continuous at a point x if and only if f(x ) f(x) for every sequence x x; n → n → it is continuous at every x if and only if f 1(G) is also open for every open set G E. A − ⊂ subset of a metric space is dense if and only if its closure is the whole space. A metric space is separable if and only if it has a countable dense subset. A subset K of a metric space is compact if and inly if it is closed and every sequence in K has a convergent subsequence. A set K is totally bounded if and only if for every ǫ> 0 it can be covered by finitely many balls of radius ǫ > 0. The space is complete if and only if any Cauchy sequence has a limit. A subset of a complete metric space is compact if and only if it is totally bounded and closed. A norm space D is a vector space equipped with a norm. A norm is a map : D k · k → [0, ) such that for every x,y in D, and α R, ∞ ∈ (i) x + y x + y ; k k ≤ k k k k (ii) αx = α x ; k k | |k k (iii) x = 0 if and only if x = 0. k k Definition. The Borel σ-field on a metric space D is the smallest σ-field that contains the open sets. A function taking values in a metric space is called Borel-measurable if it is measurable relative to the Borel σ-field. A Borel-measurable map X :Ω D defined on a → probability space (Ω, , P ) is referred to as a random element with values in D. F The following lemma is obvious. LEMMA 5.3 A continuous map between two metric spaces is Borel-measurable. 22 Example. Let C[a, b]= f(x); f is a continuous function on [a, b] . Define { } f = sup f(x) . k k a x b | | ≤ ≤ This is a complete normed linear space which is also called a Banach space. This space is separable. Any separable Banach space is isomorphic to a subset of C[a, b]. Remark. Proposition 5.1 still holds when the random variables taking values in a Polish space. Let X, X ; n 1 be a sequence of random variables. We say X converges to X { n ≥ } n weakly or in distribution if P (X x) P (X x) n ≤ → ≤ as n for every continuous point x, that is, P (X x)= P (X LEMMA 5.4 Let Xn and X be random variables. The following are equivalent: (i) Xn converges weakly to X; (ii) Ef(X ) Ef(X) for any bounded continuous function f(x) defined on R; n → (iii) liminf P (X G) P (X G) for any open set G; n n ∈ ≥ ∈ (iv) limsup P (X F ) P (X F ) for any closed set F ; n n ∈ ≤ ∈ Remark. The above lemma is actually true for any finite dimensional space. Further, it is true for random variables taking values in separable, complete metric spaces, which are also called Polish spaces. Proof. (i) = (ii). Given ǫ > 0, choose K > 0 such that K and K are continuous ⇒ − point of all X ’s and X and P ( K X K) 1 ǫ. By (i) n − ≤ ≤ ≥ − lim P ( Xn K) = lim P (Xn K) lim P (Xn K) = P (X K) P (X K) n | |≤ n ≤ − n ≤− ≤ − ≤− 1 ǫ ≥ − In other words lim P ( Xn > K) ǫ. (5.5) n | | ≤ Since f(x) is continuous on [ K, K], we can cut the interval into, say, m subintervals − [ai, ai+1], 1 i m such that each ai is a continuous point of X and supa x a f(x) ≤ ≤ i≤ ≤ i+1 | − f(a ) ǫ. Set i |≤ m g(x)= f(a )I(a x < a ). i i ≤ i+1 Xi=1 23 It is easy to see that f(x) g(x) ǫ for x K and f(x) g(x) f for x > K, | − |≤ | |≤ | − | ≤ k k | | where f = supx R f(x) . It follows that k k ∈ | | E f(Y ) Eg(Y ) ǫ + f P ( Y > K) | − |≤ k k · | | for any random variable Y. Apply this to Xn and X we obtain from (5.5) that lim sup Ef(Xn) Eg(Xn) (1 + f )ǫ and Ef(X) Eg(X) (1 + f )ǫ. n | − |≤ k k | − |≤ k k Easily Eg(X ) Eg(X). Therefore by triangle inequality, n → lim sup Ef(Xn) Ef(X) 2(1 + f )ǫ. n | − |≤ k k Then (ii) follows by letting ǫ 0. ↓ (ii) = (iii). Define f (x) = (md(x, Gc)) 1 for m 1. Since G is open, f (x) is a ⇒ m ∧ ≥ m continuous function bounded by one and f (x) 1 (x) for each x as m . It follows m ↑ G → ∞ that lim inf P (Xn G) lim inf Efm(Xn) Efm(X) n ∈ ≥ n → for each m. (iii) follows by letting m . →∞ (iii) and (iv) are equivalent because F c is open. (iv) = (i). Let x be a continuous point. Then by (iii) and (iv) ⇒ lim sup P (Xn x) P (X x)= P (X Example. Let X be uniformly distributed over 1/n, 2/n, ,n/n . Prove that X con- n { · · · } n verges weakly to U[0, 1]. Actually, for any bounded continuous function f(x), n 1 i 1 Ef(Xn)= f f(t) dt = Ef(Y ) n n → 0 Xi=1 Z where Y U[0, 1]. ∼ 24 Example. Let X , i 1 be i.i.d. random variables. Define i ≥ n n 1 1 # i : X A µ = δ through µ (A)= I(X A)= { i ∈ } n n Xi n n i ∈ n Xi=1 Xi=1 for any set A. Then, for any measurable function f(x), by the strong law of large numbers, n 1 f(x)µ ( dx)= f(X ) Ef(X )= f(x)µ( dx) n n i → 1 Z Xi=1 Z where µ = (X ). This concludes that µ µ weakly as n . L 1 n → →∞ COROLLARY 5.1 Suppose Xn converges to X weakly and g(x) is a continuous map de- 1 fined on R (not necessarily one-dimensional). Then g(Xn) converges to g(X) weakly. The following two results are called Delta methods. THEOREM 8 Let Y be a sequence of random variables that satisfies √n(Y θ) = n n − ⇒ N(0,σ2). Let g be a real-valued function such that g (θ) = 0. Then ′ 6 2 2 √n(g(Y ) g(θ)) = N(0,σ (g′(θ) )). n − ⇒ Proof. By Taylor’s expansion g(x)= g(θ)+ g (θ)(x θ)+ a(x θ) for some real function ′ − − a(x) such that a(t)/t 0 as t 0. Then, → → √n(g(Y ) g(θ)) = g′(θ)√n(Y θ)+ √n a(Y θ). n − n − · n − We only need to show that √n a(Y θ) 0 in probability as n (latter we will see · n − → →∞ this easily by Slusky’s lemma. But why now?) Indeed, for any m 1 there exists a δ > 0 ≥ such that a(x) x /m for all x δ. Thus, | | ≤ | | | |≤ P ( √n a(Y θ) b) P ( Y θ > δ)+ P ( √n(Y θ) mb) | · n − |≥ ≤ | n − | | n − |≥ P ( N(0,σ2) mb) → | |≥ as n . Now let m . we obtain that √n a(Y θ) 0 in probability. →∞ ↑∞ · n − → Similarly, one can prove that THEOREM 9 Let Y be a sequence of random variables that satisfies √n(Y θ) = n n − ⇒ 2 N(0,σ ). Let g be a real-valued function such that g′′(x) exists around x = θ with g′(θ) = 0 and g (θ) = 0. Then ′′ 6 1 2 2 n(g(Y ) g(θ)) = σ g′′(θ)χ . n − ⇒ 2 1 25 Proof. By Taylor’s expansion again, 2 1 2 Yn θ 2 n(g(Y ) g(θ)) = σ g′′(θ) n − + o(n Y θ ). n − 2 · σ | n − | " # This leads to the desired conclusion by the same arguments as in Theorem 8. PROPOSITION 5.3 Let X, X ; n 1 be a sequence of random variables on Rk. Look { n ≥ } at the following three statements: (i) Xn converges to X almost surely; (ii)Xn converges to X in probability; (iii) Xn converges to X weakly. Then (i) = (ii) = (iii). ⇒ ⇒ Proof. The implication “(i) = (ii)” is proved earlier. Now we prove that (ii) = (iii). ⇒ ⇒ For any closed set F let F ǫ = x : inf d(x,y) ǫ . { y F ≤ } ∈ Then F ǫ is a closed set. When X takes values in R, the metric d(x,y)= x y . It is easy | − | to see that X F X F 1/k X X 1/k { n ∈ } ⊂ { ∈ } ∪ {| n − |≥ } for any integer k 1. Therefore ≥ P (X F ) P (X F 1/k)+ P ( X X 1/k). n ∈ ≤ ∈ | n − |≥ Let n we have that →∞ 1/k lim sup P (Xn F ) P (X F ) n ∈ ≤ ∈ Note that F 1/k F as k . (iii) follows by letting k . ↓ →∞ →∞ Any random vector X taking values in Rk is tight: For every ǫ > 0 there exists a number M > 0 such that P ( X > M) < ǫ. A set of random vectors X ; α A is called k k { α ∈ } uniformly tight if there exists a constant M > 0 such that sup P ( Xα > M) < ǫ. α A k k ∈ 26 A random vector X taking values in a polish space (D, d) is also tight: For every ǫ> 0 there exists a compact set K D such that P (X / K) < ǫ. This will be proved later. ⊂ ∈ Remark. It is easy to see that a finite number of random variables is uniformly tight if every individual is tight. Also, if X ; α A and X ; α B are uniformly tight, respectively, { α ∈ } { α ∈ } then X , α A B is also tight. Further, if X = X, then X, X ; n 1 is uniformly { α ∈ ∪ } n ⇒ { n ≥ } tight. Actually, for any ǫ > 0, choose M > 0 such that P ( X M) 1 ǫ. Since the k k ≥ ≥ − k set x R , x > M is an open set, lim infn P ( Xn > M) P ( X > M) 1 ǫ. { ∈ k k } →∞ k k ≥ k k ≥ − The tightness of X, X follows. { n} LEMMA 5.5 Let µ be a probability measure on (D, ), where (D, d) is a Polish space and B is the Borel σ-algebra. Then µ is tight. B Proof. Since (D, d) is Polish, it is separable, there exists a countable set x ,x , D { 1 2 ···}⊂ such that it is dense in D. Given ǫ > 0. For any integer k 1, since n 1B(xn, 1/k) = D, there exists nk such ≥ ∪ ≥ nk k D nk D that µ( n=1B(xn, 1/k)) > 1 ǫ/2 . Set k = n=1B(xn, 1/k) and M = k 1 k. Then M ∪ − ∪ ∩ ≥ is totally bounded (M is always covered by a finite number of balls with any prescribed radius). Also ∞ ∞ ǫ µ(M c) µ(Dc ) = ǫ. ≤ k ≤ 2k Xk=1 Xk=1 Let K be the closure of M. Since D is complete, M K D and K is compact. Also by ⊂ ⊂ the above fact, µ(Kc) < ǫ. LEMMA 5.6 Let X ; n 1 be a sequence of random variables. Let F be the cdf of X . { n ≥ } n n Then there exists a subsequence F with the property that F (x) F (x) at each continuity nj nj → point of x of a possibly defective distribution function F. Further, if X is tight, then F (x) { n} is a cumulative distribution function. Proof. Put all rational numbers q ,q , in a sequence. Because the sequence F (q ) 1 2 · · · n 1 is contained in the interval [0, 1], it has a converging subsequence. Call the indexing sub- sequence n1 and the limit G(q ). Next, extract a further subsequence n2 n1 { j }j∞=1 1 { j } ⊂ { j } along which F (q ) converges to a limit G(q ), a further subsequence n3 n2 along n 2 2 { j } ⊂ { j } which F (q ) converges to a limit G(q ), , and so forth. The tail of the diagonal sequence n 3 3 · · · n := nj belongs to every sequence ni . Hence F (q ) G(q ) for every i = 1, 2, . Since j j j nj i → i · · · each F is nondecreasing, G(q) G(q ) if q q . Define n ≤ ′ ≤ ′ F (x)= inf G(q). q>x 27 It is not difficult to verify that F (x) is decreasing and right-continuous at every point x. It is also easy to show that F (x) F (x) at each continuity point of x of F by monotonicity nj → of F (x). Now assume X is tight. For given ǫ (0, 1), there exists a rational number M > 0 { n} ∈ such that P ( M < X M) 1 ǫ. Equivalently, F (M) F ( M) 1 ǫ for all n. − n ≤ ≥ − n − n − ≥ − Passing the subsequence n to . We have that G(M) G( M) 1 ǫ. By definition, j ∞ − − ≥ − F (M) F ( M) 1 ǫ. Therefore F (x) is a cdf. − − ≥ − THEOREM 10 (Almost-sure representations). If X = X then there exist random n ⇒ 0 variables Y on ([0, 1], [0, 1]) such that (Y ) = (X ) for each n 0 and Y Y n B L n L n ≥ n → 0 almost surely. Proof. Let Fn(x) be the cdf of Xn. Let U be a random variable uniformly distributed on ([0, 1], [0, 1]). Define Y = F 1(U) where the inverse here is the generalized inverse of a B n n− non-decreasing function defined by 1 F − (t) = inf x : F (x) t n { n ≥ } 1 1 for 0 t 1. One can verify that F (t) F − (t) for every t such that F (x) is continuous ≤ ≤ n− → 0 0 at t. We know that the set of the discontinuous points is countable. Therefore the Lebesgue 1 1 measure is zero. This shows that Y = F (U) F − (U)= Y almost surely. n n− → 0 0 It is easy to check that (X )= (Y ) for all n = 0, 1, 2, . L n L n · · · LEMMA 5.7 (Slusky). Let a , b and X are sequences of random variables such { n} { n} { n} that a a in probability and b b in probability and X = X for constants a and b n → n → n ⇒ and some random variable X. Then a X + b = aX + b as n . n n n ⇒ →∞ Proof. We will show that (a X +b ) (aX +b) 0 in probability. If this is the case, n n n − n → then the desired conclusion follows from the weak convergence of Xn (why?). Given ǫ> 0. Since X is convergent, it is tight. So there exists M > 0 such that P ( X M) ǫ. { n} | n| ≥ ≤ Moreover, by the given condition, ǫ P a a <ǫ and P ( b b ǫ) <ǫ | n − |≥ 2M | n − |≥ as n is sufficiently large. Note that (a X + b ) (aX + b) a a X + b b . | n n n − n | ≤ | n − || n| | n − | 28 Thus, P ( (anXn + bn) (aXn + b) 2ǫ) | ǫ− |≥ P a a + P ( X M)+ P ( b b ǫ) < 3ǫ ≤ | n − |≥ 2M | n|≥ || n − |≥ as n is sufficiently large. Remark. All the above statements of weak convergence of random variables Xn can be replaced by their distributions µ := (X ). So “X = X” is equivalent to “µ = µ”. n L n n ⇒ n ⇒ THEOREM 11 (L´evy’s continuity theorem). Let µ ; n = 0, 1, 2, be a sequence of prob- n · · · k ability measures on R with characteristic function φn. We have that (i) If µ = µ then φ (t) φ (t) for every t Rk; n ⇒ 0 n → 0 ∈ (ii) If φn(t) converges pointwise to a limit φ0(t) that is continuous at 0, then the as- sociated sequence of distributions µn is tight and converges weakly to the measure µ0 with characteristic function φ0. Remark. The condition that “φ (t) is continuous at 0” is essential. Let X 0 n ∼ N(0,n), n 1. So P (Xn x) 1/2 for any x, which means that Xn doesn’t converges ≥ ≤ 2 → weakly. Note that φ (t)= e nt /2 0 as n and φ (0) = 1. So φ (t) is not continuous. n − → →∞ n 0 If one ignores the continuity condition at zero, the conclusion that Xn converges weakly may wrongly follow. Proof of Theorem 11. (i) is trivial. (ii) Because marginal tightness implies joint tightness. W.L.O.G., assume that µn is a probability measure on R and generated by Xn. Note that for every x and δ > 0, sin δx 1 δ 1 δx > 2 2 1 = (1 cos tx) dt. {| | }≤ − δx δ δ − Z− Replace x by Xn, take expectations, and use Fubini’s theorem to obtain that δ δ 2 1 itXn 1 P Xn > (1 Ee ) dt = (1 φn(t)) dt. | | δ ≤ δ δ − δ δ − Z− Z− By the given condition, 2 1 δ lim sup P Xn > (1 φ0(t)) dt. n | | δ ≤ δ δ − Z− The right hand side above goes to zero as δ 0. So tightness follows. ↓ 29 To prove the weak convergence, we have to show that f(x) du f(x) dµ for every n → bounded continuous function f(x). It is equivalent to thatR for any subsequence,R there is a further subsequence, say n , such that f(x) du f(x) dµ . k nk → 0 For any subsequence, by Lemma 5.6,R the Helly selectionR principle, there is a further subsequence, say µnk , such that it converges to µ0 weakly, by the earlier proof, the c.f. of µ0 is φ0. We know that a c.f. uniquely determines a distribution. This says that every limit is identical. So f(x) du f(x) dµ . nk → 0 THEOREM 12R (Central LimitR Theorem). Let X , X , , X be a sequence of i.i.d. random 1 2 · n variables with mean zero and variance one. Let X¯ = (X + + X )/n. Then √nX¯ = n 1 · · · n n ⇒ N(0, 1). 2 Proof. Let φ(t) be the c.f. of X1. Since EX1 = 0 and EX1 = 1 we have that φ′(0) = iEX = 0 and φ (0) = i2EX2 = 1. By Taylor’s expansion 1 ′′ 1 − t t2 1 φ = 1 + o √n − 2n n as n is sufficiently large. Hence n 2 n 2 it√nX¯n t t 1 t /2 Ee = φ = 1 + o e− √n − 2n n → as n . By Levy’s continuity theorem, the desired conclusion follows. →∞ To deal with multivariate random variables, we have the next tool. THEOREM 13 (Cram´er-Wold device). Let Xn and X be random variables taking values in Rk. Then X = X if and only if tT X = tT X for all t Rk. n ⇒ n ⇒ ∈ Proof. By Levy’s continuity Theorem 11, X = X if and only if E exp(itT X ) n ⇒ n → E exp(itT X) for each t Rk, which is equivalent to E exp(iu(tT X )) E exp(iu(tT X)) for ∈ n → T T any real number u. This is same as saying that the c.f. of t Xn converges to that of t X for each t Rk. It is equivalent to that tT X = tT X for all t Rk. ∈ n ⇒ ∈ As an application, we have the multivariate analogue of the one-dimensional CLT. THEOREM 14 Let X , X , be a sequence of i.i.d. random vectors in Rk with mean 1 2 · · · vector µ = EX and covariance matrix Σ= E(X µ)(X µ)T . Then 1 1 − 1 − n 1 (X µ)= √n(Y¯ µ)= N (0, Σ). √n i − n − ⇒ k Xi=1 30 Proof. By the Cram´er-Wold device, we need to show that n 1 (tT X tT µ)= N(0,tT Σt) √n i − ⇒ Xi=1 for ant t Rk. Note that tT X tT µ, i = 1, 2, are i.i.d. real random variables. By ∈ { i − ···} the one-dimensional CLT, n 1 (tT X tT µ)= N(0, Var(tT X )). √n i − ⇒ 1 Xi=1 T T This leads to the desired conclusion since Var(t X1)= t Σt. We state the following theorem without giving the proof. The spirit of the proof is similar to that of Theorem 12. It is called Lindeberg-Feller Theorem. THEOREM 15 Let k ; n 1 be a sequence of positive integers and k + as n . n ≥ } n → ∞ →∞ For each n let Y , ,Y be independent random vectors with finite variances such that n,1 · · · n,kn kn kn Cov (Y ) Σ and E Y 2I Y >ǫ 0 n,i → k n,ik {k n,ik }→ Xi=1 Xi=1 for every ǫ> 0. Then the sequence Σkn (Y EY )= N(0, Σ). i=1 n,i − n,i ⇒ The Central Limit Theorems hold not only for independent random variables, it is also true for some other dependent random variables of special structures. Let Y ,Y , be a sequence of random variables. A sequence of σ-algebra satisfies 1 2 · · · Fn that σ(Y , ,Y ) and is non-decreasing for all n 0, where = , Ω . Fn ⊃ 1 · · · n Fn ⊂ Fn+1 ≥ F0 {∅ } When E(Y ) = 0 for any n 0, we say Y ; n 1 are martingale differences. n+1|Fn ≥ { n ≥ } THEOREM 16 For each n 1, let X , 1 i k be a sequence of martingale differ- ≥ { ni ≤ ≤ n} ences relative to nested σ-algebras , 1 i k , that is, for i < j, such {Fni ≤ ≤ n} Fni ⊂ Fnj that 1. max X ; n 1 is uniformly integrable; { i | ni| ≥ } 2. E (max X ) 0; i | ni| → 3. X2 1 in probability. i ni → Then S = X N(0, 1) in distribution. Pn i ni → P Example. Let X ; i 1 be i.i.d. with mean zero and variance one. Then one can i ≥ verify that the required conditions in Theorem 16 hold for X = X /√n for i = 1, 2, ,n. ni i · · · 31 Therefore we recollect the classical CLT that n X /√n = N(0, 1). Actually, i=1 i ⇒ P n 2 P (max Xni ǫ) nP ( X1 √nǫ) E X1 I X1 √nǫ 0 and i | |≥ ≤ | |≥ ≤ nǫ2 | | {| |≥ }→ n 2 2 i=1 Xi E(max Xni ) E = 1 i | | ≤ n P This shows that max X 0 in probability and max X , n 1 is uniformly inte- i | ni| → { i | ni| ≥ } grable. So condition 2 holds. LEMMA 5.8 Let U ; n 1 and T ; n 1 be two sequences of random variables such { n ≥ } n ≥ } that 1. U a in probability, where a is a constant; n → 2. T is uniformly integrable; { n} 3. U T is uniformly integrable; { n n} 4. ET 1. n → Then E(T U ) a. n n → Proof. Write T U = T (U a)+ aT . Then E(T U ) = E T (U a) + aET . Evi- n n n n − n n n { n n − } n dently, T (U a) is uniformly integrable by the given condition. Since T is uniformly { n n − } { n} integrable, it is not difficult to verify that T (U a) 0 in probability. Then the result n n − → follows. By the Taylor expansion, we have the following equality ix x2/2+r(x) e = (1+ ix)e− (5.6) for some function r(x)= O(x3) as x . Look at the set-up in Theorem 16, we have that →∞ itSn e = TnUn, where T = (1 + itX ) and U = exp( (t2/2) X2 + r(tX )). (5.7) n nj n − nj nj Yj Xj Xj We next obtain a general result on CLT by using Lemma 5.8. THEOREM 17 Let X , 1 i k , n 1 be a triangular array of random variables. { nj ≤ ≤ n ≥ } Suppose 1. ET 1; n → 2. T is uniformly integrable; { n} 32 3. X2 1 in probability; j nj → 4. E(max X ) 0. P j | nj| → Then Sn = j Xnj converges to N(0, 1) in distribution. P itSn Proof. Since TnUn = e = 1, we know that TnUn is uniformly integrable. By Lemma | | | | 2 { } 5.8, it is enough to show that U e t /2 in probability. Review (5.7) and condition 3, it n → − suffices to show that r(tX ) 0 in probability. The assertion r(x) = O(x3) say that j nj → there exists A> 0 and δ > 0 such that r(x) A x3 as x < δ. Then P | |≤ | | | | P ( r(tXnj) >ǫ) P (max Xnj δ)+ P ( r(tXnj) > ǫ, max Xnj < δ) | | ≤ j | |≥ | | j | | Xj Xj 1 2 δ− E(max Xnj )+ P (max Xnj Xnj >ǫ) 0 ≤ j | | j | | · | | → Xj as n by the given conditions. →∞ Proof of Theorem 16. Define Zn1 = Xn1 and j 1 − Z = X I( X2 2), 2 j k . nj nj nk ≤ ≤ ≤ n Xk=1 It is easy to check that Z is still a martingale relative to the original σ-algebra. Now { nj} 2 define J = inf j 1; 1 k j Xnk > 2 kn. Then { ≥ ≤ ≤ }∧ P P (X = Z for some k k ) = P (J k 1) nk 6 nk ≤ n ≤ n − P ( X2 > 2) 0 ≤ nk → 1 k kn ≤X≤ as n . Therefore, P (S = kn Z ) 0 as n . To prove the result, we only need →∞ n 6 j=1 nj → →∞ to show P kn Z = N(0, 1). (5.8) nj ⇒ Xj=1 We now apply Theorem 17 to prove this. Replacing Xnj by Znj in (5.7), we have new Tn and Un. Let’s literally verify the four conditions in Theorem 17. By a martingale property and iteration, ET = 1. So condition 1 holds. Now max Z max X 0 in probability n j | nj|≤ j | nj|→ as n . So condition 4 holds. It is also easy to check that Z2 1 in probability. → ∞ j nj → Then it remains to show condition 2. P 33 kn By definition, Tn = j=1(1 + itZnj)= 1 j J (1 + itZnj). Thus, ≤ ≤ Q J 1 Q − T = (1 + t2X2 )1/2 1+ tX | n| nk | nJ | kY=1 2 J 1 t − 2 exp Xnk (1 + t XnJ ) ≤ 2 ! | | · | | Xk=1 t2 e (1 + t max Xnj ) ≤ | | · j | | which is uniform integrable by condition 1 in Theorem 16. The next is an extreme limit theorem, which is different than the CLT. The limiting distribution is called an extreme distribution. THEOREM 18 Let Xi; 1 i n be i.i.d. N(0, 1) random variables. Let Wn = max1 i n Xi. ≤ ≤ ≤ ≤ Then 1 (log2 n)+ x ex/2 P Wn 2 log n e− 2√π ≤ − 2√2 log n → p for any x R, where log n = log(log n). ∈ 2 Proof. Let t be the right hand side of “ ” in the probability above. Then n ≤ P (W t )= P (X t )n = (1 P (X >t ))n. n ≤ n 1 ≤ n − 1 n Since (1 x )n e a as n if x a/n. To prove the theorem, it suffices to show that − n ∼ − →∞ n ∼ 1 P (X >t ) ex/2. (5.9) 1 n ∼ 2√πn Actually, we know that 1 x2/2 P (X1 >x) e− ∼ √2πx as x + . It is easy to calculate that → ∞ 1 1 t2 x and n log n log( log n) √2πtn ∼ 2√π log n 2 ∼ − − 2 p as n . This leads to (5.9). →∞ 34 As an application of the CLT of i.i.d. random variables, we will be able to derive the classical χ2-test as follows. Let’s consider a multinomial distribution with n trails, k classes with parameter p = (p , ,p ). Roll a die twenty times. Let X be the number of the occurrences of “i dots”, 1 · · · k i 1 i 6. Then (X , X , , X ) follows a multinomial distribution with “success rate” ≤ ≤ 1 2 · · · 6 p = (p , ,p ). How to test the die is fair or not? From an introductory course, we know 1 · · · 6 that, the null hypothesis is H : p = = p = 1/6. The test statistic is 0 1 · · · 6 6 (X np )2 χ2 = i − i . npi Xi=1 We use the fact that χ2 is roughly χ2(5) as n is large. We will prove this next. In general, X = (X , , X ) follows a multinomial distribution with n trails and n n,1 · · · n,k “success rate” p = (p , ,p ). Of course, k X = n. To be more precise, 1 · · · k i=1 n,i P n! x P (X = x , , X = x )= pxn,1 p n,k , n,1 n,1 · · · n,k n,k x ! x ! 1 · · · k n,1 · · · n,k k where i=1 xn,i = n. Now we prove the following theorem, P THEOREM 19 As n , →∞ k 2 2 (Xi npi) 2 χn = − = χ (k 1). npi ⇒ − Xi=1 We need a lemma. LEMMA 5.9 Let Y N (0, Σ). Then ∼ k k d Y 2 = λ Z2, k k i i Xi=1 where λ , , λ are eigenvalues of Σ and Z , , Z are i.i.d. N(0, 1). 1 · · · k 1 · · · k Proof. Decompose Σ = Odiag(λ , , λ )OT for some orthogonal matrix O. Then the k 1 · · · k coordinates of OY are independent with mean zero and variances λ ’s. So Y 2 = OY 2 i || k k k k 2 is equal to i=1 λiZi , where Zi’s are i.i.d. N(0, 1). P Another fact we will use is that AX = AX for any k k matrix A if X = X, where n ⇒ × n ⇒ k Xn and X are R -valued random vectors. This can be shown easily by the Cramer-Wold device. 35 Proof of the Theorem. Let Y , ,Y be i.i.d. with a multinomial distribution with 1 1 · · · n trails and “success rate” p = (p , ,p ). Then X = Y + Y . Each Y is a k-dimensional 1 · · · k n 1 · · · n i vector. By CLT, X EX n − n = N (0, Cov(Y )), √n ⇒ k 1 where it is easy to see that EXn = np and p (1 p ) p p p p 1 − 1 − 1 2 ··· − 1 k p1p2 p2(1 p2) p2pk Σ1 := Cov(Y1)= − − ··· − . ··· ··· ··· ··· p1pk p2pk pk(1 pk) − − · · · − Set D = diag(1/√p1, , 1/√p ). Then · · · k T √p1 √p1 p p T √ 2 √ 2 DΣ1D == I . . . − . . √pk √pk k Since i=1 pi = 1, the above matrix is symmetric and idempotent or a projection. Also, trace of Σ is k 1. So Σ has eigenvalue 0 with one multiplicity and 1 with k 1 multiplicity. P 1 − 1 − Therefore, by the lemma, we have that D(X np) n − = N(0, DΣ DT ). √n ⇒ 1 By the lemma, since g(x)= x 2 is a continuous function on Rk, k k k 2 2 k 1 (X np ) D(X np) − d i − i = n − = Z2 = χ2(k 1). np √n ⇒ i − i=1 i i=1 X X 6 Central Limit Theorems for Stationary Random Variables and Markov Chains Central limit theorems hold not only for independent random variables but also for depen- dent random variables. In the next we will study the Central Limit Theorems for α-mixing random variables and and martingale (a special class of dependent random variables). Let X , X , be a sequence of random variables. For each n 1, define α by 1 2 · · · ≥ n α(n) := sup P (A B) P (A)P (B) (6.10) | ∩ − | 36 where the supremum is taken over all A σ(X , , X ), B σ(X , X , ), and ∈ 1 · · · k ∈ k+n k+n+1 · · · k 1. Obviously, α(n) is decreasing. When α(n) 0, then X and X are approximately ≥ → k k+n independent. In this case the sequence X is said to be α-mixing. If the distribution of { n} the random vector (X , X , , X ) does not depend on n, the sequence is said to be n n+1 · · · n+j stationary. The sequence is called m-dependent if (X , , X ) and (X , , X ) are inde- 1 · · · k k+n · · · k+n+l pendent for any n>m, k 1 and l 1. In this case the sequence is α-mixing with α = 0 ≥ ≥ n for n>m. An independent sequence is 0-dependent. For stationary process, we have the following the Central Limit Theorems. THEOREM 20 Suppose that X , X , is stationary with EX = 0 and X A for 1 2 · · · 1 | 1| ≤ some constant A> 0. Let S = X + + X . If ∞ α(n) < , then n 1 · · · n i=1 ∞ V ar(S ) P∞ n σ2 = E(X2) + 2 E(X X ) (6.11) n → 1 1 k+1 Xk=1 where the series converges absolutely. If σ > 0, then S /√n = N(0,σ2). n ⇒ THEOREM 21 Suppose that X , X , is stationary with EX = 0 and E X 2+δ < 1 2 · · · 1 | 1| ∞ δ/(2+δ) for some constant δ > 0. Let S = X + + X . If ∞ α(n) < , then n 1 · · · n i=1 ∞ V ar(S ) ∞ P n σ2 = E(X2) + 2 E(X X ) n → 1 1 k+1 Xk=1 where the series converges absolutely. If σ > 0, then S /√n = N(0,σ2). n ⇒ Easily we have the following corollary. COROLLARY 6.1 Let X , X , be m-dependent random variables with the same distri- 1 2 · · · bution. If EX = 0 and EX2+δ < for some δ > 0. Then S /√n N(0,σ2), where 1 1 ∞ n → 2 2 m+1 σ = EX1 + 2 i=1 E(X1Xi). Remark. If weP argue by using the truncation method as in the proof of Theorem 21, we can show that the above corollary still holds only under EX2 < . 1 ∞ LEMMA 6.1 If Y σ(X , , X ) and Z σ(X , X , , X ) for n m ∈ 1 · · · k ∈ k+n k+n+1 · · · k+m ≤ ≤ + . Assume Y C and Z D. Then ∞ | |≤ | |≤ E(Y Z) (EY )EZ 4CDα(n). | − |≤ 37 Proof. W.l.o.g, assume that C = D = 1. Now since expectations can be approximated by finite sums, so w.l.o.g, we further assume that m n Y = yiAi and Z = zjBj Xi=1 Xj=1 where Ai’s and Bj’s are respectively partitions of the sample space. Then E(Y Z) (EY )EZ = y z (P (A B ) P (A )P (B )) | − | | i j i ∩ j − i j | Xi Xj z (P (A B ) P (A )P (B )) . ≤ | j i ∩ j − i j | Xi Xj Let ξ =1 or 1 depends on whether or not z (P (A B ) P (A )P (B ) 0 or not. Then i − j i ∩ j − i j ≥ the last sum is equal to ξ z (P (A B ) P (A )P (B )) which is also identical to i j i j i ∩ j − i j z ξ (P (A B ) P (A )P (B )). By the same argument, there are η equal to 1 or j j j i i ∩ j −P Pi j i 1 such that the last quantity is bounded by ξ η (P (A B ) P (A )P (B )). Let P− P j j i j i ∩ j − i j c E1 be the union of Ai such that ξi = 1 and E2 P= EP1; Let F1 and F2 be defined similarly for Bj according to the sign of ηj’s. Then E(Y Z) (EY )EZ P (E F ) P (E )P (F ) 4α(n) | − |≤ | i ∩ j − i j |≤ 1 i,j 2 ≤X≤ LEMMA 6.2 If Y σ(X , , X ) and Z σ(X , X , ). Assume for some ∈ 1 · · · k ∈ k+n k+n+1 · · · δ > 0, E(Y 2+δ) C and E(Z2+δ) D. Then ≤ ≤ E(Y Z) (EY )EZ 5(1 + C + D)α(n)δ/(2+δ). | − |≤ Proof. Let Y = YI( Y a) and Y = YI( Y > a); and Z = ZI( Z a) and 0 | | ≤ 1 | | 0 | | ≤ Z = ZI( Z > a). Then 1 | | E(Y Z) (EY )EZ E(Y Z ) (EY )EZ . | − |≤ | i j − i j | 1 i,j 2 ≤X≤ By the previous lemma, the term corresponding to i = j = 0 is bounded by 4a2α(n). For i = j = 1, E(Y Z ) (EY )EZ = Cov(Y , Z ) which is bounded by E(Y 2)E(Z2) | 1 1 − 1 1| | 1 1 | 1 1 ≤ √CD/aδ since E(Y 2) C/aδ. Now 1 ≤ p E(Y Z ) (EY )EZ E( Y EY Z EZ ) 2a 2E Z | 0 1 − 0 1|≤ | 0 − 0| · | 1 − 1| ≤ · | 1| which is less than 4D/aδ. The term for i = 1, j = 0 is bounded by 4C/aδ. In summary √CD + 4C + 4D 4.5(C + D) E(Y Z) (EY )EZ 4a2α(n)+ 4a2α(n)+ . | − |≤ a2 ≤ aδ 38 1/(2+δ) Now take a = α(n)− . The conclusion follows. Proof of Theorem 20. By Lemma 6.2, EX X 4A2α(k), so the series (6.11) con- | 1 1+k|≤ 2 2 verges absolutely. Moreover, let ρk = E(X1X1+k). Then ESn = nEX1 +2 1 i 4 3 E(Sn)= n δn (6.13) 4 for a sequence of constants δn 0. Actually, E(Sn) 1 i,j,k,l n E(XiXjXkXl) . Among → ≤ ≤ ≤ | | 4 2 2 2 3 the terms in the sum there are n terms of Xi ; at mostPn of Xi Xj and Xi Xj respectively for i = j. So the contribution of these terms is bounded by Cn2 for an universal constant 6 2 C. There are only two type of terms left: Xi XjXk and XiXjXkXl where these indices are all different. We will deal with the second type of terms only. The first one can be done similarly and more easily. Note that E(X X X X ) 4!n E(X X X X ) (6.14) | i j k l |≤ | 1 1+i 1+i+j 1+i+j+k | 1 i,j,k,l n 1 i+j+k n ≤ X ≤ ≤ X ≤ where the three indices in the last sum are not zero. By Lemma 6.2, the term inside is 4 4 bounded by 4A αi. By the same argument, it is also bounded by 4A αk. Therefore, the sum in (6.14) is bounded by 4 4!A4n2 min α , α 2Kn2 min α , α · { i k} ≤ { i k} 2 i+k n 2 i+k n, i k ≤X≤ ≤ X≤ ≤ n 2Kn2 kα = nδ (6.15) ≤ k n Xk=1 for a sequence of constants δ such that 0 < δ 0. This is because n n → n kαk = kαk + kαk k=1 k √n √n 39 Therefore (6.13) follows. From this, we know that ES4 nδ n3(δ + (1/n)) n3δ n ≤ n ≤ n ≤ n′ where δn′ := supk n δk + (1/k); k n is decreasing to 0 and is bounded below by 1/n. ≥ { ≥ } Thus, w.l.o.g., assume δn has this property. Define pn = [√n log(1/δ[√n])] and qn = [n/pn] and rn = [n/(pn + qn)]. Then p2 δ 1 1 √n p √n log n and n pn δ log2 δ log2 0 (6.16) ≪ n ≤ n ≤ pn δ ≤ pn δ → [√n] ! pn as n . Now for finite n, break X , X , , X into some blocks as follows: →∞ 1 2 · · · n X , , X X , , X X , , X , X 1 · · · pn | pn+1 · · · pn+qn | pn+qn+1 · · · 2pn+qn | ··· n For i = 1, 2, , r , set · · · n Un,i = X(i 1)(pn+qn)+1 + X(i 1)(pn+qn)+2 + + Xi(pn+qn); − − · · · Vn,i = Xipn+(i 1)qn+1 + Xipn+(i 1)qn+2 + + Xi(pn+qn); − − · · · W = X + X + + X , n rn(pn+qn)+1 rn(pn+qn)+1 · · · n rn rn and Sn′ = i=1 Un,i and Sn′′ = i=1 Vn,i. In what follows, we will show that Sn′′ and Wn are negligible.P Now, by (6.12) andP its reasoning rn V ar(S′′) = r V ar(V ) + 2 (r k)E(V V ) n n n,1 n − n,1 n,k Xk=2 rn Cr q + 2r E(V V ) . ≤ n n n | n,1 n,k | Xk=2 2 2 By assumption, Vn,i Aqn for any i 1. By Lemma 6.1, E(Vn,1Vn,k) 4A qnα(k 1)pn . | |≤ ≥ | | ≤ − Thus, by monotoncity, r r r (k 1)pn n n n − 2 2 2 2 1 E(Vn,1Vn,k) 4A qn α(k 1)pn 4A qn αj | |≤ − ≤ pn k=2 k=2 k=2 j=(k 2)pn+1 X X X −X 2 2 4A qn ∞ αj. ≤ pn Xj=1 Since q /p 0, the above two inequalities imply that n n → V ar(S ) n′′ 0 (6.17) n → as n . This says that S /√n 0 in probability. By Chebyshev’s inequality and (6.12) →∞ n′′ → we see that W /√n 0 in probability. Thus, to prove the theorem, it suffices to show that n → S /√n = N(0,σ2). This will be done through the following two steps n′ ⇒ S˜n ˜ = N(0,σ2) and EeitSn′ /√n EeitSn/√n 0 (6.18) √n ⇒ − → 40 for any t R, where S˜ = rn U and U ;ı=1, 2, , r are i.i.d. with the same ∈ n i=1 n,i′ { n,i′ · · · n} distribution as that of U . First, by (6.12), V ar(S˜ ) = r V ar(U ) r p σ2. It follows n,1 P n n n,1 ∼ n n that V ar(S˜ )/n σ2. Also, EU = 0. By (6.13) and (6.16), n → n,i′ rn 3 2 1 4 rn 4 rnpnδpn 4 pnδpn E(U ′ ) E(U )= σ− 0 2 n,i 2 4 n,1 2 4 V ar(S˜n) ∼ n σ n σ ∼ n → Xi=1 as n . By the Lyapounov central limit theorem, we know the first assertion in (6.18) → ∞ holds. Last, by Lemma 6.12, rn EeitSn′ /√n (EeitUn,1/√n)E exp(it U /√n) 16α(q ). | − n,j |≤ n Xj=2 rn Then repeat this inequality for E exp(it j=2 Un,j/√n) and so on, we obtain that P rn ˜ itU /√n EeitSn′ /√n EeitSn/√n = EeitSn′ /√n Ee n,j′ 16α(q )r . | − | | − |≤ n n jY=1 Since α(n) is decreasing and n 1 α(n) < , we have nα(n)/2 [n/2] i n α(i) 0. ≥ ∞ ≤ ≤ ≤ → Thus the above bound is o(r /q )= o(n/(p q )) 0 as n . The second statement in n Pn n n → →∞ P (6.18) follows. The proof is complete. Proof of Theorem 21. By Lemma 6.2, EX X 5(1 + 2E X 2+δ)α(n)2/(2+δ), | 1 k| ≤ | 1| 2 which is summable by the assumption. Thus the series EX1 +2 i∞=2 E(X1Xk) is absolutely convergent. By the exact same argument as in proving (6.11), we obtain that V ar(S )/n P n → σ2. Now for a given constant A> 0 and all i = 1, 2, , define · · · Y = X I X A E(X I X A ) and Z = X I X > A E(X I X > A ), i i {| i|≤ }− i {| i|≤ } i i {| i| }− i {| i| } n n and Sn′ = i=1 Yi and Sn′′ = i=1 Zi. Obviously, Sn = Sn′ + Sn′′. By Theorem 20, P P S n = N(0,σ(A)2) (6.19) √n ⇒ 2 2 as n for each fixed A > 0, where σ(A) := EY + 2 ∞ EY Y . By the summable → ∞ 1 i=2 1 k 2 assumption on α(n), it is easy to check that both the sumP in the definition of σ(A) and 2 2 σ˜(A) := EZ1 + 2 i∞=2 E(Z1Zk) are absolutely convergent, and P σ(A)2 σ2 andσ ˜(A)2 0 (6.20) → → 41 as A + . Now → ∞ Sn′′ V ar(Sn′′) σ˜(A) lim sup P (| | ǫ) lim sup 2 2 . (6.21) n √n ≥ ≤ n nǫ → ǫ →∞ →∞ by using the same argument as in (6.2). Since P (S /√n F ) P (S /√n F¯ǫ) + n ∈ ≤ n′ ∈ P ( S /√n ǫ) for any closed set F R and any ǫ > 0, where F¯ǫ is the closed ǫ- | n′′| ≥ ⊂ neighborhood of F. Thus, from (6.19) and (6.21) S S lim sup P ( n F ) P (N(0,σ(A)2) F¯ǫ)+ limsup P (| n′′| ǫ) n √n ∈ ≤ ∈ n √n ≥ →∞ →∞ σ σ˜(A) P N(0,σ2) F¯ǫ + ≤ ∈ σ(A) ǫ2 for any A> 0 and ǫ> 0. First let A + , and then let ǫ 0+ to obtain from (6.20) that → ∞ → S lim sup P ( n F ) P (N(0,σ2) F ). n √n ∈ ≤ ∈ →∞ The proof is complete. Define byα ˜(n) the supremum of E (f(X , , X )g(X , , X )) Ef(X , , X ) Eg(X , , X ) (6.22) | 1 · · · k k+n · · · k+m − 1 · · · k · k+n · · · k+m | over all k 1,m 0 and measurable functions f(x) defined on Rm and g(x) defined on ≥ ≥ Rm n+1 with f(x) 1 and g(x) 1 for all x R. − | |≤ | |≤ ∈ LEMMA 6.3 The following are true: (i) α(n) = sup sup sup P (A B) P (A)P (B) . k 1 m 0 A σ(X1, ,X ), B σ(X , ,X ) | ∩ − | ≥ ≥ ∈ ··· k ∈ k+n ··· k+n+m (ii) For all n 1, we have that α(n) α˜(n) 4α(n). ≥ ≤ ≤ Proof. (i) For two sets A and B, let A∆B = (A B) (B A), which is their symmetric \ ∪ \ difference. Then I = I I . Let Z ; i 1 be a sequence of random variables. We A∆B | A − B| i ≥ claim that σ(Z , Z , )= B σ(Z , Z , ); for any ǫ> 0, there exists l 1 and (6.23) 1 2 · · · { ∈ 1 2 · · · ≥ C σ(Z , , Z ) such that P (B∆C) <ǫ . (6.24) ∈ 1 · · · l } If this is true, then for any ǫ > 0, there exists l < and C σ(Z , , Z ) such that ∞ ∈ 1 · · · l P (B) P (C) < ǫ. Thus (ii) follows. | − | 42 Now we prove this claim. First, note that σ(Z , Z , )= σ Z E ; E (R); i 1 2 · · · {{ i ∈ i} i ∈B ≥ 1 . Denote by the right hand side of (6.23). Then contains all Z E ; E } F F { i ∈ i} i ∈ (R); i 1. We only need to show that the right hand side is a σ-algebra. B ≥ It is easy to see that (i) Ω ; (ii) Bc if B since A∆B = Ac∆Bc. (iii) If B ∈ F ∈ F ∈ F i ∈ F for i 1, there exist m < and C σ(Z , , Z ) such that P (B ∆C ) < ǫ/2i for all ≥ i ∞ i ∈ 1 · · · mi i i i 1. Evidently, n B B as n . Therefore, there exists n < such that ≥ ∪i=1 i ↑ ∪i∞=1 i → ∞ 0 ∞ P ( B ) P ( n0 B ) ǫ/2. It is easy to check that ( n0 B )∆( n0 C ) n0 (B ∆C ). | ∪i∞=1 i − ∪i=1 i |≤ ∪i=1 i ∪i=1 i ⊂∪i=1 i i Write B = B and B˜ = n0 B and C = n0 C . Note B∆C (B B˜) (B˜∆C). These ∪i∞=1 i ∪i=1 i ∪i=1 i ⊂ \ ∪ facts show that P (B∆C) < ǫ and C σ(Z , Z , , Z ) for some l < . Thus is a ∈ 1 2 · · · l ∞ F σ-algebra. (ii) Take f and g to be indicator functions, from (i) we get α(n) α˜(n). By Lemma ≤ 6.1, we have thatα ˜(n) 4α(n). ≤ Let X ; i 1 be a sequence of random variables. We say it is a Markov chain if { i ≥ } P (X A X , , X )= P (X A X ) for any Borel set A and any k 1. A Markov k+1 ∈ | 1 · · · k k+1 ∈ | k ≥ chain has the property that given the present observations, the past and future observations are (conditionally) independent. This is literally stated in Proposition 10.3. Let’s look at the strong mixing coefficientα ˜(n). Recall the definition ofα ˜(n). If X , X , 1 2 · · · is a Markov chain, for simplicity, we write f and g, respectively, for f(X , , X ) and 1 · · · k g(Xk+n, , Xk+m). Let f˜ = E(f Xk+1)) andg ˜ = E(g Xk+n 1). Then by Proposition 10.3, · · · | | − E(fg)= E(f˜g˜), and Ef = Ef˜ and Eg = Eg.˜ Thus, E(fg) (Ef)Eg = E(f˜g˜) (Ef˜)Eg.˜ − − The point is that f˜ is Xk+1 measurable, andg ˜ is Xk+n 1 measurable. Therfore − α(n) sup E(f(Xk+1)g(Xk+n 1) (Ef(Xk+1))Eg(Xk+n 1) (6.25) ≤ f 1 g 1 | − − − | | |≤ | |≤ 4sup P (Xk+1 A, Xk+n 1 B) P (Xk+1 A)P (Xk+n 1 B) (6.26) ≤ A, B | ∈ − ∈ − ∈ − ∈ | by Lemma 6.1 by taking k there to be k +1 and k + n to be k + n +1 and Xi = 0 for all positive integer i excluding k +1 and k + n + 1. If the Markov process is stationary, then α(n + 1) 4 sup P (X1 A, Xn B) P (X1 A)P (Xn B) . (6.27) ≤ A, B | ∈ ∈ − ∈ ∈ | 43 Let P n(x,B) represent the probability of X B given X = x, and π be the marginal n+1 ∈ 1 probability distribution of X1. Then P (X B, X A)= P n(x,B)π( dx) n+1 ∈ 1 ∈ ZA From this, α(n + 2) 4 sup (P n(x,B) P (X B))π( dx) . ≤ · | − n ∈ | A,B ZA For two probability measures µ and ν, define its total variation distance µ ν = sup µ(A) k − k A | − ν(A) . Then | α(n + 2) 4 P n(x, ) π π( dx), n 2. ≤ k · − k ≥ Z In practice, for example, in the theory of Monto Carlo Markov Chains, we have some conditions to guarantee the convergence rate of P n(x, ) π . In other words, we have a k · − k function M(x) 0 and γ(n) such that ≥ P n(x, ) π M(x)γ(n) k · − k≤ for all x and n. Here are some examples: m If a chain is so-called polynomially ergodic of order m, then γ(n)= n− ; If a chain is geometrically ergodic, then γ(n)= tn for some t (0, 1); ∈ If a chain is uniformly geometrically ergodic, then γ(n) = tn for some t (0, 1) and ∈ sup M(x) < . x ∞ These three conditions, combining with the Central Limit Theorems of stationary ran- dom variables stated in Theorems 20 and 21, imply the following CLT for Markov chains directly. Let EπM = M(x)π( dx). R THEOREM 22 Suppose X , X , is a stationary Markov chain with (X ) = π and { 1 2 ···} L 1 E M < . If the chain is Geometrically ergodic, in particular, if it is uniformly geometri- π ∞ cally ergodic, then S nEX n − 1 = N(0,σ2) (6.28) √n ⇒ where σ2 = E(X2) + 2 E(X X ). 1 i=2 ∞ 1 k P 44 THEOREM 23 Suppose X , X , is a stationary Markov chain with (X ) = π and { 1 2 ···} L 1 E M < . π ∞ (i) If X C for some constant C > 0, and the chain is polynomially ergodic of order | 1|≤ m> 1, then (6.28) holds. (ii) If the chain is polynomially ergodic of order m, and E X 2+δ < for some δ > 0 | 1| ∞ satisfying mδ > 2+ δ, then (6.28) holds. Remark. In practice, given an initial distribution λ, and a transitional kernel P (x, A), here is the way to generate a Markov chain: randomly draw X0 = x under distribution λ, randomly draw X under distribution P (X , ) and X under P (X , ) for any n 1. 1 0 · n+1 n · ≥ Through this a Markov chain X , X , with initial distribution λ and transitional kernel 1 2 · · · P (x, ) has been generated. Let π be the stationary distribution of this chain, that is, · π( )= P (x, )π( dx). · · Z Theorem 17.1.6 (on p. 420 from “Markov Chains and Stochastic Stability” by S.P. Meyn and R.L. Tweedie) says that a Markov CLT holds for a certain initial distribution λ0, it then holds for any initial distribution λ. In particular, if Theorem 22 or Theorem 23 (λ = π in this case) holds, then the CLT holds for the Markov chain of the same transitional prob- ability kernel P (x, ) and stationary distribution π for any initial distribution λ. · 7 U-Statistics Let X , , X be a sequence of i.i.d. random variables. Sometimes we are interested in 1 · · · n estimating θ = Eh(X , , X ) for 1 m n. To make discussion simple, we assume 1 · · · m ≤ ≤ h(x , ,x ) is a symmetric function. If X X X , the order statistic, is 1 · · · m (1) ≤ (2) ≤···≤ (n) sufficient and complete, then 1 U = h(X , , X ) (7.1) n n i1 · · · im m 1 i1<