<<

Probability Theory Oral Exam study notes

Notes transcribed by Mihai Nica Abstract. These are some study notes that I made while studying for my oral exams on the topic of Theory. I took these notes from a few dierent sources, and they are not in any particular order. They tend to move around a lot. They also skip some of the basics of measure theory which are covered in the real analysis notes. Please be extremely caution with these notes: they are rough notes and were originally only for me to help me study. They are not complete and likely have errors. I have made them available to help other students on their oral exams (Note: I specialized in , so these go a bit further into a few topics that most people would do for their oral exam). See also the sections on Conditional Expectation and the Law of the Iterated Logarithm from my Limit Theorem II notes. Contents

Independence and Weak Law of Large Numbers 5 1.1. Independence 5 1.2. Weak Law of Large Numbers 9

Borel Cantelli Lemmas 12 2.3. Borel Cantelli Lemmas 12 2.4. Bounded Convergence Theorem 15

Central Limit Theorems 17 3.5. The De Moivre-Laplace Theorem 17 3.6. Weak Convergence 17 3.7. Characteristic Functions 24 3.8. The moment problem 31 3.9. The Central Limit Theorem 33 3.10. Other Facts about CLT results 35 3.11. Law of the Iterated Log 36

Moment Methods 39 4.12. Basics of the Moment Method 39 4.13. Poisson RVs 42 4.14. Central Limit Theorem 42

Martingales 51 5.15. Martingales 51 5.16. Stopping Times 53

Uniform Integrability 58 6.17. An 'absolute continuity' property 58 6.18. Denition of a UI family 58 6.19. Two simple sucient conditions for the UI property 59 6.20. UI property of conditional expectations 59 6.21. Convergence in Probability 60 6.22. Elementary Proof of Bounded Convergence Theorem 60 6.23. Necessary and Sucient Conditions for L1 convergence 60 UI Martingales 62 7.24. UI Martingales 62 7.25. Levy's 'Upward' Theorem 62 7.26. Martingale Proof of Kolmogorov 0-1 Law 63 7.27. Levy's 'Downward' Theorem 63 7.28. Martingale Proof of the Strong Law 64

3 CONTENTS 4

7.29. Doob's Subartingale Inequality 64

Kolmogorov's Three Series Theorem using L2 Martingales 66 A few dierent proofs of the LLN 73 9.30. Truncation Lemma 73 9.31. Truncation + K3 Theorem + Kronecker Lemma 74 9.32. Truncation + Sparsication 76 9.33. Levy's Downward Theorem and the 0-1 Law 79 9.34. Ergodic Theorem 80 9.35. Non-convergence for innite mean 81 Ergodic Theorems 82 10.36. Denitions and Examples 82 10.37. Birkho's Ergodic Theorem 86 10.38. Recurrence 88 10.39. A Subadditive Ergodic Theorem 88 10.40. Applications 90 Large Deviations 91 10.41. LDP for Finite Dimesnional Spaces 92 10.42. Cramer's Theorem 103 Bibliography 109 Independence and Weak Law of Large Numbers

These are notes from Chapter 2 of [2].

1.1. Independence Definition. A, B indep if P(A ∩ B) = P(A)P(B), X,Y are indep if P(X ∈ C,Y ∈ D) = P(X ∈ C)P(Y ∈ D) for every C,D ∈ R. Two σ−algebras are independent if A ∈ F and B ∈ G has A, B independent. Exercise. (2.1.1) Show that if X,Y are indep then σ(X) and σ(Y ) are. ii) Show that if X is F-measurable and Y is G−measurable then and F , G are inde- pendent, then X and Y are independent

Proof. This is immediate from the denitions.  Exercise. (2.1.2) i) Show A, B are independent then Acand B are independent too. ii) Show A, B are independent i 1A and 1B are independent. Proof. i) P(B) = P(A∩B)+P(Ac ∩B) = P(A)P(B)+P(Ac ∩B) , rearrange. ii) Simple using if and c otherwise. {1A ∈ C} = A 1 ∈ C = A  Remark. By the above quick exercises, all of independence is dened by in- dependence of σ-algbras, so we will view that as the ventral object.

Definition. F1, F2 ... are independent if for any nite subset and for any sets from an index we have Q . Ai ∈ Fi i ∈ I ⊂ N P (∩Ai) = P(Ai) X1,X2,... are independent if σ(X1), σ(X2) ... are independent.A1,A2,... are independent if are independent. 1A1 , 1A2 ,... Exercise. (2.1.3.) Same as previous exercise with more than two sets. Example. (2.1.1) The usually example of three events which are pairwise in- dependent but not independent on the space of three fair coinips. 1.1.1. Sucient Conditions for Independence. We will work our way to Theorem 2.1.3 which is the main result for this subsection.

Definition. We call a collection of sets A1, A2 ... An independent if any col- lection from an index set i ∈ I ⊂ {1, . . . , n} Ai ∈ Ai is independent. (Just like the denition for the σ- algebras only we dont require A to be a sigma algebra)

Lemma. (2.1.1) If we suppose that each Ai contains Ω then the criteria for independence works with I = {1, . . . , n}

Proof. When you put Ak = Ω it doesn't change the intersection and it doesnt change the product since . P(Ak) = 1 

5 1.1. INDEPENDENCE 6

Definition. A π−system is a collection A which is closed under intersections, i.e. A, B ∈ A =⇒ A ∩ B ∈ A. A λ−system is a collection L that satises: i) Ω ∈ L ii) A, B ∈ L and A ⊂ B =⇒ B − A ∈ L iii) An ∈ L and An ↑ A =⇒ A ∈ L Remark. (Mihai - From Wiki) An equivalent def'n of a λ−system is: i) Ω ∈ L ii) A ∈ L =⇒ Ac ∈ L iii) disjoint ∞ A1,A2,... ∈ L =⇒ ∪n=1An ∈ L In this form, the denition of a λ−system is more comparable to the denition of a σ algebra, and we can see that it is strictly easier to be a λ−system than a σ−algebra (only needed to be closed under disjoint unions rather than arbitrary countable unions). The rst denition presented by Durret however is more useful since it is easier to check in practice! Theorem. (2.1.2) (Dynkin's π − λ) Theorem. If P is a π−system and L is a λ−system that contain P, then σ(P) ⊂ L. Proof. In the appendix apparently? Will come back to this when I do measure theory. 

Theorem. (2.1.3) Suppose A1, A2 are independent of each other and each Ai is a π−system. Then σ(A1), . . . , σ(An) are independent.

Proof. (You can basically reduce to the case n = 2 ) Fix any A2,...,An in A2,..., An respectively and let F = A2∩...∩An Let L = {A : P(A ∩ F ) = P(A)P(F )}. Then we verify that L is a λ−system by using basic properties of P. For A ⊂ B both in F we have: P ((B − A) ∩ F ) = P (B ∩ F ) − P(A ∩ F ) = P(B)P(F ) − P(A)P(F ) = P(B − A)P(F ) (increasing limits is easy by continuity of probability)

By the π − λ theorem, σ(A1) ⊂ L, since this works for any F, we have then that σ(A1), A2, A3,..., An are independent. Iterating the argument n − 1 more times gives the desired result.  Remark. (Durret) The reason the π − λ theorem is helpful here is because it is hard to check that for A, B ∈ L that A ∩ B ∈ L or that A ∪ B ∈ L. However, it is easy to check that if A, B ∈ L with A ⊂ B then B − A ∈ L. The π − λ theorem coverts these π and λ systems (which are easier to work with) to σ algebras (which are harder to work with, but more useful)

Theorem. (2.1.4) In order for X1,X2,...,Xn to be independent, it is su- cient that for all x1, x2, . . . , xn ∈ (−∞, ∞] that: n Y P (X1 ≤ x1,...,Xn ≤ xn) = P (Xi ≤ xi) i=1

Proof. Let Ai be sets of the form {Xi ≤ xi}. It is easy to check that this is a π−system, so the result is a direct application of the previous theore m.  1.1. INDEPENDENCE 7

Exercise. (2.1.4.) Suppose (X1,...,Xn) has density f(x1, . . . , xn) and f can be written as a product . Show that the 0 are independent. g1(x1) · ... · gn(xn) Xis

Proof. Let g˜i(xi) = cigi(xi) where ci is chosen so that g˜i = 1. Can verify Q Q that ci = 1 from f = g and f = 1. Then integrate along´ the margin to see that each g˜i is in fact a pdf for X´ i. Can then apply Thm 2.1.4 after replacing g's by g˜'s  Exercise. (2.1.5) Same as 2.1.4 but on a discrete space with a probability mass function instead of a probability density function. Proof. Work the same but with sums instead of integrals. We will now prove that functions of independent random variables are still independent (to be made more precise in a bit) 

Theorem. (2.1.5) Suppose Fi,j 1 ≤ i ≤ n and 1 ≤ j ≤ m(i) are independent σ−algebras and let Gi = σ(∪jFi,j). Then G1,..., Gn are independent.

Proof. (Its another π −λ proof based on the Thm 2.1.3) Let Ai be the collec- tion of the sets of the form ∩jAi,j with Ai.j taken from Fi,j. Can verify that Ai is a π−system that contains ∪jFi,j (its closed under intersections by its denition). Since the 's are all independent, it is clear the systems 0 are independent. Fi,j π− Ais By thm 2.1.3, we know that are all independent too. σ(Ai) = Gi  Theorem. (2.1.6) IF for 1 ≤ i ≤ n and 1 ≤ j ≤ m(i) the random variables are independent and m(i) are measurable functions, then the random Xi,j fi : R → R variables Yi :=fi(Xi,1,...,Xi,m(i)) are all independent.

Proof. Let Fi,j = σ (Xi,j) and Gi = σ (∪jFi,j). By the previous theorem (2.1.5.), the 0 are independent. Since  , these Gis fi Xi,1,...,Xi,m(i) := Yi ∈ Gi random variables are independent too.  Remark. (Durret) This theorem is the rigourous justication for the type of reasing like If X1,...,Xn are iid, then X1 is indep of (X2,...,Xn) 1.1.2. Independence, Distribution, and Expectation.

Example. (Mihai - From Wiki) The on R is the unique measure with µ ((a, b)) = b − a Proof. We will show that it is the unique measure on [0, 1] , you get unique- ness on all of R by stiching together all the intervals [n, n + 1]. Say ν is an- other measure with ν ((a, b)) = b − a. First verify that µ(A) = b − a for all sets A ∈ A, A := {(a, b), (a, b], [a, b), [a, b] : 0 ≤ a, b ≤ 1} and that A is a π−system. Then notice the collection L = {A : µ(A) = ν(A)} is a λ−system by the basic properties of a measure. Since A ⊂ L by the hypothesis of the problem, by the π − λ theorem σ(A) ⊂ L. But σ(A) = B is all of the Borel sets! Hence µ and ν agree on all Borel sets. 

Theorem. (2.1.7.) Suppose X1,...,Xn are independent and Xi has distribu- tion µi. Then (X1,...,Xn) has distribution µ1 × ... × µn.

Proof. Verify that the measure of (X1,...,Xn) and the measure µ1 ×...×µn agree on rectangle sets of the form A1 × ... × An using independence. Since these sets are a π system that generate the entire Borel sigma algebra, the two 1.1. INDEPENDENCE 8 measures agree everywhere. (More specically, the rectangle sets are a π−system with Rectangles Borel sets. The collection of sets where σ( ) = µ(X1,...,Xn) = µ1 × ... × µn is easily veried to be a λ-system. So Rectangles ⊂ Sets where they agree =⇒ Borel sets ⊂Sets where they agree by the π − λ theorem.  Theorem. (2.1.8.) Suppose X,Y are independent and have distributions µ and ν. If h : R2 → R has h ≥ 0 or E [h(X,Y )] < ∞ then: Eh(X,Y ) = h(x, y)µ(dx)ν(dy) ˆ ˆ Suppose now h(x, y) = f(x)g(y). If f ≥ 0 and g ≥ 0 or if E |f(X)| < ∞ and E |g(Y )| < ∞ then: E [f(X)g(Y )] = E [f(X)] E [g(Y )] Proof. This is essentiall the Tonelli theorem and the Fubini theorem. Review this when you go over integration. 

Theorem. (2.1.9) If X1,...,Xn are independent and have Xi ≥ 0 for all i or E |Xi| < ∞ for all i. Then: n ! n Y Y E Xi = E (Xi) i=1 i=1

Proof. By induction using the last theorem and the result that X1 is inde- pendent from in this case. (X2,...,Xn)  Remark. (Durret) Dont forget that uncorrelated ;independent. 1.1.3. Sums of Independent Random Variables.

Theorem. (2.1.10) If X and Y are independent, with F (x) = P(X ≤ x) and G(y) = P(Y ≤ y) then:

P(X + Y ≤ z) = F (z − y)dG(y) ˆ Where integrating dG is shorthand for, integrate with respect to the measure ν whose distribution function is G

Proof. Let h(x, y) = 1{x+y≤z}. Let µ and ν be the probability measures with distribution functions F and G. For xed y we have:

h(x, y)µ(dx) = 1 (x, y)µ(dx) ˆ ˆ {x+y≤z} = 1 (x)µ(dx) ˆ {−∞,z−y] = µ(∞, z − y) = F (z − y) Hence:

P(X + Y ≤ z) = 1 µ(dx)ν(dy) ˆ ˆ {x+y≤z} = F (z − y)ν(dy) ˆ  1.2. WEAK LAW OF LARGE NUMBERS 9

Theorem. (2.1.11) If X,Y have densities, then X + Y has density:

h(x) = f(x − y)dG(y) ˆ = f(x − y)g(y)dy ˆ Proof. Write:

z F (z − y)ν(dy) = f(x − y)dG(y)dx ˆ ˆ−∞ ˆ  There are some examples of the gamma distributions (basically gamma(α, λ) is the sum of α independent expoenntial variables with parameter λ) 1.1.4. Constructing Independent Random Variables. Kolmogorov Ex- tension Theorem is here, I'll come back to this later Theorem. (Kolmogorov Extension Theorem) If you are given a family of mea- sures on n that are consistnet: µn R

µn+1 ((a1, b1] × ... × (an, bn] × R) = µn ((a1, b1] × ... × (an, bn]) then there exists a unique probability measure P on (RN on so that P agrees with µn on cylinder sets of size n.

1.2. Weak Law of Large Numbers 1.2.1. L2 Weak Laws. Theorem. (2.2.1) If are uncorrelated and 2 then: X1,... E(Xi ) < ∞ X  X Var Xi = Var(Xi) Proof. Just expand it out, dross terms die since the r.v.s are uncorrelated. (Helps to assume WOLOG that ) E(Xi) = 0  p Lemma. (2.2.2) If p > 0 then E |Zn| → 0 then Zn → 0 in probability Proof. By Cheb ineq. This was one of the things on our types of convergence diagram. 

Theorem. (2.2.3.) If X1,... are uncorrelated and E(Xi) = µ and Var(Xi) < P 2 C < ∞. If Sn = Xi then Sn/n → µ in L and in probability. h i Proof. 2 1 1 P Cn E (Sn/n − µ) = n2 Var(Sn) = n2 Var(Xn) ≤ n2 → 0  Example. (2.2.1) Basically this example is a simlar result set up in such a way to remark that the convergence is uniform in some sense. Specically they have a family of coupled h i bernoulli variables p, and they are saying that p 2 uniformly Xn E (Sn/n − µ) → 0 for the whole family. Indeed, the estimate we used in them 2.2.3. just needs an upper bound on the . 1.2. WEAK LAW OF LARGE NUMBERS 10

1.2.2. Triangular Arrays.

Definition. A triangular array is an array of random variables Xn,k. The row sum is dened to be P Sn Sn = k Xn,k Theorem. (2.2.4.) Let , 2 If 2 2 then: µn = E(Sn) σn := Var(Sn). σn/bn → 0 S − µ n n → 0 bn In L2 and in probability.   Proof. Have (similar to before) 2 −2 E ((Sn − µn) /bn) = bn Var (Sn) → 0  Example. These examples are actually really nice and I like them a lot! How- ever, I'm going to skip them for now.

1.2.3. Truncation. To truncate a X at level M means to consider: ˜ X = X1{|X|

Theorem. (2.2.6) For each n let Xn,k, 1 ≤ k ≤ n be a triangular array of independent random variables. Let bn > 0 and bn → ∞, and let X˜n,k = . Suppose that: Xn,k1{|Xn,k|≤bn} n X P (|Xn,k| > bn) → 0 as n → ∞ and k=1 n   −2 X ˜ 2 as bn E Xn,k → 0 n → ∞ k=1   Then let P and put Pn ˜ . Then: Sn = k Xn,k an = k=1 E Xn,k S − a n n → 0 in probabiliy bn

Proof. Writing S˜n for the sum of the truncated variables, we have: !  S − a    S˜ − a n n ˜ n n P >  ≤ P Sn 6= Sn + P >  bn bn P The rst term is controlled by P (|Xn,k| > b), namely: n  ˜   n n ˜ o X P Sn 6= Sn ≤ P ∪k=1 Xn,k 6= Xn,k ≤ P (|Xn,k| > bn) → 0 k=1

The second term is controlled by Cheb inequality and our bound on the X˜n's. Have L2 convergence by: ˜ !2 n n Sn − an −2  ˜  −2 X  ˜  −2 X  ˜ 2  E = bn Var Sn = bn Var Xn,k ≤ bn E Xn,k → 0 bn k=1 k=1 So this also converges in probability to zero (by cheb ineq)  1.2. WEAK LAW OF LARGE NUMBERS 11

Theorem. (2.2.7.) (Weak law of large numbers) Let X1,... be iid with:

xP (|Xi| > x) → 0 as x → ∞ Let and let . Then Sn = X1 + ... + Xn µn = E X11{|X1|

Proof. Apply the last result with Xn,k = Xn and bn = n. The rst con- dition, that Pn as is clear since they are iid so k=1 P (|Xn,k| > bn) → 0 n → ∞ Pn by hypothesis. The second hypothesis k=1 P (|Xn,k| > bn) = nP (|Xi| > n) → 0 to be veried needs the easy lemma that p ∞ p−1 d (proven E (Y ) = 0 py P(Y > y) y by fubini easily). With this established, we see that:´ n X   1 b−2 E X˜ 2 = E X2  n n,k n n,1 k=1 1 n = 2yP (|X1| > y) dy n ˆ0

This → 0 as n → ∞ since we are given that 2yP (|X1| > y) → 0 as y → ∞ by hypothesis. 

Remark. By the converging together lemma (The one that says if Xn ⇒ X and |Xn − Yn| → 0 in probability, and the fact that convergence in probability is the same as weak convergence when the target is a constant, this result shows

Sn/n−µ → 0 (Have Sn/n−µn ⇒ 0 and |(Sn/n − µn) − (Sn/n − µ)| = |µn − µ| → 0 by LDCT )

Another way to phrase the converging together lemma is that if An ⇒ A and Yn ⇒ c a constant, then An + Bn ⇒ A + c. This small improvement leads to:

Theorem. (2.2.9) Let X1,...,Xn be iid with E |Xi| < ∞. Let Sn = X1 + ... + Xn ane let µ = EX1. Then Sn/n → µ in probability. Proof.  by LDCT since xP (|X1| > x) ≤ E |X1| 1{|X1|>x} → 0 E (|X1|) < ∞ and  µn = E X11{|X1|≤n → E(X1) = µ  Borel Cantelli Lemmas

2.3. Borel Cantelli Lemmas

2.3.1. Preliminaries. For sets An dene: [ \ [ lim sup An = {ω : lim sup 1A (ω) = 1} = lim Ak = Ak = An i.o. n n→∞ k≥n n k≥n \ [ \ lim inf An = {ω : lim inf 1A (ω) = 1} = lim Ak = Ak = An a.b.f.o. n n→∞ k≥n n k≥n Here i.o. stands for infently often and a.b.f.o. stands for all but netly- many often. (Sometimes people write this one as a.a. for almost always. Lemma. c c (lim sup An) = lim inf (An)

Proof. Just check it from the denitions, or can convince yourself logic: if An does not happen infeitly often, then it stops happening at some ntie point, so c An happens all but netly many often. 

Theorem. P (lim sup An) ≥ lim sup P(An) and P (lim inf An) ≤ lim inf P(An)   Proof. By continuity of measure S and P(lim sup An) = limn→∞ P k≥n Ak   each S by set inclusion and the result follows. P k≥n Ak ≥ supk≥n P(Ak) The other direction can be proven in the same way OR you can take comple- ments to see it from the rst result using.  2.3.2. First Borel Cantelli Lemma and Applications. Lemma. If are events with P∞ then: An n=1 P(An) < ∞

P (An i.o.) = 0 Proof. Standard proof: i.o P P (An ) = limn→∞ P (∪k≥nAk) ≤ limn→∞ k≥n P(Ak) → 0. Fancy proof: Let P so that i.o. . By Fubini/Tonelli N = n 1An {An } = {N = ∞} however, P P P E(N) = E ( n 1An ) = n E (1An ) = n P(An) < ∞ =⇒ P (N = ∞) = 0.  P Theorem. Xn → X if and only if for every subsequence Xn there is a further a.s. m subsequence X → X nmk  1  −k Proof. (⇒) Take the sub-subsequeunce so that P Xn − X > < 2 , mk k then by Borel Cantelli a.s. Xnk → X (⇐) Suppose by contradiction that there is a δ > 0 and 0 > 0 so that P (|Xn − X| > δ) > 0 for infently many n. But then that subsequence can have no a.s. convergent subsubsequence contradiction  12 2.3. BOREL CANTELLI LEMMAS 13

Theorem. If f is continuous and Xn → X in probability then f(Xn) → f(X) for continuous f. If f is bounded , then E(f(Xn)) → E(f(X)) too.

Proof. Check the subsequence/subsubsequence requirment for the the f(Xn) → f(X) bit. The next part is the in probability bounded convergence theorem. One can see it nicely with the subsequebce/subsubsequence property as follows: Given any a.s. subsequence X nd a sub-subssequence X → X and then by a.s. bounded nk nkm convergence theorem we will have Ef(X ) → Ef(X). But then Ef(X ) → nmk n Ef(X). (Otherwise there is an 0 so that they are o by 0 inetly often, but this creates a subseqeunce with no convergent subsubsequence) This can also be proven directly (it is proven somewhere else) 

Theorem. If are non-negative r.v's and P∞ then Xn n=1 E(Xn) < ∞ Xn → 0 a.s.

Proof. Find a an of positive real numbers so that an → ∞ and P∞ still. Now consider: n=1 anE(Xn) < ∞

∞ ∞ X  1  X P X > ≤ a E (X ) < ∞ n a n n n=1 n n=1

 1  1 Which means that P Xn > i.o. = 0 by Borel Cantelli. Since → 0, this an an means that Xn → 0 a.s. (How does one get the an's? Very roughly: By scaling by a constant lets P suppose E(Xn) = 1. Find numbers n1, n2,... so that nk is the rst number with Pnk 1 . Clearly . Now you can choose for i=1 E(Xi) > 1 − 2k nk → ∞ ai = k all safely by comparison with the sequence P 1 which we know nk < i < ak+1 k 2k converges) 

Theorem. 4-moment Strong Law of Large Numbers If are iid with 4 then a.s. X1,X2,... E(X1 ) < ∞ Sn/n → E(X1)

Proof. (Its like a weak law proof using Cheb inequality, but the fact that 4 makes the inequality so strong as to be summable. Then the Borel E(X1 ) < ∞ cantelli lemma is used to improve convergence in probability to a.s. convergence)

WOLOG assume that Xn are mean 0. (Just subtract it) Let's begin by com- puting (using combinatorics basically): 2.3. BOREL CANTELLI LEMMAS 14

  " 4# n !4 Sn X E = n−4E X n  i  i=1   " n # n n X 4 X X = n−4E X4 + n−4 E  X2X2 i 2  i j  i=1 i=1 j=1 j

for some constants c1 and c2. −2 −3 Finally, since n and n are both summable, we can nd a sequence n → 0 so that −4 −2 and −4 −3 are STILL summable (example: −1/10) we can n n n n n = n now use a Cheb inequality: ∞   ∞ " 4# X Sn X Sn P >  ≤ −4E n n n n n=1 n=1 ∞ ∞ X −4 −3 X −4 −2 ≤ c1 n n + c2 n n n=1 n=1 < ∞

So by Borel Cantelli, Sn i.o. and we conclued that Sn a.s. P n > n = 0 n → 0 

Remark. You could also leave n =  xed, and then this shows that for every , Sn eventually . Then taking the intersection of a countable  > 0 n <  number of these events we'd see that Sn . n → 0 h 4i Once you have that Sn  −3 −2 you could also just apply the E n = c1n + c2n 4 last theorem to see that Sn  almost surely, but this is a bit more fancy. n → 0 Remark. The converse the Borel Cantelli lemma's is false. As one see's in P the fancy proof, P(An) is actually equal to E(N) where N is the number of events that occur. Of course, E(N) could be ∞ and yet N < ∞ a.s. (To explictly construct this, take any random variable with E(N) = ∞ but N < ∞ a.s. and then on the probability space (Ω, F, P) = ([0, 1], B, m) put An = (0, P(N > n)] P P so that P(An) = P(N > n) and P(An) = P(N > n) = E(N) = ∞ and yet lim sup An = ∩nAn = ∅ 2.4. BOUNDED CONVERGENCE THEOREM 15

2.3.3. Second Borel Cantelli Lemma. P Theorem. If An are independent and P(An) = ∞ then P(Ani.o.) = 1 Proof. (Depends on the inequality 1 − p ≤ e−p and the idea to look at the complement) We will actually show that c . Have: P (lim inf An) = 0 N ! N \ c Y P An = (1 − P (An)) n=M n=M N Y ≤ e−P(An) n=M N ! X = exp − P(An) n=M → 0 as N → ∞ Since TN c T∞ c as we have then T∞ c T∞ c n=M An ↓ n=M An N → ∞ P ( n=M An) = limN→∞ P ( n=M An) = and consequently, since this holds for all , we will have c 0 M P (lim inf An) = T∞ c limM→∞ P ( n=M An) = 0 

Theorem. If X1,X2,... are ii.d with E |Xi| = ∞ then P (|Xn| ≥ n i.o.) = . Moreover, we can improve this and show that Xn  and 1 P lim sup n = ∞ = 1  |Sn|  P lim sup n = ∞ = 1 Proof. We have the bound: ∞ ∞ ∞ X X ∞ = E |X | = P (|X | > x) dx ≤ P (|X | > n) = P (|X | > n) 1 ˆ 1 1 n 0 n=0 n=0

So by the second Borel Cantelli, P (|Xn| ≥ n i.o.) = 1. We can improve this a bit to see that for any C > 0 that P (|Xn| ≥ Cn i.o.) = 1   since |X1| is still . Since this holds for every , choosing and E C ∞ C C = 1, 2, 3,... n o taking the intersection of the counatble set of probability 1 events |Xn| i.o. , n > k we see that Xn  . P lim sup n = ∞ = 1 Now to see that P lim sup Sn = ∞ = 1 we will show for each C > 0 that   n |Sn| i.o. . This is indeed the case because we know that Xn i.o. P n > C = 1 P n > 2C = and whenever Xn we have either |Sn−1| or |Sn| , which shows that 1 n > 2C n−1 > C n > C n o  Xn i.o. |Sn| i.o. n > 2C ⊂ n > C (Pf of this fact: If |Sn−1| then done! Otherwise |Sn−1| and we have n−1 > C n−1 ≤ C then Sn n−1 Sn Xn n−1 .) n = n n−1 + n ≥ n (−C) + 2C > C 

2.4. Bounded Convergence Theorem Theorem. In probability bounded convergence theorem P L1 If Yn → Y and |Yn| < K a.s., then Yn → Y 2.4. BOUNDED CONVERGENCE THEOREM 16

Proof. Notice that |Y | ≤ K a.s. too (or else convergence in probablity fails due to the positive measure set where |Y | > K + δ with δ small enough). Write for any  > 0 that:

E (|Yn − Y |) = E (|Yn − Y | ||Yn − Y | > ) P (|Yn − Y | > )

+E (Yn − Y ||Yn − Y | ≤ ) P (|Yn − Y | ≤ )

≤ 2KP (|Yn − Y | > ) + 1 →  as n → ∞ 1 Since this holds for all , we have indeed that L .  > 0 Yn → Y  Central Limit Theorems

These are notes from Chapter 3 of [2].

3.5. The De Moivre-Laplace Theorem Let be iid with 1 and let X1,X2.... P (X1 = 1) = P (X1 = −1) = 2 Sn = P . By simple combinatorics, k≤n Xk  2n  P (S = 2k) = 2−2n 2n n + k √ n −n Using Stirling's Formula now, n! ∼ n e 2πn as n → ∞ where an ∼ bn means that an/bn → 1 as n → ∞ so then:  2n   k2 −n  k −k  k k = ...... ∼ 1 − 1 + 1 − n + k n2 n n We now use the basic lemma:

aj λ Lemma. (3.1.1.) If cj → 0, aj → ∞ and ajcj → λ then (1 + cj) → e .

Exercise. (3.1.1.) (Generalization of the above) If max1≤j≤n |cj,n| → 0 and Pn and Pn then Qn λ j=1 cj,n → λ supn j=1 |cj,n| < ∞ j=1 (1 + cj,n) → e (Both proofs can be seen by taking logs and using the fact from the Taylor expansion for log that log(1 + x)/x → 1) This leads to: √ −1/2 −x2/2 Theorem. (3.1.2.) If 2k/ 2n → x then P (S2n = 2k) ∼ (πn) e And a bit more gives (mostly just a change of variables from here):

Theorem. (3.1.3) The De-Moivre Laplace Theorem: If a < b then as m → ∞ we have: b √ 2 P a ≤ S / m ≤ b → (2π)−1/2e−x /2dx m ˆ a

3.6. Weak Convergence

Definition. For distribution functions Fn, We write Fn ⇒ F and say Fn converges weakly to F  to mean that Fn(x) = P (Xn ≤ x) → F (x) = P (X ≤ x) for all continuity points x of the function F (x). We sometimes conate the random variable and its distribtuion function and write Xn ⇒ X or even Fn ⇒ X or other combinations.

17 3.6. WEAK CONVERGENCE 18

3.6.1. Examples.

Example. (3.2.1.) Let X1,... be iid coin ips, by our abover work in the De Moivre-Leplace theorem, we have that Fn ⇒ F

Example. (3.2.2.) (The Glivenko-Cantelli Theorem) Let X1,...,Xn be iid with distribution function F and let n 1 X F (x) = 1 n n {Xm≤x} m=1 Be its empirical distribution function. The G-C theorem is that:

sup |Fn(x) − F (x)| → 0 a.s. x In words: the empircal distribtuion function converges uniformly almost surely

Proof. (Pf idea: the fact that Fn(x) → F (x) is just the strong law for the indicator , the only trick is to make the convergence uniform) 1{Xn

Similarly, let − and let so that F (x−) := limh→0 ,h<0 F (x − h) Zn = 1{Xn 0, and η > 0 choose any −net of [0, 1] (i.e. nitely many points of distance no more than  from each other), say 0 = y0 < . . . < yk = 1. Let xk = inf {y : F (y) ≥ yk} (This is called the right inverse or something). Notice that in this way we have F (xk) − F (xk−) ≤ yk − yk−1 =  . Choose N(ω) so large now so that |Fn(xk) − F (xk)| < η and |Fn(xk−) − F (xk−)| < η (this is ok since there are nitely many points xk and we have almost sure convergence at each of them).

For any point x ∈ (xj−1, xj) we use the monotonicity of F , along with our inequalities with  and η now to get:

Fn(x) ≤ Fn(xj−) ≤ F (xj−) + η ≤ F (xj−1) + η +  ≤ F (x) + η + 

Fn(x) ≥ Fn(xj−1) ≥ F (xj−1) − η ≥ F (xk) − η −  ≥ F (x) − η − 

So we have an inequality sandwhich, |Fn(x) − F (x)| ≤  + η which holds for every x. Since  and η are arbitary, we get the convergence.  Example. (3.2.3.) Let have distribution . Then 1 has distribution X F X + n 1 / As we have: Fn(x) = F (x − n ) n → ∞

Fn(x) → F (x−) = lim F (y) y↑x So in this case the convergence really is only at the continuity points. Example. (3.2.4.) (Waiting for rare events, convergence of a geometric to a ) Let Xp be the number of trials needed to get a success 3.6. WEAK CONVERGENCE 19 in a sequence of independent trials with success probability p. Then P (Xp ≥ n) = (1 − p)n−1 for n = 1, 2,... and it follows that: −x P (pXp > x) → e for all x ≥ 0

In other words pXp ⇒ E where E is an exponential random variable.

Example. (3.2.5.) Birthday Problem. Fix an N and let X1,... be independent and uniformly distributed on {1, 2,...,N} and let TN = min {n : Xn = Xm for some m < n}. Notice that: n Y  m − 1 P (T > n) = 1 − N N m=2 By exercise 3.1.1., (the one that concludes Qn λ when Pn j=1 (1 + cj,n) → e j=1 cj,n → λ and the cj,n's are small) have then:  1/2  2  P TN /N > x → exp −x /2 for all x ≥ 0

Theorem. (Schee's Theorem) If fn are probability density functions with fn → f∞ pointwise as n → ∞ then µn to µ∞ in the total variation distance

kµn − µ∞kTV := sup |µn(B) − µ∞(B)| → 0 B Proof. Have:

fn − f∞ ≤ |fn − f∞| ˆB ˆB ˆΩ + = 2 (fn − f∞) ˆΩ → 0 (we have employed the usual trick for the TV distance here btw) By the dom- inated convergence thereom.  Example. (3.2.6.) (Central Order Statistics) Put 2n + 1 points uniformly at random and independently in (0, 1). Let Vn+1 be the n + 1st largest point.

Lemma. Vn+1 has density function: 2n f (x) = (2n + 1) xn(1 − x)n Vn+1 n Proof. There are 2n + 1 ways to pick which of the 2n + 1 points will be the special central order point. Then there are 2n ways to divide up the remaining n 2n points into two groups of n, one group to be placed on the left and another to be placed on the right, and nally xn(1 − x)n is the probability that these points fall where they need to. √ Changing variables x = 1 + √y ,Y = 2 V − 1  2n gives... 2 2 2n n n+1 2 −1/2 2  as fYn (y) → (2π) exp −y /2 n → ∞ By Schee's theorem then, we have weak convergence to the standard normal. (This isnt entirely surprising since f looks a bit like a binomial random vari- able...this is actually used in this calculation)  Exercise. (Convergence of the maxima of random variables) Come back to this when you review maximum distributions. 3.6. WEAK CONVERGENCE 20

3.6.2. Theory. The next result is useful for proving things about weak con- vergence

Theorem. (3.2.2.) [Skohorod Representation Theorem] If Fn ⇒ F∞then there are random variables Yn with the distribution Fn so that Yn → Y∞a.s. Proof. Let (Ω, F, P) = ((0, 1), R, m) be the usual Lebesgue measure on (0, 1). Dene → , (this is called the right inverse). By Yn(x) = sup {y : Fn(y) < x} =: Fn (x) an earlier theorem (or easy exercise), we know Yn has the distribution Fn. We will now show that Yn(x) → Y∞(x) for all but a countable number of x. Durret has a proof of this....but I prefer the proof from Resnick where he develops the idea of the left inverse a bit more fully. Review this when you do extreme values!  Remark. This theorem only works if the random variables have seprable sup- port. Since we are working with R valued random variables, this works. If they were functions or something, it might not work.

Exercise. (3.2.4.) (Fatou's lemma) Let g ≥ 0 be continuous. If Xn ⇒ X∞ then:

lim inf E(g(Xn)) ≥ E (g(X∞))

Proof. Just use Skohorod to make versions of the Xn's which converge a.s. and apply the ordinary Fatou lemma to that. 

Exercise. (3.2.5.) (Integration to the limit) Suppose g, h are continuous with g(x) > 0 and |h(x)| /g(x) → 0 as |x| → ∞. If Fn ⇒ F and g(x)dFn(x) ≤ C < ∞ then: ´ h(x)dF (x) → h(x)dF (x) ˆ n ˆ Proof. Create random variables on Ω = (0, 1) as in the Skohord representation theorem so that Fn,F are the distribution functions for Xn,X and Xn → X a.s. We desire to show that E(h(Xn)) → E(h(X)). Now for any  > 0 we nd M so large so |x| > M =⇒ |h(x)| ≤ g(x) and then we will have for any x > M that:

E (h(Xn)) = E (h(Xn); |Xn| ≤ x) + E (h(Xn); |Xn| > x)

→ E (h(X); |X| ≤ x) + E (h(Xn); |Xn| > x) (bounded convergence thm)

≤ E (h(X); |X| ≤ x) + E (g(Xn); |Xn| > x) ≤ E (h(X); |X| ≤ x) + C

The rst term converges to E(h(X)) by the LDCT. The other side of the inequality can be handled by Fatou, or I think we could do the above work a little more carefully to get it. 

Theorem. (3.2.3.) Xn ⇒ X∞ if and only if E (g(Xn)) → E (g(X)) for every bounded continuous function g.

Proof. ( =⇒ ) Follows by the Skorod repreresentation theorem and the bounded convergence theorem. 3.6. WEAK CONVERGENCE 21

(⇐=) Let gx, be a smoothed out version of the step down Heaviside step function at x. For example:  1 y ≤ x  gx, = 0 y ≥ x +  linear x ≤ y ≤ x + 

This is a continuous linear function, so we have E (gx,(Xn)) → E(gx,(X)) for each choice of x, . This basically gives the result, to do it properly, look at:\ lim sup P (Xn ≤ x) ≤ lim sup E (gx,(Xn)) = E (gx,(X)) ≤ P (X ≤ x + ) n→∞ n→∞ lim inf P (Xn ≤ x) ≥ lim inf P (gx−, (Xn) ≤ x) = E (gx−,x(X)) ≥ P (X ≤ x − ) n→∞ n→∞ Combining these inequalites we see that we have the desired convergence for continuity points of X.  Theorem. (3.2.4.) (Continuous mapping theorem) If g is measurable and Dg = {x : g is discontinuous at x}. If Xn ⇒ X and P (X ∈ Dg) = 0 then g(Xn) ⇒ g(X) and if g is bounded then E (g(Xn)) → E (g(X)) Proof. Firstly, lets notice that if g is bounded, the using Skohorod we will have a version of the variables so that Xn → X a.s. and then g(Xn) → g(X) everywhere except on the set Dg. Since this is a we get E (g(Xn)) → E (g(X)) Now, we verify that g(Xn) ⇒ g(X) by the boudned-continuous characteriza- tion, for if f is any bounded continuous function then f ◦ g is bounded and has so by the above argument. Df◦g E (f ◦ g(Xn)) → E (f ◦ g(X))  Theorem. (3.2.5.) (Portmanteau Theorem) The following are equivalent:

(i) Xn ⇒ X (ii) E (f(Xn)) → E (f(X)) for all bounded continuous f (iii) lim infn P (Xn ∈ G) ≥ P (X ∈ G) for all G open (iv) for all closed lim supn P (Xn ∈ F ) ≤ P (X ∈ F ) F (v) P (Xn ∈ A) → P (X ∈ A) for all A with P (X ∈ ∂A) = 0 Proof. [(i) =⇒ (ii))] Is the theorem we just did. [(ii) =⇒ (iii)] Use Skohord to get an a.s. convergent version and then use Fatou [(iii) ⇐⇒ (iv)] Take complements. [(iii)+(iv) =⇒ (v)] Look at the interior A◦ and closure A¯. Since ∂A = A¯−A◦ is a null set, we know that A, A,¯ A◦ are all the same up to null sets. We then get P (Xn ∈ A) → P (X ∈ A) by getting a liminf/limsup sandwhich doing the liminf ◦  P (Xn ∈ A ) = P (Xn ∈ A) and the limsup on P Xn ∈ A¯ = P (Xn ∈ A) and using the inequalities from (iii) and (iv).  Remark. In Billinglsey he does the proof without using Skohorhod by looking at the sets like  and the function  1 d + which is A = ∪x∈AB(x) f = 1 −  (x, A) 1 on A, 0 outside of A and is uniformly continuous. This has the advantage that it will work in spaces where the Skohorod theorem will not work (for example I think the Skohorod theorem will fail for function-valued-random variables) One advantage of doing this is that, since the f  that appears above is uniformly continuosu, we may restrict our attention to uniformly continuous functions f, which is sometimes a little easier to check. 3.6. WEAK CONVERGENCE 22

Theorem. (3.2.6.) (Helly's Selection Theorem) For every sequence Fn of distribution function,s there is a subseqeunce and a right continuous non- Fnk decreasing function so that at all continuity points F limk→∞ Fnk (y) → F (y) y of F . Remark. F is not necessarily a distribution function as it might not have limx→−∞ F (x) = 0 and limx→∞ F (x) = 1. This happens when mass leaks out to +∞ or −∞ i.e. Fn(x) ≡ δn.

Proof. Let qk enumerate the rationals. Since Fn(q1) ∈ [0, 1], by Bolzanno Weirestrass there is a subsequence (1) so that converges. Call the limit nk F (1) (q1) nk Since again by B-W we nd a sub-sub-sequence (2) so that G(q1). F (2) (q2) ∈ [0, 1] nk nk F (2) (q2) converges. Repeating this, we get a big collection of nested subsequnces nk so that for each . Taking the diagonal (k) we get that F (k) (qk) → G(qk) k nk = nk nk for all rationals . Fnk (q) → G(q) q Now let F (x) = inf {G(q): q ∈ Q : q > x}. This is right continuous since: for some lim F (xn) = inf {G(q): q ∈ Q, q > xn n} xn↓x = inf {G(q): q ∈ Q, q > x} = F (x)

To see that Fn(x) → F (x) at continuity points, just approximate F (x) by F (r1) and F (r2) where r1, r2 are rationals with r1 < x < r2 and use the convergence of G. F is non-decreasing since Fn(x1) ≤ Fn(x2) for all x1 ≤ x2 so any limit along a subsequence will have this property too.  Remark. In functional analysis the selection theorem is that for X a seperatble n.v.s, the unit ball in the dual spaceB∗ = {f ∈ X∗ : kfk ≤ 1} is weak* compact ∗ (Recall the weak* topology on X is the one characterized by fn → f ⇐⇒ fn(x) → f(x) for all x) The proof of this is just the subsequences/diagonal sequences trick of the rst part of the proof. If we apply this version of Helley's theorem to the space X = C ([0, 1]) which ∗ has X =nite signed measures then we get that every sequence of measures µn has a subseqeunce so that for every bounded continuous . µnk Eµn (f) → Eν (f) f This is exactly the notion of weakk convergence, , we have! need not be µnk ⇒ ν ν a probability measure here, it is only a nite signed meausre. However, by using

Fatou, we know that µ([0, 1]) ≤ lim infn→∞ µn ([0, 1]) = 1 so mass can be lost, but mass cannot be gained. We will now look at conditions under which we can be sure that no mass is lost....i.e. the limit is in fact a proper probability measure. Definition. We say that a family of proability measures is tight if {µα}α∈I for all  > 0 there is a compact set K so that: c µα(K ) < ∀α ∈ I Theorem. (3.2.7.) Let be a sequence of probability measures on . Every µn R sub sequential limit is a probability measure if and only if is tight. µnk ⇒ ν µn Proof. ( ) Suppose is tight and . For any nd so ⇐= µn Fnkl ⇒ F  > 0 M c that µn ([−M,M] ) <  for all n. Find continuity points r and s for F so that 3.6. WEAK CONVERGENCE 23 r < −M and s > M. Since Fn(r) → F (r) and Fn(s) → F (s) we have: µ ([r, s]c) = 1 − F (s) + F (r) ≤ ... ≤  This shows that , and since arbitary this lim supx→∞ 1 − F (x) + F (−x) ≤   limsup is 0. Now, since and we have lim supx→∞ F (x) ≤ 1 lim infx→−∞ F (x) ≥ 0 that 0 = lim supx→∞ 1−F (x)+F (−x) ≥ lim infx→∞ 1−F (x)+F (−x) ≥ 1−1+0 = , so we have equality everywhere from which we conclude that 0 lim supx→∞ F (x) = 1 and lim infx→−∞ F (x) = 0 and so indeed F is a probability measure. (Basically the estimate above shows that no mass can leak away)

( =⇒ ) We prove the contrapositive. If Fn is not tight, then there exists an so that for every set there is a so that c . By 0 > 0 [−n, n] nk Fnk [−n, n] ≥ 0 Helley, this has a convergent sub-subsequence and the limit is easily veried to not be a probability measure.  Theorem. (3.2.8.) If there is a ϕ ≥ 0 so that ϕ(x) → ∞ as |x| → ∞ and:

sup ϕ (x) dFn(x) = C < ∞ n ˆ

Then Fn is tight. Proof. Do a generalized Chebushev-type estimate:

c 1 Fn [−M,M] = P (|Xn| ≥ M) ≤ E (ϕ(Xn)) ≤ C/ inf ϕ(x) → 0 inf {|x| > M : ϕ(x)} |x|≥M 

Theorem. (3.2.9.) If each subsequence of Xn has a further subsequence that converges a.s. to X, then Xn ⇒ X P Proof. The rst condition is actually equivalent to Xn → X, which implies . Xn ⇒ X  Definition. (The Levy/Prohov Metric) Dene a metric on measures, by d (µ, ν) is the smallest  > 0 so that:   Pµ (A) ≤ Pν (A ) +  Pν (A) ≤ Pµ(A ) + ∀A ∈ B

Exercise. (3.2.6.) This is indeed a metric and µn ⇒ µ if and only if d(µn, µ) = 0 Proof. (⇐=) is clear by using closed sets and checking the limsup of closed sets characterization of weak convergence. ( =⇒ ) You need to use the separability of R here.  P Exercise. (3.2.12.) If Xn ⇒ c then Xn → c.  Proof. For any  > 0 we have P (|Xn − c| > ) = E 1[c−,c+]c (Xn) →  E 1[c−,c+]c (c) = 0  Remark. The converse is also true since converging in probability implies weak convergence (most easily seen by the bounded convergence theorem)

Exercise. (3.2.13) (Converging Together Lemma) If Xn ⇒ X and Yn ⇒ c then Xn + Yn ⇒ X + c. 3.7. CHARACTERISTIC FUNCTIONS 24

Remark. Does not hold if Yn ⇒ Y in general. For example, on Ω = (0, 1) let Xn(ω) = n-th digit of the binary expansion of ω. Notice that Xn is always a coinip, so Xn ⇒ A where A is a coinip is clear. (This is a good example where Xn ⇒ X but there is no convergence in proability or a.s. convergence or anything) If Yn(ω) = 1−Xn(ω) then this is still a coin ip. But Xn +Yn = 1 now is a constant and does not converge to a sum of coinips or anything like that.

P Proof. Since Yn ⇒ c we know that Yn → Y . Now ,for any closed set F let F = {x : d (x, F ) ≤ }. Then:

P (Xn + Yn ∈ F ) ≤ P (|Yn − c| > ) + P (Xn + c ∈ F) So taking limsup now:

lim sup P (Xn + Yn ∈ F ) ≤ lim sup P (|Yn − c| > ) + lim sup P (Xn + c ∈ F)

→ 0 + P (X + c ∈ F)  Finally, since F is closed, we have that P (X ∈ F ) = P X ∈ ∩nF1/n =  limn→∞ P X ∈ F1/n so taking  → 0 in the above inequality gives us weak con- vergence via Portmeanteau theorem. 

Corollary. If Xn ⇒ X, Yn ≥ 0 and Yn ⇒ c then XnYn ⇒ cX. (This actually holds withouth Yn ≥ 0 or c > 0 by splitting up the probaility space)

Proof. Notice that log(Xn) ⇒ log(X) and log(Yn) ⇒ log(x) . Then by the converging together lemma log(Xn) + log(Yn) ⇒ log(X) + log(c). Then take exp to get XnYn ⇒ cX. (might need to truncate or something to make this more rigourous.)  Exercise. (3.2.15) If X = X1,...,Xn is uniformly distributed over the √n n n surface of a sphere of radius then 1 a standard normal n Xn ⇒ Proof. Let be iid standard normals and let i Pn 2 1/2 Y1,... Xn = Yi n/ m=1 Ym then check indeed that X1,...,Xn are uniformly distrbuted over the surfact of √ n n a sphere of radius , so this is a legit way to construct the distribution i . Then n Xn 1 Pn 2 1/2 1/2 by the strong law of large numbers and Xn = Y1 n/ m=1 Ym ⇒ Y1 (n/n) the convering together lemma.  3.7. Characteristic Functions 3.7.1. Denition, Inversion Formula. If X is a random variable we dene its characteristic function by: ϕ(t) = E eitX  = E (cos tX) + iE (sin tX) Proposition. (3.3.1) All characteristic functions have: i) ϕ(0) = 1 ii) ϕ(−t) = ϕ(t) itX  itX  iii) |ϕ(t)| = E e ≤ E e = 1 ihX  iv) |ϕ(t + h) − ϕ(t)| ≤ E e − 1 since this does not depend on t, this shows that ϕ is uniformly continuos. itb v) ϕaX+b(t) = e ϕX (at) vi) For independent, X1,X2 ϕX1+X2 (t) = ϕX1 (t)ϕX2 (t) 3.7. CHARACTERISTIC FUNCTIONS 25

1/2 Proof. The only one I will comment on is iv). This holds since |z| = x2 + y2 is convex, so:   (i(t+h)X itX |ϕ(t + h) − ϕ(t)| ≤ E e − e   (i(t+h)X itX ≤ E e − e ihX  = E e − 1 → 0 as h → 0 by BCT Since this → 0 and does not depend on t, we see that ϕ is uniformly continuous.  Example. I collect the examples in a table: Name PDF Char. Fun Remark Coin Flip 1  1 P X = ± 2 = 2 ϕ(t) = eit − e−it /2 = cos t

Poisson −λ λk it  P(X = k) = e k! ϕ(λ) = exp λ(e − 1)  2  Normal t Prove this by deriving 0 . Or complete the ρ(X = x) = ϕ(t) = exp − 2 ϕ = −tϕ  2  square √1 exp − x 2π 2 Uniform 1 exp(itb)−exp(ita) This one is useful to think about for the inversion ρ(X = x) = b−a 1[a,b](x) ϕ(t) = it(b−a) If this is: formula ∞ e−ita−e−itb d 1 i.e. b = −a −∞ it(b−a) ϕ(t) t ≈ µX (a, b) b−a ´ sin(at) 1 ϕ(t) = ϕ ϕ dt ≈ µ (A) at ˆ 1A X X L(A)

Triangular + 1−cos t Use the fact that the triangular is the sum of two ρ(X = x) = (1 − |x|) ϕ(t) = 2 t2 independent Uniforms 2 Dierence Y = X − X˜ (independent ϕY (t) = |ϕX (t)| copy)

Superposition Y = CX where C is a ±1 ϕY (t) = Re(ϕX (t)) Use the fact that ϕ is linear with respect to P P Dierence coinip superposition, λiFi has char fun λiϕi. In this case its with 1 1 2 FX + 2 F−X Exppoential −x 1 ρ(X = x) = e ϕ(t) = 1−it Bilateral Exp 1 −|x| Use the superposition dierence trick ρ(X = x) = 2 e  1  ϕ(t) = Re 1 − it 1 = 1 + t2

Polya's ρ(X = x) = ϕ(t) = (1 − |t|)+ Pf comes from the inversion formula and the (1 − cos x)/πx2 triangle distribution. Used in Polya's theorem to show that convex, decreasing functions ϕ with ϕ(0) = 1 are characteristic function of something Cauchy ρ(X = x) = 1/π(1 + x2) ϕ(t) = exp (− |t|) Pf comes from inverting the bilateral exp. 3.7. CHARACTERISTIC FUNCTIONS 26

Theorem. (3.3.4.) The inversion formula. Let ϕ(t) = eitxµ(dx) where µ is a probability measure. If a < b then: ´

T e−ita − e−itb 1 lim (2π)−1 ϕ(t)dt = µ(a, b) + µ ({a, b}) T →∞ ˆ it 2 −T Proof. Write

T e−ita − e−itb I = ϕ(t)dt T ˆ it −T T e−ita − e−itb = eitxµ(dx)dt ˆ ˆ it −T

Notice that e−ita−e−itb b −ityd so this is bounded above in norm by . it = a e y b − a Hence we can apply Fubini to´ get:

T e−ita − e−itb I = eitxµ(dx)dt T ˆ ˆ it −T T e−ita − e−itb = eitxdtµ(dx) ˆ ˆ it −T T b = e−ityeitxdydtµ(dx) ˆ ˆ ˆa −T T x−b = eitydydtµ(dx) ˆ ˆ ˆ −T x−a = ... ! ! T sin(t(x − a) T sin(t(x − a) = dt + dt µ(dx) ˆ ˆ−T t ˆ−T t

Where we use the fact that cos is odd here so it cancels itself out. (This relies on T nite) Now if we let T sin(θt) d then we can show that sgn T θ sin(x) d R(θ, T ) = −T t t R(θ, T ) = 2 θ 0 x x → π for θ 6= 0 and = 0 for θ =´ 0. We have then: ´

I = R(x − a, T )µ(dx) + R(x − b, T )µ(dx) T ˆ ˆ  2π a < x < b  → π x = a or x = b µ(dx) ˆ 0 x < a or x > b

So by bounded convergence theorem, we get the result.  3.7. CHARACTERISTIC FUNCTIONS 27

Exercise. (3.3.2.) Similarly:

T 1 µ ({a}) = lim e−itaϕ(t)dt T →∞ 2T ˆ −T The inversion formula basically tells us that distributions are characterized by their char functions. Two easy consequences are then that:

Exercise. (3.3.3.) ϕ is real if and only if X and −X have the same distribu- tion.

Proof. X and −X have the same distribution i X =d CX for C a ±1 coinip i ϕX = ϕCX i ϕX = Re (ϕX ). (We had everything except the ⇐= before the inversion formula, and the inversion formula gives us this.)  Exercise. (3.3.4.) The sum of two normal distributions is again normal. Proof. The c.f. for a sum of two normals is the c.f. for a normal, so it must be a .  Theorem. (3.3.5.) If |ϕ(t)| dt < ∞ then µ has bounded continuous density function : ´ 1 f(y) = e−ityϕ(t)dt 2π ˆ

−ita −itb −ita −itb Proof. As we observed before, the kernal e −e , has e −e ≤ b − a it it so we see that µ has no point masses since: ∞ 1 1 e−ita − e−itb µ(a, b) + µ ({a, b}) = ϕ(t)dt 2 2π ˆ it −∞ ∞ b − a ≤ |ϕ(t)| dt 2π ˆ −∞ → 0 as b − a → 0 Can now caluclate the density function by looking at µ (x, x + h) with the inversion formula, using Fubini and taking h → 0: 1 x+h µ(x, x + h) + 0 = e−ityϕ(t)dydt 2π ˆ ˆx x+h  1  = e−ityϕ(t)dt dy ˆx 2π ˆ The dominated convergence theorem tells us that f is continuous: 1 f(y + h) − f(y) = e−ity 1 − e−ith ϕ(t)dt 2π ˆ → 0 by LDCT  Exercise. (3.3.5.) Given an example of a measure µ which has a density, but for which |ϕ(t)| dt = ∞ ´ 3.7. CHARACTERISTIC FUNCTIONS 28

Proof. The uniform distribution has this since 1 . This makes sense ϕ(t) ≈ t since the distribution fucntion is not continuous. 

Exercise. (3.3.6.) If are iid uniform in then P has X1,X2,... (−1, 1) i≤n Xi density: ∞ 1 sin tn f(x) = cos(tx)dx π ˆ t 0 This is a piecewise n-th order . Remark. Theorem 3.3.5, the Riemann-Lebesgue Lemma, and the following result all tell us that point masses in the distribution correspond to the behaivour at ∞ for the c.f. Theorem. (Riemann-Lebesgue lemma)

If µ has a density function f, then ϕµ(t) → 0 as t → ∞. Proof. If f is dierentiable and compactly supported then we have:

−itx |ϕµ(t)| = f(x)e dx ˆ

1 0 −itx = f (x)e dx ˆ it 1 ≤ |f 0(x)| dx |t| ˆ → 0

Any arbitrary density function can be approximated by such f (indeed, these are dense in L1 by the construction of the Lebesgue integral....just smoothly ap- proximate open intervals) (Alternativly, show it for simple functions, which are also 1 dense in L ) 

Exercise. (3.3.7.) If X,X˜ are iid copies of a R.V. with c.f. ϕ then:

T 1 2 X 2 lim |ϕ(t)| dt = P(X − X˜ = 0) = µ ({x}) T →∞ 2T ˆ x −T Proof. The result follows by the inversion formula and since 2 ϕX−X˜ = |ϕX | 

Corollary. If ϕ(t) → 0 as t → ∞ then µ has no atoms.

Proof. If then the average 1 T 2 d as too, so ϕ(t) → 0 2T −T |ϕ(t)| t → 0 T → ∞ by the previous formula µ has no atoms. ´ 

Remark. Don't forget there are distributions like the which have no atoms and there is no density. 3.7. CHARACTERISTIC FUNCTIONS 29

3.7.2. Weak Convergence - IMPORTANT!.

Theorem. (3.3.6.) (Continuity Theorem) Let µn be probability measure with c.f. ϕn. i) If µn ⇒ µ∞ then ϕn(t) → ϕ∞(t) for all t. ii) If ϕn → ϕ pointwise andϕ(t) is continuous at 0, then the associated sequence of distribution functions µn is tight and converges weakly to the measure µ with char function ϕ. Proof. i) is clear since eitX is bounded and continuous ii) We will rst show that is suces to check that µn is tight. Suppose µn is tight. We claim that µn ⇒ µ∞ by the every subsequence has a sub-subsequence criteria. Indeed, given any subsequence we use Helley to nd a subsubsequence µnk µ with µ ⇒ µ for some µ . Now since the sequence is tight, we know that nkl nkl 0 0 µ is a legitimate . By i) since µ ⇒ µ we know that 0 nkl 0 ϕ (t) → ϕ (t). However, by hypothesis, ϕ → ϕ so it must be that ϕ = ϕ ! nkl 0 n 0 To see that the sequence is tight, you use the continuity at 0. The idea is to use that 1 u d by this continuity. If you write it out, u −u (1 − ϕ(t)) t → 0 1 u d´ should control something like 2 ), so this is exactly u −u (1 − ϕ(t)) t µ(|x| > u what´ is needed for tightness.  Remark. Here is a good example to keep in mind for the continuity theorem: 2  Take Xn ∼ N(0, n) so that ϕn(t) = exp −nt /2 → 0 for t 6= 0 and = 1 at t = 0. Xn can't converge weakly to anything since µn ((−∞, x]) → 0 for any x. Exercise. (3.3.9) If and is normal with mean and variance 2 Xn ⇒ X µn σn and is normal with mean and variance 2 then and 2 X µ σ µn → µ σn → σn Proof. Look at the c.f's 

Exercise. (3.3.10) If Xn and Yn are independent and Xn ⇒ X∞ and Yn ⇒ Y∞ then Xn + Yn ⇒ X∞ + Y∞\ Proof. Look at the cfs.  Exercise. (3.3.12.) Interpret the identity Q∞ m prob- sin t/t = m=1 cos(t/2 ) abilistically. (You can get this formula from sin t = 2 sin(t/2) cos(t/2) applied repeatedly and using 2k sin(t/2k) → t as k → ∞) Proof. sin t/t is the cf for a [−1, 1] random variable. cos(t/2m) is the c.f. for a coinip 1 . So this is saying that the sum of intently many coinips Xm = ± 2m like this is a uniform random variable. This is equivalent to the fact the the n-th binary digit of a uniformly chosen x from is a coinip. Add P∞ −m to both random variables, and then [0, 1] 1 = m=1 2 divide by 2. 

Exercise. (3.3.13.) Let X1,... be iid coinips taking values 0 and 1 and let P j. [This converges almost surely by the Kolmogorov 3 series X = j≥1(2Xj)/3 theorem]. This has the Cantor distribution. Compute the ch.f. ϕ of X and notice that ϕ has the same value at t = 3kπ.

1 1  it 2  Proof. Each has c.f. it so j has 3j so we have X 2 1 + e 2X/3 2 1 + e ∞ 1  it 2  that Q 3j If you put in k you get a 2πi in the rst ϕ = k=1 2 1 + e t = 3 π e k terms and the value of the function does not change.  3.7. CHARACTERISTIC FUNCTIONS 30

Remark. This fact shows that ϕ(t) 9 0 as t → ∞, which means that it is impossible for this random variable to have a density (by the Riemann Lebesgue lemma). You can see there are no atoms because for ω ∈ {0, 1}N a possible sequence of the Xi's, X(ω) is the number whose ternary expansion is ω, so for a given x, the set {X(ω) = x} consists of at most one sequence: namely the ternary expansion of x (you have to replace 2's with 1's ..... you get the idea). Consequently, we can explicitly see that for any x, P(X = x) = 0 and X has no atoms.

3.7.3. Moments and Derivatives. Part of the proof of the ϕn → ϕ and continuous at 0 implies  theorem was the estimate that  2 ϕ Xn ⇒ X µ |x| > u ≤ −1 u d (This was used to show that if is continuous at then u −u (1 − ϕ(t)) t ϕ 0 ϕn is tight)´ This suggests that the local behaivour at 0 is related to the decay of the measure at ∞. We see more of this here: Exercise. (3.3.14) If |x|n µ(dx) < ∞ then the c.f. ϕ has continuous deriva- tives of order n given by ϕ´(n)(t) = (ix)neitxµ(dx) Proof. Lets do n = 1 rst. Consider´ that: ϕ (t + h) − ϕ(t) 1 = eitx(eihx − 1)µ(dx) h h ˆ Now use the estimate that 1 ihx  (to see this write 1 ihx  h e − 1 ≤ |x| h e − 1 = x ihzd and then bound the integral) so we can apply a dominated convergence i 0 e z theorem´ to get the result. The case n > 1 is not much dierent.  Exercise. (3.3.15.) By dierntiating the characteristic function for a normal 2 random variable, we see that for X ∼ N(0, 1) with c.f. e−t /2 we have: E X2n = (2n − 1) · (2n − 3) · ... · 3 · 1 ≡ (2n − 1)!! The next big result is that: Theorem. (3.3.8.) If E(|X|2) < ∞ then: ϕ(t) = 1 + itE(X) − t2E(X2)/2 + (t) With: ! |tX|3 2 |tX|2 |(t)| ≤ E min , 3! 2! Remark. Just using what we just did, you would need the condition that   E |X|3 < ∞ to get that ϕ was thrice dierentiable and then you would have the result by Taylor's theorem. However, by doing some slightly more careful calculations, we can actually show that:

n n+1 n ! X (ix)m |x| 2 |x| eix − ≤ min , m! (n + 1)! n! m=0 (Again this is done by integration type estimates) and then using Jensen's inequaltiy we'll have: n m n+1 n ! X (itX) |tX| |tX| E eitX  − E ≤ E min , 2 m! 3! 2! m=0 So even if we only have two moments, we know the error is bounded by t2E(|X|2) and is hence o(t2) 3.8. THE MOMENT PROBLEM 31

Theorem. (3.3.10) (Polya's criterion) Let ϕ(t) be real nonnegative and have ϕ(0) = 1 and ϕ(t) = ϕ(−t) and say ϕ is decreasing and convex on (0, ∞) with: lim ϕ(t) = 1, lim ϕ(t) = 0 t↓0 t↑∞ Then there is a probability measure ν on (0, ∞) so that: ∞  + t ϕ(t) = 1 − ν(ds) ˆ s 0 Since this is a superposition of the characteristic function for the Polya r.v., 2 + ρ(XP OLY A = x) ∼ (1 − cos x)/x ,ϕP OLY A(t) ∼ (1 − |t|) , this shows that ϕ is a char function. Proof. I'm going to skip the rather technical proof. Part of it is that you can apprixmate ϕ by piecewise linear functions, and for those its easier.  Example. (3.3.10.) exp (− |t|α) is a char function for all 0 < α < 2 Proof. The idea is to write exp (− |t|α) as a limit of characteristic functions, namely:  √ n exp (− |t|α) = lim ψ(t · 2n−1/α n→∞ where ψ(t) = 1 − (1 − cos t)α/2. This convergence is seen since 1 − cos(t) ∼ −t2/2.ψ is a char function because we can write is a linear combination of powers (cos t)n (recall cos t is the char fun for a coinip), by: ∞ X α/2 1 − (1 − cos t)α/2 = (−1)n+1(cos t)n n n=1  Exercise. (3.3.23.) This family of RV's is of interest because they are stable in the sense that a scaled sum of many iid copies has the same distribution: X + ... + X 1 n ∼ X n1/α The case α = 2 is the normal distribution and the case α = 1 is the .

3.8. The moment problem

k Suppose x dFn(x) has a limit µk for each k. We know then that the Fn are tight (a single function that goes to at for which d ´ ∞ ∞ supn φ(x) Fn(x) < ∞ shows its tight by a Cheb estimate). Then we know by Helley´ that every subse- quence has a subsubsequence that converges weakly to a legit probability distribu- tion. We also know that every limit point will have moments µk (can check in this case that and kd shows kd too...to this will Fnk ⇒ F x Fn(x) → µk x F (x) → µk k+1 use the fact that sup x´ dFn(x) < ∞ I think),´ so every limit distribution has the moments µk. ´ Question: Is there a unique limit???? It would suce to check that there is only one distribution with the moments µk. Unfortunatly this is not always true. 3.8. THE MOMENT PROBLEM 32

Example. (An example of a collection of dierent probability distribution who have the same moments) Consider the lognormal density:

−1/2 −1 2  f0(x) = (2π) x exp − log x /2 x ≥ 0 (This is what you get if you look at the density of exp (Z) for Z ∼ N(0, 1) a standard normal) For −1 ≤ a ≤ 1 let:

fa(x) = f0(x) (1 + a sin (2π log x))

To see that this has the same moments as f0, check that: ∞ xrf (x) sin(2π log x)dx = 0 ˆ 0 0 Indeed after a change of variable, we see that the integral is like exp −s2/2 sin(2πs)ds which is 0 since its odd. ´ One can check that the moments of the lognormal density are µk = E (exp (kZ)) = exp k2/2 (its the usual completing the square trick, or you can do it with deriva- tives) Notice that these get very large very fast! Also notice that the density decays   like exp − (log x)2 which is rather slow. We will now show that if the moments don't grow too fast (or equivalently if the density decays fast) then there IS a unique distribution. Theorem. (3.3.11) If 1/2k then there is at most lim supk→∞ µ2k /2k = r < ∞ one distribution function with moments µk.

Proof. Let F be any d.f. with moments µk. By Cauchy-Scwarz, the absolute k √ value |X| has moments, νk = |x| dF (x) withν2k = µ2k and ν2k+1 ≤ µ2kµ2k+2 and so: ´ ν1/k lim sup = r < ∞ k→∞ k We next use the modied Taylor series for char functions to conclude that the error in the taylor expansion of ϕ(t) at 0 is: n−1 n 0 t (n−1) |t| νn ϕ(θ + t) − ϕ(θ) − tϕ (θ) − ... − ϕ (θ) ≤ (n − 1)! n! k k k k Since νk ≤ (r + ) k , and using the bound e ≥ k /k! we see that the above estimate implies that ϕconverges uniformly to its Taylor series about any point θ in a nhd of xed radius |t| ≤ 1/er. If G is any other distribution with moment s µk , we know that G and F have the same Taylor series! But then G and F agree in a n'h'd of 0 by the above characterization. By induction, we can repeatedly make the radius of the n'h'd bigger and bigger to see the char functions of F and G agree everywhree. But in this case F must be equal to G by the inversion formula.  Remark. This condition is slightly stronger than Carleman's Condition that: ∞ X 1 < ∞ 1/2k k=1 µ2k 3.9. THE CENTRAL LIMIT THEOREM 33

3.9. The Central Limit Theorem Proposition. We have the following estimate:

 2  3 2 !! itX  t 2 |tX| |tX| E e − 1 + itE(X) − E(X ) ≤ E min , 2 2 3! 2! !! |t| |X|3 ≤ t2E min , |X|2 3!

Proof. Do integration by parts on the function f(x) = eix. We aim to show  3 2  3 ix 2 |x| 2|x| . The fact that ix 2 |x| e − 1 + ix − x ≤ min 3! , 2! e − 1 + ix − x ≤ 3! is just by the usual Taylor series. The other fact follows by a trick. Do a taylor series expansion to rst order, then add/subtract 2 . The write 2 x d = x /2 x /2 = 0 y y ´ x eix = 1 + ix − yei(x−y)dy ˆ 0 x  x2  x2 = 1 + ix − + − yei(x−y)dy 2 2 ˆ 0 x x  x2  = 1 + ix − + ydy − yei(x−y)dy 2 ˆ ˆ 0 0 x  x2    = 1 + ix − − y ei(x−y) − 1 dy 2 ˆ 0

2 But i(x−y) so the error term is bounded by x d 2|x| and we e − 1 ≤ 2 0 y2 y = 2 are done. ´ Now put in x = X and then take E and use Jensen's inequality,

 2  itX  t 2 itX 2  E e − 1 + itE(X) − E(X ) ≤ E e − 1 + iX − X 2 !! |tX|3 2 |tX|2 ≤ E min , 3! 2! !! |t| |X|3 = t2E min , |X|2 3!



2 Theorem. (iid CLT) If Xn are iid with E(X) = 0 and E(X ) = 1 then √Sn . n ⇒ N(0, 1) Proof. Have by the last estimate that:

t2 φ(t) = 1 − + (t) 2 3.9. THE CENTRAL LIMIT THEOREM 34

  3  where the error (t) is (t) ≤ t2E min |t||X| , |X|2 . Hence the char function √ 3! for Sn/ n is:  t n φ √ = φ √ Sn/ n n  t2  t n = 1 − +  √ 2n n

 2   2   2   |t| 3 2 Now notice that t √t t since √t t √ n − 2n +  n → − 2 n n ≤ n n E min n3! |X| , |X| → as by the LDCT. We now use the fact that if then cn n c 0 n → ∞ cn → c (1 + n ) → e so have:  t2  φ √ → exp − Sn/ n 2

By the continuity theorem, have then √Sn n ⇒ N(0, 1) 

n Lemma. If then zn  −z. zn → z0 1 + n → e

Proof. Suppose WOLOG that zn 1 for all in consideration. Since n < 2 n log (1 + z) is a holomorphic function and invertable (with ez as its invers) in the n'h'd 1 it suces to show that: |z| < 2  z n log 1 + n → z n But log(1 + z) has a absolutly convergent taylor series expansion around z = 0 in this n'h'd. Write this as log(1 + z) = z + zg(z) where g(z) is given by some absolutly convergent power sereis in this n'h'd and g(0) = 0.We have:  z n  z  log 1 + n = n log 1 + n n n z z z  = n n + n g n n n n z  = z + z g n n n n → z + zg(0) = z

 Theorem. (CLT for Triangular Arrays - Lindeberg-Feller Theorem) A sum of small independent errors is normally distributed.

Let Xn,m 1 ≤ m ≤ n be a triangular array of independent (but not nessisarily iid) random variables. Suppose E(Xn,m) = 0. Suppose that: n X 2  2 E Xn,m → σ > 0 m=1 and that: n X  2  ∀ > 0, lim E |Xn,m| ; |Xn,m| >  = 0 n→∞ m=1 Then: 2 Sn → N(0, σ ) 3.10. OTHER FACTS ABOUT CLT RESULTS 35

Proof. The characteristic function for Sn is: n Y φSn (t) = φXn,m (t) m=1 n Y  t2  = 1 − E X2  +  (t) 2 n,m n,m m=1

2   3 2 The error |n,m(t)| ≤ |t| E min |t| |Xn,m| , |Xn,.m| . We now claim that:

n X t2 t2 − E X2  +  (t) → − σ2 2 n,m n,m 2 m=1 as . It suces to show that Pn . Indeed for any , n → ∞ m=1 n,m(t) → 0  > 0 we have:

n n    X X 2 3 2 n,m(t) ≤ |t| E min |t| |Xn,m| , |Xn,.m| m=1 m=1 n X 2   3 2  ≤ |t| E min |t| |Xn,m| , |Xn,.m| ; |Xn,m| >  m=1 n X 2   3 2  + |t| E min |t| |Xn,m| , |Xn,.m| ; |Xn,m| ≤  m=1 n X 2  2  = |t| E |Xn,m| ; |Xn,m| >  m=1 n X 2  2 +  |t| E |t| |Xn,m| m=1 → 0 +  |t|3 σ2 Now also use that each n 2  as (impleid by the supm=1 E Xn,m → 0 n → ∞  condition) and then use the fact for complex numbers that if max1≤j≤n |cj,n| → 0 and Pn and Pn then Qn λ j=1 cj,n → λ supn j=1 |cj,n| < ∞ j=1 (1 + cj,n) → e .  3.10. Other Facts about CLT results p Theorem. Sn = X1+X2+...+Xn for Xi iid has Zn := Sn−nE(X)/ nVar(X) → N(0, 1) weakly as n → ∞.

Example. Zn does not converge in probability or in the almost sure sense. Proof. ASsume WOLOG that EX = 0 and VarX = 1. Take a subsequence √ n . Let Y = Pnk X / n − n so that the Y 0s are independent. Write k k nk−1+1 i k k−1 k after some manipulation that: r r nk−1 nk−1 Znk = 1 − Yk + Znk−1 nk nk

nk−1 Now, if Zn → A almost surely, then by choosing a sequence nk with → 0 k nk (e.g. nk = k!) then we see from the above that Yk → A almost surely too. But since are independent, then is a constant, which is absurd. Yk A  3.11. LAW OF THE ITERATED LOG 36

Example. If X1,... are independent and Xn → Z as n → ∞ almost surely or in probability, then Z ≡ cons0t almost surely. Proof. Suppose by contradiction that Z is not almost surely a constant. Then nd x < y so that P (Z ≤ x) 6= 0 and P (Z ≥ y) 6= 0. (e.g look at inf{x : P(Z ≤ x) 6= 0} and sup{y : P(Z ≥ y) 6= 0}, if these are equal then X is a.s constant. If they are not equal, then can nd the desired x, y). Assume WOLOG that x, y are continuity points of Z (indeed, any subinterval of (x, y) will work and there can only be countably many discontinuities) Now, since a.s convergecne and convergence in probability are stronger than weak convergence, we have that P(Xn ≤ x) → P(Z ≤ x) 6= 0 and P(Xn ≥ y) → P(Z ≥ y) 6= 0. In particular then the sequences P(Xn ≤ x) is not summable. By the second Borel Cantelli lemma, Xn ≤ x happens innitely often a.s. Similarly, Xn ≥ y happens innitely often a.s. On this measure 1 set, Xn has no chance to converge a.s. or in probability, as it is both ≤ x and ≥ y innitely often and these are separated by some positive gap. 

Example. Once we have proven the CLT, we can show that actually √Sn lim sup n = ∞ n o Proof. First, check that √Sn is a tail event. By the Kolmogorv lim sup n > M 0-1 law, to show that this is probability 1 it suces to show that the probability is   postive. Then notice that by the central limit theorem the √Sn limn→∞ P n > M → n o so it cannot be that √Sn is a probabilty 0 event. P (χ > M) > 0 lim sup n > M Since this holds for every M we get the result.  Theorem. (2.5.7. from Durret) If are iid with and 2 X1,... E(X1) = 0 E(X1 ) = σ2 < ∞ then: S n → 0 a.s. √ 1 + n log n 2 Remark. Compare this with the law of the iterated log: S √ lim sup √ n = σ 2 n→∞ n log log n

Proof. It suces, via the Kronecker lemma, to check that P √ Xn n log n1/2+ converges. By the K1 series theorem, it suces to check that the are summable. Indeed:  X  σ2 Var √ n ≤ n log n1/2+ n log1+2 which is indede summable (e.g. Cauchy condensation test)!  3.11. Law of the Iterated Log

The Law of the iterated log tells us that when each Xn are iid coinips ±1 with probability 1/2 or if the Xn are iid N(0, 1) Gaussian random variables, then:

Sn √ lim sup p = 2 a.s n→∞ n log(log n) The proof is divided into two parts: 3.11. LAW OF THE ITERATED LOG 37

3.11.1. ! Sn √ ∀ > 0, P lim sup p > 2 +  = 0 n→∞ n log(log n) √ . The 2 comes from the 2 appearing in the Cherno bound P (maxk≤n Sk ≥ λ) ≤  2  λ . exp − 2n Proposition. Cherno Bound:    λ2  P max Sk ≥ λ ≤ exp − k≤n 2n

Proof. Since Sn is martingale, we have Doob's inequality for submartingales:   E (|Sn|) P max Sk ≥ λ ≤ k≤n λ Of course, any convex function of martingale is a submartingale we can apply the inequality like we would for any Cheb inequality e.g.

  E [exp (θSn)] P max Sk ≥ λ ≤ k≤n exp (θλ)

If the Sn are Gaussian, then applying the directly leads to the Cherno bound after optimizing over θ. If the Xn are ±1 then one must rst get subgaussian tails for E [exp (θSn)], which is known as the Hoeding inequality. Again, once we have this optimizing over θ gives the result.  Idea of this half: With the Cherno bound in hand, we can estimate:  √ 2    2 + pc log(log c) √ p P max Sk > 2 +  c log(log c) ≤ exp −  k≤n  2n 

 c  ≤ exp − (1 + ˜) log(log(c) n

n p o n−1 So now if let An be the event that Sk > (2 + )k log (log k) for some θ ≤ √ n  p  n−1 k ≤ θ , then P(An) ≤ PP maxk≤n Sk > 2 +  c log(log c) with c = θ and n = θn so by the above estimate its:  θn−1  P(A ) ≤ exp − (1 + ˜) log(log(θn−1) n θn −(1+˜) 1 ≤ n θ Which is summable if we choose θ correctly!!! Hence by Borel Cantelli, this event only happens nitely often almost surely. Extra short summary of the idea:

Doob's inequality controls the whole sequence maxk≤n Sk just by looking at n−1 Sn. By Cherno bound, Sn has subgaussian tails. By looking in the range θ ≤ k ≤ θn we can use this to ensure that the probability of exceeding pn log(log n)) is summable. 3.11. LAW OF THE ITERATED LOG 38

3.11.2. ! Sn √ ∀ > 0, P lim sup p > 2 −  = 1 n→∞ n log(log n) . p Proof. The ideas is based on the fact that the sequenec Sn/ n log(log n) behaives a little bit like a sequence of independent random variables. (Compare √ this to the reason that Sn/ n cannot converge almost surely) In this one we use the fact that is an independent family of random Snk −Snk−1 variables. We can then use Borel Cantelli for independent r.v.s and the Mill's ration √ to show tthat p innetly often. Snk − Snk−1 > 2 −  n log(log n)) Then by the rst half of the law of the iterated log, S is not too big...so these √ √ nk large dierences force the whole intently often. Sn/ n log log n > 2 −   Moment Methods

These are based on some ideas from the notes by Terrence Tao [4].

4.12. Basics of the Moment Method

4.12.1. Introduction. Suppose you wanted to prove that Xn ⇒ X for some random variables Xn. One method, called the moment method is to rst prove that k k for every and then to do some analytical work to show E Xn → E X k that this is sucient in this case for Xn ⇒ X. The reason this is good, is because k might be easy to work with. For E Xn many situations, k has some combinatorial way of being looked at that might Xn make a limit easy to see. For example, if is a sum, Pn then k gives a Xn = i=1 Yi Xn binomial-y type formula and so on. The same thing works for random matrices the empirical spectral distribution (i.e. Xn = a randomly chosen eigenvalue of a ) then k Tr k which again has a nice writing out as sums M E Xn ' E Mn of the form P and again you can try to look at it combinatorially. Mi1i2 Mi2i3 ... Some analytical justication is always needed, because k k does E Xn → E X not always imply Xn ⇒ X. Every case must be handled individually. For this reason, every proof here is divided into two section: one where com- binatorics is used to justify k k and one where analytical methods E Xn → E X are used to argue why this is enough to show Xn ⇒ X (The latter half sometimes includes tricks like reducing to bounded random variables via truncation)

4.12.2. Criteria for which moments converging implies weak con- vergence. There are some broadly applicable criteria under which k E Xn → k E X =⇒ Xn ⇒ X which are covered here that will be useful throughout. Theorem. Suppose that k and there is a unique distribution E Xn → µk X k for which µk = E X . Then Xn ⇒ X

Proof. [sketch] First, we see that the Xn must be tight (this uses the fact that lim sup E (Xn) < ∞ since its converging to µ1...a Cheb inequality then shows its tight). By Helley's theorem, every subsequence has a further subsequence Xnk that converges weakly and the tightness guarantees us that the limit is a Xnk legitimate` probability distribution. The resulting distribution must have moments µ . Since X is the unique distribution with moments µ , it must be that X ⇒ X. k k nkl Since every sequence has a sub-sub-sequence converging to X, we conclude Xn ⇒ X (this is the sub-sequence/sub-sub-sequence characterization of convergence in a metric space) 

Unfortunately, given a random variable X whose moments are µk, it is not always true that X is the unique distribution with moments µk.

39 4.12. BASICS OF THE MOMENT METHOD 40

Example. (An example of a collection of dierent probability distribution who have the same moments) Consider the lognormal density:

−1/2 −1 2  f0(x) = (2π) x exp − log x /2 x ≥ 0 (This is what you get if you look at the density of exp (Z) for Z ∼ N(0, 1) a standard normal) For −1 ≤ a ≤ 1 let:

fa(x) = f0(x) (1 + a sin (2π log x))

To see that this has the same moments as f0, check that: ∞ xrf (x) sin(2π log x)dx = 0 ˆ 0 0 Indeed after a change of variable, we see that the integral is like exp −s2/2 sin(2πs)ds which is 0 since its odd. ´

Remark. One can check that the moments of the lognormal density are µk = E (exp (kZ)) = exp k2/2 (its the usual completing the square trick, or you can do it with derivatives) Notice that these get very large very fast! Also notice that the   density decays like exp − (log x)2 as x → ∞ which is rather slow. The following criteria show that as long as µk is not growing to fast or if the desnity goes to zero fast enough as x → ∞, then there IS at most one distribution X with moments µk. Here is a super baby version of the type of result that is useful: Theorem. If and are all bounded in some range then k Xn X [−M,M] E(Xn) → k E(X ) for all k i Xn ⇒ X Proof. (⇐=) is clear because the function f(x) = xk is bounded when we restrict to the range [−M,M] ( =⇒ ) We will verify that for any bounded continuous function f(x) that E (f(Xn)) → E (f(X)). By the Stone Wierestrass theorem, we can approximate any bounded continuous f uniformly by . I.e. ∀ > 0 nd a polynomial p so that kp − fk ≤  where k·kis the sup norm on the interval [−M,M] . We have k that E (p(Xn)) → E(p(X)) since p is a nite linear combination of powers x and we know that k k for each . Hence: E(Xn) → E(X ) k

|E (f(Xn)) − E (f(X))| ≤ |E (f(Xn) − p(Xn))| + |E (p(Xn)) − E(p(X))| + |E (f(X) − p(X))|

≤  + |E (p(Xn)) − E(p(X))| +  → 2 + 0  Remark. The same idea works to show that if X,Y are two random vari- ables on [−M,M] whose moments agree, then E(f(X)) = E(f(Y )) for all bounded continuous f or in other words X =d Y . The more advanced proofs use Fourier Analysis. The connection is that the moments of X correspond to the derivatives of the characteristic function and the upshot is that if two distributions have the same characteristic function, then they are the same random variable. 4.12. BASICS OF THE MOMENT METHOD 41

The rst theorem says that if the moments don't grow too fast then there is at most one distribution function. Theorem. (3.3.11 for Durret) If 1/2k then there is lim supk→∞ µ2k /2k = r < ∞ at most one distribution function with moments µk.

Proof. Let F be any d.f. with moments µk. By Cauchy-Scwarz, the absolute k √ value |X| has moments, νk = |x| dF (x) withν2k = µ2k and ν2k+1 ≤ µ2kµ2k+2 and so: ´ ν1/k lim sup = r < ∞ k→∞ k We next use the modied Taylor series for char functions to conclude that the error in the taylor expansion of ϕ(t) at 0 is: n−1 n 0 t (n−1) |t| νn ϕ(θ + t) − ϕ(θ) − tϕ (θ) − ... − ϕ (θ) ≤ (n − 1)! n! k k k k Since νk ≤ (r + ) k , and using the bound e ≥ k /k! we see that the above estimate implies that ϕconverges uniformly to its Taylor series about any point θ in a nhd of xed radius |t| ≤ 1/er. If G is any other distribution with moment s µk , we know that G and F have the same Taylor series! But then G and F agree in a n'h'd of 0 by the above characterization. By induction, we can repeatedly make the radius of the n'h'd bigger and bigger to see the char functions of F and G agree everywhree. But in this case F must be equal to G by the inversion formula.  The next theorem says that if the probability density goes to zero as x → ∞ fast enough, then again then again we have moments converging implies weak convergence.

Theorem. (2.2.9. from Tao) (Carleman continuity theorem) Let Xnbe a se- quence of uniformly subguassian real random variables (i.e. ∃ constats C, c so that 2 P (|Xn| > λ) ≤ C exp −cλ for all Xn and for all λ.) and let X be another subgaussain random variable. Then: k k for all E(Xn) → E X k ⇐⇒ Xn ⇒ X Remark. The subgaussian hypothesis is key for both directions here.

Proof. (⇐=) Fix a k. Take a smooth function ϕ that is 1 inside [−1, 1] and vanishes outside of [−2, 2] notice that ϕ (·/N) is then 1 inside [−N,N] and vanishes outside of [−2N, 2N]. Hence, for any N, the function x → xkϕ (x/N) is bounded. Hence by weak convergence, we know that k  k  E Xnϕ (Xn/N) → E X ϕ (X/N) for any choice of N. On the other hand, from the uniformly subguassian hy- pothesis, k  and k  can be made arbitarily E Xn (1 − ϕ(Xn/N) E Xn (1 − ϕ(Xn/N) small by taking large enough, as k  k  N E Xn (1 − ϕ(Xn/N) ≤ E Xn1{|Xn|>N} ≤ ∞ k−1 2 uniformly in . For any choose so large so N λ C exp −cλ → 0 N  > 0 N ´that this is <  and we will have then that: k k  k  k  E Xn − E (Xn) ≤ E Xnϕ (Xn/N) − E (Xnϕ (Xn/N)) + E Xn (1 − ϕ(Xn/N) + E Xn (1 − ϕ(Xn/N) k  ≤ E Xnϕ (Xn/N) − E (Xnϕ (Xn/N)) + 2 → 2 4.14. CENTRAL LIMIT THEOREM 42

From the uniformly subguassian case, we know that k k/2 ( =⇒ ) E Xn ≤ (Ck) so by Taylor's theorem with reminder we know that:

k j X (it)  k/2 k+1 F (t) = E Xj  + O (Ck) |t| Xn j! n j=0

where the error term is uniform in t and in n. The same thing holds for FX so we have:

 k/2 k+1 lim sup |Fxn (t) − FX (t)| = O (Ck) |t| n→∞ !  |t| k = O √ k

so now for any xed , taking shows that . By the Levy t k → ∞ FXn (t) → FX (t) continuity theorem (the one that says pointwise convergence of the characteristic functions to a target function that is continuous at 0 implies weak convergence) 

Here is the Levy Continuity theorem that was used above for the case that you need a refresher:

Theorem. (Levy Continuity Theorem) Say is a sequence of RVs and Xn FXn converges pointwise to something, call it F . The following are equivalent: i) F is continuous at 0 ii) Xn is tight iii) F is the characteristic function of some R.V. iv) Xn converges in distribution to some R.V. X.

4.13. Poisson RVs Put the characterization of poisson rvs and the trick for looking at sums of indicator rv's from Remco's random graph notes here.

Example. One nice way to do this: use the formula that if P is a Y = Ii1 sum of indicators then P where the sum E [(Y )r] = i ...i ∈Z P (Ii1 = ... = Iir = 1) is over distinct indices. 1 r

Remark. This follows from the general properties of occupations Occ1  A k = Occk . Any sum of disjoint sum of indicator functions can be thought of as a Ak Bernouilli point process with P Occ1 . Ii = A

4.14. Central Limit Theorem The standard way to prove the CLT is by Fourier analysis/charaterstic func- tions. Here we are going to explore how the moment method can be used to prove the CLT using combinatorial methods and other tricks, like truncation/concentration bounds, only. The proof is maybe not as easy as these Fourier methods, but its a good example of the power of these other methods. 4.14. CENTRAL LIMIT THEOREM 43

4.14.1. Convergence of Moments for Bounded Random Variables.

Proposition. For a N(0,1) Gaussian random variable Z we have: E Zk = 0 if k is odd E Zk = (k − 1)!! if k is even Where (k − 1)!! = (k − 1)(k − 3) · ... · 3 · 1. Proof. The odd moments are 0 since the distribution is symmetric. The even moments can be found by calculating and then dierentiating the moment generat- ing function of a Gaussian, or one can use integration by parts on x2k exp −x2/2 2k 2k−2 to get a recurrence relatrion E Z = (2k−1)E Z for the even´ moments. 

Remark. We will now strive to show that for iid X1,...,Xn, the moments of Zn converge to those of a Gaussian. Lets look at the rst few moments . Have by linearity of expectation:

k −k/2 X E Zn = n E (Xi1 ...Xik )

1≤i1...ik≤n We can analyze this easily for small values of k: o) for k = 0 this expression is trivially 1 i) for k = 1 the expression is trivially 0 since E(X) = 0 ii) When k = 2 we can split into diagonal and o-diagonal components:

−1 X 2 −1 X n E(Xi ) + 2n E (XiXj) 1≤i≤n 1≤i

We will now tackle the general case. With analogy to the case iv) above we will see that the only term that can survives as n → ∞ is the term where all of the powers are 2 (this explains why odd moments must vanish...they cant have this term!) To explain this nicely, it is easiest to use some notation from graphs. This gives us the terminology we need to refer to the dierent objects and see what's relevant.

Theorem. Suppose X1,...,Xn are independent and they are uniformly bounded by some constant M. Suppose also that for every i, E(Xi) = 0 and Var(Xi) = 2 . Then the random variable X1+√...+Xn has: E(Xi ) = 1 Zn = n k if is odd E Zn → 0 k 2w E Zn → (2w − 1)!!

Proof. Let Kn be the complete graph on n vertices (including a self loop for each vertex) and let Pk be the paths of length k on this graph. Notice that there is a bijection between elements of Pk and monomials in the expansion of k π1 π2 π (X1 + ... + Xn) by π = (π1, . . . , πk) ↔ X X ...X k . For this reason, we can write out: k −k/2 X E Zn = n E (Xπ1 Xπ2 ...Xπk )

π∈Pk To talk about the terms appear here, we introduce some notation. For 1 ≤ j ≤ n let #j(π) be the number of time the vertex j appears on the path π. Dene supp(π), the of π to be the set of vertices that π passes through. Let the weight of π, denotedwt(π) be the number of distinct vertices the path πvisits so that wt(π) = |supp(π)|. The following two claims show that in the above expression for k, only E Zn paths whose weight is exactly have a non-zero contribution to k as . k/2 E Zn n → ∞ Claim 1: If w < k/2 then: −k/2 X as n E (Xπ1 Xπ2 ...Xπk ) → 0 n → ∞

π∈Pk:wt(π)=w Rmk: This is because there are not enough paths with wt(π) = w. Pf: Each term has k by the bounded E (Xπ1 Xπ2 ...Xπk ) E (Xπ1 Xπ2 ...Xπk ) ≤ M −k/2 assumption. Hence it suces to show that n |{π ∈ Pk : wt(π) = w}| → 0. In- deed, to construct such a path , there are n ways to pick which vertices will be π w in the support of π and then there are at most w choices for each step of the path. This bounde gives: n n−k/2 |{π ∈ P : wt(π) = w}| ≤ n−k/2 wk k w −k/2 w ≤ n n Ck → 0 since w < k/2 by hypothesis Claim 2: If w > k/2 then:

−k/2 X n E (Xπ1 Xπ2 ...Xπk ) = 0

π∈Pk:wt(π)=w Rmk: This is because each such path has a member who appears exactly once, and so by independence, E(Xsinglton) = 0 kills the whole monomial. 4.14. CENTRAL LIMIT THEOREM 45

Pf: By the pigeonhole principle, if w > k/2 then for any path π ∈ {π ∈ Pk : wt(π) = w} there must be at least one index j0 ∈ supp(π) so that #j (π) = P 0 1. (Otherwise all the vertices in supp(π) have #j(π) ≥ 2 and then #j(π) = P lenght(π) = k will be violated since #j(π) ≥ 2 |supp(π)| = 2w > k will be violated). But then, for such π:  

Y #j (π) E (Xπ1 Xπ2 ...Xπk ) = E  Xj  j∈supp(π)   Y #j (π) by independence = E Xj j∈supp(π)

 #j (π) Y  # (π) = E X 0 E X j j0 j j∈supp(π),j6=j0 Y  # (π) = E X1  E X j j0 j j∈supp(π),j6=j0 = 0 since E (X) = 0 Since all the terms vanish, there sum does too. These two claims together already show that k for odd since for E Zn → 0 k odd k every path π has either wt(π) > k/2 or wt(π) < k/2. All that remains in the proof is the following claim: Claim 3: When k is even and w = k/2 we have:

−k/2 X n E (Xπ1 Xπ2 ...Xπk ) → (2w − 1)!!

π∈Pk:wt(π)=w

Pf: For any path π ∈ Pk with wt(π) = w = k/2 either #j(π) = 2 for all supp ) or there is at least one vertex with . (If there are no j ∈ (π j0 #j0 (π) = 1 vertices with #j (π) = 1, all the vertices in supp(π) have #j(π) ≥ 2 and then P 0 P #j(π) = lenght(π) = k lets us see that k = #j(π) ≥ 2 |supp(π)| = 2w = k means that the middle equality is actually an equality, which only happens if #j(π) = 2 for all j) In the case that there is a vertex with , the same argument as from #j0 (π) = 1 the previous claim shows that E (X X ...X ) is 0 (essentially: the E X1  π1 π2 πk j0 factors out of the term and this is zero)

In the case that #j(π) = 2 for all j ∈ supp(π) we procede as follows. (Thesea re the only terms that contribute!) We have:  

Y #j (π) E (Xπ1 Xπ2 ...Xπk ) = E  Xj  j∈supp(π)

Y  #j (π) = E Xj j∈supp(π) Y 2 = E Xj j∈supp(π) Y = 1 = 1 j∈supp(π) 4.14. CENTRAL LIMIT THEOREM 46

Hence, by the above arguments, we have:

−k/2 X −k/2 X n E (Xπ1 Xπ2 ...Xπk ) = n E (Xπ1 Xπ2 ...Xπk )

π∈Pk:wt(π)=w π∈Pk:wt(π)=w;#j (π)=2∀j∈supp(π) X = n−k/2 1

π∈Pk:wt(π)=w;#j (π)=2∀j∈supp(π) −k/2 = n |{π ∈ Pk : wt(π) = w;#j(π) = 2∀j ∈ supp(π)}| It remains only to enumerate this set. To create such a path, we rst choose w indices from the possible to be on the list. There are n ways to do this. Then, n w from the list of the chose w indices, we must arange them into a path of length k so that each appears twice. There are k  ways to do this, where we use 2, 2,..., 2

| {zw } Pm n  n! the notation that for n = ai, we write := . This is the i=1 a1,...,am a1!a2!...am! number of ways you can divide a group of n people into m rooms, where the room labeled i must have ai people in it. (For us we are dividing the k terms π1, . . . , πk into into w groups of 2). Hence: n k  n−k/2 |{π ∈ P : wt(π) = w;# (π) = 2∀j ∈ supp(π)}| = n−k/2 k j w 2, 2,..., 2 | {z } w n(n − 1) · ... · (n − w + 1) (2w)! = n−w using 2w = k w! 2w 1(2w)! = + O(n−1) w!2w → (2w − 1)!!

As desired!  Remark. [The Lindeberg swapping trick] The proof above shows that only the terms coming from path π with wt(π) = w and with #j(π) = 2 for all j ∈ supp(π) contribute. We counted these explicitly to get the result. The Lindeberg swapping trick is a dierent trick to show the convergence we need. We will show that k k for a R.V. without explicitly E(Zn) → E(Z ) Z N(0, 1) calcualting k of k. To do this let be iid N(0,1) Gaussians, and E(Z ) E Zn Y1,...,Yn since the sum of Gaussians is Gaussian, we know that: Y1+√...+Yn is a N(0,1) Wn := n Gaussian too. Hence k k By the arguments in Claim's 1,2 and 3 from E(Z ) = E(Wn ). the proof, the only terms that don't vanish are the terms from path π with wt(π) = 2 w and with #j(π) = 2 for all j ∈ supp(π) and, since E(X ) = 1, we get that k −k/2 wt supp . However, E Zn = n |{π ∈ Pk : (π) = w;#j(π) = 2∀j ∈ (π)}| + o(1) the same logic applies to k. Hence k −k/2 wt supp E Wn E Wn = n |{π ∈ Pk : (π) = w;#j(π) = 2∀j ∈ (π)}|+ too. So, without having to explicitly enumerate this, we have k k o(1) E Zn − E Wn → . Since k k this is the desired result. 0 E Wn = E(Z ) 4.14.2. Analytic step. We start by using the moment results above to prove a central limit theorem for bounded random variables. We will then use truncation to reduce the unbounded case to the bounded case. Theorem. (CLT for bounded random variables) 4.14. CENTRAL LIMIT THEOREM 47

Suppose X1,...,Xn are uniformly bounded by some constant M and let E(X) = 0 and Var(X) = 1 then: X + ... + X √ n ⇒ N(0, 1) n

Proof. Since 's are bounded, by the Cherno bound, the sums X1+√...+Xn X Zn = n are uniformly subguassian. By the Carleman continuity theorem, to show weak convergence it suces to check that k k which is the content of the E Zn → E G previous section. Alternativly, since we know the moments of N(0, 1), we can check directly with √ Stirling approximation that 1/2k and so 1/2k would allow us to µ2k ≈ k sup µ2k /k < ∞ use the other criteria.  We use truncation to improve this to a central limit theorem for arbitary ran- dom variables.

Theorem. (Bounded CLT =⇒ CLT) Suppose that for any sequence of iid 2 bounded random variables, X1,X2, ... with E(X) = µ and Var(X) = σ we have a CLT: (X + ... + X ) − nµ 1 √ n ⇒ N(0, 1) nσ2 Then we have a central limit theorem for any sequence of iid random variables. I.e. we can drop the condition of being bounded. Proof. As usual dene with Pn and X = Xn,>N + Xn,≤N Sn,≤N = k=1 Xk,≤N Pn . Similarly dene , and to be the means Sn,>N = k=1 Xk,>N µ≤N µ>N , σ≤N σ>N and variances of the truncated r.v.s. By the bounded CLT, we know that for any xed N that:

Sn,≤N − nµ≤N Zn,≤N := √ ⇒ N(0, 1) as n → ∞ with N xed nσ≤N We now claim that there will be a sequence so that Nn → ∞ Zn,≤Nn ⇒ N(0, 1) as n → ∞ to. This is a diagonalization argument. (This is just a fact about metric spaces, if for all then there is a subsequence so that . xn,N → x0 N Nn xn,Nn → x0 To do this, start by letting N1 = 1 and K0 = 1. Take K1 so large now so that for all Let and let d(xn,N1+1, x0) < 1/1 n > K1. N1 = N2 = ... = NK1 = 1 . Now repeat this, take so large so that NK1+1 = N1+1 = 2 K2 d(xn,N2+1, x0) < 1/2 for all (and WOLOG and let and n > K2 K2 ≥ N1 + 1) NK1+1 = ... = NK2 = 2 let . Continuing in this way, we have a sequence NK2+1 = NK1+1 + 1 = 3 Kn → ∞ as n → ∞ of switching points with Ni = k for Kk < i ≤ Kk+1 and so for such we have that which shows . Recall that weak i d(xi,Ni , x0) < 1/Kk xn,Nn → x0 convergence is metrizable by the Levy metric too) Now, by the dominated convergence theorem, we have that since σ≤Nn → σ Nn → ∞. Have then: S − nµ S − nµ σ n,≤Nn√ ≤Nn = n,≤√N ≤Nn Nn ⇒ N(0, 1) nσ nσNn σ

(You can check that cn → c and Xn ⇒ X then cnXn ⇒ cX by the Sko- horod Represetnation theorem, or by taking logs and using the converging together lemma) 4.14. CENTRAL LIMIT THEOREM 48

Meanwhile, from the dominated convergence theorem again, we see that σ>Nn → S −nµ as . Hence we claim that the other piece n,>N√n >Nn by a Cheb 0 n → ∞ nσ → 0 estimate:   Sn,>Nn − nµ>Nn 1 P √ > λ ≤ √ Var (Sn,>Nn ) nσ (λ nσ)2 1 1 σ2 = n >Nn λ2 n σ2 1 σ2 = >Nn → 0 λ2 σ2

So nally then, we use the converging together lemma (this says that Xn ⇒ X and Yn ⇒ Y has Xn + Yn ⇒ X + Y as long as one of them X or Y is a constant), we have that: S − nµ S − nµ S − nµ n√ = n,≤Nn√ ≤Nn + n,>Nn√ >Nn nσ nσ nσ ⇒ N(0, 1) + 0

 Exercise. (2.2.15.) (Lindeberg CLT) Assuming the bounded central limit theorem, prove the following:

Let X1,X2,...be independent (but not nessisarily iid) with E(Xi) = 0 and Var(Xi) = 1. Suppose we have the Lindeberg condition: n 1 X 2 lim lim sup E |Xj,>N | = 0 N→∞ n→∞ n j=1

Show that X1+√...+Xn n ⇒ N(0, 1)

S −nµ Proof. As before, truncate and nd a sequence so that that n,≤Nn√ ≤Nn Nn → ∞ n → S −nµ and from here we have only left to prove that n,>Nn√ >Nn . By our N(0, 1) n → 0 Cheb inequality, we have:   Sn,>Nn − nµ>Nn 1 P √ > λ ≤ √ Var (Sn,>Nn ) n (λ n)2 n 1 X 2 ≤ E |X | λ2n j,>Nn j=1

 k  1 1 X 2 ≤ lim sup E |X | λ2  k k,>Nn  k→∞ j=1

By the Lindeberg hypothesis, this → 0 as n → ∞ since Nn → ∞ as n → ∞. By the converging together lemma again, we can sum the two pieces to see that we have a CLT.  4.14.3. More Lindeberg swapping trick.

Theorem. (2.2.11. from Tao) (Berry-Esseen theorem, weak form) Let X with E(X) = 0, Var(X) = 1 and nite third moment, let ϕ be a smooth function with 4.14. CENTRAL LIMIT THEOREM 49 √ uniformly bounded derivatives up to third order. Let Zn = (X1 + ... + Xn)/ n where X1,...,Xn are iid copies of X. Then we have: 1      1 |E (ϕ(Z )) − E (ϕ(G))| ≤ E |G|3 + E |X|3 sup |ϕ000(x)| n 1/2 3! n x∈R Where G is a standard N(0,1) Gaussian.

Proof. We use the Lindeberg swapping trick. Let Y1,...,Yn be iid N(0,1) √ d Gaussians and let Wn = (Y1+...+Yn)/ n so that Wn = G. The main improvement to Lindeberg swappign trick here is that we swap the variables, one at a time. That is, dene for 0 ≤ i ≤ n: X + ... + X + Y + ... + Y Z := 1 i √ i+1 n n,i n

This partially swapped random variable is some kind of mix between Zn and Wn with Zn,n = Zn and Zn,0 = Wn. We have a telescoping sum now: n−1 X ϕ (Zn) − ϕ(Wn) = − (ϕ(Zn,i) − ϕ(Zn,i+1)) i=0

So to estimate our error it suces to estimate the error in ϕ(Zn,i) − ϕ(Zn,i+1). This is do-able since these two terms are very similar. If we dene Sn,i by: X + ... + X + Y + ... + Y S = 1 i √ i+2 n n,i n (Notice the term with index i + 1 is ommited) then we have that: Y X Z = S + √i+1 Z = S + √i+1 n,i n,i n n,i+1 n,i n

Hence we can Taylor expand both terms about Sn,i : Y 1 Y 2 ϕ(Z ) = ϕ (S ) + ϕ0(S ) √i+1 + ϕ00 (S ) i+1 + E n,i n,i n,i n 2 n,i n 1 1 |Y |3 |E | ≤ i+1 sup |ϕ000(x)| 1 3/2 3! n x∈R And similarly:

X 1 X2 ϕ(Z ) = ϕ (S ) + ϕ0(S ) √i+1 + ϕ00 (S ) i+1 + E n,i+1 n,i n,i n 2 n,i n 2 1 |X |3 |E | ≤ i+1 sup |ϕ000(x)| 2 3/2 3! n x∈R

Now, since Xi and Yi have the same rst and second moments, and both are independent of Sn,i, we may write: X − Y  E (ϕ(Z ) − ϕ (Z )) = E (ϕ (S ) − ϕ (S )) + E (ϕ0(S )) E i+1√ i+1 n,i n,i+1 n,i n,i n,i n 1  X2 − Y 2  + ϕ00(S ) E i+1 i+1 + E (E − E ) 2 n,i n 1 2 4.14. CENTRAL LIMIT THEOREM 50

The rst three terms on the RHS vanish! The last error term is bounded by:

|E (E1 − E2)| ≤ E (|E1|) + E (|E2|)  3  3 1 E |Yi+1| 1 E |Xi+1| = sup |ϕ000(x)| + sup |ϕ000(x)| 3/2 3/2 3! n x∈R 3! n x∈R 1      1 = E |G|3 + E |X|3 sup |ϕ000(x)| 3/2 3! n x∈R So nally, then, summing we have:

n−1 X |ϕ (Zn) − ϕ(Wn)| = (ϕ(Zn,i) − ϕ(Zn,i+1)) i=0  1      1  ≤ n E |G|3 + E |X|3 sup |ϕ000(x)| 3/2 3! n x∈R 1      1 = E |G|3 + E |X|3 sup |ϕ000(x)| 1/2 3! n x∈R as desired.  Remark. By the method of the proof, we can see that if they agree up to three moments, the error term will be like ϕ(4) and so on. Martingales

There are notes based on the second half of the book by David Williams [5] and also some parts of the book by Jeery Rosenthal [3].

5.15. Martingales Definition. A ltration on a probability triple (Ω, F, P) is an increasing family of σ algebras Fn F0 ⊂ F1 ⊂ F2 ⊂ ... ⊂ F We dene ! [ F∞ := σ Fn n

Remark. Filtration are a way to keep track of information. Fn is the amount of information one has at time n, and we are gaining information as time goes on. To be precise, the information in Fn is the value Z(ω) for every Fn measurable function Z. Very commonly, we are in the setup that there is a stochastic process W and Fn is given by: Fn = σ (W0,W1,...,Wn)

So that the information we have in Fn is the value of W0(ω),W1(ω) ...Wn(ω)

Definition. A process X = (Xn : n ≥ 0) is called adapted (to the ltration Fn if for each n, Xn is Fn−measurable.

Remark. An adapted process X is one where Xn(ω) is known at time n. In the set up where Fn = σ (W0,...,Wn) this means that Xn = fn (W0,W1,...,Wn) for some n+1 measurable function n B fn : R → R

Definition. A process X is called a martingale (relative to Fn, P) if: i) X is adapted. ii) E [|Xn|] < ∞∀n iii) E [Xn |Fn−1 ] = Xn−1 a.s. for n ≥ 1 A supermartingale is one where we have:

E [Xn |Fn−1 ] ≤ Xn−1 a.s. for n ≥ 1

And a submartingale is one where:

E [Xn |Fn−1 ] ≥ Xn−1 a.s. for n ≥ 1 Remark. A supermartingale decreases on average, and a submartingale in- crease on average. These correspond to superharmonic/subharmonic functions. We can prove many theorems about martingales by rst proving the corresponding statement about supermartingales, and then use the fact that for martingales X,

51 5.15. MARTINGALES 52 both X and −X are supermartingales. By replacing Xn by Xn −X0 we can suppose WOLOG that X0 ≡ 0 Proposition. For m < n, a supermartingale X has:

E [Xn |Fm ] = E [E [Xn |Fn+1 ] |Fm ]

≤ E [Xn−1 |Fm ] ≤ ...

≤ E [Xm |Fm ] = Xm

Example. Let X1,... be a sequence of independent RVs with E [|Xk|] < ∞ and . Dene and Pn and . E (Xk) = 0 ∀k S0 := 0 Sn := i=1 Xi Fn := σ (X1,...,Xn) For n ≥ 1 we have:

E [Sn |Fn−1 ] = E [Sn−1 |Fn−1 ] + E [Xn |Fn−1 ]

= Sn−1 + E [Xn]

= Sn−1 The Three Series Theorem tells you when such a seqeunce converges! There is a nice proof using martingales that we will get too,.

Example. Let X1,... be a sequence of independent non-negative random variables with . Dene and Qn with again E(Xk) = 1 M0 := 1 Mn := i=1 Xi Fn = σ (X1,X2,...,Xn). Then for n ≥ 1 we have a.s.:

E [Mn |Fn−1 ] = E [Mn−1Xn |Fn−1 ]

= Mn−1E [Xn |Fn−1 ]

= Mn−1E [Xn] = Mn−1 So M is a martingale. Remark. Because M here is a non-negative martingale, it will be a conse- quence of the martingale convergence theorem that M∞ = limn→∞ Mn exists. 1 Example. Take any ltration Fn and let ξ ∈ L (Ω, F, P). Dene Mn := E [ξ |Fn ]. By the tower property:

E [Mn |Fn−1 ] = E [E [ξ |Fn ] |Fn−1 ]

= E [ξ |Fn−1 ]

= Mn−1 Hence M is a martingale.

Remark. In this case, the Levy Upward theorem will tell us that Mn → M∞ := E [ξ |F∞ ] a.s. Remark. One sometimes calls a Martingale a Fair game by imagining that we are gambling on the process Xn so that Xn − Xn−1 is our net winnings per unit stake in the game n. (i.e. if this is 0 we break even, if this is +10 we gain $10, if this is -1000 we owe $1000 to the casino). The martingale condition says that our expected return is 0 (i.e. its a fair game). A supermartingale is a typically casino game where we are always losing money on average. 5.16. STOPPING TIMES 53

Definition. We call a process C = (C1,C2,C3,...) previsible if Cn is Fn−1measurable for n ≥ 1. In other words, the value of Cn can be determined with only information from Fn−1. Remark. We can think of a previsible process as a gambling strategy for an adapted process Xn by dening a new process: X Yn = Ck (Xk − Xk−1) =: (C • X)n 1≤k≤n

We think of Yn as our net wealth at time n, where we make a stake of Cn on the n-th betting event Xn − Xn−1. Cnmust be previsible for this too make sense, because when we decide how much to bet, we can base this descision only on Fn−1, which is all the information known at that time. Note that (This looks a lot like a stochastic integral!) is (C • X)0 = 0 C • X sometimes called the martingale transform of X by C. Notice that if Cn = 1 for every n, then C • X = X Theorem. (You can't beat the system theorem) i) Let C be a bounded non-negative previsible process. i.e. ∃K so that 0 ≤ Cn(ω) ≤ K for all n and a.e. ω. Let X be a super-martingale [respectivly martin- gale]. Then C • X is a supermartingale [martingale]. ii) If C is bounded (but possible not non-negative) |Cn(ω)| ≤ K for all n and a.e. ω. and X is a martingale, then C • X is martingale. iii) In i) and ii), the boundedness condition on C may be replaced by the re- 2 2 quirement that Cn ∈ L provided we also insist that Xn ∈ L . Remark. i) says that if you are playing a casino-favourable game, and you have to bet between 0 and K at each round, what remains is still a casino-favourable game ii) says if you are playing a fair game, and now you even allowed to bet EITHER way (i.e. can bet negative stake) then the game is still fair. iii) relaxs the boundedness a bit.

Proof. i) Let Y = C • X. Since Cn is bounded and non-negative, we have that:

E [Yn − Yn−1 |Fn−1 ] = CnE [Xn − Xn−1 |Fn−1 ] ≤ 0 [respecivly = 0] (The boundedness is needed for the taking out what is known proposition we used) + − + − ii) Write Y = C • X = C • X − C • X where C = |C| 1{C≥0} and C = |C| 1{C<0}. Apply the last theorem to both pieces sepeatly. iii) If and are both 2 then by Cauchy Cn Xn L E [|Yn|] = E |(C • X)n| < ∞ m Schwarz. Now we can look at Y = C • X1{|C•X|

{T ≤ n} ∈ Fn ∀n ≤ ∞ 5.16. STOPPING TIMES 54

The idea is that the time T is when you would choose to quit gambling at the casino. Of course, your descision to quit after time n can depend only on the information up to and including time n. Definition. Let X be a supermartingale and let T be a stopping time. Sup- pose you always bet 1 unit and quit playing at (immediatly after) time T . Then your 'stake process' is C(T )given by: ( 1 if n ≤ T (ω) C(T ) := n 0 otherwise Your 'winnings process' is the process XT given by:

T  (T )  Xn (ω) := C • X (ω) = XT (ω)∧n(ω) − X0(ω)

(T ) Cn is previsible since T is a stopping time. Morevover, it is postive and bounded above by 1. Proposition. If X is a supermartingale [respectivly martingale] and T is a stopping time, then the stopped process T + is a supermartingale X = (XT ∧n : n ∈ Z ) [respectivly martingale]. So in particular, E (XT ∧n) ≤ E(X0)∀n. [respectivly =]

Remark. Don't get confused and think that E(XT ) = E(X0), this requires some conditions on T , which we have none here. (Example: hitting time for a random walk on Z) Another way to think about this is to think of τ = T ∧ n as a bounded stopping time. This is the point of view taken by Rosenthal.

Proof. The process XT = C(T ) • X is a supermartingale [resp. martingale] by the previous theorem.  Theorem. (10.10) Doob's Optional-Stopping Theorem Let T be a stopping time and let X be a supermartingale [respectively a mar- tingale]. Then under any of the folloing conditions: i) T is bounded (i.e. ∃N so that T < N a.s. ii) X is bounded (i..e ∃K so that |Xn| ≤ K ∀n a.s.) and T is a.s. nite iii) E(T ) < ∞ and ∃K so that |Xn − Xn+1| ≤ K ∀n a.s. iv) If E(T ) < ∞ and ∃K so that E (|Xn+1 − Xn| Fn) 1{T >n} ≤ C. We conclude that:

E (XT ) ≤ E(X0) [respectivly = ]

Proof. Everything is based on the fact that E(XT ∧n) ≤ E(X0) which has already been proven. i) Follows by setting n = N ii) Follows since T ∧ n → T a.s. as n → ∞ and so XT ∧n → XT a.s. as n → ∞. Since X is bounded, by the bounded convergence theorem, E(XT ) ≤ E(X0) too. iii) Consider:

|XT ∧n − X0| ≤ K(T ∧ n) ≤ KT Using KT as a dominant majorant, by the dominated convergence theorem the result follows as in ii) iv) Similar to iii) I guess?  Theorem. If X is a non-negative supermartingale, and T is a stopping time which is a.s. nite, then:

E(XT ) ≤ E(X0) 5.16. STOPPING TIMES 55

Proof. Have XT ∧n → XT a.s. in this case, and now by Fatou's lemma:

E(XT ) = E(lim sup XT ∧n)

≤ lim sup E(XT ∧n)

= E(X0)  Theorem. (From Rosenthal) Let T be a stopping time which is almost surely nite. Then:

E(XT ) = E(X0) ⇐⇒ lim E(XT ∧n) = E(lim XT ∧n)

Proof. lim E(XT ∧n) = lim E(X0) = E(X0) while lim XT ∧n = XT since T is a.s. nite, so the equality above is clear.  Theorem. (From Rosenthal) Let X be a maringale and T a stopping time   which is almost surely nit. Suppse E |XT | < ∞ and limn→∞ E Xn1{T >n} = 0. Then E(XT ) = E(X0) Proof. Have:

E(X0) = E (XT ∧n) = E [Xn1T >n] + E [XT 1T

= E [Xn1T >n] + E [XT ] − E [XT 1T >n]

XT 1T >n → 0a.s since T < ∞ a.s. and it is bounded by the dominant majorant , so this goes to zero by the LDCT. The other goes to zero by hypothesis. XT E E  Lemma. Suppose that T is a stooping time and that for some N ∈ N, and some  > 0 we have that: a.s. P (T ≤ n + N |Fn ) >  ∀n ∈ N Then: E(T ) < ∞ Example. Suppose a monkey types on a typewriter and let T be the stopping time that he types ABRACADABRA (11 letters). Then chooisng N = 11 and  = 26−11 we get the above (he always has an  chance of typing the phrase in the next 11 letter) Proof. (Thinking of the monkey typing, the idea is to pretend that a lett- ter is a string of 26 letters, which each of the 2611 letter appearing with equal probability. The expected time that ABRACADABRA is ever typed is less than the amount of time until the ABRACADABRA letter appears. This is bounded easily. The lemma is the genralization of this with 26 replaced by N and 26−11 replaced by ) Notice P(T > N) < 1 − . Now by induction, we show that P(T > kN) ≤ (1 − )k: P(T > kN) ≤ P(T > kN |T > (k − 1)N )P (T > (k − 1)N) ≤ (1 − )(1 − )k−1 by setting n = (k − 1)N and ind. hyp. = (1 − )k Now let so that where where is the unique S = dT eN /N dxeN := kN k so that (k − 1)N ≤ x ≤ kN (its like the ceiling function using N as the rounding 5.16. STOPPING TIMES 56 threshold instead of , it is easily veried that  · .) We have easily that 1 d·eN = N N T ≤ NS The previous result tells us that P(S > k) ≤ (1 − )k, so nally E(T ) ≤ NE [S] X = N kP(S > k) X N ≤ N (1 − )k < < ∞   Example. (Setting up the right martingale!) In the monkey/typerwriter prob- lem, use martingales to explicitly compute E(T ). Proof. We know by the lemma that E(T ) < ∞. Let us open a casino that will allow gamblers to arrive and gamble with us on what letters the monkey types out. We will oer only fair bets to all our gamblers so that our resulting fortune is a martingale. We will close the casino and stop gambling only at time T . Now suppose that we have a sequence of greedy gamblers who each have the following betting strategy:  • Start with $1 • Bet $1 that an 'A' comes up.  If lose: fortune at $0 and forced to quit the game  If win: fortune at $26 and continue to gamble! • Bet $26 that a 'B' comes up  If lose: fortune at $0 and forced to quit the game  If win: fortune at $262 and continue to gamble! • Bet $262 that an 'R' comes up . • . The gambler goes through all the letters in ABRACADABRA in this way. Notice these are fair bets always: he goes up 25 times his stake if he wins, and loses his stake when he loses: 25 1 . −1 26 + 25 26 = 0 Suppose a gambler with this betting strategy starts his betting sequence for each letter in the monkey's typing sequence. Let Xn be the casino's fortune at time 11 n. This is a martingale, and has E(|Xn+1 − Xn| ≤ K) from some K (K = 26 · 26 should do), so since E(T ) < ∞ we can use our theorem to know that 0 = E(X0) = E(XT ). On the other hand, what is the casino's fortune at time T ? At time T , the monkey has just nished typing out ABRACADABRA for the very rst time. This is very good for 3 gamblers: the gambler who started playing when the rst underlined A way typed, the gambler who started when the second underlined A was typed, and the gambler who started when the third underlined A was typed. These three gamblers are up 2611 − 1, 264 − 1 and 261 − 1 dollars respectively (their fortune minus their inital money), so the total the casino has to pay out to these guys is 2611 + 264 + 26 − 3. Every other gambler has lost his initial $1. (They kept betting till they lost!) There are (T − 3) such gamblers. So the casino has made a prot of T − 3 from these guys. 11 4 In sum, we have the casino's fortune is XT = −(26 +26 +26−3)+(T −3) = 11 4  11 4 T − 26 + 26 + 26 . E(XT ) = 0 gives then E(T ) = 26 + 26 + 26 5.16. STOPPING TIMES 57

Remark: To simplify the algebra slightly (the ±3 thing) just pretend that every gambler pays a $1 fee to play at the casino, and that fee includes their rst bet! Then all the gamblers payed the fee, (+$T ) and the payout the casino has to give is ($2611 + 264 + 26) Here is a small improvement on the optional stopping time theorem:  Theorem. If T is a stopping time with E(T ) < ∞ and E |Xn+1 − Xn| 1{T >n} |Fn ≤ c1{τ>n} then: E(XT ) = E(X0) Proof. Write: T ∧n−1 X Xn∧t = X0 + Xk+1 − Xk k=0 So therefore, |Xn∧t| ≤ M with: T ∧n−1 X M : = |X0| + |Xk+1 − Xk| k=0 ∞ X = |X0| + |Xk+1 − Xk| 1{T >k} k=0 Now we claim M has E(M) < ∞ since: ∞ X   E(M) = E [X0] + E |Xk+1 − Xk| 1{T >k} k=0 ∞ X    = E [X0] + E E |Xk+1 − Xk| 1{T >k} |Fn k=0 ∞ X ≤ E [X0] + cP(τ > n) by the hypothsis k=0 ≤ ∞ So by LDCT it works  Uniform Integrability

There are notes based on the second half of the book by David Williams [5].

6.17. An 'absolute continuity' property Lemma. Suppose X ∈ L1. Then given any  > 0 there exists δ > 0 so that P(F ) < δ =⇒ E (|X| ; F ) < 

Proof. Can do this directly by looking at Xn = X1{X

Suppose by contradiction that there exists 0 > 0 and sequence Fn with −n P(Fn) < 2 and E (|X| ; Fn) ≥ 0. Let F = lim sup Fn. By Borel cantelli, P(F ) = 0, while on the other hand, by Fatou's lemma we have that: c  lim inf E (|X| ; Fn) = lim inf E |X| 1F c n n n  ≥ E lim inf |X| c 1Fn = E (|X| 1F c ) = E (|X| ; F c) where here we have used that and c c. lim inf 1An = 1lim inf An lim inf An = (lim sup An) Subtracting the above inequality from E(|X|) gives:

0 ≤ lim sup E (|X| ; Fn) ≤ E (|X| ; F ) n which is a contradiction  Corollary. If X ∈ L1 and  > 0 then there exists K > 0 such that E (|X| ; |X| > K) <  Proof. Take δ as in the lemma and then by using the Markov inequality: 1 P (|X| > K) ≤ E(|X|) → 0 K We know that evenually P (|X| > K) < δ for K large enough.  6.18. Denition of a UI family Definition. (12.1) A class C of random variables is called uniformly inte- grable (UI) if given  > 0 there exists K ∈ [0, ∞) such that: E (|X| ; |X| > K) <  ∀X ∈ C Remark. In this language the rst lemma says that a single integrable random variable is always uniformly integrable. Of course by taking maximums over a nite set, it is clear then that any nite family of random variables is UI too. Proposition. Every uniformly integrable family is bounded in L1.

58 6.20. UI PROPERTY OF CONDITIONAL EXPECTATIONS 59

Proof. Choose  = 1 to get a K1 so that for every X ∈ C we have:

E(|X|) = E (|X| ; |X| > K1) + E (|X| ; |X| < K1)

≤ 1 + K1 is a uniform bound.  Example. Here is an example of a famly that is bounded in L1 but is NOT uniformly integrable.

On (Ω, F, P) = ([0, 1], B, m) choose Xn = n1(0,n−1) so that E(Xn) = 1 for all n but for any K > 0 if we choose n so large so that n > K then we will have that E (Xn; Xn > K) = E(Xn) = 1. Notice in this case that Xn → 0 a.e. but E(Xn) 9 0 6.19. Two simple sucient conditions for the UI property Theorem. If C is a class of random variables which is bounded in Lp for some p > 1 , then C is UI. Proof. Suppose that E (|X|p) < A for all X ∈ C. We use the inequality that x > K =⇒ x ≤ K1−pxp Then consider: E (|X| ; |X| > K) ≤ K1−pE (|X|p ; |X| > K) ≤ K1−pA → 0 as K → ∞  Theorem. If C is a class of random variables dominated by some integrable non-negative random variable Y (i.e. |X| ≤ Y a.e. for all X ∈ C) then C is UI. Proof. Have: E (|X| ; |X| > K) ≤ E (Y ; Y > K) → 0 as K → ∞ which follows by our result for a single R.V. we proved already.  6.20. UI property of conditional expectations Theorem. If X ∈ L1 then the class: {E(X |G ): G a sub-σ-algebra of F} is uniformly integrable.

Proof. Let  > 0 be given. Choose δ > 0 such thtat for F ∈ F , P(F ) < δ =⇒ E (|X| ; F ) <  and then choose K so large so that K−1E (|X|) < δ. Now let G be any sub-σ-algebra of F and let Y be any version of E (X |G ) . By Jensen's ineq: |Y | ≤ E (|X| |G ) a.s. Hence E (|Y |) ≤ E (|X|) and: KP (|Y | > K) ≤ E (|Y |) ≤ E (|X|) ≤ δK So we have that P (|Y | > K) < δ. Since {|Y | > K} is G−meusurable, we have then that: E (|Y | ; |Y | ≥ K) ≤ E (|X| ; |Y | ≥ K) <  6.23. NECESSARY AND SUFFICIENT CONDITIONS FOR L1 CONVERGENCE 60

 6.21. Convergence in Probability

Definition. We say that Xn → X in probability if for every  > 0:

P (|Xn − X| > ) → 0 as n → ∞

Lemma. If Xn → X a.s. then Xn → X in probability, Proof. Suppose by contradiction that there exists so that 0 P (|Xnk − X| > 0) 9 0. But then let A = lim sup {|Xn − X| > 0} and P(A) ≥ lim sup P (|Xn − X| > 0) > since the limiti is not zero, and we know for sure that on the set of 0 Xn 9 X A positive measure.  6.22. Elementary Proof of Bounded Convergence Theorem

Theorem. If Xn → X in probabliyt and |Xn| ≤ K for some constant K then E (|Xn − X|) → 0. Proof. First verifty that P (|X| ≤ K) = 1 since for every  > 0 we have that P (|X| > K + ) ≤ P (|Xn − X| > ) + P (|Xn| > K) → 0 as n → ∞. (Then take intersection over a countable collection of 0s) Now consider that:

E (|Xn − X|) = E (|Xn − X| ; |Xn − X| < ) + E (|Xn − X| ; |Xn − X| ≥ )

≤  + 2KP (|Xn − X| > ) → 



6.23. Necessary and Sucient Conditions for L1 convergence 1 1 Theorem. Let (Xn) be a sequence in L and let X ∈ L . Then Xn → X in 1 L (i.e. E (|Xn − X|) → 0 if and only if the following two conditions are satised: i) Xn → X in probability ii) the sequence Xn is UI Proof. ) Dene by (⇐ φK : R → R  −K x < −K  φK (x) = x −K < x < K K K < x This is sort of de-extreming function since it never takes values more than K in abs value.

Given any  > 0, use the uniform integrability of Xn to nd a K so that for all n we have:  E (|X − φ (X )|) ≤ E (|X | ; |X | < K) < n K n n n 3 Since the singleton X is UI on its own we can also assume the above works for X as well.  E (|X − φ (X)|) ≤ E (|X| ; |X| < K) < K 3 6.23. NECESSARY AND SUFFICIENT CONDITIONS FOR L1 CONVERGENCE 61

Now we notice that since |φK (x) − φK (y)| ≤ |x − y| then we know that Xn → X in probability means that φK (Xn) → φK (X) in probability (its actually true for any continuous function....) so by the bounded convergence theorem we know that:

E (|φK (Xn) − φK (X)|) < /3 for n large enough. By triangle inequality then combinging these three esti- mates we get that E (|Xn − X|) <  as desired. (⇒) The two facts are seen seperatly. Conv. in probability is just by Markov inequality:

1 P (|X − X| > ) ≤ E (|X − X|) → 0 n  n UI goes as follows:

Given  > 0 choose N so large so that n ≥ N =⇒ E (|Xn − X|) < /2. Now use the UI of a nite family to choose δ so that P (F ) < δ gives:

E (|Xn| ; F ) <  for 1 ≤ n ≤ N E (|X| ; F ) < /2 1 Now since Xn is bounded in L (E (|Xn|) ≤ E (|Xn − X|) + E (|X|) → 0 + E (|X|)), we can choose K such that: −1 K sup E (|Xr|) < δ r And then by Markov we have −1 holds for P (|Xn| > K) ≤ K supr E (|Xr|) < δ any n. Now for n ≥ N we exploit the fact that E (|Xn − X|) < /2 to get:

E (|Xn| ; |Xn| > K) ≤ E (|X| ; |Xn| > K) + E (|X − Xn|) < /2 + /2 And for n ≤ N we exploit the fact that any nite family is UI:

E (|Xn| ; |Xn| > K) <  (Since ) P (|Xn| > K) < δ  1 Remark. You can also say that Xn → X in L ⇐⇒ Xn → X in probability and E (|Xn|) → E (|X|) the proof of the harder direction is similar to the above using this time a bump function φK which is equal to x on [0,K] and 0 on [K+1, ∞) and linearly interpolates on [K,K + 1] UI Martingales

There are notes based on the second half of the book by David Williams [5]

7.24. UI Martingales

Definition. A UI Martingale is a martingale M so that {Mn} is a collection of UI random variables. Theorem. Let M be a UI martingale. Then: 1 M∞ := lim Mn exists a.s. and in L Moreover, for every n, we have:

Mn = E (M∞ |Fn ) a.s. Proof. If M is a UI martingale, then it is bounded in L1, (every UI collection 1 is bounded in L ) and hence it converges a.s. to some M∞ by the martingale convergence theorem. Moreover, by our characterization of L1 convergence (as convergence in probability + UI) since Mn → M∞a.s. and Mn is UI, we know 1 moreover that Mn → M∞ in L as well. (Basically if you are a UI martingale, you are fantastically friendly!)

Finally to see that Mn = E (M∞ |Fn ) a.s. it suces to show that E (M∞; F ) = E (Mn; F ) for each F ∈ Fn. To see this consider that for any r ≥ n we have E (Mn; F ) = E (Mr; F ) by the martingale property and so we can write:

|E (Mn; F ) − E (M∞; F )| = |E (Mr; F ) − E (M∞; F )|

≤ E (|Mr − M∞| ; F )

≤ E (|Mr − M∞|) → 0 as r → ∞  Remark. The UI property of the martingale is what gives the L1 convergence, which is what is used to verify that Mn = E (M∞ |Fn ). If the martingale was convergent, but not UI, we wouldn't have this!

7.25. Levy's 'Upward' Theorem Theorem. Levy's 'Upward' Theorem 1 Let ξ ∈ L and dene Mn = E (ξ |Fn ) a.s. Then M is a UI martingale and:

Mn → η := E (ξ |F∞ ) almost surely and in L1 To put it more simply: 1 E (ξ |Fn ) → E (ξ |F∞ ) a.s. and in L

62 7.27. LEVY'S 'DOWNWARD' THEOREM 63

Proof. We know M is a martingale because of the tower property, M is UI by the UI property of conditional expectations (basically followed by Jensen since

|E (X |G )| ≤ E(|X|). By the previous theorem, we know that M∞ = limn→∞ Mn exists and the convergence is in a.s. and in L1. It remains only to show that M∞ = η a.s. where η dened as above. WOLOG assume that ξ > 0. We claim that E (η; F ) = E (M∞; F ) for every F ∈ F∞. Indeed, if we let A be the collection of such sets, we see that each Fn ⊂ A by the tower property. F∞ = ∪Fn ⊂ A too, Now since M∞, η are F∞ (indeed M∞ = lim Mn), and E (η; F ) = E (M∞; F ) then the two are both versions of E (η |F∞ ) = η. (If X,Y are F measurable and E(X; A) = E(Y ; A) then set A = {X > Y } to see that E (X − Y ; X > Y ) = 0 =⇒ P (X > Y ) = 0 similarly for the other way around.....its really just a sanity check that the denition of conditional expectation makes sense)  7.26. Martingale Proof of Kolmogorov 0-1 Law

Theorem. Let X1,X2,... be a sequence of independent RVs dene: \ Tn = σ (Xn+1,...) , T := Tn n Then if F ∈ T then P(F ) = 0 or 1

Proof. Dene Fn = σ (X1,X2,...,Xn). Let F ∈ T and let η = IF . Since η ∈ F∞, Levy's Upward Theorem shows that: 1 η = E (η |F∞ ) = lim E (η |Fn ) a.s. and in L

However, since for each n η is Tn measurable, and is hence independent of Fn (Tn are the variable with index greater than n while Fn are variables with index less than n) Hence:

E (η |Fn ) = E(η) = P(F ) So we have: η = P(F ) a.s. But this can only happen if P(F ) = 0 or 1, as these are the only values that takes. 1F  Remark. You can do this proof in one line. For any F ∈ T

1F = E (1F |F∞ )

= lim E (1F |Fn ) n→∞

= lim E (1F ) n→∞ = P(F )

7.27. Levy's 'Downward' Theorem Theorem. Suppose is a probability triple and that is (Ω, F, P) {G−n : n ∈ N} a collection of sub-σ−algebras of F such that: \ G−∞ := G−k ⊂ ... ⊂ G−(n+1) ⊂ G−n ⊂ ... ⊂ G−1 k 7.29. DOOB'S SUBARTINGALE INEQUALITY 64

(In other words, they are backwards of a normal ltration, which gets ner and ner, these get coarser and coarser.....except they are labeled with negative numbers) Let γ ∈ L1 and dene:

M−n := E (γ |G−n ) Then: 1 M−∞ := lim M−n exists a.s and in L and:

M−∞ = E (γ |G−∞ ) a.s. Proof. The upcrossing lemma applied to the martingale:

(Mk, Gk : −N ≤ k ≤ −1) can be used exactly as in the proof of the martingale convergence theorem to show that lim M−n exists a.s. The uniform-integrability result about UI martingales then shows that lim M−n exists in L1 and a.s. The fact that M−∞ = E (γ |G−∞ ) holds uses the same trick as earlier, for F ∈ G−∞ ⊂ G−r we have that E (M−r; F ) = E (γ; F ) for everyr and taking r → ∞ gives the result. 

7.28. Martingale Proof of the Strong Law Theorem. Let be iid with . Let P . Then X1,X2,... E (|X|) < ∞ Sn = k≤n Xk −1 n Sn → E(X) a.s. Proof. Dene and T . By a symmetry G−n := σ (Sn,Sn+1,...) G−∞ = n G−n −1 argument, we see that E (X1 |G−n ) = n Sn. So by Levy's downward theorem, −1 1 n Sn = E (X1 |G−n ) → E (X1 |G−∞ ) = L a.s. and in L . We now argue that L  −1 must be a constant because the event n Sn = c is a tail event (can see that it does not depend on the rst N variables and this works for any N) and by the 0-1 −1 1 law it must therefore be constant. Since the convergence n Sn → L is in L we can check that: −1  L = E(L) = lim E n Sn = E(X) 

7.29. Doob's Subartingale Inequality Theorem. LEt Z be a non-negative submartingale. Then for c > 0:   1   1 P sup Zk ≥ c ≤ E Zn; sup Zk ≥ c ≤ E (Zn) k≤n c k≤n c Proof. Let  then is a disjoint union: F = supk≤n Zk ≥ c F

F = F0 ∪ ... ∪ Fn

Where Fk is the event that Zk exceeds the value c for the rst time. Fk = {Zk ≥ c} ∩ {Zi < c 1 ≤ i ≤ k − 1}. On each of these events, we can perform a 7.29. DOOB'S SUBARTINGALE INEQUALITY 65

Cheb type inequality comparing to Zk and at the last step we can compare to Zn using the martingale propety:

P (Fk) = E (1Fk ) Z  ≤ E k 1 c Fk 1 = E (Z ; F ) c k k 1 ≤ E (Z ; F ) by the submartingale property c n k Summing over k then gives the result. 

Lemma. If M is a martingale, and φ is a convex function, and E (|c(Mn)|) < ∞ for all n, then c(M) is a submartingale. Proof. This is immediate by the conditional form of Jensen's inequality:

E (c(Mn) |Fm ) ≥ c (E (Mn |Fm )) = c(Mm)  Theorem. (Kolmogorov's Inequality) Let be a sequence of mean zero RVs in 2. Dene 2 and X1,... L σk = Var(Xk) P and Pn 2. Then for : Sn = k≤n Xk Vn = Var(Sn) = k=1 σk c > 0   2 c P sup |Sk| ≥ c ≤ Vn k≤n 2 Proof. Sn is a martingale, and so S is a submartingale. Applying Doob's submartingale theorem to this is exactly the result we want.  Kolmogorov's Three Series Theorem using L2 Martingales

There are notes based on the second half of the book by David Williams [5]. Here is a mind map of the theorems in this section:

66 KOLMOGOROV'S THREE SERIES THEOREM USING L2 MARTINGALES 67

) < ∞ K n ) < ∞ converges a.s K n n > K) < ∞ n indep K>0.and: n K3 Thm - Fwd X i) ΣP(X ii) ΣVar(X iii) ΣE(X Then ΣX

) < ∞ n converges a. n ) < ∞

n ] 2 )

n-1 indep.and: n

]< ∞ and K2 Thm - Fwd X i) ΣVar(X ii) ΣE(X Then ΣX s 2 -M ) n

2 n-1 Mrtngl Mrtngl -M 2 2 n

)+ΣE[(M 2 0

) = 0 and: n )< ∞ iff ΣE[(M )=E(M 2 2 n ) < ∞ n n converges a. n

E(M n Pythagoras for L E(M converges a.s. and in L n indep. E(X Convergence for L sup then: M n

K1 Thm - Fwd X i) ΣVar(X Then ΣX s

exists a. n

|

X n n-1

- X n

) < ∞ ) K ) < ∞ 0 n > K) < ∞ K n then lim n converges a.s ∞

n and |X )=E(X | < ∞ T indep. and: n n K3 Thm - Bckwd X i) ΣX Then for anyK: i) ΣVar(X ii) ΣE(X iii) ΣE(X Hint: 2nd Borel Cantelli

E|X n

Doob's Stopping Time Thm X a mrtngl, T stopping time. If either: i) T bounded ii) X bounded and T a.s. finite iii) E(T) < Then E(X (small improvments possible) Basic Measure/Integration Facts -Borel Cantelli (Both directions!) -Dominated Convergence Thm etc. Martingale Convergence Thm If sup s.

n

) < ∞ - indp copy of X ] n n - converges a.s ) < ∞

n | < K a.s. n n - a) indep. and:

N n K2 Thm - Bckwd X i) ΣX ii) |X Then: ΣVar(X ΣE(X Hint: X

2

[a,b] ] <= E[ (X N and X in L

2

[a,b] be the number of -ΣVar N 2 ) ) < ∞ n n

) = 0 and: n Doob's Upcrossing Lemma Let U upcrossing of [a,b] up to time N. Then: (b-a)E[ U Stochastic "Integrals" C is previsible and X a mrtngl. If either: i) |C| < K a,s, ii) C in L Then C•X is a mrtngl too. Basic Definitions and Facts -Conditional Probablity -Filtration -Martingale -Previsible Process and "stochastic integral" converges a.s

n | < K a.s. n ) and use stopping indep. E(X n n K1 Thm - Bckwd X i) ΣX ii) |X Then: ΣVar(X Hint: look at (ΣX (X times KOLMOGOROV'S THREE SERIES THEOREM USING L2 MARTINGALES 68

Definition. We say that a martingale is in 2 if 2 for all . M L E(Mn) < ∞ n ∈ N 2 2 Theorem. (Pythagoras for L Martingales) Say Mn is an L martingale then:

n 2 2 X h 2i E(Mn) = E M0 + E (Mk − Mk−1) k=1

Proof. Since E (Mk − Mk−1 |Fk−1 ) = 0, we have that E [(Mk − Mk−1) X] = 0 for any Fk−1 measurable X. Putting X = Mj −Mj−1 shows that the terms Mk − and are orthogonal. Then write Pn Mk−1 Mj −Mj−1 Mn = M0 = k=1 (Mk − Mk−1) gives the result. 

2 2 Theorem. (Convergence of L Martingales) Say Mn is an L martingale. Have:

2 X  2 sup E(Mn) < ∞ ⇐⇒ E (Mk − Mk−1) < ∞ and in the case that this holds, we have:

2 Mn → M∞ almost surely and in L Proof. By Pythagoras:

n 2 2 X h 2i sup E(Mn) = E(M0 ) + sup E (Mk − Mk−1) n n k=1 Hence 2 if and only if P  2 . supn E(Mn < ∞ E (Mk − Mk−1) < ∞ 1/2  2 Since E(|Mn|) ≤ E |Mn| < ∞, by the Martingale Convergence Theorem, M converges to some M almost surely. To get L2 convergence, use Pythagoras n h ∞i h i to bound 2 by P 2 E (M∞ − Mn) k≥n E (Mk − Mk−1) . 

Theorem. (Kolmogorov 1 Series Theorem - Fwd Direction) Suppose Xn are independent. If:

∞ X Var(Xn) < ∞ n=1 E(Xn) = 0∀n Then: ∞ X Xn converges a.s. n=1 Pn P Proof. Let Mn = Xk, this is a martingale. ∞ > Var(Xk) = h k=1 i P 2 P 2 so by the convergence theorem for 2 mar- E Xk = E (Mn − Mn−1) L tingales, converges a.s. Xn 

Theorem. (Kolmogorov 2 Series Theorem - Fwd Direction) Suppose Xn are independent. KOLMOGOROV'S THREE SERIES THEOREM USING L2 MARTINGALES 69

If:

∞ X Var(Xn) < ∞ n=1 ∞ X E(Xn) < ∞ n=1 Then: ∞ X Xn converges a.s. n=1 P P Proof. Let Yk = Xk − E(Xk) then Var (Yk) = Var(Xk) < ∞ by hy- P pothesis, and so Yk converges almost surely by the 1 series theorem. Have then: n n n X X X Xk = Yk + E(Xk) k=1 k=1 k=1 ∞ ∞ a.s X X −−→ Yk + E(Xk) k=1 k=1 

Theorem. (Kolmogorov 3 Series Theorem - Fwd Direction) Suppose Xn are independent and K > 0. Dene: K Xn := Xn1{|Xn|≤K} If:

∞ X P(Xn > K) < ∞ n=1 ∞ X K E(Xn ) < ∞ n=1 ∞ X K  Var Xn < ∞ n=1 Then: ∞ X Xn converges a.s. n=1 Proof. Since P∞ K  P∞ , by the Borel n=1 P Xn 6= Xn = n=1 P(Xn > K) < ∞ Cantelli lemma, we know that K i.o. . Hence K for all but P Xn 6= Xn = 0 Xn = Xn nitely many n almost surely. Since P∞ K  and P∞ K , we have P∞ K con- n=1 E Xn < ∞ n=1 Var(Xn ) < ∞ n=1 Xn verges almost surely by the Two Series Theorem. Hence P∞ PN P∞ K converges almost surely too. n=1 Xn = n=1 Xn + n=N+1 Xn 

Theorem. (Kolmogorov 1 Series Theorem - Bckwd Direction) Suppose Xn are independent and K > 0. KOLMOGOROV'S THREE SERIES THEOREM USING L2 MARTINGALES 70

If:

E(Xn) = 0 ∀n

|Xn| < K a.s. ∀n P∞ converges a.s. n=1 Xn Then: ∞ X Var(Xn) < ∞ n=1 Proof. The trick is to look at:

n !2 n X X Yn := Xk − Var(Xk) k=1 k=1

is easily veried to be a martingale by expanding the 2 and using Pn−1 Yn (·) E(Xn k=1 Xk) = and 2 . 0 E(Xn) = Var(Xn) [[Side note: Since Yn is a martingale, we have that for any n that E(Yn) = 0, or in other words:

n  n !2 X X Var(Xk) = E  Xk  k=1 k=1 Which is very close to giving us what we want. (We now see why an extra P condition such as |Xn| < K is needed: almost sure convergence of Xn is not enough to give L2 convergence) ]] To nish the proof, we use stopping times in a smart way. For convenience, dene the martingale Pn . For each dene the stopping time: Sn := k=1 Xk c > 0

Tc = inf {n : |Sn| > c}

Notice that:

|ST c | ≤ |ST c−1| + |XT c | ≤ c + K

This actually works for Sn too as long as n < Tc. by the hypothesis that |Xn| < ∞ for all n. Hence, using Tc as a stopping time on the martingale Y gives us E(YT ∧n) = E(Y0) = 0 for all n, or in other words:

"T ∧n #  T ∧n !2 Xc Xc E Var(Xk) = E  Xk  k=1 k=1 ≤ (c + K)2 by the previous argument

Now, we claim there exists some c so that P(Tc = ∞) > 0. Otherwise, Tn < ∞ almost surely for every n ∈ N, and taking the intersection over all N we conclude from the denition of Tc that lim sup |Sn| = ∞, which contradicts the hypothseis that P∞ converges almost surely. k=1 Xk KOLMOGOROV'S THREE SERIES THEOREM USING L2 MARTINGALES 71

For such c, we have that:

"Tc∧n # 2 X (c + K) ≥ E Var(Xk) k=1 n ! T ! X Xc = Var(Xk) P (Tc > n) + Var(Xk) P (Tc ≤ n) k=1 k=1 n ! X ≥ Var(Xk) P (Tc > n) k=1 ∞ ! X → Var(Xk) P (Tc = ∞) k=1 So P∞ −1 2 as desired. k=1 Var(Xk) ≤P (Tc = ∞) (c + K) < ∞ 

Theorem. (Kolmogorov 2 Series Theorem - Bckwd Direction) Suppose Xn are independent and K > 0. If:

|Xn| < K a.s. ∀n P∞ converges a.s. n=1 Xn Then: ∞ X Var(Xn) < ∞ n=1 ∞ X E(Xn) < ∞ n=1

Proof. The trick is to create two independent copies of the sequence Xn, call them Xn and X˜n, and to look at the dierence:

Zn := Xn − X˜n (To be precise, dene the 0 on some probability space , the ˜ 's on some Xns Ω Xn probability space and the 0 are dened on ˜) This will be a sequence ω˜ Zns Ω × Ω P of independent random variables with E(Zn) = 0, |Zn| < 2K. Since Xn con- P verges almost surely, we have also that X˜n converges almost surely. Hence P Zn converges almost surely too! Finally notice that < ∞ by hypothesis. Now, P by the One Series Thm (Bck Direction) we have that Var(Zn) < ∞. Hence P 1 P Var(Xn) = Var(Zn) < ∞ too. 2 P Finally, we use the One Series Thm (Fwd direction) to see that (Xn − E(Xn)) P P P converges almost surely, and we conclude that E(Xn) = Xn− (Xn − E(Xn)) converges too since its the dierence of two a.s. convergent things. 

Theorem. (Kolmogorov 3 Series Theorem - Bckwd Direction) Let Xn be a sequence of random variables. If: X Xn < ∞ Then for any K > 0 we dene: K Xn := Xn1{|Xn|≤K} KOLMOGOROV'S THREE SERIES THEOREM USING L2 MARTINGALES 72

and we have:

∞ X P(Xn > K) < ∞ n=1 ∞ X K E(Xn ) < ∞ n=1 ∞ X K  Var Xn < ∞ n=1 P Proof. Since Xn converges almost surely, then Xn → 0 almost surely (in- P deed {Xn → 0} ⊃ { Xn converges}). Hence |Xn| > K for all but nitely many K i.e. P(|Xn| > K i.o) = 0. Since the Xn are independent, by the second Borel Cantelli Lemma, we have that P∞ . n=1 P(Xn > K) < ∞ Since K for all but nitely many , we have that P K converges Xn = Xn n Xn almost surely. P∞ K P∞ K  follow then by the Kol- n=1 E(Xn ) < ∞, n=1 Var Xn < ∞ moogrov Two Series Theorem Bckward Direction.  A few dierent proofs of the LLN

Here is a collection of some dierent proofs of the LLN...this is a repeat of some of the earlier sections. There are dierent ways to prove the strong law of large numbers. Here are the techniques I know:

• Truncation + K3 Theorem + Kronecker Lemma • Truncation + Sparsication • Levy's downward theorem + 0-1 Law • Ergodic Theorem

9.30. Truncation Lemma This lemma is the starting point for two of the dierent proofs.

Lemma. (Kolmogorov's Truncation Lemma) Suppose that are IID with . Dene X1,X2,... E(|X|) < ∞ Yn = Xn1{|Xn|

i) E(Yn) → E(X) ii) P (Yn = Xn eventually ) = 1 iii) P∞ −2 2 n=1 n E(Yn ) ≤ 4E (|X|) < ∞

Proof. i) is just by the dominated convergence theorem using X as a dominant majorant. ii) is just by Borel Cantelli since P (Yn 6= Xn) = P (|Xn| ≥ n) and everyon'e favourite formula for expectation: P P (|X| ≥ n) = P P (b|X|c ≥ n) = E (b|X|c) ≤ E(|X|). iii) follows is based on the telescoping sum estimate:

X 1 X 2 X 2 2 2 ≤ = − ≤ n2 n(n + 1) n n + 1 m n≥m n≥m n≥m

73 9.31. TRUNCATION + K3 THEOREM + KRONECKER LEMMA 74 and using Fubini to switch the sum and the integral. Consider that:

∞ ∞ X −2 2 X −2 2  n E Yn = E n X 1{|X|



9.31. Truncation + K3 Theorem + Kronecker Lemma

Lemma. (Kolmogorov 3 Series Thm) Say X1,... are iid and K > 0. Dene K then: Xn = Xn1{|Xn| K) < ∞ n=1 ∞ ∞ X K  X E Xn < ∞ ⇐⇒ Xn converges a.s n=1 n=1 ∞ X K  Var Xn < ∞ n=1 \

Remark. The proof of =⇒ is fairly straightforward using the idea of L2 martingales and the martingale convergence theorem. Even with martingale tools, the proof of ⇐= requires a few tricks. Remark. For the strong law however, we only need the forward direction, and actually we only need the 1 series theorem (which is the case where K = ∞ and E(Xk) = 0 always). To get this from scratch is actually surprisingly quick using martingales, the steps are:

• Upcrossing lemma and Martingale convergence theorem • Pythagoras formula for L2 martingales Lemma. Cesaro Lemma: P If an are positive weights with ak → ∞ and xn is a sequence with xn → L then the weighted average: n X akxk k=1 too n → L X ak k=1 9.31. TRUNCATION + K3 THEOREM + KRONECKER LEMMA 75

P Proof. The fact that ak → ∞ shows that for any xed N the rst N terms of the sequence are irrelevant; PN is some nite amount and the xk k=1 akxk numerator Pn drowns them out. Hence only the tail of the 's matter k=1 ak → ∞ xk and since xk is converging to L, every term is eventually with  of L and so the weighted average will also be with  of L from this point on.  Lemma. Kronecker's Lemma

If bn is an increasing sequence of real numbers with bn ↑ ∞ and xn is a sequence of real numbers then: P X xn xn < ∞ =⇒ → 0 bn bn

Pn xi Proof. Let un = be the partial sums of our convergent sequence with i=1 bi u∞ = limn un. Use the dierences bn+1 − bn as the weights for the Cesaro lemma to get: X X xn = bk (uk − uk−1) n X = bnun − (bk − bk−1)uk−1 k=1 So then: P Pn xn k=1(bk − bk−1)uk−1 = un − Pn bn k=1(bk − bk−1) → u∞ − u∞ = 0  Here is a bonus little result that we basically used as part of the theorem: Theorem. (Strong Law under Variance Constraint)

Let Wn be a sequence of independent random variables so that E(Wn) = and P −2 . Then −1 P a.s. n Var(Wn) < ∞ n k≤n Wk → 0 P −1 Proof. By the Kronecker lemma, it is sucient to show that n Wn con- verges a.s. But the convergence of this series is exactly the result of the Kolmogorov 1 Series theorem, so we are done!  Theorem. (Stong Law) If are iid with then −1 Pn X1,... E (|X|) < ∞ n k=1 Xk → E(X) a.s. Proof. Dene as in the truncation lemma. Since Yn = Xn1{|Xn|

−1 X −1 X −1 X n Yn = n Wn + n E(Yk) → 0 + E(X) k≤n k≤n k≤n since (to be precise, we invoked the Cesaro lemma here) E(Yk) → E(X)  9.32. TRUNCATION + SPARSIFICATION 76

One advantage of doing it with the K3 theorem is that you can get slightly stronger results and estimates about how fast the convergence goes. Theorem. (2.5.7. from Durret) If are iid with and 2 X1,... E(X1) = 0 E(X1 ) = σ2 < ∞ then: S n → 0 a.s. √ 1 + n log n 2 Remark. Compare this with the law of the iterated log: S √ lim sup √ n = σ 2 n→∞ n log log n

Proof. It suces, via the Kronecker lemma, to check that P √ Xn n log n1/2+ converges. By the K1 series theorem, it suces to check that the variances are summable. Indeed:  X  σ2 Var √ n ≤ n log n1/2+ n log1+2 which is indeed summable (e.g. Cauchy condensation test)!  9.32. Truncation + Sparsication I'm going to do some examples of problems where if you want to show that: X n → 1 cn

a.s. for some non-negative increasing sequences Xn, cn ≥ 0 it is enough to prove the result for a subsequence that has nk cnk+1 /cnk → 1 Remark. The advantage to looking along a subsequence is that along a sub- sequence we can use things like the Borel Cantelli lemma's more easily. A lot of times we use this to prove large number type laws of the form Xn/E(Xn) → 1 a.s. and this technique can be used to prove the SLLN.

Theorem. Let Xn be an increasing sequence of non-negative random variables and let cn be an increasing sequence of non-negative constants. Suppose that nk is a sub-sequence with: c nk → 1 as k → ∞ cnk+1 Then: X X nk → 1 a.s =⇒ n → 1 a.s. cnk cn

Proof. For any n nd the unique k so that nk ≤ n < nk+1. Then, since and are increasing and non-negative, we have that and Xn cn Xnk < Xn < Xnk+1 so then: cnk+1 > cn > cnk  c  X X X X c  X nk nk = nk < n < nk+1 = nk+1 nk+1 cnk+1 cnk cnk+1 cn cnk cnk cnk+1 c Now since nk → 1 as k → ∞, for any  > 0 we can nd K so large so that cnk+1 and for all and then have: cnk+1 /cnk < 1 +  cnk /cnk+1 > 1 −  k ≥ K X X X (1 − ) nk < n < (1 + ) nk+1 cnk cn cnk+1 9.32. TRUNCATION + SPARSIFICATION 77

X Now taking limsup and liminf of both sides, using the fact that nk → 1 a.s. cn we get that: k X X (1 − ) < lim inf n < lim sup n < (1 + ) a.s cn cn But since this holds for all  > 0 it must be that the limit exists and equals 1 a.s.  Remark. The main thing is the inequality that is used. The hypothesis can be relaxed in diernt ways, for example, suppose that for ever  > 0 there exists cnk Xnk a subsequence nk so that lim sup < 1 +  and → 1 a.s. Then the result cnk+1 cnk will still hold. Theorem. (2.3.8. From Durret Prob Theory and Examples) If are pairwise independent and P∞ then as A1,A2,... n=1 P (An) = ∞ n → ∞ :

n X 1Am m=1 n = 1 a.s. X P (Am) m=1 Proof. We use a Cheb inequality to control how far Pn can be Sn := m=1 1Am from its mean, and then we can use Borel Cantelli to get a.s. convergence along certain subsequences. Then the sparsication will nish it up for us. By Cheb, we have: 1 P (|Sn − E(Sn)| > δE(Sn)) ≤ 2 2 Var(Sn) δ E(Sn)

E(Sn) ≤ 2 2 δ E(Sn) 1 = 2 δ E(Sn) Where we have using independence that P P 2 Var(Sn) = Var(1Am ) ≤ E(1A ) = P P m P(Am) = E(Sn). Now by hypothesis, E(Sn) = P(Am) → ∞ so letting  2 nk = inf n : E(Sn) ≥ k gives us a subsequence along which:   Snk 1 P − 1 ≥ δ ≤ 2 2 E(Snk ) δ k

Snk So by Borel Cantelli, we know that → 1. (Pf 1) choose δ = δk → 0 E(Sn ) k n o Snk so that its still summabl and apply BC lemma once, get − 1 ≥ δk i.o E(Snk ) n S o is null Pf 2) For each δ we know that nk − 1 ≥ δ i.o is null and so taking E(Snk ) intersections wiht 1 shows us the a.s. convergence) δ = n Finally we notice that since , we have that E(Sn+1) − E(Sn) ≤ 1 cnk+1 /cnk ≤  2  2 (k + 1) + 1 /k → 1, so the sparication trick works out here. 

Example. If are increasing r.v.'s with E(Xn) and β Xn nα → a Var(Xn) ≤ Bn with , show that Xn almost surely. β < 2α nα → a 9.32. TRUNCATION + SPARSIFICATION 78

Proof. It suces to show that Xn−E(Xn) . Have: nα → 0 1 B P (|X − E(X )| > δnα) ≤ Var (X ) ≤ nβ−2α n n δ2n2α n δ2 c If we now look along a subsequnece nm = m with c chosen so large so that mc(β−2α) is summable (i.e. c > (2α − β)−1 will do by the p-test) then a standard α Xnm −E(Xnm ) nm+1 Borel Cantelli shows us that α → 0. Now we notice that α = nm nm (m+1)αc so our sparsifaction kicks in and fnishes the job for us. mαc → 1  P Example. If Xn are independent Poisson RV with E(Xn) = λn and if λn = P Sn ∞ then Sn = Xk has → 1 a.s. k≤n E(Sn) Proof. Similar to the above at rst.....I'm having a bit of a hiccup showing nm+1 that → 1 though in the case that the λns are very large though. nm 

Example. (Record Values) If X1,X2 ... is a sequence of iid random variables with no atoms, let  be the record at  event. (Think Ak = Xk > supj

Theorem. (Etememadi's SLLN) If Xn are identically distributed and pairwise independent and then −1 P a.s. E(|X|) < ∞ n k≤n Xk → E(X) Proof. By handling the positive and negative parts separately, we assume WOLOG that X ≥ 0. By ii) from the truncation lemma it suces to show that −1 P a.s. where . Actually, since n k≤n Yk → E(X) Yk = Xk1{Xk 1, let nm = bα c and we will restrict out attention to this subsequence. Consider:

     

X X 1 X E  Yk − E  Yk > δnm ≤ 2 2 Var  Yk δ nm k≤nm k≤nm k≤nm 1 X = 2 2 Var (Yk) δ nm k≤nm Now we sum over m and use Fubini to swich the order of the sum:     ∞ ∞ X X X 2 1 X 1 X E  Yk − E  Yk > δnm ≤ 2 2 Var (Yk) δ nm m=1 k≤nm k≤nm m=1 k≤nm ∞ 1 X X 1 = 2 Var(Yk) 2 δ nm k=1 m:nm>k 9.33. LEVY'S DOWNWARD THEOREM AND THE 0-1 LAW 79

P 1 1 Now we just have to estimate 2 ∼ 2 and then we will be able m:nm>k n k to apply the truncation lemma to see the convergencem we want. Indeed this is a P 1 P −2m −2 −2 geometric series so we have 2 = m α ≤ (1 − α )k nm>k n m:α >k Hence by the truncation lemma thism is summable, and we conclude by a stan- dard application of Borel cantelli that: P P  k≤n Yk − E k≤n Yk m m → 0 nm By E(Y ) ↑ E(X) this shows that n−1 P Y → E(X) which is exactly the n m k≤nm k convergence we want along the subsequence nm. We will now use our sparsication tools to strenthen the result to convergence −1 P . For convenience at this n k≤n Yk point, dene P and we desire to show that −1 for Wn = k≤n Yk nm Wnm → E(X) −1 every α > 1 =⇒ n Wn. Indeed, for any n > 0 nd the nm so that nm < n < nm+1 and consider that: W n W W W W n nm m = nm ≤ n ≤ nm+1 = nm+1 m+1 nm nm+1 nm+1 n nm nm+1 nm Now take and everywhere, use −1 and lim sup lim inf nm Wnm → E(X) nm/nm+1 → α we get: 1 W W E (X) ≤ lim inf n ≤ lim sup n ≤ αE(X) α n n But since this holds for any we get Wn as desired. α lim n = E(X)  9.33. Levy's Downward Theorem and the 0-1 Law This one requires the most tech: You need Levy's Downward theorem: Theorem. Suppose is a probability triple and that is (Ω, F, P) {G−n : n ∈ N} a collection of sub-σ−algebras of F such that: \ G−∞ := G−k ⊂ ... ⊂ G−(n+1) ⊂ G−n ⊂ ... ⊂ G−1 k (In other words, they are backwards of a normal ltration, which gets ner and ner, these get coarser and coarser.....except they are labeled with negative numbers) Let γ ∈ L1 and dene:

M−n := E (γ |G−n ) Then: 1 M−∞ := lim M−n exists a.s and in L and:

M−∞ = E (γ |G−∞ ) a.s. Remark. To prove this one from scratch, you would do:

• Upcrossing lemma • Martingale convergence theorem • Facts about uniform integrability  UI =⇒ L1 bounded  L1 conv ⇐⇒ UI and conv in prob  E(X |Fn ) is UI 9.34. ERGODIC THEOREM 80

We also need the Kolmogorov 0 − 1 law for this proof. (fortunatly this has an easy proof using the Levy upward theorem).

Theorem. (Kolmogorov 0-1 Law) If X1,X2,... are independent and A ∈ T = ∞ then or 1, ∩n=1σ (Xn,Xn+1,...) P(A) = 0 Proof. We will show that A is independent of itself. Firstly: a) If A ∈ σ (X1,...,Xk) and B ∈ σ (Xk+1,...) then A and B are independent. (This follows by some of those proofs using π systems) b) If A ∈ σ (X1,...) and B ∈ T then A, B are indep. By a), B ∈ T ⊂ σ (Xk+1,...) is independent of σ (X1,...,Xk) for each k. Hence, it will be independent of σ(X1,...) = ∪nσ(X1,...) Since T ⊂ σ (X1,...) part b) shows us that A is independent of itself. (Put A = B in the last theorem) and the result follows.  Remark. Here's another slightly slicker proof using monotone class theorem rather than π − λ: a) If B ∈ T then B is independent to σ (X1,...,Xn) b) Let A be the set of events that are independent of B. Check that its a montone class! (Follows by continuity of measure)

c) Since A contains each σ (X1,...,Xn), it contains the algebra ∪nσ (X1,...,Xn) (Remark: its not a sigma algebra! these are all the events that depend on netly many things)

d) By the monotone class theorem, we have that A actually contains σ (∪nσ (X1,...,Xn)) = σ (X1,...). Hence B ∈ A is independent of itself! Once we have all this tech though, the following sneaky trick proves the strong law right away: Theorem. Let be iid with . Let P . Then X1,X2,... E (|X|) < ∞ Sn = k≤n Xk −1 n Sn → E(X) a.s. Proof. Dene and T . By a symmetry G−n := σ (Sn,Sn+1,...) G−∞ = n G−n −1 argument, we see that E (X1 |G−n ) = n Sn. So by Levy's downward theorem, −1 1 n Sn = E (X1 |G−n ) → E (X1 |G−∞ ) = L a.s. and in L . We now argue that L  −1 must be a constant because the event n Sn = c is a tail event (can see that it does not depend on the rst N variables and this works for any N) and by the 0-1 −1 1 law it must therefore be constant. Since the convergence n Sn → L is in L we can check that: −1  L = E(L) = lim E n Sn = E(X) 

9.34. Ergodic Theorem Problem. The strong law of large numbers says that for an iid sequence

X1,X2,... with E |X1| < ∞ that:

N 1 X lim Xn = E(X1) a.s. N→∞ N n=1 Show how this is a consequence of the pointwise ergodic theorem. 9.35. NON-CONVERGENCE FOR INFINITE MEAN 81

Proof. Let µ be the law of the random variable X. Introduce the space Ω = N of sequences and put the product meausre R = {(x1, x2,...): xi ∈ R} µ × µ × ... on this space. (To be more technical: one should use the Caratheodory extension theorem starting with the algebra of cylinder sets and the premeasure given by nite products of µ. This also makes it clear that sigma algebra on this space is the sigma algebra generated by these cylinder sets. Since this is a standard rst theorem in a probability course, I will omit the details of this.) Dene f :Ω → R to be the rst random coordinate:

f (x1, x2, x3,...) = x1

Notice that the law of f is µ, so since E |X1| < ∞ is given, we know that f ∈ L1 (µ × µ × ...). Let T :Ω → Ω be the shift operator:

T (x1, x2, x3,...) = (x2, x3,...) Since we are working with the product measure, all the coordinates have the same distribution and T is a measure preserving transformation on (Ω, F, P). We now notice that T is ergodic. This will follow by the Kolmogorov-Zero-One law for iid sequences. Suppose A is is T -invariant. Then it is also T n-invariant, and consequently any such event A does not depend on the rst n coordinates of the point ω ∈ Ω. Since this holds for every n, we conclude that A is a tail event. (In the tail sigma algebra ∞ ∞ where is the sigma algebra of events σ (∩n=1σ (∪k=nFk)) Fk that depend only on the rst k coordinates.) The Kolmogorov-Zero-One law states that such events have either P(A) = 0 or 1 and hence T is ergodic. (Again, I omit the proof of the Kolmogorov-Zero-One law) Finally, notice that:

N N 1 X 1 X f (T nx) = (T nx) N N 1 n=1 n=1 N 1 X = x N n n=1 so by the pointwise ergodic theorem, applied to the ergodic transformation T and the L1 fucntion f on this space we know that: N 1 X µ×... a.s. x −−−−−−→ E(f) = E(X) N n n=1 This is exactly the statement of the strong law of large numbers! (One remark: I am using the statement of the ergodic theorem for a single transformation and am taking averages 1 PN n . In class we proved it T N n=1 f (T x) for Zd-systems. The proof for the Zd systems can be adapted for this simpler case I am using.)  9.35. Non-convergence for innite mean

Remark. If E(|X|) = ∞ then we should have lim sup |Sn| /n using the trick that |Xn| > n i.o and |Xn| > C =⇒ |Sn−1| / (n − 1) > C/2 or |Sn| /n > C. Ergodic Theorems

These are notes from Chapter 7 of [2].

Xn is siad to be a stationary sequence if for each k ≥ 1 it has the same distribution as the shifted sequnece Xn+k. The basic fact about these sequences, called the ergodic theorem is that if E (|f(X0)|) < ∞ then:

n−1 1 X lim f (Xm) exits a.s. n→∞ n m=0

If Xn is ergodic (a generalization of the notion of irreducibility for Markov chains) then the limit is E (f(X0))

10.36. Denitions and Examples

Definition. X0,X1,... is said to be a stationary sequence if for every k the shifted sequence {Xn+k : n ≥ 0} has the same distribution i.e. for each m the vectors (X0,...,Xm) and (Xk,...,Xk+m) have the same distribution. We begin by giving four examples that will be our constant companions:

Example. (7.1.1.) X0,X1,... are iid

Example. (7.1.2.) Let Xn be a Markov chain with transition probability p(x, A) and stationary distribution π, i.ie. π(A) = p(x, A)π(dx). If X0 has distribution π then X0,X1,... is a stationary sequence.´ A special case to keep in mind for this one is the deterministic Markov chain on the state space S = {A, B} with A moving to B with probability 1. The stationary distribution is 1 I.e. or π(A) = π(B) = 2 (X0,X1,...) = (A, B, A, B, . . .) (B, A, B, A, . . .) each with probability 1 . 2 Example. (7.1.3.) (Rotation of a circle) Let Ω = [0, 1), F =Borel subsets, P =Lebesgue measure. Let θ ∈ (0, 1) and for n ≥ 0 let Xn(ω) = (ω + nθ)mod 1 where x mod 1 := x − bxc is the decimal part of x. This is a special case of the deterministic shift Markov chain this time with innitely many states. (i.e. p(x, {y}) = 1 if y = (x + θ) mod 1 and 0 for all other sets)

Theorem. (7.1.1.) If is a stationary sequence and {0,1,...} X0,X1,... g : R → R is measurable then Yk = g (Xk,Xk+1,...) is a stationary sequence Proof. For {0,1,...} dene and for {0,1,...} x ∈ R gk(x) = g(xk, xk+1,...) B ∈ B let:

A = {x :(g0(x), g1(x),...) ∈ B}

82 10.36. DEFINITIONS AND EXAMPLES 83

To check stationarity now, we observe that:

P ((Y0,Y1,...) ∈ B) = P ((X0,X1,...) ∈ A)

= P ((Xk,Xk+1,...) ∈ A)

= P ((Yk,Yk+1,...) ∈ B) which proves the desired result.  Example. (7.1.4.) (Bernoulli Shifts) Let Ω = [0, 1) and F =Borel subsets, P =Lebesgue measure. Let Yn(ω) = 2Yn−1(ω) mod 1. This is a special case of 7.1.1, the iid sequence case, the connection being that between an innite sequence of iid coin ips and a in [0, 1) identied by the diadic decomposition. If we let be iid with 1 and we let X0,... P(Xi = 0) = P(Xi = 1) = 2 g(x) = P∞ −(i+1) then the identication (check that i=0 xi2 Yn = g (Xn,Xn+1,...) Yn = 2Yn−1 mod 1here) shows us that Yn is indeed stationary. This is also a special case of the determenistic Markov chain with p(x, {y}) = 1 for y = 2x mod 1 and 0 for all other sets. Examples 7.1.3 and 7.1.4 are special cases of the following situation:

Definition. Let (Ω, F, P) be a probability space. A map ϕ :Ω → Ω is called measure preserving if P ϕ−1A = P(A) for all A ∈ F. Example. (7.1.5.) Let ϕn be ϕ composed with itself n times. If X is a random n variable then Xn(ω) = X (ϕ ω) denes a stationary sequence. To check this let n+1 B ∈ R and A = {ω :(X0(ω),...Xn(ω)) ∈ B} then: k  P ((Xk,...,Xk+1) ∈ B) = P ϕ ω ∈ A = P(ω ∈ A) = P ((X0,...,Xn) ∈ B) Remark. The last exampe is more than an important example. In fact, it is the ONLY example! For if Y0,Y1,... is a stationary sequence taking valuies in a nice space, then the Kolmogorov Extension Thoerem allows us to construct a measure {0,1,...} {0,1,...) on the sequence space S , S so that the sequence Xn(ω) = ωn has the same distribution as that of {Yn, n ≥ 0}. If we let ϕ be the shift operator here, ϕ (ω1, ω2,...) = (ω1, ω2,...) and let X(ω) = ω0 then ϕ is measure preserving (this n follows sinc e Y is stationary) and Xn(ω) = X(ϕ ω).

Theorem. (7.1.2.) Any stationary sequence {Xn, n ≥ 0} can be embedded in a two-sided stationary sequence {Yn; n ∈ Z}

Proof. Dene: P (Y−m ∈ A0,...Yn ∈ Am+n) := P (X0 ∈ A0,...,Xm+n ∈ Am+n). Since Xn is a stationary sequence, this is a consist denition and the Kolmogorov extension theorem gives us the measure we want, is a shift. Yn(ω) = ωn  In view of the above observation, it suces to give our denitons and prove our results in the setting of Example 7.1.5.. Thus our set up is: (Ω, F, P) ≡ a probability space ϕ ≡ a measure preserving map n Xn(ω) = X (ϕ ω) where X is a r.v. Definition. Let ϕ be a measure preserving map. We say that a set A ∈ F is invariant for ϕ if ϕ−1A = A a.s. (i.e. the symmetric dierence of the two sets is a null set: some people call this almost invariant). We will say that a set B is invariant in the strict sense if ϕ−1B = B exactly. 10.36. DEFINITIONS AND EXAMPLES 84

Exercise. (7.1.1) The class of invariant events I is a σ−algebra and a random variable X is I measurable if and only ifX is an invariant , i.e. X ◦ ϕ = X a.s. Proof. Closed under countable unions and intersections is clear since ϕ−1 plays nice with these operations. If X ◦ ϕ = X a.s. then we have that for any Borel set B that X−1(B) = (X ◦ ϕ)−1 (B) = ϕ−1 X−1(B) which shows that every set −1 X (B) is an invariant set, i.e. σ(X) ⊂ I and X is I measurable.  Exercise. (7.1.2.) i) For any set dene ∞ −n . Then −1 A B = ∪n=0ϕ (A) ϕ (B) ⊂ B ii) If −1 and ∞ −n then −1 ϕ (B) ⊂ B C = ∩n=0ϕ (B) ϕ (C) = C iii) A set A is almost invariant ⇐⇒ there is a set C which is invariant in the strict sense and P (A4C) = 0 Proof. i) Have −1 −1 ∞ −n ∞ −n . ϕ (B) = ϕ (∪n=0ϕ (A)) = ∪n=1ϕ (A) ⊂ B ii) Since ϕ−1(B) ⊂ B, we conclude actually that ϕ−n(B) ⊂ ϕ−(n−1) (B) and so the sequnce B, ϕ−1(B), ϕ−2(B) ... is actually a decreasing sequence. Hence −n −1 −n C = limn→∞ ϕ (B) and we see that ϕ (C) = limn→∞ ϕ (B) too, so the two are equal. iii) ( =⇒ ) If A is invariant, then dene B and C as above. If A is almost invariant, then each ϕ−1(A) is a.s. equal to A and so the countable intersection B is a.s. equal to A too, and then the countable intersection C is a.s. equal to A too, so P(A4C) = 0. (By a.s. equal here I mean · a.s. equal ? ⇐⇒ P (·4?) = 0 ) (⇐=) Conversly if P (A4C) = 0 is invariant, then ϕ−1(A) = ϕ−1(C −A4C) = ϕ−1(C)−ϕ−1(A4C) = C−ϕ−1(A4C) which is eqaual to A a.s. since P(A4C) = 0 −1  and P ϕ (A4C) = P(A4C) = 0 too 

Definition. A measure preserving transformation on (Ω, F, P) is said to be ergodic if I is trivial, i.e. for all A ∈ I, P(A) ∈ {0, 1}. Equivalently, by the previous exercise, the only invariant random variables are a.s. constants. If ϕ is not ergodic, then the space can be split into two sets, A and Ac each of positive measure so that ϕ(A) = A and ϕ(Ac) = Ac. In other words, ϕ is not irreducible. Lets go back and see what this means for the examples we have:

Example. (7.1.6) (iid sequences) We begin by observing that if Ω = R{0,1,2...}and −1 ϕ is the shift operator, then an invaraint set A has A = ϕ (A) ∈ σ(X1,X2,...) iterating gives us that: ∞ the tails -algebra A ∈ ∩n=1σ(Xn,Xn+1,...) = T σ So I ⊂ T , by the Kolmogorov 0-1 law, T is indeed trivial so we know that the sequence is ergodic. Example. (7.1.7) (Markov Chains) We will show that the resulting stationary sequence and map is ergodic if and only if the Markov chain is irreducible. Supppose we have a Markov chain on a countable state space S with stationary distribution π(x) > 0. All the states are then recurrent and we can write S = ∪iRi where Ri are the disjoint irreducible closed sets. If X0 ∈ Ri then with probability one, Xn ∈ Ri for all n ≥ 1 so {ω : X0(ω) ∈ Ri} ∈ I. This shows that if the Markov chain is not irreducible, then the sequence is not ergodic. 10.36. DEFINITIONS AND EXAMPLES 85

To prove the converse, note that if A ∈ I, then 1A ◦ θn = 1A where θn is the shift by n operator. If we let Fn = σ (X0,...Xn) then the shift invariance of 1A and the Markov property imply that:

Eπ (1A |Fn ) = Eπ (1A ◦ θn |Fn ) = h(Xn)

where h(x) = Ex(1A). By Levy's 0-1 law, the LHS converges to 1A as n → ∞. .... I'm going to skip the rest of this for now...

Example. (7.1.8) Rotation of the circle is not Ergodic if θ = m/n where m < n are positive integers, for the set n−1 will be invariant for any set A = ∪k=0 (B + k/n) B ⊂ [0, 1/n] Coversly, if θ is irrational then ϕ IS ergodic. This relies on the unique writ- ing, for 2 functions, that P 2πikx in the 2 sense, and with L f(x) = k cke L ck = −2πikxd . Now P 2πik(x+θ) P 2πikθ 2πikxso this equals fe x f(ϕ(x)) = k cke = cke e 2ikθ f´(x) if and only if e = 1 for some k or else ck = 0. If θ is irrational, then we get that ck = 0 for all k 6= 0 and so f is a constant. Exercise. (7.1.3.) (A direct proof of ergodicitiy for irrational shifts)

Show that xn = nθ mod 1 is dense in [0, 1), then show that for a Borel set with |A| > 0 then for every δ > 0 there is an interval J s.t. |A ∩ J| > (1 − δ) |J| Example. (7.1.9) Bernoulli Shifts is ergodic. To prove this, we recall the representation that: P∞ −(m+1) where are iid coin ips. Yn = m=0 2 Xn+m X0,X1,. . . We then use the following fact:

Theorem. (7.1.3) Let {0,1,...} be measurable. If is an g : R → R X0,X1,... ergodic stationary sequence, then Yk = g (Xk,...) is ergodic too. Proof. The same set translation idea as before works, e.g. let:

A = {x :(g0(x), g1(x),...) ∈ B} And then observe:

P ((Y0,Y1,...) ∈ B) = P ((X0,X1,...) ∈ A)

= P ((Xk,Xk+1,...) ∈ A)

= P ((Yk,Yk+1,...) ∈ B) So if any and both are trivial together. IY 0s = IX0s  Exercise. (7.1.4.) Prove directly using a Fourier series type argument that the Bernoulli shift is ergodic.

Exercise. (7.1.5.) Continued Fractions. For any x ∈ (0, 1), we have the decomposition: 1 x = 1 a0 + a1+...

This is called the representation, we write x = [a0; a1; ...] for short. The way to compute these is by n with 1  1  1 an = bϕ xc ϕx = x − x = x mod 1. Check that preserves the given by 1 dx for . ϕ µ µ(A) = log 2 A 1+x A ⊂ (0, 1) ´ 10.37. BIRKHOFF'S ERGODIC THEOREM 86

Proof. Assume µ is given by integration against a density ρ. The requirement that −1  gives (from −1 S∞ 1 1 ): µ ϕ (a, b) = µ(a, b) ϕ (a, b) = n=1( b+n , a+n ) b ∞ 1 X a+n ρ(x)dx = ρ(x)dx ˆ ˆ 1 a n=1 b+n Taking the b derivative now gives us: ∞ X  1  1 ρ(x) = ρ x + n (x + n)2 n=1 From which we can verify that 1 is a solution. ρ(x) = 1+x 

10.37. Birkho's Ergodic Theorem Throughout this section, we will suppose that ϕ is a measure preserving trans- formation on (Ω, F, P) . The main theorem in this section is: Theorem. (7.2.1.) (Birkho's Ergodic Theorem) For any X ∈ L1:

n−1 1 X X (ϕmω) → E (X |I ) a.s. and in L1 n m=0 Remark. When the sequence is ergodic, I is trivial, so E(X |I) = E(X). The proof is not intuitive but none of the steps are dicult.

j  Lemma. (7.2.2.) (Maximal ergodic lemma) Let Xj(ω) = X ϕ ω . and Sk(ω) = P and then 0≤j≤k−1 Xj(ω) Mk(ω) = max (0,S1(ω),...,Sk(ω)) E (X; Mk > 0) ≥ 0

Proof. The idea is to compare X to something like Mk(ω) − Mk(ϕω) (we will use X(ω) = Sj+1(ω) − Sj(ϕω) to get the ball rolling) and then notice that this integrates to 0, (since ϕ is measure preserving) and is hence positive on the set c Mk > 0 (since Mk = 0 and Mk(ϕω) ≥ 0 on {Mk > 0} ...so really we are noticing it is non-positive on the complement and it integrates to zero.)

Specically, for j ≤ k we have Mk(ϕω) ≥ Sj(ϕω) so adding X(ω) gives:

X(ω) + Mk(ϕω) ≥ X(ω) + Sj(ϕω) = Sj+1(ω) So rearranging gives:

X(ω) ≥ Sj+1(ω) − Mk(ϕω) for j ≤ k

The result also holds for j = 0 since S0+1(ω) = X(ω) and Mk ≥ 0, so we taking the maximum over all these k + 1 inequalities gives us:

X(ω) ≥ max (S1(ω),...,Sk(ω)) − Mk(ϕω)

= Mk(ω) − Mk(ϕω) Now integrating gives:

E (X; M > 0) ≥ M (ω) − M (ϕω)dP k ˆ k k {Mk>0} ≥ M (ω) − M (ϕω)dP ˆ k k = 0 10.37. BIRKHOFF'S ERGODIC THEOREM 87

Where the second ≥ comes from the fact that Mk(ω) − Mk(ϕω) is non-positive c on the complement {Mk > 0} and the last = 0 comes from the fact that ϕ is measure preserving.  Theorem. (7.2.1.) (Birkho's Ergodic Theorem) For any X ∈ L1:

n−1 1 X X (ϕmω) → E (X |I ) a.s. and in L1 n m=0 Proof. (Idea) We will assume WOLOG that E(X |I ) = 0 (we can apply the theorem to Y = X − E(X |I ) and exploit that E (X |I ) is I is ϕ invari- ant, so it comes out of the sum). Let P . The idea is Sk(ω) = 0≤j≤k−1 Xj(ω) to look at ¯ 1 . Since this is invariant, it is measurable and con- X = lim sup n Sn I sequently the set {X¯ > } ∈ I which we use to our advantage. We want to show ¯ for all . If we let ∗ , then P(X > ) = 0  > 0 X = (X(ω) − ) 1{X>¯ } ∗     E X {X¯ > } = E X {X > } − P X¯ >  = E E (X |I ) {X > } −   ∗   P X¯ >  = 0 − P X¯ >  , so it suces to show that E X X¯ >  ≥ 0 which we will do with the maximal ergodic lemma. ...I am going to skip the details for now...  Theorem. (7.2.3.) Wiener's Maximal Inequality Let j and P . Let Xj(ω) = X(ϕ ω) Sk(ω) = 0≤j≤k−1 Xj(ω) Ak(ω) = Sk(ω)/k and Dk = max (A1,...,Ak). For α > 0 we have: −1 P (Dk > α) ≤ α E (|X|)

0 Proof. Let B = {Dk > α} apply the maximal ergodic lemma to X = X − α with 0 0 j , 0 P 0 and 0 0 0 Xj(ω) = X (ϕ ω) Sk(ω) = 0≤j≤k−1 Xj(ω) Mk = max (0,S1 ...,Sk) we conclude that 0 0 . Since 0 it follows that: E (X ; M > 0) ≥ 0 {Mk > 0} = {Dk > α}

(X − α)dP ≥ 0 ˆ {Dk>α} =⇒ XdP ≥ αdP = αP(D > α) ˆ ˆ k {Dk>α} {Dk>α} And we nally make the comparison E |X| ≥ XdP to get the result. {Dk>α} (not sure why this holds....) ´  Let's apply the ergodic theorem to our examples

Example. (7.2.1.) iid sequences. Since I is trivial, the ergodic theorem says:

n−1 1 X X → E(X ) a.s. and in L1 n m 0 m=0 The a.s. convergence is the strong law of large numbers

Remark. The L1 part of the strong law can be proven by the charaterization 1 Xn → X in L ⇐⇒ Xn → X in P and Xn is UI ⇐⇒ Xn → X in P and E(|Xn|) → E(|X|) the last bit is easy to check for the sequence of averages, look at the positive and negative parts seperatly. 10.39. A SUBADDITIVE ERGODIC THEOREM 88

Example. (7.2.2.) For irreducible Markov chains, this says that:

n−1 1 X X f(X ) → f(x)π(x) a.s. and in L1 n m m=0 x Example. (7.2.3.) (Rotation of the circle) When θ is irrational, the sequence is ergodic, and so I is trivial, we have: n−1 1 X 1 m → |A| n {ϕ ω∈A} m=0 Example. (7.2.4.) (Benfords Law) Let and take θ = log10 2 Ak = [log10 k, log10(k + 1)] taking x = 0 as the starting point for the shifts, we have: n−1 1 X k + 1 1 (ϕm(0)) → log n Ak 10 k m=0 Notice that the rst digit of m is if and only if 2 k mθ = m log10 2 mod 1 ∈ Ak (take logs to see this), so the above result tells us something about the distribution of the rst digit of 2m, namely that the long time averages of P(rst digit is k) tends to k+1  log10 k This is called Benfords law!

Example. (7.2.5.) (Bernoulli Shift) If Ω = [0, 1), ϕ(ω) = 2ω mod 1. Let −1 −k −k i1, . . . , ik ∈ {0, 1} and let r = i12 +...+ik2 and let X(ω) = 1 if r ≤ ω ≤ r+2 in other words X = 1 if the rst k binary digits match i1, . . . , ik. The ergodic theorem implies that: n−1 1 X X (ϕmω) → 2−k a.s. n m=0 i.e. in almost every ω ∈ [0, 1] the pattern i1, . . . , ik occurs with the expected frequencey. This is the denition of a normal number....this result says numbers all normal.

10.38. Recurrence I'm going to skip this section for now

10.39. A Subadditive Ergodic Theorem

Theorem. (7.4.1.) (Subadditive Ergodic Theorem) Suppose Xm,n, 0 ≤ m < n is a collection of random variables that satisfy:

i) X0,m + Xm,n ≥ X0,n ii) {Xnk,(n+1)k, n ≥ 1} is a stationary sequence for each k iii) The distribution of {Xm,m+k, k ≥ 1} does not depend on m iv) +  and for each , where E X0,1 < ∞ n E (X0,n) ≥ γ0n γ0 > −∞ Then: a) E(X0,n) E(X0,m) limn→∞ n = infm m =: γ b) X0,n exists a.s. and in 1so X = limn→∞ n L E(X) = γ c) If all the stationary sequences in ii) are ergodic, then X = γ a.s. Here are some examples for motivation. The validitiy of ii) and iii) is clear in each case, so we just verify i) and iv). 10.39. A SUBADDITIVE ERGODIC THEOREM 89

Example. (7.4.1.) (Stationary Sequences.) This example shows that Birkho's Ergodic theorem is contained as a special case of this ergodic theorem.

Suppose ξ1, ξ2,... is a stationary sequence with E |ξk| < ∞ and let Xm,n = ξm+1 + ... + ξn. Then X0,n = X0,m + Xm,n and iv) holds with γ0 = E(ξ). The conclusion of the subadditive ergodic theorem is exactly that 1 P limn→∞ n k≤n ξk exists a.s. and in L1, which is the statement of Birkho's Ergodic theorem. From here though I think you have to check by hand what the limit is. Remark. This is a good example of the feature that usual happens when apply the subadditive ergodic lemma:

Xn,m ≈ something up to size m − something up to size n or ≈ something happening between m and n

Which makes the condition X0,m + Xm,n look a lot like X0,n, and they can be easily compered. Another example, which we wont see for a bit is to let Xm,n be the passage time from (m, m) to (n, n) in a last passage percolation setting. X0,m + Xm,n is the passage time for paths from (0, 0) to (n, n) passing through the point (m, m) so it less than the passage time X0,n which is the time from (0, 0, ) to (n, n) without this restriction. Example. (7.4.3.) (Longest Common Subsequences) Given two ergodic sta- tionary subsequences and let s.t. and X1,... Y1,... Lm,n = max {K : Xik = Yjk m < i1 < i2 < . . . < iK ≤ n m < j1 < j2 < . . . < jK ≤ n} . It is clear that:

L0,m + Lm,n ≥ L0,n

So the subadditive theorem applies to Xm,n = −Lm,n. iv) holds since 0 ≤ Lm,n ≤ n a.s. The conclusion of the theorem gives us that: L L  0,n → γ = sup E 0,m n m≥1 m Remark. By mapping to the plane, you can make this look like some sort of highly dependent percolation ala the precognitive/psychic dude I saw a talk about once

Theorem. (7.4.1.) (Subadditive Ergodic Theorem) Suppose Xm,n, 0 ≤ m < n is a collection of random variables that satisfy:

i) X0,m + Xm,n ≥ X0,n ii) {Xnk,(n+1)k, n ≥ 1} is a stationary sequence for each k iii) The distribution of {Xm,m+k, k ≥ 1} does not depend on m iv) +  and for each , where E X0,1 < ∞ n E (X0,n) ≥ γ0n γ0 > −∞ Then: a) E(X0,n) E(X0,m) limn→∞ n = infm m =: γ b) X0,n exists a.s. and in 1so X = limn→∞ n L E(X) = γ c) If all the stationary sequences in ii) are ergodic, then X = γ a.s. Proof. There are four steps, the rst second and fourth are standarderidized, but there are many dierent proofs of the third step. Step 1: Check that and use this to show that E(X0,n) E |X0,n| ≤ Cn limn→∞ n = E(X0,m) using subadditivity of the sequence of expected values. infm m =: γ 10.40. APPLICATIONS 90

Step 2: Use Birkho's Ergodic thoerem to get a limit for the averages 1 k (X0,m + Xn,m+n + X2n,2n+m + ...) → E (X0,m |Im ) =: Am for each choice of m. We then use these Am's to control ¯ from which we can conclude that ¯ X := lim supn→∞ X0,n/n ≤ Am/m E(X) ≤ γ Step 3: Let X0,n and show that . X := lim infn→∞ n E (X) ≥ γ Once we have the sandwhich X ≤ X and the γ ≤ E(X) ≤ E(X) ≤ γ , we can conclude that X = X , which is exactly the converge a.e. we want. Step 4: Do some work to prove that the covnergence happens in L1 too. I might go over the details later  10.40. Applications

Example. (7.5.1.) (Products of random matrices) Suppose A1,... is a sta- tionary sequence of k × k matrices with positive entries and let:

αm,n(i, j) = (Am+1 ··· An)(i, j) be the i, j-th entry of the product. It is clear that:

α0,m(1, 1)αm,n(1, 1) ≤ α0,n(1, 1)

so if we let Xm,n = − log αm,n(1, 1) then X is subadditive. Example. (7.5.2.) (Increasing sequences in random permutations) Let π be a permutation of {1, 2, . . . n} and let `(π) be the length of the longest increasing sequence in π. Hammersely attacked this problem by putting a rate one Poisson proces in the plane and for s < t let Ys,t denote the length of the longest increasing path lying in the square Rs,t from (s, s) to (t, t). It is clear then that Y0,m +Ym,n ≤ Y0,n. Applying the subadditive ergodic theorem to −Y0,n (i.e. a superadditive ergodic theorem) shows that: Y Y  0,n → γ ≡ sup E 0,m a.s. n m≥1 m

Since Ynk,(n+1)k, n ≥ 0 is iid and hence ergodic, the limit is actually a constant! To get from this result to the result about random permutations, let τ(n) be the smallest value of for which there are points in . By a law of large t √ n R0,t numbers we will have that τ(n)/ n → 1 a.s., which allows us to convert from our Poissonized grid to the permutations. Large Deviations

These are notes from the rst few chapters of the famous book by Amir Dembo and Ofer Zeitouni [1] A large deviation principle (LDP) characterizes the limign behavious as  → 0 of a family of probability measures µ on (X , B) where X is a topological space and B is the Borel sigma albebra of the topology (smallest sigma algebra containing the open sets). We will suppose we are working in a metric space actually so that we can use sequences instead of nets.

Definition. A function f : X → R is called lower semicontinuous if the level sets Ψf (α) := {x : f(x) ≤ α} are closed subsets of X .

Proposition. f is lower semicontinous if and only if for all sequence xn → x0we have: lim inf f(xn) ≥ f(x0) n

Proof. (⇒) Suppose by contradiction that xn → x0 but lim infn f(xn) < f(x0). Then ∃0 > 0 so that lim infn f(xn) < f(x0) − 0. For any  > 0 the set {x : f(x) > f(x0) − } is an open set (by the def'n of lower semicontinuous) containing x0. Hence since xn → x0, xn ∈{x : f(x) > f(x0) − 0} eventually. But then lim infn f(xn) ≥ f(x0) − 0, a contradiction. (⇐) Fix an α and suppose {x : f(x) ≤ α} is non-empty. Given any convergent sequence yn ∈ {x : f(x) ≤ α} , yn → y we wish to show that y ∈ {x : f(x) ≤ α} too. Suppose by contradiction y∈ / {x : f(x) ≤ α}. Then there exists 0 > 0 so that f(y) > α + 0. Now, yn → y and the given property gives that lim infn f(yn) ≥ . But this is impossible as for every is given. f(y) > α + 0 f(yn) ≤ α n  Definition. A rate function I is a lower semicontinous mapping from I : X → [0, ∞].A good rate function is a rate function so that the level sets {x : f(x) ≤ α} are compact. The eective domain of I is the set DI := {x : I(x) < ∞}. When there is no confusion, we call DI the domain of I. Remark. If a rate function I is good, the sets {x : f(x) ≤ α} are compact. This means that infx {x : f(x) ≤ α} is achieved at some point x0.

Definition. A collection of probability measure µ are said to satisfy the large deviation principle with rate function I if for all A ∈ B we have:

− inf I(x) ≤ lim inf  log µ(A) x∈A◦ →0

≤ lim sup  log µ(A) ≤ − inf I(x) →0 x∈A¯ Definition. A set that satises is called an A infx∈A◦ I(x) = infx∈A¯ I(x) I-continuity set.

91 10.41. LDP FOR FINITE DIMESNIONAL SPACES 92

Remark. What's the deal with the A◦and A¯ in the above denition? Are they really needed? ◦ Suppose we replaced A by A. Then take some non-atomic measures i.e µ({x}) = 0 ∀x ∈ X . Then plugging in A = {x}, the LDP would give −I(x0) ≤ lim inf  log(0) = −∞ so I(x0) = ∞ .... since this holds for any {x0} we conclude I is a silly rate function! The form of the LDP codies a particularly convenient way of stating asymp- totic results that, on the one hand, are accurate enough to be useful and, on the other hand, are loose enough to be correct.

Remark. Since µ (X ) = 1, pluggin in A = X the whole space tells us that infx∈X I(x) = 0. When I is a good rate function, this means that there exists at least one point x for which I(x) = 0

Remark. The following two conditions together are equivalent to the LDP (just unravel some denitions) 1. For every α < ∞ and every measurable set A with A¯ ⊂ {x : I(x) > α}:=

lim sup  log µ(A) ≤ −α →0

2. For any x ∈ DI the eective domain of I and any measurable function A with x ∈ A◦:

lim inf  log µ(A) ≥ −I(x) →0

10.41. LDP for Finite Dimesnional Spaces Lets look at the simplest possible framework where we can have some large deviation results. 10.41. LDP FOR FINITE DIMESNIONAL SPACES 93

Definition. Table of denitions used in this section:

Σ = {a1, . . . , aN } : Underlying Alphabet with |Σ| = N elements

M1(Σ) : Probability measures on the alphabet Σ

M1(Σ) = {µ : P (Σ) → [0, 1] : µ is a probability measure} ( ) |N| and X M1(Σ) = v ∈ R : 0 ≤ vi ≤ 1∀i vi = 1 i

Σµ; µ ∈ M1(Σ) : Elements of Σthat µ takes with non-zero probability

Σµ = {ai : µ(ai) > 0} ⊂ Σ y n Empirical measure of a sequence , called the type of Ln; y = (y1, . . . , yn) ∈ Σ : y y n 1 X Ly(a ) = 1 (y ) n i n ai j j=1 y Fraction of occurances of in Ln(ai) = ai y Y Empircal measure of a Ln : Y iid with Y = (Y1,Y2,...,Yn) ,Yi ∼ µ

Ln : All possible types sequences of length n y for some Ln = {ν : ν = Ln y}

Tn(ν); ν ∈ Ln : The type class of ν n y Tn(ν) = {y ∈ Σ : Ln = ν}

10.41.1. Basic Results and Sanov's Theorem. Lemma. (Approximation lemma for sequence types) |Σ| a) |Ln| ≤ (n + 1) b)d d |Σ| V (ν, Ln) := infµ∈Ln V (ν, µ) ≤ 2n Here dV (µ, ν) is the total variation distance between two probability measures d V (µ, ν) := supA⊂Σ [ν(A) − µ(A)] Proof. To prove a): For any , we know that y for some . µ ∈ Ln µ = Ln y Hence, each  0 1 2 n , so there are at most choices for each µ(ai) ∈ n , n , n ,..., n n + 1 component of µ. This means that there are at most (n + 1)|Σ|choices for such a measure µ since there are |Σ|-components for the measure µ. To prove b): Since L contains all the probability vectors whose elements come from the set  0 1 n , for any measure we can always nd a n , n ,..., n ν ∈ M1 (Σ) measure so that 1 for each . The result follows by the µ ∈ Ln |µ(ai) − ν(ai)| ≤ 2n ai fact that: |Σ| 1 X d (µ, ν) = |ν(a ) − µ(a )| V 2 i i i=1 (See the notes on Markov chains for details on this one...its actually pretty simple if you draw some pictures!)  Remark. The result in a) can be improved by using the fact that the compo- nents µ(ai) of µ cannot all be chosen independently (since they must sum to 1). 10.41. LDP FOR FINITE DIMESNIONAL SPACES 94

For example, the last component is determined once the rst |Σ| − 1 components |Σ|−1 are xed, so we could improve the ineuality to |Ln| ≤ (n + 1) .

Remark. From this lemma we know that the size of Ln grows polynomial in n and that sets in Ln approximate all the measures in M1(Σ) uniformly and arbitartily well as n → ∞. Both of these properties will fail to hold when |Σ| = ∞.

Definition. The entropy of a probability vector ν ∈ M1(Σ) is:

|Σ| X H(ν) = − ν(ai) log (ν(ai)) i=1 |Σ| X  1  = ν(a ) log i ν(a ) i=1 i

Definition. The relative entropy of a probability vector ν with respect to another probability vector µ is:

|Σ|   X ν(ai) H(ν|µ) = ν(a ) log i µ(a ) i=1 i

For the purposes of handling 0's in this formula, we will take the convention that and 0  . 0 log(0) = 0 0 log 0 = 0

Remark. By application of Jensen's inequality to the convex function x log(x) , it is possible to verify that H(ν|µ) ≥ 0 with equality only when µ = ν. Also, notice that H(ν|µ) is nite whenever Σν ⊂ Σµ. Moreover, if we think of H(·|µ): M1(Σ) → , we see that is a continuous function on the compact set R H(·|µ) {ν :Σν ⊂ Σµ} ⊂ M1(Σ) because x log(x) is continuous for 0 ≤ x ≤ 1. Also H(·|µ) = ∞ for ν outside this set because we will have a ν(ai) when but . These are log( 0 ) ν(ai) 6= 0 µ(ai) = 0 some properties we expect of a good rate function!

We will now do some estimates relating the size of dierent sets (e.g. the number of sequences in a type class) to quantities involving these entropies.

Lemma. Choose a measure µ ∈ M1(Σ) and consider a sequence Y1,...,YN of i.i.d. µ random variables. (We wil use Pµwhen denoting involving these to remind us that the Y 's are µdistributed). Let ν ∈ Ln. If y ∈ Tn(ν) is in the type class of ν, then:

Pµ ((Y1,Y2,...,Yn) = y) = exp (−n [H(ν) + H(ν|µ)])

Proof. If , then there is an element with 1 that has Σν * Σµ ak ν(ak) ≥ n µ(ak) = 0. In this case, there is at least one instance of ak on the sequence y but there are NONE on the sequence (Y1,...,Yn) (almost surely). So the probability of the event we are interested in on the LHS is 0. The RHS is also 0 becuase when , so we have the desired equality. H(ν|µ) = ∞ Σν * Σµ 10.41. LDP FOR FINITE DIMESNIONAL SPACES 95

Otherwise, we may assume that Σν ⊂ Σµ and that H(ν|µ) < ∞. Notice that:

n Y Pµ ((Y1,...,Yn) = y) = Pµ (Yk = yk) by independence k=1

#{k:yk=a1} #{k:yk=a2} = (Pµ(Y = a1)) · (Pµ(Y = a2)) ... regroup terms |Σ| Y nν(ai) = µ(ai) since Y ∼ µand y ∈Tn(ν) i=1 Hence:

|Σ| X log (Pµ (Y1,...,Yn) = y) = n ν(ai) log (µ(ai)) i=1   |Σ|   X ν(ai) = n ν(a ) log (ν(a )) − ν(a ) log  i i i µ(a )  i=1 i = n (−H(ν) − H(ν|µ)) = −n (H(ν) + H(ν|µ))

Taking exp of both sides gives the desired result. 

Corollary. Pµ ((Y1,...,Yn) = y) = exp (−nH(µ))

Proof. Follows by the lemma since H(µ|µ) = 0. 

Lemma. For every ν ∈ Ln:

−|Σ| (n + 1) exp (nH(ν)) ≤ |Tn(ν)| ≤ exp (nH(ν)) Proof. The upper bound is what you get when you take our last corollary and bound the probability by 1:

1 ≥ Pν ((Y1,...,Yn) ∈ Tn(ν)) X = Pν ((Y1,...,Yn) = y)

y∈Tn(ν)

= |Tn(ν)| exp (−nH(ν))

To prove the lower bound, we rst aim to prove that for any measure µ ∈ Ln that the empirical measure Y of the sequence has: Ln (Y1,...,Yn)

Y  Y  Pν Ln = ν ≥ Pν Ln = µ

(Note: Y is a bit like but not quite because Pν (Ln = µ) Pν ((Y1,...,Yn) = y) the former does NOT care about the order of the Y1,...,Yn, while in the latter the order does matter. Hence there will be some factors counting the number of ways to rearrange the Y 's that will arise here. These factors are the additional diculty we must overcome here) When the probability on the RHS is 0 and the Σµ * Σν 10.41. LDP FOR FINITE DIMESNIONAL SPACES 96 result holds. Otherwise, consider:

Y  P P L = ν Pν ((Y1,...,Yn) = y) ν n = y∈Tn(ν) P (LY = µ) P P ((Y ,...,Y ) = y) ν n y∈Tn(µ) ν 1 n |Σ| |T (ν)| Q ν(a )nν(ai) = n i=1 i by the previous lemma |Σ| Q nµ(ai) |Tn(µ)| i=1 ν(ai) |Σ| Y (nµ(ai))! = ν(a )n(ν(ai)−µ(ai)) by a counting argument (nν(a ))! i i=1 i

The last expression is a product of terms of the form m! ` `−m with `! n ` = nν(ai) and m = nµ(ai). Considering the cases m ≥ ` and m < ` separately, it is easily veried that m! (m−`) always holds. Hence m! ` `−m m−` so we have: `! ≥ ` `! n ≥ n

Y  |Σ| Pν Ln = ν Y P P ≥ nn(µ(ai)−ν(ai)) = nn( µ(ai)− ν(ai)) = nn(1−1) = 1 P (LY = µ) ν n i=1 Which proves the desired mini-result that Y  Y . Fi- Pν Ln = ν ≥ Pν Ln = µ nally, to get the lower bound on the inequality of the lemma, we have:

X Y  1 = Pν Ln = µ

µ∈Ln Y  ≤ |Ln| Pν Ln = ν

= |Ln| |Tn(ν)| exp (−nH(ν))

|Σ| So rearranging and using the bound |Ln| ≤ (n + 1) gives the desired lower bound for . |Tn(ν)| 

Lemma. For any ν ∈ Ln we have:

−|Σ| Y  (n + 1) exp (−nH(ν|µ)) ≤ Pµ Ln = ν ≤ exp (−nH(ν|µ)) Proof. We have:

Y  X Pµ Ln = ν = P ((Y1,...,Yn) = y)

y∈Tn(ν)

= |Tn(ν)| exp (−n (H(ν) + H(ν|µ))

−|Σ| So the bounds (n + 1) exp (nH(ν)) ≤ |Tn(ν)| ≤ exp (nH(ν)) give exactly our desired result. 

Theorem. (Sanov's Theorem for Finite Alphabets) For every Γ ⊂ M1(Σ) we have:

1 Y  − inf H(ν|µ) ≤ lim inf log Pµ Ln ∈ Σ ν∈Γ◦ n→∞ n 1 Y  ≤ lim sup log Pµ Ln ∈ Σ ≤ − inf H(ν|µ) n→∞ n ν∈Γ 10.41. LDP FOR FINITE DIMESNIONAL SPACES 97

Proof. By the lemmas, we have the upper bound:

Y  X Y  Pµ Ln ∈ Γ = Pµ Ln = ν

ν∈Γ∩Ln X ≤ exp (−nH(ν|µ))

ν∈Γ∩Ln X   ≤ exp −n inf H(ν|µ) ν∈Γ∩Ln ν∈Γ∩Ln   ≤ |Γ ∩ Ln| exp −n inf H(ν|µ) ν∈Γ∩Ln   ≤ |Ln| exp −n inf H(ν|µ) ν∈Γ∩Ln   ≤ (n + 1)|Σ| exp −n inf H(ν|µ) ν∈Γ∩Ln The accompanying lower bound is:

Y  X Y  Pµ Ln ∈ Γ = Pµ Ln = ν

ν∈Γ∩Ln X ≥ (n + 1)−|Σ| exp (−nH(ν|µ))

ν∈Γ∩Ln   X ≥ (n + 1)−|Σ| exp −n inf H(ν|µ) + 0 ν∈Γ∩Ln ν∈Γ∩Ln Now, since 1 |Σ| , this term has no contribution to the limn→∞ n log (n + 1) = 0 nal logarthimic limit. So we have, by taking log and then limsup of these two inequalities (recall: a lim sup(−xn) = − lim inf(xn)):   1 Y  lim sup log Pµ Ln ∈ Γ ≤ − lim inf inf H(ν|µ) n→∞ n n→∞ ν∈Γ∩Ln   1 Y  lim sup log Pµ Ln ∈ Γ ≥ − lim inf inf H(ν|µ) + 0 n→∞ n n→∞ ν∈Γ∩Ln Since these are equal, it must be that 1 Y  . lim supn→∞ n log Pµ Ln ∈ Γ = − lim infn→∞ {infν∈Γ∩Ln H(ν|µ)} Similarly, by taking lim inf of our two inequalities we get we get:   1 Y  lim inf log Pµ Ln ∈ Γ = − lim sup inf H(ν|µ) n→∞ n n→∞ ν∈Γ∩Ln

It remains only to argue that ◦ − lim supn→∞ {infν∈Γ∩Ln H(ν|µ)} ≥ − infν∈Γ H(ν|µ) and . The latter is trivial be- − lim infn→∞ {infν∈Γ∩Ln H(ν|µ)} ≤ − infν∈Γ H(ν|µ) cause for every , so always holds. Γ ∩ Ln ⊂ Γ n infν∈Γ∩Ln H(ν|µ) ≥ infν∈Γ H(ν|µ) The former holds because of the fact we saw earlier that Ln approximates any ◦ set Γ uniformly well, in the sense of dV . To be precise, take any ν ∈ Γ with 0 0 Σν ⊂ Σµ and let δ so small so that {ν : dV (ν, ν ) < δ} ⊂ Γ. Take N so large now so that |Σ| /2n < δ for n > N and then we will be able to nd νn ∈ Ln so 0 0 that νn ∈ {ν : dV (ν, ν ) < δ} ⊂ Γ. Doing this for every n > N, we can obtain a sequence νn → ν as n → ∞ in the dV sense. For this sequence, we have (keeping 10.41. LDP FOR FINITE DIMESNIONAL SPACES 98 in mind that H(·|µ) is continuous here:   0 − lim sup inf H(ν |µ) ≥ − lim H(νn|µ) = −H(ν|µ) 0 n→∞ ν ∈Γ∩Ln n→∞ The inequality also holds trivially for points ◦ with because in this ν ∈ Γ Σν * Σµ case the RHS is −∞. Since this holds for every point ν ∈ Γ◦, taking inf over the LHS gives us the desired result.  Exercise. Prove that for every open set Γ that   1 Y  − lim inf H(ν|µ) = lim log Pµ Ln ∈ Γ = − inf H(ν|µ) n→∞ ν∈Γ∩Ln n→∞ n ν∈Γ If Γ is an open set, then Γ◦ = Γ . Hence, looking at the statement of the theorem, we notice that the LHS and RHS are equal, so we must have equality everywhere (its an equality sandwhich!). Hence 1 Y  limn→∞ n log Pµ Ln ∈ Γ = − infν∈Γ H(ν|µ). (Rmk: For a general large deviation principle, one side of the inequality is Γ¯ and the other side is Γ◦ so the fact that the limit exists in this case is something special that doesnt happen in general).

◦ Exercise. Prove that if Γ is a subset of {ν ∈ M1(Σ) : Σν ⊂ Σµ}and Γ ⊂ Γ , then the same conclusion as the previous exercise holds. Moreover, show that there is a unique ν? ∈ Γ◦ that achieves the minimum of H(ν|µ) in this case. ◦ Suppose Γ ⊂ {ν ∈ M1(Σ) : Σν ⊂ Σµ} and Γ ⊂ Γ¯ . Since H(·|µ) is continuous on the set , we have that {ν ∈ M1(Σ) : Σν ⊂ Σµ} infν∈Γ H(ν|µ) ≥ infν∈Γ¯◦ H(ν|µ) = infν∈Γ◦ H(ν|µ) by continuity. This again completes the sandwhich in this case, so we know 1 Y  . Moreover in this case, since limn→∞ n log Pµ Ln ∈ Γ = − infν∈Γ H(ν|µ) Γ¯◦ is compact here, there must be a minimizing element ν? ∈ Γ¯◦ that minimizes ? . H(ν |µ) = infν∈Γ¯◦ H(ν|µ) = infν∈Γ◦ H(ν|µ)

Exercise. Assume that Σµ = Σ and that Γ is a convex subset of M1(Σ) of non-empty interior. Prove that all of the conclusion of the previous exercise apply. Moreover, prove that the point ν? is unique. First we claim that in this case that Γ ⊂ Γ◦ so we can apply the last exercise. ◦ ◦ If this is not the case then there is a ν ∈ Γ with ν∈ / Γ . Take any ν0 ∈ Γ now ◦ and consider the line νt = tν + (1 − t)νo. It is sucient to show that νt ∈ Γ for t > 0 for then we will have shown that ν is a limit point of νt and have our contradiction. To see that ◦, nd a so that for all we have  ◦for νt ∈ Γ 0  < 0 ν0 ∈ Γ every  in . For every such , the line segment   ν0 B(ν0, ) ν0 νt = tν + (1 − t)νo is contained in Γ by convexity. Finally, we notice that the collection of points  contains a small n'h'd of radius around the point . {νt : 0 < t ≤ 1, 0 <  ≤ 0} 0t νt ◦ (Draw a picture to see the geometry here). Hence each νt ∈ Γ as desired. To see that the limit point ν?is unique, it suces to prove that H(·|µ) is strictly convex for if H(·|µ) is strictly convex, then taking a convex combination of dierent points ?and ?that minimize would yield an even more extreme minimum: ν1 ν2 H(·|µ) ? ? ? ? , which contradicts that are H(tν1 + (1 − t)ν2 |µ) < tH(ν1 |µ) + (1 − t)H(ν2 |µ) ν1, ν2 minimizing. Lemma. H(·|µ) is strictly convex.

Proof. We will actually prove a slightly stronger statement, that H(tν1 + (1 − t)ν2|tµ1 + (1 − t)µ2) ≤ tH(ν1|µ1) + (1 − t)H(ν2|µ2) with equality only when ν1µ2 = ν2µ1. Setting µ1 = µ2 = µ will prove the strict convexity of H(·|µ). The 10.41. LDP FOR FINITE DIMESNIONAL SPACES 99 result is a consequence of the following inequality. Let x1, . . . , xn and y1, . . . , yn be non-negative numbers. Then the following inequality holds with equality only if xi/yi is constant:   P  X xi X  yi xi log ≥ xi log P yi yi This holds by using the fact that f(z) = z log(z) is strictly convex, so by Jensen's inequality:     P xi P xi  yif yi P  yi yi xi P ≥ f  P  = f P yi yi yi

To prove the inequality for H(·|·) now, consider the above inequality with n = 2, and x1 = tν1(ai), x2 = (1 − t)ν2(ai), y1 = tµ1(ai), and y2 = (1 − t)µ2(ai). Have:       tν1(ai) (1 − t)ν2(ai) tν1(ai) + (1 − t)ν2(ai) tν1(ai) log +(1−t)ν2(ai) log ≥ (tν1(ai) + tν2(ai)) log tµ1(ai) (1 − t)µ2(ai) tµ1(ai) + (1 − t)µ2(ai)

Summing over ai now will give exactly H(tν1 + (1 − t)ν2|tµ1 + (1 − t)µ2) ≤ as desired. tH(ν1|µ1) + (1 − t)H(ν2|µ2)  10.41.2. ++.

10.41.3. Cramer's Theorem for Finite Alphabets in R. Definition. Fix a function f :Σ → R and, in the language of the last section, iid where Yi ∼ µ let Xi = f(Yi). Without los of generality assume further that Σ = Σµ and that . Let ˆ 1 Pn be the average of the rst f(a1) < . . . < f(a|Σ|) Sn = n j=1 Xj n random variables. Cramar's theorem deals with the LDP associated with the real-valued random variables Sˆn.

Remark. In the case here, the random variable Xi's take values in the compact   interval K = f(a1), . . . , f(a|Σ|) . Consequently, so does the normalized average Sˆn. Moerover, by our denitions: |Σ| ˆ X Y Sn = f(ai)Ln (ai) i=1 Y where  = f,Ln f := f(a1), . , f(a|Σ|) Hence given any , we can associate a set by A ⊂ R ΓA ⊂ M1(R) ΓA = {ν : hf, νi ∈ A}. By the above and this denition, we have: ˆ Y Sn ∈ A ⇐⇒ Ln ∈ ΓA

This naturally gives rise the LDP for the sum Sˆn:

Theorem. (Cramer's Theorem for nite subsets of R) For any set A ⊂ R we have: 1   − inf H(ν|µ) = − inf I(x) ≤ lim inf log Pµ(Sˆn ∈ A) ◦ ν∈ΓA◦ x∈A n→∞ n 1    ≤ lim sup log Pµ Sˆn ∈ A ≤ − inf I(x) = − inf H(ν|µ) n→∞ n x∈A ν∈ΓA 10.41. LDP FOR FINITE DIMESNIONAL SPACES 100

With I(x) = inf{ν:hf,νi=x} H(ν|µ). One can verify that I(x) is continuous at x ∈ K and it satises there:

I(x) = sup {λx − Λ(λ)} λ∈R where

 |Σ|  X λf(ai) Λ(λ) = log  µ(ai)e  i=1

Proof. When the set A is open, so is the set ΓA, and the bounds directly follow. To get the fact about Λ, consider as follows. By Jensen's inequality, for any measure ν, we have:

Σ ! X λf(ai) Λ(λ) = log µ(ai)e i=1 Σ !  λf(ai)  X µ(ai)e = log ν(a ) i ν(a ) i=1 i |Σ|  λf(ai)  X µ(ai)e ≥ ν(a ) log i ν(a ) i=1 i Σ     X µ(ai) = ν(a ) log + λf(a ) i ν(a ) i i=1 i = λ hf, νi − H(ν|µ)

Since log is strictly concave, equality holds when all the terms are equal. We will denote the special measure ν that achieves this by νλwhich has νλ(ai) = µ(ai) exp (λf(ai) − Λ(λ)). Rearranging the above inequality to put H(ν|µ) and then taking inf gives:

I(x) = inf H (ν|µ) ≥ λx − Λ(λ) ∀x {ν:hf,νi=x}

with equality at x = hf, νλi. I.e. I(hf, νλi) = λ hf, νλi − Λ(λ) for every λ. 0 Now, the function Λ(λ) is dierentiable with Λ (λ) = hf, νλi. Fix an x0 now in 0 , say 0 . Consider the optimization problem {Λ (λ): λ ∈ R} x0 = Λ (λ0) = hf, νλ0 i . By taking derivatives, it is clear the maximum occurs at supλ∈ {λx0 − Λ(λ)} R and the optimal value here is . This is equal to by λ = λ0 λ0 hf, νλ0 i − Λ(λ0) I(x0) our earlier remark! This argument establishes that I(x) = sup {λx − Λ(λ)} for λ∈R every x ∈ {Λ0(λ): λ ∈ R}. We will now show that 0 ◦ . First we notice {Λ (λ): λ ∈ R} ⊂ K = f(a1), f(a|Σ| that Λ0(·) is strictly increasing (since Λ(·) is strictly convex by the inequality Λ(λ) ≥ 0 0 λ hf, νλi−H(νλ|µ) = λΛ (λ)+H (νλ|µ)). Notice moreover from Λ (λ) = hf, νλi that 0 f(a1) ≤ Λ (λ) ≤ f(a|Σ|). Moreover, since νλ(ai) ∝ exp (λf(ai)) by taking λ → −∞ 0 or to ∞ we can achieve Λ (λ) = hf, νλi → f(a1) and → f(a|Σ|). By continuity 0 ◦ therefore, Λ can achieve every value in the open interval K = (f(a1), f(a|Σ|)). 10.41. LDP FOR FINITE DIMESNIONAL SPACES 101

? ? To include the endpoints, set x = f(a1) and let ν (a1) = 1 so that hf, ν i = ? f(a1) = x. Have H(ν |µ) = − log(µ(a1)). Have then the string of inequalities: ? − log µ(a1) = H (ν |µ) ≥ I(x) ≥ sup {λx − Λ(λ)} λ

≥ lim (λx − Λ(λ)) = − log µ(a1) λ→−∞

So we get an equality sandwich and conclude that for x = f(a1) that I(x) = . The same argument works to show that the point supλ {λx − Λ(λ)} x = f(a|Σ|) works too. Putting together all these arguments we indeed recover the desired result for x anywhere in the whole interval  . x ∈ K = f(a1), f(a|Σ|) 

10.41.4. Gibbs Conditioning Principle. We know that Sˆn → hf, µi as n → ∞, and Cramer's theorem quanites the rate for us. Let us the following n o question now, given some set what does the event ˆ look like? A ∈ R Sn ∈ A When A does not contain the mean hf, µi this is a rare event whose probability is very small, so when conditioning on it we will get some interesting behavior. To be precise, we will look at the conditional law of the constituent random variables

Y1,...,Yn that make up Sˆn when we condition on Sˆn ∈ A: ?  ˆ  µn(ai) = Pµ Y1 = ai|Sn ∈ A , i = 1,..., |Σ|

By symmetry between switching indices i ↔ j, we expect that Yi and Yj will be identically distrbuted (although not independent) when we condition like this. For this reason, we restrict our attention to Y1 as above. Notice that for any function φ :Σ → R that: ? h ˆ i hφ, µni = E φ(Y1)|Sn ∈ A h i = E φ(Yj)|Sˆn ∈ A for any j = 1,..., |Σ|

 n  X 1 = E φ(Y )|Sˆ ∈ A  n j n  j=1  Y Y  by writing ˆ Y = E φ, Ln | f,Ln ∈ A Sn = f,Ln If we let Γ = {ν : hf, νi ∈ A}, then this can be written simply as: ?  Y Y  µn = E Ln |Ln ∈ Γ We will address the question of what are the possible limits as n → ∞ of this conditional law ? . This is called the Gibbs Conditioning Principle. µn Theorem. (Gibb's Conditioning Principle) Suppose that we have a set Γ for which . Dene the set of measures that IΓ := infν∈Γ◦ H(ν|µ) = infν∈Γ¯ H(ν|µ) minimize the relative entropy:  M := ν ∈ Γ:¯ H(ν|µ) = IΓ Then: a) All the limit points of ? belong to co -the closure of the convex hull {µn} (M) of M. b) When Γ is a convex set of non-empty interior, the set M consists of a single point to which ? converges as . µn n → ∞ 10.41. LDP FOR FINITE DIMESNIONAL SPACES 102

Remark. a) The condition that IΓ exists or the condition that Γ is a convex set might seem strange. However, by earlier exercises we know these conditions where IΓ exists. Also by an earlier exercise, the condition that Γ is convex also shows there is a unique minimizing element of H(ν|µ), i.e. M = {ν?}.\ b) The result is kind of intutive in the following sense. We know by Sanov's theorem that P

Proof. Firstly, we notice that part a) =⇒ part b), because when Γ is a convex set of non-empty interior. We know from an earlier exercise that M consists of a single point in this case. By compactness, every subsequence of ? must have a µn sub-subsequence converge to something, and the only candidate is the single point in by part a). Hence ? converges to the single point in too. To prove part M µn M a), we break up the main ideas into two claims: Claim 1: For any , we have that ? co Y c Y  U ⊂ M1(Σ) dV (µn, (U)) ≤ Pµ Ln ∈ U |Ln ∈ Γ Pf: For any U, we have:

 Y Y   Y Y   Y Y   Y Y  E Ln |Ln ∈ Γ − E Ln |Ln ∈ U ∩ Γ = E Ln |Ln ∈ Γ ∩ U P Ln ∈ U|Ln ∈ Γ +  Y Y c  Y c Y  +E Ln |Ln ∈ Γ ∩ U P Ln ∈ U |Ln ∈ Γ  Y Y  −E Ln |Ln ∈ U ∩ Γ  Y   Y Y  P(Ln ∈ U ∩ Γ) = E Ln |Ln ∈ Γ ∩ U Y − 1 P(Ln ∈ Γ)  Y Y c  Y c Y  +E Ln |Ln ∈ Γ ∩ U P Ln ∈ U |Ln ∈ Γ  Y c   Y Y  P(Ln ∈ U ∩ Γ) = E Ln |Ln ∈ Γ ∩ U − Y P(Ln ∈ Γ)  Y Y c  Y c Y  +E Ln |Ln ∈ Γ ∩ U P Ln ∈ U |Ln ∈ Γ  Y c Y   Y Y c  Y Y  = P Ln ∈ U |Ln ∈ Γ E Ln |Ln ∈ Γ ∩ U − E Ln |Ln ∈ Γ ∩ U

Hence, since the 1 P|Σ| depends only on the dier- dV (α, β) = 2 i=1 |α(ai) − β(ai)| ence between the mearsures, we can factor out to  Y c Y  from any P Ln ∈ U |Ln ∈ Γ formulas we have. Now use the measure  Y Y  co (NOTE: this E Ln |Ln ∈ Γ ∩ U ∈ (U) is where the convex hull comes in...if you take the average of some measures in a set U, you might escape the set U but you will still be in co(U) since averaging is like a convex conbination) and ?  Y Y  we have: µn = E Ln |Ln ∈ Γ d ? co d  Y Y   Y Y  V (µn, (U)) ≤ V E Ln |Ln ∈ Γ , E Ln |Ln ∈ Γ ∩ U  Y c Y  d  Y Y c  Y Y  ≤ P Ln ∈ U |Ln ∈ Γ V E Ln |Ln ∈ Γ ∩ U , E Ln |Ln ∈ Γ ∩ U  Y c Y  ≤ P Ln ∈ U |Ln ∈ Γ · 1

Which proves the result. δ Claim 2: Let M = {ν : dV (ν, M) < δ}. Then for every δ > 0:

Y δ Y  lim Pµ Ln ∈ M |Ln ∈ Γ = 1 n→∞ 10.42. CRAMER'S THEOREM 103

Pf: By Sanov's theorem in this case we know that infν∈Γ¯ H(ν|µ) = IΓ = 1 Y . We also have: − limn→∞ n log Pµ Ln ∈ Γ

1  Y δc  lim sup log Pµ Ln ∈ (M ∩ Γ ≤ − inf H(ν|µ) n→∞ n ν∈(Mδ )c∩Γ ≤ − inf H(ν|µ) ν∈(Mδ )c∩Γ¯ c Since Mδis an open set, we know that Mδ ∩Γ¯ is a compact set, so this inf is δc achieved at some point ν˜ ∈ M ∩ Γ. Since ν˜ ∈ / M, we know that H(˜ν|µ) > IΓ. So nally then, we put all this together to conclude that Y δc Y  Pµ Ln ∈ M |Ln ∈ Γ goes to zero exponentially fast:   1  Y δc Y  1  Y δc  1 Y  lim sup log Pµ Ln ∈ M |Ln ∈ Γ = lim sup log Pµ Ln ∈ (M ∩ Γ − log Pµ Ln ∈ Γ n→∞ n n→∞ n n

≤ −H(˜ν|µ) + IΓ < 0 Which proves Y δc Y  . Taking complements limn→∞ Pµ Ln ∈ M |Ln ∈ Γ = 0 gives the desired result. Finally, applying the two claims together, we know that for any δ > 0 that ? co δ Y δc Y  as . Hence each limit dV µn, M ≤ Pµ Ln ∈ M |Ln ∈ Γ → 0 n → ∞ point of ? must be in δ. Since is arbitary, it must be that all the limit points µn M δ are in co (M), as desired.  10.42. Cramer's Theorem Theorem. Suppose are iid taking values in with law . Assume further Xi R µ that Xi has exponential moments of all orders. Dene: Λ(λ) = log (E (exp(λX))) (This is well dened for all λ ∈ R since X has nite exponential moments) Let n 1 X S = X n n i i=1

Then Sn satises and LDP with rate function Λ?(x) = sup λx − Λ(λ) λ∈R that is to say for all measurable sets Γ ⊂ R: 1 − inf I(x) ≤ lim inf log (P (Sn ∈ Γ)) x∈Γ◦ n→∞ n 1 ≤ lim sup log (P (Sn ∈ Γ)) ≤ − inf I(x) n→∞ n x∈Γ¯ The proof is divided into lemma.

Lemma. (The Λ−Lemma: Properties of Λ and Λ∗) a) Λ(λ) is convex b) Λ?(x) is convex c) Λ?(x) has compact level sets and is lower semicontinuous (i.e. Λ?(x) is a good rate function) 10.42. CRAMER'S THEOREM 104

d) Λ?(x) can be rewritten: ( sup λx − Λ(λ) x ≥ E(X) Λ?(x) = λ≥0 supλ≤0 λx − Λ(λ) x ≤ E(X)

e) Λ?(x) is increasing for x ≥ E(X) and is decreasing for x ≤ E(X) and it has Λ? (E(X)) = 0 0 0 f) Λ is dierentiable and Λ (η) = E [X1 exp (ηX)] /E [exp (ηX)] and Λ (η) = y =⇒ Λ?(y) = ηy − Λ(η)

Proof. Proof here! 

Lemma. (The upper bound) For closed sets F we have that: 1 lim sup log (P (Sn ∈ F )) ≤ − inf I(x) n→∞ n x∈F Proof. The main idea of the proof is to use the exponential Chebyshev in- equality, for a random variable Y and for any r ≥ 0 have:

P (Z > x) = P (exp (rZ) > exp (rx))   = E 1{exp(rZ)>exp(rx)} exp (rZ)  ≤ E 1 exp (rx) {exp(rZ)>exp(rx)} exp (rZ) ≤ E exp (rx) = exp (−rx) E [exp (rZ)]

We will rst prove the result of the lemma for open sets of the form F = [a, ∞), a > E(X) and F = (−∞, b], b < E(X). We will then use this to prove it for general closed sets F . (once the claim is established for these semi-innite intervals, any closed set F that misses E(X) can be covered by a union of these intervals and we will use this to get the result) When F is a semi-innite interval: If F = [a, ∞) with a > E(X), the we x λ ≥ 0 arbitary and apply the exponetial Chebyshev inequality to Sn with r = nλ to get:

P (Sn ∈ [a, ∞)) = P (Sn > a)

≤ exp (−anλ) E [exp (nλSn)] = exp (−anλ) E [exp (λX)]n by independence = exp (−anλ) Λ(λ)n

Divinding by n and taking a log now gives: 1 log (P (S ∈ [a, ∞))) ≤ − (aλ − Λ(λ)) n n Since this holds for any λ ≥ 0, we can take the inmum over all possible λ ≥ on the RHS and the inequality will still be true. After passing through the − sign, 10.42. CRAMER'S THEOREM 105 this becomes a sup and we get: 1 log (P (Sn ∈ [a, ∞))) ≤ − sup (aλ − Λ(λ)) n λ≥0 = −Λ?(a) by the Λ−lemma = − inf Λ?(x) x∈[a,∞) ? ? ? We know that infx∈[a,∞) Λ (x) = Λ (a) since Λ is increasing to the right of E(X). The result also holds for sets F of the form F = (−∞, b] by the same argument, inserting some factors of −1 in the appropriate places (Alternativly, apply the result for sets [a, ∞) to the random variables X˜ = −X) When F is an arbitary closed set: Finally for arbitary closed sets F there are two cases to consider. If E(X) ∈ ? ? F , then infx∈F Λ (x) = 0 by the Λ−lemma since Λ (E(X)) = 0. Otherwise, if E(X) ∈/ F , since F is closed, we can nd an interval (a, b) so that E(X) ∈ (a, b) and (a, b) ⊂ F c. HAve then that F ⊂ (−∞, b] ∪ [a, ∞) and then we can use our previous estimates: 1 1 lim sup log (P (Sn ∈ F )) ≤ lim sup log (P (Sn ∈ (−∞, b] ∪ [a, ∞))) n→∞ n n→∞ n 1 ≤ lim sup log (P (Sn ∈ (−∞, b]) + P (Sn ∈ [a, ∞)) n→∞ n  1 1  ≤ max lim sup log P (Sn ∈ (−∞, b]) , lim sup log P (Sn ∈ [a, ∞)) n→∞ n n→∞ n   ≤ max − inf Λ?(x), − inf Λ?(x) x∈(−∞,b] x∈[a,∞) = − inf Λ?(x) x∈(−∞,b]∪[a,∞) ≤ − inf Λ?(x) x∈F Here we have employed the fact that 1 1 1  lim sup n log (An + Bn) = max lim sup n log An, lim sup n log Bn (a simple  argument gives this)  Lemma. (Tilting inequality) For any random variable X, we have that:

1 ? lim inf log P (Sn ∈ (−δ, δ)) ≥ inf Λ(λ) = −Λ (0) n λ Proof. The second inequality follows from the denition of Λ?: Λ?(0) = . supλ∈ −Λ(λ) = − infλ∈R Λ(λ) TheR proof is divided into two cases depending on the particulars of where the random variable X has its mass Case I: P (X < 0) > 0 and P (X > 0) > 0 In this case we claim that there is an η ∈ R such that: Λ(η) = inf Λ(λ) and Λ0(η) = 0 λ∈R This is because Λ(λ) = log E (exp (λX)) → ∞ as λ → +∞ This happens because X has postive probability to be > 0 and these values will give contribution to E(exp (λX)) that tends to +∞. Also, by the same token, since X has postive probability to be negative, Λ(λ) = log E (exp (λX)) → ∞ as λ → −∞. Hence Λ 10.42. CRAMER'S THEOREM 106 must achieve its inmum somewhere. This is unique since Λis convex. Since Λ is dierentiable, its derivative must be zero at this point. Now dene a new probability meausre µ˜ by declaring the Radon-Nikodym derivative of µ˜ w.r.t. µ to be: dµ˜(x) = exp (ηx − Λ(η)) dµ(x) (Note that exp (−Λ(λ)) is exactly the normalizing constant we need to make µ˜ a probabbility measure since exp(ηx)dµ(x) = E(exp(ηX)) = Λ(η).) Notice that under this new´ law Eµ˜(X) = x exp (ηx − Λ(η)) dx = Eµ (X exp (ηx)) exp (−Λ(η)) = Λ0(η) = 0 by the choice of η. ´ P Now consider a kind of reverse Cheb estimate, using the inequality that exp (η xi) < P exp (nδη) when | xi| < nδ: (here we use the notation µn to be the law of Sn, the normalized sum of X ∼ µ and µ˜n likewise to be the normaized sum of X ∼ µ˜)

µ (−δ, δ) = dµ(x ) ... dµ(x ) n ˆ 1 n P | xi|

= exp (−nηδ) exp (nΛ(η)) dµ˜(x ) ... dµ˜(x ) ˆ 1 n P | xi|

= exp (n (Λ(η) − ηδ))µ ˜n ((−δ, δ))

By the weak law of large numbers for µ˜n, we know that µ˜n(−δ, δ) → 1. So taking 1 log, and taking limits, we get: n 1 lim log µn (−δ, δ) ≥ Λ(η) − δη + 0 n→∞ n Now nally, to get rid of the −δη here, take any  > 0 small, then use the above inequality to get: 1 1 lim log µn (−δ, δ) ≥ lim log µn (−/η, /η) n→∞ n n→∞ n ≥ Λ(η) −  = inf Λ(λ) −  λ∈R Since this works for any  we get the desired inequality. Case II: Assume that P (X < 0) = 0 In this case is monotone increasing and (achieved Λ(·) infλ∈R Λ(λ) = µ ({0}) as λ → −∞) so then the inequality follows by µn (−δ, δ) ≥ P (Xi = 0 for each i) = n µ({0}) and the inequality follows. The case when P (X > 0) = 0 is similar.  Remark. There is a similar proof of Case II by replacing X with Xθ := X +θZ where Z ∼ N(0, 1). The random variable Xθ falls back into case I. By taking θ suciently small, we can control the probability. Corollary. For any random variable X, we have that: 1 lim inf log P (S ∈ (x − δ, x + δ)) ≥ −Λ?(x ) n n 0 0 0 10.42. CRAMER'S THEOREM 107

   Proof. Put X˜ = X − x0. Then Λ(˜ λ) = log E exp λX˜ = Λ(λ) − λx0 and ˜ ? ˜ ? . So using Λ (x) = supλ λx − Λ(λ) = supλ λ(x + x0) − Λ(λ) = Λ (x + x0) ? ? the inequality from the lemma gives ≥ −Λ˜ (0) = −Λ (x0). Notice that in this ? 0 case the tilting parameter η has Λ(η) = −Λ (x0) and Λ (η) = x0. One could have alternativly proven the cororly by nding η so that Λ satised these values.  Lemma. (The lower bound - extending local lower bound to arbitary open sets). For open sets G:

? 1 − inf Λ (x) ≤ lim inf log (P (Sn ∈ G)) x∈G n→∞ n Proof. We extend the previous lemma, which was such a lower bound for intervals (−δ, δ) to arbitary open sets. This is an argument that works in general and will be reused many times. ? ? ? ? For any  > 0, nd an x ∈ G so that infx∈G Λ (x) > Λ (x ) − . Since G is an open set, nd a δ > 0 so that the ball of radius δ centered at x∗ is completely contained in G, that is (x∗ − δ, x∗ + δ) ⊂ G.] By applying the previous lemma to the random variable X˜ = X −x? (this kind of shift has the eect at the Λ level Λ (λ) = Λ (λ)+λx∗ and Λ? (λ) = Λ∗ (λ−x∗)) X X˜ X X˜ We get that: 1 lim inf P (S ∈ (x? − δ, x? + δ)) ≥ −Λ?(x?) n n Consider then: 1 1 lim inf P (S ∈ G) ≥ lim inf P (S ∈ (x? − δ, x? + δ)) n n n n ≥ −Λ?(x?) > − inf Λ?(x) −  x∈G Since this holds for all  > 0, we get the desired result.  Theorem. Cramer's Theorem in Rd. Same statement. Lemma. (Upper Bound for Cramer's Theorem in Rd) For compact sets K ⊂ Rd we have the following WEAK LDP upper bound: 1 lim sup log (P (Sn ∈ K)) ≤ − inf I(x) n→∞ n x∈K and moreover, is exponetially tight: for all , there exists a compact Sn α ∈ R set Kα so that: 1 c lim sup log (P (Sn ∈ Kα)) ≤ −α n→∞ n Together, these two statements implie the LDP upper bound.

Proof. To see that Sn is exponenitally tight, we use the Cramer's theorem in 1D applied to the random variable hX, eii. By creating a big enough rectangle box so that all of these variables are controled, we can make sure that outside this box the exponential rate of decay is as large as we want. More specically, we have from the 1d cramer upper bound that:

? ? P (|hSn, eii| > L) ≤ exp (−n (Λi (L) ∧ Λi (−L))) 10.42. CRAMER'S THEOREM 108

and since each ? as , we can nd an large enough so that Λi → ∞ L → ±∞ L d all of these are ≥ α + 1. The box Kα = [−L, L] will now work for exponential tightness. To prove the weak LDP upper bound, consider as follows. Fix any  > 0 and consider as follows. ? For every x ∈ K nd λx so that Λ (x) ≤ hλx, xi − Λ(λx) + . Then, since inner products are continuous, nd a δx > 0 so that |hλx, xi − hλx, yi| <  for all . Have then that for all that ? . y ∈ Bδ(x) y ∈ Bδx (x) Λ (x) ≤ hλx, yi − Λ(λy) + 2 Finally then, by the exponential Cheb inequality with parameter λx we have that:  

P (Sn ∈ Bδx (x)) ≤ P hλx,Sni ≥ inf hλx, yi y∈Bδx (x)   ≤ exp −n inf hλx, yi E (exp (n hλx,Sni)) y∈Bδx (x)   = exp −n inf hλx, yi exp (nΛ(λx)) y∈Bδx (x)    = exp −n inf (hλx, yi − Λ(λx)) y∈Bδx (x) ≤ exp (−n (Λ?(x) − 2))

Now this argument produces a δx for each x ∈ K. Since K is a compact set, we and is an open cover of , we can nd a nite set of points {Bδx (x)}x∈K K x1, . . . , xN  N so that Bδ (xi) cover K. Finally then have that: xi i=1 N X P (S ∈ K) ≤ P S ∈ B (x ) n n δxi i i=1   N  ? ≤ N exp −n min Λ (xi) − 2 i=1    ? ≤ N exp −n inf Λ (xi) − 2 x∈K Taking 1 and gives: n log

1 1 ? log (P (Sn ∈ K)) ≤ log N − inf Λ (xi) + 2 n n x∈K For n large enough, the rst term is less than  and we get teh LDP upper bound with an error of 3. However, since this argument works for any  > 0, we get the LDP upper bound precisly.  Bibliography

[1] A. Dembo and O. Zeitouni. Large Deviations Techniques and Applications. Stochastic Mod- elling and Applied Probability. Springer, 2009. [2] R. Durrett. Probability: Theory and Examples. Cambridge Series in Statistical and Probabilis- tic Mathematics. Cambridge University Press, 2010. [3] J.S. Rosenthal. A First Look at Rigorous Probability Theory. World Scientic, 2000. [4] T. Tao. Topics in Random Matrix Theory. Graduate studies in mathematics. American Math- ematical Soc. [5] D. Williams. Probability with Martingales. Cambridge mathematical textbooks. Cambridge University Press, 1991.

109