Probability Theory Oral Exam study notes
Notes transcribed by Mihai Nica Abstract. These are some study notes that I made while studying for my oral exams on the topic of Probability Theory. I took these notes from a few dierent sources, and they are not in any particular order. They tend to move around a lot. They also skip some of the basics of measure theory which are covered in the real analysis notes. Please be extremely caution with these notes: they are rough notes and were originally only for me to help me study. They are not complete and likely have errors. I have made them available to help other students on their oral exams (Note: I specialized in probability theory, so these go a bit further into a few topics that most people would do for their oral exam). See also the sections on Conditional Expectation and the Law of the Iterated Logarithm from my Limit Theorem II notes. Contents
Independence and Weak Law of Large Numbers 5 1.1. Independence 5 1.2. Weak Law of Large Numbers 9
Borel Cantelli Lemmas 12 2.3. Borel Cantelli Lemmas 12 2.4. Bounded Convergence Theorem 15
Central Limit Theorems 17 3.5. The De Moivre-Laplace Theorem 17 3.6. Weak Convergence 17 3.7. Characteristic Functions 24 3.8. The moment problem 31 3.9. The Central Limit Theorem 33 3.10. Other Facts about CLT results 35 3.11. Law of the Iterated Log 36
Moment Methods 39 4.12. Basics of the Moment Method 39 4.13. Poisson RVs 42 4.14. Central Limit Theorem 42
Martingales 51 5.15. Martingales 51 5.16. Stopping Times 53
Uniform Integrability 58 6.17. An 'absolute continuity' property 58 6.18. Denition of a UI family 58 6.19. Two simple sucient conditions for the UI property 59 6.20. UI property of conditional expectations 59 6.21. Convergence in Probability 60 6.22. Elementary Proof of Bounded Convergence Theorem 60 6.23. Necessary and Sucient Conditions for L1 convergence 60 UI Martingales 62 7.24. UI Martingales 62 7.25. Levy's 'Upward' Theorem 62 7.26. Martingale Proof of Kolmogorov 0-1 Law 63 7.27. Levy's 'Downward' Theorem 63 7.28. Martingale Proof of the Strong Law 64
3 CONTENTS 4
7.29. Doob's Subartingale Inequality 64
Kolmogorov's Three Series Theorem using L2 Martingales 66 A few dierent proofs of the LLN 73 9.30. Truncation Lemma 73 9.31. Truncation + K3 Theorem + Kronecker Lemma 74 9.32. Truncation + Sparsication 76 9.33. Levy's Downward Theorem and the 0-1 Law 79 9.34. Ergodic Theorem 80 9.35. Non-convergence for innite mean 81 Ergodic Theorems 82 10.36. Denitions and Examples 82 10.37. Birkho's Ergodic Theorem 86 10.38. Recurrence 88 10.39. A Subadditive Ergodic Theorem 88 10.40. Applications 90 Large Deviations 91 10.41. LDP for Finite Dimesnional Spaces 92 10.42. Cramer's Theorem 103 Bibliography 109 Independence and Weak Law of Large Numbers
These are notes from Chapter 2 of [2].
1.1. Independence Definition. A, B indep if P(A ∩ B) = P(A)P(B), X,Y are indep if P(X ∈ C,Y ∈ D) = P(X ∈ C)P(Y ∈ D) for every C,D ∈ R. Two σ−algebras are independent if A ∈ F and B ∈ G has A, B independent. Exercise. (2.1.1) Show that if X,Y are indep then σ(X) and σ(Y ) are. ii) Show that if X is F-measurable and Y is G−measurable then and F , G are inde- pendent, then X and Y are independent
Proof. This is immediate from the denitions. Exercise. (2.1.2) i) Show A, B are independent then Acand B are independent too. ii) Show A, B are independent i 1A and 1B are independent. Proof. i) P(B) = P(A∩B)+P(Ac ∩B) = P(A)P(B)+P(Ac ∩B) , rearrange. ii) Simple using if and c otherwise. {1A ∈ C} = A 1 ∈ C = A Remark. By the above quick exercises, all of independence is dened by in- dependence of σ-algbras, so we will view that as the ventral object.
Definition. F1, F2 ... are independent if for any nite subset and for any sets from an index set we have Q . Ai ∈ Fi i ∈ I ⊂ N P (∩Ai) = P(Ai) X1,X2,... are independent if σ(X1), σ(X2) ... are independent.A1,A2,... are independent if are independent. 1A1 , 1A2 ,... Exercise. (2.1.3.) Same as previous exercise with more than two sets. Example. (2.1.1) The usually example of three events which are pairwise in- dependent but not independent on the space of three fair coinips. 1.1.1. Sucient Conditions for Independence. We will work our way to Theorem 2.1.3 which is the main result for this subsection.
Definition. We call a collection of sets A1, A2 ... An independent if any col- lection from an index set i ∈ I ⊂ {1, . . . , n} Ai ∈ Ai is independent. (Just like the denition for the σ- algebras only we dont require A to be a sigma algebra)
Lemma. (2.1.1) If we suppose that each Ai contains Ω then the criteria for independence works with I = {1, . . . , n}
Proof. When you put Ak = Ω it doesn't change the intersection and it doesnt change the product since . P(Ak) = 1
5 1.1. INDEPENDENCE 6
Definition. A π−system is a collection A which is closed under intersections, i.e. A, B ∈ A =⇒ A ∩ B ∈ A. A λ−system is a collection L that satises: i) Ω ∈ L ii) A, B ∈ L and A ⊂ B =⇒ B − A ∈ L iii) An ∈ L and An ↑ A =⇒ A ∈ L Remark. (Mihai - From Wiki) An equivalent def'n of a λ−system is: i) Ω ∈ L ii) A ∈ L =⇒ Ac ∈ L iii) disjoint ∞ A1,A2,... ∈ L =⇒ ∪n=1An ∈ L In this form, the denition of a λ−system is more comparable to the denition of a σ algebra, and we can see that it is strictly easier to be a λ−system than a σ−algebra (only needed to be closed under disjoint unions rather than arbitrary countable unions). The rst denition presented by Durret however is more useful since it is easier to check in practice! Theorem. (2.1.2) (Dynkin's π − λ) Theorem. If P is a π−system and L is a λ−system that contain P, then σ(P) ⊂ L. Proof. In the appendix apparently? Will come back to this when I do measure theory.
Theorem. (2.1.3) Suppose A1, A2 are independent of each other and each Ai is a π−system. Then σ(A1), . . . , σ(An) are independent.
Proof. (You can basically reduce to the case n = 2 ) Fix any A2,...,An in A2,..., An respectively and let F = A2∩...∩An Let L = {A : P(A ∩ F ) = P(A)P(F )}. Then we verify that L is a λ−system by using basic properties of P. For A ⊂ B both in F we have: P ((B − A) ∩ F ) = P (B ∩ F ) − P(A ∩ F ) = P(B)P(F ) − P(A)P(F ) = P(B − A)P(F ) (increasing limits is easy by continuity of probability)
By the π − λ theorem, σ(A1) ⊂ L, since this works for any F, we have then that σ(A1), A2, A3,..., An are independent. Iterating the argument n − 1 more times gives the desired result. Remark. (Durret) The reason the π − λ theorem is helpful here is because it is hard to check that for A, B ∈ L that A ∩ B ∈ L or that A ∪ B ∈ L. However, it is easy to check that if A, B ∈ L with A ⊂ B then B − A ∈ L. The π − λ theorem coverts these π and λ systems (which are easier to work with) to σ algebras (which are harder to work with, but more useful)
Theorem. (2.1.4) In order for X1,X2,...,Xn to be independent, it is su- cient that for all x1, x2, . . . , xn ∈ (−∞, ∞] that: n Y P (X1 ≤ x1,...,Xn ≤ xn) = P (Xi ≤ xi) i=1
Proof. Let Ai be sets of the form {Xi ≤ xi}. It is easy to check that this is a π−system, so the result is a direct application of the previous theore m. 1.1. INDEPENDENCE 7
Exercise. (2.1.4.) Suppose (X1,...,Xn) has density f(x1, . . . , xn) and f can be written as a product . Show that the 0 are independent. g1(x1) · ... · gn(xn) Xis
Proof. Let g˜i(xi) = cigi(xi) where ci is chosen so that g˜i = 1. Can verify Q Q that ci = 1 from f = g and f = 1. Then integrate along´ the margin to see that each g˜i is in fact a pdf for X´ i. Can then apply Thm 2.1.4 after replacing g's by g˜'s Exercise. (2.1.5) Same as 2.1.4 but on a discrete space with a probability mass function instead of a probability density function. Proof. Work the same but with sums instead of integrals. We will now prove that functions of independent random variables are still independent (to be made more precise in a bit)
Theorem. (2.1.5) Suppose Fi,j 1 ≤ i ≤ n and 1 ≤ j ≤ m(i) are independent σ−algebras and let Gi = σ(∪jFi,j). Then G1,..., Gn are independent.
Proof. (Its another π −λ proof based on the Thm 2.1.3) Let Ai be the collec- tion of the sets of the form ∩jAi,j with Ai.j taken from Fi,j. Can verify that Ai is a π−system that contains ∪jFi,j (its closed under intersections by its denition). Since the 's are all independent, it is clear the systems 0 are independent. Fi,j π− Ais By thm 2.1.3, we know that are all independent too. σ(Ai) = Gi Theorem. (2.1.6) IF for 1 ≤ i ≤ n and 1 ≤ j ≤ m(i) the random variables are independent and m(i) are measurable functions, then the random Xi,j fi : R → R variables Yi :=fi(Xi,1,...,Xi,m(i)) are all independent.
Proof. Let Fi,j = σ (Xi,j) and Gi = σ (∪jFi,j). By the previous theorem (2.1.5.), the 0 are independent. Since , these Gis fi Xi,1,...,Xi,m(i) := Yi ∈ Gi random variables are independent too. Remark. (Durret) This theorem is the rigourous justication for the type of reasing like If X1,...,Xn are iid, then X1 is indep of (X2,...,Xn) 1.1.2. Independence, Distribution, and Expectation.
Example. (Mihai - From Wiki) The Lebesgue measure on R is the unique measure with µ ((a, b)) = b − a Proof. We will show that it is the unique measure on [0, 1] , you get unique- ness on all of R by stiching together all the intervals [n, n + 1]. Say ν is an- other measure with ν ((a, b)) = b − a. First verify that µ(A) = b − a for all sets A ∈ A, A := {(a, b), (a, b], [a, b), [a, b] : 0 ≤ a, b ≤ 1} and that A is a π−system. Then notice the collection L = {A : µ(A) = ν(A)} is a λ−system by the basic properties of a measure. Since A ⊂ L by the hypothesis of the problem, by the π − λ theorem σ(A) ⊂ L. But σ(A) = B is all of the Borel sets! Hence µ and ν agree on all Borel sets.
Theorem. (2.1.7.) Suppose X1,...,Xn are independent and Xi has distribu- tion µi. Then (X1,...,Xn) has distribution µ1 × ... × µn.
Proof. Verify that the measure of (X1,...,Xn) and the measure µ1 ×...×µn agree on rectangle sets of the form A1 × ... × An using independence. Since these sets are a π system that generate the entire Borel sigma algebra, the two 1.1. INDEPENDENCE 8 measures agree everywhere. (More specically, the rectangle sets are a π−system with Rectangles Borel sets. The collection of sets where σ( ) = µ(X1,...,Xn) = µ1 × ... × µn is easily veried to be a λ-system. So Rectangles ⊂ Sets where they agree =⇒ Borel sets ⊂Sets where they agree by the π − λ theorem. Theorem. (2.1.8.) Suppose X,Y are independent and have distributions µ and ν. If h : R2 → R has h ≥ 0 or E [h(X,Y )] < ∞ then: Eh(X,Y ) = h(x, y)µ(dx)ν(dy) ˆ ˆ Suppose now h(x, y) = f(x)g(y). If f ≥ 0 and g ≥ 0 or if E |f(X)| < ∞ and E |g(Y )| < ∞ then: E [f(X)g(Y )] = E [f(X)] E [g(Y )] Proof. This is essentiall the Tonelli theorem and the Fubini theorem. Review this when you go over integration.
Theorem. (2.1.9) If X1,...,Xn are independent and have Xi ≥ 0 for all i or E |Xi| < ∞ for all i. Then: n ! n Y Y E Xi = E (Xi) i=1 i=1
Proof. By induction using the last theorem and the result that X1 is inde- pendent from in this case. (X2,...,Xn) Remark. (Durret) Dont forget that uncorrelated ;independent. 1.1.3. Sums of Independent Random Variables.
Theorem. (2.1.10) If X and Y are independent, with F (x) = P(X ≤ x) and G(y) = P(Y ≤ y) then:
P(X + Y ≤ z) = F (z − y)dG(y) ˆ Where integrating dG is shorthand for, integrate with respect to the measure ν whose distribution function is G
Proof. Let h(x, y) = 1{x+y≤z}. Let µ and ν be the probability measures with distribution functions F and G. For xed y we have:
h(x, y)µ(dx) = 1 (x, y)µ(dx) ˆ ˆ {x+y≤z} = 1 (x)µ(dx) ˆ {−∞,z−y] = µ(∞, z − y) = F (z − y) Hence:
P(X + Y ≤ z) = 1 µ(dx)ν(dy) ˆ ˆ {x+y≤z} = F (z − y)ν(dy) ˆ 1.2. WEAK LAW OF LARGE NUMBERS 9
Theorem. (2.1.11) If X,Y have densities, then X + Y has density:
h(x) = f(x − y)dG(y) ˆ = f(x − y)g(y)dy ˆ Proof. Write:
z F (z − y)ν(dy) = f(x − y)dG(y)dx ˆ ˆ−∞ ˆ There are some examples of the gamma distributions (basically gamma(α, λ) is the sum of α independent expoenntial variables with parameter λ) 1.1.4. Constructing Independent Random Variables. Kolmogorov Ex- tension Theorem is here, I'll come back to this later Theorem. (Kolmogorov Extension Theorem) If you are given a family of mea- sures on n that are consistnet: µn R
µn+1 ((a1, b1] × ... × (an, bn] × R) = µn ((a1, b1] × ... × (an, bn]) then there exists a unique probability measure P on (RN on sequences so that P agrees with µn on cylinder sets of size n.
1.2. Weak Law of Large Numbers 1.2.1. L2 Weak Laws. Theorem. (2.2.1) If are uncorrelated and 2 then: X1,... E(Xi ) < ∞ X X Var Xi = Var(Xi) Proof. Just expand it out, dross terms die since the r.v.s are uncorrelated. (Helps to assume WOLOG that ) E(Xi) = 0 p Lemma. (2.2.2) If p > 0 then E |Zn| → 0 then Zn → 0 in probability Proof. By Cheb ineq. This was one of the things on our types of convergence diagram.
Theorem. (2.2.3.) If X1,... are uncorrelated and E(Xi) = µ and Var(Xi) < P 2 C < ∞. If Sn = Xi then Sn/n → µ in L and in probability. h i Proof. 2 1 1 P Cn E (Sn/n − µ) = n2 Var(Sn) = n2 Var(Xn) ≤ n2 → 0 Example. (2.2.1) Basically this example is a simlar result set up in such a way to remark that the convergence is uniform in some sense. Specically they have a family of coupled h i bernoulli variables p, and they are saying that p 2 uniformly Xn E (Sn/n − µ) → 0 for the whole family. Indeed, the estimate we used in them 2.2.3. just needs an upper bound on the variance. 1.2. WEAK LAW OF LARGE NUMBERS 10
1.2.2. Triangular Arrays.
Definition. A triangular array is an array of random variables Xn,k. The row sum is dened to be P Sn Sn = k Xn,k Theorem. (2.2.4.) Let , 2 If 2 2 then: µn = E(Sn) σn := Var(Sn). σn/bn → 0 S − µ n n → 0 bn In L2 and in probability. Proof. Have (similar to before) 2 −2 E ((Sn − µn) /bn) = bn Var (Sn) → 0 Example. These examples are actually really nice and I like them a lot! How- ever, I'm going to skip them for now.
1.2.3. Truncation. To truncate a random variable X at level M means to consider: ˜ X = X1{|X| Theorem. (2.2.6) For each n let Xn,k, 1 ≤ k ≤ n be a triangular array of independent random variables. Let bn > 0 and bn → ∞, and let X˜n,k = . Suppose that: Xn,k1{|Xn,k|≤bn} n X P (|Xn,k| > bn) → 0 as n → ∞ and k=1 n −2 X ˜ 2 as bn E Xn,k → 0 n → ∞ k=1 Then let P and put Pn ˜ . Then: Sn = k Xn,k an = k=1 E Xn,k S − a n n → 0 in probabiliy bn Proof. Writing S˜n for the sum of the truncated variables, we have: ! S − a S˜ − a n n ˜ n n P > ≤ P Sn 6= Sn + P > bn bn P The rst term is controlled by P (|Xn,k| > b), namely: n ˜ n n ˜ o X P Sn 6= Sn ≤ P ∪k=1 Xn,k 6= Xn,k ≤ P (|Xn,k| > bn) → 0 k=1 The second term is controlled by Cheb inequality and our bound on the X˜n's. Have L2 convergence by: ˜ !2 n n Sn − an −2 ˜ −2 X ˜ −2 X ˜ 2 E = bn Var Sn = bn Var Xn,k ≤ bn E Xn,k → 0 bn k=1 k=1 So this also converges in probability to zero (by cheb ineq) 1.2. WEAK LAW OF LARGE NUMBERS 11 Theorem. (2.2.7.) (Weak law of large numbers) Let X1,... be iid with: xP (|Xi| > x) → 0 as x → ∞ Let and let . Then Sn = X1 + ... + Xn µn = E X11{|X1| Proof. Apply the last result with Xn,k = Xn and bn = n. The rst con- dition, that Pn as is clear since they are iid so k=1 P (|Xn,k| > bn) → 0 n → ∞ Pn by hypothesis. The second hypothesis k=1 P (|Xn,k| > bn) = nP (|Xi| > n) → 0 to be veried needs the easy lemma that p ∞ p−1 d (proven E (Y ) = 0 py P(Y > y) y by fubini easily). With this established, we see that:´ n X 1 b−2 E X˜ 2 = E X2 n n,k n n,1 k=1 1 n = 2yP (|X1| > y) dy n ˆ0 This → 0 as n → ∞ since we are given that 2yP (|X1| > y) → 0 as y → ∞ by hypothesis. Remark. By the converging together lemma (The one that says if Xn ⇒ X and |Xn − Yn| → 0 in probability, and the fact that convergence in probability is the same as weak convergence when the target is a constant, this result shows Sn/n−µ → 0 (Have Sn/n−µn ⇒ 0 and |(Sn/n − µn) − (Sn/n − µ)| = |µn − µ| → 0 by LDCT ) Another way to phrase the converging together lemma is that if An ⇒ A and Yn ⇒ c a constant, then An + Bn ⇒ A + c. This small improvement leads to: Theorem. (2.2.9) Let X1,...,Xn be iid with E |Xi| < ∞. Let Sn = X1 + ... + Xn ane let µ = EX1. Then Sn/n → µ in probability. Proof. by LDCT since xP (|X1| > x) ≤ E |X1| 1{|X1|>x} → 0 E (|X1|) < ∞ and µn = E X11{|X1|≤n → E(X1) = µ Borel Cantelli Lemmas 2.3. Borel Cantelli Lemmas 2.3.1. Preliminaries. For sets An dene: [ \ [ lim sup An = {ω : lim sup 1A (ω) = 1} = lim Ak = Ak = An i.o. n n→∞ k≥n n k≥n \ [ \ lim inf An = {ω : lim inf 1A (ω) = 1} = lim Ak = Ak = An a.b.f.o. n n→∞ k≥n n k≥n Here i.o. stands for infently often and a.b.f.o. stands for all but netly- many often. (Sometimes people write this one as a.a. for almost always. Lemma. c c (lim sup An) = lim inf (An) Proof. Just check it from the denitions, or can convince yourself logic: if An does not happen infeitly often, then it stops happening at some ntie point, so c An happens all but netly many often. Theorem. P (lim sup An) ≥ lim sup P(An) and P (lim inf An) ≤ lim inf P(An) Proof. By continuity of measure S and P(lim sup An) = limn→∞ P k≥n Ak each S by set inclusion and the result follows. P k≥n Ak ≥ supk≥n P(Ak) The other direction can be proven in the same way OR you can take comple- ments to see it from the rst result using. 2.3.2. First Borel Cantelli Lemma and Applications. Lemma. If are events with P∞ then: An n=1 P(An) < ∞ P (An i.o.) = 0 Proof. Standard proof: i.o P P (An ) = limn→∞ P (∪k≥nAk) ≤ limn→∞ k≥n P(Ak) → 0. Fancy proof: Let P so that i.o. . By Fubini/Tonelli N = n 1An {An } = {N = ∞} however, P P P E(N) = E ( n 1An ) = n E (1An ) = n P(An) < ∞ =⇒ P (N = ∞) = 0. P Theorem. Xn → X if and only if for every subsequence Xn there is a further a.s. m subsequence X → X nmk 1 −k Proof. (⇒) Take the sub-subsequeunce so that P Xn − X > < 2 , mk k then by Borel Cantelli a.s. Xnk → X (⇐) Suppose by contradiction that there is a δ > 0 and 0 > 0 so that P (|Xn − X| > δ) > 0 for infently many n. But then that subsequence can have no a.s. convergent subsubsequence contradiction 12 2.3. BOREL CANTELLI LEMMAS 13 Theorem. If f is continuous and Xn → X in probability then f(Xn) → f(X) for continuous f. If f is bounded , then E(f(Xn)) → E(f(X)) too. Proof. Check the subsequence/subsubsequence requirment for the the f(Xn) → f(X) bit. The next part is the in probability bounded convergence theorem. One can see it nicely with the subsequebce/subsubsequence property as follows: Given any a.s. subsequence X nd a sub-subssequence X → X and then by a.s. bounded nk nkm convergence theorem we will have Ef(X ) → Ef(X). But then Ef(X ) → nmk n Ef(X). (Otherwise there is an 0 so that they are o by 0 inetly often, but this creates a subseqeunce with no convergent subsubsequence) This can also be proven directly (it is proven somewhere else) Theorem. If are non-negative r.v's and P∞ then Xn n=1 E(Xn) < ∞ Xn → 0 a.s. Proof. Find a sequence an of positive real numbers so that an → ∞ and P∞ still. Now consider: n=1 anE(Xn) < ∞ ∞ ∞ X 1 X P X > ≤ a E (X ) < ∞ n a n n n=1 n n=1 1 1 Which means that P Xn > i.o. = 0 by Borel Cantelli. Since → 0, this an an means that Xn → 0 a.s. (How does one get the an's? Very roughly: By scaling by a constant lets P suppose E(Xn) = 1. Find numbers n1, n2,... so that nk is the rst number with Pnk 1 . Clearly . Now you can choose for i=1 E(Xi) > 1 − 2k nk → ∞ ai = k all safely by comparison with the sequence P 1 which we know nk < i < ak+1 k 2k converges) Theorem. 4-moment Strong Law of Large Numbers If are iid with 4 then a.s. X1,X2,... E(X1 ) < ∞ Sn/n → E(X1) Proof. (Its like a weak law proof using Cheb inequality, but the fact that 4 makes the inequality so strong as to be summable. Then the Borel E(X1 ) < ∞ cantelli lemma is used to improve convergence in probability to a.s. convergence) WOLOG assume that Xn are mean 0. (Just subtract it) Let's begin by com- puting (using combinatorics basically): 2.3. BOREL CANTELLI LEMMAS 14 " 4# n !4 Sn X E = n−4E X n i i=1 " n # n n X 4 X X = n−4E X4 + n−4 E X2X2 i 2 i j i=1 i=1 j=1 j for some constants c1 and c2. −2 −3 Finally, since n and n are both summable, we can nd a sequence n → 0 so that −4 −2 and −4 −3 are STILL summable (example: −1/10) we can n n n n n = n now use a Cheb inequality: ∞ ∞ " 4# X Sn X Sn P > ≤ −4E n n n n n=1 n=1 ∞ ∞ X −4 −3 X −4 −2 ≤ c1 n n + c2 n n n=1 n=1 < ∞ So by Borel Cantelli, Sn i.o. and we conclued that Sn a.s. P n > n = 0 n → 0 Remark. You could also leave n = xed, and then this shows that for every , Sn eventually almost surely. Then taking the intersection of a countable > 0 n < number of these events we'd see that Sn . n → 0 h 4i Once you have that Sn −3 −2 you could also just apply the E n = c1n + c2n 4 last theorem to see that Sn almost surely, but this is a bit more fancy. n → 0 Remark. The converse the Borel Cantelli lemma's is false. As one see's in P the fancy proof, P(An) is actually equal to E(N) where N is the number of events that occur. Of course, E(N) could be ∞ and yet N < ∞ a.s. (To explictly construct this, take any random variable with E(N) = ∞ but N < ∞ a.s. and then on the probability space (Ω, F, P) = ([0, 1], B, m) put An = (0, P(N > n)] P P so that P(An) = P(N > n) and P(An) = P(N > n) = E(N) = ∞ and yet lim sup An = ∩nAn = ∅ 2.4. BOUNDED CONVERGENCE THEOREM 15 2.3.3. Second Borel Cantelli Lemma. P Theorem. If An are independent and P(An) = ∞ then P(Ani.o.) = 1 Proof. (Depends on the inequality 1 − p ≤ e−p and the idea to look at the complement) We will actually show that c . Have: P (lim inf An) = 0 N ! N \ c Y P An = (1 − P (An)) n=M n=M N Y ≤ e−P(An) n=M N ! X = exp − P(An) n=M → 0 as N → ∞ Since TN c T∞ c as we have then T∞ c T∞ c n=M An ↓ n=M An N → ∞ P ( n=M An) = limN→∞ P ( n=M An) = and consequently, since this holds for all , we will have c 0 M P (lim inf An) = T∞ c limM→∞ P ( n=M An) = 0 Theorem. If X1,X2,... are ii.d with E |Xi| = ∞ then P (|Xn| ≥ n i.o.) = . Moreover, we can improve this and show that Xn and 1 P lim sup n = ∞ = 1 |Sn| P lim sup n = ∞ = 1 Proof. We have the bound: ∞ ∞ ∞ X X ∞ = E |X | = P (|X | > x) dx ≤ P (|X | > n) = P (|X | > n) 1 ˆ 1 1 n 0 n=0 n=0 So by the second Borel Cantelli, P (|Xn| ≥ n i.o.) = 1. We can improve this a bit to see that for any C > 0 that P (|Xn| ≥ Cn i.o.) = 1 since |X1| is still . Since this holds for every , choosing and E C ∞ C C = 1, 2, 3,... n o taking the intersection of the counatble set of probability 1 events |Xn| i.o. , n > k we see that Xn . P lim sup n = ∞ = 1 Now to see that P lim sup Sn = ∞ = 1 we will show for each C > 0 that n |Sn| i.o. . This is indeed the case because we know that Xn i.o. P n > C = 1 P n > 2C = and whenever Xn we have either |Sn−1| or |Sn| , which shows that 1 n > 2C n−1 > C n > C n o Xn i.o. |Sn| i.o. n > 2C ⊂ n > C (Pf of this fact: If |Sn−1| then done! Otherwise |Sn−1| and we have n−1 > C n−1 ≤ C then Sn n−1 Sn Xn n−1 .) n = n n−1 + n ≥ n (−C) + 2C > C 2.4. Bounded Convergence Theorem Theorem. In probability bounded convergence theorem P L1 If Yn → Y and |Yn| < K a.s., then Yn → Y 2.4. BOUNDED CONVERGENCE THEOREM 16 Proof. Notice that |Y | ≤ K a.s. too (or else convergence in probablity fails due to the positive measure set where |Y | > K + δ with δ small enough). Write for any > 0 that: E (|Yn − Y |) = E (|Yn − Y | ||Yn − Y | > ) P (|Yn − Y | > ) +E (Yn − Y ||Yn − Y | ≤ ) P (|Yn − Y | ≤ ) ≤ 2KP (|Yn − Y | > ) + 1 → as n → ∞ 1 Since this holds for all , we have indeed that L . > 0 Yn → Y Central Limit Theorems These are notes from Chapter 3 of [2]. 3.5. The De Moivre-Laplace Theorem Let be iid with 1 and let X1,X2.... P (X1 = 1) = P (X1 = −1) = 2 Sn = P . By simple combinatorics, k≤n Xk 2n P (S = 2k) = 2−2n 2n n + k √ n −n Using Stirling's Formula now, n! ∼ n e 2πn as n → ∞ where an ∼ bn means that an/bn → 1 as n → ∞ so then: 2n k2 −n k −k k k = ...... ∼ 1 − 1 + 1 − n + k n2 n n We now use the basic lemma: aj λ Lemma. (3.1.1.) If cj → 0, aj → ∞ and ajcj → λ then (1 + cj) → e . Exercise. (3.1.1.) (Generalization of the above) If max1≤j≤n |cj,n| → 0 and Pn and Pn then Qn λ j=1 cj,n → λ supn j=1 |cj,n| < ∞ j=1 (1 + cj,n) → e (Both proofs can be seen by taking logs and using the fact from the Taylor expansion for log that log(1 + x)/x → 1) This leads to: √ −1/2 −x2/2 Theorem. (3.1.2.) If 2k/ 2n → x then P (S2n = 2k) ∼ (πn) e And a bit more gives (mostly just a change of variables from here): Theorem. (3.1.3) The De-Moivre Laplace Theorem: If a < b then as m → ∞ we have: b √ 2 P a ≤ S / m ≤ b → (2π)−1/2e−x /2dx m ˆ a 3.6. Weak Convergence Definition. For distribution functions Fn, We write Fn ⇒ F and say Fn converges weakly to F to mean that Fn(x) = P (Xn ≤ x) → F (x) = P (X ≤ x) for all continuity points x of the function F (x). We sometimes conate the random variable and its distribtuion function and write Xn ⇒ X or even Fn ⇒ X or other combinations. 17 3.6. WEAK CONVERGENCE 18 3.6.1. Examples. Example. (3.2.1.) Let X1,... be iid coin ips, by our abover work in the De Moivre-Leplace theorem, we have that Fn ⇒ F Example. (3.2.2.) (The Glivenko-Cantelli Theorem) Let X1,...,Xn be iid with distribution function F and let n 1 X F (x) = 1 n n {Xm≤x} m=1 Be its empirical distribution function. The G-C theorem is that: sup |Fn(x) − F (x)| → 0 a.s. x In words: the empircal distribtuion function converges uniformly almost surely Proof. (Pf idea: the fact that Fn(x) → F (x) is just the strong law for the indicator , the only trick is to make the convergence uniform) 1{Xn Similarly, let − and let so that F (x−) := limh→0 ,h<0 F (x − h) Zn = 1{Xn For any point x ∈ (xj−1, xj) we use the monotonicity of F , along with our inequalities with and η now to get: Fn(x) ≤ Fn(xj−) ≤ F (xj−) + η ≤ F (xj−1) + η + ≤ F (x) + η + Fn(x) ≥ Fn(xj−1) ≥ F (xj−1) − η ≥ F (xk) − η − ≥ F (x) − η − So we have an inequality sandwhich, |Fn(x) − F (x)| ≤ + η which holds for every x. Since and η are arbitary, we get the convergence. Example. (3.2.3.) Let have distribution . Then 1 has distribution X F X + n 1 / As we have: Fn(x) = F (x − n ) n → ∞ Fn(x) → F (x−) = lim F (y) y↑x So in this case the convergence really is only at the continuity points. Example. (3.2.4.) (Waiting for rare events, convergence of a geometric to a exponential distribution) Let Xp be the number of trials needed to get a success 3.6. WEAK CONVERGENCE 19 in a sequence of independent trials with success probability p. Then P (Xp ≥ n) = (1 − p)n−1 for n = 1, 2,... and it follows that: −x P (pXp > x) → e for all x ≥ 0 In other words pXp ⇒ E where E is an exponential random variable. Example. (3.2.5.) Birthday Problem. Fix an N and let X1,... be independent and uniformly distributed on {1, 2,...,N} and let TN = min {n : Xn = Xm for some m < n}. Notice that: n Y m − 1 P (T > n) = 1 − N N m=2 By exercise 3.1.1., (the one that concludes Qn λ when Pn j=1 (1 + cj,n) → e j=1 cj,n → λ and the cj,n's are small) have then: 1/2 2 P TN /N > x → exp −x /2 for all x ≥ 0 Theorem. (Schee's Theorem) If fn are probability density functions with fn → f∞ pointwise as n → ∞ then µn to µ∞ in the total variation distance kµn − µ∞kTV := sup |µn(B) − µ∞(B)| → 0 B Proof. Have: fn − f∞ ≤ |fn − f∞| ˆB ˆB ˆΩ + = 2 (fn − f∞) ˆΩ → 0 (we have employed the usual trick for the TV distance here btw) By the dom- inated convergence thereom. Example. (3.2.6.) (Central Order Statistics) Put 2n + 1 points uniformly at random and independently in (0, 1). Let Vn+1 be the n + 1st largest point. Lemma. Vn+1 has density function: 2n f (x) = (2n + 1) xn(1 − x)n Vn+1 n Proof. There are 2n + 1 ways to pick which of the 2n + 1 points will be the special central order point. Then there are 2n ways to divide up the remaining n 2n points into two groups of n, one group to be placed on the left and another to be placed on the right, and nally xn(1 − x)n is the probability that these points fall where they need to. √ Changing variables x = 1 + √y ,Y = 2 V − 1 2n gives... 2 2 2n n n+1 2 −1/2 2 as fYn (y) → (2π) exp −y /2 n → ∞ By Schee's theorem then, we have weak convergence to the standard normal. (This isnt entirely surprising since f looks a bit like a binomial random vari- able...this is actually used in this calculation) Exercise. (Convergence of the maxima of random variables) Come back to this when you review maximum distributions. 3.6. WEAK CONVERGENCE 20 3.6.2. Theory. The next result is useful for proving things about weak con- vergence Theorem. (3.2.2.) [Skohorod Representation Theorem] If Fn ⇒ F∞then there are random variables Yn with the distribution Fn so that Yn → Y∞a.s. Proof. Let (Ω, F, P) = ((0, 1), R, m) be the usual Lebesgue measure on (0, 1). Dene → , (this is called the right inverse). By Yn(x) = sup {y : Fn(y) < x} =: Fn (x) an earlier theorem (or easy exercise), we know Yn has the distribution Fn. We will now show that Yn(x) → Y∞(x) for all but a countable number of x. Durret has a proof of this....but I prefer the proof from Resnick where he develops the idea of the left inverse a bit more fully. Review this when you do extreme values! Remark. This theorem only works if the random variables have seprable sup- port. Since we are working with R valued random variables, this works. If they were functions or something, it might not work. Exercise. (3.2.4.) (Fatou's lemma) Let g ≥ 0 be continuous. If Xn ⇒ X∞ then: lim inf E(g(Xn)) ≥ E (g(X∞)) Proof. Just use Skohorod to make versions of the Xn's which converge a.s. and apply the ordinary Fatou lemma to that. Exercise. (3.2.5.) (Integration to the limit) Suppose g, h are continuous with g(x) > 0 and |h(x)| /g(x) → 0 as |x| → ∞. If Fn ⇒ F and g(x)dFn(x) ≤ C < ∞ then: ´ h(x)dF (x) → h(x)dF (x) ˆ n ˆ Proof. Create random variables on Ω = (0, 1) as in the Skohord representation theorem so that Fn,F are the distribution functions for Xn,X and Xn → X a.s. We desire to show that E(h(Xn)) → E(h(X)). Now for any > 0 we nd M so large so |x| > M =⇒ |h(x)| ≤ g(x) and then we will have for any x > M that: E (h(Xn)) = E (h(Xn); |Xn| ≤ x) + E (h(Xn); |Xn| > x) → E (h(X); |X| ≤ x) + E (h(Xn); |Xn| > x) (bounded convergence thm) ≤ E (h(X); |X| ≤ x) + E (g(Xn); |Xn| > x) ≤ E (h(X); |X| ≤ x) + C The rst term converges to E(h(X)) by the LDCT. The other side of the inequality can be handled by Fatou, or I think we could do the above work a little more carefully to get it. Theorem. (3.2.3.) Xn ⇒ X∞ if and only if E (g(Xn)) → E (g(X)) for every bounded continuous function g. Proof. ( =⇒ ) Follows by the Skorod repreresentation theorem and the bounded convergence theorem. 3.6. WEAK CONVERGENCE 21 (⇐=) Let gx, be a smoothed out version of the step down Heaviside step function at x. For example: 1 y ≤ x gx, = 0 y ≥ x + linear x ≤ y ≤ x + This is a continuous linear function, so we have E (gx,(Xn)) → E(gx,(X)) for each choice of x, . This basically gives the result, to do it properly, look at:\ lim sup P (Xn ≤ x) ≤ lim sup E (gx,(Xn)) = E (gx,(X)) ≤ P (X ≤ x + ) n→∞ n→∞ lim inf P (Xn ≤ x) ≥ lim inf P (gx−, (Xn) ≤ x) = E (gx−,x(X)) ≥ P (X ≤ x − ) n→∞ n→∞ Combining these inequalites we see that we have the desired convergence for continuity points of X. Theorem. (3.2.4.) (Continuous mapping theorem) If g is measurable and Dg = {x : g is discontinuous at x}. If Xn ⇒ X and P (X ∈ Dg) = 0 then g(Xn) ⇒ g(X) and if g is bounded then E (g(Xn)) → E (g(X)) Proof. Firstly, lets notice that if g is bounded, the using Skohorod we will have a version of the variables so that Xn → X a.s. and then g(Xn) → g(X) everywhere except on the set Dg. Since this is a null set we get E (g(Xn)) → E (g(X)) Now, we verify that g(Xn) ⇒ g(X) by the boudned-continuous characteriza- tion, for if f is any bounded continuous function then f ◦ g is bounded and has so by the above argument. Df◦g E (f ◦ g(Xn)) → E (f ◦ g(X)) Theorem. (3.2.5.) (Portmanteau Theorem) The following are equivalent: (i) Xn ⇒ X (ii) E (f(Xn)) → E (f(X)) for all bounded continuous f (iii) lim infn P (Xn ∈ G) ≥ P (X ∈ G) for all G open (iv) for all closed lim supn P (Xn ∈ F ) ≤ P (X ∈ F ) F (v) P (Xn ∈ A) → P (X ∈ A) for all A with P (X ∈ ∂A) = 0 Proof. [(i) =⇒ (ii))] Is the theorem we just did. [(ii) =⇒ (iii)] Use Skohord to get an a.s. convergent version and then use Fatou [(iii) ⇐⇒ (iv)] Take complements. [(iii)+(iv) =⇒ (v)] Look at the interior A◦ and closure A¯. Since ∂A = A¯−A◦ is a null set, we know that A, A,¯ A◦ are all the same up to null sets. We then get P (Xn ∈ A) → P (X ∈ A) by getting a liminf/limsup sandwhich doing the liminf ◦ P (Xn ∈ A ) = P (Xn ∈ A) and the limsup on P Xn ∈ A¯ = P (Xn ∈ A) and using the inequalities from (iii) and (iv). Remark. In Billinglsey he does the proof without using Skohorhod by looking at the sets like and the function 1 d + which is A = ∪x∈AB(x) f = 1 − (x, A) 1 on A, 0 outside of A and is uniformly continuous. This has the advantage that it will work in spaces where the Skohorod theorem will not work (for example I think the Skohorod theorem will fail for function-valued-random variables) One advantage of doing this is that, since the f that appears above is uniformly continuosu, we may restrict our attention to uniformly continuous functions f, which is sometimes a little easier to check. 3.6. WEAK CONVERGENCE 22 Theorem. (3.2.6.) (Helly's Selection Theorem) For every sequence Fn of distribution function,s there is a subseqeunce and a right continuous non- Fnk decreasing function so that at all continuity points F limk→∞ Fnk (y) → F (y) y of F . Remark. F is not necessarily a distribution function as it might not have limx→−∞ F (x) = 0 and limx→∞ F (x) = 1. This happens when mass leaks out to +∞ or −∞ i.e. Fn(x) ≡ δn. Proof. Let qk enumerate the rationals. Since Fn(q1) ∈ [0, 1], by Bolzanno Weirestrass there is a subsequence (1) so that converges. Call the limit nk F (1) (q1) nk Since again by B-W we nd a sub-sub-sequence (2) so that G(q1). F (2) (q2) ∈ [0, 1] nk nk F (2) (q2) converges. Repeating this, we get a big collection of nested subsequnces nk so that for each . Taking the diagonal (k) we get that F (k) (qk) → G(qk) k nk = nk nk for all rationals . Fnk (q) → G(q) q Now let F (x) = inf {G(q): q ∈ Q : q > x}. This is right continuous since: for some lim F (xn) = inf {G(q): q ∈ Q, q > xn n} xn↓x = inf {G(q): q ∈ Q, q > x} = F (x) To see that Fn(x) → F (x) at continuity points, just approximate F (x) by F (r1) and F (r2) where r1, r2 are rationals with r1 < x < r2 and use the convergence of G. F is non-decreasing since Fn(x1) ≤ Fn(x2) for all x1 ≤ x2 so any limit along a subsequence will have this property too. Remark. In functional analysis the selection theorem is that for X a seperatble n.v.s, the unit ball in the dual spaceB∗ = {f ∈ X∗ : kfk ≤ 1} is weak* compact ∗ (Recall the weak* topology on X is the one characterized by fn → f ⇐⇒ fn(x) → f(x) for all x) The proof of this is just the subsequences/diagonal sequences trick of the rst part of the proof. If we apply this version of Helley's theorem to the space X = C ([0, 1]) which ∗ has X =nite signed measures then we get that every sequence of measures µn has a subseqeunce so that for every bounded continuous . µnk Eµn (f) → Eν (f) f This is exactly the notion of weakk convergence, , we have! need not be µnk ⇒ ν ν a probability measure here, it is only a nite signed meausre. However, by using Fatou, we know that µ([0, 1]) ≤ lim infn→∞ µn ([0, 1]) = 1 so mass can be lost, but mass cannot be gained. We will now look at conditions under which we can be sure that no mass is lost....i.e. the limit is in fact a proper probability measure. Definition. We say that a family of proability measures is tight if {µα}α∈I for all > 0 there is a compact set K so that: c µα(K ) < ∀α ∈ I Theorem. (3.2.7.) Let be a sequence of probability measures on . Every µn R sub sequential limit is a probability measure if and only if is tight. µnk ⇒ ν µn Proof. ( ) Suppose is tight and . For any nd so ⇐= µn Fnkl ⇒ F > 0 M c that µn ([−M,M] ) < for all n. Find continuity points r and s for F so that 3.6. WEAK CONVERGENCE 23 r < −M and s > M. Since Fn(r) → F (r) and Fn(s) → F (s) we have: µ ([r, s]c) = 1 − F (s) + F (r) ≤ ... ≤ This shows that , and since arbitary this lim supx→∞ 1 − F (x) + F (−x) ≤ limsup is 0. Now, since and we have lim supx→∞ F (x) ≤ 1 lim infx→−∞ F (x) ≥ 0 that 0 = lim supx→∞ 1−F (x)+F (−x) ≥ lim infx→∞ 1−F (x)+F (−x) ≥ 1−1+0 = , so we have equality everywhere from which we conclude that 0 lim supx→∞ F (x) = 1 and lim infx→−∞ F (x) = 0 and so indeed F is a probability measure. (Basically the estimate above shows that no mass can leak away) ( =⇒ ) We prove the contrapositive. If Fn is not tight, then there exists an so that for every set there is a so that c . By 0 > 0 [−n, n] nk Fnk [−n, n] ≥ 0 Helley, this has a convergent sub-subsequence and the limit is easily veried to not be a probability measure. Theorem. (3.2.8.) If there is a ϕ ≥ 0 so that ϕ(x) → ∞ as |x| → ∞ and: sup ϕ (x) dFn(x) = C < ∞ n ˆ Then Fn is tight. Proof. Do a generalized Chebushev-type estimate: c 1 Fn [−M,M] = P (|Xn| ≥ M) ≤ E (ϕ(Xn)) ≤ C/ inf ϕ(x) → 0 inf {|x| > M : ϕ(x)} |x|≥M Theorem. (3.2.9.) If each subsequence of Xn has a further subsequence that converges a.s. to X, then Xn ⇒ X P Proof. The rst condition is actually equivalent to Xn → X, which implies . Xn ⇒ X Definition. (The Levy/Prohov Metric) Dene a metric on measures, by d (µ, ν) is the smallest > 0 so that: Pµ (A) ≤ Pν (A ) + Pν (A) ≤ Pµ(A ) + ∀A ∈ B Exercise. (3.2.6.) This is indeed a metric and µn ⇒ µ if and only if d(µn, µ) = 0 Proof. (⇐=) is clear by using closed sets and checking the limsup of closed sets characterization of weak convergence. ( =⇒ ) You need to use the separability of R here. P Exercise. (3.2.12.) If Xn ⇒ c then Xn → c. Proof. For any > 0 we have P (|Xn − c| > ) = E 1[c−,c+]c (Xn) → E 1[c−,c+]c (c) = 0 Remark. The converse is also true since converging in probability implies weak convergence (most easily seen by the bounded convergence theorem) Exercise. (3.2.13) (Converging Together Lemma) If Xn ⇒ X and Yn ⇒ c then Xn + Yn ⇒ X + c. 3.7. CHARACTERISTIC FUNCTIONS 24 Remark. Does not hold if Yn ⇒ Y in general. For example, on Ω = (0, 1) let Xn(ω) = n-th digit of the binary expansion of ω. Notice that Xn is always a coinip, so Xn ⇒ A where A is a coinip is clear. (This is a good example where Xn ⇒ X but there is no convergence in proability or a.s. convergence or anything) If Yn(ω) = 1−Xn(ω) then this is still a coin ip. But Xn +Yn = 1 now is a constant and does not converge to a sum of coinips or anything like that. P Proof. Since Yn ⇒ c we know that Yn → Y . Now ,for any closed set F let F = {x : d (x, F ) ≤ }. Then: P (Xn + Yn ∈ F ) ≤ P (|Yn − c| > ) + P (Xn + c ∈ F) So taking limsup now: lim sup P (Xn + Yn ∈ F ) ≤ lim sup P (|Yn − c| > ) + lim sup P (Xn + c ∈ F) → 0 + P (X + c ∈ F) Finally, since F is closed, we have that P (X ∈ F ) = P X ∈ ∩nF1/n = limn→∞ P X ∈ F1/n so taking → 0 in the above inequality gives us weak con- vergence via Portmeanteau theorem. Corollary. If Xn ⇒ X, Yn ≥ 0 and Yn ⇒ c then XnYn ⇒ cX. (This actually holds withouth Yn ≥ 0 or c > 0 by splitting up the probaility space) Proof. Notice that log(Xn) ⇒ log(X) and log(Yn) ⇒ log(x) . Then by the converging together lemma log(Xn) + log(Yn) ⇒ log(X) + log(c). Then take exp to get XnYn ⇒ cX. (might need to truncate or something to make this more rigourous.) Exercise. (3.2.15) If X = X1,...,Xn is uniformly distributed over the √n n n surface of a sphere of radius then 1 a standard normal n Xn ⇒ Proof. Let be iid standard normals and let i Pn 2 1/2 Y1,... Xn = Yi n/ m=1 Ym then check indeed that X1,...,Xn are uniformly distrbuted over the surfact of √ n n a sphere of radius , so this is a legit way to construct the distribution i . Then n Xn 1 Pn 2 1/2 1/2 by the strong law of large numbers and Xn = Y1 n/ m=1 Ym ⇒ Y1 (n/n) the convering together lemma. 3.7. Characteristic Functions 3.7.1. Denition, Inversion Formula. If X is a random variable we dene its characteristic function by: ϕ(t) = E eitX = E (cos tX) + iE (sin tX) Proposition. (3.3.1) All characteristic functions have: i) ϕ(0) = 1 ii) ϕ(−t) = ϕ(t) itX itX iii) |ϕ(t)| = E e ≤ E e = 1 ihX iv) |ϕ(t + h) − ϕ(t)| ≤ E e − 1 since this does not depend on t, this shows that ϕ is uniformly continuos. itb v) ϕaX+b(t) = e ϕX (at) vi) For independent, X1,X2 ϕX1+X2 (t) = ϕX1 (t)ϕX2 (t) 3.7. CHARACTERISTIC FUNCTIONS 25 1/2 Proof. The only one I will comment on is iv). This holds since |z| = x2 + y2 is convex, so: (i(t+h)X itX |ϕ(t + h) − ϕ(t)| ≤ E e − e (i(t+h)X itX ≤ E e − e ihX = E e − 1 → 0 as h → 0 by BCT Since this → 0 and does not depend on t, we see that ϕ is uniformly continuous. Example. I collect the examples in a table: Name PDF Char. Fun Remark Coin Flip 1 1 P X = ± 2 = 2 ϕ(t) = eit − e−it /2 = cos t Poisson −λ λk it P(X = k) = e k! ϕ(λ) = exp λ(e − 1) 2 Normal t Prove this by deriving 0 . Or complete the ρ(X = x) = ϕ(t) = exp − 2 ϕ = −tϕ 2 square √1 exp − x 2π 2 Uniform 1 exp(itb)−exp(ita) This one is useful to think about for the inversion ρ(X = x) = b−a 1[a,b](x) ϕ(t) = it(b−a) If this is: formula ∞ e−ita−e−itb d 1 i.e. b = −a −∞ it(b−a) ϕ(t) t ≈ µX (a, b) b−a ´ sin(at) 1 ϕ(t) = ϕ ϕ dt ≈ µ (A) at ˆ 1A X X L(A) Triangular + 1−cos t Use the fact that the triangular is the sum of two ρ(X = x) = (1 − |x|) ϕ(t) = 2 t2 independent Uniforms 2 Dierence Y = X − X˜ (independent ϕY (t) = |ϕX (t)| copy) Superposition Y = CX where C is a ±1 ϕY (t) = Re(ϕX (t)) Use the fact that ϕ is linear with respect to P P Dierence coinip superposition, λiFi has char fun λiϕi. In this case its with 1 1 2 FX + 2 F−X Exppoential −x 1 ρ(X = x) = e ϕ(t) = 1−it Bilateral Exp 1 −|x| Use the superposition dierence trick ρ(X = x) = 2 e 1 ϕ(t) = Re 1 − it 1 = 1 + t2 Polya's ρ(X = x) = ϕ(t) = (1 − |t|)+ Pf comes from the inversion formula and the (1 − cos x)/πx2 triangle distribution. Used in Polya's theorem to show that convex, decreasing functions ϕ with ϕ(0) = 1 are characteristic function of something Cauchy ρ(X = x) = 1/π(1 + x2) ϕ(t) = exp (− |t|) Pf comes from inverting the bilateral exp. 3.7. CHARACTERISTIC FUNCTIONS 26 Theorem. (3.3.4.) The inversion formula. Let ϕ(t) = eitxµ(dx) where µ is a probability measure. If a < b then: ´ T e−ita − e−itb 1 lim (2π)−1 ϕ(t)dt = µ(a, b) + µ ({a, b}) T →∞ ˆ it 2 −T Proof. Write T e−ita − e−itb I = ϕ(t)dt T ˆ it −T T e−ita − e−itb = eitxµ(dx)dt ˆ ˆ it −T Notice that e−ita−e−itb b −ityd so this is bounded above in norm by . it = a e y b − a Hence we can apply Fubini to´ get: T e−ita − e−itb I = eitxµ(dx)dt T ˆ ˆ it −T T e−ita − e−itb = eitxdtµ(dx) ˆ ˆ it −T T b = e−ityeitxdydtµ(dx) ˆ ˆ ˆa −T T x−b = eitydydtµ(dx) ˆ ˆ ˆ −T x−a = ... ! ! T sin(t(x − a) T sin(t(x − a) = dt + dt µ(dx) ˆ ˆ−T t ˆ−T t Where we use the fact that cos is odd here so it cancels itself out. (This relies on T nite) Now if we let T sin(θt) d then we can show that sgn T θ sin(x) d R(θ, T ) = −T t t R(θ, T ) = 2 θ 0 x x → π for θ 6= 0 and = 0 for θ =´ 0. We have then: ´ I = R(x − a, T )µ(dx) + R(x − b, T )µ(dx) T ˆ ˆ 2π a < x < b → π x = a or x = b µ(dx) ˆ 0 x < a or x > b So by bounded convergence theorem, we get the result. 3.7. CHARACTERISTIC FUNCTIONS 27 Exercise. (3.3.2.) Similarly: T 1 µ ({a}) = lim e−itaϕ(t)dt T →∞ 2T ˆ −T The inversion formula basically tells us that distributions are characterized by their char functions. Two easy consequences are then that: Exercise. (3.3.3.) ϕ is real if and only if X and −X have the same distribu- tion. Proof. X and −X have the same distribution i X =d CX for C a ±1 coinip i ϕX = ϕCX i ϕX = Re (ϕX ). (We had everything except the ⇐= before the inversion formula, and the inversion formula gives us this.) Exercise. (3.3.4.) The sum of two normal distributions is again normal. Proof. The c.f. for a sum of two normals is the c.f. for a normal, so it must be a normal distribution. Theorem. (3.3.5.) If |ϕ(t)| dt < ∞ then µ has bounded continuous density function : ´ 1 f(y) = e−ityϕ(t)dt 2π ˆ −ita −itb −ita −itb Proof. As we observed before, the kernal e −e , has e −e ≤ b − a it it so we see that µ has no point masses since: ∞ 1 1 e−ita − e−itb µ(a, b) + µ ({a, b}) = ϕ(t)dt 2 2π ˆ it −∞ ∞ b − a ≤ |ϕ(t)| dt 2π ˆ −∞ → 0 as b − a → 0 Can now caluclate the density function by looking at µ (x, x + h) with the inversion formula, using Fubini and taking h → 0: 1 x+h µ(x, x + h) + 0 = e−ityϕ(t)dydt 2π ˆ ˆx x+h 1 = e−ityϕ(t)dt dy ˆx 2π ˆ The dominated convergence theorem tells us that f is continuous: 1 f(y + h) − f(y) = e−ity 1 − e−ith ϕ(t)dt 2π ˆ → 0 by LDCT Exercise. (3.3.5.) Given an example of a measure µ which has a density, but for which |ϕ(t)| dt = ∞ ´ 3.7. CHARACTERISTIC FUNCTIONS 28 Proof. The uniform distribution has this since 1 . This makes sense ϕ(t) ≈ t since the distribution fucntion is not continuous. Exercise. (3.3.6.) If are iid uniform in then P has X1,X2,... (−1, 1) i≤n Xi density: ∞ 1 sin tn f(x) = cos(tx)dx π ˆ t 0 This is a piecewise n-th order polynomial. Remark. Theorem 3.3.5, the Riemann-Lebesgue Lemma, and the following result all tell us that point masses in the distribution correspond to the behaivour at ∞ for the c.f. Theorem. (Riemann-Lebesgue lemma) If µ has a density function f, then ϕµ(t) → 0 as t → ∞. Proof. If f is dierentiable and compactly supported then we have: −itx |ϕµ(t)| = f(x)e dx ˆ 1 0 −itx = f (x)e dx ˆ it 1 ≤ |f 0(x)| dx |t| ˆ → 0 Any arbitrary density function can be approximated by such f (indeed, these are dense in L1 by the construction of the Lebesgue integral....just smoothly ap- proximate open intervals) (Alternativly, show it for simple functions, which are also 1 dense in L ) Exercise. (3.3.7.) If X,X˜ are iid copies of a R.V. with c.f. ϕ then: T 1 2 X 2 lim |ϕ(t)| dt = P(X − X˜ = 0) = µ ({x}) T →∞ 2T ˆ x −T Proof. The result follows by the inversion formula and since 2 ϕX−X˜ = |ϕX | Corollary. If ϕ(t) → 0 as t → ∞ then µ has no atoms. Proof. If then the average 1 T 2 d as too, so ϕ(t) → 0 2T −T |ϕ(t)| t → 0 T → ∞ by the previous formula µ has no atoms. ´ Remark. Don't forget there are distributions like the Cantor function which have no atoms and there is no density. 3.7. CHARACTERISTIC FUNCTIONS 29 3.7.2. Weak Convergence - IMPORTANT!. Theorem. (3.3.6.) (Continuity Theorem) Let µn be probability measure with c.f. ϕn. i) If µn ⇒ µ∞ then ϕn(t) → ϕ∞(t) for all t. ii) If ϕn → ϕ pointwise andϕ(t) is continuous at 0, then the associated sequence of distribution functions µn is tight and converges weakly to the measure µ with char function ϕ. Proof. i) is clear since eitX is bounded and continuous ii) We will rst show that is suces to check that µn is tight. Suppose µn is tight. We claim that µn ⇒ µ∞ by the every subsequence has a sub-subsequence criteria. Indeed, given any subsequence we use Helley to nd a subsubsequence µnk µ with µ ⇒ µ for some µ . Now since the sequence is tight, we know that nkl nkl 0 0 µ is a legitimate probability distribution. By i) since µ ⇒ µ we know that 0 nkl 0 ϕ (t) → ϕ (t). However, by hypothesis, ϕ → ϕ so it must be that ϕ = ϕ ! nkl 0 n 0 To see that the sequence is tight, you use the continuity at 0. The idea is to use that 1 u d by this continuity. If you write it out, u −u (1 − ϕ(t)) t → 0 1 u d´ should control something like 2 ), so this is exactly u −u (1 − ϕ(t)) t µ(|x| > u what´ is needed for tightness. Remark. Here is a good example to keep in mind for the continuity theorem: 2 Take Xn ∼ N(0, n) so that ϕn(t) = exp −nt /2 → 0 for t 6= 0 and = 1 at t = 0. Xn can't converge weakly to anything since µn ((−∞, x]) → 0 for any x. Exercise. (3.3.9) If and is normal with mean and variance 2 Xn ⇒ X µn σn and is normal with mean and variance 2 then and 2 X µ σ µn → µ σn → σn Proof. Look at the c.f's Exercise. (3.3.10) If Xn and Yn are independent and Xn ⇒ X∞ and Yn ⇒ Y∞ then Xn + Yn ⇒ X∞ + Y∞\ Proof. Look at the cfs. Exercise. (3.3.12.) Interpret the identity Q∞ m prob- sin t/t = m=1 cos(t/2 ) abilistically. (You can get this formula from sin t = 2 sin(t/2) cos(t/2) applied repeatedly and using 2k sin(t/2k) → t as k → ∞) Proof. sin t/t is the cf for a [−1, 1] random variable. cos(t/2m) is the c.f. for a coinip 1 . So this is saying that the sum of intently many coinips Xm = ± 2m like this is a uniform random variable. This is equivalent to the fact the the n-th binary digit of a uniformly chosen x from is a coinip. Add P∞ −m to both random variables, and then [0, 1] 1 = m=1 2 divide by 2. Exercise. (3.3.13.) Let X1,... be iid coinips taking values 0 and 1 and let P j. [This converges almost surely by the Kolmogorov 3 series X = j≥1(2Xj)/3 theorem]. This has the Cantor distribution. Compute the ch.f. ϕ of X and notice that ϕ has the same value at t = 3kπ. 1 1 it 2 Proof. Each has c.f. it so j has 3j so we have X 2 1 + e 2X/3 2 1 + e ∞ 1 it 2 that Q 3j If you put in k you get a 2πi in the rst ϕ = k=1 2 1 + e t = 3 π e k terms and the value of the function does not change. 3.7. CHARACTERISTIC FUNCTIONS 30 Remark. This fact shows that ϕ(t) 9 0 as t → ∞, which means that it is impossible for this random variable to have a density (by the Riemann Lebesgue lemma). You can see there are no atoms because for ω ∈ {0, 1}N a possible sequence of the Xi's, X(ω) is the number whose ternary expansion is ω, so for a given x, the set {X(ω) = x} consists of at most one sequence: namely the ternary expansion of x (you have to replace 2's with 1's ..... you get the idea). Consequently, we can explicitly see that for any x, P(X = x) = 0 and X has no atoms. 3.7.3. Moments and Derivatives. Part of the proof of the ϕn → ϕ and continuous at 0 implies theorem was the estimate that 2 ϕ Xn ⇒ X µ |x| > u ≤ −1 u d (This was used to show that if is continuous at then u −u (1 − ϕ(t)) t ϕ 0 ϕn is tight)´ This suggests that the local behaivour at 0 is related to the decay of the measure at ∞. We see more of this here: Exercise. (3.3.14) If |x|n µ(dx) < ∞ then the c.f. ϕ has continuous deriva- tives of order n given by ϕ´(n)(t) = (ix)neitxµ(dx) Proof. Lets do n = 1 rst. Consider´ that: ϕ (t + h) − ϕ(t) 1 = eitx(eihx − 1)µ(dx) h h ˆ Now use the estimate that 1 ihx (to see this write 1 ihx h e − 1 ≤ |x| h e − 1 = x ihzd and then bound the integral) so we can apply a dominated convergence i 0 e z theorem´ to get the result. The case n > 1 is not much dierent. Exercise. (3.3.15.) By dierntiating the characteristic function for a normal 2 random variable, we see that for X ∼ N(0, 1) with c.f. e−t /2 we have: E X2n = (2n − 1) · (2n − 3) · ... · 3 · 1 ≡ (2n − 1)!! The next big result is that: Theorem. (3.3.8.) If E(|X|2) < ∞ then: ϕ(t) = 1 + itE(X) − t2E(X2)/2 + (t) With: ! |tX|3 2 |tX|2 |(t)| ≤ E min , 3! 2! Remark. Just using what we just did, you would need the condition that E |X|3 < ∞ to get that ϕ was thrice dierentiable and then you would have the result by Taylor's theorem. However, by doing some slightly more careful calculations, we can actually show that: n n+1 n ! X (ix)m |x| 2 |x| eix − ≤ min , m! (n + 1)! n! m=0 (Again this is done by integration type estimates) and then using Jensen's inequaltiy we'll have: n m n+1 n ! X (itX) |tX| |tX| E eitX − E ≤ E min , 2 m! 3! 2! m=0 So even if we only have two moments, we know the error is bounded by t2E(|X|2) and is hence o(t2) 3.8. THE MOMENT PROBLEM 31 Theorem. (3.3.10) (Polya's criterion) Let ϕ(t) be real nonnegative and have ϕ(0) = 1 and ϕ(t) = ϕ(−t) and say ϕ is decreasing and convex on (0, ∞) with: lim ϕ(t) = 1, lim ϕ(t) = 0 t↓0 t↑∞ Then there is a probability measure ν on (0, ∞) so that: ∞ + t ϕ(t) = 1 − ν(ds) ˆ s 0 Since this is a superposition of the characteristic function for the Polya r.v., 2 + ρ(XP OLY A = x) ∼ (1 − cos x)/x ,ϕP OLY A(t) ∼ (1 − |t|) , this shows that ϕ is a char function. Proof. I'm going to skip the rather technical proof. Part of it is that you can apprixmate ϕ by piecewise linear functions, and for those its easier. Example. (3.3.10.) exp (− |t|α) is a char function for all 0 < α < 2 Proof. The idea is to write exp (− |t|α) as a limit of characteristic functions, namely: √ n exp (− |t|α) = lim ψ(t · 2n−1/α n→∞ where ψ(t) = 1 − (1 − cos t)α/2. This convergence is seen since 1 − cos(t) ∼ −t2/2.ψ is a char function because we can write is a linear combination of powers (cos t)n (recall cos t is the char fun for a coinip), by: ∞ X α/2 1 − (1 − cos t)α/2 = (−1)n+1(cos t)n n n=1 Exercise. (3.3.23.) This family of RV's is of interest because they are stable in the sense that a scaled sum of many iid copies has the same distribution: X + ... + X 1 n ∼ X n1/α The case α = 2 is the normal distribution and the case α = 1 is the Cauchy distribution. 3.8. The moment problem k Suppose x dFn(x) has a limit µk for each k. We know then that the Fn are tight (a single function that goes to at for which d ´ ∞ ∞ supn φ(x) Fn(x) < ∞ shows its tight by a Cheb estimate). Then we know by Helley´ that every subse- quence has a subsubsequence that converges weakly to a legit probability distribu- tion. We also know that every limit point will have moments µk (can check in this case that and kd shows kd too...to this will Fnk ⇒ F x Fn(x) → µk x F (x) → µk k+1 use the fact that sup x´ dFn(x) < ∞ I think),´ so every limit distribution has the moments µk. ´ Question: Is there a unique limit???? It would suce to check that there is only one distribution with the moments µk. Unfortunatly this is not always true. 3.8. THE MOMENT PROBLEM 32 Example. (An example of a collection of dierent probability distribution who have the same moments) Consider the lognormal density: −1/2 −1 2 f0(x) = (2π) x exp − log x /2 x ≥ 0 (This is what you get if you look at the density of exp (Z) for Z ∼ N(0, 1) a standard normal) For −1 ≤ a ≤ 1 let: fa(x) = f0(x) (1 + a sin (2π log x)) To see that this has the same moments as f0, check that: ∞ xrf (x) sin(2π log x)dx = 0 ˆ 0 0 Indeed after a change of variable, we see that the integral is like exp −s2/2 sin(2πs)ds which is 0 since its odd. ´ One can check that the moments of the lognormal density are µk = E (exp (kZ)) = exp k2/2 (its the usual completing the square trick, or you can do it with deriva- tives) Notice that these get very large very fast! Also notice that the density decays like exp − (log x)2 which is rather slow. We will now show that if the moments don't grow too fast (or equivalently if the density decays fast) then there IS a unique distribution. Theorem. (3.3.11) If 1/2k then there is at most lim supk→∞ µ2k /2k = r < ∞ one distribution function with moments µk. Proof. Let F be any d.f. with moments µk. By Cauchy-Scwarz, the absolute k √ value |X| has moments, νk = |x| dF (x) withν2k = µ2k and ν2k+1 ≤ µ2kµ2k+2 and so: ´ ν1/k lim sup = r < ∞ k→∞ k We next use the modied Taylor series for char functions to conclude that the error in the taylor expansion of ϕ(t) at 0 is: n−1 n 0 t (n−1) |t| νn ϕ(θ + t) − ϕ(θ) − tϕ (θ) − ... − ϕ (θ) ≤ (n − 1)! n! k k k k Since νk ≤ (r + ) k , and using the bound e ≥ k /k! we see that the above estimate implies that ϕconverges uniformly to its Taylor series about any point θ in a nhd of xed radius |t| ≤ 1/er. If G is any other distribution with moment s µk , we know that G and F have the same Taylor series! But then G and F agree in a n'h'd of 0 by the above characterization. By induction, we can repeatedly make the radius of the n'h'd bigger and bigger to see the char functions of F and G agree everywhree. But in this case F must be equal to G by the inversion formula. Remark. This condition is slightly stronger than Carleman's Condition that: ∞ X 1 < ∞ 1/2k k=1 µ2k 3.9. THE CENTRAL LIMIT THEOREM 33 3.9. The Central Limit Theorem Proposition. We have the following estimate: 2 3 2 !! itX t 2 |tX| |tX| E e − 1 + itE(X) − E(X ) ≤ E min , 2 2 3! 2! !! |t| |X|3 ≤ t2E min , |X|2 3! Proof. Do integration by parts on the function f(x) = eix. We aim to show 3 2 3 ix 2 |x| 2|x| . The fact that ix 2 |x| e − 1 + ix − x ≤ min 3! , 2! e − 1 + ix − x ≤ 3! is just by the usual Taylor series. The other fact follows by a trick. Do a taylor series expansion to rst order, then add/subtract 2 . The write 2 x d = x /2 x /2 = 0 y y ´ x eix = 1 + ix − yei(x−y)dy ˆ 0 x x2 x2 = 1 + ix − + − yei(x−y)dy 2 2 ˆ 0 x x x2 = 1 + ix − + ydy − yei(x−y)dy 2 ˆ ˆ 0 0 x x2 = 1 + ix − − y ei(x−y) − 1 dy 2 ˆ 0 2 But i(x−y) so the error term is bounded by x d 2|x| and we e − 1 ≤ 2 0 y2 y = 2 are done. ´ Now put in x = X and then take E and use Jensen's inequality, 2 itX t 2 itX 2 E e − 1 + itE(X) − E(X ) ≤ E e − 1 + iX − X 2 !! |tX|3 2 |tX|2 ≤ E min , 3! 2! !! |t| |X|3 = t2E min , |X|2 3! 2 Theorem. (iid CLT) If Xn are iid with E(X) = 0 and E(X ) = 1 then √Sn . n ⇒ N(0, 1) Proof. Have by the last estimate that: t2 φ(t) = 1 − + (t) 2 3.9. THE CENTRAL LIMIT THEOREM 34 3 where the error (t) is (t) ≤ t2E min |t||X| , |X|2 . Hence the char function √ 3! for Sn/ n is: t n φ √ = φ √ Sn/ n n t2 t n = 1 − + √ 2n n 2 2 2 |t| 3 2 Now notice that t √t t since √t t √ n − 2n + n → − 2 n n ≤ n n E min n3! |X| , |X| → as by the LDCT. We now use the fact that if then cn n c 0 n → ∞ cn → c (1 + n ) → e so have: t2 φ √ → exp − Sn/ n 2 By the continuity theorem, have then √Sn n ⇒ N(0, 1) n Lemma. If then zn −z. zn → z0 1 + n → e Proof. Suppose WOLOG that zn 1 for all in consideration. Since n < 2 n log (1 + z) is a holomorphic function and invertable (with ez as its invers) in the n'h'd 1 it suces to show that: |z| < 2 z n log 1 + n → z n But log(1 + z) has a absolutly convergent taylor series expansion around z = 0 in this n'h'd. Write this as log(1 + z) = z + zg(z) where g(z) is given by some absolutly convergent power sereis in this n'h'd and g(0) = 0.We have: z n z log 1 + n = n log 1 + n n n z z z = n n + n g n n n n z = z + z g n n n n → z + zg(0) = z Theorem. (CLT for Triangular Arrays - Lindeberg-Feller Theorem) A sum of small independent errors is normally distributed. Let Xn,m 1 ≤ m ≤ n be a triangular array of independent (but not nessisarily iid) random variables. Suppose E(Xn,m) = 0. Suppose that: n X 2 2 E Xn,m → σ > 0 m=1 and that: n X 2 ∀ > 0, lim E |Xn,m| ; |Xn,m| > = 0 n→∞ m=1 Then: 2 Sn → N(0, σ ) 3.10. OTHER FACTS ABOUT CLT RESULTS 35 Proof. The characteristic function for Sn is: n Y φSn (t) = φXn,m (t) m=1 n Y t2 = 1 − E X2 + (t) 2 n,m n,m m=1 2 3 2 The error |n,m(t)| ≤ |t| E min |t| |Xn,m| , |Xn,.m| . We now claim that: n X t2 t2 − E X2 + (t) → − σ2 2 n,m n,m 2 m=1 as . It suces to show that Pn . Indeed for any , n → ∞ m=1 n,m(t) → 0 > 0 we have: n n X X 2 3 2 n,m(t) ≤ |t| E min |t| |Xn,m| , |Xn,.m| m=1 m=1 n X 2 3 2 ≤ |t| E min |t| |Xn,m| , |Xn,.m| ; |Xn,m| > m=1 n X 2 3 2 + |t| E min |t| |Xn,m| , |Xn,.m| ; |Xn,m| ≤ m=1 n X 2 2 = |t| E |Xn,m| ; |Xn,m| > m=1 n X 2 2 + |t| E |t| |Xn,m| m=1 → 0 + |t|3 σ2 Now also use that each n 2 as (impleid by the supm=1 E Xn,m → 0 n → ∞ condition) and then use the fact for complex numbers that if max1≤j≤n |cj,n| → 0 and Pn and Pn then Qn λ j=1 cj,n → λ supn j=1 |cj,n| < ∞ j=1 (1 + cj,n) → e . 3.10. Other Facts about CLT results p Theorem. Sn = X1+X2+...+Xn for Xi iid has Zn := Sn−nE(X)/ nVar(X) → N(0, 1) weakly as n → ∞. Example. Zn does not converge in probability or in the almost sure sense. Proof. ASsume WOLOG that EX = 0 and VarX = 1. Take a subsequence √ n . Let Y = Pnk X / n − n so that the Y 0s are independent. Write k k nk−1+1 i k k−1 k after some manipulation that: r r nk−1 nk−1 Znk = 1 − Yk + Znk−1 nk nk nk−1 Now, if Zn → A almost surely, then by choosing a sequence nk with → 0 k nk (e.g. nk = k!) then we see from the above that Yk → A almost surely too. But since are independent, then is a constant, which is absurd. Yk A 3.11. LAW OF THE ITERATED LOG 36 Example. If X1,... are independent and Xn → Z as n → ∞ almost surely or in probability, then Z ≡ cons0t almost surely. Proof. Suppose by contradiction that Z is not almost surely a constant. Then nd x < y so that P (Z ≤ x) 6= 0 and P (Z ≥ y) 6= 0. (e.g look at inf{x : P(Z ≤ x) 6= 0} and sup{y : P(Z ≥ y) 6= 0}, if these are equal then X is a.s constant. If they are not equal, then can nd the desired x, y). Assume WOLOG that x, y are continuity points of Z (indeed, any subinterval of (x, y) will work and there can only be countably many discontinuities) Now, since a.s convergecne and convergence in probability are stronger than weak convergence, we have that P(Xn ≤ x) → P(Z ≤ x) 6= 0 and P(Xn ≥ y) → P(Z ≥ y) 6= 0. In particular then the sequences P(Xn ≤ x) is not summable. By the second Borel Cantelli lemma, Xn ≤ x happens innitely often a.s. Similarly, Xn ≥ y happens innitely often a.s. On this measure 1 set, Xn has no chance to converge a.s. or in probability, as it is both ≤ x and ≥ y innitely often and these are separated by some positive gap. Example. Once we have proven the CLT, we can show that actually √Sn lim sup n = ∞ n o Proof. First, check that √Sn is a tail event. By the Kolmogorv lim sup n > M 0-1 law, to show that this is probability 1 it suces to show that the probability is postive. Then notice that by the central limit theorem the √Sn limn→∞ P n > M → n o so it cannot be that √Sn is a probabilty 0 event. P (χ > M) > 0 lim sup n > M Since this holds for every M we get the result. Theorem. (2.5.7. from Durret) If are iid with and 2 X1,... E(X1) = 0 E(X1 ) = σ2 < ∞ then: S n → 0 a.s. √ 1 + n log n 2 Remark. Compare this with the law of the iterated log: S √ lim sup √ n = σ 2 n→∞ n log log n Proof. It suces, via the Kronecker lemma, to check that P √ Xn n log n1/2+ converges. By the K1 series theorem, it suces to check that the variances are summable. Indeed: X σ2 Var √ n ≤ n log n1/2+ n log1+2 which is indede summable (e.g. Cauchy condensation test)! 3.11. Law of the Iterated Log The Law of the iterated log tells us that when each Xn are iid coinips ±1 with probability 1/2 or if the Xn are iid N(0, 1) Gaussian random variables, then: Sn √ lim sup p = 2 a.s n→∞ n log(log n) The proof is divided into two parts: 3.11. LAW OF THE ITERATED LOG 37 3.11.1. ! Sn √ ∀ > 0, P lim sup p > 2 + = 0 n→∞ n log(log n) √ . The 2 comes from the 2 appearing in the Cherno bound P (maxk≤n Sk ≥ λ) ≤ 2 λ . exp − 2n Proposition. Cherno Bound: λ2 P max Sk ≥ λ ≤ exp − k≤n 2n Proof. Since Sn is martingale, we have Doob's inequality for submartingales: E (|Sn|) P max Sk ≥ λ ≤ k≤n λ Of course, any convex function of martingale is a submartingale we can apply the inequality like we would for any Cheb inequality e.g. E [exp (θSn)] P max Sk ≥ λ ≤ k≤n exp (θλ) If the Sn are Gaussian, then applying the directly leads to the Cherno bound after optimizing over θ. If the Xn are ±1 then one must rst get subgaussian tails for E [exp (θSn)], which is known as the Hoeding inequality. Again, once we have this optimizing over θ gives the result. Idea of this half: With the Cherno bound in hand, we can estimate: √ 2 2 + pc log(log c) √ p P max Sk > 2 + c log(log c) ≤ exp − k≤n 2n c ≤ exp − (1 + ˜) log(log(c) n n p o n−1 So now if let An be the event that Sk > (2 + )k log (log k) for some θ ≤ √ n p n−1 k ≤ θ , then P(An) ≤ PP maxk≤n Sk > 2 + c log(log c) with c = θ and n = θn so by the above estimate its: θn−1 P(A ) ≤ exp − (1 + ˜) log(log(θn−1) n θn −(1+˜) 1 ≤ n θ Which is summable if we choose θ correctly!!! Hence by Borel Cantelli, this event only happens nitely often almost surely. Extra short summary of the idea: Doob's inequality controls the whole sequence maxk≤n Sk just by looking at n−1 Sn. By Cherno bound, Sn has subgaussian tails. By looking in the range θ ≤ k ≤ θn we can use this to ensure that the probability of exceeding pn log(log n)) is summable. 3.11. LAW OF THE ITERATED LOG 38 3.11.2. ! Sn √ ∀ > 0, P lim sup p > 2 − = 1 n→∞ n log(log n) . p Proof. The ideas is based on the fact that the sequenec Sn/ n log(log n) behaives a little bit like a sequence of independent random variables. (Compare √ this to the reason that Sn/ n cannot converge almost surely) In this one we use the fact that is an independent family of random Snk −Snk−1 variables. We can then use Borel Cantelli for independent r.v.s and the Mill's ration √ to show tthat p innetly often. Snk − Snk−1 > 2 − n log(log n)) Then by the rst half of the law of the iterated log, S is not too big...so these √ √ nk large dierences force the whole intently often. Sn/ n log log n > 2 − Moment Methods These are based on some ideas from the notes by Terrence Tao [4]. 4.12. Basics of the Moment Method 4.12.1. Introduction. Suppose you wanted to prove that Xn ⇒ X for some random variables Xn. One method, called the moment method is to rst prove that k k for every and then to do some analytical work to show E Xn → E X k that this is sucient in this case for Xn ⇒ X. The reason this is good, is because k might be easy to work with. For E Xn many situations, k has some combinatorial way of being looked at that might Xn make a limit easy to see. For example, if is a sum, Pn then k gives a Xn = i=1 Yi Xn binomial-y type formula and so on. The same thing works for random matrices the empirical spectral distribution (i.e. Xn = a randomly chosen eigenvalue of a random matrix ) then k Tr k which again has a nice writing out as sums M E Xn ' E Mn of the form P and again you can try to look at it combinatorially. Mi1i2 Mi2i3 ... Some analytical justication is always needed, because k k does E Xn → E X not always imply Xn ⇒ X. Every case must be handled individually. For this reason, every proof here is divided into two section: one where com- binatorics is used to justify k k and one where analytical methods E Xn → E X are used to argue why this is enough to show Xn ⇒ X (The latter half sometimes includes tricks like reducing to bounded random variables via truncation) 4.12.2. Criteria for which moments converging implies weak con- vergence. There are some broadly applicable criteria under which k E Xn → k E X =⇒ Xn ⇒ X which are covered here that will be useful throughout. Theorem. Suppose that k and there is a unique distribution E Xn → µk X k for which µk = E X . Then Xn ⇒ X Proof. [sketch] First, we see that the Xn must be tight (this uses the fact that lim sup E (Xn) < ∞ since its converging to µ1...a Cheb inequality then shows its tight). By Helley's theorem, every subsequence has a further subsequence Xnk that converges weakly and the tightness guarantees us that the limit is a Xnk legitimate` probability distribution. The resulting distribution must have moments µ . Since X is the unique distribution with moments µ , it must be that X ⇒ X. k k nkl Since every sequence has a sub-sub-sequence converging to X, we conclude Xn ⇒ X (this is the sub-sequence/sub-sub-sequence characterization of convergence in a metric space) Unfortunately, given a random variable X whose moments are µk, it is not always true that X is the unique distribution with moments µk. 39 4.12. BASICS OF THE MOMENT METHOD 40 Example. (An example of a collection of dierent probability distribution who have the same moments) Consider the lognormal density: −1/2 −1 2 f0(x) = (2π) x exp − log x /2 x ≥ 0 (This is what you get if you look at the density of exp (Z) for Z ∼ N(0, 1) a standard normal) For −1 ≤ a ≤ 1 let: fa(x) = f0(x) (1 + a sin (2π log x)) To see that this has the same moments as f0, check that: ∞ xrf (x) sin(2π log x)dx = 0 ˆ 0 0 Indeed after a change of variable, we see that the integral is like exp −s2/2 sin(2πs)ds which is 0 since its odd. ´ Remark. One can check that the moments of the lognormal density are µk = E (exp (kZ)) = exp k2/2 (its the usual completing the square trick, or you can do it with derivatives) Notice that these get very large very fast! Also notice that the density decays like exp − (log x)2 as x → ∞ which is rather slow. The following criteria show that as long as µk is not growing to fast or if the desnity goes to zero fast enough as x → ∞, then there IS at most one distribution X with moments µk. Here is a super baby version of the type of result that is useful: Theorem. If and are all bounded in some range then k Xn X [−M,M] E(Xn) → k E(X ) for all k i Xn ⇒ X Proof. (⇐=) is clear because the function f(x) = xk is bounded when we restrict to the range [−M,M] ( =⇒ ) We will verify that for any bounded continuous function f(x) that E (f(Xn)) → E (f(X)). By the Stone Wierestrass theorem, we can approximate any bounded continuous f uniformly by polynomials. I.e. ∀ > 0 nd a polynomial p so that kp − fk ≤ where k·kis the sup norm on the interval [−M,M] . We have k that E (p(Xn)) → E(p(X)) since p is a nite linear combination of powers x and we know that k k for each . Hence: E(Xn) → E(X ) k |E (f(Xn)) − E (f(X))| ≤ |E (f(Xn) − p(Xn))| + |E (p(Xn)) − E(p(X))| + |E (f(X) − p(X))| ≤ + |E (p(Xn)) − E(p(X))| + → 2 + 0 Remark. The same idea works to show that if X,Y are two random vari- ables on [−M,M] whose moments agree, then E(f(X)) = E(f(Y )) for all bounded continuous f or in other words X =d Y . The more advanced proofs use Fourier Analysis. The connection is that the moments of X correspond to the derivatives of the characteristic function and the upshot is that if two distributions have the same characteristic function, then they are the same random variable. 4.12. BASICS OF THE MOMENT METHOD 41 The rst theorem says that if the moments don't grow too fast then there is at most one distribution function. Theorem. (3.3.11 for Durret) If 1/2k then there is lim supk→∞ µ2k /2k = r < ∞ at most one distribution function with moments µk. Proof. Let F be any d.f. with moments µk. By Cauchy-Scwarz, the absolute k √ value |X| has moments, νk = |x| dF (x) withν2k = µ2k and ν2k+1 ≤ µ2kµ2k+2 and so: ´ ν1/k lim sup = r < ∞ k→∞ k We next use the modied Taylor series for char functions to conclude that the error in the taylor expansion of ϕ(t) at 0 is: n−1 n 0 t (n−1) |t| νn ϕ(θ + t) − ϕ(θ) − tϕ (θ) − ... − ϕ (θ) ≤ (n − 1)! n! k k k k Since νk ≤ (r + ) k , and using the bound e ≥ k /k! we see that the above estimate implies that ϕconverges uniformly to its Taylor series about any point θ in a nhd of xed radius |t| ≤ 1/er. If G is any other distribution with moment s µk , we know that G and F have the same Taylor series! But then G and F agree in a n'h'd of 0 by the above characterization. By induction, we can repeatedly make the radius of the n'h'd bigger and bigger to see the char functions of F and G agree everywhree. But in this case F must be equal to G by the inversion formula. The next theorem says that if the probability density goes to zero as x → ∞ fast enough, then again then again we have moments converging implies weak convergence. Theorem. (2.2.9. from Tao) (Carleman continuity theorem) Let Xnbe a se- quence of uniformly subguassian real random variables (i.e. ∃ constats C, c so that 2 P (|Xn| > λ) ≤ C exp −cλ for all Xn and for all λ.) and let X be another subgaussain random variable. Then: k k for all E(Xn) → E X k ⇐⇒ Xn ⇒ X Remark. The subgaussian hypothesis is key for both directions here. Proof. (⇐=) Fix a k. Take a smooth function ϕ that is 1 inside [−1, 1] and vanishes outside of [−2, 2] notice that ϕ (·/N) is then 1 inside [−N,N] and vanishes outside of [−2N, 2N]. Hence, for any N, the function x → xkϕ (x/N) is bounded. Hence by weak convergence, we know that k k E Xnϕ (Xn/N) → E X ϕ (X/N) for any choice of N. On the other hand, from the uniformly subguassian hy- pothesis, k and k can be made arbitarily E Xn (1 − ϕ(Xn/N) E Xn (1 − ϕ(Xn/N) small by taking large enough, as k k N E Xn (1 − ϕ(Xn/N) ≤ E Xn1{|Xn|>N} ≤ ∞ k−1 2 uniformly in . For any choose so large so N λ C exp −cλ → 0 N > 0 N ´that this is < and we will have then that: k k k k E Xn − E (Xn) ≤ E Xnϕ (Xn/N) − E (Xnϕ (Xn/N)) + E Xn (1 − ϕ(Xn/N) + E Xn (1 − ϕ(Xn/N) k ≤ E Xnϕ (Xn/N) − E (Xnϕ (Xn/N)) + 2 → 2 4.14. CENTRAL LIMIT THEOREM 42 From the uniformly subguassian case, we know that k k/2 ( =⇒ ) E Xn ≤ (Ck) so by Taylor's theorem with reminder we know that: k j X (it) k/2 k+1 F (t) = E Xj + O (Ck) |t| Xn j! n j=0 where the error term is uniform in t and in n. The same thing holds for FX so we have: k/2 k+1 lim sup |Fxn (t) − FX (t)| = O (Ck) |t| n→∞ ! |t| k = O √ k so now for any xed , taking shows that . By the Levy t k → ∞ FXn (t) → FX (t) continuity theorem (the one that says pointwise convergence of the characteristic functions to a target function that is continuous at 0 implies weak convergence) Here is the Levy Continuity theorem that was used above for the case that you need a refresher: Theorem. (Levy Continuity Theorem) Say is a sequence of RVs and Xn FXn converges pointwise to something, call it F . The following are equivalent: i) F is continuous at 0 ii) Xn is tight iii) F is the characteristic function of some R.V. iv) Xn converges in distribution to some R.V. X. 4.13. Poisson RVs Put the characterization of poisson rvs and the trick for looking at sums of indicator rv's from Remco's random graph notes here. Example. One nice way to do this: use the formula that if P is a Y = Ii1 sum of indicators then P where the sum E [(Y )r] = i ...i ∈Z P (Ii1 = ... = Iir = 1) is over distinct indices. 1 r Remark. This follows from the general properties of occupations Occ1 A k = Occk . Any sum of disjoint sum of indicator functions can be thought of as a Ak Bernouilli point process with P Occ1 . Ii = A 4.14. Central Limit Theorem The standard way to prove the CLT is by Fourier analysis/charaterstic func- tions. Here we are going to explore how the moment method can be used to prove the CLT using combinatorial methods and other tricks, like truncation/concentration bounds, only. The proof is maybe not as easy as these Fourier methods, but its a good example of the power of these other methods. 4.14. CENTRAL LIMIT THEOREM 43 4.14.1. Convergence of Moments for Bounded Random Variables. Proposition. For a N(0,1) Gaussian random variable Z we have: E Zk = 0 if k is odd E Zk = (k − 1)!! if k is even Where (k − 1)!! = (k − 1)(k − 3) · ... · 3 · 1. Proof. The odd moments are 0 since the distribution is symmetric. The even moments can be found by calculating and then dierentiating the moment generat- ing function of a Gaussian, or one can use integration by parts on x2k exp −x2/2 2k 2k−2 to get a recurrence relatrion E Z = (2k−1)E Z for the even´ moments. Remark. We will now strive to show that for iid X1,...,Xn, the moments of Zn converge to those of a Gaussian. Lets look at the rst few moments . Have by linearity of expectation: