36-752: Lecture 1

1 36-752: Lecture 1

2 How is this course diﬀerent from your earlier probability courses? There are some prob- 3 lems that simply can’t be handled with ﬁnite-dimensional sample spaces and random vari- 4 ables that are either discrete or have densities.

5 Example 1. Try to express the strong law of large numbers without using an infinite- 6 dimensional space. Oddly enough, the weak law of large numbers requires only a sequence 7 of finite-dimensional spaces, but the strong law concerns entire infinite sequences.

8 Example 2. Consider a distribution whose cumulative distribution function (cdf) in- 9 creases continuously part of the time but has some jumps. Such a distribution is neither 10 discrete nor continuous. How do you deﬁne the mean of such a random variable? Is there 11 a way to treat such distributions together with discrete and continuous ones in a uniﬁed 12 manner?

13 General Measures. Both of the above examples are accommodated by a generaliza- 14 tion of the theories of summation and integration. Indeed, summation becomes a special 15 case of the more general theory of integration. It all begins with a generalization of the 16 concept of “size” of a set.

17 Example 3. One way to measure the size of a set is to count its elements. All infinite 18 sets would have the same size (unless you distinguish different infinite cardinals).

19 Example 4. Special subsets of Euclidean spaces can be measured by length, area, vol- 20 ume, etc. But what about sets with lots of holes in them? For example, how large is the set 21 of irrational numbers between 0 and 1?

22 We will use measures to say how large sets are. First, we have to decide which sets we 23 will measure.

24 Definition 5. Let Ω be a set. A collection of subsets of Ω is called a ﬁeld if it F 25 satisﬁes

26 Ω , • ∈F C 27 for each A , A , • ∈F ∈F 28 for all A , A , A A . • 1 2 ∈F 1 ∪ 2 ∈F 29 A field is a σ-field if, in addition, it satisfies F

30 for every sequence A ∞ in , ∞ A . • { k}k=1 F k=1 k ∈F S 31 We will define measures on fields and σ-field’s. A set Ω together with a σ-field is called F 32 a measurable space (Ω, ), and the elements of are called measurable sets . F F 2

1 Example 6. Let Ω = IR and define to be the collection of all unions of finitely many U 2 disjoint intervals of the form (a, b] or( , b] or(a, )or ( , ), together with ∅. Then −∞ ∞ −∞ ∞ 3 is a field. U

4 Example 7. (Power set) Let Ω be an arbitrary set. The collection of all subsets of Ω 5 Ω is a σ-ﬁeld. It is denoted 2 and is called the power set of Ω.

6 Example 8. (Trivial σ-field) Let Ω be an arbitrary set. Let = Ω, ∅ . This is F { } 7 the trivial σ-ﬁeld.

8 Definition 9. The extended reals is the set of all real numbers together with and + ∞ 9 . We shall denote this set IR. The positive extended reals, denoted IR is (0, ], and the −∞ +0 ∞ 10 nonnegative extended reals, denoted IR is [0, ]. ∞ +0 11 Definition 10. Let (Ω, ) be a measurable space. Let µ : IR satisfy F F → 12 µ(∅)=0, •

13 for every sequence A ∞ of mutually disjoint elements of , µ( ∞ A )= ∞ µ(A ). • { k}k=1 F k=1 k k=1 k 14 Then µ is called a measure on (Ω, ) and (Ω, ,µ)isa measure spaceS. If is merelyP a field, F F F 15 then a µ that satisfies the above two conditions whenever ∞ A is called a measure k=1 k ∈F 16 on the field . F S

17 Example 11. Let Ω be arbitrary with the trivial σ-field. Define µ(∅) = 0 and F 18 µ(Ω) = c for arbitrary c> 0 (with c = possible). ∞ Ω 19 Example 12. (Counting measure) Let Ω be arbitrary and = 2 . For each finite F 20 subset A of Ω, define µ(A) to be the number of elements of A. Let µ(A)= for all infinite ∞ 21 subsets. This is called counting measure on Ω. 3

1 36-752: Lecture 2

2 For every collection of subsets of Ω, there is a smallest field containing and a smallest C C 3 σ-field containing . These are called the field generated by and the σ-fieldgenerated by C C 4 . Just check that the intersection of an arbitrary collection of fields is a field and the C 5 intersection of an arbitrary collection of σ-field’s is a σ-field. These collections are nonempty Ω 6 because 2 is always a σ-field that contains every collection of subsets of Ω. The σ-field 7 generated by is sometimes denoted σ( ). C C

8 Exercise 13. Let , ,... be classes of sets in a common space Ω such that F1 F2 Fn ⊂Fn+1 9 for each n. Show that if each is a field, then ∞ is also a field. Fn ∪n=1Fn 10 If each is a σ-field, then is ∞ also necessarily a σ-field? Think about the following Fn ∪n=1Fn 11 case: Ω is the set of nonnegative integers and σ( 0 , 1 ,..., n ). Fn ≡ {{ } { } { }}

12 Example 14. Let = A for some nonempty A that is not itself Ω. Then σ( ) = C C { } C 13 ∅, A, A , Ω . { }

14 Example 15. Let Ω = IR and let be the collection of all intervals of the form (a, b]. C 15 Then the ﬁeld generated by is from Example 6 while σ( ) is larger. C U C

16 Example 16. (Borel σ-field) Let Ω be a topological space and let be the collection C 17 of open sets. Then σ( ) is called the Borel σ-field. If Ω = IR, the Borel σ-field is the same C k k 18 as σ( ) in Example 15. The Borel σ-field of subsets of IR is denoted . C B 1 19 Exercise 17. Give some examples of classes of sets such that σ( )= . C C B 1 20 Exercise 18. Are there subsets of IR which are not in ? B

21 Definition 19. Let (Ω, ,P ) be a measure space. If P (Ω) = 1, then P is called a F 22 probability, (Ω, ,P )is a probability space, and elements of are called events. F F

23 Sometimes, if the name of the probability P is understood or is not even mentioned, we 24 will denote P (E) by Pr(E) for events E. 25 Infinite measures pose a few unique problems. Some infinite measures are just like finite 26 ones.

27 Definition 20. Let (Ω, ,µ) be a measure space, and let . Suppose that there F C⊆F 28 exists a sequence A ∞ of elements of such that µ(A ) < for all n and Ω = ∞ A . { n}n=1 C n ∞ n=1 n 29 Then we say that µ is σ-finite on . If µ is σ-finite on , we merely say that µ is σ-finite. C F S Ω 30 Example 21. Let Ω = ZZ with =2 and µ being counting measure. This measure is F 31 σ-finite. Counting measure on an uncountable space is not σ-finite.

32 Exercise 22. Prove the claims in Example 21. 4

1 Properties of Measures. There are several useful properties of measures that are 2 worth knowing. 3 First, measures are countably subadditive in the sense that

∞ ∞ (23) µ An µ(An), 4 ≤ n=1 ! n=1 [ X

5 for arbitrary sequences An n∞=1. The proof of this uses a standard trick for dealing with { } n 1 6 countable sequences of sets. Let B = A and let B = A − B for n > 1. The B ’s 1 1 n n \ i=1 i n 7 are disjoint and have the same ﬁnite and countable unions as the An’s. The proof of (23) S 8 relies on the additional fact that µ(B ) µ(A ) for all n. n ≤ n 9 Next, if µ(An) = 0 for all n, it follows that µ ( n∞=1 An) = 0. This gets used a lot in 10 proofs. Similarly, if µ is a probability and µ(An)=1 for all n, then µ ( ∞ An)=1. S n=1 C 11 Definition 24. Suppose that some statement about elements of ΩT holds for all ω A ∈ 12 where µ(A) = 0. Then we say that the statement holds almost everywhere, denoted a.e. [µ]. 13 If P is a probability, then almost everywhere is often replaced by almost surely, denoted a.s. 14 [P ].

15 Example 25. Let (Ω, ,P ) be a probability space. Let Xn n∞=1 be a sequence of func- F { } a.s. 16 tions from Ω to IR. To say that Xn converges to X a.s. [P ] (denoted Xn X) means that C → 17 there is a set A with P (A) = 0 and P ( ω A : limn Xn(ω)= X(ω) )=1. { ∈ →∞ }

18 Proposition 26. If µ ,µ ,... are all measures on (Ω, ) and if a ∞ is a sequence 1 2 F { n}n=1 19 of positive numbers, then ∞ a µ is a measure on (Ω, ). n=1 n n F

20 Exercise 27. Prove PropositionP 26.

21 Definition 28. Deﬁne the indicator function IA : Ω 0, 1 for the set A Ω as C → { } ⊆ 22 I (ω)=1 if ω A and I (ω)=0if ω A . A ∈ A ∈

23 Definition 29. Let (Ω, ,µ) be a measure space. A sequence A ∞ of elements F { n}n=1 24 of is called monotone increasing if A A for each n. It is monotone decreasing if F n ⊆ n+1 25 A A for each n. For a general sequence, we deﬁne n ⊇ n+1 ∞ ∞ lim sup An = An, 26 n i=1 n=i →∞ \ [ ∞ ∞ lim inf An = An. 27 n →∞ i=1 n=i [ \

28 If lim supn An = lim infn An, the common set is called limn An. The set lim supn An →∞ →∞ →∞ →∞ 29 is often called An infinitely often or An i.o. because a point ω is in that set if and only if ω 30 is in infinitely many of the An sets. The set lim infn An is often called An all but finitely →∞ 31 often or An eventually (An ev.). This set has all those ω such that ω is in all of the An except 32 possibly finitely many of the An, i.e., eventually. 5

1 Exercise 30. What is the relationship between the deﬁnition of the lim sup and lim inf 2 of a sequence of reals x ∞ and this deﬁnition of the lim sup and lim inf of a sequence of { n}n=1 3 sets?

4 Exercise 31. Deﬁne A to be the set ( 1/n, 1] if n is odd, and to be ( 1, 1/n] if n is n − − 5 even. What are lim supn An and lim infn An? →∞ →∞

6 It is easy to establish some simple facts about these limiting sets.

7 Proposition 32. Let A ∞ be a sequence of sets. { n}n=1

8 lim infn An = lim supn An, if and only if, for each ω, limn IAn (ω) exists. • →∞ →∞ →∞

9 If the sequence is monotone increasing, then limn An = ∞ An. • →∞ n=1

10 If the sequence is monotone decreasing, then limn An = S∞ An. • →∞ n=1

11 Exercise 33. Prove Proposition 32. T 6

1 36-752: Lecture 3

2 Lemma 34. Let (Ω, ,µ) be a measure space. Let A ∞ be a monotone sequence of F { n}n=1 3 elements of . Then limn µ(An)= µ (limn An) if either of the following hold: F →∞ →∞ 4 the sequence is increasing, • 5 the sequence is decreasing and µ(A ) < for some k. • k ∞

6 Proof. Define A = limn An. In the first case, write B1 = A1 and Bn = An An 1 ∞ n →∞ n \ − 7 for n > 1. Then A = B for all n (including n = ). Then µ(A ) = µ(B ), n k=1 k ∞ n k=1 k 8 and n S ∞ P µ lim An = µ(A )= µ(Bk) = lim µ(Bk) = lim µ(An). 9 n ∞ n n →∞ →∞ →∞ Xk=1 Xk=1 10 In the second case, write B = A A for all n k. Then, for all n>k, n n \ n+1 ≥ n 1 − Ak An = Bi, 11 \ i[=k ∞ Ak A = Bi. 12 \ ∞ i[=k 13 By the first case, ∞ lim µ(Ak An)= µ Bi = µ(Ak A ). 14 n ∞ →∞ \ ! \ i[=k 15 Because An Ak for all n>k and A Ak, it follows that ⊆ ∞ ⊆ 16 µ(A A ) = µ(A ) µ(A ), k \ n k − n 17 µ(Ak A ) = µ(Ak) µ(A ). \ ∞ − ∞

18 It now follows that limn µ(An)= µ(A ). →∞ ∞

19 Exercise 35. Construct a simple counterexample to show that the condition µ(Ak) < 20 is required in the second claim of Lemma 34. ∞

21 Uniqueness of Measures. There is a popular method for proving uniqueness theo- 22 rems about measures. The idea is to deﬁne a function µ on a convenient class of sets and C 23 then prove that there can be at most one extension of µ to σ( ). C

24 Example 36. Suppose it is given that for any a IR, ∈ a 1 P (( , a]) = exp u2/2 du. 25 −∞ √2π − Z−∞ 1 26 Does that uniquely deﬁne a probability measure on the class of Borel subsets of the line, ? B 7

1 Definition 37. A collection of subsets of Ω is a π-system if, for all A , A , A 1 2 ∈ A 2 A A . A class is a λ-system if 1 ∩ 2 ∈A C 3 Ω , • ∈ C C 4 for each A , A , • ∈ C ∈ C

5 for each sequence A ∞ of disjoint elements of , ∞ A . • { n}n=1 C n=1 n ∈ C S 6 Example 38. The collection of all intervals of the form ( , a]isa π-system of subsets −∞ 7 of IR. So too is the collection of all intervals of the form (a, b] (together with ∅). The 2 8 collection of all sets of the form (x, y) : x a, y b is a π-system of subsets of IR . So { ≤ ≤ } 9 too is the collection of all rectangles with sides parallel to the coordinate axes.

10 Some simple results about π-systems and λ-systems are the following.

11 Proposition 39. If Ω is a set and is both a π-system and a λ-system, then is a C C 12 σ-ﬁeld.

13 Proposition 40. Let Ω be a set and let Λ be a λ-system of subsets. If A Λ and C ∈ 14 A B Λ then A B Λ. ∩ ∈ ∩ ∈

15 Exercise 41. Prove Propositions 39 and 40.

16 Lemma 42. (π λ theorem) Let Ω be a set and let Π be a π-system and let Λ be a − 17 λ-system that contains Π. Then σ(Π) Λ. ⊆

18 Proof. Define λ(Π) to be the smallest λ-system containing Π. For each A Ω, define ⊆ 19 to be the collection of all sets B Ω such that A B λ(Π). GA ⊆ ∩ ∈ 20 First, we show that A is a λ-system for each A λ(Π). To see this, note that A Ω G ∈ C ∩ ∈ 21 λ(Π), so Ω A. If B A, then A B λ(Π), and Proposition 40 says that A B λ(Π), C ∈ G ∈ G ∩ ∈ ∩ ∈ 22 so B . Finally, B ∞ with the B disjoint implies that A B λ(Π) with ∈ GA { n}n=1 ∈ GA n ∩ n ∈ 23 A B disjoint, so their union is in λ(Π). But their union is A ( ∞ B ). So ∞ B . ∩ n ∩ n=1 n n=1 n ∈ GA 24 Next, we show that λ(Π) for every C λ(Π). Let A, B Π, and notice that ⊆ GC ∈ S ∈ S 25 A B Π, so B . Since is a λ-system containing Π, it must contain λ(Π). It follows ∩ ∈ ∈ GA GA 26 that A C λ(Π) for all C λ(Π). If C λ(Π), it then follows that A . So, Π ∩ ∈ ∈ ∈ ∈ GC ⊆ GC 27 for all C λ(Π). Since is a λ-system containing Π, it must contain λ(Π). ∈ GC 28 Finally, if A, B λ(Π), we just proved that B , so A B λ(Π) and hence λ(Π) is ∈ ∈ GA ∩ ∈ 29 also a π-system. By Proposition 39, λ(Π) is a σ-field containing Π and hence must contain 30 σ(Π). Since λ(Π) Λ, the proof is complete. ⊆ 31 The uniqueness theorem is the following.

32 Theorem 43. Suppose that µ and µ are measures on (Ω, ) and is the smallest 1 2 F F 33 σ-ﬁeld containing the π-system Π. If µ1 and µ2 are both σ-ﬁnite on Π and they agree on Π, 34 then they agree on . F 8

1 Proof. First, let C Π be such that µ (C) = µ (C) < , and deﬁne to be the ∈ 1 2 ∞ GC 2 collection of all B such that µ (B C) = µ (B C). It is easy to see that is a ∈ F 1 ∩ 2 ∩ GC 3 λ-system that contains Π, hence it equals by Lemma 42. (For example, if B , F ∈ GC C C 4 µ (B C)= µ (C) µ (B C)= µ (C) µ (B C)= µ (B C), 1 ∩ 1 − 1 ∩ 2 − 2 ∩ 2 ∩ C 5 so B .) ∈ GC 6 Since µ and µ are σ-ﬁnite, there exists a sequence C ∞ Π such that µ (C ) = 1 2 { n}n=1 ∈ 1 n 7 µ (C ) < , andΩ= ∞ C . (Since Π is only a π-system, we cannot assume that the C 2 n ∞ n=1 n n 8 are disjoint.) For each A , S ∈F n

µj(A) = lim µj [Ci A] for j =1, 2. 9 n ∩ →∞ i=1 ! [ n 10 Since µ ( [C A]) can be written as a linear combination of values of µ at sets of the j i=1 i ∩ j 11 form A C, where C Π is the intersection of ﬁnitely many of C1,...,Cn, it follows from ∩S n ∈ n 12 A that µ ( [C A]) = µ ( [C A]) for all n, hence µ (A)= µ (A). ∈ GC 1 i=1 i ∩ 2 i=1 i ∩ 1 2

13 Exercise 44.S Return to ExampleS 36. You should now be able to answer the question 14 posed there.

15 Exercise 45. Suppose that Ω = a, b, c, d, e and I tell you the value of P ( a, b ) and { } { } 16 P ( b, c ). For which subset of Ω do I need to deﬁne P ( ) in order to have a unique extension { } · 17 of P to a σ-ﬁeld of subsets of Ω? 9

1 36-752: Lecture 4

2 Measures Based on Increasing Functions. Let F be a cdf (nondecreasing, right- 3 continuous, limits equal 0 and 1 at and respectively). Let be the field in Example 6. −∞n ∞ U n 4 Define µ : [0, 1] by µ(A) = F (b ) F (a ) when A = (a , b ] and (a , b ] U → k=1 k − k k=1 k k { k k } 5 are disjoint. This set-function is well-defined and finitely additive. To see that it is well- P m S 6 defined, look at an alternative representation as µ(A) = F (d ) F (c ). Consider the j=1 j − j 7 partition of A into the refinement of the two partitions given. The sum over the refinement P 8 is the same as both of the two sums we started with. Is µ countably additive as probabilities 9 are supposed to be? That is, if A = i∞=1 Ai where the Ai’s are disjoint, each Ai is a union 10 of finitely many disjoint intervals, and A itself is the union of finitely many disjoint intervals 11 S (ak, bk] for k =1,...,n, does µ(A)= i∞=1 µ(Ai)? First, take the collection of intervals that 12 go into all of the Ai’s and split them, if necessary, so that each is a subset of at most one of P 13 the (ak, bk] intervals. Then apply the following result to each (ak, bk].

14 Lemma 46. Let (a, b] = ∞ (c ,d ] with the (c ,d ]’s disjoint. Then F (b) F (a) = k=1 k k k k − 15 ∞ F (dk) F (ck). k=1 − S n n 16 P Proof. Since (a, b] (c ,d ] for all n, it follows that F (b) F (a) F (d ) ⊇ k=1 k k − ≥ k=1 k − 17 F (ck), hence F (b) F (a) ∞ F (dk) F (ck). We need to prove the opposite inequality. − ≥S k=1 − P 18 Suppose first that both a and b are finite. Let ǫ > 0. For each k, there is ek > dk such P 19 that ǫ 20 F (d ) F (e ) F (d )+ . k ≤ k ≤ k 2k 21 Also, there is f > a such that F (a) F (f) ǫ. Now, the interval [f, b] is compact and ≥ − 22 [f, b] k∞=1(ck, ek). So there are finitely many (ck, ek)’s (suppose they are the first n) such ⊆ n 23 that [f, b] (ck, ek). Now, S ⊆ k=1 S n n F (b) F (a) F (b) F (f)+ ǫ ǫ + F (ek) F (ck) 2ǫ + F (dk) F (ck). 24 − ≤ − ≤ − ≤ − Xk=1 Xk=1 25 It follows that F (b) F (a) 2ǫ + ∞ F (d ) F (c ). Since this is true for all ǫ> 0, it is − ≤ k=1 k − k 26 true for ǫ = 0. P 27 If = a be such that F (g) < ǫ. The above argument shows −∞ ∞ −∞ 28 that ∞ ∞ F (b) F (g) F (dk g) F (ck g) F (dk) F (ck). 29 − ≤ ∨ − ∨ ≤ − Xk=1 Xk=1 30 Since limg F (g) = 0, it follows that F (b) ∞ F (dk) F (ck). Similar arguments →−∞ ≤ k=1 − 31 work when a < b = and = a < b = . ∞ −∞ ∞ P 32 In Lemma 46 you can replace F by an arbitrary nondecreasing right-continuous function 33 with only a bit more effort. (See the supplement following at the end of this lecture.) 34 The function µ defined in terms of a nondecreasing right-continuous function is a measure 35 on the field . There is an extension theorem that gives conditions under which a measure U 36 on a field can be extended to a measure on the generated σ-field. Furthermore, the extension 37 is unique. 10

1 Example 47. (Lebesgue measure) Start with the function F (x)= x, form the mea- 2 sure µ on the ﬁeld and extend it to the Borel σ-ﬁeld. The result is called Lebesgue measure, U 3 and it extends the concept of “length” from intervals to more general sets.

4 Example 48. Every distribution function for a random variable has a corresponding 5 probability measure on the real line.

6 Another concept that is occasionally useful is that of a complete measure space.

7 Definition 49. A measure space (Ω, ,µ) is complete if, for every A such that F ∈ F 8 µ(A) = 0 and every B A, B . ⊆ ∈F

9 Theorem 50. (Caratheodory extension) Let µ be a σ-finite measure on the field 10 of subsets of Ω. There exists a σ-field that contains and a unique extension µ∗ of µ C A C 11 to a measure on (Ω, ). Furthermore (Ω, ,µ∗) is a complete measure space. A A

12 Exercise 51. In this exercise, we prove Theorem 50. Ω 13 First, for each B 2 , deﬁne ∈

∞ (52) µ∗(B) = inf µ(Ai), 14 i=1 X

15 where the inf is taken over all A ∞ such that B ∞ A and A for all i. Since { i}i=1 ⊆ i=1 i i ∈ C C 16 is a ﬁeld, we can assume that the Ai’s are mutually disjoint without changing the value of S 17 µ∗(B). Let

Ω C Ω 18 = B 2 : µ∗(C)= µ∗(C B)+ µ∗(C B ), for all C 2 . A { ∈ ∩ ∩ ∈ }

19 Now take the following steps:

20 1. Show that µ∗ extends µ, i.e. that µ∗(A)= µ(A) for each A . ∈ C

21 2. Show that µ∗ is monotone and subadditive.

22 3. Show that . C⊆A 23 4. Show that is a ﬁeld. A

24 5. Show that µ∗ is ﬁnitely additive on . A 25 6. Show that is a σ-ﬁeld. A

26 7. Show that µ∗ is countably additive on . A

27 8. Show that µ∗ is the unique extension of µ to a measure of (Ω, ). A

28 9. Show that (Ω, ,µ∗) is a complete measure space. A 11

1 Supplement: Measures from Increasing Functions

2 Lemma 46 deals only with functions F that are cdf’s. Suppose that F is an unbounded 3 nondecreasing function that is continuous from the right. If . Then there must ∞ k −∞ k −∞ 7 P be a subsequence kj j∞=1 such that limj ckj = . For each j, let (cj,n′ ,dj,n′ ] n∞=1 be the { } →∞ −∞ { } 8 subsequence of intervals that cover (ckj , b]. For each j, the proof of Lemma 46 applies to 9 show that ∞ (53) F (b) F (ckj )= F (dj,n′ ) F (cj,n′ ). 10 − − n=1 X 11 As j , the left side of (53) goes to while the right side eventually includes every → ∞ ∞ 12 interval in the original collection. 13 A similar proof works for an interval of the form (a, ) when limx F (x) = . A ∞ →∞ ∞ 14 combination of the two works for ( , ). −∞ ∞ 12

1 36-752: Lecture 5

2 Measurable Functions. Measurable functions are the types of functions that we can 3 integrate with respect to measures in much the same way that continuous functions can 4 be integrated “dx”. Recall that the Riemann integral of a continuous function f over a 5 bounded interval is deﬁned as a limit of sums of lengths of subintervals times values of f 6 on the subintervals. The measure of a set generalizes the length while elements of the σ- 7 ﬁeld generalize the intervals. Recall that a real-valued function is continuous if and only if 8 the inverse image of every open set is open. This generalizes to the inverse image of every 9 measurable set being measurable.

10 Definition 54. Let (Ω, ) and (S, ) be measurable spaces. Let f : Ω S be a 1 F A → 11 function that satisﬁes f − (A) for each A . Then we say that f is / -measurable. ∈F ∈A F A 12 If the σ-ﬁeld’s are to be understood from context, we simply say that f is measurable.

Ω 13 Example 55. Let = 2 . Then every function from Ω to a set S is measurable no F 14 matter what is. A

15 Example 56. Let = ∅,S . Then every function from a set Ω to S is measurable, A { } 16 no matter what is. F

17 Proving that a function is measurable is facilitated by noticing that inverse image com- 1 C 1 C 18 mutes with union, complement, and intersection. That is, f − (A ) = [f − (A)] for all A, 19 and for arbitrary collections of sets Aα α , { } ∈ℵ

1 1 f − Aα = f − (Aα), 20 α ! α [∈ℵ [∈ℵ 1 1 f − Aα = f − (Aα). 21 α ! α \∈ℵ \∈ℵ

22 Exercise 57. Is the inverse image of a σ-field is a σ-field? That is, if f : Ω S and if 1 → 23 is a σ-field of subsets of S, then f − ( ) is a σ-field of subsets of Ω. A A

24 Proposition 58. If f is a continuous function from one topological space to another 25 (each with Borel σ-ﬁeld’s) then f is measurable.

26 The proof of this makes use of Lemma 60.

1 27 Definition 59. Let f : Ω S, where (S, ) is a measurable space. The σ-field f − ( ) → A 1 A 28 is called the σ-field generated by f. The σ-field f − ( ) is sometimes denoted σ(f). A 1 29 It is easy to see that f − ( ) is the smallest σ-field such that f is / -measurable. We A C C A 30 can now prove the following helpful result. 13

1 Lemma 60. Let (Ω, ) and (S, ) be measurable spaces and let f : Ω S. Suppose that F A 1 → 2 is a collection of sets that generates . Then f is measurable if f − ( ) . C A C ⊆F 3 Exercise 61. Prove Proposition 60.

4 Definition 62. If (Ω, ,P ) is a probability space and X : Ω IR is measurable, then F → 5 X is called a random variable. In general, if X : Ω S, where (S, ) is a measurable space, → A 6 we call X a random quantity.

7 Exercise 63. Prove the following. Let S = IR in Lemma 60. Let D be a dense subset 8 of IR, and let be the collection of all intervals of the form ( , a), for a D. To prove C −∞ ∈ 9 that a real-valued function is measurable, one need only show that ω : f(ω) < a for { }∈F 10 all a D. Similarly, we can replace < a by > a or a or a. ∈ ≤ ≥ 11 Exercise 64. Show that a monotone increasing function is measurable.

12 Example 65. Suppose that f : Ω IR takes values in the extended reals. Then 1 1 C → 13 f − ( , ) = [f − (( , ))] . Also {−∞ ∞} −∞ ∞ ∞ 1 f − ( )= ω : f(ω) > n , 14 {∞} { } n=1 \ 15 and similarly for . In order to check whether f is measurable, we need to see that the −∞ 16 inverse images of all semi-infinite intervals are measurable sets. If we include the infinite 17 endpoint in these intervals, then we don’t need to check anything else. If we don’t include 18 the infinite endpoint, and if both infinite values are possible, then we need to check that at 19 least one of or has measurable inverse image. {∞} {−∞} 20 Definition 66. A measurable function that takes at most finitely many values is called 21 a simple function.

22 Example 67. Let (Ω, ) be a measurable space and let A1,...,An be disjoint elements F n 23 of , and let a1,...,an be real numbers. Then f = i=1 aiIAi defines a simple function F 1 24 since f − (( , a)) is a union of at most finitely many measurable sets. −∞ P 25 Definition 68. Let f be a simple function whose distinct values are a1,...,an, and let n 26 A = ω : f(ω)= a . Then f = a I is called the canonical representation of f. i { i} i=1 i Ai 27 Lemma 69. Let f be a nonnegativeP measurable extended real-valued function from Ω. 28 Then there exists a sequence f ∞ of nonnegative (finite) simple functions such that f f { n}n=1 n ≤ 29 for all n and limn fn(ω)= f(ω) for all ω. →∞ 1 2 30 Proof. For each n, define An,k = f − ((k/n, (k +1)/n]) for k =1,...,n 1 and An,0 = 2 1 1 1 n 1 − 31 f − ([0, 1/n] (n, )). Let A = f − ( ). Define fn(ω)= − kIA (ω)+ nA . The ∪ ∞ ∞ {∞} n k=0 n,k ∞ 32 proof is easy to complete now. P 33 Lemma 69 says that each nonnegative measurable function f can be approximated ar- 34 bitrarily closely from below by simple functions. It is easy to see that if f is bounded the 35 approximation is uniform once n is greater than the bound. 36 Many theorems about real-valued functions are easier to prove for nonnegative measurable 37 functions. This leads to the common device of splitting a measurable function f as follows. 14

+ 1 Definition 70. Let f be a real-valued function. The positive part f of f is deﬁned as + 2 f (ω) = max f(ω), 0 . The negative part f − of f is f −(ω)= min f(ω), 0 . { } − { } 3 Notice that both the positive and negative parts of a function are nonnegative. It follows + 4 easily that f = f f −. It is easy to prove that the positive and negative parts of a − 5 measurable function are measurable. 6 Here are some simple properties of measurable functions.

7 Theorem 71. Let (Ω, ), (S, ), and (T, ) be measurable spaces. F A B 8 1. If f is an extended real-valued measurable function and a is a constant, then af is 9 measurable.

10 2. If f : Ω S and g : S T are measurable, then g(f) : Ω T is measurable. → → → 11 3. If f and g are measurable real-valued functions, then f + g and fg are measurable.

12 Proof. For a = 0, part 1 is trivial. Assume a = 0. Because 6 ω : f(ω) < c/a if a> 0, ω : af(ω) c/a if a< 0, { } 14 we see that af is measurable. 1 1 1 15 For part 2, just notice that [g(f)]− (B)= f − (g− (B)). 2 16 For part 3, let h : IR IR be deﬁned by h(x, y) = x + y. This function is continuous, → 17 hence measurable. Then f + g = h(f,g). We now show that (f,g) : Ω IR is measurable, → 18 where (f,g)(ω)=(f(ω),g(ω)). To see that (f,g) is measurable, look at inverse images of 2 19 sets that generate , namely sets of the form ( , a] ( , b], and apply Lemma 60. We B −∞ × −∞ 20 see that 1 1 1 21 (f,g)− (( , a] ( , b]) = f − (( , a]) g− (( , b]), −∞ × −∞ −∞ ∩ −∞ 22 which is measurable. Hence, (f,g) is measurable and h(f,g) is measurable by part 2. Simi- 23 larly fg is measurable. 24 You can also prove that f/g is measurable when the ratio is deﬁned to be an arbitrary 25 constant when g = 0. Similarly, part 3 can be extended to extended real-valued functions 26 so long as care is taken to handle cases of and 0. ∞−∞ ∞ × 15

1 36-752: Lecture 6

2 Theorem 72. Let f : Ω IR be measurable for all n. Then the following are measur- n → 3 able:

4 1. lim supn fn, →∞

5 2. lim infn fn, →∞

6 3. ω : limn fn exists . { →∞ }

limn fn where the limit exists, 4. f = →∞ 7 0 elsewhere.

8 Exercise 73. Prove Lemma 72.

9 Random Variables and Induced Measures.

10 Example 74. Let Ω = (0, 1) with the Borel σ-ﬁeld, and let µ be Lebesgue measure, 11 a probability. Let Z0(ω) = ω. For n 1, deﬁne Xn(ω) = 2Zn 1(ω) and Zn(ω) = ≥ ⌊ − ⌋ 12 2Zn 1(ω) Xn(ω). All Xn’s and Zn’s are random variables. Each Xn takes only two values, − − 13 0 and 1. It is easy to see that µ( ω : X (ω)=1 )=1/2. It is also easy to see that { n } 14 µ( ω : Z (ω) c )= c for 0 c 1. { n ≤ } ≤ ≤ 15 Each measurable function from a measure space to another measurable space induces a 16 measure on its range space.

17 Lemma 75. Let (Ω, ,µ) be a measure space and let (S, ) be a measurable space. Let F A 18 f : Ω S be a measurable function. Then f induces a measure on (S, ) deﬁned by → 1 A 19 ν(A)= µ(f − (A)) for each A . ∈A

20 Proof. Clearly, ν 0 and ν(∅)=0. Let A ∞ be a sequence of disjoint elements of . ≥ { n}n=1 A 21 Then

∞ ∞ 1 ν An = µ f − An 22 n=1 ! "n=1 #! [ [ ∞ 1 = µ f − [An] 23 n=1 ! [ ∞ 1 = µ(f − [An]) 24 n=1 X ∞ = ν(An). 25 n=1 X 26 The measure ν in Lemma 75 is called the measure induced on (S, ) from µ by f. This A 27 measure is only interesting in special cases. First, if µ is a probability then so is ν. 16

1 Definition 76. Let (Ω, ,P ) be a probability space and let (S, ) be a measurable F A 2 space. Let X : Ω S be a random quantity. Then the measure induced on (S, ) from P → A 3 by X is called the distribution of X.

4 We typically denote the distribution of X by µX . In this case, µX is a measure on the space 5 (S, ). A 6 Example 77. Consider the random variables in Example 74. The distribution of each 7 Xn is the Bernoulli distribution with parameter 1/2. The distribution of each Zn is the 8 uniform distribution on the interval (0, 1). These were each computed in Example 74.

9 If µ is inﬁnite and f is not one-to-one, then the induced measure may be of no interest 10 at all.

11 Exercise 78. Either prove or create a counterexample to the following conjecture: If µ 12 is a σ-finite on some measurable space (Ω, ), then for any measurable function f from Ω F 13 to S, the induced measure is also σ-finite. k 14 Example 79. (Jacobians) If Ω = S = IR and f is one-to-one with a differentiable 15 inverse, then ν is the measure you get from the usual change-of-variables formula using 16 Jacobians.

17 We have just seen how to construct the distribution from a random variable. Oddly 18 enough, the opposite construction is also available. First notice that every probability ν on 1 19 (IR, ) has a distribution function F deﬁned by F (x)= ν(( , x]). Now, we can construct B −∞ 1 1 20 a probability space (Ω, ,P ) and a random variable X : Ω IR such that ν = P (X− ). F 1 → 21 Indeed, just let Ω = IR, = , P = ν, and X(ω)= ω. F B

22 Integration. Let (Ω, ,µ) be a measure space. The definition of integral is done in F 23 three stages. We start with simple functions. +0 24 Definition 80. Let f : Ω IR be a simple function with canonical representation n → n 25 f(ω)= i=1 aiIAi (ω) The integral of f with respect to µ is defined to be i=1 aiµ(Ai). The 26 integral is denoted variously as fdµ, f(ω)µ(dω), or f(ω)dµ(ω). P P 27 The values are allowed for an integral. ±∞ R R R 28 We use the following convention whenever necessary in defining an integral: 0 =0. ±∞× 29 This applies to both the case when the function is 0 on a set of infinite measure and when 30 the function is infinite on a set of 0 measure.

31 Proposition 81. If f g and both are nonegative and simple, then fdµ gdµ. ≤ ≤ 32 Definition 82. We say that f is integrable with respect to µ if fdµRis ﬁnite.R 33 Example 83. A real-valued simple function is always integrableR with respect to a ﬁnite 34 measure.

1Notation: When X is a random quantity and B is a set in the space where X takes its values, we use the following two symbols interchangeably: X−1(B) and X B. Both of these stand for ω : X(ω) B . Finally, for all B, ∈ { ∈ } −1 µX (B) = Pr(X B)= P (X (B)). ∈ 17

1 36-752: Lecture 7

2 The second step in the deﬁnition of integral is to consider nonnegative measurable func- 3 tions.

4 Definition 84. For nonnegative measurable f, deﬁne the integral of f with respect to 5 µ by fdµ = sup gdµ. 6 nonnegative ﬁnite simple g f Z ≤ Z

7 That is, if f is nonnegative and measurable, fdµ is the least upper bound (possibly infinite) 8 of the integrals of nonnegative finite simple functions g f. Proposition 81 helps to show R ≤ 9 that Definition 80 is a special case of Definition 84, so the two definitions do not conflict 10 when they both apply. 11 Finally, for arbitrary measurable f, we first split f into its positive and negative parts, + 12 f = f f −. − + 13 Definition 85. Let f be measurable. If either f or f − is integrable with respect to µ, + 14 we define the integral of f with respect to µ to be f dµ f −dµ, otherwise the integral − 15 does not exist. R R 16 It is easy to see that Definition 84 is a special case of Definition 85, so the two definitions 17 do not conflict when they both apply. The reason for splitting things up this way is to avoid 18 ever having to deal with . ∞−∞ 19 One unfortunate consequence of this three-part definition is that many theorems about 20 integrals must be proven in three steps. One fortunate consequence is that, for most of these 21 theorems, at least some of the three steps are relatively straightforward.

22 Definition 86. If A , we define fdµ by I fdµ. ∈F A A 23 Proposition 87. If f g and both integralsR areR defined, then fdµ gdµ. ≤ ≤ 24 Example 88. Let µ be counting measure on a set Ω. (ThisR measureR is not σ-finite 25 unless Ω is countable.) If A Ω, then µ(A)=#(A), the number of elements in A. If f is a ⊆ n 26 nonnegative simple function, f = i=1 aiIAi , then P n fdµ = ai#(Ai)= f(ω). 27 i=1 Z X AllXω 28 It is not difficult to see that the equality of the first and last terms above continues to hold 29 for all nonnegative functions, and hence for all integrable functions.

30 Before we study integration in detail, we should note that integration with respect to 31 Lebesgue measure is the same as the Riemann integral in many cases.

32 Theorem 89. Let f be a continuous function on a closed bounded interval [a, b]. Let µ b 33 be Lebesgue measure. Then the Riemann integral a f(x)dx equals [a,b] fdµ. R R 18

1 Exercise 90. Prove Theorem 89.

2 Example 91. A case in which the Riemann integral differs from the Lebesgue integral 3 is that of “improper” Riemann integrals. These are defined as limits of Riemann integrals 4 that are each defined in the usual way. For example, integrals of unbounded functions and 5 integrals over unbounded regions cannot be defined in the usual way because the Riemann 6 sums would always be or undefined. Consider the function f(x) = sin(x)/x over the ∞ + 7 interval [1, ). It is not difficult to see that neither f nor f − is integrable with respect to ∞ 8 Lebesgue measure. Hence, the integral that we have defined here does not exist. However, T 9 the improper Riemann integral is defined as limT f(x)dx, if the limit exists. In this →∞ 1 10 case, the limit exists. R 11 Some simple properties of integrals include the following:

12 For c a constant, cfdµ = c fdµ if the latter exists. • 13 If f 0, then fdµR 0. R • ≥ ≥ 1 14 If f is extendedR real-valued, then fdµ < only if µ(f − ( )) = 0. • ∞ {±∞} 15 if f = g a.e. [µ] and if either fdµ R or gdµ exists, then so does the other, and they • 16 are equal. Similarly, if one of the integrals doesn’t exist, then neither does the other. R R 17 Definition 92. If P is a probability and X is a random variable, then XdP is called 18 the mean of X, expected value of X, or expectation of X and denoted E(X). If E(X)= µ is 2 R 19 ﬁnite, then the variance of X is Var(X) = E[(X µ) ]. − 20 The mean and variance of a random variable have an interesting relation to the tail of 21 the distribution.

22 Proposition 93. (Markov inequality) Let X be a nonnegative random variable. 23 Then Pr(X c) E(X)/c. ≥ ≤ 24 There is also a famous corollary.

25 Corollary 94. (Tchebychev inequality) Let X have ﬁnite mean µ. Then Pr( X 2 | − 26 µ c) Var(X)/c . | ≥ ≤ 27 Exercise 95. Show that there is some random variable X for which Pr( X µ 2 | − | ≥ 28 c) = Var(X)/c . Thus, without additional assumptions, Tchebychev’s inequality cannot be 29 improved.

30 We would like to be able to prove that (f + g)dµ = fdµ + gdµ whenever at least 31 two of them are ﬁnite. We could prove this for nonnegative simple functions now, but not R R R 32 in general.

33 Proposition 96. Let f and g be nonnegative simple functions deﬁned on a measure 34 space (Ω, ,µ). Then (f + g)dµ = fdµ + gdµ. F 35 The proof for generalR functions requiresR someR limit theorems ﬁrst. 19

1 36-752: Lecture 8

2 One of the famous limit theorems is the following.

3 Theorem 97. (Fatou’s lemma) Let f ∞ be a sequence of nonnegative measurable { n}n=1 4 functions. Then

lim inf fndµ lim inf fndµ. 5 n ≤ n Z Z

6 The proof of Theorem 97 is given in a separate document. Here is an outline of the proof. 7 Let f = lim inf f , and let φ be an arbitrary nonnegative simple function such that φ f. n n ≤ 8 We need to show that φdµ lim inf f dµ. The set ω : φ(ω) > 0 can be written as ≤ n n { } 9 the union of the sets R R

10 A = ω : f (ω) > (1 ǫ)φ(ω), for all k n . n { k − ≥ }

11 For each n, fndµ (1 ǫ) φdµ. The liminf of the right sides can be shown to equal ≥ − An 12 (1 ǫ) φdµ. Since lim inf f dµ (1 ǫ) φdµ for all ǫ> 0, we have what we need. − R n R n ≥ − 13 The ﬁrst of the two most useful limit theorems is the following. R R R

14 Theorem 98. (Monotone convergence theorem) Let f ∞ be a sequence of { n}n=1 15 measurable nonnegative functions, and let f be a measurable function such that f f a.e. n ≤ 16 [µ] and limn fn = f(x) a.e. [µ]. Then, →∞

lim fndµ = fdµ. 17 n →∞ Z Z

18 Proof. Since f f for all n, f dµ fdµ for all n. Hence n ≤ n ≤ R R

19 lim inf fndµ lim sup fndµ fdµ. n ≤ n ≤ →∞ Z →∞ Z Z

20 By Fatou’s lemma, fdµ lim infn fndµ. ≤ →∞ R R 21 Exercise 99. Why is it called the “monotone” convergence theorem?

22 We are now in a position to prove that the integral of the sum is the sum of the integrals.

23 Theorem 100. If fdµ and gdµ are deﬁned and they are not both inﬁnite and of 24 opposite signs, then [f + g]dµ = fdµ + gdµ. R R

25 Proof. If f,g R 0, then byR LemmaR 69, there exist sequences of nonnegative simple ≥ 26 functions f ∞ and g ∞ such that f f and g g. Then (f + g ) (f + g) and { n}n=1 { n}n=1 n ↑ n ↑ n n ↑ 27 [fn +gn]dµ = fndµ+ gndµ by Proposition 96. The result now follows from the monotone R R R 20

+ + + 1 convergence theorem. For integrable f and g, note that (f+g) +f −+g− =(f+g)−+f + g . 2 What we just proved for nonnegative functions implies that

+ 3 (f + g) dµ + f −dµ + g−dµ Z Z Z + 4 = [(f + g) + f − + g−]dµ Z + + 5 = [(f + g)− + f + g ]dµ Z + + 6 = (f + g)−dµ + f dµ + g dµ. Z Z Z 7 Rearranging the terms in the first and last expressions gives the desired result. If both f 8 and g have infinite integral of the same sign, then it follows easily that f + g has infinite 9 integral of the same sign. Finally, if only one of f and g has infinite integral, it also follows 10 easily that f + g has infinite integral of the same sign. 11 For proving theorems about integrals, there is a common sequence of steps that is often 12 called the standard machinery or standard machine. It is illustrated in the next result, the 13 measure-theoretic version of the change-of-variables formula.

14 Lemma 101. Let (Ω, ,µ) be a measure space and let (S, ) be a measurable space. Let F A 15 f : Ω S be a measurable function. Let ν be the measure induced on (S, ) by f from µ. → 1 A 16 (See Deﬁnition 76.) Let g : S IR be / measurable. Then → A B

17 (102) gdν = g(f)dµ, Z Z 18 if either integral exists.

19 Proof. First, assume that g = IA for some A . Then (102) becomes ν(A) = 1 ∈ A 20 µ(f − (A)), which is the deﬁnition of ν. Next, if g is a nonnegative simple function, then 21 (102) holds by linearity of integrals. If g is a nonnegative function, then use the monotone 22 convergence theorem and a sequence of nonnegative simple functions converging to g from + 23 below to see that (102) holds. Finally, for general g, (102) holds if either g or g− is 24 integrable.

25 Exercise 103. Suppose that f is integrable for each n and sup f dµ < . Show n n n ∞ 26 that, if f f, then f is integrable and f dµ fdµ. n ↑ n → R 27 Exercise 104. Show that if f and Rg are integrable,R then

fdµ gdµ f g dµ. 28 − ≤ | − | Z Z Z

29 Exercise 105. Assume the sequence of functions fn is deﬁned on a measure space 30 (Ω, ,µ) such that µ(Ω) < . Further, suppose that the f are uniformly bounded and F ∞ n 31 that f f uniformly. Show that f dµ fdµ. n → n → R R 21

1 Fatou’s Lemma

2 Theorem 97. (Fatou’s lemma) Let f ∞ be a sequence of nonnegative measurable { n}n=1 3 functions. Then

lim inf fndµ lim inf fndµ. 4 n ≤ n Z Z 5 Proof. Let f(ω) = lim infn fn(ω). Because →∞

fdµ = sup φdµ, 6 ﬁnite simple φ f Z ≤ Z

7 we need only prove that, for every ﬁnite simple φ f, ≤

φdµ lim inf fndµ. 8 ≤ n Z →∞ Z 9 Let φ f be ﬁnite and simple, and let ǫ> 0. For each n, deﬁne ≤

10 A = ω A : f (ω) (1 ǫ)φ(ω), for all k n . n { ∈ k ≥ − ≥ }

11 Since (1 ǫ)φ(ω) f(ω) for all ω with strict inequality wherever either side is positive, − ≤ C 12 ∞ A = Ω and A A for all n. Let B = A A . n=1 n n ⊆ n+1 n ∩ n S (106) fndµ fndµ (1 ǫ) φdµ. 13 ≥ ≥ − Z ZAn ZAn m 14 Let the canonical representation of φ be i=1 ciICi . Then, for all n. P m φdµ = ciµ(Ci An). 15 ∩ An i=1 Z X

16 Because the An’s form an increasing sequence whose union is Ω, limn µ(Ci An)= µ(Ci) →∞ ∩ 17 for all i. Taking the lim infn of both sides of (106) yields

lim inf fndµ (1 ǫ) ciµ(Ci)=(1 ǫ) φdµ. 18 n ≥ − − i=1 Z X Z 19 Since this is true for every ǫ> 0,

lim inf fndµ φdµ. 20 n ≥ →∞ Z Z 21 22

1 36-752: Lecture 9

2 Lemma 101 has a widely-used corollary.

3 Corollary 107. (Law of the unconscious statistician) If X : Ω S is a → 4 random quantity with distribution µ and if f : S IR is measurable, then E[f(X)] = X → 5 fdµX.

6 R Another useful application of monotone convergence is the following.

+0 7 Theorem 108. Let (Ω, ,µ) be a measure space, and let f : Ω IR be measurable. F → 8 Then ν(A)= fdµ is a measure on (Ω, ). A F R 9 Exercise 109. Prove Theorem 108.

10 If µ is σ-finite and if f is finite a.e. [µ], then ν in Theorem 108 is σ-finite. 11 What goes wrong with the conclusion to Theorem 108 if f is integrable but not necessarily 12 nonnegative? If f can take negative values then ν(A) = A fdµ might be negative. Let 13 A = ω : f(ω) < 0 . Suppose that µ(A) > 0. Write A = ∞ A , where A = ω : f(ω) < { } nR=1 n n { 14 1/n . If µ(A) > 0, then there exists n such that µ(An) > 0. (This argument is used often − } S 15 in proving probability results.) Then

1 ν(A)= I ( f)dµ I ( f)dµ µ(A ) > 0. 16 − A − ≥ An − ≥ n n Z Z 17 Here is another application of the standard machinery.

18 Theorem 110. Assume the same conditions as Theorem 108. Integrals with respect to 19 ν can be computed as gdν = gfdµ, if either exists. R R 20 Proof. We prove the result in four stages. First, assume that g is a indicator IA of some 21 set A . Then the definition of ν says that gdν = ν(A)= I fdµ. Second, assume that ∈F A 22 g is a nonnegative simple function. The result holds for g by linearity of integrals. Third, R R 23 assume that g is nonnegative. Approximate g from below by nonnegative simple functions 24 g ∞ . Then g dν = g fdµ for each n and the monotone convergence theorem says { n}n=1 n n 25 that the left side converges to gdν and the right side converges to gfdµ. Finally, if g is R + R + + 26 measurable, write g = g g− (the positive and negative parts). Then g dν = g fdµ − R R 27 and g dν = g fdµ. We see that gdν exists if and only if gfdµ exists, and if either − − R R 28 exists they are equal. R R R R 29 The standard machinery corresponds to the three stages in defining integrals. The first 30 stage is split into indicators and nonnegative simple functions to make four steps in the 31 standard machinery.

32 Definition 111. The function f in Theorem 108 is called the density of ν with respect 33 to µ. 23

1 Example 112. (Probability density functions) Consider a continuous random 2 variable X having a density f. That is, a Pr(X a)= f(x)dx. 3 ≤ Z−∞ 1 4 Then the distribution of X, deﬁned by µ (B) = Pr(X B) for B , satisﬁes X ∈ ∈ B

5 µX (B)= fdλ, ZB 6 where λ is Lebesgue measure. That is, the probability density functions of the usual contin- 7 uous distributions that you learned about in earlier courses are also densities with respect 8 to Lebesgue measure in the sense deﬁned above.

9 Example 113. (Probability mass functions) Consider a typical discrete random 10 variable X with mass function f, i.e., f(x) = Pr(X = x) for all x. There are at most 11 countably many x such that f(x) > 0. Let µX be the distribution of X. For each set B, we 12 know that µX (B) = Pr(X B)= f(x). 13 ∈ x B X∈ 14 The rightmost term in this equation is fdµ, where µ is counting measure on the range 15 space of X. So, f is the density of µ with respect to µ. X R 16 The other major limit theorem is the following.

17 Theorem 114. (Dominated convergence theorem) Let f ∞ be a sequence of { n}n=1 18 measurable functions, and let f and g be measurable functions such that f f a.e. [µ], n → 19 f g a.e. [µ], and gdµ < . Then, | n| ≤ ∞ R lim fndµ = fdµ. 20 n →∞ Z Z

21 Proof. We have g f g a.e. [µ], hence − ≤ n ≤ 22 g + f 0, a.e. [µ], n ≥ 23 g f 0, a.e. [µ], − n ≥ 24 lim [g + fn] = g + f a.e. [µ], n →∞ 25 lim [g fn] = g f a.e. [µ]. n →∞ − − 26 It follows from Fatou’s lemma and Theorem 100 that

[g + f]dµ lim inf [g + fn]dµ 27 ≤ n Z →∞ Z = gdµ + lim inf fndµ, 28 n Z →∞ Z fdµ lim inf fndµ. 29 ≤ n Z →∞ Z 24

1 Similarly, it follows that

[g f]dµ lim inf [g fn]dµ 2 − ≤ n − Z →∞ Z

3 = gdµ lim sup fndµ, − n Z →∞ Z

4 fdµ lim sup fndµ. ≥ n Z →∞ Z 5 Together, these imply the conclusion of the theorem.

6 Example 115. Let µ be a ﬁnite measure. Then limits and integrals can be interchanged 7 whenever the functions in the sequence are uniformly bounded.

8 An alternate version of the dominated convergence theorem is the following.

9 Proposition 116. Let f ∞ , g ∞ be sequences of measurable functions such that { n}n=1 { n}n=1 10 fn gn, a.e. [µ]. Let f and g be measurable functions such that limn fn = f and | | ≤ →∞ 11 limn gn = g, a.e. [µ]. Suppose that limn gndµ = gdµ < . Then, limn fndµ = →∞ →∞ ∞ →∞ 12 fdµ. R R R

13 TheR proof is the same as the proof of Theorem 114, except that gn replaces g in the ﬁrst 14 three lines and wherever g appears with fn and a limit is being taken. 15 For ﬁnite measure spaces (i.e. (Ω, ,µ) with µ(Ω) < ), the minimal condition that F ∞ 16 guarantees convergence of integrals is uniform integrability .

17 Definition 117. A sequence of integrable functions f ∞ is uniformly integrable { n}n=1 18 (with respect to µ) if lim sup f dµ = 0. c n ω: fn(ω) >c n →∞ { | | } | | R 19 Theorem 118. Let µ be a ﬁnite measure. Let f ∞ be a sequence of integrable func- { n}n=1 20 tions such that limn fn = f a.e. [µ]. Suppose that fn n∞=1 is uniformly integrable. Then →∞ { } 21 limn fndµ = fdµ. →∞ R R 22 If the fn’s in Theorem 118 are nonnegative and integrable and fn f, then limn fndµ = → →∞ 23 fdµ implies that f ∞ are uniformly integrable. We will not use this result, however. { n}n=1 R R 25

1 36-752: Lecture 10

2 Here are some more useful properties of integrals.

3 Theorem 119. Let (Ω, ,µ) be a measure space. Let f and g be measurable extended F 4 real-valued functions.

5 1. If f is nonnegative and µ( ω : f(ω) > 0 ) > 0, then fdµ > 0. { } 6 2. If f and g are integrable and if fdµ = gdµ forR all A , then f = g a.e. [µ]. A A ∈F 7 3. If µ is σ-ﬁnite and if fdµ = R gdµ forR all A , then f = g a.e. [µ]. A A ∈F 8 4. Let Π be a π-system thatR generatesR . Suppose that Ω is a ﬁnite or countable union F 9 of elements of Π. If f and g are integrable and if fdµ = gdµ for all A Π, then A A ∈ 10 f = g a.e. [µ]. R R 11 Proof.

12 1. Let A = ω : f(ω) > c for each c 0. Because µ(A ) > 0 and A = ∞ A , it c { } ≥ 0 0 n=1 1/n 13 follows from Lemma 34 that there exists n such that µ(A1/n) > 0. Since f fIA , S ≥ 1/n 14 we have fdµ fdµ. But (1/n)IA1 is a simple function that is fIA1 and ≥ A1/n /n ≤ /n 15 (1/n)I dµ = µ(A ) > 0. It follows that fdµ > 0. AR1/n R 1/n

16 2.R This will appear on a homework assignment. R

17 3. First, assume that f and g are real-valued. Let A ∞ be disjoint elements of such { n}n=1 F 18 that µ(A ) < and ∞ A = Ω. Let B = ω : f(ω) < m, g(ω) < m for each n ∞ n=1 n m { | | | | } 19 integer m. For each pair (n, m), fIAn Bm and gIAn Bm satisfy the conditions of the ∩ ∩ 20 S previous part, so fIAn Bm = gIAn Bm a.e. [µ]. Let C = ω : f(ω) = g(ω) . Since ∩ ∩ { 6 }

∞ ∞ C = [C Bm An] , 21 ∩ ∩ n=1 m=1 [ [ 22 and each µ(C B A ) = 0, it follows that µ(C)=0. ∩ m ∩ n 23 Next, suppose that f and/or g is extended real-valued. Let E = f = ∆ g = , { ∞} { ∞} 24 the set where one function is but the other is not. If µ(E) > 0, then there is a ∞ 25 subset A of E such that 0 <µ(A) < and one of the functions is bounded above on ∞ 26 A while the other is infinite. This contradicts A fdµ = A gdµ. A similar result holds 27 for . −∞ R R + + 28 + + 4. Define ν1 (A)= A f dµ, ν2 (A)= A g dµ, ν1−(A)= A f −dµ, and ν2−(A)= A g−dµ. 29 These are all finite measures according to Theorem 108. The additional condition R R R R 30 implies that they are all σ-finite on Π. The equality of the integrals implies that + + + + 31 ν1 + ν2− = ν1− + ν2 for all sets in Π. Theorem 43 implies that ν1 + ν2− = ν1− + ν2 for 32 all sets in . Hence, the condition of part 2 hold and the result is proven. F 26

1 The condition about unions in part 4 of the above theorem holds for the π-systems in 2 Example 38.

3 Corollary 120. If µ is σ-ﬁnite and ν is related to µ as in Theorem 108, then the 4 density of ν with respect to µ is unique, a.e. [µ].

5 There is an interesting characterization of σ-ﬁnite measures in terms of integrals.

6 Theorem 121. Let (Ω, ,µ) be a measure space. Then µ is σ-ﬁnite if and only if there F 7 exists a strictly positive integrable function.

8 Exercise 122. Prove Theorem 121.

9 Absolute Continuity. There is a special relationship between measures on the same 10 space that is very useful in Probability theory.

11 Definition 123. Let ν and µ be measures on the space (Ω, ). We say that ν µ F ≪ 12 (read ν is absolutely continuous with respect to µ) if for every A , µ(A) = 0 implies ∈ F 13 ν(A)=0.

14 That is, ν µ if and only if every measure 0 set under µ is also a measure 0 set under ν. ≪ 15 Example 124. Let (Ω, ,µ) be a measure space. Let f be a nonnegative function, and F 16 define ν(A)= fdµ. Then ν is a measure and ν µ. If f < a.e. [µ] and if µ is σ-finite, A ≪ ∞ 17 then ν is σ-finite as well. R 18 Example 125. Let µ1 and µ2 be measures on the same space. Let µ = µ1 + µ2. Then 19 µ µ for i =1, 2. i ≪ 20 Absolute continuity has a connection with continuity of functions.

21 Proposition 126. Let ν and µ be measures on the space (Ω, ). Suppose that, for every F 22 ǫ> 0, there exists δ such that for every A , µ(A) < δ implies ν(A) < ǫ. Then ν µ. ∈F ≪ 23 A concept related to absolute continuity is singularity.

24 Definition 127. Two measures µ and ν on the same space (Ω, ) are (mutually) sin- F C C 25 gular (denoted µ ν) if there exist disjoint sets S and S such that µ(S )= ν(S )=0. ⊥ µ ν µ ν 26 Example 128. Let f and g be nonnegative functions such that fg = 0 a.e. [µ]. Deﬁne 27 ν (A)= fdµ and ν (A)= gdµ. Then ν ν . 1 A 2 A 1 ⊥ 2 28 The mainR theoretical resultR on absolute continuity is the Radon-Nikodym theorem which 29 says that, in the σ-ﬁnite case, all absolute continuity is of the type in Example 124.

30 Theorem 129. (Radon-Nikodym) Let µ and ν be σ-ﬁnite measures on the space 31 (Ω, ). Then ν µ if and only if there exists a nonnegative measurable f such that F ≪ 32 ν(A)= fdµ for all A . The function f is unique a.e. [µ]. A ∈F 33 OneR proof of this result is given in a separate course document. Another proof is given 34 later after we introduce conditional expectation.

35 Definition 130. The function f in Theorem 129 is called a Radon-Nikodym derivative 36 of ν with respect to µ. It is denoted dν/dµ. Each such function is called a version of dν/dµ. 27

1 36-752: Lecture 11

2 The uniqueness of Radon-Nikodym derivatives is only a.e. [µ]. If f = dν/dµ, then every 3 measurable function that equals f a.e. [µ] could also be called dν/dµ. All of these functions 4 are called versions of the Radon-Nikodym derivative.

5 Definition 131. If µ ν and ν µ, we say that µ and ν are equivalent. ≪ ≪ 6 If µ and ν are equivalent, then dµ 1 = . dν dν 7 dµ

8 If ν µ η, then the chain rule for R-N derivatives says ≪ ≪ dν dν dµ = . 9 dη dµ dη

10 Absolute continuity plays an important role in statistical inference. Parametric families are 11 collections of probability measures that are all absolutely continuous with respect to a single 12 measure.

13 Theorem 132. Let (Ω, ,µ) be a σ-ﬁnite measure space. Let µ : θ Θ be a collec- F { θ ∈ } 14 tion of measures on (Ω, ) such that µ µ for all θ Θ. Then there exists a sequence of F θ ≪ ∈ 15 nonnegative numbers c ∞ and a sequence of elements θ ∞ of Θ such that ∞ c { n}n=1 { n}n=1 n=1 n 16 and µθ ∞ cnµθ for all θ Θ. ≪ n=1 n ∈ P P 17 We will not prove this theorem here. (See Theorem A.78 in Schervish 1995.)

18 Random Vectors. In Deﬁnition 62 we deﬁned random variables and random quanti- 19 ties. A special case of the latter and generalization of the former is a random vector.

k 20 Definition 133. Let (Ω, ,P ) be a probability space. Let X : Ω IR be a measurable F → 21 function. Then X is called a random vector .

k 22 There arises, in this definition, the question of what σ-field of subsets of IR should be used. 23 When left unstated, we always assume that the σ-field of subsets of a multidimensional real 24 space is the Borel σ-field, namely the smallest σ-field containing the open sets. However, k 25 because IR is also a product set of k sets, each of which already has a natural σ-field 26 associated with it, we might try to use a σ-field that corresponds to that product in some 27 way.

k 28 Product Spaces. The set IR has a topology in its own right, but it also happens to 29 be a product set. Each of the factors in the product comes with its own σ-ﬁeld. There is 30 a way of constructing σ-ﬁeld’s of subsets of product sets directly without appealing to any 31 additional structure that they might have. 28

1 Definition 134. Let (Ω , ) and (Ω , ) be measurable spaces. Let be the 1 F1 2 F2 F1 ⊗F2 2 smallest σ-field of subsets of Ω Ω containing all sets of the form A A where A 1 × 2 1 × 2 i ∈Fi 3 for i =1, 2. Then is the product σ-field. F1 ⊗F2 4 Lemma 135. Let (Ω , ) and (Ω , ) be measurable spaces. Suppose that is a π- 1 F1 2 F2 Ci 5 system that generates for i = 1, 2. Let = C C : C ,C . Then Fi C { 1 × 2 1 ∈ C1 2 ∈ C2} 6 σ( )= , and is a π-system. C F1 ⊗F2 C 7 Proof. Because σ( )isa σ-field, it contains all sets of the form C A where A . C 1 × 2 2 ∈F2 8 For the same reason, it must contain all sets of the form A A for A (i = 1, 2). 1 × 2 i ∈ Fi 9 Because 10 (C C ) (D D )=(C D ) (C D ), 1 × 2 ∩ 1 × 2 1 ∩ 1 × 2 × 2 11 we see that is a π-system. C 1 12 Example 136. Let Ω = IR for i = 1, 2, and let and both be . Let be i F1 F2 B Ci 13 the collection of all intervals centered at rational numbers with rational lengths. Then Ci 14 generates for i = 1, 2 and the product topology is the smallest topology containing as Fi C 15 defined in Lemma 135. It follows that 1 2 is the smallest σ-field containing the product 2 F ⊗F 16 topology. We call this σ-field . B 2 3 17 Example 137. This time, let Ω1 = IR and Ω2 = IR. The product set is IR and the 3 3 18 product σ-field is called . It is also the smallest σ-field containing all open sets in IR . B k 19 The same idea extends to each finite-dimensional Euclidean space, with Borel σ-field’s , B 20 for k =1, 2,....

21 Lemma 138. Let (Ω , ) and (S , ) be measurable spaces for i =1, 2. Let f : Ω S i Fi i Ai i i → i 22 be a function for i = 1, 2. Deﬁne g(ω1,ω2)=(f1(ω1), f2(ω2)), which is a function from 23 Ω Ω to S S . Then f is / -measurable for i =1, 2 if and only if g is / - 1× 2 1× 2 i Fi Ai F1⊗F2 A1⊗A2 24 measurable.

25 Proof. For the “only if” direction, assume that each fi is measurable. It suﬃces to 1 26 show that for each product set A1 A2 (with Ai i for i =1, 2) g− (A1 A2) 1 2. 1 × 1 ∈A 1 × ∈F ⊗F 27 But, it is easy to see that g− (A1 A2)= f1− (A1) f2− (A2) 1 2. × × ∈F ⊗F 1 28 For the “if” direction, suppose that g is measurable. Then for every A1 1, g− (A1 1 1 1 ∈A × 29 S ) . But g− (A S ) = f − (A ) Ω . The fact that f − (A ) will now 2 ∈ F1 ⊗F2 1 × 2 1 1 × 2 1 1 ∈ F1 30 follow from the ﬁrst claim in Proposition 140. (Sorry for the forward reference.) So f1 is 31 measurable. Similarly, f2 is measurable.

32 Proposition 139. Let (Ω, ), (S , ), and (S , ) be measurable spaces. Let X : F 1 A1 2 A2 i 33 Ω S for i =1, 2. Define X =(X ,X ) a function from Ω to S S . Then X is / → i 1 2 1 × 2 i F Ai 34 measurable for i =1, 2 if and only if X is / measurable. F A1 ⊗A2 35 Lemma 138 and Proposition 139 extend to higher-dimensional products as well. 36 The product σ-field is also the smallest σ-field such that the coordinate projection func- 37 tions are measurable. The coordinate projection functions for a product set S S are the 1 × 2 38 functions f : S S S (for i =1, 2) defined by f (s ,s )= s (for i =1, 2). i 1 × 2 → i i 1 2 i 29

1 Inﬁnite-dimensional product spaces pose added complications that we will not consider 2 until later in the course. 3 There are a number of facts about product spaces that we might take for granted.

4 Proposition 140. Let (Ω , ) and (Ω , ) be measurable spaces. 1 F1 2 F2 5 For each B and each ω Ω , the ω -section of B, B = ω :(ω ,ω ) B • ∈F1 ⊗F2 1 ∈ 1 1 ω1 { 2 1 2 ∈ } 6 is in . F2 7 If µ is a σ-ﬁnite measure on (Ω , ), then µ (B ) is a measurable function from Ω • 2 2 F2 2 ω1 1 8 to IR.

9 If f : Ω Ω S is measurable, then for every ω Ω , the function f : Ω S • 1 × 2 → 1 ∈ 1 ω1 2 → 10 deﬁned by fω1 (ω2)= f(ω1,ω2) is measurable.

11 If µ is a σ-finite measure on (Ω , ) and if f : Ω Ω IR is nonnegative, then • 2 2 F2 1 × 2 → 12 f(ω1,ω2)µ2(dω2) defines a measurable (possibly infinite valued) function of ω1.

13 To proveR results like these, start with product sets or indicators of product sets and then 14 show that the collection of sets that satisfy the results is a σ-field. Then, if necessary, proceed 15 with the standard machinery. For example, consider the second claim. For the case of finite 16 µ , the claim is true if B is a product set. It is easy to show that the collection of all sets 2 C 17 B for which µ2(Bω1 ) is measurable is a λ-system. Then use Lemma 42. Here is the proof 18 that the second claim holds for σ-finite measures once it is proven that it holds for finite 19 measures. Let A ∞ be elements of that cover Ω and have finite µ measure. Define { n}n=1 F2 2 2 20 = C A : C and µ (C) = µ (A C) for all C . Then (A , ,µ ) F2,n { ∩ n ∈ F2} 2,n 2 n ∩ ∈ F2 n F2,n 2,n 21 is a finite measure space for each n and µ2,n(Bω1 ) is measurable for all n and all B in the 22 product σ-field. Finally, notice that

∞ ∞ µ2(Bω1 )= µ2(Bω1 An)= µ2,n(Bω1 ), 23 ∩ n=1 n=1 X X 24 a sum of nonnegative measurable functions, hence measurable. The standard machinery can 25 be used to prove the third and fourth claims. (Even though the third claim does not involve 26 integrals, the steps in the proof are similar to those of the standard machinery.) 30

1 Radon-Nikodym Theorem

2 Theorem 129. (Radon-Nikodym) Let µ and ν be σ-ﬁnite measures on the space 3 (Ω, ). Then ν µ if and only if there exists a nonnegative measurable f such that ν(A)= F ≪ 4 fdµ for all A . The function f is unique a.e. [µ]. A ∈F 5 The proof of this result relies upon the theory of signed measures. R 6 Definition 141. Let (Ω, ) be a measurable space. Let η : IR. We call η a signed F F → 7 measure if

8 η(∅)=0, •

9 for every sequence A ∞ of mutually disjoint elements of , η( ∞ A )= ∞ η(A ). • { k}k=1 F k=1 k k=1 k 10 η takes at most one of the two values . S P • ±∞

11 Example 142. Let µ1 and µ2 be measures on the same space such that at most one of 12 them is inﬁnite. Then µ µ is a signed measure. 1 − 2

13 Example 143. Let f be integrable with respect to µ, and define η(A)= A fdµ. Then 14 f is a finite signed measure. If the integral of f is merely defined, but not finite, then fdµ R A 15 is a signed measure. R

16 The nice thing about σ-ﬁnite signed measures is that they divide up nicely into positive and 17 negative parts just like measurable functions.

18 Theorem 144. (Hahn and Jordan decompositions) Let η be a ﬁnite signed mea- + + 19 sure on (Ω, ). Then there exists a set A such that every subset A of A has η(A) 0 and F +C ≥ 20 every subset B of A has η(B) 0. Also, there exist ﬁnite mutually singular measures η ≤ + 21 and η such that η = η+ η . − − −

22 Proof. Let α = supA η(A). Let limn η(An) = α. Although the sequence n ∈F →∞ 23 ∞ i=1 Ai n=1 is monotone increasing and signed measures do satisfy Lemma 1 of the course { } n 24 notes, η ( i=1 Ai) is not necessarily as large as η(An). However, the following trick replaces Sn 25 Ai by a sequence of sets whose signed measures do increase. For each n, partition Ω i=1 S 26 using the sets A1,...,An and their complements. Let Cn be the union of all of the compo- S 27 nent sets that have positive signed measure. Since the n + 1st partition is a reﬁnement of C 28 the nth partition, we see that C C is a union of sets with positive signed measure and n+1 ∩ n

29 η(A ) η(C ) η C C . n ≤ n ≤ n n+1 [ 30 + + By induction, we then show that A = m∞=1 n∞=m Cn has η(A )= α. The conclusions now 31 follow easily. T S 32 Theorem 144 has an interesting consequence.

33 Lemma 145. Suppose that µ and ν are ﬁnite and not mutually singular. Then there 34 exists ǫ> 0 and a set A with µ(A) > 0 and ǫµ(E) ν(E) for every E A. ≤ ⊆ 31

+ 1 Proof. For each n, let ηn = ν (1/n)µ. Let β = ν(Ω). Let An and be the set called + − +C 2 A in Theorem 144 when η is ηn. Let M = n∞=1An . Since ηn(E) 0 for every subset +C ∩ ≤ 3 of An , we have ηn(M) 0 for all n and ν(M) (1/n)µ(M). It follows that ν(M)=0 C ≤ ≤ C 4 and ν(M ) = β. Since µ and ν are not mutually singular, µ(M ) > 0 and at least one + + 5 µ(An ) > 0. Let A = An and ǫ =1/n. 6 Proof.Theorem 129 The σ-finite case follows easily from the finite case, so assume that 7 µ and ν are finite with ν µ. Let be the set of all nonnegative measurable functions g such ≪ G 8 that gdµ ν(E) for all E . Because 0 , we know that is nonempty. If g and g E ≤ ∈F ∈ G G 1 2 9 are in , we know that g g is measurable, hence it is easy to see that max g ,g . R G { 1 ≤ 2} { 1 2} ∈ G 10 Also, if g for all n and g g, then the monotone convergence theorem implies that n ∈ G n ↑ 11 g . So, let α = supg gdµ and let limn gndµ = α. Let fn = max g1,...,gn so ∈ G ∈G →∞ { } 12 that there is f such that fn f, fn for all n, and limn fndµ = α. It follows that R ↑ ∈ G R →∞ 13 fdµ = α and f . Define ν (E) = fdµ and ν = ν ν , which is a measure since ∈ G 1 E 2 − R1 14 ν ν. If ν and µ were not mutually singular, there would exist ǫ > 0 and a set A with R1 ≤ 2 R 15 µ(A) > 0 and ǫµ(E) ν (E) for all E A. For each E , ≤ 2 ⊆ ∈F (f + ǫI )dµ = fdµ + ǫµ(E A) 16 A ∩ ZE ZE 17 ν (E)+ ν (E A) ν (E)+ ν (E)= ν(E). ≤ 1 2 ∩ ≤ 1 2

18 Hence h = f + ǫIA , but hdµ = α + ǫµ(A) > α, a contradiction. It follows that ν2 and ∈ G C 19 µ are mutually singular. Hence, there exists S such that ν2(S)= µ(S ) = 0. Since ν µ, C R C ≪ 20 we have ν(S ) = 0. Because ν ν, we have ν (S ) = 0 and ν (Ω) = 0. It follows that 2 ≤ 2 2 21 ν = ν1 and the proof of existence is complete. Uniqueness follows from Theorem 119 in the 22 class notes. 23 Notice that absolute continuity was not used in the proof until the ﬁnal steps.

24 Definition 146. The decomposition of ν into ν µ and ν µ in the proof of 1 ≪ 2 ⊥ 25 Theorem 129 is called the Lebesgue decomposition of ν into an absolutely continuous part 26 and a singular part relative to µ. 32

1 36-752: Lecture 12

2 Theorem 147. Let (Ω , ) for i = 1, 2, 3 be measurable spaces. Let f : Ω Ω be i Fi 1 → 2 3 a measurable onto function. Suppose that contains all singletons. Let = σ(f). Let F3 A1 4 g : Ω Ω be / -measurable. Then g is / -measurable if and only if there exists a 1 → 3 F1 F3 A1 F3 5 / -measurable h : Ω Ω such that g = h(f). F2 F3 2 → 3

6 Proof. For the “if” part, assume that there is a measurable h : Ω2 Ω3 such that 1 → 7 g(ω)= h(f(ω)) for all ω Ω1. Let B 3. We need to show that g− (B) 1. Since h is 1 ∈ 1 ∈F 1 ∈A1 1 8 measurable, h− (B) 2, so h− (B)= A for some A 2. Since g− (B)= f − (h− (B)), it 1 ∈F 1 ∈F 9 follows that g− (B)= f − (A) 1. ∈A 1 10 For the “only if” part, assume that g is 1 measurable. For each t Ω3, let Ct = g− ( t ). A 1 ∈ 1 1 { } 11 Since g is measurable with respect to 1 = f − ( 2), every element of g− ( 3) is in f − ( 2). A1 F F F 12 So let A be such that C = f − (A ). Define h(ω) = t for all ω A . (Note that t ∈ F2 t t ∈ t 13 ∅ if t1 = t2, then At1 At2 = , so h is well defined.) To see that g(ω) = h(f(ω)), let 6 ∩ 1 14 g(ω) = t, so that ω C = f − (A ). This means that f(ω) A , which in turn implies ∈ t t ∈ t 15 h(f(ω)) = t = g(ω). 1 16 To see that h is measurable, let A 3. We must show that h− (A) 2. Since g is 1 1 ∈F 1 ∈F1 A 17 measurable, g− (A) 1, so there is some B 2 such that g− (A)= f − (B). We will show 1 ∈A ∈F 1 18 that h− (A)= B 2 to complete the proof. If ω h− (A), let t = h(ω) A and ω = f(x) ∈F 1 ∈1 ∈ 19 (because f is onto). Hence, x Ct g− (A) = f − (B), so f(x) B. Hence, ω B. This 1 ∈ ⊆ ∈ 1 ∈1 20 implies that h− (A) B. Lastly, if ω B, ω = f(x) for some x f − (B) = g− (A) and ⊆ ∈ 1 ∈ 1 21 h(ω)= h(f(x)) = g(x) A. So, h(ω) A and ω h− (A). This implies B h− (A). ∈ ∈ ∈ ⊆ 22 The condition that f be onto can be relaxed at the expense of changing the domain of h to 23 be the image of f, i.e. h : f(Ω ) Ω , with a different σ-field. The proof is slightly more 1 → 3 24 complicated due to having to keep track of the image of f, which might not be a measureable 25 set in . F2 26 The following is an example to show why the condition that contains all singletons is F3 27 included in Theorem 147.

1 28 Example 148. Let Ω = IR for all i and let = = , while = IR, ∅ . Then i F1 F2 B F3 { } 29 every function g : Ω1 Ω3 is σ(f)/ 3-measurable, no matter what f : Ω1 Ω2 is. For 2→ F 1 → 30 example, let f(x)= x and g(x)= x for all x. Then g− ( ) σ(f) but g is not a function F3 ⊆ 31 of f.

32 The reason that we need the condition about singletons is the following. Suppose that 33 there are two points t , t Ω such that t A implies t A and vice versa for every 1 2 ∈ 3 1 ∈ 2 ∈ 34 A 3. Then there can be a set A 3 that contains both t1 and t2, and g can take both ∈F ∈F 1 35 of the values t1 and t2, but f is constant on g− (A) and all the measurability conditions still 36 hold. In this case, g is not a function of f.

37 Product Measures. Product measures are measures on product spaces that arise 38 from individual measures on the component spaces. Product measures are just like joint 39 distributions of independent random variables, as we shall see after we deﬁne both concepts. 33

1 Theorem 149. Let (Ω , ,µ ) for i = 1, 2 be σ-finite measure spaces. There exists a i Fi i 2 unique measure µ defined on (Ω Ω , ) that satisfies µ(A A ) = µ (A )µ (A ) 1 × 2 F1 ⊗F2 1 × 2 1 1 2 2 3 for all A and A . 1 ∈F1 2 ∈F2 4 Proof. The uniqueness will follow from Theorem 43 since any two such measures will 5 agree on the π-system of product sets. For the existence, consider the measurable function 6 µ (B ) defined in Proposition 140. For B , define 2 ω1 ∈F1 ⊗F2

7 µ(B)= µ2(Bω1 )µ1(dω1). Z 8 Because µ (B ) 0, µ is a σ-ﬁnite measure. (See Example 124.) If B is a product set 2 ω1 ≥ 9 A A , then B = A for all ω , and 1 × 2 ω1 2 1

10 µ(B)= µ2(A2)IA1 (ω1) µ1(dω1)= µ1(A1)µ2(A2). Z 11 It follows that µ is the desired measure.

12 Definition 150. The measure µ in Theorem 149 is called the product measure of µ1 13 and µ and is sometimes denoted µ µ . 2 1 × 2 14 How to integrate with respect to a product measure is an interesting question. For 15 nonnegative functions, there is a simple answer.

16 Theorem 151. (Fubini/Tonelli theorem) Let (Ω1, 1,µ1) and (Ω2, 2,µ2) be σ- F F1 17 ﬁnite measure spaces. Let f : Ω Ω IR be a nonnegative / -measurable 1 × 2 → F1 ⊗ F2 B 18 function. Then

(152) fdµ µ = f(ω ,ω )µ (dω ) µ (dω )= f(ω ,ω )µ (dω ) µ (dω ). 19 1 × 2 1 2 1 1 2 2 1 2 2 2 1 1 Z Z Z Z Z

20 Proof. We will use the standard machinery. If f is the indicator of a set B, then all 21 three integrals in (152) equal µ µ (B), as in the poof of Theorem 149. By linearity of 1 × 2 22 integrals, the three integrals are the same for all nonnegative simple functions. Next, let

23 fn n∞=1 be a sequence of nonnegative simple functions all f such that limn fn = f. We { } ≤ →∞ 24 have just shown that, for each n,

f dµ µ = f (ω ,ω )µ (dω ) µ (dω ). 25 n 1 × 2 n 1 2 1 1 2 2 Z Z Z 26 For each ω2, the monotone convergence theorem says

lim fn(ω1,ω2)µ1(dω1)= f(ω1,ω2)µ1(dω1). 27 n →∞ Z Z 28 Again, the monotone convergence theorem says that

lim fn(ω1,ω2)µ1(dω1) µ2(dω2)= lim fn(ω1,ω2)µ1(dω1) µ2(dω2). 29 n n →∞ Z Z Z →∞ Z 30 Combining these last three equations proves that the ﬁrst two integrals in (152) are equal. 31 A similar argument shows that the ﬁrst and third are equal. 34

1 36-752: Lecture 13

2 Theorem 151 says that nonnegative product-measurable functions can be integrated in 3 either order to get the integral with respect to product measure. A similar result holds for 4 integrable product-measurable functions.

5 Corollary 153. Let (Ω , ,µ ) and (Ω , ,µ ) be σ-ﬁnite measure spaces. Let f : 1 F1 1 2 F2 2 6 Ω Ω IR be a function that is integrable with respect to µ µ . Then (152) holds. 1 × 2 → 1 × 2

7 The only sticky point in the proof of Corollary 153 is making sure that occurs ∞−∞ 8 with measure zero in the iterated integrals. But if ( ) occurs with positive measure for + ∞ −∞ + 9 f (f −) in either of the iterated integrals, that iterated integral would be inﬁnite and f (f −) 10 would not be integrable.

11 Exercise 154. Let X be a nonnegative random variable deﬁned on a probability space 12 (Ω, ,P ) having distribution function F . Show that E(X)= ∞[1 F (x)]dx. F 0 −

13 Example 155. This example satisﬁes neither the conditionsR of Theorem 151 nor those 14 of Corollary 153. Let

x exp( [1 + x2]y/2) if y> 0, f(x, y)= − 15 0 otherwise. 16 Then

f(x, y)dx = exp( y/2) x exp( x2y/2) 17 − − Z Z 18 = 0, ∞ f(x, y)dy = x exp( [1 + x2]y/2)dy 19 − Z Z0 2x = . 20 1+ x2

21 The iterated integral in one direction is 0 and is undeﬁned in the other direction.

22 These results extend to arbitrary ﬁnite products.

1 23 Example 156. The product of k copies of Lebesgue measure on IR is Lebesgue mea- k 24 sure on IR . Theorem 151 and Corollary 153 give conditions under which integrals can be 25 performed in any desired order.

26 Independence. We shall deﬁne what it means for collections of events and random 27 quantities to be independent.

28 Definition 157. Let (Ω, ,P ) be a probability space. Let and be subsets of . F C1 C2 F 29 We say that and are independent if P (A A ) = P (A )P (A ) for all A and C1 C2 1 ∩ 2 1 2 1 ∈ C1 30 A . 2 ∈ C2 35

1 Example 158. If each of and contains only one event, then being independent C1 C2 C1 2 of is the same as those events being independent. C2

3 Definition 159. Let (Ω, ,P ) be a probability space. Let (S , ) for i = 1, 2 be F i Ai 4 measurable spaces. Let X : Ω S be / measurable for i =1, 2. We say that X and i → i F Ai 1 5 X2 are independent if the σ-ﬁeld’s σ(X1) and σ(X2) (see Deﬁnition 59) are independent.

6 Proposition 160. If and are independent π-systems then σ( ) and σ( ) are C1 C2 C1 C2 7 independent.

8 Example 161. Let f1 and f2 be densities with respect to Lebesgue measure. Let P be 2 2 9 deﬁned on (IR , ) by P (C) = f (x)f (y)dxdy. Then the following two σ-ﬁeld’s are B C 1 2 10 independent : R R 1 11 1 = A IR : A , C { × ∈ B1} 12 = IR A : A . C2 { × ∈ B }

13 Also, the following two random variables are independent: X1(x, y) = x and X2(x, y) = y, 14 the coordinate projection functions. Indeed, = σ(X ) for i =1, 2. Ci i

15 Example 162. Let X1 and X2 be two random variables deﬁned on the same probability 16 space (Ω, ,P ). Suppose that the joint distribution of (X ,X ) has a density f(x, y) that F 1 2 17 factors into f(x, y) = f1(x)f2(y), the two marginal densities. Then, for each product set 1 18 A B with A, B , × ∈ B

19 Pr(X A, X B) = Pr((X ,X ) A B) 1 ∈ 2 ∈ 1 2 ∈ ×

20 = f1(x)f2(y)dydx ZA ZB

21 = f1(x)dx f2(y)dy ZA ZB 22 = Pr(X A)Pr(X B). 1 ∈ 2 ∈

23 So, X1 and X2 are independent. The same reasoning would apply if the two random variables 24 were discrete. It would also apply if one were discrete and the other continuous.

25 These deﬁnitions extend to more than two collections of events and more than two random 26 variables.

27 Definition 163. Let (Ω, ,P ) be a probability space. Let : α be a collection F {Cα ∈ ℵ} 28 of subsets of . We say that the ’s are (mutually) independent if, for every ﬁnite integer F Cα 29 n 2 and no more than the cardinality of , and for all distinct α ,...,α , and ≥ ℵ 1 n ∈ ℵ 30 Aαi αi for i =1,...,n, ∈ C n n

P Aαi = P (Aαi ). 31 i=1 ! i=1 \ Y 36

1 Definition 164. Let (Ω, ,P ) be a probability space. Let (S , ) : α be F { α Aα ∈ ℵ} 2 measurable spaces. Let X : Ω S be / measurable for each α . We say that α → α F Aα ∈ ℵ 3 X : α are (mutually) independent if the σ-ﬁeld’s σ(X ) : α are mutually { α ∈ ℵ} { α ∈ ℵ} 4 independent.

5 Theorem 165. Let (Ω, ,P ) be a probability space. Let (S , ) for i =1, 2 be measur- F i Ai 6 able spaces. Let X : Ω S and X : Ω S be random quantities. Deﬁne X =(X ,X ). 1 → 1 2 → 2 1 2 7 The distribution of X : Ω S S , µ , is the product measure µ µ if and only if → 1 × 2 X X1 × X2 8 X1 and X2 are independent.

9 Proof. For the “if” direction, suppose that X1 and X2 are independent. Then for every 10 product set A A , 1 × 2 11 µ (A A ) = Pr(X A ,X A ) = Pr(X A )Pr(X A ) X 1 × 2 1 ∈ 1 2 ∈ 2 1 ∈ 1 2 ∈ 2 12 = µX1 (A1)µX2 (A2).

13 It follows from the uniqueness of product measure that µX is the product measure. 14 For the “only if” direction, suppose that µ = µ µ . Then, for every A and X X1 × X2 1 ∈A1 15 A , 2 ∈A2 16 Pr(X A ,X A ) = µ (A A )= µ (A )µ (A ) 1 ∈ 1 2 ∈ 2 X 1 × 2 X1 1 X2 2 17 = Pr(X A )Pr(X A ). 1 ∈ 1 2 ∈ 2

19 Theorem 166. (First Borel-Cantelli lemma) Let (Ω, ,µ) be a measure space. F 20 If n∞=1 µ(An) < then µ (lim supn An)=0. ∞ →∞

21 PProof. Let B = ∞ A . Then B ∞ is a decreasing sequence of sets, each of which i n=i n { i}i=1 22 has ﬁnite measure, so the second part of Lemma 34 says that S ∞ lim µ(Bi)= µ lim Bi = µ Bi = µ lim sup An . 23 i i n →∞ →∞ i=1 ! \ →∞

24 Since ∞ µ(An) < , it follows that limi ∞ µ(An) = 0. Since µ(Bi) ∞ µ(An), n=1 ∞ →∞ n=i ≤ n=i 25 limi µ(Bi) = 0, and the result follows. →∞P P P 26 Theorem 167. (Second Borel-Cantelli lemma) Let (Ω, ,P ) be a probability F 27 space. If n∞=1 P (An)= and if An n∞=1 are mutually independent, then P (lim supn An)= ∞ { } →∞ 28 1. P

29 C C Proof. Let B = lim supn An. We shall prove that P (B ) = 0. Let Ci = n∞=i An . C →∞ 30 Then B = ∞ Ci. So, we shall prove that P (Ci)=0 for all i. Now, for each i and k > i, i=1 T k k S ∞ C C P (Ci)= P An P An = [1 P (An)]. 31 ≤ − n=i ! n=i ! n=i \ \ Y 37

1 Use the fact that log(1 x) x for all 0 x 1 to see that, for every k > i, − ≤ − ≤ ≤ k k

log[P (Ci)] log[1 P (An)] P (An). 2 ≤ − ≤ − n=i n=i X X 3 Since this is true for all k > i, it follows that log[P (C )] ∞ P (A ) = . Hence, i ≤ − n=i n −∞ 4 P (Ci)=0 for all i. P

5 Theorem 168. (Kolmogorov 0-1 law) Let X ∞ be a sequence of independent { n}n=1 6 random quantities. Deﬁne = σ( X : i n ) and = ∞ . Then every event in Tn { i ≥ } T n=1 Tn T 7 has probability either 0 or 1. T

8 Proof. Let = σ( X : i n ), and let = ∞ . Let A and B . There Un { i ≤ } U n=1 Un ∈ U ∈ T 9 exists n such that A n. Because B n+1, it follows that A and B are independent. So ∈ U ∈ T S 10 and are independent. It follows from Proposition 160 that σ( )= σ( X ∞ ) and U T U { n}n=1 T 11 are independent. Since σ( ), it follows that is independent of itself, hence for all T ⊆ U T 12 B , Pr(B) 0, 1 by a homework problem. ∈ T ∈{ }

13 Definition 169. The σ-ﬁeld in Theorem 168 is called the tail σ-ﬁeld of the sequence T 14 X ∞ . { n}n=1

15 Exercise 170. Let X1,X2,... be independent, real-valued random variables deﬁned on 16 a probability space. Let S = X + X + + X . Which of the following is in ? n 1 2 ··· n T

17 1. limn Sn exists { →∞ }

18 2. lim supn Sn > 0 { →∞ }

19 3. lim supn Sn/cn > x where cn { →∞ } →∞ 38

1 36-752: Lecture 13b

2 Stochastic Processes. A stochastic process is an indexed collection of random quan- 3 tities.

4 Definition 171. Let (Ω, ,P ) be a probability space. Let be a set. Suppose that, F ℵ 5 for each α , there is a measurable space ( , ) and a random quantity X : Ω . ∈ℵ Xα Aα α → Xα 6 The collection X : α is called a stochastic process , and is called the index set. { α ∈ ℵ} ℵ 7 The most popular stochastic processes are those for which α = IR for all α. Among X + 8 those, there are two very commonly used index sets, namely = ZZ (sequences of random +0 ℵ 9 variables) and = IR (continuous-time stochastic processes). There are, however, many ℵ 10 more general index sets than these, and they are all handled in the same general fashion.

11 Example 172. (Random vector) Let = 1,...,k and for each i , let X be a ℵ { } ∈ℵ i 12 random variable (all defined on the same probability space). Then (X1,...,Xk) is one way 13 to represent X : i 1,...,k . { i ∈{ }} k 14 Example 173. (Random probability measure) Let Θ : Ω IR be a random k +0 → 15 vector with distribution µΘ. Let f : IR IR IR be a measurable function such that k × 1 → 16 f(x, θ)dx = 1 for all θ IR . Let = , the Borel σ-field of subsets of IR. For each ∈ ℵ B 17 B , define R ∈ℵ

18 XB(ω)= f(x, Θ(ω))dx. ZB 1 19 The stochastic process X : B is a random probability measure. { B ∈ B } 20 The distribution of a stochastic process is the probability measure induced on its range 21 space. Unfortunately, if is an infinite set, the range space of a stochastic process is an ℵ 22 infinite-dimensional product set. We need to be able to construct a σ-field of subsets of such 23 a set. 24 An infinite product of sets is usually defined as a set of functions.

25 Definition 174. Let be a set. Suppose that, for each α , there is a set . The ℵ ∈ℵ Xα 26 product set = α α is defined to be the set of all functions f : α α such that, X ∈ℵ X ℵ→ ∈ℵ X 27 for every α, f(α) α. When each α is the same set , then the product set is denoted Q∈ X X Y S 28 ℵ. Y 29 The above definition applies to all product sets, not just infinite ones.

30 Example 175. It is easy to see that ﬁnite product sets can be considered sets of func- 31 tions also. Each k-tuple is a function f from 1,...,k to some space, where the ith coor- k { } 1,...,k 32 dinate is f(i). For example, the notation IR can be thought of as a shorthand for IR{ }. 33 A vector (x1,...,xk) is the function f such that f(i)= xi for i =1,...,k.

34 Example 176. (Random probability measure) In Example 173, let = [0, 1] for XB 35 all B . Then each random variable XB takes values in B. The inﬁnite product set is 1∈ ℵ 1 X 1 36 [0, 1]B . Each probability measure on (IR, ) is a function from into [0, 1]. The product B B 37 set contains other functions that are not probabilities. For example, the function f(B)=1 1 38 for all B is in the product set, but is not a probability. ∈ B 39

1 We want the σ-ﬁeld of subsets of a product space to be large enough so that all of the 2 coordinate projection functions are measurable.

3 Definition 177. Let be a set. For each α , let ( , ) be a measurable space. ℵ ∈ ℵ Xα Aα 4 Let = α α be the product set. For each α , the α-coordinate projection function X ∈ℵ X ∈ℵ 5 p : is defined as p (f)= f(α). A one-dimensional cylinder set is a set of the form α X → Xα α 6 Q α Bα where there exists one α0 and B α0 such that Bα0 = B and Bα = α for ∈ℵ ∈ ℵ ∈ A X 7 all α = α0. Define α α to be the σ-field generated by the one-dimensional cylinder sets, Q 6 ⊗ ∈ℵA 8 and call this the product σ-field.

k 9 Example 178. Let = IR for ﬁnite k. For 1 i k, the i-coordinate projection X ≤ ≤ 10 function is pi(x1,...,xk) = xi. An example of a one-dimensional cylinder set (in the case 11 k = 3) is IR [ 3.7, 4.2) IR. × − × 12 Example 179. (Random probability measure) In Example 176, let Q be a prob- 1 1 13 ability on . Then Q is an element of the inﬁnite product set [0, 1]B . For each B , the B ∈ℵ 14 B-coordinate projection function evaluated at Q is pB(Q)= Q(B).

15 Lemma 180. The product σ-ﬁeld is the smallest σ-ﬁeld such that all pα are measurable. 16

1 17 Proof. Notice that, for each α and each B , p− (B ) is the one- 0 ∈ ℵ α0 ∈ Aα0 α0 α0 18 dimensional cylinder set α Bα where Bα = α for all α = α0. This makes every pα ∈ℵ X 6 19 measurable. Notice also that the sets required to make all the pα measurable generate the Q 20 product σ-field, hence the product σ-field is the smallest σ-field such that the pα are all 21 measurable. 22 A stochastic process can be thought of as a random function. When a product space 23 is explicitly considered a function space, the coordinate projection functions are sometimes 24 called evaluation functionals.

25 Theorem 180. Let (Ω, ,P ) be a probability space. Let be a set. For each α , F ℵ ∈ ℵ 26 let ( α, α) be a measurable space and let Xα : Ω α be a function. Let = α α. X A → X X ∈ℵ X 27 Deﬁne X : Ω by setting X(ω) to be the function f deﬁned by f(α)= Xα(ω) for all α. → X Q 28 Then X is / α α-measurable if and only if each Xα : Ω α is / α-measurable. F ⊗ ∈ℵ A → X F A

29 Proof. For the “if” direction, assume that each X is measurable. Let be the α C 30 collection of one-dimensional cylinder sets, which generates the product σ-ﬁeld. Let C . ∈ C 31 Then there exists α0 and B α0 such that C = α Bα where Bα0 = B and Bα = α for ∈A1 1 ∈ℵ X 32 all α = α0. It follows that X− (C)= X− (B) . So, X is measurable by Lemma 60. 6 α0 ∈FQ 33 For the “only if” direction, assume that X is measurable. Let pα be the α coordinate 34 projection function for each α . It is trivial to see that X = p (X). Since each p is ∈ ℵ α α α 35 measurable, it follows that each Xα is measurable. 36 The function X deﬁned in Theorem 180 is an alternative way to represent the stochastic 37 process X : α . That is, instead of thinking of a stochastic process as an indexed { α ∈ ℵ} 38 set of random quantities, think of it as just another random quantity, but one whose range 39 space is itself a function space. In this way, stochastic processes can be thought of as random 40

1 functions. The idea is that, instead of thinking of Xα as a function of ω for each α, think of 2 X(ω) as a function of α for each ω. 3 Here are some examples of how to think of stochastic processes as random functions and 4 vice-versa.

5 Example 181. Let β and β be random variables. Let = IR. For each x IR, deﬁne 0 1 ℵ ∈ 6 Xx(ω)= β0(ω)+β1(ω)x. Deﬁne X as in Theorem 180. Then X is a random linear function. 7 This means that, for every ω, X(ω) is a linear function from IR to IR. Indeed, it is the 8 function that maps the number x to the number β0(ω)+ β1(ω)x.

9 Example 182. (Random probability measure) In Example 173, deﬁne X(ω) to 10 be the function (element of the product set) that maps each set B to B f(x, Θ(ω))dx. To 1 11 X see that : Ω [0, 1]B is measurable, let C be the one-dimensional cylinder set B CB → R ∈ℵ 12 where each C = [0, 1] except C = D. Deﬁne g(θ) = f(x, θ)dx. We know that B B0 B0 k Q 13 g : IR [0, 1] is measurable. Hence g(Θ) : Ω [0, 1] is measurable. It follows that 1 → 1 → R 14 X− (C)= g− (D), a measurable set.

15 Clearly, there must exist probability measures on product spaces such as 16 α α, α α . If we start with a stochastic process Xα : α and represent it as ∈ℵ X ⊗ ∈ℵA { ∈ ℵ} 17 a random function X, then the distribution of X is a probability measure on the product Q 18 space. This distribution has the obvious marginal distributions for the individual Xα’s. But, 19 in general, nothing much can be said about other aspects of the joint distribution. 20 When a stochastic process is a sequence of independent random quantities, then we can 21 say more.

22 Theorem 183. (Kolmogorov 0-1 law) Let X ∞ be a sequence of independent { n}n=1 23 random quantities. Deﬁne = σ( X : i n ) and = ∞ . Then every event in Tn { i ≥ } T n=1 Tn T 24 has probability either 0 or 1. T 25 Proof. Let = σ( X : i n ), and let = ∞ . Let A and B . There Un { i ≤ } U n=1 Un ∈ U ∈ T 26 exists n such that A n. Because B n+1, it follows that A and B are independent. So ∈ U ∈ T S 27 and are independent. It follows from Proposition 160 that σ( )= σ( X ∞ ) and U T U { n}n=1 T 28 are independent. Since σ( ), it follows that is independent of itself, hence for all T ⊆ U T 29 B , Pr(B) 0, 1 by a homework problem. ∈ T ∈{ }

30 Definition 184. The σ-ﬁeld in Theorem 168 is called the tail σ-ﬁeld of the sequence T 31 X ∞ . { n}n=1

32 There is such a thing as product measure on an inﬁnite product space, but to prove it, 33 we need a little more machinery. There is a theorem that says that ﬁnite-dimensional distri- 34 butions that satisfy a certain intuitive condition will determine a unique joint distribution 35 on the product space. This theorem is stated and proven in another course document. 41

1 36-752: Lecture 14

2 Definition 185. Let (Ω, ,µ) be a measure space and suppose f is a Borel measurable F 3 function deﬁned on this space. Deﬁne, for 1 p< , ≤ ∞ 1/p f = f p dµ 4 k kp | | Z 5 and 6 f = inf [α: µ[ω : f(ω) > α]=0] . k k∞ | | p 7 Let L (Ω, ,µ) denote the class of all Borel measurable functions f such that f p < . F p k k ∞ 8 When the measure space is clear, we usually write only L to denote this space of functions.

9 Convergence of Random Variables. Let (Ω, ,P ) be a probability space. We have F 10 already discussed convergence a.s., in the context of what a.s. means. p 11 Each L space has a sense of convergence.

12 Definition 186. Suppose f1, f2,... is a sequence of Borel measurable functions deﬁned p 13 on (Ω, ,µ) and each fn L . Let f be another Borel measurable function on (Ω, ,µ). p F ∈ p L F 14 Then we say that f converges in L to f if f f 0. Write this as f f. n k n − kp → n −→ 15 Exercise 187. Assume (Ω, ,µ) is a measure space with µ(Ω) < . Show that, for F Lp ∞ Lr 16 f, f , f ,... real-valued functions deﬁned on this space, f f implies f f for r

p 17 Convergence in L is diﬀerent from convergence a.s.

18 Example 188. Let Ω = (0, 1) with P being Lebesgue measure. Consider the sequence p 19 of functions 1, I(0,1/2], I(1/2,1), I(0,1/3], I(1/3,2/3], . . . . These functions converge to 0 in L for all 20 ﬁnite p since the integrals of their absolute values go to 0. But they clearly don’t converge 21 to 0 a.s. since every ω has fn(ω) = 1 inﬁnitely often. These functions are in L∞, but they 22 don’t converge to 0 in L∞. because their L∞ norms are all 1.

23 Example 189. Let Ω = (0, 1) with P being Lebesgue measure. Consider the sequence 24 of functions 0 if0 <ω< 1/n, f (ω)= 25 n 1/ω if 1/n ω < 1. ≤ p p 26 Each fn is in L for all p, and limn fn(ω)=1/ω a.s. But the limit function is not in L →∞ p 27 for even a single p. Clearly, f ∞ does not converge in L . { n}n=1 28 Example 190. Let Ω = (0, 1) with P being Lebesgue measure. Consider the sequence 29 of functions n if 0 <ω< 1/n, f (ω)= 30 n 0 otherwise. p p p 1 31 Then fn converges to 0 a.s. but not in L since fn dP = n − for all n and ﬁnite p. In p p| | 32 this case, the a.e. limit is in L , but it is not an L limit. R 42

1 Oddly enough convergence in L∞ does imply convergence a.e., the reason being that L∞ 2 convergence is “almost” uniform convergence.

3 Proposition 191. Let (Ω, ,µ) be a measure space. If f converges to f in L∞, then F n 4 limn fn = f, a.e. [µ]. →∞ 5 There are other modes of convergence besides those mentioned above.

6 Definition 192. Let (Ω, ,µ) be a measure space and let f and f ∞ be measurable F { n}n=1 7 functions that take values in a metric space with metric d. We say that fn converges to f 8 in measure if, for every ǫ> 0,

9 lim µ( ω : d(fn(ω), f(ω)) > ǫ )=0. n →∞ { } 10 When µ is a probability, convergence in measure is called convergence in probability, denoted P 11 f f. n → 12 Convergence in measure is diﬀerent from a.e. convergence. Example 188 is a classic example 13 of a sequence that converges in measure (in probability in that example) but not a.e. Here 14 is an example of a.e. convergence without convergence in measure (only possible in inﬁnite 15 measure spaces).

16 Example 193. Let Ω = IR with µ being Lebesgue measure. Let fn(x)= I[n, )(x) for all ∞ 17 n. Then fn converges to 0 a.e. [µ]. However, fn does not converge in measure to 0, because 18 µ( f > ǫ )= for every n. {| n| } ∞ p 19 Example 190 is an example of convergence in probability but not in L . Indeed convergence p 20 in probability is weaker than L convergence.

p P 21 Proposition 194. If X converges to X in L for some p 1, then X X. n ≥ n → 22 Convergence in probability is also weaker than converges a.s.

P 23 Lemma 195. If X X a.s., then X X. n → n → 24 Proof. Let ǫ > 0. Let C = ω : limn Xn(ω) = X(ω) , and deﬁne Cn = ω : { →∞ } { 25 d(X (ω),X(ω)) < ǫ, for all k n . Clearly, C ∞ C . Because Pr(C) = 1 and k ≥ } ⊆ n=1 n,ǫ 26 Cn n∞=1 is an increasing sequence of events, Pr(Cn) 1. Because ω : d(Xn(ω),X(ω)) > { } C →S { 27 ǫ C , } ⊆ n 28 Pr(d(X ,X) > ǫ) 0. n → 29 A partial converse of this lemma is true.

P a.s. 30 Lemma 196. If X X, then there is a subsequence X ∞ such that X X. n → { nk }k=1 nk → 31 k k Proof. Let nk be large enough so that nk > nk 1 and Pr(d(Xnk ,X) > 1/2 ) < 1/2 . k − k 32 Because ∞ Pr(d(X ,X) > 1/2 ) < , we know that Pr(d(X ,X) > 1/2 i.o.) = 0. k=1 nk ∞ nk 33 k C Let A = d(Xnk ,X) > 1/2 i.o. . Then Pr(A ) = 1 and limk Xnk (ω) = X(ω) for every C P{ } →∞ 34 ω A . ∈ 35 There is an even weaker form of convergence that we will discuss later in the course. 43

1 36-752: Lecture 15

2 Let Xn n∞=1 be a sequence of random variables. As we pointed out earlier, the tail σ-field { } n 3 1 contains all events of the form Xn converges or Xn converges to c . Because n i=1 Xi 1 n { } { } 1 n 4 converges if and only if Xi converges for all ℓ =1, 2,..., we see that limn Xi, n i=ℓ →∞ n Pi=1 5 if it exists, is measurable with respect to the tail σ-field. The Kolmogorov 0-1 law says that P P 6 the tail σ-field of an independent sequence has all probabilities 0 and 1. So, the sample 7 averages of an independent sequence must converge a.s. to constants if they converge at all. n 8 Also, i=1 Xi must converge a.s. or with probability 0, although it will not necessarily be 9 measurable with respect to the tail σ-field. Next, we will begin study of sums of independent P 10 random variables, finding conditions under which sums and averages converge or don’t.

11 Sums of Independent Random Variables. There are several useful theorems about 12 sums of independent random variables. All of these make use of a common setup. Let n 13 X ∞ be a sequence of random variables, and deﬁne, for each n, S = X . First, { n}n=1 n k=1 k 14 there is the weak law of large numbers, this version of which does not assume that the Xn’s P 15 are independent.

16 Theorem 197. (Weak law of large numbers) Let Xn n∞=1 be uncorrelated ran- n 2{ } 1 n P 17 dom variables with mean 0 and such that Var(X )= o(n ). Then X 0. i=1 i n i=1 i → 18 Proof. Since the Xn’s are uncorrelated,P P 1 n 1 n Var Xi = 2 Var(Xi), 19 n n i=1 ! i=1 X X 20 which we have assumed goes to 0 as n . According to Tchebychev’s inequality (Corol- →∞ 21 lary 94) 1 n 1 1 n Pr Xi > ǫ 2 Var Xi , 22 n ≤ ǫ n i=1 ! i=1 ! X X 23 which we just showed goes to 0 as n . →∞ 24 There are various strong laws of large numbers that conclude that the average converges 25 almost surely. A proof of Theorem 198 is given in another course document.

26 Theorem 198. (Strong law of large numbers) Assume that X ∞ are inde- { k}k=1 27 pendent and identically distributed random variables with ﬁnite mean µ. Then limn Sn/n = →∞ 28 µ, a.s.

29 We will prove a stronger law than Theorem 198 later in the course. For now, we will 30 concentrate on sums of independent random variables.

n 31 Theorem 199. (Kolmogorov’s maximal inequality) Let X be a finite col- { k}k=1 32 lection of independent random variables with finite variance and mean 0. Define Sk = k 33 i=1 Xi for all k. Then Var(Sn) P Pr max Sk ǫ . 34 1 k n | | ≥ ≤ ǫ2 ≤ ≤ 44

1 Proof. For n = 1, the result is just Chebyshev’s inequality. So assume that n > 1 for n 2 the rest of the proof. Let A be the event that S ǫ but S < ǫ for j

(200) max Sk ǫ = Ak. 4 1 k n ≤ ≤ | | ≥ k[=1 5 It follows that

n 2 2 E(Sn) SndP 6 ≥ k=1 ZAk Xn 2 2 = Sk +2Sk(Sn Sk)+(Sn Sk) dP 7 − − k=1 ZAk Xn 2 [Sk +2Sk(Sn Sk)]dP 8 ≥ − k=1 ZAk Xn 2 = Sk dP 9 k=1 ZAk Xn 2 ǫ Pr(Ak) 10 ≥ Xk=1 2 = ǫ Pr max Sk ǫ , 11 1 k n | | ≥ ≤ ≤

12 where the ﬁrst two inequalities and the ﬁrst equality are obvious. The second inequality 13 follows from the fact that IAk Sk is independent of (Sn Sk) which has mean 0. The third 2 2 − 14 inequality follows since S ǫ on A , and the third equality follows from (200). k ≥ k 15 The reason that this theorem works is that whenever the maximum S is large, it most | k| 16 likely is S that is large. There is another inequality like that of Kolmogorov that is often | n| 17 used in proofs, but we will not discuss it in this class:

18 Proposition 201. (Etemadi Lemma) Let X ∞ be a sequence if independent ran- { n}n=1 19 dom variables. Then, for each ǫ> 0 and each ﬁnite or inﬁnite m,

Pr max Sn > 3ǫ 3 max Pr( Sn > ǫ). 20 1 n m | | ≤ 1 n m | | ≤ ≤ ≤ ≤

21 The ﬁrst theorem on the convergence of sums has a simple condition.

22 Theorem 202. Let X ∞ be independent with mean 0 and suppose that ∞ Var(X ) < { n}n=1 n=1 n 23 . Then Sn converges a.s. ∞ P

24 Proof. The proof is to show that S is a Cauchy sequence a.s. The sequence S (ω) ∞ n { n }n=1 25 is not Cauchy if and only if there exists a rational ǫ> 0 such that for every n, sup S (ω) j,k>n | j − 45

1 S (ω) ǫ. For each n and ǫ> 0, let k | ≥ B = sup S S ǫ , 2 n,ǫ j k {j,k>n | − | ≥ } ǫ C = sup S S , 3 n,ǫ n+k n k 1 | − | ≥ 2 ≥

4 So that B C for all n and ǫ > 0. Then S (ω) ∞ is not a Cauchy sequence if and n,ǫ ⊆ n,ǫ { n }n=1 5 only if ∞ ∞ ω Bn,ǫ Cn,ǫ. 6 ∈ ⊆ n=1 n=1 rational[ǫ> 0 \ rational[ǫ> 0 \ 7 So, it suﬃces to show that

ǫ (203) lim Pr sup S S =0. 8 n+k n n k 1 | − | ≥ 2 →∞ ≥

9 To show (203), use Theorem 199 to see that, for each r 1, ≥ ǫ 4 r Pr max Sn+k Sn 2 Var(Xn+k). 10 1 k r 2 ǫ ≤ ≤ | − | ≥ ≤ Xk=1 11 The sets whose probabilities are on the left side increase with r, so we can take a limit on 12 both sides as r : →∞ ǫ 4 ∞ 4 ∞ Pr sup Sn+k Sn 2 Var(Xn+k)= 2 Var(Xj). 13 | − | ≥ 2 ≤ ǫ ǫ k 1 j=n+1 ≥ Xk=1 X

14 Since ∞ Var(X ) < , the tail sums must go to 0, and this implies (203). j=1 j ∞

15 P Corollary 204. Let Xn n∞=1 be independent. Suppose that n∞=1 Var(Xn) < and n { } ∞ 16 E(Xk) converges. Then Sn converges a.s. k=1 P n 17 P Proof. Let µ = E(X ). Write S = (S µ )+ µ . Theorem 202 says that n k=1 k n n − n n 18 Sn µn converges a.s., and we are assuming that µn converges, hence the sum of the two − P 19 converges a.s.

2 20 Example 205. Let the Xn’s have normal distribution with mean 1/n and variance n 21 2 2 1/n . Then Theorem 202 says that the partial sums i=1(Xi 1/i ) converge a.s. It follows n − 22 easily that i=1 Xi converges a.s. as well. Later we will be able to prove that the distribution P 2 23 of the limit is normal with mean and variance equal to ∞ 1/n . P n=1 P 24 Example 206. Let the Xn’s have uniform distributions on the intervals [ 1/n, 1/n]. n − 25 Then i=1 Xi converges a.s. P 46

1 Strong Law of Large Numbers

2 The preliminary results in this document are numbered locally because they do not ﬁgure 3 in the course notes.

4 Lemma 207. (Kronecker’s lemma) Let x ∞ and b ∞ be sequences of real { k}k=1 { k}k=1 5 numbers such that ∞ x = s< and b . Then k=1 k ∞ k ↑∞ P 1 n lim bkxk =0. 6 n →∞ bn Xk=1

7 Proof. Deﬁne rn = ∞ xk so that r0 = s. Then xk = rk 1 rk for all k. So k=n+1 − − Pn n bkxk = bk(rk 1 rk) 8 − − k=1 k=1 X Xn 1 n − = bk+1rk bkrk 9 − k=0 k=1 Xn 1 X − = (bk+1 bk)rk + b1s bnrn. 10 − − Xk=1 11 Take absolute values to conclude that

13 Let ǫ> 0. Because rn 0, there exists N such that for all k N, rk < ǫ. It follows that | |→ ≥ | | n n 1 1 ǫ − lim bkxk lim (bk+1 bk) 14 n n →∞ bn ≤ →∞ bn − Xk=1 kX=N b = ǫ lim 1 N = ǫ. 15 n − b →∞ n 16 Since this is true for all ǫ> 0, the limit is 0. 17 Theorem 198. (Strong law of large numbers) Assume that X ∞ are indepen- { k}k=1 18 dent and identically distributed random variables with ﬁnite mean µ. Then limn Sn/n = µ, →∞ 19 a.s. n 20 Proof. Deﬁne Yk = XkI[ k,k](Xk), Sn∗ = k=1 Yk, and µk = E(Yk). Recall that 2 − 21 Var(Yk) E(Y ). Also, ≤ k P

∞ ∞ 1 2 1 2 2 E(Yk ) = 2 x dµX(x) 22 k k x

k ∞ 1 2 = 2 x dµX (x) 1 k j=1 j 1< x j Xk=1 X Z − | |≤ ∞ ∞ 2 1 = x dµX(x) 2 2 k j=1 j 1< x j ! X Z − | |≤ Xk=j ∞ 2 2 < x dµX (x) 3 j j=1 j 1< x j X Z − | |≤ 4 2E( X ) < , ≤ | 1| ∞ n 5 2 2 where the ﬁrst inequality follows from the fact that k∞=j 1/k < 2/j. So, k=1 Var(Yk)/k n 6 converges It follows from Theorem 202 in the class notes that k=1(Yk µk)/k converges n − 7 1 P P a.s. Now, apply Lemma 207 to conclude that n k=1(Yk µk) converges a.s. Since µk µ, 1 n 1 n − P → 8 it follows that limn µk = µ. So Yk converges a.s. to µ. →∞ n k=1 n k=1P 9 Notice that Pr(Yk = Xk) = Pr( Xk > k). Let µX denote the distribution of each Xi. 6 P | | P 10 Recall that

∞ E( X ) = Pr( X > t)dt 11 | 1| | 1| Z0 ∞ Pr( Xk >k). 12 ≥ | | Xk=1

13 Because E( X1 ) < , the ﬁrst Borel-Cantelli lemma says that Pr(Yk = Xk i.o.) = 0. Hence | | ∞ 1 n 6 14 ∞ (Y X ) is ﬁnite a.s. and X converges a.s. to µ. k=1 k − k n k=1 k P P 48

1 36-752: Lecture 16

2 Another interesting theorem about sums of independent random variables is the following. 3 It gives necessary and suﬃcient conditions for convergence of Sn. For each c > 0 and each (c) 4 n, let Xn (ω) = X (ω)I ( X (ω) ). We will prove only the suﬃciency part of the result. n [0,c] | n | 5 The necessity proof is not included here but can be found in another course document.

6 Theorem 208. (Three-series theorem) Suppose that X ∞ are independent. For { n}n=1 7 each c> 0, consider the following three series:

∞ ∞ ∞ (c) (c) (209) Pr( Xn >c), E(Xn ), Var(Xn ). 8 | | n=1 n=1 n=1 X X X 9 A necessary condition for Sn to converge a.s. is that all three series are ﬁnite for all c> 0. 10 A suﬃcient condition is that all three series converge for some c> 0.

11 Example 210. Let Xn have a uniform distribution on the interval [an, bn]. A necessary 2 12 condition for convergence of S is that ∞ (b a ) < (the third series). Another n n=1 n − n ∞ 13 necessary condition is that ∞ (an + bn) converge (the second series). It follows that an n=1 P 14 and bn must both converge to 0 so that the ﬁrst series also converges for all c> 0. That the P 15 two conditions above are suﬃcient for the convergence of Sn follows from Corollary 204.

16 Proof.Theorem 208 First, define some notation. For each c> 0 and each n, define n (c) (c) Sn = Xk , 17 k=1 Xn (c) (c) Mn = E(Xk ), 18 Xk=1 n s(c) = Var(X(c)). n v k 19 uk=1 uX 20 For sufficiency, assume that all three seriest converge for some c> 0. Because the second (c) 21 and third series in (209) converge, Corollary 204 says that Sn converges a.s. We know that (c) 22 Pr(Xn = Xn ) = Pr( Xn > c). Since the first series in (209) converges, the first Borel- 6 | | (c) 23 Cantelli lemma says that Pr(Xn = Xn i.o.) = 0. Hence, for almost all ω, there exists N(ω) (c) 6 24 such that S (ω) Sn (ω) is the same for all n N(ω). Hence S (ω) converges for almost n − ≥ n 25 all ω.

26 Example 211. Let 1 2n2 if x = n or x = n, 1 1 − Pr(Xn = x)= 2 if x = 1/n or x =1/n,  2 − 2n − 27  0 otherwise. 2 4 28 Then E(Xn) = 0 and Var(Xn)=1+1 /n 1/n . So Theorem 202 does not imply that Sn (c) − (c) 2 4 29 converges a.s. However, for c> 0, E(Xn ) = 0 and Var(Xn ) eventually equals 1/n 1/n 2 − 30 while Pr( X > c) eventually equals 1/n , so the three-series theorem does imply that S | n| n 31 converges a.s. 49

1 Conditional Expectation. The measure-theoretic deﬁnition of conditional expecta- 2 tion is a bit unintuitive, but we will show how it matches what we already know from earlier 3 study.

4 Definition 212. Let (Ω, ,P ) be a probability space, and let be a sub-σ-field. F C⊆F 5 Let X be a random variable whose mean is defined. We use the symbol E(X ) to stand for 1 |C 6 any function h : Ω IR that is / measurable and that satisfies → C B

(213) hdP = XdP, for all C . 7 ∈ C ZC ZC 8 We call such a function h, a version of the conditional expectation of X given . C

9 Equation (213) can also be written E(IC h) = E(IC X) for all C . Any two versions of ∈ C 1 10 E(X ) must be equal a.s. according to Theorem 119 (part 3). Also, any / -measurable |C C B 11 function that equals a version of E(X ) a.s. is another version. |C 1 12 Example 214. If X is itself / measurable, then X is a version of E(X ). C B |C

13 Example 215. If X = a a.s., then E(X )= a a.s. |C

14 Let Y be a random quantity and let = σ(Y ). We will use the notation E(X Y ) to C | 15 stand for E(X ). According to Theorem 147, E(X Y ) is some function g(Y ) because it is 1 |C | 16 σ(Y )/ -measurable. We will also use the notation E(X Y = y) to stand for g(y). B |

17 Example 216. (Joint Densities) Let (X,Y ) be a pair of random variables with a 18 joint density f with respect to Lebesgue measure. Let = σ(Y ). The usual marginal X,Y C 19 and conditional densities are

20 fY (y) = fX,Y (x, y)dx, Z fX,Y (x, y) 21 fX Y (x y) = . | | fY (y)

22 The traditional calculation of the conditional mean of X given Y = y is

23 g(y)= xfX Y (x y)dx. | | Z 24 That is, E(X Y )= g(Y ) is the traditional deﬁnition of conditional mean of X given Y . We | 25 also use the symbol E(X Y = y) to stand for g(y). We can prove that h = g(Y ) is a version | 26 of the conditional mean according to Deﬁnition 212. Since g(Y ) is a function of Y , we know 1 27 that it is / measurable. We need to show that (213) holds. Let C so that there C B1 1 ∈ C 28 exists B so that C = Y − (B). Then I (ω)= I (Y (ω)) for all ω. Then ∈ B C B

29 hdP = IC hdP ZC Z 50

1 = IB(Y )g(Y )dP Z

2 = IBgdµY Z

3 = IB(y)g(y)fY (y)dy Z

4 = IB(y) xfX Y (x y)dxfY (y)dy | | Z Z

5 = IB(y)xfX,Y (x, y)dxdy Z Z 6 = E(IB(Y )X)=E(IC X).

7 Example 216 can be extended easily to handle two more general cases. First, we could 8 ﬁnd E(r(X) Y ) by virtually the same calculation. Second, the use of conditional densities | 9 extends to the case in which the joint distribution of (X,Y ) has a density with respect to 10 an arbitrary product measure. 51

1 Three-Series Theorem

2 Theorem 208. (Three-series theorem) Suppose that X ∞ are independent. For { n}n=1 3 each c> 0, consider the following three series:

∞ ∞ ∞ (c) (c) (209) Pr( Xn >c), E(Xn ), Var(Xn ). 4 | | n=1 n=1 n=1 X X X 5 A necessary condition for Sn to converge a.s. is that all three series are finite all c > 0. A 6 sufficient condition is that all three series converge for some c> 0. 7 Proof. Recall some notation. For each c> 0 and each n, define

(c) (c) 8 mn = E(Xn ), n (c) (c) Sn = Xk , 9 k=1 Xn (c) (c) Mn = mk , 10 Xk=1 n s(c) = Var(X(c)). n v k 11 uk=1 uX t 12 The sufficiency was proved in the course notes. For necessity, suppose that Sn converges 13 a.s. Let c > 0. For each ω such that Sn(ω) converges, we must have limn Xn(ω) = 0. (c) (c) →∞ 14 It follows that Xn(ω)= Xn (ω) for all but finitely many n and so Sn (ω) converges. Since (c) 15 Pr(X = Xn ) = Pr( X > c) the contrapositive of the second Borel-Cantelli lemma says n 6 | n| 16 that the first series in (209) converges. Suppose that the third series in (209) diverges. Since (c) (c) (c) 17 Xn mn are uniformly bounded and sn , the central limit theorem says that, for all − →∞ 18 y > x, (c) (c) Sn Mn (217) lim Pr x< − y = Φ(y) Φ(x), 19 n (c) →∞ sn ≤ ! − (c) (c) (c) 20 where Φ is the standard normal df. Since Sn converges a.s., we have limn Sn /sn = 0 (c) (c) P →∞ 21 a.s. Hence, Sn /sn 0. For each 1/2 >ǫ> 0, → (c) (c) (c) Sn Mn Sn (218) Pr x< −(c) y, (c) < ǫ 22 sn ≤ sn !

(c) (c) 24 Notice that the event on the left side of (218) can occur only if x ǫ< Mn /sn y + ǫ. − − ≤ 25 Hence, the event on the left side of (218) cannot occur for both of the pairs (x, y)=(ǫ 1, ǫ) − − 26 and (x, y)=(ǫ, 1 ǫ). Let δ > 0 be smaller than both Φ( ǫ) Φ(ǫ 1) and Φ(1 ǫ) Φ(ǫ). − − − − − − 52

1 Then (217) says that there exists N1(x, y) large enough so that n N1(x, y) implies that the ≥ (c) (c) P 2 ﬁrst probability on the right of (219) is within δ/2 of Φ(y) Φ(x). Also, since Sn /sn 0, − → 3 there exists N so that n N implies that the second probability on the right of (219) is 2 ≥ 2 4 at most δ/2. So, if 5 n max N (ǫ 1, ǫ), N (ǫ, 1 ǫ), N , ≥ { 1 − − 1 − 2} 6 we have

8 Φ( ǫ) Φ(ǫ 1) δ > 0 , ≥ − − − − (c) (c) (c) Sn Mn Sn Pr ǫ< −(c) 1 ǫ, (c) < ǫ 9 sn ≤ − sn !

10 Φ(1 ǫ) Φ(ǫ) δ > 0 . ≥ − − − 11 This contradicts the fact that at least one of the two events on the far left sides of these (c) 12 inequalities is impossible. Hence, sn cannot diverge and the third series in (209) converges. (c) (c) (c) 13 Theorem 202 now says that Sn Mn converges a.s. Since we already showed that Sn − 14 converges a.s., the second series in (209) must converge. 53

1 36-752: Lecture 17

2 All of the familiar results about conditional expectation are special cases of the general 3 deﬁnition. Here is an unfamiliar example.

4 Example 220. Let X1,X2 be independent with U(0, θ) distribution for some known θ. 5 Let Y = max X ,X and X = X . Find the conditional mean of X given Y . In this { 1 2} 1 6 case (X,Y ) do not have a joint density with respect to any product measure. But we can 7 argue what the conditional distribution, and hence conditional mean, of X given Y should 8 be. With probability 1/2, X = Y . With probability 1/2, X is the min of X1 and X2 and 9 ought to be uniformly distributed between 0 and Y . The mean of this hybrid distribution is 10 Y/2+ Y/4=3Y/4. Let’s verify this. 11 First, we see that h = 3Y/4 is measurable with respect to = σ(Y ). Next, let C . C ∈ C 12 We need to show that E(XIC ) = E([3Y/4]IC). Theorem 119 (part 4) says that we only need 1 13 to check this for sets of the form C = Y − ([0,d]) with 0

1 17 A reminder about versions: If two functions h and h are both / -measurable and if 1 2 C B 18 they both satisfy E(hiIC ) = E(XIC ) for all C , then they are both versions of E(X ). ∈ C 1 |C 19 Similarly, any function h′ that equals a version of E(X ) a.s. and is / -measurable is |C C B 20 another version.

21 Example 222. In Example 220, 3Y/4 if Y is irrational, h′ = 22 0 otherwise. 23 is another version of E(X Y ). | 24 The following fact is immediate by letting C = Ω.

25 Proposition 223. E(E(X )) = E(X). |C 26 Here is a generalization of Proposition 223, which is sometimes called the tower property of 27 conditional expectations.

28 Proposition 224. (Law of total probability) If are sub-σ-field’s C1 ⊆ C2 ⊆ F 29 and E(X) exists, then E(X ) is a version of E(E(X ) ). |C1 |C2 |C1 1 30 Proof. By definition E(X ) is / -measurable. We need to show that, for every |C1 C1 B 31 C , ∈ C1 E(X )dP = E(X )dP. 32 |C1 |C2 ZC ZC 33 The left side is E(XI ) by definition of conditional mean. Similarly, because C also, C ∈ C2 34 the right side is E(XIC ) as well. 54

1 Example 225. Let (X,Y,Z) be a triple of random variables. Then E(X Y ) is a version | 2 of E(E(X (Y,Z)) Y ). | |

3 Here is a simple property that extends from expectations to conditional expectations.

4 Lemma 226. If X X a.s., then E(X ) E(X ) a.s. 1 ≤ 2 1|C ≤ 2|C 5 Proof. Suppose that both E(X ) and E(X ) exist. Let 1|C 2|C

6 C = > E(X ) > E(X ) , 0 {∞ 1|C 2|C } 7 C = = E(X ) > E(X ) . 1 {∞ 1|C 2|C }

8 Then, for i =0, 1,

0 [E(X1 ) E(X2 )]dP = (X1 X2)dP 0. 9 ≤ |C − |C − ≤ ZCi ZCi

10 It follows that all terms in this string are 0 and P (C ) = 0 for i = 0, 1. Since C C = i 0 ∪ 1 11 E(X ) > E(X ) , the result is proven. { 1|C 2|C } 12 We can prove that versions of conditional expectations exist by the Radon-Nikodym 13 theorem. However, the “modern” way to prove the existence of conditional expectations is 14 through the theory of Hilbert spaces. 55

1 36-752: Lecture 18

2 The following corollary to Proposition 224 is sometimes useful.

3 Corollary 227. Assume that 1 2 are sub-σ-ﬁeld’s and E(X) exists. If a 1 C ⊆ C ⊆ F 4 version of E(X ) is / -measurable, then E(X ) is a version of E(X ) and E(X ) |C2 C1 B |C1 |C2 |C2 5 is a version of E(X ). |C1 6 Example 228. Suppose that X and Y have a joint conditional density given Θ that 7 factors, 8 fX,Y Θ(x, y θ)= fX Θ(x θ)fY Θ(y θ). | | | | | | 9 Then, the conditional density of X given (Y, Θ) is

13 Here is another example of a result that extends from expectations to conditional expec- 14 tations.

15 Lemma 229. If E(X), E(Y ), and E(X +Y ) all exist, then E(X )+E(Y ) is a version |C |C 16 of E(X + Y ). |C 1 17 Proof. Clearly E(X )+E(Y ) is / -measurable. We need to show that for all |C |C C B 18 C , ∈ C (230) E(X )+E(Y )dP = (X + Y )dP. 19 |C |C ZC ZC 20 The left side of (230) is C XdP + C Y dP = C (X + Y )dP because E(IC X), E(IC Y ) and 21 E(I [X + Y ]) all exist. C R R R 22 The following theorem is used extensively in later results.

23 Theorem 231. Let (Ω, ,P ) be a probability space and let be a sub-σ-ﬁeld of . F 1 C F 24 Suppose that E(Y ) and E(XY ) exist and that X is / -measurable. Then E(XY ) = C B |C 25 XE(Y ). |C 1 26 Proof. Clearly, XE(Y ) is / -measurable. We will use the standard machinery on |C C B 27 X. If X = I for a set B , then B ∈ C 28 (232) E(IC XY )=E(IC BY )=E(IC BE(Y )) = E(IC XE(Y )), ∩ ∩ |C |C 29 for all C . Hence, XE(Y ) = E(XY ). By linearity of expectation, the extreme ∈ C |C |C 30 ends of (232) are equal for every nonnegative simple function, X. Next, suppose that X is 31 nonnegative and let X be a sequence of nonnegative simple functions converging to X { n} 32 from below. Then

+ + 33 E(I X Y ) = E(I X E(Y )), C n C n |C 34 E(I X Y −) = E(I X E(Y − )), C n C n |C 56

1 for each n and each C . Apply the monotone convergence theorem to all four sequences ∈ C 2 above to get

+ + 3 E(I XY ) = E(I XE(Y )), C C |C 4 E(I XY −) = E(I XE(Y − )), C C |C 5 for all C . It now follows easily from Lemma 229 that XE(Y )=E(XY ). Finally, if ∈ C + |C + |C 6 X is general, use what we just proved to see that X E(Y )=E(X Y ) and X−E(Y )= |C |C |C 7 E(X−Y ). Apply Lemma 229 one last time. |C 8 In all of the proofs so far, we have proven that the deﬁning equation for conditional 9 expectation holds for all C . Sometimes, this is too diﬃcult and the following result can ∈ C 10 simplify a proof.

11 Proposition 233. Let (Ω, ,P ) be a probability space and let be a sub-σ-ﬁeld of . F C F 12 Let Π be a π-system that generates . Assume that Ω is the ﬁnite or countable union of sets C 1 13 in Π. Let Y be a random variable whose mean exists. Let Z be a / -measurable random C B 14 variable such that E(I Z)=E(I Y ) for all C Π. Then Z is a version of E(Y ). C C ∈ |C 15 One proof of this result relies on signed measures, and is very similar to the proof of Theo- 16 rem 43.

17 Conditional Probability. For A , define Pr(A ) = E(I ). That is, treat I ∈ F |C A|C A 18 as a random variable X and define the conditional probability of A to be the conditional 19 mean of X. We would like to show that conditional probabilities behave like probabilities. 20 The first thing we can show is that they are additive. That is a consequence of the following 21 result. 22 It follows easily from Lemma 229 that Pr(A ) + Pr(B ) = Pr(A B ) a.s. if A and |C |C ∪ |C 23 B are disjoint. The following additional properties are straightforward, and we will not do 24 them all in class. They are similar to Lemma 229.

25 Example 234. (Probability at most 1) We shall show that Pr(A ) 1 a.s. Let |C ≤ 26 B = ω : Pr(A ) > 1 . Then B , and { |C } ∈ C P (B) Pr(A )dP = I dP = P (A B) P (B), 27 ≤ |C A ∩ ≤ ZB ZB 28 where the ﬁrst inequality is strict if P (B) > 0. Clearly, neither of the inequalities can be 29 strict, hence P (B)=0.

30 Example 235. (Countable Additivity) Let A ∞ be disjoint elements of . Let { n}n=1 F 31 W = ∞ Pr(A ). We shall show that W is a version of Pr( ∞ A ). Let C . n=1 n|C n=1 n|C ∈ C

P ∞ S ∞ E IC I n=1An = P C An 32 ∪ ∩ "n=1 #! [ ∞ = P (C An) 33 ∩ n=1 X 57

∞ = Pr(An )dP 1 |C n=1 C X Z ∞ = Pr(An )dP 2 |C C n=1 Z X

3 = WdP, ZC 4 where the sum and integral are interchangeable by the monotone convergence theorem.

5 We could also prove that Pr(A ) 0 a.s. and Pr(Ω ) = 1, a.s. But there are generally |C ≥ |C 6 uncountably many diﬀerent A and uncountably many diﬀerent sequences of disjoint ∈ F 7 events. Although countable additivity holds a.s. separately for each sequence of disjoint 8 events, how can we be sure that it holds simultaneously for all sequences a.s.?

9 Definition 236. Let be a sub-σ-ﬁeld. We say that a collection of versions A⊆F 10 Pr(A ) : A are regular conditional probabilities if, for each ω, Pr( )(ω) is a proba- { |C ∈ A} ·|C 11 bility measure on (Ω, ). A

12 Rarely do regular conditional probabilities exist on (Ω, ), but there are lots of common F 13 sub-σ-ﬁeld’s such that regular conditional probabilities exist on (Ω, ). Oddly enough, A A 14 the existence of regular conditional probabilities doesn’t seem to depend on . C

15 Example 237. (Joint Densities) Use the same setup as in Example 216. For each y 16 such that fY (y) = 0, define fX Y (x y)= φ(x), the standard normal density. For each y such | | 17 that fY (y) > 0, define fX Y as in Example 216. Next, for each A σ(X), define | ∈

h(y)= fX Y (x y)dx, 18 | | ZB 1 19 for all y, where A = X− (B). Finally, deﬁne Pr(A )(ω) = h(Y (ω)). The calculation done |C 20 in Example 216 shows that this is a version of the conditional mean of I given . But it is A C 21 easy to see that for each ω, Pr( )(ω) is a probability measure on (Ω, σ(X)). ·|C 58

1 36-752: Lecture 19

2 Convergence in Distribution. Let be a topological space and let be the Borel X B 3 σ-ﬁeld. Let (Ω, ,P ) be a probability space and let X : Ω be / -measurable. F n → X F B 4 Also, let X : Ω be another random quantity. This will be the standard setup for all → X 5 discussions of convergence in distribution.

6 Definition 238. We say that Xn converges in distribution to X if

7 lim E[f(Xn)] = E[f(X)], n →∞

8 for all bounded continuous functions f : IR. We denote this property X D X. X → n →

9 Example 239. Let Ω = IR∞ with = ∞ and P being the joint distribution of a F B 1 n 10 sequence X ∞ of iid standard normal random variables. Let X (ω) = ω . Let { n}n=1 n √n j=1 j 11 X = X . Then X D X in a trivial way. 1 n → P

12 There are several conditions that are all equivalent to X D X. n →

13 Theorem 240. (Portmanteau theorem) The following are all equivalent if is a X 14 metric space:

15 1. limn E[f(Xn)] = E[f(X)], for all bounded continuous f, →∞

16 2. For each closed C , lim supn µXn (C) µX (C). ⊆ X →∞ ≤

17 3. For each open A , lim infn µXn (A) µX (A). ⊆ X →∞ ≥

18 4. For each B such that µX (∂B)=0, limn µXn (B)= µX (B). ∈ B →∞

19 We will not prove this whole theorem, but we will look a bit more at the four conditions. 20 If = IR, then the fourth condition is a lot like the familiar convergence of cdf’s in places X 21 where the limit is continuous. An interval B =( , b] has µ (∂B) = 0 if and only if there −∞ X 22 is no mass at b, hence if and only if the cdf is continuous at b. The second condition says 23 that we don’t want any mass from the distributions of the Xn’s to be able to escape from 24 a closed set, although it could happen that mass from outside of a closed set approaches 25 the boundary. That is why the inequality goes the way it does. Similarly, for the third 26 condition, mass can escape from an open set but nothing should be allowed to “jump” into 27 the open set. The ﬁrst condition is related to the often overlooked fact that the distribution 28 of a random quantity is equivalent to the means of all bounded continuous functions. The 29 ﬁrst condition is also a version of what mathematicians call weak∗ convergence, a concept 30 that arises in the theory of normed linear spaces. Many statisticians and probabilists call 31 convergence in distribution “weak convergence,” but convergence in distribution is not quite 32 the same as weak convergence in normed linear spaces. 33 Proof.Theorem 240 First, notice that the second and third conditions are equivalent 34 since closed sets are complements of open sets. Together the second and third conditions 59

1 imply the fourth one. We will prove that the fourth condition implies the ﬁrst one. We will 2 waive hands over the proof that the ﬁrst condition implies the second one. 3 Assume the fourth condition. Let f be bounded and continuous, f(x) K for all x. Let | | ≤ 4 ǫ> 0. Let v0 < v1 < < vM be real numbers such that v0 < K

7 x : vj 1 < f(x) < vj int(Fj) F j x : vj 1 f(x) vj . { − } ⊆ ⊆ ⊆{ − ≤ ≤ } 8 By construction

vjµXn (Fj) E[f(Xn)] ǫ, 9 − ≤ j=1 X M vjµX (Fj) E[f(X)] ǫ. 10 − ≤ j=1 X

11 By assumption µX (∂Fj )=0 for all j and M M

lim vjµXn (Fj)= vjµX (Fj). 12 n →∞ j=1 j=1 X X

13 Combining these yields limn E[f(Xn)] E[f(X)] < 2ǫ, hence the ﬁrst condition holds. | →∞ − | 14 To see why the ﬁrst condition implies the second one, let C be a closed set. For each 15 m, let Cm be the set of points that are at most 1/m away from C. The function fm(x) = C 16 max 0, 1 md(x, C) is bounded and continuous, equals 0 on C , equals 1 on C, and lies { − } m 17 between 0 and 1 everywhere. We know that limn E(fm(Xn)) = E(fm(X)) for all m. Also, →∞ 18 µ (C) E(f (X )) µ (C ) for all n and m. So Xn ≤ m n ≤ Xn m

19 (241) lim sup µXn (C) E(fm(X)) µX (Cm), n ≤ ≤ →∞

20 for all m. Since C ∞ is a decreasing sequence of sets whose intersection is C, we have { m}m=1 21 limm µX (Cm) = µX (C). Since the left side of (241) doesn’t depend on m, we have the →∞ 22 result. 23 Because convergence in distribution depends only on the distributions of the random 24 quantities involved, we do not actually need random quantities in order to discuss conver- 25 gence in distribution. Hence, we might also use notation like µ D µ, where µ and µ are n → n 26 probability measures on the same space. If = IR, we might refer to the cdf’s and say X 27 F D F . We might even refer to the names of distributions and say that X converges in n → n 28 distribution to a standard normal distribution or some other distribution. Even if we do have 29 random quantities, they don’t even have to be deﬁned on the same probability spaces. They 30 do have to take values in the same space, however. For example, for each n, let (Ω , ,P ) n Fn n 31 be a probability space, and let (Ω, ,P ) be another one. Let ( , ) be a topological space F X B 32 with Borel σ-ﬁeld. Let X : Ω and X : Ω be random quantities. We could then n n → X → X 33 ask whether or not X D X. We won’t use this last bit of added generality. n → 60

1 Example 242. Let X ∞ be a sequence of iid standard normal random variables. { n}n=1 2 Then Xn converges in distribution to standard normal, but does not converge in probability 3 to anything.

4 Some authors use the expression converges in law to mean “converges in distribution”.

5 They might write this Xn L X. Others use the expression converges weakly and might write w 2 → 6 it X X. n → 7 Skorohod proved a result that simplifies some proofs of convergence in distribution when 8 = IR. X 1 9 Lemma 243. (Skorohod Theorem) Let ( , )=(IR, ). Suppose that Xn D X. X B1 B → 10 Then there exist Y ∞ and Y defined on ((0, 1), ,λ) (λ being Lebesgue measure) such { n}n=1 B 11 that Yn has the same distribution as Xn for all n, Y has the same distribution as X, and 12 Y (ω) Y (ω) for all ω. n → 13 Proof. Let Fn be the cdf of Xn and let F be the cdf of X. Then limn Fn(x)= F (x) →∞ 1 14 for all x at which F is continuous by part 4 of Theorem 240. Define Yn(ω) = Fn− (ω) and 1 1 15 Y (ω)= F − (ω). Here, the inverse of a general cdf G is defined by G− (p) = inf x : G(x) { ≥ 16 p . It is easy to see that Y has the same distribution as X and Y has the same distribution } n n 17 as X. For example,

1 18 Pr(Y y) = Pr(F − (ω) y) = Pr(ω F (y)) = F (y). ≤ ≤ ≤ 19 To see that Y (ω) Y (ω), let ǫ> 0 and let Y (ω) ǫ

1 24 Lemma 244. Let ( , )=(IR, ). Let F be the cdf of X and let F be the cdf of X. X B B n n 25 Then Xn D X if and only if limn Fn(x)= F (x) for all x at which F is continuous. → →∞ 26 Proof. The proof of the “only if” direction is direct from Theorem 240 because F is 27 continuous at x if and only if µ ( x ) = 0 and x is the boundary of ( , x]. For the “if” X { } { } −∞ 28 part, construct Yn and Y as in the proof of Lemma 243. It then follows from the dominated 29 convergence theorem that E(f(Y )) E(f(Y )) for all bounded continuous f. n → 30 Example 245. Let Φ be the standard normal cdf, and let 0 if x< n, Φ(x) Φ( n) − − − Fn(x)= Φ(n) Φ( n) if n x < n,  − − − ≤ 31 1 if x n.  ≥

32 Then, we see that limn Fn(x)=Φ( x) for all x. Each Fn gives probability 1 to a bounded →∞ 33 set, but the limit distribution does not.

2Convergence in distribution is not the same as weak convergence of continuous linear functionals in functional analysis. It is the same as weak∗ convergence, but we will not go into that distinction here. 61

1 Example 246. Let Φ be the standard normal cdf, and let 0 if x< n, − F (x)= Φ(x) if n x < n, n  − ≤ 2 1 if x n.  ≥ 3 Then, we see that limn Fn(x)=Φ(x) for all x. Each Fn is neither discrete nor continuous, →∞ 4 but the limit is continuous.

5 Example 247. Enumerate the dyadic rationals in this sequence: 1/2, 1/4, 3/4, 1/8, 6 3/8, 5/8, 7/8, 1/16, 3/16, .... Let µn be the measure that puts mass 1/n on each of the 7 ﬁrst n in the list. Then the subsequence µ2n 1 n∞=1 converges in distribution to the uniform { − } 8 distribution on [0, 1], but the whole sequence does not converge. Consider the subsequence

9 µ2n+2 2n 1 n∞=1, which converges to a distribution with twice as much probability on [0, 1/2] { − − } 10 as on (1/2, 1].

11 Example 248. Let F be the cdf of the uniform distribution on [ n, n]. No subsequence n − 12 of Fn converges in distribution even though each cdf gives probability 1 to a bounded set.

13 Examples 245 and 248 illustrate a necessary and sufficient condition for a sequence of 14 distributions to have a convergent (in distribution) subsequence. Even though the Fn in 15 both examples assign probability to 1 to the same intervals, the probability moves out to 16 infinity at different rates in the two examples. In ??, we will see a condition on how fast 17 probability can move out to infinity and still allow subsequences to converge in distribution. 18 Convergence in distribution is weaker than convergence in probability, hence it is also p 19 weaker than convergence a.s. and L convergence.

20 Proposition 249. Let ( , ) be a metric space (having metric d) and its Borel σ-ﬁeld. X B 21 Let X ∞ be a sequence of random quantities taking values in and let X be another { n}n=1 X 22 random quantity taking values in . X P 23 1. If limn Xn = X a.s., then Xn X. →∞ → P 24 2. If X X, then X D X. n → n → P 25 3. If X is degenerate and X D X, then X X. n → n → P 26 4. If Xn X, then there is a subsequence nk k∞=1 such that limk Xnk = X, a.s. → { } →∞ 27 Proof. The ﬁrst and last claims were proven earlier and are only included for complete- 28 ness. For the second claim, let C be a closed set and let C = x : d(x, C) 1/m for each m { ≤ } 29 integer m> 0. Then

30 µ (C) µ (C )+Pr(d(X,X ) > 1/m). Xn ≤ X m n

31 It follows that lim supn µXn (C) µX (Cm). Since limm µX (Cm) = µX (C), we have that ≤ →∞ 32 Xn D X by Theorem 240. The third claim follows by approximating I[c ǫ,c+ǫ] by a bounded → − 33 continuous function, where Pr(X = c)=1. 62

1 36-752: Lecture 20

2 If f is a continuous function and X D X, then f(X ) D f(X). Indeed, even if f is not n → n → 3 continuous, so long as µX assigns 0 probability to the set of discontinuities, the result still 4 holds.

5 Theorem 250. (Continuous mapping theorem) Let X ∞ be a sequence of ran- { n}n=1 6 dom quantities, and let X be another random quantity all taking values in the same metric 7 space . Suppose that X D X. Let be a metric space and let g : . Deﬁne X n → Y X →Y 8 C = x : g is continuous at x . g { } 9 Suppose that Pr(X C )=1. Then g(X ) D g(X). ∈ g n → 10 The proof of Theorem 250 together with the proof of Theorem 252 are in another course 11 document. They both rely on the second part of Theorem 240, and they resemble the part 12 of the proof of Proposition 249 that we already did.

13 Example 251. If (Sn nµ)/[√nσ] converges in distribution to standard normal, then 2 2 − 2 14 (S nµ) /(nσ ) converges in distribution to χ with one degree of freedom. n −

15 Theorem 252. Let Xn n∞=1, X, and Yn n∞=1 be random quantities taking values in a { } { } P 16 metric space with metric d. Suppose that X D X and d(X ,Y ) 0, then Y D X. n → n n → n → 17 The most common use of this theorem is the following. If the diﬀerence between two se- 18 quences converges to 0 in probability and if one of the two sequences converges in distribution 19 to X, then so does the other one. A related result is the following.

20 Theorem 253. Let Xn take values in a metric space and let Yn take values in a metric P 21 space. Suppose that X D X and Y c, then (X ,Y ) D (X,c). n → n → n n → 22 Proof. Let d1 be the metric in the space where Xn takes values and let d2 be the metric 23 in the space where Yn takes values. Then

24 d((x1,y1), (x2,y2)) = d1(x1, x2)+ d2(y1,y2),

25 defines a metric in the product space and the product σ-field is the Borel σ-field. First note 26 that (X ,c) D (X,c) since every bounded continuous function of (X ,c) is a bounded n → n 27 continuous function of Xn alone. Next, note that d((Xn,Yn), (Xn,c)) = d2(Yn,c) and P 28 P (d (Y ,c) > ǫ) 0 for all ǫ > 0, so d((X ,Y ), (X ,c)) 0. By Theorem 252, n 2 n → n n n → 29 (X ,Y ) D (X,c). n n → 30 Example 254. Suppose that Un =(Sn nµ)/(√nσ) converges in distribution to stan- P − 31 dard normal. Suppose also, that T σ. Then (U , T ) D (Z, σ), where Z N(0, 1). n → n n → ∼ 32 Consider the continuous function g(z,s)= zσ/s. It follows that

Sn nµ D 33 g(Un, Tn)= − Z. √nTn → 63

1 Example 255. (Delta method) Suppose that limn rn = and rn(Xn a) D Y . P →∞ ∞ − → 2 Then X a. Suppose that g is a function that has a derivative g′(a) at a. Deﬁne n → g(x) g(a) h(x)= − g′(a). 3 x a − − 4 We know that limx a h(x) = 0, so we can make h continuous at a by setting h(a) = 0. Also → 5 g(x) g(a)=(x a)g′(a)+(x a)h(x). So, − − −

6 r [g(X ) g(a)] = r (X a)g′(a)+ r (X a)h(X ). n n − n n − n n − n P P 7 It follows from Theorems 250 and 249 that h(X ) 0. By Theorem 253, r (X a)h(X ) n → n n − n → 8 0 and r (X a)g′(a) D g′(a)Y . By Theorem 252, r [g(X ) g(a)] D g′(a)Y . After we see n n − → n n − → 9 the central limit theorem, there will be many examples of the use of this result.

10 If g′(a) = 0 in the above example, there may still be hope if a higher derivative is nonzero.

11 Example 256. Let X ∞ be iid with exponential distribution with parameter 2. { n}n=1 12 That is, the density is 2 exp( 2x) for x > 0. Let Y = min X ,...,X . Then Y has − n { 1 n} n 13 an exponential distribution with parameter 2n. So n(Yn 0) D X1. Let g(y) = cos(y) so − → 2 2 14 that g′(y)= sin(y). Then n[cos(Y ) 1] D 0. But g(y) 1=0 y /2+ o(y ) as y 0. − n − → − − → 15 So, 2 2 n 2 1 2 16 n [g(Y ) 1] = Y + Z D X , n − 2 n n → 2 1 P 17 where Z 0. n → 64

1 Continuous Mapping and Related Theorems

2 Lemma 257. Let and be metric spaces. Let B be a closed subset of . Let g : 1 X Y 1 X X → 3 . If x g (B) and g is continuous at x, then x g− (B). Y ∈ − ∈ 1 1 4 Proof. If x g (B) then there exists a sequence x ∞ of elements of g− (B) such ∈ − { n}n=1 5 that x x. If g is continuous at x then g(x ) g(x). Since all g(x ) B and B is closed, n → n → n ∈ 6 g(x) B. ∈ 7 Theorem 250. (Continuous mapping theorem) Let X ∞ be a sequence of ran- { n}n=1 8 dom quantities, and let X be another random quantity all taking values in the same metric 9 space . Suppose that X D X. Let be a metric space and let g : . Deﬁne X n → Y X →Y 10 C = x : g is continuous at x . g { } 11 Suppose that Pr(X C )=1. Then g(X ) D g(X). ∈ g n → 12 Proof. Let Qn be the distribution of g(Xn) and let Q be the distribution of g(X). Let 13 Rn be the distribution of Xn and let R be the distribution of X. Let B be a closed subset 1 1 14 of . If x g− (B) but x g− (B), then g is not continuous at x by Lemma 257. It follows Y 1 ∈ 1 6∈C 15 that g (B) g− (B) C . Now write − ⊆ ∪ g 1 1 16 lim sup Qn(B) = lim sup Rn(g− (B)) lim sup Rn(g− (B)) n n ≤ n →∞ →∞ →∞ 1 1 C 17 R(g (B)) R(g− (B)) + R(C ) ≤ − ≤ g 1 18 = R(g− (B)) = Q(B),

19 and the result now follows from the Theorem 240.

20 Theorem 252. Let Xn n∞=1, X, and Yn n∞=1 be random quantities taking values in a { } { } P 21 metric space with metric d. Suppose that X D X and d(X ,Y ) 0, then Y D X. n → n n → n → 22 Proof. Let Qn be the distribution of Yn, let Rn be the distribution of Xn and let R 23 be the distribution of X. Let B be an arbitrary closed set. According to Theorem 240, it 24 suﬃces to show that lim sup Q (B) R(B). Then n ≤ 25 Y B d(X , B) ǫ d(X ,Y ) > ǫ . { n ∈ }⊆{ n ≤ }∪{ n n } 26 Deﬁne C = x : d(x, B) ǫ , which is a closed set. So, ǫ { ≤ } 27 Q (B) = P (Y B) n n n ∈ 28 P (d(X , B) ǫ)+ P (d(X ,Y ) > ǫ) ≤ n n ≤ n n n 29 = Rn(Cǫ)+ Pn(d(Xn,Yn) > ǫ).

30 We have assumed that limn Pn(d(Xn,Yn) > ǫ) = 0 and that Xn D X, so we conclude →∞ → 31 lim supn Qn(B) lim supn Rn(Cǫ) R(Cǫ). Since B is closed, limǫ 0 R(Cǫ)= R(B). →∞ ≤ →∞ ≤ → 32 It follows then that

33 lim sup Qn(B) R(B), n ≤ →∞ 34 hence Y D X. n → 35 Here is the convergence in probability version of Theorem 250. 65

1 Theorem 258. Let X ∞ be a sequence of random quantities, and let X be another { n}n=1 2 random quantity all taking values in the same metric space with metric d1. Suppose that P X 3 X X. Let be a metric space with metric d and let g : . Deﬁne n → Y 2 X →Y

4 C = x : g is continuous at x . g { } P 5 Suppose that Pr(X C )=1. Then g(X ) g(X). ∈ g n →

6 Proof. For each x and ǫ > 0, there exists δ(x, ǫ) such that d (g(x),g(y)) < ǫ ∈ X 2 7 whenever d1(x, y) < δ(x, ǫ). For every set A,

8 Pr(A d (g(X),g(X )) < ǫ ) Pr(A d (X,X ) < δ(X, ǫ) ). ∩{ 2 n } ≥ ∩{ 1 n }

9 Deﬁne Am = Cg X : δ(X, ǫ) 1/m . We know that limm Pr(Am) = 1. Also, for every ∩{ ≥ } →∞ 10 m,

11 Pr(d (g(X),g(X )) < ǫ) Pr(A d (g(X),g(X )) < ǫ ) 2 n ≥ m ∩{ 2 n } 12 Pr(A d (X,X ) < δ(X, ǫ) ) ≥ m ∩{ 1 n } 13 Pr(A d (X,X ) < 1/m ). ≥ m ∩{ 1 n } P 14 This last converges to Pr(A ) as n because X X. Hence m →∞ n →

15 lim inf Pr(d2(g(X),g(Xn)) < ǫ) Pr(Am). n ≥

16 Now, take limits as m on both sides to get that lim infn Pr(d2(g(X),g(Xn)) < ǫ) 1. P →∞ ≥ 17 So, g(X ) g(X). n → 66

1 36-752: Lecture 21

p p 2 Characteristic Functions. For the special case in which ( , ) = (IR , ), there is a X B B 3 useful technique for determining if a sequence of random vectors convergences in distribution. 4 It is based on a characterization of distributions by something simpler than the means of 5 all bounded continuous functions. The means of a special collection of bounded continuous p 6 functions, namely exp(it⊤x) : t IR , are enough to characterize a distribution. From { ∈ } 3 7 here on in the notes, i is one of the complex square-roots of 1. −

8 Definition 259. The function φX (t) = Eexp(it⊤X) is called the characteristic func- 9 tion (cf) of X.

10 (Mathematicians will recognize the cf as the Fourier transform of f.) Every distribution p 11 on IR has a cf regardless of whether moments exist. Recall from complex analysis that 12 exp(iu) = cos(u)+ i sin(u). So, we see that exp(it⊤x) is indeed bounded as a function of x 13 for each t.

2 1 14 Example 260. (Cauchy distribution) Let fX (x) = [π(1 + x )]− . Then φX (t) = 15 exp( t ). To prove this requires contour integration. −| |

16 The remaining theorems about convergence in distribution are

17 the inversion/uniqueness theorem that says that each cf corresponds to a unique dis- • 18 tribution,

19 the continuity theorem that says that X D X if and only if φ (t) φ (t) for all t • n → Xn → X 20 (the “only if” direction being trivial), and

21 the central limit theorem that says that certain normalized sums of independent (not • 22 necessarily identically distributed) random variables with ﬁnite variance converge in 23 distribution to a standard normal distribution.

24 Proposition 261. If X and Y are independent, then φX+Y (t)= φX (t)φY (t).

2 25 Example 262. (Normal distribution) Let f (x) = exp( x /2)/√2π be the density X − 26 of X. Then

1 2 φX (t) = exp(itx x /2)dx 27 √2π − Z 1 1 t2 = exp [x it]2 dx 28 √2π −2 − − 2 Z 2 29 = exp( t /2). −

3If I ever use i again to mean something else, stop me so that I can ﬁx it. 67

1 Example 263. (Uniform distribution) Let f(x)=1/2 for 1

3 The smoothness of the cf is related to the existence of moments. Of course all cf’s are 4 continuous by the dominated convergence theorem. Since exp(it⊤x) exp(iu⊤x) 2 for all | − | ≤ 5 t, u, x, we can pass the limit as u t under the integral in [exp(it⊤x) exp(iu⊤x)]dµ (x) → − X 6 to get 0 for the limit. Now, suppose that X is a random variable with ﬁnite mean. We can R 7 write x x exp(ix) 1 2 = cos(x)+ i sin(x) 1 2 =2 2 cos(x)=2 sin(t)dt 2 tdt = x2. 8 | − | | − | − ≤ Z0 Z0 9 This implies that exp(ix) 1 x for all x. Clearly, exp(ix) 1 2 for all x also. So | − |≤| | | − | ≤ 10 (264) exp(ix) 1 min 2, x . | − | ≤ { | |} 11 This implies that [exp(ixt) 1]/t is bounded by a µ -integrable function x . By the dom- − X | | 12 inated convergence theorem, we can pass the limit as t 0 under the integral to get that → 13 φ′(0) exists and equals iE(X). With a bit more eﬀort similar results hold if higher moments 14 exist. That is, higher order derivatives of φ exist and equal powers of i times the moments 15 times real constants.

16 Theorem 265. (Inversion and uniqueness) Let φ be the cf for the probability P on p p 17 (IR , ). Let A be a rectangular region of the form B 18 A = (x ,...,x ) : a x b for all j , { 1 p j ≤ j ≤ j }

19 where aj < bj for all j and P (∂A)=0. For each T > 0, let

20 B = (t ,...,t ) : T t T for all j . T { 1 p − ≤ j ≤ } 21 Then p 1 exp( itjaj) exp( itjbj) P (A) = lim p − − − φ(t)dt1 dtp. 22 T (2π) it ··· →∞ BT j=1 j Z Y 23 Distinct probability measures have distinct cf’s.

24 The proof of Theorem 265 is provided in another course document. The proof relies on the 25 following interesting result.

π if c> 0, T sin(ct) lim dt = 0 if c = 0, T 26 T t  →∞ Z− π if c< 0.  − 27 (Because dt/t is invariant measure with respect to scale changes on (0, ), the integral p ∞ 28 doesn’t depend on c for c = 0.) Basically, replace φ(t) by exp(it x )dP (x), change | | 6 j=1 j j R Q 68

1 the order of integration, pass the limit inside the integral over x, combine the two products 2 into one, rewrite exp( it c ) in terms of sines and cosines (for c x a , x b ), notice − j j j ∈{ j − j j − j } 3 that the cosine terms integrate to 0 over tj, and apply the above formula to the sine terms. 4 When x is between a and b , the limit of the integral over t yields π ( π)=2π. When j j j j − − 5 x is outside of [a , b ], the limit yields either π π or π ( π), both 0. j j j − − − − 6 The following theorem allows us to simplify some future proofs by doing only the p =1 7 case.

8 Lemma 266. (Cramer-Wold)´ Let X and Y be p-dimensional random vectors. Then 9 X and Y have the same distribution if and only if α⊤X and α⊤Y have the same distribution p 10 for every α IR . ∈ 11 Proof. We know that X and Y have the same distribution if and only if φX (t)= φY (t) p p 12 for every t IR . This is true if and only if φ (sα) = φ (sα) for all α IR and all ∈ X Y ∈ 13 s IR. But φX (sα) is the cf of α⊤X (as a function of s) and φY (sα) is the cf of α⊤Y . So, ∈ p 14 φX (sα)= φY (sα) for all α IR and all s IR if and only if α⊤X and α⊤Y have the same ∈p ∈ 15 distribution for every α IR . ∈ 16 If the characteristic function is integrable, a continuous density exists. We will not prove 17 this result. 1 18 Proposition 267. If φ is the cf of the cdf F on (IR, ) and if φ is integrable, then F B 19 has a density 1 ∞ (268) f(x)= exp( itx)φ(t)dt, 20 2π − Z−∞ 21 which is continuous.

22 The connection between characteristic functions and convergence in distribution is the 23 following.

24 Theorem 269. (Continuity theorem) Let Pn n∞=1 be a sequence of probabilities on p p { } 25 (IR , ), and let P be another probability. Let φn be the cf for Pn, and let φ be the cf for P . B p 26 Then Pn D P if and only if limn φn(t)= φ(t) for all t IR . → →∞ ∈ 27 See Section 7.2 in Ash for the proof and underlying theory for the case p = 1.

28 Example 270. For each j, let Y have a uniform distribution on the interval [ 1, 1] and j − 3 n 29 let Xn = n j=1 Yj. Then the cf of Xn is q n P sin t 3/n φ (t)= . n 30  t p3/n 

3 3   31 We can write sin(t)= t t /6+ o(t ) so that,p for each t, − sin t 3/n t2 =1 + o(1/n), 32 t p3/n − 2n 2 33 as n . It follows easily thatp limn φn(t) = exp( t /2). This is the cf of the standard →∞ →∞ − 34 normal distribution. 69

1 An interesting corollary to the continuity theorem is that if limn φn(t) exists for all t →∞ 2 and is continuous at 0, then the limit is a cf, and the distributions converge to the distribution 3 with that cf. Another interesting corollary (thanks to Cram´er and Wold) is that if X ∞ { n}n=1 4 is a sequence of p-dimensional random vectors and X is a random vector, then Xn D X if p → 5 and only if α⊤X D α⊤X for all α IR . n → ∈ 70

1 Inversion/Uniqueness Theorem

2 Lemma 271. π if c> 0, T sin(ct) (272) lim dt = 0 if c =0, T 3 T t  →∞ Z− π if c< 0.  −  4 Proof. Since sin( ct) = sin(ct) and sin(0t) = 0, it suﬃces to prove the result for − − 5 c> 0. First, we do an auxiliary integral. For ﬁxed u and c, consider 1 f(x)= [1 exp( ucx)(u sin(cx) + cos(cx))]. 6 1+ u2 − −

7 Then f(0) = 0 and f ′(x)= c exp( ucx) sin(cx). It follows that − T c exp( utc) sin(ct)dt = f(T ). 8 − Z0 9 The integrand in (272) is symmetric around 0, so we will work with the integral from 0 to 10 T . Assume that c> 0. Since

∞ c exp( uct)du = 1/t, 11 − Z0 12 sin(ct) = sin(ct),

13 we have

T T sin(ct) ∞ 14 dt = c sin(ct) exp( uct)dudt 0 t 0 0 − Z Z Z T ∞ = c exp( uct) sin(ct)dtdu 15 − Z0 Z0 ∞ du ∞ exp( ucT ) = − (u sin(cT ) + cos(cT ))du. 16 1+ u2 − 1+ u2 Z0 Z0 17 The ﬁrst integral in the last equation equals π/2 and the last integral goes to 0 as T . →∞ 18 19 Theorem 265. (Inversion and uniqueness) Let φ be the cf for the probability P on p p 20 (IR , ). Let A be a rectangular region of the form B 21 A = (x ,...,x ) : a x b for all j , { 1 p j ≤ j ≤ j }

22 where aj < bj for all j and P (∂A)=0. For each T > 0, let

23 B = (t ,...,t ) : T t T for all j . T { 1 p − ≤ j ≤ } 24 Then p 1 exp( itjaj) exp( itjbj) P (A) = lim p − − − φ(t)dt1 dtp. 25 T (2π) it ··· →∞ BT j=1 j Z Y 71

1 Distinct probability measures have distinct cf’s. 2 Proof. Apply Fubini’s theorem to write p exp( itjaj) exp( itjbj) (273) − − − φ(t)dt1 dtp 3 it ··· ZBT j=1 j Y p exp(itj[xj aj]) exp(itj[xj bj]) = − − − dt1 dtjdµ(x). 4 p it ··· IR BT j=1 j Z Z Y p 5 We can do this beacuse the integrand is bounded by b a according to (264) and j=1 | j − j| 6 the set over which we are integrating has ﬁnite product measure. Rewrite the jth factor in Q 7 the integrand on the right-side of (273) as

cos(tj[xj aj]) cos(tj[xj bj]) + i sin(tj[xj aj]) i sin(tj[xj bj]) 8 − − − − − − . itj

9 Since the integration over t is from T to T and cos(t [x a ]) cos(t [x b ]) /t is j − { j j − j − j j − j } j 10 bounded and an odd function, its integral is 0. We rewrite the right side of (273) as p sin(tj[xj aj]) sin(tj[xj bj]) (274) − − dt1 dtjdµ(x). 11 p t − t ··· IR BT j=1 j j Z Z Y 12 Deﬁne p sin(tj[xj aj]) sin(tj[xj bj]) gT (x) = − − dt1 dtp 13 t − t ··· BT j=1 j j Z Y p T sin(tj[xj aj]) sin(tj[xj bj]) = − − dtj. 14 t − t j=1 T j j Y Z− 15 This function is uniformly bounded for all T and x, hence the limit as T of the integral →∞ 16 in (274) equals limT gT (x)dµ(x). If we deﬁne →∞ R 0 if x < a, π if x = a,  ψa,b(x)=  2π if a b,   p p 18 then Lemma 271 says that limT gT (x)= ψa ,b (xj), which equals (2π) for x int(A) →∞  j=1 j j C ∈ 19 and equals 0 for x A . Since µ(∂A) = 0, we have ∈ Q 1 g x dµ x µ A . 20 p lim T ( ) ( )= ( ) (2π) p T ZIR →∞ 21 At most countably many hyperplanes perpendicular to the coordinate axes can have 22 positive µ probability. So, the rectangular regions A with µ(∂A)=0 form a π-system that p 23 generate . It follows from the inversion formula that φ = φ implies µ = µ . That is, B 1 2 1 2 24 the characteristic function determines the distribution. 72

1 36-752: Lecture 22

2 Central Limit Theorem. Suppose that Sn is the sum of n independent random 3 variables with mean 0. Under some conditions to be given soon, there will exist a constant 4 cn such that Sn/cn has a cf that is close to the cf of the standard normal distribution. If 5 we can show that the cf of Sn/cn converges to the cf of the standard normal distribution 6 then we have that Sn/cn converges in distribution to the standard normal distribution. To 7 achieve this goal, we will need to be able to approximate arbitrary characteristic functions. 8 By various integrations by parts and reasoning similar to that which achieved (264), we 9 can obtain the following bound:

x2 exp(ix) 1+ ix min x 3, x2 . 10 − − 2 ≤ {| | }

2 11 In terms of the cf of a random variable X with mean 0 and variance σ , this equation says 12 that t2σ2 (275) φ (t) 1 E min Xt 3, (Xt)2 . 13 X − − 2 ≤ {| | } 14 Notice that only a second moment is required in order for the mean on the far right to exist.

15 In order to apply a bound like this to a sum like Sn, we need to approximate a product of 16 cf’s by a product of approximations. The following simple results are useful. Their proofs 17 are contained in another course document.

18 Proposition 276. Let z1,...,zm and w1,...,wm be complex numbers with modulus at 19 most 1. Then m m m

zk wk zk wk 20 − ≤ | − | k=1 k=1 k=1 Y Y X

2 21 Proposition 277. For complex z, exp(z) 1 z z exp( z ). | − − |≤| | | |

22 We are now in position to state and prove a central limit theorem.

23 Theorem 278. (Lindeberg-Feller central limit theorem) Let r ∞ be a se- { n}n=1 24 quence of integers. For each n = 1, 2,..., let Xn,1,...,Xn,rn be independent random vari- rn 25 2 2 2 ables with Xn,k having mean 0 and ﬁnite nonzero variance σn,k. Deﬁne sn = k=1 σn,k and rn 26 Sn = Xn,k. Assume that, for every ǫ> 0, k=1 P

P rn 1 2 (279) lim 2 Xn,k(ω)dP (ω)=0. 27 n s →∞ n Xn,k ǫsn Xk=1 Z{| |≥ }

28 Then Sn/sn converges in distribution to the standard normal distribution. 73

1 The usual iid central limit theorem is a special case. If X1,X2,..., are iid with mean 0 2 2 2 2 and variance σ , then let r = n and X = X for all n and all k n. Then s = nσ and n n,k k ≤ n n 1 2 1 2 Xn,k(ω) dP (ω)= X1(ω) dP (ω), 3 2 2 sn Xn,k >ǫsn | | σ X1 >ǫσ√n | | Xk=1 Z{| | } Z{| | }

4 which goes to 0 as n , because ω : X (ω) > ǫσ√n decreases to the empty set as → ∞ { | 1 | } 5 n . →∞ 6 The Lyapounov central limit theorem is another special case. In this theorem, instead of 2+δ 7 assuming (279), we assume that there exists δ > 0 such that E[ X ] < and that | n,k| ∞

rn 1 2+δ (280) lim 2+δ E Xn,k =0. 8 n s | | →∞ k=1 n X 2 2+δ δ δ 9 Since X X /[ǫ s ] when X > ǫs , we have that the sum in (279) is bounded | n,k| ≤| n,k| n | n,k| n 10 by rn rn 1 1 2+δ 1 1 2+δ Xn,k (ω) dµ(ω) E Xn,k . 11 δ 2+δ δ 2+δ ǫ sn X >ǫsn | | ≤ ǫ sn | | k=1 Z{| n,k| } k=1 X X 12 Hence, if (280) holds, so does (279).

13 Example 281. Let Y1,Y2,... be independent Poisson random variables with the pa- 14 rameter of Yk being 1/k. Then let Xn,k = Yk 1/k for all n and all k n. Now, n − ≤ 15 2 3 sn = Ln = k=1 1/k. For δ = 1, E(Xn,k)=1/k also. Hence

P 3 3 1 1 3 1 5 E Xn,k E Xn,k + = + 2 + 3 . 16 | | ≤ k k k k ≤ k !

n 17 The sum on the left of (280) is bounded by 5/√L , which goes to 0. So, [ Y L ]/√L n k=1 k − n n 18 converges in distribution to standard normal. Notice that Ln = log(n)+ cn where cn is n P 19 bounded. By Theorem 252, [ Y log(n)]/ log(n) converges in distribution to standard k=1 k− 20 normal also. P p

21 The proof of Theorem 278 works by applying the continuity theorem 269. We must 2 22 show that the cf of S /s converges to exp( t /2) for all t. The proof has two (lengthy) n n − 23 2 2 2 steps. One is to approximate the cf φn,k of each Xn,k/sn by 1 t σn,k/(2sn). The other is to 2 rn 2 2 2 − 24 approximate exp( t /2) by [1 t σ /(2s )]. − k=1 − n,k n 25 Also, notice that X is divided by s in all formulas in the statement of the theorem. n,k Q n 26 Hence, without loss of generality, we can assume that sn = 1 for all n. We do this in the 27 proof, given in a separate document.

2 28 Proposition 282. If the Xn,k are uniformly bounded and if limn sn = , then (279) →∞ ∞ 29 will hold. 74

1 Example 283. (Bernoulli distribution) If Xn,k has a Bernoulli distribution with 2 parameter 1/k and rn = n, the condition holds. The theorem does not apply, however, if the n 3 2 2 Bernoulli parameter is 1/k . Indeed, if the Bernoulli parameter is 1/k , k=1 Xn,k converges 4 almost surely according to Theorem 202. As another example, if rn = n and the Bernoulli 2 P 5 parameter is k/(n +1) for k =1,...,n, then sn = n(n + 2)/[6(n + 1)]. In fact, rn could be 1/2+ǫ 6 as small as n for 0 < ǫ 1/2, and the theorem would still apply. This example cannot ≤ 7 be described as a single sequence as all of the distributions of Xn,k change as n changes.

8 Example 284. (Delta method) Suppose that Y1,Y2,... are iid with common mean 1 n 9 2 D η and common variance σ . Let Xn = n j=1 Yj. Then √n(Xn η) Z, where Z has a 2 − → 10 normal distribution with mean 0 and variance σ . If g is a function with derivative g at η, P ′ 11 then √n[g(Xn) g(η)] converges in distribution to a normal distribution with mean 0 and 2−2 12 variance [g′(η)] σ .

13 A multivariate central limit theorem exists for iid sequences, and the proof combines the 14 univariate central limit theorem together with the method of the Cram´er-Wold lemma 266 15 and the Continuity theorem 269.

16 Theorem 285. (Multivariate Central Limit Theorem) Let X ∞ be a sequence { n}n=1 17 of iid random vectors with common mean vector η and common covariance matrix Σ. Let 18 X be the average of the ﬁrst n of these vectors. Then Z = √n(X η) converges in n n n − 19 distribution to multivariate normal with zero mean vector and covariance matrix Σ.

20 Proof. By Lemma 266 and its application to convergence in distribution, all we need to

21 show is that, for all α, α⊤Zn D N(0, α⊤Σα). For every vector α, let Yk = α⊤Xk which are iid → 2 22 with common mean α⊤η and common variance α⊤Σα. Let sn = nα⊤Σα. If α⊤Σα = 0, then 23 Pr(Y = α⊤η) =1 and Pr(α⊤Z =0)=1 for all n, which means that α⊤Z D N(0, α⊤Σα). k n n → 24 For the rest of the proof, assume that α⊤Σα> 0. Theorem 278 says that

nα⊤Xn nα⊤η α⊤Zn D 25 − = N(0, 1). sn √α⊤Σα →

26 Multiply by √α Σα to get that α⊤Z D N(0, α⊤Σα). ⊤ n → 27 A multivariate central limit theorem also exists for general independent sequences, but it is 2 2 28 very cumbersome to state. (Imagine replacing all of the σ ’s and sn’s in Theorem 278 by 29 matrices.) 75

1 Some Results About Complex Numbers

2 Proposition 276. Let z1,...,zm and w1,...,wm be complex numbers with modulus at 3 most 1. Then m m m

zk wk zk wk 4 − ≤ | − | k=1 k=1 k=1 5 Y Y X

6 Proof. We shall use induction. The result is trivially true when m = 1. Assume that 7 it is true for m = m0. For m = m0 + 1, we have

m0+1 m0+1 m0+1 m m m0+1

zk wk = zk wm0+1 zk + wm0+1 zk wk 8 − − − kY=1 kY=1 Yk=1 Yk=1 Yk=1 kY=1 m m m

zk zm0+1 wm0+1 + zk wk wm0+1 9 ≤ | − | − | | k=1 k=1 k=1 Ym Y Y zk wk + zm0+1 w m0+1 . 10 ≤ | − | | − | Xk=1 2 11 Proposition 277. For complex z, exp(z) 1 z z exp( z ). | k − − |≤| | | | 12 Proof. Write exp(z) 1 z = ∞ z /k!. Since k! < (k + 2)! for k 0, we have − − k=2 ≥ P ∞ zk ∞ z k z 2 | | z 2 exp( z ). 13 k! ≤| | (k + 2)! ≤| | | | Xk=2 Xk=0 14

1 Central Limit Theorem

2 Proof of Theorem 278. Without loss of generality, we assume that sn = 1 for all n. 3 The cf of Sn is rn

φn(t)= φn,k(t). 4 kY=1 5 According to (275), for each n, k, and t, t2σ2 φ (t) 1 n,k E min X t 3, (X t)2 6 n,k − − 2 ≤ {| n,k | n,k } 3 2 tXn,k(ω) dP (ω)+ tXn,k(ω) dP (ω) 7 ≤ X <ǫ | | X ǫ | | Z{| n,k| } Z{| n,k|≥ } ǫ t 3σ2 + t2 X2 (ω)dP (ω). 8 n,k n,k ≤ | | X ǫ Z{| n,k|≥ } 9 It follows that

rn 2 2 rn t σn,k 3 2 2 φn,k(t) 1 ǫ t + t Xn,k(ω)dP (ω). 10 − − 2 ≤ | | X ǫ k=1 k=1 Z{| n,k|≥ } X X

11 The last sum goes to 0 as n according to (279). Since ǫ is arbitrary, we have →∞ rn 2 2 t σn,k (286) lim φn,k(t) 1 =0. 12 n − − 2 →∞ k=1 X 13 2 In order to apply Proposition 276, we need σn,k to all be small. For each ǫ> 0, we have

σ2 = X2 (ω)dP (ω)+ X2 (ω)dP (ω) 14 n,k n,k n,k X ǫ X >ǫ Z{| n,k|≤ } Z{| n,k| } ǫ2 + X2 (ω)dP (ω). 15 n,k ≤ X <ǫ Z{| n,k| } 16 It follows from (279) that 2 17 (287) lim max σn,k =0. n k →∞ 2 2 18 Next, ﬁx t = 0 and notice that for n suﬃciently large 0 < t σ /2 < 1 for all k simultane- 6 n,k 19 ously. It follows from Proposition 276 and (286) that

rn 2 2 t σn,k (288) lim φn(t) 1 =0. 20 n →∞ − − 2 Yk=1 rn 21 2 2 2 2 Since sn = 1, we have that exp( t /2) = k=1 exp( t σn,k/2). For n large enough so that 2 2 − − 22 t σ /2 < 1 for all k write n,k Q t2 rn t2σ2 rn t2σ2 t2σ2 exp 1 n,k exp n,k 1+ n,k 23 − 2 − − 2 ≤ − 2 − 2 k=1 k=1 Y X

4 rn 2 t 4 t σn,k exp 1 ≤ 4 2 k=1 4 X 2 t 2 t (289) max σn,k exp , 2 ≤ 4 k 2 3 where the ﬁrst inequality follows from Proposition 276, the second follows from Proposi- 2 4 tion 277, and the third follows from the fact that sn = 1. Finally, the last term in (289) goes 2 5 to 0 according to (287). Combining this with (288) says that limn φn(t) = exp( t /2). →∞ − 78

1 Existence of rcd’s

2 This document contains more details about the proof of ??. 3 For each rational number q, let µX (( , q]) be a version of Pr(X q ). Deﬁne |C −∞ ≤ |C

C1 = ω : µX (( , q])(ω) = inf µX (( ,r])(ω), for all rational q , 4 |C −∞ rational r>q |C −∞

C2 = ω : lim µX (( , x])(ω)=0 , 5 x , x rational |C −∞ →−∞

C3 = ω : lim µX (( , x])(ω)=1 . 6 x , x rational |C −∞ →∞ 7 (Notice that C2 and C3 are defined slightly differently than in the original class notes.) 8 Define

Mq,r = ω : µX (( , q](ω) <µX (( ,r])(ω)) , M = Mq,r. 9 { |C −∞ |C −∞ } q>r [ 10 If P (Mq,r) > 0, for some q>r then

Pr(Mq,r X q ) = µX (( , q])dP < µX (( , q])dP 11 ∩{ ≤ } |C −∞ |C −∞ ZMq,r ZMq,r 12 = Pr(M X r ), q,r ∩{ ≤ } 13 which is a contradiction. Hence, P (M) = 0. Next, deﬁne

C Nq = ω M : lim µX (( ,r](ω) = µX (( , q])(ω), N = Nq. r q, r rational |C |C 14 { ∈ ↓ −∞ 6 −∞ All[ q 15 If P (Nq) > 0 for some q, then

Pr(Nq X q ) = µX (( , q])dP < lim µX (( ,r])dP 16 ∩{ ≤ } |C −∞ r q, r rational |C −∞ ZNq ZNq ↓

17 = lim µX (( ,r])dP = lim Pr(Nq X r ), r q, r rational |C −∞ r q, r rational ∩{ ≤ } ↓ Z ↓ 18 which is a contradiction. We can use Example 235 once again to prove that P (N) = 0. C 19 Notice that C1 = N , so P (C1)=1. 20 Next, notice that

C 0= P C1 C2 X x = lim µX (( , x])dP 21 ∩ ∩ { ≤ } x , x rational C |C −∞ ! →−∞ C1 C2 rational\ x Z ∩ = lim µX (( , x])dP. 22 C x , x rational |C C1 C −∞ Z ∩ 2 →−∞ 23 If P (C C ) < 1, then the last integral above is strictly positive, a contradiction. The 1 ∩ 2 24 interchange of limit and integral is justiﬁed by the fact that, for ω C1, µX (( , x]) is ∈ |C −∞ 25 nondecreasing in x. A similar contradiction arises if P (C C ) < 1. 1 ∩ 3 79

1 Martingales

2 Martingales. Let (Ω, ,P ) be a probability space. F

3 Definition 290. Let . . . be a sequence of sub-σ-field’s of . We call F1 ⊆ F2 ⊆ F 4 ∞ a filtration . If X : Ω IR is -measurable for every n, we say that X ∞ is {Fn}n=1 n → Fn { n}n=1 5 adapted to the filtration. If X ∞ is adapted to a filtration ∞ , and if E X < { n}n=1 {Fn}n=1 | n| ∞ 6 for all n and E(X )= X for all n, then we say that X ∞ is a martingale relative to n+1|Fn n { n}n=1 7 the filtration. Alternatively we say that (X , ) ∞ is a martingale. If X E(X ) { n Fn }n=1 n ≤ n+1|Fn 8 for all n, we say that (X , ) ∞ is a submartingale. If the inequality goes the other way, { n Fn }n=1 9 it is a supermartingale.

10 Proposition 291. A martingale is both a submartingale and a supermartingale. X ∞ { n}n=1 11 is a submartingale if and only if X ∞ is a supermartingale. {− n}n=1

12 Example 292. Let Yn n∞=1 be a sequence of independent random variables with ﬁ- { } n 13 nite mean. Let = σ(Y ,...,Y ) and X = Y . If E(Y ) = 0 for every n, then Fn 1 n n j=1 j n 14 (Xn, n) ∞ is a martingale. If E(Yn) 0 for every n, then we have a submartingale, and { F }n=1 ≥ P 15 if E(Y ) 0 for every n, we have a supermartingale. n ≤

16 Example 293. Let (Ω, ,P ) be a probability space. Let ∞ be a filtration. Let ν F {Fn}n=1 17 be a finite measure on (Ω, ) such that for every n, ν has a density X with respect to P F n 18 when both are restricted to (Ω, ). Then X ∞ is adapted to the filtration. To see that Fn { n}n=1 19 we have a martingale, we need to show that for every n and A ∈Fn

20 (294) Xn+1(ω)dP (ω)= Xn(ω)dP (ω). ZA ZA

21 Since , each A is also in . Hence both sides of (294) equal ν(A). Fn ⊆Fn+1 ∈Fn Fn+1

22 Example 295. As a more specific example of Example 293, let Ω = IR∞ and let n = n F 23 B IR∞ : B . That is, is the collection of cylinder sets corresponding to the first { × ∈ B } Fn 24 n coordinates (the σ-field generated by the first n coordinate projection functions). Let P 25 be the joint distribution of an infinite sequence of iid standard normal random variables. Let 26 ν be the joint distribution of an infinite sequence of iid exponential random variables with 27 parameter 1. For each n, when we restrict both P and ν to , ν has the density Fn

n/2 n 2 (2π) exp j=1[ωj /2 ωj] for ω1,...,ωn > 0, Xn(ω)= − 28 ( 0P otherwise,

29 with respect to P . It is easy to see that

2 30 E(Xn+1 n)= XnE(√2π exp(ωn+1/2 ωn+1)I(0, )(ωn+1)) = Xn. |F − ∞ 80

1 Example 296. Let (X , ) ∞ be a martingale. Let φ : IR IR be a convex function { n Fn }n=1 → 2 such that E[φ(Xn)] is ﬁnite for all n. Deﬁne Yn = φ(Xn). Then

3 E(Y ) = E[φ(X ) ] n+1|Fn n+1 |Fn 4 φ [E(X )] ≥ n+1|Fn 5 = φ(Xn)= Yn,

6 where the inequality follows from the conditional version of Jensen’s inequality. So (Y , ) ∞ { n Fn }n=1 7 is a submartingale.

8 Example 297. Let (Ω, ,P ) be a probability space. Let Y ∞ be a sequence of F { n}n=1 9 random variables and n = σ(Y1,...,Yn). Suppose that, for each n, µY1,...,Yn has a strictly F n 10 positive density pn with respect to Lebesgue measure λ . Let Q be another probability on 1 n 11 (Ω, ) such that Q((Y ,...,Y )− ( )) has a density q with respect to λ for each n. Deﬁne F 1 n · n qn(Y1,...,Yn) 12 Xn = . pn(Y1,...,Yn) n 13 It is easy to check that, for each n and H ∈ B qn(y1,...,yn) n 14 E(IH (Y1,...,Yn)Xn) = pn(y1,...,yn)dλ (y1,...,yn) H pn(y1,...,yn) Z 1 15 = Q((Y1,...,Yn)− (H)).

16 This makes X a density for Q with respect to P on the σ-ﬁeld . Hence we have a n Fn 17 martingale according to the construction in Example 293. This is an example of likelihood 18 ratios, and it is a generalization of Example 295.

19 Example 298. Let ∞ be a filtration and let X be a random variable with finite {Fn}n=1 20 mean. Define X = E(X ). By the law of total probability we have a martingale. Such a n |Fn 21 martingale is sometimes called a Lévy martingale.

22 Example 299. Consider Example 292 again. Think of Yn as being the amount that 23 a gambler wins per unit of currency bet on the nth play in a sequence of games. Let Y0 24 denote the gambler’s initial fortune which we can assume is a known value, and let be F0 25 the trivial σ-ﬁeld. (We could let Y be a random variable and let = σ(Y ), but then we 0 F0 0 26 would also have to expand to σ(Y ,...,Y ).) Suppose that the gambler devises a system Fn 0 n 27 for determining how much Wn 0 to bet on the nth play. We assume that Wn is n 1 ≥ F − 28 measurable for each n. This forces the gambler to choose the amount to bet before knowing n 29 what will happen. Now, deﬁne Zn = Y0 + j=1 WjYj. Since

30 E(W Y )= W PE(Y )= W E(Y ), n+1 n+1|Fn n+1 n+1|Fn n+1 n+1 31 and W 0, we have that E(W Y ) is 0, = 0, or 0 according as E(Y ) 0, n+1 ≥ n+1 n+1|Fn ≥ ≤ n+1 ≥ 32 = 0, or 0. That is, (Z , ) ∞ is a submartingale, a martingale, or a supermartingale ≤ { n Fn }n=1 33 according as (X , ) ∞ is a submartingale, a martingale, or a supermartingale. This { n Fn }n=1 34 result is often described by saying that gambling systems cannot change whether a game is 35 favorable, fair, or unfavorable to a gambler. 81

1 1 Definition 300. A sequence Wn n∞=1 of random variables such that Wn is n 1/ - { } F − B 2 measurable is called previsible . (If there is no , then require that W be constant.) F0 1 3 Example 301. Let (Xn, n) n∞=1 be a martingale, and let Wn n∞=1 be previsible. De- { F } { } 1 4 ﬁne Z = X and Z = Z + W (X X ) for n 1. Then Z is / -measurable 1 1 n+1 n n+1 n+1 − n ≥ n Fn B 5 and 6 E(Z )= Z + W E(X X )= Z , n+1|Fn n n+1 n+1 − n|Fn n 7 for all n 1. This makes (Z , ) ∞ a martingale. This is called a martingale transform. ≥ { n Fn }n=1 8 Example 299 is an example of this.

9 Theorem 302. (Doob decomposition) (X , ) ∞ is a submartingale if and only { n Fn }n=1 10 if there is a martingale (Z , ) ∞ and a nondecreasing previsible process A ∞ with { n Fn }n=1 { n}n=1 11 A1 =0 such that Xn = Zn + An for all n. The decomposition is unique (a.s.).

12 Proof. For the “if” direction, notice that

13 E(Z + A )= Z + A Z + A = X . n+1 n+1|Fn n n+1 ≥ n n n 14 For the “only if” dirction, Deﬁne A1 = 0 and n

An = (E(Xk k 1) Xk 1) , 15 |F − − − Xk=2 16 for n > 1. Also, deﬁne Zn = Xn An. Because E(Xk k 1) Xk 1 for all k > 1, we − |F − ≥ − 1 17 have An An 1 for all k > 1, so An n∞=1 is nondecreasing. Also, E(Xk k 1) is n 1/ - ≥ − { } |F − F − B 18 measurable for all 1

= E(Xn+1 n) [E(Xk k 1) Xk 1] 20 |F − |F − − − k=2 n X

= Xn [E(Xk k 1) Xk 1]= Zn, 21 − |F − − − Xk=2 22 so Zn is a martingale. 23 For uniqueness, suppose that Xn = Yn + Wn is another decomposition so that Yn is a 24 martingale and Wn is previsible. Then write n n

[E(Xk k 1) Xk 1] = [E(Yk + Wk k 1) Xk 1] 25 |F − − − |F − − − k=2 k=2 X Xn

= (Yk 1 + Wk Xk 1) 26 − − − k=2 Xn

= (Wk Wk 1)= Wk. 27 − − Xk=2 28 It follows that Wk = Ak a.s., hence Yk = Zk a.s. 29 The previsible process in Theorem 302 is called the compensator for the submartingale. 82

1 Stopping Times. Let (Ω, ,P ) be a probability space, and let ∞ be a filtration. F {Fn}n=1 4 2 Definition 303. A positive extended integer valued random variable τ is called a 3 stopping time with respect to the filtration if τ = n for all finite n. A special σ-field, { }∈Fn 4 is defined by Fτ 5 = A : A τ k , for all finite k . Fτ { ∈F ∩{ ≤ }∈Fk } 6 If X ∞ is adapted to the filtration and if τ < a.s., then X is defined as X (ω). { n}n=1 ∞ τ τ(ω) 7 (Define Xτ equal to some arbitrary random variable X for τ = .) ∞ ∞ 8 Example 304. Let X ∞ be adapted to the filtration and let τ = k , a constant. { n}n=1 0 9 Then τ = n is either Ω or ∅ and it is in every , so τ is a stopping time. Also, { } Fn A if k k, A τ k = 0 10 ∅ ≤ ∩{ ≤ } if k0 >k. 11 So A τ k for all finite k if and only if A . So = . ∩{ ≤ }∈Fk ∈Fk0 Fτ Fk0 12 Example 305. Let X ∞ be adapted to the filtration. Let B be a Borel set and let { n}n=1 13 τ = inf n : X B . As usual, inf ∅ = . For each finite n, { n ∈ } ∞ C τ = n = Xn B Xk B n. 14 { } { ∈ } { ∈ }∈F k

16 We can show that τ and Xτ are both τ measurable. For example, to show that Xτ is F 1 1 17 τ measurable, we must show that, for all B Xτ− (B) and for all 1 k < , F 1 ∈ B ∈ F ≤ ∞ 18 τ k X− (B) . Now, { ≤ } ∩ τ ∈Fk ∞ 1 1 1 Xτ− (B)= τ = k [Xk− (B)] τ = X− (B) . 19 { } ∩ ∪ { ∞} ∩ ∞ ∈F k=1 [ 20 This shows that X is -measurable. Next, fix k and write τ F k 1 1 τ k Xτ− (B)= Xj− (B) τ = j k. 21 { ≤ } ∩ ∩{ } ∈F j=1 [ 22 This proves that X is measurable. Suppose that τ and τ are two stopping times such τ Fτ 1 2 23 that τ τ . Let A . Since A τ k = A τ k τ k for every event A, 1 ≤ 2 ∈Fτ1 ∩{ 2 ≤ } ∩{ 1 ≤ }∩{ 2 ≤ } 24 it follows that A τ k and A . Hence . As an example, let τ be ∩{ 2 ≤ }∈Fk ∈Fτ2 Fτ1 ⊆Fτ2 25 an arbitrary stopping time (not necessarily finite a.s.) and define τ = min k, τ for finite k { } 26 k. Then τ is a finite stopping time with τ τ. Hence X is measurable for each k k k ≤ τk Fτk 27 and so X is measurable. Similarly, τ k so that and X is measurable. τk Fτ k ≤ Fτk ⊆Fk τk Fk 4If your filtration starts at n = 0, you can allow stopping times to be nonnegative valued. Indeed, if your filtration starts at an arbitrary integer k, then a stopping time can take any value from k on up. There is a trivial extension of every filtration to one lower subscript. For example, if we start at n = 1, we can extend to n = 0 by defining 0 = Ω, ∅ . Every martingale can also be extended by defining X0 = E(X1). For the rest of the course, weF will assume{ } that the lowest possible value for a stopping time is 1. 83

1 Example 306. The gambler in Example 299 can try to build a stopping time into a 2 gambling system. For example, let τ = min n : Z Y + x for some integer x> 0. This { n ≥ 0 } 3 would seem to guarantee winning at least x. There are two possible drawbacks. One is that 4 there may be positive probability that τ = . Even if τ < a.s., it might require infinite ∞ ∞ 5 resources to guarantee that we can survive until τ. For example, let Y0 = 0 and let Yn have 6 equal probability of being 1 or 1 all n. So, we stop as soon as we have won x more than − 7 we have lost. If we modify the problem so that we have only finite resources (say k units) 8 then this becomes the classic gambler’s ruin problem. The probability of achieving Zn = x 9 before Z = k is k/(k + x), which goes to 1 as k . So, if we have infinite resources, n − → ∞ 10 the probability is 1 that τ < , otherwise, we may never achieve the goal. If the probability ∞ 11 of winning on each game is less than 1/2, then P (τ = ) > 0. ∞

12 Suppose that we start with a martingale (X , ) ∞ and a stopping time τ. We can { n Fn }n=1 13 deﬁne Xn if n τ, Xn∗ = ≤ = Xmin τ,n . 14 X if n > τ { } τ 15 We can call this the stopped martingale. It turns out that X∗ ∞ is also a martingale { n}n=1 16 relative to the ﬁltration. First, note that Xmin τ,n is n measurable. Next, notice that { } F n 1 − E( Xn∗ ) = Xk dP + Xn dP 17 | | τ=k | | τ n | | k=1 Z{ } Z{ ≥ } Xn

E( Xk ) < . 18 ≤ | | ∞ Xk=1 19 Finally, let A . Then A τ > n , so ∈Fn ∩{ }∈Fn

Xn+1dP = XndP, 20 A τ>n A τ>n Z ∩{ } Z ∩{ } 21 because X = E(X ). It now follows that n n+1|Fn

X∗ dP = Xn+1dP + Xτ dP 22 n+1 A A τ>n A τ n Z Z ∩{ } Z ∩{ ≤ }

= XndP + Xτ dP 23 A τ>n A τ n Z ∩{ } Z ∩{ ≤ }

24 = Xn∗dP. ZA 25 It follows that X∗ = E(X∗ ) and the stopped martingale is also a martingale. Notice n n+1|Fn 26 that limn Xn∗ = Xτ a.s., if τ < a.s. →∞ ∞ 27 Example 307. Consider the stopping time in Example 306 with x = 1. That is τ is the 28 ﬁrst time that a gambler, betting on iid fair coin ﬂips, wins 1 more than he/she has lost.

29 This τ < a.s. It follows that limn Xn∗ = Xτ a.s. However, E(Xn∗) = 0 for all n while ∞ →∞ 30 E(Xτ ) = 1 because Xτ = 1 a.s. 84

1 Optional Sampling. Let (X , ) be a martingale. Consider a sequence of { n Fn }n=1 2 a.s. ﬁnite stopping times τ ∞ such that 1 τ τ for all j. Then we can con- { n}n=1 ≤ j ≤ j+1 3 struct (X , ) ∞ and ask whether or not it is a martingale. In general, an unpleasant { τn Fτn }n=1 4 integrability condition is needed to prove this. We shall do a simpliﬁed case.

5 Theorem 308. (Optional sampling theorem) Let (X , ) ∞ be a (sub)martingale. { n Fn }n=1 6 Suppose that for each n, there is a ﬁnite constant M such that τ M a.s. Then n n ≤ n 7 (X , ) ∞ is a (sub)martingale. { τn Fτn }n=1

8 The proof is given in a separate document. 9 The unpleasant integrability condition that can replace P (τ M ) = 1 is the following: n ≤ n 10 For every n,

11 P (τ < )=1, • n ∞

12 E( X ) < , and • | τn | ∞

13 lim infm E( Xm I(m, )(τn)) = 0. • →∞ | | ∞

14 Because we can use a constant stopping time to stop a martingale, it follows that martin- 15 gale theorems will apply to ﬁnite sequences of random variables as well as inﬁnite sequences.

16 Martingale Convergence. The upcrossing lemma says that a submartingale cannot 17 cross a fixed nondegenerate interval very often with high probability. If the submartingale 18 were to cross an interval infinitely often, then its lim sup and lim inf would have to be different 19 and it couldn’t converge.

n 20 Lemma 309. (Upcrossing lemma) Let (X , ) be a submartingale. Let r < q, { k Fk }k=1 21 and deﬁne V to be the number of times that the sequence X1,...,Xn crosses from below r 22 to above q. Then 1 (310) E(V ) (E X + r ) . 23 ≤ q r | n| | | −

24 We will only give an outline of the proof of Lemma 309. Let Y = max 0,X r . Then k { k − } 25 V is the number of times that Y moves from 0 to above q r, and (Y , ) ∞ is a k − { k Fk }k=1 submartingale. Figure 1 shows an example of the path of Yk indicating those times that it is

Fig. 1. Step in the proof of Lemma 309.

26 27 crossing up. It is easy to see that V is at most the sum of the upcrossing increments divided 28 by q r. That is, − n 1 V (Yk Yk 1)IEk , 29 ≤ q r − − − Xk=2 85

1 where Ek is the event that the path is crossing up at time k. Notice that Ek k 1 for all ∈F − 2 k. Hence, for each k 2, ≥

E([Yk Yk 1]IE )= (Yk Yk 1)dP = [E(Yk k 1) Yk 1]dP. 3 − − k − − |F − − − ZEk ZEk

4 Because E(Yk k 1) Yk 1 0 a.s. by the submartingale property, we can expand the |F − − − ≥ 5 integral from Ek to all of Ω to get

E([Yk Yk 1]IE ) [E(Yk k 1) Yk 1]dP = E(Yk Yk 1). 6 − − k ≤ |F − − − − − Z 7 It follows that E(V ) E(Y ) E(Y ) E(Y ) because Y 0. Because max 0, x is a ≤ n − 1 ≤ n 1 ≥ { } 8 convex function of x, E(Y ) E( X )+ r. The full proof is in another course document. n ≤ | n|

9 Theorem 311. (Martingale convergence theorem) Let (X , ) ∞ be a sub- { n Fn }n=1 10 martingale such that supn E Xn < . Then X = limn Xn exists a.s. and E X < . | | ∞ →∞ | | ∞

11 Proof. Let X∗ = lim supn Xn and X = lim infn Xn. Let B = ω : X (ω) < →∞ ∗ →∞ { ∗ 12 X∗(ω) . We will prove that P (B) = 0. We can write }

B = ω : X∗(ω) q>r X (ω) . 13 { ≥ ≥ ∗ } r

14 Now, X∗(ω) >q>r X (ω) if and only if the values of Xn(ω) cross from being below r to ≥ ∗ 15 being above q inﬁnitely often. For ﬁxed r and q, we now prove that this has probability 0; 16 hence P (B)= 0. Let Vn equal the number of times that X1,...,Xn cross from below r to 17 above q. According to Lemma 309,

1 18 sup E(Vn) sup E( Xn )+ r < . n ≤ q r n | | | | ∞ −

19 The number of times the values of X (ω) ∞ cross from below r to above q equals { n }n=1 20 limn Vn(ω). By the monotone convergence theorem, →∞

> sup E(Vn) = E( lim Vn). 21 n ∞ n →∞

22 It follows that P ( ω : limn Vn(ω)= )=0. { →∞ ∞} 23 Since P (B) = 0, we have that X = limn Xn exists a.s. Fatou’s lemma says E( X ) →∞ | | ≤ 24 lim infn E( Xn ) supn E( Xn ) < . →∞ | | ≤ | | ∞

25 Example 312. For the L´evy martingale of Example 298,

26 E( X )=E( E[X ] ) EE( X )=E( X ) < , | n| | |Fn | ≤ | ||Fn | | ∞

27 for all n, so the martingale converges. In Theorem 316, we can say even more about the 28 limit. 86

1 Example 313. For the random walk martingale of Example 292, if the Yn’s are iid with 2 2 ﬁnite variance σ , then Xn/√n converges in distribution so Xn can’t converge a.s. Indeed, 3 the Markov inequality says that

E( X ) c | n| P ( X >c√n) 2 1 Φ , 4 √nc ≥ | n| → − σ h i 5 for each positive c. So, eventually E( Xn ) √n[1 Φ(c/σ)] and limn E( Xn )= . | | ≥ − →∞ | | ∞

6 Example 314. For the martingale of Example 292, if ∞ Var(Y ) < , then Theo- n=1 n ∞ 7 rem 202 already told us that the sum converges a.s. P

8 We need the following result before we can identify the limit of a L´evy martingale. The 9 proof is given in another course document.

10 Lemma 315. Let ∞ be a sequence of σ-ﬁelds. Let E( X ) < . Deﬁne X = {Fn}n=1 | | ∞ n 11 E(X ). Then X ∞ is a uniformly integrable sequence. |Fn { n}n=1

12 Theorem 316. (Levy’s´ theorem) Let ∞ be an increasing sequence of σ-fields. {Fn}n=1 13 Let be the smallest σ-field containing all of the n’s. Let E( X ) < . Define Xn = F∞ F | | ∞ 14 E(X n) and X = E(X ). Then limn Xn = X , a.s. |F ∞ |F∞ →∞ ∞

15 The proof of Theorem 316 is in another course document.

16 Lemma 317. Let (X , ) ∞ be a nonnegative supermartingale. Then X converges { n Fn }n=1 n 17 a.s. to a random variable with ﬁnite mean.

18 Proof. Let Y = X . Then (Y , ) ∞ is a submartingale. n − n { n Fn }n=1

19 E( Yn )=E(Xn) = E[E(Xn n 1)] E(Xn 1). | | |F − ≤ −

20 It follows that E( Y ) E(X ) < for all n, so Theorem 311 applies and Y converges a.s. | n| ≤ 1 ∞ n 21 Trivially Y = X also converges. − n n

22 Reversed Martingales.

23 Definition 318. For n = 1, 2,..., let sub-σ-ﬁeld’s n 1 n, suppose that Xn is − − F − ⊆ F 24 n measurable, E( Xn ) < , and E(Xn n 1)= Xn 1. Then (Xn, n) n−∞= 1 is a reversed F | | ∞ |F − − { F } − 25 martingale.

26 An equivalent way to think about reversed martingales is through a decreasing sequence of 27 σ-ﬁeld’s ∞ such that for n 1. The proofs of the next two theorems are {Fn}n=1 Fn+1 ⊆Fn ≥ 28 similar to the corresponding theorems for forward martingales.

29 Theorem 319. (Reversed martingale convergence theorem) If

30 (Xn, n) n−∞= 1 is a reversed martingale, then X = limn Xn exists a.s. and E(X) = { F } − →−∞ 31 E(X 1). − 87

1 Proof. Just as in the proof of Theorem 311, we let Vn be the number of times that the 2 ﬁnite sequence Xn,Xn+1,...,X 1 crosses from below a rational r to above another rational − 3 q (for n< 0). Lemma 309 says that

1 E(Vn) (E( X 1 )+ r ) < . 4 ≤ q r | − | | | ∞ −

5 As in the proof of Theorem 311, it follows that X = limn Xn exists with probability 1. →−∞ 6 Since Xn = E(X 1 n) for each n< 1, Lemma 315 says that − |F − 7 E(X) = lim E(Xn)=E(X 1). n − →−∞

8 Notice that reversed martingales are all of the Lévy type. Not surprisingly, there is a version 9 of Lévy’s theorem 316 for reversed martingales. We state it in terms of decreasing σ-field’s.

10 Theorem 320. Let n n∞=1 be a decreasing sequence of σ-ﬁelds. Let = ∞ n. {F } F∞ n=1 F 11 Let E( X ) < . Deﬁne Xn = E(X n) and X = E(X ). Then limn Xn = X a.s. | | ∞ |F ∞ |F∞ →∞ T ∞

12 Proof. It is easy to see that (X n, n) n−∞= 1 is a reversed martingale and that { − F− } − 13 E( X1 ) < . By Theorem 319, it follows that limn X n = Y exists and is ﬁnite | | ∞ →−∞ − 14 a.s. To prove that Y = X a.s., note that X = E(X1 ) since 1. So, we must ∞ ∞ |F∞ F∞ ⊆ F 15 show that Y = E(X1 ). Let A . Then |F∞ ∈F∞

16 Xn(ω)dP (ω)= X1(ω)dP (ω), ZA ZA

17 since A and X = E(X ). Once again, using Lemma 315, it follows that ∈Fn n 1|Fn

lim Xn(ω)dP (ω)= Y (ω)dP (ω)= X1(ω)dP (ω), 18 n →∞ ZA ZA ZA

19 hence Y = E(X1 ). |F∞ 20 Theorem 320 allows us to prove a strong law of large numbers that is even more general than 21 the usual version. The greater generality comes from the fact that it applies to sequences 22 that are not necessarily independent.

23 Exchangeability. A sequence of random quantities X ∞ is exchangeable if, for { n}n=1 24 every n and all distinct j1,...,jn, the joint distribution of (Xj1 ,...,Xjn ) is the same as the 25 joint distribution of (X1,...,Xn).

26 Example 321. (Conditionally iid random quantities) Let X ∞ be condi- { n}n=1 27 tionally iid given a σ-ﬁeld . Then X ∞ is an exchangeable sequence. The result follows C { n}n=1 28 easily from the fact that

29 µX ,...,X = µX1,...,Xn , a.s. j1 jn |C |C 88

1 Example 322. Let X ∞ be Bernoulli random variables such that { n}n=1 1 P (X1 = x1,...,Xn = xn)= n , 2 (n + 1) y

n 3 where y = j=1 xj. One can show that this speciﬁes consistent joint distributions. One can 4 also check that the Xn’s are not independent. P 1 5 P (X =1) = , 1 2 1 1 2 P (X =1,X =1) = = . 6 1 2 3 6 2

7 Theorem 323. (Strong law of large numbers) Let Xn n∞=1 be an exchangeable 1 { n } 8 sequence of random variables with finite mean. Then limn Xj exists a.s. and has →∞ n j=1 9 mean equal to E(X1). If, the Xj’s are independent, then the limit equals E(X1) a.s. P 1 n 10 Proof. Define Y = X and let be the σ-field generated by all function of n n j=1 j Fn 11 (X1,X2,...) that are invariant under permutations of the first n coordinates. (For example, P 12 Y is such a function.) Let Z = E(X ). Theorem 320 says that Z converges a.s. n n 1|Fn n 13 to E(X1 ), where = ∞ n. We prove next that Zn = Yn, a.s. Since Yn is n |F∞ F∞ n=1 F F 14 measurable, we need only prove that, for all A n, E(IAYn)=E(IAX1). Notice that IA is T ∈F 15 a function of X1,X2,... that is invariant under permutations of X1,...,Xn since it depends 16 on X1,...,Xn only through Yn. That is, there are functions g and h such that

17 (324) IA(ω) = h(X1(ω),X2(ω),...) 18 = g(X (ω)+ + X (ω),X (ω),X (ω),...). 1 ··· n n+1 n+2

19 For all j =1,...,n, X1h(X1,X2,...) has the same distribution as 20 Xjh(Xj,X2,...,Xj 1,X1,Xj+1,...). But −

21 h(Xj,X2,...,Xj 1,X1,Xj+11,...)= h(X1,X2,...)= IA, −

22 according to (324). Hence, for all j =1,...,n, IAXj has the same distribution as IAX1. It 23 follows that 1 n E(IAX1)= E(IAXj)=E(IAYn). 24 n j=1 X 25 Clearly E(X1 ) has mean E(X1). If the Xn’s are independent, then the limit, being mea- |F∞ 26 surable with respect to the tail σ-ﬁeld, must be constant a.s., by Theorem 168 (Kolmogorov 27 0-1 law). The constant must equal the mean of the random variable, which is E(X1).

28 Example 325. In Example 322, we know that Yn converges a.s., hence it converges in 29 distribution. We can compute the distribution of Yn exactly: P (Yn = k/n)=1/(n +1) for 30 k = 0,...,n. Hence, Yn converges in distribution to uniform on the interval [0, 1], which 31 must be the distribution of the limit. The limit is not a.s. constant. 89

1 The rest of this material was not covered in class, but is left here for your information. 2 There is a very useful theorem due to deFinetti about exchangeable random quantities 3 that relies upon the strong law of large numbers. To state the theorem, we need to recall the 4 concept of “random probability measure” that was introduced in Example 173. Let ( , ) X B 5 be a Borel space, and let be the set of all probability measures on ( , ). We can think P X B 6 of as a subset of the function space [0, 1]B, hence it has a product σ-field. Recall that the P 7 product σ-field is the smallest σ-field such that for all B , the function f : [0, 1] ∈ B B P → 8 defined by fB(Q)= Q(B) is measurable. These are the coordinate projection functions.

9 Example 326. (Empirical probability measure) Let X1,...,Xn be random quan- 1 n 10 tities taking values in . For each B , deﬁne P (ω)(B)= I (X (ω)). For each X ∈ B n n j=1 B j 11 ω, Pn(ω)( ) is clearly a probability measure, so Pn : Ω is a function that we could prove · → P P 12 was measurable, but that proof will not be given here. Theorem 323 says that Pn(ω)(B) 13 converges to E(IB(X1) )(ω) for all B and almost all ω. If we assume that the Xn’s take |F∞ 14 values in a Borel space, then E(IB(X1) ) = Pr(X1 B ) is part of an rcd. This rcd |F∞ ∈ |F∞ 15 plays an important roll in deFinetti’s theorem.

16 DeFinetti’s theorem says that a sequence of random quantities is exchangeable if and only if it 17 is conditionally iid given a random probability measure, and that random probability measure 18 is the limit of the empirical probability measures of X1,...,Xn. That is, Example 321 is 19 essentially the only example of exchangeable sequences. The proof is given in another course 20 document.

21 Theorem 327. (DeFinetti’s theorem) A sequence X ∞ of random quantities is { n}n=1 22 exchangeable if and only if Pn (the empirical probability measure of X1,...,Xn) converges 23 a.s. to a random probability measure P and the Xn’s are conditionally iid with distribution 24 Q given P = Q.

25 Example 328. In Example 322, the empirical probability measure is equivalent to Yn = n 26 k=1 Xk/n, since Yn is one minus the proportion of the observations less than or equal to 0. 27 So P is equivalent to the limit of Yn, the limit of relative frequency of 1’s in the sequence. P 28 Conditional on the limit of relative frequency of 1’s being x, the Xk’s are iid with Bernoulli 29 distribution with parameter x. 90

1 Optional Sampling Theorem

2 Proof of Theorem 308. Without loss of generality, assume that M M for n ≤ n+1 3 every n. Since τ M , n ≤ n

Mn Mn

E( Xτn )= Xk dP E( Xk ) < . 4 | | τn=k | | ≤ | | ∞ Xk=1 Z{ } Xk=1 5 We already know that X is measurable. Let A . We need to show that τn Fτn ∈ Fτn 6 X dP ( )= X dP . Write A τn+1 ≥ A τn R R [Xτ +1 Xτ ]dP = [Xτ +1 Xτ ]dP. 7 n n n n A − A τn+1>τn − Z Z ∩{ }

8 Next, for each ω τ > τ , write ∈{ n+1 n}

Xτn+1 (ω) Xτn (ω)= [Xk(ω) Xk 1(ω)]. 9 − − − All k such that τn(ω)

10 The smallest k such that τn

Mn+1

[Xτn+1 Xτn ]dP = I τn

13 Bk = A τn

14 for each k. So

Mn+1

[Xτn+1 Xτn ]dP = (Xk Xk 1)dP 15 − A − Bk − Z Xk=2 Z Mn+1

( )= [Xk E(Xk k 1)]dP =0. 16 − ≥ Bk − |F Xk=2 Z

17 because Xk 1( )=E(Xk k 1) and Bk k 1. − ≤ |F − ∈F − 91

1 Upcrossing Lemma

n 2 Lemma 309. (Upcrossing lemma) Let (X , ) be a submartingale. Let r < q, { k Fk }k=1 3 and deﬁne V to be the number of times that the sequence X1,...,Xn crosses from below r 4 to above q. Then 1 (310) E(V ) (E X + r ) . 5 ≤ q r | n| | | − 6 n 7 Proof. Let Y = max 0,X r for every k so that (Y , ) is a submartingale. k { k − } { k Fk }k=1 8 Note that a consecutive set of Xk(ω) cross from below r to above q if and only if the 9 corresponding consecutive set of Y (ω) cross from 0 to above q r. Let T (ω) = 0 and deﬁne k − 0 10 Tm for m =1, 2,... as

11 Tm(ω) = inf k n : k > Tm 1(ω),Yk(ω)=0 , if m is odd, { ≤ − } 12 Tm(ω) = inf k n : k > Tm 1(ω),Yk(ω) q r , if m is even, { ≤ − ≥ − } 13 Tm(ω) = n +1, if the corresponding set above is empty.

14 Now V (ω) is one-half of the largest even m such that T (ω) n. Deﬁne, for k =1,...,n, m ≤

1 if Tm(ω)

E(Xˆ) = [Yk(ω) Yk 1(ω)]dP (ω) 20 ω:R (ω)=1 − − k=1 Z{ k } Xn

= [E(Yk k 1)(ω) Yk 1(ω)]dP (ω) 21 ω:R (ω)=1 |F − − − k=1 Z{ k } Xn

[E(Yk k 1)(ω) Yk 1(ω)]dP (ω) 22 ≤ |F − − − k=1 Z Xn

= [E(Yk) E(Yk 1)] = E(Yn), 23 − − Xk=1 24 where the second equality follows from (329) and the inequality follows from the fact that n 25 (Y , ) is a submartingale. It follows that (q r)E(V ) E(Y ). Since E(Y ) { k Fk }k=1 − ≤ n n ≤ 26 r + E( X ), it follows that (310) holds. | | | n| 92

1 L´evy’s Theorem

2 Lemma 315. Let ∞ be a sequence of σ-fields. Let E( X ) < . Define X = {Fn}n=1 | | ∞ n 3 E(X n). Then Xn n∞=1 is a uniformly integrable sequence. |F { } + 4 Proof. Since E(X )=E(X ) E(X− ), and the sum of uniformly integrable |Fn |Fn − |Fn 5 sequences is uniformly integrable, we will prove the result for nonnegative X. Let Ac,n = 6 Xn c n. So Xn(ω)dP (ω) = X(ω)dP (ω). If we can find, for every ǫ > 0, { ≥ }∈F Ac,n Ac,n 7 a C such that X(ω)dP (ω) < ǫ for all n and all c C, we are done. Define η(A) = Ac,n R R ≥ 8 X(ω)dP (ω). We have η P and η is finite. Absolute continuity implies that for every A R ≪ 9 ǫ> 0 there exists δ such that P (A) < δ implies η(A) < ǫ. By the Markov inequality, R 1 1 10 P (A ) E(X )= E(X), c,n ≤ c n c

11 for all n. Let C = 2E(X)/δ. Then c C implies P (A ) < δ for all n, so η(A ) < ǫ for all ≥ c,n c,n 12 n. 13 Theorem 316. (Levy’s´ theorem) Let ∞ be an increasing sequence of σ-fields. {Fn}n=1 14 Let be the smallest σ-field containing all of the n’s. Let E( X ) < . Define Xn = F∞ F | | ∞ 15 E(X n) and X = E(X ). Then limn Xn = X , a.s. |F ∞ |F∞ →∞ ∞ 16 Proof. By Lemma 315, X ∞ is a uniformly integrable sequence. Let Y be the limit { n}n=1 17 of the martingale guaranteed by Theorem 311. Since Y is a limit of functions of the Xn, it 18 is measurable with respect to . It follows from uniform integrability that for every event F∞ 19 A, limn E(XnIA)=E(Y IA). Next, note that, for every m and A m, →∞ ∈F

Y dP = lim E(X n)dP 20 n |F ZA →∞ ZA = lim XndP 21 n →∞ ZA

22 = XdP, ZA 23 where the last equality follows from the fact that A for all n m so X dP = XdP ∈Fn ≥ A n A 24 because X = E(X ). Since Y dP = XdP for all A for all m, it holds for all n |Fn A A ∈ Fm R R 25 A in the field = ∞ . Since X is integrable and is a field, we can conclude F n=1 Fn R | | R F 26 that the equality holds for all A , the smallest σ-field containing . The equality S ∈ F∞ F 27 E(XIA) = E(Y IA) for all A together with the fact that Y is measurable is ∈ F∞ F∞ 28 precisely what it means to say that Y = E(X )= X . |F∞ ∞ 93

1 Solutions to Exercises

2 Exercise 13: For notational convenience, define ∞ . F ≡ ∪n=1Fn 3 For the first part, it is clear that Ω and that is closed under complements. If ∈ F F 4 A and B , then for some n, A and for some m, B . Thus, A B ∈F ∈F ∈Fn ∈Fm ∪ ∈Fk ⊂F 5 for k max n, m . ≥ { } 6 For the second part, consider the given example and define the set A to be n if n is n { } 7 even, and to be ∅ if n is odd. Clearly, A for each n. But, ∞ A . To see n ∈Fn ⊂F ∪n=1 n 6∈ F 8 this, consider that ∞ A is the set of all positive, even integers. The class includes the ∪n=1 n Fn 9 set of positive, even integers less than or equal to n. While it is true that n , there is →∞ 10 no which has ∞ A as a member, so it cannot be in . Fn ∪n=1 n F

11 Exercise 17: Examples include:

12 1. The set of all closed intervals,

13 2. All intervals [a, b) where a, b IR, ∈ 14 3. All intervals (a, ) where a IR, ∞ ∈ 15 4. All intervals [a, ) where a IR, ∞ ∈ 16 5. All intervals ( , b) where b IR, −∞ ∈ 17 6. All intervals ( , b] where b IR. −∞ ∈

18 Exercise 18: Yes. See, for example, page 45 in Billingsley or Exercise 6 on page 34 of Ash 19 and Doleans-Dade.

Ω 20 Exercise 22: If Ω = ZZ, = 2 and µ is counting measure, then let A = 0 and F 1 { } 21 A = (k 1),k 1 for k > 1. It is impossible to construct a countable sequence of sets, k {− − − } 22 each with a ﬁnite number of elements, whose union is an uncountable set.

23 Exercise 27: Not here yet...

24 Exercise 30: First, recall that

lim sup x = inf sup x 25 n k n n k n →∞ ≥

26 and lim inf xn = sup inf xk. 27 n k n →∞ n ≥ 28 Then we can write

29 Ilim supn→∞ An = lim sup IAn n →∞ 94

1 and

2 Ilim inf →∞ A = lim inf IA . n n n n →∞

3 Exercise 31:

4 lim sup An =( 1, 1] and lim inf An = 0 n − n { } →∞ →∞

5 Exercise 33: Not here yet...

1 6 Exercise 35: Let (Ω, ) = (IR, ), let µ be Lebesgue measure, and let A =(n, ). Then F B n ∞ 7 µ(An)= for all n but limn An = ∅. ∞ →∞

8 Exercise 41: For Proposition 39, it is clear that Ω and that is closed under comple- ∈ C C 9 ments since is a λ-system. Further, if A , A ,... , then C 1 2 ∈ C n C C C Ai = A1 A2 A1 A3 A1 A2 , 10 ∪ ∩ ∪ ∩ ∩ ∪··· i=1 [ C C C 11 and A since is a λ-system, A A since is a π-system, A A A 1 ∈ C C 2 ∩ 1 ∈ C C 1 ∪ 2 ∩ 1 ∈ C 12 since is a λ-system, and so forth. C C C C 13 For Proposition 40, note that A B =(A (A B)) . ∩ ∪ ∩

14 Exercise 44: If we deﬁne to be the class of subsets of Ω = IR of the form ( , a] C 1 −∞ 15 with a IR, then is a π-system and σ( ) = . Also note that P is σ-ﬁnite since it is a ∈ C C B 1 16 probability measure. Thus, there is a unique extension of P from to . C B

17 Exercise 45: If we define the value of P ( b ) we will define the measure on the σ-field { } 18 generated by the sets a, b and b, c . This is the case since the class of sets over which P { } { } 19 is defined is now a π-system. 1 Exercise 51: 2 First, we show that µ∗ extends µ. We only need to show that µ∗(A)= µ(A) for all A . ∈ C 3 Clearly, µ∗(A) µ(A) for all A . To show the reverse inequality, let A ∞ be disjoint ≤ ∈ C { i}i=1 4 elements of such that A ∞ A , and let B = A A so that A = ∞ B . Countable C ⊆ i=1 i i i ∩ i=1 i 5 additivity of µ gives us S ∞ ∞ S µ(A)= µ(Bi) µ(Ai). 6 ≤ i=1 i=1 X X 7 Since every sum in (52) is a case of the rightmost sum in the equation immediately above, 8 we have µ(A) µ∗(A). ≤ 9 Next, we show that µ∗ is monotone and subadditive. Clearly, B1 B2 implies µ∗(B1) ⊆ Ω ≤ 10 µ∗(B ). It is also easy to see that µ∗(B B ) µ∗(B )+µ∗(B ) for all B , B 2 . In fact, 2 1 ∪ 2 ≤ 1 2 1 2 ∈ 11 Ω if Bn n∞=1 2 , then µ∗( i∞=1 Bi) i∞=1 µ∗(Bi). To see this, first choose ǫ> 0. Note that { } ∈ ≤ i 12 there must exist a sequence of sets Ai1, Ai2,... such that µ∗(Bi) > ∞ µ(Aij) ǫ/2 S P ∈ C j=1 − 13 and Bi ∞ Aij for each i. Thus, ⊆ ∪j=1 P ∞ ∞ ∞ ∞ µ∗(Bi) > µ(Aij) ǫ µ∗ Bi ǫ 14 − ≥ − i=1 " i=1 j=1 # i=1 ! X X X [ 15 since the A cover ∞ B . Thus, µ∗( ∞ B ) < ∞ µ∗(B )+ ǫ for any ǫ> 0, the desired { ij} ∪i=1 i ∪i=1 i i=1 i 16 inequality holds. P Ω 17 Next, we show that . Let A and C 2 . Since µ∗ is subadditive, we only C⊆A ∈ C C ∈ 18 need to show that µ∗(C) µ∗(C A)+ µ∗(C A ). If µ∗(C)= , this is clearly true. So ≥ ∩ ∩ ∞ 19 let µ∗(C) < . From the definition of µ∗, for every ǫ> 0, there exists a collection Ai i∞=1 ∞ { } C 20 of elements of such that ∞ µ(A ) <µ∗(C)+ ǫ. Since µ(A )= µ(A A)+ µ(A A ) C i=1 i i i ∩ i ∩ 21 for every i, we have P ∞ ∞ C µ∗(C)+ ǫ > µ(Ai A)+ µ(Ai A ) 22 ∩ ∩ i=1 i=1 X X C 23 µ∗(C A)+ µ∗(C A ). ≥ ∩ ∩ C 24 Since this is true for every ǫ> 0, it must be that µ∗(C) µ∗(C A)+ µ∗(C A ), hence ≥ ∩ ∩ 25 A . ∈A C 26 Next, we show that is a field. It is clear that Ω and A implies A by A ∈ A Ω ∈ A ∈ A 27 the symmetry in the definition of . Let A , A and C 2 . We can write A 1 2 ∈A ∈ C 28 µ∗(C) = µ∗(C A1)+ µ∗(C A1 ) ∩ ∩ C C C 29 = µ∗(C A1)+ µ∗(C A1 A2)+ µ∗(C A1 A2 ) ∩ ∩ ∩ C ∩ ∩ 30 µ∗(C [A A ]) + µ∗(C [A A ] ), ≥ ∩ 1 ∪ 2 ∩ 1 ∪ 2 31 where the two equalities follow from A , A , and the inequality follows from the subad- 1 2 ∈A 32 ditivity of µ∗. Another application of subadditivity shows that A A . 1 ∪ 2 ∈A 33 Next, we prove that µ∗ is finitely additive on . If A1, A2 are disjoint elements of , AC A 34 then A =(A A ) A and A =(A A ) A . It follows that 1 1 ∪ 2 ∩ 1 2 1 ∪ 2 ∩ 1 35 µ∗(A A )= µ∗(A )+ µ∗(A ). 1 ∪ 2 1 2 1 By induction, µ∗ is finitely additive on . A 2 Next, we prove that is a σ-field. (We have already shown that is a field.) Let A A 3 A ∞ ; then we can write A = ∞ A = ∞ B , where each B and the B are { n}n=1 ∈A i=1 i i=1 i i ∈A i 4 disjoint. (This just makes use of complements and finite unions of elements of being in n Ω S S A 5 .) Let D = B and C 2 . By the same argument we used in proving that µ∗ is A n i=1 i ∈ 6 finitely additive, we have S

7 µ∗(C [B B ]) = µ∗(C B )+ µ∗(C B ). ∩ 1 ∪ 2 ∩ 1 ∩ 2

8 A simple induction argument extends this to

µ∗(C Dn)= µ∗(C Bi). 9 ∩ ∩ i=1 X C C 10 Since A D and D for each n, we have ⊆ n n ∈A C 11 µ∗(C) = µ∗(C Dn)+ µ∗(C Dn ) ∩ ∩ C 12 µ∗(C Dn)+ µ∗(C A ) ≥ n ∩ ∩ C = µ∗(C Bi)+ µ∗(C A ). 13 ∩ ∩ i=1 X 14 Since this is true for every n,

∞ C µ∗(C) µ∗(C Bi)+ µ∗(C A ) 15 ≥ ∩ ∩ i=1 X C 16 µ∗(C A)+ µ∗(C A ), ≥ ∩ ∩

17 where the last inequality follows from subadditivity. The reverse inequality holds by subad- 18 ditivity, so, A , and is a σ-field. ∈A A 19 Next, we show that µ∗ is countably additive when restricted to . (We already proved A 20 that µ∗ is finitely additive.) Let A = i∞=1 Ai, where each Ai and the Ai are disjoint. n n ∈ A 21 Since Ai A, we have, for every n, µ∗(A) µ∗(Ai), which implies µ∗(A) i=1 ⊆ S ≥ i=1 ≥ 22 ∞ µ∗(Ai). By subadditivity, we get the reverse inequality, hence µ∗ is countably additive i=1 S P 23 on . P A 24 Next, we prove uniqueness. Suppose that µ′ also extends µ to . Since is a π-system A C 25 and µ is σ-finite on , Theorem 43 implies that µ′ = µ on σ( ). It is straightforward to C C 26 prove that µ′ = µ on the completion of σ( ). (See Theorem 1.3.8 in Ash and Dade.) C 27 Finally, we prove completeness. Since we already proved that µ∗ is monotone, we need 28 only prove that contains all B such that µ∗(B)=0. Let µ∗(B) = 0. Then µ∗(B C)=0 Ω A ∩ 29 for all C 2 . By subadditivity and monotonicity, we have ∈ C C 30 µ∗(C) µ∗(C B)+ µ∗(C B )= µ∗(C B ) µ∗(C). ≤ ∩ ∩ ∩ ≤

31 It follows that B . ∈A 1 Exercise 57: Nothing yet...

1 2 Exercise 61: The result is true if we can prove that = A : f − (A) is a D { ∈ A ∈ F} 3 σ-ﬁeld. Clearly S . Since inverse image commutes with complement, A implies C ∈ D ∈ D 4 A . Since inverse image commutes with union, A for all n implies ∞ A . ∈D n ∈D n=1 n ∈D 5 So, is a σ-ﬁeld. D S

6 Exercise 63: Nothing yet...

7 Exercise 64: Let f be a monotone increasing function. Then, for each a, there exists b 1 8 such that f − (( , a]) is an interval of the form ( , b) or( , b]. A similar result holds −∞ −∞ −∞ 9 for decreasing functions. Hence, monotone functions are measurable.

10 Exercise 73: For part 1,

∞ ω : lim sup fn a = fn a 1/m, i.o. , 11 { n ≥ } { ≥ − } m=1 \

12 a measurable set. Similarly for part 2. For part 3, the set in question is the set where the 13 difference of two measurable functions is 0, a measurable set. Part 4 is clear from the first 14 three parts, but you can fill in the details in a homework problem. The functions in parts 1, 15 2, and 4 might be extended real-valued.

2 Ω S 16 Exercise 78: Let Ω = ZZ with the σ-field 2 , while S = ZZ with σ-field 2 . Let f(x, y)= x. 1 17 Let µ be counting measure. Then, for each integer x, f − ( x ) = (x, y) : y ZZ and { } { ∈ } 18 ν( x )= . Even though µ is σ-finite, ν is not. { } ∞

19 Exercise 90: Let g = I[a,b]f. Every lower Riemann sum is the integral of a simple function 20 φ g and every upper Riemann sum is the integral of a simple function φ g. It follows 1 ≤ 2 ≥ 21 that φ dµ gdµ φ dµ. 22 1 ≤ ≤ 2 Z Z Z 23 But for each ǫ> 0, φ and φ can be chosen so that φ dµ φ dµ< ǫ. So, limits of the 1 2 2 − 1 24 upper and lower sums both equal gdµ. R R R 25 Exercise 95: Let X put probability 1/2 on c and c. − ∅ 26 Exercise 109: Clearly, ν is nonnegative and ν( ) = 0, since fI∅ = 0, a.e. [µ]. Let An n∞=1 n { } 27 ∞ be disjoint. For each n, deﬁne gn = fIAn and fn = i=1 gi. Deﬁne A = n=1 An. Then 28 0 fn fIA, a.e. [µ] and fn converges to fIA, a.e. [µ]. So, the monotone convergence ≤ ≤ P S 29 theorem says that

(330) lim fndµ = ν(A). 30 n →∞ Z 1 Also, ν(Ai)= gidµ, for each i. It follows from Theorem 100 that

R n n n (331) ν Ai = fndµ = gidµ = ν(Ai). 2 i=1 ! i=1 i=1 [ Z X Z X 3 Take the limit as n of the second and last terms in (331) and compare to (330) to see →∞ 4 that ν is countably additive.