Probabilistic Theory

Andrew Kobin Spring 2015 Contents Contents

Contents

0 Introduction 1

1 Probability and Normal Numbers 4 1.1 The Weak Law of Large Numbers ...... 6 1.2 The Strong Law of Large Numbers ...... 8 1.3 Further Properties of Normal Numbers ...... 14

2 Probability Measures 19 2.1 Fields, σ-fields and Probability Measures ...... 19 2.2 The on the Unit Interval ...... 26 2.3 Extension to σ-fields ...... 27 2.4 π-systems and λ-systems ...... 32 2.5 Monotone Classes ...... 34 2.6 Complete Extensions ...... 35 2.7 Non-Measurable Sets ...... 37

3 Denumerable Probabilities 38 3.1 Limit Inferior, Limit Superior and Convergence ...... 38 3.2 Independence ...... 40 3.3 Subfields ...... 43 3.4 The Borel-Cantelli Lemmas ...... 44

4 Simple Random Variables 47 4.1 Convergence in Measure ...... 49 4.2 Independent Variables ...... 51 4.3 Expected Value and ...... 52 4.4 Abstract Laws of Large Numbers ...... 55 4.5 Second Borel-Cantelli Lemma Revisited ...... 58 4.6 Bernstein’s Theorem ...... 60 4.7 Gambling ...... 62 4.8 Markov Chains ...... 68 4.9 Transience and Persistence ...... 72

5 Abstract Measure Theory 80 5.1 Measures ...... 80 5.2 ...... 83 5.3 Lebesgue Measure on Rn ...... 87 5.4 Measurable Functions ...... 91 5.5 Distribution Functions ...... 93

6 Integration Theory 97 6.1 Measure-Theoretic Integration ...... 97 6.2 Properties of Integration ...... 100

i 0 Introduction

0 Introduction

These notes were compiled from a course on measure-theoretic taught by Dr. Sarah Raynor in Spring 2015 at Wake Forest University. The course covers the basic concepts in measure theory and uses them to deepen understanding of probability. The companion text for the course is Probability and Measure, 4th ed., by P. Billingsley. One of the best examples to illustrate the nuance of measure theory is the Cantor . The Cantor set C is defined as follows. Let A0 = [0, 1], the unit interval. Let A1 be the 1 2  set A0 − 3 , 3 formed by deleting the middle third of A0. Next, A2 is similarly formed 1 2  7 8  by deleting the middle thirds 9 , 9 and 9 , 9 from each component of A1. The process is continued to define a

∞ [ 1 + 3k 2 + 3k  A = A − , . n n−1 3n 3n k=0

∞ \ Finally, the Cantor set is the of [0, 1] given by C = An, that is, the points remaining n=0 in the unit interval after iterating this process over the natural numbers. Length is our first idea of measure, from which many others will stem. If we take the usual length of an interval on the to be end point minus starting point, then the unit interval [0, 1] has length 1. One may then ask: How long is the Cantor set? To measure the length of C, we instead calculate the length of its and subtract it from 1. This is the following infinite sum: 1 2 4 8 + + + + ... 3 9 27 81 1/3 which is a geometric series converging to = 1. Thus the complement of the Cantor 1 − 2/3 set has length 1, but the total unit interval has length 1 meaning the Cantor set has length 0. This is our first example of a set of measure zero. Area, volume, hypervolume, etc. are all extensions of length to higher dimensions — these are also examples of measures. For example, the area of an annulus is easy to compute. Consider the following region R.

R

1 2

1 0 Introduction

We compute the area A by A = 4π − π = 3π. However, one may also want to compute the mass of the annulus, say if it were made of aluminum or steel. Given a density , e.g. ρ = e−r2 kg/cm2, find the mass of the annular region. This is computed by a double integral,

2π 2 2π ZZ Z Z 2 1 Z ρ dA = e−r r dr dθ = − (e−2 − e−1) dθ = π(e−1 − e−2). R 0 1 2 0 If we think of a double integral as the limit of the process of breaking the region into smaller regions and adding together all their masses, we see the same concept at work as in the Cantor set example.

How does this relate to probability? Example 0.0.1. What is the probability of rolling a prime number on a standard six-sided die? This can be computed by the same divide-and-conquer approach: 3 1 P (prime) = P (2) + P (3) + P (5) = = . 6 2 Example 0.0.2. When playing craps (rolling two dice), what is the probability of rolling either a 7 or an 11? 6 2 2 P (7 or 11) = P (7) + P (11) = + = . 36 36 9 Example 0.0.3. Given a dartboard of unit area, the probability of hitting a small region on the board with your dart is precisely the area of that region:

There is a common theme among the above examples, which is that the calculation of probability relies on our ability to measure things and compare the relative measures. We generalize this in the following way.

Definition. A measure µ on a set S is a function µ : P(S) → [0, ∞] such that µ is ∞ [ countably additive, that is, if A is a subset of S of the form An and An ∩ Am = ∅ for all n=1 ∞ X n 6= m, then µ(A) = µ(An). n=1 This isn’t quite the full definition yet; we will formalize everything in Chapter 5. However, some interesting questions arise from defining a measure this way:

2 0 Introduction

1 Is every subset of S measurable? When the set is finite, the answer is yes. However, for the unit interval with length as a measure, the answer is no. A counterexample is difficult to produce at this time.

2 The Banach-Tarski Paradox (sometimes called the Banach-Tarski Theorem) says that it is possible to take a solid ball of any size, say the size of a basketball, decompose it into finitely many pieces and put them back together only using rigid motions to get a ball the size of the sun. How is this possible?

The domain of a measure must have a special structure, which is called a σ-field (some- times σ-algebra in the literature).

Definition. Let S be a set and F a collection of of S. F is a σ-algebra provided

(1) S ∈ F.

(2) If A ∈ F then AC ∈ F as well.

∞ [ (3) If A1,A2,... ∈ F (this may be a countable list) then An ∈ F. n=1 It turns out that this is just enough structure to allow us to define a measure on F. This will be the main ‘universe’ in which we work, defining probability measures and developing their applications. The first four chapters may be treated as an extensive case study of probability spaces, that is, spaces with measure 1. In Chapter 5 we finally introduce the terminology and main theorems in abstract measure theory.

3 1 Probability and Normal Numbers

1 Probability and Normal Numbers

In these notes we will denote a sample space by Ω and a particular event taken from this sample space by ω. Our prototypical example will have Ω = (0, 1]. For technical reasons we will always assume an interval of the real line is of the form (a, b] so that collections of intervals may be chosen disjointly (so they don’t overlap at the endpoints). If I = (a, b] we will denote the usual notion of length by |I| = |b − a|. n [ Suppose A = Ii where Ii = (ai, bi] are pairwise disjoint intervals in the sample space i=1 Ω = (0, 1]. We define Definition. The probability of event A occuring within the sample space Ω is

n n X X P (A) := |Ii| = |bi − ai|. i=1 i=1 At this point we are carefully avoiding complicated subsets of Ω, such as the Cantor set in the introduction. These will be the focus in later chapters. If A and B are disjoint subsets of Ω and each of A, B is a finite disjoint of intervals, then P (A ∪ B) = P (A) + P (B). This is called the finite additivity of probability. So far we have brushed over something important: is our definition of P (A) well-defined? That is, if A has two different represen- tations as finite disjoint unions of intervals in Ω, do they both give the same probability? n m [ [ Well suppose A = Ii = Jj. We create a collection of intervals Kij = Ii ∩ Jj, called a i=1 j=1 refinement of the Ii and Jj. Notice that

m n m n [ [ [ [ A = Kij = (Ii ∩ Jj). j=1 i=1 j=1 i=1 This implies well-definedness of our definition of P (A). Example 1.0.4. This relates to the Riemann integral in an important way. For a subset A ⊂ Ω which is a disjoint union of finitely many intervals in Ω = (0, 1], define the characteristic function n ( X 1 x ∈ Ii fA = χIi where χIi (x) = i=1 0 x 6∈ Ii. m X Similarly define gB = χJj . Then finite additivity of probability implies the additive j=1 property of Riemann integrals: Z 1 Z 1 Z 1 (fA + gB) dx = fA dx + gB dx. 0 0 0

4 1 Probability and Normal Numbers

This is because Z 1 χI (x) dx = |I| = b − a. 0 Keep in mind that for the moment we are only dealing with event spaces that are finite disjoint unions of half-open intervals; when we encounter more complicated subsets of Ω, Riemann integration breaks down. In that case we will need to use , one of the main tools in modern integration theory. Our next goal is to equate the probabilistic notion of selecting points from the unit interval with the physical act of flipping an infinite number of coins and counting heads and tails. Define di(ω) to be the result of the ith flip of the infinite sequence of coin flips; we will denote this numerically by ( 1 if heads di(ω) = 0 if tails.

The event ω can be represented as a sequence of 1’s and 0’s: (d1(ω), d2(ω), d3(ω),...). We will also make use of the dyadic (binary) representation ∞ X −i ω = di(ω)2 . i=1 Each sequence of 0’s and 1’s corresponds to a unique in the interval [0, 1]. However, not every real number in [0, 1] has a unique dyadic representation. For exam- 5 ple, 8 can be represented by 0.101000 ... but also by the non-terminating 0.100111 ... It is convention to prefer the non-terminating representation, since this will coincide with our other preference for half-open intervals (a, b]. Notice that picking only non-terminating dec- imal representations excludes 0 = 0.000 ... from our probability space, so we are indeed constructing (0, 1]. Now, drawing at random with uniform probability from Ω = (0, 1] is equivalent to the dyadic representation of an infinite sequence of coin flips. The reason is that P [di(ω) = 1] is 2i 1 equal to the sum of the lengths of 2 intervals, each of which has length 2i . This is illustrated below.

d1 = 0 d1 = 1 ( ]( ] 1 0 2 1

d2 = 0 d2 = 1 d2 = 0 d2 = 1 ( ]( ]( ]( ] 1 1 3 0 4 2 4 1

These are sometimes called dyadic intervals. From this we can see that the probability of 1 i any single flip coming up heads is 2 , since at any level, half of the 2 intervals are included in this event. The 2n intervals of length 2−n for any n are called the set of rank n dyadic intervals; they have the nice property of being nested. Formally, if n > m and Ii is an interval of rank n, there is a unique Jj of rank m such that Ii ⊂ Jj.

5 1.1 The Weak Law of Large Numbers 1 Probability and Normal Numbers

Example 1.0.5. The binomial formula expresses the probability that k heads will be flipped in n trials. Using the interval construction above, we see that

P (k heads in the first n flips) = P (k of the first n bigits are 1)   = # subsets of {1, . . . , n} with k elements · 2−n

n = 2−n k which is exactly the same as provided by the binomial formula.

Notice that if {Ii}i∈N is a collection of rank n dyadic intervals and n ≥ m, then dm(x) is constant on Ii for all Ii of rank n.

1.1 The Weak Law of Large Numbers

This brings us to the Law of Large Numbers. In probability theory, the LLN states that a sequence of random trials converges to a particular value or outcome: the expected value (EV). In this course, we will distinguish between two different versions of the LLN.

Theorem 1.1.1 (Weak Law of Large Numbers). Let ω be an event in the sample space Ω = (0, 1] which may be expressed as a finite disjoint union of intervals. Then for any ε > 0,

n ! 1 X 1 lim P di(ω) − > ε = 0. n→∞ n 2 i=1 Proof. To prove the Weak LLN, we first define the Rademacher functions, ( 1 if heads ri(ω) = 2di(ω) − 1 = −1 if tails

for each i, and the cumulative Rademacher function of rank n,

n X sn(ω) = ri(ω). i=1 In this language, the above probability may be expressed as

1  lim P sn(ω) > ε = 0. n→∞ n

1 1 In addition, the Rademacher functions are defined so P [ri = 1] = 2 and P [ri = −1] = 2 , meaning they have an average value of 0:

Z 1 ri(ω) dω = 0. 0

6 1.1 The Weak Law of Large Numbers 1 Probability and Normal Numbers

This implies that the cumulative function also has average value 0. Furthermore, it’s easy to see that whenever i 6= j, Z 1 ri(ω)rj(ω) dω = 0 0 2 by looking at their graphs. However, ri (ω) = 1 for all i, so Z 1 Z 1 2 ri (ω) dω = 1 dω = 1. 0 0 Putting this all together, we get

2 Z 1 Z 1 n ! 2 X sn(ω) dω = ri(ω) dω 0 0 i=1 n n Z 1 X X = ri(ω) rj(ω) dω 0 i=1 j=1   Z 1 n n X 2 X  =  ri (ω) + ri(ω)rj(ω) dω 0  i=1 i,j=1  i6=j n n X X = 1 + 0 = n + 0 = n. i=1 i=1 Finally, we need Chebyshev’s , which says that if f is a nonnegative step function and α > 0, then

(1) The set {ω | f(ω) > α} is a finite union of intervals, and

1 Z 1 (2) P [ω | f(ω) > α] ≤ f(ω) dω. α 0 This is used in the following calculation of the original probability limit:   1 P sn(ω) > ε = P [|sn(ω)| > nε] n 2 2 2 = P [sn(ω) > n ε ] Z 1 1 2 ≤ 2 2 sn(ω) dω by Chebyshev’s inequality n ε 0 1 1 = n = n2ε2 nε2 which converges to 0 as n → ∞. This completes the proof of the Weak LLN (1.1.1). Let’s take a moment to prove the inequality used in the proof above. In measure theory this is known as a weak type `1 inequality.

7 1.2 The Strong Law of Large Numbers 1 Probability and Normal Numbers

Theorem 1.1.2 (Chebyshev’s Inequality). If f is a nonnegative step function, then for any α > 0, (1) The set {ω | f(ω) > α} is a finite union of intervals. 1 Z 1 (2) P [ω | f(ω) > α] ≤ f(ω) dω. α 0

Proof. Let f be a step function given by the values yi = f(ω) for ω ∈ (xi, xi+1], where 0 = x0 < x1 < ··· < xm = 1. Then clearly S = {ω | f(ω) > α} equals the union of (xi, xi+1] for those values of i such that yi > α. Note that since we defined S in terms of a partition of (0, 1], the probability P (S) may be written as X P (S) = |xi+1 − xi| i∈I where I = {i : 0 ≤ i < m, yi > α}. Now α P [ω | f(ω) > α] = α P (S) X = α|xi+1 − xi| i∈I X < yi|xi+1 − xi| i∈I m−1 X ≤ yi|xi+1 − xi| i=0 Z 1 = f(ω) dω. 0 Hence the inequality is proved.

1.2 The Strong Law of Large Numbers

Definition. A normal number is a real number ω ∈ (0, 1] such that n 1 X 1 lim di(ω) = . n→∞ n 2 i=1 The Strong Law of Large Numbers says the following about normal numbers: Theorem 1.2.1 (Strong Law of Large Numbers). The set N of normal numbers has prob- ability P (N) = 1. Instead of proving the SLLN, we will instead be proving an equivalent statement known as Borel’s Theorem on Normal Numbers (1.2.5). In order to make sense of the statement of Borel’s Theorem, we must justify the use of P (N); however, this requires defining the Lebesgue measure which we will cover in subsequent chapters. Alternatively, we can show that P (N C ) = 0. This also requires the Lebesgue measure, but we can get away with just using the definition of measure 0:

8 1.2 The Strong Law of Large Numbers 1 Probability and Normal Numbers

Definition. A set W ⊂ Ω is said to be negligible (alternatively, has measure 0) if for ∞ every ε > 0, there is a countable collection of intervals {Ik}k=1 such that ∞ [ (a) W ⊂ Ik. k=1

∞ X (b) |Ik| < ε. k=1 ∞ ∞ [ Lemma 1.2.2. Suppose {Wi}i=1 are all negligible sets in Ω. Then W = Wi is also i=1 negligible. That is, the countable union of negligible sets is negligible.

i ∞ Proof. Let ε > 0 and let i ∈ N. Since Wi is negligible, there is a countable collection {Ik}k=1 ∞ ∞ [ X ε so that W ⊂ Ii , and |Ii | < . Then i k k 2i k=1 k=1

∞ ∞ ∞ ∞ ∞ [ [ X X X ε W ⊂ Ii and |Ii | < = ε. k k 2i i=1 k=1 i=1 k=1 i=1

i Since the doubly-indexed collection {Ik}(i,k)∈N2 is countable, we conclude that W is negligible.

This has an immediate and important consequence. Corollary 1.2.3. All countable sets are negligible. Proof. If we can prove that any singleton set is negligible, then Lemma 1.2.2 immediately applies since a is just the countable union of singletones. This is easy to ε ε  establish, since for any singleton {x} ⊂ Ω, x − 2 , x + 2 covers {x} and has length ε. Example 1.2.4. By Corollary 1.2.3, Q is a negligible set. This is odd, however, since we know that Q is an example of a set that is dense in the real numbers. It turns out that there are even larger sets than Q that are still negligible. Corollary 1.2.3 suggests an interesting question: Is every negligible set countable? The answer turns out to be no; for example, the Cantor set is uncountable, but as we saw in the introduction, C has measure 0. Recall that we want to prove the complement of the normal numbers has measure 0. It would be nice if N C were countable, but unfortunately this is not the case. To see this, ( n ) C 1 X 1 what are some numbers in N = ω : lim di(ω) 6= ? Some obvious examples are n→∞ n 2 i=1 (0, 1, 1, 1, 1, 1,...) and (1, 0, 1, 1, 1, 1,...), but a more interesting one is (1, 0, 0, 1, 0, 0, 1, 0, 0,...). In fact, anything of the form (a, 0, 0, b, 0, 0, c, 0, 0, d, 0, 0,...) is in N C because

n X 1 d ≤ . i 3 i=1

9 1.2 The Strong Law of Large Numbers 1 Probability and Normal Numbers

Now (a, b, c, d, . . .) is an infinite sequence of 0’s and 1’s, of which there are uncountably many. Therefore there are uncountably many of the form (a, 0, 0, b, 0, 0, c, 0, 0,...) in N C . Hence N C must be uncountable. Nevertheless, there is a way to prove P (N C ) = 0, which is given below. Theorem 1.2.5 (Borel’s Normal Number Theorem). N C is negligible.  1 Proof. Clearly N = ω | n sn(ω) → 0 in the language of Rademacher functions. Then the theorem may be restated as P [ω : |sn(ω)| > εn] = 0 for any ε > 0. By Chebyshev’s Inequality (Theorem 1.1.2),

4 4 4 P [ω : |sn(ω)| > εn] = P [ω : |sn(ω)| > ε n ] Z 1 1 4 ≤ 4 4 sn(ω) dω. n ε 0 To evaluate this integral, note that

n ! n ! n ! n ! 4 X X X X sn(ω) = rα(ω) rβ(ω) rγ(ω) rδ(ω) α=1 β=1 γ=1 δ=1 n X = rα(ω)rβ(ω)rγ(ω)rδ(ω). α,β,γ,δ=1 Recall that even powers of the r(ω) functions are equal to 1, while odd powers are equal to −1. The possible values of the Rademacher functions, as well as what they contribute for the integral, are shown in the table below.

different values functions integral number of instances Z 1 4 4 1 (α = β = γ = δ) ri (ω) ri (ω) dω = 1 n 0 Z 1 2 2 2 2 2 ri (ω)rj (ω) ri (ω)rj (ω) dω = 1 3n(n − 1) 0 Z 1 3 3 . 2 ri (ω)rj(ω) ri (ω)rj(ω) dω = 0 . 0 Z 1 2 2 3 ri (ω)rj(ω)rk(ω) ri (ω)rj(ω)rk(ω) dω = 0 0 Z 1 4 ri(ω)rj(ω)rk(ω)rl(ω) ri(ω)rj(ω)rk(ω)rl(ω) dω = 0 0 Thus the only parts that don’t integrate to 0 are the first two:

4 2 2 sn(ω) = 1 · n + 1 · 3n(n − 1) + 0 + 0 = 3n − 2n < 3n . 1 3 This gives us P [ω : |s (ω)| < εn] < · 3n2 = . For a given n, choose ε = n1/8. n ε4n4 ε4n2 n Then we have 3 3 3 P [ω : |sn(ω)|] < = < 4 2 1/8 4 2 3/2 εnn (n ) n n

10 1.2 The Strong Law of Large Numbers 1 Probability and Normal Numbers

and this tends to 0 as n → ∞. We will use this calculation in a moment.  1 Let An = ω : n |sn(ω)| > εn . We need to verify three things: ∞ C [ (i) N ⊂ An for any m ≤ n; n=m

(ii) The An are all finite unions of intervals; and

∞ X (iii) |An| is sufficiently small. n=m

∞ \ C C First, note that (i) is the same as N ⊃ An by DeMorgan’s Laws. If ω0 ∈ An and r is n=m some number such that n > r ≥ m, then  1   1  ω ∈ ω : |s (ω)| ≤ ε ⊆ ω : |s (ω)| ≤ ε . 0 n n n n n r 1 So lim |sn(ω)| ≤ εr and as r → ∞, εr → 0 which makes this limit go to 0. Hence ω is n→∞ n normal. (ii) For a fixed n, we claim that An is a finite union of disjoint intervals. From this it will ∞ [ follow that An is a countable union of intervals, since countable unions of finite unions n=m  1 are countable. Consider An = ω : n |sn(ω)| > εn . For any n, sn(ω) is a step function and 1 therefore so is n |sn(ω)|. Hence by (1) of Chebyshev’s inequality (Theorem 1.1.2), An is a finite union of intervals. ∞ ∞ X X (iii) By (ii), An is a finite union of some intervals {Ik}. Note that |Ik| ≤ P (An) k=1 n=m since all the intervals Ik on the left appear in the sum on the right. By our work above,

∞ ∞ X X 3 3c P (A ) < ≤ √ n n3/2 m n=m n=m for some constant c, by the integral test. Given ε > 0, we can choose m sufficiently large so ∞ X √3c C that ε > m . Hence |Ik| < ε which completes the proof that N is negligible. k=1 We now have at our disposal a ‘weak’ and ‘strong’ law of large numbers; naturally from the way they are named, the Strong LLN implies the Weak LLN. However, at the moment we don’t have the tools to prove this. So far we have several good examples of negligible sets, so one might be tempted to think that all sets are negligible. Of course that isn’t true, or else the very concept of negligibility would be meaningless, so let’s find some non-negligible sets. Proposition 1.2.6. Ω = (0, 1] is not negligible.

11 1.2 The Strong Law of Large Numbers 1 Probability and Normal Numbers

∞ ∞ [ Proof. Given {Ik}k=1 a countable collection of intervals such that Ω ⊂ Ik, we want to k=1 ∞ X show that |Ik| ≥ |Ω| = 1. This results from the more general theorem below. k=1

∞ Theorem 1.2.7. Let I and {Ik}k=1 be intervals. ∞ ∞ [ X (1) If Ik ⊂ I and the Ik are disjoint, then |Ik| ≤ |I|. k=1 k=1

∞ ∞ [ X (2) If Ik ⊃ I then |Ik| ≥ |I|. k=1 k=1

∞ ∞ [ X (3) If Ik = I and the Ik are disjoint, then |Ik| = |I|. k=1 k=1 Proof. First note that (1) and (2) imply (3). We use induction to prove (1) and (2). First suppose there’s only one interval I1 ⊂ I. Then clearly |I1| ≤ |I|. Now assume inductively that the conclusion holds for I1,...,In and consider the case for n + 1 intervals. Write n+1 {Ik}k=1 in order, using the fact that they are disjoint:

a ≤ a1 < b1 ≤ a2 < b2 ≤ · · · < bn ≤ an+1 < bn+1 ≤ b

n [ where I = (a, b] and Ik = (ak, bk]. Since Ik ⊂ (a, bn], the inductive hypothesis gives us k=1

∞ X |Ik| ≤ |bn − a| ≤ |an+1 − a|. k=1 Then

n+1 n X X |Ik| = |Ik| + |bn+1 − an+1| k=1 k=1

≤ |an+1 − a| + |bn+1 − an+1|

≤ |an+1 − a| + |b − an+1| = |b − a| since the differences are all positive by our chosen ordering. So property (1) holds for finite collections of disjoint intervals. Now consider a countable collection of disjoint intevals ∞ ∞ X {Ik}k=1. For any finite n, |Ik| ≤ |I| and taking the limit as n → ∞ preserves the k=1 ∞ X inequality (e.g. by the monotone convergence theorem); hence |Ik| ≤ |I| as well. k=1

12 1.2 The Strong Law of Large Numbers 1 Probability and Normal Numbers

(2) The case with one interval is the same as above. Assume the inequality holds for n+1 n intervals and let {Ik}k=1 be a collection of intervals, not necessarily disjoint, so that n+1 n+1 [ [ Ik ⊃ I. This is the same as (a, b] ⊂ (ak, bk]. Because the intervals are not disjoint, we k=1 k=1 can’t order the ak and bk like we did last time; however we can write

bn+1 ≥ bn ≥ · · · b1 with bn+1 ≥ b.

n [ If an+1 ≤ a we’re done since then (a, b] ⊂ (an+1, bn+1]. Otherwise (a, an+1] ⊂ (ak, bk]. By k=1 n X the inductive hypothesis, |an+1 − a| ≤ |bk − ak| so k=1

n ! X |b − an+1| + |an+1 − a| ≤ |bk − ak| + |b − an+1| k=1 n ! X =⇒ |b − a| ≤ |bk − ak| + |b − an+1| k=1 n ! X ≤ |bk − ak| + |bn+1 − an+1| k=1 n+1 X = |bk − ak|. k=1 Hence (2) holds for finite collections of intervals. To prove the property for countable collections, we will exploit the completeness of R via the Heine-Borel Theorem, which states that any closed, bounded interval of real numbers is compact. Thus any open cover of such an interval will have a finite subcover and we can reduce to the finite case. At the moment though, we have neither a closed interval nor an open cover of the interval. To remedy this, let ε > 0 be arbitrarily small. Enlarge I to a closed interval [a + ε, b] and consider the open cover

∞ [  ε  a , b + ⊃ [a + ε, b]. k k 2k k=1 The Heine-Borel Theorem implies there is a finite subcover

J [  ε  ak , bk + ⊃ [a + ε, b] j j 2kj j=1

13 1.3 Further Properties of Normal Numbers 1 Probability and Normal Numbers

so the finite case above tells us that J X ε |b − a − ε| ≤ bk + − ak j 2kj j j=1 J X  ε  |b − a| − ε ≤ |bk − ak | + j j 2kj j=1 ∞ ∞ X  ε  X ≤ |b − a | + = |I | + ε. k k 2k k k=1 k=1 Letting ε → 0 proves (2) in the countable case. We obtain two immediate and useful results of this theorem. Corollary 1.2.8. Any finite interval on the real line is not negligible. In particular, Ω = (0, 1] is not negligible. Corollary 1.2.9. If A is negligible then AC is not negligible. Corollary 1.2.8 is subtle but quite vital to the foundations of measure theory: now that we know our universe (Ω = (0, 1] in this section) is not negligible, we can show sets have full measure by demonstrating that their complements are negligible (Corollary 1.2.9). In particular, once we establish that negligible sets have measure zero (Section 2.3), we will have completed the proof of the Strong Law of Large Numbers (1.2.1).

1.3 Further Properties of Normal Numbers

In Section 1.1 we proved that N is not negligible by showing that its complement N C is negligible. In this section we explore some consequences of this fact and highlight the difference between negligibility and various other measures of ‘smallness’. Proposition 1.3.1. N and N C are both dense in (0, 1]. Proof. Suppose ω ∈ (0, 1]. Given ε > 0, let j be the such that 2−j < ε/2. j−1 ∞ X −i X −i X −i Write ω = 2 di(ω). Then the number n = 2 di(ω) + 2 ei, where ei = 0 when i∈N i=1 i=j i is odd and 1 when i is even, is a normal number since the tail looks like ... 0101010101 ... In addition,

∞ X −i |ω − n| = 2 (di(ω) − ei) i=j ∞ X −i ≤ 2 |di(ω) − ei| by the i=j ∞ X 2−j ≤ 2−i · 1 = = by geometric series 1 − 1/2 i=j = 2−j+1 < ε by our choice of j above.

14 1.3 Further Properties of Normal Numbers 1 Probability and Normal Numbers

Hence N is dense in (0, 1]. On the other hand, given the same ω ∈ (0, 1], we can append the sequence 100100100100 ... 1 1 since we know this makes the average term go to 3 6= 2 . By the same logic as above, this new number is within ε of ω so N C is dense in (0, 1].

Definition. A set A is trifling if for each ε there exists a finite sequence of intervals Ik satisfying [ (i) A ⊂ Ik and k X (ii) |Ik| < ε. k Clearly a trifling set is negligible, and finite unions of trifling sets are trifling. Trifling sets are especially ‘small’ and have some nice properties. Proposition 1.3.2. If A is trifling then Cl(A) is trifling.

n Proof. Suppose A if trifling, so that there is a finite collection of intervals {Ik}k=1 covering ε A and their total length is less than 2 . Write Ik = (ak, bk]. We enlarge each Ik to form a n new collection {Jk}k=1 given by  ε i J = a − , b . k k 2k−1 k

Notice that the maximum length added to the length of the Ik is

n ∞ X ε X 1 ε ≤ ε = 2k−1 2k−1 2 k=1 k=1

ε ε so that their total length is still less than 2 + 2 = ε. Also, since each Ik ⊂ Jk and the Ik n cover A, {Jk}k=1 is also a finite cover of A. Moreover, each closed interval [ak, bk] ⊂ Jk and n [ therefore their union [ak, bk] contains the closure of A. This proves that Cl(A) is trifling k=1 if A is trifling. Examples.

1 By Corollary 1.2.3, Q ∩ (0, 1] is negligible. Since Q is dense in the reals, the closure of Q ∩ (0, 1] is (0, 1] so the contrapositive to Proposition 1.3.2 implies that Q ∩ (0, 1] is not trifling.

2 The Cantor set C, as defined in the Introduction, is an . However, C is trifling. To see this, let ε > 0 be given and take n to be the natural number such that 2 n 3 < ε. Then part (b) of this problem shows that at the nth level, every number in C is contained in one of 2n intervals, each of which has length 3−n. Clearly the sum 2 n n of the lengths of these intervals is 3 < ε by our choice of n, and 2 is finite so the collection of these intervals satisfies conditions (i) and (ii), proving C is trifling.

15 1.3 Further Properties of Normal Numbers 1 Probability and Normal Numbers

Definition. A set A ⊂ Ω is nowhere dense if for every interval J ⊂ Ω, there is a subin- terval I ⊂ J such that I ∩ A = ∅. Proposition 1.3.3. A ∈ Ω is nowhere dense ⇐⇒ the of the closure of A is empty.

Proof omitted.

Proposition 1.3.4. A trifling set is nowhere dense.

Proof. Suppose x ∈ Int(Cl(A)). Then there exists an ε > 0 such that J = (x − ε, x + ε) ⊂ Cl(A). Since the closure of a trifling set is trifling, we can cover Cl(A) with a collection of n n X intervals {Ik}k=1 such that |Ik| < ε. However, notice that |J| = |(x+ε)−(x−ε)| = 2ε > ε k=1 so by (2) of Theorem 1.2.7, n [ J 6⊆ Ik, k=1 a contradiction. Hence A is nowhere dense.

Proposition 1.3.5. A compact negligible set is trifling.

Proof. Suppose A ⊂ Ω is compact and negligible. For a given ε > 0, negligible implies there ∞ exists a collection {Ik}k=1 with the following properties:

ˆ For each k, Ik = (ak, bk].

∞ [ ˆ Ik ⊇ A. k=1

∞ X ε ˆ |I | < . k 2 k=1

∞ ε  From this collection we form an open cover {Jk}k=1 where Jk = ak, bk + 2k+1 . Observe that the most length we could have added to the original collection of Ik’s is

∞ X ε ε = by geometric series. 2k+1 2 k=1

∞ X So |Jk| < ε and {Jk} is an open cover of A. By compactness, there exists a finite (open) k=1 subcover {J }n of A. Then the collection {J ∗ }n , where J ∗ = a , b + ε , is a finite ki i=1 ki i=1 ki ki ki 2ki+1 collection of intervals covering A and satisfying

n n ∞ X X X |J ∗ | = |J | ≤ |J | < ε. ki ki k i=1 i=1 k=1 Hence A is trifling.

16 1.3 Further Properties of Normal Numbers 1 Probability and Normal Numbers

Example.

S −n−2 −n−2 3 Let B = n(rn − 2 , rn + 2 ], where r1, r2,... is an enumeration of the rationals in (0, 1]. Then BC = (0, 1] − B is nowhere dense but not trifling (or even negligible).

Proof. To prove BC is nowhere dense, we will prove that every interval contains a subinterval which is contained in (BC )C = B. An additional fact we will exploit is ∞ that, given an enumeration r1, r2,... of the rationals in (0, 1], the set {rm}m=n for any n ≥ 1 is dense in (0, 1]. Suppose J is an interval in Ω; denote the midpoint of J by x and let |J| = ε. To buy ourselves some space away from the endpoints of J, we will consider the subinterval ∗ ε ε  −n−2 ε J = x − 4 , x + 4 . Let n be a natural number such that 2 < 2 , which can −n−1 ε ∞ be written 2 < 4 . By the comments above, {rm}m=n is dense in (0, 1] for this ∗ choice of n, so a rational rm, m ≥ n can be found in J . Then the interval I = −m−2 −m−2 C (rm − 2 , rm + 2 ] is contained in B, that is, I ∩ B = ∅. What’s more, −m−1 −n−1 ε ∗ ε |I| = 2 < 2 < 4 and since rm ∈ J , I extends at most 4 to the right or left of the endpoints of J ∗. This shows that I ⊂ J and therefore we conclude that BC is nowhere dense in Ω.

Definition. We say a set is of the first category if the set can be represented as a countable union of nowhere dense sets. This is a topological notion of smallness, just as negligibility is a notion of small- ness. The following examples illustrate that neither of these conditions implies the other.

Examples. 4 The non-negligible set N of normal numbers is of the first category.

T∞  −1 1  Proof. To prove the statement, we will show that Am = ω : |n sn(ω)| < is S n=m 2 nowhere dense and N ⊂ m Am. Consider Am for a fixed m ∈ N. To prove Am is nowhere dense in Ω = (0, 1], we will show that for any J ⊂ Ω there is a subinterval I ⊂ J such that I ∩Am = ∅. Given such an interval J = (a, b], choose a dyadic interval J ⊂ I of order n0 > m such that

1 1 |sn (ω)| > for all ω ∈ I. n0 0 2 Such a choice of I is possible because specifying I is equivalent to a choice of the first n0 of every ω ∈ I (so that I ⊂ J), and taking the rest of the digits to be 1’s, with n0 1 large enough so that |sn (ω)| is sufficiently close to 1 for all ω ∈ I. We then claim n0 0 that I ∩ Am = ∅. To see this, recall the definition of Am:

∞  1 1  Am = ∩n=m ω : n |sn(ω)| < 2 .

h 1 1 i If ω ∈ I then our choice of n0 means that ω 6∈ ω : |sn (ω)| < and therefore ω n0 0 2 does not lie in the intersection defining Am. Hence Am and I are disjoint.

17 1.3 Further Properties of Normal Numbers 1 Probability and Normal Numbers

S Now to prove N ⊂ m Am, recall that the set of normal numbers may be defined by

n 1 o N = ω : lim |sn(ω)| = 0 n→∞ n  1 = ω : for all ε > 0, n |sn(ω)| < ε for all n ≥ some n0 .

1 Let ω ∈ N and choose ε = 2 . Then for all n ≥ n0 for the appropriate choice of n0, 1 1 S S n |sn(ω)| < 2 . This shows that ω ∈ An0 and so ω ∈ m Am. Hence N ⊂ m Am.

5 On the other hand, the negligible set N C is not of the first category.

Proof. If N C were a countable union of nowhere dense sets, then (0, 1] = N ∪ N C would be as well since by example 4 , N is of the first category. However, a famous theorem of Baire (see Royden’s Real Analysis) says that a nonempty interval is not of the first category, so it follows that N C is not of the first category.

18 2 Probability Measures

2 Probability Measures

2.1 Fields, σ-fields and Probability Measures

All the discussion about negligible sets and normal numbers highlights the utility of com- puting ‘lengths’ of more complicated sets than finite unions of intervals. If we try to define a function giving length for any subset of (0, 1], there are immediate logical contradictions (e.g. the Banach-Tarski Paradox in R3 and similar issues in other dimensions). So we need to restrict our attention to certain types of subsets. For instance, we want to calculate the length of the normal numbers, which may be written ∞ ∞ ∞ \ [ \  1 1 N = ω : |s (ω)| < . n n k k=1 m=1 n=m Notice that for any particular n and k, the set inside all the unions and intersections is a finite union of intervals. The entirety of measure theory is centered on describing the technicalities of performing probabilistic computations on countable unions. Definition. Suppose Ω is a set and F is a collection of subsets of Ω. F is a field if (1) Ω ∈ F. (2) A ∈ F implies AC ∈ F. (3) A, B ∈ F implies A ∪ B ∈ F. In other words, F is closed under finite set operations. Note that a field is closed under T by DeMorgan’s Laws: A ∩ B = ((A ∩ B)C )C = (AC ∪ BC )C which is why we say a field is closed under all finite set operations. Also notice that (1) and (2) imply that the ∅ is always an of the field. Definition. A field F on Ω is called a σ-field if F is also closed under countable unions. The elements of a σ-field are called measurable sets or F-sets.

Examples.

1 Let B0 = {A ⊂ Ω | A is a finite union of intervals} on Ω = (0, 1].

Claim. B0 is a field.

Proof. (1) Ω ∈ B0. To prove (2), note that the complement of a finite union of intervals is a finite union of intervals: if n [ A = (ai, bi] i=1 n n   C \ C \ then A = (ai, bi] = (0, ai] ∪ (bi, 1] which is a finite union of intervals. (3) is i=1 i=1 very similar to (2).

19 2.1 Fields, σ-fields and Probability Measures 2 Probability Measures

1  Note that B0 is not a σ-field, since for example 0, 2 is not in B0 but this can be expressed as a countable union of sets in B0:

∞  1 [  1 1  0, = 0, − . 2 2 n n=1

2 For a space Ω, consider the following collections of subsets of Ω.

C F1 = {finite and cofinite subsets of Ω} = {X ⊂ Ω | X or X is finite} C F2 = {countable and cocountable subsets} = {X ⊂ Ω | X or X is countable} F3 = P(Ω) F4 = {∅, Ω}.

All four are examples of fields on Ω. In fact, F2 is a σ-field but F1 is not. Both F3 and F4 are special σ-fields: F3 is the largest possible σ-field on Ω and similarly F4 is the smallest.

Definition. The σ-field generated by a collection a ⊂ P(Ω) is the smallest σ-field containing a: \ σ(a) := {G | G is a σ-field containing a}.

Proposition 2.1.1. For any collection a ⊂ P(Ω), σ(a) exists and is a σ-field.

Proof. First, σ(a) exists because P(Ω) is a σ-field containing a, so it suffices to prove that if {G} is a collection of σ-fields, then so is T G. (1) Ω ∈ G for all G so Ω ∈ T G. (2) If X ∈ T G then X is in each G and so XC is in each G. Thus XC ∈ T G. (3) Same as (2). Hence σ(a) is a σ-field.

Proposition 2.1.2. Let a ⊂ P(Ω). The following are properties of σ(a). (1) a ⊂ σ(a).

(2) a is a σ-field if and only if a = σ(a).

(3) If G is a σ-field containing a then σ(a) ⊂ G.

Proof. Obvious from the definition.

Examples.  3 If a = {x} | x ∈ Ω then σ(a) = F2, the countable/cocountable σ-algebra.

4 The Borel σ-field is defined as B = σ(B0) where B0 is as defined in 1 . An element of B is called a Borel set. Any countable sequence of set operations applied to an interval (a, b] will produce a Borel set.

20 2.1 Fields, σ-fields and Probability Measures 2 Probability Measures

It is important to note that not every Borel set is obtained in this way; that is, not every Borel set is the result of applying countably many set operations on an interval. This illustrates the difference between the field generated by a and the σ-field generated by a, as the former can be defined as the collection of all finite set operations on sets in a, while the latter may not be defined in this way with countable operations. The Borel σ-field is our favourite σ-field in probability theory, as it allows us to define measures in a meaningful way, i.e. so that they are compatible with all Borel sets. However, the one important type of set that we have studied so far – a negligible or measure zero set – is not always a Borel set. Notice that open sets (and therefore closed sets) are Borel sets. This is because an U ⊂ (0, 1] contains a countable, Q ∩ U. Therefore for any x ∈ U, there exist rationals px, qx ∈ Q such that

x − ε ≤ px < x ≤ qx < x + ε.

Then x ∈ (px, qx] ⊂ (x − ε, x + ε] ⊂ U. We can thus express U as a countable union of intervals: [ U = (px, qx]. x∈U Hence U is Borel. 5 The set of normal numbers N is a Borel set. We next focus on defining probability measures on a field. Later we will extend this notion to σ-fields.

Definition. A set function P : F → R where F is a field on a set Ω is called a probability measure provided (1) 0 ≤ P (A) ≤ 1 for all A ∈ F.

(2) P (Ω) = 1 and P (∅) = 0.

∞ ∞ ! ∞ ∞ [ [ P (3) If {Ai}i=1 are disjoint subsets of F and Ai ∈ F then P Ai = P (Ai). i=1 i=1 i=1 The third property is called countable additivity – this is the key property for a prob- ability measure.

Remarks. S S (a) F may not be a σ-field so in order to define P ( Ai) we require that Ai ∈ F.

n (b) Property (3) implies finite additivity: if {Ai}i=1 are disjoint F-sets then n ! n [ X P Ai = P (Ai). i=1 i=1 This requires the fact that P (∅) = 0.

21 2.1 Fields, σ-fields and Probability Measures 2 Probability Measures

(c) If A ∈ F then by (2),

P (A) + P (AC ) = P (A ∪ AC ) = P (Ω) = 1.

Hence P (AC ) = 1 − P (A) for all A. This actually implies P (∅) = 0, so stating it in the definition was redundant.

(d) Further, since P (AC ) = 1 − P (A) ≥ 0, P (A) ≤ 1 for all A ∈ F so this was redundant in the definition too. Given these redundancies, we can rewrite the conditions of a probability measure as (1)0 ≤ P (A) for all A ∈ F.

(2) P (Ω) = 1.

∞ ∞ ! ∞ ∞ [ [ P (3) If {Ai}i=1 are disjoint subsets of F and Ai ∈ F then P Ai = P (Ai). i=1 i=1 i=1 Definition. Let F be a field on a set Ω and P : F → R be a probability measure defined on F. The triple (Ω, F,P ) is called a probability space. Definition. If F ∈ F such that P (F ) = 1, F is called a support of P on F. Examples.

1 6 Let Ω = N = {1, 2, 3,...} and define the function p :Ω → R by p(n) = 2n . Notice that ∞ X 1 = 1. For any A ⊂ Ω, we define a probability measure P by 2n n=1 X P (A) = p(a). a∈A

The right σ-field to use here is F = P(Ω), which makes (Ω, F,P ) into what is called a discrete probability space (see below).

7 Let Ω = (0, ∞) and consider the σ-field F = P(Ω). We can use exactly the same formula for p to define X P (A) = p(a). a∈A∩Z  1 1  1 1 1 For example, P 2 , 10 2 = 2 + 4 + ... + 210 . Oddly, many intervals have zero  1 1  probability in this space: P 3 , 2 = 0. In this example, Z is a support for P . 8 Let Ω be a countable space. For a nonnegative function p :Ω → [0, ∞) such that P P ω∈Ω p(ω) = 1, define P (A) = ω∈A p(ω). Then Ω is called a discrete probability space.

Claim. A discrete probability space cannot contain an infinite sequence A1,A2,... of 1 independent events each of probability 2 .

22 2.1 Fields, σ-fields and Probability Measures 2 Probability Measures

Proof. Suppose Ω is such a space. Consider a number ω ∈ Ω. Then ω must lie in one of the following sets:

C C C C A1 ∩ A2 A1 ∩ A2 A1 ∩ A2 or A1 ∩ A2 ,

1 1 1 1 each of which has probability 2 · 2 = 4 by independence. Thus P (ω) ≤ 4 . Likewise at the “nth” level we have a collection of intersections

C C C A1 ∩ · · · ∩ An A1 ∩ · · · ∩ An ··· A1 ∩ · · · ∩ An

which partition Ω, and by independence each of these sets has probability 2−n. Then ω must lie in one of these intersections, so again P (ω) ≤ 2−n. Taking n → ∞, we conclude P (ω) = 0 but this is not possible in a discrete probability space, since we would have X X P (Ω) = P (ω) = 0 = 0 6= 1. ω∈Ω ω∈Ω

This can be generalized as follows. P Claim. Suppose that 0 ≤ pn ≤ 1, and put αn = min{pn, 1 − pn}. If n αn diverges, then no discrete probability space can contain independent events A1,A2,... such that An has probability pn.

Proof. As above, define at the nth level a collection of intersections

{B1 ∩ · · · ∩ Bn}

C where Bi is a choice of either Ai or Ai . For a particular choice of the Bi, the intersection n B = ∩i=1Bi has probability n Y P (B) ≤ (1 − αi) i=1

by independence, since 1 − αn corresponds to the maximum of {pn, 1 − pn}. However −αi −x notice that for each i, 1 − αi ≤ e (y = e is increasing on (0, 1]). Thus the above product is bounded by

n n Pn Y Y −αi − αi (1 − αi) ≤ e = e i=1 . i=1 i=1 Since the series in the exponent above is assumed to diverge, the term on the right approaches zero as n → ∞. This shows that P (B) → 0 as n gets large, so in particular for any ω ∈ Ω, ω must lie in such an intersection as B at every level n, and so P (ω) = 0 for all ω ∈ Ω. As in the previous proof, this cannot happen if Ω is a discrete probability space. Hence no such sequence A1,A2,... exists.

23 2.1 Fields, σ-fields and Probability Measures 2 Probability Measures

Proposition 2.1.3. (Properties of Probability Measures) (1) Monotonicity: if A ⊆ B then P (A) ≤ P (B). (2) Complement: for any A ⊂ Ω, P (AC ) = 1 − P (A). (3) Inclusion-exclusion: P (A ∪ B) = P (A) + P (B) − P (A ∩ B). Proof. (1) Suppose A ⊆ B. Then

P (B) = P ((B ∩ AC ) ∪ A) = P (B ∩ AC ) + P (A) ≥ P (A).

(2) was proven in remark (c). (3) Consider A ∪ B = A ∪ (B ∩ AC ). Then P (A ∪ B) = P (A) + P (B ∩ AC ). Also note that B = (B ∩ AC ) ∪ (A ∩ B), so that P (B) = P (B ∩ AC ) + P (A ∩ B). Putting this together gives us

P (A ∪ B) + P (A ∩ B) = P (A) + P (B ∩ AC ) + P (A ∩ B) = P (A) + P (B).

Rearranging yields P (A ∪ B) = P (A) + P (B) − P (A ∩ B). Part (3) of Proposition 2.1.3 is actually a special case of the more general inclusion- exclusion principle: Proposition 2.1.4. (The Inclusion-Exclusion Principle) For any finite collection of subsets n {Ak}k=1,

n ! [ X X X P Ak = P (Ai) − P (Ai ∩ Aj) + P (Ai ∩ Aj ∩ Ak) − ... k=1 i i

n+1 ! n ! ! [ [ P Ai = P Ai ∪ An+1 i=1 i=1 n ! n ! ! [ [ = P Ai + P (An+1) − P Ai ∩ An+1 i=1 i=1 by the base case. In particular,

n ! ! n ! [ [ P Ai ∩ An+1 = P (Ai ∩ An) i=1 i=1 X X = P (Ai ∩ An+1) − P (Ai ∩ Aj ∩ An+1) + ... 1≤i≤n 1≤i

24 2.1 Fields, σ-fields and Probability Measures 2 Probability Measures

Sn+1  Now substituting this into the formula above for P i=1 Ai and collecting terms gives the desired formula for n + 1. Therefore by induction, the inclusion-exclusion principle holds for all n.

∞ Theorem 2.1.5. Let P be a probability measure on a field F and suppose {Ak}k=1 is a countable collection of subsets belonging to F.

∞ [ (1) Continuity from below: if A1 ⊂ A2 ⊂ A3 ⊂ · · · and their union A = Ak lies in F, k=1 then lim P (An) = P (A). n→∞

∞ \ (2) Continuity from above: If A1 ⊃ A2 ⊃ A3 ⊃ · · · and their intersection A = An lies k=1 in F, then lim P (An) = P (A). n→∞

∞ [ (3) Countable subadditivity: If Ak ∈ F then k=1

∞ ! ∞ [ X P Ak ≤ P (Ak). k=1 k=1

Proof. (1) By monotonicity, lim An ≤ 1 exists (and is ≤ 1). Set B1 = A1 and define Bk = Ak − Ak−1. Then the Bk are disjoint and each An, along with A, can be written as a countable union of disjoint intervals:

∞ ∞ [ [ An = Bk and A = Bk. k=1 k=1 By countable additivity,

∞ n X X P (A) = P (Bk) = lim P (Bk) = lim P (An). n→∞ n→∞ k=1 k=1

(2) Suppose A1 ⊃ A2 ⊃ A3 ⊃ · · · . Then their complementary sequence is ascending: C C C C T S C S C A1 ⊂ A2 ⊂ A3 ⊂ · · · = A , and since F is a field, Ak ∈ F implies ( Ak) = Ak ∈ F as well. Hence by part (1) and property (2) from before,

C C C P (A) = 1 − P (A ) = 1 − lim P (An ) = lim (1 − P (An )) = lim P (An). n→∞ n→∞ n→∞ (3) omitted.

25 2.2 The Lebesgue Measure on the Unit Interval 2 Probability Measures

2.2 The Lebesgue Measure on the Unit Interval

In this section we explore an important measure on our favourite space, Ω = (0, 1]. Consider the field B0 on this space; recall this is the set of finite unions of intervals in (0, 1] and by closure, finite intersections and complements. We define a function λ : B0 → R by λ((a, b]) = b − a and extend by additivity for finite, disjoint unions of intervals:

n ! n [ X λ (ai, bi] = λ((ai, bi]). i=1 i=1 As with other functions that assign values to disjoint unions of intervals, we must check that λ is well-defined. n m n ! m ! [ [ [ [ Claim. If A = Ii = Jj then λ Ii = λ Jj . i=1 j=1 i=1 j=1

Proof. Notice that each Ii ⊂ A, so we can write these intervals as disjoint unions of their intersections with the Jj’s: m [ Ii = (Ii ∩ Jj). j=1

We can replace the Ii’s in the union above to obtain n n m X X X |Ii| = |Ii ∩ Jj|. i=1 i=1 j=1

By the same logic the Jj’s can be decomposed into their intersections with each of the Ii’s, and because both summations are finite we may reverse the order. This implies the claim.

Definition. The function λ : B0 → R on the set Ω is called the Lebesgue measure. Theorem 2.2.1. The Lebesgue measure is a probability measure.

∞ Proof. It suffices to show countable additivity. Suppose {Ak}k=1 is a disjoint union of sets ∞ m [ [k in B0 and A = Ak ∈ B0. This means that each Ak = Jkj for disjoint intervals Jkj, k=1 j=1 n [ and A = Ii where the Ii are also disjoint intervals in B0. By definition of the Lebesgue i=1 measure, n X λ(A) = |Ii| i=1 ∞ m ∞ [ [k [ and Ii = (Ii ∩Jkj) for all i, since Ii ⊂ Ak. Everything in this equation is an interval, k=1 j=1 k=1 so by Theorem 1.2.7, ∞ m X Xk |Ii| = |Ii ∩ Jkj|. k=1 j=1

26 2.3 Extension to σ-fields 2 Probability Measures

Therefore,

n ∞ m X X Xk λ(A) = |Ii ∩ Jkj| i=1 k=1 j=1 ∞ m n ! X Xk X = |Ii ∩ Jkj| since the triple sum converges k=1 j=1 i=1 ∞ m ∞ X Xk X = |Jkj| = λ(Ak). k=1 j=1 k=1

Hence λ is a probability measure on B0.

2.3 Extension to σ-fields

We want to extend the Lebesgue measure to a measure on the σ-field B – and in general any probability measure on a field F – so that we can measure things sets like N and N C . The following is perhaps the single most important construction in all of measure theory.

Theorem 2.3.1. Suppose F0 is a field and P is a probability measure on F0. Then there exists a unique extension Q of P such that Q is a probability measure on F = σ(F0) and Q| = P . F0

Notice that if we apply this theorem to λ on B0, we obtain a probability measure on the Borel σ-field B; this is also called the Lebesgue measure on (0, 1] and is denoted λ. This is consistent with the literature and there should be no confusion as long as context is understood. For the remainder of the chapter we construct such an extension Q and prove that it has the desired properties in the statement of the theorem. Starting with P and F0, we define Definition. For any subset A ⊂ Ω, the P -outer measure of A is

( ∞ ∞ ) ∗ X [ P (A) = inf P (Ak) | Ak ∈ F0 for all k and Ak ⊃ A . k=1 k=1 It turns out that P ∗ satisfies almost every axiom of a probability measure, except that it is not countably additive – however it is countably subadditive. We will prove this and several other properties of P ∗ in a moment.

Definition. For a set A ⊂ Ω, the P -inner measure of A is defined as

( ∞ ∞ ) X [ P∗(A) = sup P (Bk) | Bk ∈ F0 for all k and Bk ⊂ A . k=1 k=1

27 2.3 Extension to σ-fields 2 Probability Measures

One can rewrite the inner measure a bit using the field axioms:

( ∞ ∞ ) X [ C P∗(A) = 1 − inf P (Bk) | Bk ∈ F0, Bk ⊃ A k=1 k=1 = 1 − P ∗(AC ) = P ∗(Ω) − P ∗(AC ).

∗ The inner and outer measures clearly agree on sets in B0, that is, if A ∈ B0 then P (A) = ∗ P∗(A) = P (A). In general, we want to say a set A is ‘nice’ in some way if P (A) = P∗(A); in other words, P ∗(A) + P ∗(AC ) = P ∗(Ω). It turns out that this isn’t quite strong enough; we need the following condition. Definition. A ⊂ Ω is P ∗-measurable if for every E ⊂ Ω, P ∗(A∩E)+P ∗(AC ∩E) = P ∗(E). This is sometimes called the Carath´eodory condition.

∗ Proposition 2.3.2. Let P be the P -outer measure for a probability measure P on F0. (1) P ∗(∅) = 0. (2) (Positivity) P ∗(A) ≥ 0 for all A ⊂ Ω. (3) (Monotonicity) If A ⊆ B then P ∗(A) ≤ P ∗(B).

∞ ! ∞ ∞ ∗ [ X ∗ (4) (Countable subadditivity) For {An}n=1, P An ≤ P (An). n=1 n=1 Proof. (1) – (3) are obvious from the definition. To prove (4), note that for any ε > 0,

( ∞ ∞ ) ∗ X [ P (An) = inf P (Bk) | Bk ∈ F0, Bk ⊃ An k=1 k=1 ∞ ! X ε ≥ P (B ) − for some choice of {B }∞ . nk 2n nk k=1 k=1 ∞ ∞ ∞ [ [ [ Then Bnk ⊃ An so n=1 k=1 n=1 ∞ ! ∞ ∞ [ X X P An ≤ P (Bnk) n=1 n=1 k=1 ∞ " ∞ ! # X X ε = P (B ) − + ε nk 2n n=1 k=1 ∞ ! X ∗ ≤ P (An) + ε. n=1 Since ε > 0 was arbitrary, we can take ε → 0 which preserves the inequality. This proves that P ∗ is countably subadditive.

28 2.3 Extension to σ-fields 2 Probability Measures

Note that property (4) implies P ∗(E) ≤ P ∗(A∩E)+P ∗(AC ∩E) for all choices of A, E ⊂ Ω, so in the future we need only check the reverse inequality to prove the Carath´eodory condition holds.

Remark. It is clear from the definition of outer measure that for the Lebesgue measure λ, λ∗(A) = 0 ⇐⇒ A is a negligible subset of Ω. This finally proves the Strong Law of Large Numbers (1.2.1) rigorously, since Borel’s normal number theorem (1.2.5) established that N C was negligible.

We now prove a series of lemmata that will construct the extension Q on σ(F0). Let M = {A ⊂ Ω | A is P ∗-measurable}.

Lemma 2.3.3. M is a field.

Proof. (1) Note that P ∗(Ω ∩ E) + P ∗(ΩC ∩ E) = P ∗(E) + P ∗(∅) = P ∗(E) so Ω ∈ M. (2) Suppose A ∈ M. Then for all E ⊂ Ω,

P ∗(E) = P ∗(A ∩ E) + P ∗(AC ∩ E) = P ∗((AC )C ∩ E) + P ∗(AC ∩ E).

So AC ∈ M. (3) It suffices to prove that A, B ∈ M =⇒ A ∪ B ∈ M and induct. Let E ⊂ Ω. Then

P ∗(E) = P ∗(B ∩ E) + P ∗(BC ∩ E) = P ∗(A ∩ (B ∩ E)) + P ∗(AC ∩ (B ∩ E)) + P ∗(A ∩ (BC ∩ E)) + P ∗(AC ∩ (BC ∩ E)) since A ∈ M ≥ P ∗(A ∩ B ∩ E) ∪ (AC ∩ B ∩ E) ∪ (A ∩ BC ∩ E) + P ∗(AC ∩ BC ∩ E) by subadditivity = P ∗((A ∪ B) ∩ E) + P ∗((A ∪ B)C ∩ E) by DeMorgan.

By the above remark, this shows A ∪ B ∈ M.

∞ Lemma 2.3.4. Suppose {Ai}i=1 ⊂ M, the Ai are disjoint and E ⊂ Ω. Then

∞ !! ∞ ∗ [ X ∗ P E ∩ Ai = P (E ∩ Ai). i=1 i=1

n Proof. Suppose we have a finite collection {Ai}i=1. It suffices to show the property for A1,A2 (i.e. n = 2) and induct. Since A1,A2 and A1 ∪ A2 are in M, we have

∗ ∗ ∗ C P (E) = P (E ∩ (A1 ∪ A2)) + P (E ∩ (A1 ∪ A2) ) ∗ ∗ C = P (E ∩ A1) + P (E ∩ A1 ) ∗ ∗ C = P (E ∩ A2) + P (E ∩ A2 ).

29 2.3 Extension to σ-fields 2 Probability Measures

Combining the first two equations gives us

∗ ∗ ∗ C ∗ C P (E ∩ (A1 ∪ A2)) = P (A1 ∩ E) + P (E ∩ A1 ) − P (E ∩ (A1 ∪ A2) ) ∗ ∗ C ∗ C C = P (E ∩ A1) + P ((E ∩ A1 ) ∩ A2) + P ((E ∩ A1 ) ∩ A2 ) ∗ C C − P (E ∩ (A1 ∩ A2 )) ∗ ∗ C = P (E ∩ A1) + P (E ∩ (A1 ∩ A2)) ∗ ∗ = P (E ∩ A1) + P (E ∩ A2)

C since A2 ⊂ A1 (by disjointness). After induction, we have n !! n ∗ [ X ∗ P E ∩ Ai = P (E ∩ Ai). i=1 i=1

n ! ∞ ! [ [ Further, E ∩ Ai ⊂ E ∩ Ai for any finite n, so by monotonicity, i=1 i=1

∞ !! n !! ∗ [ ∗ [ P E ∩ Ai ≥ P E ∩ Ai i=1 i=1 n X ∗ = P (E ∩ Ai). i=1 Now taking n → ∞ preserves the inequality, so we have

∞ !! ∞ ∗ [ X ∗ P E ∩ Ai ≥ P (E ∩ Ai). i=1 i=1 For the other inequality, subadditivity gives us

∞ !! ∞ ! ∞ ∗ [ ∗ [ X ∗ P E ∩ Ai = P (E ∩ Ai) ≤ P (E ∩ Ai). i=1 i=1 i=1

∗ Lemma 2.3.5. M is a σ-field and P |M is countably additive. ∞ ∞ [ Proof. Suppose {Ai}i=1 are disjoint M-sets and set A = Ai. For any n ≥ 1, denote i=1 n [ Fn = Ai. By Lemma 2.3.3, Fn ∈ M for each n, so i=1

∗ ∗ ∗ C P (E) = P (E ∩ Fn) + P (E ∩ Fn ) ∗ ∗ C ≥ P (E ∩ Fn) + P (E ∩ A ) by monotonicity n X ∗ ∗ C = P (E ∩ Ai) + P (E ∩ A ). i=1

30 2.3 Extension to σ-fields 2 Probability Measures

Letting n → ∞, we obtain

∞ ∗ X ∗ ∗ C P (E) ≥ P (E ∩ Ai) + P (E ∩ A ) i=1 ∞ ! ∗ [ ∗ C ≥ P (E ∩ Ai) + P (E ∩ A ) by subadditivity i=1 ∞ !! ∗ [ ∗ C = P E ∩ Ai + P (E ∩ A ) i=1 = P ∗(E ∩ A) + P ∗(E ∩ AC ).

Hence A ∈ M so M is a σ-field. Now, letting E = Ω we have

∞ ∗ ∗ C X ∗ ∗ C 1 = P (A) + P (A ) = P (Ai) + P (A ) i=1

by the above inequalities, which are now equalities. Subtracting P ∗(AC ) from both sides ∞ ∗ X ∗ ∗ gives P (A) = P (Ai) so P |M is countably additive. i=1

Lemma 2.3.6. F0 ⊂ M. S∞ P∞ Proof. Let A ∈ F0, E ⊂ Ω, ε > 0, An ∈ F0 such that E ⊂ n=1 An and n=1 P (An) ≤ ∗ C P (A) + ε. Let Bn = An ∩ A and Cn = An ∩ A , which are all in F0 since F0 is a field. Then S∞ C S∞ E ∩ A ⊂ n=1 Bn and E ∩ A ⊂ n=1 Cn, so

∞ ∞ ∞ ∗ ∗ C X X X P (E ∩ A) + P (E ∩ A ) ≤ P (Bn) + P (Cn) = (P (Bn) + P (Cn)) n=1 n=1 n=1 ∞ X = P (Bn ∪ Cn) by finite additivity n=1 ∞ X ∗ = P (An) ≤ P (E) + ε. n=1

Taking ε → 0, we have P ∗(E ∩ A) + P ∗(E ∩ AC ) ≤ P ∗(E). The other direction follows from subadditivity. Hence A is P ∗-measurable.

Lemma 2.3.7. P ∗| = P . F0

Proof. Clearly any set A ∈ F0 covers itself, so

( ∞ ) ∗ X P (A) = inf P (An): {An} covers A ≤ P (A). n=1

31 2.4 π-systems and λ-systems 2 Probability Measures

For the other inequality, suppose {An} is any cover of A by F0-sets. We will show that P∞ n=1 P (An) ≥ P (A). Consider

∞ ! [ P (A) ≤ P An by monotonicity n=1 ∞ X ≤ P (An) by subadditivity n=1 ( ∞ ) ∗ X =⇒ P (A) = inf P (An): {An} covers A ≥ P (A). n=1

∗ Together this shows that P (A) = P (A) for all A ∈ F0.

∗ At this point we have a countably additive measure P on a σ-field M ⊃ σ(F0) for which P ∗| = P . This completes the existence portion of Theorem 2.3.1. F0

2.4 π-systems and λ-systems

For uniqueness, we will prove the following result and apply it to the σ-field F.

Theorem 2.4.1. Let P be a π-system (defined below) and suppose P1 and P2 are probability

measures on σ(P) satisfying P1|P = P2|P . Then P1 = P2.

Definition. A π-system is a collection P ⊂ P(Ω) for which A1,A2 ∈ P =⇒ A1 ∩ A2 ∈ P.

Definition. A λ-system is a collection L ⊂ P(Ω) satisfying (1) Ω ∈ L.

(2) A ∈ L =⇒ AC ∈ L.

∞ ∞ [ (3) If {Ai}i=1 are disjoint sets in L then Ai ∈ L. i=1 Notice that a λ-system is almost a σ-field – the disjoint requirement in (3) is a key difference. Further, a collection that is both a λ-system and a π-system is automatically ∞ a σ-field. To see this, let {Ai}i=1 be a collection of not necessarily disjoint sets in such a Sn−1 system. Then by setting Bi = Ai r j=1 Aj we see that the Bi have the same union as the Ai and are disjoint. Remark. Notice that an equivalent condition to (2), given that (1) and (3) are true, is A1,A2 ∈ L and A1 ⊂ A2 =⇒ A2 r A1 ∈ L. Theorem 2.4.2 (Dynkin’s π-λ Theorem). Suppose P is a π-system and L is a λ-system with P ⊂ L. Then σ(P) ⊂ L.

32 2.4 π-systems and λ-systems 2 Probability Measures

Proof. Define L0 to be the λ-system generated by P, i.e. the smallest λ-system containing P. Then P ⊂ L0 ⊂ L. We will show that σ(P) ⊂ L0 which implies the result. We do this by showing L0 is a π-system and by the comments above this will mean L0 is a σ-field. Define LA = {B ⊂ L0 | B ∩ A ∈ L0} for a set A ⊂ Ω. First assume that A ∈ L0. Under this hypothesis we can show that LA is a λ-system:

(1)Ω ∩ A = A ∈ L0 so Ω ∈ LA.

(2) Suppose B1,B2 ∈ LA and B1 ⊂ B2. Then A ∩ (B2 r B1) = (A ∩ B2) r (A ∩ B1) and we notice that A ∩ B2 and A ∩ B1 are both in L0. Since L0 is a λ-system, the whole expression above is in L0. Hence B2 r B1 ∈ LA so (2) holds by the remark. ∞ S∞ S∞ (3) Suppose {Bi}i=1 are disjoint elements of LA. Then A ∩ ( i=1 Bi) = i=1(A ∩ Bi) S∞ and A ∩ Bi ∈ L0 for each i, so because L0 is a λ-system, i=1(A ∩ Bi) ∈ L0. Thus S∞ i=1 Bi ∈ LA. Now suppose A, B ∈ P, which implies A∩B ∈ P since P is a π-system. That means that if A ∈ P then LA ⊃ P and moreover LA is a λ-system containing P, so LA ⊃ L0. Therefore if A ∈ P and B ∈ L0 then A ∩ B ∈ L0. Switching the roles of A and B, we can see that LB ⊃ P if B ∈ L0. Thus for every B ∈ L0, LB is a λ-system containing P , which implies LB ⊃ L0. Finally, for all A, B ∈ L0, A ∩ B ∈ L0 which shows L0 is a π-system. Hence L0 is a σ-field and the theorem is proved. Now we can prove Theorem 2.4.1 and use it to prove the uniqueness statement of Theo- rem 2.3.1.

Proof. Let L = {A ∈ σ(P) | P1(A) = P2(A)}. We will prove that L is a λ-system, which will imply L ⊃ σ(P) by the π-λ Theorem.

(1) Clearly Ω ∈ L.

C C C (2) If A ∈ L then P1(A ) = 1 − P1(A) = 1 − P2(A) = P2(A ) so A ∈ L.

∞ (3) If {An}n=1 are disjoint sets in L then

∞ ! ∞ [ X P1 An = P1(An) n=1 n=1 ∞ X = P2(An) n=1 ∞ ! [ = P2 An n=1

because P1 and P2 are both countably additive. Hence L is a λ-system and the result follows.

33 2.5 Monotone Classes 2 Probability Measures

2.5 Monotone Classes

There is an alternate proof of the uniqueness portion of Theorem 2.3.1 which is stated in terms of monotone classes:

Definition. A subset M of P(Ω) is monotone if ∞ \ (1) For every sequence M1 ⊃ M2 ⊃ · · · of sets in M, Mn ∈ M. n=1

∞ [ (2) For every sequence M1 ⊂ M2 ⊂ · · · of sets in M, Mn ∈ M. n=1 Lemma 2.5.1. A monotone class which is a field is a σ-field.

∞ S∞ Proof. We just need to check that if {An}n=1 ⊂ M then n=1 An ∈ M. Define Bn = Sn S∞ S∞ k=1 Ak for each n ∈ N. Then B1 ⊂ B2 ⊂ · · · and n=1 Bn = n=1 An. Since M is a S∞ monotone class, (2) shows that n=1 Bn ∈ M. The following is an analog to Dynkin’s π-λ Theorem which allows one to finish the uniqueness proof in the same way as above.

Theorem 2.5.2 (Halmos’ Class Theorem). If F0 is a field and M is a monotone class containing F0, then M contains σ(F0).

Proof. First define m(F0) to be the monotone class generated by F0, that is the small- est monotone class containing F0. It suffices to show m(F0) is a field and then apply Lemma 2.5.1. (1) Ω ∈ m(F0) since Ω ∈ F0 ⊂ m(F0) by definition of a field. C (2) Suppose G = {A | A ∈ m(F0)}. Since the definition of monotone class is symmetric with respect to complements, we see G is a monotone class. Moreover, since F0 is a field, C G ⊃ F0 so by minimality of m(F0), G ⊃ m(F0). Thus A ∈ m(F0). ∞ (3) Define G1 = {A ∈ m(F0) | A ∪ B ∈ m(F0) for all B ∈ F0}. If {Ai}i=1 ⊂ G1 with Ai ⊂ Ai+1 for all i, then for any B ∈ F0,

∞ ! ∞ [ [ Ai ∪ B = (Ai ∪ B) i=1 i=1 and each of these pieces is in G1. Hence G1 is a monotone class. Now G1 ⊃ F0 since F0 is a field, so by Lemma 2.5.1 G1 must contain σ(F0). Set

G2 = {B ∈ m(F0) | A ∪ B ∈ m(F0) for all A ∈ m(F0)}.

For the same reason as above, G2 is a monotone class, and G2 ⊃ F0 since G1 ⊃ m(F0). This shows that G2 ⊃ m(F0) so for every A, B ∈ m(F0), A ∪ B is also in m(F0). Hence we conclude that m(F0) is a field, and applying Lemma 2.5.1 shows that m(F0) is a σ-field. By definition σ(F0) is the smallest σ-field containing F0), so this proves finally that M ⊃ σ(F0).

34 2.6 Complete Extensions 2 Probability Measures

Now given Ω = (0, 1] with the Lebesgue measure λ defined on B0, Theorem 2.3.1 says that there exists an extension of λ to B = σ(B0), what we are also calling the Lebesgue measure λ. Notice that I = {(a, b] : 0 ≤ a < b ≤ 1} is a π-system, and further, σ(I) = B because σ(I) ⊃ B0. Then by Theorem 2.4.1, λ is the only measure on Ω that restricts to the length function on the collection I of intervals. This means that whenever we make a choice of measure on (0, 1] and want it to coincide with the natural notion of length of an interval, we have no choice but to choose the Lebesgue measure.

Examples. ∞ 1 Suppose {ri}i=1 is an enumeration of Q ∩ (0, 1]. Let ε > 0 and for each i ∈ N, let ε ε  S∞ Ii = ri − 2i+1 , ri + 2i+1 ∩ (0, 1]. Then Ii is open so A = i=1 Ii is open and ∞ X ε λ(A) ≤ = ε 2i i=1 by subadditivity. On the other hand, λ(A) is clearly positive since A contains nonempty intervals of nonzero length. It turns out that AC is nowhere dense, but λ(B) ≥ 1 − ε, i.e. λ(B) is almost always 1.

2 Let An = {ω ∈ Ω | di(ω) = dn+i(ω) = d2n+i(ω) for all i = 1, . . . , n}. For example, an element of A5 might look like 0.010110101101011 01000110100 ... anything 2n S∞ 1 Then P (An) = (23)n for each n, and if A = n=1 An, λ(A) ≤ 3 . As in the previ- ous example, AC turns out to be nowhere dense, but with (relatively) large measure: C 2 λ(A ) = 3 .

2.6 Complete Extensions

We briefly discuss the notion of a complete extension of a probability space in this section. Definition. A probability space (Ω, F,P ) is said to be complete if whenever A and B are subsets of Ω with B ∈ F, A ⊂ B and P (B) = 0, then A ∈ F as well. Note that A ⊂ B and A ∈ F imply by subadditivity that P (A) = 0 as well. Example 2.6.1. Recall the σ-field M = {A ⊂ (0, 1] : A is P ∗-measurable}. The extension ∗ (Ω, M,P |M) of the probability space defined on the Borel set is complete because for any A ⊂ B in M with P (B) = 0 and E ⊂ Ω, P ∗(E ∩ A) + P ∗(E ∩ AC ) ≤ P ∗(E ∩ B) + P ∗(E ∩ AC ) by monotonicity ≤ 0 + P ∗(E ∩ AC ) = P ∗(E). Hence A ∈ M.

35 2.6 Complete Extensions 2 Probability Measures

On any probability space (Ω, F,P ), it is possible to enlarge the σ-field and extend P so that the resulting probability space is complete. In particular, we have

Definition. Given the probability space (Ω, B0, λ) where λ is the Lebesgue measure, the complete extension of λ onto M is called the Lebesgue σ-field. Any set A ∈ M is called a Lebesgue set.

There is a difference in preferences between probabilists and analysts on whether to work with B or its complete extension M. Measure theorists tend to prefer M because it offers completeness, whereas probabilists often switch measures on B0 and therefore do not want to lose the properties they are working with (recall that an extension depends on the underlying measure, so complete extensions are not unique).

36 2.7 Non-Measurable Sets 2 Probability Measures

2.7 Non-Measurable Sets

In this section we construct a non-measurable set. To really show the limits of σ-fields, we will produce a set that is not measurable in the Lebesgue σ-field M, which will then be a counterexample on B0 and B. Example 2.7.1. The sets used to deconstruct and reassemble the sphere in the Banach- Tarski Paradox are examples of non-measurable sets. As these are very technical to describe, we will detail a simpler set on (0, 1] below.

Suppose x, y ∈ (0, 1] and define x ⊕ y = (x + y) mod 1, by which we mean x ⊕ y is the result of adding the two numbers together and, if necessary, subtracting 1 so the sum lands back in the interval (0, 1]. For a set A and any x ∈ Ω, denote A ⊕ x = {a ⊕ x | a ∈ A}. Let L = {A ∈ B | λ(A ⊕ x) = λ(A) for all x ∈ Ω}, i.e. the translation-invariant sets in B. It is easy to verify that L is a λ-system. Also, L contains the collection I of intervals in Ω, so by Halmos’ Class Theorem, L ⊃ σ(I) = B. Hence Lebesgue measure is translation-invariant on B and by uniqueness, this implies λ is translation-invariant on M. We will construct a set that is not a Lebesgue set, thereby proving the existence of non- measurable sets. Define a relation ∼ on Ω by x ∼ y ⇐⇒ x − y ∈ Q. By the Axiom of Choice, there exists a set H ⊂ Ω such that H contains exactly one element from each equivalence class of ∼.

Claim. H is not in M.

Proof. If H were in M, we would have λ(H ⊕ ri) = λ(H) for all ri ∈ Q. Notice that by definition of ⊕, if i 6= j then (H ⊕ ri) ∩ (H ⊕ rj) = ∅, so if H ∈ M, H ⊕ ri must also be in S∞ M for all ri. Further, the H ⊕ ri partition Ω, so i=1(H ⊕ ri) = Ω and this implies

∞ X λ(H ⊕ ri) = λ(Ω) = 1 by countable additivity i=1 ∞ X =⇒ λ(H) = 1. i=1 But adding up the same value λ(H) infinitely many times either equals 0 (if λ(H) = 0) or ∞ (if λ(H) > 0), so this cannot possibly equal 1. Hence H is not an element of M.

37 3 Denumerable Probabilities

3 Denumerable Probabilities

3.1 Limit Inferior, Limit Superior and Convergence

Assume (Ω, F,P ) is a generic probability space for a σ-field F. We will assume in this chapter that all sets are F-sets. Our goal is to explore some more complicated concepts in probability with infinite sequences of subsets (events) in a probability space Ω.

∞ Proposition 3.1.1. Let {An}n=1 be a countable collection of sets in Ω. S∞ (1) If P (An) = 0 for all n then P ( n=1 An) = 0. T∞ (2) If P (Bn) = 1 for all n then P ( n=1 Bn) = 1. Proof. (1) is proven by subadditivity, and (1) =⇒ (2) by DeMorgan’s Law.

∞ Definition. For a sequence {An}n=1 of measurable sets in Ω, the limit superior and limit inferior of the An are

∞ ∞ ∞ ∞ \ [ [ \ lim sup An = Am and lim inf An = Am. n→∞ n→∞ n=1 m=n n=1 m=n

If there exists A ∈ F such that lim sup An = lim inf An = A, we say the An converge to A.

Remark. Notice that x ∈ lim sup An if for every n ∈ N there is some m ≥ n such that x ∈ Am, or equivalently, if x lies in some subsequence of {An}. It is sometimes said that such an x lies in the collection {An} infinitely often (i.o.). On the other hand, x ∈ lim inf An if there exists an n ∈ N such that for every m ≥ n, x ∈ Am, or alternatively if x lies in every Am beyond some cutoff An. It is also said that x lies in all but finitely many of the An. For fixed n,

∞ ∞ ∞ ∞ ∞ [ \ \ [ \ Am ⊃ Am =⇒ Am ⊃ Am m=n m=n n=1 m=n m=n ∞ ∞ ∞ ∞ \ [ [ \ =⇒ Am ⊃ Am. n=1 m=n n=1 m=n

Hence lim sup An ⊃ lim inf An, just as is the case with the numerical lim sup and lim inf.

Example 3.1.2. For ω ∈ Ω, define the value `n(ω) to be the length of the run of heads in a sequence of coin tosses starting at the nth flip. Explicitly,

`n(ω) = {k | dn+k(ω) = 1 and dn+i(ω) = 0 for i = 0, . . . , k − 1}.

38 3.1 Limit Inferior, Limit Superior and Convergence 3 Denumerable Probabilities

Notice that for any nonzero values of k and r,

k+1 Y 1 1 P [` (ω) = k] = = n 2 2k+1 i=1 ∞ X 1 1 and P [` (ω) ≥ r] = = . n 2k+1 2r k=r

Set An = {ω | `n(ω) ≥ r}. Then for any ω ∈ Ω,

∞ ∞ \ [ ω ∈ An infinitely often ⇐⇒ ω ∈ lim sup An = Am n→∞ n=1 m=n

⇐⇒ for every n there is some m ≥ n such that ω ∈ Am

⇐⇒ there is an infinite subsequence {Am} containing ω.

∞ Theorem 3.1.3. For any collection of sets {An}n=1,

(1) P (lim inf An) ≤ lim inf P (An) ≤ lim sup P (An) ≤ P (lim sup An).

(2) If lim An exists, the above inequalities are equalities.

Proof. Note that (2) follows from (1) since if a limit exists, it equals lim inf An and lim sup An. The middle inequality in (1) is obvious. Further, the third inequality is obtained by taking complements in the first inequality, so it suffices to prove the latter. Note that if Bn = T∞ m=n Am so that ∞ ∞ ! ∞ [ \ [ lim inf An = Am = Bn, n=1 m=n n=1 then Bn ⊂ An for all n. So by monotonicity, P (Bn) ≤ P (An). Moreover, the Bn are ascend- ing and their limit is lim inf An. By continuity from below (Theorem 2.1.5), lim P (Bn) = P (lim inf An) so together this gives us P (lim inf An) = lim P (Bn) ≤ lim inf P (An). Remark. Theorem 3.1.3 only holds in the finite measure case, that is, when P (ω) < ∞. When Ω is a space of infinite measure, we cannot take complements and subtract the prob- ability from 1, so the arguments above – including continuity from below – do not apply.

In Example 3.1.2, we can apply Theorem 3.1.3 to see that 1 P [ω | ` (ω) ≥ r i.o.] = P (lim sup A ) ≥ lim sup P (A ) = . n n n 2r

39 3.2 Independence 3 Denumerable Probabilities

3.2 Independence

Definition. Suppose P (A) > 0. Then the conditional probability of B given A is P (B ∩ A) P (B | A) = . P (A) This definition can be rearranged as P (A ∩ B) = P (A)P (B | A) and inducted on to produce the formula

P (A ∩ B ∩ C) = P (A)P (B | A)P (C | A ∩ B), and so on.

∞ Proposition 3.2.1. If {An}n=1 partition Ω then for any B ∈ Ω, ∞ X P (B) = P (B | An). n=1 S∞ Proof sketch. Since B may be expressed as a disjoint, countable union B = n=1(B ∩ An), use countable additivity to produce the result. Definition. Two events A, B ∈ Ω are said to be independent if P (A ∩ B) = P (A)P (B). If P (A),P (B) > 0, the definition of independence is equivalent to

P (B) = P (B | A) and P (A) = P (A | B).

n Definition. A (finite) collection {Ai}i=1 is an independent collection if for any subcol- m lection {Akj }j=1, m m ! Y \ P (Akj ) = P Akj . j=1 j=1 Example 3.2.2. Note that simple pairwise independence is not the same as independence on the whole collection. For instance, define Buv = {ω | du(ω) = dv(ω)}. Then B12,B13 and B23 are pairwise independence events, but the collection of all three is not independent. Remarks. 1 For a collection of n sets, there are 2n − n − 1 possible conditions on those sets.

2 Any subcollection of a collection of independent events is also independent.

3 Independence is invariant under reordering of the sets in a collection.

4 We say an arbitrary collection of sets {Aθ}θ∈Θ is independent if and only if every finite subcollection is independent.

Definition. Given two collections A1 and A2, we say these are independent collections, or A1 is independent of A2, if for every A1 ∈ A1 and A2 ∈ A2, P (A1 ∩ A2 = P (A1)P (A2). n Similarly, a collection of collections {Ak}k=1 is independent if A1,...,An are independent whenever Ak ∈ Ak. This can be extended to arbitrary collections in the same way as before.

40 3.2 Independence 3 Denumerable Probabilities

Example 3.2.3. Let Hn = {ω | dn(ω) = 0}, which represents the event of getting heads ∞ on the nth coin flip (if we identify heads with 0 and tails with 1). Then {Hn}n=1 is an independent collection of events. Furthermore, if we define A1 = {H2k+1 | k ∈ N} the collection of heads on odd flips and A2 = {H2k | k ∈ N} the collection of heads on even flips, then A1 and A2 are independent.

n Theorem 3.2.4. Suppose {Ak}k=1 is an independent collection of collections, where Ak is ∞ a π-system for k = 1, . . . , n. Then {σ(Ak)}k=1 is independent.

Proof. For each k define Bk = Ak ∪ {Ω} which are all still π-systems. The fact that the Ak are independent is equivalent to

n n ! Y \ P (Bk) = P Bk for all Bk ∈ Bk. k=1 k=1

Fix B2 ∈ B2,...,Bn ∈ Bn and define

( n n !!) Y \ L = B ∈ F : P (B) P (Bk) = P B ∩ Bk . k=2 k=2

Clearly L contains B1. We claim that L is a λ-system.

(1)Ω ∈ B1 =⇒ Ω ∈ L. (2) For any B ∈ L,

n n C Y Y P (B ) P (Bk) = (1 − P (B)) P (Bk) k=2 k=2 n n Y Y = P (Ω) P (Bk) − P (B) P (Bk) k=2 k=2 n !! n !! \ \ = P Ω ∩ Bk − P B ∩ Bk k=2 k=2 since Ω,B ∈ L n !! \ = P (Ω r B) ∩ Bk by additivity k=2 n !! C \ = P B ∩ Bk . k=2

Hence BC ∈ L.

(3) follows similarly, again breaking A ∪ B into sets in L and using additivity.

41 3.2 Independence 3 Denumerable Probabilities

Hence L is a λ-system, so by the π-λ theorem, L ⊃ σ(B1). This shows that σ(B1) is independent of B2,..., Bn. Now, since σ(B1) is a π-system, a similar argument can be made by fixing B1 ∈ σ(B1),B3 ∈ B3,...,Bn ∈ Bn to show that {σ(B1), σ(B2), B3,..., Bn} are independent. Repeating shows that {σ(Bk)} are independent, but notice that σ(Ak) = σ(Bk) for all k, so we’re done.

Corollary 3.2.5. If {Aθ}θ∈Θ is an independent, arbitrary collection of π-systems then {σ(Aθ)}θ∈Θ is also independent. Corollary 3.2.6. Given a matrix of events

A11 A12 A13 ··· A21 A22 A23 ··· ...... such that the collection {Aij}(i,j)∈N2 is independent, define Fi to be the σ-field of the events in the ith row. Then the collection of Fi is an independent collection.

Proof. Let Ai be the collection of finite intersections of events in the ith row. This is a π- system by construction. Clearly σ(Ai) ⊂ Fi, and since a single set is a (trivial) intersection, Fi ⊂ σ(Ai) as well, which shows σ(Ai) = Fi for each i. Now we verify that the Ai are independent. Given B1 ∈ Ak1,...,Bj ∈ Akj, each one is a finite intersection of the Aij so we have j j m j m ! j ! Y Y Yi \ \i \ P (Bi) = P (Air) = P Air = P Bi . i=1 i=1 r=1 i=1 r=1 i=1

Therefore Theorem 3.2.4 applies to give us independence of the Fi. Examples. 1 For the even/odd coin flips, we have a matrix

H1 H3 H5 ··· H2 H4 H6 ··· which is independent by Example 3.2.3. By Corollary 3.2.6, the σ-fields of the two rows are independent, but these just correspond to the even and odd flips. What this says is that any event constructed with the odd-numbered coin flips is independent of any other event constructed with the even flips.

2 Recall Buv = {ω | du(ω) = dv(ω)}. The collections A1 = {B12,B13} and A2 = {B23} are independent, but we know their union is not, so σ(A1) and σ(A2) must not be independent. This suggests that some condition of Theorem 3.2.4 fails to be met – in fact, A1 is not a π-system.

3 Suppose A = {Ai} is a partition of Ω and for some set B ∈ F, P (B | Ai) = p for each i such that P (Ai) > 0. Then P (B) = p by Proposition 3.2.1 and B = {B} is independent of σ(A).

42 3.3 Subfields 3 Denumerable Probabilities

3.3 Subfields

It is common in probability theory to only have partial information about an event. This can mean several things, such as knowing some class of sets the event lies in or knowing conditions on the probability of the event based on other events. For a probability space (Ω, F,P ), such partial information information can be represented in terms of a subfield A ⊂ F where A is a σ-field (or sometimes just a field). If B ∈ F is independent of A, then P (B) = P (B ∩ A)P (A) for every A ∈ A. This shows that knowing information about P (A) does not necessarily tell you anything about P (B). 0 Define an equivalence relation ∼A on Ω by ω ∼A ω if and only if for every A ∈ A, 0 χA(ω) = χA(ω ), where χ denotes the characteristic function on a set: ( 1 if ω ∈ A χA(ω) = 0 if ω 6∈ A.

As an example of so-called ‘partial information’ in probability theory, even if we know ev- erything about the characteristic functions on a collection A, we still cannot distinguish between ω and ω0 that satisfy ω ∼ ω0.

Proposition 3.3.1. For a subfield A ⊂ F, ∼A and ∼σ(A) are the same equivalence relation.

0 Proof. It is clear that ∼σ(A) is a finer partition of Ω than ∼A, since A ⊂ σ(A). Fix ω, ω ∈ Ω. 0 0 Then Aω,ω0 = {A ∈ F : χA(ω) = χA(ω )} is a σ-field. Notice that if ω ∼A ω then Aω,ω0 ⊃ A 0 so by the π-λ theorem, Aω,ω0 ⊃ σ(A). Hence ω ∼σ(A) ω . Examples.

0 1 Consider A = {H2n}n∈N where H2n are as defined in Example 3.2.3. Then ω ∼A ω if and only if ω and ω0 have the same results on even coin flips, but this tells us nothing about the odd flips of either event.

2 Let A be the σ-field generated by countable and cocountable subsets of Ω, which is a subfield of the Borel σ-field B on Ω = (0, 1]. Then B is independent of A so we get practically no information about events in A; in fact, every countable set has measure 0 and every cocountable set has measure 1, but this does nothing to distinguish events. 0 0 On the other hand, A is generated by the singletons, so ω ∼A ω ⇐⇒ ω = ω . In this sense, we have all the information on events in A, paradoxically. The difference between these two situations, where we either have no information or complete information is that there is a measure present on B but not on A. The mathematics is actually contained in the statement of independent, and while it may be useful to think of σ- fields and subfields in terms of ‘information’, it is clear that this analogy breaks down in situations such as this.

43 3.4 The Borel-Cantelli Lemmas 3 Denumerable Probabilities

3.4 The Borel-Cantelli Lemmas

P∞ Lemma 3.4.1. If n=1 P (An) converges then P (lim sup An) = 0. S∞ Proof. Fix n. Then lim sup An ⊂ k=n Ak so by subadditivity we have

∞ X P (lim sup An) ≤ P (Ak). k=n But the tail of a convergent series tends to 0 as n gets big, so the right side of this inequality goes to 0. Hence P (lim sup An) = 0. The first Borel-Cantelli Lemma would have been useful at the end of the proof of Borel’s Normal Number Theorem (1.2.5); however, at the time we did not have the constructions required to describe this lemma. The second Borel-Cantelli Lemma is

P∞ ∞ Lemma 3.4.2. If n=1 P (An) diverges and the events {An}n=1 are independent, then P (lim sup An) = 1.

∞ ∞ ! \ [ Proof. We want to show that P (lim sup An) = P Ak = 1. Note that this is the n=1 k=n same as showing ∞ ∞ !C ∞ ∞ ! \ [ [ \ C P Ak = P Ak = 0, n=1 k=n n=1 k=n T∞ C  which will be true if P k=n Ak = 0 for all n ≥ 1. By independence, we have

∞ ! ∞ ∞ \ C Y C Y P Ak = P (Ak ) = (1 − P (Ak)) k=n k=n k=n ∞ Y ≤ e−P (Ak) since 1 − x ≤ e−x on (0, 1] k=n − P∞ P (A ) = e k=n k = 0

P∞ since k=n P (Ak) diverges. Examples.

1 Recall the funtion `n(ω) which denotes the length of the run of 0’s (heads) starting at the nth term in the sequence expression of ω. Let (rn) be a sequence of real numbers ∞ X 1 such that converges. Then P [ω : `n(ω) ≥ rn i.o.] = 0. To see this, note that 2rn n=1

P [ω : `n(ω) ≥ rn i.o.] = P (lim sup{ω | `n(ω) ≥ rn})

1 and we have computed P [ω : `n(ω) ≥ rn] = 2rn . Applying the first Borel-Cantelli Lemma gives the result.

44 3.4 The Borel-Cantelli Lemmas 3 Denumerable Probabilities

rn 1+ε P∞ −rn For ε > 0, set rn = (1 + ε) log2 n so that 2 = n . Then n=1 2 barely converges, but the first Borell-Cantelli Lemma says that

P [ω : `n(ω) ≥ (1 + ε) log2 n i.o.] = 0   ` (ω)   =⇒ P ω : lim sup n > 1 = 0. log2 n

2 Notice that the collection of {ω : `n(ω) = 0} = {ω : dn(ω) = 1} for all n are in- 1 dependent events, each with probability 2 . By the second Borel-Cantelli Lemma, P [ω : `n(ω) = 0 i.o.] = 1. On the other hand, define

An = {ω : `n(ω) = 1} = {ω : dn(ω) = 0, dn+1(ω) = 1}.

1 P∞ Then P (An) = 4 so n=1 P (An) diverges, but we cannot directly apply BC2 since ∞ the An are not independent. However, the collection {A2n}n=1 is independent with 1 P (A2n) = 4 for all n, so BC2 tells us P [ω | ω ∈ A2n i.o.] = 1. Further, we see that {ω | ω ∈ A2n i.o.} ⊆ {ω : `n(ω) = 1 i.o.} so by subadditivity, P [ω : `n(ω) = 1 i.o.] = 1 as well. In the same way, one can prove that P [ω : `n(ω) = k i.o.] = 1 for any k ∈ N.

P∞ 1 3 Suppose (r ) is a nondecreasing sequence such that r diveges. We claim that n n=1 rn2 n P∞ 1 P [ω : ` (ω) ≥ r i.o.] = 1. First, it is known that r diverges if and only if n n n=1 rn2 n P∞ 1 s diverges, where s = dr e. Thus without loss of generality we may assume n=1 sn2 n n n the rn are integers.

Define n1 = 1 and nk+1 = nk + rnk for each k ≥ 2. The events Ak = {ω : `nk (ω) ≥ rnk } P∞ for all k ∈ N are independent events. By BC2, P [ω | ω ∈ Ak i.o.] = 1 if k=1 P (Ak) diverges, but this series can be written

∞ ∞ X 1 X 1 r = 2 nk 2nk+1−nk k=1 k=1 ∞ X nk+1 − nk = rn rn 2 k k=1 k ∞ nk+1 X X 1 = rnk rnk 2 k=1 n=nk+1 ∞ nk+1 X X 1 ≥ r since the rn are nondecreasing rn2 n k=1 n=nk+1 ∞ X 1 = r 2n n=1 n

which diverges. Hence P [ω | ω ∈ Ak i.o.] = 1, and since {ω : `n(ω) ≥ rn i.o.} ⊇ lim sup Ak, we have shown that P [ω : `n(ω) ≥ rn i.o.] = 1.

45 3.4 The Borel-Cantelli Lemmas 3 Denumerable Probabilities

P∞ 1 P∞ 1 4 Let rn = log n as before. Then rn = diverges, so BC2 tells us 2 n=1 rn2 n=1 n log2 n that P [ω : `n(ω) ≥ rn i.o.] = 1, which implies

  ` (ω)   P ω : lim sup n ≥ 1 = 1 log2 n but before we showed that   ` (ω)   P ω : lim sup n > 1 = 0 log2 n (notice the strict inequality). By additivity, this implies that

  ` (ω)   P ω : lim sup n = 1 = 1. log2 n

∞ Definition. Suppose {An}n=1 ⊂ F and set

∞ \ T = σ(Ak,Ak+1,Ak+2,...). k=1

T is called the tail σ-field of the An. The most important theorem related to the tail σ-field is stated below. This is sometimes known as Kolmogorov’s 0-1 Law.

∞ Theorem 3.4.3 (Kolmogorov). Suppose {An}n=1 are independent and A ∈ T . Then either P (A) = 0 or P (A) = 1.

Proof. Clearly σ(A1), σ(A2), . . . , σ(An), σ(An+1,An+2,...) are independent for any fixed n since the Ak are independent (Theorem 3.2.4). So if A ∈ T then A ∈ σ(An+1,An+2,...). Thus A is independent of σ(A1), . . . , σ(An) for any n, which shows that A, A1,A2,... is an independent sequence of events. By Theorem 3.2.4, σ(A) and σ(A1,A2,...) are independent, but A ∈ T ⊆ σ(A1,A2,...) so we have shown that A is independent of itself. Finally,

A is independent of itself ⇐⇒ P (A) = P (A ∩ A) = P (A)P (A) ⇐⇒ P (A) = 0 or 1.

46 4 Simple Random Variables

4 Simple Random Variables

Definition. Given a probability space (Ω, F,P ), we say X :Ω → R is a simple random variable if

(1) X(Ω) is finite. This is the simple condition.

(2) For any x ∈ R, the set {ω : X(ω) = x} is an F-set. This is sometimes called the measurable condition.

The measurability condition (2) will allow us to integrate simple random variables over P -measurable sets. Of course, the sets {ω : X(ω) = x} need not be intervals. For example, consider the rational indicator function on Ω = [0, 1]: ( 1 if x ∈ X(ω) = Q 0 if x 6∈ Q.

R Q

[ ] I Ω

Example 4.0.1. Recall the functions dn, rn and sn introduced in Chapter 1 for the space of infinite sequences of coin flips (equivalently, dyadic representations of ω ∈ (0, 1]). Each of these is a simple random variable. On the other hand, the length function `n is a random variable because it’s measurable, but `n is not simple since its range is infinite. Proposition 4.0.2. X is a simple random variable ⇐⇒ there exists a finite collection of r Ai ∈ F such that {Ai}i=1 is a finite partition of Ω and there exist xi ∈ R such that for any ω ∈ Ω, X(ω) may be expressed in the form

r X X(ω) = xiχAi (ω). i=1 Proof sketch. The backward direction is easy. For the forward direction, suppose X is a −1 simple random variable. Let {x1, . . . , xr} = X(Ω). For i = 1, . . . , r, let Ai = X (xi). Then the Ai partition Ω.

Notice that the partition {Ai} need not be unique. Even more importantly, the xi may not be unique – they may even repeat so that in some cases Ai and Aj, i 6= j, have the same value under X. This is useful if we are comparing simple random variables X and Y and want to use the same partition of Ω for each.

47 4 Simple Random Variables

Definition. For a subfield G ⊂ F, we say a simple random variable X is G-measurable if {ω : X(ω) = x} ∈ G for every x ∈ R. Proposition 4.0.3. If X is G-measurable and H ⊂ R then {ω : X(ω) ∈ H} lies in G. [ Proof. Notice that {ω : X(ω) = H} = {ω : X(ω) = x} is a finite union. x∈H Proposition 4.0.3 can also be stated: if X is G-measurable then X−1(H) is measurable for every H ⊂ R. Define σ(X) to be the smallest σ-field G (equivalently, the intersection of all G) for which X is G-measurable. The next theorem characterizes σ(X) for collections of simple random variables.

Theorem 4.0.4. Suppose X1,...,Xn and Y are simple random variables on a probability n space (Ω, F,P ). Write X = (X1,...,Xn) so that for any ω ∈ Ω, X(ω) ∈ R . n −1 n (1) σ(X) = {{(X1(ω),...,Xn(ω)) ∈ H} | ω ∈ Ω,H ⊂ R } = {X (H) | H ⊂ R }. Moreover, the H may be taken as finite subsets of Rn. (2) Y is σ(X)-measurable if and only if there exists a function f : Rn → R such that Y = f(X) = f(X1,...,Xn). −1 Proof. (1) Let M = {X (H) | H ⊂ Rn}. We will show that M = σ(X). Consider a set −1 X (H) ∈ M. Clearly

r r n ! −1 [ −1 [ \ −1 X (H) = X (~x) = Xj (xj) , i=1 i=1 j=1

Tn −1 −1 where ~x = (x1, . . . , xn), and each j=1 Xj (xj) lies in σ(X), so X (H) ∈ σ(X). This shows M ⊂ σ(X). Additionally, M is a σ-field since (i)Ω= X−1(Rn) ∈ M. −1 −1 (ii) If A ∈ M then A = X (H) =⇒ AC = X (HC ) ∈ M.

∞ ∞ ∞ ! [ [ −1 −1 [ (iii) For Ai ∈ M, Ai = X (Hi) = X Hi which lies in M. i=1 i=1 i=1

Finally, for fixed i, {ω : Xi(ω) = xi} ∈ M but this is precisely the set

{ω : X(ω) ∈ R × R × · · · × {xi} × · · · × R} which lies in M. We have shown that M is a σ-field contained in σ(X) on which the Xi are measurable, so it follows that M = σ(X). (2) On one hand, suppose Y = f(X). Then {ω : Y (ω) = y} = {ω : X(ω) ∈ f −1(y)} ∈ M which lies in σ(X), so Y is σ(X)-measurable. On the other hand, if Y is σ(X)-measurable n then let Y (Ω) = {y1, . . . , yr}. By (1), there exist subsets H1,...,Hr ⊂ R such that for all 1 ≤ i ≤ r, {ω : Y (ω) = yi} = {ω : X(ω) ∈ Hi}. By construction the Hi are disjoint, and we r X can define f(X) = yiχHi (xi). This completes the proof. i=1

48 4.1 Convergence in Measure 4 Simple Random Variables

As a result of this theorem, functions of simple random variables are simple random variables, so in probability theory we can take simple random variables X and Y and form new simple random variables: X2, etX ,X + Y, log X, etc.

Examples.

1 Consider d1, . . . , dn for sequences ω of coin flips. Then dn 6∈ σ(d1, . . . , dn−1) which implies that dn is independent of the other di. These di are defined in this way so as to make different coin flips independent.

2 The Rademacher functions r1, . . . , rn and the cumulative Rademacher functions s1, . . . , sn generate the same σ-field:

σ(r1, . . . , rn) = σ(s1, . . . , sn). Pk This is because for each k, sk = i=1 ri and rk = sk − sk−1 so one can apply (2) of Theorem 4.0.4.

3 The characteristic function χA is G-measurable if and only if A ∈ G.

4.1 Convergence in Measure

∞ Suppose we have a sequence {Xn}n=1 of simple random variables and X a simple random variable to which the Xn might converge. We are interested in pointwise convergence .

Definition. The sequence (Xn) is said to converge to X pointwise almost everywhere h i if P ω : lim Xn(ω) = X(ω) = 1, that is, lim Xn ≡ X on all of Ω except possibly on a set n→∞ of measure zero.

In analytic terms, lim Xn(ω) = X(ω) ⇐⇒ for every ε > 0, there exists an N ∈ N such that for all n > N, |Xn(ω) − X(ω)| < ε. Thus for a given ω ∈ Ω, Xn does not converge to X(ω) if and only if there is some ε > 0 such that |Xn(ω) − X(ω)| ≥ ε infinitely often. 1 Because we want to utilize countable unions, we replace ε with m (these are equivalent by the Archimedean Principle). Then the set {ω : lim Xn(ω) = X(ω)} has complement S∞  1 m=1 ω : |Xn(ω) − X(ω)| ≥ m i.o. and its probability may be expressed as

∞ ! C h i [  1 P ω : lim Xn(ω) = X(ω) = P ω : |Xn(ω) − X(ω)| ≥ i.o. n→∞ m m=1 ∞ ! [  1 = P lim sup ω : |Xn(ω) − X(ω)| ≥ m m=1 ∞ X = 0 = 0. m=1 This suggests a different notion of convergence.

49 4.1 Convergence in Measure 4 Simple Random Variables

Definition. We say a sequence of simple random variables (Xn) converges in measure to X, denoted (Xn) →p X, if for every ε > 0,

lim P [ω : |Xn(ω) − X(ω)| ≥ ε] = 0. n→∞ We can see from the work above that pointwise convergence (a.e.) implies convergence in measure, so the latter is a weaker form of convergence. In several sections we will see that the notions of pointwise convergence (a.e.) and convergence in measure can be translated into generalized laws of large numbers, with the weaker convergence corresponding to the weak law.

Examples.

1 1 Define An = {ω : `n(ω) ≥ log2 n}. We proved that P (An) = n so this tends to 0 as n → ∞, showing (An) converges in measure to ∅. However, we proved in Example 4 of Section 3.4 that P [ω : ω ∈ An i.o.] = 1 using the Borel-Cantelli lemmas. So (An) clearly does not converge to ∅ pointwise. 2 The following sequence of simple random variables is sometimes called ‘the typewriter’ in measure theory.

f1 f2 1

f3 f4 f5 f6

1

Define the sequence beginning with

f1 = χ 1 f2 = χ 1 [0, 2 ] [ 2 ,1]

f3 = χ 1 f4 = χ 1 1 f5 = χ 1 3 f6 = χ 3 [0, 4 ] [ 4 , 2 ] [ 2 , 4 ] [ 4 ,1] etc.

Then (fn) →P 0, but for any x ∈ [0, 1], fn(x) diverges so we see that convergence in measure does not imply pointwise convergence.

50 4.2 Independent Variables 4 Simple Random Variables

4.2 Independent Variables

Definition. A sequence {X1,X2,...} of simple random variables is said to be independent n if σ(X1), σ(X2),... are independent, that is, if whenever H1,H2,... ⊂ R ,

P [ω : X1(ω) ∈ H1,...,Xn(ω) ∈ Hn] = P [ω : X1(ω) ∈ H1] · ... · P [ω : Xn(ω) ∈ Hn].

Example 4.2.1. Let Ω = Sn, the set of permutations of a set of n elements. Assign the 1 discrete probability P (ω) = n! for all ω ∈ Sn. Define a simple random variable ( 1 if position k is the last position in a cycle in ω Xk(ω) = 0 otherwise.

For example, if ω = (1 4 3)(2 5) then X3 = X5 = 1 and X1 = X2 = X4 = 0. For any n X ω ∈ Sn, define S(ω) = Xk(ω). Then S represents the number of disjoint cycles in a cycle k=1 decomposition of ω.

1 Claim. The Xk are independent with P [ω : Xk(ω) = 1] = n−k+1 .

This is easy to justify heuristically, but the details can be tricky. The idea is that X1 = 1 1 if and only if ω is a permutation fixing 1; the probability of this happening is n . If ω(1) = 1 then ω(2) is one of the remaining values 2, . . . , n and thus X2(ω) = 1 if and only if ω(2) = 2; 1 this happens with probability n−1 . On the other hand, if X1(ω) = 0, then ω(1) = i 6= 1, so that ω(i) is one of the values 1, . . . , i − 1, i + 1, . . . , n. Then X2(ω) = 1 if and only if 1 ω(i) = 1 which happens with probability n−1 . This argument can be continued to show that 1 P [ω : Xk(ω) = 1] = n−k+1 , showing the Xk are indeed independent. Definition. Let X be a simple random variable, the distribution of X is the probability measure µ defined for all subsets A ⊂ R by µ(A) = P [ω : X(ω) ∈ A].

The distribution µ is a discrete probability measure. If {x1, . . . , xn} are the distinct values X in the range of X then pi := µ({xi}) = P [ω : X(ω) = xi] so for any A ⊂ R, µ(A) = pi. In xi∈A particular, this shows that µ(R) = 1, but even better, if B is the range of X, then µ(B) = 1 so µ has finite support.

Theorem 4.2.2. If {µn} is a sequence of probability measures on the class of all subsets of R such that each µn has finite support, then there exists an independent sequence {Xn} of simple random variables on some probability space (Ω, F,P ) such that Xn has distribution µn. Proof. Let Ω = (0, 1], let F = B the Borel σ-field and let P = λ, the Lebesgue measure on F. We first consider the case where each µn has range {0, 1}. Set

pn = µn({0}) and qn = 1 − pn = µn({1}).

51 4.3 Expected Value and Variance 4 Simple Random Variables

Divide Ω = (0, 1] into two intervals I0 and I1, where |I0| = p1 and |I1| = q1. Define X1 by ( 0 ω ∈ I0 X1(ω) = 1 ω ∈ I1.

Since P is the Lebesgue measure, we have P [ω : X1(ω) = 0] = p1 and P [ω : X1(ω) = 1] = q1, so X1 has been constructed so that its distribution is µ1. Next, split I0 into intervals I00 and I01 of lengths p1p2 and p1q2; likewise split I1 into intervals I10 and I11 of lengths q1p2 and q1q2. Define the second simple random variable X2 by ( 0 ω ∈ I00 ∪ I10 X2(ω) = 1 ω ∈ I01 ∪ I11.

By construction, P [ω : X1 = X2 = 0] = p1p2 and similarly for the other three choices, so X1 and X2 are independent and X2 has distribution µ2. Repeat this process to define a sequence {Xn} of independent s.r.v.’s such that each Xn has distribution µn. In the general case, µ1 has finite support so instead of dividing Ω into two intervals, we divide the space into the number of intervals corresponding to the size of the range of µ1. The above proof is easily adapted to this setup. Example 4.2.3. A special case of the above construction is called a sequence of Bernoulli trials. Explicitly, Bernoulli trials are a sequence {Xn} of independent random variables satisfying P [ω : Xn(ω) = 1] = p and P [ω : Xn(ω) = 0] = 1 − p for all n. The dyadic interval 1 construction introduced in Chapter 1 are the intervals used to construct {Xn} for p = 2 .

4.3 Expected Value and Variance

Pn Definition. Consider a simple random variable X = i=1 xiχAi for xi ∈ R and Ai ⊂ R. The expected value or mean value of X is

n X E[X] := xiP (Ai). i=1

n Let {xi}i=1 be the range of X. Then the expected value can be written

n X E[X] = xiP [ω : X(ω) = xi]. i=1 This formulation shows that E[X] only depends on the distribution of X, so if for two simple random variables X,Y , P [ω : X(ω) = Y (ω)] = 1 then E[X] = E[Y ].

Examples.

1 −1 1 1 Suppose X = 4 χ(0,1/2] + 4 χ(1/2,1]. Another distribution for X is X = 4 χ(0,1/4] + 1 1 2 χ(1/4,1/2] − 4 χ(1/4,1].

52 4.3 Expected Value and Variance 4 Simple Random Variables

2 For A ⊆ R, recall that χA is a simple random variable. Then E[χA] = P (A).

3 Let f : R → R be a function. Recall from Theorem 4.0.4 that if X is a simple random variable, so is f(X). Then n X X E[f(X)] = f(xi)P (Ai) = f(x)P [ω : X(ω) = x] i=1 x where the last sum is over all x in the range of f(X). Definition. The kth moment of a simple random variable X is X E[Xk] = yP [ω : Xk(ω) = y] y where the sum is over all y in the range of Xk. As above, an alternate expression for the kth moment of X is n k X k E[X ] = xiP [ω : X (ω) = xi] i=1 n where {xi}i=1 is the range of X. Pn Proposition 4.3.1. Let X and Y be two simple random variables given by X = i=1 xiχAi Pm and Y = j=1 yjχBj . (a) (Linearity) For α, β ∈ R, αX + βY is a simple random variable with expected value E[αX + βY ] = αE[X] + βE[Y ].

(b) If X(ω) ≤ Y (ω) for all ω ∈ S where S is a support of P , then E[X] ≤ E[Y ]. (c) |E[X − Y ]| ≤ E[|X − Y |]. (d) If X and Y are independent then E[XY ] = E[X]E[Y ].

Proof. (a) We create a mutual refinement of the Ai and Bj by Cij = Ai∩Bj. Then αX+βY = P i,j(αxi + βyj)χCij . Therefore expected value is given by X X E[αX + βY ] = (αxi + βyj)P (Cij) = (αxi + βyj)P (Ai ∩ Bj) i,j i,j n m X X = α xiP (Ai) + β yjP (Bj) = αE[X] + βE[Y ]. i=1 j=1

(b) If X(ω) ≤ Y (ω) for all ω ∈ S, then xi ≤ yj whenever Ai ∩ Bj is nonempty and thus X X E[X] = xiP (Ai ∩ Bj) ≤ yiP (Ai ∩ Bj) = E[Y ]. i,j i,j (c) Using (b), we have E[−|X|] ≤ E[X] ≤ E[|X|] so |E[X]| ≤ E[|X|]. Moreover, linearity from part (a) gives us |E[X − Y ]| ≤ E[|X − Y |]. P (d) Note that XY = i,j xiyjχCij , so if Ai = [ω : X(ω) = xi] and Bj = [ω : Y (ω) = yj] then P (Ai∩Bj) = P (Ai)P (Bj) by definition of independence. Thus E[XY ] = E[X]E[Y ].

53 4.3 Expected Value and Variance 4 Simple Random Variables

Theorem 4.3.2 (Bounded Convergence). Suppose {Xn} is a sequence of simple random variables that is uniformly bounded, i.e. there is some K > 0 such that |Xn(ω)| ≤ K for all ω and n. If (Xn) converges pointwise a.e. to X, then lim E[Xn] = E[X].

Proof. We know pointwise convergence a.e. implies convergence in measure, so (Xn) →P X. Choose K large enough so that it bounds |X(ω)| as well as |Xn(ω)|, which is possible since X takes on finitely many values. Then for any n, |Xn −X| ≤ 2K. If A = [ω : |Xn(ω)−X(ω)| ≥ ε] then for all ω,

|Xn(ω) − X(ω)| ≤ 2KχA(ω) + εχAC (ω) ≤ 2KχA(ω) + ε.

Then by properties of expected value, E[|Xn −X|] ≤ 2KP [ω : |Xn(ω)−X(ω)| ≥ ε]+ε. Since (Xn) converges in measure to X, P [ω : |Xn(ω)−X(ω)| ≥ ε] −→ 0. Therefore E[|Xn−X|] < ε for any arbitrary ε, which shows lim E[|Xn − X|] = 0. Linearity implies the result. Definition. For a simple random variable X, the variance of X is

Var[X] = E[(X − E[X])2] = E[X2] − E[X]2.

Proposition 4.3.3. (a) If X is a simple random variable and α, β ∈ R then αX + β is a simple random variable with variance Var[αX + β] = α2 Var[X].

(b) If X1,...,Xn are independent, simple random variables then

" n # n X X Var Xi = Var[Xi]. i=1 i=1 Proof. (a) We have already seen that functions of simple random variables are simple random variables, so by properties of expected value,

Var[αX + β] = E[((αX + β) − (αE[X] + β))2] = E[α2(X − E[X])2] = α2 Var[X].

Pn Pn (b) For each i, let mi = E[Xi]. Then by linearity, E [ i=1 Xi] = i=1 mi =: m and

 n !2  n !2 X X E  Xi − m  = E  (Xi − mi)  i=1 i=1 n X 2 X = E[(Xi − mi) ] + 2 E[(Xi − mi)(Xj − mj)]. i=1 1≤i

If the Xi are independent, each E[(Xi−mi)(Xj −mj)] splits into (E[Xi]−mi)(E[Xj]−mj) = 0 so the second sum vanishes. This implies the result.

Definition. A function ϕ : I → R is convex if for all 0 ≤ p ≤ 1 and x, y ∈ I, ϕ(px + (1 − p)y) ≤ pϕ(x) + (1 − p)ϕ(y).

Notice that a sufficient condition for convexity is that ϕ00(x) ≥ 0 for all x ∈ I.

54 4.4 Abstract Laws of Large Numbers 4 Simple Random Variables

Theorem 4.3.4. Let X be a simple random variable with E[X] = m. 1 (1) (Chebyshev’s Inequality) For any α > 0, P [ω : |X(ω)| ≥ α] ≤ E[|X|]. α 1 (2) (Markov’s Inequality) For any α > 0, P [ω : |X(ω)| ≥ α] ≤ E[|X|k]. αk 1 (3) (Chebyshev-Bienaym´eInequality) For α > 0, P [ω : |X − m| ≥ α] ≤ Var[X]. α2 (4) (Jensen’s Inequality) Suppose ϕ is a convex function on an interval containing the range of X. Then ϕ(E[X]) ≤ E[ϕ(X)].

1 1 (5) (H¨older’sInequality) Suppose p, q > 1 are numbers satisfying p + q = 1. Then E[|XY |] ≤ E[|X|p]1/p · E[|Y |q]1/q. The p, q = 2 case of H¨older’sinequality is called Schwartz’s inequality: E[|XY |] ≤ E[X2]1/2 · E[Y 2]1/2. β β Moreover, setting p = α , q = β−α for some 0 < α ≤ β, taking Y ≡ 1 and replacing X with |X|α gives us Lyapounov’s inequality: E[|X|α]1/α ≤ E[|X|β]1/β.

4.4 Abstract Laws of Large Numbers

In this section we generalize the strong and weak laws of large numbers (Chapter 1) to a more general setting involving independent, simple random variables.

Definition. A sequence X1,X2,... of simple random variables on a probability space (Ω, F,P ) is said to be identically distributed if their distributions are all the same. In the case that the Xn are also independent, this is abbreviated i.i.d.

Theorem 4.4.1 (Strong Law). Suppose {Xn} is a sequence of i.i.d. simple random variables on (Ω, F,P ). For each n, set E[Xn] = m and Sn = X1 + ... + Xn. Then  1  P ω : lim Sn(ω) = m = 1. n→∞ n

Proof. Without loss of generality, we may assume m = 0 by shifting all the Xn. Since they are identically distributed, it makes sense to set E[Xn] = m for each n. First we show that  1  ω : lim n Sn(ω) = 0 is an F-set so that we can define its probability. Note that 1 lim Sn = 0 ⇐⇒ for every m ∈ N there is an N ∈ N such that n→∞ n

1 1 for all n > N, Sn − 0 < n m n X n ⇐⇒ X < . k m k=1

55 4.4 Abstract Laws of Large Numbers 4 Simple Random Variables

 Pn n  Clearly ω : | k=1 Xk(ω)| < m ∈ F and we can construct the desired set out of these:

∞ ∞ ∞ " n #  1  \ [ \ X n ω : lim Sn(ω) = 0 = ω : Xk(ω) < . n→∞ n m m=1 N=1 n=N+1 k=1

 1   1  We have shown that P lim n Sn = 0 = 1 ⇐⇒ P lim n Sn ≥ ε i.o. = 0 for any arbitrary ε so we will prove this equivalent condition. Denote the 2nd and 4th moments of each Xi by 2 2 4 4 4 4 4 2 E[Xi ] = σ and E[Xi ] = ξ . Then E[Sn] = nξ + 3n(n − 1)σ ≤ kn for some k (see the table in Section 1.2 to see where these values come from). By Markov’s inequality,

k kn2 k P [|S | ≥ nε] ≤ E[|S |4] ≤ = . n n4ε4 n n4ε4 n2ε4

As n → ∞, this approaches 0 so by the first Borel-Cantelli lemma, P [|Sn| ≥ nε i.o.] = 0. As discussed above, this implies the result.

Example 4.4.2. Consider the Bernoulli trials ( 0 with probability p Xn = 1 with probability 1 − p.

It doesn’t matter on which subset of the reals the Xn are defined, since by Theorem 4.2.2, there exists a sequence of independent simple random variables having the prescribed distri- 1 bution. Clearly E[X] = m, so by the Strong Law, n Sn → p with probability 1. Moreover, since Bernoulli trials are independent, variance is given by σ = Var[Xn] = p(1 − p). The main limitation of the Strong Law is that we must have control over the 4th moments of the Xi, via the i.i.d. condition. The Weak Law is weaker than the Strong Law in its conclusion but it is useful when we only have control over lower moments.

Theorem 4.4.3 (Weak Law). Let {Xn}, m and Sn be defined as in the Strong Law. Then for any ε > 0,   1 lim P ω : Sn − m ≥ ε = 0. n→∞ n Proof. By Chebyshev-Bienaym´e,   1 Var[Sn] P ω : Sn − m ≥ ε ≤ n n2ε2 n Var[X] = by independence n2ε2 −→ 0 as n → ∞.

Hence the Weak Law is proved. In general, Var[X] is proportional to the 2nd moment E[X2] which is easier to control than higher moments such as E[X4].

56 4.4 Abstract Laws of Large Numbers 4 Simple Random Variables

Example 4.4.4. Recall the probability space Ωn = Sn, the set of permutations of n symbols, equipped with the cycle completion variables: ( 1 if the kth position finishes a cycle Xnk = 0 otherwise.

1 Then for any n, {Xn1,...,Xnn} are independent and P [Xnk = 1] = n−k+1 =: mnk. Variance 2 in this case is σnk = mnk(1 − mnk) as with Bernoulli trials. Let Sn = Xn1 + ... + Xnn so that Pn 1 for a permutation ω ∈ Ωn, Sn(ω) represents the number of cycles of ω. Define Ln = k=1 k . Then expected value and variance are calculated by

n n X X 1 E[S ] = E[X ] = = L ; n nk n − k + 1 n k=1 k=1 n X and Var[Sn] = Var[Xnk] by independence k=1 n X = mnk(1 − mnk) < mnk. k=1 So for any ε > 0,   Sn(ω) − Ln(ω) P ω : ≥ ε = P [ω : |Sn(ω) − Ln(ω)| ≥ εLn(ω)] Ln(ω) 1 ≤ 2 Var[Sn] by Chebyshev-Bienaym´e Lnε2 1 1 = 2 2 Ln = 2 Lnε Lnε

but Ln diverges as n → ∞, so this fraction approaches 0. Thus the conclusion of the Weak Law holds and moreover we can see that Sn ∼ Ln ∼ log n. So there are approximately (in a weak sense) log n cycles in an average permutation of n symbols. The Strong Law cannot be applied in this case since Ωn is a different space for each n.

57 4.5 Second Borel-Cantelli Lemma Revisited 4 Simple Random Variables

4.5 Second Borel-Cantelli Lemma Revisited

The independence condition in the second Borel-Cantelli lemma (Section 3.4) is sometimes too restrictive. For example, recall simple random variables `n defined as the length of the run of heads beginning on the nth coin flip in an infinite sequence of flips. The problem is that the `n are not independent, but in some sense they are independent if the runs are far enough apart. Although we cannot apply BC2, we will prove a weaker theorem that will apply to the `n and other non-independent examples.

Let A1,A2,... ∈ F and for each n ∈ N, define Nn(ω) = χA1 (ω) + ... + χAn (ω). Then Nn(ω) represents the number of occurrences of ω in the first n sets in the sequence {Ai} and more importantly, [ω : ω ∈ An i.o.] = [ω : sup Nn(ω) = +∞]. Suppose the An are independent and pn = P (An); set mn = p1 + ... + pn. Then

n X E[Nn] = mn and Var[Nn] = pk(1 − pk) < mn. k=1

If x < mn then   P [ω : Nn(ω) ≤ x] ≤ P ω : |Nn(ω) − mn| ≥ |x − mn|

Var[Nn] mn ≤ 2 < 2 . |x − mn| (mn − x) P If pk diverges then mn → ∞ so for every x ∈ R, P [ω : Nn(ω) ≤ x] −→ 0. Moreover, since P [ω : sup Nn(ω) ≤ x] ≤ P [ω : Nn(ω) ≤ x] for all n and the right term goes to 0, we conclude

P [ω : sup Nn(ω) < ∞] = 0 =⇒ P [ω : ω ∈ An i.o.] = 1.

This is an alternate way to view the second Borel-Cantelli lemma. Notice that the conclusion Var[Nn] still holds, even if the An are not independent, as long as 2 −→ 0. It turns out that (mn − x) this happens when P P (Ai ∩ Aj) lim inf i,j≤n = 1, n→∞ P 2 k≤n P (Ak) as we will see in the proof of the next theorem. In general, this liminf is greater than or equal to 1, so the variance condition will hold as long as the liminf is less than or equal to 1. The following generalizes the second Borel-Cantelli lemma.

Theorem 4.5.1. Let {An} be a sequence of (not-necessarily) independent events and suppose P∞ P (An) diverges and n=1 P P (Ai ∩ Aj) lim inf i,j≤n ≤ 1. n→∞ P 2 k≤n P (Ak)

Then P [ω ∈ An i.o.] = 1.

58 4.5 Second Borel-Cantelli Lemma Revisited 4 Simple Random Variables

Proof. As above, let Nn = χA1 (ω) + ... + χAn (ω). Set P P (Ai ∩ Aj) θ = i,j≤n n P 2 k≤n P (Ak)

so that the hypothesis reads lim inf θn ≤ 1. By the work in the preceding paragraph, it is enough to show that Var[Nn] 2 −→ 0 as n → ∞. (mn − x) First, we estimate variance by

2 2 Var[Nn] = E[Nn] − E[Nn] ! X 2 = E[χAi χAj ] − mn i,j≤n ! X 2 = P (Ai ∩ Aj) − mn i,j≤n P !2 !2 i,j≤n P (Ai ∩ Aj) X X = P (A ) − P (A P 2 k k k≤n P (Ak) k≤n k≤n !2 X 2 = (θn − 1) P (Ak) = (θn − 1)mn. k≤n

For a fixed x ∈ R, 2 Var[Nn] (θn − 1)mn P [ω : Nn(ω) ≤ x] ≤ 2 ≤ 2 (mn − x) (mn − x) and as n → ∞, mn → ∞, so if lim inf θn ≤ 1 then the term on the right approaches 0. This implies P [ω : Nn(ω) → ∞] = 1, i.e. P [An i.o.] = 1.

If the An are independent, it turns out that

P 2 (pk − p ) θ = 1 + k≤n k . n P 2 k≤n pk

This ratio of series goes to 0, so lim inf θn = 1 which implies the original BC2.

Example 4.5.2. Although the sequence {`n} of run-length simple random variables is not independent, we can use this generalized BC2 to prove the following result.

P∞ −rn Claim. P [ω : `n(ω) ≥ rn i.o.] = 1 if and only if n=1 2 diverges.

Proof. As before, without loss of generality we can assume the rn are integers. Define

An = [ω : `n(ω) ≥ rn] = [dn(ω) = dn+1(ω) = ... = dn+rn−1(ω) = 0]. As previously noted, if j + rj ≤ k then Aj and Ak are independent so P (Aj ∩ Ak) = P (Aj)P (Ak) in that case. On

59 4.6 Bernstein’s Theorem 4 Simple Random Variables

the other hand, if j < k < j + rj then P (Aj ∩ Ak) ≤ P [dj = dj+1 = ... = dk−1 = 0 | Ak]. These events are now independent, so 1 P (A ∩ A ) ≤ P [d = d = ... = d = 0]P (A ) = P (A ). j k j j+1 k−1 k 2k−j k Putting all the cases together, we have

X X X X 1 P (A ∩ A ) ≤ P (A ) + 2 P (A )P (A ) + 2 P (A ) j k k j k 2k−j k j,k≤n k≤n j

P∞ P∞ −rn Therefore if n=1 P (An) = n=1 2 diverges, we have P P 2 3 P (Ak) + P (Ak) θ = k≤n k≤n −→ 1. n P 2 k≤n P (Ak)

So by Theorem 4.5.1 we get the desired conclusion, i.e. P [ω : `n(ω) ≥ rn i.o.] = 1. On the other hand, the first Borel-Cantelli lemma applies even in the non-independent case to give P −rn us the converse. Hence P [ω : `n(ω) ≥ rn i.o.] = 1 ⇐⇒ 2 diverges.

4.6 Bernstein’s Theorem

Definition. Suppose f is a real-valued function on [0, 1]. The nth Bernstein polynomial for f is defined by n X k  n B (x) = f xk(1 − x)n−k. n n k k=0

Theorem 4.6.1 (Bernstein). If f is continuous on [0, 1] then the sequence (Bn) converges to f uniformly on [0, 1].

Notice that Bernstein’s Theorem is just a restatement of the Weierstrass Approximation Theorem. In functional analysis we typically prove Weierstrass’s theorem using function convolution. Here, we instead have defined an interpolation of f for each n. We will explicitly prove that the Bn converge to f uniformly. Proof. Since f is continuous on [0, 1], f is bounded and uniformly continuous. Set M = sup |f(x)| and δ(ε) = sup |f(x) − f(y)| over all x, y ∈ [0, 1] such that |x − y| < ε. This x∈[0,1] value δ(ε) is sometimes called the modulus of continuity of f. We will show that 2M sup |f(x) − Bn(x)| ≤ δ(ε) + 2 x∈[0,1] nε

60 4.6 Bernstein’s Theorem 4 Simple Random Variables

−1/3 −1/3 2M which will imply the result since if ε = n , sup |f(x) − Bn(x)| ≤ δ(n ) + n1/3 which approaches 0 as n → ∞. Fix n ∈ N, x ∈ [0, 1] and let X1,X2,... be simple random variables such that for each i, P [Xi = 1] = x and P [Xi = 0] = 1 − x. Set Sn = X1 + ... + Xn. Then for any n k n−k k, P [Sn = k] = k x (1 − x) so the probabilities are the coefficients in the Bernstein n X polynomials. This implies E[f(Sn)] = f(k)P [Sn = k] so k=0    n   Sn X k E f = f P [S = k] n n n k=0 n X k  n = f xk(1 − x)n−k = B (x). n k n k=0

 Sn   Sn   This implies |f(x) − Bn(x)| = f(x) − E f n ≤ E f(x) − f n so we will esti-  Sn   Sn Sn  mate E f(x) − f n . Note that when n − x < ε, f n − f(x) < δ(ε), and when Sn Sn  n − x ≥ ε, f n − f(x) < 2M. Then

 Sn    Sn   Sn  E f(x) − f n ≤ δ(ε)P n − x < ε + 2M · P n − x ≥ ε  Sn   Sn  ≤ δ(ε) + 2M · P n − m ≥ ε where m = E n = δ(ε) + 2M · P [|Sn − nx| ≥ nε] Var[S ] ≤ δ(ε) + 2M · n by Chebyshev’s inequality. n2ε2 By independence,

n X Var[Sn] = Var[Xk] = n Var[Xk] = nx(1 − x) ≤ n, k=1 so our estimate for the above expected value is 2M  Sn   E f(x) − f ≤ δ(ε) + . n nε2

 Sn   Finally, apply the comments from above and the fact that |f(x)−Bn(x)| ≤ E f(x) − f n for all x ∈ [0, 1] to finish the proof.

61 4.7 Gambling 4 Simple Random Variables

4.7 Gambling

In this section, our goal is to convince the reader that, in simple terms, gambling is a bad idea. We will show that, under some fallacious assumptions often made by gamblers, the standard casino game roulette is heavily biased in the casino’s favor. The techniques developed in this section are easily adapted to any game involving a ‘unit bet’, that is, a bet of a fixed sum which is either doubled with a win or lost with a loss. ∞ Suppose {Xi}i=1 is an independent sequence of simple random variables on a probability space (Ω, F,P ) where each variable is defined by ( 1 with probability p Xi(ω) = −1 with probability q = 1 − p.

As usual, set Sn = X1 + ... + Xn and S0 = 0 (i.e. you can’t win if you don’t play).

1 1 Definition. If p > 2 , the game is said to be favorable; if p = 2 , it is a fair game; and if 1 p < 2 , the game is said to be unfavorable. Example 4.7.1. The standard casino setup for the game of roulette is as follows. A wheel with 38 slots is spun and a ball is dropped in so that it lands in one of the slots by the time the wheel stops spinning. There are 18 red slots and 18 black slots which together are labelled with the numbers 1 through 36. In addition, there are two green slots labelled 0 and 18 20 $ 00. Set p = 38 and q = 1 − p = 38 . We will assume the player places a 1 bet on a single red or black number, with a 2:1 payout. The odds against the player winning on a single spin q 10 are ρ = p = 9 . Suppose the player starts with a capital of a dollars and plays $1 per spin (or per trial for a general game). She plays until reaching a fixed goal of c dollars (success) or until she loses her money, i.e. reaches 0 dollars (ruin). After n plays, the player has a total of a + Sn dollars. Set

Aa,n = [ω : a + Sn(ω) = c] ∩ [ω : 0 < a + Si(ω) < c for all i = 1, . . . , n − 1]

Ba,n = [ω : a + Sn(ω) = 0] ∩ [ω : 0 < a + Si(ω) < c for all i = 1, . . . , n − 1] ∞ ! ∞ [ X and sc(a) = the probability of success = P Aa,n = P (Aa,n). n=1 n=1 Fixing c and allowing a to vary, we want to study the optimal starting capital for achieving success. Later we will investigate other strategies for success. By convention, set Aa,0 = ∅ and Ac,0 = Ω so that sc(0) = 0 and sc(c) = 1. Given a, sc(a) = psc(a + 1) + qsc(a − 1) which is really a second order boundary value problem:

sc(a) = psc(a + 1) + qsc(a − 1)

sc(0) = 0

sc(c) = 1.

62 4.7 Gambling 4 Simple Random Variables

q Let ρ = p . It turns out (see Billingsley) that the solutions to this boundary value problem are of the form ( A + Bρa if ρ 6= 1 sc(a) = A + Ba if ρ = 1. Suppose ρ 6= 1, i.e. the game is unfavorable. Given the first boundary condition, we see that 0 = A + B =⇒ A = −B, so the second boundary condition gives us 1 1 = A + Bρc = −B + Bρc =⇒ B = . ρc − 1 Thus the solution for an unfavorable game is ρa − 1 s (a) = . c ρc − 1 On the other hand, if the game is favorable (ρ = 1), then A = 0 and A + Bc = 1, implying B = 1 . This gives the solution c a s (a) = . c c Examples. 1 Suppose the player starts with a = $900 and has a goal of reaching c = $1000. In the 10 fair game, sc(a) = .9 which is reasonably high. However, when ρ = 9 as in roulette, sc(a) = .00003, an extremely low probability.

2 Things are worse when c = $20, 000. In the fair case, sc(a) = .005 which is unsurpris- −911 ingly low. But in the unfavorable case, sc(a) ≈ 3 × 10 . Remark. Ruin is symmetric with success, i.e. the boundary value problem

rc(a) = prc(a + 1) + qrc(a − 1)

rc(0) = 1

rc(c) = 0 has the same solution in terms of ruin: ρc−1 − 1  if ρ 6= 1 r (a) = ρc − 1 c c − a  if ρ = 1. c

Notice that for any choices of c and a, rc(a) + sc(s) = 1, to the probability of the game ending (in success or ruin) is always 1. What if our player had infinite capital? What are the odds of ever achieving the goal? Suppose a, b > 0 and define Ha,b to be the event of reaching +b before −a. For any finite capital a, this can be written

∞ n−1 !! [ \ Ha,b = [ω : Sn(ω) = b] ∩ [ω : −a < Sk(ω) < b] . n=1 k=1

63 4.7 Gambling 4 Simple Random Variables

We can write the probability of one of these events as P (Ha,b) = sa+b(a). Also set Hb to be the event of ever gaining +b, that is the event of success with infinite capital:

∞ [ Hb = Ha,b. a=1

Notice that for a fixed b, {Ha,b} is a monotone increasing sequence that converges to Hb so by continuity from below (Theorem 2.1.5),

P (Hb) = lim P (Ha,b) = lim sa+b(a) a→∞ a→∞  a  lim if ρ = 1 a→∞ a + b = 1 − ρa  lim if ρ 6= 1 a→∞ 1 − ρa+b ( 1 if ρ ≤ 1 = ρ−b if ρ > 1.

Examples. Use the same setup as before.

10 1 If c = $1000 and ρ = 1 then lim sa+c(a) = 1, but if ρ = 9 then lim sa+c(a) = .00003 — the same as before.

10 2 If c = $20, 000 and ρ = 1 then lim sa+c(a) = 1, but on the other hand when ρ = 9 , −911 lim sa+c(a) ≈ 3 × 10 . This shows that having infinite capital helps in the fair game (this is to be expected) but has no effect on the unfavorable game.

What if we have some sort of ‘strategy’ for when to place our bets? We will see that this does not change the unfavorable game’s outcome either. Suppose {Xn} are i.i.d. simple random variables and define an additional random variable ( 1 if we bet on the nth trial Bn = 0 if we don’t bet on the nth trial.

The way Bn is defined can only depend on X1,...,Xn−1, i.e. Bn cannot ‘predict the future’. In other words, Bn is measurable with respect to Fn−1 = σ(X1,...,Xn−1). Recall that this means Bn = f(X1,...,Xn−1) for a function f. Set Nn to be the time the nth bet is placed – notice Nn is not simple random. We will assume P [Bn = 1 i.o.] = 1 so that the game always terminates.

Definition. Given {Xn} a sequence of i.i.d. simple random variables, a choice of Bn that satisfies the above conditions is called a selection scheme.

Theorem 4.7.2. The sequence {Yn}, where Yn = XNn , is independent with P [Yn = 1] = p and P [Yn = −1] = q. In other words, selection schemes do not change the outcome of a game.

64 4.7 Gambling 4 Simple Random Variables

Proof omitted.

Set F0 = a, the ‘initial fortune’ of the player and for each n, let Fn be her/his fortune after the nth trial. We next define a way of altering the player’s wagers for each trial in order to potentially optimize the odds of winning.

Definition. Define the random variable Wn to represent the player’s wager (in dollars, e.g.) on the nth trial of a game. If Wn = gn(F0,X1,...,Xn−1), that is, Wn ∈ σ(X1,...,Xn−1) and Wn depends on initial fortune, and in addition Wn ≥ 0 for all n, then the choice of Wn is called a betting system.

Our player’s fortune can thus be written Fn = Fn−1 + WnXn.

Example 4.7.3. If X1 = ... = Xn−1 = −1 then an example of a betting system is the n−1 double-or-nothing approach, Wn = 2 for all n ≥ 1.

As we have it defined, Wn is not a simple random variable since it depends on F0, which can take on any finite positive value. However, fixing F0 makes Wn a simple random variable. Also notice that Wn is independent of Xn, so we can write

E[WnXn] = E[Wn]E[Xn] = (p − q)E[Wn].

If p = q then E[Fn] = E[Fn−1 + WnXn] = E[Fn−1] + E[Wn] · 0 = E[Fn−1]. In other words, if the game is fair then E[Fn] = F0 for every n ≥ 1. If the game is unfavorable, then E[Fn] ≤ E[Fn−1] and so E[Fn] ≤ F0 for all n. This shows that a betting scheme cannot make an unfavorable game fair (or favorable) but it can increase one’s odds.

Definition. Suppose τ(F0, ω) is a function assigning values in N ∪ {0} for all ω ∈ Ω and F0 ≥ 0 by τ = n when the player bets on the first n games and then stops. If for all n, [ω : τ(F0, ω) = n] ∈ σ(X1,...,Xn) and τ is finite with probability 1, then τ is called a stopping time. The dependence condition requires that τ does not depend on future information, only the previous trials of the game. The finite with probability 1 condition allows for some set of measure 0 on which τ has infinite values. Therefore τ is not a simple random variable because a game can be arbitrarily long.

Definition. A gambling policy, denoted π, is a betting system Wn for a particular initial capital F0 together with a stopping time τ.

Example 4.7.4. A selection scheme is a betting system defined by Wn = Bn ∈ {0, 1}. A stopping time for this Wn is τ = n where n is the first loss. To see this, note that P [ω : τ(ω) > n] = pn and since 0 < p < 1 (assuming an unfavorable game),

∞ ! \ P [ω : τ(ω) is not finite] = P [ω : τ(ω) > n] = lim P [ω : τ(ω) > n] = lim pn = 0. n→∞ n→∞ n=1 Thus τ is finite on a set of probability 1. Now consider

n−1 ! \ [ω : τ(ω) = n] = [ω : Xk(ω) = 1] ∩ [ω : Xn(ω) = −1] k=1

65 4.7 Gambling 4 Simple Random Variables

which lies in Fn := σ(X1,...,Xn). Hence τ is a stopping time so a selection scheme is a gambling policy. As Theorem 4.7.2 above shows, a selection scheme is an example of a policy that does not increase one’s odds of winning.

Suppose π = (F0,Wn, τ) is a gambling policy. Define ( ∗ Fn n ≤ τ ∗ Fn = and Wn = χ[ω:n≤τ(ω)]Wn. Fτ n > τ

∗ ∗ ∗ Then Fn = Fn−1 + Wn Xn so we can embed the stopping time into our betting system. Explicitly, this system tells the player to bet Wn = 0 when n is later than the stopping time. It still follows that E[Fn] = E[Fn−1] = ... = F0 for any n in the fair scenario and E[Fn] ≤ E[Fn−1] ≤ F0 in the unfavorable scenario. In a finite capital scenario, stopping times cannot make an unfavorable game profitable. However, things are different in the infinite capital world. Assuming the game terminates, ∗ Fn converges to Fτ with probability 1 so by the bounded convergence theorem (4.3.2), ∗ lim E[Fn] = E[Fτ ] if the Fn are uniformly bounded. In this case, E[Fτ ] = F0 if the game is ∗ fair and E[Fτ ] ≤ F0 if the game is unfavorable. If the Fn are not uniformly bounded, this means either the gambler or casino started with access to infinite capital, so the uniform bound condition is realistic. 1 Suppose p ≤ 2 and F0 = a. If the player stops when either Fn = c or Fn = 0 then a sc(a) < (1 − sc)(0) + sc(c) = c .

On the other hand, if the player plays until Fn = c, possibly allowing for Fn < 0, then Fτ = c > a =⇒ E[Fτ ] > F0 — wait what? There is a possibility of raising the expected fortune above that of starting levels even in an unfair game! Of course this is only possible with access to infinite capital, so in the real world the probabilities behave closer to expectation. There is a strategy that optimizes winning conditions. First, rescale the fortune scale so that Fn ∈ [0, 1] for n = 0, 1, 2,... and set τ to be the stopping time when one of 0 or 1 is reached for the first time. The gambling policy πb with this stopping time and

( 1 Fn−1 if 0 ≤ Fn−1 ≤ 2 Wn = 1 1 − Fn−1 if 2 ≤ Fn−1 ≤ 1 is called bold play. Informally, it says that if the player can’t reach her goal on the next trial, she wagers everything in a ‘double-or-nothing’ strategy. If she can reach her goal, she wagers exactly the amount that would guarantee her to have fortune 1 with a win. To see that πb is a valid policy, we just need to check that τ is a stopping time. It’s clear that τ only depends on X1,...,Xn so it suffices to check the finite property. Consider

P [ω : τ(ω) > n | τ(ω) > n − 1] = P [ω : Fn(ω) = 0, 1 | Fn−1ω) 6= 0, 1].

1 If Fn−1 ≤ 2 , the outcomes can be split into

Fn = Fn−1 + Fn−1 = 2Fn−1 if Xn = 1; happens with probability p

Fn = Fn−1 − Fn−1 = 0 if Xn = −1; happens with probability q.

66 4.7 Gambling 4 Simple Random Variables

1 On the other hand, if Fn−1 ≥ 2 then

Fn = Fn−1 + (1 − Fn−1) = 1 if Xn = 1; happens with probability p

Fn = Fn−1 − (1 − Fn−1) = 2Fn−1 − 1 if Xn = −1; happens with probability q.

The game terminates in the second and third cases, so the conditional probability is

P [ω : τ(ω) > n | τ(ω) > n − 1] ≤ max{p, q}.

Setting m = max{p, q}, we have P [ω : τ(ω) > n] ≤ mn which tends to 0 as n → ∞. Hence τ is a valid stopping time. Theorem 4.7.5 (Dubins-Savage). Bold play is optimal in the unfavorable case.

Proof sketch. Assume Fτ = 0 or 1 so that τ is a simple random variable. Write Qπ(x) =

P [Fτ = 1 | F0 = x] for any gambling policy π and for each x ∈ [0, 1]. Also set Q(x) = Qπb (x) for bold play. One can check that for any π, Qπ(0) = 0, Qπ(1) = 1 and 0 ≤ Qπ(x) ≤ 1. The 1 theorem then reduces to proving that for every π on a game with p ≤ 2 , Qπ(x) ≤ Q(x). The bold play function has the following properties: ˆ Q is increasing on [0, 1].

ˆ Q is a continuous function of x. ( pQ(2x) 0 ≤ x ≤ 1 ˆ 2 Q(x) = 1 p + qQ(2x − 1) 2 ≤ x ≤ 1 These can be used to show Q(x) ≥ pQ(x+t)+qQ(x−t) whenever 0 ≤ x−t < x < x+t ≤ 1 which implies the result. Remark. This optimal policy is not unique; however the Dubins-Savage theorem shows that bold play is one optimal policy. It is also known that the optimal probability of success sc(a) is computable. Examples.

1 Set a = $900 and c = $1000 as before. In the fair case, sc(a) = .9 and for any policy 10 π, Qπ(x) = x for every x. In the roulette odds, i.e. p = 9 , recall that unit stakes give us sc(a) = .00003. However, bold play yields substantially better odds: Q(.9) = .88.

2 Set a = $100 and c = $20, 000. This generates the following probabilities of success:  .005 fair game  −911 sc(a) = 3 × 10 roulette odds .003 bold play.

Notice that in both examples, bold play gives us odds that are on the same order of magnitude as the fair game odds.

67 4.8 Markov Chains 4 Simple Random Variables

4.8 Markov Chains

Definition. Let S be a countable set and consider a doubly-indexed sequence {pij}i,j∈S such P that pij ≥ 0 and for each i, j∈S pij = 1.A Markov chain on a probability space (Ω, F,P ) ∞ is a sequence {Xn}n=0 of random variables on (Ω, F,P ) such that

(i) For each n, Xn(Ω) ⊂ S.

(ii) For every subset {i0, i1, . . . , in} ⊂ S such that P [X0 = i0,...,Xn = in] > 0,

P [Xn+1 = j | X0 = i0,...,Xn = in] = P [Xn+1 = j | Xn = in] = pinj.

The pij are called the transition probabilities for the Markov chain, each Xn is called the nth state and S is referred to as the state space. The Markov property (ii) implies independence of history, that is, each state only depends on the previous state and ‘forgets’ what has happened before that. Also, property (ii) also implies that the states are autonomous, i.e. they don’t depend on n – if n is thought of as a time variable, then the outcome has the same dependence on the previous outcome no matter the current time. We will denote the initial probabilities by P [X0 = i] = αi for each i ∈ S. Notice that P i∈S αi = 1 and αi ≥ 0 for every i ∈ S. The transition probabilities pij correspond to a matrix P = (pij) called a transition matrix. The conditions on the pij mean that the transition matrix P = (pij) is stochastic.

Examples. 1 Markov chains are useful for describing change in physical models. Consider the fol- lowing (oversimplified) representation of diffusion. Suppose we have 2 buckets, a left and a right one, which are each filled with r balls that are either black or white. We know that there are k white balls, and therefore r − k black balls, in the left bucket and k black balls, and therefore r − k white balls, in the right bucket.

r balls r balls k white r − k white r − k black k black

Consider the state space S = {0, 1, . . . , r} where a state k ∈ S corresponds to there being k white balls in the left bucket. The example above is at state k = 3. The Markov process will be to draw one ball from the left bucket and one ball from the right bucket simultaneously, and to swap them and place each ball in the opposite bucket. We will compute the transition probabilities. Given Xn = k, the possible states for Xn+1 are:

68 4.8 Markov Chains 4 Simple Random Variables

left right probability Xn+1 k  r − k  white white k r r k  k  white black k − 1 r r r − k  r − k  black white k + 1 r r r − k  k  black black k r r Thus the transition probabilities are given by (r − k)2 p = k(k+1) r2 2k(r − k) p = kk r2 k2 p = k(k−1) r2 pkj = 0 if |j − k| ≥ 2. P Notice that for each k, pkj ≥ 0 for all j and j pkj = 1. 2 Random walks are another classic example of Markov chains. Suppose a person, dubbed the ‘walker’, is standing on the number line consisting of integer points. First consider the finite state space S = {0, 1, . . . , r} and assume the walker starts at one of the points in S. We also assume in the finite random walk scenario that the endpoints are absorbing, i.e. if Xn = 0 or r then Xn+j = Xn for all j ≥ 1. At time n, the walker moves to the right with probability p and moves left with probability q = 1 − p. Thus each Xn is a simple random variable. The transition probabilities are as follows:  p if j = i + 1, 0 < i < r  q if j = i − 1, 0 < i < r  pij = 0 if |j − i| ≥ 2, 0 < i < r  1 if i = j = 0 or r  0 if i 6= j = 0 or r.

We typically assume αi = 1 for some i ∈ S and αj = 0 for all j 6= i. That is, there is a positive probability that the walker starts on one of the endpoints. 3 Most of the time we consider an unrestricted random walk, where the state space is S = Z. Here the transition probabilities are  p if j = i + 1  pij = q if j = i − 1 0 otherwise. If p = q, the random walk is said to be symmetric. In the symmetric case, the probability is 1 that the walker returns to her starting point.

69 4.8 Markov Chains 4 Simple Random Variables

4 For higher dimensional random walks, the state space is S = Zk, k ≥ 2. Each point y ∈ Zk has 2k ‘neighbors’ so the notation would be pretty ugly to write down explicitly. Things are even worse if the probabilities of the walker moving to a particular neighbor are not uniform, so let’s assume the random walk is symmetric. Then

( 1 2k if x is a neighbor of y pyx = 0 otherwise.

It turns out that for k = 2, the probability that the walker returns to her starting location is still 1, but for any k > 2, the probability is 0. This suggests something subtle about the geometry of random walks. For this reason, random walks are an active area of modern research. 5 The following is known as either the Princess Problem or the Secretary Problem (in the latter, the princess is replaced by a businessperson trying to hire a secretary). Suppose a princess is trying to find a suitor. The rules are: the suitors appear in random order (e.g. they do not appear in order of increasing wealth or attractiveness); they appear one at a time; after meeting each suitor, the princess must decide whether to accept the suitor, at which point the process ends, or reject the suitor and continue the process. We also assume that the princess has some way of determining how the current suitor relates to every previous suitor in terms of desirability. If a suitor is more desirable than every previous suitor, we will say this suitor is dominant.

Let S1,...,Sr be the suitors in the order they appear, so that S = {S1,...,Sr} is the state space. Set X1 = 1 and for each n ≥ 2, set Xn to be the position of the nth dominant suitor, or r + 1 if the last dominant suitor has already occurred.

For example, suppose S = {S1,S2,S3,S4,S5,S6} and these are ranked in the order S3 < S6 < S2 < S1 < S4. The dominant suitors are S1 (which will always occur) and S4, so X1 = S1, X2 = S4 and X3 = X4 = ... = r + 1 = 7. In general, (k − 1)! 1 P [S is dominant] = = k k! k (j − 2)! 1 P [S is the next dominant suitor] = = if j > i and 0 otherwise j j! j(j − 1) (r−1)! r! k P [Xn+1 = r + 1 | Xn = k] = 1 = . k r Thus the transition probabilities are  k  if k < j ≤ r j(j − 1)  0 if j ≤ k < r pkj = k if j = r + 1  r  1 if j = i = r  0 otherwise.

70 4.8 Markov Chains 4 Simple Random Variables

The princess’ strategy will be to pick a dominant suitor according to some stopping time τ. If she stops at Xτ then we want to know the probability that she picked the k overall best suitor; this is expressed by f(Xτ ) where f(k) = r . Given r the number of suitors, we also want to compute E[f(Xτ )] for different choices of τ but first we need to learn about expected values for Markov chains. We will return to the Princess Problem. The Markov condition of independence of history allows us to calculate higher order transitions by stepping through one state at a time:

P [X0 = i0,X1 = i1,...,Xn = in] = αi0 pi0i1 pi1i2 ··· pin−1in and in general,

P [Xm+t = jt, t = 0, . . . , n | Xs = is for 0 ≤ s ≤ m] = pimj1pj1j2 ··· pjn−1jn .

(n) We will denote a transition of degree n by pij . These can be written

(n) X pij = P [Xm+n = j | Xm = i] = pik1 pk1k2 ··· pkn−1j. k1,...,kn−1∈S

(n) n If S is a finite state space then the transition matrix is (pij ) = P where P = (pij). Notice (0) that P0 = I and pij = δij, the Kronecker delta. If S is countably infinite, the transition probabilities correspond to an infinite matrix which really lives in a Hilbert space.

Theorem 4.8.1. Suppose (pij) is a doubly-indexed sequence of nonnegative real numbers P P such that for all i, j pij = 1 and suppose αi ≥ 0 satisfy i αi = 1. Then there exists ∞ a probability space (Ω, F,P ) and a Markov chain {Xn}n=0 on (Ω, F,P ) with the pij as its transition probabilities and the αi as its initial probabilities. Proof sketch. Let Ω = (0, 1]; F = B, the Borel σ-field; and P = λ, the Lebesgue measure on B. First we want X0 to equal i with probability αi for each i. By Theorem 4.2.2, this is possible in theory but we want to explicitly construct the random variable X0 so that (0) we may continue the process in the next steps. Construct a collection of intervals Ii with (0) (0) length αi for each i by the following process: set I1 = (0, α1],I2 = (α1, α1 + α2], etc. It is evident that X0 satisfies the desired properties. Next, we want X1 to satisfy

P [X1 = j, X0 = i] = P [X0 = i]P [X1 = j | X0 = i].

(0) (1) (0) Subdivide each Ii into Iij by a similar process as above, so that each Iij has length αipij. Repeating this process of subdivision constructs a collection of intervals I(n) with length i0i1···in

αi0 pi0i1 ··· pin−1in . Finally, set

[ (n) i if ω ∈ I  i0i1···in−1i Xn(ω) = i0,...,in−1 0 otherwise.

By construction, {Xn} is a Markov chain with the given initial and transition probabilities.

71 4.9 Transience and Persistence 4 Simple Random Variables

4.9 Transience and Persistence

∞ Let {Xn}n=0 be a Markov chain. First, let’s set up some notation. If a probability is conditioned on X0 = i, we will denote this by Pi. Define

(n) fij = Pi[X1 6= j, X2 6= j, . . . , Xn−1 6= j, Xn = j] which represents the probability that the first occurrence of state j is at time n, given X0 = i. ∞ ! ∞ [ X (n) Also set fij = Pi [Xn = j] = fij . n=1 n=1

∞ Definition. For a Markov chain {Xn}n=0 on state space S, a state i is persistent if fij = 1 and transient if fij < 1.

Suppose n1 < n2 < . . . < nk and consider

(n1) (n2−n1) (nk−nk−1) Pi[X1 6= j, X2 6= j, . . . , Xn1 = j, . . . , Xnk = j] = fij fjj ··· fjj .

Then

X (n1) (n2−n1) (nk−nk−1) Pi[Xn = j at least k times] ≥ fij fjj ··· fjj n1,...,nk k−1 = fijfjj ··· fjj = fijfjj .

Therefore ( 0 if fjj < 1 Pi[Xn = j i.o.] = 1 if fjj = 1.

P (n) Theorem 4.9.1. A state i is transient ⇐⇒ Pi[Xn = i i.o.] = 0 ⇐⇒ n pii converges. Similarly, a state i is persistent ⇐⇒ Pi[Xn = i i.o.] = 1 P (n) Proof. By the first Borel-Cantelli lemma, if n pij < ∞ then Pi[Xn = i i.o.] = 0. Based P (n) on the calculation above, fii < 1 so by definition i is transient. This proves n pii < ∞ =⇒ Pi[Xn = i i.o.] = 0 =⇒ i transient. To close the logic loop, we must show P (n) fii < 1 =⇒ n pii < ∞. For any i, j, consider

(n) pij = Pi[Xn = j] n−1 X = Pi[X1 6= j, . . . , Xn−s−1 6= j, Xn−s = j, Xn = j] s=0 n−1 X = Pi[X1 6= j, . . . , Xn−s−1 6= j, Xn−s = j]Pj[Xs = j] by autonomy s=0 n−1 X (n−s) (s) = fij pjj . s=0

72 4.9 Transience and Persistence 4 Simple Random Variables

(n) Next, we compute the sum of the pii :

n n t−1 X (t) X X (t−s) (s) pii = fii pii t=1 t=1 s=0 n−1 n X (s) X (t−s) = pii fii switching order of summation s=0 t=s+1 n−1 n X (s) X (s) ≤ pii fii ≤ pii fii s=0 s=0 n X (t) (0) = pii fii + fii since pii = 1 by a previous calculation. t=1

n X (t) Rearranging this gives us (1 − fii) pii ≤ fii so if 0 < fii < 1, t=1

n ∞ X (t) fii X (t) fii p ≤ =⇒ p ≤ ii 1 − f ii 1 − f t=1 ii t=1 ii and hence the sum converges. The statement for persistence is proven in a similar fashion. Example 4.9.2. We will prove P´olya’s Theorem for symmetric k-dimensional random walks, P which we state below. First, in order to employ Theorem 4.9.1 we want to know if n an converges or not. If k = 1 then the only way to return to one’s starting position is after an even number of moves, so a2n+1 = 0 for all n. On the other hand, if the walker returns to the start after 2n moves then she had to move left an equal number of times as she moved 2n −2n right. This means a2n = n 2 . A well-known result called Stirling’s Formula says that √ nn n! ∼ 2πn . e

Using this on a2n, we have √ 2n 2n √ 2n 2n 1 (2n)! 4πn e 2 πn 2 n e2n 1 a2n = = √ = = √ . 2 2n n n 2 2n 2n 1 (n!) 2   2n 2πn 2 n 2n πn 2πn e 2 e P Then clearly n an diverges (e.g. by a comparison test) so Theorem 4.9.1 tells us that each state in the state space is persistent. Next, suppose k = 2. By similar logic as above,

n X (2n)! a = 4−2n 2n u!u!(n − u)!(n − u)! u=0 and another application of Stirling’s Formula yields 1 a ∼ . 2n πn 73 4.9 Transience and Persistence 4 Simple Random Variables

P So n an diverges, and thus the states are persistent in the k = 2 case. 1 It turns out that for k ≥ 3, a2n ∼ nk/2 which corresponds to a convergent series, so in these cases the states are all transient. P´olya’s Theorem states this formally:

Theorem 4.9.3 (P´olya). A symmetric, k-dimensional random walk is persistent when k = 1, 2 and transient otherwise.

Definition. A Markov chain is said to be irreducible if for every i, j there exists an n such (n) that pij > 0. In other words, in an irreducible chain there is always a finite sequence of transitions between any pair of states.

Theorem 4.9.4. If a chain is irreducible then either every state is transient or every state is persistent. Furthermore,

S  P (n) (1) Transience is equivalent to Pi j[Xn = j i.o.] = 0 and also to n pij < ∞ for all states i, j ∈ S.

S  P (n) (2) Persistence is equivalent to Pi j[Xn = j i.o.] = 1 and also to n pij = ∞ for all i, j ∈ S.

(r) (s) Proof. Irreducibility implies for all i, j ∈ S there exist r, s such that pij > 0 and pji > 0. (r+s+n) (r) (n) (s) P (n) P (n) Then pii ≥ pij pjj pji so if n pii converges then n pjj converges as well. By Theorem 4.9.4, this shows that if any one state is transient then they all are. If this is the case, then fjj < 1 so

! ∞ ∞ ∞ [ X X X Pi [Xn = j i.o.] ≤ Pi[Xn = j i.o.] = fjj = 0 = 0. j n=1 j=1 j=1

S  Hence Pi j[Xn = j i.o.] = 0. In addition,

n X (n) X X (v) (n−v) pij = fij pjj n n v=1 ∞ ∞ X (v) X (n) = fij pjj switching the summation v=1 n=v ∞ ∞ X (v) X (n) ≤ fij pjj v=1 n=0 ∞ X (n) ≤ pij < ∞ since fij ≤ 1 for all i, j. n=0

P (n) Hence n pij converges.

74 4.9 Transience and Persistence 4 Simple Random Variables

On the other hand, if every state is persistent then Pj[Xn = j i.o.] = 1 by Theorem 4.9.4. Then (m)  pii = Pj[Xm = i] = Pj [Xm = i] ∩ [Xn = j i.o.] X ≤ Pj[Xm = i, Xm+1 6= i, . . . , Xn = j] n>m X (m) (n−m) (m) = pji fij = pji fij. n>m

(m) (m) So pji ≤ pji fij which shows that fij ≥ 1. But fij is a probability so fij = 1. Then by definition Pi[Xn = j i.o.] = 1. Finally, by the contrapositive to the first Borel-Cantelli P (n) lemma, n pij must diverges. Examples. 1 Suppose we have an irreducible Markov chain modeling a restricted random walk. Theorem 4.9.4 can be used to show that the probability of the random walker returning to her initial state infinitely often is 1. In other words, there are no transient states in a finite state space – if transient states exist (in an irreducible chain) then they imply the random walk will exit any finite subset of the state space.

1 2 Consider an asymmetric random walk, i.e. one where p < 2 . Suppose the state space p is unrestricted, e.g. S = Z. Then f01 = q < 1 so every state is transient. Notice in this case that f10 = 1 so it’s not true that fij < 1 for every i, j ∈ S. The previous results 1 only guarantee that fii < 1 for all i. When p < 2 (if p is the probability of the walker moving right), it appears that the chain of right movements is persistent while the left movements are transient.

1 3 If the unrestricted walk is symmetric, i.e. p = q = 2 , and 2 | (n + j − i) then  n  1 1 (n) √ pij = n+j−i n ∼ . 2 2 n

(n) If |j − i| = −1, 0, 1 then lim pij = 0 even though the chain is persistent.

Definition. A matrix Q = (qij) is said to be substochastic if qij ≥ 0 for every i and the P row sums satisfy j qij ≤ 1 for every i.

n (n) (n) P (n) Write Q = (qij ) and σi = j qij so that

(n+1) X (n) X (n) σi = qijσj ≤ σj . j j

(1) (n+1) (n) (n) This implies that σi ≤ 1 and σi ≤ σi for all i, n. So (σi ) is a bounded, monotone (n) sequence and hence σi = limn σi exists. Each σi satisfies a difference equation: X σi = qijσj. j

75 4.9 Transience and Persistence 4 Simple Random Variables

As it turns out, the σi are the maximal solutions to the boundary value problem X xi = qijxj, 1 ≤ i ≤ n j

0 ≤ xi ≤ 1.

(n+1) (n) (This is easily shown using the fact that σi ≤ σi for all n.) Now if U ⊂ S is a subset of (n) the state space then (pij)U is a substochastic matrix and σi = Pi[Xt ∈ U, t ≤ n]. Therefore by the above computations,

(n) σi = lim σi = Pi[Xt ∈ U for all t ≥ 1]. n→∞ Example.

4 Consider a half-random walk where the state space is U = N ∪ {0}, the right half of S = Z. The difference equation from before is now a boundary value problem:

x0 = px1

xi = pxi+1 + qxi−1, i ≥ 1

0 ≤ xi ≤ 1.

q n−1 If ρ = p then the solutions are of the form xn = A + Bn if p = q or xn = A + Bρ if p 6= q. Notice that when p ≤ q, the solution is necessarily unbounded. However, when p > q, the solution is bounded. We want 0 ≤ x ≤ 1 so we must have A = 0 in the case when p = q, or B = −A in the case when p 6= q. Thus the solutions are ( A − An ρ = 1 xn = A − Aρn−1 ρ 6= 1.

Theorem 4.9.5. A state i0 is transient ⇐⇒ there exists a nontrivial solution to the system X xi = pijxj, 0 ≤ xi ≤ 1 for all i 6= i0.

j6=i0

Proof. On one hand, suppose i0 is persistent. By the discussion above, Pi[Xn 6= i0 for all n]

is a maximal solution to this system. But Pi[Xn 6= i0 for all n] = 1 − fii0 so there is a

nontrivial solution ⇐⇒ fii0 < 1 for some i 6= i0 but this is impossible in the persistent case.

On the other hand, we proved that transience implies fi0i0 < 1, but

∞ X X fi0i0 = Pi0 [X1 = i0] + Pi0 [X1 = i, X2 6= i0,...,Xn = i0]

n=2 i6=i0 X = pi0i0 + pi0ifii0 .

i6=i0

If the fii0 were all 1, this would add up to 1 but fi0i0 < 1 so the above shows that fii0 < 1 for some i 6= i0. Hence there is a nontrivial solution.

76 4.9 Transience and Persistence 4 Simple Random Variables

Example. 5 Queueing is used to model physical situations, such as customers standing in line, as well as computer processing. Suppose we have a state space S = N ∪ {0} which represents the number of people currently in line. At each time k, one person at the front of the line is helped and then leaves, and simultaneously, 0, 1 or 2 people enter the line with probabilities t0, t1 and t2, respectively. These satisfy t0 + t1 + t2 = 1 and we assume t0, t2 > 0 so that the chain is irreducible. The queueing ‘matrix’ here is infinite:   t0 t1 t2 0 0 0 ··· t t t 0 0 0 ···  0 1 2  0 t t t 0 0 ··· P =  0 1 2  0 0 t0 t1 t2 0 ···  ......  ...... Notice that the first row is different than the random walk’s transition matrix: if i = 0, no one is served until someone enters the line. Fix i0 = 0 – since the chain is irreducible, either every state is persistent or every state is transient so no generality is lost. The system here is

x1 = t1x1 + t2x2

xk = t0xk−1 + t1xk + t2xk+1, k ≥ 2. This is essentially the same as the system for a half-random walk (see Example 4 ). So there is a nontrivial solution, i.e. i0 is transient, if and only if t2 > t0 and conversely i0 is persistent if and only if t2 ≤ t0. P Definition. A distribution is a sequence πi satisfying 0 ≤ πi ≤ 1, πi = 1 and P i∈S i∈S πipij = πj for all i, j ∈ S. Additionally, the distribution is stationary if P [X0 = j] = πj implies P [Xn = j] = πj for all n. (n) Definition. A state j ∈ S has period t if whenever pij > 0 for any i, t | n. If 1 is a period for j then we say j is aperiodic. Example 4.9.6. A 1-dimensional random walk has period 2 since the walker must return to her starting position after an even number of moves.

Remark. In an irreducible chain, all periods are equal. We will usually assume that {Xn} is an aperiodic, irreducible chain, such as in the next lemma.

Lemma 4.9.7. Suppose a chain {Xn} is an aperiodic, irreducible Markov chain. Then for (n) every i, j ∈ S, there exists an n0 ∈ N such that pij > 0 for all n ≥ n0. (n) (m+n) (m) (n) Proof. Let Mj = {n ∈ N | pij > 0}. Then pij ≥ pij pij for all n so Mj is closed under addition. Since the chain is aperiodic, gcd(Mj) = 1 so by elementary number theory there exists an n0 such that n ∈ Mj for all n ≥ n0. Let i, j ∈ S. Since the chain is irreducible, (r) there exists an r such that pij > 0. If we let nij = nj + r, then every n ≥ nij satisfies (n) (r) (n−r) pij ≥ pij pjj > 0 · 0 = 0.

77 4.9 Transience and Persistence 4 Simple Random Variables

Theorem 4.9.8. Let {Xn} be an aperiodic, irreducible Markov chain and suppose a station- (n) ary distribution πj exists. Then the chain is persistent, limn pij = πj, πi > 0 for all i and the distribution is unique. (n) Proof. First suppose the chain is transient. Then pij → 0 as n → ∞ for any j ∈ S. By the P (n) Weierstrass M-test, j pij πj converges absolutely and uniformly in n, so ! X (n) X  (n) πi = lim pij πj = lim pij πj = 0. n n j j

Hence πi ≡ 0 so πiπi 6= 1 and therefore no stationary distribution exists, contradicting the hypotheses. Therefore the chain is persistent. Consider the double-indexed state space S × S. Define p(ij, kl) = pikpjl to form a direct product of Markov chains Xn × Yn. One can prove that this is still irreducible and aperiodic given the assumptions on Xn. Then for all i, j, i0 ∈ S, Pij[(Xn,Yn) = (i0, i0) i.o.] = 1; that is, the two chains meet in finite time with probability 1. Let τ = infn[(Xn,Yn) = (i0, i0)]. Then another way of saying the previous statement is that τ < ∞ with probability 1. This implies (n) (n) |pik − pjk | ≤ Pij[τ > n] → 0 by the M-test. So i and j really don’t affect the outcome after time τ. Note that

(n) X (n) X (n) πj − pjk = πipik − πjpjk i j X (n) (n) = πi(pik − pjk ) i

(n) and the combined sum approaches 0 by the M-test. Thus πk = limn pij for any j ∈ S and P (n) by uniqueness of limits, πj is unique. Also, for n sufficiently large, πk = i πipik > 0. This finishes everything we needed to check. Example. 5 continued. For the queueing model described before, we can plug in the row sums to obtain:

π0 = π0t0 + π1t1

πk = πk−1t0 + πkt1 + πk+1t2, k ≥ 1. This has a nontrivial solution if the chain is persistent:  A − Ak t = t  0 2  k−1 πk = t0 A − A t0 6= t2.  t2

Of course the system is persistent if t0 ≥ t2 and in particular there is a stationary  k−1! X t0 distribution if t0 > t2, in which case the solution A − A is a geometric t2 k

78 4.9 Transience and Persistence 4 Simple Random Variables

series which we may evaluate. On the other hand, if t0 = t2 there is no stationary distribution even though the chain is persistent.

The examples illustrate our three possibilities so far for a Markov chain {Xn}: ˆ (n) The chain is transient. In this case, pij → 0 and the mean return time for a state P (n) j ∈ S is µj = n nfjj = ∞. ˆ The chain is persistent but has no stationary distribution; this is called nullpersistence. (n) In this case for all j ∈ S, pij → πj and µj = ∞. ˆ The chain is positive persistent, i.e. persistent with a stationary distribution. Here (n) 1 p → πj and µj = < ∞. ij πj Our k-dimensional random walks for different k values illustrate all three scenarios. For k = 1, the chain is positive persistent; for k = 2, the chain is nullpersistent; and finally for k ≥ 3, the chain is transient.

79 5 Abstract Measure Theory

5 Abstract Measure Theory

In Chapters 1 and 2 we studied the Lebesgue measure on Ω = (0, 1] and proved various results in measure theory for probability spaces. In this chapter we generalize the theory in Chapter 2 to general measure spaces.

5.1 Measures

First we recall the definitions of a field and σ-field. Definition. For a space Ω, a collection F of subsets of Ω is called a field if (1) Ω ∈ F. (2) If A ∈ F then AC ∈ F. (3) If A, B ∈ F then A ∪ B ∈ F. S∞ Additionally, F is a σ-field if whenever A1,A2,... ∈ F, n=1 An ∈ F as well. Theorem 5.1.1. Let Ω be a space.

(1) If F is a σ-field on Ω and Ω0 ⊂ Ω then F ∩ Ω0 is a σ-field on Ω0.

(2) If a collection A generates F then A ∩ Ω0 generates F ∩ Ω0. Proof. Easy. Example 5.1.2. The main class of sets studied in Chapters 1 and 2 was the Borel σ-field generated by finite intervals (a, b] ⊂ (0, 1]. This can be generalized to the Borel σ-field on the real line Ω = R which is generated by all finite intervals (a, b] ⊂ R. Things can be generalized even further. Let Ω = Rk, Euclidean k-space. Consider the collection of bounded rectangles

k {(x1, . . . , xk) ∈ R | ai < xi < bi, i = 1, . . . , k} = (a1, b1) × (a2, b2) × (ak, bk).

These are the analogs of (a, b) in the one-dimensional case. Let Rk be the σ-field on Rk generated by these rectangles; this is called the k-dimensional Borel σ-field. Note that R1 ∩ (0, 1] = B by Theorem 5.1.1 above, so there really is no difference between Borel sets on the unit interval as defined in Chapter 1 and the Borel sets defined here – in particular both can be generated by open or closed sets. In a moment we will define a general measure µ to be a function whose output may be infinity. In order to make sense of this, we need to be able to do arithmetic and understand inequalities involving ∞. Assume all numbers below are in [0, ∞]. ˆ For any x ∈ [0, ∞], ∞ + x = ∞. For any finite x, ∞ − x = ∞ as well. ˆ There are other conventions (see Billingsley) for sequences and series on [0, ∞] but everything is as expected.

80 5.1 Measures 5 Abstract Measure Theory

ˆ If x ≤ ∞ then either x = ∞ or x is finite. The notation x < ∞ means x is finite as usual.

The moral of this story is that we can do arithmetic with +∞ but only when the space consists of nonnegative real numbers – anytime you have negative numbers (i.e. the cancel- lation of addition) as well as +∞, things get messy. There’s a separate theory of bounded measures (i.e. finite measures) that may take on negative values.

Definition. Let F be a field on a space Ω. A set function µ : F → R is a measure if (1) For all A ∈ F, µ(A) ∈ [0, ∞]. In other words, a measure is a function µ : F → [0, ∞].

(2) µ(∅) = 0. S∞ (3) If A1,A2,... ∈ F are disjoint and n=1 An ∈ F then

∞ ! ∞ [ X µ An = µ(An). n=1 n=1

Definition. A measure µ is finite if µ(Ω) < ∞ and infinite if µ(Ω) = ∞.

Notice that if µ is a finite measure Ω, we may rescale by µ(Ω) to define a finite measure µ0 with µ0(Ω) = 1, i.e. a probability measure. Therefore we see that finite measures and probability measures are really one and the same.

Definition. Let F be a σ-field on a space Ω. Then (Ω, F) is called a and if µ is a measure on F then (Ω, F, µ) is called a measure space.

Definition. If A ∈ F has the property that µ(AC ) = 0 then A is called a support of µ, or alternatively, µ is concentrated on A.

Remark. As we saw in the probability measure case, if µ is a finite measure, A is a support if and only if µ(A) = µ(Ω).

Definition. A measure µ on a field F is σ-finite if there exists a countable collection ∞ S∞ {An}n=1 ⊂ F so that µ(An) < ∞ for all n and n=1 An = Ω. Note that disjointness is not required in the definition above. Moreover, a σ-finite measure space is precisely a countable union of finite measure spaces (subspaces).

Definition. For a subcollection A ⊂ F, we say µ is σ-finite on A if the σ-finite condition ∞ holds for some collection {An}n=1 ⊂ A. Many properties from Section 2.1 are the same for general measures. Let µ be a measure on a field F.

ˆ Finite additivity is implied by condition (3).

ˆ (Monotonicity) If A ⊆ B then µ(A) ≤ µ(B).

81 5.1 Measures 5 Abstract Measure Theory

ˆ S∞ (Countable subadditivity) If n=1 An ∈ F then

∞ ! ∞ [ X µ An ≤ µ(An). n=1 n=1

ˆ S∞ (Continuity from below) If A1 ⊂ A2 ⊂ A3 ⊂ · · · and A = n=1 An ∈ F then lim µ(An) = µ(A). n→∞ ˆ (Finite inclusion-exclusion) For any A, B ∈ F, µ(A ∪ B) = µ(A) + µ(B) − µ(A ∩ B).

The proofs are the same as in Section 2.1. However, there are some differences here in the abstract case: ˆ T∞ Continuity from above does not hold: A1 ⊃ A2 ⊃ · · · and n=1 An = A do not imply lim µ(An) = µ(A). However, if µ(An) < ∞ for some An in the chain, then continuity n→∞ from above does hold.

ˆ If µ is σ-finite on F, F does not contain an uncountable subcollection of disjoint sets with positive measure.

Theorem 5.1.3. Suppose (Ω, F) is a measurable space and µ1 and µ2 are measures on F such that µ1|P = µ2|P for a π-system P ⊂ F. If Ω is σ-finite with respect to µ1 and µ2 then the measures agree on the σ-field generated by P: µ1|σ(P) = µ2|σ(P).

Proof. Suppose B ∈ P with µ1(B) = µ2(B) < ∞. Define

LB = {A ∈ σ(P) | µ1(A ∩ B) = µ2(A ∩ B)}.

As in the proof of Theorem 2.4.1, LB is a λ-system containing P so by the π-λ theorem, S∞ LB ⊃ σ(P). By σ-finiteness, Ω = k=1 Bk where Bk ∈ P and µ1(Bk) = µ2(Bk) < ∞ for each k. Then by the principle of inclusion-exclusion, for i = 1, 2 and any n ≥ 2,

n ! n [ X X µi (Bk ∩ A) = µi(Bk ∩ A) − µi(Bj ∩ Bk ∩ A) + ... k=1 k=1 1≤j

The sums on the right are all the same for i = 1, 2 since any Bk1 ∩ · · · ∩ Bkm ∩ A lies in LB. Therefore n ! ∞ ! [ [ µ1 (Bk ∩ A) = µ2 (Bk ∩ A) k=1 k=1 S∞ and as n → ∞, µ1 = µ2 on σ(P) since k=1 Bk = Ω. This is a generalization of the uniqueness theorem for probability measures (Theo- rem 2.3.1). Uniqueness does not hold without the σ-finite property.

Corollary 5.1.4. If µ1 and µ2 are finite on σ(P) for a π-system P and Ω is a countable union of P-sets then µ1 = µ2 on σ(P).

82 5.2 Outer Measure 5 Abstract Measure Theory

Examples.

1 Let (Ω, F) be a measurable space. Take a set {wi | i ∈ N} ⊂ Ω. Suppose we have a sequence (mi) ⊂ [0, ∞]. For all A ∈ F, we define the discrete measure P by µ(A) = miδwi where ( 1 if wi ∈ A δwi (A) = 0 if wi 6∈ A.

If {wi} ∈ F for each wi in the collection, then (Ω, F, µ) is σ-finite if and only if mi < ∞ for all i. In general µ is called an atomic measure and the wi are called atoms. In some cases we may even replace the atoms wi with subsets of Ω.

2 Let F = P(Ω), the of Ω. We define the counting measure on sets A ∈ F by ( |A| if A is finite µ(A) = ∞ A is infinite. Then µ is finite if and only if Ω is finite, and µ is σ-finite if and only if Ω is countable. 3 If F is a field, µ is a measure on F and F ⊂ F is a subfield then µ = µ| is a 0 0 F0 measure on F0.

5.2 Outer Measure

Definition. A set function µ∗ : P(Ω) → R is an outer measure if it satisfies (1) µ∗(A) ∈ [0, ∞] for all A ⊂ Ω.

(2) µ∗(∅) = 0. (3) (Monotonicity) If A ⊆ B then µ∗(A) ≤ µ∗(B).

∞ (4) (Countable subadditivity) If {Ai}i=1 ⊂ P(Ω) then

∞ ! ∞ ∗ [ X ∗ µ Ai ≤ µ (Ai). i=1 i=1

Remark. If ρ : A → [0, ∞] is a set function on A ⊂ Ω, ∅ ∈ A and ρ(∅) = 0 then we can define an outer measure µ∗ on Ω by

( ∞ ∞ ) ∗ X [ µ (A) = inf ρ(Ai): Ai ∈ A for all i and Ai ⊃ A . i=1 i=1 Examples. 1 If A = {(a, b] : 0 ≤ a < b ≤ 1} on Ω = (0, 1] and ρ((a, b]) = b − a, then the outer measure µ∗ is Lebesgue outer measure on the unit interval, as defined in Section 2.3.

83 5.2 Outer Measure 5 Abstract Measure Theory

n 2 Let Ω = R . If A = {(a1, b1) × (a2, b2) × · · · × (an, bn) | ai < bi for all i} and Qn ρ(A) = i=1(bi − ai) is the n-dimensional volume function, then the corresponding µ∗ is Lebesgue outer measure on Rn. n/2 n n π ε 3 Suppose A = {Bε(¯x) | ε > 0, x¯ ∈ R } and ρ(Bε(¯x)) = vol(Bε(¯x)) = n , where Γ 2 + 1 Γ(s) is Euler’s Gamma function. Then the corresponding µ∗ is also Lebesgue outer measure, as in 2 .

n r 4 Again let A = {Bε(¯x) | ε > 0, x¯ ∈ R } and this time define ρ(Bε(¯x)) = ε for a fixed 0 ≤ r ≤ n. Then µ∗ is called the r-dimensional Hausdorff outer measure on Rn. Notice that if r = n, this is just a scalar multiple of the Lebesgue outer measure. We also have the special cases:

If n > 2 and r = 2, µ∗ represents (up to a scalar) surface area. If n > 1 and r = 1, µ∗ represents (up to a scalar) arc length.

Definition. A set A ⊂ Ω is µ∗-measurable if for every E ⊂ Ω, the Carath´eodory condition is met: µ∗(E) = µ∗(E ∩ A) + µ∗(E ∩ AC ).

Theorem 5.2.1. Let M(µ∗) be the set of µ∗-measurable sets on a space Ω. Then M(µ∗) is ∗ ∗ a σ-field and µ |M(µ∗) is a measure on M(µ ). Proof. The details are the same as in Lemma 2.3.5.

Definition. A collection A ⊂ P(Ω) is called a semiring if

(1) ∅ ∈ A. (2) For all A, B ∈ A, A ∩ B ∈ A.

(3) Whenever A, B ∈ A such that A ⊂ B, there exist disjoint C1,...,Cn ∈ A such that

n [ B r A = Ck. k=1

Lemma 5.2.2. Suppose A, A1,...,An are elements of a semiring A. Then there exist dis- joint C1,...,Cm ∈ A such that

C C A ∩ A1 ∩ · · · ∩ An = C1 ∪ · · · ∪ Cm.

Proof. If n = 1, this is just the semiring property (3). Induct.

84 5.2 Outer Measure 5 Abstract Measure Theory

Theorem 5.2.3 (Extension Theorem). If µ : A → [0, ∞] is a set function on a semiring A such that µ(∅) = 0, µ is finitely additive and countably subadditive, then µ extends to a measure on σ(A).

Proof. Note that if A ⊂ B and {Ck} ⊂ A satisfy the semiring condition (3) for B r A, then n X µ(B) = µ(A) + µ(Ck) by finite additivity. So µ(B) ≥ µ(A) and therefore µ is monotone. k=1 Define the outer measure µ∗ : P(Ω) → [0, ∞] on a set A ∈ A by ( ∞ ∞ ) ∗ X [ µ (A) = inf µ(Ai): Ai ∈ A, Ai ⊃ A . i=1 i=1 Let M(µ∗) be the set of µ∗-measurable sets. By Theorem 5.2.1, M(µ∗) is a σ-field and ∗ ∗ µ |M(µ∗) is a measure on M(µ ). We next show A ⊂ M(µ∗). Take A ∈ A and E ⊂ Ω. If µ∗(E) = ∞, we’re done because ∞ ≥ µ∗(A ∩ E) + µ∗(AC ∩ E). On the other hand, if µ∗(E) < ∞ then for every ε > 0, there ∞ ∞ ∞ [ X ∗ ∗ exists a sequence {An}n=1 ⊂ A such that An ⊃ E and µ (An) ≤ µ (E) + ε. Now A n=1 n=1 is a semiring, so Bn := An ∩ A lies in A for all n. Notice that the Bn cover A ∩ E, and the mn C [ An rBn cover A ∩E. By the semiring property, An rBn = Cnk for some Cnk ∈ A. Then k=1 ∞ ∞ mn ∗ X ∗ C X X µ (A ∩ E) ≤ µ(Bn) and µ (A ∩ E) ≤ µ(Cnk ). Notice that by finite additivity, n=1 n=1 k=1 m Xn µ(An) = µ(Bn) + µ(Cnk ) for each n. Putting this together, we have k=1

∞ ∞ mn ∗ ∗ C X X X µ (A ∩ E) + µ (A ∩ E) ≤ µ(Bn) + µ(Cnk ) n=1 n=1 k=1 ∞ m ! X Xn = µ(Bn) + µ(Cnk ) n=1 k=1 ∞ X = µ(An) n=1 ≤ µ∗(E) + ε. Since ε > 0 was arbitrary, this shows the Carath´eodory condition holds for A, so A ∈ M(µ∗). ∗ ∗ Finally we show that µ |A = µ. First, any A ∈ A covers itself, so µ (A) ≤ µ(A). On the ∞ other hand, if {An}n=1 is a cover of A by A-sets then ∞ ∞ ! X [ µ(An) ≥ µ An ≥ µ(A) n=1 n=1 by countable subadditivity and monotonicity. Taking the infimum shows that µ∗(A) ≥ µ(A), so we conclude that µ∗(A) = µ(A).

85 5.2 Outer Measure 5 Abstract Measure Theory

Example 5.2.4. Take A to be the collection of intervals in (0, 1] with λ((a, b]) = b−a. Then A is a semiring and λ extends to Lebesgue measure on (0, 1]. Similarly, length λ defined on the semiring A of finite intervals in R extends to Lebesgue measure on R1. We can generalize this construction to Rn by taking A to be the semiring of bounded n-dimensional rectangles; this is a semiring because removing a rectangle from a larger rectangle leaves a finite union of rectangles. Then the hypervolume function λ extends to n-dimensional Lebesgue measure on Rn. Definition. For sets A and B, their symmetric difference A4B is defined to be A4B = (A r B) ∪ (B r A). Theorem 5.2.5. Suppose A is a semiring and µ is a measure on σ(A) that is σ-finite on A.

∞ (1) For every B ∈ σ(A) and for every ε > 0, there exist {An}n=1 ⊂ A such that

∞ ∞ ! [ [ B ⊂ An and µ An r B < ε. n=1 n=1

(2) If in addition µ(B) < ∞, then there exist A1,...,An ∈ A such that

µ(B4(A1 ∪ · · · ∪ An)) < ε.

∗ Proof. (1) First assume µ(B) < ∞. Since µ|σ(A) = µ |σ(A) by the uniqueness theorem, ∗ ∞ S∞ µ(B) = µ (B). Then there exists a sequence {An}n=1 ⊂ A such that n=1 An ⊃ B and for all ε > 0, ∞ ! ∞ [ X ∗ µ An ≤ µ(An) ≤ µ (B) + ε = µ(B) + ε. n=1 n=1

These An satisfy (1). On the other hand if µ(B) = ∞, by σ-finiteness there exist Bn ∈ A S∞ ∞ such that Ω = n=1 Bn and µ(Bn) < ∞ for all n. In this case there exist {Ani }i=1 ⊂ A such S∞ that for each n, i=1 Ani ⊃ B ∩ Bn and

∞ ! [ ε µ A (B ∩ B ) < . ni r n 2 i=1

Then these An satisfy the desired properties. ∞ S∞ (2) Finally, if µ(B) < ∞, then choose {An}n=1 as in part (1). Then µ ( n=1 An) < ∞ S∞ ε SN S∞ and µ ( n=1 An r B) < 2 . Moreover, n=1 An r B −→ n=1 An r B so by continuity from below, there exists an N such that

∞ N ! [ [ ε µ A A < . n r n 2 n=1 n=1

86 5.3 Lebesgue Measure on Rn 5 Abstract Measure Theory

Then the measure of the symmetric difference in (2) can be estimated as

N !! N !! N ! ! [ [ [ µ B4 An = µ B r An + µ An r B n=1 n=1 n=1 ∞ N ! ∞ ! [ [ [ ≤ µ An r An + µ An r B n=1 n=1 n=1 ε ε < + = ε. 2 2

5.3 Lebesgue Measure on Rn

Define A as the set of bounded n-dimensional rectangles with rational coordinates in Rn, along with the empty set. This is a semiring as mentioned in Example 5.2.4 and σ(A) n contains the open sets in R . A is also clearly a π-system. Define λn(A) to be the volume of A for any A ∈ A. This is a measure on A and the conditions of the uniqueness and extension theorems hold so we can define

n Definition. The unique extension of λn(A) = vol(A) to the Borel sets R is called the Lebesgue measure on Rn.

Proposition 5.3.1. λn is translation-invariant.

Proof sketch. Volume is translation-invariant and one can check that if {An} ⊂ A cover A n then for any x ∈ R , {An + x} ⊂ A and {An + x} cover A + x.

Corollary 5.3.2. Hyperplanes in Rn have Lebesgue measure 0.

Proof. Fix a hyperplane H. Then Rn is the uncountable union of all the translations of H and λn is preserved under translation, so they all have the same measure. If any were n positive, this would contradict the σ-finite property of λn on R . Therefore λn(H) = 0.

Proposition 5.3.3. If T : Rn → Rn is a non-singular linear transformation and A is λn-measurable then TA is λn-measurable and

λn(TA) = | det T | λn(A).

Proof omitted.

Corollary 5.3.4. Lebesgue measure on Rn is invariant under rotations. The main idea in this section is the following: given a function F with certain properties, called a distribution function, a measure can be uniquely defined from F . We start by exploring this in the one-dimensional case. Suppose µ is a measure on the real line R

87 5.3 Lebesgue Measure on Rn 5 Abstract Measure Theory

such that every bounded, measurable set A has finite measure. Define the cumulative distribution function F by ( µ(0, x] x ≥ 0 F (x) = µ(x, 0] x < 0.

If µ is a finite measure, then F is bounded and in this case Fe(x) = µ(−∞, x] is well-defined and satisfies Fe − F = µ(−∞, 0]. So we see that F and Fe contain the same information in the finite case. A cumulative distribution function F (often shortened to a distribution) has the following properties: (1) F is nondecreasing, by monotonicity of µ.

(2) F is right-continuous, i.e. for every x ∈ R, lim F (y) = F (x). y→x+ In the case that x ≥ 0, this is a consequence of continuity from above, which holds since we have a finite condition on µ. In the case when x < 0, this follows from continuity from below, which always holds. (Monotone, right-continuous functions are nice since they may only have jump discontinuities, and in fact may only have a countable number of jump discontinuities.) (3) For any a < b, µ(a, b] = F (b) − F (a). (4) The Lebesgue measure λ has distribution F (x) = x. (5) µ determines F uniquely up to an additive constant.

Theorem 5.3.5. If F is a nondecreasing, right-continuous function on R then there exists a unique measure µ on B = R1 such that µ(a, b] = F (b) − F (a) for all a < b in R. Proof. The general case below will imply this theorem.

Now suppose B = Rn is the Borel σ-field on Rn, for n ≥ 2. Recall that B is generated by the collection of bounded rectangles

( n ) Y J = (ai, bi]: ai ≤ bi for all i . i=1

n Define S = {Sx¯ | x¯ ∈ R }, wherex ¯ = (x1, . . . , xn) and n Y Sx¯ = (−∞, xi]. i=1

Any bounded rectangle A is thus the finite difference of some collection of Sx¯. Thus S generates J which generates B, so S generates B. This is advantageous because S is a π-system.

88 5.3 Lebesgue Measure on Rn 5 Abstract Measure Theory

Suppose µ is a measure on Rn that is finite on bounded sets. In the two-dimensional case, inclusion-exclusion allows us to write the measure of a rectangle A as

µ(A) = µ(Sx¯1 ) − µ(Sx¯2 ) − µ(Sx¯3 ) + µ(Sx¯4 ),

n wherex ¯1, x¯2, x¯3, x¯4 are the vertices of A. In the n-dimensional case, A has 2 vertices {x¯1,..., x¯2n } wherex ¯i = (xi1, . . . , xin). For each j = 1, . . . , n, either xij = aj or bj for a left endpoint aj or a right endpoint bj. That is,

n Y A = (aj, bj]. j=1

For each vertex ~xi, let ( 1 if xij = aj an even number of times sgnx¯i (A) = −1 if xij = aj an odd number of times.

Define ∆ F = P sgn (A)F (¯x ). Note that if F (¯x) = µ(S ) then F is right-continuous A x¯i x¯i i x¯ (this means that F is continuous on limits taken in each component from the right). The condition ∆AF ≥ 0 will replace the monotonicity condition in the theorem below.

Theorem 5.3.6. Suppose F is right-continuous on Rn and for every rectangle A ∈ J , ∆AF ≥ 0. Then there exists a unique measure µ on B satisfying µ(A) = ∆AF for every A ∈ J .

Proof. Since ∅ is a degenerate rectangle, ∅ is measurable with µ(∅) = 0. Note that J is a π-system generating B so once we check µ exists, we will also have uniqueness. Also note that J is a semiring and µ(A) = ∆AF is defined on J so by the extension theorem (5.2.3) it only remains to check µ is finitely additive and countably subadditive on J . Qn Suppose A ∈ J can be written A = j=1 Jj, where Jj = (aj, bj]. A partition aj = tj0 < SM Qn tj1 < . . . < tjk < . . . < tjm = bj is regular if A = `=1 A` where M = j=1 m,

n Y A` = Jjk and Jjk = (tj(k−1), tjk]. j=1

PM This can be thought of as a product partition. If we write `=1 ∆A` F then each interior vertex appears an even number of times in the sum with cancelling signs. Each real vertex on the other hand appears once with the correct sign, so

M X ∆A` F = ∆AF. `=1 If we have an irregular partition, we may subdivide further to form a regular partition both of A and of each rectangle A` in the irregular partition. By the previous work, it all adds up correctly. Hence µ is finitely additive.

89 5.3 Lebesgue Measure on Rn 5 Abstract Measure Theory

S∞ Finally, suppose A = k=1 Ak, where A, A1,A2,... ∈ J . We want to show that µ(A) ≤ P∞ Qn k=1 µ(Ak). Let A = i=1(ai, bi]. For each i = 1, . . . , n, use continuity from the right to Qn choose δi > 0 such that B = i=1(ai + δi, bi] satisfies µ(B) > µ(A) − ε. Now A ⊃ B ⊃ B and B is closed and bounded, and hence compact by the Heine-Borel theorem. For each Qn Qn k, Ak = i=1(aki, bki] is not open, so choose δk > 0 such that Bk = i=1(aki, bki + δk] and ◦ ◦ ε µ(Bk) < µ(Ak) + 2k . Then Ak ⊂ Bk ⊂ Bk so {Bk} is an open cover of B. Since B is ◦ N compact, there exists a finite subcover {Bk}k=1. Then

N X µ(A) − ε < µ(B) ≤ µ(Bk) by finite subadditivity k=1 N X  ε  < µ(A ) + k 2k k=1 ∞ ! X < µ(Ak) + ε. k=1 Taking ε → 0 gives the result. Therefore there is a unique µ on B such that µ(·) restricts to ∆(·)F on J . Definition. A measure µ on a σ-field F is regular if it satisfies

(1) For all A ∈ F and for all ε > 0, there exist an open set G and a closed set C such that C ⊂ A ⊂ G and µ(G r C) < ε. (2) If µ(A) < ∞ then µ(A) = sup{µ(K) | K ⊂ A and K is compact}.

Proposition 5.3.7. Every measure µ on R1 assigning finite measure to bounded sets is regular.

ε Proof. If we assume µ(A) < ∞, there exists a bounded set A0 such that µ(A r A0) < 2 . Then (2) follows from (1) since there will exist a closed subset K ⊂ A which is compact ε by Heine-Borel and satisfies µ(A0 r K) < 2 . Letting ε → 0 shows that µ(A) is equal to sup{µ(K) | K ⊂ A is compact}. Now to prove (1), suppose A is a bounded rectangle. Take nearby open and closed rectangles using continuity from above and below. Since the collection of rectangles forms a ∞ semiring and generates B, by Theorem 5.2.5 there exists a countable collection {Ak}k=1 such S∞ ε that µ ( k=1 Ak r A) < 2 . Then for each Ak, choose nearby open and closed rectangles, say ε within 2k+1 , and add them up. Examples.

1 A Peano curve is an example of a space-filling curve, i.e. a map α : [0, 1] → R2 that is one-to-one but has positive area. Such a curve can even be made to fill the area of a square!

90 5.4 Measurable Functions 5 Abstract Measure Theory

2 (Banach-Tarski Paradox) Two sets measurable sets A, B ∈ Rn are said to be congru- ent if there exists an isometry between them, that is, a bijection ϕ : Rn → Rn such that ϕ(A) = B and λn is invariant under ϕ. This can be extended in the following way: any two sets A and B are said to be congruent if they can be decomposed into Sk Sk A = i=1 Ai and B = i=1 Bi such that Ai and Bi are congruent for each 1 ≤ i ≤ k. The Banach-Tarski Theorem says that when n ≥ 3, any two bounded sets A and B with nonempty interiors are congruent. The paradoxical demonstration of the theorem is that, in R3, one can decompose two solid spheres of different radius into finitely many nonmeasurable pieces that are congruent with the other sphere’s decomposition. Therefore any two spheres are congruent, even a baseball and the sun.

5.4 Measurable Functions

Definition. Let (Ω, F) and (Ω0, F 0) be measurable spaces. A function f :Ω → Ω0 is said to be measurable with respect to F/F 0 if for every A ∈ F 0, f −1(A) ∈ F. Definition. A real-valued f is called a random variable. If f(Ω) is finite, we say f is a simple random variable, as in Chapter 4. Theorem 5.4.1. Let f : (Ω, F) → (Ω0, F 0) be a measurable function. (1) If A0 generates F 0, it is sufficient to check the measurable condition for f on A0-sets.

(2) If (Ω00, F 00) is a measurable space and g :Ω0 → Ω00 is measurable, then g ◦ f :Ω → Ω00 is also measurable. Proof. Obvious from the definitions.

We are interested in measurable functions into Rn. Definition. For a measurable space (Ω, F), a measurable function f :Ω → Rn is called a random vector.

n Proposition 5.4.2. A function f :Ω → R of the form f(ω) = (f1(ω), . . . , fn(ω)) is measurable if and only if fi :Ω → R is measurable for each i = 1, . . . , n. n Qn Proof. R is generated by products (aka rectangles). So if A = i=1(ai, bi] then n −1 [ −1 f (A) = fi (ai, bi] ∈ F i=1 since F is a σ-field. The result follows. The definition of a measurable function in terms of pullbacks of measurable sets should recall the definition of continuous functions in analysis/topology. The next proposition shows that they are indeed related.

Proposition 5.4.3. If f :Ω → Rn is continuous, then f is measurable with respect to F/Rn.

91 5.4 Measurable Functions 5 Abstract Measure Theory

Proof. This follows from the fact that Rn is generated by open sets. There are plenty of functions that are measurable but not continuous (e.g. monotone functions, including step functions). So continuity is a strictly stronger notion than measur- ability.

n Proposition 5.4.4. Suppose fj :Ω → R are measurable for j = 1, . . . , n and g : R → R is measurable. Then g(f1(ω), . . . , fn(ω)) : Ω → R is measurable. Proof. Follows the analagous proof of Theorem 4.0.4. We immediately obtain

Corollary 5.4.5. If f, g :Ω → R are measurable then the following are also measurable: (1) af + bg for any a, b ∈ R. (2) (fg)(ω) = f(ω)g(ω).

 f  f(ω) (3) g (ω) = g(ω) provided g(ω) 6= 0.

(4) max{f(ω), g(ω)} and min{f(ω), g(ω)}.

(5) The composition of f with any continuous function h : R → R. Next, we need to understand how limits interact with measurable functions. For the moment, allow f :Ω → R ∪ {±∞} so that our limits may approach ±∞. ∞ Theorem 5.4.6. If {fi}i=1 is a sequence of measurable functions Ω → R then

(1) supi fi(ω), infi fi(ω), lim sup fi and lim inf fi are all measurable.

(2) If limi→∞ fi exists everywhere then it is measurable.

(3) The event [ω : limi→∞ fi(ω) exists] is measurable.

(4) If f :Ω → R is a measurable function then the event [ω : limi→∞ fi(ω) = f(ω)] is measurable.

Proof. This all relies on the fact that for any x ∈ R, ∞ \ A = [ω : sup fi(ω) ≤ x] = [ω : fi(ω) ≤ x] i=1 is a measurable event (by the measurability of countable intersections). Since the collection {(−∞, x]: x ∈ R} generates B, it suffices to check (1) – (4) for sets of this form. (1) The measurability of A above shows that sup fi – and by symmetry, inf fi – is mea- surable. Then we can express the lim sup in the following way,   lim sup fi = inf sup fi i→∞ n∈N i≥n

92 5.5 Distribution Functions 5 Abstract Measure Theory

and see that it is measurable since A is. A similar proof shows that lim inf fi is measurable. (2) If lim fi exists, it is equal to lim sup fi which is measurable by (1). (3) We can write [ω : lim fi(ω) exists] = [ω : lim sup fi(ω) − lim inf fi(ω) = 0]. Then since lim sup fi and lim inf fi are measurable by (2) and differences of measurable functions are measurable by Corollary 5.4.5, this set is measurable. (4) Finally, we can write B = [ω : limi→∞ fi(ω) = f(ω)] as

B = [ω : lim sup fi(ω) = lim inf fi(ω)] ∩ [ω : lim sup fi(ω) = f(ω)] which by (1) and (3) is the intersection of measurable sets, and hence is itself measurable.

Theorem 5.4.7. Suppose f :Ω → R is measurable. Then there exists a sequence of simple random variables fn :Ω → R such that if f(ω) ≥ 0 for all ω ∈ Ω, fn(ω) ≥ 0 for all n, ω and fn converges from below to f; and conversely if f(ω) ≤ 0 for all ω ∈ Ω then fn(ω) ≤ 0 for all n, ω and fn converges from above to f.

Proof sketch. First, f(ω) = χ{f≥0}(ω) + χ{f≤0}(ω) so f is measurable if and only if the characteristics functions are. Thus is suffices to prove the nonnegative case. Assume f(ω) ≥ 0 for all ω. Use dyadic intervals to subdivide the R. Then the sequence ( n if f(ω) ≥ n f (ω) = n k n k k+1 2n for k = 0,..., 2 n if 2n ≤ f(ω) ≤ 2n satisfies the desired conditions. Definition. Suppose (Ω, F, µ) is a measure space, (Ω0, F 0) is a measurable space and T is 0 −1 a measurable function with respect to F/F . Then T∗µ(A) := µ(T (A)) is called the push forward measure of T on (Ω0, F 0).

5.5 Distribution Functions

Definition. Suppose X :Ω → R is a random variable on a probability space (Ω, F,P ). The distribution measure of X is µ(A) = P [ω : X(ω) ∈ A] and the distribution function of X is F (x) = µ(−∞, x] = P [ω : X(ω) ≤ x].

Notice that distribution measure is a push forward measure: P [ω : X(ω) ∈ A] = X∗P (A). Proposition 5.5.1. A distribution F is monotone nondecreasing and right continuous. Since F is monotone, F (x−) := lim F (y) also exists. By continuity from below, we can y→x− write F (x−) = lim µ(−∞, y] so that y→x−

F (x) − F (x−) = µ(−∞, x] − µ(−∞, x) = µ({x}).

Since Ω is a finite measure space, the above implies that there are most countably many jump discontinuities of F – and it is known that for a monotone, right continuous function, these are the only discontinuities F may have.

93 5.5 Distribution Functions 5 Abstract Measure Theory

Theorem 5.5.2. If F is nondecreasing, right continuous and satisfies lim F (x) = 0 and lim F (x) = 1, x→−∞ x→∞ then there exists a probability space (Ω, F,P ) and a random variable X on Ω such that F is the distribution function of X. Proof. By Theorem 5.3.5, there exists a measure µ on R such that µ(a, b] = F (b) − F (a) for all a < b. Note that µ is a probability measure in this case, so the probability space (R, B, µ) and the random variable X(x) = x satisfy the theorem. Example 5.5.3. Suppose F is the waiting time distribution function from Example 5 of Section 4.9, i.e. let F (x) represent the probability of waiting less than time x for some event. By convention, we set F (x) = 0 when x ≤ 0. Then for any x, y ≥ 0, F satisfies 1 − F (x + y) = 1 − F (y). 1 − F (x) Let U(x) = 1 − F (x) so that this may be written U(x + y) = U(x)U(y). This is clearly an exponential function: U(x) = e−αx for some α > 0. Then F (x) = 1 − e−αx is called an exponential distribution.

Definition. Suppose F and {Fn} are distribution functions. We say Fn converges weakly to F , denoted Fn *F , if lim Fn(x) = F (x) for any x at which F is continuous. n→∞

In analysis, it is common to rescale and translate the Fn’s when computing weak conver- gence, e.g. instead of showing Fn(x) *F (x) we might show Fn(anx + bn) *F (x). If X1,...,Xn are i.i.d. random variables with distribution function G, set Mn(ω) = max{X1(ω),...,Xn(ω)}. Then n ! n \ Y n P [ω : Mn(ω) ≤ x] = P [ω : Xk(ω) ≤ x] = P [ω : Xk(ω) ≤ x] = G(x) . k=1 k=1 n Then Fn(x) = G(x) is the distribution function of Mn.

Examples. −αx 1 Consider the exponential distribution G(x) = 1 − e from above. Then Mn has n −αn n distribution Fn(x) = G(x) = (1 − e ) . However, for all x > 0, Fn(x) → 0 which isn’t a valid probability distribution. So instead we consider the sequence  1  P Mn ≤ x + α log n for some α > 0:  1   log n P ω : M (ω) ≤ x + log n = F x + n α n α  log nn = G x + α n = 1 − e−(αx+log n)  −αx n e −αx = 1 − * e−e . n

94 5.5 Distribution Functions 5 Abstract Measure Theory

2 Next, consider the distribution ( 0 x ≤ 1 G(x) = 1 − x−α x ≥ 1.

1 G(x)

1

1/α −x−α Then the sequence Fn(n x) converges weakly to e for x ≥ 0, so this is the distribution function for lim Mn. This is because  x−α n F (n1/αx) = G(n1/αx)n = (1 − (n1/αx)−α)n = 1 − n n

and this last expression converges to e−x−α wherever it is continuous.

3 Similarly, consider  0 x ≤ 0  G(x) = 1 − (1 − x)α 0 ≤ x ≤ 1 1 x ≥ 1.

Define a sequence of distributions Fn by  0 x ≤ −n1/α  −1/α −xα 1/α Fn(n x + 1) = e −n ≤ x ≤ 0 1 x > 0.

Then Fn *F , the distribution of lim Mn, given by

( α e−|x| x ≤ 0 F (x) = 1 x > 0.

95 5.5 Distribution Functions 5 Abstract Measure Theory

4 Define the Heaviside function ∆ to be the distribution ( 0 x < 0 ∆ = 1 x ≥ 0.

Notice that ∆ is the distribution function of the random variable X ≡ 0. Let X1,X2,... be i.i.d. random variables with distribution

( 1 +1 with probability p = 2 Xn = 1 −1 with probability q = 2 .

As in Theorem 4.4.1, set Sn = X1 + ... + Xn. Then the Weak LLN (1.1.1) says that  1  P n Sn > ε converges to 0 as n → ∞. Thus if Fn is the distribution of Sn, this implies Fn * ∆. Notice that if n is odd, P [Sn = 0] = 0 so the symmetry of [Sn ≤ 0] 1 and [Sn ≥ 0] tell us that Fn(0) = 2 . Thus Fn does not converge to ∆ at 0, but it doesn’t matter since weak continuity doesn’t take into account the jump discontinuity of ∆ at 0.

96 6 Integration Theory

6 Integration Theory

6.1 Measure-Theoretic Integration

For this chapter we assume f and g are measurable functions on some measurable space (Ω, F, µ). We would like to define an integral R f dµ when it exists. First we consider nonnegative functions. Assume f(ω) ≥ 0 for all ω ∈ Ω. (We are allowing the possibility that f(ω) = +∞ for n some ω.) Suppose {Ai}i=1 is a finite decomposition of Ω into measurable sets. To express the ‘area under f’ with respect to µ, we have the following sum:

n X inf f(ω) µ(Ai). ω∈Ai i=1 Graphically, this is similar to the approach taken in Riemann integration of adding up the areas of rectangles approximating a curve:

R

f

(Of course, the function f need not be continuous.) We define the measure-theoretic integral of f to be the supremum over all such decompositions.

Definition. For a nonnegative measurable function f :Ω → R, the integral of f over Ω with respect to µ is

Z ( n ) X n f dµ = sup inf f(ω)µ(Ai): {Ai}i=1 is a finite decomposition of Ω . Ai Ω i=1 If f is nonnegative, we extend this by defining the functions

f +(ω) = max{f(ω), 0} and f −(ω) = max{−f(ω), 0}.

Observe that f + and f − are nonegative, measurable functions. We would like to define the integral as the difference of R f + dµ and R f − dµ, but this only makes sense as long as both are not infinite.

97 6.1 Measure-Theoretic Integration 6 Integration Theory

Z Z + − Definition. Let f :Ω → R be measurable. If f dµ = +∞ and f dµ < ∞, we set Z Z Z Ω Z Ω f dµ = ∞. If f + dµ < ∞ and f − dµ = +∞, we set f dµ = −∞. If both are Ω Ω Ω Ω Z infinite, f is said to be nonintegrable. Otherwise f is integrable and when f + dµ and Z Ω f − dµ are both finite, the integral of f is defined as Ω Z Z Z f dµ = f + dµ − f − dµ. Ω Ω Ω Theorem 6.1.1. Suppose that f and g are nonnegative, measurable, integrable functions. Pn (1) If f is a simple function, i.e. f(x) = i=1 xiχAi for xi ∈ R and Ai ∈ F, then

n Z X f dµ = xiµ(Ai). Ω i=1

Z Z (2) If f(ω) ≤ g(ω) for all ω, then f dµ ≤ g dµ. Ω Ω R (3) (Monotone Convergence Theorem) If fn converges to f from below then fn dµ con- verges to R f dµ from below.

(4) For any α, β ≥ 0, αf + βg is integrable with Z Z Z (αf + βg) dµ = α f dµ + β g dµ. Ω Ω Ω

R Pn Proof. (1) By definition, f dµ ≥ i=1 xiµ(Ai). We want to show that for any decomposi- m tion {Bj}j=1 of Ω, n m X X xiµ(Ai) ≥ inf f(ω)µ(Bj). Bj i=1 j=1 By finite additivity, we may expand the right side as

m m n X X X inf f(ω)µ(Bj) = inf{f(ω): x ∈ Bj}µ(Ai ∩ Bj) Bj j=1 j=1 i=1 m n n m X X X X ≤ xiµ(Ai ∩ Bj) = xiµ(Ai ∩ Bj) j=1 i=1 i=1 j=1 n X = xiµ(Ai), i=1 using finite additivity in the last step. This proves (1).

98 6.1 Measure-Theoretic Integration 6 Integration Theory

(2) is clear by the property that inf and sup preserve inequalities. R (3) First, if fn is monotone increasing then fn dµ is monotone increasing. In particular R fn dµ is a monotone sequence of real numbers so the real analysis version of the monotone R R R convergence theorem says fn dµ converges. Also, fn ≤ f implies fn dµ ≤ f dµ by (2) R R and limits preserve inequalities, so limn→∞ fn dµ ≤ f dµ. To show the other inequality, we will show that for any decomposition {Ai},

n X Z S := νiµ(Ai) ≤ lim fn dµ, n→∞ i=1

where νi = infAi f(ω). First suppose each νi and µ(Ai) is finite and positive. Fix ε > 0 so that ε > νi for each i. Set Ain = [ω ∈ Ai | fn(ω) ≥ νi − ε] and notice that for each i, the Ain converge from below to Ai: for a fixed ω, fn(ω) ≥ f(ω) − ε ≥ νi − ε for n sufficiently large Sn C and the fn are increasing. Set B1 = A1n,B2 = A2n,...,Bm = Amn and B0 = ( i=1 Ai) . m Then {Bj}j=0 is a finite decomposition of Ω, so

m m m Z X X X fn dµ ≥ inf f(ω)µ(Bj) ≥ inf f(ω)µ(Bj) ≥ (νi − ε)µ(Ain). Bj Bj j=0 j=1 j=1 R Letting n → ∞ and then ε → 0 shows that S ≤ lim fn dµ, so (3) is proved in the finite, positive case. Now suppose S is finite but not necessarily positive. Simply ignore the terms that are 0 and repeat the same procedure from the last paragraph. Finally, if S = +∞, there is some νi > 0 and µ(Ai) = +∞ or νi = +∞ and µ(Ai) > 0. Then for fixed x and y such that 0 < x < νi and 0 < y < µ(Ai), we can show that Z lim fn dµ ≥ xy. n→∞ Taking the sup over all such x, y gives the result. (4) is easy to check for simple functions, using the same technique of mutual refinement as above. In the non-simple case, Theorem 5.4.7 allows us to select sequences of simple distributions fn and gn such that fn converges to f from below and gn converges to g from below. Then we can apply (3) and the simple case to produce the desired formula. Recall that the term almost everywhere (abbreviated a.e.) refers to a condition that holds on a set whose complement has measure zero. Theorem 6.1.2. Suppose f and g are nonnegative functions on a measure space (Ω, F, µ). Then Z (1) If f = 0 a.e. then f dµ = 0.

Z (2) Conversely, if µ[ω : f(ω) > 0] > 0 then f dµ > 0.

Z (3) If f dµ < ∞ then f is finite a.e.

99 6.2 Properties of Integration 6 Integration Theory

Z Z (4) If f ≤ g a.e. then f dµ ≤ g dµ.

Z Z (5) If f = g a.e. then f dµ = g dµ.

Proof. (1) Anytime f > 0, µ = 0 so the integral is 0 everywhere.  1  (2) Notice that the sequence An = ω : f(ω) > n is monotone and converges from below to A = [ω : f(ω) > 0]. If µ(A) > 0 then there is some n such that µ(An) > 0 and therefore 1 inf{f(ω): ω ∈ An} ≥ n > 0. This implies Z 1  1  f dµ ≥ n µ ω : f(ω) > n > 0.

(3) If µ[ω : f(ω) = ∞] > 0 then Z f dµ ≥ ∞ · µ[ω : f(ω) = ∞] = ∞.

(5) follows from (4). Then to prove (4), we proved the statement when f ≤ g everywhere. Z Z But as in (1), a set of measure 0 cannot affect the value of f dµ or g dµ.

6.2 Properties of Integration

Remark. The integral is the unique linear operator that integrates indicator functions cor- Z rectly – that is, by assigning χA dµ = µ(A) for all measurable sets A with respect to a given measure µ – and satisfies the monotone convergence property.

We have defined the integral for a not necessarily nonnegative function f as Z Z Z f dµ = f + dµ − f − dµ. Ω Ω Ω Z This in turn is equivalent to the condition that |f| dµ < ∞. Ω The next theorem extends the results on integration for nonnegative functions to all integrable functions.

Theorem 6.2.1. Suppose f and g are integrable functions. Z Z (1) If f ≤ g a.e. then f dµ ≤ g dµ. Ω Ω (2) For any α, β ∈ R, αf + βg is integrable and Z Z Z (αf + βg) dµ = α f dµ + β g dµ. Ω Ω Ω

100 6.2 Properties of Integration 6 Integration Theory

Proof. Write f = f + − f − and g = g+ − g− and apply Theorems 6.1.2 and 6.1.1. Remark. Every theorem about general integration has an analagous theorem for summa- tion. Let Ω = N and let µ be the counting measure, µ(A) = |A|, where |A| may be infinite. Then integration with respect to this measure is summation: ∞ ∞ Z X X f dµ = f(n)µ({n}) = f(n). N n=1 n=1 Notice that this is a theory of absolutely convergent series, since by the first remark above, P∞ f : N → R is integrable if and only if n=1 |f(n)| < ∞. The three most important results in measure theory are stated next. They are: the Monotone Convergence Theorem (MCT), Fatou’s Lemma and the Lebesgue Dominated Con- vergence Theorem (DCT). All three are equivalent; that is, one may derive the other two from the first in any combination. Theorem 6.2.2 (Monotone Convergence Theorem). Suppose {f }∞ and f are nonnega- n n=1 Z tive, integrable functions such that fn converges from below a.e.to f. Then fn dµ converges Z from below to f dµ.

Proof. This is a restatement of the MCT from Theorem 6.1.1. Examples. 1 Let µ be the counting measure on Ω = N. The remark above says that there is a corresponding monotone convergence property for series: if xmn and xn are sequences of real numbers such that 0 ≤ xmn ≤ xn and xmn → xn as m → ∞ for each n, then ∞ ∞ X X xmn % xn as m → ∞. n=1 n=1

P∞ 2 If µn is a sequence of measures on (Ω, F) and µ = n=1 µn then by the MCT (6.2.2), ∞ X Z Z f dµn = f dµ. n=1 Ω Ω

Theorem 6.2.3 (Fatou’s Lemma). If fn ≥ 0 for all n, then Z Z lim inf fn dµ ≤ lim inf fn dµ. n→∞ n→∞ Proof. Define a sequence v = inf{f | k ≥ n}. Then v converges from below to lim inf f n k Z n Z n as n → ∞. For each n, vn ≤ fn so by MCT (6.2.2), vn dµ ≤ fn dµ. Then taking limits of both sides and applying the monotonicity gives Z Z Z Z lim vn dµ ≤ lim inf fn dµ =⇒ lim vn ≤ lim inf fn dµ. n→∞ n→∞ n→∞ n→∞

101 6.2 Properties of Integration 6 Integration Theory

Theorem 6.2.4 (Lebesgue Dominated Convergence Theorem). If fn is a sequence of mea- surable functions and g is an integrable function such that |fn| ≤ g for all n ∈ N a.e. and fn → f then fn and f are all integrable, with Z Z Z lim fn dµ = lim fn dµ = f dµ. n→∞ n→∞ Z Z Proof. First note that |fn| dµ ≤ g dµ < ∞ by monotonicity so all the fn are integrable.

Thus g + fn is integrable and nonnegative a.e.for each n. By Fatou’s Lemma (6.2.3), Z Z lim inf(g + fn) dµ ≤ lim inf (g + fn) dµ n→∞ n→∞ Z Z Z Z =⇒ (g + f) dµ ≤ lim inf (g + fn) dµ = g dµ + lim inf fn dµ n→∞ n→∞ Z Z Z Z =⇒ g dµ + f dµ ≤ g dµ + lim inf fn dµ. n→∞

Now since g is integrable, R g dµ is finite so we can actually subtract this term from both sides, leaving Z f dµ ≤ lim inf fn dµ. n→∞

Similarly, g − fn is integrable, so Z Z (g − f) dµ ≤ lim inf (g − fn) dµ n→∞ Z Z Z Z =⇒ g dµ − f dµ ≤ g dµ − lim sup fn dµ n→∞ Z Z =⇒ f dµ ≥ lim sup fn dµ. n→∞ Z Z Z Therefore lim sup fn dµ ≤ f dµ ≤ lim inf fn dµ, but lim inf is always less than or equal to lim sup, so we have equality. Examples.

3 Let λ be Lebesgue measure on R. Consider the three sequences of measurable functions, cleverly nicknamed:

“the train” : fn = χ[n,n+1] 1 “the steamroller” : fn = n χ[0,n] “the teepee” : fn = nχ[0,1/n]. Z Z Notice that each fn converges to 0 a.e. but fn dλ = 1 for all n, so lim fn dλ = 1 n→∞ for each of these. The train, steamroller and teepee are all examples of how Fatou’s Lemma (6.2.3) can have a strict inequality.

102 6.2 Properties of Integration 6 Integration Theory

From MCT (6.2.2) and DCT (6.2.4), we obtain the following long list of corollaries.

Corollary 6.2.5 (Bounded Convergence). If µ(Ω) < ∞ and a sequence |fn| of measurable functions is uniformly bounded, then fn → f a.e. implies Z Z fn dµ −→ f dµ. Ω Ω

Corollary 6.2.6 (Weierstrass M-Test). If (xnm) is a sequence of real numbers satisfying

∞ X |xnm| ≤ Mn m=1

for all n and the Mn are each integrable, then

∞ ∞ ∞ ∞ X X X X xnm = xnm n=1 m=1 m=1 n=1 and both double summations converge.

Proof. Apply DCT (6.2.4) to the partial sums with g = Mn.

∞ ∞ X Z Z X Corollary 6.2.7. If fn ≥ 0 then fn dµ = fn dµ. n=1 n=1 P Corollary 6.2.8. If fn converges a.e. and there exists an integrable function g such that

N ∞ Z Z ∞ X X X fn ≤ g for all N, then fn dµ = fn dµ. n=1 n=1 n=1 P∞ R P∞ Corollary 6.2.9. If n=1 fn dµ < ∞ then n=1 fn converges absolutely a.e. and

∞ ∞ X Z Z X fn dµ = fn dµ. n=1 n=1

103