<<

18.675: Theory of Probability

Lecturer: Professor Nike Sun Notes by: Andrew Lin

Fall 2019

Introduction

The course website for this class can be found at [1]. The page notably contains a summary of topics covered, as well as some brief lecture notes and further references. 18.675 is the new numbering for 18.175; the material covered will be similar to previous years. The basic goal is to cover basic objects and laws in probability in a formal mathematical framework; some background in analysis and probability is strongly recommended. (If we are missing 18.100, we should talk to Professor Sun after class.) The main readings for this class will come from chapters 1–4 of our textbook (see [2]): readings for each chapter are included in the summary of topics linked above. Grades are mostly homework (25%), exam 1 (35%), and exam 2 (35%). Class participation is another 5% of the grade – this is a graduate class, but we are expected to come to class. This is a large class that will probably get smaller, so that’s one way attendance may be checked. Alternatively, a short quiz may be given occasionally in class. They won’t be graded, but they’ll mostly be to check whether we’re all following the material (and to have a formal record of who showed up). Realistically, attending class is highly correlated with doing well! By the way, there is some overlap with 18.125 ( theory), particularly near the beginning. However, most of the content will be significantly different.

1 September 4, 2019

We’ll start the class with some measure theory, which gives us a formal language for probability. On its own, measure theory is a bit dry, so let’s try to think about why we need it at all.

Example 1 (We should know this, e.g. from 18.600) Consider a p-biased coin, where the probability of a head is p and the probability of a tail is 1 − p. Then the probability of six independent p-biased coins coming up H, H, T, H, T, H is just the product of the probabilities: 4 2 6 it’s p (1 − p) . Since there are 4 ways to pick 4 heads among the 6 coin tosses, the probability of having 4 6 4 2 heads in total is 4 p (1 − p) .

We should generally also be familiar with what it means to have Un be a random variable distributed uniformly 1 2 n−1 over a finite set like { n , n , ··· , n , 1}, as well as what it means for Vn to be an independent copy of Un: then we

1 can write statements of independence like

 j k   j   k  1 U = ,V = = U = V = = . P n n n n P n n P n n n2

But this becomes a bit more complicated if the set of allowed states is not finite. For example, what does it mean for U to be distributed uniformly in the range [0, 1]? We can’t write down the probability that U takes on any particular

value. Also, is it true that Un converges to U as n → ∞, at least in some sense (since we seem to be populating [0, 1] uniformly)? Let’s now see an example that illustrates that this kind of problem is trickier than it might initially seem:

Problem 2 Alice and Bob play a game where they have an infinite sequence of boxes. Alice puts a real number in each box, and Bob can see the number in every box except for one of them, which he can choose. “Show” that Bob can guess the unseen number with 90 percent probability.

“Solution”. To show this, let’s set up some preliminary notation: let S be the set of infinite sequences of real numbers, which can also be denoted RN. For two sequences x, y ∈ S, say that x ∼ y (they are equivalent) if they eventually

agree: xn = yn for all n > N for some finite index N. This is an equivalence relation on the space of real-valued sequences, so there is some quotient space S/ ∼ of equivalence classes. We can pick one representative z from each equivalence class. In other words, if x ∈ S, there is a unique representative sequence z = z(x) such that z ∼ x: since they are equivalent by definition, we can denote n(x) to be the first index where x and z agree completely afterward. So here’s Bob’s strategy for guessing the unseen number. He splits the boxes into ten rows: row k contains the boxes which are in spots k mod 10. We now have ten sequences x (1), ··· x (10) corresponding to our ten rows. He picks one of these rows, uniformly at random (say row 10 without loss of generality). Bob reveals all of the other rows, and he evaluates

N = max{n(x (1)), n(x (2)), ··· , n(x (9))} + 1.

(In other words, he looks at the first nine rows, and he finds the first index where each row agrees with its representative sequence.) He now reveals every box in row 10 except for box N: he now has enough information to deduce z (10), the (10) (10) representative sequence for x . Bob now guesses zN . Why is this correct with 90 percent probability? We chose this row uniformly at random, and there’s at most a (10) (i) (10) 10 percent chance that n(x ) is the largest out of all n(x )s, which is the only case where xN disagrees with (10) zN .

This is a strange example, and we’ve obviously violated some rules of probability to get to this point. But it’s constructed in a way such that we don’t really know where we’ve messed up! So that’s why it’s worthwhile to understand a little more about measure theory. Let’s start with a seemingly basic problem: how do we define the variable U ∼ Unif[0, 1] from above? We could try to define a function (or measure) µ on subsets of [0, 1], such that µ(A) is the probability that U ∈ A for any subset A. For example, we could say that for any 0 ≤ a ≤ b ≤ 1, it’s natural to say that

µ([a, b]) = b − a.

2 We also want to have some form of additivity: if A, B ⊆ [0, 1] are disjoint, then we want their disjoint union to satisfy

µ(A t B) = µ(A) + µ(B).

By induction, this means that for any positive integer n,

n ! n G X µ Ai = µ(Ai ). i=1 i=1 Fn F∞ In addition, we can think about the limit of the expression i=1 Ai → i=1 Ai : it seems also natural to want countable additivity: ∞ ! ∞ G X µ Ai = µ(Ai ). i=1 i=1

Remark 3. Note that µ([0, 1]) = 1, but [0, 1] is an uncountable union of points, and each point has measure zero. So it’s not really reasonable to assume that uncountable addivity would work, too.

So to summarize, our measure µ(A) = P(U ∈ A) should satisfy the following:

• µ([a, b]) = b − a for all 0 ≤ a ≤ b ≤ 1 (“lengths” of intervals).

• Countable additivity (as shown above).

• We should also want translation invariance: if A ⊆ [0, 1], and we define Ax = A + x to be A shifted rightward

by x (mod 1), then µ(A) = µ(Ax ). This is not a lot of requirements: is it enough? If we’re given any subset A, can we use these three properties to determine µ(A)? Let’s try a slightly complicated example:

Example 4 T∞ Consider the Cantor set C = i=1 Cn, where Cn are iteratively defined as

C0 = [0, 1],Cn+1 = Cn \ (“middle third” of each interval of Cn).

(Because the Ci s are decreasing, the limit C can indeed be defined this way.) How do we find the measure of the

Cantor set? We know that µ(C0) = 1, and we also know that Cn is the disjoint union of Cn+1 and the middle third of

Cn, so 2 µ(C ) = µ(C ) − µ(middle third of C ) = µ(C ). n+1 n n 3 n 2 n This means that µ(Cn) = 3 for all n, and taking the limit, µ(C) = 0. This seems encouraging towards an answer of “yes, we can determine µ(A),” but it turns out the Cantor set is actually pretty well-behaved compared to some worse functions. In fact, we’ve actually been asking the wrong question:

Proposition 5 There exists no function µ : P([0, 1]) → [0, 1] that satisfies all three of the properties we want!

Proof (the Vitali construction). Define an equivalence relation on the real numbers

x ∼ y if x − y ∈ Q.

This partitions the into equivalence classes (cosets of the form [x] = x +Q). Each one is a shift of the rationals

3 by some real number, and similar to the Alice and Bob problem, we pick a representative z from each equivalence class. For simplicity, we can pick all z to be in [0, 1]. Let A be the set of all representatives, which is a subset of [0, 1]. (This is sometimes called the Vitali set.) By the properties above, because we can partition [0, 1] into translations of A,

1 = µ([0, 1]) ! [ = µ Aq q∈Q X = µ(Aq) q by countable additivity. But now all Aqs have equal measure by translation invariance, which means that X 1 = µ(A), q which cannot work: 1 cannot be the countable sum of a bunch of equal numbers.

The way measure theory deals with this problem is to not require that µ be defined on all subsets. As a historical note, when this was first proposed, it was a pretty surprising idea, but it’s really the only thing we can do if we try to write formal definitions down! We don’t really want to relax the three properties we’re given, so it’s better to just not define the measure on some pathological subsets.

Definition 6 A measure space is a triple (Ω, F, µ) satisfying the following axioms:

• Ω, the state space or outcome space, is nonempty.

• F, a set of measurable subsets or events, is a σ-field or σ-algebra over Ω (terms will be used interchange- ably). In other words, F is a nonempty collection of subsets of Ω which is closed under complementation c S∞ and countable union: if A ∈ F, then A = Ω \ A is also in F, and if Ai ∈ F, then i=1 Ai is also in F. • µ, the measure, is a map F → [0, ∞] (the extended positive reals) which is not ∞ everywhere and is

countably additive: if Ai ∈ F are disjoint,

∞ ! ∞ G X µ Ai = µ(Ai ). i=1 i=1

The goal of next week is basically to construct a uniform measure on [0, 1]: we’ll define some measure space

([0, 1], L, µ), where L is still to be defined, and by extension also define (R, LR, µ).

Definition 7 µ is called a probability measure (denoted P) if µ(Ω) = 1.

Working with the definition, we can gather a few immediate consequences about measure spaces.

4 Proposition 8 Given a measure space (Ω, F, µ),

• ∅ and Ω are in F. • µ(∅) = 0. ∞ • S (Continuity from below) If Ai ↑ A, that is, A1 ⊆ A2 ⊆ · · · and A = i=1 Ai , then µ(Ai ) ↑ µ(A). ∞ • T (Continuity from above) If Ai ↓ A, that is, A1 ⊇ A2 ⊇ · · · and A = i=1 Ai , and also µ(A1) < ∞, then

µ(Ai ) ↓ µ(A). • For any Ω, F = {∅, Ω} is a valid σ-field. So is P(Ω).

Proof. For the first point, F is nonempty, so there is some A in F. Then Ac ∈ F, so A ∪ Ac = Ω is in F, and so is c Ω = ∅. Similarly, for the second point, there is some A such that µ(A) < ∞: then µ(A ∪ ∅) = µ(A) = µ(A) + µ(∅), so µ(∅) = 0. F∞ For the third point, use countable additivity on A = i=1 Bi , where B1 = A1,B2 = A2 \ A1,B3 = A3 \ A2, and so on. Then ∞ X µ(A) = µ(Bn), n=1 Pn and since µ(An) = i=1 µ(Bi ), µ(Ai ) indeed converges to µ(A) as desired. Similarly, for the fourth point, define B1 = ∅,B2 = A1 \ A2,B3 = A1 \ A3, and so on. All Bi s have finite measure since µ(A1) is finite, and B1 ⊆ B2 ⊆

· · · ⊆ A1 \ A, so we can use the third point to finish. (A counterexample with µ(A1) = ∞ is Ai = [i, ∞).) Finally, the examples in the last point satisfies both closure axioms (of complementation and countable union).

If Ω were a finite set, and we were trying to define a probability space in 18.600, we would likely say that P(ω) = aω P for every ω ∈ Ω, and w∈Ω aω = 1. In our new terminology, F = P(Ω) is just the powerset of Ω, and Pr(A) = P w∈A aω: we can check that this is a valid probability space. But what if Ω = [0, 1] or R? If we take F = P(Ω), we do have a valid σ-field, but the Vitali set shows that we can’t define a measure µ on it. Next week, we’ll basically ask how we need to restrict our set F to get a useful measure!

Definition 9

The Borel σ-field BR is the smallest σ-field over R that contains I, the set of all open intervals.

When we see the word “smallest,” we should ask the question “is it well-defined?”. We check that if Fα : α ∈ I is a collection of σ-fields over Ω, then their intersection is also a σ-field. Indeed, if a set A is the intersection of some σ-fields, then both A and Ac are in all of the σ-fields, so Ac is in the intersection; the same argument works for

countable unions. Now we can just let BR be the intersection of all σ-fields, and this is well-defined.

2 September 9, 2019

This lecture and the next are being given by Professor Subhabrata Sen. Recall that our central question today is how to define the uniform measure on [0, 1]: we want to define a function µ on (some) subsets of [0, 1] with the requirements that

• µ ((a, b ]) = b − a (half-open is a matter of convention).

• F  P Given disjoint subsets {Ai : i ≥ 1}, µ i Ai = i µ(Ai ).

5 • If we define Ax = (A + x mod 1), then µ(Ax ) = µ(A) (translation invariance).

Vitali’s construction tells us that we can’t do this with all subsets of [0, 1]: there is no function µ : 2[0,1] → [0, 1] that satisfies all three conditions above. So we have to compromise and try to define µ on just a sub-collection of the subsets of [0, 1]. Last time, we defined a measure space (Ω, F, µ), which consists of a nonempty state space, a σ-field, and a measure, respectively. We also defined the Borel σ-field to be the smallest σ-field containing the open intervals (which are the sets where we really want the measure to be defined in a particular way):

B = σ(I), I = {(a, b ] ∩ R : a ≤ b}.

With this, let’s try to construct the Lebesgue measure on [0, 1]. Our hope is that because we already know how to assign a measure to specific sets, we can assign measures to intervals and then keep assigning measures to subsets iteratively off of that. Unfortunately, we can’t write this down rigorously, because there’s no closed form for an arbitrary element of BR: it’s not true that every set in the Borel σ-algebra can be written as a countable union and/or intersection of our open intervals! F  P So instead, we do define µ((a, b ]) = b − a, and we do define µ i (ai , bi ] = i (bi − ai ) for unions of disjoint intervals (ai , bi ], but we’re not quite done. We now need to know whether there is a consistent extension of our function µ to all sets in BR. Since this idea will come up again later in the course, we’ll make a slight generalization: say we define µ((0, x]) to be some function F (x), and we’re currently working with the case F (x) = x.

Remark 10. Remember that measures in general do not need to be translation-invariant: the conditions that we’ve been using are to ensure that we end up with a uniform measure.

In order for us to extend our function µ to be defined on all subsets in BR, this function F (x) needs to satisfy a few properties:

• µ((0, x1]) ≥ µ((0, x2]) if x1 > x2, which means that F (x1) ≥ F (x2) (F is monotone).

• If xn ↓ x, then the intervals (0, xn] ↓ (0, x]. From our work earlier, this implies that µ((0, xn]) ↓ µ((0, x]), so

F (xn) ↓ F (x) (F is right-continuous). Notably, we can’t quite make this argument in the other direction to say that F is left-continuous (will be explained later).

Definition 11 A Stieltjes measure function on R is a function F : R → R which is non-decreasing and right-continuous.

Theorem 12

For any Stieltjes measure function F , there exists a unique measure µ = µF on (R, BR) satisfying µ(a, b) = F (b) − F (a) for all a, b ∈ R.

Here’s the proof idea:

1. There exists a unique extension from µ : I → [0, ∞] to µ : A → [0, ∞], where

A = {A ⊆ R : A is a finite disjoint union of sets in I.}

6 This sort of corresponds to the “countable union” step: once we’ve shown that we can do this step in a well- defined way, we’ll be able to define our measure on any finite disjoint union of intervals.

2. Show that we can extend our measure from being defined on A to being defined on BR.

Remark 13. Why is using A a reasonable intermediate step? Notice that A is closed under complements and finite unions (exercise), while the Borel σ-algebra is closed under complements and countable unions. So we’re not at the σ-algebra yet, but we’re kind of close.

Fn Proof of step 1. The construction itself is pretty intuitive. Given any subset A ∈ A, we can write it as A = i=1 Ej

for some Ej s in I. Then just define n X µ(A) = µ(Ej ). j=1 Note that there’s a lot of ways to write an element of A as a finite disjoint union of intervals. So one thing we need to check is that different representations are consistent with each other:

n m n m G G X X A = Ej = Fk =⇒ µ(Ej ) = µ(Fk ). j=1 k=1 j=1 k=1

Indeed, we take the common refinement: because all Ej s and Fk s are open intervals,

n n n m !! n m X X X G X X µ(Ej ) = µ(Ej ∩ A) = µ Ej ∩ Fk = µ(Ej ∩ Fk ), j=1 j=1 j=1 k=1 j=1 k=1 P and starting from k µ(Fk ) also yields this result, so µ(A) is consistently defined.

For things we’re going to do next lecture, we’re going to record a few properties of our space A:

Proposition 14 The function µ we’ve defined has the following properties: F Pn 1. µ is monotone and finitely additive over A: for disjoint {Ai : 1 ≤ i ≤ n} ∈ A, µ( i Ai ) = i=1 µ(Ai ). S Pn 2. µ is finitely sub-additive over A: for any {Ai : 1 ≤ i ≤ n} ∈ A, µ( i Ai ) ≤ i=1 µ(Ai ). S P 3. µ is countably sub-additive over I: if A, Ai ∈ I and A ⊆ i Ai , then µ(A) ≤ i µ(Ai ). 4. µ is countably additive on A.

This last point is really where we want to be, because it’s a lot closer to the condition we are trying to get on BR.

Proof. We can show (1) and (2) just for n = 2 by induction.

For (1), note that we can write A1 as E1 t · · · t Ex and A2 as F1 t · · · t Fy . Because these are disjoint, then

µ(E1 t · · · t Ex t F1 t · · · t Fy ) is the measure of A1 t A2 by definition, µ of the first x terms give the measure of A1, and µ of the last y terms give the measure of A2, which is what we want. (Monotonicity follows because all measures are nonnegative.)

For (2), notice that (A2 \ A1) ⊆ A2 (where A2 \ A1 denotes all of A2 not contained in A1), and A2 \ A1 is disjoint from A1. So now by additivity and monotonicity from (1),

µ(A1 ∪ A2) = µ(A1 ∪ (A2 \ A1)) = µ(A1) + µ(A2 \ A1) ≤ µ(A1) + µ(A2), as desired.

7 For (3), denote A = (a, b ] and Ai = (ai , bi ] (all sets here are intervals because we’re working in I), and we S∞ know that (a, b ] ⊆ i=1(ai , bi ]). Recall that our measure is defined on intervals uniquely: µ(A) = F (b) − F (a), and

µ(Ai ) = F (bi ) − F (ai ).

Since F is right-continuous, there exist x > a, ci > bi such that ε F (x) − F (a) ≤ ε, F (c ) − F (b ) ≤ . i i 2i

(Intuitively, this now lets us work with A as a closed interval and the Ai s as open intervals.) Now consider the interval

∞ [ [x, b] ⊆ (ai , ci ). i=1

Since [x, b] is compact, and the right hand side is an open covering, there exists a finite subcover by the Heine-Borel theorem: n [ [x, b] ⊆ (ai , ci ) i=1 (some abuse of notation here). Now by points (1) and (2), because µ is finitely subadditive and monotone,

n X µ((a, b ]) ≤ µ((a, x ]) + µ((ai , ci ]) i=1 n X = F (x) − F (a) + (F (ci ) − F (ai )) i=1 ∞ X ≤ 2ε + (F (bi ) − F (ai )) i=1

P ε (one ε comes from F (x) − F (a), and the other comes from the sum of the geometric series 2i ). Letting ε → 0, we have that ∞ X µ((a, b ]) ≤ µ((ai , bi ]), i=1 as desired. F∞ P∞ To prove (4), we are trying to prove that given A, Ai ∈ A, where A = i=1 Ai , we have µ(A) = i=1 µ(Ai ). First of all, we can write A as n ∞ ! n G X G A = Ai t Ai = Ai t Cn+1, i=1 i=n+1 i=1 which is a finite disjoint union, so we can use (1) to write

n n X X µ(A) = µ(Ai ) + µ(Cn+1) ≥ µ(Ai ). i=1 i=1 Letting n go to ∞, ∞ X µ(A) ≥ µ(Ai ), i=1 which gives us one direction of what we want. Now we need to show the upper bound: without loss of generality, we can assume that Ai ∈ I, because all Ai s are finite disjoint unions of sets in I. By definition, we can write

n G A = Ej ,Ej ∈ I, j=1

8 and we’ve defined n n !! n ! X X [ X [ µ(A) = µ(Ej ) = µ Ej ∩ Ai = µ (Ej ∩ Ai ) j=1 j=1 i j=1 i S (because A is entirely contained in ( i Ai ) anyway). By (3), this means

n ∞ ∞ n X X X X µ(A) ≤ µ(Ej ∩ Ai ) = µ(Ej ∩ Ai ) j=1 i=1 i=1 j=1

P∞ by swapping the order of summation (this is a nonnegative series, so we’re allowed to do this), and this is i=1 µ(A ∩ P∞ Ai ) = i=1 µ(Ai ) by finite additivity.

So now we want to extend µ from A to the actual σ-field: let’s start making a move towards that direction. For now, let’s assume the measure we want to define is a finite measure.

Theorem 15 (Carathéodory Extension Theorem) Let A be an algebra over Ω (closed under complements and finite unions), and let µ : A → [0, ∞) be a countably additive finite measure. Then there exists a unique measure on the generated σ-algebra µ : σ(A) → [0, ∞) which extends µ.

We’ll show this next time!

3 September 11, 2019

Recall that the central object we’ve been trying to study is BR, the Borel σ-field, which is the smallest σ-algebra generated by the intervals I = {(a, b ] ∩ R : a ≤ b}. We’re working towards a proof that given any Stieltjes measure function (monotone, right-continuous) F , there

exists a unique µF on (R, BR) such that µ ((a, b ]) = F (b) − F (a). Last lecture, we extended the measure from I to

the algebra A, consisting of finite disjoint unions of intervals in I, and today, we’ll extend this to BR = σ(A). First, a quick clarification from last lecture: why can’t we prove that F is left-continuous as well if we’re defining

F (x) = µ((0, x ])? In general, when xn ↑ x, (0, xn ] ↑ (0, x). But this is not exactly what we want, because F (x) uses (0, x] instead:

µ((0, xn ]) ↑ µ((0, x)) 6= µ((0, x ]) = F (x).

The example to keep in mind for this is a step function, coming from a point mass! Then (0, x) and (0, x ] differ by the “mass” at point x. To finish the construction of a uniform measure, we’ll prove Theorem 15 today. We’ll go through the proof in a slightly nonlinear way, so let’s first sketch out the ideas here: we want to prove (1) that an extension exists and (2) that it is unique. We’ll do the latter first, and we’ll do so by introducing the π-λ Theorem.

Definition 16 A π-system P on a set Ω is a nonempty collection of subsets of Ω such that for any A, B ∈ P, A ∩ B ∈ P.

For example, I, the collection of (half-open) intervals, is a π-system.

9 Definition 17 A λ-system L on a set Ω satisfies the following conditions:

• Ω ∈ L.

• For any A, B ∈ L where B ⊆ A, A \ B ∈ L as well.

• If we have a sequence An ∈ L, and An ↑ A, then A ∈ L.

All σ-algebras are both π-systems (we have intersections by definition) and λ-systems (in a σ-algebra, we have the c c S whole set Ω by taking any A ∪ A , we have A ∪ B = A \ B, and we have n An = A for any An ↑ A). But we also have the converse:

Proposition 18 If a λ-system L is also a π-system, then L is a σ-algebra.

Proof. L contains the whole set Ω, so it is nonempty, and it contains countable unions. Also, since Ω ∈ L, Ω \ A = Ac ∈ L for any A.

In other words, the definition of a σ-algebra can be split into the intersections part (π-system) and the rest (λ-system), and this turns out to be very helpful:

Theorem 19 (Dynkin’s π-λ Theorem) If P is a π-system, L is a λ-system, and P ⊆ L, then σ(P) ⊆ L.

We’re going to see applications of this in a lot of different scenarios!

Proof. Without loss of generality, let L be the smallest λ-system containing P. This is well-defined, because there is at least one λ-system – the powerset of Ω – and intersections L1 ∩ L2 of λ-systems are λ-systems as well:

• Ω is in both L1 and L2, so it’s in the intersection.

• If A and B are in the intersection, they’re both in L1 and L2, so A \ B is in both λ-systems, meaning it’s in the intersection.

• If all Ans are in the intersection, then they’re in both λ-systems, so A is in both λ-systems and therefore the intersection as well.

So L is just the intersection of all λ-systems containing P . We claim that it suffices to show that L is also a π-system: this means L is a σ-field containing P, so it must contain σ(P ), the smallest σ-field containing P. To do this, we just need to show that given any A, B ∈ L, A ∩ B ∈ L. Fix A ∈ P, and define

LA = {B ∈ L : A ∩ B ∈ L} ⊆ L.

LA is a λ-system:

• A ∩ Ω = A ∈ L, so Ω ∈ LA.

• If B2 ⊆ B1, and A ∩ B1,A ∩ B2 ∈ LA, then (A ∩ B1) \ (A ∩ B2) = A ∩ (B1 \ B2) is also in L. This means that

B1 \ B2 is also in LA.

• If we have an increasing sequence Bn ↑ B ∈ L, and Bn ∈ LA (that is, A ∩ Bn ∈ L for all n), then A ∩ B ∈ L

since L is a λ-system. But this means B ∈ LA, as desired.

10 In addition, it contains P because A ∈ P (meaning that for any B ∈ P, A ∩ B ∈ P ⊆ L). But L is the smallest

λ-system containing P, so LA = L! Therefore, for all A ∈ P and B ∈ L, A ∩ B ∈ L. Now fix B ∈ L, and define

LB = {A ∈ L : A ∩ B ∈ L}.

We just showed that P ∈ LB, and we can show by the exact same argument that LB is a λ-system, so LB = L. This means that L is indeed a π-system, and we’re done.

The important message here is that measure-theoretic proofs are difficult because we don’t have a closed form for all measurable subsets. Often, the right way is to show that every measurable subset has some property. The power of this theorem is in bootstrapping results we find from P, a π-system, to the σ-algebra of P!

Proof of uniqueness of measure. Suppose there were two distinct extensions µ1 and µ2 of our measure µ from A to

σ(A). Then µ(A) = µ1(A) = µ2(A) for all A ∈ A by definition. Define

L = {B ∈ σ(A): µ1(B) = µ2(B)}.

(We’re trying to show that L is the σ-algebra σ(A) itself.) We’ve established that A ⊆ L, and we know that A is a π-system (intersections of finite collections of intervals are a finite collection of intervals). Now L is a λ-system:

• [0, 1] is in L, because [0, 1] is a finite interval and is therefore in A. (Small technicality here: we really have the interval (0, 1], so the easiest way to deal with this is to just define the measure on the real line instead. Also, for the uniform measure, any individual point is a measure zero set.)

• If µ1(A) = µ2(A), µ1(B) = µ2(B), and B ⊆ A, then µ1(A \ B) = µ2(A \ B) by additivity, so A \ B is also in L.

• If µ1(An) = µ2(An) for all n in an increasing sequence {An}, we can define B1 = A1, B2 = A2 \ A1, and so on. S S From the previous point, µ1(Bi ) = µ2(Bi ) for all i, so by countable additivity we have µ1( Bi ) = µ2( Bi ). S S Since µ1( Bi ) = µ1( Ai ) = µ1(A) (and same for µ2), µ1(A) = µ2(A) as desired.

So by the π-λ Theorem, σ(A) = L, and thus µ1 = µ2.

This is powerful – we only had to verify that µ1(A) = µ2(A) for a small subset of all measurable subsets, but L helped us bootstrap that to something more general!

Proof of existence. We start with a definition:

Definition 20 An outer measure is a map ν : 2Ω → [0, ∞)

which is monotone and countably subadditive.

This is basically a “crude” measure: it has certain desirable characteristics of a measure but not all of them.

Definition 21 Given a measure µ on an algebra A, we define

( ∞ ) ∗ X [ µ (E) = inf µ(Ai ): Ai ∈ A,E ⊆ Ai i=1 i for all E ⊆ Ω.

11 This is a covering argument: we try to cover E with minimum measure. We can verify that µ∗ is indeed an outer measure:

∗ ∗ • It’s monotone, because any covering of B ⊇ A also covers A, so µ (B) ≥ µ (A).

• S It’s countably subadditive, because the union of coverings of Ai will also cover i Ai , and we’re still covering with a countable number of sets in A.

Lemma 22 µ∗ also extends µ: that is, µ∗(A) = µ(A) for all A ∈ A.

Proof. We know that µ∗(A) ≤ µ(A), since A covers itself. Suppose that µ∗(A) < µ(A) for the sake of contradiction: first of all, µ∗(A) consists of a countable union of half-open intervals, and each one must be contained within one of P ∗ the intervals of A (otherwise, we could split the interval up or cut the ends off, resulting in a smaller µ (Ai )). Say

A is made up of N intervals: by the pigeonhole principle, there is some interval I ∈ A covered by intervals In ∈ I such P ∗ that n µ (In) < µ(I): pick the one with the largest deviation, and we can write

X ∗ µ (In) = µ(I) − ε n

µ(A)−µ∗(A) ∗ for some ε > N > 0. But this can’t happen, because µ (In) = µ(In), and we showed countable subadditivity in the last lecture of µ over intervals: X µ(I) ≤ µ(In), n which is a contradiction. Thus µ∗(A) = µ(A), as desired.

So this outer measure tries to assign a value to every subset of our space Ω. We saw that doing this didn’t quite work when we tried it naively: to refine this argument, we need to find a suitable subcollection A∗ ⊆ 2Ω such that A∗ ∗ ∗ ∗ is a σ-algebra and µ |A∗ is a measure. If we can find such a set, then A ⊆ A implies that σ(A) ⊆ σ(A ), and this gives us the extension that we want!

Definition 23 A subset E ⊆ Ω is measurable if µ∗(F ) = µ∗(F ∩ E) + µ∗(F ∩ Ec )

for all F ⊆ Ω.

This is a collection of sets which satisfy a “desirable property,”’ and it turns out the subcollection we want is

A∗ = {E : E is measurable with respect to µ∗} .

First of all, we need to make sure that we actually have a large enough subcollection:

Proposition 24 A ⊆ A∗.

Proof. We want to show µ∗(F ) = µ∗(F ∩ A) + µ∗(F ∩ Ac )

12 for any A ∈ A. First of all, we have µ∗(F ) ≤ µ∗(F ∩ A) + µ∗(F ∩ Ac ),

because any covering of the right side definitely covers the left side. Now, suppose the left side is optimally covered

by intervals I1,I2, ··· . We wish to show that we can cover the right side with at most 4ε more measure. Since A is

a finite collection of intervals, each interval Ik can only have a finite number of changes between the parts contained in A and Ac , so there are a countable number of changes between A and Ac overall. Break up the intervals at those changes. S Thus, we cover the right hand side as follows: F ∩ A is a subset of ( Ik ) ∩ A (since the Ik s cover F ), which is a c S c countable union of intervals (possibly closed or open). Similarly, F ∩ A is a subset of ( Ik ) ∩ A , which is another S ε countable union of intervals. Extend both sides of the ith interval in ( Ik ) ∩ A by 2i , and extend both sides of the S c ε jth interval in ( Ik ) ∩ A by 2j . We’ve then used 4ε more measure than in our covering of A, but we’ve covered both F ∩ A and F ∩ Ac , so µ∗(F ) + 4ε ≥ µ∗(F ∩ A) + µ∗(F ∩ Ac ),

and we’re done by taking ε → 0, showing the equality.

To finish our construction, then, we just need to show that A∗ is a σ-algebra and that µ∗ is countably additive. It’s easy to show A∗ is closed under complementation, because E and Ec are symmetric in Definition 23. So now, let’s show that A∗ is closed under countable unions. First, we prove that it’s closed under finite unions: ∗ ∗ by induction, it suffices to show that if E1,E2 ∈ A , then E1 ∪ E2 ∈ A . By definition, this is equivalent to plugging

E1 ∪ E2 in Definition 23, but it suffices to just show the inequality

∗ ∗ ∗ c µ (F ) ≥ µ (F ∩ (E1 ∪ E2)) + µ (F ∩ (E ∪ E2) ), because the reverse inequality is true by subadditivity (given to us by the definition of µ∗). By subadditivity,

∗ ∗ ∗ µ (F ∩ (E1 ∪ E2)) ≤ µ (F ∩ E1) + µ ((F \ E1) ∩ E2)

(this is just algebra), and we also have

∗ ∗ c ∗ µ ((F \ E1) ∩ E2) = µ (F ∩ E1 ) − µ (F \ (E1 ∪ E2))

∗ because we know that E2 ∈ A is measurable, and we can plug in F ∩ E1 into the definition of measurability. Now adding the two equations,

∗ ∗ ∗ ∗ c ∗ µ (F ∩ (E1 ∪ E2)) + µ (F \ (E1 ∪ E2)) ≤ µ (F ∩ E1) + µ (F ∩ E1 ) = µ (F ),

∗ ∗ where the last step comes from E1 ∈ A being measurable. This is what we set out to prove, so A is indeed closed under finite unions. ∗ ∗ To prove that A is also closed under countable unions, we need to show that given Ei ∈ A , we also have S∞ ∗ ∗ i=1 Ei ∈ A . We claim that for disjoint subsets E1,E2 ∈ A ,

∗ ∗ ∗ µ (F ∩ (E1 ∪ E2)) = µ (F ∩ E1) + µ (F ∩ E2).

This is actually not too bad: because E1 is measurable, we know that (using F ∩ (E1 ∪ E2) in the definition instead of F )

∗ ∗ ∗ ∗ ∗ µ (F ∩ (E1 ∪ E2)) = µ (F ∩ (E1 ∪ E2) ∩ E1) + µ (F ∩ (E1 ∪ E2) \ E1) = µ (F ∩ E1) + µ (F ∩ E2).

13 To show closure under countable unions, we can first assume without loss of generality that Ei s are disjoint (because we

already have closure under complementation, so we can take differences of sets by defining M1 = E1,M2 = E2 \ E1 = c E2 ∪ E1 , and so on). Define n ∞ [ [ An = Ei ,A = Ei . i=1 i=1 ∗ Since we’ve proved closure for finite unions, we know that all An ∈ A . Thus,

∗ ∗ ∗ µ (F ) = µ (F ∩ An) + µ (F \ An) for any n (by definition of measurability), so this is

n X ∗ ∗ ≥ µ (F ∩ Ei ) + µ (F \ A) i=1

(the first term by the claim we just established, and the second term by the monotonicity of µ∗). So now take n → ∞: we find that ∞ ∗ X ∗ ∗ µ (F ) ≥ µ (F ∩ Ei ) + µ (F \ A), i=1 and by countable subadditivity, this is ≥ µ∗(F ∩ A) + µ∗(F \ A), which is what we want! And now we’ve shown that A∗ is a σ-algebra. To prove that µ∗ is indeed a measure on A∗, we need to show countable additivity: and we can do this by showing inequalities on both sides. We know that

∞ ! ∞ ∗ G X ∗ µ Ai ≤ µ (Ai ) i=1 i=1 by countable subadditivity of µ∗, and also

∞ ! n ! n ∗ G ∗ G X ∗ µ Ai ≥ µ Ai = µ (Ai ), i=1 i=1 i=1 and taking n → ∞ gives the other direction of the inequality. This gives us countable additivity of µ∗, and we’re done.

Since we’ve already proved this is unique, we’ve finally extended our measure to σ(I), as desired! So we do have a uniform distribution on [0, 1].

4 September 16, 2019

The first homework assignment will be posted later today (after lecture); it will be due in class next Wednesday. There will be about 4–5 problem sets in this class: this first one will have us work a little more with properties of measures. Let’s start by restating what’s been covered over the last two lectures:

Theorem 25 (Lebesgue-Stieltjes)

Given a nondecreasing and right-continuous function F : R → R, there exists a unique measure µ = µF on (R, BR) such that µ((a, b ]) = F (b) − F (a).

14 This proof has two parts: first, we extend µ from I, the set of half-open intervals of the form (a, b], to A : {finite disjoint union of elements of I}. Then, we extend A to its Borel σ-algebra with the Carathéodory extension theorem: uniqueness follows from the π-λ theorem, and existence comes from the construction of the outer measure. This is a class about probability, so let’s start to introduce random variables and their properties! We’ll see soon that these variables are specific examples of measurable functions, and that properties like mean are specific examples of Lebesgue integrals.

Definition 26 A probability space is a measure space (Ω, F, µ) such that µ(Ω) = 1. We will often denote µ as P.

One way to interpret this is that Ω is a space of outcomes, F are possible events (some sets of outcomes are not “well-behaved” enough to be assigned a probability measure), and P is the measure itself.

Example 27 We can encode a sequence of n independent fair coin tosses via

|A| Ω = {heads, tails}n = {0, 1}n, F = (Ω), (A) = ∀A ⊆ Ω. P P |Ω|

Notice that it’s okay to take the powerset of Ω because it is a finite set. By definition, here the probability measure 1 is uniform: for any individual event (that is, |A| = 1), we have P(A) = 2n . We’ll now be a bit more general and abstract: let (Ω, F, µ) be any measure space, in particular allowing µ(Ω) = ∞.

Definition 28 An indicator function for a set A ∈ F is defined as  1 ω ∈ A 1A(ω) = 0 ω = Ω \ A.

This may also be denoted 1{ω ∈ A}.

Definition 29 A simple function f :Ω → R can be written in the form

n X f = ci 1Ai , i=1

where ci ∈ R and Ai ∈ F.

Definition 30 A measurable function f is one that can be pointwise approximated by simple functions: that is, there exists a

sequence of simple functions {fn} such that

f (ω) = lim fn(ω) ∀ω ∈ Ω. n→∞

15 Definition 31 Let (Ω, F) and (S, G) be two measurable spaces. A measurable mapping is a function f :Ω → S such that for every G ∈ G, the preimage f −1(G) belongs to F.

The intuition to have in mind is that if we take any “well-behaved” set in S, its preimage is “well-behaved” in Ω.

Proposition 32 A measurable function f on (Ω, F) (as defined in Definition 30) is equivalent to a measurable mapping (Ω, F) →

(R, BR) as defined in Definition 31.

Proof. Exercise, left for our homework!

The definition of a measurable function is more concrete, while the mapping is a bit more abstract. The latter is useful because a measure µ on the domain space Ω naturally induces a measure on the target space S!

Proposition 33 If µ is a measure on (Ω, F), and we have a mapping f : (Ω, F) → (S, G), then we have a measure ν on (S, G) called the pushforward measure: ν(G) = µ(f −1(G)) ∀G ∈ G.

This is typically written as ν = f∗µ or f#µ.

Proposition 34 If f , g are measurable, then af + bg is measurable for all a, b ∈ R, and so is f · g.

Proof sketch. Write f and g as pointwise limits of simple functions and do some bookkeeping.

Definition 35 A random variable is a measurable function defined on a probability space (Ω, F, P). (This is usually denoted X or Y rather than f .) More generally, an (S, G)-valued random variable is a measurable mapping (ω, F) → (S, G).

Usually, random variables are real-valued, but we may occasionally specify a different target space.

Fact 36 (Helpful notation) 1 Since f is a function from Ω → R, we often consider the pre-image X− (B) = {ω ∈ Ω: X(ω) ∈ B}. This will be denoted {X ∈ B}: for example,

1 1 {X ≥ 2 } = {ω ∈ Ω: X(ω) ≥ 2 .}

Why is this a good notation? X is a mapping from some space (Ω, F, P) to the real line (R, B), so if we take the pushforward measure X∗P, we get the distribution or law of our random variable X, often denoted LX . Let’s do the symbol pushing to make sure we understand what this means! This is a measure on our target space (R, B): it takes in a measurable subset B ∈ B and outputs

−1 LX (B) = (X∗(P))B = P(X (B)) = P({ω ∈ Ω: X(ω) ∈ B}).

16 So measure theory gives a meaning to P(X ∈ B), which is a notation that everyone likes to use.

Example 37 Looking back at the coin-tossing example, where we have F = P(Ω) and P a uniform measure on Ω, we can consider the following events:

n X X(ω) = ω1 = 1{first toss is head},Y (ω) = ωi = number of heads. i=1

We can visualize Ω as the vertices of a cube: under the map Y , each vertex (x, y, z) on the cube maps to (x + y + z) ∈ R. (This is an example of why we don’t always just want the identity mapping for our : we’re condensing our information.) The law LY is a measure on (R, B), supported on {0, 1, ··· , n}, so we just need to find the weight assigned to each integer. Let’s work this out in formal notation:

n L ({k}) = (Y −1({k})) = ({ω ∈ {0, 1}n : Y (ω) = k}) = k . Y P P 2n

1 This is the Bin(n, 2 ) distribution. Now that we have a more concrete picture of our random variables, let’s move on to Lebesgue integration. Assume that (Ω, F, µ) is a σ-finite measure space: that means that there exists a sequence Ωn ↑ Ω such that µ(Ωn) < ∞ for all n. Suppose that we have a measurable function f : (Ω, F) → (R, B): our goal is to define the quantity Z Z f dµ = f (ω)dµ(ω), Ω Ω such that in the case where P is a probability measure, this yields the mean of f . Since we build up measurable functions from simple functions, it makes sense to do the same for integrals.

Pn 1. For any simple function f = i=1 ci 1Ai , define

n Z X f dµ = ci µ(Ai ). i=1

It’s bookkeeping to check that this integral doesn’t depend on the representation f as a simple function. (If we Pn Pm can write it as both i=1 ci µ(Ai ) = j=1 cj µ(Bj ), then we can split this up into the sum over all Ai ∩ Bj s in both cases.) 2. Now, let’s define the integral for bounded measurable functions: for a measurable function f : (Ω, F) → (R, B)

with supω∈Ω |f (ω)| < ∞ and µ({f 6= 0}) < ∞ (meaning the function has bounded support), define Z Z  f dµ = sup ϕdµ : ϕ ≤ f , ϕ simple function = S.

It seems arbitrary to approach f from below: in particular, we can also define Z  I = inf ϕdµ : φ ≥ f , φ simple function .

Claim 38. I = S: these give the same integral.

Proof of claim. It’s clear that S ≤ I, because I is the approximation of f from above and S is the one from

17 below. To show that I ≤ S as well, we can sandwich our function between simple functions:

dmf e dmf e − 1 f lower ≤ f ≤ f upper, f upper = , f lower = 1{f 6= 0}. m m m m m m

1 upper Basically, this sifts our function f in increments of m . Now if M is the supreme of |f |, then fm is at most lower M + 1, and fm is at least −M − 1, so these are both bounded as well. (Also, both are 0 when f = 0.) Now we can write I − S in terms of integrals of simple functions: Z Z upper lower I − S ≤ fm dµ − fm dµ,

1 and these two functions only differ by m · µ({f 6= 0}), which goes to 0 as m → ∞. So I − S ≤ 0, meaning I ≤ S.

3. Now we can define the integral on nonnegative (but not necessarily bounded) functions f : Z Z  f dµ = sup hdµ : 0 ≤ h ≤ f , h bounded with µ({h 6= 0}) < ∞.

In other words, we define the integral by approximating with bounded functions from below.

Claim 39. It’s equivalent to define Z Z f dµ = lim min{f , m}dµ, m→∞ Ωm

where Ωm ↑ Ω and µ(Ωm) < ∞.

In other words, we can approximate with a specific family of functions (by setting successively higher “cutoff points” of m).

Proof of claim. Denote Z Im = min{f , m}dµ. Ωm

We claim that Im ↑ I. Im is increasing with respect to m, because we’re integrating over a larger set and a larger

nonnegative function. Also, I ≥ limm→∞ Im, because each Im is an example of a function h in the definition above: we just want to show that we can reach equality. R By definition of the integral, for any ε > 0, we can find a bounded function h such that Ω hdµ ≥ I − ε. h is bounded, so pick a constant L such that |h| ≤ L. Then for all m ≥ L, Z Z hdµ = min{h, m}dµ Ω Ω (because m ≥ L ≥ h), and we can separate this as Z Z = min{h, m}dµ + min{h, m}dµ. Ωm Ω\Ωm By definition h ≤ f , so we can replace h with f in the first term. In the second term, m is bounded from below by L, so we can make a crude approximation there:

≤ Im + L · µ ((Ω \ Ωm) ∩ {h 6= 0}) .

L is a fixed constant, and as m → ∞, Ω \ Ωm decreases to the empty set. Also, any element in the sequence

has finite measure because {h 6= 0} has finite measure. Recall that if µ(A0) < ∞ and An ↓ ∅, then µ(An) ↓ 0.

18 Thus, as m approaches infinity, the second term will get small enough: Z hdµ ≤ Im + ε. Ω So for any ε, Z I − ε ≤ hdµ ≤ Im + ε, Ω

and taking ε → 0 yields I ≤ Im, so I = Im, as desired.

4. Now, the last part is pretty easy: we want to extend from nonnegative functions to all measurable functions f : (Ω, F) → (R, B). Just break up

f = f+ − f−, f+ = max(f , 0), f− = max(−f , 0).

(We can check that f+ and f− are both measurable if f is measurable.) Then Z Z Z f dµ = f+dµ − f−dµ,

where both terms of the right side have been defined above. One caveat: if this yields ∞ − ∞, we say that the integral f is undefined. In contrast, a function f is integrable if R |f |dµ < ∞.

Definition 40 The expectation of a random variable X is Z Z EX = X(ω)dP(Ω) = XdP. Ω Ω

Why is this integral useful? Any function f :(R, B) → (R, B) that is Riemann-integrable is measurable, and the Riemann integral is the same as the Lebesgue integral with respect to the Lebesgue measure. On the other hand, the function 1Q is not Riemann integrable (because it is very spiky), but the Lebesgue integral is perfectly well-defined: because the Lebesgue measure of a countable set is 0, we have R f = 0. We’ll also soon see that Lebesgue integrals R have nice convergence properties! On the other hand, we can also interpret discrete sums as Lebesgue integrals:

n X Z f (i) = f dn, i=1 R where n is the counting measure n(A) = |A ∩ N|. So Lebesgue integration can cover both very “smooth” and very discrete cases, and even some in between (which we’ll see on the homework).

5 September 18, 2019

Last time, we defined the Lebesgue integral. There are some questions about measurability on our homework (refresh the document, because some typos have been corrected since last class), so we’ll start by restating some definitions.

19 Remark 41. Professor Sun says that the homework “takes a while,” so we shouldn’t leave it until the last minute. It’s due next Wednesday during class.

Pn A function f is called simple if it can be written as a linear combination f = i=1 ci 1Ai of indicator functions.

(The Ai s can have infinite measure if we want.) Then a measurable function is defined as the pointwise limit of simple −1 functions: it is a homework exercise to show that this is equivalent to having f (B) ∈ F for all B ∈ BR. So how do we define the Lebesgue integral? Say that (Ω, F, µ) is σ-finite, meaning that there exist Ωm ↑ Ω with

all µ(Ωm) < ∞. Then we can define our integral first over simple functions:

n n X Z X f = ci 1Ai , µ(Ai ) < ∞ =⇒ f dµ = ci µ(Ai ). i=1 Ω i=1 With this, we can define the integral of any nonnegative function f : Z Z  f dµ = sup hdµ : 0 ≤ h ≤ f , h simple function, µ({h > 0}) < ∞ .

(We did this in two steps last time by defining the integral for bounded functions first – this definition is indeed

equivalent.) Then we define the integral of any real-valued f to be the integral of f+ (the positive part) minus the R R integral of f− (the negative part): Ω f dµ = Ω f (ω)dµ(ω) is then called the Lebesgue integral. Note that this construction is defined independently of the Lebesgue measure (it works for any σ-finite measure). At the end of last lecture, we started talking about random variables: given a measurable function X on a probability space (Ω, F, P), we can define its law to be the pushforward measure

LX = X#P. R Note that the Lebesgue integral XdP is the mean EX, which is the reason we started considering it in the first place!

Fact 42 We can think of the Riemann integral as summing the area of a bunch of rectangles by partitioning the domain into small steps. In contrast, we can think of the Lebesgue integral as partitioning the target space into small steps!

If f is sufficiently nice, both of these exist, and we want them to be consistent. This is indeed the case:

Theorem 43 If a function f :[a, b] → R is Riemann-integrable, then f is also Lebesgue-integrable, and the two integrals agree.

But as we mentioned last time, Lebesgue integration is more powerful: we can encode discrete summation via

∞ X Z a(i) = a dn, i=1 where n is the counting measure on N. We can also define the Lebesgue integral in more abstract spaces (Ω, F, µ), while the Riemann integral requires some sort of Euclidean structure. Today, we’ll talk about a third advantage of the Lebesgue integral: it’s well suited for convergence theorems!

Remark 44. In practice, the Lebesgue integral is usually computed by splitting into some kind of summation. For

example, if we can write f = h +1Q, where h is continuous, we can Riemann integrate h and deal with 1Q with discrete

20 summation. So it’s pretty rare to actually use the definition of the Lebesgue integral directly! In fact, even if the integral is on a more abstract space, we can push forward onto the real line. But we need convergence properties for that. R R We’ll be asking questions like “if fn → f converges pointwise, does fndµ → f dµ?” Generally, no:

Example 45 If we define

fn = n · 1 1 , (0, n )

the integral of each fn is 1, but the sequence fn converges pointwise to 0.

For the rest of this lecture, we’ll assume that (Ω, F, µ) is a σ-finite measure space, and that fn, f , gn, g are all

measurable functions on (Ω, F). Also, let Ωk ↑ Ω be a sequence of spaces with µ(Ωk ) < ∞ for all k. Last time, we showed that for a function f ≥ 0, Z Z min{f , k}dµ ↑ f dµ Ωk converges as k → ∞. Let’s introduce a weaker notion of convergence:

Definition 46

Say that fn → f converges with respect to µ almost everywhere if fn(ω) → f (ω) except for ω in a set of

µ-measure zero. Similarly, say that fn → f converges in µ-measure if for all ε > 0,

µ ({ω : |fn(ω) − f (ω)| ≥ ε}) → 0

as n → ∞.

It will be a future homework problem to show that pointwise convergence implies µ-almost-everywhere convergence, which implies convergence in measure, but that the converses are not true! Let’s start with something easier, which is unfortunately the least useful:

Theorem 47 (Bounded Convergence Theorem)

Suppose that fn are uniformly bounded: there exists M < ∞ such that |fn| ≤ M for all n. Also suppose that

µ ({ω : fn(ω) 6= 0 for any n}) < ∞.

R R Then if fn → f in µ-measure, then fndµ → f dµ.

Proof. Let E be the set {ω : fn(ω) 6= 0 for any n}. The assumptions imply that |f | ≤ M everywhere except a set of measure zero, and the set {f 6= 0} is contained in E (we can’t have fn(x) = 0 for all x but f (x) 6= 0). Thus, we basically don’t care about Ω \ E, because all of our functions are 0 there. We’ll basically split up E into two regions. Fix ε > 0, and define

Bn = {|fn − f | ≥ ε}.

21 we know that since fn → f in µ-measure, this set can’t have large measure. Now we can bound the difference Z Z Z

f dµ − fndµ = (fn − f )dµ Ω Ω E Z

= (fn − f )dµ E

≤ εµ(E \ Bn) + µ(Bn) · 2M

≤ εµ(E) + 2Mµ(Bn), because fn − f is at most 2M. By assumption, as n → ∞, µ(Bn) → 0, so Z

lim sup (fn − f )dµ ≤ εµ(E), n→∞ R R and now we can send ε → 0 to show that Ω fndµ does indeed converge to Ω f dµ, as desired.

Theorem 48 (Fatou’s lemma)

If fn ≥ 0, then Z   Z  lim inf fn dµ ≤ lim inf fndµ . n→∞ n→∞

Notice that hidden in this statement, we are assuming that if fn are all measurable, then the lim inf of fn is also measurable. We can check this for ourselves!

Remark 49. Whenever we see lim inf, we should expect that this comes with the µ-almost-everywhere condition, and this is okay because we’re integrating over the function.

Proof. Let f = lim inf f . By definition, n   lim inf fn = sup inf f` n→∞ n≥1 `≥n

(we can use sup instead of lim because this inner term (inf`≥n f`) is nondecreasing with n). Define this inner term to be gn: note that 0 ≤ gn ≤ fn for any n, and gn ↑ f . Fix k; we claim that Z Z min{f , k} = lim min{gn, k}dµ. n→∞ Ωk Ωk

This is because gn converges pointwise to f , so it converges in measure, and then we can use the bounded convergence theorem! (By definition, Ωk has finite measure.) And now Z Z Z lim min{gn, k}dµ ≤ lim gndµ ≤ lim inf fndµ n→∞ n→∞ n→∞ Ωk

(first step because we removed the cap on the function and also on the space Ωk , second step because gn ≤ fn). Now, letting k → ∞, the original R min{f , k} converges to R f dµ (this was a result from last time). Thus R f dµ is at Ωk R most lim infn→∞ fndµ, as desired.

Example 50

Coming back to our example fn = n1 1 , where fn → f = 0 pointwise, notice that (0, n ) Z Z 0 = lim inf fndµ < lim inf fndµ = 1. n→∞ n→∞

22 Fatou’s lemma is useful because of its applications:

Theorem 51 (Monotone Convergence Theorem) R R Given a sequence fn ≥ 0 of nonnegative functions, if fn ↑ f , then fndµ ↑ f dµ (where integrals are allowed to be infinite).

R Proof. Since the Lebesgue integral is monotone, fndµ is a nondecreasing sequence, so it converges to some limit R I ≤ f dµ (since all fn sit below f ). The other direction follows directly by Fatou’s lemma: Z Z Z  Z  f dµ = lim fndµ ≤ lim inf fndµ = lim fndµ . n→∞ n→∞ n→∞

Theorem 52 (Dominated Convergence Theorem) R Suppose there exists an integrable function g (that is, |g|dµ < ∞) such that |fn| ≤ g for all n. Then if fn → f R R µ-almost-everywhere, then fndµ → f dµ.

(It may be good to notice how our example fails each of the integral convergence tests.)

Proof. By assumption, −g ≤ fn ≤ g for all n, so fn + g ≥ 0 and g − fn ≥ 0. Since fn converge to f , f is the lim inf of the f , so by Fatou’s lemma, n Z Z (g ± f )dµ ≤ lim inf (g ± fn)dµ. n→∞ Rearranging (using both +f and −f ), Z Z Z lim sup fndµ ≤ f dµ ≤ lim inf fndµ, n→∞ n→∞ which gives us what we want.

Here’s an application of what we’ve proved so far. Say that we’re working with a probability space (Ω, F, P), and R we have a random variable Y with mean EY = Ω Y dP. Say that Y can be written as f (X), where X is another random variable: X f (Ω, F, P) −→ (S, G, µ) −→ (R, B, ν),

−1 and Y = f ◦ X. Since µ is the pushforward measure under X of P, meaning that µ = X#P = P ◦ X , and ν is the pushforward of µ under f , meaning that ν = f#µ = Y#P, it’s natural to ask if

Z ? Z Y dP = f dµ. Ω S We’ll start with a simple case:

23 Example 53 P P When Ω is finite, we can just write ω∈Ω p(ω) = 1. Then Y is a simple function Y = ω∈Ω f (X(ω))1{ω}, so the expectation of Y Z X EY = Y dP = f (X(ω))p(ω). Ω ω∈Ω f is also a simple function, so we can write

X Z X f = f (x)1{x} =⇒ f dµ = f (x)µ(x). x∈S S x∈S

−1 Now µ(x) = (X#P)(x) = P(X (x)), which is the same as the measure of the set of ωs that are sent to x under X, so rearranging gives the result.

But this isn’t always just symbol pushing: in general, the theorem requires a bit more work.

Theorem 54 (Change of variables formula) Say we have maps X f (Ω, F, P) −→ (S, G, µ) −→ (R, B, ν),Y = f (X). R If either f ≥ 0 or |Y |dP < ∞, then Z Z Z −1 EY = f (X(ω))dP(ω) = f dµ = f (x)d(P ◦ X )(x). Ω S S

Proof. This is a pretty standard proof method, so we should pay attention to it. Whenever we want to show something about integrals, start with indicators. Assume f = 1B for some set B ∈ G: since f is simple on S, Z −1 f dµ = µ(B) = P(X (B)). S

On the other hand, if we consider Y = f (X) = 1B(X), Y is defined on Ω, and it’s the indicator function 1X−1(B). Since this is a simple function on Ω, Z −1 Y dP = P(X (B)). Ω Thus, it follows by linearity that Z Z f dµ = f (X)dP S Ω for all simple functions (which are finite linear combinations of indicator functions). Now, we can prove this for nonnegative functions f ≥ 0: the useful trick here is to consider

bf 2nc  f = min , 2n . n 2n

(It’s important that we’re subdividing our partitions each time to ensure that the functions fn are nondecreasing. So n this wouldn’t work if we used n instead of 2 .) This is a simple function, and fn ↑ f pointwise. We know that Z Z fndµ = fn(X)dP for all n, and sending n → ∞, these converge by the monotone convergence theorem to Z Z f dµ = Y dP,

24 as desired. Finally, split into the positive and negative parts for a general function f : in order for this to work, Y must be integrable.

This is most commonly used when we’re trying to compute a Lebesgue integral: if we want to find EX for some random variable X on a complicated space (Ω, F, P), we can apply some map

X id (Ω, F, P) −→ (R, B, LX ) −→ (R, B, LX ).

(Here Y = X.) The theorem above then implies that Z EX = id(x)dLX (x), R which is what we expect: the expected value of X should only depend on its properties on the real line!

6 September 23, 2019

This is a final reminder that homework is due on Wednesday. Because of a seminar this week, office hours have been changed to Monday 2–3 and Tuesday 11–12. (Homework should be printed and submitted in class.) Two TAs will be reading and grading, so if we are handwriting, we should make sure it is neat.

Remark 55. Writing more is not generally helpful; solutions should be correct in some minimal way.

So far, we’ve been considering measures on R, which is one-dimensional. Today, we’ll be generalizing to higher dimensions, and to do this, we’ll actually need to study our construction on R in more detail. Recall that we started with the collection of half-open intervals equipped with a Lebesgue-Stieltjes function F :

I = {(a, b ]}, µ((a, b ]) = F (b) − F (a).

We first extended this to A, the finite disjoint union of elements of I, and we had to check that µ was indeed countably F∞ additive: that is, given disjoint Ai s in A with i=1 Ai ∈ A,

∞ ! ∞ G X µ Ai = µ(Ai ). i=1 i=1

Once we did this, we used the Caratheodory extension theorem to obtain a measure on σ(A). Remember that our proof of this second step holds in an abstract setting, as long as A is an algebra (meaning it is closed under complementation and finite union) and µ is countably additive on A. However, the first step (where we extend from I to A) did use some topological properties of R. So if we want to do this step in general, we want I to actually be a semialgebra:

Definition 56 A semialgebra S is closed under finite intersection and semiclosed under complementation. In other words, if E ∈ S, then Ω \ E is a finite disjoint union of elements of S.

For example, R \ (a, b ] = (−∞, a ] t (b, ∞), which are both elements of I as well, so I is a semialgebra. (If we’re concerned about the closed bracket on (b, ∞), recall that we actually defined I to contain elements of the form (a, b ] ∩ R.) So once we have this semialgebra S, our

25 goal is to extend µ to A to the set of finite disjoint unions of elements in S (in other words, the algebra generated by S).

Lemma 57 Let S be a semialgebra over Ω, and µ : S → [0, ∞] be countably additive over S. In other words, if we have F∞ P∞ Ei ,E ∈ S such that E = i=1 Ei , then µ(E) = i=1 µ(Ei ). Then there is a unique µ : A → [0, ∞] which extends µ on S and is countably additive over A.

Proof. This was essentially proved during lectures 2 and 3, and it’s a bit of bookkeeping. The details will be on our homework.

We’ll be applying these to product measures. The idea is that if we have (for instance) a Lebesgue measure defined on two different axes, then the measure of a rectangle should just be the product of the Lebesgue measures of the two side lengths. Our goal is to formalize that! For the rest of this lecture, say that we have two measurable spaces (S, G) and (T, H). The product space can just be written as Ω = S × T , but it’s a bit harder to write down what the measurable subsets of Ω should be.

Definition 58 The product σ-field F of (S, G) and (T, H) is defined as the sigma-field σ(S), where

S = {A × B : A ∈ G,B ∈ H}

is the set of “rectangles.”

Note that we’re claiming here that S is indeed a semialgebra. We should draw some pictures to convince ourselves of this: the intersection of two rectangles is a rectangle, and we can decompose Ω \ (A × B) into rectangles in a picture too. But we obviously shouldn’t write that on a test: if we want to be more formal, the right thing to do is to say (A × B) ∩ (C × D) = (A ∩ C) × (B ∩ D)

(and indeed both sets on the right side are measurable), and

Ω \ (A × B) = (A × (T \ B)) t ((S \ A) × T ).

So the (product) measurable space that we’ll be trying to work with here is

(Ω, F) = (S × T, G ⊗ H).

All that’s left is for us to actually put a measure on it! If we’re equipped with measures (S, G, λ) and (T, H, ρ), and we’ll assume here that λ(S), ρ(T ) < ∞ (though σ-finite is sufficient), our goal is to define a product measure µ = λ ⊗ ρ on (Ω, F). We’re trying to use Lemma 57 here, which motivates the following:

Lemma 59 Define a function µ : S → [0, ∞) via µ(A × B) = λ(A) · ρ(B). Then µ is countably additive over S: that is, if we F∞ can write a rectangle A × B as a disjoint countable union of rectangles A × B = i=1(Ai × Bi ), then

∞ X λ(A)ρ(B) = λ(Ai )ρ(Bi ). i=1

26 Proof. Assume without loss of generality that all sets are nonempty. (Otherwise, discard them.) Consider the coordinate projection map π(x, y) = x which just outputs the element in S. Applying π to both sides,

∞ ! ∞ G [ A = π(A × B) = π (Ai × Bi ) = Ai , i=1 i=1 since we’re basically getting the projections of the Ai s individually (though we obviously no longer need to have a disjoint union). Thus, A is the union of the Ai s, and by symmetry B is the union of the Bi s. Now by definition, the pair (x, y) belongs to A × B if and only if there exists some unique index i such that

(x, y) ∈ Ai × Bi . Then for any x ∈ A, we can look at the cross-section along B and break it up into disjoint pieces: G B = Bi .

i:x∈Ai Since we know that ρ is a measure on B, we can say that

X ρ(B) = ρ(Bi )

i:x∈Ai for any x ∈ A. So we can define the function

∞ X X f (x) = 1{x ∈ A}· ρ(B) = 1{x ∈ A} ρ(Bi ) = 1{x ∈ Ai }ρ(Bi ).

i:x∈Ai i=1

Since f is a number ρ(B) times an indicator function 1A, f is measurable. Similarly, the right-hand side is a pointwise limit of simple functions (because the partial sums is monotone). Thus we can integrate both sides:

" ∞ # Z X λ(A)ρ(B) = 1{x ∈ Ai }ρ(Bi ) λ(dx) i=1 and by the monotone convergence theorem, we can interchange the sum and the integral:

∞ X = λ(Ai )ρ(Bi ), i=1 and we’re done.

So now applying Lemma 57, we’ve found a unique µ that extends µ and is countably additive over A, and applying the Caratheodory extension theorem gives us a measure µ on (Ω, F)! There’s an obvious generalization: we can define

(µ1 ⊗ µ2) ⊗ µ3 ··· , and we can check that the order of operations doesn’t matter here. It’s also very common to work with infinite product spaces ∞ O (Ωi , Fi , µi ), i=1 but these are more tricky and we’ll talk about them later. For now, let’s discuss the generalization of the Lebesgue P integral to product spaces as well! Recall that special cases here include double sums i,j aij , so we’re not just working

27 with Riemann double integrals. Recall that there’s a trick of changing order of summation

N M M N X X X X aij = aij , i=1 j=1 j=1 i=1

but we don’t always get something so nice for infinite sums:

Example 60 Consider  1 i = j  aij = −1 i = j + 1  0 otherwise. Then ∞  ∞  X X  aij  = 1 + 0 + ··· = 1, i=1 j=1 while ∞ ∞ ! X X aij = 0 + 0 + ··· = 0. j=1 i=1

P In cases like this, we’ll say that the double sum i,j aij is undefined (and we can’t freely change the order). However, there are some cases where this is okay:

Theorem 61 Let (S, G, λ) and (T, H, ρ) be finite measure spaces, and define (Ω, F, µ) = (S × T, G ⊗ H, λ ⊗ ρ). Let f be a measurable function defined on (Ω, F): if f ≥ 0 or R |f |dµ < ∞, then Z Z  Z Z Z  f (x, y)dρ(y) dλ(x) = f dµ = f (x, y)dλ(x) dρ(y). S T Ω T S

This goes by the names of Tonelli’s theorem (f ≥ 0) and Fubini’s theorem (R |f | < ∞). Recall that the middle integral here is defined by the way we’ve constructed the Lebesgue integral, but it isn’t clear a priori that the left and right double integrals are actually defined!

Proof. We’ll prove the first equality (the other follows by symmetry). We’ll do the usual proof method: let’s start with an indicator function f = 1E for a measurable E ∈ F = G ⊗ H. Fix x, and define fx (y) = f (x, y) to emphasize that f is just a function of y for now. Then

(1E)x (y) = 1{(x, y) ∈ E} = 1{y ∈ Ex },

where Ex is the cross-section of x across E in the space T .

Claim 62. Ex is measurable.

Proof of claim. A slick way to do this is to consider

Fx = {E ∈ F : Ex measurable} ⊆ F.

We want to show that Fx = F. Fx is a σ-field, because

28 • If E ∈ Fx , then (Ω \ E)x = T \ Ex (because the complement of the cross-section is the cross-section of the

complement), which belongs in the σ-algebra as well because Ex ∈ H is measurable by definition and H is a σ-field. ∞ ∞ • S  S If Ei ∈ Fx , the countable union i=1 Ei x = i=1(Ei )x ∈ H by the same logic (since the countable union of the cross-section is the cross-section of the countable union).

Also, the cross-section of a rectangle A × B is either B or empty, so Fx contains S. But this means that σ(Fx ) contains σ(S), so Fx = F, as desired.

So going back to our function f = 1E, the function fx (y) = 1{y ∈ Ex } is a measurable function of y on (T, H). Thus, the inner integral is indeed well-defined: Z Z Z f (x, y)dρ(y) = 1E(x, y)dρ(y) = 1{y ∈ Ex }dρ(y) = ρ(Ex ). T T T

It just remains to do the outer integral: we first need to show that the map x → ρ(Ex ) is a measurable function on (S, G). The picture in our mind is that we basically slide x along the axis for S, and we want to see if this is a measurable function! To do this, we use a similar trick: define Z Fρ = {E ∈ F : x → ρ(Ex ) is measurable on (S, G), and ρ(Ex )dλ(x) = µ(E)} ⊆ F. S

R ρ (This final condition shows what we want: that this double integral is indeed equal to Ω f dµ.) To show that F = F, ρ note that F contains the rectangles S, because (1) if E = A × B, then ρ(Ex ) gives back ρ(B) if x ∈ A and 0 otherwise, so ρ(Ex ) = ρ(B)1{x ∈ A}, which is indeed measurable, and (2) this integrates to ρ(B)λ(A) = µ(E). Also, S is a π-system (it’s closed under intersection), so it’s our goal to show that F ρ is a λ-system.

ρ • The total space Ω = S × T belongs in F , because it is a rectangle. ρ • Given E1,E2 ∈ F with E1 ⊇ E2, notice that

ρ((E1 \ E2)x ) = ρ((E1)x \ (E2)x ) = ρ((E1)x ) − ρ((E2)x )

because the cross-section of the differences is the difference of the cross-sections, and because ρ is a measure. Since both terms are measurable functions of x and the difference of measurable functions is measurable, this is indeed a measurable function! The last condition (integrates to µ(E)) is true by linearity of integration. ρ ρ • Finally, we want to show that if Ei ∈ F , and Ei ↑ E, then the limit E ∈ F . The cross-section will increase to the cross-section of the limit:

ρ((Ei )x ) ↑ ρ(Ex )

for each x because ρ is a measure, and each ρ((Ei )x ) is measurable, so ρ(Ex ) is measurable in x (continuity from below, taking the countable union). The last condition holds by the monotone convergence theorem.

ρ So now by the π-λ theorem, F = F, and we’ve finished the proof for indicator functions 1E of any measurable E ∈ F = G ⊗ H. Now we just do the usual machinery: this is true for simple functions by linearity, which gives us all nonnegative functions by approximating from below with the monotone convergence theorem, which gives us this result for all integrable functions by splitting into the positive and negative parts.

29 7 September 25, 2019

Recall that last time, we constructed the product measure on two σ-finite measure spaces: given (S, G, λ) and T, H, ρ), we define Ω = S × T , F = G ⊗ H, µ = λ ⊗ ρ. Note that we can always define non-product measures, such as the uniform measure on some triangle in our space: some examples of these will be on our homework for next time. We can always extend this to finite product spaces by induction:

n n n ! Y O O Ωi , Fi , µi . i=1 i=1 i=1 Last time, we mentioned that infinite products are more subtle, and that’s what we’ll be going into today! For example, 2 consider the percolation model on the infinite integer lattice grid: at every point (x, y) ∈ Z , flip a coin to decide whether it is a 0 or a 1. The percolation problem then asks us about the large-scale structure of this model! There can be more complicated models too (for example in statistical physics), where perhaps there is some nearest-neighbor interaction between the sites. We also don’t really want to consider a finite large subset of the lattice, because it’s often hard to consider things around the boundary (it becomes very annoying). There are two main results we’re going to prove today:

• A special case of the Ionescu-Tulcea theorem, which says that we can define an infinite product of probability spaces with no further conditions.

• The Kolmogorov extension theorem, which says that we can define a non-product measure on an infinite product space, with a mild regularity condition. We’re using a product sigma-algebra in both cases, so we’ll start by defining what that means:

Definition 63

Let (Ωα, fα) be measurable spaces, indexed α ∈ I (so there can be an uncountable number of them). Then the product σ-field O F = Fα α∈I Q is the minimal σ-field over Ω = α∈I Ωα such that all single-coordinate projections πα are measurable. In other words, we have −1 (πα) (Eα) ∈ F ∀Eα ∈ Fα.

Note that the preimage of πα looks like Y Eα × Ωγ . γ∈I\{α}

This is not exactly the same as taking Cartesian products of measurable sets Eα ∈ Fα, because I can be uncountable (sigma-algebras only give us countable unions).

Remark 64. This may look familiar if we’ve heard of a product topology before.

Definition 65 (Notation) For any J ⊆ I, denote the partial products

Y O ΩJ = Ωα, Fj = Fα. α∈J α∈J

30 Why is this a good definition? Here’s where the probability comes in: we often care about finite-dimensional distributions, also known as marginals. Say we have some probability measure on an infinite product probability space (Ω, F, P). If we have a product measure, and I is countable, then Y E = Eα α∈I is measurable, so we would guess that ? Y P(E) = Pα(Eα). α∈I (This is not exactly true, but it gives us some intuition.) In order for this to be a positive number, most of these numbers need to be pretty close to 1. So most of the dimensions basically need to include the whole space, which motivates us only looking at finitely many dimensions at a time! Let J be a finite subset of I. Then the law can be written as

−1 [(πJ )#(P)] (EJ ) = P(πJ (EJ )) = P(EJ × ΩI\J ).

These are called the finite-dimensional marginals. Note that if we project down, we need a consistent definition: 0 0 given any J ⊆ J ⊆ I, where PJ0 is the marginal of P on J , we should get the same answer from PJ0 as if we project down to J and then to J0. So to construct a measure on an infinite product space, we already have some conditions: we want to see whether we can just use these conditions and get a measure on the whole space. We’re going in reverse: if we start with a family of PJ s for finite Js, which we defined last class, we want to get a measure P consistent with those marginals. Such a family of finite-dimensional measures is usually called a family of finite-dimensional distributions, also abbreviated f.d.d.

Remark 66. The point of the measurability condition in the definition of the product σ-field is necessary to make sure everything in this consistency statement is well-defined!

One more thing before we get to our theorems. We want to apply the Carathéodory extension theorem in this infinite-dimensional case too, which says that if we have a measure µ that is countably additive on an algebra A, then it extends uniquely to a valid measure on σ(A).

Lemma 67 (Variant of Carathéodory) Say that A is an algebra over Ω, and µ : A → [0, ∞) is a finite measure. Suppose µ is finitely (not necessarily

countably) additive, and for any sequence Bn ∈ A such that Bn ↓ ∅, we have µ(Bn) ↓ 0. Then µ is countably additive over A, so we can apply the Carathéodory extension theorem.

F∞ Proof. Say that An ∈ A are disjoint such that A = n=1 An also lies in A. (We need to prove countable additivity.) Fn Define Bn = A \ i=1 Ai : this is in the algebra as well because we only need to take a finite union. Bn decreases to the empty set, and now because n X µ(A) = µ(Ai ) + µ(Bn), i=1 taking n → ∞ yields the result (since we assume µ(Bn) goes to 0). Q How can we specialize this lemma to (Ω, F) = ( Ωα, ⊗Fα)? We start with some algebra: as an exercise, check that if we take  −1 A = (πJ ) (Ej ) = Ej × ΩI\J

31 for finite J and Ej measurable on FJ , we have an algebra. (We just need to check complementation and finite union, both of which are pretty easy.)

Now, some more simplification for later. Consider a sequence Bn ∈ A such that Bn ↓ ∅: the total number of

coordinates that can be involved in the Bi s is countable, because each Bi depends on finitely many coordinates. So we can renumber the coordinates to Q = {1, 2, 3, ···} for convenience, and we can assume without loss of generality that B1 depends on the first coordinate, B2 depends on the first two, and so on. This means that we can write

Bn = Bn × ΩQ\[n] × Ωrest ,

where Ωrest is Ω on the index set I \ Q. (Note here that the Q depends on the Bn, but it is always countable.) Then to apply Lemma 67 in our theorems, we just need to show that the measure of Bns decreases to zero.

Theorem 68 (Ionescu-Tulcea theorem, special case)

Given probability spaces (Ωα, Fα, Pα) indexed by α ∈ I, there exists a unique probability measure P on ! Y O Ωα, Fα α∈I α∈I where the family of finite-dimensional distributions are given by

O PJ = Pα α∈J for any finite J.

This is essentially us creating a “product measure.”

Proof. Let our algebra consists of the sets

A = {EJ × ΩI\J ,J finite,EJ ∈ FJ }.

According to the conditions, we should define ! O P(Ej × ΩI\J ) = Pα EJ . α∈J

It’s a bookkeeping exercise to show that P is well-defined and finitely additive over A. So we’ll be done by Lemma 67 if we can show that for any sequence Bn as in the boxed equation above, P(Bn) ↓ 0. By definition,

P(Bn) = P[n](Bn) (because everything else includes the whole space), which is then equal to

P[n+1](Bn × Ωn+1) ≥ P[n+1](Bn+1)

since the Bn are decreasing. Thus, P(Bn) are decreasing, and we just need to show that that limit is zero. By Fubini’s theorem, Z Z  P(Bn) = P[n](Bn) = ··· 1{Bn}dPn ··· dP1.

Consider the doubly infinite array aij such that aii = 1{Bi }, each time we move to the right from column n to column n + 1, we cross with Ωn+1, and every time we move from column n + 1 to column n, we integrate dPn+1.

32 The 0th column is then P(B1), P(B2), ··· . The first column starts with 1{B1}, then the next entry is 1{B2} integrated over Ω2, the entry after that is 1{B3} integrated over Ω1 and Ω2 and so on.

Each column is nonincreasing, so they all have limits: let gn be the limit of the nth column. (It will be a function

on the first n coordinates.) Suppose that g0 > 0. By the bounded convergence theorem, the “integrate dPn” passes through the limit, so Z gn = gn+1dPn+1 ∀n.

If g0 is positive, then there exists some ω1 such that g1(ω1) is positive. But then Z g1(ω) = g2(ω1, ω2)dP2(ω2),

meaning there exists a coordinate ω2 such that g2(ω1, ω2) is positive. Repeating this, there exists some (ω1, ··· ) such that gn(ω1, ··· , ωn) > 0 for all n. Since this is contained in the set 1{(ω1, ··· , ωn) ∈ Bn}, which is the diagonal entry in our infinite array, this means that \ (ω1, ··· ) ∈ Bn × ΩQ\[n], n

but this is a contradiction, since the Bn = Bn × ΩQ\[n] × Ωrest decrease to zero, so the intersection of Bn × ΩQ\[n] must be empty. Thus g0 = 0, and we can apply Lemma 67 to finish.

Theorem 69 (Kolmogorov extension theorem)

Let Ωα be metric spaces with Borel σ-field Fα. Say we have a (not necessarily product) consistent f.d.d. PJ , where J is finite, and say that PJ is inner-regular – the set can be approximated from within by a compact set. Then there exists a probability measure P on the whole space (Ω, F) with these finite-dimensional distributions.

Proof. Use the same A as before, and again define P : A → [0, 1] via

P(EJ × ΩI\J ) = PJ (EJ ).

(Again, bookkeeping exercise to show that P is well-defined and finitely additive over A.) Then we just need to check the condition of lemma 1: given Bn decreasing to the empty set, which we can rename in the convenient form Bn = Bn × ΩQ\[n] × Ωrest, we want to show that the probability of Bn goes to 0. Say for contradiction that P(Bn) ↓ ε > 0.

The key step is to replace Bn with a compact set Kn that sits inside it such that the same hypotheses still hold.

By assumption, P[n] is in our family of finite dimensional distributions, so it is inner regular, and thus we can find a

compact Cn ⊆ Bn such that ε (C ) ≥ (B ) − . P[n] n P[n] n 2n+1 Note that our sets

Cn = Cn × ΩI\[n] are no longer necessarily decreasing, so just define

n \ Kn = C` × Ω[n]\[`]. `=1 This is a closed subset of a compact set, so it is compact. And now

Kn = Kn × ΩI\[n]

33 are decreasing and contained in the Bns, so they decrease to the empty set. (Note: Kn is not necessarily compact, but Kn is. Otherwise, we’d be done, because a nested sequence of compact sets has nonempty intersection.) Now as an exercise, we should show that ε (K ) ≥ (B ) − , P n P n 2

ε which implies that P(Kn) ↓ δ ≥ 2 also decreases to a positive value. But Kn is contained in Bn, so Kn ↓ ∅. We know (n) 0,n (n) that Kn is nonempty for all n, so we can find some element ω ∈ Kn. Denote ω = ω , and for every ` ≥ 1, we can create a sequence `−1,n π[`] ω n≥1 :

`,n This is a sequence in K`, so it has a convergent subsequence. Call that ω . n,n Now we’re almost done: take the diagonal entries (ω )n≥1. For any coordinate `,

n,n π`(ω ) → x` ∈ Ω` as n → ∞. But now ∞ \ (x1, x2, ··· ) ∈ Kn × ΩQ\[n] n=1

which is a contradiction since the Kn = Kn × ΩQ\[n] × Ωrest decrease to zero, so the intersection of Kn × ΩQ\[n] must be empty.

Thus P(Bn) decreases to 0, and we do have a valid probability measure.

8 September 30, 2019

Last week, we did a lot of work to construct product measure spaces, especially in the case where we have an uncountable product of the form ! Y O O (Ω, F, P) = Ωα, Fα, Pα . α∈I α∈I α∈I Today, we’ll mostly talk about “easier things:” we have likely seen concepts like variance, correlation, and indepen- dence before, and we’ll basically review them in a more formalized setting. Also, note that we’re now looking at any probability space, not necessarily product spaces.

Definition 70 Let X be a random variable on (Ω, F, P) (in other words, X is a measurable function from Ω → R). Then the σ-field generated by X, denoted σ(X), is the minimal σ-field over Ω such that X is measurable:

−1 σ(X) = {X (B): B ∈ BR}.

We can indeed check that this is a σ-field. Intuitively, this tells us about the events in Ω that are distinguishable from our mapping X into R. For example, if X just sends everything to 0, we can’t distinguish very much: the preimage of any Borel set B is Ω if it contains 0 and ∅ otherwise, so

σ(X) = {∅, Ω}.

Basically, that means we’ve defined a pretty pointless random variable: the larger σ(X) is, the more “helpful” X is for giving us information.

34 Definition 71 Two random events A, B ∈ F are independent, denoted A ⊥⊥ B, if and only if P(A ∩ B) = P(A)P(B).

Note that this also implies

c c P(A ∩ B ) = P(A) − P(A ∩ B) = P(A) − P(A)P(B) = P(A)P(B ), so A ⊥⊥ B ⇐⇒ A ⊥⊥ Bc , and so on. This can be generalized:

Definition 72

Events (Aα : α ∈ I) are (mutually) independent if for all finite J ⊆ I, ! \ Y P Aα = P(Aα). α∈J α∈J

By extension, if (Cα : α ∈ I) are each a collection of events, they are independent if for all finite J ⊆ I and

Aα ∈ Cα, the above equality holds.

Definition 73

If (Xα : α ∈ I) are variables defined on the same space, we say that they are (mutually) independent random

variables on (Ω, F, P) if and only if their sigma-algebras (σ(Xα): α ∈ I) are independent.

Notably, this means that for any finite set J ⊆ I and Bα ∈ BR, then ! \ −1 P(Xα ∈ Bα for all α ∈ J)= P Xα (Bα) , α∈J

−1 and since the Xα (Bα)s are in the sigma-algebras of the Xαs, we can use the definition of independence to say that

Y −1 Y = P(Xα (Bα)) = P(Xα ∈ Bα), α∈J α∈J and equating the blue statements gives what we’ve likely seen in a less formal probability class.

Example 74 Q N N  A special case of this is the product probability space (Ω, F, P) = α∈I Ωα, α∈I Fα, α∈I Pα which we’ve

seen before: we can define random variables Xα(ω) = fα(ωα) which are measurable functions that only look at the α-coordinate.

Then for a measurable subset B, −1 −1 Y Xα (B) = fα (B) × Ωγ . γ∈I\{α}

Notice that for any Xα, because we have “no information” about the other coordinates when we only project in the α-coordinate, O σ(Xα) = σ(fα) ⊗ Tγ , γ∈I\{α} where Tγ is the trivial σ-field (∅, Ωγ ). From this, it’s easy to see that this fits with our formal definition of independence: all (Xα : α ∈ I) will be independent random variables on (Ω, F, P).

35 This seems like a special case, but this is really the only case! Suppose we try to define something more general:

say that we have some general probability space, and we’re told that (Xα : α ∈ I) are independent random variables.

Then define a new random variable as a tuple of the Xαs: Y X(ω) = (Xα(ω))α∈I ∈ R, α∈I Q N  and this is a map from (Ω, F, P) → α∈I R, α∈I BR, LX , where

−1 LX = X#P = P ◦ X .

Then the Xαs are independent random variables if and only if the law LX is a product measure! In other words, we should treat independence and the product measure as one and the same.

Remark 75. By extension, if we have a bunch of events (or collection of events) Eα, they are independent if and only if the indicator variables 1Eα : α ∈ I are independent random variables.

This is the point in the class where measure theory basically ends: it’s time for us to start talking about more interesting things in probability!

Definition 76 p Let X be a random variable in (Ω, F, P). Then the pth moment of X is defined to be E(|X| ) (can be infinite), and the Lp norm of X is defined as

p 1/p ||X||p = ||X||Lp (Ω,F,P) = E(|X| ) .

p X belongs to L if ||X||p is finite.

Here’s a useful fact which we won’t prove here:

Theorem 77 (Jensen’s inequality) Suppose G ⊆ R is an open interval, and g is a convex function g : G → R:

g(λx + (1 − λ)y) ≤ λg(x) + (1 − λ)g(y) ∀ x, y ∈ G, λ ∈ [0, 1].

Then if X is a random variable such that P(X ∈ G) = 1 and E|X| < ∞, E|g(X)| < ∞, then E(g(X)) ≥ g(E(x)).

This gives us a useful property:

Corollary 78 (Lp monotonicity) p/r Let 0 < r ≤ p < ∞, and let X be a random variable such that ||X||r , ||X||p are both finite. Then since g(y) = |y| is convex, p r p/r r p/r E(|X| ) = E((|X| ) ) ≥ (E|X| ) , or equivalently

||X||r ≤ ||X||p.

It would be surprising if the cases where the Lp norms are infinite were different, and we’ll rigorize this:

36 Proposition 79

If 0 < r ≤ p < ∞ and ||X||p < ∞, then ||X|r < ∞. In other words,

r p L (Ω, F, P) ⊇ L (Ω, F, P).

Proof. Note that we can split up

p p p E(|X| ) = E(|X| 1{|X| ≤ 1}) + E(|X| 1{|X| > 1}).

As standard notation in this class, we’ll denote this

p p = E(|X| ; |X| ≤ 1) + E(|X| ; |X| > 1).

The first term is bounded by 1, because |X| ≤ 1 =⇒ |X|p ≤ 1, so finiteness only depends on the second term here. Now because r ≤ p, r p E(|X| ; |X| > 1) ≤ E(|X| ; |X| > 1) by Lp monotonicity, so if the right-hand side is finite, so is the left-hand side.

Note that this Lp monotonicity only holds for probability spaces! For example, consider the sequence spaces, p whose elements are of the form x = (x1, x2, ··· , ): then the norm of x (often denoted the ` norm) is defined as

∞ !1/p X p ||x||p = |xi | . i=1

In these cases, if r < p, then ||x||r ≥ ||x|p (this inequality goes in the opposite direction and it has nothing to do with Jensen’s). Thus, the nesting also goes in the other direction:

`r ⊆ `p.

One way to remember this is that the unit ball in `r is nested in the unit ball for `p. For example, the unit ball in `1 is 2 the set of points such that |x1| + |x2| + · · · ≤ 1 (which is a diamond), while the unit ball in ` is an actual ball. As we pick ` to be larger and larger, we can check that the `∞ unit ball becomes a cube of side length 2. p As another example, consider the classical function spaces L (R, B, λ) under the Lebesgue measure, where Z 1/p p ||f ||p = |f (x)| dx .

It turns out that in this case, Lp and Lr are actually not nested in each other in either direction. p 1 2 So looking back to our random variables X on (Ω, F, P), L monotonicity tells us that because L ⊇ L ,

2 E(|X| ) < ∞ =⇒ E|X| < ∞.

2 1/2 In addition, by Jensen’s inequality, E|X| ≤ E(X ) , so we can define the following nonnegative quantity:

Definition 80 The variance of a random variable X with finite second moment is

2 2 2 Var(X) = E(X − E[X]) = E(X ) − E(X) .

37 Theorem 81 (Cauchy-Schwarz) For any two random variables X,Y ∈ L2,

E|XY | ≤ ||X||2||Y ||2.

Proof. Note that if ||X||2 = 0 or ||Y ||2 = 0, then either X = 0 or Y = 0, so E|XY | = 0. Otherwise, both ||X||2 and 2 ||Y ||2 are positive. Consider a special case where E|XY | < ∞, and X and Y both have L norm 1: then

2 2 2 0 ≤ E(|X| − |Y |) = E(|X| ) + E(|Y | ) − 2E(|XY |), so E|XY | ≤ 1. Notice that if E|XY | < ∞, then we can scale X and Y to have norm 1:

X Y E ≤ 1. ||X||2 ||Y ||2 Finally, in general we can cap our random variable to have finite expectation:

E |min{|X|, n}, min{|Y |, n}| ≤ || min{|X|, n}||2|| min{|Y |, n}||2 ≤ ||X||2||Y ||2.

Now we can take n → ∞ and use the monotone convergence theorem to finish.

This allows us to make another definition:

Definition 82 If X,Y ∈ L2, then their covariance

Cov(X,Y ) = E((X − E[X])(Y − E[Y ])).

X,Y are uncorrelated if Cov(X,Y ) = 0.

Notably, if X,Y ∈ L2 and X ⊥⊥ Y , then Cov(X,Y ) = 0. The converse is false, though: two uncorrelated events can still be dependent. We’re going to work with this towards laws of large numbers. (Joke: a “standard” example is if someone goes to prison and they flip lots of coins in their free time, we should expect half of them to be heads.) Note that many of these laws are really just applications of the Pythagorean theorem, just like Cauchy-Schwarz!

If X1, ··· ,Xn are pairwise uncorrelated, consider their sample mean or empirical mean

n 1 X X = X . n n i i=1 Then the variance n n 1 X X Var X = Cov( X , X ) n n2 i i i=1 i=1 has all cross-terms disappear, which means that the expression simplifies to

n 1 X Var X = Var(X ). n n2 i i=1

(Alternatively, we can treat Xi − E(Xi ) as vectors vi , and note that all scalar products (vi , vj ) = 0 by assumption.

38 Thus, we have orthogonal vectors, and thus the variance

n ! n 2 1 X 1 X Var X = v , n i n i i=1 i=1 2 which by the Pythagorean theorem is just n 1 X = ||v ||2 n2 i i=1 2 because the vi s are orthogonal.) So now if all the Xi s have the same law as some specific X ∈ L , then

1 Var(X) Var(X ) = · n Var(X) = . n n2 n

So taking the average will make the variance smaller! To get to our first result, we’ll employ a few more useful inequalities:

Theorem 83 (Markov inequality) If X ≥ 0, then X  1 (X ≥ t) = [1{X ≥ t}] ≤ = [X]. P E E t t E

Corollary 84 (Chebyshev’s inequality) If X ∈ L2, then the probability of a large deviation

(X − [X])2  Var(X) (|X − [X]| ≥ t) ≤ E = . P E E t2 t2

Using Chebyshev on the boxed statement above then gives us our first law of large numbers:

Theorem 85 (L2 weak law of large numbers)

If X,Xi are uncorrelated and identically distributed, then

n 1 X X = X n n i i=1

2 converges to E[X] as n → ∞ in L and in probability.

For completeness, we’ll write out the proof here:

Proof. Note that  n !2  1 X  Var X ||X − X||2 = (X − [X]) = , n E 2 E n i E n  i=1 

L2 which goes to 0 as n → ∞, so Xn → E[X]. Then we can apply Chebyshev: to show convergence in probability, note that for any ε, Var Xn lim P(|Xn − E[X]| ≥ ε) ≤ lim = 0. n→∞ n→∞ ε2

39 This proves the simplest version of the law of large numbers, and the main idea is just the geometric fact that

given a bunch of orthogonal vectors v1, v2, ··· , then their average has small norm. Next time, we’ll see how far we can extend this (beyond simple L2 geometry)!

9 October 2, 2019

Last time, we started talking about moments of random variables, leading us to define

2 Var X = E[(X − EX) ] ∈ [0, ∞)

2 2 2 1/2 as long as X ∈ L (meaning that the L norm, ||X||2 = E[X ] , is finite). This led us to the weak law of large 2 numbers: given X,Xi ∈ L that are pairwise uncorrelated, such that all Xi s are identically distributed to X, we have

n 1 X X = X n n i i=1

2 2 converges to E[X] in L and in probability. We’ll strengthen this result in a few ways today: we’ll relax the L assumption, we’ll show almost-surely convergence under some conditions, and we’ll generalize this to triangular arrays. For the rest of class, consider an array of random variables

X1,1

X2,1 X2,2 . X3,1 X3,2 X3,3 ......

Our basic question: if S = Pn X is the sum of the nth row in this array, does Sn go to a constant for some n k=1 n,k bn deterministic bn? Usually bn = n, but we’ll see some other examples as well. Under the L2 assumption, it was okay for us to just say that all vectors were pairwise orthogonal: we’re going to have to work with something stronger now. From now on, we’ll assume independence in each row.

Theorem 86 (L2 weak law for triangular arrays)

Sn−ESn If we have an triangular array Xi,j as above such that Var Xn,k ≤ C < ∞ for all k, n, then n converges to 0 in L2 and in probability.

Proof. Take the L2 norm 2 Sn − ESn 1 2 = ||Sn − ESn|| , n n2

and by the Pythagorean theorem, because the Si s are uncorrelated, this is

n 1 X C = ≤ Var X ≤ , n2 n,k n k=1 which goes to 0 as n → ∞. Convergence in probability again follows from Chebyshev.

40 Theorem 87 (Weak law for triangular arrays)

Take the same triangular array as before. Let bn → ∞ such that

n X P(|Xn,k | > bn) → 0, k=1

n 2 X E(Xn,k ; |Xn,k | ≤ bn) → 0 b2 k=1 n as n → ∞. Under these assumptions, if we define

n X Yn,k = Xn,k · 1 {|Xn,k | ≤ bn} ,Tn = Yn,k , k=1

S − [T ] then n E n goes to 0 in probability. bn

We do this because under our current definition, E[Sn] might not be defined. The theorem is stated in an ugly way so that it is easy to prove!

Proof. First, we want Sn to be close to Tn, and we can even show that Sn and Tn need to be equal with high probability. This is because n X P(Sn 6= Tn) ≤ P(Xn,k 6= Yn,k ) k=1

(because Tn is the sum of truncated versions of Xn,k , while Sn is just the sum of Xn,k ). This is then bounded by

n X ≤ P(|Xn,k | > bn), k=1 which goes to 0 by the first of our convenient assumptions. Therefore, Sn − Tn → 0 in probability, and therefore Sn−Tn → 0 in probability as well (since b → ∞). Thus, it suffices to prove the statement for T in place of S . bn n n n 2 But we’ve designed Tn so that it is nicely bounded: calculating the L norm,

2 n Tn − ETn 1 X = 2 Var Yn,k bn b 2 n k=1 since the Yn,k s are uncorrelated (they are functions of the Xn,k s, which are independent), and then this is bounded by

n 1 X ≤ (Y )2, b2 E n,k n k=1 which we’ve assumed to be 0 with the second assumption.

Remark 88. Note that we do need the Xn,k s to be independent: just knowing that they’re uncorrelated wouldn’t

guarantee that the Yn,k s are uncorrelated here.

Let’s now use the ugly theorem to prove nicer results: the rest of our results are going to involve iid variables,

41 which we can put in a triangular array via

X1

X1 X2 . X1 X2 X3 ......

Notice now that the columns are strongly dependent on each other, but the rows are independent if our Xi s are independent.

Theorem 89 (Weak law of large numbers for iid sequences)

Let X,Xk be independent and identically distributed, such that as x → ∞,

g(x) = xP(|X| ≥ x) → 0. S Then as n → ∞, n − µ goes to 0 in probability, where we define µ = (X; |X| ≤ n). n n n E

Notice that this condition does not imply that EX is finite (which is equivalent to E|X| being finite): E|X| is the R ∞ 1 same as 0 P(|X| ≥ x)dx, and the condition just tells us that P(|X| ≥ x) decays faster than x as x → ∞. But there 1 are functions that decay faster than x that aren’t integrable! For example, Z b dx b = log log x|a, a x log x and this diverges as b → ∞. So that’s why we need to define µn in the way that we do: the assumptions here are a bit weaker than having a finite first moment.

Proof. Apply Theorem 87 with bn = n. We just need to check the two conditions: first of all,

n X P(|Xn,k | ≥ bn) = nP(|X| ≥ n) = g(n) k=1 goes to 0 by assumption. Then the second condition is basically checking that

(Y 2) E n → 0, where Y = X · 1{|X| ≤ n}. n n n But from a homework exercise, we know that Z ∞ Z ∞ √ 1 2 1 2 1 E(Yn ) = P(Yn ≥ t)dt = P(|Yn| ≥ t)dt. n n 0 n 0 √ Making the change of variables y = t, dt = 2ydy, this becomes 1 Z ∞ = P(|Yn| ≥ y) · 2ydy. n 0

We can then cap the upper limit of the integral, because Yn cannot be larger than bn = n, to get 1 Z n 1 Z n 2 Z n P(|Yn| ≥ y)2ydy = P(n ≥ |X| ≥ y)2ydy ≤ g(y)dy. n 0 n 0 n 0 Since this is an average over a large interval of the function g, which decays to 0, this integral goes to 0, and we’ve checked our second condition as well, finishing the proof.

42 Remark 90. Remember that because of the way we defined the expectation EX as a Lebesgue integral, the mean EX exists if and only if E|X| is finite (otherwise we have something like ∞ − ∞).

Corollary 91

Sn If X,Xi are iid with finite mean EX = µ, then n converges to µ in probability.

Proof. We bound similar to Markov’s inequality:

g(x) = xP(|X| ≥ x) ≤ E[|X|; |X| ≥ x],

Now the expression (|X|; |X| ≥ x) = (|X| · 1{|X| ≥ x)} goes to 0 as x → ∞ almost surely, and all |X|; |X| ≥ xs are dominated by |X|, which has finite mean. Thus, we can use the dominated convergence theorem (remember that the

Sn expectation is defined as a Lebesgue integral!) to show that this goes to 0 as x → ∞. Therefore, n − µn goes to 0 in probability, and it remains to check that µn goes to µ:

µ − µn ≤ E [|X|; |X| > n] , which again goes to 0 as n gets large by the same logic.

Theorem 92 (Strong law of large numbers)

Sn If X,Xk are iid with EX = µ finite, then n converges to µ almost surely.

Proof. Assume X ≥ 0: otherwise we can just break into the positive and negative parts and do them separately. Truncate the Xs, so that

Yk = Xk · 1{Xk ≤ k}. Pn This is a different truncation than before: previously, we took Tn = k=1 Xk 1{|Xk | ≤ n}, so we truncated everything in the same row of our triangular array at the same level. But this time we’re taking

n X Tn = Xk 1{Xk ≤ k}. k=1

Again, we want to show that Sn and Tn are close: we do this by showing that

∞ ∞ X X P(Xk 6= Yk ) ≤ P(Xk ≥ k) k=1 k=1

(because those are the points where truncation happens), and now because the X’s are iid, this is

∞ X Z ∞ ≤ P(X ≥ k) ≤ 1 + P(X ≥ t)dt k=1 0

(the +1 is just a constant that comes from going from a Riemann sum to a Riemann integral), and this last term is

1 + EX < ∞. By Borel-Cantelli (homework), this means that the probability that P(Xk 6= Yk ) infinitely often is 0, where we define

{Xk 6= Yk i.o.} = {ω ∈ Ω: Xk (ω) 6= Yk (ω) for infinitely many k}.

Sn−Tn So if {Xk 6= Yk i.o.} occurs with probability zero, then n goes to 0 almost surely, because there are only finitely many indices where Sn − Tn (which is some nonnegative quantity) is nonzero, and each difference has magnitude at

43 Sn−Tn C most EX (finite). (Since we divide by n, which goes to ∞, the quantity n is then bounded by n .) Sn ETn So we want to show that n → µ almost surely: we’ve shown Sn goes to Tn almost surely, so n → µ almost surely because n Tn [Sn − Tn] 1 X µ − E = E = [X; X ≥ k] , n n n E k=1 and the summand (X; X ≥ k) goes to 0 as k → ∞ (by the dominated convergence theorem), so the average also

goes to 0. So we can replace Sn with Tn, and we can replace µ with the expectation of Tn: it finally suffices to prove that T − T n E n → 0 almost surely. n We’ll do the same trick as usual: taking the L2 norm,

2 n n Tn − ETn 1 X 1 X 2 = Var Yk ≤ E[Yk ]. n n2 n2 2 k=1 k=1

This is a bit tricky to deal with, because the Xk s aren’t even assumed to have finite variance. Suppose that we knew ∞ X [Y 2] that E k ≤ A < ∞ is finite. Then we could show our result by noticing that k2 k=1

n n 2 1 X X k  [Y 2] [Y 2] = E k , n2 E k n k2 k=1 k=1 and we can then bound this somewhat arbitrarily to make the summand small in both cases:   2 2 X log n  [Y 2] X [Y 2] log n  X [Y 2] ≤ · E k + E k ≤ A + E k ,  n k2 k2  n k2 k≤log n k>log n k>log n

and this goes to 0 as n → ∞ because we’re taking the tail of a finite sum. So let’s show the boxed statement: we do the same calculation as in the weak law for iid sequences to get

∞ ∞ X 1 X 1 Z ∞ [Y 2] = 2y (X ≥ y)dy, k2 E k k2 P k=1 k=1 0

which simplifies after swapping sum and integral (okay because everything is positive) to    Z ∞ X 1 (X ≥ y) 2y dy. P   k2  0 k≥y

The bracketed sum is uniformly bounded over all y ≥ 0, because small y give us 2y times something constant, and 1 large y makes the inner sum proportional to y . Thus, our whole expression is bounded by Z ∞ ≤ c P(X ≥ y)dy = cE[x] < ∞, 0

and thus we have that Tn goes to E[Tn] in probability. But notice that the boxed expression gives us a bit more: we have a rate of decay. Suppose that we have a subsequence of the integers n(`) going to infinity: then

2 n(`) X Tn(`) − ETn(`) X 1 X 2 ≤ E[Yk ], n(`) n(`)2 `≥1 2 `≥1 k=1

44 and changing the order of summation yields

∞ X X 1{n(`) ≥ k} = (Y )2 . E k n(`)2 k=1 `≥1

This is finite, by the boxed condition, if we can show that

X 1{n(`) ≥ k} C ≤ . n(`)2 k2 `≥1

This works if we take n(`) = α` for α > 1: then by a geometric series sum,

X 1{n(`) ≥ k} C(α) ≤ n(`)2 k2 `≥1 for all k ≥ 1. So combining this with the boxed statement above, summing over a sufficiently sparse subsequence gives us 2 X Tα` − ETα` < ∞, α` ` 2 so T ` − T ` α E α → 0 α` almost surely as ` → ∞ (by Chebychev and Borel-Cantelli). In general, for the full sequence, we can use that the Xs are nonnegative: if α` ≤ n ≤ α`+1, then T ` T T `+1 α ≤ n ≤ α , α`+1 n α` and since the left and right both converge, µ T T ≤ lim inf n ≤ lim sup n ≤ αµ, α n n and send α ↓ 1 to get the result.

10 October 7, 2019

Remark 93. There was an attendance quiz in class today, “not intended to be difficult.”

Here’s what the problem was:

Problem 94 Let X be a random variable such that  0 x ≤ 1 P(X ≤ x) = 1 1 − x x ≥ 1

p Calculate E(X ; X ≤ n).

Since we have the distribution, we could just find the probability density, but we were asked to not do this.

Solution. The probability that X is large is bounded via 1 (X ≥ x) = ∀x ≥ 1, P x

45 and this is a pretty common situation where we have a tail bound on X. So now

p p E(X ; X ≤ n) = E(Y ),Y = X1{X ≤ n}.

Now Z ∞ p P E(Y ) = P(Y ≥ t)dt, 0 and now changing variables such that t = y p, dt = py p−1dy, this evaluates to Z ∞ Z n p−1 p−1 py P(Y ≥ y)dy = py P(Y ≥ y)dy, 0 0 and then we can just directly integrate. The point is that we can use this bound if we just have something like c (X ≥ x) ≤ , P x without needing to have an explicit density function!

Back to class content. The last thing we proved last time is the strong law of large numbers: if X and Xi are iid, 1 Pn and E[X] is finite, then n i=1 Xi converges to µ almost surely. This time, we’re going to work “around that” and Sn find the probability that n is about µ + ε for some constant ε.

Example 95 Let’s say that X ∼ N(0, 1) is standard Gaussian with density

1  x 2  φ(x) = √ exp − dx. 2π 2

n P √Sn Then Sn = i=1 Xi ∼ N(0, n) is another Gaussian with variance n, so n ∼ N(0, 1) again. In addition, Sn has probability density  x  dx φ √ √ , n n

Sn and the strong law of large numbers tells us that n → 0 almost surely because we have mean zero. But what if we’re a small constant ε away from mean? For example, can we find

S  n ≥ ε ? P n √ Since Sn has standard deviation n, this is equivalent to √ Z = P(N(0, 1) ≥ ε n) = √ φ(z)dz. z≥ε n

√  2  Since φ decreases very fast, this is approximately on the order of φ(ε n) = √1 exp − nε . How good is this 2π 2 approximation? We should be able to show more precisely that the actual value of the integral is

 nε2  exp − + o(n) , 2

so it’s not a leading-order problem. So that means that differing by a mean ε is exponentially unlikely in n.

46 Example 96

Let X ∼ Ber(p) (so 1 with probability p and 0 with probability (1 − p)). Then Sn is Binomial ∼ Bin(n, p) with probability distribution function n (S = k) = pk (1 − p)n−k . P n k

Sn Again by the strong law of large numbers, n approaches p almost surely. The Central Limit Theorem actually tells us that S − µ S − np n = n ≈ N(0, 1) σ pnp(1 − p) (we haven’t proved this yet). But we will do the heuristic calculation to say why this is “approximately” the normal distribution there! Recall Stirling’s formula, which says that √ n n n! ∼ 2πn . e So if 0 < x < 1, then assuming nx is an integer,

 n  (S = nx) = pnx (1 − p)n(1−x), P n nx and by Stirling’s approximation, this is about √ 2πn (n/e)n n(1−x) √ · · pnx (1 − p)n(1−x). 2πnxp2πn(1 − x) (nx/e)nx (n(1 − x)/e

(This is good when x is not too close to 0 or 1, because that’s when Stirling gives good approximations). Then (n/e)n cancels out in the top and bottom, leaving

1 p nx 1 − p n(1−x) . p2πnx(1 − x) x 1 − x

Taking logs and putting everything in the exponential, this is

1   x 1 − x  exp −n x log + (1 − x) log . p2πnx(1 − x) p 1 − p

This inner bracketed term is often denoted Ip(x) or H(x|p), also known as the binary relative entropy. From here, note that we’re trying to show that z is standard Gaussian, where

S − np n(x − p) z = n = : pnp(1 − p) pnp(1 − p) this suggests a change of variables from x to z. Implicitly, we need z to be bounded (O(1)) for the next calculation, O(1) √ meaning that x = p ± n , so Sn is pretty close to the mean. We can substitute in p !! p 1 np(1 − p)z P(Sn = np + np(1 − p)z) = exp −nIp p + . p2πnp(1 − p) n

Notice that the x in the constant in front can (and has) been replaced with p, because x is close to p. Now we’re taking Ip close to its minimizer at p, so we can Taylor expand around p: the first two terms disappear because we have

47 z 2 O(1) a local minimum, and thus the Ip term ends up being approximately 2n + n3/2 . Putting everything together, this is

 2  p 1 −z O(1) P(Sn = np + np(1 − p)z) ∼ exp + √ , p2πnp(1 − p) 2 n

and the last term here gets absorbed as a constant in our proportionality relation. So our random variable √Sn−np np(1−p) converges to the Gaussian in the sense that if a < b are fixed, then

!  2  Sn − np X 1 z P a ≤ ≤ b ∼ exp − , pnp(1 p) p2πnp(1 p) 2 − z∈[a,b] −

Z−np where z lies in the lattice √ so that Sn is equal to an integer (don’t forget we’re still working with a binomial np(1−p) distribution, which is discrete-valued). So this hopefully gives a good sense of how the Central Limit Theorem is supposed to work in general! Note that this does not actually tell us anything of the form

P(Sn − np ≥ nε) √ for a constant ε: we can only look at Sn − np on the order of n. We can indeed go further than this because of the √ extra n term in our Taylor expansion, but we can check that we can’t go out to Sn − np linear in n and still expect Gaussian behavior! To show that, note that

X n (S − n − np ≥ nε) = pk (1 − p)n(1−k), P k k≥n(p+ε)

and we alerady know how to approximate this term using Stirling’s formula. k is now larger than the mean, and thus this will be dominated by k near the minimum value: doing out the calculations, this is

≈ exp (−n(Ip(p + ε)) + o(n)) .

(Again, o(n) is not the best approximation we can get, but we only care about the leading term.) And this is not the same as the naive guess: in other words,

!  2  Sn − np nε nε P ≥ 6∼ exp − . pnp(1 − p) pnp(1 − p) 2p(1 − p)

So the main goal for this week is to find conditions for when X,Xi iid with E[X] finite satisfy

P (Sn − nµ ≥ nε) = exp (−nf (ε, µ) + o(n)) for some function f of ε and µ. (This is a subject called large deviations theory, which is pretty deep, but we’ll only study it for a bit.) To finish class today, we’ll do a simple combinatorial example with some large deviations idea.

Problem 97 Say we have a k × n table of bins for fixed k, where we care about the limit as n → ∞. We put nkp balls into the bins of the table uniformly at random (each bin can hold at most one ball). Let F be the event that every column receives at least one ball. What is the probability of F ?

First of all, we need nkp ≥ n (so that we can actually put enough balls in our n columns). For simplicity, let’s

48 assume this inequality is strict, so kp > 1. Consider the case k = 2: then we have a 2 × n table, and we have 2np balls that go in. Every column has either one or two entries: say that nx of them have a single entry, and the remaining n(1 − x) have two entries. Then the probability is 1  n  (F ) = · · 2nx , P 2n  nx 2np because we pick which nx columns only get one entry and whether they take the top or bottom ball. Also, remember that x is not a free variable: since we have 2np balls, nx + n(1 − x) + 2np, so x = 2(1 − p). Then plugging in x = 2(1 − p) gives the answer, and we can use Stirling’s if we want something nicer. k = 3 is a bit trickier: we now break up into columns with only one occupied entry, two occupied entries, and all occupied entries, but now the number of each is no longer uniquely determined (intuitively, we expect that we won’t have all entries either having one or three entries). So now it’s not such an immediate calculation.

So here’s one way to view the calculation! Let a parameter θ ∈ (0, 1) (currently arbitrary), and say that X,Xi are iid Bin(k, θ) for all 1 ≤ i ≤ n. We claim that

n ! X P(F ) = Pθ Xi ≥ 1 ∀1 ≤ i ≤ n) such that Xi = nkp i=1

for all θ ∈ (0, 1). The idea is that we put in Ber(θ) in every entry, and we just need to condition on us actually getting nkp entries. (θ is completely arbitrary, but conditioning on the total number of 1s gives us the desired configuration, and any two configurations are equally likely.) Pn Let the event Xi ≥ 1 ∀1 ≤ i ≤ n be G, and let the condition i=1 Xi = nkp be S. By Bayes’ rule, we wish to find (G ∩ S) (G) (S|G) Pθ = Pθ Pθ . Pθ(S) P(Bin(nk, θ) = nkp) We know how to calculate this denominator from our earlier work. The numerator is not too hard either: Pθ(G) is now much easier to calculate than Pθ(G|S), because we now have independent events, so that just gives us n P(Bin(k, θ) ≥ 1) . It remains to deal with Pθ(S|G), and we can always just bound that to be less than 1: this already gives us a bound of the form n P(Bin(k, θ) ≥ 1) Pθ(G|S) ≤ , P(Bin(nk, θ) = nkp) which can be optimized over θ. But it would be nice to know P(F ) exactly: specifically, if we pick the optimal θ, do we get the correct answer? Let’s write out our final term:

n n X X Pθ(S|G) = Pθ( Xi = nkp|Xi ≥ 1 ∀i) = P( Yi = nkp), i=1 i=1

where Y ∼ X|X≥1. It’s then natural to guess that we want to pick θ such that EθY = kp: explicitly, this means that kθ (X|X ≥ 1) = = kp. Eθ 1 − (1 − θ)k

P Yi Picking this value of θ, we know that n converges almost surely to kp. On its own, that doesn’t mean that the P probability P( Yi = knp) = exp(−o(n)), but this can be done with a bit more work: it’s the local central limit theorem. Note that the θ here is the same one that optimizes the earlier expression! The purpose of this exercise is to show that a specific counting configuration problem can be made simpler with probability. Next lecture, we’ll look at the more general question we’ve proposed!

49 11 October 9, 2019

Another homework assignment has been posted: it will be due in a week. Our most recent discussion has to do with sums of independent random variables: in particular, we’re saying that

if X,Xi are iid with E[X] finite, then the average of X1 through Xn approaches µ. We’re curious about the probability that this average is more than µ + ε for some constant ε > 0. Last time, we did a calculation to verify in the case where X is the standard normal, then

 nε2  (S ≥ nε) = exp − + o(n) , P n 2 and similarly if X is Bernoulli with probability p, then

P(Sn − np ≥ nε) = exp(−nIp(p + ε) + o(n)).

Today, we’ll generalize these ideas.

Definition 98 If X is a random variable, its moment generating function (also mgf) is

θX m(θ) = E(e ).

Note that eθX > 0 for all θ, so m(θ) ∈ (0, ∞]. We always have m(0) = 1: it’s possible that this is the only finite value of the moment generating function!

Lemma 99 θX Let m(θ) = E(e ), and define

θ+ = sup{θ : m(θ) < ∞}, θ− = inf{θ : m(θ) < ∞}.

Note here that θ+ is in the range [0, ∞], and θ− is in the range [−∞, 0]. Then on the open interval (θ−, θ+), m(θ) is a smooth function with kth derivative

(k) k θX m (θ) = E(X e ) < ∞.

(k) k Also, if θ+ > 0, then m (θ) → E(X ) as θ ↓ 0.

The idea is that we generally can’t exchange the derivative and the expectation value, but it’s okay to do that in this particular case! The proof is basically an exercise in the Dominated Convergence Theorem.

Definition 100 The cumulant generating function (also cgf) is defined as κ(θ) = log m(θ).

So let’s look at our question again: what is the probability P(Sn ≥ na) for any a > µ? Define xmax to be the essential supremum of X:

xmax = sup(supp LX ) = sup{x ∈ R : P(X > x) > 0}.

50 Note that for any a > x, P(X ≥ a) = 0, so P(Sn ≥ na) = 0 (because each term in the sum Sn is at most a).

Meanwhile, if a = xmax, n P(Sn ≥ na) = P(X = a) , since X is upper bounded by a almost surely, and this last value can be positive if we have a point of positive measure at X = a. So we already know the answer in those cases: we will assume from now on that

EX < a < ess sup X.

(This also rules out the case where EX = ess sup X, in which case the distribution is also concentrated entirely at one value.)

Theorem 101 (Cramér) θX Let X,Xi be iid. If m(θ) = E(e ) is finite for some positive θ, then EX = µ ∈ [−∞, ∞) is well-defined. In Pn addition, for any µ < a < xmax = ess sup X, the sum Sn = i=1 Xi satisfies the “large deviations principle:” 1 − lim log( (Sn ≥ na)) = I(a) = sup [θa − κ(θ)] = sup [θa − κ(θ)] . n→∞ P n θ≥0 θ∈R

θX Here, E(e ) being finite rules out the case where X is really big, but it doesn’t do anything about X being really small (so EX+ − EX− could be −∞). The actual definition of the large deviations principle is a bit more advanced, but we can read that on our own: the central idea of our first equality is still that P(Sn ≥ na) = exp(−nI(a) + o(n)). This last function is defined as κ∗(a), and it is often known as the Legendre dual of κ.

The main idea is that if Sn ≥ na, which is an atypically large value, there may be many ways to have X1, ··· ,Xn reach that value, but it’s intuitively likely that we won’t have an edge case like X1, ··· ,Xn−1 = 0,Xn = na.

Proof. We start with the upper bound, which means that we prove P(Sn ≥ na) is upper-bounded (so the inequality is flipped in our particular case). For any θ ≥ 0,

θSn nθa P(Sn ≥ na) ≤ P(e ≥ e )

(inequality because θ = 0), and then by Markov’s inequality,

(eθSn ) m(θ)n ≤ E = = exp [−n(θa − κ(θ))] , enθa enθa so this tells us that   P(Sn ≥ na) ≤ exp −n sup(θa − κ(θ)) . θ≥0

Looking at the next equality in our chain, for any θ ∈ (θ−, θ+), we can define a probability measure Pθ such that we assign to any event A  eθX  (A) = 1 . Pθ E A m(θ) This is a reweighting of our measure: we’ll cover this kind of operation more in a later lecture, but it is a probability h eθX i measure because Pθ(Ω) = E m(θ) is indeed normalized to 1. So for any measurable f ,

 eθX  (f (X)) = f (X) Eθ E m(θ)

51 (this is true for simple functions by the definition and then we can approximate by simple functions), and thus

 m(θ) f (X) = f (X). Eθ eθX E

This is often called an exponential tilting of P. So looking at

0 θX θX 0 m (θ) E(Xe ) κ(θ) = log E(e ) =⇒ κ (θ) = = = Eθ(X) m(θ) E(eθX (the expectation of X under the θ-measure), by Lemma 99. The second derivative is also interesting:

00  0 2 2 θX  θX 2 00 m (θ) m (θ) E(X e ) E(Xe ) 2 2 κ (θ) = − = − = Eθ(X ) − Eθ(X) , m(θ) m(θ) E(eθX ) E(eθX )

and this is the variance Varθ(X), which is strictly positive (because we assumed that µ < ess sup X). In other words,

κ(θ) is strictly convex on (θ−, θ+). Now since κ0(θ) approaches µ as θ → 0 (by Lemma 99), θ can be picked small enough that µ < κ0(θ) < a. Thus, d (θa − κ(θ)) = a − κ0(θ) > 0. dθ Thus, we indeed know that sup(θa − κ(θ)) = sup(θa − κ(θ)), θ≥0 θ∈R because κ is strictly convex on its domain, and θa is linear, so θa − κ(θ) is strictly concave on its domain (θ−, θ+). We can now prove the lower bound; we’ll only do it in a nice case, assuming that sup (θa κ(θ)) is actually attained θ∈R − by some value θa ∈ (θ−, θ+). This is an assumption which doesn’t always hold: the theorem is true without it, but it just has some annoying infinite cases.

So now pick θ ∈ (θa, θ+) (we know that θa > 0 by the work above). By convexity,

0 0 κ (θ) > κ (θa) = a,

0 because θa is a critical point of θa − κ(θ), and now take b > κ (θ). We do another change of measure here:

n ! " n # Y Y  eθXi  1{X ∈ A } = 1{X ∈ A } Pθ i i E i i m(θ) i=1 i=1

(in other words, we do the change in each random variable). Note that the Xi s are iid under Pθ as well because we’re doing this in a product form. So now, m(θ)n  P(Sn ≥ na) = Eθ · 1{Sn ≥ na} , eθSn 0 and we want to prove a lower bound, so we restrict Sn a little more: picking some b > κ (θ), m(θ)n  P(Sn ≥ na) ≥ Eθ · 1{Sn ∈ [na, nb]} . eθSn

n θSn m(θ) This means that e in the denominator can’t be too small, meaning that this is at most eθnb Pθ (Sn ∈ [na, nb]) . Pθ(Sn ∈ [na, nb]) is now 1 − o(1) by the law of large numbers, because under our changed measure,

0 Eθ[X] = κ (θ) ∈ (a, b)

52 by construction. So we’ve turned a rare event into a typical event! Thus this reduces to

m(θ)n (S ≥ na) ≥ (1 − o(1)) = exp [−n(θb − κ(θ)]] P n eθnb

0 0 for any θ > θa, b > κ (θ). Now take θ ↓ θa, b ↓ κ (θa) = a to finish.

The main idea is that something magical is happening: the convenient exponentiation in the upper bound when we used Markov’s inequality was a bit arbitrary, and we did a pretty arbitrary-looking change of measure in the lower bound too. These bounds just happen to work out to the same Legendre dual! So in some sense, exponentiating is optimal on both sides, and informally we should think of this as the most efficient (probable) way to achieve a large deviation of this form is to have X1, ··· ,Xn all behave like a collection of

samples from our tilted measure Pθ, choosing θ such that EθX = a. In some special cases, though, it’s easier to calculate this in a more explicit way, and that’s what we’ll do now.

Let’s do this in the case where LX has finite support of size k: say that

P(X = xj ) = πj , 1 ≤ j ≤ k.

Pk (Of course, j=1 πj = 1.) Then given values of our variables X1, ··· ,Xn, their empirical measure is denoted (in general, not just when the Xs have finite support)

n 1 X δ = LX . n Xi n i=1

1 X (We put a weight of n at the point that Xi achieves.) Ln is a probability measure on R, and it is a random variable that is σ(X1, ··· ,Xn)-measurable. In our particular case where the support is finite, we can just view this as a vector X k Ln ∈ [0, 1] (it’s a vector of the empirical fractions): 1 (LX ) = ·{#1 ≤ i ≤ n : X = x } . n j n i j So now we can do a direct calculation of any particular value: n! X ν1 νk P(Ln = ν = (ν1, ··· , νk )) = · π1 ··· πk . (nν1)!(nν2)! ··· (n(νk )!

(this is the probability that ν1 fraction of our Xs land on π1, and so on). Expanding this with Stirling’s formula, we find that X P(Ln = ν) = exp [−nH(ν|π) + o(n)] , where k X νj H(ν|π) = = νj log = DKL(ν|π) πj j=1 is the relative entropy or Kullback-Leibler divergence between the two measures. (Note that this is not symmetric between ν and π.) This is just a combinatorical calculation, but we can now do something with this: let’s assume that the support of P has all k ≥ 2 entries, meaning that H(ν|π) is strictly convex in ν. X Sn doesn’t depend on the ordering of Xi s: in particular, it’s just a function of Ln via

k X X X Sn = n xj Ln (xj ) = nhx, Ln i. j=1

53 So now suppose that EX < a < ess sup X: then

X X X P(Sn ≥ na) = P(Ln = ν) · 1{nhx, νi ≥ na} = 1{hx, νi ≥ a} exp(−nH(ν|π) + o(n)). ν ν

The number of possible ν is at most no(1) (polynomial in n), so the total is not comparable to the scale of the exponential. So that can be absorbed into the exponential o(n) term, meaning that

P(Sn ≥ na) = exp [−n inf{H(ν|π): hx, νi ≥ a} + o(n)] .

This is now a constraint optimization problem, so writing down the Lagrangian (remembering that we have the constraint that our probability needs to sum to 1)   X X H(ν|π) + ρ(1 − νj ) + θ a − xj νj  , j j and taking its derivative, θX ∂L πj e j = 1 + log νj − log πj − ρ − θxj = 0 =⇒ νj = , ∂νj C where C = m(θ) normalizes the probabilities to 1. So this is exactly the Pθ measure that we were talking about! In this calculation, we didn’t need to pull exponentials out of nowhere: they come from the final optimization problem. X So to achieve this large deviation event Sn ≥ na, the optimal value is achieved at Ln ∼ ν = Pθ. It’s good as an exercise to show that this also tells us     P(Sn ≥ na) = exp −n sup H(Pθ|π) + o(n) , θ:Eθ X≥a and the optimal θ has equality with EθX = a (better to optimize without going farther into the tail).

12 October 16, 2019

There is a midterm on Monday, so we should make sure to remember to show up to class. Today, we’ll cover some loose ends that have been mentioned but not in detail, and we’ll also introduce the Fourier transform (which is a topic for after the exam). Part of this is fair game for the midterm, but this will be communicated clearly. Here is a sample problem for the midterm:

Problem 102 Prove that the law of a real-valued random variable X is uniquely determined by its cumulative distribution function F (x) = P(X ≤ x).

This is the level of formality we should expect: we should understand, for example, that X is a random variable on some probability space (Ω, F, P), and its law is defined as LX = X#P = µ, which is a probability measure on (R, BR).

Solution. The cdf F (x) = P(X ≤ x) is the same as the measure (under µ) of (−∞, x], so clearly µ determines F : we’re asked to show the converse. Suppose there were two measure µ, ν on R that both satisfy this relation, so µ((−∞, x]) = ν((−∞, x]) for all x. This means that µ((a, b]) = F (b) − F (a) = ν((a, b])

54 for all half-open intervals. Then if F is the set of Borel sets such that µ(B) = ν(B), we need to show that F = BR, so we can use a π-λ argument. We know that this contains a set of half-open intervals, which is a π-system, and we can check that F is a λ-system as well. So by the π-λ theorem, F contains the σ-algebra generated by the set of half-open intervals, which is BR.

We won’t be allowed to bring cheat sheets, but some things will be written on the cover page. (However, it is reasonable for us to remember the full weak law of large numbers.)

Remark 103. If something follows directly from a theorem in class, we can cite the theorem and check that all the conditions hold. Otherwise, if it is very similar but we are told to reproduce it, that will be written out.

Let’s review something we talked about but not too formally:

Definition 104

Let X,Y be independent, real-valued random variables with laws LX = µ, LY = ν, so that L(X,Y ) = µ⊗ν. Denote their cdfs to be F (x) = P(X ≥ x),G(y) = P(Y ≥ y). Then because µ, ν are in one-to-one correspondence with F,G respectively, we can define the µ ∗ ν = ν ∗ µ to be the law of Z = X + Y , and this corresponds to the cdf F ∗ G = G ∗ F .

Note that µ, ν can be discrete, continuous, or anything in between. We can derive an expression for F ∗G explicitly: ZZ (F ∗ G)(z) = P(X + Y ≤ z) = 1{x + y ≤ z}dµ(x)dν(y), where we’ve implicitly used the change of variables formula and Tonelli’s theorem to show that this is well-defined. Then the inner integral is just the probability that x ≤ z − y, which is Z Z = F (z − y)dν(y) = F (z − y)dG(y), where the last step is just saying that ν and G carry the same information. (By symmetry, this is also equal to R G(z − x)dF (x).) We can say something nicer in the case where µ and ν have densities: then µ ∗ ν also has a density. To show this, R let’s say f , g are the densities of µ, ν. By definition, this means that for all Borel sets B, µ(B) = 1Bf dx (using the

standard Lebesgue measure), where f is a nonnegative measurable function on R, BR, which means that we can just write Z x F (x) = f (t)dt. −∞ R We can also derive that if h(x) is integrable, E[h(X)] = h(t)f (t)dt. So then because dG(y) = g(y)dy, Z Z ZZ z (F ∗ G)z = F (z − y)dG(y) = F (z − y)g(y)dy = f (t − y)dtg(y)dy, −∞ R z R and by Tonelli’s theorem this is also −∞ f (t − y)g(y)dydt, and thus Z Z µ ∗ ν has density f ∗ g = f (z − y)g(y)dy = g(z − x)f (x)dx.

We’ll now move on to our next topic: the Fourier transform. (This won’t be on the first exam, and not everything mentioned today is necessary to know, but it’s nice to know.)

55 Definition 105

The Fourier transform or characteristic function of a probability measure µ on (R, BR) is a function

φµ =µ ˆ : R → C,

defined via Z itx φµ(t) = e dµ(x).

In particular, if X is a random variable with law µ, we can also write this as

itX φµ(t) = E[e ].

itX Notice that e is always on the unit circle in C, so φµ(t) is contained in the unit ball in C. Thus, for all t ∈ R,

(by bounded convergence theorem we know that) φµ is well-defined and continuous in t. This is good for proving something like the Central Limit Theorem, because the Fourier transform turns convolution into multiplication! If X is a random variable with law µ and Y is a random variable with law ν, where X ⊥⊥ Y , then

it(X+Y ) itX itY φµ∗ν (t) = E(e ) = E[e ]E[e ] by independence, which is just φµ(t)φν (t). Eventually, we’ll have the Fourier transform on R, which captures a lot of properties of the measure, but first we can look at the Fourier transform on a finite space: let’s consider Ω = Z/nZ (the integers mod n), just to make Ω everything simpler. Let V be the set of functions f :Ω → C, which is equivalent to C| |: this is a finite-dimensional complex vector space. V has a hermitian inner product

n−1 X hf , gi = f (x)g(x) = g∗f , x=0 where g∗ is the conjugate transpose of g (we’re viewing both as vectors). Then for any (nonzero) z ∈ Ω, we can consider the translation operator on functions in V :

(Tz f )(x) = f (x − z).

We can also think of Tz as a matrix because everything’s finite dimensional: it is a circulant matrix, where the shift depends on z. What are the eigenvectors and eigenvalues of Tz ? f is an eigenvector of Tz with eigenvalue λ if and only if

λf (x) = Tz f (x) = f (x − z) for all x, but iterating this n times, λk f (x) = f (x − kz) = f (x), as long as kz ≡ 0 mod n. In particular, this means that λn = 1, so λ must be an nth root of unity.

If n is prime, Tz just has n distinct eigenvalues, which are exactly those nth roots of unity. So then we can solve for the eigenvectors pretty easily: they’re of the form

1 2πi`x  χ (x) = √ exp ` n n

2πi`z  for a translation Tz , with corresponding eigenvalue exp − n . But if n is not prime, there is no longer a unique eigenbasis. Luckily, it turns out those χ0, ··· , χn−1 still do form an eigenbasis of Tz for all z ∈ Ω, and χ` is an

56 2πi`z  eigenvector of Tz with eigenvalue λz,` = exp − n . So in summary, we have a finite space with some periodic structure: diagonalizing those translation operators, they have some nice form. We can also find that the scalar products between two vectors in our eigenbasis is

X X 1 2πix(k − `) hχ , χ i = (χ )∗χ = χ (x)χ (x) = exp . k ` ` k k ` n n x x

This is 1 when k = ` and 0 otherwise, which means that with respect to the hermitian inner product, the χi form an orthonormal basis. In other words, if we define   | · · · |   U = χ0 ··· χn−1 ,   | · · · |

∗ ∗ we have a unitary matrix, meaning that U U = UU = I. So now if χ0, ··· , χn−1 ∈ V form the Fourier basis for V , so the Fourier transform is just the change-of-basis operation sending f to fˆ = U∗f , the coordinates of f in the Fourier basis: in other words, f = UU∗f = Ufˆ , P ˆ and this last expression is just ` f (`)χ`, which is f in the Fourier basis. We can actually write out the coefficients explicitly here: X X 1  2πi`x  fˆ(`) = (U∗f ) = hf , χ i = f (x)χ (x) = √ f (x) exp − . ` ` ` n n x x Other than the random constants, this should look very similar to how we’re multiplying f by some complex exponential in the regular Fourier transform! So the whole point is that the Fourier transform is a change-of-basis operation which diagonalizes the translation operators, and the exponentials are motivated by the discrete case where they just pop out of the eigenvalue calculation. Well, in this discrete case, we have

X (f ∗ g)(x) = g(z)f (x − z), z which can be viewed as a linear combination of translations of f :

X = g(z)(Tz f )(x). z

So we can think of f ∗ g = Cgf as an operator which is a linear combination of the translation operators: X Cg = g(z)Tz z

Once we see that this convolution is then a linear combination, since the Fourier basis diagonalizes all of the Tz s simultaneously, we should expect that the Fourier basis is also good for Cg. Here’s the details of that: since X Cg = g(z)Tz , z∈Ω

∗ and Tz has the diagonal form UΛz U (where Λz is the diagonal matrix with entries λz,`, where ` = 0, ··· n − 1), we can also write this equivalently as X ∗ Tz = λz,`χ`(χ`) .

57 So then ! X X ∗ X X ∗ Cg = g(z) λz,`χ`χ` = λz,`g(z) χ`χ` , z ` ` z 2πi`z  ∗ ∗ and now because λz,` = exp − n , which is the same as the entry U`,z of the matrix U , this inner sum is the same as the `th coordinate of (U∗g). In matrix form, this looks like

 ∗    (U g)0 | · · · |  ∗   (U g)1      ∗ ∗ ∗ Cg = χ0 ··· χn−1  .  U = Udiag(U g)U .    ..  | · · · |   ∗ (U g)n−1

This then means that (using U∗U = I)

∗ ∗ ∗ f ˆ∗ g = Cˆgf = U Gf = diag(U G)U f = diag(ˆg)fˆ , which is indeed the entry-wise product of fˆ and gˆ! So this is an alternate way of arriving at the same idea: this is the unique object that behaves in a nice way under this kind of convolution. We’ll finish the class with one more sample problem:

Problem 106

Suppose Xn are random variables that are Cauchy in probability, which means that for all ε > 0, P(|Xm − Xn| ≥ ε)

goes to 0 as m, n → ∞. Prove there exists a random variable X such that Xn → X in probability.

Solution. First, we need to find an X: if a sequence converges in probability, it converges almost surely on some

subsequence nk → ∞. In other words, we have that for all m, n ≥ nk ,  1  1 |X − X | ≥ ≤ . P m n 2k 2k

1 If we then define Ak to be the set {|Xnk+1 − Xnk | ≥ 2k }, then the probability that Ak occurs infinitely often is 0 by Borel-Cantelli, so on Ω \{Ak i.o.}, we have Xn,k → X almost surely for all k.

So now we need to show that this converges in probability in X if we look at all Xn. Well, by the triangle inequality

P(|Xn − X| ≥ ε) ≤ P(|Xn − Xnk | + |Xnk − X| ≥ ε), so this is union bounded by ε ε ≤ (|X − X | ≥ ) + (|X − X| ≥ ). P n nk 2 P nk 2

The second term goes to 0 because Xnk converges to X almost surely, and the first term goes to 0 as well because

we can take n and nk to infinity using Cauchy convergence in probability.

Office hours for Professor Sun will be Friday 11-12, and the TAs will have office hours as well.

13 October 23, 2019

Recall that last time, we gave some motivation for the Fourier transform by considering the discrete case Ω = Z/nZ. Ω Letting V be the set of functions from Ω → C, we have a finite-dimensional vector space isomorphic to C with

58 hermitian inner product X hf , gi = g∗f = f (x)g(x). x∈Ω We found that under this model, there exists an orthonormal (under the hermitian inner product) basis

1 2πi`x  (χ , ··· , χ ): χ (x) = √ exp 0 n−1 ` n n

for V , which means that the matrix n×n U = (χ0 χ1 ··· χn−1) ∈ C is unitary: that is, UU∗ = U∗U = I. Thus, the Fourier transform on Ω is a change-of-basis operation, meaning

fˆ = U−1f = U∗f .

(Explicitly, this means that the coefficient fˆ(`) in the Fourier basis is hf , χ`i.) It’s also nice that we can easily do the Fourier inversion: f = Ufˆ.

Remark 107. Note that hf , gi = hfˆ , gˆi because U is unitary.

Today, we’ll finish what we need to cover about the Fourier transform, which will let us prove the Central Limit Theorem next time! We’ll cover this for functions on R instead of over a finite space.

Let’s start by fixing the normalization: recall that if µ is a probability measure on (R, BR), we defined the Fourier transform or characteristic function to be Z itx itX φµ(t) = e dµ(x) = E[e ]

if X ∼ µ. This has some basic properties, which are left as an exercise:

Proposition 108

For any probability measure µ on R, φµ is a uniformly continuous mapping from R to the closed unit disk D = {z ∈ C : |z| ≤ 1}.

Uniform continuity can be seen, because

i(t+h) itX |φµ(t + h) − φmu(t)| = |E(e X − e )|

and now by Jensen’s inequality on the absolute value, this is

itX ihX ihX ≤ E|e (e − 1)| = E|e − 1|

which tends to 0 as h → 0 by the bounded convergence theorem. All of this has a special case, which we might be more familiar with: if µ has a density, then Z itx φµ(t) = e f (x)dx = fˆ(t)

is known as the L1 Fourier transform (because f integrates to 1, so it’s in L1). An important result to keep in mind from Fourier analysis:

59 Theorem 109 (Planchard/Parseval) 1 2 2 If we have functions f , h ∈ L (R) ∩ L (R), then fˆ , hˆ ∈ L∞(R) (because the result is bounded) and also L (R). In addition, the inner product hfˆ , hˆi = 2πhf , hi.

Thus, the mapping ˆ 1 2 ∞ 2 f U : L (R) ∩ L (R) → L (R) ∩ L (R),U(f ) = √ 2π 2 2 has a unique continuous extension to a unitary map L (R) → L (R).

The point is that in principle, the integral R eitx f (x)dx may not be defined if f is not in L2. But then we can approximate it with functions in L1 ∩ L2, and the limit will make sense! Also, because this is a unitary mapping, we 1 2 have a simple inverse. If h ∈ L (R) ∩ L (R), we can define

1 Z hˆ(−x) U∗h(x) = √ e−itx h(t)dt = √ 2π 2π (this just has a negative sign in the exponent). The logic from above applies here as well, so this map extends to a 2 2 map U∗ : L (R) → L (R). This gives us an important result: the Fourier inversion theorem tells us that ! fˆ 1 f (x) = U∗Uf = U∗ √ = fˆ(−x). 2π 2π

So the Fourier transform is an involution except for the scaling factor! Well, when we look at the Fourier transform for probability measures, things are not quite so nice.

Theorem 110 (Fourier inversion formula for probability measures on R) Suppose µ is a probability measure on R and φµ(t) is defined as above. Then

Z T e−ita − e−itb  φ(t) dt −T 2πit converges as T → ∞ to µ({a, b}) µ˜(a, b) = µ((a, b)) + . 2

Note that this µ˜ is enough to uniquely determine µ (this is an exercise), even though µ˜ and µ are not exactly the

same. But the whole point is that φµ determines µ! By the way, we do need to integrate from −T to T to ensure the integral actually exists, and it’s part of the content of the theorem that this converges.

Example 111 1 Consider the random variable where X is ±1 each with probability 2 . Then

eit − e−it φ (t) = = cos t. µ 2 This has no decay properties, so we can’t integrate this from −∞ to ∞ directly.

Intuitively, what is going on here? This is not too rigorous, but we’re basically given φ and we want to recover the

60 distribution by finding, for example, µ((a, b)) for some a < b. This is equal to Z h(x)dµ(x), h = 1(a,b), so the Fourier transform of the indicator function h is (there’s no convergence problems with the usual definition because we have bounded support)

Z Z b eitb − eita hˆ(t) = eitx h(x)dx = eitx dx = . a it So we sort of want to write Z µ(a, b) = hdµ = hµ, hi

(whatever that means, because µ is a measure), which we should believe yields (by expanding out the hermitian inner product) 1 Z ∞ e−ita − e−itb hµˆ = φµ, hˆi = φ(t) dt. 2π −∞ 2πit The integral might not even be defined, but this has the approximately correct form: that’s the motivation for what’s going on here. How do we rigorize this a bit more? We can approximate φ by truncating in the Fourier domain by integrating the last expression from −T to T only:

Z T e−ita − e−itb Z k φ(t) dt = φhˆ T dt, −T 2πit 2π

where kT is the truncation function 1{|t| ≤ T }. This then evaluates essentially to 1 = hφ, hkˆ i = hµ, h i, 2π T T

where hT is the function such that

hˆT = hkˆ T .

We can solve for hT explicitly, but the idea is that we have a not-smooth function h (which is the indicator of (a, b))

mapped to the Fourier domain. Since we cut out the high-frequency domain by truncating with kT , we should have

hT essentially be a smooth version of h! And we hope that as T → ∞, the smoothing goes away: 1 lim hT (x) = h˜(x) = 1{(a, b)} + · 1{{a, b}}. T →∞ 2

1 This completes the intuition for why we get only 2 the measure on the ends, and we’ll go through all of the rigor now! Before diving into the proof, let’s figure out what hT should be so that it satisfies the equations above.

Fact 112 sin t 1 1 The function sinc(t) = t is not in L (R), because it decays like t . However, if we consider Z T S(T ) = sinc(t), −T S(T ) converges to π as T → ∞.

61 Proof. There are various ways to do this: one is through complex analysis. This is the same as the integral

Z T eit − e−it Z T eit dt = dt. −T 2it −T it

eiz The function h(z) = iz is holomorphic on C \{0}, so if we integrate around any closed contour γ not containing or through the origin, I h(z)dz = 0. γ Integrate over the following indented semicircle once counterclockwise and take R → ∞, ε → 0: y

III

IV I II x −R −ε ε R

We want the quantity along II and IV, so it suffices to show that (1) as ε → 0, the integral along I goes to π, and (2) as R → ∞, the integral along III goes to 0. (2) can be done with some careful bounding, and (1) can be done by parametrizing z = εeiθ for θ from π to 0, which yields

Z Z 0 iεeiθ e iθ h(z)dz = iθ iεe dθ. γε π εe As ε → 0, the numerator goes to 1, and thus this simplifies to Z π − 1dθ = −π. 0 Thus, because the total contour integral is 0, the integral we’re trying to find is π, as desired.

To find hT , we can apply the Fourier inversion formula for functions:

Z T Z T  itb ita  1 −itx 1 −itx e − e hT = e hˆ(t)dt = e · dt. 2π −T 2π −T it This is equivalent to (averaging between the values at x and −x for each point in the integrand)

Z T sin(t(b − x)) sin(t(a − x)) − dt, −T 2πt 2πt

eiθ −e−iθ using the fact that sin θ = 2i . So now looking at each term, if b − x = 0 we’ll get 0, and in all other cases we can make a u-substitution (flipping the limits of integration if b − x, a − x are negative) to turn this into

sgn(b − x)S(|b − x|T ) − sgn(a − x)S(|a − x|T ) = 2π where we define sgn(0) = 0. Now if b − x, a − x are both positive or both negative, both terms converge to the same value (because S converges to π as T → ∞), so we only get something nontrivial in the case where x ∈ [a, b]. If x is 2π π 1 strictly between a and b, this goes to 2π = 1, and otherwise (if x = a or b) it goes to 2π = 2 , which indeed means ˜ 1 that hT converges pointwise to h(x) = 1(a,b) + 2 · 1{{a,b}}.

62 Remark 113. Notice that another way we can see that hT is smooth is that Fourier transforms take convolution to multiplication, so

hˆT = hkˆ T =⇒ hT = h ∗ ST , where SˆT = kT . As T → ∞, this is wavy and gets more oscillatory as T gets larger.

So now we’re ready to prove the Fourier inversion theorem for probability measures:

Proof. Note that

Z T  −ita −itb  Z T Z T Z  e − e 1 1 itx IT = φ(t) dt = φ(t)hˆ(t)dt = e dµ(x) hˆ(t)dt, −T 2πit 2π −T 2π −T and now we can move the conjugation outside and swap the order of integration (because the inner integral is bounded and the outer integrated over a finite domain)

Z Z T e−itx hˆ(t) = dt µ(dx). −T 2π By the Fourier inversion formula, this is the same as Z Z = hT (x)dµ(x) = hT (x)dµ(x) for any fixed T (last step because hT is explicitly real by our direct calculations). R We just need to show that this converges to h˜(x)dµ(x) as T → ∞: we know that hT → h˜ pointwise from our work above, and now notice that 1 h = (±S(something) ± S(something else)) . T 2π

Since S is bounded for both small and large values of T , it must be uniformly bounded, so hT is bounded uniformly as

well. Thus supT ||hT ||∞ is finite, and we’re done by the bounded convergence theorem.

By the way, let’s try taking the Fourier transform of a Gaussian

1 2 g(x) = √ e−x /2 : 2π the characteristic function  2  Z 1 2 1 Z (x − it) 2 φ(t) =g ˆ(t) = eitx g(x)dx = √ eitx e−x /2dx = √ exp − e−t /2dx 2π 2π 2 which is an integral along the real line   2 Z 1 1 2 Z 1 2 = e−t /2 √ exp − (x − it)2 dx = e−t /2 √ e−z /2dz. 2 x∈R 2π R−it 2π We can show that this integral is also equal to the integral over the real line by integrating over a rectangle as R → ∞:

y −R R x I IV II III −R − it R − it

63 Since e−z 2/2 is holomorphic everywhere as well, the integral is 0, and the integrals along II and IV go to 0. Thus, the integrals along the real line R and the shifted real line R + it are the same, and they’re both equal to 1. This implies that φ(t) = e−t2/2.

This is an important example: ignoring factors of 2π, the Fourier transform of a Gaussian is a Gaussian! So we have an eigenvector of some sort for the transform. To finish, let’s do a weak form of the L2 isometry:

Proposition 114 Suppose f , h ∈ L1 ∩ L2 ∩ L∞ are continuous. Then

hfˆ , hˆi = 2πhf , hi.

Proof. Consider the following function as ε → 0:

Z  εt2  fˆ(t)hˆ(t) exp − dt 2 is integrable because fˆ , hˆ are bounded, so this is equal to

Z Z  Z   εt2  eitx f (x)dx e−ity h(y)dy exp − dt. 2

Since our functions are in L1, we can swap the order of integration to find that this is

ZZ Z  εt2  f (x)h(y) eit(x−y) exp − dtdxdy. 2

This evaluates to ZZ r2π Z 1  εt2  f (x)h(y) eit(x−y) exp − dt dxdy. ε p2π/ε 2 √ i(x y)/ εZ (x y)2/(2ε) This inner integral is E[e − ], where Z is the Gaussian of x − y, so that is equal to e− − . This leaves us with r ZZ 2π 2 f (x)h(y) e(x−y) /(2ε)dxdy. ε (Notice that in the time domain we have a very spread out Gaussian, and in the space domain this becomes a very concentrated Gaussian.) As ε → 0, all of the weight goes to x = y, and thus this goes to Z 2π f (x)h(x)dx = 2πhf , hi as ε → 0, and we’re done.

14 October 28, 2019

Exams will be brought to class on Wednesday. Last time, we discussed the Fourier inversion theorem for probability measures on R, which lets us recover the measure µ for a random variable X given its characteristic function. One key example is that the characteristic function

64 for the standard is  t2  φ(t) = exp − . 2 The main goal of today will be to prove the Central Limit Theorem for iid sequences: basically, we’ll show that if 2 E[X] = 0 and E[X ] = 1 (finite second moment), then

n 1 X d √ X −→ N(0, 1) n i i=1 for some sense of convergence (“in distribution”) which we’ll formally define.

Definition 115

Let S be a metric space with Borel sigma-algebra B generated by the open sets in the metric topology. If µn, µ

are probability measures on (S, B), then µn converges weakly to µ, denoted µn =⇒ µ, if Z Z f dµn → f dµ

d for all bounded continuous functions f : S → R. If Xn,X are random variables, then Xn −→ X converges in

distribution if and only if LXn =⇒ LX .

1 Why do we need the continuity assumption? Consider the function Xn = n (it’s not even random): this converges almost surely to X = 0, but the function f (x) = 1{x > 0} does not have f (Xn) → f (X).

Fact 116 Note that every probability measure on (S, B) is regular, which means that for all A ∈ B, we can approximate A via µ(A) = sup{µ(F ): F ⊆ A, F closed} = inf{µ(G): G ⊇ A, G open}.

The proof in general is similar to our homework. In particular, this implies that µ is determined by {µ(F ): F closed}.

This gives us the following result:

Lemma 117 R R If f dµ = f dν for all bounded continuous functions f : S → R, then µ = ν.

Proof. For any F closed, we can approximate its indicator function with a continuous function:

 dist(x, F ) f (x) = max 0, 1 − . ε

So if x is inside F , f (x) = 1, and if the distance is more than ε, this is 0: we interpolate linearly between those. But if F ε is the ε-neighborhood of F ,

1F ≤ f ≤ 1F ε , which means that (because f is a bounded continuous function) Z Z µ(F ) ≤ f dµ = f dν ≤ ν(F ε)

65 for all ε > 0. As ε ↓ 0, F ε ↓ F because F is closed, so by continuity from above, this means ν(F ε) ↓ ν(F ) ≥ µ(F ). Repeating the argument in the other direction, we see that µ(F ) = ν(F ), giving us the result.

Corollary 118

µn cannot converge weakly to two different limits.

So now, for any ε > 0 and f1, ··· , fk bounded continuous functions (k finite), let µ be any probability measure on S. Define the set Z Z

Uε,f1,··· ,f (µ) = {ν probability measures on S | fi dν − fi dµ < ε for all 1 ≤ i ≤ k} k to be a kind of neighborhood around µ. Let T be the topology on P generated by these neighborhoods, where P is the space of probability measures on (S, B). Then it’s a fact that weak convergence is equivalent to convergence in the topology T . Given that we have a notion of a neighborhood of µ, it makes sense to define a distance:

Definition 119 Given µ, ν ∈ P, define

π(µ, ν) = inf{ε > 0 : µ(A) ≤ ν(Aε) + ε, ν(A) ≤ µ(Aε) + ε ∀A ∈ B}.

This is called the Prohorov measure, and if S is a complete separable metric space (called a Polish space), then weak convergence is equivalent to convergence in π-measure.

Remark 120. “Complete” means that Cauchy sequences converge, and “separable” means that there is a countable dense subset.

P is indeed complete and separable, so it satisfies these conditions above. This is one of the reasons why almost all of probability happens on Polish spaces! (We often want to pick a random probability measure from our set of probability measures, or something like that.) So in this class, we’ll always do this. With this, we can turn to the Central Limit Theorem. In general, the main strategy for showing convergence in distribution is to break it into two parts:

• Show that the family of measures µn is confined in a compact subset of P, which means there are subsequences that converge. This is usually done by rougher estimates.

• Once we know subsequences converge, leverage this to show that all subsequential limits coincide.

This might be a bit abstract, so it’s important for us to understand what compact spaces in P actually look like!

Definition 121

Let {µα : α ∈ I} ⊆ P be a set of probability measures. Then {µα} is tight if for all ε > 0, there exists a compact

Kε ⊆ S such that µα(Kε) ≥ 1 − ε for all α ∈ I.

Some examples of not tight families of probability measures: consider {νn = N(0, n)}, which are very spread out Gaussians: we can check that these are not tight because the mass is spread out. Similarly, we can check that

{γn = N(n, 1)} is another non-example. It turns out that tightness and compactness are essentially equivalent:

66 Theorem 122 (Prohorov)

Let µα be probability measures on a metric space (S, B). If {µα} is tight, then the family is relatively compact, meaning it has compact closure. Furthermore, if S is a Polish space, then the converse is true as well, but this is less useful.

It turns out S = R is easier to prove, and this is known as the Helly selection theorem. We should make sure we understand Prohorov’s theorem, and we’re responsible for reading and understanding the proof of the Helly selection theorem (and I’ll also make sure these are in the notes eventually). The key to the proof of CLT is to relate weak convergence with characteristic functions:

Lemma 123 If µ is a probability measure on R with characteristic function φ, then  2  1 Z u µ x ∈ R : |X| ≥ ≤ (1 − φ(t))dt. u u −u

In other words, when u is small, the tail behavior is related to the characteristic function’s behavior near u = 0.

Proof. The right hand side 1 Z u 1 Z u Z (1 − φ(t))dt = (1 − eitx )dµ(x)dt u −u u −u and we can change the order of integration by Tonelli’s theorem to

Z 1 Z u = (1 − eitx )dtdµ(x). u −u This directly integrates to (pulling the 2 out)

Z  eixu − e−ixu  Z  sin(ux) = 2 1 − dµ(x) = 2 1 − dµ(x). 2ixu ux

This whole integrand is always nonnegative, so we can restrict the integral to any domain of x. Also, the numerator 1 is bounded, and whenever ux > 2, the integrand is now at most 2 , so this is indeed

≥ µ ({x : |ux| ≥ 2}) ,

which is the result we’re looking for.

This lemma basically directly proves the following:

Theorem 124 (Continuity theorem)

Let µn be probability measures on R with characteristic functions φn. Then

• If µn converge weakly to some µ, then φn converges pointwise to φ, the characteristic function of µ.

• Conversely, if φn converges pointwise to φ, and φ is continuous at t = 0, then φ is a valid characteristic

function of some probability measure µ and µn =⇒ µ.

67 Example 125 (A non-example)  −nt2  Say that µn = N(0, n) are the same wide Gaussians. Then φn(t) = exp 2 converges to the indicator 1{t = 0}, which is not continuous. Indeed, µn does not converge weakly to anything.

Proof. The first part is immediate: if µn =⇒ µ, then Z itx φn(t) = e dµn(x)

itx R itx is the integral of a continuous, bounded function f (x) = e , so this converges to e dµ(x) = φ(t) for all t ∈ R, which indeed gives us pointwise convergence. For the other direction, we use the lemma, which says that

 2  1 Z u µn |x| ≥ ≤ (1 − φn(t))dt u u −u for all u, n. Keep u fixed and let n → ∞: then the integrand is bounded and converges to 1 − φ, so this converges to 1 Z u (1 − φ(t))dt u −u by the bounded convergence theorem. Because φ is continuous, this integral approaches the value of 1 − φ(t) at t = 0, which is indeed 0. 2 2 Thus, the µn are tight, because we can take our compact set Kε to be the set [− u , u ] for an appropriate u. Specifically,  2  lim sup µn |x| ≥ ≤ f (u) → 0 n→∞ u as u ↓ 0, so we can take an appropriate u such that f (u) < ε for all n ≥ N(ε) for some N(ε), and then we can take the union of this compact set with the finitely many at the beginning of the sequence.

This means that µn weakly converges along subsequences by Prohorov’s theorem, so for any subsequence µnk =⇒ ν, φ → φ . But we’re assuming that φ = φ (because we have convergence of φ along the whole sequence), µnk ν ν n so φ is a valid characteristic function and all of the subsequential limits have to agree. Thus all νs are identical, as desired.

Remark 126 (Joke). “Does this imply CLT?” “I don’t know, does it?”

As a result of this, all we need is to show that the characteristic functions of the left and right side are the same in our statement of CLT. Here’s a helpful tool for that:

Lemma 127 For all x ∈ R,  2   3  ix x 2 |x| e − 1 + ix − ≤ min x , . 2 6

Proof. This is a calculus bash.

We’ll now prove two versions of the central limit theorem:

68 Theorem 128 (Central limit theorem for iid sequences) 2 Let X,Xi be iid, E[X] = 0, and E[X ] = 1. (These can always be rescaled, but notice that this is strictly more restrictive than the strong law of large numbers because we need to have a finite second moment.) Then

n 1 X d √ X −→ N(0, 1). n i i=1

Proof. Notice that   t2X2  t2 eitX − 1 + itX − = φ (t) − 1 − E 2 X 2 2 (by linearity of expectation, E[X] = 0, E[X ] = 1) can be bounded as t → 0:  2  1 t 2 t 3 φx (t) − 1 − ≤ E min{X , |X| } t2 2 6

t 3 converges to 0 as t → 0. This is because it converges to 0 almost surely (because the term 6 |X| goes to 0), and then the minimum is dominated by X2, which is integrable. So we can use the dominated convergence theorem to indeed say that this goes to 0, and t2 φ (t) = 1 − + o(t2) x 2 for some error term less than order of t2. Now

n 1 X Z = √ X n n i i=1 has characteristic function  t n  t2 t2 n φ (t) = φ √ = 1 − + o . Zn x n 2n n

 t2  This goes to exp − 2 as n → ∞, which is the characteristic function of the standard Gaussian, so we’re done by the continuity theorem.

Remark 129. A similar proof with the moment generating function is harder to produce, because it’s possible to have finite second moment and not have the moment generating function be defined anywhere but t = 0. Also, we don’t have such a nice inversion theorem in that case.

There’s also a more general central limit theorem:

69 Theorem 130 (Lindeberg-Feller CLT) Suppose that we have a triangular array such that for each n,

{Xn,j : 1 ≤ j ≤ n}

are mutually independent with mean 0 (but not necessarily iid). Suppose that the functions are sufficiently normalized, so n X 2 Var(Xn,1 + ··· + Xn,n) = E(Xn,j ) j=1 converges to some σ2 > 0 as n → ∞. Also (the key assumption), for all ε > 0, as n → ∞,

n X 2 E(Xn,j ; |Xn,j | > ε) → 0. j=1

Pn 2 Then Sn = i=1 Xn converges in distribution to N(0, σ ).

√Xi The iid CLT is a special case, where Xn,i = n . Basically, we’re assuming that our triangular arrays are already normalized! And the second condition says that each individual variable cannot contribute significantly to the variance.

Proof. We’ll show again that the characteristic function converges to the Gaussian. Let φn,j be the characteristic 2 2 function for Xn,j , and let σn,j = E(Xn,j ). Then by the same inequality, ! σ2 t2  t3|X |3  n,j 2 2 n,j φn,j (t) − 1 − ≤ E min t Xn,j , . 2 6

When Xn,j is small, the second term here is smaller, and when it is large, we want to use the condition in the problem. So we split by the value of |X| with cutoff at ε, and this gives us the bound

" 2 # tεE(Xn,j ) ≤ t2 + (X2 ; |X | ≥ ε) 6 E n,j n,j for any ε (where the first term comes from the |X|3 term and the second from the X2 term). Notice that for any

zi , wj ∈ C with modulus at most 1,

n n n Y Y X zi − wj ≤ |zj − wj |

i=1 j=1 j=1

by writing a telescoping sum. We can apply this to the equation above, because φn,j (t) has modulus at most 1, and

2 2 2 2 σn,j = E(Xn,j ) ≤ ε + E(Xn,j ; |Xn,j | ≥ ε)

σ2 t2 and taking ε 0, n makes this go to 0. Thus, for sufficiently large n, the term n,j is always at most 1, → → ∞ 2 meaning

n n σ2 t2 ! n " "tε (X2 ) ## Y Y n,j X 2 E n,j 2 φn,j (t) − 1 − ≤ t + E(Xn,j ; |X| ≥ ε) 2 6 j=1 j=1 j=1 2 Both terms tend to 0 as ε → 0 and n → ∞ (the first term because the expected sum of Xn,j goes to a finite value,

70 second term by the second assumption), so the whole thing goes to 0. And finally this means that

n n 2 2 ! Y Y σn,j t φ (t) → 1 − n,j 2 j=1 j=1

 σ2t2  converges as n → ∞, and indeed the expression on the right goes to exp − 2 as desired.

15 October 30, 2019

Office hours will be 4-6pm today, and we can come and see our midterms. (For logistical reasons, we’re not going to get them back during class.) Grades are on Stellar already: the average is fairly low and the distribution is very interesting. The score is out of 40: when we see our score, we should take it and divide it by 40, and then multiply it by 200 to think about how well we did. Basically, if we got a 10 out of 40, this is problematic (and we should talk to Professor Sun about whether we should stay in the class), and otherwise we can interpret the score as how well we’re doing. Moving forward, just to indicate whether we should stay in the class, the class won’t change much. The pace and difficulty will be about the same as it has been so far, but maybe the problem is that we’ve been running out of time. So we might arrange a time outside of class and have the exam at night (but longer than during class time). Last time, we talked about the topology of weak convergence (of probability measures on a metric space). Notably, we discussed Prohorov’s theorem, which was used to prove the Lindeberg-Feller Central Limit Theorem, which tells

us that given a triangular array of random variables Xn,j (for 1 ≤ j ≤ n) that are independent with mean zero, if the variance of the sum convergences to a finite value

n X 2 2 E(Xn,j ) → σ , j=1 and the contribution from X taking on large values doesn’t contribute much:

n X 2 E(Xn,j ; |Xn,j | ≥ ε) → 0, j=1

P 2 then the row sum Xn,j converges in distribution to N(0, σ ).

Example 131 Consider the number of cycles in a random permutation. Let π be a uniformly random permutation on [n] = {1, ··· , n}: there are n! such possibilities, and we pick one uniformly at random.

As n grows large, what is the behavior here? We’ll write the permutations in (sorted) cycle notation: for example, we can write π = (136)(2975)(48) to indicate that 1 maps to 3 maps to 6 maps to 1, and so on, and to ensure uniqueness of representation, we make sure that the smallest number in each cycle appears at the beginning, and also the cycles are sorted from left to right by minimum element. So now we can sample π sequentially from left to right: start by writing down “(1”, and pick a random integer (uniformly) for 1 to go to, say 3. (If it’s 1, we close the parentheses.) Then pick another random integer that is not 3, and whenever we return to 1, we close this cycle and ignore those numbers from now on.

71 This is nice, because we can now define the variable

In,k = 1{a right parentheses appears after the kth number in sorted cycle notation}.

For example, In,k takes on the values (0, 0, 1, 0, 0, 0, 1, 0, 1) in our example permutation π. We’re trying to determine Pn the behavior of Sn = k=1 In,k , which is the number of right parentheses (or equivalently cycles). If we look at how 1 we’re sampling sequentially, the In,k are actually independent Bernoulli variables with probability n−(k−1) , because there k − 1 choices that the kth number cannot go to, and out of the remaining n − (k − 1), the kth number has to go to the current first entry of our cycle.

So now let’s try to apply the Central Limit Theorem. The expectation of Sn is

n n X 1 X 1 S = = = log n + O(1). E n n − (k − 1) j k=1 j=1

We can also calculate the variance: because Ber(p) has variance p(1 − p),

n n n X X 1  1 X 1 Var S = Var(I ) = 1 − = S − = log n + O(1). n n,k j j E n j 2 k=1 j=1 j=1

So we can apply CLT as follows: recenter and rescale, so define the variables

I − [I ] X = n,k√ E n,k . n,k log n

Now the first condition of Lindeberg-Feller holds automatically (the sum of the variances of Xn,k go to 0), and the second condition is pretty easy to check too: each term of

n " 2 # X In,k − E[In,k ] In,k − E[In,k ] E √ ; √ ≥ ε log n log n k=1

√ 1 is just equal to 0 for all n large enough: this is because the numerator In,k − EIn,k is at most 1, so log n eventually drops below ε no matter what happens to our random variable. So the conditions that we want do hold, and thus

In,k − E[In,k] d √ −→ N(0, 1). log n

Since Sn is log n plus a constant, this also means that

S − log n d n√ −→ N(0, 1). log n So that’s an example of CLT where we don’t have identically distributed variables! Notably, though, not everything converges to a Gaussian:

Example 132 λ Say that In,k are all distributed according to Ber( n ) for some constant λ.

Pn λ  Then Sn = k=1 In,k ∼ Bin n, n has expectation λ by linearity of expectation, so the chance Sn is extremely large (like 100λ) is pretty small. The idea is that most of the mass is supported on a constant range, but it’s also supported on only integers. So it doesn’t really make sense to expect that we have something like a Gaussian.

72 We can carry out the calculations to make this more clear:

n λk  λn−k λk (n)  λn−k (S = k) = 1 − = k 1 − P n k n n k! nk n

n! (where (n)k is the falling factorial (n−k)! ) and as we fix k and take n → ∞, the middle term goes to 1, and the −k in the right term’s exponent becomes irrelevant:

λk (S = k) → e−λ , P n k! so this actually converges in law to the Poisson distribution Pois(λ). Again, this is supported on the integers, so it’s definitely not Gaussian! However, if we take λ large enough, we do recover the Gaussian limit:

Pois(λ) − λ) d √ −→ N(0, 1) λ as λ → ∞. Another way of saying this is that if we consider

Bin(n, p) − np , pnp(1 − p)

which is a random variable with mean 0 and variance 1, we showed (either through calculation or the CLT) that if we λ take p fixed and n large, this goes to the standard normal. On the other hand, if we take p to be n and n large (so np stays constant), this goes to a Poisson distribution. Let’s prove this fact:

Proposition 133 As λ gets large, the Poisson distribution Pois(λ) goes to a Gaussian.

Proof. We can show this using characteristic functions. If Y ∼ Pois(λ), then

X eitk e−λλk φ(t) = (eitY ) = = exp(−λ(1 − eit )) E k! k≥0

is the characteristic function of Y . And now if we take

Yλ − λ Zλ = √ , λ the characteristic function for Zλ is

    √ √ Yλ − y itY / λ ψ(t) = E exp it √ = exp(−it λ)E(e λ ) λ and plugging this into what we already know, we end up with   √  √  = exp −λ 1 − eit/ λ − it λ .

As λ gets large, we can do a series expansion:

  2    it t 1 √ 2 = exp −λ −√ + − O − it λ → e−t /2, λ 2λ λ3/2 which is the characteristic function for the standard normal.

Professor Sun once taught a class similar to 18.600, so she feels a little silly going over this next topic, but apparently many of us haven’t taken an intro probability class before.

73 Example 134 We’re going to review all of the standard probability distributions. Consider Bernoulli trials, which means that

we have Ik distributed iid according to Ber(p) for all k ≥ 1.

Say that 1 indicates a success and 0 a failure: then the number of successes by time m is distributed according to the Binomial distribution m X Bm = Ik ∼ Bin(m, p). k=1

Meanwhile, the time of the first success, which is the first index k such that Ik = 1, is distributed according to the geometric distribution k−1 Geo(p): P(G = k) = (1 − p) p.

As an extension, the time of the rth success Xr is distributed as (where Gi ,G are all iid)

d Xr = G1 + ··· + Gr , which is the negative binomial distribution NegBin(r, p). Specifically, for any t ≥ r,

t − 1 (X = t) = (1 − p)t−r pr P r r − 1

(because we need r −1 successes in the first t −1 entries). By definition, this is also equal to the geometric distribution convolved with itself r times. 1 So now imagine that we scale p = n to be small, so the vast majority of our experiments are failures. If we also scale time by n (so that we can actually see successes at a reasonable rate instead of very rarely), by time t, we’ll have done nt experiments, so   1 d B ∼ Bin nt, −→ Pois(t) nt n converges to the Poisson distribution. The time of the first success then converges to a continuous random variable   1 1 d Geo −→ Exp, n n

the exponential distribution with density 1{t ≥ 0}e−t (we can prove this with characteristic functions or direct calculations), and the time of the rth success converges to   G + ··· + G 1 1 d 1 r ∼ NegBin r, −→ E + ··· + E , n n n 1 r where the Ei are iid exponential. This is the exponential law convolved with itself r times, and this is the Gamma distribution Gamma(r). The gamma density is good to remember, and it can be derived by taking the limit of the negative binomial distribution:

NegBin(r, 1/n) nz − 1  1nz−r 1r (nz − 1)  1nz−r ∈ [z, z + dz] = 1 − (ndz) = r−1 1 − (ndz) P n r − 1 n n (r − 1)!nr n

and now if we make everything fixed and take n → ∞, this approaches

z r−1e−z = dz. (r − 1)!

74 for all z ≥ 0. For general α > 0 (not necessarily an integer), the Gamma(α) distribution has a similar density

x α−1e−x 1{X ≥ 0} , Γ(α)

where Γ(α) is the normalizing constant! (In particular, this is a generalization of the factorial, where Γ(r) = (r − 1)! for integers.) And when do we expect to see a Gaussian distribution? This happens if we take time to be very large relative to our sampling rate, which means we collect enough samples.

Remark 135. This is a bit outside the scope of this class, but remember that if π is a random permutation of [n] and

Sn is the number of cycles of π, we showed that

S − log n d n√ −→ N(0, 1). log n

If we let Cn,k be the number of cycles in π of length k, then the tuple (Cn,1,Cn,2, ··· ,Cn,n) converges in distribution to 1 (Yk )k≥1, where Yk ∼ Pois( k ). The proof is only a tiny bit outside the scope of what we’ve done so far: see Arratia and Tavaré’s paper [3] for more details (the main idea is that a large number of trials for a rare event is Poisson, and the chance that two different vertices are both in cycles of length k is close to independent). We can at least show easily

1 (n)k that ECn,k = k , because there are (combinatorially) k possible cycles of length k, and each occurs with probability 1 (explore the cycle one element at a time). (n)k

With the remaining time, let’s review some weak convergence ideas, with a reminder of what results and proofs we should know for this class in general. If we let S be a complete separable metric space with Borel sigma-algebra B, and µn =⇒ µ (converges weakly in distribution), let P be the space of all probability measures on S (this is a Polish space).

Theorem 136 (Portmanteau theorem) The following are equivalent:

• µn =⇒ µ. R R • f dµn → f dµ for all bounded uniformly continuous f .

• lim sup µn(F ) ≤ µ(F ) for all closed F ⊆ S.

• lim inf µn(G) ≥ µ(G) for all open G ⊆ S.

• lim µn(A) = µ(A) for all A with µ(∂A) = 0.

We also talked about Prohorov’s theorem, which tells us that tightness is equivalent to relative compactness.

Theorem 137 (Skorohod)

If µn =⇒ µ, then there exists a probability space (Ω, F, P) and measurable mappings Xn,X :Ω → S such that

LXn = µn, LX = µ, and Xn converges to X almost surely under P.

We should know all of the statements of these proofs, and we should know a bit more when we’re looking at the

special case under the real line (S = R). Then µn can be represented by their cdfs Fn, and µ can be represented with

its cdf F . Then say that Fn =⇒ F if Fn(x) → F (x) for all x where F is continuous: then µn converges weakly to µ

if and only if the cdfs converge Fn =⇒ F . (This is a consequence of the Portmanteau theorem!)

Tightness in R is easy to characterize: it’s equivalent to say that for any ε > 0, infn{Fn(x) − Fn(−x)} ≥ 1 − ε. So showing Prohorov’s theorem for R means that we just need to extract subsequences for our Fns which don’t have

75 mass escaping from the side: we can use the Helly selection theorem to finish, and we should know that proof (it’s a diagonalization argument)! Skorohod’s theorem is somewhat abstract in general, but it’s simple over R, and it’s also very useful. If F is a one-to-one function, then taking a random variable U uniformly distributed from 0 to 1, we claim that F −1(U) ∼ µ. This is true because −1 P(F (U) ≤ x) = P(U ≤ F (x)) = F (x) (first equality because F is monotone and one-to-one). Thus, F −1(U) is distributed according to µ, and we’re done! More generally, if it’s not one-to-one, let

X(u) = inf{x : F (x) ≥ u)}.

Then X(u) ≤ x ⇐⇒ F (x) ≥ u because F is right-continuous, and thus P(X(U) ≤ x) = P(F (x) ≥ U) = F (x) again. So given any cdf, we just define a rough inverse! Now just let (Ω, F, P) be ([0, 1], B, Leb), and define X,Xn going from Ω → R via

X(ω) = inf{x : F (x) ≥ ω},Xn(ω) = inf{x : Fn(x) ≥ ω}.

Because Fn converges to F except at countably many points, Fn converges to F almost surely, and the laws of Xn,X are µn, µ, and we’re done. The proof in general for Skorohod’s theorem is kind of related to this idea, but we don’t have the simple mapping anymore, so we have to do something more complicated.

16 November 4, 2019

We had another attendance quiz today:

Problem 138 −x Let X,Xi be iid exponential random variables: this means that they have density 1{x > 0}e dx. Find bn,Y such that   d max Xi − bn −→ Y. 1≤i≤n

Solution. First of all, we know the distribution of the maximum:   n −t n P max Xi ≤ t = P(X ≤ t) = (1 − e ) , 1≤i≤n

so if we scale such that t = log n + x,   n −x P max Xi ≤ log n + x = (1 − exp(−(log n + x))) = exp(−e ). 1≤i≤n

Thus, if we just take bn = log n, then Y = max Xi − bn converges in distribution to

−e−y P(Y ≤ y) = e ,

which is called the Gumbel distribution. This is an example of an extreme value statistic (it’s the extreme of a bunch of examples).

(By the way, convergence in law is the same as weakly convergent.)

76 Let’s do another example:

Example 139  2  Let Z, Z ∼ N(0, 1) be standard Gaussians with density g(t) = √1 exp − z dz: what does the maximum of i 2π dz Z1, ··· ,Zn look like?

If we define Z ∞ 1  z 2  Ψ(x) = P(Z ≥ x) = √ exp − dz, x 2π 2 this has no closed form expression, but we can get a nice lower bound especially when x is large:

(Z; Z ≥ x) 1 Z ∞  z 2  ≤ E = √ z exp − dz, x 2πx x 2 and now we can integrate this with a substitution

1  x 2  g(x) = √ exp − = . 2πx 2 x This is only an upper bound, and it’s not very useful for small x. But this is pretty tight for large x, and we’ll show this with an almost-matching lower bound: integrate

√ Z ∞ 1  z 2  2πΨ(x) = − z exp − dz x z 2 by parts to get  2  ∞ Z ∞  2  1 z 1 z = − exp − − 2 exp − dz. z 2 x x z 2

1  x 2  1 1 The first term evaluates to x exp − 2 , and the second can be integrated again by parts: since z 2 = z 3 z, this expression becomes 1  x 2   1  z 2 ∞ Z ∞ 3  z 2  exp − − − 3 exp − − 4 exp − dz. x 2 z 2 x x z 2 We want a lower bound, so note that we can ignore the last term because it’s adding a nonnegative amount: this means our bound is 1  x 2  1  x 2  ≥ exp − − exp − , x 2 x 3 2 so g(x)  1  Ψ(x) ≤ 1 − . x x 2 So if x is large, and take some t ∈ R with |t|  x: then

1  (x+t)2  √ exp − 2 Ψ(x + t) 2π(x+t) 2  t  ∼ ∼ exp −xt − ,  2  Ψ(x) √ 1 exp − x 2 2πx 2

and the main term here is proportional to e−tx . So a change of variables tells us that

Ψ x + t  x → exp(−t). Ψ(x)

With this, we can go back to our original question: we have Zi s that are iid standard Gaussians, and we want to

understand the distribution for Mn = max Zi . Again, we can write out the distribution:

n P(Mn ≤ x) = (1 − Ψ(x))

77 1 Just like in the quiz problem, we want to pick Ψ(x) to be approximately n : pick x = bn such that this is true. (This is a deterministic real number, because Ψ is monotone.) So then

 t    t n  e−t n P Mn ≤ bn + = 1 − Ψ bn + = 1 − (1 − on(1)) bn bn n

by the calculation above, and then this converges to e−e−t (just like before!). So the only difference here is that our variables need to be rescaled:   d max Zi − bn bn −→ Gumbel. 1≤i≤n Remark 140. Not all variables behave this way, though: it basically depends on the tail behavior of our distribution. But it is true that a large class of variables where this happens: that’s why Gumbel is an extreme value statistic!

And we actually can say a bit more about the bn values we have here: we can estimate by just taking g(x) 1 1  b2  = = √ exp − n . x n 2πbn 2

2 So to leading order, it seems that we want bn = 2 log n + (lower order): to find those next lower order terms, we’d

want to account for the bn in the numerator, and we can do this by setting

2   bn 1 p log log n + O(1) = log n − log log n + O(1) =⇒ b = 2 log n 1 − . 2 2 n 4 log n

g(x) 1  O(1) (And the estimate of Ψ as x is good enough, because the 1 − x 2 only contibutes at most a 1 − log n .) We’ll do some more examples of weak convergence on our homework: it’s useful in general that we can show a thing converges to another, so there will definitely be something about that on our next exam! We’ll spend a bit of d time talking about weak convergence on R here: we should also read section 3.10 in the textbook, but we’ll go over some of the main points here.

Definition 141 d Let µ, µn be probability measures on R : define the generalized cdf

d ! Y F (x) = µ (−∞, xi ] i=1

d for all x ∈ R .

We showed in homework that F is monotone, right-continuous, and also that the measure of any d-dimensional rectangle can be found by an inclusion-exclusion argument:

d ! Y X {# i:vi =ai } µ (ai , bi ] = (−1) F (v). i=1 v∈V

Then µn =⇒ µ if and only if Fn converges weakly to F , meaning that

Fn(x) → F (x) ∀x where F is continuous.

There’s also a notion of a characteristic function here: define Z χµ(t) = exp (iht, xi) dµ(x)

78 d for t ∈ R (notice that this integrand is still bounded in the unit disk). Then there’s a Fourier inversion theorem here as well, but it’s easier to state by assuming the boundary has zero measure:

Theorem 142 Qd If A = i=1[ai , bi ] with µ(∂A) = 0, then

d Z Y e−itj aj − e−itj bj  µ(A) = lim φµ(t) dt. T →∞ d 2πitj [−T,T ] j=1

With this, the one-dimensional continuity theorem also generalizes directly:

Theorem 143 d Let µn be probability measures on R with characteristic functions φn. Then

• If µn =⇒ µ, then the φn converge pointwise to φµ.

• If φn converge pointwise to φ and φ is continuous at t = 0, then φ is the characteristic function of some µ

and µn =⇒ µ.

Theorem 144 (Cramér–Wold) d d If Xn,X are R -valued random variables, then Xn −→ X if and only if the one-dimensional distributions converge:

d d hθ, Xni −→hθ, Xi ∀θ ∈ R .

Proof. The forward direction is clear (by definition). For the other direction, note that for all θ,

E[exp(ihθ, Xni])] → E[exp(ihθ, Xi)]

iX because hθ, Xni is a random variable and we’re taking the expectation of some function e of it. This is a characteristic function and it’s continuous in θ, so we’re done by the continuity theorem.

d Corollary 145 (CLT for iid sequences in R ) (i) d 2 Let X,X be iid R -valued random variables, such that E(||X|| ) is finite. Define

d T µ = EX ∈ R , Σ = Cov(X) = E[(X − µ)(X − µ) ]

where Cov(X) is the d × d convariance matrix given by Σij = Cov(Xi ,Xj ) (covariance between the different coordinates of X). Then n 1 X d √ (X − µ) −→ N(0 ∈ d , Σ) n i R i=1 converges to the multivariate normal distribution.

Proof. Clearly, we’re going to show that the characteristic functions converge, so let’s start by figuring out the characteristic function for the multivariate Gaussian. Σ is a d × d positive semidefinite matrix, so it has a Cholesky T d k d decomposition AA , where A ∈ R × : now if Y ∼ N(0, Σ), it’s equivalent to saying that Y = AZ where Z is a standard Gaussian N(0,Ik×k ).

79 So the characteristic function for this distribution is

T φΣ(θ) = E[exp(ihθ, Y i)] = E[exp(ihθ, AZi)] = E[exp(ihA θ, Zi)].

But Z is Gaussian with the identity covariance, so Z = (Z1,Z2, ··· ,Zk ) with all entries standard Gaussian. Thus this is just  ||AT θ||2   1   hθ, Σθi = exp − = exp − (θT AAT θ) = exp − . 2 2 2 So to finish the proof, we just need to show the left side’s characteristic function converges to this: it suffices to show that for all θ ∈ d that R n !! 1 X exp ihθ, √ (X − µ)i → φ (θ). E n i Σ i=1 We know by one-dimensional CLT that

n 1 X d √ hθ, X − µi −→ N(0, Var(hθ, Xi)) = hθ, Σθi. n i i=1 Thus, the characteristic functions converge in the above expression: !! 1 X  hθ, Σθi exp i √ hθ, X − µi → exp − , E n i 2 i as desired.

So this finishes what we’re covering from chapter 3 of Durrett, and now we’re going to move on to martingales and conditional expectation!

Definition 146 Say we’re working on (Ω, F, P), and say that A, B ∈ F with P(A) > 0. Then the conditional probability

P(A ∩ B) P(B|A) = . P(A)

1 In particular, P(B|A) = P(B) if A, B are independent. If P(A) > 0 and X ∈ L is integrable, we can define

E(X; A) E(X|A) = . P(A)

Then if X,Y are both random variables on (Ω, F, P) and P(Y = y) > 0, we can define

E(X; Y = y) g(y) = E(X|Y = y) = . P(Y = y)

In particular, if Y satisfies P(Y = y) > 0 for all y ∈ supp LY , then we can define

E[X|Y ] = g(Y ).

Note that in general, it doesn’t make sense to condition on events of probability zero in the definition above. However, we will often want to do something like that looks like that: for example, say that X and Y are distributed via ! ! !! X 0 σ11 σ12 ∼ N , Y 0 σ21 σ22

80 for some nondegenerate det Σ > 0. In general, it’s not true that a Gaussian N(0, Σ = AAT ) will have a density on d R , unless A is actually an invertible d × d matrix. But if we assume that det Σ > 0 in this case, L(X,Y ) is going to be 2 a measure on R with some density gΣ (it’s a calculus argument to figure out how to work this out). But the point is that we would like to have R xgΣ(x, y)dx E(X|Y = y) = R , gΣ(x, y)dx so the notion of conditioning from an introductory probability class isn’t exactly enough: we’ll talk about this more next time.

17 November 6, 2019

The second exam will be similar difficulty to the first one (and weighted equally). The time will be scheduled between 5 and 9pm on December 4th, depending on our availability, so that we’ll have more time. Today, we’ll talk some more about conditional expectation and martingales! Recall that if X is a random variable on (Ω, F, P) and E|X| < ∞, then we can define

E(X; A) E(X|A) = P(A)

for any event A with positive probability. In addition, if Y is another random variable with P(Y = y) positive, we can also condition E(X; Y = y) E(X|Y = y) = . P(Y = y) However, this doesn’t really work as generally as we want it to: we want to define P(X|Y ) even when P(Y = y) = 0, particularly when Y is completely nonatomic. So the key idea is that it doesn’t make sense to condition on events of measure zero, so we shouldn’t think about individual events {Y = y}: instead, think about Y as a whole, and try to define E(X|Y ) to be some other random variable. This should be σ(Y )-measurable, and it should satisfy the key property that E[E(X|Y )h(Y )] = E[Xh(Y )], as long as both sides are well-defined. Let’s formalize this:

Definition 147 Say X is a random variable on (Ω, F, P), and say that E|X| < ∞. If G is a sub-σ-algebra of F, then Y is a version of E(X|G) if

• Y ∈ G, meaning Y is measurable with respect to G, • For all A ∈ G, E(X; A) = E(Y ; A).

It’s an exercise to us to reconcile this with the “conditional probability” that we learned in introductory probability! We need to show that this random variable Y exists and also that it’s unique in some way: also, the notation suggests that Y should be integrable, so we should check that as well. But eventually we can use this to define

E(X|Y ) = E(X|σ(Y )).

Lemma 148 (Integrability) If Y is a version of E(X|G), then E|Y | ≤ E|X| < ∞ (so Y is integrable).

81 Proof. Because Y is G-measurable, {Y > 0} is in G, meaning that

E[Y+] = E(Y ; Y > 0) = E(X; Y > 0), and similarly

E[Y−] = E(−Y ; Y < 0) = E(−X; Y < 0), and adding these says that E|Y | = E[X; Y > 0] + E[−X; Y < 0] ≤ E|X|.

Lemma 149 (Uniqueness) If Y,Y 0 are both a version of E(X|G), then Y = Y 0 almost surely.

Proof. Both functions are G-measurable, so let

0 Aε : {Y − Y ≥ ε}.

(This is G-measurable.) By the second condition, this means that

0 0 = E(X; Aε) − E(X; Aε) = E[Y − Y ; Aε] ≥ εP(Aε),

0 0 so P(Aε) = 0 for all ε > 0. Take ε → 0 to conclude that Y ≤ Y almost surely, and the proof is symmetric so Y ≤ Y almost surely as well.

Definition 150 If µ, ν are both measures on (Ω, F), say that ν is absolutely continuous with respect to µ, denoted ν  µ, if for all A ∈ F with µ(A) = 0, we have ν(A) = 0.

Theorem 151 (Radon-Nikodym) Say that µ, ν are σ-finite measures on (Ω, F), such that ν  µ. Then there is an F-measurable function f such that Z ν(A) = f dµ A dν for all A ∈ F (the usual notation is that f = dµ ).

In other words, ν has a density with respect to µ. We won’t prove this right now, but we might come back to it a bit later. Notice that we’ve done something similar already: in the proof of Cramér’s theorem, we did the exponential tilt d exp(θX) Pθ = . dP m(θ) We won’t prove this for now, but it will allow us to show existence of the conditional expectation: say X is a random variable on (Ω, F, P), and say that E|X| < ∞. First, define

µ = P|G,

82 which is a probability measure on (Ω, G). If we define

ν(A) = E(X+; A) ∀A ∈ G,

this is a finite measure with ν(Ω) = E(X+) < ∞ by assumption. Then ν is absolutely continuous with respect to µ, because if A is an event of probability zero, E(X+; A) = 0, so by the Radon-Nikodym theorem, there exists a function dν Y = dµ ∈ G such that Z ν(A) = Y dµ ∀A ∈ G. A

But ν(A) = E(X+; A), and the right hand side is E(Y ; A), and this is the property of the conditional expectation! Thus Y is a version of E(X+|G), and we can subtract E(X−|G) from this to get the result. We should also read the textbook for some basic properties of the conditional expectation: it’s linear and monotone, it satisfies a version of Jensen’s inequality, and so on. But there’s a few properties which are not analogous to the usual expectation:

Proposition 152 If Gsmall is a sub-σ-algebra of G ⊆ F, and small E(X|G) ∈ G ,

small then E(X|G) = E(X|G ).

small Proof. Let Y = E(X|G): by assumption, Y is measurable with respect to G , and it suffices to show that for all A ∈ Gsmall, E(X; A) = E(Y ; A). But this is true by definition, because this holds for all A ∈ G so it must hold for all A ∈ Gsmall.

Proposition 153 (Tower property) Say that Gsmall ⊆ G ⊆ F, as before. Then

small small small E(E(X|G )|G) = E(E(X|G)|G ) = E(X|G ).

(The proof here is similar.)

Proposition 154 If E|Y | is finite, E|XY | is finite, and X ∈ G, then

E(XY |G) = XE(Y |G)

(in other words, X is a constant with respect to G, because we know everything about it already.)

Proof. Let’s start by considering a simple case: we’ll show that the equality holds if X = 1B for B ∈ G. We want to show that the right hand side is a version of E(XY |G): it’s indeed measurable because it’s the product of measurable functions, and we now just need to check that

E(XY ; A) = E(XE(Y |G); A) for all A ∈ G. Indeed, this is true because

E(XY ; A) = E(Y ; A ∩ B) = E(E(Y |G); A ∩ B)

83 (because A ∩ B ∈ G is measurable), and rearranging makes this indeed equal to E(XE(Y |G); A). By linearity, this means this extends to simple and then measurable functions in general.

One last note here is that we can sometimes interpret conditional expectation as an orthogonal projection: if G ⊆ F, then the space of random variables L2(G) is a subspace of L2(F).

Proposition 155 2 2 If X ∈ L (F), then E(X|G) is an orthogonal projection of X onto L (G).

2 2 Proof. First of all, E(E(X|G) ) ≤ E(X ) < ∞ (conditional version of Jensen’s inequality), so E(X|G) is indeed in 2 L (G). To show orthogonality, we need to show that the difference between X and E(X|G) is orthogonal to any element of L2 in G: that is, if Z ∈ L2(G), then

E[Z(X − E(X|G))] = 0.

But the first term is E[XZ], and the second term is E[ZE(X|G)]. Since Z is measurable with respect to G, we can put it inside the conditional expectation to make this whole expression

E[XZ] − E[E(XZ|G)] and the outer E on the second term is the trivial sigma-algebra, so by the tower property, this reduces to E[XZ] − E[XZ] = 0, as desired.

There’s one special topic in our textbook about regular probability distributions, which we’ll skip for now. We’ll now move on to martingales, which have many interesting applications. Throughout this discussion, we’re on a probability space (Ω, F, P).

Definition 156

A filtration is a sequence (Fn)n≥0 of σ-fields such that F0 ⊆ F1 ⊆ · · · ⊆ F (we gradually reveal more information).

A sequence of random variables (Xn)n≥0 on (Ω, F, P) is adapted to the filtration (Fn)n≥0 if Xn ∈ Fn for all n.

The natural filtration comes from F\ = σ(X1, ··· ,Xn).

(Note that the above definition doesn’t involve any probability.)

Definition 157

Given a filtration F0 ⊆ F1 ⊆ · · · on (Ω, F, P), say that (Xn)n≥0 is a martingale with respect to Fn if Xn is adapted to Fn, E|Xn| < ∞ for all n (doesn’t have to be uniformly bounded), and

E(Xn+1|Fn) = Xn.

In other words, given everything we know up to the nth step, the next step’s expected value is equal to the current

value. If we have only an inequality E(Xn+1|Fn) ≤ Xn, this is a supermartingale, and if we have E(Xn+1|Fn) ≥ Xn, this is a submartingale.

84 Example 158 Pn Take Xn to be a simple random walk on Z: start with X0 = 0, and let Xn = i=1 Yi , where Yi are iid {1, −1} 1 1 with probability 2 each. Then this is indeed a martingale, because each step has probability 2 of adding 1 to Xn 1 and probability 2 of subtracting 1.

So the first observation we make:

E(Xn) = E [E(Xn|Fn−1)] = EXn−1

first by the tower property, and then by the martingale property. So this really tells us (iteratively) that

E(Xn) = E(Xn−1) = ··· = E(X0).

But we we will show

EX0 = EXτ for random times τ as well, which will have many applications!

Definition 159 Let τ be a nonnegative integer-valued random variable. Then τ is a stopping time with respect to a filtration

(Fn)n≥0 if {τ = n} ∈ Fn for all n.

The idea is that a martingale can model our money in a stock market, and we can only choose to stop our random

process τ = n based on the information from what we already know (Fn).

Example 160

Again, let Xn be a simple random walk on Z, and let X0 = 0. Define

τ = inf{n : Xn = 1}.

Then τ is a valid stopping time: it’s a nonnegative integer-valued random variable, it’s finite almost surely, and at any time, we can ask whether τ = n (and this is measurable with respect to what we already know). But it’s

obvious that 0 = EX0 6= EXτ = 1.

On the other hand, is there any reason to expect that EX0 = EXτ ? The key observation is that if Xn is a martingale and τ is a stopping time, then the stopped process

Yn = Xn∧τ , n ∧ τ = min(n, τ)

is also a martingale, because Yn − Xn is only nonzero if τ < n:

n−1 X Yn = 1{τ = i}Xi + 1{τ ≥ n}Xn. i=1

Once we check the integrability property (exercise), notice that the first sum is measurable with respect to Fn−1, and Pn−1 1{τ ≥ n} is 1 − i=0 1{τ = i}, so this is also in Fn−1. So if we condition in Fn−1, we can take out the constants:

n−1 X E(Yn|Fn−1) = 1{τ = i}Xi + 1{τ ≥ n}E(Xn|Fn−1), i=1

85 and by the martingale property of X, this gives us back Yn−1 because E(Xn|Fn−1) = Xn−1. So the stopped process is also a martingale! This means that the expectation of Yn is constant (by our earlier expectations), so

EYn = EY0 =⇒ EX0 = E(Xn∧τ )

As n → ∞, n ∧ τ → τ almost surely: that is, there exists n(ω) such that τ(ω) ∧ n = τ(ω) for all n ≥ n(ω). Thus,

Xn∧τ converges to Xτ almost surely, so it’s reasonable to expect that with enough integrability conditions, we’ll indeed get convergence, giving us EX0 = EXτ . So most of this part of the class will be figuring out conditions where

E(Xn∧τ ) → EXτ .

Let’s describe one nice application of this:

Example 161 Let G = (V,E) be a finite connected graph with some “boundary” Z (a proper subset of V ). Say that we have some boundary condition f : Z → R. We do a simple random walk on the graph G: at a vertex, we go to a random neighbor at the next step, and let τ be the first time we hit the boundary. Define

Mn = Ef (Xτ |Fn) = E[f (Xτ )|X0].

where Fn is the filtration of the walk σ(X0, ··· Xn). Then Mn is a martingale (exercise), and Mn is a function of

the current position: it’s h(Xn), the harmonic interpolation of f . (In other words, the value at the graph is the average of the neighboring values. Note that if we replace harmonic with subharmonic, we get a submartingale.)

18 November 13, 2019

Last time, we started defining a martingale: given a filtration, which is a nested sequence of σ-fields F0 ⊆ F1 ⊆

· · · ⊆ F, a martingale is a sequence (Xn)n≥0 where all Xi are integrable, Xn ∈ Fn (the sequence is adapted to the filtration), and we satisfy the martingale property

E(Xn+1|Fn) = Xn

If we replace these conditions with an inequality ≥ or ≤, we get a submartingale or supermartingale, respectively.

Remark 162. If Xn is a submartingale, then EX0 ≤ EX1 ≤ · · · , and we have the reverse inequality for supermartingales. We’ll prove some convergence theorems today, but we’ll start with some more preliminary facts.

Lemma 163 1 Let (Xn) be a martingale, and let φ be a convex function with φ(Xn) ∈ L . Then φ(Xn) is a submartingale. (Replacing convex with concave yields a supermartingale.)

Proof. Convex functions are measurable, so φ(Xn) will be measurable with respect to Fn. It suffices to check the submartingale condition: indeed,

E(φ(Xn+1)|Fn) ≥ φ(E(Xn+1|Fn)) = φ(Xn) by the conditional version of Jensen’s inequality, and this is exactly what we want.

86 Lemma 164 1 If Xn is a submartingale, and φ is nondecreasing and convex with φ(Xn) ∈ L , then φ(Xn) is a submartingale. (Again, replacing “submartingale” and ”convex” with “supermartingale” and “concave” also works.)

Proof. This is almost identical to the previous proof: we’re checking the submartingale condition again, and

E(φ(Xn+1)|Fn) ≥ φ(E(Xn+1|Fn)) ≥ φ(Xn) because of the submartingale condition and the fact that φ is nondecreasing.

Definition 165

A process (Hn)n≥0 is predictable or previsible with respect to a filtration (Fn) if Hn ∈ Fn−1 for all n. Then

n X (H · X)n = Hk (Xk − Xk−1) k=1 is called the H-transform of X.

The way to interpret this definition is as follows: we play a randomized game at each time k, where we bet 1 dollar and make net winnings of Xk − Xk−1 = ∆Xk (so we win 1 + ∆Xk dollars). Then say we bet exactly 1 dollar each time: the net winnings at time n are just n X ∆Xk = Xn − X0 k=1

(with a telescoping sum). But we can choose to bet different amounts each time, so if we bet Hk dollars at time k,

we do indeed make net winnings of (H · X)n at time n. And we need Hn ∈ Fn−1, because the amount we bet at time

n should not depend on what we win at time n (this motivates the name “previsible”). Note that if E(∆Xk |Fk−1) ≤ 0 for all k (because casinos tend to make us lose in expectation), we have a supermartingale.

Remark 166. Recall the definition of a stopping time τ, which means that {τ = n} ∈ Fn for all n. Then if P Hn = 1{τ ≥ n} = 1 − `

(H · X)n = Xn∧τ − X0

(because we’re intuitively saying that we only bet at time n if τ didn’t happen yet).

Lemma 167 ∞ Let Xn be a submartingale, and say that Hn is nonnegative and previsible: suppose also that Hn ∈ L for all n,

which means that there is a constant for each n such that Hn is uniformly bounded almost surely. Then H · X is a submartingale as well (analogous statement holds for supermartingales, too).

Pn Proof. Let Yn = (H·X)n = k=1 Hk (Xk −Xk−1); we want to show that Yn is a submartingale. Yn is integrable, because

it’s the sum of finitely many terms, where each Hk is bounded and Xk is integrable (by X being a submartingale). In

addition, Yn ∈ Fn because all of the individual terms are in Fn: it suffices to show the submartingale condition, which can be stated as

E(Yn+1|Fn) ≥ Yn ⇐⇒ E(Yn+1 − Yn|Fn) ≥ 0

87 because E(Yn|Fn) = Yn when Yn is Fn-measurable. And we know this difference:

E(Yn+1 − Yn|Fn) = E(Hn+1(Xn+1 − Xn)|Fn) = Hn+1E((Xn+1 − Xn)|Fn)

(because Hn+1 is measurable with respect to Fn), and then this last statement is nonnegative because Xn is a submartingale.

Corollary 168

If X is a submartingale and τ is a stopping time, then Yn = Xn∧τ is also a submartingale (because the stopped process can be expressed as an H-transform).

One goal of today’s class is to prove the following result, which we should keep in mind: given any nonnegative

supermartingale Xn, then Xn converges almost surely to some integrable limit X∞ with EX∞ ≤ EX0. Because EX0 ≥ EX1 ≥ · · · , this fact shouldn’t be too surprising: the expectations are decreasing and bounded by 0. The main technical component of this proof is to show a bound on how often a submartingale can go up, which is equivalent to a bound on how often supermartingales can go down.

Definition 169

For any −∞ < a < b < ∞ and a submartingale (Xn)n≥0, define τ0 = −1 and

τ2k−1 = inf{n > τ2k−2 : Xn ≤ a}, τ2k = inf{n > τ2k−1 : Xn ≥ b}.

for all positive integers k.

In other words, we’re finding the times where Xn goes below a, then above b, then below a, and so on. It’s possible, for example, that τ1 = ∞ (if X never goes below a).

We keep track of the ranges (τ2k−1, τ2k ], which are known as upcrossings (because we’re going from a to b).

Define Un(a, b) to be the number of upcrossings of [a, b] completed by time n. which can also be defined as

Un(a, b) = sup{k : τ2k ≥ n}.

Theorem 170 (Doob’s upcrossing inequality)

Let Xn be a submartingale. Then [(X − a) − (X − a) ] [(X − a) ] U (a, b) ≤ E n + 0 + ≤ E n + . E n b − a b − a

Proof. Let Hn be the indicator that n ∈ (τ2k−1, τ2k ] for some integer k (we are inside an upcrossing). It’s equivalent to write this as X X Hn = 1{n ∈ (τ2k−1, τ2k ]} = 1{n ≤ τ2k } − 1{n ≤ τ2k−1}, k≥1 k≥1 because the intervals are disjoint. All τs are stopping times, so all indicators here are measurable with respect to Fn−1: if we’re worried about an infinite sum, note that we only need to sum up to n anyway because we can only have n upcrossings. Intuitively, we know whether we’re in an upcrossing based on what we’ve seen up to the (n − 1)th step, and this means that Hn is previsible.

88 Since Hn is bounded by 1 almost surely and is nonnegative, we can apply Lemma 167 to see that H · X and K · X are both submartingales, where we define K = 1 − H (which is also bounded by 0 and 1). So now we have a minor problem, which is that we don’t quite count the last upcrossing (so we might be arbitrarily

negative, which is bad for our bound). So we replace Xn with

Yn = max(a, Xn) = a + (Xn − a)+, which is a nondecreasing convex function of Xn, so it is also a martingale by Lemma 164. By identical logic, H · Y and K · Y are also submartingales.

Now consider (H · Y )n: this is at least (b − a)Un(a, b), because the Y -process means we move anything below a up to a, and then H · Y means we only bet during the upcrossings. So we win the increment that we make across each upcrossing, which is b − a, and we win a total of Un(a, b) times and might win some extra amount at the end! (This is why we replaced X with Y .) And now taking expectations of both sides,

(H · Y ) U (a, b) ≤ E n , E n b − a and we just want to upper bound this last quantity. Notice that if we bet 1 each time during our process,

Yn − Y0 = (1 · Y )n = (H · Y )n + (K · Y )n,

and rearranging tells us that

E(H · Y )n = E[Yn − Y0 − (K · Y )n].

But Yn − Y0 = (Xn − a)+ − (X0 − a)+, and (K · Y )n is a submartingale, meaning E(K · Y )n ≥ 0. Plugging this in to the expression E(H · Y )n gives the result that we want!

Intuitively, the key inequality here is E(K · Y )n ≥ 0: to go up many times, it’s hard for (K · Y ), which is the “outside of the up-crossings” part, to go down.

Theorem 171 (Submartingale convergence theorem) 1 If Xn is a submartingale with supn E[(Xn)+] < ∞, then Xn converges almost surely to X∞ ∈ L .

Proof. Note that (xa)+ ≤ x+ + |a| for all a, x ∈ R, so [(X ) + |a|] [U (a, b)] ≤ E n + . E n b − a

By the problem statement, this is uniformly bounded in n, so we expect a finite number of upcrossings Un(a, b). But

Un(a, b) is a monotone function which converges to some limit as n → ∞: since it has finite expectation, it’s finite almost surely. So now

P(U∞(a, b) < ∞ ∀ rational − ∞ < a < b < ∞) fails with probability 0 for each of the countably many (a, b), so the probabilty that this occurs is 1. So we can’t cross

any rational interval infinitely many times during our process, and thus X∞ = limn→∞ Xn exists with probability 1. It remains to show that we have a finite limit and that it is integrable. By Fatou’s lemma, because the positive

part of Xn converges to the positive part of X∞,

(X∞)+ ≤ lim inf (Xn)+ < ∞ E n→∞ E by hypothesis, and thus the positive part has finite expectation and is finite almost surely. Fatou also tells us that

89 E(X∞)− is similarly bounded by lim infn→∞ E(Xn)−, but we don’t have an analogous statement in our hypothesis. instead, we say that

lim inf (Xn)− = lim inf [(Xn)+ − Xn] n→∞ E n→∞ E and we’re okay because (Xn)+ is bounded and the expectation of −Xn goes down as n → ∞(because Xn is a submartingale). Thus, this is

≤ lim inf (Xn)+ − X0 < ∞. n→∞ E E 1 So X∞ is in L because both positive and negative parts have finite expectation, and thus X∞ is indeed finite almost surely.

Theorem 172

If Xn is a nonnegative supermartingale, then it converges almost surely to some X∞ with EX∞ ≤ EX0.

Proof. Define Yn = −Xn: this is a submartingale, and (Yn)+ = 0 almost surely for all n, so the submartingale convergence theorem can be applied to show that Yn → Y∞, meaning Xn converges almost surely to X∞ = −Y∞. Then Fatou’s lemma tells us that

X∞ ≤ lim inf Xn ≤ X0. E n→∞ E E

One point of caution is that there’s no guarantee in general that EXn → EX∞:

Example 173

Let Xn be a simple random walk on Z, and let X0 = 1 and τ = inf{n ≥ 0 : Xn = 0}. Then if we let Yn = Xn∧τ , we have a nonnegative martingale (which is also a supermartingale), so it converges almost surely: specifically, it

converges to Y∞ = Xτ = 0, even though X0 = 1.

Here’s an application of the convergence theorem:

Example 174 (Branching process)

Say we have an array Xn,i of iid nonnegative integer random variables, all equal in distribution to X. Then the law of X is some probability measure p, supported on the nonnegative integers, called the offspring law. Define

a random Galton-Watson tree, where we start with a root vertex which gets a random number of children X1,1.

Then each of those children gets a random number of children: the first child gets X2,1 children, and so on. (0 children is allowed.)

This tree may be finite or infinite. Let the Galton-Watson process be a summary of this tree, where

Zn = number of vertices at level n.

So now if we define

Fn = σ (X`,i : i ≥ 1, 1 ≤ ` ≤ n) ,

which encodes the tree up to depth n, notice that Zn ∈ Fn (so the process is adapted to Fn). In fact, Fn encodes

more than just σ(Z0,Z1, ··· ,Zn), because Zi only tell us the total population in each generation.

90 Basically, we can express the process Zn via

Zn−1 X Z0 = 1,Zn = Xn,i . i=1

In particular, if Zn−1 = 0, then Zn = 0 (all future populations are dead). The key observation here is that

E(Zn+1|Fn) = ZnEX,

Zn because each of the Zn children generate µ = EX children on average. This means that µn is a nonnegative martingale! Next time, we’ll use this to study what happens to the random tree as n gets large.

19 November 18, 2019

This is a reminder that the next homework assignment is due on Wednesday (there are two extra problems from the original assignment).

Last time, we proved the submartingale convergence theorem: if E(Xn)+ is uniformly bounded, so supn E[(Xn)+] <

∞, then Xn converges almost surely to some finite random variable X∞. A special case of this is that if Xn is a

nonnegative supermartingale, then Xn converges almost surely to X∞.

However, neither statement implies that the expected value E(Xn) → E(X∞) converges, and it is useful to know when that statement is true. For example, if Xn = Mn∧τ is the stopped process for Xn, we want to find conditions such that EMn∧τ → EMτ (because this would prove that EMτ = EM0). So the main result we’ll be proving today is p the L martingale convergence theorem: in other words, if Xn is a martingale such that supn ||Xn||p < ∞ for some p 1 1 < p < ∞, then Xn → X∞ almost surely and in L , and thus it also implies Xn → X∞ in L and therefore that

EXn → EX∞ (which is what we actually care about). Notably, this is not true for p = 1!

Lemma 175 (Doob’s maximal inequality)

If Xn is a submartingale, define Xn = max{(Xi )+ : 0 ≤ i ≤ n}. (This is a nondecreasing process.) Then for any λ > 0,   λP max Xi ≥ λ = λP(Xn > λ) ≤ E(Xn; Xn ≥ λ) ≤ E[(Xn)+; Xn ≥ λ] ≤ E[(Xn)+]. 0≤i≤n

All of the inequalities here except the second follow pretty directly from the definitions. Note that if we replace Xn with Xn in the third expression, this would just be true by Markov’s inequality!

Proof. First of all, if Xn is a submartingale and τ is a stopping time such that 0 ≤ τ ≤ k almost surely, then

EX0 ≤ EXτ ≤ EXk (exercise, follows because Xn is a submartingale). So if σ = inf{i ≥ 0 : Xi ≥ λ} is the first time we are larger than λ, and we let τ = σ ∧ n (so that we bound the stopping time), then EXτ ≤ EXn.

Now notice that Xτ 6= Xn only if τ < n, which is equivalent to Xn ≥ λ (because that means we’ve already hit λ at some point), so that can be written as

E(Xτ ; Xn ≥ λ) ≤ E(Xn; Xn ≥ λ)

But now the left hand side is at least λP(Xn ≥ λ) (because Xτ is at least λ on this restriction), so we’re done.

91 Corollary 176 (Kolmogorov maximal inequality) 2 2 Let Mn be a martingale with M0 = 0, and let Xn = Mn (this is a submartingale because x is convex). Then   2 E[(Xn)+] Var Mn P max |Mi | ≥ x = P( max |Xi | ≥ x ) ≤ = . 0≤i≤n 0≤i≤n x 2 x 2

Notice that Chebyshev’s inequality just gives us Var M (|M | ≥ x) ≤ n , P n x 2 while we’re taking a maximum here, which is stronger! This lemma is primarily useful for the following result:

Theorem 177 (Lp maximal inequality)

Let Xn be a submartingale, and define Xn = max{(Xi )+ : 0 ≤ i ≤ n}. Then for all p ∈ (1, ∞), we have p ||Xn||p ≤ p−1 ||(Xn)+||p.

In other words, the maximum over the first n values can be controlled by just the endpoint value. Note that this

is really only a statement about the positive part of Xn.

p Proof. One annoyance point here is that we don’t know in advance whether Xn is in L , so we take care of this by

truncation via Xn ∧ M for some finite constant M. We claim that

λP(Xn ∧ M ≥ λ) ≤ E[(Xn)+; Xn ∧ M ≥ λ].

This is because whenever M is smaller than λ, this is trivially true, and whenever M is larger than λ, we just get Lemma 175. So now w Z ∞ p p−1 E[(Xn ∧ M) ] = py P(Xn ∧ M ≥ y)dy 0 with the usual integration trick, and we plug in the bound: Z ∞ p−2 ≤ py E[(Xn)+; Xn ∧ M ≥ λ]. 0 Change the order of integration (by Tonelli’s theorem) to get

 Z ∞    p−2 p−1 p = E (Xn)+ py 1{Xn ∧ M ≥ y}dy = E (Xn)+(Xn ∧ M) · , 0 p − 1 and now by Hölder’s inequality, this is

p p−1 ≤ ||(X ) || ||(X ∧ M) || 0 , p − 1 n + p n p

1 1 0 p where p + p0 = 1 =⇒ p = p−1 , and this is p = ||(X ) || ||X ∧ M||p−1. p − 1 n + p n p So we have a bound on the Lp norm, and this indeed tells us that by rearranging that p ||X ∧ M|| ≤ ||(X ) || , n p p − 1 n + p

92 and take M → ∞ and use the monotone convergence theorem.

Theorem 178 (Lp martingale convergence theorem)

Let Xn be a martingale. If there exists a p ∈ (1, ∞) such that supn ||Xn||p < ∞, then Xn → X∞ almost surely and in Lp.

p Proof. Note that Xn satisfies supn E[(Xn)+] ≤ supn E|Xn| < ∞, because it’s uniformly bounded in L norm and we p can use L monotonicity. So Xn satisfies the conditions of the submartingale convergence theorem, and therefore

Xn → X∞ almost surely. For the other part, apply the Lp maximal inequality, which tells us that p || max (Xi )+||p ≤ ||(Xn)+||p, 1≤i≤n p − 1 and also that p || max (Xi )−||p ≤ ||(Xn)−||p 1≤i≤n p − 1

(because both Xn and −Xn are submartingales if Xn is a martingale). By the hypothesis, the two right hand sides are uniformly bounded in n, and the left hand sides are nondecreasing in n, so if we take the maximum over all i, p maxi≥0 |Xi | = Y is some L -integrable random variable. This means that

p lim [|Xn − X∞| ] → 0 n→∞ E

p by the dominated convergence theorem, because Xn → X∞ almost surely and this is dominated by (2Y ) .

Let’s apply these to branching processes! Recall the definition of a Galton-Watson tree: we have a double infinite

array (Xn,i )n≥1,i≥1, which are all iid to an offspring law X supported on the nonnegative integers with finite mean

EX = µ. We start with some root vertex o, which generates X1,1 children at depth 1. Then in general, the ith child

at depth (n − 1) generates Xn,i children at depth n, and the Galton-Watson process Zn is just the number of vertices at depth n: Zn−1 X Zn = Xn,i . i=1 We also showed last time that Z (Z |F ) = µZ =⇒ M = n is a martingale E n+1 n n n µn where Fn = σ(X`,i for all i ≥ 1, 1 ≤ ` ≤ n): in other words, we know the structure of the tree up to depth n. Since

Mn is a nonnegative martingale, we know that Mn converges almost surely to M∞, where EM∞ ≤ EM0 = 1. One case that is easy to understand is the subcritical Galton-Watson: if µ < 1, then by Markov’s inequality (and using that Zn is nonnegative-valued) n P(Zn > 0) ≤ E(Zn) = µ , which is exponentially small. Thus Zn must converge to 0 almost surely: the population goes extinct with probability

Zn 1. In fact, we actually have in this case that Mn = µn → 0 almost surely, because Zn is integer-valued so it must eventually be zero. So the limit M∞ is just identically zero, and in fact EM∞ = 0 < 1! The more interesting cases are µ = 1 and µ > 1: µ = 1 is the critical Galton-Watson: we assume that P(X = 1) < 1 (because that case is where the parent has exactly one child, which has one child, and so on). Then it

93 turns out that Zn → 0 almost surely, so we have extinction in this case as well! Showing this is less straightforward, though:

Proof. Here, Zn is a nonnegative martingale (also supermartingale), so it converges almost surely to some Z∞. Since

Zn is integer-valued, Z∞ is also integer-valued, and in fact we must just have Zn = Z∞ for all sufficiently large n ≥ n(ω) (depending on ω). This is equivalent to saying that ! [ [ P {Zn = k for all n ≥ m} = 1. k≥0 m≥0

We want to say that this is not possible for any k > 0: this is because we’re asking the process to stay with k children forever, where X is a nondegenerate random variable. To make that rigorous, note that (for any k > 0)

P(Zn+1 = k|Zn = k) = ck < 1, where ck does not depend on n, so

` P(Zn = k∀n ∈ {m, ··· , m + `}) ≤ ck .

This decreases exponentially with `, so the probability that Zn = k for all n ≥ m is zero. So for any k > 0, the countable union over all m gives a probability zero, and thus it’s equivalent to say that ! [ P {Zn = 0 for all n ≥ m} = 1, m≥0

which is exactly the definition of Zn going to 0 almost surely.

The most interesting case is the supercritical Galton-Watson, where µ > 1. Here, if p0 = 0, Zn > 0 almost surely for all n (because every vertex produces a positive number of children). But if p0 > 0, the tree does have positive probability to go extinct. The idea, though, is that surviving to (for example) 100 generations means it’s likely the tree is pretty big, so we expect the tree to survive at that point. We want to know whether the probability of nonextinction is positive. Well, condition on the first level of the tree: if the first level has 3 children, then the (n + 1)th level of the tree dies out if all level-n trees from the 3 children die out. So we have a recursive formula:

X k P(Zn+1 = 0) = pk P(Zn = 0) = ψ(P(Zn = 0)), k≥0

where we define the function (which is a reparameterization of the moment-generating function)

X k X ψ(s) = pk s = E[s ]. k≥0

This is guaranteed to be finite for 0 ≤ s ≤ 1 (since X is nonnegative), and also we’re applying this function ψ to probabilities, so ψ is indeed defined. This function is also nondecreasing and convex in s, and

ψ0(s) ↑ µ as s ↑ 1

(this last statement is by differentiating term-by-term and using dominated convergence theorem). Since ψ(1) = 1 and ψ(0) = p0, this function looks something like the following:

94 Since the slope is µ > 1 at 1, we do actually cross the ψ(x) = x at some unique point 0 < ρ < 1. And now

n n P(Zn = 0) = ψ(P(Zn−1 = 0)) = ψ (P(Z0 = 0)) = ψ (0).

This converges to ρ as n → ∞, so the probability to go extinct at time n converges to ρ (repeatedly applying this convex function brings us towards a fixed point). So now {Zn = 0} are nested: if we’re extinct at time n, we’re extinct at time (n + 1). So these events are nondecreasing up to the event {tree goes extinct}, and thus the probability that the Galton-Watson tree goes extinct is ρ < 1, and we’re done. A more complicated question: if we have a supercritical Galton-Watson tree, if we condition on the event that we do not go extinct, what is the limiting distribution

Zn M∞ = lim ? n→∞ µn

This tells us a lot about the process: if this converged to some finite limit in (0, ∞), then we’d know that Zn grows n n+o(n) like µ up to a constant factor (which is much stronger than something like Zn = µ ). It turns out that we have an exact characterization of when this happens! First of all, we claim that

ψ(P(M∞ = 0)) = P(M∞ = 0) by using the same recursive idea. So now there are only three possibilities: it’s either 0, ρ, or 1, and it can’t be zero because P(M∞ = 0) ≥ P(extinction) = ρ. (Here, we’re using that p0 6= 0, so ρ 6= 0 either.) It turns out both occur:

Theorem 179 (Kesten-Stigum L log L criterion)

Zn In a supercritical Galton-Watson tree, M∞ = lim µn is not identically zero if and only if the offspring law satisfies E(X log X) < ∞.

We’ll see a special case on our homework, and we’ll discuss this a bit further in a later lecture.

20 November 20, 2019

The last homework assignment is already posted on Stellar: it’s due on Monday, December 2 (the class day before the exam). There will be lecture that Monday but not on Wednesday: we should make sure we can both study for the exam and finish the problem set.

We’ve been discussing martinagle convergence theorems so far: if Xn is a submartingale with a moment condition, then it converges almost surely (because of the up-crossing inequality). A special case is that any nonnegative

95 supermartingale converges almost surely. 1 However, we have no guarantee that there is convergence in L : EXn → EX∞ does not always hold. One sufficient condition we mentioned was the Lp martingale convergence theorem: if there exists some p ∈ (1, ∞) such that p p 1 supn ||Xn||p < ∞, then Xn → X∞ almost surely and in L . Then if it converges in L , then it converges in L , so the expectation does indeed go to the limiting expectation. This isn’t necessary, though: we’ll find an exact characterization today!

Definition 180

Say that (Xi )i∈I is a family of random variables on a probability space (Ω, F, P). Then (Xi ) is uniformly integrable (u.i.) if

{sup E[|Xi |; |Xi | ≥ M]} → 0 i∈I as M → ∞.

This relates to the notion of tightness of a set of measures (but isn’t exactly equivalent)! If we just have one function X, then E[|X|; |X| ≥ M] goes to 0 as M → ∞: this is instead a uniform condition.

Remark 181. Note that if (Xi )i∈I is uniformly integrable, then supi∈I E|Xi | < ∞, so uniform integrability is stronger than a uniform bound on the L1 norm. This is because

E|Xi | ≤ M + E[|Xi |; |Xi | ≥ M] and we can make the second term uniformly small by taking M to be sufficiently large. In particular, this shows that for any given M, the supremum in the definition is finite.

Here’s one natural way to construct such a family of uniformly integrable random variables:

Lemma 182 1 If X ∈ L (Ω, F, P), then {E(X|G): G ⊆ F sub-σ-field} is uniformly integrable.

Proof. Note that for all ε > 0, there exists δ > 0 such that

P(A) ≤ δ =⇒ E(|X|; A) ≤ ε.

(This was on our homework! The idea is that if this weren’t true, we could find a series of events An with P(An) ↓ 0 1 but E(X; An) 6= 0. But X is in L , so this is a violation of the dominated convergence theorem.) Now by Markov’s inequality, | (X|G)| |X| (| (X|G)| ≥ M) ≤ E E ≤ E P E M M by Jensen’s inequality. This is our event A, and now if we want to bound E[E(X|G); |E(X|G)| ≥ M] = E[|E(X|G) · 1{E(X|G)| ≥ M}|], the indicator E(X|G)| ≥ M is G-measurable, and thus we can put it inside the expectation as well:

= E |E(X1{|E(X|G)| ≥ M}| .

By Jensen’s again, this is then bounded as

≤ E(|X|; |E(X|G)| ≥ M]) ≤ ε

96 whenever P(|E(X|G)| ≥ M) ≤ δ. This holds for all G simultaneously, so taking M large enough (which means taking δ → 0) gives the result.

Theorem 183

If Xn → X converges in probability, then the following are equivalent:

1. {Xn}n≥0 is uniformly integrable, 1 2. Xn → X in L ,

3. E|Xn| → E|X|.

This is true for random variables in general, but we’ll apply it to martingales!

Proof. (1) =⇒ (2): since our variables are uniformly integrable, we showed that supn E|Xn| < ∞. Since there exists a subsequence Xnk → X almost surely, we know that E|X| ≤ lim infk→∞ E|Xn,k | < ∞ by Fatou’s lemma, so X is in L1. 1 Now we want to show that Xn goes to X in L : define the (truncated identity) function  M x ≥ M  φ(x) = −M x ≤ M  x otherwise.

Then

E|Xn − X| ≤ E|φ(Xn) − φ(X)| + E|φ(X) − X| + E|Xn − φ(Xn)| by the triangle inequality.

• φ is a continuous function and it’s uniformly bounded, and Xn goes to X in probability, so E|φ(Xn) − φ(X)| goes to 0 by the bounded convergence theorem as n → ∞. • Note that |x − φ(X)| ≤ |X|1{|X| ≥ n}, so the second term is upper bounded by ≤ E(|X|; |X| ≥ M). Similarly, the third term is bounded by E(|Xn|, |Xn| ≥ M) (this time we do need uniform integrability). And now we can make all three terms small by fixing M and taking n → ∞ to get rid of the first term, and then take M → ∞. (2) =⇒ (3) by Jensen’s inequality. (3) =⇒ (1): it suffices to show that

lim {lim sup E(|Xn|; |Xn| ≥ M)} = 0 n→∞ n→∞ (the usual definition uses sup instead of lim sup). We went over a similar argument when dealing with tightness! Consider the function  x |x| ≤ M − 1  ψ(x) = 0 |x| ≥ M  linear interpolation otherwise. The main point is that this is continuous, and

|x|1{|X| ≥ M} ≤ |x − ψ(x)| ≤ |x|1{|x| ≥ M − 1}.

97 So now we can use the bound

lim sup E[|Xn|; |Xn| ≥ M] ≤ lim sup E[|Xn − ψ(Xn)|]: n→∞ n→∞ now using the bounded convergence theorem (ψ is uniformly bounded by M) on the second term and the assumption on the first term yields ≤ E[|X| − ψ(X)|] ≤ E[|X|; |X| ≥ M − 1] . As we send M → ∞, both the boxed terms go to 0, which yields the result we’re looking for.

We can apply the results now to martingales:

Theorem 184 (Submartingale L1 convergence theorem)

For a submartingale (Xn)n≥0, the following are equivalent: 1. Uniform integrability, 2. Convergence almost surely and in L1, 3. Convergence in L1.

Proof. (1) =⇒ (2): again, supn E|Xn| < ∞ from our earlier observations, and thus by the submartingale convergence theorem, our martingale converges almost surely. So Xn → X∞ in probability as well, so we can apply the above theorem. to say that we have convergence in L1. (2) =⇒ (3) is trivial. (3) =⇒ (1): a consequence of (2) =⇒ (1) in the above theorem.

Theorem 185 (Martingale L1 convergence theorem)

For a martingale (Xn)n≥0, the following are equivalent: 1. Uniform integrability, 2. Convergence almost surely and in L1, 3. Convergence in L1, 1 4. Existence of X ∈ L such that Xn = E(X|Fn) for all n.

The first three conditions are the same as above (martingales are submartingales), but this also tells us that if we have a uniformly integrable martingale, then there exists an integrable random variable for which the martingale is just “exposing information!”

Proof. The first three are equivalent from above. To show that (3) =⇒ (4), note that

Xn = E[X`|Fn] ∀` ≥ n

by the martingale property, which is equivalent to saying that for all A ∈ Fn, we have

E[Xn; A] = E[X`; A] ∀` ≥ n.

1 Take ` → ∞: we have that E[Xn; A] = E[X∞; A] by our assumption of L convergence. Thus, E(X∞|Fn) = Xn as desired.

(4) =⇒ (1): this is Lemma 182.

98 Definition 186 1 Given a filtration Fn on (Ω, F, P), and given any X ∈ L (Ω, F, P), the sequence Xn = E(X|Fn) is the Doob

martingale of X with respect to Fn.

The idea is that we’re revealing the randomness of our object “a little bit at a time,” and in fact, any martingale that converges in L1 is exactly a Doob martingale (by the above result).

Theorem 187 (Lévy convergence theorem)

Let Fn be a filtration of (Ω, F, P), and let ! [ F∞ = σ Fn . n

(The union of sigma-algebras is not itself a sigma-algebra, but we can take the sigma-algebra of it. This is denoted 1 1 Fn ↑ F∞ sometimes.) Then if X ∈ L , then E(X|Fn) → E(X|F∞) almost surely and in L .

This is a statement about the limit of the Doob martingale. We should compare this with the previous result:

there, we explicitly constructed a random variable X which satisfied the properties we want. The idea is that F∞ might be strictly smaller than F, and the Lévy convergence theorem holds even if X is some arbitrary measurable function

on the big σ-field (even if it’s not in F∞ at the beginning).

Proof. Let Xn = E(X|Fn): then {Xn}n≥0 is a uniformly integrable family by Lemma 182. Thus, it converges to 1 1 some X∞ almost surely and in L by the L martingale convergence theorem. To finish, we just need to show that

X∞ = E(X|F∞): this is an exercise with the definition of conditional expectation. We have

Xn = E(X|Fn) =⇒ E(Xn; A) = E(X; A) = E(X∞; A) ∀A ∈ Fn, where the last equality comes from Xn = E(X∞|Fn). So [ E(X; A) = E(X∞; A) A ∈ Fn, n≥0

and we need to extend this to the σ-algebra by showing that X∞ ∈ F∞ is measurable (because X∞ is the almost-sure

limit of Xn, each of which is measurable with respect to Fn and therefore with respect to F∞), and that the relevant conditional identity holds. This last part is just the standard π-λ argument!

Corollary 188 (Lévy 0-1 law)

If Fn ↑ F∞, and A ∈ F∞, then P(A|Fn) → 1A converges either to 0 or 1.

We’ll finish with a few helpful hints for the homework: the Azuma-Hoeffding bound says that if Xn is a martingale

with bounded increments, then Xn − EXn is well-concentrated. One consequence is the concentration of the inde-

pendence number in Gn,p, which is the random graph where each pair of the n vertices is connected with probability p. Basically, if G = (V,E) is a graph, then S ⊆ V is an independent set if no two vertices in S are neighbors in G. The independence number α(G) is the maximal cardinality of such a set |S| (which is a number between 1 and |V |).

99 We’re asked to determine the behavior of X = α(G) as a random variable. One useful way to define a filtration is

F` = σ (edges restricted to the first ` vertices) .

This is known as an edge-revealing filtration: we’re told how vertex i is connected to the graph so far for increasingly

large i. There are variations on this, but the point is that if we look at E(X|F`), we have no integrability issues (because G is a finite graph), and we can get a bound on the concentration by looking at this Doob martingale and using the Azuma-Hoeffding bound! Note that this kind of strategy also works for the clique number and chromatic number of a random graph, too. (This general type of argument comes from a paper [4] by Shamir and Spencer.)

Remark 189. The bound is only good for some asymptotic values of p, though: again, we’ll see the specifics on our homework!

21 November 25, 2019

We’ve proved a few results about submartingale convergence so far:

• If Xn is a submartingale with supn E[(Xn)+] finite, then Xn → X∞ converges almost surely with E|X∞| < ∞. 1 1 • Uniform integrability, convergence almost surely and in L , and convergence in L are all equivalent for a submartingale. Today, we’re going to talk about the optional stopping theorem! Remember that we defined a stopping time τ with respect to a filtration Fn to be a random variable such that {τ = n} ∈ Fn for all n. We know that for any stopping time 0 ≤ τ ≤ k (almost surely) and submartingale Xn, we know that EX0 ≤ EXτ ≤ EXk . More generally, we know that

EX0 ≤ Eτ∧n ≤ EXn for all finite n, but not necessarily if we replace n with ∞. Today, we’ll actually take the limit and see what we can do!

Lemma 190

If Xn is a uniformly integrable submartingale and τ is any stopping time, then E|Xτ | < ∞.

(Here, we allow for P(τ = ∞) > 0.)

Proof. First of all, we need to show that X∞ does actually make sense (this is because Xn converges to X∞ almost surely because Xn is uniformly integrable). Thus, Xτ is indeed well-defined.

Notice that (Xn)+ is also a submartingale, because we have a nondecreasing convex function, which implies that

E[(Xn∧τ )] ≤ E[(Xn)+] as well. The right hand side of this inequality is uniformly bounded (by the definition of uniform integrability), and thus the left hand side is also uniformly bounded:

sup E[(Xn∧τ )+] < ∞. n

Since (Xn∧τ ) is a stopped process of (Xn), it is a submartingale as well, so by the submartingale convergence theorem, 1 Xn∧τ converges to a limit which is in L . But that limit is exactly Xτ , so we’re done.

100 Corollary 191

If Xn is a uniformly integrable submartingale, and τ is any stopping time, then Yn = Xn∧τ is also a uniformly integrable submartingale.

Proof. We need to show that   lim sup E[|Xn∧τ |; |Xn∧τ | ≥ M] = 0. M→∞ n We break this up into its parts:

E[|Xn∧τ |; |Xn∧τ | ≥ M] = E[|Xn| · 1{τ > n} + |Xτ |1{τ ≤ n}; |Xn∧τ | ≥ M] and this simplifies to (dropping indicators and seeing whether we’re “using” n or τ)

≤ E[|Xn|1{|Xn| ≥ M}] + E[|Xτ |; |Xτ | ≥ M].

(We’re taking the supremum over all n.) As M gets larger, the first term is small for M large because the Xn are 1 uniformly integrable, and the second is small because Xτ ∈ L by the above lemma (and then we use the dominated convergence theorem).

Theorem 192

If Xn is a uniformly integrable submartingale and τ is a stopping time, then EX0 ≤ EXτ ≤ EX∞.

Note that if we only know that Xn∧τ is a uniformly integrable submartingale, that’s enough for the EX0 ≤ EXτ part of the inequality. (This will be clear from the proof.)

Proof. We already noted that

EX0 ≤ EXτ∧n ≤ EXn

1 for all finite n. If Xτ∧n is uniformly integrable, then it converges to Xτ in L , so EX0 ≤ EXτ . 1 Furthermore, if we know that Xn itself is uniformly integrable, then Xn converges to X∞ almost surely and in L ,

so EXn → EX∞, which gives us the other bound.

Uniform integrability is often difficult to check, though, so here’s an easier criterion:

Theorem 193

Suppose Xn is a submartingale such that E[|Xn+1 − Xn||Fn] ≤ B for all n, and say that τ is a stopping time with Eτ < ∞. Then Xn∧τ is uniformly integrable, so we do have EX0 ≤ EXτ .

Proof. For any n ≥ 0, we can write

X |Xn∧τ | ≤ |X0| + 1{τ > i}|Xi+1 − Xi |. i≥1

The right hand side doesn’t depend on n: call it Y . Note that 1{τ > i} is measurable with respect to Fi , so taking expectations, X EY ≤ E|X0| + E (1{τ > i}E[|Xi+1 − Xi ||Fi ]) i≥0

101 (we pulled out the indicators), and now we can use the other assumption:

≤ E|X0| + BEτ < ∞.

Here’s a result that’s a bit disconnected from the ones so far: we don’t need uniform integrability at all.

Theorem 194

If Xn is a nonnegative supermartingale, and τ is any stopping time, then EX0 ≥ EXτ .

Proof. From earlier results, Xn → X∞ almost surely, so Xτ is well-defined. We also know that EX0 ≥ EXn∧τ for all n, so by Fatou’s lemma,

EXτ ≤ lim inf EXn∧τ ≤ EX0.

Next, we’ll take a generalization of the first part of Theorem 192:

Definition 195

If τ is a stopping time with respect to Fn, then the stopping time sigma-algebra associated to τ is

Fτ = {A ∈ F : A ∩ {τ = n} ∈ Fn}.

We can check that if τ is some fixed constant (almost surely), then Fτ = Fk ., and also that if Yn ∈ Fn, then

Yτ ∈ Fτ . Basically, if An ∈ Fn, then we know whether we’re in An or in Ω \Fn: thus, when τ = n, we know whether

we’re in A if and only if A ∩ {τ = n} ∈ Fn.

Theorem 196

Say that Xn∧τ is a uniformly integrable submartingale. If σ, τ are both stopping times and σ ≤ τ almost surely, then

E(Xτ |Fσ) ≥ Xσ.

(We’ve already proved this for σ = 0.)

Proof. Since Yn = Xn∧τ is a uniformly integrable submartingale, we do know that EX0 ≤ EXτ . Also, EY0 ≤ EYσ ≤ EY∞, so plugging in the definition, EX0 ≤ EXσ ≤ EXτ .

Given any A ∈ Fσ, define a random variable

ξ(ω) = σ1A + τ1Ac

(which is either σ or τ). This is a stopping time, because

{ξ = n} = (A ∩ {σ = n}) t (Ac ∩ {τ = n}) .

c Since A is in Fσ, so the first term is in Fn. A is also in Fσ, and we can further decompose the second term:

n ! [ Ac ∩ {τ = n} = Ac ∩ {σ = k} ∩ {τ = n} . k=0

102 c (Thus sum over k only goes up to n, because σ must be less than n in this case.) And now A ∩ {σ = k} is in Fn and so is {τ = n}, so indeed we have a stopping time! ξ is a stopping time that is also bounded by τ almost surely, so

EYξ ≤ EY∞ by the same logic. But notice that

c E( Xσ ; A) + E(Xτ ; A ) ≤ E(Xτ ) =⇒ E(Xσ; A) ≤ E(Xτ ; A) = E[ E(Xτ |Fσ) ; A]

by the definition of conditional expectation, and then looking at the boxed terms over all A gives the desired result.

Example 197 (Asymmetric random walk) We perform a walk n  X 1 probability p Sn = ξi , ξi = , i=1 −1 probability q = 1 − p where p ∈ (0.5, 1). (Generally, there is a drift in the upward direction.)

We know that n X Sn = (ξi − Eξ) + n · Eξ. i=1 √ The first term here is the sum of n iid terms, so it has standard deviation on the order of n. Meanwhile, the second term is of order n! So the slope is the dominant term here, and eventually we won’t return to the 0 line. Let’s quantify this with martingales. Notice that 1 (sξ) = ps + q = 1 E s q has solutions at s = 1, p . So if we take q Sn M = , n p

which is a product of iid terms of mean 1, we have a nonnegative martingale! For any integer x, let τx be the first

hitting time: inf{n ≥ 0 : Sn = x}, which can be infinite (inf ∅ = ∞). So now if we take any a, b > 0, and we define

τ = τ−a ∧ τb,

−a b  q   q  then EMτ = EM0 = 1 because we have a martingale that is uniformly bounded (between p and p ). (Also note that τ = ∞ with probability 0 because we will eventually escape (−a, b).) But now

q −a q b 1 = M = π · + (1 − π) · , E τ p p

where π is the probability that we reach −a before we reach b. We can solve this to find that

b  q  1 − p π = −a b .  q   q  p − p

103 a  q  Now if we take a → ∞, we find that P(τb = ∞) = 0, and if we take b → ∞, we find that P(τ−a < ∞) = p . So given any positive level, we’ll hit it almost surely, and given any negative level, there’s a positive probability we’ll never hit it! What is the expected time that we hit b? To figure this out, we should consider the martingale

Mn = Sn − (Eξ)n = Sn − n(p − q).

b We want to guess that Eτb = p−q , because at any point where we stop, Sn = b. To justify this, define

Smin = min{Sn over all n ≥ 0} ≤ 0 :

this is integrable because we can write this as

a X X X q  (−S ) = (−S ≥ a) = (τ < ∞) = < ∞ E min P min P −a p a≥1 a≥1 a

1 (geometric sum). Thus Smin ∈ L , and now we have

Sn∧τb ∈ [Smin, b] ∀n,

1 so Sn∧τb is uniformly integrable. Thus, this converges almost surely and in L to its limit Sτb , and then Sτb = b by definition! Since Mn is a martingale,

E(Sn∧τb ) = (p − q)E(n ∧ τb).

The left side converges to b, while E(n ∧ τb) converges to Eτb by monotone convergence theorem, so indeed we have b b Eτb = p−q = 2p−1 .

Example 198 (Patterns in a random string)

Sample σ1, σ2, ··· iid from the alphabet A = {A, B, ···} (with |A| = 26), and let τ be the same time you see a particular sequence of letters (w =“ABRACADABRA”).

Note that τ ≥ length(w) = 11, and also Eτ < ∞ (because within any block of time there is a positive chance to see the word, so we have geometric decay). Thus τ is integrable.

Suppose that before each time n, a new gambler Gn enters and bets 1 dollar on the event {σn = w1}. If σn 6= w1, then Gn loses and exits; otherwise Gn wins $26. In the latter case, Gn bets all of the money on the event {σn+1 = w2}, and so on.

This continues until we have a mismatch, so {σn+i−1 6= wi }, or the entire string matches, in which case the gambler 11 wins $26 . Let Mn be the total winnings up to time n from the point of view of the casino: this is integrable for any n, and all games are fair, so Mn is a martingale (in Fn).

Now note that E(|Mn+1 − Mn||Fn) is uniformly bounded almost surely, because only the 11 most recent gamblers are playing (so the bound is 2611 · 11). Thus by Theorem 193, if τ is the stopping time where someone wins for the first time,  11 4  0 = EM0 = EMτ = E τ − 26 − 26 − 26 because the casino won one dollar from of the τ gamblers, and then at the stopping point, there are three gamblers 4 11 who have won 26, 26 , 26 dollars respectively. And this tells us the exact value of Eτ! On Wednesday, we’ll cover reverse martingales and their applications to zero-one laws.

104 22 November 27, 2019

We’ll start with two definitions of a reverse martingale:

Definition 199

A reverse martingale is a martingale “indexed by the nonpositive integers Z≤0:” that is, we have random variables

(Mn)n≤0 such that

1 • Mn ∈ L (Fn), where · · · ⊆ F−2 ⊆ F−1 ⊆ F0 ⊆ F, and

• the martingale property holds: Mn = E(Mn+1|Fn).

Equivalently, we can index with the nonnegative integers: this can also be described by the usual (Mn)n≥0,

except that F ⊇ F0 ⊇ F1 ⊇ F2 ⊇ · · · , and we will have the usual Mn+1 = E(Mn|Fn+1). (We’ll use this latter notation.)

T One point is that if F ⊇ F0 ⊇ · · · , then Fn ↓ F∞ = n≥0 Fn. Basically, intersections of sigma-algebras are sigma-algebras themselves (though unions may not be)!

The theory of reverse martingales is actually easier than normal martingales: note, for example, that Mn =

E(M0|Fn) for all n.

Theorem 200

If (Mn)n≥0 is a reverse martingale with respect to a filtration Fn ↓ F∞, then Mn → M∞ = E(M0|F∞ almost surely and in L1.

Proof. Let Un(a, b) be the number of upcrossings of [a, b] by the process (Mn, ··· ,M0) (going backwards in time, which E(M0−a)+ gives us a normal martingale). Thus, we can use the Doob upcrossing inequality to show that EUn(a, b) ≤ b−a . But M0 doesn’t change as n grows, so this is bounded uniformly in n, and thus by the monotone convergence theorem,

EU∞(a, b) < ∞, so by the same logic as before, Mn must converge almost surely to M∞. 1 Convergence in L follows because Mn = E(M0|Fn), which is a collection of conditional expectations of a single random variable: this is therefore uniformly integrable and therefore converges in L1.

Corollary 201 1 If Y ∈ L (Ω, F, P), and Fn ↓ F∞ (where all Fn ⊆ F), then E(Y |Fn) converges to E(Y |F∞) almost surely and in L1.

1 Proof. Mn = E(Y |Fn) is a reverse martingale, so it converges almost surely and in L to M∞ = E(M0|F∞) = E(E(Y |F0)|F∞), which is exactly what we want by the tower property.

We’ll spend the rest of the lecture on zero-one laws.

Definition 202

Suppose we have random variables (Xi )i≥1 which are independent but not necessarily iid on a probability space (Ω, F, P), and define

Fn+ = σ(Xn,Xn+1, ··· ). T Then the tail σ-algebra is then T = n≥1 Fn+.

105 Example 203

Let A be the event that {Xn ≥ 3 i.o.}. This is in the tail sigma-algebra, because it doesn’t depend on the first k terms for any k.

Formally, the way to prove this is that \ [ A = {Xm ≥ 3}, n≥1 m≥n and because the inner term is decreasing as a function of n, this is also

\ [ = {Xm ≥ 3} ∈ F`+ n≥` m≥n for all `, meaning that A ∈ T .

Theorem 204 (Kolmogorov 0-1 law)

On (Ω, F, P), suppose we have independent random variables (Xi )i≥1, and let T be their tail σ-algebra. Then T is trivial under P: that is, P(A) = 0 or 1 for all A ∈ T .

In other words, if an event doesn’t change if we change only finitely many coordinates, it either always happens or never happens.

Proof without martingales. Let Fk = σ(X1, ··· ,Xk ): this is independent of Fm,n = σ(Xm, ··· ,Xn) as long as k < S m ≤ n. (In other words, as long as we’re using different Xi s, the sigma-algebras will be independent.) Now n≥m Fm,n

is a π-system generating Fm+, so Fk is independent of Fm+ by a π-λ argument. (Let G be the collection of subsets satisfying the independence property.) S Now Fm+ contains the tail sigma-algebra, so Fk ⊥⊥T for all k. Now k Fk is a π-system which generates F1+, so

F1+ is independent of T , which is another π-λ argument. But F1+ contains T , so T is independent of itself. Thus, for all A ∈ T , 2 P(A) = P(A ∩ A) = P(A) .

Proof with forward martingales. Again, argue as above that Fk is independent of T for all k. Let A ∈ T , and consider the martingale

Mn = E(1A|Fn).

By the convergence theorem for uniformly integrable martingales (also known as the Lévy convergence theorem), Mn converges almost surely and in L1 to

M∞ = E(1A|F∞) = 1A

because T ⊆ F∞. But this is also P(A) because the event A is independent of Fn, so P(A) = 1A, which is a number that is either 0 or 1: thus P(A) ∈ {0, 1}.

The Kolmogorov 0-1 law is stated in terms of random variables, but we could also state it slightly differently: if we assume that ∞ ∞ ∞ ! Y M M (Ω, F, P) = Si , Si , µi i=1 i=1 i=1

106 has a product structure, then

∞ Xi :Ω → Si such that Xi ((ωj )j=1 = ω) = ωi ∈ Si just gives us the ith coordinate of ω. Then the law of Xi is exactly µi . The next result does require our Si s to be iid, but it’s stated similarly:

Definition 205 Say that we have a product space (equivalently, we have a bunch of iid random variables)

∞ ∞ ∞ ! Y M M (Ω, F, P) = S, S, µ i=1 i=1 i=1 The exchangeable σ-algebra T ⊆ E ⊆ F is the collection of events that are invariant under permutation of finitely many coordinates (a bijection π : N → N such that only finitely many coordinates are changed). Formally, N ∞ if ω ∈ Ω = S , let ωπ be the tuple (ωπ(i))i=1, and for any A ∈ F, let Aπ be the set {ωπ : ω ∈ A}: this is

permutable if A = Aπ for all finite permutations. Then formally,

E = {A : A permutable}.

We will often denote

En = {A ∈ F : A = Aπ ∀π : π(i) = i ∀i > n}

to be those events invariant under permutations of the first n coordinates. Note that En ↓ E.

Example 206 T ⊆ E, because a permutation of finitely many coordinates only changes finitely many coordinates. They aren’t the same set in general, though: consider

( n ) X A = Xi ∈ [9, 10] i.o. . i=1 This is exchangeable but not generally in T .

Theorem 207 (Hewitt-Savage 0-1 law) On our product space ∞ ∞ ∞ ! Y M M (Ω, F, P) = S, S, µ , i=1 i=1 i=1 the exchangeable σ-field E is P-trivial.

Proof without martingales. We’ll start with a useful fact:

Fact 208

On any (Ω, F, P), if Fi ↑ F∞ ⊆ F, then for all A ∈ F∞, we can find Ai ∈ Fi such that

P(A4Ai ) → 0,

where 4 denotes the symmetric difference. (Proof is a π-λ argument.)

107 Q So on our product space, if we take Fn to be the collection of all {E1 × E2 × · · · × i>n S}, where the Ei are

measurable with respect to S. Fn increases to F by definition, so for all A ∈ F, there exists a sequence An ∈ Fn such

that P(A4An) → 0.

Now let πn : N → N act on the first 2n coordinates: it sends k and n + k for all 1 ≤ k ≤ n, and let Bn = (An)πn . If A ∈ E, then

P(A4Bn) = P((A4Bn)πn )) = P(A4An)

because the measure P is invariant under permutation of indices (when we have a product iid measure) and then by

exchangeability. On the other hand, An depends only on the first n coordinates, while Bn depends on coordinates n + 1

through 2n, so An ⊥⊥ Bn. Thus,

2 P(An ∩ Bn) = P(An)P(Bn) =⇒ P(A) = P(A)

because both An and Bn approximate A (symmetric differences going to 0).

Proof with reverse martingales. Again, here’s a useful fact:

Fact 209 If G, H are any two sub-σ-fields of F, W ⊥⊥H, and E(W |G) ∈ H, then E(W |G) = E(W ) is constant.

Proof. Let Y = E(W |G): by definition, this is G-measurable, and it’s also H-measurable by assumption. If we let

A = {Y − E(W ) ≥ ε} ∈ G ∩ H, then by definition of the conditional expectation (since A ∈ G),

E(W ; A) = E(Y ; A) ≥ (EW + ε) P(A).

But A ∈ H, and W is independent of H, so

E(W ; A) = E(W )P(A).

Putting these together, E(W )P(A) ≥ (E(W ) + ε)P(A), contradiction unless P(A) = 0. Do the same for the other bound (at most −ε) to show that Y = E(W ).

We will show that E ⊥⊥Fk for all k, which implies that E ⊥⊥F by a π-λ argument. Since F contains E, this shows that E is independent of itself and we’re done as above. k To show this, it suffices to show that for all bounded measurable functions φ : S → R, and we let W = φ(ω1, ··· , ωk ) ∈ Fk , then E(W |E) = E(W ). (Independence works because we can pick W to be an indicator variable.)

Note that En ↓ E, so

E(W |En) → E(W |E) almost surely and in L1 by Theorem 200. But we can directly evaluate

E(W |En) = E [φ(ω1, ··· , ωk )|En]:

108 since n is large, the idea is that as long as n ≥ k, we average over all permutations:

1 X = φ(ωi1 , ··· , ωik ) (n)k i∈[n]k

This is measurable with respect to Fn: permuting the first n coordinates doesn’t change the average, so this is indeed E(W ).

And now W ∈ Fk , and we claim that

lim (W |Fn) ∈ F(k+1)+ : n→∞ E this is because the probability that our (i1, ··· , ik ) has any intersection with the first k coordinates goes to 0 as n → ∞ k2 (it’s at most n by a union bound), so the contribution from the first k coordinates has probability-zero of making any difference. On the other hand, we know that

lim (W |En) = (W |E), n→∞ E E so W ∈ Fk , F(k+1)+ which are independent of each other; the result follows by Fact 209.

Here’s a final application of the reverse martingale: remember the strong law of large numbers, which says that 1 Pn X,Xi iid with EX = µ implies that n i=1 Xi converges to µ almost surely. Here’s a shorter proof!

Proof. Define (taking Sn = X1 + ··· + Xn to be the usual partial sum)

Gn = σ(Sn,Sn+1, ··· ) ↓ G.

Sn Let Mn = n , and note that S  S  (M |G ) = n |S , ··· , = n |S ,X , ··· . E n n+1 E n n+1 E n n+1 n+2

Sn Notice that Xn+2,Xn+3, ··· don’t give more information about n , so this is just S  S n |S = n+1 = M . E n n+1 n + 1 n+1

So this is a reverse martingale with respect to Gn, so this converges to M∞ = E(M1|G∞) = E(X1|G∞) almost surely by Theorem 200. But G∞ is contained in E, because Gn is contained in En, so E(X1|G∞) is a conditioning on a σ-algebra where everything is either 0 or 1, and thus this is just EX = µ.

23 December 2, 2019

Today is the last lecture before the exam, so we’ll do some new material and have quite a bit of review. (There is no class on Wednesday, and the exam is in the evening.) First of all, recall that given finite measures µ, ν on (Ω, F), we say that ν is absolutely continuous with respect to µ (ν  µ, if for all A ∈ F with µ(A) = 0, we also have ν(A) = 0. Then given any measurable function f :Ω → [0, ∞) such that for all A ∈ F, Z Z ν(A) = f dµ = f (ω)1{ω ∈ A}dµ(ω), A Ω

109 dν f is known as the Radon-Nikodym derivative of ν with respect to µ, denoted f = dµ . An example of such a derivative is d exp(θ Pn X ) Pθ = i=1 i . dP E(eθX )n It’s not guaranteed in general that the Radon-Nikodym derivative exists, but if it does, ν must be absolutely continuous with respect to µ, because given any set A with µ(A) = 0, Z ν(A) = f dµ = 0. A The converse holds, too:

Theorem 210 (Radon-Nikodym) R If µ, ν are finite measures on (Ω, F) and ν  µ, then there exists a measurable f such that ν(A) = A f dµ for all A ∈ F.

Note that this f is unique µ-almost-surely: otherwise, if we can find a set of positive measure where the integrals differ. And remember that we used this theorem (without proof) earlier in the class to construct the conditional 1 expectation: if X ∈ L (Ω, F, P) and we have a sub-sigma-algebra G ⊆ F, then we can assume without loss of generality that X is nonnegative, and define dν (X|G) = E dµ where

µ = P|G, ν(A) = Eµ(X; A) for all A ∈ G. These are two different measures, and ν is absolutely continuous with respect to µ because A having measure 0 means the expected value restricted to A is 0. This is well-defined, because (1) it is G-measurable and (2) the conditional property follows from the way we defined µ, ν. Let’s prove a few properties that are pretty obvious (and have actually already been used):

Lemma 211 dπ dπ dν If we have finite measures π  ν  µ on (Ω, F), then dµ = dν dµ , µ-almost-surely.

Proof. Note that for all nonnegative measurable functions h, Z Z dν hdν = h dµ. dµ

This is true because it’s true for indicator functions h = 1A (by definition of the Radon-Nikodym derivative), then it’s true for simple functions by linearity, and then we can approximate from below and use the monotone convergence theorem for general measurable functions. Now for any measurable A, Z dπ π(A) = dν, A dν dπ and now dν is a nonnegative measurable function, so we can use the above equality to get Z dπ dν = dµ. A dν dµ

dπ dν dπ But the Radon-Nikodym derivative is unique almost-surely under µ, so dν dµ must agree almost surely with dµ , as desired.

110 Lemma 212 −1 dµ  dν  If ν  µ and µ  ν, then dν = dµ almost surely under both µ and ν.

Proof. This follows directly from Lemma 211 by taking π = µ or π = ν.

The full proof is in Appendix A.4 of Durrett, and it’s a bit outside the scope of the class (so we won’t cover it). F∞ However, there is one case which is easy to prove! Suppose Ω = i=1 Ωi can be partitioned into countably many

pieces such that F = σ(Ω1, Ω2, ··· ). Say that µ, ν are finite measures on Ω with ν  µ: now we can just define  ν(Ωi ) dν  when ω ∈ Ωi , µ(Ωi ) > 0 (ω) = f (ω) = µ(Ωi ) . dµ 0 (or any number at all) when ω ∈ Ωi , µ(Ωi ) = 0

(The second case has measure zero, so it doesn’t contribute to the µ-almost-surely problem.) We can indeed check both properties of the Radon-Nikodym derivative, and this proves the theorem in this simple case (with a countable partition). A more general situation is one where we can approximate the σ-algebra with sets of the above situation: consider

(Ω, F), where Fn ↑ F and each Fn is generated by a countable partition of Ω. For example, if we take Ω = R, F =  [i,i+1)  σ 2n : i ∈ Z , then Fn actually increases to the entire Borel σ-algebra BR (because it’s generated by open intervals, and we can approximate open intervals with a union of these dyadic sets).

Now suppose that µ, ν are finite measures on (Ω, F) such that ν  µ, and let µn be the restriction of µ to Fn (ν respectively): these also satisfy µ  ν , so dνn exists by the above explicit construction. It’s then natural to ask n n n dµn whether dνn → dν . dµn dµ

Remark 213. An example where µn  νn, but where we don’t have µ  ν: take µ to be the infinite product of

Bernoulli(p) variables, and ν is the infinite product of Bernoulli(q) variables. We do have µn  νn for any n, but by the strong law of large numbers µ  ν is not true.

Lemma 214 Let µ, ν be probability measures on (Ω, F) (scaling from finite measures), where we’re not assuming ν  µ. Suppose that F ↑ F, and define µ = µ| , ν = ν| . Suppose ν  µ , so we can define X = dνn . Then X n n Fn n Fn n n n dµn n is a martingale with respect to Fn on (Ω, F, µ).

Proof. Let E denote expectation with respect to µ: Xn is in Fn by definition because it only depends on νn and µn, and Z Z dνn dνn EXn = dµ = dµn = 1, dµn dµn 1 so Xn ∈ L (Fn). And now for all A ∈ Fn, Z Z E(Xn+1; A) = Xn+1dµ = Xn+1dµn+1 = νn+1(A), A A by the definition of the Radon-Nikodym derivative, and this is equal to ν(A), which is the same as νn(A), which is therefore the same as Z = νn(A) = Xndµn = E(Xn; A), A showing the conditional expectation property.

111 With this, we can answer our convergence question:

Theorem 215

Let µ, ν be probability measures on (Ω, F), not assuming absolute continuity. Again, suppose F ↑ F, µn = µ|Fn , ν = ν| , and say that ν  µ with X = dνn . Then X = lim sup X satisfies n Fn n n n dµn n→∞ n Z ν(A) = Xdµ + ν(A ∩ {X = ∞}). A

The first part is denoted νcontinuous(A) (it is absolutely continuous with respect to µ), and the second is νsingular(A)

(singular, meaning its support is disjoint from ν). Here, Xn is a nonnegative martingale under the measure µ, so it converges almost surely to a finite limit under µ, but it may not do so under the measure ν, which explains the “boundary” second term! Also, note that

µ(X = ∞) = µ(X∞ = ∞) = 0,

but we’re just saying that ν(X = ∞) can be positive. Once we prove this, we’ll have proved the Radon-Nikodym theorem in the special case where we increase with

countable partitions Fn ↑ F (since the singular term is just zero if we already know that ν  µ).

µ+ν µn+νn Proof. The key is to introduce a measure that interpolates between µ and ν: let ρ = 2 , and let ρn = ρ|Fn = 2 . By assumption, νn  µn  ρn (because ρn is a mixture of µn and νn, so any event with measure zero under ρn must have measure zero under µn and ρn), and thus we can define the derivatives

dµn dµn dνn Xn = ,Yn = ,Zn = . dνn dρn dρn

Note that Yn + Zn = 2, ρn-almost-surely, and by Lemma 214, Yn and Zn are Fn-martingales on (Ω, F, P). Since they are nonnegative, they converge to finite limits Y, Z ρ-almost-surely. Since Yn + Zn = 2, we must have Y + Z = 2, ρ-almost-surely. dµ Now we claim that Y = dρ : the nice thing about Y and Z is that they are both bounded, so we have stronger convergence results! Note that if A ∈ F` ⊆ Fn, then Z Z µ(A) = µn(A) = Yndρn = Yndρ, A A R and this last expression converges to A Y dρ by the bounded convergence theorem. (This step doesn’t necessarily R work for X!) Thus, for any A ∈ F`, we have µ(A) = A Y dρ, and now we want to conclude this holds for all A ∈ F: dµ dν this is just a π-λ argument. The same reasoning works for Z, so now we know that Y = dρ ,Z = dρ . Now µn  µn  ρn, so dνn dνn dµn = =⇒ Zn = XnYn dρn dµn dρn ρ-almost surely, so Zn Xn = ∈ [0, ∞], Yn Z where the endpoints are only possible when Yn = 0 or Zn = 0. Then lim supn→∞ Xn is just Y , and now ν  ρ with derivative Z, so we can write down Z ν(A) = Zdρ. A

112 To put this into the form we want, note that X = ∞ corresponds to Y = 0, so Z ν(A) = Z[WY + 1{Y = 0}]dP, A

1 where W = Y when Y > 0 and 0 otherwise (so the bracketed term is 1). Splitting this up, Z Z = ZW Y dρ + Z1{Y = 0}dρ, A A

dµ and now the first term is (because ZW = X on the event that Y is positive, and Y is the derivative dρ ) Z = Xdµ A (here, we would have to keep track of the indicator 1{X < ∞}, but switching to integrating under µ-measure makes X finite almost surely!), while the second term simplifies to (by the definition of the Radon-Nikodym derivative)

= ν(A ∩ {X = ∞}),

as desired.

24 December 9, 2019

There won’t be any more assignments for this class: in the last week, we’ll cover a few topics of general interest. Today, we’ll go over the proof of the Kesten-Stigum L log L criterion: the initial work comes from various papers by Kesten and Stigum (including [5]), and many of the ideas covered in this lecture come from [6].

We’ll start with a review of the setup: say that ξ ∼ p is a probability distribution supported on Z≥0, so ξ is the

L in the theorem statement. Define an array ξn,i iid from the law p, and define the Galton-Watson tree by starting

from a root vertex o (at depth 0), where the ith vertex at depth (n − 1) has ξn,i children at depth n. This generates

a finite or infinite tree, and Fn = σ(ξ`,i : i ≥ 1, 1 ≤ ` ≤ n) encodes all the information about the tree.

Let Zn be the population of the tree at depth n, which satisfies the recursive equation

Zn−1 X Zn = ξn,i . i=1

Assume that E[ξ] = m ∈ (1, ∞), so we’re in the supercritical case. We know from an earlier discussion that

P(extinction) = P(Zn = 0 eventually) = ρ ∈ [0, 1),

so there is a positive probability that the population never dies out. In addition, Z W = n n mn is a nonnegative martingale, so it converges almost surely to a limit W ∈ [0, ∞). In addition, the probability that W = 0 satisfies the same fixed point equation as the extinction probability, so P(W = 0) ∈ {ρ, 1}. If it’s 1, this means that for n large n we know (almost surely) that Zn  m , and if it’s ρ, that means that P(W = 0|nonextinction) = 0 (because whenever Z = 0, W = 0 as well). So W is some positive finite number almost surely, meaning that conditioned on n nonextinction, Zn  m for large n. Kesten-Stigum then tells us when each of these cases occurs:

113 Theorem 216 (Kesten-Stigum)

Suppose (Zn)n≥0 is a supercritical Galton-Watson tree with offspring law ξ ∼ p, where Eξ = m ∈ (1, ∞). Let Zn Wn = mn → W (almost surely). Then the following are equivalent:

• P(W = 0) = ρ, • EW = 1, + + • E[ξ log ξ] is finite, where log assigns 0 to 0.

2 2 2 On our homework, we showed that if E(ξ ) < ∞, then the martingale is bounded uniformly in L norm, so the L 2 1 martingale convergence theorem says that Wn converges in L to W , and hence also in L . This means EW = 1, so the probability that W = 0 can’t be 1 (and thus it must be ρ). So this theorem is a stronger version of that result!

Here’s something else from our homework: one thing we’ll use in the proof is that if Y,Yi are iid nonnegative random variables, then  Yn 0 EY < ∞ lim sup = n n→∞ ∞ EY = ∞.

P∞ Yn  Y Y (The proof is that for any x > 0, n=1 P n ≥ x is sandwiched between Eb x c and Ed x e, so these both go to 0 or both go to ∞: thus we can use Borel-Cantelli.) As an easy consequence, if Y,Yi are nonnegative integer valued random variables, then for any c > 1,

 + Yn 0 E[log Y ] < ∞ lim sup = cn + n→∞ ∞ E[log Y ] = ∞.

(Transform the previous result: the only thing to worry about is where Yn = 0, but that changes the value by at most 1.) With this, we can consider a branching process with immigration: define the random variable

Xn−1 X Xn = ξn,i + Yn, i=1

where Y,Yi are iid and are the things that immigrate at level n.

Fact 217

+ Xn On our homework, we showed that if E(log Y ) = ∞, then lim supn→∞ cn is infinite for all c > 1, and if + Xn E(log Y ), ∞, then lim supn→∞ mn converges to a finite limit ∈ (0, ∞). The first case is easy by the previous discussion, and the second can be shown based on the submartingale convergence theorem.

We’ll spend the rest of today linking the two together! One key component comes from an exam question and was also proved last time:

114 Theorem 218

Let µ, ν be finite measures on (Ω, F), and suppose Fn ↑ F is a filtration. Denote µn = µ|Fn and similar for ν,

and suppose for all finite n, νn  µn. Then if we denote

dνn Wn = ,W = lim sup Wn, dµn then we can decompose into a continuous and a singular part Z ν(A) = W dµ + ν(A ∩ {W = ∞}) = νcontinuous + νsingular. A

There are two extreme cases of this: first of all, if µ(W = 0) = 1, that’s equivalent to νcontinuous = 0 and R ν = νsingular ⊥ µ, which implies ν(W = ∞) = 1 (taking A to be everything). On the other hand, if A W dµ = 1, meaning that νcontinuous(Ω) = 1, we equivalently have that ν  µ is absolutely continuous, and ν(W = ∞) = 0.

To prove Theorem 216, we’ll apply this with Ω = {rooted graphs} (actually trees), Fn = σ(trees up to depth n).

Ω can be made into a Polish space (also on our homework), and Fn increases to F, the Borel sigma-algebra on this metric space. Then the measures we’ll use are

µ = Galton-Watson measure with offspring law p

dνn Zn (just generate a Galton-Watson tree at random), and ν =?? will be defined such that = W = n . We’ll then dµn n m proceed by looking at this process under the measure ν!

In principle, we already have a well-defined measure, since we know µn for every n. But ν is a bit confusing: it’s not clear what it is, so we don’t really know how to analyze a statement like ν(W = ∞) = 0. So let’s start with a few preliminary steps!

R Zn First of all, if Tn is the tree T truncated at depth n, then ν(Tn ∈ B) = νn(Tn ∈ B) = 1{Tn ∈ B} mn dµn by the definition of the Radon-Nikodym derivative. In particular, note that for all n, Z ν({T dies by level n}) = 0dµn = 0,

because Zn is zero in this event. Thus ν(extinction) = 0, so under ν, the tree survives forever (while this is not true under µ).

Also, it’s easy to define ν1: dν1 ξ1,1 = W1 = , dµ1 m

so we bias the original measure by the number of offspring at the first level: ν1 is the law of T1 with a size-bias kp pˆ = k ∀k ≥ 1. k m

∗ More generally, here’s a description of νn: let νn be a probability measure on pairs (Tn, xn), where Tn is a tree of depth n and xn is a vertex at depth n in Tn. (Basically, we point out a distinguished level at the nth level.) Define this via µ (T ) ν∗(T , x ) = n n , n n n normalization where the normalization factor is

X X X X n µn(Tn) = µn(Tn) 1 = µn(Tn)Zn(Tn) = m ,

Tn,xn Tn xn Tn

115 because this is the expected number of children at the nth level under the normal Galton-Watson measure. Now let ∗ ν˜n be the marginal of µn on Tn (where we just forget what xn is): then

X X µn(Tn) Zn(Tn) ν˜ (T ) = v ∗(T , x ) = = µ (T ) . n n n n n mn n n mn xn xn Then dν˜n Zn = n = Wn, dµn m

so ν˜n is indeed the νn that we’ve been looking for! So we get a tree with a marked vertex, and then we forgot the position of that marked vertex. Nothing here seems to actually help yet: we still don’t have ν. But the key observation is the following: take ∗ (Tn, xn) ∼ νn , so we can think of this as Tn with a (unique) marked path ζn from the root o to xn. So now we can

sample the tree (Tn, ζn) via the marked path level-by-level as follows:

• Start from the root: it generates ζˆ ∼ pˆ children. Choose one of them, x1, to be on the marked path.

• For all other children except x1 (vertices off the marked path), generate children according to the original law p.

• x1 generates children according to pˆ: choose one of them, x2, to be on the marked path.

• Repeat this: anything not on the marked path generates children according to p. We can continue this process indefinitely, and this will produce an infinite tree with a marked ray. This is guaranteed

to be an infinite tree because the size-biased law is supported on the positive integers, so an xi will always exist! And

this produces an infinite tree (T, ζ) with marked ray ζ, so it has a law ν∗.

Why is this the same law as (Tn, xn)? We need to prove that

∗ ν∗|Fn = νn ∀n,

µn(Tn) ◦ where the right hand side is defined (as above) via mn . Denote the left hand side via νn : the sampling procedure implies that   kpk 1 Y (x) (x ) ν◦(T , ζ ) = · · µ (T ) · ν◦ (T 1 ), n n n m k  n−1 n−1  n−1 n−1 x6=x1 because

• We generate k children with the size-bias law.

• We pick one of the k at random.

• We construct a regular Galton-Watson tree for all children x 6= xn−1 below x.

• Do this sampling procedure up to level (n − 1) for the specially marked vertex x1.

◦ ∗ Indeed, νn satisfies the same recursion as νn : the latter looks like

µ(Tn) pk Y (x) ν∗(T , ζ ) = = µ (T ), n n n mn µn n−1 n−1 x at depth 1 because after removing the mn factor, we’re just doing the normal Galton-Watson procedure. This can be rewritten as   (x1) pk Y (x) µn−1(T ) = µ (T ) n−1 . m  n−1 n−1  mn−1 x6=x1 ◦ Since this is the same recursion as for µn, and the two agree at n = 1, we can prove by induction that the two measures are indeed equal! So now we have a measure ν∗ on (T, ζ), and we can just forget ζ: let ν be the marginal of ν∗ on

116 ∗ ∗ T . Indeed, if we restrict ν to Fn, we have the marginal of ν restricted to Fn, which is µn on Tn, which is exactly the

same as νn, as we want. Now note that if (T, ζ) ∼ ν∗, then T \ ζ is a branching process with immigration. This is because the other

children coming off of ζ (that is, the siblings of the marked children) are “immigrating” in as Y1,Y2, ··· ! Then if

X1,X2, ··· are the number of children at level 1, 2, ··· , except for the marked ray, then

Xn−1 X Xn = ξn,i + Yn. i=1

Xn + By Fact 217, we understand the behavior of lim sup mn : it depends on E[log Y ]. But the law here is [ξ log(ξ − 1)] [log+ Y ] = [log+(ξˆ− 1)] = E , E E m

where ξˆ is the size-biased distribution. And this is finite if and only if E[ξ log ξ] is finite!

So in conclusion, Ω is the space of rooted graphs, Fn are the sigma-algebras of the tree up to level n, µ is the Galton-Watson law with offspring law ξ ∼ p, ν∗ is the law on pairs (T, ζ), and ν is the marginal of ν∗ on T . For finite

dνn Zn ∗ n, µ , ν are the restrictions of µ, ν to F , and = W = n . And now if (T, ζ) ∼ ν , then T \ ζ is a branching n n n dµn n m process with immigration law Y = log(ξˆ− 1), so

 X  [ξ log+ ξ] < ∞ ⇐⇒ [log+ Y ] < ∞ ⇐⇒ ν∗ lim n < ∞ = 1 E E mn

1 by Fact 217. Now Xn is everything at the nth level except for one of the vertices, and mn is negligible: thus, this happens if and only if (we can also replace ν∗ with ν)

  Z ν lim Wn < ∞ = 1 ⇐⇒ ν  µ, W dµ = 1, n→∞

where the last step comes from the extreme cases of Theorem 218. (This implies that W is not almost surely 0.) On the other hand,  X  (ξ log+ ξ) = ∞ ⇐⇒ [log+ Y ] = ∞ ⇐⇒ ν∗ lim sup n = ∞ = 1 ⇐⇒ ν(lim sup W = ∞) = 1 E E mn n

which happens only if ν is singular with respect to µ, meaning that µ(W = 0) = 1.

25 December 11, 2019

Today, we’ll cover the ζ(2) limit in the random assignment problem. We’ll use some of the techniques we’ve seen in this class, and the results are very beautiful! The following two formulations of the problem are equivalent:

117 Problem 219 (Version 1) n n Let C ∈ R × be a random matrix with iid entries equal to n · Exp (so mean n). Define

( n ) 1 X An = min Ci,σ(i) . n σ∈Sn i=1

(We choose the rook-placement such that the total sum of those entries is minimized, and scale by a factor of

n.) What does this quantity behave like: what’s EAn?

Problem 220 (Version 2) 2 Consider a complete bipartite graph Gn,n (so we have n vertices on the left and right, and we draw all n edges

between the left and the right). Think of the left side as the rows and the right side as the columns, so Cij is the weight of the edge connecting i on the left to j on the right (again defined as above). Here, let

( n ) 1 X An = min Ci,M(i) n M i=1

where M is a perfect matching (a subset of edges such that every vertex is adjacent to exactly one of those edges).

Here’s the result (from [7]) that we’ll be covering today:

Theorem 221 (Aldous) We have ∞ X 1 π2 lim E[An] = = ζ(2) = ≈ 1.6. n→∞ i 2 6 i=1

This answer was first conjectured in 1987 by Mezard and Parisi, and previous work showed 1 1 + < 1.51 ≤ lim inf A ≤ lim sup A ≤ 1.94 < 2. e E n E n Various “famous people” worked on these bounds – many people cared about the answer to this problem! The fact that the lim inf and lim sup are equal came from earlier work by Aldous in [8], and further work (see [9] and [10]) showed n X 1 A = . E n i 2 i=1 π2 We’ll just show the original proof that the limit is 6 , but we should note that improvements in [11] have further simplified the proof.

First of all, notice that there’s no scaling in the An value, so we’ll start by explaining that. If we look at this problem

via the bipartite graph Gn,n with weights Cij ∼ n · Exp, then for any given matching M, define the cost

n 1 X cost(M) = C . n i,M(i) i=1 We want to understand the minimum cost of a matching, but note that for any deterministic M, the expected value

118 of the cost is n 1 X n = n. n i=1 So somehow the expected cost of the best matching is constant, even as the expected value increases! To understand this difference, one heuristic is to think about the edges incident to a single vertex i on the left side: these form n

independent random variables C1, ··· ,Cn. By definition,

−t/n P(Ci ≥ t) = e ∀t ≥ 0, so the probability that the minimum is at least t is

  n  −t/n −t P min Ci ≥ t = e = e . 1≤i≤n

This is the cdf of a standard exponential variable! Let C(1) be the smallest of the Ci s, C(2) be the second smallest, and so on. Since the exponential distribution is memoryless, meaning that

P(C ≥ s + t|C ≥ s) = P(C ≥ t), let’s condition on the value of C(1) = min1≤i≤n Ci . Then we have n − 1 random variables left, and we know they’re all exponential and larger than Cmin, so memoryless property tells us that

d n C = C + · Exp. (2) (1) n − 1 Repeating this process yields that (if we sort our vector)   d n n (C , ··· ,C ) = E ,E + E , ··· ,E + E + ··· + nE . (1) (n) 1 1 n − 1 2 1 n − 1 2 n

As n → ∞, this converges to (E1,E1 + E2,E1 + E2 + E3, ··· ), where the Ei are iid exponentials. This is a collection of points on the line equivalent to the Poisson point process of rate 1 (we start from 0, and we go to the right by Ei on the ith step. Then the number of points Ei inside an interval [a, b] satisfies a Poisson distribution). So in summary, if we look at the edges incident to a specific vertex i, the first 100 smallest edges will be of constant order as n gets large! So it’s pretty reasonable to assume constant order for each Ci,M(i). This is just a heuristic, though, and what we’ve said doesn’t really imply that the limit of EAn should go to a constant. So we have to be more specific:

Definition 222

The Poisson-weighted infinite tree is constructed as follows: let Π be the random collection of points (E1,E1 +

E2,E1 + E2 + E3, ··· ) as above, where Ei are iid exponential. Each vertex in the tree has infinitely many children, indexed by N, and the weights of the edges are randomly sampled iid as Π for each vertex. (So the first edge from each vertex is the lightest, and so on.)

We define a matching M on this infinite tree in the same way as on our graph: it’s a subset of the edges such

that every vertex is covered exactly once. If we think about the graph Gn,n, notice that this converges locally weakly to the random tree T : we did the calculation for depth 1 above, and there’s a lot of independence after that! Then the ∗ ∗ main idea of Aldous’ proof is to consider the random bipartite graph (Gn,n,M ), where M is the best (lowest-cost) ∗ matching. Since Gn,n converges locally weakly to T , and M is just a collection of 1s and 0s (telling us whether edges

119 are included), it seems reasonable to expect

∗ lwc ∗ (Gn,n,M ) → (T, M ).

Trees are easier to analyze than complete bipartite graphs, so that motivates us using the infinite tree instead!

Remark 223. Here’s one way to get a matching on the tree: pick the lightest edge from the root. For all of the children the root does not connect to, connect those children to their lightest child, and for the connected child, connect all of its children to their lightest child. (Basically, pick the lightest edge whenever available.)

This gives some matching Mˆ , and in whatever sense we take the average (truncating at level n or whatever), the ˆ π2 average cost of M is 1, because all lightest edges are Exp(1). Since 1 is less than 6 , something has gone wrong. The point is that M∗ has to arise from a local weak limit process, which is not true here! (Any local weak limit should be spatially invariant: the choice of root of our tree should not play any special role in whether an edge belongs in the matching, by the definition of local weak convergence.) We’ll use the next result without proof:

Theorem 224 (Aldous) We have ∗  lim An = c = inf Wφ,M(φ) : spatially invariant M , n→∞ E E

where Wφ,M(φ) denotes the weight of the edge from φ to M(φ).

Basically, consider any spatially invariant matching M of our infinite tree, and then define the average cost to be the expected cost of the edge next to the root (becasue of spatial invariance, all edges are equivalent here)! ∗ One direction of this, showing that lim inf EAn ≥ c , is more straightforward, because we can use a local weak limit ∗ argument: as n → ∞, we can show that a subsequence of (Gn,n,M ) converges locally weakly to (T,M) for some sptially invariant M. Then the average cost for M must be at least c∗, because c∗ is optimal. In other words, this is some kind of compactness argument! The other direction is more difficult, though, because we have to produce a good approximation for the ideal c∗ on a finite graph. ∗ π2 So for the rest of this lecture, we’ll show that c = 6 . Here’s a heuristic that the paper also starts off with: we’re going to subtract infinity from itself a bit, so there will be some issues. Let the cost of T be the minimum cost of a matching on T , where T is this Poisson infinite weighted tree. (Obviously every matching has infinite weight, but ignore that for now.) Then we can write some recursion based on what happens at the first level:    X (i) (j)  cost(T ) = min Wφ,j + cost(T ) + cost(T \ j) . j≥1  i6=j 

Basically, pay the weight of the edge φ, j connected to the root, plus the cost from subtrees T (i) at level 1, and then we can’t use vertex j itself for that subtree. And now if we remove the root vertex, we can get a lot of cancellation:

n h (j) (j) io cost(T ) − cost(T \ φ) = min Wφ,j − cost(T ) − cost(T \ j) j≥1

(i) because the cost of (T \ φ) is the sum of the costs T . If we call the left hand side Xφ and the bracketed right term

Xj , we get

Xφ = min [Wφ,j − Xj ] . j≥1

120 But remember that we’re looking for a matching which is spatially invariant, so if we look at the Xs by themselves,

Xφ and Xj should agree in distribution.

Lemma 225 d Let 0 < ζ1 < ζ2 < · · · ∼ Π be a Poisson point process (where (ζ1, ζ2, ··· ) = (E1,E1 + E2, ··· )). Let X,Xi be iid from a law µ (which is a probability measure on R). Then

d X = min (ζi − Xi ) i≥1

1 ex occurs if and only if µ is the logistic distribution with density f (x) = (ex/2+e−x/2)2 and cdf F (x) = 1+ex .

(This is proved by calculus.)

Lemma 226

Let X,Xi be iid from µ, the logistic distribution. For x ≥ 0, let

h(x) = P(X1 + X2 ≥ x).

Then Z ∞ Z ∞ π2 h(x) = 1, xh(x)dx = . 0 0 6

This follows directly from Lemma 225: once we know the distribution function, we can just do everything by straightforward calculation. How does this relate to the matching problem? Let’s go back to the boxed statement above: we can guess that the weight next to the root in the optimal matching is

d Wφ,M∗(φ) = ζi ∗ , where ∗ i = argmin(ζi − Xi )i≥1.

So now we just need to calculate Eζi ∗ , and hopefully that’s the number that we want: there are various ways to do this, and the paper does the following.

Consider the process (ζi ,Xi ): these are points scattered on the right half-plane (call the ζ-axis z and the X-axis x). Remember that looking at any interval [a, b] is a Poisson random variable with rate b − a, but if we look at any region of the plane, the number of points which fall inside this region is a Poisson random variable as well. Specifically, this means that this is a Poisson point process on (0, ∞) × R with intensity measure Leb ⊗ µ, where µ is logistic and #{points in R} ∼ Pois((Leb ⊗ µ)(R)). for any region R. So now if we condition on the event

Ay = {some point j of (ζi ,Xi ) lands on the vertical line z = y}, the rest of the process is distributed as an independent Poisson process, because the number of points in R is a Poisson random variable, meaning that for any two disjoint regions R and S, behavior is independent. (And a narrow strip

121 z = y is disjoint from other regions!) So

∗ ˜ ˜ P(i = j|Ay ) = P(y − Xj < min(ζi − Xi )) i≥1

∗ because i = j means that ζj − Xj is the minimal value, and we’re conditioning on ζj = y. (The tildes here mean that

the things on the right are independent of the Xj .) Let mini≥1(ζ˜i − X˜i ) = X˜: by the earlier lemma, we know that X˜

is distributed iid to Xj , so this is also the probability

= P(X + X˜ > y) = h(y).

Thus, ∗ P(ζi ∗ ∈ [y, y + dy]) = P(A[y,y+dy])P(i = j|Ay,y+dy]) = h(y)dy, meaning that our distribution d ζi ∗ = Wφ,M∗(φ)

R π2 has density h. So the expected value of Wφ,M∗φ on our infinite tree is just the expectation xh(x)dx = 6 . So how does the rest of the actual proof work (how do we turn this heuristic into an actual proof)? There’s a random variable i ∗ related to the matching, but we still need to construct the actual matching M∗ on the infinite tree T . Morally, we want to sample the infinite tree and then solve the recursion, but there are infinitely many equations. So instead, we don’t take the entire tree: just look at it up to depth n. Then we don’t have to worry about the dependence between X and the tree: at the boundary at level n, we just put iid Xs that go up towards the tree, and at each vertex, just apply the boxed recursion upward, and then once we get to the root, go back down. This will produce an Xi→j as well as an Xj→i for each edge between i and j: because all Xs are correlated with the tree, the recursion is satisfied everywhere. Then we apply the Kolmogorov extension theorem to get an infinite tree (T,X)! And now that we have (T,X), we decide if the edge (ij) is in our matching based on whether

j = argmink∼i,k6=j (Wik − Xk→i ).

(This is equivalent to saying that Xk→i < mink6=j Wik − Xk→i , which is just Xi→j by the recursion! So we just have to look at the two X variables associated to edge (ij) and see whether their sum is more than Wij .) This gives us one direction of the bound, and to show the other, we just check that any deviation must not satisfy the recursion.

References

[1] N. Sun. 18.675. Theory of Probability. URL: https://math.mit.edu/~nsun/f19-18675.html. [2] R. Durrett. Probability: Theory and Examples. Cambridge University Press, Cambridge, 2019. [3] R. Arratia and S. Tavaré. “The Cycle Structure of Random Permutations”. Ann. Probab. 20.3 (1992), pp. 1567– 1591.

[4] E. Shamir and J. Spencer. “Sharp concentration of the chromatic number on random graphs Gn,p”. Combinatorica 7.1 (1987), pp. 121–129. [5] H. Kesten and B.P. Stigum. “Limit theorems for decomposable multi-dimensional Galton-Watson processes”. J. Math. Anal. Appl. 17.2 (1967), pp. 309–338. [6] R. Pemantle R. Lyons and Y. Peres. “Conceptual Proofs of L log L Criteria”. Ann. Probab. 23.3 (1995), pp. 1125– 1138.

122 [7] D.J. Aldous. “The ζ(2) limit in the random assignment problem”. Random Struct. Alg. 18.4 (2001), pp. 381– 418. [8] D. Aldous. “Asymptotics in the random assignment problem”. Probab. Theory Relat. Fields 93.4 (1992), pp. 507– 534. [9] S. Linusson and J. Wästlund. “A proof of Parisi’s conjecture on the random assignment problem”. Probab. Theory Relat. Fields 128 (2004), pp. 419–440. [10] B. Prabhakar C. Nair and M. Sharma. “Proofs of the Parisi and Coppersmith-Sorkin Random Assignment Con- jectures”. Random Struct. Alg. 27.4 (2005), pp. 413–444. [11] J. Wästlund. “An easy proof of the ζ(2) limit in the random assignment problem”. Electron. Commun. Probab. 14 (2009), pp. 261–269.

123