18.675: Theory of Probability
Lecturer: Professor Nike Sun Notes by: Andrew Lin
Fall 2019
Introduction
The course website for this class can be found at [1]. The page notably contains a summary of topics covered, as well as some brief lecture notes and further references. 18.675 is the new numbering for 18.175; the material covered will be similar to previous years. The basic goal is to cover basic objects and laws in probability in a formal mathematical framework; some background in analysis and probability is strongly recommended. (If we are missing 18.100, we should talk to Professor Sun after class.) The main readings for this class will come from chapters 1–4 of our textbook (see [2]): readings for each chapter are included in the summary of topics linked above. Grades are mostly homework (25%), exam 1 (35%), and exam 2 (35%). Class participation is another 5% of the grade – this is a graduate class, but we are expected to come to class. This is a large class that will probably get smaller, so that’s one way attendance may be checked. Alternatively, a short quiz may be given occasionally in class. They won’t be graded, but they’ll mostly be to check whether we’re all following the material (and to have a formal record of who showed up). Realistically, attending class is highly correlated with doing well! By the way, there is some overlap with 18.125 (measure theory), particularly near the beginning. However, most of the content will be significantly different.
1 September 4, 2019
We’ll start the class with some measure theory, which gives us a formal language for probability. On its own, measure theory is a bit dry, so let’s try to think about why we need it at all.
Example 1 (We should know this, e.g. from 18.600) Consider a p-biased coin, where the probability of a head is p and the probability of a tail is 1 − p. Then the probability of six independent p-biased coins coming up H, H, T, H, T, H is just the product of the probabilities: 4 2 6 it’s p (1 − p) . Since there are 4 ways to pick 4 heads among the 6 coin tosses, the probability of having 4 6 4 2 heads in total is 4 p (1 − p) .
We should generally also be familiar with what it means to have Un be a random variable distributed uniformly 1 2 n−1 over a finite set like { n , n , ··· , n , 1}, as well as what it means for Vn to be an independent copy of Un: then we
1 can write statements of independence like
j k j k 1 U = ,V = = U = V = = . P n n n n P n n P n n n2
But this becomes a bit more complicated if the set of allowed states is not finite. For example, what does it mean for U to be distributed uniformly in the range [0, 1]? We can’t write down the probability that U takes on any particular
value. Also, is it true that Un converges to U as n → ∞, at least in some sense (since we seem to be populating [0, 1] uniformly)? Let’s now see an example that illustrates that this kind of problem is trickier than it might initially seem:
Problem 2 Alice and Bob play a game where they have an infinite sequence of boxes. Alice puts a real number in each box, and Bob can see the number in every box except for one of them, which he can choose. “Show” that Bob can guess the unseen number with 90 percent probability.
“Solution”. To show this, let’s set up some preliminary notation: let S be the set of infinite sequences of real numbers, which can also be denoted RN. For two sequences x, y ∈ S, say that x ∼ y (they are equivalent) if they eventually
agree: xn = yn for all n > N for some finite index N. This is an equivalence relation on the space of real-valued sequences, so there is some quotient space S/ ∼ of equivalence classes. We can pick one representative z from each equivalence class. In other words, if x ∈ S, there is a unique representative sequence z = z(x) such that z ∼ x: since they are equivalent by definition, we can denote n(x) to be the first index where x and z agree completely afterward. So here’s Bob’s strategy for guessing the unseen number. He splits the boxes into ten rows: row k contains the boxes which are in spots k mod 10. We now have ten sequences x (1), ··· x (10) corresponding to our ten rows. He picks one of these rows, uniformly at random (say row 10 without loss of generality). Bob reveals all of the other rows, and he evaluates
N = max{n(x (1)), n(x (2)), ··· , n(x (9))} + 1.
(In other words, he looks at the first nine rows, and he finds the first index where each row agrees with its representative sequence.) He now reveals every box in row 10 except for box N: he now has enough information to deduce z (10), the (10) (10) representative sequence for x . Bob now guesses zN . Why is this correct with 90 percent probability? We chose this row uniformly at random, and there’s at most a (10) (i) (10) 10 percent chance that n(x ) is the largest out of all n(x )s, which is the only case where xN disagrees with (10) zN .
This is a strange example, and we’ve obviously violated some rules of probability to get to this point. But it’s constructed in a way such that we don’t really know where we’ve messed up! So that’s why it’s worthwhile to understand a little more about measure theory. Let’s start with a seemingly basic problem: how do we define the variable U ∼ Unif[0, 1] from above? We could try to define a function (or measure) µ on subsets of [0, 1], such that µ(A) is the probability that U ∈ A for any subset A. For example, we could say that for any 0 ≤ a ≤ b ≤ 1, it’s natural to say that
µ([a, b]) = b − a.
2 We also want to have some form of additivity: if A, B ⊆ [0, 1] are disjoint, then we want their disjoint union to satisfy
µ(A t B) = µ(A) + µ(B).
By induction, this means that for any positive integer n,
n ! n G X µ Ai = µ(Ai ). i=1 i=1 Fn F∞ In addition, we can think about the limit of the expression i=1 Ai → i=1 Ai : it seems also natural to want countable additivity: ∞ ! ∞ G X µ Ai = µ(Ai ). i=1 i=1
Remark 3. Note that µ([0, 1]) = 1, but [0, 1] is an uncountable union of points, and each point has measure zero. So it’s not really reasonable to assume that uncountable addivity would work, too.
So to summarize, our measure µ(A) = P(U ∈ A) should satisfy the following:
• µ([a, b]) = b − a for all 0 ≤ a ≤ b ≤ 1 (“lengths” of intervals).
• Countable additivity (as shown above).
• We should also want translation invariance: if A ⊆ [0, 1], and we define Ax = A + x to be A shifted rightward
by x (mod 1), then µ(A) = µ(Ax ). This is not a lot of requirements: is it enough? If we’re given any subset A, can we use these three properties to determine µ(A)? Let’s try a slightly complicated example:
Example 4 T∞ Consider the Cantor set C = i=1 Cn, where Cn are iteratively defined as
C0 = [0, 1],Cn+1 = Cn \ (“middle third” of each interval of Cn).
(Because the Ci s are decreasing, the limit C can indeed be defined this way.) How do we find the measure of the
Cantor set? We know that µ(C0) = 1, and we also know that Cn is the disjoint union of Cn+1 and the middle third of
Cn, so 2 µ(C ) = µ(C ) − µ(middle third of C ) = µ(C ). n+1 n n 3 n 2 n This means that µ(Cn) = 3 for all n, and taking the limit, µ(C) = 0. This seems encouraging towards an answer of “yes, we can determine µ(A),” but it turns out the Cantor set is actually pretty well-behaved compared to some worse functions. In fact, we’ve actually been asking the wrong question:
Proposition 5 There exists no function µ : P([0, 1]) → [0, 1] that satisfies all three of the properties we want!
Proof (the Vitali construction). Define an equivalence relation on the real numbers
x ∼ y if x − y ∈ Q.
This partitions the real line into equivalence classes (cosets of the form [x] = x +Q). Each one is a shift of the rationals
3 by some real number, and similar to the Alice and Bob problem, we pick a representative z from each equivalence class. For simplicity, we can pick all z to be in [0, 1]. Let A be the set of all representatives, which is a subset of [0, 1]. (This is sometimes called the Vitali set.) By the properties above, because we can partition [0, 1] into translations of A,
1 = µ([0, 1]) ! [ = µ Aq q∈Q X = µ(Aq) q by countable additivity. But now all Aqs have equal measure by translation invariance, which means that X 1 = µ(A), q which cannot work: 1 cannot be the countable sum of a bunch of equal numbers.
The way measure theory deals with this problem is to not require that µ be defined on all subsets. As a historical note, when this was first proposed, it was a pretty surprising idea, but it’s really the only thing we can do if we try to write formal definitions down! We don’t really want to relax the three properties we’re given, so it’s better to just not define the measure on some pathological subsets.
Definition 6 A measure space is a triple (Ω, F, µ) satisfying the following axioms:
• Ω, the state space or outcome space, is nonempty.
• F, a set of measurable subsets or events, is a σ-field or σ-algebra over Ω (terms will be used interchange- ably). In other words, F is a nonempty collection of subsets of Ω which is closed under complementation c S∞ and countable union: if A ∈ F, then A = Ω \ A is also in F, and if Ai ∈ F, then i=1 Ai is also in F. • µ, the measure, is a map F → [0, ∞] (the extended positive reals) which is not ∞ everywhere and is
countably additive: if Ai ∈ F are disjoint,
∞ ! ∞ G X µ Ai = µ(Ai ). i=1 i=1
The goal of next week is basically to construct a uniform measure on [0, 1]: we’ll define some measure space
([0, 1], L, µ), where L is still to be defined, and by extension also define (R, LR, µ).
Definition 7 µ is called a probability measure (denoted P) if µ(Ω) = 1.
Working with the definition, we can gather a few immediate consequences about measure spaces.
4 Proposition 8 Given a measure space (Ω, F, µ),
• ∅ and Ω are in F. • µ(∅) = 0. ∞ • S (Continuity from below) If Ai ↑ A, that is, A1 ⊆ A2 ⊆ · · · and A = i=1 Ai , then µ(Ai ) ↑ µ(A). ∞ • T (Continuity from above) If Ai ↓ A, that is, A1 ⊇ A2 ⊇ · · · and A = i=1 Ai , and also µ(A1) < ∞, then
µ(Ai ) ↓ µ(A). • For any Ω, F = {∅, Ω} is a valid σ-field. So is P(Ω).
Proof. For the first point, F is nonempty, so there is some A in F. Then Ac ∈ F, so A ∪ Ac = Ω is in F, and so is c Ω = ∅. Similarly, for the second point, there is some A such that µ(A) < ∞: then µ(A ∪ ∅) = µ(A) = µ(A) + µ(∅), so µ(∅) = 0. F∞ For the third point, use countable additivity on A = i=1 Bi , where B1 = A1,B2 = A2 \ A1,B3 = A3 \ A2, and so on. Then ∞ X µ(A) = µ(Bn), n=1 Pn and since µ(An) = i=1 µ(Bi ), µ(Ai ) indeed converges to µ(A) as desired. Similarly, for the fourth point, define B1 = ∅,B2 = A1 \ A2,B3 = A1 \ A3, and so on. All Bi s have finite measure since µ(A1) is finite, and B1 ⊆ B2 ⊆
· · · ⊆ A1 \ A, so we can use the third point to finish. (A counterexample with µ(A1) = ∞ is Ai = [i, ∞).) Finally, the examples in the last point satisfies both closure axioms (of complementation and countable union).
If Ω were a finite set, and we were trying to define a probability space in 18.600, we would likely say that P(ω) = aω P for every ω ∈ Ω, and w∈Ω aω = 1. In our new terminology, F = P(Ω) is just the powerset of Ω, and Pr(A) = P w∈A aω: we can check that this is a valid probability space. But what if Ω = [0, 1] or R? If we take F = P(Ω), we do have a valid σ-field, but the Vitali set shows that we can’t define a measure µ on it. Next week, we’ll basically ask how we need to restrict our set F to get a useful measure!
Definition 9
The Borel σ-field BR is the smallest σ-field over R that contains I, the set of all open intervals.
When we see the word “smallest,” we should ask the question “is it well-defined?”. We check that if Fα : α ∈ I is a collection of σ-fields over Ω, then their intersection is also a σ-field. Indeed, if a set A is the intersection of some σ-fields, then both A and Ac are in all of the σ-fields, so Ac is in the intersection; the same argument works for
countable unions. Now we can just let BR be the intersection of all σ-fields, and this is well-defined.
2 September 9, 2019
This lecture and the next are being given by Professor Subhabrata Sen. Recall that our central question today is how to define the uniform measure on [0, 1]: we want to define a function µ on (some) subsets of [0, 1] with the requirements that
• µ ((a, b ]) = b − a (half-open is a matter of convention).
• F P Given disjoint subsets {Ai : i ≥ 1}, µ i Ai = i µ(Ai ).
5 • If we define Ax = (A + x mod 1), then µ(Ax ) = µ(A) (translation invariance).
Vitali’s construction tells us that we can’t do this with all subsets of [0, 1]: there is no function µ : 2[0,1] → [0, 1] that satisfies all three conditions above. So we have to compromise and try to define µ on just a sub-collection of the subsets of [0, 1]. Last time, we defined a measure space (Ω, F, µ), which consists of a nonempty state space, a σ-field, and a measure, respectively. We also defined the Borel σ-field to be the smallest σ-field containing the open intervals (which are the sets where we really want the measure to be defined in a particular way):
B = σ(I), I = {(a, b ] ∩ R : a ≤ b}.
With this, let’s try to construct the Lebesgue measure on [0, 1]. Our hope is that because we already know how to assign a measure to specific sets, we can assign measures to intervals and then keep assigning measures to subsets iteratively off of that. Unfortunately, we can’t write this down rigorously, because there’s no closed form for an arbitrary element of BR: it’s not true that every set in the Borel σ-algebra can be written as a countable union and/or intersection of our open intervals! F P So instead, we do define µ((a, b ]) = b − a, and we do define µ i (ai , bi ] = i (bi − ai ) for unions of disjoint intervals (ai , bi ], but we’re not quite done. We now need to know whether there is a consistent extension of our function µ to all sets in BR. Since this idea will come up again later in the course, we’ll make a slight generalization: say we define µ((0, x]) to be some function F (x), and we’re currently working with the case F (x) = x.
Remark 10. Remember that measures in general do not need to be translation-invariant: the conditions that we’ve been using are to ensure that we end up with a uniform measure.
In order for us to extend our function µ to be defined on all subsets in BR, this function F (x) needs to satisfy a few properties:
• µ((0, x1]) ≥ µ((0, x2]) if x1 > x2, which means that F (x1) ≥ F (x2) (F is monotone).
• If xn ↓ x, then the intervals (0, xn] ↓ (0, x]. From our work earlier, this implies that µ((0, xn]) ↓ µ((0, x]), so
F (xn) ↓ F (x) (F is right-continuous). Notably, we can’t quite make this argument in the other direction to say that F is left-continuous (will be explained later).
Definition 11 A Stieltjes measure function on R is a function F : R → R which is non-decreasing and right-continuous.
Theorem 12
For any Stieltjes measure function F , there exists a unique measure µ = µF on (R, BR) satisfying µ(a, b) = F (b) − F (a) for all a, b ∈ R.
Here’s the proof idea:
1. There exists a unique extension from µ : I → [0, ∞] to µ : A → [0, ∞], where
A = {A ⊆ R : A is a finite disjoint union of sets in I.}
6 This sort of corresponds to the “countable union” step: once we’ve shown that we can do this step in a well- defined way, we’ll be able to define our measure on any finite disjoint union of intervals.
2. Show that we can extend our measure from being defined on A to being defined on BR.
Remark 13. Why is using A a reasonable intermediate step? Notice that A is closed under complements and finite unions (exercise), while the Borel σ-algebra is closed under complements and countable unions. So we’re not at the σ-algebra yet, but we’re kind of close.
Fn Proof of step 1. The construction itself is pretty intuitive. Given any subset A ∈ A, we can write it as A = i=1 Ej
for some Ej s in I. Then just define n X µ(A) = µ(Ej ). j=1 Note that there’s a lot of ways to write an element of A as a finite disjoint union of intervals. So one thing we need to check is that different representations are consistent with each other:
n m n m G G X X A = Ej = Fk =⇒ µ(Ej ) = µ(Fk ). j=1 k=1 j=1 k=1
Indeed, we take the common refinement: because all Ej s and Fk s are open intervals,
n n n m !! n m X X X G X X µ(Ej ) = µ(Ej ∩ A) = µ Ej ∩ Fk = µ(Ej ∩ Fk ), j=1 j=1 j=1 k=1 j=1 k=1 P and starting from k µ(Fk ) also yields this result, so µ(A) is consistently defined.
For things we’re going to do next lecture, we’re going to record a few properties of our space A:
Proposition 14 The function µ we’ve defined has the following properties: F Pn 1. µ is monotone and finitely additive over A: for disjoint {Ai : 1 ≤ i ≤ n} ∈ A, µ( i Ai ) = i=1 µ(Ai ). S Pn 2. µ is finitely sub-additive over A: for any {Ai : 1 ≤ i ≤ n} ∈ A, µ( i Ai ) ≤ i=1 µ(Ai ). S P 3. µ is countably sub-additive over I: if A, Ai ∈ I and A ⊆ i Ai , then µ(A) ≤ i µ(Ai ). 4. µ is countably additive on A.
This last point is really where we want to be, because it’s a lot closer to the condition we are trying to get on BR.
Proof. We can show (1) and (2) just for n = 2 by induction.
For (1), note that we can write A1 as E1 t · · · t Ex and A2 as F1 t · · · t Fy . Because these are disjoint, then
µ(E1 t · · · t Ex t F1 t · · · t Fy ) is the measure of A1 t A2 by definition, µ of the first x terms give the measure of A1, and µ of the last y terms give the measure of A2, which is what we want. (Monotonicity follows because all measures are nonnegative.)
For (2), notice that (A2 \ A1) ⊆ A2 (where A2 \ A1 denotes all of A2 not contained in A1), and A2 \ A1 is disjoint from A1. So now by additivity and monotonicity from (1),
µ(A1 ∪ A2) = µ(A1 ∪ (A2 \ A1)) = µ(A1) + µ(A2 \ A1) ≤ µ(A1) + µ(A2), as desired.
7 For (3), denote A = (a, b ] and Ai = (ai , bi ] (all sets here are intervals because we’re working in I), and we S∞ know that (a, b ] ⊆ i=1(ai , bi ]). Recall that our measure is defined on intervals uniquely: µ(A) = F (b) − F (a), and
µ(Ai ) = F (bi ) − F (ai ).
Since F is right-continuous, there exist x > a, ci > bi such that ε F (x) − F (a) ≤ ε, F (c ) − F (b ) ≤ . i i 2i
(Intuitively, this now lets us work with A as a closed interval and the Ai s as open intervals.) Now consider the interval
∞ [ [x, b] ⊆ (ai , ci ). i=1
Since [x, b] is compact, and the right hand side is an open covering, there exists a finite subcover by the Heine-Borel theorem: n [ [x, b] ⊆ (ai , ci ) i=1 (some abuse of notation here). Now by points (1) and (2), because µ is finitely subadditive and monotone,
n X µ((a, b ]) ≤ µ((a, x ]) + µ((ai , ci ]) i=1 n X = F (x) − F (a) + (F (ci ) − F (ai )) i=1 ∞ X ≤ 2ε + (F (bi ) − F (ai )) i=1
P ε (one ε comes from F (x) − F (a), and the other comes from the sum of the geometric series 2i ). Letting ε → 0, we have that ∞ X µ((a, b ]) ≤ µ((ai , bi ]), i=1 as desired. F∞ P∞ To prove (4), we are trying to prove that given A, Ai ∈ A, where A = i=1 Ai , we have µ(A) = i=1 µ(Ai ). First of all, we can write A as n ∞ ! n G X G A = Ai t Ai = Ai t Cn+1, i=1 i=n+1 i=1 which is a finite disjoint union, so we can use (1) to write
n n X X µ(A) = µ(Ai ) + µ(Cn+1) ≥ µ(Ai ). i=1 i=1 Letting n go to ∞, ∞ X µ(A) ≥ µ(Ai ), i=1 which gives us one direction of what we want. Now we need to show the upper bound: without loss of generality, we can assume that Ai ∈ I, because all Ai s are finite disjoint unions of sets in I. By definition, we can write
n G A = Ej ,Ej ∈ I, j=1
8 and we’ve defined n n !! n ! X X [ X [ µ(A) = µ(Ej ) = µ Ej ∩ Ai = µ (Ej ∩ Ai ) j=1 j=1 i j=1 i S (because A is entirely contained in ( i Ai ) anyway). By (3), this means
n ∞ ∞ n X X X X µ(A) ≤ µ(Ej ∩ Ai ) = µ(Ej ∩ Ai ) j=1 i=1 i=1 j=1
P∞ by swapping the order of summation (this is a nonnegative series, so we’re allowed to do this), and this is i=1 µ(A ∩ P∞ Ai ) = i=1 µ(Ai ) by finite additivity.
So now we want to extend µ from A to the actual σ-field: let’s start making a move towards that direction. For now, let’s assume the measure we want to define is a finite measure.
Theorem 15 (Carathéodory Extension Theorem) Let A be an algebra over Ω (closed under complements and finite unions), and let µ : A → [0, ∞) be a countably additive finite measure. Then there exists a unique measure on the generated σ-algebra µ : σ(A) → [0, ∞) which extends µ.
We’ll show this next time!
3 September 11, 2019
Recall that the central object we’ve been trying to study is BR, the Borel σ-field, which is the smallest σ-algebra generated by the intervals I = {(a, b ] ∩ R : a ≤ b}. We’re working towards a proof that given any Stieltjes measure function (monotone, right-continuous) F , there
exists a unique µF on (R, BR) such that µ ((a, b ]) = F (b) − F (a). Last lecture, we extended the measure from I to
the algebra A, consisting of finite disjoint unions of intervals in I, and today, we’ll extend this to BR = σ(A). First, a quick clarification from last lecture: why can’t we prove that F is left-continuous as well if we’re defining
F (x) = µ((0, x ])? In general, when xn ↑ x, (0, xn ] ↑ (0, x). But this is not exactly what we want, because F (x) uses (0, x] instead:
µ((0, xn ]) ↑ µ((0, x)) 6= µ((0, x ]) = F (x).
The example to keep in mind for this is a step function, coming from a point mass! Then (0, x) and (0, x ] differ by the “mass” at point x. To finish the construction of a uniform measure, we’ll prove Theorem 15 today. We’ll go through the proof in a slightly nonlinear way, so let’s first sketch out the ideas here: we want to prove (1) that an extension exists and (2) that it is unique. We’ll do the latter first, and we’ll do so by introducing the π-λ Theorem.
Definition 16 A π-system P on a set Ω is a nonempty collection of subsets of Ω such that for any A, B ∈ P, A ∩ B ∈ P.
For example, I, the collection of (half-open) intervals, is a π-system.
9 Definition 17 A λ-system L on a set Ω satisfies the following conditions:
• Ω ∈ L.
• For any A, B ∈ L where B ⊆ A, A \ B ∈ L as well.
• If we have a sequence An ∈ L, and An ↑ A, then A ∈ L.
All σ-algebras are both π-systems (we have intersections by definition) and λ-systems (in a σ-algebra, we have the c c S whole set Ω by taking any A ∪ A , we have A ∪ B = A \ B, and we have n An = A for any An ↑ A). But we also have the converse:
Proposition 18 If a λ-system L is also a π-system, then L is a σ-algebra.
Proof. L contains the whole set Ω, so it is nonempty, and it contains countable unions. Also, since Ω ∈ L, Ω \ A = Ac ∈ L for any A.
In other words, the definition of a σ-algebra can be split into the intersections part (π-system) and the rest (λ-system), and this turns out to be very helpful:
Theorem 19 (Dynkin’s π-λ Theorem) If P is a π-system, L is a λ-system, and P ⊆ L, then σ(P) ⊆ L.
We’re going to see applications of this in a lot of different scenarios!
Proof. Without loss of generality, let L be the smallest λ-system containing P. This is well-defined, because there is at least one λ-system – the powerset of Ω – and intersections L1 ∩ L2 of λ-systems are λ-systems as well:
• Ω is in both L1 and L2, so it’s in the intersection.
• If A and B are in the intersection, they’re both in L1 and L2, so A \ B is in both λ-systems, meaning it’s in the intersection.
• If all Ans are in the intersection, then they’re in both λ-systems, so A is in both λ-systems and therefore the intersection as well.
So L is just the intersection of all λ-systems containing P . We claim that it suffices to show that L is also a π-system: this means L is a σ-field containing P, so it must contain σ(P ), the smallest σ-field containing P. To do this, we just need to show that given any A, B ∈ L, A ∩ B ∈ L. Fix A ∈ P, and define
LA = {B ∈ L : A ∩ B ∈ L} ⊆ L.
LA is a λ-system:
• A ∩ Ω = A ∈ L, so Ω ∈ LA.
• If B2 ⊆ B1, and A ∩ B1,A ∩ B2 ∈ LA, then (A ∩ B1) \ (A ∩ B2) = A ∩ (B1 \ B2) is also in L. This means that
B1 \ B2 is also in LA.
• If we have an increasing sequence Bn ↑ B ∈ L, and Bn ∈ LA (that is, A ∩ Bn ∈ L for all n), then A ∩ B ∈ L
since L is a λ-system. But this means B ∈ LA, as desired.
10 In addition, it contains P because A ∈ P (meaning that for any B ∈ P, A ∩ B ∈ P ⊆ L). But L is the smallest
λ-system containing P, so LA = L! Therefore, for all A ∈ P and B ∈ L, A ∩ B ∈ L. Now fix B ∈ L, and define
LB = {A ∈ L : A ∩ B ∈ L}.
We just showed that P ∈ LB, and we can show by the exact same argument that LB is a λ-system, so LB = L. This means that L is indeed a π-system, and we’re done.
The important message here is that measure-theoretic proofs are difficult because we don’t have a closed form for all measurable subsets. Often, the right way is to show that every measurable subset has some property. The power of this theorem is in bootstrapping results we find from P, a π-system, to the σ-algebra of P!
Proof of uniqueness of measure. Suppose there were two distinct extensions µ1 and µ2 of our measure µ from A to
σ(A). Then µ(A) = µ1(A) = µ2(A) for all A ∈ A by definition. Define
L = {B ∈ σ(A): µ1(B) = µ2(B)}.
(We’re trying to show that L is the σ-algebra σ(A) itself.) We’ve established that A ⊆ L, and we know that A is a π-system (intersections of finite collections of intervals are a finite collection of intervals). Now L is a λ-system:
• [0, 1] is in L, because [0, 1] is a finite interval and is therefore in A. (Small technicality here: we really have the interval (0, 1], so the easiest way to deal with this is to just define the measure on the real line instead. Also, for the uniform measure, any individual point is a measure zero set.)
• If µ1(A) = µ2(A), µ1(B) = µ2(B), and B ⊆ A, then µ1(A \ B) = µ2(A \ B) by additivity, so A \ B is also in L.
• If µ1(An) = µ2(An) for all n in an increasing sequence {An}, we can define B1 = A1, B2 = A2 \ A1, and so on. S S From the previous point, µ1(Bi ) = µ2(Bi ) for all i, so by countable additivity we have µ1( Bi ) = µ2( Bi ). S S Since µ1( Bi ) = µ1( Ai ) = µ1(A) (and same for µ2), µ1(A) = µ2(A) as desired.
So by the π-λ Theorem, σ(A) = L, and thus µ1 = µ2.
This is powerful – we only had to verify that µ1(A) = µ2(A) for a small subset of all measurable subsets, but L helped us bootstrap that to something more general!
Proof of existence. We start with a definition:
Definition 20 An outer measure is a map ν : 2Ω → [0, ∞)
which is monotone and countably subadditive.
This is basically a “crude” measure: it has certain desirable characteristics of a measure but not all of them.
Definition 21 Given a measure µ on an algebra A, we define
( ∞ ) ∗ X [ µ (E) = inf µ(Ai ): Ai ∈ A,E ⊆ Ai i=1 i for all E ⊆ Ω.
11 This is a covering argument: we try to cover E with minimum measure. We can verify that µ∗ is indeed an outer measure:
∗ ∗ • It’s monotone, because any covering of B ⊇ A also covers A, so µ (B) ≥ µ (A).
• S It’s countably subadditive, because the union of coverings of Ai will also cover i Ai , and we’re still covering with a countable number of sets in A.
Lemma 22 µ∗ also extends µ: that is, µ∗(A) = µ(A) for all A ∈ A.
Proof. We know that µ∗(A) ≤ µ(A), since A covers itself. Suppose that µ∗(A) < µ(A) for the sake of contradiction: first of all, µ∗(A) consists of a countable union of half-open intervals, and each one must be contained within one of P ∗ the intervals of A (otherwise, we could split the interval up or cut the ends off, resulting in a smaller µ (Ai )). Say
A is made up of N intervals: by the pigeonhole principle, there is some interval I ∈ A covered by intervals In ∈ I such P ∗ that n µ (In) < µ(I): pick the one with the largest deviation, and we can write
X ∗ µ (In) = µ(I) − ε n
µ(A)−µ∗(A) ∗ for some ε > N > 0. But this can’t happen, because µ (In) = µ(In), and we showed countable subadditivity in the last lecture of µ over intervals: X µ(I) ≤ µ(In), n which is a contradiction. Thus µ∗(A) = µ(A), as desired.
So this outer measure tries to assign a value to every subset of our space Ω. We saw that doing this didn’t quite work when we tried it naively: to refine this argument, we need to find a suitable subcollection A∗ ⊆ 2Ω such that A∗ ∗ ∗ ∗ is a σ-algebra and µ |A∗ is a measure. If we can find such a set, then A ⊆ A implies that σ(A) ⊆ σ(A ), and this gives us the extension that we want!
Definition 23 A subset E ⊆ Ω is measurable if µ∗(F ) = µ∗(F ∩ E) + µ∗(F ∩ Ec )
for all F ⊆ Ω.
This is a collection of sets which satisfy a “desirable property,”’ and it turns out the subcollection we want is
A∗ = {E : E is measurable with respect to µ∗} .
First of all, we need to make sure that we actually have a large enough subcollection:
Proposition 24 A ⊆ A∗.
Proof. We want to show µ∗(F ) = µ∗(F ∩ A) + µ∗(F ∩ Ac )
12 for any A ∈ A. First of all, we have µ∗(F ) ≤ µ∗(F ∩ A) + µ∗(F ∩ Ac ),
because any covering of the right side definitely covers the left side. Now, suppose the left side is optimally covered
by intervals I1,I2, ··· . We wish to show that we can cover the right side with at most 4ε more measure. Since A is
a finite collection of intervals, each interval Ik can only have a finite number of changes between the parts contained in A and Ac , so there are a countable number of changes between A and Ac overall. Break up the intervals at those changes. S Thus, we cover the right hand side as follows: F ∩ A is a subset of ( Ik ) ∩ A (since the Ik s cover F ), which is a c S c countable union of intervals (possibly closed or open). Similarly, F ∩ A is a subset of ( Ik ) ∩ A , which is another S ε countable union of intervals. Extend both sides of the ith interval in ( Ik ) ∩ A by 2i , and extend both sides of the S c ε jth interval in ( Ik ) ∩ A by 2j . We’ve then used 4ε more measure than in our covering of A, but we’ve covered both F ∩ A and F ∩ Ac , so µ∗(F ) + 4ε ≥ µ∗(F ∩ A) + µ∗(F ∩ Ac ),
and we’re done by taking ε → 0, showing the equality.
To finish our construction, then, we just need to show that A∗ is a σ-algebra and that µ∗ is countably additive. It’s easy to show A∗ is closed under complementation, because E and Ec are symmetric in Definition 23. So now, let’s show that A∗ is closed under countable unions. First, we prove that it’s closed under finite unions: ∗ ∗ by induction, it suffices to show that if E1,E2 ∈ A , then E1 ∪ E2 ∈ A . By definition, this is equivalent to plugging
E1 ∪ E2 in Definition 23, but it suffices to just show the inequality
∗ ∗ ∗ c µ (F ) ≥ µ (F ∩ (E1 ∪ E2)) + µ (F ∩ (E ∪ E2) ), because the reverse inequality is true by subadditivity (given to us by the definition of µ∗). By subadditivity,
∗ ∗ ∗ µ (F ∩ (E1 ∪ E2)) ≤ µ (F ∩ E1) + µ ((F \ E1) ∩ E2)
(this is just algebra), and we also have
∗ ∗ c ∗ µ ((F \ E1) ∩ E2) = µ (F ∩ E1 ) − µ (F \ (E1 ∪ E2))
∗ because we know that E2 ∈ A is measurable, and we can plug in F ∩ E1 into the definition of measurability. Now adding the two equations,
∗ ∗ ∗ ∗ c ∗ µ (F ∩ (E1 ∪ E2)) + µ (F \ (E1 ∪ E2)) ≤ µ (F ∩ E1) + µ (F ∩ E1 ) = µ (F ),
∗ ∗ where the last step comes from E1 ∈ A being measurable. This is what we set out to prove, so A is indeed closed under finite unions. ∗ ∗ To prove that A is also closed under countable unions, we need to show that given Ei ∈ A , we also have S∞ ∗ ∗ i=1 Ei ∈ A . We claim that for disjoint subsets E1,E2 ∈ A ,
∗ ∗ ∗ µ (F ∩ (E1 ∪ E2)) = µ (F ∩ E1) + µ (F ∩ E2).
This is actually not too bad: because E1 is measurable, we know that (using F ∩ (E1 ∪ E2) in the definition instead of F )
∗ ∗ ∗ ∗ ∗ µ (F ∩ (E1 ∪ E2)) = µ (F ∩ (E1 ∪ E2) ∩ E1) + µ (F ∩ (E1 ∪ E2) \ E1) = µ (F ∩ E1) + µ (F ∩ E2).
13 To show closure under countable unions, we can first assume without loss of generality that Ei s are disjoint (because we
already have closure under complementation, so we can take differences of sets by defining M1 = E1,M2 = E2 \ E1 = c E2 ∪ E1 , and so on). Define n ∞ [ [ An = Ei ,A = Ei . i=1 i=1 ∗ Since we’ve proved closure for finite unions, we know that all An ∈ A . Thus,
∗ ∗ ∗ µ (F ) = µ (F ∩ An) + µ (F \ An) for any n (by definition of measurability), so this is
n X ∗ ∗ ≥ µ (F ∩ Ei ) + µ (F \ A) i=1
(the first term by the claim we just established, and the second term by the monotonicity of µ∗). So now take n → ∞: we find that ∞ ∗ X ∗ ∗ µ (F ) ≥ µ (F ∩ Ei ) + µ (F \ A), i=1 and by countable subadditivity, this is ≥ µ∗(F ∩ A) + µ∗(F \ A), which is what we want! And now we’ve shown that A∗ is a σ-algebra. To prove that µ∗ is indeed a measure on A∗, we need to show countable additivity: and we can do this by showing inequalities on both sides. We know that
∞ ! ∞ ∗ G X ∗ µ Ai ≤ µ (Ai ) i=1 i=1 by countable subadditivity of µ∗, and also
∞ ! n ! n ∗ G ∗ G X ∗ µ Ai ≥ µ Ai = µ (Ai ), i=1 i=1 i=1 and taking n → ∞ gives the other direction of the inequality. This gives us countable additivity of µ∗, and we’re done.
Since we’ve already proved this is unique, we’ve finally extended our measure to σ(I), as desired! So we do have a uniform distribution on [0, 1].
4 September 16, 2019
The first homework assignment will be posted later today (after lecture); it will be due in class next Wednesday. There will be about 4–5 problem sets in this class: this first one will have us work a little more with properties of measures. Let’s start by restating what’s been covered over the last two lectures:
Theorem 25 (Lebesgue-Stieltjes)
Given a nondecreasing and right-continuous function F : R → R, there exists a unique measure µ = µF on (R, BR) such that µ((a, b ]) = F (b) − F (a).
14 This proof has two parts: first, we extend µ from I, the set of half-open intervals of the form (a, b], to A : {finite disjoint union of elements of I}. Then, we extend A to its Borel σ-algebra with the Carathéodory extension theorem: uniqueness follows from the π-λ theorem, and existence comes from the construction of the outer measure. This is a class about probability, so let’s start to introduce random variables and their properties! We’ll see soon that these variables are specific examples of measurable functions, and that properties like mean are specific examples of Lebesgue integrals.
Definition 26 A probability space is a measure space (Ω, F, µ) such that µ(Ω) = 1. We will often denote µ as P.
One way to interpret this is that Ω is a space of outcomes, F are possible events (some sets of outcomes are not “well-behaved” enough to be assigned a probability measure), and P is the measure itself.
Example 27 We can encode a sequence of n independent fair coin tosses via
|A| Ω = {heads, tails}n = {0, 1}n, F = (Ω), (A) = ∀A ⊆ Ω. P P |Ω|
Notice that it’s okay to take the powerset of Ω because it is a finite set. By definition, here the probability measure 1 is uniform: for any individual event (that is, |A| = 1), we have P(A) = 2n . We’ll now be a bit more general and abstract: let (Ω, F, µ) be any measure space, in particular allowing µ(Ω) = ∞.
Definition 28 An indicator function for a set A ∈ F is defined as 1 ω ∈ A 1A(ω) = 0 ω = Ω \ A.
This may also be denoted 1{ω ∈ A}.
Definition 29 A simple function f :Ω → R can be written in the form
n X f = ci 1Ai , i=1
where ci ∈ R and Ai ∈ F.
Definition 30 A measurable function f is one that can be pointwise approximated by simple functions: that is, there exists a
sequence of simple functions {fn} such that
f (ω) = lim fn(ω) ∀ω ∈ Ω. n→∞
15 Definition 31 Let (Ω, F) and (S, G) be two measurable spaces. A measurable mapping is a function f :Ω → S such that for every G ∈ G, the preimage f −1(G) belongs to F.
The intuition to have in mind is that if we take any “well-behaved” set in S, its preimage is “well-behaved” in Ω.
Proposition 32 A measurable function f on (Ω, F) (as defined in Definition 30) is equivalent to a measurable mapping (Ω, F) →
(R, BR) as defined in Definition 31.
Proof. Exercise, left for our homework!
The definition of a measurable function is more concrete, while the mapping is a bit more abstract. The latter is useful because a measure µ on the domain space Ω naturally induces a measure on the target space S!
Proposition 33 If µ is a measure on (Ω, F), and we have a mapping f : (Ω, F) → (S, G), then we have a measure ν on (S, G) called the pushforward measure: ν(G) = µ(f −1(G)) ∀G ∈ G.
This is typically written as ν = f∗µ or f#µ.
Proposition 34 If f , g are measurable, then af + bg is measurable for all a, b ∈ R, and so is f · g.
Proof sketch. Write f and g as pointwise limits of simple functions and do some bookkeeping.
Definition 35 A random variable is a measurable function defined on a probability space (Ω, F, P). (This is usually denoted X or Y rather than f .) More generally, an (S, G)-valued random variable is a measurable mapping (ω, F) → (S, G).
Usually, random variables are real-valued, but we may occasionally specify a different target space.
Fact 36 (Helpful notation) 1 Since f is a function from Ω → R, we often consider the pre-image X− (B) = {ω ∈ Ω: X(ω) ∈ B}. This will be denoted {X ∈ B}: for example,
1 1 {X ≥ 2 } = {ω ∈ Ω: X(ω) ≥ 2 .}
Why is this a good notation? X is a mapping from some space (Ω, F, P) to the real line (R, B), so if we take the pushforward measure X∗P, we get the distribution or law of our random variable X, often denoted LX . Let’s do the symbol pushing to make sure we understand what this means! This is a measure on our target space (R, B): it takes in a measurable subset B ∈ B and outputs
−1 LX (B) = (X∗(P))B = P(X (B)) = P({ω ∈ Ω: X(ω) ∈ B}).
16 So measure theory gives a meaning to P(X ∈ B), which is a notation that everyone likes to use.
Example 37 Looking back at the coin-tossing example, where we have F = P(Ω) and P a uniform measure on Ω, we can consider the following events:
n X X(ω) = ω1 = 1{first toss is head},Y (ω) = ωi = number of heads. i=1
We can visualize Ω as the vertices of a cube: under the map Y , each vertex (x, y, z) on the cube maps to (x + y + z) ∈ R. (This is an example of why we don’t always just want the identity mapping for our probability distribution: we’re condensing our information.) The law LY is a measure on (R, B), supported on {0, 1, ··· , n}, so we just need to find the weight assigned to each integer. Let’s work this out in formal notation: