Probabilistic Measure Theory
Andrew Kobin Spring 2015 Contents Contents
Contents
0 Introduction 1
1 Probability and Normal Numbers 4 1.1 The Weak Law of Large Numbers ...... 6 1.2 The Strong Law of Large Numbers ...... 8 1.3 Further Properties of Normal Numbers ...... 14
2 Probability Measures 19 2.1 Fields, σ-fields and Probability Measures ...... 19 2.2 The Lebesgue Measure on the Unit Interval ...... 26 2.3 Extension to σ-fields ...... 27 2.4 π-systems and λ-systems ...... 32 2.5 Monotone Classes ...... 34 2.6 Complete Extensions ...... 35 2.7 Non-Measurable Sets ...... 37
3 Denumerable Probabilities 38 3.1 Limit Inferior, Limit Superior and Convergence ...... 38 3.2 Independence ...... 40 3.3 Subfields ...... 43 3.4 The Borel-Cantelli Lemmas ...... 44
4 Simple Random Variables 47 4.1 Convergence in Measure ...... 49 4.2 Independent Variables ...... 51 4.3 Expected Value and Variance ...... 52 4.4 Abstract Laws of Large Numbers ...... 55 4.5 Second Borel-Cantelli Lemma Revisited ...... 58 4.6 Bernstein’s Theorem ...... 60 4.7 Gambling ...... 62 4.8 Markov Chains ...... 68 4.9 Transience and Persistence ...... 72
5 Abstract Measure Theory 80 5.1 Measures ...... 80 5.2 Outer Measure ...... 83 5.3 Lebesgue Measure on Rn ...... 87 5.4 Measurable Functions ...... 91 5.5 Distribution Functions ...... 93
6 Integration Theory 97 6.1 Measure-Theoretic Integration ...... 97 6.2 Properties of Integration ...... 100
i 0 Introduction
0 Introduction
These notes were compiled from a course on measure-theoretic probability theory taught by Dr. Sarah Raynor in Spring 2015 at Wake Forest University. The course covers the basic concepts in measure theory and uses them to deepen understanding of probability. The companion text for the course is Probability and Measure, 4th ed., by P. Billingsley. One of the best examples to illustrate the nuance of measure theory is the Cantor set. The Cantor set C is defined as follows. Let A0 = [0, 1], the unit interval. Let A1 be the 1 2 set A0 − 3 , 3 formed by deleting the middle third of A0. Next, A2 is similarly formed 1 2 7 8 by deleting the middle thirds 9 , 9 and 9 , 9 from each component of A1. The process is continued to define a sequence
∞ [ 1 + 3k 2 + 3k A = A − , . n n−1 3n 3n k=0
∞ \ Finally, the Cantor set is the subset of [0, 1] given by C = An, that is, the points remaining n=0 in the unit interval after iterating this process over the natural numbers. Length is our first idea of measure, from which many others will stem. If we take the usual length of an interval on the real line to be end point minus starting point, then the unit interval [0, 1] has length 1. One may then ask: How long is the Cantor set? To measure the length of C, we instead calculate the length of its complement and subtract it from 1. This is the following infinite sum: 1 2 4 8 + + + + ... 3 9 27 81 1/3 which is a geometric series converging to = 1. Thus the complement of the Cantor 1 − 2/3 set has length 1, but the total unit interval has length 1 meaning the Cantor set has length 0. This is our first example of a set of measure zero. Area, volume, hypervolume, etc. are all extensions of length to higher dimensions — these are also examples of measures. For example, the area of an annulus is easy to compute. Consider the following region R.
R
1 2
1 0 Introduction
We compute the area A by A = 4π − π = 3π. However, one may also want to compute the mass of the annulus, say if it were made of aluminum or steel. Given a density function, e.g. ρ = e−r2 kg/cm2, find the mass of the annular region. This is computed by a double integral,
2π 2 2π ZZ Z Z 2 1 Z ρ dA = e−r r dr dθ = − (e−2 − e−1) dθ = π(e−1 − e−2). R 0 1 2 0 If we think of a double integral as the limit of the process of breaking the region into smaller regions and adding together all their masses, we see the same concept at work as in the Cantor set example.
How does this relate to probability? Example 0.0.1. What is the probability of rolling a prime number on a standard six-sided die? This can be computed by the same divide-and-conquer approach: 3 1 P (prime) = P (2) + P (3) + P (5) = = . 6 2 Example 0.0.2. When playing craps (rolling two dice), what is the probability of rolling either a 7 or an 11? 6 2 2 P (7 or 11) = P (7) + P (11) = + = . 36 36 9 Example 0.0.3. Given a dartboard of unit area, the probability of hitting a small region on the board with your dart is precisely the area of that region:
There is a common theme among the above examples, which is that the calculation of probability relies on our ability to measure things and compare the relative measures. We generalize this in the following way.
Definition. A measure µ on a set S is a function µ : P(S) → [0, ∞] such that µ is ∞ [ countably additive, that is, if A is a subset of S of the form An and An ∩ Am = ∅ for all n=1 ∞ X n 6= m, then µ(A) = µ(An). n=1 This isn’t quite the full definition yet; we will formalize everything in Chapter 5. However, some interesting questions arise from defining a measure this way:
2 0 Introduction
1 Is every subset of S measurable? When the set is finite, the answer is yes. However, for the unit interval with length as a measure, the answer is no. A counterexample is difficult to produce at this time.
2 The Banach-Tarski Paradox (sometimes called the Banach-Tarski Theorem) says that it is possible to take a solid ball of any size, say the size of a basketball, decompose it into finitely many pieces and put them back together only using rigid motions to get a ball the size of the sun. How is this possible?
The domain of a measure must have a special structure, which is called a σ-field (some- times σ-algebra in the literature).
Definition. Let S be a set and F a collection of subsets of S. F is a σ-algebra provided
(1) S ∈ F.
(2) If A ∈ F then AC ∈ F as well.
∞ [ (3) If A1,A2,... ∈ F (this may be a countable list) then An ∈ F. n=1 It turns out that this is just enough structure to allow us to define a measure on F. This will be the main ‘universe’ in which we work, defining probability measures and developing their applications. The first four chapters may be treated as an extensive case study of probability spaces, that is, spaces with measure 1. In Chapter 5 we finally introduce the terminology and main theorems in abstract measure theory.
3 1 Probability and Normal Numbers
1 Probability and Normal Numbers
In these notes we will denote a sample space by Ω and a particular event taken from this sample space by ω. Our prototypical example will have Ω = (0, 1]. For technical reasons we will always assume an interval of the real line is of the form (a, b] so that collections of intervals may be chosen disjointly (so they don’t overlap at the endpoints). If I = (a, b] we will denote the usual notion of length by |I| = |b − a|. n [ Suppose A = Ii where Ii = (ai, bi] are pairwise disjoint intervals in the sample space i=1 Ω = (0, 1]. We define Definition. The probability of event A occuring within the sample space Ω is
n n X X P (A) := |Ii| = |bi − ai|. i=1 i=1 At this point we are carefully avoiding complicated subsets of Ω, such as the Cantor set in the introduction. These will be the focus in later chapters. If A and B are disjoint subsets of Ω and each of A, B is a finite disjoint union of intervals, then P (A ∪ B) = P (A) + P (B). This is called the finite additivity of probability. So far we have brushed over something important: is our definition of P (A) well-defined? That is, if A has two different represen- tations as finite disjoint unions of intervals in Ω, do they both give the same probability? n m [ [ Well suppose A = Ii = Jj. We create a collection of intervals Kij = Ii ∩ Jj, called a i=1 j=1 refinement of the Ii and Jj. Notice that
m n m n [ [ [ [ A = Kij = (Ii ∩ Jj). j=1 i=1 j=1 i=1 This implies well-definedness of our definition of P (A). Example 1.0.4. This relates to the Riemann integral in an important way. For a subset A ⊂ Ω which is a disjoint union of finitely many intervals in Ω = (0, 1], define the characteristic function n ( X 1 x ∈ Ii fA = χIi where χIi (x) = i=1 0 x 6∈ Ii. m X Similarly define gB = χJj . Then finite additivity of probability implies the additive j=1 property of Riemann integrals: Z 1 Z 1 Z 1 (fA + gB) dx = fA dx + gB dx. 0 0 0
4 1 Probability and Normal Numbers
This is because Z 1 χI (x) dx = |I| = b − a. 0 Keep in mind that for the moment we are only dealing with event spaces that are finite disjoint unions of half-open intervals; when we encounter more complicated subsets of Ω, Riemann integration breaks down. In that case we will need to use Lebesgue integration, one of the main tools in modern integration theory. Our next goal is to equate the probabilistic notion of selecting points from the unit interval with the physical act of flipping an infinite number of coins and counting heads and tails. Define di(ω) to be the result of the ith flip of the infinite sequence of coin flips; we will denote this numerically by ( 1 if heads di(ω) = 0 if tails.
The event ω can be represented as a sequence of 1’s and 0’s: (d1(ω), d2(ω), d3(ω),...). We will also make use of the dyadic (binary) representation ∞ X −i ω = di(ω)2 . i=1 Each sequence of 0’s and 1’s corresponds to a unique real number in the interval [0, 1]. However, not every real number in [0, 1] has a unique dyadic representation. For exam- 5 ple, 8 can be represented by 0.101000 ... but also by the non-terminating 0.100111 ... It is convention to prefer the non-terminating representation, since this will coincide with our other preference for half-open intervals (a, b]. Notice that picking only non-terminating dec- imal representations excludes 0 = 0.000 ... from our probability space, so we are indeed constructing (0, 1]. Now, drawing at random with uniform probability from Ω = (0, 1] is equivalent to the dyadic representation of an infinite sequence of coin flips. The reason is that P [di(ω) = 1] is 2i 1 equal to the sum of the lengths of 2 intervals, each of which has length 2i . This is illustrated below.
d1 = 0 d1 = 1 ( ]( ] 1 0 2 1
d2 = 0 d2 = 1 d2 = 0 d2 = 1 ( ]( ]( ]( ] 1 1 3 0 4 2 4 1
These are sometimes called dyadic intervals. From this we can see that the probability of 1 i any single flip coming up heads is 2 , since at any level, half of the 2 intervals are included in this event. The 2n intervals of length 2−n for any n are called the set of rank n dyadic intervals; they have the nice property of being nested. Formally, if n > m and Ii is an interval of rank n, there is a unique Jj of rank m such that Ii ⊂ Jj.
5 1.1 The Weak Law of Large Numbers 1 Probability and Normal Numbers
Example 1.0.5. The binomial formula expresses the probability that k heads will be flipped in n trials. Using the interval construction above, we see that
P (k heads in the first n flips) = P (k of the first n bigits are 1) = # subsets of {1, . . . , n} with k elements · 2−n
n = 2−n k which is exactly the same as provided by the binomial formula.
Notice that if {Ii}i∈N is a collection of rank n dyadic intervals and n ≥ m, then dm(x) is constant on Ii for all Ii of rank n.
1.1 The Weak Law of Large Numbers
This brings us to the Law of Large Numbers. In probability theory, the LLN states that a sequence of random trials converges to a particular value or outcome: the expected value (EV). In this course, we will distinguish between two different versions of the LLN.
Theorem 1.1.1 (Weak Law of Large Numbers). Let ω be an event in the sample space Ω = (0, 1] which may be expressed as a finite disjoint union of intervals. Then for any ε > 0,
n ! 1 X 1 lim P di(ω) − > ε = 0. n→∞ n 2 i=1 Proof. To prove the Weak LLN, we first define the Rademacher functions, ( 1 if heads ri(ω) = 2di(ω) − 1 = −1 if tails
for each i, and the cumulative Rademacher function of rank n,
n X sn(ω) = ri(ω). i=1 In this language, the above probability may be expressed as