PROBABILITY and MEASURE Contents 1. Measures 3 2. Measurable Functions and Random Variables 11 3. Integration 18 4. Norms and In

Home , Finite measure, Probability measure, Product measure

J. R. NORRIS

Contents 1. Measures 3 2. Measurable functions and random variables 11 3. Integration 18 4. Norms and inequalities 28 5. Completeness of Lp andorthogonalprojection 31 6. Convergence in L1(P) 34 7. Fourier transforms 37 8. Gaussian random variables 43 9. Ergodic theory 44 10. Sums of independent random variables 48 Index 51

These notes are intended for use by students of the Mathematical Tripos at the Uni- versity of Cambridge. Copyright remains with the author. Please send corrections to [email protected]. 1 Schedule

Measure spaces, σ-algebras, π-systems and uniqueness of extension, statement *and proof* of Carathéodory’s extension theorem. Construction of Lebesgue measure on R, Borel σ-algebra of R, existence of a non-measurable subset of R. Lebesgue– Stieltjes measures and probability distribution functions. Independence of events, independence of σ-algebras. Borel–Cantelli lemmas. Kolmogorov’s zero–one law. Measurable functions, random variables, independence of random variables. Con- struction of the integral, expectation. Convergence in measure and convergence almost everywhere. Fatou’s lemma, monotone and dominated convergence, uniform integrability, differentiation under the integral sign. Discussion of product measure and statement of Fubini’s theorem. Chebyshev’s inequality, tail estimates. Jensen’s inequality. Completeness of Lp for 1 p . Hölder’s and Minkowski’s inequalities, uniform integrability. ≤ ≤∞ L2 as a Hilbert space. Orthogonal projection, relation with elementary conditional probability. Variance and covariance. Gaussian random variables, the multivariate normal distribution. The strong law of large numbers, proof for independent random variables with bounded fourth moments. Measure preserving transformations, Bernoulli shifts. Statements *and proofs* of the maximal ergodic theorem and Birkhoff’s almost everywhere ergodic theorem, proof of the strong law. The Fourier transform of a finite measure, characteristic functions, uniqueness and inversion. Weak convergence, statement of Lévy’s continuity theorem for characteristic functions. The central limit theorem. Appropriate books P. Billingsley Probability and Measure. Wiley 2012 (£90.00 hardback). R.M. Dudley Real Analysis and Probability. Cambridge University Press 2002 (£40.00 paperback). R.T. Durrett Probability: Theory and Examples. (£45.00 hardback). D. Williams Probability with Martingales. Cambridge University Press 1991 (£31.00 paperback).

2 1. Measures 1.1. Deﬁnitions. Let E be a set. A σ-algebra E on E is a set of subsets of E, containing the empty set and such that, for all A E and all sequences (An : n N) in E, ∅ ∈ ∈ Ac E, A E. ∈ n ∈ n [ The pair (E, E) is called a measurable space. Given (E, E), each A E is called a measurable set. ∈ A measure µ on (E, E) is a function µ : E [0, ], with µ( ) = 0, such that, for any sequence (A : n N) of disjoint elements→ of E∞, ∅ n ∈

µ An = µ(An). n ! n [ X This property is called countable additivity. The triple (E, E,µ) is called a measure space.

1.2. Discrete measure theory. Let E be a countable set and let E be the set of all subsets of E. A mass function is any function m : E [0, ]. If µ is a measure on (E, E), then, by countable additivity, → ∞

µ(A)= µ( x ), A E. { } ⊆ x A X∈ So there is a one-to-one correspondence between measures and mass functions, given by m(x)= µ( x ), µ(A)= m(x). { } x A X∈ This sort of measure space provides a toy version of the general theory, where each of the results we prove for general measure spaces reduces to some straightforward fact about the convergence of series. This is all one needs to do elementary discrete probability and discrete-time Markov chains, so these topics are usually introduced without discussing measure theory. Discrete measure theory is essentially the only context where one can define a measure explicitly, because, in general, σ-algebras are not amenable to an explicit presentation which would allow us to make such a definition. Instead one specifies the values to be taken on some smaller set of subsets, which generates the σ-algebra. This gives rise to two problems: first to know that there is a measure extending the given set function, second to know that there is not more than one. The first problem, which is one of construction, is often dealt with by Carathéodory’s extension theorem. The second problem, that of uniqueness, is often dealt with by Dynkin’s π-system lemma. 3 1.3. Generated σ-algebras. Let A be a set of subsets of E. Define

σ(A)= A E : A E for all σ-algebras E containing A . { ⊆ ∈ } Then σ(A) is a σ-algebra, which is called the σ-algebra generated by A . It is the smallest σ-algebra containing A.

1.4. π-systems and d-systems. Let A be a set of subsets of E. Say that A is a π-system if A and, for all A, B A, ∅ ∈ ∈ A B A. ∩ ∈ Say that A is a d-system if E A and, for all A, B A with A B and all increasing sequences (A : n N) in A,∈ ∈ ⊆ n ∈ B A A, A A. \ ∈ n ∈ n [ Note that, if A is both a π-system and a d-system, then A is a σ-algebra.

Lemma 1.4.1 (Dynkin’s π-system lemma). Let A be a π-system. Then any d-system containing A contains also the σ-algebra generated by A.

Proof. Denote by D the intersection of all d-systems containing A. Then D is itself a d-system. We shall show that D is also a π-system and hence a σ-algebra, thus proving the lemma. Consider

D′ = B D : B A D for all A A . { ∈ ∩ ∈ ∈ }

Then A D′ because A is a π-system. Let us check that D′ is a d-system: clearly ⊆ E D′; next, suppose B , B D′ with B B , then for A A we have ∈ 1 2 ∈ 1 ⊆ 2 ∈ (B B ) A =(B A) (B A) D 2 \ 1 ∩ 2 ∩ \ 1 ∩ ∈ because D is a d-system, so B2 B1 D′; ﬁnally, if Bn D′, n N, and Bn B, then for A A we have \ ∈ ∈ ∈ ↑ ∈ B A B A n ∩ ↑ ∩ so B A D and B D′. Hence D = D′. ∩ ∈ ∈ Now consider

D′′ = B D : B A D for all A D . { ∈ ∩ ∈ ∈ } Then A D′′ because D = D′. We can check that D′′ is a d-system, just as we did ⊆ for D′. Hence D′′ = D which shows that D is a π-system as promised. 4 1.5. Set functions and properties. Let A be any set of subsets of E containing the empty set . A set function is a function µ : A [0, ] with µ( )=0. Let µ be a set function.∅ Say that µ is increasing if, for all A,→ B ∞A with A ∅ B, ∈ ⊆ µ(A) µ(B). ≤ Say that µ is additive if, for all disjoint sets A, B A with A B A, ∈ ∪ ∈ µ(A B)= µ(A)+ µ(B). ∪ Say that µ is countably additive if, for all sequences of disjoint sets (An : n N) in A with A A, ∈ n n ∈ S µ An = µ(An). n ! n [ X Say that µ is countably subadditive if, for all sequences (An : n N) in A with A A, ∈ n n ∈ S µ A µ(A ). n ≤ n n ! n [ X 1.6. Construction of measures. Let A be a set of subsets of E. Say that A is a ring on E if A and, for all A, B A, ∅ ∈ ∈ B A A, A B A. \ ∈ ∪ ∈ Say that A is an algebra on E if A and, for all A, B A, ∅ ∈ ∈ Ac A, A B A. ∈ ∪ ∈ Theorem 1.6.1 (Carathéodory’s extension theorem). Let A be a ring of subsets of E and let µ : A [0, ] be a countably additive set function. Then µ extends to a measure on the σ→-algebra∞ generated by A. Proof. For any B E, define the outer measure ⊆ µ∗(B) = inf µ(An) n X where the infimum is taken over all sequences (A : n N) in A such that B A n ∈ ⊆ n n and is taken to be if there is no such sequence. Note that µ∗ is increasing and ∞ S µ∗( ) = 0. Let us say that A E is µ∗-measurable if, for all B E, ∅ ⊆ ⊆ c µ∗(B)= µ∗(B A)+ µ∗(B A ). ∩ ∩ Write M for the set of all µ∗-measurable sets. We shall show that M is a σ-algebra containing A and that µ∗ restricts to a measure on M, extending µ. This will prove the theorem. 5 Step I. We show that µ∗ is countably subadditive. Suppose that B n Bn. We have to show that ⊆ S µ∗(B) µ∗(B ). ≤ n n X It will suffice to consider the case where µ∗(Bn) < for all n. Then, given ε > 0, there exist sequences (A : m N) in A, with ∞ nm ∈ n B A , µ∗(B )+ ε/2 µ(A ). n ⊆ nm n ≥ nm m m [ X Now B A ⊆ nm n m so [ [ µ∗(B) µ(A ) µ∗(B )+ ε. ≤ nm ≤ n n m n X X X Since ε> 0 was arbitrary, we are done.

Step II. We show that µ∗ extends µ. Since A is a ring and µ is countably additive, µ is countably subadditive and increasing. Hence, for A A and any sequence (A : n N) in A with A A , ∈ n ∈ ⊆ n n µ(A)S µ(A A ) µ(A ). ≤ ∩ n ≤ n n n X X On taking the infimum over all such sequences, we see that µ(A) µ∗(A). On the ≤ other hand, it is obvious that µ∗(A) µ(A) for A A. ≤ ∈ Step III. We show that M contains A. Let A A and B E. We have to show that ∈ ⊆ c µ∗(B)= µ∗(B A)+ µ∗(B A ). ∩ ∩ By subadditivity of µ∗, it is enough to show that c µ∗(B) µ∗(B A)+ µ∗(B A ). ≥ ∩ ∩ If µ∗(B) = , this is clearly true, so let us assume that µ∗(B) < . Then, given ε> 0, we can∞ find a sequence (A : n N) in A such that ∞ n ∈ B A , µ∗(B)+ ε µ(A ). ⊆ n ≥ n n n [ X Then B A (A A), B Ac (A Ac) ∩ ⊆ n ∩ ∩ ⊆ n ∩ n n so [ [ c c µ∗(B A)+ µ∗(B A ) µ(A A)+ µ(A A )= µ(A ) µ∗(B)+ ε. ∩ ∩ ≤ n ∩ n ∩ n ≤ n n n X X X Since ε> 0 was arbitrary, we are done. 6 Step IV. We show that M is an algebra. Clearly E M and Ac M whenever A M. Suppose that A , A M and B E. Then ∈ ∈ ∈ 1 2 ∈ ⊆ c µ∗(B)= µ∗(B A1)+ µ∗(B A1) ∩ ∩ c c = µ∗(B A A )+ µ∗(B A A )+ µ∗(B A ) ∩ 1 ∩ 2 ∩ 1 ∩ 2 ∩ 1 c c c = µ∗(B A1 A2)+ µ∗(B (A1 A2) A1)+ µ∗(B (A1 A2) A1) ∩ ∩ ∩ ∩ ∩c ∩ ∩ ∩ = µ∗(B (A A )) + µ∗(B (A A ) ). ∩ 1 ∩ 2 ∩ 1 ∩ 2 Hence A A M. 1 ∩ 2 ∈ Step V. We show that M is a σ-algebra and that µ∗ restricts to a measure on M. We already know that M is an algebra, so it suffices to show that, for any sequence of disjoint sets (A : n N) in M, for A = A we have n ∈ n n A M, µ∗(AS)= µ∗(A ). ∈ n n X So, take any B E, then ⊆ c µ∗(B)= µ∗(B A1)+ µ∗(B A1) ∩ ∩ c c = µ∗(B A1)+ µ∗(B A2)+ µ∗(B A1 A2) ∩n ∩ ∩ ∩ c c = = µ∗(B A )+ µ∗(B A A ). ··· ∩ i ∩ 1 ∩···∩ n i=1 X c c c Note that µ∗(B A1 An) µ∗(B A ) for all n. Hence, on letting n and using countable∩ subadditivity,∩···∩ ≥ we get∩ → ∞

∞ c c µ∗(B) µ∗(B A )+ µ∗(B A ) µ∗(B A)+ µ∗(B A ). ≥ ∩ n ∩ ≥ ∩ ∩ n=1 X The reverse inequality holds by subadditivity, so we have equality. Hence A M and, setting B = A, we get ∈ ∞ µ∗(A)= µ∗(An). n=1 X

1.7. Uniqueness of measures.

Theorem 1.7.1 (Uniqueness of extension). Let µ1,µ2 be measures on (E, E) with µ (E) = µ (E) < . Suppose that µ = µ on A, for some π-system A generating 1 2 ∞ 1 2 E. Then µ1 = µ2 on E.

Proof. Consider D = A E : µ1(A)= µ2(A) . By hypothesis, E D; for A, B E with A B, we have { ∈ } ∈ ∈ ⊆ µ1(A)+ µ1(B A)= µ1(B) < , µ2(A)+ µ2(B A)= µ2(B) < \ ∞ 7 \ ∞ so, if A, B D, then also B A D; if A D, n N, with A A, then ∈ \ ∈ n ∈ ∈ n ↑ µ1(A) = lim µ1(An) = lim µ2(An)= µ2(A) n n so A D. Thus D is a d-system containing the π-system A, so D = E by Dynkin’s lemma.∈ 1.8. Borel sets and measures. Let E be a topological space. The σ-algebra generated by the set of open sets is E is called the Borel σ-algebra of E and is denoted B(E). The Borel σ-algebra of R is denoted simply by B. A measure µ on (E, B(E)) is called a Borel measure on E. If moreover µ(K) < for all compact sets K, then µ is called a Radon measure on E. ∞ 1.9. Probability measures, finite and σ-finite measures. If µ(E) = 1 then µ is a probability measure and (E, E,µ)is a probability space. The notation (Ω, F, P) is often used to denote a probability space. If µ(E) < , then µ is a finite measure. If there exists a sequence of sets (E : n N) in E with∞ µ(E ) < for all n and n ∈ n ∞ n En = E, then µ is a σ-finite measure. 1.10.S Lebesgue measure. Theorem 1.10.1. There exists a unique Borel measure µ on R such that, for all a, b R with a < b, ∈ µ((a, b]) = b a. − The measure µ is called Lebesgue measure on R. Proof. (Existence.) Consider the ring A of finite unions of disjoint intervals of the form A =(a , b ] (a , b ]. 1 1 ∪···∪ n n We note that A generates B. Define for such A A n ∈ µ(A)= (b a ). i − i i=1 X Note that the presentation of A is not unique, as (a, b] (b, c]=(a, c] whenever a 0, we have µ(Bn) 2ε for all n. For each n we can find Cn A n≥ ∈ with C¯ B and µ(B C ) ε2− . Then n ⊆ n n \ n ≤ n µ(B (C C )) µ((B C ) (B C )) ε2− = ε. n \ 1 ∩···∩ n ≤ 1 \ 1 ∪···∪ n \ n ≤ n N 8 X∈ Since µ(B ) 2ε, we must have µ(C C ) ε, so C C = , and n ≥ 1 ∩···∩ n ≥ 1 ∩···∩ n 6 ∅ so Kn = C¯1 C¯n = . Now (Kn : n N) is a decreasing sequence of bounded non-empty closed∩···∩ sets in6 R∅, so = K ∈ B , which is a contradiction. ∅ 6 n n ⊆ n n (Uniqueness.) Let λ be any measureT on B withT λ((a, b]) = b a for all a < b. Fix n and consider − µ (A)= µ((n, n + 1] A), λ (A)= λ((n, n + 1] A). n ∩ n ∩ Then µn and λn are probability measures on B and µn = λn on the π-system of intervals of the form (a, b], which generates B. So, by Theorem 1.7.1, µn = λn on B. Hence, for all A B, we have ∈ µ(A)= µn(A)= λn(A)= λ(A). n n X X The condition which characterizes Lebesgue measure µ on B allows us to check that µ is translation invariant: define for x R and B B ∈ ∈ µ (B)= µ(B + x), B + x = b + x : b B x { ∈ } then µx((a, b])=(b + x) (a + x)= b a, so µx = µ, that is to say µ(B + x)= µ(B). The restriction of Lebesgue− measure− to B((0, 1]) has another sort of translation invariance, where now we understand B + x as the subset of (0, 1] obtained after translation by x and reduction modulo 1. This can be checked by a similar argument. If we inspect the proof of Carathéodory’s Extension Theorem, and consider its application in Theorem 1.10.1, we see we have constructed not only a Borel measure µ but also an extension of µ to the set of outer measurable sets M. In this context, the extension is also called Lebesgue measure and M is called the Lebesgue σ-algebra. In fact, the Lebesgue σ-algebra can be identified also as the set of all sets of the form A N, where A B and N B for some B B with µ(B) = 0. Moreover µ(A N∪)= µ(A) in this∈ case. ⊆ ∈ ∪ 1.11. Existence of a non-Lebesgue-measurable subset of R. For x, y [0, 1), let us write x y if x y Q. Then is an equivalence relation. Using the∈ Axiom of Choice, we∼ can find− a subset∈ S of∼ [0, 1) containing exactly one representative of each equivalence class. We will show that S cannot be Lebesgue measurable. Set Q = Q [0, 1) and, for each q Q, define S + q = s + q (mod 1): s S . It is an easy exercise∩ to check that the∈ sets S + q are all disjoint{ and their union∈ } is [0, 1). On the other hand, the Lebesgue σ-algebra and Lebesgue measure on (0, 1] are translation invariant for addition modulo 1. Hence, if we suppose that S is Lebesgue measurable, then so is S + q, with µ(S + q)= µ(S). But then 1= µ([0, 1)) = µ(S + q)= µ(S) q Q q Q X∈ X∈ which is impossible. Hence S is not Lebesgue measurable. 9 1.12. Independence. A probability space (Ω, F, P) provides a model for an experi- ment whose outcome is subject to chance, according to the following interpretation: Ω is the set of possible outcomes F is the set of observable sets of outcomes, or events P(A) is the probability of the event A. Relative to measure theory, probability theory is enriched by the significance attached to the notion of independence. Let I be a countable set. Say that a family (Ai : i I) of events is independent if, for all finite subsets J I, ∈ ⊆

P Ai = P(Ai). i J ! i J \∈ Y∈ Say that a family (A : i I) of sub-σ-algebras of F is independent if the family i ∈ (Ai : i I) is independent whenever Ai Ai for all i. Here is a useful way to establish∈ the independence of two σ-algebras.∈

Theorem 1.12.1. Let A1 and A2 be π-systems contained in F and suppose that P(A A )= P(A )P(A ) 1 ∩ 2 1 2 whenever A A and A A . Then σ(A ) and σ(A ) are independent. 1 ∈ 1 2 ∈ 2 1 2 Proof. Fix A A and deﬁne for A F 1 ∈ 1 ∈ µ(A)= P(A A), ν(A)= P(A )P(A). 1 ∩ 1 Then µ and ν are measures which agree on the π-system A2, with µ(Ω) = ν(Ω) = P(A ) < . So, by uniqueness of extension, for all A σ(A ), 1 ∞ 2 ∈ 2 P(A A )= µ(A )= ν(A )= P(A )P(A ). 1 ∩ 2 2 2 1 2 Now ﬁx A σ(A ) and repeat the argument with 2 ∈ 2 µ′(A)= P(A A ), ν′(A)= P(A)P(A ) ∩ 2 2 to show that, for all A σ(A ), 1 ∈ 1 P(A A )= P(A )P(A ). 1 ∩ 2 1 2

1.13. Borel-Cantelli lemmas. Given a sequence of events (An : n N), we may ask for the probability that inﬁnitely many occur. Set ∈

lim sup An = Am, lim inf An = Am. n m n n m n \ [≥ [ \≥ We sometimes write An inﬁnitely often as an alternative for lim sup An, because ω lim sup A if and{ only if ω A for} inﬁnitely many n. Similarly, we write ∈ n ∈ n An eventually for lim inf An. The abbreviations i.o. and ev. are often used. { } 10 Lemma 1.13.1 (First Borel–Cantelli lemma). If n P(An) < , then P(An i.o.)= 0. ∞ P Proof. As n we have →∞ P(A i.o.) P( A ) P(A ) 0. n ≤ m ≤ m → m n m n [≥ X≥

We note that this argument is valid whether or not P is a probability measure.

Lemma 1.13.2 (Second Borel–Cantelli lemma). Assume that the events (An : n N) are independent. If P(A )= , then P(A i.o.)=1. ∈ n n ∞ n a c Proof. We use the inequalityP 1 a e− . The events (An : n N) are also independent. Set a = P(A ). For all n−and≤ for N n with N ∈we have n n ≥ →∞ N N N P( Ac )= (1 a ) exp a 0. m − m ≤ {− m}→ m=n m=n m=n \ Y X c c Hence P( m n Am)=0 for all n, and so P(An i.o.) = 1 P( n m n Am)=1. ≥ − ≥ T 2. Measurable functions and random variablesS T 2.1. Measurable functions. Let (E, E) and (G, G) be measurable spaces. A func- 1 1 tion f : E G is measurable if f − (A) E whenever A G. Here f − (A) denotes the inverse→ image of A by f ∈ ∈ 1 f − (A)= x E : f(x) A . { ∈ ∈ } In the case (G, G)=(R, B) we simply call f a measurable function on E. In the case (G, G) = ([0, ], B([0, ])) we call f a non-negative measurable function on E. This terminology∞ is convenient∞ but it has the consequence that some non-negative measurable functions are not (real-valued) measurable functions. If E is a topological space and E = B(E), then a measurable function on E is called a Borel function. For any function f : E G, the inverse image preserves set operations →

1 1 1 1 f − A = f − (A ), f − (G A)= E f − (A). i i \ \ i ! i [ [ 1 1 Therefore, the set f − (A) : A G is a σ-algebra on E and A G : f − (A) E is { ∈ } 1 { ⊆ ∈ } a σ-algebra on G. In particular, if G = σ(A) and f − (A) E whenever A A, then 1 ∈ ∈ A : f − (A) E is a σ-algebra containing A and hence G, so f is measurable. In the case{ G = R,∈ the} Borel σ-algebra is generated by intervals of the form ( ,y],y R, so, to show that f : E R is Borel measurable, it suﬃces to show−∞ that x ∈E : f(x) y E for all y. → { ∈ ≤ } ∈ 11 1 If E is any topological space and f : E R is continuous, then f − (U) is open in E and hence measurable, whenever U is open→ in R; the open sets U generate B, so any continuous function is measurable. For A E, the indicator function 1A of A is the function 1A : E 0, 1 which takes⊆ the value 1 on A and 0 otherwise. Note that the indicator function→ { of} any measurable set is a measurable function. Also, the composition of measurable functions is measurable. Given any family of functions fi : E G, i I, we can make them all measurable by taking → ∈ 1 E = σ(f − (A) : A G, i I). i ∈ ∈ Then E is the σ-algebra generated by (f : i I). i ∈ Proposition 2.1.1. Let (f : n N) be a sequence of non-negative measurable n ∈ functions on E. Then the functions f1 + f2 and f1f2 are measurable, and so are the following functions:

inf fn, sup fn, lim inf fn, lim sup fn. n n n n The same conclusion holds for real-valued measurable functions provided the limit functions are also real-valued. Theorem 2.1.2 (Monotone class theorem). Let (E, E) be a measurable space and let A be a π-system generating E. Let V be a vector space of bounded functions f : E R such that: →

(i) 1 V and 1A V for all A A; (ii) if ∈f V for all∈ n and f is∈ bounded with 0 f f, then f V. n ∈ ≤ n ↑ ∈ Then V contains every bounded measurable function.

Proof. Consider D = A E : 1A V . Then D is a d-system containing A, so D = E. Since V is a vector{ ∈ space, it contains∈ } all ﬁnite linear combinations of indicator functions of measurable sets. If f is a bounded and non-negative measurable function, n n then the functions fn = 2− 2 f , n N, belong to V and 0 fn f, so f V. Finally, any bounded measurable⌊ ⌋ function∈ is the diﬀerence of two≤ non-negative↑ ∈ such functions, hence in V.

2.2. Image measures. Let (E, E) and (G, G) be measurable spaces and let µ be a measure on E. Then any measurable function f : E G induces an image measure 1 → ν = µ f − on G, given by ◦ 1 ν(A)= µ(f − (A)). We shall construct some new measures from Lebesgue measure in this way. Lemma 2.2.1. Let g : R R be non-constant, right-continuous and non-decreasing. → Set g( ) = limx g(x) and write I = (g( ),g( )). Define f : I R ±∞ →±∞ 12 −∞ ∞ → by f(x) = inf y R : x g(y) . Then f is left-continuous and non-decreasing. Moreover, for {x ∈I and y ≤ R, } ∈ ∈ f(x) y if and only if x g(y). ≤ ≤ Proof. Fix x I and consider the set J = y R : x g(y) . Note that J is ∈ x { ∈ ≤ } x non-empty and is not the whole of R. Since g is non-decreasing, if y J and y′ y, ∈ x ≥ then y′ J . Since g is right-continuous, if y J and y y, then y J . Hence ∈ x n ∈ x n ↓ ∈ x J = [f(x), ) and x g(y) if and only if f(x) y. For x x′, we have J J ′ x ∞ ≤ ≤ ≤ x ⊇ x and so f(x) f(x′). For xn x, we have Jx = nJxn , so f(xn) f(x). So f is left-continuous≤ and non-decreasing,↑ as claimed. ∩ → Theorem 2.2.2. Let g : R R be non-constant, right-continuous and non-decreasing. Then there exists a unique→ Radon measure dg on R such that, for all a, b R with a < b, ∈ dg((a, b]) = g(b) g(a). − Moreover, we obtain in this way all non-zero Radon measures on R. The measure dg is called the Lebesgue-Stieltjes measure associated with g. Proof. Define I and f as in the lemma and let µ denote Lebesgue measure on I. 1 Then f is Borel measurable and the induced measure dg = µ f − on R satisfies ◦ dg((a, b]) = µ( x : f(x) > a and f(x) b )= µ((g(a),g(b)]) = g(b) g(a). { ≤ } − The argument used for uniqueness of Lebesgue measure shows that there is at most one Borel measure with this property. Finally, if ν is any Radon measure on R, we can define g : R R, right-continuous and non-decreasing, by → ν((0,y]), if y 0, g(y)= ν((y, 0]), if y<≥ 0. − Then ν((a, b]) = g(b) g(a) whenever a < b, so ν = dg by uniqueness. − 2.3. Random variables. Let (Ω, F, P) be a probability space and let (E, E) be a measurable space. A measurable function X : Ω E is called a random variable in E. It has the interpretation of a quantity, or state,→ determined by chance. Where no space E is mentioned, it is assumed that X takes values in R. The image measure 1 µ = P X− is called the law or distribution of X. For real-valued random variables, X ◦ µX is uniquely determined by its values on the π-system of intervals (( , x] : x R), given by −∞ ∈ F (x)= µ (( , x]) = P(X x). X X −∞ ≤ The function FX is called the distribution function of X. Note that F = FX is increasing and right-continuous, with lim F (x)=0, lim F (x)=1. x x →−∞ 13 →∞ Let us call any function F : R [0, 1] satisfying these conditions a distribution function. → Let Ω = (0, 1). Write F for the Borel σ-algebra on Ω and P for the restriction of Lebesgue measure to F. Then (Ω, F, P) is a probability space. Let F be any distribution function. Define X : Ω R by → X(ω) = inf x : ω F (x) . { ≤ } Then, by Lemma 2.2.1, X is a random variable and X(ω) x if and only if ω F (x). So ≤ ≤ F (x)= P(X x)= P((0, F (x)]) = F (x). X ≤ Thus every distribution function is the distribution function of a random variable. A countable family of random variables (X : i I) is said to be independent if i ∈ the family of σ-algebras (σ(Xi) : i I) is independent. For a sequence (Xn : n N) of real valued random variables, this∈ is equivalent to the condition ∈ P(X x ,...,X x )= P(X x ) . . . P(X x ) 1 ≤ 1 n ≤ n 1 ≤ 1 n ≤ n for all x ,...,x R and all n. A sequence of random variables (X : n 0) is often 1 n ∈ n ≥ regarded as a process evolving in time. The σ-algebra generated by X0,...,Xn

Fn = σ(X0,...,Xn) contains those events depending (measurably) on X0,...,Xn and represents what is known about the process by time n. 2.4. Rademacher functions. We continue with the particular choice of probability space (Ω, F, P) made in the preceding section. Provided that we forbid inﬁnite sequences of 0’s, each ω Ω has a unique binary expansion ∈ ω =0.ω1ω2ω3 .... Deﬁne random variables R : Ω 0, 1 by R (ω)= ω . Then n →{ } n n R1 =1 1 , R2 =1 1 1 +1 3 , R3 =1 1 1 +1 3 1 +1 5 3 +1 7 . ( 2 ,1] ( 4 , 2 ] ( 4 ,1] ( 8 , 4 ] ( 8 , 2 ] ( 8 , 4 ] ( 8 ,1]

These are called the Rademacher functions. The random variables R1, R2,... are independent and Bernoulli, that is to say

P(Rn =0)= P(Rn =1)=1/2. The strong law of large numbers (proved in 10) applies here to show that § k n : ω =1 1 R + + R 1 P ω (0, 1) : |{ ≤ k }| = P 1 ··· n =1. ∈ n → 2 n → 2 This is called Borel’s normal number theorem: almost every point in (0, 1) is normal, that is, has ‘equal’ proportions of 0’s and 1’s in its binary expansion. We now use a trick involving the Rademacher functions to construct on Ω = (0, 1), not just one random variable, but an inﬁnite sequence of independent random variables with given distribution functions. 14 Proposition 2.4.1. Let (Ω, F, P) be the probability space of Lebesgue measure on the Borel subsets of (0, 1). Let (Fn : n N) be a sequence of distribution functions. Then there exists a sequence (X : n ∈N) of independent random variables on (Ω, F, P) n ∈ such that Xn has distribution function FXn = Fn for all n. 2 Proof. Choose a bijection m : N N and set Yk,n = Rm(k,n), where Rm is the mth Rademacher function. Set → ∞ k Yn = 2− Yk,n. Xk=1 k Then Y1,Y2,... are independent and, for all n, for i2− =0.y1 ...yk, we have k k k P(i2−

2.5. Convergence of measurable functions and random variables. Let (E, E,µ) be a measure space. A set A E is sometimes defined by a property shared by its elements. If µ(Ac) = 0, then∈ we say that this property holds almost everywhere (or a.e.). When (E, E,µ) is a probability space, we say instead that the property holds almost surely (or a.s.). Thus, for a sequence of measurable functions (f : n N), we say f converges to f almost everywhere to mean that n ∈ n µ( x E : f (x) f(x) )=0. { ∈ n 6→ } If, on the other hand, we have that µ( x E : f (x) f(x) >ε ) 0, for all ε> 0, { ∈ | n − | } → then we say fn converges to f in measure or in probability when µ(E) = 1. For a sequence (Xn : n N) of (real-valued) random variables there is a third notion of convergence. We say∈ that X converges to X in distribution if F (x) F (x) as n Xn → X n at all points x R where FX is continuous. Note that the last definition does→ not ∞ require the random∈ variables to be defined on the same probability space. Theorem 2.5.1. Let (f : n N) be a sequence of measurable functions. n ∈ (a) Assume that µ(E) < . If f 0 a.e., then f 0 in measure. ∞ n → n →

(b) If fn 0 in measure, then fnk 0 a.e. for some subsequence (nk). → → 15 Proof. (a) Suppose f 0 a.e.. For each ε> 0, n → µ( f ε) µ f ε µ( f ε ev.) µ(f 0) = µ(E). | n| ≤ ≥ {| m| ≤ } ↑ | n| ≤ ≥ n → m n ! \≥ Hence µ( f >ε) 0 and f 0 in measure. | n| → n → (b) Suppose f 0 in measure, then we can find a subsequence (n ) such that n → k µ( f > 1/k) < . | nk | ∞ Xk So, by the first Borel–Cantelli lemma, µ( f > 1/k i.o.) = 0 | nk | so f 0 a.e.. nk → Theorem 2.5.2. Let X and (X : n N) be real-valued random variables. n ∈ (a) If X and (Xn : n N) are defined on the same probability space (Ω, F, P) and X X in probability,∈ then X X in distribution. n → n → (b) If X X in distribution, then there are random variables X˜ and (X˜ : n N) n → n ∈ defined on a common probability space (Ω, F, P) such that X˜ has the same distribution as X, X˜ has the same distribution as X for all n, and X˜ X˜ almost surely. n n n → Proof. Write S for the subset of R where FX is continuous. (a) Suppose X X in probability. Given x S and ε> 0, there exists δ > 0 such n → ∈ that FX (x δ) FX (x) ε/2 and FX (x + δ) FX (x)+ ε/2. then there exists N such that, for− all≥n N, we− have P( X X >≤ δ) ε/2, which implies ≥ | n − | ≤ F (x) P(X x + δ)+ P( X X > δ) F (x)+ ε Xn ≤ ≤ | n − | ≤ X and F (x) P(X x δ) P( X X > δ) F (x) ε. Xn ≥ ≤ − − | n − | ≥ X − (b) Suppose now that Xn X in distribution. Take (Ω, F, P) to be the interval (0, 1) equipped with its Borel→σ-algebra and Lebesgue measure. Define for ω (0, 1) ∈ X˜ (ω) = inf x R : ω F (x) , X˜(ω) = inf x R : ω F (x) . n { ∈ ≤ Xn } { ∈ ≤ X } Then X˜ has the same distribution as X, and Xñ has the same distribution as Xn for all n. Write Ω0 for the subset of (0, 1) where X˜ is continuous. Since X˜ is non-decreasing, (0, 1) Ω0 is countable, so P(Ω0) = 1. Since FX is non-decreasing, R S is countable, \ + \ + so S is dense. Given ω Ω0 and ε> 0, there exist x−, x S with x− < X˜(ω) < x + ∈ + ∈ + + and x x− < ε, and there exists ω (ω, 1) such that X˜(ω ) x . Then − + + ∈ ≤ FX (x−) <ω and FX (x ) ω >ω. So there exists N such that, for all n N, we ≥ + ≥ + have F (x−) < ω and F (x ) ω, which implies X˜ (ω) > x− and X˜ (ω) x , Xn Xn ≥ n n ≤ and hence Xñ(ω) X˜(ω) <ε. | − | 16 2.6. Tail events. Let (X : n N) be a sequence of random variables. Define n ∈ Tn = σ(Xn+1,Xn+2,... ), T = Tn. n \ Then T is a σ-algebra, called the tail σ-algebra of (Xn : n N). It contains the events which depend only on the limiting behaviour of the seq∈uence. Theorem 2.6.1 (Kolmogorov’s zero-one law). Suppose that (X : n N) is a se- n ∈ quence of independent random variables. Then the tail σ-algebra T of (Xn : n N) contains only events of probability 0 or 1. Moreover, any T-measurable random∈ variable is almost surely constant.

Proof. Set Fn = σ(X1,...,Xn). Then Fn is generated by the π-system of events A = X x ,...,X x { 1 ≤ 1 n ≤ n} whereas Tn is generated by the π-system of events B = X x ,...,X x , k N. { n+1 ≤ n+1 n+k ≤ n+k} ∈ We have P(A B)= P(A)P(B) for all such A and B, by independence. Hence F and ∩ n Tn are independent, by Theorem 1.12.1. It follows that Fn and T are independent. Now n Fn is a π-system which generates the σ-algebra F = σ(Xn : n N). So by Theorem 1.12.1 again, F and T are independent. But T ∞ F . So, if A∈ T, S ∞ ⊆ ∞ ∈ P(A)= P(A A)= P(A)P(A) ∩ so P(A) 0, 1 . ∈{ } Finally, if Y is any T-measurable random variable, then FY (y) = P(Y y) takes values in 0, 1 , so P(Y = c) = 1, where c = inf y : F (y)=1 . ≤ { } { Y } 2.7. Large values in sequences of independent identically distributed random variables. Consider a sequence (Xn : n N) of independent random variables, all having the same distribution function∈F . Assume that F (x) < 1 for all x R. Then, almost surely, the sequence (X : n N) is unbounded above, so ∈ n ∈ lim supn Xn = . A way to describe the occurrence of large values in the sequence is to ﬁnd a function∞ g : N (0, ) such that, almost surely, → ∞ lim sup Xn/g(n)=1. n x We now show that g(n) = log n is the right choice when F (x)=1 e− . The same method adapts to other distributions. − α log n Fix α > 0 and consider the event An = Xn α log n . Then P(An)= e− = a { ≥ } n− , so the series n P(An) converges if and only if α > 1. By the Borel–Cantelli Lemmas, we deduce that, for all ε> 0, P P(Xn/ log n 1 i.o.) = 1, P(Xn/ log n 1+ ε i.o.) = 0. ≥ 17 ≥ Hence, almost surely,

lim sup Xn/ log n =1. n

3. Integration 3.1. Deﬁnition of the integral and basic properties. Let (E, E,µ) be a measure space. We shall deﬁne for non-negative measurable functions f on E, and (under a natural condition) for (real-valued) measurable functions f on E, the integral of f, to be denoted µ(f)= fdµ = f(x)µ(dx). ZE ZE When (E, E)=(R, B) and µ is Lebesgue measure, the usual notation is

µ(f)= f(x)dx. ZR For a random variable X on a probability space (Ω, F, P), the integral is usually called instead the expectation of X and written E(X). A simple function is one of the form m

f = ak1Ak Xk=1 where 0 ak < and Ak E for all k, and where m N. For simple functions f, we deﬁne≤ ∞ ∈ ∈ m

µ(f)= akµ(Ak), Xk=1 where we adopt the convention 0. = 0. Although the representation of f is not unique, it is straightforward to check∞ that µ(f) is well defined and, for simple functions f,g and constants α, β 0, we have ≥ (a) µ(αf + βg)= αµ(f)+ βµ(g), (b) f g implies µ(f) µ(g), ≤ ≤ (c) f = 0 a.e. if and only if µ(f)=0. We define the integral µ(f) of a non-negative measurable function f by µ(f) = sup µ(g) : g simple, g f . { ≤ } This is consistent with the definition for simple functions by property (b) above. Note that, for all non-negative measurable functions f,g with f g, we have µ(f) µ(g). + ≤ ≤+ For any measurable function f, set f = f 0 and f − =( f) 0. Then f = f f − + ∨ − ∨ − and f = f + f −. If µ( f ) < , then we say that f is integrable and define | | | | ∞ + µ(f)= µ(f ) µ(f −). 18 − Note that µ(f) µ( f ) for all integrable functions f. We sometimes define the integral µ(f| ) by| the ≤ same| | formula, even when f is not integrable, but when one of + µ(f −) and µ(f ) is finite. In such cases the integral takes the value or . Here is the key result for the theory of integration. For x [0, ]∞ and a−∞ sequence ∈ ∞ (xn : n N) in [0, ], we write xn x to mean that xn xn+1 for all n and xn x as n ∈ . For a∞ non-negative function↑ f on E and a≤ sequence of such functions→ (f : n→ ∞N), we write f f to mean that f (x) f(x) for all x E. n ∈ n ↑ n ↑ ∈ Theorem 3.1.1 (Monotone convergence). Let f be a non-negative measurable function and let (fn : n N) be a sequence of such functions. Suppose that fn f. Then µ(f ) µ(f). ∈ ↑ n ↑

Proof. Case 1: fn =1An , f =1A. The result is a simple consequence of countable additivity. Case 2: fn simple, f =1A. Fix ε> 0 and set A = f > 1 ε . Then A A and n { n − } n ↑ (1 ε)1 f 1 − An ≤ n ≤ A so (1 ε)µ(A ) µ(f ) µ(A). − n ≤ n ≤ But µ(A ) µ(A) by Case 1 and ε> 0 was arbitrary, so the result follows. n ↑ Case 3: fn simple, f simple. We can write f in the form m

f = ak1Ak Xk=1 with a > 0 for all k and the sets A disjoint. Then f f implies k k n ↑ 1 a− 1 f 1 k Ak n ↑ Ak so, by Case 2, µ(f )= µ(1 f ) a µ(A )= µ(f). n Ak n ↑ k k Xk Xk Case 4: fn simple, f 0 measurable. Let g be simple with ≥g f. Then f f implies f g g so, by Case 3, ≤ n ↑ n ∧ ↑ µ(f ) µ(f g) µ(g). n ≥ n ∧ ↑ Since g was arbitrary, the result follows. Case 5: fn 0 measurable, f 0 measurable. ≥n n ≥ Set g = (2− 2 f ) n then g is simple and g f f, so n ⌊ n⌋ ∧ n n ≤ n ≤ µ(g ) µ(f ) µ(f). n ≤ n ≤ But fn f forces gn f, so µ(gn) µ(f), by Case 4, and so µ(fn) µ(f). ↑ ↑ ↑ 19 ↑ Second proof. Set M = supn µ(fn). We know that µ(f ) M µ(f) = sup µ(g) : g simple ,g f n ↑ ≤ { ≤ } so it will suﬃce to show that µ(g) M for all simple functions ≤ m g = a 1 f. k Ak ≤ i=1 X Without loss of generality, we may assume that the sets Ak are disjoint. Deﬁne functions gn by n n g (x)= 2− 2 f (x) g(x), x E. n ⌊ n ⌋ ∧ ∈ Then g is simple and g f for all n. Fix ε (0, 1) and consider the sets n n ≤ n ∈ A (n)= x A : g (x) (1 ε)a . k { ∈ k n ≥ − k} Now gn g and g = ak on Ak, so Ak(n) Ak, and so µ(Ak(n) µ(Ak) by countable additivity.↑ Also, we have ↑ ↑ 1 g (1 ε)a 1 Ak n ≥ − k Ak(n) so µ(1 g ) (1 ε)a µ(A (n)). Ak n ≥ − k k Finally, we have m

gn = 1Ak gn Xk=1 and the integral is additive on simple functions, so m m m µ(g )= µ(1 g ) (1 ε) a µ(A (n)) (1 ε) a µ(A )=(1 ε)µ(g) n Ak n ≥ − k k ↑ − k k − Xk=1 Xk=1 Xk=1 But µ(gn) µ(fn) M for all n and ε (0, 1) is arbitrary, so we see that µ(g) M, as required.≤ ≤ ∈ ≤ Theorem 3.1.2. For all non-negative measurable functions f,g and all constants α, β 0, ≥ (a) µ(αf + βg)= αµ(f)+ βµ(g), (b) f g implies µ(f) µ(g), ≤ ≤ (c) f =0 a.e. if and only if µ(f)=0.

Proof. Deﬁne simple functions fn,gn by n n n n f = (2− 2 f ) n, g = (2− 2 g ) n. n ⌊ ⌋ ∧ n ⌊ ⌋ ∧ Then f f and g g, so αf + βg αf + βg. Hence, by monotone convergence, n ↑ n ↑ n n ↑ µ(fn) µ(f), µ(gn) µ(g), µ(αfn + βgn) µ(αf + βg). ↑ ↑ 20 ↑ We know that µ(αfn + βgn)= αµ(fn)+ βµ(gn), so we obtain (a) on letting n . As we noted above, (b) is obvious from the deﬁnition of the integral. If f = 0→∞ a.e., then fn = 0 a.e., for all n, so µ(fn) = 0 and µ(f) = 0. On the other hand, if µ(f)=0, then µ(fn)=0 for all n, so fn = 0 a.e. and f = 0 a.e.. Theorem 3.1.3. For all integrable functions f,g and all constants α, β R, ∈ (a) µ(αf + βg)= αµ(f)+ βµ(g), (b) f g implies µ(f) µ(g), ≤ ≤ (c) f =0 a.e. implies µ(f)=0. Proof. We note that µ( f)= µ(f). For α 0, we have − − ≥ + + µ(αf)= µ(αf ) µ(αf −)= αµ(f ) αµ(f −)= αµ(f). − − + + + If h = f + g then h + f − + g− = h− + f + g , so

+ + + µ(h )+ µ(f −)+ µ(g−)= µ(h−)+ µ(f )+ µ(g ) and so µ(h)= µ(f)+µ(g). That proves (a). If f g then µ(g) µ(f)= µ(g f) 0, ≤ − − ≥ by (a). Finally, if f = 0 a.e., then f ± = 0 a.e., so µ(f ±)=0 and so µ(f)=0.

Note that in Theorem 3.1.3(c) we lose the reverse implication. The following result is sometimes useful: Proposition 3.1.4. Let A be a π-system containing E and generating E. Then, for any integrable function f, µ(f1 )=0 for all A A implies f =0 a.e.. A ∈ Here are some minor variants on the monotone convergence theorem.

Proposition 3.1.5. Let (fn : n N) be a sequence of non-negative measurable functions. Then ∈ f f a.e. = µ(f ) µ(f). n ↑ ⇒ n ↑ Proposition 3.1.6. Let (gn : n N) be a sequence of non-negative measurable functions. Then ∈

∞ ∞ µ(gn)= µ gn . n=1 n=1 ! X X This reformulation of monotone convergence makes it clear that it is the coun- terpart for the integration of functions of the countable additivity property of the measure on sets. 21 3.2. Integrals and limits. In the monotone convergence theorem, the hypothesis that the given sequence of functions is non-decreasing is essential. In this section we obtain some results on the integrals of limits of functions without such a hypothesis.

Lemma 3.2.1 (Fatou’s lemma). Let (fn : n N) be a sequence of non-negative measurable functions. Then ∈ µ(lim inf f ) lim inf µ(f ). n ≤ n Proof. For k n, we have ≥ inf fm fk m n ≥ ≤ so

µ( inf fm) inf µ(fk) lim inf µ(fn). m n k n ≥ ≤ ≥ ≤ But, as n , →∞ inf fm sup inf fm = lim inf fn m n ↑ n m n ≥ ≥ so, by monotone convergence,

µ( inf fm) µ(lim inf fn). m n ≥ ↑

Theorem 3.2.2 (Dominated convergence). Let f be a measurable function and let (f : n N) be a sequence of such functions. Suppose that f (x) f(x) for all n ∈ n → x E and that fn g for all n, for some integrable function g. Then f and fn are integrable,∈ for all| n|, ≤ and µ(f ) µ(f). n → Proof. The limit f is measurable and f g, so µ( f ) µ(g) < , so f is integrable. We have 0 g f g f so certainly| | ≤ lim inf(g| |f ≤)= g f∞. By Fatou’s lemma, ≤ ± n → ± ± n ± µ(g)+ µ(f)= µ(lim inf(g + f )) lim inf µ(g + f )= µ(g) + lim inf µ(f ), n ≤ n n µ(g) µ(f)= µ(lim inf(g f )) lim inf µ(g f )= µ(g) lim sup µ(f ). − − n ≤ − n − n Since µ(g) < , we can deduce that ∞ µ(f) lim inf µ(f ) lim sup µ(f ) µ(f). ≤ n ≤ n ≤ This proves that µ(f ) µ(f) as n . n → →∞ 3.3. Transformations of integrals. Proposition 3.3.1. Let (E, E,µ) be a measure space and let A E. Then the set ∈ EA of measurable subsets of A is a σ-algebra and the restriction µA of µ to EA is a measure. Moreover, for any non-negative measurable function f on E, we have

µ(f1A)= µA(f A). 22 | In the case of Lebesgue measure on R, we write, for any interval I with inf I = a and sup I = b, b f1I(x)dx = f(x)dx = f(x)dx. ZR ZI Za Note that the sets a and b have measure zero, so we do not need to specify whether they are included{ } in {I }or not. Proposition 3.3.2. Let (E, E) and (G, G) be measure spaces and let f : E G be 1 → a measurable function. Given a measure µ on (E, E), deﬁne ν = µ f − , the image measure on (G, G). Then, for all non-negative measurable functions◦g on G, ν(g)= µ(g f). ◦ In particular, for a G-valued random variable X on a probability space (Ω, F, P), for any non-negative measurable function g on G, we have

E(g(X)) = µX (g). Proposition 3.3.3. Let (E, E,µ) be a measure space and let f be a non-negative measurable function on E. Deﬁne ν(A)= µ(f1A), A E. Then ν is a measure on E and, for all non-negative measurable functions g on E∈, ν(g)= µ(fg). In particular, to each non-negative Borel function f on R, there corresponds a R Borel measure µ on given by µ(A) = A f(x)dx. Then, for all non-negative Borel functions g, R µ(g)= g(x)f(x)dx. n ZR We say that µ has density f (with respect to Lebesgue measure). If the law µX of a real-valued random variable X has a density fX , then we call fX P a density function for X. Then (X A) = A fX (x)dx, for all Borel sets A, and, for for all non-negative Borel functions∈g on R, R

E(g(X)) = µX (g)= g(x)fX (x)dx. ZR 3.4. Fundamental theorem of calculus. We show that integration with respect to Lebesgue measure on R acts as an inverse to differentiation. Since we restrict here to the integration of continuous functions, the proof is the same as for the Riemann integral. Theorem 3.4.1 (Fundamental theorem of calculus). (a) Let f : [a, b] R be a continuous function and set → t Fa(t)= f(x)dx. a Z23 Then Fa is differentiable on [a, b], with Fa′ = f. (b) Let F : [a, b] R be differentiable with continuous derivative f. Then → b f(x)dx = F (b) F (a). − Za Proof. Fix t [a, b). Given ε > 0, there exists δ > 0 such that f(x) f(t) ε whenever x ∈t δ. So, for 0 < h δ, | − | ≤ | − | ≤ ≤ F (t + h) F (t) 1 t+h a − a f(t) = (f(x) f(t))dx h − h − Zt t+h t+h 1 ε f(x) f( t) dx dx = ε. ≤ h | − | ≤ h Zt Zt Hence F is differentiable on the right at t with derivative f(t). Similarly, for all t a ∈ (a, b], Fa is differentiable on the left at t with derivative f(t). Finally, (F Fa)′(t)=0 for all t (a, b) so F F is constant (by the mean value theorem), and− so ∈ − a b F (b) F (a)= F (b) F (a)= f(x)dx. − a − a Za

Proposition 3.4.2. Let φ : [a, b] R be continuously differentiable and strictly increasing. Then, for all non-negative→ Borel functions g on [φ(a),φ(b)], φ(b) b g(y)dy = g(φ(x))φ′(x)dx. Zφ(a) Za The proposition can be proved as follows. First, the case where g is the indicator function of an interval follows from the Fundamental Theorem of Calculus. Next, show that the set of Borel sets B such that the conclusion holds for g = 1B is a d-system, which must then be the whole Borel σ-algebra by Dynkin’s lemma. The identity extends to simple functions by linearity and then to all non-negative measurable functions g by monotone convergence, using approximating simple functions n n (2− 2 g ) n. A⌊ general⌋ ∧ formulation of this procedure, which is often used, is given in the monotone class theorem Theorem 2.1.2. 3.5. Differentiation under the integral sign. Integration in one variable and differentiation in another can be interchanged subject to some regularity conditions. Theorem 3.5.1 (Differentiation under the integral sign). Let U R be open and suppose that f : U E R satisfies: ⊆ × → (i) x f(t, x) is integrable for all t, (ii) t 7→f(t, x) is differentiable for all x, 7→ 24 (iii) for some integrable function g, for all x E and all t U, ∈ ∈ ∂f (t, x) g(x). ∂t ≤

Then the function x (∂f/∂t)(t, x) is integrable for all t. Moreover, the function

F : U R, deﬁned by7→ → F (t)= f(t, x)µ(dx), ZE is diﬀerentiable and d ∂f F (t)= (t, x)µ(dx). dt ∂t ZE Proof. Take any sequence h 0 and set n → f(t + hn, x) f(t, x) ∂f gn(x)= − (t, x). hn − ∂t

Then gn(x) 0 for all x E and, by the mean value theorem, gn 2g for all n. In particular,→ for all t,∈ the function x (∂f/∂t)(t, x) is the| limit| ≤ of measurable functions, hence measurable, and hence7→ integrable, by (iii).Then, by dominated convergence, F (t + h ) F (t) ∂f n − (t, x)µ(dx)= g (x)µ(dx) 0. h − ∂t n → n ZE ZE

3.6. Product measure and Fubini’s theorem. Let (E1, E1,µ1) and (E2, E2,µ2) be finite measure spaces. The set A = A A : A E , A E { 1 × 2 1 ∈ 1 2 ∈ 2} is a π-system of subsets of E = E E . Define the product σ-algebra 1 × 2 E E = σ(A). 1 ⊗ 2 Set E = E E . 1 ⊗ 2 Lemma 3.6.1. Let f : E R be E-measurable. Then, for all x1 E1, the function x f(x , x ) : E R is→E -measurable. ∈ 2 7→ 1 2 2 → 2 Proof. Denote by V the set of bounded E-measurable functions for which the conclusion holds. Then V is a vector space, containing the indicator function 1A of every set A A. Moreover, if fn V for all n and if f is bounded with 0 fn f, then also f ∈ V. So, by the monotone∈ class theorem, V contains all bounded≤E-measurable↑ functions.∈ The rest is easy. 25 Lemma 3.6.2. Let f be a bounded or non-negative measurable function on E. Define for x E 1 ∈ 1 f1(x1)= f(x1, x2)µ2(dx2). ZE2 If f is bounded then f1 : E1 R is a bounded E1-measurable function. On the other hand, if f is non-negative, then→ f : E [0, ] is also an E -measurable function. 1 1 → ∞ 1 Proof. Apply the monotone class theorem, as in the preceding lemma. Note that finiteness of µ2 is needed for the boundedness of f1 when f is bounded.

Theorem 3.6.3 (Product measure). There exists a unique measure µ = µ1 µ2 on E such that ⊗ µ(A A )= µ (A )µ (A ) 1 × 2 1 1 2 2 for all A E and A E . 1 ∈ 1 2 ∈ 2 Proof. Uniqueness holds because A is a π-system generating E. For existence, by the lemmas, we can deﬁne

µ(A)= 1A(x1, x2)µ2(dx2) µ1(dx1) ZE1 ZE2 and use monotone convergence to see that µ is countably additive. Proposition 3.6.4. Let Eˆ = E E and µˆ = µ µ . For a function f on E E , 2 ⊗ 1 2 ⊗ 1 1 × 2 write fˆ for the function on E E given by fˆ(x , x ) = f(x , x ). Let f be a non- 2 × 1 2 1 1 2 negative E-measurable function. Then fˆ is a non-negative Eˆ-measurable function and µˆ(fˆ)= µ(f). Theorem 3.6.5 (Fubini’s theorem). (a) Let f be a non-negative E-measurable function. Then

µ(f)= f(x1, x2)µ2(dx2) µ1(dx1). ZE1 ZE2 (b) Let f be a µ-integrable function. Deﬁne

A = x E : f(x , x ) µ (dx ) < 1 { 1 ∈ 1 | 1 2 | 2 2 ∞} ZE2 and deﬁne f : E R by 1 1 →

f1(x1)= f(x1, x2)µ2(dx2) ZE2 for x A and f (x )=0 otherwise. Then µ (E A )=0 and f is µ -integrable 1 ∈ 1 1 1 1 1 \ 1 1 1 with µ1(f1)= µ(f). 26 Note that the iterated integral in (a) is well deﬁned, for all bounded or non-negative measurable functions f, by Lemmas 3.6.1 and 3.6.2. Note also that, in combination with Proposition 3.6.4, Fubini’s theorem allows us to interchange the order of integration in multiple integrals,whenever the integrand is non-negative or µ-integrable.

Proof. The conclusion of (a) holds for f =1A with A E by deﬁnition of the product measure µ. It extends to simple functions on E by linearity∈ of the integrals. For f n n non-negative measurable, consider the sequence of simple functions f = (2− 2 f ) n ⌊ ⌋ ∧ n. Then (a) holds for fn and fn f. By monotone convergence µ(fn) µ(f) and, for all x E , ↑ ↑ 1 ∈ 1 f (x , x )µ (dx ) f(x , x )µ (dx ) n 1 2 2 2 ↑ 1 2 2 2 ZE2 ZE2 and hence

f (x , x )µ (dx ) µ (dx ) f(x , x )µ (dx ) µ (dx ). n 1 2 2 2 1 1 ↑ 1 2 2 2 1 1 ZE1 ZE2 ZE1 ZE2 Hence (a) extends to f. Suppose now that f is µ-integrable. By Lemma 3.6.2, the function

x f(x , x ) µ (dx ) : E [0, ] 1 7→ | 1 2 | 2 2 1 → ∞ ZE2 is E1-measurable, and it is then integrable because, using (a),

f(x , x ) µ (dx ) µ (dx )= µ( f ) < . | 1 2 | 2 2 1 1 | | ∞ ZE1 ZE2 Hence A E and µ (E A ) = 0. We see also that f is well defined and, if we set 1 ∈ 1 1 1 \ 1 1 ( ) f1± (x1)= f ±(x1, x2)µ2(dx2) ZE2 (+) ( ) then f =(f f − )1 . Finally, by part (a), 1 1 − 1 A1 + (+) ( ) µ(f)= µ(f ) µ(f −)= µ (f ) µ (f − )= µ (f ) − 1 1 − 1 1 1 1 as required. The existence of product measure and Fubini’s theorem extend easily to σ-finite measure spaces. The operation of taking the product of two measure spaces is as- sociative, by a π-system uniqueness argument. So we can, by induction, take the product of a finite number, without specifying the order. The measure obtained by taking the n-fold product of Lebesgue measure on R is called Lebesgue measure on Rn. The corresponding integral is written

f(x)dx. Rn Z 27 3.7. Laws of independent random variables. Recall that a family X1,...,Xn of random variables on (Ω, F, P) is said to be independent if the family of σ-algebras σ(X1),...,σ(Xn) is independent.

Proposition 3.7.1. Let X1,...,Xn be random variables on (Ω, F, P), with values in (E , E ),..., (E , E ) say. Set E = E E and E = E E . Consider the 1 1 n n 1 ×···× n 1 ⊗···⊗ n function X : Ω E given by X(ω)=(X1(ω),...,Xn(ω)). Then X is E-measurable. Moreover, the following→ are equivalent:

(a) X1,...,Xn are independent; (b) µ = µ µ ; X X1 ⊗···⊗ Xn (c) for all bounded measurable functions f1,...,fn we have n n

E fk(Xk) = E(fk(Xk)). ! kY=1 Yk=1 A n E Proof. Set ν = µX1 µXn . Consider the π-system = k=1 Ak : Ak k . If (a) holds, then for all⊗···⊗A A we have { ∈ } ∈ Q n n µ (A)= P(X A)= P( n X A )= P(X A )= µ (A )= ν(A) X ∈ ∩k=1{ k ∈ k} k ∈ k Xk k Yk=1 kY=1 and since A generates E this implies that µ = ν on E, so (b) holds. If (b) holds, then by Fubini’s theorem n n n n

E fk(Xk) = fk(xk)µXk (dxk)= fk(xk)µXk (dxk)= E(fk(Xk)) ! E Ek kY=1 Z kY=1 kY=1 Z Yk=1 so (c) holds. Finally (a) follows from (c) by taking f =1 with A E . k Ak k ∈ k 4. Norms and inequalities 4.1. Lp-norms. Let (E, E,µ) be a measure space. For 1 p < , we denote by Lp = Lp(E, E,µ) the set of measurable functions f with ﬁnite≤ Lp-norm∞ :

1/p f = f pdµ < . k kp | | ∞ ZE We denote by L∞ = L∞(E, E,µ) the set of measurable functions f with finite L∞- norm: f = inf λ : f λ a.e. . k k∞ { | | ≤ } 1/p p Note that f p µ(E) f for all 1 p < . For 1 p and fn, f L , k k ≤ k k∞ p ≤ ∞ ≤ ≤ ∞ ∈ we say that fn converges to f in L if fn f p 0. k 28− k → 4.2. Chebyshev’s inequality. Let f be a non-negative measurable function and let λ 0. We use the notation f λ for the set x E : f(x) λ . Note that ≥ { ≥ } { ∈ ≥ } λ1 f λ f { ≥ } ≤ so on integrating we obtain Chebyshev’s inequality λµ(f λ) µ(f). ≥ ≤ Now let g be any measurable function. We can deduce inequalities for g by choosing some non-negative measurable function φ and applying Chebyshev’s inequality to f = φ g. For example, if g Lp,p< and λ> 0, then ◦ ∈ ∞ p p p p µ( g λ)= µ( g λ ) λ− µ( g ) < . | | ≥ | | ≥ ≤ | | ∞ So we obtain the tail estimate p µ( g λ)= O(λ− ), as λ . | | ≥ →∞ 4.3. Jensen’s inequality. Let I R be an interval. A function c : I R is convex if, for all x, y I and t [0, 1], ⊆ → ∈ ∈ c(tx + (1 t)y) tc(x)+(1 t)c(y). − ≤ − Lemma 4.3.1. Let c : I R be convex and let m be a point in the interior of I. Then there exist a, b R such→ c(x) ax + b for all x, with equality at x = m. ∈ ≥ Proof. By convexity, for m, x, y I with x m ∈ c(m) c(x) c(y) c(m) − a − . m x ≤ ≤ y m − − Then c(x) a(x m)+ c(m), for all x I. ≥ − ∈ Theorem 4.3.2 (Jensen’s inequality). Let X be an integrable random variable with values in I and let c : I R be convex. Then E(c(X)) is well defined and → E(c(X)) c(E(X)). ≥ Proof. The case where X is almost surely constant is easy. We exclude it. Then m = E(X) must lie in the interior of I. Choose a, b R as in the lemma. Then ∈ c(X) aX + b. In particular E(c(X)−) a E( X )+ b < , so E(c(X)) is well defined.≥ Moreover ≤ | | | | | | ∞ E(c(X)) aE(X)+ b = am + b = c(m)= c(E(X)). ≥ 29 We deduce from Jensen’s inequality the monotonicity of Lp-norms with respect to a probability measure. Let 1 p0 f = E | | 1 f >0 | | f p 1 {| | }| | f p 1 {| | } | | − | | − q 1/q g q 1/q E | | 1 f >0 µ( g ) = f p g q. ≤ f q(p 1) {| | } ≤ | | k k k k | | −

Theorem 4.4.2 (Minkowski’s inequality). For p [1, ) and measurable functions f and g, we have ∈ ∞ f + g f + g . k kp ≤k kp k kp Proof. The cases where p = 1 or where f p = or g p = are easy. We exclude them. Then, since f + g p 2p( f p + kg pk), we∞ havek k ∞ | | ≤ | | | | µ( f + g p) 2p µ( f p)+ µ( g p) < . | | ≤ { | | | | } ∞ The case where f + g = 0 is clear, so let us assume f + g > 0. Observe that k kp k kp p 1 (p 1)q 1/q p 1 1/p f + g − q = µ( f + g − ) = µ( f + g ) − . k| | k | | 30 | | So, by H¨older’s inequality, p p 1 p 1 µ( f + g ) µ( f f + g − )+ µ( g f + g − ) | | ≤ | || | | || | p 1 ( f + g ) f + g − . ≤ k kp k kp k| | kq p 1 The result follows on dividing both sides by f + g − . k| | kq 4.5. Approximation in Lp. Theorem 4.5.1. Let A be a π-system on E generating E, with µ(A) < for all A A, and such that E E for some sequence (E : n N) in A. Deﬁne∞ ∈ n ↑ n ∈ n

V0 = ak1Ak : ak R, Ak A, n N . ( ∈ ∈ ∈ ) Xk=1 p p Let p [1, ). Then V0 L . Moreover, for all f L and all ε > 0 there exists v V ∈such∞ that v f ⊆ ε. ∈ ∈ 0 k − kp ≤ 1/p p p Proof. For all A A, we have 1A p = µ(A) < , so 1A L . Hence V0 L because Lp is a vector∈ space. k k ∞ ∈ ⊆ Write V for the set of all f Lp for which the conclusion holds. By Minkowski’s inequality, V is a vector space.∈ Consider for now the case E A and define D = A E : 1 V . Then A D so E D. For A, B ∈D with A B, we { ∈ A ∈ } ⊆ ∈ ∈ ⊆ have 1B A = 1B 1A V , so B A D. For An D with An A, we have \ − ∈1/p \ ∈ ∈ ↑ 1A 1An p = µ(A An) 0, so A D. Hence D is a d-system and so D = E by kDynkin’s− Lemma.k Since\ V is→ a vector space∈ it then contains all simple functions. For p n n f L with f 0, consider the sequence of simple functions fn = (2− 2 f ) n f. ∈ p ≥ p ⌊ ⌋ ∧ ↑ Then, f f fn 0 pointwise so, by dominated convergence, f fn p 0. Hence |f | ≥|V . Hence,− | as→ a vector space, V = Lp. k − k → Returning∈ to the general case, we now know that, for all f Lp and all n N, we p p ∈ ∈ have f1En V . But f f f1En 0 pointwise so, by dominated convergence, f f1 ∈ 0, and| | so≥|f −V . | → k − En kp → ∈ 5. Completeness of Lp and orthogonal projection 5.1. Lp as a Banach space. Let V be a vector space. A map v v : V [0, ) is a norm if 7→ k k → ∞ (i) u + v u + v for all u, v V , (ii) kαv =k≤kα vk fork allk v V and∈α R, (iii) kv k= 0| implies|k k v = 0. ∈ ∈ k k We note that, for any norm, if vn v 0 then vn v . A symmetric bilinear map (ku, v−) k→u, v : Vk k→kV kR is an inner product if v, v 0, with equality only if v7→= h 0. Fori any× inner→ product, ., . , the map h i ≥ h i v v, v is a norm, by the Cauchy–Schwarz inequality. 7→ h i 31 p Minkowski’s inequality shows that each Lp space is a vector space and that the Lp-norms satisfy condition (i) above. Condition (ii) also holds. Condition (iii) fails, p because f p = 0 does not imply that f = 0, only that f = 0 a.e.. For f,g L , write f k gk if f = g almost everywhere. Then is an equivalence relation. Write∈ [f] for the∼ equivalence class of f and define ∼ Lp = [f] : f Lp . { ∈ } 2 2 Note that, for f L , we have f 2 = f, f , where ., . is the symmetric bilinear form on L2 given∈ by k k h i h i f,g = fgdµ. h i ZE Thus L2 is an inner product space. The notion of convergence in Lp defined in 4.1 is the usual notion of convergence in a normed space. § A normed vector space V is complete if every Cauchy sequence in V converges, that is to say, given any sequence (v : n N) in V such that v v 0 as n ∈ k n − mk → n, m , there exists v V such that vn v 0 as n . A complete normed vector→∞ space is called a Banach∈ space.k A complete− k→ inner→∞ product space is called a Hilbert space. Such spaces have many useful properties, which makes the following result important.

p Theorem 5.1.1 (Completeness of L ). Let p [1, ]. Let (fn : n N) be a sequence in Lp such that ∈ ∞ ∈ f f 0 as n, m . k n − mkp → →∞ Then there exists f Lp such that ∈ f f 0 as n . k n − kp → →∞ Proof. Some modiﬁcations of the following argument are necessary in the case p = , which are left as an exercise. We assume from now on that p < . Choose∞ a ∞ subsequence (nk) such that

∞ S = f f < . k nk+1 − nk kp ∞ Xk=1 By Minkowski’s inequality, for any K N, ∈ K f f S. k | nk+1 − nk |kp ≤ Xk=1 By monotone convergence this bound holds also for K = , so ∞ ∞ f f < a.e.. | nk+1 − nk | ∞ k=1 X 32 Hence, by completeness of R, fnk converges a.e.. We define a measurable function f by lim f (x) if the limit exists, f(x)= nk 0 otherwise. Given ε> 0, we can find N son that n N implies ≥ µ( f f p) ε, for all m n, | n − m| ≤ ≥ p in particular µ( fn fnk ) ε for all sufficiently large k. Hence, by Fatou’s lemma, for n N, | − | ≤ ≥ p p p µ( fn f )= µ(lim inf fn fnk ) lim inf µ( fn fnk ) ε. | − | k | − | ≤ k | − | ≤ Hence f Lp and, since ε> 0 was arbitrary, f f 0. ∈ k n − kp → Corollary 5.1.2. We have (a) Lp is a Banach space, for all 1 p , ≤ ≤∞ (b) L2 is a Hilbert space. 5.2. L2 as a Hilbert space. We shall apply some general Hilbert space arguments to L2. First, we note Pythagoras’ rule f + g 2 = f 2 +2 f,g + g 2 k k2 k k2 h i k k2 and the parallelogram law f + g 2 + f g 2 = 2( f 2 + g 2). k k2 k − k2 k k2 k k2 If f,g = 0, then we say that f and g are orthogonal. For any subset V L2, we defineh i ⊆ 2 V ⊥ = f L : f, v = 0 for all v V . { ∈ h i ∈ } 2 A subset V L is closed if, for every sequence (fn : n N) in V , with fn f in L2, we have ⊆f = v a.e., for some v V . ∈ → ∈ Theorem 5.2.1 (Orthogonal projection). Let V be a closed subspace of L2. Then 2 each f L has a decomposition f = v + u, with v V and u V ⊥. Moreover, f v ∈ f g for all g V , with equality only if∈ g = v a.e..∈ k − k2 ≤k − k2 ∈ The function v is called (a version of ) the orthogonal projection of f on V . Proof. Choose a sequence g V such that n ∈ f g d(f,V ) = inf f g : g V . k − nk2 → {k − k2 ∈ } By the parallelogram law, 2 2 2 2 2(f (gn + gm)/2) 2 + gn gm 2 = 2( f gn 2 + f gm 2). k − k k − 33 k k − k k − k 2 2 But 2(f (gn +gm)/2) 2 4d(f,V ) , so we must have gn gm 2 0 as n, m . k − k ≥ 2 k − k → →∞ By completeness, gn g 2 0, for some g L . By closure, g = v a.e., for some v V . Hence k − k → ∈ ∈ f v 2 = lim f gn 2 = d(f,V ). k − k n k − k Now, for any h V and t R, we have ∈ ∈ d(f,V )2 f (v + th) 2 = d(f,V )2 2t f v, h + t2 h 2. ≤k − k2 − h − i k k2 So we must have f v, h = 0. Hence u = f v V ⊥, as required. h − i − ∈ 5.3. Variance, covariance and conditional expectation. In this section we look at some L2 notions relevant to probability. For X,Y L2(P), with means m = ∈ X E(X), mY = E(Y ), we define variance, covariance and correlation by var(X)= E[(X m )2], − X cov(X,Y )= E[(X m )(Y m )], − X − Y corr(X,Y ) = cov(X,Y )/ var(X) var(Y ).

Note that var(X) = 0 if and only if X = mXpa.s.. Note also that, if X and Y are independent, then cov(X,Y ) = 0. The converse is generally false. For a random n variable X =(X1,...,Xn) in R , we define its covariance matrix n var(X) = (cov(Xi,Xj))i,j=1. Proposition 5.3.1. Every covariance matrix is non-negative definite. Suppose now we are given a countable family of disjoint events (G : i I), whose i ∈ union is Ω. Set G = σ(Gi : i I). Let X be an integrable random variable. The conditional expectation of X given∈ G is given by Y = E(X G )1 , | i Gi i X where we set E(X Gi) = E(X1Gi )/P(Gi) when P(Gi) > 0, and define E(X Gi) in | 2 | some arbitrary way when P(Gi)=0. Set V = L (G, P) and note that Y V . Then V is a subspace of L2(F, P), and V is complete and therefore closed. ∈ Proposition 5.3.2. If X L2, then Y is a version of the orthogonal projection of X on V . ∈

6. Convergence in L1(P) 6.1. Bounded convergence. We begin with a basic, but easy to use, condition for convergence in L1(P). Theorem 6.1.1 (Bounded convergence). Let (X : n N) be a sequence of random n ∈ variables, with Xn X in probability and Xn C for all n, for some constant → 1 | | ≤ C < . Then Xn X in L . ∞ → 34 Proof. By Theorem 2.5.1, X is the almost sure limit of a subsequence, so X C a.s.. For ε> 0, there exists N such that n N implies | | ≤ ≥ P( X X > ε/2) ε/(4C). | n − | ≤ Then

E Xn X = E( Xn X 1 Xn X >ε/2)+E( Xn X 1 Xn X ε/2) 2C(ε/4C)+ε/2= ε. | − | | − | | − | | − | | − |≤ ≤

6.2. Uniform integrability. Lemma 6.2.1. Let X be an integrable random variable and set I (δ) = sup E( X 1 ) : A F, P(A) δ . X { | | A ∈ ≤ } Then I (δ) 0 as δ 0. X ↓ ↓ n Proof. Suppose not. Then, for some ε > 0, there exist A F, with P(A ) 2− n ∈ n ≤ and E( X 1An ) ε for all n. By the ﬁrst Borel–Cantelli lemma, P(An i.o.) = 0. But then, by| dominated| ≥ convergence,

ε E( X 1S Am ) E( X 1 An i.o. )=0 ≤ | | m≥n → | | { } which is a contradiction. Let X be a family of random variables. For 1 p , we say that X is bounded p ≤ ≤∞ in L if supX X X p < . Let us deﬁne ∈ k k ∞ IX(δ) = sup E( X 1 ) : X X, A F, P(A) δ . { | | A ∈ ∈ ≤ } 1 Obviously, X is bounded in L if and only if IX(1) < . We say that X is uniformly integrable or UI if X is bounded in L1 and ∞

IX(δ) 0, as δ 0. ↓ ↓ Note that, by H¨older’s inequality, for conjugate indices p, q (1, ), ∈ ∞ E( X 1 ) X (P(A))1/q. | | A ≤k kp Hence, if X is bounded in Lp, for some p (1, ), then X is UI. The sequence 1 ∈ ∞ Xn = n1(0,1/n) is bounded in L for Lebesgue measure on (0, 1), but not uniformly integrable. Lemma 6.2.1 shows that any single integrable random variable is uniformly integrable. This extends easily to any ﬁnite collection of integrable random variables. Moreover, for any integrable random variable Y , the set X = X : X a random variable, X Y { | | ≤ } is uniformly integrable, because E( X 1A) E(Y 1A) for all A. The following result gives an alternative| | ≤ characterization of uniform integrability. 35 Lemma 6.2.2. Let X be a family of random variables. Then X is UI if and only if

sup E( X 1 X K) : X X 0, as K . { | | | |≥ ∈ }→ →∞ Proof. Suppose X is UI. Given ε > 0, choose δ > 0 so that IX(δ) < ε, then choose K < so that IX(1) Kδ. Then, for X X and A = X K , we have P(A) ∞δ so E( X 1 ) <ε≤. Hence, as K ,∈ {| | ≥ } ≤ | | A →∞ sup E( X 1 X K) : X X 0. { | | | |≥ ∈ }→ On the other hand, if this condition holds, then, since

E( X ) K + E( X 1 X K), | | ≤ | | | |≥ we have IX(1) < . Given ε > 0, choose K < so that E( X 1 X K) < ε/2 for all X X. Then∞ choose δ > 0 so that Kδ < ε/∞2. For all X | X| |and|≥ A F with P(A) <∈ δ, we have ∈ ∈

Theorem 6.2.3. Let X be a random variable and let (Xn : n N) be a sequence of random variables. The following are equivalent: ∈ (a) X L1 for all n, X L1 and X X in L1, n ∈ ∈ n → (b) X : n N is UI and X X in probability. { n ∈ } n → Proof. Suppose (a) holds. By Chebyshev’s inequality, for ε> 0, 1 P( X X >ε) ε− E( X X ) 0 | n − | ≤ | n − | → so Xn X in probability. Moreover, given ε> 0, there exists N such that E( Xn X ) <→ ε/2 whenever n N. Then we can ﬁnd δ > 0 so that P(A) δ implies| − | ≥ ≤ E( X 1 ) ε/2, E( X 1 ) ε, n =1,...,N. | | A ≤ | n| A ≤ Then, for n N and P(A) δ, ≥ ≤ E( X 1 ) E( X X )+ E( X 1 ) ε. | n| A ≤ | n − | | | A ≤ Hence X : n N is UI. We have shown that (a) implies (b). { n ∈ } Suppose, on the other hand, that (b) holds. Then there is a subsequence (nk) such that Xnk X a.s.. So, by Fatou’s lemma, E( X ) lim infk E( Xnk ) < . Now, given ε> →0, there exists K < such that, for| all|n,≤ | | ∞ ∞

E( Xn 1 Xn K) < ε/3, E( X 1 X K) < ε/3. | | | |≥ 36 | | | |≥ K K Consider the uniformly bounded sequence Xn = ( K) Xn K and set X = K K − ∨ ∧ ( K) X K. Then Xn X in probability, so, by bounded convergence, there exists− ∨N such∧ that, for all n→ N, ≥ E XK XK < ε/3. | n − | But then, for all n N, ≥ K K E Xn X E( Xn 1 Xn K)+ E Xn X + E( X 1 X K) < ε. | − | ≤ | | | |≥ | − | | | | |≥ Since ε> 0 was arbitrary, we have shown that (b) implies (a).

7. Fourier transforms 7.1. Deﬁnitions. In this section (only), for p [1, ), we will write Lp = Lp(Rd) for the set of complex-valued Borel measurable functions∈ ∞ on Rd such that 1/p p f p = f(x) dx < . k k d | | ∞ ZR The Fourier transform fˆ of a function f L1(Rd) is deﬁned by ∈ i u,x d fˆ(u)= f(x)e h idx, u R . d ∈ ZR Here, ., . denotes the usual inner product on Rd. Note that fˆ(u) f and, by h i | |≤k k1 the dominated convergence theorem, fˆ(un) fˆ(u) whenever un u. Thus fˆ is a continuous bounded (complex-valued) function→ on Rd. → For f L1(Rd) with fˆ L1(Rd), we say that the Fourier inversion formula holds for f if ∈ ∈

1 i u,x ˆ − h i f(x)= d f(u)e du (2π) d ZR for almost all x Rd. For f L1 L2(Rd), we say that the Plancherel identity holds for f if ∈ ∈ ∩ fˆ = (2π)d/2 f . k k2 k k2 The main results of this section establish that, for all f L1(Rd), the inversion ∈ formula holds whenever fˆ L1(Rd) and the Plancherel identity holds whenever f L2(Rd). ∈ ∈The Fourier transform µˆ of a ﬁnite Borel measure µ on Rd is deﬁned by

i u,x d µˆ(u)= e h iµ(dx), u R . d ∈ ZR Thenµ ˆ is a continuous function on Rd with µˆ(u) µ(Rd) for all u. The deﬁnitions are consistent in that, if µ has density f with| respect| ≤ to Lebesgue measure, then 37 d µˆ = fˆ. The characteristic function φX of a random variable X in R is the Fourier tranform of its law µX . Thus

i u,X d φ (u)=ˆµ (u)= E(e h i), u R . X X ∈

7.2. Convolutions. For p [1, ) and f Lp(Rd) and for a probability measure ν on Rd, we deﬁne the convolution∈ ∞f ν Lp∈(Rd) by ∗ ∈ f ν(x)= f(x y)ν(dy) ∗ d − ZR whenever the integral exists, setting f ν(x) = 0 otherwise. By Jensen’s inequality and Fubini’s theorem, ∗

p f(x y) ν(dy) dx f(x y) pν(dy)dx d d | − | ≤ d d | − | ZR ZR ZR ZR p p p = f(x y) dxν(dy)= f(x) dxν(dy)= f p < . d d | − | d d | | k k ∞ ZR ZR ZR ZR so the integral deﬁning the convolution exists for almost all x, and then

p 1/p

f ν p = f(x y)ν(dy) dx f p. k ∗ k d d − ≤k k ZR ZR

In the case where ν has a density function g, then we write f g for f ν. For probability measures µ, ν on Rd, we deﬁne the convolution∗ µ ∗ν to be the distribution of X + Y for independent random variables X,Y having∗ distributions µ, ν. Thus

µ ν(A)= 1A(x + y)µ(dx)ν(dy), A B. ∗ Rd Rd ∈ Z × Note that, if µ has density function f, then by Fubini’s theorem

µ ν(A)= 1A(x + y)f(x)dxν(dy) ∗ d d ZR ZR

= 1A(x)f(x y)dxν(dy)= 1A(x)f ν(x)dx d d − d ∗ ZR ZR ZR so µ ν has density function f ν. ∗ ∗ It is easy to check using Fubini’s theorem that f[ν(u) = fˆ(u)ˆν(u) for all f L1(Rd) and all probability measures ν on Rd. Similarly,∗ we have ∈

i u,X+Y i u,X i u,Y µ[ν(u)= E(e h i)= E(e h i)E(e h i)=ˆµ(u)ˆν(u). ∗ 38 7.3. Gaussians. Consider for t (0, ) the centred Gaussian probability density d ∈ ∞ function gt on R of variance t, given by

1 x 2/(2t) g (x)= e−| | . t (2πt)d/2

The Fourier transformg ˆt may be identified as follows. Let Z be a standard one- dimensional normal random variable. Since Z is integrable, by Theorem 3.5.1, the characteristic function φZ is differentiable and we can differentiate under the integral sign to obtain

iuZ 1 iux x2/2 φZ′ (u)= E(iZe )= e ixe− dx = uφZ (u) √2π R − Z where we integrated by parts for the last equality. Hence

d 2 (eu /2φ (u))=0 du Z so u2/2 u2/2 φZ(u)= φZ(0)e− = e− .

Consider now d independent standard normal random variables Z1,...,Zd and set Z =(Z1,...,Zd). Then √tZ has density function gt. So d d 2 i u,√tZ iuj √tZj u t/2 gˆt(u)= E(e h i)= E e = φZ(uj√t)= e−| | . j=1 ! j=1 Y Y d/2 d/2 d Henceg ˆt = (2π) t− g1/t and gˆt = (2π) gt. Then

dˆ 1 i u,x gt(x)= gt( x)=(2π)− gˆt( x)= d gˆt(u)e− h idu − − (2π) d ZR so the Fourier inversion formula holds for gt. 7.4. Gaussian convolutions. By a Gaussian convolution we mean any convolution 1 d f gt of a function f L (R ) with a Gaussian gt and with t (0, ). We note that f ∗ g is a continuous∈ function and that ∈ ∞ ∗ t d/2 d/2 f gt 1 f 1, f gt (2π)− t− f 1. k ∗ k ≤k k k ∗ k∞ ≤ k k Also f[g (u)= fˆ(u)ˆg (u) and we knowg ˆ explicitly, so ∗ t t t [ d/2 d/2 [ f gt 1 (2π) t− f 1, f gt f 1. k ∗ k ≤ k k k ∗ k∞ ≤k k A straightforward calculation (using the parallelogram identity in Rd) shows that d gs gs = g2s for all s (0, ). Then, for any probability measure µ on R and any ∗ ∈ ∞ 1 d t =2s (0, ), we have µ gs L (R ) and hence µ gt = µ (gs gs)=(µ gs) gs is a Gaussian∈ ∞ convolution. ∗ ∈ ∗ ∗ ∗ ∗ ∗ Lemma 7.4.1. The Fourier inversion formula holds for all Gaussian convolutions. 39 1 d Proof. Let f L (R ) and let t> 0. We use the Fourier inversion formula for gt and Fubini’s theorem∈ to see that

d d (2π) f gt(x)=(2π) f(x y)gt(y)dy ∗ d − ZR i u,y = f(x y)ˆgt(u)e− h idudy Rd Rd − Z × i u,x y i u,x = f(x y)e h − igˆt(u)e− h idudy Rd Rd − Z × i u,x [ i u,x = fˆ(u)ˆgt(u)e− h idu = f gt(u)e− h idu. d d ∗ ZR ZR

Lemma 7.4.2. Let f Lp(Rd) with p [1, ). Then f g f 0 as t 0. ∈ ∈ ∞ k ∗ t − kp → → Proof. Given ε > 0, there exists a continuous function h of compact support such that f h ε/3. Then f g h g = (f h) g f h ε/3. Set k − kp ≤ k ∗ t − ∗ tkp k − ∗ tkp ≤k − kp ≤ e(y)= h(x y) h(x) pdx. d | − − | ZR p p Then e(y) 2 h p for all y and e(y) 0 as y 0 by dominated convergence. By Jensen’s inequality≤ k k and then bounded→ convergence,→ p p h gt h p = (h(x y) h(x))gt(y)dy dx k ∗ − k d d − − ZR ZR

p h(x y) h(x)) gt(y)dydx ≤ d d | − − | ZR ZR

= e(y)gt(y)dy = e(√ty)g1(y)dy 0 d d → ZR ZR as t 0. Now f gt f p f gt h gt p + h gt h p + h f p. So f →g f <εkfor∗ all− suﬃcientlyk ≤ k small∗ −t> ∗0, ask required.k ∗ − k k − k k ∗ t − kp 7.5. Uniqueness and inversion. Theorem 7.5.1. Let f L1(Rd). Deﬁne for t> 0 and x Rd ∈ ∈ 1 ˆ u 2t/2 i u,x ft(x)= d f(u)e−| | e− h idu. (2π) d ZR Then f f 0 as t 0. Moreover, the Fourier inversion formula holds k t − k1 → → whenever f L1(Rd) and fˆ L1(Rd). ∈ ∈ u 2t/2 Proof. Consider the Gaussian convolution f g . Then f[g (u) = fˆ(u)e−| | . So ∗ t ∗ t ft = f gt by Lemma 7.4.1 and so ft f 1 0 as t 0 by Lemma 7.4.2. ∗ k − 40k → → Now, if fˆ L1(Rd), then by dominated convergence with dominating function fˆ , ∈ | | 1 ˆ i u,x ft(x) d f(u)e− h idu → (2π) d ZR as t 0 for all x. On the other hand, ftn f almost everywhere for some sequence t →0. Hence the inversion formula holds→ for f. n → 7.6. Fourier transform in L2(Rd). Theorem 7.6.1. The Plancherel identity holds for all f L1 L2(Rd). Moreover there is a unique Hilbert space automorphism F on L2 such∈ that∩ d/2 F [f] = [(2π)− fˆ] for all f L1 L2(Rd). ∈ ∩ Proof. Suppose to begin that f L1 and fˆ L1. Then the inversion formula holds ∈ ∈ d d and f, fˆ L∞. Also (x, u) f(x)fˆ(u) is integrable on R R . So, by Fubini’s theorem,∈ we obtain the Plancherel7→ identity for f: ×

d 2 d ˆ i u,x (2π) f 2 = (2π) f(x)f(x)dx = f(u)e− h idu f(x)dx k k d d d ZR ZR ZR ˆ i u,x ˆ ˆ ˆ 2 = f(u) f(x)e h idx du = f(u)f(u)du = f 2. d d d k k ZR ZR ZR 1 2 Now let f L L and consider for t > 0 the Gaussian convolution ft = f gt. ∈ ∩ 2 ∗ We consider the limit t 0. By Lemma 7.4.2, ft f in L , so ft 2 f 2. We ˆ ˆ → u 2t/2 ˆ 2 ˆ 2→ k k → k k have ft = fgˆt andg ˆt(u) = e−| | , so ft 2 f 2 by monotone convergence. The k k ↑ k k1 Plancherel identity holds for ft because ft, fˆt L . On letting t 0 we obtain the identity for f. ∈ → 1 2 2 d/2 ˆ 2 Define F0 : L L L by F0[f] = [(2π)− f]. Then F0 preserves the L 1 ∩ 2 → 2 norm. Since L L is dense in L , F0 then extends uniquely to an isometry F of L2 into itself. Finally,∩ by the inversion formula, F maps the set V = [f] : f { ∈ L1 and fˆ L1 into itself and F 4[f] = [f] for all [f] V . But V contains all Gaussian convolutions∈ } and hence is dense in L2, so F must∈ be onto L2. 7.7. Weak convergence and characteristic functions. Let µ be a Borel probability measure on Rd and let (µ : n N) be a sequence of such measures. We say n ∈ that µn converges weakly to µ if µn(f) µ(f) as n for all continuous bounded functions f on Rd. Given a random variable→ X in R→∞d and a sequence of such random variables (X : n N), we say that X converges weakly to X if µ converges weakly n ∈ n Xn to µX . There is no requirement that the random variables are defined on a common probability space. Note that a sequence of measures can have at most one weak limit, but if X is a weak limit of the sequence of random variables (Xn : n N), then so is any other random variable with the same distribution as X. In the case∈ d = 1, weak convergence is equivalent to convergence in distribution, as defined in Section 2.5. 41 d Theorem 7.7.1. Let X be random variable in R . Then the distribution µX of X is uniquely determined by its characteristic function φX . Moreover, in the case where φX is integrable, µX has a continuous bounded density function given by

1 i u,x fX (x)= d φX (u)e− h idu. (2π) d ZR d Moreover, if (Xn : n N) is a sequence of random variables in R such that φXn (u) φ (u) as n for∈ all u Rd, then X converges weakly to X. → X →∞ ∈ n Proof. Let Z be a random variable in Rd, independent of X, and having the standard Gaussian density g1. Then √tZ has density gt and X +√tZ has density given by the ˆ u 2t/2 Gaussian convolution ft = µX gt. We have ft(u)= φX (u)e−| | so, by the Fourier inversion formula, ∗

1 u 2t/2 i u,x −| | − h i ft(x)= d φX (u)e e du. (2π) d ZR By bounded convergence, for all continuous bounded functions g on Rd, as t 0 →

g(x)ft(x)dx = E(g(X + √tZ)) E(g(X)) = g(x)µX(dx). d → d ZR ZR Hence φX determines µX uniquely. d If φX is integrable, then ft(x) (2π) φX 1 for all x and by dominated convergence with dominating function| | ≤φ , wek havek f (x) f (x) for all x. Hence | X | t → X fX (x) 0 for all x and, for g continuous of compact support, by bounded convergence, ≥

g(x)µX (dx) = lim g(x)ft(x)dx = g(x)fX (x)dx d t 0 d d ZR → ZR ZR which implies that µX has density fX , as claimed. Suppose now that (X : n N) is a sequence of random variables such that n ∈ φXn (u) φX (u) for all u. We shall show that E(g(Xn)) E(g(X)) for all integrable → d → functions g on R whose derivative is bounded, which implies that Xn converges weakly to X. Given ε> 0, we can choose t> 0 so that √t g E Z ε/3. Then k∇ k∞ | | ≤ E g(X + √tZ) g(X) ε/3 and E g(Xn + √tZ) g(X) ε/3. On the other hand,| by the Fourier− inversion| ≤ formula,| and dominated− conve|rgence ≤ with dominating u 2t/2 function g(x) e−| | , we have | | 1 u 2t/2 i x,u E √ −| | − h i (g(Xn + tZ)) = d g(x)φXn (u)e e dxdu (2π) Rd Rd Z × 1 u 2t/2 i x,u −| | − h i E √ d g(x)φX (u)e e dxdu = (g(X + tZ)) → (2π) Rd Rd Z × as n . Hence E(g(Xn)) E(g(X)) <ε for all suﬃciently large n, as required. →∞ | − | 42 There is a stronger version of the last assertion of Theorem 7.7.1 called L´evy’s continuity theorem for characteristic functions: if φXn (u) converges as n , with limit φ(u) say, for all u R, and if φ is continuous in a neighbourhood of 0→∞, then φ is ∈ the characteristic function of some random variable X, and Xn X in distribution. We will not prove this. →

8. Gaussian random variables 8.1. Gaussian random variables in R. A random variable X in R is Gaussian if, for some µ R and some σ2 (0, ), X has density function ∈ ∈ ∞ 1 (x µ)2/2σ2 fX (x)= e− − . √2πσ2 We also admit as Gaussian any random variable X with X = µ a.s., this degenerate case corresponding to taking σ2 = 0. We write X N(µ, σ2). ∼ Proposition 8.1.1. Suppose X N(µ, σ2) and a, b R. Then (a) E(X) = µ, (b) 2 ∼ 2 2 ∈ iuµ u2σ2/2 var(X)= σ , (c) aX + b N(aµ + b, a σ ), (d) φ (u)= e − . ∼ X 8.2. Gaussian random variables in Rn. A random variable X in Rn is Gaussian if u,X is Gaussian, for all u Rn. An example of such a random variable is provided h i ∈ by X = (X1,...,Xn), where X1,...,Xn are independent N(0, 1) random variables. To see this, we note that i u,X iu X u 2/2 E e h i = E e k k = e−| | Yk so u,X is N(0, u 2) for all u Rn. h i | | ∈ Theorem 8.2.1. Let X be a Gaussian random variable in Rn. Let A be an m n matrix and let b Rm. Then × ∈ (a) AX + b is a Gaussian random variable in Rm, (b) X L2 and µ is determined by µ = E(X) and V = var(X), ∈ X i u,µ u,V u /2 (c) φX (u)= e h i−h i , (d) if V is invertible, then X has a density function on Rn, given by n/2 1/2 1 f (x)=(2π)− (det V )− exp x µ,V − (x µ) /2 , X {−h − − i } n1 n2 (e) suppose X =(X1,X2), with X1 in R and X2 in R , then

cov(X1,X2)=0 implies X1,X2 independent. Proof. For u Rn, we have u,AX +b = AT u,X + u, b so u,AX +b is Gaussian, by Proposition∈ 8.1.1. This provesh (a).i h i h i h i 2 Each component Xk is Gaussian, so X L . Set µ = E(X) and V = var(X). For u Rn we have E( u,X ) = u,µ and∈ var( u,X ) = cov( u,X , u,X ) = ∈ h i h i43 h i h i h i u,Vu . Since u,X is Gaussian, by Proposition 8.1.1, we must have u,X h i h i i u,X i u,µ u,V u /2 h i ∼ N( u,µ , u,Vu ) and φX (u)= E e h i = e h i−h i . This is (c) and (b) follows byh uniquenessi h ofi characteristic functions.

Let Y1,...,Yn be independent N(0, 1) random variables. Then Y = (Y1,...,Yn) has density n/2 2 f (y)=(2π)− exp y /2 . Y {−| | } Set X˜ = V 1/2Y +µ, then X˜ is Gaussian, with E(X˜)= µ and var(X˜)= V , so X˜ X. ∼ If V is invertible, then X˜ and hence X has the density claimed in (d), by a linear change of variables in Rn. Finally, if X =(X1,X2) with cov(X1,X2) = 0, then, for u =(u1,u2), we have u,Vu = u ,V u + u ,V u , h i h 1 11 1i h 2 22 2i where V11 = var(X1) and V22 = var(X2). Then φX (u) = φX1 (u1)φX2 (u2) so X1 and X2 are independent.

9. Ergodic theory 9.1. Measure-preserving transformations. Let (E, E,µ) be a measure space. A measurable function θ : E E is called a measure-preserving transformation if → 1 µ(θ− (A)) = µ(A), for all A E. ∈ 1 A set A E is invariant if θ− (A) = A. A measurable function f is invariant if f = f θ∈. The set of all invariant sets forms a σ-algebra, which we denote by E . ◦ θ Then f is invariant if and only if f is Eθ-measurable. We say that θ is ergodic if Eθ contains only sets of measure zero and their complements. Here are two simple examples of measure preserving transformations. (i) Translation map on the torus. Take E = [0, 1)n with Lebesgue measure on its Borel σ-algebra, and consider addition modulo 1 in each coordinate. For a E set ∈ θa(x1,...,xn)=(x1 + a1,...,xn + an).

(ii) Bakers’ map. Take E = [0, 1) with Lebesgue measure. Set θ(x)=2x 2x . − ⌊ ⌋ Proposition 9.1.1. If f is integrable and θ is measure-preserving, then f θ is integrable and ◦ fdµ = f θ dµ. ◦ ZE ZE Proposition 9.1.2. If θ is ergodic and f is invariant, then f = c a.e., for some constant c. 44 9.2. Bernoulli shifts. Let m be a probability measure on R. In 2.4, we constructed a probability space (Ω, F, P) on which there exists a sequence of independent§ random variables (Yn : n N), all having distribution m. Consider now the infinite product space ∈ E = RN = x =(x : n N) : x R for all n { n ∈ n ∈ } and the σ-algebra E on E generated by the coordinate maps Xn(x)= xn E = σ(X : n N). n ∈ Note that E is also generated by the π-system A = A : A B for all n, A = R for sufficiently large n . { n n ∈ n } n N Y∈ Define Y : Ω E by Y (ω)=(Yn(ω) : n N). Then Y is measurable and the image → 1 ∈ measure µ = P Y − satisfies, for A = n N An A, ◦ ∈ ∈

µ(A)=Q m(An). n N Y∈ By uniqueness of extension, µ is the unique measure on E having this property. Note that, under the probability measure µ, the coordinate maps (Xn : n N) are themselves a sequence of independent random variables with law m. The probability∈ space (E, E,µ) is called the canonical model for such sequences. Deﬁne the shift map θ : E E by → θ(x1, x2,... )=(x2, x3,... ). Theorem 9.2.1. The shift map is an ergodic measure-preserving transformation. Proof. The details of showing that θ is measurable and measure-preserving are left as an exercise. To see that θ is ergodic, we recall the deﬁnition of the tail σ-algebras T = σ(X : m n + 1), T = T . n m ≥ n n \ For A = n N An A we have ∈ ∈ n Q θ− (A)= Xn+k Ak for all k Tn. { ∈ } ∈ n Since Tn is a σ-algebra, it follows that θ− (A) Tn for all A E, so Eθ T. Hence θ is ergodic by Kolmogorov’s zero-one law. ∈ ∈ ⊆

9.3. Birkhoﬀ’s and von Neumann’s ergodic theorems. Throughout this section, (E, E,µ) will denote a measure space, on which is given a measure-preserving transformation θ. Given an measurable function f, set S = 0 and deﬁne, for n 1, 0 ≥ n 1 Sn = Sn(f)= f + f θ + + f θ − . 45◦ ··· ◦ Lemma 9.3.1 (Maximal ergodic lemma). Let f be an integrable function on E. Set S∗ = supn 0 Sn(f). Then ≥ fdµ 0. S∗>0 ≥ Z{ }

Proof. Set Sn∗ = max0 m n Sm and An = Sn∗ > 0 . Then, for m =1,...,n, ≤ ≤ { } Sm = f + Sm 1 θ f + Sn∗ θ. − ◦ ≤ ◦ On An, we have Sn∗ = max1 m n Sm, so ≤ ≤ S∗ f + S∗ θ. n ≤ n ◦ c On An, we have S∗ =0 S∗ θ. n ≤ n ◦ So, integrating and adding, we obtain

S∗dµ fdµ + S∗ θdµ. n ≤ n ◦ ZE ZAn ZE But Sn∗ is integrable, so

S∗ θdµ = S∗dµ < n ◦ n ∞ ZE ZE which forces fdµ 0. ≥ ZAn As n , An S∗ > 0 so, by dominated convergence, with dominating function f , →∞ ↑{ } | | fdµ = lim fdµ 0. ∗ n S >0 An ≥ Z{ } →∞ Z

Theorem 9.3.2 (Birkhoﬀ’s almost everywhere ergodic theorem). Assume that (E, E,µ) is σ-ﬁnite and that f is an integrable function on E. Then there exists an invariant function f¯, with µ( f¯ ) µ( f ), such that S (f)/n f¯ a.e. as n . | | ≤ | | n → →∞ Proof. The functions lim infn(Sn/n) and lim supn(Sn/n) are invariant. Therefore, for a < b, so is the following set

D = D(a, b)= lim inf(Sn/n) 0 or a < 0. We can interchange the two cases by replacing f by f. Let us assume then that b> 0. 46 − Let B E with µ(B) < , then g = f b1B is integrable and, for each x D, for some ∈n, ∞ − ∈ S (g)(x) S (f)(x) nb > 0. n ≥ n − Hence S∗(g) > 0 everywhere and, by the maximal ergodic lemma,

0 (f b1 )dµ = fdµ bµ(B). ≤ − B − ZD ZD Since µ is σ-ﬁnite, there is a sequence of sets Bn E, with µ(Bn) < for all n and B D. Hence, ∈ ∞ n ↑ bµ(D) = lim bµ(Bn) fdµ. n ≤ →∞ ZD In particular, we see that µ(D) < . A similar argument applied to f and a, this time with B = D, shows that ∞ − −

( a)µ(D) ( f)dµ. − ≤ − ZD Hence bµ(D) fdµ aµ(D). ≤ ≤ ZD Since a < b and the integral is ﬁnite, this forces µ(D)=0. Set

∆= lim inf(Sn/n) < lim sup(Sn/n) { n n } then ∆ is invariant. Also, ∆ = a,b Q,a

Theorem 9.3.3 (von Neumann’s Lp ergodic theorem). Assume that µ(E) < . Let p [1, ). Then, for all f Lp(µ), S (f)/n f¯ in Lp. ∞ ∈ ∞ ∈ n → Proof. We have 1/p f θn = f p θndµ = f . k ◦ kp | | ◦ k kp ZE So, by Minkowski’s inequality,

Sn(f)/n p f p. k 47k ≤k k Given ε > 0, choose K < so that f g < ε/3, where g = ( K) f K. ∞ k − kp − ∨ ∧ By Birkhoﬀ’s theorem, Sn(g)/n g¯ a.e.. We have Sn(g)/n K for all n so, by bounded convergence, there exists→N such that, for n| N, | ≤ ≥ S (g)/n g¯ < ε/3. k n − kp By Fatou’s lemma,

¯ p p f g¯ p = lim inf Sn(f g)/n dµ k − k n | − | ZE p p lim inf Sn(f g)/n dµ f g p. ≤ n | − | ≤k − k ZE Hence, for n N, ≥ S (f)/n f¯ S (f g)/n + S (g)/n g¯ + g¯ f¯ k n − kp ≤k n − kp k n − kp k − kp < ε/3+ ε/3+ ε/3= ε.

10. Sums of independent random variables 10.1. Strong law of large numbers for ﬁnite fourth moment. The result we obtain in this section will be largely superseded in the next. We include it because its proof is much more elementary than that needed for the deﬁnitive version of the strong law which follows.

Theorem 10.1.1. Let (Xn : n N) be a sequence of independent random variables such that, for some constants µ∈ R and M < , ∈ ∞ E(X )= µ, E(X4) M for all n. n n ≤ Set S = X + + X . Then n 1 ··· n S /n µ a.s., as n . n → →∞ Proof. Consider Y = X µ. Then Y 4 24(X4 + µ4), so n n − n ≤ n E(Y 4) 16(M + µ4) n ≤ and it suﬃces to show that (Y1 + + Yn)/n 0 a.s.. So we are reduced to the case where µ = 0. ··· → 2 3 4 Note that Xn,Xn,Xn are all integrable since Xn is. Since µ = 0, by independence, 3 2 E(XiXj )= E(XiXjXk )= E(XiXjXkXl)=0 for distinct indices i, j, k, l. Hence

4 4 2 2 E(Sn)= E Xk +6 Xi Xj . 1 i n 1 i

x m(dx) < , xm(dx)= ν. | | ∞ ZR ZR Let (E, E,µ) be the canonical model for a sequence of independent random variables with law m. Then µ( x :(x + + x )/n ν as n )=1. { 1 ··· n → → ∞} Proof. By Theorem 9.2.1, the shift map θ on E is measure-preserving and ergodic. n 1 The coordinate function f = X1 is integrable and Sn(f)= f + f θ + + f θ − = 1 ◦ ··· ◦ X1 + + Xn. So(X1 + + Xn)/n f¯ a.e. and in L , for some invariant function f¯, by··· Birkhoﬀ’s theorem.··· Since θ is ergodic,→ f¯ = c a.e., for some constant c and then ¯ c = µ(f) = limn µ(Sn/n)= ν.

Theorem 10.2.2 (Strong law of large numbers). Let (Yn : n N) be a sequence of independent, identically distributed, integrable random variables∈ with mean ν. Set S = Y + + Y . Then n 1 ··· n S /n ν a.s., as n . n → →∞ Proof. In the notation of Theorem 10.2.1, take m to be the law of the random variables 1 Y . Then µ = P Y − , where Y : Ω E is given by Y (ω)=(Y (ω) : n N). Hence n ◦ → n ∈ P(S /n ν as n )= µ( x :(x + + x )/n ν as n )=1. n → →∞ { 1 ··· n → → ∞} 49 10.3. Central limit theorem.

Theorem 10.3.1 (Central limit theorem). Let (Xn : n N) be a sequence of independent, identically distributed, random variables with m∈ean 0 and variance 1. Set Sn = X1 + + Xn. Then, for all x R, as n , ··· ∈ x →∞ Sn 1 y2/2 P x e− dy. √n ≤ → √2π Z−∞ iuX1 2 iuX1 Proof. Set φ(u) = E(e ). Since E(X1 ) < , we can diﬀerentiate E(e ) twice under the expectation, to show that ∞

φ(0) = 1, φ′(0) = 0, φ′′(0) = 1. − Hence, by Taylor’s theorem, as u 0, → φ(u)=1 u2/2+ o(u2). − So, for the characteristic function φn of Sn/√n,

iu(X1+ +Xn)/√n i(u/√n)X1 n 2 2 n φ (u)= E(e ··· )= E(e ) = (1 u /2n + o(u /n)) . n { } − The complex logarithm satisﬁes, as z 0, → log(1 + z)= z + o( z ) | | so, for each u R, as n , ∈ →∞ log φ (u)= n log(1 u2/2n + o(u2/n)) = u2/2+ o(1). n − − u2/2 u2/2 Hence φn(u) e− for all u. But e− is the characteristic function of the N(0, 1) distribution,→ so S /√n N(0, 1) in distribution by Theorem 7.7.1, as required. n →

50 Index Lp-norm, 26 event, 9 π-system, 4 expectation, 17 σ-algebra, 3 Borel, 8 Fatou’s lemma, 20 generated by, 4, 11 Fourier inversion formula, 35 product, 24 Fourier transform tail, 16 for measures, 35 L1 d d-system, 4 on (R ), 35 on L2(Rd), 38 a.e., 15 Fubini’s theorem, 25 almost everywhere, 15 function almost surely, 15 Borel, 11 characteristic, 35 Banach space, 29 convex, 27 Bernoulli shift, 42 density, 22 Birkhoff’s almost everywhere ergodic distribution, 13 theorem, 44 integrable, 18 Borel’s normal number theorem, 14 measurable, 11 Borel–Cantelli lemmas, 10 Rademacher, 14 bounded convergence theorem, 32 simple, 17 fundamental theorem of calculus, 22 Carathéodory’s extension theorem, 5 central limit theorem, 47 Gaussian characteristic function, 35 convolution, 37 Chebyshev’s inequality, 27 random variable, 40 completeness of Lp, 30 conditional expectation, 32 Hölder’s inequality, 28 conjugate indices, 28 Hilbert space, 29 convergence almost everywhere, 15 independence, 9 σ in Lp, 27 of -algebras, 10 in distribution, 15 of random variables, 13 in measure, 15 inner product, 29 in probability, 15 integral, 17 weak, 39 Jensen’s inequality, 27 convolution, 35 correlation, 31 Kolmogorov’s zero-one law, 16 countable additivity, 5 covariance, 31 Lévy’s continuity theorem, 40 matrix, 32 law, 13 density, 22 maximal ergodic lemma, 43 differentiation under the integral sign, 23 measurable set, 3 distribution, 13 measure, 3 σ dominated convergence theorem, 21 -finite, 8 Dynkin’s π-system lemma, 4 Borel, 8 discrete, 3 ergodic, 42 image, 12 51 Lebesgue, 8 Lebesgue, on Rn, 26 Lebesgue-Stieltjes, 12 product, 24 Radon, 8 measure-preserving transformation, 42 Minkowski’s inequality, 28 monotone class theorem, 12 monotone convergence theorem, 18 norm, 29 orthogonal, 31 parallelogram law, 31 Plancherel identity, 35 probability measure, 8 Pythagoras’ rule, 31 random variable, 13 set function, 5 strong law of large numbers, 47 for finite fourth moment, 46 tail estimate, 27 tail event, 16 uniform integrability, 33 variance, 31 von Neumann’s Lp ergodic theorem, 45 weak convergence, 39