<<

Notes from Limit Theorems 2

Mihai Nica Notes. These are my notes from the class Limit Theorems 2 taught by Proffe- sor McKean in Spring 2012. I have tried to carefully go over the bigger theorems from the course and fill in all the details explicitly. There is also a lot of information that is folded in from other sources. • The section on Martingales is supplemented with some notes from ”A First Look at Rigorous Probability Theory” by Jeffrey S. Rosenthal, which has a really nice introduction to Martingales. • The section of the law of the iterated logarithm is supplemented with some inequalities which I looked up on the internet...mostly wikipedia and PlanetMath. • In the section on Ergodic theorem, I use a notation I found on wikipedia that I like for continued fractions. In my pen-and-paper notes, there is also a little section about Ergodic theory for geodesics on surfaces, which is really cute. However, I couldn’t figure out a good way to draw the pictures so it hasn’t been typed up yet. • The section on Brownian Motion is supplemented by the book Brownian Motion and Martingale’s in Analysis by Richard Durret which is really wonderful. Some of the slick results are taken straight from there. • I also include an appendix with results that I found myself reviewing as I went through this stuff. Contents

Chapter 1. Martingales 5 1. Definitions and Examples 5 2. Stopping times 6 3. Martingale Convergence Theorem 7 4. Applications 9

Chapter 2. The Law of the Iterated Logarithm 13 1. First Half of the Law of the Iterated Logarithm 13 2. Second Half of the Law of the Iterated Logarithm 15 Chapter 3. Ergodic Theorem 19 1. Motivation 19 2. Birkhoff’s Theorem 20 3. Continued Fractions 24 Chapter 4. Brownian Motion 29 1. Motivation 29 2. Levy’s Construction 30 3. Construction from Durret’s Book 33 4. Some Properties 36 Chapter 5. Appendix 39 1. Conditional Random Variables 39 2. Extension Theorems 40

3

CHAPTER 1

Martingales

1. Definitions and Examples This section on Martingales contains heavy use of conditional random variables. I do a quick review of this topic from Limit Theorems 1 in the appendix.

Definition 1.1. A sequence of random variables X0,X1, ... is called a martingale if E(|Xn|) < ∞for all n and with probability 1:

E (Xn+1|X0,X1, ..., Xn) = Xn

Intuitively, this is says that the average value of Xn+1is the same as that of Xn, even if we are given the values of X0to Xn. Note that conditioning on X0, ..., Xnis just different notation for conditioning on σ(X0, ..., Xn), which is the sigma algebra generated by preimages of Borel sets through X0, ..., Xn. One can make more general martingales by replacing σ(X0, ..., Xn) with an arbitrary increasing chain of sigma algebras Fn; the results here carry over to that setting too. Example 1.2. Sometimes martingales are called “fair games”. The analogy is that the random variable Xn represents the bankroll of the gambler at time n. The game is fair, because at any point in time the equity of the gambler is constant.

Definition 1.3. A submartingale is when E (Xn+1|X0,X1, ..., Xn) ≥ Xn (i.e. the capital is increasing) and a supermartingale is when E (Xn+1|X0,X1, ..., Xn) ≤ Xn (i.e. the capital is decreasing) Most of the theorems for martingales work for submartingales, just change the inequality in the right place. To avoid confusion between sub-, super-, and ordinary martingales, we will sometimes call a martingale a “fair martingale”.

Example 1.4. The symmetric random walk, Xn = Z0 + Z1 + ... + Zn with 1 each Zn = ±1 with probability 2 is a martingale. In terms of the fair game, this is gambling on the outcome of a fair coin. Remark. Using the properties of conditional probabilities to see that:

E (Xn+2|X0,X1, ..., Xn) = E (E (Xn+2|X0,X1, ..., Xn+1) |X0, ...Xn)

= E (Xn+1|X0, ...Xn)

= Xn With a simple argument by induction, we get that in general:

E (Xm|X0,X1, ..., Xn) = Xn

In particular then E(Xn) = E(X0) for every n. If τ is a random “time”, (a non-negative integer) that is independent of the Xn’s, then E(Xτ ) is a weighted average of E(Xn)’s, so have E(Xτ ) = E(X0) still. What if υis dependent on the

5 6 1. MARTINGALES

0 Xns? In general we cannot have equality for the example of the simple symmetric random walk (coin-flip betting), with τ =first time that Xn = −1 has E(Xn) = −1 6= 0 = E(X0). The next section gives some conditions where this holds.

2. Stopping times

Definition 2.1. For a martingale {Xn}, A non-negative integer valued random variable τ is a stopping time if it has the property that:

{τ = n} ∈ σ(X1,X2,...,Xn) Intuitively, this is saying that one can determine if τ = n just by looking at the first n steps in the martingale.

Example 2.2. In the example of the random coin flipping, if we let τ be the first time so that Xn =10, then τ is a stopping time.

Example 2.3. We often are interested in Xτ , the value of the martingale at the random time τ. This is precisely defined as Xτ (ω) = Xτ(ω)(ω). Another handy P rewriting is: Xτ = Xk1{τ=k} .

Lemma 2.4. If {Xn}is a submartingale and τ1, τ2are bounded stopping times so that ∃M s.t. 0 ≤ τ1 ≤ τ2 ≤ M with probability 1, then E(Xτ1 ) ≤ E(Xτ2 ), with equality for fair martingales.

Proof. For fixed k, the event {τ1 < k ≤ τ2}can be written as {τ1 < k ≤ C τ2} = {τ1 ≤ k − 1} ∩ {τ2 ≤ k − 1} from which we see that the event {τ1 < k ≤ τ2} ∈ σ(X0,X1,...,Xk−1) because τ1and τ2are both stopping times. We have then the following manipulation using a telescoping series, linearity of the expectation, the fact that E(Y 1A)= E(E(Y |X0,X1,...,Xk−1)1A) for events A ∈ σ(X0,X1,...,Xk−1), and finally the fact that E(Xk|X0,X1,...Xk−1) − Xk−1 ≥ 0 since Xn is a (sub)martingale. (with equality for fair martingales):

E(Xτ2 ) − E(Xτ1 ) = E(Xτ2 − Xτ1 ) τ X2 = E( Xk − Xk−1)

k=τ1+1 M ! X = E (Xk − Xk−1)1{τ1

Where the inequality is equality in the case of a fair martingale.  3. MARTINGALE CONVERGENCE THEOREM 7

Theorem 2.5. Say {Xn} is a martingale and τ a bounded stopping time, (that is ∃M s.t. 0 ≤ τ ≤ M with probability 1). Then:

E(Xτ ) = E(X0) Proof. Let υbe the random variable which is constantly 0. This is a stopping time! So by the above lemma, since 0 ≤ υ ≤ τ ≤ M, we have that E(Xτ ) = E(Xυ) = E(X0) 

Theorem 2.6. For {Xn}a martingale and τ a stopping time which is almost surely finite (that is P(τ < ∞) = 1) we have:    E(Xτ ) = E(X0) ⇐⇒ E lim Xmin(τ,n) = lim E Xmin(τ,n) n→∞ n→∞  Proof. It suffices to show that E(Xτ ) = E limn→∞ Xmin(τ,n) andE(X0) =  limn→∞ E Xmin(τ,n) . The first equality holds since P(τ < ∞) = 1 gives P(limn→∞ Xmin(τ,n) = Xτ ) = 1, so they agree almost surely. The second holds by the above theorem con- cerning bounded stopping times since for any n, min(τ, n) is a bounded stopping  time, so we have E Xmin(τ,n) = E(X0), so equality holds in the limit too.  Remark. The above theorem can be combined with things like monotone convergence theorem or Lebesgue dominated convergence theorem to switch the limits and conclude that E(Xτ ) = E(X0). Here are some examples:

Example 2.7. If {Xn}is a martingale and τ a stopping time so that P(τ < ∞) = 1 and E(|Xτ |) < ∞, and limn→∞ E(Xn1τ>n) = 0, then E(Xτ ) = E(X0).

Proof. For any n we have: Xmin(τ,n) = Xn1τ>n +Xτ 1τ≤nTaking expectation and then the limit as n → ∞, gives:

lim E(Xmin(τ,n)) = lim E(Xn1τ>n) + lim E(Xτ 1τ>n) n→∞ n→∞ n→∞

= 0 + E(Xτ ) Where the first term is 0 by hypothesis, and the second limit is justified since Xτ 1τ>n → Xτ pointwise almost surely since P(τ < ∞) = 1, and the dominant majorant E(|Xτ |) < ∞lets us use the Lebesgue dominated convergence theorem to conclude the convergence of the expectation. 

Example 2.8. Suppose {Xn}is a martingale and τ a stopping time so that E(τ) < ∞ and |Xn+1 − Xn| ≤ M < ∞for some fixed M and for every n. Then E(Xτ ) = E(X0).

Proof. Let Y = |X0| + Mτ. Then Y can be used as a dominant majorant in a L.D.C.T. very similar to the above example to get the conclusion. 

3. Martingale Convergence Theorem The proof relies on the famous upcrossing lemma:

Lemma 3.1. [The Upcrossing Lemma]. Let {Xn}be a submartingale. For fixed α,β α, β ∈ R, β > α,and M ∈ N let UM be the number of “upcrossings” that the martingale {Xn}makes of the interval α, β in the time period 1 ≤ n ≤ M. (An upcrossing is when Xn goes from being less than α initially to being more than β 8 1. MARTINGALES

α,β later. Precisely this is: UM = maxk{k : ∃t1 < u1 < . . . < tk < uk ≤ M s.t. Xti ≤

α and Xui ≥ β ∀i} ). Then: E (|X − X |) E(U α,β) ≤ M 0 M β − α Proof. Firstly, we remark that it suffices to prove the result when the sub- martingale {Xn}is replaced by {max(Xn, α)}, since this is still a submartingale, it has the same number of upcrossings as Xn, and | max(XM , α) − max(X0, α)| ≤ |XM −X0|, so the equality is only strengthened. In other words, we assume without loss of generality that Xn ≥ α for all n. This simplification is used in exactly one spot later on to get the inequality we need. Let us now carefully nail down where the upcrossings happen. Define u0 = v0 = 0 and iteratively define:

uj = min(M, inf {k : Xk ≤ α} k>vj−1

vj = min(M, inf {k : Xk ≥ β} k>uj

These record the times where the martingale crosses the interval [α, β]; the uj’s record when it first crosses moving to the left of the interval, and the vj’s record crosses going to the right of the interval. They are also truncated at time M so that they are bounded stopping times. Moreover, since these times are strictly increasing until they hit M, it must be the case that vM = M. We have then, using some crafty telescoping sums:

E(XM ) = E (XvM )  = E XvM − XuM + XuM − XvM−1 + XvM−1 − ... − Xu1 + Xu1 − X0 + X0 M ! M X X  = E (X0) + E Xvk − Xuk + E Xuk − Xvk−1 k=1 k=1

The third term is non-negative! This is because uk and vk−1are both bounded stopping times with 0 ≤ vk−1 ≤ uk ≤ M, so our theorem about stopping times gives that this expectation is non-negative. (This is subtle! Most of the time

(when we haven’t hit time M yet) we expect Xuk < αwhile Xvk−1 > β, so their difference is negative. However, because of the small probability event where vk−1 < M and uk = M, we get a big positive number with small probability which balances the whole expectation. Compare to the example of a simple symmetric random walk with a truncated stopping time for τ =first time that Xn = −1.) PM   α,β Now the second term, has E k=1 Xvk − Xuk ≥ E (β − α)UM . This is α,β because each upcrossing counted in UM contributes at least (β−α) to the sum, null cycles (where uk = vk = M) contribute nothing, and the possibly one incomplete cycle (where uk < M but vk = M) must give a non-negative contribution to the sum by the simplification that Xn > α. Hence we have:  α,β E(XM ) ≥ E (X0) + (β − α)E UM + 0

Which gives the desired result.  4. APPLICATIONS 9

Theorem 3.2. [Martingale Convergence Theorem] Let {Xn}be a submartingale with supn E(|Xn|) < ∞. Then there exists a random variable X so that Xn → X almost surely. (That is Xn(ω) = X(ω) for almost all ω ∈ Ω).

Proof. Firstly, since supn E(|Xn|) < ∞, by Fatou’s lemma we have: E(lim infn |Xn|) ≤ lim infn E(|Xn|) ≤ supn E(|Xn|) < ∞, from which it follows that P(|Xn| → ∞) = 0. This ensures that the Xncannot “leak away” probability to ±∞, which would prevent the limiting random variable from being properly normalized. Now suppose by contradiction that P(lim inf Xn < lim sup Xn) > 0, i.e. there is a non-zero probability of Xnnot converging. Then, using the density of the rationals and countable subadditivity to find an α and β so that P(lim inf Xn < α < β < lim sup Xn) > 0. Counting the number of upcrossing Xn makes of [α, β],we see that  α,β  we must have: P lim UM = ∞ > P(lim inf Xn < α < β < lim sup Xn) > 0. M→∞  α,β Hence E lim UM = ∞. By the monotone convergence theorem however, we M→∞ α,β  α,β have that lim E(UM ) =E lim UM = ∞. M→∞ M→∞ But now we have reached a contradiction! For by the upcrossing lemma:

α,β limM→∞ E (|XM − X0|) 2 supM E(|Xn|) lim E(UM ) ≤ ≤ < ∞ M→∞ β − α β − α 

4. Applications Theorem 4.1. [Levy]Suppose Z a random variable with E(|Z|) < ∞, and that {Fn} is a decreasing chain of σ−algebras, F1 ⊃ F2 ⊃ ... (This is saying that they are getting coarser and coarser). Let F∞ = ∩Fn. Then we have almost surely:

lim E(Z|Fn) = E(Z|F∞) n→∞ Proof. We first prove that there is an almost sure limit using the martingale convergence theorem, and then we check the defining properties of E(Z|F∞) to verify that this is indeed the limit. Firstly, let Xn = E(Z|Fn). Then for any fixed M ∈ N we have that the sequence XM ,XM−1,...X2,X1is a martingale (Here we are referring to a slightly more gen- eral martingale than in our original definition, the sigma algebra σ(X1,X2,...) in the definition is replaced by arbitrary increasing sigma algebras Fn. The expec- tation property of the martingale follows by the fact that E(E(Z|F)|G) = E(Z|G) when G ⊂ F) Notice that we had to reverse the order of the sequence to get the sigma algebras to increase (i.e. get finer and finer), so that we really have a martin- gale. For this reason, the martingale convergence theorem does not apply directly but the idea of the proof will still work. Suppose by contradiction, as in the proof of the martingale convergence theorem, that P(lim inf Xn < lim sup Xn) > 0. Then, as before, find αand β so that P(lim inf Xn < α < β < lim sup Xn) > 0. Since there are infinitely many crossings then of the interval [α, β], we can know that the number α,β  α,β   α,β of downcrossings DM has P lim DM = ∞ > 0 and so E lim DM = ∞. M→∞ M→∞ Hence, since Dα,βis increasing in M (the number of downcrossings can only in- M   crease if we wait longer), we may find an M ∈ so that E Dα,β > 2E(|Z|) . 0 N M0 β−α 10 1. MARTINGALES

Taking now the martingale sequence XM0 ,XM0−1,...X2,X1, we have a violation of the upcrossing lemma just as we did in the martingale convergence theorem. Next, to verify that the limit is indeed E(Z|F∞) we just need to check the two defining properties, namely that it is F∞measurable and that it has the cor- rect expectation value for events in F∞. limn→∞ E(Z|Fn) is F∞measurable, since F∞ ⊂ F n for every n, meaning that E(Z|Fn) is F∞measurable for every n, and so the limit is too. To see that limn→∞ E(Z|Fn) takes the correct expectations for events in F, notice that for any A ∈ F∞ ⊂ Fn we have for every n that E (E(Z|Fn)1A) = E(Z1A) since A ∈ Fn, so in the limit limn→∞ E (E(Z|Fn)1A) = E(Z1A). Hence the problem of proving that E (limn→∞ E(Z|Fn)1A) = E(Z1A) is reduced to an interchange of a limit with an expectation. If Z is bounded, this is justified by the bounded convergence theorem. For Z not bounded, truncating Z by Z1{Z≤N}with a bit more work will give the same interchange of limits.  Theorem 4.2. [Levy] Suppose Z a random variable with E(|Z|) < ∞, and that {Fn} is an increasing chain of σalgebras, F1 ⊂ F2 ⊂ ... (This is saying that they are getting finer and finer). Let F∞ = ∪Fn. Then we have almost surely:

lim E(Z|Fn) = E(Z|F∞) n→∞

Proof. This proof is like the last one. In this case E(Z|Fn) really is a mar- tingale (no backwards), so an almost sure limit exists by the martingale conver- gence theorem. Some more work here is needed....I think you get the desired prop- erty by approximation with “tame events” A ∈ F∞, for every  > 0 there exists An ∈ Fnsuch that P(A∆An) < .  Remark. This result is often known as the “Levy Zero-One Law” since a common application is to consider an event A ∈ F∞, for which the theorem tells us that:

lim P(A|Fn) = lim E(1A|Fn) n→∞ n→∞

= E(1A|F∞)

= 1A

Where the last equality holds since A is F∞measurable. This says in particular that this probability is either 0 or 1, since these are the only two values taken on by 1A. In this setting, the theorem gives a short proof of the Kolmogorov zero-one law.

Theorem 4.3. [Kolmogorov Zero-One law] Let X1,X2,...be an infinite se- quence of i.i.d. random variables. Define: n ! [ Fn = σ σ(Xk) k=1 n [ F∞ = Fn k=1 ∞ ∞ ! \ [ Ftail = σ σ(Xk) n=1 k=n 4. APPLICATIONS 11

Then any event A ∈ Ftail has either P(A) = 0 or P(A) = 1. These are those 0 events which do not depend on finitely many of the Xns.

Proof. Let A ∈ Ftail. For any n ∈ N we have that P(A) = P(A|Fn) = E(1A|Fn) since A ∈ Ftail does not depend on the first n variables, so its conditional expectation is a constant. Have then, (as in the above “Levy 0-1” remark):

P(A) = lim P(A|Fn) n→∞ = 1A

Since A ∈ F∞So indeed, the only the values of P(A) that are possible are 1 and 0. 

Theorem 4.4. [Strong Law of Large Numbers] Suppose X1,X2,... are i.i.d. Then we have almost surely that:

X1 + X2 + ... + Xn lim = E(X1) n→∞ n ∞ [ Proof. Define Sn = X1 + X2 + ... + Xn, and let Fn = σ( σ(Sk)) be the k=n sigma algebra of the tail Sn,Sn+1,.... We now claim that: S E(X |F ) = n 1 n n This can be seen in the following slick way. First notice that by symmetry, we must have E(X1|Fn) = E(X2|Fn) = ... = E(Xn|Fn). By linearity now: Pn n k=1 E(Xk|Fn)=E(Σk=1Xk|Fn) = E(Sn|Fn) = Sn, since Sn ∈ Fn. Hence since Sn they are all equal, and sum to Sn, we get E(X1|Fn) = n as desired. By Levy’s theorem now:

Sn lim = lim E(X1|Fn) n→∞ n n→∞ ! \ = E X1| Fk k From here, one can use the Hewitt-Savage zero-one law (which says that per- mutation invariant events have a zero one law), to see that the whole sigma algebra T T k Fk must be the trivial one, so then E (X1| k Fk) = E(X1). Alternatively, once we have conclude that such an almost sure limit exists, one could then remark Sn by the Kolmogorov zero that the limit must be a constant (for limn→∞ n does 0 Sn not depend on finitely many of the Xns so any type of event {lim n < α}must have probability 0 or 1. By taking a sup, we can find that it must be a constant.) Combining this with the above, using the fact that conditional random variables preserve the expectation, shows the constant is indeed E(X1). 

Theorem 4.5. [Hewitt Savage Zero-One Law] Let X1,X2,...be an infinite se- quence of i.i.d. random variables. Let A be an event which is unchanged under 0 finite permutations of the induces of the Xis. (e.g. for every finite permutation Π,ω = (x1, x2,...) ∈ A iff Π(ω) = (xΠ(1), xΠ(2),...) ∈ A i.e. Π(A) = A). Then P(A) = 0 or 1. 12 1. MARTINGALES

Proof. We call an event “tame” if it only depends on finitely many of the 0 Xis.The proof is a consequence of the fact that for any , any event A can be approximated by a “tame event” B so that P(B4A) < . (This is completely anal- ogous to the fact that for the usual Lebesgue on R, one can approximate n any measurable set S by a finite union of open intervals In so that λ(∪i=1Ii4U) < . This comes from the definition of the as the inf of the outer mea- sure with open sets, and the fact that every open set is a union of countably many intervals, of which only finitely many are needed to be within /2. In the same vein, the probability measure on the infinite sequence of events is generated by the outer measure from tame events. This is usually all packaged up in the Caratheodory ex- tension theorem.). Once we have this tame event B, depending only on X1,...Xn we letΠ be the permutation that permutes 1, . . . n with n + 1,..., 2n so that B and Π(B) are independent events. Have then: P(A) ≈ P(A ∩ B) = P(Π(A) ∩ Π(B)) = P(A ∩ Π(B)) ≈ P(B ∩ Π(B)) = P(B)P(Π(B)) = P(B)2 ≈ P(A)2 Where each of the approximations hold within  by the choice of B. Since we 2 can do this for every  > 0, we get P(A) = P(A) and the result follows.  CHAPTER 2

The Law of the Iterated Logarithm

We will prove that for a sequence of i.i.d events X1,X2,...with mean 0 and Pn variance 1 that for Sn = i=1 Xn: ! Sn √ P lim sup p = 2 n→∞ n log(log n)) This result is giving us finer information about these sums than the law of large numbers or the central limit theorem. We need the theory of martingales to get Doob’s inequality, and then a bunch of other sneaky tricks, like the Borel Cantelli lemmas, to get the result. We will also need a few analytic type estimates along the 0 way. (Actually, our proof here will only prove the case where the Xns are ±1 with probability 1/2 each. The result can be generalized by using even finer estimates)

1. First Half of the Law of the Iterated Logarithm To start, we will first prove some helpful lemmas.

Lemma 1.1. [Doob’s Inequality] For a submartingale Zn, we have for any α > 0 that:    E(|Zn|) P max Zi ≥ α ≤ 0≤i≤n α

Proof. (Taken from Rosenthal) Let Akbe the event that {Xk ≥ α, but Xi < α for i < k},i.e. that the process reaches αfor the first time at time k. These are disjoint events with A = ∪Ak = {(max0≤i≤n Zi) ≥ α} which is the event we want. Now consider:

n X αP(A) = αP(Ak) k=0 X = E(α1Ak ) X ≤ E(Xk1Ak ) since Xk ≥ αon Ak X ≤ E (E (Xn|X1,X2,...Xk) 1Ak ) since it’s a submartingale X = E(Xn1Ak )

= E(Xn1A)

≤ E(|Xn|)

And the result follows. 

13 14 2. THE LAW OF THE ITERATED LOGARITHM

Remark. This is a “rich man’s version of Chebyushev-type inequalities”, which are proved using the same trick as in lines 3 and 4 of the inequality train above. The fact that the behavior of the whole martingale can be controlled by the end point of the martingale gives us the little extra oomph we need. Lemma 1.2. [Hoeffding’s Inequality] Let Y be a random variable so that E(Y ) = 2 2 0 and a, b ∈ R so that a ≤ Y ≤ b almost surely. Then E(etY ) ≤ et (b−a) /8. Proof. Write X as a convex combination of a and b: Y = αb+(1−α)a where α = (Y − a)/(b − a). By convexity of e(), have then: Y − a b − Y etY ≤ etb + eta b − a b − a Taking expectations (and using E(Y ) = 0), have:

−a b E etY  ≤ etb + eta = eg(t(b−a)) b − a b − a u a 0 For g(u) = −γu + log(1 − γ + γe )and γ = − b−a . Notice g(0) = g (0) = 0 and 00 1 g (u) < 4 for all u. Hence by Taylor’s theorem: u2 g(u) = g(0) + ug0(0) + g00(ξ) 2 u 2 1 u2 ≤ 0 + 0 + = 2 4 8 tY  g(t(b−a)) t2(b−a)2/8 So then E e ≤ e ≤ e  1 Pn Lemma 1.3. Let X1,X2,... be i.i.d with P(X1 = ±1) = 2 and Sn = k=1 Xk. −λ2/2n Then P(maxk≤n Sk > λ) ≤ e . Proof. Have, by using Doob’s inequality and Hoeffding’s Inequality, for any t ∈ R, we have:

tSk tλ P(max Sk > λ) = P(max e > e ) k≤n k≤n ≤ e−tλE(etSn ) = e−tλE(etX1 )n 2 2 ≤ e−tλent (b−a) /8 Set t = 4λ/n(b − a)2 to get:

−(4λ/n(b−a)2)λ n(4λ/n(b−a)2)2(b−a)2/8 P(max Sk > λ) ≤ e e k≤n 2 2 = e−2λ /n(b−a) For simple symmetric steps, we have a = −1 and b = 1, so this gives the result.  1 Theorem 1.4. Let X1,X2,... be i.i.d with P(X1 = ±1) = 2 and Sn = Pn k=1 Xk. Then for any  > 0, ! Sn √ P lim sup p > 2 +  = 0 n→∞ n log(log n)) 2. SECOND HALF OF THE LAW OF THE ITERATED LOGARITHM 15

Or in other words, since this holds for any value of  > 0: ! Sn √ P lim sup p ≤ 2 = 1 n→∞ n log(log n)) Proof. Fix some θ > 1 (the choice will be made more precise later). We will p show that with the correct choice of θ, the events An = {Sk > (2 + )k log(log k))for n−1 n some k, θ ≤ k < θ } happens√ only finitely many times, which will show that the limsup can’t be more than 2 + . To do this it suffices to show that P(An) is summable, because then the Borel-Cantelli lemmas will show that An happens finitely often with probability 1. We have (using our previous lemma):

 p n−1 n P(An) = P Sk > (2 + )k log(log k)), θ ≤ k < θ   p n−1 n−1 n−1 n ≤ P Sk > (2 + )θ log(log θ )), θ ≤ k < θ   p n−1 n−1 ≤ P max Sk > (2 + )θ log(log θ )) k≤θn 2 ! p(2 + )θn−1 log(log θn−1)) ≤ exp − 2θn  2 +  θn−1(log(n − 1) + log(log(θ))) = exp − 2 θn      ≈ exp − 1 + θ−1 log(n − 1) for large n 2    −1 So choosing θ < 1 + 2 , gives us that 1 + 2 θ > 1, so this is: −(1+  )θ−1 P(An) ≤ (n − 1) 2

From which we see that P(An) is summable (it’s a p-series!). By using the Borel Cantelli lemma, this means that An happens only finitely many times with probability 1, which is the desired result.  2. Second Half of the Law of the Iterated Logarithm To prove the other half, we need some more estimates. Lemma 2.1. [Mill’s Inequality] This is an estimate concerning the probability density function of a Gaussian: ∞ λ 2 Z 2 1 2 e−λ /2 ≤ e−y /2dy ≤ e−λ /2 λ2 + 1 λ λ Proof. To prove the lower bound, we find a remarkable anti-derivative: ∞ ∞  4 2  Z 2 Z 2 y + 2y − 1 e−y /2dy ≥ e−y /2 dy y2 + 2y2 + 1 λ λ  ∞ y −y2/2 = − 2 e y + 1 λ λ 2 = e−λ /2 λ2 + 1 16 2. THE LAW OF THE ITERATED LOGARITHM

The upper bound is found by using the estimate y/λ > 1 in the range of integration: ∞ ∞ Z 2 Z y 2 e−y /2dy ≤ e−y /2dy λ λ λ 1 h 2 i∞ = −e−y /2 λ λ 1 2 = e−λ /2 λ 

1 Theorem 2.2. Let X1,X2,... be i.i.d with P(X1 = ±1) = 2 and Sn = Pn k=1 Xk. Then for any  > 0, ! Sn √ P lim sup p ≥ 2 − 2 = 1 n→∞ n log(log n)) Or in other words, since this holds for any value of  > 0: ! Sn √ P lim sup p ≥ 2 = 1 n→∞ n log(log n))

Proof. As in the proof of the other half of the law, the idea is to prove that the appropriate events happen infinitely often using the Borel-Cantelli lemmas. Fix θ > n o p n n 1 (the choice will be made precise later). Let Bn = Sθn − Sθn−1 ≥ (2 − )θ log(log(θ )) . We will show that these occur infinitely often and then show why this gives the 0 result. Notice that the Bns are independent, as each Bn depends only on the value n−1 n of Xk for θ ≤ k ≤ θ , so to prove that Bnhappens i.o. it suffices to show, via the Borel Cantelli lemma, that P(Bn) is not summable. Consider:   p n n P(Bn) = P Sθn − Sθn−1 ≥ (2 − )θ log(log(θ ))   p n n = P Sθn−θn−1 ≥ (2 − )θ log(log(θ )) ∞ 1 Z 2 ≈ √ e−y /2dy 2π √ n n (2−√)θ log(log(θ )) θn−θn−1 Where the first equality holds using the Markov property of the sums (equiv- 0 0 alently, look at the definition as sums of Xis and the fact the Xis are i.i.d.), and n n−1 the second equality is coming√ asymptotically as θ − θ → ∞ from the central (2−)θn log(log(θn)) limit theorem. Now, let λ = √ be the lower bound of the integral θn−θn−1 and use Mill’s inequality to get:

1 λ −λ2/2 P(Bn) ≥ √ e 2π λ2 + 1 1 1 2 = √ e−λ /2 2π λ + λ−1 2. SECOND HALF OF THE LAW OF THE ITERATED LOGARITHM 17

√ √ (2−)θn log(log(θn)) √ But now notice that λ = √ ≈ 2 −  √ log n , so λ2 ≈ (2 − θn−θn−1 1−θ−1 log n ) 1−θ−1 . So our estimate is: 1  2 −   P(Bn) ≥ C √ √ exp log n log n + log n−1 2(1 − θ−1) “ 1−/2 ” − −1/2 ≥ Cn 1−θ−1 log n 1−/2 Where C’s are some constants. By choosing θlarge enough, 1−θ−1 < 1 and this will not be summable! Have then Bn occurs infinitely often. Now, we will show that these events Bnoccurring infinitely often will be enough p n n to see that Sθn ≥ (2 − 2)θ log(log(θ )) infinitely often too. To do this we will use the first half of the law of the iterated logarithm we already proved, p namely that for any η > 0, the events {Sk > (2 + η)k log(log k))}happen only finitely often with probability 1. By symmetry, we’ll have the events {Sk < p − (2 + η)k log(log k))}happen only finitely often too. Hence, the events An = n o p n−1 n−1 Sθn−1 < − (2 + η)θ log(log θ )) happens only finitely often with proba- 0 0 bility 1. Now, since the Bns occur infinitely often with probability 1, and the Ans c occur only finitely often with probability 1, the events Bn ∩ An will occur infinitely often with probability 1 too. This will give us the infinite sequence we need, for on c the event Bn ∩ An we have the inequalities: p n n Sθn − Sθn−1 ≥ (2 − )θ log(log(θ )) p n−1 n−1 Sθn−1 ≥ − (2 + η)θ log(log θ )) Hence, with probability 1, we have that for infinitely many values of n:

p n n Sθn ≥ (2 − )θ log(log(θ )) + Sθn−1 ≥ p(2 − )θn log(log(θn)) − p(2 + η)θn−1 log(log θn−1)) r (2 + η) ≥ p(2 − )θn log(log(θn)) − θn log(log θn)) θ ! √ r2 + η = 2 −  − pθn log(log(θn)) θ So by fixing η, (any choice will do) and then choosing θ large enough we can   √ q 2+η √ make the coefficient 2 −  − θ ≥ 2 − 2. (Note that this doesn’t disrupt our choice of θ previously because that too was a choice to make θ large, so we can always find θso big to suit both our needs.) We have then that for infinitely many n: S n √ θ ≥ 2 − 2 pθn log(log(θn)) So then: ! Sn √ P lim sup p ≥ 2 − 2 = 1 n→∞ n log(log n))  The two halves of the law of the iterated logarithm give the full result: 18 2. THE LAW OF THE ITERATED LOGARITHM

! Sn √ P lim sup p = 2 = 1 n→∞ n log(log n)) CHAPTER 3

Ergodic Theorem

1. Motivation The study of Ergodic Theory was first motivated by statistical mechanics. Here, one is interested in the long term average of systems. For example, say we have some particles with position Q(t) at time t, and momentum P (t) at time t. Let f be a function on this state , for example f might be the pressure/temperature/some other macroscopic variable. Can we find a distribution G so that: T 1 Z Z lim f(Q(s),P (s))ds = fdG T →∞ T 0 1 −H/kT Gibbs et al. worked on this problem and it turns out that G = Z e with Z the partition function, H the Hamiltonian, T temperature, and k Boltzmann’s constant has this! These types of long term averaging things can be useful. We will start with a simple example. Example 1.1. Let Ω = [0, 1) = {θ : 0 ≤ θ < 1} where we think of Ω as a circle with perimeter 1 (and θthe position on the circle). For some fixed angle ω, let T :Ω → Ω be rotation by ω, that is T (θ) = θ + ω mod 1. This is clearly measure preserving in the sense that for any set B we have that m(B) = m(T −1(B)) where m is the usual Lebesgue measure. Could it be that:

N−1 1 1 X Z lim f (T nx) = f(s)ds N→∞ N n=0 0 If ωis rational, this doesn’t have a chance, because T neventually cycles back to the identity, so T nx will only sample finitely many points. However, if ωis irrational, this is true! We can prove it in this case using Fourier analysis. When f(x) = e2πimx, for m ∈ N, we have the geometric series: N−1 N−1 1 X 1 X f (T nx) = e2πim(x+nω) N N n=0 n=0 1 e2πimNω − 1 = e2πmx N e2πimω − 1 → 0 1 Z = f(s)ds

0 Where the fact that ωirrational ensures that e2πimω −1 6= 0. In the case m = 0, f is constant, so of course the result holds. Now for any f ∈ C2(Ω), we can expand

19 20 3. ERGODIC THEOREM f as a Fourier series to see the result holds. This lets us calculate for example: #{k ≤ N : x + kω ∈ (a, b)} lim = b − a N→∞ N #{k≤N: x+kω∈(a,b)} 1 PN−1 n For if f = 1(a,b) notice that N = N n=0 f (T x). By ap- proximating f by C2functions (in the L1sense) from above and below, and applying the limit calculated above, we get the result. Is there away we can do this kind of thing using probability methods (rather than Fourier)? The next result is a nice theorem in this direction.

2. Birkhoff’s Theorem Theorem 2.1. [Birkoff-Khinchin Ergodic Theorem] Say (Ω, F, P) is a proba- bility space. Suppose T :Ω → Ω is a measure preserving map, in the sense that −1 −1 P(T (B)) = P(B) for all B ∈ F. Let F0 = {A ∈ F : T A = A a.e.} be the field of T invariant events. For f :Ω → R a random variable with E(|f|) < ∞, we have almost surely: N−1 1 X n lim f (T x) = E (f|F0) N→∞ N n=0

Corollary 2.2. In the case that F0 is the trivial field, E (f|F0) = E(f) is a constant, so this is exactly the thing we had above. This happens precisely when T −1A = A ⇒ P(A) = 0 or 1. In this case we say that the map T is “ergodic”. The proof of this theorem relies on the following lemma.

Lemma 2.3. [Maximal Ergodic Lemma] Say (Ω, F, P) is a probability space. Suppose T :Ω → Ω is a measure preserving map, in the sense that P(T −1(B)) = P(B) for all B ∈ F. Say f :Ω → R a random variable with E(|f|) < ∞. Let Pn−1 k Sn = k=1 f(T x) and let A = {supn≥1 Sn > 0} be the event that this is positive at some point. Then: Z E (f1A) = fdP > 0 A + + Proof. Define f (x) = f(T x) and let mn = max{0,S1,S2,...Sn}, and mn in + the same way, replacing f by f in the definition of Sk. Notice that by this definition the mn’s are non-decreasing. Notice that the event A = {supn≥1 Sn > 0} is the same as saying mn > 0 for n large enough. For this reason, it will be enough to restrict our attention to the events {mn > 0}. Notice that if we are in the event {mn > 0} then we have: + + + + S1 + mn = S1 + max{0,S1 ,S2 ,...Sn } = S1 + max{0,S2 − S1,S3 − S1,...Sn+1 − S1}

= max{S1,S2,...Sn+1}

= mn+1

Where we used that we’re on the event{mn > 0} in the last step to see the last + Pn−1 k Pn equality, and we used Sn = 0 f(T T x) = 1 f(T x) = Sn+1 − S1in the second 2. BIRKHOFF’S THEOREM 21 equality. We have then:   E f1{mn>0} = E S11{mn>0} +  = E (mn+1 − mn )1{mn>0}  +  = E mn+11{mn>0} − E mn 1{mn>0}  + ≥ E mn+11{mn>0} − E mn

The last inequality holds since on the event {mn = 0},we have S1 ≤ 0, so + +  + mn = mn+1 − S1 ≥ mn+1 ≥ 0, so E mn 1{mn=0} ≥ 0. Hence E (mn ) = +  +  +  E mn 1{mn>0} + E mn 1{mn=0} ≥ E mn 1{mn>0} . From here, we note that + E(mn ) = E(mn) since the map T is measure preserving, and the only difference + between mn and mn is the map x → T x. Have then:   E f1{mn>0} ≥ E mn+11{mn>0} − E (mn)   = E mn+11{mn>0} − E mn1{mn>0}  = E (mn+1 − mn)1{mn>0} ≥ 0

The second equality holds since mn ≥ 0 always holds, and the last inequality 0 holds since the mns are non-increasing. Finally, to get the result, notice that {mn > 0} is increasing to {sup Sn > 0}, so by a monotone convergence theorem result, we have:   E f1{sup S >0} = lim E f1{m >0} ≥ 0 n n→∞ n  With this in hand, we can prove Birkhoff’s theorem: Theorem 2.4. [Birkoff-Khinchin Ergodic Theorem] Say (Ω, F, P) is a proba- bility space. Suppose T :Ω → Ω is a measure preserving map, in the sense that −1 −1 P(T (B)) = P(B) for all B ∈ F. Let F0 = {A ∈ F : T A = A a.e.} be the field of T -invariant events. For f :Ω → R a random variable with E(|f|) < ∞, we have almost surely: N−1 1 X n lim f (T x) = E (f|F0) N→∞ N n=0 1 PN−1 n Proof. Firstly, we will argue that limN→∞ N n=0 f (T x) converges a.s. to some random variables, and then we (as usual) check that it has the two defining properties of conditional expectation. PN−1 n Define SN = n=0 f(T x) as before, so that we are interested in the sum 1 PN−1 n Sn/n. Suppose by contradiction that limN→∞ N n=0 f (T x) does not converge a.s.. By the usual trick with rational numbers then, we can find a, b ∈ R so that  Sn Sn the even A = lim inf n ≤ a < b ≤ lim sup n hasP (A) > 0. Notice moreover, that A is a T -invariant event, i.e. x ∈ A ⇒ T x ∈ A, since applying T shifts the terms in Sn by one, which does not affect the limsup or liminf of Sn/n. (Indeed, these don’t depend on finitely many of the terms!). For this reason, we may define a new probability measure on the set A, namely we think of (A, F˜, P˜) as a new probability space, with F˜={A ∩ B : B ∈ F}and P˜ (E) = P(E)/P(A). The fact that A is T -invariant means that T nx ∈ A whenever x ∈ A so we can still talk 22 3. ERGODIC THEOREM about Snand so on on this space. The fact that P(A) > 0 means that there is no problem re-normalizing like this. So we have now P˜ (A) = 1 is the whole space. With this new space as our framework, we let f 0(ω) = f(ω) − b, then we get new 0 n 0 0 o 0 Sn Sn Sn Sn sums Sn with n = n − b and then A = lim inf n ≤ a − b < 0 ≤ lim sup n . ˜ Sn ˜ ˜ 0 Notice then that P(lim sup n ≥ 0) ≥ P(A) = 1 so then P({sup Sn > 0}) = 1 is the whole space A. Have then by the maximal ergodic lemma that:

0 0 0 < E˜(f 1 0 ) = E˜(f ) = E˜(f) − b {sup Sn>0} The same argument on f 000(ω) = a − f(ω) gives:

00 0 < E˜(f 1 00 ) = a − E˜(f) {sup Sn >0} But this is a contradiction now, for we have: a > E˜(f) > b Which is impossible since a < b. This contradiction means that its impossible to separate the liminf and the limsup like this, in other words we have almost sure convergence. Next it remains only to see that the random variable that this converges to is 1 PN−1 n ¯ E(f|F0). Let us denote Firstly, notice that limN→∞ N n=0 f (T x) by f. We ¯ ¯ must show f is F0 measurable and that E(f1A) = E(f1A) for all A ∈ F 0. Notice 1 PN−1 n that applying x → T x does not change limN→∞ N n=0 f (T x) as it only effects ¯ ¯ ¯ finitely many terms. This shows that f(x) = f(T x) This is the reason why f is F0 measurable. More formally, to see that f¯−1(B) is T -invariant for any Borel set B, just write out the definitions: T (f¯−1(B)) = T x ∈ Ω: f¯(x) ∈ B = T x ∈ Ω: f¯(T x) ∈ B = y ∈ Ω: f¯(y) ∈ B = f¯−1(B) ¯−1 ¯ ¯ So indeed, f (B) ⊂ F0 means f is F0 measurable. To see that f has the right expectation values, we first see prove the result for indicator functions and then use the “ladder” of integration to get the result we need. Consider that for sets A ∈ F0 and B ∈ F we have: Z Z 1B(x)dP = 1A(x)1B(x)dP A Z = 1A(T x)1B(T x)dP Z = 1A(x)1B(T x)dP Z = 1B(T x)dP A Where the second equality is using the fact that P is T -invariant and the third R R equality uses the fact that A ∈ F0 ⇒ 1A(x) = 1A(T x). Since A 1B(x)dP= A 1B(Tx)dP, 2. BIRKHOFF’S THEOREM 23 by following along with the construction of the Lebesgue integral starting from in- R R dicator functions, we conclude that A f(x)dP = A f(T x)dP for any integrable f. Applying this inductively, we see that for any N ∈ N that:

N−1 Z 1 X Z f(T kx)dP = f(x)dP N A k=0 A

When f¯ is bounded, we can take the limit as N → ∞ and use the bounded convergence theorem to conclude:

N−1 Z Z 1 X fd¯ P = lim f(T kx)dP N→∞ N A A k=0 Z = f(x)dP

A

For general f¯, we can use a truncation argument and the monotone convergence theorem to get finish the result. 

Example 2.5. If we look at our first example of rotation by an angle ω, we concluded (using Fourier analysis) that when ωis irrational and f has a Fourier series that:

N−1 1 1 X Z lim f (T nx) = f(s)ds N→∞ N n=0 0 By Birkhoff’s theorem, we know that:

N−1 1 X n lim f (T x) = E(f|F0) N→∞ N n=0

R 1 So we conclude that: 0 f(s)ds = E(f|F0). Since this holds for every f, it must be that F0 is the trivial field. Notice that this improves our result a little bit, since we may now apply it to any f integrable, not just f which are C2.

1 PN−1 2πim(x+nω) Example 2.6. In the first example, we were essentially looking at N n=0 e . 1 PN−1 2πim(2nx) Now lets ask about the series: N n=0 e . This is harder to handle with Fourier techniques, but we can still use Birkhoff’s theorem. Again take Ω = [0, 1) to be our space, but instead of thinking of this as a circle, think of this as bi- nary sequence (which are the binary expansions of each number between 0 and 1), Ω = {0.e1e2 ... : ei = ±1}. Let T :Ω → Ω by T (0.e1e2e3 ...) = 0.e2e3 ... . This translates to T (x) = 2x mod 1 (this is the reason that applying it N times gives 2N x). It’s not hard to verify that this is measure preserving. By the Kolmogorov Zero-One law, the field F0of T -invariant events must be the trivial field, for by applying TN times, we see that an event A ∈ F0cannot depend on the first N digits e1, e2, . . . eN . Since this works for any N, this is a subset of the tail field, 24 3. ERGODIC THEOREM which by K-0-1 is trivial. Hence, by Birkhoff’s Theorem, we have: N−1 1 X n lim f (T x) = E(f|F0) N→∞ N n=0 = E(f) 1 Z = fdP

0 For the Fourier basis function f(x) = e2πimx, this is saying that: N−1 1 X n lim e2πim(2 x) = 0 N→∞ N n=0 Example 2.7. We can use Birkhoff’s theorem to give yet another proof of the strong law of large numbers. Let (X1,X2,...) be a sequence of i.i.d. random variables with finite mean and let Ω be the probability space for these sequences. 0 Define T :Ω → Ω by T (x1, x2, x3,...) = (x2, x3,...). Notice that since the X s are i.i.d. that this is measure preserving. As in example 2, the Kolmogorov zero one law tells us the field F0 of T -invariant is trivial. Let f(x1, x2,...) = x1. By Birkhoff’s theorem: N−1 N−1 1 X 1 X n lim xn = lim f (T x) N→∞ N N→∞ N n=0 n=0

= E(f|F0) = E(f)

= E(X1) Which is exactly the strong law of large numbers.

3. Continued Fractions One way to specify a number in x ∈ [0, 1) is the binary expansion. Each binary digit tells you “which half” of the number line x is in. e.g. first digits says if its  1   1  in 0, 2 or 2 , 1 , and then we treat that interval like [0, 1) and start over again for the next digit. Another way to do this game would be to draw the harmonic series 1 1 1 n on the number line, and then specify which interval [ n+1 , n ) the number is in. 1 1 Call this first number n1, and we’ll have then that ≤ x < . From this we n1+1 n1 may conclude that: 1 x = n1 + 1 For some 1 ∈ [0, 1). Play the same game again for 1, and we get: 1 x = 1 n1 + n2+2 Continuing this indefinitely gives us the “continued fraction expansion” for x. Since this is hard to write, we will adopt the convention that x = [n1; n2; n3; ...] to mean the continued fraction expansion n1 and then n2 and so on.

Proposition 3.1. If the sequence [n1; n2; n3; ...] is cyclic (that is it repeats after some finite number of steps), then x = [n1; n2; n3; ...] is algebraic. 3. CONTINUED FRACTIONS 25

Proof. The easiest way to see this is an example. Suppose we look at x = [1; 1; 1; ...]. Then:

1 x = 1 + 1 . 1+ ..

So then:

1 = 1 + x x

2 But√ then x − x + 1 = 0, so x is the root of a quadratic equation. In this case 5−1 x = 2 is the golden section. In general, if the continued fraction expansion is periodic after N steps, then x will be the root of an N + 1 order polynomial. 

Definition 3.2. We write x = [n1; n2; n3; ...] to mean:

1 x = 1 n1 + 1 n2+ n3+...

Problem 3.3. Let T : (0, 1) → (0, 1) by T ([n1; n2; ...]) = [n2; n3; ...]. This is 1 the map T (x) = x mod 1. Is there a probability density P we can put on (0, 1) so that T will be measure preserving?

1 1 Proof. [Gauss] The probability density dP = log 2 1+x dx will do the trick! Indeed, just notice that by the definition of T that:

∞ [  1 1  T −1(a, b) = , b + n a + n n=1

So then the requirement P(T −1(a, b)) = P(a, b) gives (using ρ as a probability density function):

1 b ∞ a+n Z X Z ρ(x)dx = ρ(x)dx

a n=1 1 b+n

Taking the derivative w.r.t. b here gives:

∞ X  1  1 ρ(x) = ρ x + n 2 n=1 (x + n) 26 3. ERGODIC THEOREM

1 This is hard to solve, but its easy to verify that ρ(x) = 1+x works, since the 1 LHS is 1+x while the RHS is: ∞ ∞ X  1  1 X 1 1 ρ = x + n 2 1 + 1 2 n=1 (x + n) n=1 x+n (x + n) ∞ X x + n 1 = 1 + (x + n) 2 n=1 (x + n) ∞ X 1 = (x + n + 1)(x + n) n=1 ∞ X 1 1 = − x + n x + n + 1 n=1 1 = x + 1 1 Which is a telescoping sum so we can evaluate it exactly. The factor of log 2 R 1 normalizes ρ so that 0ρ(x)dx = 1. Indeed: 1 Z 1 1 1 log 2 − log 1 dx = [log(1 + x)]1 = = 1 log 2 x + 1 log 2 0 log 2 0 

Theorem 3.4. The shift function T : [0, 1] → [0, 1] given by T ([n1; n2; ,...]) = [n2; n3; ...]is ergodic.

Proof. Fix N ∈ N and a list of integers n1, n2, . . . , nN . Now define: 1 n(x) := 1 n1 + 1 n2+...+ nN+x

For each choice of n1, n2, . . . , nN , the image of [0, 1] through n(x) is an interval whose endpoints are n(0) and n(1). As N increases, the interval [n(0), n(1)] gets smaller and smaller. An easy proof by induction shows that n(x) can be written as: Ax + B n(x) = Cx + D For A, B, C, D ∈ Rwith 0 ≤ A ≤ B and 1 ≤ C ≤ D and with AD − BC = ±1 where the sign depends on the parity of N. Now, let I = [n(0), n(1)] and let J = (a, b) be an arbitarty interval. −N 1 Claim. |I ∩ T (J)| ≥ 2 |I||J| holds for all N ∈ N. Proof. Take x ∈ I ∩ T −N (J). Notice that x ∈ I if and only if x = n(y) for some y ∈ [0, 1] by definition of I. So we can write x as a continued fraction −N x = [n1; n2; ... ; nN−1; nN + y]. On the other hand, x ∈ T (J) if and only if N N N T x ∈ J. But T x = T ([n1; n2; ... ; nN−1; nN ; y]) = y by definition of T . This shows that x ∈ T −N (J) if and only if y ∈ J. Have then, using the the observation that n is a fractional linear transformation, that: I ∩ T −N (J) = {n(y): y ∈ J} = [n(a), n(b)] 3. CONTINUED FRACTIONS 27

This shows: |I ∩ T −N (J)| = |n(b) − n(a)|

Ab + B Aa + B = − Cb + D Ca + D

b − a = (Ca + D)(Cb + D) |b − a| ≥ since a, b < 1 (C + D)2 1 ≥ |b − a||I| 2 1 = |J||I| 2 The last inequality holds by writing out |I|and using AD − BC = ±1 and the fact that 1 ≤ C ≤ D so that C + D ≤ 2D: |I| = |n(0) − n(1)|

A + B B = − C + D D 1 = |AD − BC| D(C + D) 1 = D(C + D) 2 ≤ (C + D)2  Finally, to see that T is ergodic, take any Borel set B ∈ F. By approximating B by intervals, the inequality from the claim still holds:

−N 1 I ∩ T B ≥ |I||B| 2 Take any set A now. Again, by approximting A by intervals I, we can use the above inequality to get: −N 1 A ∩ T B ≥ |A||B| 2 This gives what we want, for if B is T −invariant, we have T −N B = B for every N. The choice A = Bc in the above gives: 1 |B|Bc| ≤ |Bc ∩ T −N B| 2 = |Bc ∩ B| = 0 So |B||Bc| = 0, which is only possible if |B| = 1 or |B| = 0. This is saying all T invarant sets are either measure zero or full measure. In other words, T is ergodic. 

CHAPTER 4

Brownian Motion

1. Motivation Our aim is to discuss a stochastic process on [0, 1] (that is a probability space (Ω, F, P) and a collection of random variables Bt(ω), for t ∈ [0, 1]) which has the following properties:

• B0(ω) = 0 for every ω ∈ Ω + + • Fix a T ∈ [0, 1], and define for t > T,Bt = BT +t − Bt. We want Bt to look statistically identical to Bt. (This says the process has some sort of “time homogenous” property.) + • We want Bt as defined above to be independent of Bt. (This says that the process has some sort of Markov property) 2 • E(Bt ) < ∞ • E(Bt) = 0 • Bt(ω) is continuous for every (or almost every) ω ∈ Ω. This process is supposed to describe something like a piece of dust that you can see sometimes wiggling about in a sunbeam. Notice that the time homogenous and Markov property together means we can write:

N X BT = B kT − B (k−1)T N N k=1 Which is a sum of many independent increments. By the central limit theorem, 2 this is suggesting Bt ∼ N(0, σ ) is normally distributed (to get this more rigorously would take a bit more work, since the above set up is not exactly the set up for the central limit theorem). This is often taken as an “axiom” : 2 • Bt ∼ N(0, σ ) 2 2 A quick calculation shows that σ ∝ t. Let f(t) = σ be the variance for Bt. Then: 2 f(t + s) = E (Bt+s) 2 = E (Bt+s − Bs + Bs) 2 2 = E (Bt+s − Bs) + E Bs + 2E ((Bt+s − Bs)Bs) = f(t) + f(s) + 2 · 0 Where we used the time homogenous property and the Markov property. This functional relation means that f(t) must be linear! f(0) = 0 holds since B0 is known exactly. Hence f(t) = c · t. It doesn’t hurt to take c = 1, since anything we get can be rescaled for other values of c if need be. Sometimes this is taken as the “axiom”:

(1) Bt ∼ N(0, t)

29 30 4. BROWNIAN MOTION

The following resulting property also turns out to be very useful:

Proposition 1.1. E(BaBb) = min(a, b)

Proof. Suppose W.O.L.O.G. a < b. Then: E(BaBb) = E(Ba(Bb − Ba + 2 Ba)) = E(Ba(Bb − Ba)) + E(Ba) = 0 + a = min(a, b)  It remains to see that such a process really exists. The main difficulty is proving that the process is continuous. There is more than one way to skin the cat for this; each method is useful because it gives a different insight into what is going on.

2. Levy’s Construction We will construct Brownian motion on t ∈ [0, 1] as a uniform limit of continuous N N functions Bt , as N → ∞. Each Bt will be an approximation of the Brownian a motion that is piecewise linear between the dyadic rationals of the form 2N . The real trick in the construction is the remarkable observation that the corrections N N+1 from Bt to Bt are independent of the construction so far up to level N, which is the crucial fact that makes the construction so nice and allows it to converge. The crucial fact about Brownian motion that makes this possible is captured in the below proposition:

Proposition 2.1. Let Bt be a Brownian path and 0 < a < b < 1. Consider the Bb−Ba line segment joining Ba and Bb: l(t) = Ba +(t−a) b−a . Consider the value of the Brownian path at the midpoint time B a+b . The difference from this point to the line 2 a+b l(t) is independent of Bb and Ba. That is to say: X = B a+b − l( ) = B a+b − 2 2 2 1 1 2 Ba − 2 Bb, is independent of Ba and Bb. Moreover, X is normally distributed 1 X ∼ N(0, 4 (b − a)).

Proof. Firstly, we notice that the random variables X,Ba,and Bb are have a joint normal distribution. This can be seen without much difficulty by expanding the definition of X to write any linear combination of X,Baand Bb as a linear combination of B a+b ,Ba,and Bb. From here, rewrite as a linear combination of 2 Ba,B a+b − Ba, and Bb − B a+b . By the hypothesis on our Brownian motion, each 2 2 of these are independent Gaussian variables, so any linear combination of them is again Gaussian. Hence any linear combination of X,Ba and Bb is Gaussian. This property is a characterization of the joint Gaussian distribution. The observation that X,Ba and Bb are jointly normal substantially simplifies the verification of their independence, as for jointly normal distributions they are independent if and only if they are uncorrelated. From here we calculate (with the help of the useful covariance relation):  1 1  E(BaX) = E Ba(B a+b − Ba − Bb) 2 2 2

  1 2 1 = E BaB a+b − E Ba − E(BaBb) 2 2 2 1 1 = a − a − a 2 2 = 0

A similar calculation holds for E(BbX). Since these are uncorrelated and jointly normal, they are independent. A quick calculation using the covariance relation 1 again gives X ∼ N(0, 4 (b − a))  2. LEVY’S CONSTRUCTION 31

This remarkable fact gives us a nice idea to construct Brownian motion starting with an infinite sequence of standard E(Z) = 0, E(Z2) = 1 i.i.d Gaussian variables (Z0,Z1,Z2,...). The idea is to first construct B0 = 0,B1 = Z0. Then, once B0, and 1 1 B1 are constructed by the above proposition, we know that B1/2 − 2 B0 − 2 B1 can be 1 1 q 1 modeled by 4 Z1, so set B1/2 = 2 B1 + 4 Z1. Once B0,B1/2,B1 are constructed, the above proposition gives us a way to get B 1 and B 3 using two more normal 4 4 q 1 q 1 variables 8 Z2 and 8 Z3 and so on. The above proposition and paragraph is the basic idea. It becomes a bit of a mouthful to write it all down. A confused reader should focus on understanding the construction above before digesting the below details. N To formalize the process, we let Bt be the construction at the N − th level of a construction, which will have the correct values at points of the form 2N , 0 ≤ a ≤ 2N . We make fill in in between these points with a piecewise continuous function. After some bookkeeping, the easiest way to write this down is as follows. First h 2k 2(k+1) i define some “tent” functions which make little peaks in the interval 2n , 2n of unit height:  2n (t − (2k)) t ∈  2k , 2k+1   2n 2n  n  2k+1 2k+2  T = 2 ((2k + 2) − t) t ∈ n , n n,k h 2 2 i  2k 2(k+1) 0 t∈ / 2n , 2n Notice that for every level n, 0 ≤ k ≤ 2n−1 − 1 means there are 2n−1tents, and notice that these tents are disjoint and of unit height. N Now, at every level of the construction we make sure that Bt has the right a value at points of the form 2N by adding in the right tents with heights distributed by scaled normal functions: N 2n−1−1 r X X 1 BN = Z t + Z T (t) t 0 2n+1 n,k n,k n=1 k=0

Explanation of this formula: The “Z0t” is the initial level 0 construction. The sum 0 ≤ n ≤ N sums over the N levels of construction, and the sum 0 ≤ k ≤ 2n−1 − 1 is over the 2n−1 tents that get added on at the n − th level. Each tent q 1 1 has a height distributed like 2n+1 Z ∼ N(0, 2n+1 ) , where Z ∼ N(0, 1)(This is the content of the proposition above!) For convenience, we label the infinite sequence of normal variables so that Zn,k is controlling the height of the k − th tent on the n − th level. N Finally we get the Brownian motion as Bt = limN→∞ Bt , which puts the Brownian motion on the same probability space as the infinite sequence of normal variables. To see that this is continuous, we show that the convergence is uniform N almost surely. Since each Bt is continuous, and a uniform limit of continuous functions is continuous, this gives that Bt is continuous. N Proposition 2.2. The family of functions Bt is converging uniformly almost surely. Proof. As you might suspect, the trick is to use the right summable sequence n−1 q with a clever application of the Borel Cantelli lemma. Let H = max P2 −1 1 Z T (t) n t∈[0,1] k=0 2n+1 n,k n,k 32 4. BROWNIAN MOTION be the maximum height contribution to Bt at level n. Since the tent functions q 1 Tn,k(t) are disjoint, this is Hn = 2n+1 max (|Zn,k|). We now make the 0≤k≤2n−1−1 following estimate:

√   − n − n n+1 1 √ P(Hn > 2 2 2n) = P max (|Zn,k|) > 2 2 2 2 2 2 n 0≤k≤2n−1−1 √ ≤ 2n−1P |Z| > 2 n √ = 2nP Z > 2 n ∞ 2n Z  x2  = √ exp − dx 2 2π √ 2 n √ 2n 1  (2 n)2  ≤ √ √ exp − (this is Mill’s ratio) 2π 2 n 2 1  2 n = C · √ · n e2

Which is a summable sequence! Hence, we know by the Borel Cantelli lemma that this happens only finely often almost surely.√ That is to say, for almost every − n ω ∈ Ω, we can find N ∈ N so that Hn(ω) ≤ 2 2 2n for all n > N. But then we have that for all p, q > Nand any t ∈ [0, 1]:

n−1 q 2 −1 r p q X X 1 |B − B | = Zn,kTn,k(t) t t 2n+1 n=p+1 k=0 q X ≤ |Hn| n=p+1 q √ X − n ≤ 2 2 2n n=p+1 ∞ √ X − n ≤ 2 2 2n n=N √ − n But since 2 2 2n is summable, this can be made arbitrarily small, and we N see then that Bt is Cauchy in the uniform norm. Since this holds for almost every ω ∈ Ω, we indeed have uniform convergence almost surely. 

Finally, to see that the limiting process is really what we want, we just verify 2 that E (Bt − Bs) = |t − s|, from which it’s easy to check the properties we want. To see this, we just use the density of the dyadic rationals in [0, 1]. The above a a n a construction fixes points of the form 2n at step n, that is to say Bt( 2n ) = Bt ( 2n ). 2 n n 2 Hence for t, s dyadic rationals, we have E (Bt − Bs) = E (Bt − Bs ) = |t − s| which is easily checked by the construction above/the earlier proposition. 3. CONSTRUCTION FROM DURRET’S BOOK 33

For arbitrary t now, but s still taken to be a dyadic rational, we take a sequence of dyadic rationals tn → t. We have then using Fatou’s lemma:

2  2 E (Bt − Bs) = E lim (Bt − Bs) n→∞ n 2 ≤ lim E (Bt − Bs) n→∞ n

= lim |tn − s| n→∞ = |t − s|

Now consider, for any n ∈ N: 2 2 E (Bt − Bs) = E (Bt − Btn − Bs + Btn ) 2 2 = E (Bt − Btn ) + E (Bs − Btn ) + 2E ((Bt − Btn )(Bs − Btn )) Since this holds for any n ∈ N, we get: 2 2 2  E (Bt − Bs) = lim E (Bt − Bt ) + E (Bs − Bt ) + 2E ((Bt − Bt )(Bs − Bt )) n→∞ n n n n

= 0 + lim |tn − s| + 0 n→∞ = |t − s| Where we have observed that the two limits on either side are 0 by using 2 2 E (Bt − Bs) ≤ |t−s| in a clever way. First:limn→∞ E (Bt − Btn ) ≤ limn→∞ |t− tn| = 0 and secondly with the help of Holder:

p 2 p 2 lim |E ((Bt − Bt )(Bs − Bt )) | ≤ lim E((Bt − Bt ) ) E((Bs − Bt ) ) n→∞ n n n→∞ n n p p ≤ lim |t − tn| |s − tn| n→∞ = 0 2 Once we have E (Bt − Bs) = |t − s| for arbitrary t and dyadic s, the same 2 argument repeated again will show that E (Bt − Bs) = |t − s| works when both t and s are arbitrary.

3. Construction from Durret’s Book (I call this “Durret’s construction” since I read it out of Durret’s book: “Brow- nian Motion and Martingale’s in Analysis”) The above construction is pretty elementary and gives all the desired proper- ties. The following construction is a bit more technical, in particular it uses a few extension results like Caratheodory and Kolmogorov. However, it gives immedi- ately that not only is the Brownian motion continuous, but it is Holder continuous 1 for exponents γ < 2 . This construction uses a few ”extension theorems”, which are gone over briefly in the appendix. Definition 3.1. (Constructing Brownian Motion with the Kolmogorov Exten- sion Theorem) The Kolmogorov Extension Theorem gives us a quick way to define a measure on the space of functions. However, since the space of functions {f : T → R} is so large, this theorem often gives us a very unwieldy space to work with, one in which we can’t get our hands on the properties we want. The construction of Brownian motion below is a great example, constructing with the Kolmogorov theorem is 34 4. BROWNIAN MOTION bad, while if we take more care and construct it on only countably many points, we get what we want. Z Z Z n Let Pt1,t2,...tn (A1×A2×...×An) = dx1 dx2 ... dxnΠk=1pti−ti−1 (xi−1, xi),

A1 A2 An √ −1 |y−x|2 where pt(x, y) = 2πt exp(− 2t ). This is naively what you get as the distri- bution of Bt1 ,Bt2,...,Btn if you use the Markov property and normal distribution of the Brownian motion. By Kolmogorov, we get a measure Pon the entire space of function {f : [0, 1] → R}. This defines the Brownian motion! Proposition 3.2. With the above description of P, it will be impossible to see that the Brownian motion is almost surely continuous because the continuous functions C ⊂ {f : [0, 1] → R} are not even measurable. Proof. Suppose by contradiction C is measurable. Then we can find a se- quence t1, t2,... of times and Borel sets B1,B2,... so that C = {f :(f(ti) ∈ Bi} (The proof of this fact comes by showing that sets of the form {f :(f(ti) ∈ Bi} are a sigma-algebra which contain the cylinder sets used to define Ω = σ(A) ). Take any continuous function f now, and alter its value at a single point t∈ / {t1, t2,...} to ˆ get a function f which agrees with f at {t1, t2,...} but is not continuous. But then ˆ f ∈ C = {f :(f(ti) ∈ Bi} since it agrees with f at {t1, t2,...} is a contradiction.  This result means that our construction is not good. It is better to construct the Bt as follows: Definition 3.3. (Constructing Brownian Motion with Uniform Continuity)

Step 1. (Define on dyadic rationals). Let Pt1,...tn as above. Use the countable Kolmogorov Extension Theorem to get a measure P on the set of functions Ω = {f : [0, 1] ∩ D2 → R} from the dyadic rationals to R. Step 2. Check that functions in Ω are almost surely Holder continuous. i.e. for almost all f ∈ Ω, |f(t) − f(s)| ≤ C|t − s|γ Step 3. Conclude that for almost every f ∈ Ω,there is a unique way to extend f to a function f : [0, 1] → R since the dyadic rationals are dense in R. Step 1 is pretty simple, but step 2 requires some verification and is the real heart of the problem:

1 Proposition 3.4. Fix γ < 2 . For almost every f ∈ Ω, there is a constant C so that |f(t) − f(s)| ≤ C|t − s|γ We first prove a lemma.

1 Lemma 3.5. Fix γ < 2 . Then there exists δ > 0,so that for almost every f ∈ Ω, there is an N ∈ N (which depends on f) so that for n ≥ N we have: |f(x) − f(y)| ≤ |x − y|γ

−n −n 1 n(1−δ) Whenever x = i2 , y = j2 and |x − y| ≤ 2 1 Proof. Take m ∈ N so large so that m > 1−2γ . We use the inequality 2m m 2m E |f(t) − f(s)| ≤ Cm|t−s| with Cm = E|f(1)| (This follows by the property that f(t) − f(s) ∼ f(s) + N(0, t − s) ). For any n ∈ N now, consider now the 3. CONSTRUCTION FROM DURRET’S BOOK 35 following estimates:

! 1n(1−δ) P |f(x) − f(y)| > |x − y|γ for some x = i2−n, y = j2−n and |x − y| ≤ 2 X ≤ |x − y|−2mγ E |f(x) − f(y)|2m

Where the sum on the right hand side is taken over all the possible x, y that satisfy 1 n(1−δ) the inequality |x − y| ≤ 2 (There are finitely many, since we are restricting ourselves to dyadic rationals x = i2−n, y = j2−n). We have used the Chebyshev inequality P(|X| > a) ≤ a−mE(|X|m) here. Now, by the above inequality, we have:

X −2mγ m LHS ≤ Cm |x − y| |x − y|

X −2mγ+m = Cm |x − y| n nδ −n(1−δ) −2mγ+m ≤ Cm2 2 (2 ) −n(−(1+δ)+(1−δ)(−2mγ+m)) = Cm2

The last bound comes in because |x − y| ≤ 2−n(1−δ) for x, y in our sum, and there are at most 2n choices for x and 2nδ choices for y once x has been fixed (remember, they are all n-th level dyadic rationals). Now, the term that appears in the exponent is:

 = −(1 + δ) + (1 − δ)(−2mγ + m)

Since m is so large so that −2mγ + m > 1, we can choose δ so small so that  > 0. We will have then that

LHS ≤ 2−n

Which is a summable sequence! By the Borel Cantelli lemma, it must be the case that for almost every f ∈ Ω the event here happens only finitely many times. This is exactly the statement of the lemma which we wanted to prove. 

1 Proposition 3.6. Fix γ < 2 . For almost every f ∈ Ω, there is a constant C so that |f(t) − f(s)| ≤ C|t − s|γ

Proof. For almost every f ∈ Ω, find δ > 0,N ∈ N as in the lemma. Take any −N(1−δ) −(m+1)(1−δ) t, s ∈ D2 ∩ [0, 1] with t − s < 2 .Choose m > N now so that 2 ≤ t − s ≤ 2−m(1−δ).Write now t = i2−m − 2−q1 − 2−q2 − ... 2−qk < (i − 1)2−m, and s = j2−m + 2−r1 + ... + 2−rl < (j + 1)2m for some choice of q0s and r0s so that −m(1−δ) m < q1 < . . . < qk and m < r1 < . . . < rl. Since t − r < 2 , we have i2−m − j2−m < t − s < 2−m(1−δ) so we can apply the result from the lemma to conclude:

|f(i2−m) − f(j2−m)| ≤ ((2mδ)2−m)γ = 2−m(1−δ)γ 36 4. BROWNIAN MOTION

Now, we use the result of the lemma again many times to see that (using our clever rewriting of t): |f(t) − f(i2−m)| ≤ |f(i2−m − 2−q1 ) − f(i2−m)| + |f(i2−m − 2−q1 − 2−q2) − f(i2−m − 2−q1 )| + ... + |f(i2−m − 2−q1 − 2−q2 − ... 2−qk−1 ) − f(t)| ≤ |2−q1 |γ + ... + |2−qk |γ ∞ X ≤ (2−j)γ j=m+1 ≤ C2−γm

Since m < qp for each p, and where we used Jensen’s inequality to bound the sum. We similarly get a bound on |f(s) − f(j2−m)|.Finally then: |f(t) − f(s)| ≤ C2−γm(1−δ) + C2−γm + C2−γm ≤ C2−γm(1−δ)  γ = C2γ(1−δ) 2−(m+1)(1−δ) ≤ C2γ(1−δ)|t − s|γ −(m+1)(1−δ) By the choice of m so that 2 ≤ t − s.  So from here we see that the Brownian motion is almost surely Holder contin- 1 uous for exponents γ < 2 . This result lets us find a unique extension of f(t) from the dyadic rationals to all of [0, 1] which is not only continuous, but moreover its 1 Holder continuous for exponents γ < 2 , which is a stronger result than our first construction. For ease of notation now, we will change our notation now a little bit. We will refer to ω ∈ Ωnow instead of f and we now have a family of random variables Bt(ω) = ω(t). What we have just proven is that for fixed ω, the map 1 t → Bt(ω) is indeed a Holder continuous path for exponents γ < 2 .

4. Some Properties The following slick result shows that the Brownian motion is nowhere Holder 1 continuous for γ > 2 , which in particular shows that it is nowhere differentiable. 1 Proposition 4.1. For γ > 2 , the set of functions which are Holder continuous with exponent γ at some point is a . In other words, the Brownian motion 1 is almost surely nowhere Holder continuous for exponents γ > 2 . 1 m+1 Proof. Fix a γ > 2 and C ∈ R. Choose m ∈ N so large so that γ¿ 2m . Define the events, starting at n > m: n m m o A = ω : ∃s ∈ [0, 1] such that |B − B | ≤ C|t − s|γ ∀t ∈ [s − , s + ] n t s n n Define the random variable:     k + j k + j − 1 Yn,k(ω) = max B − B j=0,1,...2m n n And finally, the events:  m γ Bn = at least one of the Yn,k ≤ 2C n

We now claim that An ⊂ Bn, since for ω ∈ An, we find an s so that |Bt − Bs| ≤ γ m m m γ C|t − s| ∀t ∈ [s − n , s + n ]. In particular, |Bt − Bs| ≤ C n By the pigeonhole 4. SOME PROPERTIES 37

k k+1 k+2 k+2m principle, inside this interval we can find k so that { n , n , n ,... n } ⊂ [s − m m n , s + n ] . But then, for this k, we have:     k + j k + j − 1 Yn,k(ω) = max B − B j=0,1,...2m n n     k + j k + j − 1 ≤ max B − B(s) + B(s) − B j=0...2m n n mγ ≤ 2C n So ω ∈ Bn by definition.Now consider that:

P(An) ≤ P(Bn) X  mγ  ≤ P Y ≤ 2C n,k n k=0..n−m X  mγ  ≤ P B k+j − B k+j−1 ≤ 2C for each j = 0, 1, ..2m n n n k=0..n−m  mγ 2m ≤ nP |B 1 − B0| < 2C n n  mγ √ 2m = nP |B − B | < 2C n 1 0 n  2 mγ √ 2m ≤ n √ 2C n 2π n ( 1 −γ)2m+1 m+1−2mγ = Dn 2 = Dn → 0

Where we used the independence property of disjoint√ intervals of the Brownian motion, the scaling relation P(Bt > a) = P(Bct > ca), and the easy inequality P(N(0, 1) > λ) ≤ 2λwhich comes from integrating the p.d.f.. Finally, by the choice m+1 of m so that γ > 2m , we know that m + 1 − 2mγ < 0 so this probability does indeed go to zero. But then, as the events An are increasing, this means that An are all zero probability events, which is the result we wanted. 

CHAPTER 5

Appendix

1. Conditional Random Variables

Let (Ω, F, P) be a probability space and X,Y :Ω → R random variables. B is the Borel sigma algebra of R. Definition 1.1. We define σ(X) ⊂ F to be the sigma-algebra generated by the preimages of Borel sets through F. That is: σ(X) = σ({X−1(B): B ∈ B}) Remark. The sub-algebra σ(X) is in coarser than all of F. Intuitively, the random variable X can only “detect” up to sets in σ(X). Definition 1.2. Let Σ ⊂ F be a subalgebra of F. We say a random variable X :Ω → R is Σ−measurable if X−1(B) ∈ Σ for all B ∈ B. Equivalently, if σ(X) ⊂ Σ. Example 1.3. Every random variable is always F measurable, since σ(X) ⊂ F . Definition 1.4. Given X and Y , we can define a new random variable Z = E(Y |X) to be the unique random variable with the following two properties: 1. Z is σ(X) measurable. 2. For any B ∈ B we have E (Z1X∈B) = E (Y 1X∈B) Remark. The existence of this random variable is proven by restricting the Radon-Nikodym derivative of Y with respect to the probability space to just the sigma field σ(X). Remark. There is no problem with picking any subalgebra Σ ⊂ F instead of σ(X). The second condition is simply that for any S ∈ Σ we have E (Z1S) = E (Y 1S), which is really the condition above with Σ = σ(X).

Remark. Z = E(Y |X) is a random variable Z :Ω → R, but it is often thought of as a function Z : R → R, whose input is the random variable X. This works because Z is σ(X) measurable. The following two little results clear this up a bit:

Proposition 1.5. If f : R → R is measurable, and Z :Ω → Ris Σ-measurable, then the random variable f ◦ Z is Σ-measurable too. Proof. For any B ∈ B we have (f ◦ Z) −1(B) = Z−1(f −1(B)) ∈ Σsince −1 f (B) ∈ B as f is measurable and Z is Σ- measurable.  Proposition 1.6. If Z is σ(X)-measurable random variable, then we may think of Z as a function Z : R → R whose input is X.

39 40 5. APPENDIX

Proof. Define Z˜ : R → R by Z˜(x) = Z(ω) for any representative ω ∈ X−1({x}). We must justify why this value is independent of the choice of ω ∈ −1 −1 X ({x}). Indeed for ω1, ω2 ∈ X ({x}), let z = Z(ω1).Since Z is σ(X) measur- able, we have that:

Z−1({z}) ∈ σ(X) ⇒ Z−1({z}) = X−1(B) for some B ∈ B

−1 −1 But then ω1 ∈ Z ({z}) = X (B), so that X(ω1) ∈ B. Since X(ω1) = X(ω2) = −1 −1 x, we have then ω2 ∈ X (B) = Z ({z}), which means that Z(ω1) = Z(ω2) = z, as desired. Hence Z˜ is well defined! With this definition of Z˜, we see that Z = Z˜◦X. ˜ We often conflate Z with Z in practice. 

2. Extension Theorems Theorem 2.1. [Caratheodory Extension Theorem] Fix some (Ω, A, P0), where Ω is a set, A is an algebra of sets (aka a field of sets), and P0is a finitely additive probability measure on A. If we have the additional property that: For sequences of sets A1,A2,... ∈ A which are pairwise disjoint with the prop- P erty that ∪An ∈ A too, then we necessarily have P0(∪An) = P0(An) . Then there is a unique extension to a probability space (Ω, σ(A), P) so that P and P0 agree on A. Proof. [sketch] The idea is exactly the same as the construction of the Lebesgue measure on [0, 1] from the premeasure generated by µ((a, b)) = b − a on the algebra X of open sets. Define an outer measure: P(E) := inf P0(An). From here E⊂∪An you check that P is indeed a probability measure. Countable subadditivity and monotonicity are easy. To get that P(A) = P0(A) for A ∈ A requires the special property we are given above. Once this is done, you can define measurable sets a-la Caratheodory: E measurable iff for all A ∈ A we have P(A) = P(A∩E)+P(A∩Ec). Then you verify that σ(A) is a subset of these measurable sets, and declare P = Pto be the measure on σ(A). 

Remark. The above condition needed in the theorem can be replaced with “Continuity from above at ∅”: For A1,A2,... ∈ A which are decreasing down to ∅, then we necessarily have that P0(An) → 0 too. The equivalence of these two conditions is not too difficult. The first condition is more intuitive, while this second condition is sometimes easier to verify in practice.

Theorem 2.2. [Countable Kolmogorov Extension Theorem] n Suppose for every n ≥ 1, we have a probability measure Pn on R . Suppose also that these probability measure’s satisfy the following consistency condition for every Borel set E ∈ Rn: k Pn+k(E × R ) = Pn(E) Then there exists a unique measure Pon the infinite product measure R∞of n sequences, so that for every Borel set E ∈ R P(E × R × R × ...) = Pn(E). 2. EXTENSION THEOREMS 41

Proof. [sketch] Take Ω = R∞be real-valued sequences. Define the field of cylinder sets to be: n A = {E × R × R × ... : E ∈ R is Borel} With finitely additive measure P0(E × R × R × ...) := Pn(E). The given 0 condition on the Pns shows this is well defined. To see continuity from above at ∅, notice that if Ak ↓ ∅, then we must have Ak = Ek × R × R × ... for some sets n Ek ∈ R with Ek ↓ ∅. But then of course, since Pn is a probability measure, we have P0(Ak) = Pn(Ek) → 0. By application of the Caratheodory extension theorem, we get the desired measure!  Theorem 2.3. [Kolmogorov Extension Theorem] Let T be any interval T ⊂ R. Suppose we have a family of probability measure’s n Pt1,t2,...tn on R whenever t1, t2, . . . tn is a finite number of points in T . Suppose also that these probability measure’s satisfy the following consistency condition: P (E × m) = P (E) t1,t2,...tn,tˆ1,tˆ2,...tˆm R t1,t2,...tn Then there exists a unique measure P on the set of functions {f : T → R}so that:

P ({f :(f(t1), f(t2), . . . f(tn)) ∈ E}) = Pt1,t2,...tn (E) Remark. This is very similar to the countable version, but requires some more work to make it work out. However, since the space of functions {f : T → R} is so large, this theorem often gives us a very unwieldy space to work with, one in which we can’t get our hands on the properties we want. The construction of Brownian motion below is a great example, constructing with the uncountable Kolmogorov theorem is bad, while with the countable one is a good.