Probability Theory

Matthias L¨owe

Academic Year 2001/2002 Contents

1 Introduction 1

2 Basics, Random Variables 1

3 Expectation, Moments, and Jensen’s inequality 3

4 Convergence of random variables 7

5 Independence 11

6 Products and Sums of Independent Random Variables 16

7 Infinite product spaces 18

8 Zero-One Laws 23

9 Laws of Large Numbers 26

10 The Central Limit Theorem 35

11 Conditional Expectation 43

12 Martingales 50

13 Brownian motion 57 13.1 Construction of Brownian Motion ...... 61

14 Appendix 66 1 Introduction

Chance, luck, and fortune have been in the centre of human mind ever since people have started to think. The latest when people started to play cards or roll dices for money, there has also been a desire for a mathematical description and understanding of chance. Experience tells, that even in a situation, that is governed purely by chance there seems to be some regularly. For example, in a long sequence of fair coin tosses we will typically see ”about half of the time” heads and ”about half of the time” tails. This was formulated already by Jakob Bernoulli (published 1713) and is called a law of large numbers. Not much later the French mathematician de Moivre analyzed how much the typical number 1 of heads in a series of n fair coin tosses fluctuates around 2 n. He thereby discovered the first form of what nowadays has become known as the Central Limit Theorem.

However, a mathematically ”clean” description of these results could not be given by either of the two authors. The problem with such a description is that, of course, the average number of heads in a sequence of fair coin tosses will only typically converge to 1 2 . One could, e.g., imagine that we are rather unlucky and toss only heads. So the principal question was: ”what is the probability of an event?” The standard idea in the early days was, to define it as the limit of the average time of occurrences of the event in a long row of typical experiments. But here we are back again at the original problem of what a typical sequence is. It is natural that, even if we can define the concept of typical sequences, they are very difficult to work with. This is, why Hilbert in his famous talk on the International Congress of Mathematicians 1900 in Paris mentioned the axiomatic foundation of probability theory as the sixth of his 23 open problems in mathematics. This problem was solved by A. N. Kolmogorov in 1933 by ingeniously making use of the newly developed field of theory: A probability is understood as a measure on the space of all outcomes of the random experiment. This measure is chosen in such a way that it has total mass one.

From this start-up the whole framework of probability theory was developed: Starting from the laws of large numbers over the Central Limit Theorem to the very new field of mathematical finance (and many, many others). In this course we will meet the most important highlights from probability theory and then turn into the direction of stochastic processes, that are basic for mathematical finance.

2 Basics, Random Variables

As mentioned in the introduction the concept of Kolmogorov understands a probability on space of outcomes as measure with total mass one on this space. So in the framework of probability theory we will always consider a triple (Ω, , P) and call it a probability space: F Definition 2.1 A probability space is a triple (Ω, , P), where F Ω is a , ω Ω is called a state of the world. • ∈ 1 is a σ-algebra over Ω, A is called an event. • F ∈ F P is a measure on with P(Ω) = 1, P(A) is called the probability of event A. • F A probability space can be considered as an experiment we perform. Of course, in an experiment in physics one is not always interested to measure everything one could in principle measure. Such a measurement in probability theory is called a .

Definition 2.2 A random variable X is a mapping

X : Ω Rd → that is measurable. Here we endow Rd with its Borel σ-algebra d. B

The important fact about random variables is that the underlying probability space (Ω, , P) does not really matter. For example, consider the two experiments F 1 Ω = 0, 1 , = Ω , and P 0 = 1 { } F1 P 1 1 { } 2 and 1 Ω = 1, 2, . . . , 6 , = Ω , and P i = , i Ω . 2 { } F2 P 2 2 { } 6 ∈ 2 Consider the random variables X : Ω R 1 1 → ω ω 7→ and X2 : Ω2 R 0 →i is even ω 7→ 1 i is odd.  Then 1 P (X = 0) = P (X = 0) = 1 1 2 2 2 and therefore X1 and X2 have same behavior even though they are defined on completely different spaces. What we learn from this example is that what really matters for a random 1 variable is the “ distribution” P X − : ◦ d Definition 2.3 The distribution PX of a random variable X : Ω R , is the following probability measure on Rd, d : → B  1 d P (A) := P (X A) = P X− (A) , A . X ∈ ∈ B So the distribution of a random variable is its image measure in Rd.

Example 2.4 Important distributions of random variables X (we have already met in introductory courses in probability and statistics) are

2 The Binomial distribution with parameters n and p, i.e. a random variable X is • Binomially distributed with parameters n and p (B(n, p)-distributed), if

n k n k P(X = k) = p (1 p) − 0 k n. k − ≤ ≤   The binomial distribution is the distribution of the number of 1’s in n independent coin tosses with success probability p.

The Normal distribution with parameters µ and σ2, i.e. a random variable X is • normally distributed with parameters µ and σ2 ( (µ, σ2)-distributed), if N a 1 x µ 2 1 ( − ) P(X a) = e− 2 σ dx. ≤ √2πσ2 Z−∞ Here a R. ∈ The Dirac distribution with atom in a R, i.e. a random variable X is Dirac • distributed with parameter a, if ∈

1 if b = a P(X = b) = δ (b) = a 0 otherwise.  Here b R. ∈ The Poisson distribution with parameter λ R, i.e. a random variable X is Poisson • distributed with parameter λ R ( (λ)-distribute∈ d), if ∈ P λke λ P(X = k) = − k N = N 0 . k! ∈ 0 ∪ { }

The Multivariate Normal distribution with parameters µ Rd and Σ, i.e. a random • variable X is normally distributed in dimension d with par∈ameters µ and Σ ( (µ, Σ)- distributed), if µ Rd, Σ is a symmetric, positive definite d d matrixNand for ∈ × A = ( , a ] . . . ( , a ] −∞ 1 × × −∞ d P(X A) = ∈ a1 ad 1 1 1 . . . exp Σ− (x µ), (x µ) dx1 . . . dxd. (2π)d det Σ −2h − − i Z−∞ Z−∞   p 3 Expectation, Moments, and Jensen’s inequality

We are now going to consider important characteristics of random variables

Definition 3.1 The expectation of a random variable is defined as

E(X) := EP(X) := XdP = X(ω)dP(ω) Z Z if this integral is well defined.

3 Notice that for A one has E(1A) = 1AdP = P(A). Quite often one may want to integrate f(X) for ∈a functionF f : Rd Rm. How does this work? → R Proposition 3.2 Let X : Ω Rd be a random variable and f : Rd R be a measurable function. Then → → f XdP = E [f X] = fdP . (3.1) ◦ ◦ X Z Z Proof. If f = 1 , A d we have A ∈ B E [f X] = E(1 X) = P (X A) = P (A) = 1 dP . ◦ A ◦ ∈ X A X Z Hence (3.1) holds true for functions f = n α 1 , α R, A d. The standard i=1 i Ai i ∈ i ∈ B approximation techniques yield (3.1) for general integrable f. In particular Proposition 3.2 yields that P

E [X] = xdPX (x). Z Exercise 3.3 If X : Ω N , then → 0

∞ ∞ EX = nP (X = n) = P (X n) ≥ n=0 n=1 X X Exercise 3.4 If X : Ω R, then → ∞ ∞ P ( X n) E ( X ) P ( X n) . | | ≥ ≤ | | ≤ | | ≥ n=1 n=0 X X Thus X has an expectation, if and only if ∞ P ( X n) converges. n=0 | | ≥ Further characteristics of random variablesP are the p-th moments.

Definition 3.5 1. For p 1, the p-th moment of a random variable is defined as ≥ E (Xp).

2. The centered p-th moment of a random variable is defined as E [(X EX)p]. − 3. The variance of a random variable X is its centered second moment, hence

V(X) := E (X EX)2 . − Its standard deviation σ is defined as  

σ := V(X). p

4 Proposition 3.6 V(X) < if and only if X 2(P). In this case ∞ ∈ L V(X) = E X2 (E (X))2 (3.2) − as well as  V(X) E X2 (3.3) ≤ and  (EX)2 E X2 (3.4) ≤ Proof. If V(X) < , then X EX 2 (P). But 2(P) is a vector space and the constant ∞ − ∈ L L EX 2 (P) X = (X EX) + EX 2 (P) . ∈ L ⇒ − ∈ L On the other hand if X 2 (P) X 1 (P). Hence EX exists and is a constant. ∈ L ⇒ ∈ L Therefore also X EX 2 (P) . − ∈ L By linearity of expectation then:

V(X) = E(X EX)2 = EX2 2(EX)2 + (EX)2 = EX2 (EX)2 . − − − This immediately implies (3.3) and (3.4).

Exercise 3.7 Show that X and X EX have the same variance. −

It will turn out in the next step that (3.4) is a special case of a much more general principle. To this end recall the concept of a convex function. Recall that a function ϕ : R R is convex if for all α (0, 1) we have → ∈ ϕ (αx + (1 α) y) αϕ (x) + (1 α) ϕ (y) − ≤ − for all x, y R. In a first course in analysis one learns that convexity is implied by ϕ00 0. On the other∈ hand convex functions do not need to be differentiable. The following exercise≥ shows that they are close to being differentiable.

Exercise 3.8 Let I be an interval and ϕ : I R be a convex function. Show that the 0 ˚ → 0 right derivative ϕ+ (x) exist for all x I (the interior of x) and the left derivative ϕ (x) ∈ − exists for all x ˚I. Hence ϕ is continuous on ˚I. Show moreover that ϕ0 is monotonely ∈ + increasing on ˚I and it holds:

ϕ(y) ϕ(x) + ϕ0 (x)(y x) ≥ + − for x ˚I, y I. ∈ ∈

Applying Exercise 3.8 yields the generalization of (3.4) mentioned above.

5 Theorem 3.9 (Jensen’s inequality) Let X : Ω R be a random variable on (Ω, , P) and assume X is P-integrable and takes values in →an open interval I R. Then EXF I ⊂ ∈ and for every convex ϕ : I R → ϕ X is a random variable. If this random variable ϕ X is P-integrable it holds: ◦ ◦ ϕ (E (X)) E(ϕ X). ≤ ◦

Proof. Assume I = (α, β). Thus X(ω) < β for all ω Ω. But then E(X) β, but then also E(X) < β. Indeed E(X) = β, implies that the strictly∈ positive random≤variable β X(ω) equals 0 on a set of P measure one, i.e. P-almost surely. This is a contradiction. − Analogously: EX > α. According to exercise 3.8 ϕ is continuous on ˚I = I hence Borel- measurable. Now we know

ϕ(y) ϕ (x) + ϕ0 (x) (y x) (3.5) ≥ + − for all x, y I with equality for x = y. Hence ∈

0 ϕ (y) = sup ϕ (x) + ϕ+ (x) (y x) (3.6) x I − ∈ h i for all y I. Putting y = X(ω) in (3.5) yields ∈ ϕ X ϕ(x) + ϕ0 (x) (X x) ◦ ≥ + − and by integration E(ϕ X) ϕ(x) + ϕ0 (x) (E (X) x) ◦ ≥ + − all x I. Together with (3.6) this gives ∈

E 0 E E (ϕ X) sup ϕ(x) + ϕ+(x)( X x) = ϕ( (X)). ◦ ≥ x I − ∈ h i This is the assertion of Jensen’s inequality.

Corollary 3.10 Let X p(P) for some p 1. Then ∈ L ≥ E(X) p E( X p). | | ≤ | |

Exercise 3.11 Let I be an open interval and ϕ : I R be convex. For x1, . . . , xn I and n → ∈ λ , . . . λ R+ with λ = 1, show that 1 n ∈ i=1 i P n n ϕ λ x λ ϕ (x ) . i i ≤ i i i=1 ! i=1 X X

6 4 Convergence of random variables

Already in the course in measure theory we met three different types on convergence:

Definition 4.1 Let (Xn) be a sequence of random variables.

1. Xn is stochastically convergent (or convergent in probability) to a random variable X, if for each ε > 0 lim P ( Xn X ε) = 0 n →∞ | − | ≥

2. Xn converges almost surely to a random variable X, if for each ε > 0

P lim sup Xn X ε = 0 n | − | ≥  →∞  3. X converges to a random variable X in p or in p-norm, if n L p lim E ( Xn X ) = 0. n →∞ | − |

Already in measure theory we proved:

p Theorem 4.2 1. If Xn converges to X almost surely or in , then it also converges stochastically. None of the converses is true. L

2. Almost sure convergence does not imply convergence in p and vice versa. L Definition 4.3 Let (Ω, ) be a measurable, topological space endowed with its Borel σ- F algebra . This means is generated by the topology on Ω. Moreover for each n N let µ and Fµ be probability Fmeasures on (Ω, ). We say that µ converges weakly to µ∈, if for n F n each bounded, continuous, and real valued function f : Ω R (we will write b (Ω) for the space of all such functions) it holds → C

µ (f) := fdµ fdµ =: µ (f) as n (4.1) n n → → ∞ Z Z

Theorem 4.4 Let (Xn)n N be a sequence of real valued random variables on a space (Ω, , P). Assume (X ) c∈onverges to a random variable X stochastically. Then P (the F n Xn sequence of distributions) converges weakly to PX ,i.e.

P P lim fd Xn = fd X n →∞ Z Z or equivalently lim E (f Xn) = E (f X) n →∞ ◦ ◦ for all f b (R). ∈ C If X is constant P-a.s. the converse also holds true.

7 Proof. First assume f b (R) is uniformly continuous. Then for ε > 0 there is a δ > 0 ∈ C such that for any x0, x00 R ∈

x0 x00 < δ f (x0) f (x00) < ε. | − | ⇒ | − | Define A := X X δ , n N. Then n {| n − | ≥ } ∈

fdP fdP = E (f X ) E (f X) Xn − X | ◦ n − ◦ | Z Z E ( f Xn f X ) ≤ | ◦ − ◦ | = E ( f X f X 1 ) + E f X f X 1 c | ◦ n − ◦ | ◦ An | ◦ n − ◦ | ◦ An 2 f P (A ) + εP (Ac ) ≤ k k n n  2 f P (A ) + ε. ≤ k k n Here we used the notation f := sup f(x) : x R k k {| | ∈ } and that f X f X f X + f X 2 f . | ◦ n − ◦ | ≤ | ◦ n| | ◦ | ≤ k k But since Xn X stochastically, P (An) 0 as n , so for n large enough P (An) ε → → → ∞ ≤ 2 f . Thus for such n k k fdP fdP 2ε. Xn − X ≤ Z Z

Now let f b (R) be arbitrary and denote I := [ n, n]. Since I R as n we n n have P (I ) ∈1Cas n . Thus for all ε > 0 there is−n (ε) =: n such↑that → ∞ X n ↑ → ∞ 0 0 1 P (I ) = P (R I ) < ε. − X n0 X \ n0

We choose the continuous function un0 in the following way:

1 x I ∈ n0 0 x n0 + 1 un (x) =  | | ≥ 0 x + n + 1 n < x < n + 1  − 0 0 0  x + n + 1 n 1 < x < n 0 − 0 − − 0 ¯  ¯ Eventually put f := un0 f. Since f 0 outside the compact set [ n0 1, n0 + 1], the ¯ · ≡ − − function f is uniformly continuous (and so is un0 ) and hence

¯ P ¯ P lim fd Xn = fd X n →∞ Z Z as well as P P lim un d Xn = un d X ; n 0 0 →∞ Z Z thus also P P lim (1 un ) d Xn = (1 un ) d X . n − 0 − 0 →∞ Z Z 8 By the triangle inequality

fdP fdP (4.2) Xn − X ≤ Z Z

f f¯ dP + f¯dP f¯dP + f f¯ dP . − Xn Xn − X − X Z Z Z Z ¯ ¯ For large n n1(ε), fdPX fdP X ε. F urthermore from 0 1 u n 1 R I we n 0 n0 obtain ≥ − ≤ ≤ − ≤ \ R R (1 u ) dP P (R I ) < ε, − n0 X ≤ X \ n0 Z so that for all n n (ε) also ≥ 2 (1 u ) dP < ε. − n0 Xn Z This yields f f¯ dP = f (1 u ) dP f ε − X | | − n0 X ≤ k k Z Z on the one hand and f f¯ dP f ε − Xn ≤ k k Z all n n (ε) on the other. Hence w e obtain from (4.2) for large n: ≥ 2 fdP fdP 2 f ε + ε. Xn − X ≤ k k Z Z

This proves weak convergence of the distributions.

For the converse let η R and assume X η (X is identically equal to η R P-almost ∈ ≡ ∈ surely). This means PX = δη where δη is the Dirac measure concentrated in η. For the open interval I = (η ε, η + ε) , ε > 0 we may find f b (R) with f 1 and f (η) = 1. − ∈ C ≤ I Then fdP P (I) = P (X I) 1. Xn ≤ Xn n ∈ ≤ Z P P Since we assumed weak convergence of Xn to X we know that

fdP fdP = f (η) = 1 Xn → X Z Z as n . Since fdP P(X I) 1, this implies → ∞ X ≤ n ∈ ≤ R P (X I) 1 n ∈ → as n . But → ∞ X I = X η ε { n ∈ } {| n − | ≤ } and thus P ( X X ε) = P ( X η ε) | n − | ≥ | n − | ≥ = 1 P ( X η < ε) 0 − | n − | → for all ε > 0. This means Xn converges stochastically to X.

9 Definition 4.5 Let X , X be random variables on a probability space (Ω, , P). If P n F Xn converges weakly to PX we also say that Xn converges to X in distribution.

Remark 4.6 If the random variables in Theorem 4.4 are Rd-valued and the f : Rd R belong to b(Rd) the statement of the theorem stays valid. → C

Example 4.7 For each sequence σn > 0 with lim σn = 0 we have

2 lim (0, σn) = δ0. n →∞ N Here δ0 denotes the Dirac measure concentrated in 0 R. Indeed, substituting x = σy we obtain ∈ 1 ∞ x2 ∞ 1 y2 f(x)e− 2σ2 dx = e− 2 f(σy)dy √2πσ2 √2π Z−∞ Z−∞ Now for all y R ∈ 1 y2 y2 e− 2 f (σy) f e− 2 √ ≤ k k 2π which is integrable. Thus by dominate d convergenc e

2 ∞ 1 x ∞ 1 y2 2σ2 lim e− n f(x)dx = lim e− 2 f(σny)dy n 2πσ2 n √2π →∞ Z−∞ n →∞ Z−∞ ∞ 1 y2 p = lim e− 2 f(σny)dy = f(0) = fdδ0. n √2π Z−∞ →∞ Z

Exercise 4.8 Show that the converse direction in Theorem 4.4 is not true, if we drop the assumption that X η P-a.s. ≡

Exercise 4.9 Assume the following holds for sequence (Xn) of random variables on a probability space (Ω, , P): F P ( X > ε) < ε | n| For all n large enough (larger than n (ε)) for each given ε > 0. Is this equivalent with stochastic convergence of Xn to 0 ?

Exercise 4.10 For a sequence of Poisson distributions (παn ) with parameters αn > 0 show that

lim παn = δ0, n →∞ 1 if limn αn = 0. Is there a probability measure µ on , with →∞ B

lim παn = µ (weakly) n →∞ if lim α = ? n ∞

10 5 Independence

The concept of independence of events is one of the most essential in probability theory. It is met already in the introductory courses. Its background is the following: Assume we are given a probability space (Ω, , P). For two events A, B with P (B) > 0 one may ask, how the probability of theF event A changes, if we kno∈w Falready that B has happened, we only need to consider the probability of A B. To obtain a probability again we normalize by P (B) and get the conditional probabilit∩ y of A given B:

P (A B) P (A B) = ∩ . | P (B) now we would call A and B independent, if the knowledge that B has happened does not change the probability that A will happen or not, i.e. if P (A B) and P (A) are the same | P (A B) = P (A) . | This in other words means A and B are independent, if

P (A B) = P (A) P (B) . ∩ More generally we define

Definition 5.1 A family (Ai)i I of events on is called independent if for each choice of different indices i , . . . , i I ∈ F 1 n ∈ P (A .. A ) = P (A ) . . . P (A ) (5.1) i1 ∩ ∩ in i1 in Exercise 5.2 Give an example of a sequence of events that are pairwise independent, i.e. each pair of events from this sequence is independent, but not independent (i.e. all events together not independent).

We generalize Definition 5.1 to set systems in the following way:

Definition 5.3 For each i I let i be a collection of events. ( i)i I are called ∈ E ⊂ F N E ∈ independent if (5.1) holds for each i1, . . . , in I, each n and each Aiν iν , ν = 1, . . . , n. ∈ ∈ ∈ E

Exercise 5.4 Show the following

1. A family ( i)i I is independent, if and only if every finite sub-family is independent. E ∈

2. Independence of ( i)i I is maintained, if we reduce the families ( i). More precisely, E ∈ E let ( ) be independent and 0 then also the families ( 0) are independent. Ei Ei ⊆ Ei Ei N n N n n+1 3. If for all n , ( i )i I are independent and for all n and i I, i i , n∈ E ∈ ∈ ∈ E ⊆ E then ( n i )i I are independent. E ∈ S 11 Exercise 5.5 If ( i)i I are independent, then so are the Dynkin-systems ( ( i))i I . Here ( ) is the DynkinE system∈ generated by , it coincides with the intersectionD ofE all∈Dynkin D A A systems containing . (See the Appendix for definitions.) A

Corollary 5.6 Let ( i)i I be an independent family of -stable sets i . Then also the E ∈ ∩ E ⊆ F families (σ ( i))i I of the σ-algebras generated by the i are independent. E ∈ E

Theorem 5.7 Let ( i)i I be an independent family of -stable sets i and E ∈ ∩ E ⊆ F

I = ˙ j J Ij ∪ ∈ with Ii Ij = , i = j. Let j := σ i Ij i . Then also ( j)j J is independent. ∩ ∅ 6 A ∪ ∈ E A ∈  Proof. For j J let ˜ be the system of all sets of the form ∈ Ej E . . . E i1 ∩ ∩ in ˜ where = i1, . . . , in Ij and Eiν iν , ν = 1, . . . , n, are arbitrary. Then j is - ∅ 6 { } ⊆ ∈ E E ˜ ∩ stable. As an immediate consequence of the independence of the ( i)i I also the ( j)j J E ∈ E ∈ are independent. Eventually = σ( ˜ ). Thus the assertion follows from Corollary 5.6. Aj Ej In the next step we want to show that events that depend on all but finitely many σ-algebras of a countable family of independent σ-algebras can only have probability zero or one. To this end we need the following definition.

Definition 5.8 Let ( ) be a sequence of σ-algebras from and An n F

∞ := σ Tn Am m=n ! [ the σ-algebra generated by , , . . .. Then An An+1

∞ := n T∞ T n=1 \ is called the σ-algebra of the tail events.

Exercise 5.9 Why is a σ-algebra? T∞

This is now the result announced above:

Theorem 5.10 (Kolmogorov’s Zero-One-Law) Let ( ) be an independent sequence An of σ-algebras n . Then for every tail event A it holds A ⊆ F ∈ T∞ P (A) = 0 or P(A) = 1.

12 Proof. Let A and the system of all sets D that are independent of A. We ∞ want to show that∈ T A :D ∈ F ∈ D In the exercise below we show that is a Dynkin system. By Theorem 5.7 the σ-algebra is independent of the σ-algebra D Tn+1 := σ ( . . . ) . An A1 ∪ ∪ An Since A we know that for every n N. Thus ∈ Tn+1 An ⊆ D ∈ ∞ := A An ⊆ D n=1 [ Obviously is increasing. For E, F there thus exists n with E, F and hence An ∈ A ∈ An E F and thus E F . This means that is -stable. Since , i.e. ∩ ∈ An  ∩ ∈ A A ∩ A ⊆ D ( , A ) is independent, from Exercise 5.5 we conclude that ( ( ), A ) is independent, soAthat{ } σ = . Moreover , all n. HenceD A {= }σ( ) σ . A D A ⊆ D An ⊆ A T1 ∪An ⊆ A Therefore    A 1 σ . ∈ T∞ ⊆ T ⊆ A ⊆ D Therefore A is independent of A, i.e. it holds  P(A) = P(A A) = P(A) P(A) = (P(A))2 . ∩ · Hence P(A) 0, 1 is asserted. ∈ { } Exercise 5.11 Show that from the proof of Theorem 5.10 is a Dynkin system. D As an immediate consequence of the Kolmogorov Zero-One-Law (Theorem 5.10) we obtain

Theorem 5.12 (Borel’s Zero-One-Law) For each independent sequence (An)n of events A we have n ∈ F P(An for infinitely many n) = 0 or P(An for infinitely many n) = 1, i.e. P (lim sup A ) 0, 1 . n ∈ { } Proof. Let = σ (A ), i.e. = , Ω, A , Ac . It follows that ( ) is independent. An n An {∅ n n} An n For Q := ∞ A we have Q . Since ( ) is decreasing we even have Q for n m=n m n ∈ Tn Tn n m ∈ Tn all m n, n N. Since (Qn) is decreasing we obtain ≥ S∈ n ∞ ∞ lim sup An = Qk = Qk j n ∈ T →∞ k\=1 k\=j for all j N. Hence lim sup An . Hence the assertion follows from Kolmogorov’s ∞ zero-one la∈w. ∈ T

13 Exercise 5.13 In every the pairs (A, B) (any A, B with P(A) = 0 or P(A) = 1 or P(B) = 0 or P(B) = 1) arFe pairs of independent sets. If∈theseF are the only pairs of indepen- dent sets we call independence-free. Show that the following space is independence-free: F k! Ω = N, = (Ω) , and P ( k ) = 2− F P { } P P for each k 2, ( 1 ) = 1 k∞=2 (k). (Hint: Door zo nodig over te gaan op comple- menten, mag≥ je aannemen{ } dat− 1 / A en 1 / B.) P∈ ∈

A special case of the above abstract setting is the concept of independent random variables. This will be introduced next. Again we work over a probability space (Ω, , P). F

Definition 5.14 A family of random variables (Xi)i is called independent, if the σ-algebras (σ (Xi))i generated by them are independent.

For finite families there is another criterion (which is important, since by definition of independence we only need to check the independence of finite families).

Theorem 5.15 Let X1, . . . , Xn be a sequence of n random variables with values in mea- surable spaces (Ωi, i) with -stable generators i of i. X1, . . . , Xn are independent if and only if A ∩ E A n P (X E , . . . , X E ) = P (X E ) 1 ∈ 1 n ∈ n i ∈ i i=1 Y for all E , i = 1, . . . , n. i ∈ Ei Proof. Put 1 := X− (E ) , E . Gi i i i ∈ Ei Then i generates σ (Xi). i is -stable and Ω i. According to Corollary 5.6 we need to showG the independence ofG ( )∩ , which is ∈equivG alent with Gi i=1...n P (G G ) = P (G ) P (G ) 1 ∩ · · · ∩ n 1 · · · n for all choices of Gi i. Sufficiency is evident, since we may choose Gi = Ω for appropriate i. ∈ G

Exercise 5.16 Random variables X , . . . , X are independent with values in (Ω , ) if 1 n+1 i Fi and only if X1, . . . , Xn are independent and Xn+1 is independent of σ (X1, . . . , Xn).

The following theorem states that a measurable deformation of an independent family of random variarables stays independent:

Theorem 5.17 Let (Xi)i I be a family of independent random variables Xi with values in (Ω , ) and let ∈ i Ai f : (Ω , ) (Ω0 , 0) i i Ai → i Ai be measurable. Then (fi (Xi))i I is independent. ∈ 14 Proof. Let i , . . . , i I. Then 1 n ∈ P fi1 (Xi1 ) Ai01 , . . . , fin (Xin ) Ai0n ∈1 ∈1 = P X f − A0 , . . . , X f − A0 i1 i1 i1 in in in n ∈ ∈ 1   = P X f − A0 iν ∈ iν iν ν=1 Yn  = P f (X ) A0 iν iν ∈ iν ν=1 Y  by the independence of (Xi)i I . Here the Ai0ν i0ν were arbitrary. ∈ ∈ A Already Theorem 5.15 gives rise to the idea that independence of random variables may be somehow related to product measures. This is made more precise in the following theorem. To this end let X1, . . . , Xn be random variables such that

X : (Ω, ) (Ω , ) . i F → i Ai Define Y := X X : Ω Ω Ω 1 ⊗ · · · ⊗ n → 1 × · · · × n P P P Then the distribution of Y which we denote by Y can be computed as Y = X1 Xn . ⊗···⊗ Note that P is a probability measure on n . Y ⊗i=1Ai

Theorem 5.18 The random variables X1, . . . , Xn are independent if and only if their dis- tribution is the product measure of the individual distributions, i.e. if P P P X1 ... Xn = X1 Xn ⊗ ⊗ ⊗ · · · ⊗ Proof. Let A , i = 1, . . . , n. Then with i ∈ Ai Y = X X : 1 ⊗ · · · ⊗ n n n P A = P Y A = P (X A , . . . , X A ) Y i ∈ i 1 ∈ 1 n ∈ n i=1 ! i=1 ! Y Y as well as P (A ) = P (X A ) i = 1, . . . n. Xi i i ∈ i P P Now Y is the product measure of the Xi if and only if

P (A . . . A ) = P (A ) . . . P (A ) . Y 1 × × n X1 1 Xn n But this is identical with n P (X A , . . . , X A ) = P (X A ) . 1 ∈ 1 n ∈ n i ∈ i i=1 Y But according to Theorem 5.15 this is equivalent with the independence of the Xi.

15 6 Products and Sums of Independent Random Vari- ables

In this section we will study independent random variables in greater detail.

Theorem 6.1 Let X1, . . . , Xn be independent, real-valued random variables. Then n n

E Xi = E (Xi) (6.1) i=1 ! i=1 Y Y E E n if Xi is well defined (and finite) for all i. (6.1) shows that then also ( i=1 Xi) is well defined. Q Proof. We know that Q := n P is the joint distribution of the X , . . . , X . By ⊗i=1 Xi 1 n Proposition 3.2 and Fubini’s theorem n

E Xi = x1 . . . xn dQ(x1, . . . , xn) ! | | i=1 Z Y = . . . x . . . x dP (x ) . . . dP (x ) | 1| | n| X1 1 Xn n Z Z x dP (x ) . . . x dP (x ) | 1| X1 1 | n| Xn n Z Z n This shows that integrability of the Xi implies integrability of i=1 Xi. In this case the equalities are also true without absolute values. This proves the result. Q Exercise 6.2 For any two random variables X, Y that are integrable Theorem 6.1 tells that independence of X, Y implies that E (X Y ) = E (X) E (Y ) · Show that the converse is not true. Definition 6.3 For any two random variables X, Y that are integrable and have an inte- grable product we define the covariance of X and Y to be cov (X, Y ) = E [(X EX) (Y EY )] − − = E (XY ) EXEY. − X and Y are uncorrelated if cov (X, Y ) = 0. Remark 6.4 If X, Y are independent cov(X, Y ) = 0.

Theorem 6.5 Let X1, . . . , Xn be square integrable random variables. Then n n

V Xi = V(Xi) + cov (Xi, Xj) (6.2) i=1 ! i=1 i=j X X X6 In particular, if X1, . . . , Xn are uncorrelated n n

V Xi = V (Xi) . (6.3) i=1 ! i=1 X X 16 Proof. We have

n n 2 V X = E (X EX ) i  i − i  i=1 ! i=1 ! X X  n  = E (X EX )2 + (X EX ) (X EX ) i − i i − i j − j " i=1 i=j # 6 n X X

= V (Xi) + cov (Xi, Xj) . i=1 i=j X X This proves (6.2). For (6.3 just note that for uncorrelated random variables X, Y one has cov(X, Y ) = 0. Eventually we turn to determining the distribution of the sum of independent random variables.

d Theorem 6.6 Let X1, . . . , Xn be independent R valued random variables. Then the distri- bution of the sum Sn := X1 +. . .+Xn is given by the convolution product of the distributions of the Xi, i.e. P := P P P Sn X1 ∗ X2 ∗ · · · ∗ Xn Proof. Again let Y := X . . . X : Ω Rd n, and vector addition 1 ⊗ ⊗ n → A : Rd . . . Rd Rd. n × × → P P Then Sn = An Y , hence a random variable. Now Sn is the image measure of under A Y , which w◦e denote by (A Y ) (P). Thus n ◦ n ◦ P = (A Y ) (P) = A (P ) . Sn n ◦ n Y Now P = P . So by the definition of the convolution product Y ⊗ Xi P . . . P = A (P ) = P X1 ∗ ∗ Xn n Y Sn More explicitly, in the case d = 1, let g(x , . . . , x ) = 1 for x + + x s and 1 n 1 · · · n ≤ g(x1, . . . , xn) = 0, otherwise. Then application of Fubini’s theorem yields

P(S s) = E(g(X , . . . , X )) = g(x , x , . . . , x )dP (x )dP (x , . . . , x ) = n ≤ 1 n 1 2 n X1 1 (X2,...,Xn) 2 n Z Z = P(X s x x )dP (x , . . . , x ) 1 ≤ − 2 − · · · − n (X2,...,Xn) 2 n Z In the case that X1 has a density fX1 with respect to Lebesgue measure, and the order of differentiation with respect to s and integration can be exchanged, it follows that Sn has a density fSn and

f (s) = f (s x x )dP (x , . . . , x ) Sn X1 − 2 − · · · − n (X2,...,Xn) 2 n Z The same formula holds in the case that X1, . . . , Xn have a discrete distribution (that is, almost surely assume values in a fixed countable of R) if densities are taken with respect to the count measure.

17 Example 6.7 1. As we learned in Introduction to Statistics the convolution of a Bi- nomial distribution with parameters n and p, B(n, p) and a Binomial distribution B (m, p) is a B(n + m, p) distribution:

B(n.p) B(m, p) = B(n + m, p). ∗ 2. As we learned in Introduction to Statistics the convolution of a (λ)-distribution P (a Poisson distribution with parameter λ) with a (µ)-distribution is a P (µ + λ)- distribution: P (λ) (µ) = (λ + µ) P ∗ P P 3. As has been communicated in Introduction to Probability:

µ, σ2 ν, τ 2 = µ + ν, σ2 + τ 2 . N ∗ N N    7 Infinite product probability spaces

Many theorems in probability theory start with: ”Let X1, X2, . . . , Xn, . . . be a sequence of i.i.d. random variables”. But how do we know that such sequences really exist? This will be shown in this section. In the last section we established the framework for some of the most important theorems from , as the Weak Law of Large Numbers or the Central Limit Theorem. Those are the theorems that assume: ”Let X1, X2, . . . Xn be i.i.d. random variables”. Others, as the Strong Law of Large Numbers ask for the behavior of a sequence of independent and identically distributed (i.i.d.) random variables; they usually start like ”Let X1, X2, . . . be a sequence of i.i.d. random variables”. The natural first question to ask is: Does such a sequence exist at all? In the same way as the existence of a finite sequence of i.i.d. random variables is related to finite product measures, the answer to the above question is related to infinite product measures. So, we assume that we are given a sequence of measure spaces (Ωn, n, µn) of which we moreover assume that A

µn (Ωn) = 1 for all n.

We construct (Ω, ) as follows: We want each ω Ω to be a sequence (ω ) where ω Ω . A ∈ n n n ∈ n So we put ∞ Ω := Ωn. n=1 Y Moreover we have the idea that a probability measure on Ω should be defined by what happens on the first n coordinates, n N. So for A1 1, A2 2, . . . , An n, n N we want ∈ ∈ A ∈ A ∈ A ∈ A := A . . . A Ω Ω . . . (7.1) 1 × × n × n+1 × n+2 × to be in . By independence we want to define a measure µ on (Ω, ) that assigns to A defined inA(7.1) the mass A µ(A) = µ1(A) . . . µn(An).

18 We will solve this problem in greater generality. Let I be an index set and (Ωi, i, µi)i I be measure spaces with µ (Ω ) = 1. For = K I define A ∈ i i ∅ 6 ⊆

ΩK := Ωi, (7.2) i K Y∈ K in particular Ω := ΩI . Let pJ for J K denote the canonical projection from ΩK to ΩJ . K ⊆ K I For J = i we will also write pi instead of p i and pi in place of pi . Obviously { } { } pL = pK pL (J K L) (7.3) J J ◦ K ⊆ ⊆ and p := pI = pK p (J K) . (7.4) J J J ◦ K ⊆ Moreover denote by (I) := J I, J = , J is finite . H { ⊆ 6 ∅ | | } For J (I) the σ-algebras and measures ∈ H

J := i J i and µJ := i J µi A ⊗ ∈ A ⊗ ∈ are defined by Fubini’s theorem in measure theory. In analogy to the finite dimensional case we define

Definition 7.1 The product σ-algebra i I i of the σ-algebras ( i) is defined as the ∈ i I smallest σ-algebra on Ω, such that al⊗l projeA ctions p : Ω Ω arAe (∈ , )-measurable. A i → i A Ai Hence i I i := σ (pi, i I) . (7.5) ⊗ ∈ A ∈ Exercise 7.2 Show that i I i := σ (pJ , J (I)) . (7.6) ⊗ ∈ A ∈ H According to the above we are now looking for a measure µ on (Ω, ), that assigns mass A µ1 (A1) . . . µn (An) to each A as defined in (7.1). In other words

1 µ pJ− Ai = µJ Ai . i J !! i J ! Y∈ Y∈ The question, whether such a measure exists, is solved in

Theorem 7.3 On := i I i there is a unique measure µ with A ⊗ ∈ A 1 pJ (µ) := µ pJ− = µJ (7.7) for all J (I). It holds µ (Ω) = 1. ∈ H

19 Proof. We may assume I = , since otherwise the result is known from Fubini’s theorem. We start with some preparatory| | ∞ considerations: In Exercise 7.4 below it will be shown that pK is ( , )-measurable for J K and J AK AJ ⊆ that pK (µ ) = µ , (J K, J, K (I)). J K J ⊆ ∈ H Hence, if we introduce the σ-algebra of the J-cylinder sets 1 := p− ( ) (J (I)) (7.8) ZJ J AJ ∈ H 1 the measurability of pK implies pK − ( ) and thus J J AJ ⊂ AK J K (J K, J, K (I)) . (7.9) Z ⊆ Z ⊆ ∈ H Eventually we introduce the system of all cylinder sets := . Z ZJ J (I) ∈H[ Note that due to (7.9) for Z , Z we have Z , Z , for suitably chosen J (I). 1 2 ∈ Z 1 2 ∈ ZJ ∈ H Hence is an algebra (but generally not a σ-algebra). From (7.5) and (7.6) it follows Z = σ ( ) . A Z Now we come to the main part of the proof. This will be divided into four parts. 1 1. Assume Z = p− (A), J (I) , A . According to (7.7) Z must get mass Z 3 J ∈ H ∈ AJ µ (Z) = µJ (A). We have to show that this is well defined. So let 1 1 Z = pJ− (A) = pK− (B) for J, K (I), A , B . If J K we obtain: ∈ H ∈ AJ ∈ AK ⊆ 1 1 1 K − pJ− (A) = pK− pJ (A) and thus    1 1 1 K − pK− (B) = pK− (B0) with B0 := pJ (A) . Since pK (Ω) = ΩK we obtain  1 K − B = B0 = pJ (A) . Thus by the introductory considerations 

µK(B) = µJ (A) . For arbitrary J, K define L := J K. Since J, K L, (7.9) implies the existence of 1 1 ∪ 1 ⊆ C with p− (C) = p− (A) = p− (B). Therefore from what we have just seen: ∈ AL L J K µL(C) = µJ (A) and µL(C) = µK (B). Hence µJ (A) = µK(B). Thus the function 1 µ p− (A) = µ (A) (J (I) , A ), (7.10) 0 J J ∈ H ∈ AJ is well-defined on .  Z 20 2. Now we show that µ as defined in (7.10) is a volume on . Trivially it holds, µ 0 0 Z 0 ≥ and µ0( ) = 0. Moreover, as shown above for Y, Z , Y Z = , there is a ∅ 1 1 ∈ Z ∩ ∅ J (I) , A, B J such that Y = pJ− (A), Z = pJ− (B). Now Y Z = implies A ∈ BH= and due∈ Ato ∩ ∅ ∩ ∅ 1 Y Z = p− (A B) ∪ J ∪ we obtain

µ (Y Z) = µ (A B) = µ (A) + µ (B) = µ (Y ) + µ (Z) 0 ∪ J ∪ J J 0 0

hence the finite additivity of µ0.

It remains to show that µ0 is also σ-additive. Then the general principles from measure theory yield that µ0 can be uniquely extended to a measure µ on σ( ) = . 1 Z A µ also is a probability measure, because of Ω = p− (Ω ) for all J (I) and therefore J J ∈ H

µ(Ω) = µ0(Ω) = µJ (ΩJ ) = 1.

To prove the σ-additivity of µ0 we first show: 3. Let Z and J (I). Then for all ω Ω the set ∈ Z ∈ H J ∈ J

ωJ Z := ω Ω : (ωJ , pI J (ω)) Z { ∈ \ ∈ } is a cylinder set. This set consists of all ω Ω with the following property: if we ∈ replace the coordinates ωi with i J by the corresponding coordinates of ωJ , we obtain a point in Z. Moreover ∈

ωJ µ0(Z) = µ0(Z )dµJ (ωJ ). (7.11) Z This is shown by the following consideration. For Z there are K (I) and 1 ∈ Z ∈ H A K such that Z = pK− (A), this means that µ0(Z) = µK(A). Since I is infinite we∈maAy assume J K and J = K. For the ω -intersection of A in Ω , which we ⊂ 6 J K call AωJ , i.e. for the set of all ω0 ΩK J with (ωJ , ω0) A, it holds ∈ \ ∈

ωJ 1 Z = pK− J (AωJ ). \

ωJ 1 By Fubini’s theorem AωJ K J and hence Z = pK− J (AωJ ) are (K J)-cylindersets. ∈ A \ \ \ Since µK = µJ µK J Fubini’s theorem implies ⊗ \

ωJ µ0(Z) = µK(A) = µK J (A )dµJ (ωJ ). (7.12) \ Z But this is (7.11), since ωJ ωJ µ0(Z ) = µK J (A ) \ ωJ 1 (because of Z = pK− J (AωJ )). \

21 4. Eventually we show that µ0 is -continuous and thus σ-additive. To this end let (Zn) be a decreasing family of cylinder∅ sets in with α := inf µ (Z ) > 0. We will show Z n 0 n that ∞ Z = . (7.13) n 6 ∅ n=1 \

1 Now each Z is of the form Z = p− (A ), J (I), A . Due to (7.9) we may n n Jn n n ∈ H n ∈ AJn assume J1 J2 J3. . . . We apply the result proved in 3. to J = J1 and Z = Zn. ⊆ ⊆ωJ1 As ω µ Zn is -measurable J1 7−→ 0 AJ1

 ωJ1 α Q := ω Ω : µ Zn . n J1 ∈ J1 0 ≥ 2 ∈ AJ1 n  o Since all µJ ’s have mass one we obtain from (7.11): α α µ (Z ) µ (Q ) + , ≤ 0 n ≤ J1 n 2 hence µ (Q ) α > 0, for all n N. Together with (Z ) also (Q ) is decreasing. J1 n ≥ 2 ∈ n n µ as a finite measure is - continuous, which implies ∞ Q = . Hence there is J1 ∅ n=1 n 6 ∅ ωJ ∞ Qn with 1 ∈ n=1 T ωJ1 α µ Zn > 0 all n. (7.14) T 0 ≥ 2 N Successive application of 3. implies via induction that for each k there is ωJk ω ∈ ∈ Jk k Jk+1 ΩJ with (7.14) µ0 Zn 2− α > 0 and p ωJ = ωJ . k ≥ Jk k+1 k    Due to this second property there is ω0 Ω with pJk (ω0) = ωJk . Because of (7.14) ωJn ∈ we have Zn = such that there is ω˜n Ω with ωJn , pI Jn (ω˜n) Zn. But then \ also 6 ∅ ∈ ∈  ωJn , pI Jn (ω0) = ω0 Zn. \ ∈ Thus ω0 ∞ Zn which proves (7.13).  ∈ n=1 Therefore µ0 is σ-additive and hence has an extension to by Carath´eodory’s theorem T A (Theorem 14.7). It is clear that µ0 has mass one (i.e. µ0 (Ω) = 1), since for J (I) 1 ∈ H we have Ω = pJ− (ΩJ ) and hence

µ0 (Ω) = µJ (ΩJ ) = 1.

In particular µ0 is σ–finite, and the extension µ is unique, and it is a probability measure, that is µ(Ω) = µ0(Ω) = 1.

This proves the theorem. We conclude the chapter with an Exercise, that was left open during this proof.

Exercise 7.4 With the notations of this section, in particular of Theorem 7.3 show that pK is ( , )-measurable (J K, J, K (I)) and that J AK AJ ⊆ ∈ H K pJ (µK) = µJ.

22 8 Zero-One Laws

Already in Section 5 we encountered the prototype of a zero-one law: For a sequence of events (An)n that are independent we have Borel’s Zero-One-Law (Theorem 5.12): P (lim sup A ) 0, 1 . n ∈ { } In a first step we will now ask, when the probability in question is zero and when it is one. This leads to the following frequently used lemma:

Lemma 8.1 (Borel-Cantelli Lemma) Let (An) be a sequence of events over a probabil- ity space (Ω, , P). Then F

∞ P (A ) < P (lim sup A ) = 0 (8.1) n ∞ ⇒ n n=1 X

If the events (An) are pairwise independent then also

∞ P (A ) = P (lim sup A ) = 1. (8.2) n ∞ ⇒ n n=1 X Remark 8.2 The Borel-Cantelli Lemma is most often used in the form of (8.1). Note that this part does not require any knowledge about the dependence structure of the An.

Proof of Lemma 8.1. (8.1) is easy. Define

∞ ∞ A := lim sup An = Ai. n=1 i=n \ [ This implies ∞ A A for all n N. ⊆ i ∈ i=n [ and thus ∞ ∞ P (A) P A P (A ) (8.3) ≤ i ≤ i i=n ! i=n [ X P P P Since i∞=1 (Ai) converges, i∞=n (Ai) converges to zero as n . This implies (A) = 0, hence (8.1). → ∞ P P For (8.2) again put A := lim sup An and furthermore

n

In := 1An , Sn := Ij j=1 X and eventually ∞ S := Ij. j=1 X 23 Since the An are assumed to be pairwise independent they are pairwise uncorrelated as well. Hence n n V (S ) = V (I ) = E I2 E (I )2 n j j − j j=1 j=1 X n X    = E (S ) E (I )2 ES , n − j ≤ n j=1 X 2 where the last equality follows since I = I . Now by assumption ∞ E (I ) = + . j j n=1 n ∞ Since Sn S this is equivalent with ↑ P lim E (Sn) = E (S) = (8.4) n →∞ ∞ On the other hand ω A, if and only if ω A for infinitely many n which is the case, if ∈ ∈ n and only if S (ω) = + . The assertion thus is ∞ P (S = + ) = 1. ∞ This can be seen as follows. By Chebyshev’s inequality V (S ) P ( S E(S ) η) 1 n | n − n | ≤ ≥ − η2 E 1 E for all η > 0. Because of (8.4) we may assume that Sn > 0 and choose η = 2 Sn. Hence 1 1 V (S ) P S E (S ) P S ES ES 1 4 n n ≥ 2 n ≥ | n − n| ≤ 2 n ≥ − E 2     (Sn) But V (S ) E (S ) and E (S ) . Thus n ≤ n n → ∞ V (Sn) lim 2 = 0. E (Sn) Therefore for all ε > 0 and all n large enough 1 P S ES 1 ε. n ≥ 2 n ≥ −   But now S S and hence also ≥ n 1 P S ES 1 ε ≥ 2 n ≥ −   for all ε > 0. But this implies P (S = + ) = 1 which is what we wanted to show. ∞ Example 8.3 Let (Xn) be a sequence of real valued random variables which satisfies

∞ P ( X > ε) < (8.5) | n| ∞ n=1 X for all ε > 0. Then X 0 P-a.s. Indeed the Borel-Cantelli Lemma says that (8.5) n → implies that P ( X > ε infinitely often in n) = 0. | n| But this is exactly the definition of almost sure convergence of Xn to 0.

24 Exercise 8.4 Is (8.5) equivalent with P-almost sure convergence of Xn to 0?

Here is how Theorem 5.10 translates to random variables.

Theorem 8.5 (Kolmogorov’s 0-1 Law) Let (Xn)n be a sequence of independent random variables with values in arbitrary measurable spaces. Then for every tail event A, i.e. for each A with ∞ A σ (X , m n) ∈ m ≥ n=1 \ it holds that P (A) 0, 1 . ∈ { } Exercise 8.6 Derive Theorem 8.5 from Theorem 5.10.

Corollary 8.7 Let (Xn)n N a sequence of independent, real-valued random variables. De- fine ∈ ∞ := σ (Xi, i m) T∞ ≥ m=1 \ to be the tail σ-algebra. If then T is a real-valued random variable, that is measurable with respect to , then T is P-almost surely constant. I.e. there is a α R such that T∞ ∈ P (T = α) = 1.

Such random variables T : Ω R that are -measurable are called tail functions. → T∞ Proof. For each γ R¯ we have that ∈ T γ . { ≤ } ∈ T∞ This implies P (T γ) 0, 1 . On the other hand, being a distribution function we have ≤ ∈ { } lim P(T γ) = 0 and lim P(T γ) = 1 γ γ + →−∞ ≤ → ∞ ≤ Let C := γ R : P (T γ) = 1 and α := inf(C) = inf γ R : P (T γ) = 1 Then for { ∈ ≤ } { ∈ ≤ } an appropriately chosen decreasing sequence (γn) C we have γn α and since T γn T α we have α C. Hence α = min γ C ∈. This implies ↓ { ≤ } ↓ { ≤ } ∈ { ∈ } P (T < α) = 0 which implies P (T = α) = 1.

Exercise 8.8 A coin is tossed infinitely often. Show that every finite sequence

(ω , . . . , ω ) , ω H, T , k N 1 k i ∈ { } ∈ occurs infinitely often with probability one.

25 Exercise 8.9 Try to prove (8.2) in the Borel-Cantelli Lemma for independent events (An) as follows:

1. For each sequence (α ) of real numbers with 0 α 1 we have n ≤ n ≤ n n (1 α ) exp α (8.6) − i ≤ − i i=1 i=1 ! Y X This implies n ∞ αn = lim (1 αi) = 0 ∞ ⇒ n − n=1 →∞ i=1 X Y 2. For A := lim sup An we have

∞ P P c 1 (A) = lim Am − n →∞ m=n ! \N

= lim lim (1 P (Am)) . n N − →∞ →∞ m=n Y

3. As P (An) diverges we have because of 1.

P N lim (1 P (Am)) = 0 N − →∞ m=n Y and hence P (A) = 1 because of 2. Fill in the missing details!

9 Laws of Large Numbers

The central goal of probability theory is to describe the asymptotic behavior of a sequence of random variables. In its easiest form this has already been done for i.i.d. sequences in Introduction to Probability and Statistics. In the first theorem of this section this is slightly generalized.

Theorem 9.1 (Khintchine) Let (Xn)n N be a sequence of square integrable, real valued random variables, that are pairwise uncorr∈ elated. Assume

1 n lim V (Xi) = 0. n n2 →∞ i=1 X Then the weak law of large numbers holds, i.e.

1 n 1 n lim P Xi E Xi > ε = 0 for all ε > 0. n →∞ n − n ! i=1 i=1 X X

26 Proof. By Chebyshev’s inequality for each ε > 0:

n n P 1 E 1 V 1 E (Xi Xi) > ε 2 (Xi Xi) n − ! ≤ ε n − ! i=1 i=1 X Xn 1 1 = V (X EX ) ε2 n2 i − i i=1 ! X 1 1 n = V (X EX ) ε2 n2 i − i i=1 X 1 1 n = V (X ) . ε2 n2 i i=1 X Here we used that the random variables are pairwise uncorrelated. By assumption the latter expression converges to zero.

Remark 9.2 As we will learn in the next Theorem, for an independent sequence square integrability is even not required.

Theorem 9.1 raises the question whether we can replace the stochastic convergence there by almost sure convergence. This will be shown in the following theorem. Such a theorem is called a strong law of large numbers. Its first form was proved by Kolmogorov. We will present a proof due to Etemadi from 1981.

Theorem 9.3 (Strong Law of Large Numbers – Etemadi 1981) For each sequence (Xn)n of real-valued, pairwise independent, identically distributed (integrable) random vari- ables the Strong Law of Large Numbers holds, i.e.

1 n P lim sup Xi EX1 > ε = 0 for each ε > 0. n n − ! →∞ i=1 X

Before we prove Theorem 9.3 let us make a couple of remarks. These should reveal the structure of the proof a bit:

1. Denote S = n X . Then Theorem 9.3 asserts that 1 S η := EX , P-almost n i=1 i n n → 1 surely. P + + + 2. Together with Xn also Xn and Xn− (where Xn = max (Xn, 0) and Xn− = ( Xn) ) + − satisfy the assumptions of Theorem 9.3. Since X = X X− it therefore suffices to n n − n prove Theorem 9.3 for positive random variables. We therefore assume Xn 0 for the rest of the proof. ≥

3. All proofs of the Strong Law of Large Numbers use the following trick: We truncate the random variables Xn by cutting off values that are too large. We therefore introduce

Yn := Xn1 Xn

27 Of course, if µ is the distribution of X and µ is the distribution of Y , then µ = µ. n n n n 6 Indeed µn = fn (µ), where

x if 0 x < n, f (x) := n 0 otherwise.≤  The idea behind truncation is that we gain square integrability of the sequence.

Indeed: n E Y 2 = E f 2 X = f 2(x)dµ(x) = x2dµ(x) < . n n ◦ n n ∞ Z Z0   4. Of course, after having gained information about the Yn we need to translate these results back to the Xn. To this end we will apply the Borel-Cantelli Lemma and show that ∞ P (X = Y ) < . n 6 n ∞ n=1 X This implies that Xn = Yn only for finitely many n with probability one. In particular, 6 1 n P 1 n if we can show that n i=1 Yi η -a.s., we also can show that n i=1 Xi η P-a.s.. → → P P 5. For the purposes of the proof we remark the following: Let α > 1 and for n N let ∈ n kn := [α ]

denote the largest integer αn. This means k N and ≤ n ∈ k αn < k + 1. n ≤ n Since αn 1 lim − = 1 n n →∞ α there is a number cα, 0 < cα < 1, such that

k > αn 1 c αn for all n N. (9.1) n − ≥ α ∈ We now turn to Proof of Theorem 9.3.

Step 1: Without loss of generality Xn > 0. Define Yn = 1 Xn

n S0 := (Y EY ) . n i − i i=1 X Let ε > 0 and α > 1. Using Chebyshev’s inequality and the independence of the random variables (Yn) we obtain

1 1 1 1 1 1 1 n P S0 > ε V S0 = V (S0 ) = V (Y ) . n n ≤ ε2 n n ε2 n2 n n2 ε2 i     i=1 X

28 Observe that V (Y ) = E (Y 2) (E (Y ))2 E (Y 2). Thus i i − i ≤ i n 1 1 1 2 P S0 > ε E Y . n n ≤ n2 ε2 i   i=1 X  n For kn = [α ] this gives

kn 1 1 2 P S0 > ε E Y k kn ≤ ε2k2 i  n  n i=1 X  for all n N. Thus ∈ kn ∞ ∞ 1 1 1 2 P S0 > ε E Y . k kn ≤ ε2 k2 i n=1  n  n=1 i=1 n X X X 

By rearranging the order of summation we obtain

∞ ∞ 1 1 2 P S0 > ε t E Y k kn ≤ ε2 j j n=1  n  j=1 X X  where ∞ 1 t := j k2 n=n n Xj and n is the smallest n with k j. From (9.1) we obtain j n ≥ ∞ 1 1 1 2nj 1 2nj t = α− = d α− j ≤ c2 α2n c2 1 1 α α n=n α α2 Xj − 2 2 1 where d = c− (1 α− )− > 0. This implies α α − 2 t d j− . j ≤ α By using the above

j ∞ ∞ k 1 dα 1 2 P S0 > ε x dµ(x). kn 2 2 kn ≤ ε j k 1 n=1   j=1 k=1 Z − X X X

Again rearranging the order of summation yields:

j ∞ 1 k ∞ ∞ 1 k x2dµ(x) = x2dµ(x). j2 j2 j=1 k 1 ! k 1 X Xk=1 Z − Xk=1 Xj=k Z − Since

∞ 1 1 1 1 < + + + . . . j2 k2 k (k + 1) (k + 1) (k + 2) Xj=k 1 1 1 1 1 1 1 2 = + + + . . . = + , k2 k − k + 1 k + 1 − k + 2 k2 k ≤ k     29 this yields

k k ∞ 1 2dα ∞ x 2dα ∞ 2dα P 0 E Skn > ε 2 xdµ(x) 2 xdµ(x) = 2 (X1) < . kn ≤ ε k 1 k ≤ ε k 1 ε ∞ n=1   k=1 Z − k=1 Z − X X X

Thus by the Borel-Can telli Lemma 1 P S0 > ε infinitely often in n = 0. k kn  n 

But this is equivalent with the almost sure convergence of 1 P lim Sk0 = 0 -a.s. (9.2) n n →∞ kn Step 2: Next let us see that indeed 1 kn Y can only converge to E (X ). By kn i=1 i 1 definition of Yn we have that P n E (Yn) = xdµn (x) = xdµ (x) . Z Z0 Thus by monotone convergence

E (X1) = lim E (Yn) . n →∞ By Exercise 9.6 below this implies 1 E (X1) = lim (EY1 + . . . + EYn) . (9.3) n →∞ n

By definition of the sums Sn0 we have

1 1 kn 1 kn S0 = Y E (Y ) , k kn k i − k i n n i=1 n i=1 X X (9.2) and (9.3) together imply

kn kn 1 1 1 E E P lim Yi = lim Sk0 + lim Yi = X1 -a.s., n k n k n n k →∞ n i=1 →∞ n →∞ n i=1 X X which is what we wanted to show in this step.

Step 3: Now we are aiming at removing the truncation from the Xn. Consider the sum

∞ ∞ P (X = Y ) = P (X n) n 6 n n ≥ n=1 n=1 X X According to Exercise 3.4 this is smaller than E(X1), so that it is bounded. Therefore

P (X = Y infinitely often) = 0. n 6 n 30 Hence there is a n0 (random) such that with probability one Xn = Yn for all n n0. But the finitely many differences drop out when averaging, hence ≥

1 E P lim Skn = X1 -a.s. n →∞ kn

Step 4: Eventually we show that the theorem holds not only for subsequences kn chosen as above, but also for the whole sequence. For fixed α > 1, of course, the sequences (k ) are fixed and diverge to + . Hence for n n ∞ every m N there exists n N such that ∈ ∈ k < m k . n ≤ n+1

Since we assumed the Xi to be non-negative this implies

S S S . kn ≤ m ≤ kn+1 Hence S k S S k kn n m kn+1 n+1 . kn · m ≤ m ≤ kn+1 · m

The definition of kn yields

k αn < k + 1 m k αn+1. n ≤ n ≤ ≤ n+1 ≤ This gives k αn+1 n+1 < = α m αn as well as k αn 1 n > − . m αn+1 n n 1 Now, given α, for all n n = n (α) we have α 1 α − . Hence, if m k and thus ≥ 1 1 − ≥ ≥ n1 n n1 we obtain ≥ n n 1 k α 1 α − 1 n > − > = m αn+1 αn+1 α2

Now for each α we have a set Ωα with P (Ωα) = 1 with

1 E lim Skn (ω) = X1 for all ω Ωα. kn ∈

Without loss of generality we may assume that Xi are not identically equal to zero P-a.s., otherwise the assertion of the Strong Law of Large Numbers is trivially true. Therefore we may assume without loss of generality that EX1 > 0. Since α > 1 we then have

1 E 1 E X1 < Skn (ω) < α X1 α kn for all ω Ω and all n large enough. For such m and ω this means ∈ α 3 1 2 α− 1 EX < S (ω) EX < α 1 EX . − 1 m m − 1 − 1   31 Define ∞ Ω1 := Ω 1 . 1+ n n=1 \ Then P (Ω1) = 1 and 1 lim Sm (ω) = EX1 m →∞ m for all ω Ω . This proves the theorem. ∈ 1 Remark 9.4 Theorem 9.3 in particular implies that for i.i.d. sequences of random vari- ables (Xn) with a finite first moment the Strong Law of Large Numbers holds true. Since stochastic convergence is implied by almost sure convergence also Theorem 9.1 – the Weak Law of Large Numbers – holds true for such sequences as well. Therefore the finiteness of the second moment is not necessary for Theorem 9.1 to hold true for i.i.d. sequences.

Remark 9.5 One might, of course, ask whether a finite first moment is necessary for Theorem 9.3 to hold. Indeed one can prove that, if a sequence of i.i.d. random variables is 1 n such that n i=1 Xi converges almost surely to some random variable Y (necessarily a tail function as in Corollary 8.7!), then EX1 exists and Y = EX1 almost surely. This will not be shown inPthe context of this course.

Exercise 9.6 Let (am) be real numbers such that limm am = a. Show that this implies m →∞ that their Cesaro mean 1 lim (a1 + a2 + . . . + an) = a. n →∞ n Exercise 9.7 Prove the Strong Law of Large Numbers for a sequence of i.i.d. random E 4 variables (Xn)n with a finite fourth moment, i.e. for random variables with (X1 ) < . Do not use the statement of Theorem 9.3 explicitly. ∞

Remark 9.8 A very natural question to ask in the context of Theorem 9.3 is: how fast 1 E does n Sn converge to X1, i.e. given a sequence of i.i.d. random variables (Xn)n what is 1 P X EX ε ? n i − 1 ≥   X

If X1 has a finite moment generating function; i.e. if M(t) := log EetX1 < for all t, ∞ the answer is: exponentially fast. Indeed, Cram´er’s theorem (which cannot be proven in the context of this course) asserts the following: let I : R R be given by → I(x) = sup [xt M(t)] . t − Then for every closed set A R ⊂ 1 1 n lim sup log P Xi A inf I(x) n n n ∈ ≤ − x A i=1 ! ∈ →∞ X 32 and for every open set R O ⊂ 1 1 n lim inf log P Xi inf I(x). n n n ∈ O ≥ − x →∞ i=1 ! ∈O X 1 n This is called a principle of large deviations for the random variables n i=1 Xi. In particular, one can show that the function I is convex and non-negative, with P I(x) = 0 x = EX . ⇔ 1 We therefore obtain

1 n min(I¯(EX1+ε),I¯(EX1 ε))+nδ δ > 0 N n > N : P X EX ε e− − ∀ ∃ ∀ n i − 1 ≥ ≤   X ¯ E where I is the I-function intro duced above evaluate d for the random variables Xi X1. The speed of convergence is thus exponentially fast. −

Example 9.9 For a Bernoulli B(1, p) random variable X

p 1 p eM(t) = E(etX ) = pet + (1 p); I(x) = x log (1 x) log − . − − x − − 1 x    −  Exercise 9.10 Determine the functions M and I for a normally distributed random vari- able.

Exercise 9.11 Argue that if the moment generating function of a random variable X is finite, all its moments are finite. In particular both Laws of Large Numbers are applicable to a sequence X1, X2, . . . of iid random variables distributed like X.

At the end of this section we will turn to two applications of the Law of Large Numbers which are interesting in their own right: The first of these two applications is in number theory. Let (Ω, , P) be given by Ω = [0, 1), = 1 Ω and P = λ1 Ω (Lebesgue measure). For every nFumber ω Ω we may F B ∈ consider its g-adic representation

∞ n ω = ξng− (9.4) n=1 X Here g 2 is a natural number and ξ 0, . . . , g 1 . This representation is unique, ≥ n ∈ { − } if we ask that not all but finitely many of the digits ξn are equal to g 1. For each ε,g − ε 0, . . . , g 1 let Sn (ω) be the number of all i 1, . . . , n with ξi(ω) = ε in its g-adic∈ { represen−tation} (9.4). We will call a number ω ∈[0,{1) g-normal} 1, if ∈

1 ε,g 1 lim Sn (ω) = n →∞ n g 1 −k The usual meaning of g-normality is that each string of digits ε1ε2 . . . εk occurs with frequency g .

33 for all ε = 0, . . . , g 1. Hence ω is g-normal, if in the long run all of its digits occur with the same frequency. W−e will call ω absolutely normal, if ω is g-normal for all g N, g 2. ∈ ≥ Now for a number ω [0, 1) randomly chosen according to Lebesgue measure the ξi(ω) are i.i.d. random variables;∈ they have as their distribution the uniform distribution on the set 0, . . . , g 1 . This has to be shown in Exercise 9.13 below and is a consequence of the { − } uniformity of Lebesgue measure. Hence the random variable

1 if ξ (ω) = ε X (ω) = n n 0 otherwise  ε,g n ε,g are i.i.d. random variables for each g and ε. Moreover Sn (ω) = i=1 Xi (ω). According to the Strong Law of Large Numbers (Theorem 9.3) P 1 1 Sε,g(ω) E (Xε,g) = λ1-a.s. n n → 1 g for all ε 0, . . . , g 1 and all g 2. This means λ1-almost every number ω is g-normal, ∈ { − } ≥ i.e. there is a set N with λ1 (N ) = 0, such that ω is g-normal for all ω N c. Now g g ∈ g

∞ N := Ng g=2 [ is a set of Lebesgue measure zero as well. This readily implies

Theorem 9.12 (E. Borel) λ1-almost every ω [0, 1) is absolutely normal. ∈ It is rather surprising that hardly any normal numbers (in the usual meaning, see Footnote page 33), are known. Champernowne (1933) showed that

ω = 0, 1234567891011121314 . . . is 10-normal. Whether √2, log 2, e or π are normal of any kind has not been shown yet. There are no absolutely normal numbers known at all.

Exercise 9.13 Show that for every g 2, the random variables ξ (ω) introduced above ≥ n are i.i.d. random variables that are uniformly distributed on 0, . . . , g 1 . { − }

The second application is to derive a classical result from analysis – which in principle has nothing to do with probability theory – is related to the Strong Law of Large Numbers. As may be well known the approximation theorem by Stone and Weierstraß asserts that every continuous function on [a, b] (more generally on every compact set) can be approximated uniformly by polynomials. Obviously it suffices to prove this for [a, b] = [0, 1]. So let f ([0, 1]) be a continuous function on [0, 1]. Define the n’th Bernstein polynomial ∈ C for f as n n k k n k B f(x) = f x (1 x) − . n k n − Xk=0    

34 Theorem 9.14 For each f ([0, 1]) the polynomials Bnf converge to f uniformly in [0, 1]. ∈ C

Proof. Since f is continuous and [0, 1] is compact f is uniformly continuous on [0, 1], i.e. for each ε > 0 there exists δ > 0 such that

x y < δ f(x) f(y) < ε. | − | ⇒ | − |

Now consider a sequence of i.i.d Bernoullis with parameter p, (Xn)n. Call

1 1 n S∗ := S := X . n n n n i i=1 X Then by Chebyshev’s inequality 1 1 p (1 p) 1 P ( S∗ p δ) V (S∗) = V(S ) = − . (9.5) | n − | ≥ ≤ δ2 n n2δ2 n nδ2 ≤ 4nδ2 This yields

B f(p) f(p) = E (f S∗) f(p) = f(x)dP (x) f(p) | n − | | ◦ n − | Sn∗ − Z

f(S∗(x)) f (p) dP (x) + n Sn∗ ≤ S p <δ | − | Z| n∗− | P + f(Sn∗(x)) f(p) d Sn∗ (x) S p δ | − | Z| n∗ − |≥ 2 f ε + 2 f P ( S∗ p δ) ε + k k ≤ k k | n − | ≥ ≤ 4nδ2. Here f is the sup-norm of f. Hence k k 2 f sup Bnf(p) f(p) ε + k 2k p [0,1] | − | ≤ 4nδ ∈ This can be made smaller than 2ε by choosing n large enough. Notice that Weak Law of Large Numbers by itself would yield, instead of inequality (9.5), an inequality of the kind

ρ > 0 N such that n N : P( S∗ p δ) ρ. ∀ ∃ ∀ ≥ | n − | ≥ ≤ In this approach it is not clear that N can be chosen independently of p, so that we only would get pointwise convergence.

10 The Central Limit Theorem

In the previous section we met one of the central theorems of probability theory – the Law of Large Numbers: If EX1 exists, for sequence of i.i.d. random variables (Xn), their 1 n E average n i=1 Xi converges to X1 (a.s.). The following theorem, called the Central Limit P 35 Theorem analyzes the fine structure in the Law of Large Numbers. Its name is due to Polya, the proof of the following theorem goes back to Charles Stein.

n First of all notice that in a certain sense, in order to analyze the fine structure of i=1 Xi 1 the scaling of the Weak Law of Large Numbers n is already an overscaling. On this scale n P we just cannot see the shape of the distribution anymore, since by scaling i=1 Xi by a 1 1 factor n we have reduced its variance to a scale n which converges to zero. What we see in the Law of Large Numbers is a bell shaped curve with a tiny, tiny width. PHere is what we get, if we scale the variance to one:

Theorem 10.1 (Central Limit Theorem - CLT) Let X1, . . . , Xn be a sequence of ran- dom variables that are independent and have identical distribution (the same for all n) with EX2 < . Then 1 ∞ n E a i=1(Xi X1) 1 x2/2 lim P − a = e− dx. (10.1) n √nVX1 ≤ √2π →∞ P  Z−∞ Before proving the Central Limit Theorem let us remark that it holds under weaker assumptions as well

Remark 10.2 Indeed, the Central Limit Theorem also holds under the following weaker assumption. Assume given for n = 1, 2, . . . an independent family of random variables Xni, i = 1, . . . , n. For j = 1, . . . n let ηnj = EXnj and n s := VX . n v ni u i=1 uX n t The sequence ((Xni)i=1)n is said to satisfy the Lindeberg condition, if L (ε) 0 as n n → → ∞ for all ε > 0. Here

1 n L (ε) = E (X η )2 ; X η εs . n s2 nj − nj | nj − nj| ≥ n n j=1 X   Intuitively speaking the Lindeberg condition asks that none of the variables dominates the whole sum. The generalized form of the CLT stated above now asserts that if the sequence (Xn) satisfies the Lindeberg condition, it also satisfies the CLT, i.e.

n a i=1(Xni ηni) 1 x2/2 lim P − a = e− dx. n n VXni ≤ √2π →∞ P i=1 ! Z−∞ The proof of this more generpalPtheorem basically mimicks the proof we will give below for Theorem 10.1. We spare ourselves the additional technical work.

36 We will present a proof of the CLT that goes back to Charles Stein. It is based on a couple of facts: Fact 1: It suffices to prove the CLT for i.i.d. random variables with EX1 = 0. Otherwise one just substracts EX1 from each of the Xi. Fact 2: Define n 2 Sn := Xi and σ := V(X1) i=1 X Theorem 10.1 asserts the convergence in distribution of the (normalized) Sn to a Gaussian random variable. What we need to show is thus

Sn 1 ∞ x2/2 E f f(x)e− dx = E [f(Y )] (10.2) √nσ2 → √2π    Z−∞ as n for all f : R R that are uniformly continuous and bounded. Here Y is a standard→ ∞normal random →variable, i.e. it is (0, 1) distributed. N We prepare the proof of the CLT in two lemmata.

Lemma 10.3 Let f : R R be bounded and uniformly continuous. Define → 1 ∞ y2/2 (f) := f(y)e− dy N √2π Z−∞ and x x2/2 y2/2 g(x) := e (f(y) (f)) e− dy. − N Z−∞ Then g fulfills g0(x) xg(x) = f(x) (f). (10.3) − − N Proof. Differentiating g gives x x2/2 y2/2 x2/2 x2/2 g0(x) = xe (f(y) (f)) e− dy + e (f(x) (f)) e− − N − N Z−∞ = xg(x) + f(x) (f). − N

The importance of the above lemma becomes obvious, if we substitute a random variable X into (10.3) and take expectations:

E [g0(X) Xg(X)] = E [f(X) (f)] . − − N If X (0, 1) is standard normal, the right hand side is zero and so is the left hand side. The∼ideaN is thus that instead of showing that E [f(U ) (f)] n − N converges to zero, we may show the same for

E [g0(U ) U g(U )] . n − n n The next step discusses the function g introduced above.

37 Lemma 10.4 Let f : R R be bounded and uniformly continuous and g be the solution of → g0(x) xg(x) = f(x) (f). (10.4) − − N Then g(x), xg(x) and g0(x) are bounded and continuous

Proof. Obviously g is even differentiable, hence continuous. But then also xg(x) is con- tinuous. Eventually g0(x) = xg(x) + f(x) (f) − N is continuous as the sum of continuous functions. For the boundedness part first note that any continuous function on a compact set is bounded, hence we only need to check that the functions g, xg, and g0 are bounded for x . → ∞ To this end first note that x x2/2 y2/2 g(x) = e (f(y) (f)) e− dy − N Z−∞ x2/2 ∞ y2/2 = e (f(y) (f)) e− dy − − N Zx (this is true since the whole integral must equal zero). For x 0 we have ≤ x x2/2 y2/2 g(x) sup f(y) (f) e e− dy ≤ y 0 | − N | ≤ Z−∞ while for x 0 we have ≥ x2/2 ∞ y2/2 g(x) sup f(y) (f) e e− dy. ≤ y 0 | − N | x ≥ Z Now for x 0 ≤ x x x2/2 y2/2 x2/2 y y2/2 1 e e− dy e − e− dy = ≤ x x Z−∞ Z−∞ | | | | and similarly for x 0 ≥

x2/2 ∞ y2/2 x2/2 ∞ y y2/2 1 e e− dy e e− dy . (10.5) ≤ x ≤ x Zx Zx | | Thus we see that for x 1 ≥ g(x) xg(x) sup f(y) (f) | | ≤ | | ≤ y 0 | − N | ≥ as well as for x 1 ≤ − g(x) xg(x) sup f(y) (f) . | | ≤ | | ≤ y 0 | − N | ≤ Hence g and xg are bounded. But then also g0 is bounded, since

g0(x) = xg(x) + f(x) (f). − N 38 Now we turn to proving the CLT.

Proof of Theorem 10.1. Besides assuming that EXi = 0 for all i we may also assume 2 V √ 2 n that σ = X1 = 1. Otherwise we just replace Xi by Xi/ σ . We write Sn := i=1 Xi and recall that in order to prove the assertion it suffices that for all f bounded and continuous P Sn Sn Sn E g0 g 0 √n − √n √n →      as n . Here g is defined as above. But→ ∞using the identity

1 Xj Sn Xj Sn Xj g0 (1 s) g0 − ds √n √n − − √n − √n Z0      Sn Sn Xj Xj Sn Xj = g g − g0 − √n − √n − √n √n       which is to be proven in Exercise 10.5 below we arrive at

Sn Sn Sn E g0 g √n − √n √n n     1 Sn Xj Sn = E g0 g n √n − √n √n j=1 X      n 2 1 Sn Xj Sn Xj Xj Sn Xj = E g0 g − g0 − n √n − √n √n − n √n j=1 X        2 1 Xj Sn Xj Sn Xj g0 (1 s) g0 − ds − n 0 √n − − √n − √n n Z      1 Sn 1 Sn Xj = E g0 g0 − n √n − n √n j=1 X      2 1 Xj Sn Xj Sn Xj g0 (1 s) g0 − ds . − n √n − − √n − √n Z0      In the last step we used linearity of expectation together with EXi = 0 for all i as well as independence of X and S X for all i together with EX 2 = 1. Let us define i n − i i 1 Sn 1 Sn Xj Γ := E g0 g0 − j n √n − n √n      2 1 Xj Sn Xj Sn Xj g0 (1 s) g0 − ds . − n √n − − √n − √n Z0      The idea will now be that the continuous function g0 is uniformly continuous on every Xj Sn 1 Sn Xj Sn Xj compact set. So, if is small, so are g0 g0 − and g0 (1 s) √n √n − n √n √n − − √n − Sn Xj S − n       g0 √n as long as √n is inside the chosen compact set. On the other hand the probabil-  Sn Xj ities that √n is outside a chosen large compact set or that √n is large are very small. This

39 together with the boundedness of g and g0 will basically yield the proof. For K > 0, δ > 0 we write

1 Γj := Γj1 Xj 1 Sn δ n K | √n |≤ | √ |≤ 2 Γ := Γ 1 X 1 Sn j j j δ >K | √n |≤ | √n | 3 Γ := Γ 1 X . j j j >δ | √n | Hence n n n n n 1 2 3 1 2 3 Γj = Γj + Γj + Γj = Γj + Γj + Γj . j=1 j=1 j=1 j=1 j=1 X X X X X 2 We first consider the Γj -terms: By Chebyshev’s inequality

Sn Sn 1 V V − Sn √n 1 Sn 1 √n n 1 P( > K) = and P( − > K δ) = − . |√n| ≤ K2  K2 | √n | − ≤ (K δ)2 n(K δ)2 − − Hence for given ε > 0 we can find K so large that

Sn Sn 1 P( > K) ε and P( − > K δ) ε. |√n| ≤ | √n | − ≤

Since g0 is bounded by g0 := supx R g0(x) we obtain: || || ∈ | | n n 2 1 Sn 1 Sn Xj Γ = E g0 g0 − | j | n √n − n √n j=1 j=1 X X      2 1 Xj Sn Xj Sn Xj g0 (1 s) g0 − ds 1 Xj 1 Sn δ >K − n √n − − √n − √n √n | √n | Z0      | |≤ n E 1 2 g0 E 2 2 g0 1 Xj 1 Sn + || || X 1 Xj 1 Sn δ >K j δ >K ≤ n || || √n | √n | n √n | √n | j=1  | |≤   | |≤  Xn 1 2 g0 2 E 2 g0 1 Xj 1 Sn + || ||E X 1 Xj 1 Sn Xj δ >K j δ − >K δ ≤ n || || √n | √n | n √n √n j=1 | |≤ | |≤ | | − X     2 g0 2 2 g0 E 1 S + || || E X 1 Xj E 1 Sn Xj ˛ n ˛>K j δ − >K δ ≤ || || ˛ √n ˛ n √n √n ˛ ˛ j | |≤ | | −   X     Sn Sn 1 = 2 g0 P > K + 2 g0 P − > K δ || || √n || || √n −    

4 g0 ε ≤ || || 1 For the Γj -terms observe that for every fixed K > 0 the continuous function g0 is uniformly continuous on [ K, K]. This means that given ε > 0, there is δ > 0 such that −

x y < δ g0(x) g0(y) < ε. | − | ⇒ | − | 40 For given ε > 0 we choose such δ, and K as in the first step. Then

n n 1 1 Sn 1 Sn Xj Γ = E g0 g0 − | j | n √n − n √n j=1 j=1 X X      2 1 Xj Sn Xj Sn Xj g0 (1 s) g0 − ds 1 Xj 1 Sn δ K − n √n − − √n − √n √n | √n |≤ Z0      | |≤ n E 1 Sn 1 Sn Xj g0 g0 − 1 Xj 1 Sn δ K ≤ n √n − n √n √n | √n |≤ j=1 | |≤ X      2 1 Xj Sn Xj Sn Xj + g0 (1 s) g0 − ds 1 Xj 1 Sn δ K n √n − − √n − √n √n | √n |≤  Z0      | |≤ n E 2 1 Xj ε + ε ≤ n n j=1 X   2ε = n = 2ε. n  

3 Eventually we turn to the Γj -terms: E 2 Since X1 < there exists an n0 such that for a given ε > 0 and all n n0 and δ as above we have ∞ ≥ 2 E X 1˛ X ˛ < ε. 1 ˛ 1 ˛>δ  ˛ √n ˛  This implies

n n 3 1 Sn 1 Sn Xj Γ = E g0 g0 − | j | n √n − n √n j=1 j=1 X X      2 1 Xj Sn Xj Sn Xj g0 (1 s) g0 − ds 1 Xj >δ − n √n − − √n − √n √n Z0      | | n 2 2 2 E1 X g0 + g0 E X 1 X j >δ j j >δ ≤ n √n || || n|| || √n j=1 | | | | X    Xj 2 2 g0 P > δ + 2 g0 E X 1 X j j >δ ≤ || || √n || || √n    | | 

4ε g0 ≤ || || Hence for a given ε > 0 with the choice of δ > 0 and K as above we obtain

Sn Sn Sn E g0 g √n − √n √n n  n n   1 2 3 = Γj + Γj + Γj j=1 j=1 j=1 X X X 2ε + 8ε g0 . ≤ || || This can be made arbitrarily small by letting ε 0. This proves the theorem. → 41 n Exercise 10.5 Let X1, . . . , Xn be i.i.d. random variables and Sn = i=1 Xi. Let g : R R P → be a continuously differentiable function. Show that for all j

1 [g0(S (1 s)X ) g0(S X )]X ds = g(S ) g(S X ) X g0(S X ). n − − j − n − j j n − n − j − j n − j Z0

We conclude the section with the informal discussion of two extensions on the Central Limit Theorem. The first is of practical importance, the second is of more theoretical interest.

When one tries to apply the Central Limit Theorem, e.g. for a sequence of i.i.d. random variables, it is of course not only important to know that n i=1(Xi EX1) X˜n := − √nVX P 1 converges to a random variable Z (0, 1). One also needs to know, how close the ∼ N distributions of X˜n and Z are. This is stated in the following theorem due to Berry and Esseen:

Theorem 10.6 (Berry-Esseen) Let (Xi)i N be a sequence of i.i.d. random variables with E( X 3) < .Then for a (0, 1)-distribute∈d random variable Z it holds: | 1| ∞ N n X EX C E( X EX 3) sup P i=1 i − 1 a P (Z a) | 1 − 1| . V 3/2 a R √nVX1 ≤ − ≤ ≤ √n ( X1) ∈ P 

The numerical v alue of C is below 6 and larger than 0 .4. This is rather easy to prove.

The second extension of the Central Limit Theorem starts with the following obser- vation: Let X1, X2, . . . be a sequence of i.i.d. random variables with finite variance and 1 n E expectation zero. Then the law of large numbers says that n i=1 Xi converges to X1 = 0 in probability and almost surely. But it tells nothing about the size of the fluctuations. This is considered in greater detail by the Central Limit Theorem.P The latter describes the asymptotic probabilities that

n (X EX ) P i=1 i − 1 a . V ≥ P n (X1) ! Since these probabilities are positive forpall a R according to the Central Limit Theorem, n∈ it can be shown that the fluctuations of i=1 Xi are larger than √n, more precisely, for n X each positive a R it holds with probability one that lim sup Pi=1 i a. ∈ P √n ≥ The question for the precise size of the fluctuations, i.e. for the right scaling (an) such that n X lim sup i=1 i √a n P n is almost surely finite, is answered by the law of the iterated logarithm:

42 Theorem 10.7 (Law of the Iterated Logarithm by Hartmann and Winter) Let 2 V (Xi)i N be a sequence of i.i.d. random variables with σ := X1 < (σ > 0). Then for ∈ n ∞ Sn := i=1 Xi it holds P S lim sup n = +σ P-a.s. √2n log log n and S lim inf n = σ P-a.s. √2n log log n −

Due to the restricted time we will not be able to prove the Law of the Iterated Logarithm in the context of this course. Despite its theoretical interest its practical relevance is rather limited. To understand why, notice that the correction to the √nVX1 from the Central Limit Theorem to the Law of the Iterated Logarithm are of order √log log n. Even for a fantastically large number of observation, 10100 (which is more than one observation per atom in the universe) √log log n is really small,e.g.

log log 10100 = log(100 log 10) = log 100 + log log 10 √6.13 2.47. ' ' p p p 11 Conditional Expectation

To understand the concept of conditional expectation, we will start with a little example.

Example 11.1 Let Ω be a finite population and let the random variable X (ω) denote the income of person ω. So, if we are only interested in income, X contains the full information of our experiment. Now assume we are a sociologist and want to measure the influence of a person’s religion on his income. So we are not interested in the full information given by X, but only in how X behaves on each of the sets,

catholic , protestant , islamic , jewish , atheist , { } { } { } { } { } etc. This leads to the concept of conditional expectation.

The basic idea of conditional expectation will be that given a random variable

X : (Ω, ) R F → and a sub-σ-algebra of to introduce a new random variable called E [X ] =: X A F | A 0 such that X is -measurable and 0 A

X0dP = XdP ZC ZC for all C . So X contains all information necessary when we only consider events in ∈ A 0 . First we need to see that such a X can be found in a unique way. A 0

43 Theorem 11.2 Let (Ω, , P) be a probability space and X an integrable random variable. Let be a sub-σ-algebrF a. Then (up to P-a.s. equality) there is a unique random C ⊆ F variable X , which is -measurable and satisfies 0 C X dP = XdP for all C . (11.1) 0 ∈ C ZC ZC If X 0, then X 0 P-a.s. ≥ 0 ≥

Proof. First we treat the case X 0. Denote P0 := P and Q = XP . Both, P0 and Q are measures on , P even is a probabilit≥ y measure. By|C definition |C C 0 Q (C) = XdP. ZC Hence Q (C) = 0 for all C with P (C) = 0 = P (C). Hence Q P . By the theorem of 0  0 Radon-Nikodym there is a -measurable function X 0 on Ω such that Q = X P . Thus C 0 ≥ 0 0 X dP = XdP for all C . 0 0 ∈ C ZC ZC Hence X dP = XdP for all C . 0 ∈ C ZC ZC Hence X0 satisfies (11.1). For X¯0 that is -measurable and satisfies (11.1) the set C = ¯ ¯ P P C P P ¯ X0 < X0 is in and C X0d = C X0d , whence (C) = 0. In the same way ( X0 > X0 = 0. Therefore X¯ Cis P-a.s. equal to X .  0 R R 0  The proof for arbitrary, integrable X is left to the reader.

Exercise 11.3 Prove Theorem 11.2 for arbitrary, integrable X : Ω R. →

Definition 11.4 Under the conditions of Theorem 11.2 the random variable X0 (which is P-a.s. unique) is called the conditional expectation of X given . It is denoted by C

X =: E [X ] =: EC [X] . 0 | C

If is generated by a sequence of random variable (Yi)i I such that = σ (Yi, i I) we C ∈ C ∈ write E E X (Yi)i I = [X ] . | ∈ | C E If I = 1, . . . , n we also write [X Y1, . . . ,Yn]. { } | Note that, in order to check whether Y (Y -measurable) is a conditional expectation C of X given the sub-σ-algebra we need to check C Y dP = XdP ZC ZC for all C . This determines E [X ] only P-a.s. on sets C . We therefore also speak about differen∈ C t versions of conditional| C expectation. ∈ C

44 Example 11.5 1. If = , Ω , then the constant random variable EX is a version of E [X ]. Indeed ifCC ={∅, then} any variable does the job. If C = Ω | C ∅ XdP = EX = EXdP. ZC Z

2. If is generated by the family (Bi)i I of mutually disjoint sets (i.e. Bi Bj = if iC= j), where I is countable and B∈ (the original space being (Ω, ∩, P)) and∅ 6 i ∈ A A P(Bi) > 0 then 1 E [X ] = 1 XdP P-a.s. | C P(B ) Bi i I i Bi X∈ Z be checked in the following exercise.

Exercise 11.6 Show that the assertion of Example 11.5.2. is true.

Exercise 11.7 Show that the following assertions for the conditional expectation E [X ] | C of random variables X, Y : (Ω, ) R, 1 A → B ( ) are true:  C ⊂ A 1. E [E [X ]] = EX | C 2. If X is -measurable then E [X ] = X P-a.s. C | C 3. If X = Y P-a.s., then E [X ] = E [Y ] P-a.s. | C | C 4. If X α, then E [X ] = α P-a.s. ≡ | C 5. E [αX + βY ] = αE [X ] + βE [Y ] P-a.s. Here α, β R. | C | C | C ∈ 6. X Y P-a.s. implies E [X ] E [Y ] P-a.s. ≤ | C ≤ | C The following theorems have proofs that are almost identical with the proofs of the theorems for expectations:

Theorem 11.8 (monotone convergence) Let (Xn) be an increasing sequence of positive random variables with X = sup Xn, X integrable , then

sup E [Xn ] = lim E [Xn ] = E lim Xn = E [X ] . n | C n | C n | C | C →∞ h →∞ i Theorem 11.9 (dominated convergence) Let (Xn) be a sequence of random variables converging pointwise to an (integrable) random variable X, such that there is an integrable random variable Y with Y X , then ≥ | n|

lim E [Xn ] = E [X ] . n →∞ | C | C Also Jensen’s inequality has a generalization to conditional expectations:

45 Theorem 11.10 (Jensen’s inequality) Let X be an integrable random variable taking values in an open interval I R and let ⊂ q : I R → be a convex function. Then for each it holds C ⊂ A E [X ] : Ω I | C → and q (E [X ]) E [q X ] . | C ≤ ◦ | C An immediate consequence of Theorem 11.10 is the following (for p 1): ≥ E [X ] p E [ X p ] | | C | ≤ | | | C which implies E ( E [X ] p) E ( X p) . | | C | ≤ | | Denoting by 1/p N (f) = f p dP p | | Z  this means N (E [X ]) N (X) , X p (P) . p | C ≤ p ∈ L This holds for 1 p < . N (f) is called the p-norm of f. The case p = , which means ≤ ∞ p L ∞ that if X is bounded P-a.s. by some M 0, then so is E [X ], follows from Exercise 11.7. ≥ | C We slightly reformulate the definition of conditional expectation to discuss its further properties.

1 Lemma 11.11 Let X be a positive integrable function. Let X0 : (Ω, ) (R, ) a posi- tive -measurable integrable random variable that is a version of E [XA →] (X inteB grable), C | C then

ZX0dP = ZXdP (11.2) Z Z for all -measurable, positive random variables Z. C Proof. From (11.1) we obtain (11.2) for step functions. The general result follows from monotone convergence. We are now prepared to show a number of properties of conditional expectations which we will call smoothing properties

Theorem 11.12 (Smoothing properties of conditional expectations) Let (Ω, , P) F be probability space and X p (P) and Y q (P), 1 p , 1 + 1 = 1. ∈ L ∈ L ≤ ≤ ∞ p q 1. If and X is -measurable then C ⊆ F C E [XY ] = XE [Y ] (11.3) | C | C

46 2. If , with then C1 C2 ⊆ F C1 ⊆ C2 E [E [X ] ] = E [E [X ] ] = E [X ] . | C2 | C1 | C1 | C2 | C1 Proof. 1. First assume that X, Y 0. Let X be -measurable and C . Then ≥ C ∈ C XY dP = 1 XY dP = 1 XE [Y ] dP = XE [Y ] dP. C C | C | C ZC Z Z ZC Indeed, this follows immediately from lemma 11.11 since 1C X is -measurable. On the other hand, we also have XY 1(P) and C ∈ L XY dP = E [XY ] dP. | C ZC ZC Since XE [Y ] is -measurable we obtain | C C E [XY ] = XE [Y ] P-a.s. | C | C In the case X p (P) , Y q (P) we observe that then XY 1 (P) and conclude as above. ∈ L ∈ L ∈ L

2. Observe that, of course, E [X 1] is 1-measurable and, since 1 2, also 2- measurable. Property 2 in Exercise| C 11.7C than implies C ⊆ C C

E [E [X ] ] = E [X ] , P-a.s. | C1 | C2 | C1 Moreover for all C ∈ C1 E [X ] dP = XdP. | C1 ZC ZC Hence for all C ∈ C1 E [X ] dP = E [X ] dP. | C1 | C2 ZC ZC But this means E [E [X ] ] = E [X ] P-a.s. | C2 | C1 | C1

The previous theorem leads to yet another characterization of the conditional expecta- tion. To this end take X 2 (P) and denote X := E [X ] for a . Let Z 2(P) ∈ L 0 | C C ⊆ F ∈ L be -measurable. Then X 2 (P) and by (11.3) C 0 ∈ L E [Z (X X ) ] = ZE [X X ] = Z (E [X ] X ) = Z (X X ) = 0. · − 0 | C − 0 | C · | C − 0 · 0 − 0 Theorem 11.13 For all X 2 (P) and each the conditional expectation E [X ] is (up to a.s. equality) the unique∈ L -measurableCrandom⊆ F variable X 2 (P) with | C C 0 ∈ L E (X X )2 = min E (X Y )2 ; Y 2 (P) , Y -measurable − 0 { − ∈ L C }     47 Proof. Let Y 2 (P) be -measurable. Put X := E [X ]. Then ∈ L C 0 | C E((X Y )2) = E((X X +X Y )2) = E((X X )2)+E((X Y )2)+2E((X X )(X Y )) − − 0 0− − 0 0− − 0 0−

But E((X X0)(X0 Y )) = 0, since X0 Y is -measurable. This gives − − − C E (X Y )2 E (X X )2 = E (X Y )2 . (11.4) − − − 0 0 − Due to positivity of squares we hence obtain   

E (X X )2 E (X Y )2 . − 0 ≤ − If, on the other hand     E (X X )2 = E (X Y )2 − 0 − then     E (X Y )2 = 0 0 − E P which implies Y = X0 = [X ] -a.s.  | C The last theorem states that E [X ] for X 2 (P) is the ”best approximation” of X in the -measurable function space ”in| Cthe sense∈ ofL a least squares approximation”. It is the proCjection of X onto the space of square integrable, -measurable functions. C Exercise 11.14 Prove that for X 2(P), µ = E(X) is the number that minimizes E((X µ)2). ∈ L −

With the help of conditional expectation we can also give a new definition of conditional probability

Definition 11.15 Let (Ω, , P) be a probability space and be a sub-σ-algebra. For A F C ⊂ F ∈ F P [A ] := E [1 ] | C A | C is called the conditional probability of A given . C Example 11.16 In the situation of Example 11.5.2 the conditional expectation of A ∈ F is given by P (A B ) 1 P (A ) = P (A B ) 1 := ∩ i Bi . | C | i Bi P (B ) i I i I i X∈ X∈

In a last step we will only introduce (but not prove) conditional expectations on events with zero probability. Of course, in general this will just give nonsense but in the case of a conditional expectation E [X Y = y] where X, Y are random variables such that (X, Y ) has a Lebesgue density we can |give this expression a meaning.

48 Theorem 11.17 Let X, Y be real valued random variables such that (X, Y ) has a density R2 R+ 2 f : 0 with respect to two dimensional Lebesgue measure λ . Assume that X is → { } integrable and that f (y) := f (x, y) dx > 0 for all y R. 0 ∈ Z Then the function E(X Y ) will be denoted by | y E (X Y = y) 7→ | and one has 1 E (X Y = y) = xf (x, y) dx for P -a.e. y R. | f (y) Y ∈ 0 Z In particular 1 E (X Y ) = xf (x, Y ) dx P-a.s. | f (Y ) 0 Z We will also need the following relationship between conditional expectation and indepen- dence, which is a generalization of Example 11.5, case 1.

Lemma 11.18 Let X be an integrable real valued random variable and a sub-σ- C ⊂ F algebra such that X is independent of , that is σ(X) and are independent, then C C E(X ) = E(X), P-a.s. | C

Proof. Suppose X 0. Then an increasing sequence of step functions Xn can be con- n≥ n structed by Xn = [2 X]/2 . Then Xn converges monotonically to X. Notice that Xn is a linear combination of indicator functions 1 with A σ(X). And 1 dP = P(C A) = A ∈ C A ∩ P(C)P(A) = E(1 )dP. Thus E(1 ) = E(1 ), and by linearity E(X ) = E(X ) C A A A R n n and by the monotone convergence theorem| C E(X ) = E(X). The general case| C follows by R + | C linearity, X = X X−. − Exercise 11.19 Let X and Y be as in Theorem 11.17, such that X and Y are independent. Then X is independent of σ(Y ), and by Lemma 11.18 we have E(X Y ) = E(X σ(Y )) = | | E(X). Apply Theorem 11.17 to give an alternative derivation of this fact.

49 12 Martingales

In this section we are going to define a notion, that will turn out to be of central interest in all of so called stochastic analysis and mathematical finance. A key role in its definition will be taken by conditional expectation. In this section we will just give the definition and a couple of examples. There is a rich theory of martingales. Parts of this theory we will meet in a on Stochastic Calculus.

Definition 12.1 Let (Ω, , P) be a probability space and I be an ordered set (linearly or- dered), i.e. for s, t I eitherF s t or t s, with s t and t s implies s = t and s t, ∈ ≤ ≤ ≤ ≤ ≤ t u implies s u. For t I let t be a σ-algebra. ( t)t I is called a filtration, if ≤ ≤ ∈ F ⊂ F F ∈ s t implies s t. A sequence of random variables (Xt) is called ( t)t I - adapted t I ∈ if ≤X is -meFasur⊂ableF for all t I. ∈ F t Ft ∈ Exercise 12.2 Construct a filtration on a probability space with I 3. | | ≥

Example 12.3 Let (Xt)t I be a family of random variables, and I a linearly ordered set, then ∈ = σ X , s t Ft { s ≤ } is a filtration and (Xt) is adapted with respect to ( t). (Ft)t I is called the canonical F ∈ filtration with respect to (Xt)t I . ∈ P Definition 12.4 Let (Ω, , ) be a probability space and I a linearly ordered set. let ( t)t I F F ∈ be a filtration and (Xt)t I be an ( t)-adapted sequence of random variables. (Xt) is called ∈ F an (Ft)-supermartingale, if

E [X ] X P-a.s. (12.1) t | Fs ≤ s for all s t. (12.1) is equivalent with ≤ X dP X dP, for all C . (12.2) t ≤ s ∈ Fs ZC ZC (X ) is called a ( )-submartingale, if ( X ) is a (F )-supermartingale. Eventually (X ) t Ft − t t t is called a martingale, if it is both a submartingale and a supermartingale. This means that

E [X ] = X P-a.s. t | Fs s for s t or, equivalently, ≤ X dP = X dP, C . t s ∈ Fs ZC ZC Exercise 12.5 Show that the conditions (12.1) and (12.2) are equivalent.

Remark 12.6 1. If ( t) is the canonical filtration with respect to (Xt)t I , then often F ∈ (Xt) simply called a supermartingale, submartingale, or a martingale.

50 2. (12.1) and (12.2) are evidently correct for s = t (with equality). Hence these proper- ties only need to be checked for s < t.

3. Putting C = Ω in (12.2) we obtain for a supermartingale (Xt)t s t E (X ) E (X ) . ≤ ⇒ s ≥ t E Hence for supermartingales ( (Xs))s is a decreasing sequence, while for a submartin- gale (E (Xs)) is an increasing sequence.

4. In particular, if each of the random variables Xs is almost surely constant, e.g. if Ω is a singleton (a set with just one element) then (Xs) is a decreasing sequence, if (Xs) is a supermartingale. And it is an increasing sequence, if (Xs) is a submartingale. Hence martingales are (in a certain sense) the stochastic generalization of constant sequences. Exercise 12.7 Let (X ), (Y ) be adapted to the same filtration and α, β R. Show the t t ∈ following

1. If (Xt) and (Yt) are martingales, then (αXt + βYt) is a martingale. 2. If (X ) and (Y ) are supermartingales, then so is (X Y ) = (min(X , Y )) t t t ∧ t t t 3. If (X )) is a submartingale, so is X+, . t t Ft 4. If (X ) is a martingale taking values in an open set J R and t ≤ q : J R → is convex then (q X , ) is a submartingale, if q (X ) is integrable for all t. ◦ t Ft t Of course, at first glance the definition of a martingale may look a bit weird. We will therefore give a couple of examples to show that it is not as strange as expected.

Example 12.8 Let (Xn) be an i.i.d. sequence of R-valued random variables. Put Sn = X1 + . . . + Xn and consider the canonical filtration n = σ (Sm, m n). By Lemma 11.18 we have F ≤ E [X S , . . . , S ] = E [X ] P-a.s. n+1 | 1 n n+1 and by part 2. of Exercise 11.7 E [X S , . . . , S ] = X P-a.s. i | 1 n i for all i = 1, . . . , n. Adding these n + 1 equations gives E [S ] = S + E [X ] P-a.s. n+1 | Fn n n+1 If EXi = 0 for all i, then E [S ] = S n+1 | Fn n i.e. (S ) is a martingale. If E [X ] 0 then n i ≤ E [S ] S , n+1 | Fn ≤ n i.e. (S ) is a supermartingale. In the same way (S ) is a submartingale if EX 0. n n i ≥ 51 Example 12.9 Consider the following game. For each n N a coin with probability p ∈ for heads is tossed. If it shows heads (Xn = +1) our player receives money otherwise he (X = 1) looses money. The way he wins or looses is determined in the following way. n − Before the game starts he determines a sequence (%n)n of functions % : H, T n R+. n { } →

In round number n + 1 he plays for %n (X1, . . . , Xn) Euros depending on how the first n games ended. If we denote by Sn his capital at time n, then

S1 = X1 and Sn+1 = Sn + %n (X1, . . . , Xn) Xn+1.

Hence

E [S X , . . . , X ] = S + % (X , . . . , X ) E [X X , . . . , X ] n+1 | 1 n n n 1 n · n+1 | 1 n = Sn + %n (X1, . . . , Xn) E (Xn+1) = S + (2p 1) % (X , . . . , X ) , n − n 1 n since X is independent of X , . . . , X and E (X ) = 2p 1. Hence for p = 1 n+1 1 n n+1 − 2 E [S X , . . . , X ] = S n+1 | 1 n n 1 so (Sn) is a martingale while for p > 2 E [S X , . . . , X ] S , n+1 | 1 n ≥ n 1 hence (Sn) is a submartingale and for p < 2 E [S X , . . . X ] S , n+1 | 1 n ≤ n so (Sn) is a supermartingale. This explains the idea that ”martingales are generalizations of fair games”.

Exercise 12.10 Let X1, X2, . . . be a sequence of independent random variables with finite variance V(X ) = σ2. Then n (X E(X )) 2 n σ2 is a martingale with respect i i { i=1 i − i } − i=1 i to the filtration n = σ(X1, . . . , Xn). F P P

Exercise 12.11 Consider the gambler’s martingale. Consider an i.i.d. sequence (Xn)n∞=1 of Bernoulli variables with values –1 and 1, each with probability 1/2. Consider the sequence n 1 (Yn) such that Yn = 2 − if X1 = = Xn 1 = 1, and Yn = 0 if Xi = 1 for some i n 1. n · · · − − ≤ − Show that Sn = i=1 XiYi is a martingale. Show that Sn almost surely converges and determine its limit S . Observe that Sn = E(S n). P ∞ 6 ∞ | F Example 12.12 In a sense Example 12.8 is both, a special case and a generalization of the d following example. To this end let X1, . . . , Xn, . . . denote an i.i.d. sequence of R -valued random variables. Assume 1 P (X = +% ) = P (X = % ) = i k i − k 2d 52 for all i = 1, 2, . . . and all k = 1, . . . , d. Here %k denotes the k-th unit vector. Define the Sn by S0 = 0, and n

Sn = Xi. i=1 X This process is called a random walk in d directions. Some of its properties will be discussed below. First we will see that indeed (Sn) is a martingale. Indeed,

E [S X , . . . , X ] = E [X ] + S = S . n+1 | 1 n n+1 n n

As a matter of fact, not only is (Sn) a martingale, but, in a certain sense it is the discrete time martingale.

Since the random walk in d dimensions is the model for a discrete time martingale (the standard model of a continuous time martingale will be introduced in the following section) it is worth while studying some of its properties. This has been done in thousands of research papers in the past 50 years. We will just mention one interesting property here, that reveals a dichotomy in the random walk’s behavior for dimensions d = 1, 2 or d 3. ≥ d Definition 12.13 Let (Sn) be a stochastic process in Z , i.e. for each n N, Sn is a random variable with values in Zd. (S ) is called recurrent in a state x Zd,∈if n ∈

P (Sn = x infinitely often in n) = 1.

It is called transient in x, if

P (Sn = x infinitely often in n) < 1.

(S ) is called recurrent (transient), if each x Zd is recurrent (transient). n ∈ Proposition 12.14 In the situation of Example 12.12, if x Zd is recurrent, then all y Zd are recurrent. ∈ ∈ Exercise 12.15 Show proposition 12.14.

We will show a variant of the following

Theorem 12.16 The random walk (Sn) introduced in Example 12.12 is recurrent in di- mensions d = 1, 2 and transient for d 3. ≥ To prove a version of Theorem 12.16 we will first discuss the property of recurrence:

Lemma 12.17 Let fk denote probability that the random walk return to the origin after k steps for the first time, and let pk denote probability that the random walk return to the origin after k steps. Then a random walk (Sn) is recurrent if and only if k fk = 1 and this is the case if and only if pk = . k ∞ P P 53 Proof. The first equivalence is easy. Denote by Ωk the set of all realizations of the random walk returning to the origin for the first time after k steps. Then if the random walk (Sn) is recurrent, with probability one there exists a k > 0 such that Sk = 0 and Sl = 0 for all P 6 0 < l < k. Hence k fk = 1. On the other hand, if k fk = 1, then ( k Ωk) = 1. Hence with probability one there exists a k > 0 such that S = 0 and S = 0 for all 0 < l < k. k l 6 But then the situationP at times 0 and k is completelyP the same and Shence there exists k0 > k such that S = 0 and S = 0 for all k < l < k0. Iterating this gives that S = 0 for k0 l 6 k infinitely many k’s with probability one.

In order to relate fk and pk we derive the following recursion

pk = fk + fk 1p1 + + f0pk (12.3) − · · ·

(the last summand ist just added for completeness, we have f0 = 0). Indeed this is again easy to see. The left hand side is just the probability to be at the origin at time k. This event is the disjoint uinion of the events to be at 0 for the first time after 1 l k steps ≤ ≤ and to walk from zero to zero in the remaining steps. Hence we obtain.

k

pk = fipk i and p0 = 1. (12.4) − i=1 X Define the generating functions

k k F (z) = fkz and P (z) = pkz . k 0 k 0 X≥ X≥ Multiplying the left and right sides in (12.4) with zk and summing them from k = 0 to infinity gives P (z) = 1 + P (z)F (z) i.e. F (z) = 1 1/P (z). − By Abel’s theorem ∞ 1 fk = F (1) = lim F (z) = 1 lim . z 1 z 1 P (z) ↑ − ↑ Xk=1 First assume that p < . Then k k ∞ P lim P (z) = P (1) = pk < z 1 ↑ ∞ Xk and thus 1 lim = 1/ pk > 0. z 1 ↑ P (z) Xk Hence k∞=1 fk < 1 and the random walk (Sn) is transient. Next assume that pk = and fix ε > 0. Then we find N such that P k ∞ P N 2 p . k ≥ ε Xk=0 54 Then for z sufficiently close to one we have N p zk 1 and consequently for such z k=0 k ≥ ε 1 P1 ε. P (z) N k ≤ k=0 pkz ≤ But this implies that P 1 lim = 1/ pk = 0 z 1 P (z) ↑ Xk and therefore k∞=1 fk = 1 and the random walk (Sn) is transient.

Exercise 12.18P What has the Borel-Cantelli Lemma 8.1 to say in the above situation? Don’t overlook that the events S = x (n N) may be dependent. { n } ∈ We will now apply this criterion to analyze recurrence and transience for a random walk similar to the one defined in Example 12.12. To this end define the following random walk (Rn) in d dimensions. For k N let k k P k ∈P k Y1 , . . . , Yd be i.i.d. random variables, taking values in 1, +1 with (Y1 = 1) = (Y1 = 1) = 1/2. Let X be the random vector X = (Y k, . .{−. , Y k). Define} R 0 and for k 1 − k k 1 d 0 ≡ ≥ n

Rn = Xk. Xk=1

Theorem 12.19 The random walk (Rn) defined above is recurrent in dimensions d = 1, 2 and transient for d 3. ≥

Proof. Consider a sequence of i.i.d. random variables (Zk) taking values in 1, +1 with P P P 2k {− } (Zk = 1) = (Zk = 1) = 1/2. Write qk = ( i=1 Zi = 0). Then we apply Stirling’s formula − n+1/2P n lim n!/(√2πn e− ) = 1. n →∞ to obtain

2k 2k 2k! 2k q = 2− = 2− k k k!k!   √ 2k 2k 4πk e 2k 2− ∼ k 2k 2πk e  1 = .  rπk

Hence the probability of a single coordinate of R2n to be zero (Rn cannot be zero if n is 1 odd) asymptotically behaves like πn . Hence q d 1 2 P(R = 0) . n ∼ πn  

55 But d 1 2 = πn ∞ n X   for d = 1 and d = 2, while d 1 2 < πn ∞ n X   for d 3. This proves the theorem. ≥

We will give two results about martingales. The first one is inspired by the fact that given a random variable M and a filtration of , the family of conditional expectations, Ft F E(M t), yields a martingale. Can a martingale always be described in this way? That is the|conF tent of the Martingale Limit Theorem. Theorem 12.20 Suppose M is a martingale with respect to the filtration and that t Ft the martingale is (uniformly) square integrable, that is lim sup E(M 2) < . Then there t t ∞ is a square integrable random variable M such that Mt = E(M t) a.s.. Moreover 2 ∞ ∞ | F lim Mt = M in L sense. ∞ Proof. The basic property that we will use, is that the space of square integrable random variables L2(Ω, , P) with the L2 inner product is a complete vectorspace. In other words, a Cauchy sequenceF converges. Recall that for any t < s it holds that M = E(M ), and t s | Ft we have seen in Theorem 11.13 that then Ms Mt is perpendicular to Mt. In particular we have the Pythagoras formula − E(M 2) = E(M 2) + E((M M )2). s t s − t E 2 This implies that (Ms ) is increasing in s, and therefore its limit exists and equals E 2 lim sup (Mt ) which is finite. Therefore given ε > 0 there is a u such that for t > s > u we 2 have E((Ms Mt) ) < ε. That means that Ms s is a Cauchy sequence. Let M be its limit. − { } ∞ In particular M is a random variable. Since orthogonal projection onto the suspace of t ∞ F measurable functions is a continuous map, it holds that E(M t) = lim E(Ms t) = Mt. ∞ | F | F

With some extra effort one may show that M is also the limit in the sense of almost ∞ sure convergence. The Martingale Limit Theorem is valid under more general circum- E stances, for example it is sufficient that only lim supt ( Mt ) < , in which case M is | | ∞ ∞ the limit in L1 sense (as well as almost surely). An important concept for random processes is the concept of a stopping time. Definition 12.21 A stopping time is a random variable τ : Ω I , such that for all t I, ω ; τ(ω) t is measurable. Here I is order→ed such∪ {∞}that t < for all ∈ { ≤ } Ft ∪ {∞} ∞ t I. Given a process Mt, the stopped process Mτ t is given by Mτ t(ω) = Ms(ω) where ∧ ∧ s ∈= τ t = min(τ, t). ∧

Example 12.22 Given A u a stopping time is constructed by τ = 1Ac + u 1A, that is, τ (ω) = if ω / A, and∈ Fτ(ω) = u if ω A. ∞ · · A ∞ ∈ ∈ 56 Exercise 12.23 If T : Ω R is constant, then T is a stopping time. If S and T are stopping times then max(S,→T ) and min(S, T ) are stopping times.

A very important property of martingales is the following Martingale Stopping Theorem. Theorem 12.24 Let M be a martingale, and τ a stopping time. Then the stopped { t}t process Mτ t t is a martingale. { ∧ } Proof. It is easy to see that an adapted process stopped at a stopping time is again an adapted process. We will give a proof of the martingale property for the simple stopping time τ given above. If s > t u, let B , then A ≥ ∈ Ft

E(Mτ s t)dP = Mτ sdP = Mτ s + Mτ s B ∧ | F B ∧ B A ∧ B Ac ∧ Z Z Z ∩ Z ∩ = Mu + Ms = Mu + E(Ms t) B A B Ac B A B Ac | F Z ∩ Z ∩ Z ∩ Z ∩ = Mu + Mt = Mτ t. B A B Ac B ∧ Z ∩ Z ∩ Z If u t and s > t, then one can apply Theorem 11.12 ≥

E(Mτ s t) = E(E(Mτ s u) t) = E(Mu s t) = Mt = Mτ t. ∧ | F ∧ | F | F ∧ | F ∧

Exercise 12.25 Modify the proof of Theorem 12.24 to show that a stopped supermartingale is a supermartingale.

Exercise 12.26 Consider the roulette game. There are several possibilities for a bet, given 1 by a number p (0, 1) such that with probability 36/37 p the return is p− times the stake ∈ · 1 and the return is zero with probability 1 36/37 p. The probabilities p are such that p− N. − · ∈ Suppose you start with an initial fortune X0 N, and perform a sequence of bets until this fortune is reduced to zero. We are intereste∈d in the expected value of the total sum of stakes. To determine this consider the sequence of subsequent fortunes Xi, and consider the sequence of stakes Yi, meaning that the stake in bet i is Yi = Yi(X1, . . . , Xi 1) (Yi Xi 1). − ≤ 1− In particular, if for this stake the probability p is chosen, either Xi = Xi 1 Yi + p− Yi − − · (with probability 36/37 p) or Xi = Xi 1 Yi (with probability 1 36/37 p). Show that − (X + 1/37 i Y ) is·a martingale with−respect to the filtration − = σ(X· , . . . , X ). The i · j=1 j i Fi 1 i stopping time N is the first time i such that X = 0. Show that E( N Y ) = 37 X . P i j=1 j · 0 P 13 Brownian motion

In this section we will construct the continuous time martingale, Brownian motion. Besides this, Brownian motion is also a building block of stochastic calculus and stochastic analysis. In stochastic analysis one studies random functions of one variable and various kinds of integrals and derivatives thereof. The argument of these functions is usually interpreted as ‘time’, so the functions themselves can be thought of as the path of a random process.

57 Here, like in other areas of mathematics, going from the discrete to the continuous yields a pay-off in simplicity and smoothness, at the price of a formally more complicated n 3 n 3 analysis. Compare, to make an analogy, the integral 0 x dx with the sum k=1 k . The integral requires a more refined analysis for its definition and its properties, but once this R has been done the integral is easier to calculate. Similarly, in stochastic analysisP you will become acquainted with a convenient differential calculus as a reward for some hard work in analysis. Stochastic analysis can be applied in a wide variety of situations. We sketch a few examples below.

1. Some differential equations become more realistic when we allow some randomness in their coefficients. Consider for example the following growth equation, used among other places in population biology: d S = (r + “N ”)S . (13.1) dt t t t

Here, St is the size of the population at time t, r is the average growth rate of the population, and the “noise” Nt models random fluctuations in the growth rate. 2. At time t = 0 an investor buys stocks and bonds on the financial market, i.e., he divides his initial capital C0 into A0 shares of stock and B0 shares of bonds. The bonds will yield a guaranteed interest rate r0. If we assume that the stock price St satisfies the growth equation (13.1), then his capital Ct at time t is

r0t Ct = AtSt + Bte , (13.2)

where At and Bt are the amounts of stocks and bonds held at time t. With a keen eye on the market the investor sells stocks to buy bonds and vice versa. If his tradings r0t are ‘self-financing’, then dCt = AtdSt + Btd(e ). An interesting question is: - What would he be prepared to pay for a so-called European call option, i.e., the right (bought at time 0) to purchase at time T > 0 a share of stock at a predetermined price K?

The rational answer, q say, was found by Black and Scholes (1973) through an analysis of the possible strategies leading from an initial investment q to a payoff CT . Their formula is being used on the stock markets all over the world.

3. The Langevin equation describes the behaviour of a dust particle suspended in a fluid: d m V = ηV + “N ”. (13.3) dt t − t t

Here, Vt is the velocity at time t of the dust particle, the friction exerted on the particle due to the viscosity η of the fluid is ηV , and the “noise” N stands for the − t t disturbance due to the thermal motion of the surrounding fluid molecules colliding with the particle.

58 4. The path of the dust particle in example 3 is observed with some inaccuracy. One measures the perturbed signal Z(t) given by

Zt = Vt + “N˜t”. (13.4)

Here N˜t is again a “noise”. One is interested in the best guess for the actual value of V , given the observation Z for 0 s t. This is called a filtering problem: how to t s ≤ ≤ filter away the noise N˜t. Kalman and Bucy (1961) found a linear algorithm, which was almost immediately applied in aerospace engineering. Filtering theory is now a flourishing and extremely useful discipline.

5. Stochastic analysis can help solve boundary value problems such as the Dirichlet problem. If the value of a harmonic function f on the boundary of some bounded regular region D Rn is known, then one can express the value of f in the interior of D as follows: ⊂ E x (f (Bτ )) = f(x), (13.5) x t where Bt := x + 0 Ntdt is an “integrated noise” or Brownian motion, starting at x, and τ denotes the time when this Brownian motion first reaches the boundary. (A R harmonic function f is a function satisfying ∆f = 0 with ∆ the Laplacian.)

The goal of the course Stochastic Analysis is to make sense of the above equations, and to work with them.

In all the above examples the unexplained symbol Nt occurs, which is to be thought of as a “completely random” function of t, in other words, the continuous time analogue of a sequence of independent identically distributed random variables. In a first attempt to catch this concept, let us formulate the following requirements:

1. N is independent of N for t = s; t s 6 2. The random variables N (t 0) all have the same µ; t ≥

3. E (Nt) = 0.

However, when taken literally these requirements do not produce what we want. This is seen by the following argument. By requirement 1 we have for every point in time an independent value of Nt. We shall show that such a “continuous i.i.d. sequence” Nt is not measurable in t, unless it is identically 0. Let µ denote the probability distribution of Nt, which by requirement 2 does not depend on t, i.e., µ([a, b]) := P[a N b]. Divide R into two half lines, one extending from a to ≤ t ≤ and the other extending from a to . If Nt is not a constant function of t, then there m−∞ust be a value of a such that each of the∞ half lines has positive measure. So

p := P(N a) = µ (( , a]) (0, 1). (13.6) t ≤ −∞ ∈

Now consider the set of time points where the noise Nt is low: E := t 0: Nt a . It can be shown that with probability 1 the set E is not Lebesgue measurable.{ ≥ Without≤ }

59 giving a full proof we can understand this as follows. Let λ denote the Lebesgue measure on R. If E were measurable, then by requirement 1 and Eq. (13.6) it would be reasonable to expect its relative share in any interval (c, d) to be p, i.e.,

λ (E (c, d)) = p (d c) . (13.7) ∩ − On the other hand, it is known from measure theory that every measurable set E is ar- bitrarily thick somewhere with respect to the Lebesgue measure λ, i.e., for all α < 1 an interval (c, d) can be found such that

λ (E (c, d)) > α(d c) ∩ − (cf. Halmos (1974) Th. III.16.A). This clearly contradicts Eq. (13.7), so E is not measurable. This is a bad property of Nt: for, in view of (13.1), (13.3), (13.4) and (13.5), we would like to integrate Nt. For this reason, let us approach the problem from another angle. Instead of Nt, let us consider the integral of Nt, and give it a name:

t Bt := Nsds. Z0 The three requirements on the evasive object Nt then translate into three quite sensible requirements for Bt.

BM1. For 0 = t0 t1 tn the random variables Btj+1 Btj (j = 0, . . . , n 1) are independent;≤ ≤ · · · ≤ − −

BM2. Bt has stationary increments, i.e., the joint probability distribution of

(B B , B B , . . . , B B ) t1+s − u1+s t2+s − u2+s tn+s − un+s does not depend on s 0, where t > u for i = 1, 2, , n are arbitrary. ≥ i i · · · BM3. E (B B ) = 0 for all t. t − 0 We add a normalisation:

E 2 BM4. B0 = 0 and (B1 ) = 1.

Still, these four requirements do not determine Bt. For example, the compensated Poisson jump process also satisfies them. Our fifth requirement fixes the process Bt uniquely:

BM5. t B continuous a.s. 7→ t

The object Bt so defined is called the Wiener process, or (by a slight abuse of physical terminology) Brownian motion. In the next section we shall give a rigorous and explicit construction of this process. Before we go into details we remark the following

60 Exercise 13.1 Show that BM5, together with BM1 and BM2, implies the following: For any ε > 0

nP ( Bt+ 1 Bt > ε) 0 (13.8) | n − | → as n . Hint: compare with inequality (8.6). → ∞ Exercise 13.1 helps us to specify the increments of Brownian motion in the following way2.

Exercise 13.2 Suppose BM1, BM2, BM4 and (13.8) hold. Apply the Central Limit Theorem (Lindeberg’s condition, page 36) to

Xn,k := B kt B (k 1)t n − −n and conclude that B B , t > 0 has a normal distribution with variance t, i.e. s+t − s 1 x2 P (Bs+t Bs A) = e− 2t dx. − ∈ √2πt ZA

As a matter of fact, BM1 and BM5 already imply that the increments Bs+t Bs are normally distributed3. −

BM 2’. If s 0 and t > 0, then ≥ 1 x2 P (Bs+t Bs A) = e− 2t dx. − ∈ √2πt ZA

we can now define Brownian motion as follows

Definition 13.3 A one-dimensional Brownian motion is a real-valued process B , t 0 t ≥ with the properties BM1, BM2’, and BM5.

13.1 Construction of Brownian Motion Whenever a stochastic process with certain porperties is defined, the most natural question to ask is, does such a process exist? Of course, the answer is yes, otherwise these lecture notes would not have been written. In this section we shall construct Brownian motion on [0, T ]. For the sake of simplicity we will take T = 1, the construction for general T can be carried out along the same lines, or, by just concatenating independent Brownian motions. The construction we shall use was given by P. L´evy in 1948. Since we saw that the increments of Brownian motion are independent Gaussian random variables, the idea is to construct Brownian motion from these Gaussian increments. 2See R. Durrett (1991), Probability: Theory and Examples, Section 7.1, Exercise 1.1, p. 334. Unfortu- nately, there is something wrong with this exercise. See the 3rd edition (2005) for a correct treatment. 3See e.g. I. Gihman, A. Skorohod, The Theory of Stochastic Processes I, Ch. III, 5, Theorem 5, p. 189. For a high-tech approach, see N. Ikeda, S. Watanabe, Stochastic Differential Equations§ and Diffusion Processes, Ch. II, Theorem 6.1, p. 74.

61 More precisely, we start with the following observation. Suppose we already had con- s+t structed Brownian motion, say (Bt)0 t T . Take two times 0 s < t T , put θ := , ≤ ≤ ≤ ≤ 2 and let 1 (y x)2/2τ p(τ, x, y) := e− − , τ > 0, x, y, R √2πτ ∈ be the Gaussian kernel centered in x with variance τ. Then, conditioned on Bs = x and x+z 2 t s Bt = z, the random variable Bθ is normal with mean µ := 2 and variance σ := −4 . Indeed, since B ,B B , and B B are independent we obtain s θ − s t − θ t s t s P [B dx, B dy, B dz] = p(s, 0, x)p( − , x, y)p( − , y, z)dx dy dz s ∈ θ ∈ t ∈ 2 2 (y µ)2 1 − = p(s, 0, x)p(t s, x, z) e− 2σ2 dx dy dz − · σ√2π (which is just a bit of algebra). Dividing by

P [B dx, B dz] = p(s, 0, x)p(t s, x, z)dx dz s ∈ t ∈ − we obtain 1 (y µ)2 − 2 P [Bθ dy Bs dx, Bt dz] = e− 2σ dy, ∈ | ∈ ∈ σ√2π which is our claim. This suggests that we might be able to construct Brownian motion on [0, 1] by interpo- lation. (n) N To carry out this program, we begin with a sequence ξk , k I(n), n 0 of inde- pendent, standard normal random variables on some probabilit{ y space∈ (Ω, ∈, P ).} Here F I(n) := k N, k 2n, k = 2l + 1 for some l N { ∈ ≤ ∈ } n denotes the set of odd, positive integers less than 2 . For each n N0 we define a process (n) (n) ∈ B := Bt : 0 t 1 by recursion and linear interpolation of the preceeding process, { ≤ ≤ }(n) (n 1) n 1 as follows. For n N, B n 1 will agree with B −n 1 , for all k = 0, 1, . . . , 2 − . Thus for k/2 − k/2 − ∈ (n) each n we only need to specify the values of B n for k I(n). We start with k/2 ∈ (0) (1) (0) B0 = 0 and B1 = ξ1 .

(n 1) n 1 (n 1) n 1 If the values of B −n 1 , k = 0, 1 . . . 2 − have been defined (an thus Bt − , k/2 − t k/2 − ≤ ≤ n 1 (n 1) (n 1) (k + 1)/2 − is the linear interpolation between B −n 1 and B − n 1 ) and k I(n), we k/2 − (k+1)/2 − ∈ n n 1 (n 1) (n 1) 2 t s n+1 denote s = (k 1)/2 , t = (k + 1)/2 , µ = (Bs − + B − ) and σ = − = 2− and − 2 t 4 set in accordance with the above observations

(n) (n) (n) Bk/2n := B(t+s)/2 := µ + σξk .

(n) We shall show that, almost surely, Bt converges uniformly in t to a continuous function B (as n ) and that B is a Brownian motion. t → ∞ t

62 We start with giving a more convenient representation of the processes B(n), n = 0, 1, . . .. We define the following Haar functions by H 0(t) 1, and for n N, k I(n) 1 ≡ ∈ ∈ (n 1)/2 k 1 k 2 − , 2−n t < 2n (n) (n 1)/2 k ≤ k+1 H (t) := 2 − , n t < n k  − 2 ≤ 2  0 otherwise. The Schauder functions are definedby t S(n)(t) := H(n)(u)du, 0 t 1, n N , k I(n). k k ≤ ≤ ∈ 0 ∈ Z0 (0) (n) Note that S1 (t) = t, and that for n 1 the graphs of Sk are little tents of height (n+1)/2 n ≥ 2− centered at k/2 and non overlapping for different values of k I(n). Clearly, (0) (0) (0) ∈ Bt = ξ1 S1 (t), and by induction on n, it is readily verified that n B(n)(ω) = ξ(m)(ω)S(m)(t), 0 t 1, n N. (13.9) t k k ≤ ≤ ∈ m=0 k I(m) X ∈X (n) Lemma 13.4 As n , the sequence of functions Bt (ω), 0 t 1 , n N0, given by (13.9) converges uniformly→ ∞ in t to a continuous function{ B (ω≤), 0 ≤ t} 1∈for almost { t ≤ ≤ } every ω Ω. ∈ (n) Proof. Let bn := maxk I(n) ξk . Oberserve that for x > 0 and each n, k ∈ | |

(n) 2 ∞ u2/2 P ( ξ > x) = e− du | k | π r Zx 2 ∞ u u2/2 2 1 x2/2 e− du = e− , ≤ π x π x r Zx r which gives

n (n) n (n) 2 2 n2/2 P (b > n) = P ( ξ > n ) 2 P ( ξ > n) e− , n {| k | } ≤ | 1 | ≤ π n k I(n) r ∈[ for all n N. Since ∈ n 2 2 n2/2 e− < , π n ∞ n r X the Borel-Cantelli Lemma implies that there is a set Ω˜ with P (Ω)˜ = 1 such that for ω Ω˜ there is an n (ω) such that for all n n (ω) it holds true that b (ω) n. But then ∈ 0 ≥ 0 n ≤ (n) (n) (n+1)/2 ξ (ω)S (t) n2− < ; | k k | ≤ ∞ n n0(ω) k I(n) n n0(ω) ≥X X∈ ≥X so for ω Ω˜, B(n)(ω) converges uniformly in t to a limit B . The uniformity of the ∈ t t convergence implies the continuity of the limit Bt. The following exercise facilitates the construction of Brownian motion substantially:

63 Exercise 13.5 Check the following in a textbook of functional analysis: The inner product 1 f, g := f(t)g(t)dt h i Z0 2 (n) N turns L [0, 1] into a Hilbert space, and the Haar functions Hk ; k I(n), n 0 form a complete, orthonormal system. { ∈ ∈ } Thus the Parseval equality

∞ f, g = f, H(n) g, H(n) (13.10) h i h k ih k i n=0 k I(n) X X∈ holds true.

Applying (13.10) to f = 1[0,t] and g = 1[0,s] yields

∞ S(n)(t)S(n)(s) = s t. (13.11) k k ∧ n=0 k I(n) X X∈ Now we are able to prove

Theorem 13.6 With the above notations

(n) Bt := lim Bt n →∞ is a Brownian motion in [0, 1].

Proof. In view of our definition of Brownian motion it suffices to prove that for 0 = t0 < t1 . . . < tn 1, the increments (Btj Btj 1 )j=1,...,n are independent, normally distributed ≤ − − with mean zero and variance (tj tj 1). For this we will show that the Fourier transforms − − satisfy the appropriate condition, namely that for λ R (and as usual i := √ 1) j ∈ − n n E 1 2 exp i λj(Btj Btj 1 ) = exp λj (tj tj 1) . (13.12) − − −2 − −  j=1  j=1 X  Y 

To derive (13.12) it is most natural to exploit the construction of Bt form Gaussian random (n) variables. Set λn+1 = 0 and use the independence and normality of the ξk to compute for

64 M N ∈ n E exp i (λ λ )B(M) − j+1 − j tj j=1  X  M  n = E exp i ξ(m) (λ λ )S(m)(t ) − k j+1 − j k j  m=0 k I(m) j=1  X ∈X X  M n = E exp iξ(m) (λ λ )S(m)(t ) − k j+1 − j k j m=0 k I(m)  j=1  Y ∈Y X  M n 1 2 = exp (λ λ )S(m)(t ) −2 j+1 − j k j m=0 k I(m) j=1 Y ∈Y X   1 n n M = exp (λ λ )(λ λ ) S(m)(t )S(m)(t ) −2 j+1 − j l+1 − l k j k l  j=1 l=1 m=0 k I(m)  X X X ∈X Now we send M and apply (13.11) to obtain → ∞ n n E E exp i λj(Btj Btj 1 ) = exp i (λj+1 λj)Btj − − − − j=1 j=1  X   X  n 1 n  n  − 1 = exp (λ λ )(λ λ )t (λ λ )2t − j+1 − j l+1 − l j − 2 j+1 − j j j=1 j=1  X l=Xj+1 X  n 1 n − 1 = exp (λ λ )( λ )t (λ λ )2t − j+1 − j − j+1 j − 2 j+1 − j j j=1 j=1  X X  n 1 1 − 1 = exp (λ2 λ2)t λ2 t 2 j+1 − j j − 2 n n  j=1  n X 1 2 = exp λj (tj tj 1) . −2 − − j=1 Y 

65 14 Appendix

Let Ω be a set and (Ω) the collection of of Ω. P Definition 14.1 A system of sets (Ω) is called a ring if it satisfies R ⊂ P

∅ ∈ R A, B A B ∈ R ⇒ \ ∈ R A, B A B ∈ R ⇒ ∪ ∈ R If additionally Ω ∈ R then is called an algebra. R Note that for A, B Ω their intersection A B = A (B A). ⊂ ∩ \ \ Definition 14.2 A system (Ω) is called a Dynkin system if it satisfies D ⊂ P Ω ∈ D D Dc ∈ D ⇒ ∈ D For every sequence (Dn)n N of pairwise disjoint sets Dn , their union nDn is also in ∈ . ∈ D ∪ D The following theorem holds:

Theorem 14.3 A Dynkin system is a σ-algebra if and only of for any two A, B we have ∈ D A B ∩ ∈ D Similar to the case of σ-algebras for every system of sets (Ω) there is a smallest Dynkin system ( ) generated by (and containing) . The Eimp⊂ortanceP of Dynkin systems D E E mainly is due to the following

Theorem 14.4 For every (Ω) with E ⊂ P A, B A B ∈ E ⇒ ∩ ∈ E we have ( ) = σ( ). D E E Definition 14.5 Let be a ring. A function R µ : [0, ] R → ∞ is called a volume, if it satisfies µ( ) = 0 ∅

66 and n µ( n A ) = µ(A ) (14.1) ∪i=1 i i i=1 X for all pairwise disjoint sets A1, . . . , An and all n N. A volume µ is called a pre-measure if ∈ R ∈ ∞ µ( ∞ A ) = µ(A ) (14.2) ∪i=1 i i i=1 X for all pairwise disjoint sequence of sets (Ai)i N . We will call (14.1) finite additivity ∈ ∈ R and (14.2) σ-additivity.

A pre-measure µ on a σ-algebra is called a measure. A Theorem 14.6 Let be a ring and µ be a volume on . If µ is a pre-measure, then R R it is -continuous, i.e. for all (A ) , A , with µ(A ) < and A it holds ∅ n n n ∈ R n ∞ n ↓ ∅ limn µ(An) = 0. If is an algebra and µ(Ω) < , then the reverse also holds: an →∞ R ∞ -continuous volume is a pre-measure. ∅ Theorem 14.7 (Carath´eodory) For every pre-measure µ on a ring over Ω there is at least one way to extend µ to a measure on σ( ). R R In the case that is an algebra and µ is σ-finite (i.e. Ω is the countable union of subsets of finite measure),Rthis extension is unique.

67