Advanced theory

Jiˇr´ı Cern´yˇ

June 1, 2016 Preface

These are lecture notes for the lecture ‘Advanced Probability Theory’ given at Uni- versity of Vienna in SS 2014 and 2016. This is a preliminary version which will be updated regularly during the term. If you have questions, corrections or suggestions for improvements in the text, please let me know.

ii Contents

1 Introduction 1

2 Probability spaces, random variables, expectation 5 2.1 Kolmogorov ...... 5 2.2 Random variables ...... 7 2.3 Expectation of real-valued random variables ...... 10

3 Independence 14 3.1 Definitions ...... 14 3.2 Dynkin’s lemma ...... 15 3.3 Elementary facts about the independence ...... 17 3.4 Borel-Cantelli lemma ...... 19 3.5 Kolmogorov 0–1 law ...... 22

4 Laws of large numbers 25 4.1 Kolmogorov three series theorem ...... 25 4.2 Weak law of large numbers ...... 29 4.3 Strong law of large numbers ...... 30 4.4 Law of large numbers for triangular arrays ...... 35

5 Large deviations 37 5.1 Sub-additive limit theorem ...... 37 5.2 Cram´er’stheorem ...... 38

6 Weak convergence of probability measures 41 6.1 Weak convergence on R ...... 41 6.2 Weak convergence on metric spaces ...... 44 6.3 Tightness on R ...... 47 6.4 Prokhorov’s theorem* ...... 48

7 Central limit theorem 52 7.1 Characteristic functions ...... 52 7.2 Central limit theorem ...... 55 7.3 Some generalisations of the CLT* ...... 56

8 Conditional expectation 60 8.1 Regular conditional * ...... 65

iii 9 Martingales 67 9.1 Definition and Examples ...... 67 9.2 Martingales convergence, a.s. case ...... 73 9.3 Doob’s inequality and Lp convergence ...... 75 9.4 L2-martingales ...... 77 9.5 Azuma-Hoeffding inequality ...... 78 9.6 Convergence in L1 ...... 79 9.7 Optional stopping theorem ...... 83 9.8 Martingale central limit theorem* ...... 84

10 Constructions of processes 85 10.1 Semi-direct product ...... 85 10.2 Ionescu-Tulcea Theorem ...... 86 10.3 Complement: Kolmogorov extention theorem ...... 90

11 Markov chains 92 11.1 Definition and first properties ...... 92 11.2 Invariant measures of Markov chains ...... 94 11.3 Convergence of Markov chains ...... 99

12 Brownian motion and Donsker’s theorem 102 12.1 Space C([0, 1])...... 102 12.2 Brownian motion ...... 104 12.3 Donsker’s theorem ...... 105 12.4 Some applications of Donsker’s theorem ...... 109

iv 1 Introduction

The goal of this lecture is to present the most important concepts of the probability theory in the context of infinite sequences X1,X2,... of random variables, or, otherwise said, in the context of stochastic processes in discrete time. We will mostly be interested in the asymptotic behaviour of these sequences. The fol- lowing examples cover some questions that will be answered in the lecture and introduce heuristically some concepts that we will develop in order to solve them. Example 1.1 (Series with random coeficients). It is well known that

n i X (−1) n→∞ X(1) = −−−→− log 2, but n i i=1 n X 1 n→∞ X(2) = −−−→∞ (no absolute convergence). n i i=1 One can then ask what happens if the signs are chosen randomly, that is for independent 1 random variables Z1,Z2,... with P [Zi = +1] = P [Zi = −1] = 2 one considers the sum n X Zi X = . n i i=1 Does this random(!) series converge or no? If yes, is the limit random or deterministic? Example 1.2 (Sums of independent random variables). In the lecture ‘Probability and Statistic’ you were studying the following problem. Let Zi be as in Example 1.1, that is Zi’s are outcomes of independent throws of a fair coin, and n X 1 S = Z , and X = S . n i n n n i=1

By the weak law of large numbers, denoting by EZi(= 0) the expectation of Zi, we know that n→∞ P (|Xn − EXn| ≥ ε) −−−→ 0 for every ε > 0.

Observe however that the last display says only that the probability that |Xn| is far from zero decays with n. It says nothing about the convergence of Xn for a single realisation of coin throws. To address these (and many other) questions we will develop the formalism of prob- ability theory which bases on the theory and Kolmogorov axioms. In this formalism, we will show an improved version of the weak LLN, so called strong LLN   P lim Xn = 0 = 1, or equivalently lim Xn = 0,P -a.e. n→∞ n→∞

1 Example 1.3 (Random walk and Brownian motion). Continuing with Example 1.2, we can view Sn as a function S : N → R. By linear interpolation we can extend it to a function S : R+ → R (see Figure 1.1). This is a random continuous function, i.e. random element of the space C(R+; R). As such random object cannot be described by elementary means of the ‘Probability and Statistics’ lecture, one of our goals is to develop a sound mathematical theory allowing for this. 4 40 2 20 0 0

0 5 10 15 20 0 500 1000 1500 2000 −2 −20 −4 −40

Figure 1.1: Random walk and its scaling. Observe that on the second picture the x-axis is 100 times longer, but y-axis only 10 times. Second picture “looks almost like” a Brownian motion

We also want discuss the convergence of such random objects. More exactly, recall that the central limit theorem says that

1 d √ Sn −−−→ N (0, 1), n n→∞ where N (0, 1) stands for the standard normal distribution. The arrow notation in the previous display stands here for the convergence in distribution which can formally be defined here e.g. by Z a h 1 i n→∞ 1 −x2/2 P √ Sn ≤ a −−−→ √ e dx, for all a ∈ R. n −∞ 2π In view of (1.3), it seems not unreasonable to scale the function S by n−1 in the time direction and by n−1/2’ in the space direction, that is to consider

(n) −1/2 S (t) = n Snt, and ask ‘Does this sequence of random elements of C(R+, R) converge? What is the limit object?’

2 We will see that the answer on the first question is ‘YES’, but to this end we need to introduce the right convergence notion. Even more interesting is the limit object, the Brownian motion. Apart being very interesting objects of their own, random walk and Brownian motion are prototypes of two important classes of processes, namely Markov chains/processes and martingales, that we are going to study. We close this section by few examples linking the probability theory to other domains of mathematics. Some of them will be treated in the lecture in more detail. Example 1.4 (Random walk and discrete Dirichlet problem, link to PDE’s). Consider a simple random walk on Z2 started at x ∈ Z2, that is a sequence of random variables X0,X1,... determined by X0 = x and by requirement that its increments Zi = Xi−Xi−1, i ≥ 1 are i.i.d. random variables satisfying

P [Zi = ±e1] = P [Zi = ±e2] = 1/4. 2 Here e1, e2 stand for the canonical basis vectors of Z . See Figure 1.2 for a typical realisation.

x

O

Y

Figure 1.2: Realisation of random walk on Z2

Let g : R2 → R2 be a continuous function and O a large domain in R2. Let Y be the random position of the exit point of the random walk from the domain O, i.e. Y = XT with T = inf{k : Xk ∈/ O}, see the figure again. Define a function u : Z2 → R by 2 u(x) = Ex[g(Y )], x ∈ Z , where Ex stands here for the expectation for the random walk started at x. We will later show that u solves a discrete Dirichlet problem 2 ∆du(x) = 0, x ∈ Z ∩ O, 2 u(x) = g(x), x ∈ Z \ O,

3 1 where ∆d is a discrete Laplace operator

1 ∆du(x) = 4 {u(x + e1) + u(x − e1) + u(x + e2) + u(x − e2)} − u(x) Example 1.5 (Lower bound on Ramsey numbers, a tiny link to the graph theory). Ramsey number R(k) is the smallest number n such that any colouring of the edges of the complete graph Kn by two colours (red and blue say) must contain at least one monochromatic (that is completely blue or red) copy of Kk as a subgraph. These numbers are rather famous in the graph theory, not only because they are very hard to compute. Actually,2 the only known values are R(1) = 1, R(2) = 2, R(3) = 6, R(4) = 18. For larger Ramsey numbers it is known e.g. R(5) ∈ [43, 49] or R(10) ∈ [798, 23556]. It is thus essential to get good estimates on these numbers. We are going to use an easy probabilistic argument to find a lower bound on R(k).

k 1− n 2 Lemma (taken from [AS08], Proposition 1.1.1). Assume that k 2 < 1. Then n > R(k). In particular R(k) ≥ b2k/2c for all k ≥ 3.

Proof. Consider a random two-coloring of the edges of Kn obtained by coloring each edge independently either red or blue, where each color is equally likely. For any fixed set R ⊂ {1, . . . , n} of k vertices, let AR be the event that the induced subgraph of Kn on R is monochromatic. Clearly,

k P [AR] = 2 · 2 2 .

n Since there are k possible choices for R, the probability that at least one of the events k 1− n 2 AR occurs is at most k 2 < 1. Thus, with positive probability, no event AR occurs and hence there should be a two-coloring of Kn without a monochromatic Kk; that is, R(k) > n. The second claim follows by checking that the assumption holds for n = b2k/2c (exercise).

√ − c log k The best present estimates on R(k) are (1 + o(1)) 2ke−12k/2 ≤ R(k) ≤ k log log k 4k, that is the lower bound of the lemma is much better than what would one expect given the simplicity of the proof. Observe also that the proof is non-constructive. It only says that the colouring with required properties exists (since it has positive probability), and can be found by throw- ing a coin for a sufficiently long enough period of time. Such probabilistic arguments can be used in many situations in graph and number theory, see the very nice book [AS08].

Example 1.6 (few other examples). TBD

1If you prefer the usual Laplace operator, you can replace the random walk by a Brownian motion started at x ∈ R2. 2source Wikipedia

4 2 Probability spaces, random variables, expectation

The goal of this chapter is to quickly review the basic concepts and definitions which should be known from previous lectures “Probability and Statistics” and “Measure the- ory”.

2.1 Kolmogorov axioms

We start by describing how random experiments are mathematically formalised. An experiment is modelled by a probability space (Ω, A,P ) where

• Ω is a non- containing all possible results of the experiment.

•A is a σ-algebra on Ω, that is A is a collection of of Ω satisfying the following three properties (σ1) ∅ ∈ A, (σ2)( A ∈ A) =⇒ (Ac ∈ A), S (σ3) for every sequence (Ai)i≥1 with Ai ∈ A holds i≥1 Ai ∈ A. • P is a probability measure on (Ω, A). That is P is a mapping from A to [0, 1] satisfying (m1) P (Ω) = 1

(m2) P is σ-additive, that is for every sequence (Ai)i≥1 of pairwise disjoint elements of A (i.e. Ai ∩ Aj = ∅ if i 6= j) holds h [ i X P Ai = P [Ai]. i≥1 i≥1

In the language of measure theory, P is a normed measure on a measurable space (Ω, A). Any element ω ∈ Ω is called (possible) outcome of the experiment. The sets in A are called events. The σ-algebra A should be interpreted as the collection of all subsets of Ω that we want to measure (or sometimes that we can measure), that is to which we want (or can) associate their probability. For A ∈ A, the quantity P [A] then gives the probability that the random outcome of the experiment falls into A.

5 This formalism includes the two important cases of probability distributions that were introduced in the elementary lecture—the ‘continuous’ and ‘discrete’ distributions—as can be seen from the next two examples.

Example 2.1. Let Ω = R, A = B(R) be the Borel σ-algebra on R (i.e. the smallest σ-algebra containing all open of R), and let Z 1 n (x − m)2 o P (A) = √ exp − dx, A ∈ B( ). 2 2 R A 2πσ 2σ This probability space describes an experiment whose outcome has the normal (or Gaus- sian) distribution with mean m ∈ R and variance σ2 > 0, which we abbreviate by N (m, σ2).

Example 2.2. Let Ω = N = {0, 1,... }, A = P(N) be the of N and let for a λ > 0 X λk P (A) = e−λ . k! k∈A This gives the Poisson distribution with parameter λ, Pois(λ). Random vectors are also easily covered by the formalism.

Example 2.3 (Product spaces). Let (Ω1, A1,P1), (Ω2, A2,P2) be two probability spaces. We can obtain another probability space by setting Ω = Ω1 × Ω2, A to be the product σ-algebra A1 ⊗ A2 (the smallest σ-algebra containing all rectangles A1 × A2, A1 ∈ A1, A2 ∈ A2), and P to be the product measure P1 ⊗ P2, which is determined by its values on rectangles

P (A1 × A2) = P (A1)P (A2) for all A1 ∈ A1,A2 ∈ A2.

The product of n probability spaces (Ω1, A1,P1), . . . , (Ωn, An,Pn) can be defined anal- ogously. For example taking (Ωi, Ai,Pi), 1 ≤ i ≤ n to be the probability space of Example 2.1 with m = 0, σ = 1, we obtain by taking their product the n-dimensional standard normal distribution n n ⊗n Ω = R , A = B(R ) = B(R) , and Z 2 2 (2.4) 1 n x1 + ··· + xn o P (A) = n/2 exp − dx1 ··· dxn,A ∈ A. A (2π) 2 Example 2.5 (Infinte product spaces). The previous example can be extended further. Let I be an infinite (even not countable) index set and for every ι ∈ I let (Ωι, Aι,Pι) be a probability space. We set Ω = ×ι∈I Ωι to be the usual Cartesian product. A little bit more care is needed while defining the σ-algebra A: It is the smallest σ-algebra on Ω containing all ‘finite-dimensional cylinders’, that is  A = σ (×ι∈J Aι) × (×ι∈I\J Ωι): J ⊂ I finite,Aι ∈ Aι∀ι ∈ J .

6 A is called the cylinder σ-algebra. For finite dimensional cylinders we then set Y P [(×ι∈J Aι) × (×ι∈I\J Ωι)] = Pι(Aι) ι∈J which determines uniquely, as we will see, a probability measure on A, the product measure.

2.2 Random variables

Another useful known concept is that of . Random variables serve to model various properties of the random experiment, and are to some extend more important than the probability space itself1 Definition 2.6. Let (Ω, A,P ) be a probability space and (S, S) a measurable space2. A function X :Ω → S is called S-valued random variable if it is measurable function from (Ω, A) to (S, S), i.e. it holds

(2.7) X−1(B) ∈ A for every B ∈ S.

If S = R, we usually tacitly assume that S is the corresponding Borel σ-algebra B(R). In this case X is called real valued random variable, or simply random variable. Remark 2.8. Checking the condition (2.7) of Definition 2.6 seems to be a tedious task, since the condition need to be verified for all B in the σ-algebra S, which is typically rather large. Fortunately this is not the case: A subset E of the σ-algebra S is called generator of S if S = σ(E), that is S is the smallest σ-algebra containing E. We now show that if E is a generator of S, (2.7) is equivalent with

(2.9) X−1(B) ∈ A for all B ∈ E.

Proof of (2.9) ⇐⇒ (2.7). The collection of sets {B ⊂ S : X−1(B) ∈ A} is a σ-algebra. −1 −1 c −1 c −1 −1 Indeed, X (S) = Ω, X (B ) = (X (B)) , X (∪i≥1Bi) = ∪i≥1X (Bi). This σ- algebra contains all elements of E and therefore it contains also σ(E) = S.

When S is the Borel σ-algebra on R, the following generators work: • The collection O of all open sets of R.

• The collection C of all closed sets of R. • The collection of all open intervals (a, b), −∞ < a < b < ∞.

1Reading the probability theory literature, you will quickly remark that it often assumes the existence of some abstract probability space (Ω, A,P ) where one can define all random variables that one wants to deal with, without really constructing it explicitly. 2That is S is non-empty set and S is a σ-algebra of subsets of S

7 • The collection of all closed intervals [a, b], −∞ < a ≤ b < ∞.

• The collection H of all half-infinite intervals (−∞, a], a ∈ R. Hence, to check that X :Ω → R is a random variable it is e.g. sufficient to show that

−1 X ((−∞, a]) =: {X ≤ a} ∈ A for all a ∈ R. Definition 2.10. Let X : (Ω, A) → (S, S) be a random variable. The probability measure µX on (S, S) defined by

−1 µX (B) := P (X (B)) = P ({X ∈ B}), for B ∈ S, is called the distribution of X. One sometimes writes X#P (push-forward of the measure P by function X) or P ◦ X−1 to denote this distribution.

Exercise 2.11. Check that (S, S, µX ) is again a probability space.

n Exercise 2.12. Let (Ω, A,P ) be as in (2.4) and X : R → R, (x1, . . . , xn) 7→ x1. Then X is a random variable (check!) and

µX (B) = P ({X ∈ B}) Z 2 2 1 n x1 + ··· + xn o = 1 exp − dx1 ... dxn B×Rn−1 (2π) 2 2 Z 2 1 n x1 o = 1 exp − dx1. B (2π) 2 2

That is µX is the standard normal distribution. We recall the important tool from the elementary probability theory, which allows to deal with all (in particular both discrete and continuous) real valued random variables in unified manner.

Definition 2.13. Let X be a real valued random variable. The map R 3 a 7→ FX (a) = P ({X ≤ a}) is called distribution function of X. Claim 2.14. Let F be the distribution function of a random variable. Then (i) F (·) is non-decreasing,

(ii) limx→∞ F (x) = 1, limx→∞ F (x) = 0. (iii) F (x) is right-continuous

Proof. (i) follows directly from the definition of F . For (iii), let x ∈ R and (xn)n≥0 be a sequence such that xn ↓ x. Then ∩n≥0(−∞, xn] = (−∞, x] and the sequence of intervals (−∞, xn] is decreasing. Hence, by upper regularity of the measure µX , F (x) = µx((−∞, x]) = limn→∞ µx((−∞, xn]) = limn→∞ F (xn). (ii) is proved similarly to (iii).

8 The distribution functions characterise completely the set of all probability measures on (R, B(R)), as can be seen from the next theorem.

Theorem 2.15 (Lebesgue-Stieltjes). Let F : R → [0, 1] satisfy the conditions (i)-(iii) of Claim 2.14. Then there exists a unique probability measure µ on (R, B(R)) such that

F (x) = µ((−∞, x]) for all x ∈ R. Proof. Existence: Consider the probability space Ω = (0, 1), A = B((0, 1)), P = Lebesgue measure on (0, 1), and define a map X : (0, 1) → R by

(2.16) X(ω) = sup{y ∈ R : F (y) < ω}, ω ∈ (0, 1). X should be thought of as an “inverse” to F . Formally, we will show that

(2.17) {ω : X(ω) ≤ x} = {ω : ω ≤ F (x)}.

The existence of measure µ follows from (2.17). Indeed, by (2.17), X is a random variable and its distribution function is F . We now show (2.17). Assume first that ω ∈ (0, 1) with ω ≤ F (x). Then x∈ / {y : F (y) < ω} and thus x ≥ X(ω), hence {ω : X(ω) ≤ x} ⊂ {ω : ω ≤ F (x)}. On the other hand, for ω ∈ (0, 1) with ω > F (x), the right-continuiry of F yields the existence of ε > 0 such that F (x + ε) < ω, that is X(ω) ≥ x + ε > x. It follows that {ω : X(ω) ≤ x} ⊃ {ω : ω ≤ F (x)}, completing the proof of (2.17). Uniqueness: From the definition of the distribution function it follows that µ((a, b]) = F (b) − F (a) for every a < b. Therefore µ((a, b]) is uniquely determined by F . Further 1 µ((a, b)) = limn→∞ µ((a, b − µ ]) is uniquely determined, and thus also µ(O) for every open set O ⊂ R (O is a countable union of disjoint open intervals and thus µ(O) is determined by the σ-additivity). From this the uniqueness follows. We will later see (or you have seen?) a general argument based on Dynkin’s lemma which implies the uniqueness in the last proof.

Remark 2.18. The existence part of the proof is useful for simulating random vari- ables with a given distribution µ in a computer: If your programming language/library provides you with an uniform random variable (which it usually does), (2.16) gives the recipe how to transform it to obtain µ-distributed random variable.

n Exercise 2.19. (a) Let X1,...,Xn be random variables on (Ω, F,P ) and f : R → R a measurable function. Show that f(X1,...,Xn) is a random variable.

(b) Let X1,X2,... be a sequence of [−∞, +∞]-valued random variables on (Ω, A,P ). Show that supn Xn, infn Xn, lim infn Xn, lim supn Sn are random variables. (An [−∞, ∞]−valued function X is a random variable iff e.g. X−1([a, ∞]) ∈ A for every a ∈ R ∪ {−∞}.)

9 2.3 Expectation of real-valued random variables

As you know form the elementary lecture, the expectation of a random variable corre- sponds to a “mean” of all its possible values. In the two special cases of discrete and continuous real-valued random variables it is defined by X E[X] = xP [X = x](RX = set of all possible values of X),

x∈RX (2.20) Z E[X] = xfX (x) dx (fX = density of X), R whenever these expressions give sense. In the language of the measure-theoretic probability theory, there is no need for two separate definitions:

Definition 2.21. Let X :Ω → R be a random variable. Assume that X ≥ 0 or 1 R X ∈ L (Ω, A,P ) (i.e. Ω |X| dP < ∞)). The expectation of X is given by Z E(X) = X(ω)P (dω). Ω When we want to point out over which probability measure we take the expectation, we write EP (X).

The expectation inherits many properties of the integral, which were shown in the Measure Theory course.

Linearity. Let X,Y be random variables on Ω and a, b ∈ R, then E[aX + BY ] = aE[X] + bE[Y ],

whenever the right-hand side is well defined.

Monotonicity. Let X, Y be two random variables with well-defined expectation. Then

X ≥ Y (i.e. X(ω) ≥ Y (ω) for all ω ∈ Ω) =⇒ E[X] ≥ E[Y ].

1 1 H¨olderinequality. For p, q ∈ [1, ∞] with p + q = 1

E[|XY |] ≤ kXkpkY kq

p p 1/p where the L (Ω, A,P )-norm of X is defined by kXkp = (E[|X| ]) for p ∈ (1, ∞),  and by kXk∞ = inf M : P [|X| > M] = 0 =: ess supP |X| for p = ∞. Cauchy-Schwarz inequality is the special case of H¨olderinequality for p = q = 2,

E[|XY |] ≤ kXk2kY k2.

10 Minkowski inequality. This is the triangle inequality for the k · kp-norm,

kX + Y kp ≤ kXkp + kY kp, p ∈ [1, ∞].

Fatou’s lemma. Let (Xn)n≥0 a sequence of [0, ∞]-valued random variables. Then

E[lim inf Xn] ≤ lim inf E[Xn]. n→∞ n→∞

Monotone convergence theorem (MCT, Beppo-Levi theorem) For random variables Xn ≥ 0 with Xn % XP -a.s. (i.e. P [{ω : Xn(ω) ≤ Xn+1(ω)∀n, and lim Xn(ω) = X(ω)}] = 1) E[Xn] % E[X], as n → ∞.

Dominated convergence theorem (DCT, Lebesgue theorem) Let X,Y and (Xn)n≥0 be random variables such that   • P -a.s. Xn converge pointwise to X, that is P limn→∞ Xn(ω) = X(ω) = 1,

• Xn are dominated by Y , that is |Xn| ≤ YP -a.s. for all n ≥ 0, • Y is P -integrable, that is Y ∈ L1(Ω, A,P ). Then lim E[Xn] = E[X] n→∞ Other properties of expectation require P to be normalised, that is P [Ω] = 1:

Jensen’s inequality. Let X ∈ L1(Ω, A,P ) and ϕ : R → R a convex function (i.e. ϕ(λx + (1 − λ)y) ≤ λϕ(x) + (1 − λ)ϕ(y) for every x, y ∈ R and λ ∈ [0, 1]) such that E[ϕ(X)] < ∞. Then ϕ(E[X]) ≤ E[ϕ(x)].

Proof. When ϕ is convex, then there exists a ∈ R such that a(x − E(X)) + ϕ(E(X)) ≤ ϕ(x). Replacing x by X in this inequality and taking the expectations Jensen’s inequality follows.

(Generalised) Chebyshev’s inequality. Let ϕ : S → [0, ∞] be a measurable function, A ∈ B(R), and X a S-valued random variable. Then Z inf{ϕ(y): y ∈ A}P [X ∈ A] ≤ ϕ(X(ω))1{X ∈ A}P (dω) Ω := E[ϕ(X); X ∈ A] ≤ E[ϕ(x)].

Proof. Observe that inf{ϕ(y): y ∈ A}1{x ∈ A} ≤ ϕ(x)1{x ∈ A} ≤ ϕ(x) (using convention 0 · ∞ = 0 usual in the measure theory). Replacing x by X, taking the expectation and applying the monotonicity the claim follows.

11 Markov inequality. Taking ϕ(x) = |x|p, p ≥ 0, in Chebyshev’s inequality, we obtain for every X ∈ Lp and a ≥ 0 1 P [|X| ≥ a] ≤ E[|X|p]. ap Chebyshev’s inequality. By taking X ∈ L2 and ϕ(x) = (x − EX)2 we obtain the usual Chebyshev’s inequality 1 1 P [|X − EX| ≥ a] ≤ Var X := E[(X − EX)2]. a2 a2 Finally, recall a transformation (substitution) theorem. Lemma 2.22. Let (Ω, F,P ) be a probability space, (S, S) a measurable space, X :Ω → S a S-valued random variable, and g : S → R a measurable function. Then g ◦ X =: g(X) −1 is a R-valued random variable, and denoting by µX = X#P = P ◦ X the distribution of X on (S, S) (cf. Definition 2.10 and Exercise 2.19) we have Z Z (2.23) |g(s)|µX (ds) < ∞ ⇐⇒ |(g ◦ X)(ω)|P (dω) < ∞, S Ω and if these both equivalent conditions hold, then Z Z X#P ! P (2.24) µX (g) = E [g] = g(s)µX (ds) = E [g(X)] = (g ◦ X)(ω)P (dω). S Ω (the only real statement is the equality marked with ‘!’, the remaining ones introduce various notation for the same objects) Proof. This is a small exercise in the measure theory that is good to recall: We prove the theorem in four steps, starting with simple functions g and going to more general ones. (a) g = 1B is indicator of a set B ∈ S: In this case the conditions in (2.23) hold true −1 P since g ≤ 1. Moreover, µX (g) = (P ◦ X )(B) = E [1B ◦ X], that is (2.24) holds. Pm (b) g is a linear combination of indicators, g = i=1 ci1Bi for some ci ≥ 0, and Bi ∈ S: In this case (2.23) and (2.24) follow directly using (a) and the linearity of the expectation. (c) g ≥ 0 arbitrary measurable function: Set

n2n−1 X k n k k + 1o g = 1 ≤ g < + n1{g ≥ n}. n 2n 2n 2n k=0

Then gn % g, gn ◦ X % g ◦ X, and the functions gn are as in (b). By the monotonous X#P X#P P P convergence theorem, E [gn] % E [g] and E [gn(X)] ↑ E [g(X)], and thus (2.23), (2.24) holds true for non-negative g. (d) g arbitrary measurable: Set g+ = g ∨ 0, g− = (−g) ∨ 0. Then g = g+ − g−, and g+, g− are as in (c). Further, EX#P [|g|] < ∞ is equivalent with EX#P [g+] < ∞ and EX#P [g−] < ∞. Similar claims hold for functions g ◦ X, g+ ◦ X, and g− ◦ X with respect to expectation EP . (2.23) then follows, and (2.24) is a consequence of the linearity, that is of EP [g(X)] = EP [g+(X)] − EP [g−(X)].

12 To finish this chapter we recall several elementary definitions.

Definition 2.25. Let X ∈ L1(Ω, A,P ). The variance of X is defined by

(2.26) Var(X) = E[(X − EX)2] ∈ [0, ∞].

Lemma 2.27. Let X ∈ L1(Ω, A,P ). Then

(i) Var(X) = E[X2 − 2XEX − (EX)2] = E(X2) − (EX)2.

(ii) Var X < ∞ iff X ∈ L2(Ω, A,P ).

(iii) Var X = 0 iff X = EXP -a.s.

Proof. Obvious.

Definition 2.28. Let X,Y be two random variables in L1(Ω, A,P ) such that XY is also in L1(Ω, A,P ). The covariance of X and Y is given by

Cov(X,Y ) := E[(X − EX)(Y − EY )] = E(XY ) − E(X)E(Y ).

Definition 2.29. If X ∈ LP (Ω, F,P ), then E[Xp] is called the p-th moment of X.

n Definition 2.30. Let X = (X1,X2,...,Xn) be a random vector (i.e. R -valued ran- n dom variable). We define EX component-wise, EX = (EX1,...,EXn) ∈ R , and its covariance matrix Σ(X) = (σij(X))i,j=1,...,n by σij(X) = Cov(Xi,Xj). Exercise 2.31. (a) When X,Y ∈ L2(Ω, A,P ), the covariance exists.

(b) Cov(X,X) = Var(X)

(c) For every vector X, its covariance matrix Σ(X) is symmetric and positive definitive, T n that is σij(X) = σji(X) and ξ Σ(X)ξ for every column vector ξ ∈ R .

13 3 Independence

3.1 Definitions

We recall the elementary definition. Definition 3.1. Let (Ω, A,P ) be a probability space. Events A, B ∈ F are called independent when (3.2) P [A ∩ B] = P [A] · P [B] When P [B] > 0, this can be written using conditional probabilities as

P [A ∩ B] (3.2) P [A|B] := = P [A], P [B] i.e. the events A, B are independent if the information about the occurrence of B has no influence on the probability of A. We now extend this definition in several directions:

Definition 3.3. (a) Two σ-algebras G1, G2 ⊂ A are called independent if P [A1 ∩ A2] = P [A1]P [A2] for every A1 ∈ G1,A2 ∈ G2. (b) Two random variables X1, X2 on (Ω, A,P ) are called independent if the σ-algebras σ(X1), σ(X2) generated by these random variables are independent in sense of (a). (Recall, for (S, S)-valued random variable, σ(X) is the smallest σ-algebra on Ω which makes X measurable, σ(X) = {X−1(A),A ∈ S}.)

Definition 3.4. (a) The σ-algebras G1,..., Gn ⊂ A are independent if P [A1 ∩· · ·∩An] = P [A1] ··· P [An] for every A1 ∈ G1,...,An ∈ Gn. c (b) Events Ai are independent if σ-algebras {∅,Ai,Ai , Ω}, i = 1, . . . , n, are. (c) Similarly, random variables X1,...,Xn on Ω are independent if the σ-algebras σ(Xi), i = 1, . . . , n, are.

Remark 3.5. As one can take Ai = Ω in Definition 3.4(a), we see also that every subset of G1,..., Gn is independent. One also checks easily that Definition 3.4(b) is equivalent to the definition of independence which you know from the elementary lecture: The events A1,...,An ∈ A are independent if h \ i Y P Ai = P [Ai] for every J ⊂ {1, . . . , n}. i∈J i∈J

Checking the condition of Definition 3.4(a) for every Ai ∈ Gi is again a tedious task, cf. (2.8). At this level, we may even wonder if independent σ-algebras exist at all. We therefore make a small excursion into the measure theory and solve this and similar issues for once.

14 3.2 Dynkin’s lemma

Definition 3.6. A family D of subsets of Ω is called Dynkin system (or λ-system) when (λ1)Ω ∈ D

(λ2) A ∈ D =⇒ Ac ∈ D S (λ3) for every sequence (Ai)i≥0 of pairwise disjoint elements of D, we have i≥0 Ai ∈ D Observe that condition (λ3) is different from the condition (σ3) of the definition of σ-algebra. Definition 3.7. A family of C of subsets of Ω is called π-system if it is closed under intersections, i.e. A, B ∈ C implies that also A ∩ B ∈ C. The following lemma is a useful tool in measure theory. Lemma 3.8 (Dynkin). Let D be a Dynkin system and C a π-system on Ω with C ⊂ D. Then, D contains the σ-algebra generated by C,

D ⊃ σ(C).

Before proving the lemma, let us see some of its consequences. Lemma 3.9. Let P , Q be two probabilities on (Ω, A). Then the family

D = {A ∈ A : P (A) = Q(A)} is a Dynkin system. Proof. (λ1), (λ2) of Definition 3.6 are trivial to check. (λ3) follows from the σ-additivity of P and Q. As corollary of Lemmas 3.8, 3.9 we obtain: Lemma 3.10. Let P and Q be as above, and let C be a π-system such that P (C) = Q(C) for every C ∈ C. Then P (B) = Q(B) for every B ∈ σ(C). Remark 3.11. It is essential that C is a π-system, not an arbitrary generator. To see 1 1 this consider Ω = {1, 2, 3, 4}, E = {{1, 2}, {2, 3}} and µ = 2 (δ1 + δ3), ν = 2 (δ2 + δ4). Then, it is easy to see that σ(E) = P(Ω) and that µ|E∪{Ω} = ν|E∪{Ω}. On the other hand, µ 6= ν on P(Ω). Lemma 3.10 gives a powerful tool for checking equality of measures, of course if we know generators which are also π-systems:

Exercise 3.12. (a) Go back to Remark 2.8 and check which of the generators of B(R) are also π-systems. (b) Simplify the proof of the Lebesgue-Stieltjes theorem (2.15) using the previous lemma.

15 Proof of Dynkin’s lemma (*). We want to show D ⊃ σ(C). To this end we define a Dynkin system generated by C, \ D(C) = D0 D0⊃C D0 is Dynkin system We will show that

(3.13) D(C) = σ(C), which implies the lemma. The inclusion D(C) ⊂ σ(C) is obvious, as every σ-algebra is a Dynkin system. The inclusion σ(C) ⊂ D(C) will follow if we show

(3.14) D(C) is a σ-algebra.

To see this, we first claim

(3.15) D(C) is closed under intersections. and use this to prove (3.14). As a consequence of (3.15), D(C) is also closed under unions (just use A ∪ B = (Ac ∩ Bc)c, (3.15) and part (λ2) of the Definition 3.6). To see that D(C) is closed under countable unions, we fix a sequence (An)n≥1, An ∈ D(C) and write their union as [ [ (3.16) An = (Bn \ Bn−1), n≥1 n≥1 where the sets Bi, i ≥ 0, are given by

n [ B0 = 0 and Bn = Ai i=1

As D(C) is closed under unions, Bi ∈ D(C) for every i ≥ 0. Writing Bn \ Bn−1 as c c (Bn ∪Bn−1) , it follows also that Bn \Bn−1 ∈ D(C) for every n ≥ 1. Finally, using (3.16), S and part (λ3) of Definition 3.6, we see n≥1 An ∈ D(C) as claimed. Since Ω ∈ D(C) and D(C) is closed under complements, by definition, we proved (3.14). It remains to show (3.15): Step 1. We first claim

(3.17) A ∈ D(C),B ∈ C =⇒ A ∩ B ∈ D(C).

Indeed, to see this define for B ∈ C a family

DB = {A ⊂ Ω: A ∩ B ∈ D(C)}.

We claim that DB that is a Dynkin system. Indeed,

16 (1)Ω ∈ DB is obvious.

c (2) A ∈ DB implies that A ∩ B = B \ (A ∩ B) ∈ D(C), since B ∈ C, A ∩ B ∈ D(C), and B \ (A ∩ B) can be written as (Bc ∪ (A ∩ B))c and this is a disjoint union of C elements of D(C). It follows that A ∈ DB as well. S  S (3) Let Ai ∈ DB, i ≥ 1, be pairwise disjoint. Then Ai ∩B = (Ai∩B) ∈ D(C) S i≥1 i≥1 by disjointness of Ai ∩ B. Hence i≥1 Ai ∈ DB.

Since C ⊂ DB, and DB is a Dynkin system, we see immediately that DB ⊃ D(C), which implies (3.17) Step 2. For A ∈ D(C) we define, similarly as in Step 1,

DA = {B ⊂ Ω: A ∩ B ∈ D(C)}.

By essentially the same reasoning as in Step 1, it can be then shown that DA is a Dynkin system. Due to Step 1, DA ⊃ C, and thus DA ⊃ D(C). Hence for every A, B ∈ D(C) also A ∩ B ∈ D(C) that is (3.15) holds. This completes the proof of Dynkin’s lemma.

3.3 Elementary facts about the independence

As a consequence of Dynkin’s lemma we get.

Theorem 3.18. Let C1,..., Cn ⊂ A be π-systems with Ω ∈ Ci for all i ≤ n. Assume that (Ci)1≤i≤n are independent, i.e.

(3.19) P [C1 ∩ · · · ∩ Cn] = P [C1] ··· P [Cn] ∀Ci ∈ Ci.

Then, the σ-algebras σ(C1), . . . , σ(Cn) are independent.

Proof. The proof resembles the previous one. We first fix C2 ∈ C2,...,Cn ∈ Cn and consider a family D1 given by

D1 = {A ∈ A : P [A ∩ C2 ∩ · · · ∩ Cn] = P [A]P [C2] ··· P [Cn]}.

Then, D1 ⊃ C1 due to (3.19) and D1 is a Dynkin system. Indeed, as (λ1), (λ2) are S obvious, one needs to show (λ3). Take (Ai)i≥0 pairwise disjoint and A = i Ai; then X P [A ∩ C2 ∩ · · · ∩ Cn] = P [Ai ∩ C2 ∩ · · · ∩ Cn] i≥1

Ai∈D1 X = P [Ai]P [C2] ··· P [Cn] i≥1

= P [A]P [C2] ··· P [Cn], which implies that A ∈ D1. Dynkin’s lemma now implies that D1 ⊃ σ(C1).

17 To continue, define

D2 = {A ∈ A : P [D ∩ A ∩ C3 ∩ · · · ∩ Cn] = P [D]P [A]P [C3] ··· P [Cn] for all D ∈ σ(C1)}.

Using the same reasoning as above one shows that D2 is Dynkin system and D2 ⊃ C2. Thus D2 ⊃ σ(C2). The claim of the theorem then follows by repeating the same step n-times.

Corollary 3.20. (a) Let Ai,j ⊂ A, 1 ≤ i ≤ n, 1 ≤ j ≤ m(i) be independent σ-algebras. Sm(i) Then, the σ-algebras Gi = σ( j=1 Aij), i ≤ n, are also independent.

m(i) (b) Let Xij : 1 ≤ i ≤ n, 1 ≤ j ≤ m(i) be independent r.v.’s and let fi : R −→ R be measurable functions. Then Yi = fi(Xi1,...,Xim(i)) are independent, i ≤ n. Proof. It is easy to see that (a) implies (b). Indeed, it is sufficient to observe that σ(Yi) ⊂ Gi := σ(Xi1,...,Xim(i)) and that Gi’s are independent by (a). Tm(i) To prove (a), let Ci be the family of subsets of Ω of form j=1 Aij with Aij ∈ Aij. Then Ci are independent π-systems (Exercise!) and the claim follows by Theorem 3.18. Independent random variables are naturally related with product measures.

Proposition 3.21. Let X, Y be two independent real-valued random variables on a com- mon probability space (Ω, A,P ) with distributions µX , µY . Then the image of the mea- ϕ 2 2 sure P under the map ω 7→ (X(ω),Y (ω)) equals µX ⊗µY , and thus for h :(R , B(R )) → (R, B(R)) measurable we have

E[|h(X,Y )|] < ∞ iff h is µX ⊗ µY -integrable, and then Z E[h(X,Y )] = h(x, y)µX (dx)µY (dy). R2 In particular, E[XY ] = E[X]E[Y ] when E[|X|] < ∞, E[|Y |] < ∞, and Var[X + Y ] = Var[X] + Var[Y ].

Proof. We only need to show that µX ⊗ µY = ϕ#P . The remaining claims of the proposition then follow using Lemma 2.22 with (X,Y ), R2, h playing the roles of X, S and g. To show µX ⊗ µY = ϕ#P , let C = {A1 × A2 : A1,A2 ∈ B(R)}. C is a π-system and for A = A1 × A2 ∈ C we have (µX ⊗ µY )(A) = µX (A1)µY (A2) and (ϕ#P )(A) = P [X1 ∈ A1,X2 ∈ A2], which by independence equals P [X1 ∈ A1]P [X2 ∈ A2] = µX (A1)µY (A2). Since σ(C) = B(R2), the claim follows from Lemma 3.9. Exercise 3.22. Modify the statement of Proposition 3.21 for

(a) X and Y taking values in some measurable spaces (S1, S1), (S2, S2) respectively. (b) more than two random variables.

18 Exercise 3.23. (a) Let (Xi)i∈I be a family of real-valued random variables. For every |J| finite set J = {i1, . . . , i|J|} ⊂ I, and for every x ∈ R define the cummulative distribution function

FJ (x) = P [Xi1 ≤ x1,...Xi|J| ≤ x|J|].

Then (Xi)i∈I are independent iff for every such J and x

FJ (x) = F{i1}(x1) ··· F{i|J|}(x|J|).

(b) When the random variables (X1,...,Xn) possess a joint density f : Rn → [0, ∞],

then they are independent iff f factorises as f(x1, . . . , xn) = fX1 (x1) ··· fXn (xn). (c) Show that the identity E[XY ] = E[X]E[Y ] does not imply the independence of X and Y . (Random variables that satisfy E[XY ] = E[X]E[Y ], and thus also Var[X + Y ] = Var[X] + Var[Y ], are called uncorrelated).

Exercise 3.24. Show the converse to Proposition 3.21: Let X, Y be two random vari- ables on (Ω, A,P ) so that the image of P under ω 7→ (X(ω),Y (ω)) is a product measure. Then X, Y are independent.

We finish this section by extending Definition 3.4 to arbitrary number of σ-algebras.

Definition 3.25. Let I be an arbitrary index set. A family (Ai)i∈I of σ-algebras on Ω is called independent if (Ai)i∈J is independent in sense of (3.4) for every finite J ⊂ I. With this definition, Theorem 3.18 and Corollary 3.20 easily generalize (the condition Q (3.19) must be replaced by P [∩i∈J Ci] = i∈J P [Ci] for all Ci ∈ Ci and J ⊂ I finite).

3.4 Borel-Cantelli lemma

In this and the following section we develop two techniques for proving that some ‘asymp- totic’ events occur with probability one. We start with some notation. Let (Ai)i≥1 be a sequence of events on (Ω, A). We define \  [  lim sup An = Am n→∞ n≥1 m≥n

= {ω ∈ Ω: ω is contained in infinitely many of Ai’s}, [  \  lim inf An = Am n→∞ n≥1 m≥n

= {ω ∈ Ω: ω only finitely many of Ai’s do not contain ω}.

Observe that 1lim supn An = lim supn 1An and similarly for lim inf. The next two lemmas are indispensable tools in probability theory.

19 Lemma 3.26 (First Borel-Cantelli lemma). Let (Ai)i≥1 be a sequence of events on a probability space (Ω, A,P ). Then X P [Ai] < ∞ implies P [lim sup Ai] = 0. i i≥1

Proof. By the monotone convergence theorem

∞ ∞ h X i X E 1Ai = P [Ai] < ∞. i=1 i=1 P∞ Therefore i=1 1Ai < ∞ P -a.s. and thus P [lim supi Ai] = 0. The next lemma is a partial converse to Lemma 3.26.

Lemma 3.27 (Second Borel-Cantelli lemma). Let (Ai)i≥1 be a sequence of independent P events on (Ω, A,P ). Then, i≥1 P [Ai] = ∞ implies that P [lim supi Ai] = 1. Remark 3.28. To see why the independence assumption is necessary, consider e.g. −1 Ω = (0, 1), A = B((0, 1)), P being Lebesgue measure and set An = (0, n ).

c c Proof of Lemma 3.27. We show that P [(lim sup Ai) ] = P [lim inf Ai ] = 0. Since 1 − x ≤ e−x, the independence implies

M M M M h \ c i Y c Y n X o M→∞ P Ak = P [Ak] = (1 − P [Ak]) ≤ exp − P [Ak] −−−−→ 0 k=m k=m k=m k=m

c c due to the assumption. Hence P [∩k≥mAk] = 0 for all m ≥ 1 and thus P [lim inf Ai ] = 0 by σ-additivity, as required.

2 Example 3.29. Let Xn be independent N (0, σ )-distributed random variables, σ > 0. The second Borel-Cantelli lemma implies (exercise!) that

lim sup Xn = ∞,P -a.s. n→∞ We now prove a more precise statement X (3.30) lim sup √ n = 1. n→∞ σ 2 log n We first need estimates on the tail of the normal distribution:

Claim 3.31. Let X be N (0, σ2) distributed. Then for every x > 0

∞ 1  1 −1 2 Z 1 2 1 2 √ x + e−x /2 ≤ P [X ≥ xσ] = √ e−y /2 dy ≤ √ e−x /2. 2π x x 2π x 2π

20 d 1 −y2/2 1 −y2/2 Proof. For the left-hand side observe that dy ( y e ) = −(1 + y2 )e and thus Z ∞ Z ∞ 1 −x2/2  1  −y2/2  1  −y2/2 e = 1 + 2 e dy ≤ 1 + 2 e dy. x x y x x The right-hand side follows from ∞ ∞ Z 2 1 Z 2 1 2 e−y /2 dy ≤ y e−y /2 dy = e−x /2. x x x x This completes the proof of the claim

Proof of (3.30). Upper bound. We√ will apply the first Borel-Cantelli lemma. Let ε > 0 and define An = {Xn ≥ (1 + ε)σ 2 log n}, n ≥ 1. Then, using Claim 3.31,

1 −(1+ε)2 log n P [An] ≤ √ √ · e (1 + ε) 2π 2 log n 1 1 = √ √ · , (1 + ε) 2π 2 log n n(1+ε)2 and thus P P [A ] < ∞. It follows that P [lim sup A ] = 0 and thus P -a.s. for n≥1 n √ n→∞ n all n large Xn ≤ (1 + ε)σ 2 log n, which is equivalent to X lim sup √ n ≤ 1 + ε, P -a.s. n→∞ σ 2 log n

As ε is arbitrary, this yields the upper bound. √ Lower bound. Let ε ∈ (0, 1) and set Bn = {Xn ≥ (1 − ε)σ 2 log n}. Then, using Lemma 3.31 again,

−1 1  p 1  −(1−ε)2 log n P [Bn] ≥ √ (1 − ε) 2 log n + √ e 2π (1 − ε) 2 log n 1 ≥ for n ≥ n (a, ε) and (1 − ε)2 < a < 1. na 0 Hence, P P [B ] = ∞. Since B are indpendent, the second Borel-Cantelli lemma n≥1 n n √ implies that P [lim sup Bn] = 1. This yields lim supn→∞ Xn/(σ 2 log n) ≥ (1 − ε), P -a.s. Observing that ε is arbitrary, the lower bound follows.

Remark√ 3.32. Knowing (3.30), one may wonder how supk≤n Xk fluctuates around the value σ 2 log n. The (elementary) extreme value theory gives the following ‘convergence in distribution’ result:     p p σ(log log n + log 4π) n→∞ −e−u P 2 log n sup Xk − σ 2 log n + √ ≤ u −−−→ e . k≤n 2 2 log n The function on the right-hand side is a distribution funcition of the so called Gumbel distribution. Exercise 3.33. (*) Try to prove the claim of the last remark. (It is actually not so difficult.)

21 3.5 Kolmogorov 0–1 law

Let (Xi)i≥1 be a sequence of random variables on (Ω, A,P ). For n ≥ 1 we define the σ-algebra Tn describing the ‘future after n of the sequence (Xi)i≥1’ by

Tn = σ(Xn,Xn+1,... )

= the smallest σ-algebra containing σ(Xi) for all i ≥ n.

Obviously, for p > n,

(3.34) Tp ⊂ Tn and σ(Xn,...,Xp) ⊂ Tn.

We further define the so-called tail σ-algebra describing the ‘far future’ of the sequence (Xi)i≥1, by \ T = Tn. i≥1 Even if this is not intuitively obvious on the first sight, this σ-algebra contains many P interesting events. For example, the convergence set of the series i≥1 Xi is in T . Indeed, let p p n X X o Ω1 = ω ∈ Ω : lim sup Xi(ω) = lim inf Xi(ω) . p→∞ p→∞ i=1 i=1 be the convergence set of this series. Then, for every n ≥ 1, also

p p n X X o Ω1 = ω ∈ Ω : lim sup Xi(ω) = lim inf Xi(ω) . p→∞ p→∞ i=n i=n

Hence Ω1 ∈ Tn for every n ≥ 1 and thus Ω1 ∈ T . Pn The definition of the convergence set Ω1 allows for limn→∞ i=1 Xi(ω) = ±∞, however  Pn a similar reasoning implies also that ω : limn→∞ i=1 Xi(ω) = ∞ ∈ T , or

n n n X X o Ω2 = ω ∈ Ω: −∞ < lim inf Xi(ω) = lim sup Xi(ω) < +∞ ∈ T . n→∞ n→∞ i=1 i=1 In the case of independent random variables, the tail σ-algebra is trivial:

Theorem 3.35 (Kolmogorov’s 0–1 law). Let (Xi)i≥1 be independent random variables on a probability space (Ω, A,P ). Then

P [A] ∈ {0, 1} for every A ∈ T . P Corollary 3.36. A sum of independent random variables i≥1 Xi is P -a.s. convergent, or P -a.s. not converging. Formally, P [Ω1] ∈ {0, 1} and P [Ω2] ∈ {0, 1} for Ω1, Ω2 ⊂ Ω as above.

22 Proof of Theorem 3.35. Step 1. For fixed n ≥ 1,

(3.37) P [A ∩ B] = P [A]P [B] for every A ∈ σ(X1,...,Xn−1) =: An−1 and B ∈ Tn, that is An−1 and Tn are independent. Indeed, (3.37) holds for every A ∈ An−1 and B ∈ σ(Xn,...,Xp), p > n (by the inde- pendence assumption and (3.20)). Moreover, An−1 and ∪p≥nσ(Xn,...,Xp) are π-systems  and σ ∪p≥n σ(Xn,...,Xp) = Tn. The claim (3.37) then follows from Theorem 3.18. Step 2.

(3.38) P [A ∩ B] = P [A]P [B] for every A ∈ σ(X1,X2,... ) and B ∈ T , that is T and σ(X1,X2,... ) are independent. Indeed, due to (3.37) and T ⊂ Tn, P [A ∩ B] = P [A]P [B] holds for every [ A ∈ An, and B ∈ T . n≥1 S The set systems n≥1 An and T are π-systems, and (3.38) follows from Theorem 3.18 and the fact that σ(X1,X2,... ) is the smallest σ-algebra containing ∪n≥1An.

Step 3. Let A ∈ T . As T ⊂ σ(X1,X2,... ), we see that A ∈ σ(X1,X2,... ) as well. Therefore,

(3.39) P [A] = P [A ∩ A] Step= 2 P [A]P [A] = P [A]2. This yields P [A] ∈ {0, 1}. Example 3.40 (Percolation). Percolation is one of the most beautiful and yet the most challenging models in the probability theory. It was introduced by Broadbent and Hammersley (1957) as a model of disordered porous medium trough which a fluid or gas was can flow. Since then thousands of papers and many books have been devoted to it. d It can be defined as follows. Let Ed be the edge set of Z ,  d (3.41) Ed = {x, y} : x, y ∈ Z with |x − y| = 1 .

For a fixed p ∈ [0, 1], let Xe, e ∈ Ed be a family of i.i.d. random variables on some (Ω, A,Pp) satisfying

(3.42) Pp[Xe = 1] = 1 − Pp[Xe = 0] = p.

For given realisation of Xe’s, we declare an edge e open if Xe = 1 and closed otherwise. A natural question to ask about this model is: “Is there, for a P-typical configuration, an infinite self-avoiding path using only open edges”? We first need to ask whether this question has sense, that is to deal with the measurability. Let Cx be the connected open component containing the vertex x, x ∈ Zd. Define the events

Jx = {|Cx| = ∞}, (3.43) [ I = {ω ∈ Ω : there exists an infinite open component in ω} = Jx. x∈Zd

23 d d It is easy to see that I and Jx are events in A. Indeed, let B(n) = [−n, n] ∩ Z be the d  1 c d box in Z and let An = 0 is connected by an open path to the set B = Z \ B . An is an event, since it depends on the state of finitely many edges of EB(n) only. Moreover, since infinite cluster cannot exist in a finite box, J0 = ∩n≥0An, and hence J0 ∈ A. d Analogically, Jx are events for all x ∈ Z . Since, I = ∪x∈Zd Jx, I is an event too. Finally, we can come to the application of the 0–1 law.

Proposition 3.44. The probability Pp[I] equals 0 or 1. It has value 0 exactly when θ(p) := Pp[|C0| = ∞] = 0. Proof. We first show that

(3.45) I is σ(Xe : e ∈ Ed \ EB(n)) measurable for all n ≥ 1.

Indeed, let In be the event ‘the restriction of (Xe)e∈Ed to Ed \ EB(n) contains an infinite cluster’. Of course, In ∈ σ(Xe : e ∈ Ed \ EB(n)), by the same proof as for (3.43). The claim (3.45) then directly follows from the following equality:

(3.46) I = In.

To check this observe first that In ⊂ I. Conversely, if ω ∈ I, let C be an infinite open compenent of ω. Consider connected components of C\ B(n) induced by ω. If they are at least two such components, any of them must contain at least one vertex neighbouring with B(n) (since it must be connected by an open path in B(n) to the remaining components). Hence, C\ B(n) has finitely many connected components, and thus one of them should be infinite. Hence ω ∈ In. This shows I ⊂ In and consequently (3.46) and (3.45). From (3.45) if follows that I is tail-measurable, that is

(3.47) I ∈ T := ∩E⊂Ed,E finite σ(ω(e): e ∈ Ed \ E).

From the Kolmogorov 0–1 law we then deduce that Pp(I) = 0 or 1.

1path is a finite sequence of neighbouring vertices

24 4 Laws of large numbers

4.1 Kolmogorov three series theorem

We consider a sequence (Xi)i≥1 of independent random variables on some probability P space (Ω, A,P ) and are interested in the convergence of series Xi. From the previous chapter, recall that the ‘convergence set’ of this series n X X o Ω2 = ω ∈ Ω: −∞ < lim inf Xk(ω) = lim sup Xk(ω) < +∞ ∈ T . k k Therefore, by Kolmogorov 0–1 law (Theorem 3.35),

P (Ω2) = 0 or 1.

We now search for a criteria allowing to decide which of the two possibilities holds. We define the partial sums

i X (4.1) S0 = 0,Si = Xj, i ≥ 1. j=1

The following lemma will be useful later.

Lemma 4.2 (Kolmogorov’s inequality). Let (Xi)i≥1 be a sequence of independent ran- 2 dom variables satisfying E[Xi] = 0 and E[Xi ] < ∞ for all i ≥ 1. Then

  −2 P max |Sk| ≥ u ≤ u Var(Sn). 1≤k≤n

Remark 4.3. Kolmogorov’s inequality is a ‘maximal-inequality’, that means that the maximum maxk≤n |Sk| is controlled by the variance of the last term only. Compare it Chebyshev’s inequality which gives, under the same hypothesis, a control on the tail −2 behaviour of one partial sum only: P [|Sn| ≥ u] ≤ u Var(Sn). Proof. We break down the event on the left-hand side according to the time1 at which |Sk| for the first time exceeds u. Let

Ak = {|Sk| ≥ u, |Sj| < u for all j < k}.

1 We use the language of stochastic processes, and call the index n of Sn time.

25 2 Since Ak’s are disjoint n 2 X  2  ESn ≥ E Sn; Ak k=1 n X  2 2  = E Sk + 2Sk(Sn − Sk) + (Sn − Sk) ; Ak k=1 n n X 2 X   ≥ E[Sk; Ak] + E 2Sk1Ak (Sn − Sk) . k=1 k=1

By the independence assumption, Sk1Ak and (Sn − Sk) are independent. (Indeed,

Sk1Ak ∈ σ(X1,...,Xk), and Sn − Sk ∈ σ(Xk+1,...,Xn).) Moreover, E[Sn − Sk] = 0. Therefore,   E 2Sk1Ak (Sn − Sk) = E[2Sk1Ak ]E[Sn − Sk] = 0.

As Ak are disjoint and |Sk| ≥ u on Ak, we thus have n n 2 X 2 X 2 2 h i ESn ≥ E[Sk; Ak] ≥ u P [Ak] = u P max |Sk| ≥ u . 1≤k≤n k=1 k=1 This completes the proof.

Exercise 4.4. Assume that (Xi)i≥0 are i.i.d., E[Xi] = 0, Var Xi = c < ∞. Prove that p Sn/n → 0 P -a.s. for every p > 1/2. As an application of Kolmogorov’s inequality (Lemma 4.2) we obtain. P∞ Theorem 4.5. Let (Xi)i≥0 be independent with EXi = 0 and i=1 Var Xi < ∞. Then P∞ with probability one i=1 Xi converges, i.e. P (Ω2) = 1.

Remark 4.6. Since variance of Xi is finite, we can view the random variables Xi as the elements of the Hilbert space Z 2 n 2 2 o L (X, A,P ) = X :Ω → R : kXk2 := X dP < +∞ , Ω R with a scalar product hX,Y i = Ω XY dP = E[XY ]. Due to independence, for i 6= j, hXi,Xji = E[XiXj] = E[Xi]E[Xj] = 0, that is the random variables Xi are orthogonal 2 P 2 P in L (Ω, A,P ). It follows that kXkk2 = Var Xk < ∞. Using Pythagoras theorem, 2 Pm 2 m,n→∞ for m > n, kSm − Snk2 ≤ i=n+1 kXik2 −−−−→ 0, that is Sn is a Cauchy sequence in L2(Ω, A,P ) and thus converges also in L2(Ω, A,P ).

P Zk Example 4.7. In Example 1.1 we asked whether the series k≥1 k converges for Zk i.i.d. with P [Zk = ±1] = 1/2. We can now give the affirmative answer: this series converge P -a.s, since X Zk X 1 Var = < ∞. k k2 k≥1 k=1 2 R For an event B and a random variable Y we use E[Y ; B] to denote E[Y 1B] = B Y dP .

26 Proof of Theorem 4.5. Set WM = supm,n≥M |Sm − Sn|, and observe that WM & W∞ as M → ∞. We will show that,

(4.8) P [W∞ = 0] = 1.

(4.8) implies that Sk is P -a.s. a Cauchy sequence, which implies directly the claim of the theorem. To show (4.8), fix ε > 0 and M ≥ 1. Then

sup |Sm − SM | ≤ ε implies sup |Sm − Sn| ≤ 2ε. m≥M m,n≥M

Therefore, h i P [WM > 2ε] ≤ P sup |Sm − SM | > ε m≥M h i = lim P sup |Sm − SM | > ε (monotonicity and regularity of P ) N→∞ M≤m≤N N −2 X ≤ lim ε Var Xk (Kolmogorov’s inequality) N→∞ k=M+1 −2 X = ε Var Xk. k>M

Hence, since WM & W∞,

−2 X M→∞ P [W∞ ≥ 2ε] ≤ P [WM ≥ 2ε] ≤ ε Var Xk −−−−→ 0. k≥M

From this (4.8) follows and the proof is complete. The final word in the convergence of series with independent increments is the following theorem.

Theorem 4.9 (Kolmogorov’s three series theorem). Assume that (Xi)i≥1 are indepen- P dent random variables. Let for A > 0, Yi = Xi1{|Xi| ≤ A}. Then i≥1 Xi converges iff for some A > 0 the following three conditions hold: P∞ (i) n=1 P [|Xn| > A] < ∞, P∞ (ii) n=1 E[Yn] converges, P∞ (iii) n=1 Var Yn < ∞.

Zk Exercise 4.10. Use this theorem to show that for Xk = α with Zk’s i.i.d., P [Zk = P k ±1] = 1/2, α > 0, the series k>0 Zk converges iff α > 1/2. Observe that the series converges absolutely iff α > 1.

27 0 0 0 Proof. Sufficiency of (i)–(iii): Let Yk = Yk − EYk. Then EYk = 0 and Var Yk = Var Yk. P 0 By (iii) and Theorem 4.5, k≥1 Yk converges P -a.s. Due to (ii)

X X 0 X (4.11) Yk = Yk + EYk converges P -a.s. k≥1 k≥1 k≥1

By (i) and Borel-Cantelli lemma P [lim supk→∞ |Xk| > A] = 0, that is |Xk| > A only for P P finitely many k’s, P -a.s. If this occurs, the sum k≥1 Xk converges iff the sum k≥1 Yk converges. This implies the claim. P Necessity of (i)–(iii). Assume that i≥1 Xi converges P -a.s. Then condition (i) must be satisfied for every A > 0, because otherwise, by the second Borel-Cantelli lemma, there is some A > 0 such that {|Xn| > A} occurs infinitely often, P -a.s., which is P incompatible with the convergence of Xi. Thus (i) holds with A = 1. It follows P P i≥1 that also Yi := Xi1{|Xi| ≤ 1} converges. i≥1 i≥1 P Suppose that we have verified condition (iii), then by Theorem 4.5 the series (Yi − P i≥1 EYi) converges which together with the convergence of Yi implies (ii). Therefore, P i≥1 we only need to verify condition (iii), assuming that i≥1 Yn converges P -a.s., where Yn are independent and bounded by 1. We claim that we can assume without loss of generality that EYn = 0. Indeed, 0 0 let (Yn)n≥1 be an independent copy of (Yn)n≥1. Then Zn := Yn − Yn is a sequence of independent random variables bounded by 2, EZn = 0, Var Zn = 2 Var Yn, and P P P 0 i≥1 Zi = i≥1 Yi − i≥1 Yi converges P -a.s. Therefore, we only need to verify condi- tion (iii) for (Zn)n≥1 which have mean zero and are bounded by 2. Pn 2 To this end write Sn = i=1 Zn, σn = Var Zn. Fix L > 0 and let τL = min{n ≥ 0 :

|Sn| ≥ L}. By convergence of Sn we see that limL→∞ P [τL = ∞] = 1, and |SτL | ≤ L + 2, if τL < ∞. Moreover,

n X X E[S2 ] = E[Z21{j ≤ τ }] + 2 E[Z Z 1{j ≤ τ }]. n∧τL j L i j L j=1 1≤i

c Note now that the event {j ≤ τL} = {τL ≤ j − 1} is determined by Z1,...,Zj−1 and hence independent of Zj. Therefore,

n X X (L + 2)2 ≥ E[S2 ] = σ2P [j ≤ τ ] + 2 E[Z 1{j ≤ τ }]E[Z ] n∧τL j L i L j j=1 1≤i

Choosing L large enough so that P [τL = ∞] > 0, and letting n → ∞, we obtain P 2 i≥1 σi < ∞, which implies the necessity of (iii). Remark 4.12. The last part of the proof is a martingale type argument. We are going to see more of them later.

28 4.2 Weak law of large numbers

In the following two sections we consider a sequence (Xi)i≥0 and study the conditions 1 1 under which the normalised sum n (X1 + ··· + Xn) := n Sn converges. Some theorems here should be known, partially with stronger assumptions, from the elementary lecture. We recall the following definitions:

Definition 4.13. We say that a sequence (Yn)n≥1 of random variables converges in P probability to a random variable Y (notation Yn −→ Y ), when

lim P (|Yn − Y | > ε) = 0 for every ε > 0. n→∞ Exercise 4.14. Prove that the convergence in probability is weaker than the P -a.s. convergence and the convergence in Lp, p ∈ [1, ∞], that is

P Yn → YP -a.s implies that Yn −→ Y. p P Yn → Y in L (Ω, A,P ) implies that Yn −→ Y.

One distinguishes ‘weak’ and ‘strong’ law of large numbers. The weak one claims 1 the convergence of n Sn in probability, the strong one then the P -a.s. convergence. The terminology should be understandable in view of the last exercise

2 Theorem 4.15 (Weak law of large numbers in L (Ω)). Let Xi be identically distributed, 2 EXi < ∞ and assume that

Cov(Xi,Xj) := E[(Xi − EXi)(Xj − EXj)] ≤ c|i−j|,

−1 2 for some sequence cn ↓ 0. Then n Sn converges in L (Ω, A,P ) and thus in probability to E[Xi]. Proof. By an easy computation,

n 2 2 2 Sn hSn  i h 1  X  i − EX1 = E − EX1 = E (Xi − EXi) n 2 n n2 i=1 n n 1 X 1 X = Cov(X ,X ) ≤ c n2 i j n2 |i−j| i,j=1 i,j=1 n 2n X n→∞ ≤ c −−−→ 0, n2 i i=0 where we used the fact that every ci is contained at most 2n times in the double sum on the second line, and the assumption cn ↓ 0. This completes the proof.

29 4.3 Strong law of large numbers

We now turn our attention to strong laws of large numbers. We prove it first under rather strong assumption, which makes the proof very short.

Theorem 4.16 (Strong LLN with the forth moment). Let (Xi)i≥1 be i.i.d. random 4 variables satisfying EXi < ∞. Then S n −→ E[X ] P -a.s. n 1

Proof. With help of a simple transformation Xi → Xi − EXi, we may assume without loss of generality that EXi = 0. Using Chebyshev’s inequality, E[S4] (4.17) P [|S | ≥ nε] = P [S4 ≥ n4ε4] ≤ n . n n n4ε4 Further,

n n 4 4 h X  i h X i E[Sn] = E Xi = E XiXjXkXl i=1 i,j,k,l=1 0 0 0 X 4 X 4 X = E[X X X X ] + E[X X X2] + E[X X3] i j k l 2 i j k 1 i j i,j,k,l i,j,k i,j 0 4 X X + E[X2X2] + E[X4], 2 i j i i,j i where P0 stands for a summation with all indices different. Using now the independence assumption and E[Xi] = 0 one observes that the first three terms on the right-hand side 4 2 vanish. Moreover E[Xi ] < ∞ implies also E[Xi ] < ∞ and thus

4 2 2 2 4 2 E[Sn] ≤ 6n E[Xi ] + nE[Xi] ≤ Cn . Inserting this into (4.17) and applying the Borel-Cantelli lemma completes the proof. Assuming the forth moment is not necessary. We now prove an optimal result.

Theorem 4.18 (Strong LLN, Etemadi 1981). Let (Xi)i≥1 be identically distributed and pairwise independent with E[|Xi|] < ∞. Then S (4.19) n −→ E[X ] P -a.s. n 1 + − + Proof. Step 1: Let Xi = Xi ∨0 and Xi = (−Xi)∨0. The sequences (X1 )i≥1, respective − ± (Xi )i≥1 are again identically distributed, pairwise independent and E|Xi | < ∞. When ± we show (4.19) for Xi in place of Xi, then (4.19) follows by linearity. We can thus, without loss of generality assume that

Xk ≥ 0 for all k ≥ 1.

30 2 Step 2, truncation (unnecessary when E(Xi ) < ∞). Define truncated random vari- ables Yk by Yk = Xk1{Xk ≤ k}. Set A = lim infk→∞{Xk = Yk}. Then P [A] = 1 − P [lim sup{Xk 6= Yk}]. In addition, X X P [Xk 6= Yk] = P [Xk ≥ k] k k X = E[1{X1 ≥ k}] k h X i = E 1{X1 ≥ k} ≤ E[X1] < ∞. k≥1

Thus, using Borel-Cantelli lemma, we see that P [A] = 1. Moreover, on A 1 1 (X + ··· + X ) −−−→n→∞ EX ⇐⇒ (Y + ··· + Y ) −−−→n→∞ EX . n 1 n 1 n 1 n 1 Hence, (4.19) will follow if we show

n n→∞ X (4.20) Tn/n −−−→ EX1 with T0 = 0,Tn = Yk. k=1 Step 3: If for all α > 1

T n (4.21) [α ] −−−→n→∞ EX P -a.s., [αn] 1 then (4.20) follows. 1 Indeed, let αM = 1 + M and define

\ n T[αn ](ω) n→∞ o Ω˜ = Ω with Ω = ω ∈ Ω: M −−−→ EX . M M [αn ] 1 M≥1 M

˜ n As P [ΩM ] = 1 for all M by assumption, we have P [Ω] = 1. For fixed M, let kn = [αM ]. ˜ Then for any fixed ω ∈ Ω and kn ≤ m < kn+1, using the non-negativity of Yk’s, T T T kn ≤ m ≤ kn+1 . kn+1 m kn Trivially, kn+1 lim = αM . n→∞ kn Combining the last two displays with with the assumption, for ω ∈ Ω˜ and M ≥ 1

1 Tn(ω) Tn(ω) E[X1] ≤ lim inf ≤ lim sup ≤ αM EX1, αM n→∞ n n→∞ n

31 and the claim of Step 3 follows by taking M → ∞. n Step 4, Proof of (4.21). Fix α > 1 and define kn = [α ]. Then ε > 0,

∞ ∞ X 1 X Var Tk P [|T − E[T ]| > εk ] ≤ n (Chebyshev) kn kn n ε2 k2 n=1 n=1 n ∞ k 1 X 1 Xn (4.22) = Var Y (pairwise independence) ε2 k2 m n=1 n m=1 ∞ 1 X X 1 = 2 Var Ym 2 . (Fubini) ε kn m=1 n:kn≥m

αn For every n ≥ 1, kn ≥ 2 (check this). Hence

−2n0(m) X 1 X −2n 4α 4 (4.23) 2 ≤ 4 α = −2 ≤ −2 2 , kn n 1 − α (1 − α )m n:kn≥m n:α ≥m

n where n0(m) is the smallest integer n with α ≥ m. Hence, by (4.22), (4.23).

∞ ∞ X 4 X EY 2 (4.24) P [|T − E[T ]| ≥ εk ] ≤ m kn kn n ε2(1 − α−2) m2 n=1 m=1

2 Remark 4.25. When EX1 < ∞, then X EY 2 X EX2 m ≤ 1 < ∞. m2 m2 m≥1 m≥1

Hence, by the Borel-Cantelli lemma and (4.24) T ET lim sup kn − kn ≤ ε P -a.s. kn kn

2 Actually, when EX1 < ∞, then it is not necessary to introduce the truncated random variables Yi. Using the same steps as in (4.21)–(4.24) with Xi instead of Yi, we find that

Skn ESkn Skn lim sup − = lim sup − EX1 ≤ ε P -a.s. kn kn kn and thus (4.21) holds for Xi’s. ♦ We continue with the proof of (4.21). Inspecting (4.24), it remains to show the following two claims

m X EY 2 (4.26) m < ∞ m2 i=1 −1 n→∞ (4.27) n ETn −−−→ EX1.

32 Indeed, if these two hold, we obtain using the Borel-Cantelli lemma h T ET i P lim sup kn − kn ≤ ε = 1, n→∞ kn kn which together with (4.27) yields

h Tkn i P lim sup − EX1 ≤ 2ε = 1, n→∞ kn and thus also h \ n Tkn 1 oi P lim sup − EX1 ≤ = 1, n→∞ kn M M≥1 and (4.21) follows.

Step 5, Proof of (4.27). Using the definition of Yi’s we get 0 ≤ EXk − EYk = E[X11{X1 > k}], which by the dominated convergence theorem converges to 0 as k → k→∞ −1 −1 ∞. Hence EYk −−−→ EX1. Finally as n ETn = n {EY1 + ...EYn}, this implies that −1 n→∞ n ETn −−−→ EX1, as required. Step 6, Proof of (4.26). Using integration by parts and Fubini’s theorem we obtain Z ∞ Z ∞ 2 h i EYm = E 2 y1{Ym ≥ y}dy = 2 yP [Ym ≥ y] dy. 0 0

Since Ym ≤ m and Xm ≥ Ym this can be bounded from above by Z m Z m 2 EYm ≤ 2 yP [Ym ≥ y] dy ≤ 2 yP [Xm ≥ y] dy. 0 0 Therefore, using Fubini’s and monotone convergence theorem, 2 Z m X EYm X 1 2 ≤ 2 2 yP [X1 ≥ y]dy m m 0 (4.28) m≥1 m≥1 Z ∞ X y = 2 P [X ≥ y]1{y ≤ m}dy. m2 1 0 m≥1 For y ≥ 2 we have an upper bound X 1 X Z m dx Z ∞ dx 1 (4.29) ≤ = = , m2 x2 x2 y − 1 m≥y m≥y m−1 y−1 P −2 P −2 and further m≥1 m = 1 + m≥2 m ≤ 1 + 1 = 2. Inserting these into (4.28), 2 Z 2 Z ∞ X EYm y 2 ≤ 2 2y P [X1 ≥ y]dy + 2 P [X1 ≥ y]dy m 0 |{z} 2 y − 1 m≥1 ≤4 | {z } ≤4 Z ∞ h Z X1 i ≤ 8 P [X1 ≥ y] = 8E dy = 8EX1 < ∞, 0 0 and (4.26) follows. This completes the proof of Theorem 4.18.

33 Remark 4.30. The original proof of the strong law of large numbers (due to Kol- mogorov) is less general and only works for i.i.d. random variables. It is based on the three-series theorem. The connection between stochastic series and the law of large numbers is provided by Kronecker’s lemma, whose proof can be found e.g. in Durrett, page 65.

Lemma 4.31. Let (an)n≥1 and (xn)n≥1 be two sequences of real numbers such that a % ∞. Then if P xn converges, then also 1 Pn x converges. n n≥1 an an k=1 k Remark 4.32 (necessity of the integrability condition for the LLN). Consider the 2 Cauchy random variable X, that is a random variable with density fX (x) = 1/(π(1+x )). The characteristic function of X is given by Z ∞ iλX iλx −|λ| E[e ] = e fX (x) dx = e . −∞

Hence, if Xi are i.i.d. Cauchy, then the characteristic function of their normalised sum −1 −1 n Sn = n (X1 + ··· + Xn) is

E[eiλSn/n] = (E[eiλX1/n])n = (e−|λ/n|)n = e−|λ|.

Since the characteristic function uniquely determines the (we will −1 −1 show this soon), n Sn is Cauchy distributed for every n ≥ 1. Hence n Sn 6→ 0, which you could consider natural given the fact that the density of the Cauchy distribution is even, so one is tempted to define EX = 0. This is however not correct, the expectation of the Cauchy distribution is not defined because EX+ = EX− = ∞. Exercise 4.33. If you find the previous remark not convincing enough, you can try to prove the following statement which shows that the integrability condition in the law of large numbers is not only sufficient, but also necessary: If (Xi)i≥1 are i.i.d. with E|Xi| = ∞, then P [|Xn| ≥ n for infinitely many n] = 1. −1 As a consequence, P [lim n Sn exists and is finite] = 0. (Hint. Show that E|X1| ≤ P∞ n=0 P [|X1| > n] and use the second Borel-Cantelli lemma.) Example 4.34 (Renewal process). We now consider a problem where the strong law of large numbers can be applied. Let (Xi)i≥1 be i.i.d. with 0 < Xi < ∞ and set Tn = X1 + ··· + Xn. Think of Tn as the time of n-th occurence of some event (like times of reparations of a machine). Let

Nt = sup{n : Tn ≤ t} be the number of events occurring before the time t.

Theorem 4.35. If EX1 = µ < ∞, then N 1 lim t = ,P -a.s. t→∞ t µ

34 −1 n→∞ Proof. By the strong law of large numbers, n Tn −−−→ µ, P -a.s. By definition of Nt, T (Nt) ≤ t < T (Nt + 1), hence T (N ) t T (N + 1) N + 1 t ≤ ≤ t · t . Nt Nt Nt + 1 Nt

t→∞ To take the limit, note that Tn < ∞ for all n so that Nt −−−→∞, P -a.s. Combining this with the strong law of large numbers, we thus obtain

T (N (ω))(ω) N (ω) + 1 t −−−→t→∞ µ and t −−−→t→∞ 1 for P -a.e. ω. Nt(ω) Nt(ω) This completes the proof. The last theorem is just a begining of the so-called ‘renewal theory’ which proves many finer result about the behaviour of the sequence of renewal times Tn. One of its important results is the Blackwell’s renewal theorem:

Theorem 4.36. Let the distribution of Xi’s be non-arithmetic (i.e. not supported on {0, δ, 2δ, 3δ, . . . } for any δ > 0). Then, for any h > 0, the expected number U(t, t + h) of renewal times in the interval [t, t + h], X U(t, t + h) = P [Tn ∈ [t, t + h]], n≥0 converges to h/µ as t tends to infinity.

4.4 Law of large numbers for triangular arrays

In many situations one is confronted with sums or random variables whose distribution depend on the length of the considered sum. Formally, one is given a triangular array Xnk, 1 ≤ k ≤ n, of random variables and is interested in behaviour of Sn = Xn1 + ··· + Xnn.

Theorem 4.37 (Weak LLN for triangular arrays). For each n let (Xnk)1≤k≤n be inde- pendent. Assume that for some sequence bn % ∞ we have Pn n→∞ (i) k=1 P [|Xnk| > bn] −−−→ 0, −2 Pn ¯ 2 n→∞ ¯ (ii) bn k=1 EXnk −−−→ 0, where Xnk = Xnk1{|Xnk| ≤ bn}. Pn ¯ Then, with Sn = Xn1 + ··· + Xnn, an = k=1 EXnk, S − a n n −−−→n→∞ 0 in probability bn

35 Remark 4.38. In typical applications, the rows of the array are defined on different probability spaces. In such situation, it has no sense to discuss a strong law of large numbers. However, if all random variables are defined on the same probability space, a strong law of large numbers can be obtained using the ideas of the proof of Theorem 4.18. ¯ ¯ ¯ Proof of Theorem 4.37. Let Sn = Xn1 + ··· + Xnn. Clearly, ¯ h Sn − an i ¯ h Sn − an i P ≥ ε ≤ P [Sn 6= Sn] + P ≥ ε . bn bn For the first term we note that

n n ¯ h [ ¯ i X n→∞ P [Sn 6= Sn] ≤ P {Xnk 6= Xnk} ≤ P [|Xnk| ≥ bn] −−−→ 0, k=1 k=1 by assumption (i). For the second term, using the Chebyshev inequality together with ¯ 2 an = ESn and Var X ≤ EX , ¯ n h Sn − an i X P ≥ ε ≤ ε−2b−2 Var[S¯ ] = (b ε)−2 Var X¯ b n n n nk n i=1 n −2 X ¯ 2 n→∞ ≤ (bnε) EXnk −−−→ 0, i=1 due to assumption (ii). This completes the proof.

Example 4.39 (Coupon collector’s problem). Let (Xni)i≥1 be i.i.d. random variables uniformly distributed in the set {1, . . . , n}. We are interested in the first time that all possible values 1, . . . , n appear in the sequence (Xni)i≥1, that is in

Tn = inf{m ≥ 1 : for every ` ∈ {1, . . . n} there is i ≤ m such that Xni = `}.

= inf{m ≥ 1 : |{Xn1,...,Xnm}| = n}.

To this end we define τn0 = 0 and

τnk = inf{m ≥ 1 : |{Xn1,...,Xnm}| = k} to be the first time when k of n symbols are observed, obviously τnn = Tn. We further set Znk = τnk − τn,k−1, hence Znk is the time we should wait for k-th symbol after seeing (k − 1)-th one. Elementary considerations yields that Znk has geometrical distribution k−1 −1 with parameter pnk = 1 − n and is independent of Znj, j < k. Hence, E[Znk] = pnk , −2 Var Znk ≤ pnk . It is then trivial to check the assumptions of Theorem 4.37 with bn = n log n (Exercise!) and to obtain that T (4.40) n −−−→n→∞ 1 in probability. n log n

36 5 Large deviations

Let X1,X2,... be an i.i.d. sequence of random variables and Sn = X1 + ··· + Xn. From −1 the previous chapter we know that n Sn converges to EX1, whenever the expectation exists. In this chapter we investigate the rate at which the probability of the ‘unusual event’ −1 tX n Sn > u for a u > EX1 decays to zero. We will see later that when E[e ] < ∞ for some t > 0, then this probability decays exponentially, and we will identify the exact exponential rate, 1 (5.1) I(u) = − lim log P [Sn ≥ nu]. n→∞ n Remark 5.2. Formula (5.1) can be interpreted as

P [Sn ≥ nu] = exp{−n(I(u) + an)} for some sequence an converging to 0 with n.

5.1 Sub-additive limit theorem

We first develop rather useful technique allowing to show the existence of the limit in (5.1) and in many other situations. Let πn = P [Sn ≥ nu]. Using the independence of Xi’s, we see that

πm+n = P [Sm+n ≥ (m + n)u]

≥ P [Sn ≥ nu, Sm ≥ mu] = P [Sn ≥ nu]P [Sm ≥ mu]

= πnπm.

Defining γn = − log πn, it follows that γn ≥ 0 and

(5.3) γm+n ≤ γn + γn for all m, n ≥ 0.

Sequences satisfying (5.3) are called subadditive. Their important property is that they converge (after normalisation):

Lemma 5.4 (Fekete’s subadditive lemma). Let γn ≥ 0 satisfy (5.3). Then γ γ lim n = inf n . n→∞ n n≥1 n

37 Proof. Since lim inf γn/n ≥ inf γn/n, we should only show that for any m γ γ (5.5) lim sup n ≤ m . n m Writing n = km+` with 0 ≤ ` < m and k ∈ N, and by repeatedly using the subadditivity assumption yields γn ≤ kγm + γ`. Dividing by n gives γ km γ γ n ≤ m + ` . n km + ` m n Claim (5.5) then follows by taking n → ∞, observing that ` ≤ m.

Coming back to our original problem, that is the behaviour of P [Sn ≥ nu], Lemma 5.4 and (5.3) imply easily that the limit in (5.1) exists.

5.2 Cram´er’stheorem

We are now going to identify the limit in (5.1) (and also give another proof of its existence).

Theorem 5.6 (Cram´er,Chernov). Let (Xi)i≥1 be an i.i.d. sequence of random variables and Sn = X1 + ··· + Xn. Assume that the Laplace transform of Xi is finite, that is

tX (5.7) ϕ(t) := E[e i ] < ∞ for all t ∈ R, and that u > EX1. Then 1 (5.8) lim log P [Sn ≥ nu] = −I(u), n→∞ n where I is given by the Legendre transform of log ϕ,

I(a) = sup(ta − log ϕ(t)). t∈R Remark 5.9. Assumption (5.7) is not essential, it only simplifies the proof. It is even not necessary that EX1 exist. However, in this case the information provided by the theorem might be not very interesting.

Proof. Without loss of generality we may assume that u = 0 and EX1 < 0. To see this, consider random variables Y = X − u. Then P [Sn ≥ nu] = P [Y1 + ··· + Yn ≥ 0], and tY −ut ϕY (t) := E[e ] = e ϕ(t) which implies that

−ut I(u) = sup{− log(ϕ(t)e )} = sup{−ϕY (t)} =: IY (0). t t

We will also assume that Xi’s are non-degenerate (i.e. not a.s. constant), otherwise the claim is trivial. Set ρ = inft ϕ(t). Since I(0) = − log ρ we must show that 1 (5.10) lim log P [Sn ≥ 0] = log ρ. n→∞ n

38 From the definition of ϕ(t) and assumption (5.7), one can easily show that

∞ 0 tX1 00 2 tX1 ϕ ∈ C (R), ϕ (t) = E[X1e ], ϕ (t) = E[X1 e ].

0 It follows that ϕ is strictly convex and ϕ (0) = E[X1] < 0. We now consider three cases: (a) The case P [X1 < 0] = 1. Then ϕ is strictly decreasing, limt→∞ ϕ(t) = 0 = ρ. As P [Sn ≥ 0] = 0, the claim (5.10) follows. (b) The case P [X1 ≤ 0] = 1 and P [X1 = 0] > 0. In this case ϕ is decreasing and n limt→∞ ϕ(t) = ρ = P [X1 = 0]. On the other hand, P [Sn ≥ 0] = P [Sn = 0] = ρ , by independence. That is (5.10) holds in this case. (c) The most interesting is the case P (X1 < 0) > 0 and P (X1 > 0) > 0. Then limt→∞ ϕ(t) = ∞ and there exists a unique τ > 0 where ϕ(t) is minimised. That is ϕ(τ) = ρ and ϕ0(τ) = 0. For the upper bound in (5.7) we use Chebyshev’s inequality. As τ > 0 (we could use any u ≥ 0 in place of τ here, but it can be shown that τ is the optimal),

τSn τSn n n P [Sn ≥ 0] = P[e ≥ 1] ≤ Ee = (ϕ(τ)) = ρ , and thus 1 lim sup log P [Sn ≥ 0] ≤ log ρ. n→∞ n The lower bound in (5.7) is more delicate. In its proof, we use a ‘transformation of measure’ trick which will make the event Sn ≥ 0 ‘typical’. We create a new i.i.d. sequence by ‘tilting’ (or Cram´ertransform) of Xi’s. Let µ be the τXˆ 1 distribution of Xi on R. Forτ ˆ ≥ τ andρ ˆ = ϕ(ˆτ) = E[e ], we define a new probability distributionµ ˆ by 1 Z µˆ(A) = eτxˆ µ(dx). ρˆ A ˆ (Check thatµ ˆ is really a probability distribution!) Let now Xi be i.i.d. random variables with distributionµ ˆ. We claim that ˆ ˆ (5.11) EXi > 0 and Var Xi ∈ (0, ∞). ˆ Indeed, by definition of Xi,

ˆ Z 1 Z ϕ(t +τ ˆ) ϕˆ(t) := EetXi = etxµˆ(dx) = etxeτxˆ µ(dx) = < ∞. ρˆ ϕ(ˆτ)

ˆ 0 0 ˆ ˆ 2 Then, as in (5.2), EX1 =ϕ ˆ (0) = ϕ (ˆτ)/ϕ(ˆτ) > 0 sinceτ ˆ > τ, and Var Xi ≤ EXi = ϕˆ00(0) < ∞.

39 We now rewrite the probability of the event in (5.7) with help of the new random variables, Z P [Sn ≥ 0] = µ(dx1) . . . µ(dxn) x1+···+xn≥0 Z (5.12) −τxˆ 1 −τxˆ n = (ˆρe µˆ(dx1)) ... (ˆρe µˆ(dxn)) x1+···+xn≥0 ˆ n  −τˆSn ˆ  =ρ ˆ E e 1{Sn ≥ 0} ,

ˆ ˆ where Sn = X1 + ··· + Xn. To estimate the above expectation, for δ ∈ (0, 1),

ˆ ˆ  −τˆSn ˆ   −τˆSn ˆ ˆ ˆ  E e 1{Sn ≥ 0} ≥ E e 1{|Sn − ESn| ≤ δESn} ˆ −τˆ(1+δ)ESn  ˆ ˆ ˆ  ≥ e P |Sn − ESn| ≤ δESn

ˆ ˆ  ˆ ˆ Obviously, ESn = nEX1 and, by the weak law of large numbers, P |Sn − ESn| ≤ ˆ  n→∞ 1 δESn −−−→ 1. Hence, taking “lim inf n log” in the last display, and taking the limit δ → 0 consequently, yields

1 ˆ  −τˆSn ˆ  ˆ (5.13) lim inf log E e 1{Sn ≥ 0} ≥ −τEˆ X1. n→∞ n

−1 ˆ Finally, combining (5.12) with (5.13), lim inf n log P [Sn ≥ 0] ≥ ρˆ − τEˆ X1. Taking ˆ 0 0 τˆ & τ, we see thatρ ˆ → ρ and EX1 = ϕ (ˆτ)/ϕ(τ) → 0, as ϕ (τ) = 0. This yields the required lower bound for (5.7), and completes the proof.

40 6 Weak convergence of probability measures

In this chapter we discuss the weak convergence of probability measures on arbitrary metric spaces.

6.1 Weak convergence on R We recall the definition from the elementary lecture:

Definition 6.1. A sequence (µn)n∈N of probability distributions on (R, B(R)) converges weakly to the probability distribution µ when

n→∞ Fµn (y) −−−→ Fµ(y) for all points of continuity of Fµ.

Here Fµn (y) = µn((−∞, y]) and Fµ = µ((−∞, y]) denote the distribution functions of w µn and µ, respectively. We then write µn −→ µ. A sequence (Xn)n≥1 of real valued random variables converges in distribution (or in law) to a random variable X when the distributions µn of Xn converge weakly to the d distribution µ of X. We write Xn −→ X.

Remark 6.2. The random variables Xn in the previous definition do not need to be defined on the same probability space.

Remark 6.3. To see why we do not require the convergence for all y, let µn be centred d Gaussian distribution with variance 1/n. Obviously, µn −→ δ0, but Fµn (0) = 1/2 6= 1 =

Fδ0 (0).

Example 6.4 (De Moivre-Laplace Theorem). Consider a sequence (Xi)i≥1 of Bernoulli random variables, P [Xi = ±1] = 1/2. Let as usual Sn = X1 + ··· + Xn. An easy combinatorial argument yields the formula  2n  P [S = 2k] = 2−2n. 2n n + k √ x −x Analysing this√ using Stirling’s formula x! = x e 2πx(1 + o(1)) we find (Exercise!), for k(n) = bx 2n/2c, x ∈ R, n h S 2k(n)i 1 n x2 o lim P √2n = √ = √ exp − . n→∞ 2 2n 2n 2π 2

41 Using the Riemann approximation of an integral with some care, one deduces that

y h S i 1 Z 2 P √2n ≤ y −−−→n→∞ √ e−x /2 dx, 2n 2π −∞ √ that is the S2n/ 2n converges in distribution to a standard normal random variable.

Exercise 6.5 (Exponential and geometric distribution). Let Xp be a geometric random k−1 variable with parameter p, that is P [Xp = k] = (1 − p) p, k ≥ 1. Prove that

d pXp −−→ X, p→0 where X is an exponential random variable with parameter 1, P [X ≥ y] = e−y.

Exercise 6.6 (Poisson approximation). Let Xn be a binomial random variable with d parameters (n, pn). Assume that npn → λ ∈ (0, ∞) as n → ∞. Prove that Xn −→ X, where X has Poisson distribution with parameter λ.

1 1 Example 6.7. Let µn(dy) = 2 δ0(dy) + 2 δn(dy). Then 1 1 1 F (y) = 1{y ≥ 0} + 1{y ≥ n} −−−→n→∞ 1{y ≥ 0}. µn 2 2 2 Observe that the limit is not a distribution function of any probability distribution, so µn do not converge weakly. As is easy to see, problem lies in the fact that some mass of µn ‘disappears to infinity’. We now study properties of the weak convergence. Proposition 6.8. The following statements are equivalent

w (i) µn −→ µ.

(ii) There are random variables Y and Yn, n ≥ 1 on a probability space (Ω, A,P ) such that µn is distribution of Yn, µ is the distribution of Y and

n→∞ Yn −−−→ Y,P -a.s.

(iii) For every bounded continuous function f : R → R

Z n→∞ Z fdµn −−−→ fdµ.

Proof. (i) =⇒ (ii). We take Ω = (0, 1), A = B((0, 1)) and P the Lebesgue measure on (0, 1) and set for ω ∈ Ω

Yn(ω) = sup{y ∈ R : Fµn (y) < ω}, Y (ω) = sup{y ∈ R : Fµ(y) < ω}.

42 Then, as in the proof of Theorem 2.15, µn is the distribution of Yn and µ of Y . The functions ω 7→ Y (ω) and ω 7→ Yn(ω) are increasing. Let further ˜ Y (ω) = inf{y ∈ R : Fµ(y) > ω}, ˜ Ω0 = {ω ∈ Ω: Y (ω) = Y (ω)}. Observe that for every ω < ω0 one has Y (ω) ≤ Y˜ (ω) ≤ Y (ω0) ≤ Y˜ (ω0) and thus ω 7→ Y (ω) has a jump in every ω ∈ Ω \ Ω0. Since Y is increasing, Ω \ Ω0 is at most countable, and thus P [Ω0] = 1. It is thus sufficient to show that

(6.9) lim Yn(ω) = Y (ω) for every ω ∈ Ω0. n→∞

To this end, fix ω ∈ Ω0. For every continuity point y of Fµ satisfying y < Y (ω) and for n large enough Fµn (y) < ω and thus y ≤ Yn(ω). Hence, y ≤ lim inf Yn(ω). We can let y % Y (ω), as Fµ has at most countably many jumps, and obtain Y (ω) ≤ lim inf Yn(ω). On the other hand, if y is a point of continuity of Fµ with y ≥ Y (ω), recalling that ˜ ω ∈ Ω0 and thus y > Y (ω), one has Fµ(y) > ω. Hence, for n large enough Fµn (y) > ω and thus Yn(ω) ≤ y. This implies that lim sup Yn(ω) ≤ Y (ω) and (6.9) follows. (ii) =⇒ (iii). For a bounded continuous function f

Z DCT Z f dµn = E[f(Yn)] −−−→ E[f(Y )] = f dµ.

(iii) =⇒ (i). Let y ∈ R be a point of continuity of Fµ. Define gε : R → [0, 1] by  1, x ≤ y,  gε(x) = 0, x ≥ y + ε, linear, continuous, x ∈ [y, y + ε].

Obviously gε is bounded and continuous. Moreover, using (iii), Z Fµ(y + ε) = µ((−∞, y + ε]) ≥ gε(x)µ(dx) Z = lim gε(x)µn(dx) ≥ lim sup µn((−∞, y]) = lim sup Fn(y), and thus, letting ε → 0,

lim sup Fµn (y) ≤ F (y).

Similarly, with hε : R → [0, 1] given by  1, x ≤ y − ε,  hε(x) = 0, x ≥ y, linear, continuous, x ∈ [y − ε, y], using the fact that y is a continuity point of Fµ, we obtain

Fµ(y) ≤ lim inf Fµn (y), and the proof is complete.

43 6.2 Weak convergence on metric spaces

Definition 6.1 and Proposition 6.8 deal with the convergence of distribution on the real line only, but the equivalent statement (iii) of the proposition allows to define the weak convergence of distribution on an arbitrary metric1 spaces. Through the remaining part of this chapter we let (S, d) to stand for a complete metric space and S for the associated Borel σ-algebra, S = B(S). We use Cb(S) to denote the space of continuous bounded functions on S.

Definition 6.10. Let µ and µn, n ≥ 1 be probability distributions on (S, S). We say that µn converge weakly to µ if Z Z lim f dµn = f dµ for every f ∈ Cb(S). n→∞ S S

Let X and Xn, n ≥ 1, be S-valued random variables with respective distributions µ w and µn. We say that Xn converge in distribution to X, when µn −→ µ. Remark 6.11. There are other natural definitions of convergence of probability distri- butions on (S, S). One could, for example, require that

lim µn(A) = µ(A) for all A ∈ S n→∞ or, even more restrictively, to require that this convergence is uniform,

n→∞ kµn − µkTV := sup |µn(A) − µ(A)| −−−→ 0. A∈S These modes of convergence are however too strong in practical situations (especially when S is not finite or countable). To see this consider a sequence xn of points in S such that lim xn = x but xn 6= x for all n ≥ 1. Then the measures µn = δxn do not converge to δx in any of the previous two types of convergence, but they converge weakly, as can be checked easily. Similar problem occurs when considering the De-Moivre-Laplace theorem, Exam- −1/2 −1/2 ple 6.4, as the distribution of n Sn is supported on n Z, but the probability of this set is zero for the standard normal distribution. The following theorem provides many useful conditions that are equivalent with the weak convergence. Theorem 6.12 (Portmanteau). The following are equivalent:

w (i) µn −→ µ R R (ii) limn→∞ S f dµn = S f dµ for every f ∈ Cb(S) which is uniformly continuous.

(iii) lim supn→∞ µn(F ) ≤ µ(F ) for all closed sets F ⊂ S. 1In fact, topological would be sufficient here.

44 (iv) lim infn→∞ µn(G) ≥ µ(G) for all open sets G ⊂ S.

(v) limn→∞ µn(A) = µ(A) for all µ-continuity sets A ∈ S, that is for all A with µ(A¯ \ A◦) = µ(∂A) = 0. Proof. (i) =⇒ (ii). Obvious. (ii) =⇒ (iii). Let F be a closed set and ε > 0. Define h(s) = ϕ(d(s, F )) where ϕ(x) := max(1 − ε−1x, 0). Then h is uniformly continuous, bounded, h ≡ 1 on F and c h ≡ 0 on Uε(F ) , where Uε(F ) = {s ∈ S : d(s, F ) < ε} is the ε-neighbourhood of F .

Hence, 1F ≤ h ≤ 1Uε(F ). By (ii) Z Z lim sup µn(F ) ≤ lim sup h dµn = h dµ ≤ µ(Uε(F )). n→∞ n→∞ T k→∞ As F is closed, F = k≥1 U1/k(F ) and thus, using the regularity of µ, µ(U1/k(F )) −−−→ µ(F ). Inserting this in the last display, the claim (iii) follows. (iii) ⇐⇒ (iv) follows by taking complements. (iii) and (iv) =⇒ (v). For a µ-continuity set A, µ(A◦) = µ(A) = µ(A¯), and thus

(iv) ◦ ◦ µ(A) = µ(A ) ≤ lim inf µn(A ) ≤ lim inf µn(A) (iii) ¯ ¯ ≤ lim sup µn(A) ≤ lim sup µn(A) ≤ µ(A) = µ(A).

(v) =⇒ (i). Fix f ∈ Cb(S) and decompose the range of f so that

N h i [ inf f(s), sup f(s) ⊂ [cj, cj+1) s∈S s∈S j=1 with µ({f = cj}) = 0 and 0 < cj+1 − cj ≤ ε for all 1 ≤ j ≤ N. This is possible since f is bounded and the distribution function of f, t 7→ µ(f ≤ t), has at most countably many discontinuities. The sets Aj = {cj ≤ f < cj+1} are thus µ-continuity PN R n→∞ R sets. Defining g = j=1 cj1Aj , it follows from (v) that g dµn −−−→ g dµ. Moreover, by construction, sups∈S |g(s) − f(s)| ≤ ε and thus Z Z lim sup f dµn − f dµ ≤ 2ε. n→∞

w Since ε and f are arbitrary, this implies µn −→ µ.

Sometimes it is useful to prove the weak convergence by checking µn(A) → µ(A) for a particular of sets A. (Beware, however, that checking this for all A in some w π-system generating S does not imply µn −→ µ, cf Lemma 3.10. We will see examples later.)

Proposition 6.13. Let S0 ⊂ S be a set system that is closed under finite intersections and every open set G ⊂ S can be written as a countable union of sets in S0. Then w limn→∞ µn(A) = µ(A) for all A ∈ S0 implies µn −→ µ.

45 [[TODO: check this proposition, it feels strange]] Proof. For finite unions, by the inclusion-exclusion principle,

N N k  [  X k+1 X  \  µn Ai = (−1) µn Aij .

i=1 k=1 1≤i1<···

Hence, when Ai ∈ S0, as S0 is closed under intersections,

N N  [  n→∞  [  µn Ai −−−→ µ Ai . i=1 i=1 S If G is open, then G = i≥1 Ai for some Ai ∈ S0. Thus, for ε > 0 and N = N(ε) large

N N  [   [  µ(G) − ε ≤ µ Ai = lim µn Ai ≤ lim inf µn(G). n→∞ n→∞ i=1 i=1

w As ε is arbitrary, the weak convergence µn −→ µ follows from (iv) of Theorem 6.12.

Remark 6.14. When S is separable, we can always construct S0 as above, it can even be countable. Indeed, let (sn)n≥1 be a sequence which is dense in S, and iteratively 1 construct partitions Zm = {Am,k : k ∈ N} of S such that diam(Am,k) ≤ m for all k and Zm+1 refines Zm (To to this start with Am,k ∩ U1/(2(m+1))(sn) and make these sets S pairwise disjunct.) Finally, take S0 = m≥1 Zm, which is closed under intersections by the construction. Moreover, S0 is countable and every G open can be written as a union of elements of S0 (Exercise!). Exercise 6.15. Use the previous remark to show the following statement: Let (S, d) be a separable metric space. Then every probability measure µ on (S, S) can be approximated P n n by a discrete probabilty measure, that is there are µ = c δ n with c ∈ [0, 1], n i≥1 i xi i n w xi ∈ S, such that µn −→ µ.

Exercise 6.16. Let (Xn)n≥1 be an i.i.d. µ-distributed sequence of S-valued random variables. Consider the empirical distribution of the first n variables

n 1 X µ (ω) = δ , ω ∈ Ω. n n Xk(ω) k=1

(Observe µn is a random probability measure.) Show that

w µn(ω) −→ µ for P -a.e. ω.

46 6.3 Tightness on R We characterise sequentially compact sets under weak convergence. The first result considers probability measures on R only.

Theorem 6.17 (Helly’s section theorem). Let (Fn)n≥1 be a sequence of distribution functions of some probability measures on R. Then there is a subsequence Fn(k) and a right-continuous increasing function F : R → [0, 1] such that

F (y) = lim Fn(k)(y) for every point of continuity y of F . k→∞ Remark 6.18. F might not be a distribution function in general, cf. Example 6.7. More precisely, in general it only satisfies 0 ≤ limx→−∞ F (x) ≤ limx→∞ F (x) ≤ 1. Such F corresponds naturally to a sub-probability measure, i.e. measure µ on R with µ(R) ≤ 1. For (sub-)probabilities there is another notion of convergence:

Definition 6.19. A sequence of sub-probability measures νk converges vaguely to a sub-probability measure ν if Z Z lim f dνn = f dν for every continuous function f with compact support. n→∞ It is not hard to see, by the same techniques as in Proposition 6.8, that Helly’s theorem implies that the space of sub-probabilities on R is vaguely sequentially compact. .

Proof. Let Q = {q1, q2,... } be an enumeration of rationals. Since Fn(q1) ∈ [0, 1] for i→∞ all n, there is a sequence m1(i), i ≥ 1, such that Fm1(i)(q1) −−−→ G(q1). Similarly, i→∞ Fm1(i) ∈ [0, 1] for all i ≥ 1, so there is a m2(i) = m1(n2(i)) so that Fm2(i)(q2) −−−→ G(q2). Inductively, we obtain mk(i), k ≥ 1, i ≥ 1 such that mk+1(i) = mk(nk+1(i)) with nk+1 increasing (i.e. mk+1 is a subsequence of mk), and for all k ≥ 1

lim Fm (i)(q`) = G(q`), 1 ≤ ` ≤ k. i→∞ k

Using the ‘diagonal sequence’ Fn(k) := Fmk(k) we obtain

lim Fn(k)(q`) = G(q`), for all ` ≥ 1, k→∞ that is Fn(k)(t) converges for all t ∈ Q. The function G : Q → [0, 1] is non-decreasing, hence F : R → [0, 1] defined by F (x) = inf{G(q): x < q ∈ Q} is non-decreasing and right-continuous. We show that Fn(k) and F satisfy the claim of the theorem. Let y be a point of continuity of F , ε > 0, and fix r1, r2, s ∈ Q so that r1 < r2 < y < s and

F (y) − ε < F (r1) ≤ F (r2) ≤ F (y) ≤ F (s) < F (y) + ε.

Then, Fn(k)(r2) → G(r2) ≥ F (r1) and Fn(k)(s) → G(s) ≤ F (s). Hence, for every k large enough F (y) − ε ≤ Fn(k)(r2) ≤ Fn(k)(y) ≤ F (y) + ε and thus Fn(k)(y) → F (y) as required.

47 Our goal is to obtain conditions ensuring the compactness for the weak, and not vague, convergence. To this end we must deal with the issue of the mass escaping to infinity, cf. Remark 6.18.

Definition 6.20. A family µi, i ∈ I, of probability measures on (R, B) is called tight when for every ε > 0 there is M = M(ε) < ∞ such

c sup µi([−M,M] ) ≤ ε. i∈I

Theorem 6.21. Let (Fn)n≥1 be a sequence of distribution functions and µn, n ≥ 1, the associated probability measures. If (µn)n≥1 is tight, then every subsequential limit F of Fn is a distribution function.

Proof. Let F be increasing, right-continuous, with F (y) = limk→∞ Fn(k)(y) for every c continuity point y of F . Fix ε > 0 and M < ∞ so that µn([−M,M] ) ≤ ε for all n ≥ 1. Let y1 > M and y2 < −M be continuity points of F . Then

F (y2) = lim Fn(k)(y2) ≤ lim sup µn(k)((−∞, −M)) ≤ ε, k

F (y1) = lim Fn(k)(y1) ≥ lim inf µn(k)((−∞,M]) ≥ 1 − ε. k

Hence limy→∞ F (y) = 1 and limy→−∞ F (y) = 0, that is F is a distribution function.

Corollary 6.22. Every tight sequence of probability measures on (R, B(R)) has a weakly convergent subsequence.

6.4 Prokhorov’s theorem*

Surprisingly enough, almost all claims of the previous section generalise to arbitrary metric (say) spaces. [[TODO: check]] Let (S, d) be a metric space with the Borel σ-field S, and let M1(S) be the family of all probability measures on (S, S). We study the properties of M1(S) topologised by the weak convergence.

Remark 6.23. The topology corresponding to the weak convergence has the neighbour- hood basis n Z Z o U (µ) = ν ∈ M (S): h dν − h dµ < ε, i = 1, . . . , n ε,h1,...,hn 1 i i with n ∈ N, ε > 0 and hi ∈ Cb(S).

Definition 6.24. A family M ⊂ M1(S) is called relatively (weakly sequentially) com- pact if every sequence of elements of M contains a weakly converging subsequence.

The next definition generalises Definition 6.20

48 Definition 6.25. A family M ⊂ M1(S) is called tight, when for every ε > 0 exists a compact set K = K(ε) ⊂ S, such that

µ(K) ≥ 1 − ε for all µ ∈ M.

As on R, the tightness and relative compactness are closely related:

Theorem 6.26 (Prokhorov). Let M ⊂ M1(S). (a) When M is tight, then it is relatively compact.

(b) When S is complete and separable and M is relatively compact, then M is tight.

Remark 6.27 (Utility of the characterisation of the relative compactness). Let µn be a sequence of probability measures on (S, S) and assume that for every set A in some π- system C generating S we have µn(A) → µ(A). As we know this is not sufficient to deduce the weak convergence of µ. However, assume in addition that (µn) is relatively compact. Then it contains a subsequence µn(k) converging weakly to some ν ∈ M1(S). Moreover, ν(A) must be equal µ(A) for every A ∈ C, which implies µ = ν, by Lemma 3.10. Therefore, all weak subsequential limits of µn agree, which, by standard arguments, imply that µn converges weakly to µ. Proof of Theorem 6.26. (b): The part (b) of the theorem is less useful, but its proof is simpler, so we start with it. Let Gn be an arbitrary sequence of open sets such that Gn % S. We claim

(6.28) For every ε there is n such that µ(Gn) > 1 − ε for all µ ∈ M.

Indeed, if this is not the case, then there is for every n a µn ∈ M with µn(Gn) < 1 − ε. Due to relative compactness there is a subsequence µn(k) converging weakly to some µ ∈ M1(S). But then, as Gn % S,

µ(S) = lim µ(Gn), n→∞ and, using Portmanteau’s theorem,

µ(Gn) ≤ lim inf µn(k)(Gn) ≤ lim inf µn(k)G(n(k)) ≤ 1 − ε, k→∞ k→∞ so µ(S) < 1 which is a contradiction. As S is separable, there is for every n ∈ N a sequence (Ank)k≥1 of open balls of radius −1 S n covering S. Fixing now Gm = i≤m Ani, we see from (6.28) that there is mn such that  [  −n µ Ani > 1 − ε2 for all µ ∈ M.

i≤mn The set \ [ A := Ani

n≥1 i≤mn

49 is totally bounded (that is coverable by finitely many balls of radius δ for every δ > 0) by construction. Therefore K = A¯ ⊃ A is compact. Moreover, by construction,

c X  [  X −n µ(K ) ≤ µ(Ac) ≤ µ Ani ≤ ε2 = ε

n≥1 i≤mn n≥1 for all µ ∈ M, that is M is tight. (a): To prove (a) we need some preparations.

Proposition 6.29. When S is compact, then M1(S) is sequentially compact. (It is even compact, but we will not need it.)

Proof. If S is compact, then C(S) = Cb(S) is separable in the k · k∞-topology. We can thus take a dense set {hn : n ∈ N} in C(S) and set

I = ×n∈N[−khnk∞, khnk∞]. which is compact in the product topology by Tychonov’s theorem, and since it is metris- P k able (take ρ(u, v) = k |uk − vk|/(2 khkk∞)), it is also sequentially compact. Define T : M1(S) → I by  Z  T (µ) = hn dµ . n∈N

This is homeomorphism of M1(S) and T (M1(S)). Indeed. T is surjective and its R R injectivity follows from fact that T (µ) = T (ν) implies hndµ = hndν for all n, and R R since hn are dense, also hdµ = hdν for all h ∈ C(S), that is µ = ν. T is continuous, w R R −1 as µn −→ µ implies hkdµn → hkdµ for all k and thus T (µn) → T (µ). Finally, T is R R continuous, as T (µn) → T (µ) implies hkdµn → hkdµ for every k, and thus as above, R R w hdµn → hdµ for every h ∈ C(S), that is µn −→ µ. Since T is a bijection, and since M1(C) can be identified with the family L+(S) of all non-negative normed linear forms on C(S), by Riesz’ representation theorem (see measure theory), we can identify T (M1(S)) with L+(S). L+(S) is closed under pointwise convergence (i.e.) and thus also T (M1(S)) is closed and thus sequentially compact −1 Finally, since T (M1(S) is sequentially compact and T is continuous, M1(S) is sequentially compact as well.

We now return to (a) of Theorem 6.26. Let M be tight. Therefore, for every n ∈ N −1 S there is a compact Kn ⊂ S so that µ(Kn) ≥ 1 − n for all µ ∈ M. Set S0 = n≥1 Kn. Then µ(S0) = 1 for every µ ∈ M. Since Kn are compact, S0 is separable metric space (as subspace of S). By Urysohn embedding theorem, S0 is homeomorphic to a measurable ¯ ¯ subset of a compact metric space, i.e. it can be viewed as a subset S0 ⊂ S where S is a compact metric space. ¯ We now consider M ⊂ M1(S0) as a subset of M1(S), which is by Proposition 6.29 ¯ sequentially compact. Every sequence in M has thus a M1(S)-weakly convergent sub- ¯ sequence µn(k) with a limitµ ¯ ∈ M1(S).

50 To finish the proof, we need to show thatµ ¯(S0) = 1, since then there is µ ∈ M1(S) w agreeing withµ ¯ on S0 such that µ(S \ S0) = 0, and thus µn(k) −→ µ in M1(S). Indeed, by Portmanteau’s theorem,

−1 (6.30)µ ¯(S0) ≥ µ¯(KN ) ≥ lim sup µn(k)(KN ) ≥ 1 − N , k→∞ and the claim follows by letting N → ∞. The topology of the weak convergence is even metrisable:

Theorem 6.31 (Prokhorov’s metric). Let S be a separable metric space and for µ, ν ∈ M1(S) define

ρ(µ, ν) = inf{ε > 0 : ν(A) ≤ ε + µ(Uε(A)) and µ(A) ≤ ε + ν(Uε(A)) for all A ∈ S}.

Then ρ is a metric on M1(S) which is compatible with the weak convergence and (M1(S), ρ) is a separable metric space. Moreover, if S is complete, then (M1(S), ρ) is complete as well.

Proof. See e.g. Ethier-Kurtz (1986) paragraphs 3.1–3.3. [[TODO: possibly write the proof]]

51 7 Central limit theorem

This chapter mostly recalls, for sake of completeness, the results known from the ele- mentary probability lecture.

7.1 Characteristic functions

Definition 7.1. Let µ be a probability measure on R. Characteristic function of µ is a functionµ ˆ : R → C given by Z itx µˆ(t) = e µ(dx), t ∈ R. R

Characteristic function ϕX of a random variable X is the characteristic function of its distribution µX , that is  itx ϕX (t) = E e =µ ˆX (t). Remark 7.2. In the case when µ possesses a density f, the characteristic function is nothing else as the Fourier transform of f,µ ˆ(t) = R f(x)eitxdx.

Lemma 7.3 (elementary properties). Let µ be a probability measure on R (i) µˆ(0) = 1,

(ii) |µˆ(t)| ≤ 1 for all t ∈ R,

(iii) µˆ(−t) = µˆ(t) for all t ∈ R, (iv) µˆ is uniformly continuous,

(v) When X,Y are independent random variables, then ϕX+Y (t) = ϕX (t)ϕY (t). Proof. Is left as exercise! Lemma 7.4 (Characteristic function and moments). Assume that µ has finite k-th absolute moment, that is R |x|k dµ < ∞, k ≥ 1. Then, µˆ is k-times differentiable and

dl Z µˆ(t) = (ix)leitxµ(dx), 0 ≤ l ≤ k. dtl In particular, if X is µ-distributed, for l ≤ k, dl µˆ(0) = ilE(Xl). dtl

52 Proof. Exercise!

Exercise 7.5. Show that the characteristic functions of the normal distribution with mean m and variance σ2 is given by

n t2σ2 o ϕ 2 (t) = exp imt − . m,σ 2 As for the Fourier transform, there is an inversion formula.

Proposition 7.6. For every a < b,

Z T e−ita − e−itb 1 (7.7) lim (2π)−1 µˆ(t) dt = µ((a, b)) + µ({a, b}). T →∞ −T it 2 Proof. The left-hand side equals

Z T Z e−ita − e−itb (2π)−1 eitx µ(dx)dt. −T it It is not hard to see that the absolute value of the integrand is bounded by b − a. Since µ is a probability measure and [−T,T ] a bounded interval, we can thus apply Fubini’s theorem to show that Z n Z T sin(t(x − a)) Z T sin(t(x − b)) o = (2π)−1 dt − dt µ(dx). −T t −T t

R T Define J(α, T ) = −T sin(αt)/t dt. Then Z = (2π)−1 (J(x − a, T ) − J(x − b, T ) µ(dx).

A little bit of analysis shows that

lim J(α, T ) = π sign x, T →∞ where sign x is 1 if x > 0, 0 if x = 0 and −1 if x < 0. Hence,  2π a < x < b t→∞  J(x − a, T ) − J(x − b, T ) −−−→ π x ∈ {a, b} 0 otherwise.

Moreover, supα J(α, T ) < ∞, and the result follows by dominated convergence theorem.

As a corollary we immediately see that the characteristic function determine the prob- ability measure uniquely.

53 Theorem 7.8 (Uniqueness). Let µ, ν be two probability measures on (R, (R)). Then µˆ =ν ˆ implies µ = ν. Exercise 7.9. Use Theorem 7.8 to show that the sum X +Y of two independent Poisson random variables X, Y with respective parameters λX , λX has Poisson distribution with parameter λX + λY . The following lemma will be useful to prove the tightness. Lemma 7.10. For every u > 0, n 2 o 1 Z u µ x : |x| ≥ ≤ (1 − µˆ(t)) dt. u u −u Remark 7.11. This is a first statement showing that the tail-behaviour of µ is deter- mined by the behaviour ofµ ˆ near to 0. Proof. For x 6= 0 Z u eiux − e−iux  sin ux (1 − eitx)dt = 2u − = 2u 1 − ≥ 0. −u ix ux Therefore, 1 Z u Z 1 Z u (1 − µˆ(t))dt Fubini= (1 − eitx)dtµ(dx) u −u R u −u Z  sin ux  = 2 1 − µ(dx) ux R | {z } ≥0 Z  1  n 2 o ≥ 2 1 − µ(dx) ≥ µ x : |x| ≥ , |x|≥ 2 |ux| u u | {z } ≥1/2 and the proof is complete.

Theorem 7.12 (L´evy’scontinuity theorem). Let µn, n ≥ 1, be a sequence of probability measures on R. w (i) If µn −→ µ, then µˆn(t) → µˆ(t) for all t ∈ R.

(ii) If µˆn(t) → ϕ(t) for every t ∈ R and the limit ϕ(t) is continuous at 0, then there is w a probability measure µ such that µˆ = ϕ and µn −→ µ.

Remark 7.13. The condition ‘ϕ is continuous at 0’ is essential: To see this let µn be the centered normal distribution with variance n2. Using Exercise 7.5, we obtain 2 2 n→∞ µˆn(t) = exp{−t n /2} −−−→ 1{0}(t). It is evident that µn do not converge weakly to any distribution, but it is not in contradiction with the theorem as the limit of their characteristic functions is not continuous. The discontinuity at 0 is related to a mass escaping to infinity, cf. Example 6.7.

54 Proof. (i) eitx is continuous in x, so (i) follows from Proposition 6.8. (ii) We first claim that the sequence µn is tight. Indeed, let ε > 0 and choose u > 0 such that u u ε 1 Z DCT 1 Z ≥ (1 − ϕ(t)) dt = lim (1 − µˆn(t))dt. 2 u −u n→∞ u −u

Therefore, for n ≥ n0, using Lemma 7.10, 1 Z u h 2 2 ic ε > (1 − µˆn(t))dt ≥ µn − , , u −u n n proving the tightness. w As µn is tight, it has a convergent subsequence µn(k) such that µn(k) −→ ν. Due to the part (i),ν ˆ = ϕ, hence the limit does not depend on the taken subsequence, and we can w 1 define µ := ν. Then, using standard arguments, it follows µn −→ µ.

7.2 Central limit theorem

We have now all tools to extend De Moivre-Laplace theorem (Example 6.4 to a broad family of probability distributions.

Theorem 7.14 (Central limit theorem). Let (Xi)i≥1 be an i.i.d. sequence of random 2 2 variables with E(Xi ) < ∞. Set m = EXi, σ = Var Xi > 0 and S − mn Z := n √ . n τ n

Then, as n → ∞, Zn converge in distribution to a standard normal random variable. ˜ Proof. (We repeat the arguments from the introductory lecture.) Let Xi = Xi − m and ˜ ˜ ˜ Sn = X1 + ··· + Xn. Then, using Lemma 7.3(v),

n h n it ˜ oi  t  ϕZ (t) = E exp Sn = ϕ ˜ √ . n σn X1 σ n Due to Lemma 7.4, ϕ is twice differentiable and X˜1

0 ˜ ϕ (0) = iEXi = 0 X˜1 ϕ00 (0) = −E[X˜ 2] = −σ2. X˜1 i Using the Taylor expansion

2 u 2 ϕ ˜ (u) = ϕ ˜ (0) − (σ + ε(u)) X1 X1 2 1Otherwise, one would find y a continuously print of F (·) = µ((−∞, ·)) and a subsequence n(k) so that for all k ≥ 1 |Fn(k)(y) − F (y)| ≥ ε, contradicting the existence of subsequence n(k`) that converges to µ.

55 with ε(u) → 0 as u → 0, we obtain

 t2   t n ϕ (t) = 1 − σ2 + ε √ Zn 2σ2n σ n n  t2  t o = exp n log 1 − σ2 + ε( √ ) 2σ2n σ n n t2 o −−−→n→∞ exp − . 2 As e−t2/2 is continuous, the claim of the theorem follows from 7.12 and Exercise 7.5.

7.3 Some generalisations of the CLT*

We close this chapter by presenting, partly without proofs, several extensions of the central limit theorem, Theorem 7.14. In many situations of practical interest, one is confronted with random variables which are independent but not identically distributed, and whose law may depend on n. This is the same situation as in Theorem 4.37, where we proved a law of large numbers for such random variables. We now show the corresponding CLT.

Theorem 7.15 (The Lindeberg-Feller theorem). For each n, let Xn,k, 1 ≤ k ≤ n, be independent random variables with EXn,k = 0. Assume

Pn 2 n→∞ 2 (i) i=1 EXn,k −−−→ σ > 0, Pn 2 (ii) For all ε > 0, limn→∞ k=1 E[|Xn,k| ; |Xn,k| ≥ ε] = 0.

Then Sn = Xn,1 + ··· + Xn,n converges as n → ∞ in distribution to a centred normal variable with variance σ2.

Remark 7.16. This theorem actually implies Theorem 7.14. To see this, let X1,X2,... 2 2 √1 2 be i.i.d. with EXi = µ, EXi = σ < ∞. Define Xn,m = n (Xm − µ). Then E(Xn,m) = σ2/n, so (i) holds trivially. For (ii) we need to check that

n 2 X h(X1 − µ) √ i √ n→∞ E[X2 ; |X | ≥ ε] = nE ; |X | ≥ ε n ≤ E[X2; |X | ≥ ε n] −−−→ 0, n,m n,m n 1 1 1 m=1

2 for any ε > 0. This follows by the DCT and the fact that EX1 < ∞. Proof of Theorem 7.15. The proof ultimately reduces to similar arguments as for the CLT case, but we have to impose some truncation first. For this we note that (ii) implies that

1 1 X n→∞ ∃ε ↓ 0 such that K := E[X2 ; |X | ≥ ε ] −−−→ 0. n n ε2 n,m n,m n n m=1

56 ¯ ¯ Pn ¯ We now set Xn,m = Xn,m1{|Xn,m| ≤ εn} and define Sn = m=1 Xn,m. Since, using Chebyshev inequality, ¯ P (|Sn − Sn| ≥ εn) ≤ P [∃m : |Xn,m| > εn] n X ≤ P [|Xn,m| > εn] m=1

1 2 ≤ 2 E[Xn,m; |Xn,m| > εn] εn ¯ and εn → 0, Lemma 12.12 implies that the random variables Sn and Sn have the same distributional limit, provided that one of the limits exists. So we may focus of proving ¯ the theorem for Sn instead of Sn. [[TODO: move the lemma where it belongs]] For later use observe that since ESn = 0, the triangle and Chebyshev inequality imply

n n X 1 X |ES¯ | = |E[S¯ − S ]| ≤ E|X − X¯ | ≤ E[X2 ; |X | ≥ ε ], n n n n,m n,m ε n,m n,m n m=1 n m=1 and so, using also (i),

¯ n→∞ ¯ n→∞ 2 (7.17) ESn −−−→ 0 and Var(Sn) −−−→ σ .

it(X¯n,m−EX¯n,m) We now introduce ϕn,m(t) = E[e ]. Note that

n it(S¯n−ES¯n) Y E[e ] = ϕn,m(t), m=1 so, in view of the first part of (7.17), it suffices to show that this converges to e−σ2t2/2. A natural instinct is to take a logarithm and convert the product into a sum, but it is hindered by the fact that ϕn,m are complex-valued, and complex logarithm is not particularly pleasant to work with. We thus replace ϕn,m by its second order expansion, which is real valued, and estimate the error of this approximation. For the error estimate we need the following simple lemma.

Lemma 7.18. Let z1, . . . , zn, w1, . . . , wn ∈ {z ∈ C : |z| ≤ 1}. Then

n n n Y Y X zi − wi ≤ |zi − wi|. i=1 i=1 i=1 Proof. For n = 1 this holds trivially, and for n > 1 it follows from the induction step

n n n n n Y Y Y Y Y zi − wi ≤ |z1| zi − wi ≤ +|z1 − w1| wi , i=1 i=1 i=2 i=2 i=2 by bounding both |z1| and the last product by one, using the assumption.

57 The error bound will then be provided by: Lemma 7.19. n 2 X  t  n→∞ ϕ (t) − 1 − Var X¯ −−−→ 0. n,m 2 n,m m=1 ˆ ¯ ˆ Proof. To simplify the notation, write Xn,m = Xn,m − EXn,m. Note that EXn,m = 0 ˆ ¯ 2 and Var Xn,m = EXn,m. By Taylor’s theorem (with the reminder in integral form) we have 1 Z ˆ 2 ˆ 2 it(1−u)Xn,m ϕn,m(t) = 1 + i0 − t uE[Xn,me ]du. 0 Hence

2 1  t  Z ˆ ¯ 2 ˆ 2 it(1−u)Xn,m ϕm,n(t) − 1 − Var Xn,m = t uE[Xn,m(1 − e )]du. 2 0 ¯ ˆ We now explore the fact that Xn,m are truncated (!), that is Xn,m ≤ 2εn. This implies the uniform bound

ˆ it(1−u)Xn,m ˆ |1 − e | ≤ 2| sin(t(1 − u)Xn,m)| ≤ max 2 sin(x), 0≤x≤|t|εn which does not depend on m. Using this we obtain

n  2  2 n X t ¯ t X ˆ 2 ϕn,m(t) − 1 − Var Xn,m ≤ max 2 sin(x) EXn,m. 2 2 0≤x≤|t|εn m=1 m=1

ˆ 2 ¯ 2 2 Since EXn,m ≤ EXn,m ≤ EXn,m, condition (i) ensures that the sum is bounded in n. The maximum tends zero and so the result follows.

¯ 2 To conclude the proof of the theorem, we notice that Var Xn,m ≤ εn. Hence, by the last two lemmas,

n n 2 Y Y  t  n→∞ ϕ (t) − 1 − Var X¯ −−−→ 0. n,m 2 n,m m=1 m=1

1 2 Moreover, for n large enough such that δn := 2 t εn < 1, we can write

n n Y  t2  n X  t2 o 1 − Var X¯ = exp log 1 − Var X¯ . 2 n,m 2 n,m m=1 m=1

Using the bound | log(1 + z) − z| ≤ |z|2/(1 − |z|) holding for |z| < 1 then yields

n 2 2 n X  t ¯  t X ¯ δn X ¯ log 1 − Var Xn,m + Var Xn,m ≤ Var Xn,m. 2 2 1 − δ m=1 m=1 n

58 ¯ 2 The sum of variances is Var Sn, which tends to σ by (7.17). Putting everything together, using δn → 0, then implies that

it(S¯ −ES¯ ) n→∞ − 1 σ2t2 E[e n n ] −−−→ e 2 , t ∈ R. ¯ Since ESn → 0, this completes the proof.

Theorem 7.14 does not give any information about the speed of the convergence of the law of Sn to the standard normal distribution. With some additional assumptions, this rate can be estimated. (For proof see [Dur10], Theorem 3.4.9.)

2 2 Theorem 7.20 (Berry-Esseen). Let (Xi)i≥1 be i.i.d. with EXi = 0, EXi = σ , and 3 √ E|Xi| = ρ < ∞. If Fn is the distribution function of Sn/(σ n) and N the distribution function of the standard normal distribution, then

3√ |Fn(x) − N (x)| ≤ 3ρ/σ n, n ≥ 1, x ∈ R.

The central√ √ limit theorem allows us to compute (for large n) the probability that Sn ∈ (a n, b n), a < b. It however does√ not give any control of the probability of Sn being in intervals much smaller than n. This control will be provided by the last theorem of this chapter.

2 2 Theorem 7.21 (local CLT). Assume that (Xi)i≥1 are i.i.d. with EXi = 0, EXi = σ ∈ (0, ∞), and having a√ common characteristic function ϕ(t) that has |ϕ(t)| < 1 for all t 6= 0. Then, if xn/(σ n) → x and a < b, then

√ n→∞ 1 −x2/2 nP [Sn ∈ (xn + a, xn + b)] −−−→ (b − a)√ e . 2π

The condition |ϕ(t)| < 1 for all t 6= 0 ensures that the distribution of Xi’s is non- lattice, that is there are no c, d such that P [X ∈ cN + d] = 1. Requiring non-lattice property is necessary for the local central limit theorem to be true, think about b−a < c. For the proof of Theorem 7.21, see [Dur10], Theorem 3.5.3.

59 8 Conditional expectation

We introduce here the concept of conditional expectation. As its definition is slightly abstract, we start with few examples.

Example 8.1. Consider two independent random variables X, Y on a probability space (Ω, A,P ) having Poisson distribution with respective parameters λX and λY , and set S = X + Y . From the elementary probability theory recall that the conditional probability of an event A given an event B with P [B] > 0 is given by

P [A ∩ B] P [A|B] = . P [B]

Using this formula, it is easy to show that for every 0 ≤ k ≤ n,

n p[X = k|S = n] = pk(1 − p)n−k, k where p = λX /(λX + λY ). Hence, given S, X has the binomial distribution with param- eters (S, p), and thus X E[X|S = n] = kP [X = k|S = n] = np. k∈N The random variable Z = pS can be thus viewed as the ‘expectation of X given S’. This can be written as Z = E[X|S] or Z = E[X|σ(S)]. Observe that Z is σ(S)-measurable. Moreover, for any event C ∈ σ(S) we may use Z to compute E[1C X] without knowing the joint distribution of (X,Y ). To see this, let C = {S ∈ A} ∈ σ(S) for some A ⊂ N. Then X E[1C X] = E[1S∈AX] = E[1S=nX] n∈A X X = E[X|S = n]P [S = n] = pnP [S = n] n∈A n∈A

= E[1C Z].

Example 8.2. Being just a little bit more abstract, let (Ω, A,P ) be a probability space, B = (B1,...,Bn) a measurable partition of Ω, and X an integrable random variable.

60 For every Bi with P [Bi] > 0, the mapping A 7→ P [A|Bi], A ∈ A, defines a probability measure on Ω. Let E[X|Bi] be the expectation of X with respect to this measure (check that it is finite!). We may use the numbers E[X|Bi] to define a random variable (!) X E[X|B](ω) := E[X|Bi]1Bi (ω).

i≤n:P [Bi]>0 The value of this random variable is determined, when we know which of the elements of the partition B is realised, otherwise said E[X|B] is σ(B)-measurable. A very similar calculation as in the previous example then shows that for any σ(B)- measurable event C,   E[X1C ] = E E[X|B]1C . Example 8.3. Assume that random variables X and Y have a joint density f(x, y) > 0, x, y ∈ R, and E|X| < ∞. It seems natural do define the conditional density of X given Y by f(x, y) f(x|y) = , x, y ∈ , R f(u, y) du R and the conditional expectation of X given Y as a random variable Z = E[X|Y ] = ϕ(Y ) with ϕ(y) = R xf(x|y) dx. Obviously, Z ∈ σ(Y ) again, and for any C = {Y ∈ A} ∈ σ(Y ) ZZ E[X1C ] = x1A(y)f(x, y) dy dx Z  Z  = ϕ(y)1A(y) f(u, y) du dx | {z } density of Y

= E[ϕ(Y )1A(Y )] = E[Z1C ]. In view of these three examples, it now seems completely natural to introduce Definition 8.4. Let X be an integrable random variable on some probability space (Ω, A,P ) and let G ⊂ A be a sub-σ-algebra of A. The conditional expectation of X given G, denoted E[X|G], is any random variable Y such that (i) Y is G-measurable (ii) for every A ∈ G, E[X1A] = E[Y 1A]. Since E[X|G] is defined through the integral equality, any random variable Z such that Y = ZP -a.s. is also necessarily the conditional expectation of X given G. This is however the only ‘non-uniqueness’ issue: Theorem 8.5. The conditional expectation E[X|G] exists and is unique up to P -null equivalence. Moreover, if X ≥ 0, then E[X|G] ≥ 0 P -a.s.

61 Proof. Existence. We assume first that X ≥ 0. From the lecture ‘Measure Theore’ recall the Radon-Nikodym theorem: Theorem 8.6 (Radon-Nikodym). Let µ, ν be σ-finite measures on (Ω, A). If ν is absolutely continuous with respect to µ (that is for every A ∈ A, µ(A) = 0 implies ν(A) = 0, notation ν  µ), then there is a A-measurable non-negative function f such that Z ν(A) = f dµ, for all A ∈ A. A The function f is called Radon-Nikodym density of ν with respect to µ and denoted dν by f = dµ . To prove the existence of the conditional expectation, define a new measure Q on (Ω, A) by Z dQ Q(A) = X dP = E[X1A], that is = X. A dP Let Q˜ and P˜ be the restrictions of Q and P to the σ-algebra G. Since X is integrable, Q˜ is a finite measure. Moreover, for G ∈ G with P˜[G] = 0 we have P [G] = P˜[G] = 0 and thus also 0 = Q[G] = Q˜[G]. It follows that Q˜  P˜, and by Radon-Nikodym theorem applied to Q˜, P˜ and (Ω, G), there is a non-negative G-measurable random variable Z such that for all G ∈ G Z Z ˜ ˜ E[X1G] = Q[G] = Z dP = Z dP = E[Z1G]. G G Hence Z satisfies the conditions of Definition 8.4 and thus it is a conditional expectation of X given G. In addition, we proved the last claim of the theorem. For a general random variable X, we write X = X+ − X− with X+ = max(X, 0), X− = max(−X, 0). Using the above construction, we obtain random variables Z+ and Z− that are conditional expectations of X+ or X− given G and set Z = Z+ − Z−. Using the linearity of the expectation it can be checked easily that Z is a conditional expectation of X given G. Uniqueness. Let Z1 and Z2 both satisfy the conditions of Definition 8.4. Set D = 1 2 Z −Z . Then D is G-measurable and for every G ∈ G, E[D1G] = E[Z11G]−E[Z21G] = E[X1G] = E[X1G] = 0. Hence, E[|D|] = E[D1D>0] − E[D1D<0] = 0, so D = 0 P-a.s. and the claim follows.

Remark 8.7. The usual expectation E[X] of a random variable X can be interpreted as the ‘best guess’ of X without having any information about the outcome of the random experiment. In the similar vein, the conditional expectation of X given G can be interpreted as the best guess of X having the ‘information contained in the σ-algebra G’. This can be best understood from Example 8.2, where the information contained in G is simply the information about which element of the partition B is realised.

Example 8.8. (i) For G = A, we see that E[X|A] = X. That is if we know ‘everything’, then the best guess of X is X itself.

62 (ii) If G = {∅, Ω}, then E[X|G] = E[X]. The best guess of X without any additional information is E[X]. (iii) If X and G are independent (i.e. σ(X) and G are independent), then for G ∈ G, E[X1G] = E[X]P [G], and thus again E[X|G] = E[X]. Lemma 8.9 (Simple properties of conditional expectation). Asume E|X|,E|Y | < ∞.

(i) Conditional expectation is linear,

E[aX + bY |G] = aE[X|G] + bE[Y |G].

(ii) If X ≤ Y , then E[X|G] ≤ E[Y |G], P -a.s.

(iii) If Xn ≥ 0 and Xn % X, then

E[Xn|G] % E[X|G],P -a.s.

Proof. The linearity is obvious from the definition. The monotonicity follows directly from the last claim of Theorem 8.5, using the linearity. Finally, for (iii), let Yn = X −Xn. It then suffices to show that E[Yn|G] & 0. Since, Yn is decreasing, (ii) implies that Zn = E[Yn|G] is decreasing P -a.s. and thus a limit Z∞ exists, P -a.s. again. If G ∈ G, then E[Zn1G] = E[Yn1G]. Since 0 ≤ Yn ≤ X we see using the dominated convergence theorem that E[Z∞1G] = 0. The same argument as in the uniqueness part of the proof of Theorem 8.5 implies that Z∞ = 0 P -a.s., completing the proof.

Lemma 8.10 (Jensen’s inequality). Let ϕ : R → R be a convex function and X a random variable such that E|X| < ∞ and E|ϕ(X)| < ∞. Then

ϕ(E[X|G]) ≤ E[ϕ(X)|G],P -a.s.

Proof. For ϕ(x) = ax + b, the claim holds by linearity. A general convex function ϕ can be written as ϕ(x) = supn≥0 ϕn(x) with ϕn(x) = anx + bn being ‘tangents of ϕ at rationals’. Hence, P -a.s., for every n ≥ 1,

E[ϕ(X)|G] ≥ E[ϕn(X)|G] = ϕn[E(X)|G].

Taking the supremum over n, we obtain

E[ϕ(X)|G] ≥ sup ϕn[E(X)|G] = ϕ(E[X|G]) n and the proof is completed.

Corollary 8.11. Let X ∈ Lp(Ω, A,P ), p ∈ [1, ∞]. Then E[X|G] ∈ Lp(Ω, G,P ) and

kE[X|G]kp ≤ kXkp, that is the conditional expectation is a contraction on Lp.

63 Proof. For p ∈ [1, ∞) this follows easily from Jensen’s inequality. For p = ∞, observe that −kXk∞ ≤ X ≤ kXk∞ and use the monotonicity of the conditional expectation. Proposition 8.12. If X is G measurable and E|X|,E|Y | < ∞, then E[XY |G] = XE[Y |G],P -a.s. Proof. The right hand side of the claim is G-measurable, so we only need to check (ii) of Definition 8.4. Assume first that Y ≥ 0 and X = 1B for some B ∈ G. Then for G ∈ G Z Z E[E[XY |G]1G] = E[Y |G]dP = Y dP = E[XY 1G] A∩G A∩G and thus (ii) of Definition 8.4 holds for such X and Y . We then continue as usual. For X simple (i.e. for finite linear combinations of indicator functions), the claim extends by linearity, for general X ≥ 0 by monotone convergence (Lemma 8.9(iii)). Finally, for general X and Y we write X = X+ −X− and Y = Y + −Y − and apply the linearity. Exercise 8.13. Use the proposition to verify Example 8.8(i).

Proposition 8.14 (the smaller σ-algebra wins). Let G1 ⊂ G2 ⊂ A be σ-algebras. Then, P -a.s.,

(i) E[E(X|G1)|G2] = E(X|G1).

(ii) E[E(X|G2)|G1] = E(X|G1).

Proof. (i) Since G1 ⊂ G2, the random variable E(X|G1) is G2-measurable and the claim follows from Proposition 8.12. (ii) Let G ∈ G1 ⊂ G2. Then

E[E[X|G2]1G] = E[X1G] = E[E[X|G1]1G].

Hence, E[X|G1] satisfies the conditions of Definition 8.4 for Y = E[X|G2]. Exercise 8.15. Use the proposition to verify Example 8.8(ii). The next theorem provides another interpretation of the conditional expectation in the case of square integrable random variables. Theorem 8.16. Assume that X ∈ L2(Ω, A,P ), G ⊂ A. Then E[X|G] is the orthogonal projection of X from L2(Ω, A,P ) on L2(Ω, G,P ). Proof. Start by observing that if Z ∈ L2(Ω, G,P ), then by Proposition 8.12 ZE[X|G] = E[ZX|G] (the right-hand side is finite by the Cauchy-Schwarz inequality). Hence (8.17) E[ZE[X|G]] = E[E[ZX|G]] = E[ZX], using Proposition 8.14. Let now Y ∈ L2(Ω, G,P ) and set Z = Y − E[X|G]. Then, E[(X−Y )2] = E[(X−E(X|G)−Z)2] = E[(X−E(X|G))2]+E[Z2]+2E[Z(X−E(X|G))]. The last term vanishes due to (8.17) and thus E[(X − Y )2] ≥ E[(X − E(X|G))2] with equality when Z = 0 that is Y = E(X|G). This completes the proof.

64 8.1 Regular conditional probabilities*

Consider a probability space (Ω, A,P ) together with a σ-algebra G ⊂ A and a random variable X taking values in some measurable space (S, S). For every B ∈ S, the indicator 1{X ∈ B} is a bounded random variable, so that the conditional expectation (8.18) E[1{X ∈ B}|G] =: P [X ∈ B|G] exits, and is a [0, 1]-valued, G-measurable function on Ω. It is natural to view P [X ∈ B|G] as a ‘map’ Ω × S → [0, 1] (ω, B) 7→ P [X ∈ B|G]. Beware that this map is not uniquely defined since the conditional expectations in (8.18) are determined only P -a.s. Also, even if the notation suggests it, it is not a priory clear that for fixed ω ∈ Ω, the map B 7→ P [X ∈ B|G](ω) is a probability measure on S. So the question is: Is there a version of P [X ∈ · |G](ω) which is a probability measure for every ω ∈ Ω?

To see that this is not a trivial issue, let us explain where the problem is. If A1,A2,... are disjoint elements of S, it is easy to see from Lemma 8.9(i,iii) that X P [X ∈ ∪n≥1An|G] = P [X ∈ An|G],P -a.s. (!) n≥1 However, the null set where this relation fails to hold may depend on the sequence A1,A2,... . If we require P [X ∈ · |G](ω) to be probability measure, then this relation should hold for all disjoint collections A1,A2,... . But the space S typically contains uncountably many countable disjoint collections, so the exceptional sets may pile up. To deal with this problem we introduce:

Definition 8.19. Let (S1, S1), (S2, S2) be two measurable spaces. A map κ : S1 × S2 → [0, 1] is called stochastic kernel if

(i) for every A ∈ S2, the map S1 3 x 7→ κ(x, A) is S1-measurable.

(ii) fore every x ∈ S1, the map S2 3 A 7→ κ(x, A) is a probability measure on (S2, S2). We write R f(y)κ(x, dy) for the integral with respect to the probability measure κ(x, ·). This definition allows us to formalize our requirements on the ‘regularity’ of the con- ditional distribution P [X ∈ · |G]. Definition 8.20. A regular conditional distribution of X given G is a stochastic kernel κ from (Ω, G) to (S, S) so that κ( · ,B) is for every B ∈ S a version of the conditional expectation E[1{X ∈ B}|G], that is for every B ∈ S P [X ∈ B|G] = E[1{X ∈ B}|G](ω) = κ(ω, B) for P -a.e. ω.

65 We show that regular conditional distributions exist on ‘nice’ measurable spaces.

Definition 8.21. A Borel space is a measurable space (S, S) for which there is a A ∈ B(R) and a bijection ϕ : S → A such that both ϕ and ϕ−1 are measurable. Most of the measurable spaces one encounters are Borel, in particular every Polish space endowed with the corresponding Borel σ-algebra is.

Theorem 8.22. Let (S, S) be a Borel space, (Ω, A,P ) a probability space, X :Ω → S a random variable, and G ⊂ A a σ-algebra. Then there exists a regular conditional distribution of X given G.

Proof. See [Dur10], page 198. [[TODO: Type this?]]

66 9 Martingales

Up to now we were (mostly) studying properties of independent random variables and related convergence results. Dealing with dependent random variables is much harder. There are many ways ‘how to make the random variables dependent’ and therefore there is no general theory. Martingales are particular sequences of dependent random variables where a general theory exists.

9.1 Definition and Examples

Definition 9.1. Let (Ω, A,P ) be a probability space. A non-decreasing sequence (Fn)n≥0 of sub-σ-algebras, that is F0 ⊂ F1 ⊂ · · · ⊂ A, is called filtration.

Definition 9.2. A sequence (Xn)n≥0 of random variables is said adapted to Fn if the random variable Xn is Fn-measurable for every n ≥ 0.

Definition 9.3. A Fn-adapted sequence (Xn)n≥0 of integrable random variables is called martingale when for every n ≥ 0

E[Xn+1|Fn] = Xn P -a.s. It is called submartingale when for every n ≥ 0

E[Xn+1|Fn] ≥ Xn P -a.s. It is called supermartingale when for every n ≥ 0

E[Xn+1|Fn] ≤ Xn P -a.s.

Example 9.4. Let (Xi)i≥1 be an i.i.d. sequence with E[Xi] = 0. Set Sn = X1 +···+Xn, F0 = {∅, Ω}, and Fn = σ(X1,...,Xn) for n ≥ 1. Then obviously Sn is Fn-adapted and for every n ≥ 0

(9.5) E[Sn+1|Fn] = E[ Sn + Xn+1 |Fn] = Sn + E[Xn+1] = Sn. |{z} | {z } ∈Fn indep. of Fn

Hence, Sn is a martingale.

Example 9.6. Let Xi and Sn be as in the previous example and assume in addition 2 2 that EXi = σ < ∞. Set

2 2 Mn = Sn − nσ , n ≥ 0.

67 Mn is Fn-adapted and 2 2 2 E[Mn+1 − Mn|Fn] = E[Sn+1 − Sn − σ |Fn] 2 2 2 = E[(Sn + Xn+1) − Sn − σ |Fn] 2 2 = E[2 Sn Xn+1 + Xn+1 −σ |Fn] |{z} | {z } ∈Fn indep. of Fn 2 2 = 2SnE[Xn+1] + E[Xn+1] − σ = 0.

Hence, E[Mn+1|Fn] = E[Mn|Fn] = Mn, that is Mn is a martingale. The same compu- 2 tation implies that Sn is a submartingale.

Example 9.7 (Asymmetric random walk on Z). Let Xi be i.i.d. with P [Xi = +1] = 1 1 − P [Xi = −1] = p and p 6= 2 . Define Sn and Fn as in the previous examples and set

1 − pSn M = , n ≥ 0. n p

Mn is again Fn-adapted and integrable (since |Sn| ≤ n) and

h 1 − pXn+1 i E[M |F ] = E M F n+1 n n p n h1 − pX1 i = M E n p n 1 − p  p o = M p + (1 − p) = M . n p 1 − p n

1 1−p Observe that for p > 2 , limn→∞ = ∞, P -a.s. Since p < 1, this implies limn→∞ Mn = 0, P -a.s. On the other hand E[Mn] = 1 for all n. This is an important example of a martingale which converges P -a.s. but not in L1(Ω).

Example 9.8 (Radon-Nikodym derivatives). Let (Ω, A) be a measurable space, Fn a filtration. Further, let µ be a measure and ν a probability measures on Ω. Define µn and νn to be the restrictions of µ and ν to Fn and assume that µn  νn for every n ≥ 0. By this assumption we can define

dµn Mn(ω) := (ω), n ≥ 0. dνn

Mn is Fn adapted since, by the Radon-Nikodym theorem applied to µn, νn and (Ω, Fn), the Radon-Nikodym derivative dµn/dνn is Fn-measurable. Mn is obviously integrable on (Ω, A, ν), and for n ≥ 0, A ∈ Fn Z Z Z ν dµn+1 E [Mn+11A] = Mn+1 dν = Mn+1 dνn+1 = dνn+1 = µn+1(A) A A A dνn+1 Z Z Z dµn ν = µn(A) = dνn = Mn dνn = Mn dν = E [Mn1A]. A dνn A A ν By definition of the conditional expectation, it follows that E [Mn+1|Fn] = Mn, that is Mn is a martingale on the filtered probability space (Ω, A, (Fn), ν).

68 1 Example 9.9 (Progressive conditioning). Let X ∈ L and let Fn be any filtration. Set Mn = E[X|Fn]. Then, by Proposition 8.14,

E[Mn+1|Fn] = E[E(X|Fn+1)|Fn] = E[X|Fn] = Mn.

Hence Mn is a Fn-martingale. This is actually the same martingale as in the previous example, it suffices to set µ(dω) = X(ω)ν(dω).

Example 9.10 (Galton-Watson branching process). In mid 19th century, there was an interest to develop a theory of family trees—particularly, in connection with royal families. Two statisticians (by back then standards) Galton and Watson devised a model that nowadays bears their name. In this model, there are generations containing a certain number of currently-alive individuals. The dynamics is such that, at each unit of time, each individual produces a certain number of off-spring, which is sampled independently from a common law with probabilities {p(n): n ≥ 0}. (In particular, if there is no off-spring, the lineage of that individual dies out.) We will define the problem as follows. Consider a family of i.i.d. integer-valued random variables {ξn,k : n, k ≥ 0} with law determined by P [ξn,k = m] = p(m), m ≥ 0. Define, inductively, random variables {Sn}n≥0 as follows: S0 := 1 and ( 0, if Sn = 0, Sn+1 = ξn+1,1 + ··· + ξn+1,Sn , if Sn > 0.

It is easy to verify that the dynamics does what we described verbally above. The additional assumption S0 = 1 means that there is one individual at time zero. Consider now the filtration Fn := σ(ξm,k : 0 ≤ m ≤ n, k ≥ 0) and let us compute

∞ X E[Sn+1|Fn] = E[Sn+11{Sn = k}|Fn] k=0 ∞ X = 1{Sn = k}E[ξn+1,1 + ··· + ξn+1,k|Fn] k=0 ∞ X = 1{Sn = k}kE(ξ1,1) = SnE(ξ1,1). k=0

−n Thus, denoting µ := E(ξ1,1), we see that Mn := µ Sn is a martingale. The value µ = 1 is obviously special because Sn is then itself a martingale, which plays a role later. Example 9.11 (Polya’s Urn). Consider an urn that has, initially, r red balls and g green balls in it. We now proceed as follows: Sample a ball from the urn and replace it back along with another ball of the same colour. Repeating this step, the question is what is status of the urn in the long run. Let Rn denote the number of red balls in the

69 urn at time n and let Gn denote the corresponding number of green balls. Obviously, Rn + Gn = r + g + n. Now consider the random variable

Rn Rn Mn := = , Rn + Gn r + g + n which is the fraction of red balls in the urn at time n. The dynamics described above can be encoded using a sequence U1,U2,... of i.i.d. uniform random variables on [0, 1] as follows:

Rn+1 := (Rn + 1)1{Un+1 ≤ Mn} + Rn1{Un+1 > Mn}.

We claim that Mn is a martingale for the filtration Fn := σ(U1,...,Un). Indeed, since Rn is Fn-measurable, we have R + 1 R R G E[M |F ] = n · n + n · n n+1 n r + g + n + 1 r + g + n r + g + n + 1 r + g + n R R + 1 + G R = n · n n = n = M . r + g + n r + g + n + 1 r + g + n n Notice that, unlike the earlier examples, this martingale is non-negative and bounded.

We now give few elementary properties of martingales.

Proposition 9.12. When Xn is a supermartingale, then for n ≥ m ≥ 0

(9.13) E[Xn|Fm] ≤ Xm,P -a.s.

When Xn is a submartingale, then the reversed equality holds, and when Xn is a mar- tingale, then

(9.14) E[Xn|Fm] = Xm,P -a.s.

Proof. The first claim follows from the definition by induction. The second claim can be proved from the first one by observing that if Xn is submartingale, then −Xn is a supermartingale. Moreover, any martingale is both sub- and supermartingale, yielding the last claim.

Proposition 9.15. If Xn is a martingale and ϕ is a convex function with E[ϕ(Xn)] < ∞ p for all n ≥ 0, then ϕ(Xn) is a submartingale. In particular, if ϕ(x) = |x| for p ≥ 1 and p p Xn is in L , then |Xn| is a submartingale. Proof. By Jensen’s inequality and the definition

E[ϕ(Xn+1)|Fn] ≥ ϕ(E[Xn+1|Fn]) = ϕ(Xn), completing the proof.

70 Proposition 9.16. If Xn is a submartingale and ϕ is a non-decreasing convex function with E[ϕ(Xn)] < ∞ for all n ≥ 0, then ϕ(Xn) is again a submartingale. Proof. As above, by Jensen’s inequality and the definition,

E[ϕ(Xn+1)|Fn] ≥ ϕ(E[Xn+1|Fn]) ≥ ϕ(Xn).

The monotonicity of ϕ is needed in the last inequality.

Corollary 9.17. If Xn is supermartingale and a ∈ R, then Xn ∧ a is a supermartingale. Similarly, for submartingale Xn, Xn ∨ a is submartingale.

2 Exercise 9.18. Find a supermartingale Xn so that Xn is a submartingale.

Definition 9.19. A sequence (Hn)n≥1 of random variables is called predictable if Hn is Fn−1 measurable for all n ≥ 1.

Proposition 9.20 (Doob’s decomposition). Xn is a Fn-submartingale iff Xn can be written as Xn = Mn + An where Mn is a martingale and 0 = A0 ≤ A1 ≤ A2 ≤ ... is a predictable sequence. This decomposition is unique up to P -null equivalence.

Proof. To show uniqueness, observe that for n ≥ 0 we must have

E[Xn+1|Fn] − Xn = E[Mn+1|Fn] − Mn + E[An+1|Fn] −An. | {z } | {z } =0 =An+1

Hence A0 = 0 and An+1−An = E[Xn+1−Xn|Fn] ≥ 0 for all n, which uniquely determines the sequence An and thus also Mn = Xn − An. The same computation yields also the existence of An with required properties. Finally, if Xn = Mn + An with Mn, An as in the statement, then it is trivial to check that Xn is a submartingale.

Definition 9.21. Let Mn be adapted and Hn predictable. We define a discrete stochastic integral of Hn with respect to Mn by

n X (H · M)n = Hi(Mi − Mi−1). i=1 Remark 9.22. The previous definition is a discrete version of the stochastic integral R t 0 HsdMs which will be (for suitable processes Ht and Mt, t ∈ [0, ∞)) defined in the lecture ‘Stochastic analysis’.

Remark 9.23 (Interpretation as a gambling system). Consider a coin-tossing game where Xi denotes the result of the i-th toss, Xi = 1 means head, Xi = −1 means tail. Let Hi be the amount that a player bets on i-th toss being head. This value should naturally be determined from the information known at time n − 1, hence it is natural to require that Hn is predictable.

71 The game has the following rules. If i-th toss heads, then the player wins double of his bet, otherwise it looses it. The total amount the player has at time n is n X H1X1 + H2X2 + ··· + HnXn = Hi(Mi − Mi−1) = (H · M)n, i=1 where Mn = X1 + ··· + Xn for n ≥ 1 and M0 = 0. Remark 9.24 (historical). The famous gambling system or strategy called ‘martingale’ is defined by setting H1 = 1 and for n ≥ 2 Hn = 2Hn−1 if Xn−1 = −1, and Hn = 1 if Xn−1 = +1. This strategy seems to provide sure profit in this faire game since −1 − 2 − · · · − 2k−1 + 2k = 1, but it is only an illusion as can be seen from the following theorem.

Theorem 9.25. (a) If (Xn)n≥0 is a (sub-/super-)martingale and Hn ≥ 0 a predictable sequence which is bounded for every n, then also (H · X)n is (sub-/super-)martingale. (b) If (Xn)n≥0 is a martingale and Hn a predictable sequence which is bounded for every n, then also (H · X)n is martingale. 1 Proof. Since H is bounded, (H · X)n ∈ L for every n. For (a), assume that Xn is a submartingale (the remaining claim follows as by replacing X by −X). Then,

E[(H · X)n+1 − (H · X)n|Fn] = E[Hn+1(Xn+1 − Xn)|Fn] | {z } ∈Fn

= Hn+1 E[Xn+1 − Xn|Fn] ≤ 0, | {z } | {z } ≥0 ≤0 and thus H · X is a supermartingale. The proof of (b) is analogous. We now introduce an important concept: Definition 9.26. A N ∪ {∞}-valued random variable T is called stopping time (with respect to filtration Fn) when

(9.27) {T = n} ∈ Fn for all n < ∞. Think about T is the time when the player stops playing. Of course T = n must be measurable with respect to the information he has at time n. Example 9.28. (a) Every deterministic time T (ω) = k is a stopping time as {T = n} is either ∅ or Ω. (b) Let A ∈ B(R) and Xn an adapted process. Define

HA = inf{k ≥ 0 : Xk ∈ A} to be the time of the first visit to A. Then HA is a stopping time. (Exercise!). (c) On the other hand, let

LA = sup{k : Xk ∈ A} be the time of the last visit to A. Show that LA is not a stopping time in general.

72 Corollary 9.29 (first stopping theorem). Let Xn be an Fn-(sub-/super-) martingale and let T be a Fn-stopping time. Then XT ∧n is again a Fn-(sub-/super-) martingale.

Proof. We assume without loss of generality that Xn is a submartingale. Set Hn = 1{T ≥ n}. As T is a stopping time and {T ≥ n} = {T > n − 1} ∈ Fn−1, the sequence Hn is predictable. Moreover,

n X (H · X)n = 1{T ≥ i}(Xi − Xi−1) = XT ∧n − X0. i=1

Therefore, XT ∧n = X0 + (H · X)n is a submartingale by Theorem 9.25.

9.2 Martingales convergence, a.s. case

In the next three sections we study principle convergence theorems for martingales. We start with an important inequality which will be useful to control the fluctuations of (sub-/super-) martingales. Let a < b be two real numbers and define stopping times Ti, i ≥ 0, by setting T0 = 0 and then recursively for k ≥ 1,

T2k−1 = inf{i ≥ T2k−2 : Xi ≤ a},

T2k = inf{i ≥ T2k−1 : Xi ≥ b}.

If Ti = ∞ for some i, that is the set in the infimum is empty, we define Tj = ∞ for all j > i. Let Un = sup{k : T2k ≤ n} be the number of upcrossings of the interval (a, b) completed by time n.

Theorem 9.30 (Upcrossing inequality). If (Xm)m≥0 is a submartingale, then

+ + (b − a)EUn ≤ E[(Xn − a) ] − E[(X0 − a) ].

+ Proof. Let Yn = a + (Xn − a) . By Proposition 9.16, Yn is also a submartingale. When X Y X up-crosses (a, b), then so does Y and vice versa, therefore Un = Un . We thus can consider Y instead of X. We define the following sequence of random variables ( 1, if T2k−1 < m ≤ T2k for some k ∈ N, Hm = 0, otherwise.

The value Hm = 1 indicates that the time interval (m − 1, m) is ‘a part of a upcrossing’. Observe that

c {T2k−1 < m ≤ T2k} = {T2k−1 ≤ m − 1} ∩ {T2k ≤ m − 1} ∈ Fm−1,

73 since Ti’s are stopping times. Hence Hm is predictable. The process H can be interpreted as an investing strategy: If Yn denotes the price of a stock, we buy one stock when its price is at a, hold it up to the first time when its price is above b and then sell it, gaining at least b − a. From this interpretation it is easy to see that

Y (9.31) (b − a)Un ≤ (H · Y )n, since every upcrossing generates profit at least (b − a) and a possible incomplete up- crossing at time n makes a non-negative contribution to the right-hand side (this is true for Y but not for X). Write

(9.32) Yn − Y0 = (H · Y )n + ((1 − H) · Y )n.

By Theorem 9.25, ((1 − H) · Y ) is a submartingale, and therefore E((1 − H) · Y )n ≥ E((1 − H) · Y )0 = 0. Combining (9.31), (9.32) then yields

Y X EYn − EY0 ≥ (b − a)EUn = (b − a)Un , which, after inserting the definition of Y , completes the proof. Using the upcrossing inequality we easily get the first martingale convergence theorem

Theorem 9.33 (A.s. martingale convergence theorem). Let (Xn)n≥0 be a submartingale + with supn EXn < ∞. Then, as n → ∞, Xn converges a.s. to a limit X with E|X| < ∞. Proof. Take a < b ∈ Q. Due to Theorem 9.30, using the assumption, E[(X − a)+] E[X+] + |a| EU a,b ≤ n ≤ n = const(a, b). n b − a b − a

a,b a,b Defining U = limn→∞ Un to be the total number of upcrossings of (a, b) by X, we conclude using the monotone convergence theorem that EU a,b ≤ const(a, b) < ∞, and thus U a,b is finite P -a.s. This conclusion holds for all pairs of rational numbers a, b, and thus h [  i P lim inf Xn < a < b < lim sup Xn = 0. a,b∈Q

It follows that lim sup Xn = lim inf Xn, P -a.s., that is the limit lim Xn exists P -a.s. + + Fatou’s lemma guarantees that EX ≤ lim inf EXn < ∞, so X < ∞ P -a.s. Further, − + + using submartingale property, EXn = EXn − EXn ≤ EXn − EX0. Hence, by Fatou’s lemma again, − − + EX ≤ lim inf EXn ≤ sup EXn − EX0 < ∞. Hence E|X| = EX+ + EX− < ∞, completing the proof. As a special case we obtain

Corollary 9.34. Let Xn ≥ 0 be a supermartingale. Then, as n → ∞, Xn converges a.s. to a limit X and EX ≤ EX0.

74 + Proof. Yn = −Xn is a submartingale with EYn = 0. So the convergence follows from the previous theorem. The inequality is a consequence of Fatou’s lemma and the super- martingale property EXn ≤ EX0. Remark 9.35. Example 9.7 shows that the assumptions of Corollary 9.34 are not suf- 1 ficient to guarantee the L convergence of Xn to X. Example 9.36 (Example 9.8 continued). Let µ, ν be two probability measures on  (Ω, A), and assume that Fn is a filtration with lim Fn := σ ∪n≥0 Fn = A. Let µn, νn be the restrictions of µ and ν to F . Assuming that ν  µ , we know that M = dµn n n n n dνn is a non-negative martingale on (Ω, A, (Fn), ν). By Corollary 9.34, Mn → M ν-a.s. Moreover, it is possible to show that (see [Dur10, p.242]) that Z µ(A) = M dν + µ(A ∩ {M = ∞}). A Assuming in addition that µ  ν, this simplifies to Z µ(A) = M dν A

dµ that is M = dν . [[TODO: clean]]

Example 9.37 (Branching process continued). In Example 9.10 we proved that if Sn is n a Galton-Watson branching process with Eξ1,1 = µ then Mn = Sn/µ is a martingale. Since Mn is non-negative, by Corollary 9.34, Mn → MP -a.s. as n → ∞. We now show that M = 0 P -a.s. when µ ≤ 1. n When µ < 1, by martingale property ESn = µ → 0. Since Sn is integer valued, P [Sn > 0] = P [Sn ≥ 1] ≤ E[Sn] → 0. In the case µ = 1, assuming in addition P [ξn,i = 1] < 1 to exclude the degeneracy, Sn is itself a martingale with values in N. Since Sn → M, it means that Sn = M for all n large. But this is only possible if M = 0, the non-degeneracy assumption excludes the remaining values. The case µ > 1 is slightly more complicated, one can e.g. show that in this case P [Sn > 0 for all n] > 0. Example 9.38 (Polya’s urn continued). In Example 9.11, we proved that the proportion of red balls in Polya’s urn is a martingale with values in [0, 1]. Theorem 9.33 implies that this proportion converges a.s. to a, in general random, limit.

9.3 Doob’s inequality and Lp convergence

We now discuss the convergence of martingales in Lp spaces. The following inequality extends Kolmogorov’s inequality (Lemma 4.2).

75 Theorem 9.39 (Doob’s maximal inequality). Let (Xn)n≥0 be a submartingale. Then for any λ ≥ 0 + λP [ max Xi ≥ λ] ≤ E[Xn; max Xi ≥ λ] ≤ E[Xn ]. 0≤i≤n 0≤i≤n

Proof. Set T = inf{m ≥ 0 : Xm ≥ λ} and A = {max0≤i≤n Xi ≥ λ} = {T ≤ n}. Then, n X E[XT ∧n] = E[Xn1{T > n}] + E[Xk1{T = k}] k=0 n X ≤ E[Xn1{T > n}] + E[Xn1{T = k}] = E[Xn]. k=0 Therefore, E[XT ∧n1A] + E[XT ∧n1Ac ] ≤ E[Xn1A] + E[Xn1Ac ]. c Since XT ∧n ≥ λ on A, and T ∧ n = n on A , the last inequality yields

λP [A] + E[Xn1Ac ] ≤ E[Xn1A] + E[Xn1Ac ]. + and thus λP [A] ≤ E[Xn1A] ≤ E[Xn ] as claimed. Remark 9.40. To see why Kolmogorov’s inequality follows from Doob’s one consider 2 the submartingale Xn = Sn where Sn = ξ1 + ··· + ξn is an i.i.d. sum. p Theorem 9.41 (Doob’s L -inequality). Let (Xn)n≥0 be a submartingale, p ∈ (1, ∞) and ¯ + set Xn = max0≤i≤n Xn . Then  p  kX¯ k ≤ kX k . n p p − 1 n p p ¯ p In particular Xn ∈ L implies Xn ∈ L . + Proof. Without loss of generality, due to Proposition 9.16, we may assume that Xn = Xn p and Xn ∈ L . Fix M > 0. Then Z ∞ ¯ p p−1 ¯ E[(Xn ∧ M) ] = pλ P [Xn ∧ M > λ]dλ. 0 Observe that ( ¯ ¯ {Xn > λ}, on λ < M, {Xn ∧ M > λ} = ∅, on λ ≥ M. Using Theorem 9.39, Fubini’s theorem, and H¨older’sinequality then yields Z ∞ ¯ p p−2 ¯ E[(Xn ∧ M) ] ≤ pλ E[Xn1{Xn ∧ M > λ}] dλ 0 ¯ Z Xn∧M p h p−1 i = E Xn (p − 1)λ dλ p − 1 0 p = E[X (X¯ ∧ M)p−1] p − 1 n n p ≤ kX k E[(X¯ ∧ M)p](p−1)/p. p − 1 n p n

76 ¯ ¯ p−1 Either kXn ∧ Mkp = 0 or we divide by kXn ∧ Mkp to obtain  p  kX¯ ∧ Mk ≤ kX k . n p p − 1 n p The theorem then follows by the monotone convergence theorem, sending M → ∞.

? Corollary 9.42. When (Xn)n≥0 is a martingale, Xn = max0≤i≤n |Xi|, then  p  kX?k ≤ kX k . n p p − 1 n p

p Theorem 9.43 (L -convergence theorem). Let (Xn)n≥0 be martingale such that, for p n→∞ p some p ∈ (1, ∞), supn E|Xn| < ∞. Then Xn −−−→ XP -a.s. and in L .

+ p p p Proof. Since, by Jensen’s inequality, (EXn ) ≤ (E|Xn|) ≤ E[|Xn| ], Theorem 9.33 implies that Xn → XP -a.s. Letting n → ∞ in Corollary 9.42, using the monotone p convergence theorem and the assumption, yields k supn |Xn|kp < ∞. Since |Xn − X| ≤ p 1 p (2 sup |Xn|) ∈ L , we obtain that Xn → X in L by the dominated convergence theorem.

9.4 L2-martingales

As usual, the most important special case of the above theorem is p = 2. The following two simple lemmas are useful when dealing with L2-martingales

Lemma 9.44 (orthogonality of martingale increments). Let (Xn)n≥0 be a martingale 2 with EXn < ∞ for all n ≥ 0. Then, for every m ≤ n and Y a Fm-measurable random variable with EY 2 < ∞, E[(Xn − Xm)Y ] = 0.

Proof. By Cauchy-Schwarz inequality E|(Xn − Xm)Y | < ∞. We can thus write

E[(Xn − Xm)Y ] = E[E[(Xn − Xm)Y ]|Fm] = E[YE[Xn − Xm|Fm]] = 0. | {z } =0 This completes the proof.

Lemma 9.45 (conditional variance formula). Let (Xn)n≥0 be as in the last lemma, m ≤ n. Then

2 2 2 2 E[(Xn − EXn) |Fm] = E[(Xn − Xm) |Fm] = E[Xn|Fm] − Xm.

Proof. Is left for exercise!

77 Example 9.46 (branching process continued). Recall Example 9.37. Assume now 2 in addition that the offspring distribution has a finite second moment, E[(ξn,i) ] = P∞ 2 2 n l=0 pll =: σ < ∞. Let Mn = Sn/µ be the martingale, as before. By Lemma 9.45,

2 2 2 E[Mn|Fn−1] = Mn−1 + E[(Mn − Mn−1) |Fn−1] hS S 2 i = M 2 + E n − n−1 |F n−1 µn µn−1 n−1 −2n 2 2 = µ E[(Sn − µSn−1) |Fn−1] + Mn−1. Further,

∞ 2 X 2 E[(Sn − µSn−1) |Fn−1] = E[(Sn − µSn−1) 1{Sn−1 = l}|Fn−1] l=0 ∞ X 2 2 n−1 2 = 1{Sn−1 = l}lσ = Sn−1σ = µ Mn−1σ , l=0

2 2 2 −n−1 2 and thus E[Mn|Fn−1] = Mn−1 + σ µ . Taking expectation we obtain EMn = 2 2 −n−1 EMn−1 + σ µ and iterating

n+1 2 X 2 −k EMn = 1 + σ µ . k=2

2 2 It follows that sup EMn < ∞ and thus Mn → M in L .

9.5 Azuma-Hoeffding inequality

Various inequalities from the previous section get considerably better if the martingales in consideration have bounded increments, that is in the L∞-case.

Theorem 9.47 (Azuma-Hoeffding inequality). Let Xn be a supermartingale with M0 = 0 such that for some deterministic sequence ck of non-negative numbers

|Xk − Xk−1| ≤ ck, k = 1, . . . , n. Then n 1 λ2 o P [Xn > λ] ≤ exp − Pn 2 . 2 k=1 ck Proof. The proof is based on the following inequality y (9.48) ety ≤ cosh(tc) + sinh(tc), |y| ≤ c, t ∈ . c R This follows from the fact that the right-hand side can be written as y c + y c − y cosh(tc) + sinh(tc) = etc + e−tc . c 2c 2c

78 c+y c−y Under condition |y| ≤ c, both 2c and 2c are non-negative and add up to one. So by the convexity of the exponential function, the right-hand side is at most n c + y c − y o exp tc − tc = exp(ty), 2c 2c so we get (9.48). tMn Consider now the random variable e for t ≥ 0. Since Mn is bounded, this is integrable and so tMn  tMn−1 t(Mn−Mn−1)  E[e ] = E e E[e |Fn−1] . Moreover, (9.48) insures that 1 t(Mn−Mn−1) E[e |Fn−1] ≤ cosh(tcn) + sinh(tcn)E[Mn − Mn−1|Fn−1], cn and this is equal to cosh(tcn) for martingales and is ≤ cosh(tcn) for supermartingales, since t ≥ 0. By induction we obtain n n n o tMn Y X E[e ] ≤ cosh(tck) = exp log cosh(tck) , t ≥ 0. k=1 k=1 Using the second-order Taylor expansion of the function x 7→ log cosh(x) with the re- minder in Lagrange form and observing that the third derivative of this function is 1 2 negative for x < 0 and positive for x > 0 yields log cosh(x) ≤ 2 x . Hence, n nt2 X o E[etMn ] ≤ exp c2 , t ≥ 0. 2 k k=1 By the exponential Chebyshev inequality

tλ tMn P [Mn ≥ λ] ≤ e E[e ], t ≥ 0. Combining the two bounds, and minimising over t shows that the optimal value is Pn 2 −1 t = λ( k=1 ck) which yields the desired inequality.

9.6 Convergence in L1

We now study the convergence in L1. Recall that in this case Doob’s inequality (The- orem 9.41) is of no use. Further, as we have seen in Example 9.7 and Remark 9.35, 1 the fact sup E|Mn| < ∞ does not imply Mn → X in L for a martingale Mn, that is Theorem 9.43 does not hold for p = 1. On the other hand Theorem 9.33 clearly applies in this case so Mn → M a.s. We already know one way how to deduce L1 convergence from the a.s. convergence, namely the dominated convergence theorem, whose application requires the existence of L1- dominating function. We are going to develop another condition allowing to deduce L1-convergence from the a.s. one. We will see that this condition is not only sufficient, but also necessary. It deals with general families of random variables, not only with martingales.

79 Definition 9.49. A collection (Xi)i∈I is said to be uniformly integrable (UI) if   lim sup E |Xi|1{|Xi| > M} = 0. M→∞ i∈I

1 Example 9.50. (a) When |Xi| < Y for all i ∈ I and some Y ∈ L , that is there is 1 L -dominating function, then (Xi)i∈I is UI. (Exercise!) (b) Let ϕ ≥ 0 be a function such that lim ϕ(x)/x = ∞. Examples are ϕ(x) = xp, + p > 1, or ϕ(x) = x log (x). If supi∈I Eϕ(|Xi|) < ∞, then (Xi)i∈I is UI. ϕ(u) A Indeed, let A = supi Eϕ(|Xi|). Choose ε > 0 and M < ∞ such that infu≥M u ≥ ε . Then for all i ∈ I,

  ε hϕ(|Xi|) i E |Xi|1{|Xi| > M} ≤ E |Xi|1{|Xi| > M} ≤ A |Xi| ε E[ϕ(|X |)] ≤ ε, A i which implies the condition of Definition 9.49. ♦ Lemma 9.51. Let X ∈ L1(Ω, A,P ). Then the family

{E[X|G]: G ⊂ A is a σ-algebra} is UI. Proof. We start with a technical claim: Claim 9.52. If X ∈ L1 then for every ε > 0 exists δ > 0 such that

P (A) < δ =⇒ E[|X|; A] ≤ ε.

1 Proof. Assume, by contradiction, that there is a sequence of events An with P (An) ≤ n and E[|X|; An] ≥ ε. It follows that |X|1{An} → 0 in probability and thus a.s. along a subsequence kn. For such sub-sequence, the dominated convergence theorem implies n→∞ E[|X|; Akn ] −−−→ 0, leading to contradiction. Fix now ε and δ as in the claim and choose M < ∞ such that E|X|/M ≤ δ. For G ⊂ A, by Jensen’s inequality   h     i E |E[X|G]|; |E[X|G]| ≥ M ≤ E E |X| G ; E |X| G ≥ M | {z } (9.53) ∈G     = E |X|; E |X| G ≥ M , where the equality follows from the definition of the conditional expectation. In addition, by Chebyshev’s inequality

    −1    −1 P E |X| G ≥ M ≤ M E E |X| G = M E[|X|] ≤ δ, and thus, by Claim 9.52, the right-hand side of (9.53) is bounded by ε, proving the UI property.

80 The following theorem explains the usefullness of the UI property for dealing with L1-convergence. n→∞ Theorem 9.54. If Xn −−−→ X in probability, then the following are equivalent

(i) {Xn : n ≥ 1} is UI,

n→∞ 1 (ii) Xn −−−→ X in L , n→∞ (iii) E|Xn| −−−→ E|X| < ∞. Proof. (i) =⇒ (ii). For M > 0, E[|X − X|] ≤ E[|X − X|; |X | ≤ M, |X| ≤ M] (9.55) n n n + 3E[|Xn|; |Xn| > M] + 3E[|X|; |X| > M]. [[TODO: This should be better explained]] For ε ∈ (0, 1), (i) implies the existence of M0 such that ε sup E[|Xn|; |Xn| ≥ M] ≤ for all M ≥ M0. n 2 By Fatou’s lemma ε E[|X|] ≤ lim inf E[|X |] ≤ + M ≤ M + 1. n 2 0 0 Hence, we may choose M so that, uniformly in n, the last two terms in (9.55) are smaller ε than 2 . Hence,

lim sup E|Xn − X| ≤ lim sup E[|Xn − X|; |Xn| ≤ M, |X| ≤ M] + ε = ε by the dominated convergence theorem. As ε is arbitrary, (ii) follows. (ii) =⇒ (iii). By Jensen’s inequality

E|Xn| − E|X| ≤ E[|Xn| − |X|] ≤ E[|Xn − X|] → 0 by (ii), which implies (iii). (iii) =⇒ (i). Fix ε > 0. Let ψM : R+ → R+ be a continuous function such that  x if x ≤ M − 1,  ψM (x) = linear on x ∈ [M − 1,M], 0 if x ≥ M.

ε By the dominated convergence theorem, for M large enough, E|X|−EψM (|X|) ≤ 2 . An- n→∞ other application of the dominated convergence theorem implies that E[ψM (|Xn|)] −−−→ E[ψM (|X|)], so by (iii), for all n larger than some n0 ε E[|X |; |X | ≥ M] ≤ E[|X |] − Eψ (|X |) ≤ E[|X|] − E[ψ (|X|)] + < ε. n n n M n M 2

By increasing M, we can make the last inequality to be valid for all n’s, that is (Xn)n≥1 is UI.

81 Remark 9.56. In the proof we used the following extended version of the dominated 1 convergence theorem: Assume that Xn → X in probability and that |Xn| ≤ Y ∈ L for 1 all n. Then Xn → X in L . Prove this as exercise! As corrolary of Theorem 9.54 we obtain L1-convergence theorem for submartingales.

Theorem 9.57. For a submartingale (Xn)n≥0 the following are equivalent

(i) (Xn)n≥0 is UI,

1 (ii) Xn → X in L and P -a.s.

1 (iii) Xn → X in L .

Proof. (i) =⇒ (ii). The UI property implies sup E|Xn| < ∞, so by Theorem 9.33, 1 Xn → X, P -a.s. Theorem 9.54 then implies that Xn → X in L . (ii) =⇒ (iii) is obvious. 1 (iii) =⇒ (i) Since Xn → X in L , we have also Xn → X in probability. The claim then follows by another application of Theorem 9.54.

Theorem 9.58. For a martingale (Xn)n≥0 the following are equivalent

(i) (Xn)n≥0 is UI,

1 (ii) Xn → X∞ in L and P -a.s.

1 (iii) Xn → X∞ in L .

(iv) There is a random variable X such that Xn = E[X|Fn] Proof. (i)⇔(ii)⇔(iii) follows from Theorem 9.57. (iii) =⇒ (iv). Let n < m. Then, for every A ∈ Fn

(iii) E[Xn1A] = E[Xm1A] −−−→ E[X∞1A], m→∞ that is Xn = E[X∞|Fn] for all n ≥ 0. (iv) =⇒ (i) is a direct consequence of Example 9.51.

 1 Exercise 9.59. Let F∞ = σ ∪n Fn and X ∈ L (Ω, A,P ). Show that

n→∞ 1 E[X|Fn] −−−→ E[X|F∞] a.s. and in L .

82 9.7 Optional stopping theorem

We explore the behaviour of the martingales at stopping times. Recall from Corol- lary 9.29, that if Xn is a martingale and T a stopping time such that P [T ≤ c] = 1 for some c < ∞, then E[XT ] = E[X0]. For unbounded stopping times, such equality fails in general (Consider the simple ran- dom walk (Sn)n≥0 and the stopping time T = inf{k : Sk = −1}), but it holds under additional assumptions.

Theorem 9.60. If (Xn)n≥0 is a UI submartingale, then for any stopping time T ≤ ∞,

EX0 ≤ EXT ≤ EX∞, where X∞ = limn→∞ Xn. To show the theorem we need a simple lemma.

Lemma 9.61. If (Xn)n≥0 is a UI submartingale and T ≤ ∞ a stopping time, then the sequence (Xn∧T )n≥0 is UI.

+ Proof. By Proposition 9.16, (Xn )n≥0 is a submartingale which is also UI. By Corol- + + + lary 9.29, EXT ∧n ≤ EXn . Since (Xn )n≥0 is UI, we have

+ + sup EXT ∧n ≤ sup EXn < ∞, n n so by martingale convergence theorem (Theorem 9.33), XT ∧n → XT , P -a.s. and E|XT | < ∞. Then,

E[|XT ∧n|; |XT ∧n| ≥ M] ≤ E[|XT |; |XT | ≥ M] + E[|Xn|; |Xn| ≥ M], and the claim follows from the UI property of (Xn)n≥0 and the fact that E|XT | < ∞.

Proof of Theorem 9.60. By Corollary 9.29, EX0 ≤ ET ∧n ≤ EXn. Letting n → ∞, ob- 1 1 serving that Xn → X∞ in L , and XT ∧n → XT in L by Lemma 9.61 and Theorem 9.57, yields the desired result.

Corollary 9.62. If S ≤ T are two stopping times and X is a UI submartingale, then EXS ≤ EXT .

Proof. Use the inequality EYS ≤ EY∞ for Yn = XT ∧n.

83 Applications of Optional stopping theorem Let Sn = X1 + ··· + Xn for Xi-i.i.d. with P [Xi = 1] = 1 − P [Xi = −1] = p be a random walk with drift. In Examples 9.4–9.7, we have seen several martingales related to this process. For x ∈ Z define Tx = inf{k ≥ 0 : Sk = x} to be the hitting time of x. We can use the optional stopping theorem to study the exit time of Sn from an interval:

Claim 9.63. Let a < 0 < b ∈ Z. Then

( b 1 b−a , if p = 2 , P [Ta < Tb] = ϕ(b)−ϕ(0) 1 ϕ(b)−ϕ(a) , if p 6= 2 ,

1−p x where ϕ(x) = p , cf. Example 9.7. Proof. We consider only the case p > 1/2. The proof for p = 1/2 is completely analogous, using the martingale from Example 9.4. Set T = Ta ∧Tb, and consider the process Mn = ϕ(Sn∧T ), which is a martingale by Example 9.7 and Corollary 9.29. This martingale is bounded and so UI. It thus converges a.s. and in L1. Moreover, it can be seen easily that the limit cannot be in the set {ϕ(x): a < x < b} and thus M∞ ∈ {ϕ(a), ϕ(b)} and T < ∞ a.s. Optional stopping theorem then yields

ϕ0 = E[ϕ(MT )]

= ϕ(a)P [ST = a] + ϕ(b)P [ST = b]

= ϕ(a)P [Ta < Tb] + ϕ(b)(1 − P [Ta < Tb]).

Solving for P [Ta < Tb] yields the claim. Exercise 9.64. Use the above method to show that

(a) If p > 1/2 and a < 0, then P [minn Sn ≤ a] = ϕ(−a).

b (b) If p > 1/2 and b > 0, then ETb = 2p−1 .

θ (c) Let ψ(θ) = log E[exp{θX1}]. Then Mn = exp{θSn − nψ(θ)} is a martingale and the generating function of T1 is given by

1 − p1 − 4p(1 − p)s2 EsT1 = , s ∈ [0, 1]. 2(1 − p)s

9.8 Martingale central limit theorem*

TBD

84 10 Constructions of processes

[[TODO: This and following chapters should be proofreaded]] Up to now, we did not pay a lot of attention how to explicitly construct stochas- tic processes we are dealing with. In this chapter we present two general techniques that guarantee the existence of the (Xi)i≥1 as a sequence of random variables on some probability space (Ω, F, P).

10.1 Semi-direct product

[[TODO: This section should be expanded, see the handwritten notes]] We have seen in Section 8.1, how stochastic kernels naturally arise when considering regular versions of conditional probabilities. We now follow the opposite direction, and use stochastic kernels to construct measures on ‘larger’ spaces. Consider a probability space (Ω1, A1,P1). Let κ be a stochastic kernel (as in Defini- tion 8.19) from (Ω1, A1) to some other measurable space (Ω2, A2). Let Ω = Ω1 × Ω2, A = A1 ⊗ A2. We define Xi :Ω → Ωi, i = 1, 2, to be the canonical projection. The semi-direct product P = P1 × κ is a probability measure on (Ω, A) uniquely determined by Z (10.1) P (A1 × A2) = κ(ω1,A2)P (dω1),A1 ∈ A1,A2 ∈ A2. A1 Obviously then, for every Y ∈ L1(Ω,P ), we have Z Z P E [Y ] = Y ((ω1, ω2))κ(ω1, dω2)P (dω1). Ω1 Ω2 The following lemma shows the relation to the construction of Section 8.1.

Lemma 10.2. Let Y ∈ L1(Ω,P ). Then Z P 0 0 E [Y |σ(X1)](ω) = Y (ω1, ω )κ(ω1, dω ), Ω2 for ω = (ω1, ω2). In particular, κ is the regular conditional distribution of X2 given σ(X1).

Proof. The second claim follows easily by taking Y = 1{ω1 ∈ A) with A ∈ A1.

85 10.2 Ionescu-Tulcea Theorem

The basic idea of our first construction of a stochastic process is to specify for for every n = 1 a stochastic kernel κn which describes the conditional distribution of Xn given X0,...,Xn−1, and to specify a starting distribution of random variable X0. We will see that these input data are sufficient to construct a stochastic process (Xn)n≥0 whose distribution is determined uniquely. We will work in a slightly more general setting, where random variables Xi do not need to take values in the same spaces. We thus consider a sequence (Si, Si)i≥0 of measurable spaces and define

Ω0 = S0, Ω1 = S0 × X1,..., Ωn = S0 × · · · × Sn,

F0 = S0, F1 = S0 ⊗ S1,..., Fn = S0 ⊗ · · · ⊗ Sn. The input data for the construction are

• a probability measure P0 on (S0, S0) viewed as the starting distribution,

• a sequence of stochastic kernels κn from (Ωn−1, Fn−1) to (Sn, Sn), giving the tran- sition probabilities Using these ingredients and semi-direct construction (10.1), we can define probabilities Qn on (Ωn, Fn) by

Q0 = P0

Q1 = P0 × κ1, ...

Qn+1 = Qn × κn+1. The required stochastic process will be constructed with help of a a probability mea- sure on countable-infinite product space Y (10.3) Ω = Si = {ω = (x0, x1,... ): xi ∈ Si∀i ≥ 0} i≥0 endowed with the canonical coordinates

(10.4) Xi(ω) = xi ∈ Si for ω = (x0, x1,... ) ∈ Ω, and the product σ-algebra F

F = σ(Xn : n ≥ 0)

= σ(A1 × ...Ak × Sk+1 × · · · : k ≥ 0,Ai ∈ Si, 0 ≤ i ≤ k}

= σ(A × Sk+1 × · · · : k ≥ 0,A ∈ Fk}. We also define the canonical projections

(10.5) πn :Ω → Ωn, πn(ω) = (x0, . . . , xn) ∈ Ωn for ω = (x1, x2,... ) ∈ Ω.

86 Theorem 10.6 (Ionescu-Tulcea). There is a unique probability measure Q on (Ω, F) so that for every n ≥ 0

(10.7) (πn)#Q = Qn

In particular, the conditional distribution of Xn given σ(X1,...,Xn−1) is given by κn, and thus for every bounded measurable function f on (Ωn, Fn)

Q E [f(X0,...,Xn)] (10.8) Z Z Z = P0(dx0) κ1(X0, dx1) ... κn(x0, . . . , xn−1, dxn)f(x0, x1, . . . , xn). S0 Proof. The second claim of the theorem is a direct consequence of the first one, using the semidirect product construction (10.1). To show the uniqueness, observe that (10.8) uniquely determines the measure Q on the collection B = {A × Sk+1 × · · · : k ≥ 0,A ∈ Fk}. Since B is a π-system and σ(B) = F, the uniqueness of Q follows by Dynkin’s Lemma. It remains to show the existence of Q. In accordance with (10.7), we define Q on B by

(10.9) Q(A × Sk+1 × ... ) = Qk(A), for every k ≥ 0,A ∈ Fk.

We first show Q is well-defined on B by (10.9). To this end we need to check that for 0 ≤ l ≤ n and A ∈ Fl, B ∈ Fn with

A × Sl+1 × · · · = B × Sn+1 × ..., we have Q(A) = Ql(A) = Qn(B) = Q(B). This is trivial in the case n = l, since then necessarily A = B. When n > l, then Z Ql+1(A × Sl+1) = Ql × κl+1 = Ql(dx0,..., dxl)κl+1(x0, . . . , xl; Sl+1) = Ql(A). A

By induction then Qn(A × Sl+1 × · · · × Sn) = Ql(A). As B = A × Sl+1 × · · · × Sn, the claim follows for n > l as well. By definition, the collection B is an algebra (i.e. Ω ∈ B, B ∈ B =⇒ Bc ∈ B, n Bi ∈ B, i = 1, . . . , n =⇒ ∪i=1Bi ∈ B), and Q(Ω) = 1. It is also easy to see that the function Q is additive on B, (i.e. Q(B1 ∪ B2) = Q(B1) + Q(B2) for B1,B2 ∈ B with , B1 ∩ B2 = ∅). To see this observe that for every pair B1,B2 ∈ B we can find k ≥ 0 and A1,A2 ∈ Fk such that Bi = Ai × Sk+1 × ... , i = 1, 2, and therefore Q(B1 ∪ B2) = Qk(A1 ∪ A2) = Qk(A1) + Qk(A2) = Q(B1) + Q(B2) by additivity of Qk. The existence of a probability measure Q on (Ω, F) extending the additive set function Q on B will then follow from the Carath´eordoryextension theorem, once we show that Q is σ-additive on B, that is:

87 S Claim 10.10. When Bi ∈ B, i ∈ N, Bi are pairwise disjoint, and B = Bi ∈ B(!), P i≥1 then Q(B) = i≥1 Q(Bi). ˜ n Proof. The proof starts by two reductions. First, setting Bn = B \ (∪i=1Bi), n ≥ 0, we ˜ ˜ Pn have Bn ↓ ∅, and, by the additivity, Q(B) = Q(Bn) + i=1 Q(Bi). Hence the claim will follow once we show that

˜ ˜ ˜ (10.11) For every decreasing sequence Bn ∈ B with Bn ↓ ∅, limn→∞ Q(Bn) = 0. ˜ ¯ Second, for any Bn as in (10.11) we may construct another sequence Bk = Ak ×Sk+1 × ¯ ˜ ¯ ...,Ak ∈ Fk with Bk ↓ ∅ such that (Bn) is a subsequence of (Bk). (10.11) thus follows from ¯ ¯ For every Bk = Ak × Sk+1 × ...,Ak ∈ Fk, k ≥ 1 with Bk ↓ ∅, (10.12) ¯ limk→∞ Q(Bk) = 0. ¯ Assume now, by contradiction, that (10.12) does not hold for a sequence (Bk)k, that is ¯ (10.13) inf Q(Bk) >  > 0. k≥1

By (10.12), Ak+1 ⊂ Ak × Sk+1, k ≥ 0 and ¯ Q(Bk) = Qk(Ak) Z Z Z

= P0(dx0) κ1(x0, dx1) ... κ(x0, . . . , xk−1, dxk)1Ak (x0, . . . , xk) S0 S1 Sk | {z } :=f0,k(x0) and similarly Z ¯ Q(Bk+1) = Qk+1(Ak+1) = P0(dx0)f0,k+1(xo). S0 ¯ Since (Bk) is a decreasing, we have 1Ak+1 (x0, . . . , xk+1) ≤ 1Ak (x0, . . . , xk)1Sk+1 (xk+1) and thus f0,k(x0) ≥ f0,k+1(x0) for every x0 ∈ S0, k ≥ 1.

From (10.13) and the monotone convergence theorem we see that there ax ¯0 ∈ S0 such that (10.14) Z Z ¯ inf f0,k(¯x0) = inf K1(¯x0, dx1) ... Kk(¯x0, x1, . . . , xk−1dxk)1A (X0, x1, . . . , xk) > 0. k≥1 k S1 Sk Define now Z Z

f1,k(x0, x1) = κ2(x0, x1, dx2) ... κk(x0, . . . , xk−1, dxk)1Ak (x0, . . . , xk). S2 Sk

88 Using similar steps as in the last paragraph, assumption (10.14) implies that there is x¯1 ∈ S1 such that

inf f1,k(¯x0, x¯1) > 0. (10.15) k≥2

By induction we may then construct a sequence (¯xk)k≥1, x¯k ∈ Sk, such that for every l ≥ 0 Z Z inf κl+1(¯x0,... x¯l, dxl+1) ... κk(¯x0,... x¯l, xl+1, . . . , xk−1, dxk) k>l Sl+1 Sk

× 1Ak (¯x0,..., x¯l, xl+1, . . . , xk) > 0 In particular, for k = l + 1, Z

0 < κl+1(¯x0,... x¯l, dxl+1) 1Al+1 (¯x0,..., x¯l, xl+1) ≤ 1Al (¯x0,..., x¯l) Sl+1 | {z } ≤1 (¯x ,...,x¯ ) Al 0 l and thus (¯x0,..., x¯l) ∈ Al for every l ≥ 0. Taking into account also (10.12), it means T that for everyω ¯ := (¯x0, x¯1,... ) ∈ Bl for every l ≥ 1, and thusω ¯ ∈ l≥1 Bl. This is in contradiction with Bl ↓ ∅. This completes the proof of Theorem 10.6. As the first consequence of Ionescu-Tulcea theorem we may prove the existence of countable independent sequences. Q Corollary 10.16 (Product measure on Ω = Si). For i ≥ 0, let νi be a probability measure on (Si, Si). Then there exists a unique probability measure Q on (Ω, F) such that (πn)#Q = ν0 ⊗ · · · ⊗ νn N denoted by Q = i≥0 νi, a product measure.

Proof. It is sufficient to choose stochastic kernels κi by

κi(x0, . . . , xi−1dxi) = νi(dxi), i ≥ 1, and P0 = ν0. Then Qn = ν0 ⊗ · · · ⊗ νn and the claim follows directly from Theorem 10.6.

89 10.3 Complement: Kolmogorov extention theorem

Ionescu-Tulcea Theorem allows the construction of probability measures on countable product spaces from a sequence of stock kernels. We now give another construction of stochastic processes which works on arbitrary products, however with additional as- sumption on ’components’ (Si, Si). We consider an arbitrary index set I and a collection of measurable spaces (Si, Si)i∈I . We assume

(10.17) (Si, Si) is a Borel space for every i ∈ I, cf. Definition 8.21.

Similarly as before we define product spaces Y (10.18) ΩJ = Si, FJ = ⊗i∈J Fi, for J ⊂ I, i∈J and write Ω := ΩI , F := FI . For I ⊃ J ⊃ K we let πJ,K :ΩJ → ΩK to be the canonical projection, and set πJ := πI,J Finally, let F (I), resp. G(I), be the set of all finite, resp. countable, subsets of I. Starting data for our construction will be a collection of finite-dimensional distribution, that is a family of probability measures QJ on (ΩJ , FJ ), J ∈ F (I). We want to find a measure Q on (Ω, F) such that QJ ’s are its finite dimensional marginals, that is

(10.19) (πJ )#Q = QJ , for all J ∈ F (I).

Of course, there need to be a consistency requirement on QJ ’s, After all, given K ⊂ J, QK is already determined by QJ :

(10.20) (πJ,K )#QJ = QK .

It turns out, that this is everything we need:

Theorem 10.21. Let I, (Si, Si)i∈I , (QJ )J∈F (I) satisfy (10.17), (10.20). Then there exists a unique probability measure Q on (Ω, F) such that (10.19) holds.

Remark 10.22. It can be shown that the assumption (10.17) is necessary for the validity of the theorem.

Proof. We first assume that I is countable. Without loss of generality we may then assume that I = N, and denote Ωn = Ω{1,...,n}, Qn = Q{1,...,n}, Fn = F{1,...,n}, πn = π{1,...,n}. A finite product of Borel spaces is again a Borel space. By Theorem 8.22, there is a regular conditional distribution κn of Qn given Fn−1, that is

κn((x1, . . . , xn−1); A) = Q(A|σ(πn−1)).

From condition (10.20) it follows that Qn = Qn−1 ×κn and thus Qn = Q1 ×κ2 ×· · ·×κn. Ionescu-Tulcea theorem then asserts the existence of the required measure Q on (Ω, F).

90 Assume now that I is uncountable. Recall that the product σ-algebra F can be written as [ −1 F = πJ (FJ ). J∈G(I) (To see that the union on the right-hand side is indeed a σ-algebra, recall that a countable union of countable sets is again countable.) By the first step of the proof, we can construct for every J ∈ G(I) a measure QJ on (ΩJ , FJ ) such that (πJ,K )#QJ = QK for −1 every K ⊂ F (I). For A ∈ πJ (FJ ) with J ∈ G(I) we may set

Q(A) = QJ (πJ (A)).

This is a well defined function, since when A ∈ π−1(F ) ∩ π−1(F ), for J ,J ∈ G(I), J1 J1 J2 J2 1 2 then (10.20) implies that QJ1 (πJ1 (A)) = QJ2 (πJ2 (A)) (Exercise). It remains to be shown that Q is a probability measure on F. Obviously 0 ≤ Q ≤ 1 and Q(Ω) = 1. Given sequence An ∈ F of disjoint sets, let Jn ∈ G(I) be such that A ∈ π−1(F ). Then J = S J ∈ G(I), and A ∈ π−1(F ). Hence, by σ-additivity of n Jn Jn n n n J J Qn, X X Q(∪n≥1An) = QJ (∪n≥1πJ An) = QJ (πJ An) = Q(An). n≥1 n≥1 This completes the proof.

91 11 Markov chains

The second important family of dependent random variables that we will study in this lecture are the Markov chains.

11.1 Definition and first properties

Definition 11.1. Let (Ω, F, (F)n≥0,P ) be a probability space with a filtration and (S, S) a measurable space. A sequence (Xn)n≥0 of S-valued random variables is called Markov chain with respect to (Fn) if (a) (Xn) is Fn-adapted, and (b) for all B ∈ S and n ≥ 0 P [Xn+1 ∈ B|Fn] = P [Xn+1 ∈ B|σ(Xn)]. As application of the results of the previous chapter we show that Markov chain exist:

Proposition 11.2 (Existence of canonical Markov chain). Let (Ω, F) = (SN, S⊗N), Xi :Ω → S the canonical coordinates, Fn = σ(X0,...,Xn). Consider sequence of stochastic kernels (κi)i≥1 from (S, S) to (S, S) and a probability measure µ on (S, S). Then there is a unique probability measure Pµ on (Ω, F) under which (Xn)n≥0 form a Markov chain w.r.t. Fn such that X0 is µ-distributed and

Pµ[Xn+1 ∈ B|Fn](ω) = κn+1(Xn(ω),B),Pµ-a.s.

In particular, for every bounded measurable f : Sn+1 → R, n ≥ 0, (11.3) Z Z Z Pµ E [f(X0,...,Xn)] = µ(dx0) κ1(x0, dx1) ··· κn(xn−1, Dxn)f(x0, . . . , xn). S S S Proof. The claim of the proposition follows directly from the Ionescu-Tulcea theorem by taking κn(x0, . . . , xn−1; dxn) of this theorem being independent of x0, . . . , xn−2 and being equal κn(xn−1, dxn).

When κn = κ for some given probability kernel κ on (S, S), then the Markov chain is called time-homogeneous. From now on we consider only this case. For µ = δx we write

Px instead of Pδx . Example 11.4. We have already seen many examples of Markov chains: Random walks, Galton-Watson process, renewal process, Polya’s urn, . . . . [[TODO: extend]] Exercise 11.5. The Markov chains from the previous example are mostly non-canonical ones. How you can construct their canonical versions?

92 We now consider a time-homogeneous canonical Markov chain given by transition kernel κ and initial distribution µ, as constructed in Proposition 11.2. For n ≥ 0, we introduce a shift operator θn :Ω → Ω by

θ((ω0, ω1,... )) = (ωn, ωn+1,... ), i.e. θn “erases the past before time n”. The definition of Markov chain requires that conditionally on Xn, the “near future ”, that is Xn+1, is independent of the past. We now extend it to the “whole future”:

Proposition 11.6 (Markov property). (a) The map (x, B) ∈ S × F 7→ Px(B) is a stochastic kernel from (S, S) to (Ω, F). (b) For every n ≥ 0 and a bounded random variable Y

Pµ PX (ω) E [Y ◦ θn|Fn](ω) = E n [Y ],Pµ-a.s.

(c) In particular, for C ∈ σ(Xn,Xn+1,... ),

Pµ Pµ E [1C |Fn] = E [1C |σ(Xn)],Pµ-a.s.

Proof. (a) B 7→ Px(B) is a probability measure for every x ∈ S by construction. The measurability of x 7→ Px(B) follows from the formula (11.3) and Dynkin’s argument. k+1 (b) Let A ∈ Fn, and let f : S → R be a bounded measurable function. Then, by (11.3) and an easy computation,

Pµ Pµ  PX  E [1Af(Xn,...,Xn+k)] = E 1AE n [f(X0,...,Xk)] .

In particular for every B ∈ B = ∪k≥0σ(X0,...,Xk)

Pµ Pµ   E [1A1B ◦ θn] = E 1APXn [B] .

B is a π-system generating F, so the last display actually holds for all B ∈ F. Hence, by definition of the conditional expectation,

Pµ E [Y ◦ θn|Fn](ω) = PXn(ω)[B],Pµ-a.s.

The claim (b) then follows by the usual approximation procedure. (c) It is sufficient to write C ∈ σ(Xn,... ) as C = θn(B) for B ∈ F and apply the claim (b). As an easy corollary we get

Proposition 11.7 (Chapman-Kolmogorov equation). For every n, m ∈ N, x ∈ S and A ∈ S, Z Px[Xm+n ∈ A] = Px(Xn ∈ dy)Py(Xm ∈ A). S

93 Proof. By Proposition 11.6(c),

Px Px(Xm+n ∈ A) = E [Px[Xm+n ∈ A|Fn]] = Ex[PXn (Xm ∈ A)] Z = Px(Xn ∈ Dy)Py(Xm ∈ A), S as claimed. Proposition 11.6 deals with the ’future’ of Markov chains after deterministic times. We now extend this proposition to certain random times. Recall that a N ∪ {∞}-valued random variable T is called Fn-stopping time when {T ≤ n} ∈ Fn for every n ≥ 0. Given a stopping time T we define σ-algebra FT by

FT = {A ∈ F : A ∩ {T = n} ∈ Fn for every n ≥ 0}.

FT should be viewed as the σ-algebra describing the past relative to T . Proposition 11.8 (strong Markov property). For every bounded random variable Y and a stopping time T

Eµ[Y ◦ θT |FT ] = EXT [Y ],Pµ-a.s. on {T < ∞}.  In this proposition Y ◦ θT should be understood as Y ◦ θT (ω) (ω) on {T < ∞}, and zero otherwise. Similarly, EXT [Y ] is EXT (ω)(ω)[Y ] on {T < ∞}. Proof. We verify the two defining properties of the conditional expectation. On {T = n}, EXT [Y ] = EXn [Y ] which is Fn measurable. Therefore EXT [Y ] is FT -measurable. Moreover, for A ∈ FT ,   X   X h i Eµ Y ◦ θT 1A∩{T <∞} = Eµ Y ◦ θT 1A∩{T =n} = Eµ Y ◦ θn 1A∩{T =n} n≥0 n≥0 | {z } ∈Fn By the Markov property this equals X   X     = Eµ EXn [Y ]1A∩{T =n} = Eµ EXT [Y ]1A∩{T =n} = Eµ EXT [Y ]1A∩{T <∞} . n≥0 n≥0 This completes the proof.

11.2 Invariant measures of Markov chains

We want to understand here the asymptotic behaviour of Markov chains. To simplify the matter, we assume that the state space S is at most countable and S = P(S). To make the situation even easier, we assume that the Markov chain (Xn) is irre- ducible, that is for every x, y ∈ S there is n ≥ 1 such that Px[Xn = y] > 0. From the lecture ‘Stochastic processes’ you know that this is not very restrictive assumption: If X is not irreducible, it is possible to restrict it to certain subsets of S where it is irreducible.

94 Definition 11.9. A state x ∈ S is called recurrent if

Px[Xn = x for infinitely many x] = 1. It is called transient if

Px[Xn = x for infinitely many x] = 0. ˜ We first show that there is no other possibility. To this end, let Hx and Hx be the first entrance time, resp. hitting time of x,

Hx = inf{n ≥ 1 : Xn = x}, ˜ Hx = inf{n ≥ 1 : Xn = x}. Theorem 11.10. The following dichotomy holds: ˜ P (i) If Px[Hx < ∞] = 1, then x is recurrent and n≥0 Px[Xn = x] = ∞. ˜ P (ii) If Px[Hx < ∞] < 1, then x is transient and n≥0 Px[Xn = x] < ∞. k 0 Proof. Let Hx , k ≥ 0 be the times of successive visits to x defined by Hx = 0 and

( ˜ k k θ k ◦ Hx + H , on {H < ∞}, (11.11) Hk+1 = Hx x x x ∞, otherwise.

n n−1 The key observation of the proof is the fact that the increments Hx − Hx are essentially i.i.d. More precisely, define

( n n−1 n−1 Hx − Hx , on {Hx < ∞}, Sn = 0, otherwise.

n−1 Then, conditionally on {H < ∞}, S is independent of F n−1 , and x n Hx n−1 ˜ (11.12) P [Sn = k|Hx < ∞] = Px[Hx = k], k ∈ N ∪ {∞}.

Indeed, this follows from the strong Markov property, observing that S = H˜ ◦ θ n−1 n x Hx n−1 on {Hx < ∞}. P ˜ Let Nx be the total number of returns to x, Nx := n≥1 1Xn=x and set fx = Px[Hx < ∞]. By (11.12),

k+1 k Px[Nx ≥ k + 1] = Px[Hx < ∞] = Px[Hx < ∞,Sk+1 < ∞] k k ˜ = Px[Sk+1 < ∞|Hx < ∞]Px[Hx < ∞] = Px[Hx < ∞]Px[Nx ≥ k]

= fxPx[Nx ≥ k].

By easy induction argument then follows that for fx = 1 we have Px[Nx = ∞] = 1 and thus x is recurrent. On the other hand, for fx < 1, the number of returns Nx is k geometrically distributed, P [Nx = k] = fx (1 − fx), and thus finite a.s. P Finally, the last parts of both claim follows from E[Nx] = n≥1 Px[Xn = x].

95 For irreducible Markov chains the recurrence and transience are global properties:

Lemma 11.13. When (Xn) is irreducible and there is x ∈ S which is recurrent (or transient), then all y ∈ S are recurrent (or transient).

Proof. It is sufficient to show that ‘x is recurrent’ implies ‘y is recurrent’ for all x, y ∈ S. By irreducibility, there is k, l > 0 such that Px[Xk = y] > 0 and Py[Xl = x] > 0. Further, by Chapman-Kolmogorov equation,

Py[Xk+l+n = y] ≥ Py[Xl = x]Px[Xn = x]Px[Xk = y].

Hence, ∞ X X Py[Xn = y] ≥ Py[Xl = x]Px[Xk = y] Px[Xn = x]. n=k+l n≥0 The sum on the right-hand side is infinite by Theorem 11.10(a), and thus the left-hand side is infinite as well. Another application of Theorem 11.10 then yields the claim. [[TODO: Examples]] In order to understand the asymptotic behaviour of Markov chain the following object plays the key role.

Definition 11.14. A measure π is called invariant for the Markov chain (Xn) with transition kernel κ if π × κ = π. For countable S this is equivalent to X (11.15) π(y) = π(x)Px[X1 = y]. x∈S When π is a probability measure, we call it invariant distribution.

We are interested in existence and uniqueness of invariant measures and distributions. The following proposition construct invariant measures in the recurrent case.

h ˜ i PHx Proposition 11.16. Let nx(y) = Ex n=1 1Xn=y be the mean number of visit to y before returning to x. If x is recurrent, then

(i) nx(x) = 1,

(ii) nx(·) is an invariant measure for (Xn).

(iii) If X is irreducible, then nx(y) ∈ (0, ∞), y ∈ S.

96 Proof. (i) is obvious from the definition. To show (ii) we write

H˜ h Xx i nx(y) = Ex 1Xn=y n=1 ∞ X  ˜  = Ex 1{Xn = y, n ≤ Hx} n=1 ∞ X X  ˜  = Px Xn = y, Xn−1 = zn − 1 < Hx n=1 z∈S ∞ X X  ˜  = Pz[X1 = y] Px Xn−1 = zn − 1 < Hx z∈S n=1 H˜ −1 X h Xx i = Pz[X1 = y]Ex 1{Xn−1 = z} , z∈S n=0 ˜ where on the last line we made a trivial change of variable. Since (Xn) is recurrent, Hx is P -a.s. finite and thus P [X = X = x]. This implies that under P x x 0 H˜x x

H˜ −1 H˜ Xx Xx 1{Xn−1 = z} 1{Xn−1 = z}. n=0 n=1 Inserting this in the previous computation we see that

H˜ X h Xx i X nx(y) = Pz[X1 = y]Ex 1{Xn−1 = z} = Pz[X1 = y]nx(z). z∈S n=1 z∈S

This shows that nx(·) is invariant. To show (iii) observe first that using 11.15 inductively implies that every invariant measure satisfies X (11.17) π(y) = π(x)Px[Xk = y], k ≥ 1. x∈S

By irreducibility there is k, l > 0 such that Px[Xk = y] > 0 and Py[Xl = x] > 0. Hence, P using (ii) and (11.17), nx(y) = Pz[Xk = y]nx(z) ≥ nx(x)Px(Xk = y) > 0. On the P z other hand, 1 = nx(x) = z nx(z)Pz(Xl = x) ≥ nx(y)Py(Xl = x) which easily yields nx(y) < ∞. To show the uniqueness of invariant measures we will need:

Lemma 11.18. Assume that π is an invariant measure such that π(x) = 1. Then π(y) ≥ nx(y) for all y ∈ S.

97 Proof. To simplify the notation we define pxy = Px[X1 = y]. Then, using repeatedly the invariance of π, X X π(z) = π(y1)py1z = π(y1)py1 z + pxz

y1∈S y16=x X X X = π(y2)py2y1 py1z + pxy1 py1z + pxz

y16=x y26=x y16=x X X = π(yk)pykyk−1 . . . py1z + pxyk−1 . . . py1z + ··· + pxz

y1,...,yk6=x y1,...,yk−16=x

Using the fact that the first sum is always non-negative, this is bounded from below by

˜ ˜ ≥ Px[Xk = z, Hx ≥ k] + ··· + Px[X1 = z, Hx ≥ 1] k h X ˜ i = Ex 1{Xl = z, Hx ≥ l} . l=1

Letting k tend to ∞ we obtain π(z) ≥ nx(z), completing the proof.

Corollary 11.19. Assume that (Xn) is irreducible and recurrent and π is its invariant measure with π(x) = 1. Then π(y) = nx(y) for all y ∈ S.

Proof. Since π and nx are invariant, also the measure λ = π − nx is invariant. Moreover P λ ≥ 0 and λ(x) = 0. However, 0 = λ(x) = z∈S λ(z)Pz[Xl = x] ≥ λ(y)Py[Xl = x]. Using the irreducibility, we can fix l such that the last probability is positive, which implies λ(y) = 0 as well. Hence, nx = π as required. We now turn our attention to existence and uniqueness of invariant distributions.

Definition 11.20. Let x be a recurrent state for a Markov chain (Xn). It is called ˜ positively recurrent when Ex[Hx] < ∞. Otherwise it is called null-recurrent.

Theorem 11.21. Let (Xn) be irreducible. Then the following are equivalent: (i) There is x ∈ S which is positively recurrent.

(ii) All x ∈ S are positively recurrent.

(iii) (Xn) has an invariant distribution π.

Moreover, π of (iii) is unique and given by π(y) = nx(y) . ExH˜x Proof. (ii) =⇒ (i) is trivial. (i) =⇒ (iii): Since x is positively recurrent and thus recurrent, nx is an invariant ˜ P P PHx ˜ measure. Moreover, y∈S nx(y) = y∈S Ex[ n=1 1{Xn = y}] = Ex[Hx]. Therefore, nx(·) is an invariant distibution. ExH˜x

98 (iii) =⇒ (ii): Let x ∈ S be arbitrary. Since π is a probability, there is y ∈ S with π(y) > 0. By irreducibility, for some n > 0, Py[Xn = x] > 0 and thus π(x) ≥ π(y)Py[Xn = x] > 0. Set λ(y) = π(y)/π(x). Then λ is invariant, λ(x) = 1 and thus λ ≥ nx by Lemma 11.18. Hence,

X X X π(y) 1 E H˜ = n (y) ≤ λ(y) = = < ∞. x x x π(x) π(x) y∈S y∈S y∈S

This implies that x is positively recurrent. It remains to show the unicity of π. Since x is recurrent, the inequality in the last display is an equality by Corollary 11.19 and thus 1 π(x) = . ˜ ExHx This completes the proof. We close this section by few examples and exercises illustrating the situation for null- recurrent and transient Markov chains.

Example 11.22. Let (Xn) be a simple random walk on Z.(Xn) is null-recurrent. The measure π(x) ≡ 1 is invariant and every other invariant measure must be its multiple, by Corollary 11.19. Hence, there is no invariant distribution. The situation is analogous for every irreducible null-recurrent chain.

Example 11.23. Let (Xn) be a random walk with drift Px[X1 = x + 1] = 1 − Px[X1 = 1 p x x − 1] = p > 2 . Show that π(x) = A + B( 1−p ) is invariant for every A > 0 and B > 0 and (Xn) is transient. In particular transient Markov chain may posses invariant measures, and they do not need to form a one-parameter family as in the recurrent case.

Exercise 11.24. Let S = N and consider a Markov chain on S given by Px[X1 = −x −x x + 1] = 1 − 10 , Px[X1 = 0] = 10 , x > 1, and P0[X = 1] = 1. Show that (Xn) is transient and has no non-trivial invariant measure.

11.3 Convergence of Markov chains

In this section we assume that (Xn) is an irreducible positively recurrent Markov chain on an at most countable state space (S, S). By Theorem 11.21 it possesses a unique invariant distribution π. We now investigate the asymptotic behaviour of (Xn). Lemma 11.25. For every x, y ∈ S

n 1 X lim 1{Xk = y} = π(y),Px-a.s. n→∞ n k=1

99 k Proof. Writing Hy for the time of k-th visit of y defined as in (11.11),

n 1 X 1 1{X = y} = max{k ≥ 0 : Hk ≤ n}. n k n y k=1 Therefore, the claim of the lemma is equivalent to

1 k 1 (11.26) lim Hy = ,Px-a.s. k→∞ k π(y)

Using the same arguments as in the proof of Theorem 11.10 together with the recurrence l l−1 of (Xn), we see that the random variables Sl = Hy −Hy are independent and, moreover, ˜ ˜ (Sl)l≥2 are i.i.d. with Px[Sl = k] = Py[Hy = k] for l ≥ 2. Therefore E[Sl] = Ey[Hy] = 1 k Pk π(y) , by Theorem 11.21. Since Hy = l=1 Sl, claim (11.26) follows by the strong law of large numbers. Lemma 11.25 and dominated convergence imply

n h 1 X i lim Ex 1{Xk = y} = π(y). n→∞ n k=1 We now strengthen this C´esaro-type convergence.

Definition 11.27. For x ∈ S let T (x) = {n ≥ 1 : Px[Xn = x] > 0} be the set of times when return to x is possible. The period of x is the greatest common divisor gcd T (x).

Exercise 11.28. If X is irreducible and x, y ∈ S, then gcd T (x) = gcd T (y).

Definition 11.29. A Markov chain (Xn) is called aperiodic when gcd T (x) = 1 for all x ∈ S.

Exercise 11.30. If (Xn) is irreducible and aperiodic then for every x, y ∈ S there is n ≥ 0 such that Px[Xm = y] > 0 for all m ≥ n. [[TODO: Provide a proof]]

Theorem 11.31. Let X be irreducible and aperiodic with invariant distribution π. Then for every x, y ∈ S lim Px[Xn = y] = π(y). n→∞ Proof. We use a coupling argument. On some probability space (Ω, A,P ) define two independent Markov chain (Xn)n≥0 and (Yn)n≥0 with respective distributions Px and Pπ. Since π is invariant, P [Yn = y] = π(y). Fix now z ∈ S and set T = inf{n ≥ 0,Xn = yn = z}. We first claim that P [T < ∞] = 1. To see this, observe that Wn = (Xn,Yn) is a Markov chain on S × S. Using Exercise 11.30, it is not difficult to see that Wn is again

100 irreducible. Moreoverπ ˜(x, y) = π(x)π(y) is invariant distribution for (Wn). Therefore W is positively recurrent, which implies the claim. We now construct a new process (Zn) by ( Xn, if n ≤ T, Zn = Yn, if n > T.

Since XT = YT , it is not hard to see that (Zn) is Px-distributed, similarly as (Xn). Hence,

Px[Xn = y] = P [Zn = y] = P [Xn = y, n ≤ T ] + P [Yn = y, n > T ], and thus n→∞ Px[Xn = y] − π(y) ≤ 2P [n ≤ T ] −−−→ 0. This completes the proof. [[TODO: add more detail here:mixing times, examples]]

101 12 Brownian motion and Donsker’s theorem

In this chapter we introduce one of the most important stochastic processes in continuous time, the Brownian motion. We construct it as a suitable limit of rescaled trajectories of the simple random walk.

12.1 Space C([0, 1])

In order to construct the Brownian motion, we shell understand the weak convergence of probability measures on the space C = C([0, 1], R) endowed with the sup-norm kwk = supt∈[0,1] |w(t)| and the corresponding metric. It is well known that C is separable. Lemma 12.1. The Borel σ-algebra B(C) is generated by the system Z of cylinder sets

Z = {w ∈ C : w(ti) ∈ Ai, i = 1, . . . , n where n ∈ N, 0 ≤ t1 ≤ · · · ≤ tn,Ai ∈ B(R)}. More over Z is a π-system. Proof. To see that B(C) ⊃ σ(Z) it is sufficient to observe that the map w ∈ C 7→ w(t) ∈ R is continuous for every t ∈ [0, 1] and thus the sets of the form {w ∈ C : w(t) ∈ A} with A ∈ B(R), and thus all elements of Z, are contained in B(C). To show that B(C) ⊂ σ(Z), recall first that as C is separable, every open set is a countable union of open balls. Moreover, every open ball Uδ(w) ⊂ C with w ∈ C, δ > 0 can be written as ∪n∈ U 1 (w), so every open set is a countable union of closed balls. N δ− n By continuity, 0 0 Uδ(w) = {w ∈ C : kw − w k ≤ δ} \ = {w0 ∈ C : |w(i/n) − w0(i/n)| ≤ δ, i = 0, . . . , n} ∈ σ(Z). n∈N Hence every open set of C is contained in σ(Z) and thus B(C) ⊂ σ(Z). The last claim of the lemma is obvious. We now consider the measurable space (C, B(C)). Due to Lemma 12.1 and Dynkin’s lemma, every probability measure µ is determined by its finite-dimensional marginals n (πt1,...,tn )#µ, where πt1,...,tn are the natural projections from C to R given by πt1,...,tn (w) =

(wt1 , . . . , wtn ). If (µn) is a sequence of probability measures on (C, B(C)) converging weakly to µ, then w obviously also the corresponding finite-dimensional marginals converge, (πt1,...,tn )#µn −→

(πt1,...,tn )#µ for all 0 ≤ t1 ≤ · · · ≤ tn ≤ 1. The converse is false.

102 Example 12.2. Consider µ = δw, µn = δwn , where w ≡ 0 and wn is piecewise linear with wn(0) = 0, wn(1/n) = 1, wn(2/n) = 0, wn(1) = 0. Since wn → w pointwise, it w follows that (πt1,...,tn )#µn −→ (πt1,...,tn )#µ for every ti’s. On the other hand kw − wnk = 1 w and thus µn 6−→ µ. Recalling the theory of weak convergence, what the above example is missing is the tightness. Actually, as the direct consequence of Prokhorov’s theorem (Theorem 6.26) and Remark 6.27 we obtain:

Theorem 12.3. Let µn and µ be probability measures on (C, B(C)). Then the following are equivalent:

d (i) µn −→ µ.

(ii) the sequence (µn) is tight and all finite-dimensional marginals of µn converge weakly to those of µ. We thus need to develop a tightness criterion on C. Define for w ∈ C and δ > 0 the modulus of continuity of w,

ω(w, δ) := sup{|w(s) − w(t)| : s, t ∈ [0, 1], |t − s| ≤ δ}.

Recall that for every w ∈ C, limδ→0 ω(w, δ) = 0, and that w 7→ ω(w, δ) is continuous. The following well known theorem characterises relatively compact sets of C Theorem 12.4 (Arzel`a-Ascoli). A set A ⊂ C is relatively compact iff

sup |w(0)| < ∞ and lim sup ω(w, δ) = 0. w∈A δ→0 w∈A We can now characterise the tightness in C:

Theorem 12.5. A set M ⊂ M1(C) is tight iff the following two conditions holds:

(a) The set {(π0)#µ : µ ∈ M} of 0-marginals is tight in R, that is lim sup µ({w : |w(0)| ≥ K}) = 0, K→∞ µ∈M

(b) As δ → 0, the modulus of continuity converges to 0 in probability, uniformly over µ ∈ M, that is

lim sup µ({w : ω(w, δ) ≥ η}) = 0 for all η > 0. δ→0 µ∈M

Proof. Assume first that the two conditions of the theorem holds. Since {(π0)#µ : µ ∈ M} is tight, for every ε > 0 there is c(ε) < ∞ such that

inf (π0)#µ[−c(ε), c(ε)] ≥ 1 − ε. µ∈M

103 Moreover, using the second condition, we can find a sequence δk with

n 1o −k inf µ w : ω(w, δk) ≤ ≥ 1 − ε2 . µ∈M k

T 1 Set K(ε) = {w : |w0| ≤ c(ε)}∩ k≥1{w : ω(w, δk) ≤ k }. Then, by the last two estimates, for every µ ∈ M, ∞ c X −k µ(Kε ) ≤ ε + ε2 ≤ 2ε. k=1

Since w 7→ ω(w, δ) is continuous, the set Kε is closed, and by construction it satisfies the conditions of Arzel`a-Ascolitheorem. Hence Kε is compact, and thus M is tight. Assume now that M is tight. Hence, for every ε > 0 there is a compact K ⊂ C c such that supµ∈M(K ) ≤ ε. Let b = supw∈K |w0| < ∞, by Arzel`a-Ascolitheorem. Therefore, infµ∈M(π0)#µ([−b, b]) ≥ µ(K) ≥ 1 − ε, that is ((π0)#µ)µ∈M is tight and the first condition hold. Further, since K is compact, by Arzel`a-Ascolitheorem again, supw∈K ω(w, δ) ≤ η for all δ small enough. Hence, for all δ small supµ∈M µ({w : ω(w, δ) ≥ η}) ≤ µ(Kc) ≤ ε, and the second condition follows easily.

Exercise 12.6. Assume that M is a sequence (µn)n≥0. To verify the tightness it is sufficient to check lim lim sup µn({w : |w(0)| ≥ K}) = 0 K→∞ n→∞ and lim lim sup µn({w : ω(w, δ) ≥ η}) = 0 for all η > 0. δ→0 n→∞ 12.2 Brownian motion

Definition 12.7. Brownian motion is a R-valued stochastic process (Bt)t≥0 on some probability space (Ω, A,P ) such that

(i) B0 = 0, P -a.s.

(ii) For every n ∈ N and 0 = t0 < t1 < ··· < tn, the increments Bt1 − Bt0 ,...,

Btn − Btn−1 are independent random variables.

(iii) For all t ≥ 0 and s > 0, Bt+s − Bt is N (0, s) distributed

(iv) For P -a.e. ω, the trajectory of the process t 7→ Bt(ω) is a continuous function. There are many ways how to construct a Brownian motion, in particular different probability spaces can be used. On the other hand, as we will see now, the conditions (i)–(iv) determine the distribution of Brownian motion uniquely. One way to see this is to construct the so-called canonical Brownian motion. To this end, let C = C([0, ∞), R) endowed with canonical coordinates Xt : C → R, w ∈ C 7→ Xt(w) = wt, and the canonical σ-algebra F = σ(Xt, t ≥ 0). Consider

104 now an arbitrary Brownian motion constructed on a probability space (Ω, A,P ). By Definition 12.7(iv), we can find a P -negligible set N such that the trajectories of B are continuous on Ω \ N. Define now the a map B by

( Ω \ N, A ∩ (Ω \ N) −→ C, F)

ω 7→ (t 7→ Bt(ω)). Exercise 12.8. Check the measurability of this map.

Let W be the image of P , restricted to Ω \ N, under this map, W = B#P . Theorem 12.9 (uniqueness of Brownian motion). (a) The measure W on (C, F) is uniquely determined, and is called the Wiener measure. (b) The process (Xt(w))t≥0 on (C, F,W ) is also a Brownian motion, called canonical Brownian motion.

Sketch of the proof. (a) Fix 0 = t0 < t1 < ··· < tn and a bounded measurable function h : Rn+1 → R. Then by definition of W and condition Definition 12.7(i)–(iii), W P E [h(Xt0 ,...,Xtn )] = E [h(Bt0 ,...,Btn )] P = E [h(Bt0 ,Bt1 − Bt0 ,...,Btn − Btn−1 − · · · + Bt1 − Bt0 )] Z n Y 1 2 −yi /2(ti−ti−1) = h(0, y1, y1 + y2, . . . , y1 + ··· + yn) 1/2 e dyi. n (2π(t − t )) R i=1 i i−1

n+1 Taking D ∈ B(R ) and h = 1D, this determines W ({Xt0 ,...,Xtn ∈ D}). The class of sets of this form is a π-system generating F, so W is uniquely determined. This proves claim (a). The claim (b) is then obvious.

12.3 Donsker’s theorem

To show that a Brownian motion exist, we constuct it as a weak limit of random walk trajectories. For sake of simplicity we restrict the time to the interval [0, 1] first, we comment on extension to t ∈ R+ later. Let C = C([0, 1], R) endowed with canonical coordinates Xt, t ∈ [0, 1] and the canon- ical σ-algebra F, constructed as previously. Let (ξi)i≥1 be an i.i.d. sequence on a probability space (Ω, A,P ) such that Eξi = 0, 2 Eξi = 1. Set S0 = 0 and Sn = ξ1 + ··· + ξn, n ≥ 1. For t ∈ [0, ∞) \ N define St by polygonal interpolation St = Sbtc + (t − btc)ξbtc+1, and consider its rescaling

n 1 (12.10) B = S 2 , t ∈ [0, 1], n ∈ . t n tn N n n Obviously, for every ω ∈ Ω, t 7→ Bt (ω) is a random element of C. Let µn = (B· )#P be n the distribution of B· on (C, F).

105 Theorem 12.11 (Donsker). Under the above assumptions, the sequence of measures µn converges weakly on (C, k.k∞) to a measure µ which is the Wiener measure restricted to [0, 1].

Eventually, we will apply Theorems 12.3 and 12.5 to show this theorem. To check the convergence of finite-dimensional marginals we need the following general lemma.

Lemma 12.12. Let (Un)n∈N be a sequence of random variables on (Ω, A,P ) with values d in a normed separable vector space (S, k · k) such that Un −→ [n → ∞]U. P (a) If (Vn)n∈N is another sequence of S-valued random variables with Vn −→ [n → ∞]0, d then Un + Vn −→ [n → ∞]U. P (b) If (Cn)n∈N is a sequence of R-valued random variables with Cn −→ [n → ∞]c where d c is a constant, then CnVn −→ [n → ∞]cU.

Proof. (a) Since S is separable, B(S ×S) = B(S)⊗B(S), so Un +Vn is a random variable. For h bounded and uniformly continuous

|Eh(Un + Vn) − Eh(U)| ≤ |Eh(Un + Vn) − Eh(Un)| + |Eh(Un) − Eh(U)|.

d The second term tends to 0 since Un −→ U. For the first term, observe that

|Eh(Un + Vn) − Eh(Un)| ≤ sup |h(x) − h(z)| + 2khk∞P [kVnk ≥ δ]. x,z:kx−zk≤δ

Here, the first term can be made arbitrarily small by choosing δ, since h is uniformly continuous, and the second term converges to 0 for every δ, since Vn converge to 0 in probability. Portmanteau theorem (Theorem 6.12) then implies Un + Vn → U. The proof of (b) is analogous and is left as an exercise. We now can verify the convergence of finite-dimensional marginals, as required by Theorem 12.3.

Proposition 12.13. The finite-dimensional marginals of µn converge to the correspond- ing marginals of µ.

Remark 12.14. Observe that the statement does not rely on the existence of the Wiener measure but only of its finite-dimensional marginals, which is obvious from Definition 12.7.

Proof of Proposition 12.13. Consider first the one-dimensional marginals. Fix t ∈ [0, 1]. Then (Xt)#µ is the normal distribution N (0, t), and (Xt)#µn is the distribution of

n 1 2 w  B = S 2 + (tn − btn c) + ξ 2 t n btn c btn c+1 2 2 1 Sbtn2c tn − btn c = + ξ 2 . n btn2c n btn c+1

106 By the central limit theorem, the second fraction in the first term converges in distribu- tion to a standard normal random variable. Moreover, for every δ > 0,

2 2 h tn − btn c i n→∞ ξ 2 > ε ≤ P [|ξ | ≥ εn] −−−→ 0, P n btn c+1 1 that is the second term converges to 0 in probability. Applying Lemma 12.12 several times then proves the convergence of one-dimensional marginal. For the general case fix 0 = t0 < ··· < tn ≤ 1 and argue as previously for the vector (Bn − Bnt )) . Using the formula ti ( i−1 1≤i≤n

2 bn tic n n 1 X 1  2 2 2 2 B −B = ξ + (t n −bt n c)ξ 2 .−(t n −bt n c)ξ 2 . ti ti−1 n j n i i btin c+1 i−1 i−1 bti−1n c+1 2 j=bn ti−1c+1

2 Pbn tic The components of the vector ( 2 ξ ) are independent and converge, by j=bn ti−1c+1 j 1≤i≤n the central limit theorem again, to normal variables with corresponding variances. The second term in the last display is a perturbation and can be treated as previously.

To check the tightness of the sequence µn we need another lemma.

Lemma 12.15 (Ottaviani). Let U1,...,UN be centred independent random variables PN 2 with i=1 Var Ui ≤ c . Then for Zk = U1 + ··· + Uk, α ≥ c, 1 P [max |Zk| ≥ 2α] ≤ 2 P [|ZN | > α]. k≤N c 1 − α2

Proof. Set T = inf{k ≤ N : |Zk| ≥ 2α}. On {T ≤ N}, |ZT | > 2α. As {T = k} and ZN − Zk are independent,

N X P [|ZN | > α] ≥ P [T ≤ N, |ZN − ZT | ≤ α] = P [T = k]P [|ZN − Zk| ≤ α]. k=1

−2 2 2 By Chebyshev inequality, P [|ZN − Zk| > α] ≤ α Var(ZN − Zk) ≤ c /α , so

2 2 c  c  P [|ZN | > α] ≥ 1 − P [T ≤ N] = 1 − P [max |Zk| > 2α]. α2 α2 k≤N This completes the proof.

Proposition 12.16. The sequence (µn) of Theorem 12.11 is tight.

n Proof. As B0 = 0, (a) of Theorem 12.5 holds automatically. To check (b) of the same theorem, we need to show that

n n (12.17) lim lim sup P [ sup |Bt − Bs | ≥ η] = 0 for all η > 0. δ→0 n→∞ s≤t≤s+δ

107 For s ∈ [kδ, (k +1)δ) and s ≤ t ≤ s+δ, either t ∈ [kδ, (k +1)δ) or t ∈ [(k +1)δ, (k +2)δ). In the first case n n n n n n |Bt − Bs | ≤ |Bt − Bkδ| + |Bs − Bkδ|. In the second case

n n n n n n n n |Bt − Bs | ≤ |Bt − B(k+1)δ| + |Bs − Bkδ| + |B(k+1)δ − Bkδ|. Putting the two together, we see that

n n n n sup |Bt − Bs | ≤ 3 sup sup |Bt − Bkδ|, s≤t≤s+δ k≤1/δ t∈[kδ,(k+1)δ] and thus n n n n P [ sup |Bt − Bs | ≥ η] ≤ P [ sup sup |Bt − Bkδ| ≥ η/3] s≤t≤s+δ k≤1/δ t∈[kδ,(k+1)δ] 1/δ X n n ≤ P [ sup |Bkδ+t − Bkδ| ≥ η/3]. 0≤t≤δ k=0

2 2 2 −1 n Set j = bkδn c and m = b2δn c, and tak n such that n ≥ 2δ . Since B· is a polygonal interpolation of S·, we have

n n 1 1 sup |Bkδ+t − Bkδ| ≤ max |Sj+l − Sj| ∨ max |Sj+l − Sj+1|. 0≤t≤δ l≤m n l≤m n Hence, h n n η i h 1 η i P sup |Bkδ+t − Bkδ| ≥ ≤ 2P max |Sl| ≥ . 0≤t≤δ 3 l≤m n 3 Using Lemma 12.15 with U = √1 ξ , Z = √1 S , α = √η and c = 1, this can be i n 2δ i k n 2δ k 6 2δ bounded from above by

∞ 1 h|S | η i Z 1 2 P √m ≥ √ −−−→n→∞ c(η) √ e−y /2dy. 72δ η 1 − 2 m 6 2δ √ 2π η 6 2δ

R ∞ −y2/2 −1 −u2/2 It follows that, for δ small, using the Gaussian tail estimate u e dy ≤ u e ≤ e−u2/2 for u ≥ 1, Z ∞ n n 1 1 −y2/2 lim sup P [ sup |Bt − Bs | ≥ η] ≤ c(η) √ e dy n→∞ s≤t≤s+δ δ √η 2π 6 2δ 1 2 ≤ e−η /144δ −−→δ→0 0. δ This proves (12.17) and completes the proof of the proposition. Proof of Theorem 12.11. The theorem follows from Propositions 12.13 and 12.16, using Theorems 12.3 and 12.5. [[TODO: extension to [0, ∞).]]

108 12.4 Some applications of Donsker’s theorem

We will show how Donsker’s theorem can be used to claim some properties of Brownian n motion. As above, µn denotes the distribution of the B on C[0, 1] and µ is the Wiener measure restricted to this space. We start with a general lemma on weak convergence.

Lemma 12.18. Let ν, (νn)n≥1 be measures on some metric space (S, d) (endowed with the Borel σ-field) and F a continuous map from S to another metric space (S,¯ d¯). When w d νn −→ ν on (S, d), then F#νn −→ F#ν. ¯ Proof. Let f ∈ Cb(S). Then, f ◦ F ∈ Cb(S) and thus

Z Z n→∞ Z Z fd(F#νn) = f ◦ F dνn −−−→ f ◦ F dν = fd(F#ν), S¯ S proving the weak convergence. We now use Donsker’s theorem to determine the distribution of the supremum of Brownian motion M = sup{Bt : t ∈ [0, 1]}.

Proposition 12.19. P [M ≤ z] = P [|B1| ≤ z] = 2Φ(z) − 1, where Φ(z) is the distribu- tion function of standard normal distribution.

Proof. Let F :(C[0, 1], k · k∞) → (R, | · |) be the continuous map F (w) := sup{w(t), t ∈ [0, 1]}. By Lemma 12.18, and Portmanteau’s theorem, as ϕ is continuous,

P [M ≤ z] = (F#ν)((−∞, z])

= lim (F#νn)((−∞, z]) (12.20) n→∞ h S i = lim P sup k ≤ z . n→∞ k≤n2 n

We estimate the last probability in the case when Sn is the simple random walk:

Claim 12.21 (Reflection for the SRW). Let Mn = max0≤n Sn. Then ( P [Sn = v], v ≥ r ≥ 0, P [Mn ≥ r, Sn = v] = P [Sn = 2r − v], v < r, r ≥ 0.

In particular, P [Mn ≥ r] = 2P [Sn > r] + P [Sn = r]. Proof. The case v ≥ r ≥ 0 is obvious. For v < r we consider the map ϕ that “reflects any SRW trajectory after its first visit to r”, see the Figure 12.1. It is easy to see that ϕ is bijection of {ω ∈ Ω, maxi≤n Si ≥ r, Sn = v} and {ω ∈ Ω: −n Sn = 2r − v}. As every path of length n has the same probability 2 we see that P [Mn ≥ r, Sn = v] = P [Sn = 2r − v]. The last claim follows by summing over all possible terminal values.

109 [[TODO: Figure]]

Figure 12.1: Illustration of reflection principle

Using Claim 12.21, (12.20) equals

h 1 i  lim P Mn2 ≤ z = lim 1 − P [Sn2 = nz] − 2P [Sn2 > nz] = 2Φ(z) − 1, n→∞ n n→∞ by the central limit theorem. This completes the proof.

Remark 12.22. Inverting the reasoning of (12.20), it can be shown that for arbitrary 1 increments Xi’s as in Donsker’s theorem, limn→∞ P [ n max1≤n2 Si ≤ z] = 2ϕ(z) − 1, transforming the result for the simple random walk to an asymptotic result holding for an “arbitrary” random walk.

For further applications of Donsker’s theorem we need to extend Lemma 12.18 to some non-continuous functions. ¯ ¯ Lemma 12.23. Let F :(S, d) → (S, d) be measurable, and ν, (νn) probability measures as in Lemma 12.18. Then

(i) The set of discontinuity points of F is measurable, that is

DF := {x ∈ S : F is not continuous at x} ∈ B(S).

w d (ii) When νn −→ ν and ν(DF ) = 0 (that is F is ν-a.s. continuous), then F#νn −→ F#ν. S T Proof. (i) Observe that DF = n≥1 k≥1 An,k, where

−1 −1 ¯ −1 An,k = {x ∈ S : ∃y, z ∈ S with d(x, y) ≤ k , d(x, z) ≤ k d(F (y),F (z)) ≥ n } [ = U1/k(y) ∩ U1/k(z), y,z∈S,d(F (y),F (z))≥ where Uδ(x) denotes the open δ-neighbourhood of x. Since any union of open sets is open, An,k is open and thus measurable. Therefore DF ∈ B(S), as claimed. (ii) Let A ⊂ S¯ closed. Observe that

−1 −1 F (A) ⊂ F (A) ∪ DF

w By Portmanteau’s theorem, since νn −→ ν,

−1 −1 lim sup(F#ν)(A) ≤ lim sup νn(F (A)) ≤ ν(F (A)) n→∞ n→∞ −1 −1 ≤ ν(F (A)) + ν(DF ) = ν(F (A)).

The claim then follows using Portmanteau’s theorem once more.

110 Proposition 12.24. Let L = sup{t ∈ [0, 1] : B(t) = 0} ∨ 0 be the time of the last visit to 0 before time 1 by a BM. Then 2 √ P [L ≤ z] = arcsin z. π Proof. Let F : C[0, 1] → R, be given by F (w) = sup{t ≤ 1, w(t) = 0} ∨ 0. F is not continuous, but

Claim 12.25. F is µ-a.s. continuous on (C[0, 1], k·k∞) with µ being the Wiener measure. Proof. Observe that if F is discontinuous at w, then for some  > 0 the function w must have the same sign on intervals (F (w) − , F (w)) and (F (w), 1]. Define (12.26) ± A = {w ∈ C[0, 1] : x ≷ 0 on both (F (w), 1] and (F (w) − , F (w)) for a > 0}.

+ − Then DF ⊂ A ∪ A . By choosing r ∈ Q “shortly before F (w)” and looking on the Figure 12.2 we see that

[[TODO: Figure]]

Figure 12.2: XXX

[ A− ⊂ { max w(s) − w(r) = −w(r)}. r≤s≤1 r∈[0,1]∩Q

For r fixed, under µ, w(s) − w(r) is a Brownian motion independent of w(r) (this is Markov property for BM, see later). Lemma 12.19 implies that maxr≤s≤1 w(s) − w(r) has an absolutely continuous distribution, that is µ[maxr≤s≤1 w(s)−w(r) = −w(r)] = 0. − + Hence µ(A ) = 0. Similar argumentation yields also µ(A ) = 0 and thus µ(DF ) = 0. We now determine the distribution of L under µn, in the simple random walk case again. We proceed by three claims:

Claim 12.27. P (S1 6= 0,...,S2n 6= 0) = P (S2n = 0). Proof. It is not hard to see that

P [S1 6= 0,...,S2n 6= 0] = 2P [S1 > 0,...,S2n > 0] X = 2 P [S1 > 0,...,S2n−1 > 0,S2n = 2r]. r≥1

By a variant of reflection principle, the number of paths from (1, 1) to (2n, 2r) inter- secting the x-axis is the same as the number of paths from (+1, −1) to (2n, 2r), see Figure 12.3

111 [[TODO: Figure]]

Figure 12.3: XXX

We thus see

P [S1 6= 0,...,S2n 6= 0] X 1 = 2 · 2−(2n−1)#(paths from (1, 1) to (2n, 2r) not touching the x-axis) 2 r≥1 X = 2−(2n−1)#(paths from (1, 1) to (2n, 2r)) − #(paths from (1, −1) to (2n, 2r)) r≥1

X = P (S2n−1 = 2r − 1) − P (S2n−1 = 2r + 1) r≥1

= P (S2n−1 = 1) = P (S2n = 0), completing the proof.

Claim 12.28. Let Ln = max{m ≤ n : Sm = 0} ∨ 0. Then P (L2m = 2k) = u2ku2n−2k with u2k = P [S2k = 0]. Proof. Use Markov property and Claim 12.27. Claim 12.29. For every z ∈ [0, 1], hL i Z z P 2n ≤ z −−−→n→∞ π−1(x(1 − x))−1/2dx. 2n 0 Proof. Easy combinatorics and Stirling formula imply that u = 2−2k2k,k √1 as k → 2k ∼ πk k −1 −1/2 ∞. Hence when n → x, then nP (L2n = 2k) → Π (x(1 − x)) . Thus, for 0 < a < b < 1, bnbc Z b L2n X P [a ≤ ≤ b] = P [L = 2k] ∼ Π−1(x(1 − x))−1/2 dx, 2n 2n k=dnae a by dominated convergence theorem, and the claim follows. Proposition12.24 follows from the last three claims using Donsker’s theorem and Lemma 12.23. We only need to observe that √ Z z z −1 −1/2 2 2 −1/2 2 √ int0π (x(1 − x)) dx = (1 − y ) dy = arcsin( z). 0 π pi

Corollary 12.30. As in Remark 12.22, we have for an “arbitrary RW” 2 √ lim P[Ln ≤ zn] = arcsin( z) z ∈ [0, 1]. n→∞ π

112 Bibliography

[AS08] Noga Alon and Joel H. Spencer, The probabilistic method, third ed., Wiley- Interscience Series in Discrete Mathematics and Optimization, John Wiley & Sons Inc., Hoboken, NJ, 2008, With an appendix on the life and work of Paul Erd˝os.MR 2437651

[Dur10] Rick Durrett, Probability: theory and examples, fourth ed., Cambridge Series in Statistical and Probabilistic Mathematics, Cambridge University Press, Cam- bridge, 2010, downloadable at http://www.math.duke.edu/~rtd/PTE/pte. html. MR 2722836

113 Index

λ-system, 15 independent, 14 π-system, 15 invariant distribution, 96 σ-additive, 5 invariant measure, 96 σ-algebra, 5 irreducible, 94 moment, 13 Jensen’s inequality, 11 adapted, 67 Kolmogorov’s inequality, 25 aperiodic, 100 Markov chain, 92 Borel σ-algebra, 6 martingale, 67 modulus of continuity, 103 canonical Brownian motion, 104, 105 Characteristic function, 52 normal distribution, 6 Chebyshev’s inequality, 11 null-recurrent, 98 conditional expectation, 61 convergence in probability, 29 outcome, 5 convergence set, 22 period, 100 covariance, 13 Poisson distribution, 6 cylinder σ-algebra, 7 positively recurrent, 98 discrete stochastic integral, 71 probability measure, 5 distribution, 8 probability space, 5 distribution function, 8 product measure, 7 Dynkin system, 15 Product space, 6 Dynkin’s lemma, 15 random variable, 7 events, 5 recurrent, 95 expectation, 10 regular conditional distribution, 65

filtration, 67 semi-direct product, 85 finite-dimensional cylinder, 6 stochastic kernel, 65 finite-dimensional marginals, 102 stopped σ-algebra, 94 first entrance time, 95 stopping time, 72 subadditive, 37 generator, 7 submartingale, 67 hitting time, 95 supermartingale, 67

114 tail σ-algebra, 22 tight, 48 transient, 95 uncorrelated, 19 uniformly integrable, 80 upcrossings, 73 vague convergence, 47 variance, 13

Wiener measure, 105

115