Quick viewing(Text Mode)

Notes on WI4430 Martingales and Brownian Motion Robbert Fokkink

Notes on WI4430 Martingales and

Robbert Fokkink

TU Delft E-mail address: [email protected] Abstract. These notes accompany the course WI4430 on Martingales and Brownian Motion that I teach in the fall of 2016 at Delft University. Normally, Frank Redig teaches this course, but he has a sabbatical and I step in for one time. The course is mainly based on chapter 10 of Gut’s book Probability, a graduate course which is available through the WI4430 homepage. Apart from Gut, Brze´zniak and Zastawniak’s text on Basic Stochastic Processes is also recommended. It has many exercises and solutions. I wrote these notes to help you digest Gut. I have included exercises, but they are not meant to be very hard. Frank Redig has also prepared notes – you can find them under ‘scribenotes’ – and sets of more difficult exercises. This is your ‘homework’, which I include in between my notes. There is a homework session once every week in which you can seek assistance in solving these exercises. The difficult exercises are marked as challenges, to warn you. Sometimes a challenge offers a financial reward as an incentive. You can claim the reward by handing in a written solution and a transfer of copyright. These challenges are usually open research problems. Contents

Chapter 1. A recap of measure theory 1 1.1. The Riemann integral ... and beyond 1 1.2. Putting the sigma into the algebra 3 1.3. A call for better integration 5 1.4. Swap until you drop 7 1.5. A final word 10

Chapter 2. E-learning 15 2.1. E-definitions 15 2.2. E-properties 17 2.3. E-laboration 20

Chapter 3. Meet the Martingales 25 3.1. Coming to terms with terminology 26 3.2. Thirteen examples 28 3.3. Linear Algebra Again 29

Chapter 4. Welcome to the California Casino 35 4.1. You can check out any time you like 35 4.2. But you can never lose 38

Chapter 5. A discourse on inequalities 47 5.1. Doob’s optional stopping (or sampling) theorem 47 5.2. Doob’s maximal inequality 48 5.3. Doob’s p inequality 50 5.4. Doob’sL upcrossing inequality 51 5.5. And beyond 53

Chapter 6. Don’t Stop Me Now 55 6.1. 2 convergence 55 6.2.L Almost sure convergence 56 6.3. p convergence 57 6.4. L1 convergence 57 6.5.L Wrapping it up 59

Chapter 7. Sum ergo cogito 65 7.1. All, or nothing at all 65 7.2. The domino effect 66 7.3. Show me the money, Jerry 67 7.4. Take it to the limit, one more time 69

v vi CONTENTS

Chapter 8. Get used to it 75 8.1. Are we there yet? 75 8.2. A deviation 78 Chapter 9. Walk the Line 83 9.1. Putting PDE’s in your PC 84 9.2. Time’s Arrow 86 9.3. Measuring motion on the atomic scale 89 Chapter 10. Get Real 93 10.1. A fishy frog 94 10.2. Get on up, get on the scene 97 Chapter 11. Meet the real Martingales 105 11.1. Continuous martingales 107 11.2. Crooked paths 108 Chapter 12. A tale of two thinkers 115 12.1. Cracking up crooked paths 116 12.2. The stochastic integral 119

Chapter 13. 125 13.1. Itˆo’srule 127 13.2. Properties of the Stochastic Integral 130 13.3. The Itˆoformula 130 13.4. EXERCISES CHAPTER 13 135

Chapter 14. The End is Here 137 14.1. A review of lecture 9 137 14.2. Building Bridges 138 14.3. A final word 141 CHAPTER 1

A recap of measure theory

This material is covered by Chapter 1 of Brze´zniakand Zastawniak’s text on Basic Stochastic Processes

Measure Theory was developed around the turn of the 20th century by the French mathematicians Emile´ Borel and Henri Lebesgue.1 Most of you have already learned this theory, for instance in the courses TW2090 Real Analysis, TW3560 Advanced Probability, TW3570 Fourier Analysis, or if you have followed the Minor Finance. If you dit not learn the theory yet, you need to catch up, because this here is only a partial recap to refresh your memory. Fortunately, there is plenty of material available on the internet. You could for instance consult the first part of the notes by Terry Tao, the world’s most famous mathematician. You could also try Probability with Martingales, by David Williams, which is insightful but demanding. Finally, there are the very elegant notes on Probability by Varadhan.

1.1. The Riemann integral ... and beyond You have received excessive training on calculating the integral b (1.1) a f(x)dx But what does the integral mean? You need to remember your definitions. In R 1Delft pride: the Dutch mathematician Thomas Stieltjes, who studied in Delft but failed to get his BSA, got some of these ideas first.

Figure 1. Upper and lower Riemann sum

1 2 1. A RECAP OF MEASURE THEORY particular, you need to remember your Riemann sums. Divide the interval [a, b] into a finite union of subintervals

[a, b] = [x0, x1) [x1, x2) ... [xn−1, xn] ∪ ∪ ∪ with x0 = a and xn = b and approximate the integral by b f(x)dx f(ξ0)(x1 x0) + f(ξ1)(x2 x1) + + f(ξn−1)(xn xn−1) ≈ − − ··· − Za As the mesh of the subintervals gets smaller, the approximation gets better and better, and (hopefully) converges to the integral.

The summands f(ξi)(xi+1 xi) have two factors: a function value f(ξi) and a − length xi+1 xi. Suppose that we change the integral as follows: − b 2 (1.2) a f(x)d(x ) Then the function values remain the same, but we have changed the lengths. The Riemann sum with d(x2) insteadR of dx is equal to b 2 2 2 2 2 2 2 f(x)d(x ) f(ξ0)(x x ) + f(ξ1)(x x ) + + f(ξn−1)(x x ) ≈ 1 − 0 2 − 1 ··· n − n−1 Za So here is what you need to observe: the definition of the integral involves the definition of the lengths of the intervals. There are many possible notions of length. We can define the length of [x, y] to be y x or y2 x2 or ey ex and in general b − − − we may consider a f(x)d(g(x)) where the length of [x, y] is equal to g(y) g(x) for some monotonic function g. A measure is a more general notion of− length. R I did not tell you yet what the ξi are. They are arbitrary points in [xi, xi+1). If f is continuous and if the interval is very small, then f(ξi) hardly varies and the b choice of ξi is not really important. This is why a f(x)dx is well defined if f is continous. But if f is discontinuous, then the Riemann sums may not converge. You probably remember this example: R b (1.3) a 1Q(x)dx is not defined where the indicator function 1Q(x) is equal to 1 if x is rational and it is equal to 0 if x is irrational. However,R we can fix this if we can decide what the length of Q is. The rational numbers are countable, so we can enumerate them

Q = r1, r2, r3,... { } To decide what the length of Q is, we cover it by intervals

(r1 0.01, r1 + 0.01) (r2 0.001, r2 + 0.001) (r3 0.0001, r3 + 0.0001),... − ∪ − ∪ − −n−1 In other words, we cover each rn by an interval of length 2.10 . The total length of these intervals adds up to 0.0222 . We can do even better and cover Q by ··· intervals with lengths that add up to nearly nothing. So the length of Q must be b equal to zero and that is why a 1Q(x)dx = 0. Or to put this in technical terms: b the Lebesgue integral a 1QR(x)dx is equal to zero. R 1.2. PUTTING THE SIGMA INTO THE ALGEBRA 3

The advantages of the Lebesgue integral over the Riemann integral are: Lebesgue is an upgrade: it is well defined for many more functions and • allows you to integrate over spaces that are much more general than R Lebesgue is downward compatible: it produces the same value as Rie- • mann, if Riemann is well defined. Lebesgue extends accross platforms: it treats and Σ in the same • manner, and it connects Analysis to Probability. R Lebesgue has no ambiguity: there is no need for points ξi • Lebesgue has improved functionality: the swap between limn→∞ fn • and limn→∞ fn is handled in a much more transparent manner. R In short, the LebesgueR environment is more user friendly than the Riemann environ- ment. We are talking about a genuine upgrade here, we are not going Microsoft. The Lebesgue integral hinges on the idea of a measure, which is some kind of length function for sets rather than intervals. We need to describe these sets first.

1.2. Putting the sigma into the algebra Let Ω be any arbitrary set, but let it be [0, 1] in particular, or let it be R if you like. We would like to define the measure of all subsets of Ω. Unfortunately, this is impossible without running into fundamental difficulties, because there are so many subsets, more than you can ever describe. Surely, you can describe many sets, but it is impossible to describe each and everyone of them. We simply lack the notation to do this. That is why in measure theory we restrict our attention to specific subsets that are called Borel subsets. Informally speaking, these are the subsets that we can describe with our current mathematical notation. Definition 1.1. A family of subsets of Ω is an algebra if: F (1) (2) If∅ ∈A F then also Ω A (3) If A,∈ B F then also\A ∈B Fin . ∈ F ∪ F A mathematical object that is closed under two operations is often called a ring or an algebra. The operations on are and . If the operations are invertible, then the mathematical object is calledF a ∪field. Very\ confusingly, in measure theory one mixes this terminology, and one person says algebra while another person says field, or ring. Most people say algebra. Brze´zniakand Zastawniak say field. Exercise Suppose that is an algebra and A, B . Prove that A B . F ∈ F ∩ ∈ F We would like to quantify the size (or length, or area, or volume) µ(A) of an element A . Surely, if A and B are disjoint, then we would like that µ(A B) = µ(A∈) F+ µ(B). A measure is a function µ: [0, ] that has this property.∪ To define it in full generality, we first extendF the → notion∞ of algebra by allowing infinite unions (that is where the σ comes in): Definition 1.2. A family of subsets of Ω is a σ-algebra if: F (1) (2) If∅ ∈A F then also Ω A ∈ F \ ∈ F (3) If A1,A2, then also A1 A2 in . · · · ∈ F ∪ ∪ · · · F Of course, the family of all subsets of Ω is a σ-algebra. But we will be interested in smaller σ-algebras. 4 1. A RECAP OF MEASURE THEORY

Exercise Suppose that both and are σ-algebras of subsets of Ω. Then is also a σ-algebra. F G F ∩ G This exercise shows that it makes sense to speak of the smallest σ-algebra that contains a given family of sets . We say that such a σ-algebra is generated by . The σ-algebra that is generatedA by all open intervals in the line is called the A Borel σ-algebra of R. It is denoted by . Its elements are called Borel sets, or are called Borel-measurable. B Exercise A subset V R is open if for each x V there exists an  > 0 such that (x , x + ) V ⊂. Prove that each open set is∈ a countable union of open intervals. Conclude− that⊂ an open set is a Borel set.

It turns out that is not equal to the family of all subsets of R. There exist subsets that are not Borel,B but it is hard to find them. Challenge The first student that is able to explicitely write down a subset of R that is not a Borel set will receive a 10 euros gift check. Now that we have defined σ-algebras, we are ready to measure their elements. Definition 1.3. A measure µ: [0, ] is a function such that µ( ) = 0 and F → ∞ ∅ µ (A1 A2 ...) = µ(A1) + µ(A2) + ... if the sets A1,A2, are pairwise disjoint. In the∪ special∪ case that µ(Ω) = 1, we say that µ is a probability··· measure and (Ω, ) is called a probability space. In this case, the measure is denoted by P andF the elements of are called events. Trying to come up with a more general notion of length, weF stumbled on a definition of probability. It is easy to define what a measure is, but it is not so easy to prove that a measure exists. The Lebesgue measure λ on the Borel σ-algebra of R has the property that λ[a, b) = b a. In particular, λ(R) = . The Lebesgue measure − ∞ on the Borel σ-algebra of [0, 1] is a probability measure. If g : R R is any non- → decreasing function such that limx↓z g(x) = g(z), then there exists a measure on the Borel σ-algebra such that µ[a, b] = g(b) g(a). If g(x) = x then we get the standard Lebesgue measure, but we could also− take g(x) = x3 or even 0 if x < 0 g(x) = 1 if x 0  ≥ This produces a probability measure such that P(A) = 1 if and only if 0 A. This ∈ is called a Dirac measure and it is denoted by δ0.

Challenge Consider the Dirac measure δπ such that δπ(A) = 1 if and only if π A. Otherwise the measure is zero. Let A be the set of all elements of ∈ R that have infinitely many nines in their decimal expansion (compare homework assignment 1.h). Determine δπ(A). The first student that is able to determine the measure of A will receive a 10 euros gift check. We will be moving from measurable sets to measurable functions in a little bit. To make this transition, we express a set A in terms of its indicator function 1A(ω), which takes the value one if ω A and the value zero if ω A. So instead of an algebra of sets, we can also think∈ of an algebra of functions,6∈ which is easier since functions take values and values we can add. 1.3. A CALL FOR BETTER INTEGRATION 5

Exercise Let be an algebra as in Definition 1.1. Consider the family of F A all indicator functions 1A with A . Define addition and multiplication modulo 2, so 1+1=0. Show that is closed∈ F under addition and multiplication. The unit A 2 element is 1Ω and the zero element is 1∅. The family is called a Boolean algebra, which is what makes your computer tick. A A final word on independence. If and are two different σ-algebras with the same space Ω and the same probabilityF measureG P, then we say that they are independent if and only if P(A B) = P(A) P(B) for all A and B . In every day language, the word independence∩ means· the freedom from∈ F outside∈ control G or support. This notion requires a distinction between an inside and an outside, but mathematicians being mathematicians took it a little further and thought about being independent of oneself. Imagine a nation that produces a declaration of independence of itself!3 Such a notion can hardly held to be self-evident, but homework exercise 7 offers an example of a σ-algebra that is independent of itself. Also compare B-Z exercise 4.9, and Gut chapter 10.13. Exercise Suppose that is independent of itself. Prove that P(A) is either zero or one for all A . TheF events in are rather dull. ∈ F F 1.3. A call for better integration As a mathematician and as a law abiding citizen, I support my governments’s call for better integration. And it is needless to say that by better integration I mean Lebesgue integration. If is a σ-algebra with a measure µ, then for any A we can define the integralF of an indicator function ∈ F (1.4) 1Adµ = µ(A) Please note that in this notation the dx is replaced by dµ, to signify that our old notion of length has been replacedR by the new notion of measure. It is straightfor- ward to extend this definition to finite sums of indicator functions. If A1,...,Ak are pairwise disjoint elements of , and if c1, . . . , ck are real numbers, then we say that F (1.5) f(ω) = c11A1(ω) + + ck1Ak (ω) is a simple function. If f is a simple function··· then we define

(1.6) fdµ = c1µ(A1) + + ckµ(Ak) The family of all simple functions has nice algebraic··· properties. If f and g are simple functions,R then so are f + g and f g and min f, g and max f, g . To define the Lebesgue integral, we use simple functions· in much{ } the same way{ as} step functions in Riemann sums. Definition 1.4. Let be a σ-algebra on a set Ω. A function f :Ω [ , ] F → −∞ ∞ is measurable if f −1 ( , t] for any t R. { −∞ } ∈ F ∈

2After George Boole, an English mathematician who lived and worked in Cork - a beautiful town in Ireland, go visit - and died prematurely because his wife was a zealous homeopath who thought you could cure pneumonia by sleeping between wet blankets. 3An increasing number of European countries want to be independent of Europe. 6 1. A RECAP OF MEASURE THEORY

All simple functions are measurable. In the setting of probability spaces, a measurable function f is called a and instead of f one writes X, as in homework exercise 4. This is because a probabilist thinks of X as a random number, which comes out after you feed an ω Ω to it. The elements in Ω are known to God. Probabilists are only able to observe∈ the X(ω). A simple function is a random number that can only take finitely many values. It is a discrete random variable. An indicator function 1A is a Bernoulli random variable. Measurable functions are a bit like Borel sets: it is not easy to find a function that is not measurable. Of course, one could cheat and take to be the σ-algebra consisting of and Ω only. In this case it is very easy to writeF down a function that is not measurable.∅ But if Ω is equal to [0, 1] and is the Borel σ-algebra, then it is very hard to write down a non-measurable function.F Exercise Suppose that f : R R is continuous. Prove that f is Borel mea- surable. → Remark 1.5. If X is a random variable on (Ω, , P) and if B is a Borel set, then X−1(B) . To see this, consider the familyF Φ of all subsets B such that X−1(B) .∈ Observe F that Φ contains the half-lines, since X is measurable. If ∈ F B Φ, then so is its complement. And if Bn Φ then so is Bn. So Φ is a σ-algebra∈ which contains all half lines, and therefore∈ it contains ∪. The preimages X−1(B) of Borel sets form a σ-algebra that is contained in B. It is called the σ-algebra that is generated by X. F

To define the integral, we need to divide measurable functions into a negative and a positive part. If f is measurable, then so are f + = max f, 0 and f − = min f, 0 . Observe that f = f + + f −, so we can break up the definition{ } of fdµ { } into f +dµ and f −dµ. R R R Definition 1.6. Let f :Ω [0, ] be a non-negative measurable function. Then fdµ is the supremum of→gdµ∞where g runs over all simple functions that are f. If fdµ < then we say that f is integrable. If f takes positive and ≤ R ∞ R negative values, then we say that it is integrable if f dµ is finite. Unless stated otherwise, aR function f is measurable and integrable.| | R Exercise If f(x) = if x Q and f(x) = 0 otherwise, then fdλ = 0. ∞ ∈ Here is one more word on notation. If X is a random variableR on a probability space (Ω, , P), then the integral XdP is called the expectation of X. It is usually denotedF by E[X]. This notation is really more convenient in this setting, when one thinks of X as a randomR number like a roll of the dice and E[X] is the average or expected outcome of this event or experiment. The random variable X is integrable if and only if E[ X ] is finite. In this case, we also say that X is in 1(Ω, , P) and that E[ X ] is| its| 1 norm. If X2 is integrable, then we say that L F | | L 1/2 X has finite second moment, or that X is in 2(Ω, , P). The value E[X2] is L F called the 2 norm. This is comparable to the terminology in analysis, when one L  considers Borel measurable functions on R. Exercise If X is in 2 then it is in 1. This is unlike the case of the Lebesgue L L measure on R. 1.4. SWAP UNTIL YOU DROP 7

REPEATED WARNING: I will always assume that a random variable is inte- grable without saying so. Non-integrable random variables are exceptional and if we encounter them, we will identify them as such. Exercise If A B then P(A) P(B). ⊇ ≥ 1.4. Swap until you drop Suppose that we have a descending chain of events

A1 A2 ⊇ ⊇ · · · ∞ and let A∞ = An. If we think of events in terms of indicator functions, then ∩n=1 1A∞ = limn→∞ 1An is well-defined, so we can think of A∞ as the limit of the descending chain. Homework exercise 3 asks you to verify that it is allowed to swap limits and probabilities:

lim P(An) = P( lim An) n→∞ n→∞ This is Exercise 1.1 in Brze´zniakand Zastawniak, who supply full solutions to all exercises in their book. A sequence of real numbers does not always converge, but its liminf and lim- sup always exist. If fn is a sequence of functions, then lim sup fn and lim inf fn both exist. In the same way, using our connection between sets and functions, if

A1,A2,A3,... is a sequence of events, then we can define lim sup 1An . The func- tion values of this limsup are either zero or one. Therefore lim sup 1An is another indicator function and there exists a set B such that 1B = lim supn→∞ 1An . We say that B = lim sup An. c c Exercise Prove that (lim inf An) = lim sup An.

Look at homework exercise 4 and conclude that lim sup An is measurable. You can check yourself that ω lim sup An if and only if ω An for infinitely many n. Or, more formally, ∈ ∈ ∞ ∞ lim sup An = An k=1 n=k ! \ [ Suppose that we do not really like the events An, and in the long run we want them to stop. Say that they are wars. If we want events to stop in the long run, then we want that P(lim sup An) = 0. It is a consequence of Fatou’s lemma – which we will learn in a minute – that

lim sup P(An) P(lim sup An) ≤ So if P(lim sup An) = 0 then the sequence P(An) converges to zero. What if P(An) converges to zero, is it then true that P(lim sup An) = 0? No, that is not true. A stronger condition is required. Remember from your calculus class that an < ∞ is a stronger condition than an 0. → P Lemma 1.7 (Borel-Cantelli). If P(An) < then P(lim sup An) = 0. ∞ I am sorry, but I cannot give aP proof. You will have to do that yourself in homework exercise 5a. What you need to remember is that events with probabilities that are summable will eventually become a thing of the past. Wars are summable, hopefully. 8 1. A RECAP OF MEASURE THEORY

Now what about the converse. Suppose that P(lim sup An) = 0, can we con- clude that the probabilities P(An) are summable? No, that is not true. However, it is true if we require that the events are independent. Some care is required in the definition of independence, by the way. See Definition 1.12 of Zbre´zniakand Zastawniak.

Lemma 1.8 (Borel-Cantelli, part II). Let An be a sequence of independent events. If P(lim sup An) = 0 then P(An) < . ∞ It is a consequence of homeworkP exercise 7 that the limsup of independent events either has probability zero or one. If independent events are not summable, then they keep happening all the time (Murphy’s law). Somehow, I find it difficult to come up with an optimistic statement to illustrate this, but fortunately Wikipedia goes overboard. You need to prove this part of the BC lemma in homework exercise 5e, or you can check if there is a proof on the internet. You received thorough training in analysis and are aware of many forms of con- vergence: pointwise convergence, uniform convergence, perhaps even convergence on compacta, or convergence in Lp. Numerical mathematicians taught you about the speed of the convergence. Convergence is a bit of a zoo. In the Lebesgue en- vironment things are no different. We will limit our attention to three notions of convergence.

Definition 1.9. A sequence Xn of random variables on a probability space (Ω, , P) can converge to X∞ in several ways: F (a) in probability For every  > 0 the events An = ω Ω: Xn(ω) X(ω) >  satisfy { ∈ | − | } limn→∞ P(An) = 0 (b) almost surely The event ω Ω: limn→∞ Xn(ω) = X∞(ω) has probability one. (c) in the mean{ ∈ } limn→∞ E[ Xn X∞ ] = 0 | − | Both (b) and (c) imply (a), but there is no hierarchy between (b) and (c). Notion (b) is the most comfortable to work with. Translated in terms of relationships: (a) is easy so nobody wants him, (b) is who you marry, and (c) is who you dream of even though he is not better than (b). Exercise Property (a) may not imply (b) but it does imply (b) for a subse- quence. Suppose that Xn X in probability. Let 1 > 2 > be such that ∞ → ··· i < . Define Ai = ω Ω: Xn (ω) X(ω) > i . Apply Borel-Cantelli i=1 ∞ { ∈ | i − | } to prove that Xn X almost surely. P i → And now it is time to file a complaint about the terminology of probability,4 in particular the use of the expression almost sure as in (b). When a probabilist says she is almost sure she actually means that she is sure, because you cannot get surer than probability one. Almost sure simply does not it. Suppose you are looking at a second-hand car and a salesman walks up to you to tell you that this is almost surely a good car. Then you will not buy that car.

4Probability is the mathematical maybe. Proof is the mathematical certainty. Both words are derived from the Latin verb probare, which means that something is so sure that you can test it. Probare eroded to probably through overuse by Roman cardealers. 1.4. SWAP UNTIL YOU DROP 9

In , it is customary to mention all the time that an event A is almost sure, or a.s., if it has probability one. I will not do that. So if I write X < Y then I mean that X is smaller than Y with probability one. Since we do not care about events that have zero probability, we usually extend a σ-algebra to its completion without even saying. It is the smallest σ-algebra that containsF as well as all setsF that have measure zero. This is convenient, since F in the almost sure limit of a sequence of measurable Xn is measurable, which followsF from homework exercise 4b. The theorems below also apply to infinite measures,5 such as Lebesgue measure on R. So I will use the analysis notation of f instead of X. Almost sure convergence is the probabilists version of pointwise convergence. You probably remember that pointwise convergence is a weak notion and that you have to be careful to avoid mistakes like 1 1 lim fn(x)dx = lim fn(x)dx n→∞ n→∞ Z0 Z0 The typical counterexample to this mistake is the sequence fn such that the graph is an equilateral triangle with base [0, 2−n] and height 2n+1. The triangles all have 1 area equal to one, and so 0 fn(x)dx = 1 for all n, but the sequence converges 1 pointwise to the all zero function.R In this example 0 = 0 limn→∞ fn(x)dx < 1 limn→∞ 0 fn(x)dx = 1. This is the bad news. The good newsR is that the opposite inquality cannot occur. R Lemma 1.10 (Fatou’s Lemma). Suppose that fn is a sequence of non-negative measurable functions on a measure space (Ω, , µ). Then F

lim inf fndµ lim inf fndµ ≥ Z Z Suppose that the sequence fn is monotone in the sense that fm fn if m < n, ≤ then f = limn→∞ fn is well defined (if we allow the value ) and denoted by ∞ fn f. Since f fn the converse inequality of Fatou’s lemma holds automatically, and↑ we conclude≥ the following.

Theorem 1.11 (Monotone Convergence). Suppose that fn is a sequence of non-negative measurable functions on a measure space (Ω, , µ). If fn f, then F ↑

lim fndµ = fdµ Z Z One can also derive Fatou’s lemma from the monotone convergence theorem, since the sequence gn = inf fk : k n is monotonically increasing. By monotone { ≥ } convergence lim gndµ = lim inf fndµ and Fatou’s lemma follows since fn gn. So the two results are equivalent. ≥ R R Sketch of proof of Theorem1.11. The inequality

fndµ fdµ ≤ Z Z is immediate, so we only needs to prove its converse. Now fdµ is defined as a supremum of gdµ for simple functions g. We can surpass any given simple R function g by an fn for large enough n, except perhaps on a set of very small R 5I should say σ-finite measures 10 1. A RECAP OF MEASURE THEORY measure (the pain in the proof is here). It follows that lim fndµ gdµ for all simple functions. And the reverse inequality follows. ≥ R R  If there happens to be an integrable function g such that fn < g for all indices n, then we can reverse the inequality in Fatou’s lemma by| applying| it to the sequence g fn. Which gives us another convergence result: − Theorem 1.12 (Dominated Convergence). Suppose that fn is a sequence of arbitrary measurable functions that converges to f almost surely on a measure space (Ω, , µ). If there exists an integrable g such that fn g for all n, then F | | ≤

lim fndµ = fdµ Z Z I did not have space to exhibit Fubini’s theorem on swapping the order of integration. You have to look that up yourself. A useful version is Tonelli’s theorem, which says that you can swap if the function is non-negative. Blaise Pascal thought that negative numbers were the work of the devil. This is a bit extreme, of course, but the interference of negative numbers and positive numbers can cause non- convergence, so perhaps there is some truth to Blaise’s bizarre opinion.

1.5. A final word Measure theory allows one to deal with finite sums, infinite sums, and integrals in one go. We will be dealing with integrals mainly and infinite σ-algebras, but even finite algebras can be interesting. Here is a variation on homework exercise 2: Exercise Suppose that is a finite algebra. Prove that every element of Ω is contained in exactly half of theF elements of . F As I mentioned before, an algebra is called an algebra because it admits two operations: taking complements and taking unions. A family is called a lattice if it is closed under taking unions only. More precisely, it satisfiesL conditions (1) and (3) of Definition 1.1. Challenge Find a finite lattice such that each ω Ω is contained in less than half of the elements of . The first studentL who finds such∈ a lattice – or proves that it does not exist – willL receive a 20 euros gift check. Do not let the extravagance of this bonus lead you into temptation! People have been trying to settle this problem for years. Non-graded Homework 1: Exercises recap measure theory and probability

1. Show that the following subsets of R are Borel measurable. We define here the Borel σ-algebra as the smallest σ-algebra containing all open intervals.

a) A singleton a (a R). { } ∈ b) A closed interval [a, b]. c) An open subset of R. d) A closed interval subset of R. e) The Cantor set. f) The set of irrational numbers. g) The set of transcendental numbers (i.e., numbers which are not a solution of a polynomial equation with integer coefficients). h) The set of those numbers in [0, 1] whose decimal expansion does not contain 9. 2. Let Ω be a set and F a collection of subsets of Ω. We call F an algebra if it is closed under taking finite unions and complements. a) Show that an algebra is closed under taking finite intersections and set differences. Show also that an algebra always contains the set and Ω. ∅ b) Show that the intersection of an arbitrary collection of algebras is an algebra. Conclude from that that for F a collection of subsets of Ω, the minimal algebra containing F is well-defined. c) For an algebra F a set A F is called an atom if B F , B = ∈ ∈ 6 ∅ and B A implies B = A. Show that for a finite algebra F , ⊂ the set of atoms forms a partition of Ω. Show that the algebra generated by this partition coincides with F . d) Give two examples of an algebra which is not a σ-algebras. 3. Let (Ω, A ) be a measurable space (i.e., Ω is a set, and A is a σ- algebra of subsets of Ω), and P : A [0, 1] a finitely additive set → function, i.e., P : A [0, 1] and if A, B A , A B = then → ∈ ∩ ∅ P(A B) = P(A) + P(B), and assume P(Ω) = 1. Show that the ∪ following three statements are equivalent

(i) P is σ-additive. (ii) For all An, n N increasing subsets contained in A (i.e., An ∈ ⊂ A for all n) such that A = A it holds n+1 ∪n n lim (An) = sup (An) = (A) n P P P n N →∞ ∈

1 (iii) For all An, n N decreasing subsets contained in A (i.e., An ∈ ⊃ A for all n) such that A = A it holds n+1 ∩n n

lim P (An) = inf P (An) = P(A) n n N →∞ ∈

4. Let Xn, n N be a sequence of (real-valued) random variables defined ∈ on a probability space (Ω, A , P). Prove that the following are random variables (i.e., measurable).

a) X1 + X2, max(X1,X2).

b) lim infn Xn, lim supn Xn. →∞ →∞ c) ∞ X , max X . n=0 | n| i∞=1 i d)P The indicator that infinitely many Xn are non zero. 1 n e) lim infn i=1 Xi →∞ n

5. Let (Ω, A , P) be a probabilityP space and consider An, n N a sequence ∈ of events. In this exercise we prove the so-called Borel-Cantelli lemmas.

a) Prove that A := n N k n Ak is an event (i.e., a measurable ∞ ∩ ∈ ∪ ≥ set). The interpretation of this event is: “the event that infinitely many An occur”. b) Argue that

∞ P(A ) = lim P ( k nAk) lim P (Ak) ∞ n ≥ n →∞ ∪ ≤ →∞ kX=n

c) Use item b) to prove that k∞=1 P (Ak) < implies that P(A ) = ∞ ∞ 0. P d) Assume now that the events An, n N are independent. Prove ∈ that in that case

∞ (Ac ) = lim (Ac) P n P i ∞ →∞ iY=n Taking the logarithm, prove that (using log(1 x) x for all − ≤ − x (0, 1)) ∈

c ∞ c ∞ log (A ) lim log (A ) lim (Ai) P n P i n P ∞ ≤ →∞ ≤ − →∞ Xi=n Xi=n

e) Use item e) to conclude that if the events An, n N are indepen- ∈ dent, then k∞=1 P (Ak) < is equivalent with P(A ) = 0. ∞ ∞ P 2 f) Use the Borel Cantelli lemma to prove that in an independent fair sequence of zeros and ones, every finite pattern of zeros and ones occurs infinitely many times with probability one.

6. We use now the Borel Cantelli lemma to find a sequence of random variables which converges in probability but not almost surely. Con- sider Xn, n N a sequence of independent random variables taking ∈ the values zero or one, with P(Xn = 1) = an.

a) Prove that X 0 in probability if and only if a 0. n → n → b) Prove that X 0 almost surely if and only if a < . n → n n ∞ c) Find a correct choice of a such that X 0 in probability but n n → P not almost surely. d) Find an explicit subsequence of the example of item c) which converges almost surely.

7. In this exercise we prove that Kolmogorov zero-one law in a particular case. Consider the sequence space

Ω = 0, 1 N { }

endowed with the “fair coin” measure P, i.e., under P all symbols ωi, i ∈ N are independent Bernoulli distributed, with success probability 1/2. We call a subset A Ω a cylinder if it is a finite union of sets of the ⊂ form A = ω Ω: ω = α , . . . , ω = α { ∈ i1 i1 ik ik } for a collection i1, . . . , ik N, and a fixed sequence α Ω. In words, a ∈ ∈ cylinder is a set of sequences where a finite number of symbols is fixed.

a) Show that the set of all cylinders is an algebra but not a σ-algebra. b) We call A the σ-algebra generated by all cylinders. Show that the following sets belong to A

∞ A = ω Ω: ω = { ∈ i ∞} Xi=1

1 ∞ 1 B = ω Ω : lim ωi = n { ∈ →∞ n 2} Xi=1 c) We define also A n to be the σ-algebra generated by all cylinders ≥ of the form

ω Ω: ω = α , . . . , ω = α { ∈ i1 i1 ik ik }

3 for a collection i , . . . , i n, and a fixed sequence α Ω. In 1 k ≥ ∈ other words these are the events depending on symbols ω , i n. i ≥ We then define the tail σ-algebra

A = n NA n. ∞ ∩ ∈ ≥ In words, these are the events that do not depend on a finite number of symbols. Show that the events A, B of item b) are elements of A . ∞ d) We define also A n the σ-algebra generated by all events depend- ≤ ing on symbols ωi, i n. Show that for n < m, A n, A m are ≤ ≤ ≥ independent. e) Fix A A and consider ∈ ∞ F := B A : B is independent of A A { ∈ }

Show that FA contains all cylinders, and is closed under com- plements, and increasing unions. By the monotone class theorem (see e.g. https://en.wikipedia.org/wiki/Monotone_class_ theorem) conclude that F A . A ⊃ f) Conclude from item e) that for A A , A FA and therefore ∈ ∞ ∈ 2 P(A) = P(A A) = P(A) ∩ which implies P(A) 0, 1 . This is called the Kolmogorov zero- ∈ { } one law. g) Apply item f) to conclude that for all a [0, 1] the set ∈ n n 1 1 ω : lim inf ωi = lim sup ωi = a ( n n n n ) Xi=1 →∞ Xi=1 has measure zero or one.

4 CHAPTER 2

E-learning

This material is covered by Chapter 2 of Brze´zniakand Zastawniak’s text “Basic Stochastic Processes” and by section 1 in chapter 10 of Gut’s text “Probability - a graduate course”. The exposition in B-Z is more detailed. Gut covers more ground.

‘TU Delft wants to be a learning organisation that continually adapts to chang- ing environmental factors. This requires staff and managers to engage in continuous self-development. TU Delft invests in its staff by offering proven quality education and training for the purpose of personal and professional development. This edu- cation and training is delivered at different places within TU Delft .....’ And so, an old guy like me gets to take a course on new teaching methods every once in a way. Usually those courses on teaching are taught by teachers without any teaching experience. It is a bit like learning how to drive from somebody without a drivers licence. But every once in a way there is a useful course. I recently took one on E-learning and I must say that I am all for it. By E of course I mean the conditional expectation.

2.1. E-definitions Conditional probability was invented to deal with new information. Suppose I have two bears, a white bear and a black bear, and for some weird reason I am interested in the probability that both are male.

(1) A priori, knowing nothing further, the probability that both bears are male is equal to 1/4. (2) Suppose I get to know that one of the bears is male, then the probability that both are male goes up to 1/3. (3) Suppose I get to know that the white bear is male, then the probability goes up to 1/2. (4) Exercise Suppose I get to know that one of the bears is born on a Tuesday and is male. So now what is the probability that both bears are male? I know more than in (2) but I know less than in (3), and if the system

15 16 2. E-LEARNING

makes any sense, then the probability should be between 1/3 and 1/2, but what is it? You can find the surprising answer here. In the example of the bears, we compute varying numbers P(A B) for a fixed event A and varying B. Varying input, varying output. This smells| like a function. Is there perhaps a single random variable Z such that its average value on each set B is equal to P(A B)? In other words | 1 P(A B) = Z(ω)dP | P(B) Zω∈B for all sets B in our σ-algebra? We can clean this up a little bit.

P(A B) = Z(ω)dP ∩ Zω∈B We can also replace the left hand side by an integral

1A(ω)dP = Z(ω)dP Zω∈B Zω∈B And so we find a trivial answer: Z is equal to 1A. But what if A is not in our σ-algebra? Then 1A is not measurable and we need to find another function. Definition 2.1. Let X be a random variable on a probability space (Ω, , P) and let be a sub-σ-algebra. Suppose Z is a -measurable function suchF that for any GB ⊂ F G ∈ G X(ω)dP = Z(ω)dP Zω∈B Zω∈B Then Z is the conditional expectation of X, denoted by Z = E[X ]. | G If you compare this to Gut’s definition 1.1, then you see that he is more careful. Gut requires that X is integrable, but we have agreed that all our random variables are integrable unless stated otherwise. He also talks about the equivalence class of random variables: Z and Z0 are equivalent if they agree on a set of probability one. But we have agreed to write Z = Z0 if they are almost surely equal. Now two questions remain: does there exist such a Z, and is it unique? Yes and yes. Of course, we need to prove this. There is a useful inequality in probability theory, which says that for every  > 0

(2.1) P(Z ) ZdP ≥ ≤ ZZ≥ The inequality follows from the observation that 1Z≥ is a simple function Z. We will meet this inequality again and again. It is named after the great Andrey≤ Markov. You can use it to solve the following exercise.

Exercise Let Y be a -measurable random variable such that B Y dP = 0 for all B . Prove that Y =G 0. ∈ G R If Z and Z0 are two different conditional expectations of the same X, then Y = Z Z0 is equal to zero by this exercise. So the conditional expectation is unique,− but does it exist? We need another result from measure theory. 2.2. E-PROPERTIES 17

Theorem 2.2 (Radon-Nikodym1). Suppose that (Ω, , µ) is a measure space and that γ is another measure with the property that µ(AF) = 0 implies γ(A) = 0. Then there exists a non-negative measurable function f :Ω R≥0 such that → g(ω)dγ = g(ω)f(ω)dµ ZB ZB dγ 2 The function f is called the Radon-Nikodym derivative and it is denoted by dµ . Measure theory is an upgrade from calculus. In your old calculus lingo, the Radon-Nikodym theorem is called the substitution rule. Suppose you replace the Lebesue measure dx by the measure dx2 that we looked at in Lecture 1. Then the Radon-Nikodym derivative is f(x) = 2x.

Theorem 2.3 (Thm 1.2 in Gut). For any -measurable random variable, the conditional expectation E[X ] exists. F | G Proof. When we integrate, we always deal with the positive and negative parts separately. We split X into a positive part and a negative part X = X+ X− where X+ = max X, 0 and X− = max X, 0 . Both X+ and X− are− non- negative random variables.{ } We assemble the{− conditional} expectation from these two non-negative parts.

Assume X is non-negative. Then γ(B) = B XdP defines a measure on such that γ(B) = 0 if P(B) = 0. The Radon-Nikodym theorem delivers a -measurableG R G random variable Z such that γ(B) = B ZdP. This is our conditional expectation. For an arbitrary X, we define the conditional expectations Z+ of X+ and Z− R of X− and assemble Z = Z+ Z−. −  Exercise Suppose Z = E[X ]. Prove that E[Z] = E[X]. | G

2.2. E-properties Now that we have defined the conditional expectation, we have to learn how to manipulate it. Here are your four commandments. Gut supplies a longer list and come to think of it, so does God. Both of them are being a bit repetitive.

(a) linearity E[aX + bY ] = aE[X ] + bE[Y ] (b) monotonicity| G | G | G If X is non-negative, then so is E[X ] (c) taking out what is known | G If Y is -measurable, then E[YX ] = Y E[X ] (d) the towerG of expectations | G | G If then E[X ] = E[E[X ] ] H ⊂ G | H | G | H

1As in your course on Advanced Probability, see Rosenthal’s book A rigorous course in prob- ability, page 144; or look up page 342 in your Real Analysis book, by Aliprantis and Burkinshaw. 2Dutch pride: as you can check on Wikipedia, the theorem was proved by Radon in 1913, generalized by Nikodym in 1930, and extended by the great Hans Freudenthal in 1936. 18 2. E-LEARNING

You spent many hours manipulating integrals using the substitution rule, or partial integration, and you have gotten so good at it that you can do this effort- lessly, even in the most extreme situations if you had to.3 You will need your four commandments, just like you needed your old bag of calculus tricks. Of course, we still need to verify the validity of these commandments. Proof. For brevity, denote the conditional expectation E[X ] by Z. (a) Follows directly from the definitions. Check it yourself. | G (b) We need to show that Z < 0 has probability zero. By the defining property of conditional expectation,{ we have} that

ZdP = XdP 0 ≥ ZZ<0 ZZ<0 since X is non-negative. So the non-positive function Z1{Z<0} has a non-negative integral. That can only happen if this function is equal to zero (Please check!). (c) Is a bit more delicate, let’s prove it for characteristic functions first. If Y = 1A for A , then we can take it out, because ∈ G

1AZdP = ZdP = XdP = 1AXdP ZB ZA∩B ZA∩B ZB By linearity, we can also take out simple functions. But taking out a general Y is more intricate. Let’s do that later. (d) Follows directly from the definitions. Check it yourself. By the way, Gut calls this ‘smoothing’, but most people say ‘tower property’.  Some remarks are in order. Property (b) implies that conditional expectation respects the order: if Y X then E[Y ] E[X ]. In property (c) we require that YX is integrable, which≤ does not follow| G ≤ necessarily| G from the fact that X and Y both are integrable. In many ways, conditional expectation behaves like the ordinary expectation. If is equal to the trivial algebra , Ω then E[X ] = E[X]. We will now derive someG properties of conditional expectation{∅ } which| weG already know for ordinary expectation. The monotone convergence theorem says that E[Xn] E[X] if Xn X and for conditional expectation we have ↑ ↑

Lemma 2.4. If Xn X then E[Xn ] E[X ]. ↑ | G ↑ | G Proof. By the monotonicity property (b), if the sequence Xn is monotone, then so is E[Xn ]. Now define Z = lim E[Xn ]. By the monotone convergence we may swap limits| G and integrals, and we like| that G so much that we swap them twice:

ZdP = lim E[Xn ]dP = lim E[Xn ]dP = | G | G ZB ZB ZB = lim XndP = lim XndP = XdP ZB ZB ZB Therefore, Z is indeed the conditional expectation of X. 

3Jean Victor Poncelet was an officer in Napoleon’s army who got incarcerated in Russia. Poncelet had excelled in in school. He practised it and developed it further in prison, just to survive the harsh conditions. He eventually returned to France and became a famous mathematician. If your studies are not going so well, try spending some time in jail. 2.2. E-PROPERTIES 19

We can now finish the proof of property (c), which I give for nonnegative X leaving the general case to you and Gut. Approximate Y by a sequence Yn of simple functions. We already know that we can take out simple functions E[YnX ] = | G YnE[X ]. By monotone convergence E[YnX ] E[YX ] and since Yn Y | G | G ↑ | G ↑ we also have that YnE[X ] Y E[X ]. We conclude that we can take out Y . | G ↑ | G Remember from your elementary probability class that E[X2] E[X]2. This extends to conditional expectation. ≥ Lemma 2.5. E[X2 ] E[X ]2 | G ≥ | G Proof. For brevity, we write Z = E[X ]. By monotonicity | G E[(X Z)2 ] 0 − | G ≥ Expanding this, we find by linearity that E[X2 ] 2E[XZ ] + E[Z2 ] 0 | G − | G | G ≥ Taking out what is known E[X2 ] 2ZE[X ] + Z2 0 | G − | G ≥ which is equivalent to E[X2 ] Z2 | G ≥ and this we had to prove.  You may remember that this inequality is a special case of Jensen’s inequality E[φ(X)] φ(E[X]) for a convex function φ. Again, this extends to conditional expectation.≥ Theorem 2.6 (see Thm 2.2 in B-Z and Thm 1.6 in Gut). If φ is a convex function, then E[φ(X) G] φ (E[X G]). | ≥ | Proof. A convex function is an envelope of linear functions. In other words,

φ(x) = sup L(x): L linear function φ . For careful readers, I need to point out that we achieve{ this with a countable≤ family} of functions L. Since φ L we have by monotonicity ≥ E[φ(X) ] sup E[L(X) ]: L φ | G ≥ { | F ≤ } By linearity of conditional expectation E[L(X) ] = L (E[X ]). Therefore | F | F E[φ(X) ] sup L (E[X ]) : L φ = φ (E[X ]) | G ≥ { | F ≤ } | F  Exercise Prove that E[ X ] E[X ] . | | | G ≥ | | G | 20 2. E-LEARNING

You now have enough of ammunition to have a go at homework exercise 1, in which you encounter the notation E[Y X] for random variables X,Y . I have to explain what that means. For any random| variable X there exists a smallest σ-algebra, denoted by σ(X), such that X is σ(X)-measurable. It is the σ-algebra generated by the sets X−1( , t] for arbitrary t R. Instead of E[Y σ(X)] it is customary to simply write −∞E[Y X]. According∈ to the Doob-Dynkin| lemma, see B-Z page 4, E[Y X] is of the form| f(X). If Y and X are independent, then you may guess that E|[Y X] = E[Y ]. More about that below. Since we are mathematicians,| we generalize this to two or three or more ran- dom variables without even blinking. We write σ(X1,...,Xn) for the σ-algebra −1 generated by Xi ( , t] and E[Y X1,...,Xn] is of the form f(X1,...,Xn). Now solve homework−∞ exercise 1 by taking| out what is known. Check your answers against the Doob-Dynkin lemma.

2.3. E-laboration Most of you learned your Linear Algebra from the great De Groot. The use- fulness of this course can hardly be overestimated. Much if not all of mathematics takes the shape of Linear Algebra if you look at it properly. Measure theory is no different. We can add measurable functions and we can multiply them by real numbers, so they form a . An integral is a linear map – also called M a functional – : R, provided that we only allow measurable functions that are integrable. WhichM → goes without saying. R We defined first for indicator functions 1A. We then extended it to the linear span of all indicator functions c11A1 + +cn1An . Finally we took limits to extend to the entire R . This is a good mental··· picture so let’s dwell on it. M R The random variable Z is equal to the conditional expectation E[X ] if and only if | G

(X Z)1B dP = 0 − Z for all B . In other words ∈ G E[(X Z)1B] = 0 − Now think of E[XY ] as an inner product X,Y . Then this equation says that h i X Z is orthogonal to the set of all indicator functions 1B. We have to exercise some− care here. The inner product E[XY ] may not be defined since XY need not be integrable. Remember the Cauchy-Schwarz inequality (E[XY ])2 E[X2] E[Y 2] ≤ · and conclude that the inner product X,Y = E[XY ] h i is well defined on the Hilbert space 2(Ω, , P) of square integrable random vari- ables. Linear Algebra now providesL us withF a new interpretation of conditional expectation. Since X Z is orthogonal to the indicator functions, it is also orthog- onal to their linear span.− Taking limits, X Z is orthogonal to the linear subspace 2(Ω, , P) 2(Ω, , P). Now we can use− our geometric imagination to picture theL conditionalG ⊂ L expectation:F it is the orthogonal projection of X on 2(Ω, , P). This is the content of Theorem 1.4 in Gut and homework exercise 2. L G 2.3. E-LABORATION 21

Figure 1. Observe Pythagoras E[X2] = E[Z2] + E[(X Z)2] −

In , one is always looking for an estimator Y of an unknown parameter θ. Skillful statisticians use unbiased estimators E[Y θ] = 0 and reduce V ar(Y θ) as much as possible. In other words, skillful statisticians− reduce the 2-norm− of Y θ. Now we have just learned that conditional expectations areL orthogonal projections,− so they reduce the norm. However, if we do not know θ, then how can we ever determine the orthogonal projection of Y θ? In some special cases, it turns out that we can. Perhaps you remember the− Poisson process. It describes the arrival of patients in a hospital or customers in a bank. If you know that the seventh customer arrives after two hours, but you have no information on the arrival times X1,...,X6 of the six previous customers, then these are uniformly distributed random variables on [0, 2]. This is an example of a sufficient statistic T . Given T , the distribution of the data set does not depend on the unknown parameter, which is the rate of the Poisson process in this case. For sufficient statistics, the conditional expectation E[Y T ] does not depend on θ and we can use it to reduce the . This is the content| of the Blackwell-Rao theorem, as described by Gut. We will not use this theorem. Gut includes it to provide a statistical application at the end of chapter 10, but I doubt if we will ever get there. Ends can be illusive, as you will find out once you get to meet the martingales.

There is one more useful property that we have not discussed yet. Recall that X and Y are independent, if P(X < a, Y < b) = P(X < a)P(Y < b) for all a, b. Or equivalently, expressing this in terms of measure theory, for all A σ(X) and B σ(Y ) we have that P(A B) = P(A)P(B). Here is your fifth commandment:∈ ∈ ∩ (e) independence If X and Y are independent, then E[X Y ] = E[X]. |

Proof. We need to show that E[X]dP = XdP for all B σ(Y ). In B B ∈ other words, we need to show that P(B)E[X] = XdP. We are going to show a R B R little more and prove that if U is any σ(X)-measurable random variable, then R P(B)E[U] = UdP ZB 22 2. E-LEARNING

First we prove this for an indicator function 1A with A σ(X). Then we have indeed that ∈

P(B)E[1A] = P(B)P(A) = P(A B) = 1AdP ∩ ZB By linearity, this extends to all simple functions that are σ(X)-measurable. By monotone convergence, it extends to the random variables U.  Exercise Look up the proof of the CS inequality. Now prove (E[XY ])2 E[X2 ] E[Y 2 ] | G ≤ | G · | G Homework on conditional expectation

1. Let Xi, i = 1, 2,... be independent random variables taking the values 1 with probability P(Xi = 1) = 1/2. We denote by Fn the σ-algebra ± n generated by X1,...,Xn. Further we denote Sn = i=1 Xi. Finally, τ is a random variable taking values in N, with E(τ 2) < , and τ P ∞ is independent of all Xi, i N. Compute the following conditional ∈ expectations.

a) X1+X2 X3 E(e − X1,X2) | b) E(Sn Fn 1) | − c) 2 E(Sn n Fn 1) − | − d) λSn E(e Fn 1) | − e) 2 E(Sn Sn 1) | − f) τ 2 E Xi τ  ! |  Xi=1   g) Using item f) show that E(S2 τ) = 0. τ − 2. We consider a probability space (Ω, A , P) and F A a σ-algebra. ⊂ We consider X,Y two random variables, such that Y is F measurable, and both X and Y are square integrable, i.e., in L2(Ω, A , P).

a) Show that E(E(X F ) F ) = E(X F ). | | | b) Show that E ((X E(X F ))Y ) = 0, − | and also

E [(X E(X F ))E(X F )] = 0. − | | This shows that the decomposition X = E(X F )+(X E(X F )) | − | is an orthogonal decomposition.

1 c) Show that

V ar(X) = E(V ar(X F )) + V ar(E(X F )) | | where by definition

2 2 V ar(X F ) = E(X F ) E(X F ) | | − | One calls this formula the decomposition of the variance formula.

2 CHAPTER 3

Meet the Martingales

I follow Gut’s sections 2-5 quite closely. A part of this material is also covered by Chapter 3.1-3.4 of Brze´zniak-Zastawniak,which is always the easier read.

I take my wife to the casino. She is a conservative and consistent player. Every round she bets one euro on black. Her winnings M0,M1,M2,... start at M0 = 0 and they go up or down by one euro per round, with probability one half. Let Xn be my wife’s gain in round n, then Xn = 1 with probability half. Her total gain ± Mn is the sum of the gains per round.

Mn = X1 + + Xn ··· This is the simple . Of course, you know it already. It is omnipresent in probability theory. It has the property

E[Mn+1 Mn] = Mn | Now that you know that I am a gambling man, you lure me away from my wife by producing a deck of cards. You shuffle it and you turn the cards over one by one. Before you turn the next card over, I can say RED and if it is red, I get one euro. If it is black, I have to give you one euro. I am getting all scientific over this and denote the expected return after n cards have been turned over by Mn. 1 Then M0 = 0 and M1 = 51 if the first card was black, etc. Again, the Mn have the property that E[Mn+1 Mn] = Mn | As a gambling man, I am forever trying to beat the system. I could wait and say RED only if there are more red cards than black cards in the deck. Can I make some money out of this game? No, I cannot. The great Peter Winkler has an ingenious way to explain this: every time I say RED, you may just as well turn over the card that is at the bottom of the deck. Whatever strategy I adopt, the expected earning will always be zero. Exercise The rule of this game is that I have to say RED at some point of time. I cannot pass fifty two times. Suppose that I could, would the game become beneficial? How about the following game: I am allowed to say either BLACK or RED. How much do you charge me for letting me play this game? Let’s consider more serious stuff and turn our attention towards family plan- ning. To curb the exponential growth of their population, the Chinese government introduced the one-child policy.1 Of course, such a policy would eventually eliminate the entire population and that is why Chinese families may now have two children. This is comparable to affluent countries, where the average number of children in

1Delft pride: the Delft mathematician Geert Jan Olsder played a key role in Chinese family planning.

25 26 3. MEET THE MARTINGALES every family is two. Mathematically, we can think of the number of children per family as a random number: everybody tosses a coin twice, and produces one child for every toss that came up heads. Then we get a sequence M0,M1,... where M0 is the number of the first generation, M1 is the number of the second generation, etc. Everybody gets one child on average. This is an example of a . Note that in mathematics, there is no need for a mother and a father, people can just clone themselves. Anyway, again we have that E[Mn+1 Mn] = Mn where | Mn is the number of people in the n-th generation. What happens in the long run?2 Challenge Start with one fair coin. Flip it. If a tail comes up, take the coin away and the experiment stops. If a head comes up, add an additional fair coin and continue with two coins. Flip them both. Take tails aways, add a coin for each head. Etc. Let Mn be the number of coins after n tosses. So M0 = 1; M1 is 0 or 2; M2 is 0, 2, or 4. Etc. Prove that E[Mn+1 Mn] = Mn. Observe that P(Mn = 0) increases with n. Be a smartypants and compute| what happens in the long run!

3.1. Coming to terms with terminology Finally, it is time to meet the martingales. In probability theory, a sequence of random variables X1,X2,X3,... is called a . Ordinary people would say that this is a stochastic sequence, but probabilists like processions a lot. A process lives on a probability space (Ω, , P) and we think of the random F variables Xi as signals that come by. We want to learn from these signals and predict what will happen in the future. Does my sequence converge? Here is where measure theory comes in: it provides us with σ-algebras to describe the flow of information and hands us the tools to swap limits and expectations. Along with each process we consider an increasing sequence of σ-algebras 0 1 2 . An increasing sequence of σ-algebras is called a filtration. UnlessF ⊂ stated F ⊂ otherwise, F ⊂ · · · we start with the trivial 0 = , Ω , which means that initially we know nothing. F {∅ } The σ-algebra 1 is generated by X1 and describes what we can learn from X1. Its F successor 2 is generated by X1 and X2, while 3 is generated by X1,X2 and X3, F F and so forth. The σ-algebra ∞ is generated by the entire sequence of signals and represents the knowledge thatF we will have in the long run. It is required that each Xn is n-measurable, and we say that the stochastic process Xn is adapted to the F filtration n. It is customary to define n as the σ-algebra that is generated by F F X1,...,Xn. This is called the natural filtration. Instead of the natural filtration, one may also consider other filtrations, as long as Xn remains adapted.

Definition 3.1. A stochastic process Mn is called a martingale if

E[Mn+1 n] = Mn | F For a martingale, once you have received signal Mn you do not know what is going to happen next, but you do know that on average things will remain the same.

Exercise Suppose that Mn is a martingale. Use the tower property to prove that (a) E[Mn m] = Mm for any m n and (b) E[Mn+1 Mn] = 0. | F ≤ −

2In the long run, we are all dead. 3.1. COMING TO TERMS WITH TERMINOLOGY 27

The process Mn+1 Mn is called the martingale difference process. Mar- tingale differences are so− important that they deserve their own notation:

dMn = Mn+1 Mn − You should think of martingale differences as the outcomes of consecutive gambles. The martingale keeps track of the overall score. In the previous lecture, we learned that the conditional expectation is an or- thogonal projection in 2. Pythagoras now gives us L 2 2 2 E[(Mn+1 Mk) ] = E[(dMk) ] + ... + E[(dMn) ] − for k n. We can also express this in terms of the variance ≤ V ar(Mn+1 Mk) = V ar(dMk) + ... + V ar(dMn) − If k is fixed and n increases, then the variance of Mn Mk increases. If Mn is further away, it gets harder to predict. But if n is fixed− and k increases, then the variance of Mn Mk decreases. If we get closer to Mn, it gets easier to predict. Can we predict what− will happen in the end? Or mathematically speaking, can we compute lim Mn? That will be a central question for the upcoming lectures.

Note that lim Mn is the same as dMn. Here is an exercise on sums: Exercise If an converges then an 0 but, as you know, the converse does not | |P → k hold. It does not even suffice to require that all partial sums a are bounded P n=0 n and that an 0 (think of an example!). Remember that a Hilbert space is a linear space with| an| → inner product that is complete with respect toP the norm. Suppose an is an orthogonal sequence in Hilbert space, am, an = 0 if m = n. Suppose k h i 6 an is bounded for all k and that an 0, prove that an converges. n=0 || || → P If we think of martingale differences as consecutive gambles,P then why not vary the stakes? We could consider upping the stakes by a factor Tn+1 for the next gamble dMn, where we use the index n + 1 in Tn+1 since it involves a bet on round n + 1. A process Tn+1 is called predictable if E[Tn+1 n] = Tn+1. In | F other words, Tn+1 is n-measurable. If we stake Tk in round k, then our score after n + 1 rounds is F

T1dM0 + T2dM1 + ... + Tn+1dMn A conservative player like my wife stakes one euro in every round, which just produces Mn. Meanwhile, I abandoned my wife and secretly started playing on the other side of the table. In every round, I prudently vary my stakes according to my wife’s current score, which in mathematical terms means that Tn+1 = f(Mn) for some measurable function f. My score proceeds like

f(M0)dM0 + f(M1)dM1 + ... + f(Mn)dMn which looks a bit like an integral, doesn’t it? We are taking our first steps in stochastic calculus. But it will take more lectures before we learn how to walk. Gut quotes a passage from a book by the Hungarian writer Imre Kert´esz,to illustrate that life is a martingale, which it is not. You can take this old man’s word for it: life is a supermartingale.

Definition 3.2. An adapted process is supermartingale if E[Mn+1 n] | F ≤ Mn. It is submartingale if E[Mn+1 n] Mn. | F ≥ 28 3. MEET THE MARTINGALES

For supermartingales, the expected returns go down. For submartingales, they go up. This is counterintuitive: super is bad and sub is good. This is like a Party For Freedom that calles for restrictions on freedom and a removal of basic human rights. Of course, there is always a good reason for such twisted terminology. Martingales turn out to be related to harmonic functions, which you may have encountered when you studied PDE’s or Fourier Analysis.3 Anyway, there are also superharmonic and subharmonic functions. These are related to supermartingales and submartingales.

Exercise Suppose that Mn is a martingale. Apply Jensen to conclude that 2 Mn is a submartingale. Martingale is an odd name. You may wonder about the origin of this word. To me, martingale sounds like a pop singer – marvingaye – but it has been derived from Martigues, a peninsula close to Marseille. The good people of Martigues, the Martingales, are fond of gambling, bless them. They may have picked it up in Monte Carlo, which is a few miles down the coast. Gamblers that take extreme risks at the roulette table are known as Martingales, which is not very flattering. In the south of France, martingale is another word for prostitute, which is Latin for being in front of a building. If the lady escorts you inside, she becomes an institute. In a rather strange turn of events, martingales are now highly respected within financial institutions.

3.2. Thirteen examples In section 10.3, Gut gives thirteen examples, and five more in theorem 3.2. To check that these are martingales, you need to compute conditional expectations, using your five commandments. It is a good exercise. Here is an outline: Example 1 and 2 concern a random walk Sn = X1 + + Xn for independent ··· random variables Xi such that E[Xi] = 0. This is a more general version of the simple random walk of my wife in the casino, but the computations are similar. Example 3 considers the square of Sn, but you already know that this is a sub- martingale because you just solved an exercise. The conditional expectation of a submartingale increases with n. You could say it has a drift. Gut shows how to remove the drift and make it a martingale, which is a prelude to Doob’s decompo- sition, which we will meet below. Example 4 is the multiplicative analogue of the random walk. Instead of adding independent Xn that have zero mean, we could also multiply by independent Xn that have expectation one. One could say that the sequence X1,X1X2,X1X2X3,... is a multiplicative martingale. The standard model of the asset price in financial mathematics is a multiplicative martingale. Example 5. Double or nothing. This is the typical behavior of a Martingale at the roulette table. Martingales keep on betting on black, but unlike my wife they go all out. They start at M0 = 1 and double or nothing in every round. If you look at this properly, you see that it is a multiplicative martingale. The factor Xn is either 0 or 2. Multiplicative martingales usually are a recipe for disaster (asset prices are a disaster? oh no!). This one is no different. Please check that lim Mn = 0, almost surely. You should compare this to the example in the first lecture of a function fn 1 1 such that 0 fndλ = 1 but 0 lim fn(x)dλ = 0. Here we have the same example, but in a probabilistic setting: E[Mn] = 1 but E[lim Mn] = 0. R R The multiplicative martingale in example 6 appears in a homework exercise. Example 7 is useful in statistics. Suppose Y1,Y2,... is a sequence of i.i.d. random

3In Delft we use Korevaar’s book on Fourier Analysis, check page 64 3.3. LINEAR ALGEBRA AGAIN 29 variables. Statisticians are forever confused about their distribution. It could be this, or it could be that. If it is this, then the probability density is f, but if it is that, then the probability density is g. Now define the i.i.d. sequence of likelihood ratios g(Yn) Xn = f(Yn) If the distribution is this, then E[Xn] = 1 (please check!) and we can construct a multiplicative martingale Mn = X1 Xn. Now the statistician can apply martin- gale theory, which we are about to develop,··· to test if the distribution really is this. If it is that, then Mn is not a martingale. Example 8 is the Galton-Watson process, which is another example of a branch- ing process. Observe that the process is a sum, like a random walk, but that the number of terms in this sum does not increase one by one. It varies stochastically. We will be concerned with lim Mn. Example 9 argues the other way around. Suppose we already know the limit, can we construct a martingale that converges to it? Yes we can. Let X be the limit and define Xn = E[X n]. Then Xn is a martingale. But does it converge to X? More about this later.| F Example 10 demonstrates how to convert a non-martingale into a martingale. Suppose we have any process Zn and denote the differences dZn by Yn+1. For a martingale, we need that differences are orthogonal, so let’s do that. Define Xn+1 = Yn+1 E[Yn+1 n]. By the linearity and tower commandments, the new − | F differences Xn+1 are orthogonal and so the summed process Mn = X1 + ... + Xn is a martingale. Example 11 just remarks that by the independence commandment, the previous example simplifies if the Yn are independent. One notable example is a random walk with a drift. If P(Yn = 1) = p and P(Yn = 1) = 1 p, then we can replace − − the increments Yn by Xn = Yn + 1 2p to remove the drift. Example 12 is me gambling in the− casino while keeping an eye on my wife. The process T1dM1 + ... + TndMn is called a martingale transform. It is a first step towards a stochastic integral. Gut denotes Tn by vn and dMn by Un. Why, I don’t know. Example 13 is a slight extension of examples 1 and 2. We have reached the end of the list. Gut gives some more constructions in propositions 3.1-4 and theorems 3.1 and 3.2. These are all applications of our five commandments, and Jensen’s inequality. Now I did not call Jensen’s inequality a commandment and perhaps I should have, but then we would have had six commandments. Six is such a bad number that God took a day off simply to put seven days in a week. Exercise It is standard notation in financial mathematics to write x+ = max 0, x { } + and we adopt that. Suppose Mn is a martingale. Apply Jensen and prove that Mn is a submartingale.

3.3. Linear Algebra Again In 2 we can think of martingale differences as orthogonal increments, which is nice.L The price that we have to pay for this is that we need to assume that our random variables are square integrable. In other words, they have finite variance. 30 3. MEET THE MARTINGALES

Exercise A Pareto random variable X is omnipresent in economics. It takes only values 1 (minimum wealth is one unit) and it satisfies P(X > x) = (1/x)α. Verify that ≥E[X] < if α > 1 and that E[X2] < if α > 2. As the economist Thomas Piketty can∞ tell you, in recent years the α has∞ been going down (and what, I ask you, has not been going down lately?), increasing the social tension.

I already remarked above that for a martingale V ar(Mk) increases with k but V ar(Mn Mk) decreases with k. Now I am an old man and a bit forgetful, let me remark that− again in a lemma, which is equivalent to Gut’s lemma 4.1.

Lemma 3.3. Martingale differences dMn are mutually orthogonal. Pythagoras says that 2 2 2 E[Mn] = E[dM1 ] + ... + E[dMn−1]

Proof. We need to prove that E[dMmdMn] = 0 if m = n. Suppose m > n, 6 condition on n+1 and take out what is known: F E[dMmdMn n+1] = dMnE[dMm n+1] | F | F By the tower commandment, E[dMm m] = 0 implies that E[dMm n+1] = 0. | F | F We are done. 

For submartingales E[dMn n] 0. If we apply the computation in the | F ≥ proof of above to submartingales, then we find E[dMmdMn n+1] dMn. Since | F ≥ E[dMn] 0 the tower commandment implies that inner products of submartingale differences≥ are non-negative. Pythagoras says that for submartingales E[M 2] E[dM 2] + ... + E[dM 2 ] n ≥ 1 n−1 Or perhaps this is not Pythagoras, because we use the rule of the cosine and Pythagoras was unaware of cosines because he was too busy drowning people who ratted out that the root of two was not a rational. The rule of the cosine is propo- sition twelve in Euclid’s Elements.

Linear algebra appeals to our geometric imagination. It makes us see the math. If we move from M0 to Mn then each increment takes an orthogonal turn into a completely different dimension, which gets a bit hard to visualize after three or four steps, but never mind that. We get the picture. We move on to section 5 of Gut where, and this is a bit worrying because we have barely begun, submartingales are starting to get decomposed. 3.3. LINEAR ALGEBRA AGAIN 31

Theorem 3.4 (Doob decomposition). Suppose Yn is a submartingale. Then there exists a martingale Mn and a predictable An such that

Yn = Mn + An

This decomposition is unique and An is increasing. We will proceed with the proof of this theorem in a minute. And now, for something completely the same: Gram-Schmidt orthogonalization. The Gram- Schmidt process takes a set of independent vectors v1, v2, v3,... and replaces 4 { } them by a set of orthogonal vectors u1, u2, u3,... . The Gram-Schmidt process { } replaces the vi one by one, putting

u1 = v1, u2 = v2 proj (v2) − u1 u3 = v3 proju1,u2 (v3) . − .

In this notation, proju1 (v2) is the vector that is in the span of u1 and which is closest to v2. Similarly, proju1,u2 (v3) is the projection onto the span of u1 and u2, and so forth. In our stochastic world, a projection is a conditional expectation. We can now proceed with the proof. You may want to flip back to Gut’s example 10 and compare it. Proof. Apply Gram-Schmidt orthogonalization:

U1 = Y1 U2 = Y2 E[Y2 1] − | F U3 = Y3 E[Y3 2] . − | F .

The random variables Un have the property that E[Un n−1] = 0, so they form | F a martingale difference process and therefore U1,U1 + U2,U1 + U2 + U3 + is a ··· martingale. Let’s write Mn = U1 + + Un and let’s look at the difference between ··· Mn and Yn. Y1 M1 = 0 − Y2 M2 = E[Y2 1] Y1 − | F − Y3 M3 = E[Y3 2] Y2 + E[Y2 1] Y1 − . | F − | F − .

Since the Yi are submartingale each E[Yn n−1] Yn−1 is non-negative and that | F − is why the difference An = Yn Mn is forever increasing. Note that the An are − n−1 adapted. F A word on uniqueness. There really is only one way to go about the Gram- Schmidt process. If you had been asked to invent such a process, then you would have come up with the exact same thing. Doob’s decomposition, which is just Gram-Schmidt, is unique. When the great Joe Doob developed the theory of mar- tingales – during the Great Depression all he could find was a job as a statistician – he was the first one to think about this decomposition (and much more), so it has been named after him. 

4Listen up probabilists: to have a process, things need to be processed, churned in and churned out, as in the Gram-Schmidt process. 32 3. MEET THE MARTINGALES

Our final result is the Krickeberg decomposition. It is only used at the end of Gut’s chapter 10, where it is combined with the Blackwell-Rao theorem. I already told you, we are probably never going to reach the end. But I include a proof anyway for completeness sake. I don’t think we are not going to need it.

Theorem 3.5 (Krickeberg decomposition). For any martingale Mn such that + sup E[Mn ]: n 1 < there exist non-negative martingales Ln and Nn such that{ ≥ } ∞ Mn = Ln Nn − Note: this decomposition is not unique.

Exercise Choose any non-negative martingale Mn such that M0 = 0. Kricke- berg decompose it into Ln Nn which are both non-negative and have L0 = N0 = 1. − + Proof. By the conditional Jensen inequality, Mn is a non-negative submartin- + + gale since φ(x) = x is a convex function. Fix an index k and define Zn = E[M n | k]. By the tower property and by monotonicity F + + + Zn+1 = E[M k] = E[E[M n] k] E[M k] = Zn n+1 | F n+1 | F | F ≥ n | F and so the sequence Zn is non-decreasing. It converges to Z. By monotonic con- + + vergence E[Z] = lim E[Zn] = lim E[Mn ]. Here we use our assumption that E[Mn ] is bounded to conclude that E[Z] is finite. In other words, Z is integrable. This limit Z can be constructed for every index k. Let’s call these limits Lk. Then + E[Lk+1 k] = E[lim E[X k+1] k] | F n | F | F which we may swap by monotone convergence + + lim E[E[X k+1] k]] = lim E[X k] = Lk n | F | F n | F Therefore Lk is a martingale. − The exact same argument also applies to Mn . You need to observe that + − E[Mn] = E[Mn ] E[Mn ] and that for a martingale E[Mn] is constant. Therefore, − + − our assumption that E[Mn ] is bounded implies that E[Mn is bounded as well. So − we may apply the same construction to Yn = E[X k] to find another martin- n | F gale Nk. Since Mk = Zn Yn, the same holds for the limits and we conclude that − Mk = Lk Nk. −  Homework: Martingales and martingale convergence

1. The random variables Xi, i N are independent and equal to 1 with ∈ ± equal probability. We call Fn the filtration generated by Xi, i.e., Fn = σ X ,...,X . We put S = n X Determine for the following { 1 n} n i=1 i processes Zn, n N whether they are martingale, supermartingale { ∈ } P or submartingale (w.r.t. the filtration Fn, n N), or none of these. ∈ Take care to verify that Zn is Fn measurable and integrable.

a) eλSn Z = n (cosh(λ))n b) Z = S2 n − n c) n i Zn = 2 Xi Xi=1 d) Zn = log(2n + Sn) e) n X1+...Xi 1 Zn = Xi. 2 − i=1 X  (where X0 = 0 by definition) f) Zn = nSn

(1) (2) 2. We consider Xn , n N and Xn , n N both martingales w.r.t. a ∈ ∈ filtration Fn, n N. Determine whether the following are martingale, ∈ submartingale or supermartingale.

a) Z = min X(1),X(2) . n { n n } b) Z = X(1) X(2). n n − n c) Z = max (X(1))2,X(2) n { n n } d) (1) (2) Xn +Xn Zn = e

1 3. This exercise introduces you to the process of a martingale. Let M , n 1 be a martingale, define M = 0 and { n ≥ } 0 define 2 2 ∆n = E (Mn Mn 1) Fn 1 − − | − for n 1.  ≥ a) Show that ∆ 0 and show that also n ≥ 2 ∆n = E (Mn Mn 1) Fn 1 − − | − n  b) Define An = i=1 ∆i. Show that

P Z = M 2 A n n − n

is a martingale. An is called the predictable quadratic variation process of Z , n 1 . Predictable means that A depends only { n ≥ } n on Fn 1. − c) Assume now that there exists a predictable increasing process Bn, n 1 , i.e., Bn depends only on Fn 1, and is increasing (as { ≥ } − a function of n) such that

2 M B = Z0 n − n n

is a martingale. Show that in that case Bn = An. Hint: condition on Fn 1. −

2 CHAPTER 4

Welcome to the California Casino

I cover Gut’s sections 6, 7, and 8, more or less. This material is also covered by sections 3.5 and 3.6 of Brze´zniak-Zastawniak.

George Gamow and Marvin Stern wrote Puzzle Book, which has seemingly simple puzzles that contain profound mathematics. One of their puzzles is on family planning. I will describe it here in the words of the great Martin Gardner, in his book Entertaining mathematical puzzles. Please buy it.1 The great Sultan Ibn-al-Trump proclaimed a law to increase the ratio of women in his country, so that men could have larger harems: As soon as a woman gives birth to her first son, she is forbidden to have any more children. In this way, some families would have many girls and one boy, but no family would have more than one boy. Surely, it would not be long until females greatly outnumber the males. We will learn in this lecture that there is no way that the Sultan can beat the system. The ratio of men and women will remain a steady fifty percent, no matter what law the great Sultan passes.

4.1. You can check out any time you like As you know, a martingale is the mathematician’s model of a fair casino. In a fair casino, you can quit any time you like. A is a random variable τ that takes values 0, 1, 2,..., or . It represents the round after which the gambler stops placing bets. Decent people∞ do not enter a casino, τ = 0. Winners never quit,

1I bought it at Waltman’s bookstore, Binnenwatersloot. The price is very decent. I am a bit careful with money, you know.

35 36 4. WELCOME TO THE CALIFORNIA CASINO

τ = . A man of habits always quits at the same time, τ = k for some fixed k.A sophisticated∞ gambler has an intricate strategy and a complicated τ. But a gambler can only decide to stop after round n based on the information that is available up to that time: Definition 4.1. A random variable τ is a stopping time if

τ = n n { } ∈ F for all non-negative integers n. The n is even allowed to be . ∞ Exercise Either τ = 0 or τ 1. ≥ Exercise τ is a stopping time if and only if τ n n for all n. { ≤ } ∈ F The great Sultan is trying to influence a standard random walk: Xn = 1 for a girl and Xn = 1 for a boy. The random walk Sn = X1 + + Xn measures the − ··· excess of girls over boys. The law of the great Sultan says τ = min n: Xn = 1 , which is a stopping time (please check!). If the great Sultan proclaims{ that− you} are to stop having children before the first time you have a boy, which would certainly promote the number of women for no boys would be born, then a mother would have to know the sex of her next child before conception. This is not a stopping time. Abortion is not an option.

Exercise Determine the distribution of the Sultan’s τ. Note that Sτ = τ 2 − and compute that E[Sτ ] = 0. Stopping times can be intricate. If B R is a Borel set and if τ represents the ⊂ first time that Mn takes a value in B – the time of first entry – then τ is a stopping time: n−1 τ = n = M −1(Bc) M −1(B) { } k ∩ n k=1 ! \ −1 Here you need to remember that Mk is k-measurable, which means that M (A) F k ∈ k for any Borel set A. The Sultan’s law is a time of first entry into the set of boys. F Exercise The time of second entry is a stopping time also. Once you stop gambling, you receive no further information on the casino’s proceedings. If you are a decent man, then you gather no information 0 = , Ω . F {∅ } If you are a winner, then you gather all the information ∞. The information of a gambler who quits gambling at τ is somewhere in betweenF these two extremes:

Definition 4.2. The σ-algebra τ is equal to F (4.1) τ = A ∞ : A τ = n n for all n F { ∈ F ∩ { } ∈ F } It is not hard to verify that this is a σ-algebra, see proposition 6.1 in Gut. Note that if the gambler is a man of habits and τ = k, then τ is equal to k. F F You must picture the σ-algebra as follows: the sets τ = n partition the space { } Ω. On τ = n the algebras τ and n are the same. Observe that τ = n τ { } F F { } ∈ F for all n and so τ is τ -measurable. This makes sense, since τ represents what the gambler knows. Surely,F the gambler knows when he quits.F Of course, he also knows his final earnings:

Exercise Mτ is τ measurable. F 4.1. YOU CAN CHECK OUT ANY TIME YOU LIKE 37

It is sometimes more convenient to use the equivalent definition

(4.2) τ = A ∞ : A τ n n for all n F { ∈ F ∩ { ≤ } ∈ F } It is not hard to see that the two definitions are equivalent. If A satisfies 4.1 then the intersections A τ = k for k = 0, . . . , n are all in n. If A satisfies 4.2 ∩ { } F then the intersections A τ n and A τ n 1 are both in n. ∩ { ≤ } ∩ { ≤ − } F Lemma 4.3. If τ1, τ2 are stopping times and if τ1 τ2 then τ τ . ≤ F 1 ⊆ F 2 This is obvious: the gambler that stays longer in the casino learns more. Here’s a proof anyway.

Proof. The definition of 4.2 is more convenient. Suppose A τ . We need ∈ F 1 to check that A τ2 n n. Now ∩ { ≤ } ∈ F A τ2 n = A ( τ1 n τ2 n ) = (A τ1 n ) τ2 n ∩ { ≤ } ∩ { ≤ } ∩ { ≤ } ∩ { ≤ } ∩ { ≤ } Since A τ and τ2 is a stopping time, both A τ1 n and τ2 n are in ∈ F 1 ∩ { ≤ } { ≤ } n. Hence, so is their intersection. F  Exercise If τ1 and τ2 are stopping times, then τ2 τ1 τ . { ≤ } ∈ F 1 The symbols and are shorthand notation for min and max. If τ1 and τ2 ∧ ∨ are stopping times, then τ1 τ2 is the minimum of the two. Think of this as Jack and John, who came to the∧ casino along with their mother. The moment one of the two boys wants to quit, mama tells the other boy that it is time to go home. This is a stopping time since

τ1 τ2 n = τ1 n τ2 n n { ∧ ≤ } { ≤ } ∪ { ≤ } ∈ F If mum is a bit of a gambler herself - imagine a mother bringing her children to a casino - then she will ignore the first boy and tell him to wait for his brother. In this case we get τ1 τ2, which is also a stopping time. Sometimes dad takes the boys to the casino,∨ he is an incorrigible man and mum divorced him a long time ago. If Jack quits at τ1 = n and John quits at τ2 = m, then dad continues until τ1 + τ2 = n + m, wasting his sons profits at the roulette table. Again, this is a stopping time and quite a predictable one. By the time they are going home (bye dad), Jack and John already know when the old man is going to quit.

Exercise Let be the σ-algebra that is generated by τ and τ . It represents G F 1 F 2 the combined knowledge of Jack and John. Prove that τ1 + τ2 is -measurable. G The family of stopping times is closed under sums, minima and maxima. How- ever, subtractions are not allowed. If τ is a stopping time, then τ 1 need not be a stopping time, because in that case you would be able to predict what− the gambler does before he does it. That would be like the great Sultan insisting on an abortion if you have a boy. That is not a stopping time. Mathematicians are Pro Life. Of course, τ + 1 is allowed (please mum, let’s play one more round), but no sensible mother will fall for that.

The process Mτ∧n for fixed τ and n = 0, 1, 2,... is called the stopped process. By definition Mτ∧n is equal to Mn on τ n and it is equal to Mτ on τ < n . The process freezes once the gambler decides{ ≥ to} stop. A gambler can check{ out any} time he likes, but he can never leave. Any reasonable gambler will stop at some 2 time and will forever stay in the casino with his final earnings Mτ .

2Except for winners 38 4. WELCOME TO THE CALIFORNIA CASINO

Now I am getting a bit sick of definitions and so I quit. There are some more examples in Gut’s proposition 6.3, but they are all variations on the same theme. You can read them at your own leasure. Let me end this section by a few lemmas that I find useful.

Lemma 4.4. E[Mn τ ] = Mτ∧n. | F You should think of this as follows: if a gambler stops at τ, then the σ-algebra of all information is no longer ∞ but it is τ . If we condition the process Mn on this σ-algebra, then we get theF gambler’s process.F

Proof. You know that Mτ∧n is τ∧n measurable: this follows from the exer- F cise you just solved, applied to the stopping time τ n. So Mτ∧n is τ measurable, ∧ F since this is a larger algebra. We need to show that A Mτ∧ndP = A MndP for ev- ery A τ . It suffices to show the equality for A τ = k for k = 0, 1,..., since ∈ F ∩{R } R ∞ every A is a countable union of these intersections. If k n then Mτ∧n = Mn and ≥ there is nothing to prove. So we may suppose that k < n. Now A τ = k k ∩ { } ∈ F and since E[Mn k] = Mk we have that MkdP = MndP. | F A∩{τ=k} A∩{τ=k} Since k < n we have that Mτ∧n = Mk on τ = k and we are done. { R } R  Lemma 4.5. τ∧n = τ n. F F ∩ F Proof. First note that τ n is a stopping time which assumes values in ∧ 0, . . . , n . By Lemma 4.3, τ∧n τ n. Conversely, suppose that A τ n. We{ need} to show that F ⊆ F ∩F ∈ F ∩F A (τ n) = k k ∩ { ∧ } ∈ F for k = 0, . . . , n. If k < n then (τ n) = k is equal to τ = k and we need { ∧ } { } to verify that A τ = k k which follows immediately from A τ . Now ∩ { } ∈ F ∈ F suppose k = n. Then we have that A and (τ n) = n = τ n are both in n. { ∧ } { ≥ } F Hence, so is their intersection. 

Challenge Prove that σ∧τ = σ τ . Hand in your solution by e-mail before the next class. One of the solversF getsF ∩F my copy of Gardner’s puzzle book.

Lemma 4.6. Suppose, as usual, that n is the σ-algebra generated by M0,...,Mn. F Then τ∧n is generated by Mτ∧0,...,Mτ∧n. F Proof. Divide Ω into the sets (τ n) = k for k = 0, . . . , n. On (τ n) = k , { ∧ } { ∧ } the σ-algebras τ∧n and k coincide. The σ-algebra k is generated by M0,...,Mk. F F F But if τ n = k, then the stopped process Mτ∧0,...,Mτ∧n consists of the same ∧ random variables. So, we may as well say that they generate the σ-algebra. 

Lemma 4.6 says that if we replace a martingale Mn by a stopped martingale Mτ∧n, then we can replace the filtration 0 1 2 by τ∧0 τ∧1 F ⊂ F ⊂ F ⊂ · · · F ⊂ F ⊂ τ∧2 . Lemma 4.5 says that this is the same as intersecting the original F ⊂ · · · filtration with τ F 4.2. But you can never lose We turn to section 7 of Gut’s chapter 10 where we learn the first great theorem of martingales: Doob’s optional sampling theorem. It says that no matter what law the great Sultan passes, the ratio of women will always be 50 percent. Now before I continue, I have to look around for a second, since there may be a smartypants Sultan in the audience who stops me by saying: 4.2. BUT YOU CAN NEVER LOSE 39

But sir, what if the sultan issues a law which says that a woman needs to keep having babies until she has delivered more girls than boys?

This is a valid stopping time, since we can define τ = n in terms of n: { } F τ = n = Sn = 1 and Sm = 1 if m < n { } { 6 } Under the new law of Sultan Smartypants, every family will eventually have more girls than boys. So you may think that this new Sultan found a way to beat the system, but did he? To show that he did not, I will take the liberty to consider the random variables Girl and Boy, which I do not define precisely. Consider the first child in each family in the great land of Al-Trump. Half of them are girls and half of them are boys, the number is equal. The mothers that delivered a girl stop, and the mothers that delivered a boy continue. Now consider the second child in each family that had a boy as a first child. Half of them are girls and half of them our boys. Again, the number is equal. Continuing this way, we find that E[Girl] = E[Boy] However, if we consider the numbers per family, then we find E[Girl] = E[Boy] + 1 These seemingly conflicting equations can only hold if both expectations are infinite. The population explodes! Houses get overcrowded, food gets scarce. Famine and plague sweep through the once great land of Al-Trump. In his attempt to beat the system, the Sultan failed and goes down in history as Smartypants the Terrible. Exercise We counted the number of boys and girls this way and that way. It is comparable to counting a matrix this way and that way, as follows. For natural numbers i, j define xij = 0 if j i = 1 and xij = j i if j i = 1 and xij = 1. Put the numbers in an infinite| matrix− | 6 − | − | 0 1 0 0 0 1 0 1 0 0 ··· −0 1 0 1 0 ··· 0− 0 1 0 1 ··· 0 0− 0 1 0 ··· 0 0 0− 0 1 ··· 0 0 0 0− 0 ··· . . . . . ···...... Verify that the row sums are all zero, except for the first row, which adds up to 1. Verify that the column sums are all zero, except for the first column, which adds up to -1. So Fubini does not hold. What went wrong?

Lemma 4.7. The stopped process Mτ∧n is a martingale with respect to τ∧n. F This follows from lemmas 4.4–4.6. But here is another proof which has more of a gambling flavor.

Proof. You already met my wife. She bets one euro in every round. I did not tell you yet that my wife quits at some point of time τ, although that was more 40 4. WELCOME TO THE CALIFORNIA CASINO or less implicit when I told you that she is a conservative player. Consider the martingale transform

Yn = α0dM0 + + αndMn ··· for the process αn = 1 if τ n and αn = 0 if τ n. Please verify that αn ≤ ≥ 3 is n-measurable and conclude that the process is predictable. The martingale transformF is equal to the stopped martingale for this particular zero-one process αn. You know that a martingale transform is a martingale. Therefore, the stopped martingale is a martingale.  Consider a gambler who quits at time τ. The lemma implies that after round n, his expected earnings are E[Mτ∧n] = E[M0]. Even if n is equal to the end of time, which may be near but may also take a while,4

then still the gambler’s expected earning is equal to what he started out with. But mathematicians never quit and they want to know if equality continues to hold, even after the end of time. In other words, mathematicians want to know if ? (4.3) E[Mτ ] = E[ lim Mτ∧n] = lim E[Mτ∧n] = E[M0] n→∞ n→∞ Is the swap of limit and expectation allowed? This is the mother of all measure theory problems. We saw in Lecture 1 that the swap is only allowed under special conditions. To see that there is a problem with this swap, let’s reconsider Smartypants’ stopping time. Mathematically, a mother in Al-Trump is equivalent to a coin toss: Heads for Boys and Tails for Girls. A mother who produces a string of children HTHHTHTTHTHHTHTTT ··· quits as soon as the Tails exceed the Heads by one. For this particular string of H and T , the mother stops delivering after having seventeen babies, just like my maternal great grandmother. Let Bn be the number of boys after n tosses and 5 Gn be the number of girls. Then Mn = Gn Bn is a martingale. Smartypants’ − stopping time τ has the property Mτ = 1, which shows that the swap is not allowed here, because M0 = 0. Now the two ways of counting that we saw above for Boys and Girls can be made more precise. In the first count, we found that E[Gτ∧n] = E[Bτ∧n]. In the second cound we found that E[Gτ ] = E[Bτ ] + 1. By monotone convergence, limn→∞ E[Gτ∧n] = E[Gτ ] and limn→∞ E[Bτ∧n] = E[Bτ ]

3Please don’t tell my wife that she is predictable 4Hindu mystics predict the arrival of Kalki - when the universe will collapse - at around 429000 AD 5This decomposition of the random walk looks a bit like a Krickeberg decomposition, but it is different. I exercise my priority rights and say that I Fokkink decomposed it 4.2. BUT YOU CAN NEVER LOSE 41 so the two seemingly conflicting equations that we found earlier indeed hold. The expected number of children must be infinite. Exercise Prove that E[τ] = for Smartypants’ stopping time. ∞ Exercise Prove that if the Sultan decrees a law such that E[τ] < , then ∞ E[Bτ ] = E[Gτ ] < . ∞ So when is the swap allowed? Doob’s optional sampling theorem addresses just that. There are several different conditions under which a swap is allowed, and the theorem is a bit of a mixed bag. If you compare the optional sampling theorems in B-Z and in Gut (theorems 7.1 and 8.2) , then you will see that they are different. Wikipedia presents the most common form of the theorem, which gives three different conditions, the second of which is equivalent to the exercise you just solved. All these versions of Doob’s theorem can be summarized informally: you can only beat the casino if you have unlimited time or unlimited money. Nobody has that. Not even God. Theorem 4.8 (Doob’s optional sampling theorem, following Gut thm 7.1). Suppose that the martingale is of the special form Mn = E[Z n] for some random | F variable Z and some filtration 0 1 . Let τ be a finite stopping time, i.e., F ⊂ F ⊂ · · · τ < . Then Mτ = limn→∞ Mτ∧n is well-defined and satisfies Mτ = E[Z τ ]. ∞ | F In particular, E[Mτ ] = M0 = E[Z].

Proof. Since τ is finite, then limn→∞ Mτ∧n is well-defined. All Mτ∧n are τ measurable and a limit of measurable functions is measurable. So Mτ is τ - F F measurable. We need to prove that A Mτ dP = A ZdP for all A in τ . But since τ < we can partition Ω into τ = k for k = 0, 1,... excluding .F In particular, ∞ { R } R ∞ we may consider A τ = k for a fixed and finite k. On this intersection Mτ ∩ { } coincides with Mk and we get that A∩{τ=k} MkdP = A∩{τ=k} ZdP, which holds since M is a conditional expectation of Z. k R R  Gut lists some corollaries of this theorem, but we have taken a slightly different route and we already know these corollaries. Then there follow three theorems, the first of which Gut proves, so read that, and the other two he leaves to you. So let’s handle them here. But before we do that, it is time to point out something that I glossed over. Gut is very careful and restricts to finite stopping times. I should have also done that in Equation 4.3, because otherwise limn→∞ Mτ∧n may not be well-defined. What happens to the winner that never quits? Does the sequence still converge? This will be the topic of our next lecture. But since I also glossed over this when I computed the limit of Smartypants’ stopping time, perhaps I should point out why τ < in this case. Get ready to meet the gambler’s ruin problem.6 ∞ Gambler’s ruin. The Sultan decrees a new law: a woman is allowed to stop if the number of boys that she delivered exceeds the number of girls by thousand and one. Mathematically, the Sultan’s new law decrees that Sτ is either 1 or 1001 for the simple random walk. It is not so hard to prove that τ < for this new− law. I will get to that in a minute. First observe that ∞

E[Sτ ] = 1001 P(Sτ = 1001) + 1 P(Sτ = 1) − · − · 6Dutch pride: this problem was solved by the great Christiaan Huygens. He lived in Hofwijck, which is a museum now. It is a 15 minutes bike ride from Delft, go visit! 42 4. WELCOME TO THE CALIFORNIA CASINO where we have that P(Sτ = 1001) = 1 P(Sτ = 1) since τ < . The process does − − ∞ stop and can only do so in two places. Furthermore, the process Sτ∧n is bounded by 1001, so we may swap limit and expectation by the dominated| | convergence theorem: E[Sτ ] = 1001 P(Sτ = 1001) + 1 P(Sτ = 1) = 0 − · − · Now you can compute that the process stops at 1 with probability 1001/1002. The same argument applies if we replace 1001 by any N: the process stops in finite time and with probability N/(N + 1) at 1. If we remove the stop at N, then anyway the process stops at one in finite time with probability N/(N +− 1) for any N. In other words, Smartypants’ original law stops in finite time.≥ Now I still had to argue that the new law is finite. Observe that under the modified law, a mother cannot deliver 1002 boys in a row. As you know from your elementary probability course: if you let a monkey go at it on a typewriter, it will eventually produce one of Shakespeare’s masterpieces. Or less prosaically: if a mother in Al-Trump keeps delivering children, she will eventually produce 1002 boys in a row. But before she can do that, she is stopped by the law.

Theorem 4.9 (Gut, thm 7.3). Again we derive our martingale from Z. Con- sider a filtration and let 0 = τ0 τ1 τ2 be an increasing sequence of ≤ ≤ ≤ · · · stopping times. Then Mn = E[Z τ ] is a martingale. | F n Proof. The sequence τ τ τ is a filtration, so the observation F 0 ⊂ F 1 ⊂ F 2 ⊂ · · · that Mn is a martingale comes down to example 9 from the previous lecture. 

Theorem 4.10 (Gut, thm 7.4). Suppose that Mn is a submartingale and that 7 τ is a stopping time. Then the stopped process Mτ∧n is a submartingale. All with respect to the same filtration. But that goes without saying. Proof. We are not in a fair casino anymore but in a very generous one. I like to think of this as Bernie’s casino.8 The longer you stay, the higher your earnings. If you are at Bernie’s, you never quit.9 To prove that a stopped martingale is a submartingale – this really is a no- brainer, even if you quit you still had the privilege to play – we simply adapt the proof of Lemma 4.6. Think of a stopping time as a gambler who keeps betting 1 euro per round, until round τ. More generally, we check that for a non-negative predictable process αn the transform

Yn = α0dM0 + + αndMn ··· is a submartingale. We take out what is known:

E[Yn n−1] = E[Yn−1 + αndMn n−1] | F | F = Yn−1 + αnE[dMn n−1] Yn−1 | F ≥  As a final word on the sampling theorem, we return to the fair casino of Mar- tigues. Martingales always bet on black. They start with one euro and if they lose,

7 I think there is a typo in Gut’s 2nd formula, which should be E[Xτ n] E[Xn] ∧ ≤ 8Sounds better than Madoff’s casino, doesn’t it? We love Bernie. 9Applies to both Bernies 4.2. BUT YOU CAN NEVER LOSE 43 they simply double the stakes and play again. They quit once they win, and even if they have to wait for ten rounds, then still their final earnings are 1 2 4 8 16 32 64 128 256 512 + 1024 = 1 euro − − − − − − − − − − Therefore, Martingales always leave the casino with an extra euro. Unlike Smarty- pants, the Martingales have a stopping time of finite expectation: E[τ] = 2. That sounds too good to be true. So what goes wrong? Well, do you remember that other Sultan who agreed to double the grains on a chessboard? Somehow, Sultans always seem to pop up in probability problems. Now compute that P(τ = k) = 2−k and bear in mind that a Martingale needs 2k 1 euros to compensate for expenses, before coming out on top with one precious− euro in round k. The expected ex- k k pense is k(2 1)/2 = . You need very deep pockets in Martigues. You need bottomless pits.− ∞ P Meet more Martingales

Bloomberg, September 2016: Wells Fargo managers were accused of fueling the creation of bogus accounts in what may be the first lawsuit by fired or demoted employees since the bank was called out by regulators. Wells Fargo fired 5,300 employees that it blamed for opening accounts without client approval. The fired employees who sued the bank on Thursday said their dishonest practices were orchestrated by CEO John Stumpf.

The New Yorker, August 2016: Deutsche Bank’s actions are now under investigation by the U.S. Department of Justice, the New York State Department of Financial Services, and financial regulators in the U.K. and in Germany. In an internal report, Deutsche Bank has admitted that, until April, 2015, when three members of its Russian equities desk were suspended for their role in mirror trades, about ten billion dollars was spirited out of Russia. The lingering question is whose money was moved, and why.

The Guardian, April 2015: Germany‘s Deutsche Bank has been fined a record 2.5 billion dollars for rigging Libor, ordered to fire seven employees and accused of being obstructive towards regulators in their investigations into the global manipulation of the benchmark rate.

Bloomberg, April 2014: Less than 5 percent of unauthorized financial trading cases may have been reported, said Toshihide Iguchi, whose trading losses led to the 1995 shutdown of Daiwa Banks U.S. operations. Allowing traders to walk away with a clean slate instead of being fired for losses would reduce their incentives to engage in unauthorized trades, the now 63-year-old Kobe, Japan native, who confessed to amassing 1.1 billion dollars of losses in 1995, said in an interview in Hong Kong yesterday.

Reuters, October 2013: U.S. and European regulators have fined Dutch lender Rabobank 1 billion dollars for rigging benchmark interest rates, making it the fifth bank punished in a scandal that has helped to shred faith in the industry. Rabobank said on Tuesday it would pay 774 million euros to U.S., British and Dutch regulators after 30 staff were involved in ”inappropriate conduct” in a scam to manipulate the London Interbank Offered Rate (Libor) and its Euribor cousin - benchmarks for more than 300 trillion dollars of financial assets.

Bloomberg, May 2013: Stichting Vestia Groep, a Dutch affordable-housing provider that nearly collapsed as a result of losses on derivatives, asked a London court to void 700 hundred million euros in interest-rate swaps after it was sued by Credit Suisse Group AG. Credit Suisse should have known that the derivatives were inappropriate trades conducted by a Vestia employee who breached his duties to the company, Vestia said in court documents.

The Independent, May 2012: A millionaire City trader known as the ”London Whale” for the vast size of his bets on the markets, has emerged as the culprit behind a 2 billion dollar trading loss at JP Morgan.

Statement by UBS Bank, September 2011: We regret to inform you that yesterday we uncovered a case of unauthorized trading by a trader in the Investment Bank. We have reported it to the markets in line with regulatory disclosure obligations. The matter is still being investigated, but we currently estimate the loss on the trades to be around 2 billion US dollars. It is possible that this could lead UBS to report a loss for the third quarter of 2011. No client positions were affected.

Reuters, March 2009: Madoff pleads guilty, is jailed for 65 billion dollar fraud.

NRC Handelsblad, October 2008: The Dutch government bought Fortis Bank Netherlands and ABN Amro Bank for a total of 16.8 billion euros. This was anounced yesterday evening by the Prime Minister, the Finance Minister, and the President of De Nederlandsche Bank.

ABC news, September 2008: Lehman Bros. filed for bankruptcy protection Monday, becoming the second major Wall Street firm to disintegrate under the weight of the deepening credit crunch. Based on its assets at the time of filing, Lehman surpassed WorldCom as the largest U.S. bankruptcy. Lehman had about 639 billion dollars in assets at the time of filing, while WorldCom had about 107 billion dollars when it filed for Chapter 11 in 2002.

The Guardian, January 2008: A rogue trader has cost French bank Soci´et´eG´en´eraleof 4.9 billion euros in the biggest fraud in financial history. News of the fraud, which will virtually wipe out 2007 profits at France’s second-largest bank, sent shockwaves through European markets, already battered by the escalating credit crisis.

Morgan Stanley 4th Quarter Earnings Conference Call, December 2007. Comments on a world record trading loss of 9.2 billion dollars: This was the result of an error of judgement incurred on one desk in our Fixed Income area. 44 4. WELCOME TO THE CALIFORNIA CASINO

Pittsburgh Post Gazette, September 2006: One of the mistakes that led to Amaranth Advisors’ multibillion- dollar losses on natural-gas investments is a common one in fast-shifting energy markets. The hedge fund’s chief energy trader misgauged when to take his chips off the table, losing roughly 5 billion dollars in a week for a hedge fund that boasted of world-class risk-management systems. ABC news, July 2006: In Melbourne, two rogue foreign exchange traders, who dragged the National Aus- tralia Bank into a spectacular 360 million dollar loss, have been sent to jail. The Brussels Journal, March 2006: A major banking scandal is rocking the Austrian political elites left and right. A bank owned by the Socialist trade union loses billions in shady hedge deals while the union strike fund evaporates in the Caribbean and the bank gets implicated in a corruption case in Israel. Wall Street Journal, May 2005: In early 2003 two analysts at hedge fund Perry Capital struck out on their own to form Bailey Coates Asset Management. Though they hadn’t directly managed Perry’s successful European fund themselves, the two attracted about 1.3 billion dollar in less than two years for their new venture. Those assets in just the past few weeks have halved, according to people familiar with the firm. New York Times, February 2003: The accounting scandal at Royal Ahold has left thousands of its Dutch employees in debt to the company. Ahold, the global grocery company, created a program that encouraged workers in the Netherlands not only to buy stock but allowed them to borrow money to do so. Some 3,500 workers at Ahold and its Dutch subsidiaries, including the supermarket chain Albert Heijn, took out company loans in the last decade to buy shares of a fund that invested in Ahold stock, debt and other obligations. Since the company disclosed accounting problems on Monday, the company’s stock and bonds have plunged in value. Wall Street Journal, February 2002: Allied Irish Banks PLC said a rogue trader at its U.S. unit incurred losses of 750 million dollar through unauthorized trading, sparking an inquiry by the Federal Bureau of Investi- gation and a scramble by the big European bank to figure out what went wrong. New York Times, August 2000: Hedge fund manager Michael Berger, who acknowledged earlier this year that he misrepresented investment returns, has been criminally charged with securities fraud in connection with the collapse of the Manhattan Investment Fund and the loss of more than 400 million dollar by at least 300 investors. CNN Money 1998 on the collapse of hedge fund Long Term Capital Management: Founded in 1991 by former Salomon Brothers bond guru John Meriwether, LTCM initially specialized in high-volume arbitrage trades between closely linked fixed-income securities. It ran into trouble by spreading into equities and emerging markets when bond yields declined amid low inflation in most developed countries. The hedge fund was highly leveraged and on the brink of collapse until the group of financial institutions pulled together a 3.6 billion dollar equity rescue package. New York Times, November 1996: The Sumitomo Corporation announced a loss of 1.9 billion dollars today for the first six months of its fiscal year – the first loss in the giant trading company’s 77 years – as a result of huge losses from unauthorized copper trading. New York Times, February 1995: At the epicenter of Sunday night’s spectacular collapse of a leading British investment bank stands a 28-year-old trader who in the course of a month put 29 billion dollars of the firm’s money on the line and lost more than a billion of it. Wall Street Journal, January 1994: Metallgesellschaft AG, the big German engineering and metals con- glomerate, stunned its shareholders last week by announcing potential losses in energy derivatives of nearly 1 billion. Like falling dominoes, the company’s losses in these complex, risky instruments have precipitated a wider financial crisis that has pushed the group to the edge of insolvency in a matter of weeks.

Let’s end this list of tragic losses with a positive note. Money never gets lost. For every loser, there is a winner and even more so, because the amount of money increases with time. Theoretically, the stock exchange is a submartingale. Ask Bernie! Stochastic processes: First graded homework. Dead- line: Friday 28th October 2016. Secretariate 6th floor.

1. Let X be i.i.d. random variables which can take the values 1, +1, i − with P(Xi = 1) = 1/2 = P(Xi = 1). We further denote Sn = n − Xi and Fn, n N the filtration generated by X1,...,Xn. If we i=1 ∈ talk about martingales in this and next item, we always mean w.r.t P this filtration. Compute the following (conditional) expectations. (note: if in this question you use the martingale stopping theorem, justify why you can.)

X1+X2 a) E(X1e X3 X1,X2). 10 | b) E[(X1 + X2) X1]. τ | c) E ( I(Xi = 1)) where τ = inf k N : Xk = 1 for the second time i=1 − { ∈ } (here I( ) denotes indicator function). P · 2. Same context as in the previous question. a) Let a < 0 < b be two given integers. By stopping the martingale Sn, n N, at time ∈ τ := inf k 1 : S a, b , { ≥ k ∈ { }} compute the distribution of Sτ . b) By stopping the martingale S2 n at the same time τ, compute n − E(τ). c) Determine the function f : R R such that for all λ R, → ∈ λSn nf(λ) e − is a martingale. d) Use the martingale constructed in the previous item, and stop it time τ. Compute from this the moment generating function µτ E(e ) for values of µ small enough. 3. Let X be i.i.d. random variables which can take the values 1, +1, i − with P(Xi = 1) = p, P(Xi = 1) = q = 1 p, where p < 1/2. n − − Denote Sn = Xi, and by Fn, n N the filtration generated by i=1 ∈ Xi, i = 1, . . . , n. P a) Show that for all x N ∈ 1 p x+Sn Z = − n p   is a martingale.

1 b) Let x, N > 0 be a natural numbers with 0 < x < N. Show that

τ = inf k 1 : x + S 0,N { ≥ k ∈ { }} is a stopping time. This time is interpreted as the time it takes to have gathered fortune N or be bankrupt (fortune 0) when the initial capital is x.

c) Use the martingale stopping theorem for the martingale Zn to compute P(x + Sτ = N). Hint: use that 1 p x E(Zτ ) = E(Z1) = − p   You can assume here that the conditions of the martingale stop- ping theorem are satisfied, i.e., you do not have to verify these conditions. d) Prove that Z = S n(2p 1) is also a martingale. Use once n0 n − − more the martingale stopping theorem for Zn0 in order to evaluate E(τ) for the stopping time of question b).

4. Let Fn, n N be a filtration with F0 = , Ω the trivial σ-algebra. ∈ {∅ } Let An, n = 0, 1, 2,... be a predictable sequence, i.e., An is Fn 1 − measurable for all n = 0, 1, 2,.... Let Zn, n = 0, 1, 2,... be an adapted sequence, i.e., Zn is Fn measurable for all n N. ∈ a) Show that if An, n = 0, 1, 2,... is a Fn, n = 1, 2,... martingale, then it is constant with probability one.

b) Prove, using item a) that a decomposition of the type Zn = Mn + An with Mn martingale and An predictable (both w.r.t. Fn) is unique.

2 CHAPTER 5

A discourse on inequalities

We will encounter a lot of inequalities. They can be found in Gut section 9 and 10. This material is also covered by sections 4.1 and 4.2 of Brze´zniak-Zastawniak. Always the easier read.

We will now study martingale convergence. What happens in the end? But first, here is a word from our sponsor: Please return to Definition 1.9, where we listed different notions for Xn X∞. You will need to know these notions and the three convergence theorems that→ come right after it. Get ready to swap until you drop.

Exercise What’s fair about a fair game? Let X1,X2,... be independent ran- dom variables such that n2 1 with probability n−2 X = n 1− with probability 1 n−2  − − In other words, you are a Martingale who is going for progressively higher stakes in a fair casino. Observe that Xn is a martingale difference sequence, so Mn = X1 + + Xn is a martingale. Apply Borel-Cantelli to conclude that Mn almost··· surely. This seems like a profitable casino! Perhaps Sultan Smartypants→ −∞ wants to buy it?

5.1. Doob’s optional stopping (or sampling) theorem We already encountered this theorem in the previous lecture, but I did not yet state it nor prove it in full generality. It is our first martingale convergence theorem and deserves a special place.

Theorem 5.1. Let Mn be a martingale. Let τ be a finite stopping time, which implies that Mτ∧n converges to Mτ almost surely. Then E[Mτ ] = E[M0] if one of the following four conditions is satistfied: (a) The stopping time is bounded τ N for some N. ≤ (b) The stopped process is bounded Mτ∧n K for some K. | | ≤ (c) Mn is a regular martingale. (d) The stopping time is integrable E[τ] < and the martingale differences ∞ are bounded dMn K for some K. | | ≤ Proof. This theorem is all about the swap

E[ lim Mτ∧n] = lim E[Mτ∧n] n→∞ n→∞

Is the swap allowed? On the left hand side of the equation we have E[Mτ ]. On the right hand side we have a constant sequence of E[M0], since a stopped martingale is a martingale.

47 48 5. A DISCOURSE ON INEQUALITIES

In condition (a) the stopped process Mτ∧n has surely stopped at N. The sequence is finite, there is no limit, and so there is nothing to swap: E[Mτ ] = E[Mτ∧N ] = E[M0]. In condition (b) the stopped process is bounded by a constant. The dominated convergence theorem allows the swap and we conclude that E[Mτ ] = E[M0]. Condition (c) is as in Gut’s theorem 7.2: the martingale is of the form Mn = E[Z n]. I already gave a proof of that in the previous chapter. Condition| F (d) still requires an argument. A martingale is a sum of martingale differences

Mn = dM0 + + dMn−1 ··· and by assumption (b) we have dMn K for all n. By the triangle inequality | | ≤ Mn is bounded by Kn and so Mτ is bounded by Kτ, which is integrable by |our| assumptions. The dominated| convergence| theorem allows us to carry out the swap. 

Please consider the minimal effort that is involved in proving such a theorem. Measure theory is a truly powerful tool. You are a mathematician and one of the happy few to understand it, which is why Wall Street wants to hire you. There are many different versions of the stopping theorem. I reserved an improvement of (d) for your homework. You should note that the proof of Doob’s stopping theorem is only concerned with the swap. It does not only apply to martingales, but also to submartingales and supermartingales, provided that we use the proper inequality.

5.2. Doob’s maximal inequality Now it is time to consider inequalities. In chapter two we already met the mother of all stochastic inequalities: Markov’s inequality If λ > 0 then P(X λ) 1 XdP. ≥ ≤ λ {X≥λ} Doob’s maximal inequality is a considerable improvement onR Markov’s inequal- ity. For any stochastic process X1,X2,... define ∗ X = max X1,...,Xn n { } i.e., the maximum result up to that moment. Let’s call it the record at time n. Sports lovers, pay special attention! Doob’s inequality applies to records.

Theorem 5.2 (Doob’s maximal inequality). If Xn is a nonnegative submartin- gale and λ > 0 then

∗ 1 P(Xn λ) XndP ≥ ≤ λ {X∗ ≥λ} Z n ∗ If n = 1 then X1 = X1 and we have Markov’s inequality. This is theorem 9.1 in Gut’s chapter 10. In Gut’s chapter 3, which we did not cover, Gut considered inequalities for sums of i.i.d. random variables of zero expectation, which are special forms of martingales. The inequalities in Gut’s chapter 10 generalize these and therefore Gut refers to chapter 3 a lot. 5.2. DOOB’S MAXIMAL INEQUALITY 49

Proof. Think of the Xi as your personal scores in a competition that is played over rounds. Your primary aim is to reach a level λ. As soon as you reach it, you record that memorial round as τ. In mathematical terms, τ = min k : Xk λ . Now observe that { ≥ } P(X∗ λ) = P(τ n) n ≥ ≤ ∗ because Xn is your personal best in round n. It exceeds λ if and only if τ n. Rewriting the probability as an integral, we get ≤ n λP(τ n) = λdP = λdP ≤ {τ≤n} k=1 {τ=k} Z X Z If τ = k, then this is the first round in which Xk λ, and so ≥ n n n

λdP XkdP E[Xn k]dP ≤ ≤ | F k=1 {τ=k} k=1 {τ=k} k=1 {τ=k} X Z X Z X Z where we have used that Xn is a submartingale. By the defining property of conditional expectation, we have that n n E[Xn k]dP = XndP = XndP | F k=1 {τ=k} k=1 {τ=k} {τ≤n} X Z X Z Z Combining all of this, we get

∗ λP(Xn λ) XndP ≥ ≤ {X∗ ≥λ} Z n  Again, you should consider the minimal effort that is required to obtain this inequality. If you would use ordinary calculus, the notation would run overboard. Measure theory is a wonderful thing.

Corollary 5.3 (see Gut example 9.1). Suppose that Mn is a martingale with E[Mn] = 0. Then ∗ 1 P( Mn λ) Var(Mn) | | ≥ ≤ λ2 Please compare this to Chebyshev’s inequality.

2 Proof. If Mn is a martingale, then Mn is a submartingale and we can apply Doob’s maximal inequality.

2 ∗ 2 1 2 P((Mn) λ ) 2 MndP ≥ ≤ λ {(M 2 )∗≥λ2} Z n Or, equivalently ∗ 1 2 P( Mn λ) 2 MndP | | ≥ ≤ λ ∗ Z{|Mn| ≥λ} 2 On the right hand side, we have an integral over a part of Ω. If we integrate Mn over 2 the entire Ω, the integral can only go up. So, we may replace the bound by E[Mn], which is equal to the variance of Mn, since the expectation of Mn is zero.  50 5. A DISCOURSE ON INEQUALITIES

∗ 1 Doob’s inequality for a non-negative submartingale Mn is usually simplified to P(M > λ) E[Mn]. n ≤ λ Sports lovers should pay special attention here, because sports redords are always non-negative. Let’s look at the figure of the worlds best 100 meters times over the last forty years below. I have converted the results to 10.1 minus best time, to make it non-negative, and I assume that athletes are getting and better, so we are looking at a submartingale. I added a trendline, which is a statistical indication of E[Mn], which increases, as it should. For the year 2016, the trendline gives the value E[Mn] = 0.4. The world record over the last forty years is 0.52. According to Doob’s inequality, this world record has a probability of an unremarkable 77 percent.

It is nothing extraordinary, which is nice to know, since this indicates that no foul play was involved. This is all shoot-from-the-hip statistics, not very rigorous. But it does illustrate the statistical significance of Doob’s inequality.

5.3. Doob’s p inequality L p Theorem 5.4 (Doob’s inequality). Suppose that Xn is a non-negative sub- martingale. Then for p > 1L p p E[(X∗)p] E[Xp] n ≤ p 1 n  −  Let’s first try to figure out what goes on here. For a non-negative submartingale, p ∗ Doob’s inequality says P(Xn λ) E[Xn]. Now you need to remember from your oldL probability course1 that≥ the expected≤ value of any non-negative random variable Y is equal to ∞ E[Y ] = P(Y λ)dλ ≥ Zλ=0 Applying this to Doob’s maximal inequality, we find ∞ ∗ 1 E[X ] E[Xn] dλ n ≤ λ Zλ=0 Unfortunately, this is not good enough to produce a bound on E[X∗], because 1 λ dλ is infinite. But it is almost good enough, because a logarithm is almost finite (even if you circle the globe, the graph of the logarithm hardly increases). If we can R 1 1 find a slight improvement – say replace λ dλ by λp dλ for p > 1 – then we will ∗ p be able to bound E[Xn]. That is what Doob’s inequality does. The proof of it uses H¨older’sinequality, which you haveR to proveL yourselfR in the following exercise. Exercise H¨older’sinequality for non-negative random variables: 1 1 E[XY ] E[Xp]1/pE[Y q]1/q for + = 1 ≤ p q

1If you do not remember this from your old probability course, then you need to look it up on the internet. Or ask yourself: is it true for an indicator function? for a simple function? for an integrable function? 5.4. DOOB’S UPCROSSING INEQUALITY 51

A If we replace Y by Y 1X>0 – setting Y zero where X is zero – this makes the inequality tighter. · B The inequality holds for X and Y if and only if it holds for aX and Y/a for some a > 0. C By part A we may assume that X > 0 = Y > 0 . By part B we may assume that E[Xp] = 1. Recall Radon-Nikodym’s{ } { way to} define a measure q dQ = XpdP. Jensen’s inequality says that ZdQ ZqdQ for non-negative ≤ Z. Apply Jensen to Z = YX1−p and find H¨older’s inequality. R  R p Proof of Doob’s inequality. Abbreviate notation, and write X for Xn ∗ ∗ L and X for Xn. E[(X∗)p] = P((X∗)p λ)dλ λ ≥ = R P((X∗)p λp)dλp λ ≥ = R pλp−1P(X∗ λ)dλ λ ≥ R 1 pλp−1 XdPdλ ≤ λ λ X∗≥λ We used Doob’s maximal inequalityR in the finalR step. Now we swap the order of integration, which is allowed by Fubini since every function that is involved is non-negative.

pλp−2XdPdλ = pλp−2XdλdP ∗ ∗ Zλ ZX ≥λ ZΩ Zλ≤X Which is equal to p p X (X∗)p−1dP = E[X(X∗)p−1] p 1 p 1 ZΩ  −   −  To finish the proof, apply H¨older’sinequality which you just derived in your exercise above. It follows that p p E[(X∗)p] E[X(X∗)p−1] E[Xp]1/pE[(X∗)(p−1)q]1/q ≤ p 1 ≤ p 1  −   −  Now simplify this a bit to find p E[(X∗)p]1/p E[Xp]1/p ≤ p 1  −   There are many more inequalities in Gut’s section 9, but we do not really need them. Of course, it won’t hurt you to glance through them. NB Gut calls Doob’s maximal inequality the Kolmogorov-Doob inequality. He calls the p inequality the maximal inequality. L

5.4. Doob’s upcrossing inequality Imagine that you are a trader and you are managing a portfolio, which some of you will actually do one day. Many of our students end up in a financial en- vironment. As a trader, you want to buy low and you want to sell high, check investopedia. In terms of a casino, if Mn is low, then you keep betting one euro 52 5. A DISCOURSE ON INEQUALITIES every round until it is high. In terms of the stock exchange, once Mn is low you buy it and you hold onto it until its value is high and you sell it.

Now what is low and what is high depends on your personal opinion. Let’s say that a is low and b is high for some a < b. Our a b strategy is to bet one euro per − round if Mn is below a and to stop doing that once it is above b.

Lemma 5.5. If you apply the a b strategy to a supermartingale Mn, then your − expected earning Yn after round n is at most zero. That is a no-brainer. Your mama told you not to bet in a supermartingale casino.

Proof. You bet one euro once Mn a and you continue until Mn b after ≤ ≥ which you stop betting. You wait until Mn comes down to a again and start all over again. It is a bit difficult to write this down formally, so≤ I don’t. But it is clear that this strategy is a predictable process α1, α2,... in which each α is either zero or one. So the (super)martingale transform

Tn = α1(M1 M0) + ... + αn(Mn Mn−1) − − is supermartingale. In particular E[Tn] E[T0] = 0. ≤ 

For any sequence xn we say that the number of upcrossings is k if there ≥ exist i1 < j1 < < ik < jk such that the xi’s are a while the xj’s are b. ··· ≤ ≥ The number of upcrossings UN is defined to be the maximum k with jk N. It is ≤ equal to the number of times that you bought Mn low and sold it high. Lemma 5.6 (Doob’s upcrossing lemma). Adopting the notation x− = max x, 0 {− } − (b a)E[UN ] E[(MN a) ] − ≤ − Proof. By the previous lemma E[TN ] 0. If you apply the a b strategy, then ≤ − the number of upcrossings that have come and gone in M0(ω),M1(ω),...,MN (ω) is equal to the number of times that you bought low and sold high. You made UN trades and each time you earned at least b a. So you made at least UN (b a). However, at time N you may be running at− a loss if you bought low and could− not sell high yet. The current value of your asset is MN and you did not pay more than − a for it, so your loss at time N is at most (MN (ω) a) . It follows that your gain − TN at time N is − TN (b a)UN (MN a) ≥ − − − Now compute the expectation.  5.5. AND BEYOND 53

Exercise State and prove the downcrossing lemma for submartingales (buying high and selling low in Bernie’s casino – a very strange strategy – but can you make a loss in Bernie’s casino?).

Lemma 5.7 (Doob’s upcrossing lemma, part II). Suppose that Mn is a super- 1 martingale that is bounded in , i.e., the sequence E[ Mn ] is bounded. Define L | | U∞ = lim UN . Then P(U∞ = ) = 0. ∞ Proof. By our assumptions, the sequence E[ Mn ] is bounded by some K. − | | Therefore (b a)E[UN ] E[(MN a) ] K + a By the monotone convergence − ≤ − ≤ theorem (b a)E[U∞] K + a Since U∞ is integrable, the event U∞ = has − ≤ { ∞} zero probability.  5.5. And beyond Let me end with a few remarks, the first is an improvement of the sampling theorem, which you can find on Wikipedia. I leave that to you as a swapping exercise.

Exercise Consider a martingale Mn that starts at M0 = 0, as usual. Suppose there is a constant K such that E[ dMn n] K for all n and that E[τ] < . Now prove the following: | | | F ≤ ∞ A. Zn = dM0 + ... dMn−1 is a submartingale such that Mn Zn. | | | | | | ≤ B. Zτ∧n = dM0 1 + dM1 1 + ... + dMn−1 1 | | · {τ>0} | | · {τ>1} | | · {τ>n−1} C. E[ dMn 1 n ] K 1 for all n. | | · {τ>n} | F | ≤ · {τ>n} D. E[ dMn 1 ] K P( τ > n ) for all n. | | · {τ>n} ≤ · { } E. E[Zτ ] = E[limn→∞ Zτ∧n] = limn→∞ E[Zτ∧n]. ∞ ∞ F. limn→∞ E[Zτ∧n] = n=0 E[ dMn 1{τ>n}] n=0 K P( τ > n ). ∞ | | · ≤ · { } G. Prove that n=0 P(τ > n) = E[τ]. Conclude that Zτ is integrable. H. Prove that Doob’s samplingP theorem holds forP this martingale and this stopping time. P The study of stochastic inequalities is a field in itself, known as the theory of large deviations. There are still many open problems on stochastic inequalities, one of them is Samuels’ conjecture. I communicate this conjecture to you in terms of a competition:

Challenge Construct independent non-negative random variables Xn with E[Xn] = n such that P(X1 + X2 + X3 + X4 11) is as large as possible. Send in your contributions by e-mail. Winner receives≥ a prize. NB Markov’s inequality says that you can hope for 10/11 but Samuels’ conjecture is more pessimistic than that.

CHAPTER 6

Don’t Stop Me Now

This material can be found in Gut’s sections 10 and 11, leading up to Theorems 12.1, 12.2, and 12.3, which are the main results on martingales that you get to see in your course on Martingales and Brownian Motion. Alternatively, you can consult section 4.2 and 4.3 of Brze´zniak-Zastawniak.

Doob’s optional sampling theorem concerns a stopped process, which freezes in time and so it surely converges. What happens to a martingale if we never stop it, but let it run to the end. Does it converge and if so, can we swap limit and expectation? Doob’s convergence theorem addresses these questions. As always, there are many different forms of this theorem. Convergence is a bit of a zoo. Furthermore, we always need to require something extra to guarantee convergence in all possible ways. In section 3 we require that the martingale is bounded in p. In section 4 we require that it is uniformly integrable. L But first, here is a note from our sponsor: if we are considering convergence Xn(ω) X∞(ω) then we always mean almost sure convergence unless stated otherwise.→ And it takes place on the infinite line [ , + ]. Infinite limits are allowed. −∞ ∞

6.1. 2 convergence L We first prove the convergence theorem in 2, as a warm up. Martingales are easiest to understand in the 2 environment.L We will derive the stronger p theorem below, without using theL argument in this section. So if you are impatient,L you can move on to the next section. Remember from your elementary analysis class that a convergent sequence xn is bounded, but a bounded sequence is not necessarily convergent. It turns out that martingales in 2 are much better behaved. If they are bounded, they satisfy all concepts of convergenceL in the gray box of Chapter 1. The reason for this is orthogonality. If you are locked up in a box in finite dimensional space, you can walk around forever without ever coming to a standstill. If you are locked up in a box in Hilbert space and you have to take orthogonal steps, then you will come to a standstill. 2 Theorem 6.1. Suppose that Mn is a martingale and the sequence E[Mn] is 1 bounded. Then there exists an M∞ such that Mn M∞ almost surely and ∈ L 2 → in the mean. Even more so, limn→∞ E[(Mn M∞) ] = 0. − Proof. In 2 the martingale differences are orthogonal and by Pythagoras L n−1 2 2 E[Mn] = E[dMk ] k=0 X 55 56 6. DON’T STOP ME NOW

2 It follows that the sequence E[Mn] increases, and it is bounded by our assumption. Therefore ∞ 2 E[dMk ] k=0 X is finite. For every  > 0 we can choose a sufficiently large m such that ∞ 2 E[dMk ] <  k=m X 2 Now consider the submartingale (Mn Mm) for a fixed m and index n m. By Doob’s inequality we have − ≥ ∗ 2 ∗ 2 P( Mn Mm > δ) = P( (Mn Mm) > δ ) | − | | − | ≤ 1 2 1 n 2  2 E[(Mn Mm) ] = 2 E[dM ] < 2 δ − δ k=m k δ By taking m sufficiently large, we can make  arbitrarily small. If lim sup Mn P − lim inf Mn > δ then there are arbitrarily large m, n such that Mn Mm > δ. It | − | follows that we can make P(lim sup Mn lim inf Mn > δ) arbitrarily small, which − implies that P(lim sup Mn lim inf Mn > δ) = 0 for every δ > 0. In other words, − P(lim sup Mn = lim inf Mn) = 1

If the limsup and liminf are equal, then the limit of Mn exists. We denote it by M∞. Now apply Fatou’s lemma 2 2 2 E[(M∞ Mm) ] = E[ lim (Mn Mm) ] lim E[(Mn Mm) ] − n→∞ − ≤ n→∞ − 2 We already saw that E[(Mn Mm) ] can be made arbitrarily small by taking m − 2 arbitrarily large. It follows that limm→∞ E[(M∞ Mm) ] = 0, i.e., Mm converges 2 − to M∞ in the norm. But then it also converges in the mean (or as an analyst would say: in L1). L  Gut carries the proof of the 2 convergence much further and derives the con- vergence theorems from this argumentL in section 10.1. I will only give Gut’s second proof, which you can find in his section 10.2. The proof starts now.

6.2. Almost sure convergence Having dealt with 2 bounded martingales, we now move on to 1 bounded martingales. 1 boundednessL is a weaker condition and we already sawL that the double-or-nothingL martingale from Martigues, which is 1 bounded, does not con- verge in the mean. It does converge almost surely, though,L and this is the general picture: 1 Theorem 6.2. Suppose that Mn is a supermartingale which is bounded, L1 i.e., the sequence E[ Mn ] is bounded. Then there exists an M∞ such that | | ∈ L Mn M∞ almost surely. → Proof. It suffices to prove that

P(lim sup Mn > lim inf Mn) = 0

Doob’s upcrossing lemma, part II, tells us that for a fixed a < b the event U∞ = { ∞} is a nullset. Let’s denote this event by Ua,b. Now let a < b run over all rational numbers. The union of all nullsets a lim inf Mn(ω) then there are rationals a < b such lim sup Mn(ω) > b > a > lim inf Mn(ω). And so

lim sup Mn > lim inf Mn Ua,b { } ⊂ a lim inf is a nullset, and so Mn converges almost surely. To wrap up the{ proof, we apply} Fatou’s lemma

E[ M∞ ] = E[lim Mn ] = E[lim inf Mn ] lim inf E[ Mn ] | | | | | | ≤ | | 1 Since E[ Mn ] is bounded, we conclude that M∞ is in , i.e., it is integrable. | | L  We considered supermartingales since we needed the upcrossing lemma, which holds for supermartingales. How about submartingales? Well, Mn is a submartin- 1 gale if and only if Mn is a supermartingale. Also, Mn is bounded in if and − L only if Mn is bounded. So nothing changes. − Corollary 6.3. Suppose that Mn is a submartingale and the sequence 1 E[ Mn ] is bounded. Then there exists a M∞ such that Mn M∞ almost surely.| | ∈ L →

6.3. p convergence L p Theorem 6.4. Suppose that Mn is a submartingale and the sequence E[ Mn ] p | | is bounded for some p > 1. Then there exists a M∞ such that Mn M∞ ∈ L p → almost surely and in the mean. Even more so, limn→∞ E[ Mn M∞ ] = 0. | − | p 1 Proof. If a submartingale is bounded, then it is bounded and so Mn p L L → M∞ almost surely. By Doob’s inequality L p ∗ p p p E[( Mn ) ] E[ Mn ] | | ≤ p 1 | |  −  ∗ The process Mn keeps track of the world records and increases with n. Therefore | | ∗ M∞ limn→∞ Mn . By monotone convergence | | ≤ | | p ∗ p ∗ p E[ M∞ ] E[( lim Mn ) ] = lim E[ Mn ) ] | | ≤ n→∞ | | n→∞ | | is bounded. Repeating the argument for the submartingale Mn Mm for n = m, m + 1,... − p and a large enough m gives that the bound on E[ M∞ Mm ] can be made | −p | arbitrarily small. In other words, Mn converges to M∞ in . L  Again, the theorem also holds for a supermartingale Mn, since Mn is a sub- martingale. −

This proof has length next to nothing. How can it ever be important? You got to be kidding me.

6.4. 1 convergence L Now how about 1 convergence? If p > 1 then p boundedness implies that all concepts of convergenceL of our gray box in ChapterL 1 apply. Even p = 1.000001 is enough. But how about p = 1? The double-or-nothing martingale shows that we need another condition to ensure convergence in the mean. This condition is called uniform integrability.1

1Uniform sounds gray. Integral sound gray. Uniformly Integrable sounds like North Korea. 58 6. DON’T STOP ME NOW

Definition 6.5. A class of random variables is UI, or uniformly integrable, if for every  > 0 there exists aC K such that

X dP <  for every X | | ∈ C Z{|X|>K} The double-or-nothing martingale is not UI, as you can easily check for yourself. Lemma 6.6. Any random variable X is UI. In other words, if is a singleton, or a finite set for that matter, then it is UI. C Proof. By our standing assumption X is integrable and so X < almost | | ∞ surely. Therefore the sequence YK = X 1 for K = 1, 2,... converges to | | · |X|≥K zero, almost surely. Since YK X , we may swap ≤ | |

lim YK dP = lim YK dP = 0 K→∞ K→∞ Z Z by dominated convergence. In other words, E[YK ] 0, so for every  there exists → a K such that E[YK ] < . This is what we had to prove.  Exercise Let be a family of random variables. Suppose there exists an Y such that X < Y Cfor all X . Prove that is UI. | | ∈ C C Exercise If both and are UI, then so is the collection of X +Y with X and Y . C D ∈ C ∈ D Lemma 6.7. Xn Z in the mean if and only if Xn Z in probability and → → the Xn are UI.

Proof. We might as well consider the sequence Yn = Xn Z . We need to | − | show that Yn converges to zero in the mean if and only if it converges to zero in probability and is UI. ( ) Suppose Yn converges to zero in the mean. We need to show that it is UI. For ⇒ large enough N we have E[Yn] <  if n > N, so we only need to deal with a finite number Y1,...,YN and that we can do by Lemma 6.6. ( ) Suppose Yn converges to zero in probability and is UI. We need to show that ⇐ E[Yn] 0. In other words, we need to show that for every  there is an N such that → E[Yn] <  for n > N. Since Yn is UI, there exists a K such that YndP < . Yn>K Since Y converges in probability, there exists an N such that P(Y > ) < /K n R n once n > N. If we combine all this, then we are done:

E[Yn] = YndP = YndP + YndP + YndP Yn≤ K R R R R dP + KdP + YndP ≤ Yn≤ K R R R  + K.P( < Yn) +  < 3 ≤  In particular, UI+almost sure convergence implies convergence in the mean. As an immediate result we get

Theorem 6.8. Suppose that Mn is a UI submartingale and the sequence E[ Mn ] 1 | | is bounded. Then there exists a M∞ such that Mn M∞ almost surely and in the mean. ∈ L → 6.5. WRAPPING IT UP 59

We have now proved Theorem 12.3 in Gut and we have also almost proved Theorems 12.1 and 12.2 in Gut. We still need to prove (d) in 12.1 and (a) in 12.2.

6.5. Wrapping it up Let’s first settle (d) in Theorem 12.1. In fact, that is a theorem in itself.

Theorem 6.9 (L´evy’s upward theorem). Suppose that Mn is a non-negative 1 martingale which is UI and bounded. If the martingale is adapted to 0 1 L F ⊂ F ⊂ , then Mn = E[M∞ n]. ··· | F 1 Proof. By the convergence theorem, Mn converges to M∞ in the mean: L

lim Mn M∞ dP = 0 n→∞ | − | Z

If we integrate over A k, then this only reduces the integral and so ∈ F

lim Mn M∞ dP = 0 n→∞ | − | ZA By the triangle inequality, we have

lim Mn M∞dP = 0 n→∞ − ZA

which is equivalent to

lim MndP = M∞dP n→∞ ZA ZA

Since E[Mn k] = Mk and A k we can replace the Mn by Mk | F ∈ F

lim MkdP = M∞dP n→∞ ZA ZA

In other words, E[M∞ k] = Mk. | F 

This theorem is a special version of L´evy’supward theorem, named after the great French mathematician Paul L´evy.A martingale of the form Mn = E[Z n] is called a regular martingale. Let’s look at a very simple example. In graph| F theory, one is interested in the chromatic number, which is the number of colors that you need to paint the graph such that neighbors get different colors. In random , one puts all the graphs in a bag and picks one at random. So then, the chromatic number becomes a random number. Let’s put all graphs on three vertices in a bag and pick one at random. What do you expect the chromatic number to be? I illustrate this with a picture that I stole from the classic text on the , by the great Noga Alon and Joel Spencer. 60 6. DON’T STOP ME NOW

The bag of all 8 graphs on 3 vertices is on the top line. Their chromatic numbers are recorded below on the line X3. We work our way downwards and record all conditional expectations E[Xi+1 Xi]. This gives our regular martingale. Now we work our way back up, tracing| the martingale as a process in which we pick pairs of vertices and inspect if they have an edge between them. Initially, you know nothing: there is a single X0. In step one, you inspect the bottom pair. If they are connected, step to the right. Otherwise, step to the left. You get X1(L) and X1(R). Now inspect the left and the top vertex. If they are connected, step to the right. Otherwise, step to the left. We get X2(LL),X2(LR),X2(RL),X2(RR). Finally, inspect the remaining pair of right vertex and top vertex. Now fill up the values for these random variables X0,X1,X2,X3. This is the way a regular martingale works. You do not take Z out of the bag in one go, you construct it step by step. This can be very useful. A couple of years ago, I taught the course on probability models for finance in our Minor Finance. It started with computing this slowly-getting-the-cat-out-of-the-bag martingale and after 14 weeks, we were still computing it. Everybody seemed to be very happy. And finally, let’s settle (a) of Theorem 12.2. p Lemma 6.10. Suppose that a submartingale Mn is bounded in for some p > 1. Then the submartingale is UI. Even more so, there exists a randomL variable p Y such that Mn < Y for all n. | | Proof. Just a repeat of the p argument. But I have Alzheimers and love L giving the same argument twice. If Mn is a submartingale, then so is Mn by the conditional Jensen inequality. Define | | ∗ Mn = sup Mn | | 0≤k≤n | | ∗ ∗ ∗ and observe that Mn increases monotonically with n so M = limn→∞ Mn is well defined. Monotone| | covergence allows the swap | | | | ∗ p ∗ p E [( M ) ] = lim E [( Mn ) ] | | n→∞ | | Doob’s maximal inequality provides the bound p ∗ p p p E [( Mn ) ] E[M ] | | ≤ p 1 n  −  6.5. WRAPPING IT UP 61

p which is bounded by some C since the Mn are bounded in . It follows that ∗ p ∗ L ∗ ( M ) is integrable, and therefore so is M . All the Mn are bounded by M . You| | already had to figure out yourself in one| of| the exercises above that this implies| | that the Mn are UI.  Here is an aside for all of you who like a sophisticated approach. We have been looking at convergence of martingales. In analysis, convergence always takes place in some space and some metric. Convergence in the mean is convergence in the space 1. We know this well from analysis. Almost sure convergence is equivalent to pointwise convergence. We know thisL well from analysis also. What the hell is convergence in probability? Does this come with a metric? Yes it does. For random variables X,Y we can define d(X,Y ) = inf  : P( X Y > ) <  { | − | }

Challenge This defines a metric. It was invented by the great Ky Fan Non graded homework martingale convergence; dead- line 24 November 2015

1. Yi, i N are independent random variables with E(Yi) = µi,V (Yi) = ∈ σ2 < . We consider the filtration F = σ Y ,...,Y i ∞ n { 1 n} a) Show that n Yi µi Mn = − iσi Xi=1 is an L2 martingale. b) Verify that it satisfies the conditions of the L2 martingale con- vergence theorem, i.e., that

2 sup E(Mn) < n ∞

c) Use Kronecker’s lemma (see wikipedia) to conclude that

n 1 Y µ i − i n σi Xi=1 converges almost surely to zero.

2. Let Xn denote the number of individuals after n generations in a branching process. I.e., we have X0 = 1,

Xn Xn+1 = Yi,n Xi=1

where all Yi,l are i.i.d., integer-valued, non-negative and E(Yi) = m < .(Y represents the random number of children of the i-th indi- ∞ i,n vidual in the n-th generation).

n a) Show that Mn = Xn/m is a martingale.

b) Conclude from the martingale convergence theorem that Mn con- verges with probability one to a limiting random variable M . ∞ c) Conclude from item b) that when m < 1, then X 0 almost n → surely, i.e., the population gets extinct with probability one. Con- clude also that with strictly positive probability, Xn grows expo- nentially when m > 1.

d) Can you conclude anything about the convergence of E(Xn) to the expectation of its limit as n in the case m = 1? → ∞

1 2 e) Let m > 1, and assume that V (Yi,l) = σ is finite. Using the formula V (X) = E(V (X Z)) + V (E(X Z)) | | (where we recall the definition of conditional variance: V (X Z) = | E(X2 Z) (E(X Z))2). derive the recursion | − | 2 n 2 V (Xn+1) = σ m + m V (Xn)

and solve it to obtain σ2mn 1(mn 1) V (X ) = − − n m 1 − 2 f) Compute E(Mn) where Mn is the martingale of item a) and show that it satisfies the conditions of the L2 convergence theorem.

2

CHAPTER 7

Sum ergo cogito

We reached Theorems 12.1-3 in Gut last week. They were our first destination. Now, before we move on to our second destination, it is time to take a break and get philosophical. Please read Zen and the Art of Motorcycle Maintenance, by the great Robert Pirsig. The picture below, taken by Rosco, says it all. Rosco, I don’t know you man, but that is a hell of a picture. Enjoy the road, my friend.

William Thurston, one of the greatest geometers of our time, said: The product of mathematics is clarity and understanding. Not theorems, by themselves. I wish I could show you some of the beautiful math that Thurston produced, but it does not fit in this course. So let’s stick to his advice. Have we reached clarity and understanding? Clarity and understanding are very personal things that rest inside our heads. Let’s be engineers and try to quantify what is inside our heads: how many theorems can we shoot down with martingales?

7.1. All, or nothing at all You already met Kolmogorov’s zero-one law in your first set of homework ex- ercises. It goes like this. Suppose X1,X2,X3,... is a sequence of mutually inde- pendent random variables on a probability space. Let n be the σ-algebra gen- T erated by Xn,Xn+1,Xn+2,.... Obviously, these algebras form a descending chain 1 2 T3 and the tail σ-algebra is equal to the intersection = n n. T ⊃ T ⊃ ⊃ · · · T ∩ T 65 66 7. SUM ERGO COGITO

Theorem 7.1 (Kolmogorov’s zero-one law). For any A we have either (A) = 0 or (A) = 1. ∈ T P P Proof. Suppose A . Let 0 1 be the usual filtration, where ∈ T F ⊂ F ⊂ · · · n is generated by the first n random variables. Then Mn = E[1A n] is a UI F 1 | F martingale that is bounded in . Since A n+1 and since n and n+1 L ∈ T ⊂ T F T are independent, we have that E[1A n] = E[1A] = P(A). So the martingale Mn is the sequence of constant functions|P F(A), P(A), P(A), . Now Gut’s Theorem ··· 12.4 tells you that for regular martingales Mn = E[Z n] it is true that Mn Z almost surely. In our case, this implies that the constant| F function P(A) converges→ to 1A almost surely. It follows that P(A) is either zero or one. Now I did not prove Theorem 12.4 yet, you can read it in Gut’s text or solve it yourself in the exercise below. On second thought, Gut says that Thm 12.4 follows from standard 1 approximation arguments, so I guess you have to solve the exercise.  Exercise Total recall. A. Recall the exercise on page 16 in chapter 2 to see that 1A = P(A) almost surely if 1AdP = P(A)dP for all B ∞. B B ∈ F B. Recall from page 17 that m(B) = XdP defines a measure if X is non- R R B negative. R C. Recall the definition of conditional expectation and observe that B 1AdP = P(A)dP for all B n. B ∈ F R D. Recall that if two measures m1 and m2 coincide on an algebra , then they coincideR on the smallest σ-algebra σ( ) that contains .2 A A A E. Recall that ∞ is the smallest σ-algebra that contains the algebra n. F n∈N F Combine all this to conclude that 1A = P(A) almost surely. S The zero-one law is a powerful result. A limit event – something that happens in the long run – will either happen with probability one or probability zero. Or, to put this in a more Markovian terminology, events are either recurrent or transient.

Exercise Let Xn be a sequence of independent random variables. Prove that lim sup Xn is tail-algebra-measurable. Deduce that there exists a t0 such that lim sup Xn t has probability one for t < t0 and probability zero for t > t0. ≥ Physicists would say that this is a phase-transition, but social scientists call t0 a tipping point. Deduce that P(lim sup Xn = t0) = 1.

7.2. The domino effect In Chapter 3 I gave you a challenge: Flip a fair coin. If a tail comes up, take the coin away. If a head comes up, add an additional fair coin and continue with two coins. We say that this is the second generation. Flip both coins in the second generation, take tails away, add a coin for each head. So we get the third generation. Flip all coins in the third generation. Flip again and again and again. Let Mn be the number of coins after n rounds of tosses. What happens in the long run? In the mean time, we have learned that Mn is a martingale. It is non-negative 1 and bounded. Now we know that Mn converges almost surely. Since it only takesL integer values, it can only converge by freezing at a constant value. It is not so hard to see that an event like 2 = Mn = Mn+1 = Mn+2 = has probability ··· 1sometimes I hate Gut’s guts 2Look it up in your Real Analysis book by Aliprantas and Burkinshaw, p 153. 7.3. SHOW ME THE MONEY, JERRY 67 zero. If the process freezes, it freezes at ground zero: Mn 0. Here we have another example of a martingale that converges almost surely→ but not in 1. L

This coin tossing process is called a Galton-Watson process. It is an example of a branching process. One standard example is the persistence of family names. Will the great name of Fokkink persist throughout the 21-st century? Well, that depends on whether the production of offspring is subcritical or supercritical in the Fokkink family. Take a look at your homework of last week in exercise 2c: if m < 1 it is subcritical and if m > 1 it is supercritical. The critical case is m = 1 in exercise 2d. Think of Mn as the number of reacting particles in the n-th generation. If m > 1 then KABOOM we have a chain reaction. If m < 1 then PFFfff the 1 reaction fizzles out. If m = 1 then Mn is a non-negative bounded martingale. L Therefore Mn converges almost surely, but Fatou says that E[M∞] M0. This reaction is under control. We found ourselves another tipping point. ≤

7.3. Show me the money, Jerry

In 1973 the Chicago Board Options Exchange opened up the trade of deriva- tives, financial products that derive their value from an underlying asset (usually stocks). A typical derivative is a European Call Option, which puts a threshold value K at some future time T . If the asset value ST ends above the threshold, then the value of the derivative at time T is ST K. If the asset end below the threshold, − + the value of the derivative is zero. We can write this concisely as (ST K) . If Sn + − would be a martingale, then you recognize that (ST K) is submartingale. Just before the Exchange opened up, two economists− from Chicago University – Black and Scholes – figured out that you can build a derivative from a portfolio that contains assets only. In financial terms, you can replicate a derivative. If you can compute the price of the replicating portfolio, then you can compute the price of the derivative. That is what Black and Scholes did, and they published their method to compute option prices in 1973. However, not many people picked up on this, and Black and Scholes made a lot of money from option trading before people caught on. Today, it is common knowledge. Every trader has a pocket calculator with a Black-Scholes price button on it. An asset price is a stochastic process Sn. One standard model of Financial Mathematics, called the binomial model, uses coin tosses to generate this process S0,S1,S2,... of asset prices

S0,S1(H),S2(HT ),S3(HTT ),S4(HTTH),S5(HTTHH), ··· 68 7. SUM ERGO COGITO

I have included a picture which I stole from the textbook on derivative prices that we use in our financial mathematics programme. It is written by the great Steven Shreve.

The in this picture contains all possible paths of an asset price. I have traced out one , which bounces up and down along the tree of all possible paths. Let’s consider a European call option on this asset. The strike K is equal to 4 and the end of contract is T = 3. We are at time zero and these values of the derivative are in the future. We have V3(HHH) = 28,V3(HHT ) = V3(HTH) = V3(THH) = 4. All other values are zero. Now we declare that the asset price is a martingale – we declare that the stock exchange is a fair casino. In our model, the price depends on a coin toss. If it is H then the price goes up by a factor 2, and if it is T then it goes down by a factor 2. To make this a martingale, we have to declare the probability of H to be 1/3 and the probability of T to be 2/3. Now we can fill up the tree of asset paths with Black-Scholes values in the cat-out-of-the-bag way in which we produced the chromatic number of a graph in the previous chapter. We work our way down from the leaves of the tree at time T to the root of the tree at time zero. The result is below, including another asset price path that is bouncing up and down the tree. The red line indicates the threshold of 4.

The Black-Scholes price V0 for the particular derivative is 52/27. And you can find the other Black-Scholes prices V1,V2,V3 in the tree. Black and Scholes tell you to take this amount to the stock exchange, which we have just declared to be a fair casino. In round number one you bet on S1, but your stake on this gamble is such that you get 44/9 if heads come up, and 4/9 for tails. To achieve this, your stake needs to be 20/27 of the 7.4. TAKE IT TO THE LIMIT, ONE MORE TIME 69 asset. More precisely, you buy 20/27 of the asset – the stock exchange allows you to buy parts of assets – which costs you 80/27 and leaves you 28/27 in debt. Let’s see what happens if the asset follows the price path that I indicated. Heads come up in round one, so our 20/27 of the asset is now worth 160/27 and we are 28/27 in debt, for a total of

44/9. Which is exactly equal to the Black-Scholes price V1(H). Now we place our stake for the next round, in which we gamble on S2 which will be equal to 16 for heads and 4 for tails. We have 44/9 now and we buy 8/9 of the asset, which costs us 64/9 and leaves us 20/9 in debt. Unfortunately, the coin comes up tails and the asset value goes down. Our worth is 8/9 times 4 for the asset and −20/9 for our debt, which equals 4/3. This is exactly equal to the Black-Scholes price at time 2 if HT came up. Now we place the final stake. We buy 2/3 of the asset, which leaves us 4/3 in debt. In the final round, heads come up. Our portfolio is worth 2/3 times 8 minus 4/3. This is equal to 4, which is the value of the option at the time of expiry T . Replication does this for you. A few remarks are in order:

The original transition probabilities of Sn do not play any role. They are replaced, so that Sn becomes a martingale. We need to compute how we can replicate the derivative, i.e., buy and sell (parts of) the asset. The computation is simplified if we assume that Sn is a martingale. The probabilities that do that are called risk neutral. + I wrote that (Sn K) would be a submartingale if Sn would be a martingale. − In reality, Sn is not a martingale. It is only a martingale under our risk neutral + probabilities. Even then, we do not want to speculate on (SN K) , we want to replicate it. Our portfolio is a martingale transform under− the risk neutral probabilities. So, derivatives are martingales.3 We could have computed V0 immediately as the of V3 under the 1 3 2 risk neutral probabilities 28 27 + 4 1 27 . We did not take inflation· into account,· but we could have. Then we would need  to add bonds to our replicating portfolio. The principle remains the same. This is just a model. Using random walks to model asset price movement – this is called the random-walk hypothesis – remains controversial. Some economists support it, others don’t. The fact of the matter is that derivatives have become a major economic factor. Global derivatives trading exceeds the global GDP by a factor ten, according to Investopedia. Next time you hear somebody say that things did not affect the real economy, bear in mind that the unreal economy can comfortably crunch the real economy.

7.4. Take it to the limit, one more time The two limit theorems that underly all of statistics are the Law of Large Num- bers and the . Does martingale convergence shed some light on them? Sure it does! If you have solved exercise 1 of last week’s homework, then you have derived the Strong from martingale convergence. Now before we get into that, here is some Dutch Pride. The great Dutch mathe- matician Bartel van der Waerden wrote an immortal classic on Algebra that has

3A long time ago I taught financial math and there was a student who would always be the first to respond if I asked the class a question. Invariably, his answer would be wrong: he was always trying to make money out of a martingale. He was a smart guy, though. In the end he came up with a regression analysis spreadsheet of stock price movement, which he sold to some Swiss bank for half a million euros. 70 7. SUM ERGO COGITO remained the standard text for almost hundred years now.4 He also wrote a text on Statistics and stated that the strong law of large numbers scarcely plays a role in . Meanwhile, the great William Feller wrote an immortal classic on Probability Theory and stated that the weak law of large numbers is of very limited interest and should be replaced by the more precise and useful strong law. There was a bit of a controversy. That is the nice thing about mathemati- cians, they always disagree on their foundations. Is a zebra a black horse with white stripes, or is it a white horse with black stripes? It’s a tantalizing conundrum.

The law of large numbers addresses the convergence of a random walk Sn = X1 + + Xn with iid increments of E[X] = µ, rescaled by a factor n: ··· S lim n = µ n→∞ n As you learned in Chapter 1, you can interpret this limit in three different ways. It could be convergence in probability (weak law), almost sure convergence (strong law), or convergence in the mean (no name for it). We have a slight technical problem, because Sn is not a martingale, but that is very easy to fix because Sn nµ is a martingale and the law of large numbers says − Sn nµ lim − = 0 n→∞ n The technical problems have not gone away. Now we have that Sn nµ is a − martingale, but (Sn nµ)/n is not. Fortunately, there is a martingale that looks a − lot like (Sn nµ)/n, namely − Zn = (X1 µ)/1 + (X2 µ)/2 + + (Xn µ)/n − − ··· − It satisfies the 2 convergence theorem, and therefore it converges weakly, strongly, and in the squaredL mean. Instead of this martingale, let’s use the submartingale

Zn = X1 µ /1 + X2 µ /2 + + Xn µ /n | − | | − | ··· | − | and roll out our most powerful weapon, the p convergence theorem. It says that e L Zn converges in all the three ways known to us, and its limit Z∞ dominates

(Sn nµ)/n = (X1 µ)/n + (X2 µ)/n + + (Xn µ)/n e − − − ··· e − We would like to conclude from this that (Sn nµ)/n converges to zero, but we run into another technicality: −

an a1 + + an Does | | < imply ··· 0 ? n ∞ n → X You can call on a lemma of Kronecker to answer this question affirmatively. You can also handle it hands on (think before you read on, I am sure you can find |an| an argument). Take N so large that the tail of the series n>N n is less than . If M > N is sufficiently large, then a1+···+aN−1 is less than . Now break up M P M a1 + + aM a1 + + aN−1 an ··· = ··· + N < 2 M M M P 4He also solved Newton’s problem on the kissing number of a sphere, proved a theorem on patterns in numbers that is of fundamental use to study the primes, and much more. 7.4. TAKE IT TO THE LIMIT, ONE MORE TIME 71

The p convergence theorem allows you to handle L Sn nµ − 0 nq → q in the same way for q > 1/2 using a sum of (Xn µ)/n . The power q = 1/2 remains out of reach of our martingale convergence− theorems.5 The central limit theorem says Sn nµ − (0, σ) √n → N The convergence here is convergence in distribution, which is even weaker than convergence in probability. The central limit theorem is a bit too delicate for our martingale convergence theorems. And on this philosophical note we arrive at the end of our philosophical chapter.

5 Kolmogorov’s zero-one law says that there exist s0 t0 such that ≤ q q P(lim sup(Sn nµ)/n = t0) = 1 and P(lim inf(Sn nµ)/n = s0) = 1 − − If q > 1/2 then s0 = t0 = 0 but if q = 1/2 then s0 = and t0 = . All of this becomes q −∞ ∞ intuitively clear if you compute Var((Sn nµ)/n ). − Exam Stochastic Processes WI 4202.

30 january 2015, 14:00-17:00 EWI lecture hall F/G.

No books or notes allowed.

Responsible of the course: Prof. F. Redig

Second reader exam: Dr. Ludolf Meester

————————————————————————————————

a) The exam consists of two theory questions, each on 10 points, followed by exercises. The exercises consist of 10 small questions each on 2 points.

b) The end score is computed as explained on the blackboard page. Course grade is the final exam grade f or 0.6f + 0.4h (with h average homework grade), whichever is larger, provided f 5. ≥ ————————————————————————————————-

1 Theory Questions.

1) State and prove the martingale convergence theorem. If you give the L2 proof (the proof from the book), then prove also Kolmogorov’s maximal inequality. If you give the proof based on upcrossings, then prove also the Doob’s upcrossing inequality.

2) a) Give the definition of Brownian motion. b) Derive the explicit formula for the probability density of the first hitting time of a > 0 for Brownian motion.

1 2 Exercises.

1) Xi, i N are independent and identically distributed random vari- { ∈ } ables with a standard (i.e., normally distributed with mean zero and variance 1). Furthermore, let an, n N be se- { ∈ } quence of real numbers. In the whole exercise you are allowed to use the expression for the moment generating function of a normal random variable Y with mean µ and variance σ2:

2 2 tY µt+ σ t E(e ) = e 2 . a) Compute the conditional expectation

X1+X2+X3 E X1 + X2 + e X1,X2 | b) Show that M , n 1 defined via  { n ≥ } n Mn = aiXi Xi=1 is a martingale. 2 c) Show that if ∞ a < then the martingale of item b) satisfies i=1 i ∞ the conditions of the martingale convergence theorem. Conclude P that in that case the series

∞ aiXi Xi=1 converges with probability 1. d) Show that Z , n 1 defined via { n ≥ } n 1 n 2 aiXi a Zn = e i=1 − 2 i=1 i P P is a martingale. Does the martingale convergence theorem apply to this martingale?

e) Let Zn be as in item d). Define n new random variables Y1,...,Yn via E(f(Y1,...,Yn)) = E(Znf(X1,...,Xn)) for all f such that the expectations in the right hand side exist. Show that Y1,...,Yn thus defined are independent and normally distributed with mean E(Yi) = ai and variance V ar(Yi) = 1. Hint: it is sufficient to show that (Y1,...,Yn) has the correct multivariate moment generating function, i.e., that

2 n ti n tiai+ tiYi i=1 2 E(e i=1 ) = e  . P P for all t1, . . . , tn R. ∈

2 2) Let W , t 0 denote Brownian motion, and let N , t 0 denote { t ≥ } { t ≥ } rate one Poisson process (i.e., Nt is Poisson with parameter t) which is furthermore independent from W , t 0 . You are allowed to { t ≥ } use that for a Poisson random variable N with parameter λ one has sN λ(es 1) E(N) = λ, V ar(N) = λ, E(e ) = e − , s R. ∈ a) Show that 2 E(WNt ) = t b) Show that Z , t 0 defined by { t ≥ } Z = W 2 t t t − is a martingale. c) Define the exit time of the interval [ a, a] by − τ = inf t 0 : W > a { ≥ | t| } Show that τ is a finite stopping time. d) Let τ be as in item c). In this item you are allowed to use 1) that E(τ) and E(τ 2) are both finite. 2) Your are also allowed the martingale of item b) and 3) also you are allowed to use that M = W 4 6tW 2 + 3t2 is a t t − t martingale (i.e., you do not have to prove this). Show then that 4 2 2 5a E(τ) = a , E(τ ) = 3 If you use the martingale stopping theorem, you should argue why you are allowed to use it. e) In this item you are allowed to use that N t, and (N t)2 t t − t − − are martingales. Define

τ = inf t 0 : N n n { ≥ t ≥ } In this item, you do not have to verify the conditions of the mar- tingale stopping theorem, i.e., you can assume that they are sat- isfied. Prove the following equalities using martingale stopping:

2 E((τn n) ) = E(τn) = n −

3 CHAPTER 8

Get used to it

In mathematics you do not understand things. You just get used to them. John von Neumann

Two major events reshaped mathematics in the 20th century: WWII and the invention of the computer. John von Neumann was heavily involved in both events.1 His down to earth quote marks the end of our philosophical foray. We are back on the road again. WWII was a military operation on a scale that could not be supported by simple bookkeeping anymore. The assistance of computers was required to plan things properly. And so a new mathematical discipline was born: Operations Research, or simply OR. It is the mathematics of converting logistics and decision making into computations that can be fed to the machine. We have reached the end of Gut’s chapter 10. I select one result, which is due to Abraham Wald, one of the founding fathers of OR.

8.1. Are we there yet? The great Abraham Wald wrote a PhD thesis on the foundations of mathemat- ics before emigrating to the USA. He decided to make himself useful and became a statistician. During WWII he helped the US army build safer planes. He died in a plane crash in the Nilgiri mountains in India. Theorem 8.1 (Wald’s equations, Gut thm 14.3). Consider a random walk Sn = X1 + + Xn where as usual X1,X2,... is an i.i.d sequence of mean µ and ···

1Von Neumann was involved in the design of the nuclear bomb and the design of the com- puter. He was a master of supplying solid mathematical foundations underneath existing theories: quantum mechanics, economics, computing, set theory, and much more. No other scientist has ever been so influential in so many fields. I took a tour of the jewish quarter in Budapest a couple of months ago, and I asked my tour guide – she was excellent, by the way – where John von Neumann was born. She said: Johnny Neumann? I don’t know any Johnny Neumann. And that is a sad truth. Not many people have heard of this scientific giant, who ranks on the scale of Archimedes, Newton, and Gauss.

75 76 8. GET USED TO IT variance σ2. Suppose that τ is an integrable stopping time, i.e., E[τ] < . Then ∞ E[Sτ ] = µE[τ] and 2 2 E[(Sτ µτ) ] = σ E[τ] − Observe that µ = 0 is useless in equation one, but rather handy in equation two.

Exercise Roll a fair six-sided die and stop once the sum reaches 21 or more. You know

Sτ approximately, but not exactly. By Wald’s equation, you now know E[τ] approximately, but not exactly: A. Show that the expected number of rolls is at least 6. B. Show that the expected number of rolls is at most 52/7.

Proof. Wald’s equations concern stopped processes, so it makes sense to use the sampling theorem to obtain these equations. We know that Mn = Sn nµ is − a martingale which starts at zero. If the sampling theorem holds, then E[Mτ ] = 0. 2 2 This is equivalent to Wald’s first equation. We also know that Mn nσ is a martingale (this follows from exercise 3b from your homework 3 at− the end of 2 2 2 2 chapter 3, since E[dMn] = σ ). If the sampling theorem holds, then E[Mτ τ σ ] = 0. This is equivalent to Wald’s second equation. Now the big question is:− does· the sampling theorem hold? It does. The version of the sampling theorem that I left as 2 2 a final exercise in chapter 5 applies immediately to Mn. Taking care of Mn nσ requires more work. Fortunately, there is no need for all this. Gut gives a very− short direct argument that does not use the sampling theorem but the 2 convergence theorem. L The stopped process Mτ∧n = Sτ∧n (τ n)µ is a martingale and therefore − ∧ E[Sτ∧n] = µE[τ n] ∧ Now the question is: can we swap lim and E to replace τ n by τ? The monotone convergence theorem says that we may do so on the right∧ hand side. We may do the same on the left hand side if all Xi are non-negative. We conclude from this that Sτ is integrable if E[τ] < and all increments are non-negative. For a general ∞ random walk, we have that Sn is bounded by Zn = X1 + + Xn and we have | | ··· | | just concluded that Zτ is integrable. It dominates Sτ∧n and therefore we can also swap limit and expectation on the left hand side for arbitrary increments Xi. This takes care of the first equation. The stopped process M 2 (τ n)σ2 is a martingale and therefore τ∧n − ∧ E[M 2 ] = σ2E[τ n] τ∧n ∧ And again the question is: can we swap? We can on the right hand side, by monotone convergence. To deal with the left hand side, notice that E[M 2 ] = σ2E[τ n] σ2E[τ] τ∧n ∧ ≤ 2 and conclude from this that Mτ∧n is an bounded martingale. Therefore, it converges in the square mean. We may swapL limit and expectation on the left hand side, obtaining Wald’s second equation 2 2 E[Mτ ] = σ E[τ]  8.1. ARE WE THERE YET? 77

People that are trying to control randomness, like gamblers and managers, are primarily concerned with their targets. In mathematical terms, they know their Sτ . But how long does it take to get there? That is where Wald’s equations come in. If you know your final destination – and who does not? – then Wald tells you how to compute your expected travel time E[τ]. To give one example. Suppose Sn is the standard simple walk, the drunkard’s amble of +1 for heads and 1 for tails. Our drunkard takes off and would like to know how long it will take− him to get home again

τ = min n > 0,Sn = 0 { } In this case Sτ = 0 and Wald’s second equation says E[τ] = 0, which obviously cannot be true. Therefore τ cannot be integrable, E[τ] = . If Sultan Smartypants ordains that women need to produce children, but they∞ may stop once they have produced an equal number of boys and girls, the population will explode. Exercise If you flip a coin until H, you flip twice on average. Now flip a coin until you get heads twice in a row HH. How long does that take on average? Convert this to a martingale in a coin-flipping casino. Every round you bet one additional euro on H. In that way, the number of rounds is equal to the money that you give to the casino. You stop on HH. That means that you leave the casino with six euros (think!). Now here is the question: how many times do you need to flip the coin on average until HTH? (beware, it is a little tricky, use an HTH strategy). Consider a random walk with bounded non-negative increments – think of the Xi as times – given by

Tn = X1 + + Xn ··· These processes often pop up in OR problems. For instance, Tn could be the time that the n-th hairline crack appears on the wing of an airplane. For a proper maintenance, you have to determine a time period t when the plane has to be called in for inspection. To decide on a proper t, you will be interested in the number of cracks at the time of inspection

N(t) = sup n: Tn t { ≤ } Define τ as the first entry of Tn into the semi-interval (t, ). Remember that first entry times are stopping times. Wald’s first equation says∞

E[Tτ ] = µE[τ]

By definition, Tτ > t and τ = N(t) + 1 and so we find that t µE[N(t) + 1] ≤ On the other hand, we know that Tτ−1 t and so E[Tτ t] E[Tτ Tτ−1] = E[Xτ ]. ≤ − ≤ − In other words, E[Tτ ] t+C where C is the bound on the increments of the random walk. And so ≤ t + C µE[N(t) + 1] ≥ Combining these two inequalities results in the elementary renewal theorem: Theorem 8.2. E[N(t)] 1 lim = t→∞ t µ 78 8. GET USED TO IT

Proof. From the inequalities above, we know that 1 E[N(t) + 1] E[N(t) + 1] 1 lim inf lim sup µ ≤ t ≤ t ≤ µ The term of +1 does not play any role and we can safely omit it. 

You may wonder why I did not write E[Xτ ] = µ. This has to do with the inspection paradox. If you put all inspection intervals in a bag and draw one at random, then its average size is µ. However, if you draw a random t and then select the interval that contains it, your draw is biased: long intervals have a higher chance of being selected. To illustrate this point, consider a frog that jumps over the integers, starting at zero and almost always taking jumps of size one, but every once in a million jumps it braces itself and takes a giant leap of one million.2 The average size of a jump is two (minus a bit, but who cares). If you consider the first M jumps of the frog for some very large number M, then the frog is somewhere around 2M. There can be no more than M integers that are covered by a jump of size one, the other half is covered by giant leaps. If you pick a random number, then the expected size of the jump that covers it, is half a million. This is much larger than the average size of a jump, which is two. You should keep this in mind when you think about the Elfstedentocht. Historically, it takes place about once every seven years, but if you just became a member of the Elfstedenvereniging, bear in mind that the expected waiting time between consecutive events exceeds seven years for new members. I am not even taking the Greenhouse effect into account here. Of course, we have all been waiting for nineteen years now, I knew this computation would work out before I checked the data.

8.2. A deviation The computer has greatly improved our collective memory. We can easily put personal information of all people on the planet in a single PC. The NSA stores metadata of all internet users to keep track of us, which is a frightening thought. Fortunately, there are also good people at work. Social scientists use computer models to understand the way our society works. This is called the study of Large Networks, and it is quickly becoming a science in itself.3 We are all vertices in a large graph sharing edges between friends. This is called the . It is the backbone that carries everything that is important to us: gossip, money, disease.4 You may wonder what this has to do with martingales. You need to remember that random graphs are cat-out-of-the-bag martingales. Think of your personal ringtone. It has to be unique among friends, to avoid confusion. How many ring- tones does the telephone company need? At least as many as the chromatic number of the social network. But what is this number? The social network is a and the chromatic number is a random number. It is not a fixed quantity. However, it does not fluctuate much, since it belongs to a martingale:

2 I could also write Xi is equal to one or a million, and define Tn more abstractly. I just happen to like this six-million dollar frog. 3 Mathematicians now start to interact with social scientists and try to exchange ideas. It is a merger between the social and the anti-social sciences. 4If you are at ease, then you are comfortable. If you are not completely comfortable, you have a disease. That is a stiff upper lip for you. 8.2. A DEVIATION 79

Theorem 8.3 (Azuma). Suppose that Mn is a martingale with bounded incre- ments which starts at zero. If dMn 1 for all n, then | | ≤ M 2 P n > γ e−γ /2 √n ≤  

We will need the inequality eλ + e−λ eλ e−λ eλx + − x ≤ 2 2 for x 1. On the left hand side of the inequality, we have a convex function eλx. On| the| ≤ right hand side, we have a line h(x) which intersects the graph of eλx in x = 1 and x = 1. We will also need that − λ −λ e + e 2 eλ /2 2 ≤ To see this, inspect the Taylor expansions on both sides of the equation. On the λ2n λ2n left hand side, we get terms (2n)! and on the right hand side we get terms 2n(n!) . The terms on the right are larger.

Proof. By monotonicity and linearity we have

λdMn E[e n−1] E[h(dMn) n−1] = h(E[dMn n−1]) = h(0) | F ≤ | F | F Using this and taking out what is known gives

λMn λ(Mn−1+dMn) λMn−1 E[e n−1] = E[e n−1] e h(0) | F | F ≤ λMn−1 and this we can condition on n−2 in the same way: E[e h(0) n−2] F | F ≤ eλMn−2 (h(0))2. We can repeat this all the way until n is zero

λ −λ n e + e 2 E[eλMn ] h(0)n = enλ /2 ≤ 2 ≤   Combine this with Markov’s inequality √ √ √ Mn 2 P > γ = P eλMn > eλγ n e−λγ nE[eλMn ] e−λγ nenλ /2 √n ≤ ≤     This is true for any λ and we choose λ = γ/√n, which gives

M 2 P n > γ e−γ /2 √n ≤    80 8. GET USED TO IT

Remember that we could not reach the central limit theorem with our martin- gale convergence theorems. But Azuma’s inequality is almost as good. It says that the tail probability of Mn/√n decays like a standard normal distribution, if the increments of Mn are bounded. A random graph G has random edges between n vertices. The most common way to construct graphs at random is to flip a coin – could be biased – for each n of its 2 possible edges. Put an edge if heads come up. You all know what the chromatic number χ(G) of a graph is. It is a function χ:Ω N on the set Ω of  all graphs on n vertices. We have a → on Ω since we select graphs in a coin tossing way, so we are dealing with a probability space and χ is a random variable. Theorem 8.4 (Shamir and Spencer).

χ(G) E[χ] 2 P − > γ 2e−γ /2 √n ≤  

Proof. We toss a coin N = n times. The set of all graphs Ω is equiv- 2 alent to the product 0, 1 N . The k-th coin toss for an edge is the projection N { }  5 Xk : 0, 1 0, 1 on the k-th coordinate. Let k be the σ-algebra that is { } → { } F generated by X1,...,Xk. If we define Mk = E[χ k], then we have the cat- out-of-the-bag martingale for the chromatic number that| F we discussed earlier. The theorem follows from Azuma’s inequality once we establish that dMk 1. This | | ≤ inequality requires that the martingale starts at zero, so we need to replace Mk by Mk E[χ(G)], but this does not affect the increments. −After k tosses, we know the first k edges of our graph G. There are 2N−k possibilities for G. Enumerate these by Γ1,..., Γ2N−k . If we toss the coin once more, then we know one more edge and the number of possibilities is halved 2N−k−1. Let’s N−k−1 enumerate the graphs Γi in such a way that the first 2 are those in which the next coin toss comes up heads. All of these contain edge k + 1. Let’s write Γ¯i for the graph which is identical to Γi, but for the fact that it does not contain edge k + 1. A coloring of Γi is a coloring of Γ¯i as well, so χ(Γ¯i) χ(Γi). On the other ≤ hand, a coloring of Γ¯i is a coloring of Γi if we are allowed one extra color, to deal with the extra edge of Γi. Therefore χ(Γi) χ(Γ¯i) 1. If we move from Mk to | − | ≤ Mk+1 then the k + 1-th coin toss becomes known. With probability p it is heads and with probability 1 p it is tails. Let’s write this informally as − Mk = pMk+1(H) + (1 p)Mk+1(T ) − N−k−1 Here Mk+1(H) is the expected value over χ(Γi) for the first 2 indices, and Mk+1(T ) is the expected value over χ(Γ¯i). They differ at most by 1. Therefore dMk(H) = (1 p)(Mk+1(H) Mk+1(T )) and dMk(T ) = p(Mk+1(T ) Mk+1(H)) − − − are both bounded by one.  Shamir and Spencer’s theorem does not tell you what E[χ] is yet, but you could establish that by Monte Carlo simulation, since Shamir and Spencer’s theorem says that the random number χ(G) does not fluctuate very much. Or you could look it up in the literature. The great B´elaBollob´ascomputed E[χ]. His proof was one of the first applications of martingale theory to Large Networks.

5the σ is not necessary, it is all finite Exercise on Azuma-Hoeffding inequality and concentration of measure.

1. Let Xi, i N be i.i.d. Bernoulli random variables with success proba- ∈ bility p = 1/2. Let f : 0, 1 n R and define, for i 1, . . . , n { } → ∈ { } i δi(f) = sup f(x ) f(x) x 0,1 n | − | ∈{ } where xi denotes the sequence x flipped at i i.e.,

1 x(i) j = i xi(j) = − (x(j) j = i 6

We call δif the maximal influence on f caused by flipping the i-th symbol.

a) Define Zn = f(X1,...,Xn). Define E(Zn F0) = E(f(X1,...,Xn)). | Show that for i = 1, . . . , n:

E(Zn Fi) E(Zn Fi 1) δif | | − | − | ≤

Hint: use that by i.i.d. property of Xi:

n E(f(X1,...,Xn) Fi) = f(X1,...,Xi, xi+1, . . . , xn) P(X1 = xr) | x ,...,xn 0,1 r=i+1 i+1 X∈{ } Y b) Show that Zk = E(f(X1,...,Xn) Fk) | is a martingale. c) Combine now a) and b) with the Azuma-Hoeffding inequality to conclude that

2 − 2 n (δ f)2 P( f(X1,...,Xn) E(f(X1,...,Xn)) > ) 2e k=1 k | − | ≤ P 2. We now can apply what we found in the previous exercise to show an important concentration of measure result. For a set A 0, 1 n such n ⊂ { } that P(X1,...,Xn) A) 1/2 and x 0, 1 we define ∈ ≥ ∈ { } d(x, A) = inf d(x, y) y A ∈ where n 1 d(x, y) = x y n | i − i| Xi=1 denotes the so-called Hamming .

1 a) Define f(X ,...,X ) = d((X ,...,X ),A). Show that for all k 1 n 1 n ∈ 1, . . . , n , δ f 1/n. { } k ≤ b) Conclude from a) and the previous exercise that

2 n  P( f(X1,...,Xn) E(f(X1,...,Xn)) > ) 2e− 2 | − | ≤ c) Put α = E(f(X1,...,Xn)). Show that, using P(X1,...,Xn) A) ∈ ≥ 1/2

2 1 n α P( f(X1,...,Xn) E(f(X1,...,Xn)) > α) e− 2 2 ≤ | − | ≤ Derive from this the upperbound

2 log 4 α ≤ r n d) Conclude then that

2 2 log 4 n  P d((X1,...,Xn),A) > +  e− 2 r n ! ≤ e) Finally define the -blow-up of A by

A = x 0, 1 n : d(x, A)  { ∈ { } ≤ } Conclude from d) that for n large enough

2  n  P((X1,...,Xn) A ) (1 e− 3 ) ∈ ≥ −

i.e., nearly all realizations of (X1,...,Xn) are in the blow-up of a mea- sure 1/2 set! This is called “concentration of measure” phenomenon: a set of positive measure, slightly blown up in high dimension has near to full measure.

2 CHAPTER 9

Walk the Line

We move on to Brownian Motion, which in modern mathematical terminology is a martingale in continuous time. Historically BM was discovered long before mathematicians invented martingales. I will take a historical approach to BM through random walks. We leave Gut behind, but B-Z remains useful. You can find some information on random walks in Chapter 5.

In 1841 the industrial revolution had started. James Watt had introduced his steam engine more than fifty years earlier. John Stevens had been perfecting locomotives, running them around Hoboken, New Jersey. The USA was going through a railway building boom. Yet, rather miraculously, the scientists of those days seemed to have been blissfully unaware of what was going on around them. None of them had hit on the idea that heat was equivalent to mechanical energy. The first person ever to realize this was a German ship doctor, Robert Mayer.1 During a heavy storm, when big waves splashed across the ship, the captain told Mayer that water heats up during a storm. Mayer concluded that the mechanical energy of falling water gets converted to heat and so it must be possible to compute the potential energy that is equivalent to 1 calory. When he returned home to Heilbronn in 1841, he worked out the details and, with great difficulty, managed to publish his results in the Annalen der Physik und Chemie. You have to admire Mayer’s perseverence. Imagine a captain swaying up to you during a storm, when you are holding onto the mast for dear life, and telling you: Now cheer up matey, it is not all bad you know, water gets nice and warm during a storm. To derive the principle of conservation of energy from such an unreliable source takes a strong and sharp mind. That same captain – a staunch alcoholic no doubt – must have also told Mayer about nude women that lure sailors to their deaths, ghost ships that are doomed to sail the seas forever, and giant squid that crush entire ships.2

1Dutch pride: Mayer was employed by the Dutch East India Company. 2Giant squid were found in 2013, the hunt for artistic nude female cannibals continues

83 84 9. WALK THE LINE

Anyway, Mayer was the first to discover the first law of thermodynamics, and soon many others got the same idea. Within a few decades physicists had developed thermodynamics – which is Latin for heat driven motion – into a full blown theory.

9.1. Putting PDE’s in your PC Joseph Fourier was a busy man. He was a great orator and got involved in the French Revolution as a young man. In 1801, in his early thirties, he became governor of the alpine region of Is`ere,where he oversaw the development of infrastructure and education. In his spare time he conducted experiments with heat. In 1807 he derived the heat equation ∂u ∂2u ∂t = ∂x2 The computer had not yet been invented and so Fourier had to solve his equation – also known as diffusion equation or parabolic PDE – by hand. He separated the variables and decomposed the PDE into two ODE’s, which he settled by what now is called Fourier theory. 2 2 √1 −x /t ∂u 1 ∂ u Exercise Show that u(x, t) = t e solves the diffusion equation ∂t = 2 ∂x2 . Recognize the density of the normal distribution. Heat is related to probability! Time has moved on, it always does, and in our day and age we have computers to solve our PDE’s. Set the numerical grid at size ∆x, ∆t. Convert the function into a vector by evaluating it at the grid j ui = u(i∆x, j∆t) Discretize the PDE by expanding u into a Taylor series and truncating at the second order j j j uj+1 uj u 2u + u (9.1) i − i = i+1 − i i−1 ∆t ∆x2 Numerical mathematicians have a very neat way to visualize equations like these by a stencil

The stencil is an outlay of a random walk. From grid point (i, j + 1) you either move to (i 1, j) or (i, j) or (i + 1, j). You can also see this if you rewrite the equation − ∆t ∆t ∆t uj+1 = uj + 1 2 uj + uj i ∆x2 i+1 − ∆x2 i ∆x2 i−1 j+1   Think of ui as an expected value at time step j + 1 that depends on the three j j j ∆t ∆t ∆t earlier values ui−1, ui , ui+1 with transition probabilities ∆x2 , 1 2 ∆x2 , ∆x2 . Of course, these are probabilities only if −  ∆t 1 ∆x2 ≤ 2 9.1. PUTTING PDE’S IN YOUR PC 85

Now remember your scientific computing: this is the condition for numerical stabil- ity of the FTCS scheme of the heat equation. You may also remember that there are other possible schemes, like Crank-Nicholson’s or BTCS. You can discretize a PDE in several ways. For instance, instead of the upwind in equation 9.1 we could have also chosen the downwind j j j uj uj−1 u 2u + u (9.2) i − i = i+1 − i i−1 ∆t ∆x2 ∆t To simplify the notation, write ν = ∆x2 and rewrite 1 ν ν uj = uj−1 + uj + uj i 1 + 2ν i 1 + 2ν i−1 1 + 2ν i+1 which is a random walk – here I can speak freely without any reservations on the grid size – with stencil

Now that we know that a numerical solution of a PDE comes down to a random walk, we have to decide where it ends. Analysts say that the end of a random walk is a boundary condition. Fourier’s equation concerns the heat distribution over time of a metallic rod with sources of heat at the two end points. Its boundary conditions are: the initial temperature of the rod and the constant temperatures at the end points. Schematically, our numerical computation looks like this:

5 Computation of u3 for the heat equation with grid ν = 1 through the BTCS scheme, i.e., a random walk which equiprobably moves Left, Right or Down. Note that ∆t = ∆x2, so I should have drawn much smaller steps in the y-direction than in the x-direction. My random number generator said DRRLDLDRLLDLL and then it hit the boundary. Let’s reflect on what we found. We started with Fourier’s equation, which we discretized so we could feed it to the machine. We ran into a connection between 86 9. WALK THE LINE

j ui and random walks. In your scientific computing class, you will learn that the discretized solution converges to u if ∆x, ∆t 0, provided that the scheme is numerically stable. Now where does the random→ walk converge to if the mesh of the grid goes to zero? In the limit, the grid becomes a two dimensional plane and the random walk becomes a crinkly wrinkly hairline crack known as a Brownian Motion. In 1949, the great Richard Feynman and Mark Kaˇcfound that the solution u(x, t) of the heat equation equals the expected value of a Brownian Motion that starts at (x, t) and ends at the boundary. A hundred and forty years have passed since the beginning of this section. I am getting ahead of myself. Let’s go back in time, assuming that this is possible.

9.2. Time’s Arrow No one really understands entropy. John von Neumann

According to the second law of thermodynamics, entropy never decreases in an isolated system. However, according to Poincar´e’srecurrence theorem, if you wait long enough any isolated system will return to its original state.3 Or at least, arbitrarily close to it. There seems to be a contradiction here: if entropy increases once you leave your initial state, how can you ever go home again? This caused a heated debate on thermodynamics a hundred years ago. It was resolved in 1907 by Paul and Tatiana Ehrenfest, by means of a random walk.4

Imagine a flock of sheep, some are Awhite and others are Bblack. The A sheep bleat Aaah and the B sheep bleat Beeh. If an A sheep bleats Aaah, it changes color and becomes B. If a B sheep bleats Beeh, it changes color and becomes A.

3If you wait long enough you will be young again. 4Dutch pride: Paul Ehrenfest worked in Leiden. He was a great physicist and a great teacher, who produced many great students. 9.2. TIME’S ARROW 87

Each time step, one and only one of the sheep is selected uniformly at random. It bleats, and changes color. If there is an excess of A sheep, then an Aaah is more likely, and the number of A’s goes down. If there is an excess of B’s, then Beehs are more likely. This is a random process that tends to regress to the middle. The population will split evenly into A’s and B’s and if these were human beings, they would divide their territory sharply and beat the crap out of each other. But these are sheep. And sheep are more civilized. You need to remember your Markov chains here, they will also show up in your homework.5 There are finitely many sheep, say N. The current number a of white sheep is the current state of the system. Whatever happens next only depends on a and not on the previous time steps. That is why these bleating sheep form a . It turns out that the chain is recurrent, which means that if you wait long enough then the consecutive states a0, a1, a2, a3,... will show all numbers between 0 and N infinitely often. However, the numbers around the middle will show up far more frequently than the extremes. The chain is also reversible, which means that if I wait for some time and show you the numbers am, am+1, . . . , am+k as well as am+k, am+k−1, . . . , am, then you will not be able to make out which of the two runs forward and which runs backward in time. In thermodynamics one lines up the sheep to get a sequence of N labels ABBABABABB . There are 2N possible sequences and once the sheep start bleating they will move··· from one sequence to the next and in the end all sequences are visited equally often. The probability that the flock is in state a is equal to N 2−N a   This is the equilibrium distribution of the bleating sheep Markov chain. If you start out from this distribution, then the probability distribution at the next time step remains the same (please check!). The great Ludwig Boltzmann, the stern Karl Marx look alike who just passed you, defined entropy as the logarithm of the probability. You can see it engraved on his tomb stone. The S is for entropy, the W is for probability and k is Boltzmann’s constant which links heat to temperature. Boltzmann proved the H theorem – and now H stands for entropy, this is all very confusing – which says that entropy increases over time. We will get to that later. If you think a little bit then you will see that the bleating sheep correspond to a random walk, stepping from a to a 1 with a bias for the step towards the middle. We just learned that random walks± and PDE’s are related. For the flock

5Take your first year probability book by Grimmett and Welsh. Inspect chapter 12. For those of you who took Advanced Probability (Voortgezette Kanrekening), refresh your memory and consult chapter 7 of Rosenthal’s book. 88 9. WALK THE LINE of bleating sheep there is a relation with the Fokker-Planck equation.6 ∂u ∂ux ∂2u ∂t = ∂x + ∂x2 Now we have to go through a few computations, sorry. Discretizing centrally in space and forward in time j j j j j uj+1 uj u (i + 1)∆x u (i 1)∆x u 2u + u i − i = i+1 − i−1 − + i+1 − i i−1 ∆t 2∆x ∆x2 ∆t Rearranging the terms and writing ν = ∆x2 as before (i 1)∆t (i + 1)∆t uj+1 = ν − uj + (2ν 1) uj + ν + uj i − 2 i−1 − i 2 i+1     Now choose ν = 1 to make the middle term (2ν 1) uj disappear 2 − i 1 (i 1)∆t 1 (i + 1)∆t uj+1 = − uj + + uj i 2 − 2 i−1 2 2 i+1     This is a random walk. The sum of the numbers exceeds one by a bit, but that is because we are looking back at a random walk that is going forward in time. We 1 now choose a time step ∆t = M for some large number M. M (i 1) M + (i + 1) uj+1 = − − uj + uj i 2M i−1 2M i+1 j+1 Now I claim that these are a flock of 2M sheep in disguise. Let πa be the probability that the state is equal to a at time step j. If the state is a, then at the previous time step, there were either a 1 sheep and a B sheep bleated. This happens with probability (2M (a 1))/2M−. Or there were a + 1 sheep and an A sheep bleated, which happened− with− probability (a + 1)/2M. In other words, 2M (a 1) a + 1 πj+1 = − − πj + πj a 2M a−1 2M a+1 And now we see that if we put a = i + M then these two equations are the same. The index i measures the deviation between the state a and the middle M. The Fokker-Planck equation keeps track of the probability distribution of the sheep over time. Returning to the Ehrenfests. They argued that this Markov chain is recur- rent and that you cannot distinguish the arrow of time by observing the states. Boltmann’s H-theorem concerns the evolution of πj over time. The entropy of a probability distribution π on 0,...,N is defined by { } N π(j) log (π(j)) j=0 X It is possible to compute that this increases with j, but we will not go into that, and simply take the Ehrenfests word for it. The Fokker-Planck equation is like the diffusion equation, it evens out the initial differences of πj as j increases. That is

6It is always nice to meet another Fokker. Delft pride: Adriaan Fokker studied mining engineering in Delft before getting his doctorate with Lorentz in Leiden. During WWII he con- centrated on music to convince the nazis that he was not of any use and invented the Fokker organ. It is now being used in the Music Hall ’t IJ in Amsterdam, where it is kept in the Bam hall. Imagine that, playing the Fokker organ in the Bam hall. 9.3. MEASURING MOTION ON THE ATOMIC SCALE 89 why entropy increases and the arrow of time is visible in the probability distribution. However, the number of A sheep and B sheep will continue to zigzag forever. No arrow of time is visible here.

9.3. Measuring motion on the atomic scale In the year that Fourier became governor of Is`ere,Matthew Flinders embarked on a scientific expedition to Australia. It was still called Van Diemen’s Land at that time and claimed by the Dutch, but France and England were at war and the English had temporarily taken control of the Dutch colonies.7 Van Diemen’s Land is not a very catchy name. Flinders came up with Australia and the world thanks him for that. One of the members of Flinders’ expedition was a young botanist called Robert Brown. He managed to find thousands of new plants in Australia and spent years cataloguing them all. Eventually, he became the head of the botany department at the British Museum. A most splendid institution that must have had very good microscopes. In 1827 Brown discovered the Brownian motion and in 1831 he discovered the cell’s nucleus. You can argue that the three fundamental wisdoms of science today are that life is made from cells, matter is made from atoms, and forces are made from elementary particles. Brown’s name is connected with two of those three wisdoms. In 1905 Albert Einstein derived a diffusion equation that describes Brownian motion. Let Bt be the probability distribution of a particle on the real line at time t. Einstein considers a time step τ after which the particle moves to a new position Bt+τ with an increment dB = Bt+τ Bt that does not depend on t − and has probability density φ(∆). If f(x, t) is the probability density of Bt, then Bt+τ = Bt + dB has as its probability density the convolution product

f(x, t + τ) = f(x + ∆, t)φ( ∆)d∆ − . Z

Einstein writes φ(∆) instead of φ( ∆) since the increment has a symmetry, dB and dB are identically distributed.− Expanding into Taylor series and ignoring terms of− higher order, Einstein replaces the left hand side of the equation by ∂f f(x, t) + (x, t) τ ∂t ·

7If – God forbid – you ever have to choose between colonizers, never go Dutch. During their brief time in Indonesia, the Brits tried to improve the harsh labour conditions of the Indonesian farmers and uncovered the Borobudur-Prambanan temple complex. When the Dutch returned, they were delighted with this archeological find. They removed the statues and put them in their gardens. Fortunately, whatever remains still is very impressive. 90 9. WALK THE LINE and the right hand side by ∂f 1 ∂2f f(x, t) + (x, t)∆ + (x, t)∆2 φ(∆)d∆ ∂x 2 ∂x2 Z   This is a sum of three integrals ∂f 1 ∂2f f(x, t)φ(∆)d∆ + (x, t)∆φ(∆)d∆ + (x, t)∆2φ(∆)d∆ ∂x 2 ∂x2 Z Z Z or equivalently ∂f 1 ∂2f f(x, t) φ(∆)d∆ + (x, t) ∆φ(∆)d∆ + (x, t) ∆2φ(∆)d∆ ∂x 2 ∂x2 Z Z Z The first of these three integrals is the integral over the probability density, so equal to one. The second is the expectation of dB, which is zero since the particle just zigzags. The third integral is its variance, which is equal to some positive constant Kτ. Note that the variance of dB depends linearly on τ. If we had selected a half time step τ/2 then we would have had a half increment dB0 such that dB = dB0 + dB0, which you should read as a sum of two iid random variables. Equating the left hand side and the right hand side gives the diffusion equation ∂f Kτ ∂2f (x, t) = (x, t) ∂t 2τ ∂x2 The constant K/2 is a diffusion coefficient D, and Einstein concludes that

This equation has label (1) but it is one of the final equations in Einstein’s paper. Before getting to (1), Einstein derived an expression for the diffusion coefficient D. Einstein then concludes the paper by devising an experiment to determine the atom’s size from Brownian motion.

All the quantities on the right could be measured in one way or another. It was now left to the experimenters to determine the diffusion coefficient of the Brownian motion, to see if Einstein’s expression worked out. The final line of the paper reads 9.3. MEASURING MOTION ON THE ATOMIC SCALE 91

The experimenters were as fast as Einstein wished. His findings were verified by Jean Perrin in 1909, who used several methods to determine the constant N that is in Einstein’s expression of D. Perrin proposed to call it Avogrado’s constant, and the name stuck. You must have heard about it in school many years ago. Einstein’s description of the Brownian Motion gives the probability distribution of the particle’s location. He did not describe the individual zigzag motion of the Brownian paths. In terms of sheep, Einstein determined the probability distribution of A’s and B’s over time, but not the fluctuation of the A’s and B’s in a single flock. That all had to wait for almost two decaded, when a brilliant young mathematician entered the scene. 1. In this exercise you explore the connection between Markov chains and martingales. We consider an irreducible Markov chain X , n 0 on { n ≥ } the finite state space Ω = 0, 1,...,N with transition matrix π, i.e., { } πij = P(X1 = j X0 = i). For a function f :Ω R we define P f by N | → P f(i) = j=0 πijf(j). We call this operator the transition operator. Another way of writing it is P

P f(i) = E(f(X1) X0 = i) | a) Show that for any function f,

n 1 − Z = f(X ) (P I)f(X ) n n − − i Xi=0 is a martingale. Here I denotes the identity, i.e., I(f) = f.

b) Show that if P f = f, then Zn = f(Xn) is a martingale. Such a function is called harmonic. c) A function is called subharmomic if f P f and superharmonic ≤ if f P f. Show that for a sub(super) harmonic function Z = ≥ n f(Xn) is a sub(super) martingale. d) Use the martingale convergence theorem to show that every har- monic function f :Ω R is constant. Hint: because the state → space is finite and because f is harmonic, Zn = f(Xn) is a bounded martingale, hence converges, use then that the Markov chain is recurrent. e) Consider now a irreducible and recurrent Markov chain on a countable state space such as 0, 1, 2,..., . Let f be a non- { } negative subharmonic function. Use Kolmogorov’s maximal in- equality. to obtain the bound

n Ef(Xn) P(max f(Xi) λ) i=1 ≥ ≤ λ

f) Show that a bounded function f : Z R such that for all x Z → ∈ f(x) = 1 (f(x 1) + f(x + 1)) is constant. Hint: find a recurrent 2 − Markov chain such that f(Xn) is a martingale.

1 CHAPTER 10

Get Real

We now define Brownian Motion mathematically. The material in this lecture corresponds to chapter 6 of B-Z. You can also consult the scribenotes folder on Blackboard.

Order, unity, and continuity are human inventions Bertrand Russel

So far, we have studied stochastic processes X0,X1,X2,... in discrete time. We now move on to stochastic processes Xt with a time parameter t [0, ). All ∈ ∞ definitions that we considered earlier extend to this setting. A process Xt now comes along with a filtration t of σ-algebras, such that s t if s t. The F F ⊂ F ≤ standard filtration has σ-algebras t = σ Xs : s t that are generated by all Xs F { ≤ } up to time t. If we select integer times t N, then we are back at the processes that we considered so far. ∈ A stochastic process is a martingale if E[Xt s] = Xs. Think of it in a cat- out-of-the-bag way. Let the American president shake| F up the probability space and 1 grab an ω at random. Along with the president’s ω comes a function f(t) = Xt(ω), which is called a path of the process. So, our cat is a function and our bag is a function space. Think of stochastic processes in real time as random functions. Let the function come out of the bag slowly by tracing it in time. Picture this as a sliding window of market updates on an asset price.

1A little cat is called a pussy.

93 94 10. GET REAL

10.1. A fishy frog Remember the renewal process from lecture 8. We started with a random walk X1,X1 + X2,X1 + X2 + X3,... with non-negative increments and we defined

N(t) = sup n: Tn t { ≤ } The number N(t) depends on ω. It is a random variable N(t):Ω 0, 1, 2,... . It → { } is perhaps more appropriate to write Nt instead of N(t), and there we have it, our first stochastic process in real time. Its paths are jump functions and that is why it is called a . Think of a jumping frog on the real line. Nt counts the number of jumps up to t. You must have already met the most common jump process – the Poisson process – because it is taught to all TU Delft students in their first probability course.2 Let’s refresh our memory and consider this process first. Our frog is Poisson. It is a fishy frog. −λt Our random walk X1 + ... + Xn takes exponential steps: P(X > t) = e for some parameter λ. Remember that the exponential random variable is memoryless: P(X > t X > s) = P(X > t s) | − if t > s. In case you forgot this, I am sure that you can check that it is true. Here is a way to think of it. You are standing at s and your frog has jumped from 0. You see that it is still up in the air. You can either let it fly, or you can intercept it, and make it jump again from s. Either way, the probability that its jump exceeds t remains the same. Intercepting the frog does not change the distribution of its landing point.

Lemma 10.1. P(Nt = 0 Ns = 0) = P(Nt−s = 0) for t > s. | Proof. This is just another way to phrase memorilessness: Ns = 0 corre- sponds to X1 > s, Nt = 0 corresponds to X1 > t, and Nt−s = 0 corresponds to X1 > t s. −  Think of this lemma as a formula I pit stop: at s we haul the frog in, change its tires, and set it back on its flight. It does not change the distribution of its landing point. It turns out that much more is true.

Lemma 10.2. P(Nt = k Ns = j) = P(Nt−s = k j) for t > s. | −

2Consult chapter 11 of Grimmett and Welsh – or chapter 12 of Dekking et al. 10.1. A FISHY FROG 95

We will prove this in a minute. Observe that it implies that the increment Nt Ns is independent of Ns and is equal to Nt−s in distribution. This already indicates− a martingale like property: the increment between s and t does not depend on the past, only on the length t s. Its expected value is not zero, however, the process has a drift. − Let’s return to our frog. The exact location s of our pit stop does not play a role, so we might as well pick a random location. Lemma 10.3. Let X,Y be independent and exponentially distributed with pa- rameter λ. Let S be a non-negative independent random variable. Define X(ω) if X(ω) S(ω) X(ω) = S(ω) + Y (ω) if X(ω) ≤> S(ω)  Then X is also exponentiallye distributed with parameter λ. Proof. This is an exercise in joint distributions. It is one of those statements e that is immediately obvious if you think about it, but gets muddled when you write down the equations. So here we go. We need to prove that P(X > t) = e−λt. By our definition, P(X > t) = P(X > t, X S) + P(X > S, S > t) + P(X > S, Se t, S + Y > t) ≤ ≤ The first two of these three probabilities combine nicely e P(X > t, X S) + P(X > S, S > t) = P(X > t, S > t) ≤ Let f(s) be the density of S, then the third probability is equal to t ∞ ∞ f(s)λ2e−λx−λy dydxds Z0 Zs Zt−s which works out to t t f(s)e−λt ds = e−λt f(s)ds = P(X > t, S t) ≤ Z0 Z0 Hence, the three probabilities add up to P(X > t), which is what we had to prove. Needless to say, I had to write down my integrals ten times before I got it right. 

Proof of Lemma 10.2. We need to prove that P(Ns = j, Nt = k) = P(Ns = j)P(Nt−s = k j). We apply the previous lemma with S = max s X1 Xj, 0 . − { − −· · ·− } It is independent of Xj+1 since all jumps are independent. By the previous lemma, we can substitute Xj+1 for Xj+1 without changing the distribution of the process. The event Nt = k corresponds to e X1 + + Xj + Xj+1 + + Xk t < X1 + + Xj + Xj+1 + + Xk + Xk+1 ··· ··· ≤ ··· ··· Our substitution replaces it by

X1 + + Xj + Xj+1 + + Xk t < X1 + + Xj + Xj+1 + + Xk + Xk+1 ··· ··· ≤ ··· ··· Interecting with Ns = j we get that s X1 Xj Xj+1 so we have that e − − · · · −e ≤ Xj+1 = s X1 Xj + Y for an independent Y . And so Nt = k, Ns = j is replaced by− − · · · − { } e s + Y + + Xk t < s + Y + Xj+1 + + Xk + Xk+1 and Ns = j ··· ≤ ··· These two events are independent and since Y is exponentially distributed, the probability of this intersection is equal to P(Nt−s = k j)P(Ns = j). We managed − 96 10. GET REAL to uncouple the future from the past by inserting a pit stop at s. It did not change the frog’s jumping distribution, it only changed its tires.  The proof of this lemma gives us more than what we stated. We uncoupled the future and the past. The value of the increment Nt Ns is independent of any Nr for r s. So, in fact we proved that − ≤ E[Nt s] = Ns + E[Nt−s] | F and we have found our first martingale in real time:

Theorem 10.4. Nt E[Nt] is a martingale. − Proof. That is just another way to write the equality, because E[Nt−s] = E[Nt] E[Ns]. − 

Finally, we compute the distribution of Nt.

Theorem 10.5. Nt is Poisson distributed with parameter λt.

k (λt) −λt Proof. We need to prove that P(Nt = k) = k! e . We use induction on k. The statement is true for k = 0 and all t since P(Nt = 0) = P(X1 > t). Now assume that the statement is true for all integers up to k 1 and for all t. We need to prove it for k. In fact, it is more convenient to prove the− equivalent statement

k j (λt) −λt P(Nt k) = e ≤ j! j=0 X It is equivalent to

k j (λt) −λt P(X1 + + Xk+1 > t) = e ··· j! j=0 X If we write Sk = X1 + + Xk then we get ··· k j (λt) −λt P(Sk + Xk+1 > t) = e j! j=0 X The random variables Sk and Xk are independent. If f(s) is the density of Sk, then −λx the joint density is f(s, x) = f(s)λe . We can now compute P(Sk + Xk+1 > t) by a good old fashioned integral. In fact, it is a little easier to compute the complementary probability

t t−s −λx P(Sk + Xk+1 t) = f(s)λe dxds ≤ 0 0 R t R = f(s)(1 e−λ(t−s))ds 0 − R t = 1 e−λ(t−s)dF (s) 0 − where F (s) is the probability distributionR function of Sk. By partial integration, this integral reduces to t F (s)λe−λ(t−s)ds Z0 10.2. GET ON UP, GET ON THE SCENE 97

By definition F (s) = P(Sk s) = P(Ns k) = 1 P(Ns k 1). By our ≤ ≥ j − ≤ − inductive assumption, this is equal to 1 k (λs) e−λs. Plug this in the integral − j=0 j! k−1 t P t (λs)j λe−λ(t−s)ds e−λtds − j! 0 j=0 0 Z X Z which is equal to k (λs)j 1 e−λt e−λt − − j! j=1 X We computed the complementary probability, and so we find that P(Nt k) = j k (λs) −λt ≤ j=0 j! e , as required.  P 10.2. Get on up, get on the scene Norbert Wiener arrived in Cambridge in 1913. He was only 19 years old at the time but had already finished his PhD in logic and he was going to visit Bertrand Russell, who was the greatest logician of the day. They were opposites in many ways. Wiener was a son of immigrants. His parents had moved from Poland to the USA and worked their way up. Russell was the 3rd earl of Russell, a descendent of the highest aristocracy, and a true libertarian who believed that only complete freedom could make people happy. Wiener, on the other hand, believed in order and rules and regulations. The two men did not interact very much and when the war broke out in 1914, Russell, a militant pacifist, campaigned for peace and did not have much time for young Norbert. He did however tell him to work on a new thing that the physicists had come up with, called Brownian Motion. In Russell’s opinion, this theory required a solid mathematical foundation. You do not need logic to study Brownian Motion, you need analysis. In 1914, Cambridge housed the two greatest analysts of the day: Hardy and Littlewood. Norbert Wiener took classes with G.H. Hardy, a brilliant lecturer, and quickly caught up on harmonic analysis, the branch of mathematics that grew out of Fourier’s solution of the heat equation. Wiener only turned to Brownian Motion after he left Cambridge and had returned to MIT.3 In 1923, he rigorously defined BM and proved that it exists, using the same tools that Fourier used to solve his PDE.

Definition 10.6. A stochastic process Wt with 0 t < is called Brownian Motion or a Wiener process if ≤ ∞

(1) W0 = 0 almost surely (2) For all 0 = t0 < t1 < < tm the increments ··· Wt Wt ,Wt Wt , ,Wt Wt 1 − 0 2 − 1 ··· m − m−1 are independent (3) For each s < t the increment is normally distributed with

E[Wt Ws] = 0, Var(Wt Ws) = t s − − − (4) Sample paths t Wt(ω) are almost surely continuous 7→

3which also happens to be in Cambridge 98 10. GET REAL

Now we have defined it, we need to prove that it exists. Wiener’s original proof uses a random Fourier expansion. Every continuous function f : [0, 2π] R admits ∞ → a Fourier expansion f(t) = c0 + k=1(ak sin(t) + bk cos(t)) and Wiener’s idea was that a random continuous function requires random Fourier coefficients ak and bk. We will leave that idea for later.P The proof of Wiener’s theorem below is due to L´evy.

It is very easy to generate BM with your computer. I used a small matlab program to illustrate that: randn(’state’,100) clf %%%%%%%%% Problem parameters %%%%%%%%%%% L = 1e4; T = 1; dt = T/L; M = 10; %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% tvals = [0:dt:T]; Wvals = cumsum(sqrt(dt)*randn(M,L),2); Wvals = [zeros(M,1) Wvals]; plot(tvals,Wvals) title(’10 Brownian paths’) xlabel(’t’), ylabel(’W(t)’)

The matlab code does not generate continuous paths of course. It just generates a random walk with normal increments and interpolates linearly in between. Ac- cording to L´evy’sproof, this converges to BM as ∆t 0. Before we turn to that proof, you have to remember some results in probability.→ First of all, you need to be aware of the Gaussian miracle: if X,Y are independent normal random variables, then aX +bY is also normal for any constants a, b. Furthermore, aX +bY and cX + bY are independent if and only if their is zero. You also need to be aware of the Azuma like bound on the tail probability of a standard normal random variable.

Lemma 10.7. If X is standard normal, and a > 0, then

1 2 P(X > a) e−a /2 ≤ √2πa2 10.2. GET ON UP, GET ON THE SCENE 99

Proof. ∞ ∞ 1 2 x 1 2 P(X > a) = e−x /2dx e−x /2dx a √2π ≤ a a √2π Z Z and we have our result.  If we reduce the time step dt in our program by a factor two, replacing L=1e4 by L=2e4, then we effectively insert an additional grid point between each consecutive pair of grid points. If we keep the random numbers that we already generated for the grid 0, dt, 2dt, . . . then how do we insert the additional numbers in between? The following lemma shows that inserting normal random variables is easy. Lemma 10.8. Suppose that X and Y are standard normal. Then (X +Y )/2 and (X Y )/2 are independent normal random variables with mean zero and variance 1 . − 2 Proof. The covariance of (X + Y )/2 and (X Y )/2 is zero, so they are − independent by the Gaussian miracle.  Now read the lemma as follows. Let X be the increment between consecutive grid points. Half way between grid points, the increment is X/2. If we modify this half way point and add Y/2, then we get a piecewise linear function, with increment (X + Y )/2 on the first half and (X Y )/2 on the second half between the two original grid points. The increments on− the refined grid remain normal. Of course, we need to scale this with the mesh of the grid. Converted to matlab code, L´evy’sproof says that the following code produces BM as dt 0: randn(’state’,100) → clf %%%%%%%%% Problem parameters %%%%%%%%%%% T = 1; Start = 0; End = randn; N = 1; %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% Wvals = [Start End]; for L=1:N dt=T/2^(L-1); Halfway=(Wvals(1,1:2^(L-1))+[Wvals(1,2:2^(L-1)) End])/2; Insert=Halfway+sqrt(dt)/2*randn(1,2^(L-1)); Wvals=reshape([Wvals(1,1:2^(L-1));Insert],[1,2^L]); end Wvals = [Wvals End]; dt=T/2^L; tvals=[0:dt:1]; plot(tvals,Wvals) title(’Inserted Brownian path’) xlabel(’t’), ylabel(’W(t)’) You can copy and paste this code in matlab and run it, to check that it works. 100 10. GET REAL

Theorem 10.9 (Wiener, 1923). Brownian Motion exists on the unit time in- terval t [0, 1]. ∈ Proof. Brownian Motion Wt(ω) involves two variables: time t and sample point ω. You can think of BM as a function of two variables f(t, ω) = Wt(ω) and that is how we construct it. We take the numerical mathematician’s approach, we discretize time and construct BM on the dyadic grid

k n n = : 0 k 2 D 2n ≤ ≤   Let Z0 be standard normal. Starting with the smallest grid 0, we define D 0 for t = 0 f (ω, t) = Z (ω) for t = 1 0  0  linear in between 1 The grid 1 has one additional point in and we insert a value there. Let Z1 be D  2 a standard normal random variable that is independent of Z0. Now define 1 Z1(ω)/2 for t = 2 g (ω, t) = 0 for t = 0, 1 1   linear in between and if we take f1 = f0 + g1, then the interpolation lemma above shows that the 1 1 random variables ω f1(ω, 2 ) and ω f1(ω, 1) f1(ω, 2 ) are independent normal → 1 → − random variables with variance 2 . The first few steps are indicated in the figure above for f1, f2, f3, to illustrate how we proceed in general. If we move on to n k n−1 D then we need to insert the grid points 2n for odd k. There are 2 such points n−1 n−1 n and we use independent standard normals Zj for j = 2 , 2 + 1,..., 2 1. Now define − −(n+1)/2 k 2 Zj(ω) for the odd k grid points t = 2n k g (ω, t) = 0 for the even k grid points t = n n  2  linear in between and take fn = fn−1 + gn = f0 + g1 + g2 + ... + gn. By the Gaussian miracle, for all fixed t, the random variable ω fn(t, ω) is normal with mean zero. Once fn → is defined on the grid point n, then its value remains the same for all fm with D m > n. I leave it to you to check that ω fn(t, ω) has mean zero and variance k k+1 k+2 → t if t n. Moreover, if 2n , 2n , 2n are three consecutive grid points, then the two increments∈ D between the first and the second, and the second and the third grid point, are independent. I leave it to you to check that too. Let’s fix ω and look at a path t fn(t, ω). It is a piecewise linear function. 7→ Let’s simply write it as fn and keep in mind that ω is fixed. We need to prove that fn converges to a continuous function f as n goes to infinity. For this, we need to prove that fn converges uniformly√ . We defined fn = f0 + g1 + + gn√. If we can n ··· n prove that eventually gn (n+1)/2 , then we are done because (n+1)/2 < . | | ≤ 2 2 ∞ So, we need to prove that eventually Z √n. This is a Borel-Cantelli statement: j P if the probabilities are summable, then| | only ≤ finitely many events will happen. We need to prove that ∞

P( Zj > √n) < | | ∞ j=1 X 10.2. GET ON UP, GET ON THE SCENE 101

n−1 n You need to remember that the variables Zj for gn have index j = 2 ,..., 2 1. − This implies that log2(j) < n, so it suffices to prove that ∞ P Zj > log (j) < | | 2 ∞ j=1 X  p  Now we need the estimate on the tail probability from Lemma 10.7. ∞ ∞ 2 − log (j)/2 P Zj > log (j) < e 2 | | 2 √ j=1 j=1 2π log2(j) X  p  X and I leave it to you to check that this series converges. You need to observe that e− log2(j)/2 = (1/j)x for a real number x > 1. Now we know that f(ω, t) = limn→∞ fn(ω, t) is well defined for ω. We also know that the random variables Wt(ω) = f(ω, t) are normal with mean k zero and variance t for dyadic grid points t = 2n . Even more so, the increments between dyadic grid points s, t are independent of mean zero and variance t s. To finish the proof, we need to show that this also applies to non-grid points. − Any non-grid point t is a limit of an increasing sequence of grid points s0, s1, s2,....

The increments dWn = Wsn Wsn−1 are independent normal random variables and the sums −

Wsn = W0 + dW1 + ... + dWn form an 2 bounded random walk. By the martingale convergence theorem, we L know that Wt = limn→∞ Wsn converges in probability, almost surely, and in the mean. To see that Wt is normal with mean zero and variance t, we need to verify that F (x) = P(Wt < x) is a normal distribution. As always, this involves a swap.

First, rewrite P(Wt < x) as an expectation E[1Wt

E[1W

So our distribution function F (x) for Wt is a limit of normal distribution functions and it all works out. Finally, to deal with independence between arbitrary incre- ments Wt Ws we need to write these as limits of increments between dyadic sn, tn − and observe that if sequences of random variables Un and Vn are independent, then so are their limits. I left a lot for you to verify already, so I hope you don’t mind that I leave this final statement to you as well. 

Exercise Let Xn and Yn be sequences of random variables on a probability space with almost sure limits X = limn→∞ Xn and Y = limn→∞ Yn. Suppose that Xn and Yn are independent for each n, i.e., P(Xn < a Yn < b) = P(Xn < ∩ a)P(Yn < b) for all a, b. Apply limits, swap, and prove that X and Y are indepen- dent. Non graded Homework on the Poisson process and its martingales

Let N , t 0 denote the Poisson process with a given intensity λ > 0. { t ≥ } You are allowed yo use that

a) The distribution of Nt is Poisson with parameter λt:

λt n e− t P(Nt = n) = n!

b) The process N , t 0 has stationary and independent increments, { t ≥ } i.e., for 0 < t1 < t2 < . . . < tn we have that Nti Nti 1 are mutually − − independent (for different i), and N N has the same distribution t − s as Nt s for all 0 < s < t. − We denote by F the σ-algebra generated by N : 0 s t. t s ≤ ≤ 1. Compute the following (conditional) expectations

a) For s < t: E(Nt Fs) | b) For s < t: 2 E((Nt λt) Fs) − | c) For s < t: aNt E(e Fs) | 2. Show that for all ϕ : R R bounded continuous function we have →

E(ϕ(Nt Fs) = E(ϕ(Nt) Ns) | | This is called the “Markov” property of the Poisson process: condi- tioning on the whole past between 0 and s is the same as conditioning on the last point of the history, i.e., Ns, i.e., the further past plays no role.

3. Show that N λt and (N λt)2 λt are F martingales. t − t − − t 4. We call τ a stopping time if for all t 0 the event τ t is in F . You ≥ ≤ t are allowed to use that if Mt is a martingale and τ is a stopping time, then the stopped martingale Mτ t is a also a martingale. Next, under ∧ conditions such as dominated convergence we can conclude from this fact that

E(Mτ ) = E( lim Mτ t) = lim E(Mτ t) = E(M0) t ∧ t ∧ →∞ →∞ 1 Show that the hitting time

τ = inf t 0 : N = k { ≥ t } is a stopping time, and using the martingales from item 3, compute E(τ) and E(τ 2).

a aNt (e 1)λt 5. By stopping the martingale e − − , compute

uτ E(e− )

for u > 0 and conclude that τ is indeed the sum of k independent exponentials with parameter λ.

2

CHAPTER 11

Meet the real Martingales

This material is also covered in B-Z section 6.3, which – as always – has many worked out exercises.

If you’re walking down the right path and you’re willing to keep walking, eventually you’ll make progress. Barack Obama My wife is a modern woman and very much up to date. I took her to my casino once and she thought it was boring, so she showed me this much cooler online casino on the internet. Here you do not have to wait for the roulette wheel to spin and come to a standstill. You can place your bets all the time any time, any second of the day, 24/7. Of course, the casino that my wife showed me is called the stock exchange. My wife is very busy. She has all these IT projects that she has to oversee and they require a lot of time. Whenever she does not want to be disturbed by me complaining about my job, she shuts me up by saying something like: Oh you think you are so clever, now go construct a differentiable function that is nowhere monotonic. Which usually works, because I first have to figure out what she means and then I have to figure out how to solve it, by which time my wife has finished two projects in three different time zones and cooked dinner and done all the laundry. And then she’ll tell me my solution is wrong.

105 106 11. MEET THE REAL MARTINGALES

I did try to convince my wife once that my old time casino involves some nice mathematical paradoxes like Smartypants’ stopping rule which makes you wait forever. However, waiting forever is not in my wife’s dictionary. She prefers finite time windows. So here’s another paradox. Imagine a casino where you can bet on Brownian Motion. It starts at zero, so you can participate for free and by time t you have made yourself Wt. That is a fair but risky business, because Wt can be negative. Everybody knows that gambling is not a good way to make money, but still there are many gamblers. Now, pay strict attention to what I say, because I choose my words carefully and I never repeat myself: instead of making money from gambling, you can make money from gamblers. You stand at the door of the casino and you shout:

Gambling insurance for sale! Insurance for sale! Gamble up to time T . If your loss exceeds 1 euro, then I will take care of your additional loss.

This is a good product. It will sell like hotcakes because gamblers hate losing. Now, how much are you going to charge for it? Ten cents? You think ten cents is a decent price? I think so too! Let’s partner up, charge the poor buggers ten cents and make ourselves a fortune! Once you sold an insurance to a customer, I will accompany that person inside. That is why you need a partner, I am your inside man. I sit at his table and as soon as his Wt hits minus one, I bet against him. That means, I bet on Wt, which is allowed in modern casino’s. You can place whatever stakes you like,− even 1 negative stakes. It will cost me one euro, because that is the value of Wt when − I step in. As long as Wt remains below 1, I hold on to my bet, but the moment − it hits 1 again, I let go of Wt and regain my one euro. Buy low, sell low, that − − is my motto. I keep stepping in and out of Wt until the big moment T arrives. − Two things may happen. Either WT 1, in which case we do not have to pay ≥ − up for the insurance. Or WT < 1, in which case my bet against Wt pays off − nicely and earns me 1 WT . The -1 is there, because that is what I had to pay − − to enter the gamble. My earning of 1 WT equals the gambler’s loss. So we can match the money that we need to− pay− on the insurance. Everything evens out nicely. Whatever happens, we have made our 10 cents. We might as well hand out insurances for free.

This seems too good to be true and of course it is. The paths of Brownian Motion are continuous, but very irregular and this strategy of stepping in and out is just not possible. We will get to that later. What you need to know for now is that we have defined Wt, but we have not properly defined how to place stakes on it. For a discrete martingale that was easy: place bets on the increments, this is a martingale transform. For Brownian Motion this is harder, since the increments are not clearly defined anymore. Time is continuous, it does not come by step by 2 step and how should we define dWt? To answer this question, we first need to extend our notion of martingale to continuous time.

1Our lives would be so much easier if we also had euro coins and notes with negative numbers on them. 2Or does it? Space has a Planck scale, does time have one too? Tic toc tic toc. Of course, such silly questions have been asked and answered already. 11.1. CONTINUOUS MARTINGALES 107

11.1. Continuous martingales The main message is: all results for discrete time martingales extend to real time martingales, if the paths are continuous. The terminology remains the same. A family of σ-algebras t : t [0, ] is a filtration if s t for s t. If the {F ∈ ∞ } F ⊆ F ≤ random variables Xt are t is measurable, then they are adapted to the filtration. An adapted process is a continuousF martingale if

(1) E[Xt s] = Xs if s t. | F ≤ (2) Paths t Xt(ω) are continuous for a.e. ω. 7→ It goes without saying that the Xt are integrable, i.e., E[ Xt ] < . It also goes without saying that a martingale starts at zero. If this does| | not∞ happen, then it will be stated explicitely. The usual filtration has an initial 0 that contains all events that have either F probability zero or probability one. The σ-algebra t is the smallest σ-algebra that F contains 0 and all Xs r for all s t and all real r. The niceF thing about{ ≤ continuous} functions≤ is that you only need to get to know them at the rational points. Exercise Suppose that f and g are continuous functions such that f(q) = g(q) for all q Q. Then f and g are equal. ∈ And so, continuous martingales are determined by their values Xq at rational times. Instead of using Xs r for all s t to define t we might as well restrict our definition to rational{ times≤ }q t. ≤ F We say that a subset [0,≤ ) is a grid if the intersection [0, t] is finite for every t. The mesh ofD a grid ⊂ is∞ the minimal number ∆ such thatD ∩ every interval [x, x + ∆] contains at least one grid point. Numerical mathematicians use grids to convert math problems into finite problems. We use grids to convert real time martingales into discrete time martingales, and transfer our results. For instance, theorem 5.2 in real time is

Theorem 11.1 (Doob’s maximal inequality). If Xt is a continuous nonnegative submartingale and λ > 0 then

∗ 1 P(Xt > λ) XtdP ≤ λ {X∗≥λ} Z t Proof. In short: use a grid to make the martingale discrete – apply the orig- inal discrete result – take the limit. Now in long. Define a grid n on [0, t] with n ∗ D mesh 2 , so that each n+1 refines each previous n. Let X be the world record D D n,t on the grid n: D ∗ X = max Xd : d n n,t { ∈ D } As the mesh decreases, the world record increases, because we add ever more grid points. Doob’s maximal inequality says

∗ 1 P(Xn,t λ) XtdP ≥ ≤ λ {X∗ ≥λ} Z n,t ∗ The events An = Xn,t λ form an increasing chain and then we can swap limit and expectation by{ monotone≥ } convergence. We can do that on both sides of the equation. Denoting A∞ = An we find ∪ 1 P(A∞) XtdP ≤ λ ZA∞ 108 11. MEET THE REAL MARTINGALES

∗ ∗ If Xt > λ then by continuity, there has to be an n such that Xn,t λ. Observe that I sneakily changed my inequality sign from to >, because it≥ may happen that ∗ ≥ the world record Xt occurs at a time s that is not in any grid point. In particular, ∗ X > λ A∞ Xt λ . It follows that { t } ⊆ ⊆ { ≥ } ∗ 1 P(Xt > λ) XtdP ≤ λ {X∗≥λ} Z t because our submartingale is non-negative. Increasing the set over which we inte- grate increases the integral.  You need to realize that all the inequalities and limit theorems that we found extend to continuous martingales. The proof always is: use a grid – apply the discrete time result – take limits. This may involve a lot of swapping, and it is easier said then done, but it always works. Theorem 11.2 (Doob’s p inequality, see 5.4 and Thm 6.7 in B-Z). Suppose L that Xt is a continuous non-negative submartingale. Then for p > 1 p p E[(X∗)p] E[Xp] t ≤ p 1 t  −  Proof. Apply the grid and find that p p E[(X∗ )p] E[Xp] n,t ≤ p 1 t  −  As n the mesh goes to zero and X∗ X∗. The result follows by monotone → ∞ n,t ↑ t convergence. 

Wt If Wt is Brownian motion, then Wt and e are submartingales by Jensen’s | | inequality and Doob’s inequalities apply. However, if Wt is BM, then the reflection ∗ principle can be applied to study Wt very effectively. You will encounter this in your homework. Now you know that all inequalities of chapter 5 carry over. I could go on to derive the martingale convergence theorems from chapter 6 and the sampling (or stopping) theorem for continuous martingales. But we have to move on. You need to know that in real time τ is a stopping time if τ t t for all t. The { ≤ } ∈ F σ-algebra τ contains all events A such that A τ t t. One can prove F ∩ { ≤ } ∈ F that the stopped process Mτ∧t is equal to E[Mt τ ] which shows that it is a martingale. | F

11.2. Crooked paths The New York Stock Exchange, which has the nice acronym NYSE, opened up in 1817. It got a big boost thanks to the civil war but it went through a depression in 1873, in the aftermath of the Franco-Prussian war. This depression lasted for years and at the time it was even called the great depression. Of course, nowadays we know that in 1929 NYSE went through another depression, which is currently called the great depression. NYSE bounced back after WWII and today, despite a few recent hiccups, it is bigger than ever. It handles over 100 billion dollar in trades each day. The point of this story is that prices at the stock exchange can be highly irregular. Just like Brownian paths, which makes sense, because that is what they are. 11.2. CROOKED PATHS 109

Paley, Wiener and Zygmund proved in 1932 that – almost surely – a Brownian path t Wt(ω) is not differentiable. This is not an easy theorem. Let’s start with a result7→ that is almost as good, and easier to prove. Theorem 11.3. Almost surely, a Brownian path is nowhere monotone. Before we prove the theorem, we have to understand what it means. A function f(t) is non-monotone on [a, b] if there exist a < c < d < b such that the increments f(c) f(a) and f(d) f(c) have opposite signs. If a function is monotone, then all its− increments need− to have the same sign. For a Brownian path, this is like flipping infinitely many coins and Heads comes up for all of them. Not very likely. Proof. Pick your rational a and b. Put a grid on [a, b] that has 2n points. n If a path is monotone, then all 2 1 increments Wd Wd0 between consecutive − a,b− grid points have equal sign. This event, let’s call it En , happens with probability 2−2n a,b a,b 2 . If a path is monotone between a and b, then it is contained in E = En , n which has probability lim 22−2 = 0. If a path is somewhere monotone, then it has to be contained in Ea,b for some rational a < b. A countable union of nullsetsT is a nullset and there are only countably many rationals.  Up to now, I have avoided making precise what our probability space is. If you pick a path, then you pick an ω from Ω. But what is this? It is best to think of Ω as the space of all continuous functions. Think of ω as a random path. The probability of picking a path f such that f(s) [a, b] is equal to the probability that a normal random variable with mean 0 and∈ variance s is in [a, b]. Extend by independent increments. The probability that f(t) f(s) [c, d] for s < t is equal to the probability that a normal random variable with− mean∈ 0 and variance t s is in [c, d]. And so forth. This is called the Wiener measure on the space− of continuous functions, which is how Wiener constructed BM originally and it is how B-Z specify BM in their condition (3) in Definition 6.9. Let f : [0,T ] R be any continuous function. Think of f(t) as the position of a walker on the line→ at time t. To measure the total distance that it is covered by n the walker, put the standard grid n on [0,T ] with mesh ∆ = T/2 and add the increments D T f(d0) f(d) f 0(d) ∆ f 0(t) dt | − | ≈ | | ≈ | | consecutive d,d0 d∈D 0 X X Z If f is differentiable, then the walker covers a finite distance. But if a walker suffers from the severe tremors that come with Brownian motion, then he covers an infinite distance. The sum of the increments f(d0) f(d) is called the variation of the path. Or more precisely, the limit of this| sum as− the mesh| goes to zero is called the variation. This limit exist because we keep refining the grid by inserting intermediate points d00 and f(d0) f(d00) + f(d00) f(d) f(d0) f(d) | − | | − | ≥ | − | Brownian paths have infinite variation, see B-Z Theorem 6.5. The length of a Brownian path needs to be measured in a different way. The quadratic variation of a path is defined as the limit of the f(d0) f(d) 2 | − | consecutive d,d0 X 110 11. MEET THE REAL MARTINGALES as the mesh goes to zero. It is not immediately clear that this limit exists, but it is clear that it is smaller than the variation, since f(d0) f(d) is small and squaring small numbers reduces them. | − | I denote the consecutive increments f(d0) f(d) by − dWn,1, dWn,2,... where the index n in dWn,j denotes the grid and the index j denotes that it is the j-th increment. Of course, the increment dWn,j depends on the path – it depends on ω. The quadratic variation of a path is 2n 2 [W, W ] = lim dWn,j n→∞ | | j=1 X Instead of computing the quadratic variation of a single path, we compute the expected quadratic variation of all paths. You may want to turn back some pages and revisit exercise 3 of your Chapter 3 homework. There it says that the quadratic 2 2 variation is a sum of E[Mn+1 Mn n] and in (a) you need to check that this 2 − | F equals the sum of E[dMn n]. For BM we do not need to condition since the increments are independent.| F Theorem 11.4. [W, W ] has expectation T and variance zero. Remember that E[X2] = 1 and E[X4] = 3 for a standard normal. Therefore, X2 has mean 1 and variance 2. In case X N(0, σ) these values for X2 are mean σ and variance 2σ2. ∼

Proof. Let’s denote the quadratic variation over the grid n by [W, W ]n. It is a sum of 2n squared increments. Each increment has mean zeroD and variance ∆ = T/2n. Therefore, each squared increment has mean T/2n and variance 2T 2/22n. n The random variable [W, W ]n is a sum of 2 independent increments. It has mean T and variance 2T 2/2n, which vanishes as n goes to infinity. Technically speaking, we have now proved that [W, W ]n converges in distribution to the deterministic value of T . This is a weak convergence result. Let’s step up our effort and prove 2 that [W, W ]n converges to T in the squared mean (or in , as an analyst would say, compare B-Z exercise 6.29). We need to prove that L 2 lim E ([W, W ]n T ) = 0 n→∞ − h i n 2 Now you need to observe that [M,M]n is a sum of 2 random variables dW . So we can break up T into parts T/2n. There are 2n parts and if we subtract T/2n from each of the 2n random variables dW 2 then we have that 2n n [W, W ]n T = (dWn,j T/2 ) − − j=1 X which is a sum of 2n independent random variables, with mean zero and variance 2 2n 2 2 n 2T /2 . In other words, ([W, W ]n T ) has mean zero and variance 2T /2 . The variance vanishes as n goes to infinity,− which is what we had to prove. As you remember from the first chapter, there are three notions of convergence: in probability, almost sure, and in the mean (or in 1, as an analyst would say). L Convergence in probability is the weakest notion. We have just proved that [W, W ]n converges in the squared mean, which implies that it converges in the mean. Does 11.2. CROOKED PATHS 111

[W, W ]n converge almost surely? It does, but that is more subtle. I will leave that for now.  We can now conclude that Brownian paths are not differentiable, because if they were, then their quadratic variation would be zero. See also B-Z exercise 6.30. We have taken our first steps to explore Brownian paths and continuous martingales. We are ready to move on to the stochastic calculus that was developed in 1941 by a junior employee of the Japanese Bureau of Statistics who had just gotten a BSc in mathematics. Non-graded homework Brownian motion basics

We denote Wt, t 0 Brownian motion, and Ft = σ Ws : 0 s t its { ≥ } { ≤ ≤ } natural filtration. Compare B-Z exercise 6.20, 6.26, 6.35. Look up the moment generating function of the normal distribution on Wikipedia

1. Compute the following expectations. Use conditional expectation to simplify the work.

a) E(WtWs), cov(Wt,Ws).

b) E((Wt Ws)(Wr Wu) with 0 < u < s < r < t. − − 1 λWt λ2t c) E(e − 2 ) d) E((W 2 t)(W 2 s)) t − s − 2. In this exercise you are allowed to use that if τ is an integrable stopping 2 time, then Wτ t and W (τ t) are both dominated by an integrable ∧ ∧ random variable, uniformly in t. Moreover, you are allowed to use the continuous-time analogue of the stopped martingale theorem, i.e., if Mt is a martingale then so is Mτ t. Prove that for an integrable ∧ stopping time τ we have EWτ = 0 2 E(Wτ ) = E(τ) Apply this to show that for the first hitting time T = inf t > 0 : a { W = a me must have t } E(Ta) = ∞ and for the first exit time of the interval [ a, a] (with (a > 0)) T a,a = − − inf t > 0,W [ a, a] we have { t 6∈ − } 1 P WT a,a = a = − 2  and 2 E(T a,a) = a −

3. Denote Mt = max0 s t Ws. Let a > 0, b < a ≤ ≤

a) Prove that P(Mt > a, Wt < b) = P(Mt > a, Wt > 2a b) = − P(Wt > 2a b). Hint: use the reflection principle. − b) Derive from item a) that M and W have the same distribution. t | t| Hint use

P(Mt > a) = P(Mt > a, Wt < a) + P(Mt > a, Wt > a)

use then item a) to conclude P(Mt > a, Wt > a) = P(Wt > a) and use also that P(Mt > a, Wt > a) = P(Wt > a).

1 c) Show that the joint probability density of (Mt,Wt) is given by

∂ f(a, b) = 2 pt(0, x) − ∂x x=2a b   −

x2/2t with p (0, x) = e− the probability density of W . t √2πt t d) Use item c) to prove that M W and M have the same distri- t − t t bution. Hint: use that the probability density of M W at a t − t point x R is given by ∈ ∞ ∞ f(b + x, b)db = 2 pt0 (0, b + 2x)db = 2pt(x) − x Z−∞ Z−

2

CHAPTER 12

A tale of two thinkers

Read chapter 7 sections 1-2 in B-Z

It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, it was the epoch of belief, it was the epoch of incredulity, it was the season of Light, it was the season of Darkness, etc, etc. Charles Dickens June 1940. Two young mathematicians, both in their 25th year, located in different places in the world, living under different circumstances, work on the same problem: define the stochastic integral t It = f(s, Ws)dWs Z0 You should think of It as your cumulative gain in an online casino where you place a stake f that depends on current time s and current value Ws. Since the casino is fair, if there is any sense in the system, It should be a martingale

E[It s] = Is | F This elementary observation already dictates that stochastic integration differs from 2 2 standard integration. It cannot be true that 2WsdWs = Wt because Wt is not a martingale. R The first moment of the stochastic integral is E[It] = 0. Now how about the second moment? Think of an integral as a sum of infinitesimal increments f(s, Ws)dWs. Since the variance of Ws grows linearly with time, the variance of the infinitesimal increment dWs should be equal to ds, and therefore the variance of 2 f(s, Ws)dWs should be equal to E[f (s, Ws)ds]. If there is any sense in the system t 2 2 E[It ] = E[f (s, Ws)]ds Z0 115 116 12. A TALE OF TWO THINKERS

2 2 If f(s, Ws) is equal to 2Ws, then you can work out yourself that E[It ] = 2t . This 2 happens to be equal to the variance of Wt . The two random variables 2WsdWs 2 and Wt cannot be equal, but they do have the same variance. Remember that in 2 R your homework last week you had to show that Wt t is a martingale. It turns out that − t 2 2WsdWs = W t t − Z0 Stochastic integration is similar to standard integration, but there is an extra term involved. It is called the Itˆoterm, after Kiyosi Itˆo,the young Japanese mathe- matician who figured out how to compute stochastic integrals during the 1940’s. At that time, he was an employee of the Japanese Bureau of Statistics who – in complete isolation – studied the work of Paul L´evywhenever he did not have to write statistical reports. The calculus that comes with stochastic integration and differentiation is now called Itˆocalculus. Meanwhile, Wolfgang Doeblin, a student of L´evy, also figured out how to han- dle this calculus. Doeblin had volunteered for the army and was located near the Maginot line when the war broke out. He continued to work on stochastic integra- tion and reported his findings to the French Academy of Science in a sealed letter. When the French army capitulated in June 1940, Doeblin committed suicide. He was a German citizen, joining the French army was treason. The sealed letter was kept in the Academy archives and for a long time, it was forgotten. Bernard Bru, a mathematician and historian, eventually found the letter and opened it in 2000. It contained the rules of the Itˆocalculus.

12.1. Cracking up crooked paths BM contains lots of symmetry. It is a malleable phenomenon and that is why it can be integrated so neatly. Before we move on to the stochastic calculus, let’s consider some of the symmetries of BM. Example 1 BM is scale independent, provided that you rescale properly: 1 W 2 is a Brownian motion. a a t

It is straightforward to verify that Wa2t/a satisfies the conditions of Definition (10.6). The a is not necessarily positive, you may also take a = 1. Observe that, as in the heat equation, there is a linear relation between ∆x2 −and ∆t. The symmetry (t, x) (a2t, x/a) maps Brownian paths to Brownian paths. Exercise7→ Find the continuous function that is invariant under this symmetry: f : [0, 1] R with f(0) = 0 and f(1) = 1 such that f(a2t) = af(t). Determine its derivative→ in zero. Example 2 BM can be stopped and restarted. For any fixed time r

Wt+r Wr is a Brownian motion. − This allows you to reset the clock, so to speak. We learned how to speed up the clock in Example 1, and now we can reset it. This can be combined. For instance, you can let the clock move normally until time r and then decide to speed it up by a factor a2, provided that you rescale space by 1/a. If you take a = 1 then you don’t speed up time but you simply reverse the increments of the Brownian− Motion 12.1. CRACKING UP CROOKED PATHS 117 from time s onwards. The result remains BM. In stochastic integral notation, what we have in this case is t f(s)dWs Z0 with f(s) = 1 if s < r and f(s) = 1 if s r. These integrals, in which the stake − ≥ f(s, Ws) is a function of time only, are relatively easy to handle. Think of f(s) as a factor that speeds up or slows down the rate at which the variance of the process increases. In other words, it speeds up or slows down time.

Example 3 BM can be reflected. Instead of restarting at a fixed time, as in the previous example, we may also restart at a stopping time τ. Consider τ = inf t: Wt = a for some a > 0. The set t: Wt = a is closed and if it is non-empty,{ it has a} well defined minimum. If we{ reverse the} path at τ then we consider the stochastic integral

t f(s, Ws)dWs Z0 ∗ ∗ in which f(s, Ws) = 1 if W < a and f(s, Ws) = 1 if W a. Here you see that s − s ≥ my notation gets clumsy, because the stake f depends on the history of Ws rather than its current value. As soon as the Brownian motion hits level a, we reverse its direction. This is called the reflected path. Unlike an ordinary reflection of a ray of light, this does not really affect a random walker, because she has no sense of direction. She can go through the looking glass. If I did not tell you that I reversed the direction at level a, you would never know: this reflected process is BM. In exercise 3 of your Chapter 11 homework you learn how to use this observation to ∗ derive that Wt is equal to Wt in distribution. You should know, by the way, that while it is intuitively clear| that| reflected BM is BM, it is a bit intricate to prove this.

2 ∗ 2 2 Exercise Doob’s inequality implies that E[(Wt ) ] 4E[Wt ]. Your Chapter 11 homework impliesL that, remarkably, you do not need≤ the 4. Conclude that the ∗ variance of Wt is bounded by t. You can even compute the variance explicitely if you know that Wt is half normal. | | Example 4 Mean reverting paths. If we speed up the clock by a linear factor a2, then we can still retrieve BM by rescaling space by 1/a. If we speed up the clock quadratically then this is not possible anymore 1 Wt2 is not Brownian motion. √t The paths are continuous, but the process is not a martingale. The usual filtration for this process is 2 and conditioning produces Fs

1 2 1 E Wt2 = Ws2 √t Fs √t  

If Ws2 is positive, then this conditional expectation falls short of its value at time s. If Ws2 is negative, it falls long. This process has the nostalgic tendency to regress 118 12. A TALE OF TWO THINKERS to the mean, which is a common phenomenon in physics and in statistics. Such processes are known as Ornstein-Uhlenbeck diffusions.1

Example 5 Multiplicative continuous martingales. Let’s go to the stock ex- change once again. For asset prices, it’s not about increments dY , it’s about returns dY/Y . We have to find ourselves a proper process Y to describe this. It has to be continuous, because asset prices are continuous,2 and dY/Y needs to be a mar- tingale such as dW . So, we encounter our first stochastic differential equation, or SDE: dY = Y dW

Wt My first guess of a solution is Yt = e , because the classical differential produces deW = eW dW . However, this is not classical calculus, it is Itˆocalculus. My first guess is wrong. If you want to know the right solution, you can read ahead and check out B-Z Example 7.6, but we are not ready for this yet.

Example 6 Turning back the clock. BM is martingale E[Wt s] = Ws. | F We can use the past s to predict the future. But what if we are historians? If F we know the future, can we predict the past? Let s be the σ-algebra generated G by the random variables Wt for t s and suppose that r < s. Can we compute ≥ E[Wr s]? I leave that to you. | G

Challenge Compute E[Wr s]. Hand in your solution before the next class. | G Winner receives a math book. Here are some clues. If r = 0 then Wr = 0 and we have the answer. Secondly, by the , you expect the answer to depend on Ws only. And finally, if X and Y are independent standard normal, then it must be true (why?) that E[X X + Y ] = E[Y X + Y ]. | | A final word. BM is a and if I challenge you to compute E[Ws Wt] – you may have noticed that I just did that – then I am asking for a multiple| integral filled with Gaussian densities. As a rule, I avoid computing such integrals and prefer using the five commandments instead. B-Z are analysts and they are not afraid of big formulas. They compute things like E[Ws Wt] straight up from Gaussian densities. For instance, in solution 6.20 they compute|

E[WsWt] = xyp(s, 0, x)p(t s, x, y)dxdy − ZZ In this integral xy represents the outcome of WsWt, p(s, 0, x) is the density of Ws, and p(t s, x, y) is the density of Wt conditional on the outcome x of Ws. I guess such integrals− are a comfortable read for analysts. If you are a more simple minded person like me, you just apply the five commandments. Take out what is known in E[WsWt s]. Then apply the tower of expectations. That is the standard way to do it, and| Fit is completely equivalent to the computation in B-Z.

1Dutch pride: mean reverting random walks were studied by the Ehrenfests. Mean reverting continuous random walks were already studied by the Dutch physicists Leonard Ornstein and George Uhlenbeck before martingales were invented. 2 Of course they are not. The proper model for asset prices has a Brownian part Wt and a jump part Nt. The standard financial math textbooks need an upgrade. 12.2. THE STOCHASTIC INTEGRAL 119

12.2. The stochastic integral It is time to define the stochastic integral

T IT = f(s, ω)dWs Z0 in which Ws is the Brownian motion on [0,T ] Ω and f is a random variable on [0,T ] Ω which we think of as the stakes on BM× in an online casino. Let’s approach × the stochastic integral in a Riemann manner. Select a grid 0 = t0 t1 ≤ ≤ · · · ≤ tn = T , sum up n−1

f(ti, ω)(W (ti+1, ω) W (ti, ω)) − i=0 X keep refining the grid, let the mesh go to zero. Basically, that is what we will do, but it is easier said then done. Remember that computing the quadratic variation already required an effort. We need to define the stakes f(s, ω) properly. Notice that the random variables ω f(s, ω) form a stochastic process. Since we cannot use future information, we 7→ required that the random variables fs are s adapted. The simplest integrand f(s, ω) is constant.F If f is equal to one (non-random) then the integral is equal to WT (random). Let’s move on to more intricate inte- grands. We say that f is a random step process if each path t f(t, ω) is a 7→ step function, in which the steps occur at fixed times 0 = t0 t1 tn = T that do not depend on ω. B-Z say random step process, most books≤ ≤ refer · · · ≤ to this as simple process. In such processes, the gambler places a stake ηi at a fixed time ti and leaves it on the table during the time interval [ti, ti+1]:

f(t, ω) = ηi(ω) if ti t < ti+1 ≤ The stakes are random, the times are deterministic. In financial terminology, a trader acquires an amount ηi of the asset Wt at time ti and maintains that position until time ti+1. Buying the asset at time ti costs ηiWti and selling it at time ti+1 brings in ηiWti+1 . At the end of the time interval, the profit (possibly negative) is equal to

ηi(Wt Wt ) i+1 − i Mathematicians often voice their contempt of financial math, but you have to admit that thinking about stochastic integrals in financial terms helps to get the picture.

Definition 12.1. If f(s, ω) is a random step function, then

T n−1 IT (f) = f(s, ω)dWs = ηi(Wt Wt ) i+1 − i 0 i=0 Z X Instead of integrating over the entire time interval, we may integrate the random step process up to any t T . If t [tk, tk+1] then ≤ ∈ t k−1 It(f) = f(s, ω)dWs = ηi(Wt Wt ) + ηk−1(Wt Wt ) i+1 − i − k 0 i=0 ! Z X 120 12. A TALE OF TWO THINKERS

Now you need to recognize a transform of the discrete martingale Wt0 ,Wt1 ,...,Wtk ,Wt. Previously, we would have denoted this It by

k−1

ηidWti i=0 X A transformed martingale is a martingale and the increments are orthogonal in 2 . Therefore, the expectation of It(f) is zero and its variance is the sum of the L of the increments ηidWti . Lemma 12.2 (Itˆoisometry). If f is a random step process, then t t 2 2 2 E[It(f) ] = E[f(t, ω) ]dt = E f(t, ω) dt Z0 Z0  Proof. It is a martingale transform, hence martingale, hence E[It] = 0. Its 2 variance E[It ] is equal to the sum of the variances of the increments ηidWti . As always, apply the tower of expectations and take out what is known 2 2 2 2 2 E η (Wt Wt ) t = η E (Wt Wt ) t = η (ti+1 ti) i i+1 − i F i i i+1 − i F i i − It follows that the variance of η dW is equal to E[η2]( t  t ). This gives the i ti i i+1 i first equation, which is an ordinary integral of a step function,− so we could have also written this as an ordinary sum. Swapping the order of integration is allowed, since everything is positive, so we may also place the E sign in front of the integral. This gives the second equation.  At the beginning of the chapter, I bluntly wrote down the stochastic integral, claimed it was a martingale and that its variance grows with E[f(t, ω)2]dt. This blunt statement is now proved if f is a random step process. Now, how do we extend the proof to more general integrands like f(t, ω) = Wt(ω)? The set of all random step processes is closed under addition. If both f and g t are random step processes, then so is f + g. The stochastic integral 0 f + gdW is the combined return on two betting strategies, hence It(f + g) = It(f) + It(g). R The conversion of a betting strategy f into the online casino’s payoff It(f) is linear. It is an operator between the space of random step processes and the space of 2 bounded martingales on [0,T ]. To see what goes on, I rewrite the final expectationL in the lemma above as t t E f(t, ω)2dt = f 2dtdP Z0  ZΩ Z0 so you see that this is the square of the 2 norm of f :Ω [0, t] R. The 2 L 2 × → lemma says that the norm of f is equal to the norm of It(f). That is why the statement in thisL lemma is known as the Itˆoisometry.L The nice thing about isometries is that they can be extended. If fn is a Cauchy sequence of random step processes, then It(fn) is Cauchy as well. Now you need to remember your old course in Mathematical Structures or in Real Analysis or in Fourier Analysis: extend by completion. The stochastic integral It(f) can be defined for all f that are in the completion (or the closure) of the set of all random step processes. B-Z 2 denote this completion by T . They even allow T to be equal to , but let’s limit our attention to finite T . M ∞ 12.2. THE STOCHASTIC INTEGRAL 121

2 The next question is, which integrands f are in T ? Is f(ω, t) = Wt in it? Is the insurance policy that you and I sell for ten centsM

0 Wt > 1 f(ω, t) = − 1 Wt 1  − ≤ − in it? Theorem 12.3 (Thm 7.1 in B-Z). If f 2 (Ω [0,T ]) is an adapted process with continuous paths, then f 2 ∈ L × ∈ MT Proof. We need to construct a sequence of random step processes fn that converges to f in 2. The idea is, of course, to put a grid on [0,T ], define a step process on it, refineL and let the mesh go to zero. Prove that the refined grids produce a sequence that converges to f. Get ready to swap. n As always, let n be the binary grid on [0,T ] with mesh T/2 . Define fn(ω, t) D by equating it to f(ω, tk) for t [tk, tk+1). In other words, the process fn is defined ∈ from f by rounding time down to the nearest grid point. Obviously, fn is a step process and it is adapted since f is adapted. By our assumption of continuity of the paths, we have limn→∞ fn(ω, t) = f(ω, t). We have almost sure convergence, 2 but we need convergence in the square mean to prove that f T . Fix an  > 0. Remember that a continuous function on a∈ closed M interval [0,T ] is uniformly continuous. Therefore, for each path t f(ω, t) we have that 7→ fn(ω, t) f(ω, t) <  if n is sufficiently large. It follows that for almost every ω | − | T 2 lim fn(ω, t) f(ω, t) dt = 0 n→∞ | − | Z0 Putting an expectation around it T 2 E lim fn(ω, t) f(ω, t) dt = 0 n→∞ | − | " Z0 # If we can swap, then we are done. So can we swap? Of course we can. If you compare my fn to the one in B-Z, you will see that that they smoothen the fn a bit so that fn 2 f 2 and they can apply dominated convergence. I was lazy and just rounded|| || down≤ || || like a numerical mathematician, so now I need to do a little bit of extra work. If f is bounded by a constant M, then we can apply dominated convergence 2 because the constant function c(ω, t) = M is in and it dominates f and fn. Therefore, if f is adapted and has continuous pathsL and is bounded, then it is in 2 T and we are done. For an arbitrary f, we have that f M = min f, M is M 2 ∧ 2 { } bounded and has continuous paths. Therefore it is in T . Since T is complete and since f M converges to f as M , we also haveM that f M 2 . ∧ → ∞ ∈ MT  We have come a long way. We proved that BM exists. We have proved that the stochastic integral exists, provided that the integrand has continuous paths. We still do not know if our ten cents insurance policy can be integrated, but we were a bit skeptical about this financial product anyway. We have done all this, but can we now actually compute an integral? Wouldn’t you love to take your pen and paper and compute lots of stochastic integrals, just like you did with ordinary integrals? Isn’t that proper mathematics? No it is not. But we are going to do it anyway. Next week we will proceed with Itˆo’scalculus. Check out B-Z section 7.2 for your first integral. 2 Exercises.

1) Xi, i N are independent and identically distributed discrete ran- { ∈ } dom variables with P(Xi = 1) = p, P(Xi = 1) = 1 p, where − − 0 < p < 1. Furthermore we denote for n N ∈ n Sn = Xi. Xi=1 and S0 = 0.

2 a) Compute the conditional expectation E(Sn X1,...,Xn 1). | − b) For which constant a is M := S an a martingale? n n − c) Use the martingale

1 p Sn Z := − n p   to compute the probability that 20 is hit before 10 by the pro- − cess Sn, n N . You do not have to prove that Zn is a mar- { ∈ } tingale but you are asked to prove that the martingale stopping theorem can be applied for a suitable stopping time d) Use the martingale convergence theorem to prove that

n 1 Z0 := (X (2p 1)) n k k − − Xk=1 converges almost surely as n . → ∞ 2) Let X : t 0 be Brownian motion with drift µ > 0. { t ≥ } 2 a) Compute E(Xt + Xt). b) Show that M := (X µt)2 t is a martingale. t t − − c) Let a > 0, and denote by τ the first hitting time of a, i.e.,

τ := inf t 0 : X = a { ≥ t } Use the martingale

1 2 αXt αµt α t Zt := e − − 2

(you can use that Zt is a martingale, i.e., you do not have to show this) to show that for λ > 0

λτ (µ √µ2+2λ)a Ee− = e − (1)

2 d) By taking the limit λ 0 in formula (1) conclude that τ < → ∞ with probability one, i.e., a is hit with probability one by the Brownian motion with drift µ. Is the same true for a < 0?

3 Are the following statements right or wrong? Justify your answer.

a) For B , t 0 Brownian motion, Z = e Bt is a submartingale. { t ≥ } t − b) A process Xn, n N of which the expectation is E(Xn) = n { ∈ } cannot be a martingale.

3

CHAPTER 13

Stochastic Calculus

Read chapter 7 sections 3-5 in B-Z. It covers a lot of ground. What I am mainly interested in here, is giving you the rules of stochastic calculus. In mathematics, it is important that you know how to manipulate the symbols.

Mathematics should be a tool for increasing ones thinking power, but for many children it is just a set of rather pointless rules for manipulating symbols. Many universities have entrance math tests in which students have to solve boring exercises without a calculator. In my opinion, this is ridiculous. Making students solve exercises without a computer is like making gardeners shovel with their bare hands. The computer is the mathematician’s tool. Having said that, let’s now solve some differential equations, or DE’s. The- mother-of-all-DE’s is y0 = y − It returns to equilibrium no matter what initial condition: limt→∞ y(t) = 0. If your mind had been magically erased and you would have to solve this DE from scratch, then you would put it in your computer. You rewrite the DE as dy = ydt − and you feed it to the machine: M=2∧10;T=10;dt=T/M; hold on for j=-20:10:20 y=j; f=[y]; for i=1:M y=y-y*dt; f=[f y]; end plot([0:dt:T],f) end

125 126 13. STOCHASTIC CALCULUS

This produces a picture of exponentials descending down to the x-axis no matter what initial condition:

Now what if I add a stochastic part to the-mother-of-all-DE’s? I will have to write Y instead of y now, since this makes our y stochastic: dY = Y dt + dW − This is an SDE. We discretize it and feed it to the machine M=2∧10;T=10;dt=T/M; hold on for j=-20:10:20 y=j∧3; f=[y]; for i=1:M y=y-y*dt+sqrt(dt)*randn; f=[f y]; end plot([0:dt:T],f) end If you run this code, you get a BM kind of exponential decay, which is typical for the Ornstein-Uhlenbeck diffusion. The result looks like a noisy solution of the DE:

Remember that the Ornstein-Uhlenbeck diffusion is related to the bleating sheep of the Ehrenfests, which that has to do with reversibility: if you wait long enough, you will be young again. You can see that on your computer if you increase T from 10 to 1000, but I leave that experiment to you. Instead of adding dW to the mother-of-all-DE’s, I can also replace dt by dW dY = Y dW − 13.1. ITO’Sˆ RULE 127

This changes the picture completely. M=2∧10;T=10;dt=T/M; hold on for j=-20:10:20 y=1; f=[y]; for i=1:M y=y-y*sqrt(dt)*randn; f=[f y]; end plot([0:dt:T],f) end

t Since Yt = 0 Y dW is a stochastic integral, it is a martingale. Our computer suggests that the martingale is non-negative. You will be able to verify this once you R have learned how to solve the SDE with pen and paper. Yt displays the behaviour of a double-or-nothing martingale, which converges to zero almost surely. You should also be able to verify this property. It will follow from the fact that limt→∞ Wt/t = 0 almost surely. Of course, it is not surprising that dY = Y dW is very similar to double-or-nothing. You are at time t. You bet Yt, which is everything you have, on the next increment dWt. That is double or nothing! We will now get to the rules of stochastic calculus which allow you to solve SDE’s with pen and paper. Or rather, I will give you the rule. There is only one thing that we need to change in the ordinary calculus. But you should never forget that it is quite helpful to solve SDE’s with your computer.

13.1. Itˆo’srule If you obey all the rules you miss all the fun. Katharine Hepburn Before you get to play a game, you have to learn the rules, which is a pain. Stochastic integration is the name of our game and we play by Itˆo’srule. We start with the easiest integral, integrating a step function that does not even step: t 1dWs = Wt Z0 This follows immediately from the definition of integrating a step process. A step function is a step process that is not random. If the function has a single step: t 1dWs = Wt Wt0 0 − Zt 128 13. STOCHASTIC CALCULUS and that is all I have to say about step functions. We move on to integrating arbitrary non-random functions: t sdWs = Wt3/3 Z0 Observe the strange tilde over the BM on thef right. That is not a typo. If we approximate the integrand f(s) by a step process, then that is not a process really, it is a step function. Put a grid on it, take limits, denote the integral by It(s)

It(s) = lim si Ws Ws n→∞ i+1 − i si∈Dn X  The summands si Ws Ws are independent normals with mean zero and vari- i+1 − i ance s2(s s ). Therefore, the sum of the integrands is normal with mean zero i i+1 i  and variance− t 2 2 3 si (si+1 si) s ds = t /3 − ≈ 0 X Z In the limit, as the mesh of the grid goes to zero, we get that It(s) is normally distributed with mean zero and variance t3/3. A stochastic integral is a process with continuous paths. This particular It has independent increments which are normal with mean zero. If we rescale time so that the variance grows linearly, then we get BM. In general, for a non-random function f the stochastic integral It(f) is a time changed BM. That is why I write W for this integral. B-Z just leave it as it is, they are analysts.1 Moving on to stochastic integrands.f In ordinary calculus, df = fxdx + ftdt for a differentiable function in two variables x, t. A more accurate approximation 1 2 1 2 would be df = fxdx + ftdt + 2 fxxdx + fxtdxdt + 2 fyydt , but we never need to preserve the quadratic differentials. Itˆo’srule says that in stochastic calculus dW dW = dt and all other quadratic differentials vanish. If we apply Itˆo’srule to functions f(Wt, t) in which f(x, y) is differentiable, then we get 1 (13.1) df = fxdW + ftdt + fxxdt 2 t In the previous lecture, we tried to find a primitive function for 0 2W dW . Itˆo rule says that we need to find an f such that f = 2x, f + 1 f = 0. That is an x t 2 xx R ordinary calculus problem and its solution is f(x, t) = x2 t+c. The process starts t − 2 at zero and so c is equal to zero. That is why 0 2W dW = W t. Moving on to the SDE in example 7.6 in B-Z − R dY = Y dW 1 Now we need to find an f such that fx = f and ft + 2 fxx = 0. Again an ordinary calculus problem. The solution is f(x, t) = ex−t/2 if the process starts at one. Wt−t/2 You may remember from your homework exercises that f(Wt, t) = e is a martingale. You should now be able to conclude this from its increments dY = Y dW .

1What does one analyst kid tell the other analyst kid? My dad’s formulas are bigger than yours. 13.1. ITO’Sˆ RULE 129

Exercise Solve dZ = ZdW . Remember that Wt is BM. Conclude that dZ = ZdW is essentially the− same as dY = Y dW . − − Unfortunately, it does not always work this way. We already considered tdW . If we try to solve it by f(W , t) then Itˆo’srule gives f = t, f + 1 f = 0. You t x t 2 xx R can check yourself that this has no solution. The reason is that tdW is not of the form f(Wt, t). We need to consider processes which are more general than that. R Definition 13.1 (B-Z definition 7.8). An Itˆoprocess f :Ω [0,T ] R is of the form × → df = a(ω, t)dt + b(ω, t)dW 2 for adapted processes a and b in T . The term adt is called the drift. If it is equal to zero, then the Itˆoprocess isM a martingale. In fact, the condition on a can be relaxed a bit as you can see in B-Z. It is easier to integrate dt then it is to integrate dW . I write all of this in differential form, because that is shorter and easier to read. Just like ordinary differentials, stochastic differentials only make sense once you integrate them. If you like bigger formulas, then you can also write an Itˆoprocess as T T f(ω, T ) f(ω, 0) = a(ω, t)dt + b(ω, t)dW − Z0 Z0 The conditions on a and b ensure that they can be integrated almost surely. Proving this requires some care. I am imposing the rules here. You are not allowed to question them. We know an Itˆoprocess once we know the a and the b. This explains why B-Z leave the integral tdW as it is: the a is zero and b(ω, t) = t. In an SDE like the Ornstein-Uhlenbeck diffusion R dY = Y dt + dW − we do not know the a and b yet, because the Y is unknown. We already solved this SDE by computer, but how do we solve it with pen and paper? Now you need to remember that a DE can be solved by an integrating factor. 2 This also works for Ornstein-Uhlenbeck diffusion. Multiply by et and apply Itˆo’s rule d(etY ) = Y det + etdY = etY dt + etdY = etdW t If we multiply the Ornstein-Uhlenbeck process Yt by e , then we get a time-changed BM: −t Yt = e We2t/2 Please compare this to exercise 7.16 in B-Z, with α = 1. You should have objected against my lastf formula, because I said Itˆo’srule, but that only applies to f(Wt, t). Instead, I applied it to f(Yt, t) for an Itˆoprocess Y , which is different from (13.2). That equation only applies if Y is equal to W . The general equation is (compare B-Z theorem 7.6) 1 (13.2) df(Y, t) = fxdY + ftdt + fxxdY dY 2

2 If dy + f(t)ydt = 0 then multiply by eF (t) rewrite this into d yeF (t) = 0 and solve.  130 13. STOCHASTIC CALCULUS

I apologize for writing fx and Yt, perhaps I should have written Xt for an Itˆoprocess, just like B-Z. According to Itˆo’srule dY dY = a2dtdt+2abdtdW +b2dW dW = b2dt. And this is all I have to say about Itˆo’srule.

13.2. Properties of the Stochastic Integral

We defined It(f) by extending it from step processes f to limits of step processes 2 T , using the Itˆoisometry. Some of the properties of It(f) that hold for step M 2 processes carry over to f T almost automatically. B-Z combine this all in theorem 7.3, but I split it up.∈ M

Theorem 13.2. It(f) is a martingale.

Proof. If f is a step process, then It(f) essentially is a martingale transform of a discrete martingale. Therefore, it is a martingale

E[It s] = Is(f) | F 2 Every f is a limit of step processes lim fn = f. By definition It(f) = ∈ MT lim It(fn) for all t [0,T ]. Proving that It(f) is a martingale comes down to a swap, as always ∈ ? E[It(f) s] = E[lim It(fn) s] = lim E[It(fn) s] = lim Is(fn) = Is(f) | F | F | F In this case, the swap is a continuity propery of the conditional expectation. You need to remember your linear algebra: conditional expectation is a projec- tion. It contracts the 2 distance between random variables. As any contraction, the conditional expectationL is a continuous map and for every continuous map h(lim x) = lim h(x). The swap is allowed.  Theorem 13.3 (Itˆoisometry). The stochastic integral is an isometry:

f 2 = IT (f) 2 || || || || 2 2 for f and IT (f) (Ω). ∈ MT ∈ L 2 2 T 2 I do not like the long-hand notation f 2 = f (ω, t) dtdP = E[ f dt]. || || 0 Proof. This is true if f is a step process.RR We extend by takingR limits f = lim fn and IT (f) = lim IT (fn). If a sequence xn converges to x in a normed space, then xn converges to x . Therefore f 2 = lim fn 2 and IT (f) 2 = || || || || || || || || || || lim IT (fn) 2. || ||  13.3. The Itˆoformula Finally, it is time to supply the proof for the rule. Now I have been switching between f(x, t) and f(t, x) - sorry! - and since I am copying B-Z from here, it is time to switch back to f(t, x). Theorem 13.4 (B-Z theorem 7.5). Let f : R2 R have continuous partial 2 → derivatives ft, fx, fxx. Then f(Wt, t) is an Itˆoprocess and ∈ MT T 1 T f(T,WT ) f(0,W0) = ft + fxx dt + fxdW − 2 Z0   Z0 for almost every ω. I copy the proof of B-Z.

13.4. EXERCISES CHAPTER 13 135

13.4. EXERCISES CHAPTER 13

3 Exercise 1 Compute the differential of the Itˆoprocess Xt = W 3tWt. Conclude t − that Xt is a martingale.

Wt Exercise 2 Compute the differential of the Itˆoprocess Yt = e . It has a drift term. Compute E[Yt] by removing this drift. Exercise 3 Get out a large piece of paper and compute that the Itˆoprocess W 2/(1+2t) e t Mt = √1 + 2t is a non-negative martingale. Exercise 4 To motivate the horrible differential in the previous exercise, use it to prove that W lim P t > 1 + ε = 0 t→∞ t log(t) ! by Markov’s inequality. Ifp you scale BM by √t, then it is unbounded, but if you scale it by t log(t) then it is almost surely bounded. Exercise 5 The Itˆoprocess p t Jt = Wsds Z0 is not a stochastic integral. The integrand is continuous and so the paths of Jt are nice and differentiable. Determine the mean and the variance of Jt. Exercise 6 Partial integration! Show that the Itˆoprocess of the previous exercise satisfies t Jt = tWt sdWs − Z0 Jt has differential paths, but on the opposite side of the equation we find t 0 sdWs, which is a time-changed BM. R

CHAPTER 14

The End is Here

Every new beginning comes from some other beginning’s end. Seneca We have reached the end of our course at the beginning of 2017. Happy New Year! Looking back on the semester, we proceeded as follows. The first 2 lec- tures contained measure theory essentials. Lectures 3-8 dealt with discrete time martingales. Lectures 9-13 dealt with continuous time martingales, in particular the Brownian Motion. Continuous time martingales are much richer than discrete time martingales. They make you see the connection between stochastic processes, PDE’s, and statistical physics. Five lectures do not suffice to develop the full theory. If you want to know more, you should follow the courses wi4129 Stochastic Differ- ential Equations and wi4225 Interacting Particle Systems. Stochastic calculus is the foundation of financial math and almost all courses in the financial engineering track follow up on wi4430. Some of these courses are relevant even if you are not interested in finance. You will learn more about Itˆocalculus in wi4079 Financial Mathematics and you will learn more about putting SDE’s in your pc in wi4154 Computational Finance. On the pc side, there also is the very interesting in4337 Randomized Algorithms which has been taught by the computer science depart- ment for a long time. This is my personal favourite! Sadly, it will be discontinued after this year – too mathematical for computer science? – so sign up for the next quarter, as this is the last opportunity to take it. Finally, there is a strong connection between martingales and statistics, but I did not say much about that. I will try to make up for that below, but first let me give a quick recap of the final five lectures, by looking back at lecture 9.

14.1. A review of lecture 9 Lecture 9 led you through the main notions of stochastic calculus by exam- ple. In 1905, Einstein derived that the location Wt of a Brownian particle has a probability density which satisfies Fourier’s PDE ∂f 1 ∂2f = ∂t 2 ∂x2 More specifically, if F (x, t) = P(Wt x) then the probability density at time ≤ t is f(x, t) = Fx(x, t) and it satisfies this PDE. One can derive other PDE’s for the densities of other stochastic processes. For instance, one can consider mean reverting particles as in the Ornstein-Uhlenbeck process,1 which as you know now −t is BM with a time-change Xt = e W 2t . In that case F (x, t) = P(Xt x) = e ≤ 1In financial math, Ornstein-Uhlenbeck is used to predict interest rates and it is called Va- sicek’s model.

137 138 14. THE END IS HERE

t P(We2t xe ) and if you work out the PDE for the density, you get the Fokker- Planck equation≤ from lecture 9. In fact, I should have called it a Fokker-Planck equation, because different Itˆoprocesses Xt produce different PDE’s, which are all of the advection-diffusion type. Now that you are aware of the Itˆocalculus, you will 1 recognize that the factor 2 in Fourier’s equation comes from Itˆo’sterm. You should also compare Einstein’s derivation of the PDE, which I copied literally at the end of lecture 9, to the proof of Itˆo’sformula. Einstein’s proof gives three integrals and omits the rest. In the proof of Itˆo’sformula there are five sums, of which Einstein’s three remain and the other two vanish. Despite the gap of forty years between these proofs, they are very similar. The Fokker-Planck equation is fifty percent Dutch. Einstein’s derivation of BM was picked up in Leiden and I included some history out of shear patriotism. Sorry. Mathematicians prefer to say Kolmogorov’s forward equation instead of FP. In 1931 already, Kolmogorov developed a general method to derive PDE’s from stochastic processes, which is much more general than FP.2 Naturally, if there is a forward equation, then there also is a backward equation. For BM, the forward equation and the backward equation are the same, but for general Itˆo processes they are different. The alternative name for the backward equation is the Feynman-Kac integral and that is how I called it in lecture 9. Again, I was being patriotic because Mark Kac was a visiting professor in Leiden and the Dutch probability seminar is named after him. In Kolmogorov’s backward equation, you know a payoff h(XT ) of an Itˆoprocess at a future time T and you want to compute its expectation Ex,t[h(XT )] at time t < T when the process has value x. You can do that either by simulating the Itˆoprocess in the computer or by solving the Kolmogorov backward equation by analytical methods. In financial math, the backward equation is called the Black-Scholes equation. To simulate the forward equation, you run the Itˆoprocess from a grid point with known initial value, as in the bleating sheep example in lecture 9.3 For the backward equation you run the Itˆoprocess from a grid point until it hits a known final value, as in the random walk on the stencil in lecture 9.

14.2. Building Bridges

Let’s now turn to statistics. The Bt is a stochastic process that starts at zero and ends at BT = 0. It consists of all Brownian paths such that WT = 0. That is why it is called a bridge: it starts at zero and ends at zero and the path Bt forms a bridge between these two anchor points. It is a process that is often considered in statistics, because it is related to the empirical distribution function. Sadly, I could not find it in your textbook on wi4450 . However, I did find the Blackwell-Rao theorem there, which we encountered once in passing at the end of lecture 3.

2The great Andrey Kolmogorov is the founding father of probability theory. He left his mark on almost every subject in mathematics. If you read Gut, you will find that Doob’s inequalities for martingales extend earlier work of Kolmogorov for random walks. 3Bleating sheep are a bit of an anomaly, it is customary to call these urn models. These are widely used in credit risk, as you may learn in wi4228, which is a special course in the financial engineering track. By the way, you should follow another special course: wi4156 Game Theory. In my opinion, the teacher of this course is brilliant. 14.2. BUILDING BRIDGES 139

Here are four ways to construct BB. Solve the two exercises in between. BB is an important process in statistics, but I mainly include it because this is a nice process to test your understanding of the theory! Construction 1 As always, use a computer first. A Brownian Bridge is a special type of Brownian Motion, and just like BM you can produce BB numerically without any problem. In L´evy’sconstruction of BM, you start with W0 = 0 and W1 = randn and you keep inserting random numbers. For BB, you do the same, but you start with B0 = B1 = 0. randn(’state’,100) clf %%%%%%%%% Problem parameters %%%%%%%%%%% T = 1; Start = 0; End = 0; N = 8; %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% Wvals = [Start End]; for L=1:N dt=T/2^(L-1); Halfway=(Wvals(1,1:2^(L-1))+[Wvals(1,2:2^(L-1)) End])/2; Insert=Halfway+sqrt(dt)/2*randn(1,2^(L-1)); Wvals=reshape([Wvals(1,1:2^(L-1));Insert],[1,2^L]); end Wvals = [Wvals End]; dt=T/2^L; tvals=[0:dt:1]; plot(tvals,Wvals) title(’Inserted Brownian Bridge’) xlabel(’t’), ylabel(’W(t)’) You can copy and paste this code in matlab and run it, to check that it works.

Construction 2 We now turn to the statistician’s way of constructing BB. If Wt is BM, then t Bt = Wt WT − T is BB. In words, this says that you select a Brownian path, let it run to time T and then tilt the path so that at time T it is equal to zero. To illustrate this, I adjust my matlab code so that it says that End=1 instead of End=0. Then I tilt the path as in the figure below and I get my original BB. By the way, the reason why this works has to do with the challenge that I left for you in the previous lecture: if WT 140 14. THE END IS HERE is known, then the historian’s estimate of Wt is equal to t E[Wt WT ] = WT | T

You know that BM is a Gaussian process. Construction 2 shows that BB also is a Gaussian process, since it is the difference of Wt and (t/T )WT , both of which are Gaussian. By now, you have become an expert in computing expectations and the following exercise should not be hard.

Exercise Let Bt be a Brownian Bridge and suppose that r < s < T . Compute that r(1 s) E[Br] = 0, E[BrBs] = − T by using Bt = Wt (t/T )WT and your knowledge of BM. − A Gaussian process is determined by the expectations E[Xt] and the E[XsXt]. Therefore, BB is the Gaussian process Bt with mean zero and covariance s(1 t)/T for 0 < s < t < T . − Construction 3 The previous constructions of the bridge require knowledge of WT , which is in the future. First we fixed it at zero (construction 1), then we let it loose (constrution 2), but we still needed future knowledge to tilt the path. Can we construct BB without using future knowledge? Yes we can! To keep the equation clean, I take T equal to 1. Apply the following time change to BM:

Bt = (1 t) W − t/(1−t) This is the third way to define BB. To see that this Gaussian process is BB, you need to verify that it has the right mean and covariance, and that it ends at zero. It is immediately clear that E[Bt] = 0 in this construction. It is also easy to compute for r < s < 1 that r s E[BrBs] = (1 r)(1 s)E[W r W s ] = (1 r)(1 s) min , = r(1 s) − − 1−r 1−s − − 1 r 1 s −  − −  This is the right covariance (remember T = 1) and I leave the final verification to you. Exercise Recall exercise 4 of the previous lecture. Prove that almost surely

lim E[(1 t)Wt/(1−t)] = 0 t↑1 − Construction 4 We managed to get rid of future knowledge, but in construc- tion 3 we needed BM that extends all the way to infinity. Fortunately, we can apply 14.3. A FINAL WORD 141 stochastic integration to rescale to finite time. This is the best construction of BB of the lot. First consider the Itˆoprocess t 1 Xt = dWs 1 s Z0 − This is an integral of a deterministic function and so it is a time-changed BM. According to the Itˆoisometry, Var(Xt) is equal to t 1 t ds = (1 s)2 1 t Z0 − − Therefore, Xt is a time-changed Brownian motion Xt = Wt/(1−t). Construction 3 tells us that BB is equal to (1 t)Xt. What is the SDE of this Itˆoprocess? We need to apply Itˆo’sformula (B-Z− theorem 7.6) to the function f(t, x) = (1 t)x 1 − and the process Xt with differential dXt = 1−t dWt:

1 2 df(t, Xt) = ft(t(t, Xt)) + fx(t, Xt)a(t) + fxx(t, Xt)b(t) dt + fx(t, Xt)b(t)dWt 2   Since a(t) = 0 and b(t) = 1/(1 t) we find − df(t, Xt) = Xtdt + dWt − Now remember that BB is Bt = (1 t)Xt = f(t, Xt), which leads to the SDE − Bt dBt = − dt + dWt 1 t − This is easy to implement on the computer, but I leave that to you. In the pre- vious lecture, we noticed that it helps to understand an SDE if you look at the deterministic part. For this SDE, the corresponding DE is equal to y dy = − dt 1 t − which is separable. The general solution is y(t) = c(1 t) for an arbitrary c. In other words, the general solution is a line through y(1) =− 0, no matter what initial condition. If we add the stochastic term dWt to the DE, then the paths of the Itˆo process will still go through B1 = 0, no matter where we start the process. So you see, to understand SDE’s, it is still useful to know your classical DE’s! You can learn more about DE’s in wi4019 Nonlinear Differential Equations.

And so we have learned many ways to build bridges. All over the world, communities seem to be falling apart into two camps right now. The use of building bridges to overcome walls should not be underestimated.

14.3. A final word

−Bt We just integrated dBt = 1−t dt + dWt which has a singularity at t = 1. Strictly speaking, according what we know now, the integral is not defined at t = 1. Nevertheless, it can still be integrated up to t = 1 and beyond. Much more is possible and our restrictions on the integrands can be relaxed quite a bit. In lecture 11, we thought about partnering up to make a fortune out of gamblers. There I tried to integrate 1{Wt≤−1}dWt, which is not an Itˆoprocess since the function f(t, x) = 1{x≤−1} is discontinuous at x = 1. However, such integrands can still be managed and you can learn how to do this− in wi4129, which is taught by Marc Veraar. He is a first class specialist in the theory as well as a great teacher. Take that course!



© 2022 Docslib.org