Probability Theory
Matthias L¨owe
Academic Year 2001/2002 Contents
1 Introduction 1
2 Basics, Random Variables 1
3 Expectation, Moments, and Jensen’s inequality 3
4 Convergence of random variables 7
5 Independence 11
6 Products and Sums of Independent Random Variables 16
7 Infinite product probability spaces 18
8 Zero-One Laws 23
9 Laws of Large Numbers 26
10 The Central Limit Theorem 35
11 Conditional Expectation 43
12 Martingales 50
13 Brownian motion 57 13.1 Construction of Brownian Motion ...... 61
14 Appendix 66 1 Introduction
Chance, luck, and fortune have been in the centre of human mind ever since people have started to think. The latest when people started to play cards or roll dices for money, there has also been a desire for a mathematical description and understanding of chance. Experience tells, that even in a situation, that is governed purely by chance there seems to be some regularly. For example, in a long sequence of fair coin tosses we will typically see ”about half of the time” heads and ”about half of the time” tails. This was formulated already by Jakob Bernoulli (published 1713) and is called a law of large numbers. Not much later the French mathematician de Moivre analyzed how much the typical number 1 of heads in a series of n fair coin tosses fluctuates around 2 n. He thereby discovered the first form of what nowadays has become known as the Central Limit Theorem.
However, a mathematically ”clean” description of these results could not be given by either of the two authors. The problem with such a description is that, of course, the average number of heads in a sequence of fair coin tosses will only typically converge to 1 2 . One could, e.g., imagine that we are rather unlucky and toss only heads. So the principal question was: ”what is the probability of an event?” The standard idea in the early days was, to define it as the limit of the average time of occurrences of the event in a long row of typical experiments. But here we are back again at the original problem of what a typical sequence is. It is natural that, even if we can define the concept of typical sequences, they are very difficult to work with. This is, why Hilbert in his famous talk on the International Congress of Mathematicians 1900 in Paris mentioned the axiomatic foundation of probability theory as the sixth of his 23 open problems in mathematics. This problem was solved by A. N. Kolmogorov in 1933 by ingeniously making use of the newly developed field of measure theory: A probability is understood as a measure on the space of all outcomes of the random experiment. This measure is chosen in such a way that it has total mass one.
From this start-up the whole framework of probability theory was developed: Starting from the laws of large numbers over the Central Limit Theorem to the very new field of mathematical finance (and many, many others). In this course we will meet the most important highlights from probability theory and then turn into the direction of stochastic processes, that are basic for mathematical finance.
2 Basics, Random Variables
As mentioned in the introduction the concept of Kolmogorov understands a probability on space of outcomes as measure with total mass one on this space. So in the framework of probability theory we will always consider a triple (Ω, , P) and call it a probability space: F Definition 2.1 A probability space is a triple (Ω, , P), where F Ω is a set, ω Ω is called a state of the world. • ∈ 1 is a σ-algebra over Ω, A is called an event. • F ∈ F P is a measure on with P(Ω) = 1, P(A) is called the probability of event A. • F A probability space can be considered as an experiment we perform. Of course, in an experiment in physics one is not always interested to measure everything one could in principle measure. Such a measurement in probability theory is called a random variable.
Definition 2.2 A random variable X is a mapping
X : Ω Rd → that is measurable. Here we endow Rd with its Borel σ-algebra d. B
The important fact about random variables is that the underlying probability space (Ω, , P) does not really matter. For example, consider the two experiments F 1 Ω = 0, 1 , = Ω , and P 0 = 1 { } F1 P 1 1 { } 2 and 1 Ω = 1, 2, . . . , 6 , = Ω , and P i = , i Ω . 2 { } F2 P 2 2 { } 6 ∈ 2 Consider the random variables X : Ω R 1 1 → ω ω 7→ and X2 : Ω2 R 0 →i is even ω 7→ 1 i is odd. Then 1 P (X = 0) = P (X = 0) = 1 1 2 2 2 and therefore X1 and X2 have same behavior even though they are defined on completely different spaces. What we learn from this example is that what really matters for a random 1 variable is the “ distribution” P X − : ◦ d Definition 2.3 The distribution PX of a random variable X : Ω R , is the following probability measure on Rd, d : → B 1 d P (A) := P (X A) = P X− (A) , A . X ∈ ∈ B So the distribution of a random variable is its image measure in Rd.
Example 2.4 Important distributions of random variables X (we have already met in introductory courses in probability and statistics) are
2 The Binomial distribution with parameters n and p, i.e. a random variable X is • Binomially distributed with parameters n and p (B(n, p)-distributed), if
n k n k P(X = k) = p (1 p) − 0 k n. k − ≤ ≤ The binomial distribution is the distribution of the number of 1’s in n independent coin tosses with success probability p.
The Normal distribution with parameters µ and σ2, i.e. a random variable X is • normally distributed with parameters µ and σ2 ( (µ, σ2)-distributed), if N a 1 x µ 2 1 ( − ) P(X a) = e− 2 σ dx. ≤ √2πσ2 Z−∞ Here a R. ∈ The Dirac distribution with atom in a R, i.e. a random variable X is Dirac • distributed with parameter a, if ∈
1 if b = a P(X = b) = δ (b) = a 0 otherwise. Here b R. ∈ The Poisson distribution with parameter λ R, i.e. a random variable X is Poisson • distributed with parameter λ R ( (λ)-distribute∈ d), if ∈ P λke λ P(X = k) = − k N = N 0 . k! ∈ 0 ∪ { }
The Multivariate Normal distribution with parameters µ Rd and Σ, i.e. a random • variable X is normally distributed in dimension d with par∈ameters µ and Σ ( (µ, Σ)- distributed), if µ Rd, Σ is a symmetric, positive definite d d matrixNand for ∈ × A = ( , a ] . . . ( , a ] −∞ 1 × × −∞ d P(X A) = ∈ a1 ad 1 1 1 . . . exp Σ− (x µ), (x µ) dx1 . . . dxd. (2π)d det Σ −2h − − i Z−∞ Z−∞ p 3 Expectation, Moments, and Jensen’s inequality
We are now going to consider important characteristics of random variables
Definition 3.1 The expectation of a random variable is defined as
E(X) := EP(X) := XdP = X(ω)dP(ω) Z Z if this integral is well defined.
3 Notice that for A one has E(1A) = 1AdP = P(A). Quite often one may want to integrate f(X) for ∈a functionF f : Rd Rm. How does this work? → R Proposition 3.2 Let X : Ω Rd be a random variable and f : Rd R be a measurable function. Then → → f XdP = E [f X] = fdP . (3.1) ◦ ◦ X Z Z Proof. If f = 1 , A d we have A ∈ B E [f X] = E(1 X) = P (X A) = P (A) = 1 dP . ◦ A ◦ ∈ X A X Z Hence (3.1) holds true for functions f = n α 1 , α R, A d. The standard i=1 i Ai i ∈ i ∈ B approximation techniques yield (3.1) for general integrable f. In particular Proposition 3.2 yields that P
E [X] = xdPX (x). Z Exercise 3.3 If X : Ω N , then → 0
∞ ∞ EX = nP (X = n) = P (X n) ≥ n=0 n=1 X X Exercise 3.4 If X : Ω R, then → ∞ ∞ P ( X n) E ( X ) P ( X n) . | | ≥ ≤ | | ≤ | | ≥ n=1 n=0 X X Thus X has an expectation, if and only if ∞ P ( X n) converges. n=0 | | ≥ Further characteristics of random variablesP are the p-th moments.
Definition 3.5 1. For p 1, the p-th moment of a random variable is defined as ≥ E (Xp).
2. The centered p-th moment of a random variable is defined as E [(X EX)p]. − 3. The variance of a random variable X is its centered second moment, hence
V(X) := E (X EX)2 . − Its standard deviation σ is defined as
σ := V(X). p
4 Proposition 3.6 V(X) < if and only if X 2(P). In this case ∞ ∈ L V(X) = E X2 (E (X))2 (3.2) − as well as V(X) E X2 (3.3) ≤ and (EX)2 E X2 (3.4) ≤ Proof. If V(X) < , then X EX 2 (P). But 2(P) is a vector space and the constant ∞ − ∈ L L EX 2 (P) X = (X EX) + EX 2 (P) . ∈ L ⇒ − ∈ L On the other hand if X 2 (P) X 1 (P). Hence EX exists and is a constant. ∈ L ⇒ ∈ L Therefore also X EX 2 (P) . − ∈ L By linearity of expectation then:
V(X) = E(X EX)2 = EX2 2(EX)2 + (EX)2 = EX2 (EX)2 . − − − This immediately implies (3.3) and (3.4).
Exercise 3.7 Show that X and X EX have the same variance. −
It will turn out in the next step that (3.4) is a special case of a much more general principle. To this end recall the concept of a convex function. Recall that a function ϕ : R R is convex if for all α (0, 1) we have → ∈ ϕ (αx + (1 α) y) αϕ (x) + (1 α) ϕ (y) − ≤ − for all x, y R. In a first course in analysis one learns that convexity is implied by ϕ00 0. On the other∈ hand convex functions do not need to be differentiable. The following exercise≥ shows that they are close to being differentiable.
Exercise 3.8 Let I be an interval and ϕ : I R be a convex function. Show that the 0 ˚ → 0 right derivative ϕ+ (x) exist for all x I (the interior of x) and the left derivative ϕ (x) ∈ − exists for all x ˚I. Hence ϕ is continuous on ˚I. Show moreover that ϕ0 is monotonely ∈ + increasing on ˚I and it holds:
ϕ(y) ϕ(x) + ϕ0 (x)(y x) ≥ + − for x ˚I, y I. ∈ ∈
Applying Exercise 3.8 yields the generalization of (3.4) mentioned above.
5 Theorem 3.9 (Jensen’s inequality) Let X : Ω R be a random variable on (Ω, , P) and assume X is P-integrable and takes values in →an open interval I R. Then EXF I ⊂ ∈ and for every convex ϕ : I R → ϕ X is a random variable. If this random variable ϕ X is P-integrable it holds: ◦ ◦ ϕ (E (X)) E(ϕ X). ≤ ◦
Proof. Assume I = (α, β). Thus X(ω) < β for all ω Ω. But then E(X) β, but then also E(X) < β. Indeed E(X) = β, implies that the strictly∈ positive random≤variable β X(ω) equals 0 on a set of P measure one, i.e. P-almost surely. This is a contradiction. − Analogously: EX > α. According to exercise 3.8 ϕ is continuous on ˚I = I hence Borel- measurable. Now we know
ϕ(y) ϕ (x) + ϕ0 (x) (y x) (3.5) ≥ + − for all x, y I with equality for x = y. Hence ∈
0 ϕ (y) = sup ϕ (x) + ϕ+ (x) (y x) (3.6) x I − ∈ h i for all y I. Putting y = X(ω) in (3.5) yields ∈ ϕ X ϕ(x) + ϕ0 (x) (X x) ◦ ≥ + − and by integration E(ϕ X) ϕ(x) + ϕ0 (x) (E (X) x) ◦ ≥ + − all x I. Together with (3.6) this gives ∈
E 0 E E (ϕ X) sup ϕ(x) + ϕ+(x)( X x) = ϕ( (X)). ◦ ≥ x I − ∈ h i This is the assertion of Jensen’s inequality.
Corollary 3.10 Let X p(P) for some p 1. Then ∈ L ≥ E(X) p E( X p). | | ≤ | |
Exercise 3.11 Let I be an open interval and ϕ : I R be convex. For x1, . . . , xn I and n → ∈ λ , . . . λ R+ with λ = 1, show that 1 n ∈ i=1 i P n n ϕ λ x λ ϕ (x ) . i i ≤ i i i=1 ! i=1 X X
6 4 Convergence of random variables
Already in the course in measure theory we met three different types on convergence:
Definition 4.1 Let (Xn) be a sequence of random variables.
1. Xn is stochastically convergent (or convergent in probability) to a random variable X, if for each ε > 0 lim P ( Xn X ε) = 0 n →∞ | − | ≥
2. Xn converges almost surely to a random variable X, if for each ε > 0
P lim sup Xn X ε = 0 n | − | ≥ →∞ 3. X converges to a random variable X in p or in p-norm, if n L p lim E ( Xn X ) = 0. n →∞ | − |
Already in measure theory we proved:
p Theorem 4.2 1. If Xn converges to X almost surely or in , then it also converges stochastically. None of the converses is true. L
2. Almost sure convergence does not imply convergence in p and vice versa. L Definition 4.3 Let (Ω, ) be a measurable, topological space endowed with its Borel σ- F algebra . This means is generated by the topology on Ω. Moreover for each n N let µ and Fµ be probability Fmeasures on (Ω, ). We say that µ converges weakly to µ∈, if for n F n each bounded, continuous, and real valued function f : Ω R (we will write b (Ω) for the space of all such functions) it holds → C
µ (f) := fdµ fdµ =: µ (f) as n (4.1) n n → → ∞ Z Z
Theorem 4.4 Let (Xn)n N be a sequence of real valued random variables on a space (Ω, , P). Assume (X ) c∈onverges to a random variable X stochastically. Then P (the F n Xn sequence of distributions) converges weakly to PX ,i.e.
P P lim fd Xn = fd X n →∞ Z Z or equivalently lim E (f Xn) = E (f X) n →∞ ◦ ◦ for all f b (R). ∈ C If X is constant P-a.s. the converse also holds true.
7 Proof. First assume f b (R) is uniformly continuous. Then for ε > 0 there is a δ > 0 ∈ C such that for any x0, x00 R ∈
x0 x00 < δ f (x0) f (x00) < ε. | − | ⇒ | − | Define A := X X δ , n N. Then n {| n − | ≥ } ∈
fdP fdP = E (f X ) E (f X) Xn − X | ◦ n − ◦ | Z Z E ( f Xn f X ) ≤ | ◦ − ◦ | = E ( f X f X 1 ) + E f X f X 1 c | ◦ n − ◦ | ◦ An | ◦ n − ◦ | ◦ An 2 f P (A ) + εP (Ac ) ≤ k k n n 2 f P (A ) + ε. ≤ k k n Here we used the notation f := sup f(x) : x R k k {| | ∈ } and that f X f X f X + f X 2 f . | ◦ n − ◦ | ≤ | ◦ n| | ◦ | ≤ k k But since Xn X stochastically, P (An) 0 as n , so for n large enough P (An) ε → → → ∞ ≤ 2 f . Thus for such n k k fdP fdP 2ε. Xn − X ≤ Z Z
Now let f b (R) be arbitrary and denote I := [ n, n]. Since I R as n we n n have P (I ) ∈1Cas n . Thus for all ε > 0 there is−n (ε) =: n such↑that → ∞ X n ↑ → ∞ 0 0 1 P (I ) = P (R I ) < ε. − X n0 X \ n0
We choose the continuous function un0 in the following way:
1 x I ∈ n0 0 x n0 + 1 un (x) = | | ≥ 0 x + n + 1 n < x < n + 1 − 0 0 0 x + n + 1 n 1 < x < n 0 − 0 − − 0 ¯ ¯ Eventually put f := un0 f. Since f 0 outside the compact set [ n0 1, n0 + 1], the ¯ · ≡ − − function f is uniformly continuous (and so is un0 ) and hence
¯ P ¯ P lim fd Xn = fd X n →∞ Z Z as well as P P lim un d Xn = un d X ; n 0 0 →∞ Z Z thus also P P lim (1 un ) d Xn = (1 un ) d X . n − 0 − 0 →∞ Z Z 8 By the triangle inequality
fdP fdP (4.2) Xn − X ≤ Z Z
f f¯ dP + f¯dP f¯dP + f f¯ dP . − Xn Xn − X − X Z Z Z Z ¯ ¯ For large n n1(ε), fdPX fdP X ε. F urthermore from 0 1 u n 1 R I we n 0 n0 obtain ≥ − ≤ ≤ − ≤ \ R R (1 u ) dP P (R I ) < ε, − n0 X ≤ X \ n0 Z so that for all n n (ε) also ≥ 2 (1 u ) dP < ε. − n0 Xn Z This yields f f¯ dP = f (1 u ) dP f ε − X | | − n0 X ≤ k k Z Z on the one hand and f f¯ dP f ε − Xn ≤ k k Z all n n (ε) on the other. Hence w e obtain from (4.2) for large n: ≥ 2 fdP fdP 2 f ε + ε. Xn − X ≤ k k Z Z
This proves weak convergence of the distributions.
For the converse let η R and assume X η (X is identically equal to η R P-almost ∈ ≡ ∈ surely). This means PX = δη where δη is the Dirac measure concentrated in η. For the open interval I = (η ε, η + ε) , ε > 0 we may find f b (R) with f 1 and f (η) = 1. − ∈ C ≤ I Then fdP P (I) = P (X I) 1. Xn ≤ Xn n ∈ ≤ Z P P Since we assumed weak convergence of Xn to X we know that
fdP fdP = f (η) = 1 Xn → X Z Z as n . Since fdP P(X I) 1, this implies → ∞ X ≤ n ∈ ≤ R P (X I) 1 n ∈ → as n . But → ∞ X I = X η ε { n ∈ } {| n − | ≤ } and thus P ( X X ε) = P ( X η ε) | n − | ≥ | n − | ≥ = 1 P ( X η < ε) 0 − | n − | → for all ε > 0. This means Xn converges stochastically to X.
9 Definition 4.5 Let X , X be random variables on a probability space (Ω, , P). If P n F Xn converges weakly to PX we also say that Xn converges to X in distribution.
Remark 4.6 If the random variables in Theorem 4.4 are Rd-valued and the f : Rd R belong to b(Rd) the statement of the theorem stays valid. → C
Example 4.7 For each sequence σn > 0 with lim σn = 0 we have
2 lim (0, σn) = δ0. n →∞ N Here δ0 denotes the Dirac measure concentrated in 0 R. Indeed, substituting x = σy we obtain ∈ 1 ∞ x2 ∞ 1 y2 f(x)e− 2σ2 dx = e− 2 f(σy)dy √2πσ2 √2π Z−∞ Z−∞ Now for all y R ∈ 1 y2 y2 e− 2 f (σy) f e− 2 √ ≤ k k 2π which is integrable. Thus by dominate d convergenc e
2 ∞ 1 x ∞ 1 y2 2σ2 lim e− n f(x)dx = lim e− 2 f(σny)dy n 2πσ2 n √2π →∞ Z−∞ n →∞ Z−∞ ∞ 1 y2 p = lim e− 2 f(σny)dy = f(0) = fdδ0. n √2π Z−∞ →∞ Z
Exercise 4.8 Show that the converse direction in Theorem 4.4 is not true, if we drop the assumption that X η P-a.s. ≡
Exercise 4.9 Assume the following holds for sequence (Xn) of random variables on a probability space (Ω, , P): F P ( X > ε) < ε | n| For all n large enough (larger than n (ε)) for each given ε > 0. Is this equivalent with stochastic convergence of Xn to 0 ?
Exercise 4.10 For a sequence of Poisson distributions (παn ) with parameters αn > 0 show that
lim παn = δ0, n →∞ 1 if limn αn = 0. Is there a probability measure µ on , with →∞ B
lim παn = µ (weakly) n →∞ if lim α = ? n ∞
10 5 Independence
The concept of independence of events is one of the most essential in probability theory. It is met already in the introductory courses. Its background is the following: Assume we are given a probability space (Ω, , P). For two events A, B with P (B) > 0 one may ask, how the probability of theF event A changes, if we kno∈w Falready that B has happened, we only need to consider the probability of A B. To obtain a probability again we normalize by P (B) and get the conditional probabilit∩ y of A given B:
P (A B) P (A B) = ∩ . | P (B) now we would call A and B independent, if the knowledge that B has happened does not change the probability that A will happen or not, i.e. if P (A B) and P (A) are the same | P (A B) = P (A) . | This in other words means A and B are independent, if
P (A B) = P (A) P (B) . ∩ More generally we define
Definition 5.1 A family (Ai)i I of events on is called independent if for each choice of different indices i , . . . , i I ∈ F 1 n ∈ P (A .. A ) = P (A ) . . . P (A ) (5.1) i1 ∩ ∩ in i1 in Exercise 5.2 Give an example of a sequence of events that are pairwise independent, i.e. each pair of events from this sequence is independent, but not independent (i.e. all events together not independent).
We generalize Definition 5.1 to set systems in the following way:
Definition 5.3 For each i I let i be a collection of events. ( i)i I are called ∈ E ⊂ F N E ∈ independent if (5.1) holds for each i1, . . . , in I, each n and each Aiν iν , ν = 1, . . . , n. ∈ ∈ ∈ E
Exercise 5.4 Show the following
1. A family ( i)i I is independent, if and only if every finite sub-family is independent. E ∈
2. Independence of ( i)i I is maintained, if we reduce the families ( i). More precisely, E ∈ E let ( ) be independent and 0 then also the families ( 0) are independent. Ei Ei ⊆ Ei Ei N n N n n+1 3. If for all n , ( i )i I are independent and for all n and i I, i i , n∈ E ∈ ∈ ∈ E ⊆ E then ( n i )i I are independent. E ∈ S 11 Exercise 5.5 If ( i)i I are independent, then so are the Dynkin-systems ( ( i))i I . Here ( ) is the DynkinE system∈ generated by , it coincides with the intersectionD ofE all∈Dynkin D A A systems containing . (See the Appendix for definitions.) A
Corollary 5.6 Let ( i)i I be an independent family of -stable sets i . Then also the E ∈ ∩ E ⊆ F families (σ ( i))i I of the σ-algebras generated by the i are independent. E ∈ E
Theorem 5.7 Let ( i)i I be an independent family of -stable sets i and E ∈ ∩ E ⊆ F
I = ˙ j J Ij ∪ ∈ with Ii Ij = , i = j. Let j := σ i Ij i . Then also ( j)j J is independent. ∩ ∅ 6 A ∪ ∈ E A ∈ Proof. For j J let ˜ be the system of all sets of the form ∈ Ej E . . . E i1 ∩ ∩ in ˜ where = i1, . . . , in Ij and Eiν iν , ν = 1, . . . , n, are arbitrary. Then j is - ∅ 6 { } ⊆ ∈ E E ˜ ∩ stable. As an immediate consequence of the independence of the ( i)i I also the ( j)j J E ∈ E ∈ are independent. Eventually = σ( ˜ ). Thus the assertion follows from Corollary 5.6. Aj Ej In the next step we want to show that events that depend on all but finitely many σ-algebras of a countable family of independent σ-algebras can only have probability zero or one. To this end we need the following definition.
Definition 5.8 Let ( ) be a sequence of σ-algebras from and An n F
∞ := σ Tn Am m=n ! [ the σ-algebra generated by , , . . .. Then An An+1
∞ := n T∞ T n=1 \ is called the σ-algebra of the tail events.
Exercise 5.9 Why is a σ-algebra? T∞
This is now the result announced above:
Theorem 5.10 (Kolmogorov’s Zero-One-Law) Let ( ) be an independent sequence An of σ-algebras n . Then for every tail event A it holds A ⊆ F ∈ T∞ P (A) = 0 or P(A) = 1.
12 Proof. Let A and the system of all sets D that are independent of A. We ∞ want to show that∈ T A :D ∈ F ∈ D In the exercise below we show that is a Dynkin system. By Theorem 5.7 the σ-algebra is independent of the σ-algebra D Tn+1 := σ ( . . . ) . An A1 ∪ ∪ An Since A we know that for every n N. Thus ∈ Tn+1 An ⊆ D ∈ ∞ := A An ⊆ D n=1 [ Obviously is increasing. For E, F there thus exists n with E, F and hence An ∈ A ∈ An E F and thus E F . This means that is -stable. Since , i.e. ∩ ∈ A n ∩ ∈ A A ∩ A ⊆ D ( , A ) is independent, from Exercise 5.5 we conclude that ( ( ), A ) is independent, soAthat{ } σ = . Moreover , all n. HenceD A {= }σ( ) σ . A D A ⊆ D An ⊆ A T1 ∪An ⊆ A Therefore A 1 σ . ∈ T∞ ⊆ T ⊆ A ⊆ D Therefore A is independent of A, i.e. it holds P(A) = P(A A) = P(A) P(A) = (P(A))2 . ∩ · Hence P(A) 0, 1 is asserted. ∈ { } Exercise 5.11 Show that from the proof of Theorem 5.10 is a Dynkin system. D As an immediate consequence of the Kolmogorov Zero-One-Law (Theorem 5.10) we obtain
Theorem 5.12 (Borel’s Zero-One-Law) For each independent sequence (An)n of events A we have n ∈ F P(An for infinitely many n) = 0 or P(An for infinitely many n) = 1, i.e. P (lim sup A ) 0, 1 . n ∈ { } Proof. Let = σ (A ), i.e. = , Ω, A , Ac . It follows that ( ) is independent. An n An {∅ n n} An n For Q := ∞ A we have Q . Since ( ) is decreasing we even have Q for n m=n m n ∈ Tn Tn n m ∈ Tn all m n, n N. Since (Qn) is decreasing we obtain ≥ S∈ n ∞ ∞ lim sup An = Qk = Qk j n ∈ T →∞ k\=1 k\=j for all j N. Hence lim sup An . Hence the assertion follows from Kolmogorov’s ∞ zero-one la∈w. ∈ T
13 Exercise 5.13 In every the pairs (A, B) (any A, B with P(A) = 0 or P(A) = 1 or P(B) = 0 or P(B) = 1) arFe pairs of independent sets. If∈theseF are the only pairs of indepen- dent sets we call independence-free. Show that the following space is independence-free: F k! Ω = N, = (Ω) , and P ( k ) = 2− F P { } P P for each k 2, ( 1 ) = 1 k∞=2 (k). (Hint: Door zo nodig over te gaan op comple- menten, mag≥ je aannemen{ } dat− 1 / A en 1 / B.) P∈ ∈
A special case of the above abstract setting is the concept of independent random variables. This will be introduced next. Again we work over a probability space (Ω, , P). F
Definition 5.14 A family of random variables (Xi)i is called independent, if the σ-algebras (σ (Xi))i generated by them are independent.
For finite families there is another criterion (which is important, since by definition of independence we only need to check the independence of finite families).
Theorem 5.15 Let X1, . . . , Xn be a sequence of n random variables with values in mea- surable spaces (Ωi, i) with -stable generators i of i. X1, . . . , Xn are independent if and only if A ∩ E A n P (X E , . . . , X E ) = P (X E ) 1 ∈ 1 n ∈ n i ∈ i i=1 Y for all E , i = 1, . . . , n. i ∈ Ei Proof. Put 1 := X− (E ) , E . Gi i i i ∈ Ei Then i generates σ (Xi). i is -stable and Ω i. According to Corollary 5.6 we need to showG the independence ofG ( )∩ , which is ∈equivG alent with Gi i=1...n P (G G ) = P (G ) P (G ) 1 ∩ · · · ∩ n 1 · · · n for all choices of Gi i. Sufficiency is evident, since we may choose Gi = Ω for appropriate i. ∈ G
Exercise 5.16 Random variables X , . . . , X are independent with values in (Ω , ) if 1 n+1 i Fi and only if X1, . . . , Xn are independent and Xn+1 is independent of σ (X1, . . . , Xn).
The following theorem states that a measurable deformation of an independent family of random variarables stays independent:
Theorem 5.17 Let (Xi)i I be a family of independent random variables Xi with values in (Ω , ) and let ∈ i Ai f : (Ω , ) (Ω0 , 0) i i Ai → i Ai be measurable. Then (fi (Xi))i I is independent. ∈ 14 Proof. Let i , . . . , i I. Then 1 n ∈ P fi1 (Xi1 ) Ai01 , . . . , fin (Xin ) Ai0n ∈1 ∈1 = P X f − A0 , . . . , X f − A0 i1 i1 i1 in in in n ∈ ∈ 1 = P X f − A0 iν ∈ iν iν ν=1 Yn = P f (X ) A0 iν iν ∈ iν ν=1 Y by the independence of (Xi)i I . Here the Ai0ν i0ν were arbitrary. ∈ ∈ A Already Theorem 5.15 gives rise to the idea that independence of random variables may be somehow related to product measures. This is made more precise in the following theorem. To this end let X1, . . . , Xn be random variables such that
X : (Ω, ) (Ω , ) . i F → i Ai Define Y := X X : Ω Ω Ω 1 ⊗ · · · ⊗ n → 1 × · · · × n P P P Then the distribution of Y which we denote by Y can be computed as Y = X1 Xn . ⊗···⊗ Note that P is a probability measure on n . Y ⊗i=1Ai
Theorem 5.18 The random variables X1, . . . , Xn are independent if and only if their dis- tribution is the product measure of the individual distributions, i.e. if P P P X1 ... Xn = X1 Xn ⊗ ⊗ ⊗ · · · ⊗ Proof. Let A , i = 1, . . . , n. Then with i ∈ Ai Y = X X : 1 ⊗ · · · ⊗ n n n P A = P Y A = P (X A , . . . , X A ) Y i ∈ i 1 ∈ 1 n ∈ n i=1 ! i=1 ! Y Y as well as P (A ) = P (X A ) i = 1, . . . n. Xi i i ∈ i P P Now Y is the product measure of the Xi if and only if
P (A . . . A ) = P (A ) . . . P (A ) . Y 1 × × n X1 1 Xn n But this is identical with n P (X A , . . . , X A ) = P (X A ) . 1 ∈ 1 n ∈ n i ∈ i i=1 Y But according to Theorem 5.15 this is equivalent with the independence of the Xi.
15 6 Products and Sums of Independent Random Vari- ables
In this section we will study independent random variables in greater detail.
Theorem 6.1 Let X1, . . . , Xn be independent, real-valued random variables. Then n n
E Xi = E (Xi) (6.1) i=1 ! i=1 Y Y E E n if Xi is well defined (and finite) for all i. (6.1) shows that then also ( i=1 Xi) is well defined. Q Proof. We know that Q := n P is the joint distribution of the X , . . . , X . By ⊗i=1 Xi 1 n Proposition 3.2 and Fubini’s theorem n
E Xi = x1 . . . xn dQ(x1, . . . , xn) ! | | i=1 Z Y = . . . x . . . x dP (x ) . . . dP (x ) | 1| | n| X1 1 Xn n Z Z x dP (x ) . . . x dP (x ) | 1| X1 1 | n| Xn n Z Z n This shows that integrability of the Xi implies integrability of i=1 Xi. In this case the equalities are also true without absolute values. This proves the result. Q Exercise 6.2 For any two random variables X, Y that are integrable Theorem 6.1 tells that independence of X, Y implies that E (X Y ) = E (X) E (Y ) · Show that the converse is not true. Definition 6.3 For any two random variables X, Y that are integrable and have an inte- grable product we define the covariance of X and Y to be cov (X, Y ) = E [(X EX) (Y EY )] − − = E (XY ) EXEY. − X and Y are uncorrelated if cov (X, Y ) = 0. Remark 6.4 If X, Y are independent cov(X, Y ) = 0.
Theorem 6.5 Let X1, . . . , Xn be square integrable random variables. Then n n
V Xi = V(Xi) + cov (Xi, Xj) (6.2) i=1 ! i=1 i=j X X X6 In particular, if X1, . . . , Xn are uncorrelated n n
V Xi = V (Xi) . (6.3) i=1 ! i=1 X X 16 Proof. We have
n n 2 V X = E (X EX ) i i − i i=1 ! i=1 ! X X n = E (X EX )2 + (X EX ) (X EX ) i − i i − i j − j " i=1 i=j # 6 n X X
= V (Xi) + cov (Xi, Xj) . i=1 i=j X X This proves (6.2). For (6.3 just note that for uncorrelated random variables X, Y one has cov(X, Y ) = 0. Eventually we turn to determining the distribution of the sum of independent random variables.
d Theorem 6.6 Let X1, . . . , Xn be independent R valued random variables. Then the distri- bution of the sum Sn := X1 +. . .+Xn is given by the convolution product of the distributions of the Xi, i.e. P := P P P Sn X1 ∗ X2 ∗ · · · ∗ Xn Proof. Again let Y := X . . . X : Ω Rd n, and vector addition 1 ⊗ ⊗ n → A : Rd . . . Rd Rd. n × × → P P Then Sn = An Y , hence a random variable. Now Sn is the image measure of under A Y , which w◦e denote by (A Y ) (P). Thus n ◦ n ◦ P = (A Y ) (P) = A (P ) . Sn n ◦ n Y Now P = P . So by the definition of the convolution product Y ⊗ Xi P . . . P = A (P ) = P X1 ∗ ∗ Xn n Y Sn More explicitly, in the case d = 1, let g(x , . . . , x ) = 1 for x + + x s and 1 n 1 · · · n ≤ g(x1, . . . , xn) = 0, otherwise. Then application of Fubini’s theorem yields
P(S s) = E(g(X , . . . , X )) = g(x , x , . . . , x )dP (x )dP (x , . . . , x ) = n ≤ 1 n 1 2 n X1 1 (X2,...,Xn) 2 n Z Z = P(X s x x )dP (x , . . . , x ) 1 ≤ − 2 − · · · − n (X2,...,Xn) 2 n Z In the case that X1 has a density fX1 with respect to Lebesgue measure, and the order of differentiation with respect to s and integration can be exchanged, it follows that Sn has a density fSn and
f (s) = f (s x x )dP (x , . . . , x ) Sn X1 − 2 − · · · − n (X2,...,Xn) 2 n Z The same formula holds in the case that X1, . . . , Xn have a discrete distribution (that is, almost surely assume values in a fixed countable subset of R) if densities are taken with respect to the count measure.
17 Example 6.7 1. As we learned in Introduction to Statistics the convolution of a Bi- nomial distribution with parameters n and p, B(n, p) and a Binomial distribution B (m, p) is a B(n + m, p) distribution:
B(n.p) B(m, p) = B(n + m, p). ∗ 2. As we learned in Introduction to Statistics the convolution of a (λ)-distribution P (a Poisson distribution with parameter λ) with a (µ)-distribution is a P (µ + λ)- distribution: P (λ) (µ) = (λ + µ) P ∗ P P 3. As has been communicated in Introduction to Probability:
µ, σ2 ν, τ 2 = µ + ν, σ2 + τ 2 . N ∗ N N 7 Infinite product probability spaces
Many theorems in probability theory start with: ”Let X1, X2, . . . , Xn, . . . be a sequence of i.i.d. random variables”. But how do we know that such sequences really exist? This will be shown in this section. In the last section we established the framework for some of the most important theorems from probabilities, as the Weak Law of Large Numbers or the Central Limit Theorem. Those are the theorems that assume: ”Let X1, X2, . . . Xn be i.i.d. random variables”. Others, as the Strong Law of Large Numbers ask for the behavior of a sequence of independent and identically distributed (i.i.d.) random variables; they usually start like ”Let X1, X2, . . . be a sequence of i.i.d. random variables”. The natural first question to ask is: Does such a sequence exist at all? In the same way as the existence of a finite sequence of i.i.d. random variables is related to finite product measures, the answer to the above question is related to infinite product measures. So, we assume that we are given a sequence of measure spaces (Ωn, n, µn) of which we moreover assume that A
µn (Ωn) = 1 for all n.
We construct (Ω, ) as follows: We want each ω Ω to be a sequence (ω ) where ω Ω . A ∈ n n n ∈ n So we put ∞ Ω := Ωn. n=1 Y Moreover we have the idea that a probability measure on Ω should be defined by what happens on the first n coordinates, n N. So for A1 1, A2 2, . . . , An n, n N we want ∈ ∈ A ∈ A ∈ A ∈ A := A . . . A Ω Ω . . . (7.1) 1 × × n × n+1 × n+2 × to be in . By independence we want to define a measure µ on (Ω, ) that assigns to A defined inA(7.1) the mass A µ(A) = µ1(A) . . . µn(An).
18 We will solve this problem in greater generality. Let I be an index set and (Ωi, i, µi)i I be measure spaces with µ (Ω ) = 1. For = K I define A ∈ i i ∅ 6 ⊆
ΩK := Ωi, (7.2) i K Y∈ K in particular Ω := ΩI . Let pJ for J K denote the canonical projection from ΩK to ΩJ . K ⊆ K I For J = i we will also write pi instead of p i and pi in place of pi . Obviously { } { } pL = pK pL (J K L) (7.3) J J ◦ K ⊆ ⊆ and p := pI = pK p (J K) . (7.4) J J J ◦ K ⊆ Moreover denote by (I) := J I, J = , J is finite . H { ⊆ 6 ∅ | | } For J (I) the σ-algebras and measures ∈ H
J := i J i and µJ := i J µi A ⊗ ∈ A ⊗ ∈ are defined by Fubini’s theorem in measure theory. In analogy to the finite dimensional case we define
Definition 7.1 The product σ-algebra i I i of the σ-algebras ( i) is defined as the ∈ i I smallest σ-algebra on Ω, such that al⊗l projeA ctions p : Ω Ω arAe (∈ , )-measurable. A i → i A Ai Hence i I i := σ (pi, i I) . (7.5) ⊗ ∈ A ∈ Exercise 7.2 Show that i I i := σ (pJ , J (I)) . (7.6) ⊗ ∈ A ∈ H According to the above we are now looking for a measure µ on (Ω, ), that assigns mass A µ1 (A1) . . . µn (An) to each A as defined in (7.1). In other words
1 µ pJ− Ai = µJ Ai . i J !! i J ! Y∈ Y∈ The question, whether such a measure exists, is solved in
Theorem 7.3 On := i I i there is a unique measure µ with A ⊗ ∈ A 1 pJ (µ) := µ pJ− = µJ (7.7) for all J (I). It holds µ (Ω) = 1. ∈ H
19 Proof. We may assume I = , since otherwise the result is known from Fubini’s theorem. We start with some preparatory| | ∞ considerations: In Exercise 7.4 below it will be shown that pK is ( , )-measurable for J K and J AK AJ ⊆ that pK (µ ) = µ , (J K, J, K (I)). J K J ⊆ ∈ H Hence, if we introduce the σ-algebra of the J-cylinder sets 1 := p− ( ) (J (I)) (7.8) ZJ J AJ ∈ H 1 the measurability of pK implies pK − ( ) and thus J J AJ ⊂ AK J K (J K, J, K (I)) . (7.9) Z ⊆ Z ⊆ ∈ H Eventually we introduce the system of all cylinder sets := . Z ZJ J (I) ∈H[ Note that due to (7.9) for Z , Z we have Z , Z , for suitably chosen J (I). 1 2 ∈ Z 1 2 ∈ ZJ ∈ H Hence is an algebra (but generally not a σ-algebra). From (7.5) and (7.6) it follows Z = σ ( ) . A Z Now we come to the main part of the proof. This will be divided into four parts. 1 1. Assume Z = p− (A), J (I) , A . According to (7.7) Z must get mass Z 3 J ∈ H ∈ AJ µ (Z) = µJ (A). We have to show that this is well defined. So let 1 1 Z = pJ− (A) = pK− (B) for J, K (I), A , B . If J K we obtain: ∈ H ∈ AJ ∈ AK ⊆ 1 1 1 K − pJ− (A) = pK− pJ (A) and thus 1 1 1 K − pK− (B) = pK− (B0) with B0 := pJ (A) . Since pK (Ω) = ΩK we obtain 1 K − B = B0 = pJ (A) . Thus by the introductory considerations
µK(B) = µJ (A) . For arbitrary J, K define L := J K. Since J, K L, (7.9) implies the existence of 1 1 ∪ 1 ⊆ C with p− (C) = p− (A) = p− (B). Therefore from what we have just seen: ∈ AL L J K µL(C) = µJ (A) and µL(C) = µK (B). Hence µJ (A) = µK(B). Thus the function 1 µ p− (A) = µ (A) (J (I) , A ), (7.10) 0 J J ∈ H ∈ AJ is well-defined on . Z 20 2. Now we show that µ as defined in (7.10) is a volume on . Trivially it holds, µ 0 0 Z 0 ≥ and µ0( ) = 0. Moreover, as shown above for Y, Z , Y Z = , there is a ∅ 1 1 ∈ Z ∩ ∅ J (I) , A, B J such that Y = pJ− (A), Z = pJ− (B). Now Y Z = implies A ∈ BH= and due∈ Ato ∩ ∅ ∩ ∅ 1 Y Z = p− (A B) ∪ J ∪ we obtain
µ (Y Z) = µ (A B) = µ (A) + µ (B) = µ (Y ) + µ (Z) 0 ∪ J ∪ J J 0 0
hence the finite additivity of µ0.
It remains to show that µ0 is also σ-additive. Then the general principles from measure theory yield that µ0 can be uniquely extended to a measure µ on σ( ) = . 1 Z A µ also is a probability measure, because of Ω = p− (Ω ) for all J (I) and therefore J J ∈ H
µ(Ω) = µ0(Ω) = µJ (ΩJ ) = 1.
To prove the σ-additivity of µ0 we first show: 3. Let Z and J (I). Then for all ω Ω the set ∈ Z ∈ H J ∈ J
ωJ Z := ω Ω : (ωJ , pI J (ω)) Z { ∈ \ ∈ } is a cylinder set. This set consists of all ω Ω with the following property: if we ∈ replace the coordinates ωi with i J by the corresponding coordinates of ωJ , we obtain a point in Z. Moreover ∈
ωJ µ0(Z) = µ0(Z )dµJ (ωJ ). (7.11) Z This is shown by the following consideration. For Z there are K (I) and 1 ∈ Z ∈ H A K such that Z = pK− (A), this means that µ0(Z) = µK(A). Since I is infinite we∈maAy assume J K and J = K. For the ω -intersection of A in Ω , which we ⊂ 6 J K call AωJ , i.e. for the set of all ω0 ΩK J with (ωJ , ω0) A, it holds ∈ \ ∈
ωJ 1 Z = pK− J (AωJ ). \
ωJ 1 By Fubini’s theorem AωJ K J and hence Z = pK− J (AωJ ) are (K J)-cylindersets. ∈ A \ \ \ Since µK = µJ µK J Fubini’s theorem implies ⊗ \
ωJ µ0(Z) = µK(A) = µK J (A )dµJ (ωJ ). (7.12) \ Z But this is (7.11), since ωJ ωJ µ0(Z ) = µK J (A ) \ ωJ 1 (because of Z = pK− J (AωJ )). \
21 4. Eventually we show that µ0 is -continuous and thus σ-additive. To this end let (Zn) be a decreasing family of cylinder∅ sets in with α := inf µ (Z ) > 0. We will show Z n 0 n that ∞ Z = . (7.13) n 6 ∅ n=1 \
1 Now each Z is of the form Z = p− (A ), J (I), A . Due to (7.9) we may n n Jn n n ∈ H n ∈ AJn assume J1 J2 J3. . . . We apply the result proved in 3. to J = J1 and Z = Zn. ⊆ ⊆ωJ1 As ω µ Zn is -measurable J1 7−→ 0 AJ1