INTRODUCTION TO PROBABILITY

ARIEL YADIN

Course: 201.1.8001 Fall 2013-2014 Lecture notes updated: December 24, 2014

Contents

Lecture 1. 3

Lecture 2. 9

Lecture 3. 18

Lecture 4. 24

Lecture 5. 33

Lecture 6. 38

Lecture 7. 44

Lecture 8. 51

Lecture 9. 57

Lecture 10. 66

Lecture 11. 73

Lecture 12. 76

Lecture 13. 81

Lecture 14. 87

Lecture 15. 92

Lecture 16. 101 1 2

Lecture 17. 105

Lecture 18. 113

Lecture 19. 118

Lecture 20. 131

Lecture 21. 135

Lecture 22. 140

Lecture 23. 156

Lecture 24. 159 3

Introduction to Probability

201.1.8001

Ariel Yadin

Lecture 1

1.1. Example: Bertrand’s Paradox

We begin with an example [this is known as Bertrand’s paradox].

Joseph Louis Fran¸cois Question 1.1. Consider a circle of radius 1, and an equilateral triangle bounded in the Bertrand (1822–1900) circle, say ABC. (The side length of such a triangle is √3.) Let M be a randomly chosen chord in the circle. What is the probability that the length of M is greater than the length of a side of the triangle (i.e. √3)?

Solution 1. How to chose a random chord M? One way is to choose a random angle, let r be the radius at that angle, and let M be the unique chord perpendicular to r. Let x be the intersection point of M with r. (See Figure 1, left.) Because of symmetry, we can rotate the triangle so that the chord AB is perpendicular to r. Since the sides of the triangle intersect the perpendicular radii at distance 1/2 from 0, M is longer than AB if and only if x is at distance at most 1/2 from 0. r has length 1, so the probability that x is at distance at most 1/2 is 1/2. ut

Solution 2. Another way: Choose two random points x, y on the circle, and let M be the chord connecting them. (See Figure 1, right.) Because of symmetry, we can rotate the triangle so that x coincides with the vertex A of the triangle. So we can see that y falls in the arc BC on the circle if and only if M is longer than a side of the triangle. d 4

y

B C B C

x M r

M

A A

Figure 1. How to choose a random chord.

The probability of this is the length of the arc BC over 2π. That is, 1/3 (the arc BC is one-third of the circle). d dut

Solution 3. A different way to choose a random chord M: Choose a random point x in the circle, and let r be the radius through x. Then choose M to be the chord through x perpendicular to r. (See Figure 1, left.) Again we can rotate the triangle so that r is perpendicular to the chord AB. Then, M will be longer than AB if and only if x lands inside the triangle; that is if and only if the distance of x to 0 is at most 1/2. Since the area of a disc of radius 1/2 is 1/4 of the disc of radius 1, this happens with probability 1/4. ut

How did we reach this paradox? The original question was ill posed. We did not define precisely what a random chord is. The different solutions come from different ways of choosing a random chord - these are not the same. We now turn to precisely defining what we mean by “random”, “probability”, etc. 5

1.2. Sample spaces

When we do some experiment, we first want to collect all possible outcomes of the experiment. In , a collection of objects is just a set. The set of all possible outcomes of an experiment is called a sample space. Usually we denote the sample space by Ω, and its elements by ω. Example 1.2.

A coin toss. Ω = H, T . Actually, it is the set of heads and tails, maybe , , • { } { ~} but H, T are easier to write. Tossing a coin twice. Ω = H, T 2. What about tossing a coin k times? • { } Throwing two dice. Ω = , , , , , 2. It is probably easier to use • { } 1, 2, 3, 4, 5, 6 2. { } + The lifetime of a person. Ω = R . What if we only count the years? What if • we want years and months? Bows and arrows: Shooting an arrow into a target of radius 1/2 meters. Ω = • (x, y): x2 + y2 1/4 . What about Ω = (r, θ): r [0, 1/2] , θ [0, 2π) . { ≤ } { ∈ ∈ } What if the arrow tip has thickness of 1cm? So we don’t care about the point up to radius 1cm? What about missing the target altogether? A random real valued continuously differentiable function on [0, 1]. Ω = C1([0, 1]). • Noga tosses a coin. If it comes out heads she goes to the probability lecture, • and either takes notes or sleeps throughout. If the coin comes out tails, she goes running, and she records the distance she ran in meters and the time it took. + Ω = H notes, sleep T N R . { } × { } { } × × S 4 5 4

1.3. Events

Suppose we toss a die. So the sample space is, say, Ω = 1, 2, 3, 4, 5, 6 . We want to { } ecode the outcome “the number on the die is even”. 6

This can be done by collecting together all the relevant possible outcomes to this event. That is, a sub-collection of outcomes, or a subset of Ω. The event “the number on the die is even” corresponds to the subset 2, 4, 6 Ω. { } ⊂ What do we want to require from events? We want the following properties:

”Everything” is an event; that is, Ω is an event. • If we can ask if an event occured, then we can also ask if it did not occur; that • is, if A is an event, then so is Ac. If we have many events of which we can ask if they have occured, then we can • also ask if one of them has occured; that is, if (An)n∈N is a sequence of events,

then so is n An. S A word on notation: Ac = Ω A. One needs to be careful in which space we are X \ taking the complement. Some books use A. Thus, if we want to say what events are, we have: If is the collection of events on F a sample space Ω, then has the following properties: F The elements of are subsets of Ω (i.e. (Ω)). • F F ⊂ P Ω . • ∈ F If A then Ac . • ∈ F ∈ F If (An)n∈ is a sequence of elements of , then An . • N F n ∈ F S Definition 1.3. with the properties above is called a σ-algebra (or σ-field). F

X Explain the name: algebra, σ-algebra.

Example 1.4. Let Ω be any set (sample space). Then,

= , Ω •F {∅ } = 2Ω = (Ω) •G P are both σ-algebras on Ω. 4 5 4

When Ω is a countable sample space, then we will always take the full σ-algebra 2Ω. (We will worry about the uncountable case in the future.) 7

To sum up: For a countable sample space Ω, any subset A Ω is an event. ⊂ Example 1.5.

The experiment is tossing three coins. The event A is “second toss was heads”. • Ω = H, T 3. A = (x, H, y): x, y H, T . { } { ∈ { }} The experiment is “how many minutes passed until a goal is scored in the Manch- • ester derby”. The event A is “the first goal was scored after minute 23 and before minute 45”. Ω = 0, 1,..., 90 (maybe extra-time?). A = 23, 24,..., 44 . { } { }

4 5 4

Example 1.6. We are given a computer code of 4 letters. Ai is the event that the i-th letter is ‘L’. What are the following events?

A1 A2 A3 A4 • ∩ ∩ ∩ A1 A2 A3 A4 • ∪ ∪ ∪ Ac Ac Ac Ac • 1 ∩ 2 ∩ 3 ∩ 4 Ac Ac Ac Ac • 1 ∪ 2 ∪ 3 ∪ 4 A3 A4 • ∩ c A1 A • ∩ 2 What are the events in terms of Ai’s? There are at least 3 L’s • There is exactly one L • There are no two L’s in a row •

4 5 4

Example 1.7. Every day products on a production line are chosen and labeled ‘-’ if damaged or ‘+’ if good. What are the complements of the following events?

A = at least two products are damaged • B = at most 3 products are damaged • 8

C = there are at least 3 more good products than damaged products • D = there are no damaged products • E = most of the products are damaged • To which of the above events does the string + + + + belong to? − − − − −− What are the events A B,A B,C E,B D,B D. ∩ ∪ ∩ ∩ ∪ 4 5 4 9

Introduction to Probability

201.1.8001

Ariel Yadin

Lecture 2

2.1. Probability

Remark 2.1. Two sets A, B are called disjoint if A B = . For a sequence (An)n∈ , ∩ ∅ N we say that (An) are mutually disjoint if any two are disjoint; i.e. for any n = m, 6 An Am = . ∩ ∅ What is probability? It is the assignment of a value to each event, between 0 and 1, saying how likely this event is. That is, if is the collection of all events on some F sample space, then probability is a function from the events to [0, 1]; i.e. P : [0, 1]. F → Of course, not every such assignment is good, and we would like some consistency.

What would we like of P? For one thing, we would like that the likelyhood of every- thing is 100%. That is, P(Ω) = 1. Also, we would like that if A, B are two events ∈ F that are disjoint, i.e. A B = , then the likelyhood that one of A, B occurs is the sum ∩ ∅ of their likelyhoods; i.e. P(A B) = P(A) + P(B). ∪ This leads to the following definition:

Andrey Kolmogorov Definition 2.2 (Probability measure). Let Ω be a sample space, and a σ-algebra on F (1903–1987) Ω. A probability measure on (Ω, ) is a function P : [0, 1] such that F F → P(Ω) = 1. • If (An) is a sequence of mutually disjoint events in , then • F

P( An) = P(An). n n [ X (Ω, , P) is called a probability space. F Example 2.3. 10

Ω A fair coin toss: Ω = H, T , = 2 = , H , T Ω . P( ) = 0, P( H ) = • { } F {∅ { } { } } ∅ { } P( T ) = 1/2, P(Ω) = 1. { } Unfair coin toss: P( ) = 0, P( H ) = 3/4, P( T ) = 1/4, P(Ω) = 1. • ∅ { } { } Ω Fair die: Ω = 1, 2,..., 6 , = 2 . For A Ω let P(A) = A /6. • { } F ⊂ | | Let Ω be any countable set. Let = 2Ω. Let p :Ω [0, 1] be a function such • F → that p(ω) = 1. Then, for any A Ω define ω∈Ω ⊂ P P(A) = p(ω). ω∈A X P is a probability measure on (Ω, ). F Let γ = n−2. For A 1, 2,..., define • n≥1 ⊂ { } P 1 −2 P(A) = a . γ · a∈A X P is a probability measure.

4 5 4

Some properties of probability measures:

Proposition 2.4. Let (Ω, , P) be a probability space. Then, F (1) and P( ) = 0. ∅ ∈ F ∅ (2) If A1,...,An are a finite number of mutually disjoint events, then n n A := Aj and P(A) = P(Aj). ∈ F j=1 j=1 [ X c (3) For any event A , P(A ) = 1 P(A). ∈ F − (4) If A B are events, then P(B A) = P(B) P(A). ⊂ \ − (5) If A B are events, then P(A) P(B). ⊂ ≤ (6) (Inclusion-Exclusion Principle.) For events A, B , ∈ F

P(A B) = P(A) + P(B) P(A B). ∪ − ∩

Proof. Let A, B, (An)n∈ be events in . N F 11

c (1) = Ω . Set Aj = for all j N. This is a sequence of mutually disjoint ∅ ∈ F ∅ ∈ events. Since = Aj, we have that P( ) = P( ) implying that P( ) = 0. ∅ j ∅ j ∅ ∅ (2) For j > n let Aj =S . Then (Aj)j∈ is a sequenceP of mutually disjoint events, so ∅ N n n P( Aj) = P( Aj) = P(Aj) = P(Aj). j=1 j j j=1 [ [ X X c c (3) A A = Ω and A A = . So the additivity of P on disjoint sets gives that ∪ ∩ ∅ c P(A) + P(A ) = P(Ω) = 1. (4) If A B, then B = (B A) A, and (B A) A = . Thus, P(B) = ⊂ \ ∪ \ ∩ ∅ P(B A) + P(A). \ (5) Since A B, we have that P(B) P(A) = P(B A) 0. ⊂ − \ ≥ (6) Let C = A B. Note that A B = (A C) (B C) C, where all sets in the ∩ ∪ \ ∪ \ ∪ union are mutually disjoint. Since C A, C B we get ⊂ ⊂

P(A B) = P(A C)+P(B C)+P(C) = P(A) P(C)+P(B) P(C)+P(C) = P(A)+P(B) P(C). ∪ \ \ − − − ut

George Boole Proposition 2.5 (Boole’s inequality). Let (Ω, , P) be a probability space. If (An)n∈N F (1815–1864) are events, then

P( An) P(An). ≤ n∈ n∈ [N XN

Proof. For every n let B0 = , and ∅ n Bn = Aj and Cn = An Bn−1. \ j=1 [

Claim 1. For any n > m 1, we have that Cn Cm = . ≥ ∩ ∅ Proof. Since m n 1, we have that Am Bn−1. Since Cm Am, ≤ − ⊂ ⊂

Cn Cm (An Bn−1) Bn−1 = . ∩ ⊂ \ ∩ ∅

Claim 2. An Cn. n ⊂ n Proof. LetSω ASn. So there exists n such that ω An. Let m be the smallest ∈ n ∈ integer such that ωS Am and ω Am−1 (if ω A1 let m = 1). Then, ω Am, and ∈ 6∈ ∈ ∈ 12

m−1 ω Ak for all 1 k < m. In other words: ω Am and ω Ak = Bm−1. Or: 6∈ ≤ ∈ 6∈ k=1 ω Am Bm−1 = Cm. Thus, there exists some m 1 such that ωS Cm; i.e. ω Cn. ∈ \ ≥ ∈ ∈ n We conclude that (Cn)n≥1 is a collection of events that are pairwise disjoint,S and

An Cn. Using # 5 from Proposition 2.4 n ⊂ n S S P( An) P( Cn) = P(Cn) P(An). n ≤ n n ≤ n [ [ X X ut Example 2.6 (Monty Hall paradox). In a famous gameshow, the constestant is given three boxes, two of them contain nothing, but one contains the keys to a new car. All possibilities of key placements are possible. The contestant chooses a box. Then the host reveals one of the other two boxes that does not contain anything, and the contestant is given a choice whether to switch her choice or not. What should she do? A = an empty box is chosen in the first choice . B = the keys are in the other { } { box (she should switch) . } One checks that B Ac = and A Bc = ∩ ∅ ∩ ∅ so A = B. Since all key placements are equally likely, the probability the key is in the c chosen box is 1/3. That is, P(A ) = 1/3, and so P(B) = P(A) = 2/3. 4 5 4 2.2. Discrete Probability Spaces

Discrete probability spaces are those for which the sample space is countable. We have already seen that in this case we can take all subsets to be events, so = 2Ω. We F have also implicitly seen that due to additivity on disjoint sets, the probability measure

P is completely determined by its value on singletons. Ω That is, let (Ω, 2 , P) be a probability space where Ω is countable. If we denote p(ω) = P( ω ) for all ω Ω, then for any event A Ω we have that A = ω , { } ∈ ⊂ ω∈A { } and this is a countable disjoint union. Thus, S

P(A) = P( ω ) = p(ω). { } ω∈A ω∈A X X 13

Exercise 2.7. Let Ω = 1, 2,..., and define a probability measure on (Ω, 2Ω) by { } 1 (1) P( ω ) = . { } (e−1)ω! 0 1 ω (2) P ( ω ) = (3/4) . { } 3 · (Extend using additivity on disjoint unions.) Show that both uniquely define probability measures.

Solution. We need to show that P(Ω) = 1, and that P is additive on disjoint unions. Indeed, ∞ ∞ 1 P(Ω) = P( ω ) = = 1. { } (e−1)ω! ω=1 ω=1 X X ∞ 0 1 ω P (Ω) = (3/4) = 1. 3 · ω=1 X Now let (An) be a sequence of mutually disjoint events and let A = n An. Using the fact that any subset is the disjoint union of the singlton composingS it, we have that

ω A if and only if there exists a unique n(ω) such that ω An. Thus, ∈ ∈

P(A) = P( ω ) = P( ω ) = P(An). { } { } ω∈A n ω∈A n X X Xn X 0 The same for P . ut The next proposition generalizes the above examples, and characterizes all discrete probability spaces.

Proposition 2.8. Let Ω be a countable set. Let p :Ω [0, 1], such that p(ω) = 1. → ω∈Ω Ω Then, there exists a unique probability measure P on (Ω, 2 ) such thatPP( ω ) = p(ω) { } for all ω Ω.(p as above is somtimes called the density of P.) ∈

Proof. Let A Ω. Define ⊂ P(A) = p(ω). ω∈A X We have by the assumption on p that P(Ω) = ω∈Ω p(ω) = 1. Also, if (An) is a sequence of events that are mutually disjoint, and A =P An is their union, then, for any ω A n ∈ S 14 there exists n such that ω An. Moreover, this n is unique, since An Am = for all ∈ ∩ ∅ m = n, so ω Am for all m = n. So 6 6∈ 6

P(A) = p(ω) = p(ω) = P(An). ω∈A n ω∈A n X X Xn X Ω This shows that P is a probability measure on (Ω, 2 ). For uniquness, let P : 2Ω [0, 1] be a probability measure such that P ( ω ) = p(ω) → { } for all ω Ω. Then, for any A Ω, ∈ ⊂ P (A) = P ( ω ) = p(ω) = P(A). { } ω∈A ω∈A X X ut Example 2.9.

(1) An simple but important example is for finite sample spaces. Let Ω be a finite set.

Suppose we assume that all outcomes are equally likely. That is P( ω ) = 1/ Ω { } | | for all ω Ω, and so, for any event A Ω we have that P(A) = A / Ω . ∈ ⊂ | | | | It is simple to verify that this is a probability measure. This is known as the uniform measure on Ω. (2) We throw two fair dice. What is the probability the sum is 7? What is the probability the sum is 6?

Solution. The sample space here is Ω = 1, 2,..., 6 2 = (i, j) : 1 i, j 6 . { } { ≤ ≤ } Since we assume the dice are fair, all outcomes are equally likely, and so the prob- ability measure is the uniform measure on Ω. Now, the event that the sum is 7 is

A = (i, j) : 1 i, j 6 , i + j = 7 = (1, 6), (2, 5), (3, 4), (4, 3), (5, 2), (6, 1) . { ≤ ≤ } { } So P(A) = A / Ω = 6/36 = 1/6. | | | | The event that the sum is 6 is

B = (i, j) : 1 i, j 6 , i + j = 6 = (1, 5), (2, 4), (3, 3), (4, 2), (5, 1) . { ≤ ≤ } { } So P(B) = 5/36. 15

(3) There are 15 balls in a jar, 7 black balls and 8 white balls. When removing a ball from the jar, any ball is equally likely to be removed. Shir puts her hand in the jar and takes out two balls, one after the other. What is the probablity Shir removes one black ball and one white ball.

Solution. First, we can think of the different balls being represented by the numbers 1, 2,..., 15. So the sample space is Ω = (i, j) : 1 i = j 15 . Note { ≤ 6 ≤ } that the size of Ω is Ω = 15 14. Since it is equally likely to remove any ball, | | · we have that the probability measure is the uniform measure on Ω. Let A be the event that one ball is black and one is white. How can we compute P(A) = A / Ω ? | | | | We can split A into two disjoint events, and use additivity of probability mea- sures: Let B be the event that the first ball is white and the second ball is black. Let B0 be the event that the first ball is black and the second ball is white. It is imme- 0 0 0 diate that B and B are disjoint. Also, A = B B . Thus, P(A) = P(B)+P(B ), ∪ and we only need to compute the sizes of B,B0. Note that the number of pairs (i, j) such that ball i is black and ball j is white is 7 8. So B = 7 8. Similarly, B0 = 8 7. All in all, · | | · | | · 7 8 8 7 8 (A) = (B) + (B0) = · + · = . P P P 15 14 15 14 15 · · (4) A deck of 52 cards is shuffelled, so that any order is equally likely. The top 10 cards are distributed amoung 2 players, 5 to each one. What is the probability that at least one player has a royal flush (10-J-Q-K-A of the same suit)? Solution. Each player receives a subset of the cards of size 5, so the sample space is

Ω = (S1,S2): S1,S2 C, S1 = S2 = 5,S1 S2 = , { ⊂ | | | | ∩ ∅} where C is the set of all cards (i.e. C = A , 2 ,...,A , 2 ,..., ). There are { ♠ ♠ ♦ ♦ } 52 47 52 47 ways to choose S1 and ways to choose S2, so Ω = . Also, 5 5 | | 5 · 5 every choice is equally likely, soP is the uniform measure.   16

Let Ai be the event that player i has a royal flush (i = 1, 2). For s ∈ , , , , let B(i, s) be the event that player i has a royal flush of the suit {♠ ♦ ♥ ♣} s. So Ai = s B(i, s). B(i, s) is the event that Si is a specific set of 5 cards, so 47 B(i, s) = U for any choice of i, s. Thus, | | 5  B(i, s) 4 (A ) = | | = . P i Ω 52 s 5 X | | Now we use the inclusion-exclusion principle: 

P(A) = P(A1 A2) = P(A1) + P(A2) P(A1 A2). ∪ − ∩

The event A1 A2 is the event that both players have a royal flush, so A1 A2 = ∩ | ∩ | 4 3 since there are 4 options for the first player’s suit, and then 3 for the second. · Altogether,

4 4 4 3 8 3 P(A) = 52 + 52 52 · 47 = 52 1 47 . − · − 2 ! 5 5 5 · 5 5 · 5       4 5 4

Exercise 2.10. Prove that there is no uniform probability measure on N; that is, there is no probability measure on N such that P( i ) = P( j ) for all i, j. { } { } Solution. Assume such a probability measure exists. By the defining properties of prob- ability measures,

∞ ∞ p > 0 1 = P(N) = P( j ) = p = ∞ . { }  j=0 j=1 0 p = 0 X X  A contradiction.   ut Exercise 2.11. A deck of 52 card is ordered randomly, all orders equally likely. what is the probability that the 14th card is an ace? What is the probability that the first ace is at the 14th card?

Solution. The sample space is all the possible orderings of the set of cards C = A , 2 ,...,A , 2 ,..., . { ♠ ♠ ♦ ♦ } So Ω is the set of all 1-1 functions f : 1,..., 52 C, where f(1) is the first card, f(2) { } → the second, and so on. So Ω = 52!. The measure is the uniform one. | | 17

Let A be the event that the 14th card is an ace. Let B be the event that the first ace is at the 14th card. A is the event that f(14) is an ace, and there are 4 choices for this ace, so if is the A 4 1 set of aces, A = f Ω: f(14) , so A = 4 51!. Thus, P(A) = = . { ∈ ∈ A} | | · 52 13 B is the event that f(14) is an ace, but also f(j) is not an ace for all j < 14. Thus, B = f Ω: f(14) , j < 14 f(j) . So B = 4 48 47 36 38! = { ∈ ∈ A ∀ 6∈ A} | | · · ··· · 4·38·37·36 4 48! 38 37 36, and P(B) = = 0.031160772. · · · · 52·51·50·49 ut 18

Introduction to Probability

201.1.8001

Ariel Yadin

Lecture 3

3.1. Some set theory

Recall the inclusion-exclusion principle:

P(A B) = P(A) + P(B) P(A B). ∪ − ∩ This can be demonstrated by the Venn diagram in Figure 2.

John Venn (1834–1923) A B S

A B

A B ∩

Figure 2. The inclusion-exclusion principle.

This diagram also illustrates the intuitive meaning of A B; namely, A B is the ∪ ∪ event that one of A or B occurs. Similarly, A B is the event that both A and B occur. ∩ ∞ Let (An)n=1 be a sequence of events in some probability space. We have

An = ω Ω: ω An for at least one n k { ∈ ∈ ≥ } n≥k [ = ω Ω : there exists n k such that ω An . { ∈ ≥ ∈ } ∞ So An is the set of ω Ω such that for any k, there exists n k such that k=1 n≥k ∈ ≥ ω TAn; i.e.S ∈ ∞ An = ω Ω: ω An for infinitely many n . { ∈ ∈ } k=1 n≥k \ [ 19

Definition 3.1. Define the set lim sup An as

∞ lim sup An := An. k=1 n≥k \ [

The intuitive meaning of lim supn An is that infinitely many of the An’s occur. Similarly, we have that

∞ An = ω Ω : there exists n0 such that for all n > n0, ω An { ∈ ∈ } k=1 n≥k [ \ = ω Ω: ω An for all large enough n . { ∈ ∈ }

Definition 3.2. Define ∞ lim inf An := An. k=1 n≥k [ \

lim inf An means that An will occur from some large enough n onward; that is, even- tually all An occur. It is easy to see that

lim inf An lim sup An. ⊆

This also fits the intuition, as if all An occur eventually, then they occur infinitely many times.

∞ Definition 3.3. If (An)n=1 is a sequence of events such that lim inf An = lim sup An (as sets) then we define

lim An := lim inf An = lim sup An.

Example 3.4. Consider Ω = N, and the sequence

j An = n : j = 0, 1, 2,..., .  0 If m < n and m An, then it must be that m = n = 1. Thus, An = 1 and ∈ n≥k { } ∞ T lim inf An = An = 1 . n { } k=1 n≥k [ \ 20

Also, if m < k n, and m An, then again m = 1, so An does not contain any ≤ ∈ n≥k 1 < m < k; i.e. An 1, k, k + 1,... . Hence, S n≥k ⊂ { } S ∞ lim sup An = An = 1 . { } k=1 n≥k \ [ Thus, the limit exists and limn An = 1 . { } 4 5 4

Definition 3.5. A sequence of events is called increasing if An An+1 for all n, and ⊂ decreasing if An+1 An for all n. ⊂

Proposition 3.6. Let (An) be a sequence of events. Then,

c c (lim inf An) = lim sup An.

Moreover, if (An) is increasing (resp. decreasing) then lim An = n An (resp. lim An =

n An). S T Proof. The first assertion is de-Morgan.

For the second assertion, note that if (An) is increasing, then

An = An and An = Ak. n≥k n≥1 n≥k [ [ \ So

lim sup An = An = An = An k n≥k k n n \ [ \ [ [ lim inf An = An = Ak. k n≥k k [ \ [ c If (An) is decreasing then (An) is increasing. So

c c c c c c lim sup An = (lim inf An) = An = (lim sup An) = lim inf An, n [  c c and lim An = A = An. n n n ut S  T Exercise 3.7. Show that if is a σ-algebra, and (An) is a sequence of events in , F F then lim inf An and lim sup An . ∈ F ∈ F 21

3.2. Continuity of Probability Measures

The goal of this section is to prove

Theorem 3.8 (Continuity of Probability). Let (Ω, , P) be a probability space. Let (An) F be a sequence of events such that lim An exists. Then,

(lim An) = lim (An). P n→∞ P

We start with a restricted version:

Proposition 3.9. Let (Ω, , P) be a probability space. Let (An) be a sequence of in- F creasing (resp. decreasing) events. Then,

(lim An) = lim (An). P n→∞ P

Proof. We start with the increasing case. Let A = An = lim An. Define A0 = , and n ∅ Bk = Ak Ak−1 for all k 1. So (Bk) are mutuallyS disjoint and \ ≥ n n Bk = Ak = An k=1 k=1 ] [ so n Bn = A. Thus, n U (lim An) = ( Bn) = (Bn) = lim (Bk) P P P n P n n k=1 ] X X n = lim ( Bk) = lim (An). n P n P k=1 ] c The decreasing case follows from noting that if (An) is decreasing, then (An) is in- creasing, so

c c (lim An) = ( An) = 1 ( A ) = 1 lim (A ) = lim (An). P P P n n P n n P n − n − \ [ ut

Pierre Fatou (1878–1929) Lemma 3.10 (Fatou’s Lemma). Let (Ω, , P) be a probability space. Let (An) be a F sequence of events (that may not have a limit). Then,

P(lim inf An) lim inf P(An) and lim sup P(An) P(lim sup An). ≤ n n ≤ 22

Proof. For all k let Bk = n≥k An. So lim infn An = k Bk. Note that (Bk) is an

increasing sequence of eventsT (Bk = Bk+1 Ak). Also,S for any n k, Bk An, so ∩ ≥ ⊂ P(Bk) infn≥k P(An). Thus, ≤

P(lim inf An) = P( Bn) = lim P(Bn) lim inf P(Ak) = lim inf P(An). n n k≥n n n ≤ [ For the lim sup,

c c lim sup P(An) = lim sup(1 P(An)) = 1 lim inf P(An) n n − − n c c 1 P(lim inf An) = 1 P((lim sup An) ) = P(lim sup An). ≤ − −

ut Fatou’s Lemma immediately proves Theorem 3.8.

Proof of Theorem 3.8. Just note that

lim sup P(An) P(lim sup An) = P(lim An) = P(lim inf An) lim inf P(An), n ≤ ≤ n so equality holds, and the limit exists. ut As a consequence we get

Lemma 3.11 (First Borel-Cantelli Lemma). Let (Ω, , P) be a probability space. Let F (An) be a sequence of events. If P(An) < , then n ∞ P P( An occurs for infinitely many n ) = P(lim sup An) = 0.

Emile´ Borel (1871–1956) Proof. Let Bk = n≥k An. So (Bk) is decreasing, and so the decreasing sequence P(Bk)

converges to P(limS sup An). Thus, for all k,

P(lim sup An) P(Bk) P(An). ≤ ≤ n≥k X Since the right hand side converges to 0 as k by the assumption that the series is → ∞ convergent, we get P(lim sup An) 0. ≤ ut

Francesco Cantelli

(1875–1966) 23

Example 3.12. We have a bunch of bacteria in a patry dish. Every second, the bacteria give off offspring randomly, and then the parents die out. Suppose that for any n, the probability that there are no bacteria left by time n is 1 exp( f(n)). − − What is the probability that the bacteria eventually die out if:

f(n) = log n. • 2 f(n) = n −7 . • 2n2+3n+5 −f(n) Let An be the event that the bacteria dies out by time n. So P(An) = 1 e , for − n 1. ≥ Note that the event that the bacteria eventually die out, is the event that there exists n such that An; i.e. the event n An. Since (An)n is an increasing sequence, we have that P( n An) = limn P(An). S In theS first case this is 1. In the second case this is n2 7 lim 1 exp − = 1 e−1/2. n→∞ − −2n2 + 3n + 5 −   4 5 4 Example 3.13. What if we take the previous example, and the information is that the probability that the bacteria at generation n do not die out without offspring is at most exp( 2 log n). − Then, if An is the event that the n-th generation has offspring, we have that P(An) ≤ −2 n . Since P(An) < , Borel-Cantelli tells us that n ∞ P P(An occurs for infinately many n) = P(lim sup An) = 0.

That is, c c P( k : n k : An) = P(lim inf An) = 1. ∃ ∀ ≥ So a.s. there exists k such that all generations n k do not have offspring – implying ≥ that the bacteria die out with probability 1. 4 5 4 24

Introduction to Probability

201.1.8001

Ariel Yadin

Lecture 4

4.1. Conditional Probabilities

We start with a simple observation.

Proposition 4.1. Let (Ω, , P) be a probability space. Let F be an event such that F ∈ F P(F ) > 0. Define

F = A F : A , F| { ∩ ∈ F} and define P : F [0, 1] by P (B) = P(B)/ P(F ) for all B F , and Q : [0, 1] F| → ∈ F| F → by Q(A) = P(A F )/ P(F ) for all A . Then, F is a σ-algebra on F , and P is a ∩ ∈ F F| probability measure on (F, F ), and Q is a probability measure on (Ω, ). F| F

Notation. The probability measure Q above is usually denoted P( F ). ·|

Proof. The σ-algebra part just follows from F = F F and (F An) = F An. ∩ n ∩ ∩ n P is a probability measure since P (F ) = P(F )/ P(F ) = 1,S and if (Bn) is a sequenceS of mutually disjoint events in F , then F 1 P ( B ) = ( B )/ (F ) = (B ) = P (B ). n P n P ( ) P n n n n P F n n [ [ X X Q is a probability measure since Q(Ω) = P(Ω F )/ P(F ) = 1 and if (An) is a sequence ∩ of disjoint events in , then F 1 Q( A ) = (A F ) = Q(A ). n ( ) P n n n P F n ∩ n [ X X ut What is the intuitive meaning of the above measure Q? What we did is restrict all events to there intersection with F , and look at them only in F , normalized to have 25 total probability 1. This can be thought of as the probability of an event, given that we know that the outcome is in F .

Definition 4.2 (Conditional Probability). Let (Ω, , P) be a probability space, and let F F be an event such that P(F ) > 0. ∈ F For any event A , the quantity P(A F )/ P(F ) is called the conditional prob- ∈ F ∩ ability of A given F . The probability measure P( F ) := P( F )/ P(F ) is called the conditional proba- ·| · ∩ bility given F .

!!! One must be careful with conditional probabilities - intuition here is many times misleading.

Example 4.3. An urn contains 10 white balls, 5 yellow balls and 10 black balls. Shir takes a ball out of the urn, all balls equally likely.

What is the probability that the ball removed is yellow? • What is the probability the ball removed is yellow given it is not black? •

4 5 4

Solution. A = the ball is yellow. B = the ball is not black. P(A) = 5/25 = 1/5. P(B) = 15/25 = 3/5. P(A B) = P(A) = 1/5. So P(A B) = ∩ | (1/5)/(3/5) = 1/3. ut

Example 4.4. Noga has the heart of an artist and the mind of a . Given she takes a course in probability she will pass with probability 1/3. Given she take a course in contemporary art she will pass with probability 1/2. She chooses which course to take using a fair coin toss - heads for probability and tails for art. What is the probability that Noga takes probability and passes? What is the probability she passes no matter what course she takes? 4 5 4

Solution. The sample space here is

Ω = H, T pass, fail . { } × { } 26

The events we are interested in are A = Noga passes the course = H, T pass . B = { } × { } Noga takes probability = H pass, fail . Bc = Noga takes art = H pass, fail . { } × { } { } × { } So the event that Noga takes and passes probability is A B. ∩ c The information in the question tells us that P(A B) = 1/3 and P(A B ) = 1/2. | | Thus,

P(A B) = P(A B) P(B) = 1/6, ∩ | and

c 1 1 5 P(A) = P(A B) + P(A B ) = + = . ∩ ∩ 6 4 12

ut

Example 4.5. An urn contains 8 black balls and 4 white balls. Two balls are removed one after the other, any ball being equally likely. What is the probability that the second ball is black, given that the first is black? What is the probability that the first is black, given that the second is black? 4 5 4

Solution. A = first ball is black. B = second ball is black. We know that P(A) = 8/12, 8·7 and that P(A B) = . So (as is intuitive) P(B A) = 7/11. However, maybe ∩ 12·11 | c 4·8 somewhat less intuitive is P(A B). Note that P(B A ) = . So | ∩ 12·11

c 8 (4 + 7) P(B) = P(B A) + P(B A ) = · = 8/12. ∩ ∩ 12 11 · Thus, P(A B) = P(A B)/ P(B) = 7/11. | ∩ ut

Exercise 4.6. Prove that for all events A1,...,An such that P(A1 An) > 0 ∩ · · · ∩

P(A1 An) = P(An A1 An−1) P(An−1 A1 An−2) P(A2 A1) P(A1) ∩ · · · ∩ | ∩ · · · ∩ · | ∩ · · · ∩ ··· | · n−1 = P(A1) P(Aj+1 A1,...,Aj). · | j=1 Y Solution. By induction. The base is simple. The induction step follows from

P(A1 An+1) = P(An+1 A1 An) P(A1 An). ∩ · · · ∩ | ∩ · · · ∩ · ∩ · · · ∩

ut 27

4.2. Bayes’ Rule

Let (Ω, , P) be a probability space. Let (An)n be a partition of Ω; that is, (An) F are mutually disjoint and n An = Ω. Assume further that P(An) > 0 for all n. Let

A, B be events of positive probability.S Then, since (B An)n are mutually disjoint, by ∩ additivity we have

P(B) = P(B An) = P(B An) P(An). n ∩ n | X X This is called the law of total probability.

A1 A2 A4

A3

A5 A6 Ω A7

A9 A12 B A10 A8 A11

Figure 3. The law of total probability. In this case Ω = A1 A12, ]···] and

P(B) = P(B A1) P(A1) + P(B A2) P(A2) + P(B A5) P(A5) + P(B A6) P(A6) | | | |

+ P(B A7) P(A7) + P(B A9) P(A9) + P(B A10) P(A10). | | |

Another observation is Bayes’ rule: Thomas Bayes

P(B) (1701–1761) P(B A) = P(A B) . | | · P(A) Combining the two, we have that

P(An) P(An B) = P(B An) . | | · P(B An) P(An) n | P 28

Exercise 4.7. A new machine is invented that determines if a student will become a millionaire in the next ten years or not. Given that a student will become a millionaire in ten years, the machine succeeds in predicting this 95% of the time. Given that the student will not become a millionaire in the next ten years, the machine predics (wrongly) that he will become a millionaire in 1% of the cases (this is known as a false positive). Only 0.5% of the students will become millionaires in the next ten years. Given that Zuckerberg has been predicted to become a millionaire by the machine, what is the probability that Zuckerberg will actually become one?

Solution. A = Zuckerberg will become a millionaire in the next ten years . B = { } { Zuckerberg is predicted to become a millionaire by the machine . The information is: }

c P(A) = 0.005 P(B A) = 0.95 P(B A ) = 0.01. | |

So,

P(A) 0.95 0.005 P(A B) = P(B A) = · | | · P(B A) P(A) + P(B Ac) P(Ac) 0.95 0.005 + 0.01 0.995 | | · · 95 = = 0.319727891 < 1/3. 95 + 199

Not such a great machine after all... ut

George P´olya Exercise 4.8 (P´olya’s Urn). An urn contains one black ball and one white ball. At (1887–1985) every time step, a ball is chosen randomly from the urn, and returned with another new ball of the same color. (So at time t there are a total of t + 2 balls.) Calculate

p(t, b) = P( there are b black balls at time t ).

Solution. First let’s see how p(t, b) developes when we change t and b by 1. If at time t there are b black balls out of a total of t + 2, then at time t + 1 there are b + 1 black balls with probability b/t + 2 and b black balls with probability 1 b/t + 2. − Thus, b 1 b 1 p(t + 1, b) = − p(t, b 1) + 1 − p(t, b). t + 2 · − − t + 2 ·   29

Also, the initial conditions are p(0, 1) = 1. Let q(t, b) = (t+1)! p(t, b). Then, q(0, 1) = 1 · and for t > 0,

q(t, b) = (b 1)q(t 1, b 1) + (t + 1 b)q(t 1, b). − − − − − Check that q(t, b) = t! solves this. So p(t, b) = 1/(t + 1). ut ∞ Example 4.9. The probability a family has n children is pn, where n=0 pn = 1. Given that a family has n children, all possibilities for the sex of theseP children are equally likely. What is the probability the a family has one child, given that the family has no girls? 4 5 4 Solution. The natural sample space to take here is

Ω = (s1, s2, . . . , sn): sj boy, girl , n 1 no children . { ∈ { } ≥ } ∪ { }

An = n children , for n 0. B = no girls . { } ≥ { } The information in the question is:

−n For all n 1 P( (s1, . . . , sn) An) = 2 , ≥ { } | −n for any combination of sexes s1, . . . , sn boy, girl . Thus, P(B An) = 2 for all n 1. ∈ { } | ≥ −0 Also, P(B A0) = 1 = 2 . The law of total probability gives, | ∞ −n P(B) = P(B An) P(An) = 2 pn. | n n=0 X X Using Bayes’ rule

P(A1) 1 p1 P(A1 B) = P(B A1) = ∞ −n . | | · P(B) 2 · n=0 2 pn

P ut

Example 4.10. There are 6 machines M1,...,M6 in the basement of the math depart- ment. When pressing the red button on machine Mj, the machine outputs a random j −k number in N, with the probability that number is k being (j + 1) . j+1 · A bored mathematician has access to these machines. She tosses a die, and uses machine Mj if the outcome of the die is j, and writes down the resulting number. What is the probability the resulting number is greater than 0? 30

What is the probability the outcome of the die is 1 given that the resulting number is greater than 3? 4 5 4

Solution. Aj = the die shows j , j = 1,..., 6. Bk = the resulting number is greater { } { than k . Ck = the resulting number is k . So (Ck) are mutually disjoint, and } { } Bk = n>k Cn. TheS information is: ∞ j −n j −(k+1) −n −(k+1) P(Bk Aj) = P(Cn Aj) = (j+1) = (j+1) (j+1) = (j+1) . | | j + 1· j + 1· · n>k n>k n=0 X X X Also, P(Aj) = 1/6 for all j. By the law of total probability,

6 1 −(k+1) 1 −(k+1) −(k+1) −(k+1) P(Bk) = (j + 1) = (2 + 3 + + 7 ). 6 6 ··· j=1 X 1 1 1 1 So P(B0) = ( + + ). 6 2 3 ··· 7 Now, by Bayes

P(A1) −4 1 P(A1 B3) = P(B3 A1) = 2 −4 −4 −4 . | | · P(B3) · 2 + 3 + + 7 ··· ut

4.3. The Erdos-Ko-Rado¨ Theorem

Consider n elements, say 0, 1, . . . , n 1 without loss of generality. We want to { − } “collect” subsets of size k n/2 so that any two of these subsets intersect one another. ≤ We want to collect as many such subsets as possible. One strategy is as follows: Suppose w.l.o.g. A = 0, 1, . . . , k is our first set. Any { } other set that we collect must intersect it, and we want to give ourselves as much freedom as possible. So lets take all subsets of 0, 1, . . . , n 1 that contain 0. These all must { − } intersect pairwise. How many sets did we collect? n−1 because we are just actually choosing k 1 k−1 − elements out of 1, . . . , n 1 for every such subset. { − } Is this the best possible? The Erd¨os-Ko-RadoTheorem tells us that it is. 31

Theorem 4.11 (Erd¨os-Ko-Rado). Let n 2k. If 2{0,...,n−1} is a collection of ≥ A ⊂ subsets such that for all A, B : A = k and A B = . Then, there are at most ∈ A | | ∩ 6 ∅ n−1 sets in ; that is, n−1 . k−1 A |A| ≤ k−1   This elementary proof is by Katona.

Proof. Let be as in the theorem. A Let us choose a uniform element σ from Sn the set of all permutations on 0, 1, . . . , n 1 , { − } and also a uniform index I from 0, 1, . . . , n 1 . Set A = σ(I), σ(I + 1), . . . , σ(I + k 1) { − } { − } 1 1 where addition is modulo n. That is, P[σ = s, I = i] = . n! · n {0,...,n−1} For any given set B = b0, . . . , bk−1 2 with B = k, by the law of total { } ⊂ | | probability,

P[A = B,I = j] = P[σ = π, I = j , π(j), π(j + 1), . . . , π(j + k 1) = B] { − } π∈S Xn 1 = # π Sn : π(j), π(j + 1), . . . , π(j + k 1) = B . n!·n · { ∈ { − } }

This last set is of size k!(n k)! because there are k choices for π(j), then k 1 choices − − for π(j + 1), etc. until only 1 choice for π(j + k 1), and then n k choices for j + k, − − n k 1 choices for j + k + 1, and so on for all j + k, . . . , j + n 1. Thus, − − − n−1 n−1 n −1 1 n −1 P[A = B] = P[A = B,I = j] = = . k · n k j=0 j=0 X X     We have found an alternative description to choosing a subset of 2{0,...,n−1} of size exactly k uniformly.

Now the second part: For π Sn and an integer s consider the subsets Xπ,s := ∈ π(s), π(s + 1), . . . , π(s + k 1) , where as usual addition is modulo n. { − } Note that if 0 m k 1 then Xπ,m Xπ,m+k = (here we use the assumption ≤ ≤ − ∩ ∅ that 2k m). Indeed if x Xπ,m Xπ,m+k then there exist 0 i, j k 1 such that ≤ ∈ ∩ ≤ ≤ − π(m + j) = x = π(m + k + i). Since π is a permutation, this implies that j = k + i modulo n and this is impossible because 0 i, j k 1 and k n/2. ≤ ≤ − ≤ Now, for a given π, s, the subsets Xπ,m that intersect Xπ,s are (Xπ,m : m = s − k + 1, . . . , s + k 1). Without the case m = 0, these can be written in k 1 pairs: − − 32

k−1 (Xπ,s−m,Xπ,s−m+k) . For each pair, since Xπ,s−m Xπ,s−m+k = , only one of the m=1 ∩ ∅ pair can be in the intersecting family . A We conclude that for any 0 j n 1, if Xπ,j then Xπ,s Xπ,j = so ≤ ≤ − ∈ A ∩ 6 ∅ s k + 1 j s + k 1, and Xπ,s−m if and only if Xπ,s−m+k . Thus, − ≤ ≤ − ∈ A 6∈ A n−1 k−1 1{X ∈A} 1{X ∈A} + 1{X ∈A} + 1 π,j ≤ π,s π,s−m Xπ,s−m+k∈A j=0 m=1 { } X X 1 + k 1 = k. ≤ −

Since all subsets of must intersect one another, this gives that for any π Sn, A ∈

# 0 j n 1 : Xπ,j k. { ≤ ≤ − ∈ A} ≤

Now, if σ = π and A then it must be that I = j for some j such that Xπ,j . ∈ A ∈ A So, by Boole’s inequality,

P[A , σ = π] P[σ = π, I 0 j n 1 : Xπ,j ] ∈ A ≤ ∈ { ≤ ≤ − ∈ A} n−1 k 1{X ∈A} P[σ = π, I = j] . ≤ π,j ≤ n! n j=0 X · Summing over all π Sn, by the law of total probability, ∈ k P[A ] . ∈ A ≤ n We now combine this bound with the fact that A is uniform over all k-subsets of 0, . . . , n 1 . Thus, { − } k |A|n = P[A ] . k ∈ A ≤ n This is exactly n k = n−1 . |A| ≤ k · n k− 1 ut   33

Introduction to Probability

201.1.8001

Ariel Yadin

Lecture 5

5.1. Independence

Until now, we have only used the measure theoretic properties of probability mea- sures. The main difference between probability and measure theory is the notion of independence. This is perhaps the most important definition in probability.

Definition 5.1. Let A, B be events in a probability space (Ω, , P). We say that A and F B are independent if P(A B) = P(A) P(B). ∩ If (An)n is a collection of events, we say that (An) are mutually independent if for any

finite number of events in the collection, An1 ,An2 ,...,Ank we have that

P(An1 An2 Ank ) = P(An1 ) P(An2 ) P(Ank ). ∩ ∩ · · · ∩ · ···

Example 5.2. A card is chosen randomly out of a deck of 52, all cards equally likely. A is the event that the card is an ace. B is the event that the card is a spade. C is the event that the card is a jack of diamonds or a 2 of hearts. D is the event that the card is an even number. Then, A, B are independent, since 1 4 13 P(A B) = = = P(A) P(B). ∩ 52 52 · 52 · C,D are not independent since 1 2 20 P(C D) = = P(C) P(D) = . ∩ 52 6 · 52 · 52 34

4 5 4

Example 5.3. Two dice are tossed, all outcomes equally likely. A = the first die is 4 { . B = the sum of the dice is 6 . C = the sum of the dice is 7 . } { } { } A, B are not independent. 1 1 5 P(A B) = = P(A) P(B) = . ∩ 36 6 · 6 · 36 However, A, C are independent, 1 1 6 P(A C) = = = P(A) P(C). ∩ 36 6 · 36 ·

4 5 4

Proposition 5.4. Let (Ω, , P) be a probability space. F (1) Any event A is independent of Ω and of . ∅ (2) If A, B are independent, then Ac,B are independent. (3) If A, B are independent and P(B) > 0 then P(A B) = P(A). |

Proof. If A, B are independent,

c c P(B A ) = P(B) P(B A) = P(B)(1 P(A)) = P(B) P(A ). ∩ − ∩ −

If also P(B) > 0, then P(A B) = P(A) P(B)/ P(B) = P(A). | ut

Exercise 5.5. Prove that if A, B, C are mutually independent, then

(1) A and B C are independent. ∩ (2) A and B C are independent. \

Solution.

P(A B C) = P(A) P(B) P(C) = P(A) P(B C). ∩ ∩ ∩ This proves the first assertion. For the second assertion, since B C = B Cc and since \ ∩ A B is independent of C, then also A B and Cc are independent. So, ∩ ∩

c c P(A (B C)) = P(A B C ) = P(A B) P(C ) = P(A) P(B)(1 P(C)). ∩ \ ∩ ∩ ∩ − 35

Finally,

P(B C) = P(B (B C)) = P(B) P(B C) = P(B)(1 P(C)), \ \ ∩ − ∩ − since B,C are independent. ut

Exercise 5.6. Show that if B is an event with P[B] = 0 then for any event A, A and B are independent.

Solution. 0 P[B A] P[B] = 0 and P[B] P[A] = 0. ≤ ∩ ≤ · ut

Example 5.7. Consider the uniform probability measure on Ω = 0, 1 2. Let { } A = (x, y) Ω: x = 1 B = (x, y) Ω: y = 1 , { ∈ } { ∈ } and

C = (x, y) Ω: x = y . { ∈ 6 } Any two of these events are independent; indeed, 1 P(A B) = P(A C) = P(B C) = , ∩ ∩ ∩ 4 1 (A) = (B) = (C) = , P P P 2 (because C = (0, 1), (1, 0) ). However, it is not the case that (A, B, C) are mutually { } independent, since 1 P(A B C) = 0 = = P(A) P(B) P(C). ∩ ∩ 6 8 · ·

4 5 4

5.2. An example using independence: Riemann zeta function and Euler’s formula

Some number theory via probability:

Let 1 < s R be a parameter. Let Bernhard Riemann ∈ ∞ (1826-1866) ζ(s) = n−s. n=1 X 36

Let X be a random variable with range 1, 2,..., and distribution given by { } 1 −s P(X = n) = n . ζ(s) ·

Let Dm = mZ = the set of positive integers devisible by m. So −s n(mn) −s PX (Dm) = = m . n−s P n Now, let p1, p2, . . . , pk be k different primeP numbers. Then,

Dp Dp Dp = Dp p ···p . 1 ∩ 2 ∩ · · · ∩ k 1 2 k So

−s PX (Dp1 Dp2 Dpk ) = (p1p2 pk) = PX (Dp1 ) PX (Dp2 ) PX (Dpk ). ∩ ∩ · · · ∩ ··· · ···

Since this is true for any finite number of primes, we have that the collection (Dp)p∈P are mutually independent, where P is the set of primes. Since the only number that is not divisible by any prime is 1, we have that 1 = { } c p∈P Dp, and so

T −s P(X = 1) = PX ( 1 ) = (1 PX (Dp)) = (1 p ). { } − − p∈P p∈P Y Y So we conclude Euler’s formula: for any s > 1, ∞ n−s = (1 p−s)−1. − n=1 p∈P X Y

Leonhard Euler We can also let s 1 from above, and we get → (1707–1783) 0 = lim ζ(s)−1 = (1 p−1). s→1 − p∈P Y Taking logs log(1 p−1) = . − −∞ p∈P X Since log(1 x) 3x/2 for 0 < x < 1, we have that − ≥ − 3 p−1, −∞ ≥ − 2 p∈P X 37 and so 1 = . p ∞ p∈P X This is a non-trivial result. 38

Introduction to Probability

201.1.8001

Ariel Yadin

Lecture 6

6.1. Uncountable Sample Spaces

Up to now we have dealt with the case of countable sample spaces, but there is no reason to restrict to those. For example, what if we wish to experiment by throwing a dart at a target? The collection of possible outcomes is the collection of points in the target, which is uncountable. The basic theory to define such spaces is measure theory. Consider the sample space Ω = [0, 1]. Why not take a probability measure on all subsets of [0, 1]? Well, for example, we would like to consider the experiment of sampling a number from [0, 1] such that it is uniform over the interval; that is, such that for any 0 a < b 1 ≤ ≤ the probability the number is in (a, b) is b a. − How can we do this? The following proposition shows that this cannot be done using all subsets of [0, 1] (!).

Proposition 6.1. Let Ω = [0, 1]. Let P be a probability measure on (Ω, ), where is F F some σ-algebra on Ω, such that for any x [0, 1] and A , P(A) = P(A + x), where ∈ ∈ F A + x = a + x (mod 1) : a A . Then, there exists a subset S Ω such that S . { ∈ } ⊂ 6∈ F

Proof. Define the equivalence relation x y iff x y Q. So we can choose a represen- ∼ − ∈ tative for each different equivalence class (using the axiom of choice!). Let R be the set of these representatives.

Define Rq = R + q = r + q (mod 1) : r R , for all q Q [0, 1). For q = p { ∈ } ∈ ∩ 6 ∈ 0 Q [0, 1) we have that Rq Rp = , since if x Rp Rq then x = r+q = r +p (mod 1), for ∩ ∩ ∅ ∈ ∩ 39

0 0 r, r R, and this implies that r r , a contradiction. Thus, (Rq) is a countable ∈ ∼ q∈Q∩[0,1) collection of mutually disjoint events, so 1 = P([0, 1)) = P( q Rq) = q P(Rq) = P(R) = ! S P q ∞ ut PThus, we have to develop a theory of sets which we allow to be events, so that everything is well defined etc.

6.2. σ-algebras Revisited

Recall the definition of a σ-algebra:

Definition 6.2. Let Ω be a sample space. A collection of subsets 2Ω is called a F ⊂ σ-algebra if Ω . • ∈ F If A then Ac . • ∈ F ∈ F If (An) is a collection of events in then An . • F n ∈ F S These were the most basic properties we wanted of events: Everything can occur, if something can occur, it can also not occur, and if there are many possible events, we can ask if there is one of them that has occured.

Exercise 6.3. Show that 2Ω is a σ-algebra if and only if it has the following F ⊂ properties.

. •∅∈F If A then Ac . • ∈ F ∈ F If (An) is a collection of events in then An . • F n ∈ F T Solution. Use de-Morgan. ut Suppose we have a collection of subsets of Ω. We want the smallest σ-algebra con- taining these subsets.

Proposition 6.4. Let Ω be a sample space, and let 2Ω be a collection of subsets. K ⊂ There exists a unique σ algebra on Ω, denoted by σ( ), such that − K σ( ). •K⊂ K 40

If is a σ-algebra such that , then σ( ) • F K ⊂ F K ⊂ F So σ( ) is the smallest σ-algebra containing . Moreover, σ( ) = if and only if K K K K K is itself a σ algebra. − Proof. Let Γ( ) = : is a σ algebra such that , K {F F − K ⊂ F} and define σ( ) := Γ( ). So of course the second property holds. σ( ) is a σ-algebra K K K since Ω for all T Γ( ) in the above set. Also, if (An) is a collection of events in ∈ F F ∈ K σ, then An for all n and all Γ( ). Thus, An for every Γ( ), and ∈ F F ∈ K n ∈ F F ∈ K so An σ( ). S n ∈ K AsS for uniquenes, if is another σ-algebra with the above properties, then σ( ) G G ⊂ K because and similarly σ( ) . K ⊂ G K ⊂ G ut Definition 6.5. σ( ) as above is called the σ-algebra generated by . K K σ( ) is intuitively all the information carried by the events in . K K Example 6.6. Let Ω be a sample space and A Ω. Then, σ( A ) = , A, AcΩ . ⊂ { } {∅ } 4 5 4 Definition 6.7. Let (Ω, , P) be a probability space, and let be a sub-σ-algebra F G ⊂ F of . We say that an event A is independent of if for any B , we have that F ∈ F G ∈ G A, B are independent.

Let (Bα)α be a collection of events. We say that A is independent of (Bn)n if A is independent of the σ-algebra σ((Bn)n).

Example 6.8. If A, B are independent, then we have already seen that A, Bc are also independent. Since A, are always independent, and also A, Ω are always independent, ∅ we have that A is independent of σ(B) = ,B,Bc, Ω . {∅ } 4 5 4 6.3. Borel σ-algebra

Let E be a topological space. We can consider the σ-algebra generated by all open sets. This is called the Borel σ-algebra on E, and an event in this σ-algebra is call a Borel set. 41

d A special case is the case E = R , and more specifically E = R. In this case we denote the Borel σ-algebra by = (R), and it can be shown that = σ((a, b): a < b R); B B B ∈ that is is generated by all intervals. B Exercise 6.9. Show that

(R) = σ((a, b]: a < b R) = σ([a, b): a < b R) = σ([a, b]: a < b R). B ∈ ∈ ∈

Solution. Let a < b R. Since (a, b) [0, 1] = ((a, b 1/n] [0, 1]), we get that ∈ ∩ n − ∩ (a, b) [0, 1] σ((a, b] [0, 1] : a < b R). SinceT this holds for all a < b R, we have ∩ ∈ ∩ ∈ ∈ that σ((a, b) [0, 1] : a < b R) σ((a, b] [0, 1] : a < b R). ∩ ∈ ⊂ ∩ ∈ The other inclusions are similar, so all the mentioned σ algebras are equal. − ut

6.4. Lebesgue Measure

The following basic theorem, due to Lebesgue, is shown in measure theory.

Henri Lebesgue Theorem 6.10 (Lebesgue). Let Ω = [0, 1], and let be the Borel σ-algebra on Ω. Let B (1875–1941) F : [0, 1] [0, 1] be a right-continuous, non-decreasing function such that F (1) = 1 and → F (0) = 0. Then, there exists a unique probability measure, denoted P = dF , on (Ω, ) B such that for any 0 a < b 1, P((a, b]) = F (b) F (a). ≤ ≤ − We will not go into the proof of this theorem, but it gives us many new probability measures on non-discrete sample spaces that we can define.

Example 6.11. Take F (x) = x in the above theorem. So, the resulting probability measure, sometimes denoted P = dx, has the property that for any interval 0 a < ≤ b 1, P((a, b]) = b a. One can also check that this measure is translation invariant, ≤ − i.e. for any x [0, 1], ∈ P(x + (a, b] (mod 1)) = P( x + y (mod 1) : a < y b ) = b a. { ≤ } − This is also sometimes called the uniform measure on [0, 1]. 4 5 4

Exercise 6.12. Show that for the uniform measure on [0, 1], for any a [0, 1], dx( a ) = ∈ { } 0. 42

Deduce that for all 0 a b 1, ≤ ≤ ≤ dx((a, b)) = dx([a, b]) = dx([a, b)) = dx((a, b]) = b a. − 1 Solution. Let a [0, 1]. For all n, let An = (a , a]. So P(An) = 1/n. Since (An) is ∈ − n a decreasing sequence of events, and since a = An, we have by the continuity of { } n probability measures T

P( a ) = lim P(An) = 0. { } We can now note that

P([a, b]) = P((a, b) a b ) = P((a, b)) + P( a ) + P( b ) = P((a, b)). ]{ }]{ } { } { } Similarly for the other types of intervals. ut

Example 6.13. Let α < β R. Let Ω = [α, β]. For a set A ([0, 1]) let ∈ ∈ B (β α)A + α = (β α)a + α : a A . − { − ∈ } Define = (β α)A + α : A ([0, 1]) . F { − ∈ B } We have that (β α)[0, 1] + α = Ω, ((β α)A + α)c = (β α)Ac + α, and − − −

((β α)An + α) = (β α) An + α. n − − n [ [  So is a σ-algebra on Ω. Also, if dx is the uniform measure on [0, 1] then F P((β α)A + α) := dx(A) − define a probability measure on (Ω, ), because A B = if and only if (β α)A + α F ∩ ∅ − ∩ (β α)B + α = . − ∅ In this case, note that for any α a < b β, ≤ ≤ 1 1 1 1 b a P((a, b]) = P((β α)( (a α), b] + α) = dx(( (a α), b]) = − . − β−α − β−α β−α − β−α β α − 4 5 4 Example 6.14. Let Ω be a sample space, and let f :Ω [0, 1] be a 1-1 function. Let → = f −1(B): B ([0, 1]) . This is a σ-algebra on Ω since f −1([0, 1]) = Ω and F ∈ B  −1 −1 f (Bn) = f Bn . n n [ [  43

−1 For A , since B = f(f (B)), we have that f(A) , and we can define P(A) = ∈ F ∈ B dx(f(A)). This results in a probability measure on (Ω, ) since P(Ω) = 1 and if (An) F are mutually disjoint events then so are (f(An)),

P( An) = dx(f An ) = dx f(An) = dx(f(An)) = P(An). n n n n n [ [  [  X X 4 5 4 Example 6.15. A dart is thrown to a target of radius 1 meter. Points are awarded according to the distance to the center of the target. The probability of being between distance a and distance b is b2 a2. − What is the probability of hitting the inner circle of radius 1/2? What about radius 1/4? 4 5 4 Solution. Let F : [0, 1] [0, 1] be F (x) = x2. Then, dF is a probability measure → on ([0, 1], ) such that dF ((a, b)) = F (b) F (a) = b2 a2. Thus, we can model the B − − dart throwing experiment by the probability space ([0, 1], , dF ), where the outcome B ω [0, 1] is the distance to the target. ∈ Now, the probability of hitting the inner circle of radius 1/2 is dF ([0, 1/2]) = (1/2)2 = 1/4. Similarly the probability of hitting the circle of radius 1/4 is 1/16. ut 44

Introduction to Probability

201.1.8001

Ariel Yadin

Lecture 7

7.1. Random Variables

Up until now we dealt with abstract spaces (such as dice, cards, etc.). In order to say things about objects, we prefer to map them to numbers (or vectors) so that we can perform calculations. For example, maybe some biological experiment is being carried out in the lab, but in the end, the biologist may only be interested on the number of bacteria in the dish, or other such measurements. Measurements of experiments are carried out via random variables.

Suppose we have an experiment described by a probability space (Ω, , P). A mea- F surement is just a function taking outcomes to numbers; i.e. a function X :Ω R. → The natural way to define a probability space on R, would now be to take the natural σ-algebra on R, i.e. the Borel σ-algebra , and then define the measure of a Borel set B B to be the probability that the outcome ω is mapped into B:

−1 B PX (B) := P( ω : X(ω) B ) = P(X (B)). ∀ ∈ B { ∈ }

−1 But wait! What if the set X (B) is not in ? Then P is not defined on that set. F This is a technical detail, but one that needs to be addressed, otherwise we will run into paradoxes as in the case of the unmeasurable sets on [0, 1].

Definition 7.1. Let (Ω, ), (Ω0, 0) be a measurable spaces. A function X :Ω Ω0 is F F → called a measurable map from (Ω, ) to (Ω0, 0), if for all A0 0, X−1(A0) . F F ∈ F ∈ F We usually denote this by X : (Ω, ) (Ω0, 0) to stress the dependence on and F → F F 0. F 45

Exercise 7.2. Show that if X : (Ω, ) (Ω0, 0) and Y : (Ω0, 0) (Ω00, 00) are F → F F → F measurable functions, then Y X :Ω Ω00 is measurable with respect to , 00. ◦ → F F To determine if a function X : (Ω, ) (Ω0, 0) is measurable, it is enough to check F → F for a family of sets generating 0: F Proposition 7.3. Let (Ω, ), (Ω0, 0) be two measurable spaces, such that 0 = σ(Π0) F F F 0 for some Π0 2Ω . Then, X :Ω Ω0 is measurable if and only if for any A0 Π0, ⊂ → ∈ X−1(A0) . ∈ F Proof. The “only if” part is clear. For the “if” part, consider

0 = A0 Ω0 : X−1(A0) . G ⊂ ∈ F  We are given that Π0 0. If we show that 0 is a σ-algebra, then 0 = σ(Π0) 0, so ⊂ G G F ⊂ G for any A0 0 we have that X−1(A0) , and X is measurable. ∈ F ∈ F So we only need to show that 0 is a σ-algebra. Indeed, since X−1( ) = we G ∅ ∅ ∈ F have that 0. ∅ ∈ G If A0 0 then X−1(A0) so also X−1(Ω0 A0) = Ω X−1(A0) , and thus ∈ G ∈ F \ \ ∈ F Ω0 A0 0. So 0 is closed under complements. \ ∈ G G 0 0 −1 0 If (A )n is a sequence in , then for every n we have that X (A ) , and so n G n ∈ F X−1( A0 ) = X−1(A0 ) . Thus, A0 0 and 0 is closed under countable n n n n ∈ F n n ∈ G G unions.S S S ut Corollary 7.4. If Ω, Ω0 are topological spaces, and (Ω), (Ω0) are the Borel σ-algebras, B B then any continuous map f :Ω Ω0 is measurable. → Proof. f is continuous implies that for any open set O0 Ω0, f −1(O0) is an open set in ⊂ Ω. Thus, since (Ω0) = σ( 0), where 0 is the set of open sets in Ω0, we have that for B O O any O0 0, f −1(O0) (Ω). So f is measurable. ∈ O ∈ B ut

Definition 7.5. A (real valued) random variable on a probability space (Ω, , P) F is a measurable function X : (Ω, ) (R, ). F → B −1 The function PX (B) = P(X (B)) is a probability measure on (R, ), and is called B the distribution of X. 46

Example 7.6. Let (Ω, , P) be a probability space. Let A be an event. This is F ∈ F not a random variable, but there is a natural random variable connected to A: Define a function 1 ω A IA(ω) = ∈  0 ω A.  6∈ Note that for any B , if 1 B and 0 B then I−1(B) = . If 1 B and 0 B ∈ B 6∈ 6∈ A ∅ ∈ 6∈ then I−1(B) = A. If 1 B and 0 B then I−1(B) = Ac. If 1 B and 0 B then A 6∈ ∈ A ∈ ∈ −1 IA (B) = Ω. So IA is measurable.

IA is known as the indicator of the event A. Indicators are always measurable, and they are the simplest kind of random variables, taking only two values 0, 1. We denote the indicator of an event A by 1A. 4 5 4

Exercise 7.7. Let (Ω, , P) be a probability space. Prove that if X : (Ω, ) R is a F F → −1 measurable function then indeed PX (B) = P(X (B)) is a probability measure on (R, ). B

−1 Proof. PX (R) = P(X (R)) = P(Ω) = 1. If (Bn) is a sequence of mutually disjoint −1 events, then (X (Bn)) are also mutually disjoint and

−1 PX ( Bn) = P X (Bn) = PX (Bn). n n n [ [  X ut Notation. To simplify the notation, we will ommit the ω’s. e.g. instead of writing P( ω : X(ω) A ) we write P(X A). { ∈ } ∈

Example 7.8. Three balls are removed from an urn containing 20 balls numbered 1 to 20. All possibilities are equally likely. What is the probability the at least one has a number 17 or higher? The sample space here is Ω = S 1, 2,..., 20 : S = 3 , and the measure is the { ⊂ { } | | } uniform measure on this set. We define the random variable X :Ω R by X( i, j, k ) = → { } max i, j, k . { } Why is X a measurable function? For any Borel set B , we have that X−1(B) is ∈ B a subset of Ω, and the σ-algebra here is 2Ω. 47

What is the probability that there exists a ball numbered 17 or higher? This is

P( ω Ω: X(ω) 17 ) = P(X 17) = 1 P(X < 17). { ∈ ≥ } ≥ − 16 16·15·14 Since the number of S Ω with max S < 17 is , we have that P(X < 17) = = ∈ 3 20·19·18 4·7 .  3·19 4 5 4

Example 7.9. Two dice are thrown. So Ω = (i, j) : 1 i, j 6 , and P is the { ≤ ≤ } uniform measure on this set. Define the random variable X(i, j) = i + j. Note that 2 X(i, j) 12 for any i, j, so ≤ ≤ P(X [2, 12] N) = 1, ∈ ∩ and X is a discrete random variable. What is the probability that the sum is 7? 6 1 (X = 7) = = . P 36 6

4 5 4 7.2. Distribution Function

Proposition 7.10. Let (Ω, ) be a measurable space, and let X :Ω R. Then X is F → measurable if and only if

−1 X r = X ( , r] r R. { ≤ } −∞ ∈ F ∀ ∈ Proof. The “only if” part is clear. c For the “if” part: Since X > r = X r , we have that for any a < b R { } { ≤ } ∈ F ∈ X−1((a, b]) = X−1(( , b] ( , a]) = a < X b = X b X > a . −∞ \ −∞ { ≤ } { ≤ } ∩ { } ∈ F Let = B : X−1(B) . G ∈ B ∈ F −1 c −1 c −1 −1 Since X (R) and X 1(B ) = (X (B)) and X ( Bn) = X (Bn), we ∈ F − n n get that is a σ-algebra. S S G From the above, (a, b] for all a < b R. Thus, = σ((a, b]: a < b R) . So ∈ G ∈ B ∈ ⊂ G for any B , we have that B , and so X−1(B) . ∈ B ∈ G ∈ F ut 48

Definition 7.11. Let (Ω, , P) be a probability space, and let X be a random variable. F The (cumulative) distribution function of X is the function

FX : R [0, 1] FX (t) = P(X t). → ≤

Example 7.12. X is the random variable that is the sum of two dice. FX (t) =?

FX (t) = 0 for t < 2. FX (t) = 1 for t 12. FX (t) = 1/36 for t [2, 3). FX (t) = 2/36 ≥ ∈ for t [3, 4). In general, FX (t) = (k 1)/36 for t [k, k + 1) and k 5. ∈ − ∈ ≤ 4 5 4

Proposition 7.13. Let X be a random variable, and FX its distribution function. Then,

FX is non-decreasing and right-continuous. • FX (t) tends to 0 as t . • → −∞ FX (t) tends to 1 as t . • → ∞ Proof. Since for a b we have that ( , a] ( , b], we get ≤ −∞ ⊂ −∞

FX (a) = PX (( , a]) PX (( , b]) = FX (b). −∞ ≤ −∞

For right continuity, let hn 0 monotonely from the right, so (( , a + hn])n is a → −∞ decreasing sequence and

FX (a) = X ( ( , a + hn]) = lim X (( , a + hn]) = lim FX (a + hn). P n P n n −∞ −∞ \ The limits at and are treated similarly. If an then limn( , an] = ∞ −∞ → ∞ −∞ ( , ). So by the continuity of probability measures PX (( , an]) PX (R) = 1. −∞ ∞ −∞ → Similarly for an . → −∞ ut

Exercise 7.14. Give an example of a random variable X whose distribution function

FX is not left-continuous.

Solution. Let (Ω, , P) be a probability space, and let A . Let X = 1{A}. What is F ∈ F FX ? 0 t < 0 c FX (t) =  (A ) = 1 (A) 0 t < 1  P P  − ≤  1 t 1. ≥   49

So if 0 < P(A) < 1 then

lim FX (s) = 0 and FX (0) = 1 P(A) > 0, s→0− −

lim FX (s) = 1 P(A) and FX (1) = 1 > 1 P(A). s→1− − −

So both 0 and 1 are points where FX is not left-continuous. ut

X Of course one sees that the right-continuity stems from the fact that we define

FX (t) = P(X t). We could have defined P(X t) which would have been left- ≤ ≥ continuous. This is not important, however we stick to the classical distribution function.

Exercise 7.15. Let X be a random variable with distribution FX . Show that for any a < b R, ∈ P(a < X b) = FX (b) FX (a). • ≤ − − P(a < X < b) = FX (b ) FX (a). • − − P(X = x) = FX (x) FX (x ). Deduce that FX is continuous at x if and only if • − P(X = x) = 0. Solution.

Since (a, b] = ( , b] ( , a], and ( , a] ( , b], • −∞ \ −∞ −∞ ⊂ −∞

FX (b) FX (a) = PX (( , b]) PX (( , a]) = PX ((a, b]) = P(a < X b). − −∞ − −∞ ≤

The events (a, b 1/n] satisfy limn(a, b 1/n] = (a, b), and by continuity of • − − probability

− P(a < X < b) = PX (lim(a, b 1/n]) = lim PX (a, b 1/n]) = FX (b ) FX (a). n − n − −

Similarly, x = limn(x 1/n, x] so • { } − − P(X = x) = PX ( x ) = FX (x) lim FX (x 1/n) = FX (x) FX (x ). { } − n − − − So, FX is left-continuous at x if and only if FX (x ) = FX (x), which is if and only if P(X = x) = 0.

ut 50

Since segments of the form (a, b], (a, b), ( , a] and ( , a) are enough to generate −∞ −∞ all sets in (R) it can be shown that the distribution function uniquely determines the B probability measure of all Borel sets, and thus it determines the random variable X. In the same way as Lebesgue measure is shown to exist, it is shown in measure theory that there is a 1-1 correspondence between distribution functions and probability

measures on (R, (R)). That is: B Thomas Stieltjes Theorem 7.16 (Lebesgue-Stieltjes). Every function F : R [0, 1] that is right- (1856–1894) → continuous everywhere, non-decreasing and limt→∞ F (t) = 1, limt→−∞ F (t) = 0 gives

rise to a unique probability measure PF on (R, (R)) satisfying PF ((a, b]) = F (b) F (a) B − for all a < b R. (That is, for such F there is always a random variable X such that ∈ F = FX .)

Conversely, we have already seen that if X is a random variable, then PX ((a, b]) =

FX (b) FX (a) for all a < b R, and FX has the properties mentioned above. − ∈ 51

Introduction to Probability

201.1.8001

Ariel Yadin

Lecture 8

8.1. Discrete Distributions

Definition 8.1. A random variable X is called discrete if there exists a countable set R R such that P(X R) = 1. ⊂ ∈

If X is such a random variable, then if we specify P(X = r) for all r R, then the ∈ distribution of X can be calculated by

FX (t) = P(X t) = P(X = r). ≤ R3r≤t X The set r R : P(X = r) > 0 is sometimes called the range of X, and the function { ∈ } fX (x) = P(X = x) is called the density of X (sometimes: probability density function). Note that

FX (t) = fX (x). R3x≤t X 8.1.1. Bernoulli Distribution. Let 0 < p < 1. X has the Bernoulli-p distribution, denoted X Ber(p) if: ∼ 0 t < 0

FX (t) =  1 p 0 t < 1   − ≤  1 1 t. ≤  That is, P(X 0, 1 ) = 1, and P(X = 0) = 1 p, P(X = 1) = p. ∈ { } − The Bernoulli distribution models a coin toss of a biased coin - probability p to fall Jacob Bernoulli head (or 1) and 1 p to fall tails (or 0). (1654–1705) − 52

8.1.2. Binomial Distribution. Suppose we have a biased coin with probability p to fall heads (or 1). What if we toss that coin n times in a row, all tosses mutually independent? The sample space would be Ω = 0, 1 n. Because of independence the measure would { } be

number of ones in ω number of zeros in ω P( ω ) = p (1 p) . { } · − n So if X :Ω R is the number of ones, then since for 0 k n there are different → ≤ ≤ k ω Ω with exactly k ones,  ∈

n k n−k 0 k n fX (k) = P(X = k) = p (1 p) . ≤ ≤ k −   Such an X is said to have Binomial-(n, p) distribution, denoted X Bin(n, p). ∼ What is FX in this case?

n k n−k FX (t) = p (1 p) . k − k≤t   k=0X,1,...,n

So FX (t) = 0 for t < 0 and FX (t) = 1 for t n. ≥

Example 8.2. A mathematician sits at the bar. For the next hour, every 5 minutes he orders a beer with probability 2/3 and drinks it. With the remaining probability of 1/3 he does nothing for those 5 minutes. All orders of beer are mutually independent. What is the probability that he drinks no more than one beer, so he can still drive home (legally)? 4 5 4

Solution. If X is the number of beers drunk, then X Bin(12, 2/3), since every beer is ∼ drunk with probability 2/3 independently, in 12 = 60/5 trials. So

12 0 12 12 1 11 P(X 1) = P(X = 0) + P(X = 1) = (2/3) (1/3) + (2/3) (1/3) ≤ 0 1     25 1 = (1/3)12 + 12 2 (1/3)12 = < . · · 312 20, 000

ut 53

8.1.3. Geometric Distribution.

Example 8.3. A student is watching television. Every minute she tosses a biased coin, if it come up heads she goes to study for the probability exam, if tails she continues watching TV. All coin tosses are mutually indpendent. What is the probability she will continue watching TV forever? What is the probability she watches for exactly k minutes and then starts studying? The sample space here is Ω = N . ∪ {∞} Let T be the number of minutes it takes to go study. The event that T = k is the event that k 1 coins are tails, and the k-th coin is heads, where all coins are independent. − So k−1 P(T = k) = (1 p) p. − Note that ∞ ∞ k−1 P(T = ) = 1 P(T < ) = 1 P(T = k) = 1 p (1 p) = 0. ∞ − ∞ − − − k=1 k=1 X X So the range of T is N. 4 5 4 The distribution of T above is called the Geometric-p distribution; that is, X has the Geometric-p distribution, denoted X Geo(p), if ∼ (1 p)k−1p k 1, 2,..., fX (k) = P(X = k) = − ∈ { }  0 otherwise.  What is FX ? 

k−1 btc FX (t) = (1 p) p = 1 (1 p) . − − − 1≤k≤t : k∈ X N So the Geometric distribution is the number of trials of independent identical exper- iments until the first success.

Example 8.4. An urn contains b black balls and w white balls. We randomly remove a ball, all balls equally likely, and then return that ball. Let T be the number of trials until we see a black ball. What is the distribution of T ? 54

Well, every time we try there is independently a probability of p := b/(b + w) to remove a black ball. So the distribution of T is Geometric-p. 4 5 4 Proposition 8.5 (Memoryless property for geometric distribution). Let X be a Geometric- p random variable. For any k > m 1, ≥ P(X = k X > m) = P(X = k m). | − Proof. By definition, k−1 P(X = k, X > m) (1 p) p k−m P(X = k X > m) = = − = (1 p) p = P(X = k m). | P(X > m) (1 p)m − − − ut That is, if we are given that we haven’t succeeded until time m, the probability of succeeding in another k steps is the same as not succeeding for k steps to begin with.

8.1.4. Poisson Distribution. A random variable X has the Poisson-λ distribution (0 < λ R) if the range is N and ∈ k −λ λ fX (k) = P(X = k) = e . · k! Thus, btc k −λ λ FX (t) = e . · k! k=0 X

Sim´eonPoisson The importance of the Poisson distribution is in that it seems to occur naturally in

(1781–1840) many practical situations (the number of phone calls arriving at a call center per minute, the number of goals per game in football, the number of mutations in a given stretch of DNA after a certain amount of radiation, the number of requests sent to a printer from a network...). The reason this may happen is that the Poisson serves is the limit of Binomial distri-

butions when the number of trials goes to infinity: Suppose Xn Bin(n, λ/n). Then, ∼ n λ k λ n−k P(Xn = k) = 1 k n · − n   k   k λ n−k λ = (1 1 ) (1 2 ) (1 k−1 ) 1 λ e−λ . k! · − n · − n ··· − n · − n → k!  55

So Poisson is like doing infinitely many trials of an experiment that has small proba- bility of success, and asking how many successes are observed.

Example 8.6. Suppose the number of people at the checkout counter at the supermar- ket has Poisson distribution. Checkout A has Poisson-5 and Checkout B has Poisson-2. Both counters are independent. Noga arrives at the counters and chooses the shorter line (and chooses A if they are tied). What is the probability she chooses B? What is the probability both checkouts are empty? 4 5 4

Solution. Let X be the number of people at A, and Y the number at B. We want the probability of X > Y . The information we have is that

P(Y = k, X = m) = P(Y = k) P(X = m), for all k, m N . ∈ So ∞ ∞ m−1 k −5 5m −2 2 P(Y < X) = P(X = m) P(Y = k) = e e . m! k! m=0 k

−5 −2 −7 P(X = 0,Y = 0) = P(X = 0) P(Y = 0) = e e = e .

ut

8.1.5. Hypergeometric. Suppose we have an urn with N 1 balls, 0 m N are ≥ ≤ ≤ black and the rest are white. If we choose 1 n N balls from the urn randomly and ≤ ≤ uniformly, what is the number of black balls we get? The space here is uniform measure on

1,...,N Ω = ω 1,...,N : ω = n = { } . { ⊂ { } | | } n   We ask for the random variable X(ω) := ω 1, . . . , m . | ∩ { } | 56

A simple calculation reveals that for any 0 k min n, m , ≤ ≤ { } m N−m k · n−k P[X = k] = N .  n  (We interpret a for b > a as 0, so actually, max 0, m + n N k min m, n .) b { − } ≤ ≤ { } Such a random  variable X is said to have hypergeometric distribution with param- eters N, m, n, and we write X H(N, m, n). ∼ Example 8.7. Chagai chooses 5 cards from a deck, all possibilities equal. How much more likely is it for him to choose 3 aces than it is to choose 4 aces? How much more likely is it to choose 4 diamonds than it is to choose 5 diamonds? Let X = the number of aces chosen. The size of the deck is N = 52, the number of cards chosen is n = 5 and the number of “special” cards is m = 4. So m N−m P[X = 3] 3 n−3 4 (N m n + 4) 4 47 = · = · − − = · = 94. P[X = 4] m N−m (m 3) (n 3) 1 2 4  · n−4  − · − · For Y = the number of diamonds  chosen, we get N = 52, n = 5, m = 13. So m N−m P[X = 4] 4 n−4 5 (N m n + 5) 5 39 5 13 2 = · = · − − = · = · = 21 + . P[X = 5] m N−m (m 4) (n 4) 9 1 3 3 5  · n−5  − · − ·   4 5 4 8.1.6. Negative Binomial. We say that X NB(m, p), or X has negative binomial ∼ distribution (or Pascal distribution) if X is the number of trials until the m-th success, with each success with probability p.

Blaise Pascal (1623–1662) The event X = k is the event that there are m 1 successes in the first k 1 trials, − − and then one success in the k-th trial. So

k 1 m (k−1)−(m−1) k 1 m k−m P[X = k] = − p (1 p) = − p (1 p) , m 1 − m 1 −  −   −  for k m. ≥ Example 8.8. Let X NB(1, p). Then X Geo(p). Indeed, for any k 1, ∼ ∼ ≥ k 1 k−1 k−1 P[X = k] = − p(1 p) = (1 p) p. 0 − −   4 5 4 57

Introduction to Probability

201.1.8001

Ariel Yadin

Lecture 9

9.1. Continuous Random Variables

Definition 9.1. Let (Ω, , P) be a probability space. A random variable X :Ω R is F → called continuous if the distribution function FX is continuous at all points.

Remark 9.2. Recall that we showed that for all x R, ∈ − P(X = x) = lim PX ((x 1/n, x]) = FX (x) lim FX (x 1/n) = FX (x) FX (x ). n→∞ − − n→∞ − −

Thus, if FX is continuous at x, then P(X = x) = 0. So, if X is a continuous random variable then P(X = x) = 0 for all x R. ∈ This is very different than in the discrete case!

Continuous random variables can be very pathological, and we need measure theory to deal with some aspects regarding these. We restrict ourselves for this course to a “nicer” family of random variables.

Definition 9.3. Let (Ω, , P) be a probability space. A random variable X :Ω R F → is called absolutely continuous if there exists an integrable non-negative function + fX : R R such that for all x R, → ∈ x P(X x) = FX (x) = fX (t)dt. ≤ Z−∞ Such a function fX is called the probability density function (or just density function, or PDF) of X.

Remark 9.4. Recall that the fundamental theorem of calculus says that if fX is contin- 0 uous, then FX is differentiable and FX = f. 58

Also, a result of calculus tells us that the function x x f(t)dt is a continuous 7→ −∞ function of x. So any absolutely continuous random variableR is also continuous.

X Under the carpet: We don’t know how to prove that an integrable function fX :

+ R R uniquely defines a probability measure PX on (R, ), since we would need → B some measure theory for this (actually we could use the Lebesgue-Steiltjes Theorem). However, given that such a measure is uniquely defined, we have the following properties:

Proposition 9.5. Let X be an absolutely continuous random variable with PDF fX . (1) By definition,

b a b PX ((a, b]) = FX (b) FX (a) = fX (t)dt fX (t)dt = fX (t)dt. − − Z−∞ Z−∞ Za (2) We don’t always know which sets we can integrate over, but if the integral exists we have

PX (A) = fX (t)dt. ZA n For example, if A = j=1(aj, bj] this holds.

(3) Continuity of probabilityU gives us that PX ( a ) = 0 for all a (because X is a { } continuous random variable, so FX is contiuous at all points). Thus,

PX ((a, b]) = PX ([a, b)) = PX ((a, b)) = P([a, b]).

(4) We always have

fX (t)dt = PX (R) = 1. ZR Proof. Just calculus of the Riemann integral, and continuity of probability measures.

ut

Example 9.6. Let X be a random variable with PDF

C(2t t2) 0 t 2 fX (t) = − ≤ ≤  0 otherwise.  What is C? What is the probability that X > 1? 59

To compute C we use the fact that f (t)dt = 1. So R X ∞ 2 R 2 2 t3 2 4 1 = fX (t)dt = C (2t t )dt = C(t ) = C(4 8/3) = C . − − 3 0 − · 3 Z−∞ Z0

So C = 3/4. The probability that X > 1 is

1 2 3 2 1 PX ((1, )) = 1 PX (( , 1]) = 1 C(2t t )dt = 1 = . ∞ − −∞ − − − 4 · 3 2 Z0

4 5 4

9.1.1. Uniform Distribution. An absolutely continuous random variable X has the uniform disribution on [a, b] , denoted X U([a, b]) if X has PDF ∼ 1 b−a a t b fX (t) = ≤ ≤  0 otherwise.  Indeed f (t) = 1.  R X R Example 9.7. A bus arrives at the station every 15 minutes, starting at 5:00. A passenger arrives between 7:00 and 7:30, uniformly distributed. What is the probability that he does not wait more than 5 minutes for a bus? If we take X = the time the passenger arrives, then X has uniform distribution on [0, 30]. Also, the event that the passenger waits at most 5 minutes is the disjoint union

X = 0 10 X 15 25 X 30 . { }]{ ≤ ≤ }]{ ≤ ≤ } Thus the probability is

5 5 1 P(X = 0) + P(10 X 15) + P(25 X 30) = 0 + + = . ≤ ≤ ≤ ≤ 30 30 3

4 5 4

The uniform distibution is very symmetric in the following sense:

Exercise 9.8. Let X U([a, b]). Show that Y = X + c has the uniform distribution on ∼ [a + c, b + c]. 60

Solution. We need to show that x 1 FY (x) = P( ω : Y (ω) x ) = 1{[a+c,b+c]}dt. { ≤ } b + c (a + c) · Z−∞ − Indeed, since ω : Y (ω) x = ω : X(ω) x c , we have that { ≤ } { ≤ − } x−c 1 1 x FY (x) = FX (x c) = 1 dt = 1 du. − b a · {[a,b]} b + c (a + c) · {[a+c,b+c]} Z−∞ − − Z−∞ ut

Exercise 9.9. What is the distribution function of a U([a, b]) random variable?

Proof. For a x b we have ≤ ≤ 1 x x a FX (x) = dt = − . b a b a − Za − For x < a, FX (x) = 0 and for x > b, FX (x) = 1. ut

9.1.2. Exponential Distribution. Let 0 < λ R. An absolutely continuous random ∈ variable X is said to have exponential distribution of parameter λ, denoted X ∼ Exp(λ), if it has PDF

−λt fX (t) = 1 λe . {t≥0} · What is the distribution function in this case?

x −λt −λt x −λx FX (x) = λ e dt = ( e ) = 1 e , − 0 − Z0 for x 0. For x < 0, Fx(x) = 0. ≥

Example 9.10. The duration of the wait in line at Kupat Cholim is exponentially distributed with parameter 1/5. What is the probability that the wait is longer than 10 minutes? Suppose we are given that the wait is longer than 20 minutes. What is the probability given this information that the wait is longer than 30 minutes? Let X = waiting time. So X Exp(1/5). ∼ ∞ 1 −t/5 −t/5 ∞ −2 P(X > 10) = e dt = ( e ) = e . 5 − 10 Z10

61

Now, using the definition of conditional probability, P(X > 30) P(X > 30 X > 20) = . | P(X > 20) Since −t/5 ∞ −x/5 P(X > x) = ( e ) = e , − x we get that −6 −4 −2 P(X > 30 X > 20) = e /e = e . |

4 5 4 Is it a coincidence that we got the same answer? No. Recall that the geometric distribution had the memoryless property, that is a geometric random variable X satisfies

P(X > m + n X > m) = P(X > n). | Similarly, the exponential disribution has the memoryless property:

+ Proposition 9.11. Let X Exp(λ). Then for any x, y R , ∼ ∈ P(X > x + y X > x) = P(X > y). | That is, the knowledge of waiting for some time does not give information about how much more we have to wait.

Proof. First, −λt P(X > t) = 1 FX (t) = e . − So −λ(x+y) P(X > x + y) e −λy P(X > x + y X > x) = = = e = P(X > y). | P(X > x) e−λx ut However, it turns out that the exponential distribution is the only continuous random variable with the memoryless property:

+ Proposition 9.12. Let X be a continuous random variable, such that for all x, y R , ∈ P(X > x + y X > x) = P(X > y). | Then, X Exp(λ) for some λ > 0. ∼ 62

Proof. X is continuous, so that means FX is a continuous function. Let g(t) = 1 − FX (t) = P(X > t). So g is a continuous function as well. + Now, for any x, y R , ∈ g(x + y) = g(x)g(y).

−λ Let λ 0 be such that g(1) = e (recall that g : R [0, 1]). For any 0 < n N we ≥ → ∈ have that g(n) = g(n 1)g(1) = = g(1)n. − ··· Since g(n) 0 we have that g(1) < 1 and λ > 0. → For any 0 < m N we have that ∈ g(1) = g( 1 )g(1 1 ) = = g( 1 )m, m − m ··· m so g(1/m) = g(1)1/m. In a similar way, we get that for any rational number

n n/m −λn/m g( m ) = g(1) = e .

−λ· −λt Since g is continuous and coincides with e for all q Q, we get that g(t) = e . ∈ ut 9.1.3. Normal Distribution.

Some calculus. Let us look at the following integral: ∞ 2 I = e−t dt. Z−∞ How can we compute this value? We use two-dimensional calculus to compute it: Note that

2 2 I2 = e−(x +y )dxdy. 2 ZR Setting x = r cos θ and y = r sin θ (or r = x2 + y2 and θ = arctan(y/x) ) so the Jacobian is p dx = cos θ dx = r sin θ J = dr dθ −  dy dy  dr = sin θ dθ = r cos θ So det(J) = r. Thus,   2π ∞ 2 I2 = e−r rdrdθ. Z0 Z0 63

Substituting s = r2 so that ds = 2rdr, we have that 2π ∞ 2 1 −s I = 2 e dsdθ = π. Z0 Z0 So I = √π. Now, we can use another trick to compute ∞ (t µ)2 I = exp − dt. − 2σ2 Z−∞   Letting x = √t−µ we have that dt = √2σdx. Thus, 2σ ∞ 2 I = √2σ e−x dx = √2πσ. Z−∞ We are ready to define the normal distribution: + Let µ R and σ R . An absolutely continuous random variable has normal-(µ, σ) Carl Friedrich Gauss ∈ ∈ distribution, denoted X N(µ, σ), if it has PDF (1777–1855) ∼ 1 (t µ)2 fX (t) = exp − . √2 2 − 2σ2 πσ   A normal-(0, 1) random variable is called a standard normal random variable.

Example 9.13. The distribution of the lifetime of people in Oceania is distributed as a normal-(75, 8) random variable. What is the probability a person lives more than 100 2 years? Show that it is less than e−ξ /2 √1 , where ξ = 25/8. · 2π If X = lifetime, then X N(75, 8). So George Orwell ∼ ∞ 1 (t−75)2/(2·82) (1903–1950) P(X > 100) = e dt. 100 √2π 8 · Z · Taking s = (t 75)/8 we have that dt = 8ds so − ∞ 1 s2/2 P(X > 100) = e ds. √2π Zξ Since on (ξ, ) we have that 1 < s we get that ∞ ξ ∞ 1 −s2/2 P(X > 100) < se ds. √2π Zξ Substituting u = s2/2, so du = sds we have that ∞ 1 −u −ξ2/2 1 P(X > 100) < e du = e . 2 √2π · √2π Zξ /2 64

4 5 4

A nice property of the normal distribution is the following:

Exercise 9.14. Let X N(µ, σ). Define Y (ω) = aX(ω) + b. Show that Y N(aµ + ∼ ∼ b, aσ).

Solution. We need to prove that for any y R, ∈ y 1 (t aµ b)2 P(Y y) = exp − 2 −2 dt. ≤ −∞ √2π aσ − 2a σ Z ·   Since

ω : Y (ω) y = ω : X(ω) (y b)/a , { ≤ } { ≤ − } we have that

(y−b)/a 1 (t µ)2 P(Y y) = P(X (y b)/a) = exp − 2 dt. ≤ ≤ − −∞ √2π σ − 2σ Z ·   Substituting t µ = s−aµ−b (or t = s−b ), we have that dt = 1 ds and − a a a y 1 (s aµ b)2 P(Y y) = exp − 2 −2 ds. ≤ −∞ √2π aσ − 2a σ Z ·   ut

X−µ Corollary 9.15. If X is a normal-(µ, σ) random variable, then σ is a standard normal random variable.

Example 9.16. Let X N(µ, σ). Show that FX (t) > 0 for all t. ∼ Indeed,

t t FX (t) = fX (s)ds fX (s)ds inf fX (s). ≥ ≥ s∈[t−1,t] Z−∞ Zt−1 Since fX is increasing up to µ and decreasing after µ, we have that the infimum on [t 1, t] is obtained at the endpoints. So −

FX (t) min fX (t 1), fX (t) > 0. ≥ { − }

4 5 4 65

Example 9.17. Let X N(0, 1). Show that X2 is absolutely continuous. Is X2 ∼ normally distributed? First, to show that X2 is absolutely continuous we need to show that t FX2 (t) = f(s)ds Z−∞ + 2 for some function f : R R . Indeed, if t < 0 then X t = so FX2 (t) = 0. → ≤ ∅ If t 0,  ≥ √ t 2 1 −s2/2 FX2 (t) = P[X t] = P[ √t X √t] = √ e ds. ≤ − ≤ ≤ √ 2π Z− t Since this last function in the integrand is even (symmetric around 0), we have that √ t t 1 −s2/2 1 −x/2 F 2 (t) = 2 √ e = √ e dx, X 2π 2πx Z0 Z0 2 √1 where we have used the change of variables x = s so ds = 2 x dx. Now, if we define √1 e−s/2 if s > 0 2πs f(s) =  0 if s 0,  ≤ then we get from all the above that  t FX2 (t) = f(s)ds. Z−∞ 2 So X is absolutely continuous with density f = fX . 2 Is X normal? No. This can be seen since FX2 (0) = 0 but FN (0) > 0 for any N N(µ, σ). ∼ 4 5 4 66

Introduction to Probability

201.1.8001

Ariel Yadin

Lecture 10

10.1. Independence Revisited

10.1.1. Some reminders. Let (Ω, , P) be a probability space. Given a collection of F subsets , recall that the σ-algebra generated by , is K ⊂ F K

σ( ) = : is a σ-algebra , , K {G G K ⊂ G} \ and this σ-algebra is the smallest σ-algebra containing . σ( ) can be thought of as all K K possible information that can be generated by the sets in . K Recall that a collection of events (An)n are mutually independent if for any finite number of these events A1,...,An we have that

P(A1 An) = P(A1) P(An). ∩ · · · ∩ ···

We can also define the independence of families of events:

Definition 10.1. Let (Ω, , P) be a probability space. Let ( n)n be a collection of F K families of events in . Then, ( n)n are mutually independent if for any finite F K number of families from this collection, 1,..., n and any events A1 1,A2 K K ∈ K ∈ 2,...,An n, we have that K ∈ K

P(A1 An) = P(A1) P(An). ∩ · · · ∩ ···

For example, we would like to think that (An) are mutually independent, if all the information from each event in the sequence is independent from the information from the other events in the sequence. This is the content of the next proposition. 67

Proposition 10.2. Let (An) be a sequence of events on a probability space (Ω, , P). F Then, (An) are mutually independent if and only if the σ-algebras (σ(An))n are mutually independent.

It is not difficult to prove this by induction:

Proof. By induction on n we show that (σ(A1), . . . , σ(An), An+1 ,...) are mutually { } independent. The base n = 0 is just the assumption.

So assume that (σ(A1), . . . , σ(An−1), An , An+1 ,...) are mutually independent. { } { } Let n1 < n2 < . . . < nk < nk+1 < . . . < nm be a finite number of indices such that nk < n < nk+1. Let Bj σ(An ) for j = 1, . . . , k, and Bj = An for j = k + 1, . . . , m. ∈ j j Let B σ(An). If B = then P(B1 Bn B) = 0 = P(B1) P(Bn) P(B). If ∈ ∅ ∩ · · · ∩ ∩ ··· · B = Ω then P(B1 Bn B) = P(B1 Bn) = P(B1) P(Bn), by the induction ∩ · · · ∩ ∩ ∩ · · · ∩ ··· hypotheses. If B = An then this also holds by the induction hypotheses. So we only c have to deal with the case B = A . In this case, for X = B1 Bn, n ∩ · · · ∩

P(X B) = P(X (X An)) = P(X) P(X An) = P(X)(1 P(An)) = P(X) P(B), ∩ \ ∩ − ∩ − where we have used the induction hypotheses to say that X and An are independent. Since this holds for any choice a a finite number of events, we have the induction step. ut However, we will take a more windy road to get stronger results... (Corollary 10.8)

10.2. π-systems and Independence

Definition 10.3. Let be a family of subsets of Ω. K We say that is a π-system if and is closed under intersections; that • K ∅ ∈ K K is, for all A, B , A B . ∈ K ∩ ∈ K We say that is a Dynkin system (or λ-system) if is closed under set comple- • K K ments and countable disjoint unions; that is, for any A and any sequence ∈ K c (An) in of pairwise disjoint subsets, we have that A and An . K ∈ K n ∈ K U 68

The main goal now is to show that probability measures are uniquely defined once they are defined on a π-system. This is the content of Theorem 10.7.

Proposition 10.4. If is Dynkin system on Ω and is also a π-system on Ω, then F F is a σ algebra. F −

Proof. Since is a Dynkin system, it is closed under complements, and Ω = c . F ∅ ∈ F Let (An) be a sequence of subsets in . Set B1 = A1, C1 = , and for n > 1, F ∅ n−1 Cn = Aj and Bn = An Cn. \ j=1 [ n−1 c c Since is a π-system and closed under complements, Cn = ( A ) , and so F j=1 j ∈ F c Bn = An Cn . Since (Bn) is a sequence of pairwise disjointT sets, and since is a ∩ ∈ F F Dynkin system,

An = Bn . n n ∈ F [ [ ut

Proposition 10.5. If ( α)α is a collection of Dynkin systems (not necessarily count- D able). Then = α is a Dynkin system. ◦ leave as ex- D α D ercise T

Proof. Since α for all α, we have that . ∅ ∈ D ∅ ∈ D c c If A , then A α for all α. So A α for all α, and thus A . ∈ D ∈ D ∈ D ∈ D If (An)n is a countable sequence of pairwise disjoint sets in , then An α for all D ∈ D α and all n. Thus, for any α, An α. So An . n ∈ D n ∈ D ut S S Eugene Dynkin (1924–) Lemma 10.6 (Dynkin’s Lemma). If a Dynkin system contains a π-system , then D K σ( ) . K ⊂ D

Proof. Let

= 0 : 0 is a Dynkin system containing . F D D K \ So is a Dynkin system and . We will show that is a σ algebra, so F K ⊂ F ⊂ D F − σ( ) . K ⊂ F ⊂ D 69

Suppose we know that is closed under intersections (which is Claim 3 below). Since F , we will then have that is a π-system. Being both a Dynkin system and a ∅ ∈ K ⊂ F F π system, is a σ algebra. − F − Thus, to show that is a σ algebra, it suffices to show that is closed under F − F intersections. Note that is closed under complements (because all Dynkin systems are). F Claim 1. If A B are subsets in , then B A . ⊂ F \ ∈ F Proof. If A, B , then since is a Dynkin system, also Bc . Since A B, we ∈ F F ∈ F ⊂ have that A, Bc are disjoint, so A Bc and so B A = (Ac B) = (A Bc)c . ∪ ∈ F \ ∩ ∪ ∈ F ut Claim 2. For any K , if A then A K . ∈ K ∈ F ∩ ∈ F Proof. Let E = A : A K . { ∩ ∈ F} Let A E and (An) be a sequence of pairwise disjoint subsets in E. ∈ Since K and A K , by Claim 1 we have that Ac K = K (A K) . ∈ F ∩ ∈ F ∩ \ ∩ ∈ F So Ac E. ∈ Since (An K)n is a sequence of pairwise disjoint subsets in , we get that ∩ F

An K = (An K) . n n ∩ ∈ F [ \ [ So we conclude that E is a Dynkin system. Since is closed under intersections, E K contains . Thus, by definition E. K F ⊂ So for any A we have that A E, and A K . ∈ F ∈ ∩ ∈ F ut Claim 3. For any B , if A then A B . ∈ F ∈ F ∩ ∈ F Proof. Let E = A : A B . { ∩ ∈ F} Let A E and (An) be a sequence of pairwise disjoint subsets in E. ∈ Since B and A B , by Claim 1 we have that Ac B = B (A B) . So ∈ F ∩ ∈ F ∩ \ ∩ ∈ F Ac E. ∈ Since (An B)n is a sequence of pairwise disjoint subsets in , we get that ∩ F

An B = (An B) . n n ∩ ∈ F [ \ [ 70

So we conclude that E is a Dynkin system. By Claim 2, is contained in E. So by K definition, E. F ⊂ ut Since is closed under intersections, this completes the proof. F ut The next theorem tells us that a probability measure on (Ω, ) is determined by it’s F values on a π-system generating . F

Theorem 10.7 (Uniqueness of Extension). Let be a π system on Ω, and let = K − F σ( ) be the σ algebra generated by . Let P,Q be two probability measures on (Ω, ), K − K F such that for all A , P (A) = Q(A). Then, P (B) = Q(B) for any B . ∈ K ∈ F

Proof. Let = A : P (A) = Q(A) . So . We will show that is a Dynkin D { ∈ F } K ⊂ D D system, and since it contains , the by Dynkin’s Lemma it must contain = σ( ). K F K If A , then P (Ac) = 1 P (A) = 1 Q(A) = Q(Ac), so Ac . ∈ D − − ∈ D Let (An) be a sequence of pairwise disjoint sets in . Then, D

P ( An) = P (An) = Q(An) = Q( An). n n n n ] X X ] So An . n ∈ D ut U Corollary 10.8. Let (Ω, , P) be a probability space. Let (Πn)n be a sequence of π- F systems, and let n = σ(Πn). Then, (Πn)n are mutually independent if and only if F ( n)n are mutually independent. F

Proof. We will prove by induction on n that for any n 0, the collection ( 1, 2,..., n, Πn+1, Πn+2,..., ) ≥ F F F are mutually independent.

For n = 0 this is the assumption. For n > 1, let n1 < n2 < . . . < nk < nk+1 < . . . < nm be a finite number of indices such that nk < n < nk+1. Let Aj n , j = 1, . . . , k ∈ F j and Aj Πn , j = k + 1, . . . , m. ∈ j For any A n, if P(A1 Am) = 0 then A is independent of A1 Am. So ∈ F ∩ · · · ∩ ∩ · · · ∩ assume that P(A1 Am) > 0. ∩ · · · ∩ For any A n define the probability measure ∈ F

P (A) := P(A A1 Am). | ∩ · · · ∩ 71

By induction, the collection ( 1,..., n−1, Πn, Πn+1,...) are mutually independent, so F F P (A) = P(A) for any A Πn. Since Πn is a π-system generating n, we have by Theo- ∈ F rem 10.7 that P (A) = P(A) for any A n. Since this holds for any choice of a finite ∈ F number of events A1,...,Am, we get that the collection ( 1,..., n−1, n, Πn+1,...) F F F are mutually independent. ut

Corollary 10.9. Let (Ω, , P) be a probability space, and let (An)n be a sequence of F mutually independent events. Then, the σ-algebras (σ(An))n are mutually independent.

10.3. Second Borel Cantelli Lemma

We conclude this lecture with

Lemma 10.10 (Second Borel-Cantelli Lemma). Let (An) be a sequence of mutually independent events. If P(An) = then P(lim supn An) = P(An i.o.) = 1. n ∞

P c Proof. It suffices to show that P(lim inf An) = 0. For fixed m > n note that by independence m m c P( Ak) = (1 P(Ak)). − k=n k=n \ Y m c c Since limm k=n Ak = k≥n Ak, we have that T T m c P( Ak) = lim (1 P(Ak)) = (1 P(Ak)) exp P(Ak) , m→∞ − − ≤ − k≥n k=n k≥n k≥n \ Y Y X  where we have used 1 x e−x. − ≤ Since P(An) = , we have that P(Ak) = for all fixed n. Thus, n ∞ k≥n ∞ c P( k≥n APk) = 0 for all fixed n, and P T (lim inf Ac ) = ( Ac ) = (lim Ac ) = lim ( Ac ) = 0. P n P k P n k n→∞ P k n k≥n k≥n k≥n [ \ \ \ ut Example 10.11. A biased coin is tossed infinitely many times, all tosses mutually independent. What is the probability that the sequence 01010 appears infinitely many times? 72

Let p be the probability the coin’s outcome is 1. Let an be the outcome of the n-th toss. let An be the event that an = 0, an+1 = 1, an+2 = 0, an+3 = 1, an+4 = 0. Since the tosses are mutually independent, we have that the sequence of events (A5n+1)n≥0 are mutually −2 −3 ∞ independent. Since P(A5n+1) = p (1 p) > 0, we have that P(A5n+1) = . − n=0 ∞ So the Borel-Cantelli Lemma tells us that P(A5n+1 i.o.) = 1. SinceP A5n+1 i.o. implies { } An i.o. we are done. { } 4 5 4 73

Introduction to Probability

201.1.8001

Ariel Yadin

Lecture 11

11.1. Independent Random Variables

Let X : (Ω, ) (R, ) be a random variable. Recall that in the definition of F → B a random variable we require that X is a measurable function; i.e. for any Borel set B we have X−1(B) . We want to define a σ-algebra that is all the possible ∈ B ∈ F information that can be infered from X.

Definition 11.1. Let (Ω, , P) be a probability space and let (Xα)α∈I :Ω R be a F → collection of random variables. The σ-algebra generated by (Xα)α∈I is defined as

−1 σ(Xα : α I) := σ( X (B): α I,B ). ∈ α ∈ ∈ B  −1 Note that for a random variable X :Ω R, the collection X (B): B is a → ∈ B σ-algebra, so σ(X) = X−1(B): B .  ∈ B We now define independent random variables.

Definition 11.2. Let (Xn)n, (Yn)n be collections of random variables on some proba- bility space (Ω, , P). We say that (Xn)n are mutually independent if the σ-algebras F (σ(Xn))n are mutually independent.

We say that (Xn)n are independent of (Yn)n if the σ-algebras σ((Xn)n) and σ((Yn)n) are independent.

◦ keep this for An important property (that is a consequence of the π-system argument above): later

Proposition 11.3. Let (Xn)n be a collection of random variables on (Ω, , P). (Xn) are F mutually independent, if for any finite number of random variables from the collection, 74

X1,...,Xn, and any real numbers a1, a2, . . . , an, we have that

P[X1 a1,X2 a2,...,Xn an] = P[X1 a1] P[X2 a2] P[Xn an]. ≤ ≤ ≤ ≤ · ≤ ··· ≤

Proof. Let X1,...,Xn be a finite number of random variables. n n Define two probability measure on the Borel sets of R : For any B let ∈ B

P1(B) = P((X1,...,Xn) B) and P2(B) = P(X1 π1B) P(Xn πnB), ∈ ∈ ··· ∈ where πj is the projection onto the j-th coordinate. Since these two measure are identical on the π-system

( , a1] ( , an]: a1, a2, . . . , an R , { −∞ × · · · × −∞ ∈ } n and since this π-system generates all Borel sets on R , we get that P1 = P2 for all Borel n sets on R . −1 Thus, if Aj σ(Xj), j = 1, . . . , n, then for all j, Aj = X (Bj) for some Borel set on ∈ j R, so

P(A1 An) = P((X1,...,Xn) B1 Bn) = P(X1 B1) P(X1 Bn) = P(A1) P(An). ∩· · ·∩ ∈ ×· · ·× ∈ ··· ∈ ···

ut Example 11.4. We toss two dice. Y is the number on the first die and X is the sum of both dice. Here the probability space is the uniform measure on all pairs of dice results, so Ω = 1, 2,..., 6 2. { } What is σ(X)?

σ(X) = (x, y) Ω: x + y A,A 2,..., 12 . {{ ∈ ∈ ⊂ { }}} (We have to know that all subsets of 2,..., 12 are Borel sets. This follows from the { } fact that r = (r 1/n, r].) On the other hand, { } n − S σ(Y ) = (x, y) Ω: x A,A 1,..., 6 . {{ ∈ ∈ ⊂ { }}} Now note 1 P[X = 7,Y = 3] = = P[X = 7] P[Y = 3]. 36 · 75

This is mainly due to P[X = 7] = 1/6. Similarly, for any y 1,..., 6 , we have that ∈ { } the events X = 7 and Y = y are independent. { } { } Are X and Y independent random variables? NO! For example, 1 5 1 P[X = 6,Y = 3] = = = P[X = 6] P[Y = 3]. 36 6 36 · 6 · For independence we need all information from X to be independent of the information from Y ! 4 5 4

Example 11.5. Let (Xn)n be independent random variables such that Xn Exp(1) ∼ for all n. Then, for any α > 0,

−α log n −α P[Xn > α log n] = e = n .

Note that in this case,

if α 1 ∞ ≤ P[Xn > α log n] =  n≥1 < if α > 1. X  ∞ For every n the event Xn > α log n σ(Xn). So the sequence ( Xn > α log n )n is { } ∈  { } independent. By the Borel-Cantelli Lemma (both parts)

1 if α 1 ≤ P[Xn > α log n i.o.] =  0 if α > 1.   4 5 4 76

Introduction to Probability

201.1.8001

Ariel Yadin

Lecture 12

12.1. Joint Distributions

In the same way that we constructed the Borel sets on R, we can consider the σ- d algebra of subsets of R , generated by sets of the form (a1, b1] (a2, b2] (ad, bd]. × × · · · × d d d This is the Borel σ-algebra on R , usually denoted by or (R ). B B Suppose now that we have d random variables X1,...,Xd on some probability space d (Ω, , P). Then, we can define a function X :Ω R by X(ω) = (X1(ω),X2(ω),...,Xd(ω)). F → d d Suppose we are told that this function is measurable from (Ω, ) to (R , ). In this F B d case X is called a R -valued random variable, and we say that X1,...,Xd have a joint distribution on (Ω, , P). F

Proposition 12.1. X1,...,Xd are random variables if and only if X = (X1,...,Xd) d is a R -valued random variable.

d d Proof. Note that given a measurable function X : (Ω, ) (R , ) we can always F → B write X = (X1,...,Xd), and each Xj is a random variable, since for any B , ∈ B

−1 −1 X (B) = X (R B R) . j × · · · × × · · · × ∈ F

Now suppose X1,...,Xd are random variables on (Ω, , P). We need to show that F X−1(B) for all B d. ∈ F ∈ B d Since = σ( (a1, b1] (a2, b2] (ad, bd]: aj < bj , j = 1, . . . , d ), by Propo- B { × × · · · × } sition 7.3 it suffices to prove that

−1 X ((a1, b1] (a2, b2] (ad, bd]) × × · · · × ∈ F 77 for all aj < bj, j = 1, 2, . . . , d. This follows from the fact that

d −1 −1 X ((a1, b1] (a2, b2] (ad, bd]) = X ((aj, bj]) . × × · · · × j ∈ F j=1 \

ut

d d −1 The probability measure PX on (R , ) defined by PX (B) = P(X B) = P(X (B)) B ∈ is called the distribution of X. We can also define the joint distribution function of X1,...,Xd:

~ F(X1,...,X )(t1, . . . , td) = FX (t) = P(X1 t1,X2 t2,...,Xd td). d ≤ ≤ ≤

A restatement of Proposition 11.3 is:

Proposition 12.2. Let (X1,...,Xd) be d random variables on a probability space (Ω, , P) F with a joint distribution. Then, (X1,...,Xd) are mutually independent if and only if

d (t1, . . . , td) R FX (t1, . . . , td) = FX1 (t1) FXd (td). ∀ ∈ ···

Example 12.3. An urn contains 3 red balls, 4 white balls and 5 blue balls. We remove three balls from the urn, all balls equally likely. Let X be the number of red balls removed and Y the number of white balls removed. What is the joint distribution of (X,Y )? A natural probability space here is the uniform measure on Ω = S : S = 3,S 1,..., 12 . { | | ⊂ { }} Suppose f : 1,..., 12 red, white, blue is the function assigning a color to each { } → { } ball in the urn. So

(X,Y )(S) = # s S : f(s) = red , # s S : f(s) = white . { ∈ } { ∈ }  So 5 3 P(X,Y )( (0, 0) ) = P( S Ω: f(s) = blue s S ) = . { } { ∈ ∀ ∈ } 12 3

For any S Ω, we can write S = Sr Sw Sb where Sr = s S : f(s) = red and ∈ ] ] { ∈ } similarly for white and blue. So S = Sr + Sw + Sb . | | | | | | | | 78

If (X,Y )(S) = (0, 1) then Sr = 0,Sw = 1 and Sb = 2. So 3 4 5 0 1 2 P(X,Y )( (0, 1) ) = · · . { } 12  3 

Similary, if (X,Y )(S) = (1, 1) then Sr = 1,Sw = 1 and Sb = 1. So 3 4 5 1 1 1 P(X,Y )( (1, 1) ) = · · . { } 12  3  Generally, for integers x, y 0, such that x + y 3,  ≥ ≤ 3 4 5 x · y · 3−x−y P(X,Y )( (x, y) ) = 12 . { }   3 

Of course F(X,Y ) can be calculated by summing these. 4 5 4 Example 12.4. Two fair coins are tossed, there outcomes independent. X 0, 1 ∈ { } is the outcome of the first coin. Y 0, 1 is the outcome of the second coin. Z = ∈ { } X + Y (mod 2). Show that X,Z are independent. First P(X = 0) = P(X = 1) = 1/2. Also,

1 1 1 P(Z = 0) = P(X = Y = 0) + P(X = Y = 1) = 4 + 4 = 2 .

Similarly, P(Z = 1) = 1/2. Now,

1 P(X = 0,Z = 0) = P(X = 0,Y = 0) = = P(X = 0) P(Z = 0). 4 · Similarly for other values. 4 5 4 As in the case of 1-dimensional random variables, we would like to define the infor- mation a random variable carries as some σ-algebra. Thus, for a d-dimensional random variable X we define σ(X) := X−1(B): B d . ∈ B n o This is a σ algebra. − Recall that if X = (X1,...,Xd) then

−1 σ(X1,...,Xd) = σ( X (B): B , 1 j d ) j ∈ B ≤ ≤ n −1 o = σ( X (R R B R R): B ). × · · · × × × × · · · × ∈ B  79

The following proposition shows that these notations agree.

Proposition 12.5. Let X = (X1,...,Xd) be a d-dimensional random variable on some probability space. Then, σ(X) = σ(X1,...,Xd).

d −1 Proof. Define = B : X (B) σ(X1,...,Xd) we get that this is a σ-algebra, G ∈ B ∈ and that the set of rectangles = (a1, b1] (ad, bd]: aj < bj R is contained K { × · · · × ∈ } d d in . Since is generated by , we get that , and σ(X) σ(X1,...,Xn). G B K B ⊂ G ⊂ −1 −1 On the other hand, if B then X (B) = X (R B R) σ(X). So ∈ B j × · · · × × · · · × ∈ σ(X1,...,Xn) σ(X) and they are equal. ⊂ ut

12.2. Marginals

Lemma 12.6. Let X = (X1,...,Xd) be a d-dimensional random variable on some probability space (Ω, , P). Let Y = (X1,...,Xd−1) and let Z = (X2,...,Xd) (which F d−1 are d 1-dimensional). Then, for any ~t R , − ∈

FY (~t) = lim FX (~t, t) and FZ (~t) = lim FX (t,~t). t→∞ t→∞

d−1 Proof. Fix ~t = (t1, . . . , td−1) R . Let A = X1 t1,...,Xd−1 td−1 . ∈ { ≤ ≤ } Let sn , and set An = Xd sn . Since (sn) is an increasing sequence, we have % ∞ { ≤ } that An An+1, so (An)n is an increasing sequence, and thus (A An)n is also an ⊂ ∩ increasing sequence. Note that

lim An = An = X < = Ω, n n { ∞} [ so limn(A An) = A. Thus, using continuity of probability ∩

FY (~t) = P[A] = lim P[A An] = lim FX (~t, sn). n ∩ n→∞

Since this holds for any sequence sn , we have the first assertion. % ∞ The proof of the second assertion is almost identical, switching the roles of the first and last coordinate. ut 80

Corollary 12.7. Let X = (X1,...,Xd) be a d-dimensional random variable on some probability space (Ω, , P). Then, for any t R, F ∈

FXj (t) = lim FX (t1, . . . , tj−1, t, tj+1, . . . , td). ti→∞ : i6=j

FXj above is called the marginal distribution function of Xj. 81

Introduction to Probability

201.1.8001

Ariel Yadin

Lecture 13

13.1. Discrete Joint Distributions

Recall that a random variable is discrete if there exists a countable set R such that

P(X R) = 1. R is the range of X, and the function fX (r) = P(X = r) is the density ∈ of X.

Suppose that X,Y are discrete random variables, with ranges RX ,RY and densities fX , fY . Note that R := RX RY is also a range for both X and Y . ∪ Now, we can define the joint density of X and Y by

fX,Y (x, y) = P(X = x, Y = y) = P((X,Y ) = (x, y)).

Note that

2 P((X,Y ) R ) = P( X R Y R ) P(X R) + P(Y R) = 0. 6∈ { 6∈ } ∪ { 6∈ } ≤ 6∈ 6∈

2 2 So P((X,Y ) R ) = 1, and we can think of (X,Y ) as a R -valued discrete random ∈ variable. This of course can be generalized: n If X1,...,Xn are discrete random variables. Then (X1,...,Xn) is a R valued random n variable, supported on some countable set in R , and the density of (X1,...,Xn) is defined to be

f(X1,...,Xn)(x1, . . . , xn) = P(X1 = x1,...,Xn = xn). ◦ in exercise

Exercise 13.1. Let X1,...,Xd be random variables. Show that X = (X1,...,Xd) is a jointly discrete random variable if and only if X1,...,Xd are each discrete. 82

Assume that X1,...,Xd are discrete random variables. Show that X1,...,Xd are mutually independent, if and only if

d (t1, . . . , td) R f(X1,...,X )(t1, . . . , td) = fX1 (t1) fX2 (t2) fXd (td). ∀ ∈ d · ···

13.2. Discrete Marginal Densities

Proposition 13.2. If X = (X1,...,Xd) is a discrete joint random variable with range d R and density fX , then the density of Xj is given by

fXj (x) = fX (x1, . . . , xj−1, x, xj+1, . . . , xd). x ∈R : i6=j i X Proof. This follows from

FX (t1, . . . , td) = fX (r1, . . . , rd). R3r ≤t : j=1,...,d j Xj So

lim FX (t1, . . . , tj−1, t, tj+1, . . . , td) = fX (r1, . . . , rd). ti→∞ : i6=j R3r ≤t r ∈R : i6=j Xj i X Thus,

− fX (r) = FX (r) FX (r ) = fX (r1, . . . , rd) fX (r1, . . . , rd) j j − j − R3r ≤r r ∈R : i6=j R3r

= fX (r1, . . . , rj−1, r, rj+1, . . . , rd). r ∈R : i6=j i X ut

The above density is called the marginal density of Xj.

Example 13.3. Noga and Shir come home from school and decide if to do homework or watch TV. Let X = 1 if Noga does homework and X = 0 if she watches TV. Let Y = 1 if Shir does homework and Y = 0 if she watches TV. We are given that

1 1 3 P((X,Y ) = (1, 1)) = 4 , P((X,Y ) = (0, 0)) = 2 , P((X,Y ) = (1, 0)) = 16 .

What are the marginal densities of X and Y ? 83

Who is more likely to watch TV? 1 Note that since the density must sum to 1, P((X,Y ) = (0, 1)) = 16 . Well, let’s calculate the marginal densities:

7 fX (1) = P((X,Y ) = (1, 0)) + P((X,Y ) = (1, 1)) = 16 .

9 So it must be that fX (0) = 16 (why?). Similarly,

5 fY (1) = P((X,Y ) = (0, 1)) + P((X,Y ) = (1, 1)) = 16 .

11 So fY (0) = 16 . We see that Shir is more likely to watch TV. 4 5 4

13.3. Conditional Distributions (Discrete)

Example 13.4. After mating season sea turtles land on the beach and lay their eggs. After hatching, the baby turtles must make it back to the sea on their own, however, there are many dangers and predators awaiting. Each female hatches a Poisson-λ number of eggs. After hatching, each baby turtle has probability p to make it back to the sea, all turtles mutually independent. What is the probability that exactly k turtles make it back to sea? 4 5 4

How do we solve this? We would like to use conditional probabilities, to say that conditioned on their being n eggs, the number of surviving turtles is Binomial-(n, p). So we want to define distributions conditioned on the results of other distributions.

Definition 13.5. Let X,Y be discrete random variables defined on some probability space (Ω, , P). For y such that fY (y) > 0 define the random variable X Y = y as the F | random variable whose density is

fX|Y (x y) = fX|Y =y(x) = P(X = x Y = y). | |

Note that the correct way to do this would be to define a different density f ( ) X X|y · for every y. This is captured in the notation f ( y). X|Y ·| 84

Example 13.6. Let the density of X,Y be given by:

X Y 0 1 \ 0 0.4 0.1 1 0.2 0.3 So

P(Y = 1) = 0.4 and P(Y = 0) = 0.6. Thus,

fX|Y (1 1) = P(X = 1 Y = 1) = 0.3/0.4 = 3/4, | |

fX|Y (0 1) = P(X = 0 Y = 1) = 0.1/0.4 = 1/4, | | f (1 0) = 0.2/0.6 = 1/3, X|Y | f (0 0) = 0.4/0.6 = 2/3. X|Y |

4 5 4 Let’s return to solve the example above:

Solution of Example 13.4. Let X be the number of eggs, and let Y be the number of turtles that survive. What we are given, is that X Poi(λ) and that Y X = n Bin(n, p). ∼ | ∼ That is, for all k n, ≤ n k n−k P(Y = k X = n) = p (1 p) . | k −   Thus,

∞ ∞ n n k n−k −λ λ P(Y = k) = P(Y = k X = n) P(X = n) = p (1 p) e | k − · n! n=0 n=k X X   ∞ pkλk λn−k(1 p)n−k = e−λ − k! (n k)! n=k X − (pλ)k (λp)k = e−λeλ(1−p) = e−λp . k! · k! That is, the number of surviving turtles has the distribution of a Poisson-λp random variable. ut 85

Exercise 13.7. Let X,Y be discrete random variables. Show that ◦ in exercise

fX,Y (x, y) = f (x y)fY (y), X|Y | for all y such that fY (y) > 0.

13.4. Sums of Independent Random Variables (Discrete Case)

Suppose that X,Y are random variables on some probability space. One can define the random variable Z(ω) = X(ω) + Y (ω), or in short Z = X + Y . (This is always a random variable, because Z = φ(X,Y ), where φ(x, y) = x + y, and since φ is continuous it is measurable).

Suppose X,Y are independent discrete random variables, with densities fX , fY . First, from independence, we have that

P(X = x, Y = y) = P(X = x) P(Y = y), for all x, y. Thus, f(X,Y )(x, y) = fX (x)fY (y). Now, what is the density of X + Y ? Well,

fX+Y (z) = P(X + Y = z) = P( X + Y = z, Y = y ) = P(X + Y = z, Y = y) y { } y ] X = P(X = z y) P(Y = y) = fX (z y)fY (y). y − y − X X This leads to the following definition:

Definition 13.8. Let f, g be two real valued functions supported on a countable set R R. The (discrete) convolution of f, g, denoted f g, is the real valued function ⊂ ∗ f g : R R, ∗ → (f g)(z) = f(z y)g(y). ∗ − y∈RX∪(z−R) Note that f g = g f by change of variables. ∗ ∗ We conclude with the observation:

Proposition 13.9. Let X,Y be two independent discrete random variables on some probability space. Let Z = X + Y . Then, fZ = fX fY . ∗ 86

Example 13.10. Consider X1,...,Xn mutually independent random variables where

Xj Poi(λj) for some λ1, . . . , λn > 0. ∼ Let Z = X1 + + Xn. What is the distribution of Z? ···

Well, since fX1 (x) = 0 for x < 0, ∞ fX +X (z) = fX fX (z) = fX (z x)fX (x) 1 2 1 ∗ 2 1 − 2 x=0 X z z λz−x λx 1 z = e−λ1 1 e−λ2 2 = e−(λ1+λ2) λz−xλx (z x)! · x! · z! x 1 2 x=0 − x=0   X z X (λ1 + λ2) = e−(λ1+λ2) . · z! This is the density of the Poisson distribution!

So X1 + X2 Poi(λ1 + λ2). Continuing inductively, we conclude that Z Poi(λ1 + ∼ ∼ λ2 + + λn). ··· That is: the sum of independent Poisson random variables is again a Poisson random variable. 4 5 4 Example 13.11. Let X Poi(α) and Y Poi(β). Assume X,Y are independent. ∼ ∼ What is the distribution of X X + Y = n? | Let’s calculate explicitly: Recall that X + Y Poi(α + β). For k n, ∼ ≤ k n−k n −1 P(X = k) P(Y = n k) −α α −β β −(α+β) (α + β) P(X = k X + Y = n) = − = e e e | P(X + Y = n) k! · (n k)! · n! −   n αkβn−k n α k α n−k = = 1 . k (α + β)n k α + β · − α + β         So X conditioned on X + Y = n has the Binomial-(n, α/(α + β)) distribution. 4 5 4 87

Introduction to Probability

201.1.8001

Ariel Yadin

Lecture 14

14.1. Continuous Joint Distributions

Definition 14.1. Let X1,...,Xd be random variables on some probability space (Ω, , P). F We say that X1,...,Xd have an absolutely continuous joint distribution (or that d X = (X1,...,Xd) is a R -valued absolutely continuous random variable) if there ex- d + ists a non-negative integrable function fX = fX1,...,Xd : R R , such that for all → ~t = (t1, . . . , td), t1 td FX (~t) = fX (~s)d~s. ··· Z−∞ Z−∞

Marginal distributions are as in the discrete case: If X = (X1,...,Xd) is a joint random variable then the marginal distribution of Xj is

FXj (t) = lim FX (t1, . . . , tj−1, t, tj+1, . . . , td). tk→∞ : k6=j

Specifically, if X is absolutely continuous, then

fX (t) = fX (t1, . . . , tj−1, t, tj+1, . . . , td)dt1 dtj−1dtj+1 dtd. j ··· ··· ··· Z Z !!! It is not true that if X1,...,Xd are absolutely continuous, then necessarily d X = (X1,...,Xd) is absolutely continuous as a R -valued random variable.

Example 14.2. If X = Y are the same absolutely continuous random variable, then 2 (X,Y ) is not absolutely continuous, since if it was, then for the Borel set A = (x, y) R : x = y , ∈  1 = fX,Y (x, y)dxdy = fX,Y (x, y)dxdy = 0 ZZ ZZA since fX,Y (x, y) = 0 if (x, y) A and since A has measure 0 in the plane. 6∈ 88

A more elementary way to see this (in the scope of these notes) is: If fX,Y is the density of a pair (X,Y ), such that X = Y is an absolutely continuous random variable, s then: Let gs(x) = −∞ fX,Y (x, y)dy. Then, R FX,Y ( , s) = P[X ,Y s] = P[X s, Y s] = FX,Y (s, s). ∞ ≤ ∞ ≤ ≤ ≤

So ∞ s gs(x)dx = FX,Y ( , s) = FX,Y (s, s) = gs(x)dx, ∞ Z−∞ Z−∞ which implies that ∞ gs(x)dx = 0. Zs Thus, gs(x) = 0 for all (actually a.e.) x s (since gs is a non-negative function). Now, ≥ if x s then ≥ s 0 = gs(x) = fX,Y (x, y)dy, Z−∞ so fX,Y (x, y) = 0 for all (actually a.e.) x s, y s. Since this holds for all s we have ≥ ≤ that fX,Y = 0 for all (or a.e.) x y. Exchanging the roles of X,Y we get that fX,Y = 0 ≥ (a.e.), which is not a density function, contradiction!

4 5 4

However, we have the following criterion for independence:

Proposition 14.3. X1,...,Xn are mutually independent absolutely continuous random n variables if and only if X = (X1,...,Xn) is a R -valued absolutely continuous random variable with density fX (t1, . . . , tn) = fX (t1) fX (tn). 1 ··· n

Proof. The “if” part follows from Proposition 11.3.

For the “only if” part: Let f(t1, . . . , tn) = fX (t1) fX (tn). We need to show that 1 ··· n

P[X1 t1,...,Xn tn] = f(t1, . . . , tn)dt1 dtn. ≤ ≤ n ··· ZR Indeed, this follows from independence. ut 89

14.2. Conditional Distributions (Continuous)

Recall that if X,Y are discrete random variables, then

fX|Y (x y) = P[X = x Y = y]. | | For continuous random variables this is not defined, since P[Y = y] = 0. However, we still can define the following:

Let X,Y be absolutely continuous random variables. Let y R be such that fY (y) > ∈ + 0. Define a function fX|Y ( y): R R by ·| → f(X,Y )(x, y) fX|Y (x y) := . | fY (y)

Then, for every y such that fY (y) > 0, this function is a density; indeed, recall that fY (y) = f(X,Y )(x, y)dx so ∞ R f(X,Y )(x, y) f (x y)dx = dx = 1. X|Y | f (y) Z−∞ Z Y Thus, f ( y) defines an absolutely continuous random variable, denoted X Y = y, X|Y ·| | and called X conditioned on Y = y. f ( y) is the conditional density of X X|Y ·| conditioned on Y = y. We have the identity

f (x, y) = f (x y) fY (y). (X,Y ) X|Y | · For any Borel set B , we define ∈ B

P[X B Y = y] := P[(X Y = y) B] = fX|Y (x y)dx. ∈ | | ∈ | ZB In many cases, the information is given as conditional densities, rather than joint densities.

Example 14.4. Let X,Y have joint density

1 −x/y −y y e e x, y 0 fX,Y (x, y) = ≥  0 otherwise  Let’s compute the density of X Y : First, the marginal density of Y is |  ∞ 1 −x/y −y −y fY (y) = e e dx = e , y Z0 90 for y 0 and 0 otherwise. For x, y 0, ≥ ≥ 1 −x/y −y y e e 1 f (x y) = = e−x/y, X|Y | e−y y and f (x y) = 0 for x < 0. So, X Y = y Exp(1/y). X|Y | | ∼ Thus, for example, ∞ −1/y P[X > 1 Y = y] = fX|Y (x y)dx = e . | | Z1 4 5 4 14.3. Sums (Continuous)

Let X,Y be independent absolutely continuous random variables. Let Z = X + Y .

What is FZ ?

P[Z z] = P[X + Y z] = fX,Y (x, y)dxdy ≤ ≤ ZZx+y≤z ∞ z−y = fX (x)fY (y)dxdy Z−∞ Z−∞ ∞ = FX (z y)fY (y)dy = FX fY (z). − ∗ Z−∞ Now, note that ∞ z−y FZ (z) = fX (x)fY (y)dxdy Z−∞ Z−∞ z ∞ z = fX (s y)fY (y)dyds = fX fY (s)ds. − ∗ Z−∞ Z−∞ Z−∞ Thus, Z = X + Y is an absolutely continuous random variable with density fX fY . ∗ Example 14.5. What is the distibution of X + Y , for independent X,Y , when: X N(0, 1) and Y N(0, 1). • ∼ ∼ X Exp(λ) and Y Exp(α). • ∼ ∼ 2 2 2 In the normal case we have that (x−y) +y = x + (y x )2. So, 2 4 − 2 ∞ 1 −x2/4 −(y−x/2)2 1 −x2/4 1 −x2/4 fX+Y (x) = e e dy = e √π = e . 2π · −∞ 2π · √2π √2 Z · So X + Y N(0, √2). ∼ 91

In the exponential case for x > 0, ∞ −λx −(α−λ)y fX+Y (x) = fX fY (x) = αλe 1 (x y)1 (y)e dy ∗ [0,∞) − [0,∞) Z−∞ αλ αλ = e−λx(1 e−(α−λ)x) = (e−λx e−αx). α λ − α λ − − − Since P[X + Y < 0] = 0 this is the density of X + Y . 4 5 4 92

Introduction to Probability

201.1.8001

Ariel Yadin

Lecture 15

In this lecture we want to define the average value of a random variable X. If X is

discrete, it would be intuitive to define the average as x P[X = x]x. If X is abso- lutely continuous, perhaps one would also think of definingP the average analogously as

xfX (x)dx. But what about the general case? R We will follow Lebesgue integration theory for this task.

15.1. Preliminaries ◦ For the TA

Proposition 15.1. Let X = (X1,...,Xn) and Y = (Y1,...,Ym) be random variables on some probability space (Ω, , P). Then, X is independent of Y if and only if for F n m any measurable functions g : R R and h : R R we have that g(X), h(Y ) are → → independent.

Proof. Now, for the “only if” direction: Let g, h be measurable functions as in the

proposition. Let A1 σ(g(X)),A2 σ(h(Y )). Note that there exist Borel sets B1,B2 ∈ ∈ −1 −1 −1 −1 −1 such that A1 = X g (B1) and A2 = Y h (B2). Since g is measurable, g (B1) ∈ n −1 m , and similarly, h (B2) . Thus, A1 σ(X) and A2 σ(Y ). Thus, A1,A2 are B ∈ B ∈ ∈ independent. Since this holds for all A1,A2 we have that g(X), h(Y ) are independent. −1 For the “if” direction: Let A1 σ(X),A2 σ(Y ). So A1 = X (B1) and A2 = ∈ ∈ −1 n m Y (B2) for some B1 and B2 . Define g(~x) = 1 and h(~y) = 1 . ∈ B ∈ B {~x∈B1} {~y∈B2} Since these are measurable functions, and since A1 A2 = X B1,Y B2 , we have ∩ { ∈ ∈ } that

P[A1 A2] = P[g(X) = 1, h(Y ) = 1] = P[g(X) = 1] P[h(Y ) = 1] = P[A1] P[A2]. ∩

Since this holds for all A1,A2 we have that X is independent of Y . ut 93

Example 15.2. The function g(x) = x is measurable, since for any b R, b c ∈

g−1( , b] = x : x b = ( , b ] . −∞ { b c ≤ } −∞ b c ∈ B

The function g(x) = c x is measurable, since it is continuous. The function g(x) = · max x, 0 is measurable since g−1( , b] = [0, b] if b 0 and g−1( , b] = if b < 0. { } −∞ ≥ −∞ ∅ In a similar way g(x) = max x, 0 is measurable. {− } Thus, if X,Y are independent random variables, then X+ := max X, 0 ,Y + := { } max Y, 0 are independent, and X+ = 2−n 2nX+ is independent of Y + = 2−n 2nY + . { } n b c n b c 4 5 4 ◦ This dfn is 15.2. Simple Random Variables important, all ω ∈ Ω are Let us start with simple random variables: sent to some non-negative Definition 15.3. Let (Ω, , P) be a probability space. X is a simple random variable F value if there exists a finite set of positive numbers A = a1, . . . , am R such that Ω = { } ⊂ X = a1 X = am X = 0 . { }]···]{ }]{ }

Specifically, such a X is a discrete random variable. In this case it is intuitive that m the average of X should be k=1 ak P[X = ak]. Note that P m

X = ak1{X=ak}, k=1 X where all ak > 0 and the events X = ak are pairwise disjoint. { } We want to say that simple random variables somehow approximate all random vari- ables, so that we can define the average on these. This is the content of the following definitions and proposition.

Definition 15.4. Let (Ω, , P) be a probability space. A sequence of random variables F (Xn)n is monotone non-decreasing if for all ω Ω, and all n, Xn(ω) Xn+1(ω). ∈ ≤

Let (Ω, , P) be a probability space. Let (Xn)n be a sequence of random variables. F Suppose that for every ω Ω the limit X(ω) := limn→∞ Xn(ω) exists. Then, the ∈ 94 function X is a random variable, since X = lim sup Xn = lim inf Xn, and since for any Borel set B,

−1 −1 X (B) = ω : lim inf Xn(ω) B = ω : Xk(ω) B = lim inf X (B) . n→∞ ∈ { ∈ } n ∈ F n k≥n n o [ \ −1 −1 (In fact, lim Xn (B) = X (B).)

Definition 15.5. Let (Ω, , P) be a probability space. Let (Xn)n be a sequence of F random variables. Suppose that for every ω Ω the limit X(ω) := limn→∞ Xn(ω) exists. ∈ Then, the function X is a random variable. In this case we say that Xn converges everywhere to X.

If in addition, (Xn)n is a monotone non-decreasing sequence, we say that Xn con- verges everywhere monotonely (from below) to X, and denote this by Xn X. %

Finally, a non-negative random variable X is a random variable satisfying X(ω) 0 ≥ for all ω Ω. We write this as X 0. In general, we write X Y if X(ω) Y (ω) for ∈ ≥ ≤ ≤ all ω Ω. (So, for example, a sequence (Xn)n is monotone non-decreasing if Xn Xn+1 ∈ ≤ for all n.)

Proposition 15.6. Let (Ω, , P) be a probability space, and X a random variable. If F X is non-negative, then there exists a sequence (Xn)n of simple random variables such that Xn X. %

−1 −n −n Proof. For all n, k define An,k := X [k2 , (k + 1)2 ), which is of course an event. Note that for all n, 2nn−1 −1 Ω = An,k X [n, ). ∞ k=0 ] ] For all n 1 define ≥ 2nn−1 −n −1 Xn = 2 k1An,k + n1X [n,∞). k=0 X [That is, we divide the interval [0, n) into intervals of length 2−n, and replace X with the leftmost value in that interval, cutting this after n. See Figure 4]

So (Xn)n is a sequence of simple random variables. 95

−n n Note that Xn(ω) = min n, 2 2 X(ω) . If X(ω) < n then { b c}

−n n −n 1 n+1 −(n+1) n+1 Xn(ω) = 2 2 X(ω) = 2 2 X(ω) 2 2 X(ω) = Xn+1(ω). b c b 2 c ≤ b c

(For all x, if x/2 x/2 then 2 x/2 x.) If n X(ω) < n+1 then 2n+1n 2n+1X(ω) b c ≤ b c ≤ ≤ ≤ so

−(n+1) n+1 Xn(ω) = n 2 2 X(ω) = Xn+1(ω). ≤ b c

Finally, if X(ω) n + 1, then Xn(ω) = n n + 1 = Xn+1(ω). So, Xn(ω) Xn+1(ω) ≥ ≤ ≤ for all ω and all n, and (Xn)n is a non-decreasing sequence. −n n −n Now to show that limn Xn(ω) = X(ω): Note that X(ω) 2 2 X(ω) 2 . | − b c| ≤ Thus, for all ω Ω, limn→∞ Xn(ω) = X(ω). ∈ ut

Xn Xn+1

0 n n + 1

Figure 4. Xn and Xn+1 96

15.3. Expectation of Non-Negative Random Variables

We want to define the expectation on a random variable by approximating by simple random variables as in Proposition 15.6, and taking the limit. For this we have to make sure the limit exists and makes sense. Let us start by defining the expectation for a simple random variable.

m −1 Definition 15.7. If X = k=1 ak1X (ak) is a simple random variable, define P m E[X] := ak P[X = ak]. k=1 X Proposition 15.8. The above definition does not depend on the choice of the partition n n of Ω; that is, if X = bj1Bj where Ω = Bj are a partition of Ω, then E[X] = j=1 ]j=1 n j=1 bj P[Bj]. P P Proof. Suppose that m n

−1 X = ak1X (ak) = bj1Bj , k=1 j=1 m X X where Ω = Bj are a partition of Ω as in the statement of the proposition. We ]j=1 m assume without loss of generality that am = 0 so Ω = X = ak . Then, for any ]k=1 { } k, j, if ω X = ak Bj then ak = X(ω) = bj. Otherwise, if X = ak Bj = the ∈ { } ∩ { } ∩ ∅ P[X = ak,Bj] = 0. Thus, for all k, j,

ak P[X = ak,Bj] = bj P[X = ak,Bj].

n m Since Ω = Bj = X = ak , by the law of total probability we obtain, ]j=1 ]k=1 { } m m n E[X] = ak P[X = ak] = ak P[X = ak,Bj] k=1 k=1 j=1 X X X n m n = bj P[X = ak,Bj] = bj P[Bj]. j=1 k=1 j=1 X X X ut Proposition 15.9. If X,Y are simple random variables, then:

If X Y then E[X] E[Y ]. • ≤ ≤ For all a 0, E[aX + Y ] = a E[X] + E[Y ]. • ≥ 97

Specifically we have for a simple random variable X,

E[X] = sup E[S] : 0 S X,S is simple . { ≤ ≤ }

Proof. Assume that X = ak1 and Y = bj1 . If ω Ω is such that k {X=ak} j {Y =bj } ∈ X(ω) = ak and Y (ω) = bjPthen ak bj. So, ak P[XP= ak,Y = bj] bj P[X = ak,Y = ≤ ≤ bj]. Thus,

E[X] = ak P[X = ak,Y = bj] bj P[X = ak,Y = bj] = E[Y ]. ≤ k,j k,j X X Now, note that if Z = aX + Y for some a 0, then ≥

Z = (a ak + bj)1 · {X=ak,Y =bj } k,j X since Ω = k X = ak X = 0 and Ω = j Y = bj Y = 0 . So ] { }]{ } ] { }]{ }

E[Z] = (a ak + bj) P[X = ak,Y = bj] = a E[X] + E[Y ]. · k,j X

ut

Definition 15.10. If X is a general non-negative random variable, define

E[X] := sup E[S] : 0 S X,S is simple , { ≤ ≤ }

(where we allow the value E[X] = , and in this case we say X has infinite expectation). ∞

This definition coincides with the previous definition for simple random variables by ◦ important the proposition above. for what Note that the definition immediately implies that follows

0 X Y E[X] E[Y ] ≤ ≤ ⇒ ≤ because the supremum for Y is over a larger set. 98

15.4. Monotone Convergence

Theorem 15.11 (Monotone Convergence). Let (Xn)n be a non-decreasing sequence of non-negative random variables. If Xn X then E[Xn] E[X]. % %

Proof. The proof has four cases:

Case 1: Xn are simple, X = 1A. • For this case, let ε > 0, and set An := Xn > 1 ε . { − } If ω A then X(ω) = 1, then since Xn(ω) X(ω), then Xn(ω) > 1 ε ∈ % − for some n. On the other hand, if Xn(ω) > 1 ε, then X(ω) > 1 ε and so − − X(ω) = 1 and ω A. Thus, (An)n is an increasing sequence of events, such that ∈ limn An = n An = A.

Since XnSare non-negative, Xn (1 ε)1An . Hence, (1 ε) P[An] E[Xn] ≥ − − ≤ ≤ E[X] = P[A]. Continuity of probability finishes this case.

Case 2: Xn,X simple. • m Assume X = ak1 , where 0 = a0 < a1 < < am and Ω = k=0 {X=ak} ··· X = a0 XP= a1 X = am . Fix k > 0 and consider the sequence { }]{ }]···]{ } −1 Yn := a 1 Xn. Xn X implies that Yn 1 . By Case 1, since k {X=ak} % % {X=ak} Yn are simple, we have that E[Yn] P[X = ak]. % Since Xn X we have that 0 Xn1 X1 = 0. Thus, Xn = ≤ ≤ {X=a0} ≤ {X=0} m m −1 Xn 1 = ak a 1 Xn. By linearity of expectation for · k=0 {X=ak} k=1 · k {X=ak} simpleP random variables,P

−1 E[Xn] = ak E[a 1{X=a }Xn] ak P[X = ak] = E[X]. · k k % k k X X

Case 3: Xn simple, X general (non-negative). • Let 0 S X be a simple random variable, and Yn := min Xn,S . Xn X ≤ ≤ { } % implies that Yn S. So Case 2 and monotonicity of expectation for simple % random variables give that E[Xn] E[Yn] E[S]. Thus, limn E[Xn] E[S] for ≥ % ≥ all simple S X. Taking the supremum we are done. ≤ 99

Case 4: Xn,X general. • −n n Define Yn := min 2 2 Xn , n . So Yn are all simple, and Yn Xn X, and { b c } ≤ ≤ so E[Yn] E[Xn] E[X]. If we show that Yn X then we are done by Case 3. ≤ ≤ % −(n+1) n+1 As in Proposition 15.6, Yn min 2 2 Xn , n + 1 Yn+1, so (Yn)n ≤ b c ≤ is a non-decreasing sequence. 

For all ω and n such that Xn(ω) < n, we have that Yn(ω) Xn(ω) | − | ≤ −n 2 . So, for any ω, and for any ε > 0, if n > n0(ω, ε) is large enough, then

Yn(ω) X(ω) < ε. That is, Yn X. | − | % ut

15.5. Linearity

Lemma 15.12. Let X,Y be a non-negative random variables. Then, for any a > 0,

X Y implies E[X] E[Y ]. • ≤ ≤ E[aX + Y ] = a E[X] + E[Y ]. •

Proof. The first assertion is immediate from the definition, since if 0 S X is simple ≤ ≤ then 0 S Y , so we are taking a supremum over a larger set. ≤ ≤ Let Xn X and Yn Y be sequences of simple random variables converging % % everywhere monotonely to X,Y respectively, as in Proposition 15.6. So aXn + Yn % aX + Y . Monotone convergence tells us that

E[Xn] E[X] E[Yn] E[Y ] and E[aXn + Yn] E[aX + Y ]. % % %

Since for all n, E[aXn + Yn] = a E[Xn] + E[Yn], we are done. ut

15.6. Expectation

Finally, let us define the expectation of a general random variable. First, some notation. Let X be a random variable. Define X+ := max X, 0 and { } X− := max X, 0 . Note that X+,X− are non-negative random variables, and that {− } X = X+ X− and X = X+ + X−. − | | 100

+ − Definition 15.13. Let X be a random variable. If E[X ] = E[X ] = we say that ∞ the expectation of X is not defined (or X does not have expectation). If at least one of + − E[X ], E[X ] is finite then define the expectation of X by

+ − E[X] := E[X ] E[X ]. − 15.6.1. Properties of Expectation. The most important property of expectation is linearity:

Theorem 15.14. Let X,Y be random variables on some probability space (Ω, , P) such F that E[X], E[Y ] exist. Then: For all a R, E[aX + Y ] = a E[X] + E[Y ]. • ∈ X Y implies that E[X] E[Y ]. • ≤ ≤ E[X] E[ X ]. •| | ≤ | | Proof. For linearity, note that (X +Y )+ (X +Y )− = X +Y = X+ X− +Y + Y −, so − − − (X + Y )+ + X− + Y − = (X + Y )− + X+ + Y +. Since both sides are linear combinations of non-negative random variables,

+ − − − + + E(X + Y ) + E X + E Y = E(X + Y ) + E X + E Y which implies that

+ − + − + − E[X + Y ] = E(X + Y ) E(X + Y ) = E X E X + E Y E Y = E X + E Y. − − − Also, for any a > 0, (aX)+ = aX+ and (aX)− = aX−. If a < 0 then (aX)+ = aX− − − + and (aX) = aX . Thus, in both cases E[aX] = a E[X]. This proves linearity. − For the second assertion, let Z = Y X. So Z 0, and thus E[Z] 0. Since − ≥ ≥ E[Z] = E[Y ] E[X] by linearity, we are done. − Finally, for the last assertion

+ − + − E[X] = E[X ] E[X ] E[X ] + E[X ] = E[ X ], | | | − | ≤ | | where we have used the fact that X+ 0 and X− 0. ≥ ≥ ut 101

Introduction to Probability

201.1.8001

Ariel Yadin

Lecture 16

◦ Expectation 16.1. Expectation - Discrete Case for discrete RVs Proposition 16.1. Let X be a discrete random variable, with range R and density fX . Then,

E[X] = rfX (r). r∈R X Proof. For all N, let

X+ := 1 r and X− := 1 r. N {X=r} N − {X=r} r∈RX∩[0,N] r∈RX∩[−N,0] Note that X+ X+ and X− X−. Moreover, by linearity N % N %

+ − E[X ] = P[X = r]r and E[X ] = P[X = r]r. N N − r∈RX∩[0,N] r∈RX∩[−N,0] Using monotone convergence we get that

+ − + − E[X] = E[X ] E[X ] = lim E[X ] E[X ] = fX (r)r. − N→∞ N − N r∈R X ◦ Examples: ut Ber, Bin, Poi, Geo Example 16.2. Let us calculate the expectations of different discrete random variables:

If X Ber(p) then • ∼

E[X] = 1 P[X = 1] + 0 P[X = 0] = p. · · 102

If X Bin(n, p) then since n k = n n−1 for 1 k n, • ∼ k · · k−1 ≤ ≤ n  n  n k n−k E[X] = fX (k) k = p (1 p) k · k − · k=0 k=0 X X   n n 1 = np − pk−1(1 p)n−k = np. · k 1 − k=1 X  −  n An easier way, would be to note that X = Xk where Xk Ber(p) (and in k=1 ∼ n fact X1,...,Xn are independent). Thus, byP linearity, E[X] = k=1 E[Xk] = np. For X Poi(λ): P • ∼ ∞ k ∞ k−1 −λ λ −λ λ E[X] = e k = λ e = λ. k! · · (k 1)! k=0 k=1 X X − For X Geo(p): Note that for g(x) = (1 x)k, we have p g(x) = k(1 x)k−1. • ∼ − − ∂x − ∞ ∞ k−1 E[X] = fX (k) k = (1 p) p k · − · k=1 k=1 X X ∞ ∞ = p ∂ (1 p)k = p ∂ (1 p)k − · ∂p − − · ∂p − k=1 k=1 ! X X 1 = p ∂ 1 1 = p = 1 . − · ∂p p − · p2 p ◦ Do this one   Another way: Let E = E[X]. Then,

∞ ∞ ∞ E = p+ (1 p)k−1pk = p+ (1 p)k−1p(k 1)+ (1 p)k−1p = p+(1 p)E+1 p. − − − − − − k=2 k=2 k=2 X X X So E = 1 + (1 p)E or E = 1/p. −

4 5 4

Example 16.3. A pair of independent fair dice are tossed. What is the expected number of tosses needed to see Shesh-Besh? Note that each toss of the dice is an independent trial, such that the probability of seeing Shesh-Besh is 2/36 = 1/18. So if X = number of tosses until Shesh-Besh, then X Geo(1/18). Thus, E[X] = 18. ∼ 4 5 4 103

d 16.1.1. Function of a random variable. Let g : R R be a measurable function. → Let (X1,...,Xd) be a joint distribution of d discrete random variables with range R each. Then, Y = g(X1,...,Xd) is a random variable. What is its expectation? Well, first we need the density of Y : For any y R we have that ∈ −1 P[Y = y] = P[(X1,...,Xd) g ( y )] = P[(X1,...,Xd) = (r1, . . . , rd)]. ∈ { } d −1 (r1,...,rd)∈XR ∩g ({y}) d So, if RY := g(R ), then since

Rd = Rd g−1( y ), ∩ { } y∈R ]Y and since Y is discrete we get that

E[g(X1,...,Xd)] = y P[Y = y] y∈R XY = f (~r) g(~r) (X1,...,Xd) · y∈R d −1 XY ~r∈R ∩Xg ({y})

= f(X1,...,Xd)(~r) g(~r). d · ~rX∈R Specifically,

d Proposition 16.4. Let g : R R be a measurable function, and X = (X1,...,Xd) a → discrete joint random variable with range Rd. Then,

E[g(X)] = fX (~r) g(~r). d · ~rX∈R Example 16.5. Manchester United plans to earn some money selling Wayne Rooney jerseys. Each jersey costs the club x pounds, and is sold for y > x pounds. Suppose that the number of people who want to buy jerseys is a discrete random variable with range N. What is the expected profit if the club orders N jerseys? How many jerseys should be ordered in order to maximize this profit?

Solution. Let pk = fX (k) = P[X = k]. Let gN : N R be the function that gets the → number of people that want to buy jerseys and gives the profit, if the club ordered N 104 jerseys. That is, ky Nx k N gn(k) = − ≤  N(y x) k > N  − The expected profit is then  ∞ ∞ N N E[gN (X)] = pkgN (k) = pkN(y x) pk(N k)y = N(y x) pk(N k)y. − − − − − − k=0 k=0 k=0 k=0 X X X X N Call this G(N) := N(y x) pk(N k)y. We want to maximize this as a function − − k=0 − of N. Note that P N G(N + 1) G(N) = y x ypk(N + 1 N) = y x y P[X N]. − − − − − − ≤ k=0 X y−x So G(N +1) > G(N) as long as P[X N] < , so the club should order N +1 jerseys ≤ y y−x for the largest N such that P[X N] < . ≤ y 4 5 4 Example 16.6. Some more calculations:

X H(N, m, n) (recall that X is the number of “special” objects chosen, when • ∼ choosing uniformly n objects out of N objects, with a total of m special objects). We use the fact that m m 1 N N 1 N k = − m and = − . k · k 1 · n n 1 · n    −     −  n∧m N −1 m N m E[X] = k − n · · k · n k k=0   X    −  n∧m N 1 −1 n m 1 (N 1) (m 1) nm = − m − − − − = . n 1 · N · · k 1 · (n 1) (k 1) N k=1  −  X  −   − − −  X NB(m, p) (the number of trials until the m-th success). Using • ∼ k 1 k − k = m m 1 · m ·  −    with j = k + 1 and n = m + 1, ∞ ∞ k 1 m k−m −1 j 1 n j−n m E[X] = k − p (1 p) = mp − p (1 p) = . · m 1 − · n 1 − p k=m j=n X  −  X  −  4 5 4 105

Introduction to Probability

201.1.8001

Ariel Yadin

Lecture 17

17.1. Expectation - Continuous Case ◦ Goal is Our goal is now to prove the following theorem: E g(X)

Theorem 17.1. Let X = (X1,...,Xd) be an absolutely continuous random variable, d and let g : R R be a measurable function. Then, →

E[g(X)] = g(~x)fX (~x)d~x. d ZR The main lemma here is

Lemma 17.2. Let X = (X1,...,Xd) be an absolutely continuous random variable. Then, for any Borel set B d, ∈ B

P[X B] = fX (~x)d~x. ∈ ZB d d Proof. Let Q(B) = fX (~x)d~x. Then Q is a probability measure on (R , ), that B B coincides with PX onR the π-system of rectangles ( , b1] ( , bd]. Thus, P[X −∞ × · · · × −∞ ∈ d B] = PX (B) = Q(B) for all B . ∈ B ut

Remark 17.3. We have not really defined the integral B fX (~x)d~x. However, for our purposes, we can define it as P[X B], and note that thisR coincides with the Riemann ∈ integral on intervals.

d −1 Proof of Theorem 17.1. First assume that g 0, so R = g [0, ). For all n define ≥ ∞ −n n Yn = 2 2 g(X) which are discrete non-negative random variables. b c First, we show that −n n E[Yn] = 2 2 g(~x) fX (~x)d~x. d b c ZR 106

−1 −n −n d Indeed, for n, k 0 let Bn,k = g [2 k, 2 (k + 1)) . ≥ ∈ B Note that

∞ ∞ −n −n Yn = 2 k1 and E[Yn] = 2 k P[X Bn,k]. X∈Bn,k ∈ k=0 { } k=0 X X Now, since ∞ ∞ d −1 −1 −n −n R = g [0, ) = g [2 k, 2 (k + 1)) = Bn,k, ∞ k=0 k=0 ] ] and since

−n n −n 1B 2 2 g fX = 1B 2 kfX , n,k b c n,k we have by the lemma above that

∞ ∞ −n n −n n −n 2 2 g(~x) fX (~x)d~x = 1Bn,k 2 2 g(~x) fX (~x)d~x = 2 k fX (~x)d~x d b c d b c R R k=0 k=0 Bn,k Z Z X X Z ∞ −n = 2 k P[X Bn,k] = E[Yn]. ∈ k=0 X Here we have used the fact that

N ∞ −n −n 1B 2 k 1B 2 k. n,k % n,k k=0 k=0 X X −n n −n −n Since 2 2 g g 2 , we get that Yn g(X) 2 . Thus, | b c − | ≤ | − | ≤

−n n E[g(X)] g(~x)fX (~x)d~x E[g(X)] E[Yn] + g(~x)fX (~x)d~x 2 2 g(~x) fX (~x)d~x − d ≤ | − | d − d b c ZR ZR ZR

2 2−n 0. ≤ · →

This proves the theorem for non-negative functions g. Now, if g is a general measurable function, consider g = g+ g−. Since g+, g− are − non-negative, we have that

+ − + − + − E[g(X)] = E[g (X) g (X)] = E[g (X)] E[g (X)] = (g (~x) g (~x))fX (~x)d~x. − − d − ZR

◦ Expectation ut for cont. 107

Corollary 17.4. Let X be an absolutely continuous random variable with density fX . Then, ∞ E[X] = tfX (t)dt. Z−∞

X Compare tfX (t)dt to r P[X = r] in the discrete case. This is another place R P where fX is “like” P[X = ] (although the latter is identically 0 in the continuous case, ◦ Exam- · as we have seen). ples: Unif., Normal, Exp. Example 17.5. Expectations of some absolutely continuous random variables: 1 X U[0, 1]: E[X] = tdt = 1/2. • ∼ 0 More generally, X RU[a, b]: • ∼ b 1 1 2 2 b+a E[X] = t dt = (b a ) = . · b−a 2(b−a) − 2 Za X Exp(λ): We use integration by parts, since e−λt = λ−1e−λt, • ∼ − ∞ ∞ ∞ −λt −λt R −λt 1 E[X] = t λe dt = te + e dt = . 0 · − 0 0 λ Z Z

X N(µ, σ): • ∼ ∞ (t−µ)2 [X] = √ t exp dt. E 2πσ 2σ2 −∞ − Z   Change u = t µ so du = dt, − 1 ∞ u2 ∞ E[X] = u exp du + µfX (t)dt. √2πσ · −2σ2 Z−∞   Z−∞ 2 Since the function u u exp u is an odd function, its integral is 0, so 7→ − 2σ2  ∞ E[X] = µ fX (t)dt = µ. Z−∞ A simpler way is to notice that X−µ N(0, 1) so σ ∼ ∞ X µ 1 −t2/2 E[ − ] = te dt = 0, σ √2π Z−∞ as an integral over an odd function. E[X] = µ follows from linearity of expecta- tion.

4 5 4 108

Proposition 17.6. Let X be an absolutely continuous random variable, such that E[X] exists. Then, ∞ ∞ E[X] = (P[X > t] P[X t])dt = (1 FX (t) FX ( t))dt. − ≤ − − − − Z0 Z0 Proof. Note that ∞ ∞ ∞ ∞ s ∞ P[X > t]dt = fX (s)dsdt = dtfX (s)ds = sfX (s)ds. Z0 Z0 Zt Z0 Z0 Z0 Similarly,

∞ ∞ −t 0 −s 0 P[X t]dt = fX (s)dsdt = dtfX (s)ds = sfX (s)ds. ≤ − − Z0 Z0 Z−∞ Z−∞ Z0 Z−∞ Subtracting both we have the result. ◦ in exercises ut In a similar way we can prove

Exercise 17.7. Let X be a discrete random variable, with range Z such that E[X] exists. Then, ∞ E[X] = (P[X > k] P[X < k]). − − k=0 X 2 Example 17.8. Let X N(0, 1). Compute E[X ]. ∼ By the above,

∞ 2 1 2 −x2/2 1 −x2/2 1 −x2/2 E[X ] = x e dx = xe + e dx = 1 √2π −√2π −∞ √2π ZR ZR 2 2 where we have used integration by parts, with ∂ e−x /2 = xe−x /2. ∂x − 4 5 4

17.2. Examples Using Linearity

Example 17.9 (Coupon Collector). Gilad collects “super-goal” cards. There are N cards to collect altogether. Each time he buys a card, he gets one of the N uniformly at random, independently. What is the expected amount of cards Gilad needs to buy in order to collect all cards?

For k = 0, 1,...,N 1, let Tk be the number of cards bought after getting the k-th − new card, until getting the (k + 1) th new card. That is, when Gilad has k different − cards, he buys Tk more cards until he has (k + 1) different cards. 109

N−k If Gilad has k different cards, then with probability N he buys a card he does not N−k already have. So, Tk Geo( ). ∼ N Since the total number of cards Gilad buys until getting all N cards is T = T0 + T1 +

T2 + + TN−1, using linearity of expectation ··· N N 1 E[T ] = E[T0] + E[T1] + + E[TN−1] = 1 + + + N = N . ··· N 1 ··· · k k=1 − X 4 5 4

Example 17.10 (Permutation fixed points). n soldiers receive packages from their families. The packages get mixed up at the post office, so that each order is equally likely. Let X be the number of soldiers that receive the correct package from their own family. What is the expectation of X?

Let us solve this by writing X as a sum. Let Xi be the indicator of the event that soldier i receives the correct package. What is the probability that Xi = 1? There are (n−1)! 1 (n 1)! possible ordering for which this may happen, so E[Xi] = = . − n! n n 1 Note that X = Xi, so by linearity, E[X] = n = 1. i=1 · n Note also that theP Xi’s are not independent, for example, if i = j, 6 (n−2)! 1 1 P[Xi = 1,Xj = 1] = = = 2 . n! n(n−1) 6 n

4 5 4 Example 17.11. We toss a die 100 times. What is the expected sum of all tosses?

Here it really begs to use linearity. If Xk is the outcome of the k-th toss, and X = 100 k=1 Xk then 100 P 7 E[X] = E[Xk] = 100 = 350. · 2 k=1 X 4 5 4 Example 17.12. 200 random numbers are output by a computer, each distributed uniformly on [0, 1]. What is their expected sum?

1 E[X] = 200 E[U[0, 1]] = 200 = 100. · · 2 110

4 5 4

−n N Example 17.13. Let Xn U[0, 2 ], for n 0, and let SN = Xk. What is the ∼ ≥ k=0 expectation of SN ? P Linearity of expectation gives

N N −(k+1) −(N+1) E[SN ] = E[Xk] = 2 = 1 2 . − k=0 k=0 X X ∞ Note that if S∞ = Xk, then SN S∞ so monotone convergence gives that k=0 % E[S∞] = 1. P 4 5 4 17.3. A formula for general random variables

Theorem 17.14. Let X be a random variable. Then ∞ ∞ + − E[X ] = (1 FX (t))dt and E[X ] = FX ( t)dt. − − Z0 Z0 Thus, if E[X] exists then ∞ E[X] = (1 FX (t) FX ( t))dt. − − − Z0 Proof. First suppose that X 0 and discrete. Then, if the range of X is RX = ≥ 0 = r0 < r1 < r2 < , we have for any t [rj, rj+1), { ···} ∈ ∞ P[X > t] = P[X rj+1] = P[X = rn]. ≥ n=j+1 X Thus,

rj+1 ∞ (1 FX (t))dt = (rj+1 rj) P[X = rn]. − − · rj n=j+1 Z X Summing over j and interchange the sums we have

∞ ∞ rj+1 ∞ ∞ (1 FX (t))dt = (1 FX (t))dt = (rj+1 rj) P[X = rn] − − − · 0 j=0 rj j=0 n=j+1 Z X Z X X ∞ n−1 ∞ = P[X = rn] (rj+1 rj) = rn P[X = rn] = E[X]. − n=1 j=0 n=1 X X X Since for t > 0, the positivity of X tells us that FX ( t) = 0, we have the theorem for − discrete non-negative random variables. 111

We also have the formula: for t (rj, rj+1], ∈ ∞ P[X t] = P[X rj+1] = P[X = rn]. ≥ ≥ n=j+1 X Thus,

rj+1 ∞ rj+1 P[X t]dt = (rj+1 rj) P[X = rn] = P[X > t]dt, ≥ − · rj n=j+1 rj Z X Z so

∞ ∞ (17.1) E[X] = P[X > t]dt = P[X t]dt. ≥ Z0 Z0 −n n Now, if X 0 is any general non-negative random variable, let Xn := 2 2 X . So ≥ b c −n Xn is discrete, non-negative and Xn X 2 . Also, because Xn X, for any t > 0, | − | ≤ ≤

FXn (t) FX (t) = P[Xn t] P[X t] = P[Xn t, X > t] | − | | ≤ − ≤ | ≤ −n −n P[t < X t + 2 ] = FX (t + 2 ) FX (t) 0, ≤ ≤ − → as n by right continuity. Since Xn Xn+1 we get that 1 FX (t) 1 FX (t), → ∞ ≤ − n % − and by monotone convergence,

∞ ∞

E[Xn] = (1 FXn (t))dt (1 FX (t))dt. − % − Z0 Z0 −n However, since E[Xn] E[X] 2 we obtain the theorem for general non-negative | − | ≤ random variables. + − Finally, if X = X X is any random variable, we have for all t 0 that F + (t) = − ≥ X + + P[X t] = P[0 X t] + P[X < 0] = P[X t] = FX (t), and since X 0, ≤ ≤ ≤ ≤ ≥ ∞ + E[X ] = (1 FX (t))dt. − Z0 − + − Also, since X = ( X) , for all t 0, we have P[X t] = P[ X t] = FX ( t), − ≥ ≥ − ≥ − so by (17.1), ∞ ∞ − − E[X ] = P[X t]dt = FX ( t)dt. ≥ − Z0 Z0 ut 112

Example 17.15. Let X be a random variable with distribution function

0 t < 0

 t+1 1  2 0 t < 2 FX (t) =  ≤   7 1 t < 10  8 2 ≤ 1 10 t   ≤  Is X discrete? Is X absolutely continuous? 1 X cannot be absolutely continuous, because it is not even continuous, since 0, 2 , 10 are discontinuity points of FX . X cannot be discrete because if r is such that P[X = r] > 0 then r is a discontinuity 1 point of FX , so these can only be 0, 2 , 10. However,

1 1 1 1 1 5 9 1 P[X 0, , 10 ] P[ < X ] = FX ( ) FX ( ) = = > 0. 6∈ 2 ≥ 8 ≤ 4 4 − 8 8 − 16 16  Finally, let us compute the expectation of X. Since X 0, we have that X = X+, so ≥ ∞ 1/2 10 1−t 1 E[X] = (1 FX (t))dt = dt + dt − 2 8 Z0 Z0 Z1/2 2 1/2 = 1 (t t ) + 1 19 = 1 1 + 19 = 11 . 2 − 2 0 8 · 2 4 − 16 16 8

4 5 4 113

Introduction to Probability

201.1.8001

Ariel Yadin

Lecture 18

18.1. Jensen’s Inequality

Recall that a function g : R R is convex on [a, b] if for any 0 < λ < 1, and any → x, y [a, b], ∈ g(λx + (1 λ)y) λg(x) + (1 λ)g(y). − ≤ − For example, any twice differentiable function such that g00 0 is convex. e.g. x2, ≥ log x. − ◦ convex Convex functions are continuous, and so always measurable. funcs lie First a small lemma regarding convex functions: above tan- gents Lemma 18.1. Let a < m < b, and let g :[a, b] R be convex. Then, there exist → A, B R such that g(m) = Am + B and for all x [a, b], g(x) Ax + B. ∈ ∈ ≥

Proof. For all a x < m < y b we have m = λx + (1 λ)y where λ = y−m (0, 1). ≤ ≤ − y−x ∈ Thus, g(m) λg(x) + (1 λ)g(y) which implies that g(m)−g(x) g(y)−g(m) . Thus, ≤ − m−x ≤ y−m g(m)−g(x) g(y)−g(m) taking sup A infy>m , we have that for all x < y [a, b], x

Theorem 18.2 (Jensen’s Inequality). Let g : R R be convex on [a, b]. Let X be a → random variable such that P[a X b] = 1. Then, ≤ ≤

g(E[X]) E[g(X)]. ≤

Proof. If X is constant, then there is nothing to prove, so assume X is not constant. Johan Jensen (1859–1925) Thus, a < E[X] < b. Let m = E[X]. Then, g(X) AX + B and g(E[X]) = A E[X] + B ≥ 114

m

Figure 5. A convex function lies above some linear function at an inner point m.

− for some A, B R. Specifically, since g(X) A X + B A max a , b + B we ∈ ≤ | || | | | ≤ | | {| | | |} | | − get that E[g(X) ] < , so E[g(X)] is well defined. ∞ Now, since AX + B g(X), ≤

g(E[X]) = A E[X] + B = E[AX + B] E[g(X)]. ≤

◦ A-G in- ut equality Example 18.3. Let a1, . . . , an be n positive numbers. Let X be a discrete random 1 variable with density fX (ak) = n . Note that the function g(x) = log x is a convex function on (0, ) (since g00(x) = − ∞ x−2 > 0). So Jensen’s inequality gives that

1 n 1 n log ak = log(E[X]) E[ log(X)] = log ak. − n − ≤ − −n k=1 k=1 X X Exponentiating we get 1/n 1 n n ak ak . n ≥ k=1 k=1 ! X Y This is the arithmetic-geometric mean inequality. 4 5 4 115 ◦ Example: Lp in Lq for p > q Example 18.4. Let p 1. The function g(x) = x p is convex, because away from ≥ | | 0, g00(x) = p(p 1) x p−2, and at 0 we have g(0) = 0 λg(x) + (1 λ)g(y) for any − | | ≤ − 0 < λ < 1 and x, y. p p Thus, E[ X ] E[X] for any random variable X such that E[X] is defined. | | ≥ | | p Furthermore, for 1 q < p, if E[ X ] < , then ≤ | | ∞ q q p/q q/p p q/p E[ X ] = (E[ X ] ) (E[ X ]) < , | | | | ≤ | | ∞ since the function x x p/q is convex. 7→ | | 4 5 4 18.2. Moments

Definition 18.5. Let X be a random variable on a probability space (Ω, , P). F n The n-th moment of X is defined to be E[X ], if this exists and is finite. If this expectation does not exist (or is infinite), then we say that X does not have a n-th moment.

Definition 18.6. Let X be a random variable on a probability space (Ω, , P), such F that E[X] < . | | ∞ The variance of X is defined to be

2 Var[X] := E[ X E[X] ]. | − | The standard deviation of X is defined to be Var[X]. p Proposition 18.7. X has a second moment if and only if E[X] < and Var[X] < . | | ∞ ∞ 2 2 In fact, Var[X] = E[X ] E[X] . − Proof. Using linearity of expectation,

2 2 2 E[(X a) ] = E[X ] 2a E[X] + a . − − 2 2 So E[(X a) ] < if and only if < E[X] < and E[X ] < . − ∞ −∞ ∞ ∞ ut Example 18.8. Let us calculate some second moments and variances:

X Ber(p): • ∼ 2 2 2 2 2 E[X ] = P[X = 1] 1 +P[X = 0] 0 = p. So Var[X] = E[X ] E[X] = p(1 p). · · − − 116

X Bin(n, p): • ∼ We use the identity for all 2 k n, k(k 1) n = n(n 1) n−2 . ≤ ≤ − k − k−2 n n   n 2 k−2 2 n−k E[X(X 1)] = k(k 1) P[X = k] = n(n 1) − p p (1 p) − − − k 2 − k=0 k=2 X X  −  n−2 n 2 = p2n(n 1) − pk(1 p)n−2−k = p2n2 p2n. − · k − − k=0 X   By linearity of expectation,

2 2 2 p n(n 1) = E[X ] E[X] = E[X ] np. − − − 2 2 2 So E[X ] = p n(n 1) + np and Var[X] = np p n = np(1 p). − − − X Geo(p): • ∼

∞ 2 2 k−1 2 k−2 E[X ] = k (1 p) p = p + (k 1 + 1) (1 p) p(1 p) − − − − k=1 k=2 X X ∞ 2 k−1 2 = p + (1 p) (k + 1) (1 p) p = p + (1 p) E[(X + 1) ] − − − k=1 X 2 2(1 p) 2 = p + (1 p) E[X ] + (1 p) + (1 p)2 E[X] = 1 + − + (1 p) E[X ]. − − − p − So 2 1 2(1 p) 2 p E[X ] = + − = − . p p2 p2 Thus 2 2 1 p Var[X] = E[X ] E[X] = − . − p2 X Poi(λ): • ∼

∞ k 2 2 −λ λ E[X ] = k e · k! k=0 X ∞ ∞ λk−1 λk = λ (k 1 + 1)e−λ = λ (k + 1)e−λ · − (k 1)! · k! k=1 k=0 X − X = λ E[X + 1] = λ(λ + 1). · 2 2 Thus, Var[X] = E[X ] E[X] = λ. − 117

X Exp(λ): Use integration by parts: • ∼ ∞ ∞ ∞ 2 2 −λt 2 −λt −λt E[X ] = t λe dt = t e + 2te dt 0 − 0 0 Z Z 2 2 = E[X] = . λ λ2 So Var[X] = 2 1 = 1 . λ2 − λ2 λ2

4 5 4 ◦ Var[aX] Exercise 18.9. Let X be a random variable with Var[X] < . Show that for any ∞ a, b R, ∈ Var[aX + b] = a2 Var[X].

Solution. Let µ = E[X]. So E[aX + b] = aµ + b. Then,

2 2 2 Var[aX + b] = E[(aX + b (aµ + b)) ] = E[(aX aµ) ] = a Var[X]. − −

ut Example 18.10. X U[a, b]: Y = X−a U[0, 1]. (Prove this!) So • ∼ b−a ∼ 1 2 2 1 E[Y ] = s ds = 3 . Z0 Thus Var[Y ] = 1 1 = 1 . We have that 3 − 4 12 (b a)2 Var[X] = Var[(b a)Y + a] = (b a)2 Var[Y ] = − . − − 12 X N(µ, σ): Recall that Y := (X µ)/σ has N(0, 1) distribution. Thus, since • ∼ − X = σY + µ, we have that Var[X] = σ2 Var[Y ]. 2 2 Now, using integration by parts, since ∂ e−x /2 = xe−x /2, ∂x − ∞ 2 1 2 −t2/2 Var[Y ] = E[Y ] = t e dt √2π Z−∞ ∞ 1 2 ∞ 1 2 = te−t /2 + e−t /2dt = 1. −√2π · −∞ √2π −∞ Z 2 2 2 2 So Var[X] = σ . (Note that this implies that E[X ] = σ + µ .)

4 5 4 118

Introduction to Probability

201.1.8001

Ariel Yadin

Lecture 19

19.1. Covariance

Definition 19.1. Let X,Y be random variables on some probability space (Ω, , P) F such that E[X], E[Y ], E[XY ] are finite. The covariance of X and Y is defined to be

Cov[X,Y ] := E[(X E[X]) (Y E[Y ])]. − · −

The following are immediate:

Proposition 19.2. Let X,Y,Z be random variables on some probability space (Ω, , P). F Then,

Cov[X,Y ] = E[XY ] E[X] E[Y ]. • − Cov[X,Y ] = Cov[Y,X]. • Cov[aX + Y,Z] = a Cov[X,Z] + Cov[Y,Z]. • Cov[X,X] = Var[X] 0. • ≥ Where each equality holds if the relevant covariances are defined.

Note that Cov is almost an inner product on the vector space of random variables on

(Ω, , P) with finite expectation. F

19.2. Inner Products and Cauchy-Schwarz

Reminder:

Let V be a vector space over R. An inner-product on V is a function < , >: V V · · × → R such that Symmetry: For all v, u V : < u, v >=< v, u >. • ∈ 119

Linearity: For all α R and all v, u, w V : < αu + v, w >= α < u, w > + < • ∈ ∈ v, w >. Positivity: For all v V , < v, v > 0. • ∈ ≥ Definiteness: For all v V : < v, v >= 0 if and only if v = 0. • ∈ For our purposes, we would like to replace the last condition by

Definiteness’: For all v V : < v, v >= 0 if and only if < v, u >= 0 for all • ∈ u V . ∈ A basic, but super important and fundamental result is the Cauchy-Schwarz inequal- ity: (Denote v = √< v, v >.) || || Augustin-Louis Cauchy Theorem 19.3 (Cauchy-Schwarz Inequality). Let < , > be a an inner product on V · · (1789–1857) with the above modified definiteness condition. For all v, u V we have ∈ ◦ C-S for in- ner products < v, u > v u . | | ≤ || || · || ||

Proof. If u = 0 then both sides are 0 (because of the modified definiteness condition), || || so we can assume without loss of generality that u > 0. Set λ = u −2 < v, u >. || || || || · Then,

0 < v λu, v λu >=< v, v > +λ2 < u, u > 2λ < v, u >= v 2 u −2 < v, u >2 ≤ − − − || || − || || Multiplying by u 2 completes the proof. || || ut

Hermann Schwarz 19.2.1. Two Inner Products. Let (Ω, , P) be a probability space. Consider the F 2 (1843–1921) following vector space over R: Let L (Ω, , P) be the space of all random variables X F with finite second moment. In the exercise we show that if Cov[X,X] = 0 then Cov[X,Y ] = 0 for all Y , and if 2 E[X ] = 0 then E[XY ] = 0 for all Y . Thus, we have that both Cov[X,Y ] and E[XY ] 2 form (modified) inner products on L (Ω, , P). F

X In order to get honest to goodness inner products, one needs to take random variables modulo measure 0 (for E[X,Y ]) and modulo measure 0 and additive constants for Cov. 120

◦ C-S for Cov We conclude: and E[XY ] Theorem 19.4 (Cauchy-Schwarz inequality). Let X,Y be random variables with finite second moment. Then,

2 2 2 2 Cov[X,Y ] Var[X] Var[Y ] and E[XY ] E[X ] E[Y ]. | | ≤ · | | ≤ · 19.3. Correlation

Definition 19.5. If X,Y are random variables such that Cov[X,Y ] = 0 we say that

X,Y are uncorrelated. X1,...,Xn are said to be pairwise uncorrelated if for any

k = j, Xk,Xj are uncorrelated. 6 We will soon see that if X,Y are independent, they are uncorrelated. The opposite, ◦ Uncor- however, is not true. Indeed, related 6⇒ independent Example 19.6. Let X,Y have a joint distribution given by

X Y 1 0 1 \ − fX,Y = 0 1/3 0 1/3 . 1 0 1/3 0

So fX (1) = 1/3 and fY (1) = fY ( 1) = 1/3. Hence, − E[XY ] = P[X = 1,Y = 1] P[X = 1,Y = 1] = 0, − − E[X] = 1/3 and E[Y ] = 1/3 1/3 = 0. So Cov[X,Y ] = E[XY ] E[X] E[Y ] = 0. But − − X,Y are not independent, as 1 1 P[X = 1,Y = 1] = 0 = = P[X = 1] P[Y = 1]. 6 3 · 3 ·

4 5 4

We show that independence is the fact that any functions of X and Y are uncor- ◦ indepen- X dence iff related. cov=0 for all Theorem 19.7. Let X,Y be random variables on some probability space (Ω, , P). funcs F Then, X,Y are independent, if and only if for any measurable functions g, h such that Cov[g(X), h(Y )] is defined, we have that Cov[g(X), h(Y )] = 0. 121

◦ indepen- Lemma 19.8. Let X,Y be random variables with finite second moment. If X,Y are dence implies independent then they are uncorrelated. uncorrelated

Proof. It suffices to show that if X,Y are independent, then E[XY ] = E[X] E[Y ]. If X,Y are discrete random variables with range R,

0 0 0 0 E[XY ] = P[X = r, Y = r ]rr = P[X = r]r P[Y = r ]r = E[X] E[Y ]. r,r0∈R r,r0∈R X X Define X+ := 2−n 2nX+ and Y + := 2−n 2nY + . Since X+,Y + are independent n b c n b c n n and discrete, we have that

+ + + + E[Xn Yn ] = E[Xn ] E[Yn ].

Since X+ X+ and Y + Y +, and these are non-negative, we get by monotone n % n % convergence that

[X+Y +] = lim [X+Y +] = lim [X+] [Y +] = [X+] [Y +]. E n→∞ E n n n→∞ E n E n E E

In a similar way, for X− := 2−n 2nX− and Y − := 2−n 2nY − , we have that n b c n b c − − − + + − Xn ,Yn are independent, Xn ,Yn are independent, and Xn ,Yn are independent. Thus, ξ ζ ξ ζ E[XnYn ] = E[Xn] E[Yn ] for any choice of ξ, ζ +, , and taking limits we get that ∈ { −}

ξ ζ ξ ζ E[X Y ] = E[X ] E[Y ].

Altogether,

+ − + − + + + − − + − − E[XY ] = E[(X X )(Y Y )] = E[X Y ] E[X Y ] E[X Y ] + E[X Y ] − − − − + + + − − + − − = E[X ] E[Y ] E[X ] E[Y ] E[X ] E[Y ] + E[X ] E[Y ] − − + − + − = (E[X ] E[X ]) (E[Y ] E[Y ]) = E[X] E[Y ]. − · − ·

ut Proof of Theorem 19.7. We repeatedly make use of the fact that Cov[X,Y ] = 0 if and only if E[XY ] = E[X] E[Y ]. 122

−1 −1 For the “if” direction: Let A1 σ(X),A2 σ(Y ). So A1 = X (B1),A2 = Y (B2) ∈ ∈

for Borel sets B1,B2. For the functions g(x) = 1{x∈B1} and h(x) = 1{x∈B2} we have that

P[A1 A2] = E[1{X∈B1,Y ∈B2}] = E[g(X)h(Y )] = E[g(X)] E[h(Y )] = P[A1] P[A2]. ∩

This was the “easy” direction. Now, for the “only if” direction: We want to show that if X,Y are independent, then E[g(X)h(Y )] = E[g(X)] E[h(Y )] for any measurable functions such that the above expectations exist. Since g(X), h(Y ) are independent, the theorem follows from Lemma 19.8. ◦ in exercise ut ◦ Pythago- rian Theorem Exercise 19.9. Let X1,...,Xn be random variables with finite second moment. Show that n Var[X1 + + Xn] = Var[Xk] + 2 Cov[Xj,Xk]. ··· k=1 j

n Var[X1 + + Xn] = Var[Xk]. ··· k=1 X The Pythagorian Theorem lets us calculate very easily the variance of a binomial random variable: ◦ Var[Ber] n Example 19.10. Let X Bin(n, p). We know that X = Yk, where Y1,...,Yn ∼ k=1 are independent Ber(p) random variables. So P

n Var[X] = Var[Yk] = np(1 p). − k=1 X The straightforward calculation would be more difficult. 4 5 4

19.4. Examples

Example 19.11. Let X U[ 1, 1] and Y = X2. ∼ − 123

What is the distribution of Y ?

√t 0 t 1 ≤ ≤ √ √  P[Y t] = P[X [ t, t]] = 0 t < 0 ≤ ∈ −   1 t 1  ≥  Note that for  1 t−1/2 t [0, 1] 2 ∈ fY (t) =  0 t [0, 1]  6∈ we have that for any t [0, 1],  ∈  t t 1 −1/2 √ fY (s)ds = 2 s ds = t = FY (t), Z−∞ Z0 so Y is absolutely continuous with density fY . Are X,Y independent? Of course not, for example,

1 −1/2 P[X [0, 1/2] ,Y [1/2, 1]] = 0 = (1 2 ) = P[X [0, 1]] P[Y [1/2, 1]]. ∈ ∈ 6 2 · − ∈ · ∈ What is Cov(X,Y )? Well,

1 3 3 1 E[XY ] = E[X ] = t 2 dt = 0. Z−1 Also, E[X] = 0 so

Cov(X,Y ) = E[XY ] E[X] E[Y ] = 0. − · That is, X,Y are uncorrelated. What about Z = X2n for general n? In this case,

1 1 2n+1 2n+1 1 1 2n+2 E[XZ] = E[X ] = t 2 dt = t = 0. −1 2(2n + 2) −1 Z

So X,X2n are uncorrelated for any n. As for odd moments, if W = X2n+1, then

1 2n+2 1 2 1 E[XW ] = t dt = = , 2 2(2n + 3) 2n + 3 Z−1 So X,W are not uncorrelated, because Cov(X,W ) = 1 . 2n+3 4 5 4 124

Example 19.12. Let a ( 1, 1) and ∈ − 1 a A = a 1   So A = 1 a2 and | | − −1 1 1 a A = 2 − 1 a ·  a 1  − − Let (X,Y ) be a two dimensional absolutely continuous  random variable with density

1 1 −1 1 1 2 2 fX,Y (t, s) = exp A (t, s), (t, s) = exp 2 (t + s 2ats) . 2π A · − 2 h i 2π √1 a2 · − 2(1−a ) · − · | |  · −   2 2 First, let us showp that fX,Y is indeed a density. We will use the fact that t +s 2ats = − (t as)2 + s2 a2s2. − − ∞ ∞ 2 1 s2(1−a2) (t as) fX,Y (t, s)dt = exp 2 exp − dt 2 2(1−a ) 2 −∞ 2π √1 a · − · −∞ −2(1 a ) Z · −   Z  −  1 2 1 2 = e−s /2 2π(1 a2) = e−s /2. 2π √1 a2 · · − √2π · · − p 2 We have used that 1 exp (t−as) is the density of a N(as, √1 a2) random √2π(1−a2) · − 2(1−a2) − variable. So Y N(0, 1). Symmetrically, X N(0, 1). ∼ ∼ Thus fX,Y dtds = 1 and fX,Y is indeed a density. Moreover we have calculated the marginalRR densities of X and Y . Let us calculate

Cov(X,Y ) = E[XY ] = fX,Y (t, s)tsdtds ZZ ∞ ∞ 2 1 2 1 (t as) = s e−s /2 t exp − dtds 2 2 −∞ · √2π · −∞ · 2π(1 a ) · −2(1 a ) Z Z  −  ∞ − 2 p = as sfY (s)ds = E[aY ] = a. · Z−∞ So now we also have that

Var[X + Y ] = Var[X] + Var[Y ] + 2 Cov(X,Y ) = 1 + 1 + 2a = 2(1 + a).

Note that X,Y are independent if and only if Cov(X,Y ) = 0. 4 5 4 125

Example 19.13. A particle moves in the NE (nord-est) quadrant of the plane, [0, )2. ∞ The direction θ U[0, π/2] and the velocity V U[0, 1] independently. ∼ ∼ What is the covariance of X(t) and V where the particle’s position at time t is given by (X(t),Y (t))? Note that for any t X(t) = V t cos(θ). So we want to calculate

2 2 E[X(t)V ] E[X(t)] E[V ] = t E[V ] E[cos(θ)] t E[V ] E[cos(θ)] = t Var[V ] E[cos(θ)]. − · − · · We know that Var[V ] = 1/12. Also, π/2 2 2 E[cos θ] = π cos tdt = π . Z0 So Cov(X(t),V ) = t . 6π 4 5 4 Example 19.14. Suppose Y Exp(λ) and for any s > 0, X Y = s U[0, s]. What is ∼ | ∼ Cov(X,Y )? We can calculate ∞ E[XY ] = tsfX,Y (t, s)dtds = sfY (s) tfX|Y (t s)dtds 0 | ZZ∞ Z Z s 1 2 −2 = sfY (s) ds = E[Y ] = λ . · 2 2 · Z0 −1 Also, E[Y ] = λ . If we consider the function (t, s) t we get 7→ ∞ s ∞ E[X] = tfX,Y (t, s)dtds = fY (s) tfX|Y (t s)dtds = E[X Y = s]fY (s)ds 0 0 | 0 | ZZ∞ Z Z Z s 1 1 = 2 fY (s)ds = 2 E[Y ] = 2λ . Z0 So altogether, 1 1 1 1 Cov(X,Y ) = = . λ2 − 2λ · λ 2λ2

4 5 4 19.5. Markov and Chebyshev inequalities ◦ Markov Theorem 19.15 (Markov’s inequality). Let X 0 be a non-negative random variable. ≥ Then, for any a > 0, E[X] P[X a] . ≥ ≤ a

Andrey Markov

(1856–1922) 126

Proof. Consider the random variable Y = a1 . Note that Y X (here we use the {X≥a} ≤ non-negativity of X). So

E[X] E[Y ] = a P[X a]. ≥ ≥

◦ Chebyshev ut Theorem 19.16 (Chebyshev’s inequality). Let X be a random variable with finite sec- ond moment. Then, for any a > 0, Var[X] P[ X E[X] a] . | − | ≥ ≤ a2

2 Proof. Apply Markov to the non-negative random variable Y = (X E[X]) , and note Pafnuty Chebyshev − 2 that X E[X] a = Y a . (1821–1894) {| − | ≥ } ≥ ut  2 If X is a random variable with E[X] = µ and Var[X] = σ , then Chebyshev’s in- equality tells us that the probability that X deviates from its mean kσ (i.e. k standard deviations) is at most 1/k2.

Example 19.17. The average wage in Eurasia is 1000 a month.

A person is chosen at random. What is the probability that this person earns • at least 10, 000 a month? By Markov’s inequality, if X is the random person’s salary, then

E[X] 1 P[X 10, 000] = . ≥ ≤ 10, 000 10 The average wage for an Inner Party member is 10, 000 a month, and the stan- • dard deviation of the wage is 100. What is the probability that a randomly chosen Inner Party member earns at least 11, 000? Here we have that E[X] = 10, 000 and Var[X] = 10, 000. So Var[X] 1 P[X 11, 000] P[ X E[X] 1, 000] = . ≥ ≤ | − | ≥ ≤ 1, 0002 100

4 5 4

Example 19.18. Let X N(0, σ). Chebyshev’s inequality tells us that P[ X > kσ] ∼ | | ≤ k−2. 127

However because σ−1X N(0, 1), we can calculate for a = t , ∼ σ −a ∞ 1 1 P[ X > t] = P[ X > a] + P[ X < a] = fσ−1X (s)ds + fσ−1X (s)ds | | σ σ − Z−∞ Za ∞ ∞ 1 2 s 1 2 = 2 e−s /2ds 2 e−s /2ds √2π ≤ a · √2π Za Za 2 2 ∞ 2 2 = ( e−s /2) = e−a /2. a √2π · − a a √2π · · ·

If we plug in t = kσ we get

2 −k2/2 −2 P[ X > kσ] e << k , | | ≤ √2πk2 · which is a much better bound. 4 5 4 19.6. The Weierstrass Approximation Theorem

Here is a probabilistic proof by Bernstein for the famous Weierstrass Approximation Karl Weierstrass

Theorem. (1815–1897)

Theorem 19.19 (Weierstrass Approximation Theorem). Let f : [0, 1] R be continu- → ous. For every ε > 0 there exists a polynomial p(x) such that sup p(x) f(x) < ε. x∈[0,1] | − | (That is, the polynomials are dense in L∞([0, 1]).)

Proof. By classical analysis, f is uniformly continuous and bounded on [0, 1]; that is, there exists M > 0 such that sup f(x) M and for every ε > 0 there exists x∈[0,1] | | ≤ δ = δ(ε) > 0 such that if x y δ then f(x) f(y) < ε . | − | ≤ | − | 2 Fix ε > 0. Let δ = δ(ε). Let x (0, 1). Consider B Bin(n, x). Chebyshev’s ∈ ∼ inequality guaranties that

2/3 −4/3 −1/3 1 P[ B nx n ] Var[B] n = x(1 x) n . | − | ≥ ≤ · − · ≤ 4n1/3 ε Thus, we may choose n = n(ε, M) large enough so that this probability is at most 4M for any x. Assume also that n−1/3 < δ. Define the polynomial n n p(x) := xj(1 x)n−jf( j ). j − n j=0 X   That is, p(x) = E[f(B/n)] where B Bin(n, x). ∼ 128

Let A = B x > δ . So | n − |  2/3 ε P[A] P[ B nx > n ] . ≤ | − | ≤ 4M Thus,

p(x) f(x) = E[f(B/n) f(x)] | − | | − | E[ f(B/n) f(x) 1A] + E[ f(B/n) f(x) 1Ac ] ≤ | − | · | − | · ε c 3ε M P[A] + P[A ] < ε. ≤ · 2 · ≤ 4 This bound is independent of x. ut

19.7. The Paley-Zygmund Inequality

Raymond Paley Theorem 19.20 (Paley-Zygmund Inequality). Let X 0 be a non-negative random ≥ (1907–1933) variable with finite variance. Then, for any α [0, 1], ∈ 2 2 (E[X]) P[X > α E[X]] (1 α) . ≥ − · E[X2] Specifically, 2 (E[X]) P[X > 0] . ≥ E[X2] Proof. Note that X = X1{X≤α E[X]} + X1{X>α E[X]}. (1900–1992)

Since X 0 we have that X1{X≤α [X]} α E[X] and by Cauchy-Schwarz, ≥ E ≤ 2 2 (E[X1{X>α [X]}]) E[X ] P[X > α E[X]]. E ≤ · Combining these we have,

E[X] α E[X] + E[X2] P[X > α E[X]], ≤ · p which implies that

2 2 2 (1 α) (E[X]) E[X ] P[X > α E[X]]. − ≤ ·

ut 129

Example 19.21 (Erd¨os-R´enyi random graph). Suppose that n students are on Face- book. For every two different students, say x, y, they become friends with probability p, all pairs of students being independent. We say that students x and y are connected if there is some sequence of students x = x1, . . . , xk = y such that xj and xj+1 are friends for every j. What is the probability that some student has no friends? What is the probability that all students are connected?

Suppose 1, 2, . . . , n is the set of students. Let Ix,y be the indicator of the event that Paul Erd¨os(1913–1996) { } n x and y are friends. So (Ix,y)x6=y are 2 independent Bernoulli-p random variables. Let X be the number of students with  no friends; the isolated students. Let us begin by calculating the expected number of isolated students: A student x is n−1 isolated if Ix,y = 0 for all y = x. So this happens with probability (1 p) . Thus, by 6 − linearity, Alfr´ed R´enyi (1921–1970) n−1 E[X] = P[x is isolated ] = n(1 p) =: µ. x − X Let us also calculate the second moment of X: First, for x = y, is both are isolated 6 then for all z x, y we have Ix,z = 0 and Iy,z = 0. Also, Ix,y = 0. So, 6∈ { }

2(n−2)+1 2(n−1) −1 P[x is isolated , y is isolated ] = (1 p) = (1 p) (1 p) . − − −

Thus,

2 E[X ] = E 1{x is isolated ,y is isolated } x,y X = P[x is isolated ] + P[x is isolated , y is isolated ] x x6=y X X = µ + n(n 1)(1 p)2(n−1)(1 p)−1 = µ + µ2 (1 p)−1 (1 1 ). − − − · − · − n

Specifically,

2 (E[X]) −1 −1 −1 = µ + (1 p) (1 1 ) . E[X2] − · − n  p − We will make use of the inequalities e 1−p 1 p e−p valid for all p 1 . ≤ − ≤ ≤ 2 130

Now, if p (1+ε) log n , then ≥ n−1 −p(n−1) −ε E[X] ne = n 0, ≤ → so by Markov’s inequality,

P[X 1] E[X] 0. ≥ ≤ → That is, with high probability all students are connected. On the other hand, if p (1−ε) log n , then ≤ n−1 µ n exp 1 (1 ε) log n ≥ − 1−p · − → ∞   (because 1 (1 ε) < 1 for large enough n). Thus, by the Paley-Zygmund inequality, 1−p − 2 (E[X]) −1 −1 1 −1 P[X > 0] = µ + (1 p) (1 ) 1. ≥ E[X2] − · − n → That is, with high probability there exists a student that has no friends. 4 5 4 131

Introduction to Probability

201.1.8001

Ariel Yadin

Lecture 20

20.1. Convergence of Random Variables

As in the case of sequences of numbers, we would like to talk about convergence of random variables. There are many ways to say that a sequence of random variables approximates some limiting random variable. We will talk about 4 such possibilities.

20.2. a.s. and Lp Convergence

We have already seen some type of convergence; namely, Xn X if Xn(ω) X(ω) → → for all ω Ω. This is a strong type of convergence; we are usually not interested in ∈ changes that have probability 0, so we will define the following type of convergence:

Let (Xn)n be a sequence of random variables on some probability space (Ω, , P). We F have already seen that lim inf Xn, lim sup Xn are both random variables. Thus, the set

ω : lim inf Xn(ω) = lim sup Xn(ω) is an event in . Hence we can define: { } F ◦ a.s. conver- gence Definition 20.1. Let (Xn)n be a sequence of random variables on (Ω, , P). Let X be F a random variable.

We say that (Xn) converges to X almost surely, or a.s., denoted

a.s. Xn X −→ if P[lim Xn = X] = 1; that is, if

[ ω : lim Xn(ω) = X(ω) ] = 1. P n→∞ n o Another type of convergence is a type of average convergence: 132 ◦ Lp conver- gence Definition 20.2. Let p > 0. Let (Xn)n be a sequence of random variables on (Ω, , P) F p p such that E[ Xn ] < for all n. Let X be a random variable such that E[ X ] < . | | ∞ | | ∞ p We say that (Xn) converges to X in L , denoted

Lp Xn X −→ if p lim E[ Xn X ] = 0. n→∞ | − | In the exercises we will show that for 0 < p < q, Lq convergence implies Lp conver- gence. ◦ the limit is We will also show that the limit is unique; that is unique a.s. Lp Exercise 20.3. Suppose that Xn X and Xn Y . Show that P[X = Y ] = 1. −→ −→ Lp Lq Suppose that Xn X and Xn Y . Show that P[X = Y ] = 1. ◦ a.s. 6⇒ Lp −→ −→

Example 20.4. Fix p > 0. Let (Ω, , P) be uniform measure on [0, 1]. For all n let Xn F be the discrete random variable

n1/p ω [0, 1/n] Xn(ω) = ∈  0 otherwise .  So Xn has density  1 1/p n s = n  1 fXn (s) = 1 s = 0  n  −  0 otherwise Note that   q 1 1/p q q/p−1 E[ Xn ] = n = n . | | n | | q q L Thus, if 0 < q < p, then E[ Xn 0 ] 0, so Xn 0. | − | → −→ q q L However, for q p we have that E[ Xn 0 ] 1 for all n, so Xn 0 (and thus ≥ | − | ≥ −→6 Lq Xn X for any X, since the limit must be unique). −→6 a.s. 1 We claim that Xn 0. Indeed, let ω [0, 1]. If ω > 0 the for all n > , we get −→ ∈ ω that Xn(ω) = 0, and thus Xn(ω) 0. Thus, →

P[ ω : Xn(ω) 0 ] P[ 0 ] = 0, { 6→ } ≤ { } 133

a.s. so Xn 0. −→ This example has shown that

We can have Lq convergence without Lp convergence, for p > q. • We can have a.s. convergence without Lp convergence. •

4 5 4 ◦ Lp 6⇒ a.s.

Example 20.5. For all n let Xn be mutually independent Ber(1/n) random variables. Thus, for all n and all p > 0,

p p 1 L E[ Xn ] = 0 so Xn 0. | | n → −→

On the other hand, let An = Xn > 1/2 . So (An)n is a sequence of mutually {| | } independent events, and P[An] = 1/n. Thus, since P[An] = , using the second n ∞ Borel-Cantelli Lemma, P

P[lim sup An] = 1. n

That is, with probability 1, there exist infinitely many n such that Xn > 1/2. Thus,

P[lim sup Xn 1/2] = 1. n ≥

a.s. So we cannot have that Xn 0. Since the limit must be unique, we cannot have −→ a.s. Xn X for any X. −→ p We conclude that: It is possible for (Xn) to converge in L for all p, but not to converge a.s. 4 5 4

20.3. Convergence in Probability ◦ convergence in prob. Definition 20.6. Let (Xn)n be a sequence of random variables on (Ω, , P). We say F that (Xn) converges in probability to X, denoted

P Xn X −→ if for all ε > 0,

lim P[ Xn X > ε] = 0. n→∞ | − | 134

◦ a.s. ⇒ P a.s. Example 20.7. Suppose that Xn X. That is, −→

P[ Xn X 0] = 1. | − | →

Fix ε > 0. For any n let An = Xn X > ε . Note that if An occurs for infinitely {| − | } many n, then lim sup Xn X ε > 0. Thus, by Fatou’s Lemma, n | − | ≥

0 lim sup P[An] = P[lim sup An] = P[An i.o.] P[lim sup Xn X > 0] = 0. ≤ n n ≤ n | − | Thus,

lim P[ Xn X > ε] = lim sup P[An] = 0, n | − | n P and so Xn X. −→ That is: a.s. convergence implies convergence in probability. ◦ Lp ⇒ P 4 5 4 Lp Example 20.8. Assume that Xn X. Then, for any ε > 0, by Markov’s inequality, −→ p E[ Xn X ] P[ Xn X > ε] | − | 0. | − | ≤ εp → P Thus, Xn X. −→ That is, convergence in Lp implies convergence in probability. 4 5 4 ◦ limit is The following is in the exercises: unique P P Exercise 20.9. Let Xn X and Xn Y . Show that P[X = Y ] = 1. −→ −→ 135

Introduction to Probability

201.1.8001

Ariel Yadin

Lecture 21

21.1. The Law of Large Numbers

In this section we will prove

Theorem 21.1 (Strong Law of Large Numbers). Let X1,X2,...,Xn,..., be a sequence 2 of mutually independent random variables, such that E[Xn] = 0 and supn E[Xn] < . ∞ For each N let N SN = Xn. n=1 X Then, S a.s. N 0. N −→

21.1.1. Kolmogorov’s inequality.

Lemma 21.2 (Kolmogorov’s inequality). Let X1,X2,...,Xn,..., be a sequence of mu- 2 tually independent random variables, such that E[Xn] = 0 and E[Xn] < . For each N ∞ let N SN = Xn and MN = max Sn . 1≤n≤N | | n=1 X Then, for any λ > 0, N 2 n=1 E[Xn] P[MN λ] . ≥ ≤ λ2 P Proof. Fix λ > 0 and let

An = S1 < λ, S2 < λ, . . . , Sn−1 < λ, Sn λ . { ≥ } We will make a few observations. 136

First note that since (Xn) are mutually independent, they are uncorrelated, so

N 2 2 E[SN ] = 0 and Var[SN ] = E[SN ] = E[Xn]. n=1 X Next, note that MN λ if and only if there exists 1 n N such that Sn λ and ≥ ≤ ≤ | | ≥ Sk < λ for all 1 k < n. That is, MN λ implies that there exists 1 n N such | | ≤ ≥ ≤ ≤ 2 2 that S 1A λ . n · n ≥ Since (X1,...,Xn) are independent of (Xn+1,...,XN ), we have that the random variable Sn 1A is independent of Xn+1 + Xn+2 + + XN = SN Sn. Thus, for any · n ··· − 1 n N we have that ≤ ≤ 2 2 2 2 E[SN 1An ] = E[(SN Sn + Sn) 1An ] = E[(SN Sn) 1An ] + E[Sn 1An ] + 2 E[(SN Sn) Sn 1An ] · − · − · · − · · 2 E[Sn 1An ], ≥ · where we have used that

E[(SN Sn) Sn 1An ] = E[SN Sn] E[Sn 1An ] = 0 − · · − · · by independence. We now have that N N 2 2 2 2 E[SN ] E[SN 1{M ≥λ}] = E[ 1An SN ] E[Sn 1An ] ≥ · N ≥ · n=1 n=1 X X Note that by Boole’s inequality, since MN λ implies that there exists 1 n N such ≥ ≤ ≤ 2 2 that S 1A λ , n · n ≥ N N 2 2 1 2 1 2 P[MN λ] P[Sn 1An λ ] E[Sn 1An ] E[SN ]. ≥ ≤ · ≥ ≤ λ2 · ≤ λ2 n=1 n=1 X X ut 21.1.2. Kronecker’s Criterion.

∞ xn Lemma 21.3. Suppose that x1, . . . , xn,..., is a sequence of numbers such that n=1 n converges. Then, P 1 N lim xn = 0. N→∞ N n=1 X 137

N xn ∞ xn 1 N Proof. Let sN = . So sN s := . Thus, also sn s. Note n=1 n → n=1 n N n=1 → that for all n, xn P= n(sn sn−1) (where weP set s0 = 0). Thus, P − 1 N 1 xn = (NsN + ((N 1) N)sN−1 + ((N 2) (N 1))sN−2 + + (1 2)s1) N N − − − − − ··· − n=1 X N−1 N 1 1 = SN − sn s s = 0. − N · N 1 → − n=1 − X ut

21.1.3. Proof of Law of Large Numbers.

Lemma 21.4. Let X1,X2,...,Xn,..., be a sequence of mutually independent random ∞ variables such that E[Xn] = 0 for all n. If Var[Xk] < then k=1 ∞ ∞ P P Xk converges = 1. "k=1 # X n Proof. Let Sn = k=1 Xk. ∞ For every ω PΩ, Xk(ω) converges if and only if Sn(ω) converges, which is if ∈ k=1 and only if (Sn(ω))n Pform a Cauchy sequence. Thus, it suffices to show that

P[(Sn)n is a Cauchy sequence ] = 1.

Let n > 0, and consider

Sn+m Sn = Xn+1 + Xn+2 + + Xn+m. − ···

Kolmogorov’s inequality gives that

N 1 2 1 2 P[ max Sn+m Sn λ] E[Xn+k] E[Xn+k]. 1≤m≤N | − | ≥ ≤ λ2 · ≤ λ2 · k=1 k≥1 X X Taking N on the left-hand side, using continuity of probability for the increasing → ∞ sequence of events,

1 2 P[sup Sn+m Sn λ] 2 E[Xn+k]. m≥1 | − | ≥ ≤ λ · k≥1 X 138

2 This last term is the tail of the convergent series k E[Xk ]. Thus, for any r > 0 there 2 −3 exists n(r) so that k≥n E[Xk ] < r , which impliesP that

P −1 −1 −1 P[inf sup Sn+m Sn > r ] P[sup Sn(r)+m Sn(r) > r ] < r . n m≥1 | − | ≤ m≥1 | − |

−1 Now, the events infn sup Sn+m Sn > r are increasing in r, so taking r m≥1 | − | → ∞ with continuity of probability,

P[inf sup Sn+m Sn > 0] 0. n m≥1 | − | ≤ That is,

P[inf sup Sn+m Sn = 0] = 1. n m≥1 | − |

Now let ω Ω be such that infn sup Sn+m(ω) Sn(ω) = 0. This implies that ∈ m≥1 | − | for all ε > 0 there exists N = N(ε, ω) such that

sup SN+m(ω) SN (ω) < ε/2. m≥1 | − | Thus, for any n > N(ε, ω) and any m 1, ≥

Sn+m(ω) Sn(ω) SN+n+m−N (ω) SN (ω) + SN+n−N (ω) SN (ω) < ε. | − | ≤ | − | | − | That is, for any ε > 0 there exists N = N(ε, ω) such that for all n > N(ε, ω) and all m 1, Sn+m(ω) Sn(ω) < ε; that is, for this ω,(Sn(ω))n is a Cauchy sequence. We ≥ | − | have shown that

ω : inf sup Sn+m(ω) Sn(ω) = 0 ω :(Sn(ω))n is a Cauchy sequence . { n m≥1 | − | } ⊂ { } Since the first event has probability 1, so does the second event, and we are done. ut The proof of the law of large numbers is now immediate:

Proof of Theorem 21.1. By Kronecker’s Criterion is suffices to show that ∞ Xn P converges = 1. n n=1  X  By the above Lemma it suffices to show that ∞ Var[Xn/n] < . ∞ n=1 X 139

1 Since Var[Xn/n] = 2 , this is immediate. n ut Example 21.5. The Central Bureau of Statistics wants to asses how many citizens use the train system. They poll N randomly and independently chosen citizens, and set

Xj = 1 if the j-th citizen uses the train, and Xj = 0 otherwise. Any citizen is equally likely to be chosen, and all are independent. As N the sample mean → ∞ 1 N Xj N j=1 X converges to E[Xj] = E[X1] which is the probability that a randomly chosen citizen uses the train system. Since we choose each citizen uniformly, this probability is exactly the number of citizens that use the train, divided by the total number of citizens. Thus, for large N, 1 N C Xj · N j=1 X is a good approximation for the number of citizens that use the train system, where C is the number of citizens. 4 5 4 140

Introduction to Probability

201.1.8001

Ariel Yadin

Lecture 22

22.1. Convergence in Distribution

Previously we talked about types of convergence that required the sequence and the limit to be defined on the same probability space. We now look at a type of convergence which does not have this requirement.

Definition 22.1. Let (Xn)n be a sequence of random variables. We say that (Xn) converges in distribution to X, denoted

D Xn X, −→ if for all t R such that FX (t) is continuous at t, ∈

lim FX (t) = FX (t). n→∞ n

That is, the distribution functions of Xn converge pointwise to the distribution func- tion of X.

Remark 22.2. !!! Note that the convergence is required only at continuity points of

FX .

D Example 22.3. Let Xn U[0, 1/n]. Then Xn 0. Indeed, the distribution function ∼ −→ of 0 is F0(t) = 1 for t 0 and FX (t) = 0 for t < 0. Note that ≥ 0 t < 0  FXn (t) = nt 0 t 1/n   ≤ ≤  1 t > 1/n   141

Since 0 is not a continuity point for F0, we don’t care about it. For t < 0 we have that

FXn (t) = 0 = FX (t). For t > 0 we have that for large enough n > 1/t, FXn (t) = 1 =

FX (t). 4 5 4

X Note that for the other types of convergence we needed all random variables to live on the same space. For convergence in distribution, this is not required. However, this is the weakest kind of convergence, as can be seen by the following proposition.

Proposition 22.4. Convergence in probability implies convergence in distribution.

P Proof. Let Xn X. Fix t R. −→ ∈ First, for all ε > 0,

P[Xn t] = P[Xn t, Xn X ε] + P[Xn t, Xn X > ε] ≤ ≤ | − | ≤ ≤ | − | P[X t + ε] + P[ Xn X > ε] P[X t + ε]. ≤ ≤ | − | → ≤

So lim sup FX (t) FX (t + ε) for all ε. Since FX is right-continuous, we get that n n ≤ lim sup FX (t) FX (t). n n ≤ On the other hand, for any ε > 0,

P[X t ε] = P[X t ε, Xn X ε] + P[X t ε, Xn X > ε] ≤ − ≤ − | − | ≤ ≤ − | − | P[Xn t] + P[ Xn X > ε]. ≤ ≤ | − |

Taking lim inf we get that FX (t ε) lim infn FX (t). Now, if FX is continuous at t, − ≤ n then taking ε 0 gives →

FX (t) lim inf FXn (t) lim sup FXn (t) FX (t). ≤ n ≤ n ≤ D So Xn X. −→ ut 22.2. Approximating by Smooth Functions

3 Lemma 22.5. Let (Xn)n be be a sequence of random variables. Let Cb be the space of all three times continuously differentiable functions on R with bounded derivatives. If for every φ C3, ∈ b E[φ(Xn)] E[φ(X)] → 142

D then Xn X. −→

Proof. We want to choose a C3 function ψ that will approximate the step function 1 . For this we choose a C3 function that is 1 on ( , 0], decreasing on [0, 1] and (−∞,0] −∞ 1 on [1, ). Call this function ψ. (An explicit construction can be found in Section 22.8 ∞ below.)

For every t R and ε > 0 define ∈ −1 ψt,ε(x) = ψ(ε (x t)). −

Note that for every t R and ε > 0 we have ∈

ψt−ε,ε(x) 1 (x) ψt,ε(x). ≤ (−∞,t] ≤

For convergence in distribution we need to show that for every t R such that FX is ∈ continuous at t,

lim [1(−∞,t](Xn)] = [1(−∞,t](X)]. n→∞ E E Fix t R. For any ε > 0 we have that ∈

E[1(−∞,t](Xn)] E[1(−∞,t](X)] −

E[ψt,ε(Xn)] E[1(−∞,t](X)] E[ψt,ε(X)] E[1(−∞,t](X)] ≤ − → −

E[1(−∞,t+ε](X)] E[1(−∞,t](X)] = P[X (t, t + ε]] = FX (t + ε) FX (t). ≤ − ∈ − Similarly,

E[1(−∞,t](X)] E[1(−∞,t](Xn)] −

E[1(−∞,t](X)] E[ψt−ε,ε(Xn)] E[1(−∞,t](X)] E[ψt−ε,ε(X)] ≤ − → −

E[1(−∞,t](X)] E[1(−∞,t−ε](X)] = P[X (t ε, t]] = FX (t) FX (t ε). ≤ − ∈ − − − We conclude that

lim sup E[1(−∞,t](Xn)] E[1(−∞,t](X)] FX (t + ε) FX (t) + FX (t) FX (t ε) n→∞ − ≤ | − | | − − | for all ε > 0.

For any t that is a continuity point of FX the above tends to 0 as ε 0. → ut 143

22.3. Central Limit Theorem

We will now build the tools for the following very important approximation theorem:

Theorem 22.6 (Central Limit Theorem). Let X1,X2,...,Xn,..., be mutually inde- 2 pendent random variables, such that E[Xn] = 0 and E[Xn] = 1. Let N SN = Xn. n=1 X Then, S D N N(0, 1). √N −→ That is, for all t R, ∈ t 1 −s2/2 P[SN t√N] e ds. ≤ → √2π Z−∞ In the exercises we will show that

Exercise 22.7. Let X1,X2,...,Xn,..., be mutually independent random variables, such 2 that E[Xn] = µ and Var[Xn] = σ . Let N SN = Xn. n=1 X Then, SN Nµ D − N(0, 1). √Nσ −→

X That is, no matter what distribution we start with - we can approximate the sum of many independent trials by a normal random variable.

22.4. Uses for the CLT

Example 22.8. Every student’s grade in a course has expectation 80 and standard deviation 20. Give an approximation of the average grade for 100 students, if all students are independent. What is a good estimate for the probability of the average grade to be below 70? 1 100 If Xk is the grade of the k-th student, then the average grade is M := 100 k=1 Xk. P 144

The central limit theorem tells us that 1 (100M 100 80) = 1 (M 80) is very 10·20 · − · 2 − close to a N(0, 1) random variable. Thus,

−5 1 1 −s2/2 P[M < 70] = P[ (M 80) < 5] P[N(0, 1) < 5] = e ds = ... 2 − − ≈ − √2π Z−∞ which can be calculated using a standard normal distribution table. 4 5 4

Example 22.9. X1,X2,..., are independent random variables with the same distribu- tion, such that E[Xn] = 60 and Var[Xn] = 25. 1 N Show that we can take N large enough so that the empirical average N k=1 Xk is between 55 and 65 with probability greater than 0.99. P How large should we take N if we want to ensure this? N Let SN = k=1 Xk. The central limit theorem tells us that the distribution of 1 √ (SN 60NP) converges to the distribution of a N(0, 1) random variable. Thus, 5 N −

1 SN 60N P[55 < SN < 65] = P[ √N < − < √N] P[ √N < N(0, 1) < √N]. N − 5√N ≈ − The right hand side above

SN 60N P[ √N < − < √N] − 5√N goes to 1. So there is some large enough N so that this is greater than 0.99. 4 5 4

Example 22.10. X1,X2,..., are independent random variables with the same distri- bution, such that E[Xn] = µ and Var[Xn] = 1.

Show that there exists N0 and c > 0 such that for all N > N0,

1 P[ SN Nµ √N] c, | − | ≤ 2 ≥ N where SN = k=1 Xk. Note thatP we cannot use Kolmogorov here, because that would only give

1 P[ max Sn nµ > √N] 4. 1≤n≤N | − | 2 ≤ However, the central limit theorem tells us that

1 1 P[ SN Nµ √N] P[ N(0, 1) ]. | − | ≤ 2 → | | ≤ 2 145

1 1 If we take c = P[ N(0, 1) ], we get that for all large enough N, 2 | | ≤ 2 1 P[ SN Nµ √N] > c. | − | ≤ 2

4 5 4 Example 22.11. The government decides to test if the lab workers at the central lab of Kupat Cholim are adequate. They are given an exam, and graded between 0 and 100. The exam is such so that the expectation of a random worker’s grade is 80 and standard deviation is 20. If the lab has 10000 workers, give a good approximation of the probability that the average grade in the lab is smaller than 70?

In this case, we can let X1,...,X10000 be the grades of the workers, so E[Xj] = 80 and 1 10000 Var[Xj] = 400. If A = 10000 j=1 Xj is the average grade, the central limit theorem tells us that P 10000 1 P[A 70] = P[100(A 80) 10 100] = P[ (Xj 80) 50] ≤ − ≤ − · 100·20 − ≤ − j=1 X is very close to −50 1 −s2/2 P[N(0, 1) 50] = √ e ds. ≤ − 2π Z−∞ 4 5 4 Example 22.12. Suppose there are 1.2 million people that will actually vote in the elections. We want to forecast how many mandates party A will receive in the elections. Each mandate is worth 1.2 106/120 = 104 people. There are a people that will actually · vote for party A, so party A will get 10−4a mandates. Suppose we choose a random voter uniformly at random. The probability that she votes for party A is then p = a , and the variance is p(1 p). 1.2·106 − If we repeat this experiment 900 times, we can sum up the number of people who said they would vote for party A, and the average of this number should be close to p.

So if Xj is the indicator of the event that the j-th voter polled said she would vote for party A, we get that 1 900 120 Xj · 900 j=1 X 146 is close to 120p = 1.2 106 10−4p = 10−4a, which is the number of mandates party A · · gets. How good is this approximation? What is the number of mandates so that this is close to the real number with 95% confidence? For any k, we have that

900 900 1 −4 1 k P[ 120 Xj 10 a > k] = P[ (Xj p) > ] · 900 − 900 − 120 j=1 j=1 X X 900 1 k = P[ (Xj p) > ] 30 − 4 j=1 X which is close to

k P[ N(0, 1) > ] | | 4√p(1−p) by the central limit theorem. If we take k 4 p(1 p)t then this is at most ≥ − ∞ ∞ p 2 2 √2 2 2 √1 e−s /2ds 2 s √1 e−s /2ds = e−t /2. 2π · ≤ t 2π · t√π Zt Zt For t = 2.05 this is smaller than 0.05. Since p(1 p) 1/4 we can take k = 5 > 2 2.05 − ≤ · ≥ 4 p(1 p) 2.05 and then − · p 900 1 −4 P[ 120 Xj 10 a > k] < 0.05. · 900 − j=1 X

4 5 4

Example 22.13. Let (Xn)n be independent identically distributed random variables, let

(Yn)n be independent identically distributed random variables, such that E[Xn] = E[Yn] for all n, and for all n:

1 Var[Xn] = Var[Yn] = 1 and Cov(Xn,Yn) = 2 .

104 104 Let S = j=1 Xj and T = j=1 Yj. Give a good approximation for P[S > T ]. We haveP that P 104 S T = (Xj Yj), − − j=1 X 147 where (Xn Yn)n are independent with mean 0 and −

Var[Xj Yj] = Var[Xj] + Var[Yj] 2 Cov(Xj,Yj) = 1. − −

So

1 1 P[S > T ] = P[S T > 0] = P[ (S T ) > 0] P[N(0, 1) > 0] = . − 100 − ≈ 2

4 5 4

Example 22.14. A gambler plays a game repeatedly where at each game he earns Geo(0.1) Shekel and loses Poi(10) Shekel, both wins and losses independent.

What is a good approximation for the probability that he has more than 20 • Shekel after 100 games? What is a good approximation for the probability that he has exactly 0 Shekel? • The amount of money after 100 games is

100 M := Xj Yj − j=1 X where Xj Geo(0.1) and Yj Poi(10) all independent. Since E[Xj Yj] = 0 and since ∼ ∼ −

Var[Xj Yj] = Var[Xj] + Var[Yj] = 100 0.9 + 10 = 100, − · we have using the central limit theorem

1 P[M > 20] = P[ M > 0.2] P[N(0, 1) > 0.2] 100 ≈

For P[M = 0] we cannot use the approximation P[N = 0] = 0! So we use the following trick: Since M takes on only integer values, M = 0 if and only if M [ 1/2, 1/2]. So, ∈ −

1 1 1 1 1 P[M = 0] = P[ M [ , ]] P[ N(0, 1) ] 100 ∈ − 200 200 ≈ − 200 ≤ ≤ 200

4 5 4 148

22.5. Lindeberg’s Method

A “hands-on” proof of the CLT (that does not use the Levy Continuity Theorem) was Jarl Lindeberg

given by Lindeberg. The main idea is: If Z1,...,Zn are independent N(0, 1) random (1876–1932) −1/2 variables, then n (Z1 + + Zn) N(0, 1). So we can change X1,X2,...,Xn into ··· ∼ Z1,...,Zn one by one, each time comparing the sums. Hopefully each of the n differences will not be large. First, a technical lemma regarding C3 functions.

3 Lemma 22.15. For any C function φ and any x, y R, ∈ 0 1 00 2 00 2 000 3 φ(x + y) φ(x) φ (x)y φ (x)y min φ ∞ y , φ ∞ y . | − − − 2 | ≤ || || | | || || | |  Proof. For any y > 0, by Taylor’s theorem, expanding around x,

Brook Taylor 0 1 00 2 1 000 3 φ(x + y) φ(x) φ (x)y φ (x)y 6 φ ∞ y . (1685–1731) | − − − 2 | ≤ || || · The second order Taylor expansion

0 1 00 2 φ(x + y) φ(x) φ (x)y φ ∞ y , | − − | ≤ 2 || || · with the triangle inequality gives that

0 1 00 2 0 1 00 2 φ(x + y) φ(x) φ (x)y φ (x)y φ(x + y) φ(x) φ (x)y + φ ∞ y | − − − 2 | ≤ | − − | 2 || || · 00 2 φ ∞ y . ≤ || || · If y < 0 apply the previous argument to the function φ( x), and note that the − absolute values of the derivatives of the two functions are always the same. ut

Lemma 22.16. Let A, X, Z be independent random variables with E[A] = E[X] = 2 2 3 E[Z] = 0 and E[X ] = E[Z ]. Then for any C function φ and any ε, δ > 0,

E[φ(A + δX)] E[φ(A + δZ)] − 2 2 2 2 2 ◦ We gain ε Mφδ ε E[X ] + ε E[Z ] + E[Z 1{|Z|>εδ−1}] + E[X 1{|X|>εδ−1} , ≤ · in the first 00 000  where Mφ = max φ ∞, φ ∞ . term, and the {|| || || || }

1{|X|>εδ−1} term for the second term. 149

Proof. Set 1 Y = φ(A + δX) φ(A) φ0(A) δX φ00(A) δ2X2 − − · − 2 ·   1 φ(A + δZ) φ(A) φ0(A) δZ φ00(A) δ2Z2 . − − − · − 2 ·   2 2 Since E[X] = E[Z] and E[X ] = E[Z ] we get that

E[Y ] = E[φ(A + δX)] E[φ(A + δZ)]. − 0 1 00 2 2 Also, for YX = φ(A + δX) φ(A) φ (A) δX φ (A) δ X , − − · − 2 ·

E[YX ] = E[YX 1{|X|>εδ−1}] + E[YX 1{|X|≤εδ−1}] | | | | 00 2 2 000 3 3 φ ∞δ E[X 1{|X|>εδ−1}] + φ ∞δ E[ X 1{|X|≤εδ−1}] ≤ || || || || | | 00 2 2 000 2 2 φ ∞δ E[X 1{|X|>εδ−1}] + ε φ ∞δ E[X ]. ≤ || || || ||

ut We are now ready to prove the Central Limit Theorem.

Theorem 22.17 (Central Limit Theorem). Let (Xn)n be independent identically dis- 2 n tributed random variables with E[Xk] = 0 and E[Xk ] = 1. Let Sn = k=1 Xk and 3 Z N(0, 1). Then, for any C function φ, P ∼ b −1/2 E[φ(n Sn)] E[φ(Z)], → −1/2 D so consequently, n Sn Z. −→

Proof. Let (Zn)n be independent N(0, 1) random variables, also independent of (Xn)n. −1/2 n Note that n Zk N(0, 1). k=1 ∼ Fix n and letP 0 k n. Define ≤ ≤

Ak = X1 + + Xk + Zk+1 + Zn. ··· ··· −1/2 So Sn = An and n A0 N(0, 1). We can write the telescopic sum ∼ n−1 −1/2 −1/2 −1/2 E[φ(Z)] E[φ(n Sn)] = E[φ(n Ak)] E[φ(n Ak+1)]. − − k=0 X −1/2 −1/2 So it suffices to bound the increments E[φ(n Ak)] E[φ(n Ak+1)] . | − | 150

−1/2 −1/2 For 0 k n 1 let A = n (X1 + + Xk + Zk+2 + Zn). So n Ak = ≤ ≤ − ··· ··· −1/2 −1/2 −1/2 A + n Zk+1 and n Ak+1 = A + n Xk+1. Thus, for any ε > 0,

−1/2 −1/2 −1 2 √ 2 √ E[φ(n Ak)] E[φ(n Ak+1)] Mφ n 2ε + E[Zk+11 ] + E[Xk+11 ] − ≤ · · |Zk+1|>ε n |Xk+1|>ε n  { } { } 

Summing over k we get for any ε > 0,

−1/2 2 2 E[φ(Z)] E[φ(n Sn)] Mφ 2ε + E[Z 1 √ ] + E[X 1 √ ] . − ≤ · |Z|>ε n |X|>ε n  { } { } 

Taking n , and recalling that ε > 0 was arbitrary, it suffices to show that for any → ∞ 2 √ random variable X with finite variance, the quantity E[X 1 |X|>ε n ] 0. { } → 2 √ 2 Indeed, the sequence (X 1 |X|≤ε n )n is monotonically increasing to X . So mono- { } tone convergence implies that

2 √ 2 2 √ E[X 1 |X|>ε n ] = E[X ] E[X 1 |X|≤ε n ] 0. { } − { } →

ut

22.6. Characteristic Function

The characteristic function is actually just the of a random variable.

22.6.1. Preliminaries. Let (X,Y ) be a two-dimensional random variable. We can think of (X,Y ) as a complex valued random variable and write (X,Y ) = X +iY . Define the expectation of a complex valued random variable to be E[X + iY ] = E[X] + i E[Y ] as long as these expectations are all finite.

Note that if g : R C is a measurable function, then we can write g = Reg + Img, → and these are also measurable. Moreover, if X,Y are independent, and g, h : R C are → measurable, then g(X), h(Y ) are independent, since Reg(X), Img(X) are independent of Reh(Y ), Imh(Y ).

Definition 22.18. Let X be a random variable. Define a function ϕX : R C by →

itX ϕX (t) := E[e ] = E[cos(tX)] + i E[sin(tX)].

φX is called the characteristic function of X. 151

Note that φX is always defined since E[ cos(tX) ] 1 and E[ sin(tX) ] 1. Moreover, | | ≤ | | ≤ we get that (as a complex number) φX (t) 1. | | ≤ Note that if X is discrete with density fX then

itr φX (t) = fX (r)e . r X If X is absolutely continuous with density fX then ∞ its φX (t) = e fX (s)ds. Z−∞ (This latter object is the Fourier transform of fX .)

Example 22.19. Let’s calculate some characteristic functions:

X Ber(p): • ∼ it it φX (t) = pe + (1 p) = 1 p(1 e ). − − − Z = X + Y for X,Y independent: • itX itY itX itY φZ (t) = E[e e ] = E[e ] E[e ] = φX (t) φY (t). ·

Y = cX for some real c R: • ∈ itcX φY (t) = E[e ] = φX (tc).

n n X Bin(n, p): We can write X = Xj where (Xk) are independent • ∼ k=1 k=1 Ber(p). Thus, P n it n φX (t) = φX (t) = (1 p(1 e ) . k − − k=1 Y X Geo(λ): • ∼ ∞ it k−1 itk pe φX (t) = (1 p) pe = . − 1 (1 p)eit k=1 X − − X Poi(λ): • ∼ ∞ k −λ λ itk −λ(1−eit) φX (t) = e e = e . k! k=0 X 152

X U[a, b]: • ∼ ∞ b its 1 its 1 itb ita φX (t) = e fX (s)ds = e ds = (e e ). b a it(b a) − Z−∞ − Za − X Exp(λ): • ∼ ∞ −λs its λ φX (t) = λe e ds = . λ it Z0 − X N(µ, σ): Write Y = (X µ)/σ. So Y N(0, 1). We use the fact • ∼ − ∼ −s2/2 that s sin(ts)e is an odd function, so its integral over R is 0. Since 7→ eits = cos(ts) + i sin(ts) we get that

∞ ∞ 1 −s2/2 its 1 −s2/2 φY (t) = e e ds = e cos(ts)ds. √2π √2π Z−∞ Z−∞ −s2/2 0 −s2/2 Let χs(t) = e cos(ts). Since χs(t) 1, since χ (t) = e s sin(ts) | | ≤ s − which is continuous in t for every s, since for any t

ε 2 χ0 (t + η) dηds s e−s /2ds < , | s | ≤ | | ∞ ZR Z−ε ZR and since

∞ ∞ ∞ 0 −s2/2 −s2/2 −s2/2 χs(t)ds = sin(ts) se ds = sin(ts)e t cos(ts)e ds = t χs(t)ds, − −∞ · −∞− −∞ − ZR Z Z ZR

which is continuous in t, we can differentiate φY under the integral to get

0 1 0 1 φY (t) = χs(t) = t χs(t) = tφY (t). √2π − · √2π − ZR ZR Dividing by φY we get that 0 d φY (t) (log φY (t)) = = t. dt φY (t) −

t2 Integrating, log φY (t) = + C and since φY (0) = 1 we get that C = 0 and − 2 −t2/2 φY (t) = e . This completes the calculation for Y N(0, 1). For X = σY +µ ∼ we get

itσY itµ itµ itµ itµ −σ2t2/2 φX (t) = E[e ]e = e φσY (t) = e φY (σt) = e e .

4 5 4 153

2 Lemma 22.20. Let X be a random variable with E[X] = 0 and E[X ] = 1. Then, as t 0, → 2 t 2 φX (t) = 1 + o(t ). − 2

Proof. As in Lemma 22.15 we get that for any x R and t > 0, ∈ t2 e−itx 1 itx + x2 min t2x2, t3 x 3 . − − 2 ≤ | |

 Thus, for any ε > 0,

2 −itX t 2 2 3 3 E[e ] 1 E[min t X , t X ] | − − 2 | ≤ | | 2 2  2 t E[X 1{|X|>εt−1}] + εt . ≤ 2 We have already seen that for 0 < η < 1, as t 0, E[X 1{|X|>tη−1}] 0 by monotone → → convergence. So choosing ε = tη, as t 0, → 2 t 2 φX (t) = 1 + o(t ). − 2

ut 22.7. Levy’s´ Continuity Theorem

Perhaps the main theorem using characteristic functions is a theorem of Paul L´evy giving a necessary and sufficient condition for convergence in distribution.

Paul L´evy(1886–1971) Theorem 22.21 (L´evy’sContinuity Theorem). Let (Xn)n be a sequence of random D variables, and let X be another random variable. Then, Xn X if and only if −→

φXn (t) φX (t) for all t R. → ∈ 22.7.1. Proof of CLT. We can use L´evy’sContinuity Theorem to give a different proof of the Central Limit Theorem.

Proof of Theorem 22.6. Let φ(t) = φXn (t). Note that

φ √ (t) = φ(t/√N)N . SN / N

As N , we have that → ∞ t2 φ(t/√N) = 1 + o(t2N −1), − 2N 154 as N . Thus, → ∞

2 lim φ(t/√N)N = e−t /2. N→∞

2 So φ √ (t) e−t /2 which is the characteristic function of a N(0, 1) random variable. SN / N → D By L´evy’sContinuity Theorem we get that SN /√N N(0, 1). −→ ut

22.8. An Explicit Approximation of the Step Function

In this section we construct a function ψ, which is C3, takes values 1 on ( , 0], b −∞ decreasing on [0, 1] and values 0 on [1, ). ∞ Some consideration leads us to understand that we would like ψ0 < 0 on (0, 1), and so we search for a function such that ψ00(x) = ψ00(1 x) on [0, 1], vanishes at the endpoints − − 0, 1 and obtains local minimum or maximum at the endpoint 0, 1. Some thought leads to the function g(x) = x2(1 x)2(1 2x). Integrating and taking a derivative we have − − that for

x4 x5 x6 x7 h(x) = + , 12 − 5 6 − 21 the derivatives on (0, 1) satisfy h0(x) = 1 x3(1 x)3, h00(x) = g(x) and h000(x) = g0(x) = 3 − 2x 12x2 + 20x3 10x4. − − Set A = 420 = 7 5 3 4. Because h(1) = A−1 and h(0) = 0 we take · · ·

1 A h(x) x [0, 1] − · ∈  ψ(x) = 1 x < 0   0 x > 1    We have that on (0, 1), ψ0 = Ah0 < 0, also ψ00 = Ah00 and ψ000 = Ah000. These − − − 3 three derivatives all vanish at 0 and 1 so ψ is Cb . 155

Figure 6. The second derivative 20 g(x) and the function 1 A h(x). − · − · 156

Introduction to Probability

201.1.8001

Ariel Yadin

Lecture 23

23.1. Concentration of Measure

Suppose (Xn)n are a sequence of i.i.d. random variables with P[Xn = 1] = P[Xn = 1] = 1/2. Let − n Sn = Xk. k=1 2 X So E[Sn] = 0 and Var[Sn] = E[Sn] = n. Chebyshev’s inequality tells us that

−2 P[ Sn λ√n] λ . | | ≥ ≤ However, the central limit theorem gives that for large n, ∞ −λ 1 −s2/2 1 −s2/2 P[ Sn λ√n] P[ N(0, 1) λ] = e ds + e ds | | ≥ ≈ | | ≥ √2π √2π Zλ Z−∞ ∞ 2 2 ∞ 2/π λ−1 se−s /2ds = 2/πλ−1 ( e−s /2) ≤ · λ · − λ Z p −1 −λ2/2 p = 2/πλ e . p This decays much faster than λ−2. However, we don’t know how good our approximation by a standard normal is.

Because (Xn)n are independent, we can obtain an very nice concentration result using a smart trick by Bernstein.

Sergei Bernstein Theorem 23.1 (Bernstein’s Inequality). Let (Xn)n be independent random variables (1880-1968) n such that for all n, Xn 1 a.s. and E[Xn] = 0. Let Sn = Xk. Then for any | | ≤ k=1 λ > 0, P λ2 P[Sn λ] exp , ≥ ≤ −2n   157 and consequently, λ2 P[ Sn λ] 2 exp . | | ≥ ≤ −2n   Proof. There are two main ideas:

The first idea, is that for a random variable X with E[X] = 0 and X 1 a.s. one | | ≤ αX α2/2 αx has E[e ] e . Indeed, g(x) = e is a convex function, so for x 1 we can write ≤ | | ≤ x = β 1 + (1 β) ( 1), where β = x+1 , and · − · − 2 eαx βeα + (1 β)e−α = cosh(α) + x sinh(α). ≤ − α −α α −α (Here 2 cosh(α) = e + e and 2 sinh(α) = e e .) Thus, because E[X] = 0, and − using (2k)! 2kk!, ≥ αX E[e ] cosh(α) + E[X] sinh(α) = cosh(α) ≤ ∞ 2k ∞ 2k α α 2 = = eα /2. (2k)! ≤ 2kk! k=0 k=0 X X For the second idea, due to Sergei Bernstein, one applies the Chebyshev / Markov inequality to the non-negative random variable eαX , and then optimizes on α. In our case, using independence, n αS αX α2n/2 E[e n ] = E[e k ] e . ≤ k=1 Y Now apply Markov’s inequality to the non-negative random variable eαSn to get for any α > 0,

αSn αλ 1 2 P[Sn λ] = P[e e ] exp nα αλ . ≥ ≥ ≤ 2 − Optimizing over α we get that for α = λ/n,  λ2 P[Sn λ] exp . ≥ ≤ −2n   ut

Example 23.2. Suppose a gambler plays a game repeatedly and independently, such that with probability p < 1/2 he earns one dollar and loses one dollar with probability 1 p. − 158

n For every n let Xn be the amount he won in the n-th game. So Sn = k=1 Xk is the total amount he wins after n games. P 1 Note that E[Xk] = 2p 1 < 0, so Yk = (Xk (2p 1)) has expectation 0 and − 2 − − Yk 1. By Bernstein’s inequality | | ≤ n 2 n n(1 2p) P[Sn > 0] = P[ Yk > (1 2p)] exp − . 2 − ≤ − 8 k=1 X   So the probability of winning is very small even if the house has a very small advantage.

4 5 4 159

Introduction to Probability

201.1.8001

Ariel Yadin

Lecture 24

24.1. Review

Exercise 24.1. We have N urns. Urn number k contains k white balls and N k black − balls. An urn is chosen randomly, all urns equally likely. Then, we start removing balls from that urn, all balls equally likely, all balls independent, returning the removed ball to the urn after removing it. Let A be the event that the first n balls removed are white. Let B be the event that the n + 1-th ball removed is black. Calculate P[A], P[B A], P[B A]. ∩ | Now calculate the same if the process is that each time a random urn is chosen inde- pendently, and then a ball is removed and returned to and from that urn.

Solution to Exercise 24.1. For the first scenario, let Ck be the event the k-th urn was chosen. Independence gives us that

k n k n N−k P[A Ck] = P[B A Ck] = . | N ∩ | N · N   Thus by the law of total probability,

N N 1 n 1 n P[A] = k P[B A] = k (N k). N n+1 ∩ N n+2 − k=1 k=1 X X Thus, N n+1 1 k=1 k P[B A] = 1 . | − N · N n P k=1 k In the second scenario, the probability that a white ball is chosen each time is P N 1 k 1 N + 1 N + 1 = = := p. N N N 2 · 2 2N k=1 X   160

All balls are independent, so

n n P[A] = p P[B A] = p (1 p) P[B A] = (1 p). ∩ − | −

ut

Exercise 24.2. Random variables X,Y are equally distributed if FX = FY . Show that if X,Y are equally distributed then E[X] = E[Y ] if the expectations exist, and E[X] does not exist if and only if E[Y ] does not exist. (Hint: Start with simple, go through non-negative, continue to general. First show that PX = PY .)

Solution to Exercise 24.2. The first observation is that P[X B] = P[Y B] for any ∈ ∈ Borel set B . Indeed, the probability measures PX , PY agree on the π-system ∈ B ( , t]: t R by definition, and this π-system generates , so PX , PY agree on all { −∞ ∈ } B . B Now, note that if g : R R is a measurable function, then if X,Y are equally → distributed, then so are g(X), g(Y ). This is because

−1 −1 P[g(X) t] = P[X g ( , t]] = P[Y g ( , t]] = P[g(Y ) t]. ≤ ∈ −∞ ∈ −∞ ≤ The next step is to prove the claim for simple random variables: If X,Y are simple and equally distributed then

− − P[X = r] = FX (r) FX (r ) = FY (r) FY (r ) = P[Y = r]. − − Let R be a range for X and Y . Linearity of expectation now gives us that

E[X] = r P[X = r] = r P[Y = r] = E[Y ]. r∈R r∈R X X −n n Now assume that X,Y 0. Let gn(r) = min 2 2 r , n . So 0 gn(X) X and ≥ { b c } ≤ % 0 gn(Y ) Y . Monotone convergence gives that ≤ %

[X] = lim [gn(X)] = lim [gn(Y )] = [Y ], E n E n E E where we have used the fact that gn(X), gn(Y ) are equally distributed simple random variables for all n. Note that the above also holds if E[X] = E[Y ] = . ∞ 161

Finally, for general equally distributed X,Y : X+,Y + are equally distributed and + + − − non-negative. Thus, E[X ] = E[Y ]. Similarly for E[X ] = E[Y ]. Thus,

+ − + − E[X] = E[X ] E[X ] = E[Y ] E[Y ] = E[Y ], − − + + − − if at least one of E[X ] = E[Y ] or E[X ] = E[Y ] is finite. If they are both infinite, then both E[X] and E[Y ] do not exist. ut Exercise 24.3. X U[ 1, 1]. Calculate: ∼ − P[ X > 1/2] . • | | P[sin(πX/2) > 1/2]. • Solution to Exercise 24.3. Note that X > 1/2 = X > 1/2 X < 1/2 . So, {| | } { }]{ − } 1 1 1 P[ X > 1/2] = P[X > 1/2] + P[X < 1/2] = + = . | | − 4 4 2 Since sin is increasing in [ π/2, π/2] and since sin(π/6) = 1/2, we get that − P[sin(πX/2) > 1/2] = P[πX/2 > π/6] = P[X > 1/3] = 1/3.

ut Exercise 24.4. B,C are independent and B Exp(λ), C U[0, 1]. What is the ∼ ∼ probability that the equation x2 + 2Bx + C = 0 has two different real solutions.

Solution to Exercise 24.4. We are looking for the probability that

2 2 P[4B 4C > 0] = P[B > C]. − For any t > 0, √ t t √ 2 −λs 1 −λ u P[B t] = P[B √t] = λe ds = λe du, ≤ ≤ 2√u Z0 Z0 so B2 is absolutely continuous with this density. Note that for all s > 0, ∞ ∞ √ ∞ √ 1 −λ u −λt −λ s f 2 (u)du = λe du = λe dt = e . B 2√u √ Zs Zs Z s Since B2,C are independent, we have that

f 2 (t s) = f 2 (t). B |C | B 162

Thus,

1 ∞ 1 ∞ 1 √ 2 −λ s P[B > C] = fB2,C (t, s)dtds = fB2|C (t s)dtds = e ds | Z0 Zs Z0 Zs Z0 1 1 1 −λu 2u −λu 2 −λu = 2 ue du = λ e + λ e du 0 − 0 0 Z Z 2 −λ 2 −λ 2 −λ = e 2 (e 1) = 2 1 (1 + λ)e . − λ − λ − λ · −   We have now a non-trivial inequality: 2 1 (1 + λ)e−λ 1, which is equivalent to λ2 · − ≤ 2  eλ (1 + λ) λ eλ. − ≤ 2 · This indeed holds as 2k! (k + 2)! for all k and ≤ ∞ ∞ λk+2 λk λ2 eλ (1 + λ) = . − (k + 2)! ≤ k! · 2 k=0 k=0 X X ut

Exercise 24.5. Let (X,Y ) be jointly absolutely continuous with

C(s3 s 1) s [2, 4] − − ∈ fY (s) =  0 otherwise and for all s [2, 4],  ∈ π C(s) cos( 2s t) t [0, s] f (t s) = ∈ X|Y |  0 otherwise

Calculate:  C and C(s) for all s [2, 4]. • ∈ E[Y ], E[X Y = s], E[X]. • | Var[Y ], Var[X]. • Cov(X,Y ) and Var[X + Y ]. •

Solution to Exercise 24.5. For C:

4 3 44 42 24 22 1 = C (s s 1)ds = C 4 2 4 4 + 2 + 2 = C (60 6 2) = C 52. 2 − − · − − − · − − · Z   163

For C(s):

s s π 2s π 1 = C(s) cos( 2s t)dt = C(s) π sin( 2s t) 0 · 0 Z 2s = C(s) , · π

π So C(s) = 2s . We also have

4 5 5 3 3 2 2 −1 4 2 4 2 4 2 4 2 5212 E[Y ] = 52 (s s s)ds = − − − = . − − 5 52 − 3 52 − 2 52 30 52 Z2 · · · · For s [2, 4], using the fact that (x sin x + cos x)0 = sin x + x cos x sin x = x cos x we ∈ − have

s π/2 −1 E[X Y = s] = C(s) t cos(C(s)t)dt = C(s) t cos tdt | Z0 Z0 = C(s)−1 π 1 = 1 2 s. · 2 − − π ·   To calculate E[X] we use the fact that fX,Y (t, s) = fY (s)fX|Y (t s) for all s [2, 4] | ∈ and fX,Y (t, s) = 0 otherwise. Thus, with the function (t, s) t, 7→ 4 s 4 2 −1 2 E[X] = fY (s) tfX|Y (t s)dtds = 1 52 sfY (s)ds = 1 E[Y ]. | − π · · − π · Z2 Z0 Z2   Now, similarly

4 s 2 2 E[XY ] = sfY (s) tfX|Y (t s)dtds = 1 E[Y ], | − π · Z2 Z0  so we get that

2 2 2 2 Cov(X,Y ) = 1 E[Y ] 1 E[Y ] E[Y ] = 1 Var[Y ]. − π · − − π · · − π ·    ut

Exercise 24.6. A chicken lays an egg of random size. The size of the egg has the distribution of N where N N(0, 5). For any s > 0, given that an egg is of size s, the | | ∼ time until the egg hatches is distributed Exp(s−2). What is the expected time for an egg to hatch? 164

Solution to Exercise 24.6. Let S be the size and T the time. We are given that S ∼ N(0, 5) . That is, for any s > 0, | | s s 1 −t2/50 1 −t2/50 P[S s] = P[N(0, 5) [ s, s]] = √ e dt = 2 √ e dt. ≤ ∈ − 2π5 2π5 Z−s Z0 So S is absolutely continuous with density √ 2 √2 −s /50 5 π e s > 0 fS(s) =  0 s 0  ≤ So we can calculate the expectation ofT : ∞ ∞ −2 −s−2t E[T ] = tfT,S(t, s)dtds = fS(s) ts e dtds 0 0 ZZ∞ Z Z 2 2 2 = s fY (s)ds = E[S ] = E[N(0, 5) ] = 25. Z0 ut Exercise 24.7. Let X be a discrete random variable with range Q. Suppose that E[X] = 2 3 µ, E[X ] = σ and E[X ] = ρ. Let Y be a discrete random variable defined by Y X = | q Poi(q2). ∼ Calculate Cov(X,Y ).

Solution to Exercise 24.7. Note that for any q Q, ∈ ∞ 2k 2 −q2 q q = E[Y X = q] = ke . | k! k=0 X Note that for q Q and k N, ∈ ∈ 2k −q2 q fY,X (k, q) = fX (q)e . k! So ∞ 3 E[XY ] = qkfY,X (k, q) = q fX (q) = ρ, q∈ k=0 q∈ XQ X XQ and ∞ 2 E[Y ] = kfY,X (k, q) = q fX (q) = σ. q∈ k=0 q∈ XQ X XQ Thus, Cov(X,Y ) = ρ σµ. − ut