CS 109 Lecture 19 May 9th, 2016 30 Midterm Distribution

25 E[X] = 87 √Va r (X) = 15 20 median = 90

15

10

5

0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 110 115 100 105 120 30 Midterm Distribution

25 A E[X] = 87 A- √Va r (X) = 15 B+ 20 median = 90 B 15

B- 10

5 Cs A+

0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 110 115 100 105 120 30 Midterm Distribution Core Advanced Understanding 25 Understanding E[X] = 87 √Va r (X) = 15 20 median = 90

15

10

5

0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 110 115 100 105 120 30 Midterm Beta

25 E[X] = 87 √Va r (X) = 15 20 median = 90

ɑ = 7.1 15 β = 2.7

10

5

0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 110 115 100 105 120 Midterm Beta

1.0 CS109 Midterm CDF 0.8 0.6 Yo u can interpret this as a percentile 0.4 function 0.2 0.0 0 20 40 60 80 100 120 0.14 0.12 CS109 Midterm PDF 0.10 Derivative of 0.08 percentile per point 0.06 0.04 0.02 0.00 0 20 40 60 80 100 120 Midterm Question Correlations

Problem 1 Problem 2 100 100

50 50

0 0 2 4 6 8 10 12 14 16 18 20 22 24 2 4 6 8 10 12 14 16 18 20

Problem 3 Problem 4 100 100

50 50

0 0 2 4 6 8 10 12 14 16 18 20 2 4 6 8 10 12 14 16 18 20

Problem 5 Problem 6 100 100

50 50

0 0 2 4 6 8 10 12 2 4 6 8 10 12 14 16 18 20 22 24 Midterm Question Correlations

2.c 2.b 4.c 1.b Correlation 2.a 6.c 1.c 0.60 1.a 0.40 6.b 5 3.b 6.a 0.22 1.d 4.a 3.a 4.b 3.c

*Correlations bellow 0.22 are not shown **Almost all correlations were positive Joint PMFJoint Probability for Questions Mass Function 6.a,6.c 6.a,6.c 8

7

6

5

4

3 y = 0.46x + 2.6 Points on 6.c Points 2

1

0

-1 -1 0 1 2 3 4 5 6 7 8 Points on 6.a JointJoint PMF Probability for MassQuestions Function 4, 6 4,6

20

18

16

14

12 y = 0.2x + 11 10

8

Total Points on 4 Points Total 6

4

2

0 0 4 8 12 16 20 24 Total Points on 6 ConditionalJoint Probability Mass Expectation Function 4, 6

• Let X be your score on problem 4. I use standard • Let Y be your score on problem 6. error because expectation is • E[X | Y]? from a sample

20 E[X Y = y] 18 | 16 14 12 10 8 6 4 2 y 0 0 4 8 12 16 20 24 Four Prototypical Trajectories

1. Inequalities Inequality, Probability and Joviality

• In many cases, we don’t know the true form of a probability distribution

§ E.g., Midterm scores

§ But, we know the mean

§ May also have other measures/properties

o Variance

o Non-negativity

o Etc.

§ Inequalities and bounds still allow us to say something about the probability distribution in such cases

o May be imprecise compared to knowing true distribution! Markov’s Inequality

• Say X is a non-negative E[X ] P(X ≥ a) ≤ , for all a > 0 a

• Proof:

§ I = 1 if X ≥ a, 0 otherwise X § Since X ≥ 0, I ≤ a § Taking expectations: ⎡ X ⎤ E[X ] E[I] = P(X ≥ a) ≤ E⎢ ⎥ = ⎣ a ⎦ a

• Andrey Andreyevich Markov (1856-1922) was a Russian mathematician

§ Markov’s Inequality is named after him § He also invented Markov Chains…

o …which are the basis for Google’s PageRank algorithm Markov and the Midterm

from a previous quarter’s CS109 midterm

§ X = midterm score

§ Using sample mean X = 86.7 ≈ E[X]

§ What is P(X ≥ 100)? E[X ] 86.7 P(X ≥100) ≤ = ≈ 0.867 100 100

§ Markov bound: ≤ 86.7% of class scored 100 or greater

§ In fact, 20.1% of class scored 100 or greater

o Markov inequality can be a very loose bound

o But, it made no assumption at all about form of distribution! Chebyshev’s Inequality

• X is a random variable with E[X] = µ, Var(X) = σ2

σ 2 P( X − µ ≥ k) ≤ , for all k > 0 k 2

• Proof: 2 § Since (X – µ) is non-negative random variable, apply Markov’s Inequality with a = k2 E[(X − µ)2 ] σ 2 P((X − µ)2 ≥ k 2 ) ≤ = k 2 k 2 2 2 § Note that: (X – µ) ≥ k ⇔ |X – µ| ≥ k, yielding: σ 2 P( X − µ ≥ k) ≤ k 2 Pafnuty Chebyshev

• Pafnuty Lvovich Chebyshev (1821-1894) was also a Russian mathematician

§ Chebyshev’s Inequality is named after him

o But actually formulated by his colleague Irénée-Jules Bienaymé § He was Markov’s doctoral advisor

o And sometimes credited with first deriving Markov’s Inequality § There is a crater on the moon named in his honor Four Prototypical Trajectories

2. Weak Law of Large Numbers

• Consider I.I.D. random variables X1, X2, ... 2 § Xi have distribution F with E[Xi] = µ and Var(Xi) = σ 1 n § Let X X = ∑ i n i=1 § For any ε > 0: P( X − µ ≥ ε) ⎯n⎯→⎯∞→0

• Proof: X X ... X 2 E[X] = E 1+ 2+ + n = µ Var(X ) Var X1+X2+...+Xn σ [ n ] = ( n )= n § By Chebyshev’s inequality: 2 2 σVar(X) σ n→∞ P( X − µ ≥ k) ≤ P( X − µ ≥ ε) ≤ 2 ⎯⎯⎯→0 k 2 k nε Strong Law of Large Numbers

• Consider I.I.D. random variables X1, X2, ...

§ Xi have distribution F with E[Xi] = µ 1 n § Let X X = ∑ i n i=1

⎛ ⎛ X1 + X 2 +... + X n ⎞ ⎞ P⎜lim⎜ ⎟ = µ ⎟ =1 ⎝ n→∞⎝ n ⎠ ⎠

§ Strong Law ⇒ Weak Law, but not vice versa § Strong Law implies that for any ε > 0, there are only a finite number of values of n such that condition of Weak Law: X − µ ≥ ε holds. Intuitions and Misconceptions of LLN

• Say we have repeated trials of an experiment § Let event E = some outcome of experiment

§ Let Xi = 1 if E occurs on trial i, 0 otherwise § Strong Law of Large Numbers (Strong LLN) yields: X + X +... + X 1 2 n → E[X ] = P(E) n i n(E) § Recall first week of class: P(E) = lim n→∞ n § Strong LLN justifies “frequency” notion of probability

§ Misconception arising from LLN:

o Gambler’s fallacy: “I’m due for a win”

o Consider being “due for a win” with repeated coin flips... La Loi des Grands Nombres

• History of the Law of Large Numbers § 1713: Weak LLN described by Jacob Bernoulli

§ 1835: Poisson calls it “La Loi des Grands Nombres”

o That would be “Law of Large Numbers” in French

§ 1909: Émile Borel develops Strong LLN for Bernoulli random variables

§ 1928: Andrei Nikolaevich Kolmogorov proves Strong LLN in general case Silence!!

And now a moment of silence...

...before we present...

...the greatest result of probability theory! Four Prototypical Trajectories

3.

The Central Limit Theorem

• Consider I.I.D. random variables X1, X2, ... 2 § Xi have distribution F with E[Xi] = µ and Var(Xi) = σ 1 n § X X Let: = ∑ i n i=1

§ Central Limit Theorem:

σ 2 X ~ N(µ, ) as n → ∞ n

http://onlinestatbook.com/stat_sim/sampling_dist/ The Central Limit Theorem

• Consider I.I.D. random variables X1, X2, ... 2 § Xi have distribution F with E[Xi] = µ and Var(Xi) = σ 1 n 2 § X X σ Let: = ∑ i X ~ N(µ, ) as n → ∞ n i=1 n

X−µ § Recall Z = where Z ~ N(0, 1): σ 2 / n

1 n ⎡1 n ⎤ n X n 2 X i − µ ( i 1 i )− µ ( i=1 ) ⎢ ∑ = ⎥ X n σ ∑ n ( i 1 i )− µ X ~ N(µ, ) ⇔ Z = n = ⎣ ⎦ = ∑ = n σ 2 n n σ 2 n σ n

X + X +... + X − nµ 1 2 n →~ N(0, 1) as n → ∞ σ n Another form of the Central Limit Theorem Once Upon a Time… Abraham De Moivre

1733 Once Upon a Time…

• History of the Central Limit Theorem § 1733: CLT for X ~ Ber(1/2) postulated by Abraham de Moivre

§ 1823: Pierre-Simon Laplace extends de Moivre’s work to approximating Bin(n, p) with Normal

§ 1901: provides precise definition and rigorous proof of CLT

§ 2003: Charlie Sheen stars in television series “Two and a Half Men”

o By end of the 7th (final) season, there were 161 episodes

o Mean quality of subsamples of episodes is Normally distributed (thanks to the Central Limit Theorem) Central Limit Theorem in the Real World

• CLT is why many things in “real world” appear Normally distributed § Many quantities are sum of independent variables § Exams scores

o Sum of individual problems on the SAT

o Why does the CLT not apply to our midterm? § Election polling

o Ask 100 people if they will vote for candidate X (p1 = # “yes”/100)

o Repeat this process with different groups to get p1, ... , pn

o Will have a normal distribution over pi

o Can produce a “confidence interval” • How likely is it that estimate for true p is correct Binomial Approximation

• Consider I.I.D. Bernoulli variables X1, X2, ... With probability p 2 § Xi have distribution F with E[Xi] = µ and Var(Xi) = σ 1 n § Let: X X Let: ¯ = ∑ i Y = nX n i=1

X¯ N(µ, 2) as n ⇠ !1 2 Y N(nµ, n2 ) ⇠ n Y N(np, np(1 p)) ⇠ Your Midterm on the Central Limit Theorem

• Start with 370 midterm scores: X1, X2, ..., X370

§ E[Xi] = 87 and Var(Xi) = 225

§ Created 50 samples (Yi ) of size n = 10

§ Prediction by CLT: Y¯ N(87, 22.5) i ⇠ 50 ¯ ¯ 1 16 Yi E[Xi] Yi 87 Z¯ = Zi =4 10 Zi = = 50 ⇥ i=1 2/n p22.5 X Var(Z¯)=0.997 p 16 12 8 4 0 -3 -2 -1 0 1 2 3 Estimating Clock Running Time

• Have new algorithm to test for running time § Mean (clock) running time: µ = t sec. § Variance of running time: σ2 = 4 sec2. § Run algorithm repeatedly (I.I.D. trials), measure time

o How many trials so estimated time = t ± 0.5 with 95% certainty?

o Xi = running time of i-th run (for 1 ≤ i ≤ n) n X 0.95 = P ( 0.5 i=1 i t 0.5)  n  P n o By Central Limit Theorem,0.5pn Z ~ N(0, 1),X where: 0.5pn = P ( i=1 i t ) n 2  n  2 ( i=1 Xi) nµP n Zn = 0.5pn pn X pn 0.5pn = P ( pn i=1 i t ) Pn 2  2 n 2  2 ( X ) nt P = i=10.5pin n X pn pnt 0.5pn = P ( 2pn i=1 i ) P 2  2pn pn 2  2 P 0.5pn n X nt 0.5pn = P ( i=1 i ) 2  2pn  2 P 0.5pn 0.5pn = P ( Z ) 2   2 n n i=1 Xi ( i=1 Xi) nt 0.95 = P ( 0.5 t 0.5) Zn =  n  2pn P n P 0.5pn Xi 0.5pn i=1 n = P ( n t ) 2  n ni=1 Xi  2 0.95 = P ( 0.5 in=1 Xi t 0.5) 0.95 = P ( 0.5P in=1 Xi t 0.5) 0.950 =.5Pp(n0.5pn n Xi tpn0.5) 0.5pn  P i=1n n  = P (  P n n  t ) 2 0.5p2nP n ni=1 Xi 2  0.52pn = P (0.5pn in=1 Xi t 0.5pn) = P (0.5pnP i=1 Xi t 0.5pn) 0=.5Pp(n 2 Xi npn pntt  0.52pn) = P ( 2 i=1 P n n  2 ) 0.25pn  Ppn n n Xi  pn2 0.5pn 2 0.5p2npn pPn pnin=1 X2 i  pn2 0.5pn = P (0.5Ppn pn i=1 Xi pnt 0.5pn) = P ( 2 n  2 i=1n i 2 t  2 ) 0=.5Pp(n 2 i=1Xi2n Pnt n 0.5pn 2 t  2 ) = P ( 0.25pn  2n PXni pn p)2nt  0.52pn 0.5pn ni=1PX pn pnt 0.5pn =2P (0.5pn 2pn i=1 Xi p2 n pnt 0.5pn) = P (0.25Ppn  2i=1pnXi pn p2nt  0.52pn) = P ( 2  P2pn pn 2  2 ) 0.5pn 2 0.5Pp2nnpn pn 2  2 = P ( 0.5Zpn Pn ) Xi nt 0.5pn Pni=1 =2P (0.5pn 2 in=1 Xi nt 0.5pn) = P (0.5pn i=1 Xi nt 0.5pn) = P ( 2  i=12pn  2 ) 2  P 2pn  2 0.25pn  P 20p.5npn  2 = P (0.5pn ZP 0.5pn) = P (0.25pn  Z  0.52pn) = P ( 2  Z  2 ) 2   2 n X 0.95 = P ( 0.5 i=1 i t 0.5)  n  P 0.5pn n X 0.5pn = P ( i=1 i t ) 2  n  2 P 0.5pn pn n X pn 0.5pn = P ( i=1 i t ) 2  2 n 2  2 P 0.5pn n X pn pnt 0.5pn = P ( i=1 i ) 2  2pn pn 2  2 P 0.5pn n X nt 0.5pn = P ( i=1 i ) 2  2pn  2 P pn0.5pnpn 0.5pn 0.95 ==P( () ( Z) ) 4 2 4  2 pn pn = ( ) (1 ( )) 4 pn 4 pn 0.95 = ( ) ( ) pn 4 4 =2( ) 14 4 4 pn pn = ( ) (1 ( )) 4 4 pn =2( ) 1 4 pn 0.975 = ( ) 4 1 pn 1(0.975) = pn (0.975) = 4 pn 1.96 = 4 n = 61.4 William Sealy Gosset (aka Student) Estimating Time With Chebyshev

• Have new algorithm to test for running time § Mean (clock) running time: µ = t sec. § Variance of running time: σ2 = 4 sec2. § Run algorithm repeatedly (I.I.D. trials), measure time

o How many trials so estimated time = t ± 0.5 with 95% certainty? n X i o X = running time of i-th run (for 1 ≤ i ≤ n), and X i S = ∑ i=1 n 2 σ S § What would Chebyshev say? P( X S − µS ≥ k) ≤ k 2 n n n 2 ⎡ X ⎤ 2 ⎛ X ⎞ X σ 4 E i t Var i Var⎛ i ⎞ n µS = ⎢∑ ⎥ = σ S = ⎜∑ ⎟ = ∑ ⎜ ⎟ = 2 = ⎣ i=1 n ⎦ ⎝ i=1 n ⎠ i=1 ⎝ n ⎠ n n n X 4/ n 16 P( i −t ≥ 0.5) ≤ = = 0.05 ⇒ n ≥ 320 ∑ n (0.5)2 n i=1 Thanks for playing, Pafnuty! Crashing Your Website

• Number visitors to web site/minute: X ~ Poi(100) § Server crashes if ≥ 120 requests/minute § What is P(crash in next minute)? ∞ e−100 (100)i § Exact solution: P(X ≥120) = ∑ ≈ 0.0282 i=120 i! n § Use CLT, where Poi ( 100 ) ~ ∑ Poi ( 100 / n ) (all I.I.D) i=1 Y −100 119.5−100 P(X ≥120) = P(Y ≥119.5) = P( ≥ ) =1− Φ(1.95) ≈ 0.0256 100 100 o Note: Normal can be used to approximate Poisson It’s play time! Sum of Dice

• You will roll 10 6-sided dice (X1, X2, …, X10)

§ X = total value of all 10 dice = X1 + X2 + … + X10 § Win if: X ≤ 25 or X ≥ 45 § Roll!

• And now the truth (according to the CLT)… Sum of Dice

• You will roll 10 6-sided dice (X1, X2, …, X10)

§ X = total value of all 10 dice = X1 + X2 + … + X10 § Win if: X ≤ 25 or X ≥ 45 X + X +... + X − nµ • Recall CLT: 1 2 n → N(0, 1) as n → ∞ σ n

§ Determine P(X ≤ 25 or X ≥ 45) using CLT:

2 35 µ = E[X i ]= 3.5 σ = Var(X i ) = 12 25.5−10(3.5) X −10(3.5) 44.5−10(3.5) 1− P(25.5 ≤ X ≤ 44.5) =1− P( ≤ ≤ ) 35 /12 10 35 /12 10 35 /12 10

≈1− (2Φ(1.76) −1) ≈ 2(1− 0.9608) = 0.0784 Wonderful Form of Cosmic Order

I know of scarcely anything so apt to impress the imagination as the wonderful form of cosmic order expressed by the ”[Central limit theorem]". The law would have been personified by the Greeks and deified, if they had known of it. It reigns with serenity and in complete self-effacement, amidst the wildest confusion. The huger the mob, and the greater the apparent anarchy, the more perfect is its sway. It is the supreme law of Unreason. Whenever a large sample of chaotic elements are taken in hand and marshalled in the order of their magnitude, an unsuspected and most beautiful form of regularity proves to have been latent all along.

-Sir Francis Galton 30 Core Advanced Understanding Understanding

25 A E[X] = 87 A- √Va r (X) = 15 B+ 20 median = 90 B 15 ɑ = 7.1 β = 2.7 B- 10

5 Cs A+ Beta 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 110 115 100 105 120