Joint Distributions Chris Piech CS109, Stanford University Four Prototypical Trajectories

• X is a Normal Random Variable: X ~ N(µ, σ 2) § Probability Density Function (PDF): 1 2 2 f (x) = e−(x−µ) / 2σ where − ∞ < x < ∞ σ 2π µ § E[X ] = µ f (x) 2 § Var(X ) =σ x § Also called “Gaussian” § Note: f(x) is symmetric about µ Simplicity is Humble

probability σ2

value

* A Gaussian maximizes entropy for a given mean and variance Anatomy of a beautiful equation

2 (µ, ) the distance to N “exponential” the mean

(x µ)2 1 f(x)= e 22 p2⇡

probability sigma shows up density at x a constant twice And here we are

2 CDF of Standard Normal: A (µ, ) function that has been solved N for numerically x µ F (x)= ✓ ◆ The cumulative density function (CDF) of any normal

Table of Φ(z) values in textbook, p. 201 and handout Symmetry of Phi

ɸ(-a) = 1 – ɸ(a)

µ = 0

ɸ(-a) 1 – ɸ(a)

-a a Interval of Phi

µ = 0

ɸ(c) ɸ(d)

c d Interval of Phi

ɸ(d) - ɸ(c)

µ = 0

ɸ(c) ɸ(d)

c d Four Prototypical Trajectories

Great questions! Four Prototypical Trajectories

68% rule only for Gaussians? 68% Rule?

What is the probability that a normal variable X N(µ, 2) ⇠ has a value within one standard deviation of its mean?

µµ µµ XX µµ µµ++ µµ P (µ

Counter example: Uniform X Uni(↵, ) ⇠

( ↵)2 = Var(X) Var(X)= ↵ 12 = p p12

1 P (µ

How does python sample from a Gaussian? from random import * for i in range(10): mean = 5 How std = 1 does this sample = gauss(mean, std) work? print sample

3.79317794179 5.19104589315 4.209360629 5.39633891584 7.10044176511 6.72655475942 5.51485158841 4.94570606131 6.14724644482 4.73774184354 How Does a Computer Sample Normal?

Inverse Transform Sampling

CDF of the Standard Normal (x)

-5 0 5 How Does a Computer Sample Normal?

Inverse Transform Sampling y

CDF of the Standard Normal a uniform numberuniform a

between0,1 (x)

Step 1: pickStep 1: 0 -5 Step 2: Find the x such that 5 (x)=y

Further reading: Box–Muller transform Four Prototypical Trajectories

End Review Normal Approximates Binomial

There is a deep reason for the Binomial/Normal similarity… Four Prototypical Trajectories

Let’s invent it Website Testing

• 100 people are given a new website design § X = # people whose time on site increases § CEO will endorse new design if X ≥ 65 What is P(CEO endorses change| it has no effect)? § X ~ Bin(100, 0.5). Want to calculate P(X ≥ 65)

What is variance?

E[X] = 50 x Website Testing

np = 50 np(1 – p) = 25 np(1 – p) = 5

§ Use Normal approximation: Y ~ N(50, 25)

Y 50 65 50 P (Y 65) = P > = P (Z>3) = 1 (3) 0.002 5 5 ⇡

§ Using Binomial: P(X ≥ 65) ≈ 0.0018 Website Testing

E[X] = 50 x Continuity Correction

P (X 65) What about 64.9? P (Y 64.5) ⇡ 0.0018 ⇡

64 65 x Comparison when n = 100, p = 0.5

P(X = k)

k Normal Approximation of Binomial

• X ~ Bin(n, p) § E[X] = np Var(X) = np(1 – p) § Poisson approx. good: n large (> 20), p small (< 0.05) § For large n: X ≈ Y ~ N(E[X], Var(X)) = N(np, np(1 – p)) § Normal approx. good : Var(X) = np(1 – p) ≥ 10

1 1 ⎛ k − np + 0.5⎞ ⎛ k − np − 0.5⎞ P(X = k) ≈ P k − < Y < k + = Φ⎜ ⎟ − Φ⎜ ⎟ ( 2 2) ⎜ ⎟ ⎜ ⎟ ⎝ np(1− p) ⎠ ⎝ np(1− p) ⎠

“Continuity correction” Continuity Correction

Discrete (eg Binomial) Continuous (Normal) probability question probability question x = 6 5.5 < x < 6.5 x >= 6 x > 5.5 x > 6 x > 6.5 x < 6 x < 5.5 x <= 6 x < 6.5 Stanford Admissions

• Stanford accepts 2050 students this year § Each accepted student has 84% chance of attending § X = # students who will attend. X ~ Bin(2050, 0.84) § What is P(X > 1745)?

np = 1722 np(1 p) = 276 np(1 p) = 16.6 p § Use Normal approximation: Y ~ N(1722, 276) P (X>1745) P (Y>1745.5) ⇡ Y 1722 1745.5 1722 P (Y 1745.5) = P > = P (Z>1.4) 16.6 16.6 0.08 ⇡ Changes in Stanford Admissions

Class of 2020 Admit Rates Lowest in University History

“Fewer students were admitted to the Class of 2020 than the Class of 2019, due to the increase in Stanford’s yield rate which has increased over 5 percent in the past four years, according to Colleen Lim M.A. ’80, Director of Undergraduate Admission.”

68% 10 years ago 84% last year Four Prototypical Trajectories

More practice with continuous random variables: Exponential Exponential Random Variable

• X is an Exponential RV: X ~ Exp(λ) Rate: λ > 0 § Probability Density Function (PDF): ⎧λe−λx if x ≥ 0 f (x) = ⎨ where − ∞ < x < ∞ ⎩ 0 if x < 0 1 § E[X ] = λ f (x) 1 § Var(X ) = λ2 x § Cumulative distribution function (CDF), F(X) = P(X ≤ x): F(x) =1− e−λx where x ≥ 0 § Represents time until some event

o Earthquake, request to web server, end cell phone contract, etc. Visits to a Website

• Say visitor to your web site leaves after X minutes § On average, visitors leave site after 5 minutes § Assume length of stay is Exponentially distributed § X ~ Exp(λ = 1/5), since E[X] = 1/λ = 5 § What is P(X > 10)? P(X >10) =1− F(10) =1−(1−e−λ10 ) = e−2 ≈ 0.1353

§ What is P(10 < X < 20)? P(10 < X < 20) = F(20) − F(10) = (1−e−4 ) −(1−e−2 ) ≈ 0.1170 Replacing Your Laptop

• X = # hours of use until your laptop dies § On average, laptops die after 5000 hours of use § X ~ Exp(λ = 1/5000), since E[X] = 1/λ = 5000 § You use your laptop 5 hours/day. § What is P(your laptop lasts 4 years)? § That is: P(X > (5)(365)(4) = 7300) P(X > 7300) =1− F(7300) =1−(1−e−7300/5000 ) = e−1.46 ≈ 0.2322

§ Better plan ahead... especially if you are coterming: P(X > 9125) =1− F(9125) = e−1.825 ≈ 0.1612 (5 year plan) P(X >10950) =1− F(10950) = e−2.19 ≈ 0.1119 (6 year plan) Exponential is Memoryless

• X = time until some event occurs § X ~ Exp(λ) § What is P(X > s + t | X > s)? P(X > s + t and X > s) P(X > s + t) P(X > s + t | X > s) = = P(X > s) P(X > s) P(X > s +t) 1− F(s +t) e−λ(s+t) = = = e−λt =1− F(t) = P(X > t) P(X > s) 1− F(s) e−λs

So, P(X > s + t | X > s) = P(X > t) § After initial period of time s, P(X > t | •) for waiting another t units of time until event is same as at start § “Memoryless” = no impact from preceding period s Continuous Random Variables

X Uni(↵, ) Uniform Random Variable ⇠ All values of x between alpha and beta are equally likely.

Normal Random Variable X (µ, 2) ⇠ N Aka Gaussian. Defined by mean and variance. Goldilocks distribution.

Exponential Random Variable X Exp() ⇠ Time until an event happens. Parameterized by lambda (same as Poisson).

Alpha Beta Random Variable How mysterious and curious. You must wait a few classes J. Four Prototypical Trajectories

Joint Distributions Four Prototypical Trajectories

Events occur with other events Probability Table

• States all possible outcomes with several discrete variables • Often is not “parametric” • If #variables is > 2, you can have a probability table, but you can’t draw it on a slide All values of A 0 1 2

0 Every outcome falls into a bucket 1 P(A = 1, B = 1)

All values values B All of 2 Remember “,” means “and” It’s Complicated Demo

Go to this URL: https://goo.gl/jCMY18 Discrete Joint Mass Function

• For two discrete random variables X and Y, the Joint Probability Mass Function is:

pX ,Y (a,b) = P(X = a,Y = b)

• Marginal distributions: p (a) P(X a) p (a, y) X = = = ∑ X ,Y y p (b) P(Y b) p (x,b) Y = = = ∑ X ,Y x • Example: X = value of die D1, Y = value of die D2 6 6 P(X 1) p (1, y) 1 1 = = ∑ X ,Y = ∑ 36 = 6 y=1 y=1 A Computer (or Three) In Every House

• Consider households in Silicon Valley § A household has C computers: C = X Macs + Y PCs § Assume each computer equally likely to be Mac or PC

X 0 1 2 3 p (y) Y Y ⎧0.16 c = 0 0 0.16 0.12 ? 0.04 ⎪ ⎪0.24 c =1 1 0.12 0.14 0.12 0 P(C = c) = ⎨ ⎪0.28 c = 2 2 0.07 0.12 0 0 ⎩⎪0.32 c = 3 3 0.04 0 0 0

pX(x) A Computer (or Three) In Every House

• Consider households in Silicon Valley § A household has C computers: C = X Macs + Y PCs § Assume each computer equally likely to be Mac or PC

X 0 1 2 3 p (y) Y Y ⎧0.16 c = 0 0 0.16 0.12 0.07 0.04 ⎪ ⎪0.24 c =1 1 0.12 0.14 0.12 0 P(C = c) = ⎨ ⎪0.28 c = 2 2 0.07 0.12 0 0 ⎩⎪0.32 c = 3 3 0.04 0 0 0

pX(x) A Computer (or Three) In Every House

• Consider households in Silicon Valley § A household has C computers: C = X Macs + Y PCs § Assume each computer equally likely to be Mac or PC

X 0 1 2 3 p (y) Y Y ⎧0.16 c = 0 0 0.16 0.12 0.07 0.04 0.39 ⎪ ⎪0.24 c =1 1 0.12 0.14 0.12 0 0.38 P(C = c) = ⎨ ⎪0.28 c = 2 2 0.07 0.12 0 0 0.19 ⎩⎪0.32 c = 3 3 0.04 0 0 0 0.04

pX(x) 0.39 0.38 0.19 0.04 1.00 Marginal distributions Joint

• This is a joint

• A joint is not a mathematician § It did not start doing mathematics at an early age § It is not the reason we have “joint distributions” § And, no, Charlie Sheen does not look like a joint

o But he does have them…

o He also has joint custody of his children with Denise Richards Four Prototypical Trajectories

What about the continuous world? Jointly Continuous

• Random variables X and Y, are Jointly

Continuous if there exists PDF fX,Y(x, y) defined over –∞ < x, y < ∞ such that:

a2 b2 P(a X a , b Y b ) f (x, y) dy dx 1 < ≤ 2 1 < ≤ 2 = ∫ ∫ X ,Y a1 b1

Let’s look at one: Jointly Continuous

a2 b2 P(a X a , b Y b ) f (x, y) dy dx 1 < ≤ 2 1 < ≤ 2 = ∫ ∫ X ,Y a1 b1

fX,Y (x, y)

b2 a2 b1 y a1 x Jointly Continuous

a2 b2 P(a X a , b Y b ) f (x, y) dy dx 1 < ≤ 2 1 < ≤ 2 = ∫ ∫ X ,Y a1 b1

• Cumulative Density Function (CDF): a b 2 F (a,b) = f (x, y) dy dx f (a,b) ∂ F (a,b) X ,Y ∫ ∫ X ,Y X ,Y = ∂a ∂b X ,Y −∞ −∞

• Marginal density functions:

∞ ∞ f (a) f (a, y) dy fY (b) = f X ,Y (x,b) dx X = ∫ X ,Y ∫ −∞ −∞ Continuous Joint Distribution Functions

a2 b2 P(a X a , b Y b ) f (x, y) dy dx 1 < ≤ 2 1 < ≤ 2 = ∫ ∫ X ,Y a1 b1

• Marginal distributions:

FX (a) = P(X ≤ a) = P( X ≤ a,Y < ∞) = FX ,Y (a,∞)

FY (b) = P(Y ≤ b) = P( X < ∞,Y ≤ b) = FX ,Y (∞,b) Joint Dart Distribution

900

0 900 x Darts!

x X-Pixel Marginal Y-Pixel Marginal

900 900 900 900 X , Y , ⇠ N 2 2 ⇠ N 3 5 ✓ ◆ ✓ ◆ Multiple Integrals Without Tears

• Let X and Y be two continuous random variables § where 0 ≤ X ≤ 1 and 0 ≤ Y ≤ 2 • We want to integrate g(x,y) = xy w.r.t. X and Y: § First, do “innermost” integral (treat y as a constant):

2 1 2 ⎛ 1 ⎞ 2 ⎡ x2 ⎤ 1 2 1 xy dx dy = ⎜ xy dx ⎟ dy = y dy = y dy ∫ ∫ ∫ ⎜ ∫ ⎟ ∫ ⎢ 2 ⎥ 0 ∫ 2 y=0 x=0 y=0 ⎝ x=0 ⎠ y=0 ⎣ ⎦ y=0

§ Then, evaluate remaining (single) integral: 2 1 ⎡ y2 ⎤ 2 y dy = =1− 0 =1 ∫ 2 ⎢ 4 ⎥ 0 y=0 ⎣ ⎦ Computing Joint Probabilities

Let FX,Y (x, y) be joint CDF for X and Y

P( X > a, Y > b) =1− P((X > a,Y > b)c ) =1− P((X > a)c ∪(Y > b)c ) =1− P((X ≤ a) ∪(Y ≤ b)) =1− (P(X ≤ a) + P(Y ≤ b) − P(X ≤ a,Y ≤ b))

=1− FX (a) − FY (b) + FX ,Y (a,b) The General Rule Given Joint CDF

Let FX,Y(x, y) be joint CDF for X and Y

P(a1 < X ≤ a2 , b1 < Y ≤ b2 )

= F(a2 ,b2 ) − F(a1,b2 ) + F(a1,b1) − F(a2 ,b1)

a1 a2 Lovely Lemma

• Y is a non-negative continuous random variable

§ Probability Density Function: fY (y) § Already knew that: ∞ E[Y ] y f (y) dy = ∫ Y −∞ § But, did you know that: ∞ E[Y] = ∫ P(Y > y) dy ?!? 0

§ Analogously, in the discrete case, where X = 1, 2, …, n n E[X ] = ∑ P(X ≥ i) i=1 How this lemma was made In the discrete case, where X = 1, 2, …, n n E[X ] = ∑ P(X ≥ i) i=1 n Each row is an expansion for one value of i P (X i)= i=1 X P (X = 1) + P (X = 2) + P (X = 3) + + P (X = n) ··· + P (X = 2) + P (X = 3) + + P (X = n) ···

+ P (X = 3) + + P (X... = n) ··· + P (X = n)

=1P (X = 1) + 2P (X = 2) + + n(PX = n) ··· = E[X] Four Prototypical Trajectories

Life gives you lemmas, make lemmanade! Imperfections on a Disk

• Disk surface is a circle of radius R § A single point imperfection uniformly distributed on disk ⎧ 1 2 2 2 ⎪ 2 if x + y ≤ R f (x, y) = π R where − ∞ < x,y < ∞ X ,Y ⎨ 2 2 2 ⎩⎪ 0 if x + y > R

1 1 1 1 1 2 fX (x)= fX,Y (x, y)dy = R 2 dydy X X,Y ⇡R 22 22 22 ⇡ xx2++yy2 RR2 Z1 ZZx +y R Z1 22 22 Z  ppRR2 xx2 11 pR x = 1 2 dy = R2 dy ⇡⇡R ppRR22 xx22 ⇡ Z pR2 x2 Only integrate over Z 2pR2 x2 the support range = ⇡R2

Marginal of Y is the same by symmetry Imperfections on a Disk

• Disk surface is a circle of radius R § A single point imperfection uniformly distributed on disk § Distance to origin: D = X 2 +Y 2 § What is E[D]? R 2 2 a π a a Because of P(D ≤ a) = 2 = 2 equally likely π R R outcomes R R R R R R E[D]=RR P (D>a)da = RR 1 P (D a)da E[D]= P (D>a)da = 1 P (D a)da EE[D[D]=]= 0 0PP((D>aD>a))dada== 0 011 PP((DD aa))dada Z00Z Z00Z  Z0 R R 2 2 Z0 ZZ R a2 a ZZ = = RR1 1 aaa22dada === 111 2Rdadada2 0 0 R222 Z000Z RR ZZZ 3 3R R a3 a RRR 2R2R = =a a aaa33 = =222RRR === aaa 3R32R2 === 3 3  333RRR2220 0 333  000