Chapter 5

Joint Distribution and Random Samples

5.1 Discrete joint distributions §5.1

—predict or control one from another. (e.g. cases of flu last week and this week; income and education). Joint : The joint event

{(X,Y ) = (x, y)} = {ω : X(ω) = x, Y (ω) = y} = {X = x}∩{Y = y}.

105 ORF 245: Joint Distributions and Random Samples – J.Fan 106

Range: If X has possible values x1, ··· , xk and Y has possible values y1, ··· , yl, then the range of (X,Y ) is all ordered pairs

{(xi, yj): i = 1 ··· k, j = 1 ··· l}.

Joint dist: Describes how prob is distributed over the ordered pairs: X X p(xi, yj) = P (X = xi,Y = yj), p(xi, yj) = 1, i j called the joint mass function.

Example 5.1 Discrete joint distribution

Let X = deductible amount, in $, on the auto policy and Y = de- ductible amount on homeowner’s policy. Assume that their distribu- tion over a population is ORF 245: Joint Distributions and Random Samples – J.Fan 107

Table 5.1: Joint distribution of deductible amounts

y p(x, y) 0 100 200 100 .20 .10 .20 x 200 .05 .15 .30

(a) (Marginal) What is the (marginal) distribution of X? X p(X = 100) = p(100, y) = .2 + .1 + .2 = .5. y X p(X = 200) = p(200, y) = .05 + .15 + .3 = .5. y

(b) (Marginal)1 What is the marginal distribution of Y ?

X X p(Y = y) = P (X = x, Y = y) = p(x, y) x x p(Y = 0) = .25,P (Y = 100) = .25,P (Y = 200) = .5

1small latter indicates that the materials are not as important. They are required and assigned as reading materials. ORF 245: Joint Distributions and Random Samples – J.Fan 108 (c) (Functions of two rv’s) Let Z = g(X,Y ). Then, X P (Z = z) = p(x, y). (x,y):g(x,y)=z e.g. Z = X + Y . The range is {100, 200, 300, 400}.

P (Z = 100) = p(X = 100,Y = 0) = .20 P (Z = 200) = P (X = 100,Y = 100) + P (X = 200,Y = 0) = .10 + .05 = .15. P (Z = 300) = = .20 + .15 = .35, P (Z = 400) = = .30

P P (d) (): Eg(X,Y ) = x y g(x, y)p(x, y). ORF 245: Joint Distributions and Random Samples – J.Fan 109

(e) (Additivity): E(g1(X) + g2(Y )) = Eg1(X) + Eg2(Y ). e.g. E(X + Y ) = .20 × 100 + .15 × 200 + .35 × 300 + .30 × 400 = $275,

while EX = $150 and EY = $125.

p1 p2 pr n draws Example 5.2 Multinomial 1 2 ... r distribution & machine learning 1 Figure 5.1: Classification and Multinomial distribution.

Make n draws from a population with r categories. Let Xi be the number of “successes” in ith category. Then, its pmf is given by      n n − x1 n − x1 − · · · xr−1 x1 xr p(x1, ··· , xr) = ··· p1 ··· pr x1 x2 xr n! x1 xr = p1 ··· pr , where x1 + ··· + xr = n. x1! ··· xr! ORF 245: Joint Distributions and Random Samples – J.Fan 110

Q1: Are Xi and Xi+1 independent? Q2: What is the distribution if 100 peo- ple is sampled Ex 5.1 and classified ac- cording to the insurance policies. Figure 5.2: What object in the image?

Example 5.3 Trinomial distribution and population genetics.

In genetics, when sampling from an equilibrium population with respect to a gene with two alle-

les A and a, three genotypes can Figure 5.3: Illustration of Hardy-Weinberg formula. be observed with proportion given below: ORF 245: Joint Distributions and Random Samples – J.Fan 111

Table 5.2: Hardy-Weinberg formula

aa aA AA

2 2 p1 = θ p2 = 2θ(1 − θ) p3 = (1 − θ) θ = 0.5 0.25 0.5 0.25

Probability question: If θ = 0.5, n = 10, what is the probability that X1 = 2,X2 = 5 and X3 = 3? 10! probability = (.25)2(.5)5(.25)3 = 0.0769. 2!5!3! > X=rmultinom(1000,1,prob=c(0.25,0.5,0.25)) #sample 1000 from trinomial > X[,1:10] #first 10 realizations [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [1,] 1 0 0 1 0 0 0 0 0 1 [2,] 0 1 0 0 0 1 1 1 0 0 [3,] 0 0 1 0 1 0 0 0 1 0 > apply(X,1,mean) #calculate relative frequencies [1] 0.265 0.492 0.243

Statistical question: If we observe x1 = 2, x2 = 5 and x3 = 3, ORF 245: Joint Distributions and Random Samples – J.Fan 112 are the data consistent with the H-W formula? If so, what is θ?

Addition rule for expectation: (without any condition)

E(X1 + ··· + Xn) = E(X1) + ··· + E(Xn).

Independence: X and Y are independent if P (X = x, Y = y) =

P (X = x)P (Y = y), or p(x, y) = pX(x)pY (y) for all x and y.

Expectation of product: If X and Y are independent, then E(XY ) = E(X)E(Y ), since X X X X E(XY ) = xypX(x)pY (y) = xpX(x) ypY (y) = EX · EY. x y x y

5.2 Joint densities §5.1 ORF 245: Joint Distributions and Random Samples – J.Fan 113

f (,xy ) Let X and Y be continuous random variables. Then the joint density A = shaded rectangle function f(x, y) is the one such that Figure 5.4: The joint density function of 2 rv’s is such that ZZ probability equals the volume under its surface. P ((X,Y ) ∈ A) = f(x, y)dxdy. A It indicates the likelihood of getting (x, y) per unit area near (x, y): P (X ∈ x ± ∆x,Y ∈ y ± ∆y) f(x, y) ≈ . 4∆x∆y

Marginal density functions can be obtained by Z +∞ Z +∞ fX(x) = f(x, y)dy, fY (y) = f(x, y)dx. −∞ −∞ Example 5.4 Uniform distribution on a set A ORF 245: Joint Distributions and Random Samples – J.Fan 114 has a joint density

f(x, y) = IA(x, y)/Area(A), where IA(x, y) is the indicator function of the set A:   1, if (x, y) ∈ A IA(x, y) = .  0, otherwise

If A = {(x, y): x2 + y2 ≤ 1} is a unit disk, then √ Z 1−x2 1 2 p 2 fX(x) = √ dy = 1 − x , for − 1 ≤ x ≤ 1. − 1−x2 π π

Similarly, fY (y) = fX(y). Thus, X and Y have identical dist (but P (X = Y ) = 0 in this case). R +∞ R +∞ Expectation: Eg(X,Y ) = −∞ −∞ g(x, y)f(x, y)dxdy. Independence: X and Y are independent if

f(x, y) = fX(x)fY (y), for all x and y. ORF 245: Joint Distributions and Random Samples – J.Fan 115 Rule of product: If X and Y are independent, then

E(XY) = E(X)E(Y)

R +∞ R +∞ R +∞ R +∞ since −∞ −∞ xyfX(x)fY (y)dxdy = −∞ xfX(x)dx −∞ yfY (y)dy.

Example 5.5 Independence and uncorrelated

If (X,Y ) are uniformly distributed on the unit disk, then X and Y are not independent. But, the rule of product holds

EXY = 0 = E(X)E(Y ).

—termed uncorrelated in the next section. —uncorrelated but not independent (independent=⇒ uncorrelated) ORF 245: Joint Distributions and Random Samples – J.Fan 116 Now if (X,Y ) is uniformly distributed on the unit square. Then, Z 1 fX(x) = dy = 1, for 0 ≤ x ≤ 1. 0

Similarly, fY (y) = I[0,1](y). It is now easy to see that f(x, y) =

I[0,1]×[0,1](x, y) = fX(x)fY (y), i.e. X and Y are indep. Hence,

E(XY ) = E(X)E(Y ) = 1/4.

5.3 and correlation §5.2

Purpose: FCorrelation measures the strength of linear associ- ORF 245: Joint Distributions and Random Samples – J.Fan 117 ation between two variables. Fcovariance helps compute var(X+Y ): 2 var(X + Y ) = E(X + Y − µX − µY ) =

= var(X) + var(Y ) + 2 E(X − µX)(Y − µY ) . | covariance{z }

Covariance: cov(X,Y ) = E(X − µX)(Y − µY ). It is equal to

cov(X,Y ) = E(XY ) − (EX)(EY ).

If X and Y are independent, then cov(X,Y ) = 0. Correlation: is the covariance when standardized: X − µ  Y − µ  cov(X,Y ) ρ = E X × Y = . σX σY σXσY ORF 245: Joint Distributions and Random Samples – J.Fan 118 When ρ = 0, X and Y are uncorrelated.

Properties: ρ measures the strength of linear association. ♠− 1 ≤ ρ ≤ 1;

♠ The larger the |ρ|, the stronger the linear association.

♠ ρ > 0 means positive association: large x tends to associate with large y and small x tends to associate with small y. ρ < 0 means negative association.

♠ ρ is independent of units: corr(1.8X + 32, 1.5Y ) = corr(X,Y ).

Example 5.6 Bivariate has density ORF 245: Joint Distributions and Random Samples – J.Fan 119

Figure 5.5: Simulated data with sample correlations 0.026, -0.595, -0.990, 0.241, 0.970, 0.853. ORF 245: Joint Distributions and Random Samples – J.Fan 120

" #! 1 1 x − µ 2 x − µ y − µ y − µ 2 f(x, y) = exp − 1 − 2ρ 1 2 + 2 . p 2 2 2σ1σ2π 1 − ρ 2(1 − ρ ) σ1 σ1 σ2 σ2 If X and Y are standardized, then 1  1  f(x, y) = exp − x2 − 2ρxy + y2 . 2πp1 − ρ2 2(1 − ρ2) It has the following properties:

2 2 F (Marginal dist) X ∼ N(µ1, σ1) and Y ∼ N(µ2, σ2).

F (Correlation) corr(X,Y ) = ρ

2 2 2 2 F (Linearity) aX + bY ∼ N(aµ1 + bµ2, a σ1 + 2abρσ1σ2 + b σ2).

F (independence) ρ = 0 if and only if X and Y are independence.

Example 5.7 A “Normal” appointment.

A man and a woman make an appointment at 12:00pm at the Palma Square. The arrival time of the man is normally distributed with mean 11:55 and standard deviation 10 minutes, and the woman’s arrival time is normally distributed with mean 12:05 with an SD of 8 minutes. If ORF 245: Joint Distributions and Random Samples – J.Fan 121 they can wait for each other for at most 15 minutes, what is the chance that they actually meet?

Let X = the man’s arrival time relative to 12:00pm and Y = the woman’s arrival time relative to 12:00pm. Then, X ∼ N(−5, 102) and Y ∼ N(5, 82). Let W = X − Y ∼ N(−5 − 5, (102 + 82)). Then, the chance is

P (|X − Y | ≤ 15) = P (−15 ≤ W ≤ 15) 15 + 10 −15 + 10 = Φ − Φ 12.81 12.81 = 97.45% − 34.81% = 62.64%.

Addition rule for variance: If X1, ··· ,Xn are pairwise uncorre- lated, then

var(X1 + ··· + Xn) = var(X1) + ··· + var(Xn). Example 5.8 Variance for Binomial Distribtion ORF 245: Joint Distributions and Random Samples – J.Fan 122

Note X = Y1 + ··· + Yn w/ Y1, ··· ,Yn are i.i.d. Bernoulli(p). Then

var(X) = var(Y1) + ··· + var(Yn) = npq.

∗ 5.4 Multivariate random variables §5.1

Discrete: The joint pmf of discrete rv’s (X1,X2, ··· ,Xn) is

p(x1, ··· , xn) = P (X1 = x1, ··· ,Xn = xn).

Continuous: The joint pdf f(x1, ··· , xn) is such that

P (a1 ≤ X1 ≤ b1, ··· , an ≤ Xn ≤ bn) Z b1 Z bn = ··· f(x1, ··· , xn)dx1 ··· dxn. a1 an

Independence: X1, ··· ,Xn are said to be independent if their pmf or pdf is equal to the product of the marginal ones:

P (X1 = x1, ··· ,Xn = xn) = P (X1 = x1) ··· P (Xn = xn). ORF 245: Joint Distributions and Random Samples – J.Fan 123 for all x1, ··· , xn in the ranges or

f(x1, ··· , xn) = fX1(x1) ··· fXn(xn).

Conditional distribution: shows the distribution of Y given the associated characteristics

(x1, ··· , xp):

P (Y = y, X1 = x1, ··· ,Xp = xp) p(Y = y|X1 = x1, ··· ,Xp = xp) = , P (X1 = x1, ··· ,Xp = xp) for all y in the range. For continuous case,

f(y, x1, ··· , xp) fY |X(y|x1, ··· , xp) = . f(x1, ··· , xp)

5.5 Square root law §5.3

A random sample of size n means that X1, ··· ,Xn are indepen- dent and identically distributed (iid) rv’s, like the outcomes of draw- ORF 245: Joint Distributions and Random Samples – J.Fan 124 ing tickets from a box with replacement. If population mean and variance are mean µ and variance σ2, then

2 E(Sn) = nµ var(Sn) = nσ , where Sn = X1 + ··· + Xn,.

X1 …, Xn 0.005 0.010 0.015

0.0 X 0 50 100 150 200

rgamma(10000, 3, 1/15)

Figure 5.6: Outcomes of random samples ORF 245: Joint Distributions and Random Samples – J.Fan 125 Square root law: √ √ SD(Sn) = nσ, SD(X¯ n) = σ/ n.

—adding n measurement errors √ inflates the error size by n. —averaging n measurement errors √ decreases the error size by n. —sampling error depends on sample size n, not population size. Figure 5.7: The statement in the quote is wrong!

Example 5.9 Illustration of sampling variability.

We take 6 random samples of size n = 10 from the income box de- ORF 245: Joint Distributions and Random Samples – J.Fan 126 picted in Figure 5.6. The population mean is $45K with an SD $26K, which is simulated from X ∼ Gamma(3, 15). Note µ = EX = 3×15 √ and σ = SD(X) = 3 ∗ 15 ≈ 26.

Table 5.3: An illustration of sampling variability

ave† SD† sample 1 26.49 19.43 50.46 14.49 16.26 47.86 54.07 54.10 28.15 72.34 38.36 19.85 sample 2 6.92 47.18 25.38 27.72 13.59 26.90 25.45 47.11 100.65 36.24 35.71 26.14 sample 3 21.60 130.56 64.06 115.58 83.08 71.67 19.07 25.78 89.37 40.21 66.10 39.38 sample 4 43.71 49.27 161.65 28.49 18.42 29.60 53.33 35.66 28.18 102.11 55.04 44.14 sample 5 41.36 21.56 23.57 52.37 100.38 59.69 62.44 33.25 38.69 90.12 52.34 26.54 sample 6 49.87 40.12 38.64 51.48 64.87 37.44 107.41 67.06 53.05 59.25 56.92 20.58 average 50.75 29.44a SD 11.61b 10.05

† This column should be distributed around µ. † This column should be approximate σ.

a is the average SD of individual data and should be close to σ. √ b is the SD of 6 averages and is approximately σ/ n. ORF 245: Joint Distributions and Random Samples – J.Fan 127 : For any given ε > 0, as n → ∞

P (|X¯n − µ| > ε) → 0, since var(X¯n) → 0

— sample average converges to population average.

5.6 Central Limit Theorem

Central Limit Theorem: No matter which population is sampled from (or the content of the box), the of the sample average (sum) follows closely normal curve, when n is sufficiently large: nX¯ − µ o n√ o P n√ ≤ x = P n(X¯ − µ)/σ ≤ x ≈ Φ(x). σ/ n n

|standarization{z } ORF 245: Joint Distributions and Random Samples – J.Fan 128 This can be demonstrated by simulations (homework). 1-p p 0 1 Simple case: X1, ··· ,Xn are i.i.d. Bernoulli(p). Then 1

Sn ∼ Binom(n, p). Fig. 4.1 shows it follows normal. Normal approximation to the Binomial: The approximation √ is good when npq ≥ 2.5 (an alternate criterium is n min(p, q) ≥ 10). • Normal approximation with continuity correction:

P (a ≤ X ≤ b) = P (a − 0.5 ≤ X ≤ b + 0.5) b + 0.5 − np a − 0.5 − np = Φ √ − Φ √ . npq npq

Figure 5.8: Rule of continuity correction always makes interval wider. ORF 245: Joint Distributions and Random Samples – J.Fan 129

• without continuity correction: no use of 0.5. not accurate! Example 5.10 Let X = # heads in 100 tosses ∼ Bin(100, 0.5). √ (a) Note that EX = 50 and SD(X) = 100 ∗ 0.5 ∗ 0.5 = 5. Thus,

P (X = 50) = P (49.5 ≤ X ≤ 50.5) = P (−0.1 ≤ Z ≤ 0.1) = 0.5398 − 0.4602 = 7.96%.

(b) P (X ≥ 60) = P (X ≥ 59.5) = 1 − P (Z ≤ 9.5/5) = 2.87%.

(c) P (45 ≤ X ≤ 60) = P (44.5 ≤ X ≤ 60.5) = Φ(2.1) − Φ(−1.1) = 83.64%.

Example 5.11 Normal approximation.

In a large class of size 100, students are asked to simulate drawing 400 tickets with replacement from the box ORF 245: Joint Distributions and Random Samples – J.Fan 130

0 2 5 8 10 1

(a) What percentage of students will get a sum of 2100 or more?

By the LLN, the percentage is approximately the probability of getting a sum ≥ 2100.

Let X1, ··· ,X400 be the outcome of the 1st, 2nd, ··· , 400th draw,

respectively. Then, X1, ··· ,X400 are i.i.d. with 1 1 1 P (X = 0) = ,P (X = 2) = , ··· ,P (X = 10) = . 1 5 1 5 1 5 ORF 245: Joint Distributions and Random Samples – J.Fan 131 We want to find 400 X p = P { Xi ≥ 2100}. i=1 To use CLT, we need to compute the population mean and SD. 1 1 1 µ = EX = × 0 + × 2 + ··· + × 10 = 5. i 5 5 5 1 1 EX2 = (02 + × 22 + ··· + ×102) = 38.6. i 5 5 Hence,

2 2 2 var(Xi) = EXi − µ = 38.6 − 5 = 13.6, σ = = 3.7.

Consequently, 2100 − 400 × 5 p = 1 − Φ √ = 1 − Φ(1.35) = 9%. 400 × 3.7 ORF 245: Joint Distributions and Random Samples – J.Fan 132 Fact: The mean and SD of an rv obtained through drawing from a box is the same as the mean and SD of the box.

(b) What is the likely size of the error of the estimate?

(Classifying and Counting) Let Yi = 1 if the i-th student

gets a sum ≥ 2100 and 0 otherwise. Then, Y1, ··· ,Y100 are i.i.d ¯ Bernoulli(p). Let pb = Y be the proportion of students who get a sum ≥ 2100. It follows that ¯ Epb = EY = p = 0.09, and by the square root law, √ pq SD(p) = SD(Y¯ ) = √ = 2.86%. b 100 i.e. the estimation (chance) error is about 2.86%. ORF 245: Joint Distributions and Random Samples – J.Fan 133

Example 5.12 Classifying and Counting

In a city, the average family income is $40K with an SD of $30K. Among those, 20% of families have income ≥ $80K. A survey organization takes a random sample of 900 families.

(a) What is the chance that the sample average falls between $38K and $42K?

Let X¯ be the sample average. By the sqrt law, √ EX¯ = 40,SD(X¯) = 30/ 900 = 1.

Hence, by CLT 42 − 40 38 − 40 P (38 ≤ X¯ ≤ 42) = Φ( ) − Φ( ) = Φ(2) − Φ(−2) ≈ 95%, 1 1 — very likely

(b) What is the chance that between 18% and 22% of the selected families have an income ≥ $80K (i.e. poll error ≤ 2%)?

¯ Let Yi = 1 if the i-th draw has income ≥ $80K. Then, pb = Y is the proportion of selected families having income ≥ $80K. 900 X prob = P {.18 ≤ pb ≤ .22} = P {162 ≤ Yi ≤ 198}. √ i=1 Now, EYi = 0.2 and SD(Yi) = 0.2 × 0.8 = 0.4. Hence, √ ES900 = 900 × .2 = 180,SD(S900) = 900 × .2 × .8 = 12. ORF 245: Joint Distributions and Random Samples – J.Fan 134

Using normal approximation with continuity correction, 198.5 − 180 161.5 − 180 prob = Φ − Φ = 87.68%. 12 12

This method is preferable (more accurate) for this specific problem. Conclusion:

F Chance errors or poll errors depend on sample size n, not the pop- ulation size N.

F Random sampling has advantages of avoiding biases and calculat- ing chance errors (variance).

F Arbitrary sample is NOT a random sample. It creates uninten- tional biases and is unable to figure out errors.

F Statistical lessons are to avoid biases.