Probability 2: Mean and Variance of Linear Combinations

Total Page:16

File Type:pdf, Size:1020Kb

Probability 2: Mean and Variance of Linear Combinations

Cal Poly A.P. Statistics Workshop – February 16, 2013

Probability 2: Mean and Variance of Linear Combinations

Matt Carlton

Activity A: Some “miner” combinations

Two brothers, Jack and Mike, are both miners. Jack mines limestone at a nearby quarry, while Mike mines sandstone at a different quarry.

The amount of limestone Jack can mine in one day follows a Normal distribution, with a mean of 25 thousand pounds and a standard deviation of 4 thousand pounds.

The amount of sandstone Mike can mine in one day follows a Normal distribution, with a mean of 20 thousand pounds and a standard deviation of 3 thousand pounds.

Because Jack and Mike work at two different quarries, these two variables are independent.

1. Simulate 100 days of mining for both Jack and Mike, using the randNorm command.

randNorm(25,4,100) → L1

randNorm(20,3,100) → L2

2. Find the mean and standard deviation for each of these two lists of simulated values. Are they exactly equal to the parameters you specified? Are they close?

3. Create a new variable: the total amount of stone Jack and Mike mine in a day. To simulate this new variable, add your two lists together.

L1 + L2 → L3

4. Before collecting any statistics on this new variable, make some predictions: what do you think its mean will be? What do you think its standard deviation will be?

Probability 2 1 Cal Poly A.P. Statistics Workshop – February 16, 2013

5. Find the sample mean and standard deviation for your 100 simulated values of the new variable. Were you right?

6. Find the sample variance of each of the three lists by squaring the standard deviations. What do you notice about the sample variances of your three lists?

We’ve discovered two rules: If we add two independent random variables X and Y,

E(X + Y) = E(X) + E(Y)

Var(X + Y) = Var(X) + Var(Y)

Fun fact: the mean property is true even if X & Y aren’t independent, but the variance property fails.

7. Write a formula for the standard deviation of X + Y in terms of SD(X) and SD(Y).

8. What are the theoretical mean and standard deviation for the total amount of stone Jack and Mike mine in a day?

Probability 2 2 Cal Poly A.P. Statistics Workshop – February 16, 2013

Jack is very competitive with his brother, so every day he keeps track of how much more stone he has mined than Mike. In other words, he records the variable

(Amount of limestone Jack mines) minus (Amount of sandstone Mike mines)

9. Create simulated values of this new variable, using your original two lists. Look at the 100 values of Jack’s new variable: What does it mean if a number is positive? What does it mean if a number is negative?

L1-L2 → L4

10. Before collecting any stats on Jack’s new variable, make some predictions: what do you think its mean will be? What do you think its standard deviation will be?

11. Find the mean and standard deviation of the 100 simulated values for Jack’s new variable. Were you right?

12. Compare the standard deviations of the two new variables you’ve created: the brothers’ total daily amount mined, and the difference between Jack and Mike’s daily amounts. What do you discover?

Probability 2 3 Cal Poly A.P. Statistics Workshop – February 16, 2013

We’ve discovered two more rules: If we subtract two independent random variables X and Y,

E(X – Y) = E(X) – E(Y)

Var(X – Y) = Var(X + Y) = Var(X) + Var(Y) Surprising, eh?

13. Write a formula for the standard deviation of X – Y in terms of SD(X) and SD(Y).

14. What are the theoretical mean and standard deviation for the difference between the amounts of stone Jack and Mike mine in a day?

Jack is paid $4 for every thousand pounds of limestone he mines. Mike is paid $10 for every thousand pounds of sandstone he mines. Now we’re going to investigate the total amount of money the brothers make in a day.

15. Use your original two lists to the total amount of money Jack and Mike make in a day.

4*L1+10*L2→L5

16. What do you think the mean of this new list will be? Check and see if you are right.

Probability 2 4 Cal Poly A.P. Statistics Workshop – February 16, 2013

17. Hopefully, you already know that if you multiply a variable X by a positive constant a, the standard deviation is also multiplied by a: SD(aX) = a SD(X). In the formula for SD(X + Y) that you developed in Question 7, replace X with aX and Y with bY. Come up with a formula for the standard deviation of aX + bY when X and Y are independent random variables and a and b are positive constants.

18. Find the expected total amount of money the brothers make in a day. Then, use the formula you just derived to determine the theoretical standard deviation of the total amount of money the brothers make in a day. Is it close to the simulation answer?

Our final set of rules: For two independent random variables X and Y,

E(aX + bY) = a E(X) + b E(Y)

Var(aX + bY) = a2Var(X) + b2Var(Y)

Fun fact: these rules work even if the constants a and/or b are negative.

Probability 2 5 Cal Poly A.P. Statistics Workshop – February 16, 2013

19. Confirm that the rule for variance in the box above is consistent with the standard deviation formula you created in Question 17.

20. The earlier rules about means and variances are just special cases of this final set of rules. What constants would a and b equal if you were interested in X – Y? Plug those constants into the final rules and make sure you get back the earlier rule for differences.

Probability 2 6 Cal Poly A.P. Statistics Workshop – February 16, 2013

Activity B: Deriving the two-sample t statistic

Suppose we have a random sample of size n1 from a population with mean µ1 and standard deviation σ1. We also have an independent random sample of size n2 from a population with mean µ2 and standard deviation σ2.

Let X denote the sample average from the first sample and Y denote the sample average from the second sample. Toward comparing the two groups, consider the quantity X- Y .

1. What are the mean and standard deviation of the sampling distribution of X ?

2. What are the mean and standard deviation of the sampling distribution of Y ?

3. What is the mean of the sampling distribution of X- Y ?

4. What is the standard deviation of the sampling distribution of X- Y ?

Probability 2 7 Cal Poly A.P. Statistics Workshop – February 16, 2013

5. If both population distributions are Normal, then the sampling distribution of X- Y is also Normal. Use your earlier answers to construct a standard Normal random variable.

6. Suppose you had to estimate the population standard deviations, σ1 and σ2, with sample

standard deviations s1 and s2. What would your standardized variable from Question 5 become? Since you estimated σ with s (twice, in fact!), what distribution does this new standardized variable have?

Probability 2 8 Cal Poly A.P. Statistics Workshop – February 16, 2013

Activity C: Travel abroad

[Adapted from 2008 Free Response Question #4]

Michelle, a freshman at St. Joseph’s High School, surveyed 40 randomly selected freshmen and asked each person whether s/he had traveled outside the United States. Elizabeth, a junior at St. Joseph’s, surveyed 50 randomly selected juniors and asked the same question. These two surveys were conducted independently.

Let pFr denote the proportion of all St. Joseph’s freshmen that have traveled outside the U.S., and let pˆ Fr denote the proportion of Michelle’s sample of freshmen that have done so. Make similar definitions for pJr and pˆJr .

1. What are the mean and standard deviation of the sampling distribution of pˆ Fr ?

2. What are the mean and standard deviation of the sampling distribution of pˆJr ?

Devon, a sophomore, wants to investigate the same question but about his class (the sophomores at St. Joseph’s). Rather than conduct his own survey, however, he decides it’s reasonable just to take the average of Michelle’s and Elizabeth’s estimates:

pˆ+ p ˆ pˆ = Fr Jr Devon 2

3. What is the mean of the sampling distribution of Devon’s estimator?

Probability 2 9 Cal Poly A.P. Statistics Workshop – February 16, 2013

4. What is the standard deviation of the sampling distribution of Devon’s estimator?

5. Suppose that 18 out of 40 freshmen in Michelle’s sample and 35 out of 50 juniors in Elizabeth’s sample had traveled outside the United States. Calculate Devon’s estimate for the proportion of St. Joseph’s sophomores that have traveled outside the U.S., and come up with an estimate of the corresponding standard deviation.

Daniel, Devon’s twin brother, is lazy like Devon (he doesn’t want to take a survey) but has a different idea about how to estimate the proportion of all sophomores at St. Joseph’s that have traveled outside the U.S. Daniel’s idea is to combine the two surveys and take the sample proportion from the combined data set:

X Fr+ X Jr pˆ Daniel = nTotal where XFr is the number of freshmen in Michelle’s survey who said “yes,” XJr is the number of juniors in Elizabeth’s survey who said “yes,” and nTotal is the combined number of students the two girls surveyed.

6. What kind of random variable is XFr? State the name of the distribution and the numerical values of its parameters.

7. What kind of random variable is XJr? State the name of the distribution and the numerical values of its parameters.

Probability 2 10 Cal Poly A.P. Statistics Workshop – February 16, 2013

8. What is the mean of the sampling distribution of Daniel’s estimator?

9. What is the standard deviation of Daniel’s estimator?

10. Using the same data as in Question 5, calculate Daniel’s estimate for the proportion of all St. Joseph’s sophomores that have traveled outside the U.S., and come up with an estimate of the corresponding standard deviation.

Probability 2 11 Cal Poly A.P. Statistics Workshop – February 16, 2013

Activity D: Show me the money

[Adapted from 2012 Free Response Question #6]

The Alumni Center at Cal Poly wants to estimate the salaries of graduates that majored in what are called “stem” disciplines (science, technology, engineering, and mathematics). At Cal Poly, those students belong to two colleges, either the College of Engineering (CENG) or the College of Science and Mathematics (CSM). The Alumni Center surveyed graduates from across these two colleges: independent random samples of 120 CENG alumni and 80 CSM alumni.

Because institutional records show that CENG is twice as large as CSM, and because the salary distributions of the two colleges are liable to differ, the Alumni Center plans to use the following estimator for the mean salary of stem graduates:

2 1 X= X + X stem3 CENG 3 CSM

Let µCENG denote the population mean salary of all CENG graduates and let σCENG denote the population standard deviation of those salaries. Define µCSM and σCSM similarly.

1. What are the mean and standard deviation of the sampling distribution of X CENG ?

2. What are the mean and standard deviation of the sampling distribution of X CSM ?

3. Find an expression for the mean of the sampling distribution of the estimator X stem .

Probability 2 12 Cal Poly A.P. Statistics Workshop – February 16, 2013

4. Find an expression for the standard deviation of the sampling distribution of X STEM .

The results of the Alumni Center survey are summarized in the table below.

College n Mean SD CENG 120 $62,150 $14,190 CSM 80 $53,800 $11,550

5. Calculate the Alumni Center’s estimate of the mean salary of Cal Poly “stem” alumni.

6. Come up with an estimated standard deviation for their statistic.

Probability 2 13 Cal Poly A.P. Statistics Workshop – February 16, 2013

Solutions

Activity A: Some “miner” combinations

1. Commands are randNorm(25,4,100)→L1 and randNorm(20,3,100)→L2

2. answers will vary, but statistics should be “close” to the parameters

3. L1+L2→L3

4. predictions will vary; students are liable to guess the mean correctly but not sd (many will guess that the SDs add)

5. see 4

6. students should find that the variance of L3 is close to the sum of the variances of L1 and L2

7. Var(X + Y) = Var(X) + Var(Y) implies SD(X + Y) = SD(X )2+ SD( Y ) 2 . This is sometimes called the Pythagorean Theorem in statistics

8. E(X + Y) = E(X) + E(Y) = 25 + 20 = 45 thousand pounds; SD(X + Y) = SD(X )2+ SD( Y ) 2 = 42+ 3 2 = 5 thousand pounds

9. L1-L2→L4; a single number in L4 is positive if, on that day, Jack mined more stone than Mike; a single number in L4 is negative if, on that day, Jack mined less stone than Mike

10. predictions will vary; again, students tend to get the mean right but still get the sd wrong (in light of the first exercises, students might think to subtract the squares of the SDs)

11. see 10

12. students should discover that the standard deviations of L3 and L4 are quite close

13. Var(X – Y) = Var(X) + Var(Y) implies SD(X – Y) = SD(X )2+ SD( Y ) 2

14. E(X – Y) = E(X) – E(Y) = 25–20 = 5 thousand pounds; SD(X – Y) = SD(X + Y) = 5 thousand pounds

15. 4*L1+10*L2→L5

16. students will often get this right—this attunes them a little to linear combinations

Probability 2 14 Cal Poly A.P. Statistics Workshop – February 16, 2013

17. SD(aX + bY) = SD(aX )2+ SD( bY ) 2 = [a SD( X )]2+ [ b SD( Y )] 2 =

a2SD( X ) 2+ b 2 SD( Y ) 2

18. E(4X + 10Y) = 4E(X) + 10E(Y) = 4(25) + 10(20) = $300; SD(4X + 10Y) = 42 SD(X ) 2+ 10 2 SD( Y ) 2 = 16(4)2+ 100(3) 2 = 1156 ≈ $34

19. From SD(aX + bY) = a2SD( X ) 2+ b 2 SD( Y ) 2 , Var(aX + bY) = a2SD( X ) 2+ b 2 SD( Y ) 2 = a2Var(X) + b2Var(Y)

a = 1, b = –1; E(X – Y) = 1E(X) + (–1)E(Y) = E(X) – E(Y); SD(X – Y) = 20. (1)2 SD(X ) 2+ ( - 1) 2 SD( Y ) 2 = SD(X )2+ SD( Y ) 2

Probability 2 15 Cal Poly A.P. Statistics Workshop – February 16, 2013

Activity B: Deriving the two-sample t statistic

1. E( ) = µ1, SD( ) = σ1/ X X n1

2. E( ) = µ2, SD( ) = σ2/ Y Y n2

3. E( X –Y ) = E( X ) – E(Y ) = µ1 – µ2

4. SD( – ) = 2 2 = 2 2 X Y SD(X )+ SD( Y ) s1/n 1+ s 2 / n 2

Z = [( – ) – (µ1 – µ2)] / 2 2 5. X Y s1/n 1+ s 2 / n 2

6. Replace σ with s to get the quantity [( – ) – (µ1 – µ2)] / 2 2 ; by that X Y s1/ n 1+ s 2 / n 2 replacement, the result has (approximately) a t distribution

Activity C: Travel abroad

1. E[ ] = pFr; SD( ) = = pˆ Fr pˆ Fr pFr(1- p Fr ) / n Fr pFr(1- p Fr ) / 40

2. E[ ] = pJr; SD( ) = = pˆ Jr pˆ Jr pJr(1- p Jr ) / n Jr pJr(1- p Jr ) / 50

3. Rewrite as pˆ Devon = 1/2 pˆ Fr + 1/2 pˆ Jr ; then E[ pˆ Devon ] = 1/2E[ pˆ Fr ] + 1/2E[ pˆ Jr ] =

1/2pFr + 1/2pJr, or (pFr + pJr)/2

4. SD( pˆ ) = 2 2 2 2 = Devon (1/ 2) SD(pˆFr )+ (1/ 2) SD(p ˆ Jr ) pFr(1- p Fr ) /160 + p Jr (1 - p Jr ) / 200

or 1/2 · pFr(1- p Fr ) / 40 + p Jr (1 - p Jr ) / 50

5. With pˆ Fr = 18/40 and pˆ Jr = 35/50, pˆ Devon = (.45+.70)/2 = .575; replacing the p parameters with pˆ statistics in the answer to 4, an estimate of the corresponding SD is

1/2 · pˆFr(1-p ˆ Fr ) / 40 +p ˆ Jr (1 - p ˆ Jr ) / 50 = 1/2 · .45(.55) / 40+ .70(.30) / 50 = .051

X is Binomial, with n = 40 and p = p 6. Fr Fr

X is Binomial, with n = 50 and p = p 7. Jr Jr

Probability 2 16 Cal Poly A.P. Statistics Workshop – February 16, 2013

8. Rewrite as pˆ Daniel = 1/90 XFr + 1/90 XJr; then E[ pˆ Daniel ] = 1/90 E[XFr] + 1/90 E[XJr] =

1/90(40pFr) + 1/90(50pJr) = (40pFr + 50pJr)/90 [notice that nTotal = 40+50 = 90; the mean of a binomial is np]

9. SD( pˆ ) = 2 2 2 2 = 1/90 · Daniel (1/ 90) SD(XFr )+ (1/ 90) SD( X Jr ) 40pFr q Fr+ 50 p Jr q Jr [the SD squared, or variance, of binomial is npq]

pˆ = (18 + 35)/90 = .5889; replacing p with pˆ in the answer to 9, an estimated 10. Daniel standard error is 1/90 · 40(.45)(.55)+ 50(.70)(.30) = .0502

Activity D: Show me the money

1. E( X CENG) = µCENG; SD( X CENG) = σCENG/ 120

2. E( X CSM) = µCSM; SD( X CSM) = σCSM/ 80

3. E( X stem) = 2/3 E( X CENG) + 1/3 E( X CSM) = 2/3 µCENG + 1/3 µCSM

2 2 2 2 4. SD( stem) = = X (2 / 3) SD(XCENG )+ (1/ 3) SD( X CSM )

2 2 2 2 (4 / 9)sCENG /120+ (1/ 9) s CSM / 80 , or sCENG/ 270+ s CSM / 720

5. X stem = 2/3($62,150) + 1/3($53,800) = $59,367

6. Replacing every σ with s in the answer to 4, an estimated standard error for X stem is

2 2 2 2 sCENG/ 270+ s CSM / 720 = ($14,190) / 270+ ($11,550) / 720 = $965

Probability 2 17

Recommended publications