<<

8.0 Lesson Plan

Answer Questions 1 • Box Models • The • Simple Random Samples • 8.1 Box Models

Recall that a Box Model describes in terms of making repeated draws from a box containing numbers. For continuous distributions, the box contains an uncountably infinite number of numbers. 2

In a random , draws are made with replacement, and the outcomes in a series of draws are independent. The value on the first draw does not affect the value on the second.

Box models describe: rolls of a die, tosses of a coin (fair or unfair), and (approximately) drawing a sample of U.S. citizens by randomly chosen SSNs and asking for whom they plan to vote. For a box model, the expected value is the average of the numbers in the box. This is equivalent to: ∞ xf(x) dx for continuous distributions • −∞ R xp(x) for discrete distributions. • all x P For categorical , such as H or T in coin-tossing, one averages zeroes and ones. The result is just a proportion, or the number of, say, heads 3 divided by the number of tosses.

The Law of Averages says that if one makes many draws from the box and averages the results, that average will converge to the expected value of the box.

But the Law of Averages says nothing about the outcome on the next draw. The standard error for the average of n draws from a box (with replacement) is: σ se = √n where σ is the of the distribution in the box.

The standard error is the standard deviation among the averages

4 obtained in infinitely many samples of size n. The standard error is the likely size of the difference between the average of n draws from the box and the expected value of the box.

Note that as n , the standard error goes to zero. This is a partial → ∞ statement of the Law of Averages. It that the sample average converges as you take more and more draws from the box. In particular, it converges to the expected value. The standard error for the sum of n draws (with replacement) is:

se = √nσ.

This is the sd of many sums of size n.

Analogously to standard error for averages, the standard error of the sum is the likely size of the difference between the sum of n draws from

5 the box and n times the expected value of the box.

Note that as n , this standard error does not go to zero. This means → ∞ that as the number of draws increases, the likely difference between the sum of the draws and nµ gets larger, rather than smaller.

This concern about the sum, rather than the average, arises in contexts such as investment or gambling, where the total return from multiple trials is important, not the average return. 8.2 The Central Limit Theorem

The Central Limit Theorem is one of the high-water marks of mathematical thinking. It was worked upon by James Bernoulli, Abraham de Moivre, and Alan Turing. Over the centuries, the theory improved from special cases to a very general rule. 6

Essentially, the Central Limit Theorem allows one to describe how accurately the Law of Averages works. Most people have a good intuitive understanding of the Law of Averages, but often it is important to determine whether a particular size of deviation between the sample and the (usually unknown) expected value is probable or improbable. That is, what is the chance that the sample average is more than d away from the true µ? Formally, the Central Limit Theorem for averages says: X¯ µ X¯ ˙ N(µ, σ/√n) or − ˙ N(0, 1) ∼ σ/√n ∼ where X¯ is the average of n draws, µ is the expected value of the box, and σ is the standard deviation of the box. 7 This notation means that the left-hand side is a random number whose distribution is approximately the one shown on the right-hand side.

Note: The approximation gets better as n gets larger.

Modifications of this formula hold for many other situations, e.g., when there is a little dependence between the draws. A version of the Central Limit holds for sums: sum nµ sum ˙ N(nµ, √nσ) or − ˙ N(0, 1). ∼ √nσ ∼

Note that sum = nX¯ is just the sum of the draws from the box. (This should be obvious to everyone. From this, you can show that the CLT

8 for averages and sums is actually the same thing.)

This formula is useful when calculating the chance of winning a given amount of money when gambling, or getting more than a specific score on a test.

With these two central limit formulas, one can answer all sorts of practical questions. Problem 1a: You want to estimate the average income of people in Durham. Suppose the true mean income is $42,000 with sd of $10,000. You draw a random sample of 100 households. What is the probability that your sample estimate is over the true value by $500 or more?

What is the box model for this problem?

9 • What is the expected value of the box? • What is the sd of the box? •

Note that in order to solve this, we made the unreasonable assumption that we knew the true mean of the box and the sd. Later we shall relax this assumption. P[X¯ > 42, 500] = P[X¯ µ > 500] − = P[(X¯ µ)/(σ/√n) > 500/(σ/√n)] − ˙= P[Z > 500/(10, 000/√100)] = P[Z > 0.5].

10 The CLT is used in the penultimate step.

From the standard normal table, we know this has chance 0.309. There is about a 31% chance of being too high by $500 or more.

But this is not really the question one wants to ask in practice, nor is it the kind of information that one really has from a . Problem 1b: You want to estimate the average income of people in Durham. You draw a random sample of 100 households and find the sample mean is $42,500 and the sd of your sample is $10,000. What is the approximate probability that you have overestimated the true average income of Durham by $500 or more?

First, assume that the sd of your sample equals the sd of the box. This is

11 an approximation, and later we shall see a way to improve this.

We want to find

P[X¯ µ > 500] = P[(X¯ µ)/(se) > 500/(se)] − − ˙= P[Z > 500/(10, 000/√100)] = P[Z > 0.5].

The probability is 0.309 that $42,500 is $500 too high (or more). Problem 2: You are playing Red and Black in roulette. (A roulette wheel has 38 pockets; 18 are red, 18 are black, and 2 are green—the house takes all the money on green). You pick either red or black; if the ball lands in the color you pick, you win a dollar. Otherwise you lose a dollar.

Suppose you make 100 plays. What is the chance that you lose $10 or

12 more?

What is the box model? There are 38 tickets, and 18 are labeled 1 and the 20 are labeled -1.

So the expected value of the box is 1 µ = [1+1+ +1+( 1)+( 1) + +( 1)] 38 ··· − − ··· − 1 = [ 2] 38 − = 1/19. − 13

The standard deviation of the box is

38 1 2 2 σ = v( Xi ) µ u 38 u Xi=1 − t = 1 ( 1/19)2 − − =p 0.9986. The probability of losing more than $10 or more in 100 plays is

P[sum < 10] = P[sum nµ < 10 nµ] − − − − sum nµ 10 nµ = P[ − < − − ] √nσ √nσ 10 nµ ˙= P[Z < − − ] √nσ

14 = P[Z < [ 10 (100)( 1/19)]/(10 .998614)] − − − ∗ = P[Z < 0.47434]. −

From the standard normal table, the chance of being below -0.47434 is about 0.319.

Note: It would have been appropriate to use the continuity correction. 8.3 Simple Random Samples

In a of n units from a finite population, each unit is equally likely to be chosen, • each pair of units is equally likely to be chosen, •

15 and so forth, until each n-tuplet is equally likely to be chosen. • Sampling is done without replacement.

Survey sampling is done in order to learn about a population parameter, such as a mean, proportion, or standard deviation. One wants to get both: an estimate of the unknown parameter, and • a numerical description of the in the estimate. • The uncertainty (or ) in the estimate has two parts: bias, which is the systematic deviation of the estimate from the true • value, and , which is the deviation due to chance error. • 16 The variance portion of the total survey error is due to sampling. Sometimes it is costly to draw a simple random sample, so professional pollsters use more complicated sampling plans, e.g., multistage .

If the sample size n is small compared to the population size M, one can ignore the distinction between sampling with replacement and without 17 replacement. Then the standard error of an average is σ/√n.

If the n is large relative to the population size M, then use the finite population correction factor to find the standard error of the average as σ M n − . √n ∗ r M 1 − Survey researchers have identified many kinds of bias: Response Bias. The phrasing of the query can affect the response. • For example: “Do you feel that liberal tax-and-spend policies and irresponsible Washington insiders have hurt your long-term financial security?” 18 Nonresponse Bias. People who are missed in a survey often tend • to be systematically different from those who are found, and this systematic difference can be reflected in their opinions. For example: People who are not at home to answer survey phone calls may be working three jobs to make ends meet, and their view of Jeb!’s economic policies might differ from those with more leisure time. Household Bias. Often pollsters only talk to one person per home, so • large households are underrepresented. Consider what would happen at a fraternity, for example. Interviewer Bias. The interviewer may not follow the complex survey • instructions. (These instructions can be difficult, and perhaps the interviewer feels pressure to have many short interviews rather than 19 one lengthy one.) Nonrespondent Bias. People who refuse to participate in the survey • may be different from those who do. For example, consider a survey of sexual behavior in Utah. Liars. Some people misrepresent themselves. For example, consider • the poll results in the David Duke campaign. Pollsters and survey experts try to correct for known biases. Techniques include: upweighting responses from large households; • upweighting data from poor people or those with little formal • education; using a model, such as the continuum of nonresponse theory. • 20 The continuum of nonresponse theory says that people who are hard to catch or unwilling to respond can have their responses estimated as a function (perhaps a function) of the difficulty in finding people or the cost of convincing them to respond.

For example, the people who are never home to answer their phone may have opinions more like those whom the pollsters had to call 15 times before making contact, and less like those of the people who answered on the first call.