8.0 Lesson Plan

8.0 Lesson Plan Answer Questions 1 • Box Models • The Central Limit Theorem • Simple Random Samples • 8.1 Box Models Recall that a Box Model describes sampling in terms of making repeated draws from a box containing numbers. For continuous distributions, the box contains an uncountably infinite number of numbers. 2 In a random sample, draws are made with replacement, and the outcomes in a series of draws are independent. The value on the first draw does not affect the value on the second. Box models describe: rolls of a die, tosses of a coin (fair or unfair), and (approximately) drawing a sample of U.S. citizens by randomly chosen SSNs and asking for whom they plan to vote. For a box model, the expected value is the average of the numbers in the box. This is equivalent to: ∞ xf(x) dx for continuous distributions • −∞ R xp(x) for discrete distributions. • all x P For categorical data, such as H or T in coin-tossing, one averages zeroes and ones. The result is just a proportion, or the number of, say, heads 3 divided by the number of tosses. The Law of Averages says that if one makes many draws from the box and averages the results, that average will converge to the expected value of the box. But the Law of Averages says nothing about the outcome on the next draw. The standard error for the average of n draws from a box (with replacement) is: σ se = √n where σ is the standard deviation of the distribution in the box. The standard error is the standard deviation among the averages 4 obtained in infinitely many samples of size n. The standard error is the likely size of the difference between the average of n draws from the box and the expected value of the box. Note that as n , the standard error goes to zero. This is a partial → ∞ statement of the Law of Averages. It means that the sample average converges as you take more and more draws from the box. In particular, it converges to the expected value. The standard error for the sum of n draws (with replacement) is: se = √nσ. This is the sd of many sums of size n. Analogously to standard error for averages, the standard error of the sum is the likely size of the difference between the sum of n draws from 5 the box and n times the expected value of the box. Note that as n , this standard error does not go to zero. This means → ∞ that as the number of draws increases, the likely difference between the sum of the draws and nµ gets larger, rather than smaller. This concern about the sum, rather than the average, arises in contexts such as investment or gambling, where the total return from multiple trials is important, not the average return. 8.2 The Central Limit Theorem The Central Limit Theorem is one of the high-water marks of mathematical thinking. It was worked upon by James Bernoulli, Abraham de Moivre, and Alan Turing. Over the centuries, the theory improved from special cases to a very general rule. 6 Essentially, the Central Limit Theorem allows one to describe how accurately the Law of Averages works. Most people have a good intuitive understanding of the Law of Averages, but often it is important to determine whether a particular size of deviation between the sample mean and the (usually unknown) expected value is probable or improbable. That is, what is the chance that the sample average is more than d away from the true µ? Formally, the Central Limit Theorem for averages says: X¯ µ X¯ ˙ N(µ, σ/√n) or − ˙ N(0, 1) ∼ σ/√n ∼ where X¯ is the average of n draws, µ is the expected value of the box, and σ is the standard deviation of the box. 7 This notation means that the left-hand side is a random number whose distribution is approximately the one shown on the right-hand side. Note: The approximation gets better as n gets larger. Modifications of this formula hold for many other situations, e.g., when there is a little dependence between the draws. A version of the Central Limit holds for sums: sum nµ sum ˙ N(nµ, √nσ) or − ˙ N(0, 1). ∼ √nσ ∼ Note that sum = nX¯ is just the sum of the draws from the box. (This should be obvious to everyone. From this, you can show that the CLT 8 for averages and sums is actually the same thing.) This formula is useful when calculating the chance of winning a given amount of money when gambling, or getting more than a specific score on a test. With these two central limit formulas, one can answer all sorts of practical questions. Problem 1a: You want to estimate the average income of people in Durham. Suppose the true mean income is $42,000 with sd of $10,000. You draw a random sample of 100 households. What is the probability that your sample estimate is over the true value by $500 or more? What is the box model for this problem? 9 • What is the expected value of the box? • What is the sd of the box? • Note that in order to solve this, we made the unreasonable assumption that we knew the true mean of the box and the sd. Later we shall relax this assumption. P[X¯ > 42, 500] = P[X¯ µ > 500] − = P[(X¯ µ)/(σ/√n) > 500/(σ/√n)] − ˙= P[Z > 500/(10, 000/√100)] = P[Z > 0.5]. 10 The CLT is used in the penultimate step. From the standard normal table, we know this has chance 0.309. There is about a 31% chance of being too high by $500 or more. But this is not really the question one wants to ask in practice, nor is it the kind of information that one really has from a survey. Problem 1b: You want to estimate the average income of people in Durham. You draw a random sample of 100 households and find the sample mean is $42,500 and the sd of your sample is $10,000. What is the approximate probability that you have overestimated the true average income of Durham by $500 or more? First, assume that the sd of your sample equals the sd of the box. This is 11 an approximation, and later we shall see a way to improve this. We want to find P[X¯ µ > 500] = P[(X¯ µ)/(se) > 500/(se)] − − ˙= P[Z > 500/(10, 000/√100)] = P[Z > 0.5]. The probability is 0.309 that $42,500 is $500 too high (or more). Problem 2: You are playing Red and Black in roulette. (A roulette wheel has 38 pockets; 18 are red, 18 are black, and 2 are green—the house takes all the money on green). You pick either red or black; if the ball lands in the color you pick, you win a dollar. Otherwise you lose a dollar. Suppose you make 100 plays. What is the chance that you lose $10 or 12 more? What is the box model? There are 38 tickets, and 18 are labeled 1 and the 20 are labeled -1. So the expected value of the box is 1 µ = [1+1+ +1+( 1)+( 1) + +( 1)] 38 ··· − − ··· − 1 = [ 2] 38 − = 1/19. − 13 The standard deviation of the box is 38 1 2 2 σ = v( Xi ) µ u 38 u Xi=1 − t = 1 ( 1/19)2 − − =p 0.9986. The probability of losing more than $10 or more in 100 plays is P[sum < 10] = P[sum nµ < 10 nµ] − − − − sum nµ 10 nµ = P[ − < − − ] √nσ √nσ 10 nµ ˙= P[Z < − − ] √nσ 14 = P[Z < [ 10 (100)( 1/19)]/(10 .998614)] − − − ∗ = P[Z < 0.47434]. − From the standard normal table, the chance of being below -0.47434 is about 0.319. Note: It would have been appropriate to use the continuity correction. 8.3 Simple Random Samples In a simple random sample of n units from a finite population, each unit is equally likely to be chosen, • each pair of units is equally likely to be chosen, • 15 and so forth, until each n-tuplet is equally likely to be chosen. • Sampling is done without replacement. Survey sampling is done in order to learn about a population parameter, such as a mean, proportion, or standard deviation. One wants to get both: an estimate of the unknown parameter, and • a numerical description of the uncertainty in the estimate. • The uncertainty (or total survey error) in the estimate has two parts: bias, which is the systematic deviation of the estimate from the true • value, and variance, which is the deviation due to chance error. • 16 The variance portion of the total survey error is due to sampling. Sometimes it is costly to draw a simple random sample, so professional pollsters use more complicated sampling plans, e.g., multistage cluster sampling. If the sample size n is small compared to the population size M, one can ignore the distinction between sampling with replacement and without 17 replacement. Then the standard error of an average is σ/√n. If the n is large relative to the population size M, then use the finite population correction factor to find the standard error of the average as σ M n − . √n ∗ r M 1 − Survey researchers have identified many kinds of bias: Response Bias. The phrasing of the query can affect the response. • For example: “Do you feel that liberal tax-and-spend policies and irresponsible Washington insiders have hurt your long-term financial security?” 18 Nonresponse Bias.

Load more