Sampling Distributions

Fall 2001 Professor Paul Glasserman B6014: Managerial Statistics 403 Uris Hall

Sampled Data

1. So far, in all our probability calculations we have assumed that we know all quantities needed to solve the problem:

• Portfolio problems: To ﬁnd the expected return and standard deviation of a portfolio, we assumed we knew the mean and standard deviation of the returns of the underlying stocks. • Potato chip example: To ﬁnd the proportion of bags below the 8-ounce minimum, we assumed we knew the mean and standard deviation of the weight of chips in each bags.

In practice, these types of parameters are not given to us; we must estimate them from data.

2. Statistical analysis usually proceeds along the following lines:

(a) Postulate a probability model (usually including unknown parameters) for a situation involving uncertainty; e.g., assume that a certain quantity follows a normal distribution. (b) Use data to estimate the unknown parameters in the model. (c) Plug the estimated parameters into the model in order to do make predictions from the model.

The ﬁrst step, picking a model, must be based on an understanding of the situation to be modeled. Which assumptions are plausible? Which are not? These questions are answered by judgement, not by precise statistical techniques.

3. Examples:

1 • We might assume that daily changes in a stock price follow a normal distribution. We might then use historical data to estimate the mean and standard deviation. Once we have estimates, we might use the model to predict future price ranges or to value an option on the stock. • We might assume that demand for a fashion item is normally distributed. We might then use historical data to estimate the mean and standard deviation. Once we have estimates, we might use the model to set production levels.

4. The ﬁrst step in understanding the process of estimation is understanding basic properties of sampled data and sample statistics, since these are the basis of estimation.

5. When we talk about sampling it is always in the context of a ﬁxed underlying population:

• If we look at 50 daily changes in IBM stock, we are looking at a sample of size 50 from the population of all daily changes in IBM stock. • If we ask 150 shoppers whether or not they buy corn ﬂakes, we have a sample of size 150 from all possible shoppers.

If the population is very large (as in these examples), we generally treat it as though it were infinite; this simplifies matters. Thus, we are primarily concerned with finite samples from infinite populations.

6. A single sample from a population is a random variable. Its distribution is the population distribution; e.g.,

• the distribution of a randomly selected daily change in IBM stock is the distribution over all daily changes; • the probability that a randomly selected shopper buys corn ﬂakes is the proportion of the entire population that buys corn ﬂakes.

7. A random sample from a population is a set of randomly selected observations from that population. If {X1,...,Xn} are a random sample, then

• they are independent; • they are identically distributed, all with the distribution of the underlying population.

8. A sample statistic is any quantity calculated from a random sample. The most familiar example of a sample statistic is the sample mean X,givenby

X =(X1 + X2 + ···+ Xn)/n.

The sample mean gives an estimate of the the population mean µ = E[Xi].

2 Distribution of the Sample Mean

1. Every sample statistic is a random variable. Randomness is introduced through the sampling mechanism.

2. As noted above, the sample mean X of a random sample {X1,X2,...,Xn} is an estimate of the population mean µ = E[Xi]. How good an estimate is it? How can we assess the uncertainty in the estimate? To answer these questions, we need to examine the sampling distribution of the sample mean; that is, the distribution of the random variable X. 3. We begin by assuming that the underlying population is normal with mean µ and variance 2 2 σ . This means that Xi ∼ N(µ, σ ) for all i.Moreover,theXi’s are independent, since we assume we have a random sample. 4. Fact: The sum of independent normal random variables is normally distributed. The usual rules for means and variances apply: the expected value of the sum is the sum of the expected values; the variance of the sum is the sum of the variances (by independence). 5. Earlier Fact: Any linear transformation of a normal random variable is normal; in particular, multiplication by a constant preserves normality. 2 6. Using these two facts, we ﬁnd that if Xi ∼ N(µ, σ ), i =1,...,n,then 2 (X1 + X2 + ...+ Xn) ∼ N(nµ, nσ ), and 2 X =(X1 + X2 + ...+ Xn)/n ∼ N(µ, σ /n). We conclude that the sample mean from a normal population has a normal distribution; speciﬁcally, the sample mean from a random sample of size n from a population N(µ, σ2) has distribution N(µ, σ2/n). 7. First consequence: E[X]=µ. In other words, the expected value of the sample mean is the population mean; “on average” the sample mean correctly estimates the underlying mean.

8. The standard deviation of a sample statistic is called its standard√ error.Thus,wehave shown that the standard error of the sample mean is σ/ n,whereσ is the underlying standard deviation and n is the sample size. √ 9. Second consequence: Because the standard error of X is σ/ n, the uncertainty in this estimate decreases as the sample size n increases. (That’s good.) However, the uncertainty (as measured by the standard deviation) decreases rather slowly: to cut the standard deviation in half, we need to collect four times as much data, because of the square root. (That’s not so good, but that’s life.) σ 10. Example: Suppose the population√ standard√ deviation is = 50. If we have a sample of size 100, the standard error is σ/ n =50/ 100 = 5. Suppose we would like to reduce the standard error to 2.5 by collecting more data. This would reduce the uncertainty in our√ estimate. We would need a total of 400 data points to accomplish this, because 50/ 400 = 50/20 = 2.5.

3 11. Example: Suppose the number of miles driven each week by US car owners is normally distributed with a standard deviation of σ = 75 miles. Suppose we plan to estimate the population mean number of miles driven per week by US car owners using a random sample of size n = 100. What is the probability that our estimate will differ from the true value by more than 10 miles? Denote the population mean by µ and the sample mean by X. We need to find P (|X−µ| > 10). By symmetry of the normal distribution, this is 2 × P (X − µ>10). Standardizing, we find that this is X − µ P √ > 10√ , 2 σ/ n σ/ n which is 10 2P (Z> √ )=2P (Z>1.33) = 2 × .0918 = 0.1836. 75/ 100 Thus, the probability that our estimate will be off by more than 10 miles is 18.36%.

12. We now drop the assumption that the underlying population is normal, and just assume that it has mean µ and variance σ2. It is still true that the sample mean X has

E[X]=µ

and σ X √ . Std Error = Std Dev( )= n These properties do not use normality; they follow from basic properties of means and variances.

13. By the central limit theorem, regardless of the underlying population, the distribution of X tends towards N(µ, σ2/n)asn becomes large. In particular, for a suﬃciently large sample size n, X − µ √ ≈ N , σ/ n (0 1); i.e., the standardized quantity on the left tends toward the standard normal distribution. We will use this approximation repeatedly to assess the error in X as an estimate of µ.

14. A consequence of this is that in the example above concerning the mean number of miles driven by week, we don’t need to assume that the number of miles driven per week is normally distributed (as long as our sample size n is large).

15. How large should n be for the normal approximation to be accurate? There is no simple answer (it depends on the underlying distribution), but n ≥ 30 is a reasonable rule of thumb.

16. If the underlying population is ﬁnite of size N, and if the sample size n is not a small proportion of N, we use the following small sample correction to the standard error: σ N − n Std Error(X)=√ . n N − 1

4 In this course, we generally assume that the underlying population is inﬁnite or else much larger than the sample size so that the small sample correction is not needed.

Sampling Distribution of the Sample Proportion

1. Consider estimating any of the following quantities:

• Proportion of voters who will vote for a third-party candidate in the next election. • Proportion of visits to a web site that result in a sale. • Proportion of shoppers who prefer crunchy over creamy.

In each of these examples, we are trying to estimate a population proportion.We denote a generic population proportion using the symbol p.

2. We estimate a population proportion using a sample proportion. For example, if a poll surveys 1000 voters and ﬁnds that 85 of those surveyed plan to vote for a third-party candidate, then the sample proportion is 8.5%. The population proportion is what the poll would ﬁnd if it could ask every voter in the population.

3. We denote the sample proportion using the symbolp ˆ, which is read “pee-hat.” Once we have collected a random sample, the sample proportionp ˆ is known. We use it to estimate the true, unknown population proportion p.

4. The problem of estimating a proportion can be formulated as a special case of estimating a population mean. To see this, consider again the example of a poll of 1000 voters. Imagine encoding responses to a question about third-party candidates as follows: for the ith person polled, 1, if ith person plans to vote for third-party candidate; Xi = 0, otherwise.

Thus, our random sample consists of X1,...,X1000. If 85 respondents indicated that they would vote for a third-party candidate, then

X1 + X2 + ...+ X1000 =85,

because 85 of the Xi’s are equal to 1 and all the rest are equal to 0. Moreover,

X1 + X2 + ...+ X1000 X = =8.5%. 1000 Thus, the sample proportion is just a special case of the sample mean.

5. We can summarize this example as follows: the sample proportionp ˆ is a special case of the sample mean X when the data consists of 1’s and 0’s.

6. How good an estimate of the population proportion p is the sample proportionp ˆ?For example, how eﬀective are polls and surveys? These are issue we will be examining.

5 7. Recall from our discussion above of sample means that E[X]=µ, indicating that the sample mean is correct “on average” in estimating the population mean. Similarly, E[ˆp]= p, indicating that the sample proportion is also correct “on average.” 8. By how much is the sample proportionp ˆ likely to deviate from the true population proportion p? This is measured by the standard deviation ofp ˆ, also called the standard error ofp ˆ. This standard error is given by p − p p (1 ). StdErr[ˆ]= n

9. This standard error is greatest when p =0.5 (the most uncertain case) and becomes 0 when p =0orp = 1 (i.e., when there is no uncertainty at all). 10. We could alternatively calculate the standard error by taking the standard deviation of the underlying 1’s and 0’s (the Xi’s in our sample). The formula above is just a shortcut to the same answer. 11. If the sample size n is large, the error in our estimate is approximately normally distributed. The error is just the differencep ˆ−p between the sample and population proportion. For large n, p − p p − p ≈ N , (1 ) . ˆ 0 n We sometimes write this in the alternative form pˆ − p ≈ N (0, 1) . p(1−p) n 12. EXAMPLE: Suppose that the true, unknown proportion p of voters who will vote for a third-party candidate in the next election is 9%. What is the probability that a poll of 1000 voters will find a sample proportionp ˆ that differs from the true proportion by more than 2%? We need to find P (|pˆ− p| >.02). We will use the normal approximation, so by symmetry this is 2 × P (ˆp − p>.02). Now we standardize to write this as   pˆ − p .02 2 × P  >  , p(1−p) p(1−p) n n i.e., as   .02 2 × P Z>  p(1−p) n with Z a standard normal. Now we plug in numbers to write this as   .02 2 × P Z>  =2× P (Z>2.21). .09(1−.09) 1000 From the normal table, we find that this is 2 × .0135 = .027. We conclude that the probability that the poll will be off by more than two percentage points is .027.