<<

Contents

Sampling Simple random Sample mean

Sampling

Sampling is a branch of Statistics on drawing inference about a group of objects (population) based on • examining (interviewing, testing, observing, screening) a subset (sample) of the population. In contrast, refers to an exhaustive examination of all objects in the population. • Why sampling? Pros: speed, accuracy (less chance to make mistake as there are less objects to deal • with), preserving material as some sampling is destructive by nature (life of a computer card). Cons: inference from the sample has an error called as not all objects are in the sample. How to select a sample? Selection based on mood, convenience and so on – haphazard sampling. • Selection based on probability () – random sample. Haphazard sampling will often lead to bias – systematic error, tends to be too high, or too low. More • importantly, no way to quantify the error from such a scheme; should avoid haphazard sampling. Random sampling has the major advantage that it is usually unbiased – no systematic error. More • importantly, because the selection is done based on randomization, we can quantify the sampling error using probability. We shall do an activity to compare the haphazard sampling and the random sampling schemes. •

Simple random sample

Randomly draw an object from the population so that all objects are equally likely to be selected. Note • the value of the object, and call it X1. Replace the object and repeat the preceding procedure to get X2, and so on until we have n X’s; n is the sample size. Then X ,X , ,Xn are jointly independent. • 1 2 ··· What is the distribution of X ? • 1 Suppose that there are 100 figures and 20 of them have unit area, 50 of them whose area= 2, 20 of • them have area= 3 and 10 of them have area= 4. Hence, the frequency distribution of the area of the rectangles is given below: • area 1 2 3 4 relative frequency 0.2 0.5 0.2 0.1

2S39: Class Notes/ October 11, 2000 1 The above frequency distribution is called the population distribution (of the area of the 100 rectangles) • Clearly, X1, the area of the first randomly selected rectangle must be either one of the four numbers • 1, 2, 3, 4 . { } Because all rectangles are equally likely to be selected, P (X = 1) = 20/100 = 0.2, the relative • 1 frequency of 1 in the population. Similarly, P (X1 = x) = the relative frequency of the rectangles with area equal to x. Hence, the of X is the same as the population distribution! • 1 What is the distribution of X ? Same as the population distribution, and same for all other X’s. • 2 The simple random sample yields X1,X2, ,Xn which are jointly independent and identically • distributed (i.i.d., or iid) as the population distribution.··· From now on, if we write X1,X2, ,Xn as iid, then the common probability distribution is called • the population distribution, and the ···X’s may be thought of as arising from simple random sampling. The object of sampling is to learn the population distribution! Often it suffices to know some • characteristics such as the mean and the variance of the population distribution – population mean 2 2 µX = µ and variance σX = σ , where X denotes the value of an object randomly chosen from the population.

Statistics

Based on the sample X1,X2, ,Xn, we may estimate the population mean µ by the sample mean • X¯ = n X /n. ··· Pi=1 i A function of a random sample is called a statistic. So sample mean is a statistic. What other statistics • do you know? A statistic is a random variable, and hence has a pdf and the associated (probability) distribution • function (also known as the sampling distribution). The sampling distribution of a statistic may be derived analytically or by simulation (Monte Carlo study). • Simulation works as follows: draw a random sample and then compute the statistic, and repeat the • procedure say a 1000 times. The histogram of the 1000 statistics will be close to the sampling distribution of the statistic. The following web-site illustrate the simulation approach http://www.ruf.rice.edu/ lane/rvls.html •

Sample mean

2S39: Class Notes/ October 11, 2000 2 From the preceding simulation exercise, it appears that the sample mean centers at the population mean, • ¯ that is, E(X) = µX and the sampling distribution is increasingly concentrated with increasing sample size. 2 2 These observations can be quantified as follows: E(X¯ ) = µ ¯ = µ and var(X¯ ) = σ = σ /n, • X X X¯ X where n is the sample size. Let the population variance be 20. To ensure that the sample mean has a variance equal to 2, we need • to choose the sample size to n = 10 because σ2 = σ2 /n = 20/10 = 2. X¯ X But to ensure that the standard deviation of the sample mean to be 2, n = 10 is not enough as • σ ¯ = σ /√n = 20/√10 = 6.32 = 2. X X 6 What should the sample size n be to make σ ¯ = 1 if σ = 10? • X X ¯ In order to prove the results that for a random sample X1,X2, ,Xn, E(X) = µX and • ¯ 2 ··· var(X) = σX/n, we need to consider two general results and their generalization. Two results: Let c and c be two constants. E(c X + c X ) = c E(X ) + c E(X ). This • 1 2 1 1 2 2 1 1 2 2 result is true whether or not X1 and X2 are independent. Proof: E(c1X1 + c2X2) = (c1x1 + c2x2)f (x1, x2) = • Px1 Px2 X1,X2 c1x1f (x1, x2) + c2x2f (x1, x2) = Px1 Px2 X1,X2 Px1 Px2 X1,X2 c1x1 f (x1, x2) + c2x2 f (x1, x2) = Px1 Px2 X1,X2 Px2 Px1 X1,X2 c1x1f (x1) + c2x2f (x2). Px1 X1 Px2 X2 Example, if X1 and X2 are of means 5 and 9, then • E(5X X ) = 5E(X ) E(X ) = 5 5 9 = 16. 1 − 2 1 − 2 ∗ − Recall the result that if X1 and X2 are independent of each other, then • 2 2 var(c1X1 + c2X2) = c1var(X1) + c2var(X2). Example, if X and X are of identical mean µ = 5 and variance σ2 = 10, then • 1 2 X X Y = (X1 + X2)/2 has mean (5 + 5)/2 = 5 = µX and variance 2 var(X1 + X2)/4 = [var(X1) + var(X2)]/4 = 5/2 = σX/2. The sum c X + c X is called a linear combination of X and X . • 1 1 2 2 1 2 These two results can be extended to the case of more than two random variables. In words, the • expectation of a sum of random variables is the sum of the expectations. If the random variables are independent, then the variance of the sum is the sum of variances. Using this generalization, we get the two main results that the sample mean centers at the population • mean, and the variance of the (sampling distribution of the) sample variance equals the population variance divided by the sample size. Furthermore, if X1,X2,...,Xn are independent and normally distributed, but need not be identically • distributed, then any linear combination of the X’s is normally distributed. 2 In particular, if the population distribution is N(µX, σX), then the sample mean based on a random • 2 sample of size n from the population is N(µX, σX/n).

2S39: Class Notes/ October 11, 2000 3