Samples Can Vary - Standard Error
Total Page:16
File Type:pdf, Size:1020Kb
- Stratified Samples - Systematic Samples - Samples can vary - Standard Error - From last time: A sample is a small collection we observe and assume is representative of a larger sample. Example: You haven’t seen Vancouver, you’ve seen only seen a small part of it. It would be infeasible to see all of Vancouver. When someone asks you ‘how is Vancouver?’, you infer to the whole population of Vancouver places using your sample. From last time: A sample is random if every member of the population has an equal chance of being in the sample. Your Vancouver sample is not random. You’re more likely to have seen Production Station than you have of 93rd st. in Surrey. From last time: A simple random sample (SRS) is one where the chances of being in a sample are independent. Your Vancouver sample is not SRS because if you’ve seen 93rd st., you’re more likely to have also seen 94th st. A common, random but not SRS sampling method is stratified sampling. To stratify something means to divide it into groups. (Geologically into layers) To do stratified sampling, first split the population into different groups or strata. Often this is done naturally. Possible strata: Sections of a course, gender, income level, grads/undergrads any sort of category like that. Then, random select some of the strata. Unless you’re doing something fancy like multiple layers, the strata are selected using SRS. Within each strata, select members of the population using SRS. If the strata are different sizes, select samples from them proportional to their sizes. Example: Quality testing of milk. A government agency wants to check if the milk from a company is up to code. There are several trucks out leaving the plant today, each truck is a stratum. (single version of strata). The agency selects some of the trucks with SRS. Each truck is carrying many jugs of milk, some jugs from each truck are selected by SRS. One of the trucks is twice as big as the others, so twice as many jugs are sampled from that one. Therefore every jug has an equal chance of being sampled. Say they tested 50 jugs of milk from a total of 5 trucks. That’s a lot easier than stopping 50 trucks and testing 1 jug each. This is part of the appeal of stratified sampling. Another appeal is that you can choose EVERY strata. (A stratum’s chance of being picked by SRS becomes 1) Example: Employment survey. A large company wants information about its workforce of 1000 full time employees and 500 part-time employees. A company chooses both strata and uses SRS to select 80 from the full-time stratum and 40 from the part-time stratum. 8% of each strata is sampled this way. Samples can vary. Not every sample will be the same. My Vancouver sample is different from yours, which will be different from the person sitting next to you. You’ve all seen different parts of the city, you’ve all observed a different set of members of the population. If samples are different, then their means are going to be different too. But, no matter how many times you take a sample, it’s always from the same population. So the sample mean can change, but the population mean is always the same (unknown) number. The sample mean , on average is going to be the population mean μ. (Average of is μ) The standard deviation of is the standard error : The typical amount that sample means change from the true mean is the standard error. Technically, it’s the standard error of the mean, because you can have standard errors of other things too, but we’ll only look at the standard error of the mean. The standard error is our main tool for reducing the uncertainty of our sample mean. n is the sample size. The larger n gets, the smaller gets. In other words, a bigger sample gives you a better estimate of the sample mean. This should be intuitive, if you take a bigger sample, you have more information about the population . This is important because it gives us some measure of control over the statistics we get. We can’t do that with the standard deviation. Say the government agency of before knows that in regular milk, the amount of calcium is normal with mean 20 mg/L, and standard deviation 5 mg/L. If it samples 1 1L bottle of regular milk, it will have a standard error of 5 mg/L. If it samples 4 1L bottles milk, the mean calcium concentration will have a standard error If the agency samples 25 one-Litre bottles, the average calcium per bottle is going to be a lot closer to the true mean of 20mg/L than it was with 4 bottles. The sample mean of 25 bottles will have a standard error of 1, even though the standard deviation of a single bottle is 5. Why does this happen? Consider: Which is more likely, one bottle being above the mean, or a whole lot of bottles? In a large sample, the bottles above the mean are going to balance out with the bottles below the mean. As you get more and more bottles, the closer to a 50-50 balance you would expect. As we get closer to that 50-50 balance, the sample mean will tend to be closer and closer to the true mean. Since we become more sure of where the sample mean will be, we say it becomes less variable. It’s why elevators can make these limits: It’s 68kg/person, and lots of people weigh more than 68kg. But how often will you get a group of 26 averaging more than 68kg/person. Practice example: Suppose the average age when smokers begin is 17 years old with a standard deviation of 2 years. What’s the standard error of the mean from a sample of 16 smokers? What’s the standard error of the mean of 100 smokers? The sample mean never changes with size, it’s always centered around the true mean at 17. We can expand our definition of z-score from something that pertains to single values to something that pertains to sample means. It’s still (value minus mean) / (standard deviation of the value), But since the value is a sample mean instead of a single value, it has a different standard deviation. Consider again the smokers starting at What’s the z-score of a single smoker if he starts at 18 years? What’s the z-score of a sample of 16 smokers if their mean is 18 years? Instead of finding the standard error first, we can put it all into one question. (Just another option) What’s the chance of getting a sample of 100 smokers who started at an average of 18 years or older? Common Question: How do I know what z-score formula to use? This or the one from chapter 5? Answer: Look for an indication that you’re dealing with a sample. If it’s giving you an n (sample size), use it. Pro-Tip: Use this new one by default. If you can’t find n, you probably have a sample of size 1, so use n=1. When you use a sample of size 1, the standard error z becomes the standard deviation z. When n=1 So In other terms: Use the formula with square root n when you have an n. Use the original z-score formula when it’s just a single value. If you don’t know, use the square root n formula because you’ll still get the right answer, you’ll just waste some effort. Finally… Why would we ever deal with standard error? Parameters are usually unknown. In less contrived situations, we wouldn’t know what the true mean was, but the larger our sample the better our idea of that true mean. On Monday - More standard error, now with proportion data! - Law of large numbers - End of Midterm 1 exam material. .