04 Donovan Pages 201 CB

4 CENTRAL LIMIT THEOREM Objectives • Set up a spreadsheet model to examine the properties of the central limit theorem. • Develop frequency distributions and sampling distributions, and differentiate between the two. • Develop a bootstrap analysis of the mean for various sample sizes. • Evaluate the relationship between standard error and sample size, and standard deviation and sample size. Suggested Preliminary Exercise: Statistical Distribution INTRODUCTION You have probably come across the term “population” in your studies of biology. In the biological sense, the term “population” refers to a group of organisms that occupy a defined space and that can potentially interact with one another. The Hardy-Weinberg equilibrium principle is an example of a population-level study. In statistics the term population has a slightly different meaning. A statistical population is the totality of individual observations about which inferences are made, existing anywhere in the world, or at least within a specified sampling area limited in space and time (Sokal and Rohlf 1995). Suppose you want to make a statement about the average height of humans on earth. Your statistical population would include all the individuals that currently occupy the planet earth. Usually, statistical populations are smaller than that, and the researcher determines the size of the statistical population. For example, if you want to make a statement about the length of dandelion stems in your hometown, your statistical population consists of all of the dandelions currently occurring within the boundaries of your hometown. Other examples of statistical populations include a population of all the record cards kept in a filing system, of trees in a county park, or motor vehicles in the state of Vermont. In practice, it would be very difficult to measure the heights of all the individuals on earth, or even to measure all the dandelions in your hometown. So we take a sample from the population. A sample is a subset of the population that we can deal with and measure. The goal of sampling is to make scientific statements about the greater population based on the information we obtain in the sample. Quantities gathered from samples are called statistics. 66 Exercise 4 “How many samples should I take?” and “How should I choose my samples?” are very important questions that any investigator should ask before starting a scientific study. In this exercise, we’ll consider simple random sampling. If you sample 10 dandelions in your hometown with the intent of making scientific statements about all of the dandelions that occupy your town, then each and every individual in the population must have the same chance of being selected as part of the sample. In other words, a simple random sample is a sample selected by a process that gives every possible sample (of that size from that population) the same chance of being selected. Let’s imagine that you use a simple random sampling scheme to sample the stem lengths of 10 dandelions in your hometown. And let’s further imagine that the actual average stem length of the dandelion population in your hometown is µ = 10 mm; you are trying to estimate this parameter through sampling. You carefully measure the stem length of each of the 10 sampled dandelions, and then calculate and record the mean of the sample on your computer spreadsheet._ The mean you have calculated is called an estimator, usually designated as x, which estimates the true population mean, µ (which in this case is 10 mm). If you plot your raw data on a graph, your graph is called a frequency distribution. This is a pictorial description of how frequent or com- mon different values (in this case stem lengths) appear in the population. A frequency distribution reveals many things about the nature of your samples, including the sample size, the mean, the shape of the distribution (normal, skewed, etc.), the range of values, and modality of the data (Figure 1). Frequency Distribution of 10 Dandelion Stem Lengths 6 5 4 3 2 1 Number of individuals 0 1234567891011121314151617181920 Stem length (mm) Figure 1 In the example in Figure 1, our sample of 10 dandelions had a mean value of 9.4 mm. How do you know how close your estimator is to the true mean, µ, if you can’t actually measure µ? The central paradox of sampling is that it is impossible to know, based on a single sample, how well the sample represents µ. If you obtain another sample of 10 dandelions, and calculate a mean, you will now have two estimates of the population mean, µ. What if they are different? How will you know which is the “best” estimator? Here is where the central limit theorem comes into play. If you repeat this sampling process and obtain a set of estimators (say, for example, 10 estimators in total, each based on a sample size of 10 dandelions), you now have a sampling distribution of the sample average (note the difference between the sampling distribution and the frequency distribution). The sampling distribution shows the possible values that the estimator can take and the frequency with which they occur. The standard deviation of a sampling distribution is called the standard error. Central Limit Theorem 67 Sampling Distribution of 10 Mean Estimators 6 5 4 3 2 1 Number of samples 0 7891011121314 Estimator Figure 2 The central limit theorem, one of the most important statistical concepts you will encounter, states that in a finite population with a mean µ and variance σ2, the sampling distribution of the means approaches a normal distribution with a sampling mean µ and a sampling variance σ2/N as N (N = number of individuals in the sample) increases. In Figure 2, 4 of our 10 samples had a mean of 10 mm, 5 samples had a mean of 11 mm, and 1 sample had a mean of 12 mm. The central limit theorem says that this sampling distribution will become more and more “normal” (a bell-shaped curve on a graph) as the sample size increases. It also says that the mean of the sampling distribution is an unbiased estimator of µ, and that the variance of the estimators is σ2/N. In this exercise, you will set up two populations that have the same mean, µ, of 50 mm. You will try to estimate this parameter through sampling. Both populations con- tain 500 individuals. The mean stem lengths of Population 1 follow a normal distribution. Population 2 has a somewhat funky, bimodal distribution in which individuals have stem lengths of either 0 or 100. We will obtain samples from each population, from which we will estimate the mean of each population. The method by which we will sample is called the bootstrap method, a very com- mon sampling method in statistics (Efron 1982). The bootstrap involves repeated reesti- mation of a parameter (such as a mean) using random samples with replacement from the orig- inal data. Because the sampling is with replacement, some items in the data set are selected two or more times and other are not selected at all. We will do a bootstrap analysis of the mean when sample sizes of 5, 10, 15, and 20 are drawn (with replacement) from each population. When the procedure is repeated a hundred or a thousand times, we get “pseudosamples” that behave similarly to the underlying distribution of the data. In turn, you can evaluate how biased your estimator is (whether your estimator gives a good estimate of µ or not), the confidence intervals of the estimator, and the bootstrap standard error of your estimator. All of this will become more clear as you work through the exercise. As always, save your work frequently to disk. 68 Exercise 4 INSTRUCTIONS ANNOTATION A. Set up the spreadsheet. 1. Open a new spread- ABC sheet and set up column 1 Central Limit Theorem Exercise headings as shown in 2 Figure 3. 3 Population Mean => µ 50 4 Population Std => σ 10 5 6 Individual Pop 1 Pop 2 Figure 3 2. In cells A7–A506, assign Enter 1 in cell A7. a number to each individ- Enter =A7+1 in cell A8. ual in the populations, Copy this formula down to cell A506 to designate the 500 individuals. starting with 1 in cell A7 and ending with 500 in cell A506. 3. Enter a population We will compare two populations of dandelions (actual statistical populations), each mean of 50 in cell C3. consisting of 500 individuals. Both populations, Population 1 and Population 2, have an actual mean stem length (µ) of 50 mm, which is designated in cell C3. 4. Enter the standard Population 1 will consist of 500 individuals that have a mean, µ, of 50 mm and a stan- deviation for Population 1 dard deviation of 10 mm. We’ll assume that Population 1 is normally distributed. Thus, in cell C4. the raw data are distributed in a bell-shaped curve that is completely symmetrical and has tails that approach but never touch the x-axis. The shape and position of the normal curve is determined by µ and σ: µ sets the position of the curve while σ determines the spread of the curve. Figure 4 shows two normal curves. They have different means (µ) but have the same σ, thus they are similar in shape but are positioned in different locations along the x-axis. Normal Curves with Standard Deviation 10 Mean = 30 Mean = 50 01020304050607080 Figure 4 Central Limit Theorem 69 Aproperty of normal curves is that the total area under the curve is equal to 1.

04 Donovan Pages 201 CB

User Guide April 2004 EPA/600/R04/079 April 2004

Basic Statistical Concepts Statistical Population

Nearest Neighbor Methods Philip M

Experimental Investigation of the Tension and Compression Creep Behavior of Alumina-Spinel Refractories at High Temperatures

Introduction to Basic Concepts

Shapiro, A., “Statistical Inference of Moment Structures”

Standard Deviation - Wikipedia Visited on 9/25/2018

Sampling Designs for National Forest Assessments Ronald E

Calculating 95% Upper Confidence Limit (UCL) Guidance Document

Introduction to Estimation

Exercise 1C Scientific Investigation: Statistical Analysis

Quality Guidelines for Frames Insocialstatistics