M04_WEIN0000_08_SE_C04.qxd 6/20/09 3:12 PM Page 62
4 CHAPTER Normal Distributions
Some of the chapters in this book can be thought of as relatively self-contained units in the study of statistics. However, others, such as this chapter and the previous one— and, to a lesser degree, Chapters 9 and 10—are so interrelated that it could be argued (and it has been by reviewers of earlier editions) that one is so essential to an under- standing of the other that they should be read and studied together. One could also make the case (as some have) that the content in Chapter 3 should follow Chapter 4. There is a compelling logic to this position. Why? As we shall see in this chapter, measurements such as the mean and standard deviation (Chapter 3) are more than numbers derived from a formula: They take on added meaning when we introduce the concept of the normal distribution. However, in support of the current sequence of chapters, we believe it is difficult if not impossible to comprehend the characteristics of the standard normal distribution if one does not know a median from a mean or has no idea how a standard deviation is derived for a given data set. Thus, it seems that one sequence makes about as much sense as the other. At this point, the reader (especially if you are a bit math-phobic) may be a little confused. However, we hope that after the discussion that follows things will “click in,” and that concepts like the mean and standard deviation will be more than just numbers used to describe what is a typical case value or how much variability is present in a data set. What follows should help to complete the picture of the distribu- tion of a variable and to put a given case value (a raw score) in better perspective.
62 M04_WEIN0000_08_SE_C04.qxd 6/20/09 3:12 PM Page 63
4 / Normal Distributions 63
SKEWNESS
A frequency polygon can assume a variety of shapes depending on where the values of an interval level or ratio level variable tend to cluster. Some data sets contain a dispro- portionately large number of low values of a variable and thus, when displayed using a frequency polygon, a large percentage of the area of the polygon is on the left side, where lower values of the variable are displayed. Other distributions reflect the exact opposite pattern. Suppose, for example, a hospital administrator, Sue, wished to study changes in admission diagnoses over a 6-year period. She wanted to substantiate her impression that the hospital was experiencing a decline in some diagnoses and an increase in others. The data might look like those in Table 4.1 for a diagnosis such as emphysema. Just by glancing at the data contained in Table 4.1, it is easy to see that the number of emphysema patients admitted to the hospital over the 6-year period has declined over time. This trend is even more apparent when the data are placed in a histogram, such as the one in Figure 4.1. The midpoints of the bars in the histogram in Figure 4.1 can be connected via a line. The continuous line joining them to form a frequency polygon is called a curve. Distributions like the one shown in Table 4.1 and reflected in the frequency polygon in Figure 4.1 are referred to as skewed, a term we introduced in the previous chapter when we were discussing the presence of outliers (i.e., extreme values) in the distribu- tion of a variable. As we noted then, a skewed distribution is misshapen or asymmetri- cal; that is, its ends do not taper off in a similar manner in both directions. Note that the frequency polygon in Figure 4.1 has a “tail” on the right side (where, normally, the largest case values are displayed). A curve like the one in Figure 4.1, which has a tail to the right, is called a positively skewed distribution since its outliers are among the higest values of the distribution. Trends in admissions of HIV-positive cases over the same 6-year period in the hospital might reflect the exact opposite pattern from those with emphysema admis- sions. Table 4.2 and Figure 4.2 illustrate this point.
TABLE 4.1 Cumulative Frequency Distribution: Emphysema Patients Admitted to XYZ Hospital by Year (N 210) Year Absolute Frequency Cumulative Frequency
2004 60 60 2005 50 110 2006 40 150 2007 30 180 2008 20 200 2009 10 210 M04_WEIN0000_08_SE_C04.qxd 6/20/09 3:12 PM Page 64
64 4 / Normal Distributions
60
50
40
30 Frequencies 20
10
2004 2005 2006 2007 2008 2009 Year
FIGURE 4.1 Positively Skewed Frequency Polygon: Emphysema Patients Admitted to XYZ Hospital by Year (From Table 4.1, N 210)
The distribution in Figure 4.2 is also skewed, but this time the outliers and thus the tail of the frequency distribution are on the left. A curve that is skewed to the left, where the smallest case values are displayed, is called a negatively skewed distribution.
KURTOSIS
Skewness is the degree to which a distribution of a variable (and the frequency polygon portraying it) are not symmetrical. But suppose a distribution of a variable is symmet- rical. A second way to describe the distribution of a variable, kurtosis, is still needed to complete its description. Kurtosis is the degree to which a distribution is peaked,
TABLE 4.2 Cumulative Frequency Distribution: HIV-Positive Patients Admitted to XYZ Hospital by Year (N 210) Absolute Cumulative Year Frequency Frequency
2004 10 10 2005 20 30 2006 30 60 2007 40 100 2008 50 150 2009 60 210 M04_WEIN0000_08_SE_C04.qxd 6/20/09 3:12 PM Page 65
4 / Normal Distributions 65
60
50
40
30 Frequencies 20
10
2004 2005 2006 2007 2008 2009 Year
FIGURE 4.2 Negatively Skewed Frequency Polygon: HIV-Positive Patients Admitted to XYZ Hospital by Year (From Table 4.2, N 210)
as opposed to relatively flat. Or, to describe it another way, it is the degree to which measurements cluster around the center, as opposed to being more heavily concen- trated in its end points (tails). A distribution that has a high percentage of case values that cluster around its center (the mean), thus giving the appearance of peakness, is described as leptokurtic. Figure 4.3 portrays a frequency polygon that is leptokurtic. Notice how it differs from Figure 4.4, which contains case values that are more heavily concentrated in its tails. It is described as platykurtic, or more flat.
10 15 20 25 30 35 40 45 50 55 60
FIGURE 4.3 A Leptokurtic Distribution of Caseloads in Agency A M04_WEIN0000_08_SE_C04.qxd 6/20/09 3:12 PM Page 66
66 4 / Normal Distributions
10 15 20 25 30 35 40 45 50 55 60
FIGURE 4.4 A Platykurtic Distribution of Caseloads in Agency B
THE NORMAL CURVE
Some distributions of interval or ratio level variables are symmetrical and contain no or relatively few outliers. Measurements closest to the mean are the most common. But the frequency of measurements of the variable “taper off” in a consistent, gradual pattern among values as they get farther and farther away from the mean (either above or below it). These distributions are neither leptokurtic or platykurtic; values would be neither bunched in the middle or disproportionately clustered in its tails. They are bell shaped or mesokurtic. When such a distribution occurs, it can be referred to as a normal distribution. In a frequency polygon reflecting it (e.g., Figure 4.5), the curved line of the polygon is referred to as the normal curve. When graphed, the values of many variables (e.g., the variable height among either men or women) tend naturally to approximate a normal curve. Other variables have been made to form a normal distribution. For example, the distribution of scores on a standardized test are supposed to reflect the pattern of a normal distribution; that is, if we were to graph the scores of all people who take the test as a frequency polygon, the curved line of the polygon would be a normal curve. (Standardizing a test or measurement scale means revising the test as much as is necessary until the scores of people completing it would form a normal distribution.) Distributions of all interval or ratio level variables that tend to be normally distrib- uted share the same properties. In addition to being symmetrical and bell shaped, in a normal curve, the mode, median, and mean all occur at the highest point and in the center of the distribution, as in Figure 4.6. Note that in skewed curves, the mode, median, and mean occur at different points, as in Figure 4.7(a) and (b). The ends of the normal curve extend toward infinity—they approach the horizon- tal axis (x axis) but never quite touch it. This property represents the possibility that, Frequencies FIGURE 4.5 The Normal Curve M04_WEIN0000_08_SE_C04.qxd 6/20/09 3:12 PM Page 67
4 / Normal Distributions 67 Frequencies
Mode Median Mean FIGURE 4.6 The Normal Distribution
although the normal curve contains virtually all values of a variable, a very small number of values may exist that reflect extremely large or extremely small measure- ments (or values) of the variable (outliers). It also reflects the fact that at a higher level of abstraction, a total population of cases (or the universe) is never static because it is always subject to change as cases are added or deleted over time. Thus, populations are always evolving. The horizontal axis of a normal curve can be divided into six equal units—three units between the mean and the place where the curve approaches the axis on the left side, and three units between the mean and the place where it approaches the axis on the right side. These six units collectively reflect the amount of variation that exists within virtually all values of a normally distributed interval or ratio level variable. In a normal distribution of any variable, virtually all values (except for .26%) fall within these six units of variation. Each unit corresponds to exactly one standard deviation. Thus, exactly how much variation a unit represents within a frequency polygon portraying the distribution of measurements (i.e., values) of a given variable is deter- mined by using the standard deviation formula presented in the previous chapter. Figure 4.8 displays what is called the standard normal distribution. It is a normal curve with three equal units to the left of the mean and three equal units to the right of the mean. Note that the units are labeled, as is the mean, which in the standard normal distribution is labeled 0 (zero). Each unit reflects the number of standard devi- ations (SD) that it falls from the mean. Units to the left of the mean (where values are smaller than the mean) use the negative sign (i.e.,-1SD, -2SD, -3SD ), and units to the right of the mean (where values are larger than the mean) use the positive sign (i.e., +1SD, +2SD, +3SD) or no sign at all.
(a) A positively skewed distribution (b) A negatively skewed distribution Frequencies Frequencies
Mode Median Mean Mean Median Mode
FIGURE 4.7 Skewed Distributions M04_WEIN0000_08_SE_C04.qxd 6/20/09 3:12 PM Page 68
68 4 / Normal Distributions Frequencies
FIGURE 4.8 The Normal Distribution 2SD 0 2SD 3SD 1SD 1SD 3SD Showing Standard Deviations
The standard normal distribution in Figure 4.8 can be thought of as a template. If a variable is believed to be normally distributed, we can replace the numbers and signs along the horizontal line with actual numbers derived from a data set to reflect the actual distribution of a variable. We would do this by taking the values of a variable found in a research sample or population (the raw data) and using the formulas for the mean and standard deviation used in Chapter 3 or (more frequently) by deriving them through computer data analysis. Then we would substitute the mean of the variable for the data set for 0 (zero) and then either add or subtract the standard deviation of the variable within the data to replace the other notations on the horizontal line. Suppose that in our data set the mean of a variable was computed to be 7 and the stan- dard deviation was 2. We can now substitute 7 for 0 in the standard normal distribu- tion. We can substitute 9 (7+ 2= 9) for + 1; 11 (7+ 2(2)) for + 2, and so forth. For the negative values in the polygon we would subtract one computed standard devia- tion for each unit. The term standard deviation can be a little misleading; there is really little that is standard about it, since it varies depending on the values in a group of measurements of a variable, such as in a data set. Standard refers to the fact that once the standard deviation for measurements of a variable has been computed, it becomes a standard unit that reflects the amount of variability that was found to exist. We saw this in the previous chapter. Remember, larger standard deviations mean more variability in a data set, and vice versa. But remember too, a standard deviation (or a mean) is specific to a given variable within the given data set from which it was derived. Standard devi- ations (and means) differ from one variable to another and from one data set to another (even when the same variable has been measured). They differ based on the measure- ments that were taken and how much they vary from each other. The mean and stan- dard deviation for test scores (just like heights) for males, for example, might be different from the mean and standard deviation for test scores of females who take the same examination. Or, the mean and standard deviation computed from test scores of people who took the examination in 2009 might be different from the mean and stan- dard deviation computed from scores of people who took the examination in 2008. Normal distributions derived from different data sets therefore tend to have different means and different standard deviations. Figure 4.9(a), (b), and (c) demon- strates how this occurs by comparing three pairs of normal curves. The figure also demonstrates the fact that normal curves can be drawn in a way that they reflect the distribution of data; that is, curves may be high and narrow, low and wide, or anything M04_WEIN0000_08_SE_C04.qxd 6/20/09 3:12 PM Page 69
4 / Normal Distributions 69
(a) Equal means, unequal standard deviations (b) Unequal means, equal standard deviationsa Frequencies Frequencies
Means Means
(c) Unequal means, unequal standard deviations Frequencies
Means
FIGURE 4.9 Variations in Normal Distributions
in between. They can be drawn to suggest the degree of variation present in the distri- bution of an interval or ratio level variable—that is, the size of its standard deviation. Flatter curves suggest relatively large standard deviations and higher ones reflect rela- tively small standard deviations; however, they still retain what is essentially a “bell shape.” Not surprisingly (because it is symmetrical), 50 percent of the total area of the frequency polygon in a normal distribution falls below the mean and 50 percent falls above it. Other segments of the frequency polygon similarly reflect standard percent- ages of its total area. Figure 4.10 displays the percentage of the normal curve that falls between the mean and the point referred to as +1 standard deviation, between +1 standard deviation and +2 standard deviation, and so forth. By looking at Figure 4.10, we can see that the area of a normal curve between a point on the horizontal axis (e.g.,-2SD ) and the mean is equivalent to the area of the curve between the comparable point on the other side of the mean (e.g.,+2SD ) and the mean. This makes sense because, as we have already noted, a normal curve is symmetrical. If we add up all the percentages within each of the segments of the frequency polygon between -3SD and +3SD, they equal 99.74 percent of the curve. M04_WEIN0000_08_SE_C04.qxd 6/20/09 3:12 PM Page 70
70 4 / Normal Distributions
50% 49.87%
47.72%
34.13% 34.13%
13.59% 13.59% 2.15% 2.15% .13% .13%