Nonparametric Statistics
Total Page:16
File Type:pdf, Size:1020Kb
14 Nonparametric Statistics • These methods required that the data be normally CONTENTS distributed or that the sampling distributions of the 14.1 Introduction: Distribution-Free Tests relevant statistics be normally distributed. 14.2 Single-Population Inferences Where We're Going 14.3 Comparing Two Populations: • Develop the need for inferential techniques that Independent Samples require fewer or less stringent assumptions than the 14.4 Comparing Two Populations: Paired methods of Chapters 7 – 10 and 11 (14.1) Difference Experiment • Introduce nonparametric tests that are based on ranks 14.5 Comparing Three or More Populations: (i.e., on an ordering of the sample measurements Completely Randomized Design according to their relative magnitudes) (14.2–14.7) • Present a nonparametric test about the central 14.6 Comparing Three or More Populations: tendency of a single population (14.2) Randomized Block Design • Present a nonparametric test for comparing two 14.7 Rank Correlation populations with independent samples (14.3) • Present a nonparametric test for comparing two Where We've Been populations with paired samples (14.4) • Presented methods for making inferences about • Present a nonparametric test for comparing three or means ( Chapters 7 – 10 ) and for making inferences more populations using a designed experiment about the correlation between two quantitative (14.5–14.6) variables ( Chapter 11 ) • Present a nonparametric test for rank correlation (14.7) 14-1 Copyright (c) 2013 Pearson Education, Inc M14_MCCL6947_12_AIE_C14.indd 14-1 10/1/11 9:59 AM Statistics IN Action How Vulnerable Are New Hampshire Wells to Groundwater Contamination? Methyl tert- butyl ether (commonly known as MTBE) is a the measuring device, the MTBE val- volatile, flammable, colorless liquid manufactured by the chemi- ues for these wells are recorded as .2 cal reaction of methanol and isobutylene. MTBE was first rather than 0.) The other produced in the United States as a lead fuel additive (octane variables in the data set are booster) in 1979 and then as an oxygenate in reformulated described in Table SIA14.1 . fuel in the 1990s. Unfortunately, MTBE was introduced into How contaminated are water-supply aquifers by leaking underground storage tanks these New Hampshire wells? at gasoline stations, thus contaminating the drinking water. Is the level of MTBE con- Consequently, by late 2006 most (but not all) American gasoline tamination different for the two classes of wells? For the two retailers had ceased using MTBE as an oxygenate, and accord- types of aquifers? What environmental factors are related to the ingly, U.S. production has declined. Despite the reduction in pro- MTBE level of a groundwater well? These are just a few of the duction, there is no federal standard for MTBE in public water research questions addressed in the study. supplies; therefore, the chemical remains a dangerous pollutant, The researchers applied several nonparametric meth- especially in states like New Hampshire that mandate the use of ods to the data in order to answer the research questions. We reformulated gasoline. demonstrate the use of this methodology in four Statistics in A study published in Environmental Science & Technol- Action Revisited examples. ogy (Jan. 2005) investigated the risk of exposure to MTBE through drinking water in New Hampshire. In particular, the Statistics IN Action Revisited study reported on the factors related to MTBE contamina- • Testing the Median MTBE Level of Groundwater Wells tion in public and private New Hampshire wells. Data were (p. 14-7) collected on a sample of 223 wells. These data are saved in the • Comparing the MTBE Levels of Different Types of MTBE file (part of which you analyzed in Exercise 2.19). One Groundwater Wells (p. 14-14) of the variables measured was MTBE level (micrograms per liter) in the well water. An MTBE value exceeding .2 micro- • Comparing the MTBE Levels of Different Types of gram per liter on the measuring instrument is a detectable level Groundwater Wells (continued) (p. 14-30) of MTBE. Of the 223 wells, 70 had detectable levels of MTBE. • Testing the Correlation of MTBE Level with Other Envi- (Although the other wells are below the detection limit of ronmental Factors (p. 14-44) Table SIA14.1 Variables Measured in the MTBE Contamination Study Variable Name Type Description Units of Measurement, or Levels CLASS QL Class of well Public or Private AQUIFER QL Type of aquifer Bedrock or Unconsolidated DETECTION QL MTBE detection status Below limit or Detect MTBE QN MTBE level micrograms per liter PH QN pH level standard pH unit DISSOXY QN Dissolved oxygen milligrams per liter DEPTH QN Well depth meters DISTANCE QN Distance to underground storage tank meters INDUSTRY QN Industries in proximity Percent of industrial land within 500 meters of well Data Set: MTBE 14.1 Introduction: Distribution-Free Tests The confidence interval and testing procedures developed in Chapters 7 – 10 all involve making inferences about population parameters. Consequently, they are often referred to as parametric statistical tests. Many of these parametric methods (e.g., the small-sample t -test of Chapter 8 or the ANOVA F-test of Chapter 10 ) rely on the assumption that the data are sampled from a normally distributed population. When the data are normal, these tests are most powerful . That is, the use of such parametric tests maximizes power— 14-2 the probability of the researcher correctly rejecting the null hypothesis. Copyright (c) 2013 Pearson Education, Inc M14_MCCL6947_12_AIE_C14.indd 14-2 10/1/11 9:59 AM SECTION 14.1 Introduction: Distribution-Free Tests 14-3 Consider a population of data that is decidedly nonnormal. For example, the dis- tribution might be flat, peaked, or strongly skewed to the right or left. (See Figure 14.1 .) Applying the small-sample t -test to such a data set may lead to serious consequences. Since the normality assumption is clearly violated, the results of the t -test are unreli- able. Specifically, (1) the probability of a Type I error (i.e., rejecting H0 when it is true) may be larger than the value of a selected, and (2) the power of the test, 1 - b, is not maximized. Figure 14.1 Some nonnormal distributions for which the t -statistic is invalid a. Flat distribution b. Peaked distribution c. Skewed distribution A number of nonparametric techniques are available for analyzing data that do not follow a normal distribution. Nonparametric tests do not depend on the distribution of the sampled population; thus, they are called distribution-free tests . Also, nonpara- metric methods focus on the location of the probability distribution of the population, rather than on specific parameters of the population, such as the mean (hence the name “nonparametric”). Distribution-free tests are statistical tests that do not rely on any underlying assump- tions about the probability distribution of the sampled population. The branch of inferential statistics devoted to distribution-free tests is called non- parametrics . Nonparametric tests are also appropriate when the data are nonnumerical in nature, but can be ranked. * For example, when taste-testing foods or in other types of consumer product evaluations, we can say that we like product A better than product B, and B better than C, but we cannot obtain exact quantitative values for the respec- tive measurements. Nonparametric tests based on the ranks of measurements are called rank tests . Nonparametric statistics (or tests) based on the ranks of measurements are called rank statistics (or rank tests ). Ethics IN Statistics Consider a sampling problem where the assumptions required In this chapter, we present several useful nonparametric methods. Keep in mind for the valid application of a that these nonparametric tests are more powerful than their corresponding parametric parametric procedure (e.g., a t -test counterparts in those situations where either the data are nonnormal or the data are for a population mean) are clearly ranked. violated. Also, suppose the results In Section 14.2, we develop a test for making inferences about the central tendency of the parametric test lead you to a of a single population. In Sections 14.3 and 14.5, we present rank statistics for comparing different inference about the target two or more probability distributions using independent samples. In Sections 14.4 and population than the corresponding 14.6, the matched-pairs and randomized block designs are used to make nonparametric nonparametric method. Intentional comparisons of populations. Finally, in Section 14.7, we present a nonparametric mea- reporting of only the parametric sure of correlation between two variables. test results is considered unethical statistical practice . * Qualitative data that can be ranked in order of magnitude are called ordinal data. Copyright (c) 2013 Pearson Education, Inc M14_MCCL6947_12_AIE_C14.indd 14-3 10/1/11 9:59 AM 14-4 CHAPTER 14 Nonparametric Statistics 14.2 Single-Population Inferences In Chapter 8 , we utilized the z - and t -statistics for testing hypotheses about a population mean. The z -statistic is appropriate for large random samples selected from “general” populations—that is, samples with few limitations on the probability distribution of the underlying population. The t -statistic was developed for small-sample tests in which the sample is selected at random from a normal distribution. The question is, How can we conduct a test of hypothesis when we have a small sample from a nonnormal distribution? The sign test is a relatively simple nonparametric procedure for testing hypotheses about the central tendency of a nonnormal probability distribution. Note that we used the phrase central tendency rather than population mean . This is because the sign test, like many nonparametric procedures, provides inferences about the population median rather than the population mean m. Denoting the population median by the Greek letter h, we know ( Chapter 2 ) that h is the 50th per- centile of the distribution ( Figure 14.2 ) and, as such, Area = .5 is less affected by the skewness of the distribution and the presence of outliers (extreme observations).