DETECTION MONITORING TESTS Unified Guidance

PART III. DETECTION MONITORING TESTS Unified Guidance PART III. DETECTION MONITORING TESTS This third part of the Unified Guidance presents core procedures recommended for formal detection monitoring at RCRA-regulated facilities. Chapter 16 describes two-sample tests appropriate for some small facilities, facilities in interim status, or for periodic updating of background data. These tests include two varieties of the t-test and two non-parametric versions-- the Wilcoxon rank-sum and Tarone-Ware procedures. Chapter 17 discusses one-way analysis of variance [ANOVA], tolerance limits, and the application of trend tests during detection monitoring. Chapter 18 is a primer on several kinds of prediction limits, which are combined with retesting strategies in Chapter 19 to address the statistical necessity of performing multiple comparisons during RCRA statistical evaluations. Retesting is also discussed in Chapter 20, which presents control charts as an alternative to prediction limits. As discussed in Section 7.5, any of these detection-level tests may also be applied to compliance/assessment and corrective action monitoring, where a background groundwater protection standard [GWPS] is defined as a critical limit using two- or multiple-sample comparison tests. Caveats and limitations discussed for detection monitoring tests are also relevant to this situation. To maintain continuity of presentation, this additional application is presumed but not repeated in the following specific test and procedure discussions. Although other users and programs may find these statistical tests of benefit due to their wider applicability to other environmental media and types of data, the methods described in Parts III and IV are primarily tailored to the RCRA setting and designed to address formal RCRA monitoring requirements. In particular, the series of prediction limit tests found in Chapter 18 is designed to address the range of interpretations of the sampling rules in §264.97(g), §264.98(d) and §258.54. Further, all of the regulatory tests listed in §264.97(i) and §258.53(h) are discussed, as well as the Student’s t-test requirements of §265.93(b). Taken as a whole, the set of detection monitoring methods presented in the Unified Guidance should be appropriate for almost all the situations likely to be encountered in practice. Professional statistical consultation is recommended for the rest. March 2009 PART III. DETECTION MONITORING TESTS Unified Guidance This page intentionally left blank March 2009 Chapter 16. Two-Sample Tests Unified Guidance CHAPTER 16. TWO-SAMPLE TESTS 16.1 PARAMETRIC T-TESTS................................................................................................................................. 16-1 16.1.1 Pooled Variance T-Test ...................................................................................................................... 16-4 16.1.2 Welch’s T-Test .................................................................................................................................... 16-7 16.1.3 Welch’s T-Test and Lognormal Data................................................................................................ 16-10 16.2 WILCOXON RANK-SUM TEST .................................................................................................................... 16-14 16.3 TARONE-WARE TWO-SAMPLE TEST FOR CENSORED DATA ...................................................................... 16-20 This chapter describes statistical tests between two groups of data, known as two-sample tests. These tests may be appropriate for the smallest of RCRA sites performing upgradient-to-downgradient comparisons on a very limited number of wells and constituents. They may also be required for certain facilities in interim status, and can be more generally used to compare older versus newer data when updating background. Two versions of the classic Student’s t-test are first discussed: the pooled variance t-test and Welch’s t-test. Since both these tests expect approximately normally-distributed data as input, two non- parametric alternatives to the t-test are also described: the Wilcoxon rank-sum test (also known as the Mann-Whitney) and the Tarone-Ware test. The latter is particularly helpful when the sample data exhibit a moderate to larger fraction of non-detects and/or multiple detection/reporting limits. 16.1 PARAMETRIC T-TESTS BACKGROUND AND PURPOSE A statistical comparison between two sets of data is known as a two-sample test. While several varieties of two-sample tests exist, the most common is the parametric t-test. This test compares two distinct statistical populations. The goal of the two-sample t-test is to determine whether there is any statistically significant difference between the mean of the first population when compared against the mean of the second population, based on the results observed in the two respective samples. In groundwater monitoring, the typical hypothesis at issue is whether the average concentration at a compliance point is the same as (or less than) the average concentration in background, or whether the compliance point mean is larger than the background mean, as represented in equation [16.1] below: H : µ ≤ µ vs. H : µ > µ [16.1] 0 C BG A C BG A natural statistic for comparing two population means is the difference between the sample means, x − x . When this difference is small, a real difference between the respective population ( C BG ) means is considered unlikely. However, when the sample mean difference is large, the null hypothesis is rejected, since in that case a real difference between the populations seems plausible. Note that an observed difference between the sample means does not automatically imply a true population difference. Sample means can vary for many reasons even if the two underlying parent populations are 16-1 March 2009 Chapter 16. Two-Sample Tests Unified Guidance identical. Indeed, the Student’s t-test was invented precisely to determine when an observed sample difference should be considered significant (i.e., more than a chance fluctuation), especially when the sizes of the two samples tend to be small, as is the usual case in groundwater monitoring. Although the null hypothesis (H0) represented in equation [16.1] allows for a true compliance point mean to be less than background, the behavior of the t-test statistic is assessed at the point where H0 is most difficult to verify — that is, when H0 is true and the two population means are identical. Under the assumption of equal population means, the test statistic in any t-test will tend to follow a Student’s t- distribution. This fact allows the selection of critical points for the t-test based on a pre-specified Type I error or false positive rate (α). Unlike the similarly symmetric normal distribution, however, the Student’s t-distribution also depends on the number of independent sample values used in the test, represented by the degrees of freedom [df]. The number of degrees of freedom impacts the shape of the t-distribution, and consequently the magnitude of the critical (percentage) points selected from the t-distribution to provide a basis of comparison against the t-statistic (see Figure 16-1). In general, the larger the sample sizes of the two groups being compared, the larger the corresponding degrees of freedom, and the smaller the critical points (in absolute value) drawn from the Student’s t-distribution. In a one-sided hypothesis test of whether compliance point concentrations exceed background concentrations, a smaller critical point corresponds to a more powerful test. Therefore, all other things being equal, the larger the sample sizes used in the two-sample t-test, the more protective the test will be of human health and the environment. Figure 16-1. Student’s t-Distribution for Varying Degrees of Freedom 0.4 0.3 1 df 3 df 7 df 0.2 25 df 0.1 0.0 -5.0 -2.5 0.0 2.5 5.0 t-value In groundwater monitoring, t-tests can be useful in at least two ways. First, a t-test can be employed to compare background data from one or more upgradient wells against a single compliance 16-2 March 2009 Chapter 16. Two-Sample Tests Unified Guidance well. If more than one background well is involved, all the upgradient data would be pooled into a single group or sample before applying the test. Second, a t-test can be used to assess whether updating of background data is appropriate (see Chapter 5 for further discussion). Specifically, the two-sample t-test can be utilized to check whether the more recently collected data is consistent with the earlier data assigned initially as the background data pool. If the t-test is non-significant, both the initial background and more recent observations may be considered part of the same statistical population, allowing the overall background data set to grow and to provide more accurate information about the characteristics of the background population. The Unified Guidance describes two versions of the parametric t-test, the pooled variance Student’s t-test and a modification to the Student’s t-test known as Welch’s t-test. This guidance prefers the latter t-test to use of Cochran’s Approximation to the Behrens-Fisher (CABF) Student’s t-test. Initially codified in the 1982 RCRA regulations, the CABF t-test is no longer explicitly cited in the 1988 revision to those regulations. Both the pooled variance and Welch’s t-tests are more standard in statistical usage than the CABF t-test. When the parametric assumptions of the two-sample t-test

DETECTION MONITORING TESTS Unified Guidance

Demand STAR Ranking Methodology

Ranking of Genes, Snvs, and Sequence Regions Ranking Elements Within Various Types of Biosets for Metaanalysis of Genetic Data

Conduct and Interpret a Mann-Whitney U-Test

Different Perspectives for Assigning Weights to Determinants of Health

Learning to Combine Multiple Ranking Metrics for Fault Localization Jifeng Xuan, Martin Monperrus

Logrank Tests (Freedman)

Pairwise Versus Pointwise Ranking: a Case Study

Randomization-Based Test for Censored Outcomes: a New Look at the Logrank Test

Power Comparisons of the Mann-Whitney U and Permutation Tests

Practice of Epidemiology a Family Longevity Selection Score

Survival Analysis Using a 5‐Step Stratified Testing and Amalgamation

Applied Biostatistics Applied Biostatistics for the Pulmonologist