Guidance for Data Quality Assessment

United States Office of Environmental EPA/600/R-96/084 Environmental Protection Information July, 2000 Agency Washington, DC 20460 Guidance for Data Quality Assessment Practical Methods for Data Analysis QualityEPA QA/G-9 QA00 UPDATE FOREWORD This document is the 2000 (QA00) version of the Guidance for Data Quality Assessment which provides general guidance to organizations on assessing data quality criteria and performance specifications for decision making. The Environmental Protection Agency (EPA) has developed a process for performing Data Quality Assessment (DQA) Process for project managers and planners to determine whether the type, quantity, and quality of data needed to support Agency decisions has been achieved. This guidance is the culmination of experiences in the design and statistical analyses of environmental data in different Program Offices at the EPA. Many elements of prior guidance, statistics, and scientific planning have been incorporated into this document. This document is distinctly different from other guidance documents; it is not intended to be read in a linear or continuous fashion. The intent of the document is for it to be used as a "tool-box" of useful techniques in assessing the quality of data. The overall structure of the document will enable the analyst to investigate many different problems using a systematic methodology. This document is one of a series of quality management guidance documents that the EPA Quality Staff has prepared to assist users in implementing the Agency-wide Quality System. Other related documents include: EPA QA/G-4 Guidance for the Data Quality Objectives Process EPA QA/G-4D DEFT Software for the Data Quality Objectives Process EPA QA/G-4HW Guidance for the Data Quality Objectives Process for Hazardous Waste Site Investigations EPA QA/G-9D Data Quality Evaluation Statistical Toolbox (DataQUEST) This document is intended to be a "living document" that will be updated periodically to incorporate new topics and revisions or refinements to existing procedures. Comments received on this 2000 version will be considered for inclusion in subsequent versions. Please send your written comments on Guidance for Data Quality Assessment to: Quality Staff (2811R) Office of Environmental Information U.S. Environmental Protection Agency 1200 Pennsylvania Avenue, NW Washington, DC 20460 Phone: (202) 564-6830 Fax: (202) 565-2441 E-mail: [email protected] EPA QA/G-9 Final QA00 Version i July 2000 This page is intentionally blank. EPA QA/G-9 Final QA00 Version ii July 2000 TABLE OF CONTENTS Page INTRODUCTION ....................................................... 0 - 1 0.1 PURPOSE AND OVERVIEW ................................... 0 - 1 0.2 DQA AND THE DATA LIFE CYCLE ............................ 0 - 2 0.3 THE 5 STEPS OF DQA ........................................ 0 - 2 0.4 INTENDED AUDIENCE ...................................... 0 - 4 0.5 ORGANIZATION ............................................ 0 - 4 0.6 SUPPLEMENTAL SOURCES .................................. 0 - 4 STEP 1: REVIEW DQOs AND THE SAMPLING DESIGN ..................... 1 - 1 1.1 OVERVIEW AND ACTIVITIES ................................ 1 - 3 1.1.1 Review Study Objectives .................................. 1 - 4 1.1.2 Translate Objectives into Statistical Hypotheses ................ 1 - 4 1.1.3 Develop Limits on Decision Errors .......................... 1 - 5 1.1.4 Review Sampling Design .................................. 1 - 7 1.2 DEVELOPING THE STATEMENT OF HYPOTHESES .............. 1 - 9 1.3 DESIGNS FOR SAMPLING ENVIRONMENTAL MEDIA ........... 1 - 11 1.3.1 Authoritative Sampling .................................. 1 - 11 1.3.2 Probability Sampling .................................... 1 - 13 1.3.2.1 Simple Random Sampling ......................... 1 - 13 1.3.2.2 Sequential Random Sampling ...................... 1 - 13 1.3.2.3 Systematic Samples ............................. 1 - 14 1.3.2.4 Stratified Samples ............................... 1 - 14 1.3.2.5 Compositing Physical Samples ..................... 1 - 15 1.3.2.6 Other Sampling Designs .......................... 1 - 15 STEP 2: CONDUCT A PRELIMINARY DATA REVIEW ...................... 2 - 1 2.1 OVERVIEW AND ACTIVITIES ................................. 2 - 3 2.1.1 Review Quality Assurance Reports .......................... 2 - 3 2.1.2 Calculate Basic Statistical Quantities ......................... 2 - 4 2.1.3 Graph the Data ......................................... 2 - 4 2.2 STATISTICAL QUANTITIES .................................. 2 - 5 2.2.1 Measures of Relative Standing ............................. 2 - 5 2.2.2 Measures of Central Tendency ............................. 2 - 6 2.2.3 Measures of Dispersion ................................... 2 - 8 2.2.4 Measures of Association .................................. 2 - 8 2.2.4.1 Pearson’s Correlation Coefficient .................... 2 - 8 2.2.4.2 Spearman’s Rank Correlation Coefficient ............. 2 - 11 2.2.4.3 Serial Correlation Coefficient ...................... 2 - 11 EPA QA/G-9 Final QA00 Version iii July 2000 Page 2.3 GRAPHICAL REPRESENTATIONS ............................ 2 - 13 2.3.1 Histogram/Frequency Plots ............................... 2 - 13 2.3.2 Stem-and-Leaf Plot ..................................... 2 - 15 2.3.3 Box and Whisker Plot ................................... 2 - 17 2.3.4 Ranked Data Plot ...................................... 2 - 17 2.3.5 Quantile Plot .......................................... 2 - 21 2.3.6 Normal Probability Plot (Quantile-Quantile Plot) ............... 2 - 22 2.3.7 Plots for Two or More Variables ........................... 2 - 26 2.3.7.1 Plots for Individual Data Points ..................... 2 - 26 2.3.7.2 Scatter Plot ................................... 2 - 27 2.3.7.3 Extensions of the Scatter Plot ...................... 2 - 27 2.3.7.4 Empirical Quantile-Quantile Plot .................... 2 - 30 2.3.8 Plots for Temporal Data ................................. 2 - 30 2.3.8.1 Time Plot ..................................... 2 - 32 2.3.8.2 Plot of the Autocorrelation Function (Correlogram) ..... 2 - 33 2.3.8.3 Multiple Observations Per Time Period ............... 2 - 35 2.3.9 Plots for Spatial Data ................................... 2 - 36 2.3.9.1 Posting Plots .................................. 2 - 37 2.3.9.2 Symbol Plots .................................. 2 - 37 2.3.9.3 Other Spatial Graphical Representations .............. 2 - 39 2.4 Probability Distributions ....................................... 2 - 39 2.4.1 The Normal Distribution ................................. 2 - 39 2.4.2 The t-Distribution ...................................... 2 - 40 2.4.3 The Lognormal Distribution .............................. 2 - 40 2.4.4 Central Limit Theorem .................................. 2 - 41 STEP 3: SELECT THE STATISTICAL TEST ............................... 3 - 1 3.1 OVERVIEW AND ACTIVITIES ................................. 3 - 3 3.1.1 Select Statistical Hypothesis Test ........................... 3 - 3 3.1.2 Identify Assumptions Underlying the Statistical Test ............. 3 - 3 3.2 TESTS OF HYPOTHESES ABOUT A SINGLE POPULATION ........ 3 - 4 3.2.1 Tests for a Mean ........................................ 3 - 4 3.2.1.1 The One-Sample t-Test ............................ 3 - 5 3.2.1.2 The Wilcoxon Signed Rank (One-Sample) Test ........ 3 - 11 3.2.1.3 The Chen Test ................................... 3 - 15 3.2.2 Tests for a Proportion or Percentile ......................... 3 - 16 3.2.2.1 The One-Sample Proportion Test ................... 3 - 18 3.2.3 Tests for a Median ..................................... 3 - 18 3.2.4 Confidence Intervals .................................... 3 - 20 3.3 TESTS FOR COMPARING TWO POPULATIONS ................. 3 - 21 EPA QA/G-9 Final QA00 Version iv July 2000 Page 3.3.1 Comparing Two Means .................................. 3 - 22 3.3.1.1 Student's Two-Sample t-Test (Equal Variances) ........ 3 - 23 3.3.1.2 Satterthwaite's Two-Sample t-Test (Unequal Variances) .. 3 - 23 3.3.2 Comparing Two Proportions or Percentiles ................... 3 - 27 3.3.2.1 Two-Sample Test for Proportions ................... 3 - 28 3.3.3 Nonparametric Comparisons of Two Population ............... 3 - 31 3.3.3.1 The Wilcoxon Rank Sum Test ..................... 3 - 31 3.3.3.2 The Quantile Test ............................... 3 - 35 3.3.4 Comparing Two Medians ................................ 3 - 36 3.4 Tests for Comparing Several Populations .......................... 3 - 37 3.4.1 Tests for Comparing Several Means ........................ 3 - 37 3.4.1.1 Dunnett’s Test .................................. 3 - 38 STEP 4: VERIFY THE ASSUMPTIONS OF THE STATISTICAL TEST ......... 4 - 1 4.1 OVERVIEW AND ACTIVITIES ................................. 4 - 3 4.1.1 Determine Approach for Verifying Assumptions ................ 4 - 3 4.1.2 Perform Tests of Assumptions ............................. 4 - 4 4.1.3 Determine Corrective Actions ............................. 4 - 5 4.2 TESTS FOR DISTRIBUTIONAL ASSUMPTIONS .................. 4 - 5 4.2.1 Graphical Methods ...................................... 4 - 7 4.2.2 Shapiro-Wilk Test for Normality (the W test) .................. 4 - 8 4.2.3 Extensions of the Shapiro-Wilk Test (Filliben's Statistic) ......... 4 - 8 4.2.4 Coefficient of Variation ................................... 4 - 8 4.2.5 Coefficient of Skewness/Coefficient of Kurtosis Tests ............ 4 - 9 4.2.6 Range Tests .......................................... 4 - 10 4.2.7 Goodness-of-Fit

Guidance for Data Quality Assessment

Data Quality: Letting Data Speak for Itself Within the Enterprise Data Strategy

Here Is an Example Where I Analyze the Lags Needed to Analyze Okun's

12.6 Sign Test (Web)

SUGI 28: the Value of ETL and Data Quality

A Machine Learning Approach to Outlier Detection and Imputation of Missing Data1

Alternative Tests for Time Series Dependence Based on Autocorrelation Coefficients

Statistical Characterization of Tissue Images for Detec- Tion and Classification of Cervical Precancers

Outlier Detection for Improved Data Quality and Diversity in Dialog Systems

3 Autocorrelation

Case Study Applications of Statistics in Institutional Research

The Instat Guide to Choosing and Interpreting Statistical Tests

Statistical Tool for Soil Biology : 11. Autocorrelogram and Mantel Test