Statistical Methods for Practitioners (EPA QA/G-9S)
Total Page:16
File Type:pdf, Size:1020Kb
United States Office of Environmental Environmental Protection Information EPA/240/B-06/003 Agency Washington, DC 20460 February 2006 Data Quality Assessment: Statistical Methods for Practitioners EPA QA/G-9S FOREWORD This document, Data Quality Assessment: Statistical Methods for Practitioners, provides general guidance to organizations on assessing data quality criteria and performance specifications. The Environmental Protection Agency (EPA) has developed the Data Quality Assessment (DQA) Process for project managers and planners to determine whether the type, quantity, and quality of data needed to support Agency decision making has been achieved. This guidance is the culmination of experiences in the design and statistical analyses of environmental data in different Program Offices at the EPA. Many elements of prior guidance, statistics, and scientific planning have been incorporated into this document. This document is one of a series of quality management guidance documents that the EPA Quality Staff has prepared to assist users in implementing the Agency-wide Quality System. Other related documents include: EPA QA/G-4 Guidance on Systematic Planning Using the Data Quality Objectives Process EPA QA/G-5S Guidance on Choosing a Sampling Design for Environmental Data Collection EPA QA/G-9R Data Quality Assessment: A Reviewer’s Guide This document provides guidance to EPA program managers and planning teams as well as to the general public as appropriate. It does not impose legally binding requirements and may not apply to a particular situation based on the circumstances. EPA retains the discretion to adopt approaches on a case-by-case basis that differ from this guidance where appropriate. This guidance is one of the U.S Environmental Protection Agency Quality System Series documents. These documents describe the EPA policies and procedures for planning, implementing, and assessing the effectiveness of the Quality System. These documents are updated periodically to incorporate new topics and revision or refinements to existing procedures. Comments received on this, the 2006 version, will be considered for inclusion in subsequent versions. Please send your comments to: Quality Staff (2811R) U.S. Environmental Protection Agency 1200 Pennsylvania Avenue, NW Washington, DC 20460 Phone: (202) 564-6830 Fax: (202) 565-2441 E-mail: [email protected] Copies of the EPA Quality System documents may be downloaded from the Quality Staff Home Page: www.epa.gov.quality. EPA QA/G-9S iii February 2006 PREFACE Data Quality Assessment: Statistical Methods for Practitioners describes the statistical methods used in Data Quality Assessment (DQA) in evaluating environmental data sets. DQA is the scientific and statistical evaluation of environmental data to determine if they meet the planning objectives of the project, and thus are of the right type, quality, and quantity to support their intended use. This guidance applies DQA to environmental decision-making (e.g., compliance determinations) and also addresses DQA in environmental estimation (e.g., monitoring programs). This document is distinctly different from other guidance documents in that it is not intended to be read in a linear or continuous fashion. Instead, it is intended to be used as a "tool-box" of useful techniques in assessing the quality of data. The overall structure of the document will enable the analyst to investigate many problems using a systematic methodology. Each statistical technique examined in the text is demonstrated separately in the form of a series of steps to be taken. The technique is then illustrated with a practical example following the steps described. This document is intended for all EPA and extramural organizations that have quality systems based on EPA policies and specifications, and that may periodically assess these quality systems (or have them assessed by EPA) for compliance to the specifications. In addition, this guidance may be used by other organizations that assess quality systems applied to specific environmental programs. The guidance provided herein is non-mandatory and is intended to help personnel who have minimal experience with statistical terminology to understand how a technique works and how it may be applied to a problem. An explanation of DQA in plain English may be found in the companion guidance document, Data Quality Assessment: A Reviewer’s Guide (EPA QA/G-9R) (U.S. EPA, 2006b). EPA QA/G-9S iv February 2006 TABLE OF CONTENTS Page INTRODUCTION......................................................................................................................... 1 STEP 1: REVIEW DQOs AND THE SAMPLING DESIGN.................................................. 5 1.1 OVERVIEW AND ACTIVITIES............................................................................6 STEP 2: CONDUCT A PRELIMINARY DATA REVIEW.................................................... 9 2.1 OVERVIEW AND ACTIVITIES..........................................................................12 2.2 STATISTICAL QUANTITIES.............................................................................12 2.2.1 Measures of Relative Standing ..................................................................12 2.2.2 Measures of Central Tendency ..................................................................13 2.2.3 Measures of Dispersion..............................................................................15 2.2.4 Measures of Association............................................................................16 2.2.4.1 Pearson’s Correlation Coefficient................................................16 2.2.4.2 Spearman’s Rank Correlation......................................................18 2.2.4.3 Serial Correlation.........................................................................18 2.3 GRAPHICAL REPRESENTATIONS...................................................................20 2.3.1 Histogram...................................................................................................20 2.3.2 Stem-and-Leaf Plot....................................................................................21 2.3.3 Box-and-Whiskers Plot..............................................................................23 2.3.4 Quantile Plot and Ranked Data Plots.........................................................25 2.3.5 Quantile-Quantile Plots and Probability Plots ...........................................26 2.3.6 Plots for Two or More Variables ...............................................................27 2.3.6.1 Scatterplot ....................................................................................29 2.3.6.2 Extensions of the Scatterplot .......................................................30 2.3.6.3 Empirical Quantile-Quantile Plot ................................................31 2.3.7 Plots for Temporal Data.............................................................................32 2.3.7.1 Time Plot......................................................................................33 2.3.7.2 Lag Plot........................................................................................34 2.3.7.3 Plot of the Autocorrelation Function (Correlogram) ...................34 2.3.7.4 Multiple Observations in a Time Period......................................35 2.3.7.5 Four-Plot ......................................................................................36 2.3.8 Plots for Spatial Data .................................................................................37 2.3.8.1 Posting Plots................................................................................38 2.3.8.2 Symbol Plots and Bubble Plots....................................................39 2.3.8.3 Other Spatial Graphical Representations.....................................39 2.4 PROBABILITY DISTRIBUTIONS......................................................................40 2.4.1 The Normal Distribution............................................................................40 2.4.2 The t-Distribution.......................................................................................41 2.4.3 The Lognormal Distribution ......................................................................41 2.4.4 Central Limit Theorem ..............................................................................41 STEP 3: SELECT THE STATISTICAL METHOD.............................................................. 43 3.1 OVERVIEW AND ACTIVITIES..........................................................................46 3.2 METHODS FOR A SINGLE POPULATION ......................................................46 3.2.1 Parametric Methods...................................................................................48 3.2.1.1 The One-Sample t-test and Confidence Interval or Limit ...........48 EPA QA/G-9S v February 2006 Page 3.2.1.2 The One-Sample Tolerance Interval or Limit..............................48 3.2.1.3 Stratified Random Sampling........................................................51 3.2.1.4 The Chen Test..............................................................................52 3.2.1.5 Land’s Method for Lognormally Distributed Data......................57 3.2.1.6 The One-Sample Proportion Test and Confidence Interval.........58 3.2.2 Nonparametric Methods.............................................................................60 3.2.2.1 The Sign Test ...............................................................................60 3.2.2.2