An Introduction to the Data Editing Process

Total Page:16

File Type:pdf, Size:1020Kb

An Introduction to the Data Editing Process 1 AN INTRODUCTION TO THE DATA EDITING PROCESS by Dania P. Ferguson United States Department of Agriculture National Agricultural Statistics Service Abstract: A primer on the various data editing methodologies, the impact of their usage, available supporting software, and considerations when developing new software. I. INTRODUCTION The intention of this paper is to promote a better understanding of the various data editing methods and the impact of their usage as well as the software available to support the use of the various methodologies. It is hoped that this paper will serve as: - a brief overview of the most commonly used editing methodologies, - an aid to organize further reading about data editing systems, - a general description of the data editing process for use by designers and developers of generalized data editing systems. II. REVIEW OF THE PROCESS Data editing is defined as the process involving the review and adjustment of collected survey data. The purpose is to control the quality of the collected data. This process is divided into four (4) major sub-process areas. These areas are: - Survey Management - Data Capture - Data Review - Data Adjustment The rest of this section will describe each of the four (4) sub-processes. 1. Survey Management The survey management functions are: a) completeness checking, and b) quality control including audit trails and the gathering of cost data. These functions are administrative in nature. Completeness checking occurs at both survey and questionnaire levels. 2 At survey level, completeness checking ensures that all survey data have been collected. It is vitally important to account for all samples because sample counts are used in the data expansion procedures that take place during Summary. Therefore, changes in the sample count impact the expansion. A minimal completeness check compares the Sample count to the questionnaire count to insure that all samples are accounted for, even if no data were collected. In the case of a Census, the number of returned questionnaires are compared to the number of distributed questionnaires or to an estimated number of questionnaires expected to be returned. Questionnaire level completeness checking insures routing instructions have been followed. Questionnaires should be coded to specify whether the respondent was inaccessible or has refused, this information can be used in verification procedures. Survey management includes quality control of the data collection process and measures of the impact on the data by the data adjustment that occurs in sub-process No. 3 below. Survey management is a step in the quality control process that assures that the underlying statistical assumptions of a survey are not violated. "Long after methods of calculation are forgotten, the meaning of the principal statistical measures and the assumptions which condition their use should be maintained", (Neiswanger, 1947). Survey management functions are not data editing functions per se but, many of the functions require accounting and auditing information to be captured during the editing process. Thus, survey management must be integrated in the design of data editing systems. 2. Data Capture Data capture is the conversion of data to electronic media. The data may be key entered in either a heads down or heads up mode. a. Heads down data entry refers to data entry with no error detection occurring at the time of entry. High-speed data - entry personnel are used to key data in a "heads down" mode. Data entered in a heads down mode is often verified by re-keying the questionnaire and comparing the two keyed copies of the same questionnaire. b. Heads up data entry refers to data entry with a review at time of entry. Heads up data entry requires subject matter knowledge by the individuals entering the data. Data entry is slower, but data review/adjustment is reduced since simple inconsistencies in responses are found earlier in the survey process. This mode is specially effective when the interviewer or respondent enter data during the interview. This is known as Computer Assisted Interviewing which is explained in more detail below. Data may be captured by many automated methods without traditional key entry. As technology advances, many more tools will become available for data capture. One popular tool is the touch-tone telephone key-pad with synthesized voice computer-administered interview. Optical Character Readers (OCR) may be used to scan questionnaires into electronic form. 3 The use of electronic calipers and other analog measuring devices for Agricultural and Industrial surveys is becoming more common place. The choice of data-entry mode and data adjustment method have the greatest impact on the type of personnel that will be required and on their training. 3. Data Review Data review consists of both error detection and data analysis. a. Manual data review may occur prior to data entry. The data may be reviewed and prepared/corrected prior to key-entry. This procedure is more typically followed when heads-down data entry is used. b. Automated data review may occur in a batch or interactive fashion. It is important to note that data entered in a heads-down fashion may later be corrected in either a batch or an interactive data review process. - Batch data review occurs after data entry and consists of a review of many questionnaires in one batch. It generally results in a file of error messages. This file may be printed for use in preparing corrections. The data records may be split into two files. One containing the 'good' records and one containing data records with errors. The latter file may be corrected using an interactive process. - Interactive data review involves immediate review of the questionnaire after adjustments are made. The results of the review are shown on a video display terminal and the data editor is prompted to adjust the data or override the error flag. This process continues until the questionnaire is considered acceptable by the automated review process. Then results of, the next questionnaire's review by the auto review processor are presented. A desirable feature of Interactive Data Editing Software is to only present questionnaires requiring adjustments. Computer-Assisted Interviewing (CAI) combines interactive data review with interactive data editing while the respondent is an available source for data adjustment. An added benefit is that data capture (key- entry) occurs at interview time. This method may be used during telephone interviewing and with portable data-entry devices for on-site data collection. CAI assists the interviewer in the wording of questions and tailors succeeding questions based on previous responses. It is a tool to speed the interview and assist less experienced interviewers. CAI has mainly been used in Computer-Assisted Telephone Interviews (CATI), but as technological advances are made in miniaturization of personal computers, more applications will be found in Computer Assisted Personal Interviewing (CAPI). 4 c. Data review (error detection) may occur at many levels. - Item level - Validations at this level are generally named "range checking". Since items are validated based on a range. Example: age must be > 0 and < 120. In more complex range checks the range may vary by strata or some other identifier. Example: If strata = "large farm operation" the acres must be greater than 500. - Questionnaire level - This level involves across item checking within a questionnaire. Example 1: If married = 'yes' then age must be greater than 14. Example 2: Sum of field acres must equal total acres in farm. - Hierarchical - This level involves checking items in related sub-questionnaires. Data relationships of this type are known as "hierarchical data" and include situations such as questions about an individual within a household. In this example, the common household information is on one questionnaire and each individual's information is on a separate questionnaire. Checks are made to insure that the sum of the individual's data for an item does not exceed the total reported for the household. d. Across Questionnaire level edits involve calculating valid ranges for each item from the survey data distributions or from historic data for use in outlier detection. Data analysis routines that are usually run at summary time may easily be incorporated into data review at this level. In this way, summary level errors are detected early enough to be corrected during the usual error correction procedures. The across questionnaire checks should identify the specific questionnaire that contains the questionable data. Across questionnaire level edits are generally grouped into two types: statistical edits and macro edits. - Statistical Edits use the distributions of the data to detect possible errors. These procedures use current data from many or all questionnaires or historic data of the statistical unit to generate feasible limits for the current survey data. Outliers may be identified in reference to the feasible limits. Research has begun in the more complicated process of identifying inliers, (Mazur, 1990). Inliers are data falling with feasible limits, but identified as suspect due to a lack of change over time. A measurable degree of change is assumed in random variables. If the value is too consistent then the value might have simply been carried forward from a prior questionnaire rather than newly reported. The test therefore consists of comparison to the double root residual of a sample unit over time. If the test fails then the change is not sufficiently random and the questionnaire should be investigated. At USDA-NASS this test is applied to slaughter weight data. The assumption being that the head count of slaughtered hogs may not vary by much from week to week. But, the total weight of all slaughtered hogs is a random variable and should show a measurable degree of change each week. - Macro Edits are a review of the data at an aggregate level.
Recommended publications
  • Chapter 4 Research Methodology
    The impact of E-marketing practices on market performance of small business enterprises. An empirical investigation. Item Type Thesis Authors El-Gohary, Hatem O.A.S. Rights <a rel="license" href="http://creativecommons.org/licenses/ by-nc-nd/3.0/"><img alt="Creative Commons License" style="border-width:0" src="http://i.creativecommons.org/l/by- nc-nd/3.0/88x31.png" /></a><br />The University of Bradford theses are licenced under a <a rel="license" href="http:// creativecommons.org/licenses/by-nc-nd/3.0/">Creative Commons Licence</a>. Download date 26/09/2021 11:58:39 Link to Item http://hdl.handle.net/10454/4308 Chapter 4 Research Methodology 4.1 Introduction:- This chapter discusses the methodology used to conduct the research study as well as the issues related to the chosen research methodology when investigating the different factors affecting the adoption of E-Marketing by industrial and trading UK SBEs, the different forms, tools and levels of implementation of E-Marketing by these enterprises as well as the relationship between E-Marketing adoption and marketing performance of these SBEs. Through this discussion the steps that were taken to address the research design, the data collection and analysis methods used to conduct the research study will be illustrated. These issues are addressed in light of the basic research objectives and the relevant research questions discussed in chapter one and the research framework discussed in chapter seven. Firstly it discusses some possible research designs and methodologies and provides a brief review of the literature of research methodologies, research strategies and research methods implemented in the fields of E-Marketing and SBEs.
    [Show full text]
  • Evaluation Design Report for the National Cancer Institute's
    February 2009 Evaluation Design Report for the National Cancer Institute’s Community Cancer Centers Program (NCCCP) Final Report Prepared for Steve Clauser, PhD Chief, Outcomes Research Branch Applied Research Program Division of Cancer Control and Population Sciences National Cancer Institute Executive Plaza North, Room 4086 6130 Executive Boulevard, (MSC 7344) Bethesda, MD 20892-7344 Prepared by Debra J. Holden, PhD Kelly J. Devers, PhD Lauren McCormack, PhD Kathleen Dalton, PhD Sonya Green, MPH Katherine Treiman, PhD RTI International 3040 Cornwallis Road Research Triangle Park, NC 27709 RTI Project Number 0210903.000.005 RTI Project Number 0210903.0001.005 Evaluation Design Report for the National Cancer Institute’s Community Cancer Centers Program (NCCCP) Final Report February 2009 Prepared for Steve Clauser, PhD Chief, Outcomes Research Branch Applied Research Program Division of Cancer Control and Population Sciences National Cancer Institute Executive Plaza North, Room 4086 6130 Executive Boulevard, (MSC 7344) Bethesda, MD 20892-7344 Prepared by Debra J. Holden, PhD Kelly J. Devers, PhD Lauren McCormack, PhD Kathleen Dalton, PhD Sonya Green, MPH Katherine Treiman, PhD RTI International 3040 Cornwallis Road Research Triangle Park, NC 27709 Contents Section Page Acronyms Error! Bookmark not defined. Executive Summary ES-Error! Bookmark not defined. 1. Introduction Error! Bookmark not defined. 1.1 Overview of the NCCCP Error! Bookmark not defined. 1.2 Summary of the NCCCP Sites Error! Bookmark not defined. 1.3 Process Completed for an Evaluability Assessment Error! Bookmark not defined. 1.3.1 Step 1: Pilot Site Evaluability Assessment (September 2007–June 2008) .................................................. Error! Bookmark not defined. 1.3.2 Step 2: Engage Stakeholders (September 2007–September 2008)Error! Bookmark not defined.
    [Show full text]
  • Methods Used in the Development of the Global Roads Open Access Data Set (Groads), Version 1
    Methods Used in the Development of the Global Roads Open Access Data Set (gROADS), Version 1 Table of Contents Introduction ................................................................................................................................................................... 2 Data Assessment and Selection ..................................................................................................................................... 2 Root Mean Square Error (RMSE) ............................................................................................................................... 2 Validation against Google Earth (GE) imagery .......................................................................................................... 4 Calculating total length of roads ............................................................................................................................... 6 Choosing a data set for further edits ........................................................................................................................ 8 Data ingest: Creating and populating geodatabases ..................................................................................................... 8 Data editing ................................................................................................................................................................... 9 Creating a topology ..................................................................................................................................................
    [Show full text]
  • DHS Data Editing and Imputation Trevor Croft
    DHS Data Editing and Imputation Trevor Croft, Demographic and Health Surveys Abstract Producing good quality demographic and health data and making it accessible to data users worldwide is one of the main aims of the Demographic and Health Surveys program. However, large scale surveys in developing countries, particularly those collecting retrospective data, are prone to poor reporting. Survey data have suffered traditionally from incomplete and inconsistent reporting, To handle these problems, DHS performed extensive data editing operations. In addition, imputation procedures were established to deal with partial reporting of dates of key events in the respondent's life. General techniques for handling incomplete and inconsistent data are considered. The paper then presents the DHS approach to data editing. The major focus is on the editing of dates of events and the intervals between events. The editing and imputation process starts with the calculation of initial logical ranges for each date, and gradually constrains these ranges to produce final logical ranges. Inconsistent data are reported in error listings during this process. Dates are imputed for events with incomplete reporting within these final logical ranges. The levels of imputation required in the DHS-I surveys are presented. Various problem areas involved with the imputation of incomplete dates are explained. These include biases caused by questionnaire design, miscalculation of dates by interviewers and ancillary data biases. Problems relating to fine temporal variables and to unconstrained ranges for dates are also reviewed. Finally, the changes introduced as part of the editing and imputation procedures for DHS- II to resolve some of the problem areas are presented and ideas for further improvements are discussed.
    [Show full text]
  • Data Management, Analysis Tools, and Analysis Mechanics
    Chapter 2 Data Management, Analysis Tools, and Analysis Mechanics This chapter explores different tools and techniques for handling data for research purposes. This chapter assumes that a research problem statement has been formulated, research hypotheses have been stated, data collection planning has been conducted, and data have been collected from various sources (see Volume I for information and details on these phases of research). This chapter discusses how to combine and manage data streams, and how to use data management tools to produce analytical results that are error free and reproducible, once useful data have been obtained to accomplish the overall research goals and objectives. Purpose of Data Management Proper data handling and management is crucial to the success and reproducibility of a statistical analysis. Selection of the appropriate tools and efficient use of these tools can save the researcher numerous hours, and allow other researchers to leverage the products of their work. In addition, as the size of databases in transportation continue to grow, it is becoming increasingly important to invest resources into the management of these data. There are a number of ancillary steps that need to be performed both before and after statistical analysis of data. For example, a database composed of different data streams needs to be matched and integrated into a single database for analysis. In addition, in some cases data must be transformed into the preferred electronic format for a variety of statistical packages. Sometimes, data obtained from “the field” must be cleaned and debugged for input and measurement errors, and reformatted. The following sections discuss considerations for developing an overall data collection, handling, and management plan, and tools necessary for successful implementation of that plan.
    [Show full text]
  • Handbook on Precision Requirements and Variance Estimation for ESS Households Surveys
    ISSN 1977-0375 Methodologies and Working papers Handbook on precision requirements and variance estimation for ESS households surveys 2013 edition Methodologies and Working papers Handbook on precision requirements and variance estimation for ESS household surveys 2013 edition Europe Direct is a service to help you find answers to your questions about the European Union. Freephone number (*): 00 800 6 7 8 9 10 11 (*)0H The information given is free, as are most calls (though some operators, phone boxes or hotels may charge you). More information on the European Union is available on the Internet (http://europa.eu). Cataloguing data can be found at the end of this publication. Luxembourg: Publications Office of the European Union, 2013 ISBN 978-92-79-31197-0 ISSN 1977-0375 doi:10.2785/13579 Cat. No: KS-RA-13-029-EN-N Theme: General and regional statistics Collection: Methodologies & Working papers © European Union, 2013 Reproduction is authorised provided the source is acknowledged. Acknowledgments Acknowledgments The European Commission expresses its gratitude and appreciation to the following members of the ESS1 Task Force, for their work on Precision Requirements and Variance Estimation for Household Surveys: Experts from the European Statistical System (ESS): Martin Axelson Sweden — Statistics Sweden Loredana Di Consiglio Italy — ISTAT Kari Djerf Finland — Statistics Finland Stefano Falorsi Italy — ISTAT Alexander Kowarik Austria — Statistics Austria Mārtiņš Liberts Latvia — CSB Ioannis Nikolaidis Greece — EL.STAT Experts from European
    [Show full text]
  • NCRM Social Sciences Research Methods Typology (2014)
    NCRM Social Sciences Research Methods Typology (2014) Level 1 Categories Level 2 Subcategories Level 3 Descriptor Terms Frameworks for Research and Research Designs Epistemology Philosophy of social science; Critical theory; Feminist methods; Humanistic methods; Interpretivism; Positivism; Postmodernism; Poststructuralism Descriptive Research Exploratory Research Explanatory Research and Causal analysis Comparative and Cross Cross-national research; National Research Cross-cultural research; Comparative research; Historical comparative research Survey Research Cross-Sectional Research Repeated cross-sections Longitudinal Research Panel survey; Cohort study; Qualitative longitudinal research (QLR); Mixed methods longitudinal research Experimental Research Experimental design; Laboratory studies; Randomized Control Trials (RCT) Quasi-Experimental Research Case-control studies; Difference-in-differences (DID); Paired comparison; Instrumental variables; Regression discontinuity; Twin studies Evaluation Research Policy evaluation; Consumer satisfaction; Theory of change methods Case Study Pilot Study Participatory Research Child-led research; Emancipatory research; Inclusive research; Indigenous methodology; 1 Participatory Action Research (PAR); User engagement Action Research Participatory Action Research (PAR) Ethnographic Research Behavioural Research Meta-Analysis Mantel-Haenszel methods Systematic Review Secondary Analysis Archival research; Documentary research; Analysis of official statistics; Analysis of existing survey data; Analysis
    [Show full text]
  • Assessing Sample Bias and Establishing Standardized
    Louisiana State University LSU Digital Commons LSU Master's Theses Graduate School 2003 Assessing sample bias and establishing standardized procedures for weighting and expansion of travel survey data Fahmida Nilufar Louisiana State University and Agricultural and Mechanical College, [email protected] Follow this and additional works at: https://digitalcommons.lsu.edu/gradschool_theses Part of the Civil and Environmental Engineering Commons Recommended Citation Nilufar, Fahmida, "Assessing sample bias and establishing standardized procedures for weighting and expansion of travel survey data" (2003). LSU Master's Theses. 1204. https://digitalcommons.lsu.edu/gradschool_theses/1204 This Thesis is brought to you for free and open access by the Graduate School at LSU Digital Commons. It has been accepted for inclusion in LSU Master's Theses by an authorized graduate school editor of LSU Digital Commons. For more information, please contact [email protected]. ASSESSING SAMPLE BIAS AND ESTABLISHING STANDARDIZED PROCEDURES FOR WEIGHTING AND EXPANSION OF TRAVEL SURVEY DATA A Thesis Submitted to the Graduate Faculty of the Louisiana State University and Agricultural and Mechanical College in partial fulfillment of the requirements for the degree of Master of Science in Civil Engineering in The Department of Civil and Environmental Engineering by Fahmida Nilufar B.S., Bangladesh University of Engineering and Technology, 1977 M.S., Bangladesh University of Engineering and Technology, 1985 August 2003 Acknowledgments I would like to express my appreciation to my advisor Dr. Chester Wilmot, for his valuable comments, guidance, support and patience throughout the process. I would also like to thank Dr. Brian Wolshon and Dr. Sherif Ishak for their thoughtful comments on the initial proposal of this work and for consenting to be on my graduate committee.
    [Show full text]
  • STANDARDS and GUIDELINES for STATISTICAL SURVEYS September 2006
    OFFICE OF MANAGEMENT AND BUDGET STANDARDS AND GUIDELINES FOR STATISTICAL SURVEYS September 2006 Table of Contents LIST OF STANDARDS FOR STATISTICAL SURVEYS ....................................................... i INTRODUCTION......................................................................................................................... 1 SECTION 1 DEVELOPMENT OF CONCEPTS, METHODS, AND DESIGN .................. 5 Section 1.1 Survey Planning..................................................................................................... 5 Section 1.2 Survey Design........................................................................................................ 7 Section 1.3 Survey Response Rates.......................................................................................... 8 Section 1.4 Pretesting Survey Systems..................................................................................... 9 SECTION 2 COLLECTION OF DATA................................................................................... 9 Section 2.1 Developing Sampling Frames................................................................................ 9 Section 2.2 Required Notifications to Potential Survey Respondents.................................... 10 Section 2.3 Data Collection Methodology.............................................................................. 11 SECTION 3 PROCESSING AND EDITING OF DATA...................................................... 13 Section 3.1 Data Editing ........................................................................................................
    [Show full text]
  • Guide to Good Statistical Practice in the Transportation Field
    U.S. Department of Transportation Bureau of Transportation Statistics Guide to Good Statistical Practice in the Transportation Field Updated May 2003 1. Introduction Quality of data has many faces. Primarily, it has to be relevant (i.e., useful) to its users. Relevance is achieved through a series of steps starting with a planning process that links user needs to data requirements. It continues through acquisition of data that is accurate in measuring what it was designed to measure and produced in a timely manner. Finally, the data must be made accessible and easy to interpret for the users. In a more global sense, data systems also need to be complete and comparable (to both other data systems and to earlier versions). The creation of data that address all of the facets of quality is a unified effort of all of the development phases from the initial data system objectives, through system design, collection, processing, and dissemination to the users. These sequential phases are like links in a chain. The sufficiency of each phase must be maintained to achieve relevance. This document is intended to help management and data system “owners” achieve relevance through that sequential process. 1.1 Legislative Background The 1991 Intermodal Surface Transportation Efficiency Act (ISTEA) created the Bureau of Transportation Statistics (BTS) within the Department of Transportation (DOT). Among other things, it made BTS responsible for: “issuing guidelines for the collection of information by the Department of Transportation required for statistics … in order to ensure that such information is accurate, reliable, relevant, and in a form that permits systematic analysis.” (49 U.S.C.
    [Show full text]
  • Strengthening Federal Statistics
    18. STRENGTHENING FEDERAL STATISTICS The Federal Statistical System (FSS) has reliably and consistent with system-wide priorities; and develop and impartially informed the nation about its population, con- oversee the implementation of Governmentwide statis- dition, and progress since its founding, beginning with tical policies, principles, guidelines, and standards. The the first constitutionally-mandated Census in 1790. The Chief Statistician chairs the ICSP, made up of the heads mission of the FSS is to collect and transform data into of the thirteen PSAs and one rotating member from a non- useful, objective information; making it readily and eq- PSA agency that conducts significant statistical activity, uitably available to government, private businesses, currently the National Center for Veterans Analysis and and the public. There are thirteen Principal Statistical Statistics. The ICSP provides strategic leadership for Agencies (PSAs—see Table 18–1) and almost 100 non- the FSS on system-wide priorities such as improving PSA statistical units spread across the Executive Branch researcher access to confidential data while protecting that generate statistics on such topics as the economy, privacy, increasing response rates on surveys while eas- workforce, energy, agriculture, foreign trade, education, ing the cost and burden on respondents, and acquiring housing, crime, transportation, and health. The PSAs and maintaining a highly skilled workforce. CIPSEA pro- are continuously developing new methods for collecting vides a common statutory framework for the collection, and combining data from multiple sources in order to ex- handling, and dissemination of confidential data and al- pand and improve the quality and timeliness of statistical lows agencies to assure respondents that their data will evidence needed to make important decisions in today’s only be used for statistical and research purposes.
    [Show full text]
  • Statistical Data Editing
    UNITED NATIONS STATISTICAL COMMISSION and ECONOMIC COMMISSION FOR EUROPE CONFERENCE OF EUROPEAN STATISTICIANS STATISTICAL STANDARDS AND STUDIES - No. 44 STATISTICAL DATA EDITING Volume No. 1 METHODS AND TECHNIQUES UNITED NATIONS New York and Geneva, 1994 III CONTENTS PREFACE A. REVIEW OF STATISTICAL DATA EDITING METHODS AND TECHNIQUES - An Introduction to the Data Editing Process Page 1 (Dania Ferguson, United States Department of Agriculture, National Agricultural Statistics Service) - A Review of the State of the Art in Automated Data Editing Page 10 and Imputation (Mark Pierzchala, United States Department of Agriculture, National Agricultural Statistics Service) - On the Need for Generalized Numeric and Imputation Systems Page 41 (Leopold Granquist, Statistics Sweden) - Evaluation of Data Editing Procedures: Results of a Simulation Approach Page 52 (Emiliano Garcia Rubio and Vicente Peirats Cuesta, National Statistical Institute of Spain) - A systematic approach to Automatic Edit and Imputation Page 69 (I. Fellegi, Statistics Canada and D. Holt, University of Southampton, United Kingdom) B. MACRO-EDITING PROCEDURES - Macro-Editing - A Review of Methods for Rationalizing the Editing of Page 109 Survey Data (Leopold Granquist, Statistics Sweden) - Macro-Editing - The Hidiroglou-Berthelot Method Page 125 (Eiwor Hoglund Davila, Statistics Sweden) - Macro-Editing - The Aggregate Method Page 135 (Leopold Granquist, Statistics Sweden) - Macro-Editing - The Top-Down Method Page 142 (Leopold Granquist, Statistics Sweden) IV C. IMPLEMENTATION OF DATA EDITING PROCEDURES - Data Editing in a Mixed DBMS Environment Page 146 (N.W.P. Cox/D.A. Croot, Statistics Canada) - Blaise - A New Approach to Computer Assisted Survey Processing Page 165 (D. Denteneer, J.G. Bethlehem, A.J. Hundepool & M.S.
    [Show full text]