An Introduction to the Data Editing Process
Total Page:16
File Type:pdf, Size:1020Kb
1 AN INTRODUCTION TO THE DATA EDITING PROCESS by Dania P. Ferguson United States Department of Agriculture National Agricultural Statistics Service Abstract: A primer on the various data editing methodologies, the impact of their usage, available supporting software, and considerations when developing new software. I. INTRODUCTION The intention of this paper is to promote a better understanding of the various data editing methods and the impact of their usage as well as the software available to support the use of the various methodologies. It is hoped that this paper will serve as: - a brief overview of the most commonly used editing methodologies, - an aid to organize further reading about data editing systems, - a general description of the data editing process for use by designers and developers of generalized data editing systems. II. REVIEW OF THE PROCESS Data editing is defined as the process involving the review and adjustment of collected survey data. The purpose is to control the quality of the collected data. This process is divided into four (4) major sub-process areas. These areas are: - Survey Management - Data Capture - Data Review - Data Adjustment The rest of this section will describe each of the four (4) sub-processes. 1. Survey Management The survey management functions are: a) completeness checking, and b) quality control including audit trails and the gathering of cost data. These functions are administrative in nature. Completeness checking occurs at both survey and questionnaire levels. 2 At survey level, completeness checking ensures that all survey data have been collected. It is vitally important to account for all samples because sample counts are used in the data expansion procedures that take place during Summary. Therefore, changes in the sample count impact the expansion. A minimal completeness check compares the Sample count to the questionnaire count to insure that all samples are accounted for, even if no data were collected. In the case of a Census, the number of returned questionnaires are compared to the number of distributed questionnaires or to an estimated number of questionnaires expected to be returned. Questionnaire level completeness checking insures routing instructions have been followed. Questionnaires should be coded to specify whether the respondent was inaccessible or has refused, this information can be used in verification procedures. Survey management includes quality control of the data collection process and measures of the impact on the data by the data adjustment that occurs in sub-process No. 3 below. Survey management is a step in the quality control process that assures that the underlying statistical assumptions of a survey are not violated. "Long after methods of calculation are forgotten, the meaning of the principal statistical measures and the assumptions which condition their use should be maintained", (Neiswanger, 1947). Survey management functions are not data editing functions per se but, many of the functions require accounting and auditing information to be captured during the editing process. Thus, survey management must be integrated in the design of data editing systems. 2. Data Capture Data capture is the conversion of data to electronic media. The data may be key entered in either a heads down or heads up mode. a. Heads down data entry refers to data entry with no error detection occurring at the time of entry. High-speed data - entry personnel are used to key data in a "heads down" mode. Data entered in a heads down mode is often verified by re-keying the questionnaire and comparing the two keyed copies of the same questionnaire. b. Heads up data entry refers to data entry with a review at time of entry. Heads up data entry requires subject matter knowledge by the individuals entering the data. Data entry is slower, but data review/adjustment is reduced since simple inconsistencies in responses are found earlier in the survey process. This mode is specially effective when the interviewer or respondent enter data during the interview. This is known as Computer Assisted Interviewing which is explained in more detail below. Data may be captured by many automated methods without traditional key entry. As technology advances, many more tools will become available for data capture. One popular tool is the touch-tone telephone key-pad with synthesized voice computer-administered interview. Optical Character Readers (OCR) may be used to scan questionnaires into electronic form. 3 The use of electronic calipers and other analog measuring devices for Agricultural and Industrial surveys is becoming more common place. The choice of data-entry mode and data adjustment method have the greatest impact on the type of personnel that will be required and on their training. 3. Data Review Data review consists of both error detection and data analysis. a. Manual data review may occur prior to data entry. The data may be reviewed and prepared/corrected prior to key-entry. This procedure is more typically followed when heads-down data entry is used. b. Automated data review may occur in a batch or interactive fashion. It is important to note that data entered in a heads-down fashion may later be corrected in either a batch or an interactive data review process. - Batch data review occurs after data entry and consists of a review of many questionnaires in one batch. It generally results in a file of error messages. This file may be printed for use in preparing corrections. The data records may be split into two files. One containing the 'good' records and one containing data records with errors. The latter file may be corrected using an interactive process. - Interactive data review involves immediate review of the questionnaire after adjustments are made. The results of the review are shown on a video display terminal and the data editor is prompted to adjust the data or override the error flag. This process continues until the questionnaire is considered acceptable by the automated review process. Then results of, the next questionnaire's review by the auto review processor are presented. A desirable feature of Interactive Data Editing Software is to only present questionnaires requiring adjustments. Computer-Assisted Interviewing (CAI) combines interactive data review with interactive data editing while the respondent is an available source for data adjustment. An added benefit is that data capture (key- entry) occurs at interview time. This method may be used during telephone interviewing and with portable data-entry devices for on-site data collection. CAI assists the interviewer in the wording of questions and tailors succeeding questions based on previous responses. It is a tool to speed the interview and assist less experienced interviewers. CAI has mainly been used in Computer-Assisted Telephone Interviews (CATI), but as technological advances are made in miniaturization of personal computers, more applications will be found in Computer Assisted Personal Interviewing (CAPI). 4 c. Data review (error detection) may occur at many levels. - Item level - Validations at this level are generally named "range checking". Since items are validated based on a range. Example: age must be > 0 and < 120. In more complex range checks the range may vary by strata or some other identifier. Example: If strata = "large farm operation" the acres must be greater than 500. - Questionnaire level - This level involves across item checking within a questionnaire. Example 1: If married = 'yes' then age must be greater than 14. Example 2: Sum of field acres must equal total acres in farm. - Hierarchical - This level involves checking items in related sub-questionnaires. Data relationships of this type are known as "hierarchical data" and include situations such as questions about an individual within a household. In this example, the common household information is on one questionnaire and each individual's information is on a separate questionnaire. Checks are made to insure that the sum of the individual's data for an item does not exceed the total reported for the household. d. Across Questionnaire level edits involve calculating valid ranges for each item from the survey data distributions or from historic data for use in outlier detection. Data analysis routines that are usually run at summary time may easily be incorporated into data review at this level. In this way, summary level errors are detected early enough to be corrected during the usual error correction procedures. The across questionnaire checks should identify the specific questionnaire that contains the questionable data. Across questionnaire level edits are generally grouped into two types: statistical edits and macro edits. - Statistical Edits use the distributions of the data to detect possible errors. These procedures use current data from many or all questionnaires or historic data of the statistical unit to generate feasible limits for the current survey data. Outliers may be identified in reference to the feasible limits. Research has begun in the more complicated process of identifying inliers, (Mazur, 1990). Inliers are data falling with feasible limits, but identified as suspect due to a lack of change over time. A measurable degree of change is assumed in random variables. If the value is too consistent then the value might have simply been carried forward from a prior questionnaire rather than newly reported. The test therefore consists of comparison to the double root residual of a sample unit over time. If the test fails then the change is not sufficiently random and the questionnaire should be investigated. At USDA-NASS this test is applied to slaughter weight data. The assumption being that the head count of slaughtered hogs may not vary by much from week to week. But, the total weight of all slaughtered hogs is a random variable and should show a measurable degree of change each week. - Macro Edits are a review of the data at an aggregate level.