Cleaning Data A
Total Page:16
File Type:pdf, Size:1020Kb
Dealing with messy data Cleaning data A. The data cleaning process It is mandatory for the overall quality of an Data cleaning deals mainly with data problems assessment to ensure that its primary and once they have occurred. Error-prevention secondary data be of sufficient quality. “Messy strategies (see data quality control procedures data” refers to data that is riddled with later in the document) can reduce many problems inconsistencies, because of human error, poorly but cannot eliminate them. Many data errors are designed recording systems, or simply because detected incidentally during activities other than there is incomplete control over the format and data cleaning, i.e.: type of data imported from external data sources, When collecting or entering data such as a database, text file, or a Web page. So, When transforming/extracting/transferring a column that contains country names may data contain “Burma”, “Myanmar” or “Myanma”. When exploring or analysing data When submitting the draft report to peer Such inconsistencies will impede the data review processing. Care should be taken to ensure data is as accurate and consistent (i.e. spellings, to It is more efficient to detect errors by actively allow aggregation) as possible. Inconsistencies searching for them in a planned way. Data can wreak havoc when trying to perform analysis cleaning involves repeated cycles of screening, with the data, so they have to be addressed diagnosing, and treatment. before starting the analysis. Used mainly when dealing with large volumes of • Lack of data Diagnosis • Leave unchanged • Excess of data • Correction • Missing data data stored in a database, the terms data • Outliers or • Deletion insconsistencies • Errors • Strange patterns • True extremes cleansing, data cleaning or data scrubbing refer • Suspect analysis results • True normal to the process of detecting, correcting, replacing, • No diagnosis, still suspect Treatment modifying or removing incomplete, incorrect, Screening irrelevant, corrupt or inaccurate records from a record set, table, or database. Adapted from Van den Broeck J, Argeseanu Cunningham S, Eeckels R, Herbst K (2005) This document provides guidance for data analysts to find the right data cleaning strategy Screening involves systematically looking for when dealing with needs assessment data, either suspect features in assessment questionnaires, primary or secondary. It covers situations where: databases, or analysis datasets (in small Raw data is being produced by assessment assessments, with the analysts closely involved teams using a questionnaire and is entered at all stages, there may be little or no distinction into a centralized database. between a database and an analysis dataset). Data is obtained from secondary sources (displacement monitoring system, food The diagnostic (identifying the nature of the security data, census data, etc.) and is defective data) and treatment (deleting, editing or integrated, compared or merged with the data leaving the data as it is) phases of data cleaning obtained from field assessment to require insight into the sources and types of complement the final analysis. errors at all stages of the assessment. After measurement, data are the object of a sequence This document complements the ACAPS of typical activities: they are entered into technical note on How to approach a dataset databases, extracted, transferred to other tables, which specifically details data cleaning edited, selected, transformed, summarized, and operations for primary data entered into an Excel presented. It is important to realize that errors can spreadsheet during rapid assessments. occur at any stage of the data flow, including during data cleaning itself. 1 Dealing with messy data B. Sources of errors An erroneous entry happens if, e.g., age is mistyped as 26 instead of 25. Many of the sources of error in databases fall into Extraneous entries add correct, but unwanted one or more of the following categories: information, e.g. name and title in a name-only field. Measurement errors: Data is generally intended Incorrectly derived value occurs when a to measure some physical process, subjects or function was incorrectly calculated for a objects, i.e. the waiting time at the water point, the derived field (i.e. error in the age derived from size of a population, the incidence of diseases, the date of birth). etc. In some cases these measurements are Inconsistencies across tables or files occur undertaken by human processes that can have e.g. when the number of visited sites in the systematic or random errors in their design (i.e., province table and the number of visited sites improper sampling strategies) and execution (i.e., in the total sample table do not match. misuse of instruments, bias, etc.). Identifying and solving such inconsistencies goes beyond the Processing errors: In many settings, raw data are scope of this document. It is recommended to pre-processed before they are entered into a refer to the ACAPS Technical Brief How sure are database. This data processing is done for a you? To get an empirical understanding of how to variety of reasons: to reduce the complexity or deal with measurement errors in general. noise in the raw data, to emphasize aggregate properties of the raw data (often with some Data entry error: "Data entry" is the process of editorial bias), and in some cases simply to transferring information from the medium that reduce the volume of data being stored. All these records the response (traditionally answers processes have the potential to produce errors. written on printed questionnaires) to a computer application. Data entry is generally done by Data integration errors: It is actually quite rare for humans, who typically extract information from a database of significant size and age to contain speech (i.e., key informant interviews) or by using data from a single source, collected and entered secondary data from written or printed sources in the same way over time. Very often, a (i.e. health statistics from health centres). Under database contains information collected from time pressure, or for lack of proper supervision, multiple sources via multiple methods over time data is often corrupted at entry time. Main errors (i.e. tracking of affected population numbers over type include: the crisis, where the definition of “affected” is being refined or changed over time). Moreover, in practice, many databases evolve by merging in other pre-existing databases; this merging task almost always requires some attempt to resolve inconsistencies across the databases involving different data units, measurement periods, formats, and so on. Any procedure that integrates data from multiple sources can lead to errors. The merging of two or more databases will both identify errors (where there are differences between the two databases) and create new errors (i.e. duplicate records). Table 1 below illustrates some of the possible sources and types of errors in a large assessment, at three basic levels: When filling the questionnaire, when entering data into the Adapted from Kim et Al, 2003; Aldo Benini 2013 database and when performing the analysis. 2 Dealing with messy data Table 1: Sources of data error C. First things first Sources of problems Stage Lack or excess of Outliers and The first thing to do is to make a copy of the data inconsistencies original data in a separate workbook and name Questionnaire Form missing Correct value filled out the sheets appropriately, or save in a new file. Form double, in the wrong box ALWAYS keep the source files in a separate collected repeatedly Not readable Answering box or Writing error folder and change its attribute to READ-ONLY, to options left blank Answer given is out of avoid modifying any of those files, even if it is More than one expected (conditional) option selected range opened for reference. when not allowed Database Lack or excess of Outliers and data transferred from inconsistencies carried D. Screening data the questionnaire over from Form of field not questionnaire entered Value incorrectly No matter how data are collected (in face-to-face Value entered in entered, misspelling wrong field Value incorrectly interviews, telephone interviews, self- Inadvertent deletion changed during administered questionnaires, etc.), there will be and duplication previous data cleaning during database Transformation some level of error, including a number of handling (programming) error inconsistencies. While some of these will be Analysis Lack or excess of Outliers and legitimate, reflecting variation in the context, data extracted from inconsistencies carried the database over from the others will likely reflect a data collection error. Data extraction, database coding or transfer Data extraction, coding error or transfer error Examine data for the following possible errors: Deletions or Sorting errors Spelling and formatting irregularities: are they duplications by (spreadsheets) analyst Data-cleaning errors categorical variables written incorrectly? Are date format consistent? Etc. Adapted from Van den Broeck J, Argeseanu Cunningham S, Eeckels R, Herbst K (2005) Lack of data: Do some questions have far fewer answers than surrounding questions? Inaccuracy of a single measurement and data Excess of data: Are there duplicate entries? point may be acceptable, and related to the Are there more answers than originally inherent technical error of the measurement allowed? instrument. Hence, data cleaning