Statistical Distortion: Consequences of Data Cleaning
Total Page:16
File Type:pdf, Size:1020Kb
Statistical Distortion: Consequences of Data Cleaning Tamraparni Dasu Ji Meng Loh AT&T Labs Research AT&T Labs Research 180 Park Avenue 180 Park Avenue Florham Park, NJ 07932 Florham Park, NJ 07932 [email protected] [email protected] ABSTRACT There is considerable research in the database community We introduce the notion of statistical distortion as an essen- on varied aspects of data quality, data repair and data clean- tial metric for measuring the effectiveness of data cleaning ing. Recent work includes the use of machine learning for strategies. We use this metric to propose a widely applica- guiding database repair [14]; inferring and imputing miss- ble yet scalable experimental framework for evaluating data ing values in databases [10] ; resolving inconsistencies using cleaning strategies along three dimensions: glitch improve- functional dependencies [6]; and for data fusion [8]. ment, statistical distortion and cost-related criteria. Exist- From an enterprise perspective, data and information qual- ing metrics focus on glitch improvement and cost, but not ity assessment has been an active area of research as well. on the statistical impact of data cleaning strategies. We The paper [12] describes subjective and objective measures illustrate our framework on real world data, with a compre- for assessing the quality of a corporation’s data. An overview hensive suite of experiments and analyses. of well known data quality techniques and their comparative uses and benefits is provided in [2]. A DIMACS/CICCADA 1 workshop on data quality metrics featured a mix of speak- 1. INTRODUCTION ers from database and statistics communities. It covered a Measuring effectiveness of data cleaning strategies is al- wide array of topics from schema mapping, graphs, detailed most as difficult as devising cleaning strategies. Like data case studies and data privacy. Probability and uncertainty quality, cleaning metrics are highly context dependent, and have been used in identifying and developing cleaning strate- are a function of the domain, application, and user. To the gies for a wide range of tasks such as missing value impu- best of our knowledge, statistical distortion caused by data tation, inconsistency and duplicate resolution, and for the cleaning strategies has not been studied in any systematic management of uncertain data. However, to the best of fashion, and is not explicitly considered a data quality met- our knowledge, there is no known data quality metric that ric. Existing approaches focus on data quality metrics that measures the impact of cleaning strategies on the statistical measure the effectiveness of cleaning strategies in terms of properties of underlying data. constraint satisfaction, glitch improvement and cost. How- ever, data cleaning often has serious impact on statistical 1.1 Data Distributions Matter properties of the underlying data distribution. If data are As a simple though egregious example, consider the case over-treated, they no longer represent the real world pro- where there are two distributions with the same standard cess that generates the data. Therefore, cleaner data do not deviation σ. A blind data cleaning strategy is used to clean necessarily imply more useful or usable data. the data: apply 3-σ limits to identify outliers, where any Any change to the data could result in a change in the dis- data point outside the limits is considered an outlier; “re- tributional properties of the underlying data. While there pair” the outliers by setting them to the closest acceptable are situations where changing the distribution is not of con- value, a process know as Winsorization in statistics. cern, in many cases changing the underlying distribution has In Figure 1, the schematic histogram (top left) represents an impact on decision making. For instance, a change in the the distribution assumed for developing the outlier rule and distribution could result in an increase in false positives or repair strategy. Outliers are shown in dark brown at the two false negatives because the distribution is now out of sync extreme tails of the distribution. The schematic histogram with the thresholds already in place, such as those used for at top right represents the actual data. Suspicious data, in alerting loads on the network. the form of potential density-based outliers in a low density region, are shown in orange. When we apply the outlier rule to the actual data (bottom left), the two bars on either Permission to make digital or hard copies of all or part of this work for extreme are identified as outlying data. On repairing the personal or classroom use is granted without fee provided that copies are data using Winsorization, we end up with a data distribution not made or distributed for profit or commercial advantage and that copies shown in the bottom right histogram. bear this notice and the full citation on the first page. To copy otherwise, to There are several concerns with this approach. First, le- republish, to post on servers or to redistribute to lists, requires prior specific gitimate “clean” data were changed, as shown schematically permission and/or a fee. Articles from this volume were invited to present by the dark blue bars in the bottom right histogram in Fig- their results at The 38th International Conference on Very Large Data Bases, August 27th - 31st 2012, Istanbul, Turkey. ure 1. What is worse, legitimate values were moved to a 1 Proceedings of the VLDB Endowment, Vol. 5, No. 11 http://dimacs.rutgers.edu/Workshops/Metrics/ Copyright 2012 VLDB Endowment 2150-8097/12/07... $ 10.00. 1674 -440<"='>$)$ >%&)+' ?3="2'@3&'A0)2%"& >$)$'(") !"#$%& A0)2%"&4 !"#$%&'()&$)"*+,' !"#$%&"= -.&%/0)"'1234"4)'$11"#)$/2"' >$)$'(") 5$20"')3'30)2%"&4'67%843&%9$:38; Figure 1: Outliers (dark brown) are identified using 3-σ limits assuming a symmetric distribution; They are repaired by attributing the closest acceptable (non-outlying) value called Winsorization. This approach results in errors of commission (legitimate values are changed, dark blue) and omission (outliers are ig- nored, orange). The data repairs change the fundamental nature of the data distribution because they were formulated ignoring the statistical properties of the data set. potentially outlying region (above the orange bar). There- 2.1 Our Contribution fore this data cleaning strategy that was developed indepen- dently of the data distribution of the dirty data set, resulted We make several original contributions: (1) we propose in a fundamental change in the data, moving it distribution- statistical distortion, a novel measure for quantifying the ally far from the original. Such a radical change is unaccept- statistical impact of data cleaning strategies on data. It able in many real world applications because it can lead to measures the fidelity of cleaned data to original data which incorrect decisions. For example, in the cleaned data, a sig- we consider a proxy for the real world phenomenon that nificant portion of the data lie in a low-likelihood region generates the data; (2) we develop a general, flexible ex- of the original, adding to the number of existing outliers. perimental framework for analyzing and evaluating cleaning If the low-likelihood region corresponds to forbidden values strategies based on a three dimensional data quality metric (the orange values might correspond to data entry errors), consisting of statistical distortion, glitch improvement and our blind data repair has actually introduced new glitches cost. The framework allows the user to control the scale and made the data dirtier. of experiments with a view to cost and accuracy; and (3) Our goal is to repair the data to an ideal level of cleanli- our framework helps the user identify viable data cleaning ness while being statistically close to the original, in order strategies, and choose the most suitable from among them. to stay faithful to the real life process that generated the In general, the problem of data cleaning can be specified data. Hence, we measure distortion against the original, as follows: Given a data set D,wewanttoapplycleaning but calibrate cleanliness with respect to the ideal. In order strategies such that the resulting clean data set DC is as to achieve our goal, we introduce a three-dimensional data clean as an ideal data set DI .Theidealdatasetcouldbe quality metric based on glitch improvement, statistical dis- either pre-specified in terms of a tolerance bound on glitches tortion and cost. The metric is the foundation of an experi- (e.g. “no missing values, less than 5% outliers”), or in terms mental framework for analyzing, evaluating and comparing of a gold-standard it must resemble, e.g., historically gath- candidate cleaning strategies. ered data. Usually, strategies are chosen based on cost and glitch improvement. However, cleaning strategies introduce distributional changes that distort the statistical properties 2. OUR APPROACH of the original data set D,asdiscussedinourhypothetical We detail below our contributions, and provide an intu- example of Section 1.1. We propose a novel data quality itive outline of our approach and methodology. We defer a metric called statistical distortion that takes this into ac- rigorous formulation to Section 3. count. 1675 41$1':".1/#/;'B?>;.$',C'DE' focus in this paper is on measuring statistical distortion, and not on evaluating sampling strategies for large data bases. Please see [11] for random sampling from databases, [4] for :&.1*'0$+1$.;<' 0$1232%1"' bottom k-sketches sampling scheme, and [5] for the use of 4#3$,+2,/' priority sampling for computing subset sums. 2.1.2 Ideal Data Set There are two possible scenarios for identifying DI ,the =.>#?)':,3$'0$+1$.;<' ideal data set. In the first one, there is a known ideal dis- @A*./3#-.' tribution or data set, denoted by DI .Thiscouldbeatheo- retically specified distribution (“multivariate Gaussian”), or a data set constituted from historically observed good data.