Statistical Distortion: Consequences of Data Cleaning Tamraparni Dasu Ji Meng Loh AT&T Labs Research AT&T Labs Research 180 Park Avenue 180 Park Avenue Florham Park, NJ 07932 Florham Park, NJ 07932
[email protected] [email protected] ABSTRACT There is considerable research in the database community We introduce the notion of statistical distortion as an essen- on varied aspects of data quality, data repair and data clean- tial metric for measuring the effectiveness of data cleaning ing. Recent work includes the use of machine learning for strategies. We use this metric to propose a widely applica- guiding database repair [14]; inferring and imputing miss- ble yet scalable experimental framework for evaluating data ing values in databases [10] ; resolving inconsistencies using cleaning strategies along three dimensions: glitch improve- functional dependencies [6]; and for data fusion [8]. ment, statistical distortion and cost-related criteria. Exist- From an enterprise perspective, data and information qual- ing metrics focus on glitch improvement and cost, but not ity assessment has been an active area of research as well. on the statistical impact of data cleaning strategies. We The paper [12] describes subjective and objective measures illustrate our framework on real world data, with a compre- for assessing the quality of a corporation’s data. An overview hensive suite of experiments and analyses. of well known data quality techniques and their comparative uses and benefits is provided in [2]. A DIMACS/CICCADA 1 workshop on data quality metrics featured a mix of speak- 1. INTRODUCTION ers from database and statistics communities. It covered a Measuring effectiveness of data cleaning strategies is al- wide array of topics from schema mapping, graphs, detailed most as difficult as devising cleaning strategies.