Preventing Data Errors with Continuous Testing Kıvanç Mu¸slu Yuriy Brun Alexandra Meliou Computer Science & Engineering College of Information and Computer Science University of Washington University of Massachusetts Seattle, WA, USA 98195-2350 Amherst, MA, USA 01003-9264
[email protected] {brun, ameli}@cs.umass.edu ABSTRACT Data errors can arise in a variety of ways, including during data Today, software systems that rely on data are ubiquitous, and en- entry (e.g., typographical errors and transcription errors from il- suring the data’s quality is an increasingly important challenge as legible text), measurement (e.g., the data source may be faulty or data errors result in annual multi-billion dollar losses. While soft- corrupted), and data integration [33]. Data entry errors have been ware debugging and testing have received heavy research attention, found hard to detect and to significantly alter conclusions drawn on less effort has been devoted to data debugging: identifying system that data [3]. The ubiquity of data errors [21, 33, 75, 76] has resulted errors caused by well-formed but incorrect data. We present con- in significant research on data cleaning [13, 20, 38, 67, 69, 77] and tinuous data testing (CDT), a low-overhead, delay-free technique data debugging [4, 37]. Such work, focusing on identifying likely that quickly identifies likely data errors. CDT continuously executes data errors, inconsistencies, and statistical outliers can help improve domain-specific test queries; when a test fails, CDT unobtrusively the quality of a noisy or faulty data set. However, with few notable exceptions [15, 47], research has not focused on the problem of warns the user or administrator.