BigDansing: A System for Big Data Cleansing Zuhair Khayyaty∗ Ihab F. Ilyas‡∗ Alekh Jindal] Samuel Madden] Mourad Ouzzanix Paolo Papottix Jorge-Arnulfo Quiané-Ruizx Nan Tangx Si Yinx xQatar Computing Research Institute ]CSAIL, MIT yKing Abdullah University of Science and Technology (KAUST) zUniversity of Waterloo
[email protected] [email protected] {alekh,madden}@csail.mit.edu {mouzzani,ppapotti,jquianeruiz,ntang,siyin}@qf.org.qa ABSTRACT name zipcode city state salary rate t1 Annie 10001 NY NY 24000 15 Data cleansing approaches have usually focused on detect- t2 Laure 90210 LA CA 25000 10 ing and fixing errors with little attention to scaling to big t3 John 60601 CH IL 40000 25 datasets. This presents a serious impediment since data t4 Mark 90210 SF CA 88000 28 cleansing often involves costly computations such as enu- t5 Robert 60827 CH IL 15000 15 t Mary 90210 LA CA 81000 28 merating pairs of tuples, handling inequality joins, and deal- 6 Table 1: Dataset D with tax data records ing with user-defined functions. In this paper, we present BigDansing, a Big Data Cleansing system to tackle ef- have a lower tax rate; and (r3) two tuples refer to the same ficiency, scalability, and ease-of-use issues in data cleans- individual if they have similar names, and their cities are ing. The system can run on top of most common general inside the same county. We define these rules as follows: purpose data processing platforms, ranging from DBMSs (r1) φF : D(zipcode ! city) to MapReduce-like frameworks. A user-friendly program- (r2) φD : 8t1; t2 2 D; :(t1:rate > t2:rate ^ t1:salary < t2:salary) (r3) φU : 8t1; t2 2 D; :(simF(t1:name; t2:name)^ ming interface allows users to express data quality rules both getCounty(t :city) = getCounty(t :city)) declaratively and procedurally, with no requirement of being 1 2 aware of the underlying distributed platform.