Interaction between Record Matching and Data Repairing Wenfei Fan1;2 Jianzhong Li2 Shuai Ma3 Nan Tang1 Wenyuan Yu1 1University of Edinburgh 2Harbin Institute of Technology 3Beihang University fwenfei@inf., ntang@inf.,
[email protected] [email protected] [email protected] Abstract cost: it costs us businesses 600 billion dollars each year [16]. Central to a data cleaning system are record matching and With this comes the need for data cleaning systems. As data repairing. Matching aims to identify tuples that re- an example, data cleaning tools deliver \an overall business fer to the same real-world object, and repairing is to make a value of more than 600 million GBP" each year at BT [31]. In database consistent by fixing errors in the data by using con- light of this, the market for data cleaning systems is grow- straints. These are treated as separate processes in current ing at 17% annually, which substantially outpaces the 7% data cleaning systems, based on heuristic solutions. This pa- average of other IT segments [22]. per studies a new problem, namely, the interaction between There are two central issues about data cleaning: ◦ record matching and data repairing. We show that repair- Recording matching is to identify tuples that refer to ing can effectively help us identify matches, and vice versa. the same real-world entity [17, 26]. ◦ To capture the interaction, we propose a uniform frame- Data repairing is to find another database (a candidate work that seamlessly unifies repairing and matching oper- repair) that is consistent and minimally differs from ations, to clean a database based on integrity constraints, the original data, by fixing errors in the data [4, 21].