(12) United States Patent (10) Patent No.: US 8,782,016 B2 Kaldas Et Al

USOO8782O16B2 (12) United States Patent (10) Patent No.: US 8,782,016 B2 Kaldas et al. (45) Date of Patent: Jul. 15, 2014 (54) DATABASE RECORD REPAIR 2005/0262044 A1* 11/2005 Chaudhuri et al. ............... 707/1 2006, O155743 A1* 7, 2006 Bohannon et al. 707/102 2007. O168334 A1* 7, 2007 Julien et al. ..... 707 3 (75) Inventors: Ihab Francis Ilyas Kaldas, Waterloo 2008/0243967 A1* 10, 2008 Bhatia et al. .... 707/2O6 (CA); Mohamed Yakout, Doha (QA); 2008/0288482 A1* 11/2008 Chaudhuri et al. 707/5 Ahmed K. Elmagarmid, Doha (QA) 2008/0306945 A1 12/2008 Chaudhuri et al. ............... 707/6 2009,0006302 A1 1/2009 Fan et al. ........................ T06/48 2010/0005048 A1* 1/2010 Bodapati et al. 706/47 (73) Assignee: Qatar Foundation, Doha (QA) 2010.0318499 A1* 12/2010 Arasu et al. ................... 707/693 (*) Notice: Subject to any disclaimer, the term of this patent is extended or adjusted under 35 FOREIGN PATENT DOCUMENTS U.S.C. 154(b) by 120 days. GB 2493963 A 2, 2013 WO 2013O298.17 A1 3, 2013 (21) Appl. No.: 13/218,789 OTHER PUBLICATIONS (22) Filed: Aug. 26, 2011 Lee et al. “Cleansing Data for Mining and Warehousing”. 1999. In Proceedings of the 10th International Conference on Database and (65) Prior Publication Data Expert Systems Applications (DEXA '99). Springer-Verlag, London, US 2013/OO54539 A1 Feb. 28, 2013 UK, pp. 751-760.* Muller, H; et al., “Problems, Methods. and Challenges in Compre hensive Data Cleansing”; XP0079208.63 (2003; Retrieved from the (51) Int. Cl. Internet: URL:http://www.dbis.informatikhu-berlin. G06F 7/30 (2006.01) defileadminjresearchpapersjtechreportsj2003-hub ib 164-mueller. GO6F 7/OO (2006.01) pdfretrieved on Jul. 24, 2012). (52) U.S. Cl. USPC ............................ 707/691; 707/690; 707/692 (Continued) (58) Field of Classification Search None Primary Examiner — Robert Beausoliel, Jr. See application file for complete search history. Assistant Examiner — Susan F Rayyan (74) Attorney, Agent, or Firm — Mossman Kumar & Tyler (56) References Cited PC U.S. PATENT DOCUMENTS (57) ABSTRACT 5,504,890 A * 4, 1996 Sanford ................................ 1f1 A computer implemented method for repairing records of a 6,968,348 B1 * 1 1/2005 Carone et al. ................. 707,696 database, comprises determining a first set of records of the 7,092,956 B2 * 8/2006 Ruediger ....... 707,602 database which violate a functional dependency of the data 7.562,067 B2 * 7/2009 Chaudhuri et al. ................... 1f1 7,756,873 B2 * 7/2010 Gould et al. ... 707/737 base, modifying records in the first set to make them consis 8,041,668 B2 * 10/2011 Fan et al. ... ... 706/48 tent with the functional dependency to provide an output 8,200,640 B2 * 6/2012 Arasu et al. ... 707,692 consistent database instance, determining a second set of 8,204.866 B2 * 6/2012 Chaudhuri et al. 707,692 records of the output consistent database instance comprising 8,346,736 B2 * 1/2013 Haustein et al. ... 707,692 duplicate records, merging duplicate records in the second set 8,356,017 B2 * 1/2013 Akirav et al. ................. 707,692 8,639,667 B2 * 1/2014 Golab et al. .................. 707,690 in dependence on the functional dependencies of the records 2004/0003005 A1 1/2004 Chaudhuri et al. 707/200 to provide a modified database instance. 2004/0107386 A1* 6/2004 Burdick et al. .... ... 714,38 2004/O181527 A1* 9, 2004 Burdick et al. ................... 707/6 15 Claims, 4 Drawing Sheets Eicheatly 2600-109s cases 23 RBaily 2600-losswise, S. Manhate 100 PH tasady assoo-Dawsas. Marate assis a ckade: sexes 883 & C, are serie see at a state Robote;saily 60-98:a::, les: QQ23 Eois was ast. Biscatea cois wa is assary Eocess craciass. Plewark 2 to beads: sociocasses sease as a : a so-calcastlew out rose royersey also cooradrid sette sesswa, are scree 3. cologicaiss: low suk 1002 fly ak, cooperate seasesses was bone pixes set city 2 seats : rasady 2 co-owsts. Some sesswa Ready Elsa.scies S. rew or does , Raysately 2 EO-100Fsaad a sease is w, Besa dely 21so-ord as see seas was US 8,782,016 B2 Page 2 (56) References Cited Retrieved from the Internet: URL:http://www.cs.uwaterloo. ca, -gbeskale Beskales VLDB2010.pdf. International Search Report in GB1114744.4, dated Nov. 24, 2011. OTHER PUBLICATIONS International Search Report and Written Opinion in International Application PCT/EP2012/060026, dated Aug. 3, 2012. Beskales, G. et al., Modeling and Querying Possible Repairs in Kaewbuadee, Kollayut et al., “Data Cleaning Using FD From Data Duplicate Detection, Aug. 2009, XP002680770, Retrieved from the Mining Process”; May 16, 2006. XP007920865. Retrieved from the Internet: URL:http://www.cs.uwaterloo.ca/ -gbeskale? Internet: URL:http://www.iadis.org/Multi2006/Papers/16, F026 BeskalesVLDB2009.pdf. DS.pdfretrieved on Jul. 24, 2012). Beskales, G. et al., “Sampling the Repairs of Functional Dependency Violations under Hard Constraints”, Sep. 2010, XP002680771, * cited by examiner U.S. Patent Jul. 15, 2014 Sheet 1 of 4 US 8,782,016 B2 U.S. Patent Jul. 15, 2014 Sheet 3 of 4 US 8,782,016 B2 Ssssssssss-XXXXXSW'WWYYYYYYYYYYW U.S. Patent Jul. 15, 2014 US 8,782,016 B2 Sawww. ******* mki '****************************••••••••-------•••••?? exsyssessex-rassrs saa-ra-'sa-ra-a-rarer US 8,782,016 B2 1. 2 DATABASE RECORD REPAIR executed by a processor, implement a method for updating a database comprising determining a first set of records of the The present invention relates to database record repair. database which violate a functional dependency of the data base, modifying records in the first set to make them consis BACKGROUND tent with the functional dependency to provide an output consistent database instance, determining a second set of A database is a collection of information arranged in an records of the output consistent database instance comprising organized manner. A typical database might include medical, duplicate records, merging duplicate records in the second set financial or accounting information, demographics and mar in dependence on the functional dependencies of the records ket Survey data, bibliographic or archival data, personnel and 10 to provide a modified database instance. organizational information, public governmental records, Duplicate records can be determined using a duplication private business or customer data Such as addresses and phone mechanism to group duplicate records into respective clus numbers, etc. ters, wherein records within respective ones of the clusters Such information is usually contained in computer files represent the same entity. A duplicate identification attribute arranged in a pre-selected database format, and the data con 15 can be assigned to records in the second set of records. In an tents within them can be maintained for convenient access on example, a schema for the database includes the duplicate magnetic media, both for storage and for updating the file identification attribute. Preferably, the modified database contents as needed. instance does not include the duplicate identification Poor data quality can have undesirable implications for the attribute. effectiveness of a business or other organization or entity. For According to an aspect of the present invention, there is example, in healthcare, where incorrect information about provided a computer program embedded on a non-transitory patients in an Electronic Health Record (EHR) may lead to tangible computer readable storage medium, the computer wrong treatments and prescriptions, ensuring the accuracy of program including machine readable instructions that, when database entries is of prime importance. executed by a processor, implement a method for updating a A large variety of computational procedures for cleaning or 25 database comprising performing a functional-dependency repairing erroneous or duplicate entries in databases have aware deduplication of records of the database. been proposed. Typically, such procedures can automatically According to an aspect of the present invention, there is or semi-automatically identify errors and, when possible, provided an apparatus for modifying records in a database, correct them. Typically, however, these approaches have sev comprising a functional definition detection engine operable eral limitations relating to the introduction of new database 30 to detect functional definition violations for records in the errors as a result of changes that have been made. For database to provide a first set of records, a functional defini example, a repair in order correct a functional dependency tion repair engine operable to repair functional definition problem may lead to duplication errors. Similarly, dedupli violations of records in the first set to provide a consistent cation can lead to functional dependency violations within a database instance, a duplicate detection engine operable to database. 35 detect duplicate record entries in the consistent database instance, and a consistency aware repair engine operable to SUMMARY merge duplicate records in respective clusters of duplicate records of the consistent database instance to provide a modi According to an aspect of the present invention, there is fied database instance. The functional definition repair engine provided a computer implemented method for repairing 40 can be further operable to repair functional definition viola records of a database, comprising determining a first set of tions arising as a result of a previous functional definition records of the database which violate a functional depen repair operation. dency of the database, modifying records in the first set to make them consistent with the functional dependency to pro BRIEF DESCRIPTION OF THE DRAWINGS vide an output consistent database instance, determining a 45 second set of records of the output consistent database FIG. 1 is a schematic representation of a small database instance comprising duplicate records, merging duplicate instance; records in the second set in dependence on the functional FIG. 2 is a schematic block diagram of a method according dependencies of the records to provide a modified database to an example; instance.

Load more