(12) United States Patent (10) Patent No.: US 8,782,016 B2 Kaldas Et Al

Total Page:16

File Type:pdf, Size:1020Kb

(12) United States Patent (10) Patent No.: US 8,782,016 B2 Kaldas Et Al USOO8782O16B2 (12) United States Patent (10) Patent No.: US 8,782,016 B2 Kaldas et al. (45) Date of Patent: Jul. 15, 2014 (54) DATABASE RECORD REPAIR 2005/0262044 A1* 11/2005 Chaudhuri et al. ............... 707/1 2006, O155743 A1* 7, 2006 Bohannon et al. 707/102 2007. O168334 A1* 7, 2007 Julien et al. ..... 707 3 (75) Inventors: Ihab Francis Ilyas Kaldas, Waterloo 2008/0243967 A1* 10, 2008 Bhatia et al. .... 707/2O6 (CA); Mohamed Yakout, Doha (QA); 2008/0288482 A1* 11/2008 Chaudhuri et al. 707/5 Ahmed K. Elmagarmid, Doha (QA) 2008/0306945 A1 12/2008 Chaudhuri et al. ............... 707/6 2009,0006302 A1 1/2009 Fan et al. ........................ T06/48 2010/0005048 A1* 1/2010 Bodapati et al. 706/47 (73) Assignee: Qatar Foundation, Doha (QA) 2010.0318499 A1* 12/2010 Arasu et al. ................... 707/693 (*) Notice: Subject to any disclaimer, the term of this patent is extended or adjusted under 35 FOREIGN PATENT DOCUMENTS U.S.C. 154(b) by 120 days. GB 2493963 A 2, 2013 WO 2013O298.17 A1 3, 2013 (21) Appl. No.: 13/218,789 OTHER PUBLICATIONS (22) Filed: Aug. 26, 2011 Lee et al. “Cleansing Data for Mining and Warehousing”. 1999. In Proceedings of the 10th International Conference on Database and (65) Prior Publication Data Expert Systems Applications (DEXA '99). Springer-Verlag, London, US 2013/OO54539 A1 Feb. 28, 2013 UK, pp. 751-760.* Muller, H; et al., “Problems, Methods. and Challenges in Compre hensive Data Cleansing”; XP0079208.63 (2003; Retrieved from the (51) Int. Cl. Internet: URL:http://www.dbis.informatikhu-berlin. G06F 7/30 (2006.01) defileadminjresearchpapersjtechreportsj2003-hub ib 164-mueller. GO6F 7/OO (2006.01) pdfretrieved on Jul. 24, 2012). (52) U.S. Cl. USPC ............................ 707/691; 707/690; 707/692 (Continued) (58) Field of Classification Search None Primary Examiner — Robert Beausoliel, Jr. See application file for complete search history. Assistant Examiner — Susan F Rayyan (74) Attorney, Agent, or Firm — Mossman Kumar & Tyler (56) References Cited PC U.S. PATENT DOCUMENTS (57) ABSTRACT 5,504,890 A * 4, 1996 Sanford ................................ 1f1 A computer implemented method for repairing records of a 6,968,348 B1 * 1 1/2005 Carone et al. ................. 707,696 database, comprises determining a first set of records of the 7,092,956 B2 * 8/2006 Ruediger ....... 707,602 database which violate a functional dependency of the data 7.562,067 B2 * 7/2009 Chaudhuri et al. ................... 1f1 7,756,873 B2 * 7/2010 Gould et al. ... 707/737 base, modifying records in the first set to make them consis 8,041,668 B2 * 10/2011 Fan et al. ... ... 706/48 tent with the functional dependency to provide an output 8,200,640 B2 * 6/2012 Arasu et al. ... 707,692 consistent database instance, determining a second set of 8,204.866 B2 * 6/2012 Chaudhuri et al. 707,692 records of the output consistent database instance comprising 8,346,736 B2 * 1/2013 Haustein et al. ... 707,692 duplicate records, merging duplicate records in the second set 8,356,017 B2 * 1/2013 Akirav et al. ................. 707,692 8,639,667 B2 * 1/2014 Golab et al. .................. 707,690 in dependence on the functional dependencies of the records 2004/0003005 A1 1/2004 Chaudhuri et al. 707/200 to provide a modified database instance. 2004/0107386 A1* 6/2004 Burdick et al. .... ... 714,38 2004/O181527 A1* 9, 2004 Burdick et al. ................... 707/6 15 Claims, 4 Drawing Sheets Eicheatly 2600-109s cases 23 RBaily 2600-losswise, S. Manhate 100 PH tasady assoo-Dawsas. Marate assis a ckade: sexes 883 & C, are serie see at a state Robote;saily 60-98:a::, les: QQ23 Eois was ast. Biscatea cois wa is assary Eocess craciass. Plewark 2 to beads: sociocasses sease as a : a so-calcastlew out rose royersey also cooradrid sette sesswa, are scree 3. cologicaiss: low suk 1002 fly ak, cooperate seasesses was bone pixes set city 2 seats : rasady 2 co-owsts. Some sesswa Ready Elsa.scies S. rew or does , Raysately 2 EO-100Fsaad a sease is w, Besa dely 21so-ord as see seas was US 8,782,016 B2 Page 2 (56) References Cited Retrieved from the Internet: URL:http://www.cs.uwaterloo. ca, -gbeskale Beskales VLDB2010.pdf. International Search Report in GB1114744.4, dated Nov. 24, 2011. OTHER PUBLICATIONS International Search Report and Written Opinion in International Application PCT/EP2012/060026, dated Aug. 3, 2012. Beskales, G. et al., Modeling and Querying Possible Repairs in Kaewbuadee, Kollayut et al., “Data Cleaning Using FD From Data Duplicate Detection, Aug. 2009, XP002680770, Retrieved from the Mining Process”; May 16, 2006. XP007920865. Retrieved from the Internet: URL:http://www.cs.uwaterloo.ca/ -gbeskale? Internet: URL:http://www.iadis.org/Multi2006/Papers/16, F026 BeskalesVLDB2009.pdf. DS.pdfretrieved on Jul. 24, 2012). Beskales, G. et al., “Sampling the Repairs of Functional Dependency Violations under Hard Constraints”, Sep. 2010, XP002680771, * cited by examiner U.S. Patent Jul. 15, 2014 Sheet 1 of 4 US 8,782,016 B2 U.S. Patent Jul. 15, 2014 Sheet 3 of 4 US 8,782,016 B2 Ssssssssss-XXXXXSW'WWYYYYYYYYYYW U.S. Patent Jul. 15, 2014 US 8,782,016 B2 Sawww. ******* mki '****************************••••••••-------•••••?? exsyssessex-rassrs saa-ra-'sa-ra-a-rarer US 8,782,016 B2 1. 2 DATABASE RECORD REPAIR executed by a processor, implement a method for updating a database comprising determining a first set of records of the The present invention relates to database record repair. database which violate a functional dependency of the data base, modifying records in the first set to make them consis BACKGROUND tent with the functional dependency to provide an output consistent database instance, determining a second set of A database is a collection of information arranged in an records of the output consistent database instance comprising organized manner. A typical database might include medical, duplicate records, merging duplicate records in the second set financial or accounting information, demographics and mar in dependence on the functional dependencies of the records ket Survey data, bibliographic or archival data, personnel and 10 to provide a modified database instance. organizational information, public governmental records, Duplicate records can be determined using a duplication private business or customer data Such as addresses and phone mechanism to group duplicate records into respective clus numbers, etc. ters, wherein records within respective ones of the clusters Such information is usually contained in computer files represent the same entity. A duplicate identification attribute arranged in a pre-selected database format, and the data con 15 can be assigned to records in the second set of records. In an tents within them can be maintained for convenient access on example, a schema for the database includes the duplicate magnetic media, both for storage and for updating the file identification attribute. Preferably, the modified database contents as needed. instance does not include the duplicate identification Poor data quality can have undesirable implications for the attribute. effectiveness of a business or other organization or entity. For According to an aspect of the present invention, there is example, in healthcare, where incorrect information about provided a computer program embedded on a non-transitory patients in an Electronic Health Record (EHR) may lead to tangible computer readable storage medium, the computer wrong treatments and prescriptions, ensuring the accuracy of program including machine readable instructions that, when database entries is of prime importance. executed by a processor, implement a method for updating a A large variety of computational procedures for cleaning or 25 database comprising performing a functional-dependency repairing erroneous or duplicate entries in databases have aware deduplication of records of the database. been proposed. Typically, such procedures can automatically According to an aspect of the present invention, there is or semi-automatically identify errors and, when possible, provided an apparatus for modifying records in a database, correct them. Typically, however, these approaches have sev comprising a functional definition detection engine operable eral limitations relating to the introduction of new database 30 to detect functional definition violations for records in the errors as a result of changes that have been made. For database to provide a first set of records, a functional defini example, a repair in order correct a functional dependency tion repair engine operable to repair functional definition problem may lead to duplication errors. Similarly, dedupli violations of records in the first set to provide a consistent cation can lead to functional dependency violations within a database instance, a duplicate detection engine operable to database. 35 detect duplicate record entries in the consistent database instance, and a consistency aware repair engine operable to SUMMARY merge duplicate records in respective clusters of duplicate records of the consistent database instance to provide a modi According to an aspect of the present invention, there is fied database instance. The functional definition repair engine provided a computer implemented method for repairing 40 can be further operable to repair functional definition viola records of a database, comprising determining a first set of tions arising as a result of a previous functional definition records of the database which violate a functional depen repair operation. dency of the database, modifying records in the first set to make them consistent with the functional dependency to pro BRIEF DESCRIPTION OF THE DRAWINGS vide an output consistent database instance, determining a 45 second set of records of the output consistent database FIG. 1 is a schematic representation of a small database instance comprising duplicate records, merging duplicate instance; records in the second set in dependence on the functional FIG. 2 is a schematic block diagram of a method according dependencies of the records to provide a modified database to an example; instance.
Recommended publications
  • A Study Over Importance of Data Cleansing in Data Warehouse Shivangi Rana Er
    Volume 6, Issue 4, April 2016 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com A Study over Importance of Data Cleansing in Data Warehouse Shivangi Rana Er. Gagan Prakesh Negi Kapil Kapoor Research Scholar Assistant Professor Associate Professor CSE Department CSE Department ECE Department Abhilashi Group of Institutions Abhilashi Group of Institutions Abhilashi Group of Institutions (School of Engg.) (School of Engg.) (School of Engg.) Chail Chowk, Mandi, India Chail Chowk, Mandi, India Chail Chowk Mandi, India Abstract: Cleansing data from impurities is an integral part of data processing and maintenance. This has lead to the development of a broad range of methods intending to enhance the accuracy and thereby the usability of existing data. In these days, many organizations tend to use a Data Warehouse to meet the requirements to develop decision-making processes and achieve their goals better and satisfy their customers. It enables Executives to access the information they need in a timely manner for making the right decision for any work. Decision Support System (DSS) is one of the means that applied in data mining. Its robust and better decision depends on an important and conclusive factor called Data Quality (DQ), to obtain a high data quality using Data Scrubbing (DS) which is one of data Extraction Transformation and Loading (ETL) tools. Data Scrubbing is very important and necessary in the Data Warehouse (DW). This paper presents a survey of sources of error in data, data quality challenges and approaches, data cleaning types and techniques and an overview of ETL Process.
    [Show full text]
  • Oracle Paas and Iaas Universal Credits Service Descriptions
    Oracle PaaS and IaaS Universal Credits Service Descriptions Effective Date: 10-September-2021 Oracle UCM 091021 Page 1 of 202 Table of Contents metrics 6 Oracle PaaS and IaaS Universal Credit 20 1. AVAILABLE SERVICES 20 a. Eligible Oracle PaaS Cloud Services 20 b. Eligible Oracle IaaS Cloud Services 20 c. Additional Services 20 d. Always Free Cloud Services 21 Always Free Cloud Services 22 2. ACTIVATION USAGE AND BILLING 23 a. Introduction 23 i. Annual Universal Credit 24 Overage 24 Replenishment of Account at End of Services Period 25 Additional Services 25 ii. Monthly Universal Credit (subject to Oracle approval) 25 Overage 26 Orders Placed via a Partner 26 Replenishment of Account at End of Services Period 26 iii. Pay as You Go 26 iv. Funded Allocation Model 27 Overage 27 Additional Services 28 Replenishment of Account at End of Services Period 28 3. INCLUDED SERVICES 28 i. Developer Cloud Service 28 ii. Oracle Identity Foundation Cloud Service 29 b. Additional Licenses and Oracle Linux Technical Support 29 c. Oracle Cloud Infrastructure Data Catalog 30 d. Oracle Cloud Infrastructure Data Transfer Disk 30 Your Obligations/Responsibilities and Project Assumptions 30 Your Obligations/Responsibilities 31 Project Assumptions 31 Export 32 Oracle Cloud Infrastructure - Application Migration 32 f. Oracle Cloud Infrastructure Console 33 g. Oracle Cloud Infrastructure Cloud Shell 33 Access and Usage 33 4. SERVICES AVAILABLE VIA THE ORACLE CLOUD MARKETPLACE 33 a. Oracle Cloud Services Delivered via the Oracle Cloud Marketplace 33 b. Third Party
    [Show full text]
  • Bigdansing: a System for Big Data Cleansing
    BigDansing: A System for Big Data Cleansing Zuhair Khayyaty∗ Ihab F. Ilyas‡∗ Alekh Jindal] Samuel Madden] Mourad Ouzzanix Paolo Papottix Jorge-Arnulfo Quiané-Ruizx Nan Tangx Si Yinx xQatar Computing Research Institute ]CSAIL, MIT yKing Abdullah University of Science and Technology (KAUST) zUniversity of Waterloo [email protected] [email protected] {alekh,madden}@csail.mit.edu {mouzzani,ppapotti,jquianeruiz,ntang,siyin}@qf.org.qa ABSTRACT name zipcode city state salary rate t1 Annie 10001 NY NY 24000 15 Data cleansing approaches have usually focused on detect- t2 Laure 90210 LA CA 25000 10 ing and fixing errors with little attention to scaling to big t3 John 60601 CH IL 40000 25 datasets. This presents a serious impediment since data t4 Mark 90210 SF CA 88000 28 cleansing often involves costly computations such as enu- t5 Robert 60827 CH IL 15000 15 t Mary 90210 LA CA 81000 28 merating pairs of tuples, handling inequality joins, and deal- 6 Table 1: Dataset D with tax data records ing with user-defined functions. In this paper, we present BigDansing, a Big Data Cleansing system to tackle ef- have a lower tax rate; and (r3) two tuples refer to the same ficiency, scalability, and ease-of-use issues in data cleans- individual if they have similar names, and their cities are ing. The system can run on top of most common general inside the same county. We define these rules as follows: purpose data processing platforms, ranging from DBMSs (r1) φF : D(zipcode ! city) to MapReduce-like frameworks. A user-friendly program- (r2) φD : 8t1; t2 2 D; :(t1:rate > t2:rate ^ t1:salary < t2:salary) (r3) φU : 8t1; t2 2 D; :(simF(t1:name; t2:name)^ ming interface allows users to express data quality rules both getCounty(t :city) = getCounty(t :city)) declaratively and procedurally, with no requirement of being 1 2 aware of the underlying distributed platform.
    [Show full text]
  • Data Profiling and Data Cleansing Introduction
    Data Profiling and Data Cleansing Introduction 9.4.2013 Felix Naumann Overview 2 ■ Introduction to research group ■ Lecture organisation ■ (Big) data □ Data sources □ Profiling □ Cleansing ■ Overview of semester Felix Naumann | Profiling & Cleansing | Summer 2013 Information Systems Team 3 DFG IBM Prof. Felix Naumann Arvid Heise Katrin Heinrich project DuDe Dustin Lange Duplicate Detection Data Fusion Data Profiling project Stratosphere Entity Search Johannes Lorey Christoph Böhm Information Integration Data Scrubbing project GovWILD Data as a Service Data Cleansing Information Quality Web Data Linked Open Data RDF Data Mining Dependency Detection ETL Management Anja Jentzsch Service-Oriented Systems Entity Opinion Ziawasch Abedjan Recognition Mining Tobias Vogel Toni Grütze HPI Research School Dr. Gjergji Kasneci Zhe Zuo Maximilian Jenders Felix Naumann | Profiling & Cleansing | Summer 2013 Other courses in this semester 4 Lectures ■ DBS I (Bachelor) ■ Data Profiling and Data Cleansing Seminars ■ Master: Large Scale Duplicate Detection ■ Master: Advanced Recommendation Techniques Bachelorproject ■ VIP 2.0: Celebrity Exploration Felix Naumann | Profiling & Cleansing | Summer 2013 Seminar: Advanced Recommendation Techniques 5 ■ Goal: Cross-platform recommendation for posts on the Web □ Given a post on a website, find relevant (i.e., similar) posts from other websites □ Analyze post, author, and website features □ Implement and compare different state-of-the-art recommendation techniques … … Calculate (,) (i.e., the similarity between posts and ) … Recommend top-k posts ? … Overview 7 ■ Introduction to research group ■ Lecture organization ■ (Big) data □ Data sources □ Profiling □ Cleansing ■ Overview of semester Felix Naumann | Profiling & Cleansing | Summer 2013 Dates and exercises 8 ■ Lectures ■ Exam □ Tuesdays 9:15 – 10:45 □ Oral exam, 30 minutes □ Probably first week after □ Thursdays 9:15 – 10:45 lectures ■ Exercises ■ Prerequisites □ In parallel □ To participate ■ First lecture ◊ Background in □ 9.4.2013 databases (e.g.
    [Show full text]
  • Big Data Analytics for Large Scale Wireless Networks: Challenges and Opportunities
    Big Data Analytics for Large Scale Wireless Networks: Challenges and Opportunities HONG-NING DAI∗, Macau University of Science and Technology RAYMOND CHI-WING WONG, the Hong Kong University of Science and Technology (HKUST) HAO WANG, Norwegian University of Science and Technology ZIBIN ZHENG, Sun Yat-sen University ATHANASIOS V. VASILAKOS, Lulea University of Technology The wide proliferation of various wireless communication systems and wireless devices has led to the arrival of big data era in large scale wireless networks. Big data of large scale wireless networks has the key features of wide variety, high volume, real-time velocity and huge value leading to the unique research challenges that are dierent from existing computing systems. In this paper, we present a survey of the state-of-art big data analytics (BDA) approaches for large scale wireless networks. In particular, we categorize the life cycle of BDA into four consecutive stages: Data Acquisition, Data Preprocessing, Data Storage and Data Analytics. We then present a detailed survey of the technical solutions to the challenges in BDA for large scale wireless networks according to each stage in the life cycle of BDA. Moreover, we discuss the open research issues and outline the future directions in this promising area. Additional Key Words and Phrases: Big Data; Data mining; Wireless Networks ACM Reference format: Hong-Ning Dai, Raymond Chi-Wing Wong, Hao Wang, Zibin Zheng, and Athanasios V. Vasilakos. 2019. Big Data Analytics for Large Scale Wireless Networks: Challenges and Opportunities. 1, 1, Article 1 (May 2019), 46 pages. DOI: 10.1145/nnnnnnn.nnnnnnn 1 INTRODUCTION In recent years, we have seen the proliferation of wireless communication technologies, which are widely used today across the globe to fulll the communication needs from the extremely large number of end users.
    [Show full text]
  • A Review of Data Cleaning Algorithms for Data Warehouse Systems Rajashree Y.Patil,#, Dr
    Rajashree Y. Patil et al, / (IJCSIT) International Journal of Computer Science and Information Technologies, Vol. 3 (5) , 2012,5212 - 5214 A Review of Data Cleaning Algorithms for Data Warehouse Systems Rajashree Y.Patil,#, Dr. R.V.Kulkarni * # Vivekanand College, Kolhapur, Maharashtra, India *Shahu Institute of Business and Research ( SIBER) Kolhapur, Maharashtra, India Abstract— In today’s competitive environment, there is a need Data Cleaning Algorithm is an algorithm, a process used for more precise information for a better decision making. Yet to determine inaccurate, incomplete, or unreasonable data the inconsistency in the data submitted makes it difficult to and then improving the quality through correction of aggregate data and analyze results which may delays or data detected errors and commissions. The process may include compromises in the reporting of results. The purpose of this format checks, completeness checks, reasonable checks, article is to study the different algorithms available to clean the data to meet the growing demand of industry and the need and limit checks, review of the data to identify outliers for more standardised data. The data cleaning algorithms can (geographic, statistical, temporal or environmental) or other increase the quality of data while at the same time reduce the errors, and assessment of data by subject area experts (e.g. overall efforts of data collection. taxonomic specialists). These processes usually result in flagging, documenting and subsequent checking and Keywords— ETL, FD, SNM-IN, SNM-OUT, ERACER correction of suspect records. Validation checks may also involve checking for compliance against applicable I. INTRODUCTION standards, rules and conventions. Data cleaning is a method of adjusting or eradicating Data cleaning Approaches - In general data cleaning information in a database that is wrong, unfinished, involves several phases.
    [Show full text]
  • Study of Data Cleaning & Comparison of Data Cleaning Tools
    Sapna Devi et al, International Journal of Computer Science and Mobile Computing, Vol.4 Issue.3, March- 2015, pg. 360-370 Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology ISSN 2320–088X IJCSMC, Vol. 4, Issue. 3, March 2015, pg.360 – 370 RESEARCH ARTICLE Study of Data Cleaning & Comparison of Data Cleaning Tools Sapna Devi1, Dr. Arvind Kalia2 ¹Himachal Pradesh University Shimla, India ²Himachal Pradesh University Shimla, India 1 [email protected]; 2 [email protected] Abstract— Data Cleaning is a major issue. Data mining requires clean, consistent and noise free data. Incorrect or inconsistent data can lead to false conclusion and misdirect investment on both public and private scale. Data comes from various systems and in many different forms. It may be incomplete, yet it is a raw material for data mining. This research paper provides an overview of data cleaning problems, data quality, cleaning approaches and comparison of data cleaning tool. Keywords: Data cleaning, Data Quality, Data Preprocessing I. INTRODUCTION Data cleaning is part of data preprocessing before data mining, prior to process of mining information in a data warehouse, data cleaning is crucial because of garbage in and garbage out principle[1].Data cleaning is also called data cleansing or scrubbing deals with detecting and removing errors and inconsistencies from data in order to improve quality of data. The main objective of data cleaning is to reduce the time and complexity of mining process and increase the quality of datum in data warehouse.
    [Show full text]
  • Data Quality Measures and Data Cleansing for Research Information Systems Otmane Azeroual, Gunter Saake, Mohammad Abuosba
    Data Quality Measures and Data Cleansing for Research Information Systems Otmane Azeroual, Gunter Saake, Mohammad Abuosba To cite this version: Otmane Azeroual, Gunter Saake, Mohammad Abuosba. Data Quality Measures and Data Cleansing for Research Information Systems. Journal of Digital Information Management, Digital Information Research Foundation, 2018, 16 (1), pp.12-21. hal-01972675 HAL Id: hal-01972675 https://hal.archives-ouvertes.fr/hal-01972675 Submitted on 16 Jan 2019 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. Data Quality Measures and Data Cleansing for Research Information Sys- tems Otmane Azeroual1, 2, 3, Gunter Saake2, Mohammad Abuosba3 1 German Center For Higher Education Research And Science Studies (Dzhw), 10117 Berlin, Germany 2 Otto-Von-Guericke University Magdeburg, Department Of Computer Journal of Digital Science Institute For Technical And Business Information Systems Information Management Database Research Group P.O. Box 4120; 39106 Magdeburg, Germany 3 University Of Applied Sciences Htw Berlin Study Program Computational
    [Show full text]
  • Problems, Methods, and Challenges in Comprehensive Data Cleansing
    Problems, Methods, and Challenges in Comprehensive Data Cleansing Heiko Müller, Johann-Christoph Freytag Humboldt-Universität zu Berlin zu Berlin, 10099 Berlin, Germany {hmueller, freytag}@dbis.informatik.hu-berlin.de Abstract Cleansing data from impurities is an integral part of data processing and mainte- nance. This has lead to the development of a broad range of methods intending to enhance the accuracy and thereby the usability of existing data. This paper pre- sents a survey of data cleansing problems, approaches, and methods. We classify the various types of anomalies occurring in data that have to be eliminated, and we define a set of quality criteria that comprehensively cleansed data has to ac- complish. Based on this classification we evaluate and compare existing ap- proaches for data cleansing with respect to the types of anomalies handled and eliminated by them. We also describe in general the different steps in data clean- sing and specify the methods used within the cleansing process and give an out- look to research directions that complement the existing systems. 1 Contents 1 Introduction ............................................................................................................................. 3 2 Motivation ............................................................................................................................... 4 3 Data Anomalies....................................................................................................................... 5 3.1 Data Model......................................................................................................................
    [Show full text]
  • Data Cleansing and Analysis for Benchmarking
    Weatherization And Intergovernmental Programs Office Benchmarking Data Cleansing: Mona Khalil, U.S. DOE A Rite of Passage Along the Benchmarking Journey Shankar Earni, LBNL April 30, 2015 1 DOE’s State and Local Technical Assistance • General Education (fact sheets, 101s) • Implementation Models (case studies) Resources • Research and Tools for Decision-Making • Protocols (how-to guides, model documents) • Webinars Peer Exchange & • Conferences and in-person trainings • Better Buildings Project Teams Trainings • Accelerators Direct • On a limited basis • Level of effort will vary Assistance • In-depth efforts will be focus on: o High impact efforts o Opportunities for replicability o Filling gaps in the technical assistance marketplace 2 How to tap into these and other TAP offerings • Visit the STATE AND LOCAL SOLUTION CENTER http://energy.gov/eere/slsc/state-and-local-solution-center • Sign up for TAP Alerts by emailing [email protected] 3 Course Outline • Course Objectives • Building Benchmarking • Bad Data – What is it? – Types – Common Issues • Data Cleansing – What it is? – Why do it? • Data Cleansing Process – Identify/fix incorrect data types – Identify/fix missing or erroneous values – Identify/fix outliers/other inconsistencies – Check and fix to ensure internal consistency • Data Cleansing on a Sample Data Set 4 Course Objectives Intended Audience Cities, communities, and states that have implemented or are considering implementing an internal or community-wide benchmarking and/or disclosure program or policy
    [Show full text]
  • Data and Information Quality Research: Its Evolution and Future
    16 Data and Information Quality Research: Its Evolution and Future 16.1 Introduction .................................................................................... 16-1 16.2 Evolution of Data Quality Research ............................................16-2 TDQM Framework as the Foundation • Establishment and Growth of the Field and Profession 16.3 Framework for Characterizing Data Quality Research ............16-4 16.4 Research Topics ...............................................................................16-5 Hongwei Zhu Data Quality Impact • Database-Related Technical Solutions for University of Massachusetts Data Quality • Data Quality in the Context of Computer Science and Information Technology • Data Quality in Curation Stuart E. Madnick 16.5 Research Methods.........................................................................16-12 Massachusetts Institute Action Research • Artificial Intelligence • Case of Technology Study • Data Mining • Design Yang W. Lee Science • Econometrics • Empirical • Experimental • Mathematical Modeling • Qualitative • Quantitative • Statistical Northeastern University Analysis • System Design and Implementation • Survey • Theory Richard Y. Wang and Formal Proofs Massachusetts Institute 16.6 Challenges and Conclusion ......................................................... 16-14 of Technology References ..................................................................................................16-15 16.1 Introduction Organizations have increasingly invested in technology and human
    [Show full text]
  • Extract Transform Load (ETL) Process in Distributed Database Academic Data Warehouse
    APTIKOM Journal on Computer Science and Information Technologies Vol. 4, No. 2, 2019, pp. 61~68 ISSN: 2528-2417 DOI: 10.11591/APTIKOM.J. SIT.36 61 Extract transform load (ET ) process in distributed database academic data warehouse Ardhian Agung Yulianto Industrial Engineering Department, Engineering .acult0, Andalas 1ni2ersit0 Kampus 3imau Manis Kota Padang, 4est Sumatera, Indonesia, 25163 Corresponding author, e-mail: ardhian.a06eng.unand.ac.id Abstract While a data warehouse is desi ned to support the decision-makin function, the most time-consumin part is the Extract Transform Load )ETL) process. Case in Academic Data Warehouse, when data source came from the faculty.s distributed database, althou h havin a typical database but become not easier to inte rate. This paper presents how to an ETL process in distributed database academic data warehouse. 0ollowin Data 0low Thread process in the data sta in area, a deep analysis performed for identifyin all tables in each data sources, includin content profilin . Then the cleanin , confirmin , and data delivery steps pour the different data source into the data warehouse )DW). 1ince DW development usin bottom-up 2imball.s multidimensional approach, we found the three types of extraction activities from data source table3 mer e, mer e-union, and union. Result for cleanin and conformin step set by creatin conform dimension on data source analysis, refinement, and hierarchy structure. The final of the ETL step is loadin it into inte ratin dimension and fact tables by a eneration of a surro ate key. Those processes are runnin radually from each distributed database data sources until it incorporated.
    [Show full text]