Data Fusion – Resolving Data Conflicts for Integration
Total Page:16
File Type:pdf, Size:1020Kb
Data Fusion – Resolving Data Conflicts for Integration Tutorial proposal, intended length 1.5 hours Xin (Luna) Dong Felix Naumann AT&T Labs Inc. Hasso Plattner Institute (HPI) Florham Park, NJ, USA Potsdam, Germany [email protected] [email protected] 1. MOTIVATION entity into a single record and resolving possible conflicts The amount of information produced in the world in- from different data sources. Data fusion plays an important creases by 30% every year and this rate will only go up. role in data integration systems: it detects and removes dirty With advanced network technology, more and more sour- data and increases correctness of the integrated data. ces are available either over the Internet or in enterprise Objectives and Coverage. The main objective of the pro- intranets. Modern data management applications, such as posed tutorial is to gather models, techniques, and systems setting up Web portals, managing enterprise data, managing of the wide but yet unconsolidated field of data fusion and community data, and sharing scientific data, often require present them in a concise and consolidated manner. In the integrating available data sources and providing a uniform 1.5-hour tutorial we will provide an overview of the causes interface for users to access data from different sources; such and challenges of data fusion. We will cover a wide set of requirements have been driving fruitful research on data in- both simple and advanced techniques to resolve data con- tegration over the last two decades [13, 15]. flicts in different types of settings and systems. Finally, we Data integration systems face two folds of challenges. First, provide a classification of existing information management data from disparate sources are often heterogeneous. Het- systems with respect to their ability to perform data fusion. erogeneity can exist at the schema level, where different data sources often describe the same domain using differ- Intended audience. Data fusion touches many aspects of ent schemas; it can also exist at the instance level, where the very basics of data integration. Thus, we expect the tu- different sources can represent the same real-world entity in torial to appeal to a large portion of the VLDB community: different ways. There has been rich body of work on resolv- • Researchers in the fields of data integration, data cleans- ing heterogeneity in data, including, at the schema level, ing, data consolidation, data extraction, data mining, schema mapping and matching [17], model management [1], and Web information management. answering queries using views [14], data exchange [10], and at the instance level, record linkage (a.k.a., entity resolu- • Practitioners developing and distributing products in tion, object matching, reference linkage, etc.) [9, 18], string the data integration, data cleansing, ETL & data ware- similarity comparison [6], etc. housing, and master data management areas. Second, different sources can provide conflicting data. Con- We expect that attendees will take home from this sem- flicts can arise because of incomplete data, erroneous data, inar (i) an understanding of the causes and challenges of and out-of-date data. Returning incorrect data in a query conflicting data along with different application scenarios, result can be misleading and even harmful: one may contact (ii) knowledge about concrete methods to resolve data con- a person by an out-of-date phone number, visit a clinic at flicts both within relational DBMS and through dedicated a wrong address, carry wrong knowledge of the real world, applications, and (iii) an overview of existing tools and sys- and even make poor business decisions. It is thus critical tems to perform data fusion. for data integration systems to resolve conflicts from vari- ous sources and identify true values from false ones. This Assumed background. Apart from a basic understanding problem becomes especially prominent with the ease of pub- of database technology and data integration, there are no lishing and spreading false information on the Web and has prerequisites for this tutorial. recently received increasing attention. The proposed tutorial is based on a recent survey on data This tutorial focuses on data fusion, which addresses the fusion [4] and various techniques proposed for truth discov- second challenge by fusing records on the same real-world ery (including, but not limited to, [2, 7, 8, 19, 21]). We acknowledge the great contributions of authors of relevant papers. Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the VLDB copyright notice and the title of the publication and its date appear, 2. TUTORIAL OUTLINE and notice is given that copying is by permission of the Very Large Data Our tutorial starts from overviewing the importance of Base Endowment. To copy otherwise, or to republish, to post on servers data fusion in data integration and possible reasons for data or to redistribute to lists, requires a fee and/or special permission from the conflicts. We then present a classification of existing data fu- publisher, ACM. VLDB `09, August 24-28, 2009, Lyon, France sion techniques and introduce relational operations for con- Copyright 2009 VLDB Endowment, ACM 000-0-00000-000-0/00/00. flict resolution. After that, we describe several advanced Application Table 1: Motivating example: five data sources provide information on the affiliations of five re- Phase 3: Data Fusion searchers. Only S1 provides complete and correct information. S1 S2 S3 S4 S5 Phase 2: Duplicate Detection Stonebraker MIT Berkeley MIT MIT null Dewitt MSR MSR UWisc UWisc null Phase 1: Schema Mapping Bernstein MSR MSR MSR MSR null Carey UCI AT&T BEA BEA BEA Halevy Google Google UW UW UW Data sources non-null values that are all used to describe the same prop- Figure 1: Three tasks in data integration (from [16]). erty of the same entity. Contradiction is caused by different sources providing different values for the same attribute of a real-world entity. techniques for finding the best (true) values in presence of data conflicts. We end our tutorial with surveying data fu- Example 2.1. Consider the five data sources in Table 1. sion techniques in existing data integration systems and sug- There exists uncertainty on the affiliations of Stonebraker, gesting future research directions. Dewitt and Bernstein because of the null values provided by source S5, and contradiction on the affiliations of Stone- 2.1 Overview braker, Dewitt, Carey and Halevy. 2 Data integration has three broad goals: increasing the completeness, conciseness, and correctness of data that is There are two key issues in data fusion. First, how to find available to users and applications. Completeness measures the best values among conflicting values? Second, how to do the amount of data, in terms of both the number of tu- so efficiently? We next survey existing working on solving ples and the number of attributes. Conciseness measures these problems. the uniqueness of object representations in the integrated 2.2 Conflict resolution and data merging data, in terms of both the number of unique objects and the number of unique attributes of the objects. Finally, correct- Data conflicts, in the form of uncertainties or contradic- ness measures correctness of data; that is, whether the data tions, can be resolved in numerous ways. After introduc- conform to the real world. ing a broad classification we turn to the relational algebra, Whereas high completeness can be obtained by adding which already provides several possibilities. More elaborate more data sources to the system, achieving the other two data integration systems and their fusion capabilities are goals is non-trivial. To meet these requirements, a data analyzed in Section 2.4. integration system needs to perform three levels of tasks Conflict resolution strategies. There are many different (Fig. 1): data integration and fusion systems, each with their own solution. Fig. 2 classifies existing strategies to approach data 1. Schema mapping: First, a data integration system needs conflicts and Table 2 lists some of the strategies and their to resolve heterogeneity at the schema level by estab- classification. In particular, Conflict ignoring strategies are lishing semantic mappings between contents of dis- not aware of conflicts, perform no resolution, and thus may parate data sources. produce inconsistent results. Conflict avoiding strategies are 2. Duplicate detection: Second, a data integration system aware of conflicts but do not perform individual resolution needs to resolve heterogeneity at the instance level by for each conflict. Rather, a single decision is made, e.g., detecting records that refer to the same real-world en- preference of a source, and applied to all conflicts. Finally, tity. conflict resolving strategies provide the means for individual fusion decisions for each conflict. 3. Data fusion: Third, a data integration system needs Such decisions can be instance-based, i.e., they regard the to combine records that refer to the same real-world actual conflicting data values, or they can be metadata- entity by fusing them into a single representation and based, i.e., they choose values based on metadata, such as resolving possible conflicts from different data sources. freshness of data or the reliability of a source. Finally, strate- gies can be classified by the result they are able to produce: Among these three tasks, schema mapping and record deciding strategies choose a preferred value among the ex- linkage aim at removing redundancy and increasing concise- isting values, while mediating strategies can produce an en- ness of the data. Data fusion, which is the focus of this tirely new value, such as the average of a set of conflicting tutorial, aims at resolving conflicts from data and increas- numbers. ing correctness of data. We distinguish two kinds of data conflicts: uncertainty Relational operations.