<<

Complement for Data Integration

Jens Bleiholder #1, Sascha Szott +2, Melanie Herschel ∗3, Felix Naumann #4 #Hasso-Plattner-Institut, Universitat¨ Potsdam, Germany 1,4{jens.bleiholder | felix.naumann}@hpi.uni-potsdam.de +Konrad-Zuse-Zentrum fur¨ Informationstechnik Berlin, Germany [email protected] ∗Universitat¨ Tubingen,¨ Germany [email protected]

Abstract— A data integration process consists of mapping each from the sources is padded with nulls in the result source data into a target representation (schema mapping [1]), of outer union. That is, tuple 1 becomes (Miller, 7/7/59, m, identifying multiple representations of the same real-word object 12 Main, ⊥) and tuple 5 becomes (Miller, 7/7/59, m, ⊥, O). (duplicate detection [2]), and finally combining these represen- tations into a single consistent representation (data fusion [3]). These have same values for Name, DOB, and Sex. Clearly, as multiple representations of an object are generally not Tuple 1 contributes Address whereas tuple 5 contributes Blood exactly equal, during data fusion, we have to take special care in to the final tuple during complementation. handling data conflicts. This paper focuses on the definition and Now, consider tuples 1 and 4. Assume that according to implementation of complement union, an operator that defines a duplicate detection, they represent the same real-world object, new semantics for data fusion. i.e., the same person. They both store different non-null values I.INTRODUCTION for Sex. Complement union cannot solve this type of conflict, Complement union aims at resolving a special type of data for which specialized methods exist (see [3] for a survey on conflict, called uncertainty. Intuitively, the operator considers fusion methods). cases where a concrete value in one tuple conflicts with a To the best of our knowledge, this work is the first that NULL value (⊥) of another tuple, and replaces the NULL considers data fusion with the general semantics of comple- value with the NON-NULL value when merging the tuples. ment union. A known operator that is able to combine comp- Complement union combines tuple sets by first computing lementing tuples from two sources is the full disjunction [4]. the outer union and then combining tuples that complement However, it cannot combine complementing tuples from the each other. This modular specification gives us more flexibility same source, nor can it combine tuples where overlapping/join attributes contain NULL values. during integration; complement can also be used to clean Amazon.com: Tu Vas Pas Mourir De Rire: Mickey 3D: Music 10/2/09 11:06 AM source databases before the integration process and it allows Real-world applications. In our experience with real-world Amazon.com: Tu Vas Pas Mourir De Rire: Mickey 3d: Music 10/2/09 11:09 AM us to potentially use more sophisticated schema mapping data, we observeHello. theSign in to usefulnessget personalized recommendations of complementation.. New customer? Start here. Kindle: Amazon's For Wireless Reading Device instance, in a sampleYour Amazon.com of the | InternetToday's Deals Movie | Gifts & Wish Database Lists | Gift Cards (IMDB) 1, Your Account | Help algorithms than the outer union. Hello. Sign in to get personalized recommendations. New customer? Start here. Kindle: Amazon's Wireless Reading Device Shop All Departments Search Music Cart Wish List we observe that complementation occurs withYour a Amazon.com frequency | Today's Deals | Gifts & Wish Lists | Gift Cards Your Account | Help Example. Consider a disaster management scenario. We in- Music Advanced Search Browse Genres New Releases Bestsellers Music Deals Music You Should Hear Music Essentials MP3 Downloads Shop All Departments Search Music Cart Wish List tegrate two databases Police and Hospital (excerpts shown in of up to 20 times higher than subsumption, the state-of-the Tu Vas Pas MourirMusic De Rire [IMPORT]Advanced Search Browse Genres New Releases Bestsellers Music Deals Music You Should Hear Music Essentials MP3 Downloads Get it for less! Mickey 3D (Artist) art data fusion operator to handle uncertainty. Also, we find Fig. 1 (a) and (b)). Note that tid is not a relational attribute but No customer reviews yet. Be the first. | More about this product Tu Vas PasHave Mourir one to sell? De Rire [IMPORT] Quantity: 1 serves as a reference to tuples. The goal is to detect missing numerous examples on the web, e.g., Fig. 2 shows excerptsMickey 3d (Artist) of This item has been discontinued by the manufacturer.No customer reviews yet. Be the first. | More about this product two complementing descriptions of the same CD. persons that have been admitted to a hospital and to contact List Price: $35.99 or Share with Friends Sign in to turn on 1-Click ordering. As complementation combines information that wasPrice: $35.99 not & ships FREE with Super Saver Shipping or their relatives when an address is available. or FREE Two-Day Shipping with a free trial of Amazon already combined in one of the sources, we obtain a morePrime. Details Fig. 1 (c) shows the result of applying complement unionAmazon.com: to Tu Vas Pas Mourir De Rire: Mickey 3D: Music 10/2/09 11:05 AM Special Offers Available Amazon.com: Tu Vas Pas Mourir De Rire: Mickey 3d: Music 10/2/09 11:08 AM complete description of an object, albeit at the risk of cre- Amazon Prime Free Trial the two sources. Complement union is the result of applying required. Sign up when you Temporarily out of stock. Hello. Sign in to get personalized recommendations. New customer? Start here. Kindle: Amazon's Wireless Reading Device check out. Learn More Customers Viewing This Page May Be InterestedHello. in SignThese in to get Sponsoredpersonalized recommendations LinksOrder. New now (What's customer? and this?we'll Start) deliver here. whenKindle: available. Amazon's We'llWireless e-mail Reading you Device with an Your Amazon.com Today's Deals Gifts & Wish Lists Gift Cards Your Account Help an outer union followed by complementation. The schema of | | | Your Amazon.com | Today's Deals | | Giftsestimated & Wish Listsdelivery | Gift date Cards as soon as we have more information.Your Account |Your Help account 1 3d Laserinnengravuren will only be charged when we ship the item. IMDB: http://www.imdb.com Shop All Departments www.crystal-indesign.comSearch Music - Innovative Ideen aus dem Bereich der LasergravurMusicCart in KristallglasWish List Ships from and sold by Amazon.com . Gift-wrap available. the outer union of Police and Hospital is the union of schemas Shop All Departments Search Cart Wish List See larger image Music AdvancedFree 3D Search 3DSBrowse Models Genres New Releases Bestsellers Music Deals Music You ShouldAdvanced Hear SearchMusicBrowse Essentials Genres MP3New Downloads Releases Bestsellers Music Deals Music You Should Hear Music Essentials MP3 Downloads Music More Buying Choices of both tables, e.g., (Name, DOB, Sex, Address, Blood) and www.3dvia.com/ - Find, search and share the best 3D 3DS Sharemodels your online own customer images 2 new from $35.98 2 new from $35.98 Disney-WinkelTu Vas Pas Mourir De Rire [IMPORT] Tu Vas Pas Mourir De Rire [IMPORT] Get it for less! Quantity: 1 Mickey 3D (Artist) Mickey 3d (Artist) www.Disney-Winkel.nl - Ruim 3000 Disney Artikelen Direct op voorraad! Have one to sell? No customer reviews yet. Be the first. | More about this product No customer reviews yet. Be the first. | More about this product Special OffersHave one and to sell? Product Promotions

Get $1 Listworth Price: of$35.99 MP3 downloads from Amazon MP3 after you order your item.or Here's how See a problemThis with item these has advertisements?been discontinued Let usby knowthe manufacturer. Advertise on Amazon Sign in to turn on 1-Click ordering. Share with Friends tid Name DOB Sex Address (restrictionsPrice: apply)$35.99 & ships FREE with Super Saver Shipping or or FREE Two-Day Shipping with a free trial of Amazon 1 Miller 7/7/59 m 12 Main Product Details SharePrime. with Details Friends 2 Peter 1/1/53 m 34 First tid Name DOB Sex Address Blood Audio CD (December 25, 2007) Product Details Special Offers Available Audio CD (December 25, 2007) Amazon Prime Free Trial (a) Police Number of Discs: 2 required. Sign up when you 1+5 Miller 7/7/59 m 12 Main O Temporarily out of stock. check out. Learn More Number ofOrder Discs: now and 2 we'll deliver when available. We'll e-mail you with an 2+3 Peter 1/1/53 m 34 First AB Format: Import estimated delivery date as soon as we have more information. Your account Format: Importwill only be charged when we ship the item.

Label: Phantom Sound & Vision Ships from and sold by Amazon.com. Gift-wrap available. tid Name DOB Sex Blood 4 Miller ⊥ f ⊥ CustomersB Viewing This Page May Be Interested in These Sponsored Links (What's this?) ASIN: B000GDI9ZE ASIN: B0012D6H10 See larger image 3d Laserinnengravuren More Buying Choices 3 Peter 1/1/53 ⊥ AB (c) complement union of Share your own customerAverage images Customer2 new from $35.98 Review: No customer reviews yet. Be the first. www.crystal-indesign.com Average- Innovative Customer Ideen aus demReview: Bereich No der customer Lasergravur reviews in Kristallglas yet. Be the first. 4 Miller ⊥ f B Police and Hospital 2 new from $35.98 Free 3D 3DS Models Would you like to update product info or give feedback on imagesHave? one to sell? 5 Miller 7/7/59 m O www.3dvia.com/ - Find, Wouldsearch andyou share like tothe updatebest 3D 3DSproduct models info Specialonline or give Offers feedback and Product on images Promotions?

Disney-WinkelFig. 2. Complementing entries atGet Amazon.com, $1 worth of MP3 downloads as from of Amazon Oct. MP3 2nd, after 2009.you order ASINyour item. Here's how (b) Hospital Share with Friends www.Disney-Winkel.nl - Ruim 3000 Disney Artikelen Direct op voorraad! (restrictions apply) is not a real-world key like ISBN, but aCustomers generated Viewing product This identifier. Page May During Be Interested in These Sponsored Links (What's this?) Tag this product (What's this?) Search Products Tagged with See a problem with these advertisements? Let us know Product Details Advertise on Amazon Fig. 1. Complement union of Police and Hospital databases dataThink integration,of a tag as a keyword we projector label you it consider out and is strongly assign related 3d a Laserinnengravuren newto this identifier afterwards. Audio CD (December 25, 2007) www.crystal-indesign.com - Innovative Ideen aus dem Bereich der Lasergravur in Kristallglas Product Details product. Tags will help all customers organize and findNumber favorite of Discs:items. 2 Free 3D 3DS Models Audio CD (December 25, 2007) Format: Import Your tags: Add your first tag www.3dvia.com/ - Find, search and share the best 3D 3DS models online Number of Discs: 2 ASIN: B000GDI9ZE Format: Import Average Customer Review: Disney-Winkel No customer reviews yet. Be the first. Label: Phantom SoundCustomer & Vision Reviews www.Disney-Winkel.nl - Ruim 3000 Disney Artikelen Direct op voorraad! ASIN: B0012D6H10 Would you like to update product info or give feedback on images? PREPRESS PROOF FILE Average Customer1 Review:There Noare customer no customer reviews yet. reviews Be the yet.first. See aCAUSAL problem with these advertisements? PRODUCTIONS Let us know Advertise on Amazon Customers Viewing This Page May Be Interested in These Sponsored Links (What's this?) Would you like to update product info or give feedback on images? Search Products Tagged with 3d LaserinnengravurenTag this product (What's this?) www.crystal-indesign.comThink -of Innovative a tag as Ideen a keyword aus dem or Bereich label der you Lasergravur consider inis Kristallglas strongly related to this Tag this product (What's this?) Free 3DSearch 3DS Modelsproduct. Products Tagged with Think of a tag as a keyword or label you consider is strongly related to this www.3dvia.com/ -Tags Find, will search help and all share customers the best organize3D 3DS models and findonline favorite items. product. Disney-Winkel Your tags: Add your first tag Tags will help all customers organize and find favorite items. www.Disney-Winkel.nl - Ruim 3000 Disney Artikelen Direct op voorraad! Your tags: Add your first tag See a problem withCustomer these advertisements? Reviews Let us know Advertise on Amazon Customer Reviews Search Products Tagged with Tag this product (What'sThere this? )are no customer reviews yet.

There are no customer reviews yet. Think of a tag as a keyword or label you consider is strongly related to this product. Tags will help all customers organize and find favorite items.

Your tags: Add your first tag

Customer ReviewsVideo reviews

http://www.amazon.com/Tu-Vas-Pas-Mourir-Rire/dp/B0012D6H10/ref=sr_1_17?ie=UTF8&s=music&qid=1254474291&sr=1-17There are no customer reviews yet. Page 1 of 3

http://www.amazon.com/Tu-Vas-Pas-Mourir-Rire/dp/B000GDI9ZE/ref=sr_1_12?ie=UTF8&s=music&qid=1254474299&sr=1-12 Page 1 of 3

Video reviews

http://www.amazon.com/Tu-Vas-Pas-Mourir-Rire/dp/B000GDI9ZE/ref=sr_1_12?ie=UTF8&s=music&qid=1254474299&sr=1-12 Page 1 of 3

http://www.amazon.com/Tu-Vas-Pas-Mourir-Rire/dp/B0012D6H10/ref=sr_1_17?ie=UTF8&s=music&qid=1254474291&sr=1-17 Page 1 of 3 ating inaccurate combinations. This behavior is adequate in III.IMPLEMENTING COMPLEMENTATION scenarios where occasional incorrect combinations have little We present two algorithms for computing the complemen- negative impact on the application and the combined informa- tation operator (κ), namely the partitioning complement (PC) tion is valuable, e.g., in our disaster management scenario. and the null pattern complement (NPC) algorithm. Contributions & Structure. We define complementation as The main idea of partitioning complement (PC) is based on a database primitive, as defined in Sec. II. We contribute the maximal complementing sets defined in Sec. II. In a first (1) efficient algorithms to compute complementation (Sec. III); pass the algorithm goes through all input tuples and groups (2) a study of how complement can be moved in operator tuples that complement each other, thus building all sets of trees for query optimization (Sec. IV), and (3) a comparative complementing tuples. In a second pass, the algorithm goes experimental evaluation on both artificial and real-world data through all these sets and outputs the complement of all the (Sec. V), before we conclude in Sec. VI. tuples of all maximal complementing sets. In Fig. 3 we show the complement relationships among II.DEFINITIONS some tuples numbered 1 to 7 (tid). An edge represents a Definition 1 (Tuple Complementation): Two tuples t1 and complement relationship. In such a graph representation, a t2 complement each other (t1 ≷ t2) if (1) t1 and t2 have the maximal complementing corresponds to a maximal clique. same schema, (2) values of corresponding attributes in t1 and Complement ‐ Steps (1) to (5) create root Step (8) add tuple 7 t2 are either equal or one of them is NULL, (3) t1 and t2 are ship between tuples 1‐7 and add tuples 1, 2, 3, 4 and final result neither equal nor do they subsume one another, and (4) t1 and t2 have at least one NON-NULL attribute value combination in 5 3 1 2 3 4 1 2 3 4 5 6 7 common.  4 We introduce condition (3) to strictly separate equality, 4 6 6 7 4 7 Steps (6+7) add tuple 5 and 6 subsumption and complementation, and introduce condition 6 2 7 (4) to assure that tuples are not completely unrelated. Comp- X X X X 1 7 1 2 3 4 5 6 1 2 3 5 lementing tuples t1 and t2 are combined into a tuple tc, the + + + 6 6 4 6 6 4 + complement of t1 and t2. Tuple tc is created by coalescing the 7 values of each attribute a from t1 and t2, i.e.,  Fig. 3. Example for partitioning complement algorithm (PC) t1.a, t2.a is NULL or t1.a = t2.a tc.a := (1) t .a, t .a is NULL. 2 1 To compute and store the sets of complementing tuples, we Similarly, we construct the complement of more than two subsequently insert each tuple of the relation into a tree data tuples. Tuple complementation is a symmetric relationship. It structure. We collect the sets of complementing tuples that is neither reflexive nor transitive. In the following, we say that originate in that particular tuple in the subtree below it. a tuple t1 complements t2 if t1 ≷ t2 holds. Regard the example run of our algorithm in Fig. 3: The Definition 2 (Maximal complementing set): A complemen- algorithm starts with an empty tree and inserts the first three ting set CS is a of tuples of relation R where, for tuples (Steps 1 to 4). Tuples are inserted as nodes in depth-first each pair ti, tj of tuples from CS it holds that ti ≷ tj.A order, from left to right, generally checking if the tuple that complementing set S1 is a maximal complementing set MCS is inserted complements one of the already inserted tuples. if there exist no other complementing set S2 s.t. S1 ⊂ S2.  As tuples 1 to 3 do not complement each other, all are A tuple of R that does not complement any other tuple independently inserted. At Step 5 we add tuple 4. As tuple 4 forms a maximal complementing set of size one. A tuple can complements tuple 3, it is inserted below 3, forming the set of be in more than one MCS and in general there are multiple complementing tuples {3, 4}. It is subsequently also inserted MCS for a relation R. A relation R can be uniquely divided as a child of the root node. However, it is marked (italic and into a set of MCSs. All tuples of a maximal complementing grey color in Fig. 3), because it already has been added to a set can be combined into one tuple, the complement tc. larger set (i.e., {3, 4}). As we see further on, marking is used Definition 3 (Complementation Operator): The unary to distinguish maximal complementing sets from other sets. complementation operator κ replaces each existing maximal Then, tuples 5 and 6 are inserted. Tuple 6 is inserted below complementing set MCSi of a relation R by the single tuple 1 and below tuple 2 as it complements both. It is further complement tuple tci of all tuples in MCSi.  inserted below the root (marked). Finally, tuple 7 is inserted Definition 4 (Complement Union): The complement union in Step 8. It is inserted below the combination of tuples 2 operator () is the composition of outer union (]) and comp- and 6 (path root– 2 – 6 ), due to the fact that it complements lementation (κ) s. t. A  B =κ (A ] B); complementing the complement of 2 and 6: If a tuple t complements two other tuples in the result of the outer union of the two input relations tuples r and s then it also complements the complement of r A and B are replaced by their complement.  and s. As tuple 7 complements both tuples 2 and 6 alone, it is Complement union is commutative but not associative. The also inserted as a child of both of them. Due to the depth-first operator forms complements of complementary tuples from order, in both cases it is marked. It is also inserted as child of the same source and from different sources. the root node, also marked.

2 In a second pass, we traverse the whole tree and identify TABLE I maximal complementing sets as the sets formed by all tuples REWRITE RULESFOR COMPLEMENTATION 0 0 that lie on a root-to-leaf path that does not contain a marked (1) κ (A] B) = κ (A) ] κ (B), only if @t ∈ A, t ∈ B : t ≷ t leaf node. For instance, in Fig. 3, the maximal complementing (2) κ (σc (A)) = σc (κ (A)), if all columns involved in c do not contain NULL values (e.g., a key) sets are those marked by a filled rectangle (below the last (3) σc (κ (σc∨cnull (A))) = σc (κ (A)), if columns involved step). The participating original tuples are printed below. in c may contain NULL values and c is not of the following Once all maximal complementing sets have been identified, form: x IS NULL, x IS NOT NULL, x being an attribute. c ∨ cnull is the original condition appended by the test for the algorithm computes and outputs the complement of each NULL values (IS NULL) for all the columns x involved in c. maximal complementing set. For instance, we return four (4) κ (A×B) = κ (A) ×κ (B), if the relations A and B tuples in our example that correspond to the complements of do not contain subsumed tuples, and (5) κ (A×B) = κ (κ (A) ×κ (B)), in all other cases the sets {1, 6} , {2, 6, 7} , {3, 4} , and {5}. (6) κ (κ (A)) = κ (A) The worst case time complexity of the algorithm is domi- (7) γf(c) (κ (A)) = γf(c) (A), for any column c and aggrega- nated by the second pass, because the entire tree structure is tion function f ∈ {max, min} (8) κ (γ (A)) = γ (A), for column c and any f traversed to visit, check and—where necessary—output the f(c) f(c) (9) γc,f(d) (κ (A)) = γc,f(d) (A), for columns c, d with c as leaf nodes (together with their paths from the root node). grouping column not containing NULL values and aggregation The runtime complexity of this algorithm (including the time function f ∈ {max, min} needed for enumerating all MCSs and building the comple- (10) κ (γc,f(d) (A)) = γc,f(d) (A), for columns c, d with c as grouping column not containing NULL values and aggregation ments) is then exponential in the size of the largest subtree function f ∈ {max, min} (which is equal to the size of the largest MCS and usually much smaller than the number of tuples, n), i.e., O(n2k) we can push selection through complementation (Rule (2)). where k is the size of the largest MCS. We refer to this If a column allows NULL values, pushing selection through version as the simple complement algorithm. complementation alters the result only if a tuple complemen- To considerably improve runtime when computing com- ting another tuple is removed from the input. We can however plement, we partition the relation by the different values of push the selection partially down (Rule (3)). a fixed column c and apply the algorithm to each partition Combinations with Join. In general, complementation and individually. The NULL partition containing all tuples with a the cannot be exchanged: applying the NULL value in attribute c needs to be handled differently, Cartesian product introduces complementing tuples if the re- because complementation is not transitive. For each NON- lations contain subsumed tuples (Rules (4) and (5)). A general NULL partition we add the tuples from the NULL partition rule for combining join and complement cannot be devised, to it and then replace complementing tuples. We then split so joins and complementation are not exchangeable. the resulting set of tuples into two groups: (1) the set of all Other Combinations. Two complementation operators can be tuples that came from the null partition and have not been combined into one (Rule (6)). In general, complementation and complemented and (2) all other tuples. Eventually, based on grouping/aggregation are not exchangeable as complementa- this division, we identify all tuples from the null partition that tion may remove tuples that are essential for the computation have not been used in any of the partitions and add these of the aggregate. However, there are cases in which we can to the final output. We refer to this improved version as the remove the complementation operator and leave only the partitioning complement algorithm. grouping/aggregation (Rules (7), (8), (9), and (10)). We also implemented a second, different complementation algorithm, the null pattern complement algorithm. This algo- V. EXPERIMENTS rithm partitions the input relation according to the patterns of To evaluate our algorithms, we experimented both with tuples’ NULL values and compares only tuples with comple- synthetic and real-world data. The synthetic datasets consist menting NULL patterns. It is a modification of an algorithm of data in six columns. To validate our approaches for removing subsumed tuples [5]. Due to space constraints on real-world data, we used data integrated from IMDB, we only report on experimental results. Filmdienst2, and three other smaller movie sources. We discuss IV. COMPLEMENTATION IN QUERY PLANS results obtained on Actor and Movie relations taken from this integrated data set. We also report on results obtained on a We now examine how the complementation operator can be sample of the FreeDB database3. moved in logical query plans to potentially increase efficiency. A 2.3GHz dual processor quad core server with 16GB We summarize the discussed rewrite rules in Table I. of main memory was used to store the data in a relational Combinations with (Outer) Union. Complement and union database (IBM DB2 v9.5) and to perform the experiments are not exchangeable in general, due to the fact that comp- that study the runtime of different algorithms on varying lementation is not a transitive relationship. However, if there datasets. All algorithms are implemented in Java SE 6 and are no NULL values in common attributes (e.g., keys), comp- the reported runtimes are median values over 5 runs. We lementation and outer union are exchangeable (Rule (1)). Combinations with Selection. If selection is applied on a 2Data kindly provided by Filmdienst: film-dienst.kim-info.de column without NULL values (e.g., key, NOT NULL column), 3FreeDB: www.freedb.org

3 10000

1000

100

10

1

0 10000 100000 1000000

10000.0 Simple evaluate the algorithms discussed in this paper, i.e., Simple, Dynamic NPC Partitioning Complement (PC), and Null-Pattern Complement 1000.0 Partitioning Johnson Partitioning Dynamic (NPC) with algorithms adapted from the literature. Replacing PC 100.0 complementing tuples by their complement in a relation is 10.0 equivalent to finding all maximal cliques [6] in a graph that has Runtime (s) been constructed by creating one node per tuple and an edge 1.0 between nodes if the corresponding tuples complement each 0.1 another. We also evaluate the performance of an adaptation of 10000 100000 1000000 10000 100000 1000000 10000000 an algorithm, to which we refer to as the Johnson algorithm, Data set size (#tuples) for Movie (left) and Actor (right) data sets for the dual problem of finding independent sets [7]. Fig. 5. Runtimes for complement algorithms for real-world datasets. Experiment 1: Algorithm Comparison. We generate ta- bles of varying size that contain a varying percentage of NPC seems better suited for datasets with less NULL values complementing tuples (1%, 5%, and 10%). Figure 4 shows (Actors) whereas results show that the opposite is the case for the runtime for Simple Complement, PC, and Dynamic, the PC: this algorithms performs slightly better on the dataset with approach from [6] for 1% and 10% of complementing tuples. more overall NULL values (Movies). However, in this case we cannot completely rule out the influence of the larger number

10000 Simple, 1% of attributes of Movies (27 columns). Simple, 10% 1000 Dynamic, 1% We further investigated the runtime of related algorithms, Dynamic, 10% as in the previous experiments (see Fig. 5). The superiority of 100 PC, 10% PC, 1% the approach from [6] (approx. 34s for 20k tuples) over simple 10 complement (although being worse than NPC and PC) and the Runtime (s) 1 corresponding partitioning version over all other algorithms is

0 mainly due to the small percentage of complementing tuples in the data sets (below 1%). In contrast to the generated data 0 1000 10000 100000 1000000 sets, the partitioning variants show differences in runtime, Data set size (#tuples) especially [7] performs worse than for the generated datasets Fig. 4. Runtimes for different percentages of complementing tuples. (nearly 40s for 10k tuples). Discussion. We observe that Dynamic is considerably faster VI.CONCLUSIONANDOUTLOOK than the Simple Complement. However, as the runtime of Simple Complement decreases with increasing percentage of This paper addressed the problem of combining multiple complementing tuples, the runtime of Dynamic increases. This tuples to a more concise representation in the context of data is mainly due to the data structures used and the runtime of integration. We defined the complement union operator and Simple Complement being mainly dominated by the size of the presented several algorithms as physical implementations of largest clique. The algorithm from [7] (not shown) performs it. In the future, we plan to look more closely at different very poorly, even for only 100 tuples (several seconds), due to types of partitioning to further improve the algorithms. We its computational complexity. We also observe that partitioning also plan to study the interaction between the complementation pays off, as Partitioning Complement is the fastest among and the subsumption operator, to potentially devise a combined the shown algorithms. However, applying partitioning to all algorithm for efficient overall data fusion. other approaches results in a comparable runtime, even for ACKNOWLEDGMENT [7]. All partitioning algorithms scale well for relations of up This research was partly supported by the German to 1 million tuples. Varying the column for partitioning shows Research Society (DFG grant no. NA 432). significant differences in runtime for all algorithms, so care must be taken in choosing a partitioning column. REFERENCES Experiment 2: Complementation on real-world data. We [1] M. A. Hernandez,´ L. Popa, Y. Velegrakis, R. J. Miller, F. Naumann, and now consider the real-world datasets Actor, Movie, and C.-T. Ho, “Mapping XML and relational schemas with Clio,” in Proc. of ICDE, 2002. FreeDB. For each of these datasets, we take samples of [2] A. K. Elmagarmid, P. G. Ipeirotis, and V. S. Verykios, “Duplicate record different sizes and the runtime of the same algorithms detection: A survey,” IEEE Trans. Knowl. Data Eng., vol. 19, no. 1, 2007. as in the previous experiment (Fig. 5). [3] J. Bleiholder and F. Naumann, “Data fusion,” ACM Computing Survey, vol. 41, no. 1, 2008. Discussion. For the FreeDB dataset, which consists of 10,000 [4] S. Cohen and Y. Sagiv, “An incremental algorithm for computing ranked tuples, NPC takes 0.16 seconds and thereby outperforms full disjunctions,” in Proc. of PODS. New York, NY, USA: ACM Press, PC (0.34 sec. with best partitioning). Results on Actor an 2005. [5] J. Bleiholder, S. Szott, M. Herschel, F. Kaufer, and F. Naumann, “Sub- Movie datasets are also represented in Fig. 5. We see that for sumption and complementation as data fusion operators,” in Proc. of both Actors and Movies, the PC algorithm outperforms NPC. EDBT, Lausanne, Switzerland, 2010. This is mainly due to the beneficial partitioning, although [6] V. Stix, “Finding all maximal cliques in dynamic graphs,” Comput. Optim. Appl., vol. 27, no. 2, 2004. the number of columns (12 vs. 27 columns) also has some [7] D. S. Johnson, C. H. Papadimitriou, and M. Yannakakis, “On generating influence: PC better copes with large schemata. Interestingly, all maximal independent sets,” Inf. Process. Lett., vol. 27, no. 3, 1988.

4