Quick viewing(Text Mode)

Complement Union for Data Integration

Complement Union for Data Integration

Complement for Data Integration

Jens Bleiholder #1, Sascha Szott +2, Melanie Herschel ∗3, Felix Naumann #4 #Hasso-Plattner-Institut, Universitat¨ Potsdam, Germany 1,4{jens.bleiholder | felix.naumann}@hpi.uni-potsdam.de +Konrad-Zuse-Zentrum fur¨ Informationstechnik Berlin, Germany [email protected] ∗Universitat¨ Tubingen,¨ Germany [email protected]

I.INTRODUCTION of both tables, e.g., (Name, DOB, Sex, Address, Blood) and A data integration process consists of mapping source data each from the sources is padded with nulls in the result into a target representation (schema mapping [1]), identifying of outer union. That is, tuple 1 becomes (Miller, 7/7/59, m, multiple representations of the same real-word object (dupli- 12 Main, ⊥) and tuple 5 becomes (Miller, 7/7/59, m, ⊥, O). cate detection [2]), and finally combining these representa- These have same values for Name, DOB, and Sex. tions into a single consistent representation (data fusion [3]). Tuple 1 contributes Address whereas tuple 5 contributes Blood Clearly, as multiple representations of an object are generally to the final tuple during complementation. not exactly equal, during data fusion, we have to take special Now, consider tuples 1 and 4. Assume that according to care in handling data conflicts. This paper focuses on the def- duplicate detection, they represent the same real-world object, inition and implementation of complement union, an operator i.e., the same person. They both store different non-null values that defines a new semantics for data fusion. for Sex. Complement union cannot solve this type of conflict, Complement union aims at resolving a special type of data for which specialized methods exist (see [3] for a survey on conflict, called uncertainty. Intuitively, the operator considers fusion methods). cases where a concrete value in one tuple conflicts with a To the best of our knowledge, this work is the first that NULL value (⊥) of another tuple, and replaces the NULL considers data fusion with the general semantics of comple- value with the NON-NULL value when merging the tuples. ment union. A known operator that is able to combine comp- Complement union combines tuple sets by first computing lementing tuples from two sources is the full disjunction [4]. the outer union and then combining tuples that complement However, it cannot combine complementing tuples from the each other. This modular specification gives us more flexibility same source, nor can it combine tuples where overlapping/join attributes contain NULL values. during integration; complement can also be used to clean Amazon.com: Tu Vas Pas Mourir De Rire: Mickey 3D: Music 10/2/09 11:06 AM Real-world applications. In our experience with real-world source databases before the integration process and it allows Amazon.com: Tu Vas Pas Mourir De Rire: Mickey 3d: Music 10/2/09 11:09 AM us to potentially use more sophisticated schema mapping data, we observeHello. theSign in to usefulnessget personalized recommendations of complementation.. New customer? Start here. Kindle: Amazon's For Wireless Reading Device Your Amazon.com | Today's Deals | Gifts & Wish Lists | Gift Cards 1 Your Account | Help instance, in a sample of the Internet Movie DatabaseHello. Sign (IMDB) in to get personalized, recommendations. New customer? Start here. Kindle: Amazon's Wireless Reading Device algorithms than the outer union. Shop All Departments Search Music Cart Wish List Your Amazon.com | Today's Deals | Gifts & Wish Lists | Gift Cards Your Account | Help Example. Consider a disaster management scenario. We in- weMusic observe thatAdvanced complementation Search Browse Genres New Releases occursBestsellers withMusic Deals aMusic frequency You Should Hear Music Essentials MP3 Downloads of up to 20 times higher than subsumption,Shop All Departments theSearch state-of-theMusic Cart Wish List tegrate two databases Police and Hospital (excerpts shown in Tu Vas Pas MourirMusic De Rire [IMPORT]Advanced Search Browse Genres New Releases Bestsellers Music Deals Music You Should Hear Music Essentials MP3 Downloads Get it for less! Mickey 3D (Artist)

Fig. 1 (a) and (b)). Note that tid is not a relational attribute but art data fusion operatorNo customer to reviews handle yet. Be the uncertainty. first. | More about this product Also, we find Tu Vas PasHave Mourir one to sell? De Rire [IMPORT] Quantity: 1 serves as a reference to tuples. The goal is to detect missing numerous examples on the web, e.g., Fig. 2 shows excerptsMickey 3d (Artist) of This item has been discontinued by the manufacturer.No customer reviews yet. Be the first. | More about this product persons that have been admitted to a hospital and to contact two complementing descriptions of the same CD. List Price: $35.99 or Share with Friends Sign in to turn on 1-Click ordering. Price: $35.99 & ships FREE with Super Saver Shipping or As complementation combines information that was not or their relatives when an address is available. FREE Two-Day Shipping with a free trial of Amazon Prime. Details Amazon.com: Tu Vas Pasalready Mourir De Rire: Mickey combined 3D: Music in one of the sources, we10/2/09 obtain 11:05 AM a more Special Offers Available Fig. 1 (c) shows the result of applying complement union to Amazon.com: Tu Vas Pas Mourir De Rire: Mickey 3d: Music 10/2/09 11:08 AM Amazon Prime Free Trial the two sources. Complement union is the result of applying required. Sign up when you Kindle: Amazon's Wireless Reading Device Temporarily out of stock. check out. Learn More 1 Hello. Sign in to get personalized recommendations. New customer? Start here. Kindle: Amazon's Wireless Reading Device CustomersIMDB: http://www.imdb.comViewing This Page May Be InterestedHello. in SignThese in to get Sponsoredpersonalized recommendations LinksOrder. New now (What's customer? and this?we'll Start) deliver here. when available. We'll e-mail you with an Your Amazon.com | Today's Deals | Gifts & Wish Lists | Gift Cards Your Account | Help an outer union followed by complementation. The schema of Your Amazon.com | Today's Deals | Giftsestimated & Wish Listsdelivery | Gift date Cards as soon as we have more information.Your Account |Your Help account 3d Laserinnengravuren will only be charged when we ship the item.

Shop All Departments www.crystal-indesign.comSearch Music - Innovative Ideen aus dem Bereich der LasergravurCart in KristallglasWish List Shop All Departments Search Music Ships from and sold by Amazon.com . Gift-wrapCart available.Wish List the outer union of Police and Hospital is the union of schemas See larger image Music AdvancedFree 3D Search 3DSBrowse Models Genres New Releases Bestsellers Music Deals Music You ShouldAdvanced Hear SearchMusicBrowse Essentials Genres MP3New Downloads Releases Bestsellers Music Deals Music You Should Hear Music Essentials MP3 Downloads Music More Buying Choices www.3dvia.com/ - Find, search and share the best 3D 3DS Sharemodels your online own customer images 2 new from $35.98 2 new from $35.98 Disney-WinkelTu Vas Pas Mourir De Rire [IMPORT] Tu Vas Pas Mourir De Rire [IMPORT] Get it for less! Quantity: 1 Mickey 3D (Artist) Mickey 3d (Artist) www.Disney-Winkel.nl - Ruim 3000 Disney Artikelen Direct op voorraad! Have one to sell? No customer reviews yet. Be the first. | More about this product No customer reviews yet. Be the first. | More about this product Special OffersHave one and to sell? Product Promotions tid Name DOB Sex Address Get $1 Listworth Price: of$35.99 MP3 downloads from Amazon MP3 after you order your item.or Here's how See a problemThis with item these has advertisements?been discontinued Let usby knowthe manufacturer. Advertise on Amazon 1 Miller 7/7/59 m 12 Main Sign in to turn on 1-Click ordering. Share with Friends (restrictionsPrice: apply)$35.99 & ships FREE with Super Saver Shipping or or 2 Peter 1/1/53 m 34 First tid Name DOB Sex Address Blood FREE Two-Day Shipping with a free trial of Amazon Product Details SharePrime. with Details Friends (a) Police 1+5 Miller 7/7/59 m 12 Main O Audio CD (December 25, 2007) Product Details Special Offers Available Audio CD (December 25, 2007) Amazon Prime Free Trial 2+3 Peter 1/1/53 m 34 First AB Number of Discs: 2 required. Sign up when you Temporarily out of stock. check out. Learn More ⊥ ⊥ Number ofOrder Discs: now and 2 we'll deliver when available. We'll e-mail you with an tid Name DOB Sex Blood 4 Miller f B Format: Import estimated delivery date as soon as we have more information. Your account Format: Importwill only be charged when we ship the item.

⊥ (c) complement union of Label: Phantom Sound & Vision Ships from and sold by Amazon.com. Gift-wrap available. 3 Peter 1/1/53 AB Customers Viewing This Page May Be Interested in These Sponsored Links (What's this?) ASIN: B000GDI9ZE ASIN: B0012D6H10 See larger image 4 Miller ⊥ f B Police and Hospital 3d Laserinnengravuren More Buying Choices Share your own customerAverage images Customer2 new from $35.98 Review: No customer reviews yet. Be the first. www.crystal-indesign.com Average- Innovative Customer Ideen aus demReview: Bereich No der customer Lasergravur reviews in Kristallglas yet. Be the first. 5 Miller 7/7/59 m O 2 new from $35.98 Free 3D 3DS Models Would you like to update product info or give feedback on imagesHave? one to sell? (b) Hospital www.3dvia.com/ - Find, Wouldsearch andyou share like tothe updatebest 3D 3DSproduct models info Specialonline or give Offers feedback and Product on images Promotions?

Disney-WinkelFig. 2. Complementing entries atGet Amazon.com, $1 worth of MP3 downloads as from of Amazon Oct. MP3 2nd, after 2009.you order ASINyour item. Here's how Share with Friends Fig. 1. Complement union of Police and Hospital databases www.Disney-Winkel.nl - Ruim 3000 Disney Artikelen Direct op voorraad! (restrictions apply) is not a real-world key like ISBN, but aCustomers generated Viewing product This identifier. Page May During Be Interested in These Sponsored Links (What's this?) Tag this product (What's this?) Search Products Tagged with See a problem with these advertisements? Let us know Product Details Advertise on Amazon dataThink integration,of a tag as a keyword we projector label you it consider out and is strongly assign related 3d a Laserinnengravuren newto this identifier afterwards. Audio CD (December 25, 2007) www.crystal-indesign.com - Innovative Ideen aus dem Bereich der Lasergravur in Kristallglas Product Details product. Tags will help all customers organize and findNumber favorite of Discs:items. 2 Free 3D 3DS Models Audio CD (December 25, 2007) Format: Import Your tags: Add your first tag www.3dvia.com/ - Find, search and share the best 3D 3DS models online Number of Discs: 2 ASIN: B000GDI9ZE Format: Import Average Customer Review: Disney-Winkel No customer reviews yet. Be the first. Label: Phantom SoundCustomer & Vision Reviews www.Disney-Winkel.nl - Ruim 3000 Disney Artikelen Direct op voorraad! ASIN: B0012D6H10 Would you like to update product info or give feedback on images? Average Customer Review:There Noare customer no customer reviews yet. reviews Be the yet.first. See a problem with these advertisements? Let us know Advertise on Amazon

Customers Viewing This Page May Be Interested in These Sponsored Links (What's this?) Would you like to update product info or give feedback on images? Search Products Tagged with 3d LaserinnengravurenTag this product (What's this?) www.crystal-indesign.comThink -of Innovative a tag as Ideen a keyword aus dem or Bereich label der you Lasergravur consider inis Kristallglas strongly related to this Tag this product (What's this?) Free 3DSearch 3DS Modelsproduct. Products Tagged with Think of a tag as a keyword or label you consider is strongly related to this www.3dvia.com/ -Tags Find, will search help and all share customers the best organize3D 3DS models and findonline favorite items. product. Disney-Winkel Your tags: Add your first tag Tags will help all customers organize and find favorite items. www.Disney-Winkel.nl - Ruim 3000 Disney Artikelen Direct op voorraad! Your tags: Add your first tag See a problem withCustomer these advertisements? Reviews Let us know Advertise on Amazon Customer Reviews Search Products Tagged with Tag this product (What'sThere this? )are no customer reviews yet.

There are no customer reviews yet. Think of a tag as a keyword or label you consider is strongly related to this product. Tags will help all customers organize and find favorite items.

Your tags: Add your first tag

Customer ReviewsVideo reviews

http://www.amazon.com/Tu-Vas-Pas-Mourir-Rire/dp/B0012D6H10/ref=sr_1_17?ie=UTF8&s=music&qid=1254474291&sr=1-17There are no customer reviews yet. Page 1 of 3

http://www.amazon.com/Tu-Vas-Pas-Mourir-Rire/dp/B000GDI9ZE/ref=sr_1_12?ie=UTF8&s=music&qid=1254474299&sr=1-12 Page 1 of 3

Video reviews

http://www.amazon.com/Tu-Vas-Pas-Mourir-Rire/dp/B000GDI9ZE/ref=sr_1_12?ie=UTF8&s=music&qid=1254474299&sr=1-12 Page 1 of 3

http://www.amazon.com/Tu-Vas-Pas-Mourir-Rire/dp/B0012D6H10/ref=sr_1_17?ie=UTF8&s=music&qid=1254474291&sr=1-17 Page 1 of 3 complete description of an object, albeit at the risk of cre- III.IMPLEMENTING COMPLEMENTATION ating inaccurate combinations. This behavior is adequate in We present two algorithms for computing the complemen- scenarios where occasional incorrect combinations have little tation operator (κ), namely the partitioning complement (PC) negative impact on the application and the combined informa- and the null pattern complement (NPC) algorithm. tion is valuable, e.g., in our disaster management scenario. The main idea of partitioning complement (PC) is based on Contributions & Structure. We define complementation as the maximal complementing sets defined in Sec. II. In a first a database primitive, as defined in Sec. II. We contribute pass the algorithm goes through all input tuples and groups (1) efficient algorithms to compute complementation (Sec. III); tuples that complement each other, thus building all sets of (2) a study of how complement can be moved in operator complementing tuples. In a second pass, the algorithm goes trees for query optimization (Sec. IV), and (3) a comparative through all these sets and outputs the complement of all the experimental evaluation on both artificial and real-world data tuples of all maximal complementing sets. (Sec. V), before we conclude in Sec. VI. In Fig. 3 we show the complement relationships among II.DEFINITIONS some tuples numbered 1 to 7 (tid). An edge represents a complement relationship. In such a graph representation, a Definition 1 (Tuple Complementation): Two tuples t1 and maximal complementing corresponds to a maximal clique. t2 complement each other (t1 ≷ t2) if (1) t1 and t2 have the same schema, (2) values of corresponding attributes in t1 and Complement ‐ Steps (1) to (5) create root Step (8) add tuple 7 t2 are either equal or one of them is NULL, (3) t1 and t2 are ship between tuples 1‐7 and add tuples 1, 2, 3, 4 and final result neither equal nor do they subsume one another, and (4) t1 and 1 2 3 4 t2 have at least one NON-NULL attribute value combination in 5 3 1 2 3 4 5 6 7 common.  4 4 6 6 7 4 7 We introduce condition (3) to strictly separate equality, 6 Steps (6+7) add tuple 5 and 6 subsumption and complementation, and introduce condition 2 7 X X X X (4) to assure that tuples are not completely unrelated. Comp- 1 7 1 2 3 4 5 6 1 2 3 5 + + + lementing tuples t1 and t2 are combined into a tuple tc, the 6 6 4 6 6 4 + 7 complement of t1 and t2. Tuple tc is created by coalescing the values of each attribute a from t and t , i.e., 1 2 Fig. 3. Example for partitioning complement algorithm (PC)  t1.a, t2.a is NULL or t1.a = t2.a tc.a := (1) t2.a, t1.a is NULL. To compute and store the sets of complementing tuples, we Similarly, we construct the complement of more than two subsequently insert each tuple of the relation into a tree data tuples. Tuple complementation is a symmetric relationship. It structure. We collect the sets of complementing tuples that is neither reflexive nor transitive. In the following, we say that originate in that particular tuple in the subtree below it. a tuple t1 complements t2 if t1 ≷ t2 holds. Regard the example run of our algorithm in Fig. 3: The Definition 2 (Maximal complementing set): A complemen- algorithm starts with an empty tree and inserts the first three ting set CS is a of tuples of relation R where, for tuples (Steps 1 to 4). Tuples are inserted as nodes in depth-first each pair ti, tj of tuples from CS it holds that ti ≷ tj.A order, from left to right, generally checking if the tuple that complementing set S1 is a maximal complementing set MCS is inserted complements one of the already inserted tuples. if there exist no other complementing set S2 s.t. S1 ⊂ S2.  As tuples 1 to 3 do not complement each other, all are A tuple of R that does not complement any other tuple independently inserted. At Step 5 we add tuple 4. As tuple 4 forms a maximal complementing set of size one. A tuple can complements tuple 3, it is inserted below 3, forming the set of be in more than one MCS and in general there are multiple complementing tuples {3, 4}. It is subsequently also inserted MCS for a relation R. A relation R can be uniquely divided as a child of the root node. However, it is marked (italic and into a set of MCSs. All tuples of a maximal complementing grey color in Fig. 3), because it already has been added to a set can be combined into one tuple, the complement tc. larger set (i.e., {3, 4}). As we see further on, marking is used Definition 3 (Complementation Operator): The unary to distinguish maximal complementing sets from other sets. complementation operator κ replaces each existing maximal Then, tuples 5 and 6 are inserted. Tuple 6 is inserted below complementing set MCSi of a relation R by the single tuple 1 and below tuple 2 as it complements both. It is further inserted below the root (marked). Finally, tuple 7 is inserted complement tuple tci of all tuples in MCSi.  Definition 4 (Complement Union): The complement union in Step 8. It is inserted below the combination of tuples 2 operator () is the composition of outer union (]) and comp- and 6 (path root– 2 – 6 ), due to the fact that it complements lementation (κ) s. t. A  B =κ (A ] B); complementing the complement of 2 and 6: If a tuple t complements two other tuples in the result of the outer union of the two input relations tuples r and s then it also complements the complement of r A and B are replaced by their complement.  and s. As tuple 7 complements both tuples 2 and 6 alone, it is Complement union is commutative but not associative. The also inserted as a child of both of them. Due to the depth-first operator forms complements of complementary tuples from order, in both cases it is marked. It is also inserted as child of the same source and from different sources. the root node, also marked. 0 0 In a second pass, we traverse the whole tree and identify (1) κ (A] B) = κ (A) ] κ (B), only if @t ∈ A, t ∈ B : t ≷ t (2) κ (σc (A)) = σc (κ (A)), if all columns involved in c do maximal complementing sets as the sets formed by all tuples not contain NULL values (e.g., a key) that lie on a root-to-leaf path that does not contain a marked (3) σc (κ (σc∨cnull (A))) = σc (κ (A)), if columns involved in c may contain NULL values and c is not of the following leaf node. For instance, in Fig. 3, the maximal complementing form: x IS NULL, x IS NOT NULL, x being an attribute. sets are those marked by a filled rectangle (below the last c ∨ cnull is the original condition appended by the test for step). The participating original tuples are printed below. NULL values (IS NULL) for all the columns x involved in c. Once all maximal complementing sets have been identified, (4) κ (A×B) = κ (A) ×κ (B), if the relations A and B do not contain subsumed tuples, and the algorithm computes and outputs the complement of each (5) κ (A×B) = κ (κ (A) ×κ (B)), in all other cases maximal complementing set. For instance, we return four (6) κ (κ (A)) = κ (A) tuples in our example that correspond to the complements of (7) γf(c) (κ (A)) = γf(c) (A), for any column c and aggrega- tion function f ∈ {max, min} the sets {1, 6} , {2, 6, 7} , {3, 4} , and {5}. (8) κ (γf(c) (A)) = γf(c) (A), for column c and any f The worst case time complexity of the algorithm is domi- (9) γc,f(d) (κ (A)) = γc,f(d) (A), for columns c, d with c as nated by the second pass, because the entire tree structure is grouping column not containing NULL values and aggregation function f ∈ {max, min} traversed to visit, check and—where necessary—output the (10) κ (γc,f(d) (A)) = γc,f(d) (A), for columns c, d with c as leaf nodes (together with their paths from the root node). grouping column not containing NULL values and aggregation The runtime complexity of this algorithm (including the time function f ∈ {max, min} needed for enumerating all MCSs and building the comple- TABLE I ments) is then exponential in the size of the largest subtree REWRITE RULESFOR COMPLEMENTATION (which is equal to the size of the largest MCS and usually much smaller than the number of tuples, n), i.e., O(n2k) we can push selection through complementation (Rule (2)). where k is the size of the largest MCS. We refer to this If a column allows NULL values, pushing selection through version as the simple complement algorithm. complementation alters the result only if a tuple complemen- To considerably improve runtime when computing com- ting another tuple is removed from the input. We can however plement, we partition the relation by the different values of push the selection partially down (Rule (3)). a fixed column c and apply the algorithm to each partition Combinations with Join. In general, complementation and individually. The NULL partition containing all tuples with a the cannot be exchanged: applying the NULL value in attribute c needs to be handled differently, Cartesian product introduces complementing tuples if the re- because complementation is not transitive. For each NON- lations contain subsumed tuples (Rules (4) and (5)). A general NULL partition we add the tuples from the NULL partition rule for combining join and complement cannot be devised, to it and then replace complementing tuples. We then split so joins and complementation are not exchangeable. the resulting set of tuples into two groups: (1) the set of all Other Combinations. Two complementation operators can be tuples that came from the null partition and have not been combined into one (Rule (6)). In general, complementation and complemented and (2) all other tuples. Eventually, based on grouping/aggregation are not exchangeable as complementa- this division, we identify all tuples from the null partition that tion may remove tuples that are essential for the computation have not been used in any of the partitions and add these of the aggregate. However, there are cases in which we can to the final output. We refer to this improved version as the remove the complementation operator and leave only the partitioning complement algorithm. grouping/aggregation (Rules (7), (8), (9), and (10)). We also implemented a second, different complementation algorithm, the null pattern complement algorithm. This algo- V. EXPERIMENTS rithm partitions the input relation according to the patterns of To evaluate our algorithms, we experimented both with tuples’ NULL values and compares only tuples with comple- synthetic and real-world data. The synthetic datasets consist menting NULL patterns. It is a modification of an algorithm of data in six columns. To validate our approaches for removing subsumed tuples [5]. Due to space constraints on real-world data, we used data integrated from IMDB, we only report on experimental results. Filmdienst2, and three other smaller movie sources. We discuss IV. COMPLEMENTATION IN QUERY PLANS results obtained on Actor and Movie relations taken from this integrated data set. We also report on results obtained on a We now examine how the complementation operator can be sample of the FreeDB database3. moved in logical query plans to potentially increase efficiency. A 2.3GHz dual processor quad core server with 16GB We summarize the discussed rewrite rules in Table I. of main memory was used to store the data in a relational Combinations with (Outer) Union. Complement and union database (IBM DB2 v9.5) and to perform the experiments are not exchangeable in general, due to the fact that comp- that study the runtime of different algorithms on varying lementation is not a transitive relationship. However, if there datasets. All algorithms are implemented in Java SE 6 and are no NULL values in common attributes (e.g., keys), comp- the reported runtimes are median values over 5 runs. We lementation and outer union are exchangeable (Rule (1)). Combinations with Selection. If selection is applied on a 2Data kindly provided by Filmdienst: film-dienst.kim-info.de column without NULL values (e.g., key, NOT NULL column), 3FreeDB: www.freedb.org 10000

1000

100

10

1

0 10000 100000 1000000

10000.0 Simple evaluate the algorithms discussed in this paper, i.e., Simple, Dynamic NPC 1000.0 Partitioning Johnson Partitioning Complement (PC), and Null-Pattern Complement Partitioning Dynamic PC (NPC) with algorithms adapted from the literature. Replacing 100.0 complementing tuples by their complement in a relation is 10.0 equivalent to finding all maximal cliques [6] in a graph that has Runtime (s) been constructed by creating one node per tuple and an edge 1.0 between nodes if the corresponding tuples complement each 0.1 another. We also evaluate the performance of an adaptation of 10000 100000 1000000 10000 100000 1000000 10000000 an algorithm, to which we refer to as the Johnson algorithm, Data set size (#tuples) for Movie (left) and Actor (right) data sets for the dual problem of finding independent sets [7]. Experiment 1: Algorithm Comparison. We generate ta- Fig. 5. Runtimes for complement algorithms for real-world datasets. bles of varying size that contain a varying percentage of complementing tuples (1%, 5%, and 10%). Figure 4 shows NPC seems better suited for datasets with less NULL values the runtime for Simple Complement, PC, and Dynamic, the (Actors) whereas results show that the opposite is the case for approach from [6] for 1% and 10% of complementing tuples. PC: this algorithms performs slightly better on the dataset with more overall NULL values (Movies). However, in this case we 10000 Simple, 1% cannot completely rule out the influence of the larger number Simple, 10% 1000 Dynamic, 1% of attributes of Movies (27 columns). Dynamic, 10% 100 PC, 10% We further investigated the runtime of related algorithms, PC, 1% as in the previous experiments (see Fig. 5). The superiority of 10 the approach from [6] (approx. 34s for 20k tuples) over simple Runtime (s) 1 complement (although being worse than NPC and PC) and the

0 corresponding partitioning version over all other algorithms is

0 mainly due to the small percentage of complementing tuples 1000 10000 100000 1000000 in the data sets (below 1%). In contrast to the generated data Data set size (#tuples) sets, the partitioning variants show differences in runtime, Fig. 4. Runtimes for different percentages of complementing tuples. especially [7] performs worse than for the generated datasets Discussion. We observe that Dynamic is considerably faster (nearly 40s for 10k tuples). than the Simple Complement. However, as the runtime of VI.CONCLUSIONANDOUTLOOK Simple Complement decreases with increasing percentage of complementing tuples, the runtime of Dynamic increases. This This paper addressed the problem of combining multiple is mainly due to the data structures used and the runtime of tuples to a more concise representation in the context of data Simple Complement being mainly dominated by the size of the integration. We defined the complement union operator and largest clique. The algorithm from [7] (not shown) performs presented several algorithms as physical implementations of very poorly, even for only 100 tuples (several seconds), due to it. In the future, we plan to look more closely at different its computational complexity. We also observe that partitioning types of partitioning to further improve the algorithms. We pays off, as Partitioning Complement is the fastest among also plan to study the interaction between the complementation the shown algorithms. However, applying partitioning to all and the subsumption operator, to potentially devise a combined other approaches results in a comparable runtime, even for algorithm for efficient overall data fusion. [7]. All partitioning algorithms scale well for relations of up Acknowledgment. This research was partly supported by the to 1 million tuples. Varying the column for partitioning shows German Research Society (DFG grant no. NA 432). significant differences in runtime for all algorithms, so care REFERENCES must be taken in choosing a partitioning column. [1] M. A. Hernandez,´ L. Popa, Y. Velegrakis, R. J. Miller, F. Naumann, and Experiment 2: Complementation on real-world data. We C.-T. Ho, “Mapping XML and relational schemas with Clio,” in Proc. of now consider the real-world datasets Actor, Movie, and ICDE, 2002. FreeDB. For each of these datasets, we take samples of [2] A. K. Elmagarmid, P. G. Ipeirotis, and V. S. Verykios, “Duplicate record detection: A survey,” IEEE Trans. Knowl. Data Eng., vol. 19, no. 1, 2007. different sizes and the runtime of the same algorithms [3] J. Bleiholder and F. Naumann, “Data fusion,” ACM Computing Survey, as in the previous experiment (Fig. 5). vol. 41, no. 1, 2008. Discussion. For the FreeDB dataset, which consists of 10,000 [4] S. Cohen and Y. Sagiv, “An incremental algorithm for computing ranked full disjunctions,” in Proc. of PODS. New York, NY, USA: ACM Press, tuples, NPC takes 0.16 seconds and thereby outperforms 2005. PC (0.34 sec. with best partitioning). Results on Actor an [5] J. Bleiholder, S. Szott, M. Herschel, F. Kaufer, and F. Naumann, “Sub- Movie datasets are also represented in Fig. 5. We see that for sumption and complementation as data fusion operators,” in Proc. of EDBT, Lausanne, Switzerland, 2010. both Actors and Movies, the PC algorithm outperforms NPC. [6] V. Stix, “Finding all maximal cliques in dynamic graphs,” Comput. Optim. This is mainly due to the beneficial partitioning, although Appl., vol. 27, no. 2, 2004. the number of columns (12 vs. 27 columns) also has some [7] D. S. Johnson, C. H. Papadimitriou, and M. Yannakakis, “On generating Inf. Process. Lett. influence: PC better copes with large schemata. Interestingly, all maximal independent sets,” , vol. 27, no. 3, 1988.