Hitting Set Enumeration with Partial Information for Unique Column Combination Discovery

Hitting Set Enumeration with Partial Information for Unique Column Combination Discovery Johann BirnickF, Thomas Bläsius, Tobias Friedrich, Felix Naumann, Thorsten Papenbrock, Martin Schirneck FETH Zurich, Zurich, Switzerland Hasso Plattner Institute, University of Potsdam, Germany fi[email protected] ABSTRACT that fulfill the necessary properties of a key, whereby null Unique column combinations (UCCs) are a fundamental con- values require special consideration, they serve to define key cept in relational databases. They identify entities in the constraints and are, hence, also called key candidates. data and support various data management activities. Still, In practice, key discovery is a recurring activity as keys are UCCs are usually not explicitly defined and need to be dis- often missing for various reasons: on freshly recorded data, covered. State-of-the-art data profiling algorithms are able keys may have never been defined, on archived and transmit- to efficiently discover UCCs in moderately sized datasets, ted data, keys sometimes are lost due to format restrictions, but they tend to fail on large and, in particular, on wide on evolving data, keys may be outdated if constraint en- datasets due to run time and memory limitations. forcement is lacking, and on fused and integrated data, keys In this paper, we introduce HPIValid, a novel UCC discov- often invalidate in the presence of key collisions. Because ery algorithm that implements a faster and more resource- UCCs are also important for many data management tasks saving search strategy. HPIValid models the metadata dis- other than key discovery, such as anomaly detection, data covery as a hitting set enumeration problem in hypergraphs. integration, data modeling, query optimization, and index- In this way, it combines efficient discovery techniques from ing, the ability to discover them efficiently is crucial. data profiling research with the most recent theoretical in- Data profiling algorithms automatically discover meta- sights into enumeration algorithms. Our evaluation shows data, like unique column combinations, from raw data and that HPIValid is not only orders of magnitude faster than provide this metadata to any downstream data management related work, it also has a much smaller memory footprint. task. Research in this area has led to various UCC discovery algorithms, such as GORDIAN [34], HCA [2], DUCC [22], Swan [3], PVLDB Reference Format: and HyUCC [33]. Out of these algorithms, only HyUCC can Johann Birnick, Thomas Bläsius, Tobias Friedrich, Felix Nau- process datasets of multiple gigabytes in size in reasonable mann, Thorsten Papenbrock, Martin Schirneck. Hitting Set Enu- time, because it combines the strengths of all previous algo- meration with Partial Information for Unique Column Combina- tion Discovery. PVLDB, 13(11): 2270-2283, 2020. rithms (lattice search, candidate inference, and paralleliza- DOI: https://doi.org/10.14778/3407790.3407824 tion). At some point, however, even HyUCC fails to process certain datasets, because the approach needs to maintain an exponentially growing search space in main memory, which 1. INTRODUCTION eventually either exhausts the available memory or, due to Keys are among the most fundamental type of constraint in search space maintenance, the execution time. relational database theory. A key is a set of attributes whose There is a close connection between the UCC discovery values uniquely identify every record in a given relational in- problem and the enumeration of hitting sets in hypergraphs. stance. This uniqueness property is necessary to determine Hitting set-based techniques were among the first tried for entities in the data. Keys serve to query, link, and merge UCC discovery [29] and still play a role today in the form of entities. A unique column combination (UCC) is the obser- row-based algorithms. To utilize the hitting set connection vation that in a given relational instance a certain set of at- directly, one needs to feed the algorithm with information on tributes S does not contain any duplicate entries. In other all pairs of records of the database [1, 35]. This is prohibitive words, the multiset projection of schema R on S contains for larger datasets due to the quadratic scaling. only unique entries. Because UCCs describe attribute sets In this paper, we propose a novel approach that automatically discovers all unique column combinations in any given relational dataset. The algorithm is based on the connec- This work is licensed under the Creative Commons Attribution- tion to hitting set enumeration, but additionally uses the NonCommercial-NoDerivatives 4.0 International License. To view a copy key insight that most of the information in the record pairs of this license, visit http://creativecommons.org/licenses/by-nc-nd/4.0/. For is redundant and that the following strategy of utilizing the any use beyond those covered by this license, obtain permission by emailing state-of-the-art hitting set enumeration algorithm MMCS [31] [email protected]. Copyright is held by the owner/author(s). Publication rights finds the relevant information much more efficiently. We licensed to the VLDB Endowment. Proceedings of the VLDB Endowment, Vol. 13, No. 11 start the discovery with only partial (or even no) informa- ISSN 2150-8097. tion about the database. Due to this lack of information, the DOI: https://doi.org/10.14778/3407790.3407824 2270 resulting solution candidates might not actually be UCCs. A exists a key of cardinality less than a given value in a given validation then checks whether the hitting set is a true UCC. relational instance is an NP-complete problem [5]. More- If not, the validation additionally provides a reason for its over, the discovery of a key of minimum size in the database incorrectness, pointing directly to the part of the database is also likely not fixed-parameter tractable as it is W [2]- that MMCS needs to avoid the mistake. Moreover, we show complete when parameterized by the size [18]. Finding all that all previous decisions of MMCS would have been exactly keys or unique column combinations is computationally even the same, even if MMCS had full information about all row harder. For this reason, only few automatic data profiling pairs of the database. Thus, our new approach can include algorithms exist that can discover all unique column combi- the new information on the fly and simply resume the enu- nations. The discovery problem is closely related to the enumeration where it was before. Instead of providing full and meration of hitting sets (a.k.a. the transversal hypergraph probably redundant information about the database to the problem, frequent itemset mining, or monotone dualization, enumeration subroutine, the enumeration decides for itself see [15]). Many data profiling algorithms, including the first which information is necessary to make the right decisions. UCC discovery algorithms, rely on the hitting sets of certain Due to its hitting set-based nature, we named our algo- hypergraphs; the process of deriving complete hypergraphs, rithm Hitting set enumeration with Partial Information and however, is a bottleneck in the computation [30]. Validation, HPIValid for short. Our approach adopts state- In modern data profiling, one usually distinguishes two of-the-art techniques from both data profiling and hitting types of profiling algorithms: row and column-based ap- set enumeration bringing together the two research commu- proaches. For a comprehensive overview and detailed dis- nities. We add the following contributions: cussions of the different approaches, we refer to [1] and [28]. (1) UCC discovery. We introduce HPIValid, a novel UCC Row-based algorithms, such as GORDIAN [34], advance the discovery algorithm that outperforms state-of-the-art al- initial hitting set idea. They compare all records in the gorithms in terms of run time and memory consumption. input dataset pair-wise to derive all valid, minimal UCCs. (2) Hitting set enumeration with partial information. We Column-based algorithms, such as HCA [2], DUCC [22], and prove that the hitting set enumeration algorithm MMCS Swan [3], in contrast, systematically enumerate and test in- remains correct when run on a partial hypergraph, pro- dividual UCC candidates while using intermediate results vided that there is a validation procedure for candidate to prune the search space. The algorithms vary mostly in solutions. We believe that this insight can be key to their traversal strategies, which are breadth-first bottom-up also solve similar task like the discovery of functional for HCA and a depth-first random-walk for DUCC. Both row dependencies or denial constraints. and column-based approaches have their strengths: record (3) Subset closedness. We introduce the concept of subset comparisons scale well with the number of attributes, and closedness of hitting set enumeration algorithms, and systematic candidate tests scale well with the number of prove that this property is sufficient for enumeration records. Hybrid algorithms aim to combine both aspects. with partial information to succeed. This makes it easy The one proposed in [25] exploits the duality between mini- to replace MMCS with a different enumeration procedure, mal difference sets and minimal UCCs to mutually grow as long as it is also subset-closed. the available information about the search and the solution (4) Sampling strategy. We propose a robust strategy how space. HyUCC [33] switches between column and row-based to extract the relevant information from the database parts heuristically whenever the progress of the current ap- efficiently in the presence of redundancies. proach is low. Hyb [35] is a hybrid algorithm for the discovery (5) Exhaustive evaluation. We evaluate our algorithm on of embedded uniqueness constraints (eUCs), an extension of dozens of real-world datasets and compare it to the cur- UCCs to incomplete data. It proposes special ideas tailored rent state-of-the-art UCC discovery algorithm HyUCC. to incomplete data, but is based on the same discovery ap- Besides being able to solve instances that were previously proach as HyUCC.

Hitting Set Enumeration with Partial Information for Unique Column Combination Discovery

The Matroid Theorem We First Review Our Definitions: a Subset System Is A

Useful Relations in Permutations and Combination 1. Useful Relations

Primitive Recursive Functions Are Recursively Enumerable

Polya Enumeration Theorem

Cardinality of Sets

Combinatorics

Inclusion‒Exclusion Principle

Enumeration of Finite Automata 1 Z(A) = 1

The Axiom of Choice and Its Implications

17 Axiom of Choice

STRUCTURE ENUMERATION and SAMPLING Chemical Structure Enumeration and Sampling Have Been Studied by Mathematicians, Computer

Enumeration of Structure-Sensitive Graphical Subsets: Calculations (Graph Theory/Combinatorics) R