Hitting Set Enumeration with Partial Information for Unique Column Combination Discovery Johann BirnickF, Thomas Bläsius, Tobias Friedrich, Felix Naumann, Thorsten Papenbrock, Martin Schirneck FETH Zurich, Zurich, Switzerland Hasso Plattner Institute, University of Potsdam, Germany fi
[email protected] ABSTRACT that fulfill the necessary properties of a key, whereby null Unique column combinations (UCCs) are a fundamental con- values require special consideration, they serve to define key cept in relational databases. They identify entities in the constraints and are, hence, also called key candidates. data and support various data management activities. Still, In practice, key discovery is a recurring activity as keys are UCCs are usually not explicitly defined and need to be dis- often missing for various reasons: on freshly recorded data, covered. State-of-the-art data profiling algorithms are able keys may have never been defined, on archived and transmit- to efficiently discover UCCs in moderately sized datasets, ted data, keys sometimes are lost due to format restrictions, but they tend to fail on large and, in particular, on wide on evolving data, keys may be outdated if constraint en- datasets due to run time and memory limitations. forcement is lacking, and on fused and integrated data, keys In this paper, we introduce HPIValid, a novel UCC discov- often invalidate in the presence of key collisions. Because ery algorithm that implements a faster and more resource- UCCs are also important for many data management tasks saving search strategy. HPIValid models the metadata dis- other than key discovery, such as anomaly detection, data covery as a hitting set enumeration problem in hypergraphs.