Efficiently Enumerating Hitting Sets of Hypergraphs Arising in Data Profiling

Efficiently Enumerating Hitting Sets of Hypergraphs Arising in Data Profiling THOMAS BLÄSIUS, Hasso Plattner Institute, University of Potsdam, Germany TOBIAS FRIEDRICH, Hasso Plattner Institute, University of Potsdam, Germany JULIUS LISCHEID, University of Cambridge, United Kingdom KITTY MEEKS, University of Glasgow, United Kingdom MARTIN SCHIRNECK, Hasso Plattner Institute, University of Potsdam, Germany We devise a method to enumerate the inclusion-wise minimal hitting sets of a hypergraph. The algorithm ∗ has delay O¹mk +1n2º on n-vertex, m-edge hypergraphs, where k∗ is the rank of the transversal hypergraph, i.e., the cardinality of the largest minimal solution. In particular, on classes of hypergraphs for which k∗ is bounded, the delay is polynomial. The algorithm uses space linear in the input size only. The enumeration methods solves the extension problem for minimal hitting sets as a subroutine. We show that this problem, parameterised by the cardinality of the set which is to be extended, is one of the first natural W »3¼-complete problems. We give an algorithm for the subroutine that is optimal under the assumption that W »2¼ , FPT or the exponential time hypothesis, respectively. Despite the hardness of the extension problem, we provide empirical evidence indicating that the enumeration outperforms its theoretical worst-case guarantee on hypergraphs arising in the profiling of relational databases, namely, in the detection of unique column combinations. Our analysis suggest that these hypergraphs exhibit structure that allows the subroutine to be fast on average. The code and data is available at hpi.de/friedrich/research/enumdat. CCS Concepts: • Theory of computation → Backtracking; W hierarchy; • Information systems → Relational database model; • General and reference → General literature. Additional Key Words and Phrases: candidate keys, data profiling, enumeration algorithms, experimental algorithms, minimal hitting sets, transversal hypergraph, unique column combinations, W »3¼-completeness. ACKNOWLEDGMENTS Kitty Meeks is supported by a Personal Research Fellowship from the Royal Society of Edinburgh, funded by the Scottish Government. The authors thank Felix Naumann and Thorsten Papenbrock for the many fruitful discussions about data profiling. arXiv:1805.01310v2 [cs.DS] 12 Feb 2020 1 INTRODUCTION A recurring computational task in the design and profiling of relational databases is the discovery of hidden dependencies between attributes. This metadata helps to organise the dataset and subse- quently enables further cleansing and normalisation [2]. For example, unique column combinations (a.k.a. candidate keys) are subsets of attributes (columns) whose values completely identify every record (row) in the database. A minimal combination can thus serve as a small fingerprint for Parts of this work have been presented at the 21st Meeting on Algorithm Engineering and Experiments (ALENEX 2019) [10]. This work originated while Julius Lischeid was a member of the Hasso Plattner Institute. Authors’ addresses: Thomas Bläsius, Hasso Plattner Institute, University of Potsdam, Potsdam, Germany; Tobias Friedrich, Hasso Plattner Institute, University of Potsdam, Potsdam, Germany; Julius Lischeid, University of Cambridge, Cambridge, United Kingdom; Kitty Meeks, University of Glasgow, School of Computing Science, Glasgow, United Kingdom; Martin Schirneck, Hasso Plattner Institute, University of Potsdam, Potsdam, Germany. 2 T. Bläsius, T. Friedrich, J. Lischeid, K. Meeks, and M. Schirneck the data as a whole, making it a primary key. Unfortunately, unique column combinations are equivalent to hitting sets (transversals) which renders their discovery both NP-hard and W »2¼-hard when parameterised by their size [20, 32, 47]. Moreover, it is usually not enough to decide the existence of a single, isolated occurrence; instead, one is interested in compiling a comprehensive list of all dependencies [45]. One thus has to solve the transversal hypergraph problem. This problem comes in two variants, enumeration and recognition. To enumerate the transversal hypergraph means computing the list of all inclusion-wise minimal hitting sets of a input hypergraph. If necessary, the remaining non-minimal solutions can be produced by arbitrarily adding more vertices. In the recognition variant, one is given a pair of hypergraphs to decide whether one comprises exactly the minimal transversals of the other. The two variants are intimately connected. For any class of hypergraphs, there is an output-polynomial algorithm (incremental-polynomial even) for the enumeration variant if and only if the transversal hypergraph can be recognised in polynomial time for this class [7]. It is a long-standing open question whether this decision problem can be solved efficiently for arbitrary inputs. The transversal hypergraph problem also emerges in many fields beyond data profiling such as artificial intelligence36 [ ], machine learning [22], distributed systems [34], integer linear programming [12], and monotone logic [26]. While there is currently no efficient method for general inputs, it is worth exploring the charac- teristics of the concrete applications in order to find tractable cases and develop new techniques. For databases, there are two prominent traits. First, the largest minimal unique column combination is often significantly smaller than the total number of attributes. As an example, the call_a_bike database (see Section5 for details) spans 16 columns and a hundred thousand rows; nevertheless, the largest minimal unique column combination is of size 4. This is consistent with the findings of other researchers, see also [45, 54]. Although one can expect the solutions to be small, there is generally no a priori guarantee on their maximum cardinality. One thus aims for an algorithm that is suitable for all hypergraphs and particularly fast on those with small transversals. Secondly, a typical use case for the enumeration of unique column combinations is to present them to a human user for inspection. This makes it possible to incorporate domain knowledge that might otherwise be inaccessible. The human-computer interaction sparks new algorithmic constraints: the first output should be available quickly and subsequent ones should follow in regular intervals. Even if the dependencies are compiled to feed them instead into another automated tool in a profiling pipeline, the overall process benefits from solutions arriving early. In both cases, an algorithm with bounded delay is preferred over a mere output-efficient one. We devise an algorithm for the enumeration of minimal hitting sets. We give a theoretical worst-case guarantee on its delay between consecutive outputs. On hypergraphs for which the size of the largest minimal transversal is bounded, the guarantee is polynomial. Moreover, our experiments show that the enumeration is fast on hypergraphs arising in the discovery of unique column combinations in relational databases. Our enumeration method has a low memory footprint and is trivially parallelisable. 1.1 Contribution and Outline Below, in Section 1.2, we discuss how our findings relate to the literature on the transversal hypergraph problem. Then, after reviewing some algorithmic and complexity-theoretic concepts and notation in Section2, we give the general outline of our enumeration method in Section3. The algorithm consists of a backtracking search that is heavily pruned by an extension oracle for minimal hitting sets. The underlying extension problem is discussed in detail in Section4. It is known to be NP-complete in general, but efficiently solvable if the set to be extended is small[13]. In practical applications, it is surprisingly common to reduce the problem at hand to an NP-hard one, despite the complexity-theoretic ramifications, cf. [6]. In order to interpolate between the extremes Efficiently Enumerating Hitting Sets of Hypergraphs Arising in Data Profiling 3 of general hardness and tractable special cases of the extension problem, we employ techniques commonly dubbed as fixed-parameter tractability [23]. In particular, we prove that the problem is W »3¼-complete when parameterised by the cardinality jX j of the set to be extended. To the best of our knowledge, there are currently only four natural, practically motivated problems known with this property. The first one was given by Chen and Zhang [16] in the context of supply chain management; Bläsius, Friedrich, and Schirneck [9] added the discovery of inclusion dependencies in relational data. We prove here that the extension problem for minimal hitting sets belongs to this list. Finally, Casel et al. [14] recently applied our techniques to show that the extension problem is already W »3¼-complete for the special case of minimal dominating sets in bipartite graphs. Beyond the W »3¼-hardness, we also derive a lower bound based on the exponential time hypothesis, and give an algorithm for the extension oracle running in time O¹jH j jX j+1 · jV jº, matching this lower bound. Combined with the backtracking technique, this gives a method to enumerate all ∗ minimal hitting sets in lexicographic order with delay O¹jH jk +1 · jV j2º, where k∗ denotes the rank of the transversal hypergraph. Beyond these theoretical bounds, we introduce different techniques for improving the run time in practice. The empirical evaluation in Section5 shows that our algorithm is capable of efficiently enumerating all unique column combinations of real-world databases when given the hypergraph

Load more