Quick viewing(Text Mode)

Efficiently Enumerating Hitting Sets of Hypergraphs Arising in Data Profiling

Efficiently Enumerating Hitting Sets of Hypergraphs Arising in Data Profiling

Efficiently Enumerating Hitting Sets of Arising in Data Profiling

THOMAS BLÄSIUS, Hasso Plattner Institute, University of Potsdam, Germany TOBIAS FRIEDRICH, Hasso Plattner Institute, University of Potsdam, Germany JULIUS LISCHEID, University of Cambridge, United Kingdom KITTY MEEKS, University of Glasgow, United Kingdom MARTIN SCHIRNECK, Hasso Plattner Institute, University of Potsdam, Germany We devise a method to enumerate the inclusion-wise minimal hitting sets of a . The ∗ has delay O(mk +1n2) on n-vertex, m-edge hypergraphs, where k∗ is the rank of the transversal hypergraph, i.e., the cardinality of the largest minimal solution. In particular, on classes of hypergraphs for which k∗ is bounded, the delay is polynomial. The algorithm uses space linear in the input size only. The methods solves the extension problem for minimal hitting sets as a subroutine. We show that this problem, parameterised by the cardinality of the set which is to be extended, is one of the first natural W [3]-complete problems. We give an algorithm for the subroutine that is optimal under the assumption that W [2] , FPT or the exponential time hypothesis, respectively. Despite the hardness of the extension problem, we provide empirical evidence indicating that the enumera- tion outperforms its theoretical worst-case guarantee on hypergraphs arising in the profiling of relational , namely, in the detection of unique column combinations. Our analysis suggest that these hyper- graphs exhibit structure that allows the subroutine to be fast on average. The code and data is available at hpi.de/friedrich/research/enumdat. CCS Concepts: • Theory of computation → Backtracking; W hierarchy; • Information systems → Relational model; • General and reference → General literature. Additional Key Words and Phrases: candidate keys, data profiling, enumeration , experimental algorithms, minimal hitting sets, transversal hypergraph, unique column combinations, W [3]-completeness.

ACKNOWLEDGMENTS Kitty Meeks is supported by a Personal Research Fellowship from the Royal Society of Edinburgh, funded by the Scottish Government. The authors thank Felix Naumann and Thorsten Papenbrock for the many fruitful discussions about data profiling. arXiv:1805.01310v2 [cs.DS] 12 Feb 2020 1 INTRODUCTION A recurring computational task in the design and profiling of relational databases is the discovery of hidden dependencies between attributes. This metadata helps to organise the dataset and subse- quently enables further cleansing and normalisation [2]. For example, unique column combinations (a.k.a. candidate keys) are subsets of attributes (columns) whose values completely identify every record (row) in the database. A minimal combination can thus serve as a small fingerprint for

Parts of this work have been presented at the 21st Meeting on Algorithm Engineering and Experiments (ALENEX 2019) [10].

This work originated while Julius Lischeid was a member of the Hasso Plattner Institute. Authors’ addresses: Thomas Bläsius, Hasso Plattner Institute, University of Potsdam, Potsdam, Germany; Tobias Friedrich, Hasso Plattner Institute, University of Potsdam, Potsdam, Germany; Julius Lischeid, University of Cambridge, Cambridge, United Kingdom; Kitty Meeks, University of Glasgow, School of Computing Science, Glasgow, United Kingdom; Martin Schirneck, Hasso Plattner Institute, University of Potsdam, Potsdam, Germany. 2 T. Bläsius, T. Friedrich, J. Lischeid, K. Meeks, and M. Schirneck the data as a whole, making it a primary key. Unfortunately, unique column combinations are equivalent to hitting sets (transversals) which renders their discovery both NP-hard and W [2]-hard when parameterised by their size [20, 32, 47]. Moreover, it is usually not enough to decide the existence of a single, isolated occurrence; instead, one is interested in compiling a comprehensive list of all dependencies [45]. One thus has to solve the transversal hypergraph problem. This problem comes in two variants, enumeration and recognition. To enumerate the transversal hypergraph means computing the list of all inclusion-wise minimal hitting sets of a input hyper- graph. If necessary, the remaining non-minimal solutions can be produced by arbitrarily adding more vertices. In the recognition variant, one is given a pair of hypergraphs to decide whether one comprises exactly the minimal transversals of the other. The two variants are intimately connected. For any class of hypergraphs, there is an output-polynomial algorithm (incremental-polynomial even) for the enumeration variant if and only if the transversal hypergraph can be recognised in polynomial time for this class [7]. It is a long-standing open question whether this can be solved efficiently for arbitrary inputs. The transversal hypergraph problem also emerges in many fields beyond data profiling such as artificial intelligence36 [ ], machine learning [22], distributed systems [34], integer linear programming [12], and monotone logic [26]. While there is currently no efficient method for general inputs, it is worth exploring the charac- teristics of the concrete applications in order to find tractable cases and develop new techniques. For databases, there are two prominent traits. First, the largest minimal unique column combination is often significantly smaller than the total number of attributes. As an example, the call_a_bike database (see Section5 for details) spans 16 columns and a hundred thousand rows; nevertheless, the largest minimal unique column combination is of size 4. This is consistent with the findings of other researchers, see also [45, 54]. Although one can expect the solutions to be small, there is generally no a priori guarantee on their maximum cardinality. One thus aims for an algorithm that is suitable for all hypergraphs and particularly fast on those with small transversals. Secondly, a typical use case for the enumeration of unique column combinations is to present them to a human user for inspection. This makes it possible to incorporate domain knowledge that might otherwise be inaccessible. The human-computer interaction sparks new algorithmic constraints: the first output should be available quickly and subsequent ones should follow in regular intervals. Even if the dependencies are compiled to feed them instead into another automated tool in a profiling pipeline, the overall process benefits from solutions arriving early. In both cases, an algorithm with bounded delay is preferred over a mere output-efficient one. We devise an algorithm for the enumeration of minimal hitting sets. We give a theoretical worst-case guarantee on its delay between consecutive outputs. On hypergraphs for which the size of the largest minimal transversal is bounded, the guarantee is polynomial. Moreover, our experiments show that the enumeration is fast on hypergraphs arising in the discovery of unique column combinations in relational databases. Our enumeration method has a low memory footprint and is trivially parallelisable.

1.1 Contribution and Outline Below, in Section 1.2, we discuss how our findings relate to the literature on the transversal hypergraph problem. Then, after reviewing some algorithmic and complexity-theoretic concepts and notation in Section2, we give the general outline of our enumeration method in Section3. The algorithm consists of a backtracking search that is heavily pruned by an extension oracle for minimal hitting sets. The underlying extension problem is discussed in detail in Section4. It is known to be NP-complete in general, but efficiently solvable if the set to be extended is small[13]. In practical applications, it is surprisingly common to reduce the problem at hand to an NP-hard one, despite the complexity-theoretic ramifications, cf. [6]. In order to interpolate between the extremes Efficiently Enumerating Hitting Sets of Hypergraphs Arising in Data Profiling 3 of general hardness and tractable special cases of the extension problem, we employ techniques commonly dubbed as fixed-parameter tractability [23]. In particular, we prove that the problem is W [3]-complete when parameterised by the cardinality |X | of the set to be extended. To the best of our knowledge, there are currently only four natural, practically motivated problems known with this property. The first one was given by Chen and Zhang [16] in the context of supply chain management; Bläsius, Friedrich, and Schirneck [9] added the discovery of inclusion dependencies in relational data. We prove here that the extension problem for minimal hitting sets belongs to this list. Finally, Casel et al. [14] recently applied our techniques to show that the extension problem is already W [3]-complete for the special case of minimal dominating sets in bipartite graphs. Beyond the W [3]-hardness, we also derive a lower bound based on the exponential time hypoth- esis, and give an algorithm for the extension oracle running in time O(|H | |X |+1 · |V |), matching this lower bound. Combined with the backtracking technique, this gives a method to enumerate all ∗ minimal hitting sets in lexicographic order with delay O(|H |k +1 · |V |2), where k∗ denotes the rank of the transversal hypergraph. Beyond these theoretical bounds, we introduce different techniques for improving the run time in practice. The empirical evaluation in Section5 shows that our algo- rithm is capable of efficiently enumerating all unique column combinations of real-world databases when given the hypergraph of difference sets as input. We in particular observe that the runtimes of the oracle calls (the only part of our algorithm with super-polynomial run time) are quite fast in practice. Compared to a simple brute-force enumeration, we achieve huge speed-ups, in particular for larger instances. This work is completed by some concluding remarks in Section6.

1.2 Related Work The topic of enumerating and recognising the transversal hypergraph efficiently is covered in numerous papers over the last three decades, both from a theoretical and an empirical standpoint. For a detailed overview, we direct the reader to the survey by Eiter, Makino, and Gottlob [27] and the textbook by Hagen [37]. Demetrovics and Thi [21] as well as Mannila and Räihä [48], independently, are usually credited with raising the issue of generating all hitting sets. The existence of an output- polynomial algorithm is unknown. However, Fredman and Khachiyan [31] gave a quasi-polynomial algorithm. The exponent of their worst-case run time bound has only a sub-logarithmic dependence on the combined input and output size. Besides this general-purpose procedure, several tractable special cases have been identified. Those cases are usually solved by exploiting structural properties such as bounds on the edge size or degree [11, 22], conformality [43], or acyclicity [25, 26]. There is an imbalance in the theoretical literature on the generation variant of the transversal hypergraph problem: most of the results focus on the properties of the input hypergraph instead of the output. A notable exception is the work of Eiter and Gottlob [25]. They show that the recognition problem is polynomial-time solvable if the rank of the transversal hypergraph is bounded. Via an equivalence result by Bioch and Ibaraki [7], this implies an incremental-polynomial algorithm for the enumeration of such hypergraphs. However, their algorithm necessarily holds the full output in memory during the computation. We improve upon this result in showing that there exists an enumeration method that has polynomial delay (if the transversal hypergraph has bounded rank), outputs the hitting sets in lexicographical order, and uses space linear in the size of the input only. Our method is similar to the backtracking technique proposed by Elbassioni, Hagen, and Rauf [28], but achieves the same coverage of the search space with a less expensive branching scheme. Computing all solutions of a combinatorial problem in some natural order has been a point of interest beyond hitting sets. Lawler [46] was the first to study this topic rigorously. His ideas were subsequently developed into applications in areas as diverse as the enumeration of maximal 4 T. Bläsius, T. Friedrich, J. Lischeid, K. Meeks, and M. Schirneck independent sets [41], cuts in a graph [60], answers to keyword queries [44], and models of Boolean constraint satisfaction problems [18], respectively ranked by some weight function. In the theoretical analysis of our enumeration algorithm, we use tools from fixed-parameter tractability. Although they were originally developed to understand decision problems [19, 23, 52], in recent years, these tools proved to be very effective for enumeration as well. Creignou et al.[17] give a general overview of the field of parameterised enumeration. The parameterised complexity of extension problems in particular has been treated by Meeks [50] and Casel et al. [14]. In recent years, the conjectured hardness of the Boolean satisfiability problem encapsulated in the exponential time hypothesis [39, 40] (and several stronger assumptions) have become a fruitful source of conditional lower bounds on the complexity of computational problems in the exponential [59], polynomial [1], as well as the parameterised domain [19]. Complementing the theoretical point of view, there are several surveys reviewing the practical behaviour of enumeration algorithms for the transversal hypergraph. Murakami and Uno [51], for example, compared several methods on artificial and real-world instances, mostly from frequent itemset mining, highlighting different use cases. Gainer-Dewar and Vera-Licona33 [ ] conducted an exhaustive study involving more than 20 algorithms on hypergraphs stemming from various real- world applications. However, it seems that there is no empirical work focussing on the structures arising in data profiling, although the connection to the hitting set problem is well-known alsoin practice [2, 20]. There are some articles on the performance of discovery algorithms for unique column combinations, but they do not discuss the topic in the context of hypergraphs or hitting sets. One example is the presentation of the DUCC algorithm by Heise et al. [38], where they test their algorithm on the ncvoter_allc and uniprot datasets. These are databases we also use in our own evaluation in Section5. However, the running times reported here are not directly comparable since we are only focussing on the enumeration part and ignore the extraction of minimal difference sets from the database. Notwithstanding, it is worth noting that Heise etal.were unable to scale their algorithm beyond 60 columns for the ncvoter_allc dataset due to a timeout after 10 000 s (2 h 47 min), while our enumeration method can handle the full 88 columns. The more recent HyUCC algorithm by Papenbrock and Naumann [56], which is based on HyFD for functional dependencies [55], also has some overlap with our work regarding the testing datasets. Comparing their results with our findings suggest that the run times of our implementation, working on arbitrary hypergraphs, is at least competitive with the current domain-specific state of the art. This is particularly interesting since they report the space consumption as a major bottleneck of HyUCC, which is not an issue for our algorithm.

2 PRELIMINARIES 2.1 Hypergraphs and Hitting Sets A hypergraph is a finite vertex set V together with a system of subsets H ⊆ P(V ), the (hyper-)edges. We usually identify a hypergraph with its edges H. Unlike some authors, e.g. Berge [5], we do not exclude special cases of this definition like the empty graph, H = ∅, an empty edge, ∅ ∈ H, or Ð isolated vertices, V ⊋ E ∈H E. However, we do assume that there is always at least one vertex, V , ∅. The symbol n = |V | denotes the number of vertices, and m = |H | the number of edges. The rank of a hypergraph H is the maximum cardinality of its edges, rank(H) = maxE ∈H |E|. H is a Sperner hypergraph if no edge is contained in another. The minimisation of H is the subsystem of inclusion-wise minimal edges, min(H) = {E ∈ H | E′ ∈ H : E′ ⊆ E ⇒ E′ = E}. Obviously, min(H) is a Sperner hypergraph for any H. ∀ A transversal or hitting set for H is a subset H ⊆ V of vertices such that H has a non-empty intersection with every edge E ∈ H. A transversal is (inclusion-wise) minimal if it does not properly Efficiently Enumerating Hitting Sets of Hypergraphs Arising in Data Profiling 5 contain any other transversal. The minimal hitting sets of H form a Sperner hypergraph on the vertex set V, the transversal hypergraph Tr(H). Regarding the transversals, it does not make a difference whether the full hypergraph H is considered or its minimisation, Tr(min(H)) = Tr(H)). We denote the number of minimal transversals by Nmin = |Tr(H)|. In the proofs below, we make extensive use of the following well-known minimality criterion. Observation 2.1 ([5, 53]). Let (V , H) be a hypergraph and H ⊆ V a hitting set for H. H is minimal if and only if for every x ∈ H, there is an edge Ex ∈ H such that Ex ∩ H = {x} 1 The edge Ex above is a witness for x. If some element of H has no witness, it is expendable. Any total ordering ≽ of the vertex set V induces a lexicographical order on P(V ). We say a subset S ⊆ V is lexicographically smaller than another subset T , whenever the ≽-first element in ( △ ) ∈ which S and T differ is in S, cf. [41]. More formally, S ≼lex T iff S = T or min≽ S T S. Here, S △T = (S\T )∪(T \S) denotes the symmetrical difference. We call a hypergraph on a totally ordered vertex set an ordered hypergraph.

2.2 Relational Databases and Unique Column Combinations. To describe relational data, we fix a finite set R, the (relational) schema; its elements are the attributes or columns. Records or rows are tuples whose entries are indexed by R. We use the symbols ri and rj for them. For any subset X ⊆ R of columns, we let ri [X] denote the subtuple of ri consisting only of the entries indexed by X. A finite set r of rows is a relation or relational database. Given some relation r, a set X of columns is a unique column combination (UCC), or simply unique, if for any two distinct tuples ri , rj in r, we have ri [X] , rj [X]. That means, the combination of values in the entries for X fully identifies any tuple of the relation r. Otherwise, set X is non-unique. Any superset of a unique is unique and any subset of a non-unique is again non-unique. A UCC is (inclusion-wise) minimal if it does not contain another UCC. There is a one-to-one correspondence between UCCs and transversals; see the beginning of Section5 for more details.

2.3 Parameterised Complexity and the Exponential Time Hypothesis Parameterised complexity theory has proved to be a powerful tool to tackle computationally hard problems, for an overview see the textbooks [19, 23, 52]. We generally consider all decision problems to be parameterised. Given an instance I to such a problem and a parameter k ∈ N+, (I,k) is then an instance of the corresponding parameterised problem. A parameterised problem is fixed-parameter tractable (FPT), if there exists a polynomial p and a computable function f such that any instance can be solved in time f (k)·p(|I |). The class of all fixed-parameter tractable problems is denoted by FPT. Any algorithm adhering to such a worst-case bound is said to run in FPT-time. The intuition is that if the parameter is bounded by a constant, then the running time of that algorithm is polynomial, and the degree of the polynomial is independent of k. Let Π and Π′ be two parameterised problems. A parameterised reduction from Π to Π′ is a function computable in FPT-time that maps an instance (I,k) of Π to an equivalent instance (I ′,k ′) of Π′ such that the parameter k ′ depends on k but not on |I |, meaning there is a computable function д such that k ′ ≤ д(k). If there is also a parameterised reduction in the opposite direction, Π and Π′ are said to be FPT-equivalent. We mainly use a restricted form of reducibility, namely, polynomial reductions that preserve the parameter. These are reductions between parameterised problems computable in time p(|I |), independently of the parameter k, such that the resulting instance of Π′ has parameter value k ′ = k. This way, both the parameterised and the classical (polynomial) complexity is transferred. Polynomial parameter-preserving reductions form a subclass of linear FPT-reductions, i.e., parameterised reductions with k ′ = O(k), as introduced by Chen et al. [15]. 1Witnesses are sometimes also called critical hyperedges [51]. 6 T. Bläsius, T. Friedrich, J. Lischeid, K. Meeks, and M. Schirneck

The notion of parameterised reduction gives rise to a hierarchy of complexity classes, the W -hierarchy. It is defined in terms of Boolean circuits, i.e., directed acyclic graphs whose vertexset consist of input nodes, NOT-, AND-, and OR-gates (with the obvious semantics) as well as a single output node. AND- and OR-gates have potentially unbounded fan-in. The depth of a circuit is the maximum length of a path from an input to the output node. The weft is the maximum number of large gates, with fan-in larger than 2 on any path. The Weighted Circuit Satisfiability problems is to decide whether a has a satisfying assignment of weight k, i.e., with k inputs set to true, with k takes as the parameter. For every positive integer t, W [t] is the class of all parameterised problems that admit a parameterised reduction to Weighted Circuit Satisfiability restricted to circuits of constant depth and weft at most t. The classes FPT ⊆ W [1] ⊆ W [2] ⊆ ... form a ascending chain. All inclusions are believed to be proper, which is however still unproven [23]. The higher a problem ranks in the hierarchy, the lower we consider the chances of finding an FPT-algorithm to solve it. As an example, the problem to decide the existence of a hitting set/UCC of cardinality (at most) k, parameterised by k, is a classical W [2]-complete problem [19, 32]. The W -hierarchy in turn is contained in the class (uniform) XP consisting of all problems that allow for an algorithm running in time O(|I |f (k)). Again, if k = O(1), the running time is polynomial, but here the degree depends on k. It is known that the classes FPT ⊊ XP are different23 [ ], unconditionally. Another source of conditional lower bounds is the exponential time hypothesis (ETH) by Im- pagliazzo, Paturi, and Zane [39, 40]. They conjectured that there is no algorithm solving 3-SAT on Boolean formulas with n variables in time 2o(n). Sub-exponential reductions transfer this lower bound to several other computational problems, including parameterised ones, cf. [19].

2.4 Enumeration Complexity In this article, decision problems are only a means to analyse enumeration algorithms. Enumeration is the process of outputting all solutions to a without repetition. In the case of hitting sets, there exists a class of hypergraphs such that the number Nmin of minimal transversals growths exponentially in both n = |V | and m = |H | [7]. This rules out any polynomial-time algorithm for the enumeration problem. Instead, one could ask for an output-polynomial algorithm running in time polynomial in both the input and output size. That means, there exists a polynomial p such that the enumeration terminates within p(n,m, Nmin) steps. A stronger requirement is an incremental polynomial algorithm, generating the solutions in such a way that the i-th delay, the time between the (i−1)-st and i-th output, is polynomial in the input size and in i. This includes the preprocessing time until the first solution arrives as well as the postprocessing time between the last solution and termination. The strongest form of output-efficiency is that of polynomial delay, where the delay is universally bounded by a polynomial in the input size only. One can also restrict the memory consumption, ideally, the algorithm uses only space polynomial in the input.

3 BACKTRACKING ENUMERATION USING A DECISION TREE It is a common pattern in the design of enumeration algorithms to base them on a so-called extension oracle or, equivalently, an interval oracle [8, 57]. The technique was conceived in the early 70s by Lawler [46] to compute the K best solutions of a combinatorial problem (ranked by some cost measure). It can easily be adapted to listing all solutions and is now a standard technique that produced breakthrough results like an polynomial-delay algorithm for maximal independent sets [41, 58]. Recently, it has enjoyed some renewed interest for problems with certain closure properties [42, 49]. In newer works, oracle enumeration has been combined with techniques from parameterized complexity [50] and algorithms that are highly engineered for their application domain [8]. There is also an algorithm for the transversal hypergraph problem bases on oracles [28]. Efficiently Enumerating Hitting Sets of Hypergraphs Arising in Data Profiling 7

In the general setting, the extension oracle for some combinatorial problem is given a collection of elements of some underlying universe and decides the existence of a solution containing these elements. Lawler’s original technique involved fixing certain elements in a partial solution and then extending it to the optimum among all objects sharing the fixed elements. During the computation the new candidates are stored in a priority queue and are treated in the order of increasing cost. The main bottleneck is the space demand of the queue. For every partial solution, the number of newly introduced candidates can be equal to size of the universe, resulting in an exponential growth. Thus, modifications are necessary for practical use. We make this concrete for the transversal hypergraph problem, where the goal is to enumerate all inclusion-wise minimal hitting sets. Here, the extension oracle receives a collection of vertices of the input hypergraph and answers the question whether there exists a minimal hitting set containing those vertices (and possibly avoiding some other excluded set). This naturally leads to the following parameterised decision problem. Extension MinHS Input: Hypergraph (V , H) and disjoint sets X,Y ⊆ V , X ∩ Y = ∅. Output: true iff there is a minimal hitting set H ∈ Tr(H) with X ⊆ H ⊆ V \Y. Parameter: |X |.

The extension problem is known to be NP-complete, but polynomial time solvable if the cardinality |X | of the set to be extended is bounded by a constant [13]. This makes |X | a prime choice as a parameter for this problem. Moreover, the design of our enumeration algorithm as described below will ensure that we only need to solve the problem in cases where |X | is smaller than the size of the largest minimal solution. It has been observed that the latter quantity is comparatively small in hypergraphs arising in data profiling, cf.45 [ , 54]. For our own findings in that regard, see Section5. We now describe how the oracle can be used to enumerate minimal hitting sets. Later in Section4, we give an algorithm implementing the oracle. To avoid the space demand of the priority queue, we instead use the extension oracle to prune a decision tree. This is known in the literature as the backtracking method [57] or flashlight technique [49]. Given a pair (X,Y ) of disjoint sets of vertices, we want to enumerate the minimal hitting sets that comprise all of X but contain no vertex from Y. If X ∪ Y = V , this can only be X itself. Otherwise, we recursively compute the solutions given by the pairs (X ∪ {v},Y ) and (X,Y ∪ {v}), where v is some vertex not previously contained in X or Y. This recursion leads to the following binary decision tree. Every node in the tree is labelled with the respective pair (X,Y); every level in the tree corresponds to some new vertex v. The algorithm branches, in each node (X,Y) of level v, on the decision whether to add v to X or to Y, in the latter case excluding it from the future search. This is where the vertex order (V , ≽) comes into play. By always choosing v as the ≽-minimal vertex of V \(X ∪ Y), we reach a universal branching order without the need of additional communication between the nodes. In the absence of pruning this produces the full binary tree with leaves (X,V \X) for every possible subset X ⊆ V . However, we only need to enter the subtree if one of its leaves contains a minimal transversal. Given the root of the subtree is (X,Y), this is the case iff X is extendable to a minimal hitting set without Y. Therefore, the oracle tells us for every node whether we need to expand it. When given access to the ordered hypergraph (V , ≽, H), Algorithm1 recursively traverses the resulting decision tree in a depth-first, pre-order manner. Subroutine extendable(X,Y ) needs to implement the extension oracle. The procedure enumerate receives the sets X and Y as well as the remaining vertices R = V \(X ∪ Y ) as input and checks whether a leaf has been reached. If not, 8 T. Bläsius, T. Friedrich, J. Lischeid, K. Meeks, and M. Schirneck

∅ R , is non-empty and the algorithm branches on the assignment of the vertex min≽ R. The initial call of the recursion is enumerate(∅, ∅, V ).

ALGORITHM 1: Recursive algorithm for the transversal hypergraph problem. The initial call is enumerate(∅, ∅, V ). Data: Ordered hypergraph (V , ≽, H). Input: Partition (X,Y, R) of the vertex set V . Output: All minimal hitting sets H ∈ Tr(H) such that X ⊆ H ⊆ V \Y.

1 Procedure enumerate(X, Y, R): 2 if R = ∅ then return X; 3 v ← R min≽ ; 4 if extendable(X ∪ {v}, Y) then enumerate(X ∪ {v}, Y, R\{v}); 5 if extendable(X, Y ∪ {v}) then enumerate(X, Y ∪ {v}, R\{v});

Lemma 3.1. If subroutine extendable solves Extension MinHS, then Algorithm1 on input (V , ≽, H) enumerates Tr(H) (i.e., it outputs every minimal hitting set of H exactly once). Moreover, the edges of Tr(H) are output in lexicographical order induced by ≽. Proof. The correctness is almost immediate from the above discussion. Note additionally that in the extreme case of H having not a single hitting set (i.e., ∅ ∈ H) both checks in line4 and5 fail already in the root node. The algorithm then returns immediately without an output. Here, we use the assumption that there is at least one vertex and thus R = V , ∅ holds in the root. We are left to prove that the algorithm outputs the solutions in lexicographical order. Let a,b be two leaves with a being left of b in the tree so that the algorithm visits a first. Suppose they are labelled (Xa,Ya) and (Xb ,Yb ), respectively. Further, let vertex v be such that the lowest common △ ancestor of a and b branches on the assignment of v, i.e., v = min≽ Xa Xb . Algorithm1 first adds v to the current partial solution X (left child, line4). Hence, v is contained in Xa and Xa ≼lex Xb . □ Algorithm1 is similar to the backtracking method by Elbassioni, Hagen, and Rauf [ 28, Figure 1]. The main difference between the two is the search for new solutions. In our algorithm, thenodes in the decision tree maintain a set Y of vertices that have already been excluded, besides the partial solution X. The somewhat arbitrarily chosen vertex v is added to one of the two sets depending on whether X or X ∪ {v} are extendable. In contrast, the algorithm in [28] works only on the partial solution itself–there denoted by S–and explicitly searches for a new vertex extending S, which is computationally expensive. Also, their check whether S already constitutes a minimal transversal is redundant. This information can be obtained as a by-product of a careful implementation of the oracle subroutine as we show in Lemma 4.11 below. The lexicographic order of the output can be useful in the application domain. For example in the context of databases, we can apply it to list “interesting” UCCs first. Suppose the attributes are ranked by importance, then the enumeration naturally starts with those UCCs that contain many important attributes. On the other hand, the order raises some complexity-theoretic issues. Computing the lexicographically smallest minimal hitting set is known to be NP-hard [24, 41]. Therefore, Algorithm1 having polynomial delay on all ordered hypergraphs would imply P = NP. We verify below that we achieve polynomial delay at least on a restricted class of inputs. Additionally, it is an interesting question to evaluate the impact of the order on the empirical run time on real-world databases, see Section 5.3. Efficiently Enumerating Hitting Sets of Hypergraphs Arising in Data Profiling 9

4 THE EXTENSION PROBLEM FOR MINIMAL HITTING SETS We have seen how an extension oracle for minimal hitting sets can be utilized to enumerate the transversal hypergraph. In this section, we discuss some complexity-theoretic aspects of the underlying decision problem which in turn leads to an optimal algorithm for the oracle. We first show that Extension MinHS is W [3]-complete. In fact, it is one of the first natural parameterised problems having this property. From this, we derive some lower bounds on the run time of any algorithm the extension problem, conditional on the assumption that certain collapses in the W - hierarchy do not occur. We then give an algorithm matching these lower bounds and describe a strategy to improve the run time behaviour of the oracle when combining it with the decision tree for the enumeration. Finally, we derive a bound on the delay of Algorithm1.

4.1 Hardness of the Extension Problem Boros, Gurvich, and Hammer showed that Extension MinHS is NP-complete [13]. In the same article, they reduced it to a certain covering problem in hypergraphs (see below for a formal definition). We extend this result by proving that the extension and the covering problem arein fact equivalent under both polynomial and parameterised reductions. We use this equivalence to establish the W [3]-completeness of Extension MinHS when parameterised by the cardinality |X | of the set that is to be extended. To this end, we define the parameterised variant of the covering problem, which we call the Multicoloured Independent Family problem. It formalises the following computational task: given k lists of sets–each list representing a colour–as well as an additional collection of forbidden sets, one has to select a set of each colour such that their union does not completely cover any forbidden set. Multicoloured Independent Family

Input: A (k+1)-tuple (S1,..., Sk , T) of finite systems of finite sets. Ðk Output: true iff there are sets S1,..., Sk , Si ∈ Si , such that T ∈ T : T ⊈ i= Si . ∀ 1 Parameter: k. It will be convenient to have also a single-coloured variant of the problem featuring a single list. Independent Family Input: Two finite systems S, T of finite sets and a positive integer k. Ðk Output: true iff there are k distinct sets S1,..., Sk ∈ S such that T ∈ T : T ⊈ i= Si . ∀ 1 Parameter: k. Both problems generalise the Independent Set problem in hypergraphs, which asks for k vertices such that they do not cover any hyperedge, parameterised by k, cf. [19, 29]. For Independent Family, we select sets of vertices instead such that their union is independent w.r.t. hypergraph T. Independent Set is then the special case of system S consisting entirely of singletons. S1,..., Sk , S, and T each fit the definition of a hypergraph (over vertex set U ). However, we use the term set system in this context to avoid confusion with the other hypergraphs considered here. Recall that set Y in the formulation of Extension MinHS contains the vertices excluded from the search. We first state an easy observation implying that those vertices can be safely dropped from the hypergraph, which simplifies some of the proofs below. The two lemmas that follow then establish the equivalence of both variants of the covering problem with the extension problem. Observation 4.1. Let (V , H) be a hypergraph and Y,T ⊆ V , Y ∩T = ∅, two disjoint sets of vertices. T is a transversal of H if and only if it is one of the truncated hypergraph (V \Y, {E\Y | E ∈ H }). 10 T. Bläsius, T. Friedrich, J. Lischeid, K. Meeks, and M. Schirneck

Lemma 4.2. Extension MinHS and Multicoloured Independent Family are equivalent under polynomial reductions that preserve the parameter. In particular, they are FPT-equivalent. Proof. The existence of a reduction from Extension MinHS to Multicoloured Independent Family has been proven in [13]. We reprove this part both for completeness and later use in this paper. By Observation 4.1, we can assume the set Y to be empty without loosing generality. Consequently, we drop Y from notation. Let (V , H, X) be an instance of Extension MinHS. First, suppose X is empty. Then, the reduction outputs some fixed true-instance of Multicoloured Independent Family iff ∅ < H; otherwise, a false-instance. In the following, we assume X , ∅. Recall the definition of witnesses for minimal transversals. If hitting set H is minimal, every x ∈ H has an edge E ∈ H such that E ∩H = {x} holds. So for X to have an extension to a minimal solution, there must be edges E observing E ∩ X = {x} for every x ∈ X. (For the latter, we take the intersection with X, instead of H.) If some element x has no such edge, a false-instance of Multicoloured Independent Family is returned. Otherwise, the set systems Sx = {E ∈ H | E ∩ X = {x}} of potential witnesses are non-empty for all x. Next, we characterise the conditions under which a selection of potential witnesses actually implies the existence of a minimal extension of X. We collect in system T those edges of H that share no element with X. Edges that intersect X in more than one vertex can be cast aside. We claim that X can be extended to a minimal transversal of H if and only if there is a selection (Ex )x ∈X of Ð potential witnesses, Ex ∈ Sx , such that T ⊈ x ∈X Ex holds for all T ∈ T. This is the case iff the instance ((Sx )x ∈X , T) of Multicoloured Independent Family evaluates to true. Note that the reduction trivially preserves the parameter k = |X |. To prove the claim, let H ⊇ X be a minimal hitting set. There exists a witness Ex ∈ H for every Ð x ∈ H; if x is contained in the subset X, Ex is in fact a member of Sx . Define Z = V \ x ∈X (Ex \{x}). Since H intersects Ex in x only for each x, we have Z ⊇ H. This makes itself Z a hitting set (not necessarily a minimal one). Recall that the set system T ⊆ H contains the edges disjoint from X. Ð If one of those forbidden sets were contained in the union x ∈X Ex , it would also be contained in Ð x ∈X (Ex \{x}). Z would thus not hit that edge of H, a contradiction. Therefore, selection (Ex )x ∈X solves the instance ((Sx )x ∈X , T). Conversely, suppose (Ex )x ∈X , Ex ∈ Sx , is an admissible selection Ð Ð Ð such that no member of T is covered by x ∈X Ex . Then, Z = V \ x ∈X (Ex \{x}) = (V \ x ∈X Ex )∪X is a transversal: any edge of H that has a non-empty intersection with X is trivially hit; the remaining ones (in T, if any) all have some element that is not contained in any of the edges Ex by assumption. Let H ⊆ Z be any minimal transversal. H must still contain all of X, otherwise some Ex is not hit. We now treat the converse reduction from Multicoloured Independent Family to Extension MinHS. First, observe that above transformation remains valid if we redefine the set systems Sx via punctured edges instead, i.e., Sx = {E\{x} | E ∩ X = {x}}. To complete the proof it suffices to demonstrate that this modified reduction is capable of producing any given instance J = (S1,..., Sk , T) of the Multicoloured Independent Family problem. To that end, let x1 to xk Ðk Ð ∪ Ð be k objects not previously contained in the universe U = i=1 S ∈Si S T ∈T T . We construct a preimage instance I = (V , H, X) of Extension MinHS. Set X = {x1,..., xk } and take V = U ∪X as the vertex set. The edges of H consists of the system T together with all sets of the form Si ∪ {xi } for Si ∈ Si . Clearly, if the modified reduction is applied to instance I, the output is J. □

The proof of the above lemma gives a characterisation of partial solutions that are extendable to minimal hitting sets The result is of independent interest, it appears implicitly also in [13]. Corollary 4.3. Let (V , H) be a hypergraph and X ⊆ V a set of vertices. There exists a minimal transversal T ∈ Tr(H) with T ⊇ X if and only if there is a family {Ex }x ∈X ⊆ H of edges such that Ð (i) x ∈ X : Ex ∩ X = {x} and (ii) E ∈ H : E ⊆ x ∈X Ex ⇒ E ∩ X , ∅ both hold. ∀ ∀ Efficiently Enumerating Hitting Sets of Hypergraphs Arising in Data Profiling 11

bdf afg ce bcd cg ac defg ef {a, b} {a, c} {a, f, g} ab

S: {b, c, d} {b, d, f} {c, e} a b c d e f g {c, g} {d, e, f, g} {e, f}

{a, c, f} {b, c} {b, g} OR T : acf bc bg def fg {d, e, f} {f, g} AND

NOT OUT

Fig. 1. Illustration of the proof of Lemma 4.5. On the left side is an instance of Independent Family, on the right is the resulting circuit of weft 3. All edges are directed downwards. Selecting {a,c}, {c, e}, and {c,д} from S solves the instance for parameter k = 3, any other combination of three sets covers a member of T.

In the next step, we show that the multi- and single-coloured variants of the independent family problem are essentially the same, both from a classical and a parameterised point of view. Lemma 4.4. Multicoloured Independent Family and Independent Family are equivalent under polynomial reductions that preserve the parameter. Proof. In order to reduce Multicoloured Independent Family to its single-coloured variant, it is enough to enforce that selecting two sets of the same colour is never a correct solution. Hence, they must always cover some forbidden set. Let (S1,..., Sk , T) be an instance of Multicoloured Independent Family. For every 1 ≤ i ≤ k and S ∈ Si , we introduce a new element x[S,i]. The sets are augmented with their respective elements, S ∪ {x[S,i]} and the results are collected in the ′ ′ single system S. Adding the pair {x[S,i], x[S ,i]} to T for each i and S , S ∈ Si invalidates all unwanted selections. It is easy to check that no solution of the original instance is destroyed. For the other direction, we make k copies of system S and ensure that no two copies of the same set are selected together. To that end, we introduce element x[S,i] for each S ∈ S and 1 ≤ i ≤ k, define Si = {S ∪ {x[S,i]} | S ∈ S}, and add the sets {x[S,i], x[S, j]}i,j to T. □ Finally, we discuss the parameterised hardness of all three problems. To show membership of Independent Family in the class W [3], we construct a polynomial parameter-preserving reduction which outputs a circuit of bounded depth and weft 3. Lemma 4.5. There is a polynomial parameter-preserving reduction from Independent Family to the Weighted Circuit Satisfiability problem restricted to circuits of bounded depth and weft 3.

Proof. Given an instance I = (S, T ,k) of Independent Family, we build a Boolean circuit C of weft 3 that has a satisfying assignment of weight k iff I evaluates to true. C has one input node for every set S ∈ S. We introduce a large OR-gate for each element in the common universe Ð Ð U = S ∈S S ∪ T ∈T T . An input node S is wired to such an OR-gate u ∈ U whenever u ∈ S. Next, we construct a layer of large AND-gates, one for each forbidden set T ∈ T. Gate u is connected to T iff u ∈ T . Finally, the output of all AND-gates lead to a single large OR-gate whose output is then negated. This negation serves as the output to the whole circuit. Figure1 shows an example instance and the resulting circuit. 12 T. Bläsius, T. Friedrich, J. Lischeid, K. Meeks, and M. Schirneck

Sx1 = {J(1,1),J(1,3),J(2,3),J(3,1),J(3,2),J(3,3)}

I1 S: Sx2 = {J(1,1),J(1,2),J(2,1),J(2,2),J(2,3),J(3,3)} Sx = {J(1,1),J(1,3),J(3,2)} J(1,1) 3 Sx4 = {J(1,3),J(2,3)} S = {J ,J ,J ,J }    x5 (1,2) (1,3) (2,3) (3,1) x1 ∧ x2 ∧ x3 ∨(x2 x5 x6 )∨(x1 x3 x4 x5 ) Sx6 = {J(1,2),J(2,1),J(2,2)}

Sx7 = {J(2,1)}   H Sx8 = {J(2,1),J(3,2)} ∧ (x2 x6 x7 x8) ∨ (x2 x6 x9) ∨ (x1 x2 x4 x5) Sx9 = {J(2,2),J(3,1),J(3,2),J(3,3)}

  T1 = {J(1,1),J(1,2),J(1,3)} ∧ (x1 x5 x8 x9 ) ∨ (x1 x3 x9 ) ∨ (x1 x2 x9 ) T : T2 = {J(2,1),J(2,2),J(2,3)} T3 = {J(3,1),J(3,2),J(3,3)}

Fig. 2. Illustration of the proof of Lemma 4.6. On the left side is an antimonotone 3-normalised formula, on the right is the resulting instance of Independent Family. The positions marked by the grey boxes are indexed by the respective sets: H for the DNF subformulas, J1 for the terms of the first subformula, and J(1,1) for the first term of the first subformula. The formula allows for a satisfying assignment of weight 4 by setting x4, x5, x7, and x8 to true. Equivalently, the union of the sets Sx4 , Sx5 , Sx7 , and Sx8 does not cover any forbidden set in T . No assignment of Hamming weight at least 5 is satisfying.

C can be constructed from I in time polynomial in the size of the instance. It has constant depth and weft 3 as every path from an input node to the output passes through exactly one large gate of each layer. We claim that the circuit is satisfied by setting the input nodes S1,..., Sk to true if and Ðk only if the union i=1 Si of the corresponding sets contains no member of T. Let S1 through Sk be a selection of k distinct sets from S. Assigning true to input nodes corresponding to these sets (and false to the others) satisfies exactly the OR-gates in the first Ðk layer that represent the elements in i=1 Si . An AND-gate in the second layer, standing for some Ðk forbidden setT ∈ T, is satisfied if and only if all its feeding OR-gates are satisfied, i.e.,iffT ⊆ i=1 Si . The results for all T are collected by the large OR-gate in the third layer and subsequently negated. Circuit C is satisfied if and only if no forbidden set is contained in the union of the selected sets. □

To show hardness for this class, we instead reduce from a problem on Boolean formulas. Such a formula is called antimonotone 3-normalised if it is a conjunction of propositional sentences in disjunctive normal form (DNF) with only negative literals. The Weighted Antimonotone 3-normalised Satisfiability (WA3NS) problem is to decide for such a formula whether it has a satisfying assignment of Hamming weight (at least) k, i.e., with k variables set to true. Parameterised by the weight k, this satisfiability problem is known to be W [3]-complete [23]. The intuition behind the W [3]-hardness proof is as follows. The circuit constructed in the proof of Lemma 4.5 has a single NOT-gate immediately prior to the output node. Moving this negation all the way up to the input nodes, using De Morgan’s law, results in an antimonotone formula which is in fact 3-normalised. We claim that this is not an artefact of the reduction, but due to a characteristic property of the problem itself. Namely, every antimonotone 3-normalised formula can be encoded in an instance of Independent Family.

Lemma 4.6. There exists a polynomial reduction from WA3NS to Independent Family that preserves the parameter. Efficiently Enumerating Hitting Sets of Hypergraphs Arising in Data Profiling 13

Proof. First, we fix some notation regarding 3-normalised formulas. Suppose φ is a Boolean formula on the variable set Vars(φ). It is antimonotone 3-normalised iff it can be written as Û Ü Û φ = x(h,i, j),

h ∈H i ∈Ih j ∈J(h,i) (h,i, j) for an index hierarchy H, {Ih }h ∈H , {J(h,i)}h ∈H,i ∈Ih , and variables x ∈ Vars(φ). Here, symbol x denotes the negative literal of variable x. This way, h ranges over the constituting DNF subformulas, i indexes their conjunctive terms, and j the negative literals in these terms. Of course, variables may appear multiple times in the formula, so different triples (h,i, j) may point to the same variable x. In the following, we construct an instance (S, T ,k) of Independent Family which evaluates to true if and only if φ has a weight-k satisfying assignment. As the common universe, we take the terms of the DNF subformulas, U = {J(h,i) | h ∈ H, i ∈ Ih }. We introduce a set Sx ∈ S for (h,i, j) each variable x ∈ Vars(φ). Sx contains all terms in which x occurs, Sx = {J(h,i) | j : x = x}. ∃ The subformulas themselves are represented in system T by setting Th = {J(h,i) | i ∈ Ih } and T = {Th }h ∈H . The construction is illustrated in Figure2. Suppose there is a satisfying assignment to φ which sets the variables x1 to xk to true. Then, Ðk ℓ=1 Sxℓ contains exactly the terms that are not satisfied. If this union were to cover one forbidden set in T, the corresponding subformula (and hence φ) would also be unsatisfied, a contradiction.

This implies that instance (S, T ,k) evaluates to true. Conversely, let Sx1 through Sxk be a selection of sets from S that cover no element of T . In other words, each subformula has at least one term in which none of the variables x1 to xk occur. Setting these variables to true surely satisfies φ. □ Combining the reduction of the lemmas in this section finishes the proof that the investigated the problems with their respective parameterisations are complete for the class W [3]. Theorem 4.7. Extension MinHS and (Multicoloured) Independent Family areW [3]-complete. The classical Hitting Set problem is to decide, for a hypergraph H and a positive integer k, whether there is a transversal for H that has cardinality k. When parameterised by k, it is one of the standard problems complete for the class W [2], cp. [23]. An equivalent formulation of this problem is to ask whether the empty set X = ∅ can be extended to a minimal hitting set of size at most k. It is curious that giving the set X explicitly in the input, lifting the cardinality constraint, and instead -parameterising the problem by |X | raises the hardness of the problem by one level in the W -hierarchy. Based on the results above, Casel et al. have recently shown that the extension problem is already for minimal dominating sets in graphs is W [3]-complete as well [14]. This can be seen as a special case of the Extension MinHS and exhibits the same jump in the W -hierarchy as the cardinality-bound version of Dominating Set is W [2]-complete [23]. Besides the two extension problems, there are currently only two other natural problems known to be complete for W [3]. Neither the one from supply-chain management [16] nor the detection of inclusion dependencies [9] seem to have any immediate connections to the extension of hitting sets. Interestingly though, the latter problem also stems from the field of data profiling. Theorem 4.7 excludes the existence of any FPT-algorithm for the extension problem, unless the unlikely collapse W [3] = FPT occurs in the W -hierarchy. Instead, the best we can hope for is an XP-algorithm featuring a run time of the form (n +m)f (|X |) for some computable function f. Assuming that 3-SAT does not admit a subexponential algorithm, i.e., the exponential time hypothesis (ETH), Chen et al. [15] showed that Independent Set on graphs with n vertices, m edges, and budget k cannot be solved in time f (k)·(n +m)o(k). In particular, the exponent of the run time must have at least a linear dependence on the parameter. The bound transfers immediately to the Independent Family problem but has the caveat of relying on a comparatively strong 14 T. Bläsius, T. Friedrich, J. Lischeid, K. Meeks, and M. Schirneck assumption. As an illustration, the fact that Independent Set is W [1]-complete shows that ETH implies W [1] , FPT and, in turn, W [t] , FPT for all t ≥ 1. However, since all our reductions above preserve the parameter, we can derive the same lower bounds from weaker assumptions, involving levels higher up in the W -hierarchy. For this, we use the following proposition which is a corollary of a more general result2 in [15]. Recall that WA3NS is the satisfiability problem for antimonotone 3-normalised formulas, paramterised by the Hamming weight k of the assignment.

Proposition 4.8 (Chen et al. [15]). Let f be any computable function and c > 0 a constant. If WA3NS on formulas of size m with n variables can be solved in time f (k)no(k)mc , then W [2] = FPT. Moreover, if there is a polynomial parameter-preserving reduction from WA3NS to a parameterised ′ problem Π, and any instance (I,k ′) ∈ Π can be solved in f (k ′) |I |o(k ), then W [2] = FPT.

The Lemmas 4.2, 4.4, and 4.6 together with this proposition yield the following lower bounds.

Theorem 4.9. Let f be any computable function and c > 0 a constant.

(i) Unless W [3] = FPT, Extension MinHS cannot be solved in time f (|X |) · (n+m)c . (ii) Unless W [2] = FPT, Extension MinHS cannot be solved in time f (|X |) · (n+m)o(|X |). (iii) Unless ETH fails, Extension MinHS cannot be solved in time f (|X |) · (n+m)o(|X |).

The same bounds also hold for Independent Family and Multicoloured Independent Family if Ík |X | is replaced by k, and (n+m) is replaced by (|U |+|T |+|S|) or (|U |+|T |+ i=1 |Si |), respectively.

4.2 Solving the Extension Problem Despite these hardness results, to finish the enumeration algorithm, we still need to solve the extension problem. That is, we have to decide, given two disjoint sets of vertices X and Y , whether X can be extended to a minimal hitting set without using any element of Y . Justified by Lemma 4.2, we approach this via the Multicoloured Independent Family problem, leading to Algorithm2 below. First, the algorithm treats the special case of an empty set X = ∅. Then, it constructs an instance of Multicoloured Independent Family (lines4–8). Two sanity checks in (lines9& 10) assess whether the instance at hand can immediately be decided. We will see later that in practice many calls to the oracle already conclude here. If the checks are inconclusive, the instance is solved by brute force (lines 11–14). Observe that the existence of a minimal extension is decided without explicitly computing one. As shown in Section3, this is enough for the enumeration. We now show that the worst-case run time of the algorithm matches the conditional lower bound given in the second and third clause of Theorem 4.9. Recall that n = |V | denotes the number of vertices and m = |H | the number of edges of the hypergraph.

2 The original result [15, Theorem 4.2] relates the satisfiability problem of Πt -circuits, for arbitrary t ≥ 2, to the class W [t−1]. The authors mention that one can assume the circuits to be layered. For t = 3, the layer structure coincides with that of antimonotone 3-normalised formulas. The statement also holds for the more general class of linear FPT-reductions. Efficiently Enumerating Hitting Sets of Hypergraphs Arising in Data Profiling 15

ALGORITHM 2: Algorithm for the Extension MinHS oracle.

Input: Hypergraph (V , H) and disjoint sets X = {x1,..., x |X | },Y ⊆ V . Output: true iff there is H ∈ Tr(H), with X ⊆ H ⊆ V \Y, false otherwise.

1 if X = ∅ then 2 if V \Y is a hitting set then return true; 3 else return false;

4 initialise set system T = ∅; 5 foreach x ∈ X do initialise set system Sx = ∅; 6 foreach E ∈ H do 7 if E ∩ X = {x} then add E\Y to Sx ; 8 if E ∩ X = ∅ then add E\Y to T ;

9 if x ∈ X : Sx = ∅ then return false; ∃ 10 if T = ∅ then return true;

11 foreach (Ex1 ,..., Ex|X | ) ∈ Sx1 × · · · × Sx|X | do ← Ð|X | 12 W i=1 Exi ; 13 if T ∈ T : T ⊈ W then return true; ∀ 14 return false;

Lemma 4.10. Algorithm2 on input (V , H, X,Y) solves the Extension MinHS problem in time O((m/|X |)|X | · mn)and uses linear space. Proof. The first part up to line 10 of the algorithm computes the reduction from Extension MinHS to Multicoloured Independent Family for the truncated hypergraph (V \Y, {E\Y }E ∈H). This is equivalent to reducing the original instance by Observation 4.1. The sanity checks in lines1, 9, and10 filter out trivial instances. The foreach-loop starting inline 11 is then brute-forcing the Î result of the reduction, checking all tuples in the Cartesian product x ∈X Sx . Regarding the , we assume that all set operations (membership, product, union, intersection, and difference) are implemented such that they take time proportional to the total number of elements contained in the sets. Checking whether V \Y is a hitting set and computing the systems Sx1 ,..., Sx|X | , and T can thus be done in O(mn). The run time is dominated by the brute-force phase. The cardinality of the Cartesian product is maximum if all systems have the same number of sets, there are thus at most (m/|X |)|X | many tuples. For each of them, the algorithm computes the union W in O(|X |n) and checks all forbidden sets in T in O(mn). Observe that the fact that every element in X has an edge of its own (as verified in line9) implies |X | ≤ m and O(|X |n + mn) = O(mn). This gives the claimed bound. Regarding the space requirement, note that the set systems Sx and T are all disjoint subsets of the truncated hypergraph, using at most as much space as the input. □

4.3 Improved Enumeration So far, we are solving the extension problem in a vacuum. We can improve upon this by also taking into consideration the interplay of with the decision tree. We are looking for the minimal transversals, but the original formulation of the Extension MinHS problem is indifferent to the distinction between sets that are themselves minimal solutions and those that still need extending vertices. As it turns out, we can discriminate the two cases. 16 T. Bläsius, T. Friedrich, J. Lischeid, K. Meeks, and M. Schirneck

Lemma 4.11. Let H , ∅ be non-empty and X,Y ⊆ V two disjoint sets of vertices. Then, X is a minimal hitting set if and only if Algorithm2 on input (V , H, X,Y ) returns true from line 10.

Proof. Let X ∈ Tr(H) be a minimal hitting set. X is non-empty since H , ∅. Set X thus fails the check in line1. Every x ∈ X has a witness and the truncated witness is contained in Sx (as checked in line9). As X is also a hitting set, T = ∅ is empty and Algorithm2 breaks in line 10. Conversely, to reach and pass the test in line 10, X must be a non-empty hitting set. Assume X is not minimal and let x ∈ X be expendable, i.e., there is no edge E ∈ H such that E ∩ X = {x} and system Sx = ∅ is empty. This would have been detected already a line earlier, a contradiction. □

We can trivially check whether H is empty, which would make the empty set the only minimal transversal, before invoking the enumeration algorithm. Therefore, we assume H , ∅ in what follows. Consider the search tree of Algorithm1 with the extension oracle being implemented by the subroutine in Algorithm2. The subroutine breaking in line 10 flags the fact that we have founda minimal solution. If so, we know the outcome of every oracle call further down in the subtree rooted at the current node: Adding more vertices to X can never lead to a set that is extendable; excluding vertices by adding them to Yinstead does not change the answer of the oracle. In summary, the subtree rooted at (X,Y,V \(X ∪ Y)) has exactly one leaf that holds a minimal solution, namely, (X,V \X, ∅). We can thus safely shortcut the recursion and immediately output solution X. Another possible improvement stems from the observation that the two extension checks of the enumeration procedure, lines4 and5 of Algorithm1, cannot both fail at the same time. We only enter a node once we verified that at least one leaf of its subtree holds a minimal solution. That leaf is a descendent of either the left or the right child. If the first check fails, we therefore do not need to execute the second one. Observe that this does not interfere with the flagging mechanism above. We entered the current node because X is extendable but not yet a minimal solution. The call extendable(X, Y ∪ {v}) has the same first argument, and thus Algorithm2 on that input would also not break in line 10. Finally, we prove a guarantee on the maximum delay between consecutive outputs of Algorithm1. This hinges on the fact that the oracle is only called a linear number of times between two outputs. Moreover, we show that the size |X | of sets we try to extend during the computation is bounded by the rank of the transversal hypergraph rank(Tr(H)), which we denote by k∗. For bounded k∗, we achieve polynomial delay. Note that the rank k∗ is not known to the algorithms, the input consists only of the hypergraph itself. In fact, it has been shown recently by Bazgan et al. that computing k∗ is W [1]-hard (parameterised by k∗); also, it cannot be approximated with a factor of n1−ε for any ε > 0, unless P = NP [4]. The upper bound on the delay holds regardless of the knowledge of k∗.

Theorem 4.12. Algorithm1 combined with Algorithm2 as oracle, on input (V , , H), enumerates ∗ ≽ Tr(H) with delay O(mk +1 n2), where k∗ = rank(Tr(H)). The method uses space linear in the input. Proof of Theorem 4.12. The correctness of Algorithms1 and2 is treated in Lemmas 3.1 and 4.10, respectively. We are left with the bounds on the delay and space. The height of the decision tree is |V | = n. After exiting a leaf, the pre-order traversal meets at most 2n − 1 inner nodes before arriving at the next leaf. In the worst case, method extendable is invoked in each of them (even with the shortcut described above). The O(m |X |+1 n) oracle dominates the time spent in each node. We claim that any set X appearing as the first argument of extendable is of cardinality at most k∗. To reach a contradiction, assume a node (X,Y, R) is visited during the execution for which |X | > k∗. Clearly, this cannot be the root since X is non-empty. Thus, the node has been entered after an invocation of method extendable(X,Y). X is neither a minimal solution nor can it be extended to one since its cardinality is already larger than the size k∗ of the largest minimal traversal. The Efficiently Enumerating Hitting Sets of Hypergraphs Arising in Data Profiling 17 check returned false and the child node is never entered, a contradiction. In summary, the delay is ∗ ∗ at most (2n − 1)· O(mk +1 n) = O(mk +1 n2). Regarding the space requirement, we need to store the input and the label of the current node to govern the recursive calls. By Lemma 4.10, any call of Algorithm2 needs only linear space. After its return, the constructed set systems (Sx )x ∈X and T can be discarded. □

Beyond the bounded delay, there are two aspects that make this enumeration method particularly applicable in practice. As soon as the tree search branches at a node, the resulting subtrees are com- pletely independent. Algorithm1 is thus trivially parallelisable. Also, most hitting set enumeration algorithms, including the current state of the art, incur a space requirement that is proportional to the size of the output [2, 38, 54, 56]. In contrast, our branching scheme makes sure that no potential solution occurs more than once, obviating the need for any additional bookkeeping. Hence, the whole computation needs only space linear in the size of the input.

5 ENUMERATING UNIQUE COLUMN COMBINATIONS We apply our enumeration algorithm to hypergraphs arising in the profiling of relational databases. Specifically, we want to enumerate all minimal unique column combinations (UCCs). Recall that a UCC for some database r over a relational schema R is a set of X ⊆ R of attributes such that evaluating any row ri ∈ r at the entries of X, i.e., the subtuple ri [X], already identifies the full tuple ri . There is a folklore polynomial reduction from the detection of UCCs to the hitting set problem, see e.g. [20, 47]. Intuitively, for any two rows ri ,rj ∈ r, any UCC must contain at least one attribute in which ri and rj disagree; otherwise, these rows are indistinguishable. That is to say, any UCC must hit every difference set of pairs of rows in the database. The reduction is parsimonious in that it establishes a one-to-one correspondence between UCCs and hitting sets. Additionally, it preserves set inclusions, matching minimal UCCs to minimal hitting sets. Bläsius et al. showed that there is also a parsimonious, inclusion-preserving reduction in the opposite direction [9]. As a result, enumerating minimal UCCs is exactly as hard as the general transversal hypergraph problem. The reduction sketched above implies a two-phased approach for the enumeration. First, generate the hypergraph of minimal difference sets. Second, list its minimal transversals. The first phase takes time polynomial in the size of the database. The second phase, which is the focus of this paper, has exponential complexity in general. In the following, we thus assume that the Sperner hypergraph of minimal difference sets is given as input.

5.1 Data and Experimental Setup We evaluate our enumeration algorithm on a total of twelve databases. Ten of them are publicly available. These are the abalone, echocardiogram, hepatitis, and horse datasets from the UCI Machine Learning Repository3; uniprot from the Universal Protein Resource4; civil_service5, ncvoter_allc6 and flight_1k7 provided by the respective authorities of the City of New York, the state of North Carolina, and the federal government of the US; call_a_bike of the German railway company Deutsche Bahn8, as well as amalgam1 from the Database Lab of the University of Toronto9. They are complemented by two randomly generated datasets fd_reduced_15 and

3archive.ics.uci.edu/ml 4uniprot.org 5opendata.cityofnewyork.us 6ncsbe.gov 7transtats.bts.gov 8data.deutschebahn.com 9dblab.cs.toronto.edu/∼miller/amalgam 18 T. Bläsius, T. Friedrich, J. Lischeid, K. Meeks, and M. Schirneck

∗ dataset columns rows |V | |H | k Nmin call_a_bike 16 100 000 13 6 4 23 abalone 9 4177 9 30 6 29 echocardiogram 12 132 12 30 5 72 civil_service 20 100 000 14 19 7 81 horse 28 300 25 39 11 253 uniprot 40 19 999 37 28 10 310 hepatitis 20 155 20 54 9 348 fd_reduced_15 15 100 000 15 75 3 416 amalgam1 87 50 87 70 4 2737 fd_reduced_30 30 100 000 30 224 3 3436 flight_1k 70 1000 53 161 8 26 652 ncvoter_allc 88 100 000 82 448 15 200 907

Table 1. The databases used in the evaluation, ordered by the number Nmin of minimal UCCs. Columns and rows denote the respective dimension of the database, |V | and |H | refer to the resulting hypergraph of minimal difference sets, k∗ is the cardinality of the largest minimal UCC.

fd_reduced_30 using the dbtesma data generator10. Databases with more than 100k rows are cut by choosing 100k rows uniformly at random. Table1 lists the data together with some basic features such as the number of columns and rows in the database, the number of vertices and edges of the resulting hypergraph, as well as the number of minimal UCCs. After computing the minimal difference sets, we removed vertices not appearing in any edge, as these are not relevant for the enumeration. Thus, the number of vertices |V | of the hypergraph can be smaller than the number of columns in the database. The | | |r | number of difference sets of a database with r rows is 2 in the worst case. However, comparing the "rows" column in the table above with the number |H | of minimal difference sets reveals that in practice the latter tends to be smaller than the number of rows, let alone quadratic in |r |. The hypergraph perspective thus provides a very compact representation of the UCC problem. Moreover, the maximum cardinality k∗ of minimal UCCs is small in practice. In particular, there does not appear to be any relationship between k∗ and the total input size. The table is sorted by the number Nmin of minimal hitting sets/UCC, i.e., the solution size. All plots below use this order. The algorithms are implemented in C++ and run on a Ubuntu 16.04 machine with two Intel Xeon E5-2690 v3 2.60 GHz CPUs and 256 GB RAM. The code together with the hypergraphs for the call_a_bike, civil_service, fd_reduced_30, and ncvoter_allc databases are available at hpi.de/friedrich/research/enumdat. In some of the experiments, we collect run times of intermediate steps, for example the oracle calls in Section 5.4. To avoid interference with the overall run time measurements, we use separate runs for these. Also, we average over multiple runs to reduce the noise of the measurements. See the corresponding sections for details.

5.2 Comparison with Brute Force We use a brute-force search for minimal UCCs as a baseline for comparison, namely, the Apriori algorithm. Originally developed for frequent itemset mining [3], it is well-known that it can also be used to find UCCs and hitting sets in general2 [ , 35, 47]. Apriori follows a level-based bottom-up approach, in level i, it maintains a collection of candidate sets of vertices, all of cardinality i. The candidates of the (i+1)-st level are generated by combining any two i-candidates that differ in

10sourceforge.net/projects/dbtesma Efficiently Enumerating Hitting Sets of Hypergraphs Arising in Data Profiling 19

enumeration time of Algorithm 1 brute-force enumeration time brute-force completion time

107

105

103 time (ms) 101

1 10−

horse abalone uniprot amalgam1 hepatitis flight_1k fd_red._15fd_red._30 call_a_bike civil_serv. echocardiog. ncvoter_allc

Fig. 3. Comparison of the run time of Algorithm1 with the brute-force baseline ( Apriori). For the latter, we report the enumeration time and the completion time separately since it has a non-negligible overhead after the last output until termination. The completion timefor uniprot and amalgam1 are not shown as Apriori exceeds the memory limit. The enumeration and completion time of Apriori on flight_1k and ncvoter_allc exceeds the time limit of 12 h. exactly one vertex. The new candidates are then tested to determine whether they are hitting sets of the input hypergraph. If so, they are output and discarded from the candidate collection. To avoid generating unnecessary candidates, i.e., those that are supersets of smaller transversals, they are checked against the collection of solutions found so far. Traversing the power lattice in a bottom-up fashion ensures that once a hitting set is found for the first time, it is necessarily minimal. After finding all solutions, Apriori may take some additional time to terminate as it needs to verify that the search space of future candidates is in fact empty. It is generally agreed that this behaviour is hard to avoid as finding an efficient stopping criterion here would immediately yield faster algorithms to compute the transversal rank k∗, see [4, 30] for a more detailed discussion. To make this transparent in our experiments, we measure two run times for the brute-force algorithm: the enumeration time is the time it took to find all minimal UCCs, and the completion time is the total run time until termination. Figure3 shows the run times of our algorithm in comparison to the enumeration/completion times of Apriori on a log-scale. For the smaller instances, both algorithms are fast. For larger instances, the brute-force baseline is clearly outperformed. The only exceptions to this trend are the two artificially generated instances fd_reduced_15 and fd_reduced_30. These are also the two instances with the smallest transversal rank of k∗ = 3, this exceptionally small value benefits a bottom-up brute-force technique. Beyond the mere run time, Apriori also requires a lot of memory if the transversal hypergraph is large since it needs to store all output solutions to avoid unnec- essary candidates. We thus set a memory limit of 128 GB and a time limit of 12 h. On flight_1k and ncvoter_allc, Apriori finds 98.6% and 14.6% of the minimal solutions, respectively, before reaching the time limit. For all other instances, brute force enumerates all solutions within the limits. However, for uniprot and amalgam1, the memory limit was exceeded between the output 20 T. Bläsius, T. Friedrich, J. Lischeid, K. Meeks, and M. Schirneck

105

103

101 enumeration time (ms) 1 10−

horse abalone uniprot amalgam1 hepatitis flight_1k fd_red._15fd_red._30 call_a_bike civil_serv. echocardiog. ncvoter_allc

Fig. 4. Run times of our enumeration algorithm. The box plots show the run times for 1000 random vertex orders. The filled circles correspond to the heuristic vertex order. of the last solution and termination. Still, even if we only take the enumeration time into account, we achieved speed-up factors of 4723 and 14 for these instances. This comparison already reveals an important practical advantage of our backtracking enumer- ation method: it only uses memory proportional to the input size (cf. Theorem 4.12) as opposed to the size of the output. An exuberant memory consumption is not specific to the brute-force algorithm but also occurs in several modern hitting set and UCC enumerators. For more details, see the surveys [2, 33, 54].

5.3 Vertex Order, Run Time, and Delay Recall that our enumeration algorithm (Algorithm1) branches on the vertices of the hypergraph in a certain order. Although the order does not matter for our asymptotic bounds, it affects the shape of the explored decision tree, which in turn impacts the practical run time. Even on a theoretical level, it has been shown that there exist vertex orders that make finding the (lexicographically) first solution, let alone all of them, into an NP-hard search problem[24]. As a heuristic, we sort the vertices descendingly by the number of different values that appear in the corresponding column of the original database. The intuition is that columns with many values have a higher separative power on the pairs of rows and thus are more likely to appear in many minimal UCCs. Including an expressive vertex makes many other vertices obsolete, which leads to early pruning of the tree. Conversely, excluding such vertices (adding them to the set Y in Algorithm2) makes it likely that only a few hitting sets survive, which also prunes the tree early. Note that reducing the size of the decision tree, and thus the number of oracle calls, does not automatically reduce the run time. The remaining oracle calls may have a larger average return time. We discuss this in more detail in Section 5.4. As a side note, sorting the vertices by their hypergraph degree instead (i.e, the number of minimal difference sets in which they appear) results in similar but slightly worse run times. Besides using our heuristic order, we evaluate the algorithms also on 1000 random vertex orders per dataset. The only exception is the ncvoter_allc instance, where the higher run times do not permit that many orders. We report on ncvoter_allc separately. The run times, averaged over 10 measurements for each data point, are shown in Figure4. In that box plot, as well as the ones below, the instances are sorted by Nmin, the number of minimal UCCs. The boxes range from the first to the third quartile of the samples, with the median indicated as a horizontal line. The whiskers represent Efficiently Enumerating Hitting Sets of Hypergraphs Arising in Data Profiling 21

102

100

output delay (ms) 2 10−

horse abalone uniprot amalgam1 hepatitis flight_1k fd_red._15fd_red._30 call_a_bike civil_serv. echocardiog. ncvoter_allc

Fig. 5. Delays between consecutive outputs of minimal hitting sets for the heuristic vertex order. the lowest datum within 1.5 interquartile range (IQR) of the lower quartile and the highest datum within 1.5 IQR of the upper quartile. Everything beyond that we count as outlier (small crosses). The median run time generally scales with the solution size, which is not surprising. The only exceptions are, again, the generated instances fd_reduced_15 and fd_reduced_30. For most of the instances, the order of the vertices had only little impact. Our heuristic outperformed the median random order for all instances, indicating that it is a solid choice in practice. On some instances, the heuristic even led to lower run times than any of the random orders. For ncvoter_allc, however, the influence of the vertex order was much larger. Using the heuristic, the enumeration completed in 26 min. For comparison, on four out of eleven random orders, the process only finished after 59.7 h, 105.3 h, 113.7 h, and 167.7 h, respectively. The other seven runs exceeded the time limit of 168 h (one week). The box plot in Figure5 shows the delays between consecutive solutions when using the heuristic order. Each data point was obtained by averaging over 100 runs. While there is a high variance in the delays on some instances, e.g., between 10−1 ms and 103 ms for ncvoter_allc, the maximum delay is always less than 2 s, which is reasonably low. In the following section, we investigate the delays more closely by looking at the run time distribution of the oracle calls.

5.4 Oracle Calls In Section 4.3, we showed that the number of oracle calls between two consecutive outputs is linear in the number of vertices in the hypergraph. Thus, the only part potentially leading to a super-polynomial delay are the oracle calls themselves. It is thus interesting to see how many oracle calls we need and how long these calls actually take in practice. For our heuristic vertex order, we measured the run times of each individual oracle call, averaged over 100 runs to reduce the noise. Figure6 shows the complementary cumulative frequencies (CCF) of the oracle run times, i.e., for each time t (x-axis), it shows the number of oracle calls with run time at least t (y-axis). We excluded the artificial instances fd_reduced_15 and fd_reduced_30, they are discussed separately. First, note the horizontal lines on the left. They stem from the few trivial oracle calls with X = ∅. Those are much faster than all other calls since they do not need to construct the Independent Family instance. Next, we examine the impact of the total number of oracle calls on the run time. The legend of Figure6 is ordered by increasing enumeration time from left to right, for the real-world databases, this is the same as the output order used throughout (only the artificially generated ones differ from that). For comparison, thetotal number of oracle calls is the y-value of the left endpoint of each curve. The two orders are almost the same. An interesting exception is the hepatitis dataset. It has fewer calls than horse and uniprot, but these calls take 22 T. Bläsius, T. Friedrich, J. Lischeid, K. Meeks, and M. Schirneck

call_a_bike abalone echocardiog. call_a_bike abalone echocardiog. civil_serv. horse uniprot civil_serv. horse uniprot hepatitis amalgam1 flight_1k hepatitis amalgam1 flight_1k ncvoter_allc ncvoter_allc

106 106

104 104

frequency 102 frequency 102

100 100

4 3 2 1 0 1 2 3 4 3 2 1 0 1 2 3 10− 10− 10− 10− 10 10 10 10 10− 10− 10− 10− 10 10 10 10 oracle run time (ms) oracle run time (ms)

(a) (b)

Fig. 6. (a) The complementary cumulative frequencies of all oracle run times over the enumeration process. (b) The complementary cumulative frequencies of run times for those calls that enter the brute-force loop in line 11 of Algorithm2. The heuristic vertex order was used in both plots. more time on average, leading to a higher overall run time. Instance amalgam1 needs even more calls, which then outweighs the smaller average. Similarly, the oracle calls for horse take more time than those for uniprot, but the higher number of calls in the latter case causes the longer run time. In preliminary experiments, we observed these effects also when comparing different vertex orders of the same dataset. In general, aiming for the smallest number of oracle calls is a good strategy, although there are cases in which a higher number of easier calls leads to better run times. In the non-trivial cases of X , ∅, the extension oracle (Algorithm2) first checks whether the resulting Independent Family instance can already be decided by the sanity checks in lines9 and 10. This way, a significant portion of the calls can be solved in polynomial time. Overall instances, slightly more than half of the oracle calls are solved like that. In fact, for the three instances with the most oracle calls, namely, amalgam1, flight_1k, and ncvoter_allc, no more than 32% of the calls entered the brute-force loop in line 11, which is the only part that actually requires super-polynomial run time. It is thus interesting to see specifically the run times of those calls, their CCF are shown in Figure 6b. The differences between Figures 6a and 6b in the lower parts of call_a_bike and uniprot are artifacts of the separate measurements to create these plots. The run times of the oracle calls that enter the brute-force phase are heterogeneously distributed with many fast invocations and only few slow ones. Exemplarily, we investigate the oracle calls of the flight_1k instance. The database has 70 columns, 1000 rows, and 26 652 minimal UCCs. During the enumeration process the extension oracle is called 242 449 times, 22 (0.009%) calls are trivial, the vast majority of 165 767 (68.4%) of them are decided by the sanity checks, the remaining 76 660 (31.6%) calls enter the loop. Of the brute-force calls, 41 353 (53.9%) take only a single iteration to find a combination of potential witnesses verifying that the respective inputset X is indeed extendable to a minimal solution (line 13). However, there are also two calls that need the maximum of 74 880 iterations, which corresponds to a run time of 16 ms. In those two cases, all possible combinations of potential witnesses had to be tested, only to conclude that the set is not extendable Efficiently Enumerating Hitting Sets of Hypergraphs Arising in Data Profiling 23

fd_red._15 fd_red._30

104

102 frequency

100 3 2 1 10− 10− 10− oracle run time (ms)

Fig. 7. Complementary cumulative frequencies of oracle run times over the enumeration process for the artificial instances. The heuristic was used for the vertex order.

(line 14). It is inherent to the hardness of Extension MinHS that the inputs that are not extendable because all combinations of potential witnesses cover at least one unhit set incur the highest number of iterations and the longest oracle run times. Fortunately, those occasions were rare in our experiments. In the case of flight_1k, a mere 622 calls take more than 10 000 iterations, they make up for 0.8% of the brute-force calls and only 0.2% of all oracle invocations. The run time distributions of the other real-world databases are similar to the one of flight_1k, see Figure 6a. There is always a non-vanishing chance that a given oracle call incurs a high run time, which is not surprising for an worst-case exponential algorithm, but even the slowest calls are reasonably fast in practice. However, the majority is far away from the worst case, leading to a very low run time on average. The heterogeneous behaviour of the brute-force calls in particular also shows in their CCFs (Figure 6b). They resemble a power-law distribution (straight lines in a log-log plot), albeit their tendency towards very small run times (concavity of the plots) is stronger than one would expect for a pure power-law. Finally, the two artificial instances fd_reduced_15 and fd_reduced_30 behave very differently from the real-world databases. Figure7 shows the CCF of their oracle calls. The staircase shape indicates that there are only five different types of oracle calls, with roughly the same run timesfor calls of the same type.

6 CONCLUSION In this work, we devised a backtracking method to enumerate all minimal hitting sets of a hyper- graph. We showed that it achieves polynomial delay on inputs whose traversal hypergraphs have bounded rank. The underlying tree search is based on a subroutine solving the extension problem for minimal hitting sets. We identified the extension problem as a natural complete problem forthe parameterised W [3]. Our algorithm for the subroutine has an optimal worst-case behaviour when assuming the exponential time hypothesis or W [2] , FPT. The properties of our enumeration method makes it particularly suitable for the task of computing all minimal unique column combinations of a relational database, which is a domain where the solutions are expected to be small. However, the degree of the worst-case run time bound is dependent on the size of the largest solution. This bears the danger that the running times may still be, although output-polynomial, prohibitively large in practice. To check whether our algorithm succeeds within reasonable time frames, we evaluated it on a collection of publicly available datasets. The experiments showed that our method outperforms the brute-force baseline as soon as the solution space is non-trivial. As the empirical run time depends on the branching order of the 24 T. Bläsius, T. Friedrich, J. Lischeid, K. Meeks, and M. Schirneck vertices, we gave a heuristic based on the original database that achieved good results throughout by reducing the number of oracle calls. Notwithstanding, the main reason for the low overall run times is not only the small number of calls but the fact that the oracle calls are very fast on average. In particular, they regularly avoid the worst case in practice, which was the basis for the large theoretical run time bound. The tree-based computation additionally obviates the need of expensive post-processing. Our algorithm terminates almost immediately after outputting the last solution. Another important feature of our method is the low memory consumption. We showed that he algorithms requires only space linear in the input size, independently of the number of solutions. Approaching the enumeration of unique column combinations from a hitting set perspective and employing an extension oracle naturally forgoes the necessity of holding previous solutions in memory. This, however, seems to be a common problem even for current state-of-the-art discovery algorithms in data profiling such as DUCC and HyUCC. Papenbrock and Naumann, the authors of HyUCC, posed the following challenge in their conclusion [56]. For future work, we suggest to find novel techniques to deal with the often huge amount of results. Currently, HyUCC limits its results if these exceed main memory capacity [...]. We believe that this is a problem that can be solved by viewing data profiling from a hitting-set perspective, as we did in this work. While the pure enumeration times of our algorithm are already competitive, they seem not to be the true bottleneck of the computation, despite the fact that this is the NP-hard core of the problem. Instead, the polynomial preprocessing step of preparing the minimal difference sets of the database as input for the experiments regularly took longer than actually computing all transversals. Here, careful engineering has the potential of huge speed-ups on practical instances. Combining this with the natural advantages of our enumeration algorithm might yield the novel technique we are looking for.

REFERENCES [1] Amir E. Abboud. 2017. Hardness in P. Ph.D. Dissertation. Stanford University, Stanford, CA, USA. https://searchworks. stanford.edu/view/12137896 [2] Ziawasch Abedjan, Lukasz Golab, and Felix Naumann. 2015. Profiling Relational Data: A Survey. The VLDB Journal 24 (2015), 557–581. https://doi.org/10.1007/s00778-015-0389-y [3] Rakesh Agrawal and Ramakrishnan Srikant. 1994. Fast Algorithms for Mining Association Rules in Large Databases. In Proceedings of the 20th International Conference on Very Large Data Bases (VLDB). 487–499. http://dl.acm.org/citation. cfm?id=645920.672836 [4] Cristina Bazgan, Ljiljana Brankovic, Katrin Casel, Henning Fernau, Klaus Jansen, Kim-Manuel Klein, Michael Lampis, Mathieu Liedloff, Jérôme Monnot, and Vangelis Th. Paschos. 2018. The Many Facets of Upper Domination. Theoretical Computer Science 717 (2018), 2–25. https://doi.org/10.1016/j.tcs.2017.05.042 [5] Claude Berge. 1989. Hypergraphs - Combinatorics of Finite Sets. North-Holland Mathematical Library, Vol. 45. North- Holland, Amsterdam, Netherlands. [6] Armin Biere, Marijn Heule, Hans van Maaren, and Toby Walsh (Eds.). 2009. Handbook of Satisfiability. Frontiers in Artificial Intelligence and Applications, Vol. 185. IOS Press, Amsterdam, The Netherlands. [7] Jan C. Bioch and Toshihide Ibaraki. 1995. Complexity of Identification and Dualization of Positive Boolean Functions. Information and Computation 123 (1995), 50–63. https://doi.org/10.1006/inco.1995.1157 [8] Andreas Björklund, Petteri Kaski, Łukasz Kowalik, and Juho Lauri. 2015. Engineering Motif Search for Large Graphs. In Proceedings of the 17th Meeting on Algorithm Engineering and Experiments (ALENEX). 104–118. https://doi.org/10. 1137/1.9781611973754.10 [9] Thomas Bläsius, Tobias Friedrich, and Martin Schirneck. 2016. The Parameterized Complexity of Dependency Detection in Relational Databases. In Proceedings of the 11th International Symposium on Parameterized and Exact Computation (IPEC). 6:1–6:13. https://doi.org/10.4230/LIPIcs.IPEC.2016.6 [10] Thomas Bläsius, Tobias Friedrich, Julius Lischeid, Kitty Meeks, and Martin Schirneck. 2019. Efficiently Enumerating Hitting Sets of Hypergraphs Arising in Data Profiling. In Proceedings of the 21st Meeting on Algorithm Engineering and Experiments (ALENEX). 130–143. https://doi.org/10.1137/1.9781611975499.11 Efficiently Enumerating Hitting Sets of Hypergraphs Arising in Data Profiling 25

[11] Endre Boros, Khaled M. Elbassioni, Vladimir Gurvich, and Leonid Khachiyan. 2000. An Efficient Incremental Algorithm for Generating All Maximal Independent Sets in Hypergraphs of Bounded Dimension. Parallel Processing Letters 10 (2000), 253–266. https://doi.org/10.1142/S0129626400000251 [12] Endre Boros, Khaled M. Elbassioni, Vladimir Gurvich, Leonid Khachiyan, and Kazuhisa Makino. 2002. Dual-Bounded Generating Problems: All Minimal Integer Solutions for a Monotone System of Linear Inequalities. SIAM J. Comput. 31 (2002), 1624–1643. https://doi.org/10.1137/S0097539701388768 [13] Endre Boros, Vladimir Gurvich, and Peter L. Hammer. 1998. Dual Subimplicants of Positive Boolean Functions. Optimization Methods and Software 10 (1998), 147–156. https://doi.org/10.1080/10556789808805708 [14] Katrin Casel, Henning Fernau, Mehdi Khosravian Ghadikolaei, Jérôme Monnot, and Florian Sikora. 2018. On the Complexity of Solution Extension of Optimization Problems. ArXiv e-prints (2018). arXiv:cs.CC/1810.04553 https: //arxiv.org/abs/1810.04553 [15] Jianer Chen, Xiuzhen Huang, Iyad A. Kanj, and Ge Xia. 2006. Strong Computational Lower Bounds via Parameterized Complexity. J. Comput. System Sci. 72 (2006), 1346–1367. https://doi.org/10.1016/j.jcss.2006.04.007 [16] Jianer Chen and Fenghui Zhang. 2006. On Product Covering in 3-Tier Supply Chain Models: Natural Complete Problems for W [3] and W [4]. Theoretical Computer Science 363 (2006), 278–288. https://doi.org/10.1016/j.tcs.2006.07.016 [17] Nadia Creignou, Arne Meier, Julian-Steffen Müller, Johannes Schmidt, and Heribert Vollmer. 2017. Paradigms forPa- rameterized Enumeration. Theory of Computing Systems 60 (2017), 737–758. https://doi.org/10.1007/s00224-016-9702-4 [18] Nadia Creignou, Frédéric Olive, and Johannes Schmidt. 2011. Enumerating All Solutions of a Boolean CSP by Non- decreasing Weight. In Proceedings of the 14th International Conference on Theory and Applications of Satisfiability Testing (SAT). 120–133. https://doi.org/10.1007/978-3-642-21581-0_11 [19] Marek Cygan, Fedor V. Fomin, Łukasz Kowalik, Dániel Lokshtanov, Daniel Marx, Marcin Pilipczuk, Michał Pilipczuk, and Saket Saurabh. 2015. Parameterized Algorithms. Springer, Heidelberg, Germany, and New York City, NY, USA. [20] Christopher John Date. 2003. An Introduction to Database Systems (8th ed.). Addison-Wesley Longman Publishing, Boston, MA, USA. [21] János Demetrovics and Vu Duc Thi. 1987. Keys, Antikeys and Prime Attributes. Annales Universitatis Scientiarum Budapestinensis de Rolando Eötvös Nominatae Sectio Computatorica 8 (1987), 35–52. [22] Carlos Domingo, Nina Mishra, and Leonard Pitt. 1999. Efficient Read-Restricted Monotone CNF/DNF Dualization by Learning with Membership Queries. Machine Learning 37 (1999), 89–110. https://doi.org/10.1023/A:1007627028578 [23] Rodney G. Downey and Michael R. Fellows. 2013. Fundamentals of Parameterized Complexity. Springer, Heidelberg, Germany, and London, UK. https://doi.org/10.1007/978-1-4471-5559-1 [24] Thomas Eiter. 1994. Exact Transversal Hypergraphs and Application to Boolean µ-Functions. Journal of Symbolic Computation 17 (1994), 215–225. https://doi.org/10.1006/jsco.1994.1013 [25] Thomas Eiter and Georg Gottlob. 1995. Identifying the Minimal Transversals of a Hypergraph and Related Problems. SIAM J. Comput. 24 (1995), 1278–1304. https://doi.org/10.1137/S0097539793250299 [26] Thomas Eiter, Georg Gottlob, and Kazuhisa Makino. 2003. New Results on Monotone Dualization and Generating Hypergraph Transversals. SIAM J. Comput. 32 (2003), 514–537. https://doi.org/10.1137/S009753970240639X [27] Thomas Eiter, Kazuhisa Makino, and Georg Gottlob. 2008. Computational Aspects of Monotone Dualization: A Brief Survey. Discrete Applied Mathematics 156 (2008), 2035–2049. https://doi.org/10.1016/j.dam.2007.04.017 [28] Khaled M. Elbassioni, Matthias Hagen, and Imran Rauf. 2008. Some Fixed-Parameter Tractable Classes of Hypergraph Duality and Related Problems. In Proceedings of the 3rd International Symposium on Parameterized and Exact Computation (IPEC). 91–102. https://doi.org/10.1007/978-3-540-79723-4_10 [29] Jörg Flum and Martin Grohe. 2006. Parameterized Complexity Theory. Springer, Heidelberg, Germany, and New York City, NY, USA. [30] Fedor V. Fomin, Fabrizio Grandoni, Artem V. Pyatkin, and Alexey A. Stepanov. 2008. Combinatorial Bounds via Measure and Conquer: Bounding Minimal Dominating Sets and Applications. ACM Transactions on Algorithms 5 (2008), 9:1–9:17. https://doi.org/10.1145/1435375.1435384 [31] Michael L. Fredman and Leonid Khachiyan. 1996. On the Complexity of Dualization of Monotone Disjunctive Normal Forms. Journal of Algorithms 21 (1996), 618–628. https://doi.org/10.1006/jagm.1996.0062 [32] Vincent Froese, René van Bevern, Rolf Niedermeier, and Manuel Sorge. 2016. Exploiting Hidden Structure in Selecting Dimensions That Distinguish Vectors. J. Comput. System Sci. 82 (2016), 521 – 535. https://doi.org/10.1016/j.jcss.2015. 11.011 [33] Andrew Gainer-Dewar and Paola Vera-Licona. 2017. The Minimal Hitting Set Generation Problem: Algorithms and Computation. Journal on Discrete Mathematics 31 (2017), 63–100. https://doi.org/10.1137/15M1055024 [34] Hector Garcia-Molina and Daniel Barbara. 1985. How to Assign Votes in a Distributed System. J. ACM 32 (1985), 841–860. https://doi.org/10.1145/4221.4223 [35] Chris Giannella and Catharine M. Wyss. 1999. Finding Minimal Keys in a Relation Instance. https://doi.org/10.1.1.41. 7086 Online document: https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.41.7086. 26 T. Bläsius, T. Friedrich, J. Lischeid, K. Meeks, and M. Schirneck

[36] Goran Gogic, Christos H. Papadimitriou, and Martha Sideri. 1998. Incremental Recompilation of Knowledge. Journal of Artificial Intelligence Research 8 (1998), 23–37. https://doi.org/10.1613/jair.380 [37] Matthias Hagen. 2008. Algorithmic and Computational Complexity Issues of MONET. Cuvillier, Göttingen, Germany. https://cuvillier.de/de/shop/publications/1237-algorithmic-and-computational-complexity-issues-of-monet [38] Arvid Heise, Jorge-Arnulfo Quiané-Ruiz, Ziawasch Abedjan, Anja Jentzsch, and Felix Naumann. 2013. Scalable Discovery of Unique Column Combinations. Proceedings of the VLDB Endowment 7 (2013), 301–312. https://doi.org/ 10.14778/2732240.2732248 [39] Russell Impagliazzo and Ramamohan Paturi. 2001. On the Complexity of k-SAT. J. Comput. System Sci. 62 (2001), 367–375. https://doi.org/10.1006/jcss.2000.1727 [40] Russell Impagliazzo, Ramamohan Paturi, and Francis Zane. 2001. Which Problems Have Strongly Exponential Complexity? Journal of Computer and System Science 63 (2001), 512–530. https://doi.org/10.1006/jcss.2001.1774 [41] David S. Johnson, Christos H. Papadimitriou, and Mihalis Yannakakis. 1988. On Generating All Maximal Independent Sets. Inform. Process. Lett. 27 (1988), 119–123. https://doi.org/10.1016/0020-0190(88)90065-8 [42] Mamadou M. Kanté, Vincent Limouzy, Arnaud Mary, Lhouari Nourine, and Takeaki Uno. 2015. A Polynomial Delay Algorithm for Enumerating Minimal Dominating Sets in Chordal Graphs. In Proceedings of the 41st International Workshop on Graph-Theoretic Concepts in Computer Science (WG). 138–153. https://doi.org/10.1007/978-3-662-53174-7_ 11 [43] Leonid Khachiyan, Endre Boros, Khaled M. Elbassioni, and Vladimir Gurvich. 2007. A Global Parallel Algorithm for the Hypergraph Transversal Problem. Inform. Process. Lett. 101 (2007), 148–155. https://doi.org/10.1016/j.ipl.2006.09.006 [44] Benny Kimelfeld and Yehoshua Sagiv. 2008. Efficiently Enumerating Results of Keyword Search over Data Graphs. Information Systems 33 (2008), 335–359. https://doi.org/10.1016/j.is.2008.01.002 [45] Sebastian Kruse, Thorsten Papenbrock, Hazar Harmouch, and Felix Naumann. 2016. Data Anamnesis: Admitting Raw Data into an Organization. IEEE Data Engineering Bulletin 39 (2016), 8–20. http://sites.computer.org/debull/A16june/ p8.pdf [46] Eugene L. Lawler. 1972. A Procedure for Computing the K Best Solutions to Discrete Optimization Problems and Its Application to the Shortest Path Problem. Management Science 18 (1972), 401–405. http://www.jstor.org/stable/2629357 [47] David Maier. 1983. The Theory of Relational Databases. Computer Science Press, Rockville, MD, USA. [48] Heikki Mannila and Kari-Jouko Räihä. 1987. Dependency Inference. In Proceedings of the 13th International Conference on Very Large Data Bases (VLDB). 155–158. http://dl.acm.org/citation.cfm?id=645914.671482 [49] Arnaud Mary and Yann Strozecki. 2019. Efficient Enumeration of Solutions Produced by Closure Operations. Discrete Mathematics & Theoretical Computer Science 21 (2019). https://doi.org/10.23638/DMTCS-21-3-22 [50] Kitty Meeks. 2019. Randomised Enumeration of Small Witnesses Using a Decision Oracle. Algorithmica 81 (2019), 519–540. https://doi.org/10.1007/s00453-018-0404-y [51] Keisuke Murakami and Takeaki Uno. 2014. Efficient Algorithms for Dualizing Large-Scale Hypergraphs. Discrete Applied Mathematics 170 (2014), 83–94. https://doi.org/10.1016/j.dam.2014.01.012 [52] Rolf Niedermeier. 2006. Invitation to Fixed-Parameter Algorithms. Oxford University Press, Oxford, UK. [53] Øystein Ore. 1962. Theory of Graphs. Colloquium Publications - American Mathematical Society, Vol. 38. American Mathematical Society, Providence, RI, USA. [54] Thorsten Papenbrock, Jens Ehrlich, Jannik Marten, Tommy Neubert, Jan-Peer Rudolph, Martin Schönberg, Jakob Zwiener, and Felix Naumann. 2015. Functional Dependency Discovery: An Experimental Evaluation of Seven Algo- rithms. Proceedings of the VLDB Endowment 8 (2015), 1082–1093. https://doi.org/10.14778/2794367.2794377 [55] Thorsten Papenbrock and Felix Naumann. 2016. A Hybrid Approach to Functional Dependency Discovery. In Proceedings of the 2016 International Conference on Management of Data (SIGMOD). 821–833. https://doi.org/10.1145/ 2882903.2915203 [56] Thorsten Papenbrock and Felix Naumann. 2017. A Hybrid Approach for Efficient Unique Column Combination Discovery. In Proceedings of the 17th Datenbanksysteme für Business, Technologie und Web (BTW). 195–204. https: //dl.gi.de/20.500.12116/628 [57] Ronald C. Read and Robert E. Tarjan. 1975. Bounds on Backtrack Algorithms for Listing Cycles, Paths, and Spanning Trees. Networks 5 (1975), 237–252. https://doi.org/10.1002/net.1975.5.3.237 [58] Shuji Tsukiyama, Mikio Ide, Hiromu Ariyoshi, and Isao Shirakawa. 1977. A New Algorithm for Generating All the Maximal Independent Sets. SIAM J. Comput. 6 (1977), 505–517. https://doi.org/10.1137/0206036 [59] Joshua R. Wang and Ryan Williams. 2016. Exact Algorithms and Strong Exponential Time Hypothesis. In Encyclopedia of Algorithms, Ming-Yang Kao (Ed.). Springer, New York, NY, USA, 657–661. https://doi.org/10.1007/978-1-4939-2864-4_ 519 [60] Li-Pu Yeh, Biing-Feng Wang, and Hsin-Hao Su. 2010. Efficient Algorithms for the Problems of Enumerating Cutsby Non-decreasing Weights. Algorithmica 56 (2010), 297–312. https://doi.org/10.1007/s00453-009-9284-5