Counter Strike: Generic Top-Down Join Enumeration for Hypergraphs
Pit Fender Guido Moerkotte University of Mannheim University of Mannheim Mannheim, Germany Mannheim, Germany [email protected] [email protected]
ABSTRACT CUTCONSERVATIVE), two competitive top-down join enumeration Finding the optimal execution order of join operations is a crucial strategies[4,7,5]. task of today’s cost-based query optimizers. There are two ap- However, all four algorithms (DPCCP,TDMINCUTLAZY,TD- proaches to identify the best plan: bottom-up and top-down join MINCUTBRANCH,TDMINCUTCONSERVATIVE) are not ready enumeration. But only the top-down approach allows for branch- yet to be used in real-world scenarios because there exist two se- and-bound pruning, which can improve compile time by several vere deficiencies in all of them. First, as has been argued in several orders of magnitude while still preserving optimality. For both op- places, hypergraphs must be handled by any plan generator [2, 17, timization strategies, efficient enumeration algorithms have been 19]. Second, plan generators have to deal with outer joins and anti- published. However, there are two severe limitations for the top- joins [8, 17]. In general, these operators are not freely reorderable: down approach: The published algorithms can handle only (1) sim- some orderings produce wrong results. The non-inner join reorder- ple (binary) join predicates and (2) inner joins. Since real queries ing problem can be correctly reduced to hypergraphs [2, 13, 17]. may contain complex join predicates involving more than two rela- Consequently, Moerkotte and Neumann [13] extended DPCCP to tions, and outer joins as well as other non-inner joins, efficient top- DPHYP to handle hypergraphs. Since DPHYP is a bottom-up join down join enumeration cannot be used in practice yet. We develop enumeration algorithm, it cannot benefit from branch-and-bound a novel top-down join enumeration algorithm that overcomes these pruning. On the other hand, branch-and-bound pruning can signifi- two limitations. Furthermore, we show that our new algorithm is cantly speed up plan generation [3, 7], while still guaranteeing plan competitive when compared to the state of the art in bottom-up pro- optimality. cessing even without playing out its advantage by making use of its In this paper, we present a novel generic framework that can be branch-and-bound pruning capabilities. used by any existing partitioning algorithm for top-down join enu- meration to efficiently handle hypergraphs. The central idea is to smartly convert hypergraphs to simple graphs and introduce effec- 1. INTRODUCTION tive means to avoid inefficiencies. This way, any existing parti- For a DBMS that provides support for a declarative query lan- tioning algorithm for simple graphs can be used. We show that guage like SQL, the query optimizer is a crucial piece of software. TDMCBHYP, resulting from instantiating our framework with the The declarative nature of a query allows it to be translated into partitioning algorithm MINCUTBRANCH, is more efficient than ex- many equivalent evaluation plans. Essential for the execution costs isting partitioning algorithms for hypergraphs and as efficient as of a plan is the order of join operations, since the runtime of plans DPHYP even without pruning. With pruning, TDMCBHYP out- with different join orders can vary by several orders of magnitude. performs DPHYP by a factor of 1.1 − 11.5. The search space considered here consists of all bushy join trees This paper is organized as follows. Sec. 2 recalls some pre- without cross products [16]. liminaries. Sec. 3 shows a naive approach called TDBASICHYP In principle, there are two approaches to find an optimal join for handling hyperedges. Sec. 4 presents our generic framework. order: bottom-up join enumeration via dynamic programming and Sec. 5 contains the experimental evaluation, and Sec. 6 concludes top-down join enumeration through memoization. Both approaches the paper. face the same challenge: to efficiently find for a given set of rela- tions all partitions into two subsets, such that both induce connected 2. PRELIMINARIES subgraphs and there exists an edge connecting the two subgraphs. Before we give the formal definitions necessary for our algo- Currently, the following algorithms have been proposed: DPC- rithm, let us demonstrate by means of a very simple example why CP, an efficient dynamic programming-based algorithm [12], TD- hypergraphs (apart from the case where join predicates span more MINCUTLAZY [3], as well as TDMINCUTBRANCH and TDMIN- than two relations) are necessary when reordering more than plain joins. Consider the query Permission to make digital or hard copies of all or part of this work for select * from (R0 left outer join R1 on R0.A = R1.B) personal or classroom use is granted without fee provided that copies are full outer join R2 on R1.C = R2.D not made or distributed for profit or commercial advantage and that copies In a first step, it is translated into an initial operator tree: bear this notice and the full citation on the first page. To copy otherwise, to (R0 = R1) = R2 republish, to post on servers or to redistribute to lists, requires prior specific R0.A R1.B R1.C R2.D permission and/or a fee. Articles from this volume were invited to present For this query, no valid reordering is possible. To prevent reorder- their results at The 39th International Conference on Very Large Data Bases, ing, conflicts need to be detected and represented. At the core of August 26th - 30th 2013, Riva del Garda, Trento, Italy. TES Proceedings of the VLDB Endowment, Vol. 6, No. 14 every conflict presentation is a set of relations, called , associ- Copyright 2013 VLDB Endowment 2150-8097/13/14... $ 10.00. ated with each operator in the initial operator tree [13, 17, 11]. To
1822 describe the calculation of TES is beyond the current paper, but the DEFINITION 4. Let H =(V,E) be a connected hypergraph, intuition behind it is rather simple: before an operator can be ap- then we call a hyperedge (v, w) an articulation hyperedge if re- plied to join two subplans, all relations in the TES must be present in moving (v, w) from E would disconnect the graph H. the two subplans. For our example, we have TES( R0.A=R1.B )= All edges of the hypergraph shown in Fig. 5(a) are articula- {R0,R1} and TES( R1.C=R2.D)={R0,R1,R2}. For non- commutative operators, it is important to distinguish between the tion hyperedges. The graph given in Fig. 4(a) has no complex ({ } { }) relations contained in the left and right branch of the initial operator articulation hyperedges, but two simple ones: R0 , R2 and ({ } { }) tree 1. Intersection of the TES with the set of relations contained in R1 , R2 . The graph in Fig. 4(b) has none. We observe that the left and right branch of the operator tree gives a pair (L-TES,R- an articulation hyperedge cannot be part of any cycle. Hence, we call a hypergraph whose complex hyperedges are all articulation TES)ofsetsofrelations.For p1,2 ,thispairis({R0,R1}, {R2}). As we will see, this is a complex hyperedge. hyperedges a complex cycle-free hypergraph. Example cycle-free graphs are shown in Fig. 4(c) and Fig. 16(a). 2.1 Hypergraphs 2.2 Connected Subgraph and Its Complement Let us begin with the definition of hypergraphs. Pairs DEFINITION 1. A hypergraph is a pair H =(V,E) such that Our focus is on determining an optimal join order for a given query. The execution order of join operations is specified by an 1. V is a non-empty set of nodes, and operator tree of the physical algebra. For our purposes, we want to abstract from that representation and give the notion of a join 2. E is a set of hyperedges, where a hyperedge is an unordered tree pair (v, w) of non-empty subsets of V (v ⊂ V and w ⊂ V ) . A join tree is a binary tree where the leaf nodes specify the with the additional condition that v ∩ w = ∅. relations referenced in a query, and the inner nodes specify the two- way join operations. The edges of the join tree represent sets of We call any non-empty subset of V a hypernode. We assume that joined relations. Two input sets of relations that qualify for a join so the nodes in V are totally ordered via an (arbitrary) relation ≺. that no cross products need to be considered are called a connected A hyperedge (v, w) is simple if |v| = |w| =1. A hypergraph subgraph and its complement pair (ccp) [12]. is simple if all its hyperedges are simple. We call all non-simple hyperedges complex hyperedges and all non-simple hypergraphs DEFINITION 5. Let H =(V,E) be a connected hypergraph, complex hypergraphs. (S1,S2) is a connected subgraph and its complement pair (or ccp for short) if the following holds: Take a look at the complex hypergraph in Fig. 4(a) with V = • 1 1 ⊂ {R0,R1,R2,R3}. Here, we have two simple edges ({R0}, S with S V induces a connected graph H|S1 , { }) ({ } { })) ({ } • ⊂ R2 , R1 , R2 and two complex hyperedges R0,R2 , S2 with S2 V induces a connected graph H|S2 , {R3}), ({R1,R2}, {R3}). Fig. 4(b) depicts a simple hypergraph. • S1 ∩ S2 = ∅, and To decompose a join ordering problem represented as a hyper- •∃( ) ∈ | ⊆ ∧ ⊆ graph into smaller problems, we need the notion of subgraph. More v, w E v S1 w S2. specifically, we only deal with node-induced subgraphs. The set of all possible ccps is denoted by Pccp. We introduce the DEFINITION 2. Let H =(V,E) be a hypergraph and V ⊆ notion of ccp for a set to specify all those pairs of input sets that result in the same output set, if joined. V a subset of nodes. The node-induced subgraph H|V of H is defined as H|V =(V ,E ) with E = {(v, w)|(v, w) ∈ E,v ⊆ DEFINITION 6. Let H =(V,E) be a connected hypergraph V ,w ⊆ V }. The node ordering on V is the restriction of the and S a set with S ⊆ V that induces a connected subgraph H| . node ordering of V . S For S1,S2 ⊂ V , (S1,S2) is called a ccp for S if (S1,S2) is a ccp ∪ = Next, we define connectedness. and S1 S2 S holds. ( ) ( ) DEFINITION 3. Let H =(V,E) be a hypergraph. H is con- Note that if S1,S2 is a ccp for S,then S2,S1 is one as well. ( ) nected if |V | =1or if there exists a partitioning V ,V of V and We call them symmetric pairs. By Pccp S , we denote the set of ({ }) a hyperedge (v, w) ∈ E such that v ⊆ V , w ⊆ V , and both all ccps for S. Pccp R0,R1,R2,R3 for the hypergraph of Fig. 4(a) consists of 6 ccps: {({R0}, {R1,R2,R3}), ({R0,R1,R2}, H|V and H|V are connected. {R3}), ({R0,R2,R3}, {R1})} (symmetric counter pairs left out). | The node-induced subgraph H {R0,R2,R3} gained from the hyper- graph of Fig. 4(a) with E = {({R0}, {R2}), ({R0,R2}, {R3}) is 2.3 Neighborhood | = ∅ connected, whereas H {R0,R1,R3} with E is not. The main idea to generate ccps is to incrementally expand con- If H =(V,E) is a hypergraph and V ⊆ V is a subset of nected subgraphs by considering new nodes in the neighborhood the nodes such that the node-induced subgraph H|V is connected, of a subgraph. we call V a connected subgraph or csg for short. The number We start with the definition of a simple neighborhood that relies of connected subgraphs is important: it directly corresponds to the only on simple edges and returns one set of vertices. number of entries in the memotable. DEFINITION 7. Let H =(V,E) be a connected hypergraph We assume that all hypergraphs used here are connected. If not, and C be a subset of V . Then, the simple neighborhood of C ⊆ V we introduce complex hyperedges with a selectivity of one that join is defined as: two disconnected subgraphs at a time. N ( )={ | ∈ ∧ ( ) ∈ ∧ ⊂ ∧ For our framework, the notion of an articulation hyperedge is s C x x w v, w E v C ⊂ ( \ ) ∧|| =1 ∧| | =1} essential. We give its definition. w V C v w . We now give the definition of neighborhood for all edges, in- 1For commutative operators it does not harm. cluding hyperedges.
1823 DEFINITION 8. Let H =(V,E) be a connected hypergraph, recursive routine TDPGSUB for the whole set V of nodes. TDPG- S a set of nodes (S ⊆ V ) such that H|S is connected. Then the SUB checks for the presence of an already derived best plan for any neighborhood of C ⊂ S is defined as: input set of nodes S. If such a plan does not exist, TDPGSUB it- N (S, C)={w | (v, w) ∈ E ∧ v ⊆ C ∧ w ⊆ S \ C}. erates over all ccps (S1,S2) of S. If an operator is applicable (see below), the subroutine BUILDTREE generates the according plans For the hypergraph of Fig. 4(a) with S = V = {R0,R1,R2,R3}, and adds them to the lookup table (Fig. 2). BUILDTREE also con- N (S, {R0,R2})={{R1}, {R3}} and Ns({R0,R2})={R1} siders interesting orders the usual way [15, 14]. The applicability holds. Furthermore, N (S, {R3})={{R0,R2}, {R1,R2}} and test (Line 4) includes L-TES ⊆ S1 ∧ R-TES ⊆ S2 and ensures cor- Ns({R3})=∅ holds. rectness of the generated plan [11]. Whereas Line 2 declaratively specifies the set of ccps to be con- 2.4 Path, Cycles and Compound Relations sidered, any real implementation must provide a procedure to gen- The following two definitions are important for the description erate them explicitly. This is the third and exchangeable part of of the structure of a simple hypergraph. TDPLANGENHYP. One possibility is the naive partioning algo- rithm (Fig. 3). In its Line 1, all 2|S| − 2 possible non-empty DEFINITION 9. Let G =(V,E) be a simple hypergraph, then and proper subsets of S are enumerated (see [20] for the efficient ∗ a path x → y with the length l between vertices x and y is defined enumeration). Three conditions have to be met so that a partition = as a sequence of vertices z0,z1,z2, ..., zl in V such that x z0 (C, S \ C) is a ccp. We check the connectivity of H|C and H|S\C and y = zl and ({zi−1}, {zi}) ∈ E for i =1, 2, ...l. in line 2 (for a connection test see [6]). The third condition that C needs to be connected to S \ C is implied by the requirement that With the definition of a path, we can define cycles. the (sub)graph handed over as input is connected. The frequent failure of this test is the main source of inefficiency of the basic =( ) DEFINITION 10. Let H V,E be a simple hypergraph. partitioning algorithm. Then a cycle is a path z0,z1,z2, ..., zl with ∀0≤i≤lzi ∈ V where TDPLANGENHYP(H, O) z0 = zl holds. £ Input: connected H =(V,E), O set of operators £ Output: an optimal join tree for H DEFINITION 11. Let H =(V,E) be a simple hypergraph. A 1 for i ← 1 to |V | node a ∈ V is an articulation vertex iff there exist two vertices ∗ 2 BestTree[{Ri}] ← Ri x ∈ V and y ∈ V , such that every path x → y in V contains a. 3 return TDPGSUB(V )
TDPGSUB(S, O) The articulation vertices of a connected simple hypergraph H = £ (V,E) are important when determining the biconnected compo- Input: H|S connected £ nents of a simple hypergraph. Output: an optimal join tree for H|S 1 if BestTree[S]=NULL 2 for all (S1,S2) ∈ P (S) DEFINITION 12. Let H =(V,E) be a connected simple hy- ccp 3 for ◦∈O pergraph. A is a connected subgraph biconnected component 4 if APPLICABLE(◦,S1,S2) BCC =( ) = { | ( = ∨ = ) ∧ Hi Vi,Ei of H with Hi v v u v w 5 BUILDTREE(◦, TDPGSUB(S1), TDPGSUB(S2)) (v, w) ∈ Ei}, where the set of edges Ei ⊆ E is maximal such that 6 return BestTree[S] any two distinct edges (u, w) ∈ Ei and (x, y) ∈ Ei lie on a cycle Figure 1: Pseudocode for TDPLANGENHYP = ∧ = ∧ = ∧ = ∧ v0,v1,v2, ..., vl ,whereu v0 u vl w v1 x vj−1 4. GRAPH-BASED JOIN ENUMERATION y = v ∧ 0