<<

Symmetric Relations and -Bounded Multisets in Database Systems

Kenneth A. Ross Julia Stoyanovich

Columbia University∗ [email protected], [email protected]

Abstract 1 Introduction A R is symmetric in its first two attributes if R(x , x , . . . , x ) holds if and only if R(x , x , . . . , x ) In a binary symmetric relationship, A is re- 1 2 n 2 1 n holds. We call R(x , x , . . . , x ) the symmetric com- lated to B if and only if B is related to A. 2 1 n plement of R(x , x , . . . , x ). Symmetric relations Symmetric relationships between k participat- 1 2 n come up naturally in several contexts when the real- ing entities can be represented as multisets world relationship being modeled is itself symmetric. of cardinality k. Cardinality-bounded mul- tisets are natural in several real-world appli- Example 1.1 In a law-enforcement database record- cations. Conventional representations in re- ing meetings between pairs of individuals under inves- lational databases suffer from several consis- tigation, the “meets” relationship is symmetric. 2 tency and performance problems. We argue that the database system itself should pro- Example 1.2 Consider a database of web pages. The vide native support for cardinality-bounded relationship “X is linked to Y ” (by either a forward or multisets. We provide techniques to be im- backward link) between pairs of web pages is symmet- plemented by the database engine that avoid ric. This relationship is neither reflexive nor antire- the drawbacks, and allow a schema designer to flexive, i.e., “X is linked to X” is neither universally simply declare a table to be symmetric in cer- true nor universally false. While an underlying rela- tain attributes. We describe a compact data tion representing the direction of the links would nor- structure, and update methods for the struc- mally be maintained, a view defining the “is linked to” ture. We describe an algebraic symmetric clo- relation would be useful, allowing the succinct speci- sure operator, and show how it can be moved fication of queries involving a sequence of undirected around in a query plan during query optimiza- links. 2 tion in order to improve performance. We de- scribe indexing methods that allow efficient Example 1.3 Views that relate entities sharing a lookups on the symmetric columns. We show common property, such as pairs of people living in the how to perform database normalization in the same city, will generally define a symmetric relation presence of symmetric relations. We provide between those entities. 2 techniques for inferring that a view is sym- . We also describe a syntactic SQL ex- Example 1.4 Example 1.1 can be generalized to al- tension that allows the succinct formulation of low meetings of up to k people. The k-ary meeting queries over symmetric relations. relationship would be symmetric in the sense that if P = (p1, . . . , pk) is in the relationship, then so is any column- of P . 2 This∗ research was supported by NSF grants IIS-0120939 and IIS-0121239. Example 1.5 Consider a database recording what Permission to copy without fee all or part of this material is television channel various viewers watch most dur- granted provided that the copies are not made or distributed for ing the 24 hourly timeslots of the day.1 For direct commercial advantage, the VLDB copyright notice and 2 the title of the publication and its date appear, and notice is performance reasons, the database uses a table given that copying is by permission of the Very Large Data Base 1This example is based on a real-world application developed Endowment. To copy otherwise, or to republish, requires a fee by one of the authors, in which there were actually 96 fifteen- and/or special permission from the Endowment. minute slots. Proceedings of the 30th VLDB Conference, 2A conventional representation as a of slots would require Toronto, Canada, 2004 a 24-way join to reconstruct V .

912 V (ID, V iewDate, C1, . . . , C24) to record the viewer For both of the above proposals, indexed access to (identified by ID), the date, and the twenty-four chan- an underlying symmetric relationship would require nels most watched, one channel for each hour of the multiple index lookups, one for each symmetric col- day. This table V is not symmetric, because Ci is not umn. interchangeable with Cj: Ci reflects what the viewer A third alternative is to model a symmetric re- was watching at timeslot number i. Nevertheless, there lation as a set [3] or multiset. Instead of record- are interesting queries that could be posed for which ing both R(a, b, c, d, e) and R(b, a, c, d, e), one could this semantic difference is unimportant. An example record R0(q, c, d, e), S(a, q), and S(b, q), where q is a might be “Find viewers who have watched channels 2 new surrogate identifier, and R0 and S are new ta- and 4, but not channel 5.” For these queries, it could bles. The intuition here is that q represents a multi- be beneficial to treat V as a symmetric relation in or- set, of which a and b are members according to table der to have access to query plans that are specialized S. Distinct members of the multiset can be substi- to symmetric relations. 2 tuted for the first two arguments of R. To represent that are their own symmetric , such There is a natural between symmetric as R(a, a, c, d, e), one inserts S(a, q) twice. This rep- relationships among k entities, and k- multi- 3 resentation uses slightly more space than the previous sets. We phrase our results in terms of “symmetric proposal, while not resolving the issue of keeping the relations” to emphasize the column-oriented nature of representation consistent under updates. Further, re- the data representation in which columns are inter- constructing the original symmetric relation requires changeable. Nevertheless, our results are equally valid joins. if expressed in terms of “bounded-cardinality multi- We argue that none of these solutions is ideal, and sets”. that the database system should be responsible for pro- Sets and multisets have a wide range of uses for viding a “symmetric” table type. There are numerous representing information in databases. Bounded car- advantages to such a scheme: dinality multisets would be useful for applications in which there is a natural limit to the size of multisets. 1. The database system could choose a compact rep- This limit could be implicit in the application (e.g., resentation (such as storing one member of each the number of players in a baseball team), or defined pair of symmetric tuples) and take advantage of as a conservative bound (e.g., the number of children this compactness in reducing the amount of I/O belonging to a parent). We will demonstrate perfor- required. This representation can be used both mance advantages for bounded-element multisets com- for base tables that are identified as symmetric, pared with conventional relational representations of and for materialized views that can be proven to (unbounded) multisets. be symmetric. Storing a symmetric relation in a conventional database system can be done in a number of possi- 2. The database system could go even further, and ble ways. Storing the full symmetric relation induces add a symmetric-closure operator to the query al- some redundancy in the database: more space is re- gebra. A query plan over a symmetric relation quired (up to a factor of k! for k-ary relationships), and could then be manipulated using algebraic iden- integrity constraints need to be enforced to ensure con- tities so that the symmetric closure is applied as sistency of updates. Updates need to be aware of the late as possible. That way, intermediate results symmetry of the table, and to add the various column will be smaller, and queries will be processed more to all insertions and deletions. Queries efficiently. need to perform I/O for tuples and their permutations, increasing the time needed for query processing. 3. Integrity would be checked by the database sys- Alternatively, a database schema designer could tem. Single-row updates would be automatically recognize that the relation was symmetric and code propagated to the other column permutations if database procedures to store only one representative necessary. Inconsistencies would be avoided, and for each group of permuted tuples. A view can schema designers would not have to re-implement then be defined to present the symmetric closure of special functionality for each symmetric table in the stored relation for query processing. The update the database. problem remains, because updates through this view 4. The database system could index the multiple would be ambiguous. Updates to the underlying table columns of a symmetric relation in a single index would need to be aware of the symmetry, to avoid stor- structure. As a result, only one index traversal is ing multiple permutations of a tuple, and to perform necessary to locate tuples with a given value for a deletion correctly. For symmetric relations over k some symmetric column. columns, just defining the view (using standard SQL) requires a query of length proportional to k(k!). In this paper, we propose techniques to enable such 3A multiset is a set except that duplicates are allowed. a “symmetric relation” table type. We provide:

913 • An underlying abstract data type to store the The expressive power of cardinality-bounded sets kernel of a symmetric relation, i.e., a particular has been previously studied in the context of an object- nonredundant subset of the relation. We show based data model [4, 5]. how updates on this data type would be handled by the database system. We describe how rela- 2 The Kernel tional normalization techniques should take ac- count of symmetric relations during database de- Definition 2.1 γXY (R) denotes the symmetric clo- sure operator over symmetric attributes X and Y . Both normalization and the proposed rep- 4 resentation of symmetric relations aim to remove of relation R(X, Y, Z1, . . . , Zn). (x, y, z1, . . . , zn) ∈ γXY (R) if and only if either (x, y, z1, . . . , zn) ∈ R or redundancy, so combining these two approaches 2 should be beneficial. (y, x, z1, . . . , zn) ∈ R. If R is symmetric with respect to X and Y , then we • An extension of the with a sym- aim to determine a minimal relation M such that R = metric closure operator γ. We show how to trans- γ (M). By choosing a minimal M, we can represent late a query over a symmetric relation into a query XY R compactly. Several minimal relations M satisfy this involving γ applied to the kernel of the relation. constraint. Each such M chooses a particular element We provide algebraic equivalences that allow the from each pair of complementary tuples. of queries so that work can be saved by While the choice of minimal relation M does not applying γ as late as possible. matter in terms of space consumption, we shall see that certain algebraic equivalences (such as Lemma 3.4 • A method for inferring when a view is guaran- below) hold only if there is a consistent single choice teed to be symmetric. By using this method, the of M for all tables. Thus, we impose a database system has the flexibility to store a ma- (which may be arbitrary) on the domain of X and terialized view using the more compact represen- Y , and insist that the representative tuple chosen has tation. X ≤ Y according to this order. The resulting relation is unique, and is denoted by ker (R), or just ker(R) • A syntactic extension to SQL that allows the suc- XY when X and Y are clear from context. ker (R) = cinct expression of queries over symmetric rela- XY σ (R). tions. X≤Y We propose that the database stores ker(R) as the internal representation of R. Assuming a set seman- Related Work tics (as opposed to a multiset or bag semantics) for Surprisingly, there has been little past work on symmetric relations, updates are handled as follows: specialized implementations of symmetric relations Insert ( R(X,Y,Z1,...,Zn) ) (or bounded-cardinality set/multisets) within the { database system. The only literature we are aware of If (Y

914 The formalism above allows multiple disjoint pairs 2.2 Implementation of symmetric attributes. Thus, if R is symmet- γ ric in X, Y and also symmetric in V, W , it makes It is straightforward to implement the operator. For sense to talk about ker (R), ker (R), and each input tuple output that tuple in addition to tuples XY V W formed by permuting the symmetric attributes (but kerXY (kerV W (R)) = kerV W (kerXY (R)). We can also generalize symmetry to more than two attributes. don’t output a tuple twice if two permutations gener- ate the same tuple). However, in a practical database system, the mapping from algebraic operators to im- Definition 2.2 A relation R(Z1, . . . , Zn) is sym- plementations is not necessarily a direct one. For ex- metric in Z1, . . . , Zk when R(Z1, . . . , Zn) holds if ample, it is common to implement a scan operator and only if for every permutation P of Z1, . . . , Zk, with predicates, so that the getnext returns R(P (Z1), . . . , P (Zk), Zk+1, . . . , Zn) holds. Each the next row satisfying the predicates. This choice al- such R(P (Z1), . . . , P (Zk), Zk+1, . . . , Zn) is a sym- lows the scan operator to choose an appropriate access metric complement of R(Z1, . . . , Zn). We define structure, such as an index if one exists. kerZ1,...,Z (R) to include only those tuples from R k In a similar way, the natural implementation of with Z1 ≤ . . . ≤ Zk. 2 symmetric closure should also incorporate predicates on the symmetric attributes. The predicates allow for Indexing the efficient use of available access methods, and may Indexing of all symmetric attributes in ker(R) should avoid the generation of permutations that will be im- be done in a single index structure, so that a single in- mediately filtered out. The predicates may come from dex lookup suffices to find tuples with some symmetric selection operators or from join operators. attribute equal to a given probe value. Example 2.2 Consider again the M table from Ex- 2.1 Normalization ample 2.1, in which the Pi attributes store the identi- fiers of persons involved in a pairwise meeting. Sup-

Database normalization and the proposed kernel rep- pose that σP2=456(M) is a subexpression of a query resentation both aim to remove redundancy. However, to be evaluated. Let K = ker(M) be stored by the normalization may be hampered by the presence of database, so that the subexpression can be evaluated as symmetry in the data. σP2=456(γP1P2 (K)). Suppose also that we store a single index structure for the columns P1 and P2. For sim- Example 2.1 Consider a database describing meet- plicity of presentation, assume that the database knows ings of pairs of people that take place in certain lo- that for all rows of M, P1 6= P2. cations at certain times, as in Example 1.1. Sup- Then by implementing an operator for the combined pose that the initial database design has the schema selection and symmetric closure, we can directly look U(P1, P2, L, D, T, A), where P1 and P2 are the parties, up tuples in K having 456 for either of the symmetric L is the location, D is the date, and T is the time. attributes, and for each match return the permutation A is a law-enforcement agent assigned to monitor the with P2 = 456. The alternative permutation is never meeting, and multiple agents can be assigned to a sin- generated. gle meeting. The schema designer is aware that the If we implemented symmetric closure as a database system provides facilities for symmetric rela- stand-alone operator, then the best we could tions, and wishes to take advantage of these facilities do would be to rewrite σP2=456(γP1P2 (K)) as by declaring U to be symmetric in P1, P2. σP2=456(γP1P2 (σP1=456∨P2=456(K))). (See Lemma 3.2 Suppose that there can be only one meeting that below.) The pushed selection conditions allow the use takes place in a given location on a given date and of the index on K. However, both permutations of time. The symmetric redundancy prevents the expres- each matching row in K are generated, one of which sion of functional dependencies having LDT on the left will be filtered by the outer selection condition. 2 hand side. As a result, the “obvious” normalization of the table into the meets relation M(P1, P2, L, D, T ) In the general case for Example 2.2, it is possible 2 and the monitors relation S(L, D, T, A) is missed. that P1 = P2. A limited form of duplicate elimina- tion would then be needed to avoid generating an out- The solution to the problem identified in Exam- put row twice from a single input row. Also observe ple 2.1 is to apply the kernel first, and then try to nor- that the problems highlighted by Example 2.2 become malize the result using standard normalization tech- worse for symmetric relations over more than two at- niques. In Example 2.1, it is possible to identify the tributes. functional dependency LDT → P1P2 in kerP1P2 (U). For a fixed number of symmetric columns, the sym- This functional dependency allows the normalization metric closure operator can be expressed in relational of kerP1P2 (U) into kerP1P2 (M) and S; M is repre- algebra in terms of the union and attribute-renaming sented as a symmetric relation. operators. Thus neither γ nor the kernel operator add

915 to the expressive power of relational algebra. Never- Lemma 3.2 For an arbitrary condition θ, theless, by abstracting the γ operator one can derive σ γ S σ γ σ S implementations directly for γ (or γ together with se- • θ( XY ( 1)) = θ( XY ( θˆ( 1))) lection). The situation is analogous to the join oper- • γXY (S1) 1θ T = σθ(γXY (S1 1ˆ T )) ation which, though expressible in terms of selection θ and cartesian product, is best implemented directly. 2 3 Query Optimization Lemma 3.3 Suppose θ is a condition that implies X ≤ Y . Then σθ(γXY (S1)) = σθ(S1). 2 Given a query that mentions a symmetric relation R, we assume that we have physically stored just K = Lemma 3.4 ker(R). In an algebraic expression for a query that • γXY (S1) ∪ γXY (S2) = γXY (S1 ∪ S2) accesses R, we use γ(K) in place of R. In order to minimize the size of intermediate results, • γXY (S1) ∩ γXY (S2) = γXY (S1 ∩ S2) it would be beneficial to push other operators inside • γ (S ) − γ (S ) = γ (S − S ) the symmetric closure operator γ, where possible. To XY 1 XY 2 XY 1 2 support such an endeavor, we now describe algebraic • γXY (S1) × T = γXY (S1 × T ) equivalences that can form the basis of such rewriting 2 rules. For simplicity of presentation, we phrase these rules for binary symmetric relations. Generalizations Lemma 3.5 If attribute list G includes both X and to higher symmetric arity are possible. Y , then πG(γXY (S1)) = γXY (πG(S1)). 2 For the following results, we assume that S1 and S2 are arbitrary relations with attributes including X Lemma 3.6 Under a set semantics: (a) If attribute and Y , such that all rows satisfy X ≤ Y . T represents list G includes neither X nor Y , then πG(γXY (S1)) = an arbitrary relation that does not have attributes X πG(S1). (b) If attribute list G includes X but not Y , or Y . Except for Lemma 3.6, the equivalences hold and if G0 is the same as G except that X is replaced under both a set semantics and a multiset semantics by Y , then πG(γXY (S1)) = πG(S1) ∪ πG0 (S1). 2 (in which duplicate rows are permitted) for relations. f~ Definition 3.1 Let θ be a condition on X and Y , and Definition 3.2 Let AG(R) denote the aggregate of re- (possibly) other attributes. Let θ0 be formed from θ by lation R, grouped by the columns in the list G, com- ~ substituting X for Y and vice versa. We say that θ is puting the aggregate functions f. 2 a symmetric condition on X and Y if θ ≡ θ0. Given a Lemma 3.7 G X nonsymmetric condition θ, we call the condition θ ∨ θ0 If grouping attributes include both f~ f~ 2 the symmetric closure of θ, which we denote by θˆ when and Y , then AG(γXY (S1)) = γXY (AG(S1)) the attributes X and Y are clear from context. 2 Aggregates grouping by X alone or Y alone can use Example 3.1 Symmetric selection conditions on X Lemma 3.7 to first compute the aggregate grouped by and Y include X = Y , X2 + Y 2 = 1, and any X and Y . Assuming that the aggregate functions are condition that mentions neither X nor Y . Symmet- incrementally computable, the coarser aggregates can ric join conditions on R.X and R.Y include R.X = then be computed in a subsequent operation. S.A ∧ R.Y = S.A, R.X2 + R.Y 2 = S.A2, and condi- Lemma 3.8 tions that do not mention R.X or R.Y . The condition Let G be grouping attributes other than ~ R.X − R.Y > 7 is not symmetric; its symmetric clo- X and Y , and let f contain just idempotent aggregates 2 f~ f~ sure is R.X − R.Y > 7 ∨ R.Y − R.X > 7. such as min and max. Then AG(γXY (S1)) = AG(S1). 2 Symmetric conditions can be pushed below the sym- metric closure. It is tempting to think of analogous equivalences to those of Lemma 3.8 for other aggregates. However, Lemma 3.1 If θ is a symmetric condition, then a row in the kernel maps to either one or two rows

• σθ(γXY (S1)) = γXY (σθ(S1)) in the symmetric closure, depending on whether the symmetric attributes have equal values. To take ac- 1 1 • γXY (S1) θ T = γXY (S1 θ T ) count of this difference, one can split the kernel into 2 two fragments.

Because the symmetric closure of a condition is al- Lemma 3.9 Let G be grouping attributes other than ~ ways symmetric, Lemma 3.1 implies the following re- X and Y , and let f contain just linear aggregates such sult, which allows us to push down partial information as sum and count. Let 2f~ denote the aggregate that from selections on the symmetric attributes. computes double the aggregate functions f~. Then

916 f~ 2f~ If K = ker (M), then the query can be rewritten • AG(γXY (σX

917 need a special way of identifying the common mem- be a conjunctive query, where B is a conjunction of ber, then it pays (in terms of query execution time) subgoals. QT (X, Y, Z~) is the conjunctive query defined to use the formulation Q2. The trick is to formulate by the query using properties of the set of symmetric at- QT (X, Y, Z~) : −B(Y, X, Z~, W~ ) tributes where possible, because such conditions are always symmetric and can be better optimized. with X and Y interchanged in B. 2 Finally, note that the overlapR,S ≥ 1 test can be T T expressed alternatively as r.X1 = s.Y1 ∨ r.X1 = s.Y2 ∨ Note that (Q ) = Q, and that containment map- T r.X2 = s.Y1 ∨ r.X2 = s.Y2. Nevertheless, we advocate pings from Q to Q are isomorphic to containment the use of a specialized overlap function, because (a) it mappings from QT to Q. simplifies the job of identifying symmetric conditions for the query compiler, (b) for symmetric relations of Lemma 4.1 Let Q be a conjunctive query containing higher arity the equivalent logical expressions become nonsymmetric ordinary subgoals, and no built-in sub- unwieldy, and (c) the overlap function can be used to goals. Q is symmetric if and only if there exists a test other kinds of set-oriented relationships, such as containment mapping from Q to QT . 2 disjointness and subset relationships. The arguments in favor of a special overlap func- Example 4.1 Let E(K, M) represent an employee re- tion for join conditions extend also to selection con- lation, where K is the unique key of the employee, and ditions. Definition 3.1 (the symmetric closure of a M is the employee’s manager. Let Q be the conjunc- condition) can be extended to k-ary conditions by tive query taking the disjunction of all expressions formed by permuting the symmetric columns. Thus, in a sym- Q(K1, K2) : −E(K1, M), E(K2, M). metric relation of arity k with symmetric columns T P1, . . . , Pk, the closure of P1 = 123 is P1 = 123 ∨ Q is then · · · ∨ Pk = 123. The closure of P1 = 123 ∧ P2 = 456 T has k(k − 1) disjuncts. These expressions are un- Q (K1, K2) : −E(K2, M), E(K1, M) wieldy, and are likely to hide optimization alternatives from a realistic query optimizer. Instead, we propose and the identity mapping is a containment mapping T to represent the symmetric closure of P1 = 123 as from Q to Q . We therefore conclude that Q is sym- 2 “{123} Among {P1, . . . , Pk}” and the closure of P1 = metric in K1 and K2. 123 ∧ P2 = 456 as “{123, 456} Among {P1, . . . , Pk}”. This representation is compact, and can represent the When subgoals may themselves be symmetric, a closure of common conditions that equate an attribute simple containment mapping is not sufficient, as il- with a constant. It also allows for easier recognition lustrated by the following example. of efficient plans, such as using a common index on {P1, . . . , Pk}, by the query compiler. Example 4.2 Consider again table E from Exam- ple 4.1. Let S be a symmetric relation; think of 4 Inferring Symmetry S(M1, M2) as meaning that M1 and M2 are siblings. Let Q be the conjunctive query Being able to infer that a subexpression is symmet- ric enables additional options for query optimization. Q(K1, K2) : −E(K1, M1), S(M1, M2), E(K2, M2). Also, if we can infer that a materialized view is guar- T anteed to be symmetric, then we can choose to store Q is indeed symmetric. However, Q is it in the more compact form, saving space and query T Q (K1, K2) : −E(K2, M1), S(M1, M2), E(K1, M2) processing time. To formulate the inference problem, we use the no- and the identity mapping is not a containment map- tion of a conjunctive query [9] to represent a view. An ping from Q to QT . The mapping that interchanges ordinary subgoal employs a table predicate, while a M1 and M2 is not a containment mapping, because built-in subgoal employs an interpreted predicate, such T S(M1, M2) in Q maps to S(M2, M1) in Q . 2 as equality or “<”. An ordinary subgoal is symmetric if its predicate is a table that is marked as symmet- Definition 4.2 Let h be a symbol mapping from a ric in the database schema. For ease of presentation conjunctive query Q : −B to a conjunctive query we shall assume that we are dealing with binary sym- Q0 : −B0. We say that h is a symmetric containment metric relations whose symmetric attributes are the mapping from Q to Q0 if h(Q) = Q0, and for every leftmost attributes as written. subgoal S in B, either (a) h(S) appears in B0, or (b) Definition 4.1 Let S is symmetric and the symmetric complement of h(S) appears in B0, or (c) S is a built-in subgoal, and h(S) Q(X, Y, Z~) : −B(X, Y, Z~, W~ ) is equivalent to a subgoal of B0. 2

918 Unlike parts (a) and (b), part (c) of Definition 4.2 is 5 Extending SQL not syntactic identity; it depends on the proof system In this section, we extend SQL with features that al- available to demonstrate equivalence. Part (c) allows low the expression of bounded-cardinality multisets us to identify symmetric conditions (Definition 3.1) in as database columns. Our extended SQL can be a logic-based formalism. translated into the algebra described previously. The proposed syntactic constructs enable the succinct ex- Lemma 4.2 Let Q and Q0 be conjunctive queries with pression of queries that manipulate bounded multi- ordinary subgoals that may be symmetric, and no built- sets. Further, specialized syntax for commonly used in subgoals. Q is contained in Q0 if and only if there operations can help the database system choose effi- exists a symmetric containment mapping from Q0 to cient query processing algorithms to execute the query Q. 2 [2, 11]. k Lemma 4.3 Let Q be a conjunctive query containing When creating a table, one may declare columns ordinary subgoals that may be symmetric, and no built- of the same type to be a named multiset. This dec- in subgoals. Q is symmetric if and only if there exists laration serves two purposes. It provides a name for a symmetric containment mapping from Q to QT . 2 the group of attributes that can be used in writing queries. It also gives a hint to the database system Lemma 4.3 resolves the difficulty of Example 4.2, to create an index on the union of all columns in the group. The multiset may optionally be declared to be because the mapping that interchanges M1 and M2 is a symmetric containment mapping. As in the nonsym- symmetric, in which case the database system is free metric case [9], when built-in subgoals are allowed we to permute the columns (e.g., to store the kernel) to lose the “only-if” part of Lemma 4.3. make integrity constraint checking and query process- ing more efficient. Lemma 4.4 Let Q be a conjunctive query containing Example 5.1 In Example 1.4, a multiset Persons ordinary subgoals that may be symmetric, and built-in would be declared for the columns containing the (inte- subgoals. Q is symmetric if there exists a symmetric ger) identifiers of persons participating in the meeting, T 2 containment mapping from Q to Q . and Persons would be declared symmetric. 4.1 Optimization using Inference Create Table M ( Meeting-id , Suppose we can infer that a query subexpression is Symmetric Multiset Persons guaranteed to be symmetric. Then we can deliber- { P1, ..., Pk } integer, ately insert a “kernelization” operation paired with a ... ) symmetric closure operation, and move the predicates around to minimize the size of intermediate results. In Example 1.5, a multiset Slots would be declared Thus we can benefit from the proposed query opti- for the columns C1 through C24. Slots would not be mization techniques of Section 3 even if we do not have declared symmetric. any stored kernels in the database. Create Table V ( ID integer, Example 4.3 Consider again table E from Exam- ViewDate date, ple 4.1. We write E1 and E2 to distinguish two in- Multiset Slots stances of E in a single query, and we similarly sub- { C1, ..., C24 } integer, script the attributes of E. Consider a query ... ) 1 1 1 (E1 M1=M2 E2) θ1 R1 . . . θm Rm. In these examples, users may query the attributes Pi and Ci directly as regular attributes, using standard Suppose that none of the θi conditions mention K1 SQL syntax. 2 or K2. We begin by inferring that the subexpression 1 (E1 M1=M2 E2) is symmetric in K1 and K2; see Ex- We introduce new “column variables” that are al- ample 4.1. We can therefore rewrite the query as lowed to take values from any one of a set of columns. The original columns of a table are not permuted. This γK1,K2 (σ (E1 1M1=M2 E2)) 1θ R1 . . . 1θ Rm. K1≤K2 1 m choice allows us to access a symmetric base table T By repeatedly applying Lemma 3.1, this is equivalent directly in the conventional way, without forcing the to query “Select * from T ” to have k! copies of each tu- ple representing a k-element multiset. The scope of a 1 1 1 γK1,K2 (σK1≤K2 (E1 M1=M2 E2) θ1 R1 . . . θm Rm) column variable is defined using the Among keyword in the Where clause.6 which is more efficient than the original expression 6The occurrences of column variables must be safe in the because the intermediate joins are smaller. 2 sense of [10].

919 Example 5.2 Consider Example 1.5 together with the M1, M2, M3 such that X attended M1, Y attended M3, sample query “Find all individuals who, on the given M1 and M2 have overlapping membership, and M2 and date, have watched channels 2 and 4, but not channel M3 have overlapping membership. We can write this 5.” We would write this query as query as Select ID, ViewDate Select X, Y From V From M M1, M M2, M M3 Where {X1,X2} Among Slots and X1=2 and X2=4 Where {X} Among M1.Persons and not ({5} Among Slots) and {W} Among M1.Persons and {W} Among M2.Persons There is one row per ID and ViewDate in the output, and {Z} Among M2.Persons even though there may be many possible and {Z} Among M3.Persons of slots satisfying the conditions in the Where clause. 2 and {Y} Among M3.Persons {W} Among M2.Persons and {Z} Among M2.Persons When we write {X1,X2} Among Slots it is implicit are written separately, meaning that W and Z that X1 and X2 correspond to different columns within may bind to the same column. Had we written Slots. If X is a column variable, we use the syn- {W,Z} Among M2.Persons, they would have to be dif- tax X.name to denote the column name of the col- ferent columns. One could also formulate the query umn actually bound to X in the query. One can use succinctly using an “overlap” method, as discussed in the Among keyword for groups not explicitly defined Section 3.1: as multisets by explicitly listing the columns, as in “{X1,X2} Among {Jan,Feb,Mar,Apr}”. Select X, Y From M M1, M M2, M M3 Example 5.3 Continuing Example 5.2, suppose that Where {X} Among M1.Persons we include a column variable in the Select clause. and Overlap(M1.Persons,M2.Persons) >= 1 Select ID, ViewDate, X1.name, X1 and Overlap(M2.Persons,M3.Persons) >= 1 From V and {Y} Among M3.Persons Where {X1,X2} Among Slots and X1=2 and X2=4 2 and not ({5} Among Slots) Unlike before, there are multiple rows per ID/ViewDate Without the Among syntax, there would be no way in the output, one for each binding of X1 to a column to output values from multiple columns in a single se- lect statement. One would need to form the union of whose value (together with some X2 value) satisfies the 2 conditions of the Where clause. The column-variables k select statements to express Example 5.4. in the select clause implicitly control duplicate elimina- A conventional set representation would require a tion. Since only X1 is mentioned in the select clause, six-way join to express Example 5.4. there is one value output for each X1 column binding, When a symmetric multiset has fewer elements irrespective of how many valid X2 values are present. than the cardinality bound, the remaining columns 2 are padded with NULLs. Column-variables cannot be bound to NULL values. Example 5.3 shows how to “unpivot” a k-element We also advocate additional syntactic elements for multiset from a column-based representation into a directly expressing multisets formed as the intersec- more traditional row-based representation. One could tion or difference of other multisets. (Note that union use variants of Example 5.3 to define views over which of two k-bounded multisets is not necessarily a k- traditional SQL methods of set manipulation can be bounded multiset.) expressed. As a result, none of SQL’s expressive power The translation of the extended SQL into the ex- for set manipulation has been lost by using a column- tended algebra is relatively straightforward. When wise representation. We emphasize that since the un- symmetric attributes are referenced using the Among pivoted table is just a view, queries over the unpivoted keyword, the underlying relation has its symmetric table could be translated into queries over the original columns copied into new columns. Some of these new (pivoted) table, which may be more efficient because columns correspond to the column variables. The sym- joins are not required. metric closure operator is applied to the new columns to find combinations of values satisfying the conditions Example 5.4 Consider Example 1.4 in which we on column variables in the Where clause. An algebraic have a meeting table M with k attributes P1, . . . , Pk duplicate-elimination step is also needed, as is special grouped into a multiset called Persons. We wish to handling for NULL values. After the query has been find all pairs of people X and Y at three degrees of translated, it can be optimized and executed as out- separation. In other words, we need three meetings lined in Sections 2.2 and 3.

920 6 Experimental Evaluation Select X1,X2,X3,Y From K In this section we describe an experimental evalua- Where {’foo’,’bar’,’baz’} Among {X1,X2,X3} tion of various representations of multisets on a state- of-the-art commercial database system. We wish to Our proposed access plan (use a combined index on all demonstrate the qualitative performance characteris- set columns) is not directly supported by the database tics of various representations. A comprehensive per- system. Thus, the best we can do is to construct a formance evaluation is beyond the scope of this paper. query Q4 whose performance is likely to be comparable We consider a database of randomly generated 3- to our intended query plan. (We need to verify that element multisets, where each element is a string cho- the chosen plan for Q4 is similar to our intended plan.) sen uniformly from a set of about 8,000 English words. Q4 is The schema of the kernel table K is (X1, X2, X3, Y ) Select X1,X2,X3,Y where X1, X2, X3 are the set elements. We construct From K K so that X1 ≤ X2 ≤ X3, and create an index on X1, Where (X1=’bar’ and X2=’baz’ and X3=’foo’) an index on X2, and an index on X3. We store 500,000 such sets in the database. in which the constants are selected in alphabetical or- We define a view V over K as the union of all six der. permutations (each expressed using a select statement) We ran each of these queries using a commercial of X1, X2, X3 from K. database system on a 1.4GHz Intel Centrino machine We also store a conventional set-based representa- under Windows XP. We record the optimization time tion of the same data in which a new set-identifier at- and execution time as reported by the database sys- tribute ID is defined. We create one table S(ID, Y ), tem. These are elapsed-time measurements. Each and another M(ID, X) containing the unpivoted sets. query was run on a cold database that had just been An index on X in M is created. started. The numbers below reflect the average of five We consider four variants of a query that finds sets runs for each query. In each run a different combi- with all three members specified by constants. In the nation of constants was used, and the combinations first variant Q1, we query K for some of were chosen so that there was always a match in the 7 attributes. database. The database system also reported the plan chosen to execute each query. Select X1,X2,X3,Y The plan chosen for Q4 uses the indexes on X1, X2 From K and X3 to find matching row identifiers, intersects the Where (X1=’foo’ and X2=’bar’ and X3=’baz’) set of identifiers, and finds rows from K for the match- or (X1=’foo’ and X3=’bar’ and X2=’baz’) ing identifiers. Our intended plan would do the same or (X2=’foo’ and X1=’bar’ and X3=’baz’) operations, but using a single common index for X1, or (X2=’foo’ and X3=’bar’ and X1=’baz’) X2 and X3. While the number of row identifiers being or (X3=’foo’ and X1=’bar’ and X2=’baz’) intersected may be higher with our proposed method, or (X3=’foo’ and X2=’bar’ and X1=’baz’) the performance of Q4 should roughly approximate the performance of our proposed method. In the second variant Q of the query, we write the 2 The plans chosen for Q and Q are similar to each query in terms of V . 1 2 other, consisting of the union of 6 subplans of the form Q Select X1,X2,X3,Y mentioned for 4, one subplan per permutation of the From V attributes. Q Where (X1=’foo’ and X2=’bar’ and X3=’baz’) The plan chosen for 3 was a tree of three index- nested-loops joins. The innermost (i.e., leftmost) table is M accessed using an index lookup based on the X In the third variant Q3, we query the conventional set-based representation. column. The other three index lookups are on the ID attributes of M (twice) and S. Select M1.X, M2.X, M3.X, S.Y Figure 1 shows the actual execution time for each From S, M M1, M M2, M M3 of the four queries as reported by the database system. Where (M1.X=’foo’ and M2.X=’bar’ Figure 1 does not include the query optimization time, and M3.X=’baz’) and which is shown separately in Figure 2. M1.ID=S.ID and M2.ID=S.ID and M3.ID=S.ID Figure 1 shows that the execution cost of Q4 is smallest, with Q1 and Q2 having comparable execu- Our extended syntax for the query would be tion cost. The cost of Q3 is about 35 times higher than Q . 7 4 In general we cannot take advantage of the order of con- Figure 2 shows that the optimization cost of all stants mentioned in the query since the constants we’re looking for may be bound at query time, and since we may be querying three queries is comparable, although Q2 has a notice- on just a subset of the available columns. ably lower optimization cost. This lower optimization

921 Execution Time 1e+16 9 8.354

8 1e+14

7 1e+12

6 1e+10 )

c 5 e s ( 1e+08 e m i

T 4 1e+06 3 10000 2 Query/View Size (bytes)

1 100

0.316 0.296 0.234 0 1 Q1 Q2 Q3 Q4 0 2 4 6 8 10 12 14 16 k

Figure 1: Execution time of the four queries. Figure 3: Growth of Q1 and Q2 with the multiset car- dinality k. Optimization Time 0.7 7 Conclusions 0.632 0.608 0.592 0.6 We provide techniques that enable a database engine

0.5 to support a symmetric table type. The techniques include

0.4 ) c e s (

e • A nonredundant data structure with update m i T 0.3 methods and specialized indexes. 0.212 0.2 • Methods for normalization in the presence of sym-

0.1 metric tables.

0 Q1 Q2 Q3 Q4 • An algebraic symmetric closure operator, together with algebraic equivalences useful for query opti- mization. Figure 2: Optimization time of the four queries. • Inference methods to determine when a cost is probably just an artifact of a smaller search query/view is guaranteed to be symmetric. space of plans within the query optimizer, and not something intrinsic to the query itself. (Note the im- • A syntactic SQL extension to enable compact portance of separating the optimization time from the query expression. execution time in interpreting these results. Had we A symmetric table type allows database schema de- just reported the total elapsed time, Q2 would have been the winner.) signers to model symmetric relationships without hav- ing to worry about integrity, redundancy, consistency Of the four solutions (Q1, Q2, Q3, and our proposed method), only our method scales with the number of of updates, query efficiency, or suboptimal physical de- sign. attributes. Q3 does not scale because it requires a k- way join for multisets containing k elements. As one One could go even further and implement different can see in Figure 1, even for k = 3 the performance kinds of symmetric table. For example, the class of antireflexive symmetric relations (i.e., k-element sets of Q3 is more than an order of magnitude worse than competing approaches. rather than multisets) satisfies simpler algebraic rules, and some duplicate elimination steps can be omitted Suppose that writing a basic condition (of the form in the implementation of the γ operator (see Exam- table.attribute=value) takes 10 bytes of memory. ple 2.2). If we try to generalize Q1 and Q2 to k-element multi- sets, then they require either a query or view definition We have argued that our approach is applicable whose size is approximately 10k(k!) bytes. The impact when there is a natural cardinality bound in the appli- of this rate of growth is shown in Figure 3; note the cation. One could extend our approach to general mul- tisets by using a combined structure, i.e., a bounded logarithmic vertical scale. Q1 and Q2 quickly become impractical: with k = 11 the space for the query/view cardinality multiset for an initial subset of elements, definition alone is four gigabytes. and a conventional set representation for additional elements.8 An appropriate syntax could hide this di- In contrast, our query specification has size linear in k, and it can be evaluated without joins. 8This idea was suggested to us by Wisam Dakka.

922 vision from the user, presenting a single multiset ab- Spreadsheets in RDBMS for OLAP. In Proceed- straction. When small sets are typical, such an ap- ings of the 2003 ACM SIGMOD international proach would have performance benefits, even in the conference on on Management of data, pages 52– absence of a strict cardinality bound. 63. ACM Press, 2003.

References [1] Oracle Database Application Developer’s Guide — Object-Relational Features, 2004. 10g Release 1 (10.1), Part Number B10799-01. [2] D. Chatziantoniou and K. A. Ross. Querying mul- tiple features of groups in relational databases. In Proceedings of the International Conference on Very Large Databases, pages 295–306, 1996. [3] C. J. Date. On various types of relations. Database Debunkings, April 20, 2003. Available at http://www.dbdebunk.com/. [4] Jan Van den Bussche and Dirk Van Gucht. A hi- erarchy of faithful set creation in pure OODB’s. In Joachim Biskup and Richard Hull, editors, Database Theory - ICDT’92, 4th International Conference, Berlin, Germany, October 14-16, 1992, Proceedings, volume 646 of Lecture Notes in Computer Science, pages 326–340. Springer, 1992. [5] Jan Van den Bussche and Dirk Van Gucht. The expressive power of cardinality-bounded set val- ues in object-based data models. Theoretical Computer Science, 149(1):49–66, 1995. [6] Sven Helmer and Guido Moerkotte. Evaluation of main memory join algorithms for joins with set comparison join predicates. In International Conference on Very Large Databases, pages 386– 395, 1997. [7] N. Mamoulis. Efficient processing of joins on set- valued attributes. In Proceedings of the ACM Conference on Management of Data (SIGMOD), pages 157–168, 2003. [8] Karthikeyan Ramasamy, Jignesh M Patel, Raghav Kaushik, and Jeffrey F Naughton. Set containment joins: The good, the bad and the ugly. In International Conference on Very Large Databases, pages 351–362, 2000. [9] D. Jeffrey Ullman. Principles of Database and Knowledge-Base Systems, volume 2. Computer Science Press, 1989. [10] Allen Van Gelder and Rodney W. Topor. Safety and translation of relational calculus. ACM Trans. Database Syst., 16(2):235–278, 1991. [11] Andrew Witkowski, Srikanth Bellamkonda, Tolga Bozkaya, Gregory Dorman, Nathan Folkert, Ab- hinav Gupta, Lei Shen, and Sankar Subramanian.

923