<<

Symmetric Relations and Cardinality-Bounded Multisets in Database Systems

Kenneth A. Ross Julia Stoyanovich

Columbia University∗ [email protected], [email protected]

Abstract 1 Introduction A R is symmetric in its first two attributes if R(x , x , . . . , x ) holds if and only if R(x , x , . . . , x ) In a binary symmetric relationship, A is re- 1 2 n 2 1 n holds. We call R(x , x , . . . , x ) the symmetric com- lated to B if and only if B is related to A. 2 1 n plement of R(x , x , . . . , x ). Symmetric relations Symmetric relationships between k participat- 1 2 n come up naturally in several contexts when the real- ing entities can be represented as multisets world relationship being modeled is itself symmetric. of cardinality k. Cardinality-bounded mul- tisets are natural in several real-world appli- Example 1.1 In a law-enforcement database record- cations. Conventional representations in re- ing meetings between pairs of individuals under inves- lational databases suffer from several consis- tigation, the “meets” relationship is symmetric. 2 tency and performance problems. We argue that the database system itself should pro- Example 1.2 Consider a database of web pages. The vide native support for cardinality-bounded relationship “X is linked to Y ” (by either a forward or multisets. We provide techniques to be im- backward link) between pairs of web pages is symmet- plemented by the database engine that avoid ric. This relationship is neither reflexive nor antire- the drawbacks, and allow a schema designer to flexive, i.e., “X is linked to X” is neither universally simply declare a table to be symmetric in cer- true nor universally false. While an underlying rela- tain attributes. We describe a compact data tion representing the direction of the links would nor- structure, and update methods for the struc- mally be maintained, a view defining the “is linked to” ture. We describe an algebraic symmetric clo- relation would be useful, allowing the succinct speci- sure operator, and show how it can be moved fication of queries involving a sequence of undirected around in a query plan during query optimiza- links. 2 tion in order to improve performance. We de- scribe indexing methods that allow efficient Example 1.3 Views that relate entities sharing a lookups on the symmetric columns. We show common property, such as pairs of people living in the how to perform database normalization in the same city, will generally define a symmetric relation presence of symmetric relations. We provide between those entities. 2 techniques for inferring that a view is sym- metric. We also describe a syntactic SQL ex- Example 1.4 Example 1.1 can be generalized to al- tension that allows the succinct formulation of low meetings of up to k people. The k-ary meeting queries over symmetric relations. relationship would be symmetric in the sense that if P = (p1, . . . , pk) is in the relationship, then so is any column-permutation of P . 2 This∗ research was supported by NSF grants IIS-0120939 and IIS-0121239. Example 1.5 Consider a database recording what Permission to copy without fee all or part of this material is television channel various viewers watch most dur- granted provided that the copies are not made or distributed for ing the 24 hourly timeslots of the day.1 For direct commercial advantage, the VLDB copyright notice and 2 the title of the publication and its date appear, and notice is performance reasons, the database uses a table given that copying is by permission of the Very Large Data Base 1This example is based on a real-world application developed Endowment. To copy otherwise, or to republish, requires a fee by one of the authors, in which there were actually 96 fifteen- and/or special permission from the Endowment. minute slots. Proceedings of the 30th VLDB Conference, 2A conventional representation as a set of slots would require Toronto, Canada, 2004 a 24-way join to reconstruct V . V (ID, V iewDate, C1, . . . , C24) to record the viewer For both of the above proposals, indexed access to (identified by ID), the date, and the twenty-four chan- an underlying symmetric relationship would require nels most watched, one channel for each hour of the multiple index lookups, one for each symmetric col- day. This table V is not symmetric, because Ci is not umn. interchangeable with Cj: Ci reflects what the viewer A third alternative is to model a symmetric re- was watching at timeslot number i. Nevertheless, there lation as a set [3] or multiset. Instead of record- are interesting queries that could be posed for which ing both R(a, b, c, d, e) and R(b, a, c, d, e), one could this semantic difference is unimportant. An example record R0(q, c, d, e), S(a, q), and S(b, q), where q is a might be “Find viewers who have watched channels 2 new surrogate identifier, and R0 and S are new ta- and 4, but not channel 5.” For these queries, it could bles. The intuition here is that q represents a multi- be beneficial to treat V as a symmetric relation in or- set, of which a and b are members according to table der to have access to query plans that are specialized S. Distinct members of the multiset can be substi- to symmetric relations. 2 tuted for the first two arguments of R. To represent tuples that are their own symmetric complement, such There is a natural between symmetric as R(a, a, c, d, e), one inserts S(a, q) twice. This rep- relationships among k entities, and k-element multi- 3 resentation uses slightly more space than the previous sets. We phrase our results in terms of “symmetric proposal, while not resolving the issue of keeping the relations” to emphasize the column-oriented nature of representation consistent under updates. Further, re- the data representation in which columns are inter- constructing the original symmetric relation requires changeable. Nevertheless, our results are equally valid joins. if expressed in terms of “bounded-cardinality multi- We argue that none of these solutions is ideal, and sets”. that the database system should be responsible for pro- Sets and multisets have a wide range of uses for viding a “symmetric” table type. There are numerous representing information in databases. Bounded car- advantages to such a scheme: dinality multisets would be useful for applications in which there is a natural limit to the size of multisets. 1. The database system could choose a compact rep- This limit could be implicit in the application (e.g., resentation (such as storing one member of each the number of players in a baseball team), or defined pair of symmetric tuples) and take advantage of as a conservative bound (e.g., the number of children this compactness in reducing the amount of I/O belonging to a parent). We will demonstrate perfor- required. This representation can be used both mance advantages for bounded-element multisets com- for base tables that are identified as symmetric, pared with conventional relational representations of and for materialized views that can be proven to (unbounded) multisets. be symmetric. Storing a symmetric relation in a conventional database system can be done in a number of possi- 2. The database system could go even further, and ble ways. Storing the full symmetric relation induces add a symmetric-closure operator to the query al- some redundancy in the database: more space is re- gebra. A query plan over a symmetric relation quired (up to a factor of k! for k-ary relationships), and could then be manipulated using algebraic iden- integrity constraints need to be enforced to ensure con- tities so that the symmetric closure is applied as sistency of updates. Updates need to be aware of the late as possible. That way, intermediate results symmetry of the table, and to add the various column will be smaller, and queries will be processed more permutations to all insertions and deletions. Queries efficiently. need to perform I/O for tuples and their permutations, increasing the time needed for query processing. 3. Integrity would be checked by the database sys- Alternatively, a database schema designer could tem. Single-row updates would be automatically recognize that the relation was symmetric and code propagated to the other column permutations if database procedures to store only one representative necessary. Inconsistencies would be avoided, and tuple for each group of permuted tuples. A view can schema designers would not have to re-implement then be defined to present the symmetric closure of special functionality for each symmetric table in the stored relation for query processing. The update the database. problem remains, because updates through this view 4. The database system could index the multiple would be ambiguous. Updates to the underlying table columns of a symmetric relation in a single index would need to be aware of the symmetry, to avoid stor- structure. As a result, only one index traversal is ing multiple permutations of a tuple, and to perform necessary to locate tuples with a given value for a deletion correctly. For symmetric relations over k some symmetric column. columns, just defining the view (using standard SQL) requires a query of length proportional to k(k!). In this paper, we propose techniques to enable such 3A multiset is a set except that duplicates are allowed. a “symmetric relation” table type. We provide: • An underlying abstract data type to store the The expressive power of cardinality-bounded sets kernel of a symmetric relation, i.e., a particular has been previously studied in the context of an object- nonredundant subset of the relation. We show based data model [4, 5]. how updates on this data type would be handled by the database system. We describe how rela- 2 The Kernel tional normalization techniques should take ac- count of symmetric relations during database de- Definition 2.1 γXY (R) denotes the symmetric clo- sure operator over symmetric attributes X and Y sign. Both normalization and the proposed rep- 4 resentation of symmetric relations aim to remove of relation R(X, Y, Z1, . . . , Zn). (x, y, z1, . . . , zn) ∈ γXY (R) if and only if either (x, y, z1, . . . , zn) ∈ R or redundancy, so combining these two approaches 2 should be beneficial. (y, x, z1, . . . , zn) ∈ R. If R is symmetric with respect to X and Y , then we • An extension of the relational algebra with a sym- aim to determine a minimal relation M such that R = metric closure operator γ. We show how to trans- γ (M). By choosing a minimal M, we can represent late a query over a symmetric relation into a query XY R compactly. Several minimal relations M satisfy this involving γ applied to the kernel of the relation. constraint. Each such M chooses a particular element We provide algebraic equivalences that allow the from each pair of complementary tuples. of queries so that work can be saved by While the choice of minimal relation M does not applying γ as late as possible. matter in terms of space consumption, we shall see that certain algebraic equivalences (such as Lemma 3.4 • A method for inferring when a view is guaran- below) hold only if there is a consistent single choice teed to be symmetric. By using this method, the of M for all tables. Thus, we impose a database system has the flexibility to store a ma- (which may be arbitrary) on the domain of X and terialized view using the more compact represen- Y , and insist that the representative tuple chosen has tation. X ≤ Y according to this order. The resulting relation is unique, and is denoted by ker (R), or just ker(R) • A syntactic extension to SQL that allows the suc- XY when X and Y are clear from context. ker (R) = cinct expression of queries over symmetric rela- XY σ (R). tions. X≤Y We propose that the database stores ker(R) as the internal representation of R. Assuming a set seman- Related Work tics (as opposed to a multiset or bag semantics) for Surprisingly, there has been little past work on symmetric relations, updates are handled as follows: specialized implementations of symmetric relations Insert ( R(X,Y,Z1,...,Zn) ) (or bounded-cardinality set/multisets) within the { database system. The only literature we are aware of If (Y

Database normalization and the proposed kernel rep- pose that σP2=456(M) is a subexpression of a query resentation both aim to remove redundancy. However, to be evaluated. Let K = ker(M) be stored by the normalization may be hampered by the presence of database, so that the subexpression can be evaluated as symmetry in the data. σP2=456(γP1P2 (K)). Suppose also that we store a single index structure for the columns P1 and P2. For sim- Example 2.1 Consider a database describing meet- plicity of presentation, assume that the database knows ings of pairs of people that take place in certain lo- that for all rows of M, P1 6= P2. cations at certain times, as in Example 1.1. Sup- Then by implementing an operator for the combined pose that the initial database design has the schema selection and symmetric closure, we can directly look U(P1, P2, L, D, T, A), where P1 and P2 are the parties, up tuples in K having 456 for either of the symmetric L is the location, D is the date, and T is the time. attributes, and for each match return the permutation A is a law-enforcement agent assigned to monitor the with P2 = 456. The alternative permutation is never meeting, and multiple agents can be assigned to a sin- generated. gle meeting. The schema designer is aware that the If we implemented symmetric closure as a database system provides facilities for symmetric rela- stand-alone operator, then the best we could tions, and wishes to take advantage of these facilities do would be to rewrite σP2=456(γP1P2 (K)) as by declaring U to be symmetric in P1, P2. σP2=456(γP1P2 (σP1=456∨P2=456(K))). (See Lemma 3.2 Suppose that there can be only one meeting that below.) The pushed selection conditions allow the use takes place in a given location on a given date and of the index on K. However, both permutations of time. The symmetric redundancy prevents the expres- each matching row in K are generated, one of which sion of functional dependencies having LDT on the left will be filtered by the outer selection condition. 2 hand side. As a result, the “obvious” normalization of the table into the meets relation M(P1, P2, L, D, T ) In the general case for Example 2.2, it is possible 2 and the monitors relation S(L, D, T, A) is missed. that P1 = P2. A limited form of duplicate elimina- tion would then be needed to avoid generating an out- The solution to the problem identified in Exam- put row twice from a single input row. Also observe ple 2.1 is to apply the kernel first, and then try to nor- that the problems highlighted by Example 2.2 become malize the result using standard normalization tech- worse for symmetric relations over more than two at- niques. In Example 2.1, it is possible to identify the tributes. functional dependency LDT → P1P2 in kerP1P2 (U). For a fixed number of symmetric columns, the sym- This functional dependency allows the normalization metric closure operator can be expressed in relational of kerP1P2 (U) into kerP1P2 (M) and S; M is repre- algebra in terms of the union and attribute-renaming sented as a symmetric relation. operators. Thus neither γ nor the kernel operator add to the expressive power of relational algebra. Never- Lemma 3.2 For an arbitrary condition θ, theless, by abstracting the γ operator one can derive σ γ S σ γ σ S implementations directly for γ (or γ together with se- • θ( XY ( 1)) = θ( XY ( θˆ( 1))) lection). The situation is analogous to the join oper- • γXY (S1) 1θ T = σθ(γXY (S1 1ˆ T )) ation which, though expressible in terms of selection θ and cartesian product, is best implemented directly. 2 3 Query Optimization Lemma 3.3 Suppose θ is a condition that implies X ≤ Y . Then σθ(γXY (S1)) = σθ(S1). 2 Given a query that mentions a symmetric relation R, we assume that we have physically stored just K = Lemma 3.4 ker(R). In an algebraic expression for a query that • γXY (S1) ∪ γXY (S2) = γXY (S1 ∪ S2) accesses R, we use γ(K) in place of R. In order to minimize the size of intermediate results, • γXY (S1) ∩ γXY (S2) = γXY (S1 ∩ S2) it would be beneficial to push other operators inside • γ (S ) − γ (S ) = γ (S − S ) the symmetric closure operator γ, where possible. To XY 1 XY 2 XY 1 2 support such an endeavor, we now describe algebraic • γXY (S1) × T = γXY (S1 × T ) equivalences that can form the basis of such rewriting 2 rules. For simplicity of presentation, we phrase these rules for binary symmetric relations. Generalizations Lemma 3.5 If attribute list G includes both X and to higher symmetric arity are possible. Y , then πG(γXY (S1)) = γXY (πG(S1)). 2 For the following results, we assume that S1 and S2 are arbitrary relations with attributes including X Lemma 3.6 Under a set semantics: (a) If attribute and Y , such that all rows satisfy X ≤ Y . T represents list G includes neither X nor Y , then πG(γXY (S1)) = an arbitrary relation that does not have attributes X πG(S1). (b) If attribute list G includes X but not Y , or Y . Except for Lemma 3.6, the equivalences hold and if G0 is the same as G except that X is replaced under both a set semantics and a multiset semantics by Y , then πG(γXY (S1)) = πG(S1) ∪ πG0 (S1). 2 (in which duplicate rows are permitted) for relations. f~ Definition 3.1 Let θ be a condition on X and Y , and Definition 3.2 Let AG(R) denote the aggregate of re- (possibly) other attributes. Let θ0 be formed from θ by lation R, grouped by the columns in the list G, com- ~ substituting X for Y and vice versa. We say that θ is puting the aggregate functions f. 2 a symmetric condition on X and Y if θ ≡ θ0. Given a Lemma 3.7 G X nonsymmetric condition θ, we call the condition θ ∨ θ0 If grouping attributes include both f~ f~ 2 the symmetric closure of θ, which we denote by θˆ when and Y , then AG(γXY (S1)) = γXY (AG(S1)) the attributes X and Y are clear from context. 2 Aggregates grouping by X alone or Y alone can use Example 3.1 Symmetric selection conditions on X Lemma 3.7 to first compute the aggregate grouped by and Y include X = Y , X2 + Y 2 = 1, and any X and Y . Assuming that the aggregate functions are condition that mentions neither X nor Y . Symmet- incrementally computable, the coarser aggregates can ric join conditions on R.X and R.Y include R.X = then be computed in a subsequent operation. S.A ∧ R.Y = S.A, R.X2 + R.Y 2 = S.A2, and condi- Lemma 3.8 tions that do not mention R.X or R.Y . The condition Let G be grouping attributes other than ~ R.X − R.Y > 7 is not symmetric; its symmetric clo- X and Y , and let f contain just idempotent aggregates 2 f~ f~ sure is R.X − R.Y > 7 ∨ R.Y − R.X > 7. such as min and max. Then AG(γXY (S1)) = AG(S1). 2 Symmetric conditions can be pushed below the sym- metric closure. It is tempting to think of analogous equivalences to those of Lemma 3.8 for other aggregates. However, Lemma 3.1 If θ is a symmetric condition, then a row in the kernel maps to either one or two rows

• σθ(γXY (S1)) = γXY (σθ(S1)) in the symmetric closure, depending on whether the symmetric attributes have equal values. To take ac- 1 1 • γXY (S1) θ T = γXY (S1 θ T ) count of this difference, one can split the kernel into 2 two fragments.

Because the symmetric closure of a condition is al- Lemma 3.9 Let G be grouping attributes other than ~ ways symmetric, Lemma 3.1 implies the following re- X and Y , and let f contain just linear aggregates such sult, which allows us to push down partial information as sum and count. Let 2f~ denote the aggregate that from selections on the symmetric attributes. computes double the aggregate functions f~. Then f~ 2f~ If K = ker (M), then the query can be rewritten • AG(γXY (σX= 1 Select ID, ViewDate, X1.name, X1 and Overlap(M2.Persons,M3.Persons) >= 1 From V and {Y} Among M3.Persons Where {X1,X2} Among Slots and X1=2 and X2=4 2 and not ({5} Among Slots) Unlike before, there are multiple rows per ID/ViewDate Without the Among syntax, there would be no way in the output, one for each binding of X1 to a column to output values from multiple columns in a single se- lect statement. One would need to form the union of whose value (together with some X2 value) satisfies the 2 conditions of the Where clause. The column-variables k select statements to express Example 5.4. in the select clause implicitly control duplicate elimina- A conventional set representation would require a tion. Since only X1 is mentioned in the select clause, six-way join to express Example 5.4. there is one value output for each X1 column binding, When a symmetric multiset has fewer elements irrespective of how many valid X2 values are present. than the cardinality bound, the remaining columns 2 are padded with NULLs. Column-variables cannot be bound to NULL values. Example 5.3 shows how to “unpivot” a k-element We also advocate additional syntactic elements for multiset from a column-based representation into a directly expressing multisets formed as the intersec- more traditional row-based representation. One could tion or difference of other multisets. (Note that union use variants of Example 5.3 to define views over which of two k-bounded multisets is not necessarily a k- traditional SQL methods of set manipulation can be bounded multiset.) expressed. As a result, none of SQL’s expressive power The translation of the extended SQL into the ex- for set manipulation has been lost by using a column- tended algebra is relatively straightforward. When wise representation. We emphasize that since the un- symmetric attributes are referenced using the Among pivoted table is just a view, queries over the unpivoted keyword, the underlying relation has its symmetric table could be translated into queries over the original columns copied into new columns. Some of these new (pivoted) table, which may be more efficient because columns correspond to the column variables. The sym- joins are not required. metric closure operator is applied to the new columns to find combinations of values satisfying the conditions Example 5.4 Consider Example 1.4 in which we on column variables in the Where clause. An algebraic have a meeting table M with k attributes P1, . . . , Pk duplicate-elimination step is also needed, as is special grouped into a multiset called Persons. We wish to handling for NULL values. After the query has been find all pairs of people X and Y at three degrees of translated, it can be optimized and executed as out- separation. In other words, we need three meetings lined in Sections 2.2 and 3. 6 Experimental Evaluation Select X1,X2,X3,Y From K In this section we describe an experimental evalua- Where {’foo’,’bar’,’baz’} Among {X1,X2,X3} tion of various representations of multisets on a state- of-the-art commercial database system. We wish to Our proposed access plan (use a combined index on all demonstrate the qualitative performance characteris- set columns) is not directly supported by the database tics of various representations. A comprehensive per- system. Thus, the best we can do is to construct a formance evaluation is beyond the scope of this paper. query Q4 whose performance is likely to be comparable We consider a database of randomly generated 3- to our intended query plan. (We need to verify that element multisets, where each element is a string cho- the chosen plan for Q4 is similar to our intended plan.) sen uniformly from a set of about 8,000 English words. Q4 is The schema of the kernel table K is (X1, X2, X3, Y ) Select X1,X2,X3,Y where X1, X2, X3 are the set elements. We construct From K K so that X1 ≤ X2 ≤ X3, and create an index on X1, Where (X1=’bar’ and X2=’baz’ and X3=’foo’) an index on X2, and an index on X3. We store 500,000 such sets in the database. in which the constants are selected in alphabetical or- We define a view V over K as the union of all six der. permutations (each expressed using a select statement) We ran each of these queries using a commercial of X1, X2, X3 from K. database system on a 1.4GHz Intel Centrino machine We also store a conventional set-based representa- under Windows XP. We record the optimization time tion of the same data in which a new set-identifier at- and execution time as reported by the database sys- tribute ID is defined. We create one table S(ID, Y ), tem. These are elapsed-time measurements. Each and another M(ID, X) containing the unpivoted sets. query was run on a cold database that had just been An index on X in M is created. started. The numbers below reflect the average of five We consider four variants of a query that finds sets runs for each query. In each run a different combi- with all three members specified by constants. In the nation of constants was used, and the combinations first variant Q1, we query K for some combination of were chosen so that there was always a match in the 7 attributes. database. The database system also reported the plan chosen to execute each query. Select X1,X2,X3,Y The plan chosen for Q4 uses the indexes on X1, X2 From K and X3 to find matching row identifiers, intersects the Where (X1=’foo’ and X2=’bar’ and X3=’baz’) set of identifiers, and finds rows from K for the match- or (X1=’foo’ and X3=’bar’ and X2=’baz’) ing identifiers. Our intended plan would do the same or (X2=’foo’ and X1=’bar’ and X3=’baz’) operations, but using a single common index for X1, or (X2=’foo’ and X3=’bar’ and X1=’baz’) X2 and X3. While the number of row identifiers being or (X3=’foo’ and X1=’bar’ and X2=’baz’) intersected may be higher with our proposed method, or (X3=’foo’ and X2=’bar’ and X1=’baz’) the performance of Q4 should roughly approximate the performance of our proposed method. In the second variant Q of the query, we write the 2 The plans chosen for Q and Q are similar to each query in terms of V . 1 2 other, consisting of the union of 6 subplans of the form Q Select X1,X2,X3,Y mentioned for 4, one subplan per permutation of the From V attributes. Q Where (X1=’foo’ and X2=’bar’ and X3=’baz’) The plan chosen for 3 was a tree of three index- nested-loops joins. The innermost (i.e., leftmost) table is M accessed using an index lookup based on the X In the third variant Q3, we query the conventional set-based representation. column. The other three index lookups are on the ID attributes of M (twice) and S. Select M1.X, M2.X, M3.X, S.Y Figure 1 shows the actual execution time for each From S, M M1, M M2, M M3 of the four queries as reported by the database system. Where (M1.X=’foo’ and M2.X=’bar’ Figure 1 does not include the query optimization time, and M3.X=’baz’) and which is shown separately in Figure 2. M1.ID=S.ID and M2.ID=S.ID and M3.ID=S.ID Figure 1 shows that the execution cost of Q4 is smallest, with Q1 and Q2 having comparable execu- Our extended syntax for the query would be tion cost. The cost of Q3 is about 35 times higher than Q . 7 4 In general we cannot take advantage of the order of con- Figure 2 shows that the optimization cost of all stants mentioned in the query since the constants we’re looking for may be bound at query time, and since we may be querying three queries is comparable, although Q2 has a notice- on just a subset of the available columns. ably lower optimization cost. This lower optimization Execution Time 1e+16 9 8.354

8 1e+14

7 1e+12

6 1e+10 )

c 5 e s ( 1e+08 e m i

T 4 1e+06 3 10000 2 Query/View Size (bytes)

1 100

0.316 0.296 0.234 0 1 Q1 Q2 Q3 Q4 0 2 4 6 8 10 12 14 16 k

Figure 1: Execution time of the four queries. Figure 3: Growth of Q1 and Q2 with the multiset car- dinality k. Optimization Time 0.7 7 Conclusions 0.632 0.608 0.592 0.6 We provide techniques that enable a database engine

0.5 to support a symmetric table type. The techniques include

0.4 ) c e s (

e • A nonredundant data structure with update m i T 0.3 methods and specialized indexes. 0.212 0.2 • Methods for normalization in the presence of sym-

0.1 metric tables.

0 Q1 Q2 Q3 Q4 • An algebraic symmetric closure operator, together with algebraic equivalences useful for query opti- mization. Figure 2: Optimization time of the four queries. • Inference methods to determine when a cost is probably just an artifact of a smaller search query/view is guaranteed to be symmetric. space of plans within the query optimizer, and not something intrinsic to the query itself. (Note the im- • A syntactic SQL extension to enable compact portance of separating the optimization time from the query expression. execution time in interpreting these results. Had we A symmetric table type allows database schema de- just reported the total elapsed time, Q2 would have been the winner.) signers to model symmetric relationships without hav- ing to worry about integrity, redundancy, consistency Of the four solutions (Q1, Q2, Q3, and our proposed method), only our method scales with the number of of updates, query efficiency, or suboptimal physical de- sign. attributes. Q3 does not scale because it requires a k- way join for multisets containing k elements. As one One could go even further and implement different can see in Figure 1, even for k = 3 the performance kinds of symmetric table. For example, the class of antireflexive symmetric relations (i.e., k-element sets of Q3 is more than an order of magnitude worse than competing approaches. rather than multisets) satisfies simpler algebraic rules, and some duplicate elimination steps can be omitted Suppose that writing a basic condition (of the form in the implementation of the γ operator (see Exam- table.attribute=value) takes 10 bytes of memory. ple 2.2). If we try to generalize Q1 and Q2 to k-element multi- sets, then they require either a query or view definition We have argued that our approach is applicable whose size is approximately 10k(k!) bytes. The impact when there is a natural cardinality bound in the appli- of this rate of growth is shown in Figure 3; note the cation. One could extend our approach to general mul- tisets by using a combined structure, i.e., a bounded logarithmic vertical scale. Q1 and Q2 quickly become impractical: with k = 11 the space for the query/view cardinality multiset for an initial subset of elements, definition alone is four gigabytes. and a conventional set representation for additional elements.8 An appropriate syntax could hide this di- In contrast, our query specification has size linear in k, and it can be evaluated without joins. 8This idea was suggested to us by Wisam Dakka. vision from the user, presenting a single multiset ab- Spreadsheets in RDBMS for OLAP. In Proceed- straction. When small sets are typical, such an ap- ings of the 2003 ACM SIGMOD international proach would have performance benefits, even in the conference on on Management of data, pages 52– absence of a strict cardinality bound. 63. ACM Press, 2003.

References [1] Oracle Database Application Developer’s Guide — Object-Relational Features, 2004. 10g Release 1 (10.1), Part Number B10799-01. [2] D. Chatziantoniou and K. A. Ross. Querying mul- tiple features of groups in relational databases. In Proceedings of the International Conference on Very Large Databases, pages 295–306, 1996. [3] C. J. Date. On various types of relations. Database Debunkings, April 20, 2003. Available at http://www.dbdebunk.com/. [4] Jan Van den Bussche and Dirk Van Gucht. A hi- erarchy of faithful set creation in pure OODB’s. In Joachim Biskup and Richard Hull, editors, Database Theory - ICDT’92, 4th International Conference, Berlin, Germany, October 14-16, 1992, Proceedings, volume 646 of Lecture Notes in Computer Science, pages 326–340. Springer, 1992. [5] Jan Van den Bussche and Dirk Van Gucht. The expressive power of cardinality-bounded set val- ues in object-based data models. Theoretical Computer Science, 149(1):49–66, 1995. [6] Sven Helmer and Guido Moerkotte. Evaluation of main memory join algorithms for joins with set comparison join predicates. In International Conference on Very Large Databases, pages 386– 395, 1997. [7] N. Mamoulis. Efficient processing of joins on set- valued attributes. In Proceedings of the ACM Conference on Management of Data (SIGMOD), pages 157–168, 2003. [8] Karthikeyan Ramasamy, Jignesh M Patel, Raghav Kaushik, and Jeffrey F Naughton. Set containment joins: The good, the bad and the ugly. In International Conference on Very Large Databases, pages 351–362, 2000. [9] D. Jeffrey Ullman. Principles of Database and Knowledge-Base Systems, volume 2. Computer Science Press, 1989. [10] Allen Van Gelder and Rodney W. Topor. Safety and translation of relational calculus. ACM Trans. Database Syst., 16(2):235–278, 1991. [11] Andrew Witkowski, Srikanth Bellamkonda, Tolga Bozkaya, Gregory Dorman, Nathan Folkert, Ab- hinav Gupta, Lei Shen, and Sankar Subramanian.