Symmetric Relations and Cardinality-Bounded Multisets in Database Systems
Kenneth A. Ross Julia Stoyanovich
Columbia University∗ [email protected], [email protected]
Abstract 1 Introduction A relation R is symmetric in its first two attributes if R(x , x , . . . , x ) holds if and only if R(x , x , . . . , x ) In a binary symmetric relationship, A is re- 1 2 n 2 1 n holds. We call R(x , x , . . . , x ) the symmetric com- lated to B if and only if B is related to A. 2 1 n plement of R(x , x , . . . , x ). Symmetric relations Symmetric relationships between k participat- 1 2 n come up naturally in several contexts when the real- ing entities can be represented as multisets world relationship being modeled is itself symmetric. of cardinality k. Cardinality-bounded mul- tisets are natural in several real-world appli- Example 1.1 In a law-enforcement database record- cations. Conventional representations in re- ing meetings between pairs of individuals under inves- lational databases suffer from several consis- tigation, the “meets” relationship is symmetric. 2 tency and performance problems. We argue that the database system itself should pro- Example 1.2 Consider a database of web pages. The vide native support for cardinality-bounded relationship “X is linked to Y ” (by either a forward or multisets. We provide techniques to be im- backward link) between pairs of web pages is symmet- plemented by the database engine that avoid ric. This relationship is neither reflexive nor antire- the drawbacks, and allow a schema designer to flexive, i.e., “X is linked to X” is neither universally simply declare a table to be symmetric in cer- true nor universally false. While an underlying rela- tain attributes. We describe a compact data tion representing the direction of the links would nor- structure, and update methods for the struc- mally be maintained, a view defining the “is linked to” ture. We describe an algebraic symmetric clo- relation would be useful, allowing the succinct speci- sure operator, and show how it can be moved fication of queries involving a sequence of undirected around in a query plan during query optimiza- links. 2 tion in order to improve performance. We de- scribe indexing methods that allow efficient Example 1.3 Views that relate entities sharing a lookups on the symmetric columns. We show common property, such as pairs of people living in the how to perform database normalization in the same city, will generally define a symmetric relation presence of symmetric relations. We provide between those entities. 2 techniques for inferring that a view is sym- metric. We also describe a syntactic SQL ex- Example 1.4 Example 1.1 can be generalized to al- tension that allows the succinct formulation of low meetings of up to k people. The k-ary meeting queries over symmetric relations. relationship would be symmetric in the sense that if P = (p1, . . . , pk) is in the relationship, then so is any column-permutation of P . 2 This∗ research was supported by NSF grants IIS-0120939 and IIS-0121239. Example 1.5 Consider a database recording what Permission to copy without fee all or part of this material is television channel various viewers watch most dur- granted provided that the copies are not made or distributed for ing the 24 hourly timeslots of the day.1 For direct commercial advantage, the VLDB copyright notice and 2 the title of the publication and its date appear, and notice is performance reasons, the database uses a table given that copying is by permission of the Very Large Data Base 1This example is based on a real-world application developed Endowment. To copy otherwise, or to republish, requires a fee by one of the authors, in which there were actually 96 fifteen- and/or special permission from the Endowment. minute slots. Proceedings of the 30th VLDB Conference, 2A conventional representation as a set of slots would require Toronto, Canada, 2004 a 24-way join to reconstruct V . V (ID, V iewDate, C1, . . . , C24) to record the viewer For both of the above proposals, indexed access to (identified by ID), the date, and the twenty-four chan- an underlying symmetric relationship would require nels most watched, one channel for each hour of the multiple index lookups, one for each symmetric col- day. This table V is not symmetric, because Ci is not umn. interchangeable with Cj: Ci reflects what the viewer A third alternative is to model a symmetric re- was watching at timeslot number i. Nevertheless, there lation as a set [3] or multiset. Instead of record- are interesting queries that could be posed for which ing both R(a, b, c, d, e) and R(b, a, c, d, e), one could this semantic difference is unimportant. An example record R0(q, c, d, e), S(a, q), and S(b, q), where q is a might be “Find viewers who have watched channels 2 new surrogate identifier, and R0 and S are new ta- and 4, but not channel 5.” For these queries, it could bles. The intuition here is that q represents a multi- be beneficial to treat V as a symmetric relation in or- set, of which a and b are members according to table der to have access to query plans that are specialized S. Distinct members of the multiset can be substi- to symmetric relations. 2 tuted for the first two arguments of R. To represent tuples that are their own symmetric complement, such There is a natural isomorphism between symmetric as R(a, a, c, d, e), one inserts S(a, q) twice. This rep- relationships among k entities, and k-element multi- 3 resentation uses slightly more space than the previous sets. We phrase our results in terms of “symmetric proposal, while not resolving the issue of keeping the relations” to emphasize the column-oriented nature of representation consistent under updates. Further, re- the data representation in which columns are inter- constructing the original symmetric relation requires changeable. Nevertheless, our results are equally valid joins. if expressed in terms of “bounded-cardinality multi- We argue that none of these solutions is ideal, and sets”. that the database system should be responsible for pro- Sets and multisets have a wide range of uses for viding a “symmetric” table type. There are numerous representing information in databases. Bounded car- advantages to such a scheme: dinality multisets would be useful for applications in which there is a natural limit to the size of multisets. 1. The database system could choose a compact rep- This limit could be implicit in the application (e.g., resentation (such as storing one member of each the number of players in a baseball team), or defined pair of symmetric tuples) and take advantage of as a conservative bound (e.g., the number of children this compactness in reducing the amount of I/O belonging to a parent). We will demonstrate perfor- required. This representation can be used both mance advantages for bounded-element multisets com- for base tables that are identified as symmetric, pared with conventional relational representations of and for materialized views that can be proven to (unbounded) multisets. be symmetric. Storing a symmetric relation in a conventional database system can be done in a number of possi- 2. The database system could go even further, and ble ways. Storing the full symmetric relation induces add a symmetric-closure operator to the query al- some redundancy in the database: more space is re- gebra. A query plan over a symmetric relation quired (up to a factor of k! for k-ary relationships), and could then be manipulated using algebraic iden- integrity constraints need to be enforced to ensure con- tities so that the symmetric closure is applied as sistency of updates. Updates need to be aware of the late as possible. That way, intermediate results symmetry of the table, and to add the various column will be smaller, and queries will be processed more permutations to all insertions and deletions. Queries efficiently. need to perform I/O for tuples and their permutations, increasing the time needed for query processing. 3. Integrity would be checked by the database sys- Alternatively, a database schema designer could tem. Single-row updates would be automatically recognize that the relation was symmetric and code propagated to the other column permutations if database procedures to store only one representative necessary. Inconsistencies would be avoided, and tuple for each group of permuted tuples. A view can schema designers would not have to re-implement then be defined to present the symmetric closure of special functionality for each symmetric table in the stored relation for query processing. The update the database. problem remains, because updates through this view 4. The database system could index the multiple would be ambiguous. Updates to the underlying table columns of a symmetric relation in a single index would need to be aware of the symmetry, to avoid stor- structure. As a result, only one index traversal is ing multiple permutations of a tuple, and to perform necessary to locate tuples with a given value for a deletion correctly. For symmetric relations over k some symmetric column. columns, just defining the view (using standard SQL) requires a query of length proportional to k(k!). In this paper, we propose techniques to enable such 3A multiset is a set except that duplicates are allowed. a “symmetric relation” table type. We provide: • An underlying abstract data type to store the The expressive power of cardinality-bounded sets kernel of a symmetric relation, i.e., a particular has been previously studied in the context of an object- nonredundant subset of the relation. We show based data model [4, 5]. how updates on this data type would be handled by the database system. We describe how rela- 2 The Kernel tional normalization techniques should take ac- count of symmetric relations during database de- Definition 2.1 γXY (R) denotes the symmetric clo- sure operator over symmetric attributes X and Y sign. Both normalization and the proposed rep- 4 resentation of symmetric relations aim to remove of relation R(X, Y, Z1, . . . , Zn). (x, y, z1, . . . , zn) ∈ γXY (R) if and only if either (x, y, z1, . . . , zn) ∈ R or redundancy, so combining these two approaches 2 should be beneficial. (y, x, z1, . . . , zn) ∈ R. If R is symmetric with respect to X and Y , then we • An extension of the relational algebra with a sym- aim to determine a minimal relation M such that R = metric closure operator γ. We show how to trans- γ (M). By choosing a minimal M, we can represent late a query over a symmetric relation into a query XY R compactly. Several minimal relations M satisfy this involving γ applied to the kernel of the relation. constraint. Each such M chooses a particular element We provide algebraic equivalences that allow the from each pair of complementary tuples. rewriting of queries so that work can be saved by While the choice of minimal relation M does not applying γ as late as possible. matter in terms of space consumption, we shall see that certain algebraic equivalences (such as Lemma 3.4 • A method for inferring when a view is guaran- below) hold only if there is a consistent single choice teed to be symmetric. By using this method, the of M for all tables. Thus, we impose a total order database system has the flexibility to store a ma- (which may be arbitrary) on the domain of X and terialized view using the more compact represen- Y , and insist that the representative tuple chosen has tation. X ≤ Y according to this order. The resulting relation is unique, and is denoted by ker (R), or just ker(R) • A syntactic extension to SQL that allows the suc- XY when X and Y are clear from context. ker (R) = cinct expression of queries over symmetric rela- XY σ (R). tions. X≤Y We propose that the database stores ker(R) as the internal representation of R. Assuming a set seman- Related Work tics (as opposed to a multiset or bag semantics) for Surprisingly, there has been little past work on symmetric relations, updates are handled as follows: specialized implementations of symmetric relations Insert ( R(X,Y,Z1,...,Zn) ) (or bounded-cardinality set/multisets) within the { database system. The only literature we are aware of If (Y Database normalization and the proposed kernel rep- pose that σP2=456(M) is a subexpression of a query resentation both aim to remove redundancy. However, to be evaluated. Let K = ker(M) be stored by the normalization may be hampered by the presence of database, so that the subexpression can be evaluated as symmetry in the data. σP2=456(γP1P2 (K)). Suppose also that we store a single index structure for the columns P1 and P2. For sim- Example 2.1 Consider a database describing meet- plicity of presentation, assume that the database knows ings of pairs of people that take place in certain lo- that for all rows of M, P1 6= P2. cations at certain times, as in Example 1.1. Sup- Then by implementing an operator for the combined pose that the initial database design has the schema selection and symmetric closure, we can directly look U(P1, P2, L, D, T, A), where P1 and P2 are the parties, up tuples in K having 456 for either of the symmetric L is the location, D is the date, and T is the time. attributes, and for each match return the permutation A is a law-enforcement agent assigned to monitor the with P2 = 456. The alternative permutation is never meeting, and multiple agents can be assigned to a sin- generated. gle meeting. The schema designer is aware that the If we implemented symmetric closure as a database system provides facilities for symmetric rela- stand-alone operator, then the best we could tions, and wishes to take advantage of these facilities do would be to rewrite σP2=456(γP1P2 (K)) as by declaring U to be symmetric in P1, P2. σP2=456(γP1P2 (σP1=456∨P2=456(K))). (See Lemma 3.2 Suppose that there can be only one meeting that below.) The pushed selection conditions allow the use takes place in a given location on a given date and of the index on K. However, both permutations of time. The symmetric redundancy prevents the expres- each matching row in K are generated, one of which sion of functional dependencies having LDT on the left will be filtered by the outer selection condition. 2 hand side. As a result, the “obvious” normalization of the table into the meets relation M(P1, P2, L, D, T ) In the general case for Example 2.2, it is possible 2 and the monitors relation S(L, D, T, A) is missed. that P1 = P2. A limited form of duplicate elimina- tion would then be needed to avoid generating an out- The solution to the problem identified in Exam- put row twice from a single input row. Also observe ple 2.1 is to apply the kernel first, and then try to nor- that the problems highlighted by Example 2.2 become malize the result using standard normalization tech- worse for symmetric relations over more than two at- niques. In Example 2.1, it is possible to identify the tributes. functional dependency LDT → P1P2 in kerP1P2 (U). For a fixed number of symmetric columns, the sym- This functional dependency allows the normalization metric closure operator can be expressed in relational of kerP1P2 (U) into kerP1P2 (M) and S; M is repre- algebra in terms of the union and attribute-renaming sented as a symmetric relation. operators. Thus neither γ nor the kernel operator add to the expressive power of relational algebra. Never- Lemma 3.2 For an arbitrary condition θ, theless, by abstracting the γ operator one can derive σ γ S σ γ σ S implementations directly for γ (or γ together with se- • θ( XY ( 1)) = θ( XY ( θˆ( 1))) lection). The situation is analogous to the join oper- • γXY (S1) 1θ T = σθ(γXY (S1 1ˆ T )) ation which, though expressible in terms of selection θ and cartesian product, is best implemented directly. 2 3 Query Optimization Lemma 3.3 Suppose θ is a condition that implies X ≤ Y . Then σθ(γXY (S1)) = σθ(S1). 2 Given a query that mentions a symmetric relation R, we assume that we have physically stored just K = Lemma 3.4 ker(R). In an algebraic expression for a query that • γXY (S1) ∪ γXY (S2) = γXY (S1 ∪ S2) accesses R, we use γ(K) in place of R. In order to minimize the size of intermediate results, • γXY (S1) ∩ γXY (S2) = γXY (S1 ∩ S2) it would be beneficial to push other operators inside • γ (S ) − γ (S ) = γ (S − S ) the symmetric closure operator γ, where possible. To XY 1 XY 2 XY 1 2 support such an endeavor, we now describe algebraic • γXY (S1) × T = γXY (S1 × T ) equivalences that can form the basis of such rewriting 2 rules. For simplicity of presentation, we phrase these rules for binary symmetric relations. Generalizations Lemma 3.5 If attribute list G includes both X and to higher symmetric arity are possible. Y , then πG(γXY (S1)) = γXY (πG(S1)). 2 For the following results, we assume that S1 and S2 are arbitrary relations with attributes including X Lemma 3.6 Under a set semantics: (a) If attribute and Y , such that all rows satisfy X ≤ Y . T represents list G includes neither X nor Y , then πG(γXY (S1)) = an arbitrary relation that does not have attributes X πG(S1). (b) If attribute list G includes X but not Y , or Y . Except for Lemma 3.6, the equivalences hold and if G0 is the same as G except that X is replaced under both a set semantics and a multiset semantics by Y , then πG(γXY (S1)) = πG(S1) ∪ πG0 (S1). 2 (in which duplicate rows are permitted) for relations. f~ Definition 3.1 Let θ be a condition on X and Y , and Definition 3.2 Let AG(R) denote the aggregate of re- (possibly) other attributes. Let θ0 be formed from θ by lation R, grouped by the columns in the list G, com- ~ substituting X for Y and vice versa. We say that θ is puting the aggregate functions f. 2 a symmetric condition on X and Y if θ ≡ θ0. Given a Lemma 3.7 G X nonsymmetric condition θ, we call the condition θ ∨ θ0 If grouping attributes include both f~ f~ 2 the symmetric closure of θ, which we denote by θˆ when and Y , then AG(γXY (S1)) = γXY (AG(S1)) the attributes X and Y are clear from context. 2 Aggregates grouping by X alone or Y alone can use Example 3.1 Symmetric selection conditions on X Lemma 3.7 to first compute the aggregate grouped by and Y include X = Y , X2 + Y 2 = 1, and any X and Y . Assuming that the aggregate functions are condition that mentions neither X nor Y . Symmet- incrementally computable, the coarser aggregates can ric join conditions on R.X and R.Y include R.X = then be computed in a subsequent operation. S.A ∧ R.Y = S.A, R.X2 + R.Y 2 = S.A2, and condi- Lemma 3.8 tions that do not mention R.X or R.Y . The condition Let G be grouping attributes other than ~ R.X − R.Y > 7 is not symmetric; its symmetric clo- X and Y , and let f contain just idempotent aggregates 2 f~ f~ sure is R.X − R.Y > 7 ∨ R.Y − R.X > 7. such as min and max. Then AG(γXY (S1)) = AG(S1). 2 Symmetric conditions can be pushed below the sym- metric closure. It is tempting to think of analogous equivalences to those of Lemma 3.8 for other aggregates. However, Lemma 3.1 If θ is a symmetric condition, then a row in the kernel maps to either one or two rows • σθ(γXY (S1)) = γXY (σθ(S1)) in the symmetric closure, depending on whether the symmetric attributes have equal values. To take ac- 1 1 • γXY (S1) θ T = γXY (S1 θ T ) count of this difference, one can split the kernel into 2 two fragments. Because the symmetric closure of a condition is al- Lemma 3.9 Let G be grouping attributes other than ~ ways symmetric, Lemma 3.1 implies the following re- X and Y , and let f contain just linear aggregates such sult, which allows us to push down partial information as sum and count. Let 2f~ denote the aggregate that from selections on the symmetric attributes. computes double the aggregate functions f~. Then f~ 2f~ If K = ker (M), then the query can be rewritten • AG(γXY (σX 8 1e+14 7 1e+12 6 1e+10 ) c 5 e s ( 1e+08 e m i T 4 1e+06 3 10000 2 Query/View Size (bytes) 1 100 0.316 0.296 0.234 0 1 Q1 Q2 Q3 Q4 0 2 4 6 8 10 12 14 16 k Figure 1: Execution time of the four queries. Figure 3: Growth of Q1 and Q2 with the multiset car- dinality k. Optimization Time 0.7 7 Conclusions 0.632 0.608 0.592 0.6 We provide techniques that enable a database engine 0.5 to support a symmetric table type. The techniques include 0.4 ) c e s ( e • A nonredundant data structure with update m i T 0.3 methods and specialized indexes. 0.212 0.2 • Methods for normalization in the presence of sym- 0.1 metric tables. 0 Q1 Q2 Q3 Q4 • An algebraic symmetric closure operator, together with algebraic equivalences useful for query opti- mization. Figure 2: Optimization time of the four queries. • Inference methods to determine when a cost is probably just an artifact of a smaller search query/view is guaranteed to be symmetric. space of plans within the query optimizer, and not something intrinsic to the query itself. (Note the im- • A syntactic SQL extension to enable compact portance of separating the optimization time from the query expression. execution time in interpreting these results. Had we A symmetric table type allows database schema de- just reported the total elapsed time, Q2 would have been the winner.) signers to model symmetric relationships without hav- ing to worry about integrity, redundancy, consistency Of the four solutions (Q1, Q2, Q3, and our proposed method), only our method scales with the number of of updates, query efficiency, or suboptimal physical de- sign. attributes. Q3 does not scale because it requires a k- way join for multisets containing k elements. As one One could go even further and implement different can see in Figure 1, even for k = 3 the performance kinds of symmetric table. For example, the class of antireflexive symmetric relations (i.e., k-element sets of Q3 is more than an order of magnitude worse than competing approaches. rather than multisets) satisfies simpler algebraic rules, and some duplicate elimination steps can be omitted Suppose that writing a basic condition (of the form in the implementation of the γ operator (see Exam- table.attribute=value) takes 10 bytes of memory. ple 2.2). If we try to generalize Q1 and Q2 to k-element multi- sets, then they require either a query or view definition We have argued that our approach is applicable whose size is approximately 10k(k!) bytes. The impact when there is a natural cardinality bound in the appli- of this rate of growth is shown in Figure 3; note the cation. One could extend our approach to general mul- tisets by using a combined structure, i.e., a bounded logarithmic vertical scale. Q1 and Q2 quickly become impractical: with k = 11 the space for the query/view cardinality multiset for an initial subset of elements, definition alone is four gigabytes. and a conventional set representation for additional elements.8 An appropriate syntax could hide this di- In contrast, our query specification has size linear in k, and it can be evaluated without joins. 8This idea was suggested to us by Wisam Dakka. vision from the user, presenting a single multiset ab- Spreadsheets in RDBMS for OLAP. In Proceed- straction. When small sets are typical, such an ap- ings of the 2003 ACM SIGMOD international proach would have performance benefits, even in the conference on on Management of data, pages 52– absence of a strict cardinality bound. 63. ACM Press, 2003. References [1] Oracle Database Application Developer’s Guide — Object-Relational Features, 2004. 10g Release 1 (10.1), Part Number B10799-01. [2] D. Chatziantoniou and K. A. Ross. Querying mul- tiple features of groups in relational databases. In Proceedings of the International Conference on Very Large Databases, pages 295–306, 1996. [3] C. J. Date. On various types of relations. Database Debunkings, April 20, 2003. Available at http://www.dbdebunk.com/. [4] Jan Van den Bussche and Dirk Van Gucht. A hi- erarchy of faithful set creation in pure OODB’s. In Joachim Biskup and Richard Hull, editors, Database Theory - ICDT’92, 4th International Conference, Berlin, Germany, October 14-16, 1992, Proceedings, volume 646 of Lecture Notes in Computer Science, pages 326–340. Springer, 1992. [5] Jan Van den Bussche and Dirk Van Gucht. The expressive power of cardinality-bounded set val- ues in object-based data models. Theoretical Computer Science, 149(1):49–66, 1995. [6] Sven Helmer and Guido Moerkotte. Evaluation of main memory join algorithms for joins with set comparison join predicates. In International Conference on Very Large Databases, pages 386– 395, 1997. [7] N. Mamoulis. Efficient processing of joins on set- valued attributes. In Proceedings of the ACM Conference on Management of Data (SIGMOD), pages 157–168, 2003. [8] Karthikeyan Ramasamy, Jignesh M Patel, Raghav Kaushik, and Jeffrey F Naughton. Set containment joins: The good, the bad and the ugly. In International Conference on Very Large Databases, pages 351–362, 2000. [9] D. Jeffrey Ullman. Principles of Database and Knowledge-Base Systems, volume 2. Computer Science Press, 1989. [10] Allen Van Gelder and Rodney W. Topor. Safety and translation of relational calculus. ACM Trans. Database Syst., 16(2):235–278, 1991. [11] Andrew Witkowski, Srikanth Bellamkonda, Tolga Bozkaya, Gregory Dorman, Nathan Folkert, Ab- hinav Gupta, Lei Shen, and Sankar Subramanian.