Symmetric Relations and Cardinality-Bounded Multisets in Database Systems
Total Page:16
File Type:pdf, Size:1020Kb
Symmetric Relations and Cardinality-Bounded Multisets in Database Systems Kenneth A. Ross Julia Stoyanovich Columbia University¤ [email protected], [email protected] Abstract 1 Introduction A relation R is symmetric in its ¯rst two attributes if R(x ; x ; : : : ; x ) holds if and only if R(x ; x ; : : : ; x ) In a binary symmetric relationship, A is re- 1 2 n 2 1 n holds. We call R(x ; x ; : : : ; x ) the symmetric com- lated to B if and only if B is related to A. 2 1 n plement of R(x ; x ; : : : ; x ). Symmetric relations Symmetric relationships between k participat- 1 2 n come up naturally in several contexts when the real- ing entities can be represented as multisets world relationship being modeled is itself symmetric. of cardinality k. Cardinality-bounded mul- tisets are natural in several real-world appli- Example 1.1 In a law-enforcement database record- cations. Conventional representations in re- ing meetings between pairs of individuals under inves- lational databases su®er from several consis- tigation, the \meets" relationship is symmetric. 2 tency and performance problems. We argue that the database system itself should pro- Example 1.2 Consider a database of web pages. The vide native support for cardinality-bounded relationship \X is linked to Y " (by either a forward or multisets. We provide techniques to be im- backward link) between pairs of web pages is symmet- plemented by the database engine that avoid ric. This relationship is neither reflexive nor antire- the drawbacks, and allow a schema designer to flexive, i.e., \X is linked to X" is neither universally simply declare a table to be symmetric in cer- true nor universally false. While an underlying rela- tain attributes. We describe a compact data tion representing the direction of the links would nor- structure, and update methods for the struc- mally be maintained, a view de¯ning the \is linked to" ture. We describe an algebraic symmetric clo- relation would be useful, allowing the succinct speci- sure operator, and show how it can be moved ¯cation of queries involving a sequence of undirected around in a query plan during query optimiza- links. 2 tion in order to improve performance. We de- scribe indexing methods that allow e±cient Example 1.3 Views that relate entities sharing a lookups on the symmetric columns. We show common property, such as pairs of people living in the how to perform database normalization in the same city, will generally de¯ne a symmetric relation presence of symmetric relations. We provide between those entities. 2 techniques for inferring that a view is sym- metric. We also describe a syntactic SQL ex- Example 1.4 Example 1.1 can be generalized to al- tension that allows the succinct formulation of low meetings of up to k people. The k-ary meeting queries over symmetric relations. relationship would be symmetric in the sense that if P = (p1; : : : ; pk) is in the relationship, then so is any column-permutation of P . 2 This¤ research was supported by NSF grants IIS-0120939 and IIS-0121239. Example 1.5 Consider a database recording what Permission to copy without fee all or part of this material is television channel various viewers watch most dur- granted provided that the copies are not made or distributed for ing the 24 hourly timeslots of the day.1 For direct commercial advantage, the VLDB copyright notice and 2 the title of the publication and its date appear, and notice is performance reasons, the database uses a table given that copying is by permission of the Very Large Data Base 1This example is based on a real-world application developed Endowment. To copy otherwise, or to republish, requires a fee by one of the authors, in which there were actually 96 ¯fteen- and/or special permission from the Endowment. minute slots. Proceedings of the 30th VLDB Conference, 2A conventional representation as a set of slots would require Toronto, Canada, 2004 a 24-way join to reconstruct V . V (ID; V iewDate; C1; : : : ; C24) to record the viewer For both of the above proposals, indexed access to (identi¯ed by ID), the date, and the twenty-four chan- an underlying symmetric relationship would require nels most watched, one channel for each hour of the multiple index lookups, one for each symmetric col- day. This table V is not symmetric, because Ci is not umn. interchangeable with Cj: Ci reflects what the viewer A third alternative is to model a symmetric re- was watching at timeslot number i. Nevertheless, there lation as a set [3] or multiset. Instead of record- are interesting queries that could be posed for which ing both R(a; b; c; d; e) and R(b; a; c; d; e), one could this semantic di®erence is unimportant. An example record R0(q; c; d; e), S(a; q), and S(b; q), where q is a might be \Find viewers who have watched channels 2 new surrogate identi¯er, and R0 and S are new ta- and 4, but not channel 5." For these queries, it could bles. The intuition here is that q represents a multi- be bene¯cial to treat V as a symmetric relation in or- set, of which a and b are members according to table der to have access to query plans that are specialized S. Distinct members of the multiset can be substi- to symmetric relations. 2 tuted for the ¯rst two arguments of R. To represent tuples that are their own symmetric complement, such There is a natural isomorphism between symmetric as R(a; a; c; d; e), one inserts S(a; q) twice. This rep- relationships among k entities, and k-element multi- 3 resentation uses slightly more space than the previous sets. We phrase our results in terms of \symmetric proposal, while not resolving the issue of keeping the relations" to emphasize the column-oriented nature of representation consistent under updates. Further, re- the data representation in which columns are inter- constructing the original symmetric relation requires changeable. Nevertheless, our results are equally valid joins. if expressed in terms of \bounded-cardinality multi- We argue that none of these solutions is ideal, and sets". that the database system should be responsible for pro- Sets and multisets have a wide range of uses for viding a \symmetric" table type. There are numerous representing information in databases. Bounded car- advantages to such a scheme: dinality multisets would be useful for applications in which there is a natural limit to the size of multisets. 1. The database system could choose a compact rep- This limit could be implicit in the application (e.g., resentation (such as storing one member of each the number of players in a baseball team), or de¯ned pair of symmetric tuples) and take advantage of as a conservative bound (e.g., the number of children this compactness in reducing the amount of I/O belonging to a parent). We will demonstrate perfor- required. This representation can be used both mance advantages for bounded-element multisets com- for base tables that are identi¯ed as symmetric, pared with conventional relational representations of and for materialized views that can be proven to (unbounded) multisets. be symmetric. Storing a symmetric relation in a conventional database system can be done in a number of possi- 2. The database system could go even further, and ble ways. Storing the full symmetric relation induces add a symmetric-closure operator to the query al- some redundancy in the database: more space is re- gebra. A query plan over a symmetric relation quired (up to a factor of k! for k-ary relationships), and could then be manipulated using algebraic iden- integrity constraints need to be enforced to ensure con- tities so that the symmetric closure is applied as sistency of updates. Updates need to be aware of the late as possible. That way, intermediate results symmetry of the table, and to add the various column will be smaller, and queries will be processed more permutations to all insertions and deletions. Queries e±ciently. need to perform I/O for tuples and their permutations, increasing the time needed for query processing. 3. Integrity would be checked by the database sys- Alternatively, a database schema designer could tem. Single-row updates would be automatically recognize that the relation was symmetric and code propagated to the other column permutations if database procedures to store only one representative necessary. Inconsistencies would be avoided, and tuple for each group of permuted tuples. A view can schema designers would not have to re-implement then be de¯ned to present the symmetric closure of special functionality for each symmetric table in the stored relation for query processing. The update the database. problem remains, because updates through this view 4. The database system could index the multiple would be ambiguous. Updates to the underlying table columns of a symmetric relation in a single index would need to be aware of the symmetry, to avoid stor- structure. As a result, only one index traversal is ing multiple permutations of a tuple, and to perform necessary to locate tuples with a given value for a deletion correctly. For symmetric relations over k some symmetric column. columns, just de¯ning the view (using standard SQL) requires a query of length proportional to k(k!). In this paper, we propose techniques to enable such 3A multiset is a set except that duplicates are allowed. a \symmetric relation" table type. We provide: ² An underlying abstract data type to store the The expressive power of cardinality-bounded sets kernel of a symmetric relation, i.e., a particular has been previously studied in the context of an object- nonredundant subset of the relation.