ABCDEFFB ABCDEFFA BBEFECBCBBFEF ECEBFEF ABCDEFFCC query’s author could not leave DEABB out of the groupby – Dependencies have played a significant role in design for even if he realizes it would be better to – because it is stated in the many years. They have also been shown to be useful in query . The query optimizer could, however, use an index scan optimization. In this paper, we discuss dependencies between to have the tuple stream in AB, F order to accomplish the lexicographically ordered sets of tuples. We introduce formally BE on AB, DEABB, F, AD it recognizes that the concept of and present a set of axioms AB, F and AB, DEABB, F offer the same (inference rules) for them. We show how query rewrites based on partition. This is done by query optimizers today – given the these axioms can be used for query optimization. We present functional dependency (FD) information that F → several interesting theorems that can be derived using the DEABB is available to the optimizer – by rewrite. inference rules. We prove that functional dependencies are subsumed by order dependencies and that our set of axioms for For the query above, the rewrite might still not be applied, since order dependencies is sound and complete. the query specifies the answers to be ordered by AB, quarter, F. The FD that F → DEABB is F ABCDEACB logically sufficient to eliminate DEABB from the orderby, as it Consider the following SQL query (in Example 1). was to eliminate it from the groupby. Since a query plan must guarantee the orderby, it likely will include a sort operator for EXAMPLE 1. AB, DEABB, F, after all. ABCDEABBCFC EFAAA To see that the FD does not suffice to eliminate DEABB from BFACA the orderby, imagine the values for DEABB were the strings BAA DACF, C, FA, and DF. Data would be lexicographically AABA ordered as DACF, DF, C, then FA! Of course, we intend BEABCD.quarterCF that values of DEABB are, say, 1, 2, 3, and 4, so the data would BBABCD.quarterCF order naturally as by date. It is unfortunate, then, that DEABB is, in fact, redundant (in this query) in the orderby also, but that In the schema, A is a ABCA table with a row per day, the optimizer does not have the means to eliminate it. and A is a very large DEF table recording all individual sales. Each has a surrogatevalued column A, which is the What is missing is the semantic information that F C primary key for A. In the A dimension table, each row DEABB, which is more than just that F DFAE describes a given day with explicit columns as AB, DEABB, FBAC DEABB. This states that as values AC from one F, and A that describe the natural date values (and tuple to another on F, they must AC, or stay the same, from additional columns that qualify that day, such as whether it is a the one tuple to the other on DEABB (that is, the values do not weekend day or holiday). descend from the one tuple to the other on DEABB). These have been called AC (ODs), in contrast to Assume we have a tree index for A on AB, F, A. arXiv:1208.0084v1 [cs.DB] 1 Aug 2012 functional dependencies. Our objective is to bring reasoning about This index cannot help in a query plan, however, to accomplish order dependencies into the query optimizer. A query plan for the the groupby because DEABB intercedes. Of course, DEABB query above could then eliminate DEABB from F the order is logically redundant here, as F (which follows it in the by and the groupby clauses, and the index on AB, F, groupby) functionally determines DEABB. (First quarter A might then provide for an efficient plan with no need for a encompasses the months of January, February, and March, second sort operator. quarter, the months of April, May, and June, and so forth.) The The notion of order dependencies can be greatly generalized, and the potential use of them in query optimization shown to be vast. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are The relationships between ordered sets have been explored in the not made or distributed for profit or commercial advantage and that copies past and several different notions of have been considered. bear this notice and the full citation on the first page. To copy otherwise, to In this work, we consider just AEAE ordering of tuples, republish, to post on servers or to redistribute to lists, requires prior specific as by the orderby operator in SQL, because this is the notion of permission and/or a fee. Articles from this volume were invited to present order used in SQL and within query optimization for tuple their results at The 38th International Conference on Very Large Data Bases, streams. August 27th - 31st 2012, Istanbul, Turkey. Proceedings of the VLDB Endowment, Vol. 5, No. 11 Copyright 2012 VLDB Endowment 2150-8097/12/07... $ 10.00.

1220 The contribution of this paper is to present an EABEFAEFA for could be changed to BFACFCeasily, with no consequences to our order dependencies, analogous to Armstrong’s axiomatization for axiomatization. FDs [1]. This provides a formal framework for reasoning about ODs. There are two reasons for one to pursue an axiomatization. B 1. The axioms provide ACAF into how dependencies EFAC behave – and patterns for how dependencies logically ñ A capital letter in bold italics represents a EFA: R. follow from others – that are not easily evident ñ A small letter in bold represent a EFAE ACFE reasoning from first principles. (EFE): . 2. A C and BF axiomatization is the first ñ We use capital letters to represent single EFFAFC: necessary step to designing an efficient AD A,B,C. C are marked with small letters in italics: CF. . FC Our axioms for ODs help us explore beneficial query rewrites. ñ Calligraphic letters stand for CFCDEFFAF: F, , . We show how they can be cast as a new type of integrity ñ We use proximity for A of sets: F is shorthand constraint to be used in query optimization. We derive theorems for F∪. Likewise, AF or FA, where F is a set of based on our axioms, which illustrate surprising inferences and attributes and A a single attribute, stands for F ∪ {A}. equivalences over ODs, and which can provide for powerful query rewrites. While ODs for have been explored before, we ñ Also F denotes the projection of the tuple F on the present the first general axiomatization for them. We prove the attributes of F, while is the shorthand for {}. CCC of the axioms. We demonstrate that Armstrong’s ACFC axiomatization for FDs is subsumed within our axiomatization for ODs. (In this sense, ODs can be thought of as a generalization of ñ Bold letters stand for ACFC of attributes: , , . Note list FDs.) We then prove the BFCC of the set of axioms. could be the empty list, []. Working with ODs is more involved than with FDs because the ñ We use square brackets to denote a list: [A, B, C]. The order of the attributes matters. Thus, we must work with ACFC of notation [A | ] denotes that A is the E of the list, and attributes instead of with CFC. This necessarily complicates our is the FEA of the list, the remaining list with the first axioms – compared with Armstrong’s axioms for FDs – and the element removed. proofs of our theorems. ñ Proximity is used for concatenation of lists of attributes: is shorthand for ∘. Likewise, A and A stands CF In Section 2, we present (ODs) formally. We provide respectively for [A]∘ and ∘ [A], where is list of background, our notational conventions, and definitions for ODs attributes and A is a single attribute. AB denotes [A, B]. (Section 2.1). We show from where ODs in databases naturally ñ ′denotessome other permutation of elements of list . arise (Section 2.2). We demonstrate a number of effective ways ODs may be used in query optimization (Section 2.3). We discuss DEFINITION1. (operator≼) Let be a list of attributes, Cand Fbe a query optimization technique with ODs that we have two tuples in relation instance . Operator ≼ is defined as follows: implemented as a prototype in IBM DB2 [18], and our ongoing work with these techniques. In Section 3, we introduce the ≼ where [A | ] axiomatization for ODs (Section 3.1), and we prove the soundness if ( < ) of the axioms (Section 3.2). We derive a collection of theorems or if (( = ) and (= [] or ≼ )) using our axioms – which we use in the proof of completeness – which illustrate the utility of our axioms (Section 3.3). In Section In this paper, we assume ECA (A) order in the 4, we prove the completeness of the axiomatization. We sketch lexicographical ordering. (This is ’s default.) We do not our proof of completeness (Section 4.1). We demonstrate how consider descending () orders, mixing of A and FDs are CCB within order dependencies (Section 4.2). With (e.g., BB C A) [19], or use of functions in the requisite pieces in place, we present the formal proof of the order directives (e.g., BB AC A). completeness of the axiomatization (Section 4.3). In Section 5, we discuss related work. In Section 6, we present plans for future DEFINITION(operator ≺) Let be a list of attributes, Cand Fbe work and make concluding remarks. This work, we feel, opens two tuples in relation instance . The operator ≺ is defined as exciting venues for future work to develop a powerful new family follows: ≺ ADD ≼ and ⋠. of query optimization techniques in database systems. DEFINITION3. (s = t) Let be a list of attributes, Cand Fbe CDDBDB two tuples in relation instance , = iff ≼ and ≼. We first set out formal definitions for order dependencies that we DEFINITION 4. (order dependency) Let and be list of need later in proofs. Next, we illustrate ODs in databases and how attributes. Call ↦ an (OD) over the they arise. We then show the usecase scenarios for ODs for query relation if, for every pair of EBACCA tuples C and F in relation optimization. instance over , ≼ implies ≼.

Whenever ↦ , we say that C . and are D AEFiff ↦ and ↦ . We denote this by ↔ . We adopt the notational conventions in Table 1. We consider a relation with a schema set of attributes . Let be an arbitrary A B C D E F FEACFE ; thus a CF of tuples under ’s schema with 3 2 0 4 7 9 attributes . We limit table instances to CFCin our definitions, to 3 2 1 3 8 9 keep our definitions simpler and easier to follow. However, this F

1221 EXAMPLE 2.Note that [A, B, C] ↦ [F, E, D] is consistent with , Given ↦ , if one has an SQL query with BB , one but [A, B, C] ↦ [F, D, E] is falsified by in Figure 1. can rewrite the query with BB instead, and meet the intent of the original query. However, the rewritten query is F The OD ↦ means that ’s values are BFAE semantically equivalent the original (unless ↔ )! One could ECA with respect to ’s values. Thus, if a list of tuples are not legally rewrite the query with with ordered by , then they are also necessarily ordered by , but not BB BB instead. Strengthening the orderby conditions is permitted, but necessarily vice versa. That is to say, if one knows ↦ , then weakening them is not. (This is true, too inside query plans for one knows that any ordering of the tuples of , for any , that ordered tuple streams.) satisfies BB also satisfiesBB . One does not need order AEC then to accomplish useful There is a clear relationship between ODs and FDs. Any OD query rewrites. Directional order dependencies (e.g., ↦ , but implies and FD (modulo lists and sets), but not vice versa. not ↦ ) suffice. This makes ODs that much more versatile for LEMMA 1. (relationship between ODs and FDs) rewrites. Notice this differs from the use of FDs for query R ACFEDEFA AD ↦ CF F → AC rewrites, for instance, to simplify groupby’s. To replace AB, F. DEABB, F by AB, F in the groupby for the query in the example in Section 1, one should know the two are PROOF. Let CF∈ rrrr, such that = . Therefore, ≼ and F F functionally AEF. One could not replace it by AB, ≼ . By the definition of OD ≼ and ≼ , hence as , , for example, even though { , , = , = . □ F A AB F A → {AB, DEABB, F. DEFINITION 5. (order compatible) Two lists and are order Within query plans, groupby (partitions) can be accomplished compatible, denoted as ~ ADD ↔ . either by a partition operation (such as by use of a hash index), or EXAMPLE 3.Note that [A, B] ~ [F, C] is consistent with , but by the use of an ordered tuple stream (as provided by a treeindex [A, C] ~ [F, D] is falsified by in Figure 1. scan or by a sort operation). When rewriting the partition criteria, if a partition operation is employed, the criteria must be C equivalent. However, when an ordering operation is employed The concept of functional dependencies has come to have instead, then one has the same flexibility as noted for OD profound importance in databases, especially in schema design. dependencies. Strengthening the criteria suffices. For instance, While functional dependencies are a simple notion in some ways, sorting by AB, F, A would suffice to accomplish the reasoning over them is, somewhat surprisingly, not nearly as groupby on AB, DEABB, F. (Group divisions can be simple. To gain insight into how sets of FDs behave, and to found FD in the stream.) simplify the reasoning process over them, Armstrong provided an axiomatization for them [1]. Beyond layout and indexes, FDs play An OD can be declared as an AFAF CFEAF to prescribe additional important roles in query optimization. Knowledge which instances are admissible. (We have introduced this new about prescribed FDs on the schema are used in the queryrewrite type of constraint in a prototype branch of IBM DB2. See Section phase of optimization potentially to eliminate predicates. They are 2.3.) One can reason over ODs on relations in a similar way one used in the costbased phase to do better cardinality estimation. now reasons about FDs over relations. Some order dependencies They are used also to recognize partitioning equivalences of tuple are FAAEF [20].That is, they are (trivially) satisfied by E streams within query plans. table instance. For example, consider ↦ . Others are not trivial. If one knows a collection of order dependencies, ℳ – We have introduced ODs in analogy to FDs: functional declared as integrity constraints over relation – one might dependencies are to groupbyas order dependencies are to order soundly infer additionally order dependencies that must be F by. On the one hand, order is F important in the pure relational for . For example, if ↦ and ↦ are F, then ↦ is model on the AE side of the fence. Relational instances are Falso. (That is, ODs are transitive.) CFC of tuples. (Implemented systems allow for BFACFC of tuples, but again, there is no notion of order.) A schema is a CF of While order is not part of the , per se, ordered attributes. SQL concedes a single orderby clause to be appended value domains are of key importance for most databases, and most to a query to order the result set, as a convenience, given that queries. Many types of ODs are apparent in the semantics of people often want to see the results sorted in a given way. (This databases (even though these ODs are not declared explicitly). said, there are many places where order is CBEFAE Perhaps the most important of these ordered domains in practice is meaningful. EFECFEBextensions to the relational model make FAB. AB and EF (time at a coarser granularity) are richly order a part of the model. For other data models such as XML – supported in the SQL standards. The common benchmark TPC and XQuery over it – order is an integral part of the model.) DS has 99 queries. Of these, 85 involve date operators and predicates (and five involve time operators and predicates). This is On the other hand, order plays pivotal roles on the physical side, common for datawarehouses. Even if we were just limited to in the physical database and in query optimization. Data is often ODs over the date/time domain, we could derive great benefits in stored sorted by a clustered (tree) index’s key. In a query plan, an query optimization. operator that takes as input the output stream of another operator can benefit in cases when the stream is sorted in a particular way. Figure 2 represents possible ODs, in which the lefthand side of a Aggregation queries (groupby) can be evaluated FD if the dependency is FAB and the righthand side is one of the paths stream is ordered already in a way compatible with the requested through the diagram. Each node is an equivalent class of the list of BE partition, rather than needing to do a partitioning attributes leading up to it, with respect to the starting point. operation that could involve heavy I/O expense. Theorem 10 proves that any list appearing on the left side can be suffixed by attributes appearing along an equivalent path. This is shown in Example 4.

1222 EXAMPLE 4. As [F] ↦ [AC EB] holds and They showed how these rewrites could allow the optimizer to [A] ↦ [ABCFCA], it follows (from Theorem 10 consider additional query plans that process A, , below) that [F] ↦ [ACFCEB]. , and ACFAF operators more efficiently. By recognizing that a tuple stream ordered with respect to some criteria is equivalently ordered with respect to other criteria, a sort on input can be removed for a sortmerge join. Orderby and groupby operators can be satisfied with no need for a sorting or partitioning operation more often, as with our Example 1. Likewise, as the distinct operator is exchangeable with groupby, the need for a sorting or partitioning operation to satisfy distinct can be lessened. Our work builds upon this work. Their rewrites rely on functional dependency information available to the optimizer, but do not exploit any semantics, as defined by us. Our work permits a greater range of rewrites. For example, they could reduce an orderby AB, F, DEABB to an orderby AB, F, based upon the FD F → DEABB. (Likewise, they could reduce the equivalent groupby.) However, they could not reduce the orderby AB, DEABB, F to F AB, F, as we did in Example 1, since their techniques do Order dependencies are not just limited to the time domain, not employ the idea of ODs. (It is Theorem 8 below, called DF however. They arise naturally in many other domains from the ABAEF, which follows from our axiomatization, which justifies realworld semantics associated with given data. All that is this rewrite.) required is that the values of a column (or list of columns) are In [17], they introduced a rewrite algorithm for orderby called monotonically nondecreasing with respect to the values of . It sweeps the orderby attribute list from right to another column (or list of columns). This property is fairly left, seeking to eliminate attributes. Each iteration through the list, common when columns are functionally related. the prefix CF with respect to the F attribute – that is, the set EXAMPLE 5. Consider a table A that includes columns for of attributes to the left of the current – is checked to see whether it taxable F, tax BA, and A on the income. The DFAEFBAC the current attribute. If so, the attribute is tax brackets are based on the level on income (and so rise with dropped from the list. income level). Assume taxes go up with income. Then, We can augment that algorithm – call it – to do an [F] ↦ [BA] and [F] ↦ [A]. It follows additional step. Each iteration through the list, it can additionally (from Theorem2 below) that [F] ↦ [BA, A]. be checked whether any postfix ACF with respect to the current attribute – that is, a list of attributes to the right of the current – Assume the table has a tree index on F. Given a query on C the current attribute. If so, the attribute is dropped from the the table with an orderby on BA, A, with the OD list. Given the OD [F] ↦ [DEABB], both orderby AB, above, it could be evaluated using the index on F. F, DEABB and AB, DEABB, F would be Instead of being columns with explicit data, BA and reduced to AB, F. could be derived by functions or case expressions – say, if A Order dependencies are in terms of lists of attributes, not sets as Taxes were a view – or generated columns in the table. In these for functional dependencies. This makes matching in rewrites cases, it would be possible for the database system to derive the using ODs more complex generally, but also increases the orderdependency constraints above automatically. In [12], it was possibilities for matches. Consider D ↦ B. Then ABD could be shown how to derive such monotonicity “constraints” from reduced to AD. However, ABCD cannot be! The attribute C generated columns via algebraic expressions (in IBM DB2). Of intervening between the B and D invalidates the rewrite. For course, one could prescribe the set of order dependencies as check the rewrite by Theorem 8 to apply, the list on the righthand side constraints directly to benefit by this technique. of the OD must precede directly the list on the lefthand side. If Such monotonic dependencies can be derived from builtin SQL we knew D ↦ BC, then ABCD could be reduced to AD. functions, from userdefined functions (to some degree), and from A major part of our continued work with order dependencies is to case expressions. The SQL function , for example, extracts AB develop a number of efficient rewrite rules for the query the year component of a datestamp. Thus, given a datestamp optimizer, as they did in [17], to exploit ODs effectively. Our OD column , [] ↦ [AB]. axiomatization provides us the means now to pursue this. The axioms and related theorems as in Section 3.3 provide us with C insight into the types of rewrites that are possible. In [18], we In [17], the authors expounded on the important role of in developed query rewrites in a prototype branch of the IBM DB2 query optimization. They demonstrated numerous examples of 9.7 codebase that demonstrates the effectiveness of rewrites using how better ECA over AFCFA C in the query order equivalences. In datawarehouses, date is often represented optimizer could lead to significantly better performing query explicitly as a dimension table of its own, with the primary key of plans. They introduced query rewrites in IBM DB2 that could the date table made as a surrogate key [11]. While this design can replace one labeled interesting order by another, when it is known have compelling advantages, the surrogate key can cause the two order in the same way (that is, are AEF, as we problems for efficiently evaluating queries. have defined it).

1223 A majority of queries in a data warehouse are over the fact table. CDDAAF CDDDA

A query often uses natural date values in predicates. However, ↦ ↦ date in the fact table is recorded by the surrogate key. This CDDA ↔ necessitates potentially a quite expensive join between the fact table and the date dimension table when the query is evaluated. ↦ CDEA There is an additional problem when a fact table has been ↦ ~ partitioned by date in order to accommodate a very large table CDBEAEFA ∀∈[,]~ (e.g., in distributed systems). Since the date range (surrogate ↔ values) over the fact table cannot be determined from the query ~ (natural values), all partitions of the fact table must be scanned. CDECAFAAF ∀∈[,] ~ We optimize such queries involving dates by removing this join, ↦ ~ and choosing just the relevant partitions of the fact table when the ↦ table is distributed.

↦ A number of queries in the TPCDS benchmark have this condition. Fortunately, we have a guarantee (an OD) that the Two of our axioms generate trivial dependencies [20]: Reflexivity surrogate (date) keys in the date dimension table are ordered in and Normalization. We define the closure of the set of OD ℳ, the same way as natural date values in the dimension table. Thus, denoted ℳ, to be the set of ODs that are logically implied the query plan can make two probes into the dimension table to by ℳ. calculate the range of the surrogate keys from the fact table. These two probes into the date table find FA and FAA DEFINITION 8. (closure of ℳ using ℐ). Let ℐ = {OD1–OD6}, surrogate key values. These two surrogate key values replace the then ℳ = { ↦ | ℳ ⊢ℐ ↦ }. range predicate, which allows the index on the date column in the DEFINITION 9. (equivalents sets of OD). Let ℳ and ℳ' be sets fact table to be used. of ODs. We say that ℳ and ℳ' are equivalent ADD

The details of when and how this rewrite can be performed in a { ↦ | ℳ ⊨ ↦ } = { ↦ | ℳ′ ⊨ ↦ }. general way are provided in [18]. We built a prototype implementing such rewrites in IBM DB2 V.9.7 and performed F experiments over TPCDS to demonstrate the efficiency of the In this subsection, we address the problem of showing that our approach. Thirteen of TPCDS`s queries matched the conditions OD axioms are sound. This is to say, they lead only to true for this rewrite. Every one of these thirteen benefited, with an conclusions. average performance gain of 48%. Since this work reported in [18], we have continued work on the prototype. We have added a DEFINITION 10 (soundness of OD axioms) Let ℐ be a set of new type of check constraint which expresses an OD. We have inference rules {OD1–OD6}. Then ℐ is sound for logical implemented more OD rewrite rules which now rewrite eighteen implication of ODs if ↦ is deduced from ℳ (ℳ ⊢ℐ ↦ ) of TPCDS’s queries with performance gain. Consider our using axioms ℐ, then ↦ is true in any relation in which the technique from [18] combined with an OD rewrite of the orderby dependencies of ℳ are true ℳ⊨ ↦ . for our query in Example 1. If we have the OD that Let be a relation over R. The following Lemmas are true. [A] ↦ [AB, F], the orderby and groupby operators in a query plan could be accomplished by an index scan LEMMA 2. (soundness of Reflexivity)DAAFACC over the index for , the fact table, on , then A A PROOF. Let C F ∈ rrr , such that ≼. From the joining the results against the dimension table A. recursiveness of Definition 1 of operator ≼ it follows that (1) = and ≼ or (2) ≺ . (1) and (2) imply that ≼ , ACAACB therefore ∀ . ↦ . □ A key concern in dependency theory is developing the algorithms for testing logical implication. Developing inference rules is an LEMMA 3.soundness of Prefix DAACC approach to show logical implication between dependencies. PROOF. Let C F ∈ rrr , such that ≼. This implies (1) ≺ or (2) = and ≼. For (1) ≼ holds as ≺. In the second scenario (2), ≼ implies ≼ DEFINITION 6. (A proof of OD θ from ℳ) Let ℳ be a set of ( ↦ is given). Hence, as = it is true that ≼ . prescribed ODs. A proof of OD from ℳ with the set of ∀ . ↦ implies ↦ . □ inference rules ℐ is a sequence = ,…, ( ≥ 1) such that LEMMA 4. (soundness of Normalization) BEAEFA AC for ∈ [1,] either ∈ ℳ, or there exists a substitution for C some rule ∈ ℐ , such that is consequence of , and such that for each order dependency in the predecessor of the PROOF. A Let CF∈ rrrr, such that ≼. This implies corresponding order dependency is in the set { ∣ 1 ≤ < }. that: (1) = and ≼ or (2) ≺. In (1) = as = . Therefore we can suffix by list The OD is provable from ℳ using axioms ℐ (relative to set of and = holds. Hence, ≼ as we know attributes ) denoted ℳ ⊢ℐ , if there is a proof of from ℳ that ≼. Scenario (2), as ≺ implies that we can using ℐ. We now introduce axioms (inference rules) for ODs. suffix list by and ≼ holds.

DEFINITION 7. (OD axioms) The inference rules for ODs are as CBA Let CF∈ rrrr, such that ≼. This implies follows. that: (1) = and ≼ or (2) ≺. In (1) = as =. Hence, ≼ as we know that

1224 ≼. Therefore, ≼. Scenario (2), as ≺ THEOREM 4. PROOF. implies that we can suffix list by and ≼ (Shift) 3 ↦ [Aug(1)] . holds. ∀ . ↔ □ 1 ↔ 4 ↦ [Pref(3)] 5 ↔ [Norm] LEMMA 5.soundness of Transitivity ECAFAAFACC 2 ↦ ↦ 6 ↦ [Tran(4,5)]

PROOF.Let CF∈ rrrr, such that ≼. By ↦ which is given 7 ↔ [Suf(6)] ≼ which implies ≼ and it ends the proof. ∀ . ↦ ∧ 8 ↔ [Norm] ↦ implies ↦ . □ 9 ↔ [Tran(7,8)] LEMMA 6.Soundness of Suffix DDAACC 10 ↦ [Aug(1) 11 ↦ [Suf(10)] PROOF. A Let CF∈ rrrr, such that ≼. Therefore ≼ as 12 [Tran(9,11)] ↦ is given, which implies that ≼ ( ↦ ). ↦ 13 ↦ [Pref(2)] CB A Let C F ∈ rrr , such that ≼ . Therefore (1) ↦ [Tra(12,13)] □ = and ≼ or (2) ≺ is true. Scenario (1) directly implies that ≼ ( ↦ ). Scenario (2) where ≺ THEOREM 5. PROOF. Decomposition) implies that ≺. This is because ⊀ implies ≺ 2 ↦ [Ref] which implies ≺ . Hence ⊀ . This ends the proof as 1 ↦ ↦ [Tran(1,2)] □

≼ ( ↦ ). ∀ . ↔ □ ↦

LEMMA 7. (Soundness of Chain)EAACC The following theorem is helpful to prove the Eliminate, Left PROOF. Without loss of generality, assume that the lists in the Eliminate and Drop. axiom are single attributes. Let = A, = B,…, = B and THEOREM 6. PROOF. =C This simplification makes it easier to extend the rule to (Replace) lists. The proof is by contradiction. Assume that A and C are order 2 ↦ [Ref] incompatible. Then there are two tuples for which there is a swap 1 ↔ 3 ↦ B [Shift(1,2)] (The notion of swap is formalized in Definition 14) of the values ↔ B 4 B↦ [Shift(1,2)] between A and C. Also the two tuples disagree on attribute B for 5 ↦ B [Pref(3)] all A. Otherwise condition number 4 would not be true. As A ~ B, 6 B↦ [Pref(4)] the values for Bfollow A, so does the rest of attributes B because ↔ B [Tran(5,6)]□ of the condition (2). This means the two rows look in the PROOF. following way: THEOREM 7. (Eliminate) 2 ↔ [Suf(1)] A B B … B C 1 3 ↦ [Pref(2)] 0 0 0 0 0 1 ↦ 1 1 1 1 1 0 4 ↔ [Norm] ↔ FA C 5 ↔ [Norm 6 ↔ [Tran(35)] But then B is order incompatible with C, which we assumed not to be the case. We conclude with contradiction. □ 7 ↔ [Rep(6)] 8 ↔ [Norm] THEOREM 1. (soundness) EABC E C D 9 [Rep(6) AEABAEFADC ↔ ↔ [Tr(79)] □ PROOF. In order to prove the soundness of ℐ we have to prove that each of the rules is sound. This is Lemma 2 – Lemma 7. □ THEOREM 8. PROOF. (Left Eliminate) 2 ↔ [Suf(1)] □ 1 ↦ ↦ [Rep(1,2)] □

We introduce additional inference rules as they will be used ↔ throughout the paper. PROOF. THEOREM 9. THEOREM 2 PROOF. (Drop) 3 ↦ [Rep(2)] (Union) 3 ↦ [Pref(2)] 1 ↦ 4 ↦ [Tran(1,3)] 1 ↦ 4 ↦ [Suf(1)] 2 ↔ 5 ↦ [Dec(4)] 2 ↦ ↦ [Tran(3, 4)] □ ↦ 6 ↦ [Aug(5)] ↦ 7 ↔ [Suf(6)] THEOREM 3. PROOF. 8 ↔ [Norm] (Augmentation) 2 ↦ [Ref] 9 ↔ [Tran(7,8)] 1 ↦ ↦ [Tran(1,2)] □

↦ 10 ↦ [Tran(4,9)] 011 ↦ [Rep(2)] 12 ↦ [Tran(10,11 ↦ [Dec(12)])] □

1225 THEOREM 10. PROOF. ↦ is in ℳ . (So ↦ not in ℳ by Theorem 15 appear) (Path) 3 ↦ [Dec(1) Thus, is complete for ℳ. 1 ↦ 4 ↦ [Tran(2,3)] DEFINITION 15. (split(ℳ)) Split(ℳ) is a table that demonstrates 2 ↔ B 5 ↦ [Union(3,4)] for each ↦ which is not in ℳ that ↦ is falsified by

↦ split (and so, falsifies ↦ , too). 6 ↔ [Norm] 7 ↦ [Trans(5,6)] DEFINITION 16. (swap(ℳ))Swap(ℳ) is a table that demonstrates for each ↔ which is not in ℳ that 8 ↦ [Elim(2,7)] ↔ is falsified by split (and so, falsifies ↦ , too). 9 ↦ [Union(1,8)] 10 ↔ [Norm] Chain axiom is used to prove following two theorems. PROOF. ↦ [Tr(1,10)] □ HEOREM T 11.

(Partition) 4 ↔ [Suf(1)] □ 5 ↔ [Ref] CB 1 ↦ 2 ↦ 6 ↦ [Union(1,5)] In Section 4.1, we sketch the important elements of the proof for 3 set() = set() 7 ↔ [Suf(6) completeness of our OD axiomatization. We establish that ODs ↔ 8 ↔ [Norm(7)] subsume FDs in Section 4.2, followed by the formal completeness 9 ↔ [Tran(4,8)] proof of our axiomatization in Section 4.3. 10 ↔ [2,49] C 11 ~ [(9)] Our proof is constructive. To prove the axiomatization is 12 ~ [(10)] complete, it suffices to demonstrate, for any set of ODs ℳ, a 13 ↔ [Ref] table can be constructed that CEFACDAC (Lemma 14), and is 14 ↔ [Elim(1,2,13)] BF (Lemma 15) with respect to, ℳ using ℐ, the 15 ~ [Chain(1114)] axiomatization. 16 ↔ [(15)] ↔ [Norm(3,16)]□ DEFINITION 11. (a table satisfies ℳ) A table CEFACDAC ℳ ADD PROOF. no OD that is derivable over ℳ using ℐ (thus, in ℳ ) is falsified THEOREM 12. by the table . (Downward Closure) 2 ↦ [Ref] 3 [Tran(1,2)] DEFINITION 12. (a table is complete with respect to ℳ) A 1 ~ ↦ table is BF with respect to ℳ ADD every OD that is ~ 4 ↦ [Ref] constructible over the attributes that appear in ℳ that is not 5 ↦ [Union(3,4)] derivable over ℳ using ℐ (thus, is not in ℳ) is falsified by the 6 ↦ [Union(3,4)] table . ~ [Part(5,6)]□

In Section 3.2 in Theorem 1, we proved the soundness of ℐ. Thus, In the table that we construct, we shall use integer values for the any table that satisfies each OD in ℳ satisfies ℳ, and no table cells. (A cell is a given column entry of a given row.) We that satisfies ℳ can falsify any OD in ℳ . An OD ↦ can be construct table by adding splits and swaps. We have to make falsified in just two ways by a table. (See Theorem 15.) We name sure that these pieces combined together do not interfere. That is these two ways split and swap. why we formalize the notion of E. When we append two tables and , we shall ensure that the resulting table cannot DEFINITION 13. (split) A split with respect to an OD ↦ is introduce any splits (except ↦ [] ) or swaps beyond those that a pair of tuples F and Cin table , such that = but ( ≠); appear in and in alone (Lemma 9). that is, have the same value for ( =) but different values for ( ≠). Thus, the split from falsifies ↦ . DEFINITION 17. (append) Appending two subtables and is

(Consequently, ↦ is falsified, too.) This just says that accomplished by following steps: () does not functionally determine (). 1. Find the minimum value, x, over all cells of . Subtract DEFINITION 14. (swap) A swap with respect to an OD x from all cells in . (Now its minimum value is zero.)

↔ is a pair of tuples F and Cin table such that ≺, Do the same for . but ≺; i.e., there exist tuples F and C in such that ≺, 2. Find the maximum value, y, over all cells of . Add but ≺; i.e., F comes before C in any stream satisfying BB y + 1 to all cells in .The resulting table of the append , but C comes before F in any stream satisfying BB is the union of and as adjusted in steps 1 and 2. . Thus, the swap from falsifies ↔ . (Consequently, A B C D A B C D A B C D ↦ is falsified, too.) 0 0 0 0 0 1 0 0 0 0 0 0 The table that we construct for the set of order dependencies ℳ 0 0 1 1 1 0 0 0 0 0 1 1 will consist of two parts: split(ℳ) and swap(ℳ). We shall F F 2 3 2 2 construct these two parts of – the first half of the table, split(ℳ), 3 2 2 2 and the second half, swap(ℳ) – in such a way that satisfies ℳ. The purpose of split(ℳ) will be to falsify every OD of the form F ↦ not in ℳ . The purpose of swap(ℳ) will be to falsify every OD of the form ↦ , ↔ not in ℳ but for which

1226 The table we construct will be split(ℳ) append swap(ℳ) Clearly, ' falsifies each ↦ that does not mention Z that

(which we call splitswap form). We shall construct split(ℳ) in a falsifies. For any ↦ that mentions Z, it is equivalent to some way that is analogous to the construction in Ullman’s proof of the OD that does not mention Z by the Replace rule, which has completeness of Armstrong’s axiomatization for FDs in [20]. This already been established. Thus ' satisfies, and is complete with proves our axiomatization for ODs is sound and complete over respect to, ℳ ∪ {[] ↦ Z}. □ FDs. In the second case, we may assume ℳ contains no constant

We shall construct swap(ℳ) in a way to falsify each OD ↦ attributes. When considering the pair A and B, if we find they not in ℳ (but for which ↦ is in ℳ ). This construction require a swap in nonempty context F, we can "freeze" the will be more complex than for split(ℳ). For each pair of attributes of F to a single value. This is true, for any table that attributes A and B from ℳ, we determine whether there needs to satisfies ℳ = ℳ ∪ {[] ↦ X, … , [] ↦ X}, where F = be a swap between A and B – a pair of tuples Cand Fsuch that {X,…,X}. Now, we have an instance with or fewer non ≺, but ≺ – and, if so, the FF in which swap constants attributes. By our induction hypothesis, there exists a between A and B need to occur. table in splitswap form that satisfies and is complete with respect to ℳ. Note that ℳ′ ⊇ ℳ. Thus, does not falsify DEFINITION 18. (constant) An attribute A is called a CFEF any ODs in ℳ. We append to the table that we are with respect toℳADD [] ↦ A is in ℳ. Call an attribute a constructing. (Appending these is safe, since ℳ has no constants.) CFEF, otherwise. Our table swap(ℳ) therefore is a recursive appending of If an attribute is a constant, it means in any table that satisfies ℳ, (sub)tables. it can have only a single value occurring in the table. There is the case of attributes A and B such that ℳ dictates they DEFINITION 19. (context) A set of nonconstant attributes F must have a swap, but in the empty context {}. This time, we with respect to ℳ is a FF of a swapF, CADDF =FWe say cannot use the induction hypothesis to construct the tuples for us swap F, Cis AFFFD F iff F =F. (Note that a context ( ), that do the job. For this case, however, we can construct two for a swap F, Cis not unique) tuples directly that introduce a swap for A and B, but that do not introduce swaps between any other pair of attributes that would By identifying the right contexts for swaps for each pair A and B, falsify any OD in ℳ .(The soundness of this step is established swap(ℳ) will falsify each ↦ not in ℳ (but with ↦ in in Lemma 12.) ℳ), while not falsifying anything in ℳ(Lemma 13). This step is the cornerstone of our proof for completeness. For the latter, we must show that, for each ↦ not in ℳ such that ↦ is in ℳ , some subtable in swap(ℳ) by our Constructing table swap(ℳ) is not straightforward. We are able to construction does falsify it. This is done by proving there always simplify the construction via structural induction. The hypothesis is an attribute A in , an attribute B in , and a swap between A is as follows. and B in some context , which falsifies ↦ . (This is part of HYPOTHESIS 1FCAC. CBDAAFDE Lemma 15.) That completes the proof. These pieces are formally proved in the next two sections. CFDCℳBCEFFAFC{,…,}FACFCE FE A CAFCE DB FEF CEFACDAC E AC BF AF CFFℳ CDFFD In this section we show completeness of our axiomatization over We prove the base case of this for ≤2 (in Lemma 11). We FDs. This result is then used toward showing completeness over hypothesize this is true for any ℳ with a fixed number of ODs. attributes. We then prove that for any ℳ with +1 attributes THEOREM 13. (FD and OD correspondence) ACFE that the hypothesis remains true (Theorem 17). Proof of the R induction hypothesis in essence completes the overall proof. DEFA F → ADD ↦ DEACFCFEFF EFFAFCDFEEACFCAACD Induction provides us with a powerful mechanism within the PROOF.AIf ↦ holds by Lemma 1 F → F is true. By proof. Consider any ℳ with +1 attributes. In the first case, if Armstrong axiom, Reflexivity F → holds. Therefore by any of the attributes are constants with respect to ℳ, we can Armstrong axiom, Transitivity F → is true. reduce the problem. We effectively F F those constant attributes from ℳ. This means we simply remove all occurrences CBAIf ↦ does not hold, there exists CF∈ rrrr, such of the attributes in the ODs. For example, if we are projecting out that ≼ but ⋠. This implies that = and ≺. B and E, ABC ↦ DEF becomes AC ↦ DF. Call the result ℳ'. Therefore ≠ and F =F and F → is not true. □ Then, ℳ' is over or fewer attributes. By the induction hypothesis there is a table ' which it satisfies, and is complete THEOREM 14. PROOF. Let = [Y, Y, … , Y], ∀ ∈ [1, ] (Permutation) with respect to, ℳ'. We can show easily how to construct a table 2 ↦Y … Y [Dec(1)] from ' which must satisfy, and be complete with respect to, ℳ. 1 ↦ 3 ′ ↦′ Y … Y [Pref(2)] This is established by Lemma 8. ′↦ ′′ 4 ′↔′ [Norm] LEMMA 8. F EFEFEFCEFACDACEACBFAF 5 ′ Y … Y ↔′ Y … Y [Norm] CFF ℳ. FZEEFFAFFA ℳ. CFFFE' EC AFEFEB Z, EFCEBCAED Z A 6 ′ ↦′ Y … Y [Tran(35)] E ' CEFACDAC E AC BF AF CF F, 7 ′↔′ [Ref]

ℳ ∪ {[] ↦ Z}. 8 ′ ↦′Y [Drop(6,7)]

PROOF. It is straightforward that ' satisfies ℳ ∪ {[] ↦ Z} ′↦′′ [Union(8)] □

because Z is a constant in ' and Z does not appear in ℳ. □

1227 An OD ↦ can be falsified in two ways by a table (Theorem Let F [20], the closure of F (with respect to ℱ) be the set of 15). That is why we introduced split and swap (Section 4). attributes A such that F → A can be deduced from ℱ by Armstrong's axioms. We consider the relational instance with THEOREM 15. (order dependency) ↦ CADD ↦ E the two rows shown in figure below. ↔ Fattributes Other attributes PROOF. A If ↦ holds then Suffix rule tells us, that

↔ . ↦ follows from Reflexivity, therefore ↦ by 0 0 … 0 0 0 … 0

Union and ↔ by Replace, Suffix and Normalization. 0 0 … 0 1 1 … 1

F ⊭ ↦ CB A Suppose ↔ and ↦ hold. Hence, by

Transitivity ↦ , which by Reflexivity and Transitivity tell us Based on Ullman’s [20] proof of soundness and completes of that ↦ . □ Armstrong’s axioms, relation instance shows that if ℱ is the THEOREM 16. (ODs subsume FDs). A F CF D Cℳ given set of dependencies, and F → cannot be proved by EABCECEBFDFAEAC Armstrong, then is a relation in which the dependencies of ℱ hold but F → does not. That is, ℱ does not logically imply F → PROOF. Soundness is by Theorem 1, because of the . This means the inference rules are sound and complete over ℱ. correspondence between FDs and ODs (Theorem 13). The As there is no swaps in rrrr, we do not falsify anything in ℳ′, remaining step is to prove completeness over FDs, if ℳ ⊨ therefore ℳ, too. This ends the soundness and completeness proof F → then ℳ ⊢ℐ F → . This is equivalent to say if ℳ ⊨ for FDs over set of ℳ. □ ↦ F ℳ ⊢ℐ ↦ for all lists that order the attributes ofFand all lists likewise for by Theorem 13 and CD Permutation. As discussed in Section 4 an OD can be falsified by a split or a swap. Using this, our proof for completeness is by case. If Firstly, we show that axioms for ODs imply Armstrong's axioms for FDs. We can do it because of soundness of axioms. ↦ is not in ℳ , there will be a split in the subtable split(ℳ) that we construct that falsifies ↦ , and so that D ⊆ F impliesF → . falsifies ↦ also. If ↦ is not in ℳ , but ↦ is, 1. We are given that is a subset of F. there will be a swap in subtable swap(ℳ) that falsifies ↦ . 2. Therefore, the normalization rule implies that an order EMMA dependency ↔ holds, for some list that order L 9. ACCAFAEFEFACFC

the attributes ofFand some list likewise for. DBECFACAC[] ↦ DEAC 3. Hence, Permutation and Theorem 13 implies that CE A E FEF ACF C DB E

FD F → holds. CFA DFF → implies F → . PROOF. Let F be a tuple in and Cbe a tuple in . Since all

1. Since we are given F → , Theorem 13 tells us ↦ values in F are less than all values in C, it is impossible for there to

, for all lists that order the attributes ofFand all be a split (except [] ↦ ) or swap introduced between and lists likewise for. within E (Definition 17). □

2. By Reflexivity we can interfere ↔ , for all list that We construct table to satisfy, and to be complete with respect to, order the attributes of . Hence, by Prefix ℳ. Table will be split(ℳ) append swap(ℳ), as introduced rule ↦ holds. above. Note that by Theorem 15 these are the only two scenarios. 3. By Suffix ↔ . may be normalized

( ↔ ). Table split(ℳ) is constructed by appending two rows to the table,

4. By transitivity ↦ .Therefore by Permutation as in Figure 7 for each subset of attributes of F from ℳ.

and Theorem 13 FD F → holds. LEMMA 10. (split(ℳ) satisfies ℳ). E ℳ AF D F → and → implies F → . CFEFC CAFℳCFDECADEA ℳ 1. We are given F → , and F → , so we may get ROOF ↦ and ↦ for all lists that order the P . The relational instance split(ℳ) we have constructed

attributes ofFand all lists likewise for and some contains splits, but no swaps. Therefore ↦ could be only

list likewise for by Theorem 13. falsified by split. (Consequently, ↦ is falsified, too.) But we

2. It follows by Reflexivity that ↔ , so by Prefix rule know that we are sound and complete over set over FDs by Theorem 16 and by Lemma 9 appending of the tables does not we can infer that ↦ .

introduce additional splits (except [] ↦ ) or swaps, therefore this 3. Since ↔ follows by Suffix, Normalization and is not possible. □ Transitivity, ↦ follows from Transitivity. 4. Hence by Decomposition, Permutation and Theorem 13 Table split(ℳ) is based on table we constructed for ℳ in the FD F → is true. proof of Theorem 16, which establishes that ODs subsume FDs; However, this proves that axiom system comprising of inference that is, split(ℳ) satisfies ℳ and it is complete with respect to the rules ℐ is sound and complete for the set of FDs ℱ. We would like OD of the form ↦ which are equivalent to FD statement to show it is true for set of ODs ℳ. (Theorem 13) – in that it falsifies each ↦ not in ℳ but which is composable over the attributes in ℳ. As constructed, Let ℳ′= { ↦ , ↔ | ↦ ∈ ℳ}. Based on Theorem split(ℳ) introduces no CEC. 15 ℳ and ℳ′ are equivalent (Definition 9). Also let For swap(ℳ) a natural approach would seem to be to construct ℱ= {F → | ↦ ∈ ℳ′}. Based on Permutation rule and Theorem 13 we know that any relation instance satisfying the table incrementally, to falsify each OD not in ℳ, in turn, dependencies in ℱ satisfies dependencies in ℳ′ and vice versa. while ensuring we do not also falsify any OD in ℳ, in each

1228 step. This would be similar to how we constructed split(ℳ). PROOF. We construct a tworow swap with values 0 and 1 that However, how to do this by a straightforward construction is not falsifies A ~ B but cannot falsify anything in ℳ as shown in apparent. When considering how to falsify ↦ , which Figure 9. For the latter, it suffices to prove that the swap does not attributes from and from , respectively, should have a swap falsify any C ~ D in ℳ. For A and B, they have opposite values appear in the table? And how do we ensure that this swap does not in each row in the swap. For any C such that A ~ C is in ℳ, C falsify any OD in ℳ? Instead, we consider every pair of must have the same value as A in each row. (Otherwise, A and C attributes, A and B, from the set of attributes in ℳ. We would have swapped values – 0 and 1 – between the two rows.) determine the relevant contexts, if any, in which a swap with Likewise for B. And for any D such that C ~ D is in ℳ, D must respect to A and B must occur in swap(ℳ). have the same value as C (and so the same as A) in each row. And so forth. Of course, it would be impossible to construct our two The set(XYXYXYXY)is a context for A, B with respect to ℳ ADDA ~ rows if there is a chain connecting A and B through order and ~ B are in ℳ , but A ~ B is not in ℳ . If there compatibility: A ~ E ~…~ E ~ B. If there were, we would need exists such a context for A, B, this indicates there should be a to set the value of each E~ … ~ E the same as A’s value Ethe swap between A and B (to falsify A ~ B). It also indicates the same as B’s value in each row. But A’s and B’s values differ. The "context" of the swap, as the swap must not falsify A ~ or Chain axiom schema (OD6) ensures there is no such chain from A ~ B. One could imagine constructing a swap – a pair of rows F to B. EA ~ EB is in ℳ , for each E, since the maximal context and C for this – by having =. That way, the swap F, C for A, B is []. If there were a chain A ~ E~ … ~ E ~ B such that would not falsify A ~ or ~ B. But what should the values A ~ E is in ℳ, E ~ E is in ℳ for each Aon 1,..,−1, and of Fand Cbe outside of ? We cannot construct Fand Csimply, E ~ B is in ℳ, then A ~ B is in ℳ also, by the Chain axiom. ensuring the swap CFdoes not falsify anything in ℳ. Instead, Since we know that A ~ B is not in ℳ, there is no such Chain. we use structural induction. Consider for now that is non Thus, our two rows are constructable. We can partition the empty. If we added [] ↦ to ℳ – call the result ℳ – can attributes into three groups: those that must have the same values only have a single value in any table that satisfies ℳ . Recall the as A , those the same as B, and those for which it does no matter. hypothesis from Hypothesis 1in Section 4. We adopt this as our Figure 9 shows the construction. induction hypothesis. Assume our present ℳ contains +1 attributes. Then ℳ contains or fewer attributes since [] ↦ . A B A’s group B’s group Remaining attributes By our induction hypothesis, there is a table (see Figure 8) that 0 1 0 … 0 1 1 1 0 0 … 0 satisfies, and is complete with respect to ℳ . As A ~ B is not 1 0 1 … 1 0 0 0 1 1 … 1 in ℳ, it is not in ℳ either. Thus falsifies A ~ B. F Attributes of Other attributes For attributes that do not match A or B, it is important we do not 0 0 … 0 , , … , introduce swaps between them, as this could falsify something in … … … … … … … … ℳ. It suffices to use the same value for these in each row. Call the tworow swap in Figure 9 . Table satisfies ℳ. Assume 0 0 … 0 , , … , otherwise: for ↦ ∊ ℳ, r r falsifies it. Let ↦be over non FF r r constants attributes, without loss of generality. Let E be the first Which context for A, B should we do this for? Not for all of them. element of , and F of . If both E and F are from A, A’s group It is the maximal contexts that are relevant. , is a maximal or the remaining group attributes (as in Figure 9), or they are both context for A, B ADD it is a context for A, B and there is no other from B or B’s group attributes, then and order the two tuples context , such that () ⊃ (). of the same way. Therefore, E must be from one group, and F from the other. Since ↦ ∊ ℳ , ~ ∊ ℳ by Theorem 15. Since we use induction in the proof, we need to prove a base case By the Downward Closure rule E ~ F ∊ ℳ . Contradiction. □ of the induction hypothesis. We prove it for the cases of ℳ with 0, 1, and 2 nonconstant attributes in the following Lemma. Our proof obligation for swap(ℳ), that it does not falsify any OD in ℳis proved in the following Lemma. LEMMA 11. (Induction base, ≤2). EF BCF ≤2 EFFAFC F ACFC E FE t A CAFCE DB FEF CEFACDAC LEMMA 13. CEℳ CEFACDAC ℳ CCBA Hypothesis 1 EACBFAFCFFℳ DEℳDDnonconstantsEFFAFCCEℳC FDECADEAℳ. PROOF This can be directly shown by enumerating through all the possibilities. □ PROOF. Hypothesis 1 is the key in proving that A, B do not falsify any OD in ℳ. When we consider pair A and B which We have assumed so far that the (maximal) contexts, if any, for A, requires a swap in nonempty context F we obtain ℳ = ℳ ∪ B are nonempty. There is the case where A, B has a single {[] ↦ X ,…,[] ↦ X }, where F = {X ,…,X }. By our maximal context {}, the BF context. In this case, we cannot hypothesis, there exists a table in splitswap form that is appeal to the induction hypothesis. Fortunately, such pair A, B satisfied and complete with respect to ℳ. As ℳ′ ⊇ ℳ, will have special properties by virtue of the fact they have therefore any ODs in ℳis not falsified. swapped orders only in the empty context. In fact, our sixth axiom schema speaks directly to this very case. (We likely would never None of the subtables falsifies any OD in ℳ, by the hypothesis have had the insight for the sixth axiom (schema) EA had we in nonempty context and soundness of base cases (empty context not encountered this case while attempting to prove and ≤2). As the table swap(ℳ) is appendnormalized, completeness.) In this case, we will be able to construct a tworow swap(ℳ) does not falsify any OD in ℳ. □ swap for A, B directly that does not falsify anything in ℳ. LEMMA 14. (Satisfies). FEFACAEAFCF LEMMA 12. (Empty context). ACFCECEDAF FFEABEFAEFAℳACFDECADAFFE . FBFBEABEFFFEFCEFACDACℳADECADA

1229 PROOF. The subtables split(ℳ) and swap(ℳ), as we construct them, are satisfied with respect to ℳ (Lemma 10 and Lemma 13 ℳ contains no constants attributes. Lemma 15 establishes there respectively). If neither split(ℳ) nor swap(ℳ) falsifies any OD exists an that satisfies, an is complete with respect to, ℳ.□ in ℳ, then as split(ℳ) append swap(ℳ) cannot falsify any OD in ℳ either (See Lemma 9). □ DC LEMMA 15. (complete). CCBA Hypothesis 1 D E ℳ Ordered sets and lattices have been a subject of research in CFEFDEFFAFCAEℳCFF mathematics [5]. In fact, our concept of OD is equivalent to EFFAFC E AC E CFEF AF CF F ℳ CABEA between two ordered sets. The work in DAAFA 18 F FE CAFℳ E CEℳ AC mathematics has concentrated on investigating properties of, and BFAFCFFℳ relationships between, ordered sets rather than among the mappings. To the best of our knowledge, no inference system for PROOF. Assume ↦ over only nonconstant attributes, is in describing relationships between mappings has been proposed. the complement of ℳ ( ↦ ∉ ℳ ). Theorem 15 tells us that order dependency ↦ holds iff ↦ and ↔ . Order dependencies were introduced for the first time in the context of database systems in [7]. However, the type of orders, . hence the dependencies defined over them, were different from

the ones we presented here. A dependency F ↝ holds if order ↦ ∉ ℳ . We have already proven that for the scenario with over the values of E attribute in F implies an order over the ↦ (FD) we are always complete (Theorem 16). values of E attribute of . (For simplicity, we use the arrow ↝ . for different type of orders.) In other words, the dependency is defined over the sets of attributes rather than lists. The distinction ↦ ∉ ℳ , but ↦ ∊ ℳ . By Theorem 15 ~ ∉ℳ . Find longest A prefixing such that: between these two types of dependencies was later [13] aptly described as pointwise versus lexicographical order dependency. 1. ~ ∊ℳ Formally, an instance satisfies a pointwise order dependency 2. A ~ ∉ℳ F ↝ if, for all tuples C and Ffor every attribute A in F, Find longest A prefixing such that: implies that for every attribute B in , , where 3. A ~ ∊ℳ ∈ {<, =, >, ≤, ≥}. In [8] a sound and complete set of axioms for 4. A ~ B ∉ℳ such dependencies is defined together with an analysis of the 5. ~ ∊ℳ [Downward Closure (1)] complexity of determining logical implication. An application of 6. ~ B ∊ℳ [Downward Closure (1)] the dependencies for an improved index design is presented in [6]. 7. AB ↔ AB ∊ℳ [Shift(3, [B ↔ B])] Dependencies defined over lexicographically ordered domains 8. AB ↔ AB ∊ℳ [Replace(5)] were introduced in [13] under the name AEAE 9. BA ↔ BA ∊ℳ [Shift(6, [A ↔ A]] DFAEAC. Two other papers [14], [15] by the same 10. AB ↔ BA ∉ℳ [(4)] author develop a theory behind both lexicographical as well as 11. AB ↔ BA ∉ℳ [Transitivity(8,9,10] pointwise dependencies (the latter were somewhat simpler than 12. A ~ B ∉ℳ [11] the dependencies defined in [7]). A set of axioms (proved to be sound and complete) is introduced for pointwise dependencies, A and B have a swap within the context, = set(). In but – interestingly – not for lexicographical dependencies. Only a constructing swap(ℳ), we considered all maximal contexts for chase procedure is defined for the latter. An extension of A, B for which a swap is needed. Hence, we considered some relational algebra to ordered domains is presented in [15]. superset ⊇ . If ≠ [], a subtable that satisfies, and is Sorting is at the heart of many database operations: sortmerge complete with respect to ℳ ∪ {[] ↦ V , … , [] ↦ V }, where join, index generation, duplicate elimination, ordering the output = {V,…,V} is appended in swap(ℳ). This falsifies A ~ B, for all lists that order the attributes of (thus, falsifies through the orderby operator, etc. The importance of sorted sets for query optimization and processing has been ↦ ). Else if = [], we appended a swap CF as in Figure 9 which falsifies A ~ B ([]A ~ []B). □ recognized very early on. Right from the start, the query optimizer of System R [16] paid particular attention to AFCFAC by THEOREM 17. (soundness and completeness). CFDF keeping track of all such ordered sets throughout the process of EABCℐ =–ACCEBF query optimization. In more recent research, [8] and [10] explored PROOF. the use of sorted sets for executing nested queries. The importance of sorted sets has prompted the researchers to look the sets ℳ with ≤2 attributes proved by Lemma 11. that have been explicitly generated. Thus, [12] shows how to Assume Hypothesis 1 for all ℳ composed over or fewer discover sorted sets created as generated columns via algebraic attributes. expressions. (In DB2, a generated column is a column that can be AF Consider an ℳ over +1 attributes. computed from other columns in the schema.)

. For example, if column A is sorted, so is the generated column G defined as G = A/100 + A−3 (that is, A ↝ G). We show in ℳ contains constants attributes (Definition 18). Let ℳ' be ℳ [18] how to use relationships between sorted attributes discovered with these constants attributes removed. ℳ' has or fewer by reasoning over the physical schema. The axiomatization attributes. By the induction hypothesis (Hypothesis 1), there is ' presented here provides a formal way of reasoning (discovering) which satisfies, and is complete with respect to, ℳ'. Lemma 8 previously unknown (or hidden) sorted sets. Based on this work, guarantee we can construct from ' that satisfies, and is complete many other optimization techniques can also be adapted. with respect to, ℳ.

1230 CBEACBBDEEC [3] C. Beeri and P. A. Bernstein. Computational problems Ordering permeates databases, to such an extent that we take it for related to the design of normal form relational schemas. granted. It appears in many queries and is relatively expensive to , 4(1): 3059, 1979. perform. The goal of this paper was to develop a theory behind [4] Q. Cheng, J. Gryz., F. Koo, T. Y. C. Leung, L. Liu, X. Qian dependencies over lexicographically ordered sets. To the best of and K. B. Schiefer. Implementation of two semantic query our knowledge, this is the first attempt at an axiomatization for optimization techniques in DB2 Universal Database. In such dependencies. We present that ODs subsumes FDs. We have , pages 687698, 1999. also shown our inference rules for ODs are sound and complete. [5] B. A. Davey and H. A. Priestley. Introduction to lattices and Though now we conclude, the story of order dependencies is far order, vol. 2. EBAACAFCC, pages 1298, from over. There is much more that can be done, and should be. 2002. Future work in this area should pursue two lines of research: on [6] J. Dong and R. Hull. Applying Approximate order the one hand, further investigation of the theoretical questions; on dependency to reduce indexing space. In , pages the other hand, applications of the theoretical framework in a 119127, 1982. practical database setting. These are further things we plan to do in future work. [7] S. Ginsburg and R. Hull. Ordered attribute domains in the relational model. In CEFAEEFEEC ñ One of the major practical applications which we are currently , 1981. working on is a FB [20] Given a set of order [8] S. Ginsburg and R. Hull. Order dependency in the relational dependencies ℳ and an arbitrary dependency ↦ , we would like to DDAAF decide whether ℳ logically model. ,26(1): 149195, 1983.

implies ↦ . Such a theorem prover would be a useful tool [9] G. Graefe. Executing nested queries. In , pages 5877, in query optimization. 2003. ñ Integrity constraints have been widely used in query [10]R. Guravannavar, H. S. Ramanujam, and S. Sudarshan. optimization through AFC. For example, functional Optimizing nested queries with parameter sort orders. In dependencies have been shown to be useful in simplifying , pages 481492, 2005. queries with , BB , and BE operations [17], whereas inclusion dependencies can be used [11]R. Kimball and M. Ross. The data warehouse toolkit. The to remove certain joins over primary and foreign keys [4]. We complete guide to dimensional modeling, vol. 2. A believe that ODs can be used in similar ways to simplify , pages 219241, 2002. queries with BBoperation. [12]M. Malkemus, S. Padmanabhan, B. Bhattacharjee and L. ñ We are exploring the use of ODs for EFEECCA[2]. FDs Cranston. Predicate derivation and monotonicity detection in are by far the most common integrity constraints in the real DB2 UDB. In , pages 939947, 2005. world. The notion of the key derived from a given set of FDs [13]W. Ng. Lexicographically ordered functional dependencies is a fundamental to the relational model. The determination of and their application to temporal relations. In pages ODs might be an important part of designing databases in the 279287, 1999. relational model, too. It can be used in and denormalization. Order dependencies can reveal [14]W. Ng. Ordered functional dependencies in relational redundancies that cannot be detected using functional databases. DCF 24(7): 535554, 1999. dependencies alone. It would be an interesting research topic [15]W. Ng. An extension of the relational data model to to extend the results obtained there to the design of relational incorporate ordered domains. , 26(3): 344383, 2001. databases. [16]P.G. Selinger, M. M. Astrahan, D. D. Chamberlin, R. A. BCDB Lorie and T. G. Price. Access path selection in a relational We thank Calisto Zuzarte and Wenbin Ma from IBM laboratory in database management system. In , pages 2334, Toronto for their encouragement and many helpful suggestions 1979. throughout the project. [17]D. E. Simmen, E. J. Shekita and T. Malkemus. Fundamental IBM, the IBM logo, and ibm.com are trademarks or registered techniques for order optimization. In, pages 5767, trademarks of International Business Machines Corp., registered 1996. in many jurisdictions worldwide. Other product and services [18]J. Szlichta, P. Godfrey, J. Gryz, W. Ma, P. Pawluk and C. names might be trademarks of IBM or other companies. Zuzarte. Queries on dates: fast yet not blind. In , pages TPCDS is the trademark of The Transaction Processing 497502, 2011. Performance Council. [19]J. Szlichta, P. Godfrey and J. Gryz. Chasing polarized order dependencies. In , pages 168179, 2012. B [20]J. D. Ullman. Principles of database and knowledgebase [1] W. W. Armstrong, Dependency structures of data base systems, vol. 1. BFACC pages 378379, relationships. In ACDFCC, pages 1988. 580583, 1974. [2] P. Bernstein. Synthesizing third normal from relations. , 1(4): 277298, 1976.

1231