Axiomatic Foundations and Algorithms for Deciding Semantic Equivalences of SQL Queries
Total Page:16
File Type:pdf, Size:1020Kb
Axiomatic Foundations and Algorithms for Deciding Semantic Equivalences of SQL Queries Shumo Chu, Brendan Murphy, Jared Roesch, Alvin Cheung, Dan Suciu Paul G. Allen School of Computer Science and Engineering University of Washington fchushumo, jroesch, akcheung, [email protected], [email protected] ABSTRACT In the past, query optimizers were developed by only small num- Deciding the equivalence of SQL queries is a fundamental prob- ber of commercial database vendors with dedicated teams to handle lem in data management. As prior work has mainly focused on customer reported bugs. Such teams effectively acted as “manual studying the theoretical limitations of the problem, very few im- solvers” for query equivalences. The explosive growth of new data plementations for checking such equivalences exist. In this pa- analytic systems in recent decades (e.g., Spark SQL [12], Hive [4], per, we present a new formalism and implementation for reason- Dremel [42], Myria [39, 55], among many others) has unfortu- ing about the equivalences of SQL queries. Our formalism, U- nately exacerbated this problem significantly, as many such new semiring, extends SQL’s semiring semantics with unbounded sum- systems are developed by teams that simply lack the dedicated re- mation and duplicate elimination. U-semiring is defined using only sources to check every single rewrite implemented in their system. very few axioms and can thus be easily implemented using proof For example, the Calcite open source query processing engine [1] assistants such as Lean for automated query reasoning. Yet, they is mainly developed by open-source contributors. While it con- are sufficient enough to enable us reason about sophisticated SQL tains more than 232 rewrite rules in its optimization engine, none queries that are evaluated over bags and sets, along with various in- of which has been formally validated. The same holds true for sim- tegrity constraints. To evaluate the effectiveness of U-semiring, we ilar engines. have used it to formally verify 68 equivalent queries and rewrite On the other hand, deciding whether two arbitrary SQL queries rules from both classical data management research papers and are semantically equivalent is a well-studied problem in the data real-world SQL engines, where many of them have never been management research community that has shown to be undecid- proven correct before. able in general [11]. Most subsequent research has been directed at identifying fragments of SQL where equivalence is decidable, PVLDB Reference Format: under set semantics [17, 47] or bag semantics [24]. This line of Shumo Chu, Brendan Murphy, Jared Roesch, Alvin Cheung, Dan Suciu. work, focused on theoretical aspects of the problem [51, 47, 41], Axiomatic Foundations and Algorithms for Deciding Semantic Equivalences led to very few implementations, most of which were restricted to of SQL Queries. PVLDB, 11 (11): 1482-1495, 2018. DOI: https://doi.org/10.14778/3236187.3236200 applying the chase procedure to conjunctive queries [14]. An alternative approach to query equivalence was recently pro- posed by the COSETTE system [23] based on a new SQL seman- 1. INTRODUCTION tics. It interprets SQL relations as K-relations [35]. A K-relation All modern relational database management systems (DBMSs) is semiring that maps each tuple t to a value R(t) that denotes the contain a query optimizer that chooses the best means to execute multiplicity of the tuple in the relation. Normally, the multiplicity an input query. At the heart of an optimizer is a rule-based rewrite is an integer, thus the semiring K is the semiring of natural num- system that uses rewrite rules to transform the input SQL query bers N, but in order to support some of the SQL constructs (such as into another (hopefully more efficient) query to execute. A key projections, with or without DISTINCT), COSETTE resorts to using challenge in query optimization is how to ensure that the rewrit- a much more complex semiring, namely that of all univalent types ten query is indeed semantically equivalent to the input, i.e., that in Homotopy Type Theory (HoTT) [50]. Equivalence of two SQL the original and rewritten queries return the same results when exe- queries is then proven in COSETTE using the Coq proof assistant cuted on all possible input database instances. History suggests that extended with the HoTT library [36]. While the system can han- the lack of tools to establish such equivalences has caused long- dle many features of SQL, its reliance on HoTT makes it difficult standing, latent bugs in major database systems [32], and new bugs to extend as HoTT is both a Coq library under development and continue to arise [7, 10]. All such errors lead to incorrect results an active research area itself. As a result, COSETTE has a number returned that can cause dire consequences. of shortcomings. For example, COSETTE does not support foreign Permission to make digital or hard copies of all or part of this work for keys as they are difficult to model using the HoTT library in Coq. personal or classroom use is granted without fee provided that copies are In this paper, we propose a new algebraic structure, called the not made or distributed for profit or commercial advantage and that copies unbounded semiring, or U-semiring. We define the U-semiring bear this notice and the full citation on the first page. To copy otherwise, to by extending the standard commutative semiring with a few sim- republish, to post on servers or to redistribute to lists, requires prior specific ple constructs and axioms. This new algebraic structure serves as permission and/or a fee. Articles from this volume were invited to present their results at The 44th International Conference on Very Large Data Bases, a foundation for our new formalism that models the semantics of August 2018, Rio de Janeiro, Brazil. SQL. To prove the semantic equivalence of two SQL queries, we Proceedings of the VLDB Endowment, Vol. 11, No. 11 first convert them into U-semiring expressions, i.e., U-expressions. Copyright 2018 VLDB Endowment 2150-8097/18/07. Deciding the equivalence of queries then becomes determining the DOI: https://doi.org/10.14778/3236187.3236200 1482 equivalence of two U-expressions which, as we will show, is much Equivalent SQL Queries easier compared to classical approaches. SELECT*FROMRtWHERE t.a >= 12-- Q1 Our core contribution is identifying the minimal set of axioms for SELECT t2.*FROM I t1, R t2-- Q2 U-semirings that are sufficient to prove sophisticated SQL query WHERE t1.k = t2.kAND t1.a >= 12 equivalences. This is important as the number of axioms deter- mines the size of the trusted code base in any proof system. As where k is a key of R, and I is an index on R defined as: we will see, the few axioms we designed for U-semirings are sur- I :=SELECT t3.kAS k, t3.aASaFROM R t3 prisingly simple as they are just identities between two U-expres- Corresponding Equivalence in Semirings sions. Yet, they are sufficient to prove various advanced query op- timization rules that arise in real-world optimizers. Furthermore, Q1(t) = λ t: R (t) × [t:a ≥ 12] J XK we show how integrity constraints (ICs) such as keys and foreign Q2(t) = λ t: [t2 = t] × [t1:k = t2:k] × [t1:a ≥ 12]× keys can be expressed as U-expressions identities as well, and this t1;t2;t3 leads to a single framework that can model many different features [t3:k = t1:k] × [t3:a = t1:a] × R (t3) × R (t2) of SQL and the relational data model. This allows us to devise dif- J K J K ferent algorithms for deciding the equivalences of various types of Figure 1: Proving that a query is equivalent to a rewrite using SQL queries including those that involve views, or leverage ICs to an index I requires proving a subtle identity in a semiring. rewrite the input SQL query as part of the proof. To that end, we have developed a new algorithm, UDP, to au- that this algorithm is sound for general SQL queries and is com- tomatically decide the equivalence of two arbitrary SQL queries plete for Unions of Conjunctive Queries under set and bag se- that are evaluated under mixed set/bag semantics,1 and also in the mantics (Sec. 5). presence of indexes, views, and other ICs. At a high level, our • We have implemented UDP using the Lean proof assistant, and algorithm performs rewrites reminiscent to the chase/back-chase have evaluated UDP by collecting 69 real-world rewrite rules as procedure [45] but uses U-expressions rather than first order logic benchmarks, both from prior research work done in the database sentences. It performs a number of tests for isomorphisms or homo- research community, and from Apache Calcite [1], an open-source morphisms to capture the mixed set/bag semantics of SQL. Our al- relational query optimizer. To our knowledge, the majority of gorithm is sound in general and is complete in two restricted cases: these rules have been never proven correct before, while UDP can when the two queries are Unions of Conjunctive Queries (UCQ) automatically prove 62 of 68 correct ones and fail as intended on under set semantics, or UCQ under bag semantics. We implement the 1 buggy one (Sec. 6). UDP on top of the Lean proof assistant [26]. The implementation includes the modeling of relations and queries as U-expressions, axiomatic representations of ICs as simple U-expression identities, 2. OVERVIEW and the algorithm for checking the equivalence of U-expressions. We motivate our new semantics using an example query rewrite. We evaluate UDP using various optimization rules from classi- As shown in Fig.