arXiv:2012.13198v1 [cs.DB] 24 Dec 2020 audLgc In Logic. Valued 00AscainfrCmuigMachinery. Computing for Association 2020 © C utb ooe.Asrcigwt rdti emte.T permitted. is credit with Abstracting honored. be must ACM au.Qeiswitnudrtetovle eatc a b can semantics two-valued the under written Queries value. rath (3VL), logic three-valued a on based is SQL of design The C SN978-1-4503-XXXX-X/18/06...$15.00 ISBN ACM C eeec Format: Reference ACM hsi n ftems fe rtczdapcso h langua the of aspects criticized (3V often logic most three-valued the a of uses one SQL is nulls, This with data process To qal xrsievrino Q htde o s h third the use t not does leads that SQL false that of with version show expressive We nulls, equally true. from to resulting evaluate unknown, clause conflating WHERE the f in tuples appr retains conditions only The which nulls. evaluation a up SQL’s follows giving without itself without logic, and Boolean expressiveness standard of the loss on based designed been mistakes. programmer of source a beh being unintuitive thus to and leading queries for of criticized much time expressiveness same SQL the for fa indispensable as handl and viewed for is true unknown It values value nulls. truth truth additional with the accommodate logic to Boolean familiar the than ABSTRACT ag oue fdt,icmltns shr oavoid. to hard poss is not incompleteness often data, is of however, volumes latter, large The [17]. recommendat nulls” outright “avoid and to [9] everything” [19] “ruin manner to “comprehensible” tendency a in them such explain nulls to of inability treatment the about text statements damning programmers. of to full confusing very is that one and https://doi.org/... n/rafe eus emsin [email protected] , from , permissions requ Request lists, fee. to a redistribute to and/or or servers on post to work publish, this of components n for this Copyrights bear page. copies first that the and on thi tion advantage of commercial ar part copies or that or profit provided all for fee of without granted copies is hard use or classroom digital make to Permission endLbi n itPtrrud 00 adigSLNlswt Two- with Nulls SQL Handling 2020. Peterfreund. Liat and Libkin Leonid t approach. have three-valued not SQL’s do criticized models much data the various follow for s languages strongly query also new they that but internals, n database the reimplement without errors to programmer many of source the eliminate INTRODUCTION 1 oi,nrayohrmn-audlgcfrtetn ul co nulls language. treating expressive more for a Boolean logic to into led many-valued 3VL possibly have other converting of any way nor other logic, no t that of extensions showing two sult provide We queries. recursive and tions, co IN/EXISTS/ANY/ALL and subqueries with extended 19 queries SQL the of executed BY-HAVING core SELECT-FROM-WHERE-GROUP including the thus Standard, cover and results These SQL RDBMS. standard existing any the into translated ficiently eso ht otayt h ieyhl iw Q ol have could SQL view, held widely the to contrary that, show We hs eut o nypeetsalmdfiain fSLtha SQL of modifications small present only not results These nv dnug N-ai,PL/Neo4j / PSL ENS-Paris, / Edinburgh Univ. . C,NwYr,N,UA 3pgs https://doi.org/... pages. 13 USA, NY, York, New ACM, adigSLNlswt w-audLogic Two-Valued with Nulls SQL Handling [email protected] endLibkin Leonid rspirseicpermission specific prior ires o aeo distributed or made not e g. oyohrie rre- or otherwise, copy o okfrproa or personal for work s tc n h ulcita- full the and otice we yohr than others by owned u sat is but , rwhich or be in ible: uggest i re- his their , sthe as truth avior oach are s an o ions ef- e ndi- eed ing uld lse, ge, on ny 99 L). er o t hl lie ob qiaeti [12], in equivalent be to claimed While seeg,[1 xliigi ealmn rnltospres translations many detail in explaining [41] e.g., (see ol etasae into translated be would ewudne ofll h olwn criteria. following the fulfill to need would we ' octgrz at stu rfle netetidtuhval truth third the Once SQL, false. of case or the true in rather as logic; facts three-valued categorize a of to terms in think naturally not lutaeti ytocmol sue ur ertn rul rewriting query assumed commonly two behavi by unexpected this and illustrate errors to leading faulty proves often ntesmls osbedtbs containing database possible simplest the non-equivalen on two of that is chose they th example nonetheless equivalence illustrative is it but nulls, with a with row gle if r o wr f e 7 9]. [7, see programme of, SQL aware many not trap are a be to known is equivalence correct, mt al hl h atrreturns latter the while table empty oTQ rvrfrsoigeuvlne mn ure [12] queries among equivalences showing for prover HoTTSQL aaaetxs.Ti,hwvr snta qiaetrewrit equivalent an not is however, This, texts). database ssnlsadhnlste ihtefmla w-audlog two-valued that familiar language the a with have them To could handles lines. indeed and these nulls SQL uses along that designed show been we have paper proof-of-concept many-valued this th a uses than In so, rather doing logic, in Boolean but two-valued nulls miliar handles that language a namely, altogether. them dismiss to ready hand yet be not to need is and world scenarios the many in elimin occur to do form nulls normal But 6th nulls. i the and used language, which D [3] Tutorial system the LogicBlox exa and [16] for manifesto” success, “3rd more the in found the proposal to resort This i thus and logic. alternative all, two-valued An at off nulls programmer. no take the with not language for a did produce harder however S even These than is 44]. logic those 34, the of 24, han 18, types for 13, varied logic [8, more complex presents for more accounting a provide nulls, to dling is One problems. these EXISTS (Q_4): (Q_3): (Q_1): (Q_2): = ' susrltdt ulhnln tmfo h atta edo we that fact the from stem handling to related Issues et osdrtoqeispeetda nilsrto ft of illustration an as presented queries two consider Next, h rto h ue stetasainof translation the is rules the of first The hti isn nti itr sadffrn ieo thought: of line different a is picture this in missing is What with dealing for emerged thought of lines two years the Over = { 1 , { ure,dsrbdi utpetxbos o example, For textbooks. multiple in described queries, NULL NULL EETDISTINCT SELECT FROM DISTINCT SELECT SELECT SELECT } } NULL and then ( ( SELECT SELECT [email protected] itPeterfreund Liat unknown ( ,RY R X, R & ni.I arest 1] tde o consider not does it [12], to fairness In it. in = N-ai,PSL ENS-Paris, R.A R.A 3 { eun orw while rows no returns NULL FROM FROM S.A S.A nestepcue u sa logic usual our picture, the enters – WHERE FROM FROM } X.A R.A hntefis ur rdcsthe produces query first the then R R WHERE HR O EXISTS NOT WHERE ' FROM tef hspeue,btin- but presumed, This itself. X.A=Y.A S S WHERE ) & 3 R.A R and NULL IN O IN NOT & S.A=R.A & uqeisinto subqueries 4 4 . eun sin- a returns r different: are ta “easy” an at ) ne in ented etry we r We or. queries t n:if ing: usual the n mple logic. es. e– ue fa- e to s led; as , ate QL he ic, rs - ,, Leonid Libkin and Liat Peterfreund

(1) On databases without nulls queries would be written exactly Our main result says that changing the evaluation of conditions as before, and return the same results (do not make changes in this way leads to a two-valued version of SQL that satisfies our unless necessary). desiderata. Specifically, a language with the support for the usual (2) The version of SQL with nulls and two-valued logic will relational algebra operators, plus aggregation, plus recursion, re- have exactly the same expressiveness as its version based mains equally expressive to SQL based on the three-valued logic, on 3VL (do not lose any queries; do not invent new ones). the conversions of queries are easy, and none are necessary on (3) For each query currently expressible in SQL, the size of the databases without nulls. equivalent query in the two-valued language should be of We go even further and prove a number of extensions of this the same order, e.g., at most linear (do notmake queriesoverly result. First, we show that other ways of changing the evaluation complicated). of conditions give us equally expressive languages. One of them is Why do we think this is achievable? After all, many years of SQL of particular note as it is used in SQL. This is the syntactic equality, practice taught generations of programmers that one needs a 3VL in which NULL = NULL results in true. It is used in in set opera- to handle nulls. The main reason to believe that this is not so is two tions and grouping. For example, {1, NULL} EXCEPT {1} produces recent results, that made steps in the right direction, albeit for sim- {NULL}, and on a relation ) = {(NULL, 2), (NULL, 3)}, the query pler languages. First, [28] showed that in the most basic fragment SELECT A, SUM(B) FROM T GROUP BY A results in {(NULL, 5)}, of SQL corresponding to relational algebra (selection-projection- applying the syntactic equality to nulls. join-union-difference), the truth value unknown can be eliminated Another result that we prove stems from the question whether from conditions in WHERE. Essentially, it rewrote conditions by a different many-valued logic could have given us a more expres- adding IS NULL or IS NOT NULL, in a way that they could ever sive version of SQL. We provide a negative answer to this, further evaluate to unknown. Following that, [15] considered many-valued strengthening the argument for using the familiar Boolean logic first-order predicate calculi under set semantics, and showed that for handling nulls. no many-valued logic gives us extra power over the Boolean logic. Our goal is to see if these purely theoretical results are applica- Applicability of the results While the main conclusion is that SQL ble to the language that programmers actually use, SQL1. We start could have been designed without the recourse to a many-valued by looking at the core of it formalized in the 1992 version of the logic, we would like to use it as a proof of concept showing that: Standard, that includes the following features: • future languages can be designed using the familiar two- • full relational algebra; valued logic; and • arithmetic functions and comparison predicates (+, ·, ≤ etc.); • they need not do it by eliminating nulls altogether. • aggregate functions and GROUP BY; There are multiple languages under design at all times, SQL itself • comparisons involving aggregates (HAVING); included as it constantly changes and each release of the Standard • comparisons involving subqueries (IN, EXISTS, ALL, ANY); adds features. There is much activity in the field of graph databases • set operations (UNION, INTERSECT, EXCEPT, with or with- [2, 20, 23, 42] with a new unifying standard called GQL emerging out the ALL keyword); [43]. This could be a good testbed, as well as addition of nulls to • conditional statements. languages that deliberately omitted them in order to avoid the ab- A significant extension of the 1999 version of the Standard, also normalities of the three-valued logic [3, 16]. known as SQL3, added recursion in the form of Crucially, for RDBMSs, the changes we propose do not necessi- • WITH RECURSIVE clause. tate changes to the underlying implementation. A user can write a query under a two-valued Boolean semantics. Then it is translated Our main result is as follows: for these languages – SQL 1992 or into an equivalent query under the standard SQL semantics, which SQL 1999 – our key goals 1–3 above are achievable, and SQL’s 3VL any of the existing engines can evaluate. The translated query itself can be eliminated in favor of the usual Boolean logic. only marginally exceeds the size of the original in the worst case, The core idea To explain it, consider where unknown appears in and in many cases, as we shall see, no changes are even needed. SQL query evaluation. This happens when one evaluates a predi- Another potential application is in the design and verification of cate, such as R.A=S.A, and one or more arguments are NULL. Thus, query optimizers. As already mentioned, 3VL invalidates optimiza- if we were to move to the Boolean logic, a minimal change is to as- tion rules that one takes for granted in research papers. Thus, one signed one of the Boolean truth values to such comparisons. And needs to build tools for verifying actual optimizers, to ensure their SQL already does so, in a way. In fact, SQL conflates unknown with rules are correct. This need is well recognized [11, 12]. However, false upon exiting the WHERE clause. Indeed, only tuples for which most existing verification tools are based on Boolean logic, and it the condition in WHERE is true are selected, and thus at the end of thus appears that such tools would be better suited for verifying a evaluating the condition, unknown is merged with false; in other query language itself based on a two-valued logic. In addition, we words, 3VL only exists while the WHERE clause is evaluated, and shall see that two-valued languages are actually better behaved in afterwards it is all back to the standard Boolean logic. terms of query equivalences: some “false equivalences” (i.e., those true only on databases without nulls) become true on all databases when we pass from 3VL to two-valued logic. This is an additional 1A note from the authors to the reviewers: this is the reason we chose category 3 in the rather loose classification of papers into three categories, that is likely to be argument for using the logic programmers and DBMS implemen- revamped or abolished soon. tors are more familiar with. Handling SQL Nulls with Two-Valued Logic ,,

The choice of language We need to prove results formally, and thus RDBMS, and multiple such translations are described in the litera- we need a language as closely resembling SQL as possible and yet ture [5, 10, 28, 36, 41]. having a formal compositional semantics one can reason about.For Thus, we shall work with relational algebra, but not the text- this purpose, we choose an extended relational algebra that resem- book version of it. Rather we look at the version of the language bles very much the algebra into which SQL is translated into in called RAsql that is very close to what real-life SQL queries are RDBMS implementations. It expands the standard textbook opera- translated into. In particular it will be a typed relational algebra, tions of relational algebra with several features. First, it is inter- as we need to distinguish numerical attributes over which aggre- preted under bag semantics, and duplicate elimination is added. gation is performed. It will include constructs for grouping and Second, selection conditions are expanded significantly. They in- computing aggregates, as well as comparing aggregates. Its condi- troduce testing for nulls, use SQL 3VL and SQL’s rules for the in- tions will include IN and EXISTS for direct expression of subqueries troduction of unknown, and have conditions of the form C¯ ∈  and rather than translating them via joins. This will cover the essential empty() for directly encoding IN and EXISTS subqueries, as well SQL 1992 features. Then we add an iteration operation that works as conditions C = any() and C = all() (and likewise for other in the same way as SQL’s WITH RECURSIVE, added in SQL 1999. comparisons) for encoding ANY and ALL subqueries. Third, the al- gebra has aggregate functions and grouping operation. Fourth, it 2.1 Data Model allows function application to mimic expressions in the SELECT The usual presentation of relational algebra assumes a countably clauses. And finally, it has an iteration operation with the seman- infinite domain of values. Since we handle languages with aggrega- tics of SQL recursive queries. For someone familiar with SQL, it tions, we need to distinguish columns of numerical types. As not to should be clear that the language faithfully captures SQL’s seman- over-complicate the model, we assume two types: a numerical and tics while allowing us to prove results formally. non-numerical one (we call it the ordinary type as it corresponds to Related work The idea of using Boolean logic for nulls predates the presentation of relational algebra one ordinarily finds in text- SQL; it actually appeared in QUEL [39] (see details in the latest books). This will be without any loss of generality as the treatment manual [32]). Afterwards however the main direction was in mak- of nulls as values of all types is the same, except numerical as nulls ing the logic of nulls more rather than less complicated, with pro- behave differently with respect to aggregation. posals ranging from three to six values [13, 18, 24, 34, 44] or pro- Towards that goal, we assume the following pairwise disjoint ducing more complex classifications of nulls, e.g., [8, 45]. Elaborate countable infinite sets: many-valued logics for handling incomplete and inconsistent data • Name of attribute names, and were also considered in AI literature, see for example [4, 22, 25]. • Num of numerical values, and Proposals for eliminating nulls have appeared in [3, 16]. • Val of (ordinary) values. There is a large body of work on achieving correctness of query Each of these has a type whose value is either o (ordinary) or n results on databases with nulls where correctness assumes the stan- (numerical). If # ∈ Name, then type(# ) indicates whether columns dard notion of certain answers [31]. Among such works are [21, contain elements of ordinary type or numerical type; one can think 26, 27, 35]. They assumed either SQL’s 3VL, or the Boolean logic of this as the usual declarations in CREATE TABLE statements in of marked nulls [31], and showed how query evaluation could be this simple type system. Furthermore, type(4) = n if 4 ∈ Num modified to achieve correctness, but they did not question the un- and type(4)= o if 4 ∈ Val. derlying logic of nulls. To the contrary, we are concerned with find- We use the fresh symbol NULL to denote the null value. ing a logic that makes it more natural for programmer to write Typed records and relations are defined as follows. Let g := queries; once this is achieved, one will need to modify evaluation g1 ··· g= be a word over the alphabet {o, n}.A g-record 0¯ with arity schemes to produce subsets of certain answers if one so desires. For = is a tuple (01, ··· , 0=) where connection between theoretical models such as marked or Codd • 08 ∈ Num ∪{NULL} whenever g8 = n, and nulls used in much of such work and real SQL nulls, see [29]. • 08 ∈ Val ∪{NULL} otherwise (whenever g8 = o). Some papers looked into handling nulls and incomplete data in Each =-ary relation symbol ' in the schema has an associated bag-based data models as employed by SQL [14, 30, 37] but none sequence ℓ(') = # ··· # ∈ Name= of its attribute names. The concentrated on the underlying logic of nulls. 1 = type of ' is then the sequence type(') = type(#1) ··· type(#=). Organization In Section 2 we present the syntax and the seman- A relation R over ' is then a bag of type(')-records, i.e., records tics of the language. In Section 3 we explain how to eliminate un- compatible with the type of '. As in SQL, we use bag semantics, i.e., known to achieve our desiderata. Section 4 looks into other two- a record may appear more than once in a relation. We also speak valued semantics, while Section 5 shows that no other many-valued of bags of g-records as g-relations. logic could have achieved additional expressiveness. Conclusions For a g-relation R write 0¯ ∈: R if 0¯ occurs : times in R. In are in Section 6. Complete proofs are in the appendix. particular, 0¯ ∈0 R means that 0¯ does not occur in R. A relation schema S is a set of relation symbols and their types, 2 QUERY LANGUAGE: RAsql i.e., a set of pairs (', type(')).A database  over a relation schema We now settle on the language. Given the idiosyncrasies of SQL’s S associate with each (', type(')) ∈S a relation of type(')-records. syntax, it is not the ideal language – syntactically – to reason about. The duplicate eliminating operator Y(R) turns R into set that We know however that its queries are all translatable into an ex- contains every g-record in R once. Formally, 0¯ ∈ Y(R) iff 0¯ ∈: tended relational algebra; indeed, this is what is done inside every R for some : > 0. The cardinality Card(R) of R as the sum of ,, Leonid Libkin and Liat Peterfreund

Terms C := = | 2 | NULL | # | 5 (C1, ··· ,C: ) = ∈ #D<, 2 ∈ Val, # ∈ Name, 5 ∈ Ω Expressions  := ' (base relation) c ′ ′ () (generalized projection with optional renaming) C1 [#1],···,C< [#< ] f\ () (selection)  ×  (product)  ∪  (union)  ∩  (intersection)  −  (difference) Y() (duplicate elimination) ′ ′ Group#¯ h1(#1)[#1], ··· , <(#<)[#<]i() (grouping/aggregation with optional renaming) Atomic conditions . 02 := t | f | isnull(C) | C¯ = C¯′ | C¯ ∈  | empty() | %(C¯) | C l any() | C l all() % ∈ Ω l ∈ {=, =6 , <, >, ≤, ≥} Conditions \ := 02 | \ ∨ \ |¬\ | \ ∧ \

Figure 1: Syntax of RAsql the number of occurrences of different g-records in it. Formally, Relational algebra considered here is parameterized by a collec-

Card(R) := 0¯∈: R :. In what follows, we omit the g from g-record tion Ω of numerical predicates, functions, and aggregate functions. < > or g-relationP if it is not significant or clear from the context. We assume that the standard comparison predicates =, =6 , , , ≤ Since we are dealing with bags rather than sets, we interpret , ≥ are always present over numbers. Our results on translation will the operators union ∪, intersection ∩, difference −, and Cartesian be true for every possible collection of predicates and functions. product × with the standard bag semantics: Given a schema S and such a collection Ω, the syntax of RAsql expressions and conditions over S ∪ Ω is given in Fig. 1, where • Union: 0¯ ∈: R ∪ S iff 0¯ ∈= R and 0¯ ∈< S and : = = + <; ′ ¯ each C8 is a term, each #8,#8 is a name, each # is a tuple of names, • Intersection: 0¯ ∈ R ∩ S iff 0¯ ∈= R and 0¯ ∈< S and : = : and each 8 is an aggregate function. In the generalized projection min(=,<); and in the grouping/aggregation, the parts in the squared brackets • Difference: 0¯ ∈ R − S iff 0¯ ∈= R and 0¯ ∈< S and : = ′ : (i.e., [#8] and [# ]) are optional renamings. − 8 max(= <, 0); The size of an expression is defined in the standard way as the • ¯ ∈ × ∈ ¯ ∈ Cartesian Product: (¯0, 1) : R S iff 0¯ = R and 1 < S size of its parse tree. and : = = · <. In what follows, we restrict our attention to expressions with Note that the first three correspond to SQL’s UNION ALL, well-defined semantics (e.g., we forbid aggregation over non- INTERSECT ALL, and EXCEPT ALL; without ALL, these are fol- numerical columns or functions applied to arguments of wrong lowed by applying duplicate elimination. types). The next two sections present the semantics: first at the intuitive level, and then formally. 2.2 Syntax To define the syntax of our relational algebra, we define a term as 2.3 Informal Semantics either a numerical value in Num, or an ordinary value in Val, or We now offer an informal explanations of the semantics, with the RA NULL, or a name in Name, or an element of the form 5 (C1, ··· ,C: ) formal semantics presented in the next section. In expres- where 5 is a :-ary numerical function, that is, 5 : Num: → Num, sions, ' ranges over relation symbols in S. Terms are either constants of numerical or ordinary type, or and C1, ··· ,C: are terms. For example, addition and multiplication are binary numerical functions. As we shall see in the descrip- NULL, or an attribute name, or function application. For example, tion of the semantics, it will be well-defined if the argument terms A, 2 are terms as are A+2 and A∗2. Generalized projection corresponds to SQL’s SELECT clause. In C1, ··· ,C: evaluate to values of the numerical type. A :-ary numerical predicate is a relation symbol whose type is generalized projections, each term C8 is evaluated and added as a n: for some positive integer :. For example, ≤ is a binary numerical column to the result. Such terms may refer to names from Name predicate. An aggregate function  is a function  that maps bags of that are among attributes of the result of the expression . Optional numerical values into a numerical value (i.e., it maps bags whose renaming allows us to rename such columns, simulating AS in SQL. elements are from Num into a single element in Num). For exam- To see a concrete example, to express ple, SQL’s aggregates COUNT, AVG, SUM, MIN, MAX are such. SELECT A,B,A+2,A∗B FROM R Handling SQL Nulls with Two-Valued Logic ,, for a relation ' with attributes ,  we would use comparison. Then C l any() means that there exists a value C ′ in  so that ClC ′ holds, and C l all() means that ClC ′ holds for each c (') ,,add2(),mult(,) value C ′ in  (in particular, if  returns no tuples, this condition is where add2(G)= G + 2 and mult(G,~)= G · ~. true). If l is = or =6 , conditions with any and all are applicable Names of columns are unique and can be specified explicitly by at either ordinary or numerical type; if l is one of <, ≤, >, ≥, then #8 in case the optional [ #8] part appears. The content of these C and  must be of numerical type. square brackets corresponds to whats comes right after SQL’s re- Finally, we describe the operator ¯ naming key word AS. (Names cam also be derived implicitly by the Group#¯ h1(#1), ··· , <(#<)i(). The tuple # lists attributes function Name that we discuss later.) For example in GROUP BY, 8(#8) are aggregate functions 8 over numerical columns # possibly named # ′ (when [ # ′] is specified). For SELECT A, B, A+2 AS C, A∗B AS D FROM R 8 8 8 example, to express would be translated into SELECT A, COUNT(B) AS C, SUM(B) FROM R GROUP BY A c,,add2(),mult(,) (') we would use Projection follows SQL’s bag semantics. That is, for tuple Group hcount()[], sum()i(') (01, ··· , 0=) in the result of , it computes the values of terms C , ··· ,C and outputs them as values of (optionally renamed) at- 1 < where  ({0 ,...,0 }) = = and  ({0 ,...,0 }) = 0 + tributes. count 1 = sum 1 = 1 ··· + 0 . Note that #¯ could be empty; this corresponds to comput- Selection, as usual, evaluates the condition \ for each tuple, and = ing aggregates over the entire table, without grouping, for example, only keeps tuples for which the condition is true (i.e., not false or as in SELECT COUNT (B), SUM (B) FROM R. unknown). Operations of generalized projection and selection cor- respond to sequential scans in query plans (with filtering in the Example 1. We start by showing how queries & –& from the case of selection). 1 4 introduction are expressible in RAsql: Other operations have the standard meaning under the bag se- mantics: for union, intersection, difference, and Cartesian product, & = f (') it was described above. The operation Y eliminates duplicates and 1 ¬('.∈() & = f (') keeps one copy of each record. 2 empty(f'.=(.(()) We follow SQL’s semantics of functions: if one of its arguments &3 = Y c- . (f- .=. . (d'.→- .(') × d'.→. .(')))   is NULL, then the result is null. For example, 3 + 2 gives 5, but &4 = Y(c'.- .(')) NULL + 2 gives NULL. Before looking at grouping/aggregation, we deal with the con- As a more complex example we look at query &5, which is a ditions. For each predicate % ∈ Ω we assume its meaning is well slightly simplified (to fit in one column) query 22 from the TPC- defined when its arguments are not NULL (e.g., ≤ on numbers). H benchmark [40]: Then this is the meaning that is used when all arguments are SELECT c_nationkey , COUNT(c_custkey) not NULL, and if one is NULL, then the value is unknown (u). A FROM customer special case of this is equality, in fact equality of tuples of terms WHERE c_acctbal > . ′ ′ . ′ (C1,...,C<) = (C1,...,C<), which is the conjunction of C8 = C8 for ( SELECT avg (c_acctbal ) all 8 ≤ <. The condition isnull(C) tests if the value of term C is FROM customer WHERE c_acctbal > 0.0 AND NULL. c_custkey NOT IN ( SELECT o_custkey FROM orders) ) The condition C¯ ∈ , not typically included in relational algebra, GROUP BY c_nationkey tests whether a tuple belongs to the result of a query, and corre- sponds to SQL’s IN subqueries. The condition empty() checks In translations below, we use abbreviations  for customer and $ if the result of  is empty, and corresponds to SQL’s EXISTS sub- queries. for orders, and abbreviations for attributes like 2_= for c_nationkey One might be tempted to say that these are expressible via joins etc. The NOT IN condition in the subquery is then translated as ¬ ∈ > in traditional relational algebra. There is a good reason to include (2_2 c>_2 ($)), the whole condition is translated as \ := (2_0 ∧¬ ∈ them directly. First, we want to stay as close to SQL as possible. 0) (2_2 c>_2 ($)) and the aggregate subquery becomes Even more importantly, these conditions behave differently in the presence of nulls, and their expressibility via joins would require &066 = Group∅ havg(2_0)i c2_0(f\ ()) . complex conditions checking which attributes values are NULL.   Indeed, as we shall see, they behave differently with respect to Notice that there is no grouping for this aggregate, hence the treatment of nulls; in fact EXISTS subqueries follow the two-valued empty set of grouping attributes. Then the condition in the WHERE ′ logic, which IN subqueries are based on the three-valued logic. clause of the query is Next condition \ := 2_0 > any(&066) which

Other predicates not typically included in relational algebra pre- is then applied to , i.e., f2_0>any(&066 )(), and finally grouping sentation, though included here for direct correspondence with by 2_= and counting of 2_0 are performed over it, giving us SQL, are ALL and ANY comparisons. Let  be an expression that Group returns a table with a single numerical column, C a term, and l a 2_= hcount(2_2)i(f2_0>any(&066 )()) . ,, Leonid Libkin and Liat Peterfreund

Putting everything together, we arrive at the final translation into for each 8. For two environments [ and [′, by [;[′ we mean [ RAsql: overridden by [′. That is, [;[′(# )= [(# ) if [ is defined on  and [′ is not; otherwise [;[′(# )= [′(# ). Group2_= hcount(2_2)i f () . > For a bag , let =6 NULL be the same as  but with occurrences  2_0 any Group∅ havg(2_0)i c2_0 (f\ ())     of NULL removed. A tuple is called null-free if none of its compo- nents is NULL. With these, the semantics of expressions is defined in Fig. 2e. 2.4 Formal Semantics Note that we omit the optional parts in the generalized projection We next define the formal semantics of RAsql expressions. This is and grouping/aggregation as it do not affect the semantics but is done in the spirit of [5, 12, 28] and is necessary for formally prov- reflected only in the names as appear in Figure 2a. ing the results about eliminating three-valued logic. That said, the Now given an expression  of RAsql and a database , the value reader who wants to understand the result and rely on the infor- of  in  is defined as JK,∅ where ∅ is the empty mapping (i.e., mal explanation of the semantics given in the previous section can the top level expression has no free variables). skip these technical details. We define the semantic function 2.5 Recursive queries

JK,[ We now incorporate recursive queries, a feature added in the SQL 1999 standard. Extensions of relational algebra with various kinds which is the result of evaluation of expression  on database  un- of recursion exists (e.g., with transitive closure [1] or fixed-point der the environment [. The environment provides values of param- operator [33]). We follow the same approach although stay closer eters of the query. Indeed, to give a semantics of a query with sub- to SQL as it is, which uses a special type of iteration – in fact two queries, we need to define the semantics of subqueries as well. Con- kinds depending on the syntactic shape of the query (see [38] as sider for example a subquery SELECT S.A FROM S WHERE S.A=R.A well as explanation below) – to define recursive queries. of query &2 from the Introduction. Here R.A is a parameter, and to compute the query we need to provide its value. Thus, an envi- rec Syntax of RAsql. Recall that ∪ stands for bag union, i.e., multi- ronment [ is a partial mapping from the set Name of names to the plicities of tuples are added up, as in SQL’s UNION ALL. We also ∪ ∪{NULL} union Val Num . need the operation 1 ⊔ 2 defined as Y(1 ∪ 2), i.e., union in Recall that every relation ' is associated with a sequence ℓ(') which a single copy of each tuple is kept. This corresponds to SQL’s of attribute names. Just as SQL queries do, every RAsql expression UNION.  produces a table whose attribute similarly have names. We start rec An RAsql expression is defined with the grammar of RAsql in by defining those in Fig. 2a. We make the assumption that names Fig. 1) with the addition of the following constructor: do not repeat; this is easy to enforce with renaming. This differs from SQL where names in query results can repeat, and this point `'. was rather extensively discussed in [28]. However, the treatment of repeated names in the definition of the semantics of SQL queries where ' is a fresh relation symbol (i.e., not in the schema) and  is ∪ ⊔ is completely orthogonal to the treatment of nulls, and thus we can an expression of the form 1 2 or 1 2 where both 1 and RArec make this assumption without loss of generality so as not to clutter 2 are sql expressions and 2 may contain a reference to '. the description of our translations with the complexities coming Note that in SQL, various restrictions are imposed on query 2. from treating repeated attributes. These typically include linearity of recursion (at most one refer- Next, we define the semantics of terms: it is given by the envi- ence to ' within 2, restrictions on the use of recursively defined ronment, see Fig. 2b. relation in subqueries, restrictions on the use of aggregation, etc.) After that we give the semantics of predicates % ∈ Ω and equal- The reason for such restrictions is to eliminate some of the com- ity in Fig. 2c. We follow SQL’s three valued logic with true value mon cases of non-terminating queries. true (t), false (f) and unknown (u). The usual SQL’s rule is: evaluate We shall not impose such restrictions, as our result is more gen- a predicate normally if no attributes are nulls; otherwise return u. eral: passing from 3VL to two-valued logic is possible even if such That is, for each predicate % such as <, we have its interpretation restrictions were not in place. Note that different RDBMSs use dif- P over Val ∪ Num. ferent restrictions on recursive queries (and sometimes even dif- To provide the formal semantics of RAsql expressions, we need ferent syntax); hence showing this more general result will ensure some extra notation. We assume that there is a one-to-one func- that it applies to all of them. tion Name that maps terms into (unique) names (i.e., elements in rec Semantics of RAsql. Similarly to the syntactic definition, we dis- Name). Given U ∈ Name∗ and #,¯ #¯ ′ ∈ Name∗, the sequence ′ tinguish between the two cases. U ¯ ¯′ obtained from U by replacing each #8 with # where # →# 8 For `'.1 ∪2, the semantics J`'.1 ∪ 2K is defined by the ¯ ··· ¯ ′ ′ ··· ′ ,[ # := (#1, ,#<) and # := (#1, ,#<). following iterative process: Next, if 0¯ := (01, ··· , 0<) is a tuple of values over Num∪ Val ∪ (1) '( , ' := J K {NULL} and #¯ := (#1, ··· ,#<) a tuple of names over Name, we 0 0 1 ,[ denote by [0¯ the environment that maps each name # into the (2) '8+1 := J2K∪' ,[, '(8+1 := '(8 ∪ '8+1 #¯ 8 8 value 08. We say that 0¯is consistent with #¯ if type(#8)= n implies with the condition that if '8 = ∅, then the iteration stops and '(8 08 ∈ Num∪{NULL} and type(#8)= o implies 08 ∈ Val∪{NULL} is returned. Handling SQL Nulls with Two-Valued Logic ,,

Basic conditions ˜ ··· ˜ ℓ cC1[#1],···,C< [#< ]() := #1 #<   JtK,[ := t JfK,[ := f ˜ #8 if [#8 ] where #8 := t JCK = NULL Name(C8 ) otherwise Jisnull(C)K := [ ,[ f otherwise ℓ f () := ℓ() \ = ( × ) · ¯ . ¯′ . ′ ℓ 1 2 := ℓ(1) ℓ(2) JC = C K,[ := JC8 = C8 K[ 8=1 ℓ (1 op 2) := ℓ(1) for op ∈ {∪, ∩, −} ^ where C¯ := (C1,...,C=) ℓ Y() := ℓ() C¯′ := (C ′ ,...,C ′ ) ′  ¯ ˜ ˜ 1 = ℓ(Group#¯ h1(#1)[#1], ··· , := # · #1 ··· #< . ′ ′ JC¯ ∈ K,[ := JC¯ = C¯ K[ <(#<)[# ]i()) ¯′ < C ∈J_K,[ #8 if [#8 ] ′ where #˜8 := JC l any()K,[ := JClC K[ Name(8 (#8)) otherwise ′  C ∈J_K,[ ′ JC l all()K,[ := JClC K[ (a) Names ′ C ∈J^K,[ [(C) C ∈ Name t JK = ∅ JCK := Jempty()K := ,[ [  C C ∈ Val ∪ Num ∪{NULL} ,[ f otherwise J(C1, ··· ,C<)K[ := (JC1K[, ··· , JC

NULL ∃8 : JC8K[ = NULL Composite conditions J5 (C1, ··· ,C<)K[ := 5 (JC1K , ··· , JC

J\1 ∧ \2K,[ := J\1K,[ ∧ J\2K,[ (b) Terms J¬\K,[ := ¬J\K,[. t ∀8 : JC8 K[ =6 NULL, J(C1, ··· ,C=)K[ ∈ % J%(C , ··· ,C )K := f ∀8 : JC8 K =6 NULL, J(C1, ··· ,C=)K 6∈ % 1 = ,[  [ [ Three-valued logic rules u ∃8 : JC K = NULL 8 [ ∧ t f u ∨ t f u ¬  ′ ′  t JCK[, JC K[ =6 NULL, JCK[ = JC K[ t t f u t t t t t f ′  ′ ′ JC = C K,[ := f JCK[, JC K[ =6 NULL, JCK[ =6 JC K[ f f f f f t f u f t  ′ u JCK[ = NULL or JC K[ = NULL u u f u u t u u u u  (c) Predicates  (d) Conditions

 0¯ ∈: J'K,[ if 0¯ ∈: ' = (01,...,0<) ∈ JcC ,···,C ()K if : = : 9 where 2¯9 ∈ JK and for every 1 ≤ 8 ≤ <: 08 = JC8K 2¯9 : 1 < ,[ : 9 ,[ [;[ 9X=1 ℓ() J1 op 2K,[ := J1K,[ op J2K,[ for op ∈ {∪, ∩, −, ×} 0¯ ∈: Jf\ ()K if 0¯ ∈: JK and J\K 0¯ = t ,[ ,[ ,[;[ℓ() > 0¯ ∈1 JY()K,[ if 0¯ ∈: JK,[ and : 0 ′ ′ ′ ′ (¯0, 0¯ ) ∈1 JGroup ¯ h1(#1), ··· , <(#<)i()K if 0¯ ∈: Jc ¯ ()K , 0¯ =(Jh1(#1)i( )K 0¯ , ··· , Jh<(#<)i( )K 0¯ ) " ,[ " ,[ ,[;[ℓ() ,[;[ℓ() ′ . where  = Jc#1,···,#< (f"¯=0¯())K,[ and ★ (¯0,=) ∈: JhCount( )i()K,[ if 0¯ ∈: JK,[,= = Card(JK,[)

(¯0, =) ∈: Jh(# )i()K,[ if 0¯ ∈: JK,[,= =  (Jc# ()K,[)=6 NULL  

(e) Expressions

Figure 2: Names of Expressions and Semantics of Terms, Predicates, Conditions, and Expressions ,, Leonid Libkin and Liat Peterfreund

2VL For `'.1 ⊔ 2, the semantics is defined by the following itera- 3.1 Capturing SQL with JK tive process: We now show that the two-valued semantics based on conflating u with f fulfills our desiderata for a two-valued version of SQL. (1) '(0, '0 := JY(1)K,[ Recall that it postulated three requirements: (1) that no expressive- (2) '8+1 := JY(2)K − '(8 , '(8+1 := '(8 ∪ '8+1 ∪'8,[ ness be gained or lost compared to the standard SQL; (2) that over with the same stopping condition as before. Note that since '8+1 databases without nulls no changes would be required; and (3) that does not contain any tuples from '(8 , we have '(8 ∪ '8+1 = when changes are required in the presence of nulls, they are small '(8 ⊔ '8 above. That is, either ∪ or ⊔ could be used in rule (2) of and do not significantly affect the size of the query. this iterative process. We now summarize these conditions that an alternative seman- tics satisfies our desiderata in the following definition. 3 ELIMINATING UNKNOWN: CONFLATING Definition 1. Given a query language L over relational UNKNOWN AND FALSE databases with nulls, and two semantics of it, JK and JK′, we say ′ To eliminate the use of SQL’s three valued semantics, we need to re- that JK captures the semantics JK of L if the following conditions place the unknown truth value, and to do so, we need to see where are satisfied: this value arises. In SQL, unknown arises in the WHERE clause; in (1) for every expression  of L there exists another expression RAsql, in conditions. Specifically, it appears as the result of evalu-  such that ation of predicates such as equality or <. It also arises in the eval- ′ JK = JK uation of IN subqueries, but checking C¯ ∈  in RAsql boils down . to checking equalities, i.e., a disjunction of C¯ = C¯′ as C¯′ ranges over for every database ; tuples in . (2) for every expression  of L there exists another expression In applying basic predicates, unknown appears by following the  such that ′ rule that if one parameter is NULL, then the value of the predicate is JK = JK u. Thus, we need to specify what we do in this case. SQL’s existing for every database ; and features guide us in choosing our options. One is to conflate u and ′ (3) for every expression  of L, JK = JK for every data- f, which is what is done after computing conditions in WHERE, base  without nulls. as only those values for which the condition is t are kept. We shall Furthermore, JK′ captures JK efficiently if the size of expressions now discuss this semantics, but our main result on the equivalence  and  in items 1 and 2 above is at most linear in the size of of query evaluation under 3VL and Boolean logic will extend not expression .  only to this but also to many other ways of eliminating unknown, see Section 4. Our main result is that the two-valued semantics of SQL cap- The new semantics, denoted by JK2VL , replaces every case when tures its standard semantics efficiently. u is produced by f. Since u only arises in evaluating predicates, it Theorem 2. 2VL rec means that we have the following two rules: The JK semantics of RAsql expressions, and of RAsql expressions, captures their SQL semantics JK efficiently. Note that the capture statement for RAsql is not a corollary 2VL t ∀8 : JC8K =6 NULL, J(C1, ··· ,C=)K[ ∈ P rec J%(C¯)K := of the statement of RAsql , as the capture definition states that ,[ f otherwise the equivalent query must come from the same language. Thus, ′ ′ rec 2VL t JCK , JC K =6 NULL, JCK = JC K applying the statement about RA to an RA expression  JC = C ′K := [ [ [ [ sql sql ,[ f otherwise rec  would only yield expressions  and  of RAsql. In the absence of nulls the definitions of JK2VL and JK coincide. The rest of the semantics of expressions and conditions is ex- Thus, condition (3) in the definition is satisfied. In the rest of this actly as in Fig. 2. Note that the truth value u never arises, and the section we discuss the proof outline partly – we present the trans- rules of SQL’s 3VL are exactly the rules of the Boolean two-valued lation schemes for (1) along with some examples. All the transla- logic when restricted to t and f. tions are defined by a mutual induction on expressions and condi- Let us return to the example from the introduction of two tions. The key element is mimicking the conditions of one seman- queries & and & expressing difference of two relations ' and tics under the other. That is, for a condition \ is a truth value g, 1 2 g g 2VL ( using NOT IN and NOT EXISTS subqueries. On database with we have a condition \ so that J\K = g if and only if J\ K = t. ' = {1, NULL} and ( = {NULL} they produced different results: We inductively propagate these changes through the query, as our 2VL conditions can involve subqueries, e.g,. empty(). the empty table for &1 and {1, NULL} for &2. Under the new JK semantics, both &1 and &2 return the same answer {1, NULL}. The 2VL reason both elements 1 and NULL are returned is that all compar- 3.2 Translations from JK to JK . . isons 1 = NULL and NULL = NULL now evaluate to f, and then Given an expression  of RAsql, we describe how to construct an 2VL under negation NOT they become t, and hence both elements are expression  such that JK = JK for every database . Todo returned. Previously, only the two-valued query with NOT EXIST so, as explained earlier, we define three translations by a mutual (since EXISTS does not recourse to 3VL) did so. induction: Handling SQL Nulls with Two-Valued Logic ,,

t f • from conditions \ to \ and \ such that: Thus, to state that the equivalence of \1 and \2 implies the equiv- 2VL t alence of ¬\1 and ¬\2 we need the following condition: J\K,[ = t if and only if J\ K,[ = t J\ K = t ⇔ J\ K = t ⇒ J¬\ K = t ⇔ J¬\ K = t (1) 2VL f 1 2 1 2 J\K,[ = f if and only if J\ K,[ = t     If (1) we true, we could conclude from the equivalence of queries 2VL (note that J\K,[ can only produce values t and f); SELECT ... SELECT ... • from expression  to  by inductively replacing each condi- FROM ... and FROM ... t tion \ with \ . WHERE \1 WHERE \2 The full details of these construction are presented in Figure 3. We that queries note that the size of the resulting  is indeed linear in . SELECT ... SELECT ... FROM ... and FROM ... Example 2. We now look at queries &1–&5 of Example 1 and pro- WHERE ¬\1 WHERE ¬\2 vide their translations. That is, we assume these queries have been are equivalent. And this brings us to the reason why the two- written assuming the two-valued 2VL semantics, and we show how valued semantics restores the expected equivalences of queries &1 they would then look in conventional SQL. To start with, queries and &2. &2,&3, and &4 remain unchanged by the translation. ′ Proposition 1. Implication (1) does not hold for SQL’s semantics Query &1 is translated into & given by 1 but holds for JK2VL . & ′ = f (') 1 isnull('.)∨¬('.∈f¬isnull((.)() The reason behind it is that in 3VL, the law of excluded middle, In SQL, this corresponds to \ ∨¬\ ↔ t, does not hold, and that is why (1) is invalidated. In two- valued logic, on the other hand, this law does hold, which gives us SELECT R.A FROM R WHERE R.A IS NULL (1). OR R.A NOT IN With the law of excluded middle we restore many more equiva- ( SELECT S.A FROM S WHERE S.A IS NOT NULL) lences, often assumed for granted as one thinks in terms of Boolean logic, and yet programs in 3VL (perhaps accounting for some of the ′ > In the translation &5 of query &5, the condition (2_0 0) ∧ typical programmer mistakes in SQL [7, 9]). ¬(2_2 ∈ c ($)) in the subquery is translated as >_2 Proposition 2. The following equivalences hold: t > (1) f  2VL  − f  2VL \ := (2_0 0)∧ isnull(2_2)∨¬(2_2 ∈ f¬isnull(>_2)(c>_2$)) J \ ( )K,[ = J ¬\ K,[   (2) C¯ ∈ JK2VL if and only if Jempty(f ())K2VL = f ,[ C¯=ℓ() ,[ which is then used in the aggregate subquery &066; the rest of the (3) JC l any()K2VL if and only if Jempty(f ())K2VL = f query does not change. In terms of SQL, in general these would be ,[ C¯lℓ() ,[ 2VL translated into additional IS NULL and IS NOT NULL conditions in (4) JC l all()K,[ if and only if the aggregate query as follows: Jempty(f ())K2VL = t ¬(Clℓ¯ ()) ,[ SELECT AVG( c_acctbal ) rec for any RAsqlexpression , tuple C¯, and condition \. FROM customer WHERE c_acctbal > 0.0 AND Neitherof those is true in general for SQL’sthree-valued semantics (c_custkey IS NULL OR JK. c_custkey NOT IN ( SELECT o_custkey FROM orders WHERE o_custkey IS NOT NULL )) 4 OTHER TWO-VALUED SEMANTICS This is a correct translation of &5 that makes no extra assumptions We now show that the result on the equivalence of two-valued about the schema. Having such additional information (e.g., the semantics with the usual SQL semantics is very robust. That is, fact that c_custkey is the key of customer) can simplify translation many other two-valued semantics could be used in place of JK2VL , even further (e.g., by removing the IS NULL condition). and with each of them we recover the equivalence with SQL’s 3VL semantics. What other two-valued semantics could be there? For a starter, 3.3 Query equivalence under two-valued there is the syntactic equality semantics, already used in SQL in semantics connection with the GROUP BY operation as well as set opera- Recall queries &1 and &2 from the introduction. Intuitively, one tions, which treat NULL syntactically. In other words, for those expects them to be equivalent: indeed, if we remove the NOT from operations, NULL equals itself. Formally, the semantic equality se- both of them, then they are actually equivalent. And it seems that mantics JK= is defined by changing the semantics of equality to if \1 and \2 are equivalent, then so must be ¬\1 and ¬\2. So what is going on there? t JCK = JC ′K JC = C ′K= := [ [ Recall that the effect of the WHERE clause is to keep tuples for ,[ f otherwise which the condition evaluated to t. So equivalence of conditions \ and \ , from SQL’s point of view, means 1 2 and keeping the rest as in the definition of JK2VL . The only differ- . J\1K = t ⇔ J\2K = t . ence is that under this semantics, NULL = NULL evaluates to t. ,, Leonid Libkin and Liat Peterfreund

Basic conditions (t)t := t (t)f := f (f)t := f (f)f := t t f isnull(C) := isnull(C) isnull(C) := ¬isnull(C)   = . ′ t . ′ . ′ f . ′ ′ C¯ = C¯ := C¯ = C¯ C¯ = C¯ := ¬(C8 = C8 ) ∨ isnull(C8) ∨ isnull(C8 )   8_=1   = t f %(C¯) := %(C¯) %(C¯) := ¬%(C¯) ∨ isnull(C8 )   8_=1 = (¯ ∈ )t ¯ ∈ (¯ ∈ )f ∨¬ ¯ ∈ C  := C  C  := isnull(C8) (C f¬isnull(#1)∧···∧¬isnull(#= )()) 8_=1 ′ ′ ′ where C¯ :=(C1, ··· ,C=), C¯ := (C1, ··· ,C=), and ℓ():=(#1, ··· ,#=) t f empty() := empty() empty() := ¬empty() t f C l any() := C l any() C l any() := empty(f¬\ ()) t f C l all() := C l all() C l all() := ¬empty(f\ ())  where ℓ( ) := # and \ := isnull(C) ∨ isnull(# ) ∨ (¬isnull(C) ∧¬isnull(# ) ∧¬Cl# ) Composite conditions t t t f f f t t t f f f t f f t (\1 ∨ \2) := (\1) ∨(\2) (\1 ∨ \2) := (\1) ∧(\2) (\1 ∧ \2) := (\1) ∧(\2) (\1 ∧ \2) := (\1) ∨(\2) (¬\) := \ (¬\) := \

Figure 3: 2VL semantics to SQL semantics

Coming back to the example of queries &1 and &2 from the gr(%, h1i)= {(NULL,=) | = ≥ 0} while gr(%, h2i)= {(NULL,=) | introduction, under this semantics on ' = {1, NULL} and ( = = < 0} and gr(%, h1, 2i)= {(NULL, NULL)}. {NULL} they produce {1} as the answer: in this case, it is the same The semantics JKgr based on such grounding is given by re- gr as we would have by applying ' EXCEPT ( based on syntactic defining the truth value of J%(C1,...,C=)K as that of C¯ ∈ gr(%, I), 2VL = gr equality. The semantics JK and JK are, as expected, different, where I is the list on indices 8 such that JC8K = NULL. 2VL = but note that both of them make queries &1 and &2 produce the Such semantics generalize JK and JK=. In the former, by set- same result, as they both satisfy condition (1) from Section 3.3 elim- ting gr(%, I) = ∅ for each nonempty I; in the latter, it is the same inating potential confusion of writing (supposedly) the difference except that gr(=, h1, 2i) would contain the tuple (NULL, NULL). query in ways that produce different results. We say that a grounding is expressible if for each % ∈ Ω and each Is this the only possible other two-valued semantics? Of course I there is a condition \%,I such that C¯ ∈ gr(%, I) iff c¯(C¯) satisfies = 2VL I not. Consider for example the predicate ≤. In both JK and JK , \%,I. Note that the projection on the complement ¯I of I would only NULL ≤ NULL is f, but under the syntactic equality interpreta- contain non-null elements. The semantics seen previously satisfy tion it is not unreasonable to say that NULL ≤ NULL is true, as ≤ this condition. subsumes equality. This gives a general idea of how different se- gr mantics can be obtained: when some arguments of a predicate are Theorem 3. For every expressible grounding gr, the JK seman- RArec RA NULL, we can decide what the truth value based on other values. tics of sql or sql expressions captures their SQL semantics. Just for the sake of example, we could say that = ≤ NULL is f for = < 0 and t for = ≥ 0. 5 OTHER MANY-VALUED LOGICS To define such alternative semantics in a general way, we sim- To further support the claim that two-valued logic is a very natural ply state, for each predicate, what happens when some of the argu- alternative to 3VL in SQL, we now show that no other many-valued ments are NULL. Formally, to define such semantics, we introduce logic could have been used in its place in a way that would have the notion of grounding of predicates in Ω, as well as equality and altered the expressiveness of the language. Thus, of all the logics comparison. It is a function gr that takes an an =-ary predicate % that give us equal expressive power it makes sense to choose the of type g and a non-empty sequence I = h81, ··· ,8: i of indices simplest and the most familiar one. 1 ≤ 81 < ··· < 8: ≤ =, and produces a relation gr(%, I) that con- We build upon a result of [15] which proved such an equivalence tains g-records (C1, ··· ,C=) where C8 = NULL for every 8 ∈ I, and between many-valued first-order logics under set semantics, and C8 =6 NULL for every 8 6∈ I. In the above example, if % is ≤, then also under restrictions on the behavior of the many-valued con- nectives. We strengthen this by going from first-order logic to full SQL, and by eliminating the restrictions the result of [15] required. Handling SQL Nulls with Two-Valued Logic ,,

We first define extensions of RAsql based on different many- there is an interpretation where both are true. Hence \1 ∧ \2 eval- valued logics and state the equivalence result, and then give a hint uates to u rather than s. For full truth tables of this 4VL, see [13]. of the proof. The previously known equivalence result [15] considered first- 5.1 Many-valued Interpretations order logic under set semantics, and many-valued logics that are SQL’s 3VL is an example of a many-valued logic, known well be- idempotent (g ∨g = g) or weakly idempotent (g ∨g ∨g = g ∨g). For fore SQL as Kleene’s logic [6]. It is not the only logic proposed to example that the 4-valued logic from Example 3 is not idempotent deal with null values; there were others with 3,4,5, and even 6 val- but weakly idempotent. ues [13, 24, 34, 44]; see the discussion in the introduction. Could We now show a much stronger result. As the query language we rec using one of them give us a more expressive language? We now take either RAsql or RAsql . A many-valued logic does not have give the negative answer. idempotency restrictions. Even in this general setting, the resulting Recall that a many-valued (propositional) logic MVL is given by semantics does not give any extra expressiveness compared to the a finite collection T of truth values with t, f ∈ T, and a finite set Γ standard SQL semantics. ar of logical connectors W : T (W) → T. We refer to ar(W) as the arity Theorem 4. For a many-valued logic MVL in which ∧ and ∨ are of W, and assume that any many-valued logic includes at least the MVL RA unary connective ¬ for NOT, and the binary connectives ∨ and ∧ associative and commutative, let JK be a semantics of sql RArec for OR and AND, whose restriction to the basic truth-values {t, f} or sql expressions based on MVL. Assume that this semantics follows the rules of Boolean logic. is SQL-expressible for atomic predicates. Then it captures the SQL The only condition we impose on MVL is that ∨ and ∧ be as- semantics. sociative and commutative; without those we cannot write con- The translation from MVL semantics into SQL semantics may ditions like \1 OR ··· OR \: (and likewise for AND) in WHERE add an application of the COUNT aggregate and basic arithmetic without worrying about the order of conditions and parentheses (+, ∗). The idea of the translation is discussed below. around them. Not having commutativity and associativity would MVL also break many optimizations. The ability to write conditions like 5.2 JK to SQL semantics that is taken for granted by SQL programmers and is therefore a The main difficulty that needs to be addressed is that we assume requirement one cannot waive. MVL rec practically nothing about the many-valued logic which makes the A semantics JK of RAsql or RAsql expressions is given evaluation of conditions such as C¯ ∈  complicated. This is the . by defining the semantics of atomic conditions %(C¯) and equality, disjunction of truth values of conditions C¯′ = C¯ as C¯′ ranges over and then following the connectives of MVL for complex conditions. tuples in the evaluation of , and if we assume nothing about the Such a semantics of atomic conditions is expressible if it does not truth tables, how do we compute this? With idempotency, this is deviate from the standard semantics in the absence of nulls (i.e., easy: no matter how many times a truth value g occurs in this dis- = is equality, ≤ is less-than-or-equal, etc), and if for each atomic junction, it collapses to one occurrence. With weak idempotency, it condition %, the fact that %(C¯) evaluates to a truth value g can be collapses to one or two occurrences. But without these conditions, expressed by a condition under SQL’s semantics JK. This excludes we need to look for other mechanisms. pathological situations when conditions like 1 ≤ 2 evaluate to Continuing with this example, our goal is to construct a new . MVL truth values other than t, f, or when conditions like “NULL = = expression , for a truth value g, such that JC¯ ∈ K = g if and evaluates tog” are not expressible in SQL (say, having a truth value . only if JK = t. The most straightforward way to do this consists g so that NULL = = is g if = is an encoding of a halting Turing ma- of two steps. First, we compute for each tuple C¯′ ∈  the truth chine). Anything reasonable is permitted by being expressible. For- ′ . MVL MVL value JC¯ = C¯K . Second, we aggregate ∨ over these truth val- mally, a semantics JK is SQL-expressible for atomic predicates ues. As our many-valued logic is expressible, we can assume that if: the first step is expressible, by means of a condition denoted by . (1) in the absence of nulls, the semantics of atomic conditions \=C¯. Assume that we have an aggregate that takes a bag of truth is the same as SQL’s semantics JK; values {g 1, ··· ,gg< } to g 1 ∨···∨g<. WithW that,  could be defined (2) for each truth value g ∈ T and each atomic predi- as follows, assuming #¯ = ℓ(): MVL cate % there is a condition \%,gg such that J%(C¯)K,[ = Group∅ h ()i c\ . #¯ () g if and only if J\ K = t, for all  and [. =C¯( )  %,gg ,[ _   The problem is that we do not have such an aggregate function Example 3. We consider a 4-valued logic from [13]. It has truth in general; it would be unrealistic to expect it to exist for every values t, f, u just as SQL’s 3VL, and also a new value s. This value MVL. For the usual Boolean logic, it can be implemented as MAX means “sometimes”: under some interpretation of nulls the condi- by associating t with 1 and f with 0, but in general we have no tion is true, but under some it is false. The semantics is then defined such recourse to numerical aggregates. Thus, we need a different in the same way as SQL’s semantics except u is replaced by s. approach, explained below. In this logic the unknown u appears when one uses complex Commutativity and associativity of disjunction allow us to conditions. Suppose conditions \1 and \2 both evaluate to s. Then change the order of the disjuncts as wanted without affecting the there are interpretations of nulls where each one of them is true, result. In fact, it implies that the result of the disjunction is only and when each one of them is false, but we cannot conclude that determined by the number of disjuncts for each truth value, hence ,, Leonid Libkin and Liat Peterfreund implying that we can view disjunction as a function 5∨ that maps We acknowledge support of EPSRC grants N023056 and S003800, a vectors of integers of arity equals the number of truth values of and grants from FSMP and Neo4j. our many-valued logic into a single truth value. More precisely, for a logic with truth values g 1, ··· ,gg<, we have 5∨(:1, ··· ,:<)= g if REFERENCES g = (g 1 ∨···∨ g 1) ∨···∨ (g< ∨···∨ g<) . [1] R. Agrawal. Alpha: An extension of relational algebra to express a class of re- cursive queries. IEEE Trans. Software Eng., 14(7):879–885, 1988. :1 times :< times | {z } | {z } [2] R. Angles, M. Arenas, P. Barceló, P. Boncz, G. Fletcher, C. Gutiérrez, T. Lindaaker, Note that by expressibility of conditions in SQL’s semantics, the M. Paradies, S. Plantikow, J. Sequeda, O. van Rest, and H. Voigt. G-COREA Core . number : of how many times C¯′ = C¯ evaluates to g can be ex- for Future Graph Query Languages. In ACM SIGMOD, 2018. 8 8 [3] M. Aref, B. ten Cate, T. J. Green, B. Kimelfeld, D. Olteanu, E. Pasalic, T. L. Veld- pressed in RAsql with the help of the COUNT aggregate. huizen, and G. Washburn. Design and implementation of the LogicBlox system. The last missing bit is to calculate the disjunction of :ggs. Since T In SIGMOD, pages 1371–1382, 2015. g g∨g g∨ [4] O. Arieli, A. Avron, and A. Zamansky. What is an ideal logic for reasoning with is of fixed size, we know that the sequence of truth valuesgg,gg gg,gg inconsistency? In IJCAI, pages 706–711, 2011. g ∨gg, ··· eventually exhibits a periodic behavior. One can calculate [5] V. Benzaken and E. Contejean. A Coq mechanised formal semantics for realistic the period and explicitly list truth values of such disjunctions up SQL queries: formally reconciling SQL and bag relational algebra. In Proceed- ings of the 8th ACM SIGPLAN International Conference on Certified Programs and to the point from which the periodic behavior starts (that would be Proofs, CPP 2019, pages 249–261. ACM, 2019. at most |T|+1). With this, and simple arithmetic operations, one [6] L. Bolc and P. Borowik. Many-Valued Logics: Theoretical Foundations. Springer, can calculate the value of the disjunction of : gs, for each given 1992. [7] S. Brass and C. Goldberg. Semantic errors in SQL queries: A quite complete list. : that was computed by a counting aggregate. This explains the J. Syst. Softw., 79(5):630–644, 2006. main idea behind the proof; full details are in the appendix. [8] K. S. Candan, J. Grant, and V. S. Subrahmanian. A unified treatment of null values using constraints. Inf. Sci., 98(1-4):99–156, 1997. [9] J. Celko. SQL for Smarties: Advanced SQL Programming. Morgan Kaufmann, 6 CONCLUSIONS 2005. [10] S. Ceri and G. Gottlob. Translating SQL into relational algebra: Optimization, We have demonstrated that one of the most criticized aspects of semantics, and equivalence of SQL queries. IEEE Trans. Software Eng., 11(4):324– SQL, and one that is the source of confusion for numerous SQL 345, 1985. programmers – the use of the three-valued logic – was not re- [11] S. Chu, B. Murphy, J. Roesch, A. Cheung, and D. Suciu. Axiomatic foundations and algorithms for deciding semantic equivalences of SQL queries. Proc. VLDB ally necessary, and perfectly reasonable two-valued semantics ex- Endow., 11(11):1482–1495, 2018. ist that achieve exactly the same expressiveness as the original [12] S. Chu, K. Weitz, A. Cheung, and D. Suciu. Hottsql: proving query rewrites with three-valued design. Of course there is so much existing SQL code univalent SQL semantics. In PLDI, pages 510–524. ACM, 2017. [13] M. Console, P. Guagliardo, and L. Libkin. Approximations and refinements of that operates under the three-valued semantics that changing that certain answers via many-valued logics. In KR, pages 349–358. AAAI Press, 2016. aspect of the language as if it were never there is not realistic. Thus, [14] M. Console, P. Guagliardo, and L. Libkin. On querying incomplete information in databases under bag semantics. In IJCAI, pages 993–999, 2017. the questions are: what can we do with this now, and what can we [15] M. Console, P. Guagliardo, and L. Libkin. Propositional and predicate logics of do in the future. incomplete information. In KR, pages 592–601. AAAI Press, 2018. Regarding now, there are two directions. First, since we pro- [16] H. Darwen and C. J. Date. The third manifesto. SIGMOD Record, 24(1):39–49, 1995. vided equivalence results, it is entirely feasible to let programmers [17] C. J. Date. Database in Depth - Relational Theory for Practitioners. O’Reilly, 2005. use two-valued SQL without changing the underlying DBMS im- [18] C. J. Date. A critique of Claude Rubinson’s paper nulls, three-valued logic, and plementation. One simply translates a query written under the ambiguity in SQL: critiquing Date’s critique. SIGMOD Record, 37(3):20–22, 2008. [19] C. J. Date and H. Darwen. A Guide to the SQL Standard. Addison-Wesley, 1996. two-valued semantics into standard SQL and runs it. Implement- [20] A. Deutsch, Y. Xu, M. Wu, and V. E. Lee. Aggregation support for modern graph ing such a translation is one of our immediate goals. Thus, no new analytics in TigerGraph. In SIGMOD, pages 377–392. ACM, 2020. [21] S. Feng, A. Huber, B. Glavic, and O. Kennedy. Uncertainty annotated databases implementation of query evaluation is necessary; we can reuse all - A lightweight approach for approximating certain answers. In SIGMOD, pages the existing technology while at the same time getting rid of one 1313–1330. ACM, 2019. of its most problematic aspects. [22] M. Fitting. Kleene’s logic, generalized. J. Log. Comput., 1(6):797–810, 1991. [23] N. Francis, A. Green, P. Guagliardo, L. Libkin, T. Lindaaker, V. Marsault, S. Plan- Second, as we explained in the introduction, this new point of tikow, M. Rydberg, P. Selmer, and A. Taylor. Cypher: An evolving query lan- view may well be of interest to designers of new languages. This guage for property graphs. In SIGMOD, pages 1433–1445. ACM, 2018. activity is especially visible in the area of graph databases [43] and [24] G. H. Gessert. Four valued logic for relational database systems. SIGMOD Record, 19(1):29–35, 1990. an alternative to SQL’s confusing 3VL might well be considered. [25] M. L. Ginsberg. Multivalued logics: a uniform approach to reasoning in artificial For the future, the key direction to pursue is to make the transla- intelligence. Computational Intelligence, 4:265–316, 1988. [26] S. Greco, C. Molinaro, and I. Trubitsyna. Approximation algorithms for querying tion more practical by taking into account additional semantic in- incomplete databases. Inf. Syst., 86:28–45, 2019. formation. Such information is most likely to come in the form of [27] P. Guagliardo and L. Libkin. Making SQL queries correct on incomplete constraints such as keys, foreign keys, and NOT NULL constraints. databases: A feasibility study. In PODS, pages 211–223. ACM, 2016. [28] P. Guagliardo and L. Libkin. A formal semantics of SQL queries, its validation, For now, translations we presented do not take them into account and applications. Proc. VLDB Endow., 11(1):27–39, 2017. but we already saw in one of the examples that they could be very [29] P. Guagliardo and L. Libkin. On the Codd semantics of SQL nulls. Information useful. We also plan to adapt works like [21, 27] to produce evalu- Systems, 86:46–60, 2019. [30] A. Hernich and P. G. Kolaitis. Foundations of information integration under bag ation schemes that return results with certainty guarantees, under semantics. In LICS, pages 1–12. IEEE Computer Society, 2017. the two-valued approach. [31] T. Imielinski and W. Lipski. Incomplete information in relational databases. Jour- nal of the ACM, 31(4):761–791, 1984. Acknowledgments Part of this work was done when the second [32] Ingres 9.3. QUEL Reference Guide, 2009. [33] L. Jachiet, P. Genevès, N. Gesbert, and N. Layaïda. On the optimization of recur- author was at the University of Edinburgh and with FSMP ( Foun- sive relational queries: Application to graph queries. In SIGMOD, pages 681–697. dation Sciences Mathématiques de Paris) hosted by IRIF and ENS. ACM, 2020. Handling SQL Nulls with Two-Valued Logic ,,

[34] Y. Jia, Z. Feng, and M. Miller. A multivalued approach to handle nulls in RDB. [39] M. Stonebraker, E. Wong, P. Kreps, and G. Held. The design and implementation In Future Databases, volume 3 of Advanced Database Research and Development of INGRES. ACM Trans. Database Syst., 1(3):189–222, 1976. Series, pages 71–76. World Scientific, Singapore, 1992. [40] Transaction Processing Performance Council. TPC Benchmark™ H Standard [35] L. Libkin. SQL’s three-valued logic and certain answers. ACM Trans. Database Specification, 2014. Revision 2.17.1. Syst., 41(1):1:1–1:28, 2016. [41] J. Van den Bussche and S. Vansummeren. Translating SQL into the relational [36] M. Negri, G. Pelagatti, and L. Sbattella. Formal semantics of SQL queries. ACM algebra. Course notes, Hasselt University and Université Libre de Bruxelles, Trans. Database Syst., 16(3):513–534, 1991. 2009. [37] C. Nikolaou, E. V. Kostylev, G. Konstantinidis, M. Kaminski, B. C. Grau, and [42] O. van Rest, S. Hong, J. Kim, X. Meng, and H. Chafi. PGQL: a property graph I. Horrocks. The bag semantics of ontology-based data access. In IJCAI, pages query language. In GRADES, page 7. ACM, 2016. 1224–1230, 2017. [43] Wikipedia contributors. GQL graph query language, 2020. [38] PostgreSQL Documentation, Version 9.6.1. www.postgresql.org/docs/manuals, [44] K. Yue. A more general model for handling missing information in relational 2016. databases using a 3-valued logic. SIGMOD Record, 20(3):43–49, 1991. [45] C. Zaniolo. Database relations with null values. JCSS, 28(1):142–166, 1984.