Polystore Mathematics of Relational Algebra II

Associative Array Model of SQL, NoSQL, and NewSQL Databases

Jeremy Kepner1,2,3, Vijay Gadepally1,2, Dylan Hutchison4, Hayden Jananthan3,5, Timothy Mattson6, Siddharth Samsi1, Albert Reuther1 1MIT Lincoln Laboratory, 2MIT Computer Science & AI Laboratory, 3MIT Mathematics Department, 4University of Washington Computer Science Department, 5Vanderbilt University Mathematics Department, 6Intel Corporation

Abstract—The success of SQL, NoSQL, and NewSQL databases SQL, NoSQL, and NewSQL would enable their is a reflection of their ability to provide significant functionality interoperabilityThe recognition. Such of “one a mathematical size does not modelfit all” is [ Stonebraker the primary and their corresponding operations are defined in terms of and performance benefits for specific domains, such as financial &goal Çetintemel of this paper. 2005 ] has led to the need for polystore databases, associative arrays. Fifth, the mathematical properties required transactions, internet search, and data analysis. The BigDAWG such as BigDAWG [Duggan 2015, Elmore 2015], that can by graph algorithms and matrix mathematics are confirmed for polystore seeks to provide a mechanism to allow applications to contextualizeSQL Era queriesNoSQL and Era cast data betweenNewSQL Era multiplePolystore databases Era relational operations. Finally, performance results illustrating transparently achieve the benefits of diverse databases while so that a user can employ the best database for a particular task the impact of these properties are presented and discussed.

insulating applications from the details of these databases. NoSQL (see Figure 3). To achieve this goal, polystore databases need Associative arrays provide a common approach to the Relational Model NewSQL Polystore Mathematics of Relational Algebra II. RELATIONS mathematics found in different databases: sets (SQL), graphs to bridge[Codd 1970] SQL, NoSQL, and NewSQL[Cattell 2010] databases. The Polystore SQL (NoSQL), and matrices (NewSQL). This work presents the SQL Dynamic Distributed Dimensional Data Model (D4M) The relational model, based on set theory, is a key 1,2 2 2 Google4 BigTable 2,3 2 relationalHayden model in Jananthan terms of associati, Ziqive Zhouarrays ,and Vijay identifies Gadepally the ,technology Dylan Hutchison [Kepner[Chang 2012, Suna 2006]] was Kim developed, Jeremy to provide Kepner a linear [Elmore2015] mathematical foundation for SQL databases. The relational algebraic interface to graphs stored in NoSQL NewSQL databases [ByunBigDAWG key mathematical properties that1 are preserved within2 SQL3. 4 model effectively consists of relational algebra, relational These properties include Vanderbiltassociativity, University, commutativityMIT, , CalTech,2012,Figure 1.UniversityKepner Evolut ion 2013 of of Washington ]. SQL, Subsequently, NoSQL, NewSQL, D4M andhas polystore been calculus, and the structured query language (SQL) that balance distributivity, identities, annihilators, and inverses. Performance succdatabases.essfully used Each with class both of databaseSQL [Wu delivered 2014, Gadepally new mathematics, 2015] the theoretical, implementation, and systems design aspects of measurements on distributivity and associativity show the impact andfunctionality, NewSQL and [Samsi performance 2016] focused databases. on new The application effectiveness areas. of databases. The relational model is well covered in the theseAbstract properties—Financial can have transactions, on associative internet array operations search, and. These data D4Mdatabases to seamlessly to analyze interact data on with the these internet diverse [12]–[14]. databases NewSQL rests literature [Maier 1983, Codd 1990, Abiteboul 1995]; only the resultsanalysis demonstrate are all placing that increasing associative demands arrays oncould databases. provide SQL, a ondatabases itsSQL, associative supportNoSQL, new array and analytics NewSQL algebra capabilities [Kepner databases &within are Chaidez designed a database. 2013, for most relevant aspects of the model are reviewed here. Some of mathematicalNoSQL, and model NewSQL for polystores databases haveto optimize been developed the exchange to meet of KepnerInspecific hybrid & applications, Chaidez processing 2014 systems have, Kep distinctner like & Jansen Apache data models,2016] Pig [15],that and provides Apache rely on the more significant mathematical contributions of the datathese and demands execution and queries. each offers unique benefits. SQL, NoSQL, adifferent mathematics underlying that spansmathematics sets, graphs, (see Figure and matrices.2). Because The of relational model to databases include and NewSQL databases also rely on different underlying math- Spark [16], and HaLoop [17], SQL, NoSQL, and NewSQL abilityconceptstheir differenceof D4M have been(ands, each Myria blended. database [Halperin has 2014 unique]) to strengths bridge multiple that are Keywoematicalrds-Associative models. Polystores Array Algebra seek to; SQL provide; NoSQL; a mechanism NewSQL; to databaseswell suited has for laid particular the foundation workloads. for It the is now polystore recognized database that allow applications to transparently achieve the benefits of di- Polystore databases, such as BigDAWG [18]–[22] and (R1) Relations: a mathematical definition of database tables Set Theory; Graph Theory; Matrices; Linear Algebra concept.special- purpose databases can be 100x faster for a particular verse databases while insulating applications from the details of sufficient for their representation without constraining Myriaapplication [23], than were a createdgeneral to-purpose make usedatabase of the [Kepner varied specialties 2014]. In these databases. IntegratingI. INTRODUCTION the underlying mathematics of these their implementation; ofaddition, the aforementioned the availabilitydatabase of high types performance [24]. One data inherent analysis diverse databases can be an important enabler for polystores as it Visualizations Applications enablesRelational effective or SQL reasoning (Structured across differentQuery Language) databases. database Associatives challengeplatforms, issuch that as SQL, the MIT NoSQL, SuperCloud and NewSQL [Reuther databases 2013, Prout use (R2) Query semantics: a mathematical definition of operations [Coddarrays 1970, provide Stonebraker a common 1976] approach such as forPostgr theeSQL, mathematics MySQL, of different2015], allows data modelshigh performance and make use databases of different to sha mathematicalre the same on relations sufficient for proving the correctness of BigDAWG Polystore Common Interface andpolystores Oracle have by encompassing been the de thefacto mathematics interface to founddatabases in different since tools,hardware as illustratedplatform without by Figure sacrificing 1 and Figure performance 2. . database queries; thedatabases: 1980s (see sets (SQL), Figure graphs 1) and (NoSQL), are the and bedrock matrices of electronic(NewSQL). (R3) Proof of the equivalence of declarative and procedural transactionsPrior work around presented the world. the SQL More relational recently, model key-value in terms stores of Analytic Analytic Analytic syntaxes over the above definitions that has enabled the Translator Translator Translator (NoSQLassociative databases) arrays andsuch identified as Google key BigTable mathematical [Chang properties 2008], SQL SQL NoSQLNoSQL NewSQLNewSQL PolystoreFuture use of declarative semantics for database users and Apachethat are Accumulo preserved within[Wall SQL. 2015] This, and work MongoDB provides [ theChodorow rigorous Example PostgreSQL Accumulo SciDB BigDAWG procedural semantics for database builders [Codd 1972]. mathematical definitions, lemmas, and theorems underlying these 2013] have been developed for representing large sparse tables Application Transactions Data Search DataAnalysis All properties. Specifically, SQL Relational Algebra deals primarily RelationalTranslator Key-Value TranslatorSparse Associative to aid in the analysis of data for Internet search. As a result, the Data Model with relations – multisets of tuples – and operations on and SQLSQLTables PairsNoSQLNoSQL MatricesNewSQLNewSQL Arrays Of these results, (R3) has been enormously important, but majority of the data on the Internet is now analyzed using key- between those relations. These relations can be modeled as Set Graph Linear Associative would not be possible without (R1) and (R2). (R3) has been FigureMath 3. BigDAWG polystore database architecture. Analytic value stores [DeCandia et al 2007, Lakshman & Malik 2010, Theory Theory Algebra Algebra associative arrays by treating tuples as non-zero rows in an translators contextualize queries to specific databases. Data critical to the success of SQL databases that follow the George 2011]. In response to similar performance challenges, Consistency array. Operations in relational algebra are built as compositions translators cast data between databases. relational model. (R3) has enabled the successful coexistence theof relational standard operations database community on associative has arrays developed which a mirror new class their Volume of separate interfaces and languages for users and ofmatrix databases counterparts. (NewSQL) These such constructions as C-Store [Stonebraker provide insight 2005] into, VelocityMathematics is one of the most important differences implementers, with the confidence that neither would create a how relational algebra can be handled via array operations. Variety H-Store [Kallman 2008], SciDB [Balazinska 2009], VoltDB among SQL, NoSQL, and NewSQL databases (see Figure 4). fundamental mathematical contradiction for the other. [StonebrakerAs an example 2013] application,, and Graphulo the composition[Hutchison 2015 of two] to projection support Analytics The relational algebra found in SQL databases is based upon The relational model is based on balancing mathematical newoperations analytics is shown capabilities to also within be a projection, a database and. theThe projection SQL, Usability selection, union, and intersection of special sets called rigor with implementation practicality. Too much NoSQL,of a union and is NewSQL shown to be concepts equal to have the union also ofbeen the blended projections. in Figure 2. Focus areas of SQL, NoSQL, NewSQL, and Polystore relations. NoSQL is designed for analyzing sparse mathematical rigor burdens a database implementation with hybrid processing systems, such as Apache Pig [Olston 2008], Fig.databases 1. Focus. areas of SQL, NoSQL, NewSQL, and Polystore databases. Each I.INTRODUCTION relationships among data and relies on graph theory and graph unnecessary mathematics. Too little mathematical rigor makes Apache Spark [Zaharia 2010], and HaLoop [Bu 2010]. An class of database has distinct strengths and relies on a different mathematics. analysis.Polystores provideNewSQL a way data tobases unify theseuse matrices databases andand their linear mathematics. algebra to it is difficult to know if a database implementation will work. effectiveThe successmathematical of SQL, model NoSQL, that encompasses and NewSQL the concepts databases of is a reflection of their ability to provide significant function- look for patterns in numeric data. As with all good compromises, there have been advocates for Thisality material and is performancebased upon work benefits supported by for the specific National Science domains, Foundation such under Grant No. DMS-1312831. Any opinions, findings, and conclusions or improvement on both sides. As cited earlier, many new recommendations expressed in this material are those of the author(s) and do not necessarily reflectSQL the views of the NationaNoSQL l Science Foundation. NewSQL databases under the names of NoSQL and NewSQL differ from as financial transactions, internet search, data analysis, and, Set Operations Graph Operations Matrix Operations AT! v ATv the relational model to meet new performance and analysis increasingly, machine learning. Polystore databases seek to bob out edge in bob demands. Likewise, there is extensive mathematical work on provide a mechanism to allow applications to transparently vertex link vertex cited 001 alice cited bob carl modifications to the relational model to increase its alice à achieve the benefits of diverse databases while insulating 1 002 bob cited alice mathematical rigor [Imieliński 1984a, Imieliński 1984b, 003 alice cited carl cited applications from the details of these databases. Polystores carl Kanellakis 1989, Tsalenko 1992, Plotkin 1998, Priss 2006, van arXiv:1712.00802v1 [cs.DB] 3 Dec 2017 SELECT ∗ WHERE must support a wide range of databases with different iter- out vertex=alice alice Emden 2006, Litak 2014, Hutchison 2016]. One motivation faces. Among these interfaces are the standard Relational or Figure 4. Mathematics of breadth-first search for SQL, NoSQL, and for increasing the mathematical rigor [Kelly 2012] is to align NewSQLFig. 2. Breadth-first databases search. for SQL, NoSQL, and NewSQL databases in their relations with well-established Zermelo-Fraenkel Choice (ZFC) SQL (Structured Query Language) databases [1], [2] such as native data models. In each case, the operation being performed is finding the MySQL, PostgreSQL, and Oracle; key-value stores/NoSQL nearest neighbors of alice who are bob and carl. set theory [Zermelo 1908, Fraenkel 1922] that is the foundation The approach to developing an associative array model of for a number of branches of mathematics. databases such as Google BigTable [3], Apache Accumulo the above databases is as follows. First, the relevant aspects of [4], and MongoDB [5]; NewSQL databases such as C-Store relationsIntegrating are summarized the mathematics. Second of, thesethe sparse diverse matrix databases operations is an The emerging diversity of databases has initiated a dialogue [6], H-Store [7], SciDB [8], VoltDB [9], and Graphulo [10], thatimportant encompass enabler graph for algorithmspolystores as and it matrix allows mathematicsreasoning across are regarding the traditional relational model and the newer graph [11]. given.different Third databases., the associativeThe mathematical array model foundations that describes for these and matrix models. This dialogue is akin to the earlier declarative and procedural conversation that culminated in the NoSQL databases were developed to represent large sparse NoSQLdatabases and include NewSQL sets databases (SQL), graphsis described. (NoSQL), Fourth and, r matriceselations tables, contributing to the widespread adoption of NoSQL (NewSQL). LARA [25] is one branch of work that reduces the mathematics of sets (Relational Algebra) and matrices (Linear This material is based in part upon work supported by the NSF under Algebra) to a common basis of 3 operators, emphasizing their grant number DMS-1312831. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do practical realization in data processing systems. This paper 2 not necessarily reflect the views of the National Science Foundation. focuses on associative arrays as a common approach for the mathematics of polystores by situating Relational Algebra onto in which f −1(a) is a non-empty finite set for each a ∈ A. the same foundations as Linear Algebra. The size of f −1(a) represents how many copies of a are in The associative array approach to bridging NoSQL and the multiset. Two sequences NewSQL databases has been demonstrated by the Dynamic f : I → A and g : J → B Distributed Dimensional Data Model (D4M) technology [26] which provides a linear algebraic interface to SQL, NoSQL, are said to define the same multiset if A = B and there exists and NewSQL databases [27]–[31]. The key object of D4M a bijection is the associative array, which generalizes the notion of a h : I → J matrix to allow for more general value sets and indexing sets. This more general structure makes associative arrays better such that equipped to deal with graphs and relations in a more direct g ◦ h = f fashion than matrices [32]–[35]. This definition of equality of multisets captures the fact that Relations form the basis of the mathematical foundation the specific indexing set I used does not matter, only the sizes for SQL databases [36]–[38]. In set theory, they are realized of f −1(a) (which is invariant under equality of multisets). as multisets of tuples. Associative arrays can be used to This makes sense of something like {a, b, a, c, e, a, b} as the realize multisets of tuples of values which support a notion of multiset “addition” and “multiplication”. Associative array analogues of traditional matrix operations can be defined, and this leads f : {1,..., 7} → {a, b, c, e} to the question of whether the gamut of relational algebra where f(1) = a, f(2) = b, f(3) = a, f(4) = c, f(5) = e, operations can be likewise realized in terms of the associative f(6) = a, f(7) = b, In this notation, {a, a, a, b, b, c, e} defines array operations. the same multiset. Our prior work [39] presented the SQL relational model A slight variation on this definition is to allow A to in terms of associative arrays and identified key mathematical contain non-elements, i.e. an a ∈ A such that f −1(a) is properties that are preserved within SQL. This work provides empty. Accepting this variation makes the equivalance with the rigorous mathematical definitions and proofs of some of associative arrays technically simpler. these properties. Specifically, SQL Relational Algebra deals In relational algebra, a relation is a multiset of tuples. These primarily with relations – multisets of tuples – and opera- tuples conform to a schema of n attributes, where n is the arity tions on and between those relations. These relations can be of the relation. For example, an arity-3 relation might contain modeled as associative arrays by treating tuples as non-zero the tuple (7, “Hayden”, 20). We assume all relations contain rows in an array. Operations in relational algebra can be built a primary key attribute; if not, then a primary key can be as compositions of standard operations on associative arrays appended to the relation as an additional attribute, as many which mirror their matrix counterparts. These constructions databases do in practice under the hood. If J refers to the provide insight into how relational algebra can be handled via relation’s primary key and V refers to (the cross product of) array operations. its other attributes’ domains, then we can define a relation as This paper gives some technical background of multisets and associative arrays to explain the motivation for identifying f : J → V relations and associative arrays, defines some major relational which maps the primary key value of a tuple to its other algebra operations in terms of standard associative array attributes’ values. For example, such a mapping might contain operations, and uses these definitions to prove fundamental the entry 7 → (“Hayden”, 20) Primary keys outside the relation properties of some of the relational algebra operations. map to all-null rows, as discussed in the next section. Note that while both multisets and tuples are of the form II.MATHEMATICAL PRELIMINARIESAND DEFINITIONS f : I → A Understanding relations in terms of associative arrays be- gins with the careful definition of the relevant mathematical they differ in that the specific set I is inconsequential in the properties of relations and associative arrays. definition of a multiset, while it is important in the case of tuples.

A. Relations and Multisets in Set Theory B. Associative Arrays Relations form the basis of the mathematical foundation for In practice, the values in a tuple can range from alphanu- SQL databases [36]–[38]. In set theory, they are realized as meric strings to real numbers to sets. Moreover, the kinds of multisets of tuples. operations defined on those values need not be the traditional One approach to defining multisets in set theory that addition and multiplication of real numbers. However, in order matches intuition closely is to define a multiset as a sequence to define analogues of the standard matrix operations, there need be some “addition” ⊕ and some “multiplication” ⊗, and f : I → A these should satisfy some minimum set of properties to ensure that these array-analogues have a minimum set of desireable Definition II.5 (Element-Wise Product). Suppose properties. A, B : K × K → Generalizing the notion of a matrix to allow for more 1 2 V general value sets equipped with more general operations are two associative arrays. Their array element-wise product produces the notion of an associative array. C = A ⊗ B : K1 × K2 → V Definition II.1 (Semiring). [40], [41] A semiring is a set V is defined by equipped with two binary operations ⊕ and ⊗ such that 1) ⊕ is associative and commutative and has an identity C(k1, k2) = (A ⊗ B)(k1, k2) = A(k1, k2) ⊗ B(k1, k2) element 0 ∈ , V Array element-wise product is associative, as well as com- 2) ⊗ is associative with an identity element 1 ∈ , V mutative if ⊗ is commutative. 3) ⊗ distributes over ⊕, and 4) 0 is an annihilator for ⊗. Definition II.6 (Element-Wise Identity). Given a row key set K and a column key set K , denote by 1 the associative All rings and fields are semirings. The set of natural num- 1 2 K1,K2 array K1 × K2 → V with bers N = {0, 1, 2,...} is a semiring under standard addition 1 and multiplication. The set of non-negative real numbers is a K1,K2 (k1, k2) = 1 semiring under standard addition and multiplication. The set of The element-wise identity 1 provides an identity element extended real numbers ∪ {−∞, ∞} with semiring addition K1,K2 R for array element-wise product when restricted to associative ⊕ = max and semiring multiplication ⊗ = min is a semiring arrays A : K1 × K2 → V. called the max-min algebra. R ∪ {−∞, ∞} with ⊕ = max and ⊗ = + is a semiring called the max-plus algebra. The Definition II.7 (Array Multiplication). Suppose set of alphanumeric strings ordered lexicographically along A : K × K → with a formal maximum ∞ is a semiring with ⊕ = min and 1 2 V ⊗ = concatenation. and The convention that null = 0 is used here. Note that if a B : K2 × K3 → V formal null is added to with the properties V are two associative arrays. Then their array product null ⊕ v = v ⊕ null = v C = AB = A ⊕.⊗ B : K1 × K3 → V null ⊗ v = v ⊗ null = null is defined by for every v ∈ V∪{null}, then V∪{null} would be a semiring M with a new additive identity null. Thus, nothing is lost by C(k1, k3) = A(k1, k2) ⊗ B(k2, k3) examining only the case where the convention null = 0 is k2∈K2 used. Array multiplication is associative, but in general need not be commutative even if ⊗ is commutative. An is a Definition II.2 (Associative Array). associative array For brevity, A ⊕.⊗ B is denoted AB, except when it map is important to be explicit about the operations being used A : K1 × K2 → V (particularly when they are not semiring ⊕ and ⊗). where is a semiring, such that A(k , k ) 6= 0 for only V 1 2 Definition II.8 (Array Identity). Given a row key set K1, a finitely-many pairs (k , k ). Elements of K are called row 1 2 1 column key set K2, and a partial function keys and elements of K2 are called column keys. f : K1 →| K2 Definition II.3 (Array Addition). Suppose meaning f is a function defined on a subset dom f ⊂ K1. A, B : K1 × K2 → V Denote by are two associative arrays. Their array addition IK1,K2,f the associative array K × K → with C = A ⊕ B : K × K → 1 2 V 1 2 V ( 1 if k1 ∈ dom f and k2 = f(k1) is defined by IK1,K2,f (k1, k2) = 0 otherwise C(k1, k2) = (A ⊕ B)(k1, k2) = A(k1, k2) ⊕ B(k1, k2) IK1,K2 means IK1,K2,f where Array addition is both associative and commutative. dom f = K1 ∩ K2 Definition II.4 (Zero Array). 0 is the zero array, in which and f acts as the identity on K ∩ K . If K = K = K, every entry is 0. 1 2 1 2 then

The zero array provides an identity element for array addition. IK1,K2 = IK Finally, if K1,K2, f are understood, then write Deﬁnition II.12 (Array Kronecker Product). Suppose

IK1,K2,f = I A : K1 × K2 → V

The array identity IK1,K2,f does not in general act as an and identity element for array multiplication. IK , however, is an B : K3 × K4 → V identity for array multiplication when restricted to associative are associative arrays. Then their array Kronecker product arrays A : K × K → V. When performing operations between associative arrays C = A ⊗ C whose row and column key sets are not “compatible” (i.e. satisfy the hypotheses of the above definitions), this can be is the associative array solved by zero padding. (K1 × K3) × (K2 × K4) → V Definition II.9 (Zero Padding). If defined by A : K1 × K2 → V C((k1, k3), (k2, k4)) = A(k1, k2) ⊗ B(k3, k4) is an associative array and K0 ,K0 are arbitrary sets, then 1 2 The array Kronecker product allows associative arrays opera- 0 0 pad 0 0 (A):(K1 ∪ K ) × (K2 ∪ K ) → tions to handle dimensions higher than 2 dimensions. K1×K2 1 2 V is defined by III.RELATIONS AS ASSOCIATIVE ARRAYS ( A(i, j) if i ∈ K1, j ∈ K2 Motivated by the definition of a relation as a multiset padK0 ×K0 (A)(i, j) = (sequence) of tuples, define a relation to be an associative 1 2 0 otherwise array with the intuition that the rows of an associative array are By padding arrays prior to carrying out an operation, opera- the relevant tuples which are indexed by the column indices. tions can be defined in general. Explicitly, given The row indices are only meant to differentiate the rows. In practice, the sequence number identifying the distinct rows in A : K1 × K2 → V an SQL table serve a similar purpose. and Definition III.1 (Row and Row Equality). If B : K3 × K4 → V A : K1 × K2 → V then is an associative array with i ∈ K1, then the i-th row is the A ⊕ B = pad (A) (K1∪K3)×(K2∪K4) tuple

⊕ pad(K1∪K3)×(K2∪K4)(B) A(i, :) : K2 → V A ⊗ B = pad (A) (K1∪K3)×(K2∪K4) sending j to A(i, j). Such a row is non-zero if it not identically

⊗ pad(K1∪K3)×(K2∪K4)(B) zero. If B : K3 × K4 → V A ⊕.⊗ B = padK1×(K2∪K3)(A) 0 0 ⊕.⊗ pad(K2∪K3)×K4 (B) and i ∈ K3, then the i-th row of A is equal to the i -th row of B if A(i, j) and B(i0, j) are both defined and equal whenever Likewise, equality of two arrays is done up to zero padding: one of them is non-zero. A−1(i, :) denotes the subset of I A = B if and only if A containing the indices of rows in A which are equal to A(i, :). pad (A) = pad (B) (K1∪K3)×(K2∪K4) (K1∪K3)×(K2∪K4) Definition III.2 (Weak Equivalence). The associative arrays Definition II.10 (Row Support). For an associative array A : K1 × K2 → V A : K1 × K2 → V and the row support IA is the set of row keys associated with B : K3 × K4 → V non-zero rows of A. are weakly equivalent Definition II.11 (Transpose). Suppose A ∼ B

A : K1 × K2 → V if for each non-zero row of A there is an equal row in B, and vice-a-versa. is an associative array. Then its transpose A| is the associative array K2 × K1 → V deﬁned by In terms of multisets, two arrays are weakly equivalent if their underlying sets of tuples are equal. A|(k2, k1) = A(k1, k2) Lemma III.3. Given associative arrays Lemma III.7. A ≈ B if and only if there exists a bijection

A : K1 × K2 → V f : IA → IB and such that B : K3 × K4 → V A = IIA,IB,f B Deﬁne the array The array P constructed in III.3 and III.6 can be computed P : K1 × K3 → V using the following array operation. by ( Lemma III.8. For associative arrays 1 if A(k1, :) = B(k3, :) P(k1, k3) = 0 otherwise A : K1 × K2 → V Then the following are equivalent and 1) A ∼ B. B : K3 × K4 → V 2) If A(k1, :) is a non-zero row, so is the row P(k1, :), and deﬁne the array if B(k3, :) is a non-zero row, so is the column P(:, k3). P : K1 × K3 → V Lemma III.4. A ∼ B if and only if there exist functions by f : I → I ( A B 1 if A(k1, :) = B(k3, :) P(k1, k3) = and 0 otherwise g : I → I B A Then | such that P = (IIA A) ∧.δ (IIB B)

A = IIA,IB,f B and B = IIB,IA,g A where ( ( Definition III.5 (Strong Equivalence). Two associative arrays 1 if v, w 6= 0 1 if v = w v ∧ w = δ(v, w) = 0 otherwise 0 otherwise A : K1 × K2 → V and IV. RELATIONAL ALGEBRA OPERATIONS B : K3 × K4 → V There are many operations defined on relations. This is a re- are strongly equivalent sult of of Codd’s Theorem [42], which states that relational algebra and relational calculus queries have the same expressive A ≈ B power. In other words, to carry out a wide range of relational if for each non-zero row of A, there are exactly as many copies queries, it is enough to implement the basic relational algebra of that row in B as in A, and vice-a-versa. operations. By implementing these operations with associative algebra operations, this reduces SQL queries to linear algebra. In terms of multisets, two arrays are strongly equivalent if they Equipped with the necessary array operations and notions are equal as multisets. of equivalence for arrays viewed as relations, it is possible to Lemma III.6. For associative arrays define several of the standard operations in relational algebra in terms of array operations. A : K1 × K2 → V Definition IV.1 (Project). Suppose J is a set of column keys. and Then define the projection operation ΠJ by B : K3 × K4 → V ΠJ (A) = A IJ define the array which removes all columns not in J. In SQL syntax, it is P : K1 × K3 → V by SELECT J(1),...,J(n) FROM A ( 1 if A(k1, :) = B(k3, :) P(k1, k3) = Definition IV.2 (Rename). Suppose J1,J2 are two sets of 0 otherwise column keys with a bijection f : J1 → J2. Then define the

Then the following are equivalent: rename operation ρJ1/J2,f by 1) A ≈ B ρJ1/J2,f (A) = A IJ1,J2,f 2) A ∼ B and if P(k1, k3) 6= 0, then the number of non- zero entries of the row P(k1, :) is equal to the number where of non-zero entries of the column P(:,K3). A : K1 × K2 → V This operation selects the columns to be renamed, and then Then Subn (A, B) can be taken to be the first n elements of renames them according to f. In SQL syntax, it is (A × {1}) ∪ (B × {2}) SELECT J1(1),...,J1(n) AS J2(1),...,J2(n) FROM A with respect to the above ordering. The function f can act trivially on some columns. This allows Definition IV.4 (Intersection). The intersection operation is the rename operation to keep fixed the columns that aren’t defined by being renamed to something new. A ∩ B = IS (A ∪ B) Definition IV.3 (Union). For two arrays where [ A : K × K → S = Sub A−1(i , :), B−1(i , :) 1 2 V min(mi1 ,ni2 ) 1 2 i1∈IA and i2∈IB A(i1,:)=B(i2,:) B : K3 × K4 → V where their union is −1 mi1 = |A (i1, :)|

A ∪ B = (IIA×{1},IA A) ⊕ (IIB×{2},IB B) and n = |B−1(i , :)| This operation effectively adds the counts of rows together. In i2 2 SQL syntax, it is This selects from A ∪ B the minimum of the count of each row from A and B. In SQL syntax, it is SELECT ∗ FROM A UNION ALL SELECT ∗ FROM B SELECT ∗ FROM A INTERSECT SELECT ∗ FROM B A. Operations Involving Choices of Representative Rows Definition IV.5 (Multiset Difference). The multiset difference In the definition of a multiset intersection operation, the operation is defined by number of times a row appears in A ∩ B should be the minimum of the number of times that row appears in both A \ B = IS A A and B. Similarly, in the definition of a multiset difference where operation, the number of times a row appears in A\B should   be the number of times it appears in A minus the number of   times it appears in B (showing up zero times if this difference  [ −1  S = IA \ π1  Subpi ,i A (i1, :), ∅  is negative). This suggests a difficult arising due to needed  1 2   i1∈IA  to select the relevant number of rows, which arises due to i2∈IB A(i ,:)=B(i ,:) explictly having these rows indexed in a way that may not 1 2 offer an unambiguous way of making this choice. Assume and there is a function −1 −1 −1 pi1,i2 = |A (i1, :)| − max 0, |A (i1, :)| − |B (i2, :)| Sub (A, B) n This removes from A as many copies of a row as are present which assigns to any sets of row keys A and B and a non- in B (up to all of the copies of that row in A). In SQL syntax, negative integer it is 0 ≤ n ≤ |A| + |B| SELECT ∗ FROM A EXCEPT SELECT ∗ FROM B a fixed subset of The use of π1, projection onto the first coordinate, in the above (A × {1}) ∪ (B × {2}) expression of S, is intended to correct for the fact that the set −1 Subp A (i1, :), ∅ of size n. i1,i2 If there is a fixed, explicit total ordering of the row keys, is technically a subset of IA × {1}. Taking π1 makes S a then subset of IA instead. (A × {1}) ∪ (B × {2}) The notion of Subn (A, B) allows for the notion of a set difference of A and B to be defined as well, being defined as has a canonical ordering coming from the ordering of the rows; in Definition IV.5 except with pi ,i = 0. if 1 2 The notion of Subn (A, B) also allows for all duplicate A = {a1, . . . , an} elements of an associative array to be removed, effectively allowing for set semantics, by taking with a1 < ··· < an and Set(A) = ISA B = {b1, . . . , bm} where with b1 < ··· < bm then " # [ −1 S = π1 Sub1 A (i, :), ∅ (a1, 1) < ··· < (an, 1) < (b1, 2) < ··· < (bm, 2) i∈IA B. Operations Involving Functions of Certain Entries in a (This “undefined” value only shows up to be removed upon

Row use of the selection σθ(J1,J2), so there is no need not worry Definition IV.6. Suppose J is a set of column keys. If A is about the effect of it on the algebra.) an array, then the J-column indexed entries of a row A(i, :) Definition IV.9 (Extended Projection). Suppose ϕ is a function are the entries A(i, j) where j ∈ J. of the J-column indexed entries of a row and j0 is a column 0 Definition IV.7 (Select). Suppose ϕ is a boolean-valued func- key. Define the extended projection (determined by ϕ and j ) by tion (so taking values in {0, 1} ⊂ V) of the J-column indexed entries of a row, whose values we denote as ϕ(A(k1,J)). Then j0 Πϕ(J)(A) = ρ{1},{j0} ϕ(ΠJ (A)(:,J)) we define the select operation (determined by ϕ) by j0 |   σϕ(J)(A) = [ϕ(A(:,J)) ϕ(A(:,J)) ] ⊗ IIA A i1 ϕ(A(i1,J)) .  .  where ϕ(A(:,J)) is the column vector = .  .  i ϕ(A(i ,J)) 1 n n   i1 ϕ(A(i1,J)) and IA = {i1, . . . , in}. This replaces the J-indexed columns . . with a single column j0 whose entries are computed by ϕ. In ϕ(A(:,J)) = .  .    SQL syntax, it is in ϕ(A(in,J)) 0 This operation selects those rows of A which evaluate to true SELECT ϕ(A·J(1),...,J(n)) AS j FROM A under ϕ. In SQL syntax, it is Taking liberties with the notation, it is typical to write

SELECT ∗ FROM A where ϕ(A·J(1),...,J(n)) ϕ(ΠJ (A)(:,J)) Definition IV.8 (Theta Join). Suppose θ is a boolean-valued as function on the J1-column-indexed entries of a first row and ϕ(ΠJ (A)(:,J)) = ΠJ (A) ϕ.⊗ 1J,{j0} the J2-column-indexed entries of a second row. Then define the theta join operation (determined by θ) by since the entries are computed as 1 A ./θ(J1,J2) B = σθ(J1,J2) [A ⊗ K3,{1}] ϕ(A(i, j1),..., A(i, jm)) = ϕ A(i, j1)⊗1,..., A(i, jm)⊗1 1 which is remarkably close to ⊕ ρ{2}×K4,K4×{2},f [ K1,{2} ⊗ B] M where f : (2, k) 7→ (k, 2). (A(i, j) ⊗ 1) This operation selects pairs of rows from A and B which j∈J evaluate to true under θ. In SQL syntax, it is In fact, if ϕ is an iterated (commutative, associative) binary operation ∗, then this is the same as SELECT ∗ FROM A, B WHERE

Π (A) ∗.⊗ 1 0 θ(A·J1(1),...,J1(n), B·J2(1),...,J2(n)) J J,{j } The theta join operation creates new column indices for the Definition IV.10 (Aggregation). Suppose j and j0 are column resulting rows by “tagging” them with 1 and 2; this ensures keys and f is a function of finitely-supported tuples of elements that there is no conflict between them when performing the in V (i.e. all but finitely-many elements are 0) taking values array addition. in V. Define the aggregation (determined by j, j0 and f) by If needed, it can be assumed that whenever a theta join G 0 (A) = P f.⊗ A operation is performed, the non-zero column indices of the first j f(j ) and second array are distinct, and the column indices K2 ×{1} 1 and K ×{2} can be identified with the corresponding column  0 0  4 i1 f (P(i1, i1) ⊗ A(i1, j ),..., P(i1, in) ⊗ A(in, j )) indices of K and K , respectively. 2 4 = .  .  Dealing with the case where the non-zero column indices of .  .  0 0 the two arrays are not necessarily distinct, it can be required in f (P(in, i1) ⊗ A(i1, j ),..., P(in, in) ⊗ A(in, j )) that θ evaluates to true exactly when the values at those indices where agree and are defined. To achieve this, the array addition ⊕ | can be replaced with a new operation ⊕= for which P = [IIA ⊕.⊗ A(:, j)] ⊕.δ [IIA ⊕.⊗ A(:, j)] v if w = 0 This operation applies the function f (the aggregate function)  0 w if v = 0 on all the values of column j in A that share a common value v ⊕= w = in column j. In SQL syntax, it is v if v = w  undefined otherwise SELECT fj0 FROM A GROUP BY j Unlike in the cases of selection, theta join, and extended 8) If ϕ ≡ 0, then σϕ sends every array to 0. projection, where the domains of the relevant functions (ϕ in 9) If θ ≡ 0, then A ./θ(J,J 0) B = 0. 0 the case of selection and extended projection, θ in the case of 10) If J = {j } and ϕ(v) = v, then j0 Πϕ(J) acts as the theta join) are explicitly given, the domain of f is not explicitly identity map. If ϕ ≡ 0, then j0 Πϕ(J) sends every array given. (We’ve said it is a function of finitely-supported tuples to 0. of elements in V, but without restricting the possible indices, 11) If J1,J2 are sets of column indices, then there are too many of these to even form a set.) Π ◦ Π = Π In all practical considerations, the set of possible column J1 J2 J1∩J2 keys can be assumed to be finite; in this case, consider all where ◦ is function composition. finitely-supported tuples of elements in V indexed by those 12) ΠJ preserves ∪, ∩, \. possible column keys. 13) If J1,J2,J3,J4 are sets of column indices and Another practical consideration is that f is symmetric, in that permuting those indexing column keys does not affect the f : J1 → J2 result. In this case, take f to simply be a function of any finite and multiset of non-zero values in , passing on the set-theoretic V g : J → J difficulties to that of multisets. 3 4 If the values A(i, j) and A(i0, j) are equal and non-zero, are bijections, then it need not be the case that 0 then the aggregate jGf(j0)(A) will also have its i-th and i -th ρ ◦ ρ = ρ ◦ ρ entries equal. Moreover, the row keys of the aggregate are the J1/J2,f J3/J4,g J3/J4,g J1/J2,f same as those of A (at least, the non-zero rows). even up to strong or weak equivalence, where ◦ is By additionally (array) multiplying | on the left, IIA,V,f function composition. where f(i) = A(i, j), every row of the aggregate represents 14) ρJ1/J2,f preserves ∪, ∩, \. unique information with row keys equal to the value A(i, j) 15) Both ∪ and ∩ are commutative and associative (up to that was used to select the values being aggregated. both weak and strong equivalence, and up to a canonical 00 Finally, to use a column key j in place of the default 1, renaming of column keys). (array) multiply I{1},{j00} on the right. 16) Both ∪ and ∩ distribute over one-another (up to both weak and strong equivalence, and up to a canonical V. PROPERTIES OF RELATIONAL ALGEBRA OPERATIONS renaming of column keys). Since relations are defined by associative arrays with either 17) \ is neither commutative nor associative (even up to strong or weak equivalence, to ensure that these operations are strong or weak equivalence). defined on relations, they must be invariant under strong and weak equivalence. The proofs of all of the above proposition is beyond the space limitations of this work. However, they are straightfor- Proposition V.1. Each operation is invariant under strong ward given the definitions. As an example proof using array equivalence. Each operation (except for multiset difference) is algebra, consider the proof that invariant under weak equivalence. ΠJ ◦ ΠJ = ΠJ ∩J The fact that multiset difference (Definition IV.5) is not 1 2 1 2 invariant under weak equivalence is not a random occurrence – and that this is due to the fact that the definition of multiset difference ΠJ (A ∪ B) = ΠJ (A) ∪ ΠJ (B) seeks to remove only a certain number of instances of a row. If, instead, every instance of a row was removed, then this Proof. new operation would be invariant under both strong and weak By Definition IV.1, equivalence. ΠJ1 (ΠJ2 (A)) = (A IJ2 ) IJ1 Many of the desirable properties of each of the relational = (A ⊕.⊗ ) ⊕.⊗ algebra operations can be proven using array algebra with the IJ2 IJ1 above definitions of those operations. = A ⊕.⊗ (IJ2 ⊕.⊗ IJ1 ) Proposition V.2. Thus, it suffices to show that

1) If there is a ﬁxed set K of column keys, then ΠK acts IJ2 ⊕.⊗ IJ1 = IJ1∩J2 as the identity map. 2) Π∅ sends every array to the zero array 0. By Deﬁnition II.7, 3) ρJ/J,id acts as the identity map. M J ( ⊕.⊗ )(i, j) = (i, k) ⊗ (k, j) 4) 0 is an identity under ∪. IJ2 IJ1 IJ2 IJ1 5) 0 is an annihilator under ∩. k∈J1∪J2 6) 0 is a right identity and left annihilator under \. The term

7) If ϕ ≡ 1, then σϕ acts as the identity map. IJ2 (i, k) ⊗ IJ1 (k, j) is 1 if and only if i = k ∈ J2 and k = j ∈ J1, and 0 ACKNOWLEDGMENT otherwise. This only occurs when i = k = j ∈ J1 ∩ J2, and The authors wish to acknowledge the following individuals this contributes the only possible non-zero term of the sum. for their contributions: Michael Stonebraker, Sam Madden, This shows that Bill Howe, David Maier, Alan Edelman, Dave Martinez, ( 1 if i = j ∈ J1 ∩ J2 Sterling Foster, Paul Burkhardt, Victor Roytburd, Bill Arcand, (IJ2 ⊕.⊗ IJ1 )(i, j) = 0 otherwise Bill Bergeron, David Bestor, Chansup Byun, Mike Houle, Matt Hubbell, Mike Jones, Anna Klein, Pete Michaleas, Lauren = IJ1∩J2 (i, j) Milechin, Julie Mullen, Andy Prout, Tony Rosa, Sid Samsi, For preservation of union, Definition IV.3 gives and Chuck Yee. REFERENCES ΠJ (A ∪ B) = ΠJ (IIA×{1},IA A) ⊕ (IIB×{2},IB B) [1] E. F. Codd, “A relational model of data for large shared data banks,” = (IIA×{1},IA A) ⊕ (IIB×{2},IB B) IJ Communications of the ACM, vol. 13, no. 6, pp. 377–387, 1970. = (( A) ) ⊕ (( B) ) [2] M. Stonebraker, G. Held, E. Wong, and P. Kreps, “The design and IIA×{1},IA IJ IIB×{2},IB IJ implementation of ingres,” ACM Transactions on Database Systems (TODS), vol. 1, no. 3, pp. 189–222, 1976. = (IIA×{1},IA (A IJ )) ⊕ (IIB×{2},IB (B IJ )) [3] F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Bur- = IIA×{1},IA ΠJ (A) ⊕ IIB×{2},IB ΠJ (B) rows, T. Chandra, A. Fikes, and R. E. Gruber, “Bigtable: A distributed storage system for structured data,” ACM Transactions on Computer Now, this is nearly in the form ΠJ (A) ∪ ΠJ (B), with the Systems (TOCS), vol. 26, no. 2, p. 4, 2008. only issue being that wherever a IA or IB show up, there [4] A. Cordova, B. Rinaldi, and M. Wall, Accumulo: Application Devel- should instead be I or I , respectively. However, opment, Table Design, and Best Practices. ” O’Reilly Media, Inc.”, ΠJ (A) ΠJ (B) 2015. this replacement can be made due to the fact that taking a [5] K. Chodorow, MongoDB: The Definitive Guide: Powerful and Scalable projection can only make IA (resp. IB) smaller in size. Indeed, Data Storage. ” O’Reilly Media, Inc.”, 2013. [6] M. Stonebraker, D. J. Abadi, A. Batkin, X. Chen, M. Cherniack, f : J →| J J ⊂ J I ⊂ if 1 2 is a partial injection, 3 1, and C M. Ferreira, E. Lau, A. Lin, S. Madden, E. O’Neil et al., “C-store: f[J3] = {f(j) | j ∈ J3}, then a column-oriented dbms,” in Proceedings of the 31st international conference on Very large data bases. VLDB Endowment, 2005, pp. C = C 553–564. IJ1,J2,f IJ3,J2,f|J3 [7] R. Kallman, H. Kimura, J. Natkins, A. Pavlo, A. Rasin, S. Zdonik, E. P. Jones, S. Madden, M. Stonebraker, Y. Zhang et al., “H-store: a high- performance, distributed main memory transaction processing system,” One benefit of proving these properties of the relational Proceedings of the VLDB Endowment, vol. 1, no. 2, pp. 1496–1499, algebra operations as defined via the array algebra operations 2008. [8] M. Balazinska, J. Becla, D. Heath, D. Maier, M. Stonebraker, and is that it gives a better understanding of how these operations S. Zdonik, “A demonstration of scidb: A science-oriented dbms,” Cell, work without resorting to equality up to strong or weak vol. 1, no. a2, 2009. equivalence; in many cases, the relation is outright equality, [9] M. Stonebraker and A. Weisberg, “The voltdb main memory dbms.” IEEE Data Eng. Bull., vol. 36, no. 2, pp. 21–27, 2013. or at least equality up to strong or weak equivalence where [10] D. Hutchison, J. Kepner, V. Gadepally, and A. Fuchs, “Graphulo the relabeling is in some sense “canonical”. implementation of server-side sparse matrix multiply in the accu- Another benefit is in performance [39]; thanks to the mulo database,” in High Performance Extreme Computing Conference (HPEC), 2015 IEEE. IEEE, 2015, pp. 1–7. fact that associativity, commutativity, and distributivity of [11] V. Gadepally, J. Bolewski, D. Hook, D. Hutchison, B. Miller, and ⊕, ⊗, ⊕.⊗ lead to performance increases since the relational J. Kepner, “Graphulo: Linear algebra graph kernels for nosql databases,” algebra operations can be built up by the associative algebra in Parallel and Distributed Processing Symposium Workshop (IPDPSW), 2015 IEEE International. IEEE, 2015, pp. 822–830. operations. [12] G. DeCandia, D. Hastorun, M. Jampani, G. Kakulapati, A. Lakshman, A. Pilchin, S. Sivasubramanian, P. Vosshall, and W. Vogels, “Dynamo: VI.CONCLUSION amazon’s highly available key-value store,” ACM SIGOPS operating SQL, NoSQL, and NewSQL databases are specialized to systems review, vol. 41, no. 6, pp. 205–220, 2007. [13] A. Lakshman and P. Malik, “Cassandra: a decentralized structured deal with certain domains, and all three can be useful in a storage system,” ACM SIGOPS Operating Systems Review, vol. 44, no. 2, single context. For this reason, polystore databases have been pp. 35–40, 2010. developed to bridge these three concepts. [14] L. George, HBase: the definitive guide: random access to your planet- size data. ” O’Reilly Media, Inc.”, 2011. Associative arrays provide a mathematical framework [15] C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins, “Pig through which the mathematical cores of SQL, NoSQL, and latin: a not-so-foreign language for data processing,” in Proceedings of NewSQL can be reduced, allowing for polystore databases like the 2008 ACM SIGMOD international conference on Management of data. ACM, 2008, pp. 1099–1110. BigDAWG to translate between the three data types inherent to [16] M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica, these databases – sets (SQL), graphs (NoSQL), and matrices “Spark: Cluster computing with working sets.” HotCloud, vol. 10, no. (NewSQL). 10-10, p. 95, 2010. [17] Y. Bu, B. Howe, M. Balazinska, and M. D. Ernst, “Haloop: Efficient Future work will focus on exploring additional properties iterative data processing on large clusters,” Proceedings of the VLDB that the associative array perspective provides with regards Endowment, vol. 3, no. 1-2, pp. 285–296, 2010. to relational algebra, providing analysis of optimizations, and [18] J. Duggan, A. J. Elmore, M. Stonebraker, M. Balazinska, B. Howe, J. Kepner, S. Madden, D. Maier, T. Mattson, and S. Zdonik, “The the potential application of quantifying uncertainly in database bigdawg polystore system,” ACM Sigmod Record, vol. 44, no. 2, pp. queries. 11–16, 2015. [19] A. Elmore, J. Duggan, M. Stonebraker, M. Balazinska, U. Cetintemel, newsql databases,” in High Performance Extreme Computing Conference V. Gadepally, J. Heer, B. Howe, J. Kepner, T. Kraska et al., “A (HPEC), 2016 IEEE. IEEE, 2016, pp. 1–9. demonstration of the bigdawg polystore system,” Proceedings of the [40] M. Gondran and M. Minoux, “Dio¨ıds and semirings: Links to fuzzy sets VLDB Endowment, vol. 8, no. 12, pp. 1908–1911, 2015. and other applications,” Fuzzy Sets and Systems, vol. 158, no. 12, pp. [20] V. Gadepally, P. Chen, J. Duggan, A. Elmore, B. Haynes, J. Kepner, 1273–1294, 2007. S. Madden, T. Mattson, and M. Stonebraker, “The bigdawg polystore [41] J. S. Golan, Semirings and their Applications. Springer Science & system and architecture,” in High Performance Extreme Computing Business Media, 2013. Conference (HPEC), 2016 IEEE. IEEE, 2016, pp. 1–6. [42] E. F. Codd, Relational completeness of data base sublanguages. IBM [21] V. Gadepally, K. OBrien, A. Dziedzic, A. Elmore, J. Kepner, S. Madden, Corporation, 1972. T. Mattson, J. Rogers, Z. She, and M. Stonebraker, “Version 0.1 of the bigdawg polystore system,” in High Performance Extreme Computing Conference (HPEC), 2017 IEEE. IEEE, 2017. [22] K. OBrien, V. Gadepally, J. Duggan, A. Dziedzic, A. Elmore, J. Kepner, S. Madden, T. Mattson, Z. She, and M. Stonebraker, “Bigdawg polystore release and demonstration,” in High Performance Extreme Computing Conference (HPEC), 2017 IEEE. IEEE, 2017. [23] J. Wang, T. Baker, M. Balazinska, D. Halperin, B. Haynes, B. Howe, D. Hutchison, S. Jain, R. Maas, P. Mehta, D. Moritz, B. Myers, J. Ortiz, D. Suciu, A. Whitaker, and S. Xu, “The Myria big data management and analytics system and cloud service,” in Conference on Innovative Data Systems Research (CIDR), 1 2017. [Online]. Available: https://homes.cs.washington.edu/∼magda/papers/wang-cidr17.pdf [24] M. Stonebraker and U. Cetintemel, “” one size fits all”: an idea whose time has come and gone,” in Data Engineering, 2005. ICDE 2005. Proceedings. 21st International Conference on. IEEE, 2005, pp. 2– 11. [25] D. Hutchison, B. Howe, and D. Suciu, “LaraDB: A minimalist kernel for linear and relational algebra computation,” in SIGMOD Workshop on Algorithms and Systems for MapReduce and Beyond (BeyondMR). ACM, 5 2017. [26] J. Kepner, W. Arcand, W. Bergeron, N. Bliss, R. Bond, C. Byun, G. Condon, K. Gregson, M. Hubbell, J. Kurz et al., “Dynamic distributed dimensional data model (d4m) database and computation system,” in Acoustics, Speech and Signal Processing (ICASSP), 2012 IEEE Inter- national Conference on. IEEE, 2012, pp. 5349–5352. [27] C. Byun, W. Arcand, D. Bestor, B. Bergeron, M. Hubbell, J. Kepner, A. McCabe, P. Michaleas, J. Mullen, D. O’Gwynn et al., “Driving big data with big compute,” in High Performance Extreme Computing (HPEC), 2012 IEEE Conference on. IEEE, 2012, pp. 1–6. [28] J. Kepner, C. Anderson, W. Arcand, D. Bestor, B. Bergeron, C. Byun, M. Hubbell, P. Michaleas, J. Mullen, D. O’Gwynn et al., “D4m 2.0 schema: A general purpose high performance schema for the accumulo database,” in High Performance Extreme Computing Conference (HPEC), 2013 IEEE. IEEE, 2013, pp. 1–6. [29] V. Gadepally, J. Kepner, W. Arcand, D. Bestor, B. Bergeron, C. Byun, L. Edwards, M. Hubbell, P. Michaleas, J. Mullen et al., “D4m: Bringing associative arrays to database engines,” in High Performance Extreme Computing Conference (HPEC), 2015 IEEE. IEEE, 2015, pp. 1–6. [30] A. Chen, A. Edelman, J. Kepner, V. Gadepally, and D. Hutchison, “Julia implementation of the dynamic distributed dimensional data model,” in High Performance Extreme Computing Conference (HPEC), 2016 IEEE. IEEE, 2016, pp. 1–7. [31] L. Milechin, V. Gadepally, S. Samsi, J. Kepner, A. Chen, and D. Hutchi- son, “D4m 3.0: Extended database and language capabilities,” in High Performance Extreme Computing Conference (HPEC), 2017 IEEE. IEEE, 2017. [32] J. Kepner and J. Chaidez, “The abstract algebra of big data,” in Union College Mathematics Conference, 2013. [33] ——, “The abstract algebra of big data and associative arrays,” in SIAM Meeting on Discrete Math, 2014. [34] H. Jananthan, K. Dibert, and J. Kepner, “Constructing adjacency arrays from incidence arrays,” in IPDPS GABB Workshop, 2017 IEEE. IEEE, 2017. [35] J. Kepner and H. Jananthan, Mathematics of Big Data. MIT Press, 2018. [36] D. Maier, Theory of relational databases. Computer Science Pr, 1983. [37] E. F. Codd, The relational model for database management: version 2. Addison-Wesley Longman Publishing Co., Inc., 1990. [38] S. Abiteboul, R. Hull, and V. Vianu, Eds., Foundations of Databases: The Logical Level, 1st ed. Boston, MA, USA: Addison-Wesley Longman Publishing Co., Inc., 1995. [39] J. Kepner, V. Gadepally, D. Hutchison, H. Jananthan, T. Mattson, S. Samsi, and A. Reuther, “Associative array model of sql, nosql, and