Associative Array Model of SQL, Nosql, and Newsql Databases
Total Page:16
File Type:pdf, Size:1020Kb
Associative Array Model of SQL, NoSQL, and NewSQL Databases Jeremy Kepner1,2,3, Vijay Gadepally1,2, Dylan Hutchison4, Hayden Jananthan3,5, Timothy Mattson6, Siddharth Samsi1, Albert Reuther1 1MIT Lincoln Laboratory, 2MIT Computer Science & AI Laboratory, 3MIT Mathematics Department, 4University of Washington Computer Science Department, 5Vanderbilt University Mathematics Department, 6Intel Corporation Abstract—The success of SQL, NoSQL, and NewSQL databases SQL, NoSQL, and NewSQL would enable their is a reflection of their ability to provide significant functionality interoperability. Such a mathematical model is the primary and performance benefits for specific domains, such as financial goal of this paper. transactions, internet search, and data analysis. The BigDAWG polystore seeks to provide a mechanism to allow applications to transparently achieve the benefits of diverse databases while SQL Era NoSQL Era NewSQL Era Polystore Era insulating applications from the details of these databases. NoSQL Associative arrays provide a common approach to the Relational Model NewSQL mathematics found in different databases: sets (SQL), graphs [Codd 1970] [Cattell 2010] Polystore SQL (NoSQL), and matrices (NewSQL). This work presents the SQL Google BigTable relational model in terms of associative arrays and identifies the [Chang 2006] [Elmore 2015] NewSQL BigDAWG key mathematical properties that are preserved within SQL. These properties include associativity, commutativity, Figure 1. Evolution of SQL, NoSQL, NewSQL, and polystore distributivity, identities, annihilators, and inverses. Performance databases. Each class of database delivered new mathematics, measurements on distributivity and associativity show the impact functionality, and performance focused on new application areas. these properties can have on associative array operations. These results demonstrate that associative arrays could provide a SQL, NoSQL, and NewSQL databases are designed for mathematical model for polystores to optimize the exchange of specific applications, have distinct data models, and rely on data and execution queries. different underlying mathematics (see Figure 2). Because of their differences, each database has unique strengths that are Keywords-Associative Array Algebra; SQL; NoSQL; NewSQL; well suited for particular workloads. It is now recognized that Set Theory; Graph Theory; Matrices; Linear Algebra special-purpose databases can be 100x faster for a particular application than a general-purpose database [Kepner 2014]. In I. INTRODUCTION addition, the availability of high performance data analysis Relational or SQL (Structured Query Language) databases platforms, such as the MIT SuperCloud [Reuther 2013, Prout [Codd 1970, Stonebraker 1976] such as PostgreSQL, MySQL, 2015], allows high performance databases to share the same and Oracle have been the de facto interface to databases since hardware platform without sacrificing performance. the 1980s (see Figure 1) and are the bedrock of electronic transactions around the world. More recently, key-value stores (NoSQL databases) such as Google BigTable [Chang 2008], SQLSQL NoSQLNoSQL NewSQLNewSQL PolystoreFuture Apache Accumulo [Wall 2015], and MongoDB [Chodorow Example PostgreSQL Accumulo SciDB BigDAWG 2013] have been developed for representing large sparse tables Application Transactions Search Analysis All Relational Key-Value Sparse Associative to aid in the analysis of data for Internet search. As a result, the Data Model Tables Pairs Matrices Arrays majority of the data on the Internet is now analyzed using key- Set Graph Linear Associative Math value stores [DeCandia et al 2007, Lakshman & Malik 2010, Theory Theory Algebra Algebra George 2011]. In response to similar performance challenges, Consistency the relational database community has developed a new class Volume of databases (NewSQL) such as C-Store [Stonebraker 2005], Velocity H-Store [Kallman 2008], SciDB [Balazinska 2009], VoltDB Variety [Stonebraker 2013], and Graphulo [Hutchison 2015] to support Analytics new analytics capabilities within a database. The SQL, Usability NoSQL, and NewSQL concepts have also been blended in Figure 2. Focus areas of SQL, NoSQL, NewSQL, and Polystore hybrid processing systems, such as Apache Pig [Olston 2008], databases. Apache Spark [Zaharia 2010], and HaLoop [Bu 2010]. An effective mathematical model that encompasses the concepts of This material is based upon work supported by the National Science Foundation under Grant No. DMS-1312831. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation. 1 The recognition of “one size does not fit all” [Stonebraker and their corresponding operations are defined in terms of & Çetintemel 2005] has led to the need for polystore databases, associative arrays. Fifth, the mathematical properties required such as BigDAWG [Duggan 2015, Elmore 2015], that can by graph algorithms and matrix mathematics are confirmed for contextualize queries and cast data between multiple databases relational operations. Finally, performance results illustrating so that a user can employ the best database for a particular task the impact of these properties are presented and discussed. (see Figure 3). To achieve this goal, polystore databases need to bridge SQL, NoSQL, and NewSQL databases. The II. RELATIONS Dynamic Distributed Dimensional Data Model (D4M) The relational model, based on set theory, is a key technology [Kepner 2012] was developed to provide a linear mathematical foundation for SQL databases. The relational algebraic interface to graphs stored in NoSQL databases [Byun model effectively consists of relational algebra, relational 2012, Kepner 2013]. Subsequently, D4M has been calculus, and the structured query language (SQL) that balance successfully used with both SQL [Wu 2014, Gadepally 2015] the theoretical, implementation, and systems design aspects of and NewSQL [Samsi 2016] databases. The effectiveness of databases. The relational model is well covered in the D4M to seamlessly interact with these diverse databases rests literature [Maier 1983, Codd 1990, Abiteboul 1995]; only the on its associative array algebra [Kepner & Chaidez 2013, most relevant aspects of the model are reviewed here. Some of Kepner & Chaidez 2014, Kepner & Jansen 2016] that provides the more significant mathematical contributions of the a mathematics that spans sets, graphs, and matrices. The relational model to databases include ability of D4M (and Myria [Halperin 2014]) to bridge multiple databases has laid the foundation for the polystore database (R1) Relations: a mathematical definition of database tables concept. sufficient for their representation without constraining their implementation; Visualizations Applications (R2) Query semantics: a mathematical definition of operations on relations sufficient for proving the correctness of BigDAWG Polystore Common Interface database queries; (R3) Proof of the equivalence of declarative and procedural Analytic Analytic Analytic syntaxes over the above definitions that has enabled the Translator Translator Translator use of declarative semantics for database users and procedural semantics for database builders [Codd 1972]. Data Data Translator Translator SQLSQL NoSQLNoSQL NewSQLNewSQL Of these results, (R3) has been enormously important, but Figure 3. BigDAWG polystore database architecture. Analytic would not be possible without (R1) and (R2). (R3) has been translators contextualize queries to specific databases. Data critical to the success of SQL databases that follow the translators cast data between databases. relational model. (R3) has enabled the successful coexistence of separate interfaces and languages for users and Mathematics is one of the most important differences implementers, with the confidence that neither would create a among SQL, NoSQL, and NewSQL databases (see Figure 4). fundamental mathematical contradiction for the other. The relational algebra found in SQL databases is based upon The relational model is based on balancing mathematical selection, union, and intersection of special sets called rigor with implementation practicality. Too much relations. NoSQL is designed for analyzing sparse mathematical rigor burdens a database implementation with relationships among data and relies on graph theory and graph unnecessary mathematics. Too little mathematical rigor makes analysis. NewSQL databases use matrices and linear algebra to it is difficult to know if a database implementation will work. look for patterns in numeric data. As with all good compromises, there have been advocates for improvement on both sides. As cited earlier, many new SQL NoSQL NewSQL databases under the names of NoSQL and NewSQL differ from Set Operations Graph Operations Matrix Operations AT! v ATv the relational model to meet new performance and analysis bob out edge in bob vertex link vertex cited demands. Likewise, there is extensive mathematical work on 001 alice cited bob carl modifications to the relational model to increase its alice à 002 bob cited alice mathematical rigor [Imieliński 1984a, Imieliński 1984b, 003 alice cited carl cited SELECT ∗ WHERE carl Kanellakis 1989, Tsalenko 1992, Plotkin 1998, Priss 2006, van out vertex=alice alice Emden 2006, Litak 2014, Hutchison 2016]. One motivation Figure 4. Mathematics of breadth-first search for SQL, NoSQL, and for increasing the mathematical rigor [Kelly 2012]