The SQL++ Unifying Semi-Structured Query Language, and an Expressiveness Benchmark of SQL-On-Hadoop, Nosql and Newsql Databases

Noname manuscript No. (will be inserted by the editor) The SQL++ Unifying Semi-structured Query Language, and an Expressiveness Benchmark of SQL-on-Hadoop, NoSQL and NewSQL Databases Kian Win Ong · Yannis Papakonstantinou · Romain Vernoux Received: date / Accepted: date Abstract SQL-on-Hadoop, NewSQL and NoSQL Since SQL itself is a subset of SQL++, the SQL-aware databases provide semi-structured data models (typ- reader will easily identify in which ways each of the ically JSON based) and respective query languages. surveyed databases provides more or less than SQL. Lack of formal syntax and semantics, idiomatic (non- The eleven databases are Hive, Jaql, Pig, Cassandra, SQL) language constructs and large variations in syn- JSONiq, MongoDB, Couchbase, SQL, AsterixDB, Big- tax, semantics and actual capabilities pose problems Query and UnityJDBC. They were selected due to their even to database experts: It is hard to understand, com- market adoption or because they present cutting edge, pare and use these languages. It is especially tedious to advanced query language abilities. write software that interoperates between two of them Finally, we briefly discuss the use of SQL++ as or an SQL database and one of them. the query language of the FORWARD virtual database Towards solving these problems, first we formally query processor, which executes SQL++ queries over specify the syntax and semantics of SQL++. It con- SQL and non-SQL databases and the use of SQL++ in sists of a semi-structured data model (which extends the FORWARD application framework, which enables both JSON and the relational data model) and a query rapid development of live reports and interactive appli- language that is fully backwards compatible with SQL. cations on SQL and non-SQL databases. FORWARD SQL++ is “unifying” in the sense that it is explicitly provides a proof-of-concept of SQL++’s applicability designed to encompass the data model and query lan- as a unifying data model and query language. guage capabilities of current SQL-on-Hadoop, NoSQL and NewSQL databases. Then, we itemize fifteen SQL++ data model and query language features and benchmark eleven 1 Introduction databases on their support of the multiple options as- sociated with each feature, leading to feature matri- Numerous databases marketed as SQL-on-Hadoop, ces and commentary. Each feature matrix is the re- NewSQL [30] and NoSQL have emerged to catalyze Big sult of empirical validation through sample queries. Data applications. These databases generally support the 3Vs [11]. (i) Volume: amount of data (ii) Velocity: This work was supported by NSF DC 0910820, NSF III 1219263, NSF speed of data in and out (iii) Variety: semi-structured IIS 1237174 and Informatica grants. The grants' PI is Prof. and heterogeneous data. As a result of differing use Papakonstantinou who is a shareholder of an entity that commercializes cases and design considerations around the Variety re- some results mentioned in this research. quirement, these new databases have adopted semi- Kian Win Ong structured data models that vary among each other. E-mail: [email protected] Their query languages have even more variations. Some Yannis Papakonstantinou∗ variations are due to superficial syntactic differences. E-mail: [email protected] Some variations arise from the data model differences. Romain Vernoux Finally, other variations are genuine differences in query E-mail: [email protected] capabilities. 2 Kian Win Ong et al. In this setting, even researchers and practitioners that push the agenda on query languages for JSON or with many years of SQL database experience face prob- JSON-like data, as these databases gravitate towards lems in two areas: more sophisticated and complete query capabilities. The benchmark’s results are presented through 1. Comprehension: Significant effort is needed to un- fifteen feature matrices and additional analy- derstand, compare and contrast the semi-structured sis/commentary that classify each database’s data data models and query languages of these novel model and query language capabilities as a subset of databases. The informal (and often underspecified) SQL++. The matrices further decompose each feature syntax and semantics of the provided query languages into as many as eleven constituent sub-features and make comprehension even harder or impossible, as it options, in order to facilitate fine-grained comparisons becomes apparent from the avalanche of syntax and across different data models and languages. Besides semantics questions in online forums. providing information on supported and unsupported 2. Development: It is difficult to write software that re- features and options, the matrices also qualify capa- trieves data from multiple such databases, given the bility differences that cut across individual features, different data models, different query syntaxes and such as the composability of various query language the (often subtly) different query semantics. These features with each other. For readability, we interleave interoperability issues occur frequently in practice, the SQL++ specification sections with the respective for example, whenever an organization adopts one benchmarking (capability classification) sections. of these new databases and then builds applications The approach of outlining the differences between that need integrated access to data stored in the new the various databases using SQL++ achieves two ben- database and in its existing SQL databases. efits: First, SQL++ offers the reader a formal specification of the discussed features and capabilities. Second, Towards solving the above problems, we formally by understanding each database’s capabilities in terms specify the syntax and semantics of SQL++, which is of SQL++, the reader can focus on the fundamental a unifying semi-structured data model and query lan- differences of the databases without being confused by guage that is designed to encompass the data model syntactic idiosyncracies of various query languages and and query language capabilities of NoSQL, NewSQL superficial differences in the documented descriptions and SQL-on-Hadoop databases. The SQL++ semantics of their semantics. stands on the shoulders of the extensive past work from The relatively immature state of query language the database R&D community in non-relational data documentation of the surveyed databases leaves many models and query languages: OQL [2], the nested rela- questions unanswered. We dealt with this problem us- tional model and query languages [15,28,1] and XQuery ing a hands-on approach: Each feature matrix has been (and other XML-based query languages) [27,10,5]. empirically validated by executing sample queries on SQL++ is an extension to SQL and is backwards- the surveyed databases. A benchmark comprising sam- compatible with SQL. This choice was made in order ple queries, empirical observations, as well as links to to facilitate the SQL-aware audience in two aspects: supporting documentation and bug reports is available First, since many surveyed databases do not support at . the entirety of standard SQL capabilities, the provided http://forward.ucsd.edu/sqlpp The feature matrices of this survey paper classify comparisons explain the extent to which each surveyed many capabilities of semi-structured data models and database supports the SQL capabilities. Second and query languages. The most prominent capabilities are: most importantly, the reader will understand in what ways semi-structured data models and query languages – What kinds of data values are supported by each extend SQL’s capabilities, understand in which ways database? these extensions may relate to each other, and obtain – What kind of schemas and constraints are supported? an overview on which surveyed databases support these extensions. – How does the query language access and construct Then we itemize fifteen SQL++ data model nested data? and query language features and benchmark eleven – How is missing information represented and handled? databases on their support of the multiple options as- – What are the options and semantics for equality on sociated with each feature. For this benchmark, we non-scalar and heterogeneous values? cover the most popular SQL-on-Hadoop, NoSQL and NewSQL databases from DB-Engines [8] (a popularity – What are the options and semantics for ordering on tracker for database engines) and industry surveys [12, non-scalar and heterogeneous values? 17]. We have also selected research-oriented databases – Is aggregation supported? The SQL++ Semi-structured Query Language 3 – Is join supported? <=;#4/& – Are extensions (such as UDFs) provided to circum- !"#$$% !"#$$% vent limitations? :"#$;#*& $#*"=/*& &'()*(+%,-../01230& We expect that some of the results listed in the feature matrices will change in the next years as the !"#$$%!"#$%&'$()#**($& surveyed databases will release newer and better im- plementations. The arXiv/CoRR version of this paper !"#$$&@;$/".=&-./.0.*#& [24] and the benchmark will be updated to reflect these changes. +!,??& +!,??& +!,??& +!,??& +!,??& @;$/".=& @;$/".=& @;$/".=& @;$/".=& @;$/".=& Despite the forthcoming changes, we expect SQL++ @;#2*& @;#2*& @;#2*& @;#2*& @;#2*& and the comparison methodology followed by this sur- +!,3(43 +!,& 1#2+!,& 1(+!,& 5.6((7& >>>& vey to remain a standing contribution. Besides its value A$.77#$& A$.77#$& A$.77#$& A$.77#$& to developers, SQL++ can also assist the query language designers in the NoSQL, NewSQL and SQL-on- 1.89#& 1.89#& Hadoop space towards (1) producing formal versions

The SQL++ Unifying Semi-Structured Query Language, and an Expressiveness Benchmark of SQL-On-Hadoop, Nosql and Newsql Databases

Towards an Analytics Query Engine *

Evaluation of Xpath Queries on XML Streams with Networks of Early Nested Word Automata Tom Sebastian

QUERYING JSON and XML Performance Evaluation of Querying Tools for Offline-Enabled Web Applications

Programming a Parallel Future

Declarative Data Analytics: a Survey

A Platform-Independent Framework for Managing Geo-Referenced JSON Data Sets

XML Prague 2012

Big Data Analytic Approaches Classification

Download the Result from the Cluster

A Dataflow Language for Large Scale Processing of RDF Data

Recent Trends in JSON Filters Atul Jain*, Dr

Extracting Data from Nosql Databases a Step Towards Interactive Visual Analysis of Nosql Data Master of Science Thesis