Sinew: a SQL System for Multi-Structured Data

Sinew: A SQL System for Multi-Structured Data Daniel Tahara Thaddeus Diamond Daniel J. Abadi Yale University Hadapt Yale University and Hadapt [email protected] [email protected] [email protected] ABSTRACT ahead of time. By removing this up-front data manage- As applications are becoming increasingly dynamic, the no- ment effort, developers can more quickly get their application that a schema can be created in advance for an appli- tion up and running without having to worry in advance cation and remain relatively stable is becoming increasingly about which attributes will exist in their datasets or about unrealistic. This has pushed application developers away their domains, types, and dependencies. from traditional relational database systems and away from Some argue that this ìmmediate gratification' style of ap- the SQL interface, despite their many well-established ben- plication development will result in long-term problems of efits. Instead, developers often prefer self-describing data code maintenance, sharing, and performance. However, pro- models such as JSON, and NoSQL systems designed specif- ponents of the NoSQL approach argue that for rapidly evolv- ically for their relaxed semantics. ing datasets, the costs of maintaining a schema are simply In this paper, we discuss the design of a system that en- too high. Whichever side of the debate one falls on, in prac- ables developers to continue to represent their data using tice, the amount of production data represented using key- self-describing formats without moving away from SQL and value and other semi-structured data formats is increasing, traditional relational database systems. Our system stores and as the volume and strategic importance of this data arbitrary documents of key-value pairs inside physical and increases, so too does the requirement to analyze it. virtual columns of a traditional relational database system, Some NoSQL database systems support primitives that en- and adds a layer above the database system that automati- able the stored data to be analyzed. For example, Mon- cally provides a dynamic relational view to the user against goDB currently provides a series of aggregation primitives which fully standard SQL queries can be issued. We demon- as well as a proprietary MapReduce framework to analyze strate that our design can achieve an order of magnitude data, whereas other NoSQL databases, such as Cassandra improvement in performance over alternative solutions, in- and HBase, connect to Hadoop directly, leveraging Hadoop cluding existing relational database JSON extensions, Mon- MapReduce and other execution frameworks such as Apache goDB, and shredding systems that store flattened key-value Tez to analyze data. data inside a relational database. Unfortunately there are significant drawbacks to using any of the options provided by the NoSQL databases. Local primitives are a step away from the SQL standard, which Categories and Subject Descriptors renders a large number of third party analysis and busi- H.2.4 [Database Management]: Systems ness intelligence tools (such as SAP Business Objects, IBM Cognos, Microstrategy, and Tableau) unusable. Meanwhile, while there are several projects in the Hadoop ecosystem Keywords that provide a SQL interface to data stored in Hadoop (such dynamic schema; SQL; RDBMS; NoSQL; JSON; MongoDB as Hadapt, Hive, and Impala), they require the user to create a schema before the data can be analyzed via SQL, which 1. INTRODUCTION eliminates a major reason why the NoSQL database was used in the first place. In some cases, a schema can be added A major appeal of NoSQL database systems such as Mon- after the fact fairly easily, but in many other cases signifi- goDB, CouchDB, Riak, Cassandra, or HBase as the storage cant processing, ETL, and cleaning work must be performed backend for modern applications is their flexibility to load, in order to make the data fit into a usable schema. store, and access data without having to define a schema This paper describes the design of a layer above a traditional relational database system that enables standard Permission to make digital or hard copies of all or part of this work for personal or SQL-compliant queries to be issued over multi-structured classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation data (relational, key-value, or other types of semi-structured on the first page. Copyrights for components of this work owned by others than the data) without having to define a schema at any point in the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or analytics pipeline. The basic idea is to give the user a logi- republish, to post on servers or to redistribute to lists, requires prior specific permission cal view of a universal relation [13, 15] where a logical table and/or a fee. Request permissions from [email protected]. exists that contains one column for each unique key that SIGMOD’14, June 22–27, 2014, Snowbird, UT, USA. exists in the dataset. Nested data is flattened into sepa- Copyright is held by the owner/author(s). Publication rights licensed to ACM. ACM 978-1-4503-2376-5/14/06...$15.00 http://dx.doi.org/10.1145/2588555.2612183. 815 rate columns, which can result in the logical table having queries, and data is read from HDFS and processed by the potentially hundreds or thousands of columns. SQL-on-Hadoop solution according to this schema. Since physically storing data in this way is impractical, Although these systems are more flexible than systems that the physical representation differs from the logical repre- require a user to define a schema at load time, they are sentation. Data is stored in a relational database system still limited by the requirement that, in order to issue SQL (RDBMS), but only a subset of the logical columns are ma- queries, the user must pre-define a target schema over which terialized as actual columns in the RDBMS, while the re- to execute those queries. Sinew drops this requirement, and maining columns are serialized into a single binary column automatically presents a logical view of the data to the user in the database system. These columns are then transpar- based on data rather than user input. ently rendered to the user by internally extracting the cor- Google Tenzing [10], Google Dremel [16], and Apache Drill2 responding values from the serialized data. offer the ability to query data through SQL without first The primary contribution of our work is a complete system defining a schema. Tenzing, which is a SQL-on-MapReduce architecture that enables a practical implementation of this system similar to Hive, infers a relational schema from the vision. This architecture contains the following components: underlying data but can only do so for flat structures that a relational storage layer, catalog, schema analyzer, column can be trivially mapped into a relational schema. In con- materializer, loader, query rewriter, and inverted text in- trast, Dremel and Drill (and Sinew) support nested data. dex. This paper describes the design, implementation, and However, the design of Drill and Dremel differ from Sinew interaction of these components. in that Drill and Dremel are only query execution engines, We also build a prototype of our proposed system, and designed purely for analysis. In contrast, Sinew is designed compare it against several alternatives for storing and ana- as an extension of a traditional RDBMS, adding support lyzing multi-structured data: (1) MongoDB (2) a JSON ex- for semi-structured and other key-value data on top of extension of Postgres and (3) storing key-value pairs in entity- isting relational support. With this design, Sinew is able attribute-value `triple' format in a table in an RDBMS. We to support transactional updates, storage-integrated access- find that our prototype consistently outperforms these alter- control, and read/write concurrency control. Furthermore, natives (often by as much as an order of magnitude) while since it integrates with an RDBMS, Sinew can also bene- providing a more standard SQL interface. fit from the RDBMS's statistics gathering and cost-based query optimization capabilities. This makes Sinew similar to Pathfinder [8], a processing stack designed to convert from 2. RELATED WORK XML and XQuery to relational tuples and SQL, but Sinew Analytical systems that do not require a user to define differs in that it is more broadly purposed to support any a schema when data is loaded into the system generally fall form of multi-structured data and explicitly attempts to pro- into two categories: (1) systems that are specialized for non- vide a SQL interface to that data. relational data models and that do not have a SQL interface In addition to transparently providing a fully-featured re- (2) more general systems that offer a SQL or SQL-like inter- lational interface to multi-structured data, Sinew introduces face to the data at the cost of requiring the user to define the a novel approach to storing multi-structured data inside an structure of the underlying data before data can be queried. RDBMS. One common approach, historically used by XML In exchange for programming convenience or flexibility, databases, is to create a `shredder' that transforms docu- systems in the first category often require that the consumer ments into RDBMS records [5, 7, 14], often relying on some of the data sacrifice performance on some subset of process- variation on the edge model in order to map documents into ing tasks and/or learn a custom query language. For exam- relational tuples [5]. However, even when the underlying re- ple, Jaql offers a modular data processing platform through lational schema is optimized for relational primitives such the use of higher-order functions, but despite its extensibil- as projection and selection (e.g.

Load more