The Bigdawg Polystore System and Architecture

The BigDAWG Polystore System and Architecture Vijay Gadepally∗y Peinan Cheny Jennie Dugganz Aaron Elmorex Brandon Haynesk Jeremy Kepner∗y Samuel Maddeny Tim Mattson{ Michael Stonebrakery ∗MIT Lincoln Laboratory yMIT CSAIL zNorthwestern University xUniversity of Chicago {Intel Corporation kUniversity of Washington Abstract—Organizations are often faced with the challenge leveraged based on the data engine that provides the highest of providing data management solutions for large, heteroge- performance response to a particular query. nous datasets that may have different underlying data and Such analytics on complex datasets call for the develop- programming models. For example, a medical dataset may have unstructured text, relational data, time series waveforms and ment of a new generation of federated databases that support imagery. Trying to fit such datasets in a single data management seamless access to the different data models of database or system can have adverse performance and efficiency effects. storage engines. We refer to such a system as a polystore in As a part of the Intel Science and Technology Center on Big order to distinguish it from traditional federated databases that Data, we are developing a polystore system designed for such largely supported access to multiple engines using the same problems. BigDAWG (short for the Big Data Analytics Working Group) is a polystore system designed to work on complex data model. problems that naturally span across different processing or As a part of the Intel Science and Technology Center (ISTC) storage engines. BigDAWG provides an architecture that supports on Big Data, we are developing the BigDAWG, short for Big diverse database systems working with different data models, Data Analytics Working Group, polystore system. The Big- support for the competing notions of location transparency and DAWG stack is designed to support multiple data models, real- semantic completeness via islands and a middleware that provides a uniform multi–island interface. Initial results from a prototype time streaming analytics, visualization interfaces, and multi- of the BigDAWG system applied to a medical dataset validate ple databases. The current version of BigDAWG [2] shows polystore concepts. In this article, we will describe polystore significant promise and has been used to develop a series of databases, the current BigDAWG architecture and its application applications for the MIMIC II dataset. The BigDAWG system on the MIMIC II medical dataset, initial performance results and supports multiple data stores; provides an abstraction of data our future development plans. and programming models through “islands”; a middleware and API that can be used for query planning, optimization I. INTRODUCTION and execution; and support for applications, visualization and Enterprises today encounter many types of databases, data, clients. Initial results of applying the BigDAWG system to and storage models. Developing analytics and applications that diverse data such as medical imagery or clinical records has work across these different modalities is often limited by the shown the value of a polystore system in developing new incompatibility of systems or the difficulty of creating new solutions for complex data management. connectors and translators between each one. For example, The remainder of the article is organized as follows: Sec- consider the MIMIC II dataset [1] which contains deidentified tion II expands on the concept of a polystore databases and the health data collected from thousands of critical care patients in execution of polystore queries. Section III describes the current an Intensive Care Unit (ICU). This publicly available dataset BigDAWG architecture and its application to the MIMIC II (http://mimic.physionet.org/) contains structured data such as dataset. Section IV describes performance results on an initial demographics and medications; unstructured text such as doc- BigDAWG implementation. Finally, we conclude and discuss tor and nurse reports; and time–series data of physiological future work in Section V. signals such as vital signs and electrocardiogram (ECG). Each II. POLYSTORE DATABASES of these components of the dataset can be efficiently organized into database engines supporting different data models. For With the increased interest in developing storage and man- example, the structured data in a relational database, the text agement solutions for disparate data sources coupled with notes in a key-value or graph database and the time–series our belief that “one size does not fit all” [3], there is a data in an array database. Analytics of the future will cross renewed interest in developing database management systems the boundaries of a single data modality, such as correlating (DBMSs) that can support multiple query languages and information from a doctor’s note against the physiological complete functionality of underlying database systems. Prior measurements collected from a particular sensor. Further, the work on federated databases such as Garlic [4], IBM DB2 [5] same dataset may be stored in different data engines and and others [6] have demonstrated the ability to provide a single interface to disparate DBMSs. Other related work in paral- The corresponding author, Vijay Gadepally, can be reached at lel databases [7] and computing [8], [9] have demonstrated vijayg [at] ll.mit.edu the high performance that can be achieved by making use Visualizations Clients Applications Time taken for opera9ons in SciDB and PostGRES 100000 PostGRES - Count Entries BigDAWG Common Interface/API SciDB - Count Entries 10000 Relational Island Array Island Island … PostGRES - Discrete Entries SciDB - Discrete Entries 1000 Shim Shim Shim Shim Shim 100 Time Taken (milliseconds) Cast Cast 10 1000 10000 100000 1000000 10000000 Rela,onal SQL NoSQLArray NewSQL… Number of Database Entries DB DB DB Fig. 2: The BigDAWG polystore architecture consists of four Fig. 1: Time taken for various database operations in dif- layers - engines, islands, middleware/API and applications. ference database engines. The dashed lines correspond to a count operation in SciDB and PostGRES and the solid lines correspond finding the number of discrete entries in SciDB perform a matrix multiplication operation) may benefit from and PostGRES. For the count operation, SciDB outperforms performing part of the operation in PostGRES (extracting PostGRES whereas PostGRES outperforms SciDB for finding discrete entries) and the remaining part (matrix multiplication) the number of discrete entries. in SciDB. Extending the concept of federated and parallel databases, we propose a “polystore” database. Polystore databases can of replication, partitioning and horizontally scaled hardware. harness the relative strengths of underlying DBMSs. Unlike Many of the federated database technologies concentrated on federated or parallel databases, polystore databases are de- relational data. With the influx of different data sources such as signed to simultaneously work with disparate database en- text, imagery, and video, such relational data models may not gines and programming/data models while supporting com- support high performance ingest and query for these new data plete functionality of underlying DBMSs. In fact, a polystore modalities. Further, supporting the types of analytics that users solution may include federated and/or parallel databases as wish to perform (for example, a combination of convolution of a part of the overall solution stack. In a polystore solution, time series data, gaussian filtering of imagery, topic modeling different components of an overall dataset can be stored in the of text,etc.) is difficult within a single programming or data engine(s) that will best support high performance ingest, query model. and analysis. For example, a dataset with structured, text and Consider the simple performance curve of Figure 1 which time-series data may simultaneously leverage relational, key- describes an experiment where we performed two basic oper- value and array databases. Incoming queries may leverage one ations – counting the number of entries and extracting discrete or more of the underlying systems based on the characteristics entries – on a varying number of elements. As shown in the of the query. For example, performing a linear algebraic op- figure, for counting the number of entries, SciDB outperforms eration on time-series data may utilize just an array database; PostGRES by nearly an order of magnitude. We see the performing a join between time-series data and structured data relative performance reversed in the case of extracting discrete may leverage array and relational databases respectively. entries. In order to support such expansive functionality, the Big- Many time-series, image or video storage systems are most DAWG polystore system (Figure 2) utilizes a number of fea- efficient when using an array data model [10] which provides tures. “Islands” provide users with a number of programming a natural organization and representation of data. Analytics on and data model choices; “Shim” operations allow translation these data are often developed using linear algebraic operations of one data model to another; and “Cast” operations allow for such as matrix multiplication. In a simple experiment in which the migration of data from one engine or island to another. we performed matrix multiplication in PostGRES and SciDB, We go into greater depth of the BigDAWG architecture in we observed nearly three orders

Load more