Proceedings of Student-Faculty Research Day, CSIS, Pace University, May 4th, 2018 Low Latency Query Responses over Heterogeneous Data Systems

Manoj Muniswamaiah, Dr. Tilak Agerwala and Charles C. Tappert Seidenberg School of CSIS, Pace University, White Plains, New York

 individual applications or components of a single application Abstract—Organizations need to often maintain large [15]. Also, one needs to learn different query languages which heterogeneous systems which have different are used for these data stores. methods like programming models and the datasets stored in each one of them or trying to fit all data in to a single storage are not varies. Financial data can be stored in relational , user effective solutions. Queries that work across different datasets sessions in a key-value store for faster lookup, recommendation data in a graphical database and analytical data in a columnar are often limited by the incompatibility of the systems or stores for read heavy queries. Trying to fit all datasets in a single difficulty in translating data from one system into another. database could have adverse performance effects. “One Size Does Analytical queries can cross boundaries between different data Not Fit All”. The main focus of our research is finding an adequate stores [1]. Finally, having different database systems results in solution using materialized views to improve the response time of having connectors across different systems, leading developers queries across different data systems by leveraging a common to do lot of work and adding to the cost of the organization. materialization storage technique.

Index Terms—Polystore, Data Systems, Analytical Query, The above considerations led to the development polystore Materialization. systems which are built on top of different, heterogonous and integrated storage systems [2].

I. INTRODUCTION A polystore system consists of multiple systems and it is different from distributed relational databases which consists of There has been a rapid and continuous increase in the volume, replicas. In a polystore, multiple engines are accessed variety, and velocity of data being used by organizations for separately through a common interface. Federated relational decision making and improved value captured. In recent years databases are managed by individual administration team many different data management systems with different data whereas in polystore they are managed as single integrated unit models have been introduced into the marketplace: Columnar [2]. Along with polystore systems, there is a need to have a databases designed for read heavy analytical queries; OLTP unifying framework that supports the functionality of the databases becoming more main memory oriented; Numerous underlying data stores and provide s quicker query response. NoSQL data stores for horizontal scaling; HTAP system for both online transactional and analytical capabilities; and. NewSQL data systems with SQL interfaces and the scalability II. LITERATURE REVIEW of NoSQL. MATERIALIZATION: There are several techniques to improve query response time like indexing, partitioning of data The rise of data stores like key-value, graph, document and and materialization. An Index is a data structure which columnar stores have been designed for specific needs. improves the performance of data retrieval in read-intensive Specialized engines offer performance which implies that “one queries. Indexes can be used on one or more fields of the size does not fit all”. No one database performances well on all database and are like a dictionary for the lookup of the data. kinds of data. A relational database works fine on structured One of the most widely used index structure is B-tree which data but its performance decreases on other kind of datasets. keeps data sorted and allows sequential access [3]. Hence curated data is stored in different databases, structured data in to relational database, historical data in to array The use of materialized views, derived from the base , database, relationship data in a graphical database and semi- is the most effective way to improve query response time structured data in document data store. The term polyglot Materialized views pre-compute and store the aggregated persistence summarizes this dynamic and is used to mean that results from the base tables. Consistency and freshness is when storing data, it is best to use multiple data storage maintained by updating the whenever the base table technologies, chosen based upon the way data is being used by changes. Appropriate views needs to be selected for

.

D2-1 materialization, for queries to have reduced response time. A repository to store metadata information and a common materialized view can be the joins of two table or a complex materialization view for all data stores. Multiple calcite aggregate functions and consumes storage space. Freshness of adaptors have their own notion of materialized views [6]. the views needs to be maintained when they are created. Materialization requires disk storage which leads to spatial cost. BigDAWG is designed to support multiple databases and is When materialized views are created storage, query and a polystore. It consists of database engines, islands, middleware maintenance cost needs to be considered [4]. and interface for visualization applications. It provides location independence (where a query can be re-routed to the desired When a materialized view is created, the database scans the engine) and semantic completeness (meaning a query can make entire base table, executes the query and creates a copy of the use of all underlying database features). An island in result in a temporary table which is persisted to the disk. When BigDAWG consists of data models and operations which we query, the materialized view table data is read from the disk provide location independence with its associated databases. A similar to a table, if it contains the query result executed ahead shim acts as a connector translating queries defined by of time, the result is returned immediately. Few databases operations in an island in to the native language of the update these materialized views automatically, some databases respective storage engines. One of the key aspects of the require a manual refresh and some do not support materialized polystore system is to process data on the storage engine it is views. best suited. In order to achieve this feature BigDAWG has implemented the “cast” feature where data can be converted Materialized views are used in Oracle databases where into different formats. BigDAWG does not support performance and quicker query response is critical and the materialized views [7]. complex SQL queries are executed against large tables. Queries are rewritten to execute against preaggregated tables than with Myria is a data management and analytical system which the base table which speeds up the query response [16]. focuses on usability and efficiency. Myria has its own execution engine called MyriaX. It also generates query plans for different backend engines. Users can query using MyriaL which is a relational . Myria can be operated as a cloud service. However, Myria does not provide support for materialized views [8].

Apache Kylin is an open source analytical engine that provides an SQL interface and analytics on Hadoop for large datasets. Kylin executes queries on pre-calculated cubes which are built offline. Kylin uses map-reduce process to build the cubes from source data. Recently Kylin is also speeding up the cube building processing using . Kylin is tightly coupled with Hive as a data source and HBase as data storage. Requests originate from a SQL tool or from a third party API services. Kylin’s RESTful service intercepts the requests and accesses the query engine. If the target data which the query processes can be met with the pre-built cubes then the results Query rewrite for Materialized Views are returned back quickly, else it is routed to execute on the Hadoop. Kylin does not support relational and NoSQL source DATABASE TECHNOLOGIES. Spark SQL is a module in data stores to build the cubes nor the creation and maintenance Apache Spark that integrates with relational processing of materialized views [9]. databases and lets users run complex analytical queries. It uses DataFrames which performs relational operations on external is used for data querying, analysis and data sources. DataFrames are collections of structured records summarization in a . It converts SQL-like which can be manipulated and materialized in memory, but queries to map-reduce jobs for the processing of large volume lacks mechanism to persist them on disk and keep them up to of data. Hive is best suited for batch jobs instead of real time date [5]. data processing. HiveQL is query language used in Hive. HiveQL supports map-reduce scripts which can be plugged in Apache calcite is a unifying framework for parsing and to the queries. HiveQL does not support Online Transaction planning queries on different datasets. It allows for querying Processing and materialization views [10]. data which is resident in non-traditional databases through a SQL interface. It includes many features similar to typical Apache Lens integrates Apache Hadoop with traditional data databases but lacks some functionalities like storage of data, a warehouse to appear as one layer. It provides a unified layer for analytics. It provides a high level SQL like language called

D2-2 CubeQL which queries data sets in the cubes. It uses a REST databases for lower latency query responses. server to query data and make schema changes. Apache Lens doesn’t support the ability to create materialized views. Queries which return aggregate or summary are frequently However creating data cubes from materialized views would been used in user applications. Some of these analytical queries reduce data cube build time [11]. are not fast enough. Often queries are cached but needs to be invalidated and populated again. Native materialized view Database Native View Native Materialized support offered in PostgresSQL provides good query response Support Views Support for analytical queries [19]. Altibase Yes No Apache Derby Yes No Kodiak is a distributed analytical data platform which uses ClusterixDB Yes No materialization to serve analytical queries. It consists of many materialized views over petabytes of data. Kodiak shows that DB2 Yes Yes query latency is more than 3 orders magnitude faster than EXASolution Yes No executing them on base tables and also uses less resources to Firebird Yes No run same workload [18]. H2 Yes No Informix Dynamic Yes No Within in an organization using different databases for various Server tasks is a common paradigm. Some of these in use databases Ingres Yes No might have native support for materialized views while some of InterBase Yes No them may not offer. Using materialization is a common Linter SQL Yes Yes technique to improve query response. RDBMS MariaDB Yes No Our problem statement is, “In a polystore environment, MaxDB Yes No establish the feasibility of having a unifying framework that routes queries to materialized views which are stored, Microsoft SQL Yes Yes maintained, updated, statically and dynamically, in a Server common persistent data store, to achieve low latency query MonetDB Yes No responses”. MySQL Yes No OpenBase SQL Yes Yes Oracle Yes Yes IV. KEY IDEA Oracle Rdb Yes Yes Materialized Views are pre-computed data in databases. OpenLink Yes Yes Instead of computing query from scratch from the base tables, Virtuoso the database uses the results that have already been computed PostgreSQL Yes Yes and stored. This helps in reducing the query latency response Raima Database Yes No time which benefits various applications like analytical, data Manager mining and web database caching. SolidDB Yes No Analytical queries helps us get answers to questions based on Table 1: Databases supporting Materialized View [17] which business decisions can be made like, “What is the hourly pick-up, drop-off of passengers?” “How do ridesharing From the above table it can be inferred that materialized views providers like Uber affect taxis in certain places?” “Do is not supported by all databases and when organizations use passengers at the airport use more Uber or taxis?” “How long databases which does not support materialized views there is a does it take to get to the airport from certain places?” “What is need to have a framework to support for quicker query the average travel time between two places?” “How does response. weather effect taxis and Uber ridership?” “Do passengers pay with credit card or cash?” “Is Uber over taking taxis in certain III. PROBLEM STATEMENT places?” [14]. We will demonstrate feasibility by implementing the architecture described in Section V and Multiple databases are used within an organization for various creating a solution that can answer these questions faster specialized tasks and requires database specific query than they can be addressed without materialization optimization independently for each database. Materialized technologies. views is one such optimization which is used for faster query response and is not supported in all databases. Hence there is a ARCHITECTURE need for a unifying framework in polyglot persistence which supports a common and persistent materialized view for all

D2-3 provides pick-up, drop-off locations, fares price, distance covered, payment types and the passenger counts [12]. Uber has also released some of its trip data [13].

We would be using Hive which uses Hadoop file system to store table which contains detailed trip data, NoSQL and RDBMS would contain lookup and transactional related data. NewSQL would be used to contain the materialized views. This thesis would extend Apache Calcite framework to support querying various data systems and provide a common storage data system where materialized views can be stored. Queries submitted would initially lookup materialized tables for response if materialized do not meet query criteria then they would be executed against base table. We would also execute statistical and data manipulation analytical queries directly on the materialized views to get quicker response and display them graphically using R.

V. CONCLUSION This research helps in having lower query responses by using materialized view optimization technique. Currently it is in The data requests originates from SQL tools which query implementation stage. When complete we would run various engine intercepts and redirects it to the common storage where types of queries against the original base table and the materialized views are stored, if the datasets accessed in the materialized table to get the response times. query matches then the result is returned immediately. Otherwise they would be re-routed to execute them on the source data stores. REFERENCES [1] http://wp.sigmod.org/?p=1629 SQL-Based Tools: Includes any third party tool which uses [2] https://bigdawg.mit.edu JDBC driver to connect to the engine. [3] https://en.wikipedia.org/wiki/B-tree [4] Bernardino, Jorge, and Henrique Madeira. "Data warehousing and OLAP: R: Data manipulation can be done using R for statistical and improving query performance using distributed computing." Proceedings graphics. It helps in analyzing data in a fine manner. of CAiSE’00, Sweden (2000) Query Engine: It is the extension to Apache Calcite open [5] Armbrust, Michael, et al. "Spark sql: Relational data processing in spark." Proceedings of the 2015 ACM SIGMOD International Conference on source framework that includes SQL parser, Management of Data. ACM, 2015 and query planning. [6] Begoli, Edmon, et al. "Apache Calcite: A Foundational Framework for Metadata: Consists of the schema, subschema and table Optimized Query Processing Over Heterogeneous Data Sources." arXiv preprint arXiv:1802.10233 (2018) information of the underlying data stores. [7] Gadepally, Vijay, et al. "The bigdawg polystore system and architecture." Materialized View Builder: It creates materialized views on High Performance Extreme Computing Conference (HPEC), 2016 IEEE. the specified base table. It also maintains and keeps the IEEE, 2016 materialized views up to date whenever changes to the base [8] http://myria.cs.washington.edu [9] http://kylin.apache.org table happens. [10] https://hive.apache.org/ Common Storage for Materialized Views: It is the common [11] https://lens.apache.org/ storage where all the materialized views created by the builder [12] http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml [13] https://movement.uber.com/cities?lang=en-US is stored. It contains the materialized views from all the [14] http://toddwschneider.com/posts/taxi-uber-lyft-usage-new-york-city/ participating databases in a polyglot environment. [15] http://www.jamesserra.com/archive/2015/07/what-is-polyglot- RDBMS: Includes any relational database with/without persistence [16] http://www.dba-oracle.com/art_mv.htm support to inbuilt materialized views. [17] https://en.wikipedia.org/wiki/Comparison_of_relational_database_mana NoSQL: Includes different flavors of NoSQL like key-value, gement_systems document or columnar data stores. [18] Liu, Shaosu, et al. "Kodiak: leveraging materialized views for very low- latency analytics over high-dimensional web-scale data." Proceedings of Hive: It is a SQL-like interface which is built on top of the the VLDB Endowment 9.13 (2016): 1269-1280 Apache Hadoop for data summarization and analysis. [19] https://hashrocket.com/blog/posts/materialized-view-strategies-using- In this thesis we use datasets provided by NYC Taxi and Limousine Commission (TLC) under the authorization of Taxicab & Livery Passenger Enhancement Programs. It includes the data collected for yellow and green taxi trip which

D2-4