Low Latency Query Responses Over Heterogeneous Data Systems

Proceedings of Student-Faculty Research Day, CSIS, Pace University, May 4th, 2018 Low Latency Query Responses over Heterogeneous Data Systems Manoj Muniswamaiah, Dr. Tilak Agerwala and Charles C. Tappert Seidenberg School of CSIS, Pace University, White Plains, New York individual applications or components of a single application Abstract—Organizations need to often maintain large [15]. Also, one needs to learn different query languages which heterogeneous database systems which have different are used for these data stores. Data integration methods like programming models and the datasets stored in each one of them replication or trying to fit all data in to a single storage are not varies. Financial data can be stored in relational databases, user effective solutions. Queries that work across different datasets sessions in a key-value store for faster lookup, recommendation data in a graphical database and analytical data in a columnar are often limited by the incompatibility of the systems or stores for read heavy queries. Trying to fit all datasets in a single difficulty in translating data from one system into another. database could have adverse performance effects. “One Size Does Analytical queries can cross boundaries between different data Not Fit All”. The main focus of our research is finding an adequate stores [1]. Finally, having different database systems results in solution using materialized views to improve the response time of having connectors across different systems, leading developers queries across different data systems by leveraging a common to do lot of work and adding to the cost of the organization. materialization storage technique. Index Terms—Polystore, Data Systems, Analytical Query, The above considerations led to the development polystore Materialization. systems which are built on top of different, heterogonous and integrated storage systems [2]. I. INTRODUCTION A polystore system consists of multiple systems and it is different from distributed relational databases which consists of There has been a rapid and continuous increase in the volume, replicas. In a polystore, multiple engines are accessed variety, and velocity of data being used by organizations for separately through a common interface. Federated relational decision making and improved value captured. In recent years databases are managed by individual administration team many different data management systems with different data whereas in polystore they are managed as single integrated unit models have been introduced into the marketplace: Columnar [2]. Along with polystore systems, there is a need to have a databases designed for read heavy analytical queries; OLTP unifying framework that supports the functionality of the databases becoming more main memory oriented; Numerous underlying data stores and provide s quicker query response. NoSQL data stores for horizontal scaling; HTAP system for both online transactional and analytical capabilities; and. NewSQL data systems with SQL interfaces and the scalability II. LITERATURE REVIEW of NoSQL. MATERIALIZATION: There are several techniques to improve query response time like indexing, partitioning of data The rise of data stores like key-value, graph, document and and materialization. An Index is a data structure which columnar stores have been designed for specific needs. improves the performance of data retrieval in read-intensive Specialized engines offer performance which implies that “one queries. Indexes can be used on one or more fields of the size does not fit all”. No one database performances well on all database and are like a dictionary for the lookup of the data. kinds of data. A relational database works fine on structured One of the most widely used index structure is B-tree which data but its performance decreases on other kind of datasets. keeps data sorted and allows sequential access [3]. Hence curated data is stored in different databases, structured data in to relational database, historical data in to array The use of materialized views, derived from the base table, database, relationship data in a graphical database and semi- is the most effective way to improve query response time structured data in document data store. The term polyglot Materialized views pre-compute and store the aggregated persistence summarizes this dynamic and is used to mean that results from the base tables. Consistency and freshness is when storing data, it is best to use multiple data storage maintained by updating the view whenever the base table technologies, chosen based upon the way data is being used by changes. Appropriate views needs to be selected for . D2-1 materialization, for queries to have reduced response time. A repository to store metadata information and a common materialized view can be the joins of two table or a complex materialization view for all data stores. Multiple calcite aggregate functions and consumes storage space. Freshness of adaptors have their own notion of materialized views [6]. the views needs to be maintained when they are created. Materialization requires disk storage which leads to spatial cost. BigDAWG is designed to support multiple databases and is When materialized views are created storage, query and a polystore. It consists of database engines, islands, middleware maintenance cost needs to be considered [4]. and interface for visualization applications. It provides location independence (where a query can be re-routed to the desired When a materialized view is created, the database scans the engine) and semantic completeness (meaning a query can make entire base table, executes the query and creates a copy of the use of all underlying database features). An island in result in a temporary table which is persisted to the disk. When BigDAWG consists of data models and operations which we query, the materialized view table data is read from the disk provide location independence with its associated databases. A similar to a table, if it contains the query result executed ahead shim acts as a connector translating queries defined by of time, the result is returned immediately. Few databases operations in an island in to the native language of the update these materialized views automatically, some databases respective storage engines. One of the key aspects of the require a manual refresh and some do not support materialized polystore system is to process data on the storage engine it is views. best suited. In order to achieve this feature BigDAWG has implemented the “cast” feature where data can be converted Materialized views are used in Oracle databases where into different formats. BigDAWG does not support performance and quicker query response is critical and the materialized views [7]. complex SQL queries are executed against large tables. Queries are rewritten to execute against preaggregated tables than with Myria is a data management and analytical system which the base table which speeds up the query response [16]. focuses on usability and efficiency. Myria has its own execution engine called MyriaX. It also generates query plans for different backend engines. Users can query using MyriaL which is a relational query language. Myria can be operated as a cloud service. However, Myria does not provide support for materialized views [8]. Apache Kylin is an open source analytical engine that provides an SQL interface and analytics on Hadoop for large datasets. Kylin executes queries on pre-calculated cubes which are built offline. Kylin uses map-reduce process to build the cubes from source data. Recently Kylin is also speeding up the cube building processing using Apache Spark. Kylin is tightly coupled with Hive as a data source and HBase as data storage. Requests originate from a SQL tool or from a third party API services. Kylin’s RESTful service intercepts the requests and accesses the query engine. If the target data which the query processes can be met with the pre-built cubes then the results Query rewrite for Materialized Views are returned back quickly, else it is routed to execute on the Hadoop. Kylin does not support relational and NoSQL source DATABASE TECHNOLOGIES. Spark SQL is a module in data stores to build the cubes nor the creation and maintenance Apache Spark that integrates with relational processing of materialized views [9]. databases and lets users run complex analytical queries. It uses DataFrames which performs relational operations on external Apache Hive is used for data querying, analysis and data sources. DataFrames are collections of structured records summarization in a data warehouse. It converts SQL-like which can be manipulated and materialized in memory, but queries to map-reduce jobs for the processing of large volume lacks mechanism to persist them on disk and keep them up to of data. Hive is best suited for batch jobs instead of real time date [5]. data processing. HiveQL is query language used in Hive. HiveQL supports map-reduce scripts which can be plugged in Apache calcite is a unifying framework for parsing and to the queries. HiveQL does not support Online Transaction planning queries on different datasets. It allows for querying Processing and materialization views [10]. data which is resident in non-traditional databases through a SQL interface. It includes many features similar to typical Apache Lens integrates Apache Hadoop with traditional data databases but lacks some functionalities like storage of data, a warehouse to appear as one layer. It provides a unified layer for analytics.

Load more