Comparing Dbmss and Alternative Big Data Systems: Scale up Versus Scale Out

Noname manuscript No. (will be inserted by the editor) Comparing DBMSs and Alternative Big Data Systems: Scale Up versus Scale Out Carlos Ordonez, David Sergio Matusevich, Sastry Isukapalli the date of receipt and acceptance should be inserted later Abstract CARLOS: Relational DBMSs remain the main data management technology, despite the big data analytics and no-SQL waves. On the other hand, for data analytics in a broad sense, there are plenty of non-DBMS tools including statistical languages, matrix packages, generic data mining programs and large-scale parallel systems, being the main technology for big data analytics. Such large-scale systems are mostly based on the Hadoop distributed file system and MapReduce. Thus it would seem a DBMS is not a good technology to analyze big data, going beyond SQL queries, acting just as a reliable and fast data repository. In this survey, we argue that is not the case, explaining important research that has enabled analytics on large databases inside a DBMS. However, we also argue DBMSs cannot compete with parallel systems like MapReduce to analyze web-scale text data. Therefore, each technology will keep influencing each other. We conclude with a proposal of long-term research issues, considering the \big data analytics" trend. 1 DAVID: Introduction DBMS are capable of efficiently updating, storing and querying large tables. Relational DBMSs have been the dominating technology for managing and querying structured data. In the recent past, there has been an explosion of data on the Internet, commonly referred to as \big data", from collections of web pages, documents, transaction logs and records, etc., which are largely unstructured. In the DBMS and big data worlds the main challenge continues to be to efficiently manage, process, and analyze large data sets. Research on this area is extensive, and has resulted in many efficient algorithms, data structures, and optimizations for analyzing large data sets. This analysis process spans a range of activities, from querying and reporting, exploratory data University of Houston, ATT 2 C. Ordonez, D.S. Matusevich, S. Isukapalli analysis, to predictive modeling via machine learning. The process of knowl- edge discovery, represented by models and patterns is referred to here simply as data mining. There has been substantial progress in DBMS and non-DBMS technologies in the domain of analytics, but there remain many cases where there is a need for a hybrid approach that utilizes these tools in a complementary manner, that also utilizes the advances in integration of statistical methods and matrix computations with database systems. We present a survey of system infrastructure, programming mechanisms, algorithms and optimizations that enable analytics on large data sets. The focus of this survey is primarily on data ingestion, processing, querying, analytics, and data mining, mainly in relation to non-textual, array-like data. Aspects of complex data elements such as graphs, time-series, etc., transac- tional data, data security, and data privacy are not surveyed here. In the big data world, there have been major advances on efficient data mining algorithms [?,?,?], where the data are managed outside of a dedicated DMBS (e.g. sets of flat files on a distributed file system). However, when there are large data sets within a The cost of moving data from one system Previous research has shown it is slow to move large volumes of data due to double I/O, change of file format and network speed [45]. Exporting data sets is, therefore, a major bottleneck to analyze large data sets outside a DBMS [45]. Moreover, the data set storage layout has a close relationship to algorithm performance. Reducing the number of passes over a large data set resident on secondary storage, developing incremental algorithms with faster convergence, sampling for approximation and creating efficient data structures for indexing or summarization remain fundamental algorithmic optimizations. DBMSs [32] and Hadoop-type systems (based on MapReduce or variants such as Spark; and distributed streaming engines) [62]1 are currently the two main competing technologies for data mining on large data sets, both based on automatic data parallelism on a shared-nothing architecture. DBMSs can efficiently process transactions and SQL queries on relational tables. There exist parallel DBMSs for data warehousing and large-scale analytic applications, but they are difficult to expand and they are mostly proprietary and expensive. On the other hand, distributed Hadoop-type systems have been primarily open source and are designed from ground-up to scale to large, het- erogeneous clusters. These systems provide means to automatically distribute analytic operations across a cluster of computers and often work with on top of a distributed file system (on disk or in memory across the cluster), providing great flexibility, and expandability at low cost.2 Despite the advantages of the Hadoop-type systems, there are still many where there are benefits in using a centralized DMBS, especially when the end-user can perform data mining inside a DBMS and reduce the risk of data 1 SASTRY: Need to remove or update citation 2 efficiency may not be true { for simple data intensive tasks, it is reasonable, but for CPU intensive tasks, scaling up via GPUs or more cores/more memory may be a reasonable alternative. DBMSs and Alternative Systems: Scale up vs scale out 3 : :#%:C1 7 : :#%:C1 7 VIVJ V]:1` : : VJQÌ:C1<: 1QJ 1JJ1J$5:I]C1J$5QR1J$5VH:C1J$5 `:J`QÌ: 1QJ $$`V$: 1QJ %CC:C%VI]% : 1QJ5 6]CQ`: Q`7 1%:C1<: 1QJ5 RR"QH#%V`1V %GV J:C71 ]`V:R.VV : 1 1H:C V$`V1QJ5C% V`1J$5'5C:1`1V`5 %QRVC 1IVV`1V %QRVC QI]% : 1QJ QH1: 1QJ ': V`JV:`H. %CV5 )`:]. V_%VJHV %QRVC HQ`1J$ %QRVC6H.:J$V V`1QJ1J$ %:J:$VIVJ Fig. 1 Data mining process: phases and tasks. inconsistency, security, etc., and also mostly avoiding the export bottleneck if data originates from an OLTP database or a data warehouse. The problem of integrating statistical and machine learning methods and models with a DBMS has received moderate attention [32,8,46,57,58,45]. In general, such problem is considered difficult due to the relational DBMS architecture, the matrix-oriented nature of statistical techniques, lack of access to the DBMS source code, and the comprehensive set of statistical and numerical libraries already available in mathematical packages such as Matlab, statistical languages such as R, and additional high-performance numerical libraries like LAPACK. As a consequence, in modern database environments, users gener- ally export data sets to a statistical or parallel system, iteratively build several models outside the DBMS, and finally deploy the best model. Even though many DBMS offer access to data mining algorithms via procedural language interfaces, data mining in the DMBS has been mainly been limited to exploratory analysis using OLAP cubes. Most of the computation of statistical models (e.g. regression, PCA, decision trees, Bayesian classifiers, neural nets) and pattern search (e.g. association rules) continues to be performed outside the DBMS. The importance of pushing statistical and data mining computations into a DBMS is recognized in [14]; this work emphasizes exporting large tables outside the DBMS is a bottleneck and it identifies SQL queries and MapReduce [53] as two complementary mechanisms to analyze large data sets. 4 C. Ordonez, D.S. Matusevich, S. Isukapalli : V`J %GV J`QÌ: 1QJ J:C7 1H `:]. V `1V0:C `QGCVI :H.1JV : 1 1H V:`J1J$ QRVC Fig. 2 Big data analytics. 1.1 Big Data Analytics Processing We begin by giving an overview of the entire analytic process going from data preparation to actually deploying and managing models. Figure 1 shows main phases and tasks in any data mining project.3 Given the integration of many diverse data sources, the first major phase is data quality assessment and repair, which involves itself exploratory data analysis, explained below. Data quality involves detecting and solving issues with incomplete and inconsistent data. In general, the most time consuming phase is to process, transform and clean data in order to build a data set. Basic database data transformations are either denormalization or aggregation. Denormalization includes joins, pivot- ing [16], horizontal aggregations [48,42]. Common data transformation include coding variables (e.g. into binary variables), binning (discretizing numeric variables to compute histograms), sampling [12,11] (to tackle data set size or class imbalance), rescaling and null value imputation, among other transformations. Figure 2 illustrates the most common4 techniques applied for data analysis. In this survey we study techniques and methods that can be used with DBMS technology, in which data sets are structured or semi-structured (e.g. tables [homogeneous or heterogenous] and matrices [sparse and dense]), putting less emphasis on data with weaker, dynamic and more variable structure (e.g.., web pages, documents with varying degrees of completeness). 3 SASTRY: We may need to change this figure a little bit because the arrows towards the right indicate a sequence, but they are independent steps (e.g. coding, imputation, etc., do not necessarily follow aggregation/denormalization). Change the arrows to dotted lines without arrowheads? 4 SASTRY: These are general data analytic techniques and not necessarily big data; suggest removing the figure completely DBMSs and Alternative Systems: Scale up vs scale out 5 After a data set has been created and refined, the next task is to explore its statistical properties, in the form of counts, sums, averages and variances. Exploratory analysis includes ad-hoc queries 5, data quality assessment [18, 47] (i.e. to detect and solve data inconsistency and missing information), and OLAP cubes. 6. In general, at this stage no models are computed.

Comparing Dbmss and Alternative Big Data Systems: Scale up Versus Scale Out

CS145 Lecture Notes #15 Introduction to OQL

EYEDB Object Query Language

SOQL and SOSL Reference

Query Service Specification

Sciql, a Query Language for Science Applications

Object Persistence

Object Database Management Systems: Concepts and Features

1 Query Languages This Section Lists All the Query Languages That the Query Language Investigation Team Determined Were Relevant Possibilities

Apache Calcite for Enabling SQL Access to Nosql Data Systems Such As Apache Geode Christian Tzolov Whoami Christian Tzolov

ODL, OQL and SQL3 Object Definition Language

A Query Language for Analyzing Information Networks

002.02 Comparing the Object and the Relational Data Model