Noname manuscript No. (will be inserted by the editor)
Comparing DBMSs and Alternative Big Data Systems: Scale Up versus Scale Out
Carlos Ordonez, David Sergio Matusevich, Sastry Isukapalli
the date of receipt and acceptance should be inserted later
Abstract CARLOS: Relational DBMSs remain the main data management technology, despite the big data analytics and no-SQL waves. On the other hand, for data analytics in a broad sense, there are plenty of non-DBMS tools including statistical lan- guages, matrix packages, generic data mining programs and large-scale parallel systems, being the main technology for big data analytics. Such large-scale sys- tems are mostly based on the Hadoop distributed file system and MapReduce. Thus it would seem a DBMS is not a good technology to analyze big data, going beyond SQL queries, acting just as a reliable and fast data repository. In this survey, we argue that is not the case, explaining important research that has enabled analytics on large databases inside a DBMS. However, we also argue DBMSs cannot compete with parallel systems like MapReduce to ana- lyze web-scale text data. Therefore, each technology will keep influencing each other. We conclude with a proposal of long-term research issues, considering the “big data analytics” trend.
1 DAVID: Introduction
DBMS are capable of efficiently updating, storing and querying large tables. Relational DBMSs have been the dominating technology for managing and querying structured data. In the recent past, there has been an explosion of data on the Internet, commonly referred to as “big data”, from collections of web pages, documents, transaction logs and records, etc., which are largely unstructured. In the DBMS and big data worlds the main challenge contin- ues to be to efficiently manage, process, and analyze large data sets. Research on this area is extensive, and has resulted in many efficient algorithms, data structures, and optimizations for analyzing large data sets. This analysis pro- cess spans a range of activities, from querying and reporting, exploratory data
University of Houston, ATT 2 C. Ordonez, D.S. Matusevich, S. Isukapalli
analysis, to predictive modeling via machine learning. The process of knowl- edge discovery, represented by models and patterns is referred to here simply as data mining. There has been substantial progress in DBMS and non-DBMS technologies in the domain of analytics, but there remain many cases where there is a need for a hybrid approach that utilizes these tools in a comple- mentary manner, that also utilizes the advances in integration of statistical methods and matrix computations with database systems. We present a survey of system infrastructure, programming mechanisms, algorithms and optimizations that enable analytics on large data sets. The focus of this survey is primarily on data ingestion, processing, querying, an- alytics, and data mining, mainly in relation to non-textual, array-like data. Aspects of complex data elements such as graphs, time-series, etc., transac- tional data, data security, and data privacy are not surveyed here. In the big data world, there have been major advances on efficient data mining algorithms [?,?,?], where the data are managed outside of a dedicated DMBS (e.g. sets of flat files on a distributed file system). However, when there are large data sets within a The cost of moving data from one system Previous research has shown it is slow to move large volumes of data due to double I/O, change of file format and network speed [45]. Exporting data sets is, therefore, a major bottleneck to analyze large data sets outside a DBMS [45]. Moreover, the data set storage layout has a close relationship to algorithm performance. Reducing the number of passes over a large data set resident on secondary storage, developing incremental algorithms with faster convergence, sampling for approximation and creating efficient data structures for indexing or summarization remain fundamental algorithmic optimizations. DBMSs [32] and Hadoop-type systems (based on MapReduce or variants such as Spark; and distributed streaming engines) [62]1 are currently the two main competing technologies for data mining on large data sets, both based on automatic data parallelism on a shared-nothing architecture. DBMSs can efficiently process transactions and SQL queries on relational tables. There exist parallel DBMSs for data warehousing and large-scale analytic applica- tions, but they are difficult to expand and they are mostly proprietary and expensive. On the other hand, distributed Hadoop-type systems have been primarily open source and are designed from ground-up to scale to large, het- erogeneous clusters. These systems provide means to automatically distribute analytic operations across a cluster of computers and often work with on top of a distributed file system (on disk or in memory across the cluster), providing great flexibility, and expandability at low cost.2 Despite the advantages of the Hadoop-type systems, there are still many where there are benefits in using a centralized DMBS, especially when the end-user can perform data mining inside a DBMS and reduce the risk of data
1 SASTRY: Need to remove or update citation 2 efficiency may not be true – for simple data intensive tasks, it is reasonable, but for CPU intensive tasks, scaling up via GPUs or more cores/more memory may be a reasonable alternative. DBMSs and Alternative Systems: Scale up vs scale out 3
: :#%:C1 7 : :#%:C1 7 VIVJ V]:1`
: : VJQ`I:C1<: 1QJ 1JJ1J$5:I]C1J$5QR1J$5VH:C1J$5 `:J`Q`I: 1QJ $$`V$: 1QJ %CC:C%VI]% : 1QJ5
6]CQ`: Q`7 1%:C1<: 1QJ5 RR"QH#%V`1V %GV J:C71 ]`V:R.VV
: 1 1H:C V$`V1QJ5C% V`1J$5'5C:1`1V`5 %QRVC 1IVV`1V %QRVC QI]% : 1QJ QH1: 1QJ ': V`JV:`H. %CV5 )`:]. V_%VJHV
%QRVC HQ`1J$ %QRVC6H.:J$V V`1QJ1J$ %:J:$VIVJ
Fig. 1 Data mining process: phases and tasks.
inconsistency, security, etc., and also mostly avoiding the export bottleneck if data originates from an OLTP database or a data warehouse.
The problem of integrating statistical and machine learning methods and models with a DBMS has received moderate attention [32,8,46,57,58,45]. In general, such problem is considered difficult due to the relational DBMS archi- tecture, the matrix-oriented nature of statistical techniques, lack of access to the DBMS source code, and the comprehensive set of statistical and numerical libraries already available in mathematical packages such as Matlab, statistical languages such as R, and additional high-performance numerical libraries like LAPACK. As a consequence, in modern database environments, users gener- ally export data sets to a statistical or parallel system, iteratively build several models outside the DBMS, and finally deploy the best model. Even though many DBMS offer access to data mining algorithms via procedural language interfaces, data mining in the DMBS has been mainly been limited to ex- ploratory analysis using OLAP cubes. Most of the computation of statistical models (e.g. regression, PCA, decision trees, Bayesian classifiers, neural nets) and pattern search (e.g. association rules) continues to be performed outside the DBMS.
The importance of pushing statistical and data mining computations into a DBMS is recognized in [14]; this work emphasizes exporting large tables outside the DBMS is a bottleneck and it identifies SQL queries and MapReduce [53] as two complementary mechanisms to analyze large data sets. 4 C. Ordonez, D.S. Matusevich, S. Isukapalli
: V`J %GV
J`Q`I: 1QJ J:C7 1H `:]. V `1V0:C `QGCVI
:H.1JV : 1 1H V:`J1J$ QRVC
Fig. 2 Big data analytics.
1.1 Big Data Analytics Processing
We begin by giving an overview of the entire analytic process going from data preparation to actually deploying and managing models. Figure 1 shows main phases and tasks in any data mining project.3 Given the integration of many diverse data sources, the first major phase is data quality assessment and repair, which involves itself exploratory data analysis, explained below. Data quality involves detecting and solving issues with incomplete and inconsistent data. In general, the most time consuming phase is to process, transform and clean data in order to build a data set. Basic database data transformations are either denormalization or aggregation. Denormalization includes joins, pivot- ing [16], horizontal aggregations [48,42]. Common data transformation include coding variables (e.g. into binary variables), binning (discretizing numeric vari- ables to compute histograms), sampling [12,11] (to tackle data set size or class imbalance), rescaling and null value imputation, among other transformations. Figure 2 illustrates the most common4 techniques applied for data analysis. In this survey we study techniques and methods that can be used with DBMS technology, in which data sets are structured or semi-structured (e.g. tables [homogeneous or heterogenous] and matrices [sparse and dense]), putting less emphasis on data with weaker, dynamic and more variable structure (e.g.., web pages, documents with varying degrees of completeness).
3 SASTRY: We may need to change this figure a little bit because the arrows towards the right indicate a sequence, but they are independent steps (e.g. coding, imputation, etc., do not necessarily follow aggregation/denormalization). Change the arrows to dotted lines without arrowheads? 4 SASTRY: These are general data analytic techniques and not necessarily big data; suggest removing the figure completely DBMSs and Alternative Systems: Scale up vs scale out 5
After a data set has been created and refined, the next task is to explore its statistical properties, in the form of counts, sums, averages and variances. Exploratory analysis includes ad-hoc queries 5, data quality assessment [18, 47] (i.e. to detect and solve data inconsistency and missing information), and OLAP cubes. 6. In general, at this stage no models are computed. Thus ex- ploratory analysis reveals the “what”, “where” and “when”, and some times hints towards possible correlations, but not necessarily“how” or “why”. The next phase is computing statistical models or discovering patterns (i.e. data mining) hidden in the data set. Statistical and machine learning mod- els include both descriptive and predictive models such as regression (linear, logistic), clustering (K-means, EM, hierarchical, spatial) classification, dimen- sionality reduction (PCA, factor analysis) statistical tests, time series trend analysis and in some cases, estimation of parameters for a model structure de- fined a priori. Additionally, pattern recognition and search can be performed via cubes [26], association rules [3], and their generalizations, like sequences. 7 Last but not least, we need to consider model management, a subtask which follows model computation, and includes scoring (model deployment, inference [31]), model exchange (with other tools using model description languages like PMML [30]) and versioning (i.e., maintaining a history of data sets and models).
2 DAVID: Classification of Big Data Analytic Systems
We begin our discussion of database systems by attempting to classify the many ways data can be stored and analyzing for data mining purposes. The first main division is related to system functionality. To that effect we divide the universe of data Systems into two categories: Database Management Sys- tems (DBMS) and alternative systems, which we will term non-relational (also known as No-SQL [62]). Figure 3 provides our full classification, explained be- low. We start our discussion with DBMSs, whose main characteristic is the ability to handle large amounts of structured data, seamlessly and efficiently moving data between primary and secondary storage, while dealing with all other issues, such as indexing, concurrent access, data security and recovery. Given the current state of the art, database systems can be subdivided into two main categories: Relational and Non-Relational8. Commonly, the first sub-
5 SASTRY: (i.e., each query leading to a further query) – this is probably inaccurate; interactive exploration of data via different types of queries is better – it is more like a loop (query, visualize, query, loop) 6 SASTRY: and spreadsheets [68] (which provide an interactive environment to define formulas among data elements). – we should stay away from the really small scales such as spreadsheets for now 7 SASTRY: Graph analysis has evolved to be a unique set of algorithms with a foundation on theoretical computer science. – This statement seems out of place. 8 SASTRY: We probably need to mention that for Non-Relational, we are briefly covering Hierarchical, but not Network, or Object Oriented DBs
6 C. Ordonez, D.S. Matusevich, S. Isukapalli