Noname manuscript No. (will be inserted by the editor)

Comparing DBMSs and Alternative Big Data Systems: Scale Up versus Scale Out

Carlos Ordonez, David Sergio Matusevich, Sastry Isukapalli

the date of receipt and acceptance should be inserted later

Abstract CARLOS: Relational DBMSs remain the main data management technology, despite the big data analytics and no-SQL waves. On the other hand, for data analytics in a broad sense, there are plenty of non-DBMS tools including statistical lan- guages, matrix packages, generic data mining programs and large-scale parallel systems, being the main technology for big data analytics. Such large-scale sys- tems are mostly based on the Hadoop distributed file system and MapReduce. Thus it would seem a DBMS is not a good technology to analyze big data, going beyond SQL queries, acting just as a reliable and fast data repository. In this survey, we argue that is not the case, explaining important research that has enabled analytics on large inside a DBMS. However, we also argue DBMSs cannot compete with parallel systems like MapReduce to ana- lyze web-scale text data. Therefore, each technology will keep influencing each other. We conclude with a proposal of long-term research issues, considering the “big data analytics” trend.

1 DAVID: Introduction

DBMS are capable of efficiently updating, storing and querying large tables. Relational DBMSs have been the dominating technology for managing and querying structured data. In the recent past, there has been an explosion of data on the Internet, commonly referred to as “big data”, from collections of web pages, documents, transaction logs and records, etc., which are largely unstructured. In the DBMS and big data worlds the main challenge contin- ues to be to efficiently manage, process, and analyze large data sets. Research on this area is extensive, and has resulted in many efficient algorithms, data structures, and optimizations for analyzing large data sets. This analysis pro- cess spans a range of activities, from querying and reporting, exploratory data

University of Houston, ATT 2 C. Ordonez, D.S. Matusevich, S. Isukapalli

analysis, to predictive modeling via machine learning. The process of knowl- edge discovery, represented by models and patterns is referred to here simply as data mining. There has been substantial progress in DBMS and non-DBMS technologies in the domain of analytics, but there remain many cases where there is a need for a hybrid approach that utilizes these tools in a comple- mentary manner, that also utilizes the advances in integration of statistical methods and matrix computations with systems. We present a survey of system infrastructure, programming mechanisms, algorithms and optimizations that enable analytics on large data sets. The focus of this survey is primarily on data ingestion, processing, querying, an- alytics, and data mining, mainly in relation to non-textual, array-like data. Aspects of complex data elements such as graphs, time-series, etc., transac- tional data, data security, and data privacy are not surveyed here. In the big data world, there have been major advances on efficient data mining algorithms [?,?,?], where the data are managed outside of a dedicated DMBS (e.g. sets of flat files on a distributed file system). However, when there are large data sets within a The cost of moving data from one system Previous research has shown it is slow to move large volumes of data due to double I/O, change of file format and network speed [45]. Exporting data sets is, therefore, a major bottleneck to analyze large data sets outside a DBMS [45]. Moreover, the data set storage layout has a close relationship to algorithm performance. Reducing the number of passes over a large data set resident on secondary storage, developing incremental algorithms with faster convergence, sampling for approximation and creating efficient data structures for indexing or summarization remain fundamental algorithmic optimizations. DBMSs [32] and Hadoop-type systems (based on MapReduce or variants such as Spark; and distributed streaming engines) [62]1 are currently the two main competing technologies for data mining on large data sets, both based on automatic data parallelism on a shared-nothing architecture. DBMSs can efficiently process transactions and SQL queries on relational tables. There exist parallel DBMSs for data warehousing and large-scale analytic applica- tions, but they are difficult to expand and they are mostly proprietary and expensive. On the other hand, distributed Hadoop-type systems have been primarily open source and are designed from ground-up to scale to large, het- erogeneous clusters. These systems provide means to automatically distribute analytic operations across a cluster of computers and often work with on top of a distributed file system (on disk or in memory across the cluster), providing great flexibility, and expandability at low cost.2 Despite the advantages of the Hadoop-type systems, there are still many where there are benefits in using a centralized DMBS, especially when the end-user can perform data mining inside a DBMS and reduce the risk of data

1 SASTRY: Need to remove or update citation 2 efficiency may not be true – for simple data intensive tasks, it is reasonable, but for CPU intensive tasks, scaling up via GPUs or more cores/more memory may be a reasonable alternative. DBMSs and Alternative Systems: Scale up vs scale out 3

: :#%:C1 7 : :#%:C1 7 VIVJ V]:1`

: : VJQ`I:C1<: 1QJ 1JJ1J$5:I]C1J$5QR1J$5VH:C1J$5 `:J`Q`I: 1QJ $$`V$: 1QJ %CC:C%VI]% : 1QJ5

6]CQ`: Q`7 1%:C1<: 1QJ5 RR"QH#%V`1V %GV J:C71 ]`V:R.VV 

 : 1 1H:C V$`V1QJ5C% V`1J$5'5C:1`1V`5 %QRVC 1IVV`1V %QRVC QI]% : 1QJ QH1: 1QJ ': V`JV:`H. %CV5 )`:]. V_%VJHV

%QRVC HQ`1J$ %QRVC6H.:J$V V`1QJ1J$ %:J:$VIVJ

Fig. 1 Data mining process: phases and tasks.

inconsistency, security, etc., and also mostly avoiding the export bottleneck if data originates from an OLTP database or a data warehouse.

The problem of integrating statistical and machine learning methods and models with a DBMS has received moderate attention [32,8,46,57,58,45]. In general, such problem is considered difficult due to the relational DBMS archi- tecture, the matrix-oriented nature of statistical techniques, lack of access to the DBMS source code, and the comprehensive set of statistical and numerical libraries already available in mathematical packages such as Matlab, statistical languages such as R, and additional high-performance numerical libraries like LAPACK. As a consequence, in modern database environments, users gener- ally export data sets to a statistical or parallel system, iteratively build several models outside the DBMS, and finally deploy the best model. Even though many DBMS offer access to data mining algorithms via procedural language interfaces, data mining in the DMBS has been mainly been limited to ex- ploratory analysis using OLAP cubes. Most of the computation of statistical models (e.g. regression, PCA, decision trees, Bayesian classifiers, neural nets) and pattern search (e.g. association rules) continues to be performed outside the DBMS.

The importance of pushing statistical and data mining computations into a DBMS is recognized in [14]; this work emphasizes exporting large tables outside the DBMS is a bottleneck and it identifies SQL queries and MapReduce [53] as two complementary mechanisms to analyze large data sets. 4 C. Ordonez, D.S. Matusevich, S. Isukapalli

: V`J %GV

J`Q`I: 1QJ J:C7 1H `:]. V `1V0:C `QGCVI

:H.1JV  : 1 1H V:`J1J$ QRVC

Fig. 2 Big data analytics.

1.1 Big Data Analytics Processing

We begin by giving an overview of the entire analytic process going from data preparation to actually deploying and managing models. Figure 1 shows main phases and tasks in any data mining project.3 Given the integration of many diverse data sources, the first major phase is data quality assessment and repair, which involves itself exploratory data analysis, explained below. Data quality involves detecting and solving issues with incomplete and inconsistent data. In general, the most time consuming phase is to process, transform and clean data in order to build a data set. Basic database data transformations are either denormalization or aggregation. Denormalization includes joins, pivot- ing [16], horizontal aggregations [48,42]. Common data transformation include coding variables (e.g. into binary variables), binning (discretizing numeric vari- ables to compute histograms), sampling [12,11] (to tackle data set size or class imbalance), rescaling and null value imputation, among other transformations. Figure 2 illustrates the most common4 techniques applied for data analysis. In this survey we study techniques and methods that can be used with DBMS technology, in which data sets are structured or semi-structured (e.g. tables [homogeneous or heterogenous] and matrices [sparse and dense]), putting less emphasis on data with weaker, dynamic and more variable structure (e.g.., web pages, documents with varying degrees of completeness).

3 SASTRY: We may need to change this figure a little bit because the arrows towards the right indicate a sequence, but they are independent steps (e.g. coding, imputation, etc., do not necessarily follow aggregation/denormalization). Change the arrows to dotted lines without arrowheads? 4 SASTRY: These are general data analytic techniques and not necessarily big data; suggest removing the figure completely DBMSs and Alternative Systems: Scale up vs scale out 5

After a data set has been created and refined, the next task is to explore its statistical properties, in the form of counts, sums, averages and variances. Exploratory analysis includes ad-hoc queries 5, data quality assessment [18, 47] (i.e. to detect and solve data inconsistency and missing information), and OLAP cubes. 6. In general, at this stage no models are computed. Thus ex- ploratory analysis reveals the “what”, “where” and “when”, and some times hints towards possible correlations, but not necessarily“how” or “why”. The next phase is computing statistical models or discovering patterns (i.e. data mining) hidden in the data set. Statistical and machine learning mod- els include both descriptive and predictive models such as regression (linear, logistic), clustering (K-means, EM, hierarchical, spatial) classification, dimen- sionality reduction (PCA, factor analysis) statistical tests, time series trend analysis and in some cases, estimation of parameters for a model structure de- fined a priori. Additionally, pattern recognition and search can be performed via cubes [26], association rules [3], and their generalizations, like sequences. 7 Last but not least, we need to consider model management, a subtask which follows model computation, and includes scoring (model deployment, inference [31]), model exchange (with other tools using model description languages like PMML [30]) and versioning (i.e., maintaining a history of data sets and models).

2 DAVID: Classification of Big Data Analytic Systems

We begin our discussion of database systems by attempting to classify the many ways data can be stored and analyzing for data mining purposes. The first main division is related to system functionality. To that effect we divide the universe of data Systems into two categories: Database Management Sys- tems (DBMS) and alternative systems, which we will term non-relational (also known as No-SQL [62]). Figure 3 provides our full classification, explained be- low. We start our discussion with DBMSs, whose main characteristic is the ability to handle large amounts of structured data, seamlessly and efficiently moving data between primary and secondary storage, while dealing with all other issues, such as indexing, concurrent access, data security and recovery. Given the current state of the art, database systems can be subdivided into two main categories: Relational and Non-Relational8. Commonly, the first sub-

5 SASTRY: (i.e., each query leading to a further query) – this is probably inaccurate; interactive exploration of data via different types of queries is better – it is more like a loop (query, visualize, query, loop) 6 SASTRY: and spreadsheets [68] (which provide an interactive environment to define formulas among data elements). – we should stay away from the really small scales such as spreadsheets for now 7 SASTRY: Graph analysis has evolved to be a unique set of algorithms with a foundation on theoretical computer science. – This statement seems out of place. 8 SASTRY: We probably need to mention that for Non-Relational, we are briefly covering Hierarchical, but not Network, or Object Oriented DBs

6 C. Ordonez, D.S. Matusevich, S. Isukapalli

Q1 :VR

VC: 1QJ:C VC: 1QJ:C

QC%IJ :VR

$ 

$ 

 `V:I :VR

QJ VC: 1QJ:C ``:7 :VR

QJ VC: 1QJ:C

`:]. :VR

7 VI

UUL:0: QRV

%C: %1CV  :H@:$V

: C:G

QJ $ 

QJ $ 

:]VR%HV

#$%

#$%

Q 

Fig. 3 Classification of Big Data Analytics Systems group comprises those systems that have a database model and data strictly or- ganized as relations (tables). According to physical storage, relational DBMSs can be further subdivided into row-based (e.g. Postgres, IBM DB2, Oracle, Ter- adata, Greenplum) and column-based (e.g. C-Store, MonetDB, Hana). Older DBMS technology relied on row storage to accelerate transaction processing [65] and optimize simple SPJ queries [22]. Column stores [63] are more ef- ficient when the SPJA queries involve mostly small subsets of columns (i.e. projections). From a practical standpoint we could also divide them into two categories: those that were designed from the ground up to work as part of a parallel system with multiple storage and computing nodes; and those that use a single processing unit (usually older systems). Given the wide diversity of file systems used by DBMSs, we do not elaborate on them, but the common characteristic is the ability to perform random read and random writes on logical blocks [22]. In almost all RDBMS systems, the database engine runs as a server and parts of the data are cached in memory. Exceptions include engines such as SQLite, that can be created on-demand, but such engines handle relatively smaller amount of data. Non-relational systems can be divided into those that work with file-based structure, such as packages in R[50], MatLab, SAS, or custom-developed code DBMSs and Alternative Systems: Scale up vs scale out 7

(in C++, Java or equivalent high level language). Some of these utilize local file systems (local disk or network mounted), and some use distributed file systems such as the open-source Hadoop Distributed File System (Apache HDFS), in- spired by the Google file system [34]. A significant fraction of Hadoop analytic implementations also use the MapReduce programming model, originally in- troduced by Google [19] as well. In the MapReduce type model, there is no underlying database process that manages the data; instead, all the relevant processes are created at run-time. Data present in the flat files can be simple text data or sometimes can be complex, disk-based data structures such as B-Trees, arrays, etc., and can be complex, hierarchical data sets based on HDF (Hierarchical data formats). Hadoop/HDFS has recently become a system of choice for the distributed stor- age and analysis of semi-structured information, initially for documents and web pages, but gradually towards various types of data, sometimes even struc- tured data. The primary advantage of Hadoop and HDFS is that for problems where operations on a large data set can be split into parallel, independent operations on small chunks of the data, one can utilize a distributed clus- ter of machines to process the data. Additionally, the data-local computation property of MapReduce (where the execution code is deployed to the nodes where the data reside; a data-local approach), allowed for aggregation of het- erogeneous computers on a wide area network with a mix of different network speeds and processing power of the machines. Although originally supporting only the MapReduce processing paradigm, Hadoop ecosystem, Hadoop gained traction as the main platform in most large-scale data analytics systems out- side the DBMS realm. Cheaper and faster hardware have stimulated HDFS adoption. More recently it has gained traction in processing structured data, competing with RDBMS [17,54,62], initially by providing SQL-like interfaces for underlying MapReduce processing, and more recently via modules for data formats (columnar formats with indexes, custom data readers, etc.) Recently, different columnar storage systems have been developed including HBase, Ac- cumulo, etc. (both of which have been modeled after Google’s BigTable), which take advantage of the underlying HDFS file system and a distributed DMBS like interface. Furthermore, Hadoop currently supports multiple programming paradigms via the YARN scheduler, thus providing the ability to run iterative programs, long-running containers, etc. Non-relational databases arose from the need to analyze data that is very difficult (or even impractical) to fit in the relational model. Non-relational systems, while less refined and with less functionality than their relational counterparts have the advantage that they require less preparation of data be- fore loading, and also support dynamic reconfiguration of the database. This has caused a major shift in the database usage paradigm, where users prefer to trade the computational power of a DBMS for a simplified and faster setup, especially because of the scale-out support on commodity hardware for these systems. According to their “storage model”, we propose to classify them into: Array, Stream and Graph based. Array databases are primarily oriented to- wards the analysis of multidimensional data, such as scientific, financial and 8 C. Ordonez, D.S. Matusevich, S. Isukapalli bioinformatics data. Well known systems include SciDB [64], Rasdaman [5] and ArrayStore [61]. Stream database systems such Gigascope [15] are used to analyze data streams (non-static data) such as network data or video, with- out even storing the particular data records on secondary storage. Stream Database systems support “continuous” queries such as finding the number of users of a certain service on a sliding time window, returning results in near real-time response from data ingestion to reporting stage. Graph database sys- tems are characterized as systems that provide index-free adjacency list access. That is, each element of the database contains pointers to its neighbors with- out any index lookup. Graph databases make relationships between elements explicit and intuitive and are crucial when analyzing social networks and the Internet (as a collection of interconnected nodes). Prominent examples include Pregel [38], Giraph [4], Neo4J [37], Graphx [?], Titan [?] and Urika[33]. It is worth mentioning here that these divisions are not absolute. There has been in recent years, a push by developers to integrate the best qualities of one system into another, such as providing native support R and Python packages on DBMS (parallel R in the Vertica DBMS, and Python/PL in Postgres for example)

3 System Infrastructure

While the differences between DMBS and non-DBMS tools and approaches are not absolute, they can be compared with respect to how the underlying re- source demands, system configuration and infrastructure. These are affected by the approaches used for storage, indexing, parallel processing, and distributed processing.

3.1 SASTRY: Storage and Indexing

Storage in traditional DBMSs is built around blocks of rows or columns of mainly simple data types (e.g. numbers and strings) and with similar size for elements across columns. All cells within a column are of the same type and usually are of similar type. Additionally, they provide mechanisms to store large objects in a character or binary format. Irrespective of the underlying data types, these columns need to be fully specified either at the time of creating the table, or via alterations to the table prior to insertion of data. On the other hand, non-DBMS solutions range widely in terms of how they organize underlying data in rows and columns. Major types of storage layouts used by NoSQL databases are (a) are key-value stores that are in essence distributed dictionaries, (b) document oriented databases where the number of columns corresponding to different attributes of the document, (c) columnar stores, where each column is separately stored either completely or at the level of a chunk, (d) hybrid columnar stores, which are row-oriented but support a large number of dynamic columns, and (e) chunked storage, which is DBMSs and Alternative Systems: Scale up vs scale out 9

appropriate for structured data with a regular structure. Different analytical tasks require different storage layouts for I/O efficiency, as each approach has different strengths.

– Row storage: This is the most common storage layout, in which X is stored on a tabular format which has n rows, each having d columns. Several rows are generally stored together in a single block. When d is relatively small, the structure is close to a traditional DMBS system. – Document storage: This is a novel layout tailored to analytical query pro- cessing [65]. In this case the data set is stored by column across several blocks, which enables efficient projection of subsets of columns, where each column is stored separately. When there is separate storage for each col- umn, each having the primary key (or subscript) of each record and the corresponding column value, this type of storage is close to a document- oriented store. In document oriented databases, the number of columns corresponding to different attributes of the document may sometimes be higher than the number of total documents in the database. This may man- ifest as tables with millions of rows and millions of columns, and medium dense matrix. In some cases, these columns are stored separately at the level of individual chunks of a table (e.g. for every 10,000 rows); this ap- proach is employed in the ORC format used by Hive. – Hybrid columnar storage: In this case the data set is stored by rows, but the columns can be arbitrarily large (typically millions of columns) with each row containing a small subset of the columns. In this case, contents of a single row are stored together (either for the entire row, or at a minimum for a “column-family”). This storage can support queries across row keys, and subsequent fetching of cells based on the column names. – Key-value storage: (NOTE: this is NOT the default storage in HDFS) This is in essence a distributed dictionary, where records are accessed through via a key-value lookup and subsequent processing of the values. The sizes of these databases are usually billions of rows and two columns. This structure supports queries that involve a small set of “get” and “put” semantics for reading and writing values corresponding to a key. In the relational world this layout is the so-called transaction data set, stored on a normalized table with one attribute value per row. In the transaction layout, for efficiency purposes, only a subset of dimensions for each record is actually stored (e.g. excluding binary dimensions equal to zero). – Chunk storage: This is a novel mechanism to store multidimensional arrays on a large block, pioneered by SciDB [64]. Chunk storage is ideal for dense and sparse matrices, as well as spatial data.

We must note two other major approaches of large scale data storage. First approach relates to the use of formatted files on a distributed file system (e.g. Hive ORC or Parquet formats) that provide block or chunk level indexing without any data management service required. Second approach relates to the use of hierarchical formats (e.g. NetCDF and HDF) that provide random access 10 C. Ordonez, D.S. Matusevich, S. Isukapalli to distributed data through the file system interface. The second approach for storage is primarily used for scientific data sets that are typically dense. The choice of a specific layout mechanism impacts the indexing scheme that can be used in querying the data, and also impacts the storage and processing costs associated with maintaining the index when the database is updated with heavy writes. In traditional DMBSs, indexes are maintained for important simple columns, while indexes over individual elements of a complex binary structure field are impractical to maintain. In case of NoSQL databases, the underlying row keys or document IDs are sorted and/or indexed (often at the granularity of a chunk). In the hybrid columnar databases, the columns themselves are sorted based on the column name, which allows for quickly searching for a cell based on a row key and a column name. In contrast, the RDMSs typically maintain the order of the columns in a table. Recent work in this area suggests dynamic hybrid type solutions for effective use of data schema and analysis needs. For example, the Dual Table approach recommends using HBase for current data (which can change substantially at a high speed; heavy writes), and use Hive ORC tables for slightly aged data (e.g. data older than one week, which may have infrequent changes).

3.2 Data Retrieval and Mathematical Processing

In the context of large scale data retrieval and data analysis, matrix manip- ulation can be considered as a reference problem because a large number of mathematical modeling problems and machine learning models require some form of matrix manipulation. Given a matrix, three row storage layouts stand out: a standard horizontal layout having one row per input record, a long pivoted table with one attribute value per row or a transposed table having records as columns. Each layout provides tradeoffs among I/O speed, space and programming flexibility, depending on the required mathematical compu- tation, data set dimensionality and the sparsity of the matrices. Column stores provide limited flexibility in layouts to store matrices. On the other hand, a vertical row layout is the most common format for key-value stores. There has been a surge on array database systems, like SciDB [9] and ArrayStore [61], but with different storage and query evaluation mechanisms from SQL. Such systems are making some impact on the academic world, com- plementing R and Matlab, but their future impact is uncertain for corporate databases. The main differences between various storage and retrieval strategies result in differences in performance under different system configuration scenarios. For instance, a traditional DBMS system may scale well for multiple cores on a single machine or a small set of multicore servers, each with large memory. In this case, the DBMS can mostly utilize the “share all” properties and optimize memory usage for faster query processing. We refer to this scenario as a “scale- up” of processing. On the other hand, when distributed processing is performed DBMSs and Alternative Systems: Scale up vs scale out 11 using a cluster of a large number of relatively lower scale commodity servers, there is limited flexibility in optimizing all available CPU and memory towards one specific query because these clusters usually are tuned to work with a “share-nothing” configuration. We refer to this scenario as “scale-out”, and this is most common in Hadoop MapReduce and variants. In case of a “scale-up” scenario, the underlying system can allocate re- sources towards executing a given query in full and mostly provide tight time ranges for completing a task. However, in the case of “scale-out” scenario and a shared-nothing architecture, each worker processes the data independently, and the underlying system cannot guarantee tight bounds in the completion time of all the workers. In fact, it is not unusual for jobs on Hadoop to run with unusually long tails in the execution time. For example, a small set of mappers/reducers, or in general containers in Yarn may take a long time to complete, and then entire job gets delayed. In a multicore setup (“scale-up”), the underlying system can provide additional resources (CPU, memory, etc.) to expedite the job completion. In the shared-nothing environment, most clusters can at most run multiple copies of the last set of processes, without substan- tial changes to the amount of memory or CPU available, so, there are no guaranteed tight ranges for job completion.

3.3 DAVID: Parallel Processing

Currently, the best parallel architecture for DBMSs and non-relational systems (prominently represented by Hadoop) is shared-nothing, having N processing units. These processing units can be N threads on a single CPU server (in- cluding multicore CPUs sharing RAM, but different partitions on disk) or N physical nodes (each one with its own RAM and secondary storage). A clus- ter of N nodes is strictly shared-nothing. On the other hand, given modern multicore CPU architectures, there is some level of shared memory among threads with L1 or L2 cache memory. But in general, L1 is automatically manipulated by the CPU via hyper-threading [36]. In row stores tables are horizontally partitioned into sets of rows for parallel processing. Similarly, in a column store each projection is horizontally partitioned as well using an implicit row id. Thus there exist sequential and parallel versions of database physical operators: table scan, external merge sort and joins. Similarly, in col- umn stores columns are partitioned into blocks, which are in turn distributed among processing units. MapReduce [19,62] trails behind DBMSs to evalu- ate such parallel physical operators on equivalent hardware, although the gap has been gradually shrinking with RAM-based caching and processing [70]. Table 1 compares the architecture of single node and multiple node systems, including the now obsolete shared-everything architecture.9.

9 SASTRY: Table is a bit too small; suggest folding it into text 12 C. Ordonez, D.S. Matusevich, S. Isukapalli

Table 1 Comparing parallel architectures. Single N-node N-node node shared-nothing shared-everything Cache Shared Distributed local RAM Shared Distributed shared Disk Shared Distributed shared File System Unique Distributed shared Non-local data access NA message pass locking

3.4 SASTRY: Data Manipulation and Query Languages

Query languages are computer languages used to retrieve information from databases and other information systems. SQL (Structured ) is the standard language used in relational databases, since it adheres quite closely to the relational model [13] (although not completely). While SQL has been standardized by the American National Standards Institute in 1986, it is not completely portable from one DBMS to another. This is due to some ambiguities in the standard which leads to a measure of lock-in. While SQL is widely used in the industry, several alternatives have been proposed, such as OQL (Object Query Language), JPQL (Java Persistent Query Language) and HTSQL (Hyper Text SQL). The particular needs of non-relational database systems gave rise to a fam- ily of query languages that are commonly referred to as NoSQL (Not Only SQL). The NoSQL systems sometimes sacrifice consistency in favor of avail- ability, and may also sacrifice ACID properties, though the systems are evolv- ing towards supporting ACID properties. Major NoSQL systems provide an alternative to ACID as BASE (Basically Available, Soft state, and Eventually Consistent – which sacrifices consistency for partition tolerance), where the responsibility for handling stale data and for guaranteeing consistency often lies with the application. The rise of the distributed file systems such as HDFS and node-local com- puting paradigm of MapReduce and variants resulted in the development of SQL-like interfaces/languages on top of Mapreduce and variants. These lan- guages are not SQL compliant, but provide some of the functionality pro- vided by SQL (typically, select, insert/overwrite, join, filter, etc.). Apache Hive, developed originally by Facebook and currently an Apache project pro- vides HiveQL, an SQL-like interface. Pig Latin, developed originally by Yahoo and currently an Apache project, provides some SQL functionality via a flow description language. Both Pig and Hive transform a given query or set of data processing commands into MapReduce taks (and to in-memory distributed sys- tems such as Spark). There are various query languages/interfaces for access- ing different types of structured and unstructured data sets on a distributed cluster. Notable among them are Impala and Presto (Hive-like engines with dedicated processes and services), Phoneix (SQL layer for HBase with substan- tial compatibility with SQL ACID properties), Hive/Stinger (approximating DBMSs and Alternative Systems: Scale up vs scale out 13 in-memory MapReduce via “warm” containers), and SPARK SQL, which runs on Spark. Finally, we must mention Google’s Pregel, a language oriented towards the evaluation of queries on graphs. Pregel is a model for parallel processing of graphs with added fault tolerance. It uses a Master/Slave model where each Slave is assigned a subset of the graph’s vertices. The Master’s job is to coordinate and partition the graph between the Slaves, as well as assigning sections of the input. While Pregel is a proprietary language, there are open source alternatives, such as GPS (Graph Processing System) and Apache’s Giraph, and TitanDB.

3.5 DAVID: Programming Alternatives

We distinguish three major types of approaches for programming data mining algorithms to analyze large data sets: (1) in RDBMS with SQL queries and UDFs without modifying the RDBMS source code; (2) extending the capabil- ities of the RDBMS through systems language like C++, by modifications to RDBMS source code. (3) in an non-RDBMS via with programming languages in external systems, including parallel systems. The first alternative allows processing data ”in-situ”, avoiding exports out- side the DBMS. The second one is feasible for open-source DBMSs and devel- opers of industrial DBMSs. The last alternative is the most flexible, allowing many customized optimizations as well as easily adding ad-hoc code or scripts into the pipeline. It also has shortest development time and therefore has been the most popular in large scale data analysis. Performing data mining with SQL queries and UDFs has received moderate attention. A major SQL limitation is its lack of ability to manipulate non- tabular data structures requiring pointer manipulation. However, for specific domains such as spatial data analyses, stored procedures have been used to provide significant functionality for “in situ” analysis of spatial and temporal trajectories.10 When data mining computations are evaluated with SQL queries a program automatically creates queries combining joins, aggregations and mathemati- cal expressions. The methods to generate SQL code are based on the input data set, the algorithm parameters and tuning query optimizations. SQL code generation exploits relational query optimizations including forcing hash-joins [46], clustered storage for aggregation [46,45], denormalization to avoid joins and reduce I/O cost [43], pushing aggregation before joins to compress tables and balancing row distribution among processing units. This alternative covers the scenario where most processing on large tables is done inside the DBMS, and remaining partial processing is done on relatively smaller tables exported outside the DBMS [57] with an external mathemat- ical library or statistical program. Given DBMS overhead and SQL lack of

10 SASTRY: May be add a bit more on machine learning modules within RDBMS? 14 C. Ordonez, D.S. Matusevich, S. Isukapalli

flexibility to develop algorithmic optimizations SQL is generally slower than a systems language like C++ for straight-forward cases11. 12 Overall, SQL has been enriched with object-relational extensions like User- Defined Functions (UDFs), stored procedures, table-valued functions, and User-Defined Types, with aggregate UDFs being the most powerful for data mining. UDFs represent a compromise between modifying the DBMS source code, extending SQL with new syntax and generating SQL code. UDFs rep- resent an important extensibility feature, now widely available (although not standardized) in most DBMSs. UDFs include scalar, aggregate and table functions with aggregate UDFs. These UDFs provide a simplified parallel programming interface. UDFs can take full advantage of the DBMS processing power when running in the same address space as the query optimizer (i.e. unfenced [57] or unprotected mode). UDFs do not increase the computation power of SQL, but they make pro- gramming easier and many optimizations feasible since they are basically code fragments or modules natively supported on the system. Data structures, incremental computations and matrix operations are cer- tainly difficult to program with SQL and UDFs due to the relational DBMS architecture, limited flexibility of the SQL language and DBMS subsystems overhead, but they can be optimized and they can achieve a tight integra- tion with the DBMS.13 SQL can achieve linear scalability on data set size and dimensionality to compute several models. It has been shown that SQL is somewhat slower (not an order of magnitude slower) than C++ (or sim- ilar language) [43], with relative performance getting better as data set size increases (because overhead decreases). There is research on adapting and op- timizing numerical methods to work with large disk-resident matrices [52,69], but only a few work in the DBMS realm [45,32]. Developing fast algorithms in C++ or Java was the most popular alterna- tive in the past. This was no coincidence because a high-level programming language provides freedom to manipulate RAM and minimize I/O. Modifying and extending the DBMS source code (open source code or proprietary), yields algorithms tightly coupled to the DBMS. This alternative has been the most common for commercial DBMSs and it includes extending SQL with new syntax. This alternative includes low-level APIs to directly access tables through the DBMS file system (cursors [57], blocks of records with OLEDB [41]). This alternative provides fast processing, but also has disadvantages. Given the DBMS complex architecture and lack of access to DBMS source code it is difficult for end users and researchers to incorporate new algorithms or change existing ones. Nowadays, this alternative is not popular for parallel processing.

11 SASTRY: If a regular user tries to write custom code for optimized joins, it would not work well 12 But as hardware accelerates and memory grows such gap in speed becomes less impor- tant. – this is very general statement, and can be removed. 13 SASTRY: Looks like a lot of repetition – we have moved away from DBMS vs non-DBMS to speeds of implementation based on language, which is a big topic in itself DBMSs and Alternative Systems: Scale up vs scale out 15

The Hadoop eco system provides many means to process large data sets in parallel outside a DBMS, most common of which is the is currently MapReduce [62], either direct MapReduce code or via higher level frameworks such as Hive [70] and Pig. For the most part a cloud computing model is the norm, where the cloud offers hardware and software as service.14. Even though some DBMSs have achieved some level of integration between SQL and MapReduce (e.g., Greenplum [32], HadoopDB [2], Teradata AsterData [24]), in most cases data must be moved from the DBMS storage to the MapReduce file system. Some recent systems that have enabled SQL capabilities on top of the distributed Hadoop file system, integrating some information retrieval capabilities into SQL, include ODYS [67] and AsterixDB [6].15 In statistics and numerical analysis16 the norm is to develop program in non-systems languages like R, Matlab, SAS and Fortran, which do not scale to large data sets.17. Thus analysts generally split computations between such packages and MapReduce [17], pushing demanding computations with heavy I/O on large data sets into MapReduce. In general, the direct integration of mathematical non-systems languages with Hadoop remains limited; however, Hadoop MapReduce provides a “streaming” interface, where custom scripts or programs can be called within the MapReduce program. The RIOT [71] system introduces a middleware solution to provide transparent I/O in the R statistical package with large matrices (i.e. the user does not need to modify R programs to analyze large matrices). RIOT does address the scalability prob- lem, but outside the DBMS. Unfortunately, RIOT cannot efficiently analyze data inside a DBMS.

3.6 CARLOS: Algorithm Design and Optimizations

Efficient algorithms are central on database systems. Their efficiency depends on the data set layout, iterative behavior, numerical method convergence and how much primary storage (RAM memory) is available. There are two ma- jor goals: decreasing I/O and accelerating convergence. Most statistical al- gorithms have iterative behavior requiring one or more passes over the data set per iteration [45]. The most common approach to accelerate an algorithm is to reduce the number of passes scanning the large data set, without de- creasing (or even increasing) model quality (e.g. log-likelihood, squared error and node purity. The rationale is to decrease I/O cost, but also to reduce CPU time. Many algorithms attempt to reduce the number of passes to one,

14 SASTRY: Most Hadoop installations are local, in-house 15 SASTRY: Hive does not belong in this place because it is a simple conversion layer that mimics SQL and supports only a small part of SQL 16 SASTRY: Numerical analysis is a big beast; we have MPI-based programming as the main norm (e.g. for computational fluid dynamics), with data stored in HDF, NetCDF, etc., (e.g. data for weather or climate modeling). 17 SASTRY: We should remove Fortran from here – it is mostly on how we write the codes; climate models run for months or years on large clusters 16 C. Ordonez, D.S. Matusevich, S. Isukapalli processing the data set in sufficiently large blocks. Nevertheless, additional it- erations on the entire data set are generally needed to guarantee convergence to a stable solution. Reducing the number of passes is highly related to incre- mental algorithms, which can update the model as more records (points) are processed. Incremental algorithms with matrices on secondary storage are an important solution. Algorithms updating the model online based on gradient- descent methods have shown to benefit several common models [32] and they can be integrated as a UDF into the DBMS architecture. Stream processing [28,25,66] represents an important variation of incremental algorithms to the extreme that each point is analyzed once, or even skipped when the stream arrives at a high rate. Building data set summaries that preserve its statis- tical properties is an essential optimization to accelerate convergence. Most algorithms build summaries based on sufficient statistics, most commonly for the Gaussian density and Bernoulli distribution. Sufficient statistics have been extended and generalized considering independence and multiple subsets from the data set, benefiting a big family of linear models. Data set squashing keeps high order moments (sufficient statistics beyond power 2) on binned attributes to compress a data set in a lossy manner; statistical algorithms on squashed data need to be modified considering weighted points. Sampling (including randomized algorithms and Monte Carlo methods) is another acceleration mechanism that can accelerate processing by comput- ing approximate solutions (only when an efficient sampling mechanism exists, with time complexity linear in sample size), but which always requires addi- tional samples (e.g. bootstrapping) or extra passes over the entire data set to reduce approximation error: skewed probabilistic distributions and extreme values impact sampling quality. Data structures tailored to minimize I/O cost using RAM efficiently are another important mechanism. The most common data structures are balanced trees, tries, followed by hash tables. Trees are used to index transactions (FP-tree), itemsets (with a hash-tree combo) or to summarize points.

4 Examples of Important Systems

Table 2 summarizes important systems, including storage, query language and parallelism. Table 4, complementing Table 2, outlines extensibility program- ming mechanisms, operating system and finest level of storage. 18.

18 SASTRY: New advances are in the table DBMSs and Alternative Systems: Scale up vs scale out 17

Table 2 Important systems: storage type, query language, parallelism.

Storage Query Shared-nothing Name type Language N nodes DB2 Row Store SQL:2011 Y MySQL Row Store SQL:1999 Y, sharding PostgreSQL Row Store SQL:2008 N SQLServer Row Store Transact SQL N, Y in PDW Oracle Row Store SQL:2008 Y Greenplum Row Store SQL:2008 Y Teradata Row Store SQL:1999 and T-SQL Y Hana Column Store SQL:92 Y InfiniDB Column Store SQL:92 Y MonetDB Column Store SQL:2008, SciSQL N Vertica Column Store SQL:1999 Y Accumulo Column oriented storage Thrift Java also supports Y on top of HDFS thrift interface Hbase Column oriented storage (J)Ruby’s IRB Java, also Y on top of HDFS support thrift and REST interfaces Impala Multiple formats on top SQL:92 Y of HDFS AllegroGraph Graph DB SPARQL 1.1, Prolog, N RDFS++ InfiniteGraph Graph DB N, multithreaded Neo4j Graph DB , Gremlin, N SparQL Oracle Spatial/Graph Graph DB Apache Jena, PL/SQL N, shared-everything GeoRaster Array DB PL/SQL N, shared-everything PostGIS Array DB SQL:2008 N Rasdaman Array DB RaSQL N SciDB Array DB AQL Y InfoSphere Streams Stream DB SPL N Samza Stream DB Stream processing API N, task-parallel Storm Stream DB Stream processing API Y

Table 3 More recent systems

Other Systems Name Storage type and querying interface Notes Presto multiple formats on top of HDFS; SQL-like interface Similar to Impala TitanDB Graph DB ; HDFS (via HBase or Cassandra); query via Gremlin language Leverages HDFS for graph problems Google Dremel Proprietary An interface for rapid queries Google Spanner Proprietary Globally distributed database Google F1 Proprietary Hybrid database with NoSQL + SQL features Apache Drill Open source alternative to Dremel Pivotal HAWK HDFS and uses an SQL variant for queries RDBMS on top of HDFS Hadapt RDBMS on top of HDFS Apache Phoenix SQL SQL layer over HBase

Elastic search Indexing of documents and data on HDFS 18 C. Ordonez, D.S. Matusevich, S. Isukapalli Examples Oracle, DB2,adata, Ter- MySQL Postgres, Memcached, Redis MongoDB HBase,BigTable Accumulo, Hive Not applicable for Unstructured data Multiple columns Documents selecting singleat cell a time Poor for Queries that involve full table scans (non- indexed columns) Searches over values or ranges of keys Queries thatall return columns, or when predicates use multi- ple columns Queries over column values Interactive queries or small scale queries Optimal for Transactions, Queryingon indexed columns Queries based basedkeys/key ranges on Queries for selecting subsets of columns Range querieskeys over Large scanspartitions across Scaling Scale up;Scale out Limited Scale up andout Scale- Scale up andout Scale scale upout and scale scale upout and scale PrevalentStructure Storage Local Disk withmemory cache in- Distributedage stor- in-memory cache withDistributed Storage local, with replication Chunkedacross storage serversmemory with different cachewrites in- for Folders distributed acrossservers withrest no different at- CPUused resources Storage Format Custom Binary For- mat Customdata dictionary structures Custom Binary For- mat Custom binarymat for- User defined format Typical Data Dimen- sions Many rows, medium size columns Manycolumns rows, two Many rows,columns Many can (columns berows) more than Many rows,columns Many Manymedium-size columns rows, Main Structure StructuredNormalized Tables Data; Tuples Loosely Structured Data; Chunked stor- age withstored in stripes columns Loosely Structured Data; Chunked stor- age with rows stored together Loosely Structured Data; Directoryerearchies hi- astions parti- Data Management Approach RDBMS Key-Value Stores Document Oriented Columnar Stores Hybrid Tables DistributedStructure File DBMSs and Alternative Systems: Scale up vs scale out 19

4.1 DAVID: DBMSs

Relational DBMSs

Relational DBMSs supporting SQL, ACID properties and parallel processing remain the dominating technology to process transactions and ad-hoc query- ing. By far row DBMSs still dominate the analytic market, but column DBMSs are gaining traction. Since column DBMS have proven to be much faster than row stores for exploratory analysis, most row DBMSs have incorporated some level of support for column storage. However, performance cannot match a column DBMS designed and developed from scratch [1]. In the future it is expected column DBMS will gain wider adoption.

Non-Relational Systems: Stream Based Databases

As previously mentioned Stream Based Databases perform continuous ana- lytics on non-static data. Streams are designed to enable aggressive analysis of information and knowledge. They process very large volumes of data from diverse sources with low latency, in order to extract information. Streams are designed to respond in real time to events and changing requirements, analyze data continuously, adapt to changing data types, to deliver real-time comput- ing functions in order to learn patterns and predict outcomes, while providing information security and confidentiality. Streams are used in the field of communications, where they are commonly used to deal with the enormous amount of data generated by cell phone com- munications [59]; in financial services [10,20], where they are used for the fast analysis of large volumes of data to deliver near real-time business decisions; and transportation, for intelligent traffic management [7], to name a few. Notable examples of stream databases are IBM’s InfoSphere [55], ATT Labs’ Gigascope [15] and Borealis from MIT [65].

Non-Relational DBMSs: Graph Analytics

Graphs are an important field of research in discrete mathematics and the- oretical computer science. They allow to represent complex problems in an intuitive way. Most systems to analyze graphs are purpose built, using an ad- jacency matrix or an adjacency list, managing the whole graph in memory. However large graphs, such as those representing social networks or the world wide web, with number of nodes ranging from hundreds of millions to billions of nodes, require the data management mechanisms for storage and querying. Graph database systems have been proposed to deal with this problem [60]. Pregel [38] was pioneered by Google to mine graphs, inspired by the Bulk Synchronous Parallel Model [23]. Apache Giraph, another iterative system, was developed as an open source counterpart to Pregel [4]. Neo4J was developed by Neo Industries as an open source project to analyze databases, incorporating full ACID transaction processing. It stores data in nodes connected by directed 20 C. Ordonez, D.S. Matusevich, S. Isukapalli

Table 4 Important systems: extensibility programming mechanisms, operating systems and physical storage. Name Extensibility OS Physical Storage DB2 Support for Scalar and Linux, Unix, Data is arranged in pages Table UDFs in C++, MS Windows of 4, 8, 16 or 32kB in size. Java and Basic Mixed page sizes are al- lowed. MySQL Support for Aggregate FreeBSD, Hard page size of 16 kB, UDFs written in C and Linux, Solaris, but can be changed to 4, C++ Mac OSX, 8, 32 or 64 kB Windows PostgreSQL Support for user de- Linux, Unix, Data page size of 8 kB fined operators and MS Windows UDFs (aggregate and window) in PL/pgSQL and C SQLServer Full support for scalar Windows Data page size of 8 kB and aggregate UDFs and TVFs written in C# and Visual Basic Teradata Supports TVF and Windows Variable block size up to UDF written in C 128KB Hana Supports Scalar SUSE Linux Block size from 4 kB to and Table UDF in Enterprise 16 MB SQLScript Server InfiniDB No support for aggre- Linux, MS Data is arranged in gate UDFs yet Windows extents, continuous allo- cations of space. Typical sizes are between 8 and 64 MB MonetDB Limited support for Linux, Mac Block size of 512 kB scalar and column OSX, MS UDFs in C Windows Vertica Limited UDF support Linux: Ubuntu Block size of 1 MB and RHEL Accumulo MapReduce support Linux Block size of 64MB or 128 MB (HDFS standard) Hbase MapReduce support Linux Block size of 64MB or 128 MB (HDFS standard) Impala Scalar and Aggregate Linux Block size of 64MB or 128 UDFs in C++ or Java MB (HDFS standard) AllegroGraph Supports stored proce- FreeBSD, N/A dures in LISP and Java Linux, OSX, Windows InfiniteGraph Supports Java proce- OSX, SUSE Page sizes range from 1 to dures Linux, Ubuntu, 16 Kb Windows Neo4j Supports unmanaged Linux, HP-UX, Page size default is 1 MB extensions in Java OSX, Windows Oracle Spa- Support for UDF in Linux, HP- Block size of 2 to 8 kB tial and PL/SQL, Java and UX, Solaris, Graph C++ Windows GeoRaster Support for UDF in Linux, HP- Variable block size. Max PL/SQL, Java and UX, Solaris, is 4 GB C++ Windows PostGIS Support for user de- Linux, OSX, 8 kB page size fined operators and Windows UDFs (aggregate and window) in PL/pgSQL and C Rasdaman Support for user- Linux, HP- Tile size ranges from 1 to defined functions UX, Solaris, 4 MB Windows SciDB Supports user defined Ubuntu, Chunk size is 4 to 20 MB types and UDFs in RHEL, Cen- for optimal performance C++ tOS InfoSphere Streams Studio appli- Red Hat Linux Adjustable Streams cations as well as Java and C++ extensions Samza Stream processing API Linux, Solaris, Storage is build on top Unix of LevelDB: Block sizes range from 64 to 256 kB Storm Stream processing API Linux, Solaris, No storage option avail- Unix able DBMSs and Alternative Systems: Scale up vs scale out 21

Table 5 Mathematical packages.

Name RAM Limit Parallelism Programming Lan- Compiled / Inter- guage preted Julia Not limited Supports parallel Julia Language JIT compilation and cloud comput- ing, but no single node parallelism LAPACK Matrices up to Single Node super- Library of func- Compiled 232 × 232 elements scalar architec- tions called from tures and multiple FORTRAN and node parallelism C++ via ScaLapack MatLab Up to 8 Tb of Supports single- MatLab Interpreted memory in 64 bit node and N-node systems parallelism via the Parallel Comput- ing Toolbox Octave Limited by physi- Parallelism is Matlab (reduced) Interpreted cal memory available via toolkits pbdR Not limited Implicit and ex- R Language Interpreted plicit parallelism in single nodes and HPC R Limited to 2 Gb Single node paral- R Language Interpreted lelism Revolution Not limited Single and multiple R Language Interpreted node cluster SAS Limited by phys- Single node paral- SAS Language Both ical memory size lelism and paging space available at the time SAS is started relationships. Finally, we can mention Urika, a complete hardware/software solution developed by Cray, using a scalable shared memory architecture that aims to solve the problem of partitioning graphs to take advantage of parallel computations [33]. Most of these databases are built on top of a Hadoop storage system, and share most of its characteristics and limitations.

4.2 SASTRY: Generic HDFS systems, not based on SQL

As previously mentioned, in many cases, it is not convenient to store data into a DBMS sometimes. Reasons for this could be the need for quick prototyping of data mining or machine learning code, that could be more efficiently handled using a flat file; or the nature of data could be too “unstructured” to easily fit in any of the categories mentioned in Section 4. Table 5 summarizes important mathematical packages used today.19.

DAVID: The R package and Matlab

The R language was developed in 1993 by Ihaka and Gentleman [35] as an interactive tool for their introductory data-analysis course. Since then R has evolved into a full-fledged programming language for processing large and com- plex data sets. As it is an open-source project, it is maintained and developed by the statistics community. This has resulted in a package that is loaded with features especially oriented to the quick generation of charts and graphs for

19 SASTRY: Table 1 has too short a list, so we can delete it 22 C. Ordonez, D.S. Matusevich, S. Isukapalli easy interpretation of data. Furthermore most machine learning and statistics algorithms have been incorporated into the package. It is clear that these features make it the tool of choice for quick data analysis. Its main disadvantage is that it is not optimized for large data sets as those commonly found in today’s problems. This drawback, however is being addressed by tools like RRE from Revolution Analytics and pbdR [50,51]. The developers of pbdR (Programming with Big Data in R), aim to provide distributed data parallelism in R in order to utilize large HPC platforms. This is achieved using high performance libraries (MPI, ScaLAPACK and NetCDF4), but preserving R’s syntax with new scalable code underneath. Matlab combines a high level language with an interactive environment oriented towards visualization and model computation. Matlab has a large built-in library of mathematical functions to solve engineering and scientific problems, as well as exploring and modeling data. However, like R, Matlab is not designed to be used in data sets that exceed the available memory of the computer.

SASTRY: Hadoop Distributed File System (HDFS)

20. HDFS and MapReduce were the two major components ot original Hadoop (version 1.x). At its core the Hadoop Distributed File System (HDFS) is a distributed file system, written in Java to store large files across multiple machines, with added reliability achieved by replicating the data at multi- ple nodes. Hence there is no need for RAID storage in the host machines. Data nodes can communicate with each other in order to balance the load. In addition to the data hosts, HDFS include a primary NameNode, the main metadata server, and a secondary NameNode where snapshots of the primary NameNode are stored. The NameNode can become a bottleneck if many small files are being accessed or when the cluster size increases substantially. This is being addressed by allowing multiple namespaces be handled by different namenodes using HDFS Federation. However, as the Hadoop ecosystem evolved, many tools have been devel- oped to support various programming paradigms and for providing different interfaces. 21. The major drawbacks for HDFS are that it is not suitable for systems requiring concurrent access to data, and accessing data out of the file system can be difficult, since existing operating systems cannot mount it directly. All file accesses must be achieved through the native Java API. MapR [27]

20 SASTRY: HDFS continues to be the major player in distributed systems – and is a complementary thing to MapReduce, so we need to maintain that notion in the discussion 21 SASTRY: Originally a part of the Map-Reduce programming model, handling data storage and large-scale processing in nodes, it has in recent years taken a life of its own, with applications like Cloudera, that deliver a complete distribution of Apache Hadoop, coupled with features like increased security and user interfaces. – it is a bit confusing; HDFS and MapReduce are two different pieces, and Cloudera is one of the three major distributions – HortonWorks and MapR are two other DBMSs and Alternative Systems: Scale up vs scale out 23

addresses these problems by replacing the HDFS with a full random access file system.

5 CUT Patterns, CUT Cubes?, Graphs and Statistical Models

We now discuss the most difficult part of big data analytics, after data (ta- bles, files, text) have been cleaned, integrated, transformed and explored with queries or simple programs. Fundamental research problems on analyzing big data can be broadly classified into the following categories: 1. CUT Patterns, which involve exploratory search on a discrete combinato- rial space (association rules, itemsets, sequences). Such patterns are com- monly discovered by search algorithms exploring a lattice. 2. CUT Cubes, decision support queries 3. Graphs, a formal mathematical model of objects and their relationships 4. Statistical models, which involve matrices, vectors and histograms. Such models are computed by a combination of numerical and statistical meth- ods.

5.1 CUT Patterns: Cubes and Itemsets

In the case of cubes, the data set is a fact table F with d discrete attributes (called dimensions) to aggregate by group and m numeric attributes called measures. We can also introduce k, a common additional size specified in models (e.g. k clusters, k components). Pattern search is a different problem from statistical models because algo- rithms work in a combinatorial manner on a discrete space. Patterns include cubes, itemsets (association rules) [3], as well as their generalization to se- quences (considering an order on items) and graphs. Given their generality and close relationship to social networks mining large graphs are currently the most important topic. Cubes represent a mature research topic, but widely used in practice in BI tools. Most approaches avoid exploring the entire cube by pruning the search space. Methods increase the performance of multidimensional aggregations, by combining data structure for sparse data and cell precomputation at different aggregation levels of the lattice. The authors of [29] puts forward the plan of creating smaller, indexed summary tables from the original large input table to speed up aggregating executions. In [56] the authors explore tools to guide the user to interesting regions in order to highlight anomalous behavior while exploring large data cubes. Large lattices are explored with greedy algorithms. Association rules were by far the most popular data mining topic in the past. Frequent itemset is generally regarded as the fundamental problem is as- sociation rule mining, solved by the well-known level-wise A-priori algorithm (exploiting the downward closure property). The common approach to mine 24 C. Ordonez, D.S. Matusevich, S. Isukapalli association rules is to view frequent itemset search as optimizing multiple GROUP-BYs. From a DBMS perspective, exploiting SQL queries, UDFs and local caching are explored in [57]. This work studies how to write efficient SQL queries with k-way joins and nested sub-queries to get frequent itemsets, but did not explore UDFs deep enough as they were more limited back then. SQL is shown to be competitive, but the most efficient solution (local caching) is based on creating a temporary file inside the DBMS, but apart from relational tables. We believe the slightly slower solution in SQL is more valuable in the long term. Recursive queries can search frequent itemsets, yielding a compact piece of SQL code, but unfortunately, this solution is slower than k-way joins. Relational division and set containment can solve frequent itemsets search, providing a complementary solution to joins. The key issue is that relational division requires different query optimization techniques, compared to SPJ op- erators. There has been an attempt to adapt data structures, like the FP-tree, to work in SQL, but experimental evidence over previous approaches is incon- clusive. Mining frequent itemsets over sliding windows has been achieved with aggregate UDFs, interfacing with external programs. Table UDFs can update an entire cube in RAM in one pass for low dimensional data sets. Thus the dimension (or itemset) lattice can be maintained in main memory. The draw- back is that TVFs require a cursor interface that disables parallel processing. The lattice is the main hindrance to perform processing with SQL queries be- cause it cannot be easily represented as a tabular structure and because SQL does not have pointers and requires joins instead. There exists work on min- ing graphs (which represents a more general problem) with SQL, but solving narrow issues. Instead most work has focused on studying algorithms with MapReduce. For the most part, mining graphs with queries or UDFs remains unexplored.

5.2 DAVID: Graphs

The input is one large directed graph G = (V,E), where V are the vertices (nodes), E the edges, n = |V | and m = |E|. The goal is to find interesting patterns in the graph which include paths and subgraphs. Even though the problems and algorithms explained below have been studied for a long time in graph theory, operations research and theoretical computer science, big data analytics has made necessary to revisit them because G cannot fit in RAM and G must be processed in parallel. The shape of the graph and the existence of cycles makes problem harder [44]. Simplest graphs to analyze include trees, lists and graph with small components (disconnected). Harder graphs have cliques and cycles. The hardest graphs are similar to the complete graph Kn, but they are not common in social networks or the Internet. The most important graph algorithms can be broadly classified into non- recursive and recursive (given their most common definition). Non-recursive algorithms with wide applicability include: shortest path (Dijkstra), node cen- DBMSs and Alternative Systems: Scale up vs scale out 25

trality, edge density per group, clustering coefficient, discovering frequent sub- graphs up to subgraph isomorphism. Recursive algorithms can broadly subclassified into those that involve a lin- ear recursion and those with a (harder) non-linear recursion. Linear recursion problems include: transitive closure, power law (fractals) iterative spectral clustering (on Laplacian of G), discovering frequent subgraphs. Non-linear recursion is prominently represented by clique enumeration (i.e. finding all cliques) and detecting isomorphic subgraphs.

5.3 SASTRY: Statistical Models

Here we present models in two major groups: unsupervised models (clustering, dimensionality reduction) and supervised models (classification, regression, feature selection, variable selection). Definitions: Let X = {x1, . . . , xn} be a data set having n records, where each point has d attributes. When all d attributes are numeric then X can be considered a matrix and each record is a point in d-dimensional space. For predictive models, X has an additional “target” attribute. This attribute can be a numeric dimension Y (used for linear regression) or a categorical attribute G (used for classification) with m distinct values. Clustering: This is one of the best-known unsupervised data mining tech- niques. In this approach the data set is partitioned into subsets, such that points in the same subset are similar to each other. Clustering is commonly used as a building block for more complex models. Techniques that have been incorporated internally in a DBMS include K-means clustering and hierar- chical clustering. Scalable K-means can work incrementally in one pass per- forming iterations on weighted points in main memory. This algorithm is also based on sufficient statistics. The algorithm relies on a low-level direct access to the file system which allows processing the data set in blocks. SQL has been successfully used to program clustering algorithms in SQL, such as K-means [43] and the more sophisticated EM algorithm [46]. Distance is the most de- manding computation in these algorithms. Distance computation in K-means [43] is evaluated with joins between a table with centroids and the data set, computing aggregations on squared dimension differences. Sufficient statistics, denormalization (avoiding joins) and parallel query evaluation (with careful balanced row distribution) accelerate K-means computation. It is feasible to develop an incremental version [43] with SQL queries, that requires a few passes (even two) on the data set; this approach is somewhat similar to iterated sampling (bootstrapping), with the basic difference that each record is visited at least once. UDFs further accelerate distance com- putation [45] by avoiding joins altogether. The EM algorithm represents an important maximum likelihood estimation method. Reference [46] represents a first attempt to program EM clustering (solving a Gaussian mixture problem) in SQL, proving it was feasible to program a fairly complex algorithm entirely with SQL code generation. In [46] several alternatives for distance computa- 26 C. Ordonez, D.S. Matusevich, S. Isukapalli tion are explored pivoting the data set and distances to reduce I/O cost. Mix- ture parameters are updated with queries combining aggregations and joins. This SQL solution requires multiple iterations and two passes per iteration. Deriving incremental algorithms, exploiting sufficient statistics and acceler- ating processing for EM with UDFs are open problems. Subspace clustering is a grid-based approach, which incorporates several optimizations similar to frequent itemset mining (discussed below). The data set is binned with user- defined parameters, which is easily and efficiently accomplished with SQL. SQL queries traverse the dimension lattice in a level-wise manner making one pass of each size. Acceleration is gained with aggressive pruning of dimension combinations. Principal Components Analysis: This is an unsupervised technique and is the most popular dimensionality reduction method. PCA computation can be pushed into a DBMS by computing the sufficient statistics for the cor- relation matrix with SQL or UDFs in one table scan. It is possible to solve PCA with a complex numerical method like SVD entirely in SQL. Such ap- proach enables fast analysis of very high dimensional data sets (e.g. microarray data). Moreover, this solution is competitive with a state of the art statistical system (R package). PCA has also been solved summarizing the data with an aggregate UDF and computing SVD with the LAPACK library [49], giving two orders of magnitude performance improvement. We now turn our attention to supervised techniques. Decision Trees: This is a fundamental technique that recursively parti- tions a data set to induce predictive rules. Sufficient statistics for decision trees are important to reduce number of passes numeric attributes are binned and row counts per bin or categorical value are computed. The splitting phase is generally accelerated using a condensed representation of the data set. Table- Valued Functions have been exploited to implement primitive decision tree operations. UDFs implement primitives to perform attribute splits, reducing the number of passes over the data set; this work also proposes a prediction join, with new SQL syntax, to score new records. Reducing the number of passes on the data set to two or one using SQL mechanisms is an open prob- lem. Exploiting MapReduce [21] to learn an ensemble of decision trees on large data sets using a cluster of computers is studied in [53]. This work introduces optimizations to reduce the number of passes on the data set. Regression Models: We now discuss regression models, which are popu- lar in statistics. Sufficient statistics for linear models (including linear regres- sion) are generalized and implemented as a primitive function using aggregate UDFs, producing a tightly coupled solution. Basically sufficient statistics for clustering (as used by K-means) are extended with dimension cross-products to estimate correlations and covariances. Matrix equations are mathemati- cally transformed into equivalent equations based on sufficient statistics as factors. The key idea is to compute those sufficient statistics in one pass, with SQL queries or even faster with aggregate UDFs. The equations solving each model can be efficiently evaluated with a math library, exporting small ma- trices whose size independent from data set size. Linear regression has been DBMSs and Alternative Systems: Scale up vs scale out 27 integrated with the DBMS with UDFs, exploiting parallel execution of the UDF in unfenced mode. Moreover, it is feasible to perform variable selection exploiting sufficient statistics [45]. Bayesian statistics are particularly challeng- ing because MCMC methods require thousands of iterations. Recent research [40] shows it is feasible to accelerate a Gibbs sampler to select variables in linear regression under a Bayesian approach, exploiting data summarization. Database systems have been extended with model-based views in order to manage and analyze databases with incomplete or inconsistent content, to in- crementally build regression and interpolation models, extending SQL with syntax to define model views. Support Vector Machines: We close our discussion with Support Vector Machines (SVMs), which are a mathematically sophisticated classifier. SVMs achieve a perfect separation of classes in a higher dimensional space, but have quadratic time complexity. Two fundamental SVM aspects need to be con- sidered: kernel functions (mapping points to a higher dimensional space) and solving quadratic programming, a mathematically difficult problem. For in- stance, kernel functions have been integrated with a DBMS [39]. SVMs have been accelerated with hierarchical clustering and incremental matrix factor- ization.

6 The Future: Research Issues

DAVID: Analyzing large volumes of relational and diverse files together (i.e., variety in big data analytics) rapidly growing adds a new level of difficulty and thus it nowadays represents a major research direction. The default storage mechanism in most DBMSs will likely remain row-based, but column stores [65] show promise to accelerate statistical processing, especially for feature selection, dimensionality reduction and cubes. Solid state memory may be an alternative for small volumes of data, but it will likely remain behind disks on price and capacity. Sharing RAM among multiple nodes in fast network can help loading large data sets in RAM. Shared disk is not a promising alternative due to the I/O bottleneck. DAVID: In parallel processing the major goal will remain the efficient eval- uation of mathematical equations independently on each table partition, bal- ancing workload, minimizing row redistribution. With the advances in fast network communication, larger RAM, multicore CPUs and larger disks the future default system will probably remain a shared-nothing parallel architec- ture with N processing units, each one with its main memory and secondary storage (disk or solid state memory). Matrices are used extensively in math- ematical models, computed with parallel algorithms. Thus we believe it is necessary to extend a DBMS with matrix data types, matrix operators and linear algebra numerical methods. In particular, it is necessary to study effi- cient mechanisms to integrate SQL with mathematical libraries like LAPACK. Array databases go in that direction, but they are incompatible with SQL and traditional relational storage mechanisms, resulting in multiple copies of the 28 C. Ordonez, D.S. Matusevich, S. Isukapalli same data. Most likely, SQL will incorporate array storage mechanisms with a new data type and primitive operators. SASTRY: Some common simple data mining algorithms may make their way to DBMS source code, whereas others may keep growing as SQL queries or UDFs. But most new data mining and statistical algorithms will be developed in languages like C++ or R. Given the similarity between MapReduce jobs and aggregate UDFs it is likely DBMSs will support automated translation from one into the other. Also, there will be faster interfaces to move data from one platform into the other. Extending SQL with new clauses is not a promising alternative, because SQL syntax is overwhelming and DBMS source code is required (many proposals have been simply forgotten). Combining SQL code generation and UDFs needs further study to program and optimize complex statistical and numerical methods. State-of-the-art data mining algorithms generally have or are forced to achieve linear time complexity in data set size. Can SQL queries callings UDFs match such time complexity exploiting relational physical operators (scan, join, sort)? There is evidence it can. SASTRY: Future research will keep studying incremental model computa- tion, block-based processing, data summarization, sampling and data struc- tures for secondary storage under sequential and parallel processing. Develop- ing incremental algorithms, beyond clustering and decision trees, is definitely challenging due to the need to reconcile parallel processing and incremen- tal learning. Incorporating sampling as a primitive acceleration mechanism is a promising alternative to analyze large data sets, but approximation er- ror must be controlled. Model selection is about selecting the best model out of several competing models; this task requires building several models with slightly different parameters or exploring several models concurrently. From the mathematical side the list is extensive. Overall, models involving time remain difficult to compute with existing DBMS technology (they are bet- ter suited to mathematical packages). Hidden Markov Models (HMMs) are a prominent example. Non-linear models (e.g. non-linear regression, neural net- works) also deserve attention, but their impact is more limited. We anticipate recursive queries and aggregate UDFs will play a central role to analyze large graphs in a DBMS, whereas the state of the art for large graphs in social networks will be MapReduce.

References

1. D.J. Abadi, S. Madden, and N. Hachem. Column-stores vs. row-stores: how different are they really? In Proc. ACM SIGMOD Conference, pages 967–980, 2008. 2. A. Abouzied, K. Bajda-Pawlikowski, J. Huang, D.J. Abadi, and A. Silberschatz. HadoopDB in action: building real world applications. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of data, SIGMOD ’10, pages 1111–1114. ACM, 2010. 3. R. Agrawal, T. Imielinski, and A. Swami. Mining association rules between sets of items in large databases. In ACM SIGMOD Conference, pages 207–216, 1993. 4. Ching Avery. Giraph: large-scale graph processing infrastruction on hadoop. Proceedings of Hadoop Summit. Santa Clara, USA:[sn], 2011. DBMSs and Alternative Systems: Scale up vs scale out 29

5. Peter Baumann, Alex Mircea Dumitru, and Vlad Merticariu. The array database that is not a database: file based array query answering in rasdaman. In Advances in Spatial and Temporal Databases, pages 478–483. Springer, 2013. 6. A. Behm, V.R. Borkar, M.J. Carey, R. Grover, C. Li, N. Onose, R. Vernica, A. Deutsch, Y. Papakonstantinou, and V.J. Tsotras. ASTERIX: towards a scalable, semistructured data platform for evolving-world models. Distributed and Parallel Databases (DAPD), 29(3):185–216, 2011. 7. Alain Biem, Eric Bouillet, Hanhua Feng, Anand Ranganathan, Anton Riabov, Olivier Verscheure, Haris Koutsopoulos, and Carlos Moran. Ibm infosphere streams for scalable, real-time, intelligent transportation services. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of data, pages 1093–1104. ACM, 2010. 8. P. Bradley, U. Fayyad, and C. Reina. Scaling clustering algorithms to large databases. In Proc. ACM KDD Conference, pages 9–15, 1998. 9. P.G. Brown. Overview of sciDB: large scale array storage, processing and analysis. In Proc. ACM SIGMOD Conference, pages 963–968, 2010. 10. Badrish Chandramouli, Mohamed Ali, Jonathan Goldstein, Beysim Sezgin, and Balan Sethu Raman. Data stream management systems for computational finance. Computer, 43(12):45–52, 2010. 11. S. Chaudhuri, G. Das, and U. Srivastava. Effective use of block-level sampling in statis- tics estimation. In Proc. ACM SIGMOD Conference, pages 287–298, 2004. 12. J. Clear, D. Dunn, B. Harvey, M.L. Heytens, and P. Lohman. Non-stop SQL/MX primitives for knowledge discovery. In ACM KDD Conference, pages 425–429, 1999. 13. E.F. Codd. A relational model of data for large shared data banks. ACM CACM, 13(6):377–387, 1970. 14. J. Cohen, B. Dolan, M. Dunlap, J. Hellerstein, and C. Welton. MAD skills: New analysis practices for big data. In Proc. VLDB Conference, pages 1481–1492, 2009. 15. C. Cranor, T. Johnson, O. Spataschek, and V. Shkapenyuk. Gigascope: a stream database for network applications. In Proc. ACM SIGMOD Conference, pages 647– 651. ACM, 2003. 16. C. Cunningham, G. Graefe, and C.A. Galindo-Legaria. PIVOT and UNPIVOT: Op- timization and execution strategies in an RDBMS. In Proc. VLDB Conference, pages 998–1009, 2004. 17. S. Das, Y. Sismanis, K.S. Beyer, R. Gemulla, P.J. Haas, and J. McPherson. RICARDO: integrating R and hadoop. In Proc. ACM SIGMOD Conference, pages 987–998, 2010. 18. T. Dasu, T. Johnson, S. Muthukrishnan, and V. Shkapenyuk. Mining database struc- ture; or, how to build a data quality browser. In ACM SIGMOD Conference, pages 240–251, 2002. 19. J. Dean and S. Ghemawat. MapReduce: simplified data processing on large clusters. Commun. ACM, 51(1):107–113, 2008. 20. Nihal Dindar, Baris G¨u¸c,Patrick Lau, Asli Ozal, Merve Soner, and Nesime Tatbul. Dejavu: declarative pattern matching over live and archived streams of events. In Proceedings of the 2009 ACM SIGMOD International Conference on Management of data, pages 1023–1026. ACM, 2009. 21. E. Friedman, P. Pawlowski, and J. Cieslewicz. SQL/MapReduce: a practical approach to self-describing, polymorphic, and parallelizable user-defined functions. Proc. VLDB Endow., 2(2):1402–1413, 2009. 22. H. Garcia-Molina, J.D. Ullman, and J. Widom. Database Systems: The Complete Book. Prentice Hall, 2nd edition, 2008. 23. Louis Gesbert, Zhenjiang Hu, Fr´ed´eric Loulergue, Kiminori Matsuzaki, and Julien Tesson. Systematic development of correct bulk synchronous parallel programs. In Parallel and Distributed Computing, Applications and Technologies (PDCAT), 2010 International Conference on, pages 334–340. IEEE, 2010. 24. A. Ghazal, T. Rabl, M. Hu, F. Raab, M. Poess, A. Crolotte, and H. Jacobsen. Bigbench: towards an industry standard benchmark for big data analytics. In Proc. ACM SIGMOD Conference, pages 1197–1208. ACM, 2013. 25. L. Golab and M.T. Ozsu.¨ Issues in data stream management. ACM Sigmod Record, 32(2):5–14, 2003. 30 C. Ordonez, D.S. Matusevich, S. Isukapalli

26. J. Gray, A. Bosworth, A. Layman, and H. Pirahesh. Data cube: A relational aggregation operator generalizing group-by, cross-tab and sub-total. In ICDE Conference, pages 152–159, 1996. 27. Mike Gualtieri, Noel Yuhanna, Holger Kisker, and David Murphy. The forrester wave: Enterprise hadoop solutions, q1 2012, 2012. 28. S. Guha, N. Mishra, R. Motwani, and L. O’Callaghan. Clustering data streams. In FOCS Conference, page 359, 2000. 29. H. Gupta, V. Harinarayan, A. Rajaraman, and J.D. Ullman. Index selection for OLAP. In IEEE ICDE Conference, 1997. 30. J. Han and M. Kamber. Data Mining: Concepts and Techniques. Morgan Kaufmann, San Francisco, 2nd edition, 2006. 31. T. Hastie, R. Tibshirani, and J.H. Friedman. The Elements of Statistical Learning. Springer, New York, 1st edition, 2001. 32. J. Hellerstein, C. Re, F. Schoppmann, D.Z. Wang, and et. al. The MADlib analytics library or MAD skills, the SQL. Proc. of VLDB, 5(12):1700–1711, 2012. 33. Phillip Howard. Urika: An InDetail paper, 2012. [Online; accessed 1-April-2014]. 34. W. Hsieh, J. Madhavan, and R. Pike. Data management projects at google. In ACM SIGMOD Conference, pages 725–726, 2006. 35. Ross Ihaka and Robert Gentleman. R: a language for data analysis and graphics. Journal of computational and graphical statistics, 5:299–314, 1996. 36. Intel. Intel math kernel library documentation. http://software.intel.com/en- us/articles/intel-math-kernel-library-documentation/, 2012. 37. Patrik Larsson. Analyzing and adapting graph algorithms for large persistent graphs. Technical report, Linkopings Universitet, Sweden, 2008. 38. Grzegorz Malewicz, Matthew H Austern, Aart JC Bik, James C Dehnert, Ilan Horn, Naty Leiser, and Grzegorz Czajkowski. Pregel: a system for large-scale graph processing. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of data, pages 135–146. ACM, 2010. 39. B.L. Milenova, J. Yarmus, and M.M. Campos. SVM in Oracle database 10g: Removing the barriers to widespread adoption of support vector machines. In VLDB Conference, 2005. 40. M. Navas, C. Ordonez, and V. Baladandayuthapani. On the computation of stochas- tic search variable selection in linear regression with UDFs. In Proc. IEEE ICDM Conference, pages 941 – 946, 2010. 41. A. Netz, S. Chaudhuri, U. Fayyad, and J. Berhardt. Integrating data mining with SQL databases: OLE DB for data mining. In Proc. IEEE ICDE Conference, pages 379–387, 2001. 42. C. Ordonez. Vertical and horizontal percentage aggregations. In Proc. ACM SIGMOD Conference, pages 866–871, 2004. 43. C. Ordonez. Integrating K-means clustering with a relational DBMS using SQL. IEEE Transactions on Knowledge and Data Engineering (TKDE), 18(2):188–201, 2006. 44. C. Ordonez. Optimization of linear recursive queries in SQL. IEEE Transactions on Knowledge and Data Engineering (TKDE), 22(2):264–277, 2010. 45. C. Ordonez. Statistical model computation with UDFs. IEEE Transactions on Knowledge and Data Engineering (TKDE), 22(12):1752–1765, 2010. 46. C. Ordonez and P. Cereghini. SQLEM: Fast clustering in SQL using the EM algorithm. In Proc. ACM SIGMOD Conference, pages 559–570, 2000. 47. C. Ordonez and J. Garc´ıa-Garc´ıa.Referential integrity quality metrics. Decision Support Systems Journal, 44(2):495–508, 2008. 48. C. Ordonez, J. Garc´ıa-Garc´ıa, and Z. Chen. Dynamic optimization of generalized SQL queries with horizontal aggregations. In Proc. ACM SIGMOD Conference, pages 637– 640, 2012. 49. C. Ordonez, N. Mohanam, C. Garcia-Alvarado, P.T. Tosic, and E. Martinez. Fast PCA computation in a DBMS with aggregate UDFs and LAPACK. In Proc. ACM CIKM Conference, 2012. 50. G. Ostrouchov, W.-C. Chen, D. Schmidt, and P. Patel. Programming with big data in r, 2012. DBMSs and Alternative Systems: Scale up vs scale out 31

51. George Ostrouchov, Drew Schmidtb, Wei-Chen Chena, and Pragneshkumar Patelb. Combining r with scalable libraries to get the best of both for big data. Technical report, Oak Ridge National Laboratory (ORNL); Center for Computational Sciences, 2013. 52. F. Pan, X. Zhang, and W. Wang. CRD: fast co-clustering on large datasets utilizing sampling-based matrix decomposition. In SIGMOD, pages 173–184, New York, NY, USA, 2008. ACM. 53. B. Panda, J. Herbach, S. Basu, and R.J. Bayardo. PLANET: Massively parallel learning of tree ensembles with MapReduce. In Proc. VLDB Conference, pages 1426–1437, 2009. 54. S. Pitchaimalai, C. Ordonez, and C. Garcia-Alvarado. Comparing SQL and MapReduce to compute Naive Bayes in a single table scan. In Proc. ACM CloudDB, pages 9–16, 2010. 55. Roger Rea and Krishna Mamidipaka. Ibm infosphere streams, redefining real time analytics. IBM Software Group, 2010. 56. S. Sarawagi, R. Agrawal, and N. Megiddo. Discovery-driven exploration of OLAP data cubes. In EDBT, pages 168–182. Springer-Verlag, 1998. 57. S. Sarawagi, S. Thomas, and R. Agrawal. Integrating association rule mining with relational database systems: alternatives and implications. In Proc. ACM SIGMOD Conference, pages 343–354, NY, USA, 1998. ACM Press. 58. K. Sattler and O. Dunemann. SQL database primitives for decision tree classifiers. In Proc. ACM CIKM Conference, pages 379–386, 2001. 59. Vyas Sekar, Michael K Reiter, and Hui Zhang. Revisiting the case for a minimalist approach for network flow monitoring. In Proceedings of the 10th ACM SIGCOMM conference on Internet measurement, pages 328–341. ACM, 2010. 60. Bin Shao, Haixun Wang, and Yanghua Xiao. Managing and mining large graphs: sys- tems and implementations. In Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, pages 589–592. ACM, 2012. 61. E. Soroush, M. Balazinska, and D. Wang. ArrayStore: A storage manager for complex parallel array processing. In Proc. ACM SIGMOD Conference, pages 253–264, 2011. 62. M. Stonebraker, D. Abadi, D.J. DeWitt, S. Madden, E. Paulson, A. Pavlo, and A. Rasin. MapReduce and parallel DBMSs: friends or foes? Commun. ACM, 53(1):64–71, 2010. 63. M. Stonebraker, D.J. Abadi, A. Batkin, X. Chen, M. Cherniack, M. Ferreira, E. Lau, A. Lin, S. Madden, E.J. O’Neil, P.E. O’Neil, A. Rasin, N. Tran, and S.B. Zdonik. C- Store: A column-oriented DBMS. In Proc. VLDB Conference, pages 553–564, 2005. 64. M. Stonebraker, J. Becla, D.J. DeWitt, K.T. Lim, D. Maier, O. Ratzesberger, and S.B. Zdonik. Requirements for science data bases and SciDB. In Proc. CIDR Conference, 2009. 65. M. Stonebraker, S. Madden, D. J. Abadi, S. Harizopoulos, N. Hachem, and P. Helland. The end of an architectural era: (it’s time for a complete rewrite). In VLDB, pages 1150–1160, 2007. 66. H. Wang, C. Zaniolo, and C.R. Luo. ATLaS: A small but complete SQL extension for data mining and data streams. In Proc. VLDB Conference, pages 1113–1116, 2003. 67. K.Y. Whang, T.S. Yun, Y.M. Yeo, I.Y. Song, H.Y. Kwon, and I.J. Kim. ODYS: an approach to building a massively-parallel search engine using a DB-IR tightly-integrated parallel dbms for higher-level functionality. In Proc. ACM SIGMOD Conference, pages 313–324, 2013. 68. A. Witkowski, S. Bellamkonda, T. Bozkaya, G. Dorman, N. Folkert, A. Gupta, L. Sheng, and S. Subramanian. Spreadsheets in RDBMS for OLAP. In Proc. ACM SIGMOD Conference, pages 52–63, 2003. 69. G. Wu, E. Chang, Y.K. Chen, and C. Hughes. Incremental approximate matrix factor- ization for speeding up support vector machines. In ACM KDD, pages 760–766, New York, NY, USA, 2006. ACM Press. 70. R.S. Xin, J.R., M. Zaharia, M.J. Franklin, S.S., and I. Stoica. Shark: SQL and rich analytics at scale. In Proc. ACM SIGMOD Conference, pages 13–24, 2013. 71. Y. Zhang, W. Zhang, and J. Yang. I/O-efficient statistical computing with RIOT. In Proc. ICDE, 2010.