Towards an Analytics Query Engine *

Towards an Analytics Query Engine ∗ Nantia Makrynioti Vasilis Vassalos Athens University of Economics and Business Athens University of Economics and Business Athens, Greece Athens, Greece [email protected] [email protected] ABSTRACT with black box libraries, evaluating various algorithms for a This vision paper presents new challenges and opportuni- task and tuning their parameters, in order to produce an ef- ties in the area of distributed data analytics, at the core of fective model, is a time-consuming process. Things become which are data mining and machine learning. At first, we even more complicated when we want to leverage paralleliza- provide an overview of the current state of the art in the area tion on clusters of independent computers for processing big and then analyse two aspects of data analytics systems, se- data. Details concerning load balancing, scheduling or fault mantics and optimization. We argue that these aspects will tolerance can be quite overwhelming even for an experienced emerge as important issues for the data management com- software engineer. munity in the next years and propose promising research Research in the data management domain recently started directions for solving them. tackling the above issues by developing systems for large- scale analytics that aim at providing higher-level primitives for building data mining and machine learning algorithms, Keywords as well as hiding low-level details of distributed execution. Data analytics, Declarative machine learning, Distributed MapReduce [12] and Dryad [16] were the first frameworks processing that paved the way. However, these initial efforts suffered from low usability, as they offered expressive but at the same 1. INTRODUCTION time low-level languages to program data analysis tasks. Soon the need for higher-level programming languages on With the rapid growth of world wide web (WWW) and the top of these frameworks became apparent. Systems, such development of social networks, the available amount of data as Hive [23], Pig Latin [20], DryadLINQ [25] and Scope [9], has exploded. This availability has encouraged many com- offer higher-level languages that enable developers to write panies and organizations in recent years to collect and anal- part of their programs in declarative style. Then, these pro- yse data, in order to extract information and gain valuable grams are automatically translated into MapReduce jobs or knowledge. At the same time hardware cost has decreased, Dryad vertices that form a directed acyclic graph (DAG), so storage and processing of big data is not prohibitive even which is optimized for efficient distributed execution. This for smaller companies. Topic classification, sentiment anal- paradigm is adopted by other systems, too. Stratosphere ysis, spam filtering, fraud and anomaly detection are only [5], Tupleware [11] and MLbase [17] also aim at hiding de- a few analytics tasks that gained considerable popularity tails of distributed execution, such as load balancing and over the past few years, along with more traditional ware- scheduling, in order to give the impression to the user that house queries that gather statistics from data. Hence, mak- she develops code for a single machine. ing the deployment of solutions for such tasks less tedious, Apart from the programming model, optimization tech- adds value to the services provided by these companies and niques is another important issue that these systems address. organizations, and encourages more people to enhance their Big data need fast and efficient solutions and as a conse- work using information from data. quence frameworks should leverage any opportunity for code It is clear that the areas of data mining and machine learn- optimization. Query rewriting, exploitation of data locality ing are at the core of data analysis tasks. However, devel- and vectorization are concepts already known and widely oping such algorithms needs not only expertise in software explored in databases and compilers. These techniques are engineering, but also a solid mathematical background in also applied in the aforementioned analytics systems along order to interpret correctly and efficiently the mathematical with more recent ideas. computations into a program. Even when experimenting In this paper, we study systems for large-scale data anal- ∗This work was co-funded by the European Union and the Gen- ysis in the context of these two directions: semantics and eral Secretariat of Research and Technology, Ministry Of Educa- optimization techniques. We present an overview of the im- tion, Research Religious Affairs under the project Asset Manage- mense development of systems for large scale data analysis ment Optimization & Risk Management Software (AMOR) of the that made their appearance over the past few years and pro- Bilateral R&T Cooperation Program Greece - Israel 2013-2015. pose research topics that we argue will attract the interest c 2016, Copyright is with the authors. Published in the Workshop Pro- of the data management community in the near future. The ceedings of the EDBT/ICDT 2016 Joint Conference (March 15, 2016, Bor- rest of the paper is organized as follows. Section 2 describes deaux, France) on CEUR-WS.org (ISSN 1613-0073). Distribution of this paper is permitted under the terms of the Creative Commons license CC- classes of systems for distributed data analytics and section by-nc-nd 4.0 3 analyses open challenges concerning the semantics and the ilar to Stratosphere's Sopremo layer, Jaql and Pig Latin de- optimization methods used in such systems. Finally, section fine each workflow as a sequence of steps, but with each step 4 concludes the paper. performing a high-level transformation, such as SQL and ETL operations. Users are able to integrate custom code in 2. CURRENT STATE OF THE ART their data analysis tasks by writing their own UDFs either in external languages, such as Java and Python, or by us- Current work in the area of big data analytics focuses ing operators of the system. Eventually, Jaql and Pig Latin on proposing programming models for data mining and ma- scripts are automatically compiled to MapReduce jobs. A chine learning tasks, and developing optimizations that re- variant of Pig Latin, called MyriaL, extends its program- sult to efficient execution of users' programs on a distributed ming model with looping constructs and is used in Myria execution engine. So, large-scale analytics frameworks can [15], a big data management service. Iterative processes are be divided into two main categories: libraries of data min- also a limitation for Jaql and Sopremo, as the developer is ing/machine learning algorithms or sets of primitives to de- not able to define in these programming models that a given velop such algorithms. workflow will be repeated for a number of iterations or as Libraries provide implementations of algorithms commonly long a condition holds. used in data analysis tasks, such as SVMs and K-means, tar- Moving on to the incorporation of relational operators geted to a specific distributed execution platform. The user to a high-level language, we find DryadLINQ that trans- is able to use these algorithms in her code by calling them forms LINQ programs into distributed processes running on as functions, in order to load data from files or databases, an execution engine called Dryad. The LINQ model incor- transform data or use machine learning algorithms to anal- porates constructs for manipulating data items into a host yse them. The programming paradigm is the same as that language, such as C# and other .NET languages. Itera- of a developer writing code for a single machine and call- tion is expressed using loop constructs of the host language. ing functions from a third-party library. The difference is Spark exposes a similar functional programming interface that this code is then automatically compiled by the sys- implemented in Scala, but provides greater support for it- tem in order to be executed on a distributed platform. Such erative processes and expression of shared state among it- libraries are available for all popular distributed execution erations. Finally, Tupleware follows the same approach as engines. Apache Mahout [2] was initially implemented on Stratosphere's PACT model by proposing operators that are Hadoop [1] and is gradually extended to Spark [26]. A simi- second-order functions and take as argument a user defined lar example is MLlib [19], a scalable machine learning library first-order function. However, Tupleware's set of operators on top of Spark, whereas MADlib [10] is a library of SQL- is more extended than the one provided by PACT. These op- based machine learning and data analysis algorithms, which erators are used to define workflows inside a host language. run on database engines. They include algorithms for clas- Then, the system transforms user's code into a distributed sification, clustering, collaborative filtering, dimensionality program, which is deployed and executed on a cluster of reduction and other useful preprocessing tasks. machines. Systems that belong to the second category provide a On the other hand, MLI API [22] and SystemML [13] that set of primitives to the users, in order to simplify the de- run on Spark and MapReduce respectively, aim to imitate velopment of distributed data analysis algorithms. These the style of statistical computing languages, such as R and primitives can be also combined with UDFs (User Defined MATLAB, which are very popular among machine learning Functions) in many cases to allow for custom code in data researchers and help them build software prototypes quickly. analysis tasks. In this class of systems, algorithms are not As a result, MLI includes interfaces for three main concepts, ready to call, but the user can use these primitives to develop Optimizers, Algorithms and Models, as well as APIs for two machine learning or data mining algorithms that run on a data structures MLTable and LocalMatrix, which supports specific distributed execution engine.

Towards an Analytics Query Engine *

Evaluation of Xpath Queries on XML Streams with Networks of Early Nested Word Automata Tom Sebastian

QUERYING JSON and XML Performance Evaluation of Querying Tools for Offline-Enabled Web Applications

Programming a Parallel Future

Declarative Data Analytics: a Survey

Big Data Analytic Approaches Classification

A Dataflow Language for Large Scale Processing of RDF Data

Extracting Data from Nosql Databases a Step Towards Interactive Visual Analysis of Nosql Data Master of Science Thesis

IBM Infosphere Biginsights Version 2.1: Installation Guide Chapter 1

HIL: a High-Level Scripting Language for Entity Integration

Schemas and Types for JSON Data

Technology Overview

Data Analytics Using Mapreduce Framework for DB2's Large Scale XML Data Processing