AIDA - Abstraction for Advanced In-Database Analytics

AIDA - Abstraction for Advanced In-Database Analytics Joseph Vinish D’silva Florestan De Moor Bettina Kemme [email protected] fl[email protected] [email protected] School of Computer Science, McGill University Montreal,´ Canada ABSTRACT pandas [31], NumPy [55], theano [52], etc., meant to aug- With the tremendous growth in data science and machine ment a general purpose language like Python with linear learning, it has become increasingly clear that traditional re- algebra support. Should the data to be used reside in an lational database management systems (RDBMS) are lack- RDBMS, the first step in these programs is to retrieve the ing appropriate support for the programming paradigms re- data from the RDBMS and store them in user space. From quired by such applications, whose developers prefer tools there, all computation is done at the user end. Needless that perform the computation outside the database system. to say, user systems do not possess huge memory capacity While the database community has attempted to integrate nor processing power unlike servers running an RDBMS po- some of these tools in the RDBMS, this has not swayed the tentially forcing them to use smaller data sets. Thus, users trend as existing solutions are often not convenient for the sometimes resort to big data frameworks such as Spark [58] incremental, iterative development approach used in these that load the data into a compute cluster and support dis- fields. In this paper, we propose AIDA - an abstraction for tributed computation. But even when enough resources are advanced in-database analytics. AIDA emulates the syntax available, users might choose smaller data sets, in particular and semantics of popular data science packages but trans- during the exploration phase, as transfer costs and latencies parently executes the required transformations and compu- to retrieve the data from the database system can be huge. tations inside the RDBMS. In particular, AIDA works with This data sub-setting can be counterproductive, as having a regular Python interpreter as a client to connect to the a larger data set can reduce algorithm complexity and in- database. Furthermore, it supports the seamless use of both crease accuracy [12]. Additionally, once the data is taken relational and linear algebra operations using a unified about of the RDBMS, all further data selection and filtering, straction. AIDA relies on the RDBMS engine to efficiently which is often crucial in the feature engineering phase of a execute relational operations and on an embedded Python learning problem, needs to be performed within the statisti- interpreter and NumPy to perform linear algebra operations. cal package [18]. Therefore, several statistical systems have Data reformatting is done transparently and avoids data been enhanced with some relational functionality, such as copy whenever possible. AIDA does not require changes to the DataFrame concepts in pandas [31], Spark [3], and R. statistical packages or the RDBMS facilitating portability. This has the database community pondering about the opportunities and the role that it needs to play in the grand PVLDB Reference Format: scheme of things [47, 56]. In fact, many approaches have Joseph Vinish D'silva, Florestan De Moor, Bettina Kemme. AIDA been proposed hoping to encourage data science users to - Abstraction for Advanced In-Database Analytics. PVLDB, perform their computations in the database. One such ap- 11(11): 1400-1413, 2018. DOI: https://doi.org/10.14778/3236187.3236194 proach is to integrate linear algebra operators into a conven- tional RDBMS by extending the SQL syntax to allow linear algebra operations such as matrix multiplication [60, 27, 2]. 1. INTRODUCTION However, the syntactical extension to SQL that is required, The tremendous growth in advanced analytical fields such is quite cumbersome compared to what is provided by sta- as data science and machine learning has shaken up the tistical packages, and thus, has not yet been well adopted. data processing landscape. The most common current ap- Both the mainstream commercial [37, 57] and open source proach to develop machine learning and data science appli- [38, 44, 51] RDBMS have been cautious of making any hasty cations is to use one of the many statistical languages such changes in their internal implementations { tuned for rela- as R [21], MATLAB, Octave [13], etc., or packages such as tional workloads { to include native support for linear algebra. Instead, most have been content with just embedding Permission to make digital or hard copies of all or part of this work for host programming language interpreters in the RDBMS en- personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies gine and providing paradigms such as user defined functions bear this notice and the full citation on the first page. To copy otherwise, to (UDFs) to interact with them. However, the use of UDFs to republish, to post on servers or to redistribute to lists, requires prior specific perform linear algebraic computation has been aptly noted permission and/or a fee. Articles from this volume were invited to present as not a convenient abstraction for users to work with [44]. their results at The 44th International Conference on Very Large Data Bases, In fact, a recent discussion on the topic of the intersection August 2018, Rio de Janeiro, Brazil. of data management and learning systems [24], points out Proceedings of the VLDB Endowment, Vol. 11, No. 11 Copyright 2018 VLDB Endowment 2150-8097/18/07. that even among the most sophisticated implementations DOI: https://doi.org/10.14778/3236187.3236194 1400 which try to tightly couple RDBMS and machine learning 2. BACKGROUND paradigms, the human-in-the-loop is not well factored in. In this section, we briefly cover the attempts made by Among the machine learning community, there is already a statistical packages to offer declarative query support and concern that research papers are focusing solely on accuracy by RDBMS to facilitate statistical computations. We do and performance, ignoring the human effort required [12]. not cover specialized DBMS implementations intended for Further, [12] also points out that human cycles required is scientific computing such as [5, 50, 28, 7] as our starting the biggest bottleneck in a machine learning project and point is that data resides in a traditional RDBMS. while difficult to measure, easiness and efficiency in experi- menting with solutions is a critical aspect. 2.1 Relational Influence in Statistical Systems Therefore, unsurprisingly, despite efforts from the database In their most basic form, statistical systems load all data community, studies have consistently shown Python and they need into their memory space. Assuming that the data R being the top favorites among the data science commu- resides in an RDBMS, this results in an initial data transfer nity [40, 10]. In light of these observations, in this paper, from the RDBMS to the client space. we propose an abstraction akin to how data scientists use However, incremental data exploration is often the norm Python and R statistical packages today, but one that can in machine learning projects. For instance, determining perform computations inside the database. By providing what attributes will contribute to useful features can be a an intuitive interface that facilitates incremental, iterative trial and error approach [12]. Therefore, while statistical development of solutions where the user is agnostic of a systems are primarily optimized for linear algebra, many of database back-end, we argue that we can woo data science them also provide basic relational primitives including se- users to bring their computation into the database. lection and join. For example, the DataFrame objects in We propose AIDA - abstraction for Advanced in-database R [42], pandas [31] package for Python, and Spark support analytics, a framework that emulates the syntax and seman- some form of relational join. Although not as optimized tics of popular Python statistical packages, but transpar- as an RDBMS or as expressive as SQL, the fact that lin- ently shifts the computation to the RDBMS. AIDA can work ear algebraic operations usually occupy a lion's share of the with both linear algebra and relational operations simul- computation-time makes this an acceptable compromise. taneously, moving computation between the RDBMS and There has been some work done in pushing relational op- the embedded statistical system transparently. Most impor- erations into the database, such as in Spark [9]. However, tantly, AIDA maintains and manages intermediate results while this is useful at the beginning, when the first data is from the computation steps as elegantly as current procedu- retrieved from the database, it is less powerful later in the ral language based packages, without the hassles and restric- exploration process where users might have already loaded tions that UDFs and custom SQL-extension based solutions significant amount of data into the statistical framework. In place. Further, we demonstrate how AIDA is implemented such situation, it might be more beneficial to just load the without making modifications to either the statistical sys- additional data into the framework and perform, say a join tem or the RDBMS engine. We implement AIDA with Mon- inside the framework rather than performing a join in the etDB as the RDBMS back-end. We show that AIDA has a database and reloading the entire data result, which can be significant performance advantage compared to approaches even more time consuming should the result set be large [45]. that transfer data out of the RDBMS. It also avoids the As a summary, users typically either fetch more attributes hassles of refreshing data sets in the presence of updates. than would be necessary and then do relational computa- In short, we believe that AIDA is a step in the right direc- tions locally, or they have to perform multiple data trans- tion towards democratizing advanced in-database analytics.

Load more