Multi-Layer Optimizations for End-To-End Data Analytics
Total Page:16
File Type:pdf, Size:1020Kb
Multi-layer Optimizations for End-to-End Data Analytics Amir Shaikhha Maximilian Schleich University of Oxford University of Oxford United Kingdom United Kingdom Alexandru Ghita Dan Olteanu University of Oxford University of Oxford United Kingdom United Kingdom Abstract are clear advantages to this approach: it is easy to use, al- We consider the problem of training machine learning mod- lows for quick prototyping, and does not require substantial els over multi-relational data. The mainstream approach is programming knowledge. Although it uses libraries that pro- to first construct the training dataset using a feature extrac- vide efficient implementations for their functions (usually by tion query over input database and then use a statistical binding to efficient low-level C code), they miss optimization software package of choice to train the model. In this pa- opportunities for the end-to-end relational learning pipeline. per we introduce Iterative Functional Aggregate Queries In particular, they fail to exploit the relational structure of the (IFAQ), a framework that realizes an alternative approach. underlying data, which was removed by the materialization IFAQ treats the feature extraction query and the learning of the training dataset. The effect is that the efficiency on task as one program given in the IFAQ’s domain-specific lan- one machine is often severely limited, with common deploy- guage, which captures a subset of Python commonly used in ments having to rely on expensive and energy-consuming Jupyter notebooks for rapid prototyping of machine learn- clusters of machines to perform the learning task. ing applications. The program is subject to several layers The thesis of this paper is that this performance limitation of IFAQ optimizations, such as algebraic transformations, can be overcome by systematically optimizing the end-to- loop transformations, schema specialization, data layout op- end relational learning pipeline as a whole. We introduce timizations, and finally compilation into efficient low-level Iterative Functional Aggregate Queries, or IFAQ for short, C++ code specialized for the given workload and data. a framework that is designed to uniformly process and au- We show that a Scala implementation of IFAQ can out- tomatically optimize the various phases of the relational perform mlpack, Scikit, and TensorFlow by several orders of learning pipeline. IFAQ takes as input a program that fully magnitude for linear regression and regression tree models specifies both the data matrix construction and the model over several relational datasets. training in a dynamically-typed language called D-IFAQ. This language captures a fragment of scripting languages CCS Concepts • Computing methodologies → Super- such as Python that is used for rapid prototyping of machine vised learning by regression; • Information systems → learning models. This is possible thanks to the iterative and Database management system engines; • Software and its collection-oriented constructs provided by D-IFAQ. engineering → Domain specific languages. An IFAQ program is optimized by multiple compilation layers, whose optimizations are inspired by techniques devel- Keywords In-Database Machine Learning, Multi-Query Op- oped by the data management, programming languages, and arXiv:2001.03541v1 [cs.PL] 10 Jan 2020 timization, Query Compilation. high-performance computing communities. IFAQ performs automatic memoization, which identifies code fragments 1 Introduction that can be expressed as batches of aggregate queries. This optimization further enables non-trivial loop-invariant code The mainstream approach in supervised machine learning motion opportunities. The program is then compiled down over relational data is to specify the construction of the data to a statically-typed language, called S-IFAQ, which benefits matrix followed by the learning task in scripting languages from further optimization stages, including loop transforma- such as Python, MATLAB, Julia, or R using software envi- tions such as loop fusion and code motion. IFAQ investigates ronments such as Jupyter notebook. These environments optimization opportunities at the interface between the data call libraries for query processing, e.g., Pandas [34] or Spark- matrix construction and learning steps and interleaves the SQL [4], or database systems, e.g., PostegreSQL [37]. The code for both steps by pushing the data-intensive compu- materialized training dataset then becomes the input to a tation from the second step past the joins of the first step, statistical package, e.g., mlpack [14], scikit-learn [41], Ten- inspired by recent query processing techniques [2, 38]. The sorFlow [1], and PyTorch [40]) that learns the model. There Relational Learning D-IFAQ S-IFAQ C/C++ p ::= e j x e while(e){ x e} x Task e ::= e + e j e ∗ e j −e j uop¹eº j e bop e j c −−−−! −! IFAQ j Σ e j e j {{e ! e}} j [[ e ]] j dom(e) j e¹eº Figure 1. A relational learning task is expressed in D-IFAQ, x 2e λx 2e −−−−! transformed into an optimized S-IFAQ expression, and com- j {x = e} j <x = e> j e:x j e[e] piled to efficient C++ code. j let x = e in e j x j if e then e else e outcome is a highly optimized code with no separation be- c ::= ‘id‘ j "id" j n j r j false j true tween query processing and machine learning. Finally, IFAQ −−−! −−−! T ::= S j {x : T } j <x : T > j Map[T , T ] j Set[T ] compiles the optimized program into low-level C++ code that further exploits data-layout optimizations. IFAQ can S ::= B j C outperform equivalent solutions by orders of magnitude. C ::= string j enum j bool j Field The contributions of this paper are as follows: n B ::= Z j R j RC • Section2 introduces the IFAQ framework, which comprises Figure 2. Grammar of IFAQ core language. languages and compiler optimizations that are designed th for efficient end-to-end relational learning pipelines. T1, only the i value is 1 and the rest are zero. However, a • As proof of concept, Section3 demonstrates IFAQ for two value of type denoting the one-hot encoding of Rn , can take T1 popular models: linear regression and regression tree. arbitrary real numbers at each position. • Section4 shows how to systematically optimize an IFAQ The second category consists of record types, which are program in several stages. similar to structs in C; the record values contain various • Section5 benchmarks IFAQ, mlpack, TensorFlow, and scikit- fields of possibly different data types. The partial records, learn. It shows that IFAQ can train linear regression and where some fields have no values are referred toas variants. regression tree models over two real datasets orders-of- The final category consists of collection data types such magnitude faster than its competitors. In particular, IFAQ as (ordered) sets and dictionaries. Database relations are learns the models faster than it takes the competitors to represented as dictionaries (mapping elements to their mul- materialize the training dataset. Section5 further shows tiplicities). In D-IFAQ, the elements of collections can have the performance impact of individual optimization layers. different types. However, in S-IFAQ the elements of collec- tions should have the same data type. In order to distinguish 2 Overview the variables with collection data type with other types of variables, we denote them as x and x, respectively. Figure1 depicts the IFAQ workflow. Data scientists use soft- The top-level program consists of several initialization ex- ware environments such as Jupyter Notebook to specify rela- pressions, followed by a loop. This enables IFAQ to express tional learning programs. In our setting, such programs are iterative algorithms such as gradient descent optimization written in a dynamically-typed language called D-IFAQ and algorithms. The rest of the constructs of this language are subject to high-level optimizations (Section 4.1). Given the as follows: (i) various binary and unary operations on both schema information of the dataset, the optimized program scalar and collection data structures,1 (ii) let binding for is translated into a statically-typed language called S-IFAQ assigning the value of an expression to a variable, (iii) con- (Section 4.2). If there are type errors, they are reported to ditional expressions (iv) Í e : the summation operator the user. IFAQ performs several optimizations inspired by x 2e1 2 in order to iterate over the elements of a collection (e ) and database query processing techniques (Section 4.3). Finally, 1 perform a stateful computation using a particular addition it synthesizes appropriate data layouts, resulting in efficient operator (e.g., numerical addition, set union, bag union),2 low-level C/C++ code (Section 4.4). (v) λx 2e1 e2: constructing a dictionary where the key domain 2.1 IFAQ Core Language is e1 and the value for each key x is constructed using e2, (vi) constructing sets ([[e]]) and dictionaries ({{e !e}}) The grammar of the IFAQ core language is given in Figure2. given a list of elements, (vii) dom(e): getting the domain This functional language support the following data types. of a dictionary, (viii) e0(e1): retrieving the associatiated value The first category consists of numeric types as well as cate- to the given key in a dictionary, (ix) constructing records −−−−! gorical types (e.g., boolean values, string values, and other ({x = e}) and variants (<x = e>), (x) statically (e.f) and dy- custom enum types). Furthermore, for categorical types the namically (e[f]) assessing the field of a record or a variant. other alternative is to one-hot encode them, the type of which n is represented as R . This type represents an array of n real 1 T1 More specifically ring-based operations. numbers each one corresponding to a particular value in 2This operator could be every monoid-based operator. Thus, even computing the domain of T1. In the one-hot encoding of a value of type the minimum of two numbers is an addition operator.