Accelerating Database Queries for Advanced Data Analytics Hanfeng Chen, Joseph V

Short Paper HorsePower: Accelerating Database Queries for Advanced Data Analytics Hanfeng Chen, Joseph V. D’silva, Laurie Hendren, Bettina Kemme McGill University [email protected],[email protected],[email protected],[email protected] ABSTRACT database columns, HorseIR follows conceptually the data The rising popularity of data science has resulted in a chal- model of column-based DBS, which has been proven to be lenging interplay between traditional declarative queries effective for data analytics tasks. and numerical computations on the data. In this paper, we HorsePower extends the idea to a full-fledged execution present and evaluate the advanced analytical system Horse- environment for data analytics. Additionally to support- Power that is able to combine and optimize both program- ing plain SQL queries, HorsePower also supports functions ming styles in a holistic manner. It can execute traditional written in MATLAB, a popular high-level array language SQL-based database queries, programs written in the statis- widely used in the field of statistics and engineering. Horse- tical language MATLAB, as well as a mix of both by support- Power can take stand-alone functions written in MATLAB ing user-defined functions within database queries. Horse- and translate them to HorseIR, of have these functions be Power exploits HorseIR, an array-based intermediate rep- embedded in SQL queries and then translate all into a sin- resentation (IR), to which source programs are translated, gle HorseIR program, before optimizing and compiling the allowing to combine query optimization and compiler opti- code in a holistic manner. mization techniques at an intermediate level of abstraction. As such HorsePower avoids the overhead of inter-system data movements as it has a single execution environment, and eliminates the barriers between SQL queries and an- 1 INTRODUCTION alytical functions allowing optimizations across both the Complex data analytics has become the cornerstone of our declarative and functional parts of the query. data-driven society. Although the amount of data stored The contributions of this paper are thus as follows: in traditional relational database systems (DBS) has been • We present HorsePower, an advanced analytical system, growing rapidly, the by far most common current approach that extends the approach proposed in [5] to not only offer is to take the data first out of the DBS and load it into a compiler-based execution environment for SQL queries, stand-alone analytical tools, which are based on languages but also for programs written in the array-based language such as Python, or the statistical languages MATLAB [1], MATLAB and for SQL queries with embedded UDFs. and R [3]. However, as the size of the data increases, the • HorsePower uses a holistic approach of exploiting array- expensive data movement between DBS and analytics tools based compiler optimization techniques for both SQL and can become a severe bottleneck. MATLAB taking advantage of the conceptual similarities Integrating analytical capabilities into the DBS avoids of columns and arrays. such expensive data exchange. A common approach is to • The performance of HorsePower is shown through an ex- use user-defined functions (UDFs) that are embedded in tensive set of experiments on programs written in MAT- SQL queries [13]. For example, MonetDB supports UDFs LAB, and SQL queries with embedded UDFs. written in Python, that are executed by a Python language interpreter that is embedded inside the DBS engine. While no data transfer is needed with this approach, there 2 BACKGROUND are still two separate execution environments, one being the 2.1 HorseIR: an Array-based IR for SQL SQL execution engine, the other the programming language Recent years have seen the development of modern query execution environment. This can lead to costly data format compilers that translate an SQL query into an intermediate conversion. Furthermore, the SQL and the UDF compo- representation (IR) before target code is generated from the nents of the query are each individually optimized by their IR, making it possible to leverage any existing code opti- respective execution environments, without the considera- mizations available within the IR platform. tion of any holistic optimization across the entire task. In this context, HorseIR [5] was developed as a high-level To address these issues, we propose HorsePower, an ad- IR specifically for database applications [7]. Being an array- vanced analytical SQL system, which provides a holistic based IR, it is relatively straightforward to generate basic solution to integrate UDFs in SQL queries. The system is HorseIR code following the execution plans developed by based on HorseIR [5], an array-based intermediate represen- column-based DBS, as the operators executing on entire tation (IR) language which was developed to explore the columns can be translated to functions executing on vectors usage of compiler optimizations for query execution. Chen in HorseIR. In fact, Chen et al. [5] took the execution plans et al. [5] translated the execution plans of standard SQL generated by the column-based database system HyPer [11], queries into HorseIR and compiled the generated HorseIR that incorporate a wide range of traditional DBS optimiza- code using various compiler optimization strategies devel- tions, as the input for generating HorseIR programs. oped for array-based languages. Using arrays to represent In this regard, HorseIR provides a rich set of array-based © 2021 Copyright held by the owner/author(s). Published in Proceed- built-in functions to which one can map the standard data- ings of the 24th International Conference on Extending Database base operations. Moreover, the HorseIR compiler provides Technology (EDBT), March 23-26, 2021, ISBN 978-3-89318-084-4 vital optimizations over these array-based operations. For on OpenProceedings.org. Distribution of this paper is permitted under the terms of the Cre- example, loop fusion merges multiple loops into one loop, ative Commons license CC-by-nc-nd 4.0. allowing for an intuitive merge of chained operations and Series ISSN: 2367-2005 361 10.5441/002/edbt.2021.35 1 SELECT SUM(l_price * l_discount) AS RevenueChange 1 FUNCTION RevChangeSclr(price,discount) 2 FROM lineitem WHERE l_discount >= 0.05; 2 RETURN price * discount; 3 END 1 module ExampleQuery{ 2 def main(): table{ 1 SELECT SUM(RevChangeSclr(l_price,l_discount)) AS RevChange 3 ... 2 FROM lineitem WHERE l_discount >= 0.05; 4 // assume t1 , t2 are references to l_price/l_discount columns 5 t3:bool = @geq(t2, 0.05); 6 t4:f64 = @compress(t3, t1); 7 t5:f64 = @compress(t3, t2); Figure 2: Rewriting the example query with a scalar UDF 8 t6:f64 = @mul(t4, t5); 9 t7:f64 = @sum(t6); 10 ...}} these are the most commonly employed types of UDFs and also the ones supported presently in HorsePower. Figure 1: Example query and its HorseIR program A scalar UDF returns a single value per row (which could be a vector) and can be therefore essentially used wherever a regular table column is used, such as the SELECT or the thus, avoiding intermediate results. Thus, optimizations de- WHERE clause of SQL queries. Figure 2 shows a scalar UDF veloped for array-based programming languages can be ex- which performs the multiplication that was originally part ploited to improve query performance. of the SELECT clause in Figure 1. In a column-based data- Example The top of Figure 1 shows a simplified version base system, the execution of such a query first evaluates of Query 6 of the TPC-H benchmark [16] computing the the WHERE clause on l_discount, returning a boolean vec- change in total revenue given prices and discounts from tor. Then, the database applies the corresponding boolean the table lineitem. A basic translation into a HorseIR pro- selection on columns l_discount and l_extendedprice, re- gram prior to performing any optimizations is shown at turning compressed vectors containing the rows where the the bottom of Figure 1, outlining only the part of the code boolean vector was true. These columns are then given to that performs the actual relational operators. We assume the UDF as arrays, and the UDF performs an element-wise that arrays t1 and t2 represent the price and discount multiplication on them and produces a result array. This is columns. The program computes the WHERE condition then the input to the SUM operator. Thus, the UDF is only (@geq), which returns a boolean vector of the same length called a single time and works on entire arrays. as t2 with true values in all rows that fulfill the condition. A table UDF returns a table-like data structure, and thus, The function @compress then extracts from both t1 and t2 is typically called within the FROM clause of an SQL state- the rows for which the boolean vector has a true value. The ment, similar to regular database tables. For an example of output are “compressed” vectors with relevant rows, over a table UDF, we refer to a technical report [6]. which then the aggregation is performed in two steps. Introducing UDFs into queries can bring performance is- HorseIR Optimizations As can be seen, such an approach sues. If the data types used by the two execution environments are different, this can introduce a conversion over- can generate a fair amount of intermediate results (arrays t3 head when exchanging data. Further, as UDF languages to t6 in the example). If lines 5 to 9 are translated to lower- are typically black-boxes to the database engine, cross op- level code independently, each of them generates its own for loop over the corresponding arrays. However, array-based timization attempts are minimal, resulting in sub-optimal optimization techniques, including loop fusion, and some execution plans. pattern-based optimizations developed specifically for the operator sequences found in SQL statements, allow the Hor- 3 HORSEPOWER seIR compiler to fuse these loops to just one loop to avoid In this section we present HorsePower, a system designed materializing these intermediate vectors.

Accelerating Database Queries for Advanced Data Analytics Hanfeng Chen, Joseph V

Monetdb/X100: Hyper-Pipelining Query Execution

Hannes Mühleisen

LDBC Introduction and Status Update

The Database Architectures Research Group at CWI

Micro Adaptivity in Vectorwise

Hmi + Clds = New Sdsc Center

Columnar Storage in SQL Server 2012

User's Guide for Oracle Data Visualization

(Spring 2019) :: Query Execution & Processing

Prototype and Deployment of a Big Data Analytics Platform

DBMS Data Loading: an Analysis on Modern Hardware

Monetdb: Two Decades of Research in Column-Oriented Database Architectures