Horseir: Bringing Array Programming Languages Together with Database Query Processing

HorseIR: Bringing Array Programming Languages Together with Database Query Processing Hanfeng Chen Joseph Vinish D’silva Hongji Chen McGill University McGill University McGill University Canada Canada Canada [email protected] [email protected] [email protected] Bettina Kemme Laurie Hendren McGill University McGill University Canada Canada [email protected] [email protected] Abstract Keywords IR, Compiler optimizations, Array programming, Relational database management systems (RDBMS) are oper- SQL database queries ationally similar to a dynamic language processor. They take ACM Reference Format: SQL queries as input, dynamically generate an optimized ex- Hanfeng Chen, Joseph Vinish D’silva, Hongji Chen, Bettina Kemme, ecution plan, and then execute it. In recent decades, the emer- and Laurie Hendren. 2018. HorseIR: Bringing Array Programming gence of in-memory databases with columnar storage, which Languages Together with Database Query Processing. In Proceed- use array-like storage structures, has shifted the focus on ings of the 14th ACM SIGPLAN International Symposium on Dynamic optimizations from the traditional I/O bottleneck to CPU and Languages (DLS ’18), November 6, 2018, Boston, MA, USA. ACM, New memory. However, database research so far has primarily fo- York, NY, USA, 13 pages. https://doi.org/10.1145/3276945.3276951 cused on CPU cache optimizations. The similarity in the com- putational characteristics of such database workloads and 1 Introduction array programming language optimizations are largely unex- Relational Database Management Systems (RDBMS) have plored. We believe that these database implementations can been the primary data management software of choice for or- benefit from merging database optimizations with dynamic ganizations for decades with SQL being the de facto standard array-based programming language approaches. Therefore, query language [16]. Being a declarative language based on in this paper, we propose a novel approach to optimize data- relational algebra [11], SQL gives RDBMS implementers the base query execution using a new array-based intermedi- opportunity to optimize the execution plan for a query [7]. ate representation, HorseIR, that resides between database In this aspect, one can think of an RDBMS as a dynamic queries and compiled code. Furthermore, we provide a trans- language processor which receives SQL queries as input and lator to generate HorseIR from database execution plans and dynamically translates the SQL query to an optimized plan a compiler that optimizes HorseIR and generates efficient that minimizes the execution cost of the query and executes code. We compare HorseIR with the MonetDB RDBMS, by the plan. testing standard SQL queries, and show how our approach Traditionally, these optimizations were targeted at the pri- and compiler optimizations improve the runtime of complex mary resource bottleneck, I/O [17]. A major RDBMS storage queries. optimization, termed column-store architecture [1], where a column’s data structure in the disk is akin to that of a pro- CCS Concepts • Software and its engineering → Com- gramming language array, introduced significant reduction pilers; • Information systems → Database query pro- in I/O costs for large-scale data processing. At the same time, cessing; the increase in the size of the main memory had improved Permission to make digital or hard copies of all or part of this work for disk caching, making many database (DB) workloads CPU personal or classroom use is granted without fee provided that copies are not and memory bound [3,5]. made or distributed for profit or commercial advantage and that copies bear This had the database researchers looking for new op- this notice and the full citation on the first page. Copyrights for components timization strategies. The “array like” nature of a table’s of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to columns in column-store based RDBMSes, especially its 1 redistribute to lists, requires prior specific permission and/or a fee. Request working data sets , naturally led to a series of studies and op- permissions from [email protected]. timization strategies focused on the benefits of CPU caching DLS ’18, November 6, 2018, Boston, MA, USA [4, 19]. However, to the best of our knowledge, beyond these © 2018 Association for Computing Machinery. ACM ISBN 978-1-4503-6030-2/18/11...$15.00 1The term working dataset is used to denote the copy of data that has been https://doi.org/10.1145/3276945.3276951 brought from the disk to main memory for processing a request. 37 DLS ’18, November 6, 2018, Boston, MA, USA H. Chen, J. D’silva, H. Chen, B. Kemme, and L. Hendren RDBMS plans produced by the columnar RDBMS, HyPer [19], but SQL Front-End Execution Translate HorseIR the principles of translating execution plans of any RDBMS Queries Plans (raw) to HorseIR is largely the same. The HorseIR optimizer first Optimize performs standard compiler optimization techniques, such HorseIR Libraries as dead code elimination and loop fusion. Next, an intelli- (opt.) gent pattern-based code generation strategy is deployed to Link Compile recognize the patterns of HorseIR that have efficient C im- Query Binary C plementations and replace them with such libraries. Finally, Result Output Compile the generated C code is compiled into executable form. The idea of generating array-based code from SQL is both Figure 1. Overview of our approach. Shaded boxes corre- novel and challenging. To the best of our knowledge, we are spond to our contributions. the first to introduce an array-based IR that allows column- based in-memory database systems to benefit from a whole excursions into exploiting CPU cache-based optimizations, range of compiler optimizations. Prior researchers mainly the database community at large is yet to benefit from more focused on how to generate low-level code directly from SQL comprehensive optimization techniques that the compiler queries. Rather than implementing complex operations such community has amassed from its decades of research on as joins directly, HorseIR provides a repertoire of smaller array programming languages. We believe that working built-in primitives that can be combined to achieve the same datasets of column-stores are good candidates to apply array functionality. Introducing a high-level IR might open a new programming optimizations, as a column essentially contains research direction for database query optimizations. How- homogeneous data which maps nicely to array-based/vector- ever, having an additional IR for performing compiler opti- based primitives. mizations may result in some overhead and must be consid- In this paper, we follow a layered approach that facili- ered along with the performance benefits that they bring. tates a wide range of compiler optimizations in a systematic To demonstrate the importance of compiler optimizations way that exploits and further builds upon the many opti- and the feasibility of our approach, we performed exten- mizations developed by the database community in terms of sive experiments, using a subset of the TPC-H SQL bench- generating efficient execution trees for declarative queries. mark [32]. We compare the performance of HorseIR with Specifically, we propose an approach where SQL queries that of a popular open source column-store RDBMS, Mon- are first translated into execution plans using standard DB etDB [14]2. Execution times are overall relatively similar, optimization techniques that consider the operators in the showing that an intermediate representation and the use of query and the characteristics of the input dataset. These DB an array-based programming language is a promising ap- optimized plans are the basis on which we explore compiler proach that can compete with a highly optimized DB engine. optimizations. To do so in a systematic way, we translate In fact, the optimized HorseIR even outperforms MonetDB. the conventional database query execution plan into a new We also analyze in detail as to which compiler optimizations array-based intermediate representation (IR), called HorseIR, are particularly important for database queries. on which compiler optimization strategies are then applied. The main contributions of this paper are that we: An array-based IR is a natural choice, given the columnar • identified the need for a common IR for SQL queries; structure of the working data sets. The modular approach of • designed and implemented an array-based IR, called using an IR spares the RDBMS implementers from having to HorseIR, to represent SQL queries; tailor specific DB operators and algorithms to benefit from • delivered a translator which is able to generate HorseIR various hardware related optimization techniques. Instead, code from execution plans automatically; such optimizations can be performed on HorseIR. • identified and applied important compiler optimiza- The core data structure in HorseIR is a vector (correspond- tions for generating efficient code from HorseIR; ing to columns in the DB tables) and the implementation • demonstrated performance benefits of these compiler has a rich set of well-defined built-in functions with clearly optimizations, and compared the overall performance defined semantics. They are easy to optimize, and in thecase with MonetDB, a popular column-store based RDBMS. of elementwise built-in functions, are easy to vectorize and The rest of the paper is organized as follows. We first pro- parallelize. The type system of HorseIR facilitates declaring vide the relevant background about array programming and variables with explicit types, as well as a wild-card type and RDBMS in Sec.2. We then present the details of the design associated type inference rules. and implementation of HorseIR in Sec.3; the translation Fig.1 shows the overview of our approach. Queries are first translated to query execution plans. Then, a plan-to- 2While we would have liked to compare with HyPer, whose plan generator HorseIR translator translates the execution plans into raw we used, this was not possible as only the plan generator but not the HyPer HorseIR programs.

Load more