Exploring Query Execution Strategies for JIT, Vectorization and SIMD
Total Page:16
File Type:pdf, Size:1020Kb
Exploring Query Execution Strategies for JIT, Vectorization and SIMD Tim Gubner Peter Boncz CWI CWI [email protected] [email protected] ABSTRACT at-a-time execution of analytical queries is typically more This paper partially explores the design space for efficient than 90% of CPU work [4], hence block-at-a-time often de- query processors on future hardware that is rich in SIMD livers an order of magnitude better performance. capabilities. It departs from two well-known approaches: An alternative way to avoid interpretation overhead is (1) interpreted block-at-a-time execution (a.k.a. \vector- Just-In-Time (JIT) compilation of database query pipelines ization") and (2) \data-centric" JIT compilation, as in the into low-level instructions (e.g. LLVM intermediate code). HyPer system. We argue that in between these two de- This approach was proposed and demonstrated in HyPer [9], sign points in terms of granularity of execution and unit of and has also been adopted in Spark SQL for data science compilation, there is a whole design space to be explored, workloads. The data-centric compilation that HyPer pro- in particular when considering exploiting SIMD. We focus poses, however, leads to tight tuple-at-a-time loops, which on TPC-H Q1, providing implementation alternatives (“fla- means that so far these architectures do not exploit SIMD, vors") and benchmarking these on various architectures. In since for that typically multiple tuples need to be executed doing so, we explain in detail considerations regarding oper- on at the same time. JIT compilation is not only beneficial ating on SQL data in compact types, and the system features for analytical queries but also transactional queries, and thus that could help using as compact data as possible. We also better fits mixed workloads than vectorized execution. Also, discuss various implementations of aggregation, and propose with high-level query languages blending in with user code a new strategy called \in-register aggregation" that reduces in frameworks such as Spark, JIT-compilation provides a memory pressure but also allows to compute on more com- standing opportunity to compile and optimize together in a pact, SIMD-friendly data types. The latter is related to an single framework both user and database system operations. in-depth discussion of detecting numeric overflows, where A notable related project into this direction is Weld [11] we make a case for numeric overflow prevention, rather than which offers a low-level DSL for data science workloads that detection. Our evaluation shows positive results, confirming can be JIT-compiled onto multiple hardware platforms. that there is still a lot of design headroom. Overall, it seems that a mix of vectorization and JIT [14] is needed, specifically because certain tasks, where processing is data-dependent (e.g. fast data decompression) or where 1. INTRODUCTION adaptivity is important cannot be efficiently JIT-ted. HyPer In the past decades, query processing architectures have uses vectorized interpreted processing in its data scans, and evolved from using the tuple-at-a-time iterator model to spe- passes blocks of decompressed columns to its JIT-compiled cialized analytical systems that use block-at-a-time execu- query pipelines [8], so it is already using a mix. tion, either producing entire materialized intermediate table In the past two decades, SIMD instructions in common columns as in the case of MonetDB [3], or small columnar x86 have evolved from 64-bit width (MMX) to the current fragments (\vectorized execution") as in the case of Vec- 512-bit width (AVX-512). With MMX, the decision of com- torWise [4]. Two important advantages of columnar and mon database architectures to ignore SIMD was logical, but block-at-a-time processing are: (1) reduced interpretation the gap between the standard (\scalar") CPU instruction set overhead (interpretation decisions are made per block rather which operates at the 64-bit machine width and AVX-512 than per individual tuple) and (2) enabling a direct mapping is now almost an order of magnitude and arguably should 1 onto SIMD instructions. Interpretation overhead in tuple- not be ignored any longer. Admittedly, the database re- search community has studied the potential of SIMD in- 1Sec 4.1 of [1] lists more advantages of vectorized execution (e.g. adaptivity, algorithmic optimization, better profiling). structions in database operations (a non-exhaustive list is, [12, 17]), but these typically focus on SIMD-izing particular operations rather than looking at database architecture as a whole. Arguably, in this time and age, we may have to re-evaluate database architecture on a much broader level This work is licensed under the Creative Commons Attribution- in order to truly leverage modern processors with SIMD. NonCommercial-NoDerivatives 4.0 International License. To view a copy Many of the characteristics of SIMD are also found in of this license, visit http://creativecommons.org/licenses/by-nc-nd/4.0/. For GPU and APU processors, which also flourish (only) if the any use beyond those covered by this license, obtain permission by emailing same set of instructions is to be applied on many indepen- the authors. Proceedings of ADMS 2017 dent data items without divergence in control flow [5]. The . SELECT aversion for divergent control flow is also found in FPGA l_returnflag, l_linestatus, and ASIC implementations of query operators [16, 10]. count(*) AS count_order In this paper we do not aim yet to propose a new ar- sum(l_quantity) AS sum_qty, chitecture for data processing on modern, heterogeneous, avg(l_quantity) AS avg_qty, computing architectures, but rather pursue the more mod- avg(l_discount) AS avg_disc, est goal to explore parts of its design space in order to inform avg(l_extendedprice) AS avg_price, sum(l_extendedprice) AS sum_base_price, future choices. Specifically, we think that columnar execu- sum(l_extendedprice*(1-l_discount)) AS sum_disc_price, tion as in MonetDB and data-centric execution as in HyPer sum(l_extendedprice*(1-l_discount)*(1+l_tax)) can be seen as two designs at extreme points in terms of AS sum_charge, the granularity dimension (full-table versus tuple-at-a-time) FROM as well as compilation (fully interpreted versus fully com- lineitem piled). The question is what trade-offs lie in between. In WHERE l_shipdate <= date '1998-12-01' - interval '90' day order to shed light on this question, we decide here to focus GROUP BY on a restricted set of operations (we essentially exclude join) l_returnflag, l_linestatus and for this purpose focus on the well-known TPC-H query ORDER BY 1 (Q1, Figure 1). This query measures computational effi- l_returnflag, l_linestatus ciency of a database system and consists of an aggregation Figure 1: TPC-H Query 1 (Q1) with few groups and significant expression calculation. Join optimization is not important as it contains no joins, nor is 2. EXECUTING Q1, IN DETAIL indexing, since the selection predicate of Q1 selects > 95% The case of Q1 was initially used [4] to motivate the Vec- of the tuples. torWise system (then called MonetDB/X100 - we will use Thus, in this paper we explore the execution options that "X100" here to shortly denote VectorWise-style query ex- combining vectorization with JIT brings, specifically also ecution). It showed the large interpretation overhead that trying to use SIMD instructions, by generating many differ- the then-common tuple-at-a-time iterator model was pos- ent implementation variants and benchmarking these. ing for analytical queries. The vectorized approach iterates The main contributions of this paper are as follows: through the table at a vector-size of, typically, 1024 tuples. • we show that there is a large design space for combined For each operator in an expression (e.g. l tax+1), its execu- vectorized and JIT compilation by illustration, in the tion interpreter calls a primitive function that accepts inputs form of different implementation flavors [13] for Q1. either as constants or a small array of the vector-size. The primitive implementation typically is a simple loop without • we show that query executors on SIMD significantly dependencies that iterates through the vector (e.g. for(int benefit from using compact data types and discuss de- i=0; i<1024; i++) dst[i] = col[i]+1). sign options for database storage that maximize op- VectorWise also proposed \micro-adaptivity" [13], in which portunities for using thin data types. Specifically, this different equivalent execution strategies inside an operator involves both compact storage and associated statis- are tried using a one-armed bandid approach, where the dif- tics, but also what policy is used for numeric overflow. ferent strategies are sometimes explored and otherwise the • we contribute a new adaptive \in-register" aggregation best alternative is exploited. Depending on the operator, one strategy, that adaptively triggers in aggregations with can often find different ways of doing the same (e.g. eval- highly group locality (few neighbouring distinct groups uating a filter using if-then branches or with predication) in the tuple stream). and different compilation options (different compiler, dif- ferent optimization options, different target machines) that In our experiments (among others on a Knights Landing also deliver equivalent “flavors" of doing the same thing. A processor that support AVX512), we see that using compact more complex example of micro-adaptivity is changing the data types has become important for efficiency, as the fla- order of expensive conjunctive filter-predicates depending vors using these consistently improve performance. Our "in on the observed selectivities. Relevant for l tax+1 is \full- register" aggregation, is the generic execution strategy that computation" [13], in which computation operations that performs best, showing that on modern architectures there follow a filter operation can be executed without taking the is still considerable headroom beyond the fastest system up filter into account, i.e.