Micro Adaptivity in Vectorwise
Total Page:16
File Type:pdf, Size:1020Kb
Micro Adaptivity in Vectorwise Bogdan Raducanuˇ Peter Boncz Actian CWI [email protected] [email protected] ∗ Marcin Zukowski˙ Snowflake Computing marcin.zukowski@snowflakecomputing.com ABSTRACT re-arrange the shape of a query plan in reaction to its execu- Performance of query processing functions in a DBMS can tion so far, thereby enabling the system to correct mistakes be affected by many factors, including the hardware plat- in the cost model, or to e.g. adapt to changing selectivities, form, data distributions, predicate parameters, compilation join hit ratios or tuple arrival rates. Micro Adaptivity, intro- method, algorithmic variations and the interactions between duced here, is an orthogonal approach, taking adaptivity to these. Given that there are often different function imple- the micro level by making the low-level functions of a query mentations possible, there is a latent performance diversity executor more performance-robust. which represents both a threat to performance robustness Micro Adaptivity was conceived inside the Vectorwise sys- if ignored (as is usual now) and an opportunity to increase tem, a high performance analytical RDBMS developed by the performance if one would be able to use the best per- Actian Corp. [18]. The raw execution power of Vectorwise forming implementation in each situation. Micro Adaptivity, stems from its vectorized query executor, in which each op- proposed here, is a framework that keeps many alternative erator performs actions on vectors-at-a-time, rather than a function implementations (“flavors”) in a system. It uses a single tuple-at-a-time; where each vector is an array of (e.g. learning algorithm to choose the most promising flavor po- 1000) tuples. This approach is a form of block-oriented query tentially at each function call, guided by the actual costs execution [11] in some existing database systems, and was observed so far. We argue that Micro Adaptivity both in- shown to strongly improve performance [3]. creases performance robustness, and saves development time Primitive Functions. In vectorized execution the basic spent in finding and tuning heuristics and cost model thresh- computational actions for each relational algebra operator olds in query optimization. In this paper, we (i) character- are implemented in primitive functions. Each such “primi- ize a number of factors that cause performance diversity tive” works on input parameters that are each represented between primitive flavors, (ii) describe an -greedy learning as a vector, i.e. an array of values, and similarly produces algorithm that casts the flavor selection into a multi-armed an array of output values. All relational operators such as bandit problem, and (iii) describe the software framework Projection, Selection, Aggregation and Join, rely on such for Micro Adaptivity that we implemented in the Vector- primitives to process data. The (non-duplicate eliminat- wise system. We provide micro-benchmarks, and an overall ing) Projection operator is typically used to compute expres- evaluation on TPC-H, showing consistent improvements. sions as new columns, and each possible expression predicate combined with each possible input type typically maps to a Categories and Subject Descriptors: H.2.4 [Database separate primitive function (e.g. multiplication of doubles, Management]: Systems - Query processing addition of floats, subtraction of short integers, etc.). For Keywords: Adaptive; Self tuning; Query processing Selection, the system contains primitives that for a boolean predicate compute a sub-vector containing the positions of qualifying tuples from the input vectors. For Aggregation, 1. INTRODUCTION each aggregate function (sum, count, min, etc.) and possible Adapting to unpredictable, uncertain and changing envi- input type leads to a primitive function that performs the ronments is one of the major challenges in database research. work of updating an aggregate result. In addition, hash ag- Adaptive query execution and query re-optimization have gregation employs primitives for vectorized hash value com- been valuable additions to database technology that aim to putation, vectorized hash-table lookup and insertion – simi- ∗ lar holds for the Hash-Join operator. In total, the Vectorwise work performed while at Actian system contains some 5000 different primitive functions. stage: preprocess execute primitives postprocess Permission to make digital or hard copies of all or part of this work for cycles 1679694 4321561972 3983412990 807654 personal or classroom use is granted without fee provided that copies are % 0.03% 99.92% 92.17% 0.01% not made or distributed for profit or commercial advantage and that copies Table 1: Example of time spent in different execution stages. bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific Consider a simple query that involves a Selection: permission and/or a fee. SIGMOD’13, June 22–27, 2013, New York, New York, USA. SELECT l_orderkey FROM lineitem WHERE l_quantity < 40 Copyright 2013 ACM 978-1-4503-2037-5/13/06 ...$15.00. It calls a primitive that performs smaller than comparisons 1231 on integer values. Table 1 shows a sample trace of this query. erators and index arithmetic to completely remove any branch- Almost all time is spent in the execute stage (99.92%) and ing. The selection primitive in Listing 1 accepts as argu- within it, almost all time is spent in primitives (92.17%). ments a vector col of intsanditssizen, a constant val, Similar work division occurs in more complex query plans, and a vector res where to store the result. It produces a including joins, aggregations etc. Thus, with primitives selection vector with the indices of the elements in the in- dominating the execution time, their speed is critical to over- put vector which have a value strictly less than the constant all system performance. value. The selection vector is then passed to other prim- Given that the implementation of a vectorized primitive itives. The Branching implementation in Listing 1 uses a contains a loop over the input array(s), the advantage of the branch while the primitive shown in Listing 2 is branch- vectorized execution approach is that the overhead of call- free (No-Branching). These implementations are function- ing the function is amortized over many tuples, thus strongly ally equivalent: they always produce the same result. reduced compared with the tuple-at-a-time approach. Ad- Modern CPUs attempt to predict the outcome of branches, ditionally, specific compiler techniques that revolve around so each CPU cycle next instructions can be pushed into its the optimization of loop code are triggered, e.g. strength re- execution pipeline, even though the branch instruction has duction, loop unrolling, loop fission, loop fusion but also the not finished (left the pipeline) yet. Accurate branch pre- automatic generation of SIMD instructions. On the software diction is thus a necessary element to make pipelined CPU engineering level, the effect of vectorization is a separation designs efficient, because in case of a mispredicted branch, of control logic (the relational operators) and data process- the entire CPU pipeline must be flushed and delays occur. ing logic (the primitives). As such, vectorized processing Prediction will be accurate when the branch behavior can brings more than only performance advantages. One exam- be easily predicted from the previous few executions, i.e. if ple is in performance monitoring and query plan profiling: the branch condition is mostly true or mostly false. Conse- vectorized primitives can be easily profiled, given the fact quently, the performance of the Branching selection primi- that significant CPU time is spent inside them, providing tive depends on the input data distribution; see [13]. the ability to characterize all steps in a query plan to the Listing 1: Branching less-than Selection primitive level of CPU-cycles-per-tuple. This is less obviously done in size_t tuple-at-a-time engines, where it would be too expensive to select_less_than(size_t n,int* res,int* col,int* val){ trigger performance measurement code around each func- size_t k = 0; tional call performing e.g. a single integer multiplication, for(size_t i = 0; i < n; ++i) because such performance monitoring code would likely be if(col[i] < *val]) res[k++] = i; return k; } much more costly than the function being monitored. This ability to cheaply keep track of the performance of each call Listing 2: No-Branching less-than Selection primitive to a primitive function, enables our system to become adap- size_t tive to the observed cost of primitive calls. select_less_than(size_t n,int* res,int* col,int* val){ size_t k = 0; Primitive performance. Primitive efficiency depends on for(size_t i = 0; i < n; ++i) the algorithm chosen to implement it and the way the code res[k] = i, k += (col[i] < *val); return k; was compiled. But it is also influenced by the environ- } ment: hardware, data distributions, query parameters, con- current query workload, and the interactions between these. The No-Branching implementation always performs the The high complexity of computer systems, with their com- same number of operations, while with Branching, this de- plex cache hierarchies, out-of-order execution capabilities pends on the data. If the data is such that the branch is and constraints, SIMD instruction support etc. combined almost never taken, then the Branching implementation will with the dynamic aspects of the environments where the do less work, as it avoids executing the code that generates primitives are applied, make it impossible to correctly choose a result. What is the fastest implementation depends on the one optimal implementation even for a known workload. data, as shown in Figure 1, where selectivity is the probabil- ity that the branch condition is true. For very low and very Micro Adaptivity. We introduce the Micro Adaptivity high selectivities Branching is faster, while No-Branching is framework which addresses the above problem.