Efficient Many-Core Query Execution
in Main Memory Column-Stores ¦ Jonathan Dees #¦1 Peter Sanders 2
#SAP AG ¦ Karlsruhe Institute of Technology 69190 Walldorf, Germany 76128 Karlsruhe, Germany [email protected] [email protected]
Abstract—We use the full query set of the TPC-H Benchmark interpretation overhead for each operation actually performed. as a case study for the efficient implementation of decision sup- For example, to access the next tuple, a function nextÔÕ is port queries on main memory column-store databases. Instead called hiding the underlying logic to receive the tuple (e.g., of splitting a query into separate independent operators, we consider the query as a whole and translate the execution plan decompression). This results in a simple and easy to use into a single function performing the query. This allows highly interface but has several drawbacks: efficient CPU utilization, minimal materialization and execution First, we need to do a real function call which cannot be in a single pass over the data for most queries. The single pass is inlined as we use a virtual function or a function pointer, also performed in parallel and scales near-linearly with the number hindering branch prediction. Second, we need to remember of cores. The resulting query plans for most of the 22 queries are remarkably simple and are suited for automatic generation and query the iterator state to provide the correct next tu-
and fast compilation. Using a data-parallel, NUMA-aware many- ple. Calling many different nextÔÕ functions in one iteration core implementation with block summaries, inverted index data increases the danger of running out of (L1)-cache for the structures, and efficient aggregation algorithms, we achieve one execution code. to two orders of magnitude better performance than the current One solution to this problem is to process several tuples at
record holders of the TPC-H Benchmark. ¡ once per step, i.e., one nextÔÕ call returns n 1 tuples [2], I.INTRODUCTION [3] or even all [4] instead of one. This usually requires to materialize the resulting tuples. Other operators are extended The starting point for the research presented here are as well such that they process n tuples at a time. Note that this three recent developments that enable significantly improved is not fully possible for all operators (e.g. pipeline-breakers performance of relational databases: like sort or join). When n is large enough, the overhead for
Storing each column separately reduces the amount of the call is negligible. For n not too large, tuples still reside data that has to be accessed for typical queries that use in cache when the next operator processes them. However, only a small fraction of the columns present in industrial we lost the full pipelining power as we need to materialize databases. n tuples in one step and read it again for the next operator
Decreasing RAM prices make it possible to store fairly from cache (potentially second or third level cache). This large databases entirely in main memory. This greatly forces additional CPU stalls when waiting for memory and improves the available bandwidth and reduces latencies may consume memory bandwidth when data is not in cache by several orders of magnitude. In particular, this nullifies any more. possible disadvantages of column-store databases which The best of both worlds would therefore be: full pipelining have a more fine-grained memory access pattern than and no overhead for processing single tuples. We achieve this row-store representations. This effect is amplified by the goal by a simple but radical approach: Writing the complete fact that columns can often be compressed – further query plan directly in C++ and compiling it to native machine
decreasing memory cost and saving memory bandwidth. code for execution. This gives us following advantages:
Many-core processors have high processing power at tuples can be passed in registers
relatively low cost. In contrast to cluster based systems, minimal materialization due to full pipelining
the shared memory of these systems gives us more no overhead from interpretation or function calls
flexible access to the data. full optimization knowledge from the compiler These developments make it possible to access large This results in high CPU utilization and memory bandwidth amounts of data very fast. To take full advantage, the CPU consumption. We wanted to know how big the benefits can needs to keep up with processing the data. be when combined with algorithmic techniques using the However, traditional database systems are usually based full capacity of modern parallel hardware. In this paper we on the assumption that the access to the data is the main use hand-written code optimized not only for scalability and bottleneck. A standard processing technique is the Volcano efficiency but also for simplicity and orthogonality between iterator model [1]: Each tuple is processed one by one (the query plan issues and the details of parallelization and data tuple can be just a single value). This incurs a significant structures. In ongoing work, we build on this to address automatic compilation. First results indicate that using new for each operator step, consuming a lot of memory bandwidth compiler technologies like LLVM and Clang C++ [5], au- limiting the execution speed. MonetDB/X100 reduces this tomatic translation yields similar performance at translation problem by processing only several tuples in an operator step latencies around 10ms. and introducing pipelining again. If the number of tuples is We use the data and queries of the TPC-H benchmark [6] carefully chosen, the next operator in line finds the input tuples as a case study. We view TPC-H as a good starting point still in cache (L2 or even L1 cache), significantly increasing since it defines a diverse set of nontrivial queries. This means access speed and reducing memory consumption. Note that we that efficient algorithms, data structures and parallelization still need to materialize the data and pass it via caches instead strategies can have a significant performance impact. It also of registers, losing the full potential of pipelining. allows direct comparison with runtime results from other An interesting fact is that one starting point for MonetD- systems. B/X100 was the good performance of a query written directly Looking at database applications in particular, better per- in C++ [3]. Even for the most simple TPC-H query (#1) formance has several practical implications: On the one hand which uses only a single table, the C++ implementation clearly we need it in order to cope with rapidly growing data sets. dominates other systems. The comparison included only one Growing efficiency also allows us to save hardware and energy query with 1 GB data and was single-threaded. costs on the one hand and to open up new applications on There are some recent approaches for generating code for a the other hand. For example, in the past classical decision given SQL query. HIQUE [12] is a prototype system generat- support queries were used by a small number of people to ing C code for each incoming SQL query. Compilation times make strategic decisions tolerating long latencies. With two of the system are clearly visible and the reported execution orders of magnitude lower query latencies, we can afford times for three tested TPC-H queries are close to MonetDBs: equally expensive queries on a regular basis used by a large query 1 and 3 are faster, query 10 is slower. The translated number of people expecting immediate answers as in web operators are still separately visible and not fully pipelined. search engines. One goal of our research is to explore new Also note that the data of a table is stored in classical tuple techniques to further improve the performance of the parallel format on disk. main memory based SAP HANA Database. In HyPer [13], SQL queries are compiled using LLVM After introducing basic concepts in Section II, we present [5] and C++. Here, code compilation times are lower and our algorithmic techniques in Section III. Section IV describes operators are more interlaced. They used a modified TPC- implementation aspects and Section V reports on experiments. H benchmark version (TPC-CH [14]) which makes direct Section VI summarizes the results and discusses how they comparison more difficult. Running times for TPC-H like can be transferred to a more general setting with updates and OLAP queries are below MonetDB and VectorWise by a factor an SQL query compiler. The appendix includes pseudocode of 1–2 for 5 tested queries. describing the implementation of all 22 queries. Both [13] and [12] do not use parallel execution or data compression, they use only scale factor one in the experiments Related Work (1 GB data). Using a column-oriented instead of a row-oriented data DBToaster [15], [16] is another database system which representation has proven to result in a significant speedup uses C++ code generation and compilation to speed up their for complex analytical queries both for main memory and execution. However, it does not target the execution of general disk based systems [7]. Today, there exist several academic SQL Queries but produces specific C++ code for efficiently column-oriented database systems, e.g., C-Store [8], MonetDB updating materialized database views on insertion or deletion [4] or MonetDB/X100 [3] as well as commercial products like of data. Vertica [9] (based on C-Store), VectorWise [10] (based on There has been a lot of research on parallel databases in the MonetDB/X100) or Sybase IQ [11]. All these systems load late 1980s and early 1990s [17], [18]. Our approach to address- the accessed data into main memory to process it. Still they ing locality owes a lot to the shared nothing concept developed are partly designed to handle the data efficiently on disk. during that time. B¨ohm et al. [19] achieve good speedups for For query execution, C-Store processes tuple by tuple and TPC-R – the predecessor or TPC-H – on a distributed memory hands one tuple from one operator to the next (Volcano iterator machine. However, this approach is expensive since it is based model). The processing of one tuple in an operator induces on replicating all the data. a significant performance overhead due to function calls and plan interpretation. MonetDB faces this problem by process- II. PRELIMINARIES ing all tuples in one operator step making the performance Figure 1 summarizes the database schema for the TPC- overhead for the call and plan interpretation negligible. Each H Benchmark which simulates a large database of orders operator is available as a precompiled function for each input consisting of 1–7 line items each. Except for two constant type combination and can be fine-tuned for efficient CPU size tables NATION and REGION, the other relations are scale usage and high IPC (instructions per cycle). Although these factor SF times a (large) constant such that the total size of 9 functions are very efficient, we lose the ability to pipeline data, an uncompressed row representation is about SF ¤ 10 bytes. i.e. we need to materialize all resulting tuples in main memory Attribute values are populated using rules that are realistic 6M*F LINEITEM good basis for experiments on data compression since the randomly generated data is unlikely to yield good prediction 800K*F PARTSUPP ORDERS 1.5M*F for real world scenarios. Strings stemming from a small 10K*F SUPPLIER CUSTOMER 150K*F set of possible values are compressed into an enumeration referencing string values in a dictionary. For example, there are 200K*F PART NATION 25 only 150 strings denoting part types. More complex strings are currently not compressed. Note that we can often circumvent F = scale factor REGION 5 actually processing these strings using the inverted text indices =attribute introduced in Section III-C. Fig. 1. TPC-H database schema. Vertical edges go downward and indicate Block-cyclic allocation: Each column of the database attributes storing primary keys of the relation below. is split into blocks which are assigned to the local memo- ries of each socket of a NUMA machine in a round robin fashion.Blocking ensures cache efficiency while the cyclic to a certain extent but ultimately contain uniformly random distribution ensures that the memory banks are about equally decisions. There are 22 queries each of which also involves a loaded. Note that simply splitting a column into one segment small number of random parameters. In a power test, queries for each local memory would yield very bad performance since 1–22 are executed one after the other. In a throughput test, many queries do most of their work on only a subrange of a prescribed minimum number of query streams is processed a column. For example in TPC-H Query 4, ORDERS and where the order or queries in each stream is prescribed but LINEITEMs are selected from a small time interval. We han- scheduling between streams is flexible. dle the assignment of memory blocks to sockets transparently: We are targeting shared memory machines with multiple The application still uses a contiguous memory range as this sockets each containing a multi-core processor with a signifi- range is virtual and further translated to physical memory by cant number of cores and simultaneously executing threads. the operation system. The assignment of virtual to physical Each socket has its own locally attached main memory. addresses can be modified by a system call on linux systems Accessing memory of other sockets comes with a performance such that the physical address points to the correct socket [24]. penalty. There is a complex hierarchy of caches. With increas- For the size of a memory block we use a multiple of the ing cache size, latency grows and more threads share the same system page size. We keep track of the assignment of blocks to instantiation of this cache. NUMA nodes as we use this information for optimal memory access when iterating over the data (see Section III-E). III. TECHNIQUES A. Database Schema C. Index Data Structures We consider database schemas that can be viewed as a Index data structures allow us to efficiently navigate the generalization of the snowflake schema [20] which is in turn a schema graph. We have implemented several types of index generalization of a star schema. The relations form a directed data structures. Note that these data structures are very dif-
ferent from the B-tree based indices used in traditional disk- Õ acyclic schema graph G where an edge ÔR,U indicates that a tuple of relation R references a tuple of relation U. Already based column-store databases and rather resemble relations of the TPC-H benchmark (Figure 1) shows that the graph is the network data model [21] in which each tuple of a table not necessarily a tree. Paths in G specify a way to access references one, or an arbitrary set of tuples from another table. data via a sequence of key dereferencing operation. From Forward Join Indices: Recall that an edge R Ñ U in the
the point of view of a node R, DAG G can be viewed as a schema DAG means that each tuple r È R contains a primary ¦ compact representation of a closure relation R that contains key logically referencing a unique element of a tuple u È U.A an attribute for every path starting at R. This view resembles forward index makes this referencing explicit by also storing the network model [21]. the actual position of u in r. Indexing Sorted Relations: When a relation R is sorted B. Data Representation by the primary key of another relation U reachable from R
Basic Data Types and Compression: Most attributes can in the schema graph, a tuple u È U may simply specify the be mapped to integers, i.e., numbers, enumeration types, and first tuple in R that references u. The desired set can then be dates. The latter are stored as number of days since Jan 1st found by simply scanning R starting at u. We use this type 1970 in TPC-H. For reasons of efficiency and simplicity, we of index to reference LINEITEMs from ORDERs. currently store all scalar values using 1, 2, 4, or 8 bytes. Inverted Join Indices: These efficiently go backwards For example dates take 2 bytes. We have also implemented in the schema graph G described in Section III-A. Suppose more general bit compression, i.e, only storing exactly the there is a path P from relation R to relation U in G. Then, number of bits needed but we are currently not using this given a tuple u of U we want to lookup all tuples of R for performance reasons. We also have not tried more so- that reference u via P . Inverted indices support this type of phisticated compression schemes like [22], [23] for the same lookup by explicitly listing the tuples in R referring to u. reason. More generally, we do not think that TPC-H is a Note that the path may contain more than a single edge. For example, in TPC-H Query 11, we use an inverted index listing D. Block Summaries all PARTSUPPs with SUPPLIERs from a given nation. Note For columns whose values are correlated with the ordering that we do not yet use sophisticated index compression tech- column, we store minima and maxima for each block similar to niques like [25], [26] which could further reduce space usage Netezza’s zone maps described in [27]. For example, in TPC- while keeping high performance. However, in our experiments, H, ship dates, commit dates, and receipt dates off LINEITEMs inverted indices take less than one third of the total space so are strongly correlated with order dates which are the primary that this optimization seems not very pressing. sorting criterion. This optimization causes only negligible NUMA Aware Inverted Join Indices: Our system tries to storage overhead since it stores only two additional values
schedule tasks accessing tuples on socket S to threads also per data block. When we select the rows of a column lying in ×
located on S to improve memory access. Tuples referenced range Öa,b we have five cases for each block: ¤
by a tuple on socket S are likely to be accessed in the same 1) min ¥ a and max b: select all rows
task and should therefore be located on the same socket. This 2) min ¡ b or max a: no row is selected
¡ ¤
is not possible in general. Still, for a relation where R is sorted 3) min ¡ a and max b: only check for b.
¥
by the primary key of another relation U we can adjust the 4) min a and max b: only check for a.
¡ NUMA memory block distribution of R such that tuples r È R 5) min a and max b: both checks needed reside on same socket as the referencing tuple u. In TPC-H, Except for case (5) we gain an advantage. In cases (1) and we apply this for ORDER and LINEITEM. (2) we do not have to touch the values at all and thus save For inverted join indices we introduce a new NUMA aware memory bandwidth. data structure: Suppose we have an inverted index from E. Scanning relation R to relation U. For a tuple r È R, we have a set
of referenced tuples u È U for each memory socket. Each set Many SQL queries that are written down as a complicated of a socket holds the tuple positions of all tuples located on this combination of joins can actually be viewed as a simple socket. When we use this index during query execution, we do select-from-where query on a closure relation R ¦ and can be not directly access the referenced tuples but start a new task for implemented by scanning relation R and accessing attributes each socket. The tasks are then executed by a thread located corresponding to paths in the schema graph by explicitly on the corresponding socket having local memory access to following the path using our indices. Doing this efficiently the referenced tuples. requires a number of optimizations. NUMA aware Parallelization using NBB: Our initial As we need to start an extra task for each memory socket, implementation used Intel Threading Building Blocks (Intel this index should only be used if the list of referenced tuples TBB) [28]. It splits a given range of a column to be scanned for one tuple r È R is long enough on average and the NUMA into smaller ranges (blocks) and assigns the blocks as tasks effects outweigh the additional work. to different threads. For small ranges, these blocks are only Inverted Text Indices: Consider any text attribute t of a a fraction of a memory block in order to allow for enough closure relation R ¦ . We can view the texts in t as a collection parallelism and load balancing. Unfortunately, Intel TBB is of documents and use standard techniques from information not NUMA-aware so far – there is no way to reliably assign retrieval to support fast search. In particular, we can store an a thread to a CPU and therefore we cannot ensure that the
inverted index storing for each word w occurring in some accessed memory resides on the local socket. ¦ document the list of tuple IDs in R where t w. Unfor- To overcome this limitation, we built our own NUMA aware tunately, this way of searching is not immediately compatible scheduler, called NBB [29] based on POSIX Threads. NBB with SQL queries of the form like '%pattern%' because manages a pool of threads for each socket. We generate NBB such predicates also match substrings of a word. For example, tasks such they scan a piece of a relation located on a single in the TPC-H there are eight queries with keyword like. socket and NBB tries to assign this task to a thread on this They all will actually only match full words but the question socket – only using threads from remote sockets when all local is how to rule out other kinds of matches. Fortunately there is threads are busy. a fast way to do this: we build a character based full text index Ordering Predicate Evaluations: The where-clause of C on the dictionary of words occurring in any text in t, for most queries restricts the set of tuples contributing to the example a suffix array. In many applications and in the TPC- output. Often, this restriction is a conjunction of several H, this dictionary is very small so that the additional space predicates. It is important in which order these predicates consumption is negligible. Then when evaluating a like- are tested. Usually, we want to start with the most restrictive query for pattern ’%x%’ we search for all occurrences of predicate first, however, we also need to consider the cost of x in C. If the number of occurrences is small, we can use evaluating the predicate, in particular the access cost for the the inverted index for those words to quickly find all relevant data. Hence, choosing the order in which to test conditions is tuples. In particular, if x is actually a word that is not a an important tuning problem. substring of another dictionary entry then there will be exactly Access Cost: The cost for accessing an attribute of R ¦ one occurrence. defined by a path P not only depends on the length of the path but also on the size of the involved physical columns. Namely, G. Using Inverted Indices small columns may fit in cache while large columns will cause Suppose we are performing computations on tuples from a additional cache faults. For example, in TPC-H it makes little closure relation R ¦ which involves a highly selective predicate difference whether one accesses an attribute directly stored P depending only on a closure relation U ¦ and we have with a supplier or stored with the nation of the supplier – the an index inverting the path from R to U. Then rather than 25 entries of the nation dimension table easily fit in cache. scanning R and performing expensive indirect accesses to the
Loop Unrolling: In most queries, the main part of the attributes from U ¦ , we can scan U and use the index to retrieve running time is spent in one (most inner) loop iterating over only those tuples of R for which the predicate P is fulfilled. data and processing it. Usually, the instructions within the loop Using inverted indices exhibits parallelism at two levels of depend on each other and must be executed in a certain order granularity – when scanning U, and when scanning an inverted
which may cause lost processor cycles. By unrolling the loop list for a particular tuple u È U. we allow the compiler to reorder the instructions potentially Beyond this most common standard use of indices we can allowing a higher number of instructions per cycle. Note that also combine several index results: When indices are available the unroll count must not be too large as the code of the loop for several sufficiently selective predicates, we can compute should still fit into L1 cache (we use an unroll count of 8). the intersection of the index results directly. For inverted SIMD Instructions: On the first glance, decision support index data structures this is a standard operation known from queries are highly data parallel and thus should allow par- information retrieval. In particular, a pairwise intersection can allel operation on multiple tuples using the single-instruction- be computed in time proportional to the smaller set if the multiple-data features of modern processors. Thus we tried this inverted lists are represented appropriately [33]. We can also in some cases, however so far without noticeable performance efficiently intersect the sets of intervals we get when we have improvements. One reason might be that precompiled tuple sorted relations and the mixed case. by tuple processing does not benefit as much as vectorized TPC-H Query 13 is a (somewhat nonstandard) example execution from SIMD [30]. where intersecting lists is useful: This query processes orders whose comment text does not match two words specified in the F. Aggregation query. By intersecting the inverted indices for these two words, we can efficiently preprocess a bit pattern specifying which Decision support queries frequently reduce large data sets tuples can be ignored thus saving complex string matching by commutative and associative operations like counting, operations. summing, minimizing, or maximizing over a large number of values. We differentiate between two cases in TPC-H. H. Preprocessing (1) For the majority of the queries, the number of results Suppose we are performing computations on tuples from a produced is so small that we can afford to compute a local closure relation R ¦ which involves computing some predicate result table for each thread participating in the computation. or value x that only depends on data in a closure relation U ¦ Often, this result will even fit into the L1 cache. Only at the which contains far less tuples than R. Then we can precompute end, the local results are combined into the overall result. these values cheaply storing them in a temporary column This way we avoid ruinous synchronization overhead. To X. This makes sense if evaluating x is considerably more receive the overall result, we combine two local results into expensive than accessing X. This can be the case even for a single result for pairs of remaining results (in parallel). very simple computations. For example, if x is a predicate
Thus, the number of parallel steps to receive the overall result then X is just a bit array that is more likely to fit into cache. In
Ô Õ× is Ölog2 number of threads when we use our scheduling TPC-H, this technique is particularly useful if the computation framework NBB. involves complicated string operations (for example in TPC-H (2) If the results do not fit into L2 cache, we use a different Query 13 described above). method (Queries 10 and 15 in TPC-H). We divide the final result table into different parts such that each of the parts fits I. Postprocessing into L2 cache. In our implementation we use 256 parts and Top k Queries: Many TPC-H queries output only the k divide the parts based on the most significant bits of the (hash) largest or smallest results in sorted order. To save memory and key. The aggregation is done in two passes. improve cache locality, each thread periodically eliminates all First, each thread splits its part of the input into the 256 but the top k of its local results. This can done in linear time. parts. Memory mangement for the required arrays of tuples In the end we use partial sorting in order to extract the final uses thread-local memory pools. In the second pass, each output, i.e., the globally top k elements are identfied and only part is entirely assigned to one thread which can produce those are actually sorted. the corresponding part of the result table by local compu- tations. The approach eliminates the need for inter-processor J. Miscellaneous Functions synchronization and greatly improves cache locality. See also We are using a fast implementation of the Boyer-Moore the aggregation from Cieslewize et al. [31] or Manegold et al. algorithm for string matching problems where we do not have [32] using a similar technique to speed up joins. an index [34], [35]. Most predicates on dates can be reduced to comparisons 40 NBB Thread Building Blocks between integers. As a byproduct of our studies we found a TBB Thread Building Blocks 35 very fast way to extract the year y out of a date represented as number of days d since Jan 1st 1970 (or any other starting 30 date) without storing a full lookup table:
25
Ö " × ÔÔ Õ ¡ Ö " × Õ y YT d 8 .base d&255 YT d 8 .offset 20
speedup × where the precomputed table YT Öi stores the year of day 256i 15 and the number of days remaining of this year at day 256i. 10 Note that this table will fit into L1 cache. Months or days within a month can be extracted in similar ways. 5
0 IV. IMPLEMENTATION 10 20 32-cores 40 50 60 64 number of threads A. Software Engineering Aspects Our strategy is to encapsulate low level aspects of the Fig. 2. Average speedup of query 1 (10 runs each) showing also min and max speedup. TBB is slower and less predictable. machines like NUMA aware data layout, SIMD instructions, parallelization, thread scheduling, (de)compression, data lay- TABLE I
out etc. in easy to use generic C++ libraries. Thus a query QUERY PROPERTIES FOR SF300. OPTIMIZATION PRESENT compiler or an application programmer can generate queries Q NU BS Agg Top NUMA Indexes Speed GB without knowing about the details of the machine or about MA K Index up /s
many-core programming. By recompiling one can also adapt 1 1.49 1.21 34.6 33
these query codes to different platforms. Of course, for optimal 2 2F,I 26.3 24