Interleaving with Coroutines: a Practical Approach for Robust Index Joins

Interleaving with Coroutines: A Practical Approach for Robust Index Joins ⋆ ⋆ Georgios Psaropoulos † Thomas Legler† Norman May† Anastasia Ailamaki ‡ ⋆ EPFL, Lausanne, Switzerland †SAP SE, Walldorf, Germany ‡RAW Labs SA first-name.last-name @epfl.ch first-name.last-name @sap.com { } { } ABSTRACT 18 Index join performance is determined by the efficiency of the 16 Main lookup operation on the involved index. Although database 14 Main-Interleaved indexes are highly optimized to leverage processor caches, 12 main memory accesses inevitably increase lookup runtime 10 when the index outsizes the last-level cache; hence, index 8 join performance drops. Still, robust index join performance 6 becomes possible with instruction stream interleaving:given 4 agroupoflookups,wecanhidecachemissesinonelookup 2 with instructions from other lookups by switching among Query response time0 (in ms) their respective instruction streams upon a cache miss. 1MB 2MB 4MB 8MB 16MB32MB64MB128MB256MB512MB1GB 2GB In this paper, we propose interleaving with coroutines for any type of index join. We showcase our proposal on SAP Data size (log scale) HANA by implementing binary search and CSB+-tree tra- Figure 1: Response time of an IN-predicate query with versal for an instance of index join related to dictionary com- 10K INTEGER values. Main memory accesses hinder se- pression. Coroutine implementations not only perform simi- quential execution when the dictionary is larger than the larly to prior interleaving techniques, but also resemble the cache (25 MB); interleaved execution is affected much less. original code closely, while supporting both interleaved and non-interleaved execution. Thus, we claim that coroutines make interleaving practical for use in real DBMS codebases. of an index join that we use to propose a practical technique PVLDB Reference Format: that significantly enhances index join performance by hiding Georgios Psaropoulos, Thomas Legler, Norman May, and Ana- the cost of main memory accesses. stasia Ailamaki. Interleaving with Coroutines: A Practical Ap- Like all index lookups, dictionary lookups become dispro- proach for Robust Index Joins. PVLDB,11(2):xxxx-yyyy,2017.230 - 242 portionally expensive when the dictionary outsizes the last DOI: https://doi.org/10.14778/3149193.3149202 level cache of the processor. Figure 1 illustrates this dispro- portionality in the runtime of a query with an IN predicate, 1. INTRODUCTION executed on the SAP HANA column store [9]. For the size When choosing the physical operator for an equi-join be- range 1MB–2GB, we observe a significant runtime increase tween two relations, A and B, a query optimizer checks if when the dictionary outgrows the last level cache (25MB). either has an index on the join attribute. Such an indexed The increase is caused by main memory accesses (details in relation, e.g., A, can be used for an index join, which scans Section 2), a known problem for index joins [32] and main B, looking up A’s index to retrieve the matching records. memory database systems in general [5, 20]. In main memory column stores that employ dictionary Traditional tenets for dealing with main memory accesses encoding [9, 17, 19, 26, 27], we encounter relations that are prescribe the elimination of excessive indirection and non- always indexed: the dictionaries. Each dictionary holds the sequential memory access patterns. Given that dictionary mapping between a value and its encoding, and every query implementations are cache-conscious data structures, we can for a list of values requires a sequence of lookups on the assume any effected main memory accesses to be essential dictionary. In this paper, we consider these lookups a case and thus unavoidable. Still, we can hide the latency of main memory accesses by providing the processor with enough Permission to make digital or hard copies of all or part of this work for independent instructions to execute while fetching data. In personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies our case, lookups are independent from each other, allowing bear this notice and the full citation on the first page. To copy otherwise, to us to execute them concurrently in a time-sharing fashion: republish, to post on servers or to redistribute to lists, requires prior specific we can interleave the instruction streams of several lookups permission and/or a fee. Articles from this volume were invited to present so that, when a cache miss occurs in one instruction stream, their results at The 44th International Conference on Very Large Data Bases, execution continues in another. Hence, the processor keeps August 2018, Rio de Janeiro, Brazil. executing instructions without having to wait for data. Proceedings of the VLDB Endowment, Vol. 11, No. 2 Copyright 2017 VLDB Endowment 2150-8097/17/10... $ 10.00. Prior works propose two forms of such instruction stream DOI: https://doi.org/10.14778/3149193.3149202 interleaving (ISI):static,likegroup prefetching (GP) [6] and 230 software pipelined prefetching (SPP) [6], and dynamic, like 2.1 Dictionaries in Column Stores the state-of-the-art asynchronous memory access chaining Dictionary encoding is a common compression method in (AMAC) [15]. Static interleaving has negligible overhead main-memory column stores, e.g. [9, 17, 19, 26, 27]. It maps for instruction streams with identical control flow, whereas the value domain of one or more columns to a contiguous dynamic interleaving efficiently supports a wider range of integer range [7, 9, 10, 12, 18, 20]. This mapping replaces use cases, allowing instruction streams to diverge, e.g., with column values with unique integer codes and is stored in a early returns. Both approaches require to rewrite code either separate data structure, the dictionary,whichsupportstwo as a group or a pipeline in the static case, or a state ma- access methods: chine in the dynamic case (see Section 3). The interleaved extract returns the value for a code. code ends up in the database codebase alongside the origi- • locate returns the code for a value that exists in the nal, non-interleaved implementation, implying functionality • dictionary, or a special code that denotes absence. duplication, as well as additional testing and maintenance The resulting vector of codes and the dictionary consti- costs. As a result, developers can be reluctant to use inter- tute the encoded column representation. The code vector leaved execution in production. is usually smaller than the original column, reflecting the In this work, we propose interleaving with coroutines, i.e., smaller representation of codes, whereas the dictionary size functions that can suspend their execution and resume at a is determined by the value domain, which can comprise from later point. Coroutines inherently support interleaved exe- few to billions of distinct values as encountered by database cution, while they can also run in non-interleaved mode. vendors in customer workloads [25]. Atechnicalspecification[3]forthepopularC++ language In this work, we use the column store of SAP HANA, introduces coroutine support at the language level: the pro- which has two parts for each column: the read-optimized grammer simply inserts suspension statements into the code, Main,againstwhichthequeryinFigure1wasrun,andthe and the compiler automatically handles the state that has update-friendly Delta.AMain dictionary is a sorted array to be maintained between suspension and resumption (see of the domain values, and the array positions correspond to Section 4). With this soon-to-be-standard language support, codes, similarly to [9, 17, 27]. Hence, extract is a simple ar- we implement interleaving with comparable performance to ray lookup, whereas locate is a binary search on the array the prior proposals, supporting both interleaved and non- contents for the appropriate array position. On the other interleaved execution through a single implementation that hand, Delta dictionaries are implemented as unsorted ar- closely resembles the original code (see Section 5). rays indexed by a cache-conscious B+-tree (CSB+-tree) [28]; Concretely, we make the following contributions: extract is again an array lookup, but locate is now an in- A technique for instruction stream interleaving based + • dex lookup on the CSB -tree. on coroutines. We exhibit our technique in index joins, Sorted or not, a dictionary array can be considered a rela- + where we use sorted arrays and CSB -trees as indexes, tion D(code, value)thatisindexedonbothattributes:codes describing how it can be applied to any type of pointer- are encoded as array indices, whereas values can be retrieved based index structure. through binary search or index lookup, respectively in the Acomparisonwithgroup prefetching (GP) and asyn- • sorted and the unsorted case; here, we focus on the value chronous memory access chaining (AMAC).Sinceco- index. Since a sequence of values is also a relation S(value), routines are a dynamic interleaving approach, they are every value lookup from a column involves a join S ◃▹ D, equivalent to AMAC in terms of applicable use cases which is performed as an index join when S << D .Such and performance without the need for an explicit state joins dominate the IN-predicate queries that| we| discuss| | next. machine. Coroutine code closely resembles the original implementation and can be used in both interleaved 2.2 IN Predicates and Their Performance and non-interleaved execution, relying on the compi- In this paper, we study the problem of random memory ler for state management and optimization. accesses in queries with IN predicates [23], yet our analysis An optimized implementation for IN-predicate queries and the coroutine-based technique we propose apply to any • with predictable runtime, proportional to the dictio- index join that involves pointer-based data structures like nary size. This implementation is part of a prototype hash tables or B+-trees, or algorithms involving chains of based on SAP HANA, and targets both sorted and non-sequential memory accesses like binary search.

Load more