Learning Multi-Dimensional Indexes

Learning Multi-dimensional Indexes Vikram Nathan∗, Jialin Ding∗, Mohammad Alizadeh, Tim Kraska {vikramn,jialind,alizadeh,kraska}@mit.edu Massachusetts Institute of Technology ABSTRACT be tree-based data structures (e.g., k-d trees, R-Trees, or oc- Scanning and filtering over multi-dimensional tables are key trees) or a specialized sort order over multiple attributes (e.g., operations in modern analytical database engines. To opti- a space-filling curve like Z-ordering or hand-picked hierar- mize the performance of these operations, databases often chical sort). Indeed, many state-of-the-art analytical database create clustered indexes over a single dimension or multi- systems use multi-dimensional indexes or sort-orders to im- dimensional indexes such as R-Trees, or use complex sort prove the scan performance of queries with predicates over orders (e.g., Z-ordering). However, these schemes are often several columns. For example, both Redshift [1] and Spark- hard to tune and their performance is inconsistent across dif- SQL [4] use Z-ordering to lay out the data; Vertica can define ferent datasets and queries. In this paper, we introduce Flood, a sort-order over multiple columns (e.g., first age, then date), a multi-dimensional in-memory read-optimized index that while IBM Informix, along with other spatial database sys- automatically adapts itself to a particular dataset and work- tems, uses an R-Tree [15]. load by jointly optimizing the index structure and data storage However, multidimensional indexes still have significant layout. Flood achieves up to three orders of magnitude faster drawbacks. First, these techniques are extremely hard to tune. performance for range scans with predicates than state-of-the- For example, Vertica’s ability to sort hierarchically on mul- art multi-dimensional indexes or sort orders on real-world tiple attributes requires an admin to carefully pick the sort datasets and workloads. Our work serves as a building block order. The admin must therefore know which columns are towards an end-to-end learned database system. accessed together, and their selectivity, to make an informed decision. Second, there is no single approach (even if tuned correctly) that dominates all others. As our experiments will 1 INTRODUCTION show, the best multidimensional index varies depending on Scanning and filtering are the foundation of any analytical the data distribution and query workload. Third, most existing database engine, and several advances over the past several techniques cannot be fully tailored for a specific data distribu- years specifically target database scan and filter performance. tion and query workload. While all of them provide tunable Most importantly, column stores [7] have been proposed to parameters (e.g., page size), they do not allow finer-grained delay or entirely avoid accessing columns (i.e., attributes) customization for a specific dataset and filter access pattern. which are not relevant to a query. Similarly, there exist many To address these shortcomings, we propose Flood, the first techniques to skip over records that do not match a query filter. learned multi-dimensional in-memory index. Flood’s goal is For example, transactional database systems create a clustered to locate records matching a query filter faster than existing B-Tree index on a single attribute, while column stores often indexes, by automatically co-optimizing the data layout and sort the data by a single attribute. The idea behind both is index structure for a particular data and query distribution. the same: if the data is organized according to an attribute Centralto Floodare two keyideas. First,Floodusesa sample that is present in the query filter, the execution engine can query filter workload to learn how often certain dimensions arXiv:1912.01668v1 [cs.DB] 3 Dec 2019 either traverse the B-Tree or use binary search, respectively, to are used, which ones are used together, and which are more quicklynarrowitssearchtotherelevantrangeinthatattribute. selective than others. Based on this information, Flood au- We refer to both approaches as clustered column indexes. tomatically customizes the entire layout to optimize query If data has to be filtered by more than one attribute, sec- performance on the given workload. Second, Flood uses em- ondary indexes can be used. Unfortunately, their large storage pirical CDF models to project the multi-dimensional and po- overhead and the latency incurred by chasing pointers make tentially skewed data distribution into a more uniform space. them viable only for a rather narrow use case, namely when This “flattening” step helps limit the number of points that the predicate on the indexed attribute has a very high selectiv- are searched and is key to achieving good performance. ity; in most other cases, scanning the entire table can be faster Flood’s learning-based approach to layout optimization and more space efficient [6]. An alternative approach is to use is what distinguishes it from other multi-dimensional index multi-dimensional indexes to organize the data; these may ∗Equal contribution. 1 SIGMOD’20, June 14-19, 2020, Portland, OR, USA Vikram Nathan∗, Jialin Ding∗, Mohammad Alizadeh, Tim Kraska structures. It allows Flood to target its performance to a par- 2 RELATED WORK ticular query workload, avoid the superlinear growth in in- There is a rich corpus of work dedicated to multi-dimensional dex size that plagues some indexes even with uniformly dis- indexes, and many commercial database systems have turned tributed data [9], and locate relevant records quickly without to multi-dimensional indexing schemes. For example, Ama- the high traversal times incurred by k-d trees and hyperoc- zon Redshift organizes points by Z-order [29], which maps trees, especially for larger range scans. multi-dimensional points onto a single dimension for sort- While Flood’s techniques are general and may potentially ing [1, 33, 46]. With spatial dimensions, SQL Server allows benefit a wide range of systems, from OLTP in-memory trans- Z-ordering [27], and IBM Informix uses an R-Tree [15]. Other action processing systems to disk-based data warehouses, this multi-dimensional indexes include K-d trees, octrees, R∗ trees, paper focuses on improving multi-dimensional index perfor- UB trees (which also make use of the Z-order), among many mance (i.e., reducing unnecessary scan and filter overhead) others (see [31, 40] for a survey). Flood’s underlying index for an in-memory column store. In-memory stores are increas- structure is perhaps most similar to Grid Files [30], which ingly popular due to lower RAM prices [20] and the increasing has many variants [13, 14, 41]. However, Grid Files do not amount of main memory which can be put into a single ma- automatically adjust to the query workload, yielding poorer chine [8, 19]. In addition, Flood is optimized for reads (i.e., performance (§7). In fact, Grid Files tend to have superlinear query speed) at the expense of writes (i.e., incremental index growth in index size even for uniformly distributed data [9]. updates), making it most suitable for rather static analytical Flood also differs from other adaptive indexing techniques workloads, though our experiments show that adjusting to a such as database cracking [16, 17, 37]. The main goal of crack- new query workload is relatively fast. We envision that Flood ing is to build a query-adaptive incremental index by par- could serve as the building block for a multi-dimensional titioning the data incrementally with each observed query. in-memory key-value store or be integrated into commer- However, cracking produces only single dimensional clus- cial in-memory (offline) analytics accelerators like Oracle’s tered indexes, and does not jointly optimize the layout over Database In-Memory (DBIM) [34]. multiple attributes. This limits its usefulness on queries with The ability to self-optimize allows Flood to outperform al- multi-dimensional filters. Furthermore, cracking does not ternative state-of-the-art techniques by up to three orders of take the data distribution into account and adapts only to magnitude, while often having a significantly smaller storage queries; on the other hand, Flood adapts to both the queries overhead. More importantly though, Flood achieves optimal- and the underlying data. ity across the board: it has better, or at least on-par, perfor- Arguably most relevant to this work is automatic index se- mance compared to the next-fastest indexing technique on lection [3, 25, 44]. However, these approaches mainly focus on all our datasets and workloads. For example, on a real sales creating secondary indexes, whereas Flood optimizes the stor- dataset, Flood achieves a boost of 3× over a tuned clustered age and index itself for a given workload and data distribution. column index and 72× over Amazon Redshift’s Z-encoding For aggregation queries, data cubes [11] are an alternative method. On a different workload derived from TPC-H, Flood to indexes. However, data cubes alone are insufficient for is 61× faster than the clustered column index but only 3× queries over arbitrary filter ranges, and they cannot support faster than the Z-encoding. arbitrary actions over the queried records (e.g., returning the We make the following contributions: records themselves). (1) We design and implement Flood, the first learned multi- Finally, learned models have been used to replace/enhance dimensional index, on an in-memory column store. Flood traditional B-trees [5, 10, 23] and secondary indexes [21, 45]. targets its layout for a particular workload by learning Self-designing systems use learned cost models to synthe- from a sample filter predicate distribution. size the optimal algorithms for a data structure, resulting in (2) We evaluate a wide range of multi-dimensional indexes on a continuum of possible designs that form a “periodic table” one synthetic and three real-world datasets, including one of data structures [18].

Load more