An Eight-Dimensional Systematic Evaluation of Optimized Search Algorithms on Modern Processors

An Eight-Dimensional Systematic Evaluation of Optimized Search Algorithms on Modern Processors Lars-Christian Schulz David Broneske Gunter Saake University of Magdeburg University of Magdeburg University of Magdeburg Magdeburg, Germany Magdeburg, Germany Magdeburg, Germany [email protected] [email protected] [email protected] ABSTRACT k-ary searching. While their conceptual benefit is intuitve, Searching in sorted arrays of keys is a common task with a these benefits are not as easily inferable on a modern, deeply broad range of applications. Often searching is part of the pipelined, superscalar, out-of-order processor with a multi- performance critical sections of a database query or index level cache hierarchy and SIMD facilities. In this context, access, raising the question what kind of search algorithm there is also the question what influence code optimizations to choose and how to optimize it to obtain the best possible like loop unrolling have on the performance of a particular performance on real-world hardware. This paper strives to algorithm. Furthermore, the algorithm, the applied code op- answer this question by evaluating a large set of optimized timizations, and micro-architectural specifics of the proces- sequential, binary and k-ary search algorithms on a modern sor can have surprising interactions. Therefore, a systematic processor. In this context, we consider hardware-sensitive experimental evaluation of all these factors is necessary to find optimal search algorithms. To enable reproducibility, optimization strategies as well as algorithmic variations re- 1 sulting in an eight-dimensional evaluation space. the code can be access from our git . As a result, we give insights on expected interactions between search algorithms and optimizations on modern hard- Our Contributions. This paper aims at providing insight ware. In fact, there is no single best optimized algorithm, into the behavior of optimized search algorithms on modern leading to a set of advices on which variants should be con- processors on static arrays of tightly packed 32-bit integer sidered first given a particular array size. keys. We evaluate search variants in an eight-dimensional space of algorithmic variations, code optimizations, and data- PVLDB Reference Format: set properties. The algorithmic aspects include: Lars-Christian Schulz, David Broneske, Gunter Saake. An Eight- 1. Base algorithm. We consider sequential, binary and Dimensional Systematic Evaluation of Optimized Search Algo- rithms on Modern Processors. PVLDB, 11 (11): 1550-1562, 2018. k-ary search algorithms. The binary and k-ary search DOI: https://doi.org/10.14778/3236187.3236205 include variants with irregular subdivisions or strictly regular subdivisions, also often called uniform. 2. Exact match vs. lower bound. There are two pos- 1. INTRODUCTION sible functions when searching in sorted lists: range Searching in sorted data is one of the most fundamental and exact match queries. To answer range queries, all operations in computing. A typical search problem consists keys in the closed interval [a; b] have to be retrieved. of a sequence of key/value pairs sorted by keys and a query To this end, a lower bound and upper-bound search is for either a single key or a range of keys. The task is then performed. We only consider the lower-bound search, either to retrieve the value belonging to the given key, or to since only minimal differences exist to the upper-bound retrieve all key/value pairs in the given key range. search. Since exact match can be trivially derived once Since this problem is so fundamental and its solution so the lower bound has been located, we compare direct universally useful, it has been studied extensively since the exact-match search algorithms and lower-bound-based early days of computer science. The classical and asymp- exact-match search algorithms. totically optimal algorithm to solve it is the well known bi- We apply the following four code optimization techniques: nary search [7]. However, there is no single binary search 3. Branch Elimination. We remove branches from the algorithm, but many different variants with slightly differ- search loops by using branch predication. ent properties [10]. Moreover, binary searching can logically 4. Loop unrolling. We unroll the search loops. be extended to ternary searching and in all generality to 5. Software Prefetching. We add prefetching to avoid Permission to make digital or hard copies of all or part of this work for or shorten the stalls caused by cache misses. personal or classroom use is granted without fee provided that copies are 6. Vectorization. We use the AVX2 instruction set to not made or distributed for profit or commercial advantage and that copies parallelize the algorithms in SIMD fashion. bear this notice and the full citation on the first page. To copy otherwise, to The generated evaluation dataset raises two dimensions: republish, to post on servers or to redistribute to lists, requires prior specific 7. Dataset size. We vary the array size from one to 228 permission and/or a fee. Articles from this volume were invited to present keys. Thus, we consider a continuous range of array their results at The 44th International Conference on Very Large Data Bases, August 2018, Rio de Janeiro, Brazil. sizes from arrays comfortably fitting into the processor Proceedings of the VLDB Endowment, Vol. 11, No. 11 cache to much larger ones. Copyright 2018 VLDB Endowment 2150-8097/18/07. 1 DOI: https://doi.org/10.14778/3236187.3236205 https://git.iti.cs.ovgu.de/dbronesk/SearchAlgorithms 1550 8. Locality of search queries. We use two different 8 17 schemes to generate search queries with different cach- 2 5 11 14 20 23 ing demands. 0 1 3 4 6 7 9 10 12 13 15 16 18 19 21 22 24 25 In addition to runtime measurements we analyze micro- 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 architectural performance aspects like branch mispredictions and cache activity utilizing the performance counters built Figure 1: Perfect k-ary search tree for k = 3 con- into modern processors. Ultimately, we arrive at a rec- taining 26 keys and the corresponding array. ommendation which algorithms and code tuning strategies should be used first when a search algorithm is needed. of iterations needed to complete a lower-bound search can 2. OPTIMIZING SEARCH ALGORITHMS vary depending on whether the smaller or the larger parti- We consider three different general methods to search on tions are chosen during the search. Furthermore, we have sorted data: (1) sequential searching, (2) binary searching to explicitly track the bounds of the current search range. and, as a generalization thereof, (3) k-ary searching. Sec- It is possible to avoid these problems by restricting the tion 2.1 discusses these algorithms. In Section 2.2, we im- algorithms to array sizes of the form kh − 1, where h > 0 is prove the performance of the basic algorithms by adapting an integer. We will call these array sizes perfect, allowing them to characteristics of modern processors. a perfect k-ary search tree to be constructed. In a perfect k-ary search tree every node contains k − 1 keys and every 2.1 Algorithmic Variations internal node has k children. Additionally, every leaf node In the following, we briefly discuss sequential, binary and is at the same depth [12]. Figure 1 shows a perfectly sized k-ary searching for the lower bound. Additionally we in- array and the perfect search tree for k = 3. As we can see, troduce modifications to the traditional binary and k-ary all partitions (excluding the separator keys) possibly created search based on searching in perfect search trees inspired by during a search have perfect length themselves. The separa- Schlegel et at. [12]. The differences between lower bound tor keys contained in a node at height h > 0 are located at and exact-match searching are outlined in Section 2.1.5. indices of the form left + i · kh−1 − 1, where left is the start 2.1.1 Sequential Search index of the current search range. We call algorithms based on perfect search trees uniform. The uniform algorithms The sequential search visits each element in turn and re- generally have simpler index computations than their non- turns with the current index when the first key not smaller uniform counterparts enabling more optimizations. than the search key|the lower bound|has been found. Its run time is linear in the number of array elements. Uniform Binary Search. We first consider the uniform bi- 2.1.2 Binary Search nary search (Algorithm 1). In contrast to the non-uniform binary search its search loop runs for exactly dlog2(size +1)e A binary search algorithm recursively splits the search iterations. This is the height of the conceptual binary search range in two approximately equally sized partitions and ex- tree constructed by the algorithm. The indices of the separa- amines the array element separating them, i.e., the separator tor keys are computed as left + 2height−1 − 1. Generalization key. The search then continues with just one of the parti- to arrays of arbitrary length is achieved by potentially let- tions, thereby halving the search space. This way we need at ting the left and right partitions overlap in the first iteration. most blog2(size)c + 1 iterations to localize the lower bound. The following iterations can then assume both the left and This can be seen by constructing a binary tree corresponding right partition to be equally sized and correspond to perfect to the search and analyzing its height [7]. binary search trees, i.e., to have a length of 2height−1 − 1.

Load more