Fast Sorted-Set Intersection Using SIMD Instructions

Fast Sorted-Set Intersection using SIMD Instructions Benjamin Schlegel Thomas Willhalm Wolfgang Lehner TU Dresden Intel GmbH TU Dresden Dresden, Germany Munich, Germany Dresden, Germany benjamin.schlegel@tu- [email protected] wolfgang.lehner@tu- dresden.de dresden.de ABSTRACT not considered although it is available via SIMD instruction In this paper, we focus on sorted-set intersection which is sets in almost all modern CPUs. For this reason, our goal an important part in many algorithms, e.g., RID-list inter- in this paper is to speed up the intersection of sorted sets section, inverted indexes, and others. In contrast to tradi- using SIMD capabilities. tional scalar sorted-set intersection algorithms that try to Utilization of SIMD capabilities in algorithms is ideally reduce the number of comparisons, we propose a parallel achieved through automatic vectorizing by compilers or by algorithm that relies on speculative execution of compar- inserting SIMD instructions manually. However, many com- isons. In general, our algorithm requires more comparisons pilers detect vectorization opportunities only for simple loop but less instructions than scalar algorithms that translates constructs with few or without any data dependencies. In into a better overall speed. We achieve this by utilizing ef- all other cases, hand-tuned assembly or intrinsics must be ficient single-instruction-multiple-data (SIMD) instructions used. Unfortunately, all sorted-set intersection algorithms that are available in many processors. We provide different have complex data dependencies so that automatic vector- sorted-set intersection algorithms for different integer data ization cannot be applied. types. We propose versions that use uncompressed integer In this paper, we introduce parallel sorted-set intersection values as input and output as well as a version that uses a algorithms that rely on speculative execution. Our main in- tailor-made data layout for even faster intersections. In our tention is to speculatively execute more than the necessary experiments, we achieve speedups up to 5:3x compared to comparisons as done by the scalar algorithms. To do this popular fast scalar algorithms. efficiently, we use the string and text processing new instructions (STTNI) that are part of the Intel R Streaming SIMD 1 Extensions 4.2 (Intel R SSE 4.2). These instructions allow 1. INTRODUCTION a fast full comparison of either eight 16-bit values (= 64 Sorted-set intersection is a fundamental operation in comparisons) or sixteen 8-bit values (= 256 comparisons) query processing in the area of databases and information with only one instruction. Many of these comparisons are retrieval. It is part of many algorithms and often accounts useless and would not been exectuted by scalar algorithms. for a large fraction of the overall runtime. Examples are in- However, since these instructions itself require only 8 cycles verted indexes in information retrieval [25], lists intersection to complete [14] we achieve significant speedups. in frequent-itemset mining [1, 28], and merging of RID-lists In summary, our contributions are as follows: in database query processing [19]. In many of these application areas low latencies are a key concern so that reducing • We propose fast parallel sorted-set intersection algo- the execution time of set intersection is very important. rithms for 8-bit, 16-bit and 32-bit integer values based There has been done a lot of research with the goal of im- on STTNI of Intel SSE 4.2. The algorithms use un- proving sorted-set intersection. Many approaches focus on compressed integer values as input and output. We speed up sequential intersection [4, 5, 9, 11, 13, 20] by using explain in detail the necessary SIMD instructions and efficient data structures or improved processing techniques. steps of the parallel intersection. Other approaches focus on utilizing modern hardware like • We present a hierarchical data layout that is tailor- graphic processors (GPUs) [1, 2, 12, 26, 27] and multi-core made for a fast parallel intersection of integer values CPUs [23, 24] to utilize the parallelism offered by these pro- with a precision higher than 16 bits. cessors. However, the main focus of these approaches is only on thread-level parallelism. Data-level parallelism is so far • We compare our parallel algorithms with two highly efficient scalar versions on synthetic datasets. Permission to make digital or hard copies of all or part of this work for The scope of our paper is as follows. We focus solely personal or classroom use is granted without fee provided that copies are on sorted-set intersection of integer values without dupli- not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to cates. Usually, this is not a limitation since both of our ex- republish, to post on servers or to redistribute to lists, requires prior specific ample scenarios|information retrieval and frequent-itemset permission and/or a fee. This article was presented at: 1 The Second International Workshop on Accelerating Data Management Intel and Core are trademarks of Intel Corporation in the Systems using Modern Processor and Storage Architectures (ADMS’11). U.S. and/or other countries. Other names and brands may Copyright 2011. be claimed as the property of others. mining|deal only with integer values without any dupli- int intersect(short *A, short *B, int l_a, int l_b, short* C) { cates. Nonetheless, our parallel algorithms could be ex- int count = 0; tended to allow duplicates by storing the frequencies of the int i_a = 0, i_b = 0; integer values. We further assume that all sets being inter- sected are already available in main memory. This holds true while(i_a < l_a && i_b < l_a) { for all of our application areas since the data is either loaded if(A[i_a] == B[i_b]) { or generated in main memory during pre-processing or kept C[count++]=A[i_a]; i_a++; in memory all the time, e.g., in in-memory databases. i_b++; } 2. BACKGROUND AND RELATED WORK else if (A[i_a] > B[i_b]) i_b++; In this section, we survey related work and explain a typ- else ical scalar sorted-set intersection algorithm. In the remain- i_a++; der, we present necessary SIMD instructions used for our } parallel algorithms. return count; } 2.1 Related Work We focus on set intersection where both sorted sets have Figure 1: Code snippet for scalar sorted-set inter- a similar cardinality. If the difference of the cardinality of section with branches for 16-bit integer values. the two sets is very large, search techniques (e.g., binary or interpolation search) are used to find corresponding items in the larger set [13]. For sets with similar cardinality the well-known merge algorithm|similar to the merge step of Parallel set intersection algorithms [23, 24] focus on uti- merge-sort [18]|is used. Prior research provided many op- lizing multi-core processors to speed up the intersection pro- timizations for this algorithm that can be divided into three cess. Tsirogiannis [24] proposes a set partitioning that al- classes: (1) adaptive intersection, (2) hierarchical intersec- lows an efficient load balancing for multi-core processors. tion, and (3) parallel intersection. Furthermore, there is a large number of algorithms tailor- Adaptive algorithms [3, 4, 5, 9] focus on reducing the num- made for GPUs [1, 2, 12, 26, 27]; all these works focus on ber of comparisons during the intersection; this is achieved massive thread-level parallelism and efficient data layouts by combining the merge algorithm with various search tech- for set intersection on GPUs. In summary, all of these par- niques. This works very well when applied on skewed data. allel algorithms focus solely on thread-level parallelism. In However, adaptivity sometimes introduces a large overhead this paper, we provide the first intersection algorithm that that offsets the benefit of the fewer comparisons [10, 11]. utilizes data-level parallelism offered by modern CPUs. To Our main idea to execute even more than the necessary com- the best of our knowledge, there exists no such work so far. parisons seems contrary to the adaptive approaches. Nev- Apart from the set intersection algorithms, there are sev- ertheless, a combination of our approach with adaptive al- eral algorithms that benefit from STTNI. Clearly, the most gorithms would be interesting to reduce the number of full obvious use-cases are string and text processing algorithms; comparisons. a large number of code samples that were optimized using Besides the pure merge approaches there are also hier- STTNI can be found in Chapter 11 of Intel's Optimiza- archical approaches that utilize different|often tree-like| tion Reference Manual [14]. This includes amongst oth- data structures to speed up the intersection. The main ad- ers null character identification, string comparison, string vantage is a faster insertion of new integer values into the token extraction, word counting, sub-string searching, and sets. Examples for such data structures are AVL trees [7] string-to-integer conversion. More application-specific algo- and treaps [6]. We will show that many of these hierarchical rithms like accelerating XML parsing and XML schema val- approaches can be combined with our parallel algorithms. idation [17] are further typical candidates since they heavily Other approaches represent sets using two levels, an upper rely on string processing. However, also other algorithms and a lower level. Sanders [20] provides a two-level repre- that have, at first sight, nothing in common with string sentation in which an upper set contains entry points for a processing benefit of STTNI. Shi [22] proposes optimized number of lower sets. The goal is to use compression for the search algorithms for searching in arrays, trees, and hash- lower-level sets and allow binary search on the upper level tables; these algorithms achieve high speedups compared to set.

Load more