Counter Control Flow Divergence in Compiled Query Pipelines

Make the Most out of Your SIMD Investments: Counter Control Flow Divergence in Compiled Query Pipelines Harald Lang Andreas Kipf Linnea Passing Technical University of Munich Technical University of Munich Technical University of Munich Munich, Germany Munich, Germany Munich, Germany [email protected] [email protected] [email protected] Peter Boncz Thomas Neumann Alfons Kemper Centrum Wiskunde & Informatica Technical University of Munich Technical University of Munich Amsterdam, The Netherlands Munich, Germany Munich, Germany [email protected] [email protected] [email protected] ABSTRACT Control flow graph: SIMD lane utilization: Increasing single instruction multiple data (SIMD) capabilities in out scan σ ⋈ ⋈ ⋈ out scan σ ⋈ ... modern hardware allows for compiling efficient data-parallel query index 7 x x x x x x x pipelines. This means GPU-alike challenges arise: control flow di- traversal 6 x x Index ⋈ vergence causes underutilization of vector-processing units. In this 5 x x x paper, we present efficient algorithms for the AVX-512 architecture 4 x 3 x x x x x σ no to address this issue. These algorithms allow for fine-grained as- x x x x x x x x match SIMD lanes 2 signment of new tuples to idle SIMD lanes. Furthermore, we present 1 x x x strategies for their integration with compiled query pipelines with- scan 0 x x x out introducing inefficient memory materializations. We evaluate t our approach with a high-performance geospatial join query, which Figure 1: During query processing, individual SIMD lanes shows performance improvements of up to 35%. may (temporarily) become inactive due to different control ACM Reference Format: flows. The resulting underutilization of vector-processing Harald Lang, Andreas Kipf, Linnea Passing, Peter Boncz, Thomas Neumann, units causes performance degradations. We propose effi- and Alfons Kemper. 2018. Make the Most out of Your SIMD Investments: cient algorithms and strategies to fill these gaps. Counter Control Flow Divergence in Compiled Query Pipelines. In Da- MoN’18: 14th International Workshop on Data Management on New Hard- SIMD is mostly used in interpreting database systems that use ware, June 11, 2018, Houston, TX, USA. ACM, New York, NY, USA, 8 pages. the column-at-a-time or vector-at-a-time [3] execution model. Com- https://doi.org/10.1145/3211922.3211928 piling database systems like HyPer [6] barely use it due to their data-centric tuple-at-a-time execution model [13]. In such systems, 1 INTRODUCTION therefore, SIMD is primarily used in scan operators [9] and in string Integrating SIMD processing with database systems has been stud- processing [12]. ied for more than a decade [21]. Several operations, such as selec- With the increasing vector-processing capabilities for database tion [9, 16], join [1, 2, 7, 19], partitioning [14], sorting [4], CSV pars- workloads in modern hardware, especially with the advent of the ing [12], regular expression matching [18], and (de-)compression [10, AVX-512 instruction set, query compilers can now vectorize entire 16, 20] have been accelerated using the SIMD capabilities of the query execution pipelines and benefit from the high degree of x86 architectures. In more recent iterations of hardware evolu- data-parallelism [5]. With AVX-512, the width of vector registers tion, SIMD instruction sets have become even more popular in increased to 512 bit, allowing for processing of an entire cache line the field of database systems. Wider registers, higher degrees of in a single instruction. Depending on the bit-width of the attribute data-parallelism, and comprehensive support for integer data have values, up to 64 elements can be packed into a single register. increased the interest in SIMD and led to the development of many Vectorizing entire query pipelines raises new challenges. One novel algorithms. such challenge is keeping all SIMD lanes busy during query evaluation. Efficient algorithms are required to counter underutilization DaMoN’18, June 11, 2018, Houston, TX, USA © 2018 Copyright held by the owner/author(s). Publication rights licensed to the of vector-processing units (VPUs). In [11], this issue was addressed Association for Computing Machinery. by introducing (memory) materialization points immediately after This is the author’s version of the work. It is posted here for your personal use. Not each vectorized operator. However, with respect to the more strict for redistribution. The definitive Version of Record was published in DaMoN’18: 14th International Workshop on Data Management on New Hardware, June 11, 2018, Houston, definition of pipeline breakers given in[13], materialization points TX, USA, https://doi.org/10.1145/3211922.3211928. can be considered pipeline breakers because tuples are evicted from registers to slower (cache) memory. In this work, we present algorithms and strategies that do not break pipelines. Further, our approach can be applied at intra-operator level and not only at operator boundaries. DaMoN’18, June 11, 2018, Houston, TX, USA H. Lang et al. The remainder of this paper is organized as follows. In Section 2, active elements in destination destination vector register we describe the potential performance degradation caused by un- 1 1 1 0 1 0 1 1 0 1 2 8 4 9 6 7 derutilization in holistically vectorized pipelines. In Section 3, we bitwise not 1 write mask present efficient algorithms to counter underutilization, and inSec- 0 0 0 1 0 1 0 0 expand load 2 tion 4, we present strategies for integrating these algorithms with compiled query pipelines. Experimental evaluation of the proposed memory ... 6 7 8 9 10 11 12 13 ... algorithm is given in Section 5, and our conclusions are given in data Section 6. read position Figure 2: Refilling empty SIMD lanes from memory using 2 VECTORIZED PIPELINES the AVX-512 expand load instruction. The major difference between a scalar (i.e., non-vectorized) pipeline and a vectorized pipeline is that in the latter, multiple tuples are pushed through the pipeline at once. This impacts control flow read position read position vector within the query pipeline. In a scalar pipeline, whenever the control 8 broadcast 1 8 8 8 8 8 8 8 8 flow reaches any operator, it is guaranteed that there is exactly SEQUENCE one tuple to process (tuple-at-a-time). By contrast, in a vectorized pipeline, there are several tuples to process. However, because the 0 1 2 3 4 5 6 7 control flow is not necessarily the same for all tuples, some SIMD add 2 lanes may become inactive when a conditional branch is taken. Such a branch is only taken if at least one element satisfies the branch 8 9 10 11 12 13 14 15 condition. This implies that a vector of length n may contain up to − write mask n 1 inactive elements, as depicted in Figure 1. 0 0 0 1 0 1 0 0 expand 3 Inactive elements typically result from filter operations. Tuples TID vector register that do not satisfy some given predicate are disqualified by set- 0 1 2 8 4 9 6 7 ting the corresponding SIMD lane as inactive, but the actual non- qualifying values remain in the vector register. Nevertheless, the Figure 3: In general, the associated tuple identifiers (TIDs) control flow must enter the subsequent operator if at least one of loaded values need to be carried along the query pipeline. element in the vector register qualifies. Thus, disqualified elements The TIDs are derived from the current read position and as- cause underutilization in the subsequent operator(s). signed to a TID vector register. Another source of inactive lanes are search operations, for example, in pointer-based data structures such as trees and hash tables, which typically occur in joins. Because some search operations may as follows: (i) where new elements are copied from a source memory be completed earlier than others, some SIMD lanes may become address and (ii) where elements are already in vector registers. (temporarily) inactive. This is an inherent problem when traversing irregular pointer-based data structures in a SIMD fashion [17]. 3.1 Memory to Register Generally, all conditional branches within the query pipeline are Refilling from memory typically occurs in the table scan operator, potential sources of control flow divergence and, therefore, a source where contiguous elements are loaded from memory (assuming a of underutilization of VPUs. To avoid underutilization through columnar storage layout). AVX-512 offers the convenient expand divergence we need to dynamically assign new tuples to idle SIMD load instruction that loads contiguous values from memory directly lanes, possibly at multiple “points of divergence” within the query into the desired SIMD lanes (cf., Figure 2). One mask instruction pipeline. We refer to this process as pipeline refill. (bitwise not) is required to determine the write mask and one vector instruction (expand load) to execute the actual load. Overall, the 3 REFILL ALGORITHMS simple case of refilling from memory is supported by AVX-512 In this section, we present our refill algorithms for AVX-5121, which directly out of the box. we later integrate into compiled query pipelines (cf., Section 4). Nevertheless, the table scan operator typically produces an ad- These algorithms essentially copy new elements to desired positions ditional output vector containing the tuple identifiers (TIDs) of in a destination register. In this context, these desired positions the newly loaded attribute values. The TIDs are derived from the are the lanes that contain inactive elements. The active lanes are current read position and are used, for example, to (lazily) load identified by a small bitmask (or simply mask), where the ith bit attribute values of a different column later on or to reconstruct the corresponds to the ith SIMD lane.

Counter Control Flow Divergence in Compiled Query Pipelines

RISC-V Vector Extension Webinar II

Optimizing Packed String Matching on AVX2 Platform

Optimizing Software Performance Using Vector Instructions Invited Talk at Speed-B Conference, October 19–21, 2016, Utrecht, the Netherlands

Idisa+: a Portable Model for High Performance Simd Programming

MPC7410/MPC7400 RISC Microprocessor Reference Manual

A Second Generation SIMD Microprocessor Architecture

Effective Vectorization with Openmp 4.5

Jon Stokes Jon

Altivec Extension to Powerpc Accelerates Media Processing

Register Level Sort Algorithm on Multi-Core SIMD Processors

G4 Is First Powerpc with Altivec: 11/16/98

Performance Portable Short Vector Transforms