Hardware-Oblivious SIMD Parallelism for In-Memory Column-Stores

Hardware-Oblivious SIMD Parallelism for In-Memory Column-Stores Template Vector Library — General Approach and Opportunities Annett Ungethüm, Johannes Pietrzyk, Erich Focht Patrick Damme, Alexander Krause, High Performance Computing Europe Dirk Habich, Wolfgang Lehner NEC Deutschland GmbH Database Systems Group Stuttgart, Germany Technische Universität Dresden [email protected] Dresden, Germany fi[email protected] ABSTRACT be mapped to a variety of parallel processing architectures Vectorization based on the Single Instruction Multiple Data like many-core CPUs or GPUs. Another recent hardware- (SIMD) parallel paradigm is a core technique to improve oblivious approach is the data processing interface DPI for query processing performance especially in state-of-the-art modern networks [2]. in-memory column-stores. In mainstream CPUs, vectoriza- In addition to parallel processing over multiple cores, tion is offered by a large number of powerful SIMD exten- vector processing according to the Single Instruction sions growing not only in vector size but also in terms of Multiple Data (SIMD) parallel paradigm is also a heavily complexity of the provided instruction sets. However, pro- used hardware-driven technique to improve the query performance in particular for in-memory column-store gramming with vector extensions is a non-trivial task and 1 currently accomplished in a hardware-conscious way. This database systems [10, 14, 21, 26, 27]. On mainstream implementation process is not only error-prone but also con- CPUs, this vectorization is done using SIMD extensions nected with quite some effort for embracing new vector ex- such as Intel's SSE (Streaming SIMD extensions) or AVX tensions or porting to other vector extensions. To over- (Advanced Vector Extensions). In the last years, we have come that, we present a Template Vector Library (TVL) seen great advancements for this vectorization feature in as a hardware-oblivious concept in this paper. We will show form of wider vector registers and more complex SIMD that our single source hardware-oblivious implementation instructions sets. Moreover, NEC Corporation recently runs efficiently on different SIMD extensions as well as on released a strong vector engine as a co-processor called a pure vector engine. Moreover, we demonstrate that sev- SX-Aurora TSUBASA, which operates on vector registers eral new optimization opportunities are possible, which are multiple times wider than those of recent mainstream difficult to realize without a hardware-oblivious approach. processors [16, 24]. Our Contribution and Outline 1. INTRODUCTION To use the evolving variety of vector processing units in To satisfy the requirements of high query throughput and a unified way for a highly vectorized query processing in low query latency, database systems constantly adapt to in-memory column-stores, we developed a Template Vec- novel hardware features, but usually in a hardware-conscious tor Library (TVL) as a novel hardware-oblivious approach. way [6, 9, 11, 22, 26, 27]. That means, the implementations The unique properties of TVL are: (i) we provide a well- of, for example, query operators is very hardware-specific defined, standardized, and abstract interface for vectorized and any change or development of the underlying hardware query processing, (ii) query operators have to be vectorized leads to significant implementation and maintenance activ- only once using TVL, (iii) this single set of query operators ities [12]. To overcome this issue, a very promising and can be mapped to all vector processing units from different upcoming research direction will focus on the development SIMD extensions up to vector engines at compile time, and of hardware-oblivious or abstraction concepts. For example, (iv) based on our hardware-oblivious vectorization concept, Heimel et al. [12] already proposed a hardware-oblivious par- we enable new optimization opportunities. To support our allel library for query operators, so that these operators can claims, we make the following contributions in this paper: • We start with a motivation why we need a hardware- oblivious vectorization concept and present related work in Section 2. • Then, we introduce our developed Template Vector Li- This article is published under a Creative Commons Attribution License brary (TVL) as a hardware-oblivious approach by de- (http://creativecommons.org/licenses/by/3.0/), which permits distribution scribing the library interface and the mapping to dif- and reproduction in any medium as well allowing derivative works, pro- 1 vided that you attribute the original work to the author(s) and CIDR 2020. Without loss of generality, we will mainly focus on in- 10th Annual Conference on Innovative Data Systems Research (CIDR ‘20) memory column-stores in this paper. Nevertheless, our con- January 12-15, 2020, Amsterdam, Netherlands. cepts should be applicable in other contexts as well. Number of Vector Length Base Data Width Number of Vector Length Base Data Width Number of Vector Length Base Data Width Vector Instructions (multiples of 128) (bit-level granularity) Vector Instructions (multiples of 128) (bit-level granularity) Vector Instructions (multiples of 64) (bit-level granularity) 4750 4750 4750 16,348 64 16,348 64 16,348 64 4500 4500 4500 … 3500 32 3500 32 3500 32 2,048 2,048 1,984 … 3250 1,920 3250 1,920 3250 1,920 … 16 16 … 16 512 512 256 750 750 750 384 8 384 8 192 8 500 500 500 256 256 128 250 128 250 128 250 64 Xeon Gold (Skylake) Xeon Phi (Knights Landing) Xeon Gold & Phi ARM Chips with NEON, NEON v.2 ARM Chips with SVE Neon & SVE NEC Aurora TSUBASA Vector Engine (a) Intel (b) ARM (c) NEC Vector Engine Figure 1: Diversity in SIMD hardware landscape. ferent vector processing units in Section 3. the metric values for two recent Intel architectures: Xeon • Based on TVL, we developed a novel in-memory Skylake and MIC Knights Landing. Generally, Intel offers column-store MorphStore [10]. Using that system, we several SIMD extensions such as SSE (Streaming SIMD Ex- empirically prove that our single hardware-oblivious tensions), AVX (Advanced Vector Extensions), AVX2, or implementation can efficiently run on different vector AVX-512. Both architectures in Figure 1(a) offer SIMD processing units in Section 4. Moreover, we discuss functionality up to Intel's latest extension AVX-512, result- possible directions for future research in Section 5. ing in a very high number of vector instructions with three Finally, the paper concludes with a short summary of our different vector lengths of 128-, 256-, and 512-bit. The SIMD findings in Section 6. processing can be done on 64-, 32-, 16-, and 8-bit data elements on both architectures. As depicted, the number of vector instructions differs between both Intel architectures, 2. NECESSITY OF AN ABSTRACTION because not all available SIMD extensions are meant to be Hardware vendors continuously improve their individual supported by all architectures. hardware systems by providing an increasingly high degree In contrast to that, ARM, for example, pursues a com- of parallelism [5]. This results in higher hardware core pletely different approach as highlighted in Figure 1(b). In- counts (MIMD parallelization) and improved vector units stead of providing a high number of vector instructions, (SIMD parallelization). In MIMD (Multiple Instruction ARM supports much wider vector lengths. While the ARM Multiple Data), multiple processors execute different in- NEON extension (available in ARMv7-A and ARMv8-A) structions on different data elements in a parallel manner. was limited to a vector length of 128-bit, the Scalable Vec- This parallelization concept is heavily exploited in database tor Extension (SVE) (available in ARMv8-A AArch64) aims systems [1, 3, 15, 20, 23] and Heimel et al. [12] already at supporting much more vector lengths from 128 to 2,048 proposed a hardware-oblivious parallel library to easily map bits, in 128-bit increments [28]. In all cases, the SIMD pro- to different MIMD processing architectures. In contrast to cessing can be done on 64-, 32-, 16-, and 8-bit data ele- that, multiple processing elements perform the same oper- ments. Moreover, Figure 1(c) shows the metric values for ation on multiple data elements simultaneously for SIMD a recently released pure vector engine from NEC Corpora- parallelization (often called vectorization). This vectorization [16, 24]. This vector engine operates on vector registers tion concept is always provided by modern x86-processors multiple times wider than those of recent SIMD extensions of using specific SIMD instruction set extensions. modern x86-processors, whereas the engine supports vector lengths from 64 to 16,384 bits, in 64-bit increments. More- 2.1 SIMD Parallelism Diversity over, the number of available vector instructions is lower This SIMD parallel paradigm has received a lot of atten- than with the SIMD extensions and the instructions can tion to increase query performance, especially in the domain only be executed on 64- and 32-bit data elements. of in-memory column-stores [10, 14, 21, 26, 27]. Without claim of completeness, current approaches mainly focus on 2.2 Abstraction Guidelines vectorizing isolated operators, compression, partitioning, or on completely new processing models [7, 8, 14, 19, 21, 26, With this diversity in mind, we believe that developing 27]. However, the vectorization of the mentioned aspects is a hardware-oblivious or abstraction concept will become normally done in a hardware-conscious way via hand-written equally important as achieving optimal performance. From code using SIMD intrinsics providing an understandable in- our point of view, the hardware-oblivious concept should terface for the vector instructions [29]. This code will usually provide the following core aspects: result in the best performance, but the code will probably Portability and Extensibility: The vectorized code only run on the given architecture and porting to a new written in a hardware-oblivious way should be easily architecture requires a high effort.

Hardware-Oblivious SIMD Parallelism for In-Memory Column-Stores

User's Manual

DMI-HIRLAM on the NEC SX-6

Hardware Technology of the Earth Simulator 1

Shared-Memory Vector Systems Compared

Recent Supercomputing Development in Japan

Performance Evaluation of Supercomputers Using HPCC and IMB Benchmarks

Optimizing Sparse Matrix-Vector Multiplication in NEC SX-Aurora Vector Engine

SX-Aurora TSUBASA Program Execution Quick Guide

NEC SX-Aurora Tsubasa System at ICM UW User Guide

Vampirtrace 5.14.4 User Manual

The Nec Sx-6 Asia Nec Hpc Marketing Supercomputer with Single-Chip Vector Processor Promotion Division

Beyond Earth Simulator