Vegen: a Vectorizer Generator for SIMD and Beyond

VeGen: A Vectorizer Generator for SIMD and Beyond Yishen Chen Charith Mendis Michael Carbin Saman Amarasinghe MIT CSAIL UIUC MIT CSAIL MIT CSAIL USA USA USA USA [email protected] [email protected] [email protected] [email protected] ABSTRACT Vector instructions are ubiquitous in modern processors. Traditional A1 A2 A3 A4 A1 A2 A3 A4 compiler auto-vectorization techniques have focused on targeting single instruction multiple data (SIMD) instructions. However, these B1 B2 B3 B4 B1 B2 B3 B4 auto-vectorization techniques are not sufficiently powerful to model non-SIMD vector instructions, which can accelerate applications A1+B1 A2+B2 A3+B3 A4+B4 A1+B1 A2-B2 A3+B3 A4-B4 in domains such as image processing, digital signal processing, and (a) vaddpd (b) vaddsubpd machine learning. To target non-SIMD instruction, compiler developers have resorted to complicated, ad hoc peephole optimizations, expending significant development time while still coming up short. A1 A2 A3 A4 A1 A2 A3 A4 A5 A6 A7 A8 As vector instruction sets continue to rapidly evolve, compilers can- B1 B2 B3 B4 B5 B6 B7 B8 not keep up with these new hardware capabilities. B1 B2 B3 B4 In this paper, we introduce Lane Level Parallelism (LLP), which A1*B1+ A3*B3+ A5*B5+ A7*B8+ captures the model of parallelism implemented by both SIMD and A1+A2 B1+B2 A3+A4 B3+B4 A2*B2 A4*B4 A6*B6 A7*B8 non-SIMD vector instructions. We present VeGen, a vectorizer gen- (c) vhaddpd (d) vpmaddwd erator that automatically generates a vectorization pass to uncover target-architecture-specific LLP in programs while using only in- Figure 1: Examples of SIMD and non-SIMD instruction in struction semantics as input. VeGen decouples, yet coordinates the AVX2 instruction set. We use different colors to indicate automatically generated target-specific vectorization utilities with how the input values flow to different output lanes. its target-independent vectorization algorithm. This design enables us to systematically target non-SIMD vector instructions that un- til now require ad hoc coordination between different compiler 1 INTRODUCTION stages. We show that VeGen can use non-SIMD vector instructions Vector instructions are ubiquitous in modern processors. Previ- effectively, for example, getting speedup 3× (compared to LLVM’s ous work on auto-vectorization has focused on single instruction vectorizer) on x265’s idct4 kernel. multiple data (SIMD) instructions, but there is little research on CCS CONCEPTS systematically targeting non-SIMD vector instructions, which has applications in domains such as digital signal processing, image · Software and its engineering → Translator writing systems processing, and machine learning (e.g., Intel’s VNNI extension and and compiler generators; Retargetable compilers; Specifica- the dot-product instructions in ARMv8 [14]). In contrast with the tion languages; Automatic programming; · Computer systems SIMD instruction shown in Figure 1(a), Figures 1(b)ś1(d) show three organization → Single instruction, multiple data. examples of the non-SIMD instructions from the AVX2 instruction set. Figure 1(b) shows a single instruction, multiple operations, mul- KEYWORDS tiple data (SIMOMD) instruction [2] that performs additions and optimization, auto-vectorization, non-SIMD subtractions on alternating lanes (vaddsubpd); Figure 1(c) shows a horizontal addition with lane interleaving (vhaddpd); and Fig- ACM Reference Format: ure 1(d) shows an instruction computing dot-products (vpmaddwd). Yishen Chen, Charith Mendis, Michael Carbin, and Saman Amarasinghe. 2021. VeGen: A Vectorizer Generator for SIMD and Beyond. In Proceedings To date, there is no unified model of parallelism that captures the of the 26th ACM International Conference on Architectural Support for Pro- capabilities of these instructions. gramming Languages and Operating Systems (ASPLOS ’21), April 19ś23, 2021, Automatic Vectorization. There are two mainstream techniques Virtual, USA. ACM, New York, NY, USA, 13 pages. https://doi.org/10.1145/ 3445814.3446692 for extracting SIMD parallelism: loop vectorization [1, 20, 21] and superword level parallelism (SLP) based vectorization [15, 17, 24]. Permission to make digital or hard copies of part or all of this work for personal or Both techniques make two fundamental assumptions about vector classroom use is granted without fee provided that copies are not made or distributed instructions: a SIMD instruction performs isomorphic operations for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. across all lanes, and the instruction applies the operations element- For all other uses, contact the owner/author(s). wise (i.e., there is no cross-lane operation). Relying on these two as- ASPLOS ’21, April 19ś23, 2021, Virtual, USA sumptions, these algorithms enable compiler developers to support © 2021 Copyright held by the owner/author(s). ACM ISBN 978-1-4503-8317-2/21/04. SIMD instructions across a variety of architectures with relatively https://doi.org/10.1145/3445814.3446692 little incremental effort. 902 ASPLOS ’21, April 19–23, 2021, Virtual, USA Yishen Chen, Charith Mendis, Michael Carbin, and Saman Amarasinghe (a) Reference Implementation (b) ICC (c) GCC (d) LLVM (e) VeGen void movzx r11d , [rdi] vmovdqa xmm0 , [rip] vmovdqu xmm6, [rsi + 32] vmovdqu64 zmm0 , [rdx] dot_16x1x16_uint8_int8_int32( movsx eax , [rsi] vmovdqu xmm1 , [rsi] vmovdqu xmm7, [rsi + 48] vpbroadcastd zmm1 , [rdi] uint8_t data[restrict 4], imul r11d , eax ... ... vpdpbusd zmm0, zmm0, [rsi] int8_t kernel[restrict 16][4], ... vpmovsxbw xmm7 , xmm6 vpmulld zmm1, zmm11, zmm1 vmovdqu64 [rdx], zmm0 int32_t output[restrict 16]) { add r11d , r10d vpbroadcastw xmm5 , xmm5 vpaddd zmm1, zmm1, [rdx] for (int i = 0; i < 16; i++) add r11d , ecx vpmullw xmm7, xmm7, xmm9 vpmovsxbd zmm3 , xmm3 for (int k = 0; k < 4; k++) mov [rdx], r11d vpsrldq xmm2, xmm6, 8 vpmulld zmm3, zmm10, zmm3 output[i] += ... ... data[k] * kernel[i][k]; } Number of Instructions 273 106 61 4 Speedup Relative to ICC 1.0× 1.5× 2.2× 11.0× Vector Extensions Used Not Vectorized SSE4 SSE4 & AVX-512 AVX512-VNNI Figure 2: One of the dot-product kernels used by TVM’s 2D convolution layer (Figure 2(a)). Compiler generated assembly and statistics for Intel’s compiler ICC (Figure 2(b)), GCC (Figure 2(c)), LLVM (Figure 2(d)), and VeGen (Figure 2(e)) Existing Support for Non-SIMD Instructions. Because non- couple of target-dependent vectorization utility functions. By inter- SIMD instructions violate the two fundamental assumptions of facing with these utilities, the core vectorization algorithm in our existing vectorization algorithms, compiler developers support non- framework remains target-independent, as traditional vectorization SIMD instructions using ad hoc approaches that are cumbersome algorithms do. and often ineffective. For most non-SIMD instructions, compiler We realize this framework with VeGen, a system that automat- developers support them with backend peephole rewrites. However, ically generates target-architecture-aware vectorizers to uncover because these peephole rewrites do not generate vector instructions LLP in straight-line code sequences while using only instruction by themselvesÐthey fuse sequences of SIMD instructions and vec- semantics as input. From these instruction semantics, VeGen au- tor shuffles into more non-SIMD instructionsÐrelying on peephole tomatically generates the implementation of the aforementioned rewrites alone is ineffective. A relatively more effective but more vectorization utilities as a compiler library to describe the specific labor-intensive strategy involves coordinating with the compiler’s kind of LLP supported by the target architecture. With this auto- vectorizers to generate SIMD vector patterns that are tailored for matically generated target-description library, VeGen’s vectorizer those rewrite rules. For instance, the initial support in LLVM [16] can automatically use non-SIMD vector instructions. We added for the addsub instruction family (Figure 1(b)) required three co- support for newer classes of non-SIMD vector instructions (e.g., ordinated changes to LLVM: refactoring LLVM’s SLP vectorizer those found in AVX512-VNNI, which are not fully supported by to support alternating opcodes, changing LLVM’s cost model to LLVM) by providing only their semantics. recognize a special case of vector shuffle (blending odd and even We make the following contributions in this paper: lanes), and modifying LLVM’s backend lowering logic to detect the special patterns generated by the SLP vectorizer. As processor vendors continue to add more complex non-SIMD instructions, this • We introduce Lane Level Parallelism, which captures the methodology is not sustainable. Compilers are falling behind in type of parallelism implemented by both SIMD and non- identifying the complex code sequences that can be mapped to SIMD vector instructions. these instructions, and these multibillion-dollar investments by the • We describe a code-generation framework that jointly per- processor vendors in enhancing the vector instruction sets go un- forms vectorization and vector instruction selection while derutilized without expert developers manually writing assembly maintaining the modularity of traditional target-independent or compiler intrinsics. vectorizers designed for SIMD instructions. • We present VeGen, a vectorizer generator that automati- VeGen Our Approach: . In this paper, we describe

Vegen: a Vectorizer Generator for SIMD and Beyond

Vectorization Optimization

Exploiting Automatic Vectorization to Employ SPMD on SIMD Registers

Using Arm Scalable Vector Extension to Optimize OPEN MPI

Introduction on Vectorization

Advanced Parallel Programming II

MMX and SSE MMX Data Types

Compiler Auto-Vectorization with Imitation Learning

Vector Parallelism on Multi-Core Processors

VECTORIZATION-Slides

Impact of Vectorization and Multithreading on Performance and Energy Consumption on Jetson Boards

A Using Machine Learning to Improve Automatic Vectorization

Selective Vectorization for Short-Vector Instructions Samuel Larsen, Rodric Rabbah, and Saman Amarasinghe