Automatic SIMD Vectorization of Fast Fourier Transforms for the Larrabee and AVX Instruction Sets
Total Page:16
File Type:pdf, Size:1020Kb
Automatic SIMD Vectorization of Fast Fourier Transforms for the Larrabee and AVX Instruction Sets Daniel S. McFarlin Volodymyr Arbatov Franz Franchetti Department of Electrical and Department of Electrical and Department of Electrical and Computer Engineering Computer Engineering Computer Engineering Carnegie Mellon University Carnegie Mellon University Carnegie Mellon University Pittsburgh, PA USA 15213 Pittsburgh, PA USA 15213 Pittsburgh, PA USA 15213 [email protected] [email protected] [email protected] Markus Püschel Department of Computer Science ETH Zurich 8092 Zurich, Switzerland [email protected] ABSTRACT General Terms The well-known shift to parallelism in CPUs is often associated Performance with multicores. However another trend is equally salient: the increasing parallelism in per-core single-instruction multiple-date Keywords (SIMD) vector units. Intel’s SSE and IBM’s VMX (compatible to Autovectorization, super-optimization, SIMD, program generation, AltiVec) both offer 4-way (single precision) floating point, but the Fourier transform recent Intel instruction sets AVX and Larrabee (LRB) offer 8-way and 16-way, respectively. Compilation and optimization for vector extensions is hard, and often the achievable speed-up by using vec- 1. Introduction torizing compilers is small compared to hand-optimization using Power and area constraints are increasingly dictating microar- intrinsic function interfaces. Unfortunately, the complexity of these chitectural developments in the commodity and high-performance intrinsics interfaces increases considerably with the vector length, (HPC) CPU space. Consequently, the once dominant approach of making hand-optimization a nightmare. In this paper, we present a dynamically extracting instruction-level parallelism (ILP) through peephole-based vectorization system that takes as input the vector monolithic out-of-order microarchitectures is being supplanted by instruction semantics and outputs a library of basic data reorgani- designs with simpler, replicable architectural features. This trend is zation blocks such as small transpositions and perfect shuffles that most evident in the proliferation of architectures containing many are needed in a variety of high performance computing applica- symmetrical processing cores. Such designs provide for flexible tions. We evaluate the system by generating the blocks needed by power management and reduced area by trading dynamic ILP for the program generator Spiral for vectorized fast Fourier transforms static, software-defined thread-level parallelism. A similar trade- (FFTs). With the generated FFTs we achieve a vectorization speed- off is occurring with the steadily increasing vector-width and com- up of 5.5–6.5 for 8-way AVX and 10–12.5 for 16-way LRB. For plexity of single-instruction-multiple-data (SIMD) vector instruc- the latter instruction counts are used since no timing information is tion sets. available. The combination of the proposed system and Spiral thus AVX and Larrabee. Intel’s recent AVX and Larrabee (LRB) ar- automates the production of high performance FFTs for current and chitectures feature 256-bit and 512-bit vector-lengths respectively; future vector architectures. architectures with 1024-bit long vectors are already planned [2, 24]. Vector functional units and vector registers are regular structures which are fairly easy to replicate and expand. Like multiple cores, Categories and Subject Descriptors vector units provide for flexible power management in that indi- vidual vector functional units can be selectively idled. SIMD in- D.3.4 [Software]: Programming Languages—Code generation, structions also represent a form of scalar instruction compression Optimization thereby reducing the power and area consumed by instruction de- coding. Collectively, this architectural trend towards multiple cores and wide vectors has fundamentally shifted the burden of achieving performance from hardware to software. Permission to make digital or hard copies of all or part of this work for Programming SIMD extensions. In contrast to multiple cores, personal or classroom use is granted without fee provided that copies are SIMD architectures require software to explicitly encode fine-grain not made or distributed for profit or commercial advantage and that copies data-level parallelism using the SIMD instruction set. These SIMD bear this notice and the full citation on the first page. To copy otherwise, to instruction sets are quickly evolving as vendors add new instruc- republish, to post on servers or to redistribute to lists, requires prior specific tions with every CPU generation, and SIMD extensions are incom- permission and/or a fee. ICS’11, May 31–June 4, 2011, Tuscon, Arizona, USA. patible across CPU vendors. Consequently, explicitly vectorized Copyright 2011 ACM 978-1-4503-0102-2/11/05 ...$10.00. code is hard to write and inherently non-portable. The complex- ity of SIMD instruction sets complicates hand-vectorization while 1 2 3 4 5 6 7 8 auto-vectorization just like auto-parallelization poses a continuing challenge for compilers. The latest version of production compilers (Intel C++, IBM XL E1 F2 G3 H4 5 6 7 8 C, and Gnu C) all contain autovectorization technology [17, 36, 15] that provides speed-up across a large class of computation kernels. A2 B4 C6 D8 E1 F3 G5 H7 A1 B3 C5 D7 A2 B4 C6 D8 E2 F4 G6 H8 A1 B3 C5 D7 However, for many kernels like the fast Fourier transform (FFT) and matrix multiplication, the results are usually suboptimal [8] since optimal vectorization requires algorithm knowledge or there are simply too many choices that the compiler cannot evaluate. A2 12B 30C 56D A1 B9 25C 49D A4 16B 36C 64D A2 12B 30C 56D Much of the difficulty in vectorization lies in the instructions re- quired to transform and keep data in vector form. These shuffle -3A -7B -11C -15D or permutation instructions are generally the most complex and ex- pensive operations in the SIMD instruction set. They tend to scale poorly, may not support arbitrary permutations and their parame- A4 24B 60C 112D ters become increasingly non-obvious to use, especially with wider vector units. From a performance point of view, shuffle instructions -3A B4 -7C 24D -11A 60B -15C 112D are the overhead imposed by vectorization, which prevents the per- fect speedup linear in the vector length. Consequently, minimizing Figure 1: The dataflow of an SSE-vectorized kernel. The com- the number and cost of shuffles is crucial. putation is an element-wise multiplication of two complex input Contribution. This paper makes two key contributions. First, arrays (blue and red) of length four in interleaved format. we present a super-optimization infrastructure that takes as input the instruction set specification and automates the discovery of ef- ficient SIMD instruction sequences for basic data reorganization that is designed for both longer vector lengths and more complex operations such as small matrix transpositions and stride permuta- instruction sets. tions. These are required, for example, by many high performance Vectorization. Automatic vectorization has been the subject computing kernels including the FFT and linear algebra kernels. of extensive study in the literature. Two excellent references are Second, we incorporate this infrastructure into the library gen- [21, 37]. Vectorization becomes (again) increasingly important for erator Spiral to generate the reordering blocks needed for FFT SIMD extensions like Larrabee and the latest versions of SSE (SSE vectorization [8]. This approach effectively automates the port- 4.1) that allow for efficient implementation of gather/scatter opera- ing of FFTs to new vector architectures. We then evaluate efficacy tions and large data caches, since the conditions on such architec- by automatically generating vectorized FFTs for AVX and LRB. tures are similar to traditional vector computers. We demonstrate speed-ups (measured using runtime or instruction SIMDization. Originating from SIMD within a register (SWAR) counts) of 5.5–6.5 for 8-way AVX and 10–12.5 for 16-way LRB. [5, 34], SIMDization was recognized as a hybrid between vec- We also compare the AVX code against Intel’s IPP. For LRB, no torization and instruction level parallelism extraction [1]. Recent benchmarks are available at the time of writing. advances in compilation techniques for SIMD vector instruction Besides this main contribution, with AVX and Larrabee it now sets in the presence of alignment and stride constraints are de- becomes possible to study efficiency and overhead of vectorization scribed in [4, 28]. SIMD instruction extraction for two-way archi- methods across a range of vector lengths: 2, 4, 8, and 16 for single- tectures aimed at basic blocks is presented in [22]. This technique precision floating-point. We include such a study for Spiral’s FFT is included in FFTW 2.1.5 [13, 12] and has shown good perfor- vectorization. mance improvements across multiple two-way SIMD extensions. FFTW3 [14] contains SIMD codelets for SSE and AltiVec, sup- 2. Related Work porting vector lengths of 2 and 4. The work in this paper extends the SIMD support in the Spiral Superoptimization. The classic paper on super-optimization is system. It is related to vectorization techniques developed for tra- [26] while [3] presents a modern approach that is close in spirit to ditional vector computers, SIMDization techniques developed for our own. A dataflow-graph and integer-linear programming based short length SIMD vector instruction sets, superoptimization, and approach to finding