Compilers, Hands-Off My Hands-On Optimizations

Compilers, Hands-Off My Hands-On Optimizations Richard Veras Doru Thom Popovici Tze Meng Low Franz Franchetti Department of Electrical and Computer Engineering Carnegie Mellon University 5000 Forbes Ave Pittsburgh, PA 15215 {rveras, dpopovic, lowt, franzf}@cmu.edu ABSTRACT Compiler Autovect versus Expert Nehalem Achieving high performance for compute bounded numerical kernels typically requires an expert to hand select an appropriate set of Single-instruction multiple-data (SIMD) 8 instructions, then statically scheduling them in order to hide their latency while avoiding register spilling in the process. Unfortunately, this level of control over the code forces the 6 expert to trade programming abstraction for performance which is why many performance critical kernels are written in assembly language. An alternative is to either resort to 4 auto-vectorization (see Figure 1) or to use intrinsic func- GFLOPS tions, both features offered by compilers. However, in both scenarios the expert loses control over which instructions are selected, which optimizations are applied to the code and 2 expert moreover how the instructions are scheduled for a target gcc-autovect architecture. Ideally, the expert would need assembly-like icc-autovect control over their SIMD instructions beyond what intrin- 0 sics provide while maintaining a C-level abstraction for the 0 200 400 600 800 1;000 1;200 non-performance critical parts. k (m = n = 1280) In this paper, we bridge the gap between performance and abstraction for SIMD instructions through the use of custom macro intrinsics that provide the programmer control over Figure 1: In this plot we compare the performance the instruction selection, and scheduling, while leveraging of an expert implementation of matrix-multiply the compiler to manage the registers. This provides the best against two compiler auto-vectorized implementa- of both assembly and vector intrinsics programming so that tions. The peak performance of the machine is 9.08 a programmer can obtain high performance implementations GFLOPS and the expert implementation reaches within the C programming language. near that performance, while the two compiler vectorized implementations fall well below that. Keywords SIMD, Performance Engineering 1. INTRODUCTION Acknowledgment Implementing high performance mathematical kernels such as the matrix-matrix multiplication kernel is an extremely This work was sponsored by the DARPA PERFECT pro- difficult task because it requires the precise orchestration gram under agreement HR0011-13-2-0007. The content, views of CPU resources via carefully scheduled instructions. This and conclusions presented in this document do not necessar- typically requires using single instruction multiple data (SIMD) ily reflect the position or the policy of DARPA or the U.S. units on modern out-of-order processors, such as Streaming Government. No official endorsement should be inferred. SIMD Instructions (SSE) and Advanced Vector Instructions (AVX) [6] on Intel processors, AltiVec [2] on PowerPC pro- Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed cessors, or NEON [1] on ARM processors. for profit or commercial advantage and that copies bear this notice and the full cita- For key kernels, this task is often undertaken by expert tion on the first page. Copyrights for components of this work owned by others than programmers who are knowledgeable about the application, ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re- publish, to post on servers or to redistribute to lists, requires prior specific permission available SIMD instruction set, and hardware architecture. and/or a fee. Request permissions from [email protected]. This is because the programmer must manually select the WPMVP ’16, March 13 2016, Barcelona, Spain appropriate instruction for their implementation, then they c 2016 ACM. ISBN 978-1-4503-4060-1/16/03. $15.00 must schedule the instructions and finally orchestrate the DOI: http://dx.doi.org/10.1145/2870650.2870654 movement of data to efficiently use the SIMD units. There- fore, many high performance libraries rely on assembly coded Block in the M dimension kernels, mainly because of the high level of control offered by += such languages. For this reason, library instantiations of the Basic Linear Algebra Subprograms (BLAS) [8, 4, 3] such as GotoBLAS [5] (Now OpenBLAS [9]), BLIS [10] and ATLAS [11] make use of routines written in assembly to implement Block in the K dimension operations like vector additions, scalar multiplications, dot products and matrix-matrix multiplications. In this work += we address this issue, and provide a bridge between the low level control that an expert needs and the high level bene- Original Input fits of a compiled language. We do this through the use of custom intrinsic like macros. C += A B Implementing a kernel directly in assembly requires a sub- stantial amount of human effort and is typically reserved for performance critical code. Thus, to ease the programmer's burden, vector intrinsics are often used as a replacement for Figure 3: GotoBLAS approach layers loops around the low level assembly language. The compiler maps these an extremely tuned assembly coded kernel. It first vector intrinsics to the specific vector assembly code. Fig- blocks the original input in the K dimension, then ure 2 shows the mapping between different vector intrinsics it blocks in the M dimension after that the blocks and assembly instructions. During the translation of the ap- are fed to the tuned kernel. plication to machine code, the compiler applies various optimizations to increase performance. In this process the compiler may move and schedule instructions and assign named described using our parametrized C macro. We use these registers according to general purpose heuristics. Thus, the macro instructions within matrix multiplication kernels and programmer loses the control over the instruction schedule show that static scheduling of the kernels outperform the of the application. same kernels written with the normal vector intrinsics compiled with the same compiler. vmulpd %ymm8, %ymm5, %ymm9 //assembly t9 = _mm256_mul_pd(t8, t5); //intrinsic Contributions. Our work contributes the following: vaddpd %ymm9, %ymm1, %ymm1 //assembly t1 = _mm256_add_pd(t9, t1); //intrinsic { Parametrized C macros. We introduce vector macros that provide the same ease of programming afforded by vperm2f128 $1, %ymm5, %ymm5, %ymm6 //assembly traditional vector intrinsics, while providing the pro- t6 = _mm256_permute2f128_pd(t5, t5, 1); //intrinsic grammer a level of control of instruction selection and schedule that one would expect from programming in assembly. Figure 2: The vector assembly instructions can be replaced with the vector intrinsics offered by ven- { Demonstration of flexibility. We demonstrate the dors. The named registers are replaced with vari- flexibility of these customized instructions through the ables names. The compiler performs the translation implementation of high-performance matrix-matrix mul- between the intrinsics and assembly along with the tiply kernels. We used generated and optimized code mapping between the variable names and the named to show that when the customized instructions are registers. used the performance is increased in comparison to generated code with of using compiler vector intrin- A programmer wants the best of both worlds. On one sics. hand, he or she wants full control over the instructions and the scheduling mechanisms while on the other hand he or 2. BACKGROUND she desires programmability. The programmer simply wants As our running example, we focus our attention to matrix- to write the kernel in a high level language such as C, but matrix multiplication. When coded and tuned by an ex- somehow inhibit the compiler from moving and scheduling pert, this operation can achieve near the peak machine per- instructions, while still using it for operations such as regis- formance due to its O(N 3) floating point operations to its ter coloring or other optimizations for the glue code around O(N 2) memory operations. Furthermore, there is great in- the kernels. In other words, certain parts of the application terest in achieving peak performance for every new architec- or the library should not be modified by the compiler. ture because the basic linear algebra subroutines (BLAS), In this paper, we propose a mechanism for preserving in- specifically the level-3 BLAS [3], casts the bulk of its com- struction order that is as transparent to the programmer putation in terms of matrix-multiplication. Therefore, any as existing compiler intrinsics. We use the inline assem- improvements in the performance of matrix-multiplication bly compatible with the gcc compiler to have control over translates to improvements in the rest of the level-3 BLAS application, however we embed our construct within param- and the numerical and scientific libraries that build upon it. eterized C macros to hide the low-level details. Moreover Without loss of generality, we further our focus on the we use the volatile construct to notify the compiler not to variant of matrix multiplication of the form. touch the instructions and preserve their statically scheduled order. Figure 7 shows an example of a vector addition C = AB + C (1) // C = AB + C and must be carefully implemented such that the kernel com- for(int i = 0; i < m; ++i) { putes floating point operations at the rate that the elements for(int j = 0; j < n; ++j) { are streamed from cache. In order to achieve this an expert for(int p = 0; l < k; ++k) { needs to perform the following optimizations: C[i][j] += A[i][p] * B[p][j]; } { An appropriate mix of SIMD instructions need to be } selected such that the processor can sustain their rate } of execution. { These instructions must be statically scheduled such Figure 4: The simplest implementation for the that their latencies are hidden by overlapping instruc- matrix-matrix multiplication algorithm. The inner- tions.

Load more