Automatic Generation of Fast BLAS3-GEMM: a Portable Compiler Approach

Automatic Generation of Fast BLAS3-GEMM: ifact rt * * Comple A t te n * A te s W i E * s e n l C l o D C O o A Portable Compiler Approach * * c u e G m s E u e C e n R t v e o d t * y * s E a a l d u e a t Xing Suy Xiangke Liaoy Jingling Xuez yNational Laboratory for Parallel and Distributed Processing, College of Computer, NUDT, Changsha, 410073, China zSchool of Computer Science and Engineering, UNSW, Sydney, NSW 2052, Australia Abstract the other level-3 BLAS routines can be defined in terms of GEMM is the main computational kernel in BLAS3. Its GEMM and some level 1 and 2 computations [14]. In addi- micro-kernel is either hand-crafted in assembly code or gen- tion, GEMM is also the key library routine for deep learn- erated from C code by general-purpose compilers (guided by ing. Finally, the LINPACK benchmarks rely critically on architecture-specific directives or auto-tuning). Therefore, GEMM for its performance measurements. Therefore, op- either performance or portability suffers. timizing GEMM is the core task in any high-performance We present a POrtable Compiler Approach, POCA, im- BLAS implementation. However, given a C specification of plemented in LLVM, to automatically generate and opti- GEMM, general-purpose compilers are still not able to gen- mize this micro-kernel in an architecture-independent man- erate machine code that achieves near peak performance. ner, without involving domain experts. The key insight is Broadly speaking, there are three approaches for obtain- to leverage a wide range of architecture-specific abstrac- ing optimized loop-based GEMM kernels, (1) assembly pro- tions already available in LLVM, by first generating a vector- gramming, (2) auto-tuning, and (3) directive-based program- ized micro-kernel in the architecture-independent LLVM IR ming, yielding different tradeoffs between performance and and then improving its performance by applying a series of portability. With assembly programming, domain experts domain-specific yet architecture-independent optimizations. write a few innermost loops in GEMM directly in assem- The optimized micro-kernel drops easily in existing GEMM bly code. In the case of auto-tuning, ATLAS [32] generates frameworks such as BLIS and OpenBLAS. Validation fo- different GEMM kernels (with different parameter values) cuses on optimizing GEMM in double precision on two in C and compile them to run on the actual computing sys- architectures. On Intel Sandybridge and AArch64 Cortex- tem to find the best-performing one. Finally, directive-based A57, POCA’s micro-kernels outperform expert-crafted as- programming is embraced by POET [35] and AUGEM [31]. sembly code by 2.35% and 7.54%, respectively, and both Given a GEMM kernel in C, POET inserts annotations into BLIS and OpenBLAS achieve competitive or better perfor- the C code to direct source-to-source compiler transforma- mance once their micro-kernels are replaced by POCA’s. tions and AUGEM uses a template-based method to match predefined patterns in the C code and transforms the matched Categories and Subject Descriptors D.3.4 [Programming C code sequence into an optimized assembly code sequence. Languages]: Processors—Compilers; G.1.3 [Numerical These three approaches make different performance and Analysis]: Numerical Linear Algebra portability tradeoffs. Coding GEMM in assembly by do- Keywords dense linear algebra, GEMM, code optimization main experts can achieve near peak performance but is te- dious and non-portable. Auto-tuning makes the opposite 1. Introduction tradeoff. For example, ATLAS relies on general-purpose Dense linear algebra libraries are fundamental in scien- compilers to generate optimized GEMM kernels for differ- tific computing. Basic Linear Algebra Subprograms (BLAS) ent architectures automatically, thus resulting in portable are routines that provide standard building blocks for per- but sub-optimal performance. In the case of directive-based forming vector-vector operations (level 1), matrix-vector op- programming, POET and AUGEM resort to architecture- erations (level 2), and matrix-matrix operations (level 3). specific annotations and templates, respectively, written still BLAS libraries are widely available. Vendor supplied imple- by domain experts, to guide general-purpose compilers to mentations include Intel MKL, AMD ACML and NVIDIA produce optimized GEMM kernels. In particular, AUGEM cuBLAS. The HPC community have also contributed sev- is shown to generate optimized GEMM kernels for x86 only. eral high-performance BLAS implementations such as AT- How do we obtain near peak yet portable performance LAS [32], GotoBLAS [12], OpenBLAS [37] and BLIS [28]. for GEMM automatically for a wide range of architectures? For the level-3 BLAS (BLAS3), GEMM (GEneral Ma- Despite its algorithmic simplicity, this problem is very chal- trix Multiplication) is the main computational kernel, as 978-1-5090-4931-8/17 c 2017 IEEE 122 CGO 2017, Austin, USA Accepted for publication by IEEE. c 2017 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/ republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. A B C GEMM Blocked Algorithm K N N for jj = 0:N :N 1 c ¡ M K M A =A; B = B[:][jj:jj + N 1]; C = C[:][jj:jj + N 1]; 0 0 c ¡ 0 c ¡ layer 1 Nc Nc K Nc Nc Kc for kk = 0:K :K 1 c ¡ M K M A =A [:][kk:kk + K 1]; B =B [kk:kk + K 1][:]; C =C ; 1 0 c ¡ 1 0 c ¡ 1 0 layer 2 Kc Kc Nc Nc Kc for ii = 1:M :M 1 Mc Mc c ¡ M M A =A [ii:ii + M 1][:]; B =B ; C =C [ii:ii + M 1][:]; 2 1 c ¡ 2 1 2 1 c ¡ layer 3 Kc Nc Nc for j = 1:Nr:Nc 1 Kc ¡ Mc Mc Nr A =A ; B =B [:][j:j + N 1]; C =C [:][j:j + N 1]; 3 2 3 2 r ¡ 3 2 r ¡ layer 4 Nr Kc Nr Nr Mr Mr for i = 0:Mr:Mc 1 Kc ¡ Mc Mc A =A [i:i + M 1][:]; B =B ; C =C [i:i + M 1][:]; 4 3 r ¡ 4 3 4 3 r ¡ layer 5 Kc Nr Nr 1 Mr Mr for k = 0:1:K 1 c ¡ 1 Kc A5=A4[:][k]; B5=B4[k][:]; C5=C4; l macro-kernel (GEBP) e n layer 6 r 1 Nr Nr e k 1 ¹ Mr Mr C +=A B ; 5 5 £ 5 layer 7 Figure 1. Structure of blocked DGEMM (the structures of SGEMM, CGEMM and ZGEMM are similar (Section 2)). lenging to solve. Any promising solution is significant due Intel Sandybridge and AArch64 Cortex-A57, POCA’s to the continued proliferation of computer architectures. micro-kernels outperform expert-crafted assembly code In this paper, we present a POrtable Compiler Approach, by 2.35% and 7.54% respectively. In addition, both BLIS POCA, implemented in LLVM, to automatically generate and OpenBLAS achieve competitive or better perfor- and optimize GEMM in an architecture-independent man- mance once their micro-kernels are replaced by POCA’s. ner. Due to the nature of GEMM, it suffices to focus on its • We provide a comprehensive quantitative analysis to un- innermost loop, known as a micro-kernel, µkernel. The key derstand and evaluate the impact of POCA-specific com- insight is to leverage a wide range of architecture-specific piler optimizations on kernel performance. abstractions (e.g., SIMD engines supported and instruction To the best of our knowledge, this is the first compiler ap- latencies) already available in LLVM, by first generating a proach for generating fast GEMM kernels portably, without vectorized µkernel in the architecture-independent LLVM involving domain experts. While the work presented here is IR and then boosting its performance by applying a series of for GEMM in BLAS3, the approach can be applied to the domain-specific yet architecture-independent optimizations. other kernels in BLAS3 such as TRSM and GEMV. The optimized µkernel, obtained without any involvement of domain experts, drops easily in existing GEMM frameworks 2. Structure of GEMM such as GotoBLAS [12], OpenBLAS [37] and BLIS [28]. GEMM comes with four main varaints, SGEMM, DGEMM, We restrict our presentation to GEMM that works on CGEMM and ZGEMM, which operate on four different data double-precision real numbers, known as DGEMM, as in types, single-precision floating-point (S), double-precision prior work [26, 31, 35], for two reasons. First, the basic floating-point (D), complex single-precision floating-point idea behind POCA applies to other variants of GEMM such (C), and complex double-precision floating-point (Z). We as SGEMM, CGEMM and ZGEMM (as discussed in Sec- first describe a blocked algorithm for DGEMM and intro- tion 2). Second, the LINPACK benchmarks, which rely on duce several basic concepts used throughout the paper. We GEMM as the performance-critical routine, must work on then look at SGEMM, CGEMM and ZGEMM briefly. double-precision real numbers in order to build the TOP500 DGEMM performs a matrix-multiply-add operation on list, ranking the world’s most powerful supercomputers. double-precision real numbers, C = βC + αAB, where This paper makes the following contributions: A; B and C are matrices of sizes M ×K, K×N and M ×N, respectively, and α and β are scalars. While this operation is • We introduce a portable compiler approach, POCA for algorithmically simple, so that a 3-deep loop nest suffices to generating highly optimized GEMM fully automatically.

Load more