For Review Only

Songklanakarin Journal of Science and Technology SJST-2019-0174.R1 Chaikan An Effective Implementation of Strassen’s Algorithm using AVX Intrinsics for a Multicore Architecture Journal: Songklanakarin Journal of Science and Technology Manuscript ID SJST-2019-0174.R1 Manuscript Type:ForOriginal Review Article Only Date Submitted by the 24-Aug-2019 Author: Complete List of Authors: Oo, Nwe Zin; Prince of Songkla University, Dept. of Computer Engineering Chaikan, Panyayot; Prince of Songkla University, Dept. of Computer Engineering Keyword: Engineering and Industrial Research For Proof Read only Page 3 of 32 Songklanakarin Journal of Science and Technology SJST-2019-0174.R1 Chaikan 1 2 3 4 Original Article 5 6 7 An Effective Implementation of Strassen’s Algorithm using AVX Intrinsics 8 9 for a Multicore Architecture 10 11 Nwe Zin Oo, Panyayot Chaikan* 12 13 14 Department of Computer Engineering, Faculty of Engineering, 15 16 Prince of Songkla University, Hat Yai, Songkhla, 90110, Thailand 17 18 * Corresponding author, Email address: [email protected] 19 20 21 For Review Only 22 23 Abstract 24 25 This paper proposes an effective implementation of Strassen’s algorithm with 26 27 AVX intrinsics to augment matrix-matrix multiplication in a multicore system. AVX-2 28 29 30 and FMA3 intrinsic functions are utilized, along with OpenMP, to implement the 31 32 multiplication kernel of Strassen’s algorithm. Loop tiling and unrolling techniques are 33 34 also utilized to increase the cache utilization. A systematic method is proposed for 35 36 37 determining the best stop condition for the recursion to achieve maximum performance 38 39 on specific matrix sizes. In addition, an analysis method makes fine-tuning possible 40 41 when our algorithm is adopted to another machine with a different hardware 42 43 configuration. Performance comparisons between our algorithm and the latest version of 44 45 46 two well known open-source libraries have been carried out. Our algorithm is, on 47 48 average, 1.52 and 1.87 times faster than the Eigen and the OpenBLAS libraries, 49 50 respectively, and can be scaled efficiently when the matrix becomes larger. 51 52 53 54 55 Keywords: Advanced vector extension, AVX, AVX-2, Matrix-Matrix multiplication, 56 57 FMA, Strassen’s algorithm 58 59 60 For Proof Read only Songklanakarin Journal of Science and Technology SJST-2019-0174.R1 Chaikan Page 4 of 32 1 2 3 4 1. Introduction 5 6 7 In recent years, the Advanced Vector Extension (AVX) instruction set has been 8 9 bundled with all the CPUs produced by Intel and AMD. It allows multiple pieces of 10 11 floating-point data to be processed at the same time, resulting in very high performance. 12 13 14 Its successor, AVX-2, added 256-bit integer operations and fused-multiply-accumulate 15 16 operations for floating-point data, useful for scientific applications. Many researchers 17 18 have reported on its use to augment processing performance. For example, Kye, Lee Se 19 20 Hee and Lee Jeongjin increased the processing speed of matrix transposition (2018), 21 For Review Only 22 23 Al Hasib, Cebrian and Natvig proposed an implementation of k-means clustering for 24 25 compressed dataset (2018), and Bramas and Kus speeded up the processing of a sparse 26 27 matrix-vector product (2018). Hassan et al. utilized AVX and OpenMP to accelerate 28 29 30 vector-matrix multiplication (2018), Barash, Guskova and Shchur improved the 31 32 performance of random number generators (2017), and Bramas boosted the speed of the 33 34 quicksort algorithm (2017). 35 36 37 We employ AVX intrinsics for an effective implementation of Strassen’s 38 39 algorithm for single precision matrix-matrix multiplication. We decided to utilize 40 41 AVX-2 with its FMA3 capabilities, which are available in reasonably priced CPUs, 42 43 from both Intel and AMD. Our aim is to augment the speed of applications that rely on 44 45 46 matrix-matrix multiplication using off-the-shelf CPUs. There is no need to buy 47 48 expensive graphics card with high power consumption when the AVX features bundled 49 50 in current CPUs are sufficient. 51 52 53 54 55 56 57 58 59 60 For Proof Read only Page 5 of 32 Songklanakarin Journal of Science and Technology SJST-2019-0174.R1 Chaikan 1 2 3 4 2. Matrix-Matrix Multiplication 5 6 7 Matrix-matrix multiplication, defined as c = a×b, where a, b, and c are n×n, 8 9 requires 2n3 floating-point operations. The basic sequential algorithm is shown in 10 11 Figure 1. The performance of the algorithm in Giga Floating-Point Operation per 12 13 14 Second (GFLOPS) is 15 16 2* n * n * n GFLOPS , (1) 17 9 18 s *10 19 20 where s is the execution time of the program in seconds. 21 For Review Only 22 23 24 3. AVX Instruction Set and the FMA3 25 26 The third generation of Intel’s Advanced Vector Extensions (AVX) (Intel, 2011) 27 28 29 comprises sixteen 256-bit registers, YMM0-YMM15, supporting both integer and 30 31 floating-point operations. The AVX instructions allow eight single-precision floating- 32 33 point operations to be processed simultaneously, twice the number supported by 34 35 36 Streaming SIMD Extensions (SSE). There are four main ways to take advantage of 37 38 AVX instructions: 1) using assembly language to call AVX instructions directly; 2) 39 40 using the AVX inline assembly in C or C++; 3) using compiler intrinsics; or 4) utilizing 41 42 43 the compiler’s automatic vectorization feature. We employ compiler intrinsics to 44 45 implement Strassen’s algorithm because it gives better performance than auto- 46 47 vectorization, but is not as cumbersome or error prone as assembly language. Using 48 49 AVX inline assembly in a high level language is not significantly different from 50 51 52 utilizing compiler intrinsics (Hassana, Hemeida, & Mahmoud, 2016). 53 54 55 The syntax of AVX intrinsic functions follows the pattern 56 57 _mm256_<operation>_ <suffix> (Mitra, Johnston, Rendell, McCreath, & Zhou, 2013) 58 59 60 For Proof Read only Songklanakarin Journal of Science and Technology SJST-2019-0174.R1 Chaikan Page 6 of 32 1 2 3 4 where the operation can be load, store, arithmetic, or a logical operation, and the suffix 5 6 7 is the type of data used. For example, _mm256_add_ps and _mm256_add_pd add 32-bit 8 9 and 64-bit floating-point data respectively. Figure 2 shows more function prototype 10 11 examples. 12 13 14 Floating-point matrix-matrix multiplication relies on the fused-multiply-add 15 16 operation, which can be implemented using the _mm256_mul_ps and _mm256_add_ps 17 18 functions. However, replacing these two functions with a single _mm256_fmadd_ps call 19 20 can speed up the computation. This fused-multiply-add (FMA) operation performs the 21 For Review Only 22 23 multiplication and addition of the 64-bit floating-point data in a single step with 24 25 rounding. Intel Haswell processors have supported FMA since 2013 (Intel, 2019), and 26 27 the processors currently produced by AMD also support it (Advanced Micro Devices, 28 29 30 2019). 31 32 33 34 35 4. Optimization Methods for Parallel Matrix-Matrix Multiplication 36 37 38 The single-instruction-multiple-data (SIMD) processing of the AVX gives 39 40 41 higher performance than using scalar instructions, and every processing core has an 42 43 AVX unit. As a consequence, very high performance is expected by utilizing AVX 44 45 instructions on a multi-core machine. Also, to maximize the performance of the parallel 46 47 application utilizing AVX, OpenMP is employed in conjunction with two optimization 48 49 50 techniques: loop tiling and loop unrolling. 51 52 53 4.1 Loop Tiling 54 55 If the size of the data is very large, it is impossible to keep it all inside the cache. 56 57 Data movements between the cache and main memory may be required very often, 58 59 60 For Proof Read only Page 7 of 32 Songklanakarin Journal of Science and Technology SJST-2019-0174.R1 Chaikan 1 2 3 4 leading to many cache miss penalties. To reduce this effect, the data can be split into 5 6 7 smaller chunks, and each chunk is loaded by the processor and kept inside the cache 8 9 automatically by its cache controller. Increased reuse of this data from the cache leads 10 11 to improved performance. In the case of matrix-matrix multiplication, c=a×b, the 12 13 14 matrices are stored in 3 arrays, and each data element of a and b will be accessed 15 16 multiple times. If the matrix size is n×n, then each data element from each matrix will 17 18 be accessed at least n times. When loop tiling is applied, the outer loop keeps a chunk of 19 20 the first source matrix inside the cache, while a series of chunks taken from the second 21 For Review Only 22 23 matrix are processed by the inner loop. This pattern allows the chunk of the first matrix 24 25 to be reused many times before being flushed from the cache. The next chunk from the 26 27 first matrix will then be processed using the same pattern, and so on. The chunk size in 28 29 30 the outer loop must be large enough to minimize memory accesses and to increase 31 32 temporal locality. However, it must not be larger than the L1 data cache to prevent some 33 34 of the data being evicted to the higher cache level.

Load more