Research Paper Creating a Compiler Optimized Inlineable

Academia Journal of Scientific Research 6(6): 262-271, June 2018 DOI: 10.15413/ajsr.2018.0118 ISSN 2315-7712 ©2018 Academia Publishing Research Paper Creating a compiler optimized inlineable implementation of INTEL SVML SIMD intrinsics Accepted 23rd May, 2018 ABSTRACT Single Input Multiple Data (SIMD) provides data parallelism execution through implemented SIMD instructions and registers. Most mainstream computers architectures have been enhanced to provide data parallelism through SIMD extensions. These advances in parallel hardware have not been accompanied by the necessary software libraries granting programmers a set of functions that could work across all major compilers adding a 4x, 8x or 16x performance increase as compared to standard SISD. Intel’s SVML library offers SIMD implementation of Mathematics and Scientific functions that have never been Promyk Zamojski* and Raafat Elfouly ported to work outside of Intel's own compiler. An Open Source inlineable implementation would increase performance and portability. This paper Mathematics and Computer Science illustrates the development of an alternative compiler neutral implementation of Department/ Rhode Island College, Intel's SVML library. RI *Corresponding author. E-mail: Keywords: Data Parallelism, SIMD Intrinsics, SVML, C++ Math Library, Computer [email protected]. Simulation, Scientific Mathematics Library. INTRODUCTION Data parallelism occurs when one or more operations are (https://www.cilkplus.org/), a programmer is able o force repeatedly applied to several values. Generally, the compiler to work harder on specific parts of code to programming languages sequentially programme scalar optimize data-parallelism. However, this is not always the values. Single Input Multiple Data (SIMD) compiler case since the compiler cannot re-order all scalar intrinsics allow operations to be grouped together leading operation to parallel operations. It is possible to do to improved transistors per Floating Point Operations conditional processing within SIMD; however, it requires reducing power usage and overall modern SIMD capable of some tricks since the compiler does not understand how to CPUs. SIMD has some limitation because you cannot easily implement from traditional scalar code. vectorize different functions with lots of statements. These In the case where automatic vectorization is not an limitations define SIMD Intrinsic programming to be option there needs to be a library of fictions a programmer almost a new programming paradigm within languages can target. This library should be portable enough to be like C or C++. able to port code written for this library to be able to be In most modern compilers the work of optimization can compiled by the main C/C++ compilers that support Intel generally outperform many programmer tricks to generate Intrinsics such as GCC, CLANG and MSVC. Many vendors optimal output. In the case of SIMD, the compiler is smart have specified an API for Intrinsic such as Intel® short enough to transform Vector/Arrays into optimized SIMD vector math library (SVML) operations. The idea is simple, instead of computing one (https://software.intel.com/sites/landingpage/IntrinsicsG value of an array at a time your compute identical uide/#techs=SVML), a software standard provided by operation on all values is able to fit into SIMD register. All Intel® C++ Compiler (https://software.intel.com/en- data must be subject to the same processing. With the us/node/524289). addition of compiler pragmas such as Intel® Cilk™ Plus SVML functions can be broken down into five (5) main Academia Journal of Scientific Research; Zamojski and Elfouly. 263 Figure 1: SIMD vs. SISD branching problem. categories: Distribution, Division, Exponentiation, Round, when SIMD comparative operation is used to compare two Trigonometry and subcategories of __m128, __m128d, (2) vectors resulting in mask of all 1 for true or mask of all __m128i, __m256, __m256d, __m256i, __m512, __m512d, 0 for false. Each mask corresponds to value held in vector __m512i corresponding to vector data types. When position whose size is allocated to each data type byte implementing SVML functions, the functions can be alignment (Figures 2 to 5). approximated using Approximation of Taylor series It is important to understand bitwise selection in SSE expansion in combination with bitwise manipulation of SIMD operations since they will take the pace of IEEE745 floating point representation (Malossi et al., jumping/branching based on condition in most arithmetic 2015; Nyland, 2004; Kretz, 2015). Reference operation. These bitwise operations become fundamental implementation of scalar version can be translated into in the design and analysis of SIMD optimized algorithms. SIMD version from Cephes library Since all data operation are identical across SIMD vector (http://www.netlib.org/cephes). The output of the code segments must be broken up based on condition and replacement functions can be tested and compared to properly placed into the correct vector portion without output form SVML library provided in Intel® C++ resorting to scalar operations to do so. It is important to Compiler (https://software.intel.com/en- note that SSE4.1 blendv bitwise selection under the hood is us/node/524289). performing the operation of a bitwise OR the result of a (~mask and first_input_value) and the result of the (mask and second_input_value). This function is completely BITWISE SELECTION bitwise in nature and depends on binary vector representation of TRUE and FALSE to allow the pass of Due to the nature of SIMD functions operating entirely values within each vector position within the SIMD bitwise, sequential functionality must be considered and register (Figure 6). reworked for sequential implementation of some non- Once a programmer understand that bitwise selection sequential/bitwise algorithms. This includes conditions masks are used in place of conditional statements the where an algorithm can jump from one point of the stack- porting of existing C/C++ library becomes easier. The frame to a point in order to compute one value that would programmer would still have to account for inefficiency of be common in SISD algorithms (Figure 1). computing all possible branches and corner cases which Bitwise selection occurs when vector bitmask is use to would later merge into the resulting SIMD vector. With this select values within the vector. Vector bitmask is produced in mind unnecessary corner cases and redundancy within Academia Journal of Scientific Research; Zamojski and Elfouly. 264 __m128 _mm_radian_asin_ps(__m128 a) { __m128 y, n, p; p = _mm_cmpge_ps(a, __m128_1); // if(a >= 1) Generates Mask 0xffffffff -OR- 0x0 n = _mm_cmple_ps(a, __m128_minus_1); // if(a <= -1) Generates Mask 0xffffffff -OR- 0x0 // 0xffffffff is TRUE // 0x0 is FALSE y = _mm_asin_ps(a); y = _mm_blendv_ps(y, __m128_PIo2F, p); // replace value for >= 1 in var y based on 0xffffffff mask y = _mm_blendv_ps(y, __m128_minus_PIo2F, n); // replace value for <= -1 in var y based on 0xffffffff mask return y; } Figure 2: SSE SIMD branching example. _mm_radian_asin_ps(__m128): push rsi movaps xmm9, xmm0 movups xmm8, XMMWORD PTR __m128_1[rip] cmpleps xmm8, xmm0 cmpleps xmm9, XMMWORD PTR __m128_minus_1[rip] call __svml_asinf4 movaps xmm1, xmm0 movaps xmm0, xmm8 blendvps xmm1, XMMWORD PTR __m128_PIo2F[rip], xmm0 movaps xmm0, xmm9 blendvps xmm1, XMMWORD PTR __m128_minus_PIo2F[rip], xmm0 movaps xmm0, xmm1 pop rcx ret Figure 3: SSE SIMD branching assembly language output (Intel C++ Compiler). float radian_asin(float a){ if(a >= 1){ return PIo2F; } else if(a <= -1){ return -PIo2F;} else{ return asin(a); } }Figure 4: Standard branching example. an algorithm should be evaluated as critical or non-critical to compute the final resulting value (Figure 7). Academia Journal of Scientific Research; Zamojski and Elfouly. 265 radian_asin(float): push rsi comiss xmm0, DWORD PTR .L_2il0floatpacket.3[rip] jae ..B1.6 movss xmm1, DWORD PTR .L_2il0floatpacket.1[rip] comiss xmm1, xmm0 jb ..B1.4 movss xmm0, DWORD PTR .L_2il0floatpacket.2[rip] pop rcx ret The..B1.4: FMA instruction set is an extension to the 128-bit and call asinf 256pop-bit rcx Streaming SIMD Extensions instructions in the Intel Compatibleret microprocessor instruction set to perform fused ..B1.6: multiplymovss– xmm0add (FMA), DWORD There PTR are.L_2il0floatpacket.0 two variants: [rip] FMA4pop rcxis supported in AMD processors starting with the ret Bulldozer (2011) architecture. FMA4 was realized in Figure 5: Standard branching example assembly language output (Intel C++ Compiler). hardware before FMA3 [6]. Ultimately FMA4 [9] was __m128idropped _mm_blendv_si128 in favor ( __m128ito the x, superior__m128i y, __m Intel128i mask) FMA3 { hardware implementation. FMA3 is supported in AMD processors since processors// Replace bit in x sincewith bit in2014. y when matchingIts goal bit inis mask to isremove set: CPU cycles returnby combining _mm_or_si128(_mm_andnot_si128(mask, multiplication andx), _mm_and_si128(mask, addition into y)); one } instruction. This would reduce the total exaction time of an Figure 6: Bitwise selection example code. algorithm. __m128 _mm_fmadd_ps(__m128 a, __m128 b, __m128 c) // Latency ~6 __m128 _mm_add_ps ( _mm_mul_ps (__m128 a, __m128 b), __m128 c) // Latency ~10 Figure 7: FMA vs. SSE CPU Cycle latency. FMA INSTRUCTION SET (https://software.intel.com/sites/default/files/managed/ 39/c5/325462-sdm-vol-1-2abcd-3abcd.pdf). Ultimately, The FMA instruction set is an extension to the 128-bit

Research Paper Creating a Compiler Optimized Inlineable

Elementary Functions: Towards Automatically Generated, Efficient

Andre Heidekrueger

Hierarchical Roofline Analysis for Gpus: Accelerating Performance

Theoretical Peak FLOPS Per Instruction Set on Modern Intel Cpus

Introduction to Intel Scalable Architectures

Intel® Architecture Instruction Set Extensions and Future Features

Intel(R) Advanced Vector Extensions Programming Reference

Effective Vectorization with Openmp 4.5

Asianux Server 4 ==

VCL C++ Vector Class Library Manual

Speeding up Energy System Models - a Best Practice Guide

Intel® Architecture Instruction Set Extensions Programming Reference