Creating a Compiler Optimized Inlineable Implementation of Intel

International Journal of Academic Engineering Research (IJAER) ISSN: 2000-001X Vol. 2 Issue 7, July – 2018, Pages: 6-13 Creating a Compiler Optimized Inlineable Implementation of Intel Svml Simd Intrinsics Promyk Zamojski1, Raafat Elfouly2 Mathematics and Computer Science Department/ Rhode Island College, RI Email: [email protected], [email protected] Abstract: Single Input Multiple Data (SIMD) provides data parallelism execution via implemented SIMD instructions and registers. Most mainstream computers architectures have been enhanced to provide data parallelism though SIMD extensions. These advances in parallel hardware have not been accompanied by the necessary software libraries granting programmers a set of functions that could work across all major compilers adding a 4x, 8x or 16x performance increase compared to standard SISD. Intel’s SVML library offers SIMD implementation of Math and Scientific functions that have never been ported to work outside of Intel's own compiler. An Open Source inlineable implementation would increase performance and portability. This paper illustrates the development of an alternative compiler neutral implementation of Intel's SVML library. Keywords: Data Parallelism, SIMD Intrinsics, SVML, C++ Math Library, Computer Simulation, Scientific Math Library. 1. INTRODUCTION Compiler[4]. SVML functions can be broken down into 5 main category, Distribution, Division, Exponentiation, Data parallelism occurs when one or more operations Round, Trigonometry and subcategories of __m128, are applied repeatedly to several values. Generally __m128d, __m128i, __m256, __m256d, __m256i, __m512, programming languages sequential programming of scalar __m512d, __m512i corresponding to vector data types. values. Single Input Multiple Data (SIMD) compiler When implementing SVML functions the functions can be intrinsics allow for operations to be grouped together which approximated using Approximation of Taylor series leads to an improved transistors per Floating Point expansion in combination with bitwise manipulation of Operations reducing power usage and overall modern SIMD IEEE745 floating point representation [10],[11],[12]. capable CPUs. SIMD has some limitation because you Reference implementation of scalar version can be translated cannot vectorize different functions with lots of if else into SIMD version from Cephes library[2]. The output of the statements easily. These limitation define SIMD Intrinsic replacement functions can be tested and compared to output programming to be almost a new programming paradigm form SVML library provided in Intel® C++ Compiler[4]. within languages like C or C++. In most modern compilers the work of optimization can generally outperform many ORGANIZATION programmer tricks to generate optimal output. In the case of This paper is organized into four section, section one is the SIMD the compiler is smart enough to transform introduction. In section two we will talk about bitwise Vector/Arrays into optimized SIMD operations. The idea is selection. In section three we will introduce the idea of simple, instead of computing one value of an array at a time fused multiply accumulator FMA. In section four we will your compute identical operation on all values able to fit into SIMD register. All data must be subject to the same talk about the implementation and its parts including sign processing. With the addition of compiler pragmas such as preservation, exponentiation, trigonometry, and distribution Intel® Cilk™ Plus[1], a programmer is able to force the functions. Conclusion and results are in section five. compiler to work harder on specific parts of code to optimize 2. BITWISE SELECTION data-parallelism. However this is not always the case since the compiler cannot re-order all scalar operation to parallel Due to the nature of SIMD functions operating entirely operations. It is possible to do conditional processing within bitwise, sequential functionality must be considered and SIMD however it requires some tricks the compiler does not reworked for sequential implementation of some non- understand how to implement from traditional scalar code. In sequential/bitwise algorithms. This includes conditions the case where automatic vectorization is not an option there where an algorithm can jump from one point of the stack- needs to be a library of fictions a programmer can target. frame to a point in order to compute one value that would be This library should be portable enough to be able to port common in SISD algorithms. code write for this library to be able to be compiled by the main C/C++ compilers that support Intel Intrinsic such as Figure 1: SIMD vs SISD Branching Problem GCC, CLANG and MSVC. Many Vendors have specified an API for Intrinsic such as Intel® short vector math library Bitwise selection occurs when vector bitmask is use to select (SVML) [3] a software standard provided by Intel® C++ values within vector. Vector bitmask produced when SIMD www.ijeais.org/ijaer 6 International Journal of Academic Engineering Research (IJAER) ISSN: 2000-001X Vol. 2 Issue 7, July – 2018, Pages: 6-13 comparative operation is is used to compare 2 vectors performing the operation of a bitwise OR of the result of a resulting in mask of all 1 for true or mask of all 0 for false. (~mask & first_input_value) and the result of the (mask & Each mask corresponds to value held in vector position second_input_value). This functions is completely bitwise in whose size allocated to each data types byte alignment. nature and depends on binary vector representation of TRUE and FALSE to allow the pass of values within each vector __m128 _mm_radian_asin_ps(__m128 a) position within the SIMD register. { __m128 y, n, p; p = _mm_cmpge_ps(a, __m128_1); __m128i _mm_blendv_si128 (__m128i x, __m128i y, __m128i mask) { // if(a >= 1) Generates Mask 0xffffffff -OR- 0x0 // Replace bit in x with bit in y when matching bit in mask is set: n = _mm_cmple_ps(a, __m128_minus_1); return _mm_or_si128(_mm_andnot_si128(mask, x), _mm_and_si128(mask, y)); // if(a <= -1) Generates Mask 0xffffffff -OR- 0x0 } // 0xffffffff is TRUE // 0x0 is FALSE Figure 6: Bitwise selection example code y = _mm_asin_ps(a); y = _mm_blendv_ps(y, __m128_PIo2F, p); // replace value for >= 1 in var y based on 0xffffffff mask y = _mm_blendv_ps(y, __m128_minus_PIo2F, n); Once a programmer understand that bitwise selection masks // replace value for <= -1 in var y based on 0xffffffff mask return y; are used in place of conditional statements the porting of } existing C/C++ library become easier. The programmer Figure 2: SSE SIMD Branching example would still have to account for inefficiency of computing all possible branches and corner cases which would later merge _mm_radian_asin_ps(__m128): push rsi into the resulting SIMD vector. With this in mind movaps xmm9, xmm0 movups xmm8, XMMWORD PTR __m128_1[rip] unnecessary corner cases and redundancy within an cmpleps xmm8, xmm0 cmpleps xmm9, XMMWORD PTR __m128_minus_1[rip] algorithm should be evaluated as critical or noncritical to call __svml_asinf4 movaps xmm1, xmm0 compute the final resulting value. movaps xmm0, xmm8 blendvps xmm1, XMMWORD PTR __m128_PIo2F[rip], xmm0 movaps xmm0, xmm9 blendvps xmm1, XMMWORD PTR __m128_minus_PIo2F[rip], xmm0 movaps xmm0, xmm1 3. FMA INSTRUCTION SET pop rcx ret The FMA instruction set is an extension to the 128-bit and Figure 3: SSE SIMD Branching assembly language 256-bit Streaming SIMD Extensions instructions in the Intel output (Intel C++ Compiler) Compatible microprocessor instruction set to perform fused multiply–add (FMA) There are two variants: float radian_asin(float a){ if(a >= 1){ return PIo2F; } FMA4 is supported in AMD processors starting with the else if(a <= -1){ return -PIo2F;} else{ return asin(a); } Bulldozer (2011) architecture. FMA4 was realized in } hardware before FMA3[6]. Ultimately FMA4[9] was Figure 4: Standard branching example dropped in favor to the superior Intel FMA3 hardware implementation. FMA3 is supported in AMD processors since processors since 2014. Its goal is to remove CPU radian_asin(float): push rsi cycles by combining multiplication and addition into one comiss xmm0, DWORD PTR .L_2il0floatpacket.3[rip] jae ..B1.6 instruction. This would reduce the total exaction time of an movss xmm1, DWORD PTR .L_2il0floatpacket.1[rip] comiss xmm1, xmm0 algorithm. jb ..B1.4 movss xmm0, DWORD PTR .L_2il0floatpacket.2[rip] pop rcx ret __m128 _mm_fmadd_ps(__m128 a, __m128 b, __m128 c) ..B1.4: // Latency ~6 call asinf pop rcx __m128 _mm_add_ps ( _mm_mul_ps (__m128 a, __m128 b), __m128 c) ret // Latency ~10 ..B1.6: movss xmm0, DWORD PTR .L_2il0floatpacket.0[rip] Figure 7: FMA vs SSE CPU Cycle Latency pop rcx ret Figure 5: Standard Branching example assembly 4. IMPLEMENTATION language output (Intel C++ Compiler) SVML compatible library start with the most commonly called functions falling under the family of libC math It is important to understand bitwise selection in SSE SIMD functions. Most implementations are portable form operations since they will take the pace of of __m128 to __m128d, __m128 to __m256/__m512 and jumping/branching based on condition in most arithmetic operation. These bitwise operations become fundamental in __m128d to __m256d/__m512d with minor changes to the design and analysis of SIMD optimized algorithms. numerical constants and corresponding intrinsic for Since all data operation are identical across SIMD vector specific vector datatype. Libc Math functions are able to code segments must be broken up based

Load more