Intrinsics Lecture 1
Total Page:16
File Type:pdf, Size:1020Kb
Intrinsics Lecture 1 Manfred Liebmann Technische Universit¨at M¨unchen Chair of Optimal Control Center for Mathematical Sciences, M17 [email protected] January 12, 2016 Manfred Liebmann January 12, 2016 Programming with Intrinsics What are intrinsics? Intrinsics are functions that the compiler replaces with the proper assembly instructions. Intrinsics are primarily used to access the vector processing capabilities of modern CPUs. Long history of Intrinsics • – MMX : Multi Media Extensions 8 x 64bit (1997) – SSE/SSE2/SSE3/SSSE3/SSE4.x : Streaming SIMD Extensions 8 x 128bit (1999) – AVX/AVX2/FMA : Advanced Vector Extensions 16 x 256 bit (2008) – AVX-512/KNC : Advanced Vector Extensions 32 x 512 bit (2012) Intrinsics 1 Manfred Liebmann January 12, 2016 Choose the Right Header! Intrinsics are supported by all modern C/C++ compilers. Every generation has its own header! • – #include <mmintrin.h> //MMX – #include <xmmintrin.h> //SSE – #include <emmintrin.h> //SSE2 – #include <pmmintrin.h> //SSE3 – #include <tmmintrin.h> //SSSE3 – #include <smmintrin.h> //SSE4.1 – #include <nmmintrin.h> //SSE4.2 – #include <ammintrin.h> //SSE4A – #include <wmmintrin.h> //AES – #include <immintrin.h> //AVX Intrinsics 2 Manfred Liebmann January 12, 2016 Advanced Vector Extensions (AVX) Intel Advanced Vector Extensions (AVX) is a set of instructions for doing Single Instruction Multiple Data (SIMD) operations on Intel architecture CPUs. These instructions extend the previous SIMD o↵erings, MMX instructions and Intel Streaming SIMD Extensions (SSE). Intel Intrinsics Guide https://software.intel.com/sites/landingpage/IntrinsicsGuide/ Complete interactive reference for all intrinsic functions! Instruction Set Architecture (ISA) Extensions https://software.intel.com/en-us/isa-extensions Intrinsics 3 Manfred Liebmann January 12, 2016 Intel AVX Suffix Markings All modern C++ compilers support the same intrinsic operations to simplify using Intel AVX from C or C++ code. Intrinsics are functions that the compiler replaces with the proper assembly instructions. Most Intel AVX intrinsic names follow the following format: _mm256_op_suffix(data_type param1, data_type param2, data_type param3) where mm256 is the prefix for working on the new 256-bit registers; op is the operation, like add for addition or sub for subtraction; and suffix denotes the type of data to operate on, with the first letters denoting packed (p), extended packed (ep), or scalar (s). The remaining letters are the types given in the table below. Suffix Markings • [s/d] : Single- or double-precision floating point [i/u]nnn : Signed or unsigned integer of bit size nnn, where nnn is 128, 64, 32, 16, or 8 [ps/pd/sd] : Packed single, packed double, or scalar double epi32 : Extended packed 32-bit signed integer si256 : Scalar 256-bit integer Intrinsics 4 Manfred Liebmann January 12, 2016 Intel AVX Intrinsics Data Types Data Types • m256 : 256-bit as eight single-precision floating-point values m256d : 256-bit as four double-precision floating-point values m256i : 256-bit as integers, (bytes, words, etc.) m128 : 128-bit single precision floating-point (32 bits each) m128d : 128-bit double precision floating-point (64 bits each) Figure 1: Intel AVX and Intel SSE data types Intrinsics 5 Manfred Liebmann January 12, 2016 Mandelbrot Set Code Example Pseudocode for calculating the Mandelbrot set. z,p are complex numbers for each point p on the complex plane z=0 for count = 0 to max_iterations if abs(z) > 2.0 break z = z*z+p set color at p based on count reached Intrinsics 6 Manfred Liebmann January 12, 2016 Mandelbrot Set Visualization Figure 2: Mandelbrot set 0.29768 + 0.48354i to 0.29778 + 0.48364i with 4096 max iterations Intrinsics 7 Manfred Liebmann January 12, 2016 Simple Mandelbrot C++ STL Code #include <iostream> #include <complex> using namespace std; int main(int argc, char** argv) { float x1 = 0.29768, y1 = 0.48364, x2 = 0.29778, y2 = 0.48354; int width = 2048, height = 2048, int maxIters = 4096; unsigned short *image = new unsigned short[width * height]; float dx = (x2-x1)/width, dy = (y2-y1)/height; for (int j = 0; j < height; ++j) { for (int i = 0; i < width; ++i) { complex<float> c(x1+dx*i, y1+dy*j), z(0,0); int count = -1; while ((++count < maxIters) && (norm(z) < 2.0)) z = z*z+c; *image++ = count; } } } Intrinsics 8 Manfred Liebmann January 12, 2016 Mandelbrot Set Benchmark Cores STL FPU AVX 1 63.5186 11.9445 1.64415 2 50.1687 9.42479 1.26957 4 42.7716 8.02288 1.05672 8 23.2062 4.34219 0.569152 16 13.9921 2.62823 0.345063 Table 1: Total runtimes in seconds for the Mandelbrot set benchmark with a 2048 x 2048 grid on 2x Intel Xeon E5-2650 @ 2.00GHz. Intrinsics 9.