Intrinsics Lecture 1

Manfred Liebmann Technische Universit¨at M¨unchen Chair of Optimal Control Center for Mathematical Sciences, M17 [email protected]

January 12, 2016 Manfred Liebmann January 12, 2016

Programming with Intrinsics

What are intrinsics?

Intrinsics are functions that the compiler replaces with the proper assembly instructions. Intrinsics are primarily used to access the vector processing capabilities of modern CPUs.

Long history of Intrinsics • – MMX : Multi Media Extensions 8 x 64bit (1997) – SSE/SSE2/SSE3/SSSE3/SSE4.x : Streaming SIMD Extensions 8 x 128bit (1999) – AVX/AVX2/FMA : Advanced Vector Extensions 16 x 256 bit (2008) – AVX-512/KNC : Advanced Vector Extensions 32 x 512 bit (2012)

Intrinsics 1 Manfred Liebmann January 12, 2016

Choose the Right Header!

Intrinsics are supported by all modern C/C++ compilers.

Every generation has its own header! • – #include //MMX – #include //SSE – #include //SSE2 – #include //SSE3 – #include //SSSE3 – #include //SSE4.1 – #include //SSE4.2 – #include //SSE4A – #include //AES – #include //AVX

Intrinsics 2 Manfred Liebmann January 12, 2016

Advanced Vector Extensions (AVX)

Intel Advanced Vector Extensions (AVX) is a set of instructions for doing Single Instruction Multiple Data (SIMD) operations on architecture CPUs. These instructions extend the previous SIMD o↵erings, MMX instructions and Intel Streaming SIMD Extensions (SSE).

Intel Intrinsics Guide

https://software.intel.com/sites/landingpage/IntrinsicsGuide/

Complete interactive reference for all intrinsic functions!

Instruction Set Architecture (ISA) Extensions

https://software.intel.com/en-us/isa-extensions

Intrinsics 3 Manfred Liebmann January 12, 2016

Intel AVX Sux Markings

All modern C++ compilers support the same intrinsic operations to simplify using Intel AVX from C or C++ code. Intrinsics are functions that the compiler replaces with the proper assembly instructions. Most Intel AVX intrinsic names follow the following format:

_mm256_op_suffix(data_type param1, data_type param2, data_type param3) where mm256 is the prefix for working on the new 256-bit registers; op is the operation, like add for addition or sub for subtraction; and sux denotes the type of data to operate on, with the first letters denoting packed (p), extended packed (ep), or scalar (s). The remaining letters are the types given in the table below.

Sux Markings • [s/d] : Single- or double-precision floating point [i/u]nnn : Signed or unsigned integer of bit size nnn, where nnn is 128, 64, 32, 16, or 8 [ps/pd/sd] : Packed single, packed double, or scalar double epi32 : Extended packed 32-bit signed integer si256 : Scalar 256-bit integer

Intrinsics 4 Manfred Liebmann January 12, 2016

Intel AVX Intrinsics Data Types

Data Types • m256 : 256-bit as eight single-precision floating-point values m256d : 256-bit as four double-precision floating-point values m256i : 256-bit as integers, (bytes, words, etc.) m128 : 128-bit single precision floating-point (32 bits each) m128d : 128-bit double precision floating-point (64 bits each)

Figure 1: Intel AVX and Intel SSE data types

Intrinsics 5 Manfred Liebmann January 12, 2016

Mandelbrot Set Code Example

Pseudocode for calculating the Mandelbrot set.

z,p are complex numbers for each point p on the complex plane z=0 for count = 0 to max_iterations if abs(z) > 2.0 break z = z*z+p set color at p based on count reached

Intrinsics 6 Manfred Liebmann January 12, 2016

Mandelbrot Set Visualization

Figure 2: Mandelbrot set 0.29768 + 0.48354i to 0.29778 + 0.48364i with 4096 max iterations

Intrinsics 7 Manfred Liebmann January 12, 2016

Simple Mandelbrot C++ STL Code

#include #include using namespace std; int main(int argc, char** argv) { float x1 = 0.29768, y1 = 0.48364, x2 = 0.29778, y2 = 0.48354; int width = 2048, height = 2048, int maxIters = 4096; unsigned short *image = new unsigned short[width * height];

float dx = (x2-x1)/width, dy = (y2-y1)/height; for (int j = 0; j < height; ++j) { for (int i = 0; i < width; ++i) { complex c(x1+dx*i, y1+dy*j), z(0,0); int count = -1; while ((++count < maxIters) && (norm(z) < 2.0)) z = z*z+c; *image++ = count; } } }

Intrinsics 8 Manfred Liebmann January 12, 2016

Mandelbrot Set Benchmark

Cores STL FPU AVX 1 63.5186 11.9445 1.64415 2 50.1687 9.42479 1.26957 4 42.7716 8.02288 1.05672 8 23.2062 4.34219 0.569152 16 13.9921 2.62823 0.345063

Table 1: Total runtimes in seconds for the Mandelbrot set benchmark with a 2048 x 2048 grid on 2x Intel E5-2650 @ 2.00GHz.

Intrinsics 9