Advanced Parallel Programming II

Alexander Leutgeb, RISC Software GmbH

RISC Software GmbH – Johannes Kepler University Linz © 2016 22.09.2016 | 1 Introduction to Vectorization

RISC Software GmbH – Johannes Kepler University Linz © 2016 22.09.2016 | 2 Motivation

. Increasement in number of cores – Threading techniques to improve performance . But flops per cycle of vector units increased as much as number of cores . No use of vector units  wasting flops/watt . For best performance – Use all cores – Efficient use of vector units – Ignoring potential of vector units is so inefficient as using only one core

RISC Software GmbH – Johannes Kepler University Linz © 2016 22.09.2016 | 3 Vector Unit

. Single Instruction Multiple Data (SIMD) units . Mostly for floating point operations . Data parallelization with one instruction – 64-Bit unit  1 DP flop, 2 SP flop – 128-Bit unit  2 DP flop, 4 SP flop – … . Multiple data elements are loaded into vector registers and used by vector units . Some architectures have more than one instruction per cycle (e.g. Sandy Bridge)

RISC Software GmbH – Johannes Kepler University Linz © 2016 22.09.2016 | 4 Parallel Execution

Scalar version works on Vector version carries out the same instructions one element at a time on many elements at a time a[i] = b[i] + [i] x d[i]; a[i:8] = b[i:8] + c[i:8] * d[i:8];

a[i] a[i] a[i+1] a[i+2] a[i+3] a[i+4] a[i+5] a[i+6] a[i+7] ======b[i] b[i] b[i+1] b[i+2] b[i+3] b[i+4] b[i+5] b[i+6] b[i+7] + + + + + + + + + c[i] c[i] c[i+1] c[i+2] c[i+3] c[i+4] c[i+5] c[i+6] c[i+7] x x x x x x x x x d[i] d[i] d[i+1] d[i+2] d[i+3] d[i+4] d[i+5] d[i+6] d[i+7]

RISC Software GmbH – Johannes Kepler University Linz © 2016 22.09.2016 | 5 Vector Registers

RISC Software GmbH – Johannes Kepler University Linz © 2016 22.09.2016 | 6 Vector Unit Usage (Programmers View) Use vectorized libraries Ease of use (e.g. Intel MKL)

Fully automatic vectorization

Auto vectorization hints (#pragma ivdep)

SIMD feature (#pragma and simd function annotation)

Vector intrinsics (e.g. mm_add_ps())

ASM code Programmer control (e.g. addps)

RISC Software GmbH – Johannes Kepler University Linz © 2016 22.09.2016 | 7 Auto Vectorization

RISC Software GmbH – Johannes Kepler University Linz © 2016 22.09.2016 | 8 Auto Vectorization

. Modern analyse loops in serial code  identification for vectorization – Perform loop transformations for identification . Usage of instruction set of target architecture

RISC Software GmbH – Johannes Kepler University Linz © 2016 22.09.2016 | 9 Common Switches GCC and ICC

Functionality Switch Disable optimization -O0 Optimize for speed (no code size increase) -O1 Optimize for speed (default) -O2 High-level optimizer (e.g. loop unroll) -O3 Aggressive optimizations (e.g. –ipo, -O3, …) -fast Create symbols for debugging -g Generate assembly files -S OpenMP support -openmp

RISC Software GmbH – Johannes Kepler University Linz © 2016 22.09.2016 | 10 Architecture Specific Compiler Switches GCC

Functionality Switch Optimize for current machine -march=native Generate SSE v1 code -msse Generate SSE v2 code (default, may also emit SSE v1 code) -msse2 Generate SSE v3 code (may also emit SSE v1 and v2 code) -msse3 Generrate SSSE v3 code (may also emit SSE v1, v2, and v3 code) -mssse3 Generate SSE4.1 code (may also emit (S)SSEE v1, v2, and v3 code) -msse4.1 Generate SSE4.2 code (may also emit (S)SSE v1, v2, v3, and v4 code) -msse4.2 Generate AVX code -mavx Generate AVX v2 code -mavx2

RISC Software GmbH – Johannes Kepler University Linz © 2016 22.09.2016 | 11 Architecture Specific Compiler Switches ICC

Functionality Switch * Optimize for current machine -xHOST Generate SSE v1 code -xSSE1 Generate SSE v2 code (default, may also emit SSE v1 code) -xSSE2 Generate SSE v3 code (may also emit SSE v1 and v2 code) -xSSE3 Generate SSE v3 code for Atom-based processors -xSSE_ATOM Generrate SSSE v3 code (may also emit SSE v1, v2, and v3 code) -xSSSE3 Generate SSE4.1 code (may also emit (S)SSEE v1, v2, and v3 code) -xSSE4.1 Generate SSE4.2 code (may also emit (S)SSE v1, v2, v3, and v4 code) -xSSE4.2 Generate AVX code -xAVX * For Intel processors use –x, for non-Intel processors use -m

RISC Software GmbH – Johannes Kepler University Linz © 2016 22.09.2016 | 12 Example 4 Simple Vector Addition 1. Go to the directory example_4. 2. Compile simple.cpp (no vectorization) and execute the binary icc -std=c++11 -O2 -no-vec simple.cpp (g++ -std=c++11 -O2 simple.cpp) 3. Compile simple.cpp (vectorization) and execute the binary icc -std=c++11 -O2 simple.cpp (g++ -std=c++11 -O2 -ftree-vectorize) 4. What is the difference considering the execution times? 5. Can the execution times be further improved?

RISC Software GmbH – Johannes Kepler University Linz © 2016 22.09.2016 | 13 SIMD Vectorization Basics

. Vectorization offers good performance improvement on floating point intensive code . Vectorized code could compute slightly different results than non vectorized (x87 FPU – 80 Bit, SIMD – 64 Bit) . Even for scalar operations vector unit is used . Vectorization is only one aspect to improve performance – Efficent use of the cache is necessary

RISC Software GmbH – Johannes Kepler University Linz © 2016 22.09.2016 | 14 Parallelization at No Cost

. We tell the compiler to vectorize – That‘s not the whole story . There are cases, where a compiler cannot vectorize the code . How can we analyse such situations  vectorization report . Generation of vectorization report via the compiler – More interesting what was not done and why – Focus on code paths, which were not vectorized

RISC Software GmbH – Johannes Kepler University Linz © 2016 22.09.2016 | 15 Example 5 Vectorization Report 1. Go to the directory example_5. 2. Compile simple.cpp with enabled vector report generation: icc -std=c++11 -O2 -vec-report=2 simple.cpp (g++ -std=c++11 -O2 -ftree-vectorize –fopt-info-vec-missed simple.cpp) 3. Which positive/negative information does the vectorization report tell? 4. Insert the following code before std::cout in calcSP() and compile again. Have a look at the vectorization report. for (i = 1; i < VECSIZE; i++) { a[i] = a[i] + a[i – 1]; }

RISC Software GmbH – Johannes Kepler University Linz © 2016 22.09.2016 | 16 Vectorization Report ICC . Information – Which code was vectorized? – Which code was not vectorized? . Compiler switch –vec-report – n=0: no diagnostic information – n=1: (default) vectorized loops – n=2: vectorized/non vectorized loops (and why) – n=3: additional dependency information – n=4: only non vectorized loops – n=5: only non vectorized loops and dependency information – n=6: vectorized/non vectorized loops with details

RISC Software GmbH – Johannes Kepler University Linz © 2016 22.09.2016 | 17 for (i = 0; i < N; i++) Unrolling allows compiler { to reconstruct loop for a[i] = b[i] * c[i]; vector operations }

for (i = 0; i < N; i += 4) { Load b(i, …, i + 3) a[i ] = b[i ] * c[i ]; Load c(i, …, i + 3) a[i + 1] = b[i + 1] * c[i + 1]; Operate b * c -> a a[i + 2] = b[i + 2] * c[i + 2]; Store a(i, …, i + 3) a[i + 3] = b[i + 3] * c[i + 3]; }

RISC Software GmbH – Johannes Kepler University Linz © 2016 22.09.2016 | 18 Requirements for Auto Vectorization

. Countable . Single entry and exit while (i < 100) { a[i] = b[i] * c[i]; if (a[i] < 0.0) // data-dependent exit condition: break; ++i; } // loop not vectorized . Straight-line code (no switch, if with masking) for (int i = 0; i < length; i++) { float s = b[i] * b[i] – 4 * a[i] * c[i]; if (s >= 0) x[i] = sqrt(s); else x[i] = 0.; } // loop vectorized (because of masking)

RISC Software GmbH – Johannes Kepler University Linz © 2016 22.09.2016 | 19 Requirements for Auto Vectorization

. Only most inner loop (caution in case of or loop collapsing) . No functions calls, but – Instrinsic math (sin, log, …) – Inline functions – Elemental functions __attribute__((vector))

RISC Software GmbH – Johannes Kepler University Linz © 2016 22.09.2016 | 20 Inhibitors of Auto Vectorization . Non contiguous data // arrays accessed with stride 2 for (int i=0; i

// inner loop accesses a with stride SIZE for (int j=0; j

// indirect addressing of x using index array for (int i=0; i

do j=1,n F90 C for (j=0; j

RISC Software GmbH – Johannes Kepler University Linz © 2016 22.09.2016 | 21 Inhibitors of Auto Vectorization

. Inability to identify data with alias (or overlapping) – Runtime check possible  multi versioned (vectorized/non vectorized) void my_cp(int nx, double* a, double* b) { for (int i = 0; i < nx; i++) a[i] = b[i]; } – Runtime check not possible  non vectorized void my_combine(int* ioff, int nx, double* a, double* b, double* c) { for (int i = 0; i < nx; i++) { a[i] = b[i] + c[i + *ioff];

}

} Would vectorize with strict aliasing (-ansi-alias)

RISC Software GmbH – Johannes Kepler University Linz © 2016 22.09.2016 | 22 Inhibitors of Auto Vectorization

. Vector dependency – Read after write (RAW): non vectorizable for (i = 1; i < N; i++) a[i] = a[i – 1] + b[i];

– Write after read (WAR): vectorizable for (i = 0; i < N - 1; i++) a[i] = a[i + 1] + b[i];

– Read after read (RAR): vectorizable for (i = 0; i < N; i++) a[i] = b[i % M] + c[i];

– Write after write (WAW): non vectorizable for (i = 0; i < N; i++) a[i % M] = b[i] + c[i];

RISC Software GmbH – Johannes Kepler University Linz © 2016 22.09.2016 | 23 Efficiency Aspects for Auto Vectorization Alignment . Address alignment (SSE: 16 Bytes, AVX: 32 Bytes) . Check at compile time  alignment . Otherwise check at runtime  peel and remainder loop . Explicit definition in code: – local/global: __attribute__((alligned(16))) – Heap: _mm_alloc, _mm_free . Compiler support: peel=x&0x0f; __assume_aligned(x, 16) if (peel!=0) { peel=16–peel; void fill(char* x) { for (i=0; i

RISC Software GmbH – Johannes Kepler University Linz © 2016 22.09.2016 | 24 Efficieny Aspects for Auto Vectorization Data Layout . Structure of Arrays (SoA) instead of Array of Structures (AoS)

struct Vector3d { //AoS x y z x y z x y z x y z double x; double y; double z; };

struct Vectors3d { //SoA x x x x y y y y z z z z double* x; double* y; double* z; };

RISC Software GmbH – Johannes Kepler University Linz © 2016 22.09.2016 | 25 Example 6 Non Contiguous Data 1. Go to the directory example_6. 2. Compile contig.cpp with enabled vectorization report icc -std=c++11 -O2 -vec-report=2 contig.cpp (g++ -std=c++11 -O2 -ftree-vectorize –fopt-info-vec-missed contig.cpp) 3. What does the vectorization report tell?

RISC Software GmbH – Johannes Kepler University Linz © 2016 22.09.2016 | 26 Compiler Directives ICC . Many compilers have directives for vectorization hints . ivdep (C: „#pragma ivdep“, Fortran: „!dec$ ivdep“) No vector dependency in loop (usage in case of existing dependency leads to incorrect code) . vector always (C: „#pragma vector always“, Fortran: „!dec$ vector always“) . Elemental functions: __attribute__((vector))

RISC Software GmbH – Johannes Kepler University Linz © 2016 22.09.2016 | 27 Compiler Directives ICC SIMD Extensions . SIMD pragma – SIMD function annotation __attribute__((simd)) . Differences to traditional ivdep and vector always – Traditional pragmas are more like hints – New SIMD extension more like an assertion . Fine control over auto vectorization with additiontal clauses (vectorlength, private, linear, reduction, assert)

RISC Software GmbH – Johannes Kepler University Linz © 2016 22.09.2016 | 28 Example 7 Vector Dependency . Go to the directory example_7 . Compile the file forware.cpp and execute the binary. icc -std=c++11 -O2 -vec-report=2 forward.cpp (g++ -std=c++11 -O2 –ftree-vectorize -fopt-info-vec forward.cpp) . What is the execution time? . What does the vectorization report tell? . What happens if you split the inner loop into two seperate update loops for b and a?

RISC Software GmbH – Johannes Kepler University Linz © 2016 22.09.2016 | 29 Vector Intrinsics

RISC Software GmbH – Johannes Kepler University Linz © 2016 22.09.2016 | 30 Vector Intrinsics void vec_eltwise_product_avx(vec_t* a, vec_t* b, vec_t* c) { size_t i; __m256 va; __m256 vb; __m256 vc; for (i = 0; i < a->size; i += 8) { va = _mm256_loadu_ps(&a->data[i]); vb = _mm256_loadu_ps(&b->data[i]); vc = _mm256_mul_ps(va, vb); _m256_storeu_ps(&c->data[i], vc); } }

RISC Software GmbH – Johannes Kepler University Linz © 2016 22.09.2016 | 31 Vector Intrinsics

. Different data types for different architectures – SSE: __mm128, __mm128d, __mm128i – AVX: __mm256, __mm256d, __mm256i . Different operations for different architectures – SSE: _mm_add_ps(), _mm_add_pd(), … – AVX:_mm256_add_ps(), _mm256_add_pd(), … . Portability – Implement wrapper for different architectures

RISC Software GmbH – Johannes Kepler University Linz © 2016 22.09.2016 | 32 Plus Array Notation

RISC Software GmbH – Johannes Kepler University Linz © 2016 22.09.2016 | 33 Cilk Plus Array Notation

. C/C++ language extension . Support for data parallel operations . New language array expression array-expression [ lower-bound : length : stride ] . Default values of each argument in [:] – lower-bound: 0 – length: length of array – stride: 1 (if default stride, the second „:“ may be omitted) . array-expression[:]  array section is entire array of known length and stride 1

RISC Software GmbH – Johannes Kepler University Linz © 2016 22.09.2016 | 34 Cilk Plus Array Notation

. array-expression[:][:] denotes a two dimensional array . Two new terms – rank: number of array sections of single array (rank zero  scalar) – shape: length of each array section . Statement – all expressions have same rank and shape – or rank zero  broadcast for each element . Built-in functions (reductions, …) . Overlap between LHS and RHS array expression  undefined behaviour (unless exact overlap)

RISC Software GmbH – Johannes Kepler University Linz © 2016 22.09.2016 | 35 Cilk Plus Array Notation

. No new data types (use of existing array types in C and C++) . Short-hand for entire array: array[:] . Exception: dynamically allocated arrays  array[start : length] . Availability – Intel C/C++ compiler – GCC 5 – Clang/LLVM fork (https://cilkplus.github.io) . Examples int a[10]; int b[10]; int c[10][10]; int d[10];

RISC Software GmbH – Johannes Kepler University Linz © 2016 22.09.2016 | 36 Examples

Cilk Array notation Scalar C/C++ code a[:] = 5; for (i = 0; i < 10; i++) a[i] = 5; a[0:7] = 5; for (i = 0; i < 7; i++) a[i] = 5; a[7:3] = 4; for (i = 7; i < (7 + 3); i++) a[i] = 4; a[0:5:2] = 5; for (i = 0; i < 10; i += 2) a[i] = 5; a[1:5:2] = 4; for (i = 1; i < 10; i += 2) a[i] = 4; a[:] = b[:]; for (i = 0; i < 10; i++) a[i] = b[i]; a[:] = b[:] + 5; for (i = 0; i < 10; i++) a[i] = b[i] + 5;

RISC Software GmbH – Johannes Kepler University Linz © 2016 22.09.2016 | 37 Examples

Cilk Array notation Scalar C/C++ code d[:] = a[:] + b[:]; for (i = 0; i < 10; i++) d[i] = a[i] + b[i]; a[0:n] = 5; for (i = 0; i < n; i++) a[i] = 5; c[:][:] = 12; for (i = 0; i < 10; i++) for (j = 0; j < 10; j++) c[i][j] = 12; c[0:5:2][:] = 12; for (i = 0; i < 10; i += 2) for (j = 0; j < 10; j++) c[i][j] = 12; c[4][:] = a[:]; for (j = 0; j < 10; j++) c[4][j] = a[j]; func(a[:]); for (i = 0; i < 10; i++) func(a[i]); d[:] = a[b[:]] for (i = 0; i < 10; i++) d[i] = a[b[i]]; a[b[:]] = d[:] for (i = 0; i < 10; i++) a[b[i]] = d[i];

RISC Software GmbH – Johannes Kepler University Linz © 2016 22.09.2016 | 38 Examples

Cilk Array notation Scalar C/C++ code if (5 == a[:]) for (i = 0; i < 10; i++) b[:] = 1; { else if (5 == a[i]) b[:] = 0; b[i] = 1; else b[i] = 0; } if ((5 == a[:]) || for (i = 0; i < 10; i++) (8 == a[:])) { b[:] = 1; if ((5 == a[i]) || else (8 == a[i])) b[:] = 0; b[i] = 1; else b[i] = 0; } a[:] = b[:] < 5 ? for (i = 0; i < 10; i++) b[:] : a[:]; a[i] = b[i] < 5 ? b[i] : a[i]; RISC Software GmbH – Johannes Kepler University Linz © 2016 22.09.2016 | 39 Functions

. Scalar function  applied to all elements of array section . Element type  overloading in C++ . Example: a[:] = sin(b[:]) . Compiler may use vectorized version of function – Built-in function – User defined SIMD-enabled function elemental functions: __attribute__((vector))

RISC Software GmbH – Johannes Kepler University Linz © 2016 22.09.2016 | 40 Builtin Functions

Builtin function Return value __sec_reduce_add(a[:]) Scalar that is a sum of all the elements in the array section. __sec_reduce_mul(a[:]) Scalar that is a product of the 10 elements from index '0' (inclusive). __sec_reduce_max(a[:]) Scalar that is the largest element in the array section. __sec_reduce_min(a[:]) Scalar that is the smallest element in the array section. __sec_reduce_max_ind(a[:]) Integer index of the largest element in the array section. __sec_reduce_min_ind(a[:]) Integer index of the smallest element in the array section. __sec_reduce_all_zero(a[:]) 1 if all the elements of the array section are zero, else 0. __sec_reduce_all_nonzero(a[:]) 1 if all elements of the array section are non-zero, else 0. __sec_reduce_any_zero(a[:]) 1 if any of the elements in the array section are zero, else 0. __sec_reduce_any_nonzero(a[:]) 1 if any of the elements in the array section are non-zero, else 0.

RISC Software GmbH – Johannes Kepler University Linz © 2016 22.09.2016 | 41 Example 8 Minimum Distance Computation

~2.07

RISC Software GmbH – Johannes Kepler University Linz © 2016 22.09.2016 | 42 Example 8 Minimum Distance Computation . Go to the directory example_8. . Compile the file BinarySTLReader.cpp and distance.cpp icc -O2 -openmp BinarySTLReader.cpp distance.cpp . Execute the binary. . What are the execution times? . Enable OpenMP support (comment out). . Compile and execute again. . What are the execution times?

RISC Software GmbH – Johannes Kepler University Linz © 2016 22.09.2016 | 43 Portability

RISC Software GmbH – Johannes Kepler University Linz © 2016 22.09.2016 | 44 Portability

. Initial situation – Auto vectorization is default enabled (Intel) – Default instruction set SSE2 (-msse2) . Goal – Optimal performance on target machine – Usage of latest instruction set architecture (-xAVX) . Problem – Illegal instruction exception on older hardware . Solution – Create several binaries – Better: usage of CPU Dispatch

RISC Software GmbH – Johannes Kepler University Linz © 2016 22.09.2016 | 45 ICC CPU Dispatch

. Generation of multiple code paths . Usage of compiler switch –ax . Base line path – Other switches (e.g. –O3) apply to base line path – Specified via –x oder –m (default –mSSE2) . Alternative path – Specified via –ax (e.g. –axAVX) . Path selection based on executing CPU

RISC Software GmbH – Johannes Kepler University Linz © 2016 22.09.2016 | 46 Manuel CPU Dispatch

. Usage of __attribute__((cpu_dispatch(cpuid,…))) . Example __attribute__((cpu_dispatch(generic,future_cpu_16)) void dispatch_func() {};

__attribute__((cpu_specific(generic))) void dispatch_func() { /* Code for generic */ }

__attribute__((cpu_specific(future_cpu_16))) void dispatch_func() { /* Code for future_cpu_16 */}

int main() { dispatch_func(); }

RISC Software GmbH – Johannes Kepler University Linz © 2016 22.09.2016 | 47 Conclusion

. Vectorization can improve the runtime performance of floating point intensive loops (increase flops/Watt) . Auto vectorization – Many factors inhibit auto vectorization (some do not apply to certain processors like scatter/gather) – Vectorization report helps to identify (non) vectorized code – If one compiler fails, another could be successful – The Intel compiler can provide hints for code modifications . Explicit vectorization – Vector intrinsics abstraction over vector assembler – Cilk Plus array notation good abstraction for data parallelism

RISC Software GmbH – Johannes Kepler University Linz © 2016 22.09.2016 | 48

Castor, 4228m Pollux, 4092m Thank You! zwischen Monte-Rosa-Massiv und Matterhorn Wallis, Schweiz

www.risc-software.at

RISC Software GmbH – Johannes Kepler University Linz © 2016 22.09.2016 | 49