Introduction to Vectorization

Alexander Leutgeb, RISC Software GmbH

RISC Software GmbH – Johannes Kepler University Linz © 2014 16.04.2014 | 1 Motivation

. Increasement in number of cores – Threading techniques to improve performance . But flops per cycle of vector units increased as much as number of cores . No use of vector units  wasting flops/watt . For best performance – Use all cores – Efficient use of vector units – Ignoring potential of vector units is so inefficient as using only one core

RISC Software GmbH – Johannes Kepler University Linz © 2014 16.04.2014 | 2 Vector Unit

. Single Instruction Multiple Data (SIMD) units . Mostly for floating point operations . Data parallelization with one instruction – 64-Bit unit  1 DP flop, 2 SP flop – 128-Bit unit  2 DP flop, 4 SP flop – … . Multiple data elements are loaded into vector registers and used by vector units . Some architectures have more than one instruction per cylce (e.g. Sandy Bridge)

RISC Software GmbH – Johannes Kepler University Linz © 2014 16.04.2014 | 3 Parallel Execution

Scalar version works on Vector version carries out the same instructions one element at a time on many elements at a time a[i] = b[i] + [i] x d[i]; a[i:8] = b[i:8] + c[i:8] * d[i:8];

a[i] a[i] a[i+1] a[i+2] a[i+3] a[i+4] a[i+5] a[i+6] a[i+7] ======b[i] b[i] b[i+1] b[i+2] b[i+3] b[i+4] b[i+5] b[i+6] b[i+7] + + + + + + + + + c[i] c[i] c[i+1] c[i+2] c[i+3] c[i+4] c[i+5] c[i+6] c[i+7] x x x x x x x x x d[i] d[i] d[i+1] d[i+2] d[i+3] d[i+4] d[i+5] d[i+6] d[i+7]

RISC Software GmbH – Johannes Kepler University Linz © 2014 16.04.2014 | 4 Vector Registers

RISC Software GmbH – Johannes Kepler University Linz © 2014 16.04.2014 | 5 Usage from a Programmers Point of View

Use vectorized libraries Ease of use (e.g. Intel MKL)

Fully automatic vectorization

Auto vectorization hints (#pragma ivdep)

SIMD feature (#pragma and simd function annotation)

Vector intrinsics (e.g. mm_add_ps())

ASM code Programmer control (e.g. addps)

RISC Software GmbH – Johannes Kepler University Linz © 2014 16.04.2014 | 6 Auto Vectorization

. Modern analyse loops in serial code  identification for vectorization – Perform loop transformations for identification . Usage of instruction set of target architecture

RISC Software GmbH – Johannes Kepler University Linz © 2014 16.04.2014 | 7 Common Switches

Functionality Linux Disable optimization -O0 Optimize for speed (no code size increase) -O1 Optimize for speed (default) -O2 High-level optimizer (e.g. loop unroll) -O3 Aggressive optimizations (e.g. –ipo, -O3, …) -fast Create symbols for debugging -g Generate assembly files -S Optimization report generation -opt-report OpenMP support -openmp

RISC Software GmbH – Johannes Kepler University Linz © 2014 16.04.2014 | 8 Architecture Specific Compiler Switches

Functionality Linux * Optimize for current machine -xHOST Generate SSE v1 code -xSSE1 Generate SSE v2 code (default, may also emit SSE v1 code) -xSSE2 Generate SSE v3 code (may also emit SSE v1 and v2 code) -xSSE3 Generate SSE v3 code for Atom-based processors -xSSE_ATOM Generrate SSSE v3 code (may also emit SSE v1, v2, and v3 code) -xSSSE3 Generate SSE4.1 code (may also emit (S)SSEE v1, v2, and v3 code) -xSSE4.1 Generate SSE4.2 code (may also emit (S)SSE v1, v2, v3, and v4 code) -xSSE4.2 Generate AVX code -xAVX * For Intel processors use –x, for non-Intel processors use -m

RISC Software GmbH – Johannes Kepler University Linz © 2014 16.04.2014 | 9 Exercise 1: Simple Vector Addition 1. Look at example_1/simple_dp.c 2. Compile the code in the following ways 1. icc –openmp –O2 –no-vec (gcc –fopenmp –O2) 2. icc –openmp –O2 (gcc –fopenmp –O2 –ftree-vectorize) 3. What is the difference considering the execution times? 4. Repeat the same procedure for simple_sp.c. 5. Look again at the time differences. 6. Compare the sums between vectorized and non vectorized versions. 7. Can the execution times be further improved?

RISC Software GmbH – Johannes Kepler University Linz © 2014 16.04.2014 | 10 SIMD Vectorization Basics

. Vectorization offers good performance improvement on floating point intensive code . Vectorized code could compute slightly different results than non vectorized (non vectorized x87 FPU – 80 Bit, vectorized 64 Bit) . Vectorization is only one aspect to improve performance – Efficent use of the cache is necessary

RISC Software GmbH – Johannes Kepler University Linz © 2014 16.04.2014 | 11 Parallelization at No Cost

. We tell the compiler to vectorize – That‘s not the whole story . There are cases, where a compiler cannot vectorize the code . How can we analyse such situations  vectorization report . Generation of vectorization report via the compiler – More interesting what was not done and why – Focus on code paths, which were not vectorized

RISC Software GmbH – Johannes Kepler University Linz © 2014 16.04.2014 | 12 Exercise 2: Vectorization Report 1. Compile again example_1/simple_dp.c but this time with generation of vectorization report icc –openmp –O2 –xAVX –vec-report=2 simple_dp.c (gcc –fopenmp –O2 –ftree-vectorize –ftree- vectorizer-verbose=2) 2. Which positive/negative information does the vectorization report tell? 3. Insert the following code after the main loop and repeat from step 1. Code for (i = 1; i < VECSIZE; i++) { a[i] = a[i] + a[i – 1]; }

RISC Software GmbH – Johannes Kepler University Linz © 2014 16.04.2014 | 13 Vectorization Report

. Information – Which code was vectorized? – Which code was not vectorized? . Compiler switch –vec-report – n=0: no diagnostic information – n=1: (default) vectorized loops – n=2: vectorized/non vectorized loops (and why) – n=3: additional dependency information – n=4: only non vectorized loops – n=5: only non vectorized loops and dependency information – n=6: vectorized/non vectorized loops with details

RISC Software GmbH – Johannes Kepler University Linz © 2014 16.04.2014 | 14

for (i = 0; i < N; i++) Unrolling allows compiler { to reconstruct loop for a[i] = b[i] * c[i]; vector operations }

for (i = 0; i < N; i += 4) { Load b(i, …, i + 3) a[i ] = b[i ] * c[i ]; Load c(i, …, i + 3) a[i + 1] = b[i + 1] * c[i + 1]; Operate b * c -> a a[i + 2] = b[i + 2] * c[i + 2]; Store a(i, …, i + 3) a[i + 3] = b[i + 3] * c[i + 3]; }

RISC Software GmbH – Johannes Kepler University Linz © 2014 16.04.2014 | 15 Requirements for Auto Vectorization

. Countable . Single entry and exit while (i < 100) { a[i] = b[i] * c[i]; if (a[i] < 0.0) // data-dependent exit condition: break; ++i; } // loop not vectorized

. Straight-line code (no switch, if with masking) for (int i = 0; i < length; i++) { float s = b[i] * b[i] – 4 * a[i] * c[i]; if (s >= 0) x[i] = sqrt(s); else x[i] = 0.; } // loop vectorized (because of masking)

RISC Software GmbH – Johannes Kepler University Linz © 2014 16.04.2014 | 16 Requirements for Auto Vectorization

. Only most inner loop (caution in case of or loop collapsing) . No functions calls, but – Instrinsic math (sin, log, …) – Inline functions – Elemental functions __attribute__((vector))

RISC Software GmbH – Johannes Kepler University Linz © 2014 16.04.2014 | 17 Inhibitors of Auto Vectorization . Non contiguous data // arrays accessed with stride 2 for (int i=0; i

// inner loop accesses a with stride SIZE for (int j=0; j

// indirect addressing of x using index array for (int i=0; i

RISC Software GmbH – Johannes Kepler University Linz © 2014 16.04.2014 | 18 Inhibitors of Auto Vectorization

. Inability to identify data with alias (or overlapping) – Runtime check possible  multi versioned (vectorized/non vectorized) void my_cp(int nx, double* a, double* b) { for (int i = 0; i < nx; i++) a[i] = b[i]; }

– Runtime check not possible  non vectorized void my_combine(int* ioff, int nx, double* a, double* b, double* c) { for (int i = 0; i < nx; i++) { a[i] = b[i] + c[i + *ioff]; } } Would vectorize with strict aliasing (-ansi-alias)

RISC Software GmbH – Johannes Kepler University Linz © 2014 16.04.2014 | 19 Inhibitors of Auto Vectorization

. Vector dependency – Read after write (RAW): non vectorizable for (i = 1; i < N; i++) a[i] = a[i – 1] + b[i];

– Write after read (WAR): vectorizable for (i = 0; i < N - 1; i++) a[i] = a[i + 1] + b[i];

– Read after read (RAR): vectorizable for (i = 0; i < N; i++) a[i] = b[i % M] + c[i];

– Write after write (WAW): non vectorizable for (i = 0; i < N; i++) a[i % M] = b[i] + c[i];

RISC Software GmbH – Johannes Kepler University Linz © 2014 16.04.2014 | 20 Efficiency aspects for Auto Vectorization Alignment . Width of vector register void fill(char* x) { for (int i=0; i<1024; i++) (SSE 16 Bytes, AVX 32 Bytes) x[i]=1; . Check at compile time }  alignment . Otherwise check at runtime Peeling  runtime peeling

. Explicit definition peel=x&0x0f; – Local or global: if (peel!=0) { __attribute__((alligned(16))) peel=16–peel; for (i=0; i

RISC Software GmbH – Johannes Kepler University Linz © 2014 16.04.2014 | 21 Efficiency aspects for Auto Vectorization Data layout . Usage of Structure of Arrays (SoA) instead of Array of Structures (AoS) struct Vector3s { //AoS x y z x y z x y z x y z double x; double y; double z; };

struct Vectors3s { //SoA x x x x y y y y z z z z double* x; double* y; double* z; };

RISC Software GmbH – Johannes Kepler University Linz © 2014 16.04.2014 | 22 Exercise 3 Non Contiguous Data 1. Look at example_2/contig.c 2. Compile contig.c with 1. icc –openmp –O2 –xAVX –vec-report=2 contig.c 2. gcc –fopenmp –O2 –ftree-vectorize –ftree-vectorizer- verbose=2 contig.c 3. What do the vectorization reports tell? 4. What happens, if we remove the update of b?

RISC Software GmbH – Johannes Kepler University Linz © 2014 16.04.2014 | 23 Exercise 4 Overlapping Data and Aliasing 1. Look at the code in example_3 2. Compile alias.c (icc –openmp –O2 –xAVX –o alias alias.c) 3. Compile alias_func.c and alias_main.c (icc –openmp –O2 –xAVX –o alias_multi alias_func.c alias_main.c) 4. What are the runtime differences between alias and alias_multi? 5. What does the vectorization report of alias_multi tell (-vec-report=2)? 6. Use Intel Guided Compilation (-guide) for further advise (Caution: -guide  no code generation) 7. Implement the hint and compile again. What are the runtime differences now?

RISC Software GmbH – Johannes Kepler University Linz © 2014 16.04.2014 | 24 Compiler Directives

. Many compilers have directives for vectorization hints . ivdep (C: „#pragma ivdep“, Fortran: „!dec$ ivdep“) No vector dependency in loop (usage in case of existing dependency leads to incorrect code) . vector always (C: „#pragma vector always“, Fortran: „!dec$ vector always“) . Elemental functions: __attribute__((vector))

RISC Software GmbH – Johannes Kepler University Linz © 2014 16.04.2014 | 25 Compiler Directives SIMD Extension . SIMD pragma . SIMD function annotation __attribute__((simd)) . Differences to traditional ivdep and vector always – Traditional pragmas are more like hints – New SIMD extension more like an assertion . Fine control over auto vectorization with additiontal clauses (vectorlength, private, linear, reduction, assert)

RISC Software GmbH – Johannes Kepler University Linz © 2014 16.04.2014 | 26 Exercise 5 Vector Dependency 1. Look at example_4/forward.c. 2. Compile forward.c and execute the binary (icc – openmp –O2 –xAVX –vec-report=2 forward.c) 3. What is the execution time? 4. What does the vectorization report tell? 5. What happens if you split the inner loop into two seperate update loops for b and a?

RISC Software GmbH – Johannes Kepler University Linz © 2014 16.04.2014 | 27 Portability

. Initial situation – Auto vectorization is default enabled (Intel) – Default instruction set SSE2 (-msse2) . Goal – Optimal performance on target machine – Usage of latest instruction set architecture (-xAVX) . Problem – Illegal instruction exception on older hardware . Solution – Create several binaries – Better: usage of CPU Dispatch

RISC Software GmbH – Johannes Kepler University Linz © 2014 16.04.2014 | 28 CPU Dispatch

. Generation of multiple code paths . Usage of compiler switch –ax . Base line path – Other switches (e.g. –O3) apply to base line path – Specified via –x oder –m (default –mSSE2) . Alternative path – Specified via –ax (e.g. –axAVX) . Path selection based on executing CPU

RISC Software GmbH – Johannes Kepler University Linz © 2014 16.04.2014 | 29 Manuel CPU Dispatch

. Usage of __attribute__((cpu_dispatch(cpuid,…))) . Example __attribute__((cpu_dispatch(generic,future_cpu_16)) void dispatch_func() {};

__attribute__((cpu_specific(generic))) void dispatch_func() { /* Code for generic */ }

__attribute__((cpu_specific(future_cpu_16))) void dispatch_func() { /* Code for future_cpu_16 */}

int main() { dispatch_func(); }

RISC Software GmbH – Johannes Kepler University Linz © 2014 16.04.2014 | 30 Vector Intrinsics Elementwise Vector Multiplication void vec_eltwise_product_avx(vec_t* a, vec_t* b, vec_t* c) { size_t i; __m256 va; __m256 vb; __m256 vc; for (i = 0; i < a->size; i += 8) { va = _mm256_loadu_ps(&a->data[i]); vb = _mm256_loadu_ps(&b->data[i]); vc = _mm256_mul_ps(va, vb); _m256_storeu_ps(&c->data[i], vc); } }

RISC Software GmbH – Johannes Kepler University Linz © 2014 16.04.2014 | 31 Conclusion

. Vectorization can improve the runtime performance of floating point intensive loops – Our examples all fit into the cache . Many factors inhibit auto vectorization – Some do not apply to certain processors (scatter/gather) . Vectorization report helps to identify (non) vectorized code . If one compiler cannot vectorize, another one could be successful . The Intel compiler can provide hints for code modifications, which lead to a successful auto vectorization

RISC Software GmbH – Johannes Kepler University Linz © 2014 16.04.2014 | 32 Castor, 4228m Pollux, 4092m Thank You! between Monte-Rosa-Massiv and Matterhorn Wallis, Schweiz

www.risc-software.at

RISC Software GmbH – Johannes Kepler University Linz © 2014 16.04.2014 | 33