Advanced Parallel Programming II

Advanced Parallel Programming II Alexander Leutgeb, RISC Software GmbH RISC Software GmbH – Johannes Kepler University Linz © 2016 22.09.2016 | 1 Introduction to Vectorization RISC Software GmbH – Johannes Kepler University Linz © 2016 22.09.2016 | 2 Motivation . Increasement in number of cores – Threading techniques to improve performance . But flops per cycle of vector units increased as much as number of cores . No use of vector units wasting flops/watt . For best performance – Use all cores – Efficient use of vector units – Ignoring potential of vector units is so inefficient as using only one core RISC Software GmbH – Johannes Kepler University Linz © 2016 22.09.2016 | 3 Vector Unit . Single Instruction Multiple Data (SIMD) units . Mostly for floating point operations . Data parallelization with one instruction – 64-Bit unit 1 DP flop, 2 SP flop – 128-Bit unit 2 DP flop, 4 SP flop – … . Multiple data elements are loaded into vector registers and used by vector units . Some architectures have more than one instruction per cycle (e.g. Sandy Bridge) RISC Software GmbH – Johannes Kepler University Linz © 2016 22.09.2016 | 4 Parallel Execution Scalar version works on Vector version carries out the same instructions one element at a time on many elements at a time a[i] = b[i] + c[i] x d[i]; a[i:8] = b[i:8] + c[i:8] * d[i:8]; a[i] a[i] a[i+1] a[i+2] a[i+3] a[i+4] a[i+5] a[i+6] a[i+7] = = = = = = = = = b[i] b[i] b[i+1] b[i+2] b[i+3] b[i+4] b[i+5] b[i+6] b[i+7] + + + + + + + + + c[i] c[i] c[i+1] c[i+2] c[i+3] c[i+4] c[i+5] c[i+6] c[i+7] x x x x x x x x x d[i] d[i] d[i+1] d[i+2] d[i+3] d[i+4] d[i+5] d[i+6] d[i+7] RISC Software GmbH – Johannes Kepler University Linz © 2016 22.09.2016 | 5 Vector Registers RISC Software GmbH – Johannes Kepler University Linz © 2016 22.09.2016 | 6 Vector Unit Usage (Programmers View) Use vectorized libraries Ease of use (e.g. Intel MKL) Fully automatic vectorization Auto vectorization hints (#pragma ivdep) SIMD feature (#pragma simd and simd function annotation) Vector intrinsics (e.g. mm_add_ps()) ASM code Programmer control (e.g. addps) RISC Software GmbH – Johannes Kepler University Linz © 2016 22.09.2016 | 7 Auto Vectorization RISC Software GmbH – Johannes Kepler University Linz © 2016 22.09.2016 | 8 Auto Vectorization . Modern compilers analyse loops in serial code identification for vectorization – Perform loop transformations for identification . Usage of instruction set of target architecture RISC Software GmbH – Johannes Kepler University Linz © 2016 22.09.2016 | 9 Common Compiler Switches GCC and ICC Functionality Switch Disable optimization -O0 Optimize for speed (no code size increase) -O1 Optimize for speed (default) -O2 High-level optimizer (e.g. loop unroll) -O3 Aggressive optimizations (e.g. –ipo, -O3, …) -fast Create symbols for debugging -g Generate assembly files -S OpenMP support -openmp RISC Software GmbH – Johannes Kepler University Linz © 2016 22.09.2016 | 10 Architecture Specific Compiler Switches GCC Functionality Switch Optimize for current machine -march=native Generate SSE v1 code -msse Generate SSE v2 code (default, may also emit SSE v1 code) -msse2 Generate SSE v3 code (may also emit SSE v1 and v2 code) -msse3 Generrate SSSE v3 code (may also emit SSE v1, v2, and v3 code) -mssse3 Generate SSE4.1 code (may also emit (S)SSEE v1, v2, and v3 code) -msse4.1 Generate SSE4.2 code (may also emit (S)SSE v1, v2, v3, and v4 code) -msse4.2 Generate AVX code -mavx Generate AVX v2 code -mavx2 RISC Software GmbH – Johannes Kepler University Linz © 2016 22.09.2016 | 11 Architecture Specific Compiler Switches ICC Functionality Switch * Optimize for current machine -xHOST Generate SSE v1 code -xSSE1 Generate SSE v2 code (default, may also emit SSE v1 code) -xSSE2 Generate SSE v3 code (may also emit SSE v1 and v2 code) -xSSE3 Generate SSE v3 code for Atom-based processors -xSSE_ATOM Generrate SSSE v3 code (may also emit SSE v1, v2, and v3 code) -xSSSE3 Generate SSE4.1 code (may also emit (S)SSEE v1, v2, and v3 code) -xSSE4.1 Generate SSE4.2 code (may also emit (S)SSE v1, v2, v3, and v4 code) -xSSE4.2 Generate AVX code -xAVX * For Intel processors use –x, for non-Intel processors use -m RISC Software GmbH – Johannes Kepler University Linz © 2016 22.09.2016 | 12 Example 4 Simple Vector Addition 1. Go to the directory example_4. 2. Compile simple.cpp (no vectorization) and execute the binary icc -std=c++11 -O2 -no-vec simple.cpp (g++ -std=c++11 -O2 simple.cpp) 3. Compile simple.cpp (vectorization) and execute the binary icc -std=c++11 -O2 simple.cpp (g++ -std=c++11 -O2 -ftree-vectorize) 4. What is the difference considering the execution times? 5. Can the execution times be further improved? RISC Software GmbH – Johannes Kepler University Linz © 2016 22.09.2016 | 13 SIMD Vectorization Basics . Vectorization offers good performance improvement on floating point intensive code . Vectorized code could compute slightly different results than non vectorized (x87 FPU – 80 Bit, SIMD – 64 Bit) . Even for scalar operations vector unit is used . Vectorization is only one aspect to improve performance – Efficent use of the cache is necessary RISC Software GmbH – Johannes Kepler University Linz © 2016 22.09.2016 | 14 Parallelization at No Cost . We tell the compiler to vectorize – That‘s not the whole story . There are cases, where a compiler cannot vectorize the code . How can we analyse such situations vectorization report . Generation of vectorization report via the compiler – More interesting what was not done and why – Focus on code paths, which were not vectorized RISC Software GmbH – Johannes Kepler University Linz © 2016 22.09.2016 | 15 Example 5 Vectorization Report 1. Go to the directory example_5. 2. Compile simple.cpp with enabled vector report generation: icc -std=c++11 -O2 -vec-report=2 simple.cpp (g++ -std=c++11 -O2 -ftree-vectorize –fopt-info-vec-missed simple.cpp) 3. Which positive/negative information does the vectorization report tell? 4. Insert the following code before std::cout in calcSP() and compile again. Have a look at the vectorization report. for (i = 1; i < VECSIZE; i++) { a[i] = a[i] + a[i – 1]; } RISC Software GmbH – Johannes Kepler University Linz © 2016 22.09.2016 | 16 Vectorization Report ICC . Information – Which code was vectorized? – Which code was not vectorized? . Compiler switch –vec-report<n> – n=0: no diagnostic information – n=1: (default) vectorized loops – n=2: vectorized/non vectorized loops (and why) – n=3: additional dependency information – n=4: only non vectorized loops – n=5: only non vectorized loops and dependency information – n=6: vectorized/non vectorized loops with details RISC Software GmbH – Johannes Kepler University Linz © 2016 22.09.2016 | 17 Loop Unrolling for (i = 0; i < N; i++) Unrolling allows compiler { to reconstruct loop for a[i] = b[i] * c[i]; vector operations } for (i = 0; i < N; i += 4) { Load b(i, …, i + 3) a[i ] = b[i ] * c[i ]; Load c(i, …, i + 3) a[i + 1] = b[i + 1] * c[i + 1]; Operate b * c -> a a[i + 2] = b[i + 2] * c[i + 2]; Store a(i, …, i + 3) a[i + 3] = b[i + 3] * c[i + 3]; } RISC Software GmbH – Johannes Kepler University Linz © 2016 22.09.2016 | 18 Requirements for Auto Vectorization . Countable . Single entry and exit while (i < 100) { a[i] = b[i] * c[i]; if (a[i] < 0.0) // data-dependent exit condition: break; ++i; } // loop not vectorized . Straight-line code (no switch, if with masking) for (int i = 0; i < length; i++) { float s = b[i] * b[i] – 4 * a[i] * c[i]; if (s >= 0) x[i] = sqrt(s); else x[i] = 0.; } // loop vectorized (because of masking) RISC Software GmbH – Johannes Kepler University Linz © 2016 22.09.2016 | 19 Requirements for Auto Vectorization . Only most inner loop (caution in case of loop interchange or loop collapsing) . No functions calls, but – Instrinsic math (sin, log, …) – Inline functions – Elemental functions __attribute__((vector)) RISC Software GmbH – Johannes Kepler University Linz © 2016 22.09.2016 | 20 Inhibitors of Auto Vectorization . Non contiguous data // arrays accessed with stride 2 for (int i=0; i<SIZE; i+=2) b[i]+=a[i]+x[i]; // inner loop accesses a with stride SIZE for (int j=0; j<SIZE; j++) for (int i=0; i<SIZE; i++) b[i]+=a[i][j]*x[j]; // indirect addressing of x using index array for (int i=0; i<SIZE; i+=2) b[i]+=a[i]*x[index[i]]; stride 1 is best. Caution in case of more dimensional arrays: do j=1,n F90 C for (j=0; j<n; j++) do i=1,n for (i=0;i<n;i++) a(i,j)=b(i,j)*s a[j][i]=b[j][i]*s; enddo enddo RISC Software GmbH – Johannes Kepler University Linz © 2016 22.09.2016 | 21 Inhibitors of Auto Vectorization . Inability to identify data with alias (or overlapping) – Runtime check possible multi versioned (vectorized/non vectorized) void my_cp(int nx, double* a, double* b) { for (int i = 0; i < nx; i++) a[i] = b[i]; } – Runtime check not possible non vectorized void my_combine(int* ioff, int nx, double* a, double* b, double* c) { for (int i = 0; i < nx; i++) { a[i] = b[i] + c[i + *ioff]; } } Would vectorize with strict aliasing (-ansi-alias) RISC Software GmbH – Johannes Kepler University Linz © 2016 22.09.2016 | 22 Inhibitors of Auto Vectorization .

Advanced Parallel Programming II

Vectorization Optimization

Vegen: a Vectorizer Generator for SIMD and Beyond

Exploiting Automatic Vectorization to Employ SPMD on SIMD Registers

Using Arm Scalable Vector Extension to Optimize OPEN MPI

Introduction on Vectorization

MMX and SSE MMX Data Types

Compiler Auto-Vectorization with Imitation Learning

Vector Parallelism on Multi-Core Processors

VECTORIZATION-Slides

Impact of Vectorization and Multithreading on Performance and Energy Consumption on Jetson Boards

A Using Machine Learning to Improve Automatic Vectorization

Selective Vectorization for Short-Vector Instructions Samuel Larsen, Rodric Rabbah, and Saman Amarasinghe