Advanced Parallel Programming II
Total Page:16
File Type:pdf, Size:1020Kb
Advanced Parallel Programming II Alexander Leutgeb, RISC Software GmbH RISC Software GmbH – Johannes Kepler University Linz © 2016 22.09.2016 | 1 Introduction to Vectorization RISC Software GmbH – Johannes Kepler University Linz © 2016 22.09.2016 | 2 Motivation . Increasement in number of cores – Threading techniques to improve performance . But flops per cycle of vector units increased as much as number of cores . No use of vector units wasting flops/watt . For best performance – Use all cores – Efficient use of vector units – Ignoring potential of vector units is so inefficient as using only one core RISC Software GmbH – Johannes Kepler University Linz © 2016 22.09.2016 | 3 Vector Unit . Single Instruction Multiple Data (SIMD) units . Mostly for floating point operations . Data parallelization with one instruction – 64-Bit unit 1 DP flop, 2 SP flop – 128-Bit unit 2 DP flop, 4 SP flop – … . Multiple data elements are loaded into vector registers and used by vector units . Some architectures have more than one instruction per cycle (e.g. Sandy Bridge) RISC Software GmbH – Johannes Kepler University Linz © 2016 22.09.2016 | 4 Parallel Execution Scalar version works on Vector version carries out the same instructions one element at a time on many elements at a time a[i] = b[i] + c[i] x d[i]; a[i:8] = b[i:8] + c[i:8] * d[i:8]; a[i] a[i] a[i+1] a[i+2] a[i+3] a[i+4] a[i+5] a[i+6] a[i+7] = = = = = = = = = b[i] b[i] b[i+1] b[i+2] b[i+3] b[i+4] b[i+5] b[i+6] b[i+7] + + + + + + + + + c[i] c[i] c[i+1] c[i+2] c[i+3] c[i+4] c[i+5] c[i+6] c[i+7] x x x x x x x x x d[i] d[i] d[i+1] d[i+2] d[i+3] d[i+4] d[i+5] d[i+6] d[i+7] RISC Software GmbH – Johannes Kepler University Linz © 2016 22.09.2016 | 5 Vector Registers RISC Software GmbH – Johannes Kepler University Linz © 2016 22.09.2016 | 6 Vector Unit Usage (Programmers View) Use vectorized libraries Ease of use (e.g. Intel MKL) Fully automatic vectorization Auto vectorization hints (#pragma ivdep) SIMD feature (#pragma simd and simd function annotation) Vector intrinsics (e.g. mm_add_ps()) ASM code Programmer control (e.g. addps) RISC Software GmbH – Johannes Kepler University Linz © 2016 22.09.2016 | 7 Auto Vectorization RISC Software GmbH – Johannes Kepler University Linz © 2016 22.09.2016 | 8 Auto Vectorization . Modern compilers analyse loops in serial code identification for vectorization – Perform loop transformations for identification . Usage of instruction set of target architecture RISC Software GmbH – Johannes Kepler University Linz © 2016 22.09.2016 | 9 Common Compiler Switches GCC and ICC Functionality Switch Disable optimization -O0 Optimize for speed (no code size increase) -O1 Optimize for speed (default) -O2 High-level optimizer (e.g. loop unroll) -O3 Aggressive optimizations (e.g. –ipo, -O3, …) -fast Create symbols for debugging -g Generate assembly files -S OpenMP support -openmp RISC Software GmbH – Johannes Kepler University Linz © 2016 22.09.2016 | 10 Architecture Specific Compiler Switches GCC Functionality Switch Optimize for current machine -march=native Generate SSE v1 code -msse Generate SSE v2 code (default, may also emit SSE v1 code) -msse2 Generate SSE v3 code (may also emit SSE v1 and v2 code) -msse3 Generrate SSSE v3 code (may also emit SSE v1, v2, and v3 code) -mssse3 Generate SSE4.1 code (may also emit (S)SSEE v1, v2, and v3 code) -msse4.1 Generate SSE4.2 code (may also emit (S)SSE v1, v2, v3, and v4 code) -msse4.2 Generate AVX code -mavx Generate AVX v2 code -mavx2 RISC Software GmbH – Johannes Kepler University Linz © 2016 22.09.2016 | 11 Architecture Specific Compiler Switches ICC Functionality Switch * Optimize for current machine -xHOST Generate SSE v1 code -xSSE1 Generate SSE v2 code (default, may also emit SSE v1 code) -xSSE2 Generate SSE v3 code (may also emit SSE v1 and v2 code) -xSSE3 Generate SSE v3 code for Atom-based processors -xSSE_ATOM Generrate SSSE v3 code (may also emit SSE v1, v2, and v3 code) -xSSSE3 Generate SSE4.1 code (may also emit (S)SSEE v1, v2, and v3 code) -xSSE4.1 Generate SSE4.2 code (may also emit (S)SSE v1, v2, v3, and v4 code) -xSSE4.2 Generate AVX code -xAVX * For Intel processors use –x, for non-Intel processors use -m RISC Software GmbH – Johannes Kepler University Linz © 2016 22.09.2016 | 12 Example 4 Simple Vector Addition 1. Go to the directory example_4. 2. Compile simple.cpp (no vectorization) and execute the binary icc -std=c++11 -O2 -no-vec simple.cpp (g++ -std=c++11 -O2 simple.cpp) 3. Compile simple.cpp (vectorization) and execute the binary icc -std=c++11 -O2 simple.cpp (g++ -std=c++11 -O2 -ftree-vectorize) 4. What is the difference considering the execution times? 5. Can the execution times be further improved? RISC Software GmbH – Johannes Kepler University Linz © 2016 22.09.2016 | 13 SIMD Vectorization Basics . Vectorization offers good performance improvement on floating point intensive code . Vectorized code could compute slightly different results than non vectorized (x87 FPU – 80 Bit, SIMD – 64 Bit) . Even for scalar operations vector unit is used . Vectorization is only one aspect to improve performance – Efficent use of the cache is necessary RISC Software GmbH – Johannes Kepler University Linz © 2016 22.09.2016 | 14 Parallelization at No Cost . We tell the compiler to vectorize – That‘s not the whole story . There are cases, where a compiler cannot vectorize the code . How can we analyse such situations vectorization report . Generation of vectorization report via the compiler – More interesting what was not done and why – Focus on code paths, which were not vectorized RISC Software GmbH – Johannes Kepler University Linz © 2016 22.09.2016 | 15 Example 5 Vectorization Report 1. Go to the directory example_5. 2. Compile simple.cpp with enabled vector report generation: icc -std=c++11 -O2 -vec-report=2 simple.cpp (g++ -std=c++11 -O2 -ftree-vectorize –fopt-info-vec-missed simple.cpp) 3. Which positive/negative information does the vectorization report tell? 4. Insert the following code before std::cout in calcSP() and compile again. Have a look at the vectorization report. for (i = 1; i < VECSIZE; i++) { a[i] = a[i] + a[i – 1]; } RISC Software GmbH – Johannes Kepler University Linz © 2016 22.09.2016 | 16 Vectorization Report ICC . Information – Which code was vectorized? – Which code was not vectorized? . Compiler switch –vec-report<n> – n=0: no diagnostic information – n=1: (default) vectorized loops – n=2: vectorized/non vectorized loops (and why) – n=3: additional dependency information – n=4: only non vectorized loops – n=5: only non vectorized loops and dependency information – n=6: vectorized/non vectorized loops with details RISC Software GmbH – Johannes Kepler University Linz © 2016 22.09.2016 | 17 Loop Unrolling for (i = 0; i < N; i++) Unrolling allows compiler { to reconstruct loop for a[i] = b[i] * c[i]; vector operations } for (i = 0; i < N; i += 4) { Load b(i, …, i + 3) a[i ] = b[i ] * c[i ]; Load c(i, …, i + 3) a[i + 1] = b[i + 1] * c[i + 1]; Operate b * c -> a a[i + 2] = b[i + 2] * c[i + 2]; Store a(i, …, i + 3) a[i + 3] = b[i + 3] * c[i + 3]; } RISC Software GmbH – Johannes Kepler University Linz © 2016 22.09.2016 | 18 Requirements for Auto Vectorization . Countable . Single entry and exit while (i < 100) { a[i] = b[i] * c[i]; if (a[i] < 0.0) // data-dependent exit condition: break; ++i; } // loop not vectorized . Straight-line code (no switch, if with masking) for (int i = 0; i < length; i++) { float s = b[i] * b[i] – 4 * a[i] * c[i]; if (s >= 0) x[i] = sqrt(s); else x[i] = 0.; } // loop vectorized (because of masking) RISC Software GmbH – Johannes Kepler University Linz © 2016 22.09.2016 | 19 Requirements for Auto Vectorization . Only most inner loop (caution in case of loop interchange or loop collapsing) . No functions calls, but – Instrinsic math (sin, log, …) – Inline functions – Elemental functions __attribute__((vector)) RISC Software GmbH – Johannes Kepler University Linz © 2016 22.09.2016 | 20 Inhibitors of Auto Vectorization . Non contiguous data // arrays accessed with stride 2 for (int i=0; i<SIZE; i+=2) b[i]+=a[i]+x[i]; // inner loop accesses a with stride SIZE for (int j=0; j<SIZE; j++) for (int i=0; i<SIZE; i++) b[i]+=a[i][j]*x[j]; // indirect addressing of x using index array for (int i=0; i<SIZE; i+=2) b[i]+=a[i]*x[index[i]]; stride 1 is best. Caution in case of more dimensional arrays: do j=1,n F90 C for (j=0; j<n; j++) do i=1,n for (i=0;i<n;i++) a(i,j)=b(i,j)*s a[j][i]=b[j][i]*s; enddo enddo RISC Software GmbH – Johannes Kepler University Linz © 2016 22.09.2016 | 21 Inhibitors of Auto Vectorization . Inability to identify data with alias (or overlapping) – Runtime check possible multi versioned (vectorized/non vectorized) void my_cp(int nx, double* a, double* b) { for (int i = 0; i < nx; i++) a[i] = b[i]; } – Runtime check not possible non vectorized void my_combine(int* ioff, int nx, double* a, double* b, double* c) { for (int i = 0; i < nx; i++) { a[i] = b[i] + c[i + *ioff]; } } Would vectorize with strict aliasing (-ansi-alias) RISC Software GmbH – Johannes Kepler University Linz © 2016 22.09.2016 | 22 Inhibitors of Auto Vectorization .