Introduction to Vectorization

Introduction to Vectorization Alexander Leutgeb, RISC Software GmbH RISC Software GmbH – Johannes Kepler University Linz © 2014 16.04.2014 | 1 Motivation . Increasement in number of cores – Threading techniques to improve performance . But flops per cycle of vector units increased as much as number of cores . No use of vector units wasting flops/watt . For best performance – Use all cores – Efficient use of vector units – Ignoring potential of vector units is so inefficient as using only one core RISC Software GmbH – Johannes Kepler University Linz © 2014 16.04.2014 | 2 Vector Unit . Single Instruction Multiple Data (SIMD) units . Mostly for floating point operations . Data parallelization with one instruction – 64-Bit unit 1 DP flop, 2 SP flop – 128-Bit unit 2 DP flop, 4 SP flop – … . Multiple data elements are loaded into vector registers and used by vector units . Some architectures have more than one instruction per cylce (e.g. Sandy Bridge) RISC Software GmbH – Johannes Kepler University Linz © 2014 16.04.2014 | 3 Parallel Execution Scalar version works on Vector version carries out the same instructions one element at a time on many elements at a time a[i] = b[i] + c[i] x d[i]; a[i:8] = b[i:8] + c[i:8] * d[i:8]; a[i] a[i] a[i+1] a[i+2] a[i+3] a[i+4] a[i+5] a[i+6] a[i+7] = = = = = = = = = b[i] b[i] b[i+1] b[i+2] b[i+3] b[i+4] b[i+5] b[i+6] b[i+7] + + + + + + + + + c[i] c[i] c[i+1] c[i+2] c[i+3] c[i+4] c[i+5] c[i+6] c[i+7] x x x x x x x x x d[i] d[i] d[i+1] d[i+2] d[i+3] d[i+4] d[i+5] d[i+6] d[i+7] RISC Software GmbH – Johannes Kepler University Linz © 2014 16.04.2014 | 4 Vector Registers RISC Software GmbH – Johannes Kepler University Linz © 2014 16.04.2014 | 5 Usage from a Programmers Point of View Use vectorized libraries Ease of use (e.g. Intel MKL) Fully automatic vectorization Auto vectorization hints (#pragma ivdep) SIMD feature (#pragma simd and simd function annotation) Vector intrinsics (e.g. mm_add_ps()) ASM code Programmer control (e.g. addps) RISC Software GmbH – Johannes Kepler University Linz © 2014 16.04.2014 | 6 Auto Vectorization . Modern compilers analyse loops in serial code identification for vectorization – Perform loop transformations for identification . Usage of instruction set of target architecture RISC Software GmbH – Johannes Kepler University Linz © 2014 16.04.2014 | 7 Common Compiler Switches Functionality Linux Disable optimization -O0 Optimize for speed (no code size increase) -O1 Optimize for speed (default) -O2 High-level optimizer (e.g. loop unroll) -O3 Aggressive optimizations (e.g. –ipo, -O3, …) -fast Create symbols for debugging -g Generate assembly files -S Optimization report generation -opt-report OpenMP support -openmp RISC Software GmbH – Johannes Kepler University Linz © 2014 16.04.2014 | 8 Architecture Specific Compiler Switches Functionality Linux * Optimize for current machine -xHOST Generate SSE v1 code -xSSE1 Generate SSE v2 code (default, may also emit SSE v1 code) -xSSE2 Generate SSE v3 code (may also emit SSE v1 and v2 code) -xSSE3 Generate SSE v3 code for Atom-based processors -xSSE_ATOM Generrate SSSE v3 code (may also emit SSE v1, v2, and v3 code) -xSSSE3 Generate SSE4.1 code (may also emit (S)SSEE v1, v2, and v3 code) -xSSE4.1 Generate SSE4.2 code (may also emit (S)SSE v1, v2, v3, and v4 code) -xSSE4.2 Generate AVX code -xAVX * For Intel processors use –x, for non-Intel processors use -m RISC Software GmbH – Johannes Kepler University Linz © 2014 16.04.2014 | 9 Exercise 1: Simple Vector Addition 1. Look at example_1/simple_dp.c 2. Compile the code in the following ways 1. icc –openmp –O2 –no-vec (gcc –fopenmp –O2) 2. icc –openmp –O2 (gcc –fopenmp –O2 –ftree-vectorize) 3. What is the difference considering the execution times? 4. Repeat the same procedure for simple_sp.c. 5. Look again at the time differences. 6. Compare the sums between vectorized and non vectorized versions. 7. Can the execution times be further improved? RISC Software GmbH – Johannes Kepler University Linz © 2014 16.04.2014 | 10 SIMD Vectorization Basics . Vectorization offers good performance improvement on floating point intensive code . Vectorized code could compute slightly different results than non vectorized (non vectorized x87 FPU – 80 Bit, vectorized 64 Bit) . Vectorization is only one aspect to improve performance – Efficent use of the cache is necessary RISC Software GmbH – Johannes Kepler University Linz © 2014 16.04.2014 | 11 Parallelization at No Cost . We tell the compiler to vectorize – That‘s not the whole story . There are cases, where a compiler cannot vectorize the code . How can we analyse such situations vectorization report . Generation of vectorization report via the compiler – More interesting what was not done and why – Focus on code paths, which were not vectorized RISC Software GmbH – Johannes Kepler University Linz © 2014 16.04.2014 | 12 Exercise 2: Vectorization Report 1. Compile again example_1/simple_dp.c but this time with generation of vectorization report icc –openmp –O2 –xAVX –vec-report=2 simple_dp.c (gcc –fopenmp –O2 –ftree-vectorize –ftree- vectorizer-verbose=2) 2. Which positive/negative information does the vectorization report tell? 3. Insert the following code after the main loop and repeat from step 1. Code for (i = 1; i < VECSIZE; i++) { a[i] = a[i] + a[i – 1]; } RISC Software GmbH – Johannes Kepler University Linz © 2014 16.04.2014 | 13 Vectorization Report . Information – Which code was vectorized? – Which code was not vectorized? . Compiler switch –vec-report<n> – n=0: no diagnostic information – n=1: (default) vectorized loops – n=2: vectorized/non vectorized loops (and why) – n=3: additional dependency information – n=4: only non vectorized loops – n=5: only non vectorized loops and dependency information – n=6: vectorized/non vectorized loops with details RISC Software GmbH – Johannes Kepler University Linz © 2014 16.04.2014 | 14 Loop Unrolling for (i = 0; i < N; i++) Unrolling allows compiler { to reconstruct loop for a[i] = b[i] * c[i]; vector operations } for (i = 0; i < N; i += 4) { Load b(i, …, i + 3) a[i ] = b[i ] * c[i ]; Load c(i, …, i + 3) a[i + 1] = b[i + 1] * c[i + 1]; Operate b * c -> a a[i + 2] = b[i + 2] * c[i + 2]; Store a(i, …, i + 3) a[i + 3] = b[i + 3] * c[i + 3]; } RISC Software GmbH – Johannes Kepler University Linz © 2014 16.04.2014 | 15 Requirements for Auto Vectorization . Countable . Single entry and exit while (i < 100) { a[i] = b[i] * c[i]; if (a[i] < 0.0) // data-dependent exit condition: break; ++i; } // loop not vectorized . Straight-line code (no switch, if with masking) for (int i = 0; i < length; i++) { float s = b[i] * b[i] – 4 * a[i] * c[i]; if (s >= 0) x[i] = sqrt(s); else x[i] = 0.; } // loop vectorized (because of masking) RISC Software GmbH – Johannes Kepler University Linz © 2014 16.04.2014 | 16 Requirements for Auto Vectorization . Only most inner loop (caution in case of loop interchange or loop collapsing) . No functions calls, but – Instrinsic math (sin, log, …) – Inline functions – Elemental functions __attribute__((vector)) RISC Software GmbH – Johannes Kepler University Linz © 2014 16.04.2014 | 17 Inhibitors of Auto Vectorization . Non contiguous data // arrays accessed with stride 2 for (int i=0; i<SIZE; i+=2) b[i]+=a[i]+x[i]; // inner loop accesses a with stride SIZE for (int j=0; j<SIZE; j++) for (int i=0; i<SIZE; i++) b[i]+=a[i][j]*x[j]; // indirect addressing of x using index array for (int i=0; i<SIZE; i+=2) b[i]+=a[i]*x[index[i]]; stride 1 is best. Caution in case of more dimensional arrays do j=1,n F90 C for (j=0; j<n; j++) do i=1,n for (i=0;i<n;i++) a(i,j)=b(i,j)*s a[j][i]=b[j][i]*s; enddo enddo RISC Software GmbH – Johannes Kepler University Linz © 2014 16.04.2014 | 18 Inhibitors of Auto Vectorization . Inability to identify data with alias (or overlapping) – Runtime check possible multi versioned (vectorized/non vectorized) void my_cp(int nx, double* a, double* b) { for (int i = 0; i < nx; i++) a[i] = b[i]; } – Runtime check not possible non vectorized void my_combine(int* ioff, int nx, double* a, double* b, double* c) { for (int i = 0; i < nx; i++) { a[i] = b[i] + c[i + *ioff]; } } Would vectorize with strict aliasing (-ansi-alias) RISC Software GmbH – Johannes Kepler University Linz © 2014 16.04.2014 | 19 Inhibitors of Auto Vectorization . Vector dependency – Read after write (RAW): non vectorizable for (i = 1; i < N; i++) a[i] = a[i – 1] + b[i]; – Write after read (WAR): vectorizable for (i = 0; i < N - 1; i++) a[i] = a[i + 1] + b[i]; – Read after read (RAR): vectorizable for (i = 0; i < N; i++) a[i] = b[i % M] + c[i]; – Write after write (WAW): non vectorizable for (i = 0; i < N; i++) a[i % M] = b[i] + c[i]; RISC Software GmbH – Johannes Kepler University Linz © 2014 16.04.2014 | 20 Efficiency aspects for Auto Vectorization Alignment . Width of vector register void fill(char* x) { for (int i=0; i<1024; i++) (SSE 16 Bytes, AVX 32 Bytes) x[i]=1; . Check at compile time } alignment . Otherwise check at runtime Peeling runtime peeling . Explicit definition peel=x&0x0f; – Local or global: if (peel!=0) { peel=16–peel; __attribute__((alligned(16))) for (i=0; i<peel; i++) – Heap: _mm_alloc, _mm_free x[i]=1; } .

Load more