Introduction to Vectorization

Introduction to Vectorization Alexander Leutgeb, RISC Software GmbH RISC Software GmbH – Johannes Kepler University Linz © 2014 16.04.2014 | 1 Motivation . Increasement in number of cores – Threading techniques to improve performance . But flops per cycle of vector units increased as much as number of cores . No use of vector units wasting flops/watt . For best performance – Use all cores – Efficient use of vector units – Ignoring potential of vector units is so inefficient as using only one core RISC Software GmbH – Johannes Kepler University Linz © 2014 16.04.2014 | 2 Vector Unit . Single Instruction Multiple Data (SIMD) units . Mostly for floating point operations . Data parallelization with one instruction – 64-Bit unit 1 DP flop, 2 SP flop – 128-Bit unit 2 DP flop, 4 SP flop – … . Multiple data elements are loaded into vector registers and used by vector units . Some architectures have more than one instruction per cylce (e.g. Sandy Bridge) RISC Software GmbH – Johannes Kepler University Linz © 2014 16.04.2014 | 3 Parallel Execution Scalar version works on Vector version carries out the same instructions one element at a time on many elements at a time a[i] = b[i] + c[i] x d[i]; a[i:8] = b[i:8] + c[i:8] * d[i:8]; a[i] a[i] a[i+1] a[i+2] a[i+3] a[i+4] a[i+5] a[i+6] a[i+7] = = = = = = = = = b[i] b[i] b[i+1] b[i+2] b[i+3] b[i+4] b[i+5] b[i+6] b[i+7] + + + + + + + + + c[i] c[i] c[i+1] c[i+2] c[i+3] c[i+4] c[i+5] c[i+6] c[i+7] x x x x x x x x x d[i] d[i] d[i+1] d[i+2] d[i+3] d[i+4] d[i+5] d[i+6] d[i+7] RISC Software GmbH – Johannes Kepler University Linz © 2014 16.04.2014 | 4 Vector Registers RISC Software GmbH – Johannes Kepler University Linz © 2014 16.04.2014 | 5 Usage from a Programmers Point of View Use vectorized libraries Ease of use (e.g. Intel MKL) Fully automatic vectorization Auto vectorization hints (#pragma ivdep) SIMD feature (#pragma simd and simd function annotation) Vector intrinsics (e.g. mm_add_ps()) ASM code Programmer control (e.g. addps) RISC Software GmbH – Johannes Kepler University Linz © 2014 16.04.2014 | 6 Auto Vectorization . Modern compilers analyse loops in serial code identification for vectorization – Perform loop transformations for identification . Usage of instruction set of target architecture RISC Software GmbH – Johannes Kepler University Linz © 2014 16.04.2014 | 7 Common Compiler Switches Functionality Linux Disable optimization -O0 Optimize for speed (no code size increase) -O1 Optimize for speed (default) -O2 High-level optimizer (e.g. loop unroll) -O3 Aggressive optimizations (e.g. –ipo, -O3, …) -fast Create symbols for debugging -g Generate assembly files -S Optimization report generation -opt-report OpenMP support -openmp RISC Software GmbH – Johannes Kepler University Linz © 2014 16.04.2014 | 8 Architecture Specific Compiler Switches Functionality Linux * Optimize for current machine -xHOST Generate SSE v1 code -xSSE1 Generate SSE v2 code (default, may also emit SSE v1 code) -xSSE2 Generate SSE v3 code (may also emit SSE v1 and v2 code) -xSSE3 Generate SSE v3 code for Atom-based processors -xSSE_ATOM Generrate SSSE v3 code (may also emit SSE v1, v2, and v3 code) -xSSSE3 Generate SSE4.1 code (may also emit (S)SSEE v1, v2, and v3 code) -xSSE4.1 Generate SSE4.2 code (may also emit (S)SSE v1, v2, v3, and v4 code) -xSSE4.2 Generate AVX code -xAVX * For Intel processors use –x, for non-Intel processors use -m RISC Software GmbH – Johannes Kepler University Linz © 2014 16.04.2014 | 9 Exercise 1: Simple Vector Addition 1. Look at example_1/simple_dp.c 2. Compile the code in the following ways 1. icc –openmp –O2 –no-vec (gcc –fopenmp –O2) 2. icc –openmp –O2 (gcc –fopenmp –O2 –ftree-vectorize) 3. What is the difference considering the execution times? 4. Repeat the same procedure for simple_sp.c. 5. Look again at the time differences. 6. Compare the sums between vectorized and non vectorized versions. 7. Can the execution times be further improved? RISC Software GmbH – Johannes Kepler University Linz © 2014 16.04.2014 | 10 SIMD Vectorization Basics . Vectorization offers good performance improvement on floating point intensive code . Vectorized code could compute slightly different results than non vectorized (non vectorized x87 FPU – 80 Bit, vectorized 64 Bit) . Vectorization is only one aspect to improve performance – Efficent use of the cache is necessary RISC Software GmbH – Johannes Kepler University Linz © 2014 16.04.2014 | 11 Parallelization at No Cost . We tell the compiler to vectorize – That‘s not the whole story . There are cases, where a compiler cannot vectorize the code . How can we analyse such situations vectorization report . Generation of vectorization report via the compiler – More interesting what was not done and why – Focus on code paths, which were not vectorized RISC Software GmbH – Johannes Kepler University Linz © 2014 16.04.2014 | 12 Exercise 2: Vectorization Report 1. Compile again example_1/simple_dp.c but this time with generation of vectorization report icc –openmp –O2 –xAVX –vec-report=2 simple_dp.c (gcc –fopenmp –O2 –ftree-vectorize –ftree- vectorizer-verbose=2) 2. Which positive/negative information does the vectorization report tell? 3. Insert the following code after the main loop and repeat from step 1. Code for (i = 1; i < VECSIZE; i++) { a[i] = a[i] + a[i – 1]; } RISC Software GmbH – Johannes Kepler University Linz © 2014 16.04.2014 | 13 Vectorization Report . Information – Which code was vectorized? – Which code was not vectorized? . Compiler switch –vec-report<n> – n=0: no diagnostic information – n=1: (default) vectorized loops – n=2: vectorized/non vectorized loops (and why) – n=3: additional dependency information – n=4: only non vectorized loops – n=5: only non vectorized loops and dependency information – n=6: vectorized/non vectorized loops with details RISC Software GmbH – Johannes Kepler University Linz © 2014 16.04.2014 | 14 Loop Unrolling for (i = 0; i < N; i++) Unrolling allows compiler { to reconstruct loop for a[i] = b[i] * c[i]; vector operations } for (i = 0; i < N; i += 4) { Load b(i, …, i + 3) a[i ] = b[i ] * c[i ]; Load c(i, …, i + 3) a[i + 1] = b[i + 1] * c[i + 1]; Operate b * c -> a a[i + 2] = b[i + 2] * c[i + 2]; Store a(i, …, i + 3) a[i + 3] = b[i + 3] * c[i + 3]; } RISC Software GmbH – Johannes Kepler University Linz © 2014 16.04.2014 | 15 Requirements for Auto Vectorization . Countable . Single entry and exit while (i < 100) { a[i] = b[i] * c[i]; if (a[i] < 0.0) // data-dependent exit condition: break; ++i; } // loop not vectorized . Straight-line code (no switch, if with masking) for (int i = 0; i < length; i++) { float s = b[i] * b[i] – 4 * a[i] * c[i]; if (s >= 0) x[i] = sqrt(s); else x[i] = 0.; } // loop vectorized (because of masking) RISC Software GmbH – Johannes Kepler University Linz © 2014 16.04.2014 | 16 Requirements for Auto Vectorization . Only most inner loop (caution in case of loop interchange or loop collapsing) . No functions calls, but – Instrinsic math (sin, log, …) – Inline functions – Elemental functions __attribute__((vector)) RISC Software GmbH – Johannes Kepler University Linz © 2014 16.04.2014 | 17 Inhibitors of Auto Vectorization . Non contiguous data // arrays accessed with stride 2 for (int i=0; i<SIZE; i+=2) b[i]+=a[i]+x[i]; // inner loop accesses a with stride SIZE for (int j=0; j<SIZE; j++) for (int i=0; i<SIZE; i++) b[i]+=a[i][j]*x[j]; // indirect addressing of x using index array for (int i=0; i<SIZE; i+=2) b[i]+=a[i]*x[index[i]]; stride 1 is best. Caution in case of more dimensional arrays do j=1,n F90 C for (j=0; j<n; j++) do i=1,n for (i=0;i<n;i++) a(i,j)=b(i,j)*s a[j][i]=b[j][i]*s; enddo enddo RISC Software GmbH – Johannes Kepler University Linz © 2014 16.04.2014 | 18 Inhibitors of Auto Vectorization . Inability to identify data with alias (or overlapping) – Runtime check possible multi versioned (vectorized/non vectorized) void my_cp(int nx, double* a, double* b) { for (int i = 0; i < nx; i++) a[i] = b[i]; } – Runtime check not possible non vectorized void my_combine(int* ioff, int nx, double* a, double* b, double* c) { for (int i = 0; i < nx; i++) { a[i] = b[i] + c[i + *ioff]; } } Would vectorize with strict aliasing (-ansi-alias) RISC Software GmbH – Johannes Kepler University Linz © 2014 16.04.2014 | 19 Inhibitors of Auto Vectorization . Vector dependency – Read after write (RAW): non vectorizable for (i = 1; i < N; i++) a[i] = a[i – 1] + b[i]; – Write after read (WAR): vectorizable for (i = 0; i < N - 1; i++) a[i] = a[i + 1] + b[i]; – Read after read (RAR): vectorizable for (i = 0; i < N; i++) a[i] = b[i % M] + c[i]; – Write after write (WAW): non vectorizable for (i = 0; i < N; i++) a[i % M] = b[i] + c[i]; RISC Software GmbH – Johannes Kepler University Linz © 2014 16.04.2014 | 20 Efficiency aspects for Auto Vectorization Alignment . Width of vector register void fill(char* x) { for (int i=0; i<1024; i++) (SSE 16 Bytes, AVX 32 Bytes) x[i]=1; . Check at compile time } alignment . Otherwise check at runtime Peeling runtime peeling . Explicit definition peel=x&0x0f; – Local or global: if (peel!=0) { peel=16–peel; __attribute__((alligned(16))) for (i=0; i<peel; i++) – Heap: _mm_alloc, _mm_free x[i]=1; } .

Introduction to Vectorization

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support