SIMD: CilkPlus and OpenMP
Kent Milfeld, Georg Zitzlsberger, Michael Klemm, Carlos Rosales
ISC15 The Road to Application Performance on Intel Xeon Phi July 16, 2015
1 Single Instruction Multiple Data (SIMD) Data Registers Intel Cilk Plus SIMD Directive Declaration Examples OpenMP SIMD Directive Declaration SIMD loop SIMD CilkPlus OpenMP SIMD mapping Alignment & Elemental Functions Alignment Beyond Present Directives
SIMD
2 Single Instruction Multiple Data (SIMD) Data Registers Playground Intel Cilk Plus SIMD Directive Declaration Examples OpenMP SIMD Directive Declaration SIMD loop SIMD CilkPlus OpenMP SIMD mapping Alignment & Elemental Functions Alignment Beyond Present Directives
SIMD
3 What to consider about SIMD • Know as Vectorization by scientific community. • Speed Kills – It was the speed of microprocessors that killed the Cray vector story in the 90’s. – We a rediscovering how to use vectors. – Microprocessor vectors were 2DP long for many years. • We live in a parallel universe – It’s not just about parallel SIMD, we also live in a silky environment of thread tasks and MPI tasks.
4 What to make of this? SIMD Lanes
• SIMD registers are getting wider now, but there are other factors to consider. – Caches: Maybe non-coherent, possible 9 layers of memory later – Alignment: Avoid cache-to-register hickups – Prefetching: MIC needs user intervention here – Data Arrangement: AoS vs SoA, gather, scatter, permutes – Masking: Allows conditional execution– but you get less bang for your buck. – Striding: 1 is best
5 Single Instruction Multiple Data (SIMD) Evolution of SIMD Hardware Data Registers Instruction Set Overview, AVX Intel Cilk Plus SIMD Directive Declaration Examples OpenMP SIMD Directive Declaration SIMD loop SIMD CilkPlus OpenMP SIMD mapping Alignment & Elemental Functions Alignment Beyond Present Directives SIMD
6 Intel CilkPlus • # pragma SIMD – Force SIMD operation on loops • Array Notation à data arranged appropriate for SIMD a[index:count] start at index, end count-start-1 (also index:count:stride) a[i:n] = b[i-1:n]+b[i+1] (think single line SIMD, heap arrays) (optimize away subarrays) e[:] = f[:] +g[:] (entire array, heap or stack) r[:] = s[i[:]], r[i[:]=s[:] (gather, scatter) func(a[:]) (scalar/simd-enabled=by element/SIMD) if(5==a[:]) result[:]=0 (works with conditionals) • SIMD Enabled Functions: element àvector function
7 SIMD pragma – Instructs the compiler to create SIMD operations for iterations of the loops. – Reason for vectorization failure: too many pointers, complicated indexing … (ivdep is a hint)
Without pragma vec-report=2 was helpful: remark #15541: outer loop was not auto-vectorized: consider using SIMD directive
void do2(double a[n][n], double b[n][n], int end){ ivdep and vector always #pragma SIMD don’t work here.
for (int i=0 ; i 8 SIMD Enabled Functions • SIMDizable Functions: double fun1(double r, double s, double t); double fun2(double r, double s, double t); … void driver (double R[N], double S[N], double T[N]){ for (int i=0; i 9 courier SIMD Enabled Functions • Can be invoked with scalar or vector arguments. • Use array notation with SIMD version (optimized for vector width) ** __declspec(vector) double fun1(double r, double s, double t); __declspec(vector) double fun2(double r, double s, double t); // Function is for an element operation; … // but in parallel context (CilkPlus) provides an array for a vector version. void driver (double R[N], double S[N], double T[N]){ A[:] = fun1(R[:],S[:],T[:]); B[:] = fun2(R[:],S[:],T[:]); } ** or __attribute((vector)) 10 SIMD Enabled Functions • Vector attribute/declspec decorations generate scalar and SIMD version with: Syntax: __attribute__((vector (clauses))) function_declaration __declspec( vector(clauses)) function_declaration Clauses: vectorlength(n) Vector Length linear(list : step) scalar list variables are incremented by step; uniform(list) (same) values are broadcast to all iterations [no]mask generate a masked vector version 11 SIMD and Threads • Cilk’s “los tres amigos” – cilk_for – cilk_spawn – cilk_sync • Cilk loops are SIMDizes, and invoke multiple threads. • Functions use SIMD form in CilkPlus loops. 12 Single Instruction Multiple Data (SIMD) Evolution of SIMD Hardware Data Registers Instruction Set Overview, AVX Intel Cilk Plus SIMD Directive Declaration Examples OpenMP SIMD Directive Declaration SIMD loop SIMD CilkPlus OpenMP SIMD mapping Alignment & Elemental Functions Alignment Beyond Present Directives SIMD 13 OpenMP SIMD • First appeared in OpenMP 4.0 2013 • Appears as – SIMD – SIMD do/for – declare SIMD • SIMD refinements in OpenMP 4.1, ~2015. 14 SIMD • OMP Directive SIMDizes loop Syntax (Fortran): !$OMP SIMD [clause[[,] clause] ... ] #pragma omp SIMD [clause[[,] clause] ... ] Clauses: safelen(n) number (n) of interations in a SIMD chunk linear(list : step) scalar list variables are incremented by step; loop iterations incremented by (vector length)*step aligned(list :n) uses aligned (by n bytes) move on listed variables collapse(n), lastprivate(list), private(list), reduction(operator: list) 15 SIMD + Worksharing Loop • OMP Directive Workshares and SIMDizes loop Syntax: !$OMP DO SIMD [clause[[,] clause] ... ] #pragma omp SIMD [clause[[,] clause] ... ] Clauses: any DO clause data sharing attributes, nowait, etc. any SIMD clause Creates SIMD loop which uses chunks containing increments of the vector size. Remaining iterations are distributed “consistently”. No scheduling details are give. 16 SIMD Enabled Functions • OMP Directive generates scalar and SIMD version with: Syntax (Fortran): $OMP DECLARE SIMD(routine-name) [clause[[,] clause]... ] Clauses: aligned(list:n) uses aligned (by n bytes) moves on listed variables [not]inbranch must always be called in conditional [or never in] linear(list:step) scalar list variables are incremented by step; loop iterations incremented by (vector length)*step simdlen(n) vector length uniform(list) listed variables have invariant value 17 Single Instruction Multiple Data (SIMD) Evolution of SIMD Hardware Data Registers Instruction Set Overview, AVX Intel Cilk Plus SIMD Directive Declaration Examples OpenMP SIMD Directive Declaration SIMD loop SIMD CilkPlus OpenMP SIMD mapping Alignment & Elemental Functions Alignment Beyond Present Directives SIMD 18 CilkPlus à OpenMP Mapping CilkPlus OpenMP • SIMD (on loop) • SIMD (on loop) – Reduction – Reduction – Vector length – Vector length – Linear (increment) – Linear (increment) – Private, Lastprivate – Private, Lastprivate e.g (fortran) !dir$ simd reduction(+:mysum) linear(j:1) vectorlength(4) do…; mysum=mysum+j; j=fun(); enddo !$omp simd reduction(+:mysum) linear(j:1) safelen(4) do…; mysum=mysum+j; j=fun(); enddo 19 CilkPlus OpenMP SIMD Differences CilkPlus OpenMP • SIMD (on loop) • SIMD (on loop) – firstprivate – aligned(var_list,bsize) – vectorlengthfor – collapse – [no]vectremainder – [no]assert • #pragma cilk grainsize – schedule(kind, chunk) 20 CilkPlus Enable OMP Declare Differences CilkPlus OpenMP • vector clauses • declare simd – vectorlength – simdlen – linear – linear – uniform – uniform – [no]mask – inbranch/notinbranch – processor(cpuid) – aligned – vectorlengthfor 21 Alignment • Memory Alignment – Allocation alignment • C/C++ – dynamic: memalloc routines – static: __declspec(align(64)) declaration • Fortran – dynamic: !dir$ attributes align: 64 :: var – static: !dir$ attributes align: 64 :: var – compiler: -align array64byte 22 Alignment (CilkPlus) • Memory Alignment – Access Description • C/C++ – loop: #pragma vector aligned (all variables) – Cilk_for vars: _assume_aligned(var,size) – pointers attribute: __attribute__((align_value (size))) • Fortran – dynamic: !dir$ attributes align: 64 :: var (allocatable var) – static: !dir$ attributes align: 64 :: var (stack var) – compiler: -align array64byte 23 Alignment (OpenMP) • Memory Alignment – No API functions, no separate construct – Declaration SIMD / SIMD have aligned clauses 24 Prefetch • Prefetch distance can be controlled via compiler options and pragmas #pragma prefetch var:hint:distance • inner loops • may be important to turn off prefetch • available for Fortran 25 What do developers need to control at the directive level? • Caches: locality of data • Alignment: Avoid cache-to-register hickups • Prefetching: hiding latency (not available with OMP) • Rearranging data: characterizing data structure (kokkos) • Masking: Allows conditional execution– but you get less bang for your buck. • Striding: characterized data structure, (restrict) 26 Single Instruction Multiple Data (SIMD) Evolution of SIMD Hardware Data Registers Instruction Set Overview, AVX Intel Cilk Plus SIMD Directive Declaration Examples OpenMP SIMD Directive Declaration SIMD loop Alignment Beyond Present Directives SIMD 27 Vector Compiler Options • Compiler will look for vectorization opportunities at optimization – O2 level. • Use architecture option: –x 28 Vector Compiler Options (cont.) • Alignment options here. • Inlining here. 29 Vec. Programming Alignment Alignment • Alignment of data and data structures can affect performance. For AVX, alignment to 32byte boundaries (4 Double Precision words) allows a single reference to a cache line for moving 4DP words into the registers (SIMD support). For MIC, alignment is 64 bytes. • Compilers are great at detecting alignment and peeling off a few iterations before working on a sustained alignment within a loop body. (Aligned data can use the more efficient movdqa instruction, rather than the less efficient movdqu instruction.) 30 Vec. Programming Alignment Alignment Load 4 DP Words Load 4 DP Words 32-byte Load 4 DP Words Aligned Load 4 DP Words registers Single Cache access for 4 DP Words Cache Line 0 Cache Line 1 Cache Line 2 Cache Line 3 Load 4 DP Words Load 4 DP Words Non- Load 4 DP Words Aligned Load 4 DP Words registers Across Cache Line access for 4 DP Words Cache Line 0 Cache Line 1 Cache Line 2 Cache Line 3 Cache Line 4 31 Compiler Directives Alignment Vector Align • Unaligned accesses are slower. – Non-sequen al across “bus”. – Cross cache line boundary. • #pragma vector aligned or !DEC$ vector aligned for(i=0; i 32 Vec. Programming Inlining Inlining • Functions within a loop prevent vectorization. – Inlining can often overcome this problem. e.g. file main.c file funs.c ... for(i=0; i 33 Vec. Programming Inlining Inlining file main.c file funs.c ... for(i=0; i Inlining Vectorization Time (ms) not inlined not vectorized 1.55 inlined not vectorized 0.44 inlined vectorized 0.056 34 SIMD END • Questions • Discussion • … 35 some Slides from 2012 Tutorial (kfm) 36 SIMD Hardware Vectors SIMD Hardware (for Vectoriza on) SAXPY Opera on Registers Memory ⎡z1⎤ ⎡x1⎤ ⎡ y1⎤ Cache ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ z2 x2 y2 ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢z3⎥ = a*⎢x3⎥ + ⎢ y3⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ! ! ! ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ a y1, y2, y3, … yn x1, x2, x3, … xn ⎢ n⎥ ⎢ n⎥ ⎢ n⎥ ⎣z ⎦ ⎣x ⎦ ⎣ y ⎦ z1, z2, z3, … zn • Op mal Vectoriza on requires concerns beyond the SIMD Unit! – Opera ons: Requires elemental (independent) opera ons (SIMD opera ons) – Registers: Alignment of data on 64, 128, or 256 bit boundaries might be important – Cache: Access to elements in caches is fast, access from memory is much slower – Memory: Store vector elements sequen ally for fastest aggregate retrieval 37 SIMD Hardware Vectors SIMD Processing -- Vectoriza on • Vectoriza on, or SIMD* processing, allows a simultaneous, independent instruc on on mul ple data operands with a single instruc on. ( Loops over array elements o en provide a constant stream of data.) Note: Streams provide Instruc on Vectors of length 2-16 for execu on in the SIMD … unit. *SIMD= Single Instruc on Mul ple Data Stream 38 Vectorization Example Add in Hardware Vector Add -- AVX vmovupd L1 Data Cache vmovupd Cache Line 1A Cache Line 2A Cache Lin e 128A vaddpd + AVX … Unit = Cache Line 1D Cache Line 2D Cache Lin e 128D vmovupd Assembly • Only vector code will load mul ple sets of Instr. Instruc ons data into registers simultaneously. Cache Line • Non-aligned sets do consume more 4 64-bit DP FP Clock Periods (CPs). 1 256-bit Register 39 Vectorization Example KNC Vectors Vectorization ( on KNC) void mult(double *a, double *b, double*c, int n){ for(int i=0; i subroutine mult(a, b, c, n); real*8 :: a(n),b(n),c(n) do i=1,n; a(i)=b(i)+c(i); enddo end subroutine Scalar Instruc ons Vector Instruc on 8 instruc ons, 8 element pairs 1 instruc on, 8 element pairs b0 b1 b2 b3 b4 b5 b6 b7 c0 c1 c2 c3 c4 c5 c6 c7 b0 b1 b2 b3 b4 b5 b6 b7 c0 c1 c2 c3 c4 c5 c6 c7 add vaddpd a0 a0 a1 a2 a3 a4 a5 a6 a7 add a2 … add a7 40 Compiler Directives Compiler Directives: Hints and Coercion alloc_section distribute_point inline, noinline, and forceinline ivdep loop_count memref_control novector optimize optimization_level prefetch/noprefetch simd unroll/nounroll unroll_and_jam/nounroll_and_jam vector 41