Serial & Vector Optimization

Lucas A. Wilson TACC Summer Supercomputing Institute June 16, 2014 Overview

• Optimization – Optimization Process – Measuring Improvement • Scalar Optimization – Options • Vector Optimization How much of the code should be tuned?

Profile --TUNE MOST TIME-INTENSIVE SECTION Optimization is an iterative process: none AVAILABLE STOP • Profile code TIME !0 • Tune most time intensive portions RE- of code EVALUATE SUFFICIENT yes • Repeat PERFORMANCE STOP INCREASE no

For specific algorithms, derive an ideal architecture objective; but realize limitations as you optimize. As you optimize: Ideal Expectations • ideal objective usually decreases • while performance increases. Realistic (measured) Performance Optimization Cycles How is code performance measured? Some useful performance metrics 1. FLOPS • Floating point operations per second (GFLOPS/MFLOPS) • Important for numerically intensive code • Example: dot-product program prog_dot integer, parameter :: n=1024*1024*1024 real*8 :: x(n), y(n), sum ... do i=1,n sum=sum+x(i)*y(i) two FLOPS/iteration end do total FLOPS=2 GFLOPS end program

If execution time = 1.0 s. How do you then program achieves 2.0 GFLOPS/s measure this? Ideally: CPUs can execute 4-8 FLOPS/Clock Period(CP) 4-8 FLOPS/CP x CPU speed (Hz, CPs/sec.) = Peak Performance How is code performance measured? Some useful performance metrics 2. MIPS • Millions of instructions per second • Good indicator of efficiency (high MIPS means efficient use of functional units) • But, MIPS is not necessarily indicative of “productive” work

do i1=1,n Every 9 operations do i1=1,n 8 float operations do i2=i1+1,n there is loop overhead do i2=i1+1,n no loop overhead rdist=0.0 do idim=1,3 rdist=(x(1,i2)-x(1,i2))**2 + & dx=x(idim,i2)-x(idim,i1) (x(2,i2)-x(2,i2))**2 + & rdist=rdist+dx**2 (x(3,i2)-x(3,i2))**2 end do ...... end do code A end do code B end do end do Both codes do the “same” work, but code A gives rise to more instructions How is code performance measured? Some useful performance metrics 3. External Timer: (time stop-watch) 4. Instrument Code CPU (user) time + system (sys) time = Execution of your instructions + Kernel time (with wall-clock timers) executing instructions on your behalf To measure code block Application Times times, use system_clock % /usr/bin/time –p ./a.out in Fortran and clock in C real 0.18 to measure wall clock user 0.16 sys 0.02 time. program MAIN #include integer :: ic0, ic1, icr int main(int argc, char*argv[]){ real*8 :: dt clock_t t0, t1; float dt; ...... call system_clock(count=ic0,count_rate=icr) t0=clock(); call jacobi_iter(psi0,psi,m,n) jacobi_iter(psi,psi0,m,n); call system_clock(count=ic1) t1=clock(); dt=dble(ic1-ic0)/dble(icr) dt=(t1-t0)/CLOCKS_PER_SEC; ...... end program } What are the tools for measuring metrics?

Tools • Profilers (routine level) – gprof • Timers (Application/Block) – time, system_clock, clock, gettimeofday, getrusage, ... • Analyzers (profile, line level timings, & trace) – MPIP, TAU, Vtune, Paraver, … • Expert system – PerfExpert (Burtscher Tx State, Browne & Rane UT & TACC) • Hardware Counters: – PAPI Compiler Options

• Three important Categories – Optimization Level – Interprocedural Optimization – Architecture Specification

You should always have (at least) one option from each category! Compiler Optimization Levels

-O -O1 -O2 -O3 -O4 -O5 -O6

Debugging Standard Interprocedural Quick Compilations Optimizations Optimization

Aggressive Opt. Loop Interchanging Semantics Changes -O0: Debug High Order Transforms= (Blocking, routine -O1: Fast compile substitution) -O2: Standard -O3: More Aggressive Interprocedural Analysis Interprocedural analysis (IPA) allows to optimize across routine boundaries. Benefits can include automatic procedure inlining (across modules); copy propagation, in which flow dependency is eliminated by propagating a copy of the variable causing the (false) dependency; and vectorization. automatic recognition of standard libraries procedure inlining whole program Intel PGI IBM xlf Sun SGI IPA -ip (single file) -Mipa -qipa inline=(auto|noauto) -xipo (all -IPA -ipo (multi-file) inline= objects) inline=threshold= System Specific Optimizations • Include Architectural Information on Command Line– otherwise, many compilers, use a “generic” instruction set. Vector Loops: Can be sent to a SIMD Unit Can be unrolled and pipelined Can be parallelized through OpenMP Directives Can be automatically parallelized. Power6, G4/5 Velocity Engine (AltiVec) Intel/AMD MMX, SSE, SSE2, SSE3, SSE4 (SIMD) Cray Vector Units

Compiler options for x86 on TACC machines: Intel: -xHOST -- for Stampede and Lonestar Optimizations performed by the compiler

• Copy propagation – Involves getting rid of unnecessary dependencies – Aids in instruction level parallelism – e.g.

x=y x=y These can z=1.0+x z=1.0+y be executed simultaneously

ILP=Instruction Level Parallelism – Involves replacement of variable references with a constant to generate faster code – Programmer can help the compiler by declaring such constants in PARAMETER statements in Fortran or by using the const modifier in C Optimizations performed by the compiler

– Replacement of an expensive calculation with a cheaper one:

Use x*x instead of x**2.0 0.5*x instead of x/2.0

(exp(T)-exp(2T)) è tmp=exp(t); (tmp-tmp*tmp)

Use n logical shifts (<< or >>) instead of 2**n

Optimizations performed by the compiler

• Common subexpression elimination e.g.

temp=a/b compiler will generate x=a/b x=temp a temporary variable ...... to save on calculating a/b y=y0+a/b y=y0+temp Optimizations performed by the compiler

• Loop-invariant code motion – Involves moving loop-independent calculations out of the loop e.g. temp=a*b; for (i=0;i

float min(float x, float y); #define min(x,y)((x)<(y)?(x):(y) main(int argc, char* argv[]) main(int argc, char* argv[]) { { float x1, x2, xmin; float x1, x2, xmin; … … xmin=min(x1,x2); xmin=min(x1,x2); … … } } float min(float x, float y) { min=(x

Splits a loop in half to create two separate loops half the size of the original, and combines the operations of the two halves in a single loop of half the length.

do i=1,n dosum=dosum+a(i)+ b(i) end

do i=1,n/2 sum0=sum0+a(i )+ b(i ) Addition Pipeline is fuller. sum1=sum1+a(i+n/2)+ b(i+n/2) 2 more streams from memory end dosum=sum0+sum1 Loop Optimization Loop fusion:

Loop fusion combines two or more loops of the same iteration space (loop length) into a single loop:

for (i=0;i

If mult. instead of inverse Addition pipeline is fuller. Loop Optimization

Loop Fission:

The opposite of loop fusion is loop distribution or fission. Fission splits a single loop with independent operations into multiple loops: do i=1,n a(i)=b(i)+c(i)*d(i) end do do i=1,n a(i)=b(i)+c(i)*d(i) do i=1,n e(i)=f(i)-g(i)*h(i)+p(i) e(i)=f(i)-g(i)*h(i)+p(i) q(i)=r(i)+s(i) end do end do do i=1,n q(i)=r(i)+s(i) end do

Memory Tuning

Low-stride memory access is preferred because it uses more of the data in a cache line retrieved from memory and helps in triggering streaming.

do i=1,n for (j=0;j

do j=1,n for (i=0;i

do jouter=1,n,8 do i=1,n do j=jouter,min(jouter+7,n) a(i,j)=a(i,j)+s*b(j,i) end do end do Blocked Code end do

Memory Tuning Array Blocking real*8 a(n,n), b(n,n), c(n,n) do ii=1,n,nb do jj=1,n,nb matrix do kk=1,n,nb multiplication do i=ii,min(n,ii+nb-1) do j=jj,min(n,jj+nb-1) do k=kk,min(n,kk+nb-1)

c(i,j)=c(i,j)+a(j,k)*b(k,i)

nb x nb nb x nb nb x nb nb x nb

end do; end do; end do; end do; end do; end do Memory Tuning

Much more efficient implementations exist, in HPC scientific libraries (ESSL, MKL, ACL, ATLAS,…) Memory Tuning

Loop interchange can help in the case of a DAXPY-type loop :

integer,parameter::nkb=16,kb=1024,n=nkb*kb/8 real*8 :: x(n), a(n,n), y(n) ... do i=1,n s=0.0 do j=1,n s=s+a(i,j)*x(j) integer, parameter :: nkb=16,kb=1024, n=nkb*kb/8 end do real*8 :: x(n), a(n,n), y(n) y(i)=s ... end do do j=1,n do i=1,n y(i)=y(i)+a(i,j)*x(j) end do end do VECTORIZATION SIMD Hardware Vectors SIMD Hardware (for Vectorizaon) SAXPY Operaon Registers Memory ⎡z1⎤ ⎡x1⎤ ⎡ y1⎤ Cache ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ z2 x2 y2 ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢z3⎥ = a*⎢x3⎥ + ⎢ y3⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ! ! ! ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ a y1, y2, y3, … yn x1, x2, x3, … xn ⎢ n⎥ ⎢ n⎥ ⎢ n⎥ ⎣z ⎦ ⎣x ⎦ ⎣ y ⎦ z1, z2, z3, … zn • Opmal Vectorizaon requires concerns beyond the SIMD Unit! – Operaons: Requires elemental (independent) operaons (SIMD operaons) – Registers: Alignment of data on 64, 128, or 256 bit boundaries might be important – Cache: Access to elements in caches is fast, access from memory is much slower – Memory: Store vector elements sequenally for fastest aggregate retrieval Vectorizaon • Vectorizaon, or SIMD* processing, allows the same instrucon to be executed on mulple data operands. ( A loop over array elements oen provides a constant stream of data.)

Note: Streams provide Vectors of length 2-16 … instruction instruc. instruc. instruc. stream for execuon in the SIMD unit. … a(i)=b(i)+c(i)

*SIMD= Single Instrucon Mulple Data Stream Vectorization

• REGISTER SIZE DETERMINES the number of Simultaneous Operands in a SIMD operaon. • Compilers are good at vectorizing inner loops --gathering Multiple Data & executing a Single Instruction • Each iteration must be independent • Inlined functions and intrinsic Short Vector Math Library (SVML) functions can provide vectorization opportunities.

Vector Compiler Options • Compiler will look for vectorization opportunities at optimization – O2 level. • Use architecture option: –x to ensure latest vectorization hardware/instructions set is used. • Confirm with vector report: – vec-report=, n=“verboseness” • To get assembly code, myprog.s: – S • Rough Vectorization estimate: run w./w.o. vectorization -no-vec

Vectorization Report

• Intel Vector Reporting is OFF by default. • USE vector reporting to report on loops NOT vectorized, reports 4,5.

% ifort -xHOST -vec-report=4 prog.f90 –c

prog.f90(31): (col. 11) remark: loop was not vectorized: existence of vector dependence. … -vec-report=5 prog.f90(31): (col. 4) …assumed ANTI dependence between z line 31 and z line 31. prog.f90(31): (col. 4) …assumed FLOW dependence between z line 31 and z line 31. Vector Add • Each iteraon can be executed independently: loop will vectorize. (Use –vec-report=6 for details.) • Compiler is aware of data size: but may be unaware of data alignment and cache. double a[N], b[N], c[N]; C real*8 :: a(N), b(N), c(N) F90 .. .. for(j=0;j

vmovupd L1 Data Cache vmovupd Cache Line 1A Cache Line 2A Cache Lin e 128A vaddpd + AVX … Unit = Cache Line 1D Cache Line 2D Cache Lin e 128D

vmovupd

Assembly • Only vector code will load mulple sets of Instr. Instrucons data into registers simultaneously. Cache Line • Non-aligned sets do consume more 4 64-bit DP FP Clock Periods (CPs). 1 256-bit Register Beyond Arithmetic/Load/Store

• Predicates (more) for comparison • Shuffle • Broadcast • Masked-move for condional load/store • Insert/extract • Permute

• Future: FMA Intel Intrinsics

• Available with AVX • No need to write in-line assembly • Not available for Fortran Assembly vs Intrinsics -- functions

a(i)=b(i)+c(i)

• Assembly • Intrinsics Actually, more complex Use C bindings to call from Fortran

void inadd(double *a, double *b, double *c) { void inadd(double *a, double *b, double *c){ __asm { mov rax, a mov rbx, b __m128d m0,m1; mov rcx, c

movupd xmm0, [rax] m0 = _mm_load_pd(a); movupd xmm1, [rbx] m1 = _mm_load_pd(b); addpd xmm0, xmm1 m0 = _mm_add_pd(m0,m1);

movupd [rcx], xmm0 _mm_store_pd(c,m0); } } } Vector Programming

• Optimal Performance Considerations – REGISTER Fill (Vectorize) • Vector Loops • Dependencies – STREAM from Cache and Memory • Memory and Cache Bandwidths • Strided Access – Use Libraries (not covered) DP= Double Precision Word, 8-Byte storage Vector Loops

• Write loops with independent iterations • Loops will “Vectorize”à – Will be unrolled or “pipelined” to allow simultaneous transfers of multiple data elements from L1 Cache to vector register. (2 DP words for Westmere and 4 DP words for Sandy Bridge) • Vector loop requirements: – Countable; single entry and single exit; straight- line code (no switches; but if’s may be masked), “inner loop”; no function calls (e.g. io). Vector Loops

• There are many functions for which the compiler has a vectorizable version: sin, acos, cos, acos, etc.; exp, pow, log, etc.; erf, erfc; ceil, floor, fmin, fmax, etc. • Use the –opt-report-phase ipo_inl option for an inlining report.

Loop Unrolling

Unrolling allows do i=1,N compiler to re- for( i=0;i

do i=1,N, 4 for( i=0;ia a(i+2) = b(i+2)*c(i+2) Store a(i,…,i+3) a[i+2] = b[i+2]*c[i+2]; a(i+3) = b(i+3)*c(i+3) a[i+3] = b[i+3]*c[i+3]; end do } Memory and Cache Bandwidths

• Many applications access elements of Arrays. • These elements may reside in the caches (L1 | L2 | L3) or memory when the load/ store instructions are executed. • Performance is dependent on data location. Memory and Cache Bandwidths

• If a stream of data (series of accesses) is from memory (BW limited), vectorization has little or no effect on performance – the vector units are starved. • If the majority of accesses are from memory the application is called memory-bandwidth limited. •

Core Cache Memory 4 FLOPS/CP 2/2 DP words/CP (LD/ST) ~0.4 DP word/CP 8 FLOPS/CP 4/2 DP words/CP (LD/ST) (1600 DDR3, 1 channel, 3.0GHz Core)

Westmere ”Pipes” for Streaming Data to Cores Sandy Bridge Strided Access

• Striding decreases performance. The effect is greatest for accesses from memory, and can be minimal for the closest (L1) cache.

For L2/L3 and memory, moving unused data degrades effective bandwidth (for useful data).

For L1 Cache, don’t split vectors across cache lines. Strided Access • 1-D Array – Stride 1 access is best! • Multi-D Array Fortran: Stride 1 access on “inner” dimension is best. C/C++: Stride 1 access on “outer” dimension is best. do j = 1,n F90 C for(j=0;j

Memory Strided Add* Performance

0.4 0.35 *do i = 1,4000000*istride, istride Strided Access 0.3 a(i) = b(i) + c(i) * sfactor 0.25 enddo • Striding through Memory 0.2 0.15 cache reduces effective 0.1 0.05 Lower is better. bandwidth to vector 0 Time (Giga ClockPeriods) (Giga Time 0 1 2 3 4 5 6 7 8 units– ~(1-stride/8) for DP Stride arrays. (upper right) L2 Strided Add* Performance • Effective L2 BW 35000 30000 *do i = 1, 2048*istride, istride degradation is less (bottom 25000 a(i) = b(i) + c(i) * sfactor 20000 enddo 15000 right). 10000 5000 Lower is better. 0 Time (ClockPeriods) Time 1 2 3 4 Stride Vector Libraries

In some cases, an entire loop can be replaced with a single call to a vector function. For example, the loop below can be written as a call to vdInvSqrt in the Intel VML:

for (i=0;i

vdSinCos(n,x,s,c); for (i=0;i

Aliasing

• In general, two different pointers can have the same target (can point to the same memory location). Hence, loops in routines that use pointer references may not vectorize. • In routines that pass pointers, make sure the compiler can be certain that references don’t overlap. Aliasing

void my_cp(int nx, double *a, double *b){ for(int i=0; i

This will generally vectorize, because the compiler can check for overlap at runtime.

void my_combine(int *ioff, int nx, double * a, double * b, double * c){ for(int i=0; i

The compiler cannot determine that there is no aliasing– so it won’t vectorize the loop. Aliasing • In general, two different pointers can have the same target (can point to the same memory location). Hence, loops in routines that use pointer references may not vectorize:

void my_combine(int *ioff, int nx, double * a, double * b, double * c){ for(int i=0; i

if a and c have been “equivalenced”, there may be dependencies.

(Strict aliasing means that two objects of different types cannot refer to the same location in memory.)

Aliasing void my_combine(int * ioff, int nx, double * a, double * b, double * c){ for(int i=0; i

$ icpc -xhost -c -O2 func.cpp -vec-report=2 func.cpp(3): (col. 4) remark: loop was not vectorized: existence of vector dependence. Aliasing

void my_combine(int * ioff, int nx, double * a, double * b, double * c){ for(int i=0; i

Vectorization allowed with ANSI aliasing rules:

$ icpc -xhost -c -O2 func.cpp -ansi-alias -vec-report=2 func.cpp(3): (col. 4) remark: LOOP WAS VECTORIZED. func.cpp(3): (col. 4) remark: loop skipped: multiversioned.

But the option applies to whole unit; be careful, use with caution. Aliasing

void my_combine(int * restrict ioff, int nx, double * restrict a, double * restrict b, double * restrict c){ for(int i=0; i

Vectorization allowed with “restrict” declaration/option – specific to routine: $ icpc -xhost -c -O2 func.cpp -restrict -vec-report=2 func.cpp(3): (col. 4) remark: LOOP WAS VECTORIZED.

Can use __restrict alone with Intel compiler. Vectorization Summary

• AVX is here.

• Register widths are leaping forwardà be aware of SIMD power. • Vector Instruction Sets are getting largerà more opportunity for handling the data in SIMD. • Just because register sizes double (quadruple) that doesn’t mean your performance will increase by the same amount.

53