Serial & Vector Optimization

Serial & Vector Optimization Lucas A. Wilson TACC Summer Supercomputing Institute June 16, 2014 Overview • Optimization – Optimization Process – Measuring Improvement • Scalar Optimization – Compiler Options • Vector Optimization How much of the code should be tuned? Profile --TUNE MOST TIME-INTENSIVE SECTION Optimization is an iterative process: none AVAILABLE STOP • Profile code TIME !0 • Tune most time intensive portions RE- of code EVALUATE SUFFICIENT yes • Repeat PERFORMANCE STOP INCREASE no For specific algorithms, derive an ideal architecture objective; but realize limitations as you optimize. As you optimize: Ideal Expectations • ideal objective usually decreases • while performance increases. Realistic (measured) Performance Optimization Cycles How is code performance measured? Some useful performance metrics 1. FLOPS • Floating point operations per second (GFLOPS/MFLOPS) • Important for numerically intensive code • Example: dot-product program prog_dot integer, parameter :: n=1024*1024*1024 real*8 :: x(n), y(n), sum ... do i=1,n sum=sum+x(i)*y(i) two FLOPS/iteration end do total FLOPS=2 GFLOPS end program If execution time = 1.0 s. How do you then program achieves 2.0 GFLOPS/s measure this? Ideally: CPUs can execute 4-8 FLOPS/Clock Period(CP) 4-8 FLOPS/CP x CPU speed (Hz, CPs/sec.) = Peak Performance How is code performance measured? Some useful performance metrics 2. MIPS • Millions of instructions per second • Good indicator of efficiency (high MIPS means efficient use of functional units) • But, MIPS is not necessarily indicative of “productive” work do i1=1,n Every 9 operations do i1=1,n 8 float operations do i2=i1+1,n there is loop overhead do i2=i1+1,n no loop overhead rdist=0.0 do idim=1,3 rdist=(x(1,i2)-x(1,i2))**2 + & dx=x(idim,i2)-x(idim,i1) (x(2,i2)-x(2,i2))**2 + & rdist=rdist+dx**2 (x(3,i2)-x(3,i2))**2 end do ... ... end do code A end do code B end do end do Both codes do the “same” work, but code A gives rise to more instructions How is code performance measured? Some useful performance metrics 3. External Timer: (time stop-watch) 4. Instrument Code CPU (user) time + system (sys) time = Execution of your instructions + Kernel time (with wall-clock timers) executing instructions on your behalf To measure code block Application Times times, use system_clock % /usr/bin/time –p ./a.out in Fortran and clock in C real 0.18 to measure wall clock user 0.16 sys 0.02 time. program MAIN #include <time.h> integer :: ic0, ic1, icr int main(int argc, char*argv[]){ real*8 :: dt clock_t t0, t1; float dt; ... ... call system_clock(count=ic0,count_rate=icr) t0=clock(); call jacobi_iter(psi0,psi,m,n) jacobi_iter(psi,psi0,m,n); call system_clock(count=ic1) t1=clock(); dt=dble(ic1-ic0)/dble(icr) dt=(t1-t0)/CLOCKS_PER_SEC; ... ... end program } What are the tools for measuring metrics? Tools • Profilers (routine level) – gprof • Timers (Application/Block) – time, system_clock, clock, gettimeofday, getrusage, ... • Analyzers (profile, line level timings, & trace) – MPIP, TAU, Vtune, Paraver, … • Expert system – PerfExpert (Burtscher Tx State, Browne & Rane UT & TACC) • Hardware Counters: – PAPI Compiler Options • Three important Categories – Optimization Level – Interprocedural Optimization – Architecture Specification You should always have (at least) one option from each category! Compiler Optimization Levels -O -O1 -O2 -O3 -O4 -O5 -O6 Debugging Standard Interprocedural Quick Compilations Optimizations Optimization Aggressive Opt. Loop Unrolling Loop Interchanging Semantics Changes -O0: Debug High Order Transforms= (Blocking, routine -O1: Fast compile substitution) -O2: Standard -O3: More Aggressive Interprocedural Analysis Interprocedural analysis (IPA) allows compilers to optimize across routine boundaries. Benefits can include automatic procedure inlining (across modules); copy propagation, in which flow dependency is eliminated by propagating a copy of the variable causing the (false) dependency; and vectorization. automatic recognition of standard libraries procedure inlining whole program alias analysis pointer analysis Intel PGI IBM xlf Sun SGI IPA -ip (single file) -Mipa -qipa inline=(auto|noauto) -xipo (all -IPA -ipo (multi-file) inline=<fun._name> objects) inline=threshold=<num> System Specific Optimizations • Include Architectural Information on Command Line– otherwise, many compilers, use a “generic” instruction set. Vector Loops: Can be sent to a SIMD Unit Can be unrolled and pipelined Can be parallelized through OpenMP Directives Can be automatically parallelized. Power6, G4/5 Velocity Engine (AltiVec) Intel/AMD MMX, SSE, SSE2, SSE3, SSE4 (SIMD) Cray Vector Units Compiler options for x86 on TACC machines: Intel: -xHOST -- for Stampede and Lonestar Optimizations performed by the compiler • Copy propagation – Involves getting rid of unnecessary dependencies – Aids in instruction level parallelism – e.g. x=y x=y These can z=1.0+x z=1.0+y be executed simultaneously • Constant folding ILP=Instruction Level Parallelism – Involves replacement of variable references with a constant to generate faster code – Programmer can help the compiler by declaring such constants in PARAMETER statements in Fortran or by using the const modifier in C Optimizations performed by the compiler • Strength reduction – Replacement of an expensive calculation with a cheaper one: Use x*x instead of x**2.0 0.5*x instead of x/2.0 (exp(T)-exp(2T)) è tmp=exp(t); (tmp-tmp*tmp) Use n logical shifts (<< or >>) instead of 2**n Optimizations performed by the compiler • Common subexpression elimination e.g. temp=a/b compiler will generate x=a/b x=temp a temporary variable ... ... to save on calculating a/b y=y0+a/b y=y0+temp Optimizations performed by the compiler • Loop-invariant code motion – Involves moving loop-independent calculations out of the loop e.g. temp=a*b; for (i=0;i<n;i++) { for (i=0;i<n;i++) { x(i)=y(i)+a*b; x(i)=y(i)+temp; last=x(n); } } last=x(n); • Induction variable simplification – Replacement of complicated expressions using induction variables with simpler ones j=5; e.g. for (i=0;i<n;i++) { for (i=0;i<n;i++) { j=i*2+5; j+=2; ... ... } } Example: use of a macro instead of a function float min(float x, float y); #define min(x,y)((x)<(y)?(x):(y) main(int argc, char* argv[]) main(int argc, char* argv[]) { { float x1, x2, xmin; float x1, x2, xmin; … … xmin=min(x1,x2); xmin=min(x1,x2); … … } } float min(float x, float y) { min=(x<y)?x:y; } use of a function use of a macro Example: procedure inlining program MAIN program MAIN integer :: ndim=2, niter=10000000 integer, parameter :: ndim=2 real*8 :: x(ndim), x0(ndim), r real*8 :: x(ndim), x0(ndim), r integer :: i, j integer :: i, j ... ... do i=1,100000 do i=1,100000 ... ... r=dist(x,x0,ndim) r=0.0 ... do j=1,ndim end do r=r+(x(j)-(x0(j))* & ... (x(j)-(x0(j)) end program end do ... real*8 function dist(x,x0,n) end do real*8 :: x0(n), x(n), r ... integer :: j,n end program r=0.0 do j=1,n r=r+(x(j)-x0(j))**2 function dist is expanded end do inline inside loop dist=r function dist is called end function niter times Loop Optimization Loop bisection: Splits a loop in half to create two separate loops half the size of the original, and combines the operations of the two halves in a single loop of half the length. do i=1,n dosum=dosum+a(i)+ b(i) end do i=1,n/2 sum0=sum0+a(i )+ b(i ) Addition Pipeline is fuller. sum1=sum1+a(i+n/2)+ b(i+n/2) 2 more streams from memory end dosum=sum0+sum1 Loop Optimization Loop fusion: Loop fusion combines two or more loops of the same iteration space (loop length) into a single loop: for (i=0;i<n;i++){ for (i=0;i<n;i++){ a[i]=x[i]+y[i]; a[i]= x[i]+y[i]; } b[i]=1.0/x[i]+z[i]; for (i=0;i<n;i++){ } b[i]=1.0/x[i]+z[i]; } Only n memory accesses for X array. Five streams created. Costly (at least 30 CP) Division many not be pipelined! If mult. instead of inverse Addition pipeline is fuller. Loop Optimization Loop Fission: The opposite of loop fusion is loop distribution or fission. Fission splits a single loop with independent operations into multiple loops: do i=1,n a(i)=b(i)+c(i)*d(i) end do do i=1,n a(i)=b(i)+c(i)*d(i) do i=1,n e(i)=f(i)-g(i)*h(i)+p(i) e(i)=f(i)-g(i)*h(i)+p(i) q(i)=r(i)+s(i) end do end do do i=1,n q(i)=r(i)+s(i) end do Memory Tuning Low-stride memory access is preferred because it uses more of the data in a cache line retrieved from memory and helps in triggering streaming. do i=1,n for (j=0;j<n;j++) { do j=1,n for (i=0;i<n;i++) { C(i,j)=A(i,j)+B(i,j) C[i][j]=A[i][j]+B[i][j]; end do } end do } do j=1,n for (i=0;i<n;i++) { do i=1,n for (j=0;j<n;j++) { C(i,j)=A(i,j)+B(i,j) C[i][j]=A[i][j]+B[i][j]; end do } end do } Loop Optimization Strip mining: Transforms a single loop into two loops to insure that each element of a cache line is used, once it is brought into the cache. In the example below, the loop over j is split into two: do jouter=1,n,8 do j=jouter,min(jouter+7,n) do j=1,n do i=1,n do i=1,n a(i,j)=a(i,j)+s*b(j,i) a(i,j)=a(i,j)+s*b(j,i) end do “Intermediate end do end do Code” end do end do do jouter=1,n,8 do i=1,n do j=jouter,min(jouter+7,n) a(i,j)=a(i,j)+s*b(j,i) end do end do Blocked Code end do Memory Tuning Array Blocking real*8 a(n,n), b(n,n), c(n,n) do ii=1,n,nb do jj=1,n,nb matrix do kk=1,n,nb multiplication do i=ii,min(n,ii+nb-1) do j=jj,min(n,jj+nb-1) do k=kk,min(n,kk+nb-1) c(i,j)=c(i,j)+a(j,k)*b(k,i) nb x nb nb x nb nb x nb nb x nb end do; end do; end do; end do; end do; end do Memory Tuning Much more efficient implementations exist, in HPC scientific libraries (ESSL, MKL, ACL, ATLAS,…) Memory Tuning Loop interchange can help in the case of a DAXPY-type loop : integer,parameter::nkb=16,kb=1024,n=nkb*kb/8 real*8 :: x(n), a(n,n), y(n) ..

Load more