Serial & Vector Optimization

Serial & Vector Optimization Lucas A. Wilson TACC Summer Supercomputing Institute June 16, 2014 Overview • Optimization – Optimization Process – Measuring Improvement • Scalar Optimization – Compiler Options • Vector Optimization How much of the code should be tuned? Profile --TUNE MOST TIME-INTENSIVE SECTION Optimization is an iterative process: none AVAILABLE STOP • Profile code TIME !0 • Tune most time intensive portions RE- of code EVALUATE SUFFICIENT yes • Repeat PERFORMANCE STOP INCREASE no For specific algorithms, derive an ideal architecture objective; but realize limitations as you optimize. As you optimize: Ideal Expectations • ideal objective usually decreases • while performance increases. Realistic (measured) Performance Optimization Cycles How is code performance measured? Some useful performance metrics 1. FLOPS • Floating point operations per second (GFLOPS/MFLOPS) • Important for numerically intensive code • Example: dot-product program prog_dot integer, parameter :: n=1024*1024*1024 real*8 :: x(n), y(n), sum ... do i=1,n sum=sum+x(i)*y(i) two FLOPS/iteration end do total FLOPS=2 GFLOPS end program If execution time = 1.0 s. How do you then program achieves 2.0 GFLOPS/s measure this? Ideally: CPUs can execute 4-8 FLOPS/Clock Period(CP) 4-8 FLOPS/CP x CPU speed (Hz, CPs/sec.) = Peak Performance How is code performance measured? Some useful performance metrics 2. MIPS • Millions of instructions per second • Good indicator of efficiency (high MIPS means efficient use of functional units) • But, MIPS is not necessarily indicative of “productive” work do i1=1,n Every 9 operations do i1=1,n 8 float operations do i2=i1+1,n there is loop overhead do i2=i1+1,n no loop overhead rdist=0.0 do idim=1,3 rdist=(x(1,i2)-x(1,i2))**2 + & dx=x(idim,i2)-x(idim,i1) (x(2,i2)-x(2,i2))**2 + & rdist=rdist+dx**2 (x(3,i2)-x(3,i2))**2 end do ... ... end do code A end do code B end do end do Both codes do the “same” work, but code A gives rise to more instructions How is code performance measured? Some useful performance metrics 3. External Timer: (time stop-watch) 4. Instrument Code CPU (user) time + system (sys) time = Execution of your instructions + Kernel time (with wall-clock timers) executing instructions on your behalf To measure code block Application Times times, use system_clock % /usr/bin/time –p ./a.out in Fortran and clock in C real 0.18 to measure wall clock user 0.16 sys 0.02 time. program MAIN #include <time.h> integer :: ic0, ic1, icr int main(int argc, char*argv[]){ real*8 :: dt clock_t t0, t1; float dt; ... ... call system_clock(count=ic0,count_rate=icr) t0=clock(); call jacobi_iter(psi0,psi,m,n) jacobi_iter(psi,psi0,m,n); call system_clock(count=ic1) t1=clock(); dt=dble(ic1-ic0)/dble(icr) dt=(t1-t0)/CLOCKS_PER_SEC; ... ... end program } What are the tools for measuring metrics? Tools • Profilers (routine level) – gprof • Timers (Application/Block) – time, system_clock, clock, gettimeofday, getrusage, ... • Analyzers (profile, line level timings, & trace) – MPIP, TAU, Vtune, Paraver, … • Expert system – PerfExpert (Burtscher Tx State, Browne & Rane UT & TACC) • Hardware Counters: – PAPI Compiler Options • Three important Categories – Optimization Level – Interprocedural Optimization – Architecture Specification You should always have (at least) one option from each category! Compiler Optimization Levels -O -O1 -O2 -O3 -O4 -O5 -O6 Debugging Standard Interprocedural Quick Compilations Optimizations Optimization Aggressive Opt. Loop Unrolling Loop Interchanging Semantics Changes -O0: Debug High Order Transforms= (Blocking, routine -O1: Fast compile substitution) -O2: Standard -O3: More Aggressive Interprocedural Analysis Interprocedural analysis (IPA) allows compilers to optimize across routine boundaries. Benefits can include automatic procedure inlining (across modules); copy propagation, in which flow dependency is eliminated by propagating a copy of the variable causing the (false) dependency; and vectorization. automatic recognition of standard libraries procedure inlining whole program alias analysis pointer analysis Intel PGI IBM xlf Sun SGI IPA -ip (single file) -Mipa -qipa inline=(auto|noauto) -xipo (all -IPA -ipo (multi-file) inline=<fun._name> objects) inline=threshold=<num> System Specific Optimizations • Include Architectural Information on Command Line– otherwise, many compilers, use a “generic” instruction set. Vector Loops: Can be sent to a SIMD Unit Can be unrolled and pipelined Can be parallelized through OpenMP Directives Can be automatically parallelized. Power6, G4/5 Velocity Engine (AltiVec) Intel/AMD MMX, SSE, SSE2, SSE3, SSE4 (SIMD) Cray Vector Units Compiler options for x86 on TACC machines: Intel: -xHOST -- for Stampede and Lonestar Optimizations performed by the compiler • Copy propagation – Involves getting rid of unnecessary dependencies – Aids in instruction level parallelism – e.g. x=y x=y These can z=1.0+x z=1.0+y be executed simultaneously • Constant folding ILP=Instruction Level Parallelism – Involves replacement of variable references with a constant to generate faster code – Programmer can help the compiler by declaring such constants in PARAMETER statements in Fortran or by using the const modifier in C Optimizations performed by the compiler • Strength reduction – Replacement of an expensive calculation with a cheaper one: Use x*x instead of x**2.0 0.5*x instead of x/2.0 (exp(T)-exp(2T)) è tmp=exp(t); (tmp-tmp*tmp) Use n logical shifts (<< or >>) instead of 2**n Optimizations performed by the compiler • Common subexpression elimination e.g. temp=a/b compiler will generate x=a/b x=temp a temporary variable ... ... to save on calculating a/b y=y0+a/b y=y0+temp Optimizations performed by the compiler • Loop-invariant code motion – Involves moving loop-independent calculations out of the loop e.g. temp=a*b; for (i=0;i<n;i++) { for (i=0;i<n;i++) { x(i)=y(i)+a*b; x(i)=y(i)+temp; last=x(n); } } last=x(n); • Induction variable simplification – Replacement of complicated expressions using induction variables with simpler ones j=5; e.g. for (i=0;i<n;i++) { for (i=0;i<n;i++) { j=i*2+5; j+=2; ... ... } } Example: use of a macro instead of a function float min(float x, float y); #define min(x,y)((x)<(y)?(x):(y) main(int argc, char* argv[]) main(int argc, char* argv[]) { { float x1, x2, xmin; float x1, x2, xmin; … … xmin=min(x1,x2); xmin=min(x1,x2); … … } } float min(float x, float y) { min=(x<y)?x:y; } use of a function use of a macro Example: procedure inlining program MAIN program MAIN integer :: ndim=2, niter=10000000 integer, parameter :: ndim=2 real*8 :: x(ndim), x0(ndim), r real*8 :: x(ndim), x0(ndim), r integer :: i, j integer :: i, j ... ... do i=1,100000 do i=1,100000 ... ... r=dist(x,x0,ndim) r=0.0 ... do j=1,ndim end do r=r+(x(j)-(x0(j))* & ... (x(j)-(x0(j)) end program end do ... real*8 function dist(x,x0,n) end do real*8 :: x0(n), x(n), r ... integer :: j,n end program r=0.0 do j=1,n r=r+(x(j)-x0(j))**2 function dist is expanded end do inline inside loop dist=r function dist is called end function niter times Loop Optimization Loop bisection: Splits a loop in half to create two separate loops half the size of the original, and combines the operations of the two halves in a single loop of half the length. do i=1,n dosum=dosum+a(i)+ b(i) end do i=1,n/2 sum0=sum0+a(i )+ b(i ) Addition Pipeline is fuller. sum1=sum1+a(i+n/2)+ b(i+n/2) 2 more streams from memory end dosum=sum0+sum1 Loop Optimization Loop fusion: Loop fusion combines two or more loops of the same iteration space (loop length) into a single loop: for (i=0;i<n;i++){ for (i=0;i<n;i++){ a[i]=x[i]+y[i]; a[i]= x[i]+y[i]; } b[i]=1.0/x[i]+z[i]; for (i=0;i<n;i++){ } b[i]=1.0/x[i]+z[i]; } Only n memory accesses for X array. Five streams created. Costly (at least 30 CP) Division many not be pipelined! If mult. instead of inverse Addition pipeline is fuller. Loop Optimization Loop Fission: The opposite of loop fusion is loop distribution or fission. Fission splits a single loop with independent operations into multiple loops: do i=1,n a(i)=b(i)+c(i)*d(i) end do do i=1,n a(i)=b(i)+c(i)*d(i) do i=1,n e(i)=f(i)-g(i)*h(i)+p(i) e(i)=f(i)-g(i)*h(i)+p(i) q(i)=r(i)+s(i) end do end do do i=1,n q(i)=r(i)+s(i) end do Memory Tuning Low-stride memory access is preferred because it uses more of the data in a cache line retrieved from memory and helps in triggering streaming. do i=1,n for (j=0;j<n;j++) { do j=1,n for (i=0;i<n;i++) { C(i,j)=A(i,j)+B(i,j) C[i][j]=A[i][j]+B[i][j]; end do } end do } do j=1,n for (i=0;i<n;i++) { do i=1,n for (j=0;j<n;j++) { C(i,j)=A(i,j)+B(i,j) C[i][j]=A[i][j]+B[i][j]; end do } end do } Loop Optimization Strip mining: Transforms a single loop into two loops to insure that each element of a cache line is used, once it is brought into the cache. In the example below, the loop over j is split into two: do jouter=1,n,8 do j=jouter,min(jouter+7,n) do j=1,n do i=1,n do i=1,n a(i,j)=a(i,j)+s*b(j,i) a(i,j)=a(i,j)+s*b(j,i) end do “Intermediate end do end do Code” end do end do do jouter=1,n,8 do i=1,n do j=jouter,min(jouter+7,n) a(i,j)=a(i,j)+s*b(j,i) end do end do Blocked Code end do Memory Tuning Array Blocking real*8 a(n,n), b(n,n), c(n,n) do ii=1,n,nb do jj=1,n,nb matrix do kk=1,n,nb multiplication do i=ii,min(n,ii+nb-1) do j=jj,min(n,jj+nb-1) do k=kk,min(n,kk+nb-1) c(i,j)=c(i,j)+a(j,k)*b(k,i) nb x nb nb x nb nb x nb nb x nb end do; end do; end do; end do; end do; end do Memory Tuning Much more efficient implementations exist, in HPC scientific libraries (ESSL, MKL, ACL, ATLAS,…) Memory Tuning Loop interchange can help in the case of a DAXPY-type loop : integer,parameter::nkb=16,kb=1024,n=nkb*kb/8 real*8 :: x(n), a(n,n), y(n) ..

Serial & Vector Optimization

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support