Serial & Vector Optimization
Lucas A. Wilson TACC Summer Supercomputing Institute June 16, 2014 Overview
• Optimization – Optimization Process – Measuring Improvement • Scalar Optimization – Compiler Options • Vector Optimization How much of the code should be tuned?
Profile --TUNE MOST TIME-INTENSIVE SECTION Optimization is an iterative process: none AVAILABLE STOP • Profile code TIME !0 • Tune most time intensive portions RE- of code EVALUATE SUFFICIENT yes • Repeat PERFORMANCE STOP INCREASE no
For specific algorithms, derive an ideal architecture objective; but realize limitations as you optimize. As you optimize: Ideal Expectations • ideal objective usually decreases • while performance increases. Realistic (measured) Performance Optimization Cycles How is code performance measured? Some useful performance metrics 1. FLOPS • Floating point operations per second (GFLOPS/MFLOPS) • Important for numerically intensive code • Example: dot-product program prog_dot integer, parameter :: n=1024*1024*1024 real*8 :: x(n), y(n), sum ... do i=1,n sum=sum+x(i)*y(i) two FLOPS/iteration end do total FLOPS=2 GFLOPS end program
If execution time = 1.0 s. How do you then program achieves 2.0 GFLOPS/s measure this? Ideally: CPUs can execute 4-8 FLOPS/Clock Period(CP) 4-8 FLOPS/CP x CPU speed (Hz, CPs/sec.) = Peak Performance How is code performance measured? Some useful performance metrics 2. MIPS • Millions of instructions per second • Good indicator of efficiency (high MIPS means efficient use of functional units) • But, MIPS is not necessarily indicative of “productive” work
do i1=1,n Every 9 operations do i1=1,n 8 float operations do i2=i1+1,n there is loop overhead do i2=i1+1,n no loop overhead rdist=0.0 do idim=1,3 rdist=(x(1,i2)-x(1,i2))**2 + & dx=x(idim,i2)-x(idim,i1) (x(2,i2)-x(2,i2))**2 + & rdist=rdist+dx**2 (x(3,i2)-x(3,i2))**2 end do ...... end do code A end do code B end do end do Both codes do the “same” work, but code A gives rise to more instructions How is code performance measured? Some useful performance metrics 3. External Timer: (time stop-watch) 4. Instrument Code CPU (user) time + system (sys) time = Execution of your instructions + Kernel time (with wall-clock timers) executing instructions on your behalf To measure code block Application Times times, use system_clock % /usr/bin/time –p ./a.out in Fortran and clock in C real 0.18 to measure wall clock user 0.16 sys 0.02 time. program MAIN #include
Tools • Profilers (routine level) – gprof • Timers (Application/Block) – time, system_clock, clock, gettimeofday, getrusage, ... • Analyzers (profile, line level timings, & trace) – MPIP, TAU, Vtune, Paraver, … • Expert system – PerfExpert (Burtscher Tx State, Browne & Rane UT & TACC) • Hardware Counters: – PAPI Compiler Options
• Three important Categories – Optimization Level – Interprocedural Optimization – Architecture Specification
You should always have (at least) one option from each category! Compiler Optimization Levels
-O -O1 -O2 -O3 -O4 -O5 -O6
Debugging Standard Interprocedural Quick Compilations Optimizations Optimization
Aggressive Opt. Loop Unrolling Loop Interchanging Semantics Changes -O0: Debug High Order Transforms= (Blocking, routine -O1: Fast compile substitution) -O2: Standard -O3: More Aggressive Interprocedural Analysis Interprocedural analysis (IPA) allows compilers to optimize across routine boundaries. Benefits can include automatic procedure inlining (across modules); copy propagation, in which flow dependency is eliminated by propagating a copy of the variable causing the (false) dependency; and vectorization. automatic recognition of standard libraries procedure inlining whole program alias analysis pointer analysis Intel PGI IBM xlf Sun SGI IPA -ip (single file) -Mipa -qipa inline=(auto|noauto) -xipo (all -IPA -ipo (multi-file) inline=
Compiler options for x86 on TACC machines: Intel: -xHOST -- for Stampede and Lonestar Optimizations performed by the compiler
• Copy propagation – Involves getting rid of unnecessary dependencies – Aids in instruction level parallelism – e.g.
x=y x=y These can z=1.0+x z=1.0+y be executed simultaneously
• Constant folding ILP=Instruction Level Parallelism – Involves replacement of variable references with a constant to generate faster code – Programmer can help the compiler by declaring such constants in PARAMETER statements in Fortran or by using the const modifier in C Optimizations performed by the compiler
• Strength reduction – Replacement of an expensive calculation with a cheaper one:
Use x*x instead of x**2.0 0.5*x instead of x/2.0
(exp(T)-exp(2T)) è tmp=exp(t); (tmp-tmp*tmp)
Use n logical shifts (<< or >>) instead of 2**n
Optimizations performed by the compiler
• Common subexpression elimination e.g.
temp=a/b compiler will generate x=a/b x=temp a temporary variable ...... to save on calculating a/b y=y0+a/b y=y0+temp Optimizations performed by the compiler
• Loop-invariant code motion – Involves moving loop-independent calculations out of the loop e.g. temp=a*b; for (i=0;i float min(float x, float y); #define min(x,y)((x)<(y)?(x):(y) main(int argc, char* argv[]) main(int argc, char* argv[]) { { float x1, x2, xmin; float x1, x2, xmin; … … xmin=min(x1,x2); xmin=min(x1,x2); … … } } float min(float x, float y) { min=(x Splits a loop in half to create two separate loops half the size of the original, and combines the operations of the two halves in a single loop of half the length. do i=1,n dosum=dosum+a(i)+ b(i) end do i=1,n/2 sum0=sum0+a(i )+ b(i ) Addition Pipeline is fuller. sum1=sum1+a(i+n/2)+ b(i+n/2) 2 more streams from memory end dosum=sum0+sum1 Loop Optimization Loop fusion: Loop fusion combines two or more loops of the same iteration space (loop length) into a single loop: for (i=0;i If mult. instead of inverse Addition pipeline is fuller. Loop Optimization Loop Fission: The opposite of loop fusion is loop distribution or fission. Fission splits a single loop with independent operations into multiple loops: do i=1,n a(i)=b(i)+c(i)*d(i) end do do i=1,n a(i)=b(i)+c(i)*d(i) do i=1,n e(i)=f(i)-g(i)*h(i)+p(i) e(i)=f(i)-g(i)*h(i)+p(i) q(i)=r(i)+s(i) end do end do do i=1,n q(i)=r(i)+s(i) end do Memory Tuning Low-stride memory access is preferred because it uses more of the data in a cache line retrieved from memory and helps in triggering streaming. do i=1,n for (j=0;j do j=1,n for (i=0;i do jouter=1,n,8 do i=1,n do j=jouter,min(jouter+7,n) a(i,j)=a(i,j)+s*b(j,i) end do end do Blocked Code end do Memory Tuning Array Blocking real*8 a(n,n), b(n,n), c(n,n) do ii=1,n,nb do jj=1,n,nb matrix do kk=1,n,nb multiplication do i=ii,min(n,ii+nb-1) do j=jj,min(n,jj+nb-1) do k=kk,min(n,kk+nb-1) c(i,j)=c(i,j)+a(j,k)*b(k,i) nb x nb nb x nb nb x nb nb x nb end do; end do; end do; end do; end do; end do Memory Tuning Much more efficient implementations exist, in HPC scientific libraries (ESSL, MKL, ACL, ATLAS,…) Memory Tuning Loop interchange can help in the case of a DAXPY-type loop : integer,parameter::nkb=16,kb=1024,n=nkb*kb/8 real*8 :: x(n), a(n,n), y(n) ... do i=1,n s=0.0 do j=1,n s=s+a(i,j)*x(j) integer, parameter :: nkb=16,kb=1024, n=nkb*kb/8 end do real*8 :: x(n), a(n,n), y(n) y(i)=s ... end do do j=1,n do i=1,n y(i)=y(i)+a(i,j)*x(j) end do end do VECTORIZATION SIMD Hardware Vectors SIMD Hardware (for Vectoriza on) SAXPY Opera on Registers Memory ⎡z1⎤ ⎡x1⎤ ⎡ y1⎤ Cache ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ z2 x2 y2 ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢z3⎥ = a*⎢x3⎥ + ⎢ y3⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ! ! ! ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ a y1, y2, y3, … yn x1, x2, x3, … xn ⎢ n⎥ ⎢ n⎥ ⎢ n⎥ ⎣z ⎦ ⎣x ⎦ ⎣ y ⎦ z1, z2, z3, … zn • Op mal Vectoriza on requires concerns beyond the SIMD Unit! – Opera ons: Requires elemental (independent) opera ons (SIMD opera ons) – Registers: Alignment of data on 64, 128, or 256 bit boundaries might be important – Cache: Access to elements in caches is fast, access from memory is much slower – Memory: Store vector elements sequen ally for fastest aggregate retrieval Vectoriza on • Vectoriza on, or SIMD* processing, allows the same instruc on to be executed on mul ple data operands. ( A loop over array elements o en provides a constant stream of data.) Note: Streams provide Vectors of length 2-16 … instruction instruc. instruc. instruc. stream for execu on in the SIMD unit. … a(i)=b(i)+c(i) *SIMD= Single Instruc on Mul ple Data Stream Vectorization • REGISTER SIZE DETERMINES the number of Simultaneous Operands in a SIMD opera on. • Compilers are good at vectorizing inner loops --gathering Multiple Data & executing a Single Instruction • Each iteration must be independent • Inlined functions and intrinsic Short Vector Math Library (SVML) functions can provide vectorization opportunities. Vector Compiler Options • Compiler will look for vectorization opportunities at optimization – O2 level. • Use architecture option: –x Vectorization Report • Intel Vector Reporting is OFF by default. • USE vector reporting to report on loops NOT vectorized, reports 4,5. % ifort -xHOST -vec-report=4 prog.f90 –c prog.f90(31): (col. 11) remark: loop was not vectorized: existence of vector dependence. … -vec-report=5 prog.f90(31): (col. 4) …assumed ANTI dependence between z line 31 and z line 31. prog.f90(31): (col. 4) …assumed FLOW dependence between z line 31 and z line 31. Vector Add • Each itera on can be executed independently: loop will vectorize. (Use –vec-report=6 for details.) • Compiler is aware of data size: but may be unaware of data alignment and cache. double a[N], b[N], c[N]; C real*8 :: a(N), b(N), c(N) F90 .. .. for(j=0;j