Introduction to Performance Analysis and Vectorization on SX-Aurora TSUBASA
July 27, 2018 NEC Corporation
1 © NEC Corporation 2018
Contents
What is vectorization?
Is vectorization difficult?
Basic points to promote vectorization
SX-Aurora TSUBASA performance tools The most efficient way to execute a vectorizable application is a vector processor.
Jim Smith, International Symposium on Computer Architecture (1994)
(Source: COMPUTER ARCHITECTURE A Quantitative Approach Sixth Edition)
4 © NEC Corporation 2018 What is vectorization? Scalar data and vector data
Variables and each element of an array are called scalar data.
Variable Array
An orderly arranged scalar data sequence such as a row, column, or diagonal of a matrix is called vector data.
Matrix
Processing such vector data simultaneously is called vectorization.
6 © NEC Corporation 2018 Scalar processing and vector processing
▌Scalar processing
DO I = 1,100 C(I) = A(I) + B(I) END DO
C(1) = A(1) + B(1)
C(2) = A(2) + B(2)
C(3) = A(3) + B(3)
Compute 100 times …
C(99) = A(99) + B(99)
C(100) = A(100) + B(100)
7 © NEC Corporation 2018 Scalar processing and vector processing
▌Vector processing
DO I = 1,100 C(I) = A(I) + B(I) END DO
C(1) A(1) B(1) C(2) A(2) B(2) C(3) A(3) B(3)
= + Compute multiple
…
… … data at once C(99) A(99) B(99) C(100) A(100) B(100)
8 © NEC Corporation 2018 Is vectorization difficult? Advanced automatic vectorization
▌NEC compiler analyzes source files then generates an executable that is automatically vectorized as much as possible.
Automatic Detecting vectorizable loops automatically and vectorizing them
Powerful Vectorizing loops including IF blocks and reduction operations
Transforming nested loops to increase performance Efficiently by vectorization efficiently
10 © NEC Corporation 2018 Vectorization conditions
▌ Some conditions for vectorization need to be satisfied in order for the compiler to vectorize automatically.
To be conformed loops, statements, types, and operations to vectorization For example, the following loops are not vectorized. Containing variables and arrays of character type and quadruple-precision real type Containing procedures (functions and subroutines)
Containing I/O statements
To have no dependency between array element definitions and references within same loop
To be expected performance improvement by vectorization For example, the compiler judges that the vectorization improves the performance efficiently in the following cases.
The loop length is sufficiently long to vectorization.
The performance improvement by vectorization is more effective than the cost increasing due to loop transformations with vectorization.
11 © NEC Corporation 2018 Range of vectorization subjects
Vectorizable Array expression, DO loop, DO WHILE loop, Loop consisted of IF construct and GOTO statement loops
Vectorizable Numeric and logical intrinsic assignments, CONTINUE, GOTO, CYCLE, EXIT, IF construct, CASE construct statements CALL, I/O statements, pointer assignment statements are not subject to vectorization
Vectorizable 4-byte and 8-byte integer types, 4-byte and 8 byte types logical types, real type, double-precision real type, complex type, double-precision complex type Character type, quadruple-precision real type, quadruple-precision complex type, 2-byte integer type, single-byte logical type, and derived type are not subject to vectorization
Vectorizable Four arithmetic operations, logical operations, numeric relational operations, power operation, type operations conversions, intrinsic procedures User-defined operations are not subjected to vectorization
12 © NEC Corporation 2018 Dependency
▌Loop with a dependency can not be vectorized.
DO I = 1, N A : Original array ‘A’ A(I+1)=A(I)*B(I)+C(I) END DO A : Updated array ‘A’
Execution order in Execution order in scalar processing vector processing A(2) = A(1)*B(1)+C(1) A(2) = A(1)*B(1)+C(1) A(3) = A(2)*B(2)+C(2) A(3) = A(2)*B(2)+C(2)
A(4) = A(3)*B(3)+C(3) A(4) = A(3)*B(3)+C(3)
… …
13 © NEC Corporation 2018 Basic points to promote vectorization Essential points for vectorization
ベクトル化率(ベクトル演算率)の向上Raising vectorization ratio (vector operation ratio) ループ長の拡大 Lengthening loop
Making memory access efficient
15 © NEC Corporation 2018 Vectorization ratio
Execution time by scalar
Vectorizable part
Execution time after vectorization
Execution time of vectorizable parts Vectorization ratio = Execution time by scalar
16 © NEC Corporation 2018 Vectorization ratio and performance improvement
▌Vectorization ratio needs to be sufficiently high for delivering the superior performance.
50
40
25x
30
up -
20 Speed 4.6x 2x 10
0 0% 20% 40% 60% 80% 100% Vectorization ratio
* This graph supposes the performance improvement is 50x in the vectorization ratio 100% based on Amdahl’s law. Let vectorization ratio be as close as possible to 100%.
17 © NEC Corporation 2018 Loop length (Vector length)
▌Sufficient loop length of vectorized loop increases the effect of vectorization. Execution time Unvectorized
Vectorized
Start-up time Loop length Cross length
Let loop length be as long as possible.
18 © NEC Corporation 2018 Steps for performance improvement
Analyzing performance of whole program PROGINF
Identifying procedures (functions and subroutines) that are bottlenecks of performance FTRACE
Identifying loops and array expressions Diagnostic message that are bottlenecks of performance Format list FTRACE REGION
Applying measures for vectorization Compiler directives Source modifications
19 © NEC Corporation 2018 PROGINF
▌PROGINF shows information about program execution such as the program execution time and the number of execution instruction. ▌How to use (1) Link a program specifying “-proginf” option. (2) Execute the program specifying “YES” or “DETAIL” for the run-time option “VE_PROGINF”. (3) The information is output to a standard error output file at the end of the program execution.
Example : VE_PROGINF=YES ******** Program Information ******** Real Time (sec) : 121.233126 User Time (sec) : 121.228955 Is the ratio of “Vector Time” Vector Time (sec) : 106.934651 dominant to “User Time”? Inst. Count : 119280358861 V. Inst. Count : 29274454500 V. Element Count : 6389370973939 V. Load Element Count : 3141249840232 FLOP Count : 3182379290112 MOPS : 58637.969529 Is “A. V. Length” (average vector MOPS (Real) : 58635.323545 length) sufficiently long? MFLOPS : 26251.407833 MFLOPS (Real) : 26250.223262 (Is the length close to 256?) A. V. Length : 218.257559 V. Op. Ratio (%) : 98.733828 L1 Cache Miss (sec) : 0.639122 VLD LLC Hit Element Ratio (%) : 73.051749 Is “V. Op. Ratio” (vector operation Memory Size Used (MB) : 27532.000000 ratio) sufficiently high? (Is the ratio close to 100%?) Start Time (date) : Sun Jul 22 20:09:51 2018 JST End Time (date) : Sun Jul 22 20:11:52 2018 JST
20 © NEC Corporation 2018 FTRACE
▌FTRACE collects the performance analysis information of each procedure. ▌How to use (1) Compile and link a program specifying “-ftrace “ option. (2) The analysis information file (ftrace.out) is generated at the end of the program execution. (3) The analysis list is output to a standard output file by executing the ftrace command. (Example) ftrace –f ftrace.out Calling count of a function
Exclusive CPU time and the ratio of the time to *------* the CPU time of the whole program FTRACE ANALYSIS LIST *------* Average CPU time required for on execution
Execution Date : Sun Jul 22 21:10:50 2018 JST The same information items as PROGINF Total CPU Time : 0:02'02"434 (122.434 sec.)
FREQUENCY EXCLUSIVE AVER.TIME MOPS MFLOPS V.OP AVER. VECTOR L1CACHE CPU PORT VLD LLC PROC.NAME TIME[sec]( % ) [msec] RATIO V.LEN TIME MISS CONF HIT E.%
512 58.110( 47.5) 113.496 68380.5 30870.9 99.27 223.8 58.110 0.000 0.000 74.26 SUB1 510 31.763( 25.9) 62.280 69474.6 31552.6 99.31 223.6 31.762 0.000 0.000 76.59 SUB2 2 9.056( 7.4) 4527.963 3198.9 0.0 14.91 205.2 0.062 1.272 0.000 73.33 SUB3 459 7.732( 6.3) 16.846 55793.0 23438.2 98.44 163.7 7.732 0.000 0.000 51.84 SUB4 459 7.037( 5.7) 15.332 56239.0 26694.7 98.84 212.3 7.037 0.000 0.000 62.00 SUB5 2097152 6.667( 5.4) 0.003 2816.0 322.1 22.88 256.0 0.282 0.283 0.000 100.00 SUB6 4 1.355( 1.1) 338.818 19218.8 11094.9 98.75 243.9 1.355 0.000 0.000 0.82 SUB7 463 0.448( 0.4) 0.968 57096.7 0.0 95.73 176.4 0.448 0.000 0.000 0.00 SUB8 1483 0.141( 0.1) 0.095 18966.1 0.0 97.87 205.4 0.141 0.000 0.048 2.59 SUB9 2099270 0.122( 0.1) 0.000 3056.8 17.2 0.00 0.0 0.000 0.000 0.000 0.00 SUB10 51 0.001( 0.0) 0.017 667.2 0.0 0.00 0.0 0.000 0.001 0.000 0.00 SUB11 1 0.000( 0.0) 0.292 444.7 0.6 0.01 8.0 0.000 0.000 0.000 0.00 MAIN_ (省略) ------4201150 122.434(100.0) 0.029 58071.6 25992.7 98.59 218.3 106.929 1.558 0.048 73.05 total
21 © NEC Corporation 2018 Format list
▌The format list indicates various information (vectorization and parallelization information of loops, inline expansion information of procedure calls, and so on) in correspondence with the relevant source lines. ▌How to use (1) Compile a program specifying “-report-format” option. “-report-all” option is recommended instead of “-report-format” option. “-report-all” option enables to output format lists with diagnostic messages. (2) The format list is output with the file name with a suffix “.L”.
539: +------> do i3=2,n3-1 540: |+-----> do i2=2,n2-1 541: ||V----> do i1=1,n1 542: ||| r1(i1) = r(i1,i2-1,i3) + r(i1,i2+1,i3) 543: ||| > + r(i1,i2,i3-1) + r(i1,i2,i3+1) 544: ||| r2(i1) = r(i1,i2-1,i3-1) + r(i1,i2+1,i3-1) 545: ||| > + r(i1,i2-1,i3+1) + r(i1,i2+1,i3+1) 546: ||V---- enddo 547: ||V----> do i1=2,n1-1 548: ||| F u(i1,i2,i3) = u(i1,i2,i3) 549: ||| > + c(0) * r(i1,i2,i3) 550: ||| > + c(1) * ( r(i1-1,i2,i3) + r(i1+1,i2,i3) 551: ||| > + r1(i1) ) 552: ||| > + c(2) * ( r2(i1) + r1(i1-1) + r1(i1+1) ) 553: ||| c------554: ||| c Assume c(3) = 0 (Enable line below if c(3) not= 0) 555: ||| c------556: ||| c > + c(3) * ( r2(i1-1) + r2(i1+1) ) 557: ||| c------558: ||V---- enddo 559: |+----- enddo 560: +------enddo
22 © NEC Corporation 2018 Format list output examples
▌Optimization information related to such as loop structures and vectorization is indicated as a symbol.
V------> DO I=1,N | Whole loop is vectorized V------END DO
+------> DO I=1,N | Loop is not vectorized +------END DO
S------> DO I=1,N Loop is partially vectorized | (It is called “partial vectorization”) S------END DO
23 © NEC Corporation 2018 Format list output examples
W------> DO I=1,N |*-----> DO J=1,M || Nested loops are collapsed |*----- END DO W------END DO
X------> DO I=1,N |*-----> DO J=1,M || Nested loops are interchanged |*----- END DO X------END DO
U------> DO I=1,N |V-----> DO J=1,M Outer loop is unrolled and inner loop is || vectorized |V----- END DO U------END DO
24 © NEC Corporation 2018 Format list output examples
▌Optimization information fo each statement is indicated at column 17.
547: ||V----> do i1=2,n1-1 548: ||| F u(i1,i2,i3) = u(i1,i2,i3) 549: ||| > + c(0) * r(i1,i2,i3) 550: ||| > + c(1) * ( r(i1-1,i2,i3) + r(i1+1,i2,i3) 551: ||| > + r1(i1) ) 552: ||| > + c(2) * ( r2(i1) + r1(i1-1) + r1(i1+1) ) 553: ||| c------554: ||| c Assume c(3) = 0 (Enable line below if c(3) not= 0) 555: ||| c------556: ||| c > + c(3) * ( r2(i1-1) + r2(i1+1) ) 557: ||| c------558: ||V---- enddo 559: |+----- enddo 560: +------enddo
F : Vector fused-multiply-add instruction is generated
I : Routine call is inlined
M : Nested loops are replaced with a vector matrix multiply routine
G : Vector gather instruction is generated
25 © NEC Corporation 2018 FTRACE REGION (User-specified regions)
▌User can specify regions to be measured to collect performance information. This region is called a User-specified region.
55: | CALL FTRACE_REGION_BEGIN("Loop") 56: |+-----> DO 10 K=KS,KE 57: || C 58: ||+----> DO 20 MM=MMIN,MMAX 59: ||| ID1=MM-JE 60: ||| ID2=MM-JS 61: ||| IS=MAX0(ID1,IIS) 62: ||| IE=MIN0(ID2,IIE) 63: ||| C 64: |||+---> DO 30 I=IS,IE 65: ||||
Calculation using a hyperplane method
506: |||| 507: |||+--- 30 continue 508: ||+---- 20 continue 509: |+----- 10 continue 510: | CALL FTRACE_REGION_END("Loop")
FREQUENCY EXCLUSIVE AVER.TIME MOPS MFLOPS V.OP AVER. VECTOR L1CACHE CPU PORT VLD LLC PROC.NAME TIME[sec]( % ) [msec] RATIO V.LEN TIME MISS CONF HIT E.%
10 32.966( 13.6) 3296.625 818.5 245.5 0.00 0.0 0.000 19.977 0.000 0.00 Loop
26 © NEC Corporation 2018 Compiler directives
▌NEC compiler can generate more vectorized code by giving it an optimization hint using a compiler directive.
55: | 56: |+-----> DO 10 K=KS,KE 57: || C 58: ||+----> DO 20 MM=MMIN,MMAX 59: ||| ID1=MM-JE 60: ||| ID2=MM-JS 61: ||| IS=MAX0(ID1,IIS) 62: ||| IE=MIN0(ID2,IIE) 63: ||| C 64: ||| !$NEC ivdep 65: |||V|||+---> DO 30 I=ISTA,IEND 66: ||||
Calculationハイパープレーン手法適用処理 using a hyperplane method
506: |||| 507: |||V|||+--- 30 continue 508: ||+---- 20 continue 509: |+----- 10 continue 510: | Before using a compiler direcvive FREQUENCY EXCLUSIVE AVER.TIME MOPS MFLOPS V.OP AVER. VECTOR L1CACHE CPU PORT VLD LLC PROC.NAME TIME[sec]( % ) [msec] RATIO V.LEN TIME MISS CONF HIT E.%
10 32.966( 13.6) 3296.625 818.5 245.5 0.00 0.0 0.000 19.977 0.000 0.00 Loop Almost 20 times faster Vectorized After using a compiler directive FREQUENCY EXCLUSIVE AVER.TIME MOPS MFLOPS V.OP AVER. VECTOR L1CACHE CPU PORT VLD LLC PROC.NAME TIME[sec]( % ) [msec] RATIO V.LEN TIME MISS CONF HIT E.%
10 1.703( 1.3) 170.257 18980.1 8210.0 96.69 69.4 1.368 0.181 0.046 99.66 Loop
27 © NEC Corporation 2018 SX-Aurora TSUBASA performance tools NEC Software Development Kit for Vector Engine
NEC SDK for VE includes compilers, scientific computing libraries, performance analysis tool, etc. necessary for development and optimization of software of SX-Aurora TSUBASA.
Compilers Scientific computing • Fortran Compiler libraries • C/C++ Compiler • NEC Numeric Library Collection
NEC Software Development Kit for Vector Engine (NEC SDK for VE)
Debugger Performance analysis tool • NEC Parallel Debugger • NEC Ftrace Viewer
29 © NEC Corporation 2018 NEC SDK for VE - Fortran Compiler, C/C++ Compiler
Compilers with advanced automatic vectorization and parallelization
▌Automatic vectorization Detecting vectorizable loops automatically and vectorizing them Vectorizing loops including IF blocks and reduction operations Transforming nested loops to increase performance by vectorization efficiently ▌Automatic parallelization Detecting task-parallelizable loops automatically and parallelizing them using shared- memory parallelization Equipping with various parallelization methods such as parallelization using a condition for selecting a parallel code or a non-parallel code at execution time ▌Supported language standards Fortran 2003 • Partial supporting of Fortran 2008/2015 language standards (scheduled to expand the range of the support gradually) • Compatible with other vendor language specifications such as CRAY pointer C11/C++14 • Partial supporting of C++17 language standards • Compatible with GCC language specifications OpenMP 4.5 • Supporting of a de-fact standard Application Programing Interface (API) for shared memory parallelization
30 © NEC Corporation 2018 NEC SDK for VE - NEC Numeric Library Collection
NEC Numeric Library maximizes HW performance
▌Performance maximization of frequently used functions in numerical analysis Applications BLAS, LAPACK, ScaLAPACK FFTW Interface • Enabling to use FFTW API just by replacing a header file FFTW Interface • Using kernels of NEC FFT library that is pursued performance thoroughly FFT library ▌High performance with modern numerical developed by NEC analysis algorithms NEC Numeric Library Collection Mersenne Twister pseudo-random number • Generating high-quality random numbers fast Direct solver Sobol quasi-random number developed by NEC • Necessary to quasi-Monte Carlo method Scalar Vector ▌Reduction of calculation time by offloading processing processing between VE and Xeon Direct sparse solver • Properly using VE and Xeon depending on vector-oriented Xeon calculations and scalar oriented calculations • Solving fast by utilizing advantages of respective architectures
31 © NEC Corporation 2018 NEC SDK for VE - NEC Numeric Library Collection
NEC Numeric Library Collection consists of the following libraries.
ASL Unified Interface BLAS
Fourier Transforms, Random Number Generators Basic vector and matrix operations
FFTW Interface LAPACK
Interface library to use Fourier Transform functions Simultaneous linear equations, Eigenvalues and of ASL with FFTW (version 3.x) API Eigenvectors
ASL ScaLAPACK
Basic Matrix Algebra, Simultaneous Linear Equations, Simultaneous linear equations, Eigenvalues and Eigenvalues and Eigenvectors, Fourier Transforms, Eigenvectors (for distributed memory parallel Spline Functions, Approximation and Interpolation, programs) Numerical Differentials, Numerical Integration, Roots of Equations, Extremum Problems and Optimization, Differential Equations and Their CBLAS Applications, Special Functions, Random Number, Sorting and Ranking, Probability Distributions, Sample Statistics, Tests and Estimates, Analysis of C interface to BLAS Variance and Design of Experiments, Nonparametric Tests, Multivariate Analysis, Time Series Analysis, Regression Analysis SBLAS
Basic operations of sparse matrices
32 © NEC Corporation 2018 NEC SDK for VE - NEC Parallel Debugger
GUI debugger supporting VE (Eclipse plugin)
Makefile VE program Compiling source Environment file ve.out Compile Local machine Front-end machine File Server Debug Operation on screen SDM
gdb gdb
VE VE Process Process
SX-Aurora TSUBASA
SDM: Scalable Debug Manager
33 © NEC Corporation 2018 NEC SDK for VE - NEC Ftrace Viewer
Performance analysis GUI tool
Maximum, minimum, average, and standard deviation of exclusive CPU time Exclusive CPU time and Exclusive CPU time for each function for each function vector operation ratio for each function
Exclusive CPU time for each process
Elapsed time, MPI communication time and MPI communication idle time for each MPI process in a function
34 © NEC Corporation 2018 NEC MPI
NEC MPI consists of commands and a library for distributed memory parallel processing programing.
Key features The latest MPI specification (MPI 3.1) compliant Supporting InfiniBand (EDR) as interconnect Automatic selection of optimal inter-process communication means and high-speed communication using zero-copy transfer • Using shared memory within the same VE • Selecting DNA transfer or InfiniBand transfer according to process placement between VEs
InfiniBand
Xeon IB Xeon IB
InfiniBand transfer between VEs VE VE VE … VE DMA transfer Shared memory between VEs
within VE
Memory Memory Memory Memory
35 © NEC Corporation 2018 Conclusion
▌Vectorization itself is not too difficult.
Basically, compiler automatically performs vectorization. Give an optimization hint using compiler directive if compiler can not know whether a loop can be vectorized or not.
▌As future topics …
Efficient memory access Vectorization with source modifications Efficient parallelization using automatic parallelization/OpenMP/MPI Performance improvement techniques using vector host (VH) and vector engine (VE)
and others
We would like to introduce various methods for getting the superior performance out of SX-Aurora TSUBASA, along with useful examples.
36 © NEC Corporation 2018