Introduction to Performance Analysis and Vectorization on SX-Aurora TSUBASA

July 27, 2018 NEC Corporation

1 © NEC Corporation 2018

Contents

 What is vectorization?

 Is vectorization difficult?

 Basic points to promote vectorization

 SX-Aurora TSUBASA performance tools The most efficient way to execute a vectorizable application is a .

Jim Smith, International Symposium on Computer Architecture (1994)

(Source: COMPUTER ARCHITECTURE A Quantitative Approach Sixth Edition)

4 © NEC Corporation 2018 What is vectorization? Scalar data and vector data

Variables and each element of an array are called scalar data.

Variable Array

An orderly arranged scalar data sequence such as a row, column, or diagonal of a matrix is called vector data.

Matrix

Processing such vector data simultaneously is called vectorization.

6 © NEC Corporation 2018 Scalar processing and vector processing

▌Scalar processing

DO I = 1,100 (I) = A(I) + B(I) END DO

C(1) = A(1) + B(1)

C(2) = A(2) + B(2)

C(3) = A(3) + B(3)

Compute 100 times …

C(99) = A(99) + B(99)

C(100) = A(100) + B(100)

7 © NEC Corporation 2018 Scalar processing and vector processing

▌Vector processing

DO I = 1,100 C(I) = A(I) + B(I) END DO

C(1) A(1) B(1) C(2) A(2) B(2) C(3) A(3) B(3)

= + Compute multiple

… … data at once C(99) A(99) B(99) C(100) A(100) B(100)

8 © NEC Corporation 2018 Is vectorization difficult? Advanced automatic vectorization

▌NEC analyzes source files then generates an executable that is automatically vectorized as much as possible.

Automatic Detecting vectorizable loops automatically and vectorizing them

Powerful Vectorizing loops including IF blocks and reduction operations

Transforming nested loops to increase performance Efficiently by vectorization efficiently

10 © NEC Corporation 2018 Vectorization conditions

▌ Some conditions for vectorization need to be satisfied in order for the compiler to vectorize automatically.

To be conformed loops, statements, types, and operations to vectorization For example, the following loops are not vectorized. Containing variables and arrays of character type and quadruple-precision real type Containing procedures (functions and subroutines)

Containing I/O statements

To have no dependency between array element definitions and references within same loop

To be expected performance improvement by vectorization For example, the compiler judges that the vectorization improves the performance efficiently in the following cases.

The loop length is sufficiently long to vectorization.

The performance improvement by vectorization is more effective than the cost increasing due to loop transformations with vectorization.

11 © NEC Corporation 2018 Range of vectorization subjects

Vectorizable Array expression, DO loop, DO WHILE loop, Loop consisted of IF construct and GOTO statement loops

Vectorizable Numeric and logical intrinsic assignments, CONTINUE, GOTO, CYCLE, EXIT, IF construct, CASE construct statements CALL, I/O statements, pointer assignment statements are not subject to vectorization

Vectorizable 4-byte and 8-byte integer types, 4-byte and 8 byte types logical types, real type, double-precision real type, complex type, double-precision complex type Character type, quadruple-precision real type, quadruple-precision complex type, 2-byte integer type, single-byte logical type, and derived type are not subject to vectorization

Vectorizable Four arithmetic operations, logical operations, numeric relational operations, power operation, type operations conversions, intrinsic procedures User-defined operations are not subjected to vectorization

12 © NEC Corporation 2018 Dependency

▌Loop with a dependency can not be vectorized.

DO I = 1, N A : Original array ‘A’ A(I+1)=A(I)*B(I)+C(I) END DO A : Updated array ‘A’

Execution order in Execution order in scalar processing vector processing A(2) = A(1)*B(1)+C(1) A(2) = A(1)*B(1)+C(1) A(3) = A(2)*B(2)+C(2) A(3) = A(2)*B(2)+C(2)

A(4) = A(3)*B(3)+C(3) A(4) = A(3)*B(3)+C(3)

… …

13 © NEC Corporation 2018 Basic points to promote vectorization Essential points for vectorization

ベクトル化率(ベクトル演算率)の向上Raising vectorization ratio (vector operation ratio) ループ長の拡大 Lengthening loop

Making memory access efficient

15 © NEC Corporation 2018 Vectorization ratio

Execution time by scalar

Vectorizable part

Execution time after vectorization

Execution time of vectorizable parts Vectorization ratio = Execution time by scalar

16 © NEC Corporation 2018 Vectorization ratio and performance improvement

▌Vectorization ratio needs to be sufficiently high for delivering the superior performance.

50

40

25x

30

up -

20 Speed 4.6x 2x 10

0 0% 20% 40% 60% 80% 100% Vectorization ratio

* This graph supposes the performance improvement is 50x in the vectorization ratio 100% based on Amdahl’s law. Let vectorization ratio be as close as possible to 100%.

17 © NEC Corporation 2018 Loop length (Vector length)

▌Sufficient loop length of vectorized loop increases the effect of vectorization. Execution time Unvectorized

Vectorized

Start-up time Loop length Cross length

Let loop length be as long as possible.

18 © NEC Corporation 2018 Steps for performance improvement

Analyzing performance of whole program PROGINF

Identifying procedures (functions and subroutines) that are bottlenecks of performance FTRACE

Identifying loops and array expressions Diagnostic message that are bottlenecks of performance Format list FTRACE REGION

Applying measures for vectorization Compiler directives Source modifications

19 © NEC Corporation 2018 PROGINF

▌PROGINF shows information about program execution such as the program execution time and the number of execution instruction. ▌How to use (1) Link a program specifying “-proginf” option. (2) Execute the program specifying “YES” or “DETAIL” for the run-time option “VE_PROGINF”. (3) The information is output to a standard error output file at the end of the program execution.

Example : VE_PROGINF=YES ******** Program Information ******** Real Time (sec) : 121.233126 User Time (sec) : 121.228955 Is the ratio of “Vector Time” Vector Time (sec) : 106.934651 dominant to “User Time”? Inst. Count : 119280358861 V. Inst. Count : 29274454500 V. Element Count : 6389370973939 V. Load Element Count : 3141249840232 FLOP Count : 3182379290112 MOPS : 58637.969529 Is “A. V. Length” (average vector MOPS (Real) : 58635.323545 length) sufficiently long? MFLOPS : 26251.407833 MFLOPS (Real) : 26250.223262 (Is the length close to 256?) A. V. Length : 218.257559 V. Op. Ratio (%) : 98.733828 L1 Cache Miss (sec) : 0.639122 VLD LLC Hit Element Ratio (%) : 73.051749 Is “V. Op. Ratio” (vector operation Memory Size Used (MB) : 27532.000000 ratio) sufficiently high? (Is the ratio close to 100%?) Start Time (date) : Sun Jul 22 20:09:51 2018 JST End Time (date) : Sun Jul 22 20:11:52 2018 JST

20 © NEC Corporation 2018 FTRACE

▌FTRACE collects the performance analysis information of each procedure. ▌How to use (1) Compile and link a program specifying “-ftrace “ option. (2) The analysis information file (ftrace.out) is generated at the end of the program execution. (3) The analysis list is output to a standard output file by executing the ftrace command. (Example) ftrace –f ftrace.out Calling count of a function

Exclusive CPU time and the ratio of the time to *------* the CPU time of the whole program FTRACE ANALYSIS LIST *------* Average CPU time required for on execution

Execution Date : Sun Jul 22 21:10:50 2018 JST The same information items as PROGINF Total CPU Time : 0:02'02"434 (122.434 sec.)

FREQUENCY EXCLUSIVE AVER.TIME MOPS MFLOPS V.OP AVER. VECTOR L1CACHE CPU PORT VLD LLC PROC.NAME TIME[sec]( % ) [msec] RATIO V.LEN TIME MISS CONF HIT E.%

512 58.110( 47.5) 113.496 68380.5 30870.9 99.27 223.8 58.110 0.000 0.000 74.26 SUB1 510 31.763( 25.9) 62.280 69474.6 31552.6 99.31 223.6 31.762 0.000 0.000 76.59 SUB2 2 9.056( 7.4) 4527.963 3198.9 0.0 14.91 205.2 0.062 1.272 0.000 73.33 SUB3 459 7.732( 6.3) 16.846 55793.0 23438.2 98.44 163.7 7.732 0.000 0.000 51.84 SUB4 459 7.037( 5.7) 15.332 56239.0 26694.7 98.84 212.3 7.037 0.000 0.000 62.00 SUB5 2097152 6.667( 5.4) 0.003 2816.0 322.1 22.88 256.0 0.282 0.283 0.000 100.00 SUB6 4 1.355( 1.1) 338.818 19218.8 11094.9 98.75 243.9 1.355 0.000 0.000 0.82 SUB7 463 0.448( 0.4) 0.968 57096.7 0.0 95.73 176.4 0.448 0.000 0.000 0.00 SUB8 1483 0.141( 0.1) 0.095 18966.1 0.0 97.87 205.4 0.141 0.000 0.048 2.59 SUB9 2099270 0.122( 0.1) 0.000 3056.8 17.2 0.00 0.0 0.000 0.000 0.000 0.00 SUB10 51 0.001( 0.0) 0.017 667.2 0.0 0.00 0.0 0.000 0.001 0.000 0.00 SUB11 1 0.000( 0.0) 0.292 444.7 0.6 0.01 8.0 0.000 0.000 0.000 0.00 MAIN_ (省略) ------4201150 122.434(100.0) 0.029 58071.6 25992.7 98.59 218.3 106.929 1.558 0.048 73.05 total

21 © NEC Corporation 2018 Format list

▌The format list indicates various information (vectorization and parallelization information of loops, inline expansion information of procedure calls, and so on) in correspondence with the relevant source lines. ▌How to use (1) Compile a program specifying “-report-format” option. “-report-all” option is recommended instead of “-report-format” option. “-report-all” option enables to output format lists with diagnostic messages. (2) The format list is output with the file name with a suffix “.L”.

539: +------> do i3=2,n3-1 540: |+-----> do i2=2,n2-1 541: ||V----> do i1=1,n1 542: ||| r1(i1) = r(i1,i2-1,i3) + r(i1,i2+1,i3) 543: ||| > + r(i1,i2,i3-1) + r(i1,i2,i3+1) 544: ||| r2(i1) = r(i1,i2-1,i3-1) + r(i1,i2+1,i3-1) 545: ||| > + r(i1,i2-1,i3+1) + r(i1,i2+1,i3+1) 546: ||V---- enddo 547: ||V----> do i1=2,n1-1 548: ||| F u(i1,i2,i3) = u(i1,i2,i3) 549: ||| > + c(0) * r(i1,i2,i3) 550: ||| > + c(1) * ( r(i1-1,i2,i3) + r(i1+1,i2,i3) 551: ||| > + r1(i1) ) 552: ||| > + c(2) * ( r2(i1) + r1(i1-1) + r1(i1+1) ) 553: ||| c------554: ||| c Assume c(3) = 0 (Enable line below if c(3) not= 0) 555: ||| c------556: ||| c > + c(3) * ( r2(i1-1) + r2(i1+1) ) 557: ||| c------558: ||V---- enddo 559: |+----- enddo 560: +------enddo

22 © NEC Corporation 2018 Format list output examples

▌Optimization information related to such as loop structures and vectorization is indicated as a symbol.

V------> DO I=1,N | Whole loop is vectorized V------END DO

+------> DO I=1,N | Loop is not vectorized +------END DO

S------> DO I=1,N Loop is partially vectorized | (It is called “partial vectorization”) S------END DO

23 © NEC Corporation 2018 Format list output examples

W------> DO I=1,N |*-----> DO J=1,M || Nested loops are collapsed |*----- END DO W------END DO

X------> DO I=1,N |*-----> DO J=1,M || Nested loops are interchanged |*----- END DO X------END DO

U------> DO I=1,N |V-----> DO J=1,M Outer loop is unrolled and inner loop is || vectorized |V----- END DO U------END DO

24 © NEC Corporation 2018 Format list output examples

▌Optimization information fo each statement is indicated at column 17.

547: ||V----> do i1=2,n1-1 548: ||| F u(i1,i2,i3) = u(i1,i2,i3) 549: ||| > + c(0) * r(i1,i2,i3) 550: ||| > + c(1) * ( r(i1-1,i2,i3) + r(i1+1,i2,i3) 551: ||| > + r1(i1) ) 552: ||| > + c(2) * ( r2(i1) + r1(i1-1) + r1(i1+1) ) 553: ||| c------554: ||| c Assume c(3) = 0 (Enable line below if c(3) not= 0) 555: ||| c------556: ||| c > + c(3) * ( r2(i1-1) + r2(i1+1) ) 557: ||| c------558: ||V---- enddo 559: |+----- enddo 560: +------enddo

F : Vector fused-multiply-add instruction is generated

I : Routine call is inlined

M : Nested loops are replaced with a vector matrix multiply routine

G : Vector gather instruction is generated

25 © NEC Corporation 2018 FTRACE REGION (User-specified regions)

▌User can specify regions to be measured to collect performance information. This region is called a User-specified region.

55: | CALL FTRACE_REGION_BEGIN("Loop") 56: |+-----> DO 10 K=KS,KE 57: || C 58: ||+----> DO 20 MM=MMIN,MMAX 59: ||| ID1=MM-JE 60: ||| ID2=MM-JS 61: ||| IS=MAX0(ID1,IIS) 62: ||| IE=MIN0(ID2,IIE) 63: ||| C 64: |||+---> DO 30 I=IS,IE 65: ||||

Calculation using a hyperplane method

506: |||| 507: |||+--- 30 continue 508: ||+---- 20 continue 509: |+----- 10 continue 510: | CALL FTRACE_REGION_END("Loop")

FREQUENCY EXCLUSIVE AVER.TIME MOPS MFLOPS V.OP AVER. VECTOR L1CACHE CPU PORT VLD LLC PROC.NAME TIME[sec]( % ) [msec] RATIO V.LEN TIME MISS CONF HIT E.%

10 32.966( 13.6) 3296.625 818.5 245.5 0.00 0.0 0.000 19.977 0.000 0.00 Loop

26 © NEC Corporation 2018 Compiler directives

▌NEC compiler can generate more vectorized code by giving it an optimization hint using a compiler directive.

55: | 56: |+-----> DO 10 K=KS,KE 57: || C 58: ||+----> DO 20 MM=MMIN,MMAX 59: ||| ID1=MM-JE 60: ||| ID2=MM-JS 61: ||| IS=MAX0(ID1,IIS) 62: ||| IE=MIN0(ID2,IIE) 63: ||| C 64: ||| !$NEC ivdep 65: |||V|||+---> DO 30 I=ISTA,IEND 66: ||||

Calculationハイパープレーン手法適用処理 using a hyperplane method

506: |||| 507: |||V|||+--- 30 continue 508: ||+---- 20 continue 509: |+----- 10 continue 510: | Before using a compiler direcvive FREQUENCY EXCLUSIVE AVER.TIME MOPS MFLOPS V.OP AVER. VECTOR L1CACHE CPU PORT VLD LLC PROC.NAME TIME[sec]( % ) [msec] RATIO V.LEN TIME MISS CONF HIT E.%

10 32.966( 13.6) 3296.625 818.5 245.5 0.00 0.0 0.000 19.977 0.000 0.00 Loop Almost 20 times faster Vectorized After using a compiler directive FREQUENCY EXCLUSIVE AVER.TIME MOPS MFLOPS V.OP AVER. VECTOR L1CACHE CPU PORT VLD LLC PROC.NAME TIME[sec]( % ) [msec] RATIO V.LEN TIME MISS CONF HIT E.%

10 1.703( 1.3) 170.257 18980.1 8210.0 96.69 69.4 1.368 0.181 0.046 99.66 Loop

27 © NEC Corporation 2018 SX-Aurora TSUBASA performance tools NEC Software Development Kit for Vector Engine

NEC SDK for VE includes , scientific computing libraries, performance analysis tool, etc. necessary for development and optimization of software of SX-Aurora TSUBASA.

Compilers Scientific computing • Fortran Compiler libraries • C/C++ Compiler • NEC Numeric Library Collection

NEC Software Development Kit for Vector Engine (NEC SDK for VE)

Debugger Performance analysis tool • NEC Parallel Debugger • NEC Ftrace Viewer

29 © NEC Corporation 2018 NEC SDK for VE - Fortran Compiler, C/C++ Compiler

Compilers with advanced automatic vectorization and parallelization

▌Automatic vectorization  Detecting vectorizable loops automatically and vectorizing them  Vectorizing loops including IF blocks and reduction operations  Transforming nested loops to increase performance by vectorization efficiently ▌Automatic parallelization  Detecting task-parallelizable loops automatically and parallelizing them using shared- memory parallelization  Equipping with various parallelization methods such as parallelization using a condition for selecting a parallel code or a non-parallel code at execution time ▌Supported language standards  Fortran 2003 • Partial supporting of Fortran 2008/2015 language standards (scheduled to expand the range of the support gradually) • Compatible with other vendor language specifications such as CRAY pointer  C11/C++14 • Partial supporting of C++17 language standards • Compatible with GCC language specifications  OpenMP 4.5 • Supporting of a de-fact standard Application Programing Interface (API) for shared memory parallelization

30 © NEC Corporation 2018 NEC SDK for VE - NEC Numeric Library Collection

NEC Numeric Library maximizes HW performance

▌Performance maximization of frequently used functions in numerical analysis Applications  BLAS, LAPACK, ScaLAPACK  FFTW Interface • Enabling to use FFTW API just by replacing a header file FFTW Interface • Using kernels of NEC FFT library that is pursued performance thoroughly FFT library ▌High performance with modern numerical developed by NEC analysis algorithms NEC Numeric Library Collection  Mersenne Twister pseudo-random number • Generating high-quality random numbers fast Direct solver  Sobol quasi-random number developed by NEC • Necessary to quasi-Monte Carlo method Scalar Vector ▌Reduction of calculation time by offloading processing processing between VE and Xeon  Direct sparse solver • Properly using VE and Xeon depending on vector-oriented Xeon calculations and scalar oriented calculations • Solving fast by utilizing advantages of respective architectures

31 © NEC Corporation 2018 NEC SDK for VE - NEC Numeric Library Collection

NEC Numeric Library Collection consists of the following libraries.

ASL Unified Interface BLAS

Fourier Transforms, Random Number Generators Basic vector and matrix operations

FFTW Interface LAPACK

Interface library to use Fourier Transform functions Simultaneous linear equations, Eigenvalues and of ASL with FFTW (version 3.x) API Eigenvectors

ASL ScaLAPACK

Basic Matrix Algebra, Simultaneous Linear Equations, Simultaneous linear equations, Eigenvalues and Eigenvalues and Eigenvectors, Fourier Transforms, Eigenvectors (for distributed memory parallel Spline Functions, Approximation and Interpolation, programs) Numerical Differentials, Numerical Integration, Roots of Equations, Extremum Problems and Optimization, Differential Equations and Their CBLAS Applications, Special Functions, Random Number, Sorting and Ranking, Probability Distributions, Sample Statistics, Tests and Estimates, Analysis of C interface to BLAS Variance and Design of Experiments, Nonparametric Tests, Multivariate Analysis, Time Series Analysis, Regression Analysis SBLAS

Basic operations of sparse matrices

32 © NEC Corporation 2018 NEC SDK for VE - NEC Parallel Debugger

GUI debugger supporting VE (Eclipse plugin)

Makefile VE program Compiling source Environment file ve.out Compile Local machine Front-end machine File Server Debug Operation on screen SDM

gdb gdb

VE VE Process Process

SX-Aurora TSUBASA

SDM: Scalable Debug Manager

33 © NEC Corporation 2018 NEC SDK for VE - NEC Ftrace Viewer

Performance analysis GUI tool

Maximum, minimum, average, and standard deviation of exclusive CPU time Exclusive CPU time and Exclusive CPU time for each function for each function vector operation ratio for each function

Exclusive CPU time for each process

Elapsed time, MPI communication time and MPI communication idle time for each MPI process in a function

34 © NEC Corporation 2018 NEC MPI

NEC MPI consists of commands and a library for distributed memory parallel processing programing.

Key features The latest MPI specification (MPI 3.1) compliant Supporting InfiniBand (EDR) as interconnect Automatic selection of optimal inter-process communication means and high-speed communication using zero-copy transfer • Using shared memory within the same VE • Selecting DNA transfer or InfiniBand transfer according to process placement between VEs

InfiniBand

Xeon IB Xeon IB

InfiniBand transfer between VEs VE VE VE … VE DMA transfer Shared memory between VEs

within VE

Memory Memory Memory Memory

35 © NEC Corporation 2018 Conclusion

▌Vectorization itself is not too difficult.

 Basically, compiler automatically performs vectorization.  Give an optimization hint using compiler directive if compiler can not know whether a loop can be vectorized or not.

▌As future topics …

 Efficient memory access  Vectorization with source modifications  Efficient parallelization using automatic parallelization/OpenMP/MPI  Performance improvement techniques using vector host (VH) and vector engine (VE)

 and others

We would like to introduce various methods for getting the superior performance out of SX-Aurora TSUBASA, along with useful examples.

36 © NEC Corporation 2018