Introduction to Performance Analysis and Vectorization on SX-Aurora TSUBASA

Introduction to Performance Analysis and Vectorization on SX-Aurora TSUBASA July 27, 2018 NEC Corporation 1 © NEC Corporation 2018 Contents What is vectorization? Is vectorization difficult? Basic points to promote vectorization SX-Aurora TSUBASA performance tools The most efficient way to execute a vectorizable application is a vector processor. Jim Smith, International Symposium on Computer Architecture (1994) (Source: COMPUTER ARCHITECTURE A Quantitative Approach Sixth Edition) 4 © NEC Corporation 2018 What is vectorization? Scalar data and vector data Variables and each element of an array are called scalar data. Variable Array An orderly arranged scalar data sequence such as a row, column, or diagonal of a matrix is called vector data. Matrix Processing such vector data simultaneously is called vectorization. 6 © NEC Corporation 2018 Scalar processing and vector processing ▌Scalar processing DO I = 1,100 C(I) = A(I) + B(I) END DO C(1) = A(1) + B(1) C(2) = A(2) + B(2) C(3) = A(3) + B(3) Compute 100 times … C(99) = A(99) + B(99) C(100) = A(100) + B(100) 7 © NEC Corporation 2018 Scalar processing and vector processing ▌Vector processing DO I = 1,100 C(I) = A(I) + B(I) END DO C(1) A(1) B(1) C(2) A(2) B(2) C(3) A(3) B(3) = + Compute multiple … … … data at once C(99) A(99) B(99) C(100) A(100) B(100) 8 © NEC Corporation 2018 Is vectorization difficult? Advanced automatic vectorization ▌NEC compiler analyzes source files then generates an executable that is automatically vectorized as much as possible. Automatic Detecting vectorizable loops automatically and vectorizing them Powerful Vectorizing loops including IF blocks and reduction operations Transforming nested loops to increase performance Efficiently by vectorization efficiently 10 © NEC Corporation 2018 Vectorization conditions ▌ Some conditions for vectorization need to be satisfied in order for the compiler to vectorize automatically. To be conformed loops, statements, types, and operations to vectorization For example, the following loops are not vectorized. Containing variables and arrays of character type and quadruple-precision real type Containing procedures (functions and subroutines) Containing I/O statements To have no dependency between array element definitions and references within same loop To be expected performance improvement by vectorization For example, the compiler judges that the vectorization improves the performance efficiently in the following cases. The loop length is sufficiently long to vectorization. The performance improvement by vectorization is more effective than the cost increasing due to loop transformations with vectorization. 11 © NEC Corporation 2018 Range of vectorization subjects Vectorizable Array expression, DO loop, DO WHILE loop, Loop consisted of IF construct and GOTO statement loops Vectorizable Numeric and logical intrinsic assignments, CONTINUE, GOTO, CYCLE, EXIT, IF construct, CASE construct statements CALL, I/O statements, pointer assignment statements are not subject to vectorization Vectorizable 4-byte and 8-byte integer types, 4-byte and 8 byte types logical types, real type, double-precision real type, complex type, double-precision complex type Character type, quadruple-precision real type, quadruple-precision complex type, 2-byte integer type, single-byte logical type, and derived type are not subject to vectorization Vectorizable Four arithmetic operations, logical operations, numeric relational operations, power operation, type operations conversions, intrinsic procedures User-defined operations are not subjected to vectorization 12 © NEC Corporation 2018 Dependency ▌Loop with a dependency can not be vectorized. DO I = 1, N A : Original array ‘A’ A(I+1)=A(I)*B(I)+C(I) END DO A : Updated array ‘A’ Execution order in Execution order in scalar processing vector processing A(2) = A(1)*B(1)+C(1) A(2) = A(1)*B(1)+C(1) A(3) = A(2)*B(2)+C(2) A(3) = A(2)*B(2)+C(2) A(4) = A(3)*B(3)+C(3) A(4) = A(3)*B(3)+C(3) … … 13 © NEC Corporation 2018 Basic points to promote vectorization Essential points for vectorization ベクトル化率（ベクトル演算率）の向上Raising vectorization ratio (vector operation ratio) ループ長の拡大 Lengthening loop Making memory access efficient 15 © NEC Corporation 2018 Vectorization ratio Execution time by scalar Vectorizable part Execution time after vectorization Execution time of vectorizable parts Vectorization ratio = Execution time by scalar 16 © NEC Corporation 2018 Vectorization ratio and performance improvement ▌Vectorization ratio needs to be sufficiently high for delivering the superior performance. 50 40 25x 30 up - 20 Speed 4.6x 2x 10 0 0% 20% 40% 60% 80% 100% Vectorization ratio * This graph supposes the performance improvement is 50x in the vectorization ratio 100% based on Amdahl’s law. Let vectorization ratio be as close as possible to 100%. 17 © NEC Corporation 2018 Loop length (Vector length) ▌Sufficient loop length of vectorized loop increases the effect of vectorization. Execution time Unvectorized Vectorized Start-up time Loop length Cross length Let loop length be as long as possible. 18 © NEC Corporation 2018 Steps for performance improvement Analyzing performance of whole program PROGINF Identifying procedures (functions and subroutines) that are bottlenecks of performance FTRACE Identifying loops and array expressions Diagnostic message that are bottlenecks of performance Format list FTRACE REGION Applying measures for vectorization Compiler directives Source modifications 19 © NEC Corporation 2018 PROGINF ▌PROGINF shows information about program execution such as the program execution time and the number of execution instruction. ▌How to use (1) Link a program specifying “-proginf” option. (2) Execute the program specifying “YES” or “DETAIL” for the run-time option “VE_PROGINF”. (3) The information is output to a standard error output file at the end of the program execution. Example : VE_PROGINF=YES ******** Program Information ******** Real Time (sec) : 121.233126 User Time (sec) : 121.228955 Is the ratio of “Vector Time” Vector Time (sec) : 106.934651 dominant to “User Time”? Inst. Count : 119280358861 V. Inst. Count : 29274454500 V. Element Count : 6389370973939 V. Load Element Count : 3141249840232 FLOP Count : 3182379290112 MOPS : 58637.969529 Is “A. V. Length” (average vector MOPS (Real) : 58635.323545 length) sufficiently long? MFLOPS : 26251.407833 MFLOPS (Real) : 26250.223262 (Is the length close to 256?) A. V. Length : 218.257559 V. Op. Ratio (%) : 98.733828 L1 Cache Miss (sec) : 0.639122 VLD LLC Hit Element Ratio (%) : 73.051749 Is “V. Op. Ratio” (vector operation Memory Size Used (MB) : 27532.000000 ratio) sufficiently high? (Is the ratio close to 100%?) Start Time (date) : Sun Jul 22 20:09:51 2018 JST End Time (date) : Sun Jul 22 20:11:52 2018 JST 20 © NEC Corporation 2018 FTRACE ▌FTRACE collects the performance analysis information of each procedure. ▌How to use (1) Compile and link a program specifying “-ftrace “ option. (2) The analysis information file (ftrace.out) is generated at the end of the program execution. (3) The analysis list is output to a standard output file by executing the ftrace command. (Example) ftrace –f ftrace.out Calling count of a function Exclusive CPU time and the ratio of the time to *----------------------* the CPU time of the whole program FTRACE ANALYSIS LIST *----------------------* Average CPU time required for on execution Execution Date : Sun Jul 22 21:10:50 2018 JST The same information items as PROGINF Total CPU Time : 0:02'02"434 (122.434 sec.) FREQUENCY EXCLUSIVE AVER.TIME MOPS MFLOPS V.OP AVER. VECTOR L1CACHE CPU PORT VLD LLC PROC.NAME TIME[sec]( % ) [msec] RATIO V.LEN TIME MISS CONF HIT E.% 512 58.110( 47.5) 113.496 68380.5 30870.9 99.27 223.8 58.110 0.000 0.000 74.26 SUB1 510 31.763( 25.9) 62.280 69474.6 31552.6 99.31 223.6 31.762 0.000 0.000 76.59 SUB2 2 9.056( 7.4) 4527.963 3198.9 0.0 14.91 205.2 0.062 1.272 0.000 73.33 SUB3 459 7.732( 6.3) 16.846 55793.0 23438.2 98.44 163.7 7.732 0.000 0.000 51.84 SUB4 459 7.037( 5.7) 15.332 56239.0 26694.7 98.84 212.3 7.037 0.000 0.000 62.00 SUB5 2097152 6.667( 5.4) 0.003 2816.0 322.1 22.88 256.0 0.282 0.283 0.000 100.00 SUB6 4 1.355( 1.1) 338.818 19218.8 11094.9 98.75 243.9 1.355 0.000 0.000 0.82 SUB7 463 0.448( 0.4) 0.968 57096.7 0.0 95.73 176.4 0.448 0.000 0.000 0.00 SUB8 1483 0.141( 0.1) 0.095 18966.1 0.0 97.87 205.4 0.141 0.000 0.048 2.59 SUB9 2099270 0.122( 0.1) 0.000 3056.8 17.2 0.00 0.0 0.000 0.000 0.000 0.00 SUB10 51 0.001( 0.0) 0.017 667.2 0.0 0.00 0.0 0.000 0.001 0.000 0.00 SUB11 1 0.000( 0.0) 0.292 444.7 0.6 0.01 8.0 0.000 0.000 0.000 0.00 MAIN_ (省略) ------------------------------------------------------------------------------------------------------------------ 4201150 122.434(100.0) 0.029 58071.6 25992.7 98.59 218.3 106.929 1.558 0.048 73.05 total 21 © NEC Corporation 2018 Format list ▌The format list indicates various information (vectorization and parallelization information of loops, inline expansion information of procedure calls, and so on) in correspondence with the relevant source lines. ▌How to use (1) Compile a program specifying “-report-format” option. “-report-all” option is recommended instead of “-report-format” option. “-report-all” option enables to output format lists with diagnostic messages. (2) The format list is output with the file name with a suffix “.L”. 539: +------> do i3=2,n3-1 540: |+-----> do i2=2,n2-1 541: ||V----> do i1=1,n1 542: ||| r1(i1) = r(i1,i2-1,i3) + r(i1,i2+1,i3) 543: ||| > + r(i1,i2,i3-1) + r(i1,i2,i3+1) 544: ||| r2(i1) = r(i1,i2-1,i3-1) + r(i1,i2+1,i3-1)

Introduction to Performance Analysis and Vectorization on SX-Aurora TSUBASA

Vectorization Optimization

Vegen: a Vectorizer Generator for SIMD and Beyond

Exploiting Automatic Vectorization to Employ SPMD on SIMD Registers

Using Arm Scalable Vector Extension to Optimize OPEN MPI

Introduction on Vectorization

Advanced Parallel Programming II

MMX and SSE MMX Data Types

Compiler Auto-Vectorization with Imitation Learning

Vector Parallelism on Multi-Core Processors

VECTORIZATION-Slides

Impact of Vectorization and Multithreading on Performance and Energy Consumption on Jetson Boards

A Using Machine Learning to Improve Automatic Vectorization