NEC SX-Aurora TSUBASA Code Porting Workshop

2018/11/16 Agenda 10:00 – 10:30 SX Aurora Tsubasa : technology overview 10:30 – 11:30 Migration to SX-Aurora Tsubasa : compilers, CAU Kiel libraries, … NEC SX Aurora TSUBASA 11:30 – 12:00 RZ Kiel : overview Lunch break migration workshop 13:00 – 17:00 Hands-on session: porting, run-time, performance Dr. Jens-Olaf Beismann Senior Benchmarking Analyst NEC Deutschland GmbH SX Aurora TSUBASA Technology overview ° Dedicated vector processor ° High memory bandwidth ° Commodity processors ° De facto standard x86/Linux environment 4 © NEC Deutschland GmbH 2018 1 2018/11/16 Brand-new Vector Supercomputer Strategy Linux open environment TSUBASA : meaning “ wing ” in Japanese Linux asset High performance user 1.22TB/s / processor, 150GB/s / core Linux OS Vector Environment Engine Fortran/C/C++ programing, OpenMP Library Automatic vectorization/parallelization Tool x86 VE high performance High sustained performance on peripherals Application on x86/Linux environment x86/Linux 5 © NEC Deutschland GmbH 2018 6 © NEC Deutschland GmbH 2018 Architecture Inherited and Changed Aurora Architecture x86 node + Vector Engine (VE) VE capability is provided on x86/Linux environment Previous SX SX-Aurora TSUBASA VE OS Aurora Architecture Super-UX Application LINUX Application x86 server processor processor Supercomputer VE OS Linux core core OS SPU Mem. x86 SPU Mem. VPU VPU x86 node Rack Mount Servers Vector Engine storage storage VH VE SX-Aurora TSUBASA Vector Host Vector Engine Desktop Tower 7 © NEC Deutschland GmbH 2018 8 © NEC Deutschland GmbH 2018 2 2018/11/16 GPGPU and VE Processor GPGPU Architecture Aurora Architecture 2.45TF (@1.6GHZ) VE1.0 Spec. 307GFcore core core core AP Function cores/CPU 8 OS CUDA OS AP core ~307GF(DP) core core core core M M M M e e e e performance ~614GF(SP) m m m m o x86 GPGPU o o x86 VE o r r r r CPU ~2.45TF(DP) 0.4TB/s y PCIe y y PCIe y performance ~4.91TF(SP) 3TB/s exec exec cache Data Transmission Start Processing 16MB shared capacity Software controllable cache Result Transmission 16MB : OS I/O,etc memory : Function 1.22TB/s bandwidth memory exit exit End Processing 24, 48GB capacity 1.22TB/s Frequent PCIe transmission Whole AP is executed on VE PCIe bottleneck Avoiding PCIe bottleneck disadvantage Small memory Advantage Larger memory Programming difficulty Standard language HBM2 memory x 6 9 © NEC Deutschland GmbH 2018 10 © NEC Deutschland GmbH 2018 Product Core Architecture processor Card Single core VFMA0 VFMA0 VFMA0VFMA1 VFMA0VFMA1 VFMA0VFMA1 VFMA0VFMA1VFMA2 VFMA0VFMA1VFMA2 VFMA0VFMA1VFMA2 VFMA0VFMA1VFMA2ALU0 VFMA0VFMA1VFMA2ALU0 VFMA0VFMA1VFMA2ALU0 VFMA0VFMA1VFMA2ALU0ALU1 VFMA0VFMA1VFMA2ALU0ALU1 VFMA0VFMA1VFMA2ALU0ALU1DIV VFMA0VFMA1VFMA2ALU0ALU1DIV VFMA0VFMA1VFMA2ALU0ALU1DIV VFMA0VFMA1VFMA2ALU0ALU1DIV VFMA0VFMA1VFMA2ALU0ALU1DIV VFMA0VFMA1VFMA2ALU0ALU1DIV VFMA0VFMA1VFMA2ALU0ALU1DIV VFMA0VFMA1VFMA2ALU0ALU1DIV VFMA0VFMA1VFMA2ALU0ALU1DIV VFMA0VFMA1VFMA2ALU0ALU1DIV VFMA0VFMA1VFMA2ALU0ALU1DIV Developed by NEC VFMA0VFMA1VFMA2ALU0ALU1DIV ■ VFMA0VFMA1VFMA2ALU0ALU1DIV VFMA0VFMA1VFMA2ALU0ALU1DIV World’s highest memory bandwidth VFMA0VFMA1VFMA2ALU0ALU1DIV ■ VFMA0VFMA1VFMA2ALU0ALU1DIV VFMA0VFMA1VFMA2ALU0ALU1DIV VFMA0VFMA1VFMA2ALU0ALU1DIV VFMA0VFMA1VFMA2ALU0ALU1DIV VFMA1VFMA2ALU0ALU1DIV VFMA1VFMA2ALU0ALU1DIV VFMA2ALU0ALU1DIV VFMA2ALU0ALU1DIV Peak Performance : Products of SX-Aurora TSUBASA VFMA2ALU0ALU1DIV ALU0ALU1DIV ALU0ALU1DIV 268.8GF = ALU0ALU1DIV ALU1DIV 32Flops/cycle x ALU1DIV 1.22TB/s / processor 400GB/s / core ALU1DIV DIV 2(FMA) x 3 x DIV (Ave. 150GB/s / core) 1.4GHz 1VE 2VE 4VE 8VE SPU CAU Kiel : 17.2 TF A100 Tower A300 Server A500 DLC Supercomputer Scalar Processing Unit 11 © NEC Deutschland GmbH 2018 12 © NEC Deutschland GmbH 2018 3 2018/11/16 Characteristics Fundamental Benchmarks x STREAM: VE is the highest sustained memory bandwidth / node HPL: VE provides competitive FLOPS capability Position Memory bandwidth / processor STREAM / node HPL / node Vector GPGPU Engine high spec. HPL/node STREAM/node ® specification Xeon standard standard special Xeon ® GPGPU Vector VE provides the highest memory VE provides same range HPL language Engine bandwidth sustained performance as SKL/KNL 13 © NEC Deutschland GmbH 2018 14 © NEC Deutschland GmbH 2018 Performance/Price HPCG High Price Competitiveness VE provides high HPCG performance per node and price - The highest STREAM sustained performance / price HPL and STREAM are bookends of benchmark, and HPCG stands - Competitive HPL sustained performance / price between them Performance HPCG / node HPCG / price STREAM / price HPL / price Characteristics HPL/price Aurora SKL STREAM/price Performanceratio Performanceratio Performanceratio VE provides the highest memory VE provides same range HPL 3x 3x Memory bandwidth Performance bandwidth/price sustained performance/price bound bound compared to Intel products 15 © NEC Deutschland GmbH 2018 16 © NEC Deutschland GmbH 2018 4 2018/11/16 Usability Programing Environment Vector Cross Compiler automatic vectorization automatic parallelization Fortran: F2003, F2008(partially) C: C11 C++: C++14 $ vi sample.c SX Aurora TSUBASA $ ncc sample.c OpenMP: OpenMP4.5 MPI: MPI3.1 User environment Execution Environment VH VE $ a.out execution 18 © NEC Deutschland GmbH 2018 Compilers Programming Environment Cross compilers : NEC supports the latest language standards along with GNU nfort compatibility ncc nc++ ▌C/C++ ISO/IEC 9899:2011 (aka C11) ISO/IEC 14882:2014 (aka C++14) Tools : ▌Fortran nld, nar, nranlib, … ISO/IEC 1539-1:2004 (aka Fortran 2003) ISO/IEC 1539-1:2010 (aka Fortran 2008) ▌OpenMP MPI wrappers : Version 4.5 mpinfort ▌Libraries mpincc libc mpinc++ MPI Version 3.1 (fully tuned for Aurora architecture) Numeric libraries (BLAS, FFT, Lapack, etc.) ▌Tools GNU Profiler (gprof) GNU Debugger (gdb), Eclipse Parallel Tools Platform (PTP) FtraceViewer / PROGINF 19 © NEC Deutschland GmbH 2018 20 © NEC Deutschland GmbH 2018 5 2018/11/16 Options Options -Caopt -O4 Compiler diagnostics / listings -Chopt -O3 -fdiag-vector=0|1|2 # more or less detailed -Cvopt -O2 vectorization diagnostics -Cvsafe -O1 -fdiag-parallel=0|1|2 -Cnoopt -O0 -fdiag-inline=0|1|2 -Omove -fmove-loop-invariants-unsafe -report-all # get both diagnostics and formatted listing in .L file -Onomovediv -fmove-loop-invariants -Onomove -fno-move-loop-invariants Default type size -fdefault-integer=4|8 -Popenmp -fopenmp -fdefault-real=4|8 -fdefault-double=8|16 -pi,auto -finline-functions # no cross-file inlining Cache usage … -mretain-[all|list-vector|none] 21 © NEC Deutschland GmbH 2018 22 © NEC Deutschland GmbH 2018 Directives Libraries !CDIR … !NEC$ … ▌NEC Library provides wide variety of functions NEC library is fully tuned for Aurora architecture nodep ivdep NEC Lib MKL BLAS P P LAPACK P P expand= n unroll( n) ScaLAPACK P P move move_unsafe FFT P P nomovediv move Random number generators P P nomove nomove Direct sparse solvers P P outerunroll=n outerloop_unroll(n) Iterative sparse solvers P P Functions for Statistics P P unroll= n unroll( n) Spline functions P P … Special functions P Directive conversion tool nfdirconv ! Approximation and Interpolation P Numerical Differentials/Integrals P Roots of Equations P Time series analysis P Sorting and ranking P 23 © NEC Deutschland GmbH 2018 24 © NEC Deutschland GmbH 2018 6 2018/11/16 UNIX system function interface Endianness SX-Aurora TSUBASA is little-endian ! (Former SXs were big- To use extensions like GETARG, FLUSH, ABORT , … subroutines, endian) compile with -use F90_UNIX[,F90_UNIX_ENV,…] export VE_FORT_UFMTENDIAN=10,11 (ALL) sets the unit number of an unformatted file to be treated as a file See Fortran Users’ Guide 8.2 for details in big-endian format. When the value of this variable is ALL , then all unit numbers are applied. Two or more unit numbers can be specified by comma delimitation. GNU Fortran extension : convert specifier open(10,file=‘test.dat’, form=‘unformatted’, & convert=‘big_endian’) Non-standard Fortran, but supported by nfort 25 © NEC Deutschland GmbH 2018 26 © NEC Deutschland GmbH 2018 Correctness Debugger Run-time errors…? Process Sets variables Functions Compile with -traceback export VE_TRACEBACK=FULL|ALL reduce optimization -fcheck=bounds (all) Stack/heap initialization –minit-stack=zero|nan export VE_INIT_HEAP=ZERO|NAN -mno-vector-fma Standard output export VE_FPE_ENABLE=(DIV,FOF,FUF,INV,INE) Source code Stack trace Standard error output (debugger) Eclipse parallel tools platform (PTP) VE plugin provides GUI debugging environment 27 © NEC Deutschland GmbH 2018 28 © NEC Deutschland GmbH 2018 7 2018/11/16 PROGINF performance information PROGINF performance information ******** Program Information ******** Real Time (sec) : 8783.021690 User Time (sec) : 8753.527852 Vector Time (sec) : 4959.702777 Inst. Count : 8018493848355 Compile with “-proginf” CompileV. Inst. with Count “-proginf” : 1081598389267 V. Element Count : 221178430822266 V. Load Element Count : 51426036073697 export VE_PROGINF=YES/DETAIL exportFLOP Count VE_PROGINF=YES/DETAIL : 140842692526140 MOPS : 29663.826745 MOPS (Real) : 29444.837925 MFLOPS : 16155.280253 MFLOPS (Real) : 16036.016282 A. V. Length : 204.492197 V. Op. Ratio (%) : 97.317633 L1 Cache Miss (sec) : 801.540473 CPU Port Conf. (sec) : 2.435367 V. Arith. Exec. (sec) : 2355.046188 V. Load Exec. (sec) : 2346.286974 VLD LLC Hit Element Ratio (%) : 79.566662 Power Throttling (sec) : 0.000000 Thermal Throttling (sec) : 0.000000 Memory Size Used (MB) : 10956.000000 Start Time (date) : Tue Nov 6 13:05:18 2018 CET End Time (date) : Tue Nov 6 15:31:41 2018

NEC SX-Aurora TSUBASA Code Porting Workshop

User's Manual

DMI-HIRLAM on the NEC SX-6

Hardware Technology of the Earth Simulator 1

Shared-Memory Vector Systems Compared

Recent Supercomputing Development in Japan

Hardware-Oblivious SIMD Parallelism for In-Memory Column-Stores

Performance Evaluation of Supercomputers Using HPCC and IMB Benchmarks

Optimizing Sparse Matrix-Vector Multiplication in NEC SX-Aurora Vector Engine

SX-Aurora TSUBASA Program Execution Quick Guide

NEC SX-Aurora Tsubasa System at ICM UW User Guide

Vampirtrace 5.14.4 User Manual

The Nec Sx-6 Asia Nec Hpc Marketing Supercomputer with Single-Chip Vector Processor Promotion Division