2018/11/16

Agenda

10:00 – 10:30 SX Aurora Tsubasa : technology overview 10:30 – 11:30 Migration to SX-Aurora Tsubasa : , CAU Kiel libraries, … NEC SX Aurora TSUBASA 11:30 – 12:00 RZ Kiel : overview Lunch break migration workshop 13:00 – 17:00 Hands-on session: porting, run-time, performance Dr. Jens-Olaf Beismann Senior Benchmarking Analyst NEC Deutschland GmbH

SX Aurora TSUBASA

Technology overview

° Dedicated ° High memory bandwidth

° Commodity processors ° De facto standard x86/ environment

4 © NEC Deutschland GmbH 2018

1 2018/11/16

Brand-new Vector Strategy

Linux open environment

TSUBASA : meaning “ wing ” in Japanese Linux asset High performance

user

1.22TB/s / processor, 150GB/s / core Linux OS Vector Environment Engine

Fortran/C/C++ programing, OpenMP Library Automatic vectorization/parallelization

Tool x86 VE high performance High sustained performance on peripherals Application on x86/Linux environment x86/Linux

5 © NEC Deutschland GmbH 2018 6 © NEC Deutschland GmbH 2018

Architecture Inherited and Changed

Aurora Architecture ¢ x86 node + Vector Engine (VE) ¢ VE capability is provided on x86/Linux environment Previous SX SX-Aurora TSUBASA

VE OS Aurora Architecture Super-UX Application LINUX Application

x86 server processor processor Supercomputer VE OS Linux core core OS SPU Mem. x86 SPU Mem. VPU VPU x86 node Rack Mount Servers Vector Engine storage storage VH VE SX-Aurora TSUBASA Vector Host Vector Engine Desktop Tower

7 © NEC Deutschland GmbH 2018 8 © NEC Deutschland GmbH 2018

2 2018/11/16

GPGPU and VE Processor

GPGPU Architecture Aurora Architecture 2.45TF (@1.6GHZ) VE1.0 Spec. 307GFcore core core core AP Function cores/CPU 8 OS CUDA OS AP core ~307GF(DP) core core core core M M M M e e e e performance ~614GF(SP) m m m m o x86 GPGPU o o x86 VE o r r r r CPU ~2.45TF(DP) 0.4TB/s y PCIe y y PCIe y performance ~4.91TF(SP) 3TB/s exec exec cache Data Transmission Start Processing 16MB shared capacity Software controllable cache Result Transmission 16MB : OS I/O,etc memory : Function 1.22TB/s bandwidth memory exit exit End Processing 24, 48GB capacity 1.22TB/s Frequent PCIe transmission Whole AP is executed on VE

¢ PCIe bottleneck ¢ Avoiding PCIe bottleneck disadvantage ¢ Small memory Advantage ¢ Larger memory ¢ ¢ Programming difficulty Standard language HBM2 memory x 6

9 © NEC Deutschland GmbH 2018 10 © NEC Deutschland GmbH 2018

Product Core Architecture

processor Card Single core

VFMA0 VFMA0 VFMA0VFMA1 VFMA0VFMA1 VFMA0VFMA1 VFMA0VFMA1VFMA2 VFMA0VFMA1VFMA2 VFMA0VFMA1VFMA2 VFMA0VFMA1VFMA2ALU0 VFMA0VFMA1VFMA2ALU0 VFMA0VFMA1VFMA2ALU0 VFMA0VFMA1VFMA2ALU0ALU1 VFMA0VFMA1VFMA2ALU0ALU1 VFMA0VFMA1VFMA2ALU0ALU1DIV VFMA0VFMA1VFMA2ALU0ALU1DIV VFMA0VFMA1VFMA2ALU0ALU1DIV VFMA0VFMA1VFMA2ALU0ALU1DIV VFMA0VFMA1VFMA2ALU0ALU1DIV VFMA0VFMA1VFMA2ALU0ALU1DIV VFMA0VFMA1VFMA2ALU0ALU1DIV VFMA0VFMA1VFMA2ALU0ALU1DIV VFMA0VFMA1VFMA2ALU0ALU1DIV VFMA0VFMA1VFMA2ALU0ALU1DIV VFMA0VFMA1VFMA2ALU0ALU1DIV Developed by NEC VFMA0VFMA1VFMA2ALU0ALU1DIV ■ VFMA0VFMA1VFMA2ALU0ALU1DIV VFMA0VFMA1VFMA2ALU0ALU1DIV World’s highest memory bandwidth VFMA0VFMA1VFMA2ALU0ALU1DIV ■ VFMA0VFMA1VFMA2ALU0ALU1DIV VFMA0VFMA1VFMA2ALU0ALU1DIV VFMA0VFMA1VFMA2ALU0ALU1DIV VFMA0VFMA1VFMA2ALU0ALU1DIV VFMA1VFMA2ALU0ALU1DIV VFMA1VFMA2ALU0ALU1DIV VFMA2ALU0ALU1DIV VFMA2ALU0ALU1DIV Peak Performance : Products of SX-Aurora TSUBASA VFMA2ALU0ALU1DIV ALU0ALU1DIV ALU0ALU1DIV 268.8GF = ALU0ALU1DIV ALU1DIV 32Flops/cycle x ALU1DIV 1.22TB/s / processor 400GB/s / core ALU1DIV DIV 2(FMA) x 3 x DIV (Ave. 150GB/s / core) 1.4GHz

1VE 2VE 4VE 8VE SPU CAU Kiel : 17.2 TF A100 Tower A300 Server A500 DLC Supercomputer Scalar Processing Unit

11 © NEC Deutschland GmbH 2018 12 © NEC Deutschland GmbH 2018

3 2018/11/16

Characteristics Fundamental Benchmarks

x ¢ STREAM: VE is the highest sustained memory bandwidth / node ¢ HPL: VE provides competitive FLOPS capability

Position Memory bandwidth / processor STREAM / node HPL / node

Vector GPGPU Engine high spec. HPL/node STREAM/node ® specification Xeon standard

standard special Xeon ® GPGPU Vector ¢ VE provides the highest memory ¢ VE provides same range HPL language Engine bandwidth sustained performance as SKL/KNL

13 © NEC Deutschland GmbH 2018 14 © NEC Deutschland GmbH 2018

Performance/Price HPCG

¢ High Price Competitiveness ¢ VE provides high HPCG performance per node and price - The highest STREAM sustained performance / price ¢ HPL and STREAM are bookends of benchmark, and HPCG stands - Competitive HPL sustained performance / price between them

Performance HPCG / node HPCG / price STREAM / price HPL / price Characteristics HPL/price Aurora SKL STREAM/price Performanceratio Performanceratio Performanceratio

¢ VE provides the highest memory ¢ VE provides same range HPL 3x 3x Memory bandwidth Performance bandwidth/price sustained performance/price bound bound compared to Intel products

15 © NEC Deutschland GmbH 2018 16 © NEC Deutschland GmbH 2018

4 2018/11/16

Usability

Programing Environment Vector Cross automatic vectorization automatic parallelization

Fortran: F2003, F2008(partially) C: C11 C++: C++14 $ vi sample.c SX Aurora TSUBASA $ ncc sample.c OpenMP: OpenMP4.5 MPI: MPI3.1 User environment Execution Environment

VH VE

$ a.out execution

18 © NEC Deutschland GmbH 2018

Compilers Programming Environment

Cross compilers : ¢ NEC supports the latest language standards along with GNU nfort compatibility ncc nc++ ▌C/C++ ò ISO/IEC 9899:2011 (aka C11) ò ISO/IEC 14882:2014 (aka C++14) Tools : ▌Fortran nld, nar, nranlib, … ò ISO/IEC 1539-1:2004 (aka Fortran 2003) ò ISO/IEC 1539-1:2010 (aka Fortran 2008) ▌OpenMP MPI wrappers : ò Version 4.5 mpinfort ▌Libraries mpincc ò libc mpinc++ ò MPI Version 3.1 (fully tuned for Aurora architecture) ò Numeric libraries (BLAS, FFT, Lapack, etc.) ▌Tools ò GNU Profiler (gprof) ò GNU Debugger (gdb), Eclipse Parallel Tools Platform (PTP) ò FtraceViewer / PROGINF

19 © NEC Deutschland GmbH 2018 20 © NEC Deutschland GmbH 2018

5 2018/11/16

Options Options

-Caopt -O4 Compiler diagnostics / listings -Chopt -O3 -fdiag-vector=0|1|2 # more or less detailed -Cvopt -O2 vectorization diagnostics -Cvsafe -O1 -fdiag-parallel=0|1|2 -Cnoopt -O0 -fdiag-inline=0|1|2

-Omove -fmove-loop-invariants-unsafe -report-all # get both diagnostics and -Onomovediv -fmove-loop-invariants formatted listing in .L file -Onomove -fno-move-loop-invariants Default type size -fdefault-integer=4|8 -Popenmp -fopenmp -fdefault-real=4|8 -fdefault-double=8|16 -pi,auto -finline-functions # no cross-file inlining Cache usage … -mretain-[all|list-vector|none]

21 © NEC Deutschland GmbH 2018 22 © NEC Deutschland GmbH 2018

Directives Libraries

!CDIR … !NEC$ … ▌NEC Library provides wide variety of functions òNEC library is fully tuned for Aurora architecture nodep ivdep NEC Lib MKL BLAS P P LAPACK P P expand= n unroll( n) ScaLAPACK P P move move_unsafe FFT P P nomovediv move Random number generators P P nomove nomove Direct sparse solvers P P outerunroll=n outerloop_unroll(n) Iterative sparse solvers P P Functions for Statistics P P unroll= n unroll( n) Spline functions P P … Special functions P ‹ Directive conversion tool nfdirconv ! Approximation and Interpolation P Numerical Differentials/Integrals P Roots of Equations P Time series analysis P Sorting and ranking P

23 © NEC Deutschland GmbH 2018 24 © NEC Deutschland GmbH 2018

6 2018/11/16

UNIX system function interface Endianness

SX-Aurora TSUBASA is little-endian ! (Former SXs were big- To use extensions like GETARG, FLUSH, ABORT , … subroutines, endian) compile with -use F90_UNIX[,F90_UNIX_ENV,…] export VE_FORT_UFMTENDIAN=10,11 (ALL) sets the unit number of an unformatted file to be treated as a file See Fortran Users’ Guide 8.2 for details in big-endian format. When the value of this variable is ALL , then all unit numbers are applied. Two or more unit numbers can be specified by comma delimitation.

GNU Fortran extension : convert specifier open(10,file=‘test.dat’, form=‘unformatted’, & convert=‘big_endian’)

‹ Non-standard Fortran, but supported by nfort

25 © NEC Deutschland GmbH 2018 26 © NEC Deutschland GmbH 2018

Correctness Debugger

Run-time errors…? Process Sets variables Functions

Compile with -traceback export VE_TRACEBACK=FULL|ALL reduce optimization -fcheck=bounds (all) Stack/heap initialization –minit-stack=zero|nan export VE_INIT_HEAP=ZERO|NAN

-mno-vector-fma

Standard output export VE_FPE_ENABLE=(DIV,FOF,FUF,INV,INE) Source code Stack trace Standard error output (debugger) Eclipse parallel tools platform (PTP) VE plugin provides GUI debugging environment

27 © NEC Deutschland GmbH 2018 28 © NEC Deutschland GmbH 2018

7 2018/11/16

PROGINF performance information PROGINF performance information ******** Program Information ******** Real Time (sec) : 8783.021690 User Time (sec) : 8753.527852 Vector Time (sec) : 4959.702777 Inst. Count : 8018493848355 Compile with “-proginf” CompileV. Inst. with Count “-proginf” : 1081598389267 V. Element Count : 221178430822266 V. Load Element Count : 51426036073697 export VE_PROGINF=YES/DETAIL exportFLOP Count VE_PROGINF=YES/DETAIL : 140842692526140 MOPS : 29663.826745 MOPS (Real) : 29444.837925 MFLOPS : 16155.280253 MFLOPS (Real) : 16036.016282 A. V. Length : 204.492197 V. Op. Ratio (%) : 97.317633 L1 Cache Miss (sec) : 801.540473 CPU Port Conf. (sec) : 2.435367 V. Arith. Exec. (sec) : 2355.046188 V. Load Exec. (sec) : 2346.286974 VLD LLC Hit Element Ratio (%) : 79.566662 Power Throttling (sec) : 0.000000 Thermal Throttling (sec) : 0.000000 Memory Size Used (MB) : 10956.000000

Start Time (date) : Tue Nov 6 13:05:18 2018 CET End Time (date) : Tue Nov 6 15:31:41 2018 CET

29 © NEC Deutschland GmbH 2018 30 © NEC Deutschland GmbH 2018

configure

autoconf: https://github.com/SX-Aurora/autoconf-helper configure command: ./configure CC=ncc CXX=nc++ FC=nfort F90=nfort \ AR=nar LD=nld AS=nas --host=ve--linux Documentation CMAKE Toolchain (example): https://github.com/SX- Aurora/CMake-toolchain-file Available at www.rz.uni-kiel.de/de/angebote/hiperf/nec- sx-aurora-tsubasa

31 © NEC Deutschland GmbH 2018

8 2018/11/16

Documentation

▌Official Documentation: http://www.nec.com/en/global/prod/hpc/aurora/document

▌Official Software: http://www.nec.com/en/global/prod/hpc/aurora/ve-software

▌Open Source Software: https://github.com/SX-Aurora

33 © NEC Deutschland GmbH 2018 34 © NEC Deutschland GmbH 2018

Aurora Forum community website

Visit https://www.hpc.nec and join our quest

Aurora Forum Website

Join

36 © NEC Deutschland GmbH 2018

9 2018/11/16

Aurora Forum community website Aurora Forum community website

Better communication through BBS, let’s discuss openly! About posting – Please like for good posts

òPost text and arbitrary file(word, ppt, pdf, picture movie, software, etc. there is not limitation for the type of file).

Evaluation Porting Report

Roadmap Tuning request

Like!

37 © NEC Deutschland GmbH 2018 38 © NEC Deutschland GmbH 2018

Aurora Forum community website

Let’s develop better Aurora together!!

39 © NEC Deutschland GmbH 2018

10