2018/11/16
Agenda
10:00 – 10:30 SX Aurora Tsubasa : technology overview 10:30 – 11:30 Migration to SX-Aurora Tsubasa : compilers, CAU Kiel libraries, … NEC SX Aurora TSUBASA 11:30 – 12:00 RZ Kiel : overview Lunch break migration workshop 13:00 – 17:00 Hands-on session: porting, run-time, performance Dr. Jens-Olaf Beismann Senior Benchmarking Analyst NEC Deutschland GmbH
SX Aurora TSUBASA
Technology overview
° Dedicated vector processor ° High memory bandwidth
° Commodity processors ° De facto standard x86/Linux environment
4 © NEC Deutschland GmbH 2018
1 2018/11/16
Brand-new Vector Supercomputer Strategy
Linux open environment
TSUBASA : meaning “ wing ” in Japanese Linux asset High performance
user
1.22TB/s / processor, 150GB/s / core Linux OS Vector Environment Engine
Fortran/C/C++ programing, OpenMP Library Automatic vectorization/parallelization
Tool x86 VE high performance High sustained performance on peripherals Application on x86/Linux environment x86/Linux
5 © NEC Deutschland GmbH 2018 6 © NEC Deutschland GmbH 2018
Architecture Inherited and Changed
Aurora Architecture ¢ x86 node + Vector Engine (VE) ¢ VE capability is provided on x86/Linux environment Previous SX SX-Aurora TSUBASA
VE OS Aurora Architecture Super-UX Application LINUX Application
x86 server processor processor Supercomputer VE OS Linux core core OS SPU Mem. x86 SPU Mem. VPU VPU x86 node Rack Mount Servers Vector Engine storage storage VH VE SX-Aurora TSUBASA Vector Host Vector Engine Desktop Tower
7 © NEC Deutschland GmbH 2018 8 © NEC Deutschland GmbH 2018
2 2018/11/16
GPGPU and VE Processor
GPGPU Architecture Aurora Architecture 2.45TF (@1.6GHZ) VE1.0 Spec. 307GFcore core core core AP Function cores/CPU 8 OS CUDA OS AP core ~307GF(DP) core core core core M M M M e e e e performance ~614GF(SP) m m m m o x86 GPGPU o o x86 VE o r r r r CPU ~2.45TF(DP) 0.4TB/s y PCIe y y PCIe y performance ~4.91TF(SP) 3TB/s exec exec cache Data Transmission Start Processing 16MB shared capacity Software controllable cache Result Transmission 16MB : OS I/O,etc memory : Function 1.22TB/s bandwidth memory exit exit End Processing 24, 48GB capacity 1.22TB/s Frequent PCIe transmission Whole AP is executed on VE
¢ PCIe bottleneck ¢ Avoiding PCIe bottleneck disadvantage ¢ Small memory Advantage ¢ Larger memory ¢ ¢ Programming difficulty Standard language HBM2 memory x 6
9 © NEC Deutschland GmbH 2018 10 © NEC Deutschland GmbH 2018
Product Core Architecture
processor Card Single core
VFMA0 VFMA0 VFMA0VFMA1 VFMA0VFMA1 VFMA0VFMA1 VFMA0VFMA1VFMA2 VFMA0VFMA1VFMA2 VFMA0VFMA1VFMA2 VFMA0VFMA1VFMA2ALU0 VFMA0VFMA1VFMA2ALU0 VFMA0VFMA1VFMA2ALU0 VFMA0VFMA1VFMA2ALU0ALU1 VFMA0VFMA1VFMA2ALU0ALU1 VFMA0VFMA1VFMA2ALU0ALU1DIV VFMA0VFMA1VFMA2ALU0ALU1DIV VFMA0VFMA1VFMA2ALU0ALU1DIV VFMA0VFMA1VFMA2ALU0ALU1DIV VFMA0VFMA1VFMA2ALU0ALU1DIV VFMA0VFMA1VFMA2ALU0ALU1DIV VFMA0VFMA1VFMA2ALU0ALU1DIV VFMA0VFMA1VFMA2ALU0ALU1DIV VFMA0VFMA1VFMA2ALU0ALU1DIV VFMA0VFMA1VFMA2ALU0ALU1DIV VFMA0VFMA1VFMA2ALU0ALU1DIV Developed by NEC VFMA0VFMA1VFMA2ALU0ALU1DIV ■ VFMA0VFMA1VFMA2ALU0ALU1DIV VFMA0VFMA1VFMA2ALU0ALU1DIV World’s highest memory bandwidth VFMA0VFMA1VFMA2ALU0ALU1DIV ■ VFMA0VFMA1VFMA2ALU0ALU1DIV VFMA0VFMA1VFMA2ALU0ALU1DIV VFMA0VFMA1VFMA2ALU0ALU1DIV VFMA0VFMA1VFMA2ALU0ALU1DIV VFMA1VFMA2ALU0ALU1DIV VFMA1VFMA2ALU0ALU1DIV VFMA2ALU0ALU1DIV VFMA2ALU0ALU1DIV Peak Performance : Products of SX-Aurora TSUBASA VFMA2ALU0ALU1DIV ALU0ALU1DIV ALU0ALU1DIV 268.8GF = ALU0ALU1DIV ALU1DIV 32Flops/cycle x ALU1DIV 1.22TB/s / processor 400GB/s / core ALU1DIV DIV 2(FMA) x 3 x DIV (Ave. 150GB/s / core) 1.4GHz
1VE 2VE 4VE 8VE SPU CAU Kiel : 17.2 TF A100 Tower A300 Server A500 DLC Supercomputer Scalar Processing Unit
11 © NEC Deutschland GmbH 2018 12 © NEC Deutschland GmbH 2018
3 2018/11/16
Characteristics Fundamental Benchmarks
x ¢ STREAM: VE is the highest sustained memory bandwidth / node ¢ HPL: VE provides competitive FLOPS capability
Position Memory bandwidth / processor STREAM / node HPL / node
Vector GPGPU Engine high spec. HPL/node STREAM/node ® specification Xeon standard
standard special Xeon ® GPGPU Vector ¢ VE provides the highest memory ¢ VE provides same range HPL language Engine bandwidth sustained performance as SKL/KNL
13 © NEC Deutschland GmbH 2018 14 © NEC Deutschland GmbH 2018
Performance/Price HPCG
¢ High Price Competitiveness ¢ VE provides high HPCG performance per node and price - The highest STREAM sustained performance / price ¢ HPL and STREAM are bookends of benchmark, and HPCG stands - Competitive HPL sustained performance / price between them
Performance HPCG / node HPCG / price STREAM / price HPL / price Characteristics HPL/price Aurora SKL STREAM/price Performanceratio Performanceratio Performanceratio
¢ VE provides the highest memory ¢ VE provides same range HPL 3x 3x Memory bandwidth Performance bandwidth/price sustained performance/price bound bound compared to Intel products
15 © NEC Deutschland GmbH 2018 16 © NEC Deutschland GmbH 2018
4 2018/11/16
Usability
Programing Environment Vector Cross Compiler automatic vectorization automatic parallelization
Fortran: F2003, F2008(partially) C: C11 C++: C++14 $ vi sample.c SX Aurora TSUBASA $ ncc sample.c OpenMP: OpenMP4.5 MPI: MPI3.1 User environment Execution Environment
VH VE
$ a.out execution
18 © NEC Deutschland GmbH 2018
Compilers Programming Environment
Cross compilers : ¢ NEC supports the latest language standards along with GNU nfort compatibility ncc nc++ ▌C/C++ ò ISO/IEC 9899:2011 (aka C11) ò ISO/IEC 14882:2014 (aka C++14) Tools : ▌Fortran nld, nar, nranlib, … ò ISO/IEC 1539-1:2004 (aka Fortran 2003) ò ISO/IEC 1539-1:2010 (aka Fortran 2008) ▌OpenMP MPI wrappers : ò Version 4.5 mpinfort ▌Libraries mpincc ò libc mpinc++ ò MPI Version 3.1 (fully tuned for Aurora architecture) ò Numeric libraries (BLAS, FFT, Lapack, etc.) ▌Tools ò GNU Profiler (gprof) ò GNU Debugger (gdb), Eclipse Parallel Tools Platform (PTP) ò FtraceViewer / PROGINF
19 © NEC Deutschland GmbH 2018 20 © NEC Deutschland GmbH 2018
5 2018/11/16
Options Options
-Caopt -O4 Compiler diagnostics / listings -Chopt -O3 -fdiag-vector=0|1|2 # more or less detailed -Cvopt -O2 vectorization diagnostics -Cvsafe -O1 -fdiag-parallel=0|1|2 -Cnoopt -O0 -fdiag-inline=0|1|2
-Omove -fmove-loop-invariants-unsafe -report-all # get both diagnostics and -Onomovediv -fmove-loop-invariants formatted listing in .L file -Onomove -fno-move-loop-invariants Default type size -fdefault-integer=4|8 -Popenmp -fopenmp -fdefault-real=4|8 -fdefault-double=8|16 -pi,auto -finline-functions # no cross-file inlining Cache usage … -mretain-[all|list-vector|none]
21 © NEC Deutschland GmbH 2018 22 © NEC Deutschland GmbH 2018
Directives Libraries
!CDIR … !NEC$ … ▌NEC Library provides wide variety of functions òNEC library is fully tuned for Aurora architecture nodep ivdep NEC Lib MKL BLAS P P LAPACK P P expand= n unroll( n) ScaLAPACK P P move move_unsafe FFT P P nomovediv move Random number generators P P nomove nomove Direct sparse solvers P P outerunroll=n outerloop_unroll(n) Iterative sparse solvers P P Functions for Statistics P P unroll= n unroll( n) Spline functions P P … Special functions P ‹ Directive conversion tool nfdirconv ! Approximation and Interpolation P Numerical Differentials/Integrals P Roots of Equations P Time series analysis P Sorting and ranking P
23 © NEC Deutschland GmbH 2018 24 © NEC Deutschland GmbH 2018
6 2018/11/16
UNIX system function interface Endianness
SX-Aurora TSUBASA is little-endian ! (Former SXs were big- To use extensions like GETARG, FLUSH, ABORT , … subroutines, endian) compile with -use F90_UNIX[,F90_UNIX_ENV,…] export VE_FORT_UFMTENDIAN=10,11 (ALL) sets the unit number of an unformatted file to be treated as a file See Fortran Users’ Guide 8.2 for details in big-endian format. When the value of this variable is ALL , then all unit numbers are applied. Two or more unit numbers can be specified by comma delimitation.
GNU Fortran extension : convert specifier open(10,file=‘test.dat’, form=‘unformatted’, & convert=‘big_endian’)
‹ Non-standard Fortran, but supported by nfort
25 © NEC Deutschland GmbH 2018 26 © NEC Deutschland GmbH 2018
Correctness Debugger
Run-time errors…? Process Sets variables Functions
Compile with -traceback export VE_TRACEBACK=FULL|ALL reduce optimization -fcheck=bounds (all) Stack/heap initialization –minit-stack=zero|nan export VE_INIT_HEAP=ZERO|NAN
-mno-vector-fma
Standard output export VE_FPE_ENABLE=(DIV,FOF,FUF,INV,INE) Source code Stack trace Standard error output (debugger) Eclipse parallel tools platform (PTP) VE plugin provides GUI debugging environment
27 © NEC Deutschland GmbH 2018 28 © NEC Deutschland GmbH 2018
7 2018/11/16
PROGINF performance information PROGINF performance information ******** Program Information ******** Real Time (sec) : 8783.021690 User Time (sec) : 8753.527852 Vector Time (sec) : 4959.702777 Inst. Count : 8018493848355 Compile with “-proginf” CompileV. Inst. with Count “-proginf” : 1081598389267 V. Element Count : 221178430822266 V. Load Element Count : 51426036073697 export VE_PROGINF=YES/DETAIL exportFLOP Count VE_PROGINF=YES/DETAIL : 140842692526140 MOPS : 29663.826745 MOPS (Real) : 29444.837925 MFLOPS : 16155.280253 MFLOPS (Real) : 16036.016282 A. V. Length : 204.492197 V. Op. Ratio (%) : 97.317633 L1 Cache Miss (sec) : 801.540473 CPU Port Conf. (sec) : 2.435367 V. Arith. Exec. (sec) : 2355.046188 V. Load Exec. (sec) : 2346.286974 VLD LLC Hit Element Ratio (%) : 79.566662 Power Throttling (sec) : 0.000000 Thermal Throttling (sec) : 0.000000 Memory Size Used (MB) : 10956.000000
Start Time (date) : Tue Nov 6 13:05:18 2018 CET End Time (date) : Tue Nov 6 15:31:41 2018 CET
29 © NEC Deutschland GmbH 2018 30 © NEC Deutschland GmbH 2018
configure
autoconf: https://github.com/SX-Aurora/autoconf-helper configure command: ./configure CC=ncc CXX=nc++ FC=nfort F90=nfort \ AR=nar LD=nld AS=nas --host=ve-nec-linux Documentation CMAKE Toolchain (example): https://github.com/SX- Aurora/CMake-toolchain-file Available at www.rz.uni-kiel.de/de/angebote/hiperf/nec- sx-aurora-tsubasa
31 © NEC Deutschland GmbH 2018
8 2018/11/16
Documentation
▌Official Documentation: http://www.nec.com/en/global/prod/hpc/aurora/document
▌Official Software: http://www.nec.com/en/global/prod/hpc/aurora/ve-software
▌Open Source Software: https://github.com/SX-Aurora
33 © NEC Deutschland GmbH 2018 34 © NEC Deutschland GmbH 2018
Aurora Forum community website
Visit https://www.hpc.nec and join our quest
Aurora Forum Website
Join
36 © NEC Deutschland GmbH 2018
9 2018/11/16
Aurora Forum community website Aurora Forum community website
Better communication through BBS, let’s discuss openly! About posting – Please like for good posts
òPost text and arbitrary file(word, ppt, pdf, picture movie, software, etc. there is not limitation for the type of file).
Evaluation Porting Report
Roadmap Tuning request
Like!
37 © NEC Deutschland GmbH 2018 38 © NEC Deutschland GmbH 2018
Aurora Forum community website
Let’s develop better Aurora together!!
39 © NEC Deutschland GmbH 2018
10