SX-Aurora TSUBASA Introduction Vector Supercomputer Technology on a Pcie Card What Is Vector Processor? (1/2)
Total Page:16
File Type:pdf, Size:1020Kb
SX-Aurora TSUBASA Introduction Vector Supercomputer Technology on a PCIe Card What is Vector Processor? (1/2) Vector processor can operate large data at once and suited for fast processing of large scale data General Processor Vector Processor Suited for processing data in small Suited for processing data in large units such as business operation and units at once such as simulation,AI, web servers and Bigdata data data 256 Scalar Vector calculation calculation 256 output output 2 © NEC Corporation 2019 What is Vector Processor? (2/2) ① Many small cores vs small number of large cores ② Balance of computation performance and data access performance ③ Software development environment GPU-like Processors Vector Processors ① Many small cores ① Small number of large cores ② Larger size of computation circuits ② Balanced size of computation circuits and ③ Special language (such as CUDA) data access circuits ③ Standard language (C/C++/Fortran) Cores Cores Data access Data access Memory Memory 3 © NEC Corporation 2018 Vector Processor – History & Future ▌Vector Processor has traditionally been used to process big data, much earlier than the term big data was coined. ▌The very first vector processor based machine, Cray-1, was built by Seymour Cray in 1976. NEC made its first vector-supercomputer, the SX-2, in 1981. SX-2 was the first ever CPU to exceed 1 Gflops of peaK performance. Soon, Fujitsu, Hitachi followed NEC’s footsteps in the high-end HPC Technology segment. ▌However, in 1990s, the computer industry changed drastically with the advent of affordable x86 processors. The eventual dominance of x86 played a Key-role in democratization of HPC across academia & industry. ▌Soon due to economic pressure, Cray bailed out of maKing vector supercomputers, followed by Fujitsu & Hitachi. ▌NEC is the only remaining vendor that is still committed to develop & enhance pure vector processors. 4 © NEC Corporation 2020 NEC SX-Series of Vector Supercomputers Good, but… High Bytes/Flops has been the core • large • expensive feature of NEC SX-Series of vector • special like dinosaurs supercomputers -Aurora TSUBASA SX Vector Engine Earth Simulator 3 Hardware innovationsEarth Simulator 2 Performance • Fast SX-ACE • Strong Compact Earth Simulator SX-9 • • Economical SX-8 like falcons SX-7 SX-6 SX-5 SX-4 Software innovationsVector technology experience accumulated over 35 years SX-3 packed into PCIe card SX-2 1990 2000 2010 5 © NEC Corporation 2019 Vector Processor on PCIe Card (World’s highest Memory Capacity & Bandwidth Processor) n8 cores / processor n1.35TB/s memory bandwidth, 48GB memory (Very High Memory Bandwidth) nStandard programming with Fortran/C/C++(No Special Programming Model Needed) n2.45TF performance (double precision) n4.90TF performance (single precision) 6 © NEC Corporation 2019 SX Aurora Vector Engine Design Vision Design concept n High sustained performance in real application n TCO reduction ▌High sustained performance lVector Accelerator lHigh B/F à Good balance of memory bandwidth and cpu performance) ▌TCO reduction lLow power consumption Machine room lHigh density à smaller installation space Soft Power lProductivity (programing, code maintenance) ware Hard etc ware TCO 7 © NEC Corporation 2019 Aurora Vector Engine 1E : Specification 2.45TF VE10E Specification 307GFcore core core core cores/CPU 8 core core core core core ~307GF(DP) performance ~614GF(SP) 0.4TB/s CPU ~2.45TF(DP) 3TB/s performance ~4.91TF(SP) Software controllable cache cache capacity 16MB shared 16MB memory 1.35TB/s bandwidth 1.35TB/s memory 48GB capacity HBM2 memory x 6 8 © NEC Corporation 2019 Architecture n SX-Aurora TSUBASA = Standard x86 + Vector Engine n Linux + standard language (Fortran/C/C++) n Enjoy high performance with easy programming SX-Aurora TSUBASA Hardware Architecture n Standard x86 server + Vector Engine Software Linux OS Application n Linux OS n Automatic vectorization compiler n Fortran/C/C++ x86 server Vector à No special programming like CUDA (VH) PCIe Engine(VE) Interconnect n InfiniBand for MPI n VE-VE direct communication support Automatic Easy Enjoy high vectorization programming Performance! (standard language) compiler 9 © NEC Corporation 2019 Usability Programing Environment Vector Cross Compiler automatic vectorization automatic parallelization Fortran: F2003, F2008 C/C++: C11/C++14 OpenMP: OpenMP4.5 $ vi sample.c $ ncc sample.c Library: MPI 3.1, libc, BLAS, Lapack, etc Debugger: gdb, Eclipse parallel tools platform Tools: PROGINF, FtraceViewer Execution Environment VH VE $ ./a.out execution 10 © NEC Corporation 2019 SX-Aurora TSUBASA Programming Environment Support of the latest language standards along with GNU compatibility ▌C/C++ l ISO/IEC 9899:2011 (aka C11) l ISO/IEC 14882:2014 (aka C++14) ▌Fortran l ISO/IEC 1539-1:2004 (aka Fortran 2003) l ISO/IEC 1539-1:2010 (aka Fortran 2008) ▌OpenMP l Version 4.5 ▌Libraries l libc l MPI Version 3.1 (fully tuned for Aurora architecture) l Numeric libraries (Stencil, BLAS, FFT, Lapack, etc) ▌Tools l GNU Profiler (gprof) l GNU Debugger (gdb), Eclipse Parallel Tools Platform (PTP) l FtraceViewer / PROGINF 11 © NEC Corporation 2019 NEC Numerical Library Collection (NLC) NLC is a collection of mathematical libraries that powerfully supports the development of numerical simulation programs. ASL Unified Interface BLAS / CBLAS Fourier transforms and Random number generators Basic linear algebra subprograms FFTW3 Interface LAPACK Interface library to use Fourier Transform functions of Linear algebra package ASL with FFTW (version 3.x) API ScaLAPACK ASL Scalable linear algebra package for distributed memory parallel programs Scientific library with a wide variety of algorithms for numerical/statistical calculations: Linear algebra, Fourier transforms, Spline functions, SBLAS Special functions, Approximation and interpolation, Numerical differentials and integration, Roots of equations, Basic statistics, etc. Sparse BLAS Stencil Code Accelerator HeteroSolver Stencil Code Acceleration Direct sparse solver 12 © NEC Corporation 2019 Default Execution model Accelerator(GPGPU) SX-Aurora TSUBASA Frequent data transfer will Entire application runs on Vector become performance bottleneck Engine. No data transfer bottleneck Application function function Application function function Linux OS Linux OS Accelerator Vector x86 x86 (GPGPU) Engine processor processor 13 © NEC Corporation 2019 VEOS offload models Run the application in the way it is supposed to run OS Offload VH call VEO VE x86 Application Application VE x86 Application VE Application Application VEOS VEOS VEOS Linux Linux Linux x86 Vector x86 Vector x86 Vector node Engine node Engine node Engine 14 © NEC Corporation 2019 Hybrid MPI MPI application running process on VE and VH communicating through PCIe switch P VE VH P VE P PCIe switch P VE P VE P Process 15 © NEC Corporation 2019 HPL using Hybrid MPI P P P P P P P P P P P P P P P P P P P P P P P P 8 procs on VE 1867 Gflops Hybrid MPI 16 procs on VE and VH P P P P P P P P 2830 Gflops 8 procs on VH 1430 Gflops 16 © NEC Corporation 2019 Offload I/O using Hybrid MPI Run I/O process on VH using Hybrid MPI and continue computation on VE P VE VH P VE I/O switch P VE I/OP VE I/O Process for I/O File system 17 © NEC Corporation 2019 SX-Aurora based System Providers in North America DL380 Vector Engine Apollo Card 6500 18 © NEC Corporation 2019 SX-Aurora based System Providers in North America • Over 30 years of experience in delivering custom and HPC solutions • Extensive customer base especially academia and research labs • Specialized HPC expertise • Solution design and development • HPC research and training • Hybrid system design • NEC and Colfax partnership aims to provide “personal supercomputing” power for leading-edge development 19 © NEC Corporation 2019 Performance Benchmarks DGEMM performance Aurora 1E (2019 CPU) performance is similar to A64FX (2020 CPU) DGEMM single node performance 6627 2398 2500 2104 Performance [GFLOPS] 2016 2017 2019 2020 Xeon Tesla*1 Aurora1E A64FX*2 Gold 6148 V100 10AE (1CPU) (2CPU) (1GPU) (1CPU) *1 AMD NEXT HORIZON http://ir.amd.com/static-files/ef99f84b-e1ad-4e12-8058-f3488f4c47b7 *2 The post-K project and Fujitsu ARM-SVE enabled A64FX processor https://indico.math.cnrs.fr/event/4705/attachments/2362/2942/CEA-RIKEN-school-19013.pdf 21 © NEC Corporation 2020 Himeno Benchmark Aurora 1E (2019 CPU) performance is similar to A64FX (2020 CPU) Himeno BM single node performance (size: XL) 339 346 305 82 Performance [GFLOPS] 2016 2017 2019 2020 Xeon Tesla*1 Aurora1E A64FX*2 Gold 6148 V100 10AE (1CPU) (2CPU) (1GPU) (1CPU) *1 Performance evaluation of a vector supercomputer SX-aurora TSUBASA https://dl.acm.org/citation.cfm?id=3291728 *2 Supercomputer ”Fugaku” Formerly known as Post-K https://www.fujitsu.com/global/Images/supercomputer-fugaku.pdf 22 © NEC Corporation 2020 Stream Benchmark Aurora 1E (2019 CPU) performance is more than 30% higher than competitors STREAM Triad single node performance 1084 830 830 180 Performance [GB/s] 2016 2017 2019 2020 Xeon Tesla*1 Aurora1E A64FX*2 Gold 6148 V100 10AE (1CPU) (2CPU) (1GPU) (1CPU) *1 The post-K project and Fujitsu ARM-SVE enabled A64FX processor https://indico.math.cnrs.fr/event/4705/attachments/2362/2942/CEA-RIKEN-school-19013.pdf 23 © NEC Corporation 2020 HPC Use Case: Stencil Code Acceleration for O&G Stencil Code Overview 25 © NEC Corporation 2020 Seismic Imaging ▌Reverse Time Migration (RTM) l A typical method for seismic imaging. l The most costly part is “stencil code”. l In the case of 3D RTM, 0 20 40 60 80 100 it consumes about 90% Elapsed Time Ratio [%] of the total execution time even when using 40 threads. stencil code other computation I/O 3D RTM on Xeon Gold 6148 x2 (Skylake 2.40GHz 40C) Dataset: Sandia/SEG Salt Model 45 shot subset [3D RTM seismic imaging example] 26 © NEC Corporation 2020 Stencil Code ▌What is “stencil code” ? l A procedure pattern that frequently appears in scientific simulations, image processing, signal processing, deep learning, etc.