WSCAD2020
An Evolved and Brand New Vector Technology SX-Aurora TSUBASA-Present & Future-
22nd October, 2020 NEC Corporation AI Platform Division
1 © NEC Corporation 2020 Presentation Agenda 1)Introduction of SX-Aurora TSUBASA 2) Details of new products and its performance 3) Future Prospect 4) Brief Introduction of SX-Aurora TSUBASA communities Over 30 years Experience For High Sustained Performance
3 © NEC Corporation 2020 NEC extracts entire technology and know-how from previous SX series into PCIe vector engine card
Next Generation Innovation Platform 「SX-Aurora TSUBASA」 has been released in February, 2018. Renewed in 2020 (the 2nd generation of SX-Aurora TSUBASA). Customers from all over the world are using Aurora!
4 © NEC Corporation 2020 SX-Aurora TSUBASA Is Growing
Over 17,000 VE cards was sold to more than 100 customers. Hope to provide a better HPC solution to more users.
VE Cards 20000
15000
10000
5000
0 2018 2019 2020
5 © NEC Corporation 2020 Trusted and Chosen by World Famous HPC Centers Acquired several orders from Japanese and European HPC centers. Below 3 systems have already started operating.
NIFS Tohoku univ. National Institute for Fusion Science
Fusion Science(Japan) Weather / Climate(Germany) Academic(Japan) Supercomputer AOBA installed at By utilizing a new Plasma Simulator, The new HPC system will enable the Tohoku University will contribute it is possible to analyze the behavior development of seamless prediction of not only to academic research of complex fusion plasmas in a severe weather events, including but also to supporting a secure larger scale and in a shorter period thunderstorms or heavy rain. and safe society. of time.
6 © NEC Corporation 2020 What is SX-Aurora TSUBASA
7 © NEC Corporation 2020 SX-Aurora TSUBASA
(Traditional Super Computer) Tower Type
Vector technology makes it possible to process multiple and huge data at a time, which has a 10 times Mem BW than Xeon.
Rack Type No specialized knowledge is required, AP can be executed only after been compiled. Use C/C++/Fortran to program.
Downsizing of super
computer realized by NEC’s Customer can choose a system which meets their Technology. needs. From server type to card specification are all optional, NEC helps customer to maximize the cost performance, to fit all market requirement. 8 © NEC Corporation 2020 1. High Performance (Vector Engine card)
Vector Processor(~10cores) PCIe Standard A different execution model from GPGPU. Standard programing environment: Fortran/C/C++ Peak Performance: 3.07TF(DP)/6.14TF(SP) Memory Bandwidth: 1.53TB/s(highest in the world) 10 times than Xeon processor Memory Capacity: 48GB
9 © NEC Corporation 2020 2. Ease of Use: Open Environment From Dedicated OS on previous SX series to open Linux OS on Aurora. Existing Linux assets can be widely used. Linux open environment
Linux assets High performance Calculation
User Linux Vector OS Environment Engine
Library
Tool x86 High performance of peripherals Application Vector Engine can be mostly used on Linux OS.
10 © NEC Corporation 2020 Basic Execution model
Accelerator(GPGPU) SX-Aurora TSUBASA
Need special programing Entire application runs on for accelerator Vector Engine. Just compile and run. Application No data transfer bottleneck function function Application function function
Linux OS Linux OS Accelerator Vector x86 x86 (GPGPU) Engine processor processor
11 © NEC Corporation 2020 VEOS offload models
Run the application in the right way
Basic execution VEO (VE Offload) VH call model x86 VE Application Application
VE Application x86 VE Application Application
VEOS VEOS VEOS Linux Linux Linux
x86 Vector x86 Vector x86 Vector node Engine node Engine node Engine
See the manual on Aurora Web Forum: Documentation VEOS The tutorial and API reference of libsysve/veoffload
12 © NEC Corporation 2020 Architecture
SX-Aurora TSUBASA = Standard x86 + Vector Engine Linux + standard language (Fortran/C/C++) Enjoy high performance with easy programming SX-Aurora TSUBASA Hardware Architecture Standard x86 server + Vector Engine Software Linux OS Application Linux OS Automatic vectorization compiler Fortran/C/C++ x86 server Vector No special programming like CUDA (VH) PCIe Engine(VE) Interconnect InfiniBand for MPI VE-VE direct communication support
Automatic Easy Enjoy high vectorization programming Performance! (standard language) compiler
13 © NEC Corporation 2020 3. Flexibility Wide range of model from personal tower model to high-end supercomputer model
Supercomputer Model A511-64 For large scale configuration DLC with 40℃ water Water cool model A401-8/B401-8 8VE within 2U DLC with 40℃ water
Rack Mount Model A311-4/B300-8 Flexible configuration Air Cooled
Tower Model A111-1 For developer/programmer Office room use
14 © NEC Corporation 2020 New Products and Performance
15 © NEC Corporation 2020 Product Lineup Wide range of model from personal tower model to high-end supercomputer model
Supercomputer Model A511-64 For large scale configuration DLC with 40℃ water Water cool model A401-8/B401-8 8VE within 2U DLC with 40℃ water
Rack Mount Model A311-4/B300-8 Flexible configuration Air Cooled
Tower Model A111-1 For developer/programmer Office room use
16 © NEC Corporation 2020 VE20 Processor ( the 2nd generation of Vector Engine)
VE20 Specifications
Processor Version Type 20A Type 20B
Cores/processor 10 NEW 8
307GF (DP) Core performance 614GF (SP)
Processor 3.07TF (DP) 2.45TF (DP) core core
performance 6.14TF (SP) 4.91TF (SP) HBM2 HBM2 HBM2 I/F HBM2 Cache capacity 16MB I/F HBM2 core core Cache bandwidth 3TB/s
Cache Function Software Controllable HBM2 core core HBM2
LLC 8MB LLC 8MB LLC HBM2 I/F HBM2 Memory capacity 48GB I/F HBM2 core core Memory bandwidth 1.53TB/s NEW HBM2 HBM2 ~300W (TDP)
Power core core HBM2 I/F HBM2 ~200W (Application) I/F HBM2
17 © NEC Corporation 2020 VE SKU
Memory Memory capacity bandwidth
1.53TB/ VE20B VE20A s 1.35TB/ 48GB VE10BE* VE10AE* *10AE is 1.584GHz s *10BE is 1.408GHz 1.22TB/ s VE10B VE10A 1.00TB/ VE10CE s 24GB 0.75TB/ VE10C s 2.15TF 2.45TF 3.07TF
Frequency 1.4GHz 1.6GHz
Cores 8core 10core
18 © NEC Corporation 2020 Card Implementation
Standard PCIe implementation Connector: PCIe Gen.3 x16 Double height (same form factor as Nvidia) <300W
19 © NEC Corporation 2020 CPU Technology
TSMC/Broadcom/NEC developed 6 HBM2 memory implementation Processor 16 nm FF 33.00 x 14.96 mm ~10 cores 3.07TF 1.53TB/s Memory HBM2 3D stacking memory HBM2 x 6 per processor Si interposer Processor and 6 HBM2 are integrated TSMC/Broadcom/NEC
20 © NEC Corporation 2020 Stream Benchmark
▌Single node comparison (Triad)
1245 1084 988
830 830 Performance [GB/s] Performance 180
Xeon Tesla A64FX*1 Aurora1 Aurora1E Aurora2 Gold 6142 V100 10A 10AE 20B (2CPU) (1GPU) (1CPU) (1CPU) (1CPU) (1CPU)
*1 Supercomputer ”Fugaku” Formerly known as Post-K https://indico.math.cnrs.fr/event/4705/attachments/2362/2942/CEA-RIKEN-school-19013.pdf
21 © NEC Corporation 2020 Himeno Benchmark
▌Single node comparison (size: XL, Single precision) 380 346 331 339 305
82 Performance [GFLOPS] Performance
Xeon Tesla*1 A64FX*2 Aurora1 Aurora1E Aurora2 Gold 6148 V100 10A 10AE 20A (2CPU) (1GPU) (1CPU) (1CPU) (1CPU) (1CPU)
*1 Performance evaluation of a vector supercomputer SX-aurora TSUBASA https://dl.acm.org/citation.cfm?id=3291728
*2 Supercomputer ”Fugaku” Formerly known as Post-K https://www.fujitsu.com/global/Images/supercomputer-fugaku.pdf
22 © NEC Corporation 2020 Product Lineup-New Product Introduction Wide range of model from personal tower model to high-end supercomputer model
Supercomputer Model A511-64 For large scale configuration DLC with 40℃ water Water cool model A401-8/B401-8 8VE within 2U DLC with 40℃ water
Rack Mount Model A311-4/B300-8 Flexible configuration Air Cooled
Tower Model A111-1 For developer/programmer Office room use
23 © NEC Corporation 2020 High Density DLC Server Packaging density is 2 times higher than A5xx series(64VE) Space-saving and energy-saving realized due to High-temperature(40℃) cooling system.
Cooling chiller tower x18 servers System integration
8VE/2U high density DLC server
144VE/rack DLC ~442TF/rack, 220TB/s/rack
24 © NEC Corporation 2020 High Density Solution
Direct Liquid Cooling VE DLC To provide higher density Hot water cooling (~40C)
High density VH server 8VE/2U One VH consists of: VE x8 AMD Rome processor x1 IB HCA x2
Performance: 24.56TF /VH Memory bandwidth: 12.24TB/s /VH
HPL: 2.8kW Meteorology code: 1.7kW
25 © NEC Corporation 2020 Performance Sample (1) : Stencil Code Accelerator (SCA) Library
▌ What is “stencil code” ? A procedure pattern that frequently appears in scientific simulations, image processing, signal processing, deep learning, etc. Updates each element in a multidimensional array by referring to the neighbor elements. 푡 푡 푡
퐵푖,푗,푘 = 푐퐵푖,푗,푘 + 퐹푝,푞,푟퐴푖+푝,푗+푞,푘+푟 푟=−푡 푞=−푡 푝=−푡 Requires significant performance of both computation and memory access.
[Stencil Shape Examples] ▌ Domains Where Stencil Code Appears: Scientific Simulations: Fluid Dynamics, Thermal Analysis, Electromagnetic Field Analysis, Climate / Weather etc. Signal Processing: Audio, Sonar, Rader, Radio Telescopes, etc.
Image / Volume Data Processing: Retouch, Data Compression, Recognition, Medical Diagnosis (Biopsy, CT, MRI, …), etc.
Machine Learning: Deep Learning (Convolutional Neural Networks)
26 © NEC Corporation 2020 Performance Results
SCA on VE shows the highest performance.
1x1y1za
2x2y2za
3x3y3za
4x4y4za
5x5y5za
6x6y6za
0 200 400 600 800 1000 1200 1400 1600 1800 2000 2200 2400 2600 2800 3000 GFLOPS (Single Precision)
Aurora VE Type 20B / Naïve Impl. Aurora VE Type 20B / SCA GPU memory transfer Tesla V100 PCI-E 16GB / Naïve Impl. Tesla V100 PCI-E 16GB / Physis is excluded. Xeon Gold 6148 x2 (Skylake 2.40GHz 40C) / Naïve Impl. Xeon Gold 6148 x2 (Skylake 2.40GHz 40C) / YASK
27 © NEC Corporation 2020 NEC Numerical Library Collection (NLC)
NLC is a collection of mathematical libraries that powerfully supports the development of numerical simulation programs.
ASL Unified Interface BLAS / CBLAS
Fourier transforms and Random number generators Basic linear algebra subprograms
FFTW3 Interface LAPACK
Interface library to use Fourier Transform functions of Linear algebra package ASL with FFTW (version 3.x) API
ScaLAPACK ASL Scalable linear algebra package for distributed memory parallel programs Scientific library with a wide variety of algorithms for numerical/statistical calculations: Linear algebra, Fourier transforms, Spline functions, SBLAS Special functions, Approximation and interpolation, Numerical differentials and integration, Roots of equations, Basic statistics, etc. Sparse BLAS
Stencil Code Accelerator HeteroSolver
Stencil Code Acceleration Direct sparse solver
28 © NEC Corporation 2020 Performance sample (2) : NLCPy
NLCPy is a package for accelerating the performance of NumPy scripts. Enables NumPy scripts to compute on Vector Engine or SX-Aurora TSUBASA. Provides subset of NumPy's API. ▌Example of Python Script
import numpy as np import nlcpy as np
a = np.rand.rand(1000, 1000) a = np.rand.rand(1000, 1000) b = np.rand.rand(1000, 1000) b = np.rand.rand(1000, 1000)
c = a + b c = a + b d = np.dot(a, b) d = np.dot(a, b)
Computation Data transfer between VH and VE on VE Script Script Library x86 x86 Vector node node Engine
29 © NEC Corporation 2020 Performance Comparison of NLCPy vs Numpy
NLCPy delivers 3.8x higher Performance compared to NumPy for Haversine
NLCPy delivers 37x higher Performance compared to NumPy for 2-D Heat Equation
size: Number of sets of latitudes and longitudes Nx: Number of grid points in x-direction Ny: Number of grid points Processors: in y-direction - NumPy Xeon Gold 6126 x 2 sockets Processors: (Skylake 2.6 GHz, 48 cores) - NumPy - NLCPy Xeon Gold 6126 x 2 sockets Aurora 1.4GHz, 8 cores (Skylake 2.6 GHz, 48 cores) - NLCPy Aurora 1.4GHz, 8 cores faster faster
The benchmarking programs are based on the following site:
https://github.com/weld-project/split-annotations/tree/master/python/benchmarks/haversine
https://benchpress.readthedocs.io/autodoc_benchmarks/heat_equation.html#heat-equation-python-numpy
30 © NEC Corporation 2020 Comparison with NumPy
The first version of NLCPy provides subset of NumPy functions. NLCPy will continue to expand the coverage.
Function List NLCPy NumPy
Array Creation 17 45 empty, eye, ones, zeros, full , etc.
Array Manipulation 15 53 reshape, transpose, broadcast, etc.
Universal Functions 83 89 add, subtract, multiply, divide, etc.
Mathematical Functions 67 94 sin, cos, tan, exp, log, etc.
Indexing Routines 4 36 nonzero, where, take, etc. Random Sampling 46 90 Statistics 14 27 Sorting, and Searching 7 19 Linear Algebra 1 31 Fourier Transform 0 18
31 © NEC Corporation 2020 The Future
32 © NEC Corporation 2020 Target Area of NEC SX - Series
While achieving developments in the traditional HPC area, reaching users in new areas where supercomputers have not been used due to its size ▌Aiming to utilize and collaborate in 5G and other future technologies through the development of new areas Artificial Intelligence Image analysis Chemical analysis Tsunami Forecast Plant Control Weather Forecast Malware Detection Fluid Analysis Structural Analysis
>>>>> Downsizing >>>>>
Disaster prevention Manufacturing E-Commerce Transportation Security and mitigation Medical 33 © NEC Corporation 2020 We will update software for SX-Aurora TSUBASA
We plan to continue to enhance the functionality and performance of our software.
OpenMP5.0 Docker/Singularity Hybrid MPI (Host+VE) VE Offload/VH Call Extension NLCPy updates Etc.
34 © NEC Corporation 2020 Community
35 © NEC Corporation 2020 Aurora Forum: https://www.hpc.nec/
Click!
Click!
36 © NEC Corporation 2020 SC20 Register NOW!
https://sc20.supercomputing.org/
37 © NEC Corporation 2020 Please check Aurora Forum Website. https://www.hpc.nec/
Please feel free to contact me. [email protected]
Any requests, to Fernanda Silos at NEC Latin America S.A. [email protected]
38 © NEC Corporation 2020