WSCAD2020

An Evolved and Brand New Vector Technology SX-Aurora TSUBASA-Present & Future-

22nd October, 2020 NEC Corporation AI Platform Division

1 © NEC Corporation 2020 Presentation Agenda 1)Introduction of SX-Aurora TSUBASA 2) Details of new products and its performance 3) Future Prospect 4) Brief Introduction of SX-Aurora TSUBASA communities Over 30 years Experience For High Sustained Performance

3 © NEC Corporation 2020 NEC extracts entire technology and know-how from previous SX series into PCIe vector engine card

Next Generation Innovation Platform 「SX-Aurora TSUBASA」 has been released in February, 2018. Renewed in 2020 (the 2nd generation of SX-Aurora TSUBASA). Customers from all over the world are using Aurora!

4 © NEC Corporation 2020 SX-Aurora TSUBASA Is Growing

 Over 17,000 VE cards was sold to more than 100 customers.  Hope to provide a better HPC solution to more users.

VE Cards 20000

15000

10000

5000

0 2018 2019 2020

5 © NEC Corporation 2020 Trusted and Chosen by World Famous HPC Centers Acquired several orders from Japanese and European HPC centers. Below 3 systems have already started operating.

NIFS Tohoku univ. National Institute for Fusion Science

Fusion Science(Japan) Weather / Climate(Germany) Academic(Japan) AOBA installed at By utilizing a new Plasma Simulator, The new HPC system will enable the Tohoku University will contribute it is possible to analyze the behavior development of seamless prediction of not only to academic research of complex fusion plasmas in a severe weather events, including but also to supporting a secure larger scale and in a shorter period thunderstorms or heavy rain. and safe society. of time.

6 © NEC Corporation 2020 What is SX-Aurora TSUBASA

7 © NEC Corporation 2020 SX-Aurora TSUBASA

(Traditional Super Computer) Tower Type

Vector technology makes it possible to process multiple and huge data at a time, which has a 10 times Mem BW than Xeon.

Rack Type No specialized knowledge is required, AP can be executed only after been compiled. Use C/C++/ to program.

Downsizing of super

computer realized by NEC’s Customer can choose a system which meets their Technology. needs. From server type to card specification are all optional, NEC helps customer to maximize the cost performance, to fit all market requirement. 8 © NEC Corporation 2020 1. High Performance (Vector Engine card)

(~10cores)  PCIe Standard  A different execution model from GPGPU.  Standard programing environment: Fortran/C/C++  Peak Performance: 3.07TF(DP)/6.14TF(SP)  Memory Bandwidth: 1.53TB/s(highest in the world) 10 times than Xeon processor  Memory Capacity: 48GB

9 © NEC Corporation 2020 2. Ease of Use: Open Environment From Dedicated OS on previous SX series to open OS on Aurora. Existing Linux assets can be widely used. Linux open environment

Linux assets High performance Calculation

User Linux Vector OS Environment Engine

Library

Tool x86 High performance of peripherals Application Vector Engine can be mostly used on Linux OS.

10 © NEC Corporation 2020 Basic Execution model

Accelerator(GPGPU) SX-Aurora TSUBASA

Need special programing Entire application runs on for accelerator Vector Engine. Just compile and run. Application No data transfer bottleneck function function Application function function

Linux OS Linux OS Accelerator Vector x86 x86 (GPGPU) Engine processor processor

11 © NEC Corporation 2020 VEOS offload models

Run the application in the right way

Basic execution VEO (VE Offload) VH call model x86 VE Application Application

VE Application x86 VE Application Application

VEOS VEOS VEOS Linux Linux Linux

x86 Vector x86 Vector x86 Vector node Engine node Engine node Engine

See the manual on Aurora Web Forum:  Documentation  VEOS  The tutorial and API reference of libsysve/veoffload

12 © NEC Corporation 2020 Architecture

 SX-Aurora TSUBASA = Standard x86 + Vector Engine  Linux + standard language (Fortran/C/C++)  Enjoy high performance with easy programming SX-Aurora TSUBASA Hardware Architecture  Standard x86 server + Vector Engine Software Linux OS Application  Linux OS  Automatic vectorization  Fortran/C/C++ x86 server Vector  No special programming like CUDA (VH) PCIe Engine(VE) Interconnect  InfiniBand for MPI  VE-VE direct communication support

Automatic Easy Enjoy high vectorization programming Performance! (standard language) compiler

13 © NEC Corporation 2020 3. Flexibility Wide range of model from personal tower model to high-end supercomputer model

Supercomputer Model A511-64  For large scale configuration  DLC with 40℃ water Water cool model A401-8/B401-8  8VE within 2U  DLC with 40℃ water

Rack Mount Model A311-4/B300-8  Flexible configuration  Air Cooled

Tower Model A111-1 For developer/programmer Office room use

14 © NEC Corporation 2020 New Products and Performance

15 © NEC Corporation 2020 Product Lineup Wide range of model from personal tower model to high-end supercomputer model

Supercomputer Model A511-64  For large scale configuration  DLC with 40℃ water Water cool model A401-8/B401-8  8VE within 2U  DLC with 40℃ water

Rack Mount Model A311-4/B300-8  Flexible configuration  Air Cooled

Tower Model A111-1 For developer/programmer Office room use

16 © NEC Corporation 2020 VE20 Processor ( the 2nd generation of Vector Engine)

VE20 Specifications

Processor Version Type 20A Type 20B

Cores/processor 10 NEW 8

307GF (DP) Core performance 614GF (SP)

Processor 3.07TF (DP) 2.45TF (DP) core core

performance 6.14TF (SP) 4.91TF (SP) HBM2 HBM2 HBM2 I/F HBM2 Cache capacity 16MB I/F HBM2 core core Cache bandwidth 3TB/s

Cache Function Software Controllable HBM2 core core HBM2

LLC 8MB LLC 8MB LLC HBM2 I/F HBM2 Memory capacity 48GB I/F HBM2 core core Memory bandwidth 1.53TB/s NEW HBM2 HBM2 ~300W (TDP)

Power core core HBM2 I/F HBM2 ~200W (Application) I/F HBM2

17 © NEC Corporation 2020 VE SKU

Memory Memory capacity bandwidth

1.53TB/ VE20B VE20A s 1.35TB/ 48GB VE10BE* VE10AE* *10AE is 1.584GHz s *10BE is 1.408GHz 1.22TB/ s VE10B VE10A 1.00TB/ VE10CE s 24GB 0.75TB/ VE10C s 2.15TF 2.45TF 3.07TF

Frequency 1.4GHz 1.6GHz

Cores 8core 10core

18 © NEC Corporation 2020 Card Implementation

 Standard PCIe implementation  Connector: PCIe Gen.3 x16  Double height (same form factor as Nvidia)  <300W

19 © NEC Corporation 2020 CPU Technology

TSMC/Broadcom/NEC developed 6 HBM2 memory implementation Processor 16 nm FF 33.00 x 14.96 mm ~10 cores 3.07TF 1.53TB/s Memory HBM2 3D stacking memory HBM2 x 6 per processor Si interposer Processor and 6 HBM2 are integrated TSMC/Broadcom/NEC

20 © NEC Corporation 2020 Stream Benchmark

▌Single node comparison (Triad)

1245 1084 988

830 830 Performance [GB/s] Performance 180

Xeon Tesla A64FX*1 Aurora1 Aurora1E Aurora2 Gold 6142 V100 10A 10AE 20B (2CPU) (1GPU) (1CPU) (1CPU) (1CPU) (1CPU)

*1 Supercomputer ”Fugaku” Formerly known as Post-K https://indico.math.cnrs.fr/event/4705/attachments/2362/2942/CEA-RIKEN-school-19013.pdf

21 © NEC Corporation 2020 Himeno Benchmark

▌Single node comparison (size: XL, Single precision) 380 346 331 339 305

82 Performance [GFLOPS] Performance

Xeon Tesla*1 A64FX*2 Aurora1 Aurora1E Aurora2 Gold 6148 V100 10A 10AE 20A (2CPU) (1GPU) (1CPU) (1CPU) (1CPU) (1CPU)

*1 Performance evaluation of a vector supercomputer SX-aurora TSUBASA https://dl.acm.org/citation.cfm?id=3291728

*2 Supercomputer ”Fugaku” Formerly known as Post-K https://www.fujitsu.com/global/Images/supercomputer-fugaku.pdf

22 © NEC Corporation 2020 Product Lineup-New Product Introduction Wide range of model from personal tower model to high-end supercomputer model

Supercomputer Model A511-64  For large scale configuration  DLC with 40℃ water Water cool model A401-8/B401-8  8VE within 2U  DLC with 40℃ water

Rack Mount Model A311-4/B300-8  Flexible configuration  Air Cooled

Tower Model A111-1 For developer/programmer Office room use

23 © NEC Corporation 2020 High Density DLC Server  Packaging density is 2 times higher than A5xx series(64VE)  Space-saving and energy-saving realized due to High-temperature(40℃) cooling system.

Cooling chiller tower x18 servers System integration

8VE/2U high density DLC server

144VE/rack DLC ~442TF/rack, 220TB/s/rack

24 © NEC Corporation 2020 High Density Solution

Direct Liquid Cooling VE DLC To provide higher density Hot water cooling (~40C)

High density VH server 8VE/2U One VH consists of: VE x8 AMD Rome processor x1 IB HCA x2

Performance: 24.56TF /VH Memory bandwidth: 12.24TB/s /VH

HPL: 2.8kW Meteorology code: 1.7kW

25 © NEC Corporation 2020 Performance Sample (1) : Stencil Code Accelerator (SCA) Library

▌ What is “stencil code” ?  A procedure pattern that frequently appears in scientific simulations, image processing, signal processing, deep learning, etc.  Updates each element in a multidimensional array by referring to the neighbor elements. 푡 푡 푡

퐵푖,푗,푘 = 푐퐵푖,푗,푘 + ෍ ෍ ෍ 퐹푝,푞,푟퐴푖+푝,푗+푞,푘+푟 푟=−푡 푞=−푡 푝=−푡 Requires significant performance of both computation and memory access.

[Stencil Shape Examples] ▌ Domains Where Stencil Code Appears:  Scientific Simulations: Fluid Dynamics, Thermal Analysis, Electromagnetic Field Analysis, Climate / Weather etc.  Signal Processing: Audio, Sonar, Rader, Radio Telescopes, etc.

 Image / Volume Data Processing: Retouch, Data Compression, Recognition, Medical Diagnosis (Biopsy, CT, MRI, …), etc.

 Machine Learning: Deep Learning (Convolutional Neural Networks)

26 © NEC Corporation 2020 Performance Results

SCA on VE shows the highest performance.

1x1y1za

2x2y2za

3x3y3za

4x4y4za

5x5y5za

6x6y6za

0 200 400 600 800 1000 1200 1400 1600 1800 2000 2200 2400 2600 2800 3000 GFLOPS (Single Precision)

Aurora VE Type 20B / Naïve Impl. Aurora VE Type 20B / SCA GPU memory transfer Tesla V100 PCI-E 16GB / Naïve Impl. Tesla V100 PCI-E 16GB / Physis is excluded. Xeon Gold 6148 x2 (Skylake 2.40GHz 40C) / Naïve Impl. Xeon Gold 6148 x2 (Skylake 2.40GHz 40C) / YASK

27 © NEC Corporation 2020 NEC Numerical Library Collection (NLC)

NLC is a collection of mathematical libraries that powerfully supports the development of numerical simulation programs.

ASL Unified Interface BLAS / CBLAS

Fourier transforms and Random number generators Basic linear algebra subprograms

FFTW3 Interface LAPACK

Interface library to use Fourier Transform functions of Linear algebra package ASL with FFTW (version 3.x) API

ScaLAPACK ASL Scalable linear algebra package for distributed memory parallel programs Scientific library with a wide variety of algorithms for numerical/statistical calculations: Linear algebra, Fourier transforms, Spline functions, SBLAS Special functions, Approximation and interpolation, Numerical differentials and integration, Roots of equations, Basic statistics, etc. Sparse BLAS

Stencil Code Accelerator HeteroSolver

Stencil Code Acceleration Direct sparse solver

28 © NEC Corporation 2020 Performance sample (2) : NLCPy

NLCPy is a package for accelerating the performance of NumPy scripts. Enables NumPy scripts to compute on Vector Engine or SX-Aurora TSUBASA. Provides subset of NumPy's API. ▌Example of Python Script

import numpy as np import nlcpy as np

a = np.rand.rand(1000, 1000) a = np.rand.rand(1000, 1000) b = np.rand.rand(1000, 1000) b = np.rand.rand(1000, 1000)

c = a + b c = a + b d = np.dot(a, b) d = np.dot(a, b)

Computation Data transfer between VH and VE on VE Script Script Library x86 x86 Vector node node Engine

29 © NEC Corporation 2020 Performance Comparison of NLCPy vs Numpy

NLCPy delivers 3.8x higher Performance compared to NumPy for Haversine

NLCPy delivers 37x higher Performance compared to NumPy for 2-D Heat Equation

size: Number of sets of latitudes and longitudes Nx: Number of grid points in x-direction Ny: Number of grid points Processors: in y-direction - NumPy Xeon Gold 6126 x 2 sockets Processors: (Skylake 2.6 GHz, 48 cores) - NumPy - NLCPy Xeon Gold 6126 x 2 sockets Aurora 1.4GHz, 8 cores (Skylake 2.6 GHz, 48 cores) - NLCPy Aurora 1.4GHz, 8 cores faster faster

The benchmarking programs are based on the following site:

https://github.com/weld-project/split-annotations/tree/master/python/benchmarks/haversine

https://benchpress.readthedocs.io/autodoc_benchmarks/heat_equation.html#heat-equation-python-numpy

30 © NEC Corporation 2020 Comparison with NumPy

The first version of NLCPy provides subset of NumPy functions. NLCPy will continue to expand the coverage.

Function List NLCPy NumPy

Array Creation 17 45 empty, eye, ones, zeros, full , etc.

Array Manipulation 15 53 reshape, transpose, broadcast, etc.

Universal Functions 83 89 add, subtract, multiply, divide, etc.

Mathematical Functions 67 94 sin, cos, tan, exp, log, etc.

Indexing Routines 4 36 nonzero, where, take, etc. Random Sampling 46 90 Statistics 14 27 Sorting, and Searching 7 19 Linear Algebra 1 31 Fourier Transform 0 18

31 © NEC Corporation 2020 The Future

32 © NEC Corporation 2020 Target Area of NEC SX - Series

While achieving developments in the traditional HPC area, reaching users in new areas where have not been used due to its size ▌Aiming to utilize and collaborate in 5G and other future technologies through the development of new areas Artificial Intelligence Image analysis Chemical analysis Tsunami Forecast Plant Control Weather Forecast Malware Detection Fluid Analysis Structural Analysis

>>>>> Downsizing >>>>>

Disaster prevention Manufacturing E-Commerce Transportation Security and mitigation Medical 33 © NEC Corporation 2020 We will update software for SX-Aurora TSUBASA

We plan to continue to enhance the functionality and performance of our software.

 OpenMP5.0  Docker/Singularity  Hybrid MPI (Host+VE)  VE Offload/VH Call Extension  NLCPy updates  Etc.

34 © NEC Corporation 2020 Community

35 © NEC Corporation 2020 Aurora Forum: https://www.hpc.nec/

Click!

Click!

36 © NEC Corporation 2020 SC20 Register NOW!

https://sc20.supercomputing.org/

37 © NEC Corporation 2020 Please check Aurora Forum Website. https://www.hpc.nec/

Please feel free to contact me. [email protected]

Any requests, to Fernanda Silos at NEC Latin America S.A. [email protected]

38 © NEC Corporation 2020