SX-Aurora TSUBASA (Vector Engine)

Brand-new Vector Supercomputing power in Server Chassis

Deepak Pathania, OEM Alliance

1 © NEC Corporation 2019 High per FLOPs or Evolving towards massive data processing

Vector Processor massive data processing Memory Performance Demand Forecasting

Crash Price Forecasting Simulation Recommendation Weather Simulation Simulation AI

Voice Recognition

Image Recognition Massively Parallel Processor Multipurpose (GPGPU) Processor Computing Compute Performance 2 © NEC Corporation 2019 Performance High Bytes per FLOP’s Targeted Roadmap

Vector Engine Memory Performance Aurora 3

Aurora 2

Aurora 1 GPU

CPU Computing ComputePerformance 3 © NEC Corporation 2019 Performance on Card (World’s Highest Memory Bandwidth Processor)

 New Developed Vector Processor (Derived from Super-Computer)  PCIe Card Implementation  8 cores / processor  2.15TF performance (double precision)  1.2TB/s memory bandwidth, 48GB memory  Normal programming with Fortran/C/C++

4 © NEC Corporation 2019 Core Architecture Single core

VFMA0 VFMA0 VFMA0VFMA1 VFMA0VFMA1 VFMA0VFMA1 VFMA0VFMA1VFMA2 VFMA0VFMA1VFMA2 VFMA0VFMA1VFMA2 VFMA0VFMA1VFMA2ALU0 VFMA0VFMA1VFMA2ALU0 VFMA0VFMA1VFMA2ALU0 VFMA0VFMA1VFMA2ALU0ALU1 VFMA0VFMA1VFMA2ALU0ALU1 VFMA0VFMA1VFMA2ALU0ALU1DIV VFMA0VFMA1VFMA2ALU0ALU1DIV VFMA0VFMA1VFMA2ALU0ALU1DIV VFMA0VFMA1VFMA2ALU0ALU1DIV VFMA0VFMA1VFMA2ALU0ALU1DIV VFMA0VFMA1VFMA2ALU0ALU1DIV VFMA0VFMA1VFMA2ALU0ALU1DIV VFMA0VFMA1VFMA2ALU0ALU1DIV VFMA0VFMA1VFMA2ALU0ALU1DIV VFMA0VFMA1VFMA2ALU0ALU1DIV VFMA0VFMA1VFMA2ALU0ALU1DIV VFMA0VFMA1VFMA2ALU0ALU1DIV VFMA0VFMA1VFMA2ALU0ALU1DIV VFMA0VFMA1VFMA2ALU0ALU1DIV VFMA0VFMA1VFMA2ALU0ALU1DIV VFMA0VFMA1VFMA2ALU0ALU1DIV VFMA0VFMA1VFMA2ALU0ALU1DIV VFMA0VFMA1VFMA2ALU0ALU1DIV VFMA0VFMA1VFMA2ALU0ALU1DIV VFMA1VFMA2ALU0ALU1DIV VFMA1VFMA2ALU0ALU1DIV VFMA2ALU0ALU1DIV VFMA2ALU0ALU1DIV VFMA2ALU0ALU1DIV ALU0ALU1DIV ALU0ALU1DIV ALU0ALU1DIV ALU1DIV ALU1DIV 1.2TB/s / processor 400GB/s / core ALU1DIV DIV (Ave. 150GB/s / core) DIV

SPU Scalar Processing Unit

5 © NEC Corporation 2019 Licensed Open-Source

▌C/C++ Compiler  ISO/IEC 9899:2011 (aka C11) ▌Materials for Learning and  ISO/IEC 14882:2014 (aka C++14) ▌Fortran Compiler Training  ISO/IEC 1539-1:2004 (aka Fortran 2003) Tuning guides and various manuals  ISO/IEC 1539-1:2010 (aka Fortran 2008) ▌Open MP and MPI ▌VEOS or middleware  Version 4.5  Heterogeneous Computing with Vectors ▌Libraries ▌ML Libraries in C/C++  Glibc  MPI version 3.1 (fully tuned for Vector Engine Architecture  Frovedis Library or Spark alike machine  NLC Libraries (BLAS, LAPACK, FFT etc.) learning framework in C/C++ ▌ Tools FtraceViewer/PROGINF ▌Deep Learning using Gprof Vectors GDB, Eclipse Parallel Tools Platform

6 © NEC Corporation 2019 Ease of Programming

Automatic Vectorization feature and various tools help in vectorization for advance levels

void matmul(float *A, float *B, float *C, int l, int m, int n){ int i, j, k; No modification is necessary for vectorization for (i = 0; i < l; i++) { for (j = 0; j < n; j++) { float sum = 0.0; for (k = 0; k < m; k++) Just compile, and loops are vectorized automatically sum += A[i * m + k] * B[k * n + j]; C[i*n+j] = sum; } } [Compiler diagnostic message] } $ ncc sample.c -O4 -report-all -fdiag-vector=2 void alloc_matrix(float **m_h, int h, int w){ ncc: opt(1589): sample.c, line 11: Outer loop moved inside inner loop(s).: j *m_h = (float *)malloc(sizeof(float) * h * w); ncc: vec( 101): sample.c, line 11: Vectorized loop. } ncc: opt(1592): sample.c, line 13: Outer loop unrolled inside inner loop.: k ncc: vec( 101): sample.c, line 13: Vectorized loop. // other function definitions … ncc: vec( 128): sample.c, line 14: Fused multiply-add operation applied. ncc: vec( 101): sample.c, line 15: Vectorized loop. int main(int argc, char *argv[]){ float *Ah, *Bh, *Ch; struct timeval t1, t2; [Format list] // prepare matrix A alloc_matrix(&Ah, L, M); 8: void matmul(float *A, float *B, float *C, int l, int m, int n){ init_matrix(Ah, L, M); 9: int i, j, k; // do it again for matrix B 10: +------> for (i = 0; i < l; i++) { alloc_matrix(&Bh, M, N); 11: |X-----> for (j = 0; j < n; j++) { init_matrix(Bh, M, N); 12: || float sum = 0.0; // allocate spaces for matrix C 13: ||V----> for (k = 0; k < m; k++) alloc_matrix(&Ch, L, N); 14: ||V---- F sum += A[i * m + k] * B[k * n + j]; 15: || C[i*n+j] = sum; // call matmul function 16: |X----- } matmul(Ah, Bh, Ch, L, M, N); 17: +------} 18: } return 0; } 7 © NEC Corporation 2019 VEOS offload models

Run the application in the right way possible using VEOS

OS Offload VH call VEO

VE x86 Application Application

x86 VE VE Application Application Application

OS OS OS x86 Vector x86 Vector x86 Vector node Engine node Engine node Engine

Study Reference: https://www.hpc.nec/api/v1/forum/file/download?id=LbGhNY

8 © NEC Corporation 2019 Vectors for Weather Vectors for Material Science

NEC received a 50 Million Euro order from the NEC SX-Aurora TSUBASA installed by High Energy Deutscher Wetterdienst (DWD) for a highly Accelerator Research Organization, and National innovative European weather forecasting system Institute for Environmental Studies for successive using NEC SX-Aurora TSUBASA vector systems, Japan

“The system combines forecasting based on observations with very demanding numerical weather prediction models “The organization promotes simulation research in high in order for a more precise prediction of the development energy physics, and introduced SX-Aurora TSUBASA to and the tracks of such small-scale weather events up to implement the joint use program "Primary Particle Nuclear twelve hours into the future. This will enable better and Space Simulation Program".,” earlier warnings for local populations," Says KEK (High Energy Accelerator Research Organization) said Mr. Detlev Majewski, Head of the Department of

Meteorological Analysis and Numerical Modelling at the Source: NEC Press Release Deutscher Wetterdienst.

Source: NEC Press Release

9 © NEC Corporation 2019 NEC and HPE join forces in an Exclusive partnership Vector optimized expertise meets Global HPC leader

 30+ years in Vector compute  #1 HPC Market Share * architecture  Global market coverage  Silicon to software design expertise  Purpose built server portfolio

 Optimized software stack  Strong ISV partner eco-system • Compiler throughout the world • SDK • Libraries  Specialized HPC expertise • System management and operation • Benchmarking • Solution integration • Hybrid infrastructure

10 11 NEC and Colfax partner to provide groundbreaking HPC development at your desk

• Over 30 years of experience in delivering custom and HPC solutions

• Extensive customer base especially academia and research labs

• Specialized HPC expertise • Solution design and development • HPC research and training • Hybrid system design

• NEC and Colfax partnership aims to provide “personal supercomputing” power for leading-edge development

11 © NEC Corporation 2019 Legacy Adapt/Evolve Change for Good

▌Earth Simulator delivered the ▌Super computers unlike in the ▌SX-Aurora TSUBASA (Wing) highest efficiency or high past, should now be available supercomputing processor on a Bytes/Flops. PCIe Card and supporting for everyone. heterogeneity. ▌Fairly easy to implement for harnessing full vector potential. ▌Pure vectors needed for ▌HBM2 and 1.2 TB/s of massive data processing but bandwidth with large vector ▌Vectors are till today and for keeping in mind power and pipes and cores delivering 2.15 tomorrow will remain the efficiency. TFLOPs DP per card. ultimate super power for huge data processing of any super ▌Auto-vectorization and computing architecture. ▌Co-exist with others for solving parallelization for ease of problems. programming. ▌ Yet reduction in size has been an area of improvement from ▌Continue the legacy…. ▌Continue the legacy to deliver legacy. High Bytes/FLOPs with power efficiency.

12 © NEC Corporation 2019 Wish to experience our NEC SX-Aurora TSUBASA?

Join our trial program today!

▌Contact Us: : [email protected]..com

13 © NEC Corporation 2019