Vector Engine Card and SX-Aurora TSUBASA

March 2019 NEC Corporation

Agenda

Overview of VE and SX-Aurora TSUBASA Benchmark Use Cases Frovedis Overview History of Vector computing

NEC has always provided high sustained Good, but… • large performance by vector supercomputer • expensive SX series • special likeVector dinosaurs Engine （PCI card) Earth Simulator 3

Earth Simulator 2 Performance SX-ACE Earth Simulator SX-9 • Fast SX-8 • Powerful SX-7 • Compact • Economical SX-6 like falcons SX-5 SX-4 Packed vector technologies accumulated over 35 years SX-3 into PCI card SX-2 1990 2000 2010

5 © NEC Corporation 2019 New Value

NEC’s Vector technology can invent new Social Values - as the key to accelerate HPC + AI/Big Data Analytics

Financial/Economics Life

Security Energy

Statistical Image analysis analysis Acoustic Progress of analysisanalysis/science Genetic

Fluid Disaster Manufacturing analysis Geophysical management

Structural Simulation AI/BigData Weather analysis (HPC) Analytics Climate

Vector technology 6 © NEC Corporation 2019 New Architecture

 SX-Aurora TSUBASA = Standard x86 + Vector Engine  Linux + standard language (Fortran/C/C++)  Enjoy high performance with easy programming

SX-Aurora TSUBASA Hardware Architecture  Standard x86 server + Vector Engine

Linux OS Application Software  Linux OS  Automatic vectorization compiler  Fortran/C/C++ x86 server Vector  No special programming like CUDA (VH) PCIe Engine(VE) Interconnect  InfiniBand for MPI  VE-VE direct communication support

Automatic Easy Enjoy high vectorization programming performance (standard language) compiler

7 © NEC Corporation 2019 GPGPU and VE

GPGPU Architecture Aurora Architecture (Function offload model) (OS offload model)

OS AP Function OS AP CUDA

M M M M e e e e m m m m o x86 GPGPU o o x86 VE o r PCIe r r PCIe r y y y y

exec exec Data Transmission Start Processing

Result Transmission : OS I/O,etc : Function

exit exit End Processing

Frequent PCIe transmission Whole AP is executed on VE

 Avoiding PCIe bottleneck disadvantage  PCIe bottleneck Advantage  Programming difficulty  Standard language

8 © NEC Corporation 2019 VEOS offload models

Run the application in the right way

OS Offload VEO VH call

x86 VE Application Application

VE x86 VE Application Application Application

OS OS OS

x86 Vector x86 Vector x86 Vector node Engine node Engine node Engine

9 © NEC Corporation 2019 Vector Processor on The Card

New Developed Vector Processor PCIe Card Implementation, but not an accelerator 8 cores / processor 2.45TF performance 1.22TB/s memory bandwidth Normal programing with Fortran/C/C++

10 © NEC Corporation 2019 Processor

2.45TF VE Specification 307GFcore core core core cores/CPU 8 core ~307GF(DP) core core core core performance ~614GF(SP) CPU ~2.45TF(DP) 0.4TB/s performance ~4.91TF(SP) 3TB/s cache 16MB shared capacity Software controllable cache 16MB memory 1.22TB/s bandwidth memory 24, 48GB 1.22TB/s capacity

HBM2 memory x 6

11 © NEC Corporation 2019 Processor SKU

 3 Processor SKUs, Type 10A/10B/10C - Frequency: 1.6GHz or 1.4GHz - Memory Bandwidth: 1.22TB/s or 0.75TB/s - Memory Capacity: 48GB or 24GB

Processor SKUs core processor

Freq. Memory VE Type (GHz) GF cores DP TF BW size TB/s GB

Type 10A 1.6 307 2.45 1.22 48 Type 10B 8 1.4 269 2.15 Type 10C 0.75 24

12 © NEC Corporation 2019 Software stack

Programing Environment Vector Cross Compiler automatic vectorization automatic parallelization

Fortran: F2003, F2008 C/C++: C11/C++14 OpenMP: OpenMP4.5 $ vi sample.c $ ncc sample.c Library: MPI 3.1, libc, BLAS, Lapack, etc Debugger: gdb, Eclipse parallel tools platform Tools: PROGINF, FtraceViewer Execution Environment

VH VE

$ ./a.out execution

13 © NEC Corporation 2019 Libraries

▌NEC Library provides wide variety of functions  NEC library is fully tuned for Aurora architecture NEC Lib MKL BLAS   LAPACK   ScaLAPACK   FFT   Random number generators   Direct sparse solvers   Iterative sparse solvers   Functions for Statistics   Spline functions   Special functions  Approximation and Interpolation  Numerical Differentials/Integrals  Roots of Equations  Time series analysis  Sorting and ranking 

14 © NEC Corporation 2019 Product Lineup

 Product: VEs + x86/Linux server  3 series, A100, A300, and A500 for various users

Supercomputer Model A500 series  For large scale configuration  DLC with 40℃ water

Rack Mount Model A300 series  Flexible configuration  Air Cooled

2VE 4VE 8VE Tower Model A100 series For developer/programmer Office room use 1VE

15 © NEC Corporation 2019 Product Lineup

▌Vector Engine (VE) SKUs

Peak Memory Band SKU # of cores Memory Capacity performance Width Type 10A 2.45 TFLOPS 1.20 TB/s 48 GB Type 10B 8 cores 2.15 TFLOPS Type 10C 0.75 TB/s 24 GB

▌Models

Tower model Rack-mount model Supercomputer A100-1 A300-2 A300-4 A300-8 A500-64

Models

# of VE 1 Up to 2 Up to 4 Up to 8 Up to 64

Type 10C Type 10A Supported VE Type 10B 4U Form Factor Tower 1U Rackmount Dedicated Rack Rackmount System cooling Air cool Liquid cool

16 © NEC Corporation 2019 Benchmark HPL and STREAM

 HPL : Aurora provides competitive FLOPS capability  STREAM : Aurora provides highest sustained memory bandwidth

HPL / Node STREAM / Node

(SKL=1) HPL HPL / Node SKL6148 Volta100 STREAM STREAM / Node (SKL=1) SKL6148 Volta100 Aurora provides same range HPL   Aurora provides the highest memory sustained performance as SKL/KNL bandwidth

• Aurora is Vector Engine Type 10-B (1.4GHz, 8core) • SKL is Intel Skylake 6148 Xeon x2/node • KNL is Intel Knight Landing x1/node • V100 is NVIDIA Tesla V100 x1/node

18 © NEC Corporation 2019 HPCG

Performance/power of Aurora shows 7 times better than SKL

HPCG/Node HPCG/power (W) 3 8

7 2.5 6 2 5

1.5 4 7x 3 1 2 0.5

Performance/Node (SKL=1) Performance/power (SKL=1) 0 0 Aurora SKL6148 Aurora SKL6148

• Aurora is Vector Engine Type 10-B (1.4GHz, 8core) • SKL is Intel Skylake 6148 Xeon x2/node

19 © NEC Corporation 2019 NAS Parallel Benchmark (Class C)

Aurora shows over 2x performance/power compared to SKL(6148)

Performance Performance/Power 2 3

2.5 1.5 2

1 1.5

1 0.5 0.5

0 0 Performance/Power [SKL=1] Performance/Power Performance/node Performance/node [SKL=1] FT CG MG FT CG MG Aurora SKL(6148) Aurora SKL(6148)

• Aurora is Vector Engine Type 10-B (1.4GHz, 8core) • SKL is Intel Skylake 6148 Xeon x2/node

20 © NEC Corporation 2019 Use cases Use case: Financial Option Pricing

▌European option pricing (Monte Carlo)  Using Intel MKL financial option pricing example https://software.intel.com/en-us/mkl_cookbook_samples

Xeon VE 30 14 25 12.38 12 20 10 x3.3 15 X4.7 faster (socket comparison) 8 faster

Throughput 10 xeon 6 5 3.77 VE

4 0 Training Time [sec] Time Training 2 0 4 8 12 16 0 Number of Cores

Xeon Gold 6126 (12c, 2.6GHz/1.7GHz): 652.8GFlops(DP) VE(8c, 1.4GHz): 2150.4 GFlops(DP)

22 © NEC Corporation 2019 CT Scan Program Benchmark Result (MBIR)

~4x higher performance than 1 socket Xeon

▌Model-Based Iterative Reconstruction (MBIR)

Execution time per core Improved performance

Xeon VE 40 Normalized by 1 core performance of Gold6126 35 1200 1024.96 30 1000 25 x3.9 20 800 x7.1 Performance faster 15 faster Xeon 600 10 5 VE Time [sec] Time 400 Improved 0 142.69 200 0 4 8 12 16 0 Number of Cores

- NEC tested CT Program with Model-Based Iterative Reconstruction (MBIR) algorithm - NEC compared VE performance to Gold 6126(2.6GHz) - Test Program is OSS provided by Purdue University

23 © NEC Corporation 2019 Use case: Malware Detection

 Malware detection training used to take few days  Daily training  Detect new malware rapidly  Reduce malware damage

▌Analyze application binary and detect malware Using NEC machine learning application

Training time using 1 core Performance per socket (required training time for 100 epoch) （training throughput)

Xeon VE 30 700 25 574.66 600 20 x2.５ 500 more training x3.7 15 400 10

faster Throughput 300 5 Xeon 154.57 VE 200 0 Training Time [sec] Time Training 100 0 4 8 12 16 0 Number of Cores

Xeon Gold 6126 (12c, 2.6GHz/1.7GHz): 652.8GFlops(DP) Assign independent training to each core VE(8c, 1.4GHz): 2150.4 GFlops(DP) (assuming parallel training with different parameters)

24 © NEC Corporation 2019 Simulation: Risk assessment of heatstroke

SX-Aurora TSUBASA delivers 1.8x higher performance compared to Xeon/Skylake

2 22 years22 old years65 years old old 6575 years years old old22 75years years old old 22 years22 years22 old years 22old 65years old years65 old years65 old years 65old 75years old years75 old years75 old years 75old 22years old years22 old years22 old years 22old years old old (male) (male) (male) (female) (Male)(male)(male)(male)(male)(male)(male)(Male)(male)(male)(male)(male)(male)(Male)(male)(female)(female)(female)(female) Body Surface Body SurfaceBodyBody SurfaceBody SurfaceBody Surface Surface o 1.5 TemperatureTemperatureTemperatureTemperature［ CTemperature］ [ºC]［TemperatureoC］［oC］［oC］［oC］ 1.8x faster 39.039.039.039.039.039.0 3 years old 3 years3 years old3 years old old 3 years old3(Male) years old 1 (male) (male)(male)(male)(male)

36.036.036.036.036.036.0 0.5

33.033.033.033.0 33.033.0 0 SX-Aurora Xeon Source: Prof. Hirata of Nagoya Institute of Technology

Performance/node (Xeon=1) Performance/node TSUBASA (Gold 6148)

 This simulation is ...  Temperature elevation and sweating been computed by solving a bio-heat equation considering thermoregulatory response.  With weather forecast, the risk of heatstroke can be estimated.

25 © NEC Corporation 2019 Simulation of Shot peening

SX-Aurora TSUBASA delivers 2.4x higher performance compared to Xeon/Skylake

2.5

2 2.4x faster 1.5

0.5

0 SX-Aurora Xeon Source:

Performance/node (Xeon=1) Performance/node TSUBASA (Gold 6148) PhD student Y. Mizuno and Prof. S. Takahashi of Tokai Univ.  This simulation is ...  Technology to create compressive residual stress layer on metal surface  Collision of metal shots (small particles) to strengthen the material  More accurate and efficient process for various subjects  Many applications such as wing panel, crankshaft, connecting rod, etc

26 © NEC Corporation 2019 Frovedis AI and BigData

We target accelerating memory intensive workloads for HPC and AI  Several statistical MLs and some Neural Networks (MLP, LSTM)

Developed middleware that speeds up Spark for statistical ML

Vector Memory processor Demand prediction Statistical performance Price prediction machine learning Recommender system General purpose (MLP: Multi layer perceptron) AI, Big Data Translation (LSTM) Deep Voice recognition Learning ｚ (CNN) Image General recognition(CNN) purpose GPU CPU Computation performance

28 © NEC Corporation 2019 Frovedis: NEC Middleware for statistical machine learning

▌Frovedis: NEC middleware compatible to Spark  Same API  Only 3 lines to change original spark program  OSS developed by NEC https://github.com/frovedis

Original Spark program: logistic regression

… import org.apache.spark.mllib.classification.LogisticRegressionWithSGD … val model = LogisticRegressionWithSGD.train(data) …

Change to call NEC middleware implementation Change import … import com.nec.froedis.mllib.classificaiton.LogisticRegressionWithSGD … FrovedisServer.initialize(...) Same API val model = LogisticRegressionWithSGD.train(data) (no change) FrovedisServer.shut_down() … Start/Stop server

29 © NEC Corporation 2019 Frovedis supported process

Machine Learning

Logistic Linear Lasso Ridge Regression Regression Regression Regression

Linear Collaborative SVD EVD PCA SVM filtering(ALS)

Naive Decision Factorization K-means word2vec Bayes Tree Machines

Data Frame Basic matrix operations

Filter Sort Solve Inverse Gemv SpMV LU

Group Least Join Backed by ScaLAPACK, Gemm by square LAPACK, BLAS

30 © NEC Corporation 2019 Performance of Frovedis (Machine Learning)

Frovedis + VE shows over 100x performance in ML compared to Spark + x86

120 Logistic 111 K-means Singular value Regression (document clustering) decomposition 100 (web ad) (recommendation)

60 56.8 111x 42.8 40

Speedup (spark/x86=1) 20

1 1 1 0 Spark + x86 Frovedis + Aurora(VE)

• x86: Intel Xeon Gold 6126 x1 socket • Aurora: Vector Engine Type 10-B (1.4GHz, 8core) x1 • Performance comparison does not include I/O time

Frovedis + VE shows high performance in Data Frame compared to Spark + x86

30 Filter 26.7 Sort Join Group by

20 17.2

9.82 10 8.7

5 Speedup (spark/x86=1) 1 1 1 1 0 Spark + x86 Frovedis + Aurora (VE)

• x86: Intel Xeon Gold 6126 x1 socket • Aurora: Vector Engine Type 10-B (1.4GHz, 8core) x1 • Performance comparison does not include I/O time

Visit https://www.hpc.nec and join our quest

Join

Hybrid system

There are various types of applications  Run the right application on the right platform

Shared Scheduler Network filesystem (NQSV) (IB) (ScateFS)

VH VH x86 x86 (x86) …. (x86) server …. server VE VE ctrl ctrl s.out s.out

VE VE VE VE VEVE VE VEVE VE VE VEVE VE VEVE v.out v.out

Vector cluster Scalar cluster

Incorporate VH as part of scalar cluster  Higher performance, Higher system utilization

Shared Scheduler Network filesystem (NQSV) (IB) (ScateFS)

VH VH x86 x86 (x86) …. (x86) server …. server VE VE ctrl s.out ctrl s.out s.out s.out

Scalar cluster

VE VE VE VE VEVE VE VEVE VE VE VEVE VE VEVE v.out v.out

Vector cluster

MPI process running on both VE node and x86 node as a single MPI program.

Shared Scheduler network filesystem (NQSV) (ScateFS)

VH VH x86 x86 (x86) …. (x86) server …. server VE s.out VE s.out s.out s.out rank rank rank ctrl rank0 ctrl 1023 1024 2047 Scalar cluster

VEVE VE VEVE VE VE VEVE VE VEVE VEv.out VE VEv.out VE rank rank 2048 4097

Vector cluster

▌Single job per compute node JobX  Use all of the compute nodes at a time #!/bin/bash  Run multiple copies of single script on #PBS -b 2000 --venum_lhost=1 many nodes with different set of data #PBS -b 3500 --cpunum_lhost=48 for each copy (Bulk job) #PBS -T distrib

if [ ${VE_NODE_NUMBER} ]; then User can run a Bulk job over Xeon ve.out # VEnode cluster and VE cluster else xe.out # Xeon fi NQSV

VE Cluster Xeon Cluster

VE VE VE Xeon Xeon Xeon JobX.0 JobX.1 … JobX.1999 JobX.2000 JobX.2001 … JobX.5499 ve.out ve.out ve.out xe.out xe.out xe.out

▌Single job utilizing all compute nodes  MPI processes running on both VE and Xeon node as a single MPI program #!/bin/bash (Hybrid MPI) #PBS -b 2000 --venum_lhost=1 #PBS -b 3500 --cpunum_lhost=48 A hybrid job that uses VE and #PBS -T necmpi Xeon clusters can be easily mpiexec –ppn 1 -venode -nn 2000 ve.out : executed in close cooperation with -nn 3500 xe.out NEC MPI NQSV

VE Cluster Xeon Cluster

VE VE VE Xeon Xeon Xeon Rank Rank Rank Rank Rank Rank 0 1 … 1999 2000 2001 … 5499 ve.out ve.out ve.out xe.out xe.out xe.out

single MPI program

User can run VH-VE hybrid MPI application within a single job #!/bin/bash #PBS -b 2 #PBS -T necmpi

mpiexec –nn 2 –np 2 vh.out : Options for VH -venode –nn 16 –np 32 ve.out single MPI program Options for VE VH VH

vh.out vh.out

PCIe PCIe PCIe PCIe SW SW SW SW ve. ve. ve. ve. V veV. V veV. H V V V V H V V V V H V veV . V veV . H E outE E outE C E outE E outE C E outE E outE C E outE E outE C ve0. 1 ve2. 3 A ve4. 5 ve6 . 7 A ve0 . 1 ve2 . 3 A ve4 . 5 ve6 . 7 A out out 0 out out 1 out out 0 out out 1

User can run MPI application without being aware of VE topology

$ qsub -b 2 –cpunum_lhost=2 --venum_lhost=8 run.sh

#!/usr/bin/bash run.s mpiexec –venode –nn 16 –np 32 ve.out h #VEs #Procs

run.sh mpid mpid VH mpiexec

PCI PCI e HCA HCA e RDMA RDMASW SW SHMEM

VE ve.out ve.out … ve.out ve.out ve.out ve.out … ve.out ve.out

User can run VH-VE hybrid MPI application by one command $ qsub –b 2 --cpunum_lhost=2 --venum_lhost=8 run.sh #!/usr/bin/bash run.s mpiexec –nn 2 –np 2 vh.out : -venode –nn 16 –np 32 ve.out h Options for VH Options for VE

run.sh mpid mpid VH mpiexec vh.out vh.out

RDMA PCI PCI e HCA HCA e SW SW RDMA VE ve.out ve.out … ve.out ve.out ve.out ve.out … ve.out ve.out