Vector Engine Card and SX-Aurora TSUBASA
March 2019 NEC Corporation
Agenda
Overview of VE and SX-Aurora TSUBASA Benchmark Use Cases Frovedis Overview History of Vector computing
NEC has always provided high sustained Good, but… • large performance by vector supercomputer • expensive SX series • special likeVector dinosaurs Engine (PCI card) Earth Simulator 3
Earth Simulator 2 Performance SX-ACE Earth Simulator SX-9 • Fast SX-8 • Powerful SX-7 • Compact • Economical SX-6 like falcons SX-5 SX-4 Packed vector technologies accumulated over 35 years SX-3 into PCI card SX-2 1990 2000 2010
5 © NEC Corporation 2019 New Value
NEC’s Vector technology can invent new Social Values - as the key to accelerate HPC + AI/Big Data Analytics
Financial/Economics Life
Security Energy
Statistical Image analysis analysis Acoustic Progress of analysisanalysis/science Genetic
Fluid Disaster Manufacturing analysis Geophysical management
Structural Simulation AI/BigData Weather analysis (HPC) Analytics Climate
Vector technology 6 © NEC Corporation 2019 New Architecture
SX-Aurora TSUBASA = Standard x86 + Vector Engine Linux + standard language (Fortran/C/C++) Enjoy high performance with easy programming
SX-Aurora TSUBASA Hardware Architecture Standard x86 server + Vector Engine
Linux OS Application Software Linux OS Automatic vectorization compiler Fortran/C/C++ x86 server Vector No special programming like CUDA (VH) PCIe Engine(VE) Interconnect InfiniBand for MPI VE-VE direct communication support
Automatic Easy Enjoy high vectorization programming performance (standard language) compiler
7 © NEC Corporation 2019 GPGPU and VE
GPGPU Architecture Aurora Architecture (Function offload model) (OS offload model)
OS AP Function OS AP CUDA
M M M M e e e e m m m m o x86 GPGPU o o x86 VE o r PCIe r r PCIe r y y y y
exec exec Data Transmission Start Processing
Result Transmission : OS I/O,etc : Function
exit exit End Processing
Frequent PCIe transmission Whole AP is executed on VE
Avoiding PCIe bottleneck disadvantage PCIe bottleneck Advantage Programming difficulty Standard language
8 © NEC Corporation 2019 VEOS offload models
Run the application in the right way
OS Offload VEO VH call
x86 VE Application Application
VE x86 VE Application Application Application
OS OS OS
x86 Vector x86 Vector x86 Vector node Engine node Engine node Engine
9 © NEC Corporation 2019 Vector Processor on The Card
New Developed Vector Processor PCIe Card Implementation, but not an accelerator 8 cores / processor 2.45TF performance 1.22TB/s memory bandwidth Normal programing with Fortran/C/C++
10 © NEC Corporation 2019 Processor
2.45TF VE Specification 307GFcore core core core cores/CPU 8 core ~307GF(DP) core core core core performance ~614GF(SP) CPU ~2.45TF(DP) 0.4TB/s performance ~4.91TF(SP) 3TB/s cache 16MB shared capacity Software controllable cache 16MB memory 1.22TB/s bandwidth memory 24, 48GB 1.22TB/s capacity
HBM2 memory x 6
11 © NEC Corporation 2019 Processor SKU
3 Processor SKUs, Type 10A/10B/10C - Frequency: 1.6GHz or 1.4GHz - Memory Bandwidth: 1.22TB/s or 0.75TB/s - Memory Capacity: 48GB or 24GB
Processor SKUs core processor
Freq. Memory VE Type (GHz) GF cores DP TF BW size TB/s GB
Type 10A 1.6 307 2.45 1.22 48 Type 10B 8 1.4 269 2.15 Type 10C 0.75 24
12 © NEC Corporation 2019 Software stack
Programing Environment Vector Cross Compiler automatic vectorization automatic parallelization
Fortran: F2003, F2008 C/C++: C11/C++14 OpenMP: OpenMP4.5 $ vi sample.c $ ncc sample.c Library: MPI 3.1, libc, BLAS, Lapack, etc Debugger: gdb, Eclipse parallel tools platform Tools: PROGINF, FtraceViewer Execution Environment
VH VE
$ ./a.out execution
13 © NEC Corporation 2019 Libraries
▌NEC Library provides wide variety of functions NEC library is fully tuned for Aurora architecture NEC Lib MKL BLAS LAPACK ScaLAPACK FFT Random number generators Direct sparse solvers Iterative sparse solvers Functions for Statistics Spline functions Special functions Approximation and Interpolation Numerical Differentials/Integrals Roots of Equations Time series analysis Sorting and ranking
14 © NEC Corporation 2019 Product Lineup
Product: VEs + x86/Linux server 3 series, A100, A300, and A500 for various users
Supercomputer Model A500 series For large scale configuration DLC with 40℃ water
Rack Mount Model A300 series Flexible configuration Air Cooled
2VE 4VE 8VE Tower Model A100 series For developer/programmer Office room use 1VE
15 © NEC Corporation 2019 Product Lineup
▌Vector Engine (VE) SKUs
Peak Memory Band SKU # of cores Memory Capacity performance Width Type 10A 2.45 TFLOPS 1.20 TB/s 48 GB Type 10B 8 cores 2.15 TFLOPS Type 10C 0.75 TB/s 24 GB
▌Models
Tower model Rack-mount model Supercomputer A100-1 A300-2 A300-4 A300-8 A500-64
Models
# of VE 1 Up to 2 Up to 4 Up to 8 Up to 64
Type 10C Type 10A Supported VE Type 10B 4U Form Factor Tower 1U Rackmount Dedicated Rack Rackmount System cooling Air cool Liquid cool
16 © NEC Corporation 2019 Benchmark HPL and STREAM
HPL : Aurora provides competitive FLOPS capability STREAM : Aurora provides highest sustained memory bandwidth
HPL / Node STREAM / Node
(SKL=1) HPL HPL / Node SKL6148 Volta100 STREAM STREAM / Node (SKL=1) SKL6148 Volta100 Aurora provides same range HPL Aurora provides the highest memory sustained performance as SKL/KNL bandwidth
• Aurora is Vector Engine Type 10-B (1.4GHz, 8core) • SKL is Intel Skylake 6148 Xeon x2/node • KNL is Intel Knight Landing x1/node • V100 is NVIDIA Tesla V100 x1/node
18 © NEC Corporation 2019 HPCG
Performance/power of Aurora shows 7 times better than SKL
HPCG/Node HPCG/power (W) 3 8
7 2.5 6 2 5
1.5 4 7x 3 1 2 0.5
1
Performance/Node (SKL=1) Performance/power (SKL=1) 0 0 Aurora SKL6148 Aurora SKL6148
• Aurora is Vector Engine Type 10-B (1.4GHz, 8core) • SKL is Intel Skylake 6148 Xeon x2/node
19 © NEC Corporation 2019 NAS Parallel Benchmark (Class C)
Aurora shows over 2x performance/power compared to SKL(6148)
Performance Performance/Power 2 3
2.5 1.5 2
1 1.5
1 0.5 0.5
0 0 Performance/Power [SKL=1] Performance/Power Performance/node Performance/node [SKL=1] FT CG MG FT CG MG Aurora SKL(6148) Aurora SKL(6148)
• Aurora is Vector Engine Type 10-B (1.4GHz, 8core) • SKL is Intel Skylake 6148 Xeon x2/node
20 © NEC Corporation 2019 Use cases Use case: Financial Option Pricing
▌European option pricing (Monte Carlo) Using Intel MKL financial option pricing example https://software.intel.com/en-us/mkl_cookbook_samples
Xeon VE 30 14 25 12.38 12 20 10 x3.3 15 X4.7 faster (socket comparison) 8 faster
Throughput 10 xeon 6 5 3.77 VE
4 0 Training Time [sec] Time Training 2 0 4 8 12 16 0 Number of Cores
Xeon Gold 6126 (12c, 2.6GHz/1.7GHz): 652.8GFlops(DP) VE(8c, 1.4GHz): 2150.4 GFlops(DP)
22 © NEC Corporation 2019 CT Scan Program Benchmark Result (MBIR)
~4x higher performance than 1 socket Xeon
▌Model-Based Iterative Reconstruction (MBIR)
Execution time per core Improved performance
Xeon VE 40 Normalized by 1 core performance of Gold6126 35 1200 1024.96 30 1000 25 x3.9 20 800 x7.1 Performance faster 15 faster Xeon 600 10 5 VE Time [sec] Time 400 Improved 0 142.69 200 0 4 8 12 16 0 Number of Cores
- NEC tested CT Program with Model-Based Iterative Reconstruction (MBIR) algorithm - NEC compared VE performance to Gold 6126(2.6GHz) - Test Program is OSS provided by Purdue University
23 © NEC Corporation 2019 Use case: Malware Detection
Malware detection training used to take few days Daily training Detect new malware rapidly Reduce malware damage
▌Analyze application binary and detect malware Using NEC machine learning application
Training time using 1 core Performance per socket (required training time for 100 epoch) (training throughput)
Xeon VE 30 700 25 574.66 600 20 x2.5 500 more training x3.7 15 400 10
faster Throughput 300 5 Xeon 154.57 VE 200 0 Training Time [sec] Time Training 100 0 4 8 12 16 0 Number of Cores
Xeon Gold 6126 (12c, 2.6GHz/1.7GHz): 652.8GFlops(DP) Assign independent training to each core VE(8c, 1.4GHz): 2150.4 GFlops(DP) (assuming parallel training with different parameters)
24 © NEC Corporation 2019 Simulation: Risk assessment of heatstroke
SX-Aurora TSUBASA delivers 1.8x higher performance compared to Xeon/Skylake
2 22 years22 old years65 years old old 6575 years years old old22 75years years old old 22 years22 years22 old years 22old 65years old years65 old years65 old years 65old 75years old years75 old years75 old years 75old 22years old years22 old years22 old years 22old years old old (male) (male) (male) (female) (Male)(male)(male)(male)(male)(male)(male)(Male)(male)(male)(male)(male)(male)(Male)(male)(female)(female)(female)(female) Body Surface Body SurfaceBodyBody SurfaceBody SurfaceBody Surface Surface o 1.5 TemperatureTemperatureTemperatureTemperature[ CTemperature] [ºC][TemperatureoC][oC][oC][oC] 1.8x faster 39.039.039.039.039.039.0 3 years old 3 years3 years old3 years old old 3 years old3(Male) years old 1 (male) (male)(male)(male)(male)
36.036.036.036.036.036.0 0.5
33.033.033.033.0 33.033.0 0 SX-Aurora Xeon Source: Prof. Hirata of Nagoya Institute of Technology
Performance/node (Xeon=1) Performance/node TSUBASA (Gold 6148)
This simulation is ... Temperature elevation and sweating been computed by solving a bio-heat equation considering thermoregulatory response. With weather forecast, the risk of heatstroke can be estimated.
25 © NEC Corporation 2019 Simulation of Shot peening
SX-Aurora TSUBASA delivers 2.4x higher performance compared to Xeon/Skylake
3
2.5
2 2.4x faster 1.5
1
0.5
0 SX-Aurora Xeon Source:
Performance/node (Xeon=1) Performance/node TSUBASA (Gold 6148) PhD student Y. Mizuno and Prof. S. Takahashi of Tokai Univ. This simulation is ... Technology to create compressive residual stress layer on metal surface Collision of metal shots (small particles) to strengthen the material More accurate and efficient process for various subjects Many applications such as wing panel, crankshaft, connecting rod, etc
26 © NEC Corporation 2019 Frovedis AI and BigData
We target accelerating memory intensive workloads for HPC and AI Several statistical MLs and some Neural Networks (MLP, LSTM)
Developed middleware that speeds up Spark for statistical ML
Vector Memory processor Demand prediction Statistical performance Price prediction machine learning Recommender system General purpose (MLP: Multi layer perceptron) AI, Big Data Translation (LSTM) Deep Voice recognition Learning z (CNN) Image General recognition(CNN) purpose GPU CPU Computation performance
28 © NEC Corporation 2019 Frovedis: NEC Middleware for statistical machine learning
▌Frovedis: NEC middleware compatible to Spark Same API Only 3 lines to change original spark program OSS developed by NEC https://github.com/frovedis
Original Spark program: logistic regression
… import org.apache.spark.mllib.classification.LogisticRegressionWithSGD … val model = LogisticRegressionWithSGD.train(data) …
Change to call NEC middleware implementation Change import … import com.nec.froedis.mllib.classificaiton.LogisticRegressionWithSGD … FrovedisServer.initialize(...) Same API val model = LogisticRegressionWithSGD.train(data) (no change) FrovedisServer.shut_down() … Start/Stop server
29 © NEC Corporation 2019 Frovedis supported process
Machine Learning
Logistic Linear Lasso Ridge Regression Regression Regression Regression
Linear Collaborative SVD EVD PCA SVM filtering(ALS)
Naive Decision Factorization K-means word2vec Bayes Tree Machines
Data Frame Basic matrix operations
Filter Sort Solve Inverse Gemv SpMV LU
Group Least Join Backed by ScaLAPACK, Gemm by square LAPACK, BLAS
30 © NEC Corporation 2019 Performance of Frovedis (Machine Learning)
Frovedis + VE shows over 100x performance in ML compared to Spark + x86
120 Logistic 111 K-means Singular value Regression (document clustering) decomposition 100 (web ad) (recommendation)
80
60 56.8 111x 42.8 40
Speedup (spark/x86=1) 20
1 1 1 0 Spark + x86 Frovedis + Aurora(VE)
• x86: Intel Xeon Gold 6126 x1 socket • Aurora: Vector Engine Type 10-B (1.4GHz, 8core) x1 • Performance comparison does not include I/O time
31 © NEC Corporation 2019 Performance of Frovedis (Data Frame)
Frovedis + VE shows high performance in Data Frame compared to Spark + x86
30 Filter 26.7 Sort Join Group by
25
20 17.2
15
9.82 10 8.7
5 Speedup (spark/x86=1) 1 1 1 1 0 Spark + x86 Frovedis + Aurora (VE)
• x86: Intel Xeon Gold 6126 x1 socket • Aurora: Vector Engine Type 10-B (1.4GHz, 8core) x1 • Performance comparison does not include I/O time
32 © NEC Corporation 2019 Aurora Forum community website
Visit https://www.hpc.nec and join our quest
Join
Join
33 © NEC Corporation 2019
Hybrid system
There are various types of applications Run the right application on the right platform
Shared Scheduler Network filesystem (NQSV) (IB) (ScateFS)
VH VH x86 x86 (x86) …. (x86) server …. server VE VE ctrl ctrl s.out s.out
VE VE VE VE VEVE VE VEVE VE VE VEVE VE VEVE v.out v.out
Vector cluster Scalar cluster
35 © NEC Corporation 2019 Hybrid system
Incorporate VH as part of scalar cluster Higher performance, Higher system utilization
Shared Scheduler Network filesystem (NQSV) (IB) (ScateFS)
VH VH x86 x86 (x86) …. (x86) server …. server VE VE ctrl s.out ctrl s.out s.out s.out
Scalar cluster
VE VE VE VE VEVE VE VEVE VE VE VEVE VE VEVE v.out v.out
Vector cluster
36 © NEC Corporation 2019 Hybrid MPI
MPI process running on both VE node and x86 node as a single MPI program.
Shared Scheduler network filesystem (NQSV) (ScateFS)
VH VH x86 x86 (x86) …. (x86) server …. server VE s.out VE s.out s.out s.out rank rank rank ctrl rank0 ctrl 1023 1024 2047 Scalar cluster
VEVE VE VEVE VE VE VEVE VE VEVE VEv.out VE VEv.out VE rank rank 2048 4097
Vector cluster
37 © NEC Corporation 2019 Heterogeneous system
▌Single job per compute node JobX Use all of the compute nodes at a time #!/bin/bash Run multiple copies of single script on #PBS -b 2000 --venum_lhost=1 many nodes with different set of data #PBS -b 3500 --cpunum_lhost=48 for each copy (Bulk job) #PBS -T distrib
if [ ${VE_NODE_NUMBER} ]; then User can run a Bulk job over Xeon ve.out # VEnode cluster and VE cluster else xe.out # Xeon fi NQSV
VE Cluster Xeon Cluster
VE VE VE Xeon Xeon Xeon JobX.0 JobX.1 … JobX.1999 JobX.2000 JobX.2001 … JobX.5499 ve.out ve.out ve.out xe.out xe.out xe.out
1job = 1node 38 © NEC Corporation 2019 Heterogeneous system (2/3)
▌Single job utilizing all compute nodes MPI processes running on both VE and Xeon node as a single MPI program #!/bin/bash (Hybrid MPI) #PBS -b 2000 --venum_lhost=1 #PBS -b 3500 --cpunum_lhost=48 A hybrid job that uses VE and #PBS -T necmpi Xeon clusters can be easily mpiexec –ppn 1 -venode -nn 2000 ve.out : executed in close cooperation with -nn 3500 xe.out NEC MPI NQSV
VE Cluster Xeon Cluster
VE VE VE Xeon Xeon Xeon Rank Rank Rank Rank Rank Rank 0 1 … 1999 2000 2001 … 5499 ve.out ve.out ve.out xe.out xe.out xe.out
single MPI program
39 © NEC Corporation 2019 Heterogeneous system (3/3)
User can run VH-VE hybrid MPI application within a single job #!/bin/bash #PBS -b 2 #PBS -T necmpi
mpiexec –nn 2 –np 2 vh.out : Options for VH -venode –nn 16 –np 32 ve.out single MPI program Options for VE VH VH
vh.out vh.out
PCIe PCIe PCIe PCIe SW SW SW SW ve. ve. ve. ve. V veV. V veV. H V V V V H V V V V H V veV . V veV . H E outE E outE C E outE E outE C E outE E outE C E outE E outE C ve0. 1 ve2. 3 A ve4. 5 ve6 . 7 A ve0 . 1 ve2 . 3 A ve4 . 5 ve6 . 7 A out out 0 out out 1 out out 0 out out 1
40 © NEC Corporation 2019 NEC MPI Execution on VE
User can run MPI application without being aware of VE topology
$ qsub -b 2 –cpunum_lhost=2 --venum_lhost=8 run.sh
#!/usr/bin/bash run.s mpiexec –venode –nn 16 –np 32 ve.out h #VEs #Procs
run.sh mpid mpid VH mpiexec
PCI PCI e HCA HCA e RDMA RDMASW SW SHMEM
VE ve.out ve.out … ve.out ve.out ve.out ve.out … ve.out ve.out
41 © NEC Corporation 2019 NEC MPI Execution on VH-VE Hybrid
User can run VH-VE hybrid MPI application by one command $ qsub –b 2 --cpunum_lhost=2 --venum_lhost=8 run.sh #!/usr/bin/bash run.s mpiexec –nn 2 –np 2 vh.out : -venode –nn 16 –np 32 ve.out h Options for VH Options for VE
run.sh mpid mpid VH mpiexec vh.out vh.out
RDMA PCI PCI e HCA HCA e SW SW RDMA VE ve.out ve.out … ve.out ve.out ve.out ve.out … ve.out ve.out
42 © NEC Corporation 2019