Performance Evaluation of a Vector SX-Aurora TSUBASA

Kazuhiko Komatsu, S. Momose, Y. Isobe, O. Watanabe, A. Musa, M. Yokokawa, T. Aoyama, M. Sato, H. Kobayashi Tohoku University 14 November, 2018 SC18 Outline

• Background • Overview of SX-Aurora TSUBASA • Performance evaluation • Benchmark performance • Application performance • Conclusions

14 November, 2018 SC18 2 Background

become important infrastructures • Widely used for scien8fic researches as well as various industries • Top1 system reaches 143.5 Pflop/s • Big gap between theore8cal performance and sustained performance ◎ Compute-intensive applica8ons stand to benefit from high peak performance ✖Memory-intensive applica8ons are limited by lower memory performance Memory performance has gained more and more attentions 14 November, 2018 SC18 3 A new vector supercomputer SX-Aurora TSUBASA

• Two important concepts of its design • High usability • High sustained performance • New memory integration technology • Realize the world’s highest memory bandwidth • New architecture • Vector host (VH) is attached to vector engines (VEs) • VE is responsible for executing an entire application • VH is used for processing system calls invoked by the applications VH VEVE 14 November, 2018 SC18 5 X86 Linux Vector processor New execution model

• Conven1onal model • New execution model

Host GPU VH VE

Start Exe module load Start processing processing

Transparent System call Kernel offload (I/O, etc) execution OS function

Kernel Finish execution processing

Fisnish processing SC18 6 Highlights of the execution model

• Two advantages over conventional execution model • Avoid frequent data transfers between VE and VH • Applications are entirely executed on VE • Only necessary data for system calls need to be transferred →High sustained performance VH VE • No special programming • Explicit specifications of computation Exe module load Start kernels are not necessary processing • System calls are transparently offloaded Offload System call (I/O, etc) to the VH OS function • Programmers do not need to care system calls →High usability Finish processing

14 November, 2018 SC18 7 Specification of SX-Aurora TSUBASA • Memory bandwidth • 1.22 TB/s world’s highest memory bandwidth • Six HBM2 memory modules integraEon • 3.0 TB/s LLC bandwidth • LLC is connected to cores via 2D mesh network • ComputaEonal capability • 2.15 Tflop/[email protected] GHz • 8 powerful vector cores • 16 nm FINFET process technology • 4.8 billion transistors • 14.96 mm x 33.00 mm

14 November, 2018 SC18 Block diagram of a vector processor Outline

• Background • Overview of SX-Aurora TSUBASA • Performance evaluation • Benchmark performance • Application performance • Conclusions

14 November, 2018 SC18 9 Experimental environments

• SX-Aurora TSUBASA A300-2 VH VE • 2x VEs Type 10B VE • 1x VH X86 Linux Vector Engines

VH Gold 6126 VE Type 10B Frequency 2.60 GHz / 3.70 GHz (Turbo) Frequency 1.4 GHz Peak FP / core 83.2 Gflop/s Peak FP / core 268.8 Gflop/s # cores 12 # cores 8 Peak DP Flops 998.4 Gflop/s Peak DP Flops / socket 2.15 Tflop/s Mem BW 128 GB/s Memory BW 1.2 TB/s Mem Capacity 96 GB Memory capacity 48 GB Mem config DDR4-2666 DIMM 16GB x 6 Memory config HBM2 14 November, 2018 SC18 10 Experimental environments cont.

SX-Aurora Xeon Gold Processor SX-ACE Tesla V100 Type 10B 6126 KNL 7290 Frequency 1.4 GHz 2.6 GHz 1.0 GHz 1.245 GHz 1.5 GHz # of cores 8 12 4 5120 72 DP flop/s 2.15 T 998.4 GF 7 TF 3.456 TF 256 GF (SP flop/s) (4.30 T) (1996.8 GF) (14 TF) (6.912 TF) Memory MCDRAM HBM2 x6 DDR4 x6ch DDR3 x16ch HBM2 x4 subsystem DDR4 450+ GB/s Memory BW 1.22 TB/s 128 GB/s 256 GB/s 900 GB/s 115.2 GB/s Memory 16 GB 48 GB 96 GB 64 GB 16 GB capacity 96 GB LLC BW 2.66 TB/s N/A 1.0 TB/s N/A N/A 16 MB 19.25 MB 1 MB shared LLC capacity 1 MB private 6 MB shared 14 November,shared 2018 shared SC18 by 211 cores Applications used for evaluation

• SGEMM/DGEMM • Matrix-matrix mul;plica;ons to evaluate the Peak flop/s • Stream benchmark • Simple kernels (copy, scale, add, triad) to measure sustained memory performance • Himeno benchmark • Jacobi kernels with a 19-point stencil as a memory-intensive kernels • Applica;on kernels • Kernels of prac;cal applica;ons of Tohoku univ in Earthquake, CFD, Electromagne;c • Microbenchmark to evaluate the execu;on model • Mixture with vector-friendly Jacobi kernels and I/O kernels

14 November, 2018 SC18 12 Overview of application kernels

Memory Mesh Code Actual Kernels Fields Methods access size B/F B/F Electro Land mine FDTD Sequential 100x750x750 6.22 5.15 magnetic Seismolo Friction Earthquake Sequential 2047x2047x256 4.00 4.00 gy Law Navier- Turbulent Flow CFD Sequential 512x16384x512 1.91 0.35 Stokes Electro Antenna FDTD Sequential 252755x9x97336 1.73 0.98 magnetic Lax- Plasma Physics Indirect 20,048,000 1.12 0.075 Wendroff Turbine CFD LU-SGS Indirect 480x80x80x10 0.96 0.0084

14 November, 2018 SC18 13 Outline

• Background • Overview of SX-Aurora TSUBASA • Performance evaluation • Benchmark performance • Application performance • Conclusions

14 November, 2018 SC18 14 SGEMM/DGEMM Performance DGEMM Gflop/s SGEMM Gflop/s DGEMM efficiency SGEMM Efficiency 4500 100 4000 90 3500 80 3000 70 60 2500 50 2000 40 1500 30 Efficiency (%) Efficiency 1000 20 500 10 0 0 GEMM performance (Gflop/s) 1 2 3 4 5 6 7 8 Number of threads • High scalability up to 8 threads • High vectorization ratio 99.36%, good vector length 253.8 • High efficiency • Efficiency 97.8~99.2% 14 November, 2018 SC18 15 Memory performance(Stream Triad)

1200 1400 x 4.72 x 11.7 x 1.37 x 2.23 SX-Aurora… SX-ACE Skylake 1000 1200

800 1000

600 800 600 400 400 200 Stream bandwidth Stream (GB/s)

Stream bandwidth Stream (GB/s) 200 0 SX-Aurora SX-ACE Skylake Tesla V100 KNL 0 TSUBASA 0 2 4 6 8 10 12 Number of threads • High sustained memory bandwidth of SX-Aurora TSUBASA • Efficiency: Aurora 79%, ACE 83%, Skylake 66%, V100 81% • Scalability • Saturated even when the number of threads is less than half

14 November, 2018 SC18 16 Himeno (Jacobi) performance

350 320 x 0.93 SX-Aurora TSUBASA SX-ACE Skylake 300 x 3.4 x 7.96 x 2.1 280

250 240 x 6.9 200 200 160 150 120 100 80 50 40 Himeno performance (Gflop/s) Himeno performance (Gflop/s) 0 0 SX-Aurora SX-ACE Skylake Tesla V100 KNL 0 2 4 6 8 10 12 TSUBASA Number of threads • Higher performance… except GPU • Vector reduction becomes bottleneck due to copy among vector pipes • Nice thread scalability • 6.9x speedup in 8 threads => 86% parallel efficiency

14 November, 2018 SC18 17 Application kernel performance

80 SX-Aurora TSUBASA SX-ACE Skylake 4096 SX-Aurora TSUBASA SX-ACE 70 Antenna 60 1024 Earthquake x 9.9 Turbulent flow 50 256 Turbine 40 Plasma 64 Antenna 30 Plasma 16 Land mine

Execution time (sec) 20 x 3.4 x 4.3 10 x 2.9 x 2.6 x 3.4 4

0 Attainable performance (Gflop/s) 1 Land mine Earthquake Turbulent Antenna Plasma Turbine flow 0.1250.25 0.5 1 2 4 8 16 32 64 128 Arithmetic intensity (Flops/Byte) • SX-Aurora TSUBASA could achieve high performance • Plasma, Turbine => Indirect access, memory latency-bound • Antenna => computation-bound to memory BW-bound • Land mine, Earthqauke, Turbulent flow =>memory or LLC BW- bound

14 November, 2018 SC18 18 Memory bound? or LLC bound?

• Further analysis using 4 types of Bytes/Flop ratio • Memory B/F = (memory BW) / (peak performance) • LLC B/F = (LLC BW) / (peak performance) • Code B/F = (necessary data in Byte) / (# FP operations) • Actual B/F = (# block memory access) * (block size) / (# FP operations)

B/F ratio Actual < Memory Memory > Actual

Code < LLC Computation-bound Memory BW-bound

Code > LLC LLC BW-bound Memory or LLC bound * • Code B/F > Actual B/F * LLC BW / Memory BW => LLC bound • Code B/F < Actual B/F * LLC BW / Memory BW => memory bound

14 November, 2018 SC18 19 Application kernel performance

80 SX-Aurora TSUBASA SX-ACE Skylake B/F ratio Actual < Memory Actual > Memory 70 60 x 9.9 Code < LLC Computation-bound Memory BW-bound 50 Code > LLC LLC BW-bound Memory or LLC bound 40 30 • Memory B/F = 1.22 TB/s / 2.15 TF

Execution time (sec) 20 x 3.4 = 0.57 x 4.3 10 x 2.9 x 2.6 x 3.4 • LLC B/F = 2.66 TB/s / 2.15 TF 0 Land mine Earthquake Turbulent Antenna Plasma Turbine = 1.24 flow • Land mine (Code 6.22, Actual 5.79) => LLC bound • Earthqauke (Code 6.00, Actual 2.00) => LLC bound • Turbulent flow (Code 1.91, Actual 0.35) => memory BW bound • Antenna (Code 1.73, Actual 0.98) => memory BW bound

14 November, 2018 SC18 20 Evaluation of the execution model • (Transparent/Explicit) • Offload from VH to VE Offload from VE to VH

VH VE VH VE

VH V S VE S V offload AP AP offload 40 Data transfer Jacobi 1 I/O Jacobi 2 35 30 V 25 AP V V AP 20 S S 15 S AP Execution time (sec) 10 V V 5 AP V AP S 0 AP 21 TransparentVE execution VE to VH explicitVH offload VE to VH VHVE to offload VE Multi-VE performance on A300-8 9000 2400 Aurora Ideal

Aurora Ideal /s) 2200 8000 2000 7000 Gflop 1800 6000 1600 5000 1400 1200 4000 1000 3000 800 performance ( 2000 600 400 1000 Stream bandwidth (GB/s) Stream 200 0 Himeno 0 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 Number of VEs Number of VEs • Stream VE-level scalability • Almost ideal scalability up to 8 VEs • Himeno VE-level scalability • Good scalability up to 4VEs • Lack of vector lengths when more than 5VEs • Problem size is too small

14 November, 2018 SC18 22 Outline

• Background • Overview of SX-Aurora TSUBASA • Performance evaluation • Benchmark performance • Application performance • Conclusions

14 November, 2018 SC18 23 Application evaluation

• Applications • Tsunami simulation code • Mainly consists on ground fluctuation and Tsunami inundation prediction • Used for a real-time tsunami inundation system • A costal region of Japan(1244x826km) with an 810m resolution • Direct numerical simulation code of turbulent flows • Incompressible Navier-Stokes equations and a continuum equation are solved by 3D FFT • MPI parallelization to 4VEs (1~32 processes) • 1283, 2563, 5123 grid points

14 November, 2018 SC18 24 Performance of the Tsunami code

• Core performance is x 11.4, x 4.1, x2.4, to KNL, Skylake, ACE. • Socket performance is x 2.5, x 1.9, x 3.5

14 November, 2018 SC18 25 Performance of the DNS code

• About 8.14, 6.73, 5.48 speedup by 32cores to 1core in 1283, 2563, 5123 grids • LLC hit ratios of 2563 and 5123 are lower than that of 1283 • 10% is bank conflicts

14 November, 2018 SC18 26 Conclusions

• A vector supercomputer SX-Aurora TSUBASA • The highest bandwidth 1.22 TB/s by six HBM2 integration • New execution model to achieve high usability and high sustained performance • Performance evaluation • Benchmark programs → High sustained memory performance → effectiveness of a new execution model • Application programs → High sustained performance • Future work • Further optimizations for SX-Aurora TSUBASA

14 November, 2018 SC18 27 Acknowledgements

• People • Projects • Hiroyuki Takizawa • The closed beta program for • Ryusuke Egawa early access to Aurora • Souya Fujimoto • High performance computing • Yasuhisa Masaoka division (jointly-organized with NEC), Cyberscience Center, Tohoku university

14 November, 2018 SC18 28