Performance Evaluation of a Vector Supercomputer SX-Aurora TSUBASA
Total Page:16
File Type:pdf, Size:1020Kb
Performance Evaluation of a Vector Supercomputer SX-Aurora TSUBASA Kazuhiko Komatsu, S. Momose, Y. Isobe, O. Watanabe, A. Musa, M. Yokokawa, T. Aoyama, M. Sato, H. Kobayashi Tohoku University 14 November, 2018 SC18 Outline • Background • Overview of SX-Aurora TSUBASA • Performance evaluation • Benchmark performance • Application performance • Conclusions 14 November, 2018 SC18 2 Background • Supercomputers become important infrastructures • Widely used for scien8fic researches as well as various industries • Top1 Summit system reaches 143.5 Pflop/s • Big gap between theore8cal performance and sustained performance ◎ Compute-intensive applica8ons stand to benefit from high peak performance ✖Memory-intensive applica8ons are limited by lower memory performance Memory performance has gained more and more attentions 14 November, 2018 SC18 3 A new vector supercomputer SX-Aurora TSUBASA • Two important concepts of its design • High usability • High sustained performance • New memory integration technology • Realize the world’s highest memory bandwidth • New architecture • Vector host (VH) is attached to vector engines (VEs) • VE is responsible for executing an entire application • VH is used for processing system calls invoked by the applications VH VEVE 14 November, 2018 SC18 5 X86 Linux Vector processor New execution model • Conven1onal model • New execution model Host GPU VH VE Start Exe module load Start processing processing Transparent System call Kernel offload (I/O, etc) execution OS function Kernel Finish execution processing Fisnish processing SC18 6 Highlights of the execution model • Two advantages over conventional execution model • Avoid frequent data transfers between VE and VH • Applications are entirely executed on VE • Only necessary data for system calls need to be transferred →High sustained performance VH VE • No special programming • Explicit specifications of computation Exe module load Start kernels are not necessary processing • System calls are transparently offloaded Offload System call (I/O, etc) to the VH OS function • Programmers do not need to care system calls →High usability Finish processing 14 November, 2018 SC18 7 Specification of SX-Aurora TSUBASA • Memory bandwidth • 1.22 TB/s world’s highest memory bandwidth • Six HBM2 memory modules integraEon • 3.0 TB/s LLC bandwidth • LLC is connected to cores via 2D mesh network • ComputaEonal capability • 2.15 Tflop/[email protected] GHz • 8 powerful vector cores • 16 nm FINFET process technology • 4.8 billion transistors • 14.96 mm x 33.00 mm 14 November, 2018 SC18 Block diagram of a vector processor Outline • Background • Overview of SX-Aurora TSUBASA • Performance evaluation • Benchmark performance • Application performance • Conclusions 14 November, 2018 SC18 9 Experimental environments • SX-Aurora TSUBASA A300-2 VH VE • 2x VEs Type 10B VE • 1x VH X86 Linux Vector Engines VH Intel Xeon Gold 6126 VE Type 10B Frequency 2.60 GHz / 3.70 GHz (Turbo) Frequency 1.4 GHz Peak FP / core 83.2 Gflop/s Peak FP / core 268.8 Gflop/s # cores 12 # cores 8 Peak DP Flops 998.4 Gflop/s Peak DP Flops / socket 2.15 Tflop/s Mem BW 128 GB/s Memory BW 1.2 TB/s Mem Capacity 96 GB Memory capacity 48 GB Mem config DDR4-2666 DIMM 16GB x 6 Memory config HBM2 14 November, 2018 SC18 10 Experimental environments cont. SX-Aurora Xeon Gold Xeon Phi Processor SX-ACE Tesla V100 Type 10B 6126 KNL 7290 Frequency 1.4 GHz 2.6 GHz 1.0 GHz 1.245 GHz 1.5 GHz # oF cores 8 12 4 5120 72 DP Flop/s 2.15 T 998.4 GF 7 TF 3.456 TF 256 GF (SP flop/s) (4.30 T) (1996.8 GF) (14 TF) (6.912 TF) Memory MCDRAM HBM2 x6 DDR4 x6ch DDR3 x16ch HBM2 x4 subsystem DDR4 450+ GB/s Memory BW 1.22 TB/s 128 GB/s 256 GB/s 900 GB/s 115.2 GB/s Memory 16 GB 48 GB 96 GB 64 GB 16 GB capacity 96 GB LLC BW 2.66 TB/s N/A 1.0 TB/s N/A N/A 16 MB 19.25 MB 1 MB shared LLC capacity 1 MB private 6 MB shared 14 November,shared 2018 shared SC18 by 211 cores Applications used for evaluation • SGEMM/DGEMM • Matrix-matrix mul;plica;ons to evaluate the Peak flop/s • Stream benchmark • Simple kernels (copy, scale, add, triad) to measure sustained memory performance • Himeno benchmark • Jacobi kernels with a 19-point stencil as a memory-intensive kernels • Applica;on kernels • Kernels of prac;cal applica;ons of Tohoku univ in Earthquake, CFD, Electromagne;c • Microbenchmark to evaluate the execu;on model • Mixture with vector-friendly Jacobi kernels and I/O kernels 14 November, 2018 SC18 12 Overview of application kernels Memory Mesh Code Actual Kernels Fields Methods access size B/F B/F Electro Land mine FDTD Sequential 100x750x750 6.22 5.15 magnetic Seismolo Friction Earthquake Sequential 2047x2047x256 4.00 4.00 gy Law Navier- Turbulent Flow CFD Sequential 512x16384x512 1.91 0.35 Stokes Electro Antenna FDTD Sequential 252755x9x97336 1.73 0.98 magnetic Lax- Plasma Physics Indirect 20,048,000 1.12 0.075 Wendroff Turbine CFD LU-SGS Indirect 480x80x80x10 0.96 0.0084 14 November, 2018 SC18 13 Outline • Background • Overview of SX-Aurora TSUBASA • Performance evaluation • Benchmark performance • Application performance • Conclusions 14 November, 2018 SC18 14 SGEMM/DGEMM Performance DGEMM Gflop/s SGEMM Gflop/s DGEMM efficiency SGEMM Efficiency 4500 100 4000 90 3500 80 3000 70 60 2500 50 2000 40 1500 30 Efficiency (%) Efficiency 1000 20 500 10 0 0 GEMM performance (Gflop/s) 1 2 3 4 5 6 7 8 Number of threads • High scalability up to 8 threads • High vectorization ratio 99.36%, good vector length 253.8 • High efficiency • Efficiency 97.8~99.2% 14 November, 2018 SC18 15 Memory performance(Stream Triad) 1200 1400 x 4.72 x 11.7 x 1.37 x 2.23 SX-Aurora… SX-ACE Skylake 1000 1200 800 1000 600 800 600 400 400 200 Stream bandwidth Stream (GB/s) Stream bandwidth Stream (GB/s) 200 0 SX-Aurora SX-ACE Skylake Tesla V100 KNL 0 TSUBASA 0 2 4 6 8 10 12 Number of threads • High sustained memory bandwidth of SX-Aurora TSUBASA • Efficiency: Aurora 79%, ACE 83%, Skylake 66%, V100 81% • Scalability • Saturated even when the number of threads is less than half 14 November, 2018 SC18 16 Himeno (Jacobi) performance 350 320 x 0.93 SX-Aurora TSUBASA SX-ACE Skylake 300 x 3.4 x 7.96 x 2.1 280 250 240 x 6.9 200 200 160 150 120 100 80 50 40 Himeno performance (Gflop/s) Himeno performance (Gflop/s) 0 0 SX-Aurora SX-ACE Skylake Tesla V100 KNL 0 2 4 6 8 10 12 TSUBASA Number of threads • Higher performance… except GPU • Vector reduction becomes bottleneck due to copy among vector pipes • Nice thread scalability • 6.9x speedup in 8 threads => 86% parallel efficiency 14 November, 2018 SC18 17 Application kernel performance 80 SX-Aurora TSUBASA SX-ACE Skylake 4096 SX-Aurora TSUBASA SX-ACE 70 Antenna 60 1024 Earthquake x 9.9 Turbulent flow 50 256 Turbine 40 Plasma 64 Antenna 30 Plasma 16 Land mine Execution time (sec) 20 x 3.4 x 4.3 10 x 2.9 x 2.6 x 3.4 4 0 Attainable performance (Gflop/s) 1 Land mine Earthquake Turbulent Antenna Plasma Turbine flow 0.1250.25 0.5 1 2 4 8 16 32 64 128 Arithmetic intensity (Flops/Byte) • SX-Aurora TSUBASA could achieve high performance • Plasma, Turbine => Indirect access, memory latency-bound • Antenna => computation-bound to memory BW-bound • Land mine, Earthqauke, Turbulent flow =>memory or LLC BW- bound 14 November, 2018 SC18 18 Memory bound? or LLC bound? • Further analysis using 4 types of Bytes/Flop ratio • Memory B/F = (memory BW) / (peak performance) • LLC B/F = (LLC BW) / (peak performance) • Code B/F = (necessary data in Byte) / (# FP operations) • Actual B/F = (# block memory access) * (block size) / (# FP operations) B/F ratio Actual < Memory Memory > Actual Code < LLC Computation-bound Memory BW-bound Code > LLC LLC BW-bound Memory or LLC bound * • Code B/F > Actual B/F * LLC BW / Memory BW => LLC bound • Code B/F < Actual B/F * LLC BW / Memory BW => memory bound 14 November, 2018 SC18 19 Application kernel performance 80 SX-Aurora TSUBASA SX-ACE Skylake B/F ratio Actual < Memory Actual > Memory 70 60 x 9.9 Code < LLC Computation-bound Memory BW-bound 50 Code > LLC LLC BW-bound Memory or LLC bound 40 30 • Memory B/F = 1.22 TB/s / 2.15 TF Execution time (sec) 20 x 3.4 = 0.57 x 4.3 10 x 2.9 x 2.6 x 3.4 • LLC B/F = 2.66 TB/s / 2.15 TF 0 Land mine Earthquake Turbulent Antenna Plasma Turbine = 1.24 flow • Land mine (Code 6.22, Actual 5.79) => LLC bound • Earthqauke (Code 6.00, Actual 2.00) => LLC bound • Turbulent flow (Code 1.91, Actual 0.35) => memory BW bound • Antenna (Code 1.73, Actual 0.98) => memory BW bound 14 November, 2018 SC18 20 Evaluation of the execution model • (Transparent/Explicit) • Offload from VH to VE Offload from VE to VH VH VE VH VE VH V S VE S V offload AP AP offload 40 Data transfer Jacobi 1 I/O Jacobi 2 35 30 V 25 AP V V AP 20 S S 15 S AP Execution time (sec) 10 V V 5 AP V AP S 0 AP 21 TransparentVE execution VE to VH explicitVH offload VE to VH VHVE to offload VE Multi-VE performance on A300-8 9000 2400 Aurora Ideal Aurora Ideal /s) 2200 8000 2000 7000 Gflop 1800 6000 1600 5000 1400 1200 4000 1000 3000 800 performance ( 2000 600 400 1000 Stream bandwidth (GB/s) Stream 200 0 Himeno 0 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 Number of VEs Number of VEs • Stream VE-level scalability • Almost ideal scalability up to 8 VEs • Himeno VE-level scalability • Good scalability up to 4VEs • Lack of vector lengths when more than 5VEs • Problem size is too small 14 November, 2018 SC18 22 Outline • Background • Overview of SX-Aurora TSUBASA • Performance evaluation • Benchmark performance • Application performance • Conclusions 14 November, 2018 SC18 23 Application evaluation • Applications • Tsunami simulation code • Mainly consists on ground fluctuation and Tsunami inundation prediction • Used for a real-time tsunami inundation system • A costal region of Japan(1244x826km) with an 810m resolution • Direct numerical simulation code of turbulent flows • Incompressible Navier-Stokes equations and a continuum equation are solved by 3D FFT • MPI parallelization to 4VEs (1~32 processes) • 1283, 2563, 5123 grid points 14 November, 2018 SC18 24 Performance of the Tsunami code • Core performance is x 11.4, x 4.1, x2.4, to KNL, Skylake, ACE.