Performance Evaluation of a Vector Supercomputer SX-Aurora TSUBASA

Performance Evaluation of a Vector Supercomputer SX-Aurora TSUBASA

Performance Evaluation of a Vector Supercomputer SX-Aurora TSUBASA Kazuhiko Komatsu, S. Momose, Y. Isobe, O. Watanabe, A. Musa, M. Yokokawa, T. Aoyama, M. Sato, H. Kobayashi Tohoku University 14 November, 2018 SC18 Outline • Background • Overview of SX-Aurora TSUBASA • Performance evaluation • Benchmark performance • Application performance • Conclusions 14 November, 2018 SC18 2 Background • Supercomputers become important infrastructures • Widely used for scien8fic researches as well as various industries • Top1 Summit system reaches 143.5 Pflop/s • Big gap between theore8cal performance and sustained performance ◎ Compute-intensive applica8ons stand to benefit from high peak performance ✖Memory-intensive applica8ons are limited by lower memory performance Memory performance has gained more and more attentions 14 November, 2018 SC18 3 A new vector supercomputer SX-Aurora TSUBASA • Two important concepts of its design • High usability • High sustained performance • New memory integration technology • Realize the world’s highest memory bandwidth • New architecture • Vector host (VH) is attached to vector engines (VEs) • VE is responsible for executing an entire application • VH is used for processing system calls invoked by the applications VH VEVE 14 November, 2018 SC18 5 X86 Linux Vector processor New execution model • Conven1onal model • New execution model Host GPU VH VE Start Exe module load Start processing processing Transparent System call Kernel offload (I/O, etc) execution OS function Kernel Finish execution processing Fisnish processing SC18 6 Highlights of the execution model • Two advantages over conventional execution model • Avoid frequent data transfers between VE and VH • Applications are entirely executed on VE • Only necessary data for system calls need to be transferred →High sustained performance VH VE • No special programming • Explicit specifications of computation Exe module load Start kernels are not necessary processing • System calls are transparently offloaded Offload System call (I/O, etc) to the VH OS function • Programmers do not need to care system calls →High usability Finish processing 14 November, 2018 SC18 7 Specification of SX-Aurora TSUBASA • Memory bandwidth • 1.22 TB/s world’s highest memory bandwidth • Six HBM2 memory modules integraEon • 3.0 TB/s LLC bandwidth • LLC is connected to cores via 2D mesh network • ComputaEonal capability • 2.15 Tflop/[email protected] GHz • 8 powerful vector cores • 16 nm FINFET process technology • 4.8 billion transistors • 14.96 mm x 33.00 mm 14 November, 2018 SC18 Block diagram of a vector processor Outline • Background • Overview of SX-Aurora TSUBASA • Performance evaluation • Benchmark performance • Application performance • Conclusions 14 November, 2018 SC18 9 Experimental environments • SX-Aurora TSUBASA A300-2 VH VE • 2x VEs Type 10B VE • 1x VH X86 Linux Vector Engines VH Intel Xeon Gold 6126 VE Type 10B Frequency 2.60 GHz / 3.70 GHz (Turbo) Frequency 1.4 GHz Peak FP / core 83.2 Gflop/s Peak FP / core 268.8 Gflop/s # cores 12 # cores 8 Peak DP Flops 998.4 Gflop/s Peak DP Flops / socket 2.15 Tflop/s Mem BW 128 GB/s Memory BW 1.2 TB/s Mem Capacity 96 GB Memory capacity 48 GB Mem config DDR4-2666 DIMM 16GB x 6 Memory config HBM2 14 November, 2018 SC18 10 Experimental environments cont. SX-Aurora Xeon Gold Xeon Phi Processor SX-ACE Tesla V100 Type 10B 6126 KNL 7290 Frequency 1.4 GHz 2.6 GHz 1.0 GHz 1.245 GHz 1.5 GHz # oF cores 8 12 4 5120 72 DP Flop/s 2.15 T 998.4 GF 7 TF 3.456 TF 256 GF (SP flop/s) (4.30 T) (1996.8 GF) (14 TF) (6.912 TF) Memory MCDRAM HBM2 x6 DDR4 x6ch DDR3 x16ch HBM2 x4 subsystem DDR4 450+ GB/s Memory BW 1.22 TB/s 128 GB/s 256 GB/s 900 GB/s 115.2 GB/s Memory 16 GB 48 GB 96 GB 64 GB 16 GB capacity 96 GB LLC BW 2.66 TB/s N/A 1.0 TB/s N/A N/A 16 MB 19.25 MB 1 MB shared LLC capacity 1 MB private 6 MB shared 14 November,shared 2018 shared SC18 by 211 cores Applications used for evaluation • SGEMM/DGEMM • Matrix-matrix mul;plica;ons to evaluate the Peak flop/s • Stream benchmark • Simple kernels (copy, scale, add, triad) to measure sustained memory performance • Himeno benchmark • Jacobi kernels with a 19-point stencil as a memory-intensive kernels • Applica;on kernels • Kernels of prac;cal applica;ons of Tohoku univ in Earthquake, CFD, Electromagne;c • Microbenchmark to evaluate the execu;on model • Mixture with vector-friendly Jacobi kernels and I/O kernels 14 November, 2018 SC18 12 Overview of application kernels Memory Mesh Code Actual Kernels Fields Methods access size B/F B/F Electro Land mine FDTD Sequential 100x750x750 6.22 5.15 magnetic Seismolo Friction Earthquake Sequential 2047x2047x256 4.00 4.00 gy Law Navier- Turbulent Flow CFD Sequential 512x16384x512 1.91 0.35 Stokes Electro Antenna FDTD Sequential 252755x9x97336 1.73 0.98 magnetic Lax- Plasma Physics Indirect 20,048,000 1.12 0.075 Wendroff Turbine CFD LU-SGS Indirect 480x80x80x10 0.96 0.0084 14 November, 2018 SC18 13 Outline • Background • Overview of SX-Aurora TSUBASA • Performance evaluation • Benchmark performance • Application performance • Conclusions 14 November, 2018 SC18 14 SGEMM/DGEMM Performance DGEMM Gflop/s SGEMM Gflop/s DGEMM efficiency SGEMM Efficiency 4500 100 4000 90 3500 80 3000 70 60 2500 50 2000 40 1500 30 Efficiency (%) Efficiency 1000 20 500 10 0 0 GEMM performance (Gflop/s) 1 2 3 4 5 6 7 8 Number of threads • High scalability up to 8 threads • High vectorization ratio 99.36%, good vector length 253.8 • High efficiency • Efficiency 97.8~99.2% 14 November, 2018 SC18 15 Memory performance(Stream Triad) 1200 1400 x 4.72 x 11.7 x 1.37 x 2.23 SX-Aurora… SX-ACE Skylake 1000 1200 800 1000 600 800 600 400 400 200 Stream bandwidth Stream (GB/s) Stream bandwidth Stream (GB/s) 200 0 SX-Aurora SX-ACE Skylake Tesla V100 KNL 0 TSUBASA 0 2 4 6 8 10 12 Number of threads • High sustained memory bandwidth of SX-Aurora TSUBASA • Efficiency: Aurora 79%, ACE 83%, Skylake 66%, V100 81% • Scalability • Saturated even when the number of threads is less than half 14 November, 2018 SC18 16 Himeno (Jacobi) performance 350 320 x 0.93 SX-Aurora TSUBASA SX-ACE Skylake 300 x 3.4 x 7.96 x 2.1 280 250 240 x 6.9 200 200 160 150 120 100 80 50 40 Himeno performance (Gflop/s) Himeno performance (Gflop/s) 0 0 SX-Aurora SX-ACE Skylake Tesla V100 KNL 0 2 4 6 8 10 12 TSUBASA Number of threads • Higher performance… except GPU • Vector reduction becomes bottleneck due to copy among vector pipes • Nice thread scalability • 6.9x speedup in 8 threads => 86% parallel efficiency 14 November, 2018 SC18 17 Application kernel performance 80 SX-Aurora TSUBASA SX-ACE Skylake 4096 SX-Aurora TSUBASA SX-ACE 70 Antenna 60 1024 Earthquake x 9.9 Turbulent flow 50 256 Turbine 40 Plasma 64 Antenna 30 Plasma 16 Land mine Execution time (sec) 20 x 3.4 x 4.3 10 x 2.9 x 2.6 x 3.4 4 0 Attainable performance (Gflop/s) 1 Land mine Earthquake Turbulent Antenna Plasma Turbine flow 0.1250.25 0.5 1 2 4 8 16 32 64 128 Arithmetic intensity (Flops/Byte) • SX-Aurora TSUBASA could achieve high performance • Plasma, Turbine => Indirect access, memory latency-bound • Antenna => computation-bound to memory BW-bound • Land mine, Earthqauke, Turbulent flow =>memory or LLC BW- bound 14 November, 2018 SC18 18 Memory bound? or LLC bound? • Further analysis using 4 types of Bytes/Flop ratio • Memory B/F = (memory BW) / (peak performance) • LLC B/F = (LLC BW) / (peak performance) • Code B/F = (necessary data in Byte) / (# FP operations) • Actual B/F = (# block memory access) * (block size) / (# FP operations) B/F ratio Actual < Memory Memory > Actual Code < LLC Computation-bound Memory BW-bound Code > LLC LLC BW-bound Memory or LLC bound * • Code B/F > Actual B/F * LLC BW / Memory BW => LLC bound • Code B/F < Actual B/F * LLC BW / Memory BW => memory bound 14 November, 2018 SC18 19 Application kernel performance 80 SX-Aurora TSUBASA SX-ACE Skylake B/F ratio Actual < Memory Actual > Memory 70 60 x 9.9 Code < LLC Computation-bound Memory BW-bound 50 Code > LLC LLC BW-bound Memory or LLC bound 40 30 • Memory B/F = 1.22 TB/s / 2.15 TF Execution time (sec) 20 x 3.4 = 0.57 x 4.3 10 x 2.9 x 2.6 x 3.4 • LLC B/F = 2.66 TB/s / 2.15 TF 0 Land mine Earthquake Turbulent Antenna Plasma Turbine = 1.24 flow • Land mine (Code 6.22, Actual 5.79) => LLC bound • Earthqauke (Code 6.00, Actual 2.00) => LLC bound • Turbulent flow (Code 1.91, Actual 0.35) => memory BW bound • Antenna (Code 1.73, Actual 0.98) => memory BW bound 14 November, 2018 SC18 20 Evaluation of the execution model • (Transparent/Explicit) • Offload from VH to VE Offload from VE to VH VH VE VH VE VH V S VE S V offload AP AP offload 40 Data transfer Jacobi 1 I/O Jacobi 2 35 30 V 25 AP V V AP 20 S S 15 S AP Execution time (sec) 10 V V 5 AP V AP S 0 AP 21 TransparentVE execution VE to VH explicitVH offload VE to VH VHVE to offload VE Multi-VE performance on A300-8 9000 2400 Aurora Ideal Aurora Ideal /s) 2200 8000 2000 7000 Gflop 1800 6000 1600 5000 1400 1200 4000 1000 3000 800 performance ( 2000 600 400 1000 Stream bandwidth (GB/s) Stream 200 0 Himeno 0 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 Number of VEs Number of VEs • Stream VE-level scalability • Almost ideal scalability up to 8 VEs • Himeno VE-level scalability • Good scalability up to 4VEs • Lack of vector lengths when more than 5VEs • Problem size is too small 14 November, 2018 SC18 22 Outline • Background • Overview of SX-Aurora TSUBASA • Performance evaluation • Benchmark performance • Application performance • Conclusions 14 November, 2018 SC18 23 Application evaluation • Applications • Tsunami simulation code • Mainly consists on ground fluctuation and Tsunami inundation prediction • Used for a real-time tsunami inundation system • A costal region of Japan(1244x826km) with an 810m resolution • Direct numerical simulation code of turbulent flows • Incompressible Navier-Stokes equations and a continuum equation are solved by 3D FFT • MPI parallelization to 4VEs (1~32 processes) • 1283, 2563, 5123 grid points 14 November, 2018 SC18 24 Performance of the Tsunami code • Core performance is x 11.4, x 4.1, x2.4, to KNL, Skylake, ACE.

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    27 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us