Japanese Supercomputer Development and Hybrid Accelerated Supercomputing

Japanese Supercomputer Development and Hybrid Accelerated Supercomputing Taisuke Boku Director, Center for Computational Sciences University of Tsukuba With courtesy of HPCI and R-CCS (first part) 1 2019/08/27 HPC-AI Advisory Council @ Perth Center for Computational Sciences, Univ. of Tsukuba Agenda n Development and Deployment of Supercomputers in Japan n Tier-1 and Tier-2 systems n Supercomputers in national universities n FUGAKU (Post-K) Computer n Multi-Hybrid Accelerated Supercomputing at U. Tsukuba n Today’s accelerated supercomputing n New concept of multi-hybrid accelerated computing n Combining GPU and FPGA in a system n Programming and applications n Cygnus supercomputer in U. Tsukuba n Conclusions 2 HPC-AI Advisory Council @ Perth 2019/08/27 Center for Computational Sciences, Univ. of Tsukuba Development and Deployment of Supercomputer in Japan 3 2019/08/27 HPC-AI Advisory Council @ Perth Towards Exascale Computing Future • Tier-1: PF Exascale Post K Computer National Flagship Machine 1000 Tier-1 and tier-2 supercomputers -> RikenR-CCS AICS • Originally developed MPP form HPCI and move forward to • K Computer 100 Exascale computing like two wheels • Fugaku (Post-K) Computer • Tier-2: University Supercomputer 10 OFP JCAHPC（U. Tsukuba and Centers U. Tokyo) • Cluster, vector, GPU, etc. 1 Tokyo Tech. • 9 national universities to TSUBAME2.0 procure original systems T2K U. of Tsukuba U. of Tokyo Kyoto U. 2008 2010 2012 2014 2016 2018 2020 4 HPC-AI Advisory Council @ Perth 2019/08/27 HPCI – High Performance Computing Infrastructure in Japan n National networking program to share most of supercomputers in Japan, under MEXT n National flagship supercomputer “K” (and “FUGAKU” in 2021), and all other representative supercomputers in national university supercomputer centers are connected physically and logically n Nation-wide supercomputer sharing program based on proposals (twice in a year) from all kind of computational science and engineering fields reviewed by selection committee and assigned computation resources n Large capacity of shared storage (~250PByte) distributed in two sites to be shared by all HPCI supercomputer facilities connected by 100Gbps class of nation-wide network n “Single sign-on” system (Globus) for easy login and job scheduling among all resources 5 HPC-AI Advisory Council @ Perth 2019/08/27 6 2019/08/27 HPC-AI Advisory Council @ Perth HPCI Tier 2 Systems Roadmap (As of Jun. 2019） Fiscal Year 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025 2026 2027 HITACHIHITACHI SR16000/M1SR16000/M1（（172TF,172TF, 22TB 22TB）） Cloud System BS2000 （44TF, 14TB） 3.96 PF （UCC + CFL/M） 0.9 MW Hokkaido Cloud System BS2000 （44TF, 14TB） 35 PF （UCC + Data Science Cloud / Storage HA8000 / WOS7000 （10TF, 1.96PB） 0.16 PF （Cloud） 0.1MW CFL-M） 2MW 100~200 PF, Tohoku SX-ACE(707TF,160TB, 655TB/s) 100-200 PB/s LX406e(31TF), Storage(4PB), 3D Vis, 2MW ~30PF, ~30PB/s Mem BW (CFL-D/CFL-M) ~3MW (CFL-D/CFL-D) ~4 MW HA- PPX1 PPX2 Tsukuba PACS(1166TF) (62TF) (62TF) Cygnus 2.4PF (TPF) 0.4MW PACS-XI 100PF (TPF) COMA (PACS-IX) (1001 TF) Oakforest-PACS (OFP) 25 PF 100+ PF 4.5-6.0MW (UCC + TPF) Fujitsu FX10 (Oakleaf/Oakbridge) (1.27PFlops, 168TiB, 460 TB/s), (UCC + TPF) 3.2 MW 200+ Tokyo Oakbridge-II 4+ PF 1.0MW PF Reedbush-U/H: 1.92 PF (FAC) 0.7MW (Reedbush-Uは2020年6月末まで) BDEC 60+ PF (FAC) 3.5-4.5MW (FAC) Reedbush-L1.4 PF (FAC) 0.2 MW 6.5- Hitachi SR16K/M1 (54.9 TF, 10.9 TiB, 28.7 TB/s) 8.0MW TSUBAME 3.0 (12.15 PF, 1.66 PB/s) Tokyo Tech. TSUBAME 2.5 TSUBAME 4.0 (~100 PF, ~10PB/s, ~2.0MW) (5.7 PF, 110+ TB, 1160 TB/s), 1.4MW Fujitsu FX100 (2.9PF, 81 TiB) (542TF, 100+ PF (FAC/UCC+CFL-M) Fujitsu CX400 (774TF, 71TiB) 71TiB) 20+ PF (FAC/UCC + CFL-M) Nagoya up to 3MW up to 3MW 2MW in Cray:XE6 + total GB8K XC30 Cray XC40(5.5PF) + CS400(1.0PF) 20-40+ PF 80-150+ PF Kyoto (983TF) 1.33 MW (FAC/TPF + UCC) 2 MW Cray XC30 (584TF) (FAC/TPF + UCC) 1.5 MW NEC SX-ACE NEC Express5800 3.2PB/s,15~25Pflop/s, 1.0-1.5MW (CFL-M) 25.6 PB/s, 50-100Pflop/s Osaka (TPF) (423TF) (22.4TF) OCTPUS 1.463PF (UCC) 1.5-2.0MW HA8000 (712TF, 242 TB) SR16000 (8.2TF, 6TB) Fujitsu PRIMERGY CX subsystem A + B, 10.4 PF (UCC/TPF) 2.7MW 100+ PF ～ 2.0MW 3MW Kyushu FX10 (272.4TF, 36 TB) FX10 (FAC/TPF + UCC/TPF) CX400 (966.2 TF, 183TB) (90.8TFLOPS) JAMSTEC SX-ACE(1.3PF, 320TiB) 3MW 100PF, 3MW UV2000 (98TF, ISM 2PF, 0.3MW HPC-AI Advisory Council128TiB) @ Perth 0.3MW 7 2019/08/27 Power is the maximum consumption including cooling FUGAKU (富岳) New National Flagship Machine Slides courtesy by M. Sato of RIKEN R-CCS 8 2019/08/27 HPC-AI Advisory Council @ Perth FLAGSHIP2020 Project p Missions • Building the Japanese national flagship supercomputer Fugaku (a.k. a post K), and • Developing wide range of HPC applications, running on Fugaku, in order to solve social and science issues in Japan p Overview of Fugaku architecture Fujitsu A64FX processor Prototype board Node: Manycore architecture • Armv8-A + SVE (Scalable Vector Extension) p Status and Update • SIMD Length: 512 bits • “Design and Implementation” completed • # of Cores: 48 + (2/4 for OS) (> 2.7 TF / 48 core) • The official contract with Fujitsu to manufacture, ship, • Co-design with application developers and high and install hardware for Fugaku is done memory bandwidth utilizing on-package stacked • RIKEN revealed #nodes > 150K memory (HBM2) 1 TB/s B/W • The Name of the system was decided as “Fugaku” • Low power : 15GF/W (dgemm) • RIKEN announced the Fugaku early access program to Network: TofuD begin around Q2/CY2020 • Chip-Integrated NIC, 6D mesh/torus Interconnect 2019/08/27 HPC-AI Advisory Council @ Perth 9 CPU Architecture: A64FX l Armv8.2-A (AArch64 only) + SVE (Scalable Vector Extension) u “Common” programing model will be to run each l FP64/FP32/FP16 (https://developer.arm.com/products/architecture/a- MPI process on a NUMA node (CMG) with profile/docs) OpenMP-MPI hybrid programming. u 48 threads OpenMP is also supported. l SVE 512-bit wide SIMD l # of Cores: 48 + (2/4 for OS) CMG(Core-Memory-Group): NUMA node 12+1 core l Co-design with application developers and high memory bandwidth utilizing on-package stacked memory: HBM2(32GiB) l Leading-edge Si-technology (7nm FinFET), low power logic design (approx. 15 GF/W (dgemm)), and power-controlling knobs l PCIe Gen3 16 lanes l Peak performance l > 2.7 TFLOPS (>90% @ dgemm) l Memory B/W 1024GB/s (>80% stream) l Byte per Flops: approx. 0.4 HBM2: 8GiB 2019/08/27 HPC-AI Advisory Council @ Perth 10 Peak Performance n HPL & Stream Fugaku K 400+ Pflops n Peak DP 11.3 Pflops > 2.5TF / node for dgemm (double precision) (x34+) Peak SP 800+ Pflops n > 830GB/s /node for stream triad 11.3 Pflops (single precision) (x70+) Peak HP 1600+ Pflops -- (half precision) (x141+) Total memory 150+ PB/sec n 5.2PB/sec Himeno Benchmark (Fortran90) bandwidth (x29+) † “Performance evaluation of a vector supercomputer SX-aurora TSUBASA”, 12 SC18, https://dl.acm.org/citation.cfm?id=3291728 Target Application’s Performance l Performance Targets l 100 times faster than K for some applications (tuning included) https://postk-web.r-ccs.riken.jp/perf.html l 30 to 40 MW power consumption p Predicted Performance of 9 Target Applications As of 2019/05/14 Performance Area Priority Issue Application Brief description Speedup over K 1. Innovative computing infrastructure for drug x125+ GENESIS MD for proteins Health and discovery longevity 2. PersonaliZed and preventive medicine using big Genomon Genome processing data x8+ (Genome alignment) 3. Integrated simulation systems induced by GAMERA Earthquake simulator (FEM in unstructured & structured grid) Disaster earthquake and tsunami x45+ prevention and Environment 4. Meteorological and global environmental NICAM+ Weather prediction system using Big data (structured grid stencil & prediction using big data x120+ LETKF ensemble Kalman filter) 5. New technologies for energy creation, conversion Molecular electronic / storage, and use x40+ NTChem (structure calculation) Energy issue 6. Accelerated development of innovative clean Computational Mechanics System for Large Scale Analysis and energy systems x35+ Adventure Design (unstructured grid) 7. Creation of new functional devices and high- Ab-initio program Industrial performance materials x30+ RSDFT (density functional theory) competitivenes 8. Development of innovative design and production s enhancement Large Eddy Simulation (unstructured grid) processes x25+ FFB 9. Elucidation of the fundamental laws and Basic science LQCD Lattice QCD simulation (structured grid Monte Carlo) 14 evolution of the universe x25+ Performance study using Post-K simulator l We have been developing a cycle-level simulator for the post-K processor using gem5. l Collaboration with U. Tsukuba l Kernel evaluation using single core Post-K KNL Simulator Execution time 4.2 5.5 [msec] Number of L1D 29569 ー l 1.3 times faster than KNL per core misses L1D miss rate 1.19% ー l With further optimization (inst. scheduling) exec 3.4 msec by time reduced to 3.4 msec (1.6 times faster) Number of L2 misses 20 ー further l This is the evaluation on L1. OpenMP Multicore L2 miss rate 0.01% optimizationー execution will be much faster due to HBM 17 memory Fugaku prototype board and rack l “Fujitsu Completes Post-K Supercomputer CPU Prototype, Begins Functionality Trials”, HPCwire June 21, 2018 HBM2 60mm 60mm Wa t e r Wa t e r AOC QSFP2 8 ( Z) AOC QSFP2 8 ( Y) AOC QSFP2 8 ( X) Electrical signals Shelf: 48 CPUs (24 CMU) Rack: 8 shelves = 384 CPUs (8x48) 2 CPU / CMU 18 CPU Die Photo: by Fujitsu Ltd.

Japanese Supercomputer Development and Hybrid Accelerated Supercomputing

Spectroscopy of the Candidate Luminous Blue Variable at the Center

File System and Power Management Enhanced for Supercomputer Fugaku

Linpack Evaluation on a Supercomputer with Heterogeneous Accelerators

2020 ALCF Science Report

BRAS Newsletter August 2013

Tsubame 2.5 Towards 3.0 and Beyond to Exascale

Miniature Exoplanet Radial Velocity Array I: Design, Commissioning, and Early Photometric Results

Biology at the Exascale

(Intel® OPA) for Tsubame 3

NI\S/\ \\\\\\\\\ \\\\ \\\\ \\\\\ \\\\\ \\\\\ \\\\\ \\\\ \\\\ ' NF00991 ) NASA Technical Memorandum 86169

BDEC2 Poznan Japanese HPC Infrastructure Update

Supercomputer Fugaku