Japanese Supercomputer Development and Hybrid Accelerated Supercomputing
Total Page:16
File Type:pdf, Size:1020Kb
Japanese Supercomputer Development and Hybrid Accelerated Supercomputing Taisuke Boku Director, Center for Computational Sciences University of Tsukuba With courtesy of HPCI and R-CCS (first part) 1 2019/08/27 HPC-AI Advisory Council @ Perth Center for Computational Sciences, Univ. of Tsukuba Agenda n Development and Deployment of Supercomputers in Japan n Tier-1 and Tier-2 systems n Supercomputers in national universities n FUGAKU (Post-K) Computer n Multi-Hybrid Accelerated Supercomputing at U. Tsukuba n Today’s accelerated supercomputing n New concept of multi-hybrid accelerated computing n Combining GPU and FPGA in a system n Programming and applications n Cygnus supercomputer in U. Tsukuba n Conclusions 2 HPC-AI Advisory Council @ Perth 2019/08/27 Center for Computational Sciences, Univ. of Tsukuba Development and Deployment of Supercomputer in Japan 3 2019/08/27 HPC-AI Advisory Council @ Perth Towards Exascale Computing Future • Tier-1: PF Exascale Post K Computer National Flagship Machine 1000 Tier-1 and tier-2 supercomputers -> RikenR-CCS AICS • Originally developed MPP form HPCI and move forward to • K Computer 100 Exascale computing like two wheels • Fugaku (Post-K) Computer • Tier-2: University Supercomputer 10 OFP JCAHPC(U. Tsukuba and Centers U. Tokyo) • Cluster, vector, GPU, etc. 1 Tokyo Tech. • 9 national universities to TSUBAME2.0 procure original systems T2K U. of Tsukuba U. of Tokyo Kyoto U. 2008 2010 2012 2014 2016 2018 2020 4 HPC-AI Advisory Council @ Perth 2019/08/27 HPCI – High Performance Computing Infrastructure in Japan n National networking program to share most of supercomputers in Japan, under MEXT n National flagship supercomputer “K” (and “FUGAKU” in 2021), and all other representative supercomputers in national university supercomputer centers are connected physically and logically n Nation-wide supercomputer sharing program based on proposals (twice in a year) from all kind of computational science and engineering fields reviewed by selection committee and assigned computation resources n Large capacity of shared storage (~250PByte) distributed in two sites to be shared by all HPCI supercomputer facilities connected by 100Gbps class of nation-wide network n “Single sign-on” system (Globus) for easy login and job scheduling among all resources 5 HPC-AI Advisory Council @ Perth 2019/08/27 6 2019/08/27 HPC-AI Advisory Council @ Perth HPCI Tier 2 Systems Roadmap (As of Jun. 2019) Fiscal Year 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025 2026 2027 HITACHIHITACHI SR16000/M1SR16000/M1((172TF,172TF, 22TB 22TB) ) Cloud System BS2000 (44TF, 14TB) 3.96 PF (UCC + CFL/M) 0.9 MW Hokkaido Cloud System BS2000 (44TF, 14TB) 35 PF (UCC + Data Science Cloud / Storage HA8000 / WOS7000 (10TF, 1.96PB) 0.16 PF (Cloud) 0.1MW CFL-M) 2MW 100~200 PF, Tohoku SX-ACE(707TF,160TB, 655TB/s) 100-200 PB/s LX406e(31TF), Storage(4PB), 3D Vis, 2MW ~30PF, ~30PB/s Mem BW (CFL-D/CFL-M) ~3MW (CFL-D/CFL-D) ~4 MW HA- PPX1 PPX2 Tsukuba PACS(1166TF) (62TF) (62TF) Cygnus 2.4PF (TPF) 0.4MW PACS-XI 100PF (TPF) COMA (PACS-IX) (1001 TF) Oakforest-PACS (OFP) 25 PF 100+ PF 4.5-6.0MW (UCC + TPF) Fujitsu FX10 (Oakleaf/Oakbridge) (1.27PFlops, 168TiB, 460 TB/s), (UCC + TPF) 3.2 MW 200+ Tokyo Oakbridge-II 4+ PF 1.0MW PF Reedbush-U/H: 1.92 PF (FAC) 0.7MW (Reedbush-Uは2020年6月末まで) BDEC 60+ PF (FAC) 3.5-4.5MW (FAC) Reedbush-L1.4 PF (FAC) 0.2 MW 6.5- Hitachi SR16K/M1 (54.9 TF, 10.9 TiB, 28.7 TB/s) 8.0MW TSUBAME 3.0 (12.15 PF, 1.66 PB/s) Tokyo Tech. TSUBAME 2.5 TSUBAME 4.0 (~100 PF, ~10PB/s, ~2.0MW) (5.7 PF, 110+ TB, 1160 TB/s), 1.4MW Fujitsu FX100 (2.9PF, 81 TiB) (542TF, 100+ PF (FAC/UCC+CFL-M) Fujitsu CX400 (774TF, 71TiB) 71TiB) 20+ PF (FAC/UCC + CFL-M) Nagoya up to 3MW up to 3MW 2MW in Cray:XE6 + total GB8K XC30 Cray XC40(5.5PF) + CS400(1.0PF) 20-40+ PF 80-150+ PF Kyoto (983TF) 1.33 MW (FAC/TPF + UCC) 2 MW Cray XC30 (584TF) (FAC/TPF + UCC) 1.5 MW NEC SX-ACE NEC Express5800 3.2PB/s,15~25Pflop/s, 1.0-1.5MW (CFL-M) 25.6 PB/s, 50-100Pflop/s Osaka (TPF) (423TF) (22.4TF) OCTPUS 1.463PF (UCC) 1.5-2.0MW HA8000 (712TF, 242 TB) SR16000 (8.2TF, 6TB) Fujitsu PRIMERGY CX subsystem A + B, 10.4 PF (UCC/TPF) 2.7MW 100+ PF ~ 2.0MW 3MW Kyushu FX10 (272.4TF, 36 TB) FX10 (FAC/TPF + UCC/TPF) CX400 (966.2 TF, 183TB) (90.8TFLOPS) JAMSTEC SX-ACE(1.3PF, 320TiB) 3MW 100PF, 3MW UV2000 (98TF, ISM 2PF, 0.3MW HPC-AI Advisory Council128TiB) @ Perth 0.3MW 7 2019/08/27 Power is the maximum consumption including cooling FUGAKU (富岳) New National Flagship Machine Slides courtesy by M. Sato of RIKEN R-CCS 8 2019/08/27 HPC-AI Advisory Council @ Perth FLAGSHIP2020 Project p Missions • Building the Japanese national flagship supercomputer Fugaku (a.k. a post K), and • Developing wide range of HPC applications, running on Fugaku, in order to solve social and science issues in Japan p Overview of Fugaku architecture Fujitsu A64FX processor Prototype board Node: Manycore architecture • Armv8-A + SVE (Scalable Vector Extension) p Status and Update • SIMD Length: 512 bits • “Design and Implementation” completed • # of Cores: 48 + (2/4 for OS) (> 2.7 TF / 48 core) • The official contract with Fujitsu to manufacture, ship, • Co-design with application developers and high and install hardware for Fugaku is done memory bandwidth utilizing on-package stacked • RIKEN revealed #nodes > 150K memory (HBM2) 1 TB/s B/W • The Name of the system was decided as “Fugaku” • Low power : 15GF/W (dgemm) • RIKEN announced the Fugaku early access program to Network: TofuD begin around Q2/CY2020 • Chip-Integrated NIC, 6D mesh/torus Interconnect 2019/08/27 HPC-AI Advisory Council @ Perth 9 CPU Architecture: A64FX l Armv8.2-A (AArch64 only) + SVE (Scalable Vector Extension) u “Common” programing model will be to run each l FP64/FP32/FP16 (https://developer.arm.com/products/architecture/a- MPI process on a NUMA node (CMG) with profile/docs) OpenMP-MPI hybrid programming. u 48 threads OpenMP is also supported. l SVE 512-bit wide SIMD l # of Cores: 48 + (2/4 for OS) CMG(Core-Memory-Group): NUMA node 12+1 core l Co-design with application developers and high memory bandwidth utilizing on-package stacked memory: HBM2(32GiB) l Leading-edge Si-technology (7nm FinFET), low power logic design (approx. 15 GF/W (dgemm)), and power-controlling knobs l PCIe Gen3 16 lanes l Peak performance l > 2.7 TFLOPS (>90% @ dgemm) l Memory B/W 1024GB/s (>80% stream) l Byte per Flops: approx. 0.4 HBM2: 8GiB 2019/08/27 HPC-AI Advisory Council @ Perth 10 Peak Performance n HPL & Stream Fugaku K 400+ Pflops n Peak DP 11.3 Pflops > 2.5TF / node for dgemm (double precision) (x34+) Peak SP 800+ Pflops n > 830GB/s /node for stream triad 11.3 Pflops (single precision) (x70+) Peak HP 1600+ Pflops -- (half precision) (x141+) Total memory 150+ PB/sec n 5.2PB/sec Himeno Benchmark (Fortran90) bandwidth (x29+) † “Performance evaluation of a vector supercomputer SX-aurora TSUBASA”, 12 SC18, https://dl.acm.org/citation.cfm?id=3291728 Target Application’s Performance l Performance Targets l 100 times faster than K for some applications (tuning included) https://postk-web.r-ccs.riken.jp/perf.html l 30 to 40 MW power consumption p Predicted Performance of 9 Target Applications As of 2019/05/14 Performance Area Priority Issue Application Brief description Speedup over K 1. Innovative computing infrastructure for drug x125+ GENESIS MD for proteins Health and discovery longevity 2. PersonaliZed and preventive medicine using big Genomon Genome processing data x8+ (Genome alignment) 3. Integrated simulation systems induced by GAMERA Earthquake simulator (FEM in unstructured & structured grid) Disaster earthquake and tsunami x45+ prevention and Environment 4. Meteorological and global environmental NICAM+ Weather prediction system using Big data (structured grid stencil & prediction using big data x120+ LETKF ensemble Kalman filter) 5. New technologies for energy creation, conversion Molecular electronic / storage, and use x40+ NTChem (structure calculation) Energy issue 6. Accelerated development of innovative clean Computational Mechanics System for Large Scale Analysis and energy systems x35+ Adventure Design (unstructured grid) 7. Creation of new functional devices and high- Ab-initio program Industrial performance materials x30+ RSDFT (density functional theory) competitivenes 8. Development of innovative design and production s enhancement Large Eddy Simulation (unstructured grid) processes x25+ FFB 9. Elucidation of the fundamental laws and Basic science LQCD Lattice QCD simulation (structured grid Monte Carlo) 14 evolution of the universe x25+ Performance study using Post-K simulator l We have been developing a cycle-level simulator for the post-K processor using gem5. l Collaboration with U. Tsukuba l Kernel evaluation using single core Post-K KNL Simulator Execution time 4.2 5.5 [msec] Number of L1D 29569 ー l 1.3 times faster than KNL per core misses L1D miss rate 1.19% ー l With further optimization (inst. scheduling) exec 3.4 msec by time reduced to 3.4 msec (1.6 times faster) Number of L2 misses 20 ー further l This is the evaluation on L1. OpenMP Multicore L2 miss rate 0.01% optimizationー execution will be much faster due to HBM 17 memory Fugaku prototype board and rack l “Fujitsu Completes Post-K Supercomputer CPU Prototype, Begins Functionality Trials”, HPCwire June 21, 2018 HBM2 60mm 60mm Wa t e r Wa t e r AOC QSFP2 8 ( Z) AOC QSFP2 8 ( Y) AOC QSFP2 8 ( X) Electrical signals Shelf: 48 CPUs (24 CMU) Rack: 8 shelves = 384 CPUs (8x48) 2 CPU / CMU 18 CPU Die Photo: by Fujitsu Ltd.