Supercomputer "Fugaku"
Total Page:16
File Type:pdf, Size:1020Kb
Supercomputer “Fugaku” Formerly known as Post-K 0 Copyright 2019 FUJITSU LIMITED Supercomputer Fugaku Project RIKEN and Fujitsu are currently developing Japan's next-generation flagship supercomputer, the successor to the K computer, as the most advanced general-purpose supercomputer in the world PRIMEHPC FX100 PRIMEHPC FX10 No.1(2017) Finalist(2016) No.1(2018) K computer Supercomputer Fugaku © RIKEN RIKEN and Fujitsu announced that manufacturing started in March 2019 RIKEN announced on May 23, 2019 that the supercomputer is named*Formerly “Fugaku known as” Post-K 1 Copyright 2019 FUJITSU LIMITED Goals and Approaches for Fugaku Goals High application Good usability Keeping application performance and wide range of uses compatibility RIKEN announced predicted performance: More than 100x+ faster than K computer for GENESIS and NICAM+LETKF Geometric mean of speedup over K computer in 9 Priority Issues is greater than 37x+ https://postk-web.r-ccs.riken.jp/perf.html 2 Copyright 2019 FUJITSU LIMITED Goals and Approaches for Fugaku Goals High application Good usability Keeping application performance and wide range of uses compatibility Approaches Develop Achieve 1. High-performance Arm CPU A64FX in HPC and AI areas - High performance in real applications 2. Cutting-edge hardware design - High efficiency in key features for AI 3. System software stack applications 3 Copyright 2019 FUJITSU LIMITED 1. High-Performance Arm CPU A64FX in HPC and AI Areas Architecture features CMG (Core Memory Group) ISA Armv8.2-A (AArch64 only) SVE (Scalable Vector Extension) specification TofuD 13 cores 28 Gbps x 2 lanes x 10 ports SIMD width 512-bit L2 Cache 8 MiB I/O Memory 8 GiB, 256 GB/s PCIe Gen3 16 lanes Precision FP64/32/16, INT64/32/16/8 Cores 48 computing cores + 4 assistant cores (4 CMGs) Memory HBM2: Peak B/W 1,024 GB/s TofuD PCIe Controller Controller Interconnect TofuD: 28 Gbps x 2 lanes x 10 ports HBM2 HBM2 Peak performance (Chip level) (TOPS) HPC AI HBM2 Network on Chip HBM2 25 21.6+ A64FX (Fugaku) 20 SPARC64 VIIIfx (K computer) 15 10.8+ 10 5.4+ 2.7+ 5 0.128 0.128 N/A N/A 0 64 bits 32 bits 16 bits 8 bits (Element size) 4 Copyright 2019 FUJITSU LIMITED 1. High-Performance Arm CPU A64FX in HPC and AI Areas Architecture features ISA Armv8.2-A (AArch64 only) SVE (Scalable Vector Extension) TofuD Interface PCIe Interface SIMD width 512-bit HBM2 Precision FP64/32/16, INT64/32/16/8 Core Core Core Core Core Core Core Core Core Core Interface Cores 48 computing cores + 4 assistant cores (4 CMGs) Memory HBM2: Peak B/W 1,024 GB/s L2 L2 Core Core Core Core Interface HBM2 Cache Cache Interconnect TofuD: 28 Gbps x 2 lanes x 10 ports Core Core Core Core Core Core Bus Ring Core Core Core Core Core Core Peak performance (Chip level) (TOPS) HPC AI Core Core Core Core Core Core Core Core Core Core Core Core 25 21.6+ A64FX (Fugaku) 20 HBM2 Interface HBM2 SPARC64 VIIIfx (K computer) L2 L2 Core Core Core Core 15 Cache Cache 10.8+ 10 5.4+ 5 2.7+ Core Core Core Core Core Core Core Core Core Core HBM2 Interface 0.128 0.128 N/A N/A 0 64 bits 32 bits 16 bits 8 bits (Element size) 5 Copyright 2019 FUJITSU LIMITED 2. Cutting-edge Hardware Design 1PFlops by Fugaku and K computer Fugaku K computer © RIKEN Configuration 1x rack including SSDs 80x compute racks & 20x disk racks Nodes 384 8,160 Footprint 1.1 m2 (0.8 m x 1.4 m) 128 m2 (4 m x 32 m) Scalable design CMU(CPU Memory Unit) IN OUT Water 100% direct water coupler cooling PCIe 3x QSFP for AOC(Active connector Optical Cables) TofuD connector CPU CMU BoB Shelf Rack Fugaku Single-sided blind (Y) mate connectors for (X) Nodes 1 2 16 48 384 150k+ electrical signals and QSFP28 QSFP28 QSFP28 (Z) QSFP28 water QSFP28 TofuD Performance cables AOC AOC [Flops] 2.7 T+ 5.4 T+ 43 T+ 129 T+ 1 P+ 400 P+ AOC 6 Copyright 2019 FUJITSU LIMITED 3. System Software Stack Fujitsu developing system software in collaboration with RIKEN Fujitsu Technical Computing Suite implementing development and execution environments with great usability on large-scale system Fugaku applications Fujitsu Technical Computing Suite / RIKEN developing system software Management software File system Programming environment System management FEFS XcalableMP MPI for high availability & power Lustre-based distributed file (Open MPI, MPICH) saving operation system OpenMP, Compilers Job management for higher Coarray (C, C++, Fortran) LLIO system utilization & power NVM-based file I/O accelerator Debugging and efficiency tuning tools Math. libs. Linux OS / McKernel (Lightweight kernel) Fugaku system hardware Fugaku Under development w/ RIKEN 7 Copyright 2019 FUJITSU LIMITED 3. System Software Stack Fujitsu developing system software in collaboration with RIKEN Fujitsu Technical Computing Suite implementing development and execution environments with great usability on large-scale system Fugaku applications Fujitsu Technical Computing Suite / RIKEN developing system software Management software File system Programming environment Exploits hardware performance by compiler System management FEFS XcalableMPXcalableMP* MPI for optimizationshigh availability & power such as SVELustre vectorization-based distributed file (Open MPI, MPICH) saving operation system OpenMP, Compilers Job management for higher COARRAYCoarray (C, C++, Fortran) Supports new programming languageLLIO system utilization & power Math. libs. NVM-based file I/O accelerator Debugging and standardsefficiency and data type FP16 tuning tools Math. libs. Linux OS / McKernel (Lightweight kernel) Fugaku system hardware Fugaku Under development w/ RIKEN 8 Copyright 2019 FUJITSU LIMITED Achieve High Performance in Real Application WRF: Weather Research and Forecasting model (v3.8.1) Vectorizing loops including IF statements is key optimization High memory B/W and long SIMD length work Himeno Benchmark (Fortran90, size: XL) effectively Stencil calculation to solve Poisson’s equation by Jacobi method WRF v3.8.1 (48-hour,12km, CONUS) on 48 cores Himeno Benchmark (Fortran90) 2 400 346 1.56 286 305 1.5 300 1.00 1 200 GFlops 85 103 0.5 100 Performance Ratio * Ratio Performance Higher isHigher better 0 isHigher better 0 Skylake A64FX Skylake(Xeon FX100 A64FX SX-Aurora Tesla V100 (Xeon Platinum 8168) 1 CPU Platinum 8168) 1CPU 1CPU TSUBASA 1GPU† 2 CPUs with source tuning 2CPUs 1VE† †Performance evaluation of a vector supercomputer SX-aurora TSUBASA * Normalized by the average elapsed time for timestep of Skylake https://dl.acm.org/citation.cfm?id=3291728 9 Copyright 2019 FUJITSU LIMITED Achieve High Efficiency in Key Features for AI Applications High FP16 & INT8 peak performance and high memory peak B/W INT8 partial dot product FP16 performance: 10.8+ TOPS, > 90%@HGEMM 8-bit 8-bit 8-bit 8-bit A0 A1 A2 A3 X X X X INT8 performance : 21.6+ TOPS in partial dot product B0 B1 B2 B3 C Memory B/W : 1,024 GB/s, > 80%@STREAM Triad 32-bit Functions contributing to key features in AI fields • 2x 512-bit wide SIMD pipelines per core for FP16 and INT8 A64FX CPU • High memory B/W and calculation throughput • Vectorization and software pipelining Compilers & libraries • FP16 as data type of programming language (e.g., real (kind=2) in Fortran) • Mathematical Library for HGEMM 10 Copyright 2019 FUJITSU LIMITED Future Plans Supercomputer Fugaku Today Operations starting around CY2021 CY2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 Manufacturing, Basic Design Implementation Installation Operations Fugaku Design and Tuning Development Operations Operations ending Aug, 2019 © RIKEN K computer Fujitsu HPC Products Fujitsu will begin global sales of supercomputers based on the Supercomputer Fugaku technology in the 2nd half of FY2019 11 Copyright 2019 FUJITSU LIMITED 12.