Supercomputer Fugaku

Supercomputer Fugaku Toshiyuki Shimizu Feb. 18th, 2020 FUJITSU LIMITED Copyright 2020 FUJITSU LIMITED Outline ◼ Fugaku project overview ◼ Co-design ◼ Approach ◼ Design results ◼ Performance & energy consumption evaluation ◼ Green500 ◼ OSS apps ◼ Fugaku priority issues ◼ Summary 1 Copyright 2020 FUJITSU LIMITED Supercomputer “Fugaku”, formerly known as Post-K Focus Approach Application performance Co-design w/ application developers and Fujitsu-designed CPU core w/ high memory bandwidth utilizing HBM2 Leading-edge Si-technology, Fujitsu's proven low power & high Power efficiency performance logic design, and power-controlling knobs Arm®v8-A ISA with Scalable Vector Extension (“SVE”), and Arm standard Usability Linux 2 Copyright 2020 FUJITSU LIMITED Fugaku project schedule 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 Fugaku development & delivery Manufacturing, Apps Basic Detailed design & General Feasibility study Installation review design Implementation operation and Tuning Select Architecture & Co-Design w/ apps groups apps sizing 3 Copyright 2020 FUJITSU LIMITED Fugaku co-design ◼ Co-design goals ◼ Obtain the best performance, 100x apps performance than K computer, within power budget, 30-40MW • Design applications, compilers, libraries, and hardware ◼ Approach ◼ Estimate perf & power using apps info, performance counts of Fujitsu FX100, and cycle base simulator • Computation time: brief & precise estimation • Communication time: bandwidth and latency for communication w/ some attributes for communication patterns • I/O time: ◼ Then, optimize apps/compilers etc. and resolve bottlenecks ◼ Estimation of performance and power ◼ Precise performance estimation for primary kernels • Make & run Fugaku objects on the Fugaku cycle base simulator ◼ Brief performance estimation for other sections • Replace performance counts of FX100 w/ Fugaku params: # of inst. commit/cycle, wait cycles of barrier, inst. fetch, branch, fp exec, data load/store wait cycles of L1D/L2, etc. ◼ Power estimation • DGEMM execution toggles on the emulator + estimation of memory and interconnect considering utilization + loss of convertors 4 Copyright 2020 FUJITSU LIMITED Co-design iterations around year 2015 ◼ Apply each of application kernels ◼ Define/refine a set of architecture parameters ◼ Implement/tune the kernel under the architecture parameters ◼ Evaluate execution time using the estimation tools ◼ Identify hardware bottlenecks and explore design space ◼ Examples of architecture parameters ◼ Frequency, # of arch regs, SIMD width, cache structure & size, # of cores… ◼ Memory, interconnect parameters ◼ Implementation of instructions: i.e. Combined gather… 5 Copyright 2020 FUJITSU LIMITED A64FX CPU ◼ Arm SVE, high performance and high efficiency ◼ DP performance 2.7+ TFLOPS, >90%@DGEMM, (CPU freq=1.8/2.0/2.2 GHz) ◼ Memory BW 1024 GB/s, >80%@STREAM Triad FUGAKU operating cond. 12x compute cores A64FX 1x assistant core PCle Tofu Controller Interface ISA (Base, extension) Armv8.2-A, SVE CMG CMG Peak DP performance 2.7+ TFLOPS HBM2 HBM2 C C SIMD width 512-bit N # of cores 48 + 4 CMG O CMG Memory capacity 32 GiB (HBM2 x4) HBM2 HBM2 C C C Memory peak bandwidth 1024 GB/s PCIe Gen3 16 lanes CMG：Core Memory Group NOC：Network on Chip High speed interconnect TofuD integrated 6 Copyright 2020 FUJITSU LIMITED A64FX leading-edge Si-technology ◼ TSMC 7nm FinFET & CoWoS TofuD Interface PCIe Interface ◼ HBM2 Broadcom SerDes, HBM I/O, and SRAMs Core Core Core Core Core Core Core Core Core Core ◼ 8.786 billion transistors Interface L2 L2 interface HBM2 Core Core Core Core ◼ 594 signal pins Cache Cache Core Core Core Core Core Core RING Core Core Core Core Core Core - Bus Core Core Core Core Core Core Core Core Core Core Core Core HBM2 Interface HBM2 L2 L2 Core Core Core Core Cache Cache HBM2 Interface HBM2 Core Core Core Core Core Core Core Core Core Core 7 Copyright 2020 FUJITSU LIMITED Fujitsu-designed CPU core w/ High Memory Bandwidth ◼ A64FX out-of-order controls in cores, caches, and memories CMG 12x Compute Cores + 1x Assistant Core achieve superior throughput Core Core Core Core 512-bit wide SIMD 2x FMAs 115+ 230+ GB/s GB/s L1D 64KiB, 4way BW and calc. perf. A64FX B/F 57+ 115+ GB/s DP floating perf. (TFlops) 2.7+ - GB/s L2 Cache 8MiB, 16way L1 data cache (TB/s) 11+ 4 256 L2 cache (TB/s) 3.6+ 1.3 GB/s HBM2 8GiB Memory BW (GB/s) 1024 0.37 HBM2HBM2HBM2 8GiB 8GiB 8GiB x4 8 Copyright 2020 FUJITSU LIMITED A64FX optimized load efficiency for apps performance ◼ 128 bytes/cycle sustained bandwidth L1D cache Read data0 Read port0 even for unaligned SIMD load 64B/cycle Read data1 Read port1 64B/cycle 128B ◼ “Combined Gather” doubles gather Mem. 0 1 3 2 6 7 5 4 (indirect) load’s data throughput, flow-1 flow-4 flow-3 when target elements are within a flow-2 “128-byte aligned block” for a pair of two regs, even & odd Regs 0 1 2 3 4 5 6 7 Suggested through Co-design work w/ app teams 8B Maximizes BW to 32 bytes/cyc. 9 Copyright 2020 FUJITSU LIMITED A64FX Power Knobs to reduce power consumption ◼ “Power knob” limits units’ activities via user APIs ◼ Performance/Watt can be optimized by utilizing Power knobs Fetch Issue Dispatch Reg-read Execute Cache and Memory Commit CSE EAGA Fetch EXC Port Frequency reduction L1 inst. Decode RSA EAGB PC PGPR EXD Store cache & Issue L1 data EXA Port cache EXB Control RSE0 Write Registers Branch Buffer Predictor PPR PRX RSE1 EXA pipeline only FLA RSBR PFPR FLB FLA pipeline only L2 cache HBM2 Controller 52 cores Tofu controller PCI Controller HBM2 B/W adjust Decode width: 4 to 2 HBM2 (units of 10%) 10 Copyright 2020 FUJITSU LIMITED Fugaku system software developed with RIKEN Fugaku applications Fujitsu Technical Computing Suite / RIKEN-developed system software Management software File system Programming environment System management FEFS XcalableMP, FDPS for high availability & Lustre-based Open MPI, MPICH(PiP), DTF power saving distributed file system operation OpenMP, COARRAY, Math. libs. Tensorflow Job management for LLIO Chainer Compilers, AI frameworks higher system NVM-based file I/O Pytorch utilization & power Accelerator for Fugaku efficiency Debug & tuning tools, Containers Docker KVM Multi-Kernel System: RHEL8 and light-weight kernel (IHK/McKernel) Singularity Fugaku system hardware FDPS: Framework for Developing Particle Simulator Fugaku developed w/ RIKEN PiP: Process-in-Process DTF: Data Transfer Framework IHK: Interface for Heterogeneous Kernels 11 Copyright 2020 FUJITSU LIMITED Outline ◼ Fugaku project overview ◼ Co-design ◼ Approach ◼ Design results ◼ Performance & energy consumption evaluation ◼ Green500 ◼ OSS apps ◼ Fugaku priority issues ◼ Summary 12 Copyright 2020 FUJITSU LIMITED Green500, Nov. 2019 ◼ A64FX prototype – ◼ Fujitsu A64FX 48C 2GHz ranked #1 on the list ◼ 768x general purpose A64FX CPU w/o accelerators • 1.9995 PFLOPS @ HPL, 84.75% • 16.876 GF/W • Power quality level 2 https://www.top500.org/green500/lists/2019/11/ 13 Copyright 2020 FUJITSU LIMITED How we achieved ◼ Key for GF/W is {energy efficient HW} x {parallel/exec efficiency} ◼ A64FX is designed for energy efficient ◼ Fujitsu’s proven CPU microarchitecture & 7nm FinFET ◼ SoC design: Tofu interconnect D integrated + Concerted efforts ◼ CoWoS: 4x HBM2 for main memory integrated of co-design ◼ Superior parallel/exec efficiency ◼ Math. libraries are tuned for application efficiency ◼ Comm. libs are also tuned utilizing long experience of Tofu @ K computer ◼ Performance tuning is efficiently done utilizing rich performance analyzer/monitor 14 Copyright 2020 FUJITSU LIMITED SC19 TOP500 calculation efficiency ◼ A64FX superior A64FX prototype efficiency 84.75% 100% with small Nmax 90% 80% 70% ◼ Results of: 60% Efficiency ◼ Optimized 50% communication 40% Challenge and math. libs 30% w/o Acc ◼ Optimization of 20% w/ Acc overlapped 10% communication 0% 0 5,000,000 10,000,000 15,000,000 20,000,000 Nmax 15 Copyright 2020 FUJITSU LIMITED A64FX CPU performance evaluation for real apps Relative speed up ratio ◼ Open source software, Real 0.5 1.0 1.5 apps on an A64FX @ 2.2GHz OpenFOAM ◼ Up to 1.8x faster over the latest x86 processor (24core, FrontlSTR 2.9GHz) x2 ABINIT ◼ High memory BW and long SALMON SIMD length of A64FX work SPECFEM3D effectively with these WRF applications MPAS FUJITSU A64FX x86 (24core, 2.9GHz) x2 16 Copyright 2020 FUJITSU LIMITED A64FX CPU power efficiency for real apps Relative power efficiency ratio ◼ Performance / Energy 1.0 2.0 3.0 4.0 consumption on an A64FX @ 2.2GHz OpenFOAM ◼ Up to 3.7x more efficient over FrontlSTR the latest x86 processor ABINIT (24core, 2.9GHz) x2 SALMON ◼ High efficiency is achieved by SPECFEM3D energy-conscious design and WRF implementation MPAS FUJITSU A64FX x86 (24core, 2.9GHz) x2 17 Copyright 2020 FUJITSU LIMITED Fugaku priority issues and performance prediction https://postk-web.r-ccs.riken.jp/perf.html ◼ 100x app performance and power budget requirements are met ◼ Some apps utilize power knob to reduce power consumption and achieve high energy efficiency ◼ Measuring and optimizing using real HW 18 Copyright 2020 FUJITSU LIMITED Fugaku project schedule 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 Fugaku development & delivery Manufacturing, Apps Basic Detailed design & General Feasibility study Installation review design Implementation operation and Tuning Co-Design w/ apps groups Select Architecture & apps sizing Apps tuning Previous estimation w/o HW w/ HW 19 Copyright 2020 FUJITSU LIMITED Current status normalized by the estimation w/o HW ◼ Performance and energy (=power*elapse)

Supercomputer Fugaku

File System and Power Management Enhanced for Supercomputer Fugaku

Performance Analysis on Energy Efficient High

BDEC2 Poznan Japanese HPC Infrastructure Update

Measuring Power Consumption on IBM Blue Gene/P

Download SC18 Final Program

Compute System Metrics (Bof)

Mixed Precision Block Fused Multiply-Add

Computer Architectures an Overview

Programming on K Computer

HPC Systems in the Next Decade – What to Expect, When, Where

2010 IBM and the Environment Report

Management Direction Update