Supercomputer

Toshiyuki Shimizu Feb. 18th, 2020

FUJITSU LIMITED

Copyright 2020 LIMITED Outline

◼ Fugaku project overview ◼ Co-design ◼ Approach ◼ Design results ◼ Performance & energy consumption evaluation ◼ ◼ OSS apps ◼ Fugaku priority issues ◼ Summary

1 Copyright 2020 FUJITSU LIMITED Supercomputer “Fugaku”, formerly known as Post-K

Focus Approach

Application performance Co-design w/ application developers and Fujitsu-designed CPU core w/ high memory bandwidth utilizing HBM2 Leading-edge Si-technology, Fujitsu's proven low power & high Power efficiency performance logic design, and power-controlling knobs Arm®v8-A ISA with Scalable Vector Extension (“SVE”), and Arm standard Usability

2 Copyright 2020 FUJITSU LIMITED Fugaku project schedule

2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022

Fugaku development & delivery

Manufacturing, Apps Basic Detailed design & General Feasibility study Installation review design Implementation operation and Tuning

Select Architecture & Co-Design w/ apps groups apps sizing

3 Copyright 2020 FUJITSU LIMITED Fugaku co-design

◼ Co-design goals ◼ Obtain the best performance, 100x apps performance than , within power budget, 30-40MW • Design applications, compilers, libraries, and hardware ◼ Approach ◼ Estimate perf & power using apps info, performance counts of Fujitsu FX100, and cycle base simulator • Computation time: brief & precise estimation • Communication time: bandwidth and latency for communication w/ some attributes for communication patterns • I/O time: ◼ Then, optimize apps/compilers etc. and resolve bottlenecks ◼ Estimation of performance and power ◼ Precise performance estimation for primary kernels • Make & run Fugaku objects on the Fugaku cycle base simulator ◼ Brief performance estimation for other sections • Replace performance counts of FX100 w/ Fugaku params: # of inst. commit/cycle, wait cycles of barrier, inst. fetch, branch, fp exec, data load/store wait cycles of L1D/L2, etc. ◼ Power estimation • DGEMM execution toggles on the emulator + estimation of memory and interconnect considering utilization + loss of convertors

4 Copyright 2020 FUJITSU LIMITED Co-design iterations around year 2015

◼ Apply each of application kernels ◼ Define/refine a set of architecture parameters ◼ Implement/tune the kernel under the architecture parameters ◼ Evaluate execution time using the estimation tools ◼ Identify hardware bottlenecks and explore design space ◼ Examples of architecture parameters ◼ Frequency, # of arch regs, SIMD width, cache structure & size, # of cores… ◼ Memory, interconnect parameters ◼ Implementation of instructions: i.e. Combined gather…

5 Copyright 2020 FUJITSU LIMITED A64FX CPU

◼ Arm SVE, high performance and high efficiency ◼ DP performance 2.7+ TFLOPS, >90%@DGEMM, (CPU freq=1.8/2.0/2.2 GHz) ◼ Memory BW 1024 GB/s, >80%@STREAM Triad FUGAKU operating cond.

12x compute cores A64FX 1x assistant core PCle Tofu Controller Interface ISA (Base, extension) Armv8.2-A, SVE CMG CMG

Peak DP performance 2.7+ TFLOPS HBM2 HBM2 C C SIMD width 512-bit N # of cores 48 + 4

CMG O CMG Memory capacity 32 GiB (HBM2 x4) HBM2 HBM2 C C C Memory peak bandwidth 1024 GB/s PCIe Gen3 16 lanes CMG:Core Memory Group NOC:Network on Chip High speed interconnect TofuD integrated

6 Copyright 2020 FUJITSU LIMITED A64FX leading-edge Si-technology

◼ TSMC 7nm FinFET & CoWoS TofuD Interface PCIe Interface

◼ HBM2 Broadcom SerDes, HBM I/O, and SRAMs Core Core Core Core Core Core Core Core Core Core ◼ 8.786 billion transistors Interface

L2 L2 interface HBM2 Core Core Core Core ◼ 594 signal pins Cache Cache

Core Core Core Core Core Core RING Core Core Core Core Core Core

- Bus

Core Core Core Core Core Core Core Core Core Core Core Core HBM2Interface L2 L2 Core Core Core Core

Cache Cache HBM2 Interface HBM2 Core Core Core Core Core Core Core Core Core Core

7 Copyright 2020 FUJITSU LIMITED Fujitsu-designed CPU core w/ High Memory Bandwidth

◼ A64FX out-of-order controls in cores, caches, and memories CMG 12x Compute Cores + 1x Assistant Core achieve superior throughput Core Core Core Core 512-bit wide SIMD 2x FMAs 115+ 230+ GB/s GB/s L1D 64KiB, 4way BW and calc. perf. A64FX B/F 57+ 115+ GB/s DP floating perf. (TFlops) 2.7+ - GB/s L2 Cache 8MiB, 16way L1 data cache (TB/s) 11+ 4 256 L2 cache (TB/s) 3.6+ 1.3 GB/s HBM2 8GiB Memory BW (GB/s) 1024 0.37 HBM2HBM2HBM2 8GiB 8GiB 8GiB x4

8 Copyright 2020 FUJITSU LIMITED A64FX optimized load efficiency for apps performance

◼ 128 bytes/cycle sustained bandwidth L1D cache Read data0 Read port0 even for unaligned SIMD load 64B/cycle

Read data1 Read port1 64B/cycle

128B ◼ “Combined Gather” doubles gather Mem. 0 1 3 2 6 7 5 4 (indirect) load’s data throughput, flow-1 flow-4 flow-3 when target elements are within a flow-2 “128-byte aligned block” for a pair of two regs, even & odd Regs 0 1 2 3 4 5 6 7

Suggested through Co-design work w/ app teams 8B Maximizes BW to 32 bytes/cyc. 9 Copyright 2020 FUJITSU LIMITED A64FX Power Knobs to reduce power consumption

◼ “Power knob” limits units’ activities via user APIs ◼ Performance/Watt can be optimized by utilizing Power knobs

Fetch Issue Dispatch Reg-read Execute Cache and Memory Commit

CSE EAGA Fetch EXC Port Frequency reduction L1 inst. Decode RSA EAGB PC PGPR EXD Store cache & Issue L1 data EXA Port cache EXB Control RSE0 Write Registers Branch Buffer Predictor PPR PRX RSE1 EXA pipeline only FLA RSBR PFPR FLB FLA pipeline only L2 cache HBM2 Controller 52 cores Tofu controller PCI Controller HBM2 B/W adjust Decode width: 4 to 2 HBM2 (units of 10%)

10 Copyright 2020 FUJITSU LIMITED Fugaku system software developed with

Fugaku applications Fujitsu Technical Computing Suite / RIKEN-developed system software Management software File system Programming environment System management FEFS XcalableMP, FDPS for high availability & -based Open MPI, MPICH(PiP), DTF power saving distributed file system operation OpenMP, COARRAY, Math. libs. Tensorflow Job management for LLIO Chainer Compilers, AI frameworks higher system NVM-based file I/O Pytorch utilization & power Accelerator for Fugaku efficiency Debug & tuning tools, Containers Docker KVM Multi-Kernel System: RHEL8 and light-weight kernel (IHK/McKernel) Singularity

Fugaku system hardware FDPS: Framework for Developing Particle Simulator Fugaku developed w/ RIKEN PiP: Process-in-Process DTF: Data Transfer Framework IHK: Interface for Heterogeneous Kernels

11 Copyright 2020 FUJITSU LIMITED Outline

◼ Fugaku project overview ◼ Co-design ◼ Approach ◼ Design results ◼ Performance & energy consumption evaluation ◼ Green500 ◼ OSS apps ◼ Fugaku priority issues ◼ Summary

12 Copyright 2020 FUJITSU LIMITED Green500, Nov. 2019

◼ A64FX prototype – ◼ Fujitsu A64FX 48C 2GHz ranked #1 on the list ◼ 768x general purpose A64FX CPU w/o accelerators

• 1.9995 PFLOPS @ HPL, 84.75% • 16.876 GF/W • Power quality level 2

https://www.top500.org/green500/lists/2019/11/

13 Copyright 2020 FUJITSU LIMITED How we achieved

◼ Key for GF/W is {energy efficient HW} x {parallel/exec efficiency} ◼ A64FX is designed for energy efficient ◼ Fujitsu’s proven CPU & 7nm FinFET ◼ SoC design: Tofu interconnect D integrated + Concerted efforts ◼ CoWoS: 4x HBM2 for main memory integrated of co-design ◼ Superior parallel/exec efficiency ◼ Math. libraries are tuned for application efficiency ◼ Comm. libs are also tuned utilizing long experience of Tofu @ K computer ◼ Performance tuning is efficiently done utilizing rich performance analyzer/monitor

14 Copyright 2020 FUJITSU LIMITED SC19 TOP500 calculation efficiency

◼ A64FX superior A64FX prototype efficiency 84.75% 100% with small Nmax 90% 80%

70%

◼ Results of: 60% Efficiency ◼ Optimized 50% communication 40% Challenge and math. libs 30% w/o Acc ◼ Optimization of 20% w/ Acc overlapped 10% communication 0% 0 5,000,000 10,000,000 15,000,000 20,000,000 Nmax

15 Copyright 2020 FUJITSU LIMITED A64FX CPU performance evaluation for real apps

Relative speed up ratio ◼ Open source software, Real 0.5 1.0 1.5 apps on an A64FX @ 2.2GHz OpenFOAM ◼ Up to 1.8x faster over the latest x86 processor (24core, FrontlSTR 2.9GHz) x2 ABINIT ◼ High memory BW and long SALMON SIMD length of A64FX work SPECFEM3D

effectively with these WRF applications MPAS

FUJITSU A64FX x86 (24core, 2.9GHz) x2

16 Copyright 2020 FUJITSU LIMITED A64FX CPU power efficiency for real apps

Relative power efficiency ratio ◼ Performance / Energy 1.0 2.0 3.0 4.0 consumption on an A64FX @ 2.2GHz OpenFOAM ◼ Up to 3.7x more efficient over FrontlSTR the latest x86 processor ABINIT (24core, 2.9GHz) x2 SALMON

◼ High efficiency is achieved by SPECFEM3D

energy-conscious design and WRF implementation MPAS

FUJITSU A64FX x86 (24core, 2.9GHz) x2

17 Copyright 2020 FUJITSU LIMITED Fugaku priority issues and performance prediction https://postk-web.r-ccs.riken.jp/perf.html ◼ 100x app performance and power budget requirements are met ◼ Some apps utilize power knob to reduce power consumption and achieve high energy efficiency ◼ Measuring and optimizing using real HW

18 Copyright 2020 FUJITSU LIMITED Fugaku project schedule

2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022

Fugaku development & delivery

Manufacturing, Apps Basic Detailed design & General Feasibility study Installation review design Implementation operation and Tuning

Co-Design w/ apps groups Select Architecture & apps sizing Apps tuning

Previous estimation w/o HW w/ HW 19 Copyright 2020 FUJITSU LIMITED Current status normalized by the estimation w/o HW

◼ Performance and energy (=power*elapse) of apps

Note: evaluation underway

Faster than prev. est. prev. than Faster

Lower energy than prev. est. than prev. energy Lower

Estimation w/ Estimation HW / w/o HW

Energy Performance

DGEMM Stream A B C D E F

20 Copyright 2020 FUJITSU LIMITED Current status normalized by the estimation w/o HW

• Performances are match Computations were Good estimation and ◼ Performancedue to precise simulation and energy (=power*elapse)reduced after the of appsoptimization: errors are • Lower power 8% in logic & previous estimation less than 20% 15% in memory use; Note: evaluation underway

acceptable variations

Estimation w/ Estimation HW / w/o HW

Energy Performance

DGEMM Stream A B C D E F

21 Copyright 2020 FUJITSU LIMITED Summary

◼ “Fugaku” is co-designed and runs apps at high performance w/ optimal power consumption Fujitsu PRIMEHPC FX1000 ◼ Arm HPC ecosystem and apps portfolio are growing by efforts of communities and selections of HW ◼ Fujitsu Supercomputer PRIMEHPC FX1000 & FX700 based on Fugaku tech. ◼ Cray CS500 using Fujitsu A64FX Arm CPU

FX700 https://www.riken.jp/pr/news/2019/20190827_1/

22 Copyright 2020 FUJITSU LIMITED 23 Copyright 2020 FUJITSU LIMITED