Supercomputer “” Formerly known as Post-K

0 Copyright 2019 LIMITED Supercomputer Fugaku Project

RIKEN and Fujitsu are currently developing Japan's next-generation flagship supercomputer, the successor to the , as the most advanced general-purpose supercomputer in the world

PRIMEHPC FX100 PRIMEHPC FX10 No.1(2017) Finalist(2016) No.1(2018) K computer Supercomputer Fugaku

©

 RIKEN and Fujitsu announced that manufacturing started in March 2019

 RIKEN announced on May 23, 2019 that the supercomputer is named*Formerly “Fugaku known as” Post-K

1 Copyright 2019 FUJITSU LIMITED Goals and Approaches for Fugaku  Goals

High application Good usability Keeping application performance and wide range of uses compatibility

RIKEN announced predicted performance:  More than 100x+ faster than K computer for GENESIS and NICAM+LETKF  Geometric mean of speedup over K computer in 9 Priority Issues is greater than 37x+ https://postk-web.r-ccs.riken.jp/perf.html

2 Copyright 2019 FUJITSU LIMITED Goals and Approaches for Fugaku  Goals

High application Good usability Keeping application performance and wide range of uses compatibility

 Approaches Develop Achieve 1. High-performance Arm CPU A64FX in HPC and AI areas - High performance in real applications 2. Cutting-edge hardware design - High efficiency in key features for AI 3. System software stack applications

3 Copyright 2019 FUJITSU LIMITED 1. High-Performance Arm CPU A64FX in HPC and AI Areas  Architecture features

CMG (Core Memory Group) ISA Armv8.2-A (AArch64 only) SVE (Scalable Vector Extension) specification TofuD 13 cores 28 Gbps x 2 lanes x 10 ports SIMD width 512-bit L2 Cache 8 MiB I/O Memory 8 GiB, 256 GB/s PCIe Gen3 16 lanes Precision FP64/32/16, INT64/32/16/8 Cores 48 computing cores + 4 assistant cores (4 CMGs) Memory HBM2: Peak B/W 1,024 GB/s TofuD PCIe Controller Controller Interconnect TofuD: 28 Gbps x 2 lanes x 10 ports HBM2 HBM2  Peak performance (Chip level)

(TOPS) HPC AI

HBM2 Network on Chip HBM2 25 21.6+ A64FX (Fugaku) 20 SPARC64 VIIIfx (K computer) 15 10.8+ 10 5.4+ 2.7+ 5 0.128 0.128 N/A N/A 0 64 bits 32 bits 16 bits 8 bits (Element size)

4 Copyright 2019 FUJITSU LIMITED 1. High-Performance Arm CPU A64FX in HPC and AI Areas  Architecture features

ISA Armv8.2-A (AArch64 only) SVE (Scalable Vector Extension) TofuD Interface PCIe Interface SIMD width 512-bit HBM2 Precision FP64/32/16, INT64/32/16/8 Core Core Core Core Core Core Core Core Core Core

Interface Cores 48 computing cores + 4 assistant cores (4 CMGs) Memory HBM2: Peak B/W 1,024 GB/s L2 L2

Core Core Core Core Interface HBM2 Cache Cache Interconnect TofuD: 28 Gbps x 2 lanes x 10 ports

Core Core Core Core Core Core RingBus Core Core Core Core Core Core  Peak performance (Chip level)

(TOPS) HPC AI Core Core Core Core Core Core Core Core Core Core Core Core 25 21.6+ A64FX (Fugaku) 20 HBM2Interface SPARC64 VIIIfx (K computer) L2 L2 Core Core Core Core 15 Cache Cache 10.8+ 10 5.4+ 5 2.7+ Core Core Core Core Core Core Core Core Core Core HBM2Interface 0.128 0.128 N/A N/A 0 64 bits 32 bits 16 bits 8 bits (Element size)

5 Copyright 2019 FUJITSU LIMITED [Flops] Performance Nodes   2. Cutting Footprint Nodes Configuration Scalable design Scalable 1PFlopsby 2.7 T+ CPU 1 5.4 T+ CMU 2 - edge Fugaku and 43 T+ 43 BoB 16 Hardware Design 129 T+ 129 Shelf 1.1 m 1x 48 rackincluding SSDs 2 K Fugaku (0.8 mx1.4 m) 384 computer 1 384 Rack P+ Fugaku 400 P+ 400 150k+ 6     water and electricalsignals mate Single O QSFP forAOC(Active3x cooling direct100% water CMU(CPU Memory Unit) Memory CMU(CPU pticalCables) connectorsfor - sided 80x compute racks & 20x disk racks disk compute& 20x racks blind © 128 m 128 RIKEN K 2 8,160 (4 m x (4 m computer IN 32 m) 32 OUT AOC QSFP28 (Z) Copyright 2019 FUJITSU LIMITED FUJITSU 2019 Copyright AOC QSFP28 (Y) AOC QSFP28 (X) connector PCIe coupler Water cables TofuD connector TofuD 3. System Software Stack

 Fujitsu developing system software in collaboration with RIKEN  Fujitsu Technical Computing Suite implementing development and execution environments with great usability on large-scale system

Fugaku applications Fujitsu Technical Computing Suite / RIKEN developing system software Management software File system Programming environment

System management FEFS XcalableMP MPI for high availability & power -based distributed file (Open MPI, MPICH) saving operation system OpenMP, Compilers Job management for higher Coarray (C, C++, Fortran) LLIO system utilization & power NVM-based file I/O accelerator Debugging and efficiency tuning tools Math. libs.

Linux OS / McKernel (Lightweight kernel) Fugaku system hardware

Fugaku Under development w/ RIKEN 7 Copyright 2019 FUJITSU LIMITED 3. System Software Stack

 Fujitsu developing system software in collaboration with RIKEN  Fujitsu Technical Computing Suite implementing development and execution environments with great usability on large-scale system

Fugaku applications Fujitsu Technical Computing Suite / RIKEN developing system software Management software File system Programming environment  Exploits hardware performance by compiler System management FEFS XcalableMPXcalableMP* MPI for optimizationshigh availability & power such as SVELustre vectorization-based distributed file (Open MPI, MPICH) saving operation system OpenMP, Compilers Job management for higher COARRAYCoarray (C, C++, Fortran)  Supports new programming languageLLIO system utilization & power Math. libs. NVM-based file I/O accelerator Debugging and standardsefficiency and data type FP16 tuning tools Math. libs.

Linux OS / McKernel (Lightweight kernel) Fugaku system hardware

Fugaku Under development w/ RIKEN 8 Copyright 2019 FUJITSU LIMITED Achieve High Performance in Real Application

 WRF: Weather Research and Forecasting model (v3.8.1)  Vectorizing loops including IF statements is key optimization High memory B/W and long SIMD length work  Himeno Benchmark (Fortran90, size: XL) effectively  Stencil calculation to solve Poisson’s equation by Jacobi method

WRF v3.8.1 (48-hour,12km, CONUS) on 48 cores Himeno Benchmark (Fortran90)

2 400 346 1.56 286 305 1.5 300 1.00 1 200

GFlops 85 103

0.5 100

Performance Ratio * Ratio Performance Higher is Higher better 0 is Higher better 0 Skylake A64FX Skylake( FX100 A64FX SX-Aurora Tesla V100 (Xeon Platinum 8168) 1 CPU Platinum 8168) 1CPU 1CPU TSUBASA 1GPU† 2 CPUs with source tuning 2CPUs 1VE† †Performance evaluation of a vector supercomputer SX-aurora TSUBASA * Normalized by the average elapsed time for timestep of Skylake https://dl.acm.org/citation.cfm?id=3291728

9 Copyright 2019 FUJITSU LIMITED Achieve High Efficiency in Key Features for AI Applications

 High FP16 & INT8 peak performance and high memory peak B/W

INT8 partial dot product

FP16 performance: 10.8+ TOPS, > 90%@HGEMM 8-bit 8-bit 8-bit 8-bit

A0 A1 A2 A3 X X X X INT8 performance : 21.6+ TOPS in partial dot product B0 B1 B2 B3

C Memory B/W : 1,024 GB/s, > 80%@STREAM Triad 32-bit

 Functions contributing to key features in AI fields

• 2x 512-bit wide SIMD pipelines per core for FP16 and INT8 A64FX CPU • High memory B/W and calculation throughput

• Vectorization and software pipelining Compilers & libraries • FP16 as data type of programming language (e.g., real (kind=2) in Fortran) • Mathematical Library for HGEMM

10 Copyright 2019 FUJITSU LIMITED Future Plans Supercomputer Fugaku Today  Operations starting around CY2021

CY2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 Manufacturing, Basic Design Implementation Installation Operations Fugaku Design and Tuning

Development Operations Operations ending Aug, 2019 © RIKEN K computer

Fujitsu HPC Products  Fujitsu will begin global sales of based on the Supercomputer Fugaku technology in the 2nd half of FY2019

11 Copyright 2019 FUJITSU LIMITED 12