Fugaku: the First ‘Exascale’ Machine

Satoshi Matsuoka, Director R-CCS / Prof. Tokyo Inst. Tech. The 6th ENES HPC Workshop (virtual) 26 May 2020

1 The “” 富岳 “Exascale” for Society 5.0 High Large Scale Application Large -

Peak Mt. Fuji representing (Capability)

--- the ideal of supercomputing Acceleration of

Broad Base --- Applicability & Capacity Broad Applications: Simulation, Data Science, AI, … Broad User Bae: Academia, Industry, Cloud Startups, … For Society 5.0 Arm64fx & Fugaku 富岳 /Post-K are:  - design A64fx ARM v8.2 (SVE), 48/52 core CPU  HPC Optimized: Extremely high package high memory BW (1TByte/s), on-die Tofu-D network BW (~400Gbps), high SVE FLOPS (~3Teraflops), various AI support (FP16, INT8, etc.)  Gen purpose CPU – , Windows (Word), other SCs/Clouds  Extremely power efficient – > 10x power/perf efficiency for CFD benchmark over current mainstream x86 CPU  Largest and fastest supercomputer to be ever built circa 2020  > 150,000 nodes, superseding LLNL  > 150 PetaByte/s memory BW  Tofu-D 6D Torus NW, 60 Petabps injection BW (10x global IDC traffic)  25~30PB NVMe L1 storage

 many endpoint 100Gbps I/O network into 3

Brief History of R-CCS towards Fugaku April 2018 April 2012 AICS renamed to RIKEN R-CCS. July 2010 Post-K Feasibility Study start Satoshi Matsuoka becomes new RIKEN AICS established 3 Arch Teams and 1 Apps Team Director August 2010 June 2012 Aug 2018 April 2020 HPCI Project start construction complete Arm A64fx announce at Hotchips COVID19 Teams September 2010 September 2012 Oct 2018 Start K computer installation start K computer production start NEDO 100x processor project start May 2020 First meeting of SDHPC November 2012 Nov 2018 Fugaku Shipment (Post-K) ACM Gordon bell Award Post-K Manufacturing approval by Complete Prime Minister’s CSTI Committee 2013 2006 2011 2014

2010 2019 2012 2018 2020

January 2006 End of FY2013 (Mar 2014) Next Generation Supercomputer P Post-K Feasibility Study Reports March 2019 roject (K Computer) start Post-K Manufacturing start May 2019 June 2011 Post-K named “Supercomputer Fugaku” #1 on Top 500 July 2019 November 2011 Post-Moore Whitepaper start #1 on Top 500 > 10 Petaflops Aug 2019 April 2014 K Computer shutdown ACM Gordon Bell Award Post-K project start End of FY 2011 (March 2012) Dec 2019 June 2014 Fugaku installation start 4 SDHPC Whitepaper #1 on Graph 500 Co-Design Activities in Fugaku Multiple Activities since 2011 Science by Computing Science of Computing ・9 Priority App Areas: High Concern to General Public: Medical/Pharma, Environment/Disaster, Energy, Manufacturing, … A64fx For the Post-K supercomputer

Select representatives fr Design systems with param om 100s of applications eters that consider various signifying various compu application characteristics tational characteristics  Extremely tight collabrations between the Co-Design apps centers, Riken, and Fujitsu, etc.  Chose 9 representative apps as “target application” scenario  Achieve up to x100 speedup c.f. K-Computer  Also ease-of-programming, broad SW ecosystem, very low power, … Research Subjects of the Post-K Computer

The post K computer will expand the fields pioneered by the K computer, and also challenge new areas.

11 NICAM: Global Climate Simulation

 Global cloud resolving model with 0.87 km-mesh which allows resolution of cumulus clouds  Month-long forecasts of Madden-Julian oscillations in the tropics is realized.

Global cloud resolving model

Miyamoto et al (2013) , Geophys. Res. Lett., 40, 4922–4926, doi:10.1002/grl.50944. 13 “Big Data Assimilation” NICAM+LETKF High-precision Simulations

Future-generation technologies available 10 years in advance

High-precision observations

Mutual feedback Fugaku’s FUjitsu A64fx Processor is…

 an Many-Core ARM CPU…  48 compute cores + 2 or 4 assistant (OS) cores  Brand new core design  Near -Class Integer performance core  ARM V8 --- 64bit ARM ecosystem  Tofu-D + PCIe 3 external connection  …but also an accelerated GPU-like processor  SVE 512 bit x 2 vector extensions (ARM & Fujitsu)

 Integer (1, 2, 4, 8 bytes) + Float (16, 32, 64 bytes)  Cache + memory localization (sector cache)  HBM2 on package memory – Massive Mem BW (Bytes/DPF ~0.4)

 Streaming memory access, strided access, scatter/gather etc.  Intra-chip barrier synch. and other memory enhancing features

 GPU-like High performance in HPC especially CFD-- Weather & Climate (even with traditional Fortran code) + AI/Big Data 9 “Fugaku” CPU Performance Evaluation (2/3)

 Himeno Benchmark (Fortran90)  Stencil calculation to solve Poisson’s equation by Jacobi method GFlops

† †

† “Performance evaluation of a vector supercomputer SX-aurora TSUBASA”, SC18, https://dl.acm.org/citation.cfm?id=3291728 Post-K Activities, ISC19, Frankfurt 10 Copyright 2019 FUJITSU LIMITED “Fugaku” CPU Performance Evaluation (3/3)

WRF: Weather Research and Forecasting model  Vectorizing loops including IF-constructs is key optimization  Source code tuning using directives promotes compiler optimizations

x x

Post-K Activities, ISC19, Frankfurt 11 Copyright 2019 FUJITSU LIMITED Fugaku system configuration

 Scalable design

CPU CMU BoB Shelf Rack System

Unit # of nodes Description

Single socket node with HBM2 & Tofu interconnect CPU 1 D CMU 2 CPU Memory Unit: 2x CPU BoB 16 Bunch of Blades: 8x CMU Shelf 48 3x BoB Rack 384 8x Shelf

System 150k+ As a Fugaku system 12 © 2019 FUJITSU Fugaku Total System Config & Performance  Total # Nodes: 158,976 nodes  384 nodes/rack x 396 (full) racks = 152,064 nodes  192 nodes/rack x 36 (half) racks = 6,912 nodes c.f. K Computer 88,128 nodes  Theoretical Peak Compute Performances  Normal Mode (CPU Frequency 2GHz) 提供: 富士通株式会社   64 bit Double Precision FP: 488 Petaflops C.f. K Computer performance comparison (Boost)   32 bit Single Precision FP: 977 Petaflops 64 bit Double Precision FP: 48x   16 bit Half Precision FP (AI training): 1.95 Exaflops 32 bit Single Precision: 95x   8 bit Integer (AI Inference): 3.90 Exaops 16 bit Half Precision (AI training): 190x  Boost Mode (CPU Frequency 2.2GHz)  K Computer Theoretical Peak: 11.28 PF for all precisions  64 bit Double Precision FP: 537 Petaflops  8 bit Integer (AI Inference): > 1,500x  32 bit Single Precision FP: 1.07 Exaflops  K Computer Theoretical Peak: 2.82 Petaops (64 bits)  16 bit Half Precision FP (AI training): 2.15 Exaflops  Theoretical Peak Memory Bandwidth: 29x  8 bit Integer (AI Inference): 4.30 Exaops  K Computer Theoretical Peak: 5.64 Petabytes/s  Theoretical Peak Memory Bandwidth: 163 Petabytes/s 13 Fugaku is a Year’s worth of IT in Japan IDC Servers incl Smartphones Fugaku K Computer Clouds 20 million 300,000 Units (2/3 annual shipments = (2/3 annual shipments = 1 30~100 in Japan) in Japan) = 10W×20 mil = 600-700W x 30K > 1/ Power = 200MW 30MW 15MW 200MW (incl cooling) >

Arm x86/Arm CPU ISA Arm Sparc System SW iOS/ Linux (Red Hat Linux (Red Proprietary Linux Android Linux etc.)/Win Hat etc.) Low generality Gen. Purpos Gen. CPU AI Custom ASIC Accelerator e.g. SVE None Acceleration Inference Only GPU instructions 14 , Nov. 2019

 A64FX prototype –  Fujitsu A64FX 48C 2GHz ranked #1 on the list  768x general purpose A64FX CPU w/o accelerators

• 1.9995 PFLOPS @ HPL, 84.75% • 16.876 GF/W • Power quality level 2

FUJITSU CONFIDENTIAL 15 Copyright 2019 FUJITSU LIMITED SC19 Green500 ranking and 1st appeared TOP500 edition  “A64FX prototype”, prototype of Supercomputer Fugaku,

18 ranked #1, 16.876 GF/W

16 nd w/o Acc Approx. 3x better GF/W to 2 system w/CPUs w/ Acc 14  12 The first system w/o Acc ranked #1 in 9.5 years

10

8 Top Xeon machine #37, 5.843 GF/W GF/W 6 Endeavor: #94, 2.961 GF/W 4

2

0 0 50 100 150 200 250 300 350 400 450 500 35 40 45 50 Green500 rank TOP500 edition

FUJITSU CONFIDENTIAL 16 Copyright 2019 FUJITSU LIMITED A64FX CPU performance evaluation for real apps

 Relative speed up ratio Open source software, Real 0.5 1.0 1.5 apps on an A64FX @ 2.2GHz OpenFOAM  Up to 1.8x faster over the latest FrontlSTR x86 processor (24core, 2.9GHz) x 2, or 3.6x per socket ABINIT  High memory B/W and long SALMON SIMD length of A64FX work SPECFEM3D

effectively with these WRF

applications MPAS

FUJITSU A64FX x86 (24core, 2.9GHz)x2

19 A64FX CPU power efficiency for real apps

 Relative power efficiency ratio Performance / Energy 1.0 2.0 3.0 4.0 consumption on an A64FX @ 2.2GHz OpenFOAM FrontlSTR  Up to 3.7x more efficient over the latest x86 processor ABINIT (24core, 2.9GHz) x2 SALMON  High efficiency is achieved by SPECFEM3D energy-conscious design and WRF

implementation MPAS

FUJITSU A64FX x86 (24core, 2.9GHz)x2

20 Fugaku Performance Estimate on 9 Co-Design Target Apps

Catego Performance Priority Issue Area Application Brief description ry Speedup over K

 Performance target goal and longevity Health 1. Innovative computing  100 times faster than K for some applications infrastructure for drug GENESIS MD for proteins discovery 125x + (tuning included)  2. Personalized and 30 to 40 MW power consumption Genome processing preventive medicine using big 8x + Genomon (Genome alignment)  Peak performance to be achieved data

prevention and prevention 3. Integrated simulation Environment Earthquake simulator (FEM in systems induced by PostK K Disaster GAMERA earthquake and tsunami 45x + unstructured & structured grid) Peak DP >400+ Pflops 11.3 Pflops Weather prediction system using (double precision) (34x +) 4. Meteorological and global NICAM+ environmental prediction using 120x + Big data (structured grid stencil & Peak SP big data LETKF >800+ Pflops ensemble Kalman filter) (single 11.3 Pflops (70x +) 5. New technologies for precision) issue Energy Molecular electronic simulation energy creation, conversion / 40x + NTChem (structure calculation) Peak HP >1600+ Pflops storage, and use -- (half precision) (141x +) 6. Accelerated development of Computational Mechanics System Total memory >150+ PB/sec innovative clean energy Adventure for Large Scale Analysis and 5,184TB/sec systems 35x + bandwidth (29x +) Design (unstructured grid) competitiveness competitiveness enhancement 7. Creation of new functional Industrial Industrial Ab-initio simulation devices and high-performance RSDFT  Geometric Mean of Performance materials 30x + (density functional theory) Speedup of the 9 Target Applications 8. Development of innovative Large Eddy Simulation over the K-Computer design and production FFB processes 25x + (unstructured grid) science Basic > 37x+ 9. Elucidation of the Lattice QCD simulation (structured fundamental laws and 25x + LQCD As of 2019/05/14 evolution of the universe grid Monte Carlo) Fugaku / Fujitsu FX1000 System Software Stack

Fugaku AI (DL4Fugaku) Live Data Analytics ~3000 Apps sup- RIKEN: Chainer, PyTorch, TensorFlow, DNNL… Apache Flink, Kibana, …. ported by Spack Math Libraries Fujitsu: BLAS, LAPACK, ScaLAPACK, SSL II Cloud Software Stack RIKEN: EigenEXA, KMATH_FFT3D, Batched BLAS,,,, OpenStack, Kubernetis, NEWT... Open Source Management Tool Compiler and Script Languages Spack Fortran, C/C++, OpenMP, Java, python, … Batch Job and Management (Multiple Compilers suppoted: Fujitsu, Arm, GNU, System ObjectStore LLVM/CLANG, PGI, …) Hierarchical File System S3 Compatible Tuning and Debugging Tools Fujitsu: Profiler, Debugger, GUI Red Hat Enterprise Linux 8 Libraries Most applications will work with simple recompile from High-level Prog. Lang. Domain Spec. Lang. Communication Fujitsu MPI File I/O Virtualization & Container XMP FDPS RIKEN MPI DTF KVM, Singularity x86/RHEL environment. Process/Thread Low Level Communication File I/O for Hierarchical Storage LLNL Spack automates this. PIP uTofu, LLC Lustre/LLIO

Red Hat Enterprise Linux Kernel+ optional light-weight kernel (McKernel)

20020/01/27 RIKEN Center for Computational Science 20 New PRIMEHPC Lineup

PRIMEHPC FX1000 PRIMEHPC Supercomputer optimized for large scale computing FX700 Superior power efficiency High Scalability High Density Supercomputer based on Easestandard to use technologiesInstallation

Copyright 2019 FUJITSU LIMITED 21 The HPE/Cray CS500 - Fujitsu A64FX Arm Server

 Cray Fujitsu Technology Agreement  Supported in Cray CS500 infrastructure

 Full Cray Programming Environment

 Leadership performance for many memory intensive HPC applications, e.g., weather  GA in mid’2020  A number of adoptions US: Stony Brook, DoE Labs, etc. Multiple yet-to-be-named EU centers

FUJITSU CONFIDENTIAL 22 Fugaku Deployment Status (Apr. 2020)

 Pipelined manufacturing, installation, and bringup, first rack shipped on Dec 3 2019.  All racks on the floor May 13, 2020(!)  2020 early users, incl. COVID-19 apps running already  Open to international users through HPCI, general allocation April 2021 (application starting Sept. 2020) (does not need to involve a Japanese PI)  Also some internal test nodes (Apr 2020) and allocations (Apr. 2021) are available for R-CCS

23 Massive Scale Deep Learning on Fugaku

Fugaku Processor is AI-DL ready High perf FP16&Int8 Unprecedened DL scalability High mem BW for convolution ◆ High Performance and Ultra-Scalable Network Built-in scalable Tofu network for massive scaling model & data parallelism High◆ Performance DNN Convolution ◆

CPU For the Fugaku CPU CPU CPU supercomputer For the For the For the Fugaku Fugaku Fugaku supercomputer supercomputer supercomputer

BW for fast reduction Low Precision ALU + High Memory Bandw Ultra Scalability of Data/Model Parallelism idth + Advanced Combining of Convolutio Scalability shown to 20,000 nodes n Algorithms (FFT+Winograd+GEMM) HPL-AI to beyond exaflops 24 Large Scale Public AI Infrastructures in Japan

Deployed Purpose AI Processor Inference Training HPL-AI Top500 Green500 Peak Perf. Peak Perf. Perf Perf/Rank Perf/Rank Tokyo Tech. July 2017 HPC + AI NVIDIA P100 45.8 PF 22.9 PF / 45.8PF 8.125 PF 13.704 GF/W TSUBAME3 Public x 2160 (FP16) (FP32/FP16) #22 #8 Inference U-Tokyo Apr. 2018 HPC + AI NVIDIA P100 10.71 PF 5.36 PF / 10.71PF (Unranked) (unranked) 838.5PF Reedbush-H/L (update) Public x 496 (FP16) (FP32/FP16) Training U-Kyushu Oct. 2017 HPC + AI NVIDIA P100 11.1 PF 5.53 PF/11.1 PF (Unranked) (Unranked) 86.9 PF ITO-B Public x 512 (FP16) (FP32/FP16) vs. AIST-AIRC Oct. 2017 AI NVIDIA P100 8.64 PF 4.32 PF / 8.64PF (Unranked) (Unranked) Inf. 1/4 AICC Lab Only x 400 (FP16) (FP32/FP16) Train. 1/5 Riken-AIP Apr. 2018 AI NVIDIA V100 54.0 PF 6.40 PF/54.0 PF 1.213 PF (Unranked) Raiden (update) Lab Only x 432 (FP16) (FP32/FP16) #462 AIST-AIRC Aug. 2018 AI NVIDIA V100 544.0 PF 65.3 PF/544.0 PF 19.88 PF 14.423 GF/W ABCI Public x 4352 (FP16) (FP32/FP16) #8 #6 NICT & Summer AI NVIDIA V100 ~210 PF ~26 PF/~210 PF 4.128 #51 (Unranked) Sakura Internet 2019 Lab Only x 1700 (FP16) (FP32/FP16) 3.712 #58 C.f. US ORNL Summer HPC + AI NVIDIA V100 3,375 PF 405 PF/3,375 PF 445 PF 143.5 PF 14.719 GF/W Summit 2018 Public x 27,000 (FP16) (FP32/FP16) (FP16) #1 #5 x2 ~ x4 Summit Riken R-CCS 2020 HPC + AI Fujitsu A64fx > 4000 PO >1000PF/>2000PF >1000PF > 400PF 16.876 GF/W HPL-AI Fugaku ~2021 Public > x 150,000 (Int8) (FP32/FP16) (FP16) #1 (2020?) #1 (proto) (equiv. ~100K GPUs) Fujitsu-Riken-Arm joint effort on AI framework development on SVE/A64FX

 MOU Signed Fujitsu-Riken Optimization Levels Nov. 25, 2019 Step3

DL Frameworks Step2 (PyTorch, TensorFlow)

Arm DNNL Step1 26 Copyright 2019 FUJITSU LIMITED  Also w/Arm Arm Xbyak, Eigen + Lib

 1st release May 2020

 First ver. optimized A64fx processor for inference

 Next ver. training optimization Exaops of sim, data, and AI on Fugaku and Cloud 26 Cloud Service Providers Partnership https://www.r-ccs.riken.jp/library/topics/200213.html (in Japanese)

×

Action Items • Cool Project name and logo! • Trial methods to provide computing resources of Fugaku to end-users via service providers • Evaluate the effectiveness of the methods quantitatively as possible and organize the issues 27

COVID19 Fugaku Early Production & HPCI

 Supercomputer Fugaku used to help fight against COVID-19

 https://www.r-ccs.riken.jp/en/topics/fugaku-coronavirus.html

 Production a year ahead of schedule

 Max availability in April~May: 72 racks (80 Petaflops), 1/6 full

 More to be added after June

 Public call by MEXT and Riken – fast track, bypasses normal allocation procedure, priority allocation to massive resource w/extensive R-CCS support

 Must work closely w/Riken R-CCS and other COVID19 groups

 HPCI Tier-2 COVID19 resource allocations

 https://www.hpci-office.jp/pages/e_hpci_covid19

 Most HPCI tier-2 resources, more variety (e.g., GPUs), less capacity & less restrictions c.f. Fugaku

 Goes thru accelerated peer-review process by RIST

 Both are available to international research groups 28 29 30 31