Fugaku: the First ‘Exascale’ Machine
Satoshi Matsuoka, Director R-CCS / Prof. Tokyo Inst. Tech. The 6th ENES HPC Workshop (virtual) 26 May 2020
1 The “Fugaku” 富岳 “Exascale” Supercomputer for Society 5.0 High Large Scale Application Large -
Peak Mt. Fuji representing (Capability)
--- the ideal of supercomputing Acceleration of
Broad Base --- Applicability & Capacity Broad Applications: Simulation, Data Science, AI, … Broad User Bae: Academia, Industry, Cloud Startups, … For Society 5.0 Arm64fx & Fugaku 富岳 /Post-K are: Fujitsu-Riken design A64fx ARM v8.2 (SVE), 48/52 core CPU HPC Optimized: Extremely high package high memory BW (1TByte/s), on-die Tofu-D network BW (~400Gbps), high SVE FLOPS (~3Teraflops), various AI support (FP16, INT8, etc.) Gen purpose CPU – Linux, Windows (Word), other SCs/Clouds Extremely power efficient – > 10x power/perf efficiency for CFD benchmark over current mainstream x86 CPU Largest and fastest supercomputer to be ever built circa 2020 > 150,000 nodes, superseding LLNL Sequoia > 150 PetaByte/s memory BW Tofu-D 6D Torus NW, 60 Petabps injection BW (10x global IDC traffic) 25~30PB NVMe L1 storage
many endpoint 100Gbps I/O network into Lustre 3
Brief History of R-CCS towards Fugaku April 2018 April 2012 AICS renamed to RIKEN R-CCS. July 2010 Post-K Feasibility Study start Satoshi Matsuoka becomes new RIKEN AICS established 3 Arch Teams and 1 Apps Team Director August 2010 June 2012 Aug 2018 April 2020 HPCI Project start K computer construction complete Arm A64fx announce at Hotchips COVID19 Teams September 2010 September 2012 Oct 2018 Start K computer installation start K computer production start NEDO 100x processor project start May 2020 First meeting of SDHPC November 2012 Nov 2018 Fugaku Shipment (Post-K) ACM Gordon bell Award Post-K Manufacturing approval by Complete Prime Minister’s CSTI Committee 2013 2006 2011 2014
2010 2019 2012 2018 2020
January 2006 End of FY2013 (Mar 2014) Next Generation Supercomputer P Post-K Feasibility Study Reports March 2019 roject (K Computer) start Post-K Manufacturing start May 2019 June 2011 Post-K named “Supercomputer Fugaku” #1 on Top 500 July 2019 November 2011 Post-Moore Whitepaper start #1 on Top 500 > 10 Petaflops Aug 2019 April 2014 K Computer shutdown ACM Gordon Bell Award Post-K project start End of FY 2011 (March 2012) Dec 2019 June 2014 Fugaku installation start 4 SDHPC Whitepaper #1 on Graph 500 Co-Design Activities in Fugaku Multiple Activities since 2011 Science by Computing Science of Computing ・9 Priority App Areas: High Concern to General Public: Medical/Pharma, Environment/Disaster, Energy, Manufacturing, … A64fx For the Post-K supercomputer
Select representatives fr Design systems with param om 100s of applications eters that consider various signifying various compu application characteristics tational characteristics Extremely tight collabrations between the Co-Design apps centers, Riken, and Fujitsu, etc. Chose 9 representative apps as “target application” scenario Achieve up to x100 speedup c.f. K-Computer Also ease-of-programming, broad SW ecosystem, very low power, … Research Subjects of the Post-K Computer
The post K computer will expand the fields pioneered by the K computer, and also challenge new areas.
11 NICAM: Global Climate Simulation
Global cloud resolving model with 0.87 km-mesh which allows resolution of cumulus clouds Month-long forecasts of Madden-Julian oscillations in the tropics is realized.
Global cloud resolving model
Miyamoto et al (2013) , Geophys. Res. Lett., 40, 4922–4926, doi:10.1002/grl.50944. 13 “Big Data Assimilation” NICAM+LETKF High-precision Simulations
Future-generation technologies available 10 years in advance
High-precision observations
Mutual feedback Fugaku’s FUjitsu A64fx Processor is…
an Many-Core ARM CPU… 48 compute cores + 2 or 4 assistant (OS) cores Brand new core design Near Xeon-Class Integer performance core ARM V8 --- 64bit ARM ecosystem Tofu-D + PCIe 3 external connection …but also an accelerated GPU-like processor SVE 512 bit x 2 vector extensions (ARM & Fujitsu)
Integer (1, 2, 4, 8 bytes) + Float (16, 32, 64 bytes) Cache + memory localization (sector cache) HBM2 on package memory – Massive Mem BW (Bytes/DPF ~0.4)
Streaming memory access, strided access, scatter/gather etc. Intra-chip barrier synch. and other memory enhancing features
GPU-like High performance in HPC especially CFD-- Weather & Climate (even with traditional Fortran code) + AI/Big Data 9 “Fugaku” CPU Performance Evaluation (2/3)
Himeno Benchmark (Fortran90) Stencil calculation to solve Poisson’s equation by Jacobi method GFlops
† †
† “Performance evaluation of a vector supercomputer SX-aurora TSUBASA”, SC18, https://dl.acm.org/citation.cfm?id=3291728 Post-K Activities, ISC19, Frankfurt 10 Copyright 2019 FUJITSU LIMITED “Fugaku” CPU Performance Evaluation (3/3)
WRF: Weather Research and Forecasting model Vectorizing loops including IF-constructs is key optimization Source code tuning using directives promotes compiler optimizations
x x
Post-K Activities, ISC19, Frankfurt 11 Copyright 2019 FUJITSU LIMITED Fugaku system configuration
Scalable design
CPU CMU BoB Shelf Rack System
Unit # of nodes Description
Single socket node with HBM2 & Tofu interconnect CPU 1 D CMU 2 CPU Memory Unit: 2x CPU BoB 16 Bunch of Blades: 8x CMU Shelf 48 3x BoB Rack 384 8x Shelf
System 150k+ As a Fugaku system 12 © 2019 FUJITSU Fugaku Total System Config & Performance Total # Nodes: 158,976 nodes 384 nodes/rack x 396 (full) racks = 152,064 nodes 192 nodes/rack x 36 (half) racks = 6,912 nodes c.f. K Computer 88,128 nodes Theoretical Peak Compute Performances Normal Mode (CPU Frequency 2GHz) 提供: 富士通株式会社 64 bit Double Precision FP: 488 Petaflops C.f. K Computer performance comparison (Boost) 32 bit Single Precision FP: 977 Petaflops 64 bit Double Precision FP: 48x 16 bit Half Precision FP (AI training): 1.95 Exaflops 32 bit Single Precision: 95x 8 bit Integer (AI Inference): 3.90 Exaops 16 bit Half Precision (AI training): 190x Boost Mode (CPU Frequency 2.2GHz) K Computer Theoretical Peak: 11.28 PF for all precisions 64 bit Double Precision FP: 537 Petaflops 8 bit Integer (AI Inference): > 1,500x 32 bit Single Precision FP: 1.07 Exaflops K Computer Theoretical Peak: 2.82 Petaops (64 bits) 16 bit Half Precision FP (AI training): 2.15 Exaflops Theoretical Peak Memory Bandwidth: 29x 8 bit Integer (AI Inference): 4.30 Exaops K Computer Theoretical Peak: 5.64 Petabytes/s Theoretical Peak Memory Bandwidth: 163 Petabytes/s 13 Fugaku is a Year’s worth of IT in Japan IDC Servers incl Smartphones Fugaku K Computer Clouds 20 million 300,000 Units (2/3 annual shipments = (2/3 annual shipments = 1 30~100 in Japan) in Japan) = 10W×20 mil = 600-700W x 30K > 1/ Power = 200MW 30MW 15MW 200MW (incl cooling) >
Arm x86/Arm CPU ISA Arm Sparc System SW iOS/ Linux (Red Hat Linux (Red Proprietary Linux Android Linux etc.)/Win Hat etc.) Low generality Gen. Purpos Gen. CPU AI Custom ASIC Accelerator e.g. SVE None Acceleration Inference Only GPU instructions 14 Green500, Nov. 2019
A64FX prototype – Fujitsu A64FX 48C 2GHz ranked #1 on the list 768x general purpose A64FX CPU w/o accelerators
• 1.9995 PFLOPS @ HPL, 84.75% • 16.876 GF/W • Power quality level 2
FUJITSU CONFIDENTIAL 15 Copyright 2019 FUJITSU LIMITED SC19 Green500 ranking and 1st appeared TOP500 edition “A64FX prototype”, prototype of Supercomputer Fugaku,
18 ranked #1, 16.876 GF/W
16 nd w/o Acc Approx. 3x better GF/W to 2 system w/CPUs w/ Acc 14 12 The first system w/o Acc ranked #1 in 9.5 years
10
8 Top Xeon machine #37, 5.843 GF/W GF/W 6 Endeavor: #94, 2.961 GF/W 4
2
0 0 50 100 150 200 250 300 350 400 450 500 35 40 45 50 Green500 rank TOP500 edition
FUJITSU CONFIDENTIAL 16 Copyright 2019 FUJITSU LIMITED A64FX CPU performance evaluation for real apps
Relative speed up ratio Open source software, Real 0.5 1.0 1.5 apps on an A64FX @ 2.2GHz OpenFOAM Up to 1.8x faster over the latest FrontlSTR x86 processor (24core, 2.9GHz) x 2, or 3.6x per socket ABINIT High memory B/W and long SALMON SIMD length of A64FX work SPECFEM3D
effectively with these WRF
applications MPAS
FUJITSU A64FX x86 (24core, 2.9GHz)x2
19 A64FX CPU power efficiency for real apps
Relative power efficiency ratio Performance / Energy 1.0 2.0 3.0 4.0 consumption on an A64FX @ 2.2GHz OpenFOAM FrontlSTR Up to 3.7x more efficient over the latest x86 processor ABINIT (24core, 2.9GHz) x2 SALMON High efficiency is achieved by SPECFEM3D energy-conscious design and WRF
implementation MPAS
FUJITSU A64FX x86 (24core, 2.9GHz)x2
20 Fugaku Performance Estimate on 9 Co-Design Target Apps
Catego Performance Priority Issue Area Application Brief description ry Speedup over K
Performance target goal and longevity Health 1. Innovative computing 100 times faster than K for some applications infrastructure for drug GENESIS MD for proteins discovery 125x + (tuning included) 2. Personalized and 30 to 40 MW power consumption Genome processing preventive medicine using big 8x + Genomon (Genome alignment) Peak performance to be achieved data
prevention and prevention 3. Integrated simulation Environment Earthquake simulator (FEM in systems induced by PostK K Disaster GAMERA earthquake and tsunami 45x + unstructured & structured grid) Peak DP >400+ Pflops 11.3 Pflops Weather prediction system using (double precision) (34x +) 4. Meteorological and global NICAM+ environmental prediction using 120x + Big data (structured grid stencil & Peak SP big data LETKF >800+ Pflops ensemble Kalman filter) (single 11.3 Pflops (70x +) 5. New technologies for precision) issue Energy Molecular electronic simulation energy creation, conversion / 40x + NTChem (structure calculation) Peak HP >1600+ Pflops storage, and use -- (half precision) (141x +) 6. Accelerated development of Computational Mechanics System Total memory >150+ PB/sec innovative clean energy Adventure for Large Scale Analysis and 5,184TB/sec systems 35x + bandwidth (29x +) Design (unstructured grid) competitiveness competitiveness enhancement 7. Creation of new functional Industrial Industrial Ab-initio simulation devices and high-performance RSDFT Geometric Mean of Performance materials 30x + (density functional theory) Speedup of the 9 Target Applications 8. Development of innovative Large Eddy Simulation over the K-Computer design and production FFB processes 25x + (unstructured grid) science Basic > 37x+ 9. Elucidation of the Lattice QCD simulation (structured fundamental laws and 25x + LQCD As of 2019/05/14 evolution of the universe grid Monte Carlo) Fugaku / Fujitsu FX1000 System Software Stack
Fugaku AI (DL4Fugaku) Live Data Analytics ~3000 Apps sup- RIKEN: Chainer, PyTorch, TensorFlow, DNNL… Apache Flink, Kibana, …. ported by Spack Math Libraries Fujitsu: BLAS, LAPACK, ScaLAPACK, SSL II Cloud Software Stack RIKEN: EigenEXA, KMATH_FFT3D, Batched BLAS,,,, OpenStack, Kubernetis, NEWT... Open Source Management Tool Compiler and Script Languages Spack Fortran, C/C++, OpenMP, Java, python, … Batch Job and Management (Multiple Compilers suppoted: Fujitsu, Arm, GNU, System ObjectStore LLVM/CLANG, PGI, …) Hierarchical File System S3 Compatible Tuning and Debugging Tools Fujitsu: Profiler, Debugger, GUI Red Hat Enterprise Linux 8 Libraries Most applications will work with simple recompile from High-level Prog. Lang. Domain Spec. Lang. Communication Fujitsu MPI File I/O Virtualization & Container XMP FDPS RIKEN MPI DTF KVM, Singularity x86/RHEL environment. Process/Thread Low Level Communication File I/O for Hierarchical Storage LLNL Spack automates this. PIP uTofu, LLC Lustre/LLIO
Red Hat Enterprise Linux Kernel+ optional light-weight kernel (McKernel)
20020/01/27 RIKEN Center for Computational Science 20 New PRIMEHPC Lineup
PRIMEHPC FX1000 PRIMEHPC Supercomputer optimized for large scale computing FX700 Superior power efficiency High Scalability High Density Supercomputer based on Easestandard to use technologiesInstallation
Copyright 2019 FUJITSU LIMITED 21 The HPE/Cray CS500 - Fujitsu A64FX Arm Server
Cray Fujitsu Technology Agreement Supported in Cray CS500 infrastructure
Full Cray Programming Environment
Leadership performance for many memory intensive HPC applications, e.g., weather GA in mid’2020 A number of adoptions US: Stony Brook, DoE Labs, etc. Multiple yet-to-be-named EU centers
FUJITSU CONFIDENTIAL 22 Fugaku Deployment Status (Apr. 2020)
Pipelined manufacturing, installation, and bringup, first rack shipped on Dec 3 2019. All racks on the floor May 13, 2020(!) 2020 early users, incl. COVID-19 apps running already Open to international users through HPCI, general allocation April 2021 (application starting Sept. 2020) (does not need to involve a Japanese PI) Also some internal test nodes (Apr 2020) and allocations (Apr. 2021) are available for R-CCS
23 Massive Scale Deep Learning on Fugaku
Fugaku Processor is AI-DL ready High perf FP16&Int8 Unprecedened DL scalability High mem BW for convolution ◆ High Performance and Ultra-Scalable Network Built-in scalable Tofu network for massive scaling model & data parallelism High◆ Performance DNN Convolution ◆
CPU For the Fugaku CPU CPU CPU supercomputer For the For the For the Fugaku Fugaku Fugaku supercomputer supercomputer supercomputer
BW for fast reduction Low Precision ALU + High Memory Bandw Ultra Scalability of Data/Model Parallelism idth + Advanced Combining of Convolutio Scalability shown to 20,000 nodes n Algorithms (FFT+Winograd+GEMM) HPL-AI to beyond exaflops 24 Large Scale Public AI Infrastructures in Japan
Deployed Purpose AI Processor Inference Training HPL-AI Top500 Green500 Peak Perf. Peak Perf. Perf Perf/Rank Perf/Rank Tokyo Tech. July 2017 HPC + AI NVIDIA P100 45.8 PF 22.9 PF / 45.8PF 8.125 PF 13.704 GF/W TSUBAME3 Public x 2160 (FP16) (FP32/FP16) #22 #8 Inference U-Tokyo Apr. 2018 HPC + AI NVIDIA P100 10.71 PF 5.36 PF / 10.71PF (Unranked) (unranked) 838.5PF Reedbush-H/L (update) Public x 496 (FP16) (FP32/FP16) Training U-Kyushu Oct. 2017 HPC + AI NVIDIA P100 11.1 PF 5.53 PF/11.1 PF (Unranked) (Unranked) 86.9 PF ITO-B Public x 512 (FP16) (FP32/FP16) vs. Summit AIST-AIRC Oct. 2017 AI NVIDIA P100 8.64 PF 4.32 PF / 8.64PF (Unranked) (Unranked) Inf. 1/4 AICC Lab Only x 400 (FP16) (FP32/FP16) Train. 1/5 Riken-AIP Apr. 2018 AI NVIDIA V100 54.0 PF 6.40 PF/54.0 PF 1.213 PF (Unranked) Raiden (update) Lab Only x 432 (FP16) (FP32/FP16) #462 AIST-AIRC Aug. 2018 AI NVIDIA V100 544.0 PF 65.3 PF/544.0 PF 19.88 PF 14.423 GF/W ABCI Public x 4352 (FP16) (FP32/FP16) #8 #6 NICT & Summer AI NVIDIA V100 ~210 PF ~26 PF/~210 PF 4.128 #51 (Unranked) Sakura Internet 2019 Lab Only x 1700 (FP16) (FP32/FP16) 3.712 #58 C.f. US ORNL Summer HPC + AI NVIDIA V100 3,375 PF 405 PF/3,375 PF 445 PF 143.5 PF 14.719 GF/W Summit 2018 Public x 27,000 (FP16) (FP32/FP16) (FP16) #1 #5 x2 ~ x4 Summit Riken R-CCS 2020 HPC + AI Fujitsu A64fx > 4000 PO >1000PF/>2000PF >1000PF > 400PF 16.876 GF/W HPL-AI Fugaku ~2021 Public > x 150,000 (Int8) (FP32/FP16) (FP16) #1 (2020?) #1 (proto) (equiv. ~100K GPUs) Fujitsu-Riken-Arm joint effort on AI framework development on SVE/A64FX
MOU Signed Fujitsu-Riken Optimization Levels Nov. 25, 2019 Step3
DL Frameworks Step2 (PyTorch, TensorFlow)
Arm DNNL Step1 26 Copyright 2019 FUJITSU LIMITED Also w/Arm Arm Xbyak, Eigen + Lib
1st release May 2020
First ver. optimized A64fx processor for inference
Next ver. training optimization Exaops of sim, data, and AI on Fugaku and Cloud 26 Cloud Service Providers Partnership https://www.r-ccs.riken.jp/library/topics/200213.html (in Japanese)
×
Action Items • Cool Project name and logo! • Trial methods to provide computing resources of Fugaku to end-users via service providers • Evaluate the effectiveness of the methods quantitatively as possible and organize the issues 27
COVID19 Fugaku Early Production & HPCI
Supercomputer Fugaku used to help fight against COVID-19
https://www.r-ccs.riken.jp/en/topics/fugaku-coronavirus.html
Production a year ahead of schedule
Max availability in April~May: 72 racks (80 Petaflops), 1/6 full
More to be added after June
Public call by MEXT and Riken – fast track, bypasses normal allocation procedure, priority allocation to massive resource w/extensive R-CCS support
Must work closely w/Riken R-CCS and other COVID19 groups
HPCI Tier-2 COVID19 resource allocations
https://www.hpci-office.jp/pages/e_hpci_covid19
Most HPCI tier-2 resources, more variety (e.g., GPUs), less capacity & less restrictions c.f. Fugaku
Goes thru accelerated peer-review process by RIST
Both are available to international research groups 28 29 30 31