Fugaku: the First ‘Exascale’ Machine

Fugaku: the First ‘Exascale’ Machine Satoshi Matsuoka, Director R-CCS / Prof. Tokyo Inst. Tech. The 6th ENES HPC Workshop (virtual) 26 May 2020 1 The “Fugaku” 富岳 “Exascale” Supercomputer for Society 5.0 High Large Application Scale - Peak Peak Mt. Fuji representing (Capability) --- the ideal of supercomputing Acceleration of Acceleration Broad Base --- Applicability & Capacity Broad Applications: Simulation, Data Science, AI, … Broad User Bae: Academia, Industry, Cloud Startups, … For Society 5.0 Arm64fx & Fugaku 富岳 /Post-K are: Fujitsu-Riken design A64fx ARM v8.2 (SVE), 48/52 core CPU HPC Optimized: Extremely high package high memory BW (1TByte/s), on-die Tofu-D network BW (~400Gbps), high SVE FLOPS (~3Teraflops), various AI support (FP16, INT8, etc.) Gen purpose CPU – Linux, Windows (Word), other SCs/Clouds Extremely power efficient – > 10x power/perf efficiency for CFD benchmark over current mainstream x86 CPU Largest and fastest supercomputer to be ever built circa 2020 > 150,000 nodes, superseding LLNL Sequoia > 150 PetaByte/s memory BW Tofu-D 6D Torus NW, 60 Petabps injection BW (10x global IDC traffic) 25~30PB NVMe L1 storage many endpoint 100Gbps I/O network into Lustre 3 Brief History of R-CCS towards Fugaku April 2018 April 2012 AICS renamed to RIKEN R-CCS. July 2010 Post-K Feasibility Study start Satoshi Matsuoka becomes new RIKEN AICS established 3 Arch Teams and 1 Apps Team Director August 2010 June 2012 Aug 2018 April 2020 HPCI Project start K computer construction complete Arm A64fx announce at Hotchips COVID19 Teams September 2010 September 2012 Oct 2018 Start K computer installation start K computer production start NEDO 100x processor project start May 2020 First meeting of SDHPC November 2012 Nov 2018 Fugaku Shipment (Post-K) ACM Gordon bell Award Post-K Manufacturing approval by Complete Prime Minister’s CSTI Committee 2013 2006 2011 2014 2010 2019 2012 2018 2020 January 2006 End of FY2013 (Mar 2014) Next Generation Supercomputer P Post-K Feasibility Study Reports March 2019 roject (K Computer) start Post-K Manufacturing start May 2019 June 2011 Post-K named “Supercomputer Fugaku” #1 on Top 500 July 2019 November 2011 Post-Moore Whitepaper start #1 on Top 500 > 10 Petaflops Aug 2019 April 2014 K Computer shutdown ACM Gordon Bell Award Post-K project start End of FY 2011 (March 2012) Dec 2019 June 2014 Fugaku installation start 4 SDHPC Whitepaper #1 on Graph 500 Co-Design Activities in Fugaku Multiple Activities since 2011 Science by Computing Science of Computing ・9 Priority App Areas: High Concern to General Public: Medical/Pharma, Environment/Disaster, Energy, Manufacturing, … A64fx For the Post-K supercomputer Select representatives fr Design systems with param om 100s of applications eters that consider various signifying various compu application characteristics tational characteristics Extremely tight collabrations between the Co-Design apps centers, Riken, and Fujitsu, etc. Chose 9 representative apps as “target application” scenario Achieve up to x100 speedup c.f. K-Computer Also ease-of-programming, broad SW ecosystem, very low power, … Research Subjects of the Post-K Computer The post K computer will expand the fields pioneered by the K computer, and also challenge new areas. 11 NICAM: Global Climate Simulation Global cloud resolving model with 0.87 km-mesh which allows resolution of cumulus clouds Month-long forecasts of Madden-Julian oscillations in the tropics is realized. Global cloud resolving model Miyamoto et al (2013) , Geophys. Res. Lett., 40, 4922–4926, doi:10.1002/grl.50944. 13 “Big Data Assimilation” NICAM+LETKF High-precision Simulations Future-generation technologies available 10 years in advance High-precision observations Mutual feedback Fugaku’s FUjitsu A64fx Processor is… an Many-Core ARM CPU… 48 compute cores + 2 or 4 assistant (OS) cores Brand new core design Near Xeon-Class Integer performance core ARM V8 --- 64bit ARM ecosystem Tofu-D + PCIe 3 external connection …but also an accelerated GPU-like processor SVE 512 bit x 2 vector extensions (ARM & Fujitsu) Integer (1, 2, 4, 8 bytes) + Float (16, 32, 64 bytes) Cache + memory localization (sector cache) HBM2 on package memory – Massive Mem BW (Bytes/DPF ~0.4) Streaming memory access, strided access, scatter/gather etc. Intra-chip barrier synch. and other memory enhancing features GPU-like High performance in HPC especially CFD-- Weather & Climate (even with traditional Fortran code) + AI/Big Data 9 “Fugaku” CPU Performance Evaluation (2/3) Himeno Benchmark (Fortran90) Stencil calculation to solve Poisson’s equation by Jacobi method GFlops † † † “Performance evaluation of a vector supercomputer SX-aurora TSUBASA”, SC18, https://dl.acm.org/citation.cfm?id=3291728 Post-K Activities, ISC19, Frankfurt 10 Copyright 2019 FUJITSU LIMITED “Fugaku” CPU Performance Evaluation (3/3) WRF: Weather Research and Forecasting model Vectorizing loops including IF-constructs is key optimization Source code tuning using directives promotes compiler optimizations ｘｘ Post-K Activities, ISC19, Frankfurt 11 Copyright 2019 FUJITSU LIMITED Fugaku system configuration Scalable design CPU CMU BoB Shelf Rack System Unit # of nodes Description Single socket node with HBM2 & Tofu interconnect CPU 1 D CMU 2 CPU Memory Unit: 2x CPU BoB 16 Bunch of Blades: 8x CMU Shelf 48 3x BoB Rack 384 8x Shelf System 150k+ As a Fugaku system 12 © 2019 FUJITSU Fugaku Total System Config & Performance Total # Nodes: 158,976 nodes 384 nodes/rack x 396 (full) racks = 152,064 nodes 192 nodes/rack x 36 (half) racks = 6,912 nodes c.f. K Computer 88,128 nodes Theoretical Peak Compute Performances Normal Mode (CPU Frequency 2GHz) 提供: 富士通株式会社 64 bit Double Precision FP: 488 Petaflops C.f. K Computer performance comparison (Boost) 32 bit Single Precision FP: 977 Petaflops 64 bit Double Precision FP: 48x 16 bit Half Precision FP (AI training): 1.95 Exaflops 32 bit Single Precision: 95x 8 bit Integer (AI Inference): 3.90 Exaops 16 bit Half Precision (AI training): 190x Boost Mode (CPU Frequency 2.2GHz) K Computer Theoretical Peak: 11.28 PF for all precisions 64 bit Double Precision FP: 537 Petaflops 8 bit Integer (AI Inference): > 1,500x 32 bit Single Precision FP: 1.07 Exaflops K Computer Theoretical Peak: 2.82 Petaops (64 bits) 16 bit Half Precision FP (AI training): 2.15 Exaflops Theoretical Peak Memory Bandwidth: 29x 8 bit Integer (AI Inference): 4.30 Exaops K Computer Theoretical Peak: 5.64 Petabytes/s Theoretical Peak Memory Bandwidth: 163 Petabytes/s 13 Fugaku is a Year’s worth of IT in Japan IDC Servers incl Smartphones Fugaku K Computer Clouds 20 million 300,000 Units (2/3 annual shipments = (2/3 annual shipments = 1 30~100 in Japan) in Japan) ＝ 10W×20 mil ＝ 600-700W x 30K > 1/ Power = 200MW 30MW 15MW 200MW (incl cooling) > Arm x86/Arm CPU ISA Arm Sparc System SW iOS/ Linux (Red Hat Linux (Red Proprietary Linux Android Linux etc.)/Win Hat etc.) Low generality Gen. Purpos Gen. CPU AI Custom ASIC Accelerator e.g. SVE None Acceleration Inference Only GPU instructions 14 Green500, Nov. 2019 A64FX prototype – Fujitsu A64FX 48C 2GHz ranked #1 on the list 768x general purpose A64FX CPU w/o accelerators • 1.9995 PFLOPS @ HPL, 84.75% • 16.876 GF/W • Power quality level 2 FUJITSU CONFIDENTIAL 15 Copyright 2019 FUJITSU LIMITED SC19 Green500 ranking and 1st appeared TOP500 edition “A64FX prototype”, prototype of Supercomputer Fugaku, 18 ranked #1, 16.876 GF/W 16 nd w/o Acc Approx. 3x better GF/W to 2 system w/CPUs w/ Acc 14 12 The first system w/o Acc ranked #1 in 9.5 years 10 8 Top Xeon machine #37, 5.843 GF/W GF/W 6 Endeavor: #94, 2.961 GF/W 4 2 0 0 50 100 150 200 250 300 350 400 450 500 35 40 45 50 Green500 rank TOP500 edition FUJITSU CONFIDENTIAL 16 Copyright 2019 FUJITSU LIMITED A64FX CPU performance evaluation for real apps Relative speed up ratio Open source software, Real 0.5 1.0 1.5 apps on an A64FX @ 2.2GHz OpenFOAM Up to 1.8x faster over the latest FrontlSTR x86 processor (24core, 2.9GHz) x 2, or 3.6x per socket ABINIT High memory B/W and long SALMON SIMD length of A64FX work SPECFEM3D effectively with these WRF applications MPAS FUJITSU A64FX x86 (24core, 2.9GHz)x2 19 A64FX CPU power efficiency for real apps Relative power efficiency ratio Performance / Energy 1.0 2.0 3.0 4.0 consumption on an A64FX @ 2.2GHz OpenFOAM FrontlSTR Up to 3.7x more efficient over the latest x86 processor ABINIT (24core, 2.9GHz) x2 SALMON High efficiency is achieved by SPECFEM3D energy-conscious design and WRF implementation MPAS FUJITSU A64FX x86 (24core, 2.9GHz)x2 20 Fugaku Performance Estimate on 9 Co-Design Target Apps Catego Performance Priority Issue Area Application Brief description ry Speedup over K Performance target goal Health and longevity 1. Innovative computing 100 times faster than K for some applications infrastructure for drug GENESIS MD for proteins discovery 125x + (tuning included) 2. Personalized and 30 to 40 MW power consumption Genome processing preventive medicine using big 8x + Genomon (Genome alignment) Peak performance to be achieved data prevention and 3. Integrated simulation Environment Earthquake simulator (FEM in systems induced by PostK K Disaster GAMERA earthquake and tsunami 45x + unstructured & structured grid) Peak DP >400+ Pflops 11.3 Pflops Weather prediction system using (double precision) (34x +) 4. Meteorological and global NICAM+ environmental prediction using 120x + Big data (structured grid stencil & Peak SP big data LETKF >800+ Pflops ensemble Kalman filter) (single 11.3 Pflops (70x +) 5. New technologies for precision) Energy issue Molecular electronic simulation energy creation, conversion / 40x + NTChem (structure calculation) Peak HP >1600+ Pflops storage, and use -- (half precision) (141x +) 6. Accelerated development of Computational Mechanics System Total memory >150+ PB/sec innovative clean energy Adventure for Large Scale Analysis and 5,184TB/sec systems 35x + bandwidth (29x +) Design (unstructured grid) competitiveness enhancement 7.

Fugaku: the First ‘Exascale’ Machine

File System and Power Management Enhanced for Supercomputer Fugaku

BDEC2 Poznan Japanese HPC Infrastructure Update

Supercomputer Fugaku

Download SC18 Final Program

Mixed Precision Block Fused Multiply-Add

Computer Architectures an Overview

Programming on K Computer

HPC Systems in the Next Decade – What to Expect, When, Where

Management Direction Update

40 Jahre Erfolgsgeschichte BS2000 Vortrag Von 2014

Toward Automated Cache Partitioning for the K Computer

The Post-K Project and Fujitsu ARM-SVE Enabled A64FX Processor for Energy-Efficiency and Sustained Application Performance