Fugaku: the First 'Exascale' Supercomputer - Past, Present and Future
Total Page:16
File Type:pdf, Size:1020Kb
Fugaku: the First 'Exascale' Supercomputer - Past, Present and Future. Satoshi Matsuoka, Director R-CCS / Prof. Tokyo Inst. Tech. Cluster 2020 Keynote Presentation 15 Sep. 20201 Science of Computing by Computing for Computing R-CCS International core research center in the science of high performance computing (HPC) One of Riken’s 12 Research Centers 18 Research Teams + 4 Operations Teams > 200 FTEs III Achievements First ‘Exascale’ SC: Fugaku R&D Groundbreaking Apps: E.g., Large-scale - R&D started since the genesis Earthquake disaster simulation w/HPC & AI of the center, to design & build Ultra large scale world’s first “exascale” earthquake simulation supercomputer using AI for massive scale - Fugaku construction official earthquake in Tokyo area. approval to build by the Nominated as finalist of government on Nov. 2018 Modeling of bldgs., Gordon bell 2018. Winning - Installation start Dec. 2019, underground structures, SC16, 17 Best Poster Fugaku/post- K rack, CPU with geological layers ( ) Award. prototype ©Fujitsu operations to start early 2021 ©Tokyo Univ. Industry outreach e.g. RIKEN Combustion Top-Tier Awards HPC High Speed algorithm for analyzing social System CAE Consortium RIKEN Consortium to develop next generation networks won Best paper Award at HiPC. CAE for combustors has established with Contribution of Dr. Matsuoka has awarded 11 industries and 9 at ACM HPDC 2018 academia members. Achievement Award also It would expand the Asia HPC Leadership users of R-CCS Award at Software as well as supercomputing Asia Fugaku. 2019. Simulation CAE examples 3 The “Fugaku” 富岳 “Exascale” Supercomputer for Society 5.0 High Large Scale Application Scale Large - Mt. Fuji representing Peak (Capability) the ideal of supercomputing --- Acceleration of Acceleration Broad Base --- Applicability & Capacity Broad Applications: Simulation, Data Science, AI, … Broad User Bae: Academia, Industry, Cloud Startups, … For Society 5.0 Fugaku: Largest & Fastest Supercomputer Ever ‘Applications First’ R&D Challenge--- High Risk “Moonshot” R&D ●A new high performance & low power Arm A64FX CPU co-developed by Riken R-CCS & Fujitsu along with nationwide HPC researchers as a National Flagship 2020 project - 3x perf c.f. top CPU in HPC apps “Moonshot” - 3x power efficiency c.f. top CPU R&D Target - General purpose Arm CPU, runs sa me program as Smartphones - Acceleration features for AI ●Fugaku x 2~3 = Entire annual IT in Japan Servers (incl. K Smartphones Fugaku IDC) Computer 20 million 300,000 1 Untis ~annual shipment = (~annual = Max 120 in Japan shipment in Japan (160K nodes) 600-700W×30万台= 15MW Power 10W×2,000万台= > 30MW = 200MW (less than 1/10 (W) 200MW > (very low) efficiency c.f. (incl cooling) Fugaku) ●Developed via extensive co-design ”Science of Computing" ”Science by Computing" By Riken & Fujitsu & HPCI Centers, “9 Priority Areas” to develop target etc., Arm Ecosystem, Reflecting applications to tackle important numerous research results societal problems 5 Brief History of R-CCS towards Fugaku April 2018 AICS renamed to RIKEN R-CCS. April 2012 April 2020 Satoshi Matsuoka becomes new July 2010 Post-K Feasibility Study start COVID19 Teams Director RIKEN AICS established 3 Arch Teams and 1 Apps Team Start Aug 2018 August 2010 June 2012 May 2020 Arm A64fx announce at Hotchips HPCI Project start K computer construction complete Fugaku Shipment Oct 2018 September 2010 September 2012 Complete NEDO 100x processor project start K computer installation start K computer production start June 2020 Nov 2018 First meeting of SDHPC November 2012 Fugaku #1 on Top Post-K Manufacturing approval by (Post-K) ACM Gordon bell Award 500 etc. Prime Minister’s CSTI Committee 2013 2006 2011 2014 2010 2019 2012 2018 2020 January 2006 End of FY2013 (Mar 2014) Next Generation Supercomputer Post-K Feasibility Study Reports March 2019 Project (K Computer) start Post-K Manufacturing start May 2019 June 2011 Post-K named “Supercomputer Fugaku” K #1 on Top 500 July 2019 November 2011 Post-Moore Whitepaper start K #1 on Top 500 > 10 Petaflops April 2014 Aug 2019 ACM Gordon Bell Award Post-K project start K Computer shutdown End of FY 2011 (March 2012) June 2014 Dec 2019 SDHPC Whitepaper K #1 on Graph 500 Fugaku installation start 6 “Applications First” Co-Design Activities in Fugaku Science by Computing Science of Computing ・9 Priority App Areas: High Concern to Riken R-CCS & Fujtisu General Public: Medical/Pharma, Environment/Disaster, Energy, Manufacturing, … A 6 4 f x For the Post-K supercomputer Select representatives fr Design systems with param om 100s of applications eters that consider various signifying various compu application characteristics tational characteristics Extremely tight collabrations between the Co-Design apps centers, Riken, and Fujitsu, etc. Chose 9 representative apps as “target application” scenario Achieve up to x100 speedup c.f. K-Computer Also ease-of-programming, broad SW ecosystem, very low power, … A64FX Leading-edge Si-technology TSMC 7nm FinFET & CoWoS TofuD Interface PCIe Interface HBM2 Broadcom SerDes, HBM I/O, and Core Core Core Core Core Core Core Core Core Core SRAMs Interface L2 L2 HBM2interface Core Core Core Core 8.786 billion transistors Cache Cache 594 signal pins Core Core Core Core Core Core RING Core Core Core Core Core Core - Bus Core Core Core Core Core Core Core Core Core Core Core Core HBM2Interface L2 L2 Core Core Core Core Cache Cache HBM2Interface Core Core Core Core Core Core Core Core Core Core Post-K Activities, ISC19, Frankfurt 8 Copyright 2019 FUJITSU LIMITED Fugaku’s FUjitsu A64fx Processor is… an Many-Core ARM CPU… 48 compute cores + 2 or 4 assistant (OS) cores Brand new core design Near Xeon-Class Integer performance core ARM V8 --- 64bit ARM ecosystem Tofu-D + PCIe 3 external connection …but also an accelerated GPU-like processor SVE 512 bit x 2 vector extensions (ARM & Fujitsu) Integer (1, 2, 4, 8 bytes) + Float (16, 32, 64 bytes) Cache + memory localization (sector cache) HBM2 on package memory – Massive Mem BW (Bytes/DPF ~0.4) Streaming memory access, strided access, scatter/gather etc. Intra-chip barrier synch. and other memory enhancing features GPU-like High performance in HPC especially CFD-- Weather & Climate (even with traditional Fortran code) + AI/Big Data 9 A64FX Chip Overview Tofu I/O Architecture Features <A64FX> 28Gbps 2 lanes 10 ports PCIe Gen3 16 lanes CMG specification 13 cores • Armv8.2-A (AArch64 only) L2$ 8MiB Mem 8GiB, 256GB/s Tofu PCIe • SVE 512-bit wide SIMD controller controller • 48 computing cores + 4 assistant cores* HBM2 HBM2 *All the cores are identical Netwrok • HBM2 32GiB on Chip • Tofu 6D Mesh/Torus HBM2 HBM2 28Gbps x 2 lanes x 10 ports • PCIe Gen3 16 lanes A64FX SPARC64 XIfx 7nm FinFET (Post-K) (PRIMEHPC FX100) ISA (Base) Armv8.2-A SPARC-V9 • 8,786M transistors ISA (Extension) SVE HPC-ACE2 • 594 package signal pins Process Node 7nm 20nm Peak Performance >2.7TFLOPS 1.1TFLOPS Peak Performance (Efficiency) SIMD 512-bit 256-bit • >2.7TFLOPS (>90%@DGEMM) # of Cores 48+4 32+2 Memory HBM2 HMC Memory Peak B/W 1024GB/s 240GB/s x2 (in/out) • Memory B/W 1024GB/s (>80%@Stream Triad) 6 All Rights Reserved. Copyright © FUJITSU LIMITED 2018 A64FX Core Pipeline A64FX enhances and inherits superior features of SPARC64 • Inherits superscalar, out-of-order, branch prediction, etc. • Enhances SIMD and predicate operations – 2x 512-bit wide SIMD FMA + Predicate Operation + 4x ALU (shared w/ 2x AGEN) – 2x 512-bit wide SIMD load or 512-bit wide SIMD store Fetch Issue Dispatch Reg-Read Execute Cache and Memory Commit CSE 52cores EAGA Fetch EXC Port Decode RSA EAGB PC L1 I$ & Issue PGPR EXD Store EXA Port L1D$ Control Registers Branch Buffer Predictor PPR PRX RSE1 FLA RSBR PFPR FLB Tofu controller L2$ PCI Controller Main Enhanced block HBM2 Controller Tofu Interconnect HBM2 PCI-GEN3 8 All Rights Reserved. Copyright © FUJITSU LIMITED 2018 Power Management (Cont.) “Power knob” for power optimization A64FX provides power management function called “Power Knob” • Applications can change hardware configurations for power optimization → Power knobs and Energy monitor/analyzer will help users to optimize power consumption of their applications <A64FX Power Knob Diagram> Decode width: 2 EX pipeline usage : EXA only Frequency reduction Fetch Issue Dispatch Reg-Read Execute Cache and Memory Commit CSE EAGA Fetch EXC Port Decode RSA EAGB PC L1 I$ & Issue PGPR EXD Store EXA Port L1D$ Control Registers EXB Write RSE0 Buffer Branch Predictor PPR PRX RSE1 FLA PFPR RSBR 52 cores FLB L2$ HBM2 Controller PCI Controller Tofu controller FL pipeline usage: FLA only HBM2 HBM2 B/W adjustment (units of 10%) 16 All Rights Reserved. Copyright © FUJITSU LIMITED 2018 Many-Core Architecture A64FX consists of four CMGs (Core Memory Group) • A CMG consists of 13 cores, an L2 cache and a memory controller - One out of 13 cores is an assistant core which handles daemon, I/O, etc. • Four CMGs keep cache coherency by ccNUMA with on-chip directory • X-bar connection in a CMG maximizes high efficiency for throughput of the L2 cache • Process binding in a CMG allows linear scalability up to 48 cores On-chip-network with a wide ring bus secures I/O performance CMG Configuration A64FX Chip Configuration core core core core core core core Tofu PCIe Controller Controller core core core core core core HBM2 HBM2 X-Bar Connection CMG CMG Netwrok on L2 cache 8MiB 16-way Chip (Ring bus) HBM2 HBM2 CMG CMG Memory Controller Network HBM2 on Chip 12 All Rights Reserved. Copyright © FUJITSU LIMITED 2018 High Bandwidth Extremely high bandwidth in caches and memory • A64FX has out-of-order mechanisms in cores, caches and memory controllers. It maximizes the capability of each layer’s bandwidth CMG 12x Computing Cores + 1x Assistant Core Core Core Core Core Performance 512-bit wide SIMD >2.7TFLOPS 2x FMAs >115 L1 Cache > 230 GB/s >11.0TB/s (BF ratio = 4) GB/s L1D 64KiB, 4way >57 L2 Cache > 115 GB/s >3.6TB/s (BF ratio = 1.3) GB/s L2 Cache 8MiB, 16way Memory 256 1024GB/s (BF ratio =~0.37) GB/s HHBBMM2288GGiBiB HBM2HBM288GGiBiB 13 All Rights Reserved.