Exascale” Supercomputer Fugaku & Beyond

The first “exascale” supercomputer Fugaku & beyond l Satoshi Matsuoka l Director, RIKEN Center for Computational Science l 20190815 Modsim Presentation 1 Arm64fx & Fugaku 富岳 /Post-K are: l Fujitsu-Riken design A64fx ARM v8.2 (SVE), 48/52 core CPU l HPC Optimized: Extremely high package high memory BW (1TByte/s), on-die Tofu-D network BW (~400Gbps), high SVE FLOPS (~3Teraflops), various AI support (FP16, INT8, etc.) l Gen purpose CPU – Linux, Windows (Word), otherSCs/Clouds l Extremely power efficient – > 10x power/perf efficiency for CFD benchmark over current mainstream x86 CPU l Largest and fastest supercomputer to be ever built circa 2020 l > 150,000 nodes, superseding LLNL Sequoia l > 150 PetaByte/s memory BW l Tofu-D 6D Torus NW, 60 Petabps injection BW (10x global IDC traffic) l 25~30PB NVMe L1 storage l many endpoint 100Gbps I/O network into Lustre l The first ‘exascale’ machine (not exa64bitflops but in apps perf.) 3 Brief History of R-CCS towards Fugaku April 2018 April 2012 AICS renamed to RIKEN R-CCS. July 2010 Post-K Feasibility Study start Satoshi Matsuoka becomes new RIKEN AICS established 3 Arch Teams and 1 Apps Team Director August 2010 June 2012 Aug 2018 HPCI Project start K computer construction complete Arm A64fx announce at Hotchips September 2010 September 2012 Oct 2018 K computer installation starts K computer production start NEDO 100x processor project start First meeting of SDHPC (Post-K) November 2012 Nov 2018 ACM Gordon bell Award Post-K Manufacturing approval by Prime Minister’s CSTI Committee 2 006 2011 2013 2014 2019 2010 2018 2012 January 2006 End of FY2013 (Mar 2014) Next Generation Supercomputer Post-K Feasibility Study Reports March 2019 Project (K Computer) start Post-K Manufacturing start May 2019 June 2011 Post-K named “Supercomputer Fugaku” #1 on Top 500 July 2019 November 2011 Post-Moore Whitepaper start #1 on Top 500 > 10 Petaflops Aug 2019 April 2014 K Computer shutdown ACM Gordon Bell Award Post-K project start End of FY 2011 (March 2012) Dec 2019 June 2014 Fugaku installation start (planned) 4 SDHPC Whitepaper #1 on Graph 500 SDHPC (2011-2012) Candidate of ExaScale Architecture https://www.exascale.org/mediawiki/images/a/aa/Talk-3-kondo.pdf }} Four types of architectures are considered }} General Purpose(GP) }} Ordinary CPU-based MPPs Memory }} e.g.) K-Computer, GPU, Blue Gene, capacity x86-based PC-clusters CB oriented }} Capacity-Bandwidth oriented (CB) Memory }} With expensive memory-I/F rather than bandwidth General computing capability purpose Reduced }} e.g.) Vector machines Memory }} Reduced Memory(RM) Compute }} With embedded (main) memory oriented }} e.g.) SoC, MD-GRAPE4, Anton }} ComputeOriented (CO) FLOPS }} Many processing units }} e.g.) ClearSpeed, GRAPE-DR IESP Meeting@Kobe (April 12, 2012) 5 SDHPC (2011-2012) Performance Projection }} Performance projection for an HPC system in 2018 }} Achieved through continuous technology development }} Constraints: 20 – 30MW electricity & 2000sqmspace Total CPU Total Memory Total Memory Performance Bandwidth Capacity Byte / Flop Node Performance (PetaFLOPS) (PetaByte/s) (PetaByte) General Purpose 200~400 20~40 20~40 0.1 Capacity-BW Oriented 50~100 50~100 50~100 1.0 Reduced Memory 500~1000 250~500 0.1~0.2 0.5 Compute Oriented 1000~2000 5~10 5~10 0.005 Network Storage Min Max Total Capacity Total Bandwidth Injection P-to-P Bisection Latency Latency 1 EB 10TB/s High-radix 32GB/s 32 GB/s 2.0PB/s 200ns 1000ns 100 times larger For saving all data (Dragonfly) than main in memory to disks Low-radix 128GB/s 16 GB/s 0.13PB/s 100ns 5000ns memory within 1000-sec. (4DTorus) IESP Meeting@Kobe (April 12, 2012) 6 SDHPC (2011-2012) Gap Between Requirement and Technology Trends }} Mapping four architectures onto science requirement }} Projected performance vs. science requirement }} Big gap between projected and required performance Mapping of Architectures Projected vs. Required Perf. 1.0E+1 2700 CB Gap between 1.0E+0 requirements and F / technoloGy trends B 1800 1.0E-1 (PFLOPS) of RM GP 1.0E-2 CO 900 1.0E-3 Requirement Requirement 1.0E-4 0 1.0E-3 1.0E-2 1.0E-1 1.0E+0 1.0E+1 1.0E+2 1.0E+3 CP RM GP CB Requirement of Memory Capacity (PB) Needs national research project for science-driven HPC systems IESP Meeting@Kobe (April 12, 2012) 7 Post-K Feasibility Study (2012- 2013) l 3 Architecture Teams, from identified architectural types in the SDHPC report l General Purpose --- balanced l Compute Intensive --- high flops and/or low memory capacity & high memory BW l Large Memory Capacity --- also w/high memory BW l The A64fx processor satisfied multiple roles - basically balanced but also compute intensive l Application Team (Tomita, Matsuoka) l Put all the K-Computer applications stakeholders into one room l Templated reporting of science impact possible on exascale machines and their computational algorithms / requirements l 600 page report (English summary available) 8 Post-K Application Feasibility Study 2012- 2013 https://hpci-aplfs.r-ccs.riken.jp/document/roadmap/roadmap_e_1405.pdf 9 Co-Design Activities in Fugaku Multiple Activities since 2011 Science by Computing Science of Computing ・9 Priority App Areas: High Concern to General Public: Medical/Pharma, Environment/Disaster, Energy, Manufacturing, … A 6 4 f x For the Post-K supercomputer Select representatives fr Design systems with param om 100s of applications eters that consider various signifying various compu application characteristics tational characteristics l Extremely tight collabrations between the Co-Design apps centers, Riken, and Fujitsu, etc. l Chose 9 representative apps as “target application” scenario l Achieve up to x100 speedup c.f. K-Computer l Also ease-of-programming, broad SW ecosystem, very low power, … Research Subjects of the Post-K Computer The post K computer will expand the fields pioneered by the K computer, and also challenge new areas. 11 Genesis MD: proteins in a cell environment Protein simulation before K Protein simulation with K n Simulation of a protein in isolation n all atom simulation of a cell interior Folding simulation of Villin, a small protein with 36 amino acids n cytoplasm of Mycoplasma genitalium 400nm DNA Ribosome Proteins GROEL metabolites Ribosome water O R G EL ion TRNA ATP 100nm 12 NICAM: Global Climate Simulation n Global cloud resolving model with 0.87 km-mesh which allows resolution of cumulus clouds n Month-long forecasts of Madden-Julian oscillations in the tropics isrealized. Global cloud resolving model Miyamoto et al (2013) , Geophys. Res. Lett., 40, 4922–4926, doi:10.1002/grl.50944. 13 “Big Data Assimilation” NICAM+LETKF High-precision Simulations Future-generation technologies available 10 years in advance High-precision observations Mutual feedback Co-design from Apps to Architecture l Architectural Parameters to be determined l #SIMD, SIMD length, #core, #NUMA node, O3 resources, specialized hardware l cache (size and bandwidth), memory technologies Target applications representatives of l Chip die-size, power consumption almost all our applications in termsof l Interconnect computational methods and communication patterns in order to l We have selected a set of target applications design architectural features. l Performance estimation tool l Performance projection using Fujitsu FX100 execution profile to a set of arch. parameters. l Co-design Methodology (at early design phase) 1. Setting set of system parameters 2. Tuning target applications under the system parameters 3. Evaluating execution time using prediction tools 4. Identifying hardware bottlenecks and changing the set of system parameters 15 Co-design of Apps for Architecture l Tools for performance tuning l Performance estimation tool l Performance projection using Fujitsu FX100 execution profile l Gives “target” performance l Post-K processor simulator l Based on gem5, O3, cycle-level simulation l Very slow, so limited to kernel-level evaluation TarGet As is TuninG 1 performance l Co-design of apps TuninG 2 ① l 1. Estimate “target” performance using ② performance estimation tool ③ l 2. Extract kernel code for simulator time l 3. Measure exec time using simulator Execution l 4. Feed-back to code optimization Perform- ance estimation Simulator Simulator Simulator l 5. Feed-back to compiler toolcd 16 ARM for HPC - Co-design Opportunities l ARM SVE Vector Length Agnostic feature is very interesting, since we can examine vector performance using the same binary. l We have investigated how to improve the performance of SVEkeeping hardware-resource the same. (in “Rev-A” paper) l ex. “512 bits SVE x 2 pipes” vs. “1024 bits SVE x 1 pipe” l Evaluation of Performance and Power ( in “coolchips” paper) by using our gem-5 simulator (with “white” parameter) and ARM compiler. l Conclusion: Wide vector size over FPU element size will improve performance if there are enough rename registers and the utilization of FPU has room for improvement. Note that these researches are not relevantto 1.40 “post-K” architecture. 1.20 l Y. Kodama, T. Oajima and M. Sato. “Preliminary 1.00 Performance Evaluation of Application Kernels Using Time e ARM SVE with Multiple Vector Lengths”, In Re- 0.80 Emergence of Vector Architectures Workshop(Rev- ast F r A)in 2017 IEEE International Conference on Cluster Execution 0.60 Computing, pp. 677-684, Sep. 2017. 0.40 l T. Odajima, Y. Kodama and M. Sato, “Power Relative Performance Analysis of ARM Scalable Vector 0.20 Extension”, In IEEE Symposium on Low-Power and 0.00 High-Speed Chips and Systems (COOL Chips 21), Apr. triad nbody dgemm 2018 LEN=4 LEN=8 LEN=8 (x2) 17 A64FX Leading-edge Si-technology n TSMC 7nm FinFET & CoWoS TofuD Interface PCIe Interface n HBM2 Broadcom SerDes, HBM I/O, and Core Core Core Core Core Core Core Core Core Core Interface SRAMs interface L2 L2 HBM2 Core Core Core Core n 8.786 billion transistors Cache Cache n 594 signal pins R B Core Core Core Core Core Core u Core Core Core Core Core Core I N s G - Core Core Core Core Core Core Core Core Core Core Core Core HBM2 L2 L2 Core Core Core Core Cache Cache Interface Interface HBM2 Core Core Core Core Core Core Core Core Core Core Post-K Activities, ISC19, Frankfurt 18 Copyright 2019 FUJITSU LIMITED Fugaku: The Game Changer 1.

Exascale” Supercomputer Fugaku & Beyond

Petaflops for the People

File System and Power Management Enhanced for Supercomputer Fugaku

Linpack Evaluation on a Supercomputer with Heterogeneous Accelerators

NVIDIA Tesla Personal Supercomputer, Please Visit

Blockchain Database for a Cyber Security Learning System

Science-Driven Development of Supercomputer Fugaku

An Introduction to Cloud Databases a Guide for Administrators

Middleware-Based Database Replication: the Gaps Between Theory and Practice

The Sunway Taihulight Supercomputer: System and Applications

BDEC2 Poznan Japanese HPC Infrastructure Update

Supercomputer Fugaku

Providing a Robust Tools Landscape for CORAL Machines