BDEC2 Poznan Japanese HPC Infrastructure Update

BDEC2 Poznan Japanese HPC Infrastructure Update Presenter: Masaaki Kondo, Riken R-CCS/Univ. of Tokyo (on behalf of Satoshi Matsuoka, Director, Riken R-CCS) 1 Post-K: The Game Changer (2020) 1. Heritage of the K-Computer, HP in simulation via extensive Co-Design • High performance: up to x100 performance of K in real applications • Multitudes of Scientific Breakthroughs via Post-K application programs • Simultaneous high performance and ease-of-programming 2. New Technology Innovations of Post-K Global leadership not just in High Performance, esp. via high memory BW • the machine & apps, but as Performance boost by “factors” c.f. mainstream CPUs in many HPC & Society5.0 apps via BW & Vector acceleration cutting edge IT • Very Green e.g. extreme power efficiency Ultra Power efficient design & various power control knobs • Arm Global Ecosystem & SVE contribution Top CPU in ARM Ecosystem of 21 billion chips/year, SVE co- design and world’s first implementation by Fujitsu High Perf. on Society5.0 apps incl. AI • ARM: Massive ecosystem Architectural features for high perf on Society 5.0 apps from embedded to HPC based on Big Data, AI/ML, CAE/EDA, Blockchain security, etc. Technology not just limited to Post-K, but into societal IT infrastructures e.g. Clouds 2 Arm64fx & Post-K (to be renamed) Fujitsu-Riken design A64fx ARM v8.2 (SVE), 48/52 core CPU HPC Optimized: Extremely high package high memory BW (1TByte/s), on-die Tofu-D network BW (~400Gbps), high SVE FLOPS (~3Teraflops), various AI support (FP16, INT8, etc.) Gen purpose CPU – Linux, Windows (Word), other SCs/Clouds Extremely power efficient – > 10x power/perf efficiency for CFD benchmark over current mainstream x86 CPU Largest and fastest supercomputer to be ever built circa 2020 > 150,000 nodes, superseding LLNL Sequoia > 150 PetaByte/s memory BW Tofu-D 6D Torus NW, 60 Petabps injection BW (10x global IDC traffic) 25~30PB NVMe L1 storage ~10,000 endpoint 100Gbps I/O network into Lustre The first ‘exascale’ machine (not exa64bitflops but in apps perf.) 3 Post K A64fx Processor an Many-Core ARM CPU… 48 compute cores + 2 or 4 assistant (OS) cores Brand new core design Near Xeon-Class Integer performance core ARM V8 --- 64bit ARM ecosystem Tofu-D + PCIe 3 external connection …but also an accelerated GPU-like processor SVE 512 bit vector extensions (ARM & Fujitsu) Integer (1, 2, 4, 8 bytes) + Float (16, 32, 64 bytes) Cache + scratchpad-like local memory (sector cache) HBM2 on package memory – Massive Mem BW (Bytes/DPF ~0.4) Streaming memory access, strided access, scatter/gather etc. Intra-chip barrier synch. and other memory enhancing features GPU-like High performance in HPC, AI/Big Data, Auto Driving… 4 A64FX: Spec Summary Arm SVE, high performance and high efficiency DP performance 2.7+ TFLOPS, >90%@DGEMM Memory BW 1024 GB/s, >80%@STREAM Triad A64FX 12x compute cores 1x assistant core PCle Tofu ISA (Base, extension) Armv8.2-A, SVE Controller Interface Process technology 7 nm CMG CMG HBM HBM Peak DP performance > 2.7+ TFLOPS C C 2 2 SIMD width 512-bit N # of cores 48 + 4 CMG O CMG HBM C HBM Memory capacity 32 GiB (HBM2 x4) C C 2 2 Memory peak bandwidth 1024 GB/s PCIe Gen3 16 lanes CMG：Core Memory Group NOC：Network on High speed interconnect TofuD integrated Chip 5 Preliminary performance evaluation results Over 2.5x faster in HPC & AI benchmarks than SPARC64 XIfx 6 Preliminary performance evaluation results Himeno Benchmark (Fortran90) † “Performance evaluation of a vector supercomputer SX-aurora TSUBASA”, SC18, https://dl.acm.org/citation.cfm?id=3291728 7 Overview of Post-K System & Storage 3-level hierarchical storage 1st Layer: GFS Cache + Temp FS (25~30 PB NVMe) 2nd Layer: Lustre-based GFS (a few hundred PB HDD) 3rd Layer: Off-site Cloud Storage Full Machine Spec >150,000 nodes ~8 million High Perf. Arm v8.2 Cores > 150PB/s memory BW Tofu-D 10x Global IDC traffic @ 60Pbps ~10,000 I/O fabric endpoints > 400 racks ~40 MegaWatts Machine+IDC PUE ~ 1.1 High Pressure DLC NRE pays off: ~= 15~30 million state-of-the art competing CPU Cores for HPC workloads (both dense and sparse problems) 8 Preparing the 40+MW Facility (actual photo) 9 What is HPCI ? World’s top class computing resources open to the world-wide HPC communities The operation of K computer will stop in August 2019. 10 HPCI Computational Resources (sites / machines) HPCI provides wide variety of supercomputer resources 11 HPCI Tier 2 Systems Roadmap (As of Nov. 2018） Fiscal Year 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025 2026 2027 HITACHIHITACHI SRSR1600016000/M/M11（（172172TF,TF, 22 TB22）TB） Cloud System BS2000 （44TF, 14TB）（）（ Hokkaido Cloud System BS2000 （44TF, 14TB） 3.96 PF UCC + CFL/M 0.9 MW 35 PF UCC + Data Science Cloud / Storage HA8000 / WOS7000 ）（10TF, 1.96PB） 0.16 PF （Cloud） 0.1MW CFL-M 2MW 100~200 PF, Tohoku SX-ACE(707TF,160TB, 655TB/s) 100-200 PB/s LX406e(31TF), Storage(4PB), 3D Vis, 2MW ~30PF, ~30PB/s Mem BW (CFL-D/CFL-M) ~3MW (CFL-D/CFL-D) ~4 MW PPX1 HA- PPX2 (62TF) Tsukuba PACS(1166TF) (62TF) Cygnus 3+PF (TPF) 0.4MW PACS-XI 100PF (TPF) COMA (PACS-IX) (1001 TF) Oakforest-PACS (OFP) 25 PF 100+ PF 4.5-6.0MW Fujitsu FX10 (Oakleaf/Oakbridge) (1.27PFlops, 168TiB, 460 TB/s), (UCC + TPF) 3.2 MW (UCC + TPF) 200+ PF Oakbridge-II 4+ PF 1.0MW Tokyo Reedbush-U/H: 1.92 PF (FAC) 0.7MW (Reedbush-Uは2020年6月末まで) (FAC) BDEC 60+ PF (FAC) 3.5-4.5MW 6.5- Hitachi SR16K/M1 (54.9 TF, 10.9 TiB, 28.7 TB/s) Reedbush-L1.4 PF (FAC) 0.2 MW 8.0MW TSUBAME 3.0 (12.15 PF, 1.66 PB/s) Tokyo Tech. TSUBAME 2.5 1.4MW (total) TSUBAME 4.0 (~100 PF, ~10PB/s, ~2.0MW) (5.7 PF, 110+ TB, 1160 TB/s), 1.4MW TSUBAME 2.6として延長運用の可能性最大5 PF Fujitsu FX100 (2.9PF, 81 TiB) 100+ PF (FAC/UCC+CFL-M) Fujitsu CX400 (774TF, 71TiB) (542TF, 71TiB) 20+ PF (FAC/UCC + CFL-M) Nagoya up to 3MW up to 3MW 2MW in total Cray:XE6 + GB8K XC30 Cray XC40(5.5PF) + CS400(1.0PF) 20-40+ PF 80-150+ PF (983TF) Kyoto 1.33 MW (FAC/TPF + UCC) 2 MW Cray XC30 (584TF) (FAC/TPF + UCC) 1.5 MW NEC SX-ACE NEC Express5800 3.2PB/s,15~25Pflop/s, 1.0-1.5MW (CFL-M) 25.6 PB/s, 50-100Pflop/s (TPF) Osaka (423TF) (22.4TF) OCTPUS 1.463PF (UCC) 1.5-2.0MW HA8000 (712TF, 242 TB) SR16000 (8.2TF, 6TB) Fujitsu PRIMERGY CX subsystem A + B, 10.4 PF (UCC/TPF) 2.7MW 100+ PF ～3MW 2.0MW Kyushu FX10 (272.4TF, 36 TB) (FAC/TPF + UCC/TPF) CX400 (966.2 TF, 183TB) FX10 (90.8TFLOPS) JAMSTEC SX-ACE(1.3PF, 320TiB) 3MW 100PF, 3MW ISM UV2000 (98TF, 128TiB) 0.3MW 2PF, 0.3MW Power is the maximum consumption including cooling 12 Supercomputer roadmap in IT C /U.Tokyo 2 big systems, 6 yr. cycle FY 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 13 IT C /U.Tokyo now operating 2 (or 4) Systems !! 2,000+ users (1,000 from outside of U.Tokyo) Reedbush (HPE, Intel BDW + NVIDIA P100 (Pascal)) Integrated Supercomputer System for Data Analyses & Scientific Simulations Jul. 2016-Jun.2020 (RB-U), -Mar. 2021 (RB-H/L) Reedbush-U: CPU only, 420 nodes, 508 TF (Jul. 2016) Reedbush-H: 120 nodes, 2 GPUs/node: 1.42 PF (Mar. 2017) Reedbush-L: 64 nodes, 4 GPUs/node: 1.43 PF (Oct. 2017) 502.2 kW (w/o cooling, aircooled) Oakforest-PACS (OFP) (Fujitsu, Intel Xeon Phi (KNL)) JCAHPC: Joint Center for Advanced HPC (U.Tsukuba & U.Tokyo) 8,208 nodes, 25 PF, 13.55 PF (HPL) TOP 500 #14 (#2 in Japan), HPCG #9 (#3) (Nov. 2018) Omni-Path Architecture, Full Bi-section Fat-tree DDN IME (Burst Buffer), Lustre 26 PB IO 500 #1 (June 2018), #4 (Nov. 2018) 4.24 MW (incl. Cooling) 3.44 MW (w/o Cooling) 14 Oakbridge-CX (OBCX): Fujitsu awarded Intel Xeon Scalable Processors (CascadeLake SP) Platinum 8280 (28 Core, 2.7 GHz) x 2 socket Overview 1,368 nodes, 6.61 PF peak Aggregated Memory Bandwidth: 385.1 TB/sec Total HPL Performance: 4.2+ PF Fast Cache: SSD’s for 128 nodes: Intel SSD 1.6 TB/node, 3.20/1.32 GB/s/node for R/W Staging, Check-Pointing, Data Intensive Application 16 of these nodes can directly access external resources (server, storage, sensor network etc.) Network: Intel Omni-Path, 100 Gbps, Full Bi-Section Storage: DDN EXAScaler (Lustre) 12.4 PB, 193.9 GB/sec Power Consumption: 950.5 kVA Operation Starts: July 1st, 2019 15 Tokyo Tech.TSUBAME3.0 TSUBAME Supercomputing History World’s Leading Supercomputer. x100,000 speedup in 17 years developing world-leading use 2000 of massively parallel, many-core technology Matsuoka GSIC Appointment 2008 TSUBAME1.2 2006 TSUBAME1.0 2002 “TSUBAME0” 170 TeraFlops 2010 TSUBAME2.0 2000 80 TeraFlops 1.3 TeraFlops Word’s first GPU 2.4 Petarlops No1 World 128 Gigaflops No.1 Asia, No.7 World Custom First “TeraScale” Supercomputer No.1 Production Green 10,000 cores ACM Gordon Bell Prize Supercomputer JP Univ. Supercomputer 32 cores 800 cores General Purpose CPU & Many Core Processor (GPU), Advaned Optical Networks, Non-Volatile Memory, Efficient Power Control and Cooling 2013 TSUBAME2.5 2013 TSUBAME-KFC 41１8 GPUs Upgraded TSUBAME3 Prototype 2017 TSUBAME3.0, > 10 million cores 5.7 Petaflps, No.2 Japan Oil Immersive Cooling 12.1 Petaflops (AI Flops 47.2 Petaflops) AI Flops 17.1 Petaflops Green World No.1 Green World No1 2015 AI Prototype Upgrade (KFC/DL) HPC and Big Data / AI Convergence ABCI: AI Supercomputer to Serve Academic & Industry AI research in Japan, hosted by AIRC-AIST (2018) Commoditizing TSUBAME3 supercomputer • 4332 Volta GPUs + 2166 Skylake CPUs technologies to the Cloud • 0.55 AI-Exaflops, 37 DFP Petaflops (17 AI-Petaflops, 70kW/rack, free cooling) • ~2 Petabytes NVMe, 20 Petabytes HDD • DNN training optimized Infiniband EDR • New IDC single floor, inexpensive & fast build, hard concrete floor with 2 tons/m2 weight tolerance • Racks • 144 racks max.

BDEC2 Poznan Japanese HPC Infrastructure Update

Supercomputer Fugaku

Download SC18 Final Program

Mixed Precision Block Fused Multiply-Add

HPC Systems in the Next Decade – What to Expect, When, Where

The Post-K Project and Fujitsu ARM-SVE Enabled A64FX Processor for Energy-Efficiency and Sustained Application Performance

SC20-Final-Program-V2.Pdf

World's Fastest Computer

An Accidental Benchmarker

An Overview of High Performance Computing and Using Mixed Precision in Numerical Computations to Speedup Linear Algebra Solvers

Final Copy 2021 06 24 Foyer

The Arm Architecture for Exascale HPC

Exascale” Supercomputer Fugaku & Beyond