BDEC2 Poznan Japanese HPC Infrastructure Update

Presenter: Masaaki Kondo, Riken R-CCS/Univ. of Tokyo (on behalf of Satoshi Matsuoka, Director, Riken R-CCS)

1 Post-K: The Game Changer (2020) 1. Heritage of the K-Computer, HP in simulation via extensive Co-Design • High performance: up to x100 performance of K in real applications • Multitudes of Scientific Breakthroughs via Post-K application programs • Simultaneous high performance and ease-of-programming 2. New Technology Innovations of Post-K Global leadership not just in High Performance, esp. via high memory BW • the machine & apps, but as Performance boost by “factors” c.f. mainstream CPUs in many HPC & Society5.0 apps via BW & Vector acceleration cutting edge IT • Very Green e.g. extreme power efficiency Ultra Power efficient design & various power control knobs • Arm Global Ecosystem & SVE contribution Top CPU in ARM Ecosystem of 21 billion chips/year, SVE co- design and world’s first implementation by High Perf. on Society5.0 apps incl. AI • ARM: Massive ecosystem Architectural features for high perf on Society 5.0 apps from embedded to HPC based on Big Data, AI/ML, CAE/EDA, Blockchain security, etc.

Technology not just limited to Post-K, but into societal IT infrastructures e.g. Clouds 2 Arm64fx & Post-K (to be renamed)  Fujitsu-Riken design A64fx ARM v8.2 (SVE), 48/52 core CPU

 HPC Optimized: Extremely high package high memory BW (1TByte/s), on-die Tofu-D network BW (~400Gbps), high SVE FLOPS (~3Teraflops), various AI support (FP16, INT8, etc.)

 Gen purpose CPU – Linux, Windows (Word), other SCs/Clouds

 Extremely power efficient – > 10x power/perf efficiency for CFD benchmark over current mainstream x86 CPU  Largest and fastest to be ever built circa 2020

 > 150,000 nodes, superseding LLNL Sequoia

 > 150 PetaByte/s memory BW

 Tofu-D 6D Torus NW, 60 Petabps injection BW (10x global IDC traffic)

 25~30PB NVMe L1 storage

 ~10,000 endpoint 100Gbps I/O network into Lustre

 The first ‘exascale’ machine (not exa64bitflops but in apps perf.) 3 Post K A64fx Processor  an Many-Core ARM CPU…  48 compute cores + 2 or 4 assistant (OS) cores  Brand new core design  Near Xeon-Class Integer performance core  ARM V8 --- 64bit ARM ecosystem  Tofu-D + PCIe 3 external connection

 …but also an accelerated GPU-like processor  SVE 512 bit vector extensions (ARM & Fujitsu)

 Integer (1, 2, 4, 8 bytes) + Float (16, 32, 64 bytes)  Cache + scratchpad-like local memory (sector cache)  HBM2 on package memory – Massive Mem BW (Bytes/DPF ~0.4)

 Streaming memory access, strided access, scatter/gather etc.  Intra-chip barrier synch. and other memory enhancing features

 GPU-like High performance in HPC, AI/Big Data, Auto Driving… 4 A64FX: Spec Summary

 Arm SVE, high performance and high efficiency

 DP performance 2.7+ TFLOPS, >90%@DGEMM

 Memory BW 1024 GB/s, >80%@STREAM Triad A64FX 12x compute cores 1x assistant core PCle Tofu ISA (Base, extension) Armv8.2-A, SVE Controller Interface Process technology 7 nm CMG CMG

HBM HBM Peak DP performance > 2.7+ TFLOPS C C

2 2 SIMD width 512-bit N # of cores 48 + 4 CMG O CMG

HBM C HBM Memory capacity 32 GiB (HBM2 x4) C C

2 2 Memory peak bandwidth 1024 GB/s PCIe Gen3 16 lanes CMG:Core Memory Group NOC:Network on High speed interconnect TofuD integrated Chip 5 Preliminary performance evaluation results

 Over 2.5x faster in HPC & AI benchmarks than SPARC64 XIfx

6 Preliminary performance evaluation results

 Himeno Benchmark (Fortran90)

† “Performance evaluation of a vector supercomputer SX-aurora TSUBASA”, SC18, https://dl.acm.org/citation.cfm?id=3291728 7 Overview of Post-K System & Storage

 3-level hierarchical storage  1st Layer: GFS Cache + Temp FS (25~30 PB NVMe)  2nd Layer: Lustre-based GFS (a few hundred PB HDD)  3rd Layer: Off-site Cloud Storage  Full Machine Spec  >150,000 nodes ~8 million High Perf. Arm v8.2 Cores  > 150PB/s memory BW  Tofu-D 10x Global IDC traffic @ 60Pbps  ~10,000 I/O fabric endpoints  > 400 racks  ~40 MegaWatts Machine+IDC PUE ~ 1.1 High Pressure DLC  NRE pays off: ~= 15~30 million state-of-the art competing CPU Cores for HPC workloads (both dense and sparse problems) 8 Preparing the 40+MW Facility (actual photo)

9 What is HPCI ?  World’s top class computing resources

 open to the world-wide HPC communities

The operation of will stop in August 2019.

10 HPCI Computational Resources (sites / machines)

 HPCI provides wide variety of supercomputer resources

11 HPCI Tier 2 Systems Roadmap (As of Nov. 2018)

Fiscal Year 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025 2026 2027

HITACHIHITACHI SRSR1600016000/M/M11((172172TF,TF, 22 TB22)TB) Cloud System BS2000 (44TF, 14TB) ( ) ( Hokkaido Cloud System BS2000 (44TF, 14TB) 3.96 PF UCC + CFL/M 0.9 MW 35 PF UCC + Data Science Cloud / Storage HA8000 / WOS7000 ) (10TF, 1.96PB) 0.16 PF (Cloud) 0.1MW CFL-M 2MW

100~200 PF, Tohoku SX-ACE(707TF,160TB, 655TB/s) 100-200 PB/s LX406e(31TF), Storage(4PB), 3D Vis, 2MW ~30PF, ~30PB/s Mem BW (CFL-D/CFL-M) ~3MW (CFL-D/CFL-D) ~4 MW

PPX1 HA- PPX2 (62TF) Tsukuba PACS(1166TF) (62TF) Cygnus 3+PF (TPF) 0.4MW PACS-XI 100PF (TPF) COMA (PACS-IX) (1001 TF) Oakforest-PACS (OFP) 25 PF 100+ PF 4.5-6.0MW

Fujitsu FX10 (Oakleaf/Oakbridge) (1.27PFlops, 168TiB, 460 TB/s), (UCC + TPF) 3.2 MW (UCC + TPF) 200+ PF Oakbridge-II 4+ PF 1.0MW Tokyo Reedbush-U/H: 1.92 PF (FAC) 0.7MW (Reedbush-Uは2020年6月末まで) (FAC) BDEC 60+ PF (FAC) 3.5-4.5MW 6.5- Hitachi SR16K/M1 (54.9 TF, 10.9 TiB, 28.7 TB/s) Reedbush-L1.4 PF (FAC) 0.2 MW 8.0MW

TSUBAME 3.0 (12.15 PF, 1.66 PB/s) Tokyo Tech. TSUBAME 2.5 1.4MW (total) TSUBAME 4.0 (~100 PF, ~10PB/s, ~2.0MW) (5.7 PF, 110+ TB, 1160 TB/s), 1.4MW TSUBAME 2.6として 延長運用の可能性 最大5 PF Fujitsu FX100 (2.9PF, 81 TiB) 100+ PF (FAC/UCC+CFL-M) Fujitsu CX400 (774TF, 71TiB) (542TF, 71TiB) 20+ PF (FAC/UCC + CFL-M) Nagoya up to 3MW up to 3MW 2MW in total Cray:XE6 + GB8K XC30 Cray XC40(5.5PF) + CS400(1.0PF) 20-40+ PF 80-150+ PF (983TF) Kyoto 1.33 MW (FAC/TPF + UCC) 2 MW Cray XC30 (584TF) (FAC/TPF + UCC) 1.5 MW

NEC SX-ACE NEC Express5800 3.2PB/s,15~25Pflop/s, 1.0-1.5MW (CFL-M) 25.6 PB/s, 50-100Pflop/s (TPF) Osaka (423TF) (22.4TF) OCTPUS 1.463PF (UCC) 1.5-2.0MW

HA8000 (712TF, 242 TB) SR16000 (8.2TF, 6TB) Fujitsu CX subsystem A + B, 10.4 PF (UCC/TPF) 2.7MW 100+ PF ~3MW 2.0MW Kyushu FX10 (272.4TF, 36 TB) (FAC/TPF + UCC/TPF) CX400 (966.2 TF, 183TB) FX10 (90.8TFLOPS)

JAMSTEC SX-ACE(1.3PF, 320TiB) 3MW 100PF, 3MW

ISM UV2000 (98TF, 128TiB) 0.3MW 2PF, 0.3MW Power is the maximum consumption including cooling 12 Supercomputer roadmap in IT C /U.Tokyo  2 big systems, 6 yr. cycle FY 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25

13 IT C /U.Tokyo now operating 2 (or 4) Systems !! 2,000+ users (1,000 from outside of U.Tokyo)  Reedbush (HPE, Intel BDW + NVIDIA P100 (Pascal))  Integrated Supercomputer System for Data Analyses & Scientific Simulations

 Jul. 2016-Jun.2020 (RB-U), -Mar. 2021 (RB-H/L)  Reedbush-U: CPU only, 420 nodes, 508 TF (Jul. 2016)  Reedbush-H: 120 nodes, 2 GPUs/node: 1.42 PF (Mar. 2017)  Reedbush-L: 64 nodes, 4 GPUs/node: 1.43 PF (Oct. 2017)  502.2 kW (w/o cooling, aircooled)  Oakforest-PACS (OFP) (Fujitsu, Intel Xeon Phi (KNL))  JCAHPC: Joint Center for Advanced HPC (U.Tsukuba & U.Tokyo)  8,208 nodes, 25 PF, 13.55 PF (HPL)

 TOP 500 #14 (#2 in Japan), HPCG #9 (#3) (Nov. 2018)  Omni-Path Architecture, Full Bi-section Fat-tree  DDN IME (Burst Buffer), Lustre 26 PB

 IO 500 #1 (June 2018), #4 (Nov. 2018)  4.24 MW (incl. Cooling) 3.44 MW (w/o Cooling) 14 Oakbridge-CX (OBCX): Fujitsu awarded  Intel Xeon Scalable Processors (CascadeLake SP)  Platinum 8280 (28 Core, 2.7 GHz) x 2 socket  Overview  1,368 nodes, 6.61 PF peak  Aggregated Memory Bandwidth: 385.1 TB/sec  Total HPL Performance: 4.2+ PF  Fast Cache: SSD’s for 128 nodes: Intel SSD  1.6 TB/node, 3.20/1.32 GB/s/node for R/W  Staging, Check-Pointing, Data Intensive Application  16 of these nodes can directly access external resources (server, storage, sensor network etc.)  Network: Intel Omni-Path, 100 Gbps, Full Bi-Section  Storage: DDN EXAScaler (Lustre) 12.4 PB, 193.9 GB/sec  Power Consumption: 950.5 kVA  Operation Starts: July 1st, 2019 15 Tokyo Tech.TSUBAME3.0 TSUBAME Supercomputing History World’s Leading Supercomputer. x100,000 speedup in 17 years developing world-leading use 2000 of massively parallel, many-core technology Matsuoka GSIC Appointment

2008 TSUBAME1.2 2006 TSUBAME1.0 2002 “TSUBAME0” 170 TeraFlops 2010 TSUBAME2.0 2000 80 TeraFlops 1.3 TeraFlops Word’s first GPU 2.4 Petarlops No1 World 128 Gigaflops No.1 Asia, No.7 World Custom First “TeraScale” Supercomputer No.1 Production Green 10,000 cores ACM Gordon Bell Prize Supercomputer JP Univ. Supercomputer 32 cores 800 cores General Purpose CPU & Many Core Processor (GPU), Advaned Optical Networks, Non-Volatile Memory, Efficient Power Control and Cooling

2013 TSUBAME2.5 2013 TSUBAME-KFC 4118 GPUs Upgraded TSUBAME3 Prototype 2017 TSUBAME3.0, > 10 million cores 5.7 Petaflps, No.2 Japan Oil Immersive Cooling 12.1 Petaflops (AI Flops 47.2 Petaflops) AI Flops 17.1 Petaflops Green World No.1 Green World No1 2015 AI Prototype Upgrade (KFC/DL) HPC and Big Data / AI Convergence ABCI: AI Supercomputer to Serve Academic & Industry AI research in Japan, hosted by AIRC-AIST (2018) Commoditizing TSUBAME3 supercomputer • 4332 Volta GPUs + 2166 Skylake CPUs technologies to the Cloud • 0.55 AI-Exaflops, 37 DFP Petaflops (17 AI-Petaflops, 70kW/rack, free cooling) • ~2 Petabytes NVMe, 20 Petabytes HDD • DNN training optimized Infiniband EDR • New IDC single floor, inexpensive & fast build, hard concrete floor with 2 tons/m2 weight tolerance • Racks • 144 racks max. • ABCI 43 racks for compute & storage • Power capacity • 3.25 MW max. • ABCI 2.3MW max. • Water-Air Hybrid Wam Water “Free” Cooling • 70KW/rack. PUE < 1.1 • Total: 3.2MW min. (summer) • 32 Celsius Free cooling even at ~39 Celsius external temperature with high humidity • ABCI IDC built in 7 months, operation Aug. 2018 Training ImageNet in Minutes Rio Yokota, Kazuki Osawa,Yohei Tsuji,Yuichiro Ueno, Hiroki Naganuma, Shun Iwase, Kaku Linsho, Satoshi Matsuoka Tokyo Institute of Technology/Riken+ Akira Naruse (NVIDIA)

#GPU time Facebook 512 30 min Preferred Networks 1024 15 min TSUBAME3.0 ABCI UC Berkeley 2048 14 min Tencent 2048 6.6 min Sony (ABCI) ~3000 3.7 min Google (TPU/GCC) 1024 2.2 min Fujitsu Lab+ (ABCI) 2048 75 sec TokyoTech/NVIDIA/Riken (ABCI) 4096 ?? Source Ben-nun & Hoefler https://arxiv.org/pdf/1802.09941.pdf Massive Scale Deep Learning on Post-K

Post-K Processor ◆High perf FP16&Int8 Unprecedened DL scalability ◆ High mem BW for convolution High Performance and Ultra-Scalable Network ◆ Built-in scalable Tofu network for massive scaling model & data parallelism High Performance DNN Convolution

CPU For the Post-K CPU CPU CPU supercomputer For the For the For the Post-K Post-K Post-K supercomputer supercomputer supercomputer TOFU Network w/high injection BW for fast reduction Low Precision ALU + High Memory Bandw Unprecedented Scalability of Data/ idth + Advanced Combining of Convolutio n Algorithms (FFT+Winograd+GEMM) 19 Large AI Infrastructures in Japan Deployed Purpose AI Processor Inference Training Top500 Green500 Peak Perf. Peak Perf. Perf/Rank Perf/Rank Tokyo Tech. July 2017 HPC + AI NVIDIA P100 45.8 PF 22.9 PF / 45.8PF 8.125 PF 13.704 GF/W TSUBAME3 Public x 2160 (FP16) (FP32/FP16) #22 #5 U-Tokyo Apr. 2018 HPC + AI NVIDIA P100 10.71 PF 5.36 PF / 10.71PF (Unranked (Unranked) Reedbush-H/L (update) Public x 496 (FP16) (FP32/FP16) ) U-Kyushu Oct. 2017 HPC + AI NVIDIA P100 11.1 PF 5.53 PF/11.1 PF (Unranked (Unranked) ITO-B Public x 512 (FP16) (FP32/FP16) ) AIST-AIRC Oct. 2017 AI NVIDIA P100 8.64 PF 4.32 PF / 8.64PF 0.961 PF 12.681 GF/W AICC Lab Only x 400 (FP16) (FP32/FP16) #446 #7 Riken-AIP Apr. 2018 AI NVIDIA V100 54.0 PF 6.40 PF/54.0 PF 1.213 PF 11.363 GF/W Raiden (update) Lab Only x 432 (FP16) (FP32/FP16) #280 #10 AIST-AIRC Aug. AI NVIDIA V100 544.0 PF 65.3 PF/544.0 PF 19.88 PF 14.423 GF/W ABCI 2018 Public x 4352 (FP16) (FP32/FP16) #7 #4 NICT Summer AI NVIDIA V100 ~210 PF ~26 PF/~210 PF ???? ???? (unnamed) 2019 Lab Only about x 1700 (FP16) (FP32/FP16) C.f. US ORNL Summer HPC + AI NVIDIA V100 3,375 PF 405 PF/3,375 PF 143.5 PF 14.668 GF/W Summit 2018 Public x 27,000 (FP16) (FP32/FP16) #1 #3 Riken R-CCS 2020 HPC + AI Fujitsu A64fx > 4000 PO >1000PF/>2000PF > 400PF > 15 GF/W Post-K ~2021 Public > x 150,000 (Int8) (FP32/FP16) #1 (2020?) ??? ABCI 2 2022 AI Future GPU Similar similar ~100PF 25~30GF/W (speculative) ~2023 Public ~ 5000 ??? 20