INTRODUCTION TO HPC| PATH TO EXASCALE

Ondřej Vysocký| Infrastructure research lab, IT4Innovations MATERIALS TAKEN FROM top500.org, exascaleproject.org, eurohpc-ju.europa.eu Vendors' & SC centers' web pages and presentations PATH TO EXASCALE

TRENDS TOP500 LIST

▪ List of the most powerful ▪ Updated 2x a year – ISC (June) and SC (November) ▪ From 1993 High Performance Linpack (HPL) benchmark ▪ From 2017 also High-Performance Conjugate Gradient (HPCG) Benchmark ▪ From 2013 Green500 list ▪ From 2019 HPL-AI – not a list yet - mixed-precision algorithms TOP500 LIST HPL + HPCG

ARM

No China

EU +11, 12, 15, 16, 18 11/2020 TOP500 LIST HPL + HPCG TOP500 LIST

Where's Russia?!

11/2020 TOP500 LIST

6/2019 TOP500 LIST

June 2013

June 2008

11/2020 TOP500 LIST

11/2020 TOP500 LIST

11/2020 TOP500 LIST

11/2020 TOP500 LIST

11/2020 TOP500 LIST HPL

11/2020 TOP500 LIST HPL

x2 = 60 MW

x5 = 50 MW Exascale goal is 50 GFlops/Watt = 20 MW system

x8 = 60 MW

x8 = 123 MW

x13 = 34 MW

x10 = 185 MW

11/2020 x14 = 25 MW GREEN500

x2 = 60 MW

x5 = 50 MW • Direct Warm-Water Cooling (CPU and GPU cooling separated circles) x8 = 60 MW • Availability of power controling knobs • Higher heterogenity of new systems = using accelerators, GPGPUs, FPGAs, x8 = 123 MW single/mixed precission units • Decarbonization x13 = 34 MW • AI everywhere • And many more x10 = 185 MW

11/2020 x14 = 25 MW GREEN500

Nvidia A100

MN-Core V100

Nvidia A100 Nvidia V100

Nvidia A100 Nvidia V100

Nvidia A100 11/2020 SUMMIT FUGAKU SUPERCOMPUTER

• 158 976 nodes, node peak performance 3.4 TFLOP/s • Fujitsu A64FX ARM v8.2-A, 48(+4) cores, SVE 512 bit instructions • high bandwidth 3D stacked memory, 4x 8 GB HBM with 1 024 GB/s, • on-die Tofu-D network BW (~400Gbps), • 29.9 MW

OUT Tofu IN interconect Direct water cooling GREEN 500 #1 2020/6: MN-3

• 2x Platinum 8260M (CSL) 24C 2.4GHz • Optane persistent memory • MN-Core • PreferredNetworks' accelerator • Specializedfor deep learning training • Optimizedfor energy efficiency • Efficiency above one teraflop per Watt • 1 Matrix arithmetic units (MAU) + 4 Processor Elements (PE, provide data to the MAU) = Matrix Arithmetic Block (MAB) • 4 dies per chip, 512 MABs per die • Air cooled ?! PATH TO EXASCALE

ROADMAPS USA ROADMAP

1.5 EFlops AMD CPU + GPU

1 EFlops Intel CPU + Intel Xe

>2EFlops, ~40 MW AMD CPU + GPU High variability of CPU and GPU vendors USA ROADMAP

500M $ 1 EFlops, <=60 MW 2 Intel Saphire Rapids + 6 Intel Xe per node

600M $ 1.5 EFlops, ~30 MW AMD 1 CPU + 4 GPUs

>2EFlops, ~40 MW AMD CPU + GPU – 1ST EXASCALE SYSTEM ? CHINA

▪ Homogenous ▪ NUDT: Tianhe-2a (2018, Intel Xeon + Matrix-2000, 95PFlop) -> Tianhe-3 (2021?, Matrix-3000, ~1.3 EFlops) 100 cabinets, 128 blades each, 8 CPUs per blade ▪ NRCPC: Sunway TaihuLight (ShenWei 26010) -> NRCPC prototype (ShenWei 26010) -> ? ▪ ShenWei 26010 = 260 cores, 4 core groups, 3 TFlops

▪ Accelerated ▪ Sugon prototype (Hygon CPU+DCU ACC) -> Sugon (Hygon accelerated)

Matrix-3000 Hygon CPU • >=96 cores, > 10 TFlops • Licensed AMD EPYC clone • HBM2 ShenWei 26010 • support half precision • 260 cores, 4 core groups • 3 TFlops THE EUROHPC JOINT UNDERTAKING

▪ A legal and funding agency

▪ 32 member countries

▪ A co-founding programme to build a pan-European supercomputing infrastructure Medium-to-high range Supercomputers ▪ at least 4 Pflops ▪ Bulgaria, Czech Republic, Luxembourg, Portugal, Slovenia ▪ expected installation by H1 2021 High-range Pre-Exascale Supercomputers ▪ 150-200 Pflops ▪ Finland, Spain and Italy consorciums ▪ expected installation mid-2021 Next generations of systems planned for 2023-2024 (exascale) and 2026-2027 EUROPEAN PRE-EXASCALE SYSTEMS

▪ H2 2021

▪ 240M €

▪ 248 PFlops

▪ 2 Intel Xeon Ice Lake CPUs + 4 Nvidia A100 GPUs

MareNostrum V

▪ 200 PFlops

▪ 223 millions of Euros 552 PFlops Peak ▪ Heterogenous 375 PFlops LINPACK Mid-2021 IT4INNOVATIONS ROADMAP

▪ EURO_IT4I Q1 2021 ▪ 15.2 PFlops ▪ AMD Epyc + Nvidia A100 ▪ Homogenous (2x 7H12), accelerated (2x7452 + 8 A100), visualization (NVidia RTX 6000), big data (32x Intel Xeon 8268, 24 576 GB RAM), and cloud partitions ▪ 200 Gb/s interconnect

11. 12. 2020 IT4INNOVATIONS ROADMAP

▪ EURO_IT4I Q1 2021 ▪ 15.2 PFlops ▪ AMD Epyc + Nvidia A100 ▪ Homogenous (2x 7H12), accelerated (2x7452 + 8 A100), visualization (NVidia Quadro RTX 6000), big data (32x Intel Xeon 8268, 24 576 GB RAM), and cloud partitions ▪ 200 Gb/s interconnect

▪ New experimental systems ▪ Late 2021 – 4 architectures, targeting the most perspective technologies ▪ Late 2022 – quantum computer ?

11. 12. 2020 IT4INNOVATIONS ROADMAP

Name the computer ▪ EURO_IT4I Q1 2021 bit.ly/jmenosuperpocitace ▪ Experimental system late 2021

▪ Experimental system late 2022

9.5M € 17M € 5.5M € 7.5M € 2M € 17M € 2M € 7.5M € 4M € 2023 2024 2025 2026 2027 2028 2029 2030 2031 PATH TO EXASCALE

HARDWARE INTEL PROCESSORS

▪ All the intel architectures have delay ▪ One year delay in 7nm, and 6 months in 10 nm technology ▪ Intel Xeon SP 7nm CPU is on the roadmap for the first half of 2023

2021 H2 2021

10nm, PCIe 4.0, DDR4 10nm, PCIe 5.0, DDR5 INTEL XEON ICE LAKE SP INTEL XE – PONTE VECCHIO

▪ 1-, 2-, or 4-tile packing design

▪ 4-tile variant should provide over 40 TFlops FP32 (2-tile design for Aurora) INTEL OPTANE MEMORY

▪ 3D XPoint is a non-volatile memory (NVM)

▪ Another layer in the memory hierarchy

▪ Requires CPU support AMD EPYC PROCESSORS A chiplet-based architecture based on Zen cores

▪ Naples (2017) ▪ 14nm

▪ Rome (2019) ▪ 7nm, 8 mem channels, up to 4 TB RAM

▪ Milan (Q1 2021) ▪ 7nm+

▪ Genoa (expected in 2021 ?) ▪ 5nm AMD EPYC PROCESSORS AMD INSTINCT GPUS

MI25 (2017) MI60 (2018) ▪ 14 nm ▪ 4096 stream processors, 1800 MHz ▪ 4096 stream processors ▪ 14.7 TFlops DP ▪ 768 GFlops DP ▪ Not in sale any more ▪ 16 GB HBM2, 484 GB/s MI100 (11/2020) ▪ PCIe 3.0 ▪ 7 nm ▪ Passively cooled, 300W TDP ▪ 7,680 stream processors, 1502 MHz MI50 (2018) ▪ 11.5 TFlops DP, 92.3 TFlops BFloat ▪ 7 nm ▪ 32 GB HBM2, 1228.8 GB/s ▪ 3840 stream processors, 1725 MHz ▪ PCIe 3.0/4.0 ▪ 6.6 TFlops DP ▪ Passively cooled, 300W TDP ▪ 16 GB HBM2, 1024 GB/s ▪ PCIe 3.0/4.0 ▪ Passively cooled, 300W TDP IBM PROCESSORS

Power10 offers a ~3x performance gain and ~2.6x core efficiency gain over Power9

2021 2017 IBM POWER10 NVIDIA GPUS

NVIDIA Tesla V100 (Volta) 12 nm ▪ 5120 CUDA cores + 640 tensor cores ▪ 16/32 GB HBM2, 900 GB/s ▪ 300 GB/s NVLink ▪ 7.8 TFlops DP ▪ 15.7 TFlops SP

NVIDIA Tesla A100 (Ampere) 7nm ▪ 6912 CUDA cores + 432 tensor cores ▪ 40/80 GB memory, 1.5/2 TB/s ▪ 600 GB/s NVLink ▪ 9.7 TFlops DP, 19.5 TFlops Tensor core DP ▪ 19.5 TFlops SP, 156 TFlops Tensor core SP TENSOR CORES

Mixed (half) precision computing - tensor cores From Ampere architecture also double precision! NVIDIA DGX PLATFORM

DGX-1 ▪ 8x NVIDIA Tesla V100 32 GB/GPU ▪ 40 960 CUDA cores + 5 120 Tensor Cores ▪ NVIDIA NVLink - Hybrid Cube Mesh ▪ 512 GB DDR4 DGX-2 ▪ 16x NVIDIA Tesla V100 ▪ Intel Xeon Platinum ▪ NVSwitch - 2.4 TB/s of bisection bandwidth NVIDIA DGX-A100 & SUPERPOD

▪ 8x NVIDIA A100 GPU ▪ 2x AMD EPYC Rome CPU ▪ 640 GB memory ▪ 600 GB/s GPU-to-GPU Bi-directional Bandwidth ▪ 5 PFlop AI ▪ 6.5 kW NVIDIA DGX-A100 & SUPERPOD

▪ 8x NVIDIA A100 GPU ▪ 2x AMD EPYC Rome CPU ▪ 640 GB memory ▪ 600 GB/s GPU-to-GPU Bi-directional Bandwidth ▪ 5 PFlop AI ▪ 6.5 kW

▪ #1 Green500 11/2020 ▪ 20 – 140 DGX-A100 ▪ 100 – 700 PFlop system ▪ 32.5 kW per rack ▪ Deployable in Weeks ARM IN HPC

ARM brings better in compare to x86 processors ARM roadmap expects 5nm Poseidon platform in 2021

Fujitsu A64FX ▪ Armv8.2-A (AArch64 only) SVE (Scalable Vector Extension), 512 bit, 7nm ▪ 48 computing cores + 4 assistant cores ThunderX2 ▪ ARMv8.1, 64 bit, 14 nm ▪ 32 cores, 128 threads ThunderX3 ▪ ARMv8.3+, 128 bit, 7nm ▪ 96 cores, 384 threads ▪ Expected in 2020 EUROPEAN PROCESSOR INITIATIVE (EPI)

Europe invests into development of a new processor ▪ Security ▪ Competitiveness Design a roadmap of future European low power processors ▪ common platform ▪ general purpose processor ▪ accelerator ▪ automotive FPGAS IN HPC

Device Fabrication #cores Peak TDP Perf/W Memory Memory [nm] performance [W] [GOPs/W] bandwidth type [GFlops] [GB/s] Intel 10 DX 14 11 520 8 600 SP ? 1000 512 HBM2 DSPs 143 INT8 Intel Agilex 10 ? 40 000 ? ? 512 HBM2 FP16 Xilinx Alveo U280 16 9 024 24.5 225 109 38/460 DDR4/ DSPs INT8 HBM2 Xilinx Alveo U250 16 12 288 33.3 225 148 77 DDR4 INT8 QUANTUM COMPUTING

Several basic quantum computer implemenations and hardware emulators

▪ D-Wave, IBM, Google, Atos, ...

JUNIQ system in Juelich

▪ D-Wave system PATH TO EXASCALE

SOFTWARE EXASCALE APPLICATIONS

Fugaku software stack EXASCALE APPLICATIONS

▪ Earth and space science

▪ Chemistry and materials – Medicine, Plasma science, Molecular Dynamics

▪ Energy production and transmission

▪ National security = military NEW SOFTWARE SPECIFICATIONS

▪ OpenMP 5.1 ▪ OpenMP 5.0 will be fully implemented in GCC 12 except OMPT and OMPD ▪ New directives ▪ interop ▪ dispatch ▪ assume ▪ target_device selector ▪ … and many more

▪ MPI 4.0 ▪ Specification2/2021, implementations by the end 2021 ▪ New features

▪ solution for "big count" operations ▪ persistent collectives ▪ partitioned communication ▪ topology-aware communicators ▪ ... and many more EXASCALE SOFTWARE STACK

Simplified software development for heterogenous hardware

▪ Intel oneAPI

▪ AMD ROCm

▪ CUDA-X HPC & AI software stack QUANTUM COMPUTING

Different frameworks and programming languages:

▪ Qasm, Qiskit (IBM), Cirq (Google), Forest/pyqil (Rigetti), Q# (Microsoft), Ocean (D-Wave) IBM Quantum Experience

▪ Free online access to quantum simulators (up to 32 qubits) and actual quantum computers (1-15 qubits) with different topologies

▪ Programmable with a visual interface and via different languages (python, qasm, Jupyter Notebooks) Atos myQLM

▪ Freeware for Linux or Windows machines

▪ quantum software stack for writing, simulating, optimizing, and executing quantum programs. Provides up to 20 Qbits.

For more information:

▪ Elías F. Combarro: A Practical Introduction to Quantum Computing: From Qubits to Quantum Machine Learning and Beyond

▪ Workshop on Quantum Computing at IT4Innovations (Q1 2021) Ondřej Vysocký [email protected]

IT4Innovations National Supercomputing Center VSB – Technical University of Ostrava Studentská 6231/1B 708 00 Ostrava-Poruba, Czech Republic www.it4i.cz