Introduction to Hpc| Path to Exascale

INTRODUCTION TO HPC| PATH TO EXASCALE Ondřej Vysocký| Infrastructure research lab, IT4Innovations MATERIALS TAKEN FROM top500.org, exascaleproject.org, eurohpc-ju.europa.eu Vendors' & SC centers' web pages and presentations PATH TO EXASCALE TRENDS TOP500 LIST ▪ List of the most powerful supercomputers ▪ Updated 2x a year – ISC (June) and SC (November) ▪ From 1993 High Performance Linpack (HPL) benchmark ▪ From 2017 also High-Performance Conjugate Gradient (HPCG) Benchmark ▪ From 2013 Green500 list ▪ From 2019 HPL-AI – not a list yet - mixed-precision algorithms TOP500 LIST HPL + HPCG ARM No China EU +11, 12, 15, 16, 18 11/2020 TOP500 LIST HPL + HPCG TOP500 LIST Where's Russia?! 11/2020 TOP500 LIST 6/2019 TOP500 LIST June 2013 June 2008 11/2020 TOP500 LIST 11/2020 TOP500 LIST 11/2020 TOP500 LIST 11/2020 TOP500 LIST 11/2020 TOP500 LIST HPL 11/2020 TOP500 LIST HPL x2 = 60 MW x5 = 50 MW Exascale goal is 50 GFlops/Watt = 20 MW system x8 = 60 MW x8 = 123 MW x13 = 34 MW x10 = 185 MW 11/2020 x14 = 25 MW GREEN500 x2 = 60 MW x5 = 50 MW • Direct Warm-Water Cooling (CPU and GPU cooling separated circles) x8 = 60 MW • Availability of power controling knobs • Higher heterogenity of new systems = using accelerators, GPGPUs, FPGAs, x8 = 123 MW single/mixed precission units • Decarbonization x13 = 34 MW • AI everywhere • And many more x10 = 185 MW 11/2020 x14 = 25 MW GREEN500 Nvidia A100 MN-Core Nvidia V100 Nvidia A100 Nvidia V100 Nvidia A100 Nvidia V100 Nvidia A100 11/2020 SUMMIT SUPERCOMPUTER FUGAKU SUPERCOMPUTER • 158 976 nodes, node peak performance 3.4 TFLOP/s • Fujitsu A64FX ARM v8.2-A, 48(+4) cores, SVE 512 bit instructions • high bandwidth 3D stacked memory, 4x 8 GB HBM with 1 024 GB/s, • on-die Tofu-D network BW (~400Gbps), • 29.9 MW OUT Tofu IN interconect Direct water cooling GREEN 500 #1 2020/6: MN-3 • 2x Xeon Platinum 8260M (CSL) 24C 2.4GHz • Intel Optane persistent memory • MN-Core • PreferredNetworks' accelerator • Specializedfor deep learning training • Optimizedfor energy efficiency • Efficiency above one teraflop per Watt • 1 Matrix arithmetic units (MAU) + 4 Processor Elements (PE, provide data to the MAU) = Matrix Arithmetic Block (MAB) • 4 dies per chip, 512 MABs per die • Air cooled ?! PATH TO EXASCALE ROADMAPS USA ROADMAP 1.5 EFlops AMD CPU + GPU 1 EFlops Intel CPU + Intel Xe >2EFlops, ~40 MW AMD CPU + GPU High variability of CPU and GPU vendors USA ROADMAP 500M $ 1 EFlops, <=60 MW 2 Intel Saphire Rapids + 6 Intel Xe per node 600M $ 1.5 EFlops, ~30 MW AMD 1 CPU + 4 GPUs >2EFlops, ~40 MW AMD CPU + GPU AURORA – 1ST EXASCALE SYSTEM ? CHINA ▪ Homogenous ▪ NUDT: Tianhe-2a (2018, Intel Xeon + Matrix-2000, 95PFlop) -> Tianhe-3 (2021?, Matrix-3000, ~1.3 EFlops) 100 cabinets, 128 blades each, 8 CPUs per blade ▪ NRCPC: Sunway TaihuLight (ShenWei 26010) -> NRCPC prototype (ShenWei 26010) -> ? ▪ ShenWei 26010 = 260 cores, 4 core groups, 3 TFlops ▪ Accelerated ▪ Sugon prototype (Hygon CPU+DCU ACC) -> Sugon (Hygon accelerated) Matrix-3000 Hygon x86 CPU • >=96 cores, > 10 TFlops • Licensed AMD EPYC clone • HBM2 ShenWei 26010 • support half precision • 260 cores, 4 core groups • 3 TFlops THE EUROHPC JOINT UNDERTAKING ▪ A legal and funding agency ▪ 32 member countries ▪ A co-founding programme to build a pan-European supercomputing infrastructure Medium-to-high range Supercomputers ▪ at least 4 Pflops ▪ Bulgaria, Czech Republic, Luxembourg, Portugal, Slovenia ▪ expected installation by H1 2021 High-range Pre-Exascale Supercomputers ▪ 150-200 Pflops ▪ Finland, Spain and Italy consorciums ▪ expected installation mid-2021 Next generations of systems planned for 2023-2024 (exascale) and 2026-2027 EUROPEAN PRE-EXASCALE SYSTEMS ▪ H2 2021 ▪ 240M € ▪ 248 PFlops ▪ 2 Intel Xeon Ice Lake CPUs + 4 Nvidia A100 GPUs MareNostrum V ▪ 200 PFlops ▪ 223 millions of Euros 552 PFlops Peak ▪ Heterogenous 375 PFlops LINPACK Mid-2021 IT4INNOVATIONS ROADMAP ▪ EURO_IT4I Q1 2021 ▪ 15.2 PFlops ▪ AMD Epyc + Nvidia A100 ▪ Homogenous (2x 7H12), accelerated (2x7452 + 8 A100), visualization (NVidia Quadro RTX 6000), big data (32x Intel Xeon 8268, 24 576 GB RAM), and cloud partitions ▪ 200 Gb/s interconnect 11. 12. 2020 IT4INNOVATIONS ROADMAP ▪ EURO_IT4I Q1 2021 ▪ 15.2 PFlops ▪ AMD Epyc + Nvidia A100 ▪ Homogenous (2x 7H12), accelerated (2x7452 + 8 A100), visualization (NVidia Quadro RTX 6000), big data (32x Intel Xeon 8268, 24 576 GB RAM), and cloud partitions ▪ 200 Gb/s interconnect ▪ New experimental systems ▪ Late 2021 – 4 architectures, targeting the most perspective technologies ▪ Late 2022 – quantum computer ? 11. 12. 2020 IT4INNOVATIONS ROADMAP Name the computer ▪ EURO_IT4I Q1 2021 bit.ly/jmenosuperpocitace ▪ Experimental system late 2021 ▪ Experimental system late 2022 9.5M € 17M € 5.5M € 7.5M € 2M € 17M € 2M € 7.5M € 4M € 2023 2024 2025 2026 2027 2028 2029 2030 2031 PATH TO EXASCALE HARDWARE INTEL PROCESSORS ▪ All the intel architectures have delay ▪ One year delay in 7nm, and 6 months in 10 nm technology ▪ Intel Xeon SP 7nm CPU is on the roadmap for the first half of 2023 2021 H2 2021 10nm, PCIe 4.0, DDR4 10nm, PCIe 5.0, DDR5 INTEL XEON ICE LAKE SP INTEL XE – PONTE VECCHIO ▪ 1-, 2-, or 4-tile packing design ▪ 4-tile variant should provide over 40 TFlops FP32 (2-tile design for Aurora) INTEL OPTANE MEMORY ▪ 3D XPoint is a non-volatile memory (NVM) ▪ Another layer in the memory hierarchy ▪ Requires CPU support AMD EPYC PROCESSORS A chiplet-based architecture based on Zen cores ▪ Naples (2017) ▪ 14nm ▪ Rome (2019) ▪ 7nm, 8 mem channels, up to 4 TB RAM ▪ Milan (Q1 2021) ▪ 7nm+ ▪ Genoa (expected in 2021 ?) ▪ 5nm AMD EPYC PROCESSORS AMD RADEON INSTINCT GPUS MI25 (2017) MI60 (2018) ▪ 14 nm ▪ 4096 stream processors, 1800 MHz ▪ 4096 stream processors ▪ 14.7 TFlops DP ▪ 768 GFlops DP ▪ Not in sale any more ▪ 16 GB HBM2, 484 GB/s MI100 (11/2020) ▪ PCIe 3.0 ▪ 7 nm ▪ Passively cooled, 300W TDP ▪ 7,680 stream processors, 1502 MHz MI50 (2018) ▪ 11.5 TFlops DP, 92.3 TFlops BFloat ▪ 7 nm ▪ 32 GB HBM2, 1228.8 GB/s ▪ 3840 stream processors, 1725 MHz ▪ PCIe 3.0/4.0 ▪ 6.6 TFlops DP ▪ Passively cooled, 300W TDP ▪ 16 GB HBM2, 1024 GB/s ▪ PCIe 3.0/4.0 ▪ Passively cooled, 300W TDP IBM PROCESSORS Power10 offers a ~3x performance gain and ~2.6x core efficiency gain over Power9 2021 2017 IBM POWER10 NVIDIA GPUS NVIDIA Tesla V100 (Volta) 12 nm ▪ 5120 CUDA cores + 640 tensor cores ▪ 16/32 GB HBM2, 900 GB/s ▪ 300 GB/s NVLink ▪ 7.8 TFlops DP ▪ 15.7 TFlops SP NVIDIA Tesla A100 (Ampere) 7nm ▪ 6912 CUDA cores + 432 tensor cores ▪ 40/80 GB memory, 1.5/2 TB/s ▪ 600 GB/s NVLink ▪ 9.7 TFlops DP, 19.5 TFlops Tensor core DP ▪ 19.5 TFlops SP, 156 TFlops Tensor core SP TENSOR CORES Mixed (half) precision computing - tensor cores From Ampere architecture also double precision! NVIDIA DGX PLATFORM DGX-1 ▪ 8x NVIDIA Tesla V100 32 GB/GPU ▪ 40 960 CUDA cores + 5 120 Tensor Cores ▪ NVIDIA NVLink - Hybrid Cube Mesh ▪ 512 GB DDR4 DGX-2 ▪ 16x NVIDIA Tesla V100 ▪ Intel Xeon Platinum ▪ NVSwitch - 2.4 TB/s of bisection bandwidth NVIDIA DGX-A100 & SUPERPOD ▪ 8x NVIDIA A100 GPU ▪ 2x AMD EPYC Rome CPU ▪ 640 GB memory ▪ 600 GB/s GPU-to-GPU Bi-directional Bandwidth ▪ 5 PFlop AI ▪ 6.5 kW NVIDIA DGX-A100 & SUPERPOD ▪ 8x NVIDIA A100 GPU ▪ 2x AMD EPYC Rome CPU ▪ 640 GB memory ▪ 600 GB/s GPU-to-GPU Bi-directional Bandwidth ▪ 5 PFlop AI ▪ 6.5 kW ▪ #1 Green500 11/2020 ▪ 20 – 140 DGX-A100 ▪ 100 – 700 PFlop system ▪ 32.5 kW per rack ▪ Deployable in Weeks ARM IN HPC ARM brings better performance per Watt in compare to x86 processors ARM roadmap expects 5nm Poseidon platform in 2021 Fujitsu A64FX ▪ Armv8.2-A (AArch64 only) SVE (Scalable Vector Extension), 512 bit, 7nm ▪ 48 computing cores + 4 assistant cores ThunderX2 ▪ ARMv8.1, 64 bit, 14 nm ▪ 32 cores, 128 threads ThunderX3 ▪ ARMv8.3+, 128 bit, 7nm ▪ 96 cores, 384 threads ▪ Expected in 2020 EUROPEAN PROCESSOR INITIATIVE (EPI) Europe invests into development of a new processor ▪ Security ▪ Competitiveness Design a roadmap of future European low power processors ▪ common platform ▪ general purpose processor ▪ accelerator ▪ automotive FPGAS IN HPC Device Fabrication #cores Peak TDP Perf/W Memory Memory [nm] performance [W] [GOPs/W] bandwidth type [GFlops] [GB/s] Intel Stratix 10 DX 14 11 520 8 600 SP ? 1000 512 HBM2 DSPs 143 INT8 Intel Agilex 10 ? 40 000 ? ? 512 HBM2 FP16 Xilinx Alveo U280 16 9 024 24.5 225 109 38/460 DDR4/ DSPs INT8 HBM2 Xilinx Alveo U250 16 12 288 33.3 225 148 77 DDR4 INT8 QUANTUM COMPUTING Several basic quantum computer implemenations and hardware emulators ▪ D-Wave, IBM, Google, Atos, ... JUNIQ system in Juelich ▪ D-Wave system PATH TO EXASCALE SOFTWARE EXASCALE APPLICATIONS Fugaku software stack EXASCALE APPLICATIONS ▪ Earth and space science ▪ Chemistry and materials – Medicine, Plasma science, Molecular Dynamics ▪ Energy production and transmission ▪ National security = military NEW SOFTWARE SPECIFICATIONS ▪ OpenMP 5.1 ▪ OpenMP 5.0 will be fully implemented in GCC 12 except OMPT and OMPD ▪ New directives ▪ interop ▪ dispatch ▪ assume ▪ target_device selector ▪ … and many more ▪ MPI 4.0 ▪ Specification2/2021, implementations by the end 2021 ▪ New features ▪ solution for "big count" operations ▪ persistent collectives ▪ partitioned communication ▪ topology-aware communicators ▪ ... and many more EXASCALE SOFTWARE STACK Simplified software development for heterogenous hardware ▪ Intel oneAPI ▪ AMD ROCm ▪ CUDA-X HPC & AI software stack QUANTUM COMPUTING Different frameworks and programming languages: ▪ Qasm, Qiskit (IBM), Cirq (Google), Forest/pyqil (Rigetti), Q# (Microsoft), Ocean (D-Wave) IBM Quantum Experience ▪ Free online access to quantum simulators (up to 32 qubits) and actual quantum computers (1-15 qubits) with different topologies ▪ Programmable with a visual interface and via different languages (python, qasm, Jupyter Notebooks) Atos myQLM ▪ Freeware for Linux or Windows machines ▪ quantum software stack for writing, simulating, optimizing, and executing quantum programs.

Introduction to Hpc| Path to Exascale

GPU Developments 2018

Future-ALCF-Systems-Parker.Pdf

Hardware Developments V E-CAM Deliverable 7.9 Deliverable Type: Report Delivered in July, 2020

Marc and Mentat Release Guide

Intro to Parallel Computing

Evaluating Performance Portability of Openmp for SNAP on NVIDIA, Intel, and AMD Gpus Using the Rooﬂine Methodology

A Survey on Bounding Volume Hierarchies for Ray Tracing

Intro to Parallel Computing

Software Development for Performance in the Energy Exascale Earth System Model

Long Term Vision for Larsoft: Overview

Multi-Platform Performance Portability for QCD Simulations Aka How I Learned to Stop Worrying and Love the Compiler

Intel Rendering Framework and Intel Xe Architecture Poised to Advance S