Introduction to Hpc| Path to Exascale

Total Page:16

File Type:pdf, Size:1020Kb

Introduction to Hpc| Path to Exascale INTRODUCTION TO HPC| PATH TO EXASCALE Ondřej Vysocký| Infrastructure research lab, IT4Innovations MATERIALS TAKEN FROM top500.org, exascaleproject.org, eurohpc-ju.europa.eu Vendors' & SC centers' web pages and presentations PATH TO EXASCALE TRENDS TOP500 LIST ▪ List of the most powerful supercomputers ▪ Updated 2x a year – ISC (June) and SC (November) ▪ From 1993 High Performance Linpack (HPL) benchmark ▪ From 2017 also High-Performance Conjugate Gradient (HPCG) Benchmark ▪ From 2013 Green500 list ▪ From 2019 HPL-AI – not a list yet - mixed-precision algorithms TOP500 LIST HPL + HPCG ARM No China EU +11, 12, 15, 16, 18 11/2020 TOP500 LIST HPL + HPCG TOP500 LIST Where's Russia?! 11/2020 TOP500 LIST 6/2019 TOP500 LIST June 2013 June 2008 11/2020 TOP500 LIST 11/2020 TOP500 LIST 11/2020 TOP500 LIST 11/2020 TOP500 LIST 11/2020 TOP500 LIST HPL 11/2020 TOP500 LIST HPL x2 = 60 MW x5 = 50 MW Exascale goal is 50 GFlops/Watt = 20 MW system x8 = 60 MW x8 = 123 MW x13 = 34 MW x10 = 185 MW 11/2020 x14 = 25 MW GREEN500 x2 = 60 MW x5 = 50 MW • Direct Warm-Water Cooling (CPU and GPU cooling separated circles) x8 = 60 MW • Availability of power controling knobs • Higher heterogenity of new systems = using accelerators, GPGPUs, FPGAs, x8 = 123 MW single/mixed precission units • Decarbonization x13 = 34 MW • AI everywhere • And many more x10 = 185 MW 11/2020 x14 = 25 MW GREEN500 Nvidia A100 MN-Core Nvidia V100 Nvidia A100 Nvidia V100 Nvidia A100 Nvidia V100 Nvidia A100 11/2020 SUMMIT SUPERCOMPUTER FUGAKU SUPERCOMPUTER • 158 976 nodes, node peak performance 3.4 TFLOP/s • Fujitsu A64FX ARM v8.2-A, 48(+4) cores, SVE 512 bit instructions • high bandwidth 3D stacked memory, 4x 8 GB HBM with 1 024 GB/s, • on-die Tofu-D network BW (~400Gbps), • 29.9 MW OUT Tofu IN interconect Direct water cooling GREEN 500 #1 2020/6: MN-3 • 2x Xeon Platinum 8260M (CSL) 24C 2.4GHz • Intel Optane persistent memory • MN-Core • PreferredNetworks' accelerator • Specializedfor deep learning training • Optimizedfor energy efficiency • Efficiency above one teraflop per Watt • 1 Matrix arithmetic units (MAU) + 4 Processor Elements (PE, provide data to the MAU) = Matrix Arithmetic Block (MAB) • 4 dies per chip, 512 MABs per die • Air cooled ?! PATH TO EXASCALE ROADMAPS USA ROADMAP 1.5 EFlops AMD CPU + GPU 1 EFlops Intel CPU + Intel Xe >2EFlops, ~40 MW AMD CPU + GPU High variability of CPU and GPU vendors USA ROADMAP 500M $ 1 EFlops, <=60 MW 2 Intel Saphire Rapids + 6 Intel Xe per node 600M $ 1.5 EFlops, ~30 MW AMD 1 CPU + 4 GPUs >2EFlops, ~40 MW AMD CPU + GPU AURORA – 1ST EXASCALE SYSTEM ? CHINA ▪ Homogenous ▪ NUDT: Tianhe-2a (2018, Intel Xeon + Matrix-2000, 95PFlop) -> Tianhe-3 (2021?, Matrix-3000, ~1.3 EFlops) 100 cabinets, 128 blades each, 8 CPUs per blade ▪ NRCPC: Sunway TaihuLight (ShenWei 26010) -> NRCPC prototype (ShenWei 26010) -> ? ▪ ShenWei 26010 = 260 cores, 4 core groups, 3 TFlops ▪ Accelerated ▪ Sugon prototype (Hygon CPU+DCU ACC) -> Sugon (Hygon accelerated) Matrix-3000 Hygon x86 CPU • >=96 cores, > 10 TFlops • Licensed AMD EPYC clone • HBM2 ShenWei 26010 • support half precision • 260 cores, 4 core groups • 3 TFlops THE EUROHPC JOINT UNDERTAKING ▪ A legal and funding agency ▪ 32 member countries ▪ A co-founding programme to build a pan-European supercomputing infrastructure Medium-to-high range Supercomputers ▪ at least 4 Pflops ▪ Bulgaria, Czech Republic, Luxembourg, Portugal, Slovenia ▪ expected installation by H1 2021 High-range Pre-Exascale Supercomputers ▪ 150-200 Pflops ▪ Finland, Spain and Italy consorciums ▪ expected installation mid-2021 Next generations of systems planned for 2023-2024 (exascale) and 2026-2027 EUROPEAN PRE-EXASCALE SYSTEMS ▪ H2 2021 ▪ 240M € ▪ 248 PFlops ▪ 2 Intel Xeon Ice Lake CPUs + 4 Nvidia A100 GPUs MareNostrum V ▪ 200 PFlops ▪ 223 millions of Euros 552 PFlops Peak ▪ Heterogenous 375 PFlops LINPACK Mid-2021 IT4INNOVATIONS ROADMAP ▪ EURO_IT4I Q1 2021 ▪ 15.2 PFlops ▪ AMD Epyc + Nvidia A100 ▪ Homogenous (2x 7H12), accelerated (2x7452 + 8 A100), visualization (NVidia Quadro RTX 6000), big data (32x Intel Xeon 8268, 24 576 GB RAM), and cloud partitions ▪ 200 Gb/s interconnect 11. 12. 2020 IT4INNOVATIONS ROADMAP ▪ EURO_IT4I Q1 2021 ▪ 15.2 PFlops ▪ AMD Epyc + Nvidia A100 ▪ Homogenous (2x 7H12), accelerated (2x7452 + 8 A100), visualization (NVidia Quadro RTX 6000), big data (32x Intel Xeon 8268, 24 576 GB RAM), and cloud partitions ▪ 200 Gb/s interconnect ▪ New experimental systems ▪ Late 2021 – 4 architectures, targeting the most perspective technologies ▪ Late 2022 – quantum computer ? 11. 12. 2020 IT4INNOVATIONS ROADMAP Name the computer ▪ EURO_IT4I Q1 2021 bit.ly/jmenosuperpocitace ▪ Experimental system late 2021 ▪ Experimental system late 2022 9.5M € 17M € 5.5M € 7.5M € 2M € 17M € 2M € 7.5M € 4M € 2023 2024 2025 2026 2027 2028 2029 2030 2031 PATH TO EXASCALE HARDWARE INTEL PROCESSORS ▪ All the intel architectures have delay ▪ One year delay in 7nm, and 6 months in 10 nm technology ▪ Intel Xeon SP 7nm CPU is on the roadmap for the first half of 2023 2021 H2 2021 10nm, PCIe 4.0, DDR4 10nm, PCIe 5.0, DDR5 INTEL XEON ICE LAKE SP INTEL XE – PONTE VECCHIO ▪ 1-, 2-, or 4-tile packing design ▪ 4-tile variant should provide over 40 TFlops FP32 (2-tile design for Aurora) INTEL OPTANE MEMORY ▪ 3D XPoint is a non-volatile memory (NVM) ▪ Another layer in the memory hierarchy ▪ Requires CPU support AMD EPYC PROCESSORS A chiplet-based architecture based on Zen cores ▪ Naples (2017) ▪ 14nm ▪ Rome (2019) ▪ 7nm, 8 mem channels, up to 4 TB RAM ▪ Milan (Q1 2021) ▪ 7nm+ ▪ Genoa (expected in 2021 ?) ▪ 5nm AMD EPYC PROCESSORS AMD RADEON INSTINCT GPUS MI25 (2017) MI60 (2018) ▪ 14 nm ▪ 4096 stream processors, 1800 MHz ▪ 4096 stream processors ▪ 14.7 TFlops DP ▪ 768 GFlops DP ▪ Not in sale any more ▪ 16 GB HBM2, 484 GB/s MI100 (11/2020) ▪ PCIe 3.0 ▪ 7 nm ▪ Passively cooled, 300W TDP ▪ 7,680 stream processors, 1502 MHz MI50 (2018) ▪ 11.5 TFlops DP, 92.3 TFlops BFloat ▪ 7 nm ▪ 32 GB HBM2, 1228.8 GB/s ▪ 3840 stream processors, 1725 MHz ▪ PCIe 3.0/4.0 ▪ 6.6 TFlops DP ▪ Passively cooled, 300W TDP ▪ 16 GB HBM2, 1024 GB/s ▪ PCIe 3.0/4.0 ▪ Passively cooled, 300W TDP IBM PROCESSORS Power10 offers a ~3x performance gain and ~2.6x core efficiency gain over Power9 2021 2017 IBM POWER10 NVIDIA GPUS NVIDIA Tesla V100 (Volta) 12 nm ▪ 5120 CUDA cores + 640 tensor cores ▪ 16/32 GB HBM2, 900 GB/s ▪ 300 GB/s NVLink ▪ 7.8 TFlops DP ▪ 15.7 TFlops SP NVIDIA Tesla A100 (Ampere) 7nm ▪ 6912 CUDA cores + 432 tensor cores ▪ 40/80 GB memory, 1.5/2 TB/s ▪ 600 GB/s NVLink ▪ 9.7 TFlops DP, 19.5 TFlops Tensor core DP ▪ 19.5 TFlops SP, 156 TFlops Tensor core SP TENSOR CORES Mixed (half) precision computing - tensor cores From Ampere architecture also double precision! NVIDIA DGX PLATFORM DGX-1 ▪ 8x NVIDIA Tesla V100 32 GB/GPU ▪ 40 960 CUDA cores + 5 120 Tensor Cores ▪ NVIDIA NVLink - Hybrid Cube Mesh ▪ 512 GB DDR4 DGX-2 ▪ 16x NVIDIA Tesla V100 ▪ Intel Xeon Platinum ▪ NVSwitch - 2.4 TB/s of bisection bandwidth NVIDIA DGX-A100 & SUPERPOD ▪ 8x NVIDIA A100 GPU ▪ 2x AMD EPYC Rome CPU ▪ 640 GB memory ▪ 600 GB/s GPU-to-GPU Bi-directional Bandwidth ▪ 5 PFlop AI ▪ 6.5 kW NVIDIA DGX-A100 & SUPERPOD ▪ 8x NVIDIA A100 GPU ▪ 2x AMD EPYC Rome CPU ▪ 640 GB memory ▪ 600 GB/s GPU-to-GPU Bi-directional Bandwidth ▪ 5 PFlop AI ▪ 6.5 kW ▪ #1 Green500 11/2020 ▪ 20 – 140 DGX-A100 ▪ 100 – 700 PFlop system ▪ 32.5 kW per rack ▪ Deployable in Weeks ARM IN HPC ARM brings better performance per Watt in compare to x86 processors ARM roadmap expects 5nm Poseidon platform in 2021 Fujitsu A64FX ▪ Armv8.2-A (AArch64 only) SVE (Scalable Vector Extension), 512 bit, 7nm ▪ 48 computing cores + 4 assistant cores ThunderX2 ▪ ARMv8.1, 64 bit, 14 nm ▪ 32 cores, 128 threads ThunderX3 ▪ ARMv8.3+, 128 bit, 7nm ▪ 96 cores, 384 threads ▪ Expected in 2020 EUROPEAN PROCESSOR INITIATIVE (EPI) Europe invests into development of a new processor ▪ Security ▪ Competitiveness Design a roadmap of future European low power processors ▪ common platform ▪ general purpose processor ▪ accelerator ▪ automotive FPGAS IN HPC Device Fabrication #cores Peak TDP Perf/W Memory Memory [nm] performance [W] [GOPs/W] bandwidth type [GFlops] [GB/s] Intel Stratix 10 DX 14 11 520 8 600 SP ? 1000 512 HBM2 DSPs 143 INT8 Intel Agilex 10 ? 40 000 ? ? 512 HBM2 FP16 Xilinx Alveo U280 16 9 024 24.5 225 109 38/460 DDR4/ DSPs INT8 HBM2 Xilinx Alveo U250 16 12 288 33.3 225 148 77 DDR4 INT8 QUANTUM COMPUTING Several basic quantum computer implemenations and hardware emulators ▪ D-Wave, IBM, Google, Atos, ... JUNIQ system in Juelich ▪ D-Wave system PATH TO EXASCALE SOFTWARE EXASCALE APPLICATIONS Fugaku software stack EXASCALE APPLICATIONS ▪ Earth and space science ▪ Chemistry and materials – Medicine, Plasma science, Molecular Dynamics ▪ Energy production and transmission ▪ National security = military NEW SOFTWARE SPECIFICATIONS ▪ OpenMP 5.1 ▪ OpenMP 5.0 will be fully implemented in GCC 12 except OMPT and OMPD ▪ New directives ▪ interop ▪ dispatch ▪ assume ▪ target_device selector ▪ … and many more ▪ MPI 4.0 ▪ Specification2/2021, implementations by the end 2021 ▪ New features ▪ solution for "big count" operations ▪ persistent collectives ▪ partitioned communication ▪ topology-aware communicators ▪ ... and many more EXASCALE SOFTWARE STACK Simplified software development for heterogenous hardware ▪ Intel oneAPI ▪ AMD ROCm ▪ CUDA-X HPC & AI software stack QUANTUM COMPUTING Different frameworks and programming languages: ▪ Qasm, Qiskit (IBM), Cirq (Google), Forest/pyqil (Rigetti), Q# (Microsoft), Ocean (D-Wave) IBM Quantum Experience ▪ Free online access to quantum simulators (up to 32 qubits) and actual quantum computers (1-15 qubits) with different topologies ▪ Programmable with a visual interface and via different languages (python, qasm, Jupyter Notebooks) Atos myQLM ▪ Freeware for Linux or Windows machines ▪ quantum software stack for writing, simulating, optimizing, and executing quantum programs.
Recommended publications
  • GPU Developments 2018
    GPU Developments 2018 2018 GPU Developments 2018 © Copyright Jon Peddie Research 2019. All rights reserved. Reproduction in whole or in part is prohibited without written permission from Jon Peddie Research. This report is the property of Jon Peddie Research (JPR) and made available to a restricted number of clients only upon these terms and conditions. Agreement not to copy or disclose. This report and all future reports or other materials provided by JPR pursuant to this subscription (collectively, “Reports”) are protected by: (i) federal copyright, pursuant to the Copyright Act of 1976; and (ii) the nondisclosure provisions set forth immediately following. License, exclusive use, and agreement not to disclose. Reports are the trade secret property exclusively of JPR and are made available to a restricted number of clients, for their exclusive use and only upon the following terms and conditions. JPR grants site-wide license to read and utilize the information in the Reports, exclusively to the initial subscriber to the Reports, its subsidiaries, divisions, and employees (collectively, “Subscriber”). The Reports shall, at all times, be treated by Subscriber as proprietary and confidential documents, for internal use only. Subscriber agrees that it will not reproduce for or share any of the material in the Reports (“Material”) with any entity or individual other than Subscriber (“Shared Third Party”) (collectively, “Share” or “Sharing”), without the advance written permission of JPR. Subscriber shall be liable for any breach of this agreement and shall be subject to cancellation of its subscription to Reports. Without limiting this liability, Subscriber shall be liable for any damages suffered by JPR as a result of any Sharing of any Material, without advance written permission of JPR.
    [Show full text]
  • Future-ALCF-Systems-Parker.Pdf
    Future ALCF Systems Argonne Leadership Computing Facility 2021 Computational Performance Workshop, May 4-6 Scott Parker www.anl.gov Aurora Aurora: A High-level View q Intel-HPE machine arriving at Argonne in 2022 q Sustained Performance ≥ 1Exaflops DP q Compute Nodes q 2 Intel Xeons (Sapphire Rapids) q 6 Intel Xe GPUs (Ponte Vecchio [PVC]) q Node Performance > 130 TFlops q System q HPE Cray XE Platform q Greater than 10 PB of total memory q HPE Slingshot network q Fliesystem q Distributed Asynchronous Object Store (DAOS) q ≥ 230 PB of storage capacity; ≥ 25 TB/s q Lustre q 150PB of storage capacity; ~1 TB/s 3 The Evolution of Intel GPUs 4 The Evolution of Intel GPUs 5 XE Execution Unit qThe EU executes instructions q Register file q Multiple issue ports q Vector pipelines q Float Point q Integer q Extended Math q FP 64 (optional) q Matrix Extension (XMX) (optional) q Thread control q Branch q Send (memory) 6 XE Subslice qA sub-slice contains: q 16 EUs q Thread dispatch q Instruction cache q L1, texture cache, and shared local memory q Load/Store q Fixed Function (optional) q 3D Sampler q Media Sampler q Ray Tracing 7 XE 3D/Compute Slice qA slice contains q Variable number of subslices q 3D Fixed Function (optional) q Geometry q Raster 8 High Level Xe Architecture qXe GPU is composed of q 3D/Compute Slice q Media Slice q Memory Fabric / Cache 9 Aurora Compute Node • 6 Xe Architecture based GPUs PVC PVC (Ponte Vecchio) • All to all connection PVC PVC • Low latency and high bandwidth PVC PVC • 2 Intel Xeon (Sapphire Rapids) processors Slingshot
    [Show full text]
  • Hardware Developments V E-CAM Deliverable 7.9 Deliverable Type: Report Delivered in July, 2020
    Hardware developments V E-CAM Deliverable 7.9 Deliverable Type: Report Delivered in July, 2020 E-CAM The European Centre of Excellence for Software, Training and Consultancy in Simulation and Modelling Funded by the European Union under grant agreement 676531 E-CAM Deliverable 7.9 Page ii Project and Deliverable Information Project Title E-CAM: An e-infrastructure for software, training and discussion in simulation and modelling Project Ref. Grant Agreement 676531 Project Website https://www.e-cam2020.eu EC Project Officer Juan Pelegrín Deliverable ID D7.9 Deliverable Nature Report Dissemination Level Public Contractual Date of Delivery Project Month 54(31st March, 2020) Actual Date of Delivery 6th July, 2020 Description of Deliverable Update on "Hardware Developments IV" (Deliverable 7.7) which covers: - Report on hardware developments that will affect the scientific areas of inter- est to E-CAM and detailed feedback to the project software developers (STFC); - discussion of project software needs with hardware and software vendors, completion of survey of what is already available for particular hardware plat- forms (FR-IDF); and, - detailed output from direct face-to-face session between the project end- users, developers and hardware vendors (ICHEC). Document Control Information Title: Hardware developments V ID: D7.9 Version: As of July, 2020 Document Status: Accepted by WP leader Available at: https://www.e-cam2020.eu/deliverables Document history: Internal Project Management Link Review Review Status: Reviewed Written by: Alan Ó Cais(JSC) Contributors: Christopher Werner (ICHEC), Simon Wong (ICHEC), Padraig Ó Conbhuí Authorship (ICHEC), Alan Ó Cais (JSC), Jony Castagna (STFC), Godehard Sutmann (JSC) Reviewed by: Luke Drury (NUID UCD) and Jony Castagna (STFC) Approved by: Godehard Sutmann (JSC) Document Keywords Keywords: E-CAM, HPC, Hardware, CECAM, Materials 6th July, 2020 Disclaimer:This deliverable has been prepared by the responsible Work Package of the Project in accordance with the Consortium Agreement and the Grant Agreement.
    [Show full text]
  • Marc and Mentat Release Guide
    Marc and Mentat Release Guide Marc® and Mentat® 2020 Release Guide Corporate Europe, Middle East, Africa MSC Software Corporation MSC Software GmbH 4675 MacArthur Court, Suite 900 Am Moosfeld 13 Newport Beach, CA 92660 81829 Munich, Germany Telephone: (714) 540-8900 Telephone: (49) 89 431 98 70 Toll Free Number: 1 855 672 7638 Email: [email protected] Email: [email protected] Japan Asia-Pacific MSC Software Japan Ltd. MSC Software (S) Pte. Ltd. Shinjuku First West 8F 100 Beach Road 23-7 Nishi Shinjuku #16-05 Shaw Tower 1-Chome, Shinjuku-Ku Singapore 189702 Tokyo 160-0023, JAPAN Telephone: 65-6272-0082 Telephone: (81) (3)-6911-1200 Email: [email protected] Email: [email protected] Worldwide Web www.mscsoftware.com Support http://www.mscsoftware.com/Contents/Services/Technical-Support/Contact-Technical-Support.aspx Disclaimer User Documentation: Copyright 2020 MSC Software Corporation. All Rights Reserved. This document, and the software described in it, are furnished under license and may be used or copied only in accordance with the terms of such license. Any reproduction or distribution of this document, in whole or in part, without the prior written authorization of MSC Software Corporation is strictly prohibited. MSC Software Corporation reserves the right to make changes in specifications and other information contained in this document without prior notice. The concepts, methods, and examples presented in this document are for illustrative and educational purposes only and are not intended to be exhaustive or to apply to any particular engineering problem or design. THIS DOCUMENT IS PROVIDED ON AN “AS-IS” BASIS AND ALL EXPRESS AND IMPLIED CONDITIONS, REPRESENTATIONS AND WARRANTIES, INCLUDING ANY IMPLIED WARRANTY OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE, ARE DISCLAIMED, EXCEPT TO THE EXTENT THAT SUCH DISCLAIMERS ARE HELD TO BE LEGALLY INVALID.
    [Show full text]
  • Intro to Parallel Computing
    Intro To Parallel Computing John Urbanic Parallel Computing Scientist Pittsburgh Supercomputing Center Copyright 2021 Purpose of this talk ⚫ This is the 50,000 ft. view of the parallel computing landscape. We want to orient you a bit before parachuting you down into the trenches. ⚫ This talk bookends our technical content along with the Outro to Parallel Computing talk. The Intro has a strong emphasis on hardware, as this dictates the reasons that the software has the form and function that it has. Hopefully our programming constraints will seem less arbitrary. ⚫ The Outro talk can discuss alternative software approaches in a meaningful way because you will then have one base of knowledge against which we can compare and contrast. ⚫ The plan is that you walk away with a knowledge of not just MPI, etc. but where it fits into the world of High Performance Computing. FLOPS we need: Climate change analysis Simulations Extreme data • Cloud resolution, quantifying uncertainty, • “Reanalysis” projects need 100 more computing understanding tipping points, etc., will to analyze observations drive climate to exascale platforms • Machine learning and other analytics • New math, models, and systems support are needed today for petabyte data sets will be needed • Combined simulation/observation will empower policy makers and scientists Courtesy Horst Simon, LBNL Exascale combustion simulations ⚫ Goal: 50% improvement in engine efficiency ⚫ Center for Exascale Simulation of Combustion in Turbulence (ExaCT) – Combines simulation and experimentation –
    [Show full text]
  • Evaluating Performance Portability of Openmp for SNAP on NVIDIA, Intel, and AMD Gpus Using the Roofline Methodology
    Evaluating Performance Portability of OpenMP for SNAP on NVIDIA, Intel, and AMD GPUs using the Roofline Methodology Neil A. Mehta1, Rahulkumar Gayatri1, Yasaman Ghadar2, Christopher Knight2, and Jack Deslippe1 1 NERSC, Lawrence Berkeley National Laboratory 2 Argonne National Laboratory Abstract. In this paper, we show that OpenMP 4.5 based implementa- tion of TestSNAP, a proxy-app for the Spectral Neighbor Analysis Poten- tial (SNAP) in LAMMPS, can be ported across the NVIDIA, Intel, and AMD GPUs. Roofline analysis is employed to assess the performance of TestSNAP on each of the architectures. The main contributions of this paper are two-fold: 1) Provide OpenMP as a viable option for appli- cation portability across multiple GPU architectures, and 2) provide a methodology based on the roofline analysis to determine the performance portability of OpenMP implementations on the target architectures. The GPUs used for this work are Intel Gen9, AMD Radeon Instinct MI60, and NVIDIA Volta V100. Keywords: Roofline analysis · Performance portability · SNAP. 1 Introduction Six out of the top ten supercomputers in the list of Top500 supercomputers rely on GPUs for their compute performance. The next generation of supercom- puters, namely, Perlmutter, Aurora, and Frontier, rely primarily upon NVIDIA, Intel, and AMD GPUs, respectively, to achieve their intended peak compute bandwidths, the latter two of which will be the first exascale machines. The CPU, also referred to as the host and the GPU or device architectures that will be available on these machines are shown in Tab. 1. Table 1. CPUs and GPUs on upcoming supercomputers. System Perlmutter Aurora Frontier Host AMD Milan Intel Xeon Sapphire Rapids AMD EPYC Custom Device NVIDIA A100 Intel Xe Ponte Vecchio AMD Radeon Instinct Custom The diversity in the GPU architectures by multiple vendors has increased the importance of application portability.
    [Show full text]
  • A Survey on Bounding Volume Hierarchies for Ray Tracing
    DOI: 10.1111/cgf.142662 EUROGRAPHICS 2021 Volume 40 (2021), Number 2 H. Rushmeier and K. Bühler STAR – State of The Art Report (Guest Editors) A Survey on Bounding Volume Hierarchies for Ray Tracing yDaniel Meister1z yShinji Ogaki2 Carsten Benthin3 Michael J. Doyle3 Michael Guthe4 Jiríˇ Bittner5 1The University of Tokyo 2ZOZO Research 3Intel Corporation 4University of Bayreuth 5Czech Technical University in Prague Figure 1: Bounding volume hierarchies (BVHs) are the ray tracing acceleration data structure of choice in many state of the art rendering applications. The figure shows a ray-traced scene, with a visualization of the otherwise hidden structure of the BVH (left), and a visualization of the success of the BVH in reducing ray intersection operations (right). Abstract Ray tracing is an inherent part of photorealistic image synthesis algorithms. The problem of ray tracing is to find the nearest intersection with a given ray and scene. Although this geometric operation is relatively simple, in practice, we have to evaluate billions of such operations as the scene consists of millions of primitives, and the image synthesis algorithms require a high number of samples to provide a plausible result. Thus, scene primitives are commonly arranged in spatial data structures to accelerate the search. In the last two decades, the bounding volume hierarchy (BVH) has become the de facto standard acceleration data structure for ray tracing-based rendering algorithms in offline and recently also in real-time applications. In this report, we review the basic principles of bounding volume hierarchies as well as advanced state of the art methods with a focus on the construction and traversal.
    [Show full text]
  • Intro to Parallel Computing
    Intro To Parallel Computing John Urbanic Parallel Computing Scientist Pittsburgh Supercomputing Center Copyright 2021 Purpose of this talk ⚫ This is the 50,000 ft. view of the parallel computing landscape. We want to orient you a bit before parachuting you down into the trenches to deal with MPI. ⚫ This talk bookends our technical content along with the Outro to Parallel Computing talk. The Intro has a strong emphasis on hardware, as this dictates the reasons that the software has the form and function that it has. Hopefully our programming constraints will seem less arbitrary. ⚫ The Outro talk can discuss alternative software approaches in a meaningful way because you will then have one base of knowledge against which we can compare and contrast. ⚫ The plan is that you walk away with a knowledge of not just MPI, etc. but where it fits into the world of High Performance Computing. FLOPS we need: Climate change analysis Simulations Extreme data • Cloud resolution, quantifying uncertainty, • “Reanalysis” projects need 100 more computing understanding tipping points, etc., will to analyze observations drive climate to exascale platforms • Machine learning and other analytics • New math, models, and systems support are needed today for petabyte data sets will be needed • Combined simulation/observation will empower policy makers and scientists Courtesy Horst Simon, LBNL Exascale combustion simulations ⚫ Goal: 50% improvement in engine efficiency ⚫ Center for Exascale Simulation of Combustion in Turbulence (ExaCT) – Combines simulation and
    [Show full text]
  • Software Development for Performance in the Energy Exascale Earth System Model
    Software development for performance in the Energy Exascale Earth System Model Robert Jacob Argonne National Laboratory Lemont, IL, USA (Conveying work from many people in the E3SM project.) E3SM: Energy Exascale Earth System Model • 8 U.S. DOE labs + universities. Total ~50 FTEs spread over 80 staff • Atmosphere, Land, Ocean and Ice (land and sea) component models • Development and applications driven by DOE-SC mission interests: Energy/water issues looking out 40 years • Computational goal: Ensure an Earth system model will run well on upcoming DOE pre-exascale and exascale computers • https://github.com/E3SM-Project/E3SM – Open source and open development • http://e3sm.org E3SM v1 (released April, 2018) • New MPAS Ocean/Sea Ice/Land Ice components • Atmosphere – Spectral Element (HOMME) dynamical core – Increased vertical resolution: 72L, 40 tracers. – MAM4 aerosols, MG2 microphysics, RRTMG radiation, CLUBBv1 boundary layer, Zhang-McFarlane deep convection. • Land – New river routing (MOSART), soil hydrology (VSFM), dynamic rooting distribution, dynamic C:N:P stoichiometry option • 2 Resolutions: – Standard: Ocean/sea-ice is 30 to 60km quasi-uniform. 60L, 100km atm/land – High: Ocean/sea-ice is 6 to 18 km, 80L, 25 km atm/land. Integrations ongoing • Code: Fortran with MPI/OpenMP (including some nested OpenMP threading) Golaz, J.-C., et al. (2019). ”The DOE E3SM coupled model version 1: Overview and evaluation at standard resolution.” J. Adv. Model. Earth Syst., 11, 2089-2129. https://doi.org/10.1029/2018MS001603 E3SM v2 Development E3SMv2 was supposed to incorporate minor changes compared to v1 • Not quite… significant structural changes have taken place • Tri-grid configuration – Land and river component now on separate common ½ deg grid – Increased land resolution, especially over Northern land areas – Enables closer coupling between land and river model: water management, irrigation, inundation, … – 3 grids: atmosphere, land/river, ocean/sea-ice.
    [Show full text]
  • Long Term Vision for Larsoft: Overview
    Long term vision for LArSoft: Overview Adam Lyon LArSoft Workshop 2019 25 June 2019 Long term computing vision • You already know this… https://www.karlrupp.net/2018/02/42-years-of-microprocessor-trend-data/ !2 The response - Multicore processors 1 S = 1 − p Examples… Intel Xeon “Haswell”: 16 cores @ 2.3 GHz; 32 threads; Two 4-double vector units Intel Xeon Phi “Knights Landing (KNL)”: 68 cores @ 1.4 GHz; 272 threads; Two 8-double vector units Nvidia Volta “Tesla V100” GPU: 5120 CUDA cores; 640 Tensor cores @ ~1.2 GHz https://upload.wikimedia.org/wikipedia/commons/e/ea/AmdahlsLaw.svg Grid computing uses one or more “cores” (really threads) per job Advantages of multi-threading… • Main advantage is memory sharing • If you are looking for speedup, remember Amdahl’s law Vectorization is another source of speedup … maybe https://cvw.cac.cornell.edu/vector/performance_amdahl !3 High Performance Computing (next 5 years) ORNL AMD/Cray https://www.slideshare.net/insideHPC/exascale-computing-project-software-activities !4 Heterogenous Computing • Future: multi-core, limited power/core, limited memory/core, memory bandwidth increasingly limiting • The old days are not coming back • The DOE is spending $2B on new “Exascale” machines (1018 floating point operations/sec) … - OLCF: Summit IBM CPUs & 27K NVIDIA Volta GPUs (#1 supercomputer in the world) - NERSC: Perlmutter AMD CPUs & NVIDIA Tensor GPUs (2020) - ALCF: Aurora Intel CPUs & Intel Xe GPUs (early 2021) — first US Exascale machine - OLCF: Frontier AMD CPUs & AMD GPUs (later 2021) - Exascale
    [Show full text]
  • Multi-Platform Performance Portability for QCD Simulations Aka How I Learned to Stop Worrying and Love the Compiler
    Multi-Platform Performance Portability for QCD Simulations aka How I learned to stop worrying and love the compiler Peter Boyle Theoretical Particle Physics University of Edinburgh • DiRAC funded by UK Science and Technology Facilities Council (STFC) • National computing resource for Theoretical Particle Physics, Nuclear Physics, Astronomy and Cosmology • Installations in Durham, Cambridge, Edinburgh and Leicester • 3 system types with domain focused specifications • Memory Intensive: simulate largest universe possible, memory capacity • Extreme Scaling: scalable, high delivered high floating point, strong scaling • Data Intensive : throughput, heterogenous, high speed I/O • Will discuss: Extreme Scaling system for QCD simulation (2018, upgrade 208) Portability to future hardware Convergence of HPC and AI needs Data motion is the rate determining step 4. Performance Results fact may have an impact on both machine intercomparisons and the selection of systems for procurement. In the former case, architec- 4.1 STREAM tures become compared based largely on their peak memory band- To measure the memory bandwidth performance, which can signif- width and not the inherent computational advantages available on icantly impact many scientific codes, we ran the STREAM bench- each processor. In the latter case, application developers and sys- mark• onFloating each of the point test platforms. is now For free each platform, we config- tem procurement teams may find it easier to choose machines with ured the test to utilize 60% of the on–node memory. For Hopper higher peak memory bandwidth rather than refactoring their appli- and Edison,• Data we motion ran separate is now copies key of STREAM on each of the cations, or researching new algorithms, to better use the CPU.
    [Show full text]
  • Intel Rendering Framework and Intel Xe Architecture Poised to Advance S
    Intel Rendering Framework and Intel Xe architecture poised to advance s... https://www.guru3d.com/news-story/intel-rendering-framework-and-inte... Guru3D.com » News » Intel Rendering Framework and Intel Xe architecture poised to advance studio workflows by Hilbert Hagedoorn on: 05/01/2019 10:00 AM | source: | 4 comment(s) Xe architecture roadmap for data center optimized rendering includes ray tracing hardware acceleration support for the Intel Rendering Framework family of API’s and libraries. -- Intel -- Intel Rendering Framework and Intel Xe architecture poised to advance studio workflows By Jim Jeffers | May 1, 2019 New Reviews As the world’s visually creative leaders come together at FMX’19 in Stuttgart this week, I would like to share some Samsung 970 EVO Plus 2TB NVMe M.2. SSD review exciting news regarding Intel and its partners’ efforts to continue delivering leadership solutions for advanced, feature Guru3D Rig of the Month - May 2019 rich, photorealistic and high-performance workflows for studio quality asset creation. As I announced in my Promo: URCDKey Summer Sale Windows 10 Pro for SIGGRAPH’18 blog, Intel® Rendering Framework open source software libraries continue to increase in features, $11 performance and ease of use. DeepCool Matrexx 70 Chassis review Intel® Embree, Intel® OSPRay, Intel® OpenSWR, and the recently released Intel® Open Image Denoise provide Noctua NH-U12A cooler review open source rendering kernels and middleware mapped to Intel® architecture multi-core parallel processors for Cooler Master MasterBox NR600 review maximum flexibility, performance and technical transparency. Intel® Xeon® processors running Embree supported MSI to unveil 10th Anniversary RTX 2080 Ti Lightning renderers are used to provide state-of-the-art visual effects in animated movies, including Dreamworks* MoonRay* Edition renderer, used to render How to Train Your Dragon: The Hidden World that is in theatres now.
    [Show full text]