End-To-End I/O Portfolio for the Summit Supercomputing Ecosystem

Total Page:16

File Type:pdf, Size:1020Kb

End-To-End I/O Portfolio for the Summit Supercomputing Ecosystem End-to-end I/O Portfolio for the Summit Supercomputing Ecosystem Sarp Oral, Sudharshan S. Vazhkudai, Feiyi Wang, Christopher Zimmer, Christopher Brumgard Jesse Hanley, George Markomanolis, Ross Miller, Dustin Leverman, Scott Atchley, and Veronica Vergara Larrea ORNL is managed by UT-Battelle, LLC for the US Department of Energy ORNL Summit System Overview System Performance The system includes Each node has • Peak of 200 Petaflops • 4,608 nodes • 2 IBM POWER9 processors (FP ) for modeling & 64 • Dual-rail Mellanox EDR • 6 NVIDIA Tesla V100 GPUs simulation InfiniBand network • 608 GB of fast memory • Peak of 3.3 ExaOps (FP16) • 250 PB IBM file system (96 GB HBM2 + 512 GB DDR4) for data analytics and transferring data at 2.5 • 1.6 TB of non-volatile artificial intelligence TB/s memory 2 Open slide master to edit Summit Node Schematic • Coherent memory across DRAM DRAM 7 TF 7 256 GB 256 GB TF 7 GPU GPU HBM HBM 16 GB 16 GB entire node 900 GB/s 900 GB/s • NVLink v2 fully 50 GB/s 50 GB/s 50 GB/s 50 GB/s interconnects three GPUs 135 GB/s 64 135 GB/s GB/s and one CPU on each P9 P9 7 TF 7 TF 7 GPU GPU HBM HBM 16 GB 16 GB side of node 50 GB/s 50 GB/s 900 GB/s 900 GB/s 16 GB/s • PCIe Gen 4 connects 50 GB/s 50 GB/s 16 GB/s NVM and NIC 50 GB/s 50 GB/s • Single shared NIC with NIC 7 TF 7 TF 7 GPU GPU HBM HBM 16 GB 16 GB 900 GB/s 900 GB/s dual EDR ports 6.0 GB/s Read NVM 2.2 GB/s Write 12.5 GB/s 12.5 GB/s TF 42 TF (6x7 TF) HBM/DRAM Bus (aggregate B/W) HBM 96 GB (6x16 GB) NVLINK DRAM 512 GB (2x16x16 GB) X-Bus (SMP) NET 25 GB/s (2x12.5 GB/s) PCIe Gen4 MMsg/s 83 EDR IB HBM & DRAM speeds are aggregate (Read+Write). All other speeds (X-Bus, NVLink, PCIe, IB) are bi-directional. 3 Open slide master to edit OLCF I/O Requirements • POSIX API – ~1,300 users access and request POSIX • Usability and transparency – No data islands, center-wide access and no application code changes OLCF workloads do not • Capacity distinguish between – O(100)s PB; min 30x of Summit CPU memory; 80x desired checkpoints and data • I/O performance outputs, and persist all output – O(TB/s); 50% of total CPU memory to be written in 5 minutes data • N-N and N-1 I/O support • ML workload support – High small-read IOPS • Cost efficient design – Not to exceed 15% of the Summit cost 4 Open slide master to edit OLCF I/O solutions architecture GPU1 GPU1 GPU1 GPU1 GPU2 P9 P9 GPU2 Summit Compute Nodes GPU2 P9 P9 GPU2 (4,608) GPU3 PCIe Gen4 GPU3 . GPU3 GPU3 /XFS /XFS NVM Rates: 5.8 GB/s Read In-system storage layer 2.1 GB/s Write NVM NVM 1 Million Read IOPS 1.6 TB Aggregate Capacity: 7.4 PB 170K Write IOPS Aggregate Write BW: 9.7 TB/s /SymphonyFS: FUSE- Mellanox Infiniband based Distributed FS EDR Fat-tree Network 2.5 TB/s Spectral Intercept Library /Spider3 Center-Wide PFS NSD I/O NSD I/O Capacity: 250 PB NSD I/O NSD I/O server server BW: 2.5 TB/s server server . 2.5 TB/s IBM Spectrum Scale GL4s to other Redundancy Redundancy (77) Redundancy Redundancy OLCF Group 1 Group 2 Group 1 Group 2 systems (211 HDDs) (211 HDDs) (211 HDDs) (211 HDDs) 5 Open slide master to edit Design challenges • Not possible to order from a catalogue • Provide a transparent mount point – GPFS at the time of the selection (<2014) – Users don’t like multiple mount points • Largest deployment was 18 PB – Users don’t like searching data in multiple • Highest throughput was 400 GB/s places • Highest create/s per single directory was 5 • Need a solution combining node-local – Multi-year collaboration with IBM and center-wide POSIX mount points • Identify performance bottlenecks • Need a solution moving data from • Verify design and implementation node-local mount points to GPFS • Deployment and acceptance issues transparently 6 Open slide master to edit Center-wide parallel file system (PFS) • Spider 3/Alpine • Each GL4 – POSIX namespace, shared – 2 P9 based NSD servers – IBM SpectrumScale/GPFS – 4 106 slot disk enclosures • 77 ESS GL4, w/ O(30K) 10TB NL-SAS – 12 Gbps SAS connected (NSD – enclosure) • IB EDR connected – 422 disks in total organized in 2 distributed – 250 PB usable, formatted RAID sets • ∼90x of 2.8 PB DDR+HBM of Summit • Each NSD – 2.5 TB/s aggregate sequential write/read – 2 IB ConnectX-5 EDR ports connect to – 2.2 TB/s aggregate random write/read Summit – 800K/s 32KB file transactions – 2 IB ConnectX-5 EDR ports connect to the rest of OLCF • create/open+write+close – ~30K 0B file create in a shared directory 7 Open slide master to edit Summit in-system storage layer • Each Summit compute node can write @ 12.5 GB/s to Alpine – Max out Alpine w/ 200 Summit compute nodes • Each Summit node has a 1.6 TB Samsung PM1725a NVMe, exclusive – 6 GB/s read and 2.1 GB/s write I/O performance – 5 drive writes per day (DWPD) – Formatted as XFS (node-local file system) • In aggregate Summit in-system storage layer – 7.4 PB @ 26.7 TB/s read and 9.7 TB/s aggregate write I/O performance – 4.6 billion IOPS in aggregate – 2.5 times the capacity of aggregate system DRAM and HBM • Question is how to aggregate and present this in software as an effective I/O solution? 8 Open slide master to edit Spectral • An I/O intercept library interposing between application and Spider 3 – https://code.ornl.gov/cz7/Spectral • No application code changes required and supports N-N and N-1workloads • All writes redirected to local XFS, later to be transferred to a specified location on Spider 3 by an environment variable in the job script • Does not add extra metadata on the I/O critical path • Detects the close() of the file handle and enqueues the data transfer to the data mover process that runs on the isolated system cores • Maintains a log of files in motion so that when an application exits it can use a call to the Spectral wait tool, to hold the job open until the remaining files are transferred • Feature complete, code hardening for operations 9 Open slide master to edit SymphonyFS • Fuse based file system, presenting a unified namespace of distributed local XFS filesystems and Spider 3 – https://code.ornl.gov/techint/SymphonyFS • Provides a single mount point combining node-local and PFS namespaces • No application code changes required and supports N-N and N-1workloads • Metadata and read calls are directed to Spider 3, writes directed to local XFS – Relies on Spider 3 for most metadata operations and adds latency by transiting them through FUSE • SymphonyFS daemon will then transfer the data from the local XFS namespace to the compute node local GPFS client • Applications must either avoid Read-After-Write and overlapping writes between nodes (not very common use cases for OLCF) or employ manual intervention with mechanisms such as fsync() • In development, improving metadata performance 10 Open slide master to edit Spectral and SymphonyFS workflow Unified Namespace App 1 App App App NVMe AppApp NVMe 2 2 Spectral SymphonyFS App 1 AppApp App NVMe AppApp NVMe 2 2 Spectral SymphonyFS 1 3 App 1 AppApp App NVMe AppApp NVMe 2 2 Spectral SymphonyFS GPFS 11 Open slide master to edit Spectral basic use case • Needs four easy changes to the application configuration – An update in the configuration file to tell the application to write to a path on the local XFS filesystem on node-local NVMe SSD – Setting PERSIST_DIR, the base directory location on local XFS where the application will write these files – Setting a location for Spectral to move the files to, on Spider 3, noted as PFS_DIR – Notifying the runtime system using the module and alloc_flags parameters to load the spectral libraries 12 Open slide master to edit OLCF end-to-end I/O solutions End-to-end I/O Portfolio for the Summit Supercomputing Ecosystem SC ’19, November 17–22, 2019, Denver, CO, USA Table 1: End-to-end I/O solutions for Summit and its Ecosystem. Spectral and SymphonyFS reect application-level perfor- mance.End-to-end Spectral I/O I/O Portfolio library for performsthe Summit at Supercomputing the rate of aggregate Ecosystem local XFS lesystem SC performance. ’19, November 17–22, SymphonyFS 2019, Denver, performance CO, USA numbersTable 1: are End-to-end projections I/O based solutions on the for small-scale Summit and test its results Ecosystem. and re Spectralect N-1 and I/O SymphonyFS workload with re large-blockect application-level sizes. perfor- mance. Spectral I/O library performs at the rate of aggregate local XFS lesystem performance. SymphonyFS performance Capacity Performance Supported I/O numbers are projectionsPOSIX basedUsability on the small-scale test results and reect N-1 I/O workload with large-blockML/DL sizes. Support (aggregate) (aggegate) Modes Spider 3 Compliant High 250Capacity PB 2.5 TB/s forPerformance reads and writes SupportedN-N and N-1 I/O Low POSIX Usability ML/DL Support Local XFS Compliant Low 7.4 PB(aggregate) (1.6 TB/node) 24.8 TB/s for reads,(aggegate) 9.6 TB/s for writes ModesN-N High SpectralSpider 3 CompliantN/A MediumHigh 7.4250 PB PB 24.8 TB/s2.5 TB/s for reads, for reads 9.6 TB/s and writes for writes N-NN-N and N-1 LowHigh SymphonyFSLocal XFS CompliantCompliant HighLow 7.4 PB7.4 (1.6 PB TB/node) 2.524.8 TB/s TB/s for for reads, reads, 9.6 9.6 TB/s TB/s for for writes writes N-NN-N and N-1 HighHigh Spectral N/A Medium 7.4 PB 24.8 TB/s for reads, 9.6 TB/s for writes N-N High SymphonyFS Compliant High 7.4 PB 2.5 TB/s for reads, 9.6 TB/s for writes N-N and N-1 High SymphonyFS.
Recommended publications
  • Science-Driven Development of Supercomputer Fugaku
    Special Contribution Science-driven Development of Supercomputer Fugaku Hirofumi Tomita Flagship 2020 Project, Team Leader RIKEN Center for Computational Science In this article, I would like to take a historical look at activities on the application side during supercomputer Fugaku’s climb to the top of supercomputer rankings as seen from a vantage point in early July 2020. I would like to describe, in particular, the change in mindset that took place on the application side along the path toward Fugaku’s top ranking in four benchmarks in 2020 that all began with the Application Working Group and Computer Architecture/Compiler/ System Software Working Group in 2011. Somewhere along this path, the application side joined forces with the architecture/system side based on the keyword “co-design.” During this journey, there were times when our opinions clashed and when efforts to solve problems came to a halt, but there were also times when we took bold steps in a unified manner to surmount difficult obstacles. I will leave the description of specific technical debates to other articles, but here, I would like to describe the flow of Fugaku development from the application side based heavily on a “science-driven” approach along with some of my personal opinions. Actually, we are only halfway along this path. We look forward to the many scientific achievements and so- lutions to social problems that Fugaku is expected to bring about. 1. Introduction In this article, I would like to take a look back at On June 22, 2020, the supercomputer Fugaku the flow along the path toward Fugaku’s top ranking became the first supercomputer in history to simul- as seen from the application side from a vantage point taneously rank No.
    [Show full text]
  • Providing a Robust Tools Landscape for CORAL Machines
    Providing a Robust Tools Landscape for CORAL Machines 4th Workshop on Extreme Scale Programming Tools @ SC15 in AusGn, Texas November 16, 2015 Dong H. Ahn Lawrence Livermore Naonal Lab Michael Brim Oak Ridge Naonal Lab Sco4 Parker Argonne Naonal Lab Gregory Watson IBM Wesley Bland Intel CORAL: Collaboration of ORNL, ANL, LLNL Current DOE Leadership Computers Objective - Procure 3 leadership computers to be Titan (ORNL) Sequoia (LLNL) Mira (ANL) 2012 - 2017 sited at Argonne, Oak Ridge and Lawrence Livermore 2012 - 2017 2012 - 2017 in 2017. Leadership Computers - RFP requests >100 PF, 2 GiB/core main memory, local NVRAM, and science performance 4x-8x Titan or Sequoia Approach • Competitive process - one RFP (issued by LLNL) leading to 2 R&D contracts and 3 computer procurement contracts • For risk reduction and to meet a broad set of requirements, 2 architectural paths will be selected and Oak Ridge and Argonne must choose different architectures • Once Selected, Multi-year Lab-Awardee relationship to co-design computers • Both R&D contracts jointly managed by the 3 Labs • Each lab manages and negotiates its own computer procurement contract, and may exercise options to meet their specific needs CORAL Innova,ons and Value IBM, Mellanox, and NVIDIA Awarded $325M U.S. Department of Energy’s CORAL Contracts • Scalable system solu.on – scale up, scale down – to address a wide range of applicaon domains • Modular, flexible, cost-effec.ve, • Directly leverages OpenPOWER partnerships and IBM’s Power system roadmap • Air and water cooling • Heterogeneous
    [Show full text]
  • Summit Supercomputer
    Cooling System Overview: Summit Supercomputer David Grant, PE, CEM, DCEP HPC Mechanical Engineer Oak Ridge National Laboratory Corresponding Member of ASHRAE TC 9.9 Infrastructure Co-Lead - EEHPCWG ORNL is managed by UT-Battelle, LLC for the US Department of Energy Today’s Presentation #1 ON THE TOP 500 LIST • System Description • Cooling System Components • Cooling System Performance NOVEMBER 2018 #1 JUNE 2018 #1 2 Feature Titan Summit Summit Node Overview Application Baseline 5-10x Titan Performance Number of 18,688 4,608 Nodes Node 1.4 TF 42 TF performance Memory per 32 GB DDR3 + 6 GB 512 GB DDR4 + 96 GB Node GDDR5 HBM2 NV memory per 0 1600 GB Node Total System >10 PB DDR4 + HBM2 + 710 TB Memory Non-volatile System Gemini (6.4 GB/s) Dual Rail EDR-IB (25 GB/s) Interconnect Interconnect 3D Torus Non-blocking Fat Tree Topology Bi-Section 15.6 TB/s 115.2 TB/s Bandwidth 1 AMD Opteron™ 2 IBM POWER9™ Processors 1 NVIDIA Kepler™ 6 NVIDIA Volta™ 32 PB, 1 TB/s, File System 250 PB, 2.5 TB/s, GPFS™ Lustre® Peak Power 9 MW 13 MW Consumption 3 System Description – How do we cool it? • >100,000 liquid connections 4 System Description – What’s in the data center? • Passive RDHXs- 215,150ft2 (19,988m2) of total heat exchange surface (>20X the area of the data center) – With a 70°F (21.1 °C) entering water temperature, the room averages ~73°F (22.8°C) with ~3.5MW load and ~75.5°F (23.9°C) with ~10MW load.
    [Show full text]
  • NVIDIA Powers the World's Top 13 Most Energy Efficient Supercomputers
    NVIDIA Powers the World's Top 13 Most Energy Efficient Supercomputers ISC -- Advancing the path to exascale computing, NVIDIA today announced that the NVIDIA® Tesla® AI supercomputing platform powers the top 13 measured systems on the new Green500 list of the world's most energy-efficient high performance computing (HPC) systems. All 13 use NVIDIA Tesla P100 data center GPU accelerators, including four systems based on the NVIDIA DGX-1™ AI supercomputer. • As Moore's Law slows, NVIDIA Tesla GPUs continue to extend computing, improving performance 3X in two years • Tesla V100 GPUs projected to provide U.S. Energy Department's Summit supercomputer with 200 petaflops of HPC, 3 exaflops of AI performance • Major cloud providers commit to bring NVIDIA Volta GPU platform to market Advancing the path to exascale computing, NVIDIA today announced that the NVIDIA® Tesla® AI supercomputing platform powers the top 13 measured systems on the new Green500 list of the world's most energy-efficient high performance computing (HPC) systems. All 13 use NVIDIA Tesla P100 data center GPU accelerators, including four systems based on the NVIDIA DGX-1™ AI supercomputer. NVIDIA today also released performance data illustrating that NVIDIA Tesla GPUs have improved performance for HPC applications by 3X over the Kepler architecture released two years ago. This significantly boosts performance beyond what would have been predicted by Moore's Law, even before it began slowing in recent years. Additionally, NVIDIA announced that its Tesla V100 GPU accelerators -- which combine AI and traditional HPC applications on a single platform -- are projected to provide the U.S.
    [Show full text]
  • TOP500 Supercomputer Sites
    7/24/2018 News | TOP500 Supercomputer Sites HOME | SEARCH | REGISTER RSS | MY ACCOUNT | EMBED RSS | SUPER RSS | Contact Us | News | TOP500 Supercomputer Sites http://top500.org/blog/category/feature-article/feeds/rss Are you the publisher? Claim or contact Browsing the Latest Browse All Articles (217 Live us about this channel Snapshot Articles) Browser Embed this Channel Description: content in your HTML TOP500 News Search Report adult content: 04/27/18--03:14: UK Commits a 0 0 Billion Pounds to AI Development click to rate The British government and the private sector are investing close to £1 billion Account: (login) pounds to boost the country’s artificial intelligence sector. The investment, which was announced on Thursday, is part of a wide-ranging strategy to make the UK a global leader in AI and big data. More Channels Under the investment, known as the “AI Sector Deal,” government, industry, and academia will contribute £603 million in new funding, adding to the £342 million already allocated in existing budgets. That brings the grand total to Showcase £945 million, or about $1.3 billion at the current exchange rate. The UK RSS Channel Showcase 1586818 government is also looking to increase R&D spending across all disciplines by 2.4 percent, while also raising the R&D tax credit from 11 to 12 percent. This is RSS Channel Showcase 2022206 part of a broader commitment to raise government spending in this area from RSS Channel Showcase 8083573 around £9.5 billion in 2016 to £12.5 billion in 2021. RSS Channel Showcase 1992889 The UK government policy paper that describes the sector deal meanders quite a bit, describing a lot of programs and initiatives that intersect with the AI investments, but are otherwise free-standing.
    [Show full text]
  • Industry Insights | HPC and the Future of Seismic
    INDUSTRY INSIGHTS August 20 By Andrew Long ([email protected]) 1 of 4 HPC and the Future of Seismic I briefly profile the world’s largest commercial computer installations, and consider their relevance for the future of high-end seismic imaging and AI pursuits by the oil and gas industry. Introduction When I began my career in seismic geophysics over 20 years ago we were regularly told that seismic processing and imaging was one of the largest users of computing, and indeed, it took developments in computing capability to mature much further before applications such as Full Waveform Inversion (FWI) and Least-Squares Migration (LSM) became commercially commonplace in the last 5-10 years. Although the oil and gas industry still has several representatives in the top-100 ranked supercomputers (discussed below), applications now more commonly include modeling and simulations essential for nuclear security, weather and climate forecasting, computational fluid dynamics, financial forecasting and risk management, drug discovery and biochemical modeling, clean energy development, exploring the fundamental laws of the universe, and so on. The dramatic growth in High Performance Computing (HPC) services in the cloud (HPCaaS: HPC-as-a-Service) in recent years has in principle made HPC accessible ‘on demand’ to anyone with an appropriate budget, and key selling points are generally the scalable and flexible capabilities, combined with a variety of vendor-specific Software-as-a-Service (SaaS), Platform-as-a-Service (PaaS), and Infrastucture-as-a-Service (IaaS) offerings. This new vocabulary can be confronting, and due to the complexity of cloud-based solutions, I only focus here on the standalone supercomputer infrastructures.
    [Show full text]
  • NVIDIA A100 Tensor Core GPU Architecture UNPRECEDENTED ACCELERATION at EVERY SCALE
    NVIDIA A100 Tensor Core GPU Architecture UNPRECEDENTED ACCELERATION AT EVERY SCALE V1.0 Table of Contents Introduction 7 Introducing NVIDIA A100 Tensor Core GPU - our 8th Generation Data Center GPU for the Age of Elastic Computing 9 NVIDIA A100 Tensor Core GPU Overview 11 Next-generation Data Center and Cloud GPU 11 Industry-leading Performance for AI, HPC, and Data Analytics 12 A100 GPU Key Features Summary 14 A100 GPU Streaming Multiprocessor (SM) 15 40 GB HBM2 and 40 MB L2 Cache 16 Multi-Instance GPU (MIG) 16 Third-Generation NVLink 16 Support for NVIDIA Magnum IO™ and Mellanox Interconnect Solutions 17 PCIe Gen 4 with SR-IOV 17 Improved Error and Fault Detection, Isolation, and Containment 17 Asynchronous Copy 17 Asynchronous Barrier 17 Task Graph Acceleration 18 NVIDIA A100 Tensor Core GPU Architecture In-Depth 19 A100 SM Architecture 20 Third-Generation NVIDIA Tensor Core 23 A100 Tensor Cores Boost Throughput 24 A100 Tensor Cores Support All DL Data Types 26 A100 Tensor Cores Accelerate HPC 28 Mixed Precision Tensor Cores for HPC 28 A100 Introduces Fine-Grained Structured Sparsity 31 Sparse Matrix Definition 31 Sparse Matrix Multiply-Accumulate (MMA) Operations 32 Combined L1 Data Cache and Shared Memory 33 Simultaneous Execution of FP32 and INT32 Operations 34 A100 HBM2 and L2 Cache Memory Architectures 34 ii NVIDIA A100 Tensor Core GPU Architecture A100 HBM2 DRAM Subsystem 34 ECC Memory Resiliency 35 A100 L2 Cache 35 Maximizing Tensor Core Performance and Efficiency for Deep Learning Applications 37 Strong Scaling Deep Learning
    [Show full text]
  • GPU Static Modeling Using PTX and Deep Structured Learning
    Received October 18, 2019, accepted October 30, 2019, date of publication November 4, 2019, date of current version November 13, 2019. Digital Object Identifier 10.1109/ACCESS.2019.2951218 GPU Static Modeling Using PTX and Deep Structured Learning JOÃO GUERREIRO , (Student Member, IEEE), ALEKSANDAR ILIC , (Member, IEEE), NUNO ROMA , (Senior Member, IEEE), AND PEDRO TOMÁS , (Senior Member, IEEE) INESC-ID, Instituto Superior Técnico, Universidade de Lisboa, 1000-029 Lisbon, Portugal Corresponding author: João Guerreiro ([email protected]) This work was supported in part by the national funds through Fundação para a Ciência e a Tecnologia (FCT), under Grant SFRH/BD/101457/2014, and in part by the projects UID/CEC/50021/2019, PTDC/EEI-HAC/30485/2017, and PTDC/CCI-COM/31901/2017. ABSTRACT In the quest for exascale computing, energy-efficiency is a fundamental goal in high- performance computing systems, typically achieved via dynamic voltage and frequency scaling (DVFS). However, this type of mechanism relies on having accurate methods of predicting the performance and power/energy consumption of such systems. Unlike previous works in the literature, this research focuses on creating novel GPU predictive models that do not require run-time information from the applications. The proposed models, implemented using recurrent neural networks, take into account the sequence of GPU assembly instructions (PTX) and can accurately predict changes in the execution time, power and energy consumption of applications when the frequencies of different GPU domains (core and memory) are scaled. Validated with 24 applications on GPUs from different NVIDIA microarchitectures (Turing, Volta, Pascal and Maxwell), the proposed models attain a significant accuracy.
    [Show full text]
  • Inside Volta
    INSIDE VOLTA Olivier Giroux and Luke Durant NVIDIA May 10, 2017 1 VOLTA: A GIANT LEAP FOR DEEP LEARNING ResNet-50 Training ResNet-50 Inference TensorRT - 7ms Latency 2.4x faster 3.7x faster Images Images per Second Images per Second P100 V100 P100 V100 FP32 Tensor Cores FP16 Tensor Cores 2 V100 measured on pre-production hardware. ROAD TO EXASCALE Volta to Fuel Most Powerful US Supercomputers Volta HPC Application Performance P100 to Tesla to Relative Relative Summit Supercomputer 200+ PetaFlops ~3,400 Nodes System Config Info: 2X Xeon E5-2690 v4, 2.6GHz, w/ 1X Tesla 3 P100 or V100. V100 measured on pre-production hardware. 10 Megawatts INTRODUCING TESLA V100 Volta Architecture Improved NVLink & Volta MPS Improved SIMT Model Tensor Core HBM2 120 Programmable Most Productive GPU Efficient Bandwidth Inference Utilization New Algorithms TFLOPS Deep Learning The Fastest and Most Productive GPU for Deep Learning and HPC 4 TESLA V100 21B transistors 815 mm2 80 SM 5120 CUDA Cores 640 Tensor Cores 16 GB HBM2 900 GB/s HBM2 300 GB/s NVLink 5 *full GV100 chip contains 84 SMs GPU PERFORMANCE COMPARISON P100 V100 Ratio Training acceleration 10 TOPS 120 TOPS 12x Inference acceleration 21 TFLOPS 120 TOPS 6x FP64/FP32 5/10 TFLOPS 7.5/15 TFLOPS 1.5x HBM2 Bandwidth 720 GB/s 900 GB/s 1.2x NVLink Bandwidth 160 GB/s 300 GB/s 1.9x L2 Cache 4 MB 6 MB 1.5x L1 Caches 1.3 MB 10 MB 7.7x 6 NEW HBM2 MEMORY ARCHITECTURE 1.5x Delivered Bandwidth Delivered GB/s - STREAM: STREAM: Triad HBM2 stack P100 V100 76% DRAM 95% DRAM Utilization Utilization 7 V100 measured on pre-production hardware.
    [Show full text]
  • World's Fastest Computer
    HISTORY OF FUJITSU LIMITED SUPERCOMPUTING “FUGAKU” – World’s Fastest Computer Birth of Fugaku Fujitsu has been developing supercomputer systems since 1977, equating to over 40 years of direct hands-on experience. Fujitsu determined to further expand its HPC thought leadership and technical prowess by entering into a dedicated partnership with Japan’s leading research institute, RIKEN, in 2006. This technology development partnership led to the creation of the “K Computer,” successfully released in 2011. The “K Computer” raised the bar for processing capabilities (over 10 PetaFLOPS) to successfully take on even larger and more complex challenges – consistent with Fujitsu’s corporate charter of tackling societal challenges, including global climate change and sustainability. Fujitsu and Riken continued their collaboration by setting aggressive new targets for HPC solutions to further extend the ability to solve complex challenges, including applications to deliver improved energy efficiency, better weather and environmental disaster prediction, and next generation manufacturing. Fugaku, another name for Mt. Fuji, was announced May 23, 2019. Achieving The Peak of Performance Fujitsu Ltd. And RIKEN partnered to design the next generation supercomputer with the goal of achieving Exascale performance to enable HPC and AI applications to address the challenges facing mankind in various scientific and social fields. Goals of the project Ruling the Rankings included building and delivering the highest peak performance, but also included wider adoption in applications, and broader • : Top 500 list First place applicability across industries, cloud and academia. The Fugaku • First place: HPCG name symbolizes the achievement of these objectives. • First place: Graph 500 Fugaku’s designers also recognized the need for massively scalable systems to be power efficient, thus the team selected • First place: HPL-AI the ARM Instruction Set Architecture (ISA) along with the Scalable (real world applications) Vector Extensions (SVE) as the base processing design.
    [Show full text]
  • OLCF AR 2016-17 FINAL 9-7-17.Pdf
    Oak Ridge Leadership Computing Facility Annual Report 2016–2017 1 Outreach manager – Katie Bethea Writers – Eric Gedenk, Jonathan Hines, Katie Jones, and Rachel Harken Designer – Jason Smith Editor – Wendy Hames Photography – Jason Richards and Carlos Jones Stock images – iStockphoto™ Oak Ridge Leadership Computing Facility Oak Ridge National Laboratory P.O. Box 2008, Oak Ridge, TN 37831-6161 Phone: 865-241-6536 Email: [email protected] Website: https://www.olcf.ornl.gov Facebook: https://www.facebook.com/oakridgeleadershipcomputingfacility Twitter: @OLCFGOV The research detailed in this publication made use of the Oak Ridge Leadership Computing Facility, a US Department of Energy Office of Science User Facility located at DOE’s Oak Ridge National Laboratory. The Office of Science is the single largest supporter of basic research in the physical sciences in the United States and is working to address some of the most pressing challenges of our time. For more information, please visit science.energy.gov. 2 Contents LETTER In a Record 25th Year, We Celebrate the Past and Look to the Future 4 SCIENCE Streamlining Accelerated Computing for Industry 6 A Seismic Mapping Milestone 8 The Shape of Melting in Two Dimensions 10 A Supercomputing First for Predicting Magnetism in Real Nanoparticles 12 Researchers Flip Script for Lithium-Ion Electrolytes to Simulate Better Batteries 14 A Real CAM-Do Attitude 16 FEATURES Big Data Emphasis and New Partnerships Highlight the Path to Summit 18 OLCF Celebrates 25 Years of HPC Leadership 24 PEOPLE & PROGRAMS Groups within the OLCF 28 OLCF User Group and Executive Board 30 INCITE, ALCC, DD 31 SYSTEMS & SUPPORT Resource Overview 32 User Experience 34 Education, Outreach, and Training 35 ‘TitanWeek’ Recognizes Contributions of Nation’s Premier Supercomputer 36 Selected Publications 38 Acronyms 41 3 In a Record 25th Year, We Celebrate the Past and Look to the Future installed at the turn of the new millennium—to the founding of the Center for Computational Sciences at the US Department of Energy’s Oak Ridge National Laboratory.
    [Show full text]
  • NVIDIA Tensor Core Programmability, Performance & Precision
    1 NVIDIA Tensor Core Programmability, Performance & Precision Stefano Markidis, Steven Wei Der Chien, Erwin Laure KTH Royal Institute of Technology Ivy Bo Peng, Jeffrey S. Vetter Oak Ridge National Laboratory Abstract The NVIDIA Volta GPU microarchitecture introduces a specialized unit, called Tensor Core that performs one matrix-multiply- and-accumulate on 4×4 matrices per clock cycle. The NVIDIA Tesla V100 accelerator, featuring the Volta microarchitecture, provides 640 Tensor Cores with a theoretical peak performance of 125 Tflops/s in mixed precision. In this paper, we investigate current approaches to program NVIDIA Tensor Cores, their performances and the precision loss due to computation in mixed precision. Currently, NVIDIA provides three different ways of programming matrix-multiply-and-accumulate on Tensor Cores: the CUDA Warp Matrix Multiply Accumulate (WMMA) API, CUTLASS, a templated library based on WMMA, and cuBLAS GEMM. After experimenting with different approaches, we found that NVIDIA Tensor Cores can deliver up to 83 Tflops/s in mixed precision on a Tesla V100 GPU, seven and three times the performance in single and half precision respectively. A WMMA implementation of batched GEMM reaches a performance of 4 Tflops/s. While precision loss due to matrix multiplication with half precision input might be critical in many HPC applications, it can be considerably reduced at the cost of increased computation. Our results indicate that HPC applications using matrix multiplications can strongly benefit from using of NVIDIA Tensor Cores. Keywords NVIDIA Tensor Cores; GPU Programming; Mixed Precision; GEMM I. INTRODUCTION The raising markets of AI-based data analytics and deep-learning applications, such as software for self-driving cars, have pushed several companies to develop specialized hardware to boost the performance of large dense matrix (tensor) computation.
    [Show full text]