Nvswitch and Dgx-2

Total Page:16

File Type:pdf, Size:1020Kb

Nvswitch and Dgx-2 NVSWITCH AND DGX-2 NVLINK-SWITCHING CHIP AND SCALE-UP COMPUTE SERVER Alex Ishii and Denis Foley with Eric Anderson, Bill Dally, Glenn Dearth Larry Dennison, Mark Hummel and John Schafer NVIDIA® DGX-2™ SERVER AND NVSWITCH™ 16 Tesla™ V100 32 GB GPUs Single-Server Chassis 12 NVSwitch Network New NVSwitch Chip FP64: 125 TFLOPS 10U/19-Inch Rack Mount Full-Bandwidth Fat-Tree Topology 18 2nd Generation NVLink™ Ports FP32: 250 TFLOPS 10 kW Peak TDP 2.4 TBps Bisection Bandwidth 25 GBps per Port Tensor: 2000 TFLOPS Dual 24-core Xeon CPUs Global Shared Memory 900 GBps Total Bidirectional 1.5 TB DDR4 DRAM Repeater-less Bandwidth 512 GB of GPU HBM2 30 TB NVMe Storage 450 GBps Total Throughput ©2018 NVIDIA CORPORATION 2 OUTLINE The Case for “One Gigantic GPU” and NVLink Review NVSwitch — Speeds and Feeds, Architecture and System Applications DGX-2 Server — Speeds and Feeds, Architecture, and Packaging Signal-Integrity Design Highlights Achieved Performance ©2018 NVIDIA CORPORATION 3 MULTI-CORE AND CUDA WITH ONE GPU Users explicitly express parallel ® work in NVIDIA CUDA GPU Work (data and GPU Driver distributes work to XBAR GPC PCIe I/O CUDA Kernels) available Graphics Processing L2 Cache Results Clusters (GPC)/Streaming (data) Multiprocessor (SM) cores HBM2 Copy Engines L2 GPC/SM cores can compute on Cache CPU data in any of the second- Hub generation High Bandwidth GPC L2 Memories (HBM2s) Cache NVLink HBM2 GPC/SM cores use shared HBM2s L2 NVLink to exchange data Cache ©2018 NVIDIA CORPORATION 4 TWO GPUS WITH PCIE Access to HBM2 of other GPU is at PCIe BW (32 GBps GPU0 GPU1 (bidirectional)) Work (data and GPC CUDA GPC Kernels) Interactions with CPU compete PCIe I/O PCIe I/O HBM2 + L2 Caches Results with GPU-to-GPU (data) Copy Copy Engines Engines PCIe is the “Wild West” HBM2 + L2 Caches CPU (lots of performance bandits) XBAR Hub Hub XBAR GPC GPC HBM2 + L2 Caches NVLink NVLink HBM2 + L2 Caches NVLink NVLink ©2018 NVIDIA CORPORATION 5 TWO GPUS WITH NVLINK All GPCs can access all HBM2 memories GPU0 GPU1 Work (data and Access to HBM2 of other GPU GPC CUDA GPC Kernels) is at multi-NVLink bandwidth PCIe I/O PCIe I/O HBM2 + L2 Caches Results (300 GBps bidirectional in V100 (data) GPUs) Copy Copy Engines Engines HBM2 + L2 Caches CPU NVLinks are effectively a XBAR Hub Hub “bridge” between XBARs XBAR GPC GPC HBM2 + L2 Caches No collisions with PCIe traffic NVLink NVLink HBM2 + L2 Caches NVLink NVLink ©2018 NVIDIA CORPORATION 6 THE “ONE GIGANTIC GPU” IDEAL Highest number of GPUs possible GPU0 GPU1 CPU Single GPU Driver process controls GPU2 all work across all GPUs GPU3 From perspective of GPCs, all HBM2s GPU4 GPU8 can be accessed without intervention GPU5 GPU9 NVLink XBAR by other processes (LD/ST GPU6 GPU10 instructions, Copy Engine RDMA, GPU7 GPU11 everything “just works”) Access to all HBM2s is independent GPU12 of PCIe GPU13 GPU14 CPU Bandwidth across bridged XBARs is as GPU15 high as possible (some NUMA is unavoidable) ©2018 NVIDIA CORPORATION 7 “ONE GIGANTIC GPU” BENEFITS Problem Size Capacity GPU0 Problem size is limited by aggregate GPU1 CPU HBM2 capacity of entire set of GPUs, GPU2 GPU3 rather than capacity of single GPU Strong Scaling GPU4 GPU8 GPU5 GPU9 NUMA-effects greatly reduced GPU6 NVLink XBAR GPU10 compared to existing solutions GPU7 GPU11 Aggregate bandwidth to HBM2 grows ? with number of GPUs Ease of Use GPU12 GPU13 GPU14 Apps written for small number CPU of GPUs port more easily GPU15 Abundant resources enable rapid experimentation ©2018 NVIDIA CORPORATION 8 INTRODUCING NVSWITCH Parameter Spec Parameter Spec Bidirectional Bandwidth per NVLink 51.5 GBps Bidirectional Aggregate Bandwidth 928 GBps NRZ Lane Rate (x8 per NVLink) 25.78125 Gbps NVLink Ports 18 Transistors 2 Billion Mgmt Port (config, maintenance, errors) PCIe Process TSMC 12FFN LD/ST BW Efficiency (128 B pkts) 80.0% Die Size 106 mm^2 Copy Engine BW Efficiency (256 B pkts) 88.9% ©2018 NVIDIA CORPORATION 9 NVSWITCH BLOCK DIAGRAM GPU-XBAR-bridging device; NVSwitch not a general networking device Management PCIe I/O Packet-Transforms make traffic to/from multiple GPUs look like Port Logic 0 NVLink 0 Classification & they are to/from single GPU Routing Packet Transforms PHY TL DL XBAR is non-blocking Error Check & Transaction Tracking & PHY Statistics Collection Packet Transforms SRAM-based buffering Port Logic 17 NVLink 17 NVLink IP blocks and XBAR XBAR (18 X 18) Classification & Routing design/verification Packet Transforms infrastructure reused from V100 PHY TL DL Error Check & Transaction Tracking & PHY Statistics Collection Packet Transforms ©2018 NVIDIA CORPORATION 10 NVLINK: PHYSICAL SHARED MEMORY Virtual-to physical address translation is done in GPCs GPU0 GPU1 Work (data and NVLink packets carry physical GPC CUDA GPC Kernels) addresses PCIe I/O PCIe I/O HBM2 + L2 Caches Results (data) NVSwitch and DGX-2 follow Copy Copy Engines Engines same model HBM2 + L2 Caches CPU XBAR Hub Hub XBAR GPC GPC HBM2 + L2 Caches NVLink NVLink Physical Addressing HBM2 + L2 Caches NVLink NVLink ©2018 NVIDIA CORPORATION 11 NVSWITCH: RELIABILITY Hop-by-hop error checking NVSwitch deemed sufficient in intra- chassis operating environment Management Hooks for SW-based scrubbing for table corruptions and error conditions (orphaned packets) -12 Low BER (10 or better) on Hooks for partial-reset and queue draining intra-chassis transmission media removes need for high-overhead Port Logic NVLink protections like FEC Classification & Routing Packet Transforms PHY TL DL HW-based consistency checks PHY Error Check & Transaction Tracking flight pkts flight guard against “escapes” from - Statistics Collection & Packet Transforms hop-to-hop XBAR ECC on SRAMs and in-flight pkts Data Link (DL) pkt tracking ECC on in on ECC and retry for external links Access-Control checks for mis-configured pkts SW responsible for additional (either config/app errors or corruption) CRC on pkts (external) consistency checks and clean-up Configurable buffer allocations ©2018 NVIDIA CORPORATION 12 NVSWITCH: DIE ASPECT RATIO NVLINK NVLINK NVLINK NVLINK NVLINK NVLINK PHYS PHYS PHYS PHYS PHYS PHYS NVLINK NVLINK NVLINK NVLINK PHYS PHYS PHYS PHYS NVLINK NVLINK Non-NVLink Logic NVLINK NVLINK PHYS PHYS PHYS PHYS NVLINK NVLINK NVLINK NVLINK PHYS PHYS PHYS PHYS Packet-processing and switching (XBAR) logic is extremely compact With more than 50% of area taken-up by I/O blocks, maximizing “perimeter”-to-area ratio was key to an efficient design Having all NVLink ports exit die on parallel paths simplified package substrate routing ©2018 NVIDIA CORPORATION 13 SIMPLE SWITCH SYSTEM No NVSwitch Connect GPU directly Gang Aggregate NVLinks into gangs V100 V100 for higher bandwidth Interleaved over the links to prevent camping Max bandwidth between two GPUs V100 limited to the bandwidth of the gang With NVSwitch Interleave traffic across all the links V100 V100 and to support full bandwidth between any pair of GPUs NVSWITCH Traffic to a single GPU is non- blocking, so long as aggregate V100 bandwidth of six NVLinks is not exceeded ©2018 NVIDIA CORPORATION 14 SWITCH CONNECTED SYSTEMS Add NVSwitches in parallel to support more GPUs 8 GPUs, 3 NVSwitches Eight GPU closed system can be built with three NVSwitches with two V100 V100 NVLinks from each GPU to each switch Gang width is still six — traffic is interleaved across all the GPU links V100 V100 GPUs can now communicate pairwise using the full 300 GBps bidirectional between any pair V100 V100 NVSwitch XBAR provides unique paths NVSWITCH from and source to any destination — non-blocking, non-interfering V100 V100 ©2018 NVIDIA CORPORATION 15 NON-BLOCKING BUILDING BLOCK Taking this to the limit — connect one NVLink from each GPU to each of six switches V100 No routing between different switch planes required V100 Eight NVLinks of the 18 available V100 per switch are used to connect to GPUs V100 NVSWITCH Ten NVLinks available per switch V100 for communication outside the local group (only eight are required to V100 support full bandwidth) V100 This is the GPU baseboard configuration for DGX-2 V100 ©2018 NVIDIA CORPORATION 16 DGX-2 NVLINK FABRIC Two of these building blocks together form a fully connected 16 GPU cluster V100 V100 Non-blocking, non-interfering V100 V100 (unless same destination is involved) V100 V100 Regular load, store, atomics V100 V100 NVSWITCH just work NVSWITCH V100 V100 Left-right symmetry simplifies V100 V100 physical packaging, and manufacturability V100 V100 V100 V100 ©2018 NVIDIA CORPORATION 17 NVIDIA DGX-2: SPEEDS AND FEEDS Parameter DGX-2 Spec Parameter DGX-2 Spec Number of Tesla V100 GPUs 16 CPUs Dual Xeon Platinum 8168 Aggregate FP64/FP32 125/250 TFLOPS CPU Memory 1.5 TB DDR4 Aggregate Tensor (FP16) 2000 TFLOPS Aggregate Storage 30 TB (8 NVMes) Aggregate Shared HBM2 512 GB Peak Max TDP 10 kW Aggregate HBM2 Bandwidth 14.4 TBps 17.3”(10U)/ 19.0”/32.8” Dimensions (H/W/D) (440.0mm/482.3mm/834.0mm) Per-GPU NVLink Bandwidth 300 GBps bidirectional Weight 340 lbs (154.2 kgs) Chassis Bisection Bandwidth 2.4 TBps InfiniBand NICs 8 Mellanox EDR Cooling (forced air) 1,000 CFM ©2018 NVIDIA CORPORATION 18 DGX-2 PCIE NETWORK Xeon sockets are QPI connected, but affinity-binding keeps V100 V100 GPU-related traffic off QPI 100G PCIE PCIE 100G NIC SW SW NIC V100 V100 PCIe tree has NICs connected to PCIE PCIE pairs of GPUs to facilitate SW SW V100 V100 TM GPUDirect RDMAs over IB 100G PCIE PCIE 100G NIC SW SW NIC network V100 NVS V100 NVSNVSNVSNVS NVSNVSNVSNVS QPI PCIE WIT PCIE QPI x86 x6 WITWITWITWIT WITWITWITWIT x6 x86 SW CH SW Configuration and control of the CHCHCHCH CHCHCHCH
Recommended publications
  • Supermicro GPU Solutions Optimized for NVIDIA Nvlink
    SuperServers Optimized For NVIDIA® Tesla® GPUs Maximizing Throughput and Scalability 16 Tesla® V100 GPUs With NVLink™ and NVSwitch™ Most Powerful Solution for Deep Learning Training • New Supermicro NVIDIA® HGX-2 based platform • 16 Tesla® V100 SXM3 GPUs (512GB total GPU memory) • 16 NICs for GPUDirect RDMA • 16 hot-swap NVMe drive bays • Fully configurable to order SYS-9029GP-TNVRT 8 Tesla® V100 GPUs With NVLink™ 4 Tesla® V100 GPUs With NVLink™ SYS-1029GQ-TVRT SYS-4029GP-TVRT www.supermicro.com/GPU March 2019 Maximum Acceleration for AI/DL Training Workloads PERFORMANCE: Highest Parallel peak performance with NVIDIA Tesla V100 GPUs THROUGHPUT: Best in class GPU-to-GPU bandwidth with a maximum speed of 300GB/s SCALABILITY: Designed for direct interconections between multiple GPU nodes FLEXIBILITY: PCI-E 3.0 x16 for low latency I/O expansion capacity & GPU Direct RDMA support DESIGN: Optimized GPU cooling for highest sustained parallel computing performance EFFICIENCY: Redundant Titanium Level power supplies & intelligent cooling control Model SYS-1029GQ-TVRT SYS-4029GP-TVRT • Dual Intel® Xeon® Scalable processors with 3 UPI up to • Dual Intel® Xeon® Scalable processors with 3 UPI up to 10.4GT/s CPU Support 10.4GT/s • Supports up to 205W TDP CPU • Supports up to 205W TDP CPU • 8 NVIDIA® Tesla® V100 GPUs • 4 NVIDIA Tesla V100 GPUs • NVIDIA® NVLink™ GPU Interconnect up to 300GB/s GPU Support • NVIDIA® NVLink™ GPU Interconnect up to 300GB/s • Optimized for GPUDirect RDMA • Optimized for GPUDirect RDMA • Independent CPU and GPU thermal zones
    [Show full text]
  • 20201130 Gcdv V1.0.Pdf
    DE LA RECHERCHE À L’INDUSTRIE Architecture evolutions for HPC 30 Novembre 2020 Guillaume Colin de Verdière Commissariat à l’énergie atomique et aux énergies alternatives - www.cea.fr Commissariat à l’énergie atomique et aux énergies alternatives EUROfusion G. Colin de Verdière 30/11/2020 EVOLUTION DRIVER: TECHNOLOGICAL CONSTRAINTS . Power wall . Scaling wall . Memory wall . Towards accelerated architectures . SC20 main points Commissariat à l’énergie atomique et aux énergies alternatives EUROfusion G. Colin de Verdière 30/11/2020 2 POWER WALL P: power 푷 = 풄푽ퟐ푭 V: voltage F: frequency . Reduce V o Reuse of embedded technologies . Limit frequencies . More cores . More compute, less logic o SIMD larger o GPU like structure © Karl Rupp https://software.intel.com/en-us/blogs/2009/08/25/why-p-scales-as-cv2f-is-so-obvious-pt-2-2 Commissariat à l’énergie atomique et aux énergies alternatives EUROfusion G. Colin de Verdière 30/11/2020 3 SCALING WALL FinFET . Moore’s law comes to an end o Probable limit around 3 - 5 nm o Need to design new structure for transistors Carbon nanotube . Limit of circuit size o Yield decrease with the increase of surface • Chiplets will dominate © AMD o Data movement will be the most expensive operation • 1 DFMA = 20 pJ, SRAM access= 50 pJ, DRAM access= 1 nJ (source NVIDIA) 1 nJ = 1000 pJ, Φ Si =110 pm Commissariat à l’énergie atomique et aux énergies alternatives EUROfusion G. Colin de Verdière 30/11/2020 4 MEMORY WALL . Better bandwidth with HBM o DDR5 @ 5200 MT/s 8ch = 0.33 TB/s Thread 1 Thread Thread 2 Thread Thread 3 Thread o HBM2 @ 4 stacks = 1.64 TB/s 4 Thread Skylake: SMT2 .
    [Show full text]
  • High Performance Computing and AI Solutions Portfolio
    Brochure High Performance Computing and AI Solutions Portfolio Technology and expertise to help you accelerate discovery and innovation Discovery and innovation have always started with great minds Go ahead. dreaming big. As artificial intelligence (AI), high performance computing (HPC) and data analytics continue to converge and Dream big. evolve, they are fueling the next industrial revolution and the next quantum leap in human progress. And with the help of increasingly powerful technology, you can dream even bigger. Dell Technologies will be there every step of the way with the technology you need to power tomorrow’s discoveries and the expertise to bring it all together, today. 463 exabytes The convergence of HPC and AI is driven by data. The data‑driven age is dramatically reshaping industries and reinventing the future. As of data will be created each day by 20251 vast amounts of data pour in from increasingly diverse sources, leveraging that data is both critical and transformational. Whether you’re working to save lives, understand the universe, build better machines, neutralize financial risks or anticipate customer sentiment, 44X ROI data informs and drives decisions that impact the success of your organization — and Average return on investment (ROI) shapes the future of our world. for HPC2 AI, HPC and data analytics are technologies designed to unlock the value of your data. While they have long been treated as separate, the three technologies are converging as 83% of CIOs it becomes clear that analytics and AI are both big‑data problems that require the power, Say they are investing in AI scalable compute, networking and storage provided by HPC.
    [Show full text]
  • BRKIOT-2394.Pdf
    Unlocking the Mystery of Machine Learning and Big Data Analytics Robert Barton Jerome Henry Distinguished Architect Principal Engineer @MrRobbarto Office of the CTAO CCIE #6660 @wirelessccie CCDE #2013::6 CCIE #24750 CWNE #45 BRKIOT-2394 Cisco Webex Teams Questions? Use Cisco Webex Teams to chat with the speaker after the session How 1 Find this session in the Cisco Events Mobile App 2 Click “Join the Discussion” 3 Install Webex Teams or go directly to the team space 4 Enter messages/questions in the team space BRKIOT-2394 © 2020 Cisco and/or its affiliates. All rights reserved. Cisco Public 3 Tuesday, Jan. 28th Monday, Jan. 27th Wednesday, Jan. 29th BRKIOT-2600 BRKIOT-2213 16:45 Enabling OT-IT collaboration by 17:00 From Zero to IOx Hero transforming traditional industrial TECIOT-2400 networks to modern IoT Architectures IoT Fundamentals 08:45 BRKIOT-1618 Bootcamp 14:45 Industrial IoT Network Management PSOIOT-1156 16:00 using Cisco Industrial Network Director Securing Industrial – A Deep Dive. Networks: Introduction to Cisco Cyber Vision PSOIOT-2155 Enhancing the Commuter 13:30 BRKIOT-1775 Experience - Service Wireless technologies and 14:30 BRKIOT-2698 BRKIOT-1520 Provider WiFi at the Use Cases in Industrial IOT Industrial IoT Routing – Connectivity 12:15 Cisco Remote & Mobile Asset speed of Trains and Beyond Solutions PSOIOT-2197 Cisco Innovates Autonomous 14:00 TECIOT-2000 Vehicles & Roadways w/ IoT BRKIOT-2497 BRKIOT-2900 Understanding Cisco's 14:30 IoT Solutions for Smart Cities and 11:00 Automating the Network of Internet Of Things (IOT) BRKIOT-2108 Communities Industrial Automation Solutions Connected Factory Architecture Theory and 11:00 Practice PSOIOT-2100 BRKIOT-1291 Unlock New Market 16:15 Opening Keynote 09:00 08:30 Opportunities with LoRaWAN for IOT Enterprises Embedded Cisco services Technologies IOT IOT IOT Track #CLEMEA www.ciscolive.com/emea/learn/technology-tracks.html Cisco Live Thursday, Jan.
    [Show full text]
  • NVIDIA DGX Station the First Personal AI Supercomputer 1.0 Introduction
    White Paper NVIDIA DGX Station The First Personal AI Supercomputer 1.0 Introduction.........................................................................................................2 2.0 NVIDIA DGX Station Architecture ........................................................................3 2.1 NVIDIA Tesla V100 ..........................................................................................5 2.2 Second-Generation NVIDIA NVLink™ .............................................................7 2.3 Water-Cooling System for the GPUs ...............................................................7 2.4 GPU and System Memory...............................................................................8 2.5 Other Workstation Components.....................................................................9 3.0 Multi-GPU with NVLink......................................................................................10 3.1 DGX NVLink Network Topology for Efficient Application Scaling..................10 3.2 Scaling Deep Learning Training on NVLink...................................................12 4.0 DGX Station Software Stack for Deep Learning.................................................14 4.1 NVIDIA CUDA Toolkit.....................................................................................16 4.2 NVIDIA Deep Learning SDK ...........................................................................16 4.3 Docker Engine Utility for NVIDIA GPUs.........................................................17 4.4 NVIDIA Container
    [Show full text]
  • The NVIDIA DGX-1 Deep Learning System
    NVIDIA DGX-1 ARTIFIcIAL INTELLIGENcE SYSTEM The World’s First AI Supercomputer in a Box Get faster training, larger models, and more accurate results from deep learning with the NVIDIA® DGX-1™. This is the world’s first purpose-built system for deep learning and AI-accelerated analytics, with performance equal to 250 conventional servers. It comes fully integrated with hardware, deep learning software, development tools, and accelerated analytics applications. Immediately shorten SYSTEM SPECIFICATIONS data processing time, visualize more data, accelerate deep learning frameworks, GPUs 8x Tesla P100 and design more sophisticated neural networks. TFLOPS (GPU FP16 / 170/3 CPU FP32) Iterate and Innovate Faster GPU Memory 16 GB per GPU CPU Dual 20-core Intel® Xeon® High-performance training accelerates your productivity giving you faster insight E5-2698 v4 2.2 GHz NVIDIA CUDA® Cores 28672 and time to market. System Memory 512 GB 2133 MHz DDR4 Storage 4x 1.92 TB SSD RAID 0 NVIDIA DGX-1 Delivers 58X Faster Training Network Dual 10 GbE, 4 IB EDR Software Ubuntu Server Linux OS DGX-1 Recommended GPU NVIDIA DX-1 23 Hours, less than 1 day Driver PU-Only Server 1310 Hours (5458 Days) System Weight 134 lbs 0 10X 20X 30X 40X 50X 60X System Dimensions 866 D x 444 W x 131 H (mm) Relatve Performance (Based on Tme to Tran) Packing Dimensions 1180 D x 730 W x 284 H (mm) affe benchmark wth V-D network, tranng 128M mages wth 70 epochs | PU servers uses 2x Xeon E5-2699v4 PUs Maximum Power 3200W Requirements Operating Temperature 10 - 30°C NVIDIA DGX-1 Delivers 34X More Performance Range NVIDIA DX-1 170 TFLOPS PU-Only Server 5 TFLOPS 0 10 50 100 150 170 Performance n teraFLOPS PU s dual socket Intel Xeon E5-2699v4 170TF s half precson or FP16 DGX-1 | DATA SHEET | DEc16 computing for Infinite Opportunities NVIDIA DGX-1 Software Stack The NVIDIA DGX-1 is the first system built with DEEP LEARNING FRAMEWORKS NVIDIA Pascal™-powered Tesla® P100 accelerators.
    [Show full text]
  • Computing for the Most Demanding Users
    COMPUTING FOR THE MOST DEMANDING USERS NVIDIA Artificial intelligence, the dream of computer scientists for over half a century, is no longer science fiction. And in the next few years, it will transform every industry. Soon, self-driving cars will reduce congestion and improve road safety. AI travel agents will know your preferences and arrange every detail of your family vacation. And medical instruments will read and understand patient DNA to detect and treat early signs of cancer. Where engines made us stronger and powered the first industrial revolution, AI will make us smarter and power the next. What will make this intelligent industrial revolution possible? A new computing model — GPU deep learning — that enables computers to learn from data and write software that is too complex for people to code. NVIDIA — INVENTOR OF THE GPU The GPU has proven to be unbelievably effective at solving some of the most complex problems in computer science. It started out as an engine for simulating human imagination, conjuring up the amazing virtual worlds of video games and Hollywood films. Today, NVIDIA’s GPU simulates human intelligence, running deep learning algorithms and acting as the brain of computers, robots, and self-driving cars that can perceive and understand the world. This is our life’s work — to amplify human imagination and intelligence. THE NVIDIA GPU DEFINES MODERN COMPUTER GRAPHICS Our invention of the GPU in 1999 made possible real-time programmable shading, which gives artists an infinite palette for expression. We’ve led the field of visual computing since. SIMULATING HUMAN IMAGINATION Digital artists, industrial designers, filmmakers, and broadcasters rely on NVIDIA Quadro® pro graphics to bring their imaginations to life.
    [Show full text]
  • NVIDIA Gpudirect RDMA (GDR) Host Memory Host Memory • Pipeline Through Host for Large Msg 2 IB IB 4 2
    Efficient and Scalable Communication Middleware for Emerging Dense-GPU Clusters Ching-Hsiang Chu Advisor: Dhabaleswar K. Panda Network Based Computing Lab Department of Computer Science and Engineering The Ohio State University, Columbus, OH Outline • Introduction • Problem Statement • Detailed Description and Results • Broader Impact on the HPC Community • Expected Contributions Network Based Computing Laboratory SC 19 Doctoral Showcase 2 Trends in Modern HPC Architecture: Heterogeneous Multi/ Many-core High Performance Interconnects Accelerators / Coprocessors SSD, NVMe-SSD, Processors InfiniBand, Omni-Path, EFA high compute density, NVRAM <1usec latency, 100Gbps+ Bandwidth high performance/watt Node local storage • Multi-core/many-core technologies • High Performance Storage and Compute devices • High Performance Interconnects • Variety of programming models (MPI, PGAS, MPI+X) #1 Summit #2 Sierra (17,280 GPUs) #8 ABCI #22 DGX SuperPOD (27,648 GPUs) #10 Lassen (2,664 GPUs) (4,352 GPUs) (1,536 GPUs) Network Based Computing Laboratory SC 19 Doctoral Showcase 3 Trends in Modern Large-scale Dense-GPU Systems • Scale-up (up to 150 GB/s) • Scale-out (up to 25 GB/s) – PCIe, NVLink/NVSwitch – InfiniBand, Omni-path, Ethernet – Infinity Fabric, Gen-Z, CXL – Cray Slingshot Network Based Computing Laboratory SC 19 Doctoral Showcase 4 GPU-enabled HPC Applications Lattice Quantum Chromodynamics Weather Simulation Wave propagation simulation Fuhrer O, Osuna C, Lapillonne X, Gysi T, Bianco M, Schulthess T. Towards GPU-accelerated operational weather
    [Show full text]
  • Dell EMC Poweredge C4140 Technical Guide
    Dell EMC PowerEdge C4140 Technical Guide Regulatory Model: E53S Series Regulatory Type: E53S001 Notes, cautions, and warnings NOTE: A NOTE indicates important information that helps you make better use of your product. CAUTION: A CAUTION indicates either potential damage to hardware or loss of data and tells you how to avoid the problem. WARNING: A WARNING indicates a potential for property damage, personal injury, or death. © 2017 - 2019 Dell Inc. or its subsidiaries. All rights reserved. Dell, EMC, and other trademarks are trademarks of Dell Inc. or its subsidiaries. Other trademarks may be trademarks of their respective owners. 2019 - 09 Rev. A00 Contents 1 System overview ......................................................................................................................... 5 Introduction............................................................................................................................................................................ 5 New technologies.................................................................................................................................................................. 5 2 System features...........................................................................................................................7 Specifications......................................................................................................................................................................... 7 Product comparison.............................................................................................................................................................
    [Show full text]
  • HPE Apollo 6500 Gen10 System Overview
    QuickSpecs HPE Apollo 6500 Gen10 System Overview HPE Apollo 6500 Gen10 System The ability of computers to autonomously learn, predict, and adapt using massive datasets is driving innovation and competitive advantage across many industries and applications. The HPE Apollo 6500 Gen10 System is an ideal HPC and Deep Learning platform providing unprecedented performance with industry leading GPUs, fast GPU interconnect, high bandwidth fabric and a configurable GPU topology to match your workloads. The system with rock-solid RAS features (reliable, available, secure) includes up to eight high power GPUs per server tray (node), NVLink 2.0 for fast GPU-to-GPU communication, Intel® Xeon® Scalable Processors support, choice of up to four high-speed / low latency fabric adapters, and the ability to optimize your configurations to match your workload and choice of GPU. And while the HPE Apollo 6500 Gen10 System is ideal for deep learning workloads, the system is suitable for complex high performance computing workloads such as simulation and modeling. Eight GPU per server for faster and more economical deep learning system training compared to more servers with fewer GPU each. Keep your researchers productive as they iterate on a model more rapidly for a better solution, in less time. Now available with NVLink 2.0 to connect GPUs at up to 300 GB/s for the world’s most powerful computing servers. HPC and AI models that would consume days or weeks can now be trained in a few hours or minutes. Front View –with two HPE DL38X Gen10 Premium 6 SFF SAS/SATA + 2 NVMe drive cages shown 1.
    [Show full text]
  • Nvidia Autonomous Driving Platform
    NVIDIA AUTONOMOUS DRIVING PLATFORM Apr, 2017 Sr. Solution Architect , Marcus Oh Who is NVIDIA Deep Learning in Autonomous Driving Training Infra & Machine / DIGIT Contents DRIVE PX2 DRIVEWORKS USECASE Example Next Generation AD Platform 2 NVIDIA CONFIDENTIAL — DRIVE PX2 DEVELOPMENT PLATFORM NVIDIA Founded in 1993 | CEO & Co-founder: Jen-Hsun Huang | FY16 revenue: $5.01B | 9,500 employees | 7,300 patents | HQ in Santa Clara, CA | 3 NVIDIA — “THE AI COMPUTING COMPANY” GPU Computing Computer Graphics Artificial Intelligence 4 DEEP LEARNING IN AUTONOMOUS DRIVING 5 WHAT IS DEEP LEARNING? Input Result 6 Deep Learning Framework Forward Propagation Repeat “turtle” Tree Training Backward Propagation Cat Compute weight update to nudge Dog from “turtle” towards “dog” Trained Neural Net Model Inference “cat” 7 REINFORCEMENT LEARNING How’s it work? F A reinforcement learning agent includes: state (environment) actions (controls) reward (feedback) A value function predicts the future reward of performing actions in the current state Given the recent state, action with the maximum estimated future reward is chosen for execution For agents with complex state spaces, deep networks are used as Q-value approximator Numerical solver (gradient descent) optimizes github.com/dusty-nv/jetson-reinforcement the network on-the-fly based on reward inputs SELF-DRIVING CARS ARE AN AI CHALLENGE PERCEPTION AI PERCEPTION AI LOCALIZATION DRIVING AI DEEP LEARNING 9 NVIDIA AI SYSTEM FOR AUTONOMOUS DRIVING MAPPING KALDI LOCALIZATION DRIVENET Training on Driving with NVIDIA DGX-1 NVIDIA DRIVE PX 2 DGX-1 DriveWorks 10 TRAINING INFRA & MACHINE / DIGIT 11 170X SPEED-UP OVER COTS SERVER MICROSOFT COGNITIVE TOOLKIT SUPERCHARGED ON NVIDIA DGX-1 170x Faster (AlexNet images/sec) 13,000 78 8x Tesla P100 | 170TF FP16 | NVLink hybrid cube mesh CPU Server DGX-1 AlexNet training batch size 128, Dual Socket E5-2699v4, 44 cores CNTK 2.0b2 for CPU.
    [Show full text]
  • Investigations of Various HPC Benchmarks to Determine Supercomputer Performance Efficiency and Balance
    Investigations of Various HPC Benchmarks to Determine Supercomputer Performance Efficiency and Balance Wilson Lisan August 24, 2018 MSc in High Performance Computing The University of Edinburgh Year of Presentation: 2018 Abstract This dissertation project is based on participation in the Student Cluster Competition (SCC) at the International Supercomputing Conference (ISC) 2018 in Frankfurt, Germany as part of a four-member Team EPCC from The University of Edinburgh. There are two main projects which are the team-based project and a personal project. The team-based project focuses on the optimisations and tweaks of the HPL, HPCG, and HPCC benchmarks to meet the competition requirements. At the competition, Team EPCC suffered with hardware issues that shaped the cluster into an asymmetrical system with mixed hardware. Unthinkable and extreme methods were carried out to tune the performance and successfully drove the cluster back to its ideal performance. The personal project focuses on testing the SCC benchmarks to evaluate the performance efficiency and system balance at several HPC systems. HPCG fraction of peak over HPL ratio was used to determine the system performance efficiency from its peak and actual performance. It was analysed through HPCC benchmark that the fraction of peak ratio could determine the memory and network balance over the processor or GPU raw performance as well as the possibility of the memory or network bottleneck part. Contents Chapter 1 Introduction ..............................................................................................
    [Show full text]