DGX Superpod Deployment

Leadership AI Computing Mike Houston, Chief Architect - AI Systems AI Impact On Science

2 AI SUPERCOMPUTING DELIVERING SCIENTIFIC BREAKTHROUGHS 7 of 10 Gordon Bell Finalists Used NVIDIA AI Platforms

AI-DRIVEN MULTISCALE SIMULATION DOCKING ANALYSIS DEEPMD-KIT SQUARE KILOMETER ARRAY Largest AI+MD Simulation Ever 58x Full Pipeline Speedup 1,000x Speedup 250GB/s Data Processed End-to-End EXASCALE AI SCIENCE

CLIMATE (1.12 EF) GENOMICS (2.36 EF) NUCLEAR WASTE REMEDIATION (1.2 EF) CANCER DETECTION (1.3 EF) LBNL | NVIDIA ORNL LBNL | PNNL | Brown U. | NVIDIA ORNL | Stony Brook U. FUSING SIMULATION + AI + DATA ANALYTICS Transforming Scientific Workflows Across Multiple Domains

PREDICTING EXTREME WEATHER EVENTS SimNet (PINN) for CFD RAPIDS FOR SEISMIC ANALYSIS 5-Day Forecast with 85% Accuracy 18,000x Speedup 260X Speedup Using K-Means FERMILAB USES TRITON TO SCALE DEEP LEARNING INFERENCE IN HIGH ENERGY PARTICLE PHYSICS GPU-Accelerated Offline Neutrino Reconstruction Workflow

400TB Data from Hundreds of Millions of Neutrino Events

17x Speedup of DL Model on T4 GPU Vs. CPU

Triton in Kubernetes Enables “DL Inference” as a Service

Expected to Scale to Thousands of Particle Physics Client Nodes

Neutrino Event Classification from Reconstruction EXPANDING UNIVERSE OF SCIENTIFIC COMPUTING

DATA ANALYTICS SIMULATION SUPERCOMPUTING

EDGE APPLIANCE NETWORK EDGE VISUALIZATION STREAMING

EXTREME IO

CLOUD AI THE ERA OF EXASCALE AI SUPERCOMPUTING AI is the New Growth Driver for Modern HPC

HPL VS. AI PERFORMANCE AI

10000 Leonardo

Perlmutter Summit JUWELS Sierra Fugaku 1000

PFLOPS HPL

100 2017 20182018 20192019 2020 20212021 2022

HPL: Based on #1 system in June Top500 AI: Peak system FP16 FLOPS

FIGHTING COVID-19 WITH SCIENTIFIC COMPUTING $1.25 Trillion Industry | $2B R&D per Drug | 12 ½ Years Development | 90% Failure Rate

O(1060) Chemical Compounds

O(109)

GENOMICS STRUCTURE DOCKING SIMULATION IMAGING

Biological Drug Target O(102) Candidate

NLP

Literature Real World Data FIGHTING COVID-19 WITH NVIDIA CLARA DISCOVERY

O(1060) Chemical Compounds

RAPIDS

O(109)

GENOMICS STRUCTURE DOCKING SIMULATION IMAGING

Biological Drug Target O(102) Candidate

Schrodinger Clara Imaging Clara Parabricks CryoSPARC, Relion AutoDock NAMD, VMD, OpenMM

RAPIDS AlphaFold RAPIDS MELD MONAI

NLP

Literature Real World Data

BioMegatron, BioBERT EXPLODING DATA AND MODEL SIZE

GPT-3 175 Bn 175 Zettabytes

Turing-NLG 17 Bn 393 TB 287 TB/day COVID-19 Graph Analytics ECMWF GPT-2 8B 8.3 Bn 58

Zettabytes # Parameters (Log scale) (Log # Parameters BERT 340 M Transformer 65 M 16 TB/sec 550 TB 20172017 20182018 20192019 20202020 2021 SKA NASA Mars Landing Sim. 2010 2015 2020 2025

EXPLODING MODEL SIZE BIG DATA GROWTH GROWTH IN SCIENTIFIC DATA Driving Superhuman Capabilities 90% of the World’s Data in last 2 Years Fueled by Accurate Sensors & Simulations

Source for Big Data Growth chart: IDC – The Digitization of the World (May, 2020) AI SUPERCOMPUTING NEEDS EXTREME IO

System Memory System Memory System Memory

CPU CPU CPU RTX HPC RAPIDS AI CLARA METRO DRIVE ISAAC AERIAL

PCIe Switch 200GB/s PCIe Switch PCIe Switch 200GB/s Storage CUDA-X NIC NIC NIC

GPU GPU GPU CUDA

8X 8X 8X

NODE A NODE B NODE A MAGNUM IO IN-NETWORK STORAGE IO NETWORK IO IO MANAGMENT COMPUTE GPUDIRECT RDMA GPUDIRECT STORAGE

IB Network ∑ NODE 0-15 REDUCTIONS 2X DL Inference Performance 10X IO Performance | 6X Lower CPU Utilization

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 NODES 0-15

SHARP IN-NETWORK COMPUTING IO OPTIMIZATION IMPACT

3.8X 1.54X 2.9X

1.24X 2.4X

1X 1X

MPI NCCL NCCL+P2P NUMPY DALI DALI+GDS WITHOUT GDS WITH GDS

IMPROVING SIMULATION OF PHOSPHORUS SEGMENTING EXTREME WEATHER REMOTE FILE READS AT PEAK MONOLAYER WITH LATEST NCCL P2P PHENOMENA IN CLIMATE SIMULATIONS FABRIC BANDWIDTH ON DGX A100 DATA CENTER ARCHITECTURE

15 SELENE DGX SuperPOD Deployment

#1 on MLPerf for commercially available systems

#5 on TOP500 (63.46 PetaFLOPS HPL)

#5 on Green500 (23.98 GF/watt) - #1 on Green500 (26.2 GF/W) - single scalable unit

#4 on HPCG (1.6 PetaFLOPS)

#3 on HPL-AI (250 PetaFLOPS)

Fastest Industrial System in U.S. — 3+ ExaFLOPS AI

Built with NVIDIA DGX SuperPOD Architecture

• NVIDIA DGX A100 and NVIDIA Mellanox IB

• NVIDIA’s decade of AI experience

Configuration:

• 4480 NVIDIA A100 Tensor Core GPUs

• 560 NVIDIA DGX A100 systems

• 850 Mellanox 200G HDR IB switches

• 14 PB of all-flash storage 16 LESSONS LEARNED How to Build and Deploy HPC Systems with Hyperscale Sensibilities

Speed and feed matching

Thermal and power design

Interconnect design

Deployability

Operability

Flexibility

Expandability

17 DGX-1 PODs

NVIDIA DGX-1 – original layout DGX-1 Multi-POD

RIKEN RAIDEN NVIDIA DGX-1 – new layout

18 DGX SuperPOD with DGX-2 19 A NEW DATA CENTER DESIGN

20 DGX SUPERPOD Fast Deployment Ready - Cold Aisle Containment Design

21 DGX SuperPOD Cooling / Airflow 22 A NEW GENERATION OF SYSTEMS NVIDIA DGX A100

GPUs 8x NVIDIA A100 GPU Memory 640 GB total Peak performance 5 petaFLOPS AI | 10 petaOPS INT8 NVSwitches 6 System Power Usage 6.5kW max Dual AMD Rome 7742 CPU 128 cores total, 2.25 GHz(base), 3.4GHz (max boost) System Memory 2TB 8x Single-Port Mellanox ConnectX-6 200Gb/s HDR Infiniband (Compute Network) Networking 2x Dual-Port Mellanox ConnectX-6 200Gb/s HDR Infiniband (Storage Network also used for Eth*) OS: 2x 1.92TB M.2 NVME drives Storage Internal Storage: 15TB (4x 3.84TB) U.2 NVME drives Software Ubuntu Linux OS (5.3+ kernel) System Weight 271 lbs (123 kgs) Packaged System Weight 315 lbs (143 kgs) Height 6U Operating temp range 5°C to 30°C (41°F to 86°F)

* Optional upgrades 23 24 MODULARITY: RAPID DEPLOYMENT

Compute: Scalable Unit (SU) Compute Fabric Storage and Mgmt

25 DGX SUPERPOD Modular Architecture

GPU 1K GPU SuperPOD Cluster POD • 140 DGX A100 nodes (1,120 GPUs) in a GPU POD • 1st tier fast storage - DDN AI400x with Lustre 1K GPU POD • Mellanox HDR 200Gb/s InfiniBand - Full Fat-tree • Network optimized for AI and HPC Distributed Core Switches Distributed Core Switches DGX A100 Nodes

• 2x AMD 7742 EPYC CPUs + 8x A100 GPUs Spine Switches Storage Spine Switches • NVLINK 3.0 Fully Connected Switch • 8 Compute + 2 Storage HDR IB Ports Leaf Switches Storage Leaf Switches A Fast Interconnect … • Modular IB Fat-tree DGX A100 DGX A100 … Storage • Separate network for Compute vs Storage #1 #140 • Adaptive routing and SharpV2 support for offload

26 GPU POD

GPU GPU DGX SUPERPOD POD POD Extensible Architecture

GPU POD to POD POD • Modular IB Fat-tree or DragonFly+ • Core IB Switches Distributed Between PODs 1K GPU POD • Direct connect POD to POD

Distributed Core Switches Distributed Core Switches

Spine Switches Storage Spine Switches

Leaf Switches Storage Leaf Switches …

DGX A100 DGX A100 … Storage #1 #140

27 MULTI NODE IB COMPUTE The Details Distributed Core Switches Designed with Mellanox 200Gb HDR IB network

Separate compute and storage fabric 8 Links for compute 2 Links for storage (Lustre) Both networks share a similar fat-tree design

Modular POD design 140 DGX A100 nodes are fully connected in a SuperPOD SuperPOD contains compute nodes and storage All nodes and storage are usable between SuperPODs

Sharpv2 optimized design Leaf and Spines organized in HCA planes For a SuperPOD, all HCA1 from 140 DGX-2 connect to a HCA1 Plane fat-tree network Traffic from HCA1 to HCA1 between any two nodes in a POD stay either at the Leaf or Spine level Only use core switches when - Moving data between HCA planes (e.g. mlx5_0 to mlx5_1 in another system) - Moving any data between SuperPODs

28 DESIGNING FOR PERFORMANCE In the Data Center

All design is based on a radix optimized approach for Sharpv2 support and fabric performance and to align with design of Mellanox Quantum switches.

Scalable Unit (SU)

29 SHARP HDR200 Selene Early Results

128 NVIDIA DGX A100 (1024 GPUs, 1024 InfiniBand Adapters NCCL AllReduce Performance Increase with SHARP

3.0

2.0

1.0 Performance Increase Factor Increase Performance

0.0

Message Size

30 STORAGE Parallel filesystem for perf and NFS for home directories

Per SuperPOD:

Fast Parallel FS: Lustre (DDN) - 10 DDN AI400X Units - Total Capacity: 2.5 PB - Max Perf Read/Write: 490/250 GB/s - 80 HDR-100 cables required - 16.6KW

Shared FS: Oracle ZFS5-2 - HA Controller Pair/768GB total - 8U Total Space (4U per Disk Shelf, 2U per controller) - 76.8 TB Raw - 24x3.2TB SSD - 16x40GbE - Key features: NFS, HA, snapshots, dedupe - 2kW

31 STORAGE HIERARCHY

• Memory (file) cache (aggregate): 224TB/sec - 1.1PB (2TB/node)

• NVMe cache (aggregate): 28TB/Sec - 16PB (30TB/node)

• Network filesystem (cache - Lustre): 2TB/sec - 10PB

• Object storage: 100GB/sec - 100+PB

32 SOFTWARE OPERATIONS

33 SCALE TO MULTIPLE NODES Software Stack - Application

• Deep Learning Model: • Hyperparameters tuned for multi-node scaling • Multi-node launcher scripts • Deep Learning Container: • Optimized TensorFlow, GPU libraries, and multi-node software • Host: • Host OS, GPU driver, IB driver, container runtime engine (docker, enroot)

34 SCALE TO MULTIPLE NODES Software Stack - System

• Slurm: User job scheduling & management

• Enroot: NVIDIA open-source tool to convert traditional container/OS images into unprivileged sandboxes

• Pyxis: NVIDIA open-source plugin integrating Enroot with Slurm

• DeepOps: NVIDIA open-source toolbox for GPU cluster management w/Ansible playbooks

NGC model containers (Pytorch, Tensorflow from 19.09)

Slurm controller Pyxis Enroot | Docker DCGM

35 INTEGRATING CLUSTERS IN THE DEVELOPMENT WORKFLOW Supercomputer-scale CI (Continuous integration internal at NVIDIA)

• Integrating DL-friendly tools like GitLab, Docker w/ HPC systems Kick off 10000’s of GPU hours of tests with a single button click in GitLab … build and package with Docker … schedule and prioritize with SLURM … on demand or on a schedule … reporting via GitLab, ELK stack, Slack, email Emphasis on keeping things simple for users while hiding integration complexity

Ensure reproducibility and rapid triage

36 LINKS

37 RESOURCES Presentations

GTC Sessions (https://www.nvidia.com/en-us/gtc/session-catalog/) : Under the Hood of the new DGX A100 System Architecture [S21884] Inside the NVIDIA Ampere Architecture [S21730] CUDA New Features And Beyond [S21760] Inside the NVIDIA HPC SDK: the Compilers, Libraries and Tools for Accelerated Computing [S21766] Introducing NVIDIA DGX A100: the Universal AI System for Enterprise [S21702] Mixed-Precision Training of Neural Networks [S22082] Tensor Core Performance on NVIDIA GPUs: The Ultimate Guide [S21929] Developing CUDA kernels to push Tensor Cores to the Absolute Limit on NVIDIA A100 [S21745] HotChips: Hot Chips Tutorial - Scale Out Training Experiences – Megatron Language Model Hot Chips Session - NVIDIA’s A100 GPU: Performance and Innovation for GPU Computing Pyxis/Enroot https://fosdem.org/2020/schedule/event/containers_hpc_unprivileged/

38 RESOURCES Links and other doc

DGX A100 Page https://www.nvidia.com/en-us/data-center/dgx-a100/

Blogs DGX SuperPOD https://blogs.nvidia.com/blog/2020/05/14/dgx-superpod-a100/ DDN Blog for DGX A100 Storage https://www.ddn.com/press-releases/ddn-a3i-nvidia-dgx-a100/ Kitchen Keynote summary https://blogs.nvidia.com/blog/2020/05/14/gtc-2020-keynote/ Double Precision Tensor Cores https://blogs.nvidia.com/blog/2020/05/14/double-precision-tensor-cores/