Leadership AI Computing Mike Houston, Chief Architect - AI Systems AI Impact On Science
2 AI SUPERCOMPUTING DELIVERING SCIENTIFIC BREAKTHROUGHS 7 of 10 Gordon Bell Finalists Used NVIDIA AI Platforms
AI-DRIVEN MULTISCALE SIMULATION DOCKING ANALYSIS DEEPMD-KIT SQUARE KILOMETER ARRAY Largest AI+MD Simulation Ever 58x Full Pipeline Speedup 1,000x Speedup 250GB/s Data Processed End-to-End EXASCALE AI SCIENCE
CLIMATE (1.12 EF) GENOMICS (2.36 EF) NUCLEAR WASTE REMEDIATION (1.2 EF) CANCER DETECTION (1.3 EF) LBNL | NVIDIA ORNL LBNL | PNNL | Brown U. | NVIDIA ORNL | Stony Brook U. FUSING SIMULATION + AI + DATA ANALYTICS Transforming Scientific Workflows Across Multiple Domains
PREDICTING EXTREME WEATHER EVENTS SimNet (PINN) for CFD RAPIDS FOR SEISMIC ANALYSIS 5-Day Forecast with 85% Accuracy 18,000x Speedup 260X Speedup Using K-Means FERMILAB USES TRITON TO SCALE DEEP LEARNING INFERENCE IN HIGH ENERGY PARTICLE PHYSICS GPU-Accelerated Offline Neutrino Reconstruction Workflow
400TB Data from Hundreds of Millions of Neutrino Events
17x Speedup of DL Model on T4 GPU Vs. CPU
Triton in Kubernetes Enables “DL Inference” as a Service
Expected to Scale to Thousands of Particle Physics Client Nodes
Neutrino Event Classification from Reconstruction EXPANDING UNIVERSE OF SCIENTIFIC COMPUTING
DATA ANALYTICS SIMULATION SUPERCOMPUTING
EDGE APPLIANCE NETWORK EDGE VISUALIZATION STREAMING
EXTREME IO
CLOUD AI THE ERA OF EXASCALE AI SUPERCOMPUTING AI is the New Growth Driver for Modern HPC
HPL VS. AI PERFORMANCE AI
10000 Leonardo
Perlmutter Summit JUWELS Sierra Fugaku 1000
PFLOPS HPL
100 2017 20182018 20192019 2020 20212021 2022
HPL: Based on #1 system in June Top500 AI: Peak system FP16 FLOPS
FIGHTING COVID-19 WITH SCIENTIFIC COMPUTING $1.25 Trillion Industry | $2B R&D per Drug | 12 ½ Years Development | 90% Failure Rate
SEARCH
O(1060) Chemical Compounds
O(109)
GENOMICS STRUCTURE DOCKING SIMULATION IMAGING
Biological Drug Target O(102) Candidate
NLP
Literature Real World Data FIGHTING COVID-19 WITH NVIDIA CLARA DISCOVERY
SEARCH
O(1060) Chemical Compounds
RAPIDS
O(109)
GENOMICS STRUCTURE DOCKING SIMULATION IMAGING
Biological Drug Target O(102) Candidate
Schrodinger Clara Imaging Clara Parabricks CryoSPARC, Relion AutoDock NAMD, VMD, OpenMM
RAPIDS AlphaFold RAPIDS MELD MONAI
NLP
Literature Real World Data
BioMegatron, BioBERT EXPLODING DATA AND MODEL SIZE
GPT-3 175 Bn 175 Zettabytes
Turing-NLG 17 Bn 393 TB 287 TB/day COVID-19 Graph Analytics ECMWF GPT-2 8B 8.3 Bn 58
Zettabytes # Parameters (Log scale) (Log # Parameters BERT 340 M Transformer 65 M 16 TB/sec 550 TB 20172017 20182018 20192019 20202020 2021 SKA NASA Mars Landing Sim. 2010 2015 2020 2025
EXPLODING MODEL SIZE BIG DATA GROWTH GROWTH IN SCIENTIFIC DATA Driving Superhuman Capabilities 90% of the World’s Data in last 2 Years Fueled by Accurate Sensors & Simulations
Source for Big Data Growth chart: IDC – The Digitization of the World (May, 2020) AI SUPERCOMPUTING NEEDS EXTREME IO
System Memory System Memory System Memory
CPU CPU CPU RTX HPC RAPIDS AI CLARA METRO DRIVE ISAAC AERIAL
PCIe Switch 200GB/s PCIe Switch PCIe Switch 200GB/s Storage CUDA-X NIC NIC NIC
GPU GPU GPU CUDA
8X 8X 8X
NODE A NODE B NODE A MAGNUM IO IN-NETWORK STORAGE IO NETWORK IO IO MANAGMENT COMPUTE GPUDIRECT RDMA GPUDIRECT STORAGE
IB Network ∑ NODE 0-15 REDUCTIONS 2X DL Inference Performance 10X IO Performance | 6X Lower CPU Utilization
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 NODES 0-15
SHARP IN-NETWORK COMPUTING IO OPTIMIZATION IMPACT
3.8X 1.54X 2.9X
1.24X 2.4X
1X
1X 1X
MPI NCCL NCCL+P2P NUMPY DALI DALI+GDS WITHOUT GDS WITH GDS
IMPROVING SIMULATION OF PHOSPHORUS SEGMENTING EXTREME WEATHER REMOTE FILE READS AT PEAK MONOLAYER WITH LATEST NCCL P2P PHENOMENA IN CLIMATE SIMULATIONS FABRIC BANDWIDTH ON DGX A100 DATA CENTER ARCHITECTURE
15 SELENE DGX SuperPOD Deployment
#1 on MLPerf for commercially available systems
#5 on TOP500 (63.46 PetaFLOPS HPL)
#5 on Green500 (23.98 GF/watt) - #1 on Green500 (26.2 GF/W) - single scalable unit
#4 on HPCG (1.6 PetaFLOPS)
#3 on HPL-AI (250 PetaFLOPS)
Fastest Industrial System in U.S. — 3+ ExaFLOPS AI
Built with NVIDIA DGX SuperPOD Architecture
• NVIDIA DGX A100 and NVIDIA Mellanox IB
• NVIDIA’s decade of AI experience
Configuration:
• 4480 NVIDIA A100 Tensor Core GPUs
• 560 NVIDIA DGX A100 systems
• 850 Mellanox 200G HDR IB switches
• 14 PB of all-flash storage 16 LESSONS LEARNED How to Build and Deploy HPC Systems with Hyperscale Sensibilities
Speed and feed matching
Thermal and power design
Interconnect design
Deployability
Operability
Flexibility
Expandability
17 DGX-1 PODs
NVIDIA DGX-1 – original layout DGX-1 Multi-POD
RIKEN RAIDEN NVIDIA DGX-1 – new layout
18 DGX SuperPOD with DGX-2 19 A NEW DATA CENTER DESIGN
20 DGX SUPERPOD Fast Deployment Ready - Cold Aisle Containment Design
21 DGX SuperPOD Cooling / Airflow 22 A NEW GENERATION OF SYSTEMS NVIDIA DGX A100
GPUs 8x NVIDIA A100 GPU Memory 640 GB total Peak performance 5 petaFLOPS AI | 10 petaOPS INT8 NVSwitches 6 System Power Usage 6.5kW max Dual AMD Rome 7742 CPU 128 cores total, 2.25 GHz(base), 3.4GHz (max boost) System Memory 2TB 8x Single-Port Mellanox ConnectX-6 200Gb/s HDR Infiniband (Compute Network) Networking 2x Dual-Port Mellanox ConnectX-6 200Gb/s HDR Infiniband (Storage Network also used for Eth*) OS: 2x 1.92TB M.2 NVME drives Storage Internal Storage: 15TB (4x 3.84TB) U.2 NVME drives Software Ubuntu Linux OS (5.3+ kernel) System Weight 271 lbs (123 kgs) Packaged System Weight 315 lbs (143 kgs) Height 6U Operating temp range 5°C to 30°C (41°F to 86°F)
* Optional upgrades 23 24 MODULARITY: RAPID DEPLOYMENT
Compute: Scalable Unit (SU) Compute Fabric Storage and Mgmt
25 DGX SUPERPOD Modular Architecture
GPU 1K GPU SuperPOD Cluster POD • 140 DGX A100 nodes (1,120 GPUs) in a GPU POD • 1st tier fast storage - DDN AI400x with Lustre 1K GPU POD • Mellanox HDR 200Gb/s InfiniBand - Full Fat-tree • Network optimized for AI and HPC Distributed Core Switches Distributed Core Switches DGX A100 Nodes
• 2x AMD 7742 EPYC CPUs + 8x A100 GPUs Spine Switches Storage Spine Switches • NVLINK 3.0 Fully Connected Switch • 8 Compute + 2 Storage HDR IB Ports Leaf Switches Storage Leaf Switches A Fast Interconnect … • Modular IB Fat-tree DGX A100 DGX A100 … Storage • Separate network for Compute vs Storage #1 #140 • Adaptive routing and SharpV2 support for offload
26 GPU POD
GPU GPU DGX SUPERPOD POD POD Extensible Architecture
GPU POD to POD POD • Modular IB Fat-tree or DragonFly+ • Core IB Switches Distributed Between PODs 1K GPU POD • Direct connect POD to POD
Distributed Core Switches Distributed Core Switches
Spine Switches Storage Spine Switches
Leaf Switches Storage Leaf Switches …
DGX A100 DGX A100 … Storage #1 #140
27 MULTI NODE IB COMPUTE The Details Distributed Core Switches Designed with Mellanox 200Gb HDR IB network
Separate compute and storage fabric 8 Links for compute 2 Links for storage (Lustre) Both networks share a similar fat-tree design
Modular POD design 140 DGX A100 nodes are fully connected in a SuperPOD SuperPOD contains compute nodes and storage All nodes and storage are usable between SuperPODs
Sharpv2 optimized design Leaf and Spines organized in HCA planes For a SuperPOD, all HCA1 from 140 DGX-2 connect to a HCA1 Plane fat-tree network Traffic from HCA1 to HCA1 between any two nodes in a POD stay either at the Leaf or Spine level Only use core switches when - Moving data between HCA planes (e.g. mlx5_0 to mlx5_1 in another system) - Moving any data between SuperPODs
28 DESIGNING FOR PERFORMANCE In the Data Center
All design is based on a radix optimized approach for Sharpv2 support and fabric performance and to align with design of Mellanox Quantum switches.
Scalable Unit (SU)
29 SHARP HDR200 Selene Early Results
128 NVIDIA DGX A100 (1024 GPUs, 1024 InfiniBand Adapters NCCL AllReduce Performance Increase with SHARP
3.0
2.0
1.0 Performance Increase Factor Increase Performance
0.0
Message Size
30 STORAGE Parallel filesystem for perf and NFS for home directories
Per SuperPOD:
Fast Parallel FS: Lustre (DDN) - 10 DDN AI400X Units - Total Capacity: 2.5 PB - Max Perf Read/Write: 490/250 GB/s - 80 HDR-100 cables required - 16.6KW
Shared FS: Oracle ZFS5-2 - HA Controller Pair/768GB total - 8U Total Space (4U per Disk Shelf, 2U per controller) - 76.8 TB Raw - 24x3.2TB SSD - 16x40GbE - Key features: NFS, HA, snapshots, dedupe - 2kW
31 STORAGE HIERARCHY
• Memory (file) cache (aggregate): 224TB/sec - 1.1PB (2TB/node)
• NVMe cache (aggregate): 28TB/Sec - 16PB (30TB/node)
• Network filesystem (cache - Lustre): 2TB/sec - 10PB
• Object storage: 100GB/sec - 100+PB
32 SOFTWARE OPERATIONS
33 SCALE TO MULTIPLE NODES Software Stack - Application
• Deep Learning Model: • Hyperparameters tuned for multi-node scaling • Multi-node launcher scripts • Deep Learning Container: • Optimized TensorFlow, GPU libraries, and multi-node software • Host: • Host OS, GPU driver, IB driver, container runtime engine (docker, enroot)
34 SCALE TO MULTIPLE NODES Software Stack - System
• Slurm: User job scheduling & management
• Enroot: NVIDIA open-source tool to convert traditional container/OS images into unprivileged sandboxes
• Pyxis: NVIDIA open-source plugin integrating Enroot with Slurm
• DeepOps: NVIDIA open-source toolbox for GPU cluster management w/Ansible playbooks
NGC model containers (Pytorch, Tensorflow from 19.09)
Slurm controller Pyxis Enroot | Docker DCGM
Login nodes DGX Pod: DGX Servers w. DGX base OS
35 INTEGRATING CLUSTERS IN THE DEVELOPMENT WORKFLOW Supercomputer-scale CI (Continuous integration internal at NVIDIA)
• Integrating DL-friendly tools like GitLab, Docker w/ HPC systems Kick off 10000’s of GPU hours of tests with a single button click in GitLab … build and package with Docker … schedule and prioritize with SLURM … on demand or on a schedule … reporting via GitLab, ELK stack, Slack, email Emphasis on keeping things simple for users while hiding integration complexity
Ensure reproducibility and rapid triage
36 LINKS
37 RESOURCES Presentations
GTC Sessions (https://www.nvidia.com/en-us/gtc/session-catalog/) : Under the Hood of the new DGX A100 System Architecture [S21884] Inside the NVIDIA Ampere Architecture [S21730] CUDA New Features And Beyond [S21760] Inside the NVIDIA HPC SDK: the Compilers, Libraries and Tools for Accelerated Computing [S21766] Introducing NVIDIA DGX A100: the Universal AI System for Enterprise [S21702] Mixed-Precision Training of Neural Networks [S22082] Tensor Core Performance on NVIDIA GPUs: The Ultimate Guide [S21929] Developing CUDA kernels to push Tensor Cores to the Absolute Limit on NVIDIA A100 [S21745] HotChips: Hot Chips Tutorial - Scale Out Training Experiences – Megatron Language Model Hot Chips Session - NVIDIA’s A100 GPU: Performance and Innovation for GPU Computing Pyxis/Enroot https://fosdem.org/2020/schedule/event/containers_hpc_unprivileged/
38 RESOURCES Links and other doc
DGX A100 Page https://www.nvidia.com/en-us/data-center/dgx-a100/
Blogs DGX SuperPOD https://blogs.nvidia.com/blog/2020/05/14/dgx-superpod-a100/ DDN Blog for DGX A100 Storage https://www.ddn.com/press-releases/ddn-a3i-nvidia-dgx-a100/ Kitchen Keynote summary https://blogs.nvidia.com/blog/2020/05/14/gtc-2020-keynote/ Double Precision Tensor Cores https://blogs.nvidia.com/blog/2020/05/14/double-precision-tensor-cores/
39