CUDA 11 and A100 - WHAT’S NEW? Markus Hrywniak, 23Rd June 2020 TOPICS for TODAY

CUDA 11 AND A100 - WHAT’S NEW? Markus Hrywniak, 23rd June 2020 TOPICS FOR TODAY Ampere architecture – A100, powering DGX–A100, HGX-A100... and soon, FZ Jülich‘s JUWELS Booster New CUDA 11 Toolkit release Overview of features Talk next week: Third generation Tensor Cores GTC talks go into much more details. See references! 2 HGX-A100 4-GPU HGX-A100 8-GPU • 4 A100 with NVLINK • 8 A100 with NVSwitch 3 HIERARCHY OF SCALES Multi-System Rack Multi-GPU System Multi-SM GPU Multi-Core SM Unlimited Scale 8 GPUs 108 Multiprocessors 2048 threads 4 AMDAHL’S LAW serial section parallel section serial section Amdahl’s Law parallel section Shortest possible runtime is sum of serial section times Time saved serial section Some Parallelism Increased Parallelism Infinite Parallelism Program time = Parallel sections take less time Parallel sections take no time sum(serial times + parallel times) Serial sections take same time Serial sections take same time 5 OVERCOMING AMDAHL: ASYNCHRONY & LATENCY serial section parallel section serial section Split up serial & parallel components parallel section serial section Some Parallelism Task Parallelism Infinite Parallelism Program time = Parallel sections overlap with serial sections Parallel sections take no time sum(serial times + parallel times) Serial sections take same time 6 OVERCOMING AMDAHL: ASYNCHRONY & LATENCY CUDA Concurrency Mechanisms At Every Scope CUDA Kernel Threads, Warps, Blocks, Barriers Application CUDA Streams, CUDA Graphs Node Multi-Process Service, GPU-Direct System NCCL, CUDA-Aware MPI, NVSHMEM 7 OVERCOMING AMDAHL: ASYNCHRONY & LATENCY Execution Overheads Non-productive latencies (waste) Operation Latency Network latencies Memory read/write File I/O ... Execution Overheads are waste Reduced through hardware & system efficiency improvements Operation Latencies are the cost of doing work Improve through hardware & software optimization 8 CUDA KEY INITIATIVES Need Picture Hierarchy Asynchrony Latency Language Programming and running Creating concurrency at every Overcoming Amdahl Supporting and evolving systems at every scale level of the hierarchy with lower overheads for Standard Languages memory & processing 9 THE NVIDIA AMPERE GPU ARCHITECTURE NVIDIA GA100 Key Architectural Features Multi-Instance GPU Advanced barriers Asynchronous data movement L2 cache management Task graph acceleration New Tensor Core precisions For more information see: S21730 - Inside the NVIDIA Ampere Architecture && www.nvidia.com/nvidia-ampere-architecture-whitepaper10 A100 TENSOR-CORE GPU 7x 54 billion transistors in 7 nm GigaThread Engine with MIG Control Scale OUT HBM2 HBM2 Multi-Instance GPU L2 Cache L2 Cache HBM2 HBM2 A100 HBM2 HBM2 2x BW NVLink NVLink NVLink NVLink NVLink NVLink NVLink NVLink NVLink NVLink NVLink NVLink Scale UP 3rd gen. 108 SMs 40 MB L2 1.56 TB/s HBM2 NVLINK 6912 CUDA Cores 6.7x capacity 1.7x bandwidth 11 A100 SM Third-generation Tensor Core Faster and more efficient 3rd Gen. 3rd Gen. Comprehensive data types TENSOR TENSOR CORE CORE FP64 support Sparsity acceleration Asynchronous data movement and synchronization 3rd Gen. 3rd Gen. TENSOR TENSOR CORE CORE Increased L1/SMEM capacity 192 KB L1 Data Cache / Shared Memory 12 ACCELERATING HPC Molecular Dynamics Physics Engineering Geo Science 2,0x 2,1X A100 2,0X 1,9X 1,9X 1,8X 1,7X 1,5x 1,6X 1,5X 1,5X 1,0x V100 Speedup 0,5x 0,0x AMBER GROMACS LAMMPS NAMD Chroma BerkeleyGW FUN3D RTM SPECFEM3D All results are measured Except BerkeleyGW, V100 used is single V100 SXM2. A100 used is single A100 SXM4 More apps detail: AMBER based on PME-Cellulose, GROMACS with STMV (h-bond), LAMMPS with Atomic Fluid LJ-2.5, NAMD with v3.0a1 STMV_NVE Chroma with szscl21_24_128, FUN3D with dpw, RTM with Isotropic Radius 4 1024^3, SPECFEM3D with Cartesian four material model 13 BerkeleyGW based on Chi Sum and uses 8xV100 in DGX-1, vs 8xA100 in DGX A100 GPU GPU Instance 6 GPU Instance 5 GPU Instance 4 GPU Instance GPU3 Instance GPU2 Instance 1 GPU Instance 0 GPU Instance Divide a Single GPU Into Multiple Multiple Into GPU Single a Divide NEW NEW MULTI Isolated Paths Through the Entire Memory System Memory Entire the Through Paths Isolated USER0 USER6 USER5 USER4 USER3 USER2 USER1 Sys Sys Sys Sys Sys Sys Sys Pipe Pipe Pipe Pipe Pipe Pipe Pipe Control Control Control Control Control Control Control Xbar Xbar Xbar Xbar Xbar Xbar Xbar SMs - INSTANCE GPU (MIG) GPU INSTANCE Data Data Data Data Data Data Data Xbar Xbar Xbar Xbar Xbar Xbar Xbar L2 L2 L2 L2 L2 L2 L2 DRAM DRAM DRAM DRAM DRAM DRAM DRAM Pod, Virtualized Environments Pod, Kubernetes Bare metal, with Docker, Supported EnvironmentsDeployment Diverse error & fault isolation latency, & throughput instances run All predictablewith in parallel MIG Service OfQuality Guaranteed With Execution Workload Simultaneous & cache L2 bandwidth SM, memory, dedicated Full software stack enabled A100 Instances a Single 7 InGPU To Up Instances , Each With Each , on each each instance, on with 14 LOGICAL VS. PHYSICAL PARTITIONING A B C CUDA MULTI-PROCESS SERVICE CONTROL TensorFlow PyTorch PyTorch TensorFlow Jarvis + TensorRT TensorRT GPU MULTI-PROCESS SERVICE Multi-Process Service Multi-Instance GPU Dynamic contention for GPU resources Hierarchy of instances with guaranteed resource allocation Single tenant Multiple tenants 15 ASYNC MEMCOPY: DIRECT TRANSFER INTO SHARED MEMORY SM A100 SM Shared Memory Shared Memory Threads2 Threads2 Threads Threads Registers Registers Registers Registers 1 1L1 Cache1 L1 Cache GPU Memory GPU Memory HBM HBM HBM HBM Two step copy to shared memory via registers Asynchronous direct copy to shared memory Thread loads data from GPU 1 1 Direct transfer into shared memory, memory into registers bypassing thread resources 2 Thread stores data into SM shared memory 16 ASYNC COPY Asynchronous load + store in shared Memory Typical way of using shared memory: __shared__ int smem[1024]; LDG.E.SYS R0, [R2] ; smem[threadIdx.x] = input[index]; * STALL * STS [R5], R0 ; SM L1 • Wasting registers L2 • Stalling while the data is loaded Registers SHM • Wasting L1/SHM bandwidth Compute Units 17 ASYNC COPY Asynchronous load + store in shared Memory __shared__ int smem[1024]; __pipeline_memcpy_async(&smem[threadIdx.x], &input[index], sizeof(int)); __pipeline_commit(); __pipeline_wait_prior(0); SM Copies the data straight to shared memory asynchronously with 2 possible paths: • L1 Access (Data gets Cached in L1) L1 • L1 Bypass (No L1 Caching, 16-Byte vector LDGSTS) L2 Registers Very flexible scheduling (e.g. multi-stage) SHM Compute Units For more details: S21170 (Carter Edwards) 18 ASYNCHRONOUS BARRIERS Produce Data Produce Data Arrive All threads block on slowest arrival Independent Pipelined Work processing Arrive Single-Stage Wait Barrier Wait Consume Data Consume Data Single-Stage barriers combine Asynchronous barriers enable back-to-back arrive & wait pipelined processing 19 ASYNCHRONOUS PROGRAMMING MODEL __device__ void split_barrier_example() __device__ void memcpy_example() { { __shared__ barrier b1, b2; __shared__ barrier b1, b2; // initialization omitted // initialization omitted compute_head(part_one); cuda::memcpy_async(/* ... */, b1); auto t1 = b1.arrive(); cuda::memcpy_async(/* ... */, b2); compute_head(part_two); b1.arrive_and_wait(); auto t2 = b2.arrive(); compute(); b1.wait(t1); b2.arrive_and_wait(); compute_tail(part_one); compute(); b2.wait(t2); } compute_tail(part_two); } Data Compute For more information see: S21730: Inside the NVIDIA Ampere Architecture 20 HIERARCHY OF LATENCIES Network GPU 50x shmem SM L1 shmem SM L2 HBM L1 HBM HBM PCIe HBM shmem SM DRAM L1 1x 5x 15x CPU 25x 21 MANAGING LATENCY: L2 CACHE RESIDENCY CONTROL Shared Memory L2 Cache GPU Memory Latency 1x 5x 15x Bandwidth 13x 3x 1x SM L1 HBM SM L1 HBM HBM SM L1 22 MANAGING LATENCY: L2 CACHE RESIDENCY CONTROL Shared Memory L2 Cache GPU Memory Latency 1x 5x 15x Bandwidth 13x 3x 1x L2 Cache Residency Control Specify address range up to 128MB for persistent caching SM L1 Normal & streaming accesses HBM cannot evict persistent data Residency SM L1 Control HBM HBM Load/store from range persists in L2 even between kernel launches SM L1 Normal accesses can still use entire cache if no persistent data is present 23 TUNING FOR L2 CACHE Global Memory Histogram More frequently accessed histogram bins stay pinned in L2. Increases hit rate for global memory atomics __global__ void histogram( int *hist, int *data, int nbins) { int tid = blockIdx.x * blockDim.x + threadIdx.x; int bin_id = data[tid]; // Performing atomics in global memory atomicAdd(hist + bin_id, 1); } 24 TUNING FOR L2 CACHE Setting Persistence on Global Memory Data Region Global memory region can be marked for persistence access using accessPolicyWindow Subsequent kernel launches in the stream or Cuda graph have persistence property on the marked data region. cudaStreamAttrValue attribute; data_ptr auto &window = attribute.accessPolicyWindow; window.base_ptr = data_ptr; window.num_bytes = num_bytes; num_bytes Global Memory window.hitRatio = 1.0; window.hitProp = cudaAccessPropertyPersisting; window.missProp = cudaAccessPropertyStreaming; L2 for L2 for normal persisting cudaStreamSetAttribute(stream, accesses cudaStreamAttributeAccessPolicyWindow, accesses &attribute); cuda_kernel<<<grid_size,block_size,0,stream>>>(data_ptr); For more detailed API: S21170 (Carter Edwards) 25 TUNING FOR L2 CACHE Global Memory Histogram Dataset Size = 1024 MB* ( 256 Million integers) Size of Persistent Histogram bins = 20 MB* (5 Million integer bins) 3 2,5 +43% 2,48 2 1,5 1,73 Speedup 1 1 0,5 0 V100

Load more