Inside Volta
Total Page:16
File Type:pdf, Size:1020Kb
INSIDE VOLTA Olivier Giroux and Luke Durant NVIDIA May 10, 2017 1 VOLTA: A GIANT LEAP FOR DEEP LEARNING ResNet-50 Training ResNet-50 Inference TensorRT - 7ms Latency 2.4x faster 3.7x faster Images Images per Second Images per Second P100 V100 P100 V100 FP32 Tensor Cores FP16 Tensor Cores 2 V100 measured on pre-production hardware. ROAD TO EXASCALE Volta to Fuel Most Powerful US Supercomputers Volta HPC Application Performance P100 to Tesla to Relative Relative Summit Supercomputer 200+ PetaFlops ~3,400 Nodes System Config Info: 2X Xeon E5-2690 v4, 2.6GHz, w/ 1X Tesla 3 P100 or V100. V100 measured on pre-production hardware. 10 Megawatts INTRODUCING TESLA V100 Volta Architecture Improved NVLink & Volta MPS Improved SIMT Model Tensor Core HBM2 120 Programmable Most Productive GPU Efficient Bandwidth Inference Utilization New Algorithms TFLOPS Deep Learning The Fastest and Most Productive GPU for Deep Learning and HPC 4 TESLA V100 21B transistors 815 mm2 80 SM 5120 CUDA Cores 640 Tensor Cores 16 GB HBM2 900 GB/s HBM2 300 GB/s NVLink 5 *full GV100 chip contains 84 SMs GPU PERFORMANCE COMPARISON P100 V100 Ratio Training acceleration 10 TOPS 120 TOPS 12x Inference acceleration 21 TFLOPS 120 TOPS 6x FP64/FP32 5/10 TFLOPS 7.5/15 TFLOPS 1.5x HBM2 Bandwidth 720 GB/s 900 GB/s 1.2x NVLink Bandwidth 160 GB/s 300 GB/s 1.9x L2 Cache 4 MB 6 MB 1.5x L1 Caches 1.3 MB 10 MB 7.7x 6 NEW HBM2 MEMORY ARCHITECTURE 1.5x Delivered Bandwidth Delivered GB/s - STREAM: STREAM: Triad HBM2 stack P100 V100 76% DRAM 95% DRAM Utilization Utilization 7 V100 measured on pre-production hardware. VOLTA NVLINK 300GB/sec 50% more links 28% faster signaling 8 PROGRAMMABILITY 9 PROGRAMMABILITY DRIVES DEEP LEARNING Deep Learning methods developed using CUDA FlowNet Sparsely Gated AlexNet Batch Normalization Mixture of Experts NCCL Billion-Scale Similarity Search WinoGrad (FAISS) 2012 2013 2014 2015 2016 2017 2018 1-bit SGD Persistent RNNs FFT Convolutions Phased LSTM cuDNN ? New solvers, new layers, new scaling techniques, new applications for old techniques, and much more… 10 STATE OF UNIFIED MEMORY High performance, low effort Performance vs no Unified Memory PGI OpenACC on Pascal P100 Explicit data movement Geometric mean across all 15 86% SPEC ACCEL™ benchmarks 86% PCI-E, 91% NVLink *S7285 – Unified Memory on the Latest GPU Architectures Unified Memory Automatic data movement for allocatables PGI 17.1 Compilers OpenACC SPEC ACCEL™ 1.1 performance measured March, 2017. SPEC® and the benchmark 11 name SPEC ACCEL™ are registered trademarks of the Standard Performance Evaluation Corporation. PASCAL UNIFIED MEMORY GPU CPU Unified Memory GPU CPU GPU CPU GPUGPU CPUCPU Page Migration Engine Memory Memory GPU Optimized State CPU Optimized State 12 VOLTA + PCIE CPU UNIFIED MEMORY GPU CPU Unified Memory GPU CPU GPU CPU GPUGPU CPUCPU Page Migration Engine Memory Memory + Access counters GPU Optimized State CPU Optimized State 13 VOLTA + NVLINK CPU UNIFIED MEMORY GPU CPU Unified Memory GPU CPU GPU CPU GPUGPU CPUCPU Page Migration Engine Unified Memory Unified Memory + Access counters + New NVLink Features GPU Optimized State (Coherence, Atomics, ATS) CPU Optimized State 14 GPU MULTI-PROCESS SCHEDULING Background A B C CPU Processes Timeslice Scheduling Multi-Process Service Single-process throughput optimized Multi-process throughput optimized 15 MULTI-PROCESS TIMESLICING A B C CPU Processes GPU Execution A Pascal GP100 Timeslice 1 16 MULTI-PROCESS TIMESLICING A B C A B C CPU Processes B GPU Execution A Pascal GP100 Pascal GP100 Timeslice 2 17 MULTI-PROCESS TIMESLICING CPU Processes A B C A B C A B C B A C Pascal GP100 Pascal GP100 Pascal GP100 Timeslice 3 GPU Execution 18 MULTI-PROCESS TIMESLICING CPU Processes A B C A B C A B C B A C Pascal GP100 Pascal GP100 Pascal GP100 Timeslice 1 Timeslice 2 Timeslice 3 GPU Execution Full process isolation, peak throughput optimized for each process 19 PASCAL MULTI-PROCESS SERVICE A B C Software Work CUDA Multi-Process Service: CUDA MULTI-PROCESS SERVICE Submission CPU Processes Improve GPU utilization by sharing GPU Execution resources amongst small jobs Limited B Isolation A C Pascal GP100 Opt-in: Limited isolation, peak throughput optimized across processes 20 VOLTA MULTI-PROCESS SERVICE A B C Hardware Volta MPS Enhancements: Accelerated CUDA MULTI-PROCESS SERVICE CONTROL Work Submission CPU Processes • Reduced launch latency GPU Execution • Improved launch throughput VOLTA MULTI-PROCESS SERVICE • Improved quality of service with scheduler partitioning Hardware • More reliable performance Isolation A B C • 3x more clients than Pascal Volta GV100 21 VOLTA MPS FOR INFERENCE Efficient inference deployment without batching system 60% of perf with batching 7x faster Resnet50 Images/sec, 7ms latency Images/sec, Resnet50 Single Volta Client, Multiple Volta Clients, Volta with No Batching, No Batching, Batching Using MPS System 22 V100 measured on pre-production hardware. No MPS NEW SM MICROARCHITECTURE 23 VOLTA GV100 SM GV100 FP32 units 64 FP64 units 32 INT32 units 64 Tensor Cores 8 Register File 256 KB Unified L1/Shared 128 KB memory Active Threads 2048 24 VOLTA GV100 SM Redesigned for Productivity Completely new ISA Twice the schedulers Simplified Issue Logic Large, fast L1 cache Improved SIMT model Tensor acceleration = The easiest SM to program yet 25 RECAP: PASCAL L1 AND SHARED MEMORY Pascal SM Load/Store Units Shared Memory Low Latency 64 KB L1$ Streaming : Unlimited cache misses in flight 24 KB L2$ 4 MB 26 UNIFYING KEY TECHNOLOGIES Pascal SM Volta SM Load/Store Units Load/Store Units Shared L1$ and Shared Memory Memory Low Latency 128 KB 64 KB L1$ Streaming 24 KB L2$ L2$ 4 MB 6 MB 27 VOLTA L1 AND SHARED MEMORY SM Volta Streaming L1$ : Load/Store Units Unlimited cache misses in flight Low cache hit latency L1$ and Shared Memory 4x more bandwidth 128 KB 5x more capacity Volta Shared Memory : Unified storage with L1 Configurable up to 96KB L2$ 6 MB 28 NARROWING THE SHARED MEMORY GAP with the GV100 L1 cache Directed testing: shared in global Cache: vs shared Average Shared 93% • Easier to use Memory Benefit • 90%+ as good 70% Shared: vs cache • Faster atomics • More banks • More predictable Pascal Volta 29 INDEPENDENT THREAD SCHEDULING 30 VOLTA: INDEPENDENT THREAD SCHEDULING Communicating Algorithms Pascal: Lock-Free Algorithms Volta: Starvation Free Algorithms Threads cannot wait for messages Threads may wait for messages 31 STARVATION FREE ALGORITHM Example __device__ void insert_after(Node *a, Node *b) Doubly-Linked List with Fine Grained Lock { Node *c; lock(a); lock(a->next); c = a->next; A C a->next = b; b->prev = a; B b->next = c; c->prev = b; unlock(c); unlock(a); } 32 STARVATION FREE ALGORITHM Example __device__ void insert_after(Node *a, Node *b) Doubly-Linked List with Fine Grained Lock { Node *c; lock(a); lock(a->next); c = a->next; A C a->next = b; b->prev = a; B b->next = c; c->prev = b; unlock(c); unlock(a); } *Not shown: lock() implementation 33 STARVATION FREE ALGORITHM Example __device__ void insert_after(Node *a, Node *b) Doubly-Linked List with Fine Grained Lock { Node *c; lock(a); lock(a->next); c = a->next; A C a->next = b; b->prev = a; B b->next = c; c->prev = b; unlock(c); unlock(a); } 34 STARVATION FREE ALGORITHM Example __device__ void insert_after(Node *a, Node *b) Doubly-Linked List with Fine Grained Lock { Node *c; lock(a); lock(a->next); c = a->next; A C a->next = b; b->prev = a; B b->next = c; c->prev = b; unlock(c); unlock(a); } 35 STARVATION FREE ALGORITHM Example __device__ void insert_after(Node *a, Node *b) Doubly-Linked List with Fine Grained Lock { Node *c; lock(a); lock(a->next); c = a->next; A C a->next = b; b->prev = a; B b->next = c; c->prev = b; unlock(c); unlock(a); } Tip! Volta can run 163,840 threads simultaneously Minimize Contention! 36 WARP IMPLEMENTATION Pre-Volta Program Counter (PC) and Stack (S) 32 thread warp 37 PASCAL WARP EXECUTION MODEL if (threadIdx.x < 4) { X; Y; A; B; } else { X; Y; diverge reconverge } A; B; Time 38 PASCAL WARP EXECUTION MODEL No Synchronization Permitted if (threadIdx.x < 4) { X; Y; A; __syncwarp(); B; } else { X; diverge __syncwarp(); reconverge Y; A; B; } Time 39 and Stack (S) and Stack Counter Counter (PC) Convergence Convergence Optimizer Pre Program Program Volta - Volta WARP IMPLEMENTATION PC,S PC,S PC,S PC,S 32 thread warp with independent32 thread with warp scheduling PC,S PC,S PC,S PC,S PC,S PC,S PC,S 32 thread warp PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S 40 VOLTA WARP EXECUTION MODEL Synchronization may lead to interleaved scheduling! if (threadIdx.x < 4) { X; Y; A; __syncwarp(); B; } else { X; diverge __syncwarp(); synchronize Y; A; B; } __syncwarp(); Time 41 VOLTA WARP EXECUTION MODEL Synchronization may lead to interleaved scheduling! if (threadIdx.x < 4) { X; Y; A; __syncwarp(); B; } else { X; diverge __syncwarp(); synchronize Y; A; B; } __syncwarp(); Time Software synchronization also supported, e.g. locks for doubly-linked list! 42 VOLTA’S EXTENDED SIMT MODEL The SIMT model: enable thread-parallel programs to execute with vector efficiency CPU Pascal GPU Volta GPU SIMT Thread-parallelism MIMD SIMT (lock-free) Data-parallelism SIMD SIMT SIMT 43 COOPERATION SYNCHRONIZATION See also: CUDA 9.0, S7132, Wednesday 230pm Cooperative Groups, S7622, Wednesday 4pm 44 SHUFFLE SYNCHRONIZATION __shfl_sync - deprecates __shfl warp Warp-synchronizing operation Lane data exchange Result distributed across warp 45 COMPARE SYNCHRONIZATION __ballot_sync, __[any|all]_sync - deprecate namesakes