Parallel Programming Many-Core Computing: Cuda Introduction (3/5)

PARALLEL PROGRAMMING MANY-CORE COMPUTING: CUDA INTRODUCTION (3/5) Rob van Nieuwpoort [email protected] Schedule 2 1. Introduction, performance metrics & analysis 2. Many-core hardware, low-level optimizations 3. GPU hardware and Cuda class 1: basics 4. Cuda class 2: advanced 5. Case study: LOFAR telescope with many-cores 3 GPU hardware introduction It's all about the memory 4 Integration into host system 5 Typically PCI Express 2.0 x16 Theoretical speed 8 GB/s protocol overhead → 6 GB/s In reality: 4 – 6 GB/s V3.0 is coming soon Double bandwidth Less protocol overhead Lessons from Graphics Pipeline 6 Throughput is paramount must paint every pixel within frame time scalability Create, run, & retire lots of threads very rapidly measured 14.8 billion thread/s on increment() kernel Use multithreading to hide latency 1 stalled thread is OK if 100 are ready to run CPU vs GPU 7 Movie The Mythbusters Jamie Hyneman & Adam Savage Discovery Channel Appearance at NVIDIA’s NVISION 2008 Why is this different from a CPU? 8 Different goals produce different designs GPU assumes work load is highly parallel CPU must be good at everything, parallel or not CPU: minimize latency experienced by 1 thread big on-chip caches sophisticated control logic GPU: maximize throughput of all threads # threads in flight limited by resources => lots of resources (registers, etc.) multithreading can hide latency => skip the big caches share control logic across many threads Flynn’s taxonomy revisited 9 Single Data Multiple Data Single instruction SISD SIMD Multiple Instruction MISD MIMD GPUs don’t fit! Key architectural Ideas 10 SIMT (Single Instruction Multiple Thread) execution HW automatically handles divergence Hardware multithreading HW resource allocation & thread scheduling HW relies on threads to hide latency Context switching is (basically) free 11 GPU hardware: ATI CPU vs GPU Chip 12 AMD Magny-Cours (6 cores) ATI 4870 (800 cores) Latest generation ATI 13 Northern Islands 1 chip: HD 6970 1536 cores 176 GB/sec memory bandwidth 2.7 tflops single, 675 gflops double precision Maximum power: 250 Watts 299 euros! 2 chips: HD 6990 3072 cores, 5.1 tflops, 575 euro! Comparison: entire 72-node DAS-4 VU cluster has 4.4 tflops ATI 5870 architecture overview 14 ATI 5870 SIMD engine 15 Each of the 20 SIMD engines has: 16 thread processors x 5 stream cores = 80 scalar stream processing units 20 * 16 * 5 = 1600 cores total 32KB Local Data Share its own control logic and runs from a shared set of threads a dedicated fetch unit with 8KB L1 cache a 64KB global data share to communicate with other SIMD engines ATI 5870 thread processor 16 Each thread processor includes: 4 stream cores + 1 special function stream core general purpose registers FMA in a single clock ATI 5870 Memory Hierarchy 17 EDC (Error Detection Code) CRC Checks on Data Transfers for Improved Reliability at High Clock Speeds Bandwidths Up to 1 TB/sec L1 texture fetch bandwidth Up to 435 GB/sec between L1 & L2 153.6 GB/s to device memory PCI-e 2.0, 16x: 8GB/s to main memory ATI programming models 18 Low-level: CAL (assembly) High-level: Brook+ Originally developed at Stanford University Streaming language Performance is not great Now: OpenCL 19 GPU Hardware: NVIDIA Reading material 20 Reader: NVIDIA’s Next Generation CUDA Compute Architecture: Fermi Recommended further reading: CUDA: Compute Unified Device Architecture Fermi 21 Host Interface Consumer: GTX 480, 580 GigaThread Engine GPC GPC Rastter Engiine Raster Engiine M M r r e GPGPU: Tesla C2050 e e e SM SM SM SM SM SM SM SM l m l m l l o o o o r r r r t t y y n n C C o o o o C More memory, ECC C n n y y t t r r r r o o o o l GPC GPC l m l m l e e e e r SM SM SM SM SM SM SM SM r M 1.0 teraflop single M 515 megaflop double PPoollyymoorrpphh EEnnggiinnee PPoollyymoorrpphh EEnnggiinnee Poollyymoorrph EEnggiine Pollyymorrph Enggiine Pollymorrph Enggiine Pollymorrph Enggiine Pollymorrph Engiine Pollymorrph Engiine M M r r e e e e l m l m l l o 16 streaming o o o r r r r t t y y n n C C o o o o C C L2 Cache n n y y t t r multiprocessors (SM) r r r o o o o l l m l m l e e e e r r M M GTX 580: 16 PPoollyymoorrpphh EEnnggiinnee PPoollyymoorrpphh EEnnggiinnee Poollyymoorrph EEnggiine Pollyymorrph Enggiine Pollymorrph Enggiine Pollymorrph Enggiine Pollymorrph Engiine Pollymorrph Engiine GTX 480: 15 M M r r e e e e SM SM SM SM SM SM SM SM l m l m l l o o o o r C2050: 14 r GPC GPC r r t t y y n n C C o o o o C C n n y y t t r r r r o o o o l l SMs are independent m l m l e e e e r SM SM SM SM SM SM SM SM r M M Rastter Engiine Raster Engiine GPC GPC Fermi Streaming Multiprocessor (SM) 22 Host Interface 32 cores per SM GigaThread Engine (512 cores total) GPC GPC Rastter Engiine Raster Engiine M M r r e e e e SM SM SM SM SM SM SM SM l m l m l l o o o 64KB configurable o r r r r t t y y n n C C o o o o C C n n y L1 cache / shared memory y t t r r r r o o o o l GPC GPC l m l m l e e e e r SM SM SM SM SM SM SM SM r M 32,768 32-bit registers M PPoollyymoorrpphh EEnnggiinnee PPoollyymoorrpphh EEnnggiinnee Poollyymoorrph EEnggiine Pollyymorrph Enggiine Pollymorrph Enggiine Pollymorrph Enggiine Pollymorrph Engiine Pollymorrph Engiine M M r r e e e e l m l m l l o o o o r r r r t t y y n n C C o o o o C C L2 Cache n n y y t t r r r r o o o o l l m l m l e e e e r r M M PPoollyymoorrpphh EEnnggiinnee PPoollyymoorrpphh EEnnggiinnee Poollyymoorrph EEnggiine Pollyymorrph Enggiine Pollymorrph Enggiine Pollymorrph Enggiine Pollymorrph Engiine Pollymorrph Engiine M M r r e e e e SM SM SM SM SM SM SM SM l m l m l l o o o o r r GPC GPC r r t t y y n n C C o o o o C C n n y y t t r r r r o o o o l l m l m l e e e e r SM SM SM SM SM SM SM SM r M M Rastter Engiine Raster Engiine GPC GPC CUDA Core Architecture 23 Decoupled floating point and integer data paths Double precision throughput is 50% of single precision Integer operations optimized for extended precision 64 bit and wider data element size Predication field for all instructions Fused-multiply-add Memory Hierarchy 24 Configurable L1 cache per SM 16KB L1 cache / 48KB Shared 48KB L1 cache / 16KB Shared registers L1 cache / shared memory Shared 768KB L2 cache L2 cache Host memory Device memory PCI-e bus Multiple Memory Scopes 25 Per-thread private Thread memory Per-thread Local Memory Each thread has its own local memory Stacks, other private data SM Per-SM shared memory Per-SM Shared Small memory close to the Memory processor, low latency Device memory Kernel 0 GPU frame buffer . Per-device Can be accessed by any Kernel 1 Global Memory thread in any SM . Unified Load/Store Addressing 26 Non - unified Address Space Local * p _ local Shared * p _ shared Device 0 32 - bit * p _ device Unified Address Space Local Shared Device 0 40 - bit * p Atomic Operations 27 Device memory is not coherent! Share data between streaming multiprocessors Read / Modify / Write Fermi increases atomic performance by 5x to 20x ECC (Error-Correcting Code) 28 All major internal memories are ECC protected Register file, L1 cache, L2 cache DRAM protected by ECC (on Tesla only) ECC is a must have for many computing applications NVIDIA GPUs become more generic 29 Expand performance sweet spot of the GPU Caching Concurrent kernels Double precision floating point C++ Full integration in modern software development environment Debugging Profiling Bring more users, more applications to the GPU 30 Programming NVIDIA GPUs CUDA 31 CUDA: Scalable parallel programming C/C++ extensions Provide straightforward mapping onto hardware Good fit to GPU architecture Maps well to multi-core CPUs too Scale to 1000s of cores & 100,000s of threads GPU threads are lightweight — create / switch is free GPU needs 1000s of threads for full utilization Parallel Abstractions in CUDA 32 Hierarchy of concurrent threads Lightweight synchronization primitives Shared memory model for cooperating threads Hierarchy of concurrent threads 33 Parallel kernels composed of many threads All threads execute the same sequential program Called the Kernel Thread t Threads are grouped into thread blocks Threads in the same block can cooperate Threads in different blocks cannot! Block b t0 t1 … tB All thread blocks are organized in a Grid Threads/blocks have unique IDs Grids, Thread Blocks and Threads Grid Thread Block 0, 0 Thread Block 0, 1 Thread Block 0, 2 0,0 0,1 0,2 0,3 0,0 0,1 0,2 0,3 0,0 0,1 0,2 0,3 1,0 1,1 1,2 2,3 1,0 1,1 1,2 2,3 1,0 1,1 1,2 2,3 2,0 2,1 2,2 2,3 2,0 2,1 2,2 2,3 2,0 2,1 2,2 2,3 Thread Block 1, 0 Thread Block 1, 1 Thread Block 1, 2 0,0 0,1 0,2 0,3 0,0 0,1 0,2 0,3 0,0 0,1 0,2 0,3 1,0 1,1 1,2 2,3 1,0 1,1 1,2 2,3 1,0 1,1 1,2 2,3 2,0 2,1 2,2 2,3 2,0 2,1 2,2 2,3 2,0 2,1 2,2 2,3 CUDA Model of Parallelism 35 Block Shared Block Shared Memory • • • Memory Device Memory CUDA virtualizes the physical hardware Devices have Different numbers of SMs Different compute capabilities (Fermi = 2.0) block is a virtualized streaming multiprocessor (threads, shared memory) thread is a virtualized scalar processor (registers, PC, state) Scheduled onto physical hardware without pre-emption threads/blocks launch & run to completion blocks should be independent Hardware Memory Spaces in CUDA 36 Grid Block (0, 0) Block (1, 0) Shared Memory Shared Memory Registers Registers Registers Registers Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0) Host Device Memory Constant Memory Device Memory 37 CPU and GPU have separate memory spaces Data is moved across PCI-e

Load more