Overview: Graphics Processing Units

Overview: Graphics Processing Units l advent of GPUs l GPU architecture n the NVIDIA Fermi processor l the CUDA programming model n simple example, threads organization, memory model n case study: matrix multiply using shared memory n memories, thread synchronization, scheduling n case study: reductions n performance considerations: bandwidth, scheduling, resource conflicts, instruction mix u host-device data transfer: multiple GPUs, NVLink, Unified Memory, APUs l the OpenCL programming model l directive-based programming models refs: Lin & Snyder Ch 10, CUDA Toolkit Documentation, An Even Easier Introduction to CUDA (tutorial); NCI NF GPU page, Programming Massively Parallel Processors, Kirk & Hwu, Morgan-Kaufman, 2010; Cuda By Example, by Sanders and Kandrot; OpenCL web page, OpenCL in Action, by Matthew Scarpino COMP4300/8300 L18,19: Graphics Processing Units 2021 JJJ • III × 1 Advent of General-purpose Graphics Processing Units l many applications have massive amounts of mostly independent calculations n e.g. ray tracing, image rendering, matrix computations, molecular simulations, HDTV n can be largely expressed in terms of SIMD operations u implementable with minimal control logic & caches, simple instruction sets l design point: maximize number of ALUs & FPUs and memory bandwidth to take advantage of Moore’s’ Law (shown here) n put this on a co-processor (GPU); have a normal CPU to co-ordinate, run the operating system, launch applications, etc l architecture/infrastructure development requires a massive economic base for its development (the gaming industry!) n pre 2006: only specialized graphics operations (integer & float data) n 2006: ‘General Purpose’ (GPGPU): general computations but only through a graphics library (e.g. OpenGL) n 2009: programmable for general (numeric) calculations (e.g. CUDA, OpenCL) Some applications have large speedups (10–500×) over a single CPU core. COMP4300/8300 L18,19: Graphics Processing Units 2021 JJJ • III × 2 Graphics Processor Unit Systems l GPU systems are a co-processor ‘device’ on a CPU-based system ([O’H.&Bryant, fig 1.4]) n separate memory space (DRAMs) for CPU (host) and GPU (device) n must allocate space on GPU and copy data from CPU memory to GPU memory (and visa versa) via the PCIe bus n also need a way to copy the GPU executable code and start it (kernel launch) n issues? Why not use the same memory space? COMP4300/8300 L18,19: Graphics Processing Units 2021 JJJ • III × 3 Graphics Processor Unit Architecture l GPU chip: an array of streaming multiprocessors(SMs) sharing an L2 cache n comparison with UltraSPARC T2 (courtesy Real World Tech) n eachSM has (8–32) streaming processors(SPs) TeslaS2050 co-processor u onlySPs (= cores) within anSM can (easily) synchronize, share data n identical threads are organized into fixed-size blocks, each allocated to anSM n blocks are in turn are divided into warps n at any timestep, allSPs execute an instruction from a warp(‘SIMT’ mode) n latencies are hidden by scheduling from TeslaS2050 architecture many warps (courtesy NVIDIA) COMP4300/8300 L18,19: Graphics Processing Units 2021 JJJ • III × 4 The Fermi Graphics Processor Chip GF110 model: 1.15 GHz; 900W; 3D grid & thread blocks; warp size: 32; max resident: blocks 8, warps 32, threads 1536 (courtesy NCI NF) COMP4300/8300 L18,19: Graphics Processing Units 2021 JJJ • III × 5 GPU vs CPU Floating Point Speed and Memory Bandwidth COMP4300/8300 L18,19: Graphics Processing Units 2021 JJJ • III × 6 The Common Unified Device Architecture Programming Model l ‘device’ refers to a co-processor with its own DRAM that can run many threads in parallel l ‘host’ performs serial execution, transfers data to/from device (via DMA), and sends (highly jj) kernels to device l the kernel’s threads are organized into a grid of blocks n each block is sent to anSM l a CUDA program is a C/C++ program with device calls & ‘kernels’ (each with many threads) embedded into it l GPU threads are very lightweight (some overheads in invoking a kernel, and dispatching each block) n threads are identical but have thread (& block) ids (courtesy NCSU) l CUDA compiler (e.g. nvcc) produces a normal executable with device code embedded into it n has CUDA runtime (cudart) and core (cuda) libraries linked into it COMP4300/8300 L18,19: Graphics Processing Units 2021 JJJ • III × 7 CUDA Program: Simple Example l reverse an array (reversearray.cu) g l o b a l void reverseArray(int ∗ a d, intN) f int idx= threadIdx.x; intv=a[N −idx −1];a[N −idx −1] =a[idx];a[idx] =v; g #define N (1 <<16) int main() f //may not dereferencea d! inta[N], ∗ a d,a size=N ∗ sizeof(int); ... cudaMalloc((void ∗∗)&a d,a size); cudaMemcpy(a d,a,a size, cudaMemcpyHostToDevice); reverseArray <<<1,N/2 >>> (a d,N); cudaThreadSynchronize();// wait till threads finish cudaMemcpy(a,a d,a size, cudaMemcpyDeviceToHost); cudaFree(a d);... g l cf. OpenMP on a normal multicore: style; practicality? #pragma omp parallel num threads(N/2) default(shared) f int idx= omp get threads num(); intv=a[N −idx −1];a[N −idx −1] =a[idx];a[idx] =v; g COMP4300/8300 L18,19: Graphics Processing Units 2021 JJJ • III × 8 CUDA Thread Organization and Memory Model l memory model (left) reflects l a 2 × 1 grid with 2 × 1 blocks that of the GPU l 2 × 2 grid with 4 × 2 × 2 blocks (courtesy Real World Tech.) (courtesy NCSC) COMP4300/8300 L18,19: Graphics Processing Units 2021 JJJ • III × 9 Case Study: Matrix Multiply l perform C+=AB, C is N × N, A is N × K, B is K × N l column-major storage: Ci j is at C[i+j ∗N] l 1st attempt: each thread computes one element of C, Ci, j l invocation with W ×W thread blocks (assume WjN) n why better than using a N × N thread block? (2 reasons, both important!) l for thread( tx,ty) of block( bx,by), i = byW +ty and j = bxW +tx (courtesy xfig) COMP4300/8300 L18,19: Graphics Processing Units 2021 JJJ • III × 10 CUDA Matrix Multiply: Implementation l kernel: g l o b a l void matMult(intN, intK, double ∗ A d, double ∗ B d, double ∗ C d) f inti= blockIdx.y ∗ blockDim.y+ threadIdx.y; intj= blockIdx.x ∗ blockDim.x+ threadIdx.x; double cij=C d[i+j ∗ N]; for(intk=0;k < K;k++) cij +=A d[i+k ∗ N] ∗ B d[k+j ∗ K]; C d[i+j ∗ N] = cij; g l main program: needs to allocate device versions of A, B & C (A d, B d, and C d) and cudaMemcpy() host versions into them l invocation with W ×W thread blocks (assume WjN) dim3 dimG(N/W,N/W); dim3 dimB(W,W);// in kernel blockDim.x ==W matMult <<<dimG, dimB >>> (N,K,A d,B d,C d); l what if N%W > 0? Add to kernel if(i < N&&j < N) and declare dim3 dimG((N+W−1)/W,(N+W−1)/W); n note: SIMD nature ofSPs ) cycles for both branches of if are consumed COMP4300/8300 L18,19: Graphics Processing Units 2021 JJJ • III × 11 CUDA Memories and Thread Synchronization l GPUs can potentially suffer more still from the memory wall n DRAM access still may be 100’s of cycles n bandwidth is limited for load/store intensive kernels l the shared memory is on-chip (hence very fast) n the shared type modifier may be used to denote a (fixed) array allocated to shared memory l threads within a block can synchronized via the syncthreads() intrinsic (efficient – why?) n (SM-level) atomic instructions can enforce data consistency within a block l note: no (easy) way to synchronize between blocks, or safely ensure data consistency across blocks n can only be done across separate kernel invocations COMP4300/8300 L18,19: Graphics Processing Units 2021 JJJ • III × 12 Matrix Multiply Using Shared Memory l threads (tx,0) ... (tx,W − 1) all access Bk,bxW +tx;((0,ty) ... (W − 1,ty) access AbyW +ty,k) n high ratio of load to FP instructions n harder to hide L1 cache latencies; strains memory bandwidth l can improve kernel by utilizing SM shared memory: s h a r e d doubleA s[W][W],B s[W][W]; g l o b a l void matMult s(intN, intK, double ∗ A d, double ∗ B d, double ∗ C d) f int ty= threadIdx.y, tx= threadIdx.x; inti= blockIdx.y ∗ W+ ty,j= blockIdx.x ∗ W+ tx; double cij=C d[i+j ∗ N]; for(intk=0;k < K;k+=W) f A s[ty][tx] =A d[i+(k+tx) ∗ N]; B s[ty][tx] =B d[(k+ty) +j ∗ K]; syncthreads(); for(intw=0;w < W;w++) cij +=A s[ty][w] ∗ B s[w][tx]; syncthreads();// can this be avoided? g C d[i+j ∗ N] = cij; g COMP4300/8300 L18,19: Graphics Processing Units 2021 JJJ • III × 13 GPU Scheduling - Warps l the GigaThread scheduler assigns the (independently executable) thread blocks to eachSM l each block is divided into groups (of 32) called warps n grouping occurs in linear order by (e.g. warp size 4) tx + bxty + bxbytz l the warp scheduler determines which blocks are ready to run n with 32-thread warps, suitable block sizes range from 4 × 8 to 16 × 16 n SIMT: eachSP executes next instr’n SIMD-style (note: requires only a single instruction fetch!) l thus, a kernel with enough blocks can scale across a GPU with any number of cores (courtesy NVIDIA - both) COMP4300/8300 L18,19: Graphics Processing Units 2021 JJJ • III × 14 Reductions and Thread Divergence l threads within a single 1D block summing A[0..N−1]: g l o b a l void sumV(intN, double ∗ A, double ∗ s) f int bx= blockDim.x; s h a r e d double pSum[bx]; int tx= threadID.x,x; pSum[tx] = ..

Overview: Graphics Processing Units

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support