Overview: Graphics Processing Units
Total Page:16
File Type:pdf, Size:1020Kb
Overview: Graphics Processing Units l advent of GPUs l GPU architecture n the NVIDIA Fermi processor l the CUDA programming model n simple example, threads organization, memory model n case study: matrix multiply using shared memory n memories, thread synchronization, scheduling n case study: reductions n performance considerations: bandwidth, scheduling, resource conflicts, instruction mix u host-device data transfer: multiple GPUs, NVLink, Unified Memory, APUs l the OpenCL programming model l directive-based programming models refs: Lin & Snyder Ch 10, CUDA Toolkit Documentation, An Even Easier Introduction to CUDA (tutorial); NCI NF GPU page, Programming Massively Parallel Processors, Kirk & Hwu, Morgan-Kaufman, 2010; Cuda By Example, by Sanders and Kandrot; OpenCL web page, OpenCL in Action, by Matthew Scarpino COMP4300/8300 L18,19: Graphics Processing Units 2021 JJJ • III × 1 Advent of General-purpose Graphics Processing Units l many applications have massive amounts of mostly independent calculations n e.g. ray tracing, image rendering, matrix computations, molecular simulations, HDTV n can be largely expressed in terms of SIMD operations u implementable with minimal control logic & caches, simple instruction sets l design point: maximize number of ALUs & FPUs and memory bandwidth to take advantage of Moore’s’ Law (shown here) n put this on a co-processor (GPU); have a normal CPU to co-ordinate, run the operating system, launch applications, etc l architecture/infrastructure development requires a massive economic base for its development (the gaming industry!) n pre 2006: only specialized graphics operations (integer & float data) n 2006: ‘General Purpose’ (GPGPU): general computations but only through a graphics library (e.g. OpenGL) n 2009: programmable for general (numeric) calculations (e.g. CUDA, OpenCL) Some applications have large speedups (10–500×) over a single CPU core. COMP4300/8300 L18,19: Graphics Processing Units 2021 JJJ • III × 2 Graphics Processor Unit Systems l GPU systems are a co-processor ‘device’ on a CPU-based system ([O’H.&Bryant, fig 1.4]) n separate memory space (DRAMs) for CPU (host) and GPU (device) n must allocate space on GPU and copy data from CPU memory to GPU memory (and visa versa) via the PCIe bus n also need a way to copy the GPU executable code and start it (kernel launch) n issues? Why not use the same memory space? COMP4300/8300 L18,19: Graphics Processing Units 2021 JJJ • III × 3 Graphics Processor Unit Architecture l GPU chip: an array of streaming multiprocessors(SMs) sharing an L2 cache n comparison with UltraSPARC T2 (courtesy Real World Tech) n eachSM has (8–32) streaming processors(SPs) TeslaS2050 co-processor u onlySPs (= cores) within anSM can (easily) synchronize, share data n identical threads are organized into fixed-size blocks, each allocated to anSM n blocks are in turn are divided into warps n at any timestep, allSPs execute an instruction from a warp(‘SIMT’ mode) n latencies are hidden by scheduling from TeslaS2050 architecture many warps (courtesy NVIDIA) COMP4300/8300 L18,19: Graphics Processing Units 2021 JJJ • III × 4 The Fermi Graphics Processor Chip GF110 model: 1.15 GHz; 900W; 3D grid & thread blocks; warp size: 32; max resident: blocks 8, warps 32, threads 1536 (courtesy NCI NF) COMP4300/8300 L18,19: Graphics Processing Units 2021 JJJ • III × 5 GPU vs CPU Floating Point Speed and Memory Bandwidth COMP4300/8300 L18,19: Graphics Processing Units 2021 JJJ • III × 6 The Common Unified Device Architecture Programming Model l ‘device’ refers to a co-processor with its own DRAM that can run many threads in parallel l ‘host’ performs serial execution, transfers data to/from device (via DMA), and sends (highly jj) kernels to device l the kernel’s threads are organized into a grid of blocks n each block is sent to anSM l a CUDA program is a C/C++ program with device calls & ‘kernels’ (each with many threads) embedded into it l GPU threads are very lightweight (some overheads in invoking a kernel, and dispatching each block) n threads are identical but have thread (& block) ids (courtesy NCSU) l CUDA compiler (e.g. nvcc) produces a normal executable with device code embedded into it n has CUDA runtime (cudart) and core (cuda) libraries linked into it COMP4300/8300 L18,19: Graphics Processing Units 2021 JJJ • III × 7 CUDA Program: Simple Example l reverse an array (reversearray.cu) g l o b a l void reverseArray(int ∗ a d, intN) f int idx= threadIdx.x; intv=a[N −idx −1];a[N −idx −1] =a[idx];a[idx] =v; g #define N (1 <<16) int main() f //may not dereferencea d! inta[N], ∗ a d,a size=N ∗ sizeof(int); ... cudaMalloc((void ∗∗)&a d,a size); cudaMemcpy(a d,a,a size, cudaMemcpyHostToDevice); reverseArray <<<1,N/2 >>> (a d,N); cudaThreadSynchronize();// wait till threads finish cudaMemcpy(a,a d,a size, cudaMemcpyDeviceToHost); cudaFree(a d);... g l cf. OpenMP on a normal multicore: style; practicality? #pragma omp parallel num threads(N/2) default(shared) f int idx= omp get threads num(); intv=a[N −idx −1];a[N −idx −1] =a[idx];a[idx] =v; g COMP4300/8300 L18,19: Graphics Processing Units 2021 JJJ • III × 8 CUDA Thread Organization and Memory Model l memory model (left) reflects l a 2 × 1 grid with 2 × 1 blocks that of the GPU l 2 × 2 grid with 4 × 2 × 2 blocks (courtesy Real World Tech.) (courtesy NCSC) COMP4300/8300 L18,19: Graphics Processing Units 2021 JJJ • III × 9 Case Study: Matrix Multiply l perform C+=AB, C is N × N, A is N × K, B is K × N l column-major storage: Ci j is at C[i+j ∗N] l 1st attempt: each thread computes one element of C, Ci, j l invocation with W ×W thread blocks (assume WjN) n why better than using a N × N thread block? (2 reasons, both important!) l for thread( tx,ty) of block( bx,by), i = byW +ty and j = bxW +tx (courtesy xfig) COMP4300/8300 L18,19: Graphics Processing Units 2021 JJJ • III × 10 CUDA Matrix Multiply: Implementation l kernel: g l o b a l void matMult(intN, intK, double ∗ A d, double ∗ B d, double ∗ C d) f inti= blockIdx.y ∗ blockDim.y+ threadIdx.y; intj= blockIdx.x ∗ blockDim.x+ threadIdx.x; double cij=C d[i+j ∗ N]; for(intk=0;k < K;k++) cij +=A d[i+k ∗ N] ∗ B d[k+j ∗ K]; C d[i+j ∗ N] = cij; g l main program: needs to allocate device versions of A, B & C (A d, B d, and C d) and cudaMemcpy() host versions into them l invocation with W ×W thread blocks (assume WjN) dim3 dimG(N/W,N/W); dim3 dimB(W,W);// in kernel blockDim.x ==W matMult <<<dimG, dimB >>> (N,K,A d,B d,C d); l what if N%W > 0? Add to kernel if(i < N&&j < N) and declare dim3 dimG((N+W−1)/W,(N+W−1)/W); n note: SIMD nature ofSPs ) cycles for both branches of if are consumed COMP4300/8300 L18,19: Graphics Processing Units 2021 JJJ • III × 11 CUDA Memories and Thread Synchronization l GPUs can potentially suffer more still from the memory wall n DRAM access still may be 100’s of cycles n bandwidth is limited for load/store intensive kernels l the shared memory is on-chip (hence very fast) n the shared type modifier may be used to denote a (fixed) array allocated to shared memory l threads within a block can synchronized via the syncthreads() intrinsic (efficient – why?) n (SM-level) atomic instructions can enforce data consistency within a block l note: no (easy) way to synchronize between blocks, or safely ensure data consistency across blocks n can only be done across separate kernel invocations COMP4300/8300 L18,19: Graphics Processing Units 2021 JJJ • III × 12 Matrix Multiply Using Shared Memory l threads (tx,0) ... (tx,W − 1) all access Bk,bxW +tx;((0,ty) ... (W − 1,ty) access AbyW +ty,k) n high ratio of load to FP instructions n harder to hide L1 cache latencies; strains memory bandwidth l can improve kernel by utilizing SM shared memory: s h a r e d doubleA s[W][W],B s[W][W]; g l o b a l void matMult s(intN, intK, double ∗ A d, double ∗ B d, double ∗ C d) f int ty= threadIdx.y, tx= threadIdx.x; inti= blockIdx.y ∗ W+ ty,j= blockIdx.x ∗ W+ tx; double cij=C d[i+j ∗ N]; for(intk=0;k < K;k+=W) f A s[ty][tx] =A d[i+(k+tx) ∗ N]; B s[ty][tx] =B d[(k+ty) +j ∗ K]; syncthreads(); for(intw=0;w < W;w++) cij +=A s[ty][w] ∗ B s[w][tx]; syncthreads();// can this be avoided? g C d[i+j ∗ N] = cij; g COMP4300/8300 L18,19: Graphics Processing Units 2021 JJJ • III × 13 GPU Scheduling - Warps l the GigaThread scheduler assigns the (independently executable) thread blocks to eachSM l each block is divided into groups (of 32) called warps n grouping occurs in linear order by (e.g. warp size 4) tx + bxty + bxbytz l the warp scheduler determines which blocks are ready to run n with 32-thread warps, suitable block sizes range from 4 × 8 to 16 × 16 n SIMT: eachSP executes next instr’n SIMD-style (note: requires only a single instruction fetch!) l thus, a kernel with enough blocks can scale across a GPU with any number of cores (courtesy NVIDIA - both) COMP4300/8300 L18,19: Graphics Processing Units 2021 JJJ • III × 14 Reductions and Thread Divergence l threads within a single 1D block summing A[0..N−1]: g l o b a l void sumV(intN, double ∗ A, double ∗ s) f int bx= blockDim.x; s h a r e d double pSum[bx]; int tx= threadID.x,x; pSum[tx] = ..