Overview: Graphics Processing Units

G advent of GPUs G GPU architecture I the NVIDIA Fermi processor G the CUDA programming model I simple example, threads organization, memory model I case study: matrix multiply using shared memory I memories, thread synchronization, scheduling I case study: reductions I performance considerations: bandwidth, scheduling, resource conflicts, instruction mix N host-device data transfer: multiple GPUs, NVLink, Unified Memory, APUs G the OpenCL programming model G directive-based programming models refs: Lin & Snyder Ch 10, CUDA Toolkit Documentation, An Even Easier Introduction to CUDA (tutorial); NCI NF GPU page, Programming Massively Parallel Processors, Kirk & Hwu, Morgan-Kaufman, 2010; Cuda By Example, by Sanders and Kandrot; OpenCL web page, OpenCL in Action, by Matthew Scarpino

COMP4300/8300 L18,19: Graphics Processing Units 2021 JJJ • III × 1 Advent of General-purpose Graphics Processing Units

G many applications have massive amounts of mostly independent calculations I e.g. ray tracing, image rendering, matrix computations, molecular simulations, HDTV I can be largely expressed in terms of SIMD operations N implementable with minimal control logic & caches, simple instruction sets G design point: maximize number of ALUs & FPUs and memory bandwidth to take advantage of Moore’s’ Law (shown here) I put this on a co-processor (GPU); have a normal CPU to co-ordinate, run the operating system, launch applications, etc G architecture/infrastructure development requires a massive economic base for its development (the gaming industry!) I pre 2006: only specialized graphics operations (integer & float data) I 2006: ‘General Purpose’ (GPGPU): general computations but only through a graphics library (e.g. OpenGL) I 2009: programmable for general (numeric) calculations (e.g. CUDA, OpenCL) Some applications have large speedups (10–500×) over a single CPU core.

COMP4300/8300 L18,19: Graphics Processing Units 2021 JJJ • III × 2 Graphics Processor Unit Systems

G GPU systems are a co-processor ‘device’ on a CPU-based system ([O’H.&Bryant, fig 1.4]) I separate memory space (DRAMs) for CPU (host) and GPU (device) I must allocate space on GPU and copy data from CPU memory to GPU memory (and visa versa) via the PCIe bus I also need a way to copy the GPU executable code and start it (kernel launch) I issues? Why not use the same memory space?

COMP4300/8300 L18,19: Graphics Processing Units 2021 JJJ • III × 3 Graphics Processor Unit Architecture

G GPU chip: an array of streaming multiprocessors(SMs) sharing an L2 cache I comparison with UltraSPARC T2 (courtesy Real World Tech) I eachSM has (8–32) streaming processors(SPs) TeslaS2050 co-processor N onlySPs (= cores) within anSM can (easily) synchronize, share data I identical threads are organized into fixed-size blocks, each allocated to anSM I blocks are in turn are divided into warps I at any timestep, allSPs execute an instruction from a warp(‘SIMT’ mode) I latencies are hidden by scheduling from TeslaS2050 architecture many warps (courtesy NVIDIA)

COMP4300/8300 L18,19: Graphics Processing Units 2021 JJJ • III × 4 The Fermi Graphics Processor Chip

GF110 model: 1.15 GHz; 900W; 3D grid & thread blocks; warp size: 32; max resident: blocks 8, warps 32, threads 1536

(courtesy NCI NF)

COMP4300/8300 L18,19: Graphics Processing Units 2021 JJJ • III × 5 GPU vs CPU Floating Point Speed and Memory Bandwidth

COMP4300/8300 L18,19: Graphics Processing Units 2021 JJJ • III × 6 The Common Unified Device Architecture Programming Model

G ‘device’ refers to a co-processor with its own DRAM that can run many threads in parallel G ‘host’ performs serial execution, transfers data to/from device (via DMA), and sends (highly ||) kernels to device G the kernel’s threads are organized into a grid of blocks I each block is sent to anSM G a CUDA program is a C/C++ program with device calls & ‘kernels’ (each with many threads) embedded into it G GPU threads are very lightweight (some overheads in invoking a kernel, and dispatching each block) I threads are identical but have thread (& block) ids (courtesy NCSU) G CUDA compiler (e.g. nvcc) produces a normal executable with device code embedded into it I has CUDA runtime (cudart) and core (cuda) libraries linked into it

COMP4300/8300 L18,19: Graphics Processing Units 2021 JJJ • III × 7 CUDA Program: Simple Example G reverse an array (reversearray.cu) g l o b a l void reverseArray(int ∗ a d, intN) { int idx= threadIdx.x; intv=a[N −idx −1];a[N −idx −1] =a[idx];a[idx] =v; } #define N (1 <<16) int main() { //may not dereferencea d! inta[N], ∗ a d,a size=N ∗ sizeof(int); ... cudaMalloc((void ∗∗)&a d,a size); cudaMemcpy(a d,a,a size, cudaMemcpyHostToDevice); reverseArray <<<1,N/2 >>> (a d,N); cudaThreadSynchronize();// wait till threads finish cudaMemcpy(a,a d,a size, cudaMemcpyDeviceToHost); cudaFree(a d);... } G cf. OpenMP on a normal multicore: style; practicality? #pragma omp parallel num threads(N/2) default(shared) { int idx= omp get threads num(); intv=a[N −idx −1];a[N −idx −1] =a[idx];a[idx] =v; }

COMP4300/8300 L18,19: Graphics Processing Units 2021 JJJ • III × 8 CUDA Thread Organization and Memory Model G memory model (left) reflects G a 2 × 1 grid with 2 × 1 blocks that of the GPU G 2 × 2 grid with 4 × 2 × 2 blocks

(courtesy Real World Tech.) (courtesy NCSC) COMP4300/8300 L18,19: Graphics Processing Units 2021 JJJ • III × 9 Case Study: Matrix Multiply

G perform C+=AB, C is N × N, A is N × K, B is K × N G column-major storage: Ci j is at C[i+j ∗N] G 1st attempt: each thread computes one element of C, Ci, j G invocation with W ×W thread blocks (assume W|N) I why better than using a N × N thread block? (2 reasons, both important!) G for thread( tx,ty) of block( bx,by), i = byW +ty and j = bxW +tx

(courtesy xfig) COMP4300/8300 L18,19: Graphics Processing Units 2021 JJJ • III × 10 CUDA Matrix Multiply: Implementation G kernel: g l o b a l void matMult(intN, intK, double ∗ A d, double ∗ B d, double ∗ C d) { inti= blockIdx.y ∗ blockDim.y+ threadIdx.y; intj= blockIdx.x ∗ blockDim.x+ threadIdx.x; double cij=C d[i+j ∗ N]; for(intk=0;k < K;k++) cij +=A d[i+k ∗ N] ∗ B d[k+j ∗ K]; C d[i+j ∗ N] = cij; }

G main program: needs to allocate device versions of A, B & C (A d, B d, and C d) and cudaMemcpy() host versions into them G invocation with W ×W thread blocks (assume W|N) dim3 dimG(N/W,N/W); dim3 dimB(W,W);// in kernel blockDim.x ==W matMult <<>> (N,K,A d,B d,C d); G what if N%W > 0? Add to kernel if(i < N&&j < N) and declare dim3 dimG((N+W−1)/W,(N+W−1)/W);

I note: SIMD nature ofSPs ⇒ cycles for both branches of if are consumed

COMP4300/8300 L18,19: Graphics Processing Units 2021 JJJ • III × 11 CUDA Memories and Thread Synchronization

G GPUs can potentially suffer more still from the memory wall I DRAM access still may be 100’s of cycles I bandwidth is limited for load/store intensive kernels G the shared memory is on-chip (hence very fast) I the shared type modifier may be used to denote a (fixed) array allocated to shared memory G threads within a block can synchronized via the syncthreads() intrinsic (efficient – why?) I (SM-level) atomic instructions can enforce data consistency within a block G note: no (easy) way to synchronize between blocks, or safely ensure data consistency across blocks I can only be done across separate kernel invocations

COMP4300/8300 L18,19: Graphics Processing Units 2021 JJJ • III × 12 Matrix Multiply Using Shared Memory

G threads (tx,0) ... (tx,W − 1) all access Bk,bxW +tx;((0,ty) ... (W − 1,ty) access AbyW +ty,k) I high ratio of load to FP instructions I harder to hide L1 cache latencies; strains memory bandwidth G can improve kernel by utilizing SM shared memory: s h a r e d doubleA s[W][W],B s[W][W]; g l o b a l void matMult s(intN, intK, double ∗ A d, double ∗ B d, double ∗ C d) { int ty= threadIdx.y, tx= threadIdx.x; inti= blockIdx.y ∗ W+ ty,j= blockIdx.x ∗ W+ tx; double cij=C d[i+j ∗ N]; for(intk=0;k < K;k+=W) { A s[ty][tx] =A d[i+(k+tx) ∗ N]; B s[ty][tx] =B d[(k+ty) +j ∗ K]; syncthreads(); for(intw=0;w < W;w++) cij +=A s[ty][w] ∗ B s[w][tx]; syncthreads();// can this be avoided? } C d[i+j ∗ N] = cij; }

COMP4300/8300 L18,19: Graphics Processing Units 2021 JJJ • III × 13 GPU Scheduling - Warps

G the GigaThread scheduler assigns the (independently executable) thread blocks to eachSM G each block is divided into groups (of 32) called warps I grouping occurs in linear order by (e.g. warp size 4) tx + bxty + bxbytz G the warp scheduler determines which blocks are ready to run I with 32-thread warps, suitable block sizes range from 4 × 8 to 16 × 16 I SIMT: eachSP executes next instr’n SIMD-style (note: requires only a single instruction fetch!) G thus, a kernel with enough blocks can scale across a GPU with any number of cores (courtesy NVIDIA - both)

COMP4300/8300 L18,19: Graphics Processing Units 2021 JJJ • III × 14 Reductions and Thread Divergence G threads within a single 1D block summing A[0..N−1]: g l o b a l void sumV(intN, double ∗ A, double ∗ s) { int bx= blockDim.x; s h a r e d double pSum[bx]; int tx= threadID.x,x; pSum[tx] = ... for(x=bx/2;x >0;x/=2) { syncthreads(); if(tx < x) pSum[tx] += pSum[tx+x]; } (courtesy NVIDIA) if(tx==0) ∗ s= pSum[tx]; G divergence is minimized: occurs only } when x < 32 (on one warp) G predicated execution: threads in a G cf. alternative algorithm: for(x=1;x < bx;x ∗=2) { warp where the condition is false syncthreads(); execute a ‘no-op’ if(tx%x == 0) pSum[tx] += pSum[tx+x]; G if-else statements thus cause thread } divergence (worse when nested) COMP4300/8300 L18,19: Graphics Processing Units 2021 JJJ • III × 15 Global Memory Bandwidth Issues

G in reduction example, all threads in warp contiguously access (shared) array pSum G very important when you have global memory accesses: I memory subsystem can coalesce these into a single access I allows DRAM banks to deliver peak bandwidth (burst mode) N reason: 2D organization of DRAM chips (same row address) (Lect 3, p14) G matMult example: threads within warp access B contiguously, but not A I the effect of accesses to A in this case is mitigated by the use of shared memory in multiply I note that this effect is opposite to normal cores, where contiguous access within a thread is most desirable (maximizes spatial locality) G worst case scenario: memory strides in (large) powers of 2 – causes memory bank conflicts

COMP4300/8300 L18,19: Graphics Processing Units 2021 JJJ • III × 16 SM Registers and Warp Scheduling

G theSM maintains block ids of scheduled blocks, and thread ids (and block sizes) of scheduled threads G theSM’s (32K word) register file is shared between all of these I the block and thread ids are used to index the file for the registers allocated to a particular thread G the warps whose next instruction has their operands ready for consumption may be selected I round-robin used if there are several ready G thus, registers need to be ‘scoreboarded’ I can make use of this to () ‘prefetch’ data and better hide latencies (sh. mem. matMult) G example: if there are 4 instrn’s between a load & its use, on the G80, with 4 clock cycles needed to process an instrn., we need 14 active warps to tolerate a 200-cycle memory latency (courtesy NVIDIA)

COMP4300/8300 L18,19: Graphics Processing Units 2021 JJJ • III × 17 Performance Considerations: Shared SM Resources

G on Fermi GPUs, may have resident on anSM: 8 blocks, 32 warps and 1536 threads; 128 KB register file, 64 KB shared memory / L1 cache I to fully utilize the block & thread ‘slots’, need at least 192 threads per block I assuming 4-byte operands, can have at most 21 (courtesy NVIDIA) registers per thread G resource contention can cause a dramatic G ‘optimizations’ on a kernel loss of performance resulting in more registers may G the CUDA occupancy calculator can help result in fewer blocks being evaluate this resident . . .

COMP4300/8300 L18,19: Graphics Processing Units 2021 JJJ • III × 18 Performance Considerations: Instruction Mix

G goal: keep theSP’s FPUs fully occupied doing useful operations I every other kind of instruction (loads, address calculations, branches) hinders this! G matrix multiply revisited: I strategy 1: ‘unroll’ k loops: for(intk=0;k < K;k+=2) cij +=A d[i+k ∗ N] ∗ B d[k+j ∗ K] + A d[i+(k+1) ∗ N] ∗ B d[k+1+j ∗ K]; halves the loop index increments & branches I strategy 2: each thread computes a 2 × 2 ‘tile’ of C instead of a single element reduces load instructions; reduces branches by 4 – but may require 4× the registers! also increases thread granularity: may help if K is not large

COMP4300/8300 L18,19: Graphics Processing Units 2021 JJJ • III × 19 Host-Device Issues: Multiple GPUs, NVLink, and Unified Memory G transfer of data to/from host to device is error-prone, potentially a performance bottleneck (and what if the matrix for a linear system solver could not fit in GPU memory?) G the problem is exacerbated when multiple GPUs are connected to one host I we can select the required device by cudaSetDevice(): cudaSetDevice(0); cudaMalloc(a d,n); cudaMemcpy(a d,a,n,...); reverseArray <<<1,n/2>>>(a d,n); cudaThreadSynchronize(); cudaMemcpyPeer(a d, 0,b d, 1,n); cudaSetDevice(1); reverseArray <<<1,n/2>>>(b d,n);

G fast interconnects such as NVLink will reduce the transfer costs (e.g. Sierra system) G CUDA’s Unified Memory will improve programability issues (and in some cases, performance) I cudaMallocManaged(a,n); allocates the array on host so that it can migrate, page-by-page, to/from GPU(s) transparently and on demand G alternatively, have the device and CPU use the same memory, as on AMD’s APU for Exascale Computing

COMP4300/8300 L18,19: Graphics Processing Units 2021 JJJ • III × 20 The Open Compute Language for Devices and Regular Cores

G open standard – not proprietary like CUDA; based on C (no C++) G design philosophy: treat GPUs and CPUs as peers, data- and task- parallel compute model G similar execution model to CUDA: I NDRange (CUDA grid): operates on global data, units within cannot synch. I WorkGroup (CUDA block): units within can use local data (CUDA shared ), to synch. I WorkItem (CUDA thread): indpt. unit of execution, also has private data G example kernel: k e r n e l void reverseArray( g l o b a l int ∗ a d, intN) { int idx= getGlobalId(0); intv=a[N −idx −1];a[N −idx −1] =a[idx];a[idx] =v; }

G recall that in CUDA, we could launch as reverseArray<<<1,N/2>>>(a d,N) , but in OpenCL. . .

COMP4300/8300 L18,19: Graphics Processing Units 2021 JJJ • III × 21 OpenCL Kernel Launch

G must explicitly create device handle, compute context and work-queue, load and compile the kernel, and finally enqueue it for execution clGetDeviceIDs(...,CL DEVICE TYPE GPU, 1, &device,...); context= clCreateContext(0, 1, &device,...); queue= clCreateCommandQueue(context, device,...);

program= clCreateProgramWithSource(context, "reverseArray.cl" ,...) clBuildProgram(program, 1, &device,...);

reverseArr k= clCreateKernel(program, "reverseArray" ,...); clSetKernelArg(reverseArray k, 0, sizeof(cl mem)&a d); clSetKernelArg(reverseArray k, 0, sizeof(int)&N); cnDimension = 1; cnBlockSize=N/2; clEnqueueNDRangeKernel(queue, reverseArray k, 1, 0, &cnDimension,&cnBlockSize, 0, 0, 0);

G note: CUDA host code is compiled into .cubin intermediate files which follow a similar sequence G for usage on normal core (CL DEVICE TYPE CPU), a WorkItem corresponds to an item in a work queue that a number of (kernel-level) threads get work from I compiler may aggregate these to reduce overheads

COMP4300/8300 L18,19: Graphics Processing Units 2021 JJJ • III × 22 Directive-Based Programming Models

G OpenACC enables us to specify which code is to run on a device, and how to transfer data to/from it #pragma acc parallel loop copyin(a,b) copy(c) for(i=0;i < N;i++) for(intj=0;j < N;j++) { double cij=C[i+j ∗ N]; for(intk=0;k < K;k++) cij +=A[i+k ∗ N] ∗ B[k+j ∗ K]; C[i+j ∗ N] = cij; }

I the data directive may be used to specify data placement across kernels I the code can be also compiled to run across multiple CPUs G OpenMP 4.0 operates similarly. For the above example: #pragma omp target map(to:A[0:N ∗ K],B [0: N ∗ K]) map(tofrom:C[0:N ∗ N]) #pragma omp target teams distribute parallel for default(shared)

G studies on complex applications where all data must be kept on device indicate a productivity grain and performance loss of ≈ 2× over CUDA (e.g. Zhe14)

COMP4300/8300 L18,19: Graphics Processing Units 2021 JJJ • III × 23 Graphics Processing Units: Summary

G designed to exploit computations expressible in large numbers of identical, independent threads I grouped into blocks: allocated to anSM and hence can have synchronization within each G GPU cores are designed for throughput, not single-thread speed I low clock speed, instructions taking several clock cycles G SIMT execution to hide long latencies; large amounts of hardware to maintain many thread contexts G destructive sharing: appears as resource contention G may lose performance due to poor utilization, but not from load imbalance G L2 cache and memory bandwidth an important consideration, but main consideration in access patterns is within a warp

COMP4300/8300 L18,19: Graphics Processing Units 2021 JJJ • III × 24