<<

What is GPGPU ?

• General Purpose computation using GPU in applications other than 3D graphics CUDA – GPU accelerates critical path of application • Data parallel algorithms leverage GPU attributes – Large data arrays, streaming throughput Slides by David Kirk – Fine-grain SIMD parallelism – Low-latency floating point (FP) computation • Applications – see //GPGPU.org – Game effects (FX) physics, image processing – Physical modeling, computational engineering, matrix algebra, convolution, correlation, sorting

Previous GPGPU Constraints CUDA • Dealing with graphics API per per Shader Input Registers • “Compute Unified Device Architecture” – Working with the corner cases per Context • General purpose programming model of the graphics API Fragment Program Texture – User kicks off batches of threads on the GPU • Addressing modes Constants – GPU = dedicated super-threaded, massively data parallel co- – Limited texture size/dimension Temp Registers • Targeted software stack – Compute oriented drivers, language, and tools Output Registers • Shader capabilities • Driver for loading computation programs into GPU FB Memory – Limited outputs – Standalone Driver - Optimized for computation • Instruction sets – Interface designed for compute - graphics free API – Data sharing with OpenGL buffer objects – Lack of Integer & bit ops – Guaranteed maximum download & readback speeds • Communication limited – Explicit GPU memory management – Between pixels – Scatter a[i] = p

1 on a GPU Extended C • Declspecs • NVIDIA GPU Computing Architecture – global, device, shared, __device__ float filter[N]; – Via a separate HW interface local, constant __global__ void convolve (float *image) { – In laptops, desktops, workstations, servers GeForce 8800 __shared__ float region[M]; ... • Keywords • 8-series GPUs deliver 50 to 200 GFLOPS region[threadIdx] = image[i]; on compiled parallel C applications – threadIdx, blockIdx • Intrinsics __syncthreads() – __syncthreads ... Tesla D870 • GPU parallelism is doubling every year image[j] = result; • Programming model scales transparently • Runtime API } – Memory, symbol, // Allocate GPU memory execution void *myimage = cudaMalloc(bytes) • Programmable in C with CUDA tools management • Multithreaded SPMD model uses // 100 blocks, 10 threads per block application • Function launch convolve<<<100, 10>>> (myimage); and thread parallelism Tesla S870

CUDA Programming Model: A Highly Multithreaded Coprocessor • The GPU is viewed as a compute device that: – Is a coprocessor to the CPU or host – Has its own DRAM (device memory) – Runs many threads in parallel • Data-parallel portions of an application are executed on the device as kernels which run in parallel on many threads • Differences between GPU and CPU threads – GPU threads are extremely lightweight • Very little creation overhead – GPU needs 1000s of threads for full efficiency • Multi-core CPU needs only a few

2 Thread Batching: Grids and Blocks Block and Thread IDs • A kernel is executed as a Host Device grid of thread blocks Grid 1 • Threads and blocks have IDs – All threads share data Kernel Block Block Block – So each thread can decide Device 1 memory space (0, 0) (1, 0) (2, 0) what data to work on Grid 1

•A thread block is a batch of Block Block Block – Block ID: 1D or 2D Block Block Block (0, 1) (1, 1) (2, 1) threads that can cooperate – Thread ID: 1D, 2D, or 3D (0, 0) (1, 0) (2, 0) Block Block Block with each other by: Grid 2 • Simplifies memory (0, 1) (1, 1) (2, 1) – Synchronizing their execution Kernel addressing when processing • For hazard-free shared 2 multidimensional data memory accesses Block (1, 1) – Image processing – Efficiently sharing data Block (1, 1) – Solving PDEs on volumes Thread Thread Thread Thread Thread through a low latency shared (0, 0) (1, 0) (2, 0) (3, 0) (4, 0) Thread Thread Thread Thread Thread (0, 0) (1, 0) (2, 0) (3, 0) (4, 0) –… memory Thread Thread Thread Thread Thread (0, 1) (1, 1) (2, 1) (3, 1) (4, 1) • Two threads from two Thread Thread Thread Thread Thread (0, 1) (1, 1) (2, 1) (3, 1) (4, 1) Thread Thread Thread Thread Thread different blocks cannot (0, 2) (1, 2) (2, 2) (3, 2) (4, 2) Thread Thread Thread Thread Thread cooperate (0, 2) (1, 2) (2, 2) (3, 2) (4, 2) Courtesy: NDVIA Courtesy: NDVIA

CUDA Device Memory Space Global, Constant, and Texture Memories Overview (Long Latency Accesses) • Each thread can: (Device) Grid • Global memory (Device) Grid – R/W per-thread registers Block (0, 0) Block (1, 0) – Main means of Block (0, 0) Block (1, 0) – R/W per-thread local memory communicating R/W – R/W per-block shared memory Shared Memory Shared Memory Data between host and Shared Memory Shared Memory – R/W per-grid global memory Registers Registers Registers Registers device Registers Registers Registers Registers – Read only per-grid constant – Contents visible to all memory Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0) threads Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0) – Read only per-grid texture Local Local Local Local • Texture and Constant Local Local Local Local memory Memory Memory Memory Memory Memories Memory Memory Memory Memory Host Global Host Global Memory – Constants initialized by • The host can R/W Memory host global, constant, and Constant Constant Memory – Contents visible to all Memory texture memories Texture Texture Memory threads Memory Courtesy: NDVIA

3 CUDA Highlights: Easy and Lightweight • The API is an extension to the ANSI C programming language CUDA – API Low learning curve

• The hardware is designed to enable lightweight runtime and driver High performance

CUDA Device Memory Allocation CUDA Device Memory Allocation (cont.) • cudaMalloc() (Device) Grid – Allocates object in the Block (0, 0) Block (1, 0) • Code example: device Global Memory – Allocate a 64 * 64 single precision float array – Requires two parameters Shared Memory Shared Memory Register Register Register Register – Attach the allocated storage to Md.elements • Address of a pointer to the s s s s – “d” is often used to indicate a device data structure allocated object Thread (0, Thread (1, Thread (0, Thread (1, • Size of of allocated object 0) 0) 0) 0)

Local Local Local Local BLOCK_SIZE = 64; • cudaFree() Memor Memor Memor Memor y y y y float * d_matrix; – Frees object from device Host Global int size = BLOCK_SIZE * BLOCK_SIZE * sizeof(float); Global Memory Memory Constant • Pointer to freed object Memory Texture cudaMalloc((void**)&d_matrix, size); Memory cudaFree(d_matrix);

4 CUDA Host-Device Data Transfer CUDA Host-Device Data Transfer (cont.) • cudaMemcpy() (Device) Grid

– memory data transfer Block (0, 0) Block (1, 0) • Code example: – Transfer a 64 * 64 single precision float array – Requires four parameters Shared Memory Shared Memory

• Pointer to source Register Register Register Register – M is in host memory and Md is in device memory • Pointer to destination s s s s – cudaMemcpyHostToDevice and • Number of bytes copied Thread (0, Thread (1, Thread (0, Thread (1, 0) 0) 0) 0) cudaMemcpyDeviceToHost are symbolic constants • Type of transfer Local Local Local Local – Host to Host Memor Memor Memor Memor cudaMemcpy(h_matrix, d_matrix, size, – Host to Device y y y y Host Global cudaMemcpyHostToDevice); – Device to Host Memory

– Device to Device Constant Memory cudaMemcpy(h_matrx, d_matrix, size,

• Asynchronous in CUDA Texture cudaMemcpyDeviceToHost); Memory 1.1

Calling a Kernel Function – Thread Creation Memory Model • A kernel function must be called with an execution configuration:

__global__ void KernelFunc(...); dim3 DimGrid(100, 50); // 5000 thread blocks dim3 DimBlock(4, 8, 8); // 256 threads per block size_t SharedMemBytes = 64; // 64 bytes of shared memory KernelFunc<<< DimGrid, DimBlock, SharedMemBytes >>>(...);

• Any call to a kernel function is asynchronous from CUDA 1.1 on, explicit synch needed for blocking

5 Why Use the GPU for Computing ? What is Behind such an Evolution? • The GPU has evolved into a very flexible and • The GPU is specialized for compute-intensive, powerful processor: highly data parallel computation (exactly what – It’s programmable using high-level languages graphics rendering is about) – It supports 32-bit floating point precision – So, more transistors can be devoted to data – It offers lots of GFLOPS: processing rather than data caching and flow control

ALU ALU Control ALU ALU CPU GPU G80 = GeForce 8800 GTX

GFLOPS G71 = GeForce 7900 GTX DRAM DRAM G70 = GeForce 7800 GTX NV40 = GeForce 6800 Ultra • The fast-growing video game industry exerts NV35 = GeForce FX 5950 Ultra NV30 = GeForce FX 5800 strong economic pressure that forces constant •GPU in every PC and workstation innovation

G80 Thread Computing Pipeline • ProcessorsThe future of execute GPUs iscomputing programmable threads processing • AlternativeSo – build theoperating architecture mode around specifically the processor for computing split personality HostHost

Input Assembler n. Input Assembler Setup / Rstr / ZCull

Thread Execution Manager Vtx Thread Issue Geom Thread Issue Pixel Thread Issue

Two distinct personalities in the same entity, SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP each of which prevails at a particular time. ParallelTF Data ParallelTF Data ParallelTF Data ParallelTF Data ParallelTF Data ParallelTF Data ParallelTF Data ParallelTF Data Cache Cache Cache Cache Cache Cache Cache Cache

TextureTextureL1 TextureL1 TextureL1 TextureL1 TextureL1 TextureL1 TextureL1 TextureL1 Thread Processor

Load/storeL2 Load/storeL2 Load/storeL2 Load/storeL2 Load/storeL2 Load/storeL2

FB FB FB Global Memory FB FB FB

6 GeForce 8800 Series Technical Specs What is the GPU Good at? • The GPU is good at • Maximum number of threads per block: 512 • Maximum size of each dimension of a grid: 65,535 data-parallel processing • The same computation executed on many data elements • Number of streaming multiprocessors (SM): in parallel – low control flow overhead – GeForce 8800 GTX: 16 @ 675 MHz – GeForce 8800 GTS: 12 @ 600 MHz with high SP floating point arithmetic intensity • Device memory: • Many calculations per memory access – GeForce 8800 GTX: 768 MB • Currently also need high floating point to integer ratio – GeForce 8800 GTS: 640 MB • High floating-point arithmetic intensity and many data • Shared memory per multiprocessor: 16KB divided in 16 elements mean that memory access latency can be banks hidden with calculations instead of big data caches – • Constant memory: 64 KB Still need to avoid bandwidth saturation! • Warp size: 32 threads (16 Warps/Block)

Drawbacks of (legacy) GPGPU Drawbacks of (legacy) GPGPU Model: Hardware Limitations Model: Hardware Limitations • Memory accesses are done as pixels • Applications can easily be limited by DRAM – Only gather: can read data from other pixels memory bandwidth Control Control ALU ALU ALU ... ALU ALU ALU ... … Cache Cache

Control Control DRAM ALU ALU ALU ... ALU ALU ALU ... d0 d1 d2 d3 d4 d5 d6 d7 … Cache Cache – No scatter: (Can only write to one pixel) DRAM d0 d1 d2 d3 d4 d5 d6 d7 … Control Control ALU ALU ALU ... ALU ALU ALU ... … Cache Cache Waste of computation power due to data

DRAM d0 d1 d2 d3 d4 d5 d6 d7 … starvation Less programming flexibility

7 CUDA Highlights: CUDA Highlights: Scatter On-Chip Shared Memory • CUDA provides generic DRAM memory • CUDA enables access to a parallel on-chip addressing shared memory for efficient inter-thread data Control Control – Gather:ALU ALU ALU ... ALU ALU ALU ... … Cache Cache sharing Control Control ALU ALU ALU ... ALU ALU ALU ... DRAM d d d d d d d d … Cache Cache 0 1 2 3 4 5 6 7 … Shared Shared memory d0 d1 d2 d3 memory d4 d5 d6 d7

–Control And scatter: no longerControl limited to write one pixel ALU ALU ALU ... ALU ALU ALU ... … DRAM Cache Cache d0 d1 d2 d3 d4 d5 d6 d7 …

DRAM d0 d1 d2 d3 d4 d5 d6 d7 … Big memory bandwidth savings More programming flexibility

A Common Programming Pattern A Common Programming Pattern (cont.) • Local and global memory reside in device memory • Texture and Constant memory also reside in (DRAM) - much slower access than shared memory device memory (DRAM) - much slower access • So, a profitable way of performing computation on the than shared memory device is to block data to take advantage of fast shared – But… cached! memory: – Highly efficient access for read-only data – Partition data into data subsets that fit into shared memory – Handle each data subset with one thread block by: • Carefully divide data according to access • Loading the subset from global memory to shared memory, using patterns multiple threads to exploit memory-level parallelism • Performing the computation on the subset from shared memory; – R/O no structure Æ constant memory each thread can efficiently multi-pass over any data element – R/O array structured Æ texture memory • Copying results from shared memory to global memory – R/W shared within Block Æ shared memory – R/W registers spill to local memory – R/W inputs/results Æ global memory

8 G80 Hardware Implementation: Hardware Implementation: A Set of SIMD Multiprocessors Memory Architecture Device • The device is a set of 16 Device • The local, global, constant, multiprocessors Multiprocessor N and texture spaces are Multiprocessor N • Each multiprocessor is a set Multiprocessor 2 regions of device memory Multiprocessor 2 of 32-bit processors with a Multiprocessor 1 Multiprocessor 1 • Each multiprocessor has: Single Instruction Multiple Data architecture – shared – A set of 32-bit registers per Shared Memory processor Registers Registers Registers Instruction Instruction Unit – On-chip shared memory Unit Processor 1 Processor 2 … Processor M • At each clock cycle, a Processor 1 Processor 2 … Processor M • Where the shared memory multiprocessor executes the space resides Constant same instruction on a group – A read-only constant cache Cache of threads called a warp Texture • To speed up access to the Cache • The number of threads in a constant memory space

warp is the warp size – A read-only texture cache Device memory • To speed up access to the Global, constant, texture memories texture memory space

Hardware Implementation: Execution Model (review) Threads, Warps, Blocks • Each thread block of a grid is split into warps, each • There are (up to) 32 threads in a Warp gets executed by one multiprocessor (SM) – Only <32 when there are fewer than 32 total threads – The device processes only one grid at a time • There are (up to) 16 Warps in a Block • Each thread block is executed by one • Each Block (and thus, each Warp) executes on a single multiprocessor SM – So that the shared memory space resides in the on- • G80 has 16 SMs chip shared memory • At least 16 Blocks required to “fill” the device • A multiprocessor can execute multiple blocks • More is better concurrently – If resources (registers, thread space, shared memory) allow, – Shared memory and registers are partitioned among the threads of more than 1 Block can occupy each SM all concurrent blocks – So, decreasing shared memory usage (per block) and register usage (per thread) increases number of blocks that can run concurrently

9 Language Extensions: Common Runtime Component Built-in Variables • Provides: • dim3 gridDim; – Built-in vector types – Dimensions of the grid in blocks (gridDim.z –A subset of the C runtime library supported unused) in both host and device codes • dim3 blockDim; – Dimensions of the block in threads • dim3 blockIdx; – Block index within the grid • dim3 threadIdx; – Thread index within the block

Common Runtime Component: Common Runtime Component: Built-in Vector Types Mathematical Functions •pow, sqrt, cbrt, hypot •exp, exp2, expm1 • [u]char[1..4], [u]short[1..4], •log, log2, log10, log1p [u]int[1..4], [u]long[1..4], •sin, cos, tan, asin, acos, atan, atan2 float[1..4] •sinh, cosh, tanh, asinh, acosh, atanh – Structures accessed with x, y, z, w fields: •ceil, floor, trunc, round uint4 param; •Etc. int y = param.y; – When executed on the host, a given function uses • dim3 the C runtime implementation if available – Based on uint3 – These functions are only supported for scalar types, – Used to specify dimensions not vector types

10 Host Runtime Component: Device Runtime Component: Memory Management Mathematical Functions • Device memory allocation • Some mathematical functions (e.g. – cudaMalloc(), cudaFree() sin(x)) have a less accurate, but faster • Memory copy from host to device, device to device-only version (e.g. __sin(x)) host, device to device – __pow – cudaMemcpy(), cudaMemcpy2D(), , , cudaMemcpyToSymbol(), – __log __log2 __log10 cudaMemcpyFromSymbol() – __exp • Memory addressing – __sin, __cos, __tan – cudaGetSymbolAddress()

Device Runtime Component: Synchronization Function

• void __syncthreads(); Some Useful Information on • Synchronizes all threads in a block • Once all threads have reached this point, Tools execution resumes normally • Used to avoid RAW/WAR/WAW hazards when accessing shared or global memory • Allowed in conditional constructs only if the conditional is uniform across the entire thread block

11 Compilation Linking

• Any source file containing CUDA language • Any executable with CUDA code requires extensions must be compiled with nvcc two dynamic libraries: • nvcc is a compiler driver – The CUDA runtime library (cudart) – Works by invoking all the necessary tools and compilers like cudacc, g++, cl, ... – The CUDA core library () • nvcc can output: – Either C code • That must then be compiled with the rest of the application using another tool – Or object code directly

Debugging Using the Device Emulation Mode Pitfalls Device Emulation Mode • Emulated device threads execute sequentially, • An executable compiled in device emulation so simultaneous accesses of the same memory mode (nvcc -deviceemu) runs completely on location by multiple threads could produce the host using the CUDA runtime different results. – No need of any device and CUDA driver – Each device thread is emulated with a host thread • Dereferencing device pointers on the host or host pointers on the device can produce correct • When running in device emulation mode, one results in device emulation mode, but will can: generate an error in device execution mode – Use host native debug support (breakpoints, inspection, etc.) • Results of floating-point computations will slightly – Access any device-specific data from host code and vice-versa differ because of: – Call any host function from device code (e.g. printf) and vice-versa – Different compiler outputs, instruction sets – Detect deadlock situations caused by improper usage of – Use of for intermediate results __syncthreads • There are various options to force strict single precision on the host

12