CSE 591: GPU Programming Basics on Architecture and Programming

Recommended Literature CSE 591: GPU Programming Basics on Architecture and Programming text book reference book Klaus Mueller programming guides available from nvidia.com Computer Science Department Stony Brook University more general books on parallel programing Course Topic Tag Cloud Course Topic Tag Cloud Architecture Architecture Limits of parallel programming Limits of parallel programming Host Performance tuning Host Performance tuning Kernels Kernels Debugging Thread management Debugging Thread management OpenCL OpenCL Algorithms Memory Algorithms Memory CUDA Device control CUDA Device control Example applications Example applications Parallel programming Parallel programming Speedup Curves Speedup Curves but wait, there is more to this….. Amdahl’s Law Amdahl’s Law Governs theoretical speedup Governs theoretical speedup 1 1 1 1 S = = S = = P P P P (1− P) + (1− P) + (1− P) + (1− P) + S parallel N S parallel N P: parallelizable portion of the program P: parallelizable portion of the program S: speedup S: speedup N: number of parallel processors N: number of parallel processors P determines theoretically achievable speedup • example (assuming infinite N): P=90% Æ S=10 P=99% Æ S=100 Amdahl’s Law Focus Efforts on Most Beneficial How many processors to use Optimize program portion with most ‘bang for the buck’ • when P is small Æ a small number of processors will do • look at each program component • when P is large (embarrassingly parallel) Æ high N is useful • don’t be ambitious in the wrong place Focus Efforts on Most Beneficial Beyond Theory.... Optimize program portion with most ‘bang for the buck’ Limits from mismatch of parallel program and parallel platform • look at each program component • man-made ‘laws’ subject to change with new architectures • don’t be ambitious in the wrong place Example: • program with 2 independent parts: A, B (execution time shown) A B Original program B sped up 5× A sped up 2× • sometimes one gains more with less Beyond Theory.... Beyond Theory.... Limits from mismatch of parallel program and parallel platform Limits from mismatch of parallel program and parallel platform • man-made ‘laws’ subject to change with new architectures • man-made ‘laws’ subject to change with new architectures Memory access patterns Memory access patterns • data access locality and strides vs. memory banks • data access locality and strides vs. memory banks Memory access efficiency • arithmetic intensity vs. cache sizes and hierarchies Beyond Theory.... Beyond Theory.... Limits from mismatch of parallel program and parallel platform Limits from mismatch of parallel program and parallel platform • man-made ‘laws’ subject to change with new architectures • man-made ‘laws’ subject to change with new architectures Memory access patterns Memory access patterns • data access locality and strides vs. memory banks • data access locality and strides vs. memory banks Memory access efficiency Memory access efficiency • arithmetic intensity vs. cache sizes and hierarchies • arithmetic intensity vs. cache sizes and hierarchies Enabled granularity of program parallelism Enabled granularity of program parallelism • MIMD vs. SIMD • MIMD vs. SIMD Hardware support for specific tasks Æ on-chip ASICS Beyond Theory.... Device Transfer Costs Limits from mismatch of parallel program and parallel platform Transferring the data to the device is also important • man-made ‘laws’ subject to change with new architectures • computational benefit of a transfer plays a large role • transfer costs are (or can be ) significant Memory access patterns • data access locality and strides vs. memory banks Memory access efficiency • arithmetic intensity vs. cache sizes and hierarchies Enabled granularity of program parallelism • MIMD vs. SIMD Hardware support for specific tasks Æ on-chip ASICS Support for hardware access Æ drivers, APIs Device Transfer Costs Device Transfer Costs Transferring the data to the device is also important Transferring the data to the device is also important • computational benefit of a transfer plays a large role • computational benefit of a transfer plays a large role • transfer costs are (or can be ) significant • transfer costs are (or can be ) significant Adding two (N×N) matrices: Adding two (N×N) matrices: • transfer back and from device: 3 N2 elements • transfer back and from device: 3 N2 elements • number of additions: N2 • number of additions: N2 Æ operations-transfer ratio = 1/3 or O(1) Æ operations-transfer ratio = 1/3 or O(1) Multiplying two (N×N) matrices: • transfer back and from device: 3 N2 elements • number of multiplications and additions: N3 Æ operations-transfer ratio = O(N) grows with N Conlcusions: Programming Strategy Course Topic Tag Cloud Use GPU to complement CPU execution • recognize parallel program segments and only parallelize these Architecture Limits of parallel programming • leave the sequential (serial) portions on the CPU Host Performance tuning parallel portions (enjoy) Kernels sequential portions (do not bite) Debugging Thread management OpenCL Algorithms Memory CUDA Device control PPP (Peach of Parallel Programming – Kirk/Hwu) Example applications Parallel programming Overall GPU Architecture (G80) GPU Architecture Specifics Memory bandwith: 86.4 GB/s (GPU) Streaming Additional hardware 4GB/s BW (GPU ↔ CPU, PCI Express) multi-processor SM • each SP has a multiply-add (MAD) and one extra multiply unit Stream Host • special floating-point function units (SQRT, TRIG, ..) SM block processor SP Input Assembler Massive multi-thread support Thread Execution Manager • CPUs typically run 2 or 4 threads/core • G80 can run up to 768 threads/SM Æ 12,000 threads/chip • G200 can run 1024 threads/SM Æ 30,000 threads/ship G80 (2008) Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data Cache Cache Cache Cache Cache Cache Cache Cache • GeForce 8-series (8800 GTX, etc) TextureTexture Texture Texture Texture Texture Texture Texture Texture • 128 SP (16 SM × 8 SM) • 500 Gflops (768 MB DRAM) NVIDIA Quadro: G200 (2009) professional version of consumer Load/store Load/store Load/store Load/store Load/store Load/store GeForce series • GeForce GTX 280, etc Global Memory • 240 SP 1 Tflops (1 GB DRAM) 768 MB Off-chip (GDDR) DRAM (on-board) • NVIDIA Fermi Architecture NVIDIA Fermi New GeForce 400 series • GTX 480, etc • up to 512 SP (16 × 32) but typically < 500 (GTX 480 has 496 SP) CUDA Core SM (Streaming • 1.3 Tflops (1.5GB DRAM) Multiprocessor) memory interface (64 bit) Important features: • C++, support for C, Fortran, Java, Python, OpenCL, DirectCompute • ECC (Error Correcting Code) memory (Tesla only) • 512 CUDA Cores™ with new IEEE 754-2008 floating-point standard • 8× peak double precision arithmetic performance over last-gen GPUs On chip: SMs: 16 • NVIDIA Parallel DataCache™ cache hierarchy CUDA cores: 16/SM, 512/chip • NVIDIA GigaThread™ for concurrent kernel execution memory interfaces: 6 (BW 384 bits) NVIDIA Fermi Course Topic Tag Cloud Architecture full cross-bar interface Limits of parallel programming Host Performance tuning Kernels two16-wide cores Debugging Thread management OpenCL Algorithms Memory CUDA Device control 4 special function units (math, etc) Example applications Parallel programming Mapping the Architecture to Parallel Programs Mapping the Architecture to Parallel Programs Parallelism is exposed as threads Thread communication • all threads run the same code • threads within a block cooperate via shared memory, atomic • a thread runs on one SP operations, barrier synchronization • SPMD (Single Process, Multiple Data) • threads in different blocks cannot cooperate The threads divide into blocks • threads execute together in a block • each block has a unique ID within a Thread Block 0 Thread Block 1 Thread Block N - 1 grid Æ block ID threadID 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 • each thread has a unique ID within a block Æ thread ID … … … block ID and thread ID can be used to float x = float x = float x = • input[threadID]; input[threadID]; input[threadID]; compute a global ID float y = func(x); float y = func(x); … float y = func(x); output[threadID] = y; output[threadID] = y; output[threadID] = y; … … … The blocks aggregate into grid cells • each such cell is a SM Mapping the Architecture to Parallel Programs Mapping the Architecture to Parallel Programs Threads within a block are organized into SPMD warps Mapping • execute the same instruction • depends on device simultaneously with different data hardware A warp is 32 threads Thread management Æ a 16-core block takes 2 clock cycles • very lightweight to compute a warp thread creation, scheduling One SM can maintain 48 warps • in contrast, on the simultaneously CPU thread • keep one warp active while 47 wait for management is very memory Æ latency hiding heavy • 32 threads × 48 warps ×16 SMs Æ 24,576 threads ! Qualifiers Qualifiers Function qualifiers Variable qualifiers • specify whether a function executes on the host or on the device • specify the memory location on the device of a variable • __global__ defines a kernel function (must return void) • __shared__ and __constant__ are optionally used together • __device__ and __host__ can be used together with __device__ Function Exe on Call from Variable Memory Scope Lifetime __device__ GPU GPU __shared__ Shared Block Block __global__ GPU CPU __device__ Global Grid Application __host__ CPU CPU __constant__ Constant Grid Application CPU=host and GPU=device For function executed on the device • no recursion

Load more