Recommended Literature
CSE 591: GPU Programming
Basics on Architecture and Programming
text book reference book
Klaus Mueller programming guides available from nvidia.com
Computer Science Department Stony Brook University more general books on parallel programing
Course Topic Tag Cloud Course Topic Tag Cloud
Architecture Architecture Limits of parallel programming Limits of parallel programming
Host Performance tuning Host Performance tuning Kernels Kernels
Debugging Thread management Debugging Thread management
OpenCL OpenCL
Algorithms Memory Algorithms Memory
CUDA Device control CUDA Device control Example applications Example applications Parallel programming Parallel programming Speedup Curves Speedup Curves
but wait, there is more to this…..
Amdahl’s Law Amdahl’s Law
Governs theoretical speedup Governs theoretical speedup
1 1 1 1 S = = S = = P P P P (1− P) + (1− P) + (1− P) + (1− P) + S parallel N S parallel N
P: parallelizable portion of the program P: parallelizable portion of the program S: speedup S: speedup N: number of parallel processors N: number of parallel processors
P determines theoretically achievable speedup • example (assuming infinite N): P=90% Æ S=10 P=99% Æ S=100 Amdahl’s Law Focus Efforts on Most Beneficial
How many processors to use Optimize program portion with most ‘bang for the buck’ • when P is small Æ a small number of processors will do • look at each program component • when P is large (embarrassingly parallel) Æ high N is useful • don’t be ambitious in the wrong place
Focus Efforts on Most Beneficial Beyond Theory....
Optimize program portion with most ‘bang for the buck’ Limits from mismatch of parallel program and parallel platform • look at each program component • man-made ‘laws’ subject to change with new architectures • don’t be ambitious in the wrong place Example: • program with 2 independent parts: A, B (execution time shown)
A B Original program
B sped up 5×
A sped up 2×
• sometimes one gains more with less Beyond Theory.... Beyond Theory....
Limits from mismatch of parallel program and parallel platform Limits from mismatch of parallel program and parallel platform • man-made ‘laws’ subject to change with new architectures • man-made ‘laws’ subject to change with new architectures Memory access patterns Memory access patterns • data access locality and strides vs. memory banks • data access locality and strides vs. memory banks Memory access efficiency • arithmetic intensity vs. cache sizes and hierarchies
Beyond Theory.... Beyond Theory....
Limits from mismatch of parallel program and parallel platform Limits from mismatch of parallel program and parallel platform • man-made ‘laws’ subject to change with new architectures • man-made ‘laws’ subject to change with new architectures Memory access patterns Memory access patterns • data access locality and strides vs. memory banks • data access locality and strides vs. memory banks Memory access efficiency Memory access efficiency • arithmetic intensity vs. cache sizes and hierarchies • arithmetic intensity vs. cache sizes and hierarchies Enabled granularity of program parallelism Enabled granularity of program parallelism • MIMD vs. SIMD • MIMD vs. SIMD Hardware support for specific tasks Æ on-chip ASICS Beyond Theory.... Device Transfer Costs
Limits from mismatch of parallel program and parallel platform Transferring the data to the device is also important • man-made ‘laws’ subject to change with new architectures • computational benefit of a transfer plays a large role • transfer costs are (or can be ) significant Memory access patterns • data access locality and strides vs. memory banks Memory access efficiency • arithmetic intensity vs. cache sizes and hierarchies Enabled granularity of program parallelism • MIMD vs. SIMD Hardware support for specific tasks Æ on-chip ASICS Support for hardware access Æ drivers, APIs
Device Transfer Costs Device Transfer Costs
Transferring the data to the device is also important Transferring the data to the device is also important • computational benefit of a transfer plays a large role • computational benefit of a transfer plays a large role • transfer costs are (or can be ) significant • transfer costs are (or can be ) significant Adding two (N×N) matrices: Adding two (N×N) matrices: • transfer back and from device: 3 N2 elements • transfer back and from device: 3 N2 elements • number of additions: N2 • number of additions: N2 Æ operations-transfer ratio = 1/3 or O(1) Æ operations-transfer ratio = 1/3 or O(1) Multiplying two (N×N) matrices: • transfer back and from device: 3 N2 elements • number of multiplications and additions: N3 Æ operations-transfer ratio = O(N) grows with N Conlcusions: Programming Strategy Course Topic Tag Cloud
Use GPU to complement CPU execution • recognize parallel program segments and only parallelize these Architecture Limits of parallel programming • leave the sequential (serial) portions on the CPU
Host Performance tuning
parallel portions (enjoy) Kernels
sequential portions (do not bite) Debugging Thread management
OpenCL
Algorithms Memory
CUDA Device control PPP (Peach of Parallel Programming – Kirk/Hwu) Example applications Parallel programming
Overall GPU Architecture (G80) GPU Architecture Specifics
Memory bandwith: 86.4 GB/s (GPU) Streaming Additional hardware 4GB/s BW (GPU ↔ CPU, PCI Express) multi-processor SM • each SP has a multiply-add (MAD) and one extra multiply unit Stream Host • special floating-point function units (SQRT, TRIG, ..) SM block processor SP Input Assembler Massive multi-thread support Thread Execution Manager • CPUs typically run 2 or 4 threads/core • G80 can run up to 768 threads/SM Æ 12,000 threads/chip • G200 can run 1024 threads/SM Æ 30,000 threads/ship G80 (2008) Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data Cache Cache Cache Cache Cache Cache Cache Cache • GeForce 8-series (8800 GTX, etc) TextureTexture Texture Texture Texture Texture Texture Texture Texture • 128 SP (16 SM × 8 SM) • 500 Gflops (768 MB DRAM) NVIDIA Quadro: G200 (2009) professional version of consumer Load/store Load/store Load/store Load/store Load/store Load/store GeForce series • GeForce GTX 280, etc Global Memory • 240 SP 1 Tflops (1 GB DRAM) 768 MB Off-chip (GDDR) DRAM (on-board) • NVIDIA Fermi Architecture NVIDIA Fermi
New GeForce 400 series • GTX 480, etc • up to 512 SP (16 × 32) but typically < 500 (GTX 480 has 496 SP) CUDA Core SM (Streaming • 1.3 Tflops (1.5GB DRAM) Multiprocessor) memory interface (64 bit) Important features: • C++, support for C, Fortran, Java, Python, OpenCL, DirectCompute • ECC (Error Correcting Code) memory (Tesla only) • 512 CUDA Cores™ with new IEEE 754-2008 floating-point standard • 8× peak double precision arithmetic performance over last-gen GPUs On chip: SMs: 16 • NVIDIA Parallel DataCache™ cache hierarchy CUDA cores: 16/SM, 512/chip • NVIDIA GigaThread™ for concurrent kernel execution memory interfaces: 6 (BW 384 bits)
NVIDIA Fermi Course Topic Tag Cloud
Architecture full cross-bar interface Limits of parallel programming
Host Performance tuning Kernels
two16-wide cores Debugging Thread management
OpenCL
Algorithms Memory
CUDA Device control 4 special function units (math, etc) Example applications Parallel programming Mapping the Architecture to Parallel Programs Mapping the Architecture to Parallel Programs
Parallelism is exposed as threads Thread communication • all threads run the same code • threads within a block cooperate via shared memory, atomic • a thread runs on one SP operations, barrier synchronization • SPMD (Single Process, Multiple Data) • threads in different blocks cannot cooperate The threads divide into blocks • threads execute together in a block
• each block has a unique ID within a Thread Block 0 Thread Block 1 Thread Block N - 1 grid Æ block ID threadID 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 • each thread has a unique ID within a block Æ thread ID … … … block ID and thread ID can be used to float x = float x = float x = • input[threadID]; input[threadID]; input[threadID]; compute a global ID float y = func(x); float y = func(x); … float y = func(x); output[threadID] = y; output[threadID] = y; output[threadID] = y; … … … The blocks aggregate into grid cells • each such cell is a SM
Mapping the Architecture to Parallel Programs Mapping the Architecture to Parallel Programs
Threads within a block are organized into SPMD warps Mapping • execute the same instruction • depends on device simultaneously with different data hardware A warp is 32 threads Thread management Æ a 16-core block takes 2 clock cycles • very lightweight to compute a warp thread creation, scheduling One SM can maintain 48 warps • in contrast, on the simultaneously CPU thread • keep one warp active while 47 wait for management is very memory Æ latency hiding heavy • 32 threads × 48 warps ×16 SMs Æ 24,576 threads ! Qualifiers Qualifiers
Function qualifiers Variable qualifiers • specify whether a function executes on the host or on the device • specify the memory location on the device of a variable • __global__ defines a kernel function (must return void) • __shared__ and __constant__ are optionally used together • __device__ and __host__ can be used together with __device__
Function Exe on Call from Variable Memory Scope Lifetime __device__ GPU GPU __shared__ Shared Block Block __global__ GPU CPU __device__ Global Grid Application __host__ CPU CPU __constant__ Constant Grid Application CPU=host and GPU=device For function executed on the device • no recursion • no static variable declarations inside the function • no variable number of arguments
Anatomy of a Kernel Function Call Program Constructs
Define function as device kernel to be called from the host: Synchronization __global__ void KernelFunc(...); • any call to a kernel function is asynchronous from CUDA 1.0 on • explicit synchronization needed for blocking Configuring thread layout and memory: Memory allocation on the device • use cudaMalloc(*mem, size) dim3 DimGrid(100,50); // 5000 thread blocks • resulting pointer may be manipulated on the host but allocated dim3 DimBlock(4,8,8); // 256 threads per (3D) block memory cannot be accessed from the host size_t SharedMemBytes = 64; // 64 bytes of shared Data transfer from and to the device memory • use cudaMemcpy(devicePtr, hostPtr, size, HtoD) for hostÆdevice • use cudaMemcpy(hostPtr, devicePtr, size, DtoH) for deviceÆhost Launch the kernel (<<, >> are CUDA runtime directives) Number of CUDA devices KernelFunc<<< DimGrid, DimBlock, SharedMemBytes >>>(...); • use cudaGetDeviceCount(&count); Program Constructs Example: Vector Add (CPU)
Get device properties for cnt devices void vectorAdd(float *A, float *B, float *C, int N) { for (i=0; i vectorAdd(A, B, C, N); // run kernel free(A); free(B); free(C);} // free memory from: Dana Schaa and Byunghyun Jang, NEU Example: Vector Add (GPU) Example: Vector Add (GPU) __global__ void gpuVecAdd(float *A, float *B, float *C) { int main() { int N = 4096; // allocate and initialize memory on the CPU int tid = blockIdx.x * blockDim.x + threadIdx.x float *A = (float *) malloc(sizeof(float)*N); *B = (float *) malloc(sizeof(float)*N); *C = C[tid] = A[tid] + B[tid]; } (float*)malloc(sizeof(float)*N) blockIdx.x init(A); init(B); // allocate and initialize memory on the GPU float *d_A, *d_B, *d_C; cudaMalloc(&d_A, sizeof(float)*N); cudaMalloc(&d_B, sizeof(float)*N); cudaMalloc(&d_C, sizeof(float)*N); threadIdx.x cudaMemcpy(d_A, A, sizeof(float)*N, HtoD); cudaMemcpy(d_B, B, sizeof(float)*N, HtoD); // configure threads dim3 dimBlock(32,1); dim3 dimGrid(N/32,1); // run kernel on GPU gpuVecAdd <<< dimBlock,dimGrid >>> (d_A, d_B, d_C); (0,0), (1,0) …. (31,0) // copy result back to CPU tid = blockId.x * blockDim.x + threadIdx.x cudaMemcpy(C, d_C, sizeof(float)*N, DtoH); // free memory on CPU and GPU cudaFree(d_A); cudaFree(d_B); cudaFree(d_C); blockDim.x=32 free(A); free(B); free(C); } Course Topic Tag Cloud Large Sums Add up a large set of numbers Architecture Limits of parallel programming • Normalization factor: S = v + v + + v Host Performance tuning 1 2 L n Kernels • Mean square error: Debugging Thread management (a − b )2 + + (a − b )2 MSE = 1 1 L n n OpenCL n • L2 Norm: Algorithms Memory r 2 2 2 CUDA Device control x = x1 + x2 + L + xn Example applications Parallel programming Large Sums Non-Parallel Approach Common operator: Non-parallel approach: Input numbers: : ∑ v1 + v 2 + v3 + L + v n Memory space Code in C++ running on CPU: O(n) additions Generate only one thread float sum = 0; for (int i=0; i Two tasks: Parallel Approach: Kernel 1 • read numbers to memory • do the computation (addition) and write result a b Generate 16 threads Threads in same step O(logn) complexity Reduction approach execute in parallel Reduction Approach: Kernel 1 ……. CUDA code: Kernel optimization Kernel optimization Kernel optimization very inefficient statement, % operator is very slow Kernel optimization Toward Final Optimized Kernel Hardware Requirements Performance for 4M numbers: NVIDIA CUDA-able devices: • desktop machines: GeForce 8-series and up • mobile: GeForce 8m-series and up • for more information see http://en.wikipedia.org/wiki/CUDA May use CUDA emulator for older devices • slower but better debugging support Final optimized kernel: Parallel Reduction