Recommended Literature

CSE 591: GPU Programming

Basics on Architecture and Programming

text book reference book

Klaus Mueller programming guides available from .com

Computer Science Department Stony Brook University more general books on parallel programing

Course Topic Tag Cloud Course Topic Tag Cloud

Architecture Architecture Limits of parallel programming Limits of parallel programming

Host Performance tuning Host Performance tuning Kernels Kernels

Debugging Thread management Debugging Thread management

OpenCL OpenCL

Algorithms Memory Algorithms Memory

CUDA Device control CUDA Device control Example applications Example applications Parallel programming Parallel programming Speedup Curves Speedup Curves

but wait, there is more to this…..

Amdahl’s Law Amdahl’s Law

Governs theoretical speedup Governs theoretical speedup

1 1 1 1 S = = S = = P P P P (1− P) + (1− P) + (1− P) + (1− P) + S parallel N S parallel N

P: parallelizable portion of the program P: parallelizable portion of the program S: speedup S: speedup N: number of parallel processors N: number of parallel processors

P determines theoretically achievable speedup • example (assuming infinite N): P=90% Æ S=10 P=99% Æ S=100 Amdahl’s Law Focus Efforts on Most Beneficial

How many processors to use Optimize program portion with most ‘bang for the buck’ • when P is small Æ a small number of processors will do • look at each program component • when P is large (embarrassingly parallel) Æ high N is useful • don’t be ambitious in the wrong place

Focus Efforts on Most Beneficial Beyond Theory....

Optimize program portion with most ‘bang for the buck’ Limits from mismatch of parallel program and parallel platform • look at each program component • man-made ‘laws’ subject to change with new architectures • don’t be ambitious in the wrong place Example: • program with 2 independent parts: A, B (execution time shown)

A B Original program

B sped up 5×

A sped up 2×

• sometimes one gains more with less Beyond Theory.... Beyond Theory....

Limits from mismatch of parallel program and parallel platform Limits from mismatch of parallel program and parallel platform • man-made ‘laws’ subject to change with new architectures • man-made ‘laws’ subject to change with new architectures Memory access patterns Memory access patterns • data access locality and strides vs. memory banks • data access locality and strides vs. memory banks Memory access efficiency • arithmetic intensity vs. cache sizes and hierarchies

Beyond Theory.... Beyond Theory....

Limits from mismatch of parallel program and parallel platform Limits from mismatch of parallel program and parallel platform • man-made ‘laws’ subject to change with new architectures • man-made ‘laws’ subject to change with new architectures Memory access patterns Memory access patterns • data access locality and strides vs. memory banks • data access locality and strides vs. memory banks Memory access efficiency Memory access efficiency • arithmetic intensity vs. cache sizes and hierarchies • arithmetic intensity vs. cache sizes and hierarchies Enabled granularity of program parallelism Enabled granularity of program parallelism • MIMD vs. SIMD • MIMD vs. SIMD Hardware support for specific tasks Æ on-chip ASICS Beyond Theory.... Device Transfer Costs

Limits from mismatch of parallel program and parallel platform Transferring the data to the device is also important • man-made ‘laws’ subject to change with new architectures • computational benefit of a transfer plays a large role • transfer costs are (or can be ) significant Memory access patterns • data access locality and strides vs. memory banks Memory access efficiency • arithmetic intensity vs. cache sizes and hierarchies Enabled granularity of program parallelism • MIMD vs. SIMD Hardware support for specific tasks Æ on-chip ASICS Support for hardware access Æ drivers, APIs

Device Transfer Costs Device Transfer Costs

Transferring the data to the device is also important Transferring the data to the device is also important • computational benefit of a transfer plays a large role • computational benefit of a transfer plays a large role • transfer costs are (or can be ) significant • transfer costs are (or can be ) significant Adding two (N×N) matrices: Adding two (N×N) matrices: • transfer back and from device: 3 N2 elements • transfer back and from device: 3 N2 elements • number of additions: N2 • number of additions: N2 Æ operations-transfer ratio = 1/3 or O(1) Æ operations-transfer ratio = 1/3 or O(1) Multiplying two (N×N) matrices: • transfer back and from device: 3 N2 elements • number of multiplications and additions: N3 Æ operations-transfer ratio = O(N) grows with N Conlcusions: Programming Strategy Course Topic Tag Cloud

Use GPU to complement CPU execution • recognize parallel program segments and only parallelize these Architecture Limits of parallel programming • leave the sequential (serial) portions on the CPU

Host Performance tuning

parallel portions (enjoy) Kernels

sequential portions (do not bite) Debugging Thread management

OpenCL

Algorithms Memory

CUDA Device control PPP (Peach of Parallel Programming – Kirk/Hwu) Example applications Parallel programming

Overall GPU Architecture (G80) GPU Architecture Specifics

Memory bandwith: 86.4 GB/s (GPU) Streaming Additional hardware 4GB/s BW (GPU ↔ CPU, PCI Express) multi-processor SM • each SP has a multiply-add (MAD) and one extra multiply unit Stream Host • special floating-point function units (SQRT, TRIG, ..) SM block processor SP Input Assembler Massive multi-thread support Thread Execution Manager • CPUs typically run 2 or 4 threads/core • G80 can run up to 768 threads/SM Æ 12,000 threads/chip • G200 can run 1024 threads/SM Æ 30,000 threads/ship G80 (2008) Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data Cache Cache Cache Cache Cache Cache Cache Cache • GeForce 8-series (8800 GTX, etc) TextureTexture Texture Texture Texture Texture Texture Texture Texture • 128 SP (16 SM × 8 SM) • 500 Gflops (768 MB DRAM) NVIDIA : G200 (2009) professional version of consumer Load/store Load/store Load/store Load/store Load/store Load/store GeForce series • GeForce GTX 280, etc Global Memory • 240 SP 1 Tflops (1 GB DRAM) 768 MB Off-chip (GDDR) DRAM (on-board) • NVIDIA Fermi Architecture NVIDIA Fermi

New GeForce 400 series • GTX 480, etc • up to 512 SP (16 × 32) but typically < 500 (GTX 480 has 496 SP) CUDA Core SM (Streaming • 1.3 Tflops (1.5GB DRAM) Multiprocessor) memory interface (64 bit) Important features: • C++, support for C, Fortran, Java, Python, OpenCL, DirectCompute • ECC (Error Correcting Code) memory (Tesla only) • 512 CUDA Cores™ with new IEEE 754-2008 floating-point standard • 8× peak double precision arithmetic performance over last-gen GPUs On chip: SMs: 16 • NVIDIA Parallel DataCache™ cache hierarchy CUDA cores: 16/SM, 512/chip • NVIDIA GigaThread™ for concurrent kernel execution memory interfaces: 6 (BW 384 bits)

NVIDIA Fermi Course Topic Tag Cloud

Architecture full cross-bar interface Limits of parallel programming

Host Performance tuning Kernels

two16-wide cores Debugging Thread management

OpenCL

Algorithms Memory

CUDA Device control 4 special function units (math, etc) Example applications Parallel programming Mapping the Architecture to Parallel Programs Mapping the Architecture to Parallel Programs

Parallelism is exposed as threads Thread communication • all threads run the same code • threads within a block cooperate via shared memory, atomic • a thread runs on one SP operations, barrier synchronization • SPMD (Single Process, Multiple Data) • threads in different blocks cannot cooperate The threads divide into blocks • threads execute together in a block

• each block has a unique ID within a Thread Block 0 Thread Block 1 Thread Block N - 1 grid Æ block ID threadID 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 • each thread has a unique ID within a block Æ thread ID … … … block ID and thread ID can be used to float x = float x = float x = • input[threadID]; input[threadID]; input[threadID]; compute a global ID float y = func(x); float y = func(x); … float y = func(x); output[threadID] = y; output[threadID] = y; output[threadID] = y; … … … The blocks aggregate into grid cells • each such cell is a SM

Mapping the Architecture to Parallel Programs Mapping the Architecture to Parallel Programs

Threads within a block are organized into SPMD warps Mapping • execute the same instruction • depends on device simultaneously with different data hardware A warp is 32 threads Thread management Æ a 16-core block takes 2 clock cycles • very lightweight to compute a warp thread creation, scheduling One SM can maintain 48 warps • in contrast, on the simultaneously CPU thread • keep one warp active while 47 wait for management is very memory Æ latency hiding heavy • 32 threads × 48 warps ×16 SMs Æ 24,576 threads ! Qualifiers Qualifiers

Function qualifiers Variable qualifiers • specify whether a function executes on the host or on the device • specify the memory location on the device of a variable • __global__ defines a kernel function (must return void) • __shared__ and __constant__ are optionally used together • __device__ and __host__ can be used together with __device__

Function Exe on Call from Variable Memory Scope Lifetime __device__ GPU GPU __shared__ Shared Block Block __global__ GPU CPU __device__ Global Grid Application __host__ CPU CPU __constant__ Constant Grid Application CPU=host and GPU=device For function executed on the device • no recursion • no static variable declarations inside the function • no variable number of arguments

Anatomy of a Kernel Function Call Program Constructs

Define function as device kernel to be called from the host: Synchronization __global__ void KernelFunc(...); • any call to a kernel function is asynchronous from CUDA 1.0 on • explicit synchronization needed for blocking Configuring thread layout and memory: Memory allocation on the device • use cudaMalloc(*mem, size) dim3 DimGrid(100,50); // 5000 thread blocks • resulting pointer may be manipulated on the host but allocated dim3 DimBlock(4,8,8); // 256 threads per (3D) block memory cannot be accessed from the host size_t SharedMemBytes = 64; // 64 bytes of shared Data transfer from and to the device memory • use cudaMemcpy(devicePtr, hostPtr, size, HtoD) for hostÆdevice • use cudaMemcpy(hostPtr, devicePtr, size, DtoH) for deviceÆhost Launch the kernel (<<, >> are CUDA runtime directives) Number of CUDA devices KernelFunc<<< DimGrid, DimBlock, SharedMemBytes >>>(...); • use cudaGetDeviceCount(&count); Program Constructs Example: Vector Add (CPU)

Get device properties for cnt devices void vectorAdd(float *A, float *B, float *C, int N) { for (i=0; i

vectorAdd(A, B, C, N); // run kernel free(A); free(B); free(C);} // free memory

from: Dana Schaa and Byunghyun Jang, NEU

Example: Vector Add (GPU) Example: Vector Add (GPU)

__global__ void gpuVecAdd(float *A, float *B, float *C) { int main() { int N = 4096; // allocate and initialize memory on the CPU int tid = blockIdx.x * blockDim.x + threadIdx.x float *A = (float *) malloc(sizeof(float)*N); *B = (float *) malloc(sizeof(float)*N); *C = C[tid] = A[tid] + B[tid]; } (float*)malloc(sizeof(float)*N) blockIdx.x init(A); init(B); // allocate and initialize memory on the GPU float *d_A, *d_B, *d_C; cudaMalloc(&d_A, sizeof(float)*N); cudaMalloc(&d_B, sizeof(float)*N); cudaMalloc(&d_C, sizeof(float)*N); threadIdx.x cudaMemcpy(d_A, A, sizeof(float)*N, HtoD); cudaMemcpy(d_B, B, sizeof(float)*N, HtoD); // configure threads dim3 dimBlock(32,1); dim3 dimGrid(N/32,1); // run kernel on GPU gpuVecAdd <<< dimBlock,dimGrid >>> (d_A, d_B, d_C); (0,0), (1,0) …. (31,0) // copy result back to CPU tid = blockId.x * blockDim.x + threadIdx.x cudaMemcpy(C, d_C, sizeof(float)*N, DtoH); // free memory on CPU and GPU cudaFree(d_A); cudaFree(d_B); cudaFree(d_C); blockDim.x=32 free(A); free(B); free(C); } Course Topic Tag Cloud Large Sums

Add up a large set of numbers Architecture Limits of parallel programming • Normalization factor: S = v + v + + v Host Performance tuning 1 2 L n Kernels • Mean square error:

Debugging Thread management (a − b )2 + + (a − b )2 MSE = 1 1 L n n OpenCL n • L2 Norm: Algorithms Memory

r 2 2 2 CUDA Device control x = x1 + x2 + L + xn Example applications Parallel programming

Large Sums Non-Parallel Approach

Common operator: Non-parallel approach:

Input numbers: : ∑ v1 + v 2 + v3 + L + v n Memory space Code in C++ running on CPU: O(n) additions Generate only one thread float sum = 0; for (int i=0; i

Two tasks: Parallel Approach: Kernel 1 • read numbers to memory • do the computation (addition) and write result a b Generate 16 threads

Threads in same step O(logn) complexity Reduction approach execute in parallel

Reduction Approach: Kernel 1 …….

CUDA code: Kernel optimization

Kernel optimization

Kernel optimization very inefficient statement, % operator is very slow Kernel optimization Toward Final Optimized Kernel Hardware Requirements

Performance for 4M numbers: NVIDIA CUDA-able devices: • desktop machines: GeForce 8-series and up • mobile: GeForce 8m-series and up • for more information see http://en.wikipedia.org/wiki/CUDA May use CUDA emulator for older devices • slower but better debugging support

Final optimized kernel:

Parallel Reduction