Advanced CUDA and Openacc

Peter Messmer (NVIDIA) Sami Saarinen (CSC) Sami Ilvonen (CSC) Advanced CUDA and OpenACC Jan 21-23, 2014 PRACE Advanced Training Centre CSC – IT Center for Science Ltd, Finland free(hA); free(hB); free(hC); } else { // Worker process domainDecomposition(rank, nprocs, N, &localN, &offset); // Allocate host and device arrays hA = (double*)malloc(sizeof(double) * localN); hB = (double*)malloc(sizeof(double) * localN); hC = (double*)malloc(sizeof(double) * localN); CUDA_CHECK( cudaMalloc((void**)&dA, sizeof(double) * localN) ); CUDA_CHECK( cudaMalloc((void**)&dB, sizeof(double) * localN) ); CUDA_CHECK( cudaMalloc((void**)&dC, sizeof(double) * localN) ); MPI_Recv(hA, localN, MPI_DOUBLE, 0, 11, MPI_COMM_WORLD, MPI_STATUS_IGNORE); MPI_Recv(hB, localN, MPI_DOUBLE, 0, 12, MPI_COMM_WORLD, MPI_STATUS_IGNORE); CUDA_CHECK( cudaMemcpy(dA, hA, sizeof(double) * localN, cudaMemcpyHostToDevice) ); CUDA_CHECK( cudaMemcpy(dB, hB, sizeof(double) * localN, cudaMemcpyHostToDevice) ); vector_add(dC, dA, dB, localN, CUDA); CUDA_CHECK( cudaMemcpy(hC, dC, sizeof(double) * localN, cudaMemcpyDeviceToHost) ); // Copy the results back to root process MPI_Send(hC, localN, MPI_DOUBLE, 0, 13, MPI_COMM_WORLD); // Release the allocated memory CUDA_CHECK( cudaFree((void*)dA) ); CUDA recap Sami Ilvonen Material on “CUDA recap”, (C) 2014 by CSC – IT Center for Science Ltd. CSC – IT Center for Science Ltd Material from CSC is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Espoo, Finland Unported License, http://creativecommons.org/licenses/by-nc-sa/3.0/ 1 2 Outline CUDA Programming model Compute Unified Device Architecture Thread and memory hierarchy CUDA C is a C/C++ language extension for GPU Kernels programming Synchronization, streams, events CUDA API is the most up-to-date programming interface Summary for NVIDIA GPUs – Other options: OpenCL (Khronos Group), DirectCompute (Microsoft) – Two APIs: runtime and device 3 4 Unified Device Architecture CUDA C Code Consists of CUDA provides abstraction layer between different GPUs Qualifiers __device__ float array[128]; – Parallel Threads Execution (ptx) ISA and virtual machine global, device, shared, __global__ void kern(float *data) { – Ptx can be compiled to device binary code either at local, constant, ... __shared__ float buffer[32]; compile time or by the driver using JIT at runtime Built-in variables ... threadIdx, blockIdx, ... buffer[threadIdx.x] = data[i]; Important concept of compute capability Intrinsics ... __syncthreads, __syncthreads; – Original G80 architecture supported CC 1.0 __fmul_rn, ... ... } – CC 3.5 only supported by K20/K20x and K40 Runtime API calls float *d_data; memory, device, . Dynamic parallelism cudaMalloc((void**)&d_data, bytes); execution management – Many features only supported with most recent GPUs Kernel launch kern<<<1024, 128>>>(d_data); 5 6 CUDA APIs CUDA APIs CUDA API includes functions for CUDA Runtime API – Memory control (allocation, copying) – User friendlier interface for application developers – Synchronization and execution control – Requires nvcc compiler – Hardware control and query CUDA Driver API – etc. – Low-level interface for more detailed control – Much more complicated and verbose – Can be used with other C compilers 7 8 CUDA Programming Model GPU accelerator is called device, CPU is host Parallel code (kernel) is launched by the host and executed on the device by several threads PROGRAMMING MODEL, THREADS Threads are grouped into thread blocks Program code is written from a single thread’s point of view – Each thread has unique id – Each thread can diverge and execute a unique code path (can cause performance issues) 9 10 Thread Hierarchy Hardware Implementation, SIMT Architecture Block (1,0) Threads: Thread (0,0) Thread (1,0) Thread (2,0) Thread (3,0) Maximum number of threads in a block depends on the – 3D IDs, unique in block compute capability (1024 on Fermi/Kepler) Blocks: Thread (0,1) Thread (1,1) Thread (2,1) Thread (3,1) GPU multiprocessor creates, manages, schedules and – 3D IDs, unique in grid executes threads in warps of 32 Dimensions are set at kernel Warp executes one common instruction at a time launch Grid Block (0,0) Block (1,0) – Threads are allowed to branch, but each branch is Built-in variables for device executed serially code: Context switch is extremely fast, warp scheduler selects – threadIdx, blockIdx Block (0,1) Block (1,1) – blockDim, gridDim warps that are ready to execute → can hide latencies 11 12 Hardware Implementation (cont.) The actual number of simultaneous threads in execution depends on the device – For example, Fermi can execute 512 threads simultaneously PROGRAMMING MODEL, MEMORY For maximum throughput many more threads are needed – Many threads will be waiting for memory queries or register values – Thread scheduler can hide latencies efficiently only if there are enough threads waiting for execution 13 14 CPU and GPU Memories Device Memory Hierarchy Host and device have separate memories Thread Per-thread local Registers are fast, off-chip Host manages the GPU memory memory local memory has high latency Usually one has to Thread Block Per-block shared Tens of kb per block, on-chip, 1. Copy (explicitly) data from host to the device memory very fast 2. Execute the GPU kernel Grid 0 Size up to 6 GB, high latency 3. Copy (explicitly) the results back to the host Block (0,0) Block (1,0) Random access very expensive! Coalesced access much more Data copies between host and device use the PCI bus Global memory efficient with very limited bandwidth → minimize the transfers! Block (0,1) Block (1,1) Constant memory (64k) Texture memory 15 16 Memory Hierarchy Allocating Device Memory Another view of the memory (Device) Grid cudaMalloc() (Device) Grid hierarchy of a CUDA device Block (0, 0) Block (0, 1) – Allocate device global Block (0, 0) Block (0, 1) Arrows show the read and Shared Memory Shared Memory memory Shared Memory Shared Memory write permissions Registers Registers Registers Registers cudaFree() Registers Registers Registers Registers Host can only access global, Thread Thread Thread Thread Thread Thread Thread Thread (0, 0) (0, 1) (0, 0) (0, 1) – Frees the allocated (0, 0) (0, 1) (0, 0) (0, 1) texture and constant memory Local Local Local Local Local Local Local Local mem mem mem mem memory mem mem mem mem Host Global Host Global Note that the global, memory memory constant and texture Constant Note that the host code Constant memory spaces are memory memory Texture can not dereference Texture persistent between kernel memory device memory pointers! memory calls! 17 18 Device-Host Data Transfer Coalesced Memory Access cudaMemcpy() (Device) Grid Global memory access has very high latency transfers data from Block (0, 0) Block (0, 1) Threads are executed in warps, memory operations are – Host to Host Shared Memory Shared Memory grouped in a similar fashion – Host to Device Registers Registers Registers Registers Thread Thread Thread Thread – Memory access is optimized for coalesced access where – Device to Host (0, 0) (0, 1) (0, 0) (0, 1) threads read from / write to successive memory locations Local Local Local Local – Device to Device mem mem mem mem – Exact alignment rules and performance issues depend on Host Global This call blocks the memory the computing capability. execution of the host code. Constant memory Shared memory is better suited for more complicated These is also an Texture asynchronous copy function. memory data access patterns 19 20 Access Patterns Coalesced Access (cont.) __global__ void offsetDemo(float *out, float *in, int oset) Performance penalties for non-coalesced access were { int tid = blocdIdx.x * blockDim.x + threadIdx.x + oset; very high for CC 1.0/1.1 devices odata[tid] = idata[tid]; } CC 2.0 has relaxed the requirements considerably – For best performance portability it is safest (and hardest) oset = 0 oset = 1 to comply with older requirements Coalesced access Unaligned (but sequential) access Refer to the NVIDIA documentation for details 128B aligned 128B aligned segment segment (half) warp of threads (half) warp of threads Note that access would be coalesced also when all threads do not copy 21 22 Shared Memory Shared Memory Non-coalesced global memory access expensive, how to __shared__ qualifier declares a variable that avoid it? – Resides in the shared memspace of the thread block – Load data using coalesced operations from global to – Has the lifetime of the block shared memory – Is only accessible from all threads within the block – Access shared memory (avoid bank conflicts) and do the Beware of synchronization issues, such as write-after- needed manipulations write, write-after-read – Save output back to global memory using coalesced writes – Synchronize the execution of threads with __syncthreads() when needed 23 24 Allocating Shared Memory Example of Memory Operations Inside device code (static size): int main(void) { float *A = (float *) malloc(N*sizeof(float)); __global__ kern(float *in) float *d_A; { __shared__ sdata[NX][NY]; cudaMalloc((void**)&d_A, N*sizeof(float)); cudaMemcpy(d_A, A, N*sizeof(float), cudaMemcpyHostToDevice); ... Can not dereference device Extenal variable, array size determined by kernel call float A0 = d_A[0]; pointers in host code! parameter: ... extern __shared__ float sdata[] cudaMemcpy(A, d_A, N*sizeof(float), cudaMemcpyDeviceToHost); __global__ kern(float *in) cudaFree(d_A); { free(A); float *arr = (float *)sdata; return 0; ... } kern<<<128,10,1024,0>>> 25 26 Device Code C++ function with restrictions: – Can only

Load more