Lecture 7 CUDA

Lecture 7 CUDA Dr. Wilson Rivera ICOM 6025: High Performance Computing Electrical and Computer Engineering Department University of Puerto Rico Outline • GPU vs CPU • CUDA execution Model • CUDA Types • CUDA programming • CUDA Timer ICOM 6025: High Performance Computing 2 CUDA • Compute Unified Device Architecture – Designed and developed by NVIDIA – Data parallel programming interface to GPUs • Requires an NVIDIA GPU (GeForce, Tesla, Quadro) ICOM 4036: Programming Languages 3 CUDA SDK GPU and CPU: The Differences ALU ALU Control ALU ALU Cache DRAM DRAM CPU GPU • GPU – More transistors devoted to computation, instead of caching or flow control – Threads are extremely lightweight • Very little creation overhead – Suitable for data-intensive computation • High arithmetic/memory operation ratio Grids and Blocks Host • Kernel executed as a grid of thread Device blocks Grid 1 – All threads share data memory Kernel Block Block Block space 1 (0, 0) (1, 0) (2, 0) • Thread block is a batch of threads, Block Block Block can cooperate with each other by: (0, 1) (1, 1) (2, 1) – Synchronizing their execution: For hazard-free shared Grid 2 memory accesses Kernel 2 – Efficiently sharing data through a low latency shared memory Block (1, 1) • Two threads from two different blocks cannot cooperate Thread Thread Thread Thread Thread (0, 0) (1, 0) (2, 0) (3, 0) (4, 0) – (Unless thru slow global Thread Thread Thread Thread Thread memory) (0, 1) (1, 1) (2, 1) (3, 1) (4, 1) • Threads and blocks have IDs Thread Thread Thread Thread Thread (0, 2) (1, 2) (2, 2) (3, 2) (4, 2) Accelerate Applications Applications Programmi OpenACC Libraries ng Directives Languages “Drop-in” Easily Accelerate Maximum Acceleration Applications Flexibility © NVIDIA 2013 GPU Programming • Libraries simplicity – cuBLAS – cuSPARSE – cuFFT • Compiler Directives – openACC (openMP like) • Language Extensions – CUDA – OpenCL (not specific to NVIDIA) Performance GPU Programming Languages Numerical analytics MATLAB, Mathematica, LabVIEW Fortran OpenACC, CUDA Fortran C OpenACC, CUDA C C++ Thrust, CUDA C++ Python PyCUDA, Copperhead F# Alea.cuBase © NVIDIA 2013 CUDA Execution Model • Warp size = 32 threads • Thread Block Size = 512 threads • Grid size: 65k block per dimension ICOM 6025: High Performance Computing 10 Programming Model Single Instruction Multiple Thread (SIMT) Execution: • Groups of 32 threads formed into warps o always executing same instruction o share instruction fetch/dispatch o some become inactive when code path diverges o hardware automatically handles divergence • Warps are primitive unit of scheduling • pick 1 of 24 warps for each instruction slot. • all warps from all active blocks are time-sliced CUDA Execution Model • Threads within a warp run synchronously in parallel – Threads in a warp are implicitly and efficiently synchronized • Threads within a thread block run asynchronously in parallel – Threads in the same thread block can co-operate and synchronize – But threads in different thread blocks cannot co-operate and synchronize—they can, however, communicate indirectly via the global memory • Programmer encouraged to decompose the program into small independent sub-tasks ICOM 6025: High Performance Computing 12 CUDA Parallel Threads and Memory Thread Block Registers Per-thread Private Per-block Local Memory Shared Memory float LocalVar; __shared__ float SharedVar; Grid 0 Sequence . Per-app Device Grid 1 Global Memory . __device__ float GlobalVar; CUDA kernel maps to Grid of Blocks Host Thread Grid of Thread Blocks . GPU CPU SMem SMem SMem Cache Cache Host Bridge Device Memory Memory PCIe CUDA: Hello, World! #define NUM_BLOCKS 4 #define BLOCK_WIDTH 8 /* Main function, executed on host (CPU) */ int main( void) { /* print message from CPU */ printf( "Hello Cuda!\n" ); Kernel: /* execute function on device (GPU) */ A parallel function that hello<<<NUM_BLOCKS, BLOCK_WIDTH>>>(); runs on the GPU /* wait until all threads finish their job */ cudaDeviceSynchronize(); /* print message from CPU */ printf( "Welcome back to CPU!\n" ); return(0); } /* Function executed on device (GPU */ __global__ void hello( void) { printf( "\tHello from GPU: thread %d and block %d\n", threadIdx.x, blockIdx.x );} Kernel Function call • kernel<<<grid, block, stream, shared_mem>>>(); – Grid: Grid dimension (up to 2D) – Block: Block dimension (up to 3D) – Stream: stream ID (optional) – Shared_mem: shared memory size (optional) __global__ void filter(int *in, int *out); dim3 grid(16, 16); dim3 block (16, 16) ; filter <<< grid, block, 0, 0 >>> (in, out); \\ filter <<< grid, block >>> (in, out); ICOM 4036: Programming Languages 16 Programming Model Simple example ( Matrx addition ): cpu c program: cuda program: CUDA Example: Add_matrix // Set grid size const int N = 1024; const int blocksize = 16; // Compute kernel __global__ void add_matrix( float* a, float *b, float *c, int N ) { // threadIdx.x is a built-in variable provided by CUDA at runtime int i = blockIdx.x * blockDim.x + threadIdx.x; int j = blockIdx.y * blockDim.y + threadIdx.y; int index = i + j*N; if ( i < N && j < N ) c[index] = a[index] + b[index]; } ICOM 4036: Programming Languages 18 CUDA Example: Add_matrix int main() { \\ CPU memory allocation float *a = new float[N*N]; float *b = new float[N*N]; float *c = new float[N*N]; for ( int i = 0; i < N*N; ++i ) { a[i] = 1.0f; b[i] = 3.5f; } \\GPU memory allocation float *ad, *bd, *cd; const int size = N*N*sizeof(float); cudaMalloc( (void**)&ad, size ); cudaMalloc( (void**)&bd, size ); cudaMalloc( (void**)&cd, size ); ICOM 4036: Programming Languages 19 CUDA Example: Add_matrix \\ copy data to GPU cudaMemcpy( ad, a, size, cudaMemcpyHostToDevice ); cudaMemcpy( bd, b, size, cudaMemcpyHostToDevice ); \\ execute kernel dim3 dimBlock( blocksize, blocksize ); dim3 dimGrid( N/dimBlock.x, N/dimBlock.y ); add_matrix<<<dimGrid, dimBlock>>>( ad, bd, cd, N ); \\ copy result back to CPU cudaMemcpy( c, cd, size, cudaMemcpyDeviceToHost ); \\ clean up and return cudaFree( ad ); cudaFree( bd ); cudaFree( cd ); delete[] a; delete[] b; delete[] c; return EXIT_SUCCESS; } ICOM 4036: Programming Languages 20 CUDA Example ICOM 6025: High Performance Computing 21 CUDA Example ICOM 6025: High Performance Computing 22 Memory Model • Registers – Per thread, Read-Write • Local memory – Per thread, Read-Write • Shared memory – Per block Read-Write For sharing data within a block • Global memory – Per grid Read-Write Not cached • Constant memory – Per grid Read-only Cached • Texture memory – Per grid Read-only Spatially cached ICOM 4036: Programming Languages 23 Memory Model • Registers o on chip o fast access o per thread o limited amount o 32 bit Memory Model There are 6 Memory Types : • Registers • Local Memory o in DRAM o slow o non-cached o per thread o relative large Memory Model There are 6 Memory Types : • Registers • Local Memory • Shared Memory o on chip o fast access o per block o 16 KByte o synchronize between threads Memory Model There are 6 Memory Types : • Registers • Local Memory • Shared Memory • Global Memory o in DRAM o slow o non-cached o per grid o communicate between grids Memory Model There are 6 Memory Types : • Registers • Local Memory • Shared Memory • Global Memory • Constant Memory o in DRAM o cached o per grid o read-only Memory Model There are 6 Memory Types : • Registers • Local Memory • Shared Memory • Global Memory • Constant Memory • Texture Memory o in DRAM o cached o per grid o read-only Built-in Variables accessible in a Kernel dim3 gridDim • Contains the dimensions of blocks in the grid as specified during kernel invocation. gridDim.x, gridDim.y (.z is unused) uint3 blockIdx • Contains the block index within the grid. blockIdx.x, blockIdx.y (.z is unused) dim3 blockDim • Contains the dimensions of threads in a block (blockDim.x, blockDim.y, and blockDim.z) uint3 threadIdx • Contains the thread index within the block (threadIdx.x, threadIdx.y, and threadIdx.z) CUDA Type Qualifiers Function type Variable type qualifiers qualifiers __device__ __device__ • global memory space • Executed on the device • • Callable from the device Is accessible from all the only threads within the grid __global__ __constant__ • Executed on the device • constant memory space • Callable from the host only • Is accessible from all the __host__ threads within the grid • Executed on the host __shared__ • Callable from the host • space of a thread block only • Default type if • Is only accessible from all unspecified the threads within the block CUDA Variable Type Qualifiers Variable declaration Memory Scope Lifetime int var; register thread thread int array_var[10]; local thread thread __shared__ int shared_var; shared block block __device__ int global_var; global grid application __constant__ int constant_var; constant grid application • “automatic” scalar variables without qualifier reside in a register – compiler will spill to thread local memory • “automatic” array variables without qualifier reside in thread-local memory CUDA Variable Type Performance Variable declaration Memory Penalty int var; register 1x int array_var[10]; local 100x __shared__ int shared_var; shared 1x __device__ int global_var; global 100x __constant__ int constant_var; constant 1x • scalar variables reside in fast, on-chip registers • shared variables reside in fast, on-chip memories • thread-local arrays & global variables reside in uncached off- chip memory • constant variables reside in cached off-chip memory CUDA Timer int main() { float myTime; cudaEvent_T myTimerStart, myTimerStop; cudaEventCreate(&myTimerStart);

Lecture 7 CUDA

ATI Radeon™ HD 4870 Computation Highlights

AMD Accelerated Parallel Processing Opencl Programming Guide

AMD Opencl User Guide.)

Novel Methodologies for Predictable CPU-To-GPU Command Offloading

Adaptive GPU Tessellation with Compute Shaders Jad Khoury, Jonathan Dupuy, and Christophe Riccio

Near Data Processing: Are We There Yet?

Integrated Framework for Heterogeneous Embedded Platforms Using Opencl Author: Kulin Seth Department: Electrical and Computer Engineering

Virtual GPU Software User Guide Is Organized As Follows: ‣ This Chapter Introduces the Capabilities and Features of NVIDIA Vgpu Software

Opencl Programming

Lecture 7 Today's Content Trends

Compute & Memory Optimizations for High-Quality Speech Recognition on Low-End GPU Processors

The Compute Architecture of Intel® Processor Graphics Gen8