Lecture 7 CUDA

Lecture 7 CUDA Dr. Wilson Rivera ICOM 6025: High Performance Computing Electrical and Computer Engineering Department University of Puerto Rico Outline • GPU vs CPU • CUDA execution Model • CUDA Types • CUDA programming • CUDA Timer ICOM 6025: High Performance Computing 2 CUDA • Compute Unified Device Architecture – Designed and developed by NVIDIA – Data parallel programming interface to GPUs • Requires an NVIDIA GPU (GeForce, Tesla, Quadro) ICOM 4036: Programming Languages 3 CUDA SDK GPU and CPU: The Differences ALU ALU Control ALU ALU Cache DRAM DRAM CPU GPU • GPU – More transistors devoted to computation, instead of caching or flow control – Threads are extremely lightweight • Very little creation overhead – Suitable for data-intensive computation • High arithmetic/memory operation ratio Grids and Blocks Host • Kernel executed as a grid of thread Device blocks Grid 1 – All threads share data memory Kernel Block Block Block space 1 (0, 0) (1, 0) (2, 0) • Thread block is a batch of threads, Block Block Block can cooperate with each other by: (0, 1) (1, 1) (2, 1) – Synchronizing their execution: For hazard-free shared Grid 2 memory accesses Kernel 2 – Efficiently sharing data through a low latency shared memory Block (1, 1) • Two threads from two different blocks cannot cooperate Thread Thread Thread Thread Thread (0, 0) (1, 0) (2, 0) (3, 0) (4, 0) – (Unless thru slow global Thread Thread Thread Thread Thread memory) (0, 1) (1, 1) (2, 1) (3, 1) (4, 1) • Threads and blocks have IDs Thread Thread Thread Thread Thread (0, 2) (1, 2) (2, 2) (3, 2) (4, 2) Accelerate Applications Applications Programmi OpenACC Libraries ng Directives Languages “Drop-in” Easily Accelerate Maximum Acceleration Applications Flexibility © NVIDIA 2013 GPU Programming • Libraries simplicity – cuBLAS – cuSPARSE – cuFFT • Compiler Directives – openACC (openMP like) • Language Extensions – CUDA – OpenCL (not specific to NVIDIA) Performance GPU Programming Languages Numerical analytics MATLAB, Mathematica, LabVIEW Fortran OpenACC, CUDA Fortran C OpenACC, CUDA C C++ Thrust, CUDA C++ Python PyCUDA, Copperhead F# Alea.cuBase © NVIDIA 2013 CUDA Execution Model • Warp size = 32 threads • Thread Block Size = 512 threads • Grid size: 65k block per dimension ICOM 6025: High Performance Computing 10 Programming Model Single Instruction Multiple Thread (SIMT) Execution: • Groups of 32 threads formed into warps o always executing same instruction o share instruction fetch/dispatch o some become inactive when code path diverges o hardware automatically handles divergence • Warps are primitive unit of scheduling • pick 1 of 24 warps for each instruction slot. • all warps from all active blocks are time-sliced CUDA Execution Model • Threads within a warp run synchronously in parallel – Threads in a warp are implicitly and efficiently synchronized • Threads within a thread block run asynchronously in parallel – Threads in the same thread block can co-operate and synchronize – But threads in different thread blocks cannot co-operate and synchronize—they can, however, communicate indirectly via the global memory • Programmer encouraged to decompose the program into small independent sub-tasks ICOM 6025: High Performance Computing 12 CUDA Parallel Threads and Memory Thread Block Registers Per-thread Private Per-block Local Memory Shared Memory float LocalVar; __shared__ float SharedVar; Grid 0 Sequence . Per-app Device Grid 1 Global Memory . __device__ float GlobalVar; CUDA kernel maps to Grid of Blocks Host Thread Grid of Thread Blocks . GPU CPU SMem SMem SMem Cache Cache Host Bridge Device Memory Memory PCIe CUDA: Hello, World! #define NUM_BLOCKS 4 #define BLOCK_WIDTH 8 /* Main function, executed on host (CPU) */ int main( void) { /* print message from CPU */ printf( "Hello Cuda!\n" ); Kernel: /* execute function on device (GPU) */ A parallel function that hello<<<NUM_BLOCKS, BLOCK_WIDTH>>>(); runs on the GPU /* wait until all threads finish their job */ cudaDeviceSynchronize(); /* print message from CPU */ printf( "Welcome back to CPU!\n" ); return(0); } /* Function executed on device (GPU */ __global__ void hello( void) { printf( "\tHello from GPU: thread %d and block %d\n", threadIdx.x, blockIdx.x );} Kernel Function call • kernel<<<grid, block, stream, shared_mem>>>(); – Grid: Grid dimension (up to 2D) – Block: Block dimension (up to 3D) – Stream: stream ID (optional) – Shared_mem: shared memory size (optional) __global__ void filter(int *in, int *out); dim3 grid(16, 16); dim3 block (16, 16) ; filter <<< grid, block, 0, 0 >>> (in, out); \\ filter <<< grid, block >>> (in, out); ICOM 4036: Programming Languages 16 Programming Model Simple example ( Matrx addition ): cpu c program: cuda program: CUDA Example: Add_matrix // Set grid size const int N = 1024; const int blocksize = 16; // Compute kernel __global__ void add_matrix( float* a, float *b, float *c, int N ) { // threadIdx.x is a built-in variable provided by CUDA at runtime int i = blockIdx.x * blockDim.x + threadIdx.x; int j = blockIdx.y * blockDim.y + threadIdx.y; int index = i + j*N; if ( i < N && j < N ) c[index] = a[index] + b[index]; } ICOM 4036: Programming Languages 18 CUDA Example: Add_matrix int main() { \\ CPU memory allocation float *a = new float[N*N]; float *b = new float[N*N]; float *c = new float[N*N]; for ( int i = 0; i < N*N; ++i ) { a[i] = 1.0f; b[i] = 3.5f; } \\GPU memory allocation float *ad, *bd, *cd; const int size = N*N*sizeof(float); cudaMalloc( (void**)&ad, size ); cudaMalloc( (void**)&bd, size ); cudaMalloc( (void**)&cd, size ); ICOM 4036: Programming Languages 19 CUDA Example: Add_matrix \\ copy data to GPU cudaMemcpy( ad, a, size, cudaMemcpyHostToDevice ); cudaMemcpy( bd, b, size, cudaMemcpyHostToDevice ); \\ execute kernel dim3 dimBlock( blocksize, blocksize ); dim3 dimGrid( N/dimBlock.x, N/dimBlock.y ); add_matrix<<<dimGrid, dimBlock>>>( ad, bd, cd, N ); \\ copy result back to CPU cudaMemcpy( c, cd, size, cudaMemcpyDeviceToHost ); \\ clean up and return cudaFree( ad ); cudaFree( bd ); cudaFree( cd ); delete[] a; delete[] b; delete[] c; return EXIT_SUCCESS; } ICOM 4036: Programming Languages 20 CUDA Example ICOM 6025: High Performance Computing 21 CUDA Example ICOM 6025: High Performance Computing 22 Memory Model • Registers – Per thread, Read-Write • Local memory – Per thread, Read-Write • Shared memory – Per block Read-Write For sharing data within a block • Global memory – Per grid Read-Write Not cached • Constant memory – Per grid Read-only Cached • Texture memory – Per grid Read-only Spatially cached ICOM 4036: Programming Languages 23 Memory Model • Registers o on chip o fast access o per thread o limited amount o 32 bit Memory Model There are 6 Memory Types : • Registers • Local Memory o in DRAM o slow o non-cached o per thread o relative large Memory Model There are 6 Memory Types : • Registers • Local Memory • Shared Memory o on chip o fast access o per block o 16 KByte o synchronize between threads Memory Model There are 6 Memory Types : • Registers • Local Memory • Shared Memory • Global Memory o in DRAM o slow o non-cached o per grid o communicate between grids Memory Model There are 6 Memory Types : • Registers • Local Memory • Shared Memory • Global Memory • Constant Memory o in DRAM o cached o per grid o read-only Memory Model There are 6 Memory Types : • Registers • Local Memory • Shared Memory • Global Memory • Constant Memory • Texture Memory o in DRAM o cached o per grid o read-only Built-in Variables accessible in a Kernel dim3 gridDim • Contains the dimensions of blocks in the grid as specified during kernel invocation. gridDim.x, gridDim.y (.z is unused) uint3 blockIdx • Contains the block index within the grid. blockIdx.x, blockIdx.y (.z is unused) dim3 blockDim • Contains the dimensions of threads in a block (blockDim.x, blockDim.y, and blockDim.z) uint3 threadIdx • Contains the thread index within the block (threadIdx.x, threadIdx.y, and threadIdx.z) CUDA Type Qualifiers Function type Variable type qualifiers qualifiers __device__ __device__ • global memory space • Executed on the device • • Callable from the device Is accessible from all the only threads within the grid __global__ __constant__ • Executed on the device • constant memory space • Callable from the host only • Is accessible from all the __host__ threads within the grid • Executed on the host __shared__ • Callable from the host • space of a thread block only • Default type if • Is only accessible from all unspecified the threads within the block CUDA Variable Type Qualifiers Variable declaration Memory Scope Lifetime int var; register thread thread int array_var[10]; local thread thread __shared__ int shared_var; shared block block __device__ int global_var; global grid application __constant__ int constant_var; constant grid application • “automatic” scalar variables without qualifier reside in a register – compiler will spill to thread local memory • “automatic” array variables without qualifier reside in thread-local memory CUDA Variable Type Performance Variable declaration Memory Penalty int var; register 1x int array_var[10]; local 100x __shared__ int shared_var; shared 1x __device__ int global_var; global 100x __constant__ int constant_var; constant 1x • scalar variables reside in fast, on-chip registers • shared variables reside in fast, on-chip memories • thread-local arrays & global variables reside in uncached off- chip memory • constant variables reside in cached off-chip memory CUDA Timer int main() { float myTime; cudaEvent_T myTimerStart, myTimerStop; cudaEventCreate(&myTimerStart);

Lecture 7 CUDA

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support