Lecture 7 CUDA
Dr. Wilson Rivera
ICOM 6025: High Performance Computing Electrical and Computer Engineering Department University of Puerto Rico Outline
• GPU vs CPU • CUDA execution Model • CUDA Types • CUDA programming • CUDA Timer
ICOM 6025: High Performance Computing 2 CUDA
• Compute Unified Device Architecture – Designed and developed by NVIDIA – Data parallel programming interface to GPUs • Requires an NVIDIA GPU (GeForce, Tesla, Quadro)
ICOM 4036: Programming Languages 3 CUDA SDK GPU and CPU: The Differences
ALU ALU Control ALU ALU
Cache
DRAM DRAM
CPU GPU • GPU – More transistors devoted to computation, instead of caching or flow control – Threads are extremely lightweight • Very little creation overhead – Suitable for data-intensive computation • High arithmetic/memory operation ratio Grids and Blocks
Host • Kernel executed as a grid of thread Device blocks Grid 1
– All threads share data memory Kernel Block Block Block space 1 (0, 0) (1, 0) (2, 0)
• Thread block is a batch of threads, Block Block Block can cooperate with each other by: (0, 1) (1, 1) (2, 1) – Synchronizing their execution: For hazard-free shared Grid 2 memory accesses Kernel 2 – Efficiently sharing data through a low latency shared memory Block (1, 1) • Two threads from two different blocks cannot cooperate Thread Thread Thread Thread Thread (0, 0) (1, 0) (2, 0) (3, 0) (4, 0) – (Unless thru slow global Thread Thread Thread Thread Thread memory) (0, 1) (1, 1) (2, 1) (3, 1) (4, 1)
• Threads and blocks have IDs Thread Thread Thread Thread Thread (0, 2) (1, 2) (2, 2) (3, 2) (4, 2) Accelerate Applications
Applications
Programmi OpenACC Libraries ng Directives Languages
“Drop-in” Easily Accelerate Maximum Acceleration Applications Flexibility
© NVIDIA 2013 GPU Programming
• Libraries simplicity – cuBLAS – cuSPARSE – cuFFT • Compiler Directives – openACC (openMP like) • Language Extensions – CUDA – OpenCL (not specific to NVIDIA) Performance GPU Programming Languages
Numerical analytics MATLAB, Mathematica, LabVIEW
Fortran OpenACC, CUDA Fortran
C OpenACC, CUDA C
C++ Thrust, CUDA C++
Python PyCUDA, Copperhead
F# Alea.cuBase
© NVIDIA 2013 CUDA Execution Model
• Warp size = 32 threads • Thread Block Size = 512 threads • Grid size: 65k block per dimension
ICOM 6025: High Performance Computing 10 Programming Model
Single Instruction Multiple Thread (SIMT) Execution: • Groups of 32 threads formed into warps o always executing same instruction o share instruction fetch/dispatch o some become inactive when code path diverges o hardware automatically handles divergence
• Warps are primitive unit of scheduling • pick 1 of 24 warps for each instruction slot. • all warps from all active blocks are time-sliced CUDA Execution Model
• Threads within a warp run synchronously in parallel – Threads in a warp are implicitly and efficiently synchronized • Threads within a thread block run asynchronously in parallel – Threads in the same thread block can co-operate and synchronize – But threads in different thread blocks cannot co-operate and synchronize—they can, however, communicate indirectly via the global memory • Programmer encouraged to decompose the program into small independent sub-tasks
ICOM 6025: High Performance Computing 12 CUDA Parallel Threads and Memory
Thread Block Registers Per-thread Private Per-block Local Memory Shared Memory float LocalVar; __shared__ float SharedVar; Grid 0 Sequence . . . Per-app Device Grid 1 Global Memory
. . .
__device__ float GlobalVar; CUDA kernel maps to Grid of Blocks
Host Thread Grid of Thread Blocks
. . .
GPU CPU SMem SMem SMem
Cache Cache
Host Bridge Device Memory Memory PCIe CUDA: Hello, World!
#define NUM_BLOCKS 4 #define BLOCK_WIDTH 8
/* Main function, executed on host (CPU) */ int main( void) {
/* print message from CPU */ printf( "Hello Cuda!\n" ); Kernel: /* execute function on device (GPU) */ A parallel function that hello<<
/* print message from CPU */ printf( "Welcome back to CPU!\n" );
return(0); }
/* Function executed on device (GPU */ __global__ void hello( void) {
printf( "\tHello from GPU: thread %d and block %d\n", threadIdx.x, blockIdx.x );} Kernel Function call
• kernel<<
__global__ void filter(int *in, int *out); dim3 grid(16, 16); dim3 block (16, 16) ; filter <<< grid, block, 0, 0 >>> (in, out); \\ filter <<< grid, block >>> (in, out);
ICOM 4036: Programming Languages 16 Programming Model
Simple example ( Matrx addition ): cpu c program: cuda program: CUDA Example: Add_matrix
// Set grid size const int N = 1024; const int blocksize = 16;
// Compute kernel __global__ void add_matrix( float* a, float *b, float *c, int N ) { // threadIdx.x is a built-in variable provided by CUDA at runtime int i = blockIdx.x * blockDim.x + threadIdx.x; int j = blockIdx.y * blockDim.y + threadIdx.y; int index = i + j*N; if ( i < N && j < N ) c[index] = a[index] + b[index]; }
ICOM 4036: Programming Languages 18 CUDA Example: Add_matrix int main() { \\ CPU memory allocation float *a = new float[N*N]; float *b = new float[N*N]; float *c = new float[N*N];
for ( int i = 0; i < N*N; ++i ) { a[i] = 1.0f; b[i] = 3.5f; }
\\GPU memory allocation float *ad, *bd, *cd; const int size = N*N*sizeof(float); cudaMalloc( (void**)&ad, size ); cudaMalloc( (void**)&bd, size ); cudaMalloc( (void**)&cd, size );
ICOM 4036: Programming Languages 19 CUDA Example: Add_matrix
\\ copy data to GPU cudaMemcpy( ad, a, size, cudaMemcpyHostToDevice ); cudaMemcpy( bd, b, size, cudaMemcpyHostToDevice );
\\ execute kernel dim3 dimBlock( blocksize, blocksize ); dim3 dimGrid( N/dimBlock.x, N/dimBlock.y ); add_matrix<<
\\ copy result back to CPU cudaMemcpy( c, cd, size, cudaMemcpyDeviceToHost );
\\ clean up and return cudaFree( ad ); cudaFree( bd ); cudaFree( cd ); delete[] a; delete[] b; delete[] c; return EXIT_SUCCESS; }
ICOM 4036: Programming Languages 20 CUDA Example
ICOM 6025: High Performance Computing 21 CUDA Example
ICOM 6025: High Performance Computing 22 Memory Model
• Registers – Per thread, Read-Write • Local memory – Per thread, Read-Write • Shared memory – Per block Read-Write For sharing data within a block • Global memory – Per grid Read-Write Not cached • Constant memory – Per grid Read-only Cached • Texture memory – Per grid Read-only Spatially cached
ICOM 4036: Programming Languages 23 Memory Model
• Registers o on chip o fast access o per thread o limited amount o 32 bit Memory Model
There are 6 Memory Types :
• Registers • Local Memory o in DRAM o slow o non-cached o per thread o relative large Memory Model
There are 6 Memory Types :
• Registers • Local Memory • Shared Memory o on chip o fast access o per block o 16 KByte o synchronize between threads Memory Model
There are 6 Memory Types :
• Registers • Local Memory • Shared Memory • Global Memory o in DRAM o slow o non-cached o per grid o communicate between grids Memory Model
There are 6 Memory Types :
• Registers • Local Memory • Shared Memory • Global Memory • Constant Memory o in DRAM o cached o per grid o read-only Memory Model
There are 6 Memory Types :
• Registers • Local Memory • Shared Memory • Global Memory • Constant Memory • Texture Memory o in DRAM o cached o per grid o read-only Built-in Variables accessible in a Kernel
dim3 gridDim • Contains the dimensions of blocks in the grid as specified during kernel invocation. gridDim.x, gridDim.y (.z is unused) uint3 blockIdx • Contains the block index within the grid. blockIdx.x, blockIdx.y (.z is unused) dim3 blockDim • Contains the dimensions of threads in a block (blockDim.x, blockDim.y, and blockDim.z) uint3 threadIdx • Contains the thread index within the block (threadIdx.x, threadIdx.y, and threadIdx.z) CUDA Type Qualifiers
Function type Variable type qualifiers qualifiers __device__ __device__ • global memory space • Executed on the device • • Callable from the device Is accessible from all the only threads within the grid __global__ __constant__ • Executed on the device • constant memory space • Callable from the host only • Is accessible from all the __host__ threads within the grid • Executed on the host __shared__ • Callable from the host • space of a thread block only • Default type if • Is only accessible from all unspecified the threads within the block CUDA Variable Type Qualifiers
Variable declaration Memory Scope Lifetime int var; register thread thread int array_var[10]; local thread thread __shared__ int shared_var; shared block block __device__ int global_var; global grid application __constant__ int constant_var; constant grid application
• “automatic” scalar variables without qualifier reside in a register – compiler will spill to thread local memory • “automatic” array variables without qualifier reside in thread-local memory CUDA Variable Type Performance
Variable declaration Memory Penalty int var; register 1x int array_var[10]; local 100x __shared__ int shared_var; shared 1x __device__ int global_var; global 100x __constant__ int constant_var; constant 1x
• scalar variables reside in fast, on-chip registers • shared variables reside in fast, on-chip memories • thread-local arrays & global variables reside in uncached off- chip memory • constant variables reside in cached off-chip memory CUDA Timer int main() { float myTime; cudaEvent_T myTimerStart, myTimerStop; cudaEventCreate(&myTimerStart); cudaEventCreate(&myTimerStop);
cudaEventRecord(myTimerStart, 0); // task to be timed cudaEventRecord(myTimerStop, 0);
cudaEventSynchronize(myTimerStop); cudaEventElapsedTime(&myTime, myTimerStart, myTimerStop); }
ICOM 4036: Programming Languages 34 Some GPU-accelerated Libraries
NVIDIA cuBLAS NVIDIA cuRAND NVIDIA cuSPARSE NVIDIA NPP
Vector Signal GPU Accelerated Matrix Algebra on GPU and NVIDIA cuFFT Image Processing Linear Algebra Multicore
Building-block Sparse Linear C++ STL AlgorithmsArrayFire Matrix for Algebra Features for IMSL Library ComputationsCUDA CUDA
© NVIDIA 2013 GPU Accelerated Libraries
• cuFFT • cuBLAS • cuSPARSE • Performance Primitives • Thrust Performance on Scientific Applications
MATLAB (FFT)* Engineering
Chroma Physics
Earth SPECFEM3D Science
Molecular AMBER Dynamics
0.0x 5.0x 10.0x 15.0x 20.0x
CPU results: Dual socket E5-2687w, 3.10 GHz, GPU results: Dual socket E5-2687w + 2 Tesla K20X GPUs *MATLAB results comparing one i7-2600K CPU vs with Tesla K20 GPU Disclaimer: Non-NVIDIA implementations may not have been fully optimized © NVIDIA 2013 CUFFT Performance vs. FFTW
•CUFFT starts to perform better than FFTW around data sizes of 8192 elements. It beats FFTW for most large sizes( > 10,000 elements)
Source: http://www.science.uwaterloo.ca/˜hmerz/CUDA_benchFFT/ GEMM C = αAB + βC
/* General Matrix Multiply (simplified version) */ static void simple_dgemm( int n, double alpha, const double *A, const double *B, double beta, double *C) { int i, j, k; for (i = 0; i < n; ++i) { for (j = 0; j < n; ++j){
double prod = 0; for (k = 0; k < n; ++k) prod += A[k * n + i] * B[j * n + k];
C[j * n + i] = alpha * prod + beta * C[j * n + i]; } } }
ICOM 6025: High Performance Computing 39 GEMM C = αAB + βC
/* dgemm from BLAS library */ extern "C"{ extern void dgemm_(char *, char * , int *, int *, int *, double *, double *, int *, double *, int *, double *, double *, int *); };
/* Main */ int main(int argc, char **argv) { . . .
/* call gemm from BLAS library */ dgemm_("N","N", &N, &N, &N, &alpha, h_A, &N, h_B, &N, &beta, h_C_blas,&N); . . . GEMM C = αAB + βC
/* Main */ int main(int argc, char **argv) { /* 0. Initialize CUBLAS */ cublasCreate(&handle);
/* 1. allocate memory on GPU */ cudaMalloc((void **)&d_A, n2 * sizeof(d_A[0]));
/* 2. Copy data from Host to GPU */ status = cublasSetVector(n2, sizeof(h_A[0]), h_A, 1, d_A, 1);
/* 3. Execute GPU kernel */ cublasDgemm( handle, CUBLAS_OP_N, CUBLAS_OP_N, N, N, N, &alpha, d_A, N, d_B, N, &beta, d_C, N );
/* 4. Copy data from GPU back to Host */ cublasGetVector(n2, sizeof(h_C[0]), d_C, 1, h_C, 1);
/* 5. Free GPU memory */ cudaFree(d_A) } Jacobi Iteration
A(i,j+1)
A(i-1,j) A(i+1,j)
A(i,j-1) © NVIDIA 2013 Jacobi Iteration C Code while ( error > tol && iter < iter_max ) Iterate until { converged error=0.0;
for( int j = 1; j < n-1; j++) { Iterate across matrix for(int i = 1; i < m-1; i++) { elements
Anew[j][i] = 0.25 * (A[j][i+1] + A[j][i-1] + Calculate new value A[j-1][i] + A[j+1][i]); from neighbors
error = max(error, abs(Anew[j][i] - A[j][i]); Compute max error } for convergence }
for( int j = 1; j < n-1; j++) { Swap input/output for( int i = 1; i < m-1; i++ ) { A[j][i] = Anew[j][i]; arrays } }
iter++; } © NVIDIA 2013 OpenMP C Code while ( error > tol && iter < iter_max ) { error=0.0;
#pragma omp parallel for shared(m, n, Anew, A) Parallelize loop for( int j = 1; j < n-1; j++) { across CPU threads for(int i = 1; i < m-1; i++) {
Anew[j][i] = 0.25 * (A[j][i+1] + A[j][i-1] + A[j-1][i] + A[j+1][i]);
error = max(error, abs(Anew[j][i] - A[j][i]); } }
#pragma omp parallel for shared(m, n, Anew, A) Parallelize loop for( int j = 1; j < n-1; j++) { across CPU threads for( int i = 1; i < m-1; i++ ) { A[j][i] = Anew[j][i]; } }
iter++; }
© NVIDIA 2013 OpenACC C
while ( error > tol && iter < iter_max ) { error=0.0;
#pragma acc kernels Execute GPU kernel for( int j = 1; j < n-1; j++) { for loop nest for(int i = 1; i < m-1; i++) {
Anew[j][i] = 0.25 * (A[j][i+1] + A[j][i-1] + A[j-1][i] + A[j+1][i]);
error = max(error, abs(Anew[j][i] - A[j][i]); } } Execute GPU kernel #pragma acc kernels for( int j = 1; j < n-1; j++) { for loop nest for( int i = 1; i < m-1; i++ ) { A[j][i] = Anew[j][i]; } }
iter++;
} © NVIDIA 2013 First Attempt: Performance
CPU: Intel Xeon X5680 GPU: NVIDIA Tesla M2070 6 Cores @ 3.33GHz Execution Time (s) Speedup
CPU 1 OpenMP 69.80 -- thread
CPU 2 OpenMP 44.76 1.56x threads
CPU 4 OpenMP 39.59 1.76x threads
CPU 6 OpenMP 39.71 1.76x threads
0.24x OpenACC GPU 162.16 FAIL © NVIDIA 2013 Excessive Data Transfers
while ( error > tol && iter < iter_max ) { error=0.0; #pragma acc kernels Copy
These copies for( int j = 1; j < n-1; j++) { happen every for( int i = 1; i < m-1; i++) { Anew[j][i] = 0.25 * (A[j][i+1] + A[j][i-1] + iteration of the A[j-1][i] + A[j+1][i]); outer while error = max(error, abs(Anew[j][i] - A[j][i]); loop!* } } Copy
... }
*Note: there are two #pragma acc kernels, so there are 4 copies per while loop iteration! © NVIDIA 2013 OpenACC C (second version)
Copy A in at beginning of #pragma acc data copy(A), create(Anew) loop, out at end. while ( error > tol && iter < iter_max ) { Allocate Anew on error=0.0; accelerator
#pragma acc kernels for( int j = 1; j < n-1; j++) { for(int i = 1; i < m-1; i++) {
Anew[j][i] = 0.25 * (A[j][i+1] + A[j][i-1] + A[j-1][i] + A[j+1][i]);
error = max(error, abs(Anew[j][i] - A[j][i]); } }
#pragma acc kernels for( int j = 1; j < n-1; j++) { for( int i = 1; i < m-1; i++ ) { A[j][i] = Anew[j][i]; } }
iter++; } © NVIDIA 2013 Second Attempt: Performance
CPU: Intel Xeon X5680 GPU: NVIDIA Tesla M2070 6 Cores @ 3.33GHz
Execution Time (s) Speedup
CPU 1 OpenMP 69.80 -- thread
CPU 2 OpenMP 44.76 1.56x threads
CPU 4 OpenMP 39.59 1.76x threads
CPU 6 OpenMP 39.71 1.76x threads
OpenACC GPU 13.65 2.9x © NVIDIA 2013 Data Clauses copy ( list ) Allocates memory on GPU and copies data from host to GPU when entering region and copies data to the host when exiting region. copyin ( list ) Allocates memory on GPU and copies data from host to GPU when entering region. copyout ( list ) Allocates memory on GPU and copies data to the host when exiting region. create ( list ) Allocates memory on GPU but does not copy. present ( list ) Data is already present on GPU from another containing data region. and present_or_copy[in|out], present_or_create, deviceptr.
© NVIDIA 2013 Learn More
• Download CUDA Toolkit & SDK: www.nvidia.com/getcuda
• Nsight IDE (Eclipse or Visual Studio): www.nvidia.com/nsight
• Programming Guide/Best Practices: • docs.nvidia.com
• Questions: • NVIDIA Developer forums: devtalk.nvidia.com • Search or ask on: www.stackoverflow.com/tags/cuda
• General: www.nvidia.com/cudazone
© NVIDIA 2013 Learn More
CUDA C/C++ GPU.NET http://developer.nvidia.com/cuda-toolkit http://tidepowerd.com
Thrust C++ Template Library http://developer.nvidia.com/thrust MATLAB http://www.mathworks.com/discovery/ matlab-gpu.html CUDA Fortran http://developer.nvidia.com/cuda-toolkit
Mathematica PyCUDA (Python) http://www.wolfram.com/mathematica/new http://mathema.tician.de/software/pycuda -in-8/cuda-and-opencl-support/
© NVIDIA 2013 Summary
• GPU vs CPU • CUDA execution Model • CUDA Types • CUDA programming • CUDA Timer
ICOM 6025: High Performance Computing 53