Lecture 7 CUDA

Dr. Wilson Rivera

ICOM 6025: High Performance Electrical and Computer Engineering Department University of Puerto Rico Outline

• GPU vs CPU • CUDA execution Model • CUDA Types • CUDA programming • CUDA Timer

ICOM 6025: High Performance Computing 2 CUDA

• Compute Unified Device Architecture – Designed and developed by NVIDIA – Data parallel programming interface to GPUs • Requires an NVIDIA GPU (GeForce, Tesla, )

ICOM 4036: Programming Languages 3 CUDA SDK GPU and CPU: The Differences

ALU ALU Control ALU ALU

Cache

DRAM DRAM

CPU GPU • GPU – More transistors devoted to computation, instead of caching or flow control – Threads are extremely lightweight • Very little creation overhead – Suitable for data-intensive computation • High arithmetic/memory operation ratio Grids and Blocks

Host • Kernel executed as a grid of Device blocks Grid 1

– All threads share data memory Kernel Block Block Block space 1 (0, 0) (1, 0) (2, 0)

• Thread block is a batch of threads, Block Block Block can cooperate with each other by: (0, 1) (1, 1) (2, 1) – Synchronizing their execution: For hazard-free shared Grid 2 memory accesses Kernel 2 – Efficiently sharing data through a low latency shared memory Block (1, 1) • Two threads from two different blocks cannot cooperate Thread Thread Thread Thread Thread (0, 0) (1, 0) (2, 0) (3, 0) (4, 0) – (Unless thru slow global Thread Thread Thread Thread Thread memory) (0, 1) (1, 1) (2, 1) (3, 1) (4, 1)

• Threads and blocks have IDs Thread Thread Thread Thread Thread (0, 2) (1, 2) (2, 2) (3, 2) (4, 2) Accelerate Applications

Applications

Programmi OpenACC Libraries ng Directives Languages

“Drop-in” Easily Accelerate Maximum Acceleration Applications Flexibility

© NVIDIA 2013 GPU Programming

• Libraries simplicity – cuBLAS – cuSPARSE – cuFFT • Compiler Directives – openACC (openMP like) • Language Extensions – CUDA – OpenCL (not specific to NVIDIA) Performance GPU Programming Languages

Numerical analytics MATLAB, Mathematica, LabVIEW

Fortran OpenACC, CUDA Fortran

C OpenACC, CUDA C

C++ Thrust, CUDA C++

Python PyCUDA, Copperhead

F# Alea.cuBase

© NVIDIA 2013 CUDA Execution Model

• Warp size = 32 threads • Thread Block Size = 512 threads • Grid size: 65k block per dimension

ICOM 6025: High Performance Computing 10 Programming Model

Single Instruction Multiple Thread (SIMT) Execution: • Groups of 32 threads formed into warps o always executing same instruction o share instruction fetch/dispatch o some become inactive when code path diverges o hardware automatically handles divergence

• Warps are primitive unit of scheduling • pick 1 of 24 warps for each instruction slot. • all warps from all active blocks are time-sliced CUDA Execution Model

• Threads within a warp run synchronously in parallel – Threads in a warp are implicitly and efficiently synchronized • Threads within a thread block run asynchronously in parallel – Threads in the same thread block can co-operate and synchronize – But threads in different thread blocks cannot co-operate and synchronize—they can, however, communicate indirectly via the global memory • Programmer encouraged to decompose the program into small independent sub-tasks

ICOM 6025: High Performance Computing 12 CUDA Parallel Threads and Memory

Thread Block Registers Per-thread Private Per-block Local Memory Shared Memory float LocalVar; __shared__ float SharedVar; Grid 0 Sequence . . . Per-app Device Grid 1 Global Memory

. . .

__device__ float GlobalVar; CUDA kernel maps to Grid of Blocks

Host Thread Grid of Thread Blocks

. . .

GPU CPU SMem SMem SMem

Cache Cache

Host Bridge Device Memory Memory PCIe CUDA: Hello, World!

#define NUM_BLOCKS 4 #define BLOCK_WIDTH 8

/* Main function, executed on host (CPU) */ int main( void) {

/* print message from CPU */ printf( "Hello Cuda!\n" ); Kernel: /* execute function on device (GPU) */ A parallel function that hello<<>>(); runs on the GPU /* wait until all threads finish their job */ cudaDeviceSynchronize();

/* print message from CPU */ printf( "Welcome back to CPU!\n" );

return(0); }

/* Function executed on device (GPU */ __global__ void hello( void) {

printf( "\tHello from GPU: thread %d and block %d\n", threadIdx.x, blockIdx.x );} Kernel Function call

• kernel<<>>(); – Grid: Grid dimension (up to 2D) – Block: Block dimension (up to 3D) – Stream: stream ID (optional) – Shared_mem: shared memory size (optional)

__global__ void filter(int *in, int *out); dim3 grid(16, 16); dim3 block (16, 16) ; filter <<< grid, block, 0, 0 >>> (in, out); \\ filter <<< grid, block >>> (in, out);

ICOM 4036: Programming Languages 16 Programming Model

Simple example ( Matrx addition ): cpu c program: program: CUDA Example: Add_matrix

// Set grid size const int N = 1024; const int blocksize = 16;

// Compute kernel __global__ void add_matrix( float* a, float *b, float *c, int N ) { // threadIdx.x is a built-in variable provided by CUDA at runtime int i = blockIdx.x * blockDim.x + threadIdx.x; int j = blockIdx.y * blockDim.y + threadIdx.y; int index = i + j*N; if ( i < N && j < N ) c[index] = a[index] + b[index]; }

ICOM 4036: Programming Languages 18 CUDA Example: Add_matrix int main() { \\ CPU memory allocation float *a = new float[N*N]; float *b = new float[N*N]; float *c = new float[N*N];

for ( int i = 0; i < N*N; ++i ) { a[i] = 1.0f; b[i] = 3.5f; }

\\GPU memory allocation float *ad, *bd, *cd; const int size = N*N*sizeof(float); cudaMalloc( (void**)&ad, size ); cudaMalloc( (void**)&bd, size ); cudaMalloc( (void**)&cd, size );

ICOM 4036: Programming Languages 19 CUDA Example: Add_matrix

\\ copy data to GPU cudaMemcpy( ad, a, size, cudaMemcpyHostToDevice ); cudaMemcpy( bd, b, size, cudaMemcpyHostToDevice );

\\ execute kernel dim3 dimBlock( blocksize, blocksize ); dim3 dimGrid( N/dimBlock.x, N/dimBlock.y ); add_matrix<<>>( ad, bd, cd, N );

\\ copy result back to CPU cudaMemcpy( c, cd, size, cudaMemcpyDeviceToHost );

\\ clean up and return cudaFree( ad ); cudaFree( bd ); cudaFree( cd ); delete[] a; delete[] b; delete[] c; return EXIT_SUCCESS; }

ICOM 4036: Programming Languages 20 CUDA Example

ICOM 6025: High Performance Computing 21 CUDA Example

ICOM 6025: High Performance Computing 22 Memory Model

• Registers – Per thread, Read-Write • Local memory – Per thread, Read-Write • Shared memory – Per block Read-Write For sharing data within a block • Global memory – Per grid Read-Write Not cached • Constant memory – Per grid Read-only Cached • Texture memory – Per grid Read-only Spatially cached

ICOM 4036: Programming Languages 23 Memory Model

• Registers o on chip o fast access o per thread o limited amount o 32 bit Memory Model

There are 6 Memory Types :

• Registers • Local Memory o in DRAM o slow o non-cached o per thread o relative large Memory Model

There are 6 Memory Types :

• Registers • Local Memory • Shared Memory o on chip o fast access o per block o 16 KByte o synchronize between threads Memory Model

There are 6 Memory Types :

• Registers • Local Memory • Shared Memory • Global Memory o in DRAM o slow o non-cached o per grid o communicate between grids Memory Model

There are 6 Memory Types :

• Registers • Local Memory • Shared Memory • Global Memory • Constant Memory o in DRAM o cached o per grid o read-only Memory Model

There are 6 Memory Types :

• Registers • Local Memory • Shared Memory • Global Memory • Constant Memory • Texture Memory o in DRAM o cached o per grid o read-only Built-in Variables accessible in a Kernel

dim3 gridDim • Contains the dimensions of blocks in the grid as specified during kernel invocation. gridDim.x, gridDim.y (.z is unused) uint3 blockIdx • Contains the block index within the grid. blockIdx.x, blockIdx.y (.z is unused) dim3 blockDim • Contains the dimensions of threads in a block (blockDim.x, blockDim.y, and blockDim.z) uint3 threadIdx • Contains the thread index within the block (threadIdx.x, threadIdx.y, and threadIdx.z) CUDA Type Qualifiers

Function type Variable type qualifiers qualifiers __device__ __device__ • global memory space • Executed on the device • • Callable from the device Is accessible from all the only threads within the grid __global__ __constant__ • Executed on the device • constant memory space • Callable from the host only • Is accessible from all the __host__ threads within the grid • Executed on the host __shared__ • Callable from the host • space of a thread block only • Default type if • Is only accessible from all unspecified the threads within the block CUDA Variable Type Qualifiers

Variable declaration Memory Scope Lifetime int var; register thread thread int array_var[10]; local thread thread __shared__ int shared_var; shared block block __device__ int global_var; global grid application __constant__ int constant_var; constant grid application

• “automatic” scalar variables without qualifier reside in a register – compiler will spill to thread local memory • “automatic” array variables without qualifier reside in thread-local memory CUDA Variable Type Performance

Variable declaration Memory Penalty int var; register 1x int array_var[10]; local 100x __shared__ int shared_var; shared 1x __device__ int global_var; global 100x __constant__ int constant_var; constant 1x

• scalar variables reside in fast, on-chip registers • shared variables reside in fast, on-chip memories • thread-local arrays & global variables reside in uncached off- chip memory • constant variables reside in cached off-chip memory CUDA Timer int main() { float myTime; cudaEvent_T myTimerStart, myTimerStop; cudaEventCreate(&myTimerStart); cudaEventCreate(&myTimerStop);

cudaEventRecord(myTimerStart, 0); // task to be timed cudaEventRecord(myTimerStop, 0);

cudaEventSynchronize(myTimerStop); cudaEventElapsedTime(&myTime, myTimerStart, myTimerStop); }

ICOM 4036: Programming Languages 34 Some GPU-accelerated Libraries

NVIDIA cuBLAS NVIDIA cuRAND NVIDIA cuSPARSE NVIDIA NPP

Vector Signal GPU Accelerated Matrix Algebra on GPU and NVIDIA cuFFT Image Processing Linear Algebra Multicore

Building-block Sparse Linear C++ STL AlgorithmsArrayFire Matrix for Algebra Features for IMSL Library ComputationsCUDA CUDA

© NVIDIA 2013 GPU Accelerated Libraries

• cuFFT • cuBLAS • cuSPARSE • Performance Primitives • Thrust Performance on Scientific Applications

MATLAB (FFT)* Engineering

Chroma Physics

Earth SPECFEM3D Science

Molecular AMBER Dynamics

0.0x 5.0x 10.0x 15.0x 20.0x

CPU results: Dual socket E5-2687w, 3.10 GHz, GPU results: Dual socket E5-2687w + 2 Tesla K20X GPUs *MATLAB results comparing one i7-2600K CPU vs with Tesla K20 GPU Disclaimer: Non-NVIDIA implementations may not have been fully optimized © NVIDIA 2013 CUFFT Performance vs. FFTW

•CUFFT starts to perform better than FFTW around data sizes of 8192 elements. It beats FFTW for most large sizes( > 10,000 elements)

Source: http://www.science.uwaterloo.ca/˜hmerz/CUDA_benchFFT/ GEMM C = αAB + βC

/* General Matrix Multiply (simplified version) */ static void simple_dgemm( int n, double alpha, const double *A, const double *B, double beta, double *C) { int i, j, k; for (i = 0; i < n; ++i) { for (j = 0; j < n; ++j){

double prod = 0; for (k = 0; k < n; ++k) prod += A[k * n + i] * B[j * n + k];

C[j * n + i] = alpha * prod + beta * C[j * n + i]; } } }

ICOM 6025: High Performance Computing 39 GEMM C = αAB + βC

/* dgemm from BLAS library */ extern "C"{ extern void dgemm_(char *, char * , int *, int *, int *, double *, double *, int *, double *, int *, double *, double *, int *); };

/* Main */ int main(int argc, char **argv) { . . .

/* call gemm from BLAS library */ dgemm_("N","N", &N, &N, &N, &alpha, h_A, &N, h_B, &N, &beta, h_C_blas,&N); . . . GEMM C = αAB + βC

/* Main */ int main(int argc, char **argv) { /* 0. Initialize CUBLAS */ cublasCreate(&handle);

/* 1. allocate memory on GPU */ cudaMalloc((void **)&d_A, n2 * sizeof(d_A[0]));

/* 2. Copy data from Host to GPU */ status = cublasSetVector(n2, sizeof(h_A[0]), h_A, 1, d_A, 1);

/* 3. Execute GPU kernel */ cublasDgemm( handle, CUBLAS_OP_N, CUBLAS_OP_N, N, N, N, &alpha, d_A, N, d_B, N, &beta, d_C, N );

/* 4. Copy data from GPU back to Host */ cublasGetVector(n2, sizeof(h_C[0]), d_C, 1, h_C, 1);

/* 5. Free GPU memory */ cudaFree(d_A) } Jacobi Iteration

A(i,j+1)

A(i-1,j) A(i+1,j)

A(i,j-1) © NVIDIA 2013 Jacobi Iteration C Code while ( error > tol && iter < iter_max ) Iterate until { converged error=0.0;

for( int j = 1; j < n-1; j++) { Iterate across matrix for(int i = 1; i < m-1; i++) { elements

Anew[j][i] = 0.25 * (A[j][i+1] + A[j][i-1] + Calculate new value A[j-1][i] + A[j+1][i]); from neighbors

error = max(error, abs(Anew[j][i] - A[j][i]); Compute max error } for convergence }

for( int j = 1; j < n-1; j++) { Swap input/output for( int i = 1; i < m-1; i++ ) { A[j][i] = Anew[j][i]; arrays } }

iter++; } © NVIDIA 2013 OpenMP C Code while ( error > tol && iter < iter_max ) { error=0.0;

#pragma omp parallel for shared(m, n, Anew, A) Parallelize loop for( int j = 1; j < n-1; j++) { across CPU threads for(int i = 1; i < m-1; i++) {

Anew[j][i] = 0.25 * (A[j][i+1] + A[j][i-1] + A[j-1][i] + A[j+1][i]);

error = max(error, abs(Anew[j][i] - A[j][i]); } }

#pragma omp parallel for shared(m, n, Anew, A) Parallelize loop for( int j = 1; j < n-1; j++) { across CPU threads for( int i = 1; i < m-1; i++ ) { A[j][i] = Anew[j][i]; } }

iter++; }

© NVIDIA 2013 OpenACC C

while ( error > tol && iter < iter_max ) { error=0.0;

#pragma acc kernels Execute GPU kernel for( int j = 1; j < n-1; j++) { for loop nest for(int i = 1; i < m-1; i++) {

Anew[j][i] = 0.25 * (A[j][i+1] + A[j][i-1] + A[j-1][i] + A[j+1][i]);

error = max(error, abs(Anew[j][i] - A[j][i]); } } Execute GPU kernel #pragma acc kernels for( int j = 1; j < n-1; j++) { for loop nest for( int i = 1; i < m-1; i++ ) { A[j][i] = Anew[j][i]; } }

iter++;

} © NVIDIA 2013 First Attempt: Performance

CPU: Intel Xeon X5680 GPU: M2070 6 Cores @ 3.33GHz Execution Time (s) Speedup

CPU 1 OpenMP 69.80 -- thread

CPU 2 OpenMP 44.76 1.56x threads

CPU 4 OpenMP 39.59 1.76x threads

CPU 6 OpenMP 39.71 1.76x threads

0.24x OpenACC GPU 162.16 FAIL © NVIDIA 2013 Excessive Data Transfers

while ( error > tol && iter < iter_max ) { error=0.0; #pragma acc kernels Copy

These copies for( int j = 1; j < n-1; j++) { happen every for( int i = 1; i < m-1; i++) { Anew[j][i] = 0.25 * (A[j][i+1] + A[j][i-1] + iteration of the A[j-1][i] + A[j+1][i]); outer while error = max(error, abs(Anew[j][i] - A[j][i]); loop!* } } Copy

... }

*Note: there are two #pragma acc kernels, so there are 4 copies per while loop iteration! © NVIDIA 2013 OpenACC C (second version)

Copy A in at beginning of #pragma acc data copy(A), create(Anew) loop, out at end. while ( error > tol && iter < iter_max ) { Allocate Anew on error=0.0; accelerator

#pragma acc kernels for( int j = 1; j < n-1; j++) { for(int i = 1; i < m-1; i++) {

Anew[j][i] = 0.25 * (A[j][i+1] + A[j][i-1] + A[j-1][i] + A[j+1][i]);

error = max(error, abs(Anew[j][i] - A[j][i]); } }

#pragma acc kernels for( int j = 1; j < n-1; j++) { for( int i = 1; i < m-1; i++ ) { A[j][i] = Anew[j][i]; } }

iter++; } © NVIDIA 2013 Second Attempt: Performance

CPU: Intel Xeon X5680 GPU: NVIDIA Tesla M2070 6 Cores @ 3.33GHz

Execution Time (s) Speedup

CPU 1 OpenMP 69.80 -- thread

CPU 2 OpenMP 44.76 1.56x threads

CPU 4 OpenMP 39.59 1.76x threads

CPU 6 OpenMP 39.71 1.76x threads

OpenACC GPU 13.65 2.9x © NVIDIA 2013 Data Clauses copy ( list ) Allocates memory on GPU and copies data from host to GPU when entering region and copies data to the host when exiting region. copyin ( list ) Allocates memory on GPU and copies data from host to GPU when entering region. copyout ( list ) Allocates memory on GPU and copies data to the host when exiting region. create ( list ) Allocates memory on GPU but does not copy. present ( list ) Data is already present on GPU from another containing data region. and present_or_copy[in|out], present_or_create, deviceptr.

© NVIDIA 2013 Learn More

• Download CUDA Toolkit & SDK: www.nvidia.com/getcuda

• Nsight IDE (Eclipse or Visual Studio): www.nvidia.com/nsight

• Programming Guide/Best Practices: • docs.nvidia.com

• Questions: • NVIDIA Developer forums: devtalk.nvidia.com • Search or ask on: www.stackoverflow.com/tags/cuda

• General: www.nvidia.com/cudazone

© NVIDIA 2013 Learn More

CUDA C/C++ GPU.NET http://developer.nvidia.com/cuda-toolkit http://tidepowerd.com

Thrust C++ Template Library http://developer.nvidia.com/thrust MATLAB http://www.mathworks.com/discovery/ matlab-gpu.html CUDA Fortran http://developer.nvidia.com/cuda-toolkit

Mathematica PyCUDA (Python) http://www.wolfram.com/mathematica/new http://mathema.tician.de/software/pycuda -in-8/cuda-and--support/

© NVIDIA 2013 Summary

• GPU vs CPU • CUDA execution Model • CUDA Types • CUDA programming • CUDA Timer

ICOM 6025: High Performance Computing 53