Basics of CUDA Programming

Weijun Xiao Department of Electrical and Computer Engineering University of Minnesota

1 Outline

• What’s GPU computing? • CUDA programming model • Basic Memory Management • Basic Kernels and Execution • CPU and GPU Coordination • CUDA debugging and profiling • Conclusions

2 What is GPU?

• Graphic Processing Unit

Logical Representation of Visual Information Output Signal

3 Performance Gap between GPUs and CPUs

4 GPU = Fast Parallel Machine

• GPU speed increasing at faster pace than Moore’s Law. • This is a consequence of the data-parallel streaming aspects of the GPU. • Gaming market simulates the development of GPU • GPUs are cheap ! Put enough together, and you can get a super-computer.

So can we use the GPU for general-purpose computing ?

5 Sure, thousands of Applications

• Large matrix/vector operations (BLAS) • Protein Folding (Molecular Dynamics) • FFT (signal processing) • VMD(Visual Molecular Dynamics) • Speech Recognition (Hidden Markov Models, Neural nets) • Databases • Sort/Search • Storages • MRI • …

6 Why are We Interested in GPU?

• High-performance Computing • High Parallelism • Low Cost • GPU can be Programmable • GPGPU

7 Growth and Development of GPU

• A quiet revolution and potential build-up – Calculation: 367 GFLOPS vs. 32 GFLOPS – Memory Bandwidth: 86.4 GB/s vs. 8.4 GB/s – Before CUDA , programmed through graphics API S P O L F G80 = GeForce 8800 GTX G G71 = GeForce 7900 GTX G70 = GeForce 7800 GTX NV40 = GeForce 6800 Ultra NV35 = GeForce FX 5950 Ultra NV30 = GeForce FX 5800

– GPU in every PC and workstation – massive volume and potential impact

8 GeForce 8800

16 highly threaded MP, 128 Cores, 367 GFLOPS, 768 MB DRAM, 86.4 GB/S Mem BW, 4GB/S BW to CPU

Host

Input Assembler

Thread Execution Manager

Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data

Cache Cache Cache Cache Cache Cache Cache Cache

T e x t u r e T e x t u r e T e x t u r e T e x t u r e T e x t u r e T e x t u r e T e x t u r e T e x t u r e T e x t u r e

Load/store Load/store Load/store Load/store Load/store Load/store

Global Memory 9 Telsa 2050

14 MP, 448 Cores, 1.03 TFLOPS/515 GFLOPS, 3GB GDDR5 DRAM with ECC, 144GB/S Mem BW, PCIe 2 x16 (8GB/S BW to CPU)

10 GPU Languages

• Assembly • Cg (NVIDIA) - C for Graphics • GLSL (OpenGL) - OpenGL Shading Language • HLSL (Microsoft) - High-level Shading language • Brook C/C++ (AMD) • CUDA (NVIDIA) • Open CL

11 How GPGPU Works before CUDA?

• Follow graphics pipeline • Pretend to be graphics • Take an advantage of massive parallelism of GPU • Disguise data as textures or geometry • Disguise algorithm as render passes • Fool graphics pipeline to do computation

12 CUDA Programming Model

• Compute Unified Device Architecture • Simple and General-Purpose Programming Model • Standalone driver to load computation programs into GPU • Graphics-free API • Data sharing with OpenGL buffer objects • Easy to use and low-learning curve

13 CUDA – C with no shader limitations! • Integrated host+device app C program – Serial or modestly parallel parts in host C code – Highly parallel parts in device SPMD kernel C code

Serial Code (host)

Parallel Kernel (device) KernelA<<< nBlk, nTid >>>(args); . . .

Serial Code (host)

Parallel Kernel (device) . . . KernelB<<< nBlk, nTid >>>(args);

14 CUDA Devices and Threads

• A compute device – Is a coprocessor to the CPU or host – Has its own DRAM (device memory) – Runs many threads in parallel – Is typically a GPU but can also be another type of parallel processing device • Data-parallel portions of an application are expressed as device kernels which run on many threads • Differences between GPU and CPU threads – GPU threads are extremely lightweight • Very little creation overhead – GPU needs 1000s of threads for full efficiency • Multi-core CPU needs only a few

© David Kirk/NVIDIA and Wenmei W. Hwu, 20072009 15 ECE 498AL, University of Illinois, UrbanaChampaign Extended C • Declspecs – global, device, shared, __device__ float filter[N]; local, constant __global__ void convolve (float *image) {

• __shared__ float region[M]; Keywords ... – threadIdx, blockIdx • Intrinsics region[threadIdx] = image[i]; – __syncthreads __syncthreads() ...

• Runtime API image[j] = result; – Memory, symbol, } execution management // Allocate GPU memory void *myimage = cudaMalloc(bytes) • Function launch // 100 blocks, 10 threads per block convolve<<<100, 10>>> (myimage);

16 Compiling a CUDA Program

C/C++ CUDA float4 me = gx[gtid]; Application me.x += me.y * me.z; • Parallel Thread eXecution (PTX) – Virtual Machine and ISA NVCC CPU Code – Programming model – Execution Virtual PTX Code resources and state Physical ld.global.v4.f32 {$f1,$f3,$f5,$f7}, [$r9+0]; PTX to Target mad.f32 $f1, $f5, $f3, $f1; Compiler

G80 … GPU

Target code

17 Arrays of Parallel Threads

• A CUDA kernel is executed by an array of threads – All threads run the same code (SPMD) – Each thread has an ID that it uses to compute memory addresses and make control decisions

threadID 0 1 2 3 4 5 6 7

… float x = input[threadID]; float y = func(x); output[threadID] = y; …

© David Kirk/NVIDIA and Wenmei W. Hwu, 20072009 18 ECE 498AL, University of Illinois, UrbanaChampaign Thread Blocks: Scalable Cooperation • Divide monolithic thread array into multiple blocks – Threads within a block cooperate via shared memory, atomic operations and barrier synchronization – Threads in different blocks cannot cooperate

Thread Block 0 Thread Block 1 Thread Block N - 1 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 threadID

… … … float x = float x = float x = input[threadID]; input[threadID]; input[threadID]; float y = func(x); float y = func(x); … float y = func(x); output[threadID] = y; output[threadID] = y; output[threadID] = y; … … …

Click to edit Master text styles Host • Each thread uses IDs to SecondDevice level Third level Grid 1 decide what data to work on Fourth level Kernel Block Block – Block ID: 1D or 2D Fifth1 level (0, 0) (1, 0) – Thread ID: 1D, 2D, or 3D Block Block (0, 1) (1, 1)

• Simplifies memory Grid 2 addressing when Kernel 2 processing Block (1, 1) multidimensional data (0,0,1) (1,0,1) (2,0,1) (3,0,1) – Image processing Thread Thread Thread Thread – Solving PDEs on volumes (0,0,0) (1,0,0) (2,0,0) (3,0,0) Thread Thread Thread Thread – … (0,1,0) (1,1,0) (2,1,0) (3,1,0) Courtesy: NDVIA

20 Figure 3.2. An Example of CUDA Thread Organization. CUDA Memory Model • Global memory – Main means of communicating R/W Grid Data between host Block (0, 0) Block (1, 0)

and device Shared Memory Shared Memory

– Contents visible to Registers Registers Registers Registers all threads Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0) – Long latency access Host Global Memory

22 Memory Spaces

• CPU and GPU have separate memory spaces – Data is moved across PCIe bus – Use functions to allocate/set/copy memory on GPU • Very similar to corresponding C functions

• Pointers are just addresses – Can’t tell from the pointer value whether the address is on CPU or GPU – Must exercise care when dereferencing: • Dereferencing CPU pointer on GPU will likely crash • Same for vice versa

23 GPU Memory Allocation / Release

• Host (CPU) manages device (GPU) memory: – cudaMalloc (void ** pointer, size_t nbytes) – cudaMemset (void * pointer, int value, size_t count) – cudaFree (void* pointer)

int n = 1024; int nbytes = 1024*sizeof(int); int * d_a = 0; cudaMalloc( (void**)&d_a, nbytes ); cudaMemset( d_a, 0, nbytes); cudaFree(d_a);

24 Data Copies

• cudaMemcpy( void *dst, void *src, size_t nbytes, enum cudaMemcpyKind direction); – returns after the copy is complete – blocks CPU thread until all bytes have been copied – doesn’t start copying until previous CUDA calls complete • enum cudaMemcpyKind – cudaMemcpyHostToDevice – cudaMemcpyDeviceToHost – cudaMemcpyDeviceToDevice • Non-blocking memcopies are provided

25 Code Walkthrough 1

• Allocate CPU memory for n integers • Allocate GPU memory for n integers • Initialize GPU memory to 0s • Copy from GPU to CPU • Print the values

26 Code Walkthrough 1 #include

int main() { int dimx = 16; int num_bytes = dimx*sizeof(int);

int *d_a=0, *h_a=0; // device and host pointers

27 Code Walkthrough 1 #include

int main() { int dimx = 16; int num_bytes = dimx*sizeof(int);

int *d_a=0, *h_a=0; // device and host pointers

h_a = (int*)malloc(num_bytes); cudaMalloc( (void**)&d_a, num_bytes );

if( 0==h_a || 0==d_a ) { printf("couldn't allocate memory\n"); return 1; }

28 Code Walkthrough 1 #include

int main() { int dimx = 16; int num_bytes = dimx*sizeof(int);

int *d_a=0, *h_a=0; // device and host pointers

h_a = (int*)malloc(num_bytes); cudaMalloc( (void**)&d_a, num_bytes );

if( 0==h_a || 0==d_a ) { printf("couldn't allocate memory\n"); return 1; }

cudaMemset( d_a, 0, num_bytes ); cudaMemcpy( h_a, d_a, num_bytes, cudaMemcpyDeviceToHost );

29 Code Walkthrough 1 #include

int main() { int dimx = 16; int num_bytes = dimx*sizeof(int);

int *d_a=0, *h_a=0; // device and host pointers

h_a = (int*)malloc(num_bytes); cudaMalloc( (void**)&d_a, num_bytes );

if( 0==h_a || 0==d_a ) { printf("couldn't allocate memory\n"); return 1; }

cudaMemset( d_a, 0, num_bytes ); cudaMemcpy( h_a, d_a, num_bytes, cudaMemcpyDeviceToHost );

for(int i=0; i

free( h_a ); cudaFree( d_a );

return 0; } 30 Basic Kernels and Execution on GPU

31 CUDA Function Declarations

Executed Only callable on the: from the: __device__ float DeviceFunc() device device __global__ void KernelFunc() device host __host__ float HostFunc() host host

• __global__ defines a kernel function – Must return void • __device__ and __host__ can be used together

• __device__ functions cannot have their address taken • For functions executed on the device: – Can only access GPU memory – No recursion – No static variable declarations inside the function – No variable number of arguments

• Build on Walkthrough 1 • Write a kernel to initialize integers • Copy the result back to CPU • Print the values

34 Kernel Code (executed on GPU)

__global__ void kernel( int *a ) { int idx = blockIdx.x*blockDim.x + threadIdx.x; a[idx] = 7; }

35 Launching kernels on GPU

• Launch parameters: – grid dimensions (up to 2D), dim3 type – thread-block dimensions (up to 3D), dim3 type – shared memory: number of bytes per block • for extern smem variables declared without size • Optional, 0 by default – stream ID • Optional, 0 by default dim3 grid(16, 16); dim3 block(16,16); kernel<<>>(...); kernel<<<32, 512>>>(...);

36 #include

__global__ void kernel( int *a ) { int idx = blockIdx.x*blockDim.x + threadIdx.x; a[idx] = 7; }

int main() { int dimx = 16; int num_bytes = dimx*sizeof(int);

int *d_a=0, *h_a=0; // device and host pointers

h_a = (int*)malloc(num_bytes); cudaMalloc( (void**)&d_a, num_bytes );

if( 0==h_a || 0==d_a ) { printf("couldn't allocate memory\n"); return 1; }

cudaMemset( d_a, 0, num_bytes );

dim3 grid, block; block.x = 4; grid.x = dimx / block.x;

kernel<<>>( d_a );

cudaMemcpy( h_a, d_a, num_bytes, cudaMemcpyDeviceToHost );

for(int i=0; i

free( h_a ); cudaFree( d_a );

return 0; } 37 Kernel Variations and Output

__global__ void kernel( int *a ) { int idx = blockIdx.x*blockDim.x + threadIdx.x; a[idx] = 7; Output: 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 }

__global__ void kernel( int *a ) { int idx = blockIdx.x*blockDim.x + threadIdx.x; a[idx] = blockIdx.x; Output: 0 0 0 0 1 1 1 1 2 2 2 2 3 3 3 3 }

__global__ void kernel( int *a ) { int idx = blockIdx.x*blockDim.x + threadIdx.x; a[idx] = threadIdx.x; Output: 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 }

38 Code Walkthrough 3

• Build on Walkthruogh 2 • Write a kernel to increment n×m integers • Copy the result back to CPU • Print the values

39 Kernel with 2D Indexing

__global__ void kernel( int *a, int dimx, int dimy ) { int ix = blockIdx.x*blockDim.x + threadIdx.x; int iy = blockIdx.y*blockDim.y + threadIdx.y; int idx = iy*dimx + ix;

a[idx] = a[idx]+1; }

40 int main() { int dimx = 16; int dimy = 16; int num_bytes = dimx*dimy*sizeof(int);

int *d_a=0, *h_a=0; // device and host pointers

h_a = (int*)malloc(num_bytes); cudaMalloc( (void**)&d_a, num_bytes );

if( 0==h_a || 0==d_a ) { printf("couldn't allocate memory\n"); return 1; } __global__ void kernel( int *a, int dimx, int dimy ) { cudaMemset( d_a, 0, num_bytes ); int ix = blockIdx.x*blockDim.x + threadIdx.x; dim3 grid, block; int iy = blockIdx.y*blockDim.y + threadIdx.y; block.x = 4; block.y = 4; int idx = iy*dimx + ix; grid.x = dimx / block.x; grid.y = dimy / block.y;

a[idx] = a[idx]+1; kernel<<>>( d_a, dimx, dimy ); } cudaMemcpy( h_a, d_a, num_bytes, cudaMemcpyDeviceToHost );

for(int row=0; row

free( h_a ); cudaFree( d_a );

return 0; } 41 Blocks must be independent

• Any possible interleaving of blocks should be valid – presumed to run to completion without pre-emption – can run in any order – can run concurrently OR sequentially

• Blocks may coordinate but not synchronize – shared queue pointer: OK – shared lock: BAD … can easily deadlock

• Independence requirement gives scalability

42 Blocks must be independent

• Thread blocks can run in any order – Concurrently or sequentially – Facilitates scaling of the same code across many devices

Scalability

43 Coordinating CPU and GPU Execution

44 Synchronizing GPU and CPU • All kernel launches are asynchronous – control returns to CPU immediately – kernel starts executing once all previous CUDA calls have completed • Memcopies are synchronous – control returns to CPU once the copy is complete – copy starts once all previous CUDA calls have completed • cudaThreadSynchronize() – blocks until all previous CUDA calls complete • Asynchronous CUDA calls provide: – non-blocking memcopies – ability to overlap memcopies and kernel execution 45 CUDA Error Reporting to CPU

• All CUDA calls return error code: – except kernel launches – cudaError_t type

• cudaError_t cudaGetLastError(void) – returns the code for the last error (“no error” has a code)

• char* cudaGetErrorString(cudaError_t code) – returns a null-terminated character string describing the error printf(“%s\n”, cudaGetErrorString( cudaGetLastError() ) );

46 Device Management

• CPU can query and select GPU devices – cudaGetDeviceCount( int* count ) – cudaSetDevice( int device ) – cudaGetDevice( int *current_device ) – cudaGetDeviceProperties( cudaDeviceProp* prop, int device ) – cudaChooseDevice( int *device, cudaDeviceProp* prop ) • Multi-GPU setup: – device 0 is used by default – one CPU thread can control one GPU • multiple CPU threads can control the same GPU – calls are serialized by the driver

47 CUDA Debugging and Profiling

48 What’s cuda-gdb?

• All-in-one debugging tool • Host and CUDA codes • Extension to linux gdb • 32/64-bit Linux • 4.0 release

49 Debug Compilation

• -g –G • nvcc –g –G foo.cu –o foo • Fermi -gencode arch=compute_20,code=sm_20 • Makefile • CUDA-GDB error: undefined reference to '$gpu_registers‘ (2.2 beta or previous) • Ptxvars.cu nvcc "/usr/local/cuda/bin/ptxvars.cu" -g -G --host-compilation=c -c -define- always-macro _DEVICE_LAUNCH_PARAMETERS_H__ -Xptxas -fext

50 Extension to GDB

• Debug both host and GPU code seamlessly • GPU memory is as an extension to host memory • GPU thread/blocks are as extensions to host threads • Breakpoints at any host and/or device function symbol or source file line number • Single-step individual warps

51 Debug commands

• thread <<<(BX,BY),(TX,TY,TX)>>> thread <<<170>>> thread <<<2,(10,10)>>> • cuda block (n,m) thread (x,y,z) • info cuda state (replacing with devices, kernels, system, warp, sm…)

52 Debugging Commands(cont.)

• break • print • continue • next • step • quit • set args … • GDB quick reference http://users.ece.utexas.edu/~adnan/gdb-refcard.pdf

53 Example code

• 8-bit bit reverse • 00011101 -> 10111000 • 10010111 -> 11101001

54 Algorithms

r= 0; for (int i=0;i<8;i++) { r = r <<1; if (x mod 2) r += 1; x=x>>1; }

x = (((0xf0 &x )>>4) | ((0x0f &x) << 4) ); x = (((0xcc &x )>>2) | ((0x33 &x) << 2 )); x = (((0xaa &x ) >>1) | ((0x55 &x> <<1));

55 Code

1 #include 23 unsigned int *d = NULL; int i; 2 #include 24 unsigned int idata[N], odata[N]; 3 25 4 // Simple 8-bit bit reversal Compute test 26 for (i = 0; i < N; i++) 5 27 idata[i] = (unsigned int)i; 6 #define N 256 28 7 29 cudaMalloc((void**)&d, sizeof(int)*N); 8 __global__ void bitreverse(unsigned int *data) 30 cudaMemcpy(d, idata, sizeof(int)*N, 9 { 31 cudaMemcpyHostToDevice); 10 unsigned int *idata = data; 11 32 12 unsigned int x = idata[threadIdx.x]; 33 bitreverse<<<1, N>>>(d); 13 34 14 x = ((0xf0f0f0f0 & x) >> 4 | ((0x0f0f0f0f & x) << 35 cudaMemcpy(odata, d, sizeof(int)*N, 4); 36 cudaMemcpyHostToDevice); 15 x = ((0xcccccccc & x) >> 2 | ((0x33333333 & x) 37 << 2); 38 for (i = 0; i < N; i++) 16 x = ((0xaaaaaaaa & x) >> 1 | ((0x55555555 & x) 39 printf(“%u -> %u\n”, idata[i], odata[i]); << 1); 40 17 18 idata[threadIdx.x] = x; 41 cudaFree((void*)d); 19 } 42 return 0; 20 43 } 21 int main(void) 22 { 56 Cuda-gdb supported platform

• Host platform X11 cannot be running on the GPU used for debugging One GPU: Disable X11 Two more GPUs • GPU requirements All CUDA-enable GPUs except 8800GTS, 8800GTX, 8800 Ultra, FX4600, and FX5600

57 Debugging example code

• Step 1 nvcc –g –G bitreverse.cu –o bitreverse • Step 2 Cuda-gdb ./bitreverse • Step 3 Set breakpoints(break main, break bitreverse, break 18) • Step 4 Run CUDA application (cuda-gdb) run • Step 5 Continue and watch variables (cuda-gdb) continue (cuda-gdb) thread (cuda-gdb) print x

58 Profiling Tools

• CUDA memcheck • Occupancy Calculator • Visual Profiler

59 CUDA Visual Profiler

60 CUDA Counters

61 Profiler Counters for Fermi

• branch, divergent branch • instruction issued, instruction executed • sm cta launched • gld request, gst request • local load, local store • share load, share store • warps launched, threads launched • l1 global load hit, l1 global load miss • l1 local load hit, l1 local load miss • l1 local store hit, l1 local store miss • l1 share bank conflicts • uncached global load transaction • global store transaction • l2 read requests, l2 write requests • l2 read misses, l2 write misses • dram reads, dram writes • tex cache requests, tex cache misses

62 Memory throughput

• Compute capability<2.0 Global read throughput= (((gld_32*32) + (gld_64*64) + (gld_128*128)) * TPC) / gputime Global write throughput = (((gst_32*32) + (gst_64*64) + (gst_128*128)) * TPC) / gputime

• Compute capability>=2.0 Global read throughput = (dram reads * 32 )/gputime Global write throughput = (dram writes * 32 )/gputime

• Gmem overall throughput = read throughput + write throughput • Tesla C2050 , theoretical bandwidth 144GB/s

63 Conclusions

• GPU as an accelerator for HPC • CUDA programming model • CUDA thread and kernel • CUDA example codes • CUDA debugging and profiling