<<

Basics of CUDA Programming

Weijun Xiao Department of Electrical and Computer Engineering University of Minnesota

1 Outline

• What’s GPU computing? • CUDA programming model • Basic • Basic Kernels and • CPU and GPU Coordination • CUDA debugging and profiling • Conclusions

2 What is GPU?

• Graphic Processing Unit

Logical Representation of Visual Information Output Signal

3 Performance Gap between GPUs and CPUs

4 GPU = Fast Parallel Machine

• GPU speed increasing at faster pace than Moore’s Law. • This is a consequence of the data-parallel streaming aspects of the GPU. • Gaming market simulates the development of GPU • GPUs are cheap ! Put enough together, and you can get a super-computer.

So can we use the GPU for general-purpose computing ?

5 Sure, thousands of Applications

• Large matrix/vector operations (BLAS) • Protein Folding (Molecular Dynamics) • FFT ( processing) • VMD(Visual Molecular Dynamics) • Speech Recognition (Hidden Markov Models, Neural nets) • Databases • Sort/Search • Storages • MRI • …

6 Why are We Interested in GPU?

• High-performance Computing • High Parallelism • Low Cost • GPU can be Programmable • GPGPU

7 Growth and Development of GPU

• A quiet revolution and potential -up – Calculation: 367 GFLOPS vs. 32 GFLOPS – Memory Bandwidth: 86.4 GB/s vs. 8.4 GB/s – Before CUDA , programmed through graphics API S P O L F G80 = GeForce 8800 GTX G G71 = GeForce 7900 GTX G70 = GeForce 7800 GTX NV40 = GeForce 6800 Ultra NV35 = GeForce FX 5950 Ultra NV30 = GeForce FX 5800

– GPU in every PC and workstation – massive volume and potential impact

8 GeForce 8800

16 highly threaded MP, 128 Cores, 367 GFLOPS, 768 MB DRAM, 86.4 GB/S Mem BW, 4GB/S BW to CPU

Host

Input Assembler

Thread Execution Manager

Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data

Cache Cache Cache Cache Cache Cache Cache

T e x t u e T e x t u r e T e x t u r e T e x t u r e T e x t u r e T e x t u r e T e x t u r e T e x t u r e T e x t u r e

Load/store /store Load/store Load/store Load/store Load/store

Global Memory 9 Telsa 2050

14 MP, 448 Cores, 1.03 TFLOPS/515 GFLOPS, 3GB GDDR5 DRAM with ECC, 144GB/S Mem BW, PCIe 2 x16 (8GB/S BW to CPU)

10 GPU Languages

• Assembly • Cg () - for Graphics • GLSL (OpenGL) - OpenGL Shading Language • HLSL (Microsoft) - High-level Shading language • Brook C/C++ (AMD) • CUDA (NVIDIA) • Open CL

11 How GPGPU Works before CUDA?

• Follow graphics • Pretend to be graphics • Take an advantage of massive parallelism of GPU • Disguise data as textures or geometry • Disguise as render passes • Fool graphics pipeline to do computation

12 CUDA Programming Model

• Compute Unified Device Architecture • Simple and General-Purpose Programming Model • Standalone driver to load computation programs into GPU • Graphics-free API • Data with OpenGL buffer objects • Easy to use and low-learning curve

13 CUDA – C with no limitations! • Integrated host+device app C program – Serial or modestly parallel parts in host C code – Highly parallel parts in device SPMD kernel C code

Serial Code (host)

Parallel Kernel (device) KernelA<<< nBlk, nTid >>>(args); . . .

Serial Code (host)

Parallel Kernel (device) . . . KernelB<<< nBlk, nTid >>>(args);

14 CUDA Devices and Threads

• A compute device – Is a to the CPU or host – Has its own DRAM (device memory) – Runs many threads in parallel – Is typically a GPU but can also be another type of parallel processing device • Data-parallel portions of an application are expressed as device kernels which run on many threads • Differences between GPU and CPU threads – GPU threads are extremely lightweight • Very little creation overhead – GPU needs 1000s of threads for full efficiency • Multi-core CPU needs only a few

© David Kirk/NVIDIA and Wen­mei W. Hwu, 2007­2009 15 ECE 498AL, University of Illinois, Urbana­Champaign Extended C • Declspecs – global, device, shared, __device__ float filter[N]; local, constant __global__ void convolve (float *image) {

• __shared__ float region[M]; Keywords ... – threadIdx, blockIdx • Intrinsics region[threadIdx] = image[i]; – __syncthreads __syncthreads() ...

• Runtime API image[j] = result; – Memory, symbol, } execution management // Allocate GPU memory void *myimage = cudaMalloc(bytes) • Function launch // 100 blocks, 10 threads per block convolve<<<100, 10>>> (myimage);

16 Compiling a CUDA Program

C/C++ CUDA float4 me = gx[gtid]; Application me.x += me.y * me.z; • Parallel eXecution (PTX) – Virtual Machine and ISA NVCC CPU Code – Programming model – Execution Virtual PTX Code resources and Physical ld.global.v4.f32 {$f1,$f3,$f5,$f7}, [$r9+0]; PTX to Target mad.f32 $f1, $f5, $f3, $f1; Compiler

G80 … GPU

Target code

17 Arrays of Parallel Threads

• A CUDA kernel is executed by an array of threads – All threads run the same code (SPMD) – Each thread has an ID that it uses to compute memory addresses and make control decisions

threadID 0 1 2 3 4 5 6 7

… float x = input[threadID]; float y = func(x); output[threadID] = y; …

© David Kirk/NVIDIA and Wen­mei W. Hwu, 2007­2009 18 ECE 498AL, University of Illinois, Urbana­Champaign Thread Blocks: Scalable Cooperation • Divide monolithic thread array into multiple blocks – Threads within a block cooperate via , atomic operations and barrier – Threads in different blocks cannot cooperate

Thread Block 0 Thread Block 1 Thread Block N - 1 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 threadID

… … … float x = float x = float x = input[threadID]; input[threadID]; input[threadID]; float y = func(x); float y = func(x); … float y = func(x); output[threadID] = y; output[threadID] = y; output[threadID] = y; … … …

© David Kirk/NVIDIA and Wen­mei W. Hwu, 2007­2009 19 ECE 498AL, University of Illinois, Urbana­Champaign Block IDs and Thread IDs

Click to edit Master text styles Host • Each thread uses IDs to SecondDevice level Third level Grid 1 decide what data to work on Fourth level Kernel Block Block – Block ID: 1D or 2D Fifth1 level (0, 0) (1, 0) – Thread ID: 1D, 2D, or 3D Block Block (0, 1) (1, 1)

• Simplifies memory Grid 2 addressing when Kernel 2 processing Block (1, 1) multidimensional data (0,0,1) (1,0,1) (2,0,1) (3,0,1) – Image processing Thread Thread Thread Thread – Solving PDEs on volumes (0,0,0) (1,0,0) (2,0,0) (3,0,0) Thread Thread Thread Thread – … (0,1,0) (1,1,0) (2,1,0) (3,1,0) Courtesy: NDVIA

20 Figure 3.2. An Example of CUDA Thread Organization. CUDA Memory Model • Global memory – Main means of communicating R/W Grid Data between host Block (0, 0) Block (1, 0)

and device Shared Memory Shared Memory

– Contents visible to Registers Registers Registers Registers all threads Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0) – Long latency access Host Global Memory

© David Kirk/NVIDIA and Wen­mei W. Hwu, 2007­2009 21 ECE 498AL, University of Illinois, Urbana­Champaign Basic Memory Management

22 Memory Spaces

• CPU and GPU have separate memory spaces – Data is moved across PCIe – Use functions to allocate/set/copy memory on GPU • Very similar to corresponding C functions

• Pointers are just addresses – Can’t tell from the pointer value whether the address is on CPU or GPU – Must exercise care when dereferencing: • Dereferencing CPU pointer on GPU will likely crash • Same for vice versa

23 GPU Memory Allocation / Release

• Host (CPU) manages device (GPU) memory: – cudaMalloc (void ** pointer, size_t nbytes) – cudaMemset (void * pointer, int value, size_t count) – cudaFree (void* pointer)

int n = 1024; int nbytes = 1024*sizeof(int); int * d_a = 0; cudaMalloc( (void**)&d_a, nbytes ); cudaMemset( d_a, 0, nbytes); cudaFree(d_a);

24 Data Copies

• cudaMemcpy( void *dst, void *src, size_t nbytes, enum cudaMemcpyKind direction); – returns after the copy is complete – blocks CPU thread until all bytes have been copied – doesn’t start copying until previous CUDA calls complete • enum cudaMemcpyKind – cudaMemcpyHostToDevice – cudaMemcpyDeviceToHost – cudaMemcpyDeviceToDevice • Non- memcopies are provided

25 Code Walkthrough 1

• Allocate CPU memory for n integers • Allocate GPU memory for n integers • Initialize GPU memory to 0s • Copy from GPU to CPU • Print the values

26 Code Walkthrough 1 #include

int main() { int dimx = 16; int num_bytes = dimx*sizeof(int);

int *d_a=0, *h_a=0; // device and host pointers

27 Code Walkthrough 1 #include

int main() { int dimx = 16; int num_bytes = dimx*sizeof(int);

int *d_a=0, *h_a=0; // device and host pointers

h_a = (int*)malloc(num_bytes); cudaMalloc( (void**)&d_a, num_bytes );

if( 0==h_a || 0==d_a ) { printf("couldn't allocate memory\n"); return 1; }

28 Code Walkthrough 1 #include

int main() { int dimx = 16; int num_bytes = dimx*sizeof(int);

int *d_a=0, *h_a=0; // device and host pointers

h_a = (int*)malloc(num_bytes); cudaMalloc( (void**)&d_a, num_bytes );

if( 0==h_a || 0==d_a ) { printf("couldn't allocate memory\n"); return 1; }

cudaMemset( d_a, 0, num_bytes ); cudaMemcpy( h_a, d_a, num_bytes, cudaMemcpyDeviceToHost );

29 Code Walkthrough 1 #include

int main() { int dimx = 16; int num_bytes = dimx*sizeof(int);

int *d_a=0, *h_a=0; // device and host pointers

h_a = (int*)malloc(num_bytes); cudaMalloc( (void**)&d_a, num_bytes );

if( 0==h_a || 0==d_a ) { printf("couldn't allocate memory\n"); return 1; }

cudaMemset( d_a, 0, num_bytes ); cudaMemcpy( h_a, d_a, num_bytes, cudaMemcpyDeviceToHost );

for(int i=0; i

free( h_a ); cudaFree( d_a );

return 0; } 30 Basic Kernels and Execution on GPU

31 CUDA Function Declarations

Executed Only callable on the: from the: __device__ float DeviceFunc() device device __global__ void KernelFunc() device host __host__ float HostFunc() host host

• __global__ defines a kernel function – Must return void • __device__ and __host__ can be used together

© David Kirk/NVIDIA and Wen­mei W. Hwu, 2007­2009 32 ECE 498AL, University of Illinois, Urbana­Champaign CUDA Function Declarations (cont.)

• __device__ functions cannot have their address taken • For functions executed on the device: – Can only access GPU memory – No recursion – No static variable declarations inside the function – No variable number of arguments

© David Kirk/NVIDIA and Wen­mei W. Hwu, 2007­2009 33 ECE 498AL, University of Illinois, Urbana­Champaign Code Walkthrough 2

• Build on Walkthrough 1 • Write a kernel to initialize integers • Copy the result back to CPU • Print the values

34 Kernel Code (executed on GPU)

__global__ void kernel( int *a ) { int idx = blockIdx.x*blockDim.x + threadIdx.x; a[idx] = 7; }

35 Launching kernels on GPU

• Launch parameters: – grid dimensions (up to 2D), dim3 type – thread-block dimensions (up to 3D), dim3 type – shared memory: number of bytes per block • for extern smem variables declared without size • Optional, 0 by default – stream ID • Optional, 0 by default dim3 grid(16, 16); dim3 block(16,16); kernel<<>>(...); kernel<<<32, 512>>>(...);

36 #include

__global__ void kernel( int *a ) { int idx = blockIdx.x*blockDim.x + threadIdx.x; a[idx] = 7; }

int main() { int dimx = 16; int num_bytes = dimx*sizeof(int);

int *d_a=0, *h_a=0; // device and host pointers

h_a = (int*)malloc(num_bytes); cudaMalloc( (void**)&d_a, num_bytes );

if( 0==h_a || 0==d_a ) { printf("couldn't allocate memory\n"); return 1; }

cudaMemset( d_a, 0, num_bytes );

dim3 grid, block; block.x = 4; grid.x = dimx / block.x;

kernel<<>>( d_a );

cudaMemcpy( h_a, d_a, num_bytes, cudaMemcpyDeviceToHost );

for(int i=0; i

free( h_a ); cudaFree( d_a );

return 0; } 37 Kernel Variations and Output

__global__ void kernel( int *a ) { int idx = blockIdx.x*blockDim.x + threadIdx.x; a[idx] = 7; Output: 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 }

__global__ void kernel( int *a ) { int idx = blockIdx.x*blockDim.x + threadIdx.x; a[idx] = blockIdx.x; Output: 0 0 0 0 1 1 1 1 2 2 2 2 3 3 3 3 }

__global__ void kernel( int *a ) { int idx = blockIdx.x*blockDim.x + threadIdx.x; a[idx] = threadIdx.x; Output: 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 }

38 Code Walkthrough 3

• Build on Walkthruogh 2 • Write a kernel to increment n×m integers • Copy the result back to CPU • Print the values

39 Kernel with 2D Indexing

__global__ void kernel( int *a, int dimx, int dimy ) { int ix = blockIdx.x*blockDim.x + threadIdx.x; int iy = blockIdx.y*blockDim.y + threadIdx.y; int idx = iy*dimx + ix;

a[idx] = a[idx]+1; }

40 int main() { int dimx = 16; int dimy = 16; int num_bytes = dimx*dimy*sizeof(int);

int *d_a=0, *h_a=0; // device and host pointers

h_a = (int*)malloc(num_bytes); cudaMalloc( (void**)&d_a, num_bytes );

if( 0==h_a || 0==d_a ) { printf("couldn't allocate memory\n"); return 1; } __global__ void kernel( int *a, int dimx, int dimy ) { cudaMemset( d_a, 0, num_bytes ); int ix = blockIdx.x*blockDim.x + threadIdx.x; dim3 grid, block; int iy = blockIdx.y*blockDim.y + threadIdx.y; block.x = 4; block.y = 4; int idx = iy*dimx + ix; grid.x = dimx / block.x; grid.y = dimy / block.y;

a[idx] = a[idx]+1; kernel<<>>( d_a, dimx, dimy ); } cudaMemcpy( h_a, d_a, num_bytes, cudaMemcpyDeviceToHost );

for(int row=0; row

free( h_a ); cudaFree( d_a );

return 0; } 41 Blocks must be independent

• Any possible interleaving of blocks should be valid – presumed to run to completion without pre-emption – can run in any order – can run concurrently OR sequentially

• Blocks may coordinate but not synchronize – shared queue pointer: OK – shared : BAD … can easily

• Independence requirement gives

42 Blocks must be independent

• Thread blocks can run in any order – Concurrently or sequentially – Facilitates scaling of the same code across many devices

Scalability

43 Coordinating CPU and GPU Execution

44 Synchronizing GPU and CPU • All kernel launches are asynchronous – control returns to CPU immediately – kernel starts executing once all previous CUDA calls have completed • Memcopies are synchronous – control returns to CPU once the copy is complete – copy starts once all previous CUDA calls have completed • cudaThreadSynchronize() – blocks until all previous CUDA calls complete • Asynchronous CUDA calls provide: – non-blocking memcopies – ability to overlap memcopies and kernel execution 45 CUDA Error Reporting to CPU

• All CUDA calls return error code: – except kernel launches – cudaError_t type

• cudaError_t cudaGetLastError(void) – returns the code for the last error (“no error” has a code)

• char* cudaGetErrorString(cudaError_t code) – returns a null-terminated character string describing the error printf(“%s\n”, cudaGetErrorString( cudaGetLastError() ) );

46 Device Management

• CPU can query and select GPU devices – cudaGetDeviceCount( int* count ) – cudaSetDevice( int device ) – cudaGetDevice( int *current_device ) – cudaGetDeviceProperties( cudaDeviceProp* prop, int device ) – cudaChooseDevice( int *device, cudaDeviceProp* prop ) • Multi-GPU setup: – device 0 is used by default – one CPU thread can control one GPU • multiple CPU threads can control the same GPU – calls are serialized by the driver

47 CUDA Debugging and Profiling

48 What’s -gdb?

• All-in-one debugging tool • Host and CUDA codes • Extension to gdb • 32/64-bit Linux • 4.0 release

49 Debug Compilation

• -g –G • nvcc –g –G foo.cu –o foo • Fermi -gencode arch=compute_20,code=sm_20 • Makefile • CUDA-GDB error: undefined reference to '$gpu_registers‘ (2.2 beta or previous) • Ptxvars.cu nvcc "/usr/local/cuda/bin/ptxvars.cu" -g -G --host-compilation=c -c -define- always-macro _DEVICE_LAUNCH_PARAMETERS_H__ -Xptxas -fext

50 Extension to GDB

• Debug both host and GPU code seamlessly • GPU memory is as an extension to host memory • GPU thread/blocks are as extensions to host threads • at any host and/or device function symbol or file line number • Single-step individual warps

51 Debug commands

• thread <<<(BX,BY),(TX,TY,TX)>>> thread <<<170>>> thread <<<2,(10,10)>>> • cuda block (n,m) thread (x,y,z) • info cuda state (replacing with devices, kernels, system, warp, sm…)

52 Debugging Commands(cont.)

• break • print • continue • next • step • quit • set args … • GDB quick reference http://users.ece.utexas.edu/~adnan/gdb-refcard.pdf

53 Example code

• 8-bit bit reverse • 00011101 -> 10111000 • 10010111 -> 11101001

54

r= 0; for (int i=0;i<8;i++) { r = r <<1; if (x mod 2) r += 1; x=x>>1; }

x = (((0xf0 &x )>>4) | ((0x0f &x) << 4) ); x = (((0xcc &x )>>2) | ((0x33 &x) << 2 )); x = (((0xaa &x ) >>1) | ((0x55 &x> <<1));

55 Code

1 #include 23 unsigned int *d = NULL; int i; 2 #include 24 unsigned int idata[N], odata[N]; 3 25 4 // Simple 8-bit bit reversal Compute test 26 for (i = 0; i < N; i++) 5 27 idata[i] = (unsigned int)i; 6 #define N 256 28 7 29 cudaMalloc((void**)&d, sizeof(int)*N); 8 __global__ void bitreverse(unsigned int *data) 30 cudaMemcpy(d, idata, sizeof(int)*N, 9 { 31 cudaMemcpyHostToDevice); 10 unsigned int *idata = data; 11 32 12 unsigned int x = idata[threadIdx.x]; 33 bitreverse<<<1, N>>>(d); 13 34 14 x = ((0xf0f0f0f0 & x) >> 4 | ((0x0f0f0f0f & x) << 35 cudaMemcpy(odata, d, sizeof(int)*N, 4); 36 cudaMemcpyHostToDevice); 15 x = ((0xcccccccc & x) >> 2 | ((0x33333333 & x) 37 << 2); 38 for (i = 0; i < N; i++) 16 x = ((0xaaaaaaaa & x) >> 1 | ((0x55555555 & x) 39 printf(“%u -> %u\n”, idata[i], odata[i]); << 1); 40 17 18 idata[threadIdx.x] = x; 41 cudaFree((void*)d); 19 } 42 return 0; 20 43 } 21 int main(void) 22 { 56 Cuda-gdb supported platform

• Host platform X11 cannot be running on the GPU used for debugging One GPU: Disable X11 Two more GPUs • GPU requirements All CUDA-enable GPUs except 8800GTS, 8800GTX, 8800 Ultra, FX4600, and FX5600

57 Debugging example code

• Step 1 nvcc –g –G bitreverse.cu –o bitreverse • Step 2 Cuda-gdb ./bitreverse • Step 3 Set breakpoints(break main, break bitreverse, break 18) • Step 4 Run CUDA application (cuda-gdb) run • Step 5 Continue and watch variables (cuda-gdb) continue (cuda-gdb) thread (cuda-gdb) print x

58 Profiling Tools

• CUDA memcheck • Occupancy Calculator • Visual Profiler

59 CUDA Visual Profiler

60 CUDA Counters

61 Profiler Counters for Fermi

• branch, divergent branch • instruction issued, instruction executed • sm cta launched • gld request, gst request • local load, local store • share load, share store • warps launched, threads launched • l1 global load hit, l1 global load miss • l1 local load hit, l1 local load miss • l1 local store hit, l1 local store miss • l1 share bank conflicts • uncached global load transaction • global store transaction • l2 read requests, l2 write requests • l2 read misses, l2 write misses • dram reads, dram writes • tex cache requests, tex cache misses

62 Memory throughput

• Compute capability<2.0 Global read throughput= (((gld_32*32) + (gld_64*64) + (gld_128*128)) * TPC) / gputime Global write throughput = (((gst_32*32) + (gst_64*64) + (gst_128*128)) * TPC) / gputime

• Compute capability>=2.0 Global read throughput = (dram reads * 32 )/gputime Global write throughput = (dram writes * 32 )/gputime

• Gmem overall throughput = read throughput + write throughput • Tesla C2050 , theoretical bandwidth 144GB/s

63 Conclusions

• GPU as an accelerator for HPC • CUDA programming model • CUDA thread and kernel • CUDA example codes • CUDA debugging and profiling

64