Introduction to Parallel Lecture

GPU/CUDA Programming Flynn’s Classical Taxonomy

2 Single Instruction, Single Data (SISD)

3 Single Instruction, Multiple Data (SIMD)

4 Multiple Instruction, Single Data (MISD)

5 Multiple Instruction, Multiple Data (MIMD)

6 Single Instruction, Multiple Threads (SIMT)

7 Goal

8 Terminology:

● Host: The CPU and its memory (host memory) ● Device: The GPU and its memory (device memory)

9 CUDA’s Processing Flow

10 CPUs and GPUs

● GPU () + CPU accelerates scientific and engineering applications ● CPUs - a few cores ● GPUs - thousands of smaller, more efficient cores designed for parallel performance ● Serial code run on the CPU while parallel portions run on the GPU

11 12 CUDA Architecture

13 CUDA

● CUDA C is a variant of C with extensions to define: – Where a function executes (host CPU or the GPU) – Where a variable is located in the CPU or GPU address space – Execution parallelism of kernel function distributed in terms of grids and blocks – Defines variables for grid, block dimensions, indices for blocks and threads

14 CUDA C

● Requires the nvcc 64-bit compiler and the CUDA driver outputs PTX (Parallel eXecution, pseudo-assembly language), CUDA, standard C binaries ● CUDA run-time JIT compiler (optional); compiles PTX code into native operations ● Math libraries, cuFFT, cuBLAS and cuDPP (optional)

15 Installing CUDA SDK ()

16 References

● https://docs.nvidia.com/cuda/ ● CUDA By Example, by Sanders & Kandrot ● Kirk & Hwu, Programming Processors ● https://developer.nvidia.com/cuda-education- training#1

17 CPU vs GPU

18 Hello World!

19 Running GPU/CUDA Jobs on tuckoo

20 21 GPU/CUDA Env on tuckoo

● Identify the model name of your GPU. ● cat /etc/motd

22 The CUDA Compiler: nvcc

● you can install CUDA toolkit, compile code without a GPU device. ● To compile use: nvcc ● NOTE: CUDA does not support doubles on the device by default: You need to add the ”- arch sm 30” (or a higher compute capability) to your nvcc command ● Try “simple_hello.cu” on the server

23 Compiling & running CUDA code using batch script

● nvcc -o simple_hello simple_hello.cu ● cat batch.simple_hello ● qsub batch.simple_hello ● Change the nodes information using cat /etc/motd

● What and why is different?

24 Hello World! With Device Code

simple_kernel .cu

25 Hello World! With Device Code

__global__ void mykernel(void) { } ● CUDA C/C++ keyword __global__ indicates a function that: – Runs on the device – Is called from host code ● nvcc separates source code into host and device components ● Device functions (e.g. mykernel()) processed by NVIDIA compiler ● Host functions (e.g. main()) processed by standard host compiler

26 Hello World! With Device Code

mykernel<<<1,1>>>();

● Triple angle brackets mark a call from host code to device code – Also called a “kernel launch” – mykernel() in this case is just an empty function ● That’s all that is required to execute a function on the GPU!

27 Two types of parallelism:

● Block Parallelism – Launch N blocks with 1 thread each: add <<< N, 1 >>> (dev a, dev b, dev c) >>> Run the ● Thread Parallelism program simple_kernel – Launch 1 block with N threads: 2.cu using batch! add <<< 1, N >>> (dev a, dev b, dev c) >>>

We will look at examples for each type of parallel mechanisms.

28 Memory Allocation

● CPU: malloc, calloc, free, cudaMallocHost, cudaFreeHost

● GPU: cudaMalloc, cudaMallocPitch, cudaFree, cudaMallocArray, cudaFreeArray

29 Passing Parameters to the Kernel

30 31 32 33 34 35 Parallel Programming in CUDA C

● But wait… GPU computing is about massive parallelism!

● We need a more interesting example…

● We’ll start by adding two integers and build up to vector addition

a b c 36 Adding two Numbers

● A simple kernel to add two integers

__global__ void add(int *a, int *b, int *c) { *c = *a + *b; } See “simple_kernel_params.cu”

● As before __global__ is a CUDA C/C++ keyword meaning – add() will execute on the device – add() will be called from the host

37 Adding two Numbers

● Note that we use pointers for the variables

__global__ void add(int *a, int *b, int *c) { *c = *a + *b; }

● add() runs on the device, so a, b and c must point to device memory ● We need to allocate memory on the GPU

38 Memory Management

● Host and device memory are separate entities – Device pointers point to GPU memory ● May be passed to/from host code ● May not be dereferenced in host code – Host pointers point to CPU memory ● May be passed to/from device code ● May not be dereferenced in device code

● Simple CUDA API for handling device memory – cudaMalloc(), cudaFree(), cudaMemcpy() – Similar to the C equivalents malloc(), free(), memcpy()

39 Adding Two Numbers

40 Adding Two Numbers

41 Moving to Parallel: CPU

42 Moving to Parallel: CUDA

• GPU computing is about massive parallelism – So how do we run code in parallel on the device?

add<<< 1, 1 >>>();

add<<< N, 1 >>>();

• Instead of executing add() once, execute N times in parallel

43 CUDA dimensions

● In its simplest form it looks like: kernelRoutine <<< gridDim, blockDim >>> (args) ● Kernel runs on the device. It is executed by threads, each of which knows about: – variables passed as arguments – pointers to arrays in device memory (also arguments) – global constants in device memory – and private registers/local variables

44 45 46 Vector Addition on the Device

• With add() running in parallel we can do vector addition

• Terminology: each parallel invocation of add() is referred to as a block – The set of blocks is referred to as a grid – Each invocation can refer to its block index using blockIdx.x

__global__ void add(int *a, int *b, int *c) { c[blockIdx.x] = a[blockIdx.x] + b[blockIdx.x]; }

• By using blockIdx.x to index into the array, each block handles a different index 47 Vector Addition on the Device

__global__ void add(int *a, int *b, int *c) { c[blockIdx.x] = a[blockIdx.x] + b[blockIdx.x]; }

• On the device, each block can execute in parallel:

Block 0 Block 1 Block 2 Block 3

c[0] = a[0] + b[0]; c[1] = a[1] + b[1]; c[2] = a[2] + b[2]; c[3] = a[3] + b[3];

48 Vector Addition on the Device: add()

• Returning to our parallelized add() kernel

__global__ void add(int *a, int *b, int *c) { c[blockIdx.x] = a[blockIdx.x] + b[blockIdx.x]; }

• Let’s take a look at main()…

49 Vector Addition on the Device: main()

#define N 512 int main(void) { int *a, *b, *c; // host copies of a, b, c int *d_a, *d_b, *d_c; // device copies of a, b, c int size = N * sizeof(int);

// Alloc space for device copies of a, b, c cudaMalloc((void **)&d_a, size); cudaMalloc((void **)&d_b, size); cudaMalloc((void **)&d_c, size);

// Alloc space for host copies of a, b, c and setup input values a = (int *)malloc(size); random_ints(a, N); b = (int *)malloc(size); random_ints(b, N); c = (int *)malloc(size);

50 Vector Addition on the Device: main()

// Copy inputs to device cudaMemcpy(d_a, a, size, cudaMemcpyHostToDevice); cudaMemcpy(d_b, b, size, cudaMemcpyHostToDevice);

// Launch add() kernel on GPU with N blocks add<<>>(d_a, d_b, d_c);

// Copy result back to host cudaMemcpy(c, d_c, size, cudaMemcpyDeviceToHost);

// Cleanup free(a); free(b); free(c); cudaFree(d_a); cudaFree(d_b); cudaFree(d_c); return 0; }

51 Common Pattern

52 Block index

53 Remark

Why, you may then ask, is it not just blockIdx? Why blockIdx.x? As it turn out, CUDA C allows you to define a group of blocks in two dimensions. For problems with two-dimensional domains, such as matrix math or image processing, it is often convenient to use two-dimensional indexing to avoid annoying translations from linear to rectangular indices. Don’t worry if you aren’t familiar with these problem types; just know that using two- dimensional indexing can sometimes be more convenient than one-dimensional indexing. But you never have to use it. We won’t be offended.

54 Remark

● We call the collection of parallel blocks a grid ● These threads will have varying values for blockIdx.x, the first taking value 0 and the last taking value N-1. ● Later, we use blockIdx.y

55 56 Remarl

● Why do we check whether tid is less than N?

It should always be less than N, but just make sure about this as we have paranoid.

● If you would like to see how easy it is to generate a massively parallel application, try changing the 10 in the line #define N 10 to 10000 or 50000 to launch tens of thousands of parallel blocks. Be warned, though: No dimension of your launch of blocks may exceed 65,535. Go check enum_gpu.cu again.

57 Review (1 of 2)

● Difference between host and device – Host CPU – Device GPU

● Using __global__ to declare a function as device code – Executes on the device – Called from the host

● Passing parameters from host code to a device function

58 Review (2 of 2)

● Basic device memory management – cudaMalloc() – cudaMemcpy() – cudaFree()

● Launching parallel kernels – Launch N copies of add() with add<<>>(…); – Use blockIdx.x to access block index

59