Introduction to Parallel Computing Lecture
GPU/CUDA Programming Flynn’s Classical Taxonomy
2 Single Instruction, Single Data (SISD)
3 Single Instruction, Multiple Data (SIMD)
4 Multiple Instruction, Single Data (MISD)
5 Multiple Instruction, Multiple Data (MIMD)
6 Single Instruction, Multiple Threads (SIMT)
7 Goal
8 Terminology:
● Host: The CPU and its memory (host memory) ● Device: The GPU and its memory (device memory)
9 CUDA’s Processing Flow
10 CPUs and GPUs
● GPU (graphics processing unit) + CPU accelerates scientific and engineering applications ● CPUs - a few cores ● GPUs - thousands of smaller, more efficient cores designed for parallel performance ● Serial code run on the CPU while parallel portions run on the GPU
11 12 CUDA Architecture
13 CUDA C
● CUDA C is a variant of C with extensions to define: – Where a function executes (host CPU or the GPU) – Where a variable is located in the CPU or GPU address space – Execution parallelism of kernel function distributed in terms of grids and blocks – Defines variables for grid, block dimensions, indices for blocks and threads
14 CUDA C
● Requires the nvcc 64-bit compiler and the CUDA driver outputs PTX (Parallel Thread eXecution, NVIDIA pseudo-assembly language), CUDA, standard C binaries ● CUDA run-time JIT compiler (optional); compiles PTX code into native operations ● Math libraries, cuFFT, cuBLAS and cuDPP (optional)
15 Installing CUDA SDK (linux)
16 References
● https://docs.nvidia.com/cuda/ ● CUDA By Example, by Sanders & Kandrot ● Kirk & Hwu, Programming Massively Parallel Processors ● https://developer.nvidia.com/cuda-education- training#1
17 CPU vs GPU
18 Hello World!
19 Running GPU/CUDA Jobs on tuckoo
20 21 GPU/CUDA Env on tuckoo
● Identify the model name of your GPU. ● cat /etc/motd
22 The CUDA Compiler: nvcc
● you can install CUDA toolkit, compile code without a GPU device. ● To compile use: nvcc ● NOTE: CUDA does not support doubles on the device by default: You need to add the switch ”- arch sm 30” (or a higher compute capability) to your nvcc command ● Try “simple_hello.cu” on the server
23 Compiling & running CUDA code using batch script
● nvcc -o simple_hello simple_hello.cu ● cat batch.simple_hello ● qsub batch.simple_hello ● Change the nodes information using cat /etc/motd
● What and why is different?
24 Hello World! With Device Code
simple_kernel .cu
25 Hello World! With Device Code
__global__ void mykernel(void) { } ● CUDA C/C++ keyword __global__ indicates a function that: – Runs on the device – Is called from host code ● nvcc separates source code into host and device components ● Device functions (e.g. mykernel()) processed by NVIDIA compiler ● Host functions (e.g. main()) processed by standard host compiler
26 Hello World! With Device Code
mykernel<<<1,1>>>();
● Triple angle brackets mark a call from host code to device code – Also called a “kernel launch” – mykernel() in this case is just an empty function ● That’s all that is required to execute a function on the GPU!
27 Two types of parallelism:
● Block Parallelism – Launch N blocks with 1 thread each: add <<< N, 1 >>> (dev a, dev b, dev c) >>> Run the ● Thread Parallelism program simple_kernel – Launch 1 block with N threads: 2.cu using batch! add <<< 1, N >>> (dev a, dev b, dev c) >>>
We will look at examples for each type of parallel mechanisms.
28 Memory Allocation
● CPU: malloc, calloc, free, cudaMallocHost, cudaFreeHost
● GPU: cudaMalloc, cudaMallocPitch, cudaFree, cudaMallocArray, cudaFreeArray
29 Passing Parameters to the Kernel
30 31 32 33 34 35 Parallel Programming in CUDA C
● But wait… GPU computing is about massive parallelism!
● We need a more interesting example…
● We’ll start by adding two integers and build up to vector addition
a b c 36 Adding two Numbers
● A simple kernel to add two integers
__global__ void add(int *a, int *b, int *c) { *c = *a + *b; } See “simple_kernel_params.cu”
● As before __global__ is a CUDA C/C++ keyword meaning – add() will execute on the device – add() will be called from the host
37 Adding two Numbers
● Note that we use pointers for the variables
__global__ void add(int *a, int *b, int *c) { *c = *a + *b; }
● add() runs on the device, so a, b and c must point to device memory ● We need to allocate memory on the GPU
38 Memory Management
● Host and device memory are separate entities – Device pointers point to GPU memory ● May be passed to/from host code ● May not be dereferenced in host code – Host pointers point to CPU memory ● May be passed to/from device code ● May not be dereferenced in device code
● Simple CUDA API for handling device memory – cudaMalloc(), cudaFree(), cudaMemcpy() – Similar to the C equivalents malloc(), free(), memcpy()
39 Adding Two Numbers
40 Adding Two Numbers
41 Moving to Parallel: CPU
42 Moving to Parallel: CUDA
• GPU computing is about massive parallelism – So how do we run code in parallel on the device?
add<<< 1, 1 >>>();
add<<< N, 1 >>>();
• Instead of executing add() once, execute N times in parallel
43 CUDA dimensions
● In its simplest form it looks like: kernelRoutine <<< gridDim, blockDim >>> (args) ● Kernel runs on the device. It is executed by threads, each of which knows about: – variables passed as arguments – pointers to arrays in device memory (also arguments) – global constants in device memory – shared memory and private registers/local variables
44 45 46 Vector Addition on the Device
• With add() running in parallel we can do vector addition
• Terminology: each parallel invocation of add() is referred to as a block – The set of blocks is referred to as a grid – Each invocation can refer to its block index using blockIdx.x
__global__ void add(int *a, int *b, int *c) { c[blockIdx.x] = a[blockIdx.x] + b[blockIdx.x]; }
• By using blockIdx.x to index into the array, each block handles a different index 47 Vector Addition on the Device
__global__ void add(int *a, int *b, int *c) { c[blockIdx.x] = a[blockIdx.x] + b[blockIdx.x]; }
• On the device, each block can execute in parallel:
Block 0 Block 1 Block 2 Block 3
c[0] = a[0] + b[0]; c[1] = a[1] + b[1]; c[2] = a[2] + b[2]; c[3] = a[3] + b[3];
48 Vector Addition on the Device: add()
• Returning to our parallelized add() kernel
__global__ void add(int *a, int *b, int *c) { c[blockIdx.x] = a[blockIdx.x] + b[blockIdx.x]; }
• Let’s take a look at main()…
49 Vector Addition on the Device: main()
#define N 512 int main(void) { int *a, *b, *c; // host copies of a, b, c int *d_a, *d_b, *d_c; // device copies of a, b, c int size = N * sizeof(int);
// Alloc space for device copies of a, b, c cudaMalloc((void **)&d_a, size); cudaMalloc((void **)&d_b, size); cudaMalloc((void **)&d_c, size);
// Alloc space for host copies of a, b, c and setup input values a = (int *)malloc(size); random_ints(a, N); b = (int *)malloc(size); random_ints(b, N); c = (int *)malloc(size);
50 Vector Addition on the Device: main()
// Copy inputs to device cudaMemcpy(d_a, a, size, cudaMemcpyHostToDevice); cudaMemcpy(d_b, b, size, cudaMemcpyHostToDevice);
// Launch add() kernel on GPU with N blocks add<<
// Copy result back to host cudaMemcpy(c, d_c, size, cudaMemcpyDeviceToHost);
// Cleanup free(a); free(b); free(c); cudaFree(d_a); cudaFree(d_b); cudaFree(d_c); return 0; }
51 Common Pattern
52 Block index
53 Remark
Why, you may then ask, is it not just blockIdx? Why blockIdx.x? As it turn out, CUDA C allows you to define a group of blocks in two dimensions. For problems with two-dimensional domains, such as matrix math or image processing, it is often convenient to use two-dimensional indexing to avoid annoying translations from linear to rectangular indices. Don’t worry if you aren’t familiar with these problem types; just know that using two- dimensional indexing can sometimes be more convenient than one-dimensional indexing. But you never have to use it. We won’t be offended.
54 Remark
● We call the collection of parallel blocks a grid ● These threads will have varying values for blockIdx.x, the first taking value 0 and the last taking value N-1. ● Later, we use blockIdx.y
55 56 Remarl
● Why do we check whether tid is less than N?
It should always be less than N, but just make sure about this as we have paranoid.
● If you would like to see how easy it is to generate a massively parallel application, try changing the 10 in the line #define N 10 to 10000 or 50000 to launch tens of thousands of parallel blocks. Be warned, though: No dimension of your launch of blocks may exceed 65,535. Go check enum_gpu.cu again.
57 Review (1 of 2)
● Difference between host and device – Host CPU – Device GPU
● Using __global__ to declare a function as device code – Executes on the device – Called from the host
● Passing parameters from host code to a device function
58 Review (2 of 2)
● Basic device memory management – cudaMalloc() – cudaMemcpy() – cudaFree()
● Launching parallel kernels – Launch N copies of add() with add<<
59