GPU/CUDA Programming Flynn’S Classical Taxonomy

Introduction to Parallel Computing Lecture GPU/CUDA Programming Flynn’s Classical Taxonomy 2 Single Instruction, Single Data (SISD) 3 Single Instruction, Multiple Data (SIMD) 4 Multiple Instruction, Single Data (MISD) 5 Multiple Instruction, Multiple Data (MIMD) 6 Single Instruction, Multiple Threads (SIMT) 7 Goal 8 Terminology: ● Host: The CPU and its memory (host memory) ● Device: The GPU and its memory (device memory) 9 CUDA’s Processing Flow 10 CPUs and GPUs ● GPU (graphics processing unit) + CPU accelerates scientific and engineering applications ● CPUs - a few cores ● GPUs - thousands of smaller, more efficient cores designed for parallel performance ● Serial code run on the CPU while parallel portions run on the GPU 11 12 CUDA Architecture 13 CUDA C ● CUDA C is a variant of C with extensions to define: – Where a function executes (host CPU or the GPU) – Where a variable is located in the CPU or GPU address space – Execution parallelism of kernel function distributed in terms of grids and blocks – Defines variables for grid, block dimensions, indices for blocks and threads 14 CUDA C ● Requires the nvcc 64-bit compiler and the CUDA driver outputs PTX (Parallel Thread eXecution, NVIDIA pseudo-assembly language), CUDA, standard C binaries ● CUDA run-time JIT compiler (optional); compiles PTX code into native operations ● Math libraries, cuFFT, cuBLAS and cuDPP (optional) 15 Installing CUDA SDK (linux) 16 References ● https://docs.nvidia.com/cuda/ ● CUDA By Example, by Sanders & Kandrot ● Kirk & Hwu, Programming Massively Parallel Processors ● https://developer.nvidia.com/cuda-education- training#1 17 CPU vs GPU 18 Hello World! 19 Running GPU/CUDA Jobs on tuckoo 20 21 GPU/CUDA Env on tuckoo ● Identify the model name of your GPU. ● cat /etc/motd 22 The CUDA Compiler: nvcc ● you can install CUDA toolkit, compile code without a GPU device. ● To compile use: nvcc ● NOTE: CUDA does not support doubles on the device by default: You need to add the switch ”- arch sm 30” (or a higher compute capability) to your nvcc command ● Try “simple_hello.cu” on the server 23 Compiling & running CUDA code using batch script ● nvcc -o simple_hello simple_hello.cu ● cat batch.simple_hello ● qsub batch.simple_hello ● Change the nodes information using cat /etc/motd ● What and why is different? 24 Hello World! With Device Code simple_kernel .cu 25 Hello World! With Device Code __global__ void mykernel(void) { } ● CUDA C/C++ keyword __global__ indicates a function that: – Runs on the device – Is called from host code ● nvcc separates source code into host and device components ● Device functions (e.g. mykernel()) processed by NVIDIA compiler ● Host functions (e.g. main()) processed by standard host compiler 26 Hello World! With Device Code mykernel<<<1,1>>>(); ● Triple angle brackets mark a call from host code to device code – Also called a “kernel launch” – mykernel() in this case is just an empty function ● That’s all that is required to execute a function on the GPU! 27 Two types of parallelism: ● Block Parallelism – Launch N blocks with 1 thread each: add <<< N, 1 >>> (dev a, dev b, dev c) >>> Run the ● Thread Parallelism program simple_kernel – Launch 1 block with N threads: 2.cu using batch! add <<< 1, N >>> (dev a, dev b, dev c) >>> We will look at examples for each type of parallel mechanisms. 28 Memory Allocation ● CPU: malloc, calloc, free, cudaMallocHost, cudaFreeHost ● GPU: cudaMalloc, cudaMallocPitch, cudaFree, cudaMallocArray, cudaFreeArray 29 Passing Parameters to the Kernel 30 31 32 33 34 35 Parallel Programming in CUDA C ● But wait… GPU computing is about massive parallelism! ● We need a more interesting example… ● We’ll start by adding two integers and build up to vector addition a b c 36 Adding two Numbers ● A simple kernel to add two integers __global__ void add(int *a, int *b, int *c) { *c = *a + *b; } See “simple_kernel_params.cu” ● As before __global__ is a CUDA C/C++ keyword meaning – add() will execute on the device – add() will be called from the host 37 Adding two Numbers ● Note that we use pointers for the variables __global__ void add(int *a, int *b, int *c) { *c = *a + *b; } ● add() runs on the device, so a, b and c must point to device memory ● We need to allocate memory on the GPU 38 Memory Management ● Host and device memory are separate entities – Device pointers point to GPU memory ● May be passed to/from host code ● May not be dereferenced in host code – Host pointers point to CPU memory ● May be passed to/from device code ● May not be dereferenced in device code ● Simple CUDA API for handling device memory – cudaMalloc(), cudaFree(), cudaMemcpy() – Similar to the C equivalents malloc(), free(), memcpy() 39 Adding Two Numbers 40 Adding Two Numbers 41 Moving to Parallel: CPU 42 Moving to Parallel: CUDA • GPU computing is about massive parallelism – So how do we run code in parallel on the device? add<<< 1, 1 >>>(); add<<< N, 1 >>>(); • Instead of executing add() once, execute N times in parallel 43 CUDA dimensions ● In its simplest form it looks like: kernelRoutine <<< gridDim, blockDim >>> (args) ● Kernel runs on the device. It is executed by threads, each of which knows about: – variables passed as arguments – pointers to arrays in device memory (also arguments) – global constants in device memory – shared memory and private registers/local variables 44 45 46 Vector Addition on the Device • With add() running in parallel we can do vector addition • Terminology: each parallel invocation of add() is referred to as a block – The set of blocks is referred to as a grid – Each invocation can refer to its block index using blockIdx.x __global__ void add(int *a, int *b, int *c) { c[blockIdx.x] = a[blockIdx.x] + b[blockIdx.x]; } • By using blockIdx.x to index into the array, each block handles a different index 47 Vector Addition on the Device __global__ void add(int *a, int *b, int *c) { c[blockIdx.x] = a[blockIdx.x] + b[blockIdx.x]; } • On the device, each block can execute in parallel: Block 0 Block 1 Block 2 Block 3 c[0] = a[0] + b[0]; c[1] = a[1] + b[1]; c[2] = a[2] + b[2]; c[3] = a[3] + b[3]; 48 Vector Addition on the Device: add() • Returning to our parallelized add() kernel __global__ void add(int *a, int *b, int *c) { c[blockIdx.x] = a[blockIdx.x] + b[blockIdx.x]; } • Let’s take a look at main()… 49 Vector Addition on the Device: main() #define N 512 int main(void) { int *a, *b, *c; // host copies of a, b, c int *d_a, *d_b, *d_c; // device copies of a, b, c int size = N * sizeof(int); // Alloc space for device copies of a, b, c cudaMalloc((void **)&d_a, size); cudaMalloc((void **)&d_b, size); cudaMalloc((void **)&d_c, size); // Alloc space for host copies of a, b, c and setup input values a = (int *)malloc(size); random_ints(a, N); b = (int *)malloc(size); random_ints(b, N); c = (int *)malloc(size); 50 Vector Addition on the Device: main() // Copy inputs to device cudaMemcpy(d_a, a, size, cudaMemcpyHostToDevice); cudaMemcpy(d_b, b, size, cudaMemcpyHostToDevice); // Launch add() kernel on GPU with N blocks add<<<N,1>>>(d_a, d_b, d_c); // Copy result back to host cudaMemcpy(c, d_c, size, cudaMemcpyDeviceToHost); // Cleanup free(a); free(b); free(c); cudaFree(d_a); cudaFree(d_b); cudaFree(d_c); return 0; } 51 Common Pattern 52 Block index 53 Remark Why, you may then ask, is it not just blockIdx? Why blockIdx.x? As it turn out, CUDA C allows you to define a group of blocks in two dimensions. For problems with two-dimensional domains, such as matrix math or image processing, it is often convenient to use two-dimensional indexing to avoid annoying translations from linear to rectangular indices. Don’t worry if you aren’t familiar with these problem types; just know that using two- dimensional indexing can sometimes be more convenient than one-dimensional indexing. But you never have to use it. We won’t be offended. 54 Remark ● We call the collection of parallel blocks a grid ● These threads will have varying values for blockIdx.x, the first taking value 0 and the last taking value N-1. ● Later, we use blockIdx.y 55 56 Remarl ● Why do we check whether tid is less than N? It should always be less than N, but just make sure about this as we have paranoid. ● If you would like to see how easy it is to generate a massively parallel application, try changing the 10 in the line #define N 10 to 10000 or 50000 to launch tens of thousands of parallel blocks. Be warned, though: No dimension of your launch of blocks may exceed 65,535. Go check enum_gpu.cu again. 57 Review (1 of 2) ● Difference between host and device – Host CPU – Device GPU ● Using __global__ to declare a function as device code – Executes on the device – Called from the host ● Passing parameters from host code to a device function 58 Review (2 of 2) ● Basic device memory management – cudaMalloc() – cudaMemcpy() – cudaFree() ● Launching parallel kernels – Launch N copies of add() with add<<<N,1>>>(…); – Use blockIdx.x to access block index 59.

Load more