GPU/CUDA Programming Flynn’S Classical Taxonomy

GPU/CUDA Programming Flynn’S Classical Taxonomy

Introduction to Parallel Computing Lecture GPU/CUDA Programming Flynn’s Classical Taxonomy 2 Single Instruction, Single Data (SISD) 3 Single Instruction, Multiple Data (SIMD) 4 Multiple Instruction, Single Data (MISD) 5 Multiple Instruction, Multiple Data (MIMD) 6 Single Instruction, Multiple Threads (SIMT) 7 Goal 8 Terminology: ● Host: The CPU and its memory (host memory) ● Device: The GPU and its memory (device memory) 9 CUDA’s Processing Flow 10 CPUs and GPUs ● GPU (graphics processing unit) + CPU accelerates scientific and engineering applications ● CPUs - a few cores ● GPUs - thousands of smaller, more efficient cores designed for parallel performance ● Serial code run on the CPU while parallel portions run on the GPU 11 12 CUDA Architecture 13 CUDA C ● CUDA C is a variant of C with extensions to define: – Where a function executes (host CPU or the GPU) – Where a variable is located in the CPU or GPU address space – Execution parallelism of kernel function distributed in terms of grids and blocks – Defines variables for grid, block dimensions, indices for blocks and threads 14 CUDA C ● Requires the nvcc 64-bit compiler and the CUDA driver outputs PTX (Parallel Thread eXecution, NVIDIA pseudo-assembly language), CUDA, standard C binaries ● CUDA run-time JIT compiler (optional); compiles PTX code into native operations ● Math libraries, cuFFT, cuBLAS and cuDPP (optional) 15 Installing CUDA SDK (linux) 16 References ● https://docs.nvidia.com/cuda/ ● CUDA By Example, by Sanders & Kandrot ● Kirk & Hwu, Programming Massively Parallel Processors ● https://developer.nvidia.com/cuda-education- training#1 17 CPU vs GPU 18 Hello World! 19 Running GPU/CUDA Jobs on tuckoo 20 21 GPU/CUDA Env on tuckoo ● Identify the model name of your GPU. ● cat /etc/motd 22 The CUDA Compiler: nvcc ● you can install CUDA toolkit, compile code without a GPU device. ● To compile use: nvcc ● NOTE: CUDA does not support doubles on the device by default: You need to add the switch ”- arch sm 30” (or a higher compute capability) to your nvcc command ● Try “simple_hello.cu” on the server 23 Compiling & running CUDA code using batch script ● nvcc -o simple_hello simple_hello.cu ● cat batch.simple_hello ● qsub batch.simple_hello ● Change the nodes information using cat /etc/motd ● What and why is different? 24 Hello World! With Device Code simple_kernel .cu 25 Hello World! With Device Code __global__ void mykernel(void) { } ● CUDA C/C++ keyword __global__ indicates a function that: – Runs on the device – Is called from host code ● nvcc separates source code into host and device components ● Device functions (e.g. mykernel()) processed by NVIDIA compiler ● Host functions (e.g. main()) processed by standard host compiler 26 Hello World! With Device Code mykernel<<<1,1>>>(); ● Triple angle brackets mark a call from host code to device code – Also called a “kernel launch” – mykernel() in this case is just an empty function ● That’s all that is required to execute a function on the GPU! 27 Two types of parallelism: ● Block Parallelism – Launch N blocks with 1 thread each: add <<< N, 1 >>> (dev a, dev b, dev c) >>> Run the ● Thread Parallelism program simple_kernel – Launch 1 block with N threads: 2.cu using batch! add <<< 1, N >>> (dev a, dev b, dev c) >>> We will look at examples for each type of parallel mechanisms. 28 Memory Allocation ● CPU: malloc, calloc, free, cudaMallocHost, cudaFreeHost ● GPU: cudaMalloc, cudaMallocPitch, cudaFree, cudaMallocArray, cudaFreeArray 29 Passing Parameters to the Kernel 30 31 32 33 34 35 Parallel Programming in CUDA C ● But wait… GPU computing is about massive parallelism! ● We need a more interesting example… ● We’ll start by adding two integers and build up to vector addition a b c 36 Adding two Numbers ● A simple kernel to add two integers __global__ void add(int *a, int *b, int *c) { *c = *a + *b; } See “simple_kernel_params.cu” ● As before __global__ is a CUDA C/C++ keyword meaning – add() will execute on the device – add() will be called from the host 37 Adding two Numbers ● Note that we use pointers for the variables __global__ void add(int *a, int *b, int *c) { *c = *a + *b; } ● add() runs on the device, so a, b and c must point to device memory ● We need to allocate memory on the GPU 38 Memory Management ● Host and device memory are separate entities – Device pointers point to GPU memory ● May be passed to/from host code ● May not be dereferenced in host code – Host pointers point to CPU memory ● May be passed to/from device code ● May not be dereferenced in device code ● Simple CUDA API for handling device memory – cudaMalloc(), cudaFree(), cudaMemcpy() – Similar to the C equivalents malloc(), free(), memcpy() 39 Adding Two Numbers 40 Adding Two Numbers 41 Moving to Parallel: CPU 42 Moving to Parallel: CUDA • GPU computing is about massive parallelism – So how do we run code in parallel on the device? add<<< 1, 1 >>>(); add<<< N, 1 >>>(); • Instead of executing add() once, execute N times in parallel 43 CUDA dimensions ● In its simplest form it looks like: kernelRoutine <<< gridDim, blockDim >>> (args) ● Kernel runs on the device. It is executed by threads, each of which knows about: – variables passed as arguments – pointers to arrays in device memory (also arguments) – global constants in device memory – shared memory and private registers/local variables 44 45 46 Vector Addition on the Device • With add() running in parallel we can do vector addition • Terminology: each parallel invocation of add() is referred to as a block – The set of blocks is referred to as a grid – Each invocation can refer to its block index using blockIdx.x __global__ void add(int *a, int *b, int *c) { c[blockIdx.x] = a[blockIdx.x] + b[blockIdx.x]; } • By using blockIdx.x to index into the array, each block handles a different index 47 Vector Addition on the Device __global__ void add(int *a, int *b, int *c) { c[blockIdx.x] = a[blockIdx.x] + b[blockIdx.x]; } • On the device, each block can execute in parallel: Block 0 Block 1 Block 2 Block 3 c[0] = a[0] + b[0]; c[1] = a[1] + b[1]; c[2] = a[2] + b[2]; c[3] = a[3] + b[3]; 48 Vector Addition on the Device: add() • Returning to our parallelized add() kernel __global__ void add(int *a, int *b, int *c) { c[blockIdx.x] = a[blockIdx.x] + b[blockIdx.x]; } • Let’s take a look at main()… 49 Vector Addition on the Device: main() #define N 512 int main(void) { int *a, *b, *c; // host copies of a, b, c int *d_a, *d_b, *d_c; // device copies of a, b, c int size = N * sizeof(int); // Alloc space for device copies of a, b, c cudaMalloc((void **)&d_a, size); cudaMalloc((void **)&d_b, size); cudaMalloc((void **)&d_c, size); // Alloc space for host copies of a, b, c and setup input values a = (int *)malloc(size); random_ints(a, N); b = (int *)malloc(size); random_ints(b, N); c = (int *)malloc(size); 50 Vector Addition on the Device: main() // Copy inputs to device cudaMemcpy(d_a, a, size, cudaMemcpyHostToDevice); cudaMemcpy(d_b, b, size, cudaMemcpyHostToDevice); // Launch add() kernel on GPU with N blocks add<<<N,1>>>(d_a, d_b, d_c); // Copy result back to host cudaMemcpy(c, d_c, size, cudaMemcpyDeviceToHost); // Cleanup free(a); free(b); free(c); cudaFree(d_a); cudaFree(d_b); cudaFree(d_c); return 0; } 51 Common Pattern 52 Block index 53 Remark Why, you may then ask, is it not just blockIdx? Why blockIdx.x? As it turn out, CUDA C allows you to define a group of blocks in two dimensions. For problems with two-dimensional domains, such as matrix math or image processing, it is often convenient to use two-dimensional indexing to avoid annoying translations from linear to rectangular indices. Don’t worry if you aren’t familiar with these problem types; just know that using two- dimensional indexing can sometimes be more convenient than one-dimensional indexing. But you never have to use it. We won’t be offended. 54 Remark ● We call the collection of parallel blocks a grid ● These threads will have varying values for blockIdx.x, the first taking value 0 and the last taking value N-1. ● Later, we use blockIdx.y 55 56 Remarl ● Why do we check whether tid is less than N? It should always be less than N, but just make sure about this as we have paranoid. ● If you would like to see how easy it is to generate a massively parallel application, try changing the 10 in the line #define N 10 to 10000 or 50000 to launch tens of thousands of parallel blocks. Be warned, though: No dimension of your launch of blocks may exceed 65,535. Go check enum_gpu.cu again. 57 Review (1 of 2) ● Difference between host and device – Host CPU – Device GPU ● Using __global__ to declare a function as device code – Executes on the device – Called from the host ● Passing parameters from host code to a device function 58 Review (2 of 2) ● Basic device memory management – cudaMalloc() – cudaMemcpy() – cudaFree() ● Launching parallel kernels – Launch N copies of add() with add<<<N,1>>>(…); – Use blockIdx.x to access block index 59.

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    59 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us