Lecture 3: a Simple Example, Tools, and CUDA Threads

ECE498AL Lecture 3: A Simple Example, Tools, and CUDA Threads © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 1 ECE498AL, University of Illinois, Urbana-Champaign A Simple Running Example Matrix Multiplication • A simple matrix multiplication example that illustrates the basic features of memory and thread management in CUDA programs – Leave shared memory usage until later – Local, register usage – Thread ID usage – Memory data transfer API between host and device – Assume square matrix for simplicity © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 2 ECE498AL, University of Illinois, Urbana-Champaign Programming Model: Square Matrix Multiplication Example • P = M * N of size WIDTH x WIDTH N • Without tiling: H T D – One thread calculates one element I W of P – M and N are loaded WIDTH times from global memory M P H T D I W WIDTH WIDTH © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 3 ECE498AL, University of Illinois, Urbana-Champaign Memory Layout of a Matrix in C M0,0 M1,0 M2,0 M3,0 M0,1 M1,1 M2,1 M3,1 M0,2 M1,2 M2,2 M3,2 M0,3 M1,3 M2,3 M3,3 M M0,0 M1,0 M2,0 M3,0 M0,1 M1,1 M2,1 M3,1 M0,2 M1,2 M2,2 M3,2 M0,3 M1,3 M2,3 M3,3 © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 4 ECE498AL, University of Illinois, Urbana-Champaign Step 1: Matrix Multiplication A Simple Host Version in C // Matrix multiplication on the (CPU) host in double precision void MatrixMulOnHost(float* M, float* N, float* P, int Width)N { k for (int i = 0; i < Width; ++i) H j T for (int j = 0; j < Width; ++j) { D I double sum = 0; W for (int k = 0; k < Width; ++k) { double a = M[i * width + k]; double b = N[k * width + j]; sum += a * b; M P } P[i * Width + j] = sum; i H } T D I } W k WIDTH WIDTH © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 5 ECE498AL, University of Illinois, Urbana-Champaign Step 2: Input Matrix Data Transfer (Host-side Code) void MatrixMulOnDevice(float* M, float* N, float* P, int Width) { int size = Width * Width * sizeof(float); float* Md, Nd, Pd; … 1. // Allocate and Load M, N to device memory cudaMalloc(&Md, size); cudaMemcpy(Md, M, size, cudaMemcpyHostToDevice); cudaMalloc(&Nd, size); cudaMemcpy(Nd, N, size, cudaMemcpyHostToDevice); // Allocate P on the device cudaMalloc(&Pd, size); © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 6 ECE498AL, University of Illinois, Urbana-Champaign Step 3: Output Matrix Data Transfer (Host-side Code) 2. // Kernel invocation code – to be shown later … 3. // Read P from the device cudaMemcpy(P, Pd, size, cudaMemcpyDeviceToHost); // Free device matrices cudaFree(Md); cudaFree(Nd); cudaFree (Pd); } © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 7 ECE498AL, University of Illinois, Urbana-Champaign Step 4: Kernel Function // Matrix multiplication kernel – per thread code __global__ void MatrixMulKernel(float* Md, float* Nd, float* Pd, int Width) { // Pvalue is used to store the element of the matrix // that is computed by the thread float Pvalue = 0; © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 8 ECE498AL, University of Illinois, Urbana-Champaign Step 4: Kernel Function (cont.) for (int k = 0; k < Width; ++k) { Nd float Melement = Md[threadIdx.y*Width+k]; float Nelement = Nd[k*Width+threadIdx.x]; k H T D Pvalue += Melement * Nelement; I } tx W Pd[threadIdx.y*Width+threadIdx.x] = Pvalue; } Md Pd ty ty H T D I W k tx © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 WIDTH WIDTH 9 ECE498AL, University of Illinois, Urbana-Champaign Step 5: Kernel Invocation (Host-side Code) // Setup the execution configuration dim3 dimGrid(1, 1); dim3 dimBlock(Width, Width); // Launch the device computation threads! MatrixMulKernel<<<dimGrid, dimBlock>>>(Md, Nd, Pd, Width); © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 10 ECE498AL, University of Illinois, Urbana-Champaign Only One Thread Block Used Grid 1 Nd • One Block of threads compute Block 1 2 matrix Pd – Each thread computes one 4 element of Pd Thread 2 (2, 2) • Each thread 6 – Loads a row of matrix Md – Loads a column of matrix Nd – Perform one multiply and addition for each pair of Md and Nd elements – Compute to off-chip memory 3 2 5 4 48 access ratio close to 1:1 (not very high) • Size of matrix limited by the WIDTH number of threads allowed in a thread block Md Pd © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 11 ECE498AL, University of Illinois, Urbana-Champaign Step 7: Handling Arbitrary Sized Square Matrices (will cover later) • Have each 2D thread block to Nd compute a (TILE_WIDTH)2 sub- matrix (tile) of the result matrix H T D I – Each has (TILE_WIDTH)2 threads W • Generate a 2D Grid of (WIDTH/TILE_WIDTH)2 blocks Md Pd You still need to put a loop by around the kernel call for cases TILE_WIDTH where WIDTH/TILE_WIDTH H T D ty I is greater than max grid size W (64K)! bx tx © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 WIDTH WIDTH 12 ECE498AL, University of Illinois, Urbana-Champaign Some Useful Information on Tools © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 13 ECE498AL, University of Illinois, Urbana-Champaign Compiling a CUDA Program C/C++ CUDA float4 me = gx[gtid]; Application me.x += me.y * me.z; • Parallel Thread eXecution (PTX) – Virtual Machine NVCC CPU Code and ISA – Programming model PTX Code – Execution Virtual resources and state PhysicalPTX to Target ld.global.v4.f32 {$f1,$f3,$f5,$f7}, [$r9+0]; mad.f32 $f1, $f5, $f3, $f1; Compiler G80 … GPU Target code © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 14 ECE498AL, University of Illinois, Urbana-Champaign Compilation • Any source file containing CUDA language extensions must be compiled with NVCC • NVCC is a compiler driver – Works by invoking all the necessary tools and compilers like cudacc, g++, cl, ... • NVCC outputs: – C code (host CPU Code) • Must then be compiled with the rest of the application using another tool – PTX • Object code directly • Or, PTX source, interpreted at runtime © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 15 ECE498AL, University of Illinois, Urbana-Champaign Linking • Any executable with CUDA code requires two dynamic libraries: – The CUDA runtime library (cudart) – The CUDA core library (cuda) © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 16 ECE498AL, University of Illinois, Urbana-Champaign Debugging Using the Device Emulation Mode • An executable compiled in device emulation mode (nvcc -deviceemu) runs completely on the host using the CUDA runtime – No need of any device and CUDA driver – Each device thread is emulated with a host thread • Running in device emulation mode, one can: – Use host native debug support (breakpoints, inspection, etc.) – Access any device-specific data from host code and vice-versa – Call any host function from device code (e.g. printf) and vice-versa – Detect deadlock situations caused by improper usage of __syncthreads © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 17 ECE498AL, University of Illinois, Urbana-Champaign Device Emulation Mode Pitfalls • Emulated device threads execute sequentially, so simultaneous accesses of the same memory location by multiple threads could produce different results. • Dereferencing device pointers on the host or host pointers on the device can produce correct results in device emulation mode, but will generate an error in device execution mode © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 18 ECE498AL, University of Illinois, Urbana-Champaign Floating Point • Results of floating-point computations will slightly differ because of: – Different compiler outputs, instruction sets – Use of extended precision for intermediate results • There are various options to force strict single precision on the host © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 19 ECE498AL, University of Illinois, Urbana-Champaign CUDA Threads © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 20 ECE498AL, University of Illinois, Urbana-Champaign Block IDs and Thread IDs Host • Each thread uses IDs to Device decide what data to work on Grid 1 Kernel Block Block – Block ID: 1D or 2D 1 (0, 0) (1, 0) – Thread ID: 1D, 2D, or 3D Block Block (0, 1) (1, 1) • Simplifies memory Grid 2 addressing when Kernel processing 2 Block (1, 1) multidimensional data (0,0,1) (1,0,1) (2,0,1) (3,0,1) – Image processing Thread Thread Thread Thread – Solving PDEs on volumes (0,0,0) (1,0,0) (2,0,0) (3,0,0) – … Thread Thread Thread Thread (0,1,0) (1,1,0) (2,1,0) (3,1,0) Courtesy: NDVIA © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 21 ECE498AL, University of Illinois, Urbana-Champaign bx Matrix Multiplication Using 0 1 2 tx Multiple Blocks 012 TILE_WIDTH-1 Nd • Break-up Pd into tiles H T D • Each block calculates one I tile W – Each thread calculates one element – Block size equal tile size Md Pd 0 E 0 H Pd T 1 sub H D T I 2 D I W ty _ W by 1 E L I T TILE_WIDTH-1 TILE_WIDTH 2 WIDTH WIDTH © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 22 ECE498AL, University of Illinois, Urbana-Champaign A Small Example Block(0,0) Block(1,0) P0,0 P1,0 P2,0 P3,0 TILE_WIDTH = 2 P0,1 P1,1 P2,1 P3,1 P0,2 P1,2 P2,2 P3,2 P0,3 P1,3 P2,3 P3,3 Block(0,1) Block(1,1) © David Kirk/NVIDIA and Wen-mei W.

Lecture 3: a Simple Example, Tools, and CUDA Threads

Introduction to Multi-Threading and Vectorization Matti Kortelainen Larsoft Workshop 2019 25 June 2019 Outline

Unit: 4 Processes and Threads in Distributed Systems

Parallel Computing

Threading SIMD and MIMD in the Multicore Context the Ultrasparc T2

Parallel Programming with Openmp

Intel's Hyper-Threading

Thread Scheduling in Multi-Core Operating Systems Redha Gouicem

Concurrent Computing

Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures

Opencl Introduction

An Introduction to CUDA/Opencl and Graphics Processors

Message Passing Interface Part - II