Lecture 3: a Simple Example, Tools, and CUDA Threads

ECE498AL Lecture 3: A Simple Example, Tools, and CUDA Threads © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 1 ECE498AL, University of Illinois, Urbana-Champaign A Simple Running Example Matrix Multiplication • A simple matrix multiplication example that illustrates the basic features of memory and thread management in CUDA programs – Leave shared memory usage until later – Local, register usage – Thread ID usage – Memory data transfer API between host and device – Assume square matrix for simplicity © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 2 ECE498AL, University of Illinois, Urbana-Champaign Programming Model: Square Matrix Multiplication Example • P = M * N of size WIDTH x WIDTH N • Without tiling: H T D – One thread calculates one element I W of P – M and N are loaded WIDTH times from global memory M P H T D I W WIDTH WIDTH © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 3 ECE498AL, University of Illinois, Urbana-Champaign Memory Layout of a Matrix in C M0,0 M1,0 M2,0 M3,0 M0,1 M1,1 M2,1 M3,1 M0,2 M1,2 M2,2 M3,2 M0,3 M1,3 M2,3 M3,3 M M0,0 M1,0 M2,0 M3,0 M0,1 M1,1 M2,1 M3,1 M0,2 M1,2 M2,2 M3,2 M0,3 M1,3 M2,3 M3,3 © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 4 ECE498AL, University of Illinois, Urbana-Champaign Step 1: Matrix Multiplication A Simple Host Version in C // Matrix multiplication on the (CPU) host in double precision void MatrixMulOnHost(float* M, float* N, float* P, int Width)N { k for (int i = 0; i < Width; ++i) H j T for (int j = 0; j < Width; ++j) { D I double sum = 0; W for (int k = 0; k < Width; ++k) { double a = M[i * width + k]; double b = N[k * width + j]; sum += a * b; M P } P[i * Width + j] = sum; i H } T D I } W k WIDTH WIDTH © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 5 ECE498AL, University of Illinois, Urbana-Champaign Step 2: Input Matrix Data Transfer (Host-side Code) void MatrixMulOnDevice(float* M, float* N, float* P, int Width) { int size = Width * Width * sizeof(float); float* Md, Nd, Pd; … 1. // Allocate and Load M, N to device memory cudaMalloc(&Md, size); cudaMemcpy(Md, M, size, cudaMemcpyHostToDevice); cudaMalloc(&Nd, size); cudaMemcpy(Nd, N, size, cudaMemcpyHostToDevice); // Allocate P on the device cudaMalloc(&Pd, size); © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 6 ECE498AL, University of Illinois, Urbana-Champaign Step 3: Output Matrix Data Transfer (Host-side Code) 2. // Kernel invocation code – to be shown later … 3. // Read P from the device cudaMemcpy(P, Pd, size, cudaMemcpyDeviceToHost); // Free device matrices cudaFree(Md); cudaFree(Nd); cudaFree (Pd); } © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 7 ECE498AL, University of Illinois, Urbana-Champaign Step 4: Kernel Function // Matrix multiplication kernel – per thread code __global__ void MatrixMulKernel(float* Md, float* Nd, float* Pd, int Width) { // Pvalue is used to store the element of the matrix // that is computed by the thread float Pvalue = 0; © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 8 ECE498AL, University of Illinois, Urbana-Champaign Step 4: Kernel Function (cont.) for (int k = 0; k < Width; ++k) { Nd float Melement = Md[threadIdx.y*Width+k]; float Nelement = Nd[k*Width+threadIdx.x]; k H T D Pvalue += Melement * Nelement; I } tx W Pd[threadIdx.y*Width+threadIdx.x] = Pvalue; } Md Pd ty ty H T D I W k tx © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 WIDTH WIDTH 9 ECE498AL, University of Illinois, Urbana-Champaign Step 5: Kernel Invocation (Host-side Code) // Setup the execution configuration dim3 dimGrid(1, 1); dim3 dimBlock(Width, Width); // Launch the device computation threads! MatrixMulKernel<<<dimGrid, dimBlock>>>(Md, Nd, Pd, Width); © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 10 ECE498AL, University of Illinois, Urbana-Champaign Only One Thread Block Used Grid 1 Nd • One Block of threads compute Block 1 2 matrix Pd – Each thread computes one 4 element of Pd Thread 2 (2, 2) • Each thread 6 – Loads a row of matrix Md – Loads a column of matrix Nd – Perform one multiply and addition for each pair of Md and Nd elements – Compute to off-chip memory 3 2 5 4 48 access ratio close to 1:1 (not very high) • Size of matrix limited by the WIDTH number of threads allowed in a thread block Md Pd © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 11 ECE498AL, University of Illinois, Urbana-Champaign Step 7: Handling Arbitrary Sized Square Matrices (will cover later) • Have each 2D thread block to Nd compute a (TILE_WIDTH)2 sub- matrix (tile) of the result matrix H T D I – Each has (TILE_WIDTH)2 threads W • Generate a 2D Grid of (WIDTH/TILE_WIDTH)2 blocks Md Pd You still need to put a loop by around the kernel call for cases TILE_WIDTH where WIDTH/TILE_WIDTH H T D ty I is greater than max grid size W (64K)! bx tx © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 WIDTH WIDTH 12 ECE498AL, University of Illinois, Urbana-Champaign Some Useful Information on Tools © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 13 ECE498AL, University of Illinois, Urbana-Champaign Compiling a CUDA Program C/C++ CUDA float4 me = gx[gtid]; Application me.x += me.y * me.z; • Parallel Thread eXecution (PTX) – Virtual Machine NVCC CPU Code and ISA – Programming model PTX Code – Execution Virtual resources and state PhysicalPTX to Target ld.global.v4.f32 {$f1,$f3,$f5,$f7}, [$r9+0]; mad.f32 $f1, $f5, $f3, $f1; Compiler G80 … GPU Target code © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 14 ECE498AL, University of Illinois, Urbana-Champaign Compilation • Any source file containing CUDA language extensions must be compiled with NVCC • NVCC is a compiler driver – Works by invoking all the necessary tools and compilers like cudacc, g++, cl, ... • NVCC outputs: – C code (host CPU Code) • Must then be compiled with the rest of the application using another tool – PTX • Object code directly • Or, PTX source, interpreted at runtime © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 15 ECE498AL, University of Illinois, Urbana-Champaign Linking • Any executable with CUDA code requires two dynamic libraries: – The CUDA runtime library (cudart) – The CUDA core library (cuda) © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 16 ECE498AL, University of Illinois, Urbana-Champaign Debugging Using the Device Emulation Mode • An executable compiled in device emulation mode (nvcc -deviceemu) runs completely on the host using the CUDA runtime – No need of any device and CUDA driver – Each device thread is emulated with a host thread • Running in device emulation mode, one can: – Use host native debug support (breakpoints, inspection, etc.) – Access any device-specific data from host code and vice-versa – Call any host function from device code (e.g. printf) and vice-versa – Detect deadlock situations caused by improper usage of __syncthreads © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 17 ECE498AL, University of Illinois, Urbana-Champaign Device Emulation Mode Pitfalls • Emulated device threads execute sequentially, so simultaneous accesses of the same memory location by multiple threads could produce different results. • Dereferencing device pointers on the host or host pointers on the device can produce correct results in device emulation mode, but will generate an error in device execution mode © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 18 ECE498AL, University of Illinois, Urbana-Champaign Floating Point • Results of floating-point computations will slightly differ because of: – Different compiler outputs, instruction sets – Use of extended precision for intermediate results • There are various options to force strict single precision on the host © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 19 ECE498AL, University of Illinois, Urbana-Champaign CUDA Threads © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 20 ECE498AL, University of Illinois, Urbana-Champaign Block IDs and Thread IDs Host • Each thread uses IDs to Device decide what data to work on Grid 1 Kernel Block Block – Block ID: 1D or 2D 1 (0, 0) (1, 0) – Thread ID: 1D, 2D, or 3D Block Block (0, 1) (1, 1) • Simplifies memory Grid 2 addressing when Kernel processing 2 Block (1, 1) multidimensional data (0,0,1) (1,0,1) (2,0,1) (3,0,1) – Image processing Thread Thread Thread Thread – Solving PDEs on volumes (0,0,0) (1,0,0) (2,0,0) (3,0,0) – … Thread Thread Thread Thread (0,1,0) (1,1,0) (2,1,0) (3,1,0) Courtesy: NDVIA © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 21 ECE498AL, University of Illinois, Urbana-Champaign bx Matrix Multiplication Using 0 1 2 tx Multiple Blocks 012 TILE_WIDTH-1 Nd • Break-up Pd into tiles H T D • Each block calculates one I tile W – Each thread calculates one element – Block size equal tile size Md Pd 0 E 0 H Pd T 1 sub H D T I 2 D I W ty _ W by 1 E L I T TILE_WIDTH-1 TILE_WIDTH 2 WIDTH WIDTH © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 22 ECE498AL, University of Illinois, Urbana-Champaign A Small Example Block(0,0) Block(1,0) P0,0 P1,0 P2,0 P3,0 TILE_WIDTH = 2 P0,1 P1,1 P2,1 P3,1 P0,2 P1,2 P2,2 P3,2 P0,3 P1,3 P2,3 P3,3 Block(0,1) Block(1,1) © David Kirk/NVIDIA and Wen-mei W.

Lecture 3: a Simple Example, Tools, and CUDA Threads

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support