GPU Programming Introduction to CUDA

GPU programming Introduction to CUDA Dániel Berényi 22 October 2019 Introduction Dániel Berényi • E-mail: [email protected] Materials related to the course: • http://u235axe.web.elte.hu/GPUCourse/ The birth of GPGPUs What was there before… Graphics APIs – vertex / fragment shaders - The GPUs had dedicated parts for executing the vertex and the fragment shaders - Limited number of data, data types and data structures - Limited number of instruction in shaders People tried to create simulations, non-graphical computations with these. The GeForce 7 series G71 chip design The GeForce 8 series In 2006 NVIDIA introduced a new card (GeForce 8800) with more advanced, unified processors and gave generic access to it via the CUDA API. more reading GeForce 8800 unified pipeline architecture Host Data Assembler Setup / Rstr / ZCull Vtx Thread Issue Geom Thread Issue Pixel Thread Issue SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP TF TF TF TF TF TF TF TF L1 L1 L1 L1 L1 L1 L1 L1 ProcessorThread L2 L2 L2 L2 L2 L2 FB FB FB FB FB FB CUDA NVIDIA gave access to the new card via a new API, CUDA “Compute Unified Device Architecture” It is not specialized for graphics tasks, it has a general purpose programming model It was based on the C language, but extended with specific elements. CUDA vs OpenCL Let’s compare CUDA with OpenCL! CUDA OpenCL What it is HW architecture, language, API and language API, SDK, tool, etc. specification Type Proprietary technology Royalty-free standard Maintained by Nvidia Khronos, multiple vendors Target hardware Nvidia hw (mostly GPUs) Wide range of devices (CPU, GPU, FPGA, DSPs, …) CUDA vs OpenCL CUDA OpenCL Form Singe source Separate source (host and device code is (host and device code in separate source files) in same source code) Language Extension of C / C++ Host only handles API code, can be in any language Device code is an extension of C Compiled by nvcc Host code is compiled by the host language’s compiler, device code is compiled by vendor runtime CUDA vs OpenCL CUDA OpenCL Intermediate nvcc compiles device Conforming implementations code into PTX compile to SPIR or SPIR-V Can compile to intermediate offline yes yes and load at runtime Graphics OpenGL / DirectX / OpenGL / DirectX interoperability Vulkan CUDA vs OpenCL CUDA OpenCL Platform, device, context, queue Initialization Implicit creation Explicit device memory Explicit device memory allocation, Data management allocation, copy to some buffer movement is handled device, copy back implicitely, copy back Load source, create program, Just invoke it with Kernel build, bind arguments, enqueue arguments on queue Queue management Implicit Explicit CUDA vs OpenCL CUDA OpenCL Grid NDRange Thread block Work group Thread Work item Thread ID Global ID Block index Block ID Thread index Local ID Shared memory Local memory Registers Private memory CUDA vs OpenCL The CUDA computational grid is specified as: • The number of threads inside the block • The number of blocks inside the grid The OpenCL computational grid is specified as: • The number of threads inside the workgroup • The number of threads inside the grid! (and the workgroup size must evenly divide the gridsize!) A sample code in details • Squaring an array of integers The CUDA kernel __global__ void sq(int* dst, int* src) { int x = src[threadIdx.x]; dst[threadIdx.x] = x * x; } The CUDA kernel __global__ void sq(int* dst, int* src) { int x = src[threadIdx.x]; dst[threadIdx.x] = x * x; } The __global__ qualifier signals device function entry points The CUDA kernel __global__ void sq(int* dst, int* src) { int x = src[threadIdx.x]; dst[threadIdx.x] = x * x; } All entry points must return void The CUDA kernel __global__ void sq(int* dst, int* src) { int x = src[threadIdx.x]; dst[threadIdx.x] = x * x; } There are no specific types or qualifiers for buffers, we just expect pointers to arrays The CUDA kernel __global__ void sq(int* dst, int* src) { int x = src[threadIdx.x]; dst[threadIdx.x] = x * x; } threadIdx is a built-in object, that stores the actual thread’s indices in the computational grid The CUDA host code #include <vector> #include <algorithm> #include <iostream> int main() { std::vector<int> A{1, 2, 3, 4, 5, 6, 7, 8, 9}; std::vector<int> B(A.size()); std::vector<int> C(A.size()); size_t sz = A.size(); int* src = nullptr; int* dst = nullptr; The CUDA host code int* src = nullptr; int* dst = nullptr; cudaError_t err = cudaSuccess; err = cudaMalloc( (void**)&src, sz*sizeof(int) ); if( err != cudaSuccess){ … } err = cudaMalloc( (void**)&dst, sz*sizeof(int) ); if( err != cudaSuccess){ … } The CUDA host code int* src = nullptr; int* dst = nullptr; cudaMalloc allocates memory on the device cudaError_t err = cudaSuccess; err = cudaMalloc( (void**)&src, sz*sizeof(int) ); if( err != cudaSuccess){ … } err = cudaMalloc( (void**)&dst, sz*sizeof(int) ); if( err != cudaSuccess){ … } The CUDA host code int* src = nullptr; int* dst = nullptr; cudaError_t err = cudaSuccess; err = cudaMalloc( (void**)&src, sz*sizeof(int) ); if( err != cudaSuccess){ … } err = cudaMalloc( (void**)&dst, sz*sizeof(int) ); if( err != cudaSuccess){ … } CUDA API functions return error codes to signal success or failure. The CUDA host code Copying sz*sizeof(int) bytes of data from host pointer A.data() to device pointer src. err = cudaMemcpy( src, A.data(), sz*sizeof(int), cudaMemcpyHostToDevice ); if( err != cudaSuccess){ … } The CUDA host code dim3 dimGrid( 1, 1 ); dim3 dimBlock( sz, 1 ); sq<<<dimGrid, dimBlock>>>(dst, src); err = cudaGetLastError(); if (err != cudaSuccess){ … } The CUDA host code dim3 dimGrid( 1, 1 ); dim3 is a built-in type for 3D sizes. Here we create only one block dim3 dimBlock( sz, 1 ); with sz threads. sq<<<dimGrid, dimBlock>>>(dst, src); err = cudaGetLastError(); if (err != cudaSuccess){ … } The CUDA host code dim3 dimGrid( 1, 1 ); dim3 dimBlock( sz, 1 ); <<< … >>> marks a kernel invocation sq<<<dimGrid, dimBlock>>>(dst, src); err = cudaGetLastError(); if (err != cudaSuccess){ … } The CUDA host code dim3 dimGrid( 1, 1 ); dim3 dimBlock( sz, 1 ); Function name (sq) sq<<<dimGrid, dimBlock>>>(dst, src); err = cudaGetLastError(); Function arguments if (err != cudaSuccess){ … } Computation grid and other options The CUDA host code dim3 dimGrid( 1, 1 ); dim3 dimBlock( sz, 1 ); sq<<<dimGrid, dimBlock>>>(dst, src); err = cudaGetLastError(); if (err != cudaSuccess){ … } Errors during the kernel invocation are reported through the cudaGetLastError() function. The CUDA host code err = cudaMemcpy( B.data(), dst, sz*sizeof(int), cudaMemcpyDeviceToHost ); if( err != cudaSuccess){ … } Data must be copied back from the device to the host The CUDA host code What is happening in the background? In CUDA there is an implicit stream (queue) selected, that receives all the issued commands, these are automatically sequenced after each other and cudaMemcpy commands are blocking, until the data is available from the previous kernel call. So in the simples examples here, we do not need to handle host syncronizations. When we have to, we can use events, callbacks, or the cudaDeviceSynchronize function. Compiling CUDA codes CUDA codes are compiled by the nvcc compiler supplied by Nvidia. It performs the extraction of the device side codes (marked by <<< >>> and __global__ ) from the source code and then proceeds to compile the host code to binary and the device code to PTX. Compiling CUDA codes Compiling under windows: (note the installation resources here) • A Visual Studio installation or the Visual Studio Build tools are needed. See here. • You need to run the Developer Command Prompt (documentation here and here) • You need to be aware which VS version is compatible with which CUDA version • Once you’ve set up the developer prompt, you can invoke nvcc on your CUDA sources. Compiling CUDA codes Sample setup script for initializing the command line on Windows: "C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Auxiliary\Build\vcvars64.bat" amd64 -vcvars_ver=14.16 The generic setup script for initializing the command prompt for development Compiling CUDA codes Sample setup script for initializing the command line on Windows: "C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Auxiliary\Build\vcvars64.bat" amd64 -vcvars_ver=14.16 Architecture Visual Studio toolset version to build with (now 2017 toolset) Compiling CUDA codes Compiling under linux/unix and/or Mac OS: If you’ve installed cuda properly nvcc should be simply available from the command line just like gcc. Installation resources: Mac OS Linux nvcc more information on nvcc Once your system / command prompt is ready, nvcc acts as gcc: nvcc mycode.cu –o myexecutable some arguments we will use: -O3 to enable optimizations -std=c++11 or -std=c++14 to select C++ standard version --expt-extended-lambda for generic lambda functions The goal of GPU computing We would like to write code that runs fast! Faster then on the CPU. But the GPU result must be correct too. It is highly recommended, that we: • measure CPU performance • measure GPU performance • compare the results if they did the same thing (assuming the CPU implementation was correct) Measuring performance on the CPU Since C++11: #include <chrono> auto t0 = std::chrono::high_resolution_clock::now(); // computation to measure auto t1 = std::chrono::high_resolution_clock::now(); auto dt = std::chrono::duration_cast<std::chrono::microseconds>(t1-t0).count() Measuring performance on the CPU Since C++11: See on cppreference auto dt = std::chrono::duration_cast<std::chrono::microseconds>(t1-t0).count() std::chrono::hours std::chrono::minutes std::chrono::seconds std::chrono::milliseconds std::chrono::microseconds std::chrono::nanoseconds Measuring performance on the GPU Simply using the CPU timers to measure the GPU performance may give incorrect results, as the GPU is operating asynchronously. Thus it would require CPU-GPU synchronizations that can reduce performance on large programs. Much more reliable time can be obtained by events that are placed into the stream to record time marks.

Load more