GPU programming

Introduction to CUDA

Dániel Berényi 22 October 2019 Introduction

Dániel Berényi

• E-mail: [email protected]

Materials related to the course:

• http://u235axe.web.elte.hu/GPUCourse/ The birth of GPGPUs What was there before…

Graphics APIs – vertex / fragment

- The GPUs had dedicated parts for executing the vertex and the fragment shaders - Limited number of data, data types and data structures - Limited number of instruction in shaders

People tried to create simulations, non-graphical computations with these. The GeForce 7 series

G71 chip design The GeForce 8 series

In 2006 introduced a new card (GeForce 8800) with more advanced, unified processors and gave generic access to it via the CUDA API.

more reading GeForce 8800 unified pipeline architecture

Host

Data Assembler Setup / Rstr / ZCull

Vtx Thread Issue Geom Thread Issue Pixel Thread Issue

SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP

TF TF TF TF TF TF TF TF

L1 L1 L1 L1 L1 L1 L1 L1 ProcessorThread

L2 L2 L2 L2 L2 L2

FB FB FB FB FB FB CUDA

NVIDIA gave access to the new card via a new API, CUDA

“Compute Unified Device Architecture”

It is not specialized for graphics tasks, it has a general purpose programming model

It was based on the C language, but extended with specific elements. CUDA vs OpenCL

Let’s compare CUDA with OpenCL!

CUDA OpenCL What it is HW architecture, language, API and language API, SDK, tool, etc. specification Type Proprietary technology Royalty-free standard

Maintained by Nvidia Khronos, multiple vendors

Target hardware Nvidia hw (mostly GPUs) Wide range of devices (CPU, GPU, FPGA, DSPs, …) CUDA vs OpenCL CUDA OpenCL Form Singe source Separate source (host and device code is (host and device code in separate source files) in same source code) Language Extension of C / C++ Host only handles API code, can be in any language Device code is an extension of C Compiled by nvcc Host code is compiled by the host language’s compiler, device code is compiled by vendor runtime CUDA vs OpenCL

CUDA OpenCL Intermediate nvcc compiles device Conforming implementations code into PTX compile to SPIR or SPIR-V Can compile to intermediate offline yes yes and load at runtime Graphics OpenGL / DirectX / OpenGL / DirectX interoperability Vulkan CUDA vs OpenCL

CUDA OpenCL Platform, device, context, queue Initialization Implicit creation Explicit device memory Explicit device memory allocation, Data management allocation, copy to some buffer movement is handled device, copy back implicitely, copy back Load source, create program, Just invoke it with Kernel build, bind arguments, enqueue arguments on queue Queue management Implicit Explicit CUDA vs OpenCL CUDA OpenCL Grid NDRange Thread block Work group Thread Work item Thread ID Global ID Block index Block ID Thread index Local ID Shared memory Local memory Registers Private memory CUDA vs OpenCL

The CUDA computational grid is specified as: • The number of threads inside the block • The number of blocks inside the grid

The OpenCL computational grid is specified as: • The number of threads inside the workgroup • The number of threads inside the grid! (and the workgroup size must evenly divide the gridsize!) A sample code in details

• Squaring an array of integers The CUDA kernel

__global__ void sq(int* dst, int* src) { int x = src[threadIdx.x]; dst[threadIdx.x] = x * x; } The CUDA kernel

__global__ void sq(int* dst, int* src) { int x = src[threadIdx.x]; dst[threadIdx.x] = x * x; }

The __global__ qualifier signals device function entry points The CUDA kernel

__global__ void sq(int* dst, int* src) { int x = src[threadIdx.x]; dst[threadIdx.x] = x * x; }

All entry points must return void The CUDA kernel

__global__ void sq(int* dst, int* src) { int x = src[threadIdx.x]; dst[threadIdx.x] = x * x; } There are no specific types or qualifiers for buffers, we just expect pointers to arrays The CUDA kernel

__global__ void sq(int* dst, int* src) { int x = src[threadIdx.x]; dst[threadIdx.x] = x * x; } threadIdx is a built-in object, that stores the actual thread’s indices in the computational grid The CUDA host code #include #include #include int main() { std::vector A{1, 2, 3, 4, 5, 6, 7, 8, 9}; std::vector B(A.size()); std::vector C(A.size());

size_t sz = A.size(); int* src = nullptr; int* dst = nullptr; The CUDA host code int* src = nullptr; int* dst = nullptr; cudaError_t err = cudaSuccess; err = cudaMalloc( (void**)&src, sz*sizeof(int) ); if( err != cudaSuccess){ … } err = cudaMalloc( (void**)&dst, sz*sizeof(int) ); if( err != cudaSuccess){ … } The CUDA host code int* src = nullptr; int* dst = nullptr; cudaMalloc allocates memory on the device cudaError_t err = cudaSuccess; err = cudaMalloc( (void**)&src, sz*sizeof(int) ); if( err != cudaSuccess){ … } err = cudaMalloc( (void**)&dst, sz*sizeof(int) ); if( err != cudaSuccess){ … } The CUDA host code int* src = nullptr; int* dst = nullptr; cudaError_t err = cudaSuccess; err = cudaMalloc( (void**)&src, sz*sizeof(int) ); if( err != cudaSuccess){ … } err = cudaMalloc( (void**)&dst, sz*sizeof(int) ); if( err != cudaSuccess){ … }

CUDA API functions return error codes to signal success or failure. The CUDA host code

Copying sz*sizeof(int) bytes of data from host pointer A.data() to device pointer src. err = cudaMemcpy( src, A.data(), sz*sizeof(int), cudaMemcpyHostToDevice ); if( err != cudaSuccess){ … } The CUDA host code dim3 dimGrid( 1, 1 ); dim3 dimBlock( sz, 1 ); sq<<>>(dst, src); err = cudaGetLastError(); if (err != cudaSuccess){ … } The CUDA host code dim3 dimGrid( 1, 1 ); dim3 is a built-in type for 3D sizes. Here we create only one block dim3 dimBlock( sz, 1 ); with sz threads. sq<<>>(dst, src); err = cudaGetLastError(); if (err != cudaSuccess){ … } The CUDA host code dim3 dimGrid( 1, 1 ); dim3 dimBlock( sz, 1 ); <<< … >>> marks a kernel invocation sq<<>>(dst, src); err = cudaGetLastError(); if (err != cudaSuccess){ … } The CUDA host code dim3 dimGrid( 1, 1 ); dim3 dimBlock( sz, 1 ); Function name (sq) sq<<>>(dst, src); err = cudaGetLastError(); Function arguments if (err != cudaSuccess){ … }

Computation grid and other options The CUDA host code dim3 dimGrid( 1, 1 ); dim3 dimBlock( sz, 1 ); sq<<>>(dst, src); err = cudaGetLastError(); if (err != cudaSuccess){ … }

Errors during the kernel invocation are reported through the cudaGetLastError() function. The CUDA host code

err = cudaMemcpy( B.data(), dst, sz*sizeof(int), cudaMemcpyDeviceToHost ); if( err != cudaSuccess){ … }

Data must be copied back from the device to the host The CUDA host code What is happening in the background?

In CUDA there is an implicit stream (queue) selected, that receives all the issued commands, these are automatically sequenced after each other and cudaMemcpy commands are blocking, until the data is available from the previous kernel call.

So in the simples examples here, we do not need to handle host syncronizations. When we have to, we can use events, callbacks, or the cudaDeviceSynchronize function. Compiling CUDA codes

CUDA codes are compiled by the nvcc compiler supplied by Nvidia. It performs the extraction of the device side codes (marked by <<< >>> and __global__ ) from the source code and then proceeds to compile the host code to binary and the device code to PTX. Compiling CUDA codes

Compiling under windows: (note the installation resources here) • A Visual Studio installation or the Visual Studio Build tools are needed. See here. • You need to run the Developer Command Prompt (documentation here and here) • You need to be aware which VS version is compatible with which CUDA version • Once you’ve set up the developer prompt, you can invoke nvcc on your CUDA sources. Compiling CUDA codes

Sample setup script for initializing the command line on Windows:

"C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Auxiliary\Build\vcvars64.bat" amd64 -vcvars_ver=14.16

The generic setup script for initializing the command prompt for development Compiling CUDA codes

Sample setup script for initializing the command line on Windows:

"C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Auxiliary\Build\vcvars64.bat" amd64 -vcvars_ver=14.16

Architecture Visual Studio toolset version to build with (now 2017 toolset) Compiling CUDA codes

Compiling under linux/unix and/or Mac OS:

If you’ve installed properly nvcc should be simply available from the command line just like gcc.

Installation resources: Mac OS Linux nvcc more information on nvcc

Once your system / command prompt is ready, nvcc acts as gcc: nvcc mycode.cu –o myexecutable some arguments we will use: -O3 to enable optimizations -std=c++11 or -std=c++14 to select C++ standard version --expt-extended-lambda for generic lambda functions The goal of GPU computing

We would like to write code that runs fast!

Faster then on the CPU.

But the GPU result must be correct too.

It is highly recommended, that we: • measure CPU performance • measure GPU performance • compare the results if they did the same thing (assuming the CPU implementation was correct) Measuring performance on the CPU

Since C++11:

#include auto t0 = std::chrono::high_resolution_clock::now(); // computation to measure auto t1 = std::chrono::high_resolution_clock::now(); auto dt = std::chrono::duration_cast(t1-t0).count() Measuring performance on the CPU

Since C++11: See on cppreference auto dt = std::chrono::duration_cast(t1-t0).count()

std::chrono::hours std::chrono::minutes std::chrono::seconds std::chrono::milliseconds std::chrono::microseconds std::chrono::nanoseconds Measuring performance on the GPU

Simply using the CPU timers to measure the GPU performance may give incorrect results, as the GPU is operating asynchronously. Thus it would require CPU-GPU synchronizations that can reduce performance on large programs.

Much more reliable time can be obtained by events that are placed into the stream to record time marks. Measuring performance on the GPU cudaEvent_t evt[2]; for(auto& e : evt){ cudaEventCreate(&e); } cudaEventRecord(evt[0]); kernel<<< ... >>>( ... ); cudaEventRecord(evt[1]); cudaEventSynchronize(evt[1]); float dt = 0.0f;//milliseconds cudaEventElapsedTime(&dt, evt[0], evt[1]); for(auto& e : evt){ cudaEventDestroy(e); } Measuring performance on the GPU cudaEvent_t evt[2]; for(auto& e : evt){ cudaEventCreate(&e); } cudaEventRecord(evt[0]); Events must be created explicitly kernel<<< ... >>>( ... ); cudaEventRecord(evt[1]); cudaEventSynchronize(evt[1]); float dt = 0.0f;//milliseconds cudaEventElapsedTime(&dt, evt[0], evt[1]); for(auto& e : evt){ cudaEventDestroy(e); } Measuring performance on the GPU cudaEvent_t evt[2]; for(auto& e : evt){ cudaEventCreate(&e); } cudaEventRecord(evt[0]); Place events into the active stream kernel<<< ... >>>( ... ); before and after the kernel call cudaEventRecord(evt[1]); cudaEventSynchronize(evt[1]); float dt = 0.0f;//milliseconds cudaEventElapsedTime(&dt, evt[0], evt[1]); for(auto& e : evt){ cudaEventDestroy(e); } Measuring performance on the GPU cudaEvent_t evt[2]; for(auto& e : evt){ cudaEventCreate(&e); } cudaEventRecord(evt[0]); kernel<<< ... >>>( ... ); Wait for the last event (and all cudaEventRecord(evt[1]); previously issued commands) to finish. cudaEventSynchronize(evt[1]); float dt = 0.0f;//milliseconds cudaEventElapsedTime(&dt, evt[0], evt[1]); for(auto& e : evt){ cudaEventDestroy(e); } Measuring performance on the GPU cudaEvent_t evt[2]; for(auto& e : evt){ cudaEventCreate(&e); } cudaEventRecord(evt[0]); kernel<<< ... >>>( ... ); cudaEventRecord(evt[1]); Query the elapsed time between two events in milliseconds cudaEventSynchronize(evt[1]); float dt = 0.0f;//milliseconds cudaEventElapsedTime(&dt, evt[0], evt[1]); for(auto& e : evt){ cudaEventDestroy(e); } Measuring performance on the GPU cudaEvent_t evt[2]; for(auto& e : evt){ cudaEventCreate(&e); } cudaEventRecord(evt[0]); kernel<<< ... >>>( ... ); cudaEventRecord(evt[1]); cudaEventSynchronize(evt[1]); Dispose events float dt = 0.0f;//milliseconds cudaEventElapsedTime(&dt, evt[0], evt[1]); for(auto& e : evt){ cudaEventDestroy(e); } Comparing results between CPU and GPU

A handy function to check for mismatch between CPU and GPU result arrays is std::mismatch (see here).

It traverses a pair of iterators and return at the first mismatch with a pair containing the iterators to the positions. If the containers are the same (no mismatch), the end iterators are returned. if( std::mismatch(B.begin(), B.end(), C.begin()).first == B.end() ) { std::cout << "They are the same."; } else{ std::cout << "There is a mismatch."; } CUDA built-in vector types

We can create vector types from the following basic types: char, short, int, long, longlong (and their unsigned variants with the u prefix), float, double and adding a number 1, 2, 3 or 4 at the end of it: e.g.: float4 is a four element vector type, with members .x, .y, .z, .w respectively. Such a type can be initialized by braces {a, b, c, d} or by calling the make_float4(a, b, c, d) function. dim3 is equivalent to uint3 CUDA built-in vector types

Vector types can be useful when loading or writing data, as they can improve bandwidth in certain cases. CUDA built-in variables in device code dim3 gridDim; .x, .y, .z: dimensions of the computational grid uint3 blockIdx; .x, .y, .z: index of the block in the computational grid dim3 blockDim; .x, .y, .z: dimensions of a block uint3 threadIdx; .x, .y, .z: thread index within the block int warpSize; number of threads that are handled together by the multiprocessor (usually 32) Global memory access

• The GPU memory access handler is designed to serve some access patterns very effectively: if threads access consecutive addresses they can be served in a single memory transaction, so they are fast. This is called coalesced memory access.

• Totally random access in the worst case may be serialized for each kernel, and thus will be very slow.

• Also, strided access is in between these, when the locations fall on 32-64-128 byte segments. Shared memory

The threads inside a block can access a common memory, the shared memory.

We must explicitly specify the size of this memory, and the variables to locate in it.

Read more here Shared memory

Shared memory can be static, when the required size is known at compile time:

__global__ void my_kernel(...) { __shared__ int shared[64]; } Shared memory

Shared memory can be dynamic, when the required size is only known at run time:

__global__ void my_kernel(...) { extern __shared__ int shared[]; }

In this case, we must specify the shared memory size in bytes at kernel launch: myKernel<<>>(...); Shared memory

Shared memory is organized into banks.

In a single access, only different banks can be accessed simultaneously. If threads access the same banks, their accesses are serialized. This is called a bank conflict.

The bank width (4 or 8 bytes) can be configured by calling cudaDeviceSetSharedMemConfig. See the docs

The multiprocessor memory is divided between the shared memory and the L1 cache, this can be configured by calling cudaDeviceSetCacheConfig. Shared memory bank conflicts

See this video: link And see this blogpost: link

Assume an array of integers (4 bytes)

No bank conflict with access like: array[thread_idx]

Bank conflict with access like: array[thread_idx*2] Banks serving odd indices will be unused, even indiced banks receive two accesses, that are serialized, 2-fold slowdown Shared memory access

Shared memory access must be synchronized if there are more than one warps executing in the block.

__syncthreads() is the function to call, to wait for all threads in the block to catch up.

Caveat: all threads must encounter the sync point, or the kernel will hang. The GPU reduce

Reduction on the GPU in CUDA is a great exercise to learn about code performance. We will go over to the classic document by Mark Harris: https://developer.download.nvidia.com/assets/cuda/files/reduction.pdf Exercises 1) Write a CUDA code, that computes the average of a large array of floating point numbers.

2) Extend this to compute the standard deviation of the dataset.

3) Compare the performance with a CPU version.

4) Write a code to compute the dot product of two large arrays of floating point numbers. Compare with the performance of std::inner_product

+1) Make the code work for any number of elements that fits into the GPU memory Resources

• For C++ related questions, search cppreference.com

• For CUDA related questions, search the CUDA documentation

• Also, stackoverflow and the Nvidia developer forums are a great place to find answers to common questions.

• The Nvidia dev blogs are useful for learning about old and new features.