Prototyping and Developing GPU-Accelerated Solutions with Python and CUDA Luciano Martins and Robert Sohigian, 2018-11-22 Introduction to Python

GPU-Accelerated Computing

NVIDIA® CUDA® technology

Why Use Python with GPUs? Agenda Methods: PyCUDA, Numba, CuPy, and scikit- Summary

Q&A

2 Introduction to Python

Released by Guido van Rossum in 1991

The Zen of Python: Beautiful is better than ugly. Explicit is better than implicit. Simple is better than complex. Complex is better than complicated. Flat is better than nested. Interpreted language (CPython, , ...)

Dynamically typed; based on objects

3 Introduction to Python

Small core structure: ~30 keywords ~ 80 built-in functions Indentation is a pretty serious thing Dynamically typed; based on objects

Binds to many different languages

Supports GPU acceleration via modules

4 Introduction to Python

5 Introduction to Python

6 Introduction to Python

7 GPU-Accelerated Computing

“[T]the use of a graphics processing unit (GPU) together with a CPU to accelerate deep learning, analytics, and engineering applications” (NVIDIA)

Most common GPU-accelerated operations: Large vector/matrix operations (Basic Linear Algebra Subprograms - BLAS) Speech recognition Computer vision

8 GPU-Accelerated Computing

Important concepts for GPU-accelerated computing:

Host ― the machine running the workload (CPU)

Device ― the GPUs inside of a host

Kernel ― the code part that runs on the GPU

SIMT ― Single Instruction Multiple Threads

9 GPU-Accelerated Computing

10 GPU-Accelerated Computing

11 CUDA

Parallel computing platform and programming model developed by NVIDIA: Stands for Compute Unified Device Architecture Based on /C++ with some extensions Fairly short learning curve for those with experience of OpenMP and MPI programming

CUDA on a system has three components: Driver (software that controls the graphics card) Toolkit (nvcc, several libraries, debugging tools) SDK (examples and error-checking utilities)

12 CUDA

A kernel is executed as a grid of thread blocks All threads within a block share a portion of data memory A thread block is a batch of threads that can cooperate with each other by: Synchronizing their execution to provide hazard-free common memory accesses Efficiently sharing data through low-latency shared memory Multiple blocks are combined to form a grid Blocks on a grid contain the same number of threads

13 CUDA

Host Device

Kernel 1 Grid 1

Block Block Block (0, 0) (1, 0) (2, 0)

Block Block Block (0, 1) (1, 1) (2, 1)

Kernel 2 Grid 2

Block (1, 1)

Thread Thread Thread Thread Thread (0, 0) (1, 0) (2, 0) (3, 0) (4, 0)

Thread Thread Thread Thread Thread (0, 1) (1, 1) (2, 1) (3, 1) (4, 1)

Thread Thread Thread Thread Thread (0, 2) (1, 2) (2, 2) (3, 2) (4, 2)

14 CUDA

The host performs the following tasks (CPU):

1.Initializes GPU card(s)

2.Allocates memory in host and on device

3.Copies data from host to device memory

4.Launches instances of the kernel on device(s)

5.Copies data from device memory to host

6.Repeats 3-5 as needed

7.De-allocates all memory and terminates

15 “ Python is an interpreted, object-oriented, high-level programming language with dynamic semantics. Its high- level built-in data structures, combined with dynamic typing and dynamic binding, make it very attractive for Rapid Application Development, as well as for use as a scripting or glue language to connect existing components together.”

https://www.python.org/doc/essays/blurb/

16 Python (and the Need for Speed)

Since interpreted and high-level languages can be slow for high-performance needs, Python needs assistance for those tasks.

Keep the best of both scenarios: Quick development and prototyping with Python Use high-processing power and speed of GPU

17 Accelerating Python

Accelerated code may be pure Python or also involve C-code.

Focusing here on the following modules: PyCUDA Numba CuPy scikit-cuda

18 PyCUDA

A Python wrapper to the CUDA API

Gives speed to Python – near zero wrapping

Requires C programming knowledge (kernel)

Compiles the CUDA code and copies to GPU

CUDA errors translated to Python exceptions

Easy installation (pip)

19 PyCUDA

20 PyCUDA

21 Numba

No need to write C-code

High-performance functions written in Python

On-the-fly code generation

Native code generation for the CPU and GPU

Integration with the Python scientific stack

Takes advantage of Python decorators

Code translation done using LLVM compiler

22 Numba

https://numba.pydata.org/numba-examples/examples/finance/blackscholes/results.html

23 Numba

24 CuPy

An implementation of -compatible multi-dimensional array on CUDA

Useful to perform matrix ops on GPUs

Provides easy ways to define three types of CUDA kernels: Elementwise kernels Reduction kernels Raw kernels Also easy to install (pip)

25 CuPy

26 CuPy

Array Size NumPy [ms] CuPy [ms]

104 0.03 0.58

105 0.20 0.97

106 2.00 1.84

107 55.55 12.48

108 517.17 84.73

27 scikit-cuda

Motivated by the idea of enhancing PyCUDA

Exposes GPU powered libraries

Tested on Linux (potentially works elsewhere)

Can be seen as “SciPy on GPU juice”

Presents low-level and high-level functions

28 scikit-cuda

Low-Level Functions Wrapping C functions via ctypes Catching errors and mapping to Python exceptions High-Level Functions Take advantage of PyCUDA GPUArray to manipulate matrices in GPU memory Some high-level functions available include FFT/IFFT, numerical integration, randomized linear algebra, NumPy-like routines not available on PyCUDA (cumsum, zeros, etc)

29 scikit-cuda

30 Summary

Many projects ported to Python are available

Keeps the simplicity of Python whilst adding GPU performance

Allows faster prototype development cycles

Supports C performance depending on the module (approach) chosen

Goes through matrices operation, scientific programming to custom kernels creation

31 Summary Pages

32 PyCUDA Summary

PyCUDA CUDA Python wrapper C code added directly on the Python project All CUDA libraries support Relevant complexity due to the kernels in C https://documen.tician.de/pycuda/

33 Numba Summary

Numba Similar coverage as PyCUDA No C coding needed Takes advantage of LLVM and JIT compiling Missing: dynamic parallelism and texture memory http://numba.pydata.org/doc.html

34 CuPy Summary

CuPy Fully supports NumPy structures Performs same operations at scale using GPU Allows CPU/GPU agnostic code creation https://docs-cupy.chainer.org/en/stable

35 scikit-cuda Summary

scikit-cuda: Scientific computing using Python and GPU Presents high-level and low-level functions Big coverage of operations already available Depends on PyCUDA GPUArray mechanisms http://scikit-cuda.readthedocs.io/en/latest/

36