Prototyping and Developing GPU-Accelerated Solutions with Python and CUDA Luciano Martins and Robert Sohigian, 2018-11-22 Introduction to Python
GPU-Accelerated Computing
NVIDIA® CUDA® technology
Why Use Python with GPUs? Agenda Methods: PyCUDA, Numba, CuPy, and scikit-cuda Summary
Q&A
2 Introduction to Python
Released by Guido van Rossum in 1991
The Zen of Python: Beautiful is better than ugly. Explicit is better than implicit. Simple is better than complex. Complex is better than complicated. Flat is better than nested. Interpreted language (CPython, Jython, ...)
Dynamically typed; based on objects
3 Introduction to Python
Small core structure: ~30 keywords ~ 80 built-in functions Indentation is a pretty serious thing Dynamically typed; based on objects
Binds to many different languages
Supports GPU acceleration via modules
4 Introduction to Python
5 Introduction to Python
6 Introduction to Python
7 GPU-Accelerated Computing
“[T]the use of a graphics processing unit (GPU) together with a CPU to accelerate deep learning, analytics, and engineering applications” (NVIDIA)
Most common GPU-accelerated operations: Large vector/matrix operations (Basic Linear Algebra Subprograms - BLAS) Speech recognition Computer vision
8 GPU-Accelerated Computing
Important concepts for GPU-accelerated computing:
Host ― the machine running the workload (CPU)
Device ― the GPUs inside of a host
Kernel ― the code part that runs on the GPU
SIMT ― Single Instruction Multiple Threads
9 GPU-Accelerated Computing
10 GPU-Accelerated Computing
11 CUDA
Parallel computing platform and programming model developed by NVIDIA: Stands for Compute Unified Device Architecture Based on C/C++ with some extensions Fairly short learning curve for those with experience of OpenMP and MPI programming
CUDA on a system has three components: Driver (software that controls the graphics card) Toolkit (nvcc, several libraries, debugging tools) SDK (examples and error-checking utilities)
12 CUDA
A kernel is executed as a grid of thread blocks All threads within a block share a portion of data memory A thread block is a batch of threads that can cooperate with each other by: Synchronizing their execution to provide hazard-free common memory accesses Efficiently sharing data through low-latency shared memory Multiple blocks are combined to form a grid Blocks on a grid contain the same number of threads
13 CUDA
Host Device
Kernel 1 Grid 1
Block Block Block (0, 0) (1, 0) (2, 0)
Block Block Block (0, 1) (1, 1) (2, 1)
Kernel 2 Grid 2
Block (1, 1)
Thread Thread Thread Thread Thread (0, 0) (1, 0) (2, 0) (3, 0) (4, 0)
Thread Thread Thread Thread Thread (0, 1) (1, 1) (2, 1) (3, 1) (4, 1)
Thread Thread Thread Thread Thread (0, 2) (1, 2) (2, 2) (3, 2) (4, 2)
14 CUDA
The host performs the following tasks (CPU):
1.Initializes GPU card(s)
2.Allocates memory in host and on device
3.Copies data from host to device memory
4.Launches instances of the kernel on device(s)
5.Copies data from device memory to host
6.Repeats 3-5 as needed
7.De-allocates all memory and terminates
15 “ Python is an interpreted, object-oriented, high-level programming language with dynamic semantics. Its high- level built-in data structures, combined with dynamic typing and dynamic binding, make it very attractive for Rapid Application Development, as well as for use as a scripting or glue language to connect existing components together.”
https://www.python.org/doc/essays/blurb/
16 Python (and the Need for Speed)
Since interpreted and high-level languages can be slow for high-performance needs, Python needs assistance for those tasks.
Keep the best of both scenarios: Quick development and prototyping with Python Use high-processing power and speed of GPU
17 Accelerating Python
Accelerated code may be pure Python or also involve C-code.
Focusing here on the following modules: PyCUDA Numba CuPy scikit-cuda
18 PyCUDA
A Python wrapper to the CUDA API
Gives speed to Python – near zero wrapping
Requires C programming knowledge (kernel)
Compiles the CUDA code and copies to GPU
CUDA errors translated to Python exceptions
Easy installation (pip)
19 PyCUDA
20 PyCUDA
21 Numba
No need to write C-code
High-performance functions written in Python
On-the-fly code generation
Native code generation for the CPU and GPU
Integration with the Python scientific stack
Takes advantage of Python decorators
Code translation done using LLVM compiler
22 Numba
https://numba.pydata.org/numba-examples/examples/finance/blackscholes/results.html
23 Numba
24 CuPy
An implementation of NumPy-compatible multi-dimensional array on CUDA
Useful to perform matrix ops on GPUs
Provides easy ways to define three types of CUDA kernels: Elementwise kernels Reduction kernels Raw kernels Also easy to install (pip)
25 CuPy
26 CuPy
Array Size NumPy [ms] CuPy [ms]
104 0.03 0.58
105 0.20 0.97
106 2.00 1.84
107 55.55 12.48
108 517.17 84.73
27 scikit-cuda
Motivated by the idea of enhancing PyCUDA
Exposes GPU powered libraries
Tested on Linux (potentially works elsewhere)
Can be seen as “SciPy on GPU juice”
Presents low-level and high-level functions
28 scikit-cuda
Low-Level Functions Wrapping C functions via ctypes Catching errors and mapping to Python exceptions High-Level Functions Take advantage of PyCUDA GPUArray to manipulate matrices in GPU memory Some high-level functions available include FFT/IFFT, numerical integration, randomized linear algebra, NumPy-like routines not available on PyCUDA (cumsum, zeros, etc)
29 scikit-cuda
30 Summary
Many projects ported to Python are available
Keeps the simplicity of Python whilst adding GPU performance
Allows faster prototype development cycles
Supports C performance depending on the module (approach) chosen
Goes through matrices operation, scientific programming to custom kernels creation
31 Summary Pages
32 PyCUDA Summary
PyCUDA CUDA Python wrapper C code added directly on the Python project All CUDA libraries support Relevant complexity due to the kernels in C https://documen.tician.de/pycuda/
33 Numba Summary
Numba Similar coverage as PyCUDA No C coding needed Takes advantage of LLVM and JIT compiling Missing: dynamic parallelism and texture memory http://numba.pydata.org/doc.html
34 CuPy Summary
CuPy Fully supports NumPy structures Performs same operations at scale using GPU Allows CPU/GPU agnostic code creation https://docs-cupy.chainer.org/en/stable
35 scikit-cuda Summary
scikit-cuda: Scientific computing using Python and GPU Presents high-level and low-level functions Big coverage of operations already available Depends on PyCUDA GPUArray mechanisms http://scikit-cuda.readthedocs.io/en/latest/
36