Prototyping and Developing GPU-Accelerated Solutions with Python and CUDA Luciano Martins and Robert Sohigian, 2018-11-22 Introduction to Python

Prototyping and Developing GPU-Accelerated Solutions with Python and CUDA Luciano Martins and Robert Sohigian, 2018-11-22 Introduction to Python GPU-Accelerated Computing NVIDIA® CUDA® technology Why Use Python with GPUs? Agenda Methods: PyCUDA, Numba, CuPy, and scikit-cuda Summary Q&A 2 Introduction to Python Released by Guido van Rossum in 1991 The Zen of Python: Beautiful is better than ugly. Explicit is better than implicit. Simple is better than complex. Complex is better than complicated. Flat is better than nested. Interpreted language (CPython, Jython, ...) Dynamically typed; based on objects 3 Introduction to Python Small core structure: ~30 keywords ~ 80 built-in functions Indentation is a pretty serious thing Dynamically typed; based on objects Binds to many different languages Supports GPU acceleration via modules 4 Introduction to Python 5 Introduction to Python 6 Introduction to Python 7 GPU-Accelerated Computing “[T]the use of a graphics processing unit (GPU) together with a CPU to accelerate deep learning, analytics, and engineering applications” (NVIDIA) Most common GPU-accelerated operations: Large vector/matrix operations (Basic Linear Algebra Subprograms - BLAS) Speech recognition Computer vision 8 GPU-Accelerated Computing Important concepts for GPU-accelerated computing: Host ― the machine running the workload (CPU) Device ― the GPUs inside of a host Kernel ― the code part that runs on the GPU SIMT ― Single Instruction Multiple Threads 9 GPU-Accelerated Computing 10 GPU-Accelerated Computing 11 CUDA Parallel computing platform and programming model developed by NVIDIA: Stands for Compute Unified Device Architecture Based on C/C++ with some extensions Fairly short learning curve for those with experience of OpenMP and MPI programming CUDA on a system has three components: Driver (software that controls the graphics card) Toolkit (nvcc, several libraries, debugging tools) SDK (examples and error-checking utilities) 12 CUDA A kernel is executed as a grid of thread blocks All threads within a block share a portion of data memory A thread block is a batch of threads that can cooperate with each other by: Synchronizing their execution to provide hazard-free common memory accesses Efficiently sharing data through low-latency shared memory Multiple blocks are combined to form a grid Blocks on a grid contain the same number of threads 13 CUDA Host Device Kernel 1 Grid 1 Block Block Block (0, 0) (1, 0) (2, 0) Block Block Block (0, 1) (1, 1) (2, 1) Kernel 2 Grid 2 Block (1, 1) Thread Thread Thread Thread Thread (0, 0) (1, 0) (2, 0) (3, 0) (4, 0) Thread Thread Thread Thread Thread (0, 1) (1, 1) (2, 1) (3, 1) (4, 1) Thread Thread Thread Thread Thread (0, 2) (1, 2) (2, 2) (3, 2) (4, 2) 14 CUDA The host performs the following tasks (CPU): 1.Initializes GPU card(s) 2.Allocates memory in host and on device 3.Copies data from host to device memory 4.Launches instances of the kernel on device(s) 5.Copies data from device memory to host 6.Repeats 3-5 as needed 7.De-allocates all memory and terminates 15 “ Python is an interpreted, object-oriented, high-level programming language with dynamic semantics. Its high- level built-in data structures, combined with dynamic typing and dynamic binding, make it very attractive for Rapid Application Development, as well as for use as a scripting or glue language to connect existing components together.” https://www.python.org/doc/essays/blurb/ 16 Python (and the Need for Speed) Since interpreted and high-level languages can be slow for high-performance needs, Python needs assistance for those tasks. Keep the best of both scenarios: Quick development and prototyping with Python Use high-processing power and speed of GPU 17 Accelerating Python Accelerated code may be pure Python or also involve C-code. Focusing here on the following modules: PyCUDA Numba CuPy scikit-cuda 18 PyCUDA A Python wrapper to the CUDA API Gives speed to Python – near zero wrapping Requires C programming knowledge (kernel) Compiles the CUDA code and copies to GPU CUDA errors translated to Python exceptions Easy installation (pip) 19 PyCUDA 20 PyCUDA 21 Numba No need to write C-code High-performance functions written in Python On-the-fly code generation Native code generation for the CPU and GPU Integration with the Python scientific stack Takes advantage of Python decorators Code translation done using LLVM compiler 22 Numba https://numba.pydata.org/numba-examples/examples/finance/blackscholes/results.html 23 Numba 24 CuPy An implementation of NumPy-compatible multi-dimensional array on CUDA Useful to perform matrix ops on GPUs Provides easy ways to define three types of CUDA kernels: Elementwise kernels Reduction kernels Raw kernels Also easy to install (pip) 25 CuPy 26 CuPy Array Size NumPy [ms] CuPy [ms] 104 0.03 0.58 105 0.20 0.97 106 2.00 1.84 107 55.55 12.48 108 517.17 84.73 27 scikit-cuda Motivated by the idea of enhancing PyCUDA Exposes GPU powered libraries Tested on Linux (potentially works elsewhere) Can be seen as “SciPy on GPU juice” Presents low-level and high-level functions 28 scikit-cuda Low-Level Functions Wrapping C functions via ctypes Catching errors and mapping to Python exceptions High-Level Functions Take advantage of PyCUDA GPUArray to manipulate matrices in GPU memory Some high-level functions available include FFT/IFFT, numerical integration, randomized linear algebra, NumPy-like routines not available on PyCUDA (cumsum, zeros, etc) 29 scikit-cuda 30 Summary Many projects ported to Python are available Keeps the simplicity of Python whilst adding GPU performance Allows faster prototype development cycles Supports C performance depending on the module (approach) chosen Goes through matrices operation, scientific programming to custom kernels creation 31 Summary Pages 32 PyCUDA Summary PyCUDA CUDA Python wrapper C code added directly on the Python project All CUDA libraries support Relevant complexity due to the kernels in C https://documen.tician.de/pycuda/ 33 Numba Summary Numba Similar coverage as PyCUDA No C coding needed Takes advantage of LLVM and JIT compiling Missing: dynamic parallelism and texture memory http://numba.pydata.org/doc.html 34 CuPy Summary CuPy Fully supports NumPy structures Performs same operations at scale using GPU Allows CPU/GPU agnostic code creation https://docs-cupy.chainer.org/en/stable 35 scikit-cuda Summary scikit-cuda: Scientific computing using Python and GPU Presents high-level and low-level functions Big coverage of operations already available Depends on PyCUDA GPUArray mechanisms http://scikit-cuda.readthedocs.io/en/latest/ 36 .

Prototyping and Developing GPU-Accelerated Solutions with Python and CUDA Luciano Martins and Robert Sohigian, 2018-11-22 Introduction to Python

Python on Gpus (Work in Progress!)

Introduction Shrinkage Factor Reference

Tangent: Automatic Differentiation Using Source-Code Transformation for Dynamically Typed Array Programming

Python Guide Documentation 0.0.1

Hyperlearn Documentation Release 1

Zmcintegral: a Package for Multi-Dimensional Monte Carlo Integration on Multi-Gpus

Python Guide Documentation Publicación 0.0.1

Aayush Mandhyan

Tensorflow, Pytorch and Horovod

Numba: a Compiler for Python Functions

Or How to Avoid Recoding Your (Entire) Application in Another Language Before You Start

GPU Computing with Python: Performance, Energy Efficiency And