GPU Computing with Apache Spark and Python

GPU Computing with Apache Spark and Python Stan Seibert Siu Kwan Lam April 5, 2016 © 2015 Continuum Analytics- Confidential & Proprietary My Background • Trained in particle physics • Using Python for data analysis for 10 years • Using GPUs for data analysis for 7 years • Currently lead the High Performance Python team at Continuum 2 About Continuum Analytics • We give superpowers to people who change the world! • We want to help everyone analyze their data with Python (and other tools), so we offer: • Enterprise Products • Consulting • Training • Open Source 3 • I’m going to use Anaconda throughout this presentation. • Anaconda is a free Mac/Win/Linux Python distribution: • Based on conda, an open source package manager • Installs both Python and non-Python dependencies • Easiest way to get the software I will talk about today • https://www.continuum.io/downloads 4 Overview 1. Why Python? 2. Numba: A Python JIT compiler for the CPU and GPU 3. PySpark: Distributed computing for Python 4. Example: Image Registration 5. Tips and Tricks 6. Conclusion WHY PYTHON? 6 Why is Python so popular? • Straightforward, productive language for system administrators, programmers, scientists, analysts and hobbyists • Great community: • Lots of tutorial and reference materials • Easy to interface with other languages • Vast ecosystem of useful libraries 7 8 … But, isn’t Python slow? • Pure, interpreted Python is slow. • Python excels at interfacing with other languages used in HPC: C: ctypes, CFFI, Cython C++: Cython, Boost.Python FORTRAN: f2py • Secret: Most scientific Python packages put the speed critical sections of their algorithms in a compiled language. 9 Is there another way? • Switching languages for speed in your projects can be a little clunky: • Sometimes tedious boilerplate for translating data types across the language barrier • Generating compiled functions for the wide range of data types can be difficult • How can we use cutting edge hardware, like GPUs? 10 NUMBA: A PYTHON JIT COMPILER 11 Compiling Python • Numba is an open-source, type-specializing compiler for Python functions • Can translate Python syntax into machine code if all type information can be deduced when the function is called. • Implemented as a module. Does not replace the Python interpreter! • Code generation done with: • LLVM (for CPU) • NVVM (for CUDA GPUs). 12 Supported Platforms OS HW SW • Windows (7 and later) • 32 and 64-bit x86 CPUs • Python 2 and 3 • OS X (10.9 and later) • CUDA-capable NVIDIA GPUs • NumPy 1.7 through 1.10 • Linux (~RHEL 5 and • HSA-capable AMD GPUs later) • Experimental support for ARMv7 (Raspberry Pi 2) 13 How Does Numba Work? @jit def do_math(a, b): … >>> do_math(x, y) Python Function Functions (bytecode) Arguments Type Rewrite IR Inference Bytecode Numba IR Analysis Lowering Cache Execute! Machine LLVM/NVVM JIT LLVM IR Code 14 Numba on the CPU 15 Numba on the CPU Numba decorator (nopython=True not required) Array Allocation Looping over ndarray x as an iterator Using numpy math functions Returning a slice of the array 2.7x speedup! 16 CUDA Kernels in Python 17 CUDA Kernels in Python Decorator will infer type signature when you call it Helper function to compute blockIdx.x * blockDim.x + threadIdx.x NumPy arrays have expected attributes and indexing Helper function to compute blockDim.x * gridDim.x 18 Calling the Kernel from Python Works just like CUDA C, except Numba handles allocating and copying data to/from the host if needed 19 Handling Device Memory Directly Memory allocation matters in small tasks. 20 Higher Level Tools: GPU Ufuncs 21 Higher Level Tools: GPU Ufuncs Decorator for creating ufunc List of supported type signatures Code generation target 22 GPU Ufuncs Performance 4x speedup incl. host<->device round trip on GeForce GT 650M 23 Accelerate Library Bindings: cuFFT MKL accelerated FFT >2x speedup incl. host<->device round trip on GeForce GT 650M 24 PYSPARK: DISTRIBUTED COMPUTING FOR PYTHON 25 What is Apache Spark? • An API and an execution engine for distributed computing on a cluster • Based on the concept of Resilient Distributed Datasets (RDDs) • Dataset: Collection of independent elements (files, objects, etc) in memory from previous calculations, or originating from some data store • Distributed: Elements in RDDs are grouped into partitions and may be stored on different nodes • Resilient: RDDs remember how they were created, so if a node goes down, Spark can recompute the lost elements on another node 26 Computation DAGs Fig from: https://databricks.com/blog/2015/06/22/understanding-your-spark-application-through-visualization.html 27 How does Spark Scale? • All cluster scaling is about minimizing I/O. Spark does this in several ways: • Keep intermediate results in memory with rdd.cache() • Move computation to the data whenever possible (functions are small and data is big!) • Provide computation primitives that expose parallelism and minimize communication between workers: map, filter, sample, reduce, … 28 Python and Spark • Spark is implemented in Java & Scala on the JVM • Full API support for Scala, Java, and Python (+ limited support for R) • How does Python work, since it doesn’t run on the JVM? (not counting IronPython) 29 Spark Py4J Python Worker Interpreter Java Python Spark Py4J Spark Context Context Spark Py4J Python Python Java Worker Interpreter Java Python Client Cluster 30 Getting Started • conda can create a local environment for Spark for you: conda create -n spark -c anaconda-cluster python=3.5 spark numba \ cudatoolkit ipython-notebook source activate spark #Uncomment below if on Mac OS X #export JRE_HOME=$(/usr/libexec/java_home) #export JAVA_HOME=$(/usr/libexec/java_home) IPYTHON_OPTS="notebook" pyspark # starts jupyter notebook 31 Using Numba (CPU) with Spark 32 Using Numba (CPU) with Spark Load arrays onto the Spark Workers Apply my function to every element in the RDD and return first element 33 Using CUDA Python with Spark Define CUDA kernel Compilation happens here Wrap CUDA kernel launching logic Creates Spark RDD (8 partitions) Apply gpu_work on each partition 34 recompile if CUDA arch is different LLVM IR LLVM IR PTX CUDA Python Compiles Serialize Deserialize Finalize CUDA Kernel Binary PTX PTX use serialized PTX if CUDA arch matches client Client Cluster 35 LLVM IR LLVM IR PTX CUDA Python Compiles Serialize Deserialize Finalize CUDA Kernel Binary PTX PTX This happens on every worker process Client Cluster 36 EXAMPLE: IMAGE REGISTRATION 37 Basic Algorithm image set Group similar images (unsupervised kNN clustering) attempt image registration on every pair in each group (phase correlation based; FFT heavy) unused images new images Progress? 38 Basic Algorithm • Reduce number of pairwise image set image registration attempt • Run on CPU Group similar images (unsupervised kNN clustering) attempt image registration on every pair in each group • FFT 2D heavy (phase correlation based; FFT heavy) • Expensive • Mostly running on GPU unused images new images Progress? 39 Cross Power Spectrum Core of phase correlation based image registration algorithm 40 Cross Power Spectrum Core of phase correlation based image registration algorithm cuFFT explicit memory transfer 41 Scaling on Spark Image set • Machines have multiple GPUs • Each worker computes a partition at a time Random partitioning rdd.repartition(num_parts) partition partition partition Apply ImgReg on each Partition i i i rdd.mapPartitions(imgreg_func) … u n u n u n … Progress? 42 CUDA Multi-Process Service (MPS) • Sharing one GPU between multiple workers can be beneficial • nvidia-cuda-mps-control • Better GPU utilization from multiple processes • For our example app: 5-10% improvement with MPS • Effect can be bigger for app with higher GPU utilization 43 Scaling with Multiple GPUs and MPS • 1 CPU worker: 1 Tesla K20 • 2 CPU worker: 2 Tesla K20 • 4 CPU worker: 2 Tesla K20 (2 workers per GPU using MPS) • Spark + Dask = External process performing GPU calculation 44 Spark vs CUDA • Spark logic is similar to CUDA host logic • Partitions are like CUDA blocks • mapPartitions is like kernel launch • Work cooperatively within a partition • Spark network transfer overhead vs PCI-express transfer overhead Image set Random partitioning partition partition partition rdd.repartition(num_parts i i i … Apply ImgReg on each Partition u n u n u n … rdd.mapPartitions(imgreg_func) Progress? 45 TIPS AND TRICKS 46 Parallelism is about communication • Be aware of the communication overhead when deciding how to chunk your work in Spark. • Bigger chunks: Risk of underutilization of resources • Smaller chunks: Risk of computation swamped by overhead ➡ Start with big chunks, then move to smaller ones if you fail to reach full utilization. 47 Start Small • Easier to debug your logic on your local system! • Run a Spark cluster on your workstation where it is easier to debug and profile. • Move to the cluster only after you’ve seen things work locally. (Spark makes this fairly painless.) 48 Be interactive with Big Data! • Run Jupyter notebook on Spark + MultiGPU cluster • Live experimentation helps you develop and understand your progress • Spark keeps track of intermediate data and recomputes them if they are lost 49 Amortize Setup Costs • Make sure that one-time costs are actually done once: • GPU state (like FFT plans) can be slow to initialize. Do this at the top of your mapPartition call and reuse for each element in the RDD partition. • Coarser chunking of tasks allows GPU memory allocations to be reused between steps inside a single task. 50 Be(a)ware

GPU Computing with Apache Spark and Python

Python on Gpus (Work in Progress!)

Introduction Shrinkage Factor Reference

Prototyping and Developing GPU-Accelerated Solutions with Python and CUDA Luciano Martins and Robert Sohigian, 2018-11-22 Introduction to Python

Tangent: Automatic Differentiation Using Source-Code Transformation for Dynamically Typed Array Programming

Python Guide Documentation 0.0.1

Hyperlearn Documentation Release 1

Zmcintegral: a Package for Multi-Dimensional Monte Carlo Integration on Multi-Gpus

Python Guide Documentation Publicación 0.0.1

Aayush Mandhyan

Tensorflow, Pytorch and Horovod

Numba: a Compiler for Python Functions

Or How to Avoid Recoding Your (Entire) Application in Another Language Before You Start