Goai and Pygdf: GPU-Accelerated Data Science with Jupyter Notebooks

RAPIDS, FOSDEM’19 Dr. Christoph Angerer, Manager AI Developer Technologies, NVIDIA HPC & AI TRANSFORMS INDUSTRIES Computational & Data Scientists Are Driving Change Healthcare Industrial Consumer Internet Automotive Ad Tech / Retail Financial / Insurance MarTech 2 DATA SCIENCE IS NOT A LINEAR PROCESS It Requires Exploration and Iterations Manage Data Training Evaluate Deploy Data Model All ETL Structured Visualization Inference Data Data Store Preparation Training Iterate … Cross Validate … Grid Search … Iterate some more. Accelerating`Model Training` only does have benefit but doesn’t address the whole problem 3 DAY IN THE LIFE Or: Why did I want to become a Data Scientist? Data Scientist are valued resources. Why not give them the environment to configure ETL workflow be more productive 4 PERFORMANCE AND DATA GROWTH Post-Moore's law Data sizes continue to grow Moore's law is no longer a predictor of capacity in CPU market growth Distributing CPUs exacerbates the problem 5 TRADITIONAL DATA SCIENCE CLUSTER Workload Profile: Fannie Mae Mortgage Data: • 192GB data set • 16 years, 68 quarters • 34.7 Million single family mortgage loans • 1.85 Billion performance records • XGBoost training set: 50 features 300 Servers | $3M | 180 kW 6 GPU-ACCELERATED MACHINE LEARNING CLUSTER NVIDIA Data Science Platform with DGX-2 1 DGX-2 | 10 kW 1/8 the Cost | 1/15 the Space 1/18 the Power End-to-End 20 CPU Nodes 30 CPU Nodes 50 CPU Nodes 100 CPU Nodes DGX-2 5x DGX-1 0 2,000 4,000 6,000 8,000 10,000 7 DELIVERING DATA SCIENCE VALUE Maximized Productivity Top Model Accuracy Lowest TCO Oak Ridge Global Streaming Media National Labs Retail Giant Company 215x $1B $1.5M Speedup Using RAPIDS Potential Saving with Infrastructure with XGBoost 4% Error Rate Reduction Cost Saving 8 DATA SCIENCE WORKFLOW WITH RAPIDS Open Source, End-to-end GPU-accelerated Workflow Built On CUDA DATA PREDICTIONS DATA PREPARATION GPUs accelerated compute for in-memory data preparation Simplified implementation using familiar data science tools Python drop-in Pandas replacement built on CUDA C++. GPU-accelerated Spark (in development) 9 DATA SCIENCE WORKFLOW WITH RAPIDS Open Source, End-to-end GPU-accelerated Workflow Built On CUDA DATA PREDICTIONS MODEL TRAINING GPU-acceleration of today’s most popular ML algorithms XGBoost, PCA, K-means, k-NN, DBScan, tSVD … 10 DATA SCIENCE WORKFLOW WITH RAPIDS Open Source, End-to-end GPU-accelerated Workflow Built On CUDA DATA PREDICTIONS VISUALIZATION Effortless exploration of datasets, billions of records in milliseconds Dynamic interaction with data = faster ML model development Data visualization ecosystem (Graphistry & OmniSci), integrated with RAPIDS 11 THE EFFECTS OF END-TO-END ACCELERATION Faster Data Access Less Data Movement Hadoop Processing, Reading from disk HDFS HDFS HDFS HDFS HDFS Read Query Write Read ETL Write Read ML Train Spark In-Memory Processing 25-100x Improvement Less code HDFS Language flexible Read Query ETL ML Train Primarily In-Memory GPU/Spark In-Memory Processing 5-10x Improvement More code HDFS GPU CPU GPU CPU GPU ML Language rigid Query ETL Read Read Write Read Write Read Train Substantially on GPU RAPIDS 50-100x Improvement Same code Arrow ML Language flexible Query ETL Read Train Primarily on GPU 12 Yes GPUs are fast but … ADDRESSING • Too much data movement CHALLENGES • Too many makeshift data IN GPU formats ACCELERATED • Writing CUDA C/C++ is DATA SCIENCE involved • No Python API for data manipulation 13 DATA MOVEMENT AND TRANSFORMATION The bane of productivity and performance APP B Read Data APP B GPU APP B Copy & Convert Data CPU GPU Copy & Convert APP A Copy & Convert GPU Data APP A APP A Load Data 14 DATA MOVEMENT AND TRANSFORMATION What if we could keep data on the GPU? APP B Read Data APP B GPU APP B Copy & Convert Data CPU GPU Copy & Convert APP A Copy & Convert GPU Data APP A APP A Load Data 15 LEARNING FROM APACHE ARROW From Apache Arrow Home Page - https://arrow.apache.org/ 16 CUDA DATA FRAMES IN PYTHON GPUs at your Fingertips Illustrations from https://changhsinlee.com/pyspark-dataframe-basics/ 17 RAPIDS OPEN GPU DATA SCIENCE 18 RAPIDS Open GPU Data Science APPLICATIONS • Learn what the data science community needs • Use best practices and standards ALGORITHMS • Build scalable systems and algorithms • Test Applications and workflows SYSTEMS • Iterate CUDA ARCHITECTURE 19 RAPIDS COMPONENTS Data Preparation Model Training Visualization cuDF cuML cuGraph PyTorch & Chainer Kepler.GL Analytics Machine Learning Graph Analytics Deep Learning Visualization GPU Memory DASK 20 CUML & CUGRAPH Data Preparation Model Training Visualization cuDF cuML cuGraph PyTorch & Chainer Kepler.GL Analytics Machine Learning Graph Analytics Deep Learning Visualization GPU Memory DASK 21 AI LIBRARIES cuML & cuGraph Machine Learning Graph Analytics XGBoost, Mortgage Dataset, 90x PageRank BFS Jaccard Similarity Decisions Trees Single Source Shortest Path Accelerating more of the AI ecosystem Random Forests Triangle Counting Linear Regressions Louvain Modularity Graph Analytics is fundamental to network analysis Logistics Regressions 3 Hours to 2 mins on 1 DGX-1 K-Means Time Series Machine Learning is fundamental to prediction, K-Nearest Neighbor classification, clustering, anomaly detection and DBSCAN recommendations. Kalman Filtering Principal Components Both can be accelerated with NVIDIA GPU Single Value Decomposition Bayesian Inferencing ARIMA 8x V100 20-90x faster than dual socket CPU Holt-Winters 22 CUDF + XGBOOST DGX-2 vs Scale Out CPU Cluster • Full end to end pipeline • Leveraging Dask + PyGDF • Store each GPU results in sys mem then read back in • Arrow to Dmatrix (CSR) for XGBoost 23 CUDF + XGBOOST Scale Out GPU Cluster vs DGX-2 Chart Title • Full end to end pipeline DGX-2 • Leveraging Dask for multi-node + PyGDF • Store each GPU results in sys mem then read back in • Arrow to Dmatrix (CSR) for XGBoost 5x DGX-1 0 50 100 150 200 250 300 350 ETL+CSV (s) ML Prep (s) ML (s) 24 CUML Benchmarks of initial algorithms 25 NEAR FUTURE WORK ON CUML Additional algorithms in development right now K-means - Released ARIMA – v0.6 K-NN - Released UMAP – v0.6 Kalman filter – v0.5 Collaborative filtering – Q2 2019 GLM – v0.5 Random Forests - v0.6 26 CUGRAPH GPU-Accelerated Graph Analytics Library Coming Soon: Full NVGraph Integration Q1 2019 27 CUDF Data Preparation Model Training Visualization cuDF cuML cuGraph PyTorch & Chainer Kepler.GL Analytics Machine Learning Graph Analytics Deep Learning Visualization GPU Memory DASK 28 CUDF GPU DataFrame library • Apache Arrow data format • Pandas-like API • Unary and Binary Operations • Joins / Merges • GroupBys • Filters • User-Defined Functions (UDFs) • Accelerated file readers • Etc. 29 CUDF Today CUDA With Python Bindings • Low level library containing function • A Python library for manipulating GPU implementations and C/C++ API DataFrames • Importing/exporting Apache Arrow using the • Python interface to CUDA C++ with additional CUDA IPC mechanism functionality • CUDA kernels to perform element-wise math • Creating Apache Arrow from Numpy arrays, operations on GPU DataFrame columns Pandas DataFrames, and PyArrow Tables • CUDA sort, join, groupby, and reduction • JIT compilation of User-Defined Functions operations on GPU DataFrames (UDFs) using Numba 30 CUSTRING & NVSTRING GPU-Accelerated string functions with a Pandas-like API • API and functionality is following Pandas: https://pandas.pydata.org/pandas- docs/stable/api.html#string-handling 800.00 700.00 • lower() 600.00 s 500.00 d n • ~22x speedup o c e 400.00 s i l l i • find() m 300.00 200.00 • ~40x speedup 100.00 0.00 • slice() lower() find(#) slice(1,15) Pandas cudastrings • ~100x speedup 31 DASK Data Preparation Model Training Visualization cuDF cuML cuGraph PyTorch & Chainer Kepler.GL Analytics Machine Learning Graph Analytics Deep Learning Visualization GPU Memory DASK 32 DASK What is Dask and why does RAPIDS use it for scaling out? • Dask is a distributed computation scheduler built to scale Python workloads from laptops to supercomputer clusters. • Extremely modular with scheduling, compute, data transfer, and out-of-core handling all being disjointed allowing us to plug in our own implementations. • Can easily run multiple Dask workers per node to allow for an easier development model of one worker per GPU regardless of single node or multi node environment. 33 DASK Scale up and out with cuDF • Use cuDF primitives underneath in map-reduce style operations with the same high level API • Instead of using typical Dask data movement of pickling objects and sending via TCP sockets, take advantage of hardware advancements using a communications framework called OpenUCX: • For intranode data movement, utilize NVLink and PCIe peer-to-peer communications • For internode data movement, utilize GPU https://github.com/rapidsai/dask_gdf RDMA over Infiniband and RoCE 34 DASK Scale up and out with cuML • Native integration with Dask + cuDF • Can easily use Dask workers to initialize NCCL for optimized gather / scatter operations • Example: this is how the dask-xgboost included in the container works for multi-GPU and multi- node, multi-GPU • Provides easy to use, high level primitives for synchronization of workers which is needed for many ML algorithms 35 LOOKING TO THE FUTURE 36 GPU DATAFRAME Next few months • Continue improving performance and functionality • Single GPU • Single node, multi GPU • Multi node, multi GPU • String Support • Support for specific “string” dtype with GPU-accelerated functionality similar to Pandas

Goai and Pygdf: GPU-Accelerated Data Science with Jupyter Notebooks

Python on Gpus (Work in Progress!)

Introduction Shrinkage Factor Reference

Prototyping and Developing GPU-Accelerated Solutions with Python and CUDA Luciano Martins and Robert Sohigian, 2018-11-22 Introduction to Python

Tangent: Automatic Differentiation Using Source-Code Transformation for Dynamically Typed Array Programming

Python Guide Documentation 0.0.1

Hyperlearn Documentation Release 1

Zmcintegral: a Package for Multi-Dimensional Monte Carlo Integration on Multi-Gpus

Python Guide Documentation Publicación 0.0.1

Aayush Mandhyan

Tensorflow, Pytorch and Horovod

Numba: a Compiler for Python Functions

Or How to Avoid Recoding Your (Entire) Application in Another Language Before You Start