From inception to insight: Accelerating AI productivity with GPUs
John Zedlewski, Director, RAPIDS Machine Learning @ NVIDIA Ramesh Radhakrishnan, Technologist @ Server OCTO, Dell EMC Data sizes continue to grow Prototyping and production diverge
Large-scale cluster Challenges! RAPIDS on GPU Spark, Hadoop “Tools gap” - rewriting Python or RAPIDS + Dask High throughput Full data R code to Spark/Hadoop jobs to scale to cluster Consistent tools: Workstation or Cluster High latency on cluster leads to Workstation High throughput / slower iteration low latency Python Fast iteration Small data subsets on workstation Full data or large subsets Small data subset make it hard to build realistic models
2 Data Processing Evolution Faster data access, less data movement Hadoop Processing, Reading from disk
HDFS HDFS HDFS HDFS HDFS Read Query Write Read ETL Write Read ML Train
Spark In-Memory Processing 25-100x Improvement Less code HDFS Language flexible Read Query ETL ML Train Primarily In-Memory
Traditional GPU Processing 5-10x Improvement More code HDFS GPU CPU GPU CPU GPU ML Language rigid Query ETL Read Read Write Read Write Read Train Substantially on GPU
3 Data Movement and Transformation The bane of productivity and performance
APP B Read Data
APP B
GPU APP B Copy & Convert Data CPU GPU Copy & Convert
Copy & Convert APP A GPU Data
APP A
Load Data APP A
4 Data Movement and Transformation What if we could keep data on the GPU?
APP B Read Data
APP B
GPU APP B Copy & Convert Data CPU GPU Copy & Convert
Copy & Convert APP A GPU Data
APP A
Load Data APP A
5 Data Processing Evolution Faster data access, less data movement Hadoop Processing, Reading from disk
HDFS HDFS HDFS HDFS HDFS Read Query Write Read ETL Write Read ML Train
Spark In-Memory Processing 25-100x Improvement Less code HDFS Language flexible Read Query ETL ML Train Primarily In-Memory
Traditional GPU Processing 5-10x Improvement More code HDFS GPU CPU GPU CPU GPU ML Language rigid Query ETL Read Read Write Read Write Read Train Substantially on GPU
RAPIDS 50-100x Improvement Same code Arrow ML Language flexible Query ETL Read Train Primarily on GPU
6 RAPIDS Scale up and out with accelerated GPU data science
Data Preparation Model Training Visualization
Dask cuDF cuIO cuML cuGraph PyTorch Chainer MxNet cuXfilter <> pyViz Analytics Machine Learning Graph Analytics Deep Learning Visualization
GPU Memory
7 RAPIDS Scale up and out with accelerated GPU data science
Data Preparation Model Training Visualization
Dask
cuDF cuIO cuML cuGraph PyTorch Chainer MxNet cuXfilter <> pyViz Analytics Machine Learning Graph Analytics Deep Learning Visualization Pandas sklearn NetworkX API API API
GPU Memory
8 Scale up with RAPIDS
RAPIDS and Others Accelerated on single GPU
NumPy -> CuPy/PyTorch/.. Pandas -> cuDF Scikit-Learn -> cuML Numba -> Numba
PyData NumPy, Pandas, Scikit-Learn, Numba and many more
Single CPU core
In-memory data Scale Scale UpAccelerate /
9 Scale out with RAPIDS + Dask with OpenUCX
RAPIDS and Others RAPIDS + Dask with Accelerated on single GPU OpenUCX Multi-GPU NumPy -> CuPy/PyTorch/.. On single Node (DGX) Pandas -> cuDF Or across a cluster Scikit-Learn -> cuML Numba -> Numba
PyData Dask NumPy, Pandas, Scikit-Learn, Multi-core and Distributed PyData Numba and many more NumPy -> Dask Array Single CPU core Pandas -> Dask DataFrame In-memory data Scikit-Learn -> Dask-ML
Scale Scale UpAccelerate / … -> Dask Futures
Scale out / Parallelize 10 Faster Speeds, Real-World Benefits cuIO/cuDF – Load and Data Preparation cuML - XGBoost End-to-End
8762
6148
3925
3221
322
213
Time in seconds (shorter is better)
cuIO/cuDF (Load and Data Prep) Data Conversion XGBoost
Benchmark CPU Cluster Configuration DGX Cluster Configuration 200GB CSV dataset; Data prep includes CPU nodes (61 GiB memory, 8 vCPUs, 64- 5x DGX-1 on InfiniBand joins, variable transformations bit platform), Apache Spark network
11 Dask + cuML Machine Learning at Scale
12 Machine Learning More models more problems
Data Preparation Model Training Visualization
Dask cuDF cuIO cuML cuGraph PyTorch Chainer MxNet cuXfilter <> pyViz Analytics Machine Learning Graph Analytics Deep Learning Visualization
GPU Memory
13 Algorithms GPU-accelerated Scikit-Learn
Decision Trees / Random Forests Classification / Regression Linear Regression Logistic Regression K-Nearest Neighbors
Inference Random forest / GBDT inference
K-Means Clustering DBSCAN Spectral Clustering Principal Components Singular Value Decomposition Decomposition & Dimensionality Reduction UMAP Spectral Embedding Cross Validation Holt-Winters Time Series Kalman Filtering Hyper-parameter Tuning
More to come!
14 RAPIDS matches common Python APIs GPU-Accelerated Clustering
from sklearn.datasets import make_moons import cudf
X, y = make_moons(n_samples=int(1e2), noise=0.05, random_state=0)
X = cudf.DataFrame({'fea%d'%i: X[:, i] for i in range(X.shape[1])})
Find Clusters
from cuml import DBSCAN dbscan = DBSCAN(eps = 0.3, min_samples = 5)
dbscan.fit(X)
y_hat = dbscan.predict(X)
15 Benchmarks: single-GPU cuML vs scikit-learn
1x V100 vs 2x 20 core CPU
16 Why Dask?
• PyData Native • Built on top of NumPy, Pandas Scikit-Learn, etc. (easy to migrate) • With the same APIs (easy to train) • With the same developer community (well trusted) • Scales • Easy to install and use on a laptop • Scales out to thousand-node clusters • Popular • Most common parallelism framework today at PyData and SciPy conferences • Deployable • HPC: SLURM, PBS, LSF, SGE • Cloud: Kubernetes • Hadoop/Spark: Yarn
17 Using Dask in Practice Familiar Python APIs import pandas as pd import dask_cudf df = pd.read_csv(“data-*.csv”) df = dask_cudf.read_csv(“data-*.csv”) df.groupby(df.user_id).value.mean() df.groupby(df.user_id).value.mean().compute()
Dask supports a variety of data structures and backends
18 A quick demo...
19 RAPIDS How do I get the software?
• https://github.com/rapidsai • https://ngc.nvidia.com/registry/nvidia- rapidsai-rapidsai • https://anaconda.org/rapidsai/ • https://hub.docker.com/r/rapidsai/rapidsai/
20 AI is more than Model Training
Deploy the solution at scale Identify and define business problem Modeling & performance evaluation Promote user adoption Capture business requirements Models development Continuous model improvement
DISCOVERY EXPLORATION MODELING FINDINGS OPERATIONALIZE
Exploratory data analysis Deliver Insights & Recommendations Acquire, prepare & enrich data Measure business effectiveness & ROI Preliminary ROI analysis Promote business enablement
Source: Adapted from DellTech Data science solutions learnings & presentation
21 A vision for end-to-end ML journey – Fluid and Flexible Dell Tech AI offerings Cloud to In-house, One vendor with no Lock-in
Direct & Consultative Sales System Integrator partners Dell Tech consulting services
SOFTWARE Enterprise ISV Accelerator Virtualization Open Source Deep PARTNERS ECO-SYSTEM ML Software and pooling Learning Frameworks
Public Cloud Hybrid Cloud Private Cloud CLOUD FOUNDATION
Infrastructure Public cloud Hosted Hybrid & distributed Private Cloud Ready Bundles edge cloud Appliances for ML and DL
BOOMI CONNECTORS
PARTNER SOFTWARE DATA PREPARATION
Dell EMC Public cloud storage Elastic cloud storage Enterprise Storage
22 Restricted - Confidential AI and Deep Learning Workstation Use Cases
Data Science Sandboxes Production Development NVIDIA GPU CLOUD (NGC) Container Framework
ML Rapids pyTorch
GPU GPU Precision Precision 1-3 x 1-3 x WS WS
Copy and In place stage access Isilon Storage
23 Pre-verified GPU-accelerated Deep Learning Platforms White papers, best practices, performance tests, dimensioning guidelines
NVIDIA GPU CLOUD (NGC)
ML Rapids pyTorch
Precision WS Hyper-Converged Dell PowerEdge Dell DSS8440 DGX-1 Server DGX-2 Server
ECS Object Store Isilon F, H and A nodes Ready Solutions as appropriate Currently pre-verified Isilon the foundation data lake for AI platforms To be verified in H2 2019
24 THANK YOU
Ramesh Radhakrishnan TODO insert email / Twitter
John Zedlewski @zstats [email protected] Join the Movement Everyone can help!
APACHE ARROW RAPIDS Dask GPU Open Analytics Initiative https://arrow.apache.org/ https://rapids.ai https://dask.org http://gpuopenanalytics.com/ @ApacheArrow @RAPIDSAI @Dask_dev @GPUOAI
Integrations, feedback, documentation support, pull requests, new issues, or code donations welcomed!
26 cuDF
27 RAPIDS GPU Accelerated data wrangling and feature engineering
Data Preparation Model Training Visualization
Dask cuDF cuIO cuML cuGraph PyTorch Chainer MxNet cuXfilter <> pyViz Analytics Machine Learning Graph Analytics Deep Learning Visualization
GPU Memory
28 GPU-Accelerated ETL The average data scientist spends 90+% of their time in ETL as opposed to training models
29 ETL - the Backbone of Data Science cuDF is…
Python Library
● A Python library for manipulating GPU DataFrames following the Pandas API
● Python interface to CUDA C++ library with additional functionality
● Creating GPU DataFrames from Numpy arrays, Pandas DataFrames, and PyArrow Tables
● JIT compilation of User-Defined Functions (UDFs) using Numba
30 Benchmarks: single-GPU Speedup vs. Pandas
cuDF v0.8, Pandas 0.23.4
Running on NVIDIA DGX-1:
GPU: NVIDIA Tesla V100 32GB CPU: Intel(R) Xeon(R) CPU E5-2698 v4
@ 2.20GHz GPU (cuDF) Speedup vs. CPU CPU (Pandas) vs. Speedup GPU (cuDF)
31 Eliminate data extraction bottlenecks cuIO under the hood
• Follow Pandas APIs and provide >10x speedup
• CSV Reader - v0.2, CSV Writer v0.8 • Parquet Reader – v0.7, Parquet Writer v0.10 • ORC Reader – v0.7, ORC Writer v0.10 • JSON Reader - v0.8 • Avro Reader - v0.9
• GPU Direct Storage integration in progress for bypassing PCIe bottlenecks!
• Key is GPU-accelerating both parsing and decompression wherever possible Source: Apache Crail blog: SQL Performance: Part 1 - Input File Formats
32 cuML
33 ML Technology Stack
Python Dask cuML Dask cuDF cuDF Cython Numpy
cuML Algorithms Thrust Cub cuML Prims cuSolver nvGraph CUTLASS CUDA Libraries cuSparse cuRand CUDA cuBlas
34