<<

From inception to insight: Accelerating AI productivity with GPUs

John Zedlewski, Director, RAPIDS @ NVIDIA Ramesh Radhakrishnan, Technologist @ Server OCTO, Dell EMC Data sizes continue to grow Prototyping and production diverge

Large-scale cluster Challenges! RAPIDS on GPU Spark, Hadoop “Tools gap” - rewriting Python or RAPIDS + Dask High throughput Full data code to Spark/Hadoop jobs to scale to cluster Consistent tools: Workstation or Cluster High latency on cluster leads to Workstation High throughput / slower iteration low latency Python Fast iteration Small data subsets on workstation Full data or large subsets Small data subset make it hard to build realistic models

2 Data Processing Evolution Faster data access, less data movement Hadoop Processing, Reading from disk

HDFS HDFS HDFS HDFS HDFS Read Query Write Read ETL Write Read ML Train

Spark In-Memory Processing 25-100x Improvement Less code HDFS Language flexible Read Query ETL ML Train Primarily In-Memory

Traditional GPU Processing 5-10x Improvement More code HDFS GPU CPU GPU CPU GPU ML Language rigid Query ETL Read Read Write Read Write Read Train Substantially on GPU

3 Data Movement and Transformation The bane of productivity and performance

APP B Read Data

APP B

GPU APP B Copy & Convert Data CPU GPU Copy & Convert

Copy & Convert APP A GPU Data

APP A

Load Data APP A

4 Data Movement and Transformation What if we could keep data on the GPU?

APP B Read Data

APP B

GPU APP B Copy & Convert Data CPU GPU Copy & Convert

Copy & Convert APP A GPU Data

APP A

Load Data APP A

5 Data Processing Evolution Faster data access, less data movement Hadoop Processing, Reading from disk

HDFS HDFS HDFS HDFS HDFS Read Query Write Read ETL Write Read ML Train

Spark In-Memory Processing 25-100x Improvement Less code HDFS Language flexible Read Query ETL ML Train Primarily In-Memory

Traditional GPU Processing 5-10x Improvement More code HDFS GPU CPU GPU CPU GPU ML Language rigid Query ETL Read Read Write Read Write Read Train Substantially on GPU

RAPIDS 50-100x Improvement Same code Arrow ML Language flexible Query ETL Read Train Primarily on GPU

6 RAPIDS Scale up and out with accelerated GPU data science

Data Preparation Model Training Visualization

Dask cuDF cuIO cuML cuGraph PyTorch MxNet cuXfilter <> pyViz Analytics Machine Learning Graph Analytics Deep Learning Visualization

GPU Memory

7 RAPIDS Scale up and out with accelerated GPU data science

Data Preparation Model Training Visualization

Dask

cuDF cuIO cuML cuGraph PyTorch Chainer MxNet cuXfilter <> pyViz Analytics Machine Learning Graph Analytics Deep Learning Visualization Pandas sklearn NetworkX API API API

GPU Memory

8 Scale up with RAPIDS

RAPIDS and Others Accelerated on single GPU

NumPy -> CuPy/PyTorch/.. Pandas -> cuDF Scikit-Learn -> cuML Numba -> Numba

PyData , Pandas, Scikit-Learn, Numba and many more

Single CPU core

In-memory data Scale Scale UpAccelerate /

9 Scale out with RAPIDS + Dask with OpenUCX

RAPIDS and Others RAPIDS + Dask with Accelerated on single GPU OpenUCX Multi-GPU NumPy -> CuPy/PyTorch/.. On single Node (DGX) Pandas -> cuDF Or across a cluster Scikit-Learn -> cuML Numba -> Numba

PyData Dask NumPy, Pandas, Scikit-Learn, Multi-core and Distributed PyData Numba and many more NumPy -> Dask Array Single CPU core Pandas -> Dask DataFrame In-memory data Scikit-Learn -> Dask-ML

Scale Scale UpAccelerate / … -> Dask Futures

Scale out / Parallelize 10 Faster Speeds, Real-World Benefits cuIO/cuDF – Load and Data Preparation cuML - XGBoost End-to-End

8762

6148

3925

3221

322

213

Time in seconds (shorter is better)

cuIO/cuDF (Load and Data Prep) Data Conversion XGBoost

Benchmark CPU Cluster Configuration DGX Cluster Configuration 200GB CSV dataset; Data prep includes CPU nodes (61 GiB memory, 8 vCPUs, 64- 5x DGX-1 on InfiniBand joins, variable transformations bit platform), Apache Spark network

11 Dask + cuML Machine Learning at Scale

12 Machine Learning More models more problems

Data Preparation Model Training Visualization

Dask cuDF cuIO cuML cuGraph PyTorch Chainer MxNet cuXfilter <> pyViz Analytics Machine Learning Graph Analytics Deep Learning Visualization

GPU Memory

13 Algorithms GPU-accelerated Scikit-Learn

Decision Trees / Random Forests Classification / Regression Linear Regression Logistic Regression K-Nearest Neighbors

Inference Random forest / GBDT inference

K-Means Clustering DBSCAN Spectral Clustering Principal Components Singular Value Decomposition Decomposition & Dimensionality Reduction UMAP Spectral Embedding Cross Validation Holt-Winters Kalman Filtering Hyper-parameter Tuning

More to come!

14 RAPIDS matches common Python APIs GPU-Accelerated Clustering

from sklearn.datasets import make_moons import cudf

X, y = make_moons(n_samples=int(1e2), noise=0.05, random_state=0)

X = cudf.DataFrame({'fea%d'%i: X[:, i] for i in range(X.shape[1])})

Find Clusters

from cuml import DBSCAN dbscan = DBSCAN(eps = 0.3, min_samples = 5)

dbscan.fit(X)

y_hat = dbscan.predict(X)

15 Benchmarks: single-GPU cuML vs scikit-learn

1x V100 vs 2x 20 core CPU

16 Why Dask?

• PyData Native • Built on top of NumPy, Pandas Scikit-Learn, etc. (easy to migrate) • With the same APIs (easy to train) • With the same developer community (well trusted) • Scales • Easy to install and use on a laptop • Scales out to thousand-node clusters • Popular • Most common parallelism framework today at PyData and SciPy conferences • Deployable • HPC: SLURM, PBS, LSF, SGE • Cloud: Kubernetes • Hadoop/Spark: Yarn

17 Using Dask in Practice Familiar Python APIs import pandas as pd import dask_cudf df = pd.read_csv(“data-*.csv”) df = dask_cudf.read_csv(“data-*.csv”) df.groupby(df.user_id).value.mean() df.groupby(df.user_id).value.mean().compute()

Dask supports a variety of data structures and backends

18 A quick demo...

19 RAPIDS How do I get the software?

• https://github.com/rapidsai • https://ngc.nvidia.com/registry/nvidia- rapidsai-rapidsai • https://anaconda.org/rapidsai/ • https://hub.docker.com/r/rapidsai/rapidsai/

20 AI is more than Model Training

Deploy the solution at scale Identify and define business problem Modeling & performance evaluation Promote user adoption Capture business requirements Models development Continuous model improvement

DISCOVERY EXPLORATION MODELING FINDINGS OPERATIONALIZE

Exploratory Deliver Insights & Recommendations Acquire, prepare & enrich data Measure business effectiveness & ROI Preliminary ROI analysis Promote business enablement

Source: Adapted from DellTech Data science solutions learnings & presentation

21 A vision for end-to-end ML journey – Fluid and Flexible Dell Tech AI offerings Cloud to In-house, One vendor with no Lock-in

Direct & Consultative Sales System Integrator partners Dell Tech consulting services

SOFTWARE Enterprise ISV Accelerator Virtualization Open Source Deep PARTNERS ECO-SYSTEM ML Software and pooling Learning Frameworks

Public Cloud Hybrid Cloud Private Cloud CLOUD FOUNDATION

Infrastructure Public cloud Hosted Hybrid & distributed Private Cloud Ready Bundles edge cloud Appliances for ML and DL

BOOMI CONNECTORS

PARTNER SOFTWARE DATA PREPARATION

Dell EMC Public cloud storage Elastic cloud storage Enterprise Storage

22 Restricted - Confidential AI and Deep Learning Workstation Use Cases

Data Science Sandboxes Production Development NVIDIA GPU CLOUD (NGC) Container Framework

ML Rapids pyTorch

GPU GPU Precision Precision 1-3 x 1-3 x WS WS

Copy and In place stage access Isilon Storage

23 Pre-verified GPU-accelerated Deep Learning Platforms White papers, best practices, performance tests, dimensioning guidelines

NVIDIA GPU CLOUD (NGC)

ML Rapids pyTorch

Precision WS Hyper-Converged Dell PowerEdge Dell DSS8440 DGX-1 Server DGX-2 Server

ECS Object Store Isilon F, H and A nodes Ready Solutions as appropriate Currently pre-verified Isilon the foundation data lake for AI platforms To be verified in H2 2019

24 THANK YOU

Ramesh Radhakrishnan TODO insert email / Twitter

John Zedlewski @zstats [email protected] Join the Movement Everyone can help!

APACHE ARROW RAPIDS Dask GPU Open Analytics Initiative https://arrow.apache.org/ https://rapids.ai https://dask.org http://gpuopenanalytics.com/ @ApacheArrow @RAPIDSAI @Dask_dev @GPUOAI

Integrations, feedback, documentation support, pull requests, new issues, or code donations welcomed!

26 cuDF

27 RAPIDS GPU Accelerated data wrangling and feature engineering

Data Preparation Model Training Visualization

Dask cuDF cuIO cuML cuGraph PyTorch Chainer MxNet cuXfilter <> pyViz Analytics Machine Learning Graph Analytics Deep Learning Visualization

GPU Memory

28 GPU-Accelerated ETL The average data scientist spends 90+% of their time in ETL as opposed to training models

29 ETL - the Backbone of Data Science cuDF is…

Python Library

● A Python library for manipulating GPU DataFrames following the Pandas API

● Python interface to CUDA ++ library with additional functionality

● Creating GPU DataFrames from Numpy arrays, Pandas DataFrames, and PyArrow Tables

● JIT compilation of User-Defined Functions (UDFs) using Numba

30 Benchmarks: single-GPU Speedup vs. Pandas

cuDF v0.8, Pandas 0.23.4

Running on NVIDIA DGX-1:

GPU: NVIDIA Tesla V100 32GB CPU: Intel(R) Xeon(R) CPU E5-2698 v4

@ 2.20GHz GPU (cuDF) Speedup vs. CPU CPU (Pandas) vs. Speedup GPU (cuDF)

31 Eliminate data extraction bottlenecks cuIO under the hood

• Follow Pandas APIs and provide >10x speedup

• CSV Reader - v0.2, CSV Writer v0.8 • Parquet Reader – v0.7, Parquet Writer v0.10 • ORC Reader – v0.7, ORC Writer v0.10 • JSON Reader - v0.8 • Avro Reader - v0.9

• GPU Direct Storage integration in progress for bypassing PCIe bottlenecks!

• Key is GPU-accelerating both parsing and decompression wherever possible Source: Apache Crail blog: SQL Performance: Part 1 - Input File Formats

32 cuML

33 ML Technology Stack

Python Dask cuML Dask cuDF cuDF Numpy

cuML Algorithms Thrust Cub cuML Prims cuSolver nvGraph CUTLASS CUDA Libraries cuSparse cuRand CUDA cuBlas

34