Accelerating Data Science Productivity with Gpus: from Workstation to Cluster
Total Page:16
File Type:pdf, Size:1020Kb
From inception to insight: Accelerating AI productivity with GPUs John Zedlewski, Director, RAPIDS Machine Learning @ NVIDIA Ramesh Radhakrishnan, Technologist @ Server OCTO, Dell EMC Data sizes continue to grow Prototyping and production diverge Large-scale cluster Challenges! RAPIDS on GPU Spark, Hadoop “Tools gap” - rewriting Python or RAPIDS + Dask High throughput Full data R code to Spark/Hadoop jobs to scale to cluster Consistent tools: Workstation or Cluster High latency on cluster leads to Workstation High throughput / slower iteration low latency Python Fast iteration Small data subsets on workstation Full data or large subsets Small data subset make it hard to build realistic models 2 Data Processing Evolution Faster data access, less data movement Hadoop Processing, Reading from disk HDFS HDFS HDFS HDFS HDFS Read Query Write Read ETL Write Read ML Train Spark In-Memory Processing 25-100x Improvement Less code HDFS Language flexible Read Query ETL ML Train Primarily In-Memory Traditional GPU Processing 5-10x Improvement More code HDFS GPU CPU GPU CPU GPU ML Language rigid Query ETL Read Read Write Read Write Read Train Substantially on GPU 3 Data Movement and Transformation The bane of productivity and performance APP B Read Data APP B GPU APP B Copy & Convert Data CPU GPU Copy & Convert Copy & Convert APP A GPU Data APP A Load Data APP A 4 Data Movement and Transformation What if we could keep data on the GPU? APP B Read Data APP B GPU APP B Copy & Convert Data CPU GPU Copy & Convert Copy & Convert APP A GPU Data APP A Load Data APP A 5 Data Processing Evolution Faster data access, less data movement Hadoop Processing, Reading from disk HDFS HDFS HDFS HDFS HDFS Read Query Write Read ETL Write Read ML Train Spark In-Memory Processing 25-100x Improvement Less code HDFS Language flexible Read Query ETL ML Train Primarily In-Memory Traditional GPU Processing 5-10x Improvement More code HDFS GPU CPU GPU CPU GPU ML Language rigid Query ETL Read Read Write Read Write Read Train Substantially on GPU RAPIDS 50-100x Improvement Same code Arrow ML Language flexible Query ETL Read Train Primarily on GPU 6 RAPIDS Scale up and out with accelerated GPU data science Data Preparation Model Training Visualization Dask cuDF cuIO cuML cuGraph PyTorch Chainer MxNet cuXfilter <> pyViz Analytics Machine Learning Graph Analytics Deep Learning Visualization GPU Memory 7 RAPIDS Scale up and out with accelerated GPU data science Data Preparation Model Training Visualization Dask cuDF cuIO cuML cuGraph PyTorch Chainer MxNet cuXfilter <> pyViz Analytics Machine Learning Graph Analytics Deep Learning Visualization Pandas sklearn NetworkX API API API GPU Memory 8 Scale up with RAPIDS RAPIDS and Others Accelerated on single GPU NumPy -> CuPy/PyTorch/.. Pandas -> cuDF Scikit-Learn -> cuML Numba -> Numba PyData NumPy, Pandas, Scikit-Learn, Numba and many more Single CPU core In-memory data Scale Scale Up / Accelerate 9 Scale out with RAPIDS + Dask with OpenUCX RAPIDS and Others RAPIDS + Dask with Accelerated on single GPU OpenUCX Multi-GPU NumPy -> CuPy/PyTorch/.. On single Node (DGX) Pandas -> cuDF Or across a cluster Scikit-Learn -> cuML Numba -> Numba PyData Dask NumPy, Pandas, Scikit-Learn, Multi-core and Distributed PyData Numba and many more NumPy -> Dask Array Single CPU core Pandas -> Dask DataFrame In-memory data Scikit-Learn -> Dask-ML Scale Scale Up / Accelerate … -> Dask Futures Scale out / Parallelize 10 Faster Speeds, Real-World Benefits cuIO/cuDF – Load and Data Preparation cuML - XGBoost End-to-End 8762 6148 3925 3221 322 213 Time in seconds (shorter is better) cuIO/cuDF (Load and Data Prep) Data Conversion XGBoost Benchmark CPU Cluster Configuration DGX Cluster Configuration 200GB CSV dataset; Data prep includes CPU nodes (61 GiB memory, 8 vCPUs, 64- 5x DGX-1 on InfiniBand joins, variable transformations bit platform), Apache Spark network 11 Dask + cuML Machine Learning at Scale 12 Machine Learning More models more problems Data Preparation Model Training Visualization Dask cuDF cuIO cuML cuGraph PyTorch Chainer MxNet cuXfilter <> pyViz Analytics Machine Learning Graph Analytics Deep Learning Visualization GPU Memory 13 Algorithms GPU-accelerated Scikit-Learn Decision Trees / Random Forests Classification / Regression Linear Regression Logistic Regression K-Nearest Neighbors Inference Random forest / GBDT inference K-Means Clustering DBSCAN Spectral Clustering Principal Components Singular Value Decomposition Decomposition & Dimensionality Reduction UMAP Spectral Embedding Cross Validation Holt-Winters Time Series Kalman Filtering Hyper-parameter Tuning More to come! 14 RAPIDS matches common Python APIs GPU-Accelerated Clustering from sklearn.datasets import make_moons import cudf X, y = make_moons(n_samples=int(1e2), noise=0.05, random_state=0) X = cudf.DataFrame({'fea%d'%i: X[:, i] for i in range(X.shape[1])}) Find Clusters from cuml import DBSCAN dbscan = DBSCAN(eps = 0.3, min_samples = 5) dbscan.fit(X) y_hat = dbscan.predict(X) 15 Benchmarks: single-GPU cuML vs scikit-learn 1x V100 vs 2x 20 core CPU 16 Why Dask? • PyData Native • Built on top of NumPy, Pandas Scikit-Learn, etc. (easy to migrate) • With the same APIs (easy to train) • With the same developer community (well trusted) • Scales • Easy to install and use on a laptop • Scales out to thousand-node clusters • Popular • Most common parallelism framework today at PyData and SciPy conferences • Deployable • HPC: SLURM, PBS, LSF, SGE • Cloud: Kubernetes • Hadoop/Spark: Yarn 17 Using Dask in Practice Familiar Python APIs import pandas as pd import dask_cudf df = pd.read_csv(“data-*.csv”) df = dask_cudf.read_csv(“data-*.csv”) df.groupby(df.user_id).value.mean() df.groupby(df.user_id).value.mean().compute() Dask supports a variety of data structures and backends 18 A quick demo... 19 RAPIDS How do I get the software? • https://github.com/rapidsai • https://ngc.nvidia.com/registry/nvidia- rapidsai-rapidsai • https://anaconda.org/rapidsai/ • https://hub.docker.com/r/rapidsai/rapidsai/ 20 AI is more than Model Training Deploy the solution at scale Identify and define business problem Modeling & performance evaluation Promote user adoption Capture business requirements Models development Continuous model improvement DISCOVERY EXPLORATION MODELING FINDINGS OPERATIONALIZE Exploratory data analysis Deliver Insights & Recommendations Acquire, prepare & enrich data Measure business effectiveness & ROI Preliminary ROI analysis Promote business enablement Source: Adapted from DellTech Data science solutions learnings & presentation 21 A vision for end-to-end ML journey – Fluid and Flexible Dell Tech AI offerings Cloud to In-house, One vendor with no Lock-in Direct & Consultative Sales System Integrator partners Dell Tech consulting services SOFTWARE Enterprise ISV Accelerator Virtualization Open Source Deep PARTNERS ECO-SYSTEM ML Software and pooling Learning Frameworks Public Cloud Hybrid Cloud Private Cloud CLOUD FOUNDATION Infrastructure Public cloud Hosted Hybrid & distributed Private Cloud Ready Bundles edge cloud Appliances for ML and DL BOOMI CONNECTORS PARTNER SOFTWARE DATA PREPARATION Dell EMC Public cloud storage Elastic cloud storage Enterprise Storage 22 Restricted - Confidential AI and Deep Learning Workstation Use Cases Data Science Sandboxes Production Development NVIDIA GPU CLOUD (NGC) Container Framework ML Rapids pyTorch GPU GPU Precision Precision 1-3 x 1-3 x WS WS Copy and In place stage access Isilon Storage 23 Pre-verified GPU-accelerated Deep Learning Platforms White papers, best practices, performance tests, dimensioning guidelines NVIDIA GPU CLOUD (NGC) ML Rapids pyTorch Precision WS Hyper-Converged Dell PowerEdge Dell DSS8440 DGX-1 Server DGX-2 Server ECS Object Store Isilon F, H and A nodes Ready Solutions as appropriate Currently pre-verified Isilon the foundation data lake for AI platforms To be verified in H2 2019 24 THANK YOU Ramesh Radhakrishnan TODO insert email / Twitter John Zedlewski @zstats [email protected] Join the Movement Everyone can help! APACHE ARROW RAPIDS Dask GPU Open Analytics Initiative https://arrow.apache.org/ https://rapids.ai https://dask.org http://gpuopenanalytics.com/ @ApacheArrow @RAPIDSAI @Dask_dev @GPUOAI Integrations, feedback, documentation support, pull requests, new issues, or code donations welcomed! 26 cuDF 27 RAPIDS GPU Accelerated data wrangling and feature engineering Data Preparation Model Training Visualization Dask cuDF cuIO cuML cuGraph PyTorch Chainer MxNet cuXfilter <> pyViz Analytics Machine Learning Graph Analytics Deep Learning Visualization GPU Memory 28 GPU-Accelerated ETL The average data scientist spends 90+% of their time in ETL as opposed to training models 29 ETL - the Backbone of Data Science cuDF is… Python Library ● A Python library for manipulating GPU DataFrames following the Pandas API ● Python interface to CUDA C++ library with additional functionality ● Creating GPU DataFrames from Numpy arrays, Pandas DataFrames, and PyArrow Tables ● JIT compilation of User-Defined Functions (UDFs) using Numba 30 Benchmarks: single-GPU Speedup vs. Pandas cuDF v0.8, Pandas 0.23.4 Running on NVIDIA DGX-1: GPU: NVIDIA Tesla V100 32GB CPU: Intel(R) Xeon(R) CPU E5-2698 v4 @ 2.20GHz GPU (cuDF) Speedup vs. CPU CPU vs. (Pandas) Speedup GPU (cuDF) 31 Eliminate data extraction bottlenecks cuIO under the hood • Follow Pandas APIs and provide >10x speedup • CSV Reader - v0.2, CSV Writer v0.8 • Parquet Reader – v0.7, Parquet Writer v0.10 • ORC Reader – v0.7, ORC Writer v0.10 • JSON Reader - v0.8 • Avro Reader - v0.9 • GPU Direct Storage integration in progress for bypassing PCIe bottlenecks! • Key is GPU-accelerating both parsing and decompression wherever possible Source: Apache Crail blog: SQL Performance: Part 1 - Input File Formats 32 cuML 33 ML Technology Stack Python Dask cuML Dask cuDF cuDF Cython Numpy cuML Algorithms Thrust Cub cuML Prims cuSolver nvGraph CUTLASS CUDA Libraries cuSparse cuRand CUDA cuBlas 34.