XGBOOST with RAPIDS WHAT IS XGBOOST? a Superior Implementation of CART

XGBOOST WITH RAPIDS WHAT IS XGBOOST? A Superior Implementation of CART • Xtreme: Engineering improvements such as a parallelization-first design, and memory and cache optimization, improvements of existing GBDT models to be more robust and accurate • Gradient Boosted: additional small trees are added to minimize a gradient • Classification and Regression Trees: appropriate for discrete or continuous prediction problems 2 WHICH ML ALGORITHM PERFORMED BEST Average rank across 165 ML datasets Lower Is Better 3 Source: https://arxiv.org/pdf/1708.05070.pdf PREDICT: WHO ENJOYS COMPUTER GAMES Example of Decision Tree 4 Source: https://xgboost.ai/2016/12/14/GPU-accelerated-xgboost.html COMBINE TREES FOR STRONGER PREDICTION If (age < 15) AND (use computer daily) = enjoy computer games 5 Source: https://xgboost.ai/2016/12/14/GPU-accelerated-xgboost.html WHY XGBOOST? A Strong History of Success on a Wide Range or Problems • Winner of Caterpiller Kaggle Contest 2015 • Machinery component pricing • Winner of CERN Large Hadron Collider Kaggle Contest 2015 • classification of rare particle decay phenomena • Winner of KDD Cup 2016 • research institutions’ impact on the acceptance of submitted academic papers • Winner of ACM RecSys Challenge 2017 • Job posting recommendation 6 WHY XGBOOST? It Produces Comprehensible Solutions • Deep Learning has been very successful, but it can’t tell exactly why it came to the conclusion it did • XGBoost is, at a very high level, a regression • Increases stakeholder confidence in predications being made 7 XGBOOST LIMITATIONS Time is Everything • Large number of hyper parameters can mean that finding the best solution can take a long time as try different combinations • CPU implementation may be fast relative to other methods, but it is not fast enough for problems of a certain size • Scale out can require compromises on accuracy 8 WHY RAPIDS WITH XGBOOSTBENCHMARKS Multi-GPU, Multi-Node, Scalability cuIO/cuDF — Load and Data Preparation cuML — XGBoost End-to-End Freedom to execute end-to-end data science & analytics pipeline20 entirelyCPU Nodes on GPU 2,741 20 CPU Nodes 2,290 20 CPU Nodes 30 CPU Nodes User-friendly Python interfaces 1,675 30 CPU Nodes 1,956 30 CPU Nodes 50 CPU Nodes 715 50 CPU Nodes 1,999 50 CPU Nodes Relies on CUDA primitives 100 CPU Nodes 379 100 CPU Nodes 1,948 100 CPU Nodes Faster results makes tuning parameters more DGX-2 42 DGX-2 169 DGX-2 interactive 5x DGX-1 19 5x DGX-1 157 5x DGX-1 0 1,00 0 2,00 0 3,00 0 0 500 1,00 0 1,50 0 2,00 0 2,50 0 0 2,00 0 4,00 0 6,00 0 8,00 0 10,000 Time in seconds — Shorter is better cuIO / cuDF (Load and Data Preparation) Data Conversion XGBoost CPU Cluster Configuration CPU nodes (61 GB of memory, 8 vCPUs, 64-bit platform), Apache Spark Benchmark CPU Cluster Configuration DGX Cluster Configuration 9 200GB CSV dataset; Data preparation CPU nodes (61 GiB of memory, 8 vCPUs, 5x DGX-1 on InfiniBand network includes joins, variable transformations. 64-bit platform), Apache Spark 13 WHAT IS RAPIDS? The New GPU Data Science Pipeline rapids.ai Suit of open-source, end-to-end data science tools Built on CUDA Pandas-like API for data cleaning and transformation Scikit-learn-like API for machine learning A unifying framework for GPU data science 10 DATA SCIENCE WORKFLOW WITH RAPIDS Open Source, GPU-accelerated ML Built On CUDA cuDF cuML VISUALIZE DATA PREDICTIONS Data preparation / ML model Dataset wrangling training exploration 11 RAPIDS PREREQUISITES See more at rapids.ai NVIDIA Pascal™ GPU architecture or better CUDA 9.2 or 10.0 compatible NVIDIA driver Ubuntu 16.04 or 18.04 Docker CE v18+ nvidia-docker v2+ 12 SELECTING THE RIGHT RAPIDS SOLUTION Unparalleled Data Science Performance and Productivity ML Enthusiast Machine Learning Developer Data Center Machine Learning Data Science Workstations Shared infrastructure for Data Science Teams TITAN RTX Quadro Workstation DGX Station DGX-1 / OEM DGX-2 / OEM PC solution, easy to Enterprise workstation for Enterprise ML Enterprise server, proven Largest compute and acquire, deploy and get experienced data workgroups, largest 8-way configuration, memory capacity in single Benefit started experimenting scientists memory on a workstation modular approach for node, fastest training scale, multi-node training solution GPU Memory 48GB 64GB 128GB 256GB 512GB 2-way 2-way 4-way 8-way 16-way GPU Fabric NVLINK NVLINK NVLINK NVLINK NVSWITCH 13 DOWNLOAD RAPIDS FROM NGC GPU-Accelerated Software for Machine Learning, Deep Learning, and HPC Download the ready-to-run RAPIDS container from the NGC container registry Run on on the top cloud providers, NVIDIA DGX systems, NGC-Ready systems, NVIDIA TITAN, and Quadro Browse NGC at: https://ngc.nvidia.com/ For full instructions see: https://docs.nvidia.com/ngc/ 14 CONDA PACKAGES https://anaconda.org/rapidsai For bare metal install cuDF using conda. You can get a minimal conda installation with Miniconda or get the full installation with Anaconda. For full instructions see https://github.com/rapidsai/cudf 15 GETTING STARTED WITH DOCKER Quickly get up-to-speed on RAPIDS Consistently releasing updated containers All the latest components of the the RAPIDS project Get started in 4 lines of code 16 A ‘HELLO WORLD’ WORKFLOW Surviving the Titanic 17 A ‘HELLO WORLD’ WORKFLOW Surviving the Titanic 18 GETTING STARTED WITH DOCKER Multi-GPU Download the data set at https://rapidsai.github.io/demos/datasets/mortgage-data 19 GETTING STARTED WITH DOCKER Multi-GPU The example notebook expects 8 GPUs by default. Change this line based on the specifics of the system you are running. 20 GETTING STARTED WITH DOCKER Multi-GPU The address of the Dask Dashboard will need to be modified. In your browser’s address bar, you need to enter the IP address of the machine that you are working: {YOUR IP ADDRESS}:8787 21 GETTING STARTED WITH DOCKER Multi-GPU Modify the path to the volume you mounted at container start-up. Modify the part count to control memory usage. 22 DGX-2 E2E XGBOOST XGBoost training across 16 V100-32GB GPUs 23 COMMON ERRORS Mortgage XGBoost Demo Running out of GPU Device Memory ETL processes may create many copies of data in device memory, resulting in memory utilization spikes Need to budget 25% GPU device memory to account for XGBoost overhead Cannot exceed 24GB on 32GB GPU, or cannot exceed 12GB on 16GB GPU Memory utilization which exceeds available device resources will cause a Dask worker to crash This error can be propagated forward in the Dask task graph, and manifest in very short ETL times (sub-millisecond timescale) An error may be raised by another routine referring to NoneType in data or similar 24 GETTING STARTED MULTI-NODE Multi-Node, Multi-GPU Github contains everything you need to try out multi-node, multi-gpu XGBoost Initializing the dask cluster: to instantiate a dask cluster, there are two helper scripts, and one read-only file: /path/to/notebooks/utils/dask-cluster.py ← PYTHON SCRIPT /path/to/notebooks/utils/dask-setup.sh ← BASH SCRIPT /path/to/notebooks/utils/dask.conf ← READ-ONLY 25 GETTING STARTED MULTI-NODE Multi-Node, Multi-GPU 1. Start the notebook on MASTER node: cd /path/to/notebooks && bash utils/start-jupyter.sh 2. Call the cluster script on MASTER node: %run -i ../utils/dask-cluster.py 3, Call the cluster script on all WORKER nodes: cd /path/to/notebooks/mortgage && python ../utils/dask-cluster.py 26 GETTING STARTED MULTI-NODE Multi-Node, Multi-GPU 4. Attach the client to the scheduler on MASTER: import subprocess import dask from dask.delayed import delayed from dask.distributed import Client, wait cmd = "hostname --all-ip-addresses" process = subprocess.Popen(cmd.split(), stdout=subprocess.PIPE) output, error = process.communicate() IPADDR = str(output.decode()).split()[0] _client = IPADDR + str(":8786") client = dask.distributed.Client(_client) client 27 GETTING STARTED RESOURCES Rapids.ai cuDF Documentation: https://rapidsai.github.io/projects/cudf/en/latest/ cuML Documentation: https://rapidsai.github.io/projects/cuml/en/latest/ Github: https://github.com/RAPIDSai Twitter: @rapidsai NVIDIA Accelerated Data Science: https://www.nvidia.com/en-us/deep-learning-ai/solutions/data-science/ 28 .

XGBOOST with RAPIDS WHAT IS XGBOOST? a Superior Implementation of CART

Package 'Sparkxgb'

Benchmarking and Optimization of Gradient Boosting Decision Tree Algorithms

Xgboost: a Scalable Tree Boosting System

Release 1.0.2 Xgboost Developers

Lightgbm Release 3.2.1.99

Enhanced Prediction of Hot Spots at Protein-Protein Interfaces Using Extreme Gradient Boosting

Adaptive Xgboost for Evolving Data Streams

ACCELERATING TIME to VALUE with XGBOOST on NVIDIA GPUS Chris Kawalek, NVIDIA Paul Hendricks, NVIDIA Chris Kawalek Sr

Release 1.4.2 Xgboost Developers

Xgboost Add-In for JMP Pro

Accelerating the Xgboost Algorithm Using GPU Computing

Xgboost) Classifier to Improve Customer Churn Prediction