GPU OPEN ANALYTICS INITIATIVE END-TO-END ACCELERATED ANALYTICS
Brad Rees, Ph.D. - Senior Solution Architect - NVIDIA
GTC DC, November 2017
The AI Computing Company AGENDA – TWO PARTS Discuss Analysis from the Perspective of Data Science
“Data science, also known as data-driven • Part 1 science, is an interdisciplinary field about scientific methods, processes, and systems to • Big Data and Spark extract knowledge or insights from data …” - WIkipedia • GPU Barriers Better Exploration ∝ Better Science
• Part 2 Faster Analytics yield better Exploration • GOAI
Fail Fast Needs to be Embraces I have not failed. I've just found 10,000 ways that won't work. - Thomas A. Edison
the Big Data Catalyst The Glue that Binds Big Data
• Spark has become synonymous with Hadoop and Big Data • It’s the interface/API for big data app to app communication • The processing layer for big data and leading ML framework SPARK IS NOT ENOUGH We Want More Efficiency and Speed
• Common issue is speed at scale
• Scaling out to get the necessary speed for mission critical workloads is prohibitively expensive
• Clients want core ML on GPU Commercial Government HPC
We need a GPU-equivalent to Spark … But there are some Barriers GPU ADOPTION BARRIERS • Too much data movement • Too many makeshift data formats
Concerns: • No inter-GPU communication • Too Hard to Integrate GPUs • No Python API for data manipulation • Not suited for Data Science • No all inclusive Machine Learning Library DATA MOVEMENT AND TRANSFORMATION The bane of productivity / performance
• Too much time spent Moving data • Data movement and conversion hinder any performance gains • No Inter-GPU Communication
CPU Parquet CSV GML Panda Avro HDFS XML Numpy
JSON DATA FORMATS Pickle
ProtoBuf CSC CSR COO
Plain Text vs Binary Compressed vs Uncompressed
* Not a complete list ARE THE GPU BARRIERS TO GREAT? Is there any hope?
☹️ Data movement ☹️ Data formats ☹️ Inter-GPU communication ☹️ No Python API for data manipulation
☹️ No all inclusive Machine Learning Library GPU OPEN ANALYTICS INITIATIVE Luckily others were also thinking about the problems • Formed in March at Strata SJ; Launched at GTC in May • Goal: GOAI seeks to foster open collaboration between GPU analytics projects and products to enable data scientists to efficiently combine the best tools for their workflows. ACCELERATED ANALYTICS ECOSYSTEM Prior State (pre-March 2017)
● Fragmented with too INTERACTION Graphistry Jupyter NB MapD Immerse many holes ● Still too reliant on CPU for moving data between applications Data Manipulation ● 80-90% of data science is PROCESSING accelerated analytics, not MapD Anaconda * Fast Data deep learning yet BlazingDB (Dask NV Graph AND (Streaming) ANALYTICS (“SQL”) “Python”)
IN GPU
MEMORY Many Columnar Data Frames DATA (everyone has their own makeshift data frame) STRUCTURE Key: Open Source
Free to Use STORAGE MapD GPU Ram BlazingDB Disk Closed Source
* Primarily x86 w/ some GPU acceleration ACCELERATED ANALYTICS ECOSYSTEM Post-March 2017
INTERACTION Graphistry Jupyter NB MapD Immerse
Data Manipulation PROCESSING MapD Anaconda H2O (Data. Fast Data H2O.ai (GPU BlazingDB (Dask NV Graph AND Table “R”) (Streaming) MLlib) ANALYTICS (“SQL”) “Python”)
IN GPU
MEMORY Standard Columnar Data Frame DATA (Open Sourced/Free to Use from MapD) STRUCTURE Key: Open Source
Free to Use MapD + BlazingDB STORAGE MapD GPU Ram BlazingDB Disk System Memory Closed Source LEARNING FROM APACHE ARROW Interoperability Big Data ecosystem facing similar issues
Major push in the big data world to remove bottlenecks of copy & converting data between systems
Apache Arrow™
• enables execution engines to take advantage of the latest SIMD (Single input multiple data) operations
• Columnar layout is optimized for data locality for better performance on modern hardware like CPUs and GPUs.
• The Arrow memory format supports zero-copy reads for lightning-fast data access without serialization overhead. THE GPU DATA FRAME First GOAI Project
✓ Data movement ✓ Data formats ✓ Inter-GPU communication ✓ Python API ✓ Machine Learning Library
CPU
So …. What does this get me? SEAMLESS CALLS BETWEEN APPLICATIONS
What does GOAI get me? Big improvement for Data Science
• Load data into MapD • Call an H2O ML algorithm • All via Anaconda Python • Within a Jupyter Notebook
Demos available on goai github SEAMLESS CALLS BETWEEN APPLICATIONS
What does GOAI get me? Big improvement for Data Science
• Load data into MapDpygdf: Python library for manipulating GDFs • Call an H2O ML algorithm• Creating GDFs from numpy arrays and Pandas DataFrames • Performing math operations on columns • All via Anaconda Python• Import/export via CUDA IPC • Sort, join, reductions • Within a Jupyter Notebook• JIT compilation of group by and filter kernels using Numba
Demos available on goai github SIMPLE DATA CONVERSION
Convert from Pandas and Numpy Several Examples Available on GOAI GitHub GOAL OF GOAI Better Adoption with Better Usability and TCO
Hadoop Processing, Reading from disk
HDFS HDFS HDFS HDFS HDFS SQL Query ETL Train Read Write Read Write Read
Spark In-Memory Processing Large TCO benefit 25-100x Improvement over Hadoop Less code HDFS Language flexible SQL Query ETL ML Train Large Adoption Read Primarily In-Memory
GPU + Spark In-Memory Processing Small TCO benefit 5-10x Improvement over Spark More code HDFS GPU SQL CPU GPU CPU GPU ML Language rigid ETL Small Adoption Read Read QueryWrite Read Write Read Train Substantially on GPU
End-to-End GPU Processing (GOAI) Large TCO benefit 25-100x Improvement over Spark Same code Arrow SQL ML Language flexible ETL Large Adoption? Read Query Train Primarily on GPU • libgdf: C library of helper functions: • Copying GDF metadata block to the host and parsing it INITIAL LIBRARIES to a host-side struct • Importing/exporting via CUDA IPC GPU Data Frame • CUDA kernels to perform element-wise math operations on GDF columns. • CUDA sort, join, and reduction operations on GDFs. github.com/gpuopenanalytics • pygdf: Python library for manipulating GDFs • Creating GDFs from numpy arrays and Pandas DataFrames • Performing math operations on columns • Import/export via CUDA IPC • Sort, join, reductions • JIT compilation of group by and filter kernels using Numba
• dask_gdf: Extension for Dask to work with distributed GDFs. • Same operations as pygdf, but working on GDFs chunked onto different GPUs and different servers. ABOUT
~8.5x speedup on half a DGX ~100x speedup using MapD on Python on GPU... to produce a robust GLM via half a DGX to analyze census Numba and Pandas 10-fold cross-validation vs an 8 data vs a 20 node Spark cluster node Spark cluster
~5X faster than Redshift to utilize full disk storage and system memory >50x speedup in ~100x more cyber security data performing pagerank on a interactively visualized using an graph on half a DGX vs intuitive layout algorithm on a an 8 node Spark cluster single GPU as a connected graph MapD GPU-accelerated analytics platform Consists of MapD Core database and MapD Immerse
MapD Core database is an in-GPU-memory, columnar, open-source, GPU-accelerated, SQL database.
MapD Enterprise brings distributed and high availability modes, GPU-accelerated backend rendering, Kerberos/LDAP security, and ODBC/JDBC.
MapD Immerse is a visual analytics platform on top of the MapD Core database that allows data scientists and analysts to interactively explore large datasets. 1.1 BILLION TAXI RIDES BENCHMARK
Query 1 Query 2 Query 3 Query 4 GPU Memory based 10190 8134 19624 85942 5000databases 45008x to 15x faster
4000than CPU in- memory databases such3500 as Redshift. 2970 3000 100x to 485x faster 2500 than Spark 2250 2000on 11-servers 1560 Time in Milliseconds 1500 1209 1250 Open Source core 1000 DBMS 795 596 518 500 372 150 21 80 Free0 Community EditionMapD DGX-1 Kinetica DGX-1 Redshift 6-node Spark 11-node
Source: MapD Benchmarks on DGX-1 from internal NVIDIA testing following guidelines of @marklit82 Mark Litwintschik’s blogs: Redshift, 6-node ds2.8xlarge cluster & Spark 2.1, 11 x m3.xlarge cluster w/ HDFS BlazingDb GPU-accelerated petabyte scale data warehouse
Consists of BlazingDB database
BlazingDB database is a disk-based, columnar, GPU-accelerated SQL database.
BlazingDB has distributed and high availability modes, JDBC, and Python/C# APIs.
BlazingDB offers a Community Edition that can be downloaded for free and has an Enterprise Edition that you can launch today on AWS. Blazing DB high performance SQL on petabyte scale
Blazing speedup
BlazingDB SQL is built on a columnar relational data model. Enterprise grade security through Spring Security BlazingDB distributes both data and computation to multiple instances, for more data, or faster query speeds •https://blazingdb.com/ Anaconda Python Open-source focused, GPU-accelerated data science platform
Contains Anaconda Accelerate, Numba, and Dask
Anaconda Accelerate provides access to libraries optimized for performance on NVIDIA GPUs such as CUDA Sorting and cuBLAS.
Numba is a compiler for Python functions that generates native code for GPU hardware.
Dask is a parallel computing library for analytic computing in Python. It enables distributed computing in Pure Python and integrates with Anaconda Accelerate and Numba. NUMBA PERFORMANCE How Fast
Jeremy Howard
Deep learning researcher & educator.
Founder: fast.ai Faculty: USF & Singularity University Previously - CEO: Enlitic President: Kaggle CEO Fastmail
Rewrote the PolynomialFeatures from scikit_learn in Numba. Got a 40x speedup in only 12 lines of code H2O.ai Open-source GPU-accelerated machine learning platform
Contains H2O.ai platform
H2O.ai has a working implementation of GPU- accelerated generalized linear modeling.
H2O.ai is working to GPU-accelerate additional machine learning algorithms such as random forests, gradient boosting machines, and clustering.
H2O.ai is working on porting data.table, a columnar data frame library, along with the world's fastest implementation of the sort algorithm to NVIDIA GPUs. MACHINE LEARNING LIBRARY H2O4GPU Roadmap Graphistry GPU-accelerated graph visualization engine
Consists of Graphistry graph visualization engine
Graphistry uses GPUs in the backend for layout calculation and machine learning.
Graphistry uses GPUs in the frontend for rendering the visualization in a web browser.
Graphistry allows a user to interactively visualize magnitudes more data than traditional solutions in an intuitive way. Different Graphs, Different Questions
IR: Killchain Analysis Hunting: Daily SecOps: Shadow IT Anomalies Use
Threat Intel: Botnet Ops/NOC: Outage Fraud: Tracking Analysis Root Cause Embezzlers Gunrock Open-source GPU-accelerated graph analytics library
Consists of Gunrock graph analytics library
Gunrock has multi-GPU implementations of graph algorithms such as PageRank, Breadth First Search, Single Source Shortest Path, etc.
Gunrock has high level API in C that is accessible from Python. JOIN THE REVOLUTION Everyone Can Help!
GPU Open Analytics APACHE ARROW APACHE PARQUET Initiative http://gpuopenanalytics.com/ https://arrow.apache.org/ https://parquet.apache.org/ @Gpuoai @ApacheArrow @ApacheParquet
Integrations, feedback, documentation support, pull requests, new issues, or donations welcomed! GOAI PARTNER SESSION LINE-UP AT GTC DC 2017
Session # Topi c Wednesday 11/1 DC7213 World's Fastest Machine Learning With GPUs 2:00pm Jon Mckinney - Senior Developer, H2O.ai Hemisphere A Wednesday 11/1 DC7212 Interpretable AI: Not Just For Regulators 2:30pm Patrick Hall - Director of Data Science, H2O.ai Hemisphere A Wednesday 11/1 DC7189 The Impact of GPUs in Geovisualization for Government 5:00pm Todd Mostak - CEO & Founder, MapD Polaris Thursday11/2 DC7133 Scaling Event Data Investigations with GPU Visual Graph Analytics 2:00pm Leo Meyerovich - CEO, Graphistry, Inc Hemisphere B Thursday 11/2 DC7111 Accelerating Cyber Threat Detection with GPUs 4:30pm Josh Patterson - NVIDIA Atrium Hall Fundamentals NVIDIA DEEP LEARNING INSTITUTE Training available as online self-paced labs and instructor-led workshops
Take self-paced labs at www.nvidia.com/dlilabs
Find or request an instructor-led workshop at www.nvidia.com/dli
Educators: download the Teaching Kit at developer.nvidia.com/teaching-kit and contact [email protected] for info on the University Ambassador Program Autonomous Vehicles Healthcare Media & Entertainment
…and more
Machine Vision - IVA Finance http://gpuopenanalytics.com/
Thank You !