data analytics on spark

Software & Services Group Intel® Confidential — INTERNAL USE ONLY Spark platform for big data analytics What is Apache Spark? • Apache Spark is an open-source cluster-computing framework • Spark has a developed ecosystem • Designed for massively distributed apps • Fault tolerance • Dynamic resource sharing

https://software.intel.com/ai 2 INTEL and Spark community

• BigDL – lib for Apache Spark • Intel® Data Analytics Acceleration Library (Intel DAAL) – lib includes support for Apache Spark

• Performance Optimizations across Apache Hadoop and Apache Spark open source projects • Storage and File System: Ceph, Tachyon, and HDFS • Security: Sentry and integration into various Hadoop & Spark modules • Benchmarks Contributions: Big Bench, Hi Bench, Cloud Sort

3 Taxonomy

Artificial Intelligence (AI) Machines that can sense, reason, act without explicit programming

Machine Learning (ML), a key tool for AI, is the development, and application of algorithms that improve their performance at some task based on experience (previous iterations) BigDL Deep Learning Classic Machine Learning DAAL focus focus Algorithms where multiple layers of neurons Algorithms based on statistical or other learn successively complex representations techniques for estimating functions from examples Dimension Classifi Clusterin Regres- CNN RNN RBM … -ality -cation g sion Reduction

Training: Build a mathematical model based on a data set

Inference: Use trained model to make predictions about new data

4 BigDLand Intel DAAL BigDL and Intel DAAL are machine learning and data analytics libraries natively integrated into Apache Spark ecosystem

5 WhATis bigdl?

• BigDL is a distributed deep learning library for Apache Spark • Allows to write deep learning applications as standard Spark programs • Runs on top of existing Spark or Hadoop/Hive clusters • Feature parity with popular DL frameworks. • High performance - Intel MKL and multi-threaded programming • Efficient scale-out with an all-reduce communications on Spark

https://software.intel.com/ai 6 BIGDL: Ease of Use

You may want to write your deep learning programs using BigDL if you need to: • Analyze “big data” using deep learning on the same Hadoop/Spark cluster where the data are stored • Add deep learning functionalities to the Big Data (Spark) programs and/or workflow • Leverage existing Hadoop/Spark clusters to run deep learning applications • Dynamically share with other workloads (e.g., ETL, data warehouse, , classical machine learning, graph analytics, etc.) • Making deep learning more accessible for Big Data users and data scientists, who are usually not experts for deep learning

https://software.intel.com/ai 7 BigDLcan re-use/fine-tune models from other frameworks

BigDL • Load existing / Model Model File

• Allows for transition from single-node Load Caffe to distributed application deployment Model File BigDL • Useful for inference Torch • Allows for minor model tuning Model File • Allows for model sharing between Save Data Scientists and Production Engr. Storage

8 Python API Support

Based on PySpark, Python API in BigDL allows use of existing Python libs: • Numpy • Scipy • Pandas • Scikit-learn • NLTK • Matplotlib • …

9 BigDL Examples

BigDL provide examples to help developer play with BigDL and start with popular models. https://github.com/intel-analytics/BigDL/wiki/Examples Models (Train and Inference Example Code): . LeNet, Inception, VGG, ResNet, RNN, Auto-encoder Examples: • Text Classification • Image Classification • Load Torch/Caffe model

10 BigDL performance and scale out

• Single node Xeon performance • Benchmarked to be best on Xeon E5-26XX v3 or E5-26XX v4 • Orders of magnitude speedup vs. out-of-box open source Caffe, Torch or TensorFlow • Scaling-out • Efficiently scales out to 10s~100s of Xeon servers on Spark

* For more complete information about performance and benchmark results, visit www.intel.com/benchmarks

11 BigDL Features

Layers . 90+ Layers Criterion . 10+ loss functions Optimization . SGD, Adagrad, LBFGS etc

12 BigDL Features –text classfication Build and train model Evaluate the prediction result

13 BigDL RESOURCES https://github.com/intel-analytics/BigDL Software.intel.com/bigdl

Join Our Mail List [email protected]

Report Bugs And Create Feature Request https://github.com/intel-analytics/BigDL/issues

14 What is Intel® DAAL?

An Intel-optimized library that provides building blocks for all data analytics stages, from data preparation to & machine learning

. Python, Java & C++ APIs . Developed by same team as the industry- leading Intel® Math Kernel Library . Can be used with many platforms (Hadoop*, Spark*, R*, …) but not tied to any of them . Open source, Free community-supported and commercial premium-supported . Flexible interface to connect to different options data sources (CSV, SQL, HDFS, …) . Also included in . Windows*, Linux*, Parallel Studio XE suites and OS X*

15 Why Intel® DAAL?

Automatic performance scaling . Scale-up: from core to multicore to multi-socket . Scale-out: from in-memory analysis to clusters to cloud A rich set of analytics algorithms . Widely applicable to most data mining and machine learning workloads Leverages decades of R&D work in code optimization on IA . By the same team behind Intel® Math Kernel Library (Intel® MKL)

16 Who should use Intel® DAAL?

Software developers Data Scientists Data Analytics ISVs Big Data System Integrators • Need optimized ML • Build and executes math • Want competitive advantages algorithms in their apps models for domain specific by making their solutions run • Want to beef up their product knowledge discovery faster on IA portfolio by providing • No resources/time/expertise • Need to speed up the performance-enhanced to manually optimize performance critical parts of alternatives to popular open- themselves their models source analytics tools

17 Ideas Behind Intel® DAAL: Heterogeneous Analytics Data is different, data analytics pipeline is the same Data transfer between devices is costly, protocols are different

. Need data analysis proximity to Data Source

. Need data analysis proximity to Client

. Data Source device ≠ Client device

. Requires abstraction from communication protocols

Data Source Edge Compute (Server, Desktop, … ) Client Edge

Pre-processing Transformation Analysis Modeling Validation Decision Making

Business

Web/Social Scientific/Engineering

Machine Learning (Training) Hypothesis testing Decompression, Aggregation, Summary Statistics Forecasting Parameter Estimation Model errors Filtering, Normalization Dimension Reduction Clustering, etc. Decision Trees, etc. Simulation Ideas Behind Intel DAAL: Languages & Platforms

• Intel DAAL has multiple programming language bindings • C++ – ultimate performance for real-time analytics with Intel DAAL • Java*/Scala* – easy integration with Big Data platforms (Hadoop*, Spark*, etc) • Python* – advanced analytics for data scientists Intel® DAAL Algorithms Data Transformation and Analysis in Intel® DAAL Basic Correlation statistics for and Matrix Dimensionality Outlier datasets dependence factorizations reduction detection

Low Cosine order SVD PCA Univariate distance moments

Association Correlation Multivariate Quantiles distance QR rule mining (Apriori) Variance- Optimization Order Math functions Covariance Cholesky statistics solvers (SGD, (exp, log,…) matrix AdaGrad, lBFGS) Algorithms supporting batch processing Algorithms supporting distributed computation in Apache Spark, Hadoop and MPI

20 Intel® DAAL Algorithms Machine Learning in Intel® DAAL Ridge Regression Regression

Decision Forest K-Means Unsupervised Clustering Supervised learning EM for Boosting GMM (Ada, Brown, Logit) Naïve Weak Classification Neural networks Bayes learner kNN Collaborative Alternating Algorithms supporting batch processing filtering Least Squares Algorithms supporting distributed Support Vector Machine computation in Apache Spark, Hadoop and MPI

21 PCA Performance Boosts Using Intel® DAAL vs. Spark* Mllib on an Eight-node Cluster

14 11,6x 12 11,5x 10,2x 10

8

6,4x Speedup 6 5,4x 4,0x 4,5x 4 3,4x

2

0 1M rows, 200 1M rows, 400 1M rows, 600 1M rows, 800 1M rows, 1K 10M rows, 5K 20M rows, 5K 40M rows, 5K columns columns columns columns columns columns columns columns

Configuration Info - Versions: Intel® Data Analytics Acceleration Library 2017, Spark 1.2; Hardware: Intel® Xeon® Processor E5-2699 v3, 2 Eighteen-core CPUs (45MB LLC, 2.3GHz), 128GB of RAM per node; Operating System: CentOS 6.6 x86_64. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. * Other brands and names are the property of their respective owners. Benchmark Source: Intel Corporation Optimization Notice: Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 .

22 DAAL RESOURCES https://github.com/01org/daal Software.intel.com/DAAL

DAAL Forum software.intel.com/en-us/forums/intel-data-analytics-acceleration-library

Report Bugs And Create Feature Request https://github.com/01org/daal/issues

23