How to Use Intel® DAAL with Spark*

Zhang Zhang Developer Products Division Agenda Overview - Apache Spark™ What is Intel® DAAL? Intel® DAAL in the context of Spark Speeding up Spark MLlib with Intel® DAAL – Data conversion – Programming model of Intel® DAAL distributed processing – Example: PCA – Example: Linear regression Performance Future of Intel® DAAL Optimization Notice Copyright © 2015, Intel Corporation. All rights reserved. 2 *Other names and brands may be claimed as the property of others. Overview of Apache Spark™ A fast and general engine for large-scale data processing*. Providing more operations than MapReduce . Increasing developer productivity . Running on Hadoop, Mesos, as a standalone, or in the cloud . Fault-tolerant distributed data structures (RDD) . A stack of powerful libraries * Source: Apache Spark website (http://spark.apache.org) Optimization Notice Copyright © 2015, Intel Corporation. All rights reserved. 3 *Other names and brands may be claimed as the property of others. Intel® Data Analytics Acceleration Library (Intel® DAAL) An industry leading Intel® Architecture based data analytics acceleration library of fundamental algorithms covering all machine learning stages. Pre-processing Transformation Analysis Modeling Validation Decision Making Business Web/Social Scientific/Engineering PCA Linear regression (De-)Compression Statistical moments Naïve Bayes Collaborative filtering Variance matrix SVM QR, SVD, Cholesky Classifier boosting Neural Networks Apriori Kmeans EM GMM Optimization Notice Copyright © 2015, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. What’s in the Package? • Support IA-32 and Intel64 architectures • Support Linux, Windows, and OS X* • C++ and Java API. Python support is coming soon. • Static and dynamic linking. • A standalone library, and also bundled in Intel® Parallel Studio XE 2016. • Also available as part of the Intel Performance Libraries Community Edition (free). • An open source Intel DAAL will soon be available on Github. Note: Bundled version is not available on OS* X. Optimization Notice Copyright © 2015, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Where Intel DAAL Fits? Spark* Breeze MLLib Netlib-Java JVM JNI Netlib BLAS Optimization Notice Copyright © 2015, Intel Corporation. All rights reserved. 6 *Other names and brands may be claimed as the property of others. Where Intel DAAL Fits? Optimization Notice Copyright © 2015, Intel Corporation. All rights reserved. 7 *Other names and brands may be claimed as the property of others. Where Intel DAAL Fits? Intel® Data Analytics Acceleration Library Analysis Machine learning Utilities PCA Regression Data compression Low order • Linear regression Recommendation Serialization moments • ALS Model Matrix import/output factorization Classification Clustering Outlier detection • SVM • K-Means Distances • Naïve Bayes • EM for GMM Association rules • Boosting algorithms …… …… Programming languages Processing modes C++ Batch processing Java Distributed processing Online processing Optimization Notice Copyright © 2015, Intel Corporation. All rights reserved. 8 *Other names and brands may be claimed as the property of others. Intel® DAAL Distributed Processing Conceptual Model Input Local processing Partial Partial results result collection Master Final Input result Local processing Partial result Optimization Notice Copyright © 2015, Intel Corporation. All rights reserved. 9 *Other names and brands may be claimed as the property of others. Intel DAAL in the Context of Spark Executor Input Local Worker Node Driver Program processing SparkContext Partial result Partial results collection Cluster Manager Master Executor Final Input result Local Worker Node processing Partial result Optimization Notice Copyright © 2015, Intel Corporation. All rights reserved. 10 *Other names and brands may be claimed as the property of others. Intel DAAL Numeric Tables Heterogeneous – AOS Homogeneous – Dense matrix . Observations are stored in contiguous memory . 2D matrix: n rows (observations), p columns (features) buffers. Homogeneous – Sparse matrix (CSR) Heterogeneous – SOA . Support both 0-based indexing and 1-based indexing. Features are stored in contiguous memory buffers. m-by-n homogenous x00 x01 … x0(n-1) x10 x11 … x1(n-1) … … … ... xm0 xm1 … xm(n-1) Optimization Notice Copyright © 2015, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Numeric Tables and RDD JavaRDD<NumericTable> Local … NumericTable Local Optimization Notice Copyright © 2015, Intel Corporation. All rights reserved. 12 *Other names and brands may be claimed as the property of others. Handle MLlib Distributed Data Structures JavaRDD<Vector> RowMatrix User-defined conversion methods JavaRDD<NumericTable> Optimization Notice Copyright © 2015, Intel Corporation. All rights reserved. 13 *Other names and brands may be claimed as the property of others. Data Conversion: RDD<Vector> to RDD<NumericTable> Optimization Notice Copyright © 2015, Intel Corporation. All rights reserved. 14 *Other names and brands may be claimed as the property of others. Data Conversion: RDD<NumericTable> to RDD<Vector> Optimization Notice Copyright © 2015, Intel Corporation. All rights reserved. 15 *Other names and brands may be claimed as the property of others. Programming Model of Distributed Processing Local Processing Local processing algorithms abstractions: For each table in JavaRDD<NumericTable>: . Package: com.intel.daal.algorithms // Local processing algorithm and params – pca.DistributedStep1Local DistributedStep1Local alg = – linear_regression.Training.TrainingDistributedStep1Local new DistributedStep1Local(……); – …… // Set input Partial results abstractions: alg.input.set(Input.data, table); – pca.PartialResult – linear_regression.Training.PartialResult // Compute and access partial result PartialResult ret = alg.compute(); – …… … Optimization Notice Copyright © 2015, Intel Corporation. All rights reserved. 16 *Other names and brands may be claimed as the property of others. Programming Model of Distributed Processing Master Side // Collect partial results from all slaves Master side algorithms abstractions: List<PartialResult> partsList = parts.collect(); // Master side processing algorithms and params . Package: com.intel.daal.algorithms DistributedStep2Master alg = – pca.DistributedStep2Master new DistributedStep2Master(……); – linear_regression.Training.TrainingDistributedStep2Master – …… // Set master side processing input for (PartialResult val : partsList) { Final results abstractions: …… alg.input.add(MasterInputId.partialResults, val); – pca.Result } – linear_regression.Training.TrainingResult – …… // Compute alg.compute(); // Get final result Result result = alg.finalizeCompute(); Optimization Notice Copyright © 2015, Intel Corporation. All rights reserved. 17 *Other names and brands may be claimed as the property of others. Example: Intel DAAL PCA on Spark Intel DAAL provides two computation methods: – Correlation method (default) – SVD method Input in the form of NumericTables: – Non-normalized data, or – Normalized data (µ = 0, = 1), or – Correlation matrix Output: – Scores (A 1xp NumericTable with eigenvalues, largest to smallest) – Loadings (A pxp NumericTable with corresponding eigenvectors) Optimization Notice Copyright © 2015, Intel Corporation. All rights reserved. 18 *Other names and brands may be claimed as the property of others. PCA - Slave Side Optimization Notice Copyright © 2015, Intel Corporation. All rights reserved. 19 *Other names and brands may be claimed as the property of others. PCA – Master Side Optimization Notice Copyright © 2015, Intel Corporation. All rights reserved. 20 *Other names and brands may be claimed as the property of others. Example: Intel DAAL Linear Regression on Spark Training – distributed processing Prediction – batch processing, on master side only . Computation methods: – Normal equation method . Input: – QR method – A Model – An mxp NumericTable of unseen data . Input: – An nxp NumericTable of independent . Output: variables – An mxk NumericTable of predicted – An nxk NumericTable of corresponding responses known responses . Output: – A Model object (the intercept, coefficients) Optimization Notice Copyright © 2015, Intel Corporation. All rights reserved. 21 *Other names and brands may be claimed as the property of others. Training Data Representation JavaPairRDD<NumericTable, NumericTable> Local … Local Independent Responses variables Optimization Notice Copyright © 2015, Intel Corporation. All rights reserved. 22 *Other names and brands may be claimed as the property of others. Linear Regression Model Training – Slave Side Optimization Notice Copyright © 2015, Intel Corporation. All rights reserved. 23 *Other names and brands may be claimed as the property of others. Linear Regression Model Training – Master Side Optimization Notice Copyright © 2015, Intel Corporation. All rights reserved. 24 *Other names and brands may be claimed as the property of others. Linear Regression Prediction Optimization Notice Copyright © 2015, Intel Corporation. All rights reserved. 25 *Other names and brands may be claimed as the property of others. PCA Performance Boosts Using Intel® DAAL vs. Spark* MLLib on Intel® Architectures PCA (correlation method) on an 8-node Hadoop* cluster based on Intel® Xeon® Processors E5-2697 v3 8 7X 7X 6X 6X 6 4X 4 Speedup 2 0 1M x 200 1M x 400 1M x 600 1M x 800 1M x 1000

Load more