Intel® Software Development Tools for High Performance Computing and Machine Learning Tasks Intel® MKL, Intel® DAAL, Intel® IPP
Developer Products Division
Gennady Fedorov
Oct 2017 Optimized Mathematical Building Blocks
Intel® Math Kernel Library – Intel® MKL Intel® Data Analytics Acceleration Library – Intel® DAAL Intel® Integrated Performance Primitives Library - Intel® IPP
2 Where to get MKL/IPP/DAAL
For MKL/IPP/DAAL • Sub-component of Intel Parallel Studio XE/Intel System Studio • Free access through High Performance Library (MKL/IPP/DAAL) https://software.intel.com/en-us/performance-libraries • YUM/APT repository for MKL/IPP/DAAL & Intel Distribution for Python https://software.intel.com/en-us/articles/installing-intel-free-libs-and-python-yum-repo https://software.intel.com/en-us/articles/installing-intel-free-libs-and-python-apt-repo Only for MKL • Cloudera* Parcels support since Intel MKL2017update2 https://software.intel.com/en-us/articles/installing-intel-mkl-cloudera-cdh-parcel • Conda* package/ Anaconda Cloud* support since Intel MKL2017update2 https://software.intel.com/en-us/articles/using-intel-distribution-for-python-with- anaconda
3 Optimized Mathematical Building Blocks Intel® MKL
Linear Algebra Fast Fourier Transforms Vector Math • BLAS • Multidimensional • Trigonometric • LAPACK • FFTW interfaces • Hyperbolic • ScaLAPACK • Cluster FFT • Exponential • Sparse BLAS • Log • PARDISO* SMP & Cluster • Power • Iterative sparse solvers • Root • Vector RNGs
Deep Neural Networks Summary Statistics And More • Convolution • Kurtosis • Splines • Pooling • Variation coefficient • Interpolation • Normalization • Order statistics • Trust Region • ReLU • Inner Product • Min/max • Fast Poisson Solver • Variance-covariance
*Other names and brands may be claimed as property of others.
4 Automatic Dispatching to Tuned ISA-specific Code Paths
More cores More Threads Wider vectors
Intel® Intel® Xeon® Intel® Xeon® Intel® Xeon® Intel® Xeon® Intel® Xeon® Intel® Xeon Phi™ Xeon® Processor Processor Processor Processor Processor Intel® Xeon® x200 Processor Processor 5100 series 5500 series 5600 series E5-2600 v2 E5-2600 v3 Scalable 64-bit series series Processor1 v4 series
Up to Core(s) 1 2 4 6 12 18-22 28 72
Up to Threads 2 2 8 12 24 36-44 56 288
SIMD Width 128 128 128 128 256 256 512 512
Intel® Intel® SSE4- Intel® SSE Intel® Intel® Vector ISA Intel® SSE3 Intel® AVX Intel® AVX2 SSE3 4.1 4.2 AVX-512 AVX-512
1. Product specification for launched and shipped products available on ark.intel.com.
5 Intel® MKL - Performance Benefit to Applications
The latest version of Intel® MKL unleashes the performance benefits of Intel architectures 6 What’s new in Intel® MKL v.2018
Deep Learning for New Vector Math BLAS LAPACK Knights Mill Functions
New 8-bit and BLAS Group & Aasen’s Factor 16-bit Integer Batch: Efficiency Algorithm & Solve NEW! 24 Added Matrix Multiply & Performance Routines 2x New - Batched Improved v?Fmod, v?Remainder, v?Powr, Optimized Triangular Solve v?Exp2; v?Exp10; v?Log2; Speedup ScaLAPACK SGEMM over TRSM Matrix v?Logb; v?Cospi; v?Sinpi; performance v?Tanpi; v?Acospi; v?Asinpi; ~1.6x Fast LU v?Atanpi; v?Atan2pi; v?Cosd; Faster Without v?Sind; v?Tand; v?CopySign; Speedup GEMM_BATCH factorization DNN DNN over GEMM Pivoting v?NextAfter; v?Fdim; v?Fmax; Convolution & and Inverse v?Fmin; v?MaxMag; v?MinMag Inner Product Compact BLAS and without Optimizations LAPACK functions pivoting
7 MKL v.2018 - direct call feature
• Better performance of selected functions on small sizes (<32) ?GEMM, ?TRSM + LU/Inverse, Cholesky, QR added in MKL 2018
• Partial inlining of small-size kernels from headers • Reduced call/error checking/threading overhead • Enabled by –DMKL_DIRECT_CALL, /DMKL_DIRECT_CALL
8 MKL v.2018 - Batch Matrix-Matrix Multiplication
Compute independent matrix-matrix multiplications (GEMMs) simultaneously with a single function call
– Supports all precisions of GEMM and GEMM3M – Handles varying matrix sizes with a single function call – Better utilizes multi/many-core processors for small sizes – Performance optimizations added in Intel® MKL v.2018
9 MKL v.2018-New Compact BLAS/LAPACK Functions
Compact Layout • Matrix subgroups are oriented as 3D tensor with matrix number increasing first • Subgroup length is set to the architecture SIMD width. • Consistent layout for all supported BLAS/LAPACK routines/ matrices.
From 2D space rank-1 tensor (vector) to 3D rank-2 tensor (matrix)
9 MKL v.2018 - Compact BLAS Functions
New compact GEMM and TRSM functions • mkl_?gemm_compact • mkl_?trsm_compact Provides better performance for small matrices due to cross-matrix vectorization . Service functions copying to/from compact format . Pack once, unpack only output matrices Single core performance! . Optimized for Intel AVX and later
10 All cores performance! Compact LAPACK Functions
New compact LU factorization with no pivoting, Cholesky and QR factorizations, and Inverse . mkl_?getr[f|i}np_compact . mkl_?potrf_compact . mkl_?geqrf_compact Provides better performance for small matrices due to cross-matrix vectorization . Service functions copying to/from compact format . Pack once, unpack only output matrices
11 Intel® Data Analytic Acceleration Library
• Targets both data centers (Intel® Xeon® and Intel® Xeon Phi™) and edge-devices (Intel® Atom)
• Perform analysis close to data source (sensor/client/server) to optimize response latency, decrease network bandwidth utilization, and maximize security
• Offload data to server/cluster for complex and large-scale analytics
Pre-processing Transformation Analysis Modeling Validation Decision Making
Business
Web/Social Scientific/Engineering
Regression Clustering PCA • Linear • Kmeans (De-)Compression Statistical moments (De-)Serialization • Ridge • EM GMM Quantiles Classification Variance matrix • Naïve Bayes Collaborative filtering QR, SVD, Cholesky • SVM • ALS Apriori • Classifier boosting Outlier detection • kNN Neural Networks Intel® DAAL Algorithms Machine Learning in Intel® DAAL Linear Regression Ridge Regression Regression
Decision Forest K-Means Unsupervised Clustering Supervised Decision Tree learning learning EM for Boosting GMM (Ada, Brown, Logit) Naïve Weak Classification Neural networks Bayes learner k-NN Collaborative Alternating Least filtering Squares Algorithms supporting batch processing Support Vector Machine Algorithms supporting batch, online and/or distributed processing
14 Intel® DAAL v.2018 – Decision Tree, Performance
Decision tree, Training, algorithmFPType=float, Decision tree, Prediction, algorithmFPType=float, split criterion=entropy, 1 thread, large datasets split criterion=entropy, 1 thread, large datasets 400.000 1.200 350.000 1.000 300.000 0.800 250.000 200.000 0.600 150.000 0.400 100.000 0.200
Training time, Trainingtime, seconds 50.000 Prediction time, seconds 0.000 0.000 Ethylene data CO data Ethylene data CO data
DAAL Scikit-learn DAAL Scikit-learn
Configuration Info – Versions: Intel® DAAL® 2018, OpenCV® 2.4.13.2, Scikit-learn® 0.18.1. Hardware: Intel® Xeon Phi™ Processor 000A @1.40GHz, 68 cores (1024K L2 cache, 16GB of DDR4 RAM and 8 GB MCDRAM, Operating System: RHEL 7.2 GA x86_64; Ethylena/CO Data -- https://archive.ics.uci.edu/ml/datasets/Gas+sensor+array+under+dynamic+gas+mixtures Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. * Other brands and names are the property of their respective owners. Benchmark Source: Intel Corporation 15 Intel® DAAL v.2018 – Decision Tree, Performance
Decision tree, Training, algorithmFPType=float, Decision tree, Prediction, algorithmFPType=float, split criterion=gini, 68 threads, large datasets split criterion=gini, 68 threads, large datasets 600.000 1.400
500.000 1.200 1.000 400.000 0.800 300.000 0.600 200.000 0.400
Training time, Trainingtime, seconds 100.000 Prediction time, seconds 0.200
0.000 0.000 Ethylene data CO data Ethylene data CO data
DAAL OpenCV Scikit-learn DAAL OpenCV Scikit-learn
Configuration Info – Versions: Intel® DAAL® 2018, OpenCV® 2.4.13.2, Scikit-learn® 0.18.1. Hardware: Intel® Xeon Phi™ Processor 000A @1.40GHz, 68 cores (1024K L2 cache, 16GB of DDR4 RAM and 8 GB MCDRAM, Operating System: RHEL 7.2 GA x86_64; Ethylena/CO Data -- https://archive.ics.uci.edu/ml/datasets/Gas+sensor+array+under+dynamic+gas+mixtures Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. * Other brands and names are the property of their respective owners. Benchmark Source: Intel Corporation 16 Intel® DAAL v.2018 – Forest Tree, Performance
Decision Forest, Training, Decision Forest, Prediction, algorithmFPType=float, split criterion=gini AgorithmFPType=float, split criterion=gini 18.000 0.900 16.000 0.800 14.000 0.700 12.000 0.600 10.000 0.500 8.000 0.400 6.000 0.300
4.000 0.200 Training time, Trainingtime, seconds
2.000 Prediction time, seconds 0.100 0.000 0.000 Letter Isolet MNIST Letter Isolet MNIST
DAAL Ranger DAAL Ranger
Configuration Info – Versions: Intel® DAAL® 2018, Ranger® 0.7.1. Hardware: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz, 2 sockets, 12 cores per socket (504GB RAM; Operating System: RHEL 7.2 GA x86_64) Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. * Other brands and names are the property of their respective owners. Benchmark Source: Intel Corporation
17 Intel® MKL 2018 : K-means (MLLIB API)
Notation: // example from MLLIB documentation Deleted code Added code import org.apache.spark.mllib.clustering.Kmeans Notes and Comments import com.intel.daal.mllib_api.Kmeans // use K-means from DAAL namespace computeCost and predict is called from MLLIB import org.apache.spark.mllib.clustering.KMeansModel KMeansModel import org.apache.spark.mllib.linalg.Vectors It is not safe to inherit from KMeansModel // Load and parse the data val data = sc.textFile("data/mllib/kmeans_data.txt") val parsedData: RDD[Vector] = data.map(s => Vectors.dense(s.split(' ').map(_.toDouble))).cache()
// Cluster the data into two classes using KMeans val clusters: KMeansModel = KMeans.train(parsedData, numClusters, numIterations, initializationMode, seed?) // train: distributed call optimized with DAAL
// Evaluate clustering by computing Within Set Sum of Squared Errors val WSSSE = clusters.computeCost(parsedData) println("Within Set Sum of Squared Errors = " + WSSSE)
// Save and load model clusters.save(sc, "target/org/apache/spark/KMeansExample/KMeansModel") val sameModel = KMeansModel.load(sc, "target/org/apache/spark/KMeansExample/KMeansModel")
18 Intel®IPP- Intel® Integrated Performance Primitives Library
Image Processing Computer Vision Signal Processing
• Geometry transformations • Feature detection • Transforms • Linear and non-linear filtering • Objects tracking • Convolution, Cross-Correlation • Linear transforms • Pyramids functions • Signal generation • Statistics and analysis • Segmentation, enhancement • Digital filtering • Color models • Camera functions • Statistical • And more
Data Compression Cryptography String Processing • LZSS • Symmetric cryptography • String Functions: Find, Insert, • LZ77(ZLIB) • Hash functions Remove, Compare, etc. • LZO • Data authentication • Regular expression • Bzip2 • Public key
19 Intel® IPP v.2018 – IPP_LZ4 vs LZ4*
<>speedup ~17 % <>speedup ~27 %
Configuration Info - Versions: Intel® Intel Performance Primitives (Intel® IPP) 2018 beta 1; Hardware: CPU= Intel(R) Xeon(R) Platinum 8168, 2.7GHz 24c (NP=96) , Memory - 192GB 2666MHz DDR4 Dual-rank, OS RHEL 7.; HT=OFF, Turbo=OFF, EIST=OFF; Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. * Other brands and names are the property of their respective owners. Benchmark Source: Intel Corporation
Optimization Notice: Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 .
* -- https://github.com/lz4/lz4 20 Intel® IPP v.2018 – IPP_LZ4, Florida Collection
<>speedup ~22%
Name dimentionalnnz size, MB Description ASIC_100k 99340 954163 32 circuit simulation problem inline_1 503712 18660027 589.4 structural problem ldoor 952203 23737339 648.3 structural problem dielFilterV2real1157456 24828204 855 electromagnetics problem kkt_power 2063494 8130343 213.6 optimization problem soc-LiveJournal14847571 68993773 1011.6 directed graph
Configuration Info - Versions: Intel® Intel Performance Primitives (Intel® IPP) 2018 beta 1; Hardware: CPU= Intel(R) Xeon(R) Platinum 8168, 2.7GHz 24c (NP=96) , Memory - 192GB 2666MHz DDR4 Dual-rank, OS RHEL 7.; HT=OFF, Turbo=OFF, EIST=OFF;
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. * Other brands and names are the property of their respective owners. Benchmark Source: Intel Corporation
Optimization Notice: Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 .
21 References
Intel® MKL and MKL Forum pages . http://software.intel.com/en-us/articles/intel-mkl . http://software.intel.com/en-us/articles/intel-math-kernel-library-documentation
. http://software.intel.com/en-us/forums/intel-math-kernel-library Intel® DAAL and DAAL Forum page
. https://software.intel.com/en-us/intel-daal
. https://software.intel.com/en-us/forums/intel-data-analytics-acceleration-library Intel® IPP and IPP Forum pages:
. https://software.intel.com/en-us/intel-ipp
. https://software.intel.com/en-us/forums/intel-integrated-performance-primitives
22 Legal Disclaimer & Optimization Notice
INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.
Copyright © 2014, Intel Corporation. All rights reserved. Intel, Pentium, Xeon, Xeon Phi, Core, VTune, Cilk, and the Intel logo are trademarks of Intel Corporation in the U.S. and other countries.
Optimization Notice Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804
23 Intel Confidential — Do Not Forward