Intel® Development Tools for High Performance Computing and Machine Learning Tasks ® MKL, Intel® DAAL, Intel® IPP

Developer Products Division

Gennady Fedorov

Oct 2017 Optimized Mathematical Building Blocks

Intel® Math Kernel Library – Intel® MKL Intel® Data Analytics Acceleration Library – Intel® DAAL Intel® Integrated Performance Primitives Library - Intel® IPP

2 Where to get MKL/IPP/DAAL

For MKL/IPP/DAAL • Sub-component of XE/Intel System Studio • Free access through High Performance Library (MKL/IPP/DAAL) https://software.intel.com/en-us/performance-libraries • YUM/APT repository for MKL/IPP/DAAL & Intel Distribution for Python https://software.intel.com/en-us/articles/installing-intel-free-libs-and-python-yum-repo https://software.intel.com/en-us/articles/installing-intel-free-libs-and-python-apt-repo Only for MKL • Cloudera* Parcels support since Intel MKL2017update2 https://software.intel.com/en-us/articles/installing-intel-mkl-cloudera-cdh-parcel • Conda* package/ Anaconda Cloud* support since Intel MKL2017update2 https://software.intel.com/en-us/articles/using-intel-distribution-for-python-with- anaconda

3 Optimized Mathematical Building Blocks Intel® MKL

Linear Algebra Fast Fourier Transforms Vector Math • BLAS • Multidimensional • Trigonometric • LAPACK • FFTW interfaces • Hyperbolic • ScaLAPACK • Cluster FFT • Exponential • Sparse BLAS • Log • PARDISO* SMP & Cluster • Power • Iterative sparse solvers • Root • Vector RNGs

Deep Neural Networks Summary Statistics And More • Convolution • Kurtosis • Splines • Pooling • Variation coefficient • Interpolation • Normalization • Order statistics • Trust Region • ReLU • Inner Product • Min/max • Fast Poisson Solver • Variance-covariance

*Other names and brands may be claimed as property of others.

4 Automatic Dispatching to Tuned ISA-specific Code Paths

More cores  More Threads  Wider vectors

Intel® Intel® ® Intel® Xeon® Intel® Xeon® Intel® Xeon® Intel® Xeon® Intel® ™ Xeon® Processor Processor Processor Processor Processor Intel® Xeon® x200 Processor Processor 5100 series 5500 series 5600 series E5-2600 v2 E5-2600 v3 Scalable 64-bit series series Processor1 v4 series

Up to Core(s) 1 2 4 6 12 18-22 28 72

Up to Threads 2 2 8 12 24 36-44 56 288

SIMD Width 128 128 128 128 256 256 512 512

Intel® Intel® SSE4- Intel® SSE Intel® Intel® Vector ISA Intel® SSE3 Intel® AVX Intel® AVX2 SSE3 4.1 4.2 AVX-512 AVX-512

1. Product specification for launched and shipped products available on ark.intel.com.

5 Intel® MKL - Performance Benefit to Applications

The latest version of Intel® MKL unleashes the performance benefits of Intel architectures 6 What’s new in Intel® MKL v.2018

Deep Learning for New Vector Math BLAS LAPACK Knights Mill Functions

New 8-bit and BLAS Group & Aasen’s Factor 16-bit Integer Batch: Efficiency Algorithm & Solve NEW! 24 Added Matrix Multiply & Performance Routines 2x New - Batched Improved v?Fmod, v?Remainder, v?Powr, Optimized Triangular Solve v?Exp2; v?Exp10; v?Log2; Speedup ScaLAPACK SGEMM over TRSM Matrix v?Logb; v?Cospi; v?Sinpi; performance v?Tanpi; v?Acospi; v?Asinpi; ~1.6x Fast LU v?Atanpi; v?Atan2pi; v?Cosd; Faster Without v?Sind; v?Tand; v?CopySign; Speedup GEMM_BATCH factorization DNN DNN over GEMM Pivoting v?NextAfter; v?Fdim; v?Fmax; Convolution & and Inverse v?Fmin; v?MaxMag; v?MinMag Inner Product Compact BLAS and without Optimizations LAPACK functions pivoting

7 MKL v.2018 - direct call feature

• Better performance of selected functions on small sizes (<32) ?GEMM, ?TRSM + LU/Inverse, Cholesky, QR added in MKL 2018

• Partial inlining of small-size kernels from headers • Reduced call/error checking/threading overhead • Enabled by –DMKL_DIRECT_CALL, /DMKL_DIRECT_CALL

8 MKL v.2018 - Batch Matrix-

Compute independent matrix-matrix multiplications (GEMMs) simultaneously with a single function call

– Supports all precisions of GEMM and GEMM3M – Handles varying matrix sizes with a single function call – Better utilizes multi/many-core processors for small sizes – Performance optimizations added in Intel® MKL v.2018

9 MKL v.2018-New Compact BLAS/LAPACK Functions

Compact Layout • Matrix subgroups are oriented as 3D tensor with matrix number increasing first • Subgroup length is set to the architecture SIMD width. • Consistent layout for all supported BLAS/LAPACK routines/ matrices.

From 2D space rank-1 tensor (vector) to 3D rank-2 tensor (matrix)

9 MKL v.2018 - Compact BLAS Functions

New compact GEMM and TRSM functions • mkl_?gemm_compact • mkl_?trsm_compact Provides better performance for small matrices due to cross-matrix vectorization . Service functions copying to/from compact format . Pack once, unpack only output matrices Single core performance! . Optimized for Intel AVX and later

10 All cores performance! Compact LAPACK Functions

New compact LU factorization with no pivoting, Cholesky and QR factorizations, and Inverse . mkl_?getr[f|i}np_compact . mkl_?potrf_compact . mkl_?geqrf_compact Provides better performance for small matrices due to cross-matrix vectorization . Service functions copying to/from compact format . Pack once, unpack only output matrices

11 Intel® Data Analytic Acceleration Library

• Targets both data centers (Intel® Xeon® and Intel® Xeon Phi™) and edge-devices (Intel® Atom)

• Perform analysis close to data source (sensor/client/server) to optimize response latency, decrease network bandwidth utilization, and maximize security

• Offload data to server/cluster for complex and large-scale analytics

Pre-processing Transformation Analysis Modeling Validation Decision Making

Business

Web/Social Scientific/Engineering

Regression Clustering PCA • Linear • Kmeans (De-)Compression Statistical moments (De-)Serialization • Ridge • EM GMM Quantiles Classification Variance matrix • Naïve Bayes Collaborative filtering QR, SVD, Cholesky • SVM • ALS Apriori • Classifier boosting Outlier detection • kNN Neural Networks Intel® DAAL Algorithms Machine Learning in Intel® DAAL Linear Regression Ridge Regression Regression

Decision Forest K-Means Unsupervised Clustering Supervised Decision Tree learning learning EM for Boosting GMM (Ada, Brown, Logit) Naïve Weak Classification Neural networks Bayes learner k-NN Collaborative Alternating Least filtering Squares Algorithms supporting batch processing Support Vector Machine Algorithms supporting batch, online and/or distributed processing

14 Intel® DAAL v.2018 – Decision Tree, Performance

Decision tree, Training, algorithmFPType=float, Decision tree, Prediction, algorithmFPType=float, split criterion=entropy, 1 thread, large datasets split criterion=entropy, 1 thread, large datasets 400.000 1.200 350.000 1.000 300.000 0.800 250.000 200.000 0.600 150.000 0.400 100.000 0.200

Training time, Trainingtime, seconds 50.000 Prediction time, seconds 0.000 0.000 Ethylene data CO data Ethylene data CO data

DAAL Scikit-learn DAAL Scikit-learn

Configuration Info – Versions: Intel® DAAL® 2018, OpenCV® 2.4.13.2, Scikit-learn® 0.18.1. Hardware: Intel® Xeon Phi™ Processor 000A @1.40GHz, 68 cores (1024K L2 cache, 16GB of DDR4 RAM and 8 GB MCDRAM, : RHEL 7.2 GA x86_64; Ethylena/CO Data -- https://archive.ics.uci.edu/ml/datasets/Gas+sensor+array+under+dynamic+gas+mixtures Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. * Other brands and names are the property of their respective owners. Benchmark Source: Intel Corporation 15 Intel® DAAL v.2018 – Decision Tree, Performance

Decision tree, Training, algorithmFPType=float, Decision tree, Prediction, algorithmFPType=float, split criterion=gini, 68 threads, large datasets split criterion=gini, 68 threads, large datasets 600.000 1.400

500.000 1.200 1.000 400.000 0.800 300.000 0.600 200.000 0.400

Training time, Trainingtime, seconds 100.000 Prediction time, seconds 0.200

0.000 0.000 Ethylene data CO data Ethylene data CO data

DAAL OpenCV Scikit-learn DAAL OpenCV Scikit-learn

Configuration Info – Versions: Intel® DAAL® 2018, OpenCV® 2.4.13.2, Scikit-learn® 0.18.1. Hardware: Intel® Xeon Phi™ Processor 000A @1.40GHz, 68 cores (1024K L2 cache, 16GB of DDR4 RAM and 8 GB MCDRAM, Operating System: RHEL 7.2 GA x86_64; Ethylena/CO Data -- https://archive.ics.uci.edu/ml/datasets/Gas+sensor+array+under+dynamic+gas+mixtures Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. * Other brands and names are the property of their respective owners. Benchmark Source: Intel Corporation 16 Intel® DAAL v.2018 – Forest Tree, Performance

Decision Forest, Training, Decision Forest, Prediction, algorithmFPType=float, split criterion=gini AgorithmFPType=float, split criterion=gini 18.000 0.900 16.000 0.800 14.000 0.700 12.000 0.600 10.000 0.500 8.000 0.400 6.000 0.300

4.000 0.200 Training time, Trainingtime, seconds

2.000 Prediction time, seconds 0.100 0.000 0.000 Letter Isolet MNIST Letter Isolet MNIST

DAAL Ranger DAAL Ranger

Configuration Info – Versions: Intel® DAAL® 2018, Ranger® 0.7.1. Hardware: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz, 2 sockets, 12 cores per socket (504GB RAM; Operating System: RHEL 7.2 GA x86_64) Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. * Other brands and names are the property of their respective owners. Benchmark Source: Intel Corporation

17 Intel® MKL 2018 : K-means (MLLIB API)

Notation: // example from MLLIB documentation Deleted code Added code import org.apache.spark.mllib.clustering.Kmeans Notes and Comments import com.intel.daal.mllib_api.Kmeans // use K-means from DAAL namespace computeCost and predict is called from MLLIB import org.apache.spark.mllib.clustering.KMeansModel KMeansModel import org.apache.spark.mllib.linalg.Vectors It is not safe to inherit from KMeansModel // Load and parse the data val data = sc.textFile("data/mllib/kmeans_data.txt") val parsedData: RDD[Vector] = data.map(s => Vectors.dense(s.split(' ').map(_.toDouble))).cache()

// Cluster the data into two classes using KMeans val clusters: KMeansModel = KMeans.train(parsedData, numClusters, numIterations, initializationMode, seed?) // train: distributed call optimized with DAAL

// Evaluate clustering by computing Within Set Sum of Squared Errors val WSSSE = clusters.computeCost(parsedData) println("Within Set Sum of Squared Errors = " + WSSSE)

// Save and load model clusters.save(sc, "target/org/apache/spark/KMeansExample/KMeansModel") val sameModel = KMeansModel.load(sc, "target/org/apache/spark/KMeansExample/KMeansModel")

18 Intel®IPP- Intel® Integrated Performance Primitives Library

Image Processing Computer Signal Processing

• Geometry transformations • Feature detection • Transforms • Linear and non-linear filtering • Objects tracking • Convolution, Cross-Correlation • Linear transforms • Pyramids functions • Signal generation • Statistics and analysis • Segmentation, enhancement • Digital filtering • Color models • Camera functions • Statistical • And more

Data Compression Cryptography String Processing • LZSS • Symmetric cryptography • String Functions: Find, Insert, • LZ77(ZLIB) • Hash functions Remove, Compare, etc. • LZO • Data authentication • Regular expression • Bzip2 • Public key

19 Intel® IPP v.2018 – IPP_LZ4 vs LZ4*

<>speedup ~17 % <>speedup ~27 %

Configuration Info - Versions: Intel® Intel Performance Primitives (Intel® IPP) 2018 beta 1; Hardware: CPU= Intel(R) Xeon(R) Platinum 8168, 2.7GHz 24c (NP=96) , Memory - 192GB 2666MHz DDR4 Dual-rank, OS RHEL 7.; HT=OFF, Turbo=OFF, EIST=OFF; Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. * Other brands and names are the property of their respective owners. Benchmark Source: Intel Corporation

Optimization Notice: Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 .

* -- https://github.com/lz4/lz4 20 Intel® IPP v.2018 – IPP_LZ4, Florida Collection

<>speedup ~22%

Name dimentionalnnz size, MB Description ASIC_100k 99340 954163 32 circuit simulation problem inline_1 503712 18660027 589.4 structural problem ldoor 952203 23737339 648.3 structural problem dielFilterV2real1157456 24828204 855 electromagnetics problem kkt_power 2063494 8130343 213.6 optimization problem soc-LiveJournal14847571 68993773 1011.6 directed graph

Configuration Info - Versions: Intel® Intel Performance Primitives (Intel® IPP) 2018 beta 1; Hardware: CPU= Intel(R) Xeon(R) Platinum 8168, 2.7GHz 24c (NP=96) , Memory - 192GB 2666MHz DDR4 Dual-rank, OS RHEL 7.; HT=OFF, Turbo=OFF, EIST=OFF;

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. * Other brands and names are the property of their respective owners. Benchmark Source: Intel Corporation

Optimization Notice: Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 .

21 References

Intel® MKL and MKL Forum pages . http://software.intel.com/en-us/articles/intel-mkl . http://software.intel.com/en-us/articles/intel-math-kernel-library-documentation

. http://software.intel.com/en-us/forums/intel-math-kernel-library Intel® DAAL and DAAL Forum page

. https://software.intel.com/en-us/intel-daal

. https://software.intel.com/en-us/forums/intel-data-analytics-acceleration-library Intel® IPP and IPP Forum pages:

. https://software.intel.com/en-us/intel-ipp

. https://software.intel.com/en-us/forums/intel-integrated-performance-primitives

22 Legal Disclaimer & Optimization Notice

INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.

Copyright © 2014, Intel Corporation. All rights reserved. Intel, Pentium, Xeon, Xeon Phi, Core, VTune, , and the Intel logo are trademarks of Intel Corporation in the U.S. and other countries.

Optimization Notice Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804

23 Intel Confidential — Do Not Forward