Intel® Software Development Tools for High Performance Computing and Machine Learning Tasks Intel® MKL, Intel® DAAL, Intel® IPP

Intel® Software Development Tools for High Performance Computing and Machine Learning Tasks Intel® MKL, Intel® DAAL, Intel® IPP Developer Products Division Gennady Fedorov Oct 2017 Optimized Mathematical Building Blocks Intel® Math Kernel Library – Intel® MKL Intel® Data Analytics Acceleration Library – Intel® DAAL Intel® Integrated Performance Primitives Library - Intel® IPP 2 Where to get MKL/IPP/DAAL For MKL/IPP/DAAL • Sub-component of Intel Parallel Studio XE/Intel System Studio • Free access through High Performance Library (MKL/IPP/DAAL) https://software.intel.com/en-us/performance-libraries • YUM/APT repository for MKL/IPP/DAAL & Intel Distribution for Python https://software.intel.com/en-us/articles/installing-intel-free-libs-and-python-yum-repo https://software.intel.com/en-us/articles/installing-intel-free-libs-and-python-apt-repo Only for MKL • Cloudera* Parcels support since Intel MKL2017update2 https://software.intel.com/en-us/articles/installing-intel-mkl-cloudera-cdh-parcel • Conda* package/ Anaconda Cloud* support since Intel MKL2017update2 https://software.intel.com/en-us/articles/using-intel-distribution-for-python-with- anaconda 3 Optimized Mathematical Building Blocks Intel® MKL Linear Algebra Fast Fourier Transforms Vector Math • BLAS • Multidimensional • Trigonometric • LAPACK • FFTW interfaces • Hyperbolic • ScaLAPACK • Cluster FFT • Exponential • Sparse BLAS • Log • PARDISO* SMP & Cluster • Power • Iterative sparse solvers • Root • Vector RNGs Deep Neural Networks Summary Statistics And More • Convolution • Kurtosis • Splines • Pooling • Variation coefficient • Interpolation • Normalization • Order statistics • Trust Region • ReLU • Inner Product • Min/max • Fast Poisson Solver • Variance-covariance *Other names and brands may be claimed as property of others. 4 Automatic Dispatching to Tuned ISA-specific Code Paths More cores More Threads Wider vectors Intel® Intel® Xeon® Intel® Xeon® Intel® Xeon® Intel® Xeon® Intel® Xeon® Intel® Xeon Phi™ Xeon® Processor Processor Processor Processor Processor Intel® Xeon® x200 Processor Processor 5100 series 5500 series 5600 series E5-2600 v2 E5-2600 v3 Scalable 64-bit series series Processor1 v4 series Up to Core(s) 1 2 4 6 12 18-22 28 72 Up to Threads 2 2 8 12 24 36-44 56 288 SIMD Width 128 128 128 128 256 256 512 512 Intel® Intel® SSE4- Intel® SSE Intel® Intel® Vector ISA Intel® SSE3 Intel® AVX Intel® AVX2 SSE3 4.1 4.2 AVX-512 AVX-512 1. Product specification for launched and shipped products available on ark.intel.com. 5 Intel® MKL - Performance Benefit to Applications The latest version of Intel® MKL unleashes the performance benefits of Intel architectures 6 What’s new in Intel® MKL v.2018 Deep Learning for New Vector Math BLAS LAPACK Knights Mill Functions New 8-bit and BLAS Group & Aasen’s Factor 16-bit Integer Batch: Efficiency Algorithm & Solve NEW! 24 Added Matrix Multiply & Performance Routines 2x New - Batched Improved v?Fmod, v?Remainder, v?Powr, Optimized Triangular Solve v?Exp2; v?Exp10; v?Log2; Speedup ScaLAPACK SGEMM over TRSM Matrix v?Logb; v?Cospi; v?Sinpi; performance v?Tanpi; v?Acospi; v?Asinpi; ~1.6x Fast LU v?Atanpi; v?Atan2pi; v?Cosd; Faster Without v?Sind; v?Tand; v?CopySign; Speedup GEMM_BATCH factorization DNN DNN over GEMM Pivoting v?NextAfter; v?Fdim; v?Fmax; Convolution & and Inverse v?Fmin; v?MaxMag; v?MinMag Inner Product Compact BLAS and without Optimizations LAPACK functions pivoting 7 MKL v.2018 - direct call feature • Better performance of selected functions on small sizes (<32) ?GEMM, ?TRSM + LU/Inverse, Cholesky, QR added in MKL 2018 • Partial inlining of small-size kernels from headers • Reduced call/error checking/threading overhead • Enabled by –DMKL_DIRECT_CALL, /DMKL_DIRECT_CALL 8 MKL v.2018 - Batch Matrix-Matrix Multiplication Compute independent matrix-matrix multiplications (GEMMs) simultaneously with a single function call – Supports all precisions of GEMM and GEMM3M – Handles varying matrix sizes with a single function call – Better utilizes multi/many-core processors for small sizes – Performance optimizations added in Intel® MKL v.2018 9 MKL v.2018-New Compact BLAS/LAPACK Functions Compact Layout • Matrix subgroups are oriented as 3D tensor with matrix number increasing first • Subgroup length is set to the architecture SIMD width. • Consistent layout for all supported BLAS/LAPACK routines/ matrices. From 2D space rank-1 tensor (vector) to 3D rank-2 tensor (matrix) 9 MKL v.2018 - Compact BLAS Functions New compact GEMM and TRSM functions • mkl_?gemm_compact • mkl_?trsm_compact Provides better performance for small matrices due to cross-matrix vectorization . Service functions copying to/from compact format . Pack once, unpack only output matrices Single core performance! . Optimized for Intel AVX and later 10 All cores performance! Compact LAPACK Functions New compact LU factorization with no pivoting, Cholesky and QR factorizations, and Inverse . mkl_?getr[f|i}np_compact . mkl_?potrf_compact . mkl_?geqrf_compact Provides better performance for small matrices due to cross-matrix vectorization . Service functions copying to/from compact format . Pack once, unpack only output matrices 11 Intel® Data Analytic Acceleration Library • Targets both data centers (Intel® Xeon® and Intel® Xeon Phi™) and edge-devices (Intel® Atom) • Perform analysis close to data source (sensor/client/server) to optimize response latency, decrease network bandwidth utilization, and maximize security • Offload data to server/cluster for complex and large-scale analytics Pre-processing Transformation Analysis Modeling Validation Decision Making Business Web/Social Scientific/Engineering Regression Clustering PCA • Linear • Kmeans (De-)Compression Statistical moments (De-)Serialization • Ridge • EM GMM Quantiles Classification Variance matrix • Naïve Bayes Collaborative filtering QR, SVD, Cholesky • SVM • ALS Apriori • Classifier boosting Outlier detection • kNN Neural Networks Intel® DAAL Algorithms Machine Learning in Intel® DAAL Linear Regression Ridge Regression Regression Decision Forest K-Means Unsupervised Clustering Supervised Decision Tree learning learning EM for Boosting GMM (Ada, Brown, Logit) Naïve Weak Classification Neural networks Bayes learner k-NN Collaborative Alternating Least filtering Squares Algorithms supporting batch processing Support Vector Machine Algorithms supporting batch, online and/or distributed processing 14 Intel® DAAL v.2018 – Decision Tree, Performance Decision tree, Training, algorithmFPType=float, Decision tree, Prediction, algorithmFPType=float, split criterion=entropy, 1 thread, large datasets split criterion=entropy, 1 thread, large datasets 400.000 1.200 350.000 1.000 300.000 0.800 250.000 200.000 0.600 150.000 0.400 100.000 0.200 Training time, Training time, seconds 50.000 Prediction time, seconds 0.000 0.000 Ethylene data CO data Ethylene data CO data DAAL Scikit-learn DAAL Scikit-learn Configuration Info – Versions: Intel® DAAL® 2018, OpenCV® 2.4.13.2, Scikit-learn® 0.18.1. Hardware: Intel® Xeon Phi™ Processor 000A @1.40GHz, 68 cores (1024K L2 cache, 16GB of DDR4 RAM and 8 GB MCDRAM, Operating System: RHEL 7.2 GA x86_64; Ethylena/CO Data -- https://archive.ics.uci.edu/ml/datasets/Gas+sensor+array+under+dynamic+gas+mixtures Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. * Other brands and names are the property of their respective owners. Benchmark Source: Intel Corporation 15 Intel® DAAL v.2018 – Decision Tree, Performance Decision tree, Training, algorithmFPType=float, Decision tree, Prediction, algorithmFPType=float, split criterion=gini, 68 threads, large datasets split criterion=gini, 68 threads, large datasets 600.000 1.400 500.000 1.200 1.000 400.000 0.800 300.000 0.600 200.000 0.400 Training time, Training time, seconds 100.000 Prediction time, seconds 0.200 0.000 0.000 Ethylene data CO data Ethylene data CO data DAAL OpenCV Scikit-learn DAAL OpenCV Scikit-learn Configuration Info – Versions: Intel® DAAL® 2018, OpenCV® 2.4.13.2, Scikit-learn® 0.18.1. Hardware: Intel® Xeon Phi™ Processor 000A @1.40GHz, 68 cores (1024K L2 cache, 16GB of DDR4 RAM and 8 GB MCDRAM, Operating System: RHEL 7.2 GA x86_64; Ethylena/CO Data -- https://archive.ics.uci.edu/ml/datasets/Gas+sensor+array+under+dynamic+gas+mixtures Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. * Other brands and names are the property of their respective owners. Benchmark Source: Intel Corporation 16 Intel® DAAL v.2018 – Forest Tree, Performance Decision Forest, Training, Decision Forest, Prediction, algorithmFPType=float, split criterion=gini AgorithmFPType=float, split criterion=gini

Load more