Intel® AI Workshop 2021 Intel® Acceleration for Classical

Laurent Duhem – HPC/AI Solutions Architect ([email protected]) Shailen Sobhee - AI Software Technical Consultant ([email protected]) Notices and Disclaimers

▪ Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Performance varies depending on system configuration.

▪ No product or component can be absolutely secure.

▪ Tests document performance of components on a particular test, in specific systems. Differences in hardware, software, or configuration will affect actual performance. For more complete information about performance and benchmark results, visit http://www.intel.com/benchmarks .

▪ Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit http://www.intel.com/benchmarks .

▪ Intel® Advanced Vector Extensions (Intel® AVX) provides higher throughput to certain processor operations. Due to varying processor power characteristics, utilizing AVX instructions may cause a) some parts to operate at less than the rated frequency and b) some parts with Intel® Turbo Boost Technology 2.0 to not achieve any or maximum turbo frequencies. Performance varies depending on hardware, software, and system configuration and you can learn more at http://www.intel.com/go/turbo.

▪ Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

▪ Cost reduction scenarios described are intended as examples of how a given Intel-based product, in the specified circumstances and configurations, may affect future costs and provide cost savings. Circumstances will vary. Intel does not guarantee any costs or cost reduction.

▪ Intel does not control or audit third-party benchmark data or the web sites referenced in this document. You should visit the referenced web site and confirm whether referenced data are accurate.

▪ © Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others.

2 Executive Summary

▪ Intel® Distribution for Python covers major usages in HPC and Data Science

▪ Achieve faster Python application performance — right out of the box — with minimal or no changes to a code

▪ Accelerate NumPy*, SciPy*, and scikit-learn* with integrated Intel® Performance Libraries such as Intel® oneMKL (Math Kernel ) and Intel® oneDAL (Data Analytics Library) ▪ Analysts ▪ Access the latest vectorization and multithreading instructions, Numba* and ▪ Data Scientists Cython*, composable parallelism with Threading Building Blocks, and more ▪ Machine Learning Developers

3 Intel® Distribution for Python Architecture

Command Line Scientific Environments Developer Environments

> python script.py Interface

CPython Intel® Distribution for Python

GIL:

Language Numerical Parallelism

tbb4py smp mpi4py Python Packages daal4py

DPC++ oneDAL TBB iomp impi

Intel Community Intel

Native oneMKL technology technology Technologies

4 5

Accelerated NumPy and SciPy

• Optimizations include use of oneMKL which has optimized BLAS/LAPACK operations, FFT computations • Optimizations also include use of Intel® C and Fortran compilers to enable better use of vectorization • Interface directly works with single and double precision NumPy arrays • Natively supports multidimensional transforms

5 Intel® Distribution for Python Architecture

Command Line Scientific Environments Developer Environments

> python script.py Interface

CPython Intel® Distribution for Python

GIL:

Language Numerical Parallelism

tbb4py smp mpi4py Python Packages daal4py

DPC++ oneDAL TBB iomp impi

Intel Community Intel

Native oneMKL technology technology Technologies

6 oneAPI Data Analytics Library (oneDAL) Optimized building blocks for all stages of data analytics on Intel Architecture

GitHub: https://github.com/oneapi-src/oneDAL

7 8 What makes oneDAL faster?

8 9Intel® oneAPI Data Analytics Library(oneDAL) Algorithms Machine Learning Ridge Regression Linear DBSCAN Regression LASSO Regression Unsupervised K-Means learning Clustering Decision Tree AdaBoost

Supervised Brown/Logit EM for GMM Random Forest learning Boosting

Gradient Boosting Naïve Bayes Alternating Least Classification Logistic Collaborative Squares Regression filtering

kNN Apriori Algorithms supporting Intel GPU (Gen 9 & Gen12) & dGPU Algorithms supporting batch processing SVM Algorithms supporting batch and distributed processing

9 10Intel® oneAPI Data Analytics Library (oneDAL) algorithms Data Transformation and Analysis

Basic statistics Correlation and Dimensionality Matrix factorizations Outlier detection for datasets dependence reduction

Low order Cosine SVD PCA Univariate moments distance

QR Quantiles Correlation Association rule Multivariate distance Cholesky mining (Apriori)

Order Variance- Covariance statistics tSVD matrix Optimization solvers Math functions (SGD, AdaGrad, lBFGS, CD) (exp, log,…) Algorithms supporting batch processing Intel GPU (Gen 9 & Gen12) & dGPU Algorithms supporting batch processing

Algorithms supporting batch, online and/or distributed processing

10 11 K-Means Using Scikit-learn and daal4py

▪ Scikit-learn ▪ daal4py

from sklearn.cluster import KMeans from daal4py import kmeans_init, kmeans import pandas as pd import pandas as pd

data = pd.read_csv("./kmeans.csv") data = pd.read_csv("./kmeans.csv") # Load the data

init = kmeans_init(nClusters=20, # Compute initial method="plusPlusDense").compute(data) # centroids

algo = KMeans(n_clusters=20, algo = kmeans(nClusters=20, # Configure K-means init='k-means++', max_iter=5) maxIterations=5, assignFlag=True) # main object

result = algo.fit(data) result = algo.compute(data, # Compute the init.centroids) # clusters and labels

result.labels_ result.assignments # Print the results result.cluster_centers_ result.centroids

11 scikit-learn Optimized building blocks for all stages of data analytics on Intel Architecture

GitHub: https://github.com/oneapi-src/oneDAL

12 The most popular ML package for Python*

13

13 Intel Distribution for Python (IDP) Scikit-learn

Common Scikit-learn Scikit-learn with Intel CPU opts Same Code, import daal4py as d4p Same Behavior d4p.patch_sklearn() ▪ from sklearn.svm import SVC from sklearn.svm import SVC ▪ X, Y = get_dataset() X, Y = get_dataset() • Scikit-learn, not scikit-learn-like

clf = SVC().fit(X, y) • Scikit-learn conformance ▪ clf = SVC().fit(X, y) (mathematical equivalence) res = clf.predict(X) ▪ res = clf.predict(X) defined by Scikit-learn Consortium, continuously vetted by public CI Available through Intel conda Scikit-learn mainline (conda install daal4py –c intel)

Intel Confidential 14 Intel optimized Scikit-Learn

Speedup of Intel® oneDAL powered Scikit-Learn over the original Scikit-Learn

K-means fit 1M x 20, k=1000 44.0

K-means predict, 1M x 20, k=1000 3.6 PCA fit, 1M x 50 4.0 Same Code, PCA transform, 1M x 50 27.2 Random Forest fit, higgs1m 38.3 Same Behavior Random Forest predict, higgs1m 55.4

Ridge Reg fit 10M x 20 53.4

Linear Reg fit 2M x 100 91.8

LASSO fit, 9M x 45 50.9 SVC fit, ijcnn 29.0 • Scikit-learn, not scikit-learn-like SVC predict, ijcnn 95.3

SVC fit, mnist 82.4

SVC predict, mnist 221.0 • Scikit-learn conformance

DBSCAN fit, 500K x 50 17.3 (mathematical equivalence) defined by Scikit-learn Consortium, train_test_split, 5M x 20 9.4 continuously vetted by public CI kNN predict, 100K x 20, class=2, k=5 131.4

kNN predict, 20K x 50, class=2, k=5 113.8

0.0 50.0 100.0 150.0 200.0 250.0 HW: Intel Xeon Platinum 8276L CPU @ 2.20GHz, 2 sockets, 28 cores per socket; Details: https://medium.com/intel-analytics-software/accelerate-your-scikit-learn-applications-a06cacf44912

15 Available algorithms

▪ Accelerated IDP Scikit-learn algorithms: • Linear/Ridge Regression • Logistic Regression • ElasticNet/LASSO • PCA • K-means • DBSCAN • SVC • train_test_split(), assume_all_finite() • Random Forest Regression/Classification - DAAL 2020.3 • kNN (kd-tree and brute force) - DAAL 2020.3

16 Demo

17 XGBoost Optimized building blocks for all stages of data analytics on Intel Architecture

GitHub: https://github.com/oneapi-src/oneDAL

18 Gradient Boosting - overview

• Gradient Boosting: • Boosting algorithm (Decision Trees - base learners) • Solve many types of ML problems (classification, regression, learning to rank) • Highly-accurate, widely used by Data Scientists • Compute intensive workload • Known implementations: XGBoost*, LightGBM*, CatBoost*, Intel® DAAL, …

Error Error Error

19 DMLC XGBoost* ACCELERATION

▪ Intel® contributed 3 Pull requests into XGBoost* project on GitHub* during the year Goal: performance optimizations of ‘hist’ mode for Intel® CPUs

2020 21 XGBoost training improvements:

Metric Library versions Airline-OHE, 4.69M

Train time, s XGBoost 0.81 4481 XGBoost 1.2.0 243 Accuracy XGBoost 0.81 0.841544 XGBoost 1.2.0 0.842981 Speedup: 18.4

Workload description: Airline dataset was preprocessed with OHE, and then after random permutation first 7M rows were selected and divided to train test parts (70%-30%).

2 x Intel® Xeon Gold 6230R @ 26 cores, OS: CentOS 8 (Core), 193 GB RAM.

SW: XGBoost :1.2, 0.81 versions from xgboost PIP chanel. compiler – G++ 7.4, Intel DAAL: 2020.3 version, downloaded from conda. Python env: Python 3.7, Numpy 1.18.5, Pandas 0.25.3, Scikit-lean 0.23.2. 21 XGB and LGBM prediction acceleration daal4py Gradient Boosting Model Convertors XGBoost: xgb_model = xgb.train(params, X_train) # Train common XGBoost model as usual import daal4py as d4p daal_model = d4p.get_gbt_model_from_xgboost(xgb_model) # XGBoost model to DAAL model daal_prediction = d4p.gbt_classification_prediction(…).compute(X_test, daal_model) # make fast prediction with DAAL LGBM: lgb_model = lgb.train(params, X_train) # Train common LGBM model as usual import daal4py as d4p daal_model = d4p.get_gbt_model_from_lightgbm(xgb_model) # LGBM model to DAAL model daal_prediction = d4p.gbt_classification_prediction(…).compute(X_test, daal_model) # make fast prediction with DAAL

Convert already trained XGB/LGBM model to speedup prediction performance without accuracy loosing

Prediction time, s Prediction, time s Accuracy/MSE Dataset LGBM + Speed up XGB + Speed up LGBM + XGB + LGBM XGB LGBM XGB daal4py daal4py daal4py daal4py Higgs 9.156 0.728 12.6 5.514 0.7 7.9 0.75626 0.75626 0.75828 0.75828 Mortgage 9.156 0.728 12.6 5.514 0.7 7.9 0.49061 0.49061 0.4879 0.4879 MSRank 0.857 0.111 7.7 0.934 0.121 7.7 0.57101 0.57101 0.57177 0.57177

Intel Confidential 22 Demo

23 Intel® Distribution for Python Architecture

Command Line Scientific Environments Developer Environments

> python script.py Interface

Intel® Distribution for Python Extension Numba Release GIL CPython Release GIL SDC LLVM IR GIL: C++

Language Numerical Parallelism Dataframe daal4py

tbb4py smp mpi4py

Python Packages

oneMKL DPC++ oneDAL TBB iomp impi

Intel Community Intel

Native technology technology Technologies

24 Intel® Scalable Just import Numba DataFrame Compiler and use decorator ▪ Extension for Numba* to accelerate AI workflows ▪ Supports more data types (Series, Dataframes, ASCII/Unicode strings) ▪ Compiler, not a library ▪ Scales from laptops to multi-core servers ▪ Open-source project Github page https://github.com/IntelPython/sdc Documentation https://intelpython.github.io/sdc-doc/latest/index.html ▪ Available as conda package and pip wheels

25 Intel® SDC

SPEEDUP SDC VS. Pandas 16 14.5491 14

12 10.9496

10

8

6

4 3.3001

1.6991 2

0 1 thread 4 threads 20 threads 40 threads

run_etl

Intel® Xeon™ Gold 6248 CPU @ 2.50GHz, 2x20 cores Numba* 0.51.2, Pandas* 1.0.5, SDC 0.37.0

26 Demo

27 Modin ▪ Usable and Scalable memory Pandas DataFrame

CPU CPU CPU CPU

Idle cores

memory To use Modin, replace the pandas import Modin DataFrame

CPU CPU CPU CPU

Full utilization

28 Modin Execution time Pandas vs. Modin[ray] 400

350 340.0729 10.8 300 speedup 250

200 Time, s Time, 150

100

50 31.2453

0

Pandas Modin

Intel® Xeon™ Gold 6248 CPU @ 2.50GHz, 2x20 cores

▪ Dataset size: 2.4GB

29 End-to-End Data Pipeline Acceleration

▪ Workload: Train a model using 50yrs of Census dataset from IPUMS.org to predict income based on education

▪ Solution: Intel Modin for data ingestion and ETL, Daal4Py and Intel scikit-learn for model training and prediction

▪ Perf Gains:

• Read_CSV (Read from disk and store as a dataframe) : 6x

• ETL operations : 38x

• Train Test Split : 4x

• ML training (fit & predict) with Ridge Regression : 21x

For more complete information about performance and benchmark results, visit www.intel.com/benchmarks. See backup for configuration details. 30 Intel® Distribution for Python Architecture

Command Line Scientific Environments Developer Environments

> python script.py Interface

Intel® Distribution for Python Extension Numba Release GIL CPython Release GIL SDC LLVM IR GIL: C++

Language Numerical Parallelism Dataframe daal4py

tbb4py smp mpi4py

Python Packages

oneMKL DPC++ oneDAL TBB iomp impi

Intel Community Intel

Native technology technology Technologies

31 Envision a GPU-enabled Python Library Ecosystem Data Parallel Python Unified Python Offload Programming Model

Extending PyData ecosystem for XPU with device_context(“gpu”): a_dparray = dpnp.random.random(1024, 3) X_dparray = numba.njit(compute_embedding)(a_dparray) res_dparray = daal4py.kmeans().compute(X_dparray)

Optimized Packages for Intel CPUs & GPUs Jit Compilation

• • •

numpy → dpnp Unified Data & Execution Infrastructure ndarray → dparray

NDA Presentation host memory → unified shared mem zero-copy USM array interface common device execution queues

CPU → XPU

DPC++ RUNTIME

OpenCL Level 0 CUDA

32 New Additions to Numba’s Language Design

@dppy.kernel @njit

import dpctl from numba import njit import numba_dppy as dppy import numpy as np import numpy as np import dpctl

@njit @dppy.kernel def f1(a, b): def sum(a,b,c): c = a + b i = dppy.get_global_id[0] return c c[i] = a[i] + b[i] a = np.ones(1024 dtype=np.float32) a = np.ones(1024 dtype=np.float32) b = np.ones(1024, dtype=np.float32) b = np.ones(1024, dtype=np.float32) c = np.zeros_like(a) with dpctl.device_context("gpu"): with dpctl.device_context("gpu"): c = f1(a, b) NDA Presentation sum[1024, dppy. DEFAULT_LOCAL_SIZE](a, b, c)

Explicit kernels, Low-level kernel NumPy-based array programming, auto- programming for expert ninjas offload, high-productivity

33 Seamless interoperability and sharing of resources

• Different packages share same execution context import dpctl, numba, dpnp, daal4py

@numba.njit def compute(a): • Data can be ... Numba function exchanged without extra copies and kept with dpctl.device_context("gpu"): a_dparray = dpnp.random.random(1024, 3) on the device X_dparray = compute(a_dparray) res_dparray = daal4py.kmeans().compute(X_dparray)

daal4py function

34 Portability Across Architectures

import numba import numpy as np import math

@numba.vectorize(nopython=True) # Runs on CPU by default def cndf2(inp): blackscholes(...) out = 0.5 + 0.5 * math.erf((math.sqrt(2.0) / 2.0) * inp) return out # Runs on GPU @numba.njit(parallel={"offload": True}, fastmath=True) with dpctl.device_context("gpu"): def blackscholes(sptprice, strike, rate, volatility, timev): blackscholes(...) logterm = np.log(sptprice / strike) powterm = 0.5 * volatility * volatility den = volatility * np.sqrt(timev) # In future d1 = (((rate + powterm) * timev) + logterm) / den with dpctl.device_context(“cuda:gpu"): d2 = d1 - den blackscholes(...) NofXd1 = cndf2(d1) NofXd2 = cndf2(d2) futureValue = strike * np.exp(-rate * timev) c1 = futureValue * NofXd2 call = sptprice * NofXd1 - c1 put = call - futureValue + sptprice return put

35 Scikit-Learn on XPU

Stock on Host: Optimized on Host: Offload to XPU: SAME NUMERIC BEHAVIOR import daal4py as d4p import daal4py as d4p d4p.patch_sklearn() d4p.patch_sklearn() import dpctl as defined by Scikit-learn from sklearn.svm import SVC from sklearn.svm import SVC from sklearn.svm import SVC Consortium

X, Y = get_dataset() X, Y = get_dataset() X, Y = get_dataset() & continuously with dpctl.device_context(“gpu”): validated by CI clf = SVC().fit(X, y) clf = SVC().fit(X, y) clf = SVC().fit(X, y) res = clf.predict(X) res = clf.predict(X) res = clf.predict(X)

NDA Presentation 36 37 Installing Intel® Distribution for Python* 2021

> conda create -n idp –c intel intelpython3_core python=3.x Anaconda.org > conda activate idp https://anaconda.org/intel/packages > conda install intel::numpy

https://software.intel.com/content/www/us/en/develop/articles/installing-intel- free-libs-and-python-apt-repo.html YUM/APT https://software.intel.com/content/www/us/en/develop/articles/installing-intel- free-libs-and-python-yum-repo.html

Docker Hub docker pull intelpython/intelpython3_full

https://software.intel.com/content/www/us/en/develop/tools/onea oneAPI pi/ai-analytics-toolkit.html

Standalone https://software.intel.com/content/www/us/en/develop/articles/one Installer api-standalone-components.html#python

> pip install intel-numpy > pip install intel-scipy + Intel library Runtime packages PyPI > pip install mkl_fft + Intel development packages > pip install mkl_random

37 Get the Most from Your Code Today with Intel® Tech.Decoded

Visit TechDecoded.intel.io to learn how to put key optimization strategies into practice with Intel development tools.

Big Picture Videos TOPICS: Discover Intel’s vision for Visual Computing key development areas. Code Modernization Essential Webinars Systems & IoT Gain strategies, practices and tools to optimize Data Science application and solution performance. Data Center & Cloud

Quick Hit How-To Videos 38 Learn how to do specific programming tasks using Intel® tools.

38 More Resources

Intel® Distribution for Python • Product page – overview, features, FAQs… • Training materials – movies, tech briefs, documentation, evaluation guides… • Support – forums, secure support… • Machine Learning Benchmarks • https://github.com/oneapi-src/oneAPI-samples/tree/master/AI-and-Analytics • https://github.com/IntelPython/scikit-learn_bench

39 Thank you