9/29/2020 AI Overview

Intel Corporation APJ Datacenter Group Sales Hiroshi Ouchiyama AI also can be run on CPU

▪ Versatility and flexibility for all workloads are the hallmarks of the CPU

インテル株式会社 2 Recent Intel AI ~Supports a wider AI workload~

Machine ML and DL training learning Scope-in

Deep Learning

Intel® VNNI

Inference Training

インテル株式会社 3 AVX-512 & DL Boost on Intel CPU ▪ AVX-512 (SIMD) is installed, contributing to the improvement of parallel processing performance. Furthermore, further acceleration can be expected with the dedicated Boost instruction. Intel® AVX-512 Inside (Intel® Advanced Vector Extensions 512) Inside + Intel® DL Boost (Intel® Deep Learning Boost)

From 10th gen Ice Lake From Skylake, AVX-512 From Cascade Lake, DL Boost

インテル株式会社 4 Intel® AI : ML and DL

Developer TOols Machine learning Deep LEarning Management Tools

App Developers Containers

SW Platform Developer Topologies & Models ▪ Intel Distribution for Python Data Scientist (SKlearn, Pandas) Frameworks Deep Learning Architect & Data Scientist Reference Stack DevOps

▪ Intel Data Analytics ▪ Intel Machine Learning Scaling Graph Acceleration Library Library (Intel MLSL) Data Analytics (Intel DAAL) Reference ML Performance Engineer Stack ▪ Intel (Intel MKL) Kernel ▪ Intel® Deep Neural Network Library (DNNL) ML Performance Engineer CPU CPU ▪︎GPU ▪︎FPGA ▪︎専用 Red font products are the most broadly applicable SW products for AI users

インテル株式会社 5 Deep Learning Framework (Optimizations by Intel)

SCALING UTILIZE ALL VECTORIZE / EFFICIENT ▪ Improve load THE CORES SIMD MEMORY / balancing ▪ OpenMP, MPI ▪ Unit strided CACHE USE ▪ Reduce ▪ Reduce access per SIMD synchronization ▪ Blocking synchronization lane events, all-to-all ▪ Data reuse events, serial ▪ High vector comms code efficiency ▪ Prefetching ▪ Improve load ▪ Data alignment ▪ Memory balancing allocation for

More framework See installation guides at optimizations underway ai.intel.com/framework-optimizations/ (e.g., PaddlePaddle*, CNTK* and more)

SEE ALSO: Machine Learning Libraries for Python (Scikit-learn, Pandas, NumPy), R (Cart, randomForest, e1071), Distributed (MlLib on Spark, Mahout) *Limited availability today Optimization Notice

6 Optimization and Quantization of Deep Learning Models for Further Performance Improvement of Inference Processing

▪ Optimization: Make models smarter by removing unnecessary Ops, integrating multiple Ops, etc. ▪ Quantization*: Make models slim by converting internal numerical representation of the model from FP32 to INT8.

by Optimiz ation Quantiz by ation A quantization tool is available for each Optimization & framework. 元のモデル Quantization (TensorFlow*、PyTorch* などで作成) by

* 2nd generation Intel® ® scalable processors and later with Intel® Deep Learning Boost (VNNI) as of May 2020, effective on 10th generation Intel® Core™ processor family (Ice Lake† only) and later

インテル株式会社 7 参考値 Deep Learning Inference Processing Benchmark Intel® Xeon® Gold 6254 processor @ 2.10GHz (18 cores x 1 sockets)

As of 3/20/2020 性能比 (倍) Resnet50 - FPS Input=224x224, BS=1, 1 stream 8.00

6.00

4.00

2.00

0.00 TensorFlow* OpenVINO™ TensorFlow* OpenVINO™ 1.15.0 ツールキット 2020R1 1.15.0 ツールキット 2020R1 FP32 INT8

注)インテル社員による性能確認のための個人的なベンチマーク結果であり、インテルの公式結果ではありません。 インテル株式会社 8 CheXNet Performance Optimization by OpenVINO™

Before Optimization← → After Optimization Optimization Quantization Parallelization Measured the batch • Export the model to • Quantize the IR to INT8 • Change the source code inference performance ONNX. format by OpenVINO’s to use asynchronized with 22K images • Convert the ONNX to IR quantization tool processing and multi by OpenVINO’s Model • Run the IR on threading on Optimizer. OpenVINO’s inference OpenVINO’s inference • Run the IR on engine. (on VNNI) engine (8 parallel). OpenVINO’s inference engine.

11,177 sec x10.0 1,116 sec x3.1 359 sec x1.4 251 sec (Baseline)

on x 44.5 Xeon 6252 against Baseline

Please refer the link below to find the specific source code. https://github.com/taneishi/CheXNet

9 Training with Huge Memory ~U-Net Training by NUS~ The DICE (model accuracy) is on average 5%

GPU higher for models trained on an Intel® CPU.

- • V100 GPU (32GB memory) based env based • 10 CPU cores • 126GB RAM

• Batch size of 1

Result CPU

- • 2 x Intel Platinum CPUs. based env based • 2 x 24 CPU cores • 384GB RAM • Batch size of 6

インテル株式会社 10 What if I want performance?

Use multiple CPUs in a bundle In other words, Distributed Training ☝

インテル株式会社 11 Scaling Efficient Deep Learning on Existing Infrastructure: The Case of GENCI and CERN GENCI CERN French research institute focused on numerical simulation and HPC the European Organization for Nuclear Research, which operates the across all scientific and industrial fields Large Hadron Collider (LHC), the world’s largest particle accelerator Succeeded in training a plant classification model for 94% scaling efficiency up to 128 nodes, with a 300K species, 1.5TByte dataset of 12 million images significant reduction in training time per epoch for on 1024 2S Intel® Xeon® Nodes with Resnet50. 3D-GANs High Energy Physics: 3D GANs Training Speedup Performance Intel 2S Xeon(R) on Stampede2/TACC, OPA Fabric TensorFlow 1.9+MKL-DNN+horovod, Intel MPI, Core Aff. BKMs, 4 Workers/Node 2S Xeon 8160: Secs/Epoch Speedup Ideal Scaling Efficiency 100% 100% 98% 97% 97% 96% 100% 256 95% 94%

128 90% 120 80% 128-Node Perf: 64 148 Secs/Epoch 61 70%

32 60% 31 16 50%

15.5 Speedup 8 40% 7.8 30% Efficiency Speedup 4 3.9 20% 2 2.0 10% 1.0 1 0% 1 2 4 8 16 32 64 128 Intel(R) 2S Xeon(R) Nodes

インテル株式会社 12 ML is still important Top Data Science, Machine Learning

Methods used in 2018/2019 Share of Respondents Regression 56% AI Decision Trees / Rules 48% Clustering 47% Visualizaiton 46% Random Forests 45% ML Statistics - Descriptive 39% K-NearestNeighbors 33% Time Series 32% Ensamble Methods 30% PCA 28% Text Mining 28% DL Boosting 27% Neural Networks - Deep Learning 25% Anomaly / Deviation Detection 23% Dgradient Boosted Machines 23% Neural Networks - CNN 22% Support Vector Machine 22%

引用元: インテル株式会社 https://www.kdnuggets.com/2019/04/top-data-science-machine-learning-methods-2018-2019.html 13 Public Cloud 行列のコレスキー分解 Intel® Distribution for on AVX512 & 72cores Python* 9 倍 (OSS実装との比較) Intel's implementation and optimization of Python and related libraries • Numpy • Pandas Public Cloud 回帰分析 学習処理 • Scipy on AVX512 & 72cores

423 倍(OSS実装との比較) • Scikit-learn • XGBoost • TensorFlow • etc..

インテル株式会社 https://software.intel.com/en-us/distribution-for-python/benchmarks 14 Intel® AI Library & oneDAL

Collective Math ML & Analytics DL Communicaiton

Intel® oneAPI Intel® oneAPI Intel® oneAPI Intel® oneAPI Collective Deep Neural Network Math Kernel Library Data Analytics Library Communication Library (oneMKL) (oneDAL) Library (oneDNN) (oneCCL)

daal4py Partner Solution

pip install daal4py pip install intel-scikit-learn Able to install into Spark http://www.intel.com/analytics

インテル株式会社 https://www.oneapi.com/ 15 New demand, New Technology Security Data

PPML Graphs as Algorithm

- ★ - ★ - - - 0 1 (Privacy Preserving Machine Learning) - - - - ★ - ★ - - - - - ★ - Graph 3 6 4 A = ★ - ★ ------★ - From vertex - - ★ - - - - 2 5 (rows) - - ★ ★ ★ - - SLIDE Machine learning (Sub-LInear Deep learning Engine) technology with an emphasis on privacy Analysis of graph data or Collaboration with Rice University. protection pattern detection using Deep learning’s training algorithms machine learning have been fundamentally redesigned to achieve higher learning performance on the CPU than on the GPU.

インテル株式会社 16 Intel Technology Blog on Graph Analysis

https://medium.com/intel-analytics-software/you-dont- have-to-spend-800-000-to-compute-pagerank- fa6799133402

https://medium.com/intel-analytics-software

インテル株式会社 17 AI Software Ecosystem on Intel

インテル株式会社 18 Accelerate Your AI Journey with Intel Intel Xeon Scalable Processor: The only data center CPU with built-in AI acceleration DISCOVERY Data Develop Deploy of possibilities & next steps setup, ingestion & cleaning models using analytics/AI into production & iterate Ecosystem software Hardware

Intel® AI Over 100 vertical & horizontal Over 50 optimized Ethernet ecosystem solutions Data Analytics software platforms Move Silicon Builders Photonics Amazon Web Services Baidu Cloud Machine Intel Distribution Store Optimized cloud Google Cloud Platform Learning for Python Intel 3D NAND SSD Azure & More In development AI-Optimized Deep Learning ProCess CONFIGurations

All products, computer systems, dates, and figures are preliminary based on current expectations, and are subject to change without notice. Optimization Notice

19 20