Intel AI Overview

Total Page:16

File Type:pdf, Size:1020Kb

Intel AI Overview 9/29/2020 Intel AI Overview Intel Corporation APJ Datacenter Group Sales Hiroshi Ouchiyama AI also can be run on CPU ▪ Versatility and flexibility for all workloads are the hallmarks of the CPU インテル株式会社 2 Recent Intel AI ~Supports a wider AI workload~ Machine ML and DL training learning Scope-in Deep Learning Intel® VNNI Inference Training インテル株式会社 3 AVX-512 & DL Boost on Intel CPU ▪ AVX-512 (SIMD) is installed, contributing to the improvement of parallel processing performance. Furthermore, further acceleration can be expected with the dedicated Deep Learning Boost instruction. Intel® AVX-512 Inside (Intel® Advanced Vector Extensions 512) Inside + Intel® DL Boost (Intel® Deep Learning Boost) From 10th gen Ice Lake From Skylake, AVX-512 From Cascade Lake, DL Boost インテル株式会社 4 Intel® AI Software: ML and DL Developer TOols Machine learning Deep LEarning Management Tools App Developers Containers SW Platform Developer Topologies & Models ▪ Intel Distribution for Python Data Scientist (SKlearn, Pandas) Frameworks Deep Learning Architect & Data Scientist Reference Stack DevOps ▪ Intel Data Analytics ▪ Intel Machine Learning Scaling Graph Acceleration Library Library (Intel MLSL) Data Analytics (Intel DAAL) Reference ML Performance Engineer Stack ▪ Intel Math Kernel Library (Intel MKL) Kernel ▪ Intel® Deep Neural Network Library (DNNL) ML Performance Engineer CPU CPU ▪︎GPU ▪︎FPGA ▪︎専用 Red font products are the most broadly applicable SW products for AI users インテル株式会社 5 Deep Learning Framework (Optimizations by Intel) SCALING UTILIZE ALL VECTORIZE / EFFICIENT ▪ Improve load THE CORES SIMD MEMORY / balancing ▪ OpenMP, MPI ▪ Unit strided CACHE USE ▪ Reduce ▪ Reduce access per SIMD synchronization ▪ Blocking synchronization lane events, all-to-all ▪ Data reuse events, serial ▪ High vector comms code efficiency ▪ Prefetching ▪ Improve load ▪ Data alignment ▪ Memory balancing allocation for More framework See installation guides at optimizations underway ai.intel.com/framework-optimizations/ (e.g., PaddlePaddle*, CNTK* and more) SEE ALSO: Machine Learning Libraries for Python (Scikit-learn, Pandas, NumPy), R (Cart, randomForest, e1071), Distributed (MlLib on Spark, Mahout) *Limited availability today Optimization Notice 6 Optimization and Quantization of Deep Learning Models for Further Performance Improvement of Inference Processing ▪ Optimization: Make models smarter by removing unnecessary Ops, integrating multiple Ops, etc. ▪ Quantization*: Make models slim by converting internal numerical representation of the model from FP32 to INT8. by Optimiz ation Quantiz by ation A quantization tool is available for each Optimization & framework. 元のモデル Quantization (TensorFlow*、PyTorch* などで作成) by * 2nd generation Intel® Xeon® scalable processors and later with Intel® Deep Learning Boost (VNNI) as of May 2020, effective on 10th generation Intel® Core™ processor family (Ice Lake† only) and later インテル株式会社 7 参考値 Deep Learning Inference Processing Benchmark Intel® Xeon® Gold 6254 processor @ 2.10GHz (18 cores x 1 sockets) As of 3/20/2020 性能比 (倍) Resnet50 - FPS Input=224x224, BS=1, 1 stream 8.00 6.00 4.00 2.00 0.00 TensorFlow* OpenVINO™ TensorFlow* OpenVINO™ 1.15.0 ツールキット 2020R1 1.15.0 ツールキット 2020R1 FP32 INT8 注)インテル社員による性能確認のための個人的なベンチマーク結果であり、インテルの公式結果ではありません。 インテル株式会社 8 CheXNet Performance Optimization by OpenVINO™ Before Optimization← → After Optimization Optimization Quantization Parallelization Measured the batch • Export the model to • Quantize the IR to INT8 • Change the source code inference performance ONNX. format by OpenVINO’s to use asynchronized with 22K images • Convert the ONNX to IR quantization tool processing and multi by OpenVINO’s Model • Run the IR on threading on Optimizer. OpenVINO’s inference OpenVINO’s inference • Run the IR on engine. (on VNNI) engine (8 parallel). OpenVINO’s inference engine. 11,177 sec x10.0 1,116 sec x3.1 359 sec x1.4 251 sec (Baseline) on x 44.5 Xeon 6252 against Baseline Please refer the link below to find the specific source code. https://github.com/taneishi/CheXNet 9 Training with Huge Memory ~U-Net Training by NUS~ The DICE (model accuracy) is on average 5% GPU higher for models trained on an Intel® CPU. - • V100 GPU (32GB memory) basedenv • 10 CPU cores • 126GB RAM • Batch size of 1 Result CPU - • 2 x Intel Platinum CPUs. basedenv • 2 x 24 CPU cores • 384GB RAM • Batch size of 6 インテル株式会社 10 What if I want performance? ↓ Use multiple CPUs in a bundle In other words, Distributed Training ☝ インテル株式会社 11 Scaling Efficient Deep Learning on Existing Infrastructure: The Case of GENCI and CERN GENCI CERN French research institute focused on numerical simulation and HPC the European Organization for Nuclear Research, which operates the across all scientific and industrial fields Large Hadron Collider (LHC), the world’s largest particle accelerator Succeeded in training a plant classification model for 94% scaling efficiency up to 128 nodes, with a 300K species, 1.5TByte dataset of 12 million images significant reduction in training time per epoch for on 1024 2S Intel® Xeon® Nodes with Resnet50. 3D-GANs High Energy Physics: 3D GANs Training Speedup Performance Intel 2S Xeon(R) on Stampede2/TACC, OPA Fabric TensorFlow 1.9+MKL-DNN+horovod, Intel MPI, Core Aff. BKMs, 4 Workers/Node 2S Xeon 8160: Secs/Epoch Speedup Ideal Scaling Efficiency 100% 100% 98% 97% 97% 96% 100% 256 95% 94% 128 90% 120 80% 128-Node Perf: 64 148 Secs/Epoch 61 70% 32 60% 31 16 50% 15.5 Speedup 8 40% 7.8 30% Efficiency Speedup 4 3.9 20% 2 2.0 10% 1.0 1 0% 1 2 4 8 16 32 64 128 Intel(R) 2S Xeon(R) Nodes インテル株式会社 12 ML is still important Top Data Science, Machine Learning Methods used in 2018/2019 Share of Respondents Regression 56% AI Decision Trees / Rules 48% Clustering 47% Visualizaiton 46% Random Forests 45% ML Statistics - Descriptive 39% K-NearestNeighbors 33% Time Series 32% Ensamble Methods 30% PCA 28% Text Mining 28% DL Boosting 27% Neural Networks - Deep Learning 25% Anomaly / Deviation Detection 23% Dgradient Boosted Machines 23% Neural Networks - CNN 22% Support Vector Machine 22% 引用元: インテル株式会社 https://www.kdnuggets.com/2019/04/top-data-science-machine-learning-methods-2018-2019.html 13 Public Cloud 行列のコレスキー分解 Intel® Distribution for on AVX512 & 72cores Python* 9 倍 (OSS実装との比較) Intel's implementation and optimization of Python and related libraries • Numpy • Pandas Public Cloud 回帰分析 学習処理 • Scipy on AVX512 & 72cores 423 倍(OSS実装との比較) • Scikit-learn • XGBoost • TensorFlow • etc.. インテル株式会社 https://software.intel.com/en-us/distribution-for-python/benchmarks 14 Intel® AI Library & oneDAL Collective Math ML & Analytics DL Communicaiton Intel® oneAPI Intel® oneAPI Intel® oneAPI Intel® oneAPI Collective Deep Neural Network Math Kernel Library Data Analytics Library Communication Library (oneMKL) (oneDAL) Library (oneDNN) (oneCCL) daal4py Partner Solution pip install daal4py pip install intel-scikit-learn Able to install into Spark http://www.intel.com/analytics インテル株式会社 https://www.oneapi.com/ 15 New demand, New Technology Security Data PPML Graphs as Linear Algebra Algorithm - ★ - ★ - - - 0 1 (Privacy Preserving Machine Learning) - - - - ★ - ★ - - - - - ★ - Graph 3 6 4 A = ★ - ★ - - - - - - - - - ★ - From vertex - - ★ - - - - 2 5 (rows) - - ★ ★ ★ - - SLIDE Machine learning (Sub-LInear Deep learning Engine) technology with an emphasis on privacy Analysis of graph data or Collaboration with Rice University. protection pattern detection using Deep learning’s training algorithms machine learning have been fundamentally redesigned to achieve higher learning performance on the CPU than on the GPU. インテル株式会社 16 Intel Technology Blog on Graph Analysis https://medium.com/intel-analytics-software/you-dont- have-to-spend-800-000-to-compute-pagerank- fa6799133402 https://medium.com/intel-analytics-software インテル株式会社 17 AI Software Ecosystem on Intel インテル株式会社 18 Accelerate Your AI Journey with Intel Intel Xeon Scalable Processor: The only data center CPU with built-in AI acceleration DISCOVERY Data Develop Deploy of possibilities & next steps setup, ingestion & cleaning models using analytics/AI into production & iterate Ecosystem software Hardware Intel® AI Over 100 vertical & horizontal Over 50 optimized Ethernet ecosystem solutions Data Analytics software platforms Move Silicon Builders Photonics Amazon Web Services Baidu Cloud Machine Intel Distribution Store Optimized cloud Google Cloud Platform Learning for Python Intel 3D NAND SSD Microsoft Azure & More In development AI-Optimized Deep Learning ProCess CONFIGurations All products, computer systems, dates, and figures are preliminary based on current expectations, and are subject to change without notice. Optimization Notice 19 20.
Recommended publications
  • Using Intel® Math Kernel Library and Intel® Integrated Performance Primitives in the Microsoft* .NET* Framework
    Using Intel® MKL and Intel® IPP in Microsoft* .NET* Framework Using Intel® Math Kernel Library and Intel® Integrated Performance Primitives in the Microsoft* .NET* Framework Document Number: 323195-001US.pdf 1 Using Intel® MKL and Intel® IPP in Microsoft* .NET* Framework INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL® PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER, AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. UNLESS OTHERWISE AGREED IN WRITING BY INTEL, THE INTEL PRODUCTS ARE NOT DESIGNED NOR INTENDED FOR ANY APPLICATION IN WHICH THE FAILURE OF THE INTEL PRODUCT COULD CREATE A SITUATION WHERE PERSONAL INJURY OR DEATH MAY OCCUR. Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined." Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The information here is subject to change without notice. Do not finalize a design with this information. The products described in this document may contain design defects or errors known as errata which may cause the product to deviate from published specifications.
    [Show full text]
  • SOL: Effortless Device Support for AI Frameworks Without Source Code Changes
    SOL: Effortless Device Support for AI Frameworks without Source Code Changes Nicolas Weber and Felipe Huici NEC Laboratories Europe Abstract—Modern high performance computing clusters heav- State of the Art Proposed with SOL ily rely on accelerators to overcome the limited compute power API (Python, C/C++, …) API (Python, C/C++, …) of CPUs. These supercomputers run various applications from different domains such as simulations, numerical applications or Framework Core Framework Core artificial intelligence (AI). As a result, vendors need to be able to Device Backends SOL efficiently run a wide variety of workloads on their hardware. In the AI domain this is in particular exacerbated by the Fig. 1: Abstraction layers within AI frameworks. existance of a number of popular frameworks (e.g, PyTorch, TensorFlow, etc.) that have no common code base, and can vary lines of code to their scripts in order to enable SOL and its in functionality. The code of these frameworks evolves quickly, hardware support. making it expensive to keep up with all changes and potentially We explore two strategies to integrate new devices into AI forcing developers to go through constant rounds of upstreaming. frameworks using SOL as a middleware, to keep the original In this paper we explore how to provide hardware support in AI frameworks without changing the framework’s source code in AI framework unchanged and still add support to new device order to minimize maintenance overhead. We introduce SOL, an types. The first strategy hides the entire offloading procedure AI acceleration middleware that provides a hardware abstraction from the framework, and the second only injects the necessary layer that allows us to transparently support heterogenous hard- functionality into the framework to enable the execution, but ware.
    [Show full text]
  • Intel® Deep Learning Boost (Intel® DL Boost) Product Overview
    Intel® Deep Learning boost Built-in acceleration for training and inference workloads 11 Run complex workloads on the same Platform Intel® Xeon® Scalable processors are built specifically for the flexibility to run complex workloads on the same hardware as your existing workloads 2 Intel avx-512 Intel Deep Learning boost Intel VNNI, bfloat16 Intel VNNI 2nd & 3rd Generation Intel Xeon Scalable Processors Based on Intel Advanced Vector Extensions Intel AVX-512 512 (Intel AVX-512), the Intel DL Boost Vector 1st, 2nd & 3rd Generation Intel Xeon Neural Network Instructions (VNNI) delivers a Scalable Processors significant performance improvement by combining three instructions into one—thereby Ultra-wide 512-bit vector operations maximizing the use of compute resources, capabilities with up to two fused-multiply utilizing the cache better, and avoiding add units and other optimizations potential bandwidth bottlenecks. accelerate performance for demanding computational tasks. bfloat16 3rd Generation Intel Xeon Scalable Processors on 4S+ Platform Brain floating-point format (bfloat16 or BF16) is a number encoding format occupying 16 bits representing a floating-point number. It is a more efficient numeric format for workloads that have high compute intensity but lower need for precision. 3 Common Training and inference workloads Image Classification Speech Recognition Language Translation Object Detection 4 Intel Deep Learning boost A Vector neural network instruction (vnni) Extends Intel AVX-512 to Accelerate Ai/DL Inference Intel Avx-512 VPMADDUBSW VPMADDWD VPADDD Combining three instructions into one UP TO11X maximizes the use of Intel compute resources, DL throughput improves cache utilization vs. current-gen Vnni and avoids potential Intel Xeon Scalable CPU VPDPBUSD bandwidth bottlenecks.
    [Show full text]
  • Intel(R) Math Kernel Library for Linux* OS User's Guide
    Intel® Math Kernel Library for Linux* OS User's Guide MKL 10.3 - Linux* OS Document Number: 315930-012US Legal Information Legal Information INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL(R) PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER, AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. UNLESS OTHERWISE AGREED IN WRITING BY INTEL, THE INTEL PRODUCTS ARE NOT DESIGNED NOR INTENDED FOR ANY APPLICATION IN WHICH THE FAILURE OF THE INTEL PRODUCT COULD CREATE A SITUATION WHERE PERSONAL INJURY OR DEATH MAY OCCUR. Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined." Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The information here is subject to change without notice. Do not finalize a design with this information. The products described in this document may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request. Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order.
    [Show full text]
  • Intel® Math Kernel Library for Windows* OS User's Guide
    Intel® Math Kernel Library for Windows* OS User's Guide Intel® MKL - Windows* OS Document Number: 315930-027US Legal Information Contents Contents Legal Information................................................................................7 Introducing the Intel® Math Kernel Library...........................................9 Getting Help and Support...................................................................11 Notational Conventions......................................................................13 Chapter 1: Overview Document Overview.................................................................................15 What's New.............................................................................................15 Related Information.................................................................................15 Chapter 2: Getting Started Checking Your Installation.........................................................................17 Setting Environment Variables ..................................................................17 Compiler Support.....................................................................................19 Using Code Examples...............................................................................19 What You Need to Know Before You Begin Using the Intel® Math Kernel Library...............................................................................................19 Chapter 3: Structure of the Intel® Math Kernel Library Architecture Support................................................................................23
    [Show full text]
  • 0 BLIS: a Modern Alternative to the BLAS
    0 BLIS: A Modern Alternative to the BLAS FIELD G. VAN ZEE and ROBERT A. VAN DE GEIJN, The University of Texas at Austin We propose the portable BLAS-like Interface Software (BLIS) framework which addresses a number of short- comings in both the original BLAS interface and present-day BLAS implementations. The framework allows developers to rapidly instantiate high-performance BLAS-like libraries on existing and new architectures with relatively little effort. The key to this achievement is the observation that virtually all computation within level-2 and level-3 BLAS operations may be expressed in terms of very simple kernels. Higher-level framework code is generalized so that it can be reused and/or re-parameterized for different operations (as well as different architectures) with little to no modification. Inserting high-performance kernels into the framework facilitates the immediate optimization of any and all BLAS-like operations which are cast in terms of these kernels, and thus the framework acts as a productivity multiplier. Users of BLAS-dependent applications are supported through a straightforward compatibility layer, though calling sequences must be updated for those who wish to access new functionality. Experimental performance of level-2 and level-3 operations is observed to be competitive with two mature open source libraries (OpenBLAS and ATLAS) as well as an established commercial product (Intel MKL). Categories and Subject Descriptors: G.4 [Mathematical Software]: Efficiency General Terms: Algorithms, Performance Additional Key Words and Phrases: linear algebra, libraries, high-performance, matrix, BLAS ACM Reference Format: ACM Trans. Math. Softw. 0, 0, Article 0 ( 0000), 31 pages.
    [Show full text]
  • DL-Inferencing for 3D Cephalometric Landmarks Regression Task Using Openvino?
    DL-inferencing for 3D Cephalometric Landmarks Regression task using OpenVINO? Evgeny Vasiliev1[0000−0002−7949−1919], Dmitrii Lachinov1;2[0000−0002−2880−2887], and Alexandra Getmanskaya1[0000−0003−3533−1734] 1 Institute of Information Technologies, Mathematics and Mechanics, Lobachevsky State University 2 Department of Ophthalmology and Optometry, Medical University of Vienna [email protected], [email protected], [email protected] Abstract. In this paper, we evaluate the performance of the Intel Distribution of OpenVINO toolkit in practical solving of the problem of automatic three- dimensional Cephalometric analysis using deep learning methods. This year, the authors proposed an approach to the detection of cephalometric landmarks from CT-tomography data, which is resistant to skull deformities and use convolu- tional neural networks (CNN). Resistance to deformations is due to the initial detection of 4 points that are basic for the parameterization of the skull shape. The approach was explored on CNN for three architectures. A record regression accuracy in comparison with analogs was obtained. This paper evaluates the per- formance of decision making for the trained CNN-models at the inference stage. For a comparative study, the computing environments PyTorch and Intel Distribu- tion of OpenVINO were selected, and 2 of 3 CNN architectures: based on VGG for regression of cephalometric landmarks and an Hourglass-based model, with the RexNext backbone for the landmarks heatmap regression. The experimen- tal dataset was consist of 20 CT of patients with acquired craniomaxillofacial deformities and was include pre- and post-operative CT scans whose format is 800x800x496 with voxel spacing of 0.2x0.2x0.2 mm.
    [Show full text]
  • Dr. Fabio Baruffa Senior Technical Consulting Engineer, Intel IAGS Legal Disclaimer & Optimization Notice
    Dr. Fabio Baruffa Senior Technical Consulting Engineer, Intel IAGS Legal Disclaimer & Optimization Notice Performance results are based on testing as of September 2018 and may not reflect all publicly available security updates. See configuration disclosure for details. No product can be absolutely secure. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit www.intel.com/benchmarks. INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. Copyright © 2019, Intel Corporation. All rights reserved. Intel, the Intel logo, Pentium, Xeon, Core, VTune, OpenVINO, Cilk, are trademarks of Intel Corporation or its subsidiaries in the U.S. and other countries. Optimization Notice Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations.
    [Show full text]
  • Intel® Parallel Studio Xe 2017 Runtime
    Intel® Parallel StudIo Xe 2017 runtIme Release Notes 26 July 2016 Contents 1 Introduction ................................................................................................................................................... 1 1.1 What Every User Should Know About This Release ..................................................................... 1 2 Product Contents ......................................................................................................................................... 2 3 System Requirements ................................................................................................................................ 3 3.1 Processor Requirements........................................................................................................................... 3 3.2 Disk Space Requirements ......................................................................................................................... 3 3.3 Operating System Requirements .......................................................................................................... 3 3.4 Memory Requirements .............................................................................................................................. 3 3.5 Additional Software Requirements ...................................................................................................... 3 4 Issues and Limitations ..............................................................................................................................
    [Show full text]
  • 0 BLIS: a Framework for Rapidly Instantiating BLAS Functionality
    0 BLIS: A Framework for Rapidly Instantiating BLAS Functionality FIELD G. VAN ZEE and ROBERT A. VAN DE GEIJN, The University of Texas at Austin The BLAS-like Library Instantiation Software (BLIS) framework is a new infrastructure for rapidly in- stantiating Basic Linear Algebra Subprograms (BLAS) functionality. Its fundamental innovation is that virtually all computation within level-2 (matrix-vector) and level-3 (matrix-matrix) BLAS operations can be expressed and optimized in terms of very simple kernels. While others have had similar insights, BLIS reduces the necessary kernels to what we believe is the simplest set that still supports the high performance that the computational science community demands. Higher-level framework code is generalized and imple- mented in ISO C99 so that it can be reused and/or re-parameterized for different operations (and different architectures) with little to no modification. Inserting high-performance kernels into the framework facili- tates the immediate optimization of any BLAS-like operations which are cast in terms of these kernels, and thus the framework acts as a productivity multiplier. Users of BLAS-dependent applications are given a choice of using the the traditional Fortran-77 BLAS interface, a generalized C interface, or any other higher level interface that builds upon this latter API. Preliminary performance of level-2 and level-3 operations is observed to be competitive with two mature open source libraries (OpenBLAS and ATLAS) as well as an established commercial product (Intel MKL). Categories and Subject Descriptors: G.4 [Mathematical Software]: Efficiency General Terms: Algorithms, Performance Additional Key Words and Phrases: linear algebra, libraries, high-performance, matrix, BLAS ACM Reference Format: ACM Trans.
    [Show full text]
  • Accelerating Spark ML Applications Date Published: 2020-01-16 Date Modified
    Best Practices Accelerating Spark ML Applications Date published: 2020-01-16 Date modified: https://docs.cloudera.com/ Legal Notice © Cloudera Inc. 2021. All rights reserved. The documentation is and contains Cloudera proprietary information protected by copyright and other intellectual property rights. No license under copyright or any other intellectual property right is granted herein. Copyright information for Cloudera software may be found within the documentation accompanying each component in a particular release. Cloudera software includes software from various open source or other third party projects, and may be released under the Apache Software License 2.0 (“ASLv2”), the Affero General Public License version 3 (AGPLv3), or other license terms. Other software included may be released under the terms of alternative open source licenses. Please review the license and notice files accompanying the software for additional licensing information. Please visit the Cloudera software product page for more information on Cloudera software. For more information on Cloudera support services, please visit either the Support or Sales page. Feel free to contact us directly to discuss your specific needs. Cloudera reserves the right to change any products at any time, and without notice. Cloudera assumes no responsibility nor liability arising from the use of products, except as expressly agreed to in writing by Cloudera. Cloudera, Cloudera Altus, HUE, Impala, Cloudera Impala, and other Cloudera marks are registered or unregistered trademarks in the United States and other countries. All other trademarks are the property of their respective owners. Disclaimer: EXCEPT AS EXPRESSLY PROVIDED IN A WRITTEN AGREEMENT WITH CLOUDERA, CLOUDERA DOES NOT MAKE NOR GIVE ANY REPRESENTATION, WARRANTY, NOR COVENANT OF ANY KIND, WHETHER EXPRESS OR IMPLIED, IN CONNECTION WITH CLOUDERA TECHNOLOGY OR RELATED SUPPORT PROVIDED IN CONNECTION THEREWITH.
    [Show full text]
  • Accelerate Scientific Deep Learning Models on Heteroge- Neous Computing Platform with FPGA
    Accelerate Scientific Deep Learning Models on Heteroge- neous Computing Platform with FPGA Chao Jiang1;∗, David Ojika1;5;∗∗, Sofia Vallecorsa2;∗∗∗, Thorsten Kurth3, Prabhat4;∗∗∗∗, Bhavesh Patel5;y, and Herman Lam1;z 1SHREC: NSF Center for Space, High-Performance, and Resilient Computing, University of Florida 2CERN openlab 3NVIDIA 4National Energy Research Scientific Computing Center 5Dell EMC Abstract. AI and deep learning are experiencing explosive growth in almost every domain involving analysis of big data. Deep learning using Deep Neural Networks (DNNs) has shown great promise for such scientific data analysis appli- cations. However, traditional CPU-based sequential computing without special instructions can no longer meet the requirements of mission-critical applica- tions, which are compute-intensive and require low latency and high throughput. Heterogeneous computing (HGC), with CPUs integrated with GPUs, FPGAs, and other science-targeted accelerators, offers unique capabilities to accelerate DNNs. Collaborating researchers at SHREC1at the University of Florida, CERN Openlab, NERSC2at Lawrence Berkeley National Lab, Dell EMC, and Intel are studying the application of heterogeneous computing (HGC) to scientific prob- lems using DNN models. This paper focuses on the use of FPGAs to accelerate the inferencing stage of the HGC workflow. We present case studies and results in inferencing state-of-the-art DNN models for scientific data analysis, using Intel distribution of OpenVINO, running on an Intel Programmable Acceleration Card (PAC) equipped with an Arria 10 GX FPGA. Using the Intel Deep Learning Acceleration (DLA) development suite to optimize existing FPGA primitives and develop new ones, we were able accelerate the scientific DNN models under study with a speedup from 2.46x to 9.59x for a single Arria 10 FPGA against a single core (single thread) of a server-class Skylake CPU.
    [Show full text]