Arxiv:1809.02697V3 [Cs.DC] 5 Jul 2019 Computation on Cpus

Total Page:16

File Type:pdf, Size:1020Kb

Arxiv:1809.02697V3 [Cs.DC] 5 Jul 2019 Computation on Cpus Optimizing CNN Model Inference on CPUs Yizhi Liu*, Yao Wang*, Ruofei Yu, Mu Li, Vin Sharma, Yida Wang Amazon Web Services {yizhiliu, wayao, yuruofei, mli, vinarm, wangyida}@amazon.com Abstract inference is essentially executing a computation graph con- sisting of operations. In practice, people normally use high- The popularity of Convolutional Neural Network (CNN) mod- performance kernel libraries (e.g. Intel MKL-DNN [27] and els and the ubiquity of CPUs imply that better performance of OpenBlas [51]) to obtain high performance for CNN opera- CNN model inference on CPUs can deliver significant gain tions. While these libraries tune very carefully for common to a large number of users. To improve the performance of operations with normal input data shapes (e.g. 2D convolu- CNN inference on CPUs, current approaches like MXNet tions), they only focus on the (mostly, convolution) operations and Intel OpenVINO usually treat the model as a graph and but miss the opportunities to further optimize the end-to-end use the high-performance libraries such as Intel MKL-DNN model inference at the graph level. The graph-level optimiza- to implement the operations of the graph. While achieving tion is often handled by the deep learning frameworks, e.g. reasonable performance on individual operations from the off- TensorFlow [5] and MXNet [8]. the-shelf libraries, this solution makes it inflexible to conduct optimizations at the graph level, as the local operation-level However, the graph-level optimization such as operation fu- optimizations are predefined. Therefore, it is restrictive and sion and data layout planing that a framework can do is limited misses the opportunity to optimize the end-to-end inference because the operation implementation is already predefined pipeline as a whole. This paper presents NeoCPU, a com- in the third-party libraries. Therefore, the optimizations in the prehensive approach of CNN model inference on CPUs that frameworks do not work in concert with the optimizations employs a full-stack and systematic scheme of optimizations. in the kernel library, which leaves significant performance NeoCPU optimizes the operations as templates without rely- gains unrealized in practice. Furthermore, different CPU ar- ing on third-parties libraries, which enables further improve- chitectures rely on different high-performance libraries and ment of the performance via operation- and graph-level joint integrating a library into a deep learning framework requires optimization. Experiments show that NeoCPU achieves up error-prone and time-consuming engineering effort. Lastly, to 3.45× lower latency for CNN model inference than the although those libraries are highly optimized, they present current state-of-the-art implementations on various kinds of as third-party plug-ins, which may introduce contention is- popular CPUs. sues with other libraries in the framework. As an example, TensorFlow originally used the Eigen library [4] to handle arXiv:1809.02697v3 [cs.DC] 5 Jul 2019 computation on CPUs. Later on, MKL-DNN was also in- 1 Introduction troduced. As a consequence, at runtime MKL-DNN threads coexist with Eigen threads, resulting in resource contention. The growing use of Convolutional Neural Network (CNN) In summary, this kind of framework-specific approach for models in computer vision applications makes this model CNN model inference on CPUs is inflexible, cumbersome, architecture a natural focus for performance optimization and sub-optimal. efforts. Similarly, the widespread deployment of CPUs in Because of the constraint imposed by the framework, opti- servers, clients, and edge devices makes this hardware plat- mizing the performance of CNN model inference end-to-end form an attractive target. Therefore, performing CNN model without involving a framework (i.e. a framework-agnostic inference efficiently on CPUs is of critical interest to many method) is of obvious interest to many deep learning prac- users. titioners. Recently, Intel launched a universal CNN model The performance of CNN model inference on CPUs leaves inference engine called OpenVINO Toolkit [16]. This toolkit significant room for improvement. Performing a CNN model optimizes CNN models in the computer vision domain on In- *Equal contribution tel processors (mostly x86 CPUs) and claims to achieve better performance than the deep learning frameworks alone. Yet, Op-level opt Graph-level opt Joint opt Open-source NeoCPU 3 3 3 3 OpenVINO could only provide limited graph-level optimiza- MXNet [8]/TensorFlow [5] 3rd party limited 7 3 tion (e.g. operation fusion as implemented in ngraph [15]) as OpenVINO [16] 3rd party limited ? 7 Original TVM [9] incomplete 3 7 3 it still relies upon MKL-DNN to deliver performance gains Glow [40] single core 3 7 3 for the carefully-tuned operations. Therefore, the optimiza- tion done by OpenVINO is still not sufficient for most of the Table 1: Side-by-side comparison between NeoCPU and ex- CNN models. isting works on CNN model inference Based on the previous observation, we argue that in order to further improve the CNN model inference performance on • Constructs a template to achieve good performance of CPUs, being able to do the flexible end-to-end optimization is convolutions, which is flexible to apply to various con- the key. In this paper, we propose NeoCPU, a comprehensive volution workloads on multiple CPU architectures (x86 approach to optimize CNN models for efficient inference on and ARM) without relying on high-performance kernel CPUs. NeoCPU is full-stack and systematic, which includes libraries; operation- and graph-level joint optimizations and does not • Designs a global scheme to look for the best layout rely on any third-party high-performance libraries. At the combination in different operations of a CNN model, operation level, we follow the well-studied techniques to op- which minimizes the data layout transformation over- timize the most computationally-intensive operations like head between operations while maintaining the high convolution (CONV) in a template, which is applicable to dif- performance of individual operations. ferent workloads on multiple CPU architectures and enables us for flexible graph-level optimization. At the graph level, in It is worth noting that, this paper primarily deals with direct addition to the common techniques such as operation fusion convolution computation, while NeoCPU is compatible to and inference simplification, we coordinate the individual op- other optimziation works on the computationally-intensive eration optimizations by manipulating the data layout flowing kernels, e.g. CONVs via Winograd [7, 29] or FFT [52]. through the entire model for the best end-to-end performance. We evaluated NeoCPU on CPUs with both x86 and ARM In summary, NeoCPU does the end-to-end optimization in a architectures. In general, NeoCPU delivers the best perfor- flexible and automatic fashion, while the existing works rely mance for 13 out of 15 popular networks on Intel Skylake on third-party libraries and lack comprehensive performance CPUs, 14 out of 15 on AMD EYPC CPUs, and all 15 models tuning. on ARM Cortex A72 CPUs. It is worthwhile noting that the NeoCPU is built upon a deep learning compiler stack baselines on x86 CPUs were more carefully tuned by the chip named TVM [9] with a number of enhancements. TVM vendor (Intel MKL-DNN) but the ARM CPUs were less opti- enables the possibility of using own operation-level opti- mized. While the selected framework-specific (MXNet and mizations instead of third-party high-performance libraries, TensorFlow) and framework-agnostic (OpenVINO) solutions which make it flexible to apply our operation- and graph-level may perform well on one case and less favorably on the other joint optimization. However, there was only one customized case, NeoCPU runs efficiently across models on different operation-level optimization on ARM CPUs for convolutions architectures. with specific data shapes and no operation- and graph-level In addition, NeoCPU produces a standalone module with joint optimization in the original TVM stack before our work. minimal size that does not depend on either the frameworks In addition, there exist other deep learning compilers such as or the high-performance kernel libraries, which enables easy Tensor Comprehensions [46] and Glow [40]. Unfortunately, deployment to multiple platforms. NeoCPU is used in Ama- they either do not target on CPUs or not optimize the CPU zon SageMaker Neo Service 1, enabling model developers performance well, e.g. based on the paper description and our to optimize for inference on CPU-based servers in the cloud own experiments, Glow only optimizes the single-core per- and devices at the edge. Using this service, a number of ap- formance for CPUs. Therefore we do not incorporate those plication developers have deployed CNN models optimized works as the baseline. Table1 summarizes the features of for inference in production on several types of platforms. NeoCPU compared to others. To the best of our knowledge, All source code has been released to the open source TVM NeoCPU achieves competitive performance for CNN model project2. inference on various kinds of popular CPUs. The rest of this paper is organized as follows: Section2 Specifically, this paper makes the following contributions: reviews the background of modern CPUs as well as the typical CNN models; Section3 elaborates the optimization ideas • Provides an operation- and graph-level joint optimiza- that we propose and how we implement them, followed by tion scheme to obtain high CNN model inference perfor- evaluations in Section4. We list the related works in Section5 mance on different popular CPUs including Intel, AMD and summarize the paper in Section6. and ARM, which outperforms the current state-of-the-art 1https://aws:amazon:com/sagemaker/neo/ implementations; 2https://github:com/dmlc/tvm 2 Background model is normally abstracted as a computation graph, essen- tially, Directed Acyclic Graph (DAG), in which a node rep- 2.1 Modern CPUs resents an operation and a directed edge pointing from node X to Y represents that the output of operation X serves as (a Although accelerators like GPUs and TPUs demonstrate their part of) the inputs of operation Y (i.e.
Recommended publications
  • Using Intel® Math Kernel Library and Intel® Integrated Performance Primitives in the Microsoft* .NET* Framework
    Using Intel® MKL and Intel® IPP in Microsoft* .NET* Framework Using Intel® Math Kernel Library and Intel® Integrated Performance Primitives in the Microsoft* .NET* Framework Document Number: 323195-001US.pdf 1 Using Intel® MKL and Intel® IPP in Microsoft* .NET* Framework INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL® PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER, AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. UNLESS OTHERWISE AGREED IN WRITING BY INTEL, THE INTEL PRODUCTS ARE NOT DESIGNED NOR INTENDED FOR ANY APPLICATION IN WHICH THE FAILURE OF THE INTEL PRODUCT COULD CREATE A SITUATION WHERE PERSONAL INJURY OR DEATH MAY OCCUR. Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined." Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The information here is subject to change without notice. Do not finalize a design with this information. The products described in this document may contain design defects or errors known as errata which may cause the product to deviate from published specifications.
    [Show full text]
  • SOL: Effortless Device Support for AI Frameworks Without Source Code Changes
    SOL: Effortless Device Support for AI Frameworks without Source Code Changes Nicolas Weber and Felipe Huici NEC Laboratories Europe Abstract—Modern high performance computing clusters heav- State of the Art Proposed with SOL ily rely on accelerators to overcome the limited compute power API (Python, C/C++, …) API (Python, C/C++, …) of CPUs. These supercomputers run various applications from different domains such as simulations, numerical applications or Framework Core Framework Core artificial intelligence (AI). As a result, vendors need to be able to Device Backends SOL efficiently run a wide variety of workloads on their hardware. In the AI domain this is in particular exacerbated by the Fig. 1: Abstraction layers within AI frameworks. existance of a number of popular frameworks (e.g, PyTorch, TensorFlow, etc.) that have no common code base, and can vary lines of code to their scripts in order to enable SOL and its in functionality. The code of these frameworks evolves quickly, hardware support. making it expensive to keep up with all changes and potentially We explore two strategies to integrate new devices into AI forcing developers to go through constant rounds of upstreaming. frameworks using SOL as a middleware, to keep the original In this paper we explore how to provide hardware support in AI frameworks without changing the framework’s source code in AI framework unchanged and still add support to new device order to minimize maintenance overhead. We introduce SOL, an types. The first strategy hides the entire offloading procedure AI acceleration middleware that provides a hardware abstraction from the framework, and the second only injects the necessary layer that allows us to transparently support heterogenous hard- functionality into the framework to enable the execution, but ware.
    [Show full text]
  • Intel® Deep Learning Boost (Intel® DL Boost) Product Overview
    Intel® Deep Learning boost Built-in acceleration for training and inference workloads 11 Run complex workloads on the same Platform Intel® Xeon® Scalable processors are built specifically for the flexibility to run complex workloads on the same hardware as your existing workloads 2 Intel avx-512 Intel Deep Learning boost Intel VNNI, bfloat16 Intel VNNI 2nd & 3rd Generation Intel Xeon Scalable Processors Based on Intel Advanced Vector Extensions Intel AVX-512 512 (Intel AVX-512), the Intel DL Boost Vector 1st, 2nd & 3rd Generation Intel Xeon Neural Network Instructions (VNNI) delivers a Scalable Processors significant performance improvement by combining three instructions into one—thereby Ultra-wide 512-bit vector operations maximizing the use of compute resources, capabilities with up to two fused-multiply utilizing the cache better, and avoiding add units and other optimizations potential bandwidth bottlenecks. accelerate performance for demanding computational tasks. bfloat16 3rd Generation Intel Xeon Scalable Processors on 4S+ Platform Brain floating-point format (bfloat16 or BF16) is a number encoding format occupying 16 bits representing a floating-point number. It is a more efficient numeric format for workloads that have high compute intensity but lower need for precision. 3 Common Training and inference workloads Image Classification Speech Recognition Language Translation Object Detection 4 Intel Deep Learning boost A Vector neural network instruction (vnni) Extends Intel AVX-512 to Accelerate Ai/DL Inference Intel Avx-512 VPMADDUBSW VPMADDWD VPADDD Combining three instructions into one UP TO11X maximizes the use of Intel compute resources, DL throughput improves cache utilization vs. current-gen Vnni and avoids potential Intel Xeon Scalable CPU VPDPBUSD bandwidth bottlenecks.
    [Show full text]
  • Intel(R) Math Kernel Library for Linux* OS User's Guide
    Intel® Math Kernel Library for Linux* OS User's Guide MKL 10.3 - Linux* OS Document Number: 315930-012US Legal Information Legal Information INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL(R) PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER, AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. UNLESS OTHERWISE AGREED IN WRITING BY INTEL, THE INTEL PRODUCTS ARE NOT DESIGNED NOR INTENDED FOR ANY APPLICATION IN WHICH THE FAILURE OF THE INTEL PRODUCT COULD CREATE A SITUATION WHERE PERSONAL INJURY OR DEATH MAY OCCUR. Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined." Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The information here is subject to change without notice. Do not finalize a design with this information. The products described in this document may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request. Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order.
    [Show full text]
  • Intel® Math Kernel Library for Windows* OS User's Guide
    Intel® Math Kernel Library for Windows* OS User's Guide Intel® MKL - Windows* OS Document Number: 315930-027US Legal Information Contents Contents Legal Information................................................................................7 Introducing the Intel® Math Kernel Library...........................................9 Getting Help and Support...................................................................11 Notational Conventions......................................................................13 Chapter 1: Overview Document Overview.................................................................................15 What's New.............................................................................................15 Related Information.................................................................................15 Chapter 2: Getting Started Checking Your Installation.........................................................................17 Setting Environment Variables ..................................................................17 Compiler Support.....................................................................................19 Using Code Examples...............................................................................19 What You Need to Know Before You Begin Using the Intel® Math Kernel Library...............................................................................................19 Chapter 3: Structure of the Intel® Math Kernel Library Architecture Support................................................................................23
    [Show full text]
  • 0 BLIS: a Modern Alternative to the BLAS
    0 BLIS: A Modern Alternative to the BLAS FIELD G. VAN ZEE and ROBERT A. VAN DE GEIJN, The University of Texas at Austin We propose the portable BLAS-like Interface Software (BLIS) framework which addresses a number of short- comings in both the original BLAS interface and present-day BLAS implementations. The framework allows developers to rapidly instantiate high-performance BLAS-like libraries on existing and new architectures with relatively little effort. The key to this achievement is the observation that virtually all computation within level-2 and level-3 BLAS operations may be expressed in terms of very simple kernels. Higher-level framework code is generalized so that it can be reused and/or re-parameterized for different operations (as well as different architectures) with little to no modification. Inserting high-performance kernels into the framework facilitates the immediate optimization of any and all BLAS-like operations which are cast in terms of these kernels, and thus the framework acts as a productivity multiplier. Users of BLAS-dependent applications are supported through a straightforward compatibility layer, though calling sequences must be updated for those who wish to access new functionality. Experimental performance of level-2 and level-3 operations is observed to be competitive with two mature open source libraries (OpenBLAS and ATLAS) as well as an established commercial product (Intel MKL). Categories and Subject Descriptors: G.4 [Mathematical Software]: Efficiency General Terms: Algorithms, Performance Additional Key Words and Phrases: linear algebra, libraries, high-performance, matrix, BLAS ACM Reference Format: ACM Trans. Math. Softw. 0, 0, Article 0 ( 0000), 31 pages.
    [Show full text]
  • DL-Inferencing for 3D Cephalometric Landmarks Regression Task Using Openvino?
    DL-inferencing for 3D Cephalometric Landmarks Regression task using OpenVINO? Evgeny Vasiliev1[0000−0002−7949−1919], Dmitrii Lachinov1;2[0000−0002−2880−2887], and Alexandra Getmanskaya1[0000−0003−3533−1734] 1 Institute of Information Technologies, Mathematics and Mechanics, Lobachevsky State University 2 Department of Ophthalmology and Optometry, Medical University of Vienna [email protected], [email protected], [email protected] Abstract. In this paper, we evaluate the performance of the Intel Distribution of OpenVINO toolkit in practical solving of the problem of automatic three- dimensional Cephalometric analysis using deep learning methods. This year, the authors proposed an approach to the detection of cephalometric landmarks from CT-tomography data, which is resistant to skull deformities and use convolu- tional neural networks (CNN). Resistance to deformations is due to the initial detection of 4 points that are basic for the parameterization of the skull shape. The approach was explored on CNN for three architectures. A record regression accuracy in comparison with analogs was obtained. This paper evaluates the per- formance of decision making for the trained CNN-models at the inference stage. For a comparative study, the computing environments PyTorch and Intel Distribu- tion of OpenVINO were selected, and 2 of 3 CNN architectures: based on VGG for regression of cephalometric landmarks and an Hourglass-based model, with the RexNext backbone for the landmarks heatmap regression. The experimen- tal dataset was consist of 20 CT of patients with acquired craniomaxillofacial deformities and was include pre- and post-operative CT scans whose format is 800x800x496 with voxel spacing of 0.2x0.2x0.2 mm.
    [Show full text]
  • Dr. Fabio Baruffa Senior Technical Consulting Engineer, Intel IAGS Legal Disclaimer & Optimization Notice
    Dr. Fabio Baruffa Senior Technical Consulting Engineer, Intel IAGS Legal Disclaimer & Optimization Notice Performance results are based on testing as of September 2018 and may not reflect all publicly available security updates. See configuration disclosure for details. No product can be absolutely secure. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit www.intel.com/benchmarks. INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. Copyright © 2019, Intel Corporation. All rights reserved. Intel, the Intel logo, Pentium, Xeon, Core, VTune, OpenVINO, Cilk, are trademarks of Intel Corporation or its subsidiaries in the U.S. and other countries. Optimization Notice Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations.
    [Show full text]
  • Intel® Parallel Studio Xe 2017 Runtime
    Intel® Parallel StudIo Xe 2017 runtIme Release Notes 26 July 2016 Contents 1 Introduction ................................................................................................................................................... 1 1.1 What Every User Should Know About This Release ..................................................................... 1 2 Product Contents ......................................................................................................................................... 2 3 System Requirements ................................................................................................................................ 3 3.1 Processor Requirements........................................................................................................................... 3 3.2 Disk Space Requirements ......................................................................................................................... 3 3.3 Operating System Requirements .......................................................................................................... 3 3.4 Memory Requirements .............................................................................................................................. 3 3.5 Additional Software Requirements ...................................................................................................... 3 4 Issues and Limitations ..............................................................................................................................
    [Show full text]
  • 0 BLIS: a Framework for Rapidly Instantiating BLAS Functionality
    0 BLIS: A Framework for Rapidly Instantiating BLAS Functionality FIELD G. VAN ZEE and ROBERT A. VAN DE GEIJN, The University of Texas at Austin The BLAS-like Library Instantiation Software (BLIS) framework is a new infrastructure for rapidly in- stantiating Basic Linear Algebra Subprograms (BLAS) functionality. Its fundamental innovation is that virtually all computation within level-2 (matrix-vector) and level-3 (matrix-matrix) BLAS operations can be expressed and optimized in terms of very simple kernels. While others have had similar insights, BLIS reduces the necessary kernels to what we believe is the simplest set that still supports the high performance that the computational science community demands. Higher-level framework code is generalized and imple- mented in ISO C99 so that it can be reused and/or re-parameterized for different operations (and different architectures) with little to no modification. Inserting high-performance kernels into the framework facili- tates the immediate optimization of any BLAS-like operations which are cast in terms of these kernels, and thus the framework acts as a productivity multiplier. Users of BLAS-dependent applications are given a choice of using the the traditional Fortran-77 BLAS interface, a generalized C interface, or any other higher level interface that builds upon this latter API. Preliminary performance of level-2 and level-3 operations is observed to be competitive with two mature open source libraries (OpenBLAS and ATLAS) as well as an established commercial product (Intel MKL). Categories and Subject Descriptors: G.4 [Mathematical Software]: Efficiency General Terms: Algorithms, Performance Additional Key Words and Phrases: linear algebra, libraries, high-performance, matrix, BLAS ACM Reference Format: ACM Trans.
    [Show full text]
  • Accelerating Spark ML Applications Date Published: 2020-01-16 Date Modified
    Best Practices Accelerating Spark ML Applications Date published: 2020-01-16 Date modified: https://docs.cloudera.com/ Legal Notice © Cloudera Inc. 2021. All rights reserved. The documentation is and contains Cloudera proprietary information protected by copyright and other intellectual property rights. No license under copyright or any other intellectual property right is granted herein. Copyright information for Cloudera software may be found within the documentation accompanying each component in a particular release. Cloudera software includes software from various open source or other third party projects, and may be released under the Apache Software License 2.0 (“ASLv2”), the Affero General Public License version 3 (AGPLv3), or other license terms. Other software included may be released under the terms of alternative open source licenses. Please review the license and notice files accompanying the software for additional licensing information. Please visit the Cloudera software product page for more information on Cloudera software. For more information on Cloudera support services, please visit either the Support or Sales page. Feel free to contact us directly to discuss your specific needs. Cloudera reserves the right to change any products at any time, and without notice. Cloudera assumes no responsibility nor liability arising from the use of products, except as expressly agreed to in writing by Cloudera. Cloudera, Cloudera Altus, HUE, Impala, Cloudera Impala, and other Cloudera marks are registered or unregistered trademarks in the United States and other countries. All other trademarks are the property of their respective owners. Disclaimer: EXCEPT AS EXPRESSLY PROVIDED IN A WRITTEN AGREEMENT WITH CLOUDERA, CLOUDERA DOES NOT MAKE NOR GIVE ANY REPRESENTATION, WARRANTY, NOR COVENANT OF ANY KIND, WHETHER EXPRESS OR IMPLIED, IN CONNECTION WITH CLOUDERA TECHNOLOGY OR RELATED SUPPORT PROVIDED IN CONNECTION THEREWITH.
    [Show full text]
  • Accelerate Scientific Deep Learning Models on Heteroge- Neous Computing Platform with FPGA
    Accelerate Scientific Deep Learning Models on Heteroge- neous Computing Platform with FPGA Chao Jiang1;∗, David Ojika1;5;∗∗, Sofia Vallecorsa2;∗∗∗, Thorsten Kurth3, Prabhat4;∗∗∗∗, Bhavesh Patel5;y, and Herman Lam1;z 1SHREC: NSF Center for Space, High-Performance, and Resilient Computing, University of Florida 2CERN openlab 3NVIDIA 4National Energy Research Scientific Computing Center 5Dell EMC Abstract. AI and deep learning are experiencing explosive growth in almost every domain involving analysis of big data. Deep learning using Deep Neural Networks (DNNs) has shown great promise for such scientific data analysis appli- cations. However, traditional CPU-based sequential computing without special instructions can no longer meet the requirements of mission-critical applica- tions, which are compute-intensive and require low latency and high throughput. Heterogeneous computing (HGC), with CPUs integrated with GPUs, FPGAs, and other science-targeted accelerators, offers unique capabilities to accelerate DNNs. Collaborating researchers at SHREC1at the University of Florida, CERN Openlab, NERSC2at Lawrence Berkeley National Lab, Dell EMC, and Intel are studying the application of heterogeneous computing (HGC) to scientific prob- lems using DNN models. This paper focuses on the use of FPGAs to accelerate the inferencing stage of the HGC workflow. We present case studies and results in inferencing state-of-the-art DNN models for scientific data analysis, using Intel distribution of OpenVINO, running on an Intel Programmable Acceleration Card (PAC) equipped with an Arria 10 GX FPGA. Using the Intel Deep Learning Acceleration (DLA) development suite to optimize existing FPGA primitives and develop new ones, we were able accelerate the scientific DNN models under study with a speedup from 2.46x to 9.59x for a single Arria 10 FPGA against a single core (single thread) of a server-class Skylake CPU.
    [Show full text]