Benchmarking Contemporary Deep Learning Hardware and Frameworks: a Survey of Qualitative Metrics Wei Dai, Daniel Berleant

Benchmarking Contemporary Deep Learning Hardware and Frameworks: A Survey of Qualitative Metrics Wei Dai, Daniel Berleant To cite this version: Wei Dai, Daniel Berleant. Benchmarking Contemporary Deep Learning Hardware and Frameworks: A Survey of Qualitative Metrics. 2019 IEEE First International Conference on Cognitive Machine Intelli- gence (CogMI), Dec 2019, Los Angeles, United States. pp.148-155, 10.1109/CogMI48466.2019.00029. hal-02501736 HAL Id: hal-02501736 https://hal.archives-ouvertes.fr/hal-02501736 Submitted on 7 Mar 2020 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. Benchmarking Contemporary Deep Learning Hardware and Frameworks:A Survey of Qualitative Metrics Wei Dai Daniel Berleant Department of Computer Science Department of Information Science Southeast Missouri State University University of Arkansas at Little Rock Cape Girardeau, MO, USA Little Rock, AR, USA [email protected] [email protected] Abstract—This paper surveys benchmarking principles, machine learning devices including GPUs, FPGAs, and ASICs, and deep learning software frameworks. It also reviews these technologies with respect to benchmarking from the perspectives of a 6-metric approach to frameworks and an 11-metric approach to hardware platforms. Because MLPerf is a benchmark organization working with industry and academia, and offering deep learning benchmarks that evaluate training and inference on deep learning hardware devices, the survey also mentions MLPerf benchmark results, benchmark metrics, datasets, deep learning frameworks and algorithms. We summarize seven benchmarking principles, differential characteristics of mainstream AI devices, and qualitative comparison of deep learning hardware and frameworks. Fig. 1. Benchmarking Metrics and AI Architectures Keywords— Deep Learning benchmark, AI hardware and software, MLPerf, AI metrics benchmarking metrics for hardware devices and six metrics for software frameworks in deep learning, respectively. The I. INTRODUCTION paper also provides qualitative benchmark results for major After developing for about 75 years, deep learning deep learning devices, and compares 18 deep learning technologies are still maturing. In July 2018, Gartner, an IT frameworks. research and consultancy company, pointed out that deep According to [16],[17], and [18], there are seven vital learning technologies are in the Peak-of-Inflated- characteristics for benchmarks. These key properties are: Expectations (PoIE) stage on the Gartner Hype Cycle diagram [1] as shown in Figure 2, which means deep 1) Relevance: Benchmarks should measure important learning networks trigger many industry projects as well as features. research topics [2][3][4]. 2) Representativeness: Benchmark performance Image data impacts accuracy of deep learning metrics should be broadly accepted by industry and academia. algorithms. The accuracy of deep learning algorithms can 3) Equity: All systems should be fairly compared. be improved by feeding high quality images into these 4) Repeatability: Benchmark results should be algorithms. Well-known image sets useful for this include verifiable. CIFAR-10 [5], MNIST[6], ImageNet [7], and Pascal 5) Cost-effectiveness: Benchmark tests should be Visual Object Classes (P-VOC) [8]. The CIFAR-10 dataset economical. has 10 groups, and all images are 32×32 color images. 6) Scalability: Benchmark tests should measure from MNIST has digital handwriting images, and these images single server to multiple servers. are black and white. ImageNet and P-VOC are high quality 7) Transparency: Benchmark metrics should be image datasets, and are broadly used in visual object readily understandable. category recognition and detection. Benchmarking is useful in both industry and academia. Evaluating artificial intelligence (AI) hardware and The definition from the Oxford English Dictionary [9] deep learning frameworks allows discovering both strengths states that a benchmark is "To evaluate or check and weaknesses of deep learning technologies. To (something) by comparison with an established standard." illuminate, this paper is organized as follows. Section II Deep neural learning networks are leading technologies that reviews hardware platforms including CPUs, GPUs, extend their computing performance and capability based FPGAs, and ASICs, qualitative metrics for benchmarking on flexibility, distributed architectures, creative algorithms, hardware, and qualitative results on benchmarking devices. and large volume datasets. Comparing them via Section III introduces qualitative metrics for benchmarking benchmarking is increasingly important. frameworks and results. Section IV introduces a machine Even though previous research papers provide learning benchmark organization named MLPerf and their knowledge of deep learning, it is hard to find a survey deep learning benchmarking metrics. Section V presents our discussing qualitative benchmarks for machine learning conclusions. Section VI discusses future work. hardware devices and deep learning software frameworks as shown in Figure 1. In this paper we introduce 11 qualitative XXX-X-XXXX-XXXX-X/XX/$XX.00 ©20XX IEEE Fig. 2 Milestones of Deep learning on the Gartner hyper cycle. We inserted some deep learning historical milestones, modifying the figure of Gartner [1]. A. GPU Devices II. DEEP LEARNING HARDWARE GPUs are specified unitary processors that are AI algorithms often benefit from many-core hardware dedicated to accelerating real time three-dimensional (3D) and high bandwidth memory, in comparison to many non- graphics. GPUs contain an internal cache, high speed AI algorithms that are often encountered. Thus bandwidth, and quick parallel performance. The GPU computational power is not just a one-dimensional concept. cache accelerates matrix multiplication routines because The type of computations the hardware design is best suited these routines do not need to access global memory. for must be considered, since a hardware platform can have GPUs are universal hardware devices for deep more or less computational power depending on the type of computation on which it is measured. GPUs (graphics learning. After testing neural networks including ones with processing units) do well on the kind of parallelism often 200 hidden layers on MNIST handwritten data sets, GPU beneficial to AI algorithms, in comparison to CPUs (central performance was found to be better than CPUs [11]. The processing units), and thus tend to be well suited to AI test results show NVIDIA GeForce 6800 Ultra has a 3.3X applications. FPGAs (field programmable gate arrays), speedup compared to the Intel 3GHz P4; ATI Radeon X800 being configurable, can be configured to perform well on AI has 2.4-3.4X speedup. NVIDIA GPUs increase FLOPS algorithms although currently they lack the rich software (floating point operations per second) performance. In layer needed to fully achieve their potential in the AI [12], a single NVIDIA GeForce 8800 GTX, released in domain. ASICs (application specific integrated circuits) are November 2006, had 575 CUDA cores with 345.6 similar to FPGAs in this regard, since in principle a specially gigaflops, and its memory bandwidth was 86.4 GB/s; by configured FPGA is a kind of ASIC. Thus GPUs, FPGAs September 2018, a NVIDIA GeForce RTX 2080 Ti [13] and ASICs have the potential to expedite machine learning had 4,352 CUDA cores with 13.4 Teraflops, and its algorithms in part because of their capabilities for parallel memory bandwidth had increased to 616 GB/s. computing and high-speed internal memory. B. FPGA Devices Nevertheless, while earlier generation CPUs have had performance bottlenecks while training or using deep FPGAs have dynamical hardware configurations, so learning algorithms, cutting edge CPUs can provide better hardware engineers developed FPGAs using hardware performance and thus better support for deep learning description language (HDL), including VHDL or Verilog algorithms. In 2017, Intel released Intel Xeon Scalable [14][15]. However, some use cases will always involve processors, which includes Intel Advance Vector Extension energy-sensitive scenarios. FPGA devices offer better 512 (Intel AVX-512) instruction set and Intel Math Kernel performance per watt than GPUs. According to[16], while Library for Deep Neural Networks (Intel MKL-DNN) [10]. comparing gigaflops per watt, FPGA devices often have a The Intel AVX-512 and MKL-DNN accelerate deep 3-4X speed-up compared to GPUs. After comparing learning algorithms on lower precision tasks. Comparing performances of FPGAs and GPUs [17] on ImageNet 1K mainstream 32-bit floating point precision (fp32) on GPUs, data sets, Ovtcharov et al. [18] confirmed that the Arria 10 the 16-bit and 8-bits floating-point precisions (fp16/fp8) are GX1150 FPGA devices handled about 233 images/sec. lower in precision, but can be sufficient for the inference of while device power is 25 watts. In comparison, NVIDIA deep learning application domain. In addition, Lower K40 GPUs handled 500-824 images/sec. with device power precision also can enhance usage

Benchmarking Contemporary Deep Learning Hardware and Frameworks: a Survey of Qualitative Metrics Wei Dai, Daniel Berleant

SOL: Effortless Device Support for AI Frameworks Without Source Code Changes

Training Professionals and Researchers on Future Intelligent On-Board Embedded Heterogeneous Processing

Plaidml Documentation

Advancing Compiler Optimizations for General-Purpose & Domain-Specific Parallel Architectures

Machine Learning Framework for Petrochemical Process Industry

Fast Geometric Learning with Symbolic Matrices

Triton: an Intermediate Language and Compiler for Tiled Neural Network Computations

Benchmarking Contemporary Deep Learning Hardware And

Stripe: Tensor Compilation Via the Nested Polyhedral Model

Comparing the Costs of Abstraction for DL Frameworks

Towards Transparent Neural Network Acceleration

Package 'Keras'