Deepedgebench: Benchmarking Deep Neural Networks on Edge Devices*

DeepEdgeBench: Benchmarking Deep Neural Networks on Edge Devices* *Note: This the preprint version of the accepted paper at IC2E’21 Stephan Patrick Baller∗, Anshul Jindal∗, Mohak Chadha∗, Michael Gerndt∗ ∗Chair of Computer Architecture and Parallel Systems, Technische Universitat¨ Munchen,¨ Germany Garching (near Munich), Germany Email: fstephan.baller, anshul.jindal, [email protected], [email protected] Abstract—EdgeAI (Edge computing based Artificial Intelli- strict latency requirements that the cloud would have difficulty gence) has been most actively researched for the last few years to meeting since it may be far away from the users [7]. handle variety of massively distributed AI applications to meet The concept of Edge Computing has been recently proposed up the strict latency requirements. Meanwhile, many companies have released edge devices with smaller form factors (low power to complement cloud computing to resolve these problems by consumption and limited resources) like the popular Raspberry performing certain tasks at the edge of the network [8]. The Pi and Nvidia’s Jetson Nano for acting as compute nodes at idea is to distribute parts of processing and communication to the edge computing environments. Although the edge devices are the ”edge” of the network, i.e closer to the location where it is limited in terms of computing power and hardware resources, needed. As a result, the server needs less computing resources, they are powered by accelerators to enhance their performance behavior. Therefore, it is interesting to see how AI-based Deep the network is less strained and latencies is decreased. Neural Networks perform on such devices with limited resources. Edge Devices can come in a variety of forms ranging from In this work, we present and compare the performance in large servers to low-powered System on a chip (SoC) devices terms of inference time and power consumption of the four SoCs: like the popular Raspberry Pi or any other ARM based devices. Asus Tinker Edge R, Raspberry Pi 4, Google Coral Dev Board, Deep Neural Networks (DNNs) may occupy big amounts of Nvidia Jetson Nano, and one microcontroller: Arduino Nano 33 BLE, on different deep learning models and frameworks. We also storage and computing resources [9], [10]. Although the edge provide a method for measuring power consumption, inference devices are limited in terms of computing power and hardware time and accuracy for the devices, which can be easily extended resources, they are powered by accelerators to enhance their to other devices. Our results showcase that, for Tensorflow based performance at the edge computing environments. quantized model, the Google Coral Dev Board delivers the best performance, both for inference time and power consumption. In the context of Edge Computing, it is rather interesting For a low fraction of inference computation time, i.e. less than to see how devices with smaller form factors (low power con- 29.3% of the time for MobileNetV2, the Jetson Nano performs sumption and limited resources) can handle DNN evaluation. faster than the other devices. There are a considerable number of articles on the benchmark Index Terms—edge computing, deep learning, performance of edge devices [11]–[15]. However, some of the articles are benchmark, edge devices, power consumption, inference time, already outdated and others miss benchmarks on the major lat- power prediction est edge devices, thus lacking a thorough comparison between the edge devices concerning the DNN applications. Unlike I. INTRODUCTION other works, this article focuses on evaluating the performance of recent edge device for DNN models inferencing. The key Advances in computing devices and high-speed mobile net- contributions of this work are as follows: arXiv:2108.09457v1 [cs.AI] 21 Aug 2021 working provide today’s applications to be distributed globally. • We present a comparative analysis on recent accelerator- Such applications are thus deployed on cloud service platforms based edge devices specialized for the DNN application to benefit from on demand provisioning, unlimited resource domain. We choose following devices as our target edge pooling, and dynamic scalability. Modern deep learning tech- devices to assess their performance behavior and capa- niques serve a key component in various real-time applica- bilities for DNN applications: tions, like speech recognition [1], recommendation systems [2] and video classification [3]. However, deep learning-based – ASUS Tinker Edge R [16] approaches require a large volume of high-quality data to train – Raspberry Pi 4 [17] and are very expensive in terms of computation, memory and – Google Coral Dev Board [18] power consumption [4]. Moreover, existing cloud computing – NVIDIA Jetson Nano [19] is unable to manage these massively distributed applications – Arduino Nano 33 BLE [20] and analyze their data due to: i) challenges posed on the • Our evaluation perspectives include inference speed for network capacity when tasks are deployed to the cloud [5]; ii) a fixed number of images, power consumption during many applications, for example, autonomous driving [6], have (accelerated) DNN inference, and models accuracies with and without optimization for the respective device. aggregation of tens of thousands of ALUs (arithmetic logic • We used four different deep learning inference frame- units) and faster data transfer rate between the TPU and the works for the evaluation: Tensorflow, TensorRT, Tensor- memory. However, the Edge TPU is fine-tuned for matrix flow Lite and RKNN-Toolkit. operations which are very frequent in neural network machine • We also provide a method for measuring power consump- learning. tion, inference time and accuracy for the devices, which The Google Coral Dev Board comes with either 1GB or can be easily extended to other devices. We open source 4GB of RAM. The SoC integrates the Google Edge TPU with the collected data and the developed method for further the performance of 4 trillion operations (tera-operations) per 1 research . second (TOPS) or 2 Tera Operations Per Second (TOPS) per The rest of this paper is organized as follows. §II gives watt [25]. To make use of the dedicated unit, models in a an overview of the target edge devices, model frameworks supported format can be converted to work with the PyCoral and formats used in this work for the evaluation. In §III, the and Tensorflow Lite framework [26]. overall evaluation setup and methodology are described. §IV 3) Asus Tinker Edge R: Currently, ASUS offers six devices provides the experimental configurations details. In §V, our under the Tinker Board brand [27]. Two of them, the Tinker performance evaluation results are presented along with the Edge R [16] and Tinker Edge T [28] are specifically made for discussion of those results. Finally, §VI concludes the paper AI applications. While the Tinker Edge T is supported by a and presents a future outlook. Google Edge TPU, the Tinker Edge R uses Rockchip Neural Processing Unit (NPU) (RK3399Pro), a Machine Learning II. TARGET EDGE DEVICES,FRAMEWORKS AND MODEL (ML) accelerator that speeds up processing efficiency, and FORMATS lowers power demands. With this integrated Machine Learning In this section, we compare different edge device architec- (ML) accelerator, the Tinker Edge R is capable of perform- tures and present their hardware overview, and also briefly ing 3 tera-operations per second (TOPS), using low power describe the characteristics of different DNN models and consumption. And it’s optimized for Neural Network (NN) frameworks used in this work for benchmarking the edge architecture, which means Tinker Edge R can support multiple devices. Machine Learning (ML) frameworks and common Machine A. Target Edge Devices Learning (ML) models can be easily compiled and run on the Tinker Edge R. Rockchip, the manufacturer of the chip, We have considered the following five edge devices sum- provides the RKNN-Toolkit as a library to convert and run marized in Table I. models from different frameworks. As operating system, Asus 1) Nvidia Jetson Nano: The Jetson Nano is one of the offers a Debian Stretch and Android 9 variant of its so called offerings from Nvidia for edge computing [19]. With the Tinker Operating System (OS) to power the board. Compared Jetson product line, Nvidia deploys their Graphics Processing to the other devices, the board offers additional interfaces Unit (GPU) modules to the Edge with accelerated AI per- for expanding connectivity (mPCIe, M.2, serial camera, serial formance. The Jetson Nano’s GPU is based on the Maxwell display), especially interesting for IoT applications. microarchitecture (GM20B) and comes with one streaming 4) Raspberry Pi 4: The devices from the Raspberry Pi multiprocessor (SM) with 128 CUDA cores, which allows series are among the most popular SoCs and represent go- to run multiple neural networks in parallel [19]. The Jetson to products for people who are interested in IoT [29]. In June Nano is the lowest-end version of the product line with a 2019, the Raspberry Pi 4 was released, featuring a variety peak performance of 472 Giga Floating Operations Per Second of RAM options and a better processor. The Raspberry Pi (GFLOPS) [21]. It comes with either 2GB or 4GB Random 4 comes with Gigabit Ethernet, along with onboard wireless Access Memory (RAM) version wherefore we’re testing the networking and Bluetooth. The Rasbperry Pi is commonly 4GB version. The Nano can be operated in two power modes, used with Raspbian, a 32 Bit OS based on Debian. As the 5W and 10W. We used in our experiment an Nvidia Jetson Raspberry Pi 4 is capable of running a 64 Bit OS. We’re Nano with 4 GB RAM and 10 W to maximize performance. instead going to use the available 64 Bit Raspbian version Compared to the other target edge devices in this work, the (aarch64 architecture) to provide the same testing conditions Jetson Nano stands out with a fully utilizable GPU. as the other devices.

Deepedgebench: Benchmarking Deep Neural Networks on Edge Devices*

GPU Developments 2018

User Manual - S.USV Solutions Compatible with Raspberry Pi, up Board and Tinker Board Revision 2.2 | Date 07.06.2018

Performance Analysis and Optimization Opportunities for NVIDIA Automotive Gpus

A Low-Cost Deep Neural Network-Based Autonomous Car

Diseño De Un Prototipo Electrónico Para El Control Automático De La Luz Alta De Un Vehículo Mediante Detección Inteligente De Otros Automóviles.”

Procurement of Electric-Electronic

042Cf377-Ed0c-4715-9260-770F680082fc.Pdf

Arduino Nano

NVIDIA Jetson TX2 Module

Jetson TX2 Developer Kit User Guide

Die Meilensteine Der Computer-, Elek

Case 6:21-Cv-00138-ADA Document 1 Filed 02/09/21 Page 1 of 21