Neural Network Inference on Mobile Socs

Neural Network Inference on Mobile SoCs Siqi Wang, Anuj Pathania, Tulika Mitra Abstract—The ever-increasing demand from mobile Machine Big CPU DVFS Learning (ML) applications calls for evermore powerful on-chip Small CPU DVFS DVFS computing resources. Mobile devices are empowered with hetero- Core Core geneous multi-processor Systems-on-Chips (SoCs) to process ML Core Core GPU workloads such as Convolutional Neural Network (CNN) infer- Core Core ence. Mobile SoCs house several different types of ML capable Core Core components on-die, such as CPU, GPU, and accelerators. These different components are capable of independently performing inference but with very different power-performance character- L2 Cache L2 Cache NPU istics. In this article, we provide a quantitative evaluation of the inference capabilities of the different components on mobile CCI Bus SoCs. We also present insights behind their respective power- performance behavior. Finally, we explore the performance limit DRAM of the mobile SoCs by synergistically engaging all the components concurrently. We observe that a mobile SoC provides up to 2x improvement with parallel inference when all its components are Fig. 1: An abstract block diagram of a mobile SoC with an engaged, as opposed to engaging only one component. asymmetric multi-core CPU, GPU, and NPU. Index Terms—Deep learning, convolutional neural networks, heterogeneous computing, embedded multiprocessor SoCs with accelerators that can run across multiple devices. Instead, I. INTRODUCTION the CPUs remain the common denominator among mobile The tremendous popularity of Neural-Network (NN) based SoCs and is the favored choice for inference [1]. machine learning applications in recent years has been fuelled We embark on an exploration to quantitatively characterize partly by the increased capability of the compute engines, in and understand the inferencing capabilities of the mobile particular, the GPUs. Traditionally, both the network training SoCs given the diverse landscape. We portray the power- and inference were performed on the cloud with mobile performance gap between the ubiquitous CPUs and the high- devices only acting as user interfaces. However, enriched user performance accelerators in high-end devices and uncover the experience and privacy concerns now demand inference to reasons behind the gap through the roofline models. Finally, be performed on the mobile devices themselves with high we propose simultaneous engagement of all the SoC compo- accuracy and throughput. nents to greatly expand the promise of functional deployment In this article, we look at NN-enabled vision applica- of vision applications on mobile devices. tions on mobile devices. These applications extract high- II. INFERENCE ON MOBILE SOCS level semantic information from real-time video streams and A. Heterogeneous Multi-processor SoCs predominately use Convolutional Neural Networks (CNNs). They are important in many domains, such as Advanced There are over two thousand unique mobile SoCs in the Driver-Assistance Systems (ADAS), Virtual Reality (VR), and mobile devices market. The diversity comes from the choice Augmented Reality (AR). Enabling these applications in the of different CPUs, GPUs, caches, memory controllers, and power-constrained mobile devices is challenging due to the other application-specific accelerators. This fragmentation of arXiv:1908.11450v2 [cs.LG] 22 Jan 2020 enormous computational and memory requirements. the SoC market makes standard optimizations impossible. Heterogeneous multi-processor SoC enables the current However, the similarity among these SoCs lies in the choice state-of-the-art mobile devices. However, the presence of of one or more CPU core clusters. multiple vendors fragments the mobile SoCs. Accelerators 1) ARM big.LITTLE: Multi-cores enable the state-of-the- (including GPU, FPGA, and dedicated neural accelerators) art Mobile SoCs. 99.9% of the Android devices in the market demonstrate great performance for inference. However, these in 2019 have multiple cores [1]. Among these, about half high-performance components are present in only a small of the SoCs implement performance heterogeneity with at fraction of the mobile devices. Moreover, due to market least two CPU clusters: a high-performance and an energy- fragmentation, it is impossible to develop a mobile application efficient core cluster. ARM big.LITTLE architecture, one of the most popular architectures implementing this heterogeneity, is S. Wang, A. Pathania and T. Mitra are with the Department of Com- present in Hi-Silicon Kirin, Samsung Exynos, and Qualcomm puter Science, School of Computing, National University of Singapore. E- Snapdragon series SoCs. The heterogeneous cores differ in mail: ((wangsq, pathania, tulika)@comp.nus.edu.sg). Address: COM1, 13 Computing Drive, S117417. (Corresponding author: Tulika Mitra) power-performance-area characteristics but share the same Instruction Set Architecture (ISA). Figure 1 shows an abstract block diagram of this architecture. The general availability of TABLE I: Throughput of different networks on different CPUs make them a favorable choice for mobile inference and mobile SoCs components running at their peak frequencies. make device-agnostic optimizations feasible. Exynos 5422 Kirin 970 2) Accelerators: Existing architectures, including GPU and Network Throughput (Imgs/s) Throughput (Imgs/s) A7 A15 T628 A53 A73 G72 NPU FPGA, have proven to be advantageous for ML workloads and AlexNet 1.1 3.1 7.8 2.2 7.6 32.5 32.5 are thus commonly used for deployment on certain devices. GoogLeNet 0.9 3.4 5.2 3.0 7.1 19.9 34.4 Google MobileNet 1.5 5.7 8.5 6.5 17.7 29.1 Not Supported Both academic and commercial dedicated accelerators ( ResNet50 0.2 1.3 2.1 1.5 2.8 8.4 21.9 Edge TPU, Intel Nervana NNP, Huawei NPU, Apple Neural SqueezeNet 1.5 5.0 8.0 6.8 15.7 43.0 49.3 Engine) offer exceptional runtime and energy-efficiency. There are no standard neural accelerators for mobile SoCs, making horizontal application integration difficult. Limited availability A. Experimental Set-up even constraints the use of GPUs. 1) CPU: Both SoCs include ARM big.LITTLE based asymmetric multi-core CPU. Kirin 970 CPU adopts ARMv8-A B. Mobile ML Framework and Optimizations architecture. It consists of a high-performance high-power Tensorflow, PyTorch, and MXNet are some of the common out-of-order four-core Cortex-A73 cluster (2.36 GHz) and ML development frameworks for all scenarios. Tensorflow Lite a low-performance low-power four-core in-order Cortex- like frameworks facilitates the compression of huge models to A53 (1.8 GHz). Exynos 5422 has a similar design but uses fit into resource-constrained mobile devices. Efficient libraries an older ARMv7-A architecture with Cortex-A15 (2 GHz) and and APIs bridge the gap between the frameworks and the Cortex-A7 (1.4 GHz) cores. All CPU cores support NEON underlying hardware, examples of which are Nvidia cuDNN advanced Single Instruction Multiple Data (SIMD) operations, for GPUs, ARM NN powered by Compute Library (ARM- which allows for four 32-bit floating-point operations per CL) for ARM CPUs and GPUs, Facebook NNPACK, and cycle. QNNPACK for mobile CPUs. These libraries usually optimize 2) GPU: Kirin 970 adopts ARM Mali G72 MP12 GPU with detailed architectural information. ARM-CL supports ac- (850 MHz), implementing the second generation Bifrost ar- celeration through ARM NEON vectorization and provides chitecture. It has twelve shader cores with three execution NEON assembly implementation for the most computation- engines each. Each engine is capable of eight FP32 oper- ally intensive convolution kernels. Algorithmic optimizations ations per cycle, giving a total peak compute capability of (Winograd transform, FFT, sparsity exploration) lower the 244.8 GFLOPS/s for G72. Exynos 5422 includes an ARM computational complexity of convolution computations. Fur- Mali T628 MP6 GPU (600 MHz). It adopts an older Midgard thermore, quantization and network pruning are common architecture with six shader cores implementing Tripipe design techniques that bring down the processing requirement with with two arithmetic pipelines. Each pipeline is capable of eight the sacrifice of accuracy [2]. FP32 operations per cycle, providing a total peak compute Even though most mobile inference workloads run on CPUs, capability of 57.6 GFLOPS/s for T628. optimizations of ML workloads with accelerators hordes most 3) NPU: Kirin 970 includes a Huawei NPU purpose- of the attention. There is a lot of room for optimizations built for ML. It has a peak performance of 1.92 TFLOPS/s on mobile CPUs to enable ML applications across different with FP16. The accompanying HiAi DDK API enables the mobile platforms. deployment of networks on NPU but only works with Android. Exynos 5422 does not have any ML accelerator. III. CHARACTERIZING INFERENCING ON MOBILE SOC 4) Network Structure: We experiment with several popular networks introduced in recent years – AlexNet [4], We perform experiments across different technology nodes GoogleNet [5], MobileNet [6], ResNet50 [7], and using two commonly used mobile SoCs: 28 nm Exynos 5422 SqueezeNet [8]. within Odroid XU3 development platform and 10 nm Kirin 970 within Hikey 970 development platform. Released in 2014 B. Individual Heterogeneous Components and 2017 respectively, these two SoCs show us the progress of We first study each component in isolation by running infer- mobile SoCs development over the years. Furthermore, these encing of multiple images in a stream on a single component. two SoCs roughly approximate the mid- and high-end mobile Both Big and Small clusters are self-sufficient for inferencing. SoCs today. GPU and NPU require the support of a Small cluster for In the experiments, both SoCs are using ARM-CL 18.05v. inferencing. Kirin 970 NPU is supported by HiAI DDK (v100) for network 1) Throughput: Table I shows the throughput of each com- deployment. For Exynos5422, in-built power sensors, running ponent on both our SoCs. All components in Kirin 970 outper- at 200 Hz, measure the power of each component.

Neural Network Inference on Mobile Socs

GPU Developments 2018

An Emerging Architecture in Smart Phones

Mali-400 MP: a Scalable GPU for Mobile Devices

Intel Desktop Board DG41MJ Product Guide

SECOND AMENDED COMPLAINT 3:14-Cv-582-JD

GMAI: a GPU Memory Allocation Inspection Tool for Understanding and Exploiting the Internals of GPU Resource Allocation in Critical Systems

Three Ways of Seeing Improved Health and Productivity

Survey and Benchmarking of Machine Learning Accelerators

2016 Comparison of Application Processor (AP) Packaging 1 Table of Contents

Intel® Desktop Board DG965RY Product Guide

Android Vs Ios

Digital Forensic Analysis of Smart Watches