AI Benchmark: Running Deep Neural Networks on Android Smartphones
Total Page:16
File Type:pdf, Size:1020Kb
AI Benchmark: Running Deep Neural Networks on Android Smartphones Andrey Ignatov Radu Timofte William Chou Ke Wang ETH Zurich ETH Zurich Qualcomm, Inc. Huawei, Inc. [email protected] [email protected] [email protected] [email protected] Max Wu Tim Hartley Luc Van Gool ∗ MediaTek, Inc. Arm, Inc. ETH Zurich [email protected] [email protected] [email protected] Abstract mobile platforms is associated with a huge computational overhead on phone CPUs and a serious drain on battery power. Over the last years, the computational power of mobile de- Many recent developments in deep learning are, however, vices such as smartphones and tablets has grown dramatically, tightly connected to tasks meant for mobile devices. One no- reaching the level of desktop computers available not long table group of such tasks is concerned with computer vision ago. While standard smartphone apps are no longer a prob- problems like image classification [1,2,3], image enhance- lem for them, there is still a group of tasks that can easily ment [4,5,6] and super-resolution [7,8,9], optical character challenge even high-end devices, namely running artificial in- recognition [10], object tracking [11, 12], visual scene under- telligence algorithms. In this paper, we present a study of standing [13, 14], face detection and recognition [15, 16], gaze the current state of deep learning in the Android ecosystem tracking [17], etc. Another group of tasks encompasses vari- and describe available frameworks, programming models and ous natural language processing problems such as natural lan- the limitations of running AI on smartphones. We give an guage translation [18, 19], sentence completion [20, 21], sen- overview of the hardware acceleration resources available on tence sentiment analysis [22, 23] or interactive chatbots [24]. four main mobile chipset platforms: Qualcomm, HiSilicon, A separte group deals with on-line sensor data processing for MediaTek and Samsung. Additionally, we present the real- human activity recognition from accelerometer data [25, 26], world performance results of different mobile SoCs collected gesture recognition [27] or sleep monitoring [28]. Several with AI Benchmark1 that are covering all main existing hard- other deep learning problems on smartphones are related to ware configurations. speech recognition, virtual reality and many other tasks. Despite the rising interest in deep learning for mobile ap- plications, the majority of AI algorithms are either not avail- 1 Introduction able on smartphones or are executed on remote servers due to the aforementioned phones’ hardware limitations. The lat- arXiv:1810.01109v2 [cs.AI] 15 Oct 2018 With the recent advances in mobile system-on-chip (SoC) ter option is also not flawless, causing: a) privacy issues; b) technologies, the performance of portable Android devices dependency on an internet connection; c) delays associated has increased by a multiple over the past years. With their with network latency; d) bottleneck problems — the number multi-core processors, dedicated GPUs, and gigabytes of of possible clients depends on the servers’ computational ca- RAM, the capabilities of current smartphones have already pabilities. To overcome these issues, there were a number of gone far beyond running the standard built-in phone applica- attempts to port separate algorithms or whole machine learn- tions or simple mobile games. Whereas their computational ing libraries to mobile platforms with added hardware accel- power already significantly exceeds the needs of most every- eration (HA) using GPUs or DSPs. In [29], the authors imple- day use cases, artificial intelligence algorithms still remain mented a mobile neural network classification engine capable challenging even for high-end smartphones and tablets. De- of sensor inference tasks on Qualcomm’s Hexagon DSP [30]. spite the fact that many machine learning solutions are highly Though they achieved very impressive energy consumption useful when deployed on end-user devices, running them on results, the DSP was able to run only very simple CNN models due to its small program and memory space. In [31], the au- ∗We also thank Przemyslaw Szczepaniak ([email protected]), Google Inc., for writing and editing sections 2.7, 3.1 and 3.2. thors presented a GPU-accelerated library CNNdroid for par- 1http://ai-benchmark.com allel execution of pre-trained CNNs on mobile GPUs. The 1 Figure 1: Mobile SoCs with potential acceleration support for third-party AI applications. library was based on the RenderScript framework [32] that ing the CPU and GPU performance of mobile phones, none parallelizes computations across CPUs and GPUs, and though of them measure the speed and acceleration of AI operations the proposed solution was up to 40 times faster compared to that can be achieved due to available AI chips and DSPs. In the baseline naive singe-thread implementation, in reality its this paper, we present an AI Benchmark designed specifically speed was comparable to a CPU-based TensorFlow Mobile to test the machine learning performance, available hardware library [33] relying on the Arm NEON [34] instruction set. AI accelerators, chipset drivers, and memory limitations of the Motamedi et al. [35] exploited the same approach of using current Android devices. It consists of a number of computer RenderScript, but used a CPU’s imprecise computing modes vision AI tests that are executed directly on the phones’ hard- to lower execution times. Despite the promising results, the ware and that cover relevant deep learning architectures and effect inexact arithmetic had on accuracy was not investigated operations. We provide a detailed description of the actual in depth in this paper, and therefore the applicability of this chipset platforms and popular mobile machine learning frame- approach remains unclear. RSTensorFlow [36] is another at- works, and describe the limitations of running deep learning tempt to expoit RenderScript for GPU-based acceleration of algorithms on smartphones. Finally, we present the in-the- matrix operations, and in this case it was used to directly mod- wild performance of about 200 Android devices and major ify the TensorFlow Mobile library. The results demonstrated mobile chipsets, as collected with our AI Benchmark, for over that, while matrix multiplications can be executed up to 3 10,000 smartphones and tablets. times faster, it is not possible to speed up the convolutional The rest of the paper is arranged as follows. In Section 2 operations that take approximately 75% of the total inference we describe the hardware acceleration resources available on time. Additionally, the experiment revealed that RenderScript the main chipset platforms, as well as the programming in- is not always using GPUs on all the devices — sometimes it terfaces for accessing them. Section 3 gives an overview of is running on a CPU only, leading to slower execution times popular mobile deep learning frameworks. Section 4 provides even compared to the original TF implementation. a detailed description of the benchmark architecture, its pro- Besides that, some SDKs for running computationally in- gramming implementation, and the computer vision tests that tensive operations were proposed directly by SoC manufac- it includes. Section 5 shows the experimental results and in- turers. In 2016, Qualcomm introduced the Snapdragon Neu- ference times for different deep learning architectures, for var- ral Processing Engine (SNPE) [37] to accelerate the execution ious Android devices and chipsets. Section 6 analyzes the ob- of neural networks with their GPUs and DSPs. The next year tained results. Finally, Section 7 concludes the paper. HiSilicon proposed the HiAI platform [38] for running neural networks on Kirin’s NPU, and later MediaTek presented the NeuroPilot SDK [39] that can trigger GPUs or APUs to run 2 Hardware Acceleration deep learning models. The biggest issue is that all these SDKs were developed for the corresponding chipsets only, i.e., the While the first consumer computers were mostly equipped application relying on HiAI will not run on Qualcomm SoC, with a single, stand-alone CPU, it soon became clear that and vice versa, thus forcing developers to create several ver- its computational performance is too limited for a number of sions of their app for each platform, or to give up on some multimedia applications. This led to the creation of special of them. This situation changed with the introduction of the co-processors working in parallel with the main CPU. Their Android Neural Networks API (NNAPI) [40], designed to run architecture was optimized for many signal processing tasks. deep learning models on mobile devices. This API is basi- The era of digital signal processors (DSPs) began in the early cally an intermediate layer between the higher-level machine 1980s with the introduction of the NEC PD7720 [41], the learning framework and the device’s hardware acceleration re- AT&T DSP1 [42] and the TI TMS32010 [43] co-processors. sources, and is responsible for their communication and for They established general principles of the DSP architecture scheduling the execution of tasks on the most suitable hard- used until now [44]: Harvard architecture, hardware block ware. NNAPI still requires specific SoC vendors’ drivers in for multiply-accumulate (MAC) operations, VLIW and SIMD order to run the computations on anything but a CPU, and instruction sets for parallel computations, etc. Though the therefore its default presence in Android 8.1+ does not au- first DSPs had quite restricted capabilities due to their limited tomatically guarantee hardware acceleration support. set of instructions and memory constraints, they were widely While there exists a number of common benchmarks test- used till the mid 90s of the last century. They were popu- 2 lar for applications related to computer graphics, sound and video decoding, as mathematical co-processors and accelera- tors for various photo editing software, and even for running the first deep learning OCR models designed in 1989 [45].