Arxiv:1910.06663V1 [Cs.PF] 15 Oct 2019
Total Page:16
File Type:pdf, Size:1020Kb
AI Benchmark: All About Deep Learning on Smartphones in 2019 Andrey Ignatov Radu Timofte Andrei Kulik ETH Zurich ETH Zurich Google Research [email protected] [email protected] [email protected] Seungsoo Yang Ke Wang Felix Baum Max Wu Samsung, Inc. Huawei, Inc. Qualcomm, Inc. MediaTek, Inc. [email protected] [email protected] [email protected] [email protected] Lirong Xu Luc Van Gool∗ Unisoc, Inc. ETH Zurich [email protected] [email protected] Abstract compact models as they were running at best on devices with a single-core 600 MHz Arm CPU and 8-128 MB of The performance of mobile AI accelerators has been evolv- RAM. The situation changed after 2010, when mobile de- ing rapidly in the past two years, nearly doubling with each vices started to get multi-core processors, as well as power- new generation of SoCs. The current 4th generation of mo- ful GPUs, DSPs and NPUs, well suitable for machine and bile NPUs is already approaching the results of CUDA- deep learning tasks. At the same time, there was a fast de- compatible Nvidia graphics cards presented not long ago, velopment of the deep learning field, with numerous novel which together with the increased capabilities of mobile approaches and models that were achieving a fundamentally deep learning frameworks makes it possible to run com- new level of performance for many practical tasks, such as plex and deep AI models on mobile devices. In this pa- image classification, photo and speech processing, neural per, we evaluate the performance and compare the results of language understanding, etc. Since then, the previously used all chipsets from Qualcomm, HiSilicon, Samsung, MediaTek hand-crafted solutions were gradually replaced by consider- and Unisoc that are providing hardware acceleration for AI ably more powerful and efficient deep learning techniques, inference. We also discuss the recent changes in the Android bringing us to the current state of AI applications on smart- ML pipeline and provide an overview of the deployment of phones. deep learning models on mobile devices. All numerical re- Nowadays, various deep learning models can be found in sults provided in this paper can be found and are regularly nearly any mobile device. Among the most popular tasks updated on the official project website 1. are different computer vision problems like image classi- fication [38, 82, 23], image enhancement [27, 28, 32, 30], image super-resolution [17, 42, 83], bokeh simulation [85], 1. Introduction object tracking [87, 25], optical character recognition [56], arXiv:1910.06663v1 [cs.PF] 15 Oct 2019 face detection and recognition [44, 70], augmented real- Over the past years, deep learning and AI became one of ity [3, 16], etc. Another important group of tasks running the key trends in the mobile industry. This was a natural on mobile devices is related to various NLP (Natural Lan- fit, as from the end of the 90s mobile devices were get- guage Processing) problems, such as natural language trans- ting equipped with more and more software for intelligent lation [80,7], sentence completion [52, 24], sentence senti- data processing – face and eyes detection [20], eye track- ment analysis [77, 72, 33], voice assistants [18] and interac- ing [53], voice recognition [51], barcode scanning [84], tive chatbots [71]. Additionally, many tasks deal with time accelerometer-based gesture recognition [48, 57], predic- series processing, e.g., human activity recognition [39, 26], tive text recognition [74], handwritten text recognition [4], gesture recognition [60], sleep monitoring [69], adaptive OCR [36], etc. At the beginning, all proposed methods power management [50, 47], music tracking [86] and classi- were mainly based on manually designed features and very fication [73]. Lots of machine and deep learning algorithms are also integrated directly into smartphones firmware and ∗We also thank Oli Gaymond ([email protected]), Google Inc., for writing and editing section 3.1 of this paper. used as auxiliary methods for estimating various parameters 1http://ai-benchmark.com and for intelligent data processing. 1 Figure 1: Performance evolution of mobile AI accelerators: image throughput for the float Inception-V3 model. Mobile devices were run- ning the FP16 model using TensorFlow Lite and NNAPI. Acceleration on Intel CPUs was achieved using the Intel MKL-DNN library [45], on Nvidia GPUs – with CUDA [10] and cuDNN [8]. The results on Intel and Nvidia hardware were obtained using the standard TensorFlow library [2] running the FP32 model with a batch size of 20 (the FP16 format is currently not supported by these CPUs / GPUs). Note that the Inception-V3 is a relatively small network, and for bigger models the advantage of Nvidia GPUs over other silicon might be larger. While running many state-of-the-art deep learning models troduction of TFLite delegates [12]. These delegates can be on smartphones was initially a challenge as they are usually written directly by hardware vendors and then used for ac- not optimized for mobile inference, the last few years have celerating AI inference on devices with outdated or absent radically changed this situation. Presented back in 2015, NNAPI drivers. A universal delegate for accelerating deep TensorFlow Mobile [79] was the first official library allow- learning models on mobile GPUs (based on OpenGL ES, ing to run standard AI models on mobile devices without any OpenCL or Metal) was already released by Google earlier special modification or conversion, though also without any this year [43]. All these changes build the foundation for hardware acceleration, i.e. on CPU only. In 2017, the latter a new mobile AI infrastructure tightly connected with the limitation was lifted by the TensorFlow Lite (TFLite) [46] standard machine learning (ML) environment, thus making framework that dropped support for many vital deep learn- the deployment of machine learning models on smartphones ing operations, but offered a significantly reduced binary easy and convenient. The above changes will be described size and kernels optimized for on-device inference. This li- in detail in Section3. brary also got support for the Android Neural Networks API The latest generation of mid-range and high-end mobile (NNAPI) [5], introduced the same year and allowing for the SoCs comes with AI hardware, the performance of which is access to the device’s AI hardware acceleration resources di- getting close to the results of desktop CUDA-enabled Nvidia rectly through the Android operating system. This was an GPUs released in the past years. In this paper, we present important milestone as a full-fledged mobile ML pipeline and analyze performance results of all generations of mo- was finally established: training, exporting and running the bile AI accelerators from Qualcomm, HiSilicon, Samsung, resulting models on mobile devices became possible within MediaTek and Unisoc, starting from the first mobile NPUs one standard deep learning library, without using special- released back in 2017. We compare against the results ob- ized vendors tools or SDKs. At first, however, this approach tained with desktop GPUs and CPUs, thus assessing perfor- had also numerous flaws related to NNAPI and TensorFlow mance of mobile vs. conventional machine learning silicon. Lite themselves, thus making it impractical for many use To do this, we use a professional AI Benchmark applica- cases. The most notable issues were the lack of valid NNAPI tion [31] consisting of 21 deep learning tests and measuring drivers in the majority of Android devices (only 4 commer- more than 50 different aspects of AI performance, including cial models featured them as of September 2018 [19]), and the speed, accuracy, initialization time, stability, etc. The the lack of support for many popular ML models by TFLite. benchmark was significantly updated since previous year to These two issues were largely resolved during the past year. reflect the latest changes in the ML ecosystem. These up- Since the spring of 2019, nearly all new devices with Qual- dates are described in Section4. Finally, we provide an comm, HiSilicon, Samsung and MediaTek systems on a chip overview of the performance, functionality and usage of An- (SoCs) and with dedicated AI hardware are shipped with droid ML inference tools and libraries, and show the perfor- NNAPI drivers allowing to run ML workloads on embed- mance of more than 200 Android devices and 100 mobile ded AI accelerators. In Android 10, the Neural Networks SoCs collected in-the-wild with the AI Benchmark applica- API was upgraded to version 1.2 that implements 60 new tion. ops [1] and extends the range of supported models. Many of these ops were also added to TensorFlow Lite starting from The rest of the paper is arranged as follows. In Section2 builds 1.14 and 1.15. Another important change was the in- we describe the hardware acceleration resources available on the main chipset platforms and programming interfaces 2 to access them. Section3 gives an overview of the latest changes in the mobile machine learning ecosystem. Sec- tion4 provides a detailed description of the recent modifi- cations in our AI Benchmark architecture, its programming implementation and deep learning tests. Section5 shows the experimental performance results for various mobile devices and chipsets, and compares them to the performance of desk- top CPUs and GPUs. Section6 analyzes the results. Finally, Section7 concludes the paper. 2. Hardware Acceleration Though many deep learning algorithms were presented back in the 1990s [40, 41, 22], the lack of appropriate (and afford- able) hardware to train such models prevented them from being extensively used by the research community till 2009, when it became possible to effectively accelerate their train- Figure 2: The overall architecture of the Exynos 9820 NPU [78]. ing with general-purpose consumer GPUs [65].