Comparison and Benchmarking of AI Models and Frameworks on Mobile Devices
Total Page:16
File Type:pdf, Size:1020Kb
Comparison and Benchmarking of AI Models and Frameworks on Mobile Devices 1st Chunjie Luo 2nd Xiwen He 3nd Jianfeng Zhan Institute of Computing Technology Institute of Computing Technology Institute of Computing Technology Chinese Academy of Sciences Chinese Academy of Sciences Chinese Academy of Sciences BenchCouncil BenchCouncil BenchCouncil Beijing, China Beijing, China Beijing, China [email protected] [email protected] [email protected] 4nd Lei Wang 5nd Wanling Gao 6nd Jiahui Dai Institute of Computing Technology Institute of Computing Technology Beijing Academy of Frontier Chinese Academy of Sciences Chinese Academy of Sciences Sciences and Technology BenchCouncil BenchCouncil Beijing, China Beijing, China Beijing, China [email protected] wanglei [email protected] [email protected] Abstract—Due to increasing amounts of data and compute inference on the edge devices can 1) reduce the latency, 2) resources, deep learning achieves many successes in various protect the privacy, 3) consume the power [2]. domains. The application of deep learning on the mobile and em- To make the inference on the edge devices more efficient, bedded devices is taken more and more attentions, benchmarking and ranking the AI abilities of mobile and embedded devices the neural networks are designed more light-weight by using becomes an urgent problem to be solved. Considering the model simpler architecture, or by quantizing, pruning and compress- diversity and framework diversity, we propose a benchmark suite, ing the networks. Different networks present different trade- AIoTBench, which focuses on the evaluation of the inference offs between accuracy and computational complexity. These abilities of mobile and embedded devices. AIoTBench covers networks are designed with very different philosophies and three typical heavy-weight networks: ResNet50, InceptionV3, DenseNet121, as well as three light-weight networks: SqueezeNet, have very different architectures. There is no single network MobileNetV2, MnasNet. Each network is implemented by three architecture that unifies the network design and application. frameworks which are designed for mobile and embedded de- Beside of the diversity of models, the AI frameworks on vices: Tensorflow Lite, Caffe2, Pytorch Mobile. To compare and mobile and embedded devices are also diverse, e.g. Tensorflow rank the AI capabilities of the devices, we propose two unified Lite [2], Caffe2 [3], Pytorch Mobile [4]. On the other hand, metrics as the AI scores: Valid Images Per Second (VIPS) and Valid FLOPs Per Second (VOPS). Currently, we have compared the mobile and embedded devices provide additional hardware and ranked 5 mobile devices using our benchmark. This list will acceleration using GPUs or NPUs to support the AI applica- be extended and updated soon after. tions. Since AI applications on mobile and embedded devices get more and more attentions, benchmarking and ranking the I. INTRODUCTION AI abilities of those devices becomes an urgent problem to be Due to increasing amounts of data and compute resources, solved. arXiv:2005.05085v1 [cs.LG] 7 May 2020 deep learning achieves many successes in various domains. To MLPerf Inference [5] proposes a set of rules and practices prove the validity of the AI, researchers take more attention to ensure comparability across systems with wildly differing to train more accurate models in the early days. Since many architectures. It defines several high-level abstractions of AI models have achieved pretty performance in various domains, problems, and specifies the task to be accomplished and the the applications of the AI models to the end-users are put on general rules of the road, but leaves implementation details the agenda. After entering the application period, the inference to the submitters. MLPerf Inference aims to more general becomes more and more important. According to a IDC report and flexible benchmarking. However, this flexibility brings [1], the demand for AI inference will far exceed the demand unfeasibility and unaffordable for a specific domain. The for training in the future. users of the benchmark need make their efforts to optimize With the popularity of smart phones and internet of things, the whole stack of the AI solutions, from the languages, researchers and engineers have made effort to apply the AI libraries, frameworks, to hardwares. Moreover, the excessive inference on the mobile and embedded devices. Running the optimization of a particular solution may affect the generality of the AI system. ETH Zurich AI Benchmark [6], [7] aims Jianfeng Zhan is the corresponding author. to evaluate the AI ability of Android smartphones. Although AI Benchmark covers diverse tasks, we believe that the model the AI solutions, from the languages, libraries, frameworks, to diversity should be taken more attentions, since the models hardwares. For the hardware manufacturer, the implementation are often shared between different tasks by transfer learning. and optimization of the AI task may take more proportion to For example, the pre-trained networks for image recognition achieve a higher score of the benchmarking. Moreover, the are often used as the backbone to extract the feature maps excessive optimization of a particular solution may affect the for other vision tasks, e.g. the object detection and segmen- generality of the system. tation. Moreover, AI Benchmark is implemented only using ETH Zurich AI Benchmark [6], [7] aims to evaluate the AI TensorFlow Lite. ability of Android smartphones. It contains several workloads In this paper, we propose a benchmark suite, AIoTBench, covering the tasks of object recognition, face recognition, which focuses on the evaluation of the inference ability of playing atari games, image deblurring, image super-resolution, mobile and embedded devices. The workloads in our bench- bokeh simulation, semantic segmentation, photo enhancement. mark cover more model architectures and more frameworks. Although AI Benchmark covers diverse tasks, we believe that Specifically, AIoTBench covers three typical heavy-weight the model diversity should be taken more attentions, since the networks: ResNet50 [8], InceptionV3 [9], DenseNet121 [10], models are often shared between different tasks by transfer as well as three light-weight networks: SqueezeNet [11], learning. For example, the pre-trained networks for image MobileNetV2 [12], MnasNet [13]. Each model is implemented recognition are often used as the backbone to extract the by three frameworks: Tensorflow Lite, Caffe2, Pytorch Mobile. feature maps for other vision tasks, e.g. object detection and Comparing to MLPerf Inference, our benchmark is more segmentation. Moreover, AI Benchmark is implemented only available and affordable for the users, since it is off the shelf using TensorFlow Lite. and needs no re-implementation. Simone Bianco et al. [14] analyzes more than 40 state-of- Our benchmark can be used for: the-art deep architectures in terms of accuracy rate, model • Comparison of different AI models. Users can make a complexity, memory usage, computational complexity, and tradeoff between the accuracy, model complexity, model inference time. They experiment the selected architectures on size and speed depending on the application requirement. a workstation equipped with a NVIDIA Titan X Pascal and • Comparison of different AI frameworks on mobile en- an embedded system based on a NVIDIA Jetson TX1 board. vironment. Depending on the selected model, users can Their workloads are implemented using PyTorch. Shaohuai make comparisons between the AI frameworks on mobile Shi et al. [15] compare the state-of-the-art deep learning devices. software frameworks, including Caffe, CNTK, MXNet, Ten- • Benchmarking and ranking the AI abilities of different sorFlow, and Torch. They compare the running performance of mobile devices. With diverse and representative models these frameworks with three popular types of neural networks and frameworks, the mobile devices can get a compre- on two CPU platforms and three GPU platforms. The paper hensive benchmarking and evaluation. [16] discusses a comprehensive design of benchmark on To compare and rank the AI capabilities of the devices, we mobile and embeded devices, while we focus on the vision propose two unified metrics as the AI scores: Valid Images Per domain in this paper. Other AI benchmarks include Fathom Second (VIPS) and Valid FLOPs Per Second (VOPS). They [17], DAWNBench [18]. These benchmarks are not designed reflect the trade-off between quality and performance of the for mobile devices. AI system. Currently, we have compared and ranked 5 mobile Our benchmark focuses on the evaluation of the inference devices using our benchmark: Galaxy s10e, Honor v20, Vivo ablity of mobile and embedded devices. The workloads in x27, Vivo nex,and Oppo R17. This list will be extended and our benchmark cover more model architectures and more updated soon after. frameworks. Additionally, comparing to MLPerf Inference, our benchmark is more available and affordable for the users, since II. RELATED WORK it is off the shelf and needs no re-implementation. The details of comparisons of different benchmarks are shown in Table I. MLPerf Inference [5] proposes a set of rules and practices to ensure comparability across systems with wildly differing III. CONSIDERATIONS architectures. Unlike traditional benchmarks, e.g. SPEC CPU, In this section, we highlight the main considerations of which runs out of the box, MLPerf Inference is a semantic- AIoTBench. The details