Challenges and Future Directions in Benchmarking Smartphone Performance

Freeze It If You Can: Challenges and Future Directions in Benchmarking Smartphone Performance Yao Guo, Yunnan Xu, Xiangqun Chen Key Laboratory of High-Confidence Software Technologies (Ministry of Education) School of Electronics Engineering and Computer Science, Peking University {yaoguo, xuyn14, cherry}@pku.edu.cn ABSTRACT Benchmarking the performance of a computer system is Benchmarking the performance of mobile devices such as important for many tasks, ranging from hardware design, Android-based smartphones is important in understanding OS optimization, to minimizing the overhead of security and comparing the performance of different devices. Per- approaches. Many studies have focused on the performance formance benchmarking tools such as Antutu have been measurement and optimization on traditional computers widely used in both academia and industry. However, one such as Windows and Linux based devices, including var- of the main difficulties when benchmarking smartphone per- ious benchmarks [8], measurement studies and optimization formance is due to the fact that the performance cannot be methods. measured accurately and steadily. This paper investigates However, there are few research focused on the per- the challenges on performance benchmarking for Android- formance benchmarking of mobile devices, compared to based smartphones. We identify key factors affecting much enthusiasm from the energy perspective. Nonetheless, performance benchmarking, which include frequency scaling performance of mobile devices is still an important issue. setting, temperature, running background services, etc. Similar to traditional computer systems, When developing Experiments show that some of these factors may cause new techniques or approaches for mobile systems, it is performance fluctuation by as high as 60%. We show prelim- important to measure the performance, or compare the inary results in controlling the performance benchmarking performance of mobile devices in order to understand the process to generate steady results, for example freezing performance overhead and implications of new approaches. a smartphone in a refrigerator could remove most of the Although there are many benchmarks available for tra- fluctuations. Finally, we discuss the implications of our ditional computer systems, there are no widely adopted study and possible future research directions. performance benchmarks for smartphones. If you want to compare the performance of two smartphones, the most widely used methods are benchmarking applications such CCS Concepts as Antutu [6] and PCMark [4]. •General and reference ! Performance; •Human- For example, many security approaches use Antutu to centered computing ! Smartphones; assess their performance overhead on Android smartphones, for example Compac [13], SEAndroid [12] and RootGuard [11]. Keywords However, it is difficult to achieve consistent results in Antutu benchmarking. In order to evaluate the performance Operating systems, smartphones, Android, performance overhead of SEAndroid [12], the authors chose to run benchmarking, frequency scaling Antutu for 200 times!1 One of the reasons of running Antutu so many times is because the variation is high. For 1. INTRODUCTION some performance score components, the standard variation Mobile devices such as smartphones have been widely is over 10% of the mean value, which indicates almost adopted in the past decade. The number of smartphone prohibitively high variance among the benchmarking results. devices shipped has reached over 1.4 billion in 2015, about Meanwhile, many device manufactures have been accused four times than the number of PCs shipped in the same of cheating on Android benchmarking tools [5], including year. Despite the popularity of iPhones, Android-based many big brands such as Samsung, HTC, LG and Asus. smartphones have been dominating the market with more For example, Samsung has been shown to use \benchmark than 80% of the share. boosters" such as overclocking to increase the benchmark scores by 20-50%. This paper attempts to answer the following two ques- Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed tions: Why is it hard to measure the performance of for profit or commercial advantage and that copies bear this notice and the full citation smartphones accurately? What are the challenges and how on the first page. Copyrights for components of this work owned by others than do we control the measurement process in order to achieve ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission steady performance scores? and/or a fee. Request permissions from [email protected]. We present a detailed study on performance measure- HotMobile ’17, February 21-22, 2017, Sonoma, CA, USA ments of Android smartphones with benchmarks such as An- c 2017 ACM. ISBN 978-1-4503-4907-9/17/02. $15.00 1 DOI: http://dx.doi.org/10.1145/3032970.3032979 Each run of Antutu typically takes about 5-10 minutes. tutu, as well as microbenchmarks for storage and database important to test the performance implications to make sure performance. We have found that the performance measure- it does not cause significant performance degradation. ment can be easily affected by various factors, sometimes Another very important issue, particularly for Android- generating results more than 50% apart in the value of per- based devices, is to compare the performance of different formance numbers. The key factors affecting performance Android devices. Besides comparing hardware parameters, measurements include: CPU frequency scaling settings, many evaluation and review reports will include some kind temperature, background services running on the device, of performance measurement results, for example with available memory, etc. tools such as Antutu. As many devices nowadays are Based on these experiments, we observe that we can equipped with processors of similar configuration and the improve the benchmarking scores significantly by changing same amount of memory, the performance testing results the parameters in the test environment. For example, will demonstrate which device has been better optimized. putting a smartphone in a colder environment, such as in a refrigerator, could improve the overall Antutu score by 2.2 Existing Tools and Methods more than 60%. Modifying the frequency scaling governor Typically, performance evaluation of a computing system in Android or killing unused background services can also requires both benchmarks and testing tools. For example, improve the benchmarking scores significantly. SPEC CPU Benchmarks [8, 9] have been widely used for However, the focus of this paper is not on how to improve benchmarking the performance of computer systems, espe- (or cheat) benchmarking applications, instead on how to cially at the architecture level. The Linpack benchmarks [7] achieve stability in performance benchmarking results. In are also often used to measure a system's floating point a normal test environment, some components of Antutu computing power. benchmarking scores can be varied by as much as 50% For a mobile system, the most widely used methods without intentionally changing any environment parameters. are themselves mobile applications, many of which can be The variation can be reduced by fixing the CPU frequency, downloaded directly from Google Play, including Antutu [6], fixing the environment temperature, as well as killing back- GeekBench [2], PCMark [4], etc. ground services that may generate performance noises. We For example, Antutu [6] is one of the most widely show that with these practices enabled, we can improve the used performance measurement tools, which is able to stability of benchmarking results by an order of magnitude not only produce a set of performance scores, but also for both Antutu and other microbenchmarks. comparing the performance scores with similar devices to Finally, we show that with our recommended benchmark- help users understand the advantages and limitations of ing environment and parameters, the variations of both the device. Besides an overall performance score, Antutu Antutu and microbenchmarks can be controlled within 1% of (as of version 6.0) also provides a breakdown including the mean value, which can be considered statistically stable. \3D (Graphics)", \User Experience (UX)" (including data This paper makes the following main contributions: security, data processing, I/O and gaming), \CPU" (including mathematics, common algorithms and multi-core • To the best of our knowledge, this is the first explo- computation), and \RAM" performance tests. rative study on how to measure the performance of Other researchers also studied mobile performance issues. Android-based devices accurately and steadily. Yoon [14] studied the performance of the Android platform using a benchmark application, as well as public profile • We have identified a list of factors that may cause software. MobileBench [10] is a set of mobile benchmarks in- fluctuation in smartphone benchmarking results and cluding both performance and user experience benchmarks. measured their effects through experiments on both In comparison to these existing work that focusing on benchmarking apps and microbenchmarks. how to evaluate or improve benchmarking scores, our main goal is to identify the reasons why performance numbers • We propose

Load more