Moby: A Mobile Benchmark Suite for Architectural Simulators

Yongbing Huang∗†, Zhongbin Zha∗†, Mingyu Chen∗, Lixin Zhang∗ ∗State Key Laboratory of Computer Architecture, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China †University of Chinese Academy of Sciences, Beijing, China Email:{huangyongbing, zhazhongbin, cmy, zhanglixin}@ict.ac.cn

Abstract—Mobile devices such as smartphones and tablets Snapdragon [6] are more prevalent than processors like Intel’s have become the primary consumer computing devices, and Atom [7]. Generally, as the performance of these mobile their rate of adoption continues to grow. The applications that processors improves, their microarchitectures become more run on these mobile platforms vary in how they use hardware complicated. For example, mobile processors with four-cores, resources, and their diversity is increasing. Performance and an out-of-order execution model, and two-level caches have be- power limitations also vary widely across mobile platforms. Thus come the mainstream. Mobile system designers must consider there is a growing need for tools to help computer architects design systems to meet the needs of mobile workloads. Full-system how application and OS diversity affect their design choices simulators are invaluable tools for designing new architectures, in this increasingly complex design space. but we still need appropriate benchmark suites that capture the Benchmarking and architectural simulation are two im- behaviors of emerging mobile applications. Current benchmark portant tools for processor design and computer architecture suites cover only a small range of mobile applications, and many cannot run directly in simulators due to their user interaction research. To be relevant, a benchmark suite for architectural requirements. research must satisfy at least two properties. First, work- loads in the benchmark suite should be diverse enough to In this paper, we introduce and characterize Moby, a bench- exhibit the range of behaviors of the target applications. mark suite designed to make it easier to use full-system architec- Second, all the applications should be portable to architec- tural simulators to evaluate microarchitectures for mobile pro- tural simulators. However, most of current mobile benchmark cessors. Moby contains popular Android applications, including a web browser, a social networking application, an email client, a suites represent only a small subset of mobile application- music player, a video player, a document processing application, s [8], [9], [10], [11], [12], [13], and some cannot be run directly and a map program. To facilitate microarchitectural exploration, in simulators due to user-interaction requirements (e.g., the we port the Moby benchmark suite to the popular gem5 simulator. interactive games and audio player of Gutierrez et al. [14]. We characterize the architecture-independent features of Moby Meanwhile, existing benchmarks such as SPEC CPU2006 applications on the simulator and analyze the architecture- exhibit significantly different behaviors from interactive mobile dependent features on a current-generation mobile platform. Our applications [13], [14]. results show that mobile applications exhibit complex instruction execution behaviors and poor code locality, but current mobile In this paper, we develop Moby, a benchmark suite de- platforms especially instruction-related components cannot meet signed to evaluate microarchitectures of mobile platforms in their requirements. full-system architectural simulators. Generally, there are two design issues that drive our benchmark suite. First, mobile I.INTRODUCTION applications on different operating systems are incompatible. Since Android is the commonly used for Mobile devices, especially smartphones and tablets, have mobile devices, Moby contains only mobile applications that become an important world-wide market. From the application run on the Android OS. Second, most popular mobile ap- point of view, a wide variety of mobile applications are now plications are commercial, and thus their source codes are widely used; these include web browsers, social networks, e- not generally available. We choose only applications that can mail clients, audio and video players, document processing be freely downloaded, in order to avoid licensing issues. In systems, and map programs, to name a few. Different types of total, Moby contains 10 mobile applications spanning nine applications present different requirements for the hardware categories, including a web browser, a social networking components of mobile platforms. From the mobile operating application, email, audio, video, document, map and game. system point of view, Android [1] and iOS [2] have the highest Except the web browser application BBench [14], all the other market occupancy and growth speed. Android adoption has applications are selected from Google Play Store [15]. ramped up quickly, gaining in popularity six times faster than iOS. Android and iOS use different programming languages Since our benchmark suite is intended to drive architec- and execution models, and they differ in their utilization of tural simulators, all applications should be executable without hardware resources. Therefore, the requirements placed on manual user inputs. Although AutoGUI [13] provides a user hardware resources for different mobile applications vary. interface automation tool to record and deterministically replay user actions, we use an alternative method to bypass user As for mobile platforms, ARM [3] based mobile proces- interaction by executing only typical representative operations sors such as Apple’s Ax [4], TI’s OMAP [5], Qualcomm’s for these mobile applications. While Moby can be executed on many simulators that support Android OS, we take the Benchmark suites intended to support architectural design commonly used gem5 simulator [16] as an example to test space exploration and system research should make it pos- and characterize Moby. The gem5 disk image for Moby has sible to instrument, manipulate, and model the constituent already become public [17]. applications in detail. However, most popular mobile appli- cations are commercial, which complicates instrumentation We measure microarchitecture-independent features on the because their source codes are unavailable. Although most gem5 simulator with the ARM ISA and microarchitecture- Moby applications lack source codes, all can be downloaded dependent features on the ARM-based Pandaboard develop- for free. Note that most mobile applications involve user ment board [18]. The microarchitecture-independent features interaction and require a network connection, both of which are include instruction mix, working set size, data and instruc- difficult to implement or model in architectural simulators. The tion locality, and binary execution behaviors. The instruction dependence on networks can easily be removed by buffering features show that the representative operations for all applica- any required remote data in local storage. tions execute several billion instructions and that nearly 70% of branches are conditional. Furthermore, most applications 2) User Interaction: A major difficulty in analyzing inter- spawn about 20 processes and invoke more than 20 libraries. active mobile applications is generating reproducible results On the Pandaboard, measured microarchitecture-dependent without manual user inputs. Moreover, the slow execution features mainly include CPI and the behaviors of the branch speed of full-system simulators makes it impractical to incor- predictor, cache, TLB, and memory components. Experimental porate user-action inputs in experiments. There are two main results demonstrate that all applications present high CPIs, solutions to cope with user interaction in simulators. Tools like which implies that these mobile applications and current ARM- AutoGUI [13] and Xnee [19] provide automation capabilities based mobile platforms are not well matched. In particular, to record and replay user inputs. Using similar tools, we are the instruction-related resources (branch predictor, instruction working to identify representative code pieces that suffer from cache, and TLB) suffer from serious miss rates. large response latencies. However, most current automation tools still suffer from shortcomings like nondeterministic re- In summary, we make three contributions: play. • We present Moby, a new mobile benchmark suite that The other simple solution is to avoid user interaction contains a diverse set of applications and is appropriate in simulators. We find that most main activities of mobile for simulation-based design space exploration. applications can be executed as separate processes, and user • We extract typical representative operations of interactive interactions mainly used specify inputs for these activities. By applications in Moby and automatically execute them on executing these activities by specifying their inputs manually full-system architectural simulators. in command lines, user interactions are no longer required. • We describe both microarchitecture-independent and Compared to using automation tools, this method is both microarchitecture-dependent characteristics of all Moby simple and efficient, and thus we adopt this approach for Moby. applications. The main activities illustrated above are considered to be The rest of this paper is organized as follows. Section representative operations of mobile applications, and they can II describes the applications included in Moby. Section III be extracted from the AndroidManifest.xml file for mobile ap- introduces our experimental platforms and tools. Then the plications running on the Android OS. In the current version of microarchitecture-independent features and microarchitecture- Moby, only typical operations for each application are executed dependent features of Moby are illustrated in Section IV and in architectural simulators. In addition, we can combine several Section V, respectively. Finally, we describe related work in typical operations together, in order to improve simulation Section VI and conclude in Section VII. accuracy. 3) Selection Steps: We take five steps to select the appli- II.THE MOBY BENCHMARK SUITE cations in the Moby benchmark suite. Given the popularity and maturity of the Android ecosystem, we study mobile The goal of this work is to define a benchmark suite that applications executed on Android OS. Initially, we choose can be used to design and optimize mobile processors. In commonly used programs from the Google Play Store [15] this section, we first present our methods to select suitable as our application pool. Then, for each category in the Google applications for such a suite, and then we describe each Play Store, we focus only on popular applications that are included application in detail. free and have high download rates. A subset of applications studied can be found in Zhang et al. [20]. Next, we measure A. Benchmark Selection Methods microarchitectural characteristics of these applications on a real platform (see Section V). After that, according to their 1) Requirements: As a mobile benchmark suite for archi- characteristics, we select several applications to represent tectural simulators, Moby should both contain emerging and each category. Finally, we extract representative operations diverse applications and also support research. for selected applications, and verify whether these operations can be automatically executed without interaction with users The rapid innovation of the mobile Internet spawns many and whether necessary data downloaded from networks can be emerging applications with new and varied behaviors. Mobile buffered and replayed offline. Moby includes only applications platform architects need tools to model the diverse behaviors that pass these tests. and resource requirements of current and emerging mobile applications in different application markets. As a result, we have chosen nine applications from the TABLE I. SUMMARY OF MOBY Hence, the typical operations for KingsoftOffice are opening Bench Category Typical OP Input files with these formats. ∗ BBench Web Browser Load web pages Web pages K9Mail Email Load/Show emails Buffered emails 6) Adobe [25] is an application for reliably viewing and in- SinaWeibo Social Network Load information Buffered texts teracting with PDF documents. Its typical operation is viewing NeteaseNews News Check and load news Buffered news a PDF file. KingsoftOffice Document Open doc/xls/ppt file A doc file AdobeReader Document Open pdf file A PDF file 7) BaiduMap [26] is a mobile map client from China’s BaiduMap Map Load an area’s map Buffered maps biggest search engine and is similar to Google Maps. The MXPlayer Video Play a video A video file map client presents detailed maps with 3D buildings, supports TTPod Audio Play a song A music file navigation, and displays neighboring restaurant and hotel in- FrozenBubble Game Load game Null formation. Loading the map for a specific area from offline

∗: BBench is from Gutierrez et al. [14] maps is its typical operation. 8) MXPlayer [27] is a video player that supports almost Google Play Store for our benchmark suite; summaries for every movie format. It applies hardware acceleration to all the these are shown in Table I. Almost all applications are com- videos with the help of a new H/W decoder, and it supports mercial, but they can be downloaded for free. Moreover, we multi-core decoding. Its typical operation consists of playing choose BBench [14] to represent web-browser applications, a mp4 video clip stored in local disk. creating a 10-applications in suite. We will add more applica- 9) TTPod [28] is a music player for a wide variety of dif- tions to our benchmark suite as more mainstream or popular ferent audio formats. It provides high-quality decoding, highly applications emerge. accurate lyrics, and album acts downloads. It supports a rich graphical user interface with built-in graphics, a customizable B. Benchmark Descriptions equalizer function, and floating lyrics. The operation for this In this subsection, we describe each application’s use and application is to play the first minute of an MP3 file. features in detail. 10) FrozenBubble [29] is a puzzle game for Android. In- 1) BBench [14] is an automated web-browser page- teractive games have become important applications on mobile rendering benchmark that tests rendering performance. It com- devices with the introduction of high performance CPUs and prises a sequence of snapshots of a varied selection of the mobile GPUs. However, interactive games heavily rely on most popular sites. The webpages included in the benchmark users and thus cannot be automatically executed in simulators. contain diverse content and page styles (e.g., dynamic context, This game is chosen because it can be fully loaded and simply JavaScript, video, images, Flash, CSS, HTML5, etc.). Typical played without any user interaction. operation constitutes simply loading a webpage. 2) K9Mail [21] is an open-source email client running on the Android platform, which supports the commonly used . Input Sets POP3 and IMAP4 protocols. Although K9Mail supports fea- tures like sending/receiving email, searching, and multi-folder Generally, each benchmark suite should provide several syncing, our benchmark only chooses loading and displaying input sets that represent various usage scenarios. For example, emails buffered in local storage as its typical operations. This PARSEC [30] applications contain six input sets, each of which requires no network connection, and it can be easily automated processes different amounts of data. Unlike those applications without user interaction. designed for high performance servers, most mobile applica- tions do not focus on data processing. Therefore, in the current 3) SinaWeibo [22] is a client for one of China’s biggest version of Moby, we use just one input size for each applica- social networking and microblogging services. It allows users tion. Nevertheless, multiple input sets for Moby applications to publish information instantly and share it with others. The can be easily produced. For applications like KingsoftOffice, information includes text, picture, music, and video. Loading AdobeReader, MXPlayer, and TTPod (which mainly execute and displaying information is the typical operation for social or process input files), input sets can easily be constructed networking applications. Again, this information is buffered by selecting input files with varied types and sizes. The input locally. sets of other mobile applications are actually buffered network data, which the users can obtain from real mobile platforms by 4) NeteaseNews [23] is a news reader application. Users performing different web queries. These kinds of applications can obtain news by subscribing to magazines, newspapers, include K9Mail, Sinaweibo, NeteaseNews, and BaiduMap. For and other resources. The typical operation for news readers example, users can construct input sets for BaiduMap by is checking the news from the server and listing articles. In downloading maps of various areas from anywhere on the our benchmark, we substitute local data for data on remote internet. server. 5) KingsoftOffice [24] is an efficient mobile office appli- cation. It contains rich editing features, and supports 23 kinds III.METHODOLOGY of files, including DOC, XLS, PPT, and PDF. Writer, pre- sentation, and spreadsheet are commonly used KingsoftOffice In this section, we explain how we characterize the Moby programs to manipulate DOC, PPT, and XLS files respectively. benchmark suite in terms of platforms and tools. TABLE II. MOBY INSTRUCTION SUMMARY

Branches Bench Instruction Count (Billions) ∗ Loads Stores Working Set Size (MB) Total Cond./Total † BBench 2.48 14.43% 69.5% 23.05% 12.16% 80 K9Mail 1.18 11.00% 72.60% 20.03% 9.34% 64 SinaWeibo 2.23 16.92% 68.35% 27.21% 14.68% 114 NeteaseNews 2.65 16.58% 69.01% 25.85% 12.22% 104 KingsoftOffice 2.24 16.59% 68.73% 26.13% 14.06% 87 AdobeReader 2.09 15.17% 70.47% 23.74% 12.19% 83 BaiduMap 3.53 14.31% 72.50% 22.79% 12.29% 102 ‡ MXPlayer 3.84 18.22% 70.64% 23.79% 12.76% 97 ‡ TTPod 3.87 15.18% 68.45% 25.49% 12.84% 126 FrozenBubble 0.28 15.59% 71.76% 21.53% 9.66% 47

∗: Percent of all branch instructions that are conditional

†: BBench only loads each page once

‡: TTPod and MXPlayer each play about three seconds of music/video

A. Platforms components, including the processor pipeline, cache, and TLB. The Cortex A9 provides six core counters that can count up to We measure the Moby benchmark suite on both the gem5 six events simultaneously, one extra cycle counter, and two L2 simulator [16] and the Pandaboard ES [18] development board cache counters. However, the metrics shown in Section V can- running Android version ICS 4.0. The gem5 simulator is not be directly acquired or computed using the above counters a widely used architecture simulator which supports Alpha, when running each application just once. Hence, we repeat ARM, SPARC, MIPS, POWER and x86 ISAs. By default, the each experiment multiple times with different combinations of gem5 simulator provides several machine configurations for counters, and report average values from ten measurements. ARM ISAs. These machine configurations, which contain the All performance event data are collected using the lightweight parameters of main hardware components, are almost the same performance counter tool TopMC [33]. as the configurations of real ARM-based development boards such as Versatile Express [31]. IV. MICROARCHITECTURE-INDEPENDENT The Pandaboard ES board comes with a market-quality CHARACTERIZATION OMAP 4460 system-on-chip (SoC) equipped with a dual-core Cortex A9 processor [32] manufactured on the 45 nm process Microarchitecture-independent characteristics enable us to node and 1GB LPDDR2 DRAM. The Cortex A9 processor is understand the inherent nature of applications. In this section, a complex out-of-order four-wide superscalar core with eight we provide an overview of the microarchitecture-independent pipeline stages. It has 32 KB 4-way set associative L1 I/D characteristics of Moby in terms of instruction mix, working caches and a 512 KB 16-way set associative L2 cache. set size, spawned processes, invoked libraries, and code and data locality. Note that most mobile applications execute many In our experiments, the configurations of main hardware short activities, where each activity only accounts for a few components such as cache and memory in gem5 are set billion instructions. Compared to the trillions of instructions according to that of Pandaboard. Their operating systems and for SPEC CPU2006 applications [34], executing these few disk images are also nearly the same. billion instructions is much more suitable for slow, full- system simulators. We use the gem5 full-system simulator to B. Tools execute the representative operations for each workload shown in Table I1, and collect all the following microarchitecture- In order to study instruction behaviors, we modify the gem5 independent metrics. We find that our workloads share similar simulator to collect the instruction trace for each application. instruction profiles even though their working-set sizes vary Meanwhile, we can also map instructions to their binaries such significantly. as libraries, the OS kernel, and the application binary file by dumping mapping tables between instructions’ virtual address- A. Instruction Mix es and binaries. The mapping tables are just the contents of the proc file ”/proc/pid/maps” in the Android file system. This The mix of instructions reflect the requirements on d- information is maintained in the virtual memory structure and ifferent hardware resources. For example, load and store can be tracked by the task structure of processes. Thus, in instructions rely on cache and memory resources. Different the simulator, we only need to find out these task structures types of branch instructions indirectly reveal the complexity of for different processes and then read out the corresponding programs and their demands on branch predictors. As shown in contents. Table II, load and store instructions account for about 25% and 12%, respectively, for most applications. Compared to most For the purpose of studying performance of processor integer benchmarks of SPEC CPU2006, which present diverse and memory components, we measure the microarchitecture load and store behaviors [35], the percentages of load and store characteristics of Moby suite using hardware performance counters [3] on the Pandaboard. The hardware performance 1We have already released the corresponding gem5 disk images and events provided by the Cortex A9 processor cover most main execution scripts for all Moby applications [17]. 50 50 50 50 50 Inst. Data Inst. Data Inst. Data Inst. Data Inst. Data 40 40 40 40 40 30 30 30 30 30 20 20 20 20 20 10 10 10 10 10 0 0 0 0 0 Percent of Requests Requests of Percent

Reuse Distance (a) BBench (b) K9Mail (c) SinaWeibo (d) NeteaseNews (e) KingsoftOffice

50 50 50 40 50 Inst. Data Inst. Data Inst. Data Inst. Data Inst. Data 40 40 40 30 40 30 30 30 30 20 20 20 20 20 10 10 10 10 10 0 0 0 0 0

(f) AdobeReader (g) BaiduMap (h) MXPlayer (i) TTPod (j) FrozenBubble

Fig. 1. L1 reuse distance distributions

1 BBench 16-way set associative cache K9Mail 0.8 SinaWeibo 0.6 NeteaseNews KingsoftOffice CDF 0.4 AdobeReader BaiduMap 0.2 MXPlayer TTPod 0 FrozenBubble

Reuse Distance

Fig. 2. L2 reuse distance distributions instructions are similar for Moby applications. Another 15% implies that only a small portion of each touched page is used. of instructions are branches for all applications except MX- Given this, these applications are likely to suffer frequent TLB Player and K9mail. Meanwhile, conditional branches occupy misses. nearly 70% of these branch instructions across all applications. Generally, each conditional branch instruction may result in executing a wrong path and consequently require out-of-order C. Locality processors to roll back. The high percentages of conditional In order to gain a deeper understanding of the code and data branch instructions is likely to trigger many mispredictions locality in Moby applications, we analyze the reuse distances with large penalties, which will affect the overall performance. (the number of distinct references between two successive uses of a line) of all references to two different cache levels. Fig- B. Working Sets ure 1 shows the reuse distances of instruction and data requests for each Moby application. All the requests are captured when Working-set size can be measured at cacheline or page they access L1 instruction or data caches. Figure 2 shows the granularities, depending on our purpose. We choose pages cumulative density function (CDF) of reuse distances for L2 (4KB) as our basic granularity in this paper because we cache requests. All the requests studied are those requests that aim at studying main memory access behaviors. Note that miss in 32KB L1 instruction/data cache. most typical operations of Moby applications only last several seconds, and thus we consider working-set sizes to be the total As shown in Figure 1, Moby applications present similar number of pages touched during the whole execution. instruction and data locality behaviors. Typically, only about 30% of instruction references have reuse distances less than Half the working sets in Table II approach or exceed four, which is the set associativity of the Pandaboard instruc- 100 megabytes. Only K9Mail and FrozenBubble, which only tion cache. Memory references with larger reuse distances execute around a billion instructions, have working sets smaller suffer misses under LRU replacement. Some instructions have than 65 MB. Even so, all working sets exceed the capacity a zero reuse distance because whenever a line is fetched into of the last-level cache. In contrast to these large working sets, the instruction queue, subsequent instructions will also be mobile input sets are usually small, and the applications do not found in the queue: no cache access is required. The figure execute sustained memory accesses. For example, SinaWeibo shows that highly associative instruction caches (64 or more) typically loads tens of small text messages at a time, which could service over 80% of instructions. 40 Processes Libraries 35 30 25 20 15 10 5 0

Fig. 3. Numbers of processes and invoked libraries

As for data requests, Figure 1 indicates that Moby ap- plications generally present good data locality. For a four- Fig. 4. Instruction flow distributions for KingsoftOffice way set associative data cache, nearly 70% of lines can be reused. Requests with one reuse distance constitute 40% of all accesses, which implies that data within a cacheline enjoy colorful segments imply that those binaries are executed con- high temporal locality. tinuously without suffering interference from other binaries. Figure 4 illustrates that the Android kernel and five other Figure 2 shows the reuse distance distribution of L2 cache binaries (dalvik-jit-code-cache, libdvm.so, libcutils.so, libc.so, accesses. For SinaWeibo, TTPod, and NeteaseNews, about and libnativehelper.so) dominate the execution of Kingsoft- 40% of memory references have reuse distances smaller than Office. These binaries can be organized into three groups, 16 (the associativity of the Pandaboard L2 cache). At this Java-language related, C-language related, and system related. associativity, reused memory locations for the remaining ap- Moreover, the execution switches among different binaries plications only reach 20%. Moreover, as reuse distance grows, frequently. For instance, the executions of the libc library these “reusability ratios” increase only gradually until they and the Android kernel are interleaved. In such situations, level off at 512. This implies that 16 is a good choice for instruction locality and branch prediction accuracy may be L2 cache associativity on mobile platforms. affected, which results in poor performance for instruction- related components. D. Instruction Execution Flow The instructions executed by most mobile applications ex- E. PCA Analysis hibit complex behavior. Mobile applications tend to depend on Diversity is an important metric to evaluate the repre- GUI-based display systems. Furthermore, for high portability sentative of a benchmark suite. We use principal component and productivity, most Android applications are analysis (PCA) to demonstrate the diversity of Moby appli- written in the object-oriented Java language. Thus, Moby cations by analyzing both their microarchitecture-independent applications may invoke many libraries and generate many and microarchitecture-dependent behaviors. PCA applies an instructions. orthogonal transformation to a group of possibly correlated Figure 3 depicts the number of processes spawned and variables to convert them into several uncorrelated variables libraries invoked. Most applications create tens of process- (principal components) with different weights. Similar PCA es/threads and access more than 15 libraries. Six of the Moby analysis has been conducted on mobile applications and tradi- applications invoke more than 20 libraries, which increases tional SPEC benchmarks by Sunwoo et al. [13], whose results code footprints and puts pressure on all instruction-related demonstrate that mobile applications differ greatly from SPEC microarchitectural resources. Furthermore, multiple processes benchmarks, especially in instruction-side behaviors. running in parallel inevitably cause interference in the caches, Figure 5 depicts the PCA map of the above the TLB, and the predictors. microarchitecture-independent metrics for Moby applications, To better understand dynamic instruction behaviors, we showing only two main principal components. The X-axis collect instruction traces and map dynamic instructions back (i.e., Dim 1) shows the first principal component, which to the static binaries. Given the many background processes represents more than 65% of the variability. Dim 2 shown running within the Android OS, we record only instructions in the Y-axis accounts for another 20% variability. Hence, closely related to the target application. A memory map these two principal components can influence the differences file for each processor assists translation (as described in among all Moby applications. The distance between points Section III-B). on the PCA map implies the dissimilarity of applications. The closer two points are, the more similar the applications. As an example, we present a part of the instruction execu- As shown in the figure, 10 Moby applications are evenly tion flow of KingsoftOffice in Figure 4. The X-axis depicts scattered in different regions. This phenomenon means that the number of instructions executed, and the Y-axis shows the mobile applications we choose are diverse with respect to the corresponding static binary files for these instructions. The their inherent characteristics. Stalled Due to TLB 30 Stalled Due to Dcache 25 Stalled Due to Icache 20 15 10 5

Percent of Overall Cycles Cycles Overall of Percent 0

Fig. 7. Contribution to overall cycles broken down by component Fig. 5. PCA results for microarchitecture-independent metrics

14 4 12 3.5 10 3 8 2.5 2 6 1.5 4 1 2

0.5 Mispredicted % Branches 0 0

Fig. 8. Branch misprediction rates Fig. 6. CPI results

2) Stalled Cycle per Component: The processor’s pipeline V. MICROARCHITECTURE-DEPENDENT will stall if components fill and cannot allocate additional CHARACTERIZATION resources for incoming requests. There are many such compo- nents in ARM processors, including the cache, TLB, reorder We explain the microarchitecture-dependent results of Mo- buffer, load/store buffer, and reservation stations. by on the Pandaboard development board in this section. These metrics are obtained from the hardware performance counters Figure 7 depicts the percent of stalled cycles caused by provided by ARM processors. cache and TLB resources. Since other components cause very few stall cycles, the remaining cycles can be considered to be the processor’s active cycles. Two interesting observations can A. Overall Performance be made from Figure 7. First, nearly 2% and 5% of processors cycles are stalled waiting for the TLB and instruction cache 1) CPI: Cycles per instruction (CPI) initially characterizes for almost all Moby applications. But the pipeline stall cycles the overall performance of a target application on measured incurred by the data cache vary from 3% to 20% for different platform. Applications with high CPI perform poorly, which applications. Second, for applications such as K9Mail and means that the microarchitecture of the measured platform TTPod, the instruction cache stalls the processor’s pipeline could be improved to better cope with these applications. more often than the data cache. Unlike desktop applica- Figure 6 depicts the CPI results for all Moby applications. tions [36] and server applications [14], whose data cache Six out of ten applications have CPI higher than 3, and the dominates the pipeline stalls, the TLB and instruction cache of CPIs of the remaining applications are around 2. Note that the mobile processors are primarily responsible for the observed ideal CPI for Cortex A9 processor with its two-issue width performance degradation. Therefore, more attention should be is 0.5, and hence these applications perform poorly. Mobile paid to optimizing mobile processor TLBs and instruction processors like Cortex A9 processor could be better optimized caches. for workloads like Moby. Moreover, we observe that the four applications with relatively low CPI — KingsoftOffice, B. Branch Misprediction Rate AdobeReader, MXPlayer, and TTPod — process large amounts of data, unlike the applications with higher CPI. This implies The branch predictor plays an important role in ensuring that instruction-related components might hinder the overall efficient out-of-order execution and exploiting instruction level performance. parallelism. 80 50 Icache Dcache ITLB DTLB Data Inst. Total 45 70 40 60 35 50 30 25 40 20 30

15 Ratio % Miss 20 10 10 5 Misses per 1K Instructions1K per Misses 0 0

Fig. 9. Cache and TLB miss rates Fig. 10. L2 cache miss rates

70 As shown in Figure 8, the branch misprediction rates for NeteaseNews and BaiduMap reach up to 12%. This happens 60 because nearly 70% of branches are conditional, as shown 50 in Table II, and the execution of these applications switches 40 frequently among different binaries, as illustrated in Figure 4. 30 Each time instructions are switched, branch mispredictions are 20 likely occur. Note that unpredictable user behaviors for inter- active mobile applications can further exacerbate the branch 10 0 misprediction rate. Requests Overall of Percent

C. Cache & TLB & Memory 1) L1 I/D Cache & I/D TLB: As illustrated by Gutierrez et al. [14], Jiang et al. [36] and Ferdman et al. [37], the miss rates of the instruction cache and instruction TLB are high Fig. 11. Ratio of data requests in L2 cache. The rest requests are due to the large code size of interactive applications and the instructions. limited cache size of mobile processors. Figure 9 shows the same observation. fraction of all memory requests, since there will be many DMA According to the L1 cache reuse distance shown in Fig- requests issued by other I/O devices like the GPU and LCD ure 1, less than 35% of instruction references have reuse display system. distances smaller than four (the associativity of the Pandaboard instruction and data caches). Given that data references with D. Core Utilization similarly small reuse distances reach nearly 70%, it is obvious Mobile processors are still improving, in terms of both that the data cache outperforms the instruction cache. frequency and numbers of cores. In order to study the core Furthermore, since most mobile applications do not manip- utilization, we count the cycles executed by different cores. ulate large amounts of data, their data references are relatively In Figure 12, we depict the ratio of cycles executed by Core few compared to their instruction references. 0 compared to the total cycles executed by both cores on the Pandaboard. Except for MXPlayer, Moby applications do most The DTLB suffers a higher miss rate than the data cache of their work on Core 0. This suggests that most mobile appli- for several applications (e.g., BBench). This suggests that cations are programmed without considering the existence of the number of DTLB entries is insufficient to hold random multicore platform, and thus they cannot fully utilize precious distributed data references. processor resources. Under this condition, simply integrating 2) L2 Cache: Figure 10 depicts the miss rates of different more cores in mobile processors just consumes more power kinds of requests to the L2 cache. More than 10% instruction without improving performance. requests miss in L2 cache for all Moby applications, and sev- eral suffer more than 25% instruction misses. This result again E. PCA Analysis demonstrates the large code footprints for mobile applications. Figure 13 depicts the top two principal components of Although mobile applications present good locality in the L1 PCA analysis based on the above microarchitecture-dependent data cache, nearly 50% of the data requests miss in L2 cache. characteristics. Dim1 is the primary principal component, and It is interesting to observe that data references no longer its main contributor is the L2 miss rate. Dim2 is the second dominate L2 requests, as shown in Figure 11. Except Kingsoft- principal component, and it includes L1D MPKI, DTLB MP- Office, AdobeReader, and FrozenBubble, the L2 cache receives KI, and the branch misprediction rate. Although data points more instruction requests than data requests. However, from for applications such as SinaWeibo and FrozenBubble in the the view of memory, instruction requests account for a small Y-axis are a bit closer, these applications are widely spread 90 has become less suitable for current microarchitectural analy- 80 sis. Gutierrez et al. [14] present an interactive game, a video 70 player, a media player, and BBench as typical benchmarks for 60 smartphones. MobileBench [12] contains several web browsing 50 applications, a photo rendering application and a video play- 40 er. Sunwoo et al. [13] study several smartphone workloads 30 (AndEBench, CaffeineMark, RL Benchmark, Angry Birds, 20 and KingsoftOffice) to measure the performance of the dalvik 10 virtual machine, the SQLite and the whole system. Compared

Percent of Overall Core Cycles Core Overall of Percent 0 to these benchmarks, Moby contains some similar applications, and some with diverse behaviors not yet found in other suites.

VII.CONCLUSION

Fig. 12. Ratio of cycles executed by Core 0. Core 1 accounts for the rest Mobile devices have already become the primary consumer ratio. computing devices, and their use still exhibits rapid growth. Efficient mobile processor design requires knowledge of typ- ical mobile applications. In this paper, we have presented a mobile benchmark suite — Moby — that includes popular applications executed under Android OS. Our analysis finds them to be sufficiently diverse to be considered representative. In this study, we fully characterize Moby in order to assist other researchers in using it for their studies. We use the gem5 simulator and the hardware performance counters provided by ARM processors to evaluate Moby’s microarchitecture- independent features (instruction mix, working set size, data and instruction locality, and binary execution behavior) and microarchitecture-dependent features (CPI and the behaviors of the branch predictor, caches, TLBs, and other memory components). We will continue to add more mobile applica- tions to the Moby benchmark suite as more mainstream or popular applications emerge. Furthermore, we will integrate Fig. 13. PCA results microarchitecture-dependent characteristics on the user-action automation tools to model the effects of user inputs Pandaboard on applications. across the primary principal component. Given the PCA results ACKNOWLEDGMENT of both microarchitecture-independent and microarchitecture- dependent behaviors, we can conclude that Moby applications We would like to thank Sally McKee for her useful sug- behavior vary significantly and in many ways. gestions and hard work improving the writing quality. We also thank Yungang Bao, Kun Zhang, and other teammates from VI.RELATED WORK ICT, and the anonymous reviewers for helpful suggestions and insightful feedback. This research is supported by the There are many kinds of benchmarks to evaluate the National Basic Research Program of China (973 Program) performance of mobile devices. In the industry community, under the grant number 2011CB302502, the National Natural commonly used benchmarks such as EEMBC [38], SiSoft Science Foundation of China (NSFC) under the grant number Snadra [39], AnTuTu [8], 3D GLBenchmark [9], and Geek- 60925009, 61272132, and 61221062, the Strategic Priority bench [10] can measure the peek performance of mobile Research Program of the Chinese Academy of Sciences under device components, including the CPU, memory, GPU, and the grant number XDA06010401, and the Huawei Research multimedia support. On one hand, some of these benchmarks Program under the grant number YBCB2011030. are not freely available to academia. On the other hand, the peak performance of each component cannot represent the REFERENCES total performance of the system. Other benchmarks such as SunSpider [11] and BrowserMark [40] only test the perfor- [1] “Android operating sytem for mobile devices,” http://www.android.com. mance of specific applications or classes of applications (e.g., [2] “iOS operating system for apple,” http://www.apple.com/ios. embedded Java benchmarks [43] or the MEVBench computer [3] “ARM architecture reference manual: ARM v7-A and ARM v7-R vision applications [42]). edition.” [4] “Apple system on chips,” http://en.wikipedia.org/wiki/Apple\ System\ In the research community, MiBench [41] has been widely on\ Chips. used for embedded systems. Although it contains 35 embedded [5] “OMAP applications processors,” http://www.ti.com/lsds/ti/omap- applications covering six categories, the applications differ applications-processors/features.page. greatly from current mobile applications in terms of diversity, [6] “Qualcomm Snapdragon processors,” coding language, code size, and functionality. Hence, MiBench http://www.qualcomm.com/snapdragon. [7] “Intel Atom processor,” http://www.intel.com/content/www/us/en/ [28] “TTPod,” http://www.ttpod.com, http://t.cn/zTT2cNg. processors/atom/atom-processor.html. [29] “FrozenBubble,” http://t.cn/zTTLjD8. [8] “AnTuTu,” http://www.antutu.com/index.shtml. [30] C. Bienia, S. Kumar, J. P. Singh, and K. Li, “The PARSEC benchmark [9] “Gfxbench,” https://gfxbench.com/result.jsp. suite: Characterization and architectural implications,” in Proceedings [10] “Geekbench,” http://www.primatelabs.com/geekbench. of the 17th international conference on Parallel Architectures and Compilation Techniques. ACM, 2008, pp. 72–81. [11] “SunSpider,” http://www.webkit.org/perf/sunspider/sunspider.html. [31] “Versatile Express products,” http://www.arm.com/products/tools/ [12] D. Pandiyan, S.-Y. Lee, and C.-J. Wu, “Performance, energy character- development-boards/versatile-express/index.php. izations and architectural implications of an emerging mobile platform benchmark suite c mobilebench,” in IEEE International Symposium on [32] “ARM Cortex A9,” http://www.arm.com/products/processors/cortex- Workload Characterization (IISWC). IEEE, 2013. a/cortex-a9.php. [13] D. Sunwoo, W. Wang, M. Ghosh, C. Sudanthi, G. Blake, C. Emmons, [33] “TopMC,” http://asg.ict.ac.cn/projects/topmc, 2011. and N. Paver, “A structured approach to the simulation, analysis and [34] J. L. Henning, “SPEC CPU2006 benchmark descriptions,” ACM characterization of smartphone applications,” in IEEE International SIGARCH Computer Architecture News, vol. 34, no. 4, pp. 1–17, 2006. Symposium on Workload Characterization (IISWC). IEEE, 2013. [35] S. Bird, A. Phansalkar, L. K. John, A. Mericas, and R. Indukuru, [14] A. Gutierrez, R. G. Dreslinski, T. F. Wenisch, T. Mudge, A. Saidi, “Performance characterization of spec cpu benchmarks on Intels core C. Emmons, and N. Paver, “Full-system analysis and characterization of microarchitecture based processor,” in SPEC Benchmark Workshop, interactive smartphone applications,” in IEEE International Symposium 2007. on Workload Characterization (IISWC). IEEE, 2011, pp. 81–90. [36] T. Jiang, R. Hou, L. Zhang, K. Zhang, L. Chen, M. Chen, and N. Sun, [15] “Google Play Store,” https://play.google.com/store. “Micro-architectural characterization of desktop cloud workloads,” in [16] N. Binkert, B. Beckmann, G. Black, S. K. Reinhardt, A. Saidi, A. Basu, IEEE International Symposium on Workload Characterization (IISWC). J. Hestness, D. R. Hower, T. Krishna, S. Sardashti et al., “The gem5 IEEE, 2012, pp. 131–140. simulator,” ACM SIGARCH Computer Architecture News, vol. 39, no. 2, [37] M. Ferdman, A. Adileh, O. Kocberber, S. Volos, M. Alisafaee, D. Jevd- pp. 1–7, 2011. jic, C. Kaynak, A. D. Popescu, A. Ailamaki, and B. Falsafi, “Clear- [17] “Moby: A Mobile Benchmark Suite,” http://asg.ict.ac.cn/projects/moby, ing the clouds: a study of emerging scale-out workloads on modern 2013. hardware,” in Proceedings of the seventeenth international conference on Architectural Support for Programming Languages and Operating OMAP4460 Pandaboard ES System Reference Manual [18] , pandaboard.org, Systems. ACM, 2012, pp. 37–48. 2011. [38] “EDN embedded microprocessor benchmark consortium,” [19] “GNU Xnee webpage,” http://www.gnu.org/software/xnee. http://www.eembc.org. [20] K. Zhang, Y. Huang, and M. Chen, “Architecture characteristics and [39] “SiSoft sandra,” http://www.sisoftware.net. analysis of mobile device applications,” in National Anual Conference on High Performance Computing, China (In Chinese), 2013, pp. 81–90. [40] “Browsermark,” http://browsermark.rightware.com. [21] “K9Mail,” https://github.com/k9mail/k-9, http://t.cn/zTlAnPO. [41] M. R. Guthaus, J. S. Ringenberg, D. Ernst, T. M. Austin, T. Mudge, and R. B. Brown, “MiBench: A free, commercially representative embed- [22] “SinaWeibo,” http://www.weibo.com, http://t.cn/zTYHOxK. ded benchmark suite,” in IEEE International Workshop on Workload [23] “NeteaseNews,” http://www.163.com, http://t.cn/zTYmGMj. Characterization. IEEE, 2001, pp. 3–14. [24] “KingsoftOffice,” http://www.kingsoftstore.com, http://t.cn/zTYsBQC. [42] J. Clemons, H. Zhu, S. Savarese, and T. Austin, “MEVBench: A mobile [25] “AdobeReader,” http://www.adobe.com/products/eulas, computer vision benchmarking suite,” in IEEE International Symposium http://t.cn/zTTPgDj. on Workload Characterization. IEEE, 2011, pp. 91–102. [26] “BaiduMap,” http://map.baidu.com, http://t.cn/zTT7y0Y. [43] C. Isen, L. John, J. P. Choi, and H. J. Song, “On the representativeness [27] “MXPlayer,” https://sites.google.com/site/mxvpen, of embedded Java benchmarks,” in IEEE International Symposium on http://t.cn/zTTAq7Q. Workload Characterization, 2008. IEEE, 2008, pp. 153–162.