Improving the Performance of Smartphone Apps with Soft Hang Bug Detection and Dynamic Resource Management

Dissertation

Presented in Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy in the Graduate School of The Ohio State University

By

Marco Brocanelli, M.S.

Graduate Program in Electrical and Computer Engineering

The Ohio State University

2018

Dissertation Committee:

Dr. Xiaorui Wang, Advisor Dr. Feng Qin Dr. Christopher C. Stewart ⃝c Copyright by

Marco Brocanelli

2018 Abstract

Two critical quality factors for mobile devices (e.g., smartphones, tablets) are battery

life and apps’ user perceived performance. For example, apps that require frequent user

actions with the user interface should have high responsiveness, which indicates how fast an

app reacts to user actions. On the other hand, apps used mostly for video/music play should

have a high throughput, which allows for example a video to be played smoothly without

perceivable frame rate loss. Two main causes of poor performance for these apps are soft

hang bugs and resource contention. A soft hang bug causes the app to have soft hangs, i.e.,

the app’s response time of handling a certain user action is longer than a user-perceivable

delay. A soft hang bug is a blocking operations that executes on the app’s main thread

and can be fixed by moving the execution of this operation to a background worker thread.

Resource contention can cause concurrent foreground apps to miss their performance target.

Indeed, during recent years, the improvements in mobile operating system performance and

the increasing display size have brought these resource constrained devices to be able to

execute multiple apps at the same time, e.g., watch a video while chatting with a friend.

As a result, the resource contention among the apps sharing the screen can either cause

performance degradation for at least one of the concurrent apps or cause an unnecessarily

high energy consumption.

In this dissertation, we first introduce Hang Doctor, a runtime soft hang detection and

diagnosis methodology that runs in the wild on user devices. Hang Doctor helps developers

ii track the responsiveness performance of their apps and provides diagnosis information

for them to fix soft hangs. Hang Doctor exploits performance event counters toensure

high detection quality and low overhead. In particular, we propose a soft hang filter that

examines the performance event counters during the app execution to automatically prune

false positives. We have implemented Hang Doctor and tested it with the latest releases of

114 real-world apps. Hang Doctor has found 34 new soft hang bugs previously unknown to

their developers. So far, 62% of the bugs have already been confirmed by the developers

and 68% are missed by offline detection algorithms.

Second, in order to ensure good user-perceived performance of concurrent apps and low

energy consumption, we propose SURF, Supervisory control of User-perceived peRFor-

mance. Specifically, first, SURF dynamically balances the performance of the concurrent

apps to regulate the resource allocation among the concurrent apps according to their actual

performance needs. Then, when the concurrent foreground apps have balanced performance,

SURF manipulates CPU DVFS (dynamic voltage and frequency scaling) to ensure that the

user-perceived performance of all the apps stay close to their desired values while mini-

mizing the energy consumption. A key advantage of SURF is that it is designed rigorously

based on supervisory control theory, which provides analytical stability and performance

guarantees compared to heuristic solutions. We test SURF on several mobile device models with real-world open-source apps and show that it can reduce the CPU energy consumption

by 30-90% compared to state-of-the-art solutions while causing no perceivable performance

degradation.

iii Dedicated to all the people that guided me to this life achievement

iv Acknowledgments

The first person that I want to thank is my advisor Dr. Xiaorui Wang. Since thefirstday in my Ph.D., he pushed me to improve myself and do things that I didn’t know I could do.

He thought me how to properly conduct scientific research, develop projects, write papers, and present in front of a large amount of people. I really appreciate his great contribution to my personal and professional development, which will positively impact my future.

I would like to thank Dr. Feng Qin, Dr. Christopher Stewart, Dr. Fusun Ozguner, and

Dr. Jian Tan for being part of my candidacy and final exam committees. They all gave me insightful feedbacks on my projects and I really hope to collaborate with them in the future.

I also want to thank my present and former lab mates Kuangyu Zheng, Li Li, Bruce

Beitman, Yunhao Bai, Wenli Zheng, Zichen Xu, Kai Ma, and Xiaodong Wang for all the invaluable time spent discussing research ideas, which has highly contributed to the successful publication of many papers.

Special thanks go to Dr. Andrea Serrani, who helped me come to The Ohio State

University at the end of my master studies. My life would be so much different without his help and I would have probably not even started a Ph.D. if it was not for him.

Last but not least, I would like to thank my family. I want to thank my mom Lorella, my sister Linda, and all the rest of my Italian family for encouraging and supporting me to pursue a career outside my home-country, Italy. In addition, I would like to thank my wife

Paula and my Colombian family for all the moral support and kindness given to me.

v Vita

2012-Present ...... Ph.D., Electrical and Computer Engineering, The Ohio State University, USA. 2010 ...... Visiting Scholar, Control Engineer, The Ohio State University, USA. 2008 ...... M.S., Control Systems, University of Rome Tor Vergata, Italy. 2005 ...... B.S., Control Systems, University of Rome Tor Vergata, Italy.

Publications

Research Publications

Marco Brocanelli, Xiaorui Wang. “SURF: Supervisory Control of User-Perceived Per- formance for Mobile Device Energy Savings”. International Conference on Distributed Computing Systems (ICDCS), July 2018.

Marco Brocanelli, Xiaorui Wang. “Hang Doctor: Runtime Detection and Diagnosis of Soft Hangs for Smartphone Apps”. European Conference on Computer Systems (EuroSys), April 2018.

Marco Brocanelli, Xiaorui Wang. “Smartphone Radio Interface Management for Longer Battery Lifetime”. IEEE International Conference on Autonomic Computing (ICAC), July 2017.

vi Marco Brocanelli, Sen Li, Xiaorui Wang, Wei Zhang. “Maximizing the revenues of data centers in regulation market by coordinating with electric vehicles”. Sustainable Computing: Informatics and Systems, 6: 26-38. June 2015.

Marco Brocanelli, Wenli Zheng, Xiaorui Wang. “Reducing the expenses of geo-distributed data centers with portable containerized modules”. IFIP WG 7.3 Performance14, September 2014.

Sen Li, Marco Brocanelli, Wei Zhang, Xiaorui Wang. “Integrated Power Management of Data Centers and Electric Vehicles for Energy and Regulation Market Participation”. IEEE Transactions on Smart Grid, 5(5): 2283-2294. June 2014.

Marco Brocanelli, Sen Li, Xiaorui Wang, Wei Zhang. “Joint management of data centers and electric vehicles for maximized regulation profits”. International Green Computing Conference (IGCC), June 2013.

Sen Li, Marco Brocanelli, Wei Zhang, Xiaorui Wang. “Data center power control for frequency regulation”. Power and Energy Society General Meeting (PES), July 2013.

Marco Brocanelli, Yakup Gunbatar, Andrea Serrani, Michael Bolender, “Robust Control for Unstart Recovery in Hypersonic Vehicles”. AIAA Guidance, Navigation, and Control Conference, August 2012.

Fields of Study

Major Field: Electrical and Computer Engineering

vii Table of Contents

Page

Abstract ...... ii

Dedication ...... iv

Acknowledgments ...... v

Vita ...... vi

List of Tables ...... x

List of Figures ...... xi

1. Introduction ...... 1

1.1 Soft Hang Bugs ...... 2 1.2 Resource Contention ...... 3 1.3 Major Contributions ...... 4

2. Hang Doctor: Runtime Detection and Diagnosis of Soft Hangs for Smartphone Apps...... 6

2.1 Background and Motivation ...... 10 2.1.1 Background ...... 10 2.1.2 Motivation ...... 13 2.2 Design of Hang Doctor ...... 15 2.2.1 Goals and Challenges ...... 15 2.2.2 Design Overview ...... 16 2.2.3 First Phase: S-Checker ...... 20 2.2.4 Second Phase: Diagnoser ...... 31 2.2.5 Hang doctor Implementation ...... 33 2.3 Evaluation ...... 34

viii 2.3.1 Baselines and Performance Metrics ...... 34 2.3.2 Result Summary and Developers’ Response ...... 36 2.3.3 Example Runtime Hang Bug Detection ...... 40 2.3.4 Detection Performance Comparison ...... 43 2.3.5 Overhead Analysis ...... 46 2.3.6 Alternative Approaches and Limitations ...... 47 2.4 Related Work ...... 49

3. SURF: Supervisory Control of User-Perceived Performance for Mobile Device Energy Savings ...... 51

3.1 Related Work ...... 54 3.2 Background and Motivation ...... 55 3.2.1 Background ...... 55 3.2.2 Motivation ...... 56 3.3 Design of SURF ...... 61 3.3.1 Design Overview ...... 61 3.3.2 Inner Loop: Performance Balancer ...... 64 3.3.3 Outer Loop: Performance Controller ...... 70 3.3.4 Discussion ...... 72 3.4 Experimental Results ...... 73 3.4.1 Experimental Setup ...... 73 3.4.2 SURF: Overall Summary of Results ...... 75 3.4.3 Inner Loop: Performance Balancer ...... 77 3.4.4 Outer Loop: Performance Controller ...... 78 3.4.5 Integrated Solution: SURF ...... 80

4. Conclusions ...... 85

Bibliography ...... 87

ix List of Tables

Table Page

2.1 Apps with well-known soft hang bugs tested in the motivation study. The commit number refers to the app version that has the bug...... 13

2.2 The timeout value influences the performance of Timeout-based runtime detection algorithms. The numbers report the average numbers of true positives and false positives detected for the apps in Table 2.1...... 14

2.3 Correlation analysis results used for the design of S-Checker. Top-10 most correlated performance events for soft hang diagnosis. (a) Monitoring main thread and render thread increases the correlation of about 14% on average compared to (b) monitoring only the main thread...... 23

2.4 Sensitivity analysis for the correlation analysis of Table 2.3(a). The corre- lation analyses with (a) 75% and (b) 50% of the data points used in Table 2.3(a) have similar results, thus the results do not depend on the training set. 25

2.6 S-Checker uses three performance events, i.e., context-switches, task-clock, and page-faults, to find soft hang bugs. The 23 New Bugs are those from Table 2.5 that were previously unknown to be soft hang bugs, i.e., missed offline. All the new soft hang bugs are correctly recognized by at leastone of the three event counters...... 44

3.1 Apps tested. The commit number indicates the latest version available at the time of the experiments...... 73

x List of Figures

Figure Page

2.1 The Main Thread of the app A Better Camera [30] executes UI-related APIs (e.g., setText, inflate, init, enable) and camera APIs (e.g., setParameters, open). Moving blocking APIs (e.g., open) to a worker thread makes the app more responsive to user actions...... 11

2.2 (a) High-level architecture of Hang Doctor. It is designed as a two-phase algorithm that is activated for every user action. The detected soft hang bugs are communicated to the developer trough the Hang Bug Report. (b) Example entries of the app AndStatus in Hang Bug Report...... 17

2.3 The first-phase S-Checker and the second-phase Diagnoser transition the state of each individual action based on their analysis result to improve the detection performance and lower the overhead. S-Checker monitors the performance event counters and the response time of actions in the Uncategorized state to filter out soft hangs caused by UI operations. Di- agnoser collects stack traces during the soft hangs caused by actions in the Suspicious and Hang Bug states to determine the root cause blocking operation...... 18

2.4 Analysis of three top-correlated performance events in our training set. Using these three performance events allows to distinguish soft hangs caused by soft hang bugs (HB) from those caused by UI operations (i.e., UI-API). Most of the soft hang bugs have a high performance event difference while most of the UI-APIs have a low performance event difference. This is because soft hang bugs, different from UI-APIs, cause more work for the main thread and less work for the render thread...... 27

2.5 Context-switch traces of main thread and render thread for two actions with a soft hang caused by the (a) soft hang bug 2 and (b) UI-API 2 in Figure 2.4(a). Using only a few samples collected at the beginning of the action execution may lead to false positives (e.g., from time 0s to 0.6s in (b)). . . . 31

xi 2.6 (a) Execution trace of a user action with K9-mail. One of the input events related to the action has a soft hang (shadowed area). S-Checker, at the end of the action execution (i.e., at time 3.1s), finds a positive context-switch difference, i.e., there may be a soft hang bug. (b) At the next execution of the same action, Diagnoser collects the Stack Traces (ST) during the soft hang to find the root cause operation: clean API, code line 25 of HtmlSanitizer.java. 41

2.7 S-Checker and Diagnoser use action states (U for Uncategorized, S for Suspicious, H for Hang bug, N for Normal) to minimize the overhead of collecting stack traces for soft hangs caused by UI-APIs...... 42

2.8 Detection performance normalized to the Timeout-based (TI) baseline, which does not have false negatives. Hang Doctor (HD), different from the baselines, (a) traces most of the real soft hang bugs every time they manifest (the few false negatives are only due to the initial filtering activity of S-Checker) while (b) pruning most of the false positives...... 45

2.9 Hang Doctor achieves low overheads, while having high detection perfor- mance at the same time...... 46

3.1 Performance balancing for concurrent apps can be obtained by manipulating the app priorities. User-perceived performance (left) and relative perfor- mance (right) of Rocket.Chat and ExoPlayer when the apps run (a) on the same core and (b) on different cores...... 58

3.2 Execution time distribution of Surface Flinger across the two CPU cores run- ning Rocket.Chat (core 0) and ExoPlayer (core 1). Increasing ExoPlayer’s priority leads the load balancer to execute Surface Flinger more time on the core running Rocket.Chat, thus impacting its performance...... 60

3.3 High-level design of SURF for a generic event...... 62

3.4 Relative error for various priority ratio values of two events. It can be described with a linear model...... 66

3.5 Results of SURF for various combinations of throughput apps (ExoPlayer, Mapbox) and interactive apps. The default Android has an average of 54ms response time and 58fps frame rate because mostly uses high frequencies. SURF reduces the core frequencies to (a) save CPU energy while (b) causing no perceivable performance degradation...... 76

xii 3.6 Comparison of the performance balancer of SURF (i.e., SURF-B) with the baselines. (a) The performance balancer converges faster into the desired region (the grey band) and is more stable. (b) SURF-B achieves better performance balancing for three sizes of desired performance region. . . . . 77

3.7 The performance controller of SURF (i.e., SURF-C) based on supervi- sory control theory ensures higher tracking accuracy and lower overhead compared to the Periodic baseline based on time-driven control theory. . . . 79

3.8 Applying single-app solutions to concurrent executions (a) may lead to an increased CPU frequency and energy consumption, caused by (b) app performance imbalance...... 80

3.9 SURF (a) balances the app performance, then (b) tracks the desired relative performance of the apps (gray band) to (c) ensure good user-perceived performance while reducing the CPU energy consumption...... 81

3.10 Compared to eQoS and Periodic, SURF has (a, b) high tracking accuracy for both apps by (c) lowering the core frequencies, thus (d) increasing the CPU energy savings...... 83

xiii Chapter 1: Introduction

The recent widespread use of smartphones has led to an exponential growth of mobile apps available in the market. For example, Google Play Store counted about 1.8 million of apps in 2015 [74] that can be used to watch videos, listen to music, chat with friends, and navigate the web. Typically, mobile users expect to see good perceivable performance while interacting with their devices. The user-perceived performance of these apps can be judged depending on how the user interacts with them. For example, there are Interactive-oriented apps, which require frequent user actions (e.g., touch buttons, scroll a timeline or a web page) and Throughput-oriented apps, which do not require frequent user actions (e.g., watch a video or listen to music). The most important requirements for an Interactive-oriented app is its responsiveness to user actions. A responsive app reacts to user actions quickly without letting the user perceive much delay, i.e., the response time of each user action is lower than 100-200ms (minimum human-perceivable delay [94]). A poorly responsive app that often hangs/freezes after user actions is perceived as sluggish by users. As a result, the app may receive a low rating in the market and thus it may have a lower probability of success.

For Throughput-oriented apps like video play, it is important to maintain high reproduction quality by ensuring frame rates above the minimum human-perceivable motion delay of

30fps [94]. At the same time, these perceivable requirements need to be achieved using the lowest possible amount of energy to avoid quickly depleting the device’s battery.

1 Two main problems that cause poor performance for mobile apps are soft hang bugs and

resource contention. The work presented in this dissertation describes two main systematic

methodologies to efficiently handle both problems.

1.1 Soft Hang Bugs

Soft hang bugs are programming issues that may cause the app to have soft hangs,

i.e., the app becomes unresponsive for a limited but perceivable period of time. A soft

hang bug is some blocking operation on the app’s main thread that can be executed on a

separate worker thread, such that the main thread can become more responsive [76]. Typical

examples of well-known soft hang bugs are file read and write, network operations, and

database operations [48]. Despite the presence of online guidelines that explain how to

avoid soft hang bugs [24], inexperienced developers can easily make mistakes and have

soft hang bugs in the released version of their apps. Therefore, it is important to help those

smartphone developers detect and diagnose soft hangs in their apps.

The primary approach used to detect soft hang bugs is offline detection [48, 76, 88], which analyzes the source code of the app to find well-known soft hang bugs. Compared to

runtime approaches that detect bugs directly on the user devices, offline detection has the

major advantage to detect soft hang bugs before the market release of the app. Unfortunately,

these offline approaches can fail to identify blocking operations on the main thread ofthe

app that are previously unknown or hidden in libraries. As a result, even after using those

offline detection tools, soft hang bugs can still manifest at runtime.

The main runtime approach used to detect soft hang bugs is called Timeout-based, which detects any soft hangs longer than a certain timeout. The main advantage of runtime

detection is that it does not depend on the particular source code of the app or on a database

2 of known blocking operations. Therefore, it can potentially solve the above described

problem of offline detection. However, an important challenge for runtime soft hangbug

detection is to diagnose if a soft hang is indeed caused by soft hang bugs, instead of lengthy

User Interface (UI) operations that must execute on the main thread. If a UI operation

is mistakenly diagnosed as a soft hang bug, we say it is a false positive. Similar to the

Timeout-based approach, Android OS incorporates an Application Not Responding (ANR)

tool [32] to detect soft hangs that last longer than 5 seconds, which is much longer than the

100ms perceivable delay [24]. Thus, it can miss many soft hangs. However, simply reducing

the timeout to 100ms, as proposed in [45], would lead to a large number of false positives.

As a result, it is desired to have a more efficient runtime soft hang bug detection tool that

can detect soft hang bugs and report them to the developers, who can then update their apps

to improve the app responsiveness.

1.2 Resource Contention

The quality of experience of mobile users is mainly influenced by the user-perceived

performance of apps (e.g., response time of touches lower than 100ms, frame rate of videos

higher than 30fps [94]) and the battery life of the mobile device. Unfortunately, these two

objectives often conflict with each other. Therefore, it is of primary importance toproperly

allocate the available computing resources for the best tradeoff between performance and

energy consumption.

Several recent studies have attempted to improve the mobile devices’ user experience.

Unfortunately, they have at least one of the following limitations. First, most of them

[8, 21, 27, 40, 79, 94] focus mainly on a single foreground app. Hence, they may not

efficiently handle the concurrent execution of multiple foreground apps. For example, now

3 a smartphone user can play a video while navigating posts on Facebook or chatting with a friend, by exploiting the split screen [96], Picture in Picture (PiP) [16], or freeform windows

[17]. Applying single-app solutions to concurrent executions may result in a performance imbalance among the apps that can cause either an increase of CPU frequency and energy consumption, or poor performance for at least one of the executing apps. Second, many solutions focus on regulating the app performance on a periodic basis [40, 79]. However, such solutions contradict with the aperiodic nature of user actions (e.g., button click) on mobile devices. It is indeed possible for those solutions to have a regulating period long enough (e.g., several seconds) to sample the execution of multiple user actions, but this may lead to a poor responsiveness.

1.3 Major Contributions

Correspondingly to the above described two main problems, i.e., soft hang bugs and resource contention, in this dissertation we describe our two solutions, Hang Doctor and

SURF, respectively.

In Chapter 2, we present Hang Doctor, a runtime methodology that supplements the existing offline algorithms by detecting and diagnosing soft hangs caused by previously unknown blocking operations. Hang Doctor features a two-phase algorithm that first checks response time and performance event counters for detecting possible soft hang bugs with small overheads, and then performs stack trace analysis when diagnosis is necessary. A novel soft hang filter based on correlation analysis is designed to minimize false positives and negatives for high detection performance and low overhead. We have implemented a prototype of Hang Doctor and tested it with the latest releases of 114 real-world apps. Hang

Doctor has identified 34 new soft hang bugs that are previously unknown to their developers,

4 among which 62%, so far, have been confirmed by the developers, and 68% are missed by

offline algorithms.

In Chapter 3, we present SURF, Supervisory control of User-perceived peRFormance, which is designed to overcome the two limitations of current studies in mobile device

user experience. First, it dynamically allocates resources to concurrent apps for balanced

performance. Second, SURF uses supervisory control theory to handle the aperiodicity of

user actions. SURF features a two-level architecture design that performs the two tasks at

different time scales, according to their different overheads and timing requirements. We

test SURF on several mobile device models with real-world open-source apps and show

that it can reduce the CPU energy consumption by 30-90% compared to state-of-the-art

solutions while causing no perceivable performance degradation.

5 Chapter 2: Hang Doctor: Runtime Detection and Diagnosis of Soft Hangs for Smartphone Apps

There can be a variety of reasons for software to have responsiveness problems and programming issues are among the major ones. Correctness bugs such as deadlocks or infinite loops [43, 44, 93] may cause an app to become unresponsive for an unlimited period of time or until the app is killed. Soft hang bugs instead, which is our focus in this chapter, are programming issues that may cause the app to have soft hangs, i.e., the app becomes unresponsive for a limited but perceivable period of time. A soft hang bug is some blocking operation1 on the app’s main thread that can be executed on a separate worker thread, such that the main thread can become more responsive [76]. For example, a soft hang may occur when the main thread is blocked by some lengthy I/O APIs (e.g., file read and write).

Different from server/desktop software, the development of mobile apps is more accessible even to inexperienced developers who can easily have soft hang bugs in the released version of their app. Therefore, it is important to help those smartphone developers detect and diagnose soft hangs in their apps.

Existing studies [48, 76, 88] propose offline detection algorithms that try to find soft hang bugs by searching for calls to well-known blocking APIs on the app’s main thread.

1We adopt the terminology used in [76] and consider any operation blocking if there exists a worst- case scenario that prevents the calling thread from making progress until timeout (e.g., 100ms perceivable delay [24]).

6 Unfortunately, offline algorithms can fail for three main reasons. First, the exponential

growth of new APIs [82] makes it almost impossible to have full knowledge of their

processing time, thus new blocking APIs (i.e., potential soft hang bugs) may be unknown to

offline detection algorithms and developers (e.g., K9-mail bug #1007 in Table 2.5). Second,

some segments of the app code, e.g., closed-source third-party libraries, may have a soft

hang bug but may not be directly accessible. Thus, offline solutions may not be able to

analyze the source code of those libraries and may miss soft hang bugs. For example, one out

of the three SageMath bugs (#84 in Table 2.5) is caused by a well-known blocking database

API hidden within a third-party library. This API can be detected only if the offline algorithm

has a chance to examine the library code. Third, a self-developed lengthy operation (e.g., a

heavy loop) on the main thread cannot be detected by offline algorithms that try to search for

the names of well-known blocking APIs. Some studies [15, 55] optimize loops to improve

app performance, but they do not focus on soft hang bugs. As a result, an app may still have

bugs that can cause soft hangs at runtime, even after offline detection tools have already

been applied.

Given the limitations of offline detection, it is desirable to have a runtime hang detection

algorithm that catches a soft hang on the fly and finds which blocking operation is causing

it, so that the developer can get sufficient diagnosis information to fix the problem. An

important challenge for runtime soft hang detection is to diagnose if a soft hang is indeed

caused by soft hang bugs, instead of lengthy User Interface (UI) operations that must execute

on the main thread. If a UI operation is mistakenly diagnosed as a soft hang bug, we say it

is a false positive. Some proposed runtime algorithms for server/desktop software [59,93]

monitor the resource utilization of the software (e.g., CPU time or memory access) and detect

potential hangs when static resource utilization thresholds are violated. Unfortunately, those

7 algorithms are mainly designed for correctness bugs rather than soft hang bugs. Correctness

bugs, different from soft hang bugs, cause the app to become unresponsive for an unlimited

period of time and thus can be detected by monitoring the coarse-grained resource utilization

of apps. However, soft hang bugs can last as little as 100ms and need a more fine-grained

monitoring of the app execution. As a result, as we show in this chapter, resource utilizations

used for soft hang bug detection may cause large numbers of false positives and negatives.

Some recent studies [60] propose in-lab test case generation to detect a sequence of actions whose execution cost gradually increases with time, but they are not designed to work in the wild to detect soft hang bugs. Some practical tools have been developed for smartphones in

the wild. For example, Android OS incorporates an Application Not Responding (ANR)

tool [32] to detect hangs that last longer than 5 seconds, which is much longer than the

100ms perceivable delay [24]. Thus, it can miss many soft hangs. However, as shown in

Section 2.1.2, simply reducing the timeout to 100ms, as proposed in [45], would lead to a

large number of false positives.

In this chapter, we propose Hang Doctor, a runtime soft hang detection and diagnosis

methodology that runs in the wild on user devices. Hang Doctor helps developers track the

responsiveness performance of their apps and provides diagnosis information for them to

fix soft hangs. Hang Doctor is not meant to replace offline detection, which istheprimary

approach because it can detect known soft hang bugs before the app is released in the wild. Instead, it can supplement offline detection by identifying new blocking APIs that are

previously unknown.

Hang Doctor features a two-phase algorithm to achieve high detection performance with small overheads. The first phase is a lightweight soft hang bug symptom checker

(S-Checker) that is invoked upon the execution of each user action to label only those actions

8 that have the symptoms of a soft hang bug. We define the symptoms of a soft hang bug by

profiling performance event counters during soft hangs. Then, we use correlation analysis, which identifies the performance events that are more suitable for soft hang bug detection.

Based on this analysis, we design a soft hang filter that reads the selected performance events

and compares them with their thresholds to find soft hang bugs and minimize the numbers

of false positives and negatives. Compared to resource utilizations [59, 80, 93], monitoring

and accessing performance event counters is more lightweight and provides a wider variety

of low-level hardware metrics. Compared to just monitoring the response time [45], using

both response time and performance events allows to minimize the number of false positives,

thus improving the detection performance. The second phase is a Diagnoser that is invoked

only for those labeled actions that have the soft hang bug symptoms. Diagnoser monitors the

response time of an executing action. If its response time exceeds 100ms again, Diagnoser

collects stack traces until the end of the soft hang for in-depth diagnosis. Then, Diagnoser

analyzes the collected stack traces to determine if there is indeed a soft hang bug. Upon the

detection of a soft hang bug, the collected information is reported to the app developer. If it

is caused by a previously unknown blocking API, Hang Doctor adds it to the database of

known blocking APIs, so that the offline algorithms can detect them.

Hang Doctor addresses the limitations of offline detection solutions because it isa

runtime solution that can detect soft hang bugs caused by 1) new blocking APIs, 2) known

blocking APIs called in third-party libraries (without the need of source code), and 3)

self-developed lengthy operations, as long as those soft hang bugs manifest themselves at

runtime. Therefore, with Hang Doctor, developers can track the responsiveness performance

of their apps in the wild and get diagnosis information about the soft hang bugs to be fixed.

Specifically, this chapter makes three contributions:

9 • We propose Hang Doctor, a runtime methodology to detect soft hang bugs that can be

missed by offline detection algorithms.

• Hang Doctor features a two-phase detection algorithm to achieve small runtime over-

heads. A novel soft hang filter is designed based on response time and performance

event counters for high detection performance. To our best knowledge, this is the first

work that leverages performance event counters for soft hang bug detection.

• We have implemented Hang Doctor and tested it with the latest releases of 114 real-

world apps. Hang Doctor has found 34 new soft hang bugs previously unknown

to their developers. So far, 62% of the bugs have already been confirmed by the

developers (see our website for details [5]) and 68% are missed by offline detection

algorithms.

The rest of this chapter is organized as follows. Section 2.1 motivates our study and

Section 2.2 describes the design of Hang Doctor. Section 2.3 evaluates our solution. Section

2.4 discusses the related work.

2.1 Background and Motivation

In this section, we first introduce some background information on soft hangs. Wethen

use real-world examples and traces to demonstrate the limitations of existing soft hang bug

detection algorithms as our motivation.

2.1.1 Background

For mobile apps (e.g., Android, iOS, Windows), the only app thread that is designed to

receive and execute user actions from the User Interface (UI) is the main thread [24]. Here, we briefly introduce how apps handle user actions and why blocking operations cause soft

10 1 - setParameters (camera) 2 - open (camera) 3 - setText (TextView) 4 - inflate (LayoutInflater) 5 - (SeekBar) 6 - enable (OrientationEventList.) Response Time Buggy-Main Thread 1 2 3 4 5 6 Response Time Main 1 3 4 5 6 Thread

Fixed Worker Thread 2 0 100 200 300 400 Time (ms)

Figure 2.1: The Main Thread of the app A Better Camera [30] executes UI-related APIs (e.g., setText, inflate, init, enable) and camera APIs (e.g., setParameters, open). Moving blocking APIs (e.g., open) to a worker thread makes the app more responsive to user actions.

hangs. Note, in this dissertation we mainly focus on Android OS for its open-source nature,

but similar considerations can be applied to other mobile OSs.

User actions performed through the touchscreen of the smartphone are recognized and

forwarded by the OS to the main thread of the foreground app as input events. An input

event is a message containing some information that allows the main thread to determine what code to execute for that event (e.g., listeners, handlers). New events are put into an

event queue upon their arrival and the events are executed, one by one, in their queue order.

Therefore, if the code related to an input event includes the execution of a blocking operation

on the main thread, a delay longer than 100ms may be perceived by users [32,48]. These

responsiveness bugs are well known in the literature as soft hang bugs [76,88]. In order to

avoid soft hangs, as suggested by the Android guideline [32], blocking operations should be

moved to a worker thread, so that the main thread can execute new input events in a timely

manner without the user perceiving delay.

Practical Example. Figure 2.1 shows the sequence of operations executed by an input

event on the main thread of the app A Better Camera [28, 30] when the user Resumes

11 the main activity (labeled Buggy Main Thread in figure). The main activity of the app

is composed of 1) camera images loaded through two camera APIs (e.g., setParameters,

open [31]) and 2) a UI interface loaded through four UI-APIs (e.g., setText, inflate, ,

enable) to allow the user to interact with the app. The input event has a response time of

423ms, which is perceivable by the user, to execute those operations. As shown in Figure

2.1, the camera API open is the one that takes the longest to execute. This API connects

the app’s UI with the camera, thus it may take a long time to establish the connection. The

response time shown in Figure 2.1 can be reduced to the less perceivable 160ms by moving

the execution of this API to a worker thread (labeled Fixed in Figure 2.1), so that it can be

executed asynchronously and return the necessary data to the main thread (e.g., with the

onPostExecute method of AsyncTask) without affecting the user experience. Note, while the

responsiveness can be further improved by also moving other APIs (e.g., setParameters),

the UI-APIs must be executed on the main thread because they manipulate the UI and

may inevitably introduce a perceivable delay. Thus, UI-APIs are not soft hang bugs and

should not be reported by Hang Doctor. It is our future work to study the responsiveness of

UI-APIs.

Some soft hang bugs may need a more sophisticated fix, because the subsequent -

tions on the main thread may depend on the data generated by the blocking API. As a result,

such a blocking API cannot be simply moved to a worker thread. Note that the focus of

Hang Doctor is to detect and report such a blocking operation as a soft hang bug. It is up to

the developer on deciding how to fix it.

12 App Name DroidWall FrostWire Ushaidi WebSMS Commit # 3e2b654 55427ef 59fbb533d0 1f596fbd29 App Name cgeo Seadroid FBReaderJ A Better Camera Commit # 6e4a8d4ba8 5a7531d 0f02d4e923 9f8e3b0

Table 2.1: Apps with well-known soft hang bugs tested in the motivation study. The commit number refers to the app version that has the bug.

2.1.2 Motivation

In this section, we first discuss the limitations of offline detection algorithms with several

real-world examples. We then conduct experiments with some open-source apps and an LG

V10 smartphone to test the performance of existing runtime algorithms. We select these

apps (summarized in Table 2.1) from recent studies [48] and from online repositories [28]

by searching the keywords “freeze", “hang", or “ANR" (Application Not Responding) in the

apps change logs.

Offline Soft Hang Detection. Offline detection algorithms [48, 76, 88] find softhang

bugs by scanning the app code to look for well-known blocking APIs on the main thread.

However, these offline approaches fail to detect those soft hangs caused by APIs thatare not

known as blocking.

There are cases that some APIs were not known as blocking in the past but caused soft

hangs. For example, the camera API open is available since 2008 but has been marked

as blocking only after 2011 [31, 33]. Similarly, other APIs, such as mediaplayer.prepare,

bitmap.decode, and bluetooth.accept, are all available since 2009 but have been clearly

marked as blocking only after 2012. As a result, any offline algorithm would not have

been able to detect the soft hang bugs caused by those APIs before 2011/2012. Thus, soft

hangs may occur at runtime even after using those offline algorithms. Our hypothesis is that

13 True Positives False Positives App Name 5s 1s 500ms 100ms 5s 1s 500ms 100ms DroidWall 0 0 0 1 0 0 1 3 FrostWire 0 0 1 1 0 0 0 5 Ushaidi 0 0 0 2 0 0 1 4 SeaDroid 0 1 1 1 0 0 2 6 WebSMS 0 0 0 1 0 0 0 3 cgeo 0 0 0 5 0 0 2 5 FBReaderJ 0 0 0 6 0 0 2 4 A Better Camera 0 0 0 2 0 0 0 4 TOTAL 0/19 1/19 2/19 19/19 0 0 8 33

Table 2.2: The timeout value influences the performance of Timeout-based runtime detection algorithms. The numbers report the average numbers of true positives and false positives detected for the apps in Table 2.1.

there can be other blocking APIs that remain unknown to offline tools and can cause soft

hangs at runtime. To our knowledge, currently, there is no established way to automatically

determine if a certain API is an unknown soft hang bug. As also stated in some related work, e.g., [76], an API becomes a soft hang bug mainly based on expert knowledge, which

often entails manually diagnosing performance data and/or stack traces collected in the wild.

Therefore, for these unknown blocking APIs, only a runtime detection algorithm would be

able to detect the incurred soft hang in a timely manner and report to the developer. For

example, the K9-mail bug (#1007 in Table 2.5) that we have found at runtime is caused by

an unknown blocking API clean, which is missed by a state-of-the-art detection tool [48].

In addition, self-developed lengthy operations and well-known blocking APIs nested into

closed-source third-party libraries may also be missed. We discuss various examples of such

cases in Section 2.3.2.

Runtime Detection. The most representative state-of-the-art runtime method used

to catch responsiveness problems is the Timeout-based (e.g., [18, 32, 45]). It detects a

responsiveness problem when the response time of a user action is longer than a timeout.

14 The main thread may execute several input events for each action. Each input event may

cause a soft hang. We refer to the user action response time as the maximum response time

of the input events executed.

The choice of the timeout value determines the detection quality of the Timeout-based

method. As Table 2.2 shows, a long timeout (e.g., 5 seconds used by Android’s ANR

tool [32]) misses most of the soft hang bugs. A shorter timeout (e.g., 100ms) leads to many

false positives caused by UI operations. As we show in Section 2.3.5, collecting stack

traces for every soft hang longer than 100ms may lead to an unnecessarily high overhead.

Thus, Timeout-based methods alone are not sufficient for soft hang bug detection. Hang

Doctor achieves better detection performance and lower overhead by using response time

and performance event counters.

2.2 Design of Hang Doctor

In this section, we first describe the goals and challenges of Hang Doctor. Wethen

introduce its two-phase algorithm at high level. Finally, we discuss the details of each phase.

2.2.1 Goals and Challenges

The target of Hang Doctor is to help app developers fix soft hang bugs that can be missed

by offline algorithms (e.g., [48]). Hang Doctor runs at runtime on the users’ devices andhas

three main goals: 1) understand whether an app is affected by soft hang bugs, 2) diagnose which blocking operation causes each soft hang, and 3) update the database of known

blocking APIs used by the offline algorithms. Soft hang bugs caused by self-developed

lengthy operations are communicated only to the app developer.

There are three major challenges for Hang Doctor:

15 1. Finding the root cause: Hang Doctor should be able to detect soft hang bugs caused

by APIs previously unknown as blocking or nested within libraries.

2. High detection performance: Hang Doctor should ensure high-quality detection,

i.e., all and only the manifested soft hang bugs are detected and analyzed.

3. Low overhead: Analyzing every soft hang could lead to a high overhead due to a

large number of false positives.

In order to achieve the goals and meet the challenges described above, Hang Doctor is

designed as a two-phase algorithm that is activated for every user action. The first phase is

a lightweight soft hang bug symptom checker (i.e., S-Checker) and the second phase is a

soft-hang Diagnoser.

2.2.2 Design Overview

Figure 2.2(a) shows the high-level architecture design of Hang Doctor. Because soft

hang bugs occur only for some user actions, Hang Doctor dynamically transitions each

action among several states. Based on the current action state, Hang Doctor performs either

a lightweight analysis with the first-phase S-checker or a deep analysis with the second-

phase Diagnoser. Hang Doctor has five runtime components (yellow boxes on the right

side of Figure 2.2(a)), i.e., the Response Time Monitor, the Performance Event Monitor,

the first-phase S-Checker, and the second-phase Diagnoser, which is composed of Trace

Collector and Trace Analyzer. Hang Doctor includes also two offline components (blue

boxes in Figure 2.2(a)), i.e., Hang Bug Report and App Injector.

S-Checker. The main approach of Hang Doctor to balance performance and overhead,

is to first analyze an executing action with the lightweight first-phase S-Checker. Figure 2.3

16 Soft Hang Bugs from Other Users Phase 2: Diagnoser Database of Hang Bug Soft Hang Bug Trace Trace Known Report Analyzer Collector Blocking APIs Detected Developer Offline Hang Suspicious/Hang Response Time Bug Detector bug Action State Monitor

App Phase 1: Performance App OS Injector User S-Checker Event Monitor Hardware App Market Action Development Lab User Smartphone

Runtime Hang Doctor Component Offline Hang Doctor Component Existing Component (a) High-level architecture

File Name Subroutine Blocking Op. Code Line Occurrence

AttachedImageDrawable.java DrawableFromPath decodeFile 74 75% Xstl.java toHtmlString transform 67 15% Xstl.java toHtmlString newTransformer 64 10%

(b) Hang Bug Report Example

Figure 2.2: (a) High-level architecture of Hang Doctor. It is designed as a two-phase algo- rithm that is activated for every user action. The detected soft hang bugs are communicated to the developer trough the Hang Bug Report. (b) Example entries of the app AndStatus in Hang Bug Report.

shows a state machine that represents how Hang Doctor manages an action’s state over time.

Each node is a state and the solid black arrows represent the transition of an action from one state to another. The labels on these arrows specify the Hang Doctor component that causes the transition (in bold) and the condition. There are three possible paths for an action to go through, starting from state Uncategorized, which means the action has never caused a soft hang before.

Path A: Upon the execution of an uncategorized action, if the response time of this action is longer than 100ms, the performance event counters are examined by S-Checker.

If the performance event values are low (see Section 2.2.3 for more details), the action is

17 Hang Doctor Phase Path A: No symptoms of Transition Reason a soft hang bug Path B: Has the symptoms but is not a soft hang bug Path C: It is a soft hang bug

S-Checker Uncategorized UI-API Symptoms Normal Action S-Checker Action Periodic Reset

Hang BugS-Checker Symptoms Diagnoser No Hang Bug

Hang Bug Suspicious Action Diagnoser Action Hang Bug Diagnoser Occasional Hang Bug

Figure 2.3: The first-phase S-Checker and the second-phase Diagnoser transition the stateof each individual action based on their analysis result to improve the detection performance and lower the overhead. S-Checker monitors the performance event counters and the response time of actions in the Uncategorized state to filter out soft hangs caused byUI operations. Diagnoser collects stack traces during the soft hangs caused by actions in the Suspicious and Hang Bug states to determine the root cause blocking operation.

determined as a UI operation and transitioned by the S-Checker to the Normal state, which

means it does not have a soft hang bug.

Paths B and C: If the uncategorized action has the symptoms of a soft hang bug, i.e.,

response time longer than 100ms and high performance event values, the action transitions

to the Suspicious state. Diagnoser is then triggered to determine (see below) if this action

indeed contains a soft hang bug. If not, the action follows Path B to Normal. Otherwise,

the action transitions to Hang Bug through Path C. For those actions in the Normal state,

to account for soft hang bugs that may manifest after a long time, S-Checker periodically

resets them back to Uncategorized, so that they can be analyzed again. The period can be

configurable (e.g., every 20 executions of the action [60]).

18 Diagnoser. As Figures 2.2(a) and 2.3 show, actions in the Suspicious state are analyzed by Diagnoser to determine if the executing action has a soft hang bug. Diagnoser checks if the action currently executing violates the 100ms timeout again and generates a soft hang. If the timeout is not violated (i.e., there is no soft hang), the action may have a soft hang bug that manifests only occasionally because a soft hang was previously detected by S-Checker for this action. In such cases, the Diagnoser leaves the action in the Suspicious state, so that it can be traced and analyzed as soon as it causes another soft hang. On the other hand, if the timeout is violated again, Trace Collector collects the main thread’s stack traces until the end of the soft hang, which are then analyzed by Trace Analyzer to determine whether the soft hang is caused by a UI operation or a real soft hang bug. In the former case, Diagnoser transitions the action to Normal through Path B in Figure 2.3. On the other hand, when a soft hang is determined to be a soft hang bug (Path C), Diagnoser transitions the action to the Hang Bug state so that it is always analyzed by Diagnoser during future executions.

Note, we could avoid collecting other stack traces during soft hangs of actions in the Hang

Bug state to further reduce the overhead. However, doing so may lead to misdiagnose the root cause of some soft hangs: some actions (e.g., Andstatus bug 303, K9-mail bug 1007 [5]) may include multiple soft hang bugs that cause soft hangs in different executions.

Developer Feedback and Implementation. Hang Doctor maintains the Hang Bug

Report for the developer, which allows to view statistical information about the app respon- siveness performance in the wild. It includes a table of detected soft hang bugs ordered by the percentage of occurrences across user devices. Figure 2.2(b) shows an example of report entries for the three new soft hang bugs of the app AndStatus (see Section 2.3.2). As

Figure 2.2(a) shows, Hang Doctor adds the detected unknown soft hang bugs in the list of known blocking APIs used by offline algorithms, so that also developers of other apps

19 can be warned about the possible new soft hang bugs and fix them before they may cause

problems in the wild. We consider Hang Doctor as a supplementary runtime solution to

offline algorithms for two main reasons. First, it is desired to detect soft hang bugsofflineto

avoid poor user ratings and runtime overhead. However, as we have discussed in Section 2,

there are unknown soft hang bugs, e.g., transform in Figure 2.2(b), that can be missed by

offline solutions, thus a runtime solution is also needed. Second, the user privacy, which isa

concern of runtime solutions, is not violated by Hang Doctor because all the anonymized

data sent out from the user devices only include those blocking operations that have caused

a soft hang. Hang Doctor can be embedded into an app by the developer who wants to

improve the app performance. It runs as an additional, separate, and lightweight thread within the app, but it does not need any OS modification to work (see discussion in Section

2.2.5).

2.2.3 First Phase: S-Checker

Uncategorized actions are analyzed by S-Checker, which performs a light-weight analy-

sis of their execution to filter out soft hangs caused by UI-APIs. The filtering isbasedon

soft hang bug symptoms, which we define by using correlation analysis of performance

event counters with soft hang bugs. We first provide some background information about

the performance event counters and the methodology used for the analysis. Second, we

explain which app threads to select for the analysis. Third, we examine the results of the

correlation analysis and discuss their generality across platforms and training sets. Finally,

based on these results, we determine how many performance events are needed to detect

soft hang bugs and define the soft hang bug symptoms.

20 2.2.3.1 Soft Hang Bug Detection with Performance Event Counters

Performance Event Counters. Performance event counters can provide low-level

information about how well an app is performing. In general, there are two main types

of performance event counters: performance events generated and counted at kernel level

and performance events generated by the performance monitoring unit (PMU) of the CPU, which are counted using a limited number of special registers. The main advantages of

using performance events are the low monitoring overhead and the high customizability,

e.g., the user can select which performance events to monitor and the target process or

thread. However, using all the available performance events for soft hang bug detection

can have two main drawbacks. First, the counting accuracy may decrease because the

number of PMU-generated events available is usually much greater than the number of

registers (e.g., 37 events vs 6 registers in the LG V10). Second, the soft hang bug detection

performance may degrade because some of the available performance events may not be

able to distinguish soft hang bugs from UI operations, which may lead to high numbers of

false positives and false negatives. To solve these two problems, we design the S-Checker

based on correlation analysis, which identifies few performance events that lead to high soft

hang bug detection performance.

Methodology. Here, we briefly describe the methodology used for the correlation

analysis. We mainly present the results with the LG V10 smartphone, but we have obtained

similar results with other devices (e.g., Nexus 5, Galaxy S3). We collect performance

event samples (46 performance events are available in total) to use in the analysis from the

apps listed in Table 2.5. We use 1) a training set of soft hang bugs and UI-APIs to find

the performance events and their thresholds that allow to detect soft hang bugs with low

numbers of false positives and false negatives and 2) a validation set of soft hang bugs and

21 UI-APIs to demonstrate the efficacy of our solution. For the training set, we have used10

different well-known hang bugs in Table 2.5 that are also detected by offline tools and 11

UI-APIs. Note, due to the limited number of different known soft hang bugs that can be

analyzed, the training set size is limited. For the validation set (see results in Section 2.3.4), we have used the previously unknown soft hang bugs in Table 2.5 that are missed by existing

offline solutions. None of the soft hang bugs in the training set is included in thevalidation

set.

To train S-Checker, we sample the performance events during the execution of user

actions that have soft hangs caused by the soft hangs bugs and UI-APIs in the training set.

For each performance event sample, we perform the correlation analysis by calculating

the Pearson correlation coefficient [41] between the samples collected during each action

and a vector that identifies each sample as soft hang bug or UI operation. Each coefficient

ranges from -1 (negative correlation) to 1 (positive correlation): the higher is the correlation

coefficient of a performance event, the better it diagnoses the soft hang cause. Notethat

here we test the linear correlation of performance events with soft hang bugs. We leave as

future work studying the non-linear correlation.

Thread Selection. In order to perform the analysis, we first need to choose which

app threads to monitor. In general, there are three types of threads for each app. Several

background worker threads, a main thread, and a render thread. Background threads are not

involved, in most of the cases, in soft hangs and thus they should not be monitored. The

main thread, as explained in Section 2.1.1, is the thread that may have soft hang bugs and

thus it should be included in the analysis. It handles user actions and performs some UI

update operations. These UI changes are then communicated to the render thread, which

performs the heavier job of generating and communicating every frame (e.g., button color

22 Performance Event Corr. Coeff. Performance Event Corr. Coeff. context-switches 0.658 minor-faults 0.601 task-clock 0.632 page-faults 0.601 cpu-clock 0.632 L1-dcache-loads 0.469 page-faults 0.561 L1-dcache-stores 0.454 minor-faults 0.557 instructions 0.451 cpu-migrations 0.548 cache-misses 0.440 cache-misses 0.472 task-clock 0.431 instructions 0.466 cpu-clock 0.431 cache-references 0.466 cache-references 0.428 raw-l1-dcache-refill 0.459 branch-loads 0.416 Average 0.545 Average 0.472 (a) Main Thread - Render Thread (b) Only Main Thread

Table 2.3: Correlation analysis results used for the design of S-Checker. Top-10 most correlated performance events for soft hang diagnosis. (a) Monitoring main thread and render thread increases the correlation of about 14% on average compared to (b) monitoring only the main thread.

change) to the Graphics Processing Unit (GPU). Thus, when there is no soft hang bug,

the main thread executes mostly UI-related jobs and generates a lot of work for the render

thread. Intuitively, it may be possible to recognize soft hang bugs when the main thread does

not generate much work for the render thread. Therefore, for each action, we consider two

cases for the correlation analysis. First, we monitor all the performance events of both main

thread and render thread, i.e., each performance event has one sample that is the difference

between the recorded performance event values of main thread and render thread. Second, we monitor only the main thread. Note, we do not consider the case of only render thread

because soft hang bugs are located only in the main thread.

Correlation Analysis. Table 2.3(a) shows the top-10 most correlated performance

events for the first case: 30% of the events have a coefficient higher than 0.6, 30% between

0.5 and 0.6, and 40% between 0.4 and 0.5. Table 2.3(b) shows the top-10 performance

23 events for the second case: 20% of the coefficients are just above 0.6 while 80% are between

0.4 and 0.5. These results confirm that monitoring both main thread and render thread allows

to have better soft hang bug detection performance than monitoring only the main thread.

Thus, we use both threads to identify which performance events are necessary to minimize

false positives and negatives. On the other hand, S-Checker could be designed based only

on the main thread for smartphones running older versions of Android (i.e., below 5.0) that

do not have the render thread.

Some events in Table 2.3(a), e.g., context-switch, task-clock, and page-fault, have a

high correlation coefficient because their increase is dictated by OS decisions onthread

scheduling rather than the particular source code of a soft hang bug. The context-switch

count of a thread increases whenever this thread is executing but is preempted by another

thread. The task-clock count of a thread increases to keep track of the CPU time received

by this thread. The page-fault count of a thread increases whenever this thread is executing

and tries to access a page that is not currently mapped in memory. When there is a soft

hang bug, the main thread has heavy work to execute and does not provide much work to

the render thread. As a result, during soft hang bugs, the context-switch, task-clock, and

page-fault counters are generally high for the main thread and low for the render thread,

i.e., the difference between main thread and render thread for each event counter is usually

high. During a UI-API, the main thread provides much more (UI-related) work to the render

thread, i.e., the difference between main thread and render thread for each event counter

is usually low. As a result, these event counters are likely to be useful for the detection

of soft hang bugs. Other event counters have a lower correlation coefficient because their

increase may depend more on the specific source code of a soft hang bug. For example, the

instruction count increases every time a thread executes an instruction. Each soft hang bug

24 Performance Event Corr. Coeff. Performance Event Corr. Coeff. context-switches 0.707 context-switches 0.817 task-clock 0.678 task-clock 0.778 cpu-clock 0.678 cpu-clock 0.777 page-faults 0.563 page-faults 0.734 minor-faults 0.560 minor-faults 0.733 cache-misses 0.467 raw-l1-dcache-refill 0.548 L1-dcache-stores 0.463 cache-misses 0.540 cpu-migrations 0.457 instructions 0.467 raw-l1-dcache-refill 0.449 raw-l1-itlb-refill 0.464 cache-references 0.443 cache-references 0.461 (a) 75% training set (b) 50% training set

Table 2.4: Sensitivity analysis for the correlation analysis of Table 2.3(a). The correlation analyses with (a) 75% and (b) 50% of the data points used in Table 2.3(a) have similar results, thus the results do not depend on the training set.

may have more or less instructions compared to UI-APIs and thus it is more difficult to use

those event counters to distinguish soft hang bugs from UI-APIs. As a result, it is unlikely

that they can be used for soft hang bug detection. Note, these observations hold for heavy

APIs and also self-developed lengthy operations2.

Generality of the Analysis. It could be argued that the above analysis may depend

on the platform and the training set used. As we have verified by testing various devices

(LG V10, Nexus 5, Galaxy S3), the proposed correlation analysis has little to do with

the particular platform used because these performance events are mostly related to OS

scheduling decisions at the kernel level. In addition, the first six most-correlated event

counters in Table 2.3(a) are generated by the kernel and thus they are available independently

from the particular CPU and architecture. The rest of the events in Table 2.3(a) are generated

2We do not discuss network-related operations because 1) they are well-known hang bugs that can be detected by offline tools and 2) they would generate an exception during the build operation of the app,thusit is unlikely to find them in the wild. Hang Doctor can be easily extended to detect also these bugs by monitoring the network activity of the main thread.

25 by the PMU of the CPU but most of them, e.g., cache-misses, instructions, cache-references,

are present in most CPUs. This is why different platforms have similar correlation analysis

results. Next, we perform a sensitivity analysis to demonstrate that the proposed correlation

analysis is not affected by the particular training set.

In order to perform the sensitivity analysis, we change the training set used to execute

the correlation analysis. Due to the limited number of known data points, we perform the

sensitivity analysis by 1) reducing the size of the training set to generate new training sets,

2) executing the correlation analysis on the new training sets, and 3) comparing the most

correlated performance events for each training set. The correlation analysis summarized in

Table 2.3(a) is robust to training set changes if the top-correlated performance events remain

the same across all the training sets. We randomly remove data points from the full training

set and generate two new training sets that have 75% and 50% of the data points used in

the full training set. Tables 2.4(a) and 2.4(b) show the correlation analysis with the 75%

and 50% training sets, respectively. The top-5 performance events, i.e., the context-switch,

task-clock, cpu-clock, page-faults, and minor-faults, have the same ranking positions in

all the training sets, which means that the correlation of these performance events to the

soft hang bugs is not affected by the training set used. Note that, with smaller training sets,

the correlation coefficients may increase because it is easier to separate hang bugsfrom

UI-APIs. In addition, in some cases, low-correlated performance events may change in

ranking position because their correlation may be more dependent on the particular data

points in the training set. This sensitivity analysis demonstrates that our correlation analysis

is robust to different training sets and that the correlation of the top-5 performance events is

not affected by the training set used.

26 400 90% of Hang Bugs 1 80% of Hang Bugs 90% of UI-APIs 0.8 80% of UI-APIs

Th.) 300 Th.) Billions Diff. 200 > 0 0.6 > 1.7e8 100

< 0 Diff. 0.4 > 7.5e7

itch 0 0.2 Render -100 Render 0 lock - -200 - -0.2 -300 Th. -0.4 Th. -400 -0.6 Task -C Context -Sw 4 HB 9 HB 7 HB 5 HB 8 HB 6 HB 2 HB 3 HB 1 HB 4 HB 7 HB 8 HB 5 HB 9 HB 1 HB 2 HB 3 HB 6 HB 1 0 HB 1 0 HB (Main (Main UI-API 1 UI-API 7 UI-API 9 UI-API 8 UI-API 4 UI-API 5 UI-API 3 UI-API 6 UI-API 2 UI-API 4 UI-API 1 UI-API 7 UI-API 8 UI-API 9 UI-API 3 UI-API 5 UI-API 2 UI-API 6 UI-API 10 UI-API 11 UI-API 11 UI-API 10 Soft Hang Cause Soft Hang Cause (a) Context-Switch Difference (b) Task-Clock Difference

90% of Hang Bugs 2000 80% of UI-APIs Th.) 1500 > 500

Diff. 1000 500 < 220 Render ault

- 0 -500 Th. -1000 Page -F 4 HB 5 HB 7 HB 9 HB 8 HB 1 HB 2 HB 3 HB 6 HB 1 0 HB (Main UI-API 4 UI-API 7 UI-API 3 UI-API 8 UI-API 9 UI-API 1 UI-API 2 UI-API 6 UI-API 5 UI-API 11 UI-API 10 Soft Hang Cause (c) Page-Fault Difference

Figure 2.4: Analysis of three top-correlated performance events in our training set. Using these three performance events allows to distinguish soft hangs caused by soft hang bugs (HB) from those caused by UI operations (i.e., UI-API). Most of the soft hang bugs have a high performance event difference while most of the UI-APIs have a low performance event difference. This is because soft hang bugs, different from UI-APIs, cause more work for the main thread and less work for the render thread.

Hang Bug Symptoms and Filter Details. In order to find which performance events

are necessary to detect the soft hang bugs, we use the following procedure. First, starting

from the most correlated event counter (i.e., context-switches in Table 2.3(a)), we find the

best threshold that distinguishes soft hang bugs from UI-APIs by minimizing false positives

and false negatives. Second, in case of false negatives, we include another performance

event (as ordered in Table 2.3(a)) until all the soft hang bugs in the training set can be

detected by at least one performance event. Note, the primary target of Hang Doctor is

27 to detect soft hang bugs, thus it is important that we minimize or eliminate (if possible)

the number of false negatives. The minimization of false positives is a secondary target

to reduce the overhead that we address by properly choosing the thresholds. Using this

procedure (see next paragraph), we find that just three performance events are necessary:

two performance events that track the CPU activity, i.e., context-switches and task-clock,3

and one that tracks the memory activity, i.e., page-faults.

As explained before, a higher event counter difference indicates a higher main thread

activity. Figure 2.4 shows the soft hang samples (HB for soft hang bug and UI-API for UI

operations) in descending order for the three event counters. As the figure shows, most of

the soft hang bugs have a high performance event difference. For each performance event, we identify the soft hang bug symptoms that distinguish most of the soft hang bugs from

most of the UI-APIs:

• Positive context-switch difference. As Figure 2.4(a) shows, 90% of the UI-API samples

have a negative context-switch difference while 90% of the soft hang bugs have a

positive difference.

• Task-clock difference above 1.7e8. As Figure 2.4(b) shows, 80% of the soft hang bug

samples have a task-clock difference greater than 1.7e8, which is more than twice

larger than 80% of the UI-API samples.

• Page-fault difference above 500. As Figure 2.4(c) shows, 90% of the soft hang bug

samples have a page-fault difference greater than 500, which is more than twice larger

than 80% of the UI-API samples.

3The cpu-clock is omitted because it is similar to the task-clock.

28 When an Uncategorized user action has a soft hang, S-Checker monitors the above three

performance events and reads their accumulated values at the end of the action: if at least one

of the above three conditions is verified, S-Checker transitions the action to the Suspicious

state, so that it can be further diagnosed with the Diagnoser. Otherwise, if none of the

conditions are verified, the action is transitioned tothe Normal state, which does not do any

data collection for minimized overhead. If a soft hang does not occur, in order to account

for soft hang bugs that may occasionally manifest with soft hangs, the action is left in the

Uncategorized state so that it is monitored again in future executions. This filter recognizes

100% of the soft hang bugs and prunes 64% of the false positives in the training set (81%

overall accuracy).

Automatic Adaptation of the Filter. As explained before, the effect of soft hang bugs

on the execution behavior of main thread and render thread is mainly software dependent

rather than platform dependent. Thus, as we verify by testing the designed filter with various

devices (e.g., LG V10, Nexus 5, Galaxy S3), the selected thresholds and events are generally

good also for other platforms. In addition, our validation results in Section 2.3.4 show that

the above conditions ensure good detection performance even with a set of soft hang bugs

and UI-APIs not used for the design of S-Checker.

On the other hand, because unfortunately we could not test all the possible existing soft

hang bugs and platforms, we cannot completely exclude the possibility that there could

be cases of soft hang bugs that need more event counters to be used or a slightly different

threshold on a particular device. In order to address this concern, Hang Doctor could

automatically adapt the thresholds or even the selected event counters. For example, Hang

Doctor could perform a periodic data collection of performance event counters (e.g., top-ten

counters in Table 2.3(a)) and stack traces during the execution of user actions. This data

29 collection would be performed as an extra task for Hang doctor and would be independent

from the activities of S-Checker and Diagnoser. The data collection period could be adjusted

and set long enough so that this extra data collection overhead can become negligible.

Using the collected data, Hang Doctor may verify whether to execute a light adaptation or a

heavier adaptation algorithm. The light adaptation is executed on the user device and has

a low computational overhead. It is executed if the data collected includes false positives

or false negatives that can be eliminated by simply increasing or decreasing, respectively,

some of the thresholds of the selected performance event counters. The heavy adaptation, which may lead to a higher computational overhead and thus it could run on a server, is

executed if the light adaptation is not sufficient or leads to poor detection performance.

The heavy adaptation uses the collected data to execute an automated version of the above

described algorithm to select the performance events and their thresholds. The selected new

performance event counters and their new thresholds could then be sent as upgrades to the

device for improved detection performance.

Discussion. In order to minimize the overhead of S-Checker, we could run the above

filter based only on a few performance event samples collected at the beginning of anaction

execution. Unfortunately, this strategy may lead to many false positives. Figures 2.5(a) and

2.5(b) show the context-switch count of two actions that lead to the soft hang bug number 2

and the UI-API number 2 in Figure 2.4(a). While the action with the soft hang bug shows

soft hang bug symptoms during the whole execution, i.e., positive difference, the UI-API

in Figure 2.5(b) has soft hang bug symptoms between time 0s to 0.6s, even though the

soft hang is caused by a UI operation (similar results for most of the other UI-APIs and

performance events). This behavior is common at the beginning of an action execution

because the main thread has to execute some developer-defined code (e.g., in the onClick

30 200 Main Thread 800 Main Thread Render Thread Render Thread 150 600

100 400

50 200 Context Switches Context 0 Switches Context 0 0.00 0.42 0.56 0.68 0.97 1.21 1.46 2.01 2.21 5.07 0.00 0.09 0.24 0.37 0.44 0.52 0.56 2.17 2.25 2.34 Time (sec) Time (sec) (a) Soft hang Bug (b) UI-API

Figure 2.5: Context-switch traces of main thread and render thread for two actions with a soft hang caused by the (a) soft hang bug 2 and (b) UI-API 2 in Figure 2.4(a). Using only a few samples collected at the beginning of the action execution may lead to false positives (e.g., from time 0s to 0.6s in (b)).

method for buttons) and some UI-related operations (e.g., calculate UI elements positions)

before doing any UI changes that involve the render thread. Therefore, the above soft hang

bug symptoms may not work through the whole action execution. As a result, S-Checker

conservatively counts the performance events until the end of the action execution, i.e., none

of the two threads execute or a new action is detected. Then, it checks the above conditions.

2.2.4 Second Phase: Diagnoser

Suspicious actions are analyzed by the Diagnoser, which performs a deep analysis of

their execution to determine the root cause blocking operations that cause soft hangs.

2.2.4.1 Trace Collection and Analysis

During a user action execution, the main thread may execute several input events in

sequence, which are analyzed by Hang Doctor according to the action’s current state. If

any of these input events has a response time longer than the minimum human-perceivable

31 delay (i.e., 100ms), Diagnoser starts collecting stack traces of the main thread until the

end of the soft hang to find the root cause blocking operation (i.e., Diagnoser doesnot

monitor performance events). A stack trace shows which operation and code line a thread is

executing at a certain time point. Therefore, by collecting stack traces during a soft hang, we can understand which operations the main thread executes over time. In particular, an

operation that executes for long time and causes a soft hang appears in most of the collected

stack traces. For example, in Figure 2.1, the camera.open API, which is the root cause

of the soft hang, appears in about 60% of the stack traces collected during the soft hang.

Different from other tracking methods, e.g., code injection to log when certain operations

are executed [65], this technique allows to track the execution of any operations executed on

the main thread of the app, even those blocking APIs nested in third-party libraries. Stack

traces are also useful in detecting soft hangs caused by self-developed lengthy operations,

such as heavy loops, because they include which file, subroutine, and code line number in

the app code has the heavy loop. At the end of the soft hang, Trace Analyzer analyzes the

stack traces collected to find the root cause blocking operation.

Trace Analyzer determines the root cause of a soft hang by analyzing the occurrence

factor of the API that appears the most across the stack traces collected. The occurrence

factor is defined as the percentage of stack traces that include a certain API. If the occurrence

factor is high (the exact threshold can be adjusted), as the example in Figure 2.1, the probable

cause of the soft hang is a single heavy API (e.g., camera.open). However, there could

be cases of self-developed operations executing many light APIs that cause a soft hang.

In these situations, moving just one of these APIs would likely not fix the soft hang. In

order to fix this type of soft hang, the whole self-developed operation should be movedtoa

background thread. Trace Analyzer recognizes these cases when the occurrence factor is

32 low: first it finds what is the most common caller function (i.e., the self-developed operation that executes those APIs) across the stack traces collected that has a high occurrence factor and then indicates this caller function as the probable cause of the soft hang.

Next, Trace Analyzer determines if the root cause is a soft hang bug or a UI-API. To our best knowledge, any operation that does not involve the UI can be moved off the main thread to improve the app responsiveness. This analysis can be automated because UI-APIs are well known as they are grouped in a few classes (e.g., View and Widget Classes [33]) and thus they can be easily recognized by analyzing the stack traces. Trace Analyzer can recognize even new UI-APIs from their class name because, different from new soft hang bugs, new UI-APIs are expected to be part of those classes. Note, self-developed lengthy operations are reported as possible soft hang bugs to the app developer if the collected stack traces do not include only UI-APIs.

2.2.5 Hang doctor Implementation

Android apps handle user actions by implementing special listeners, handlers, and callback functions such as onClick when buttons are clicked, onScroll when the user scrolls lists of items, and so on. To distinguish the various actions, App Injector assigns a Unique

ID (UID) to every action. Then, at runtime, a look-up table is created to save various information about the actions, including UIDs and current states. When the user executes an action, Hang Doctor reads the UID and looks up the current state of that action to eventually activate S-Checker or Diagnoser.

Hang Doctor measures the response time of each input event executed on the main thread by exploiting the setMessageLogging API of Android’s Looper class, which is invoked in two cases: 1) when an input event is dequeued for execution and 2) when this input

33 event finishes the execution. As a result, the response time is measured as the difference

between these two invocations. The performance events are accessed and monitored using

Simpleperf [34], which is an executable that accepts a wide range of parameters to customize which threads and which performance events to collect. Currently, Hang Doctor exploits

this executable to start and stop the monitoring of performance events during a user action.

SimplePerf can be easily included with the app as an additional lightweight executable (i.e.,

less than 1% of extra space) or directly integrated into Hang Doctor’s source code.

We integrate Hang Doctor in the app so that developers do not need any OS modification

to track the responsiveness performance of their apps. However, the methodology of Hang

Doctor could be generalized and integrated into the OS as a more general framework that

improves the currently used ANR tool [32]. We plan to do this in our future work.

2.3 Evaluation

In this section, we first introduce our baseline and evaluation metrics. Second, we

summarize all the real-world soft hang bugs detected by Hang Doctor. Third, we give a

concrete example on how Hang Doctor works. Fourth, we evaluate its detection performance

and overhead. Finally, we discuss alternative design approaches and the current limitations

of Hang Doctor.

2.3.1 Baselines and Performance Metrics

Baselines. We consider three baselines for comparison:

1. TImeout-based (TI) detects a potential soft hang bug when the action’s response

time becomes longer than the human-perceivable delay of 100ms. TI is similar to

34 solutions adopted in Android OS [32] and proposed by various studies such as Jovic

et al. [45].

2. UTilization-based (UT) monitors periodically (every 100ms) the resource utilizations

of the main thread (e.g., CPU time, memory traffic, network usage). It detects a

potential soft hang bug when at least one of the resource utilizations is above its static

utilization threshold. UT is similar to the solutions proposed by various studies such

as Pelleg et al. [59] and Zhu et al. [93].

3. UT+TI is a simple combination of Utilization-based and Timeout-based but collects

resource utilizations only during soft hangs. UT+TI detects potential soft hang bugs

when 1) the response time becomes longer than 100ms and 2) at least one of the

resource utilizations monitored is above its static threshold. Different from Hang

Doctor, it uses coarse-grain resource utilizations rather than low-level performance

events to diagnose soft hang bugs.

Like Hang Doctor, all the baselines collect stack traces when they detect a potential

soft hang bug to find out the causing operation. For the baselines that use the resource

utilizations, to test the impacts of different thresholds, we consider two possible thresholds

for each app: 1) a low threshold (i.e., UTL, UTL+TI), which is the minimum resource

utilization observed during soft hang bugs, 2) a high threshold (i.e., UTH, UTH+TI), which

is set as the 90% peak value of the resource usage observed during soft hang bugs. In

addition, we compare the detection performance of Hang Doctor with PerfChecker [48], which is the state-of-the-art offline detection tool. We do not test a phase-2-only baseline

because it would be similar to the Timeout baseline, thus we omit it.

35 Performance Metrics. In order to compare the detection performance of the baselines with Hang Doctor, we manually review the source code of apps to find well-known soft hang

bugs (e.g. database operations). Then, for each baseline, we count the true positives, i.e.,

real soft hang bugs that are also detected by the baseline, false positives, i.e., bugs detected

by the baseline that are not real soft hang bugs, and false negatives, i.e., real soft hang bugs

that are missed by the algorithm. When an unknown bug is detected by any of the baselines, we revise the app code to determine whether it is a true positive, i.e., we fix the bug and verify that the app does not have any more soft hangs, or a false positive.

A problem we have is how to count the false negatives for the unknown soft hang bugs.

For those new bugs that manifest with a soft hang, we can use the baseline TI because it

reports all the stack traces collected during all the soft hang occurrences. Therefore, in order

to count the false negatives in such cases, we run Hang Doctor (or a baseline) and TI at

the same time to get two separate detection traces. After manually reviewing each trace

as described above, we compare these two traces to count the numbers of false negatives

for Hang Doctor (or for a baseline). On the other hand, some unknown soft hang bugs

may never manifest with a soft hang during our experiments. Unfortunately, there is no

manageable way to find and study these hidden unknown bugs and count them as false

negatives. However, in this study we consider a wide variety of soft hang bugs and thus we

believe that Hang Doctor, whenever those hidden bugs actually manifest, would be able to

correctly diagnose them.

2.3.2 Result Summary and Developers’ Response

App Selection and Testing. The apps used in our tests are all open-source and available

in the Google Play Store [30] and in the GitHub [28]. We have started testing those apps

36 that are still receiving regular updates from the developers, have high counts of downloads,

and ensure a large variety of categories (e.g., Social, Productivity). Using these criteria, we

have tested about 114 apps so far and asked 20 users to test them with Hang Doctor on their

own devices for 60 days. Due to space limitations, we summarize only those tested apps

that have shown soft hang problems in Table 2.5. More details about all the tested apps are

available on our Hang Doctor website [5]. The reported commit number is the latest version

of each app at the time of the tests.

The apps in Table 2.5 represent typical usage cases of smartphone users: for example,

AndStatus is used to scroll social posts in a timeline and is similar to more widespread

apps such as Facebook or ; K9-mail is an client similar to Outlook or Gmail;

AntennaPod is used to listen and is similar to the more popular Player.

In general, the likelihood to find soft hang bugs (or any other type of software bugs)in

apps that are not well tested is higher than well-tested and experienced apps. Among the

apps in Table 2.5, 50% have less than 10,000 downloads, 37% have between 50,000 and

100,000 downloads, and 13% have more than 1,000,000 downloads. Thus, the majority

of the apps where we have found soft hang bugs are not well-tested. In fact, Hang Doctor

can be a powerful tool for the inexperienced developers of apps that need more testing and

improvements, so that they can have higher chances of success.

As summarized in Table 2.5, Hang Doctor has identified 34 new soft hang bugs that were previously unknown to their developers: 68% of the soft hang bugs found by Hang

Doctor are missed by PerfChecker, i.e., the state-of-the-art offline detection algorithm [48],

because the root causes were new unknown blocking APIs. In addition, all the known soft

hang bugs detected by PerfChecker that manifest with a soft hang are diagnosed by Hang

Doctor (see discussion in Section 2.3.6 for the bugs that did not cause soft hangs). When

37 App Name (# Downloads) Commit # Category Issue ID BD (MO) AndStatus (1K+) 49ef41c Social 303 3 (2) DashClock (1M+) 7e248f7 Personalization 874 1 (0) CycleStreets (50K+) 2d8d550 Travel & Local 117 4 (3) K9-mail (5M+) ac131a2 Communication 1007 2 (2) Omni-Notes (50K+) 8ffde3a Productivity 253 3 (3) OwnTracks (1K+) 1514d4a Travel & Local 303 1 (0) QKSMS (100K+) 2a80947 Communication 382 3 (3) StickerCamera (5K+) 6fc41b1 Photography 29 3 (0) AntennaPod (100K+) c3808e2 Media & Video 1921 3 (2) Merchant (10K+) c87d69a Business 17 1 (1) UOITDC Booking (100+) 5d18c26 Tools 3 2 (2) Sage Math (10K+) 3198106 Education 84 3 (2) RadioDroid (10+) 0108e8b Music & Audio 29 2 (1) Git@OSC (10K+) bb80e0a95 Tools 89 1 (1) Lens-Launcher (100K+) e41e6c6 Personalization 15 1 (0) SkyTube (5K+) 3da671c Video Players 88 1 (1) Total 34 (23)

Table 2.5: Apps tested with Hang Doctor that have shown soft hang problems (114 apps tested in total). The “Commit Number” refers to the master version at the time of our experiments. BD is the number of Bugs Detected by Hang Doctor and MO shows how many of them are Missed by a state-of-the-art Offline detection tool [48]. The web-links to the “Issue IDs” and more details are available on our website [5].

Hang Doctor detected a soft hang bug, we opened an issue with the app’s developers on

GitHub (Issue ID in Table 2.5). For all those issues that have received a reply (62% of the

detected soft hang bugs), the developers have confirmed the problem and, in many cases,

have fixed it with a new release of the app (e.g., AndStatus bug 303, K9-mail bug 1007).

Some of the opened issues (38% of the cases) did not receive feedback from the developers

(e.g., Cyclestreets bug 117). However, we verify the correctness of the detected soft hang

bug by fixing it ourselves and testing the modified app. In all the cases, the modified appdid

not show any more soft hangs.

Detected New Blocking APIs. Hang Doctor supplements existing offline detection

tools by identifying APIs that are not known as blocking (68% of the cases). These soft hang

38 bugs can be missed by the offline algorithms and thus may cause soft hangs at runtime. For

example, one of the soft hang bugs detected in K9-mail manifests for an API named clean

from a third-party library named org.HtmlCleaner. This API is used to parse HTML content when some are opened by users. For particularly heavy pages, this operation causes

the app to have a response time longer than 1.3s. Two of the three detected soft hang bugs

in Sage Math manifest for an API called toJson from the library com.google.gson, which

is used to serialize a specified object into its equivalent Json (JavaScript Object Notation)

representation. The serialization lasts about one second for particularly large objects. All

these previously unknown blocking APIs detected with Hang Doctor can be added to the

database of known blocking APIs, so that they can improve the detection performance of

offline algorithms.

Other Detected Bugs. In a few other cases, i.e., 11 out of 34 (32% of the cases), the

soft hang bugs detected by Hang Doctor are caused by well-known blocking APIs as bitmap

decode or database operations, thus they can be detected by the offline algorithm. However,

Hang Doctor can be useful also in these cases in two ways. First, some developers may

simply choose to ignore soft hang bugs detected offline because they underestimate their

impact at runtime. For example, the developer of AndStatus (issue 303) has initially argued

that the blocking API BitmapFactory.decodeFile would not cause many problems since it would be rarely executed. However, Hang Doctor has reported that this blocking API has

frequently caused soft hangs of 600ms every time the timeline of AndStatus is scrolled. As

a result, the developer has promptly fixed the issue and released a more responsive version

of the app. Second, in 3 out of these 11 cases (Owntracks, Sage Math, Lens-Launcher), the

call to the well-known blocking API is nested within a library API used on the main thread.

For example, one of the three soft hang bugs detected in Sage Math has a call to the API

39 get from a third-party library named cupboard, which is not known as blocking. However,

this library API hides the execution of a database operation (insertWithOnConflict). As

discussed in Section 2, the source code of some libraries may be unavailable or encrypted,

and thus soft hang bugs could be missed by offline tools. By detecting soft hang bugs while

they occur at runtime, Hang Doctor is able to detect the root causes of any soft hangs, even when the bug is nested within a library API whose code cannot be analyzed offline.

These results confirm that Hang Doctor can effectively help developers improve the

responsiveness of their apps.

2.3.3 Example Runtime Hang Bug Detection

In this section, we show how Hang Doctor detects soft hang bugs with an example app:

K9-Mail. First, we focus on a particular user action to explain how Hang Doctor finds the

root cause of a soft hang. Then, we show how Hang Doctor changes the action state when

multiple actions are executed.

Finding Root Cause of the Soft Hang. Figure 2.6 shows how Hang Doctor detects

a soft hang bug. Specifically, Figure 2.6(a) shows the activity of S-Checker for the user

action Open email. It shows 1) a shadowed area that highlights the response time and time

period when an input event of this action has a soft hang and 2) the context-switch of main

tread and render thread during the action execution. Note, the other two performance events

collected are not shown because they are less meaningful for this specific case.

The user action Open Email has never caused a soft hang before, thus it has an initial

state of Uncategorized and is analyzed by S-Checker. One of the input events executed for

this action has a response time of 1.3s (i.e., from time 0.45s to 1.75s), which is much longer

than the 100ms human-perceivable delay. At the end of the action execution (i.e., at time

40 Soft Hang Response Time Main Thread - Context Switches 1.5 Render Thread - Context Switches 1500 1.25 1 1000 0.75 Difference of 0.5 +799 context switches 500 0.25 0 0 0.00 0.38 0.76 1.14 1.51 1.89 2.27 2.65 3.03 Context Switches Context Response Time (s) Time Response Time (sec) (a) S-Checker detects a possible soft hang bug.

App Source Code File Name and Code Line # Root Cause of the soft hang [ST 1] sanitize(HtmlSanitizer.java:25) -> org.htmlcleaner.HtmlCleaner.clean(HtmlCleaner.java:371) [ST 2] sanitize(HtmlSanitizer.java:25) -> org.htmlcleaner.HtmlCleaner.clean(HtmlCleaner.java:371) [ST 3] sanitize(HtmlSanitizer.java:25) -> org.htmlcleaner.HtmlCleaner.clean(HtmlCleaner.java:371) .... [ST 60] sanitize(HtmlSanitizer.java:25) -> org.htmlcleaner.HtmlCleaner.clean(HtmlCleaner.java:371) [ST 61] setText(MessageWebView.java:129) -> loadDataWithBaseURL(WebView.java:946) [ST 62] setText(MessageWebView.java:129) -> loadDataWithBaseURL(WebView.java:946)

Action response time: 1300 ms (b) Diagnoser collects Stack Traces (ST) for a deeper analysis.

Figure 2.6: (a) Execution trace of a user action with K9-mail. One of the input events related to the action has a soft hang (shadowed area). S-Checker, at the end of the action execution (i.e., at time 3.1s), finds a positive context-switch difference, i.e., there may bea soft hang bug. (b) At the next execution of the same action, Diagnoser collects the Stack Traces (ST) during the soft hang to find the root cause operation: clean API, code line 25 of HtmlSanitizer.java.

3.1s), S-Checker reads the performance event counters and finds a positive context-switch

difference. Thus, S-Checker determines that there could be a potential soft hang bug and

transitions that action to Suspicious for further diagnosis. When this action causes another

soft hang (similar to the soft hang shown in Figure 2.6(a)), Diagnoser collects stack traces

during the soft hang manifestation. Figure 2.6(b) shows an extract of the collected stack

traces. Diagnoser examines them and determines 1) the root cause API, 2) the file name

and the code line in the app source code containing the bug. The API clean has a high

41 Response Time Page Faults Diff. Page-Faults Threshold

s ) Stack Traces 400 Collected 2500 No Data No Data No Data 300 Collection Collection Collection 1500 200 500 Time (m 100 -500 0 -1500 0 11 0.4 2.2 2.6 6.3 6.7 15.8 10.6 11.4 15.4 18.4 18.8 S-Checker S-Checker Time ( s) Diagnoser a a Response

Folders Inbox aFolders Inbox Folders Inbox Page-Fault Difference (UI-API) U->N (UI-API) U->S N->N (UI-API) S->N N->N N->N

Figure 2.7: S-Checker and Diagnoser use action states (U for Uncategorized, S for Suspi- cious, H for Hang bug, N for Normal) to minimize the overhead of collecting stack traces for soft hangs caused by UI-APIs. 1 occurrence factor (i.e., 96%, see Section 2.2.4.1) and is not a UI-API, thus it is determined to be a soft hang bug. As a result, Diagnoser transitions the action to the Hang Bug state.

Action State Transitioning. Hang Doctor transitions actions to several states to mini- mize the stack trace collection overhead during soft hangs caused by UI operations. Figure

2.7 shows an example trace of K9-mail with two different actions (Folders and Inbox) that lead to soft hangs caused by UI-APIs, i.e., they are not soft hang bugs. The figure shows the response time of the actions (shadowed areas), and, when collected, the page-fault difference between the main thread and render thread (the other two event counters are less meaningful for this specific case, thus we do not show them). The bottom of the figure describes 1)Hang

Doctor’s component examining each action execution, 2) the action name, 3) the root cause of the soft hang (e.g., UI-API) and the action state update decision (U for Uncategorized, S for Suspicious, H for Hang bug, N for Normal).

As Figure 2.7 shows, when the user opens for the first time the Folders menu at time

0.2s, a soft hang of 305ms occurs. However, S-Checker finds that the page-fault difference is negative, which is lower than its threshold (dashed red line) and correctly determines that

42 this soft hang is caused by a UI-API. As a result, it transitions the action to the Normal state

so that Hang Doctor does not check this action in future executions, e.g., 6.3s and 15.4s in

Figure 2.7. In contrast, when the user opens the Inbox at 2.3s, S-Checker finds a soft hang

of 350ms and a page-fault difference above the threshold, i.e., it is a false positive. Thus,

S-Checker transitions this action to the Suspicious state for a deeper diagnosis. When the

user executes again this action at 10.7s, Diagnoser collects the stack traces and finds that

the root cause is indeed a UI-API. As a result, Diagnoser transitions this action to Normal,

so that it does not cause unnecessarily high overhead for stack trace collection in future

executions, e.g., at 18.4s (see Section 2.3.5 for more results).

2.3.4 Detection Performance Comparison

Ideally, to ensure best detection performance and lowest overhead, all and only the soft

hangs with soft hang bugs are traced with stack traces. Thus, here we count true positives,

false positives, and false negatives by counting the soft hangs caused by soft hang bugs and

UI operations that are actually traced. Note, based on the apps in Table 2.5, in Section 2.2.3.1 we have designed Hang Doctor with a training set, which includes only the well-known soft

hang bugs that are not missed offline. Here, we test Hang Doctor using the validation set, which includes a different set of soft hang bugs that are missed offline. First, we study how

S-Checker detects the new soft hang bugs. Then, we compare the detection performance of

Hang Doctor with the baselines.

Table 2.6 lists all the soft hang bugs in the validation set. Hang Doctor monitors

performance events to detect previously unknown soft hang bugs. For each app, we report

how many bugs in the validation set are detected with each one of the three performance

events monitored, i.e., context-switches, task-clock, and page-faults. As Table 2.6 shows,

43 # of Bugs Detected with App Name New Bugs Context-Switches Task-Clock Page-Faults AndStatus 2 1 - 1 CycleStreets 3 3 - - K9-Mail 2 2 2 2 Omni-Notes 3 - - 3 QKSMS 3 3 3 - AntennaPod 2 2 2 - Merchant 1 1 - - UOITDC Booking 2 2 2 2 SageMath 2 2 2 2 RadioDroid 1 - - 1 GIT@OSC 1 1 - - SkyTube 1 1 1 1 Total 23 18 12 12

Table 2.6: S-Checker uses three performance events, i.e., context-switches, task-clock, and page-faults, to find soft hang bugs. The 23 New Bugs are those from Table 2.5 that were previously unknown to be soft hang bugs, i.e., missed offline. All the new soft hang bugs are correctly recognized by at least one of the three event counters.

Hang Doctor correctly recognizes all the 23 unknown soft hang bugs. In particular, 18 bugs

out of 23 are recognized with the context-switch counter, and 12 out of 23 with the task-

clock and page-fault counters. Thus, similar to the results observed in Section 2.2.3.1, the

context-switch counter is the most correlated with the soft hang bugs. However, using only

this event counter would miss 5 new soft hang bugs in these tests, i.e., 1 bug in AndStatus,

3 in Omni-Notes, and 1 in RadioDroid, which are detected with the page-fault counter.

These results, demonstrate 1) the effectiveness of Hang Doctor in recognizing soft hang

bugs not included in the training set and 2) the importance of using several performance

event counters in S-Checker.

Figures 2.8(a) and 2.8(b) summarize the comparison with the baselines of true positives

and false positives, respectively. The results are normalized to the TI baseline, which collects

stack traces for all the soft hangs, thus it does not have false negatives (see Section 2.3.6 for

44 AndStatus CycleStreets K9-mail AndStatus CycleStreets K9-mail Omni-Notes UOITDC Bookin Average Omni-Notes UOITDC Booking Average 1 2 8 - 22 0.8 1.5 Positives Positives 0.6 1 0.4 False True 0.5 0.2 < 0.1 0 0 Normalized Normalized (a) True Positives (b) Fase Positives

Figure 2.8: Detection performance normalized to the Timeout-based (TI) baseline, which does not have false negatives. Hang Doctor (HD), different from the baselines, (a) traces most of the real soft hang bugs every time they manifest (the few false negatives are only due to the initial filtering activity of S-Checker) while (b) pruning most of the false positives.

a discussion about the false negatives that have never manifested with a soft hang). In order

to ensure a fair comparison, we use the same app user traces to test Hang Doctor and the

baselines. Due to space limitation, we report only the results of some representative apps.

Similar results are obtained with the rest of the apps listed in Table 2.5. CycleStreets, as we verify by comparing the number of true positives for all the apps in Figure 2.8(a), has the

lowest number of true positives. This is because this app includes map loading operations

that may cause high resource utilization on the main thread. Thus, the UT baselines may

not be able to distinguish soft hang bugs from UI operations. As Figure 2.8(a) shows, Hang

Doctor increases the true positives by relaying on performance events, e.g., 66% more

true positives compared to UTH+TI. On average, across the various apps, Hang Doctor

traces 80% of the true positive soft hangs and, as Figure 2.8(b) shows, less than 10% of

the false positive soft hangs. The false negatives for Hang Doctor are due to the initial

filtering activity of S-Checker. However, all the soft hang bugs are correctly tracedwith

45 AndStatus CycleStreets K9-mail Omni-Notes UOITDC Booking Average 6 25% 10% (%) 4

2 Overhead 0

Figure 2.9: Hang Doctor achieves low overheads, while having high detection performance at the same time.

Diagnoser in the subsequent executions of those actions. None of the other baselines achieve a high true positive count and a low false positive count at the same time for all the apps.

For example, UTL detects all the soft hang bugs but traces from 8 to 22 times more false positives compared to TI. UTH, has a near zero false positive count but misses 62% of the soft hang bugs. The combinations UTL+TI and UTH+TI achieve a lower false positive count compared to UTL and UTH, respectively. However, they cannot achieve the high detection performance of Hang Doctor because they do not use performance events and do not transition actions across states to lower the false positives.

2.3.5 Overhead Analysis

Here, we compare the resource usage overhead of Hang Doctor with that of the baselines.

Specifically, for each trace, we measure the CPU and memory access (fromthe stat and io files, respectively, available in the proc/PID filesystem) before and after the execution of a trace without Hang Doctor (or a baseline). Then, we repeat the measurements when

Hang Doctor (or the baseline) executes and calculate the percentages of CPU and memory

46 increase. The resource usage overhead is calculated as the average between the percentage

CPU overhead and the percentage memory overhead.

Figure 2.9 shows the overhead comparison between the baselines and Hang Doctor.

UTL and UTH have about 25% and 10% overhead on average, respectively, because they

need to periodically sample the resource utilizations. In addition, UTL frequently collects

stack traces because it has many more false positives than UTH, which further increases the

overhead. TI instead, on average, has more false positives than UTH but collects stack traces

only when the app has a response time longer than 100ms without need of periodically

measuring the resource utilizations. Thus, TI has a lower overhead, i.e., 2.26% on average.

UTH+TI is the algorithm with the lowest overhead (about 0.58%) but, as discussed in Section

2.3.4, it misses most of the soft hang bugs. Hang Doctor has a high detection performance

and a 0.83% overhead, which is slightly higher than that of UTH+TI, but 63% lower than

that of TI. These results demonstrate the efficiency of collecting performance-event data

rather than resource utilizations to prune false positives and reduce the overall overhead.

Hang Doctor has also a negligible impact on apps’ code size, energy consumption, and

responsiveness.

2.3.6 Alternative Approaches and Limitations

Hang Doctor finds soft hang bugs in the wild while users interact with theapps.An

alternative approach would be to run Hang Doctor on a test bed of smartphones where user

inputs are automatically generated by tools such as Android’s Monkey and MonkeyRunner.

The main advantage of this approach is that soft hang bugs could be detected before they

cause problems on user devices. In addition, in a test bed environment, smartphones can

be easily connected to external power, thus the overhead of Hang Doctor would not be an

47 important concern. As a result, the second phase of Hang Doctor may be sufficient in a test

bed because Trace Analyzer can discard most of the false positives by reading the stack

traces collected during all the soft hangs. However, note that such test beds often cannot

completely recreate the real environment of apps in the wild, which may cause some soft

hang bugs to never manifest. As a result, soft hang bugs could still be missed in the test bed

and Hang Doctor would still need to run in the wild.

Hang Doctor has four possible limitations.

First, under special conditions, e.g., a soft hang bug within an action that has some

heavy render thread operations, none of the conditions described in Section 2.2.3.1 may be verified, which leads to possible false negatives. However, in our experiments, wehavenot yet encountered such cases. We plan to address this issue in our future work.

Second, some soft hang bugs may never manifest at runtime with a soft hang. Due to

its runtime detection nature, Hang Doctor will miss these soft hang bugs. However, the

user would also not experience any responsiveness problems in such cases. Thus, we can

consider these missed bugs as benign false negatives. Note that the false negatives due to

unknown soft hang bugs are challenging to identify if they never cause a soft hang. We plan

to address this issue in our future work.

Third, Hang Doctor may miss occasional hang bugs in user actions that have previously

caused a false positive and thus are in the Normal state. Although in our experiments all

the known occasional soft hang bugs were diagnosed as soon as they manifest with a soft

hang, in order to handle such situations, Hang Doctor periodically resets Normal events to

Uncategorized, so that they can be analyzed again.

48 Fourth, the training set size used for the correlation and sensitivity analyses in Section

2.2.3.1 is limited due to the limited number of known soft hang bugs. We plan to repeat the

analysis with a larger training set when more soft hang bugs are reported.

2.4 Related Work

Recent research has proposed a variety of strategies to improve apps’ performance

[63,66,68,72]. For example, some studies [9,29,87] propose to offload the computation-

intensive tasks of an app to the cloud. A few other studies help developers improve their apps’

performance by identifying the critical path in user transactions [65, 91] or by estimating

apps’ execution time for given inputs [47]. Different from these studies, we focus on

detecting soft hang bugs.

Offline Detection. A widely adopted approach to diagnosing programming issues in

software is the offline analysis of source code. For example, many studies [15,42,55,56,85]

focus on helping developers find the app performance bottleneck (e.g., inefficient loops).

Huang et al. [37] propose to help developers identify programming issues across different

app commits. The most closely related work is offline soft hang bug detection [48,76,88], which proposes offline algorithms to automatically detect soft hang bugs by searching the

app code for well-known blocking APIs. In contrast, Hang Doctor detect and diagnoses soft

hangs at runtime in order to address the limitations of offline detection discussed in Section

2.

Runtime Detection. A variety of runtime approaches have also been proposed to

address responsiveness problems. Many proposed runtime algorithms for server/desktop

software [12–14,19,26,70,71,83] are not suitable for smartphone apps mainly because of

their relatively high overheads. Some studies [59,93] profile various resource utilizations

49 (e.g., CPU time, Memory Access) during bug-free runs of the application and use static

thresholds to detect responsiveness problems caused by correctness bugs. Other solutions

detect software failure due to concurrency bugs [4] or diagnose synchronization bugs [1].

Different from these approaches, Hang Doctor is designed to detect a different type of bug,

i.e., soft hang bugs, on smartphone apps.

Some research [6,7,10,45,54,89], similar to the ANR detection tool of Android [32],

detects soft hangs in software by monitoring the response time of user actions. The main

limitation of these timeout-based approaches is that they can lead to large numbers of false

positives and negatives. Pradel et al. [60] propose in-lab test case generation to detect a

sequence of actions whose execution cost gradually increases with time, but this solution

is not designed to work in the wild to detect soft hang bugs. In addition to their offline

solutions, Wang et al. [76] propose to allow users to force-terminate the currently executing

job during a soft hang, but, different from Hang Doctor, they do not diagnose its root cause.

Another important feature of Hang Doctor is its two-phase algorithm that balances

detection performance and overheads. Some proposed approaches [18,53] also attempt to

balance monitoring performance and logging overhead in the wild. However, different from

Hang Doctor, they either do not detect soft hang bugs [53] or perform only timeout-based

detection, without pinpointing the exact blocking operation that causes the soft hang [18].

50 Chapter 3: SURF: Supervisory Control of User-Perceived Performance for Mobile Device Energy Savings

The quality of experience of mobile users is mainly influenced by the user-perceived

performance of apps (e.g., response time of touches lower than 100ms, frame rate of videos

higher than 30fps [94]) and the battery life of the mobile device. Unfortunately, these two

objectives often conflict with each other. Therefore, it is of primary importance toproperly

allocate the available computing resources for the best tradeoff between performance and

energy consumption.

Recently, many studies have proposed to improve mobile devices’ user experience and

energy efficiency. Sadly, existing solutions have at least one of the following two limitations.

First, most of those solutions (e.g., [8, 21, 27, 40, 79, 94]) focus solely on a single foreground

app to tradeoff performance and energy consumption. Hence, they may not efficiently

handle the future trend of having multiple foreground apps concurrently executing. For

example, similar to the split-view mode already available in many desktop/laptop OSs (e.g.,

Windows and macOS [3]), now a smartphone user can play a video while navigating posts

on Facebook or chatting with a friend, by exploiting the split screen [96], Picture in Picture

(PiP) [16], or freeform windows [17]. In addition, users can transform the smartphone into

large-screen laptops [51, 52, 67, 81, 84], which makes it even easier to execute multiple

concurrent apps. Applying single-app solutions to concurrent executions may result in a

51 performance imbalance among the apps that can cause either an increase of CPU frequency

and energy consumption, or poor performance for at least one of the executing apps (see

Section 3.2 for motivation results). Second, many solutions focus on regulating the app

performance on a periodic basis [40, 79]. However, such solutions contradict with the

aperiodic nature of user actions (e.g., button click) on mobile devices. It is indeed possible

for those solutions to have a regulating period long enough (e.g., several seconds) to sample

the execution of multiple user actions, but this may lead to a poor responsiveness.

In this chapter, we present SURF, Supervisory control of User-perceived peRFormance.

Different from previous work, SURF is a 2-level solution that is systematically designed to

overcome the above two limitations. Specifically, SURF 1) dynamically allocates resources

to concurrent apps for balanced app performance, 2) uses supervisory control theory to

handle the aperiodicity of user actions.

In order to achieve performance balancing, app priorities are adjusted to impact the

computing resource (CPU, memory) allocated to each app (see Section 3.2). Once the app

performance is balanced, SURF manipulates CPU DVFS (dynamic voltage and frequency

scaling) to ensure that the user-perceived performance metrics of all the app events (both

periodic and aperiodic) stay close to their desired values. Note that if DVFS is applied to a

CPU that has app performance imbalance, either some app may fail to achieve the desired

performance (CPU frequency too low) or the CPU may have high energy consumption

(CPU frequency set too high unnecessarily). A key advantage of SURF is that it is designed

rigorously based on supervisory control theory, which provides analytical stability and

performance guarantees compared to heuristic solutions (e.g., [94]). Although control theory

has already been adopted for better performance management (e.g., [40]), such time-driven

control approaches must execute periodically, which makes it hard to handle the aperiodic

52 nature of user actions on mobile devices. In contrast, a supervisory controller is designed to activate itself only when the controlled variable is outside a desired region, thus provides a more efficient solution for aperiodic app events.

Due to their different overheads and timing requirements, SURF features a two-loop architecture that manipulates app priorities and CPU DVFS at different time scales. Particu- larly, SURF must first balance the app performance, before it can control the performance of all the apps with CPU DVFS. Thus, SURF uses the outer loop to manipulate CPU DVFS at a longer time scale than the app priorities, which are manipulated in the inner loop. The two loops are coordinated to achieve the desired performance, control stability, and accuracy, as discussed in Section 3.3.

In summary, this chapter makes the following contributions:

• Despite a large body of related work on mobile device performance and power man-

agement, we identify that existing solutions still have at least one of two limitations:

1) cannot efficiently manage concurrent foreground apps thus leading to energy waste,

and 2) cannot handle the aperiodicity nature of user actions.

• In order to address those challenges, we design SURF, a 2-level architecture design

to overcome the two limitations: 1) performance balancing of concurrent apps at

the finest time grain and 2) supervisory control to handle aperiodicity. Toourbest

knowledge, this is the first work that uses supervisory control theory to control the

user-perceived performance of mobile devices.

• We prototyped SURF on real smartphones with several real-world apps. SURF

achieves 30-90% CPU energy savings compared to state-of-the-art solutions.

53 The chapter is organized as follows. Section 3.1 reviews the related work. Section 3.2 gives background and motivation. Section 3.3 reports the design of SURF and Section 3.4 discusses the experimental results.

3.1 Related Work

Recent research studies have proposed to reduce the energy consumption of mobile devices by offloading code executions [11], optimizing network data transfers [61, 69], detecting energy bugs in apps [58], adjusting the display brightness [2], resolution [36], and colors [20]. Different from these studies, we ensure good user-perceived performance while minimizing the CPU energy consumption.

DVFS has been widely used to control the CPU performance and power consumption [57,

62,64,73,90]. Recent studies [35,94,95] have proposed to reduce the energy consumption of mobile apps while maintaining good user-perceived performance. However, these solutions adopt heuristic algorithms. As we show later, a key advantage of control-theoretic designs is to provide fast convergence and analytical guarantees of control accuracy. Time-driven control theory has been used to regulate the performance on a periodic basis to a desired target [23,40,77,79]. However, this approach may not be efficient for aperiodic jobs such as user actions. In addition, most of the above solutions may not efficiently allocate resources for concurrent apps.

Operating systems implement a scheduler and a load balancer to distribute the resources among concurrent apps based on static app priorities [22,75,78,86]. In order to ensure good user experience, several proposed algorithms [38,39,49,92] detect the threads that influence the user-perceived performance and then maximize their priorities to reduce the interference from background processes. Different from the above studies that focus mainly on one app,

54 we use app priorities to influence all the concurrent apps, such that they can all meettheir

user-perceived performance targets while reducing the CPU energy consumption.

3.2 Background and Motivation

3.2.1 Background

User-Perceived Performance Metrics. The user-perceived performance in mobile

devices is evaluated, in most of the cases, by either response time or frame rate. For example,

apps such as Gmail and WhatsApp are mostly interaction-oriented, e.g., open an email,

send a message. For these apps, the user perceives good performance when the interaction

response time is shorter than 100ms [25]. Other apps such as Netflix and YouTube are

mostly throughput-oriented because users perceive good performance when the average

frame rate is higher than 30fps [94]. Therefore, we mainly use these two metrics, but SURF

can be extended to handle other metrics.

Thread Priority. Thread priority is used in many CPU schedulers to allocate resources

among competing jobs [38,50]. In general, for non real-time threads, there are 40 priority

levels ranging from 0 to 39, where 0 is the highest priority and 39 is the lowest one. Android

assigns to each foreground app a static default priority level4, i.e., 20 (middle of the range).

Each priority level is mapped to a static weight for scheduling decisions. Particularly, the

Completely Fair Scheduler (CFS, default in Android) allocates CPU time among runnable

threads through the concept of per-thread virtual runtime (VR), which is calculated as:

w0 VR(τi,t) = A(τi,t) (3.1) W(τi)

4In the rest of the chapter, we refer to the app priority as the priority assigned to all the app’s threads that influence the user-perceivable performance [92].

55 where w0 is the weight of the default priority, W(τi) is the weight of the priority level

assigned to thread τi, A(τi,t) is the CPU time allocated to thread τi until time t. The thread virtual runtime is directly proportional to the CPU time and inversely proportional to the

priority (a higher priority has a higher weight). Thus, among threads with the same CPU

time, the thread with the highest priority has the lowest virtual runtime. Fairness is achieved

by ensuring the same virtual runtime for all the threads on a core: CFS executes the thread with the lowest virtual runtime and calculates the time-slice duration proportionally to

the thread’s priority. Thus, CFS executes higher-priority threads sooner and with longer

time-slices than other lower-priority threads [38].

In a multicore environment, CFS maintains a queue of runnable threads for each active

core. A load balancer moves threads across cores [50] to balance the cores’ load. Each

runnable thread running on a core contributes to the core’s load. The thread’s load is

calculated proportionally to 1) the thread utilization and 2) the weight associated to the

thread’s priority [50]. As a result, the priority of a thread influences also the decisions of the

load balancer.

3.2.2 Motivation

Here, we first define the metric that we use to compare the performance of different apps.

We then study how and why the app priorities affect performance balancing.

Relative Performance. In order to balance the performance of different types of apps, we define a metric called relative performance. The relative performance for an interactive

event of an interaction-oriented app is calculated as RT(k)/RTdes, where RT(k) is the

th response time of the event at the k execution and RTdes is the desired value, e.g., 100ms.

The relative performance of a throughput-oriented app is calculated as FRdes/FR(k), where

56 FRdes is the desired frame rate, e.g., 30fps, and FR(k) is the average frame rate measured

during the kth time interval. Therefore, a relative performance higher than one indicates

poor user-perceived performance and a relative performance close to zero indicates energy waste. However, a relative performance value of one indicates the ideal case (see Figure

3.1): good user-perceived performance at the minimum energy cost.

Experiment Setup. We test multiple smartphones with the latest Android version (e.g.,

Nexus 5, Galaxy S3 with Android 7.1.1). We set 100ms response time and 30fps frame

rate as the desired performance [94] for interaction-oriented and throughput-oriented apps,

respectively. We use the scenario of a throughput-oriented app, i.e., ExoPlayer [30], and

an interaction-oriented app, i.e., Rocket.Chat [30] in split screen (see Section 3.4 for other

apps). We fix the CPU frequency and the number of active cores to eliminate their impacts

and test only the app priorities. Then, while ExoPlayer plays a local video, we replay a 20s

trace during which we inject user actions on the device to send 10 messages in Rocket.Chat

and average the results.

Advantages of Priority Management. App performance imbalance can either cause

some apps to violate the performance requirement (CPU frequency too low) or lead to high

CPU energy consumption (CPU frequency set too high unnecessarily). Here, we test how

app priorities can be used for performance balancing in two general cases: the apps run on

1) the same core and 2) different cores.

Figure 3.1(a) shows the response time of Rocket.Chat, the frame rate of ExoPlayer,

and the relative performance for various priorities when the two apps run on the same

core at 1,026MHz. With the default priority, i.e., 20 for both apps, Rocket.Chat has

poor performance with an average response time of 178ms (i.e., 178/100=1.78 relative

performance) while ExoPlayer has good performance with an average frame rate of 60fps

57 Rocket.Chat - Response Time Rocket.Chat ExoPlayer ExoPlayer - Frame Rate 2.5

) 75 2.25 200 ) 2 ms fps ( 60 ( 1.75 150 1.5 45 1.25 Time Rate

100 30 Performance 1 IDEAL 0.75 50 15 0.5

Frame 0.25 0 0 0 Relative Response

(Rocket.Chat Prio, ExoPlayer Prio) (Rocket.Chat Prio, ExoPlayer Prio) (a) Apps on the same core

Rocket.Chat - Response Time Rocket.Chat ExoPlayer ExoPlayer - Frame Rate ) ) 2.16 2.3 150 60 1.75 ms fps ( ( 1.5 100 45 1.25

Rate 1 IDEAL Time 30 0.75 50 15 0.5 Performance

Frame 0.25 0 0 0 Response Relative (Rocket.Chat Prio, ExoPlayer Prio) (Rocket.Chat Prio, ExoPlayer Prio) (b) Apps on different cores

Figure 3.1: Performance balancing for concurrent apps can be obtained by manipulating the app priorities. User-perceived performance (left) and relative performance (right) of Rocket.Chat and ExoPlayer when the apps run (a) on the same core and (b) on different cores.

(30/60=0.5 relative performance). Thus, the apps’ relative performance is imbalanced. Just

maximizing the priorities of both apps to 0, as proposed in some related work [38,39,49,92],

may improve performance, but cannot completely fix the problem, as shown in Figure 3.1(a).

Maximizing only the priority of Rocket.Chat because of its poor performance (see (0, 20) in

Figure 3.1(a)) leads Rocket.Chat to have a response time below 100ms but ExoPlayer to

have a frame rate below 30fps. In order to get good performance for both apps, a simplistic

58 solution would be to increase the core’s frequency at the cost of a higher CPU energy

consumption. On the other hand, setting ExoPlayer’s priority to 5 or 10 while keeping the

Rocket.Chat’s priority at 0 allows more balanced performance without increasing the CPU

frequency, thus reducing the CPU energy consumption.

Figure 3.1(b) shows the results for two different cores, i.e., one core runs Rocket.Chat

at frequency 702Mhz and another core runs ExoPlayer at frequency 1.134GHz (random

frequencies for which one of the two apps has poor performance at default priorities). With

default priorities (i.e., 20), Rocket.Chat has an average relative performance of 0.65, while

ExoPlayer has an unacceptable average relative performance of 2.16. As we explain later,

the app priorities influence the performance balance also when the apps run on different

cores. Similar to the previous case, maximizing the priorities of both apps or just the priority

of the app with poor performance does not lead to a good performance balance. However,

setting Rocket.Chat’s priority to 19 and ExoPlayer’s priority to 11 allows a better balance

and an acceptable user-perceived performance for both apps. As a result, there is no need of

increasing the core frequencies.

These results prove that manipulating app priorities can help balance the relative perfor-

mance, such that both apps have their desired performance without having to increase the

CPU frequency. That in turn leads to energy savings.

Why Priority affects Performance Balance? The priorities of two apps running on

the same CPU core influence the app performance balancing because the apps have toshare

the same core time (see Section 3.2.1). However, it is less intuitive why the app priorities

influence the performance balancing when the apps run on different cores. To clarifythis

point, we use Android’s systrace tool to record the CPU thread scheduling activities using

the same setup of Figure 3.1(b), i.e., Rocket.Chat on core 0 and ExoPlayer on core 1. The

59 5 On Core 0 On Core 1 Setup: Rocket.Chat on Core 0 and ExoPlayer on Core 1 ) s ( 4

Time 3 . 2 Exec 1

Flinger 0 Rocket.Chat Rocket.Chat Rocket.Chat Rocket.Chat Rocket.Chat Prio=20, Prio=20, Prio=20, Prio=20, Prio=20, Surface ExoPlayer ExoPlayer ExoPlayer ExoPlayer ExoPlayer Prio=20 Prio=15 Prio=10 Prio=5 Prio=0

Figure 3.2: Execution time distribution of Surface Flinger across the two CPU cores running Rocket.Chat (core 0) and ExoPlayer (core 1). Increasing ExoPlayer’s priority leads the load balancer to execute Surface Flinger more time on the core running Rocket.Chat, thus impacting its performance.

key finding is that other system processes scheduled by the load balancer can affectthe

performance balancing of the two apps. Here, we examine where the load balancer places a

system process called Surface Flinger (real-time priority), which must execute periodically

to generate the display images. We choose this process for two reasons: 1) it takes up a

big part of the total CPU time and 2) its total CPU time is about the same in the various

experiments.

Figure 3.2 shows how the load balancer distributes Surface Flinger’s execution time

among the two cores for various priority settings of ExoPlayer. When the two apps have the

default priority (i.e., 20), the Surface Flinger’s time is almost equally distributed between the

two cores. However, as Figure 3.2 shows, when we increase the priority of ExoPlayer run-

ning on core 1 (to 15, 10, ..., 0), the portion of Surface Flinger’s execution time increases on

core 0, thus allowing ExoPlayer to achieve better performance while lowering Rocket.Chat’s

performance, as shown in Figure 3.1(b). The results demonstrate that a higher app priority

60 can effectively cause the default load balancer of Android to schedule less workload on the

same core. This explains why it is possible to balance the app performance by manipulating

their priorities even when the apps run on different cores.

3.3 Design of SURF

Here, we first describe the two-loop architecture of SURF at high level. Then, we

provide the details of each loop.

3.3.1 Design Overview

SURF aims to ensure good user-perceivable performance and low energy consumption.

To achieve this objective, SURF runs on users’ devices to 1) balance the performance

of concurrent apps for minimized CPU energy consumption and 2) efficiently handle the

aperiodicity of user action events. The main challenge is to ensure high control accuracy and

low overhead. In order to simplify the explanation of the SURF design, in the following, we

consider only two concurrent foreground apps (one interaction-oriented and one throughput-

oriented), which is the maximum number of foreground concurrent apps currently allowed

in mobile devices. We discuss the general case in Section 3.3.4.

In order to achieve the above goals, SURF features a two-level architecture design that

uses 1) app priority adaptation for performance balancing and 2) supervisory control theory

to handle aperiodic app events with theoretically guaranteed accuracy and low overhead.

Figure 3.3 shows the high-level architecture design of SURF. The primary control loop, i.e.,

the inner-loop, maintains performance balancing between the two apps by manipulating app

priorities. Consequently, the apps can have roughly the same relative performance. After the

inner loop enters the steady state, the secondary control loop, i.e., the outer-loop, controls

the apps’ relative performance to a desired value by manipulating CPU DVFS. As a result,

61 Control Component Other Component System Component

Outer Loop Inner Loop Performance Monitor Event A Frame Rate Response Time Relative Error Performance Update Interactive Throughput Balancer App Priorities Set App App App Priorities Event A Parameters Operating System State Event A Relative Started Performance Central Processing Unit Performance Update

Controller Per-Core DVFS A Event Set Per-Core Core 0 Core 1 DVFS Event A Parameters … Core N

Figure 3.3: High-level design of SURF for a generic event.

the energy consumption can be minimized while guaranteeing the desired performance for all the apps. The outer loop executes at a coarser time scale than the inner loop due to their different overheads and timing requirements.

Inner Loop. As Figure 3.3 shows, the key components of the inner loop are a perfor- mance monitor and a performance balancer. During the execution of the two apps, the performance monitor measures the apps’ user-perceived performance, calculates the apps’ relative error, which quantifies the app performance balancing, and sends the value tothe performance balancer. The performance balancer implements a closed-loop feedback control based on supervisory control theory. The balancer, at runtime, 1) models the relative error as a function of the app priorities for improved control accuracy compared to offline modeling, and 2) uses this model to dynamically adjust the app priorities for performance balancing, based on the measured relative error.

Outer Loop. The key components of the outer loop are the same performance monitor and a performance controller. When the apps have balanced relative performance after

62 several invocations of the inner loop, the performance monitor calculates the apps’ relative

performance and sends the values to the performance controller. The performance controller

implements a closed-loop feedback controller based on supervisory control theory. At

runtime, the performance controller 1) models the relative performance as a function of

the CPU DVFS level for improved control accuracy compared to offline modeling, 2) uses

the relative performance model to dynamically adjust the CPU DVFS and control the user-

perceived performance to a desired value for both the apps, based on the measured relative

performance.

Loops Coordination. The two loops can both influence the apps performance and may

interfere with each other. Thus, we design them to run at different time scales, so that SURF

achieves global system stability even when each loop is proved to be stable: the inner loop

is activated for every execution of an interactive app; the outer loop is activated only when

the inner loop stabilizes and reaches the steady state after a short settling time (i.e., the apps

performance is balanced after several consecutive balancer invocations).

Handling Diverse App Events. Generally, in Android, there are N different interactive

events (e.g., button click, scroll) for an interactive app and only one event for a throughput-

oriented app (e.g., video frames). The resource contention between the two apps mainly

occurs when one of the N interactive events executes concurrently with the throughput event.

Thus, there are N different couples of events to be controlled: each couple is composed

of one of the N interactive events and the common throughput event. Different interactive

events may require different resource allocations to achieve the same performance. Thus,

using only one of the above 2-level control for all the event couples may lead to poor control

accuracy (see Section 3.3.2.1).

63 To ensure high control accuracy, SURF is designed to control each one of the N couples

independently from the others, i.e., every event couple has the above described 2-level

control structure. In addition, every event couple has different models and parameters that

are loaded when then event couple is executed. Specifically, as Figure 3.3 shows, the two

loops of each event couple store the models, parameters, and resource allocation decision in

the event state table, which is read at the beginning of the event couple execution to enforce

the allocation previously calculated by SURF for this couple. Note that only one of the

N control structures is active at each given time. Thus, having N control structures for N

events instead of one controller for all the N events incurs negligible extra overheads to

load/store the control parameters of each control structure in the event tables.

We prototyped SURF and deployed it on real devices. SURF does not require changes

to the apps’ source code. Its energy overhead is less than 1%, thus negligible. The mem-

ory overhead is also negligible because the event state tables store only a few bytes of

information.

3.3.2 Inner Loop: Performance Balancer

The performance balancer manipulates the app priorities to control the relative error and

ensure performance balancing. Because of the discrete nature of the system (e.g., discrete

priority levels), it is not always feasible to regulate the relative error exactly to a set point.

Thus, we design the control system so that it controls the relative error inside a desired region

that centers at the desired relative error. The performance balancer is based on supervisory

control theory, which provides analytical stability and performance guarantees compared

to heuristic solutions [94]. Moreover, different from time-driven control solutions that

must execute periodically [40], a supervisory controller activates only when the controlled

64 variable is outside the desired region, thus it provides a more efficient solution for aperiodic

app events. In order to design the supervisory controller, we need a model that relates the

relative error to the app priorities.

3.3.2.1 Relative Error Modeling

The relative error RE(k) for the kth execution of the event couple is calculated as:

RP (k) RE(k) = Interactive (3.2) RPThroughput(k) where RPInteractive(k) is the relative performance of the interactive event and RPThroughput(k)

is the average relative performance of the throughput event measured during the kth execution

of the current interactive event. The event couple has balanced relative performance when

the relative error is equal to one. Note that there are two manipulated variables, i.e., the

two app priorities, but only one controlled variable, i.e., the relative error. This fact may

complicate the model and the control design. To solve this problem, we combine the two

app priorities into one manipulated variable, i.e., the priority ratio PR, which is calculated

as P (k) PR(k) = Throughput (3.3) PInteractive(k) where PThroughput(k) is the priority level of the throughput app and PInteractive(k) is the

priority level of the interactive app during the kth execution of the event couple5.

Figure 3.4 shows the relative error samples for various priority ratio values of two

randomly selected interactive events of Rocket.Chat, i.e., Attach a file and open the SideMenu, while ExoPlayer plays a video. All the samples are measured at the same fixed CPU

frequency to eliminate the impact of DVFS. From the figure, we have two observations.

5Because the priority levels range from 0 to 39, in order to avoid dividing by zero, we map the priority levels into the range 1 to 40 for the control.

65 2.5 Attach Event Sample SideMenu Event Sample Attach Linear Model SideMenu Linear Model 2

Error 1.5

1 Relative 0.5

0 0 5 10 15 20 25 30 35 40 Priority Ratio

Figure 3.4: Relative error for various priority ratio values of two events. It can be described with a linear model.

First, the relative error of the two event couples changes almost linearly with the priority

ratio: the modeling error, calculated as the absolute difference of each relative error sample

from its linear approximation (i.e., the solid and dashed lines in figure), is 0.21 on average

(worst case 0.36) and 0.07 (worst case 0.16) for Attach and SideMenu, respectively. Second,

using one model for all the N event couples would lead to a high modeling error and thus,

poor control accuracy. For example, using one model for the two event couples in Figure

3.4 leads to a 0.47 modeling error on average (worst case 0.89). As a result, to ensure high

control accuracy, each event couple has a linear model as follows:

RE(k) = aREPR(k) + bRE (3.4) where aRE and bRE are the estimated slope and intercept of the model. The dynamic relative

error model is as follows:

RE(k + 1) − RE(k) = aRE (PR(k + 1) − PR(k)) (3.5) where PR(k + 1) is the priority ratio selected for the next execution of the event couple, i.e.,

the manipulated variable.

66 Runtime Parameter Estimation. There are two main ways to estimate the parameters

of the relative error: offline static modeling and runtime modeling. Offline modeling is

generally preferred because it allows to avoid any runtime modeling overhead. However,

given the different resource demands of different events, offline modeling would require

to have good offline estimates of the model parameters for nearly every event ofevery

app, which may be infeasible. As a result, we use runtime modeling. In order to limit

the runtime modeling overhead, the performance balancer only collects few relative error

samples measured at different priority ratios. Then, it uses these samples to find the slope

and intercept of the linear model, which are then stored in the event state table to be used by

the controller. Generally, three samples are sufficient to accurately model the relative error.

However, these model parameters can be dynamically adjusted if the modeling error is high

(e.g., > 10%).

3.3.2.2 Control Design

The main goal of supervisory control is to ensure that the controlled variable, i.e.,

relative error, is inside the desired region. The region is centered on one for balanced relative

performance. The upper and lower bounds are set at a distance from the center calculated as

the average modeling error measured during the modeling phase. The controller is activated

only when the relative error is outside the desired region to restore the performance balance.

Based on the system model (3.5), the balancer selects the next priority ratio as follows:

g PR(k + 1) = PR(k) + RE (RE∗ − RE(k)) (3.6) aRE

∗ where gRE is a tunable control gain and RE is the desired set point, i.e., the center of

the desired region. Equation (3.6) is derived from the dynamic model Equation (3.5). By

substituting Equation (3.6) into Equation (3.5), we have that RE(k + 1) = RE∗, assuming a

67 perfectly linear model (i.e., gRE set to one). Thus, the relative error of the next execution of

the event couple would be the desired. Sadly, because of the unavoidable modeling error

caused by runtime system dynamics (e.g., CPU frequency change), the closed-loop system

could be unstable. We use theoretical analysis to find the control gain gRE that ensures

convergence of the relative error into the desired region despite the modeling error.

3.3.2.3 Theoretical Analysis

Many control theoretic feedback systems use the notion of stability to ensure that the

controlled variable converges to a desired set point. However, given the finite nature of

our manipulated variable (i.e., priority levels), we can only ensure that the relative error

stays close enough to the desired set point. For this reason, we use the notion of practical

stability [46] to guarantee good performance balancing: the closed-loop system is practically

stable if the relative error converges, within a finite time, into a desired region that centers at

the desired set point.

The main challenge for guaranteeing practical stability for the closed-loop system, is

that the parameter aRE of model (3.5) is uncertain because of modeling errors. Thus, the

real aRE may be different from the one estimated and used in the controller (3.6). This

uncertainty, if not handled, could lead to the instability of the closed-loop system. Thus, we use theoretical analysis to find the control gain gRE that guarantees also the robustness

property: the closed-loop system is robust if it is practically stable even in the presence of

uncertainties in the controlled system.

Theorem 1. Given an execution of the event couple with a relative error outside the desired

region, if we assume bounded modeling error and bounded relative error, controller (3.6)

68 1 max with gRE = max guarantees practical stability for any modeling error lower than ε , where εRE RE max εRE is the maximum modeling error.

Proof. In order to prove Theorem 1, we first need to prove that, if the current relative error

1 is too high (or too low), the control Equation (3.6) with gRE = max guarantees that the next εRE relative error will either into the desired region or too low (or too high). To prove this

statement, we need to verify that the following inequality holds when the control (3.6) is

used (here we consider the case of RE(k) too high, but a similar proof holds for RE(k) too

low):

∗ RE(k + 1) < RE + eRE (3.7) where eRE is the acceptable relative error that defines the bounds of the desired region. Let’s

max consider the presence of a bounded linearization error εRE < εRE . Following Equation (3.5), the real RP(k + 1) value that includes the linearization error can be described as:

RE(k + 1) = RE(k) + εREaRE∆PR(k) (3.8) where ∆PR(k) = PR(k + 1) − PR(k). By substituting Equation (3.6) into Equation (3.8), the

Inequality (3.7) becomes:

∗ (1 − εREgRE)∆RE (k) < eRE (3.9)

∗ ∗ max max ∗ where ∆RE (k) = RE(k) − RE > 0. Let us define ∆RE = RE − RE > eRE, where

max max RE is the recorded upper bound of relative error for an event. Note that eRE < ∆RE ,

thus Inequality (3.9) can be rewritten as follows:

∆REmax (1 − ε g ) < (3.10) RE RE ∆RE∗(k)

The right element of Inequality (3.10) is ≥ 1. Therefore, for the Inequality (3.7) to hold true, we just need that the left element of Inequality (3.10) is strictly lower than 1. To make sure

69 1 1 of this, the gain gRE should be lower than , which is verified when g = max , thus proving εRE εRE the above statement.

We now know that if the relative error RE(k) is too high, using the control Equation

1 3.6 with g = max , the next relative error for this event couple will be either in the desired εRE region, or too low. Therefore, to prove the convergence into the desired region, we just need

to verify that the distance between RE(k) and the set point RE∗ is strictly higher than the

distance between RE(k + 1) (in the too low region) and RE∗. Specifically, if the following

constrain is verified, then Theorem 1 is proved:

RE(k) − RE∗ > RE∗ − RE(k + 1) (3.11)

1 By substituting Equations (3.8) and (3.6) into Inequality (3.11) and by using g = max , we εRE can rewrite Inequality (3.11) as: 1 1 max < (3.12) 2εRE εRE which is verified, thus proving Theorem 1.

In practice, at runtime the performance balancer, after estimating the model parameters,

calculates gRE based on the maximum measured modeling error.

3.3.3 Outer Loop: Performance Controller

The performance controller is activated when the performance balancer reaches the

steady state, i.e., the event couple has balanced relative performance after several consecutive

balancer invocations. For the same reasons explained in Section 3.3.2, we design the

performance controller based on supervisory control theory. The interactive event and the

throughput event can be running 1) on the same CPU core or 2) on different CPU cores,

based on the decision of the load balancer. In the first case, there is only one controller that

70 controls the average relative performance of the two events by manipulating the common

core’s DVFS level (we also refer to it as CPU frequency because voltage is always scaled with frequency). In the second case, there are two controllers, one for each event. Each

controller manipulates the core’s DVFS level where the controlled event is executing. Similar

to the first case, we could design the outer loop to have only one controller that controls

the average performance of the two events instead of having two controllers for two events.

However, with only one controller, the outer loop would limit the flexibility of multi-core

CPUs to minimize the energy consumption because it would execute the two events at the

same frequency level. The two loops share many design similarities. Thus, here, we mainly

focus on the differences.

The performance controller models the relative performance of each interactive and

throughput event as a linear function of the CPU frequency. Thus, the dynamic model of the

event relative performance can be described as follows:

RP(k + 1) − RP(k) = aRP ( f (k + 1) − f (k)) (3.13) where RP(k) is the relative performance of the kth execution of the event at frequency f (k).

aRP is the estimated slope. The steps used for estimating the model parameters follow those

detailed in Section 3.3.2.1 and thus are omitted. Because of the finite number of CPU

frequencies available, we define a desired region for relative performance: its upper bound

is set to one (e.g., maximum desired response time) and its lower bound is set as the average

modeling error calculated during the relative performance modeling phase.

The performance controller calculates the frequency level for the next execution of

the controlled events, so that the relative performance converges into the desired region.

Based on supervisor control theory and the relative performance model (3.13), we derive

71 the controller equation that calculates the next frequency level f (k + 1) as:

g f (k + 1) = f (k) + RP (RP∗ − RP(k)) (3.14) a

∗ where gRP is a tunable control parameter. RP is the center of the desired region. RP(k)

is the measured average relative performance at the kth execution of the controlled events.

If the events are on different cores, a is the estimated slope aRP of the controlled event.

Otherwise, if the events are on the same core, a is the average of the estimated slopes of the

two events. The analysis to determine the gain gRP follows that presented in Section 3.3.2.3,

thus we omit it.

3.3.4 Discussion

More than Two Concurrent Apps. The design of SURF is mainly described based

on the example combination of an interactive app and a throughput app. However, SURF

can be applied to more general scenarios. For only one foreground app, the outer loop can

be used to control the performance of this app. For more than two foreground apps, e.g.,

two throughput apps and one interactive app, the performance balancer first balances the

performance of the throughput apps. Then, it balances the interactive app’s performance with the average performance of the throughput apps.

Priority Ratio to App Priorities. In order to translate the desired priority ratio

(controller (3.6)) into app priorities, we store in a lookup table an ordered list of available

priority ratios calculated by combining the 40 priority levels. At runtime, SURF uses binary

search to find the combination of priorities that have the closest ratio to the desired one.The

computational overhead of this method is negligible.

Background Jobs. Some actions may start background jobs such as data processing,

network operations, or I/O operations. Typically, these jobs are assigned to the background

72 App Name Commit Category App Name Commit Category ExoPlayer ab6f9ae Video Player Mapbox da0197a Navigation Rocket.Chat 975511e Chat AndStatus 55f16d9 Social Conversations 03c3464 Chat LeafPic 1617a73 Photography muPDF 8ee452b PDF reader PocketHub 2052f77 Productivity

Table 3.1: Apps tested. The commit number indicates the latest version available at the time of the experiments.

cgroup that cannot have more than 10% CPU utilization, thus they have little interference with the user-perceived performance. The optimization of background jobs is orthogonal to

our study and can be complimentary to our solution.

3.4 Experimental Results

3.4.1 Experimental Setup

In our experiments, because of limitations imposed by the current mobile OSs, we con-

sider up to two apps running in the foreground. We use open-source apps available in the app

store [30] and listed in Table 3.1. We select two throughput apps: ExoPlayer, a video player

app that plays a local video to eliminate network interference, and MapBox, a navigation

app that provides navigation directions using a fixed route to ensure similar conditions

in the comparisons. We have interactive apps for chatting (Rocket.Chat, Conversations),

reading documents (muPDF), browsing social networks (AndStatus), pictures (LeafPic),

and repositories (PocketHub). In order to ensure a fair comparison of different solutions, we

build a program that injects user touches on the device with a reliable timing and precision.

We test multiple smartphones (e.g., Nexus 5, Galaxy S3) with Android 7.1.1, which uses

the up-to-date scheduler and load balancer. Unless specified, the presented results are those

of the Nexus 5. The results for the other devices are similar and thus omitted. We measure

73 the response time and frame rate from Android’s looper thread. Because smartphones do not have CPU power sensors, we use the Monsoon Power Monitor to profile the power consumption of the whole device for various active cores and frequencies. We then use this profile to estimate the CPU energy consumption.

We compare SURF against the following baselines:

• AdHoc is similar to SURF but has a heuristic balancer that increases/decreases the

app priorities by one level at a time instead of using control theory. It has a slow

convergence time into the desired region.

• Binary is similar to SURF but has a heuristic balancer that uses binary search instead

of control theory to find the app priorities that minimize the relative error distance

from the desired value. It has faster convergence than AdHoc, but it may lead to

instability.

• Periodic is a state-of-the-art DVFS controller based on time-driven control theory

(similar to [40, 79]). It activates periodically to control the app performance to a

specified target. It may not efficiently handle aperiodic user actions and doesnot

balance app performance.

• eQoS is a state-of-the-art DVFS controller proposed in [94]. Similar to SURF, builds

performance models for each event to find an initial core frequency that givesa

desired performance. Then, it uses a heuristic algorithm instead of control theory to

change the core frequency one level at a time and compensate the modeling errors. It

may not efficiently handle concurrent executions because it does not balance apps’

performance.

74 • Android is the solution used in the current Android OS to regulate the app priorities

and CPU DVFS. It does not control the app performance and assigns the static default

priority to foreground apps.

In the following sections, we first summarize the results of SURF compared to Android for all the apps in Table 3.1. Then, we use the case of Rocket.Chat and ExoPlayer to study the two loops of SURF (the results with other apps are similar and thus omitted). For the inner loop, we compare against AdHoc and Binary in Section 3.4.3 to show the advantage of control-theoretic designs. For the outer loop, we compare against Periodic in Section

3.4.4 to highlight the capability of supervisory control to handle aperiodic user actions. We then compare against Periodic, eQoS, and Android in Section 3.4.5 to demonstrate how the two loops of SURF are coordinated to reduce the CPU energy consumption while causing no perceivable performance degradation.

3.4.2 SURF: Overall Summary of Results

Here, we summarize the results of SURF for all the combinations of apps listed in Table

3.1 that cause resource competition among concurrent apps, i.e., an interactive app and a throughput app, or two throughput apps. Note, two interactive apps are not considered because they do not compete for resources with each other. Thus, SURF can control each app independently using only the outer loop. We omit the results of these cases due to space limitations.

Figure 3.5(a) shows the CPU energy savings of SURF compared to the Android baseline.

Figure 3.5(b) shows the average user-perceived performance of the apps during the experi- ments. Compared to Android, which uses the highest CPU frequencies during most of the user actions and has an average of 54ms response time and 58fps frame rate across the apps,

75 100 Response Time Frame Rate )

ExoPlayer Mapbox ) 120 75 100 fps

ms 60 (

80 ( 80 45 60 60 30 Rate Android (%) Time 40 40 20 15

over 0 0 20 Frame Reponse Chat Chat 0 . . LeafPic LeafPic muPDF muPDF Savings ExoPlayer AndStatus AndStatus PocketHub PocketHub Rocket Rocket Conversations Conversations Energy ExoPlayer Mapbox (a) (b)

Figure 3.5: Results of SURF for various combinations of throughput apps (ExoPlayer, Map- box) and interactive apps. The default Android has an average of 54ms response time and 58fps frame rate because mostly uses high frequencies. SURF reduces the core frequencies to (a) save CPU energy while (b) causing no perceivable performance degradation.

SURF achieves 30% to 90% CPU energy savings while causing no perceivable performance

degradation. The variability in CPU energy savings across the apps is due to the diverse

apps’ resource consumption. For example, interacting with muPDF to change the page of

a document while playing a video with ExoPlayer requires just one core at frequency of

652MHz. With this configuration, which achieves 90% CPU energy savings compared to

Android, the two apps have an average of 84.5ms response time and 34fps frame rate, i.e.,

both apps are close to their desired performance (the two horizontal lines in Figure 3.5(b)).

A similar performance is achieved to open pictures with LeafPic while navigating with

Mapbox. However, these performance values are achieved using a more energy-demanding

configuration, i.e., two cores at frequencies of 1,036MHz and 422Mhz on average. Inthis

case, SURF achieves 30% CPU energy savings on average compared to Android.

These experiments show that SURF, compared to Android, achieves high CPU energy

savings while ensuring good performance for various real-world apps.

76 SURF-B Binary AdHoc 100 SURF-B Binary AdHoc 2.5 80 Region 2 60 Error Desired 1.5 40 in

Relative 1 20 Actions

0.5 % 0 0 5 10 15 20 25 30 1 ± 0.3 1 ± 0.2 1 ± 0.1 User Action Desired Relative Error (a) Balancing example (b) Comparison

Figure 3.6: Comparison of the performance balancer of SURF (i.e., SURF-B) with the baselines. (a) The performance balancer converges faster into the desired region (the grey band) and is more stable. (b) SURF-B achieves better performance balancing for three sizes of desired performance region.

3.4.3 Inner Loop: Performance Balancer

Here, we test the inner loop when the outer loop of SURF is disabled (i.e., SURF-B) and compare with AdHoc and Binary. We set ExoPlayer and Rocket.Chat in split screen and fix the core frequencies to eliminate their impact.

To test the control accuracy of the three solutions, we examine three different sizes of desired relative error regions, i.e., large (1 ± 0.3), medium (1 ± 0.2, calculated by SURF-B, see Section 3.3.2.2), and small (1 ± 0.1). Figure 3.6(a) shows an example of performance balancing with the small desired region (the gray band). Note, the different initial relative error of the three solutions is due to background noise from system jobs, but it has a negligible impact on the overall comparison. Figure 3.6(b) shows the results of the three solutions with all the three region sizes in terms of percentage of actions with relative error in the desired region. As Figure 3.6(a) shows, AdHoc has a slow convergence and thus, as

77 Figure 3.6(b) shows, it has only 42% of the actions within the large region. Binary has a faster convergence than AdHoc, thus has 90% and 73.9% of the actions within the large and medium regions, respectively. However, it achieves this fast convergence by aggressively adjusting app priorities, which may lead to instability of the relative error, especially for the small-region case (e.g., action 18 in Figure 3.6(a)). SURF-B is based on supervisory control theory, thus ensures fast convergence and more stability. As a result, SURF-B has 42.5% more actions in the small region compared to Binary and, overall, it has 90%, 80%, and 65% of the actions within the large, medium, and small regions, respectively.

These experiments show that the performance balancer ensures a good balancing between concurrent apps.

3.4.4 Outer Loop: Performance Controller

Here, we test the outer-loop performance controller of SURF without the inner loop, i.e.,

SURF-C, and compare SURF-C against Periodic, which is based on time-driven control theory. We design Periodic as a Proportional Integral Derivative (PID) controller using the same linear model built by SURF-C. We set the desired response time to 90ms for Periodic.

Correspondingly, the desired region for SURF-C is between 80ms and 100ms. Then, we inject a fixed sequence of actions in Rocket.Chat (40 in total) and measure the average tracking error and control overhead. The tracking error is the absolute difference between each action’s response time and the desired value. The control overhead is the number of times each solution executes its control algorithm. To eliminate the effect of performance balancing, we test only a single foreground app (Rocket.Chat) and describe the results for concurrent foreground apps in the next section.

78 SURF-C Periodic-Short Tracking Accuracy Overhead 15 35 Periodic-Long Target ) 30 100 ms ( ) 10 25

ms 90 ( 20 Error 80 15 Overhead 5

Time 10 70 5 Control 60 Tracking 0 0 50 Response 40 0 5 10 15 20 25 30 User Action (a) Response Time Tracking (b) Tracking Accuracy and Overhead

Figure 3.7: The performance controller of SURF (i.e., SURF-C) based on supervisory control theory ensures higher tracking accuracy and lower overhead compared to the Periodic baseline based on time-driven control theory.

Figure 3.7(a) shows the response time variation of the send text event of Rocket.Chat with the two solutions. Periodic needs to tradeoff tracking accuracy with computational

overhead. When Periodic runs with a short regulating period of at least one action executed,

i.e., Periodic-Short in Figure 3.7, it ensures a good tracking accuracy of 2ms, but has a

high control overhead of 33 executions because it runs after most of the actions. Increasing

the regulating period to have at least 3 (Periodic-Medium) or 15 actions (Periodic-Long)

executed, as Figure 3.7(b) shows, allows to decrease the control overhead to 10 executions

and 2 executions, respectively. Unfortunately, the tracking error increases to 6ms and 12ms

for Periodic-Medium and Periodic-Long, respectively. SURF-C, because it is based on

supervisory control theory, it activates only when the response time is outside the desired

region (80-100ms). Thus, as Figure 3.7(b) shows, SURF-C achieves a tracking accuracy

similar to that of Periodic-Short, i.e., 3ms, while only incurring an overhead similar to that

of Periodic-Long, i.e., 2 executions.

79 ExoPlayer Frequency Rocket.Chat Response Time Rocket.Chat Frequency Response Time Higher Bound ExoPlayer Relative Performance ExoPlayer Frame Rate Rocket.Chat Relative Performance 300 Frame Rate Lower Bound 70 )

) 2 1.5 60 ) ms ( fps GHz 1.25 50 ( ( 1.5 200 40 1 Time

1 0.75 30 Rate 0.5 Performance 100 0.5 20 Frequency 0.25 10 Frame Response 0 0

Relative 0 0 Core 0 5 10 15 20 25 30 0 5 10 15 20 25 30 User Action User Action (a) Relative Performance (b) User-Perceived Performance

Figure 3.8: Applying single-app solutions to concurrent executions (a) may lead to an in- creased CPU frequency and energy consumption, caused by (b) app performance imbalance.

These results show that SURF-C achieves high tracking accuracy while incurring only little control overhead.

3.4.5 Integrated Solution: SURF

Here, we first test SURF-C to highlight the problem of unbalanced performance for concurrent apps. Then, we show the results of SURF and compare with the baselines.

Unbalanced Performance. Figure 3.8(a) shows the relative performance and the core frequencies of the two concurrent apps when SURF-C controls the app performance without performance balancing. Figure 3.8(b) shows the corresponding performance. As Figure

3.8(a) shows, the core frequency of ExoPlayer quickly goes to the minimum because, as

Figure 3.8(b) shows, the frame rate of this app has an average of 50fps, thus above the desired performance of 34fps. Meanwhile, the unhandled imbalance leads the load balancer to schedule more work on the same core with Rocket.Chat, which has to increase its core

80 Rocket.Chat Priority ExoPlayer Priority Relative Error 40 3 35 2.5 30 25 2 Error 20 1.5 Priority 15 1 App

10 Relative 5 0.5 0 0 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 User Action (a) Inner Loop

ExoPlayer Frequency RocketChat Frequency Average Relative Performance 3

) 1.5 2.5 GHz ( 1 2 1.5 Performance 1

Frequency 0.5 0.5 Relative Core 0 0 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 User Action (b) Outer Loop

Rocket.Chat Response Time Response Time Higher Bound ExoPlayer Frame Rate Frame Rate Lower Bound 120 120 )

100 100 ) ms ( fps 80 80 (

Time 60 60 Rate 40 40

20 20 Frame Response 0 0 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 User Action (c) User-Perceived Performance

Figure 3.9: SURF (a) balances the app performance, then (b) tracks the desired relative performance of the apps (gray band) to (c) ensure good user-perceived performance while reducing the CPU energy consumption.

frequency (e.g., after action 4) thus increasing the energy consumption. This experiment demonstrates the importance of performance balancing.

Loop Interaction. Figure 3.9 shows the results when the two loops of SURF work together. The initial configuration is the same as that used for Figure 3.8, i.e., default

81 priorities and highest frequencies. As Figure 3.9(c) shows, the initial response time and

frame rate of the apps are far from the user-perceivable limits (i.e., the two straight lines).

However, as Figure 3.9(b) shows, their relative performance is similar at user action 1, i.e.,

the relative error is near 1. As a result, the performance controller, at user actions 2 and 3 in

Figure 3.9(b), starts lowering the CPU frequencies. At action 6, the relative performance is within its desired region. However, with the new CPU frequencies, the apps’ performance

is imbalanced, which activates the performance balancer. The performance balancer starts

changing the app priorities to control the relative error within the desired region. At action

23, both the relative error and the relative performance are controlled within the desired

regions, i.e., the response time of Rocket.Chat and the frame rate of ExoPlayer are closer to

the 100ms and 30fps limits (Figure 3.9(c)). After this point, the average core frequency for

ExoPlayer and Rocket.Chat is 835MHz and 813MHz, respectively, thus reducing 54% CPU

energy consumption, on average, compared to the initial configuration (see Figure 3.10).

Sometimes, e.g., action 31 in Figure 3.9(a), due to the occasional execution of other system

processes, the relative error and/or the relative performance of the apps may get out of the

desired regions. However, as the figures show, the two loops keep executing to adjust the

resource allocation of the apps and ensure good user-perceived performance with low CPU

energy consumption.

Comparison with Baselines. Figures 3.10(a) and 3.10(b) show the tracking errors of

the two apps with SURF, eQoS, and Periodic (Periodic-Short for better accuracy). Figure

3.10(c) shows the average frequencies, and Figure 3.10(d) shows the CPU energy savings

compared to Android, which uses the highest frequencies during most of the actions. We test

three set-point pairs of response time and frame rate, i.e., 1) 90ms and 34fps, 2) 80ms and

38fps, and 3) 70ms and 52fps. Similar to the results of SURF-C in Figure 3.8, both eQoS

82 20 SURF eQoS Periodic 25 SURF eQoS Periodic ) ) 20

15 fps ms ( ( 15 Error Error 10 10

5 5 Tracking Tracking

0 0 90 80 70 34 38 52 Desired Response Time (ms) Desired Frame Rate (fps) (a) Tracking Response Time (b) Tracking Frame Rate

70 1.2 ExoPlayer Rocket.Chat SURF eQoS Periodic 1 60 0.8 50 0.6 Android (%)

Frequency 40 0.4

0.2 over 30 Core . 0 20 10 Norm SURF SURF SURF eQoS eQoS eQoS Savings

Periodic Periodic Periodic 0 (90ms, 34fps) (80ms, 38fps) (70ms, 52fps) (90ms, 34fps) (80ms, 38fps) (70ms, 52fps) Energy Desired (Response Time, Frame Rate) Desired (Response Time, Frame Rate) (c) Core Frequencies (d) Energy

Figure 3.10: Compared to eQoS and Periodic, SURF has (a, b) high tracking accuracy for both apps by (c) lowering the core frequencies, thus (d) increasing the CPU energy savings.

and Periodic have large tracking errors for at least one of the two apps, due to performance

imbalance. For example, for the 90ms and 34fps set points, the baselines have a small

tracking error for Rocket.Chat but a large tracking error for ExoPlayer. It would be possible

to reduce the imbalance by setting a higher set point, e.g., 52fps, for ExoPlayer. However,

as Figure 3.10(c) shows, this would cause an increased CPU frequency and thus an energy waste because the user cannot perceive any difference for a frame rate higher than 30fps.

SURF achieves a 68-73% better tracking accuracy for Rocket.Chat and 64-68% better

83 accuracy for ExoPlayer compared to the baselines, by balancing the app performance (see

Figures 3.9(a) and 3.9(b)). Meanwhile, SURF decreases the core frequency of Rocket.Chat, which leads (on average) to 54%, 31%, and 32% CPU energy savings compared to Android,

eQoS, and Periodic, respectively.

These experiments prove that, by balancing the concurrent app performance, it is possible

to have good user-perceived performance while reducing the CPU energy consumption.

84 Chapter 4: Conclusions

In this dissertation, we have discussed how to improve the user experience with mobile

apps. In particular, we have focused on two main causes of poor performance, i.e., soft hang

bugs and resource contention.

In Chapter 2, we have presented Hang Doctor, a runtime methodology that supplements

the existing offline algorithms by detecting and diagnosing soft hangs caused by previously

unknown blocking operations. Hang Doctor features a two-phase algorithm that first checks

response time and performance event counters for detecting possible soft hang bugs with

small overheads, and then performs stack trace analysis when diagnosis is necessary. A

novel soft hang filter based on correlation analysis is designed to minimize false positives

and negatives for high detection performance and low overhead. Our results have shown

that Hang Doctor has identified 34 new soft hang bugs that were previously unknown to

their developers, among which 62%, so far, have already been confirmed by the developers,

and 68% are missed by offline detection algorithms.

In Chapter 3, we have presented SURF, Supervisory control of User-perceived peRFor-

mance. SURF is a 2- level solution that overcomes the two limitations of existing work:

1) cannot efficiently manage concurrent foreground apps, and so may lead to performance

imbalance and high energy consumption, 2) cannot handle the aperiodicity of user actions.

First, SURF dynamically allocates resources to concurrent apps for balanced performance.

85 Second, SURF uses supervisory control theory to handle the aperiodicity of user actions.

SURF features a two-level architecture design that performs the two tasks at different time scales, according to their different overheads and timing requirements. We have tested

SURF on real mobile devices with several realworld apps and show that it can reduce the

CPU energy consumption by 30-90% compared to state-of-the-art solutions while causing no perceivable performance degradation.

86 Bibliography

[1] Mohammad Mejbah ul Alam, Tongping Liu, Guangming Zeng, and Abdullah Muzahid. Syncperf: Categorizing, detecting, and diagnosing synchronization performance bugs. In Proceedings of the Twelfth European Conference on Computer Systems (EuroSys ’17), 2017.

[2] Bhojan Anand, Karthik Thirugnanam, Jeena Sebastian, Pravein G. Kannan, Ananda L. Akhihebbal, Mun Choon Chan, and Rajesh Krishna Balan. Demo: Adaptive display power management for mobile games. In MobiSys, 2011.

[3] apple. Use two mac apps side by side in split view. https://support.apple.com/ en-us/HT204948, 2017.

[4] Joy Arulraj, Po-Chun Chang, Guoliang Jin, and Shan Lu. Production-run software failure diagnosis via hardware performance counters. In Proceedings of the Eighteenth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS ’13), 2013.

[5] Marco Brocanelli and Xiaorui Wang. Hang doctor: Runtime detection and di- agnosis of soft hangs for smartphone apps. https://sites.google.com/site/ hangdoctorhome/, 2017.

[6] Benjamin Elliott Canning and Thomas Scott Coon. Method, system, and apparatus for identifying unresponsive portions of a computer program, 2008. Corporation, US Patent.

[7] Michael Carbin, Sasa Misailovic, Michael Kling, and Martin C. Rinard. Detecting and escaping infinite loops with jolt. In 25th European Conference on Object-Oriented Programming (ECOOP ’11), 2011.

[8] Aaron Carroll and Gernot Heiser. Mobile multicores: Use them or waste them. In HotPower, 2013.

[9] Byung-Gon Chun, Sunghwan Ihm, Petros Maniatis, Mayur Naik, and Ashwin Patti. Clonecloud: Elastic execution between mobile device and cloud. In Proceedings of the Sixth Conference on Computer Systems (Eurosys ’11), 2011.

87 [10] Domenico Cotroneo, Roberto Natella, and Stefano Russo. Assessment and improve- ment of hang detection in the linux operating system. In 28th IEEE International Symposium on Reliable Distributed Systems (SRDS ’09), 2009.

[11] Eduardo Cuervo, Aruna Balasubramanian, Dae-ki Cho, Alec Wolman, Stefan Saroiu, Ranveer Chandra, and Paramvir Bahl. Maui: Making smartphones last longer with code offload. In MobiSys, 2010.

[12] Daniel J. Dean, Hiep Nguyen, Xiaohui Gu, Hui Zhang, Junghwan Rhee, Nipun Arora, and Geoff Jiang. Perfscope: Practical online server performance bug inference in production cloud computing infrastructures. In Proceedings of the ACM Symposium on Cloud Computing (SOCC ’14), 2014.

[13] Daniel J Dean, Hiep Nguyen, Peipei Wang, Xiaohui Gu, Anca Sailer, and Andrzej Kochut. Perfcompass: Online performance anomaly fault localization and inference in infrastructure-as-a-service clouds. IEEE Transactions on Parallel and Distributed Systems, vol. 27(no. 6):pp. 1742–1755, 2016.

[14] Daniel Joseph Dean, Hiep Nguyen, and Xiaohui Gu. Ubl: Unsupervised behavior learning for predicting performance anomalies in virtualized cloud systems. In Pro- ceedings of the 9th International Conference on Autonomic Computing (ICAC ’12), 2012.

[15] Luca Della Toffola, Michael Pradel, and Thomas R. Gross. Performance problems you can fix: A dynamic analysis of memoization opportunities. In Proceedings of the 2015 ACM SIGPLAN International Conference on Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA ’15), 2015.

[16] Android Developers. Adding picture-in-picture. https://developer.android. com/training/tv/playback/picture-in-picture..

[17] Android Developers. Multi-window support. https://developer.android.com/ guide/topics/ui/multi-window.html.

[18] Rui Ding, Hucheng Zhou, Jian-Guang Lou, Hongyu Zhang, Qingwei Lin, Qiang Fu, Dongmei Zhang, and Tao Xie. Log2: A cost-aware logging mechanism for performance diagnosis. In Proceedings of the 2015 USENIX Conference on Usenix Annual Technical Conference (USENIX ATC ’15), 2015.

[19] Xiaoning Ding, Hai Huang, Yaoping Ruan, Anees Shaikh, and Xiaodong Zhang. Automatic software fault diagnosis by exploiting application signatures. In Proceedings of the 22nd Conference on Large Installation System Administration Conference (LISA ’08), 2008.

88 [20] Mian Dong and Lin Zhong. Chameleon: A color-adaptive web browser for mobile oled displays. In MobiSys, 2011.

[21] Yuyang Du, Sebastien Haezebrouck, Jin Cui, Rajeev Muralidhar, Harinarayanan Seshadri, Vishwesh Rudramuni, Nicole Chalhoub, YongTong Chua, and Richard Quinzio. Taskfolder: Dynamic and fine-grained workload consolidation for mobile devices. In MobiSys, 2016.

[22] Eiman Ebrahimi, Chang Joo Lee, Onur Mutlu, and Yale N. Patt. Fairness via source throttling: A configurable and high-performance fairness substrate for multi-core memory systems. In ASPLOS, 2010.

[23] Antonio Filieri, Henry Hoffmann, and Martina Maggio. Automated design of self- adaptive software with control-theoretical formal guarantees. In ICSE, 2014.

[24] Brad Fitzpatrick. Writing zippy android apps. In Google I/O Developers Conference, 2010.

[25] Brad Fitzpatrick. Writing zippy android apps. In Google I/O, 2010.

[26] Pierre-Marc Fournier and Michel R. Dagenais. Analyzing blocking to debug perfor- mance problems on multi-core systems. SIGOPS Oper. Syst. Rev., vol. 44(no. 2):pp. 77–87, 2010.

[27] Benjamin Gaudette, Carole-Jean Wu, and Sarma Vrudhula. Improving smartphone user experience by balancing performance and energy with probabilistic qos guarantee. In HPCA, 2016.

[28] GitHub. Home page. https://github.com/, 2018.

[29] Ioana Giurgiu, Oriana Riva, and Gustavo Alonso. Dynamic software deployment from clouds to mobile devices. In Proceedings of the 13th International Middleware Conference (Middleware ’12), 2012.

[30] Google. Google play store. https://play.google.com/store.

[31] Android Developers Guide. Controlling the camera. https://developer.android. com/training/camera/cameradirect.html, 2018.

[32] Android Developers Guide. Keeping your app responsive. https://developer. android.com/training/articles/perf-anr.html, 2018.

[33] Android Developers Guide. Package index. https://developer.android.com/ reference/packages.html, 2018.

[34] Android Developers Guide. Simpleperf. https://developer.android.com/ndk/ guides/simpleperf.html, 2018.

89 [35] Haofu Han, Jiadi Yu, Hongzi Zhu, Yingying Chen, Jie Yang, Guangtao Xue, Yan- min Zhu, and Minglu Li. E3: Energy-efficient engine for frame rate adaptation on smartphones. In SenSys, 2013.

[36] Songtao He, Yunxin Liu, and Hucheng Zhou. Optimizing smartphone power consump- tion through dynamic resolution scaling. In Mobicom, 2015.

[37] Peng Huang, Xiao Ma, Dongcai Shen, and Yuanyuan Zhou. Performance regression testing target prioritization via performance risk analysis. In Proceedings of the 36th International Conference on Software Engineering (ICSE ’14), 2014.

[38] Sungju Huh, Jonghun Yoo, and Seongsoo Hong. Improving interactivity via vt-cfs and framework-assisted task characterization for linux/android smartphones. In RTCSA, 2012.

[39] Jinho Hwang and Timothy Wood. Adaptive dynamic priority scheduling for virtual desktop infrastructures. In IWQoS, 2012.

[40] Connor Imes, David H.K. Kim, Martina Maggio, and Henry Hoffman. Poet: a portable approach to minimizing energy under soft real-time constraints. In RTAS, 2015.

[41] S.L. Jackson. Research Methods and Statistics: A Critical Thinking Approach. Cengage Learning, 2012.

[42] Guoliang Jin, Linhai Song, Xiaoming Shi, Joel Scherpelz, and Shan Lu. Understanding and detecting real-world performance bugs. In Proceedings of the 33rd ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI ’12), 2012.

[43] Guoliang Jin, Linhai Song, Wei Zhang, Shan Lu, and Ben Liblit. Automated atomicity- violation fixing. In Proceedings of the 32nd ACM SIGPLAN Conference on Program- ming Language Design and Implementation (PLDI ’11), 2011.

[44] Guoliang Jin, Wei Zhang, Dongdong Deng, Ben Liblit, and Shan Lu. Automated concurrency-bug fixing. In Proceedings of the 10th USENIX Conference on Operating Systems Design and Implementation (OSDI ’12), 2012.

[45] Milan Jovic, Andrea Adamoli, and Matthias Hauswirth. Catch me if you can: perfor- mance bug detection in the wild. ACM SIGPLAN Notices, vol. 46(no. 10):pp. 155–170, 2011.

[46] Xenofon Koutsoukos, Radhika Tekumalla, Balachandran Natarajan, and Chenyang Lu. Hybrid supervisory utilization control of real-time systems. In RTAS, 2005.

[47] Yongin Kwon, Sangmin Lee, Hayoon Yi, Donghyun Kwon, Seungjun Yang, Byung- Gon Chun, Ling Huang, Petros Maniatis, Mayur Naik, and Yunheung Paek. Mantis: Efficient predictions of execution time, energy usage, memory usage and network

90 usage on smart mobile devices. IEEE Transactions on Mobile Computing, vol. 14(no. 10):pp. 2059–2072, 2015.

[48] Yepang Liu, Chang Xu, and Shing-Chi Cheung. Characterizing and detecting perfor- mance bugs for smartphone applications. In Proceedings of the 36th International Conference on Software Engineering (ICSE ’14), 2014.

[49] Robert Love. O(1) scheduler. In Linux Symposium, Sams, 2003.

[50] Jean-Pierre Lozi, Baptiste Lepers, Justin Funston, Fabien Gaud, Vivien Quéma, and Alexandra Fedorova. The linux scheduler: A decade of wasted cores. In EuroSys, 2016.

[51] Mirateam. Turn your smartphone into a laptop. https://www.indiegogo.com/ projects/turn-your-smartphone-into-a-laptop-mobile-android#/.

[52] David Z. Morris. The $99 peripheral that turns your smartphone into a laptop. http: //fortune.com/2016/08/14/smartphone-laptop-peripheral/.

[53] Priya Nagpurkar, Hussam Mousa, Chandra Krintz, and Timothy Sherwood. Efficient remote profiling for resource-constrained devices. ACM Transactions on Architecture and Code Optimization (TACO), vol. 3(no. 1):pp. 35–66, 2006.

[54] Nithin Nakka, Giacinto Paolo Saggese, Zbigniew Kalbarczyk, and Ravishankar K. Iyer. An architectural framework for detecting process hangs/crashes. In 5th European Dependable Computing Conference (EDCC ’05), 2005.

[55] Adrian Nistor, Po-Chun Chang, Cosmin Radoi, and Shan Lu. Caramel: Detecting and fixing performance problems that have non-intrusive fixes. In Proceedings of the 37th International Conference on Software Engineering (ICSE ’15), 2015.

[56] Adrian Nistor, Linhai Song, Darko Marinov, and Shan Lu. Toddler: Detecting per- formance problems via similar memory-access patterns. In Proceedings of the 2013 International Conference on Software Engineering (ICSE ’13), 2013.

[57] Santiago Pagani, Muhammad Shafique, Heba Khdr, Jian-Jia Chen, and Jörg Henkel. seBoost: Selective boosting for heterogeneous manycores. In CODES, 2015.

[58] Abhinav Pathak, Abhilash Jindal, Y. Charlie Hu, and Samuel P. Midkiff. What is keeping my phone awake?: Characterizing and detecting no-sleep energy bugs in smartphone apps. In MobiSys, 2012.

[59] Dan Pelleg, Muli Ben-Yehuda, Rick Harper, Lisa Spainhower, and Tokunbo Adeshiyan. Vigilant: out-of-band detection of failures in virtual machines. ACM SIGOPS Operat- ing Systems Review, vol. 42(no. 1):pp. 26–31, 2008.

91 [60] Michael Pradel, Parker Schuh, George Necula, and Koushik Sen. Eventbreak: Analyz- ing the responsiveness of user interfaces through performance-guided test generation. In Proceedings of the ACM International Conference on Object Oriented Programming Systems Languages & Applications (OOPSLA ’14), 2014. [61] Feng Qian, Zhaoguang Wang, Yudong Gao, Junxian Huang, Alexandre Gerber, Zhuo- qing Mao, Subhabrata Sen, and Oliver Spatscheck. Periodic transfers in mobile applications: Network-wide origin, impact, and optimization. In WWW, 2012. [62] Arun Raghavan, Laurel Emurian, Lei Shao, Marios Papaefthymiou, Kevin P. Pipe, Thomas F. Wenisch, and Milo M.K. Martin. Computational sprinting on a hardware/- software testbed. In ASPLOS, 2013. [63] Arun Raghavan, Yixin Luo, Anuj Chandawalla, Marios Papaefthymiou, Kevin P. Pipe, Thomas F. Wenisch, and Milo M. K. Martin. Computational sprinting. In Proceedings of the IEEE 18th International Symposium on High-Performance Computer Architec- ture (HPCA ’12), 2012. [64] Arun Raghavan, Yixin Luo, Anuj Chandawalla, Marios Papaefthymiou, Kevin P Pipe, Thomas F Wenisch, and Milo MK Martin. Computational sprinting. In HPCA, 2012. [65] Lenin Ravindranath, Jitendra Padhye, Sharad Agarwal, Ratul Mahajan, Ian Obermiller, and Shahin Shayandeh. Appinsight: Mobile app performance monitoring in the wild. In Proceedings of the 10th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’12), 2012. [66] Lenin S. Ravindranath, Jitendra Padhye, Ratul Mahajan, and Hari Balakrishnan. Time- card: Controlling user-perceived delays in server-based mobile applications. In The 24th ACM Symposium on Operating Systems Principles (SOSP ’13), 2013. [67] Samsung. Turn your phone into a computer. http://www.samsung.com/us/ explore/dex/. [68] Dan Schatzberg, James Cadden, Han Dong, Orran Krieger, and Jonathan Appavoo. Ebbrt: A framework for building per-application library operating systems. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’16), 2016. [69] Aaron Schulman, Vishnu Navda, Ramachandran Ramjee, Neil Spring, Pralhad Desh- pande, Calvin Grunewald, Kamal Jain, and Venkata N. Padmanabhan. Bartendr: A practical approach to energy-aware cellular data scheduling. In MobiCom, 2010. [70] Abhishek B Sharma, Haifeng Chen, Min Ding, Kenji Yoshihira, and Guofei Jiang. Fault detection and localization in distributed systems using invariant relationships. In 43rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN ’13), 2013.

92 [71] Kai Shen, Christopher Stewart, Chuanpeng Li, and Xin Li. Reference-driven per- formance anomaly identification. In Proceedings of the Eleventh International Joint Conference on Measurement and Modeling of Computer Systems (SIGMETRICS ’09), 2009.

[72] Wook Song, Nosub Sung, Byung-Gon Chun, and Jihong Kim. Reducing energy con- sumption of smartphones using user-perceived response time analysis. In Proceedings of the 15th Workshop on Mobile Computing Systems and Applications (HotMobile ’14), 2014.

[73] Wook Song, Nosub Sung, Byung-Gon Chun, and Jihong Kim. Reducing energy con- sumption of smartphones using user-perceived response time analysis. In HotMobile, 2014.

[74] Statista. Number of available applications in the google play store from december 2009 to november 2015. http://www.statista.com/, 2016.

[75] Kenzo Van Craeynest, Shoaib Akram, Wim Heirman, Aamer Jaleel, and Lieven Eeckhout. Fairness-aware scheduling on single-isa heterogeneous multi-cores. In PACT, 2013.

[76] Xi Wang, Zhenyu Guo, Xuezheng Liu, and Zhilei Xu. Hang analysis: Fighting responsiveness bugs. In Proceedings of the 3rd Conference on Computer Systems (Eurosys ’08), 2008.

[77] Xiaorui Wang, Yingming Chen, Chenyang Lu, and Xenofon Koutsoukos. Fc-orb: A robust distributed real-time embedded middleware with end-to-end utilization control. Journal of Systems and Software, 80(7):938–950, 2007.

[78] Xiaorui Wang, Kai Ma, and Yefu Wang. Cache latency control for application fairness or differentiation in power-constrained chip multiprocessors. IEEE Transactions on Computers, 61(10):1371–1385, 2012.

[79] Yefu Wang, Xiaorui Wang, Ming Chen, and Xiaoyun Zhu. Power-efficient response time guarantees for virtualized enterprise servers. In RTSS, 2008.

[80] Long Wangand, Zbigniew Kalbarczyk, Weining Guand, and Ravishankar K Iyer. Reliability microkernel: Providing application-aware reliability in the OS. IEEE Transactions on Reliability, vol. 56(no. 4):pp. 597–614, 2007.

[81] Tom Warren. Microsoft is about to turn a phone into a real pc. https://www. theverge.com/, 2016.

[82] Programmable Web Research Center. Growth in web apis from 2005 to 2013. http: //www.programmableweb.com/api-research, 2014.

93 [83] Zilong Wen, Weiqi Dai, Deqing Zou, and Hai Jin. Perfdoc: Automatic performance bug diagnosis in production cloud computing infrastructures. In Trustcom/BigDataSE/I SPA, 2016.

[84] XFinitum. Turn your mobile devices into a ultimate machine. http://xfinitum.com.

[85] Xusheng Xiao, Shi Han, Dongmei Zhang, and Tao Xie. Context-sensitive delta inference for identifying workload-dependent performance bottlenecks. In Proceedings of the International Symposium on Software Testing and Analysis (ISSTA ’13), 2013.

[86] Di Xu, Chenggang Wu, Pen-Chung Yew, Jianjun Li, and Zhenjiang Wang. Providing fairness on shared-memory multiprocessors via process scheduling. In SIGMETRICS, 2012.

[87] Lei Yang, Jiannong Cao, Shaojie Tang, Di Han, and Neeraj Suri. Run time application repartitioning in dynamic mobile cloud environments. IEEE Transactions on Cloud Computing, vol. 4(no. 3):pp. 336 – 348, 2016.

[88] Shengqian Yang, Dacon Yan, and Atanas Rountev. Testing for poor responsiveness in android applications. In 1st International Workshop on the Engineering of Mobile- Enabled Systems (MOBS ’13), 2013.

[89] Andrew Zeigler, Shawn M Woods, David M Ruzyski, John H Lueders, Jon R Berry, and Daniel James Plaster. Hang recovery in software applications, 2012. Microsoft Corporation, US Patent.

[90] Huazhe Zhang and Henry Hoffmann. Maximizing performance under a power cap: A comparison of hardware, software, and hybrid techniques. In ASPLOS, 2016.

[91] Lide Zhang, David R Bild, Robert P Dick, Z Morley Mao, and Peter Dinda. Panapp- ticon: event-based tracing to measure mobile application and platform performance. In International Conference on Hardware/Software Codesign and System Synthesis (CODES+ ISSS ’13), 2013.

[92] Haoqiang Zheng and Jason Nieh. RSIO: Automatic user interaction detection and scheduling. In SIGMETRICS, 2010.

[93] Yian Zhu, Yue Li, Jingling Xue, Tian Tan, Jialong Shi, Yang Shen, and Chunyan Ma. What is system hang and how to handle it. In IEEE 23rd International Symposium on Software Reliability Engineering (ISSRE ’12), 2012.

[94] Yuhao Zhu, Matthew Halpern, and Vijay Janapa Reddi. Event-based scheduling for energy-efficient qos (eqos) in mobile web applications. In HPCA, 2015.

[95] Yuhao Zhu and Vijay Janapa Reddi. Greenweb: Language extensions for energy- efficient mobile web computing. In PLDI, 2016.

94 [96] Christian Zibreg. You can now watch videos while you do other things on viber for iphone. http://www.idownloadblog.com/, 2017.

95