UNIVERSITY OF CALIFORNIA RIVERSIDE

Detecting and Verifying Event-Driven Races in Mobile Apps

A Dissertation submitted in partial satisfaction of the requirements for the degree of

Doctor of Philosophy

in

Computer Science

by

Yongjian Hu

December 2017

Dissertation Committee:

Dr. Rajiv Gupta, Chairperson Dr. Iulian Neamtiu Dr. Heng Yin Dr. Zhijia Zhao Copyright by Yongjian Hu 2017 The Dissertation of Yongjian Hu is approved:

Committee Chairperson

University of California, Riverside Acknowledgments

I am deeply indebeted to my advisor, Dr. Iulian Neamtiu, for his generous support and guidance during my Ph.D. journey. Without his help, this dissertation would not have been possible. I am grateful to Iulian for the time we spent together discussing various research ideas and the kindness that allows me do the research I’m interested in. I still miss the crazy time we spent working together prior to paper deadlines. His passion and enthusiasm will be a great model for me to work in the future career. Thanks, Dr. Neamtiu!

I want to thank my committee chair, Dr. Rajiv Gupta, for his help and for teaching

an excellent class on compiler optimization which simulated my interest in the programming

analysis research. I also thank the other members of the committee, Dr. Heng Yin and Dr.

Zhijia Zhao, for their support and constructive feedbacks.

I am also thankful for my collaborators, Dr. Oriana Riva, Dr. Suman Nath from

Microsoft Research, Dr. Patrick Mutchler, Dr. Gogul Balakrishnan from Google. The

internship with them is a nice experience that helps me improve in research and coding.

I would like to express my gratitude to all my labmates: Tanzirul Azim, Arash

Alavi, Zhiyong Shan, Bo Zhou, Yan Wang, Changhui Lin, Amlan Kusum, Vineet Singh,

Steve Su for helping me in many ways during my graduate study. I express my regards to

the University staff members Ms. Amy Ricks, Ms. Vanda Yamaguchi, Ms. Amanda Wong

and Ms. Madie Heersink for their administrative support.

Finally, I would like to take this opportunity to thank my family for their uncon-

ditional and endless support. My wife Yiqiu Yang always stood by my side and encourged

me when I was stressed in the graduate study. Her love is the driving force of my efforts.

iv My parents Donghai Hu and Huifang Zhang spent all they can to help me succeed, and this dissertation is a small tribute to their endeavors.

v To my wife and my parents, who give me endless support.

vi ABSTRACT OF THE DISSERTATION

Detecting and Verifying Event-Driven Races in Mobile Apps

by

Yongjian Hu

Doctor of Philosophy, Graduate Program in Computer Science University of California, Riverside, December 2017 Dr. Rajiv Gupta, Chairperson

Concurrency bugs are notoriously difficult to find and fix. The situation is even worse on mobile platforms due to their asynchronous programming model. Android, the dominant mobile platform, has been plagued by concurrency errors ever since its inception.

The majority of these errors are event-driven races, a new type of races caused by non- deterministic order of events that access the same memory with at least one write access.

Prior research in this area consists of dynamic approaches to discover harmful event-driven races. However, due to their dynamic nature, these approaches suffer from coverage and false negative issues. Moreover, they produce false positives, cannot reproduce races, and cannot distinguish between benign and harmful races.

In this dissertation, we present an effective approach to systematically detect, ver- ify and classify event-driven races for Android apps. The approach is composed of three sub-systems. First, a static event-driven race detector named SIERRA; this is the first

static approach for race detection in Android and mobile systems in general. By employ-

ing a novel action-sensitive pointer analysis, a static happens-before graph and symbolic

vii execution based refinement, SIERRA finds true races with high accuracy and efficiency. On a benchmark suite used by prior work on dynamic race detection, SIERRA was able to discover a super set of those races reported by the dynamic detector (29.5 vs 4 true races per app) which demonstrates the advantage of a sound static approach. Second, a versa- tile yet lightweight record-and-replay tool named VALERA which uses a novel sensor and event-stream driven approach to record-and-replay. VALERA’s low overhead (about 1% for either record or replay) and precise schedule replay allows it to reproduce event-driven races hence facilitate bug fixing. Finally, a race verification and classification approach named

ERVA, which uses event dependency graphs, event flipping, and replay to filter out false positives; for true positives, ERVA can distinguish benign races from harmful races via state comparison.

viii Contents

List of Figures xii

List of Tables xiv

1 Introduction 1 1.1 Motivation ...... 1 1.2 Challenges ...... 4 1.3 Dissertation Overivew ...... 7 1.4 Contributions ...... 10 1.5 Organization ...... 11

2 Background 13 2.1 Android Background ...... 13 2.1.1 Android App Construction ...... 14 2.1.2 Android Concurrency Model ...... 15 2.2 Event-Driven Races ...... 20 2.2.1 Examples ...... 20 2.2.2 Formal Definition ...... 24

3 Static Event-Driven Race Detection 29 3.1 Design Overview ...... 31 3.2 Harness Generation ...... 33 3.3 Event Call Graph and Pointer Analysis ...... 35 3.3.1 Action Sensitivity ...... 35 3.4 Happens-before Relationship ...... 36 3.4.1 Definitions ...... 37 3.4.2 Actions: Static Happens-before Graph Nodes ...... 37 3.4.3 Static Happens-before Rules ...... 38 3.4.4 Accesses and Races ...... 42 3.5 Symbolic Execution-based Refutation ...... 44 3.6 Evaluation ...... 46 3.6.1 Effectiveness ...... 48

ix 3.6.2 Efficiency ...... 49 3.6.3 Harmful Race Example ...... 50 3.6.4 Comparison with Dynamic Race Detection ...... 51 3.6.5 Discussion ...... 52 3.6.6 Results on the 174 App Dataset ...... 53 3.7 Summary ...... 54

4 Race Reproduction via Record and Replay 55 4.1 Motivation ...... 59 4.1.1 Accurate Timing and Low Overhead ...... 59 4.1.2 Schedule Replay ...... 61 4.1.3 Network and Sensors ...... 63 4.2 Design Overview ...... 63 4.3 API Interception and Replay ...... 66 4.3.1 Example: Intercepting the Location Services ...... 66 4.3.2 Automatic Interception through App Rewriting ...... 69 4.3.3 Interceptor Specification ...... 70 4.3.4 Intercepting Events and Eliminating Nondeterminism ...... 73 4.3.5 Intercepting Intents ...... 77 4.3.6 Recording and Replaying ...... 77 4.4 Fuzzy Replay ...... 79 4.4.1 Semantic Sensor Data Alteration ...... 79 4.4.2 Cross-app Testing ...... 82 4.5 Event Schedule Replay ...... 84 4.5.1 Recording the Event Schedule ...... 84 4.5.2 Replaying the Event Schedule ...... 87 4.6 Evaluation ...... 89 4.6.1 Evaluation of VALERA ...... 89 4.6.2 Evaluation of Fuzzy Replay ...... 98 4.7 Summary ...... 105

5 Race Verification and Classification 106 5.1 Motivation Examples ...... 108 5.1.1 False Positive Type-1: Imprecise Android Component Model . . . . 109 5.1.2 False Positive Type-2: Implicit Happens-before Relation ...... 109 5.1.3 Benign Race Type-1: Control Flow Protection ...... 112 5.1.4 Benign Race Type-2: No State Difference ...... 114 5.2 Design Overview ...... 116 5.2.1 Race Detection ...... 117 5.2.2 Input Capture and Replay ...... 118 5.2.3 Event Dependency Graph ...... 119 5.2.4 Event Flipping ...... 120 5.2.5 State Recording and Comparison ...... 121 5.2.6 Race Verification and Classification ...... 123 5.3 Evaluation ...... 124

x 5.3.1 Effectiveness ...... 125 5.3.2 Efficiency ...... 127 5.4 Summary ...... 129

6 Related Work 130 6.1 Race Detection ...... 130 6.1.1 Race Detection for Thread-based Programs ...... 130 6.1.2 Race Detection for Event-Driven Programs ...... 131 6.2 Race Classification ...... 132 6.3 Record and Replay ...... 133

7 Conclusions 136 7.1 Future Work ...... 137

Bibliography 139

xi List of Figures

1.1 Motivation example of an event-driven race. (a) A correct execution schedule. (b) An incorrect schedule caused by the alternated execution of onReceive and onStop...... 2 1.2 Overview of this dissertation ...... 7

2.1 Thread model of a typical Android app...... 15 2.2 Event posting patterns in Android...... 18 2.3 Intra-component race...... 21 2.4 Source code of race bug in Tomdroid...... 23 2.5 Event schedule of main thread in Tomdroid: normal execution (left) and race leading to crash (right)...... 24

3.1 Overview of SIERRA...... 31 3.2 Harness example...... 34 3.3 HB edges among Activity lifecycle callbacks (dashed red arrows) induced by CFG dominance in our harness model. Black edges represent control flow. . 39 3.4 HB edges (dashed red arrows) induced by CFG dominance in our GUI model, e.g., onResume→ onClick1, or onClick2→ onClick3. Black edges represent control flow. 41 3.5 Adding intra-action transitive HB edges: (a) is the action order, while (b) and (c) are possible schedules...... 43 3.6 Refutation helps eliminate this false positive in the OpenSudoku app. . . . 44

4.1 Screenshots of correct execution (left) and divergent execution due to impre- cise timing (right)...... 57 4.2 Schedule of correct execution (left) and divergent execution due to imprecise schedule replay (right)...... 58 4.3 Overview of the VALERA runtime...... 64 4.4 Location API example...... 67 4.5 Overview of VALERA’s automatic interception...... 70 4.6 Location interceptor example...... 71 4.7 Camera interceptor example...... 72 4.8 Overview of SSDA...... 79

xii 4.9 Overview of cross-app testing...... 83

5.1 False positive type-1 in the AnyMemo app...... 110 5.2 False positive type-2 in the Cool Reader app...... 111 5.3 Benign race type-1 in the Volley library...... 113 5.4 Benign race type-2 in the AnyMemo app...... 115 5.5 Overview of ERVA...... 116 5.6 Excerpt from the EDG of the TomDroid app...... 119 5.7 Performing event flipping on TomDroid’s EDG in Figure 5.6...... 119

xiii List of Tables

3.1 Actions and HB introduction...... 33 3.2 App popularity and size for the 20-app dataset...... 47 3.3 SIERRA effectiveness on the 20-app dataset...... 48 3.4 SIERRA efficiency on the 20-app dataset: running time for each stage and total, in seconds...... 50 3.5 SIERRA effectiveness and efficiency on the 174-app dataset...... 51

4.1 Network and sensor API usage in top-11 apps in each of the 25 categories on Google Play...... 63 4.2 Annotation keywords...... 73 4.3 VALERA evaluation results: apps, popularity, streams and overhead. . . . . 91 4.4 VALERA event streams: number of events and burst event rate (events/sec- ond); a ‘-’ indicates that the app did not use that stream during our recording. 92 4.5 SSDA Evaluation results...... 99 4.6 Source and destination apps for cross-app replay...... 102

5.1 ERVA effectiveness...... 125 5.2 ERVA efficiency: running time, in seconds...... 128

xiv Chapter 1

Introduction

1.1 Motivation

Mobile technologies have changed the way we live, work and communicate. It is estimated that there are 4.92 billion global mobile users in 2017, equaling 66% of the total population in the world [8]. The time users are spending in mobile apps is continuing to grow. According to a recent survey, U.S. consumers now spend 5 hours per day on mobile devices, and this number had increased 69% percent year-over-year [10]. Users are increasingly relying on mobile apps for a variety of tasks. For example, fetching news from

CNN app, booking restaurants via OpenTable, online shopping through Amazon and paying bills by Paypal. Hence, the demands for high quality apps have become pressing. Buggy and easy-crashing apps would quickly lose customers.

The analysis scope of this dissertation focuses on Android which is the dominant software platform for smartphones and tablets [7]. Ever since the platform’s inception, however, Android has been plagued by concurrency errors, with concurrency being one of

1 Main Thread Broadcast Receiver Main Thread Broadcast Receiver

void onCreate() { void onCreate() { db = new Database(); db = new Database(); register register registerReceiver(recv, ...); registerReceiver(recv, ...); } }

void onStart() { void onStart() { db.open(); db.open(); } }

void onReceive() { void onStop() { db.update(...); db.close(); } }

void onReceive() { void onStop() { db.update(...); db.close(); } } Exception!

void onDestroy() { unregister void onDestroy() { unregister unregisterReceiver(recv); unregisterReceiver(recv); Db = null; Db = null; } }

(a) Correct execution (b) Incorrect execution with exception Figure 1.1: Motivation example of an event-driven race. (a) A correct execution schedule.

(b) An incorrect schedule caused by the alternated execution of onReceive and onStop. the top-5 most common bug causes every year starting in 2008 [149]. In fact, a study of

18,000 fixed bugs in the Android platform and apps has revealed that 66% of the high- severity bugs are due to concurrency [150]. Android concurrency research has shown that the majority of Android race bugs are event-driven races [99, 72, 25]; per Maiya et al. [99], in Android apps, event-driven races are 4–7 times more frequent than multi-threading data races.

Example. Figure 1.1 shows an example of a harmful event-driven race that crashes the app. In the onCreate callback of the MainActivity, a DataBase object is created and a

Broadcast Receiver is registered. Accordingly, the receiver is unregistered and the DataBase

2 object is freed in the onDestroy callback where the activity is no longer usable. The pro- gram opens the database in the onStart method when the activity is becoming visible to the user, and closes it in onStop when it is no longer visible — the rationale is, the app should consume fewer resources when the activity is pushed into the background. The

Broadcast Receiver is designed to be invoked from the background service when new data is available and communicate with the foreground activity to update the data.

The event-driven race occurs if the broadcast message is delivered at the time when the activity is pushed into the background. Since the database is closed in the onStop callback, an update operation at this time in the onReceive callback would cause exceptions.

Unlike traditional multi-threading races which occur on different threads when at least one memory access is write, event-driven races occur on the same thread, i.e., the looper thread where incoming events arrive in non-deterministic order. There are multiple solutions to

fix this event race bug. For example, registering and unregistering the broadcast receiver during onStart and onStop, or adding a flag to indicate the status of the activity and checking it before database updates.

Similar to other concurrency errors, such event-driven races are hard to detect and debug due to the nondeterministic nature of asynchronous events. Hence there is a strong impetus for constructing tools that help find event-driven races in Android apps.

This dissertation presents an effective approach to automatically detect, verify and classify event-driven races. Our approach helps app developers produce higher-quality apps; it helps researchers better understand event-driven races; and it helps mobile platform developers understand and reduce the sources of event-driven races. Ultimately our approach improves

3 the state of research and practice in mobile app development, which benefits mobile app users.

1.2 Challenges

Recently, researchers have devised program analyses to discover harmful event- driven races. Several dynamic detectors have been proposed, e.g., DroidRacer [99], CAFA [72], and EventRacer [25]. They operate in a similar fashion. First, they define a set of Happens

Before (HB) rules for Android’s event-driven model. Then, they instrument the Android platform to collect runtime traces and run the app under this instrumentation; the collected traces contain event begin/end/posting and memory read/write information. Finally, they analyze the trace according to the HB model graph. If there exist read/write or write/write operations on the same memory location and these operations are not ordered by HB, the tools report an event-driven race.

Though they find many event-driven races, these tools face substantial problems in real usage. This section summarizes these approaches face.

Soundness of Race Detection. In this context, “soundness” means the race detector never misses any true races in an app. However, due to the nature of dynamic approach, dynamic tools are prone to miss actual bugs. For example, in Figure 1.1, the dynamic race detector will miss the race if the onReceive event of the BroadcastReceiver is not triggered by the underlying system. In our experiments, we found that EventRacer, the state-of-art dynamic race detector, missed 25.5 out of 29.5 true races, on average. Note that the effectiveness of dynamic tools hinges on high-quality inputs that ensure good coverage [21], as well as

4 efficient ways to explore schedules.

The Complexity of Android Framework. Since its 2007 inception, Android has evolved to become a very complex system. Analyzing the Android Framework is a very challenging task. The complex concurrency model makes event-driven race detection rather difficult for several reasons. First, unlike traditional (desktop/server) Java applications,

Android apps do not have a main method but rather rely on callbacks being invoked by the Android Framework. Second, apps consist of activities (separate screens) that can be navigated back and forth [21]; further, each activity comprises GUI objects which can be accessed in relatively unconstrained order [123]. Third, asynchronous/long-running opera- tions (e.g., network and I/O) are run in separate threads and their results posted back via messages, in nondeterministic order.

Reproducing Races. Races are hard to debug because they manifest under very specific schedules. The ability to record and reproduce races is helpful in assisting developers to evaluate the side-effect of the reported races and fix them. While useful, this task has proven difficult on mobile devices: smartphone apps revolve around concurrent streams of events that have to be recorded and replayed with precise timing.

Prior work on record-and-replay for desktop and server platforms [48, 143, 107,

142, 66] has relied on techniques such as hardware changes, VM logging, or system call interception; using such techniques on Android is problematic, due to high overhead and wrong granularity—their word- or object-level granularity can be used for small, timing- insensitive programs, but not for Android apps. In desktop/server programs input comes

5 from the file system, network, mouse or keyboard; with the exception of drags and double- clicks, large changes to input timing between the record and replay executions are gracefully tolerated by the program. In contrast, on smartphones, input can come concurrently from the network, GPS, camera, microphone, touchscreen, accelerometer, compass, and other apps via IPC. Moreover, the timing of delivering these input events during replay must be extremely accurate, as even small time perturbations will cause record or replay to fail.

These challenges require mobile record-and-replay tools to be lightweight, yet pre- cise (record sufficient input and schedule nondeterminism) to be able to reproduce races.

Precision and Classification. Precision is the key factor determining the usefulness of a race detector. If the detector reports too many false races, developers would get overwhelmed and unlikely to use the tool. Unfortunately, in our experience, even the most precise race detector currently available, EventRacer, is still prone to false positives. The are two main source of false positives: 1) due to the complexity of Android, the race detector might omit certain asynchronous callbacks which have implicit happens-before relation with other callbacks, and 2) ad-hoc synchronization is widely used in Android apps to protect shared variable access across events – reasoning about such synchronization requires sophisticated, computationally expensive analyses.

True races are not necessarily harmful. Prior research [106] on desktop applications has shown that 90% of the true data races are actually benign. In our experiments on 16 apps, we have a similar conclusion: only 3% of the true event-driven races reported are harmful [76]. Thus, the race detector should have the ability to classify benign and harmful races. However, distinguishing between benign and harmful behavior is very challenging.

6 Static Race Record and Race Verifier Detection Replay & Classifier

Figure 1.2: Overview of this dissertation

1.3 Dissertation Overivew

The overarching goal of this dissertation is to effectively detect event-driven races, automatically filter out most false positives, classify benign or harmful races, and provide a concrete schedule to reproduce the race. This dissertation is composed of three compo- nents. Figure 1.2 shows the overview workflow of this dissertation. First, we design a static event-driven race detector to address the soundness issue of dynamic analysis. Second, we implement a lightweight record-and-replay tool for Android that is able to reproduce the racy schedule. Finally, we show how to use event flipping approach to verify true races and further classify them into benign and harmful races.

Next, we describe the high level design of each components.

Static Race Detection. To address the soundness issue of dynamic analysis, we have im- plemented a tool named SIERRA (StatIc Event-based Race detectoR for Android). SIERRA analyzes the bytecode, hence the app source code is not required, and apps can be readily analyzed in the APK format there are distributed in.

Given an app, SIERRA first decomposes the app and analyzes the manifest to fetch the list of all the activities. Second, for each activity, SIERRA automatically constructs harnesses to “kickstart” the static analysis. SIERRA introduces a new model that reifies

7 Android concurrency primitives and their processing as context-sensitive actions (event processors) that can model threads, messages, lifecycle activities and GUI events. In ad- dition, SIERRA introduces a novel action-sensitive context abstraction for pointer analysis

that greatly improves the precision of alias analysis. Third, SIERRA statically infers the

Happens-before rules which order actions, from a harness-based model for lifecycle and

GUI events to inter- and intra-procedural domination. The result is a Static Happens-

before Graph. Next, for those actions and memory accesses that have not been orderable

yet, SIERRA uses backward symbolic analysis, i.e., goal-directed (refutation-based) sym-

bolic execution, to see if indeed independent path conditions allow racy memory accesses

to execute in any order. Finally, to help developers fix likely harmful races, SIERRA priori-

tizes race reports using several heuristics: 1) races in application code have higher priroity

than those in framework code; 2) framework races directly invoked from app code have

higher priority than those invoked from the library; 3) races involved in pointer reference

reads/writes are more likely to be dangerous as they can lead to a NullPointerException.

Record and Replay. We introduce a novel, sensor- and event-stream driven approach to record-and-replay; by focusing on sensors and event streams, rather than system calls or the instruction stream, our approach is effective yet lightweight. We have implemented our approach in a tool called VALERA (VersAtile yet Lightweight rEcord and Replay for

Android) that records and replays mobile apps. VALERA records input streams and events with minimal overhead, and then replays them with exact timing. Stream-driven replay hits a “sweet spot” on the accuracy vs. overhead curve: replaying sensor inputs and events with precise timing allows VALERA to be lightweight yet achieve high accuracy.

8 Note, however, that VALERA does not record all system state: it does not record memory accesses or the VM instruction stream (as other approaches do, albeit not on smartphones [107, 48, 142]) as this state is not externally visible. We made these design choices for two main reasons. First, to keep the approach widely applicable, we avoided hardware modifications. Second, the overhead of recording all memory accesses or the VM instruction stream is too prohibitive: our experience with PinPlay for Android (a whole- system record-and-replay approach) shows that its high overhead perturbs app execution significantly.

VALERA is composed of two main components: API interceptors to intercept the communication between the app and the system to eliminate nondeterminism due to network and sensor inputs, as well as inter-app communication via intents; and Sched- uleReplayer, an approach for recording and replaying event schedules to eliminate event schedule nondeterminism and allow event-driven races to be deterministically replayed and reproduced.

Race Verifier and Classifier. To solve the problem of race verification and classification, we present ERVA (Event-race Reproducer and Verifier for Android), an automated approach for verifying and reproducing event-driven races in Android apps. ERVA takes as input a report of a potential race, and uses a suite of techniques to categorize the race. If the race is a false positive, it is reported as such. If the race can be confirmed, it is classified as benign or harmful. To support this classification, we introduce event dependency graphs and a novel definition of benign vs. harmful races in Android apps based on state comparison.

With the help of VALERA, ERVA is able to record sufficient replay information (e.g.,

9 UI events, input stream, sensor streams) and synchronization information (e.g., begin/end of thread and synchronization actions, event posting, etc). The next stage examines the true positives to further distinguish between benign and harmful races. ERVA replays executions multiple times using previously-recorded inputs, and instruments the app to record app state. In each of these executions, ERVA “flips” – alternates the ordering of – the events to check their side effects, i.e., the effect of flipping on app state (app state includes all the

UI view states, shared preferences, file, database, network traffic). If the flipping has no side effect, ERVA categorizes the race as benign, otherwise it is declared as harmful. Since

ERVA employs replay, developers have the opportunity to replay the app with those inputs and event schedules that lead to harmful races, to facilitate fixing the cause of the race.

1.4 Contributions

This dissertation makes the following contributions:

• It presents, to our best knowledge, the first static event-driven race detector for An-

droid apps. This static approach has three main contributions. First, it defines actions

as Android concurrency unit, and invents a novel action-sensitive context abstraction

for pointer analysis that significantly improve precision. Second, it describes an ap-

proach for defining static happens-before rules in Android apps. Two actions are not

racing if they are in a happens-before relation, i.e., they are ordered. Finally, it intro-

duces a suite of refinements and precision enablers, based on path-sensitive backward

symbolic analysis, that substantially filter out the number of false races caused by

ad-hoc synchronizations.

10 • It introduces VALERA, a dynamic and stream-oriented record-and-replay tool to re-

produce races. Unlike traditional heavy-weight replay systems that record at memory-

level granularity, VALERA focuses on event streams, i.e., records and replays apps by

recording input streams/events with minimal overhead, and then replays them with

exact timing. VALERA can deterministically reproduce the races if they manifest dur-

ing the recording stage, and provide developers with concrete schedule traces to debug

and fix the races.

• It describes an automatic approach to verify and classify reported races. An event-

flipping algorithm is proposed that, with the help of record-and-replay, can check the

side effect of the races by alternating the event schedule. This approach introduces

a novel definition of benign vs. harmful races based on external visible states; the

definition is configurable and extensible.

1.5 Organization

The rest of this dissertation is organized as follows. Chapter 2 introduces back- ground knowledge about the Android platform and its concurrency model. We present several examples to give an informal view of what event-driven races are, and then we for- mally define event-driven races. In Chapter 3, we present the SIERRA static event-driven race detector with three novel contributions: a new context abstraction of pointer analysis that is designed for the event-driven model, static happens-before relationship inference, and path-sensitive backward symbolic execution to filter out ad-hoc synchronizations. Chap- ter 4 introduces VALERA, a lightweight yet versatile record-and-replay system for Android.

11 We describe how VALERA hits the “sweet spot” on the accuracy vs. overhead curve via a novel sensor- and event-stream driven approach to record-and-replay. Chapter 5 presents how event flipping can be used to distinguish true races from false positives. True races are further classified as benign or harmful by checking the difference of externally visible state.

Chapter 6 surveys related work. In Chapter 7 we conclude and outline directions for future work.

12 Chapter 2

Background

In this chapter, we provide some background knowledge of Android including the app construction and the concurrency model in Section 2.1. Several informal race examples are showed in Section 2.2, followed by a formal definition of event-driven race.

2.1 Android Background

The Android software stack consists of apps running on top of an Android Frame- work (AF), which orchestrates app control flow and mediates intra-app and inter-app com- munication, as well as the communication between apps and hardware. Apps are typically written in Java (though certain parts can be written in C or C++ for efficiency) and com- piled into either Dalvik bytecode that executes on top a Dalvik virtual machine (Android version < 5.0) or directly to native code (Android version ≥ 5.0). The Dalvik VM or native code in turn run on top of an Android-specific Linux kernel.

13 2.1.1 Android App Construction

An app consists of components; there are four kinds of components: (1) Activities, i.e., user visible screens, (2) Services, used for background operations, (3) Content Providers, which manage access to data, and (4) Broadcast receivers, used for system or application events [16].

Activities are the most popular components—apps usually consist of a suite of

Activities. The app transitions among activities in response to user input, e.g., in the

Amazon app, the “Home” screen is named MainActivity; when the user clicks on the “Search” box, the app transitions to a SearchActivity ; upon selecting a search from the list of result items, the app transitions to the SearchListActivity . Within one activity, various GUI objects are placed in a View hierarchy. Activities follow a state machine, where the states have associated callbacks that can be filled out by the programmer, e.g., upon activity creation, the onCreate() method is called, upon destruction the onDestroy() callback is invoked, while in-between the activity can cycle between visible and invisible states that have associated onStop()/onRestart() callbacks. GUI objects, e.g., menus, buttons, have callbacks as well. The

AF automatically invokes callbacks in response to user input (e.g., click ‘Back’) or hardware events. While components are strongly isolated — e.g., the only way for one Activity to share information with another activity is through message passing (called Intents) — inter- component races are possible.

14 UI Thread Background (Looper) Thread (Non-Looper)

Touch postMessage Screen Event Handler

Downloader (AsyncTask) fork thread Button.onClick

postMessage (update progress)

ProgressBar.update

postMessage (post final result) ImageView.show

Figure 2.1: Thread model of a typical Android app.

2.1.2 Android Concurrency Model

Android has three main kinds of threads: looper, background, and binder. In

Android, threads can communicate with each other in two ways: via messages (the most common way) or via shared memory (as in traditional Java applications, used sparsely).

In Android’s concurrency model, every application process has a main thread

(also called “UI thread”); only the main thread can access the GUI objects, to prevent non- responsive threads from blocking the GUI. To update the GUI, other (non-main) threads can send messages to the main thread, and the main thread will dispatch these events to the appropriate user interface widgets. Long-running tasks such as network access and CPU- intensive operations are usually run in background threads. When these tasks are finished, the background threads post back messages (we call these messages internal events) together with the data to the UI thread. We now describe the Android threading model and then provide an example of how threads are used in a concurrent app.

15 Example. Figure 2.1 shows a standard Android app that downloads an image. When the user touches the touchscreen, the hardware touch events are delivered to the Window

Manager Service (WMS). WMS keeps a record of all the apps’ windows, i.e., window coor- dinates and layers. WMS checks the hardware touchscreen event’s coordinates and sends it to the corresponding app. A handler is then invoked on the app’s UI thread. The han- dler traverses the view tree hierarchy and invokes the corresponding view’s action. If the user clicks a button, the handler posts an internal event with the onClick action to the UI thread’s event queue. The onClick action forks a new background thread to download the image, offloading the task from the UI thread. The downloader thread may periodically send back internal events to show the percentage it has downloaded. When the download task is done, the downloader thread will post another event message along with the image to the UI thread. Finally, the UI thread decodes the image and displays it.

Threading Model. The following grammar describes Android thread kinds.

T hread ::= Looper | Non−looper

Non−looper ::= Background | Binder

Looper threads are threads with an associated Looper object that confers threads message dispatching capabilities: the thread blocks waiting for messages and when a mes- sage comes, it is processed atomically. The main thread is a looper thread.

Background thread is the result of a regular thread fork () which does not register a Looper. Binder thread is created when an app is launched; binders are widely-used for inter-process communication (IPC). Each app holds a binder thread pool. The number of binder threads in the pool is automatically adjusted based on IPC usage.

16 Event Model. The following grammar describes the Android event model.

Event ::= ExternalEvent | InternalEvent

ExternalEvent ::= InputEvent | SensorEvent

| IPC | HardwareInterrupt

InputEvent ::= MotionEvent | KeyEvent

SensorEvent ::= Compass | Accelerometer

| Light | ...

InternalEvent ::= Message | Runnable Object

In Android, events can be either external or internal. External events originate

in the hardware, cross into the OS and then into the AF. Apps can choose to use default

handling for these events, in which case they are handled by the UI thread, or can register

custom event handlers in other looper threads. Typical external events include input events

(e.g., gesture or key events), sensor events (e.g., accelerometer, compass), IPC and hardware

interrupts (such as VSYNC, a hardware “heartbeat” signal invoked 60 times per second).

Internal events are messages or runnable objects sent from a non-looper thread to a looper

thread. Internal events are created and sent via the Handler API at the AF or app level,

rather than coming from the OS.

Loopers. Android provides a Looper class that clients can use to attach message dispatch-

ing capabilities to threads. Each thread can attach at most one Looper object. The main

thread has one Looper by default when the app launches. Inside the Looper object, there is

a MessageQueue. If there is no message in the queue, the thread will block. Otherwise, the

Looper removes the message at the front of the queue and processes it. Once the thread

17 Looper Looper Looper 1 Looper 2 Looper Background Thread Looper Binder Thread

postMessage postMessage postMessage postMessage postMessage

Type 1: External event Type 2: Internal event Type 3: Cross-Looper message Type 4: Background thread posts message to Type 5: Binder thread posts message to post message to the post message to the posting Looper thread Looper thread. same Looper same Looper

External event handler Internal event handler Non-looper task handler

Figure 2.2: Event posting patterns in Android.

begins to process one message event, no other message processing routine can be started until the current one finishes. Hence event handling within each thread is atomic [72].

Messages and Handlers. Android also provides a Handler class that allows threads to send and process messages, as well as runnable actions. Each Handler instance is associated with a single thread and that thread’s Message Queue. There are two main uses for a Handler:

(1) to schedule messages and runnables to be executed at some point in the future; and (2) to enqueue an action to be performed on a different thread. When the Looper retrieves a message from the message queue, it will dispatch to the corresponding Handler to handle the message or run the runnable action. Messages can be posted in a variety of ways: AtTime(time), i.e., post a message/action at a specific time, Delayed(delay), i.e., post a message/action after a specific time, or AtFrontOfQueue, i.e., post a message/action at the front of message queue.

Event posting. Event posting is at the core of the Android programming model. Android events can be divided into two categories: internal events (messages or callbacks sent to a

looper thread) and external events (caused by hardware interrupts). We have identified six

18 different event posting types (Figure 2.2 illustrates them).

Type 1: an external event posting a message to the same looper. For example, when the

user touches a view widget on screen, the touchscreen generates a series of hardware

interrupts (motion events). If the motion events are located in a View object that has

a listener, the UI thread will post a listener callback to itself.

Type 2: an internal event posting a message to the same looper, i.e., looper posting to

itself. One use of this scenario is for implementing timeouts. For example, if a looper

wants to limit the time for processing a task, it can use an internal event to post a

timeout message to the same looper—when the timeout message is due for processing,

the task has reached its time processing limit.

Type 3: background worker thread posting a message to a looper thread. Since Android

only allows the main thread to update the UI elements, it is common that background

threads post messages or runnable callbacks to update the UI.

Type 4: cross-looper posting of internal events, e.g., when looper thread 1 posts a message

to looper thread 2. Type 4 posting is very similar to Type 2, the difference being that

in Type 4 the message is posted to another looper.

Type 5: cross looper posting of external events, e.g., looper thread 1 posts a message to

looper thread 2. This is similar to Type 4, but the event is external.

Type 6: binder thread posting a message to a looper thread. Android apps frequently

communicate with background services such as the Activity Manager Service (AMS).

For example, when the AMS decides to start an app’s activity, it sends a “launch

19 activity” IPC call to the binder of that app and the binder posts an internal event

message to the main thread looper. In this scenario, the activity’s onCreate lifecycle

callback will be invoked.

2.2 Event-Driven Races

In Android, event-Driven races are the most common concurrency errors; they are 4–7 times more frequent than multi-threading data races [99]. In this section, we first demonstrate several race examples to give an informal view of what event-driven races are, then present the formal definitions.

2.2.1 Examples

Intra-component Race. Figure 2.3 shows an actual, harmful event-driven race 1 — more precisely, an intra-component race, as it happens within one activity. The NewsActivity, shown on the left, has a RecycleView to display the news items. RecycleView is an advanced widget,

designed to display large data sets that can be scrolled very efficiently by maintaining a

limited number of views. In NewsActivity’s onCreate method, the RecycleView is initialized and

the corresponding adapter is set (lines 7–9). The activity registers an onClickListener that creates a LoaderTask (a subclass of AsyncTask) to update the news list; this is shown in the center of the figure. The time-consuming download operation is put in the doInBackground

1https://code.google.com/p/android/issues/detail?id=77846

20 1 class LoaderTask extends 1 class NewsActivity extends AsyncTask { Activity 2 final NewsAdapter adapter; 2 implements onClickListener { 3 LoaderTask(NewsAdapter Main Background 3 RecycleView rv; adapter) { Thread Thread 4 NewsAdapter adapter; 4 this .adapter = adapter; onClick 5 5 } LoaderTask.execute() 6 void onCreate() { 6 7 rv = (RecycleView) 7 void doInBackground() { scroll ) findViewById (...) ; 8 News[] newslist = download getViewForPosition()

validateForPosition() newslist 8 adapter = new NewsAdapter (); ( (...); 9 adapter.add(newslist ); 9 rvNews.setAdapter(adapter); 10 } 10 } 11 onPostExecute adapter.add 11 12 void onPostExecute(News news 12 adapter.notifyDataS void onClick(View v) { ) { etChanged() 13 new LoaderTask(adapter). 13 adapter. execute(); notifyDataSetChanged() 14 } ; 15 } 14 } 15 } Figure 2.3: Intra-component race. which runs in a separate thread. This practice is strongly suggested in Android to make the app more responsive. When the AsyncTask is done, it posts an onPostExecute callback to the main thread and notifies the adapter to refresh the RecycleView with the latest data.

The race manifests when the user scrolls the view before downloading has finished

– a runtime exception will then crash the app. This exception occurs only in the specific event schedule (as shown in Figure 2.3 on the right) where the onScroll callback is executed before onPostExecute on the main thread, and the adapter’s internal data is just updated in the background thread. The root cause of the bug is that when the user scrolls down, the

RecycleView will decide which view to show according to the last-scrolled position. If the view position does not match the previously-cached result because the adapter has not had a chance to execute notifyDataSetChanged to update the cache, the exception is thrown. The fix

for this bug is to invoke notifyDataSetChanged right after the adapter’s add method, or move

21 the add method to the onPostExecute callback in AsyncTask.

Inter-component Race. The previous example has shown a harmful race within one

Android component (Activity in that case). Now we present an inter-component race, i.e., a race that occurs across components. Figure 2.4 shows the code snippet from Tomdroid, an open source note-taking app which allows notes to be saved on, and synchronized with, a remote server. When the user clicks the ‘Sync’ button to synchronize notes, a new background worker thread (“sync task”) is forked to perform this task. Periodically, the sync task sends back the progress status to the main thread. These status messages are handled by a SyncMessageHandler; each activity has an associated SyncMessagehandler. If the received message is SYNC START, the main thread invokes the onSynchronization Started method which plays an animation to show that Tomdroid is now syncing data. When the main thread receives a SYNC DONE, it calls onSynchronizationDone to stop the animation.

Tomdroid has a race condition which can lead to a crash. Suppose the user enters the ViewNote activity and clicks the ‘Sync’ button, waiting until the sync operation is done, then clicks ‘Back’ to go back to the main activity. The sync operation usually completes quickly, thus in most cases the user clicks ‘Back’ after the sync has already completed. The left side of Figure 2.5 shows the event schedule from the main thread in this scenario.

However, in case the sync is slow, the user could click ‘Back’ before the sync is done. Then the ‘Back’ operation will trigger a switch from the ViewNote activity to the Main activity. When the SYNC DONE message is processed in the Main activity’s handler, the main thread invokes onSynchronizationDone; in that method, dot.getAnimation() returns null because

22 1 class SyncMessageHandler extends Handler {

2 Activity activity ;

3

4 void onSynchronizationStarted() {

5 Animation pulse = loadAnimation();

6 View dot = activity .findViewById(R.id.sync dot);

7 dot.startAnimation(pulse);

8 }

9 void onSynchronizationDone() {

10 View dot = activity .findViewById(R.id.sync dot);

11 Animation pulse = dot.getAnimation();

12 pulse .setRepeatCount(0);

13 }

14 public void handleMessage(Message msg) {

15 if (msg.what == SYNC START) {

16 onSynchronizationStarted() ;

17 else if (msg.what == SYNC DONE)

18 onSynchronizationDone();

19 }} Figure 2.4: Source code of race bug in Tomdroid.

23 1 // Schedule: click ‘Back’ after sync is done 1 // Schedule: click ‘Back’ before sync is done 2 ... 2 ... 3 Lifecycle event: launch Main activity . 3 Lifecycle event: launch Main activity . 4 UI event: click ListView to show one note. 4 UI event: click ListView to show one note. 5 Lifecycle event: pause Main activity . 5 Lifecycle event: pause Main activity . 6 Lifecycle event: launch ViewNote activity . 6 Lifecycle event: launch ViewNote activity . 7 Lifecycle event: stop Main activity . 7 Lifecycle event: stop Main activity . 8 UI event: click sync button to sync notes. 8 UI event: click sync button to sync notes. 9 Async event: SYNC START 9 Async event: SYNC START 10 Async event: SYNC PROGRESS 10 Async event: SYNC PROGRESS 11 Async event: SYNC PROGRESS 11 UI event: click back button. 12 Async event: SYNC DONE 12 Lifecycle event: pause ViewNote activity . 13 UI event: click back button. 13 Lifecycle event: resume Main activity . 14 Lifecycle event: pause ViewNote activity . 14 Async event: SYNC PROGRESS 15 Lifecycle event: resume Main activity . 15 Async event: SYNC DONE 16 ... 16 CRASH: Null pointer exception Figure 2.5: Event schedule of main thread in Tomdroid: normal execution (left) and race leading to crash (right). the animation object is created and registered in ViewNote activity’s handler. This will cause a null pointer exception that crashes the app; the event schedule is shown on the right side of Figure 2.5. Note that to faithfully reproduce this bug, the SYNC DONE must be delivered after the activity transfer events are handled. In the next section, we show how we achieve this by replaying the event schedule.

2.2.2 Formal Definition

Now we describe the formal definition of event-driven race. We begin by defining the concurrency model in terms of threads, events, memory locations, and operations on them. Then, we introduce a set of rules that define the Happens-before relation. Finally, we introduce definitions for false positives, benign races, and harmful races.

24 T hread type t ::= tl | tnl

Access type τ ::= read | write

Memory location ρ ∈ P ointers

Memory access α ::= ατ (ρ)

Message m ∈ P ointers

Runnable object r ∈ P ointers

Event posting β ::= post(e, tl, m | r, ∆)

T hread operation γ ::= fork(t1, t2) | join(t1, t2)

Operation op ::= α | β | γ

Event e ::= begin; op1; ...opn; end

∗ Looper trace πl ::= e

Non − looper trace πnl ::= op1; ...opn

T race π ::= πl | πnl

Definitions. In our approach, threads t can be either loopers tl or non-loopers tnl. For each thread we record a trace. For looper threads, their traces πl contains a (possible empty) sequence of events e. For looper threads, their traces πnl contain sequences of

operations. Operations op can be memory accesses α (which capture the location ρ and

access kind reads or writes); thread operations γ, for example, fork(parenttid, childtid) or

join (parenttid, childtid); or event postings β. Event postings create new events e by either

sending a message m or posting a runnable object r to looper thread tl with a time delay

∆.

25 Happens-before relationship. Existing event-driven race detectors [25, 72, 99] have proposed various HB definitions (≺). We now proceed to define HB as a set of rules tied together by transitivity.

Program order rule: if an operation op1 precedes another operation op2 on the

same thread in the trace, then they follow program order op1 ≺π op2. Program order on

nl nl non-looper threads implies HB, i.e., op1 ∈ t ∧ op2 ∈ t ∧ op1 ≺π op2 =⇒ op1 ≺ op2, but

not on looper threads. Rather, HB on looper threads can only be introduced by the looper

atomicity rule, discussed next

Looper atomicity rule: the order of operations executed within one event estab-

lishes HB; that is, if op1 ∈ ek ∧ op2 ∈ ek ∧ op1 ≺π op2, then op1 ≺ op2.

Event order rule: e1 ≺ e2 if end(e1) ≺ begin(e2).

Event post rule: new events can be posted from looper threads tl or non-looper

nl l l threads t . For the former case, say β = post(e2, t1, m | r, ∆) ∧ β ∈ e1 ∧ e1 ∈ t1, i.e., event

l e1 posts an event e2 to the looper thread t that e1 belongs to, then e1 ≺ e2. For the latter

case, say β = post(e, tl, m | r, ∆) ∧ β ∈ tnl, i.e., an event e is posted by a non-looper thread,

then β ≺ e and ∀α ∈ tnl ∧ α ≺ β we have α ≺ e.

Thread rule: if thread ti creates a new thread tj (γ = fork(ti, tj)), then ∀α ∈ tj

we have γ ≺ α. Similarly, for a thread join γ = join(ti, tj), we have ∀α ∈ tj =⇒ α ≺ γ.

External event rule: in our model, each external event sensor si has a begin(si, θ)

and end(si, θ) where θ is the sequence number. The external events within the boundary

of begin(si) and end(si) are ordered by HB. For example, a click operation is a series of

external events from the touchscreen starting with ACTION DOWN, many ACTION MOVEs, and

26 ending with ACTION UP. Here the begin(si, θ) is ACTION DOWN and end(si, θ) is ACTION UP.

All the external events e1, e2, . . . , en within this boundary follow HB order. However, if e1

and e2 are from two different sequences, then there is no strict HB order. An example is

two click operations that could be triggered in alternative order.

Android component lifecycle rule: callbacks in different components such as Activity ,

Service, Fragment, View, etc. are ordered by HB. For instance, Activity ’s onCreate callback is

always invoked before its onDestroy. Based on Android’s documentation [59], a lifecycle graph [28] can be built to precisely describe the HB relation between Android component callbacks.

Transitivity: HB is transitive, that is α1 ≺ α2 ∧ α2 ≺ α3 =⇒ α1 ≺ α3 and

e1 ≺ e2 ∧ e2 ≺ e3 =⇒ e1 ≺ e3.

Event-Driven races. We can now state the race definition. We say that event ei races

with event ej if there exists a shared variable ρ such that αi(ρ) ∈ ei, αj(ρ) ∈ ej and ei 6≺ ej.

On occasion we will refer to the event pair that satisfies this definition as “racy”.

Access influence. Let α1 = ατ1(ρ1) and α2 = ατ2(ρ2). We say that access α1

influences access α2 (denoted α1  α2) if executing α1 leads to a different value for ρ2

compared to omitting α1.

Benign race. We say that two events ei and ej have a benign race if they have

an event-driven race (which we defined above) on at least one location ρ but ∀αi ∈ ej

and ∀αj ∈ ej, we have αi; αj 6 αEVS and αj; αi 6 αEVS. That is, the different order of

executing αi and αj does not have any effect on the externally visible state (EVS). EVS can

be customized by the user; The default EVS definition will be presented in Section 5.2.5.

27 Harmful race. We define as harmful a race where event execution order influences program state and this is reflected into the EVS. More precisely, we say that two events ei and ej have a harmful race if they have an event-driven race on at least one location ρ and

∃αi ∈ ei, ∃αj ∈ ej, such that αi; αj  αEVS or αj; αi  αEVS. Harmful races can have various consequences, e.g., crash, exception, erroneous GUI state.

28 Chapter 3

Static Event-Driven Race

Detection

In this chapter, we present our static approach for detecting event-driven races.

In the previous two chapters we have shown that Android has been plagued by concurrency errors, and among them, event-driven races are pervasive and pernicious. Detecting and de- buggin this kind of races is complicated in Android, and event-driven systems in general, for several reasons. First, unlike traditional (desktop/server) Java applications, Android apps do not have a main method but rather rely on callbacks being invoked by the AF. Second, apps consist of activities (separate screens) that can be navigated back and forth [21]; fur- ther, each activity comprises GUI objects which can be accessed in relatively unconstrained order [123]. Third, asynchronous/long-running operations (e.g., network and I/O) are run in separate threads and their results posted back via messages, in nondeterministic order.

Fourth, ad-hoc synchronization eludes standard control- and data-flow analyses.

29 To overcome these challenges, we have implemented a tool named SIERRA (StatIc

Event-based Race detectoR for Android). In the design of SIERRA, we introduce sev-

eral novel approaches. First, we reify Android concurrency primitives and their processing

as context-sensitive actions (event processors) that can model threads, messages, lifecycle

activities and GUI events. Second, we use static analysis refinements to significantly im-

prove precision, e.g., automatically-constructed harnesses to kickstart the static analysis,

and a novel action-sensitive context abstraction for pointer analysis. Third, we introduce

Happens-before rules which order actions, from a harness-based model for lifecycle and

GUI events to inter- and intra-procedural domination; the result is a Static Happens-before

Graph. Fourth, for those actions and memory accesses that have not been orderable yet,

we use symbolic analysis, i.e., goal-directed (refutation-based) symbolic execution, to see if

indeed independent path conditions allow events to execute in any order.

We evaluated SIERRA on 194 apps, of which 20 were chosen for further manual

analysis. Experiments show that SIERRA is effective, discovering about 1,223 happens-

before edges and 68 racy pairs per app. Refutation reduces these substantially, to just

43 race reports per app. SIERRA is efficient: it typically takes 960 seconds to analyze an

app, which is acceptable for a static analysis. For the 20 manually-analyzed apps, we ran

EventRacer Android [25], the most advanced dynamic race detector to date. We found

that SIERRA reports 38 potential races on average, of which 29.5 are true races, whereas

EventRacer Android reports 4 races, missing 25.5 true races. Moreover, SIERRA can also

filter out some false positives reported by EventRacer.

30 Hybrid Race CONTEXT PA Static HB Race App SELECTOR Refiner Graph 2 2 . . .

Harness DroidEL Generator Call graph Pointer builder Analysis Path/Context Race Race Sensitivity WALA Prioritization Refutation App Reports harness THRESHER

Figure 3.1: Overview of SIERRA.

3.1 Design Overview

Figure 3.1 shows the architecture of SIERRA. First, we leverage DroidEL [29], a

static Android Framework modeling tool, to handle view inflation and reflection. The AF

relies on reflection to load the APK. For example, the GUI layouts, written in XML, are

accessed via the findViewById(int id) API to access the specific view. However, static analysis

cannot resolve such objects created via reflection. DroidEL can resolve these objects and

creates bindings between layout structure and view objects. The models generated by

DroidEL are then intergrated into our harness generator (described later) that will drive

the static analysis.

Second, we leverage WALA [79] to perform whole-program (application and frame-

work) analysis. WALA is a mature, industrial-level program analysis tool for object-oriented

languages like Java. It provides versatile features for program analysis such as pointer anal-

ysis, call graph generation and control/data flow analysis. Selecting the appropriate context

in pointer analysis is key to achieving scalability and precision. Prior research has shown

that object sensitivity is an effective abstraction for object oriented languages. However, in

31 event-driven systems like Android, object sensitivity is not precise enough. SIERRA intro- duces a novel abstraction called hybrid context sensitivity which selectively combines object

sensitivity, call-site sensitivity, and thread sensitivity (Section 3.3).

Different action execution orders on the looper thread lead to a non-deterministic

schedule; an event-driven race can manifest if two actions access the same memory, and at

least one access is a write. However, naively considering that each pair of memory actions

is a potential race will produce an overwhelming number of false positives. SIERRA defines

a set of static happens-before rules between actions to rule out infeasible racy action pairs,

e.g., onCreate always happens-before onDestroy (only actions that do not have strict happens-

before relation could be involved in races). This stage, described in Section 3.4.3, yields a

Static Happens-before Graph (SHBG).

Next, SIERRA generates candidate races by intersecting the points-to sets between

actions that are not ordered by happens-before. However, these pairs (named racy pairs)

are not necessarily races since in asynchronous programming ad-hoc synchronizations are

widespread. So, in the next step, we attempt to refute (rule out) false positives by a path-

sensitive, backward symbolic execution; for this we extended the Thresher tool [27] to verify

path feasibility between two actions (Section 3.5).

Finally, to help developers fix likely-harmful races, SIERRA prioritizes race reports

using several heuristics: 1) races in application code have higher priority than those in

framework code; 2) framework races directly invoked from app code have higher priority

than those invoked from the library; 3) races involved in pointer reference reads/writes are

more likely to be dangerous as they can lead to a NullPointerException.

32 Action Creation (SHBG node) Happens-before introduction (SHBG edge) Thread Asynchronous thread new AsyncTask Thread.start () Background thread new Thread AsyncTask.execute() Runnable new <...> implements Runnable Executor.execute() Message Message.obtain() sendMessage∗(Message msg)/post∗(Runnable r) Execution: Runnable.run() Lifecycle event onCreate(), onDestroy() According to the activity lifecycle, e.g., onStart(), onStop(), onRestart() onCreate→ onPause(), onResume() onStop→ GUI event onClick∗() According to the GUI model, e.g., onClick1→ onClick2 System event BroadcastReceiver.onReceive() registerReceiver onServiceConnected bindService onServiceDisconnected startService

Table 3.1: Actions and HB introduction.

3.2 Harness Generation

We now describe SIERRA’s automatic harness creation approach. As SIERRA performs whole-program analysis, we need to find the app’s entrypoints. While in traditional

Java programs we would start at main(), Android apps have no main. Rather, in Android, app control flow is orchestrated by the AF, which invokes lifecycle callbacks, such as onCreate when the app is created, or onDestroy when the app is destroyed. Besides these lifecycle callbacks, an app can implement view event handlers (e.g., onClick and onScroll ) that can be

registered either statically in the layout XML or dynamically in code. Figure 3.2 illustrates

a harness generated for the example in Figure 1.1.

First, we create a Harness activity with a main method which serves as the entrypoint.

Second, we instantiate the NewsActivity and invoke its Activity lifecycle callbacks (lines 4–6

and 14–16). Third, starting from these lifecycle callbacks, a call graph is built by WALA

to compute the reachable methods. Within the reachable methods, the analysis might

discover new callbacks. For example, an onClickListener may be created and registered via

33 1 class Harness {

2 public static void main() {

3 NewsActivity a = new NewsActivity();

4 a.onCreate();

5 a.onStart() ;

6 a.onResume();

7 while (∗) {

8 switch(∗) {

9 case 1: a.invokeOnClick(); break;

10 case 2: a.invokeOnScroll() ; break;

11 ......

12 }

13 }

14 a.onPause();

15 a.onStop();

16 a.onDestroy();

17 }} Figure 3.2: Harness example.

setOnClickListener . At this time, the harness generator creates synthetic invocation sites

(lines 9–11) and builds the call graph again. This process iterates until a fix-point is reached, i.e., no new callbacks found. Finally, the callbacks registered in XML files are added to the harness since they are unique. We borrow FlowDroid [19]’s predefined callback list to find callbacks.

34 3.3 Event Call Graph and Pointer Analysis

Context sensitivity plays a key role for scalability and precision in static analysis.

Two main kinds of context sensitivity have been proposed for object-oriented languages: object-sensitivity (k-obj) [104] and call-site-sensitivity (k-cfa) [127].

Prior research [105, 128] has shown that object-sensitivity increases precision; how-

ever, we have found that it is still not precise enough for our Android setting. K-obj sen-

sitivity merges the last k object allocation sites, thus precision is lost for contexts longer

than k. The same loss occurs for k-cfa sensitivity which merges the last k call sites. In-

correct aliasing may occur when two different actions call a method foo() which contains

j call sites to method bar() and allocates an object. If j > k, both k-obj and k-cfa fail to

distinguish that the objects are allocated in two different actions and incorrectly consider

them as aliased because their last k allocation sites (or k call sites, respectively) are the

same. While precision could be improved by increasing the value of k, this greatly decreases

performance, as analysis complexity is exponential in k.

3.3.1 Action Sensitivity

Based on the insight that objects should be associated with their corresponding ac-

tions, SIERRA introduces a new context abstraction named action-sensitivity which greatly

improves precision. When building the call graph for an action, we add the action’s id as

part of its context, and leverage hybrid-context-sensitivity which consists of object-sensitivity

and call-site-sensitivity. More specifically, the hybrid-context-sensitivity uses k-obj for nor-

mal dispatch calls and k-cfa for static invocations within one action. Each object’s abstract

35 context has the action id where this object is allocated in. In the previous foo() example, the objects created by foo() have different action ids in their contexts, and are not aliased

(conflated) anymore. Note that, although within one action the objects may still lose preci- sion due to last k merges, across actions objects are still separate. Since SIERRA focuses on analyzing objects accessed by different actions, we find that action-sensitivity is particular useful for our race detection. While action-sensitivity is effective at distinguishing abstract objects, a class of objects named “views” need to be handled specially, as explained next.

Inflated view context. Apps can define views using layout XML files, and then inflate the views at runtime. Android provides the findViewById(int id) API to access the inflated view, given the constant view id; findViewById can be invoked in different actions, but the object is aliased when using the same id. SIERRA uses a special context named Inflated-

ViewContext that contains view ids and their type. During APK parsing, for each view defined in the layout, SIERRA saves its view id into a map. When findViewById(id) is called,

SIERRA uses this constant id to retrieve the view object from the map; two inflated view objects are considered aliased when they have the same ids.

3.4 Happens-before Relationship

Prior event-driven race detectors for Android have defined dynamic happens-before rules [25, 72, 99]. Those definitions do not easily translate here, as our approach is static and uses path condition information, hence we define our own static happens-before (HB) rules. HB orders actions, described shortly, and the order relations are captured in a Static

Happens-Before Graph (SHBG).

36 3.4.1 Definitions

We first define the concepts and notations used in our approach. We use A, B,

A1, etc. as action names. The happens-before relation, denoted A ≺ B, indicates that we can statically prove that action A is completed before action B starts.

Races. We define races as unordered memory accesses, at least one of which is a write.

Our points-to sets map variables x to memory locations ρ, i.e., π(x) = ρ. Memory accesses

α are hx, τ, Ai bundles, indicating that variable x is accessed using access type τ (read or write) in action A.

Racy pairs. We define racy pairs as follows: accesses α1 and α2 form a racy pair if they come from different actions A1 and A2, operate on at least one shared location (i.e., their points-to sets’ intersection is non-empty, π(α1.x) ∩ π(α1.x) 6= ∅) and at least one of the

accesses α1.τ or α2.τ is a write.

Race-finding strategy. Our approach proceeds by constructing an HB graph, then find-

ing all candidate racy pairs, and finally using symbolic analysis to refute those racy pairs

that are actually ordered.

3.4.2 Actions: Static Happens-before Graph Nodes

Actions are the building blocks of SIERRA. An action represents context-sensitive

event handling. Table 3.1 shows how HB nodes and edges are identified, so they can be

added to the SHBG. When the analysis reaches an action creation (column 2) it creates the

appropriate HB node, as described next.

37 There are four classes of actions. Threads can be created as asynchronous tasks,

background threads, or runnables. Messages: in Android, messages are sent using either the

send∗ or post∗ API; in either case, the message has an associated Runnable which will execute

in the recipient thread. Lifecycle events: Android activities are controlled by the Android

Framework and have a well-defined lifecycle, described as activity states, which form HB nodes, while activity state transitions form HB edges ( Section 3.4.3 §2). GUI events: our

harness is a GUI model where GUI callbacks are HB nodes, while the GUI callback order

introduces HB edges ( Section 3.4.3 §3).

3.4.3 Static Happens-before Rules

We now define the HB rules, i.e., rules for adding edges between actions in the

SHBG.

1. Action invocation rule: when an action is invoked, the sender action happens

before the recipient. For example, as is standard in race detection, we add an HB edge

between the action in which a thread is created and the new action (that thread’s body).

Similarly, we add an HB edge from the message sender’s action to the message’s Runnable.

2. Android component lifecycle rule: in Android, activities follow a lifecycle

described as a state machine where state transitions invoke callbacks [59]. The Android

Framework will invoke these callbacks in predefined order, e.g., upon activity creation,

onCreate is invoked first, then onStart, then onResume.1 Our key insight is to use (pre) dominator

information to distinguish between different instances of callbacks that appear in cycles so

1While this lifecycle state machine has been unchanged since Android’s inception, it would be trivial to change our model to accommodate potential future changes in the state machine, should they occur in subsequent Android versions.

38 Resumed

[onResume] onPause ≺ [onPause] onResume “2”

onResume() “1” onResume() “2” onPause()

[onStart] onResume “1” ≺ [onResume] onPause

Started Paused

[onPause] onStop ≺ [onRestart] onStart “2” onStart() “1” onStart() “2” onStop()

[onCreate] onStart “1” ≺ [onPause] onStop

Created onRestart() Stopped

onDestroy() onCreate() Figure 3.3: HB edges among Activity lifecycle callbacks (dashed red arrows) induced by

CFG dominance in our harness model. Black edges represent control flow. we can order them.

We illustrate this rule in Figure 3.3 on the actual Android activity lifecycle. Ac- cording to the lifecycle rules, onCreate is the first method to be invoked after an Activity has been created, while onDestroy is the last method to be invoked before the Activity is disposed of. However, in the call graph, all these callbacks are disconnected. The harness, described in Section 3.2, mirrors the Activity lifecycle and invokes the callbacks in the required order.

In the harness, as onCreate dominates any other node, we know that any shared memory access in onCreate will precede accesses in subsequent actions, e.g., onStart, hence we can add an HB edge onCreate→ onStart.

We now show how we deal with cycles. As seen in Figure 3.3, onResume/onPause and onStart/onStop form cycles. At first sight, these callbacks do not appear orderable by

39 HB. For example, since onResume is invoked after either Started or Resumed states, onResume appears not to be orderable with onPause — onPause can come either before or after onResume.

Our insight is that we can distinguish between the two onResume’s if we take into account dominator information. For simplicity let us name onResume ‘‘1’’ the callback pre- dominated by onStart and onResume ‘‘2’’ the callback pre-dominated by onPause. Now it can be easily seen that

onResume ‘‘1’’ ≺ onPause, and

onPause ≺ onResume ‘‘2’’, hence the previously-unorderable callbacks can actually be ordered. Similarly, we have:

onStart ‘‘1 ’ ’ ≺ onStop

onStop ≺ onStart ‘‘2’’.

3. GUI layout/object order: similar to the Android lifecycle, the GUI layout captured by the harness (Section 3.2) is used as a basis for HB. We illustrate this rule in

Figure 3.4 on a simplified GUI layout, where an app cycles and nondeterministically chooses between onClick1() or onClick2 (); onClick3(). Since onResume pre-dominates onClick1 we have:

onResume ≺ onClick1

onResume ≺ onClick2

onClick2 ≺ onClick3.

4. Intra-procedural domination. Assume that a method M in activity A has two outgoing calls e1 and e2 that post actions A1 and A2, respectively. If e1 dominates e2

40 onResume()

while()

*

onClick2()

onClick1() onClick2 ≺ onClick3 onClick3()

join

onPause()

Figure 3.4: HB edges (dashed red arrows) induced by CFG dominance in our GUI model, e.g., onResume→ onClick1, or onClick2→ onClick3. Black edges represent control flow.

then A1 ≺ A2; this is intuitive because e1 will always be invoked before e2 and by the time

e2 executes (and gets a chance to post A2), A1 has already been posted, so A2 can only be

posted after A1.

5. Inter-procedural, intra-action domination. This is similar to rule 4, but

the difference is that e1 and e2 can be in two separate methods M1 and M2 of the same

activity A. Note that e1 cannot straight-up dominate e2 because e2 might be invoked from

a context that does not involve e1. We leverage WALA’s interprocedural CFG (ICFG) to

address this issue as follows: we remove e1 from the ICFG and check whether e2 is still

reachable; if it is not reachable, then de facto e1 dominates e2 and we add A1 ≺ A2. If, on

the other hand, e2 is still reachable when e1 is absent, we do not add any HB edges.

6. Inter-action transitivity: If A1 ≺ A2, A1 posts an action A3, and A2 posts

41 an action A4, then A3 ≺ A4. Figure 3.5 illustrates this. On top (Figure 3.5 (a)) we show the order relation. On the bottom we show the two possible execution schedules for this order. A1 executes first, and during its execution, it posts A3. Importantly, by the time

A1 finishes, A3 is already posted. We have two cases: Figure 3.5(b) when A3 executes before A2 does, hence A3 ≺ A4 because A4 has not even been posted when A2 finishes; and Figure 3.5(c) when A2 executes first, but because A3 has already been posted when

A2 starts executing, A4 can only be posted after A3 hence A3 ≺ A4. We can infer these orderings thanks to the looper atomicity guarantee.

7. Transitivity: HB is transitive, i.e.,

A1 ≺ A2 ∧ A2 ≺ A3 =⇒ A1 ≺ A3

We repeatedly invoke transitive closure together with rule 6, as rule 6 can discover new HB edges in ways other than control- or data- flow (which rules 1–5 are limited to).

Note that after applying these HB rules we still have an under-approximation of all

HB relations, which preserves soundness at the expense of having potential false positives.

We now describe how we further introduce ordering to refine our HB relations hence reduces false positives.

3.4.4 Accesses and Races

A thread can register a Looper object to receive asynchronous messages. Each Looper object is associated with one thread and each thread can register at most one Looper. In

Handler’s constructor, a Looper object must be specified so that the messages sent via this

Handler will be delivered to the corresponding thread. Two actions are considered to be

42 A1 HB A3 (a)

A2

A4

A1 A1 HB post

A3 A2 post HB

A3≺A4 A2 A3 post A3≺A4 post

A4 A4

(b) (c) Looper execution Looper execution Figure 3.5: Adding intra-action transitive HB edges: (a) is the action order, while (b) and

(c) are possible schedules.

potentially racy, iff the corresponding Handler objects refer to the same Looper. SIERRA pre- processes all the creation sites of Loopers and Handlers to learn which thread is associated with the Handler by traversing the call graph from the entry of each thread and performing an in-thread reachability analysis.

Ruling out ordered accesses. Racy pairs (e.g., accesses αA and αB in actions A and

B, respectively) form the starting point for detecting races—these accesses are candidate

races unless we can refute that assumption, i.e., prove that αA and αB are ordered (we do

43 1 Timer.Runnable runner = { 12 void stop() { //action B 2 void run() { //action A 13 if (mIsRunning) { 3 if (mIsRunning) { 14 mIsRunning = false; 4 mAccumTime=... // αA 15 mAccumTime=... // αB 5 if (∗) { 16 } 6 ... 17 } 7 postDelayed(runner ,...) ; 8 } 9 else 10 mIsRunning=false; } 11 }} Figure 3.6: Refutation helps eliminate this false positive in the OpenSudoku app.

so via symbolic execution, described next).

3.5 Symbolic Execution-based Refutation

A candidate race, e.g., accesses αA and αB in two unordered actions A and B, is not necessarily a true positive since accesses could be protected by ad-hoc synchronization [120]; such synchronization idioms are prevalent in event-driven systems to protect the event handler from executing unsafe paths.

Example. We show how SIERRA refutes a candidate race in the OpenSudoku app (Fig- ure 3.6). The run method on the left is from a Runnable object that is posted from the onResume callback. The stop method on the right is invoked from the onPause callback to stop the Runnable object.

These two actions do not have an HB edge and they both write to a shared field mAccumTime (lines 4 and 15). SIERRA starts by considering both orderings possible. Let us assume that action B occurs before action A. SIERRA performs backward symbolic analysis starting from αA (line 4 in action A). When the analysis reaches the if conditional

44 on line 3, it adds a path constraint {mIsRunning = true}, i.e., a precondition to reach αA. The backward analysis continues until reaching the boundary of the run method and proceeds

(assuming there are no conflicting constraints). Then SIERRA traces the path back to the exit block of the stop method in action B, and continues backward. When the path reaches line 12, SIERRA chooses to enter the block guarded by line 13 because the guard condition is consistent with the path constraint {mIsRunning = true}. Finally, a conflicting constraint is found when the path reaches line 14 which performs a strong update to mIsRunning. This strong update means the path constraint after this statement must be mIsRunning = false, which conflicts with our current path constraints. After searching all the possible paths,

SIERRA cannot find a feasible way to witness the backward path from αA to αB, thus the candidate race is refuted.

The backward analysis framework is based on Thresher [27], which we adapted to

fit our event-driven race detection scenario. Thresher is designed to perform precise heap refutation by traversing all the paths related to the candidate query back to the program’s entrypoint. SIERRA changes the refutation process to be witnessing a feasible path between a source and a sink. The candidate race is a true positive, iff in both orderings of actions

A and B, there does exist a feasible path from αA to αB, and vice versa.

On-demand constant propagation. When the action is Handler.handleMessage(Message m), program behavior depends on the values of Message’s fields, e.g., the what field is an integer indicating the type of the message. To increase precision, we introduce constraints to check if any of these fields are constant integers and used as guard conditions. SIERRA does on- demand constant propagation from the creation site of the action (i.e., handler .sendMessage)

45 and checks if any of the message’s fields are constant. If yes, the constraints are added to the query of the backward symbolic executor.

Caching. Refutation’s running time varies depending on app complexity. A refutation could be terminated by the executor if the system runs out of memory or exceeds the max- imum number of paths (we set this to 5,000 paths in SIERRA). In either case, SIERRA

soundly reports the race, though it might be a false positive. To prevent redundant com-

putation, SIERRA memoizes (caches) the call graph nodes visited in a refuted query. Later

queries first check the cache. If the current node in a path exists in the cache, then the

query stops immediately as the path is infeasible. This caching mechanism is particularly

useful where many race candidates are within the same call graph node or dominated by

that node in a refuted query.

3.6 Evaluation

We have evaluated SIERRA in terms of effectiveness, i.e., how many potential races

it can find, and efficiency, i.e., how long it takes to analyze an app.

App datasets. We chose apps from various categories (news apps, video players, email

clients, etc.) and of various sizes. First, we reuse Gator [123]’s benchmark which contains

20 apps as listed in Table 3.2. We chose this dataset because all the apps are open-source

so that we can manually check SIERRA’s correctness. The center column of Table 3.2 shows

app popularity, retrieved from Google Play in August 2017. As we can see, 17 of these apps

have in excess of 100,000 installs; the number of installs was not available for Open Manager

46 App Installs Bytecode size (KB) APV 500,000–1,000,000 736 Astrid 100,000–500,000 5,400 Barcode Scanner 100,000,000–500,000,000 808 Beem 50,000–100,000 1,700 ConnectBot 1,000,000–5,000,000 700 FBReader 10,000,000–50,000,000 1,013 K-9 Mail 5,000,000–10,000,000 2,800 KeePassDroid 1,000,000–5,000,000 489 Mileage 500,000–1,000,000 641 MyTracks 500,000–1,000,000 5,300 NPR News 1,000,000–5,000,000 1,500 NotePad 10,000,000–50,000,000 228 OpenManager N/A 77 OpenSudoku 1,000,000–5,000,000 170 SipDroid 1,000,000–5,000,000 539 SuperGenPass 10,000–50,000 137 TippyTipper 100,000–500,000 79 VLC 100,000,000–500,000,000 1,100 VuDroid 100,000–500,000 63 XBMC remote 100,000–500,000 1,100

Table 3.2: App popularity and size for the 20-app dataset.

as it was retrieved from the alternative F-Droid market. The right column of Table 3.2 shows the bytecode size (.dex) for each app; this ranges between 63 KB and 5.4 MB. The experimental results on these 20 apps are discussed in Sections 3.6.1– 3.6.5.

Next, we chose an additional 174 apps with a median size of 1.1MB from F-droid [9]

(an online open source repository for Android) for automatic testing. These results are discussed in Section 3.6.6.

Experimental setup. We ran our experiments on an 8-core hyper-threaded (hence 16 threads) Intel Xeon E5-2687W CPU 3.4GHz, with 64GB memory. The server was running

47 App Harne- Actions HB Ordered Racy Pairs Racy Pairs After Manual Inspection EventRacer sses Edges (%) w/o AS with AS refutation True Races FP App Races APV 4 84 1,648 47 75 25 10 8 2 3 Astrid 6 147 2,755 26 319 83 54 37 17 - Barcode Scanner 9 136 2,756 30 64 24 15 11 4 7 Beem 12 169 3,724 26 467 73 13 10 0 0 ConnectBot 11 171 4,829 33 567 96 58 43 15 16 FBReader 27 259 4,710 14 836 285 106 93 13 5 K-9 Mail 29 312 5,725 12 1,347 370 89 72 17 1 KeePassDroid 15 216 4,076 18 266 61 27 16 1 0 Mileage 50 331 8,498 16 496 195 36 33 3 1 MyTracks 8 198 6,826 35 634 174 80 75 5 34 NPR News 13 490 10,673 9 607 132 21 21 0 3 NotePad 9 72 609 24 436 65 31 27 4 9 OpenManager 6 92 1,036 25 532 113 55 51 4 5 OpenSudoku 10 141 1,425 14 426 158 110 83 27 72 SipDroid 11 206 2,386 11 321 94 27 17 10 - SuperGenPass 2 43 343 38 82 16 6 6 0 3 TippyTipper 5 100 1,864 38 93 21 9 7 2 1 VLC 13 151 2,349 20 202 78 35 32 3 0 VuDroid 3 45 150 15 62 27 10 10 0 5 XBMC 13 330 4,218 8 445 137 63 48 15 17 Median 10.5 160 2,755 22 431 80.5 33 29.5 8.5 4

Table 3.3: SIERRA effectiveness on the 20-app dataset.

Ubuntu 14.04.1 LTS. We use DroidEL as a pre-processor to handle reflection and extract app layout, and automatically create harnesses via our harness generator. WALA has provided points-to information and call graph construction. The action-sensitive context selector is implemented as a WALA plugin. SIERRA modifies Thresher to run goal-directed path-sensitive race refutation. Thresher in turn uses the Z3 SMT solver [45].

3.6.1 Effectiveness

We present the results in Table 3.3. Per Section 3.2, SIERRA creates a harness method for each app activity which serves as the entrypoint of the static analysis (on average 10.5 harnesses per app). Next, we show the number of actions, i.e., SHBG nodes.

The number sums all the actions found in each harness—typically about 160 actions per app. Column 4 shows the total number of HB edges found by SIERRA, and column 5 shows

48 the fraction of HB edges compared with the total number of edges (e.g., if the app has

N actions, and all actions are in a happens-before relation, the transitively-closed graph

N∗(N−1) would have 2 edges); the higher this percentage, the less work later stages have to do at refuting potential races, and the lower the chance of false positives. Note how SIERRA manages to find 22% of the theoretically maximum number of edges.

Columns 6 and 7 show the number of racy pairs without and with action-sensitive contexts. The results demonstrate the effectiveness of action sensitivity, as action-sensitive contexts reduce racy pairs by a factor of 5, from 431 to 80.5, which then greatly reduce the number of races to be refuted by the backward symbolic executor. After refutation (column

8) the median number of races is reduced substantially, to just 33, which we believe is very effective for developers. Section 3.6.4 compares SIERRA’s results with EventRacer’s (last

column).

We have manually inspected the races reported by SIERRA and classify them

into true races (median = 29.5) and false positives (median = 8.5) in columns 8 and 9.

Section 3.6.5 contains a detailed analysis of true/false positives.

3.6.2 Efficiency

Table 3.4 shows the results of efficiency experiments. For each app, we show the

time, in seconds, it took to run each analysis stage. The front-end analysis with WALA

typically takes 1,310 seconds per app (CG column). SHBG construction took 28.5 seconds

which is quite efficient. Unsurprisingly, refutation takes about 560.5 seconds per app due to

symbolic execution. In total, SIERRA takes about 1,899 seconds per app, which is acceptable

49 App CG+PA HBG Refutation Total APV 182 18 83 283 Astrid 325 24 938 1,287 Barcode Scanner 173 29 247 449 Beem 397 36 1,664 2,097 ConnectBot 241 54 2,128 2,423 FBReader 1,058 85 1,687 2,830 K-9 Mail 2,936 113 2,759 5,808 KeePassDroid 136 33 288 457 Mileage 1,927 41 3,361 5,329 MyTracks 2,711 52 2,170 4,933 NPR News 562 46 1,546 2,153 NotePad 148 78 702 928 OpenManager 275 53 715 1,043 OpenSudoku 253 36 612 901 SipDroid 278 71 488 837 SuperGenPass 87 16 419 522 Tippytipper 133 32 285 450 VLC 738 30 793 1,561 VuDroid 67 29 405 501 XBMC 2,438 39 1,038 3,515 Median 1,310 28.5 560.5 1,899

Table 3.4: SIERRA efficiency on the 20-app dataset: running time for each stage and total, in seconds. for a static analysis.

3.6.3 Harmful Race Example

The NPR News app contains a harmful event race that may result in incorrect view states – such a race is hard to detect dynamically. The NewsListActivity contains a ListView to show the news list. When new data must be loaded, the app creates background threads, via ImageLoaderTask, to load a list of news items – each item bundles images and text from a certain URL. Similar with the example in Figure 2.3, the program does not take scroll

50 Effectiveness Efficiency (time in seconds) App Bytecode Harnesses Actions HB Ordered Racy After CG HBG Refutation Total size (KB) edges (%) pairs refutation Median 1,114 4.5 67.5 1,223 17.3 68 43.5 139 27 648 960

Table 3.5: SIERRA effectiveness and efficiency on the 174-app dataset.

events into consideration. If a scroll event occurs before the background ImageLoaderTask posts back data, the ListView will create another ImageLoaderTask to load the new image. If the new image comes before the old one, the old image will replace the new one hence displaying the incorrect image to the user. Triggering this race requires a specific event order – this order can easily elude dynamic race detectors. There are multiple ways to fix this bug. The key is to associate the ImageLoaderTask with the URL for each news item. If the downloaded image does not match the item’s URL, then the image should not update the view.

3.6.4 Comparison with Dynamic Race Detection

We also ran EventRacer Android [25], the most advanced dynamic race detector to date, on our test apps. We show the dynamic detection results in the last column of

Table 3.3. Out of the 20 apps, we could not run EventRacer Android on Astrid and SipDroid.

We then considered the high priority races that occur in app code. After analyzing the 182 races reported by EventRacer Android in 18 apps, we found that 102 of them are false positives because they are protected by guard conditions. EventRacer Android uses a concept called “Race coverage” to filter ad-hoc sychronization races, but it only reasons about primitive type variables. Most of the 102 false positive races are protected by pointer checking condition (e.g., var != NULL or var == NULL). SIERRA can successfully filter out these false positives because it uses combined path and points-to queries. For such cases,

51 SIERRA has the ability to provide more accurate results than EventRacer.

There were also 15 races reported by EventRacer that SIERRA did not report because they could be ruled out; the races fall into two categories. First, EventRacer considers that UI actions can occur after Activity lifecycle callbacks (e.g., onClick after onStop). However, SIERRA rules this out because when an Android Activity goes to an invisible state (i.e., is stopped), an UI callback cannot be executed. Second, EventRacer considers UI and UI action as racy, but SIERRA can order UI events (Section 3.4.3). The remaining races missed by EventRacer are in actions the dynamic detector does not cover.

This demonstrates the soundness advantage of a static approach compared with a dynamic approach.

3.6.5 Discussion

False positives. Thanks to action-sensitivity, SHBG and symbolic execution, SIERRA can filter a great amount of false positives. However, we found some cases where false positives may happen. For example, in OpenManager, SIERRA reports a race as follows: both onCreate and onClick create a thread that fetches some data from disk and posts callbacks to update the ListView items. SIERRA considers thread callbacks as non-deterministic. But there is an implicit dependency in the app: onClick can only be triggered after the ListView is filled with data by the thread from onCreate. Such implicit dependencies are beyond the current capabilities of SIERRA. Another type of false positives comes from over-approximate merging in arrays or containers. SIERRA uses index-insensitive analysis to handle instances stored into an array or list. Finally, a symbolic executor timeout may also produce false positives – if we cannot refute within the time budget, we report a potential race to maintain

52 soundness.

Benign races. Symbolic execution is instrumental in filtering out the vast majority of candidate races. The reported races are true positives because SIERRA witnesses a feasible path. However, a true race does not mean it is harmful. Actually, the majority of the true races are due to guard variables in the control flow graph. For example, in Figure 3.6,

SIERRA reports a true race of reading mIsRunning on line 3 of action A and writing it on line 14 of action B. Note that mIsRunning is a guard variable to protect mAccumTime from incorrect access. If action A happens first, the read value is true in action A, and in the alternative order, the read value is false. Although this race is a true race, it is arguably benign. We have examined the race reports (“After Refutation” column) and found that

74.8% fit this pattern. For the remaining race reports, SIERRA witnesses different values of an instance variable being set in alternative order of the actions. To conclude, all the true races reported by SIERRA are due to bad programming practices and should be fixed, but the precise extent of the harm inflicted upon the app varies from race to race.

False negatives. SIERRA is sound up to reflection and native code, which is standard practice in Java static analysis. Reflection use beyond the capabilities of DroidEL and unsafe concurrent use of native code use might result in false negatives, i.e., missing races.

3.6.6 Results on the 174 App Dataset

SIERRA’s results (medians) on the additional set of 174 apps which were not sub- ject to manual analysis are shown in Table 3.5. SIERRA typically reports 43.5 potential races per app, and the analysis takes 960 seconds. These results are mostly in line with the

53 20-app dataset but we believe the results are more indicative due to the larger set size (174 vs. 20).

3.7 Summary

In this chapter, we have presented SIERRA, the first (to our knowledge) approach for static event-based race detection in Android apps. We show that, by employing precise, automatically-constructed harnesses and a static happens-before graph, we can order actions quite effectively. Further, by employing action-sensitivity as well as symbolic execution we can eliminate a large percentage of false positives. Experiments reveal that our approach is effective at finding true races without a large number of false positives, yet has acceptable performance. In Chapter 4 we show how we have designed and implemented a record-and- replay tool which allows races to be reproduced hence facilitate bug fixing.

54 Chapter 4

Race Reproduction via Record and

Replay

Given a potential event-driven race, e.g., as reported by SIERRA in the previous chapter, it is particularly useful to have a tool that can reproduce the race. In this chapter, we present a record-and-replay technique that helps developers debug the race.

While useful, reproducing executions has proven difficult: smartphone apps revolve around concurrent streams of events that have to recorded and replayed with precise timing.

To keep overhead low, prior record-and-replay approaches for smartphones only capture

GUI input [15, 13, 62, 86, 56, 68] which hurts accuracy as they cannot replay input from the network or sensors, e.g., GPS, camera, and microphone, which are used frequently by popular apps; or events, to reproduce event-based races [72, 99]. Prior work on record- and-replay for desktop and server platforms [48, 143, 107, 142, 66] has relied on techniques such as hardware changes, VM logging, or system call interception; using such techniques on

55 Android is problematic, due to high overhead and wrong granularity—their word- or object- level granularity can be used for small, timing-insensitive programs, but not for Android apps.

In desktop/server programs input comes from the file system, network, mouse or keyboard; with the exception of drags and double-clicks, large changes to input timing between the record and replay executions are gracefully tolerated by the program. In contrast, on smartphones, input can come concurrently from the network, GPS, camera, microphone, touchscreen, accelerometer, compass, and other apps via IPC. Moreover, the timing of delivering these input events during replay must be extremely accurate, as even small time perturbations will cause record or replay to fail.

To address these challenges, we introduce a novel, sensor- and event-stream driven approach to record-and-replay; by focusing on sensors and event streams, rather than sys- tem calls or the instruction stream, our approach is effective yet lightweight. We have implemented our approach in a tool called VALERA (VersAtile yet Lightweight rEcord

and Replay for Android)1 that records and replays smartphone apps, by intercepting and recording input streams and events with minimal overhead and replaying them with exact timing.

56 (a) Shazam correct (b) Shazam divergent

(c) QR Droid correct (d) QR Droid divergent

Figure 4.1: Screenshots of correct execution (left) and divergent execution due to imprecise timing (right).

57 UI Thread Decode Thread UI Thread Decode Thread (Looper) (Looper) (Looper) (Looper)

CaptureActivity fork thread DecodeThread CaptureActivity fork thread DecodeThread onResume() init() onResume() init()

bad frame bad frame Frame 1 start decoding Frame 1 start decoding

DecodeThread DecodeThread bad frame decode() bad frame decode() Frame 2 Frame 2

good frame decode fail CaptureActivityHandler Frame 3 handle decode failure

decode fail good frame CaptureActivityHandler handle decode failure Frame 3 start decoding

bad frame bad frame Frame 4 DecodeThread Frame 4 start decoding DecodeThread decode() decode()

decode success decode fail

CaptureActivityHandler CaptureActivityHandler show barcode result handle decode failure

(a) Barcode Scanner correct (b) Barcode Scanner divergent Figure 4.2: Schedule of correct execution (left) and divergent execution due to imprecise schedule replay (right).

58 4.1 Motivation

4.1.1 Accurate Timing and Low Overhead

Consider two popular apps, Shazam and QR Droid, that use sensor stream inputs.

For each app, we first perform a record with VALERA; thanks to VALERA’s low overhead,

the support for record has no impact on the execution. Next, we replay the app with

VALERA normally, i.e., with precise timing (we call this the “correct” execution as there is

no visible difference between the recorded and replayed executions). Then, we replay the

app with VALERA again, but alter timing during the sensor replay phase, i.e., deliver events

slightly earlier or later than originally recorded. The effect is divergence (the app exhibits

different behavior), hence we call this the “divergent” execution. In Figure 4.1(a) we show

a screenshot of the Shazam music recognition app replayed with VALERA with exact audio

stream timing, and correctly recognizing a song; Figure 4.1(b) shows what happens—Shazam

fails to recognize the song, per the message on top “Sorry we couldn’t find a match”—

when we deliberately speed up the audio stream timing by 40%. Figure 4.1(c) contains a

screenshot of the QR Droid barcode scanner app replayed with VALERA with exact frame

buffer timing, correctly recognizing a book barcode (97881173660351); Figure 4.1(d) shows

that QR Droid fails to recognize the barcode, per the message on the left side “Place the

barcode inside the viewfinder rectangle to scan it”, when we deliberately introduce a 200

millisecond delay between frames.

The importance of overhead during record and replay. The previous examples

illustrate how the execution of mobile apps is sensitive to input stream timing, and how

1Available at http://spruce.cs.ucr.edu/valera/

59 timing deviations lead to record or replay errors. We now discuss two more classes of timing errors in Android apps.

First, Android No Response (ANR) error: Android will raise an ANR and kill the app if the UI thread cannot handle an event within 5 seconds or if an IPC event gets no response for 10 seconds. Thus, if the recording overhead is high (e.g., recording every memory access), then the system is likely to raise an ANR error, terminating record or replay.

Second, the semantics of UI gestures may change [56]. When the user touches the screen, the hardware generates a series of motion events, starting with ACTION DOWN and ending with ACTION UP. The time between ACTION DOWN and ACTION UP is crucial for deciding whether the user input is a tap, a click, or a long click. High-overhead recording affects input timing hence the semantic of the gesture changes, e.g., a click becomes a long click.

This overhead issue affects the usability of PinPlay [113], a record-and-replay sys- tem that has been recently ported to Android. When using PinPlay to record app exe- cutions, the overhead is prohibitively high because PinPlay instruments the screen device driver in the OS, the OS code that dispatches the input events, and finally the Android

Java code that dispatches events. We have observed that, when performing a touchscreen gesture, apps respond very slowly, showing the ANR error; eventually the OS kills the app.

For example, when attempting to use PinPlay to record the Amazon Mobile app, sending

multiple clicks triggered time outs and eventually the OS killed the test app after 90 seconds.

In contrast, VALERA can handle Amazon Mobile essentially in real-time (1.7% overhead for

60 record and 2.34% overhead for replay).

4.1.2 Schedule Replay

Consider the popular Barcode Scanner app that can recognize various types of barcodes. Figure 4.2a illustrates the working model of Barcode Scanner. When the user

starts the app, the UI thread will load the CaptureActivity screen. In the onResume() method,

the UI thread forks a new thread, DecodeThread, that performs heavy-weight computation—

decoding the barcode—thus relieving the burden of the UI thread and ensuring the app is

responsive. After initialization, it waits to receive events from the UI thread.

In the onResume() method, the UI thread also opens the camera and registers

PreviewFrameCallback as a handle for callbacks from the camera; this handler invoked by the

hardware periodically. If a frame, say frame1, arrives from the hardware, onPreviewFrame in

the UI thread handles this callback by sending a message with the “start decoding” flag

together with the frame buffer data to the decode thread. The decode thread starts to use

various complex algorithms to decode this frame. The UI thread continues to handle other

events. Suppose a second frame, frame2 arrives, but the decode thread has not finished yet.

The handler in the UI thread knows that the decoding thread is still working, so it discards

frame2.

Suppose the decode thread finishes decoding and fails to recognize the barcode

from the given frame. This is a normal case, due to various reasons such as the frame does

not contain any barcode, or the frame taken by the camera is blurred. When the UI thread

receives the “decode failure” message, it marks the decode thread to be ready for decoding

and continues to receive new frames from camera. Say frame3 comes next, and this frame

61 can be successfully decoded; a frame4 would be skipped for the same reason as frame 2. When the UI thread receives a “decode success” message, it updates the UI element to show the barcode result.

To successfully replay Barcode Scanner, the key is to enforce the same event order executed in the looper as in the record phase. Otherwise, the replay may diverge, as showed in Figure 4.2b and explained next. Suppose that in the recorded four frames, only frame3 can be successfully decoded. Since the Android camera does not guarantee a constant frame rate, the “decode failure” message due to frame1 could arrive after frame3 has been delivered to the UI thread, in which case frame4 will be sent to the decode thread, instead of frame 3.

But frame4 could be a poor-quality frame that cannot be recognized; now the replay diverges as the app cannot successfully decode any of the four frames—this situation can happen in RERAN but not in VALERA. When using VALERA the “decode failure” message will be delivered after frame 2 not frame 3 since we enforce schedule determinism (Section 4.5) hence avoiding divergence.

This example illustrates the importance of ensuring event order determinism in the replay of Android apps, as Android apps are mainly event-driven. In fact, as our study in Section 4.6.1 shows, the event stream is far more demanding than other streams, with typical burst rates in excess of 1,000 events/second, whereas the second-most demanding stream, the network, has typical burst rates of 207 events/second.

62 Table 4.1: Network and sensor API usage in top-11 apps in each of the 25 categories on

Google Play. Network Location Audio Camera percentage 95% 60% 34% 34%

4.1.3 Network and Sensors

Supporting network and sensors is essential: the success of the smartphone plat- form is due, in no small part, to on-the-go network connectivity and the sensor-based context-aware capabilities offered by apps. To quantify the need to replay network and high-level sensors, in Table 4.1 we show their use frequency in top-11 most popular apps across all 25 app categories on Google Play: 95% of the apps use the network, location is used by 60% of the apps, etc. Thus we argue that supporting network and sensor input is necessary for successfully recording-and-replaying Android apps.

4.2 Design Overview

We believe that, to be practical, a record-and-replay system for Android should meet several key desiderata:

1. Support I/O (sensors, network) and record system information required to achieve

high accuracy and replay popular, full-featured apps.

2. Accept APKs as input—this is how apps are distributed on Google Play—rather than

requiring access to the app source code.

3. Work with apps running directly on the phone, rather than on the Android emulator

which has limited support for only a subset of sensors.

63 Instrumented App GPS Mic. Camera Intents

Event Network schedule

Android

Framework ScheduleReplayer Touchscreen Accelerometer …

Dalvik VM VALERA runme

Linux kernel Legend modified record replay

Figure 4.3: Overview of the VALERA runtime.

4. Low overhead to avoid perturbing the app’s execution.

5. Require no hardware, kernel, or VM changes.

We have designed VALERA specifically to meet these desiderata. Current ap- proaches [15, 13, 62, 56] do not meet one or more of these desiderata (especially #1, which we believe is critical). We now turn to presenting overviews of the Android platform and our approach, then state VALERA’s replay accuracy guarantees.

64 VALERA overview. VALERA consists of a runtime component and an API interception component. We first discuss the runtime component, shown in Figure 4.3 (the grey area on the right)—the interception component will be discussed in Section 4.3.2. The instrumented app runs on top of the instrumented AF, which in turn runs on top of unmodified versions of the VM and the kernel. App instrumentation, achieved via bytecode rewriting, is used to intercept the communication between the app and the AF to produce log files (values and timestamps) associated with network and high-level sensor input, such as GPS, microphone, and camera; intents are also intercepted at this point. AF instrumentation (which we per- formed manually) is used to log and replay the event schedule—see the ScheduleReplayer vertical box inside the AF. As the arrow directions indicate, during record the value/times- tamp stream flows from left to right (toward the log files), and during replay from right to left (from the log files to the app/AF). To sum up, the VALERA runtime consists of record and replay code and the log files; this code runs inline in the app and AF, with no extra processes or threads needed. Other apps that execute concurrently run in their own address space, on their own VM copies; we omit them for clarity.

Note that, since VALERA uses bytecode rewriting and an instrumented AF, its operation is not affected by either the JIT compiler used in Android versions prior to 5.0 or the runtime/compiler combination (ART/AOT) used in Android versions 5.0 and beyond.

Replay accuracy. We define Externally Visible State as the subset of app state that might be accessed, or viewed, by the user; currently the EVS includes GUI objects (views, images) and Shared Preferences (a key-value store where apps can save private or public data).

65 We validated VALERA’s replay fidelity via snapshot differencing, as follows: (1) during record, upon entering or leaving each activity (screen) A, we snapshot the EVS into

EVSrecA; (2) likewise, we snapshot the EVS during replay, into EVSrepA; and (3) compare

EVSrecA and EVSrepA to find differences—a faithful replay should show no difference, that is, the user cannot tell the difference between the record and replay executions. Note that record vs. replay differences might still exist in hidden state, e.g., memory contents or the

VM stream, but these differences are not our focus. Nevertheless, our fidelity guarantee is stronger than techniques used in prior approaches to compare app executions (which just compared screen contents [71]). The next sections show how we effectively implement record and replay in VALERA to achieve these accuracy guarantees.

4.3 API Interception and Replay

We now present our API interception approach. The infrastructure for intercep- tion, record, and replay, is generated automatically through app rewriting, based on an interceptor specification. While VALERA has a predefined set of interceptors, VALERA users can easily change or extend this specification, e.g., to add interceptors for new sensors or API calls, while still guaranteeing input determinism.

4.3.1 Example: Intercepting the Location Services

We first illustrate our technique by showing how we intercept, record and replay the

GPS location. The location API provides functionality for finding the current GPS location, as well as receiving notifications when the location changes, e.g., when the smartphone is

66 1 //class LocationActivity extends Activity

2 protected void onStart() {

3 location =mLocMgr.getLastKnownLocation(provider);

4 updateUILocation(location);

5 mLocMgr.requestLocationUpdates(provider,

6 TEN SECONDS, TEN METERS, listener);

7 }

8

9 protected void onStop() {

10 mLocMgr.removeUpdates(listener);

11 };

12

13 private LocationListener listener = new LocationListener() {

14 @Override

15 public void onLocationChanged(Location location) {

16 updateUILocation(location);

17 };}; Figure 4.4: Location API example. in motion.

Figure 4.4 shows an excerpt from a simple app that uses the location services to display the current location on the GUI and update the display whenever the location changes. When the app starts (method onStart()) it asks the manager for the last known location (getLastKnownLocation on line 3), updates the current location in the GUI (line 4) and directs the manager to provide location updates every 10 seconds or when the location has changed by 10 meters (lines 5–6). The location updates are provided via a callback mechanism: note how, on line 6, a listener is passed as an argument to requestLocationUpdates.

67 This location listener, initialized on line 13, has a method onLocationChanged() which is the callback that will be invoked whenever the location manager needs to inform the app about a location update—when that happens, the app updates the location on the GUI (line 16).

The getLastKnownLocation method returns a location object containing the last known location obtained from the given provider. We name such API calls downcalls: they are

initiated by the app and go downwards (i.e., into the AF), run synchronously with the

calling thread and return the result. Another kind of API calls are upcalls: in this case

the lower levels of the system software (i.e., the AF) invoke an app-defined callback hence

the direction of the call is upward. The onLocationChanged method is an upcall since after

registering the callback, the AF periodically sends back the updated location by invoking

onLocationChanged. By overriding this method, the app receives notifications whenever the

user’s location changes.

VALERA takes API annotations (e.g., upcalls, downcalls) as input and generates

the support for interception, record, and replay automatically. In our location example,

VALERA records the values exchanged in location API upcalls and downcalls and upon

replaying, feeds the app the recorded values—this way we can “trick” the app into believing

that the phone’s geographical location (or sequence of locations, if in motion), is the same

as during the record phase, even though the phone’s location could have changed since then.

For example, when a developer in New York wants to replay and debug a location-related

crash that happened on the phone of a user based in San Francisco, VALERA injects the

record-time GPS sequence (the San Francisco coordinates) into the app.

68 4.3.2 Automatic Interception through App Rewriting

Figure 4.5 presents the VALERA interception component, which performs auto- matic interception via app bytecode rewriting. While the rewriting is currently performed on a desktop or laptop, VALERA’s record-and-replay (runtime component) runs on the phone with no outside system necessary.

We use the bytecode rewriting capabilities of Redexer (an off-the-shelf Dalvik bytecode rewriting tool [87]) along with interceptor specifications to transform an off-the- shelf app into an app with intercept/record/replay capabilities, as explained next.

The Scanner takes as input the original app (APK file) along with an Interceptor specification and finds all the callsites in the bytecode that match the specification and should be intercepted. The specification consists of a list of API methods along with simple annotations on how the methods and their parameters should be treated from the point of view of intercept/record/replay (explained in Section 4.3.3). We have a predefined li- brary of such calls for instrumenting commonly-used APIs (Section 4.3.4); users can expand the library with their own specifications. The Scanner first extracts the Dalvik bytecode

file from the APK, then finds matching method invocations (callsites). The Interceptor generator produces two parts: a dynamic intercepting module and a stub. The intercept- ing module is a plug-in for the Redexer that executes on each method that needs to be intercepted (i.e., the Redexer rewrites the call to go to the stub). Finally, the dynamic intercepting modules and stubs are passed on to the Redexer that performs the bytecode rewriting to effect the interception, and repackages the bytecode into an instrumented APK that now has interception/playback support.

69 Interceptor Specificaon

Network GPS Camera Microphone Intents

Interceptor APK Scanner generator Original app Redexer

Instrumented app APK

Figure 4.5: Overview of VALERA’s automatic interception.

4.3.3 Interceptor Specification

The interceptors specify what to intercept and how to handle the intercepted values upon replay. VALERA supports three simple annotation keywords to describe Android API

methods. We borrow this idea from R2, an application-level record/replay framework that

allows users to choose the functions to be recorded and replayed [66].

Examples. We first provide examples of how users can easily specify interceptors to

achieve record-and-replay, and then discuss the general framework. In the examples, for

clarity, we leave out full package names and module annotations—these have to be specified

before passing the specification to VALERA but they are straightforward.

Figure 4.6 shows a specification snippet from our interceptors for the Location API.

70 [downcall]

public Location getLastKnownLocation(String provider );

[ upcall ]

public abstract void onLocationChanged (Location location); Figure 4.6: Location interceptor example.

First, we specify that getLastKnown Location is a downcall and that the provider argument cannot be modified during replay (i.e., it is not subject to record and replay). However, the return value of getLastKnownLocation is subject to record and replay, hence will be recorded and during replay, VALERA will return to the app the location values that were saved during record, rather than the current location. The specification also mentions that onLocationChanged is an upcall, and that the location argument will be filled with the recorded value rather than the value provided by the framework.

We did not find it necessary to support in or out annotations on method pa-

rameters, because the stub implementations in our predefined stub library implement the

appropriate in or out semantics for each API call anyway.

Figure 4.7 shows a snippet from our interceptor specification for the Camera API:

first, we specify that onPictureTaken and onPreviewFrame are both upcalls, and the camera argu-

ment is not subject to record and replay. The annotation on data is more interesting: it is

an array whose size varies between record and replay, hence the xpointer annotation (we will

provide details shortly).

71 [ upcall ]

public abstract void onPictureTaken (

[ xpointer ] byte[] data,

Camera camera);

[ upcall ]

public abstract void onPreviewFrame (

[ xpointer ] byte[] data,

Camera camera); Figure 4.7: Camera interceptor example.

General annotation framework. Table 4.2 lists the annotations VALERA provides for constructing interceptors. There are two categories of keywords: parameter and function.

Parameter keywords describe whether the value can change in size from record to replay.

Function keywords label each method as downcall or upcall.

Xpointer is necessary when objects vary in size from record to replay. For exam- ple, in the previously-mentioned camera interceptor, onPictureTaken and onPreviewFrame take a byte[] data argument. Let us assume that during the record phase the user takes a 1 MB picture. However during replay, the camera may take a 500 KB picture (while the image width and height do not change between record and replay, the camera captures different images in the two different executions hence due to differences in pixel colors and the use of compression, the image sizes will likely be different). Since attempting to copy the 1 MB data from the log into a 500 KB byte array would cause buffer overflow, VALERA offers an

xpointer annotation to deal with such cases. Instead of directly accessing the byte[] data reference, xpointer wraps the reference and provides get and set methods. During replay,

72 Table 4.2: Annotation keywords. Annotation Scope Description xpointer parameter pointer to reference downcall function synchronous API call upcall function asynchronous callback the stub updates the xpointer reference to point to the recorded stream data.

Downcall denotes a synchronous API method call. In Figure 4.6, getLastKnownLocation is annotated as a downcall method since the caller waits until it returns.

Upcall denotes an asynchronous callback. Android apps rely heavily on callbacks to improve system efficiency. For example, onLocationChanged, onPictureTaken, onPreviewFrame in

Figures 4.6 and 4.7 are callback methods, thus they are marked with upcall.

4.3.4 Intercepting Events and Eliminating Nondeterminism

We now describe how VALERA intercepts sensor events and eliminates various sources of nondeterminism.

Motion and Key Events. Motion (i.e., touch screen) and Key events are the main sources of events that drive app execution. Prior manual replay tools such as Robotium [62] and Monkey Runner [13] provide programming scripts that allow the user to specify which

GUI objects to interact with. However, they only support basic GUI operations such as click or long click, whereas most mobile apps provide rich gestures such as zoom, pinch, and swipe. The RERAN [56] record-and-replay system for Android supports these complex gestures by recording the event streams at the OS driver level and replaying them back with precise timing. However, the drawback of RERAN is that it has no knowledge about the

73 app’s events order. For example, in the Barcode Scanner example in Figure 4.2a, suppose

that during recording, four frames were recorded but during replay, due to the unexpected

nature of the external events, the camera may invoke callbacks at a slower rate and only

replay three frames. If the fourth frame is the successful frame, then replay will fail (diverge).

To address these issues, VALERA records motion and key events on the app’s side

instead. Whenever the Windows Manager Service dispatches the event to the app, VALERA

intercepts the dispatchInputEvent method recording the event data and the time since app

start. In addition, VALERA records the current window ID of the app because Android

dispatches motion and key event to each window and one app may have multiple windows

(e.g., Activity and Dialog).

Sensor Events. Mobile devices provide a richer set of sensors than desktop/server ma-

chines. They can be classified into two categories: low-level sensors and high-level sensors.

Low-level sensors, e.g., accelerometer, gravity, gyroscope, etc., provide streams of events

and invoke the app via the SensorManager API. VALERA records and replays the event and

its associated data.

High-level sensors such as GPS, Camera, and Audio, are richer, as they provide

principled APIs for device access via upcalls and downcalls; we illustrate these using the

Location API.

The Location API offers the getLastKnownLocation() downcall—the app waits until

the system returns the last known location data. The location API also provides an upcall,

onLocationChanged(): when the physical location has changed, the hardware GPS sensor in-

vokes this upcall on the UI thread’s Looper as an event. VALERA records and replays both

74 downcalls and upcalls.

Camera. Android apps can use the camera to take pictures in three ways. First, apps can use the preinstalled Camera app as a proxy by sending it an intent. The Camera app takes the picture and returns it via the intent mechanism. The intent recording mechanism, which we will describe shortly, ensures that pictures taken via intents will be replayed. The second way is to use the frame buffer, i.e., continuously read from the camera’s sensor, similar to using a camera’s preview feature. A typical example of such use is in barcode scanning apps, e.g., Barcode Scanner or RedLaser Barcode. These apps read from the frame buffer using the onPreviewFrame upcall, scan the frame for a barcode picture, and direct the user on how to properly expose the barcode so the app can get a good reading. The third way is to take a single picture, e.g., via the onPictureTaken() upcall. VALERA intercepts all necessary camera downcalls/upcalls and intents, hence the input data from the camera can be recorded and replayed.

Audio. Android provides audio services through two main API components:

MediaRecorder and AudioRecord. MediaRecorder is high-level, supports audio compression, and automatically saves the audio data to files. AudioRecord is more low-level because it captures raw audio data into a memory buffer and allows processing that data on-the-fly (akin to the camera’s frame buffer described previously). Different apps use different methods for audio manipulation: if the app just needs to record the audio, using MediaRecorder is easier, but for apps that require high quality audio information and runtime processing, e.g., for audio recognition, AudioRecord is a better option. VALERA intercepts all necessary methods in the MediaRecorder and AudioRecord API.

75 Network non-determinism. Previous systems recorded and replayed network activity at the system call level, e.g., send() and recv(). However, in our experience, recording at the system call level is not appropriate on Android because the OS uses a socket pool: which socket is assigned to connect() is not deterministic. Our insight is that, instead, we need to record the network connections and the data transferred. Our implementation intercepts the

HTTP/HTTPS protocol APIs2 as follows. For each HTTP/HTTPS connection, VALERA

records the data sent and received, the timing of the network API calls, as well as any

exception encountered. During replay, VALERA feeds the app data (as well as error values or exceptions) from the log instead of sending/receiving real network data. Note that reading from a log file is faster than reading from the network. Thus VALERA needs to sleep an appropriate amount of time, according to the network connection log, to preserve precise replay timing.

Another advantage of eliminating network nondeterminism is enabling replay for apps that use dynamic layout (i.e., the GUI layout is determined by the server) which has posed problems in RERAN since RERAN assumes the same GUI layout between record and replay [56].

Random number nondeterminism. The Random number API is another possible source of non-determinism. Android provides two sets of Random API: java . util .Random

and java . security .SecureRandom. The former is pseudo-random: VALERA just intercepts the

seed, hence subsequent random number calls are deterministic. The latter is a stronger

random number API, hence VALERA intercepts all the generated random numbers to ensure

2HTTP/HTTPS is the most widely used protocol for Android apps; VALERA can be easily extended to intercept other protocols.

76 accurate replay. If an app implements its own random number library, the corresponding

API has to be marked so VALERA replays it; however, we did not find any app, among our

50 examined apps, that defines a custom random number library.

4.3.5 Intercepting Intents

In Android, sensitive resources and devices are protected by permissions. Apps that want to use a resource directly must obtain an install-time permission, e.g., an app needs the ACCESS FINE LOCATION permission to access the GPS or the CAMERA permission

to access the camera directly.

Android also allows apps to use devices via a proxy app, in which case no per-

mission is required, as we described in the Camera app example. This is realized by

Android’s Intent mechanism, as follows. App A constructs an intent object with the

ACTION IMAGE CAPTURE action. Then A invokes the Camera app by calling startActivityForResult () with the intent object as parameter. After the picture is taken, the result will come back through the onActivityResult () method. We intercept such methods to log the Intent object data (in this case, the picture) and use this data in the replay phase. Thus, to replay intent-based sensor input carried through proxy apps, we must intercept the Intent API, even though intents are not sensors per se.

4.3.6 Recording and Replaying

We now describe how record-and-replay is effectively achieved via auto-generated stubs, and how we control the timing and delivery of values and exceptions.

77 Stubs. API call interception is realized by redirecting the original API call to go to a stub. By default, VALERA auto-generates stubs that implement the record-and-replay functionality, as follows: during record, a stub saves parameters and return data in a log

file and then passes them through to the original callee; during replay, the stub code feeds the app recorded data from the log file instead of the “fresh” data coming from the sensors.

More concretely, for upcalls, the dynamic intercepting module will add pre-function stub code that executes before the intercepted method, and post-function stub code that executes after the intercepted method completes. For downcalls, the invocation instruction will be replaced to redirect to the stub code.

Timing. In addition to logging values associated with API calls, VALERA records the timestamp of each intercepted method, so the method is replayed at the appropriate time.

We realized that precise timing control is crucial, because feeding the recorded values too early or too late will cause the replay execution to diverge from the record execution; this is especially true for apps using audio and image streams. During replay, it takes much less time to just read (and feed to the app) the recorded sensor data from the log file, hence during replay VALERA sleeps for a calculated amount of time in the stub function to replicate the precise timing from the record mode.

Exceptions. VALERA has to record and replay any runtime exceptions. If an exception is logged during record, we have to re-throw it during replay.

Limitations. VALERA cannot handle nondeterminism in apps that perform customized rendering, e.g., games that do not use the Android UI toolkit; while VALERA will record and

78 Record Semantic Replay Android app alteration

Log Fuzzed APK (sensor Fuzzedlogs stream) Fuzzedlogs Fuzzedlogs Fuzzedlogs logs

Figure 4.8: Overview of SSDA.

replay the app, it does not guarantee that the nondeterminism in the customized rendering part will be eliminated.

4.4 Fuzzy Replay

With the ability to record and replay sensor input in Android apps described previously in this chapter, we present a new approach that helps automate app testing by capturing the input log (sensor stream) and using this log in two ways: first, we fuzz

(alter) the log in a semantically-meaningful way: by applying principled transformations

(e.g., changing GPS coordinates or navigation speed), a new input log is constructed, which represents a new test case. Second, we use the log captured in app A to test an app B which offers similar functionality, e.g., GPS navigation or image recognition.

4.4.1 Semantic Sensor Data Alteration

An overview of Semantic Sensor Data Alteration(SSDA) is provided in Figure 4.8.

We first record the execution of the app; the app is exercised in this phase via manual input, e.g., navigating to record a GPS trace, using the camera to record a set of camera inputs, or running an app such as Shazam to record microphone input. The result of the record phase

79 is a log, i.e., a stream of sensor inputs. Given the recorded log, we use semantic alteration to modify the recorded sensor inputs; this phase is performed off-line, on a desktop/laptop computer. SSDA provides a set of systematic altering methods for each sensor, described later. The result of SSDA is a set of fuzzed logs – streams of sensor inputs. We emphasize that the log alteration process is completely automated, avoiding the user the burden of manually creating new logs. Finally, we replay the altered logs back into the app, observe the execution, and note differences from the original recorded execution. We now proceed to defining the semantic transformations specific to each sensor.

Location. As mentioned previously, the location (GPS) sensor is used to provide location- sensitive services, and detect changes in the phone’s geographical location. We have iden- tified three semantic alterations for the location:

1. Null location: we inject a null location reference into the logs to simulate scenarios

when the underlying GPS module has trouble acquiring a location. Note that null is

a valid location reply in the GPS API, and it is good programming practice to check

the location parameter before using it.

2. Map shift: this involves taking the series of coordinates h(x1, y1),..., (xn, yn)i from the

recorded execution and changing them by “shifting the map” by a (∆x, ∆y) factor so

that during replay the set of coordinates fed to the app is h(x1+∆x, y1+∆y),..., (xn+

∆x, yn + ∆y)i. The parameters (∆x, ∆y) can be inputed manually by the developer

or generated randomly by SSDA.

3. Speed change: we replay a route at different speeds, e.g., to simulate driving vs.

80 driving faster (where speed limit alarms can go off). This is achieved by altering

the GPS coordinates according to the interval between two consecutive location API

upcalls. Note that during replay we cannot alter timing — the app expects the sensor

input at the same time as during record. Hence we alter the coordinates to simulate

a change in speed: we compute the first derivative of position and then during replay

alter the coordinates so that the coordinates change faster/slower.

To create valid routes for our SSDA experiments we have downloaded predefined routes from Google Earth [57], used the route’s locations as waypoints, then modified the route according to a given speed to simulate the required activity, e.g., driving.

Camera. In Android, the camera can take pictures one-by-one or continuously, using the frame buffer. We implement several image transformations.

1. Exposure: we alter the exposure (i.e., darken or lighten the image) to simulate a

day/night swap. Similarly, we alter the color balance to simulate different lighting

conditions, e.g., moving from sun to shade.

2. Size/blur: we alter image size and blur the image to simulate a poor-quality picture

or holding the phone incorrectly w.r.t. the object of the picture.

3. Rotation: we rotate the image at various angles; this transformation tests the ability of

image recognition algorithms to handle different orientations of objects in the picture.

In all cases, we have performed the semantic picture alteration by using the Im- ageMagick [3] image manipulation toolkit to alter the recorded picture, e.g., blur, rotate, darken.

81 Audio. Android apps use two API classes, MediaRecorder and AudioRecord, to process audio.

For SSDA we mainly focus on AudioRecord as it receives the raw audio stream from the microphone and is the preferred way of performing on-the-fly audio processing. We have applied several audio altering techniques:

1. Sample rate: we increase and decrease the sample rate (the sample rate determines

the quality of the audio stream as it defines the number of sound samples per unit of

time).

2. Noise: we add background noise, to simulate the situation that the audio is recorded

in a noisy environment.

We first used the PCM Recorder to record and save a song in raw audio format.

Next, we performed audio transformations by altering the recorded sound data via the SoX open source package [4] and finally sent the altered audio to the app via replay.

4.4.2 Cross-app Testing

Cross-app testing involves collecting a trace in one app (source) and replaying snippets of it in another app (destination). For example, one trace snippet can be a route recorded with the navigation app Navfree USA and then replayed on other navigation apps such as GPS Navigation & Maps and Waze Social GPS Maps; the latter apps will behave as if the phone follows the recorded route.

This is somewhat akin to test amplification, though it involves taking a test con- structed for one app and using it construct a test for another app. Figure 4.9 provides an

82 App1 App2 (source) (destination)

APK APK

Record Replay

Log Sub-log

Figure 4.9: Overview of cross-app testing.

overview of this process: given a log (collected with VALERA) from a “source” app App1, we extract a sub-log (e.g., only the GPS waypoints) and then replay that log into a “desti- nation” app App2. The success of cross-app testing relies on two main prerequisites. First,

App1 and App2 must share the same functionality (e.g., navigation or barcode scanning).

Second, the state where record starts for App1 should be equivalent, at high level, with the replay start state for App2 so that App2 can accept the input events in the sub-log.

We believe that cross-app replay has the potential to significantly improve the

state-of-the-practice in Android app testing: by collecting a library of sub-logs (trace snip-

pets) specific to each sensor, mobile app researchers and developers can replay the sub-logs

from the library on their own apps to augment their testing suites. We now proceed to

evaluate the applicability of cross-app testing.

83 4.5 Event Schedule Replay

In Section 4.3.4 we have shown how VALERA eliminates network and sensor non- determinism. This, however, is not enough, as in Android, another important source of non-determinism is the event schedule. We now describe our approach for eliminating event schedule nondeterminism.

4.5.1 Recording the Event Schedule

VALERA records the event schedule by logging each message send and message processing operation into a trace file. Every time a thread sends a message, we record this operation as a tuple. Here the etype indicates whether this is an internal or external event, eid denotes the unique event identifier, pid is

the process id, tid is the thread id, type shows whether this event is a handler or a runnable

action, looper is the target Looper object, and caller records the caller method that has created

this event message. When the Looper begins to execute an event or finishes an event, VALERA

also saves this information into the event schedule.

We found that certain types of events, e.g., Android’s FrameHandler background

events, do not affect the correctness of our event replay hence they are not included in

the schedule. However, most events are relevant and are included, e.g., Activity’s lifecycle

events (Activity launch/stop/resume/etc.), user interaction events (e.g., touch or click a

button), UI update events (e.g., resize/hide a view object) and the app’s own messages.

84 Algorithm 1 Deterministic Event Order Replay Input: Total Order of ScheduleList ScheduleList

1: procedure Looper.loop

2: while Looper not exit do

3: executable ← true

4: msg ← checkPendingEvent()

5: if msg is null then

6: msg ← MessageQueue.next()

7: executable ← checkExecutable(msg)

8: end if

9: execute msg if executable is true

10: end while

11: end procedure

12: procedure checkPendingEvent

13: msg ← PendingQueue.peek()

14: if No msg from PendingQueue then

15: return null

16: end if

17: if msg times out then . msg considered as missing

18: Scheduler.turn++

19: PendingQueue.remove()

20: return null

21: end if

85 22: if Scheduler.turn == event.order then

23: Scheduler.turn++

24: PendingQueue.remove()

25: return event.msg

26: else . must wait its turn

27: return null

28: end if

29: end procedure

30: procedure checkExecutable(msg)

31: for all Si in ScheduleList do

32: if Si match msg then

33: if Scheduler.turn == Si.order then

34: Scheduler.turn++

35: return true

36: else

37: Add Si to pending queue

38: return false

39: end if

40: end if

41: end for

42: return true . let unrecorded event execute

43: end procedure

86 4.5.2 Replaying the Event Schedule

We now present our algorithm for deterministically replaying events; in Figure 4.3 the algorithm is implemented in the ScheduleReplayer. We illustrate event replay on the

Looper, though VALERA tracks and delivers other events as well.

Each event, either internal or external, is assigned a Lamport timestamp (logic order number [93]) in the schedule. At app startup time, we load the schedule into Sched- uleReplayer’s linked list. Loopers run an infinite loop waiting for events and dispatching them to the target handler. In the infinite loop, the looper checks with the ScheduleReplayer to see if there is any recorded event whose Lamport timestamp indicates it is next; if there is such an event, it is replayed, and the current (replay) logic order number is increased; otherwise it is saved in a pending queue, waiting for its turn to be executed. If the current event has not been recorded before, the ScheduleReplayer simply dispatches it.

The pseudocode is shown in Algorithm 1. The input is a schedule file indicating the execution order of the recorded events on this looper. Each event is assigned a logic order number. Every time the looper tries to fetch a new event, it first checks whether there is any event in the pending queue (line 4). An event is added to the pending queue if it matches the event in the schedule, but its turn to execute has not come yet. If there is no event in the pending queue, the looper fetches the event from its message queue as usual

(line 6), then checks whether this event is executable or not.

In the CheckPendingEvent procedure, ScheduleReplayer first checks whether there is any event in the pending queue; if there is no event in this queue, the Looper will check its message queue. Otherwise, if an event exists, ScheduleReplayer checks the

87 event’s logic order number with the scheduler’s current turn number. If they match (i.e., it is the pending event’s turn to execute), the event is popped from the pending queue and returned (line 25). The scheduler’s global turn number is increased to indicate next available executable event.

In the checkExecutable procedure, the input parameter is the message event from the looper. ScheduleReplayer iterates through the schedule list and matches the given event. An event is matched with the recorded schedule if the tuple described in Section 4.5.1 matches. If the event matches a schedule and the global turn matches its logic order, then the procedure returns true indicating that this event can execute. Otherwise, the event is added to the pending queue (line 37). Note that if the event does not match any recorded schedule, ScheduleReplayer returns true to allow this event to run (line 42).

Handling external event replay. During replay, external events are delivered from the recorded log, instead of the underlying hardware; VALERA implements a controller for this purpose. The controller is a background thread which continuously sends internal events to the looper. The internal event wraps the logged external event data with a what field indicating its type (e.g., touchscreen, sensor, GPS or camera event). The ScheduleReplayer knows the logic order number of every external event and executes it in its turn. This way, event non-determinism such as the example showed in Figure 4.2 is eliminated. After the current event is consumed, the controller will fire next.

Handling missing and unrecorded events. While VALERA records and replays the events coming into or going out of the subject app, it cannot control the behavior of other

88 apps (for that, a system-wide approach would be needed). That might pose a problem if an external app sends an event during record but not during replay (or vice versa). For example, the system’s Activity Manager Service (AMS) can send a TRIM MEMORY event and invoke the app’s onTrimMemory() callback if the AMS detects that the system is low on memory. Since VALERA does not control the AMS, the AMS might send a

TRIM MEMORY event during record but not during replay. To handle this situation,

VALERA assigns a timeout value for each event in the schedule list. If the waiting time exceed the timeout limit (line 17), VALERA regards the event as missing and removes it from the schedule list. Conversely, the TRIM MEMORY event could come during replay without appearing in the record log. VALERA handles this case by allowing execution of any unrecorded event (line 42). In both of these cases, VALERA logs the missing or unrecorded events, and the user can decide how they should be handled.

4.6 Evaluation

4.6.1 Evaluation of VALERA

We now describe our experimental setup, then evaluate the effectiveness and effi- ciency of VALERA.

Environment. The smartphone we used for experiments was a Samsung Galaxy Nexus running Android version 4.3.0, Linux kernel version 3.0.31, on a dual core ARM Cortex-A9

[email protected] GHz.

89 Setup. The experimental setup involved three scenarios, i.e., three different executions for each app: baseline, record, and replay. To establish a baseline, we first ran each app without involving VALERA at all, that is a human user ran the original, uninstrumented app; while the app was running, we were recording touchscreen events using RERAN (as RERAN has already been shown to have a low overhead, at most 1.1%). We call this scenario the baseline execution. Next, we ran the VALERA-instrumented versions of apps, with RERAN replaying, to inject the same touchscreen inputs as in the baseline run, while VALERA was

recording—we call this the record execution. Finally, we ran a replay execution in which

VALERA was set to replay mode (of course, no user interaction was necessary).

The user interacted with each app for about 70 seconds, since prior research has

shown that the average app usage session lasts 71.56 seconds [30]. The user was exercising

the relevant sensors for each app, e.g., scanning a barcode for the Barcode Scanner, Ama- zon Mobile and Walmart apps; playing a song externally so apps Shazam, Tune Wiki, or

SoundCloud would attempt to recognize it; driving a car to record a navigation route for

Waze, GPSNavig.&Maps, NavFreeUSA. To record intents, in apps Twitter, , PicsArt,

Craigslist we took pictures by invoking Android’s default Camera app (invoking the Camera

app and returning the resulting picture is achieved via intents); for Google Translate, eBay,

and Dictionary we used speech recognition which is also achieved via intents.

Effectiveness

We evaluate the effectiveness of our approach on two dimensions: (1) Is VALERA

capable of recording and replaying highly popular real-world apps?, (2) Is VALERA capable

90 Table 4.3: VALERA evaluation results: apps, popularity, streams and overhead. App # Down- Streams Time Space loads Baseline Record Replay Log size Log rate Time Time Overhead Time Overhead GPS Intent Network

(millions) Microphone (sec.) (sec.) (%) (sec.) (%) (KB) (KB/s) Camera(pic.) Booking.com 10–50 • Camera(fbuf.) • 74.38 75.26 1.18 75.86 1.99 1,246 16.56 GasBuddy∗ 10–50 • • 83.47 84.02 0.65 84.77 1.56 206 2.45 Sygic: GPS N.&M. 10–50 • • 94.11 97.46 3.56 98.03 4.17 6,538 68.08 TripAdvisor 10–50 • • 69.39 71.34 2.81 71.56 3.13 1,328 18.62 Waze Social GPS 10–50 • • 86.30 87.91 1.87 88.12 2.11 4,719 53.68 Yelp∗ 10–50 • • 75.40 76.13 0.97 76.24 1.11 867 11.50 Flixster∗ 10–50 • • 78.31 79.45 1.46 80.01 2.17 1,147 14.65 Hotels.com 5–10 • • 84.50 85.17 0.79 85.66 1.37 1,563 18.35 Priceline 1–5 • • 82.18 83.45 1.55 83.12 1.14 2,313 27.72 Scout GPS Navig. 1–5 • • 66.39 68.11 2.59 68.47 3.13 5,312 77.99 Route 66 Maps 1–5 • • 88.79 89.23 0.5 89.89 1.24 4,108 46.04 Restaurant Finder 1–5 • • 71.46 72.18 1.01 73.45 2.78 918 12.72 GPSNavig.&Maps 0.5–1 • • 72.19 73.58 1.93 73.45 1.75 5,177 71.71 Weather Whiskers 0.5–1 • • 65.43 65.67 0.37 66.01 0.89 31 0.47 NavFreeUSA 0.1–0.5 • • 63.81 64.37 0.88 65.11 2.03 75 1.17 Barcode Scanner 50–100 • • 69.29 71.43 3.01 71.37 3.00 145,271 2,033.75 Google Goggles 10–50 • • 73.10 74.12 1.40 74.87 2.42 106,121 1,451.72 Pudding Camera 10–50 • 61.26 61.38 0.20 61.91 1.06 7,488 121.99 Evernote∗ 10–50 • • 74.12 75.00 1.19 75.19 1.44 2,317 30.89 Amazon Mobile∗ 10–50 • • 85.31 86.77 1.71 87.31 2.34 41,071 473.33 QR Droid 10–50 • • 79.46 81.55 2.63 82.39 3.69 114,812 1,407.87 CamScanner 10–50 • 62.01 62.76 1.21 62.87 1.39 2,612 41.62 CamCard Free 1–5 • 61.49 62.38 1.45 62.82 2.16 4,501 72.15 RedLaser Barcode 1–5 • • 72.47 74.05 2.18 74.87 3.31 91,191 1,231.48 Walmart 1–5 • • 85.65 86.78 1.32 86.86 1.41 157,129 1,810.66 Camera Zoom Fx 1–5 • • 56.37 57.11 1.31 57.32 1.69 6,328 110.80 Horizon 1–5 • • 64.39 65.71 2.05 66.10 2.66 5,413 82.38 Shazam 50–100 • • 91.28 92.73 1.59 92.41 1.24 6,186 66.71 GO SMS Pro 50–100 • • 58.12 59.33 2.08 59.87 3.01 101 1.70 Tune Wiki∗ 10–50 • • 84.10 85.27 1.40 86.31 2.63 7,192 84.34 SoundCloud 10–50 • • 64.38 65.87 2.31 66.12 2.70 1,206 18.31 Ringtone Maker 10–50 • 67.30 68.11 1.20 68.73 2.12 2,490 36.56 musiXmatch 5–10 • • 73.28 74.01 0.99 74.35 1.46 651 8.80 Best Voice Changer 5–10 • 58.45 59.17 1.23 59.83 2.36 108 1.85 Smart Voice Rec. 5–10 • 51.39 53.12 3.37 53.81 4.71 97 1.89 PCM Recorder 1–5 • 46.28 48.12 3.98 48.73 5.23 2,418 52.25 RoboVox Lite 0.05–0.1 • • 68.10 68.95 1.25 69.27 1.72 2,617 37.96 Diktofon 0.01–0.05 • • 62.47 63.71 1.98 64.05 2.53 2,102 32.99 Twitter∗ 100–500 • • 81.19 83.45 2.78 84.57 4.16 835 10.01 Google Translate∗ 100–500 • • 69.36 70.48 1.61 71.02 2.39 49 0.70 Instagram∗ 100–500 • • 55.47 55.98 0.92 56.13 1.19 872 15.58 PicsArt 100–500 • 64.21 64.32 0.17 64.55 0.53 12 0.19 eBay∗ 50–100 • • 96.37 97.24 0.90 97.98 1.67 1,354 14.05 Bible∗ 10–50 • • 73.91 74.63 0.97 75.38 1.99 871 11.67 Craigslist∗ 10–50 • • 65.28 66.33 1.61 66.91 2.50 1,672 25.21 Dictionary∗ 10–50 • • 58.31 59.23 1.58 59.88 2.69 164 2.77 GO SMS Pro Emoji 10–50 • • 54.17 55.67 2.77 55.90 3.19 76 1.37 Weibo 5–10 • • 90.46 91.87 1.56 92.44 2.19 3,182 34.64 1Weather 5–10 • • 45.61 46.00 0.86 46.02 0.90 318 6.91 Weather 5–10 • • 87.31 88.45 1.31 88.19 1.01 673 7.61 Mean 71.36 72.49 1.01 72.92 1.02 15,101 208.32 ∗=VALERA can replay network, camera, GPS, microphone, intents, and schedule, while RERAN cannot

91 Table 4.4: VALERA event streams: number of events and burst event rate (events/second); a ‘-’ indicates that the app did not use that stream during our recording.

App Touchscreen GPS Camera (fbuf.)Camera (pic.) Audio Network Intent Scheduler events (types 1–6) #Evs. Rate #Evs. Rate #Evs. Rate #Evs. Rate #Evs. Rate #Evs. Rate #Evs. Rate 1 2 3 4 5 6 Rate Booking.com 356 62 20 62 ------162 500 - - 1,224 8,474 147 973 0 427 >1,000 Gasbuddy 606 58 5 58 ------256 500 - - 1,943 713 82 17 0 369 >1,000 Sygic: GPS N.&M. 307 47 81 100 ------35 333 - - 1,628 523 57 11 0 416 500 TripAdvisor 517 55 13 71 ------137 333 - - 982 1,268 168 249 0 344 >1,000 Waze Social GPS 253 41 73 200 ------51 200 - - 1,847 202 471 82 0 316 >1,000 Yelp∗ 493 62 11 166 ------93 333 - - 1,764 897 763 65 0 311 >1,000 Flixster∗ 390 55 7 166 ------123 333 - - 1,571 1,056 354 113 0 276 >1,000 Hotels.com 503 66 15 125 ------188 250 - - 1,603 841 504 59 0 384 >1,000 Priceline 652 62 9 142 ------138 90 - - 1,357 2,514 93 37 0 405 500 Scout GPS Navig. 207 62 78 90 ------36 71 - - 1,438 298 539 0 0 361 >1,000 Route 66 Maps 197 52 94 111 ------46 125 - - 1,883 3,617 836 18 0 314 >1,000 Restaurant Finder 468 47 12 100 ------127 333 - - 1,695 817 596 47 0 325 >1,000 GPSNavig.&Maps 296 52 42 166 ------12 47 - - 1,605 192 758 0 0 413 >1,000 Weather Whiskers 541 58 19 100 ------27 166 - - 894 1,056 74 81 0 366 >1,000 NavFreeUSA 303 43 84 58 ------8 200 - - 726 207 160 0 0 401 500 Barcode Scanner 64 41 - - 57 5.95 - - - - 5 333 - - 364 189 31 107 19 144 >1,000 Google Goggles 51 166 - - 52 5.95 - - - - 16 166 - - 307 216 15 52 0 158 >1,000 Pudding Camera 103 35 - - - - 5 0.46 ------798 341 45 96 0 230 500 Evernote∗ 315 66 - - - - 3 0.23 - - 23 250 - - 1,158 589 130 244 0 363 >1,000 Amazon Mobile∗ 590 58 - - 32 12.50 - - - - 64 500 - - 2,005 775 194 9 0 181 >1,000 QR Droid 83 43 - - 55 6.06 - - - - 6 66 - - 513 115 20 73 0 160 >1,000 CamScanner 119 45 - - - - 2 0.01 ------439 312 37 52 0 118 500 CamCard Free 76 55 - - - - 5 0.01 ------882 436 50 31 0 126 500 RedLaser Barcode 93 62 - - 41 5.95 - - - - 8 83 - - 375 231 25 66 0 132 >1,000 Walmart 139 62 - - 86 3.68 - - - - 35 200 - - 611 152 55 149 0 155 >1,000 Camera Zoom Fx 86 38 - - - - 3 - - - 5 62 - - 460 287 69 41 0 113 500 Horizon 73 55 - - - - 2 - - - 13 83 - - 512 319 79 53 0 146 >1,000 Shazam 27 71 ------560 71 33 333 - - 224 6,617 125 272 0 255 >1,000 GO SMS Pro 18 71 ------68 52 14 333 - - 128 117 25 12 0 212 >1,000 Tune Wiki∗ 86 83 ------386 66 36 200 - - 386 1,253 267 88 0 172 >1,000 SoundCloud 93 66 ------419 90 41 142 - - 513 420 86 77 0 269 500 Ringtone Maker 125 71 ------897 83 - - - - 756 138 217 93 0 315 >1,000 musiXmatch 119 62 ------288 71 39 250 - - 1,124 683 153 113 0 367 >1,000 Best Voice Changer 65 45 ------167 62 - - - - 335 517 80 155 0 381 >1,000 Smart Voice Rec. 35 55 ------260 62 - - - - 297 513 85 98 0 285 500 PCM Recorder 26 50 ------613 66 - - - - 414 397 52 18 0 415 >1,000 RoboVox Lite 52 55 ------302 62 15 166 - - 326 238 47 56 0 248 500 Diktofon 69 62 ------286 41 13 90 - - 257 366 38 89 0 325 500 Twitter∗ 417 62 ------64 250 7 0.20 973 652 318 49 0 405 >1,000 Google Translate∗ 217 66 ------36 83 8 0.39 549 572 28 110 0 139 >1,000 Instagram∗ 536 71 ------12 166 12 0.17 1,839 416 150 217 0 315 >1,000 PicsArt 303 45 ------6 0.15 905 531 234 68 0 357 >1,000 eBay∗ 200 58 ------64 250 11 0.22 1,545 377 59 158 0 306 >1,000 Bible∗ 471 58 ------15 142 5 0.15 1,560 603 76 333 138 143 >1,000 Craigslist∗ 271 55 ------48 71 7 0.12 1,147 521 83 267 0 268 >1,000 Dictionary∗ 318 62 ------41 125 9 0.17 1,468 699 103 251 0 375 >1,000 GO SMS Pro Emoji 102 62 ------12 66 6 0.25 314 215 34 62 0 236 >1,000 Weibo 486 71 ------115 200 5 0.15 1,532 748 428 93 0 386 >1,000 1Weather 275 47 ------18 142 8 0.22 948 817 42 88 0 306 500 Weather 183 45 ------9 142 7 0.17 829 543 25 46 0 268 500

92 of recording and replaying high-throughput stream-oriented apps?, and (3) Is VALERA useful for reproducing event-driven races?

Recording and replaying popular apps. To demonstrate the importance of recording and replaying sensor data, we ran VALERA on a wide range of popular apps. The apps were chosen based on the following criteria: (a) the app must use at least one of the sensor

APIs described in Section 4.3.4, and (b) apps must come from a variety of categories, such as business, productivity and tools. Note that these apps were downloaded directly from

Google Play, the main Android app marketplace, which does not provide the app’s source code.

Table 4.3 lists the 50 apps that we chose for our evaluation. To answer questions

(1) and (2) above, we chose a mix of highly-popular apps and high-throughput stream apps.

The first column contains the app name, the second column indicates the popular- ity of the app, i.e., number of downloads (installs), while the “Streams” grouped columns show the streams used in that app.

For example, Sygic GPS, a popular navigation app with more than 10 million downloads, could be replayed with VALERA because VALERA replays GPS and network inputs. Similarly, popular apps with more than 50 million downloads, such as Barcode

Scanner, Shazam, Google Translate, and Twitter could be replayed thanks to VALERA’s support for replaying camera, microphone, and intent inputs.

Several apps, e.g., Amazon Mobile∗, are marked with an asterisk. For those apps, the most powerful Android record-and-replay system to date, RERAN, could only replay the

GUI interaction, but not the high-level sensors, network, or events. For example, Amazon

93 Mobile allows users to search by scanning a barcode or taking a picture of the item; RERAN cannot replay either of these actions. We discuss the rest of the columns in Section 4.6.1 where we examine VALERA’s efficiency.

Thus we can conclude that VALERA is effective at recording-and-replaying widely popular Android apps, which are drawn from a variety of app categories and use a variety of sensors.

Reproducing event-driven race bugs. We used VALERA to reproduce event-driven races in several open source apps. The races, due to cross-posting of events and co-enabled events, were discovered by Maiya et al. [99]. Note that current Android record-and-replay tools cannot reproduce these races as they cannot preserve event ordering due to non- deterministic thread scheduling.

NPR News. While loading, this app checks the time of the last news list up- date. Concurrently, when new stories are added, a Runnable thread from NewsListAdapter. addMoreStories makes a post update call to the main thread which updates the lastUpdate vari- able in an asynchronous manner. These two events are non-deterministic and not ordered by a happens-before relationship. We reproduced the race by alternating the order and replaying the app in that specific order.

Anymemo. This app helps users learn new words in different languages using flash cards. After the user finishes one set of cards, the app creates a background thread which calculates the score and updates list of items to be shown on the UI. Usually the calculation is fast for a small working set and the updated result will show before the user switches back to the list view. However, the calculation and update operation are not ordered by

94 happens-before. Hence if the user switches back before the calculation is done, the update operation cannot get the result and will throw a null pointer exception. In the latest version of this app, the authors have fixed the race bug by simply ignoring the race with a try-catch block. Although this fixes the crash, the UI view show the incorrect result.

My Tracks. This app exhibited a different type of race. Whenever users try to record their location, the app sends a bind request to the Binder thread, and the thread

eventually binds it to a recording service. The problem is that when the service is registered

with the request, the field providerUtils is updated. When the service thread gets destroyed

this field is set to null. But there is no happens-before order between the field update and

the service destruction. If the service is destroyed before the registration is executed, the

recording process will attempt to dereference a null pointer. This is a harmful race which

can be reproduced and replayed using VALERA.

Tomdroid. This race, which leads to a NullPointerException (Figure 2.4), was also

successfully reproduced.

Efficiency

To quantify the efficiency of VALERA we measured: (1) the time and space over-

head that VALERA imposes when recording and replaying our test apps; and (2) the stream-

ing requirements, in terms of events and event burst rate.

Time and Space Overhead

The “Time” and “Space” columns in Table 4.3 present the results of the mea-

surements in the “Baseline”, “Record”, and “Replay” scenarios; for the record and replay

95 scenarios, we also show the overhead, in percents, compared to the baseline. The last row shows geometric means computed across all 50 apps.

Based on the results in Table 4.3 we make several observations. First, note that record overhead is typically 1.01%, and replay overhead is typically 1.02% compared to the baseline, uninstrumented app run. This low overhead is critical for ensuring that sen- sor input, especially real-time streams, e.g., video/audio capture or complex touchscreen gestures, is delivered with precise timing so the recorded and replayed executions do not diverge. Second, note that several apps, e.g., Sygic, have overheads of around 4.17%: upon investigation, we found that the cause is record and replay of heavy network traffic. We also performed experiments without network replay (we omit the detailed results for brevity) and found the overhead to be much lower: at most 1.16%, typically 0.5%–1.0%. VALERA allows users to turn off network replay, e.g., if users wish to reduce overhead or let the app interact with “live” severs and services.

The “Space” grouped columns show the space overhead of our approach: the size of the recorded log (VALERA stores the log data on the phone’s SD card), and the required log rate. As the table shows, apps that use the frame buffer have the largest space overhead, e.g.,

Barcode Scanner’s log size is 145 MB, collected during an execution lasting 71.43 seconds.

The large log size is due to the frame buffer from the camera sensor continuously sending back image data (in this case, it fills the 1.3 MB buffer every 250 milliseconds). Walmart,

RedLaser Barcode, Google Goggles and QR Droid have large logs for the same reason. For the audio sensor experiments (e.g., PCM Recorder, Shazam), the log size is determined by user actions and the duration of the execution. Similarly, GPS-based apps (e.g., Navfree USA,

96 GasBuddy, or TripAdvisor) have smaller logs, as saving GPS coordinates and network traffic takes less space than video or audio streams. The smallest-footprint logs are observed for intent replay—unlike GPS and other sensors, intents are small in size and sparse. We do not expect the logging to be an issue in practice as long as the log file fits onto the SD card: the log rate (last column, always less than 1.8 MB/s) is well within modern SD cards’ throughout capabilities.

Streaming Rate

We now present a quantitative characterization of the streams that have to be replayed. In Table 4.4 we show the number of events for each sensor and the scheduler, as well as the burst event rate,3 in events per second. The burst event rate indicates the burden on the record-and-replay system during periods of high load.

The table suggests a natural grouping of sensors by burst rate. First, the camera

(in “taking pictures” mode) has the lowest burst rate, since the user must press the shutter which naturally limits the picture-taking rate. Similarly, the intent event rate is low since intents are usually triggered in response to user input. The camera (in “frame buffer” mode) has a low burst rate as well—at most 12.5 events per second; this is due to the app having to perform frame-buffer processing, namely image recognition, which is computationally intensive.

The touchscreen, GPS and audio have moderate burst rates, 41–200 events/second.

The network’s burst rate is higher, 207 events/second on average. Note, however, that the touchscreen and network are used by most apps, so their burst rates might need to be

3The event rate fluctuates during an execution. For each sensor’s events, we identified the burst periods as the top 25% intervals by event rate, and took the median event rate of those periods.

97 accommodated simultaneously.

Finally, the scheduler has the highest burst rate, typically in excess of 1,000 events/second (our timer had millisecond granularity, so for events separated by less than 1 millisecond, we rounded up to the ceiling value of 1 millisecond). More specifically, Type-1 and Type-2 events were the most prevalent, indicating that most messages are self-postings.

Thus we can conclude that, with respect to our chosen apps, VALERA is efficient at

record-and-replay in terms time overhead, space overhead, and supporting high-rate burst

events.

4.6.2 Evaluation of Fuzzy Replay

We now evaluate another one of VALERA’s applications: SSDA techniques.

Platform. The smartphone we used for experiments was a Samsung Galaxy Nexus with

Android version 4.3.0, Linux kernel version 3.0.31. The phone has a dual core ARM Cortex-

A9 processor running at 1.2GHz. For app rewriting and semantic alteration we used a

MacBook Pro laptop (2.5GHz Intel Core i7 processor with 16 GB memory) running Mac

OS X 10.10.5.

Procedure. We first recorded a normal, non-crashing execution. Then we used SSDA

on the recorded logs, replayed the semantically altered logs via VALERA and noted any

difference in app behavior compared to the behavior during record—either crash or increased

coverage, as explained next.

98 Table 4.5: SSDA Evaluation results.

App Sensor SSDA Outcome Time Log Coverage (sec) Size in- (KB) crease (%) Null loca- CRASH 25.42 1 Yelp GPS tion Map shift Different search result shown 35.67 1 1 Speed Normal execution 36.10 1 1 change GPS Navig.& Null loca- CRASH 26.83 31 Map GPS tion Map shift Different map route shown 73.14 72 1 Speed Different driving speed shown 71.85 70 1 change Null loca- CRASH 21.43 26 Route 66 Map GPS tion Map shift Different map route shown 65.31 64 1 Speed Different driving speed shown 67.74 65 1 change Null loca- CRASH 32.02 32 NavFree USA GPS tion Speed Different driving speed shown 64.13 73 1 change Map shift Altered execution: “unknown coordi- 15.97 71 3.5 nates” error message Null loca- Altered execution: “Oops! device can- 12.73 1 1.2 GasBuddy GPS tion not find your location” error message Map shift Different search result shown 41.71 1 1.3 Speed Normal execution 42.92 1 1 change CamCard Blur Altered execution: fail to recognize 36.72 3,576 2 HD Free picture Camera Darken Altered execution: “Low Light” error 36.14 3,684 2 message Lighten Altered execution: “Blurry Image” er- 36.86 3,630 2 ror message Rotate Altered execution: fail to recognize 36.58 3,593 2 picture Blur Altered execution: fail to recognize 32.76 4,137 1 picture Barcode ScannerCamera Darken Altered execution: fail to recognize 32.58 4,186 1 picture Lighten Altered execution: fail to recognize 32.74 4,238 2 picture Rotate Altered execution: fail to recognize 32.87 4,194 5 picture Blur Altered execution: fail to recognize 36.59 5,186 1 picture Google Goggles Camera Darken Altered execution: fail to recognize 35.78 5,240 1 picture Lighten Altered execution: fail to recognize 36.81 5,159 2 picture Rotate Normal execution 36.04 5,206 1 Decrease Altered execution: fail to recognize 45.37 2,745 3 Shazam Audio sample rate song Adding Altered execution: fail to recognize 45.26 2,833 3 noise song

99 Location. The “GPS sensor” rows of Table 4.5 show the SSDA results when traces are altered in two ways: injecting a null location into the trace and map-shifting. We discovered that four popular apps crash when presented with a null location: GPS Navigation & Map,

Yelp, Route 66 Map, Navfree USA; GasBuddy handles this situation more gracefully, though the behavior is different compared to the original execution. Navfree USA exhibits a different

behavior when presented with a map-shifted list of locations.

Camera. The middle part of Table 4.5 shows the result of applying image SSDA to three

popular apps. CamCard HD Free helps users scan business cards. Barcode Scanner is used

to scan barcodes or QR codes. Google Goggles performs image search, by taking a picture

and searching Google Images for similar pictures. In all cases we apply the image SSDA

techniques (e.g., blur, darken, lighten, rotate). The “Outcome” column of Table 4.5 shows

the results: the app’s behavior is different in all cases, compared to the original execution—

displaying an error message or failing to recognize the image thus ending up in a different

state compared to the original app.

Audio. SSDA was effective at altering executions for Shazam as shown in the last row

of Table 4.5. The default sample rate used by Shazam is 44.1KHz, and its buffer size is

4,410 bytes. We changed a song that was successfully recognized during the record phase

by altering the song’s sample rate to 16KHz, 44.1KHz and 48KHz. The result indicate that

Shazam can recognize the song if the recorded audio stream’s sample rate is ≥ 44.1KHz,

but fails when the sample rate is below this value. This behavior is expected, because a low

sample rate means low audio quality which hinders Shazam’s song recognition efforts.

100 Surprisingly, we found that, when playing two songs simultaneously, Shazam can recognize one of the songs.

Performance. The “Time” and “Log Size” columns of Table 4.5 show the running time and log size. Thanks to VALERA’s low runtime overhead (about 1%) we did not experience any visible slowdown during execution. In the null location case, we inject the null object randomly into the middle of the original trace, and remove the trailing data. The four crashing apps exit early, thus their running time is short. The log size is small for GPS: the three navigation apps (GPS Navigation & Map, Route 66 Map, NavFree USA) log around 70

KB data as they continuously use the GPS sensor. Yelp and GasBuddy, on the other hand, log very little data because they use the GPS only when users search by location.

The log sizes of camera apps and audio apps are comparatively much larger: on average, the camera apps log 4 MB of data in 35 seconds, while the audio apps log 2.7 MB in 45 seconds. Though these log sizes are larger compared to GPS, they are very much manageable: modern smartphones’ storage capacity (typically 32GB) is large enough to hold the log data.

Coverage increase. We measured the increase in coverage as follows. We leverage An- droid’s default profiler to collect the executed method trace. In each entry and exit of the instrumented sensor API function, the profiler is opened and closed. The trace data is saved into SD card. Then we use Android’s dmtracedump to dump the executed method trace.

Finally, we compare the normal trace and SSDA trace. The coverage increase shows the newly executed functions that do not exist in normal trace. The results (method coverage

101 Table 4.6: Source and destination apps for cross-app replay.

Sensor Recording Recording Log Replaying app Replaying app Time (sec- Size Time (sec- onds) (KB) onds) GPS NavFreeUSA 72.13 68 GPS Navigation & 68.72 Maps Sygic GPS 75.48 Waze Social GPS 83.17 Maps Scout GPS Naviga- 69.44 tion Route 66 Maps 78.69 Camera Barcode Scan- 32.55 4,168 QR Droid 38.50 ner (Frame RedLaser Barcode 41.72 buffer) Goggle Goggles 37.25 Amazon Mobile 43.52 Walmart 39.69 Camera CamCard HD 45.81 3,610 CardToContact 41.02 Free (Take pic- Business Card Reader 48.60 ture) Lite ScanBizCards Lite 41.77 Google Goggles 38.96 Audio PCM Recorder 52.65 2,395 Shazam 34.28

increase due to SSDA) are in the last column of Table 4.5. Notice that coverage increase depends on the app. For example, while a Null location is effective at crashing Yelp, GPS

Navig.& Map, and Route 66 Map the other GPS alterations do not increase coverage sub- stantially in these apps. For NavFreeUSA and GasBuddy the cumulative increases are 3.5% and 2.5% respectively. Camera and audio SSDA, however, are more effective, yielding be- tween 4% to 9% cumulative coverage increase. We believe that these coverage gain figures are acceptable given that the fuzzed logs are created automatically (no user involvement).

Now we describe the evaluation for cross-app testing. The platform and apps are the same as those discussed in the evaluation of SSDA.

102 Category 1: Navigation Apps. Navigation is a widely-used application of smart- phones, hence app marketplaces contain numerous navigation apps. Although each navi- gation app has its own feature such as voice command, turn-by-turn directions and social connection, their common functionality is to provide routes and guide users. Without our approach it would be tedious to generate location traces for each navigation app. With our API interception support, we can collect a location trace from one app and reuse it for others. Since apps receive location updates via the standard Location API, we could easily port the trace from one app to many others without any modification. As Table 4.6 indicates (the GPS row), in our experiments, we collected a routing trace from Navfree USA and have successfully replayed it on five other popular navigation apps.

Category 2: Barcode Scanning Apps. Barcode and QR code scanner apps are another popular application of smartphones, e.g., when users are shopping, as the app can recognize the item by its barcode and compare its price on different online shops. QR codes are widely used to entice users to scan the code and visit a certain URL. Table 4.6 (row “Camera

(Frame buffer)”) shows the result of cross-app testing for apps in this category. We collect the camera’s frame buffer trace from the Barcode Scanner app, and successfully replay it

back to other five apps that are also able to read barcodes.

However, unlike for the navigation apps above, we had to modify the trace data

before cross-app replay because different apps set different camera parameters. For instance,

Barcode Scanner sets the camera preview size to 1280x720 pixels while QR Droid sets it to

864x480. While we could not reuse the trace “verbatim”, we found out that modifying the

trace data to fit the current app could be automated and performed on-the-fly during replay

103 by simply examining the parameters with which the app has invoked the API, in this case the preview size. By modifying the preview size, we could successfully replay the Barcode

Scanner trace on the other five apps.

Category 3: Business Card Recognition Apps. These apps use pattern recognition to extract contact information after taking a picture of the card. Unlike the barcode scan- ning apps studied above, business card recognition apps mainly use the camera’s picture taking feature instead of frame buffers. We found that pictures are saved in a common format (JPEG) that makes it easy to reuse trace snippets from one app to another. As showed in Table 4.6 (row “Camera (Take picture)”) we take pictures of business cards from

CamCard HD Free and can successfully replay the trace in four other apps.

Category 4: Audio Apps. The cross-app testing for audio-related apps is interesting: we chose two apps, PCM Recorder and Shazam, which on the surface are not in the same category as the former allows users to record an audio stream while the latter recognizes songs. Since both use the AudioRecord API, it is possible to replay the audio data trace from one app to the other. In our experiment, as illustrated in the last row of Table 4.6, we collect a song’s audio stream in PCM Recorder and can successfully play it back on Shazam.

Cross-app testing time. The “Recording Time” and “Replaying Time” columns of

Table 4.6 show the time it took to record the log in App1 and replay the sub-log in App2, respectively; note that these times are low (less than 80 seconds in all cases).

104 4.7 Summary

In this chapter, we have presented VALERA, an approach and tool for versatile,

low-overhead, record-and-replay of Android apps. VALERA is based on the key observation

that sensor inputs, network activity and event schedules play a fundamental role in the

construction and execution of smartphone apps, hence recording and replaying these two

categories is sufficient for achieving high-accuracy replay. Further, the recorded input data

can be fuzzed in a semantically meaningful way to increase the coverage of testing. Exper-

iments with using VALERA on popular apps from Google Play, as well as replaying event

race bugs, show that our approach is effective, efficient, and widely applicable. VALERA’s

accuracy and low runtime overhead make it suitable as a platform for applications such as

profiling, monitoring, debugging, testing, or dynamic analysis. In Chapter 5 we show how

we have extended VALERA to filter out false races, and classify true races into benign and

harmful.

105 Chapter 5

Race Verification and Classification

In the previous chapters we have shown how to statically detect and replay event- driven races. In this chapter, we show how we improve the precision of the reported races via the replay technique described in Chapter 4. So far, the state-of-art event-driven race detector for Android, including static and dynamic tools, have two drawbacks: (1) they are prone to false positives, (2) they cannot verify the effect of the race, e.g., is it a benign or harmful race. We now discuss these drawbacks and how our approach in this section helps address them.

False positives. Most Android apps use ad-hoc synchronization to protect shared variable access across asynchronous events. Therefore, race detectors can improve their precision by identifying a broad range of synchronization operations, to avoid reporting safe/synchronized access as races. In our experience, even the most precise race detector currently available, EventRacer, is still prone to false positives. EventRacer attempts to

filter out false positives by applying a technique called “race coverage” [120] which was

106 previous used for event-driven races in web applications. While race coverage can greatly reduce the false positives rate, it still fails to identify certain categories (types) of false positives. In Section 5.1 we describe these categories.

Harmful vs. benign races. The second problem with current tools is that for true positives – accesses unprotected by synchronization – they fail to distinguish between benign and harmful races. Our study shows that only a very small portion of reported races are harmful. Previous studies have reached similar conclusions for desktop applications [106].

Since analyzing races requires substantial human effort, an approach with a high rate of false positives or benign races is less likely to be adopted by developers, as it is a high- investment/low-return activity. Thus we argue that we need an automatic race verification tool that can distinguish between benign and harmful races. In Section 5.1 we define benign races, harmful races, and false positives.

To address these issues, we introduce ERVA (Event-race Reproducer and Verifier

for Android)1, an automated approach and tool for verifying and reproducing event-based races in Android apps. ERVA, described in detail in Section 5.2, takes as input a report of a potential race, categorizes the race, and uses a suite of techniques to categorize the race into three categories. First, if the race is a false positive, it is reported as such. If the race can be confirmed, it is classified as benign or harmful. To support this classification, we introduce event dependency graphs (EDG) and a novel definition of benign vs. harmful races in Android apps based on state comparison. If the race is harmful, ERVA automatically produces an event schedule that can be used to deterministically reproduce the race, so the developers can study the race, understand its cause, and fix it.

1Available at http://spruce.cs.ucr.edu/valera/erva.html

107 ERVA does not require access to the app source code, but rather relies on dynamic tracking of happens-before (HB) relationships, schedule replay, and “event flipping”. Given an app, ERVA proceeds in two stages. In the first stage, ERVA runs the app in the Even- tRacer [25] race detector to obtain a set of candidate races (pairs of race events). While the app is running in EventRacer, ERVA records replay information (e.g., UI events, input stream, sensor streams) and synchronization information (e.g., begin/end of thread and synchronization actions, event posting, etc); this information is used in later phases. For each candidate race from the report, ERVA’s post-run analysis will confirm whether the candidate is indeed a race, to distinguish between false positives and true positives.

The second stage examines the true positives to further distinguish between benign and harmful races. ERVA replays executions multiple times using the inputs recorded in the

first stage, this time instrumenting the app to record app state. In each of these executions,

ERVA “flips” – alternates the ordering of – the events to check their side effects, i.e., the effect of flipping on app state (app state includes all the UI view states, shared preferences,

file, database, network traffic). If the flipping has no side effect, ERVA categorizes the

race as benign, otherwise it is declared as harmful. Since ERVA employs replay, developers

have the opportunity to replay the app with those inputs and event schedules that lead to

harmful races, to facilitate finding and fixing the cause of the race.

5.1 Motivation Examples

In this section, we demostrate two types of false races and two types of benign

races from real world apps during our experiment.

108 5.1.1 False Positive Type-1: Imprecise Android Component Model

False positives may arise due to imprecise modeling of the Android components and their interaction. Figure 5.1 shows an example: a race reported by EventRacer in

AnyMemo (a flashcard app) that is actually a false positive. The RecentList Fragment is a sub-class of Android’s Fragment component. In the onResume() callback, the app performs a database query and updates the recent list ri to the fragment views. Since the database access is time-consuming, to make the app responsive, the database query is performed by a background task (lines 14–16). When the task is done, a callback will be posted to the main thread, and the main thread updates the UI (lines 18–22).

The race detector reports a race between onCreateView() and onResume() callbacks.

Due to imprecise modeling, the race detector cannot find any HB relation between these callbacks. Hence, since onCreateView() writes the mAdapter variable and onResume reads the same variable, a read-write race is reported.

However, this race is actually a false positive. According to the Android docu- mentation2 a Fragment’s onCreateView() method is always invoked before its onResume() method.

Thus this read-write race can never happen.

5.1.2 False Positive Type-2: Implicit Happens-before Relation

Another category of false positives is due to imprecise modeling of happens-before relationship. Figure 5.2 shows an example of FP caused by implicit HB relation in Cool

Reader, an eBook reader app. EventRacer reports that the callbacks onRecentBooksListLoaded

2http://developer.android.com/guide/components/fragments.html

109 1 public class RecentListFragment extends Fragment {

2 private ArrayAdapter mAdapter = null;

3 private Handler mHandler = null;

4

5 @Override

6 public View onCreateView(...) {

7 ...

8 mHandler = new Handler();

9 mAdapter = new ArrayAdapter(...);

10 }

11

12 @Override

13 public void onResume() {

14 Thread thread = new Thread() {

15 public void run() {

16 // query database operations

17 mHandler.post(new Runnable() {

18 public void run() {

19 mAdapter.clear() ;

20 for (RecentItem ri : database)

21 mAdapter.insert( ri );

22 }

23 });

24 }}

25 thread. start () ;

26 }} Figure 5.1: False positive type-1 in the AnyMemo app.

110 [CoolReaderActivity.java] CoolReader.onStart() { ... waitForCRDService(new Runnable() { public void run() { post(Runnable r) Service.getHistory().loadFromDB(...); ... post(Runnable r) new CRRootView(...); } }); }

[History.java] void onRecentBooksListLoaded(List list) { ... mBooks = list; ... }

[History.java] void getOrLoadRecentBooks(...) { if (mBooks != null && mBooks.size() > 0) { // read and update mBooks. } }

Figure 5.2: False positive type-2 in the Cool Reader app.

111 and getOrLoadRecentBooks have a race condition because they both access the mBooks shared

object, but the tool cannot derive any HB relation between these two callbacks. The

CoolReaderActivity is an instance of an Android Activity subclass, i.e., a separate screen. Its

lifecycle starts with the onStart() callback invoked on the main thread. In onStart, the app first

starts the database service CRDBService. If the service starts successfully, a Runnable callback

will be posted back to the main thread indicating that the database service is ready. The

callback first tries to load some history records by invoking loadFromDB(), then creates a new

CRRootView object. These two methods will both post callbacks (detailed implementation

omitted).

Note that the loadFromDB and CRRootView initialization methods are invoked in

the same action on the Looper thread (i.e., the main thread). According to the looper

atomicity rule, loadFromDB happens before CRRootView; in other words, the calls loadFromDB

and CRRootView are in program order. In the implementation of these two methods, they both

use the Handler.post(Runnable r) to post callbacks. The post method inserts the actions into the

queue in FIFO order. Since loadFromDB posts the callback before CRRootView, the callback

onRecentBooksListLoaded will always happen before getOrLoadRecentBooks. However, EventRacer

misses this implicit HB relation and thus reports a race, which is a false positive in this

case.

5.1.3 Benign Race Type-1: Control Flow Protection

We now discuss a benign race in Volley, a popular HTTP library [60]. Figure 5.3 shows the relevant source code. EventRacer reports a race on the mRunnable object. The

112 1 public class ImageLoader {

2 private Runnable mRunnable;

3

4 private void batchResponse (...) {

5 if (mRunnable == null) {

6 mRunnable = new Runnable() {

7 public void run() {

8 // deliver batched requests

9 mRunnable = null;

10 }

11 }

12 mHandler.post(mRunnable);

13 }}}

Figure 5.3: Benign race type-1 in the Volley library.

113 method batchResponse on line 4 and the creation of the Runnable object on line 6 are two distinct actions executed on the main thread. On line 6, the mRunnable object is updated to point to a new Runnable object while on line 9 it is set to null. Since EventRacer does not capture the HB relation between these two actions, it reports a write-write race, but it is a benign race.

The null test on line 5 can be true or false depending on when the next batchResponse is executed. Usually, the Runnable.run() is executed before the next batchResponse, the mRunnable will be set to null (line 9) hence in the next batchResponse a new Runnable (line 6) is created and posted to the main thread’s looper (line 12). However, in cases when there are multiple batchResponse actions queued and executed before the Runnable.run(), the check on line 5 sees

that mRunnable is already non-null, takes the else branch and does nothing. Thus the order

in which the batchResponse and Runnable are executed does not matter due to the control flow

protection offered by the if on line 5. This race is classified as benign.

5.1.4 Benign Race Type-2: No State Difference

Figure 5.4 shows an example of benign race type-2 in the AnyMemo app. When

QACardActivity is launched, it will create several loaders to load data from the database or configuration files. In the startLoading method, the global variable runningLoaderCount tracks

how many active loaders are currently running. When the loader finishes, it will post an

onLoadFinished callback to the main thread and invoke the checkAllLoaderCompleted method. In

this method, the variable runningLoaderCount is first decreased; if runningLoader Count <= 0, it

will invoke the onAllLoaderComplete callback to inform that all the loaders have finished their

114 QACardActivity Loader 1 Loader 2 (Looper Thread) startLoading() { for (Loader loader : mLoaders) { loaderManager.initLoader(loader) runningLoaderCount++; } }

onLoadFinished checkAllLoaderCompleted() { runningLoaderCount--; if (runningLoaderCount <= 0) { onAllLoaderComplete(); } onLoadFinished }

checkAllLoaderCompleted() { runningLoaderCount--; if (runningLoaderCount <= 0) { onAllLoaderComplete(); } }

Figure 5.4: Benign race type-2 in the AnyMemo app.

115 Race detecon phase Race verificaon phase

Race False Benign Harmful Report posive race race

Input App state log comparison

Event Input Event Input Event Racer capture capture EDG replay flipping Instrumented plaorm Replay plaorm (emulator) (emulator or phone)

single execuon mulple execuons App

Figure 5.5: Overview of ERVA. job.

Since the time spent in the loader is unpredictable, the order of these two onLoadFinished callbacks executed on the main thread is not deterministic. The race detector reports this pair of callbacks as a race because it cannot find any HB relation and these callbacks do write to the same object, runningLoaderCount. Although this reported race is a true positive, it is actually harmless, because app state does not depend on the order in which the callbacks write to runningLoaderCount. ERVA can flip the execution order of the callbacks and does not

find any harmful effect (EVS difference). Thus this race is classified as benign.

5.2 Design Overview

Figure 5.5 shows an overview of ERVA; it consists of two phases, a race detec- tion phase and a race verification phase. Note that ERVA relies on dynamic analysis and

116 instrumentation hence the app source code is not required.

In the race detection phase, we run the app on an instrumented platform; traces described in Chapter 2 are collected at this stage. The platform’s instrumentation consists of three main modules. First, a publicly-available custom Android emulator3 with EventRacer running on top of it. Second, an input capture module provided by the VALERA [74] record- and-replay tool. Third, an event capture part, shown in thicker black line and font as it is a contribution of this work, unlike EventRacer and VALERA which are off-the-shelf tools.

EventRacer runs the app and produces a race report. ERVA saves the instrumentation results in an input log and EDG, respectively. With these logs at hand, we proceed to the race verification phase.

In the verification phase, we replay the execution multiple times, flipping event order. The platform used for this phase can either be the emulator or an actual Android phone. Using the input log collected during detection, we use the input replay support of

VALERA to ensure that the input provided to the app in this phase is the same as during detection (record). Due to event flipping we have multiple executions; we capture app state from each execution and then use app state comparison to classify potential race as either false positive, benign race, or harmful race. We now provide details on each of these components.

5.2.1 Race Detection

We choose EventRacer [25] as the race detector in ERVA as it is publicly available and robust. Compared with CAFA [72] and DroidRacer [99], EventRacer’s HB model is

3http://eventracer.org/android/

117 more precise, while its race reports are easy to parse.

5.2.2 Input Capture and Replay

To capture app input for subsequent replay, we leverage VALERA [74], a tool that can record and replay app input, e.g., touchscreen, network, GPS, etc. VALERA can also capture and replay event schedules, but it does so “blindly” — it does not capture event dependencies that are instrumental for this work.

IPC Events. Android apps use IPC heavily, for isolation reasons. For instance, the Input

Method Manager Service (IMMS) uses IPC: when the user inputs text into the current window, the soft keyboard is actually a global server service instead running in the app’s address space. The IMMS receives the inputs from the user and dispatches them to the currently active window via IPC calls.

In Android, IPC is carried out via the Binder mechanism: each app has several

Binder threads to handle incoming IPC calls. ERVA records the data and timing of each

Binder transaction. The data contains two parts: primitive data and reference data. Ref- erence data such as Binder object references and OS file descriptors are not deterministic across different runs. For primitive data, ERVA saves their concrete content while for refer- ence data ERVA just keeps a slot in the log. The concrete value is filled in during the real execution.

Threads and Event Posting. ERVA intercepts thread and event operations to capture the HB relationships defined in Section 2.2.2. Android apps achieve asynchronous pro- gramming via message posting. When the background threads need to update the UI, they

118 AMS. AMS finishActivity()

SyncThread. run() Sync Thread

ViewNoteAct. ViewNoteAct. ViewNoteAct. NoteListAct. onClick() updateUI() onBackPressed() onResume() UI Thread

Hardware

TouchScreen Back Key Event Figure 5.6: Excerpt from the EDG of the TomDroid app.

AMS. AMS finishActivity()

SyncThread. run() Sync Thread

ViewNoteAct. onClick() onBackPressed() onResume() updateUI() UI Thread

Hardware

TouchScreen Back Key Event Figure 5.7: Performing event flipping on TomDroid’s EDG in Figure 5.6.

post messages (events) to the Looper on UI thread. As described Section 5.2, there are three types of event posting. ERVA captures these message by recording relevant API such as

Handler.sendMessage and Handler.post.

5.2.3 Event Dependency Graph

By capturing the external and internal events and their posting, ERVA builds an event dependency graph (EDG). Figure 5.6 illustrates EDGs by showing an excerpt from the actual EDG of the TomDroid app. Each edge in the graph describes the causal relationship between events. For example, in Figure 5.6, the user performs two interactions with the app. First, the user clicks the ‘Sync’ button on the ViewNoteActivity. The onClick listener will

119 create a background thread that performs a long-running task (data synchronization with the cloud). When the SyncThread is successfully created, the app will show an animation indicating that the task is running in the background. After the task is finished, the

SyncThread will post an internal message to the UI thread to stop the animation. When the user is notified that the Sync task is done, she can use the ‘Back’ button to go back to the previous activity. The Back button press will trigger the onBackPressed() handler. The

ViewNoteActivity sends an IPC binder transaction to inform the Activity Manager Service

(AMS) to finish the current activity. AMS handles this transaction and then sends back an

IPC to the app telling the UI thread to switch to the NoteListActivity . This activity contains updated content hence the user sees an updated screen.

The EDG precisely describes each event transition. Using the EDG, ERVA knows the root cause of a particular event. For instance, the ViewNoteActivity.updateUI() is triggered because the user has clicked the ‘Sync’ button and this event will create another background thread. During replay, an event can be ready to replay if and only if its recorded preceding event has been replayed. This is controlled by ERVA’s underlying scheduler which will be described next.

5.2.4 Event Flipping

In Android, each Looper has an associated Message Queue. The Looper runs in an infinite loop, waiting for incoming events and placing them in a queue. Messages (i.e., events) are dispatched by invoking the event handler’s callback. We changed the Looper implementation to support flipping “racy” (unordered by HB) pairs of events, as follows.

During replay, ERVA retrieves all recorded events (VALERA saves those to support event

120 replay). Whenever our modified Looper receives an event, it checks whether this event is executable according to the EDG. If all the precedent events in the EDG have been executed, the message will be dispatched to the handler as usual. Otherwise, the event is postponed (added to a “pending” queue) because it could potentially be flipped.

For example, in Figure 5.6, the ViewNoteActivity.updateUI() and NoteListActivity .onResume() do not have an HB relation according to the race detector, which means their execution order can be flipped. To flip the event, ERVA adds a “fake” dependency edge in the

EDG as shown in Figure 5.7. During replay, the updateUI event handler comes before the

Back Key handler, but this time updateUI cannot be executed because it has a preceding event

( NoteListActivity . onResume) in the EDG. Thus the event is added to the pending queue. When the NoteListActivity is brought back into the foreground, the onResume callback is invoked. Af- ter onResume is finished, the looper scheduler notices that onResume has a succeeding edge in the EDG, i.e., updateUI. The scheduler then inspects the pending queue, finds updateUI and allows it to execute. To summarize, this strategy guarantees that the order of events is

flipped compared to the original (record) order.

5.2.5 State Recording and Comparison

Replay-based race classification has been used in prior tools [22, 106]: it starts from an execution that experienced one ordering of the racing accesses and re-runs it while enforcing another ordering, then it compares the states of the program to check whether the race is benign or harmful. The main problem of these tools is that, by using instruction-level deterministic replay their overhead is too high and would not work for Android apps for several reasons. First, Android devices usually have limited resources of computation and

121 storage. Second, whole-system instruction-level replay would be difficult on mobile devices without hardware changes. Third, Android apps are sensitive to timing: large slowdown is likely to incur ANR (Android No Respond) errors. Fourth, Android’s UI gestures are very sensitive to input timing and large overhead may change gesture semantic leading to replay divergence [56, 74].

We define Externally Visible State (EVS) as the subset of app state that might be accessed, or viewed, by the user; in ERVA the EVS consists of GUI objects (layouts, views, images) and Shared Preferences (a system-wide key-value store where apps can save private or public data [58]). The extent of the EVS can be customized by ERVA users. However, for this work we decided to limit the EVS to just the GUI and Shared Preferences for two reasons: (1) capturing more state e.g., file contents, would incur higher overhead and lead to spurious differences; and (2), Android event-race bugs tend to manifest as GUI differences or crashes [99, 72, 120].

Hence instead of recording and comparing whole-memory contents, ERVA finds state differences (hence harmful races) via EVS snapshot differencing, as follows: (1) in the original event order execution, ERVA snapshots the EVS upon entering or leaving each ac- tivity into EVSoriginal; (2) likewise, ERVA snapshots the EVS after the event order is flipped, into EVSalternate; and (3) ERVA compares EVSoriginal and EVSalternate to find differences— a benign true should show no difference, that is, the user cannot tell the difference between the original and alternate executions. Note that some differences might still exist in hidden state, e.g., memory contents or the VM stream, but these differences are not our focus — in our experience, many are spurious — rather, we expose those races that lead to visible EVS

122 differences. In additions, ERVA allows the EVS definition (i.e., its extent) to be customized by the user.

5.2.6 Race Verification and Classification

As described in Section 5.1, ERVA classifies race reports into five bins: two types of false positives, two types of benign races, and harmful races. We now describe how ERVA performs this classification.

False positives type-1 occur because race detectors do not model the app’s lifecycle callback events (or do not model them precisely). The result of event flipping is deadlock because an event with logical timestamp 1 cannot happen before another event with logical timestamp 2. Once ERVA detects deadlock after flipping the events, we bin this report as false positive type-1.

For false positive type-2, the cause is missing implicit HB relations. ERVA detects this type of FP by analyzing the EDG. For example, for onRecentBooksListLoaded and getOr

LoadRecentBooks (the racy pair of events in the Cool Reader example from Section 5.1.2) the

EDG shows that the event posters are within the save event and are ordered by program order. Since the Handler.post(Runnable) follows the FIFO property, these two events cannot be

flipped. Note that, had one of the posters used postDelayed(Runnable r, long time), the events would be flippable.

For benign race type-1, memory accesses are protected by the control flow. ERVA

first tries to flip the racy pair and finds the events can be flipped. Then, during event execution, ERVA enables tracing of instructions to detect reads and writes protected by control flow. In the example shown in Figure 5.3, the access of mRunnable is protected by

123 the branch condition in line 5. By analyzing the instruction trace of flipped events, ERVA bins such reports as benign race type-1.

For benign race type-2, ERVA flips the order of events and does find that the memory value is different after flipping, thus this is a race. Next, ERVA dumps the state of the app (Section 5.2.5) and finds no difference. Thus ERVA considers this as a benign race type-2.

5.3 Evaluation

We now describe our experimental setup, then evaluate the effectiveness and effi- ciency of ERVA.

Environment. The race detector used in our experiments is the publicly-available Even- tRacer for Android [1]. ERVA is based on Android version 4.3.0. All the experiments were conducted on the Android emulator on top of an 8-core 24GB desktop machine running

64-bit Ubuntu 14.04.2 LTS.

We have evaluated ERVA along two dimensions: (1) effectiveness in verifying races and (2) efficiency, i.e., the time required to process an app.

App dataset. We ran ERVA on 16 real-world apps (column 1 of Table 5.1). These apps were chosen according to several criteria: (a) spanning various categories, from note-taking to flashcards to news and utilities; (b) reasonable popularity — column 2 shows the number of downloads, in thousands, according to Google Play, all but two apps have at least 10,000 downloads while five apps have in excess of 1 million downloads; and (c) nontrivial size —

124 Table 5.1: ERVA effectiveness. App # Downloads Bytecode EventRacer Android ERVA size Race High Normal False True Benign Harmful (thousands) (KB) Reports Priority Priority Positives Positives Races Races AnyMemo 100–500 5,986 22 2 20 4 18 17 1 aLogCat 500–1,000 298 43 10 33 3 40 40 0 Aard Dictionary 10–50 3,466 6 1 5 1 5 5 0 AnStop Stopwatch N/A 59 4 3 1 0 4 4 0 Cool Reader 10,000–50,000 854 35 15 20 20 15 15 0 DiskUsage 1,000–5,000 299 8 2 6 0 8 8 0 GhostCommander 1,000–5,000 1,699 30 8 22 11 19 18 1 GnuCash 50–100 5,511 26 2 24 8 18 18 0 Markers 500–1,000 89 8 2 6 0 8 8 0 Mirakel 10–50 6,191 20 1 19 8 12 12 0 Nori 1–5 1,045 5 2 3 2 3 3 0 NPR News 1,000–5,000 1,224 10 3 7 4 6 5 1 OI File Manager 5,000–10,000 675 8 3 5 2 6 5 1 OS Monitor 1,000–5,000 2,576 24 15 9 10 14 12 2 TextWarrior 50–100 237 5 3 2 0 5 4 1 Tomdroid 10–50 594 6 2 4 1 5 4 1 Total 260 74 186 74 186 178 8 (%) (100) (28.5) (71.5) (28.5) (71.5) (68.5) (3) column 3 shows their bytecode size, in KB.

5.3.1 Effectiveness

An effective race verification and reproduction tool should support developers in triaging reported races and allowing them to focus on true races, in particular on harmful races — this helps bug finding and fixing. We quantified ERVA’s effectiveness on the 16 aforementioned apps.

We present the results of both EventRacer and ERVA in Table 5.1. The first set of grouped columns (columns 4–6) summarize EventRacer’s output: the number of race reports and its breakdown as high or normal priority.4 For example, for the CoolReader app, EventRacer reports 35 potential races; of these, 15 were high priority and 20 were normal priority. Presumably, the developer would proceed by trying to confirm the 15

4EventRacer classifies a report as high priority if the race is in app code and as medium priority if it is in the AF but invoked from the app.

125 high-priority races and then move on to the remaining 20 normal-priority races – this is likely to be quite time-consuming. The last two rows show totals and percentages across all apps: out of 260 race reports, 74 (28.5%) are high priority while 186 (71.5%) are normal priority.

The remaining columns (7–10) summarize ERVA’s output: the number of false positives, true positives, and among the latter, how many of the races were benign or harmful. For example, in the CoolReader app, ERVA has found that of the 35 reports produced by EventRacer, 20 were false positives and 15 are true positives; however, none of the true positives were harmful races.

The last two rows, which show totals and percentages across all apps, reveal that out of 260 race reports, 74 (28.5%) were false positives, 186 (71.5%) were true positives, and the 71.5% true positives were split as 68.5% benign, 3% harmful. Note that harmful races make up only 3% of the total number of race reports, which underscores the importance of race verification. We now discuss harmful races in detail.

Harmful races. Since ERVA offers deterministic replay, the schedules that expose the harmful races can be replayed, which helps developers find and fix the root cause of the race. We used this facility to manually confirm that the 8 races reported by ERVA as harmful were indeed harmful. Harmful races manifest in various ways. For example, some harmful races crash the app. In the TomDroid example discussed in Section 5.2.3, if the SyncThread and BACK key events are flipped, the app will crash due to a null pointer exception.

Even if the app does not crash, the consequences can still be deleterious. For

126 example, AnyMemo has a harmful race that leads to an exception and different GUI state, and is caught by ERVA’s state differencing. An excerpt5 of the relevant code is shown next.

1 try {

2 // get data from database

3 adapter. insert (db.getData());

4 } catch (Exception e) {

5 Log.e(”Exception Maybe caused by race condition . Ignored.”);

6 }

If the event race occurs, the adapter object may be initialized improperly and its dereference will cause a Null PointerException. Interestingly, the developers are aware of the race but they simply use a try ... catch to handle the exception hence mask the effect of

the bug. ERVA detects this bug via state differencing and reports that the race will cause

a difference in the state of the View object.

Hence ERVA is effective at helping developers verify their apps, as well as find and

fix races.

5.3.2 Efficiency

Since ERVA consists of detection and verification phases, we measured the time for

each phase. We present the results in Table 5.2, individually for each app and the average

across all apps in the last row. Recall that in the detection phase we run each app on

an instrumented platform – we call this “online time” (column 2) and it takes on average

34 seconds per app. Following the online stage, EventRacer performs an offline analysis

5https://code.google.com/p/anymemo/source/browse/src/org/liberty/android/fantastischmemo/ ui/RecentListFragment.java

127 Table 5.2: ERVA efficiency: running time, in seconds. App Race Detection Race Total Phase Verification Online Offline Phase AnyMemo 26.32 53.18 1264.56 1344.06 aLogCat 18.80 26.50 1671.84 1717.14 Aard Dictionary 30.26 48.91 391.08 470.25 AnStop Stopwatch 21.57 32.62 190.88 245.07 Cool Reader 43.88 102.47 3150.70 3297.05 DiskUsage 35.10 69.29 589.12 693.51 GhostCommander 27.53 44.70 1792.8 1865.03 GnuCash 38.81 57.29 2093.00 2189.10 Markers 25.08 36.74 435.84 497.66 Mirakel 39.78 65.12 1675.20 1780.1 Nori 48.15 57.32 507.80 613.27 NPR News 63.18 75.74 1305.60 1444.52 OI File Manager 37.56 41.03 632.32 710.91 OS Monitor 28.24 52.13 1414.56 1494.93 TextWarrior 30.11 47.50 320.10 397.71 Tomdroid 26.55 52.80 340.92 420.27 Average 33.80 53.95 1111.02 1198.78 (column 3) which takes on average 54 seconds per app.

The time for the verification phase is presented in column 4: on average 1,111 seconds. This is due to ERVA having to perform multiple executions to flip events and compare state; we believe this time can be reduced substantially by using checkpointing to only replay program regions rather than whole executions [140], an idea we leave to future work. Finally, in the last column we present the total time (the sum of the detection and verification phases) for each app: it ranges from 245 to 3,297 seconds (1,198 seconds on average), which we believe is acceptable.

128 5.4 Summary

In this chapter we have presented ERVA, an approach and tool for automatically verifying and reproducing event-based races in Android apps. ERVA addresses the im- precisions in current race detectors for Android by precisely modeling events and their dependencies, which allows it to categorize race reports and only point out those reports that are definite, harmful races. Experiments on 16 Android apps show that most races reported by race detectors are false positives or benign races, and that ERVA is an effective and efficient approach for automatically triaging and reproducing races.

129 Chapter 6

Related Work

This chapter surveys prior approaches in areas closely related to this dissertation: race detection, record-and-replay, and race classification. We compare our approaches with prior systems, focusing on the differences in functionality and application scenarios.

6.1 Race Detection

6.1.1 Race Detection for Thread-based Programs

Race detection has been widely studied in the past 30 years; proposed approaches were either static [137, 49, 105] or dynamic. Dynamic race detectors can be further classied as lockset-based [35, 126], causality model based [51, 31], or hybrid [146, 116].

Chord [105] is a well-known static race detector for multi-threaded Java program.

It use k-object-sensitivity to model abstract memory access and devises four static algo- rithms, i.e., may alias analysis, thread escape analysis, may-happen-in-parallel and con- ditional must not alias analysis, to infer if there exist two memory access(at least one

130 write) from different threads that are not protected under a shared lock. While impress- ing, Chord’s lockset-based analysis cannot be applied to event-driven programs because event-driven races occur on the same thread, i.e., looper thread, and the events from the same queue are processed non-preemptively. SIERRA borrows the idea of context-sensitive object abstraction from Chord, and further developes a hybrid context(action-sensitivity) that works particularly well for event-driven programs.

The dynamic race detectors based on the causality model, or happens-before model [93], define a partial order over the operations in a dynamic execution trace. They report a data race if a pair of memory accesses to the same location with at least one of the access is write, and they are not ordered according to the defined happens-before model.

Happens-before data race detectors are usually considered more precise than lockset-based ones, but they cannot work on event-driven programs either. They assume that instructions executed on the same thread have program order. However, this is not true for event-driven programs due to their asynchronous programming model and Looper events arriving in non-deterministic order. SIERRA reifies the events and their processors as context-sensitive actions, and statically models the dependencies between actions as happens-before relations.

6.1.2 Race Detection for Event-Driven Programs

Several recent approaches have been proposed for detecting event-driven races.

EventRacer [23, 120] detects event-driven races in web applications while EventRacer An- droid [25], CAFA [72] and DroidRacer [99] focus on Android apps. These approaches are all dynamic, hence prone to false negatives and dependent on high-quality inputs; these drawbacks are the main impetus for our work.

131 Static analysis for Android. Many static analysis approaches for Android have been proposed, with specific purposes such as constructing GUI models [123, 29], or information flow [19]. Hopper [28] also uses backward symbolic execution but with a different goal, finding null pointer dereferences. We employ an array of techniques, that while geared at finding races, we believe can also be used as a general, precise static analysis framework for Android apps.

6.2 Race Classification

Race detectors that support race classification and prioritization are more likely to be adopted by developers, because developers can decide how to prioritize investigating race reports. Narayanasamy et al. [106] use instruction-level record-and-replay to replay alternate schedules, then compare the register/memory state to classify the races. Kasikci et al. [22] apply symbolic execution to classify the consequences of races by comparing their symbolic output result. However, both approaches focus on multi-threaded desktop/server apps, hence they are not suitable for event-driven mobile applications. In contrast, ERVA captures and flips events according to the event dependency graph, rather than altering thread scheduling.

Model Checking. Model checking can help systematically explore all the nondetermin- istic schedules to find concurrency bugs. R4 [84] aims to find event-driven races in web applications; it uses dynamic partial order reduction (DPOR) [52] and bounded conflict- reversal to limit the total number of schedules to explore. Similarly, AsyncDroid [111] uses delay-bounded prioritized systematic exploration of the recorded schedule to find concur-

132 rency errors in Android apps. Unlike these model checking techniques which target finding new buggy schedules, ERVA checks only the potential racing events reported by the race detector and aims to verify whether they are false positives or harmless races. Thus, ERVA cannot detect bugs in unexplored schedules. R4 can check harmless races due to ad-hoc synchronization, but directly applying it to Android seems problematic: Android provides a number of system callbacks that have implicit happens-before relations, and ignoring these callbacks could cause false positives as the example of false positive type-1 Section 5.1 shows.

ERVA can check this type of false positives by flipping the events to see whether the system enters a deadlock condition. ERVA and model checkers could be combined. For example, the EDG from ERVA can be used as an auxiliary model for R4 and AsyncDroid in their exploration algorithm to check whether the new schedules are feasible or not. Furthermore, the EVS can be used to check the harmfulness of newly explored schedules.

6.3 Record and Replay

Record-and-replay has been widely studied and implemented on various platforms.

On the smartphone platform, the most powerful, and most directly related effort is our group’s prior system RERAN [56], which has been used to record and replay GUI gestures in 86 out of the Top-100 most popular Android apps on Google Play. RERAN does not require app instrumentation (hence it can handle gesture nondeterminism in apps that perform GUI rendering in native code, such as Angry Birds) or AF changes. Mosaic [68]

extends RERAN with support for device-independent replay of GUI events (note that our

approach is device-independent as well). Mosaic has low overhead, typically less than 0.2%,

133 and has replayed GUI events in 45 popular apps from Google Play. However, RERAN and

Mosaic have several limitations: they do not support critical functionality (network, camera, microphone, or GPS), required by many apps; they do not permit record-and-replay of API calls or event schedules; their record-and-replay infrastructure is manual, which makes it hard to modify or extend to other sensors.

Android test automation tools such as Android Guitar [20], Robotium [62], or

Troyd [86] offer some support for automating GUI interaction, but require developers to extract a GUI model from the app and manually write test scripts to emulate user gestures.

In addition to the manual effort required to write scripts, these tools do not support replay for sensors or schedules. VALERA, on the other hand, records the GUI events on the fly while the app is being played, thus saves the effort to write test script and allows to replay complex guestures such as zoom, pinch and scroll.

On non-smartphone platforms, record-and-replay tools have a wide range of ap- plications: intrusion analysis [48], bug reproducing [107], debugging [131], etc. Hardware- based [142, 107] and virtual machine-based [48, 91] replay tools are often regarded as whole- system replay. Recording at this low level, e.g., memory access order, thread scheduling, allows them to eliminate all non-determinism. However, these approaches require special hardware support or virtual machine instrumentation which might be prohibitive on current commodity smartphones.

Library-based approaches [55, 143, 122, 132] record the non-determinism interac- tion between the program libraries and underlying operating system with a fixed interface.

R2 [66] extends them by allowing developers to choose which kinds of interfaces they want

134 to replay by a simple annotation specification language. VALERA borrows this idea from R2

(which targets the Windows kernel API) but applies it to sensor-rich event-based Android.

135 Chapter 7

Conclusions

In this dissertation we have presented a suite of approaches that address various aspects of event-driven races – common types of concurrency errors in mobile systems. We have discussed issues with prior, dynamic, event-driven race detectors, e.g., false negative issues due to the dynamic nature of these tools, false positive issues caused by complex

Android model and ad-hoc synchronization, inability to reproduce the race and distinguish benign and harmful races. To address these problems, this dissertation proposes three approaches, ranging from static race detection to race reproduction to race classification, to handle event-driven races. Our thesis is that these approaches are effective and practical. To support our claims, Chapter 3 presented a novel static event-driven race detection algorithm that focused on event abstraction, causality between abstracted events (happens-before relations) and path-sensitive refinement. The algorithm is implemented in a tool named

SIERRA, the first static event-driven race detector for Android. The results show that

SIERRA outperforms dyanmic detectors in finding races and the precision and performance

136 is acceptable.

Chapter 4 has presented a novel stream and event oriented record-and-replay sys- tem called VALERA that is lightweight and able to reproduce event races via deterministic schedule replay, helping developers fix bugs.

Lastly, Chapter 5 has presented an event-flipping techniques that alternates racing event schedules to check if the reported race is true. For true races, it further classifies the race into benign and harmful by comparing the state difference. The technique is implemented in a tool named ERVA. The results showed that ERVA is effective in classifying races, and only 3% of the races in our data set are true harmful races which greatly reduce the burden of developers in fixing these races.

7.1 Future Work

Our presented approaches can be extended along several lines:

• Race detection for other event-driven systems. Many other systems employ

the event-driven model, e.g., web applications and SDN (software-defined networking)

elements. The basic idea of SIERRA can be applied to these systems, too. For example,

the action-sensitive abstraction for events and symbolic execution based refinements

can be applied. The happens-before relations are usually platform-dependent, thus

we would need to add a causality model for those platforms.

• Detecting other types of bugs caused by asynchronous programming. There

are other types of errors in the event-driven system. Memory leak detection is one

promising direction. If there exist some unnecessary references to large objects, they

137 could prevent the garbage collector from reclaiming memory. For example, an inner

class that has an implicit reference to the Activity which holds a large chunk of

memory. If that inner class is held by a thread, and the activity is switched into

background mode, the garbage collector cannot release the memory for that Activity

because there is a strong reference points to it, thus result in memory leak. Some

of the ideas introduced by SIERRA can help. Since SIERRA performs whole-program

pointer analysis, we know how many references point to the specific object in the

conservative analysis. The abstract action in SIERRA can tell us which actions hold

these references, so that we can report potential memory leak locations.

• Automatic event input generation. Automatic event input generation is another

promising research line. Currently, for a given report, we need to manually look into

the source code, and work out a sequence of events to trigger the racy schedule. It

would be very helpful to have a tool that can automatically generate the sequence.

• Commutativity analysis. Our benign/harmful race classification algorithm could

be refined. Currently, ERVA reports a benign race if there is no the external visible

state (EVS) difference after event flipping. The EVS approach relies on user-defined

state (e.g., GUI state, file state, etc.) which is not sound. Commutativity analysis

is a promising technique to do race classification. Two operations are said to be

commutative if executing them in different order leads to the same program state.

Dimitrov et al. [46] have introduced a set of high-level operation commutativity for

library functions. We believe their commutativity specification could be applied to

our race classification context.

138 Bibliography

[1] EventRacer Android. http://eventracer.org/android/.

[2] FFmpeg. http://www.ffmpeg.org.

[3] ImageMagick. http://www.imagemagick.org/script/index.php.

[4] SoX - Sound eXchange. http://sox.sourceforge.net/Main/HomePage.

[5] YUV Format. http://en.wikipedia.org/wiki/YUV.

[6] RERAN: Record and Replay for Android Video Demo, August 2012. http://www. .com/user/RERAN2012.

[7] Mobile/Tablet Operating System Market Share, Oct 2015. https: //www.netmarketshare.com/operating-system-market-share.aspx?qprid= 8&qpcustomd=1.

[8] Digital in 2017: Global Overview, January 2017. https://wearesocial.com/ special-reports/digital-in-2017-global-overview.

[9] F-Droid, 2017. https://f-droid.org/.

[10] U.S. consumers now spend 5 hours per day on mobile de- vices., March 2017. https://techcrunch.com/2017/03/03/ u-s-consumers-now-spend-5-hours-per-day-on-mobile-devices/.

[11] A. Paiva, J. Faria, N. Tillmann, and R. Vidal. A Model-to-implementation Mapping Tool for Automated Model-based GUI Testing. In ICFEM ’05.

[12] Saswat Anand, Mayur Naik, Mary Jean Harrold, and Hongseok Yang. Automated concolic testing of smartphone apps. In FSE ’12, pages 1–11.

[13] Android Developers. MonkeyRunner, August 2012. http://developer.android. com/guide/developing/tools/monkeyrunner_concepts.html.

[14] Android Developers. MotionEvent, August 2012. http://developer.android.com/ reference/android/view/MotionEvent.html.

139 [15] Android Developers. UI/Application Exerciser Monkey, August 2012. http:// developer.android.com/tools/help/monkey.html.

[16] Android Developers. App Components, 2017. https://developer.android.com/ guide/components/index.html.

[17] Android Reran. Android Reran Website, April 2014. http://www.androidreran. com.

[18] Androlib. Accumulated number of Application and Games in the Android Market, March 2013. http://www.androlib.com/appstats.aspx.

[19] Steven Arzt, Siegfried Rasthofer, Christian Fritz, Eric Bodden, Alexandre Bartel, Jacques Klein, Yves Le Traon, Damien Octeau, and Patrick McDaniel. Flowdroid: Precise context, flow, field, object-sensitive and lifecycle-aware taint analysis for an- droid apps. In Proceedings of the 35th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI ’14, pages 259–269, New York, NY, USA, 2014. ACM.

[20] Atif Memon. GUITAR, August 2012. guitar.sourceforge.net/.

[21] Tanzirul Azim and Iulian Neamtiu. Targeted and depth-first exploration for system- atic testing of android apps. In Proceedings of the 2013 ACM SIGPLAN International Conference on Object Oriented Programming Systems Languages & Applications, OOPSLA ’13, pages 641–660, New York, NY, USA, 2013. ACM.

[22] C. Zamfir B. Kasikci and G. Candea. Data races vs. data race bugs: telling the difference with portend. In ASPLOS’12, pages 185–198.

[23] M. Sridharan J. Dolby B. Petrov, M. Vechev. Race detection for web applications. PLDI’12, 2012.

[24] Pamela Bhattacharya, Liudmila Ulanova, Iulian Neamtiu, and Sai Charan Koduru. An empirical analysis of the bug-fixing process in open source android apps. In CSMR’13.

[25] Pavol Bielik, Veselin Raychev, and Martin Vechev. Scalable race detection for android applications. In Proceedings of the 2015 ACM SIGPLAN International Conference on Object-Oriented Programming, Systems, Languages, and Applications, OOPSLA 2015, pages 332–348, New York, NY, USA, 2015. ACM.

[26] Bitbar. Automated Testing Tool for Android - Testdroid., January 2013. http: //testdroid.com/.

[27] Sam Blackshear, Bor-Yuh Evan Chang, and Manu Sridharan. Thresher: Precise refu- tations for heap reachability. In Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI ’13, pages 275–286, New York, NY, USA, 2013. ACM.

140 [28] Sam Blackshear, Bor-Yuh Evan Chang, and Manu Sridharan. Selective control-flow abstraction via jumping. In Proceedings of the 2015 ACM SIGPLAN International Conference on Object-Oriented Programming, Systems, Languages, and Applications, OOPSLA 2015, pages 163–182, New York, NY, USA, 2015. ACM. [29] Sam Blackshear, Alexandra Gendreau, and Bor-Yuh Evan Chang. Droidel: A general approach to android framework modeling. In Proceedings of the 4th ACM SIGPLAN International Workshop on State Of the Art in Program Analysis, SOAP 2015, pages 19–25, New York, NY, USA, 2015. ACM. [30] Matthias B¨ohmer,Brent Hecht, Johannes Sch¨oning,Antonio Kr¨uger,and Gernot Bauer. Falling asleep with angry birds, facebook and kindle: a large scale study on mobile application usage. In MobileHCI ’11, pages 47–56. [31] Michael D. Bond, Katherine E. Coons, and Kathryn S. McKinley. Pacer: Proportional detection of data races. SIGPLAN Not., 45(6), June 2010. [32] Michael D. Bond, Milind Kulkarni, Man Cao, Minjia Zhang, Meisam Fathi Salmi, Swarnendu Biswas, Aritra Sengupta, and Jipeng Huang. Octet: Capturing and con- trolling cross-thread dependences efficiently. In Proceedings of the 2013 ACM SIG- PLAN International Conference on Object Oriented Programming Systems Languages & Applications, OOPSLA ’13, pages 693–712, 2013. [33] P. Brooks and A. M. Memon. Automated gui testing guided by usage profiles. In ASE ’07. [34] Yinzhi Cao, Yanick Fratantonio, Antonio Bianchi, Manuel Egele, Christopher Kruegel, Giovanni Vigna, and Yan Chen. EdgeMiner: Automatically Detecting Implicit Con- trol Flow Transitions through the Android Framework. In Proceedings of the ISOC Network and Distributed System Security Symposium (NDSS), 2015. [35] Jong-Deok Choi, Keunwoo Lee, Alexey Loginov, Robert O’Callahan, Vivek Sarkar, and Manu Sridharan. Efficient and precise datarace detection for multithreaded object-oriented programs. In Proceedings of the ACM SIGPLAN 2002 Conference on Programming Language Design and Implementation, PLDI ’02, pages 258–269, New York, NY, USA, 2002. ACM. [36] Shauvik Roy Choudhary, Alessandra Gorla, and Alessandro Orso. Automated test input generation for android: Are we there yet? (e). In Proceedings of the 2015 30th IEEE/ACM International Conference on Automated Software Engineering (ASE), ASE ’15, pages 429–440, Washington, DC, USA, 2015. IEEE Computer Society. [37] CNET. Android reclaims 61 percent of all U.S. smartphone sales, May 2012. http://news.cnet.com/8301-1023_3-57429192-93/ android-reclaims-61-percent-of-all-u.s-smartphone-sales/. [38] CNET. Android dominates 81 percent of world smart- phone market, November 2013. http://www.cnet.com/news/ android-dominates-81-percent-of-world-smartphone-market/.

141 [39] CNET. Android beat Apple in tablet sales last year – Gartner, March 2014. http://www.cnet.com/news/ android-beat-apple-in-tablet-sales-last-year-gartner/.

[40] B. Cornelissen, A. Zaidman, A. van Deursen, L. Moonen, and R. Koschke. A system- atic survey of program comprehension through dynamic analysis. Software Engineer- ing, IEEE Transactions on, pages 684–702, 2009.

[41] H. Cui, J. Wu, C. Tsai, and J. Yang. Stable deterministic multithreading through schedule memoization. OSDI’10, 2010.

[42] D. Amalfitano, A. R. Fasolino, and P. Tramontana. A GUI Crawling-based technique for Android Mobile Application Testing. In ICSTW ’11.

[43] Dakshinamurthy Karra. Marathon, August 2012. http://www.marathontesting. com/.

[44] Daniel Herding and Christian Spannagel. Jacareto, August 2012. http:// sourceforge.net/apps/mediawiki/jacareto/.

[45] Leonardo de Moura and Nikolaj Bjørner. Z3: An efficient smt solver. In Tools and Algorithms for the Construction and Analysis of Systems, 14th International Confer- ence, TACAS 2008, Held as Part of the Joint European Conferences on Theory and Practice of Software, ETAPS 2008, Budapest, Hungary, March 29-April 6, 2008. Pro- ceedings, volume 4963 of Lecture Notes in Computer Science, pages 337–340. Springer, 2008.

[46] Dimitar Dimitrov, Veselin Raychev, Martin Vechev, and Eric Koskinen. Commu- tativity race detection. In Proceedings of the 35th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI ’14, pages 305–315, New York, NY, USA, 2014. ACM.

[47] Mian Dong and Lin Zhong. Self-constructive high-rate system energy modeling for battery-powered mobile systems. MobiSys ’11, pages 335–348, 2011.

[48] George W. Dunlap, Samuel T. King, Sukru Cinar, Murtaza A. Basrai, and Peter M. Chen. Revirt: enabling intrusion analysis through virtual-machine logging and replay. In OSDI’02.

[49] Dawson Engler and Ken Ashcraft. Racerx: Effective, static detection of race condi- tions and deadlocks. In Proceedings of the Nineteenth ACM Symposium on Operating Systems Principles, SOSP ’03, pages 237–252, New York, NY, USA, 2003. ACM.

[50] Michael D. Ernst. Static and dynamic analysis: Synergy and duality. In WODA 2003: Workshop on Dynamic Analysis, pages 24–27, Portland, Oregon, May 9, 2003.

[51] Cormac Flanagan and Stephen N. Freund. Fasttrack: Efficient and precise dynamic race detection. In Proceedings of the 2009 ACM SIGPLAN Conference on Program- ming Language Design and Implementation, PLDI ’09, 2009.

142 [52] Cormac Flanagan and Patrice Godefroid. Dynamic partial-order reduction for model checking software. In Proceedings of the 32Nd ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, POPL ’05, pages 110–121, New York, NY, USA, 2005. ACM.

[53] Gartner, Inc. Gartner Highlights Key Predictions for IT Organizations and Users in 2010 and Beyond, January 2010. http://www.gartner.com/it/page.jsp?id= 1278413.

[54] Gartner, Inc. Gartner Says Worldwide PC Shipment Growth Was Flat in Second Quarter of 2012, July 2012. http://www.gartner.com/it/page.jsp?id=2079015.

[55] Dennis Geels, Gautam Altekar, Scott Shenker, and Ion Stoica. Replay debugging for distributed applications. In Proceedings of the Annual Conference on USENIX ’06 Annual Technical Conference, ATEC ’06, pages 27–27, Berkeley, CA, USA, 2006. USENIX Association.

[56] Lorenzo Gomez, Iulian Neamtiu, Tanzirul Azim, and Todd Millstein. Reran: Timing- and touch-sensitive record and replay for android. In Proceedings of the 2013 Interna- tional Conference on Software Engineering, ICSE ’13, pages 72–81, Piscataway, NJ, USA, 2013. IEEE Press.

[57] Google. Google Earth. http://earth.google.com.

[58] Google. SharedPreferences. https://developer.android.com/reference/ android/content/SharedPreferences.html.

[59] Google. The Activity Lifecycle. https://developer.android.com/guide/ components/activities/activity-lifecycle.html.

[60] Google. Volley, 2017. https://github.com/google/volley.

[61] Google Blog. 10 Billion Android Market downloads and count- ing, December 2011. http://googleblog.blogspot.com/2011/12/ 10-billion-android-market-downloads-and.html.

[62] Google Code. Robotium, August 2012. http://code.google.com/p/robotium/.

[63] Google Code. Androguard, January 2013. http://code.google.com/p/ androguard/.

[64] Google Code. Android Assault, January 2013. http://code.google.com/p/ android-assault/.

[65] Google Code. Droidbox, January 2013. http://code.google.com/p/droidbox/.

[66] Zhenyu Guo, Xi Wang, Jian Tang, Xuezheng Liu, Zhilei Xu, Ming Wu, M. Frans Kaashoek, and Zheng Zhang. R2: An application-level kernel for record and replay. In OSDI’08.

143 [67] Jessica Guyn. Facebook users give iPhone app thumbs down. Los Angeles Times, Jul 21 2011. http://latimesblogs.latimes.com/technology/2011/07/ facebook-users-give-iphone-app-thumbs-down.html. [68] Matthew Halpern, Yuhao Zhu, and Vijay Janapa Reddi. Mosaic: Cross-platform user- interaction record and replay for the fragmented android ecosystem. In 2015 IEEE International Symposium on Performance Analysis of Systems and Software, ISPASS 2015, 2015. [69] S. Hao, Ding Li, W.G.J. Halfond, and R. Govindan. Estimating android applica- tions’ cpu energy usage via bytecode profiling. In Green and Sustainable Software (GREENS), 2012 First International Workshop on, pages 1–7, 2012. [70] Shin Hong and Moonzoo Kim. A survey of race bug detection techniques for multi- threaded programmes. Softw. Test. Verif. Reliab., 25(3):191–217, May 2015. [71] Peter Hornyack, Seungyeop Han, Jaeyeon Jung, Stuart Schechter, and David Wether- all. These aren’t the droids you’re looking for: Retrofitting android to protect data from imperious applications. In Proceedings of the 18th ACM Conference on Com- puter and Communications Security, CCS ’11, pages 639–652, New York, NY, USA, 2011. ACM. [72] Chun-Hung Hsiao, Jie Yu, Satish Narayanasamy, Ziyun Kong, Cristiano L. Pereira, Gilles A. Pokam, Peter M. Chen, and Jason Flinn. Race detection for event-driven mobile applications. In Proceedings of the 35th ACM SIGPLAN Conference on Pro- gramming Language Design and Implementation, PLDI ’14, pages 326–336, New York, NY, USA, 2014. ACM. [73] Cuixiong Hu and Iulian Neamtiu. Automating gui testing for android applications. In AST ’11, pages 77–83, 2011. [74] Yongjian Hu, Tanzirul Azim, and Iulian Neamtiu. Versatile yet lightweight record- and-replay for android. In Proceedings of the 2015 ACM SIGPLAN International Conference on Object-Oriented Programming, Systems, Languages, and Applications, OOPSLA 2015, pages 349–366, New York, NY, USA, 2015. ACM. [75] Yongjian Hu, Tanzirul Azim, and Iulian Neamtiu. Versatile yet lightweight record- and-replay for android. Technical report, University of California, Riverside, May 2015. http://www.cs.ucr.edu/~neamtiu/pubs/valera-tr.pdf. [76] Yongjian Hu, Iulian Neamtiu, and Arash Alavi. Automatically verifying and repro- ducing event-based races in android apps. In Proceedings of the 25th International Symposium on Software Testing and Analysis, ISSTA 2016, pages 377–388, New York, NY, USA, 2016. ACM. [77] Yongjian Hu, Iulian Neamtiu, and Arash Alavi. Automatically verifying and repro- ducing event-based races in android apps. In Proceedings of the 25th International Symposium on Software Testing and Analysis, ISSTA 2016, pages 377–388, New York, NY, USA, 2016. ACM.

144 [78] IBM. Rational Robot, August 2012. www.ibm.com/software/awdtools/tester/ robot/.

[79] IBM T.J Watson. WALA, 2017. http://wala.sourceforge.net/wiki/index.php/ Main_Page.

[80] IDC. Android and iOS Surge to New Smartphone OS Record in Second Quarter, According to IDC, August 2012. http://www.idc.com/getdoc.jsp?containerId= prUS23638712.

[81] J. Devietti, B. Lucia, L. Ceze, and M. Oskin. Dmp: deterministic shared memory multiprocessing. In ASPLOS’09, 2009.

[82] J. Flinn and Z. M. Mao. Can Deterministic Replay be an Enabling Tool for Mobile Computing? In HotMobile ’11.

[83] J. Mickens, J. Elson, and J. Howell. Mugshot: Deterministic Capture and Replay for JavaScript Applications. In NSDI ’10.

[84] Casper S. Jensen, Anders Møller, Veselin Raychev, Dimitar Dimitrov, and Martin Vechev. Stateless model checking of event-driven applications. In Proceedings of the 2015 ACM SIGPLAN International Conference on Object-Oriented Programming, Systems, Languages, and Applications, OOPSLA 2015, pages 57–73, New York, NY, USA, 2015. ACM.

[85] Casper S. Jensen, Mukul R. Prasad, and Anders Møller. Automated testing with tar- geted event sequence generation. In Proceedings of the 2013 International Symposium on Software Testing and Analysis, pages 67–77, 2013.

[86] Jinseong Jeon and Jeffrey S. Foster. Troyd, January 2013. https://github.com/ plum-umd/troyd.

[87] Jinseong Jeon and Kristopher Micinski and Jeffrey S. Foster. Redexer. http://www. cs.umd.edu/projects/PL/redexer/index.html.

[88] Josh Constine, techcrunch . Facebook Speeds Up Android App By Ditching HTML5 And Rebuilding It Natively Just Like The iOS Version, December 2012. http:// techcrunch.com/2012/12/13/facebook-android-faster/.

[89] D. Marinov K. Sen and G. Agha. Cute: A concolic unit testing engine for c. In FSE’05, pages 263–272.

[90] George Kastrinis and Yannis Smaragdakis. Hybrid context-sensitivity for points-to analysis. In Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI ’13, pages 423–434, New York, NY, USA, 2013. ACM.

[91] Samuel T. King, George W. Dunlap, and Peter M. Chen. Debugging operating systems with time-traveling virtual machines. In Proceedings of the Annual Conference on

145 USENIX Annual Technical Conference, ATEC ’05, pages 1–1, Berkeley, CA, USA, 2005. USENIX Association.

[92] L. B. Kara and T. F. Stahovich. An image-based, trainable symbol recognizer for hand-drawn sketches. In Computers & Graphics, 29 (4), pp. 501-517, 2005.

[93] Leslie Lamport. Time, clocks, and the ordering of events in a distributed system. Commun. ACM, 21(7):558–565, July 1978.

[94] M. Grechanik, Q. Xie, and C. Fu. Maintaining and Evolving GUI-directed Test Scripts. In ICSE ’09.

[95] M. Jovic and A. Adamoli and D. Zaparanuks and M. Hauswirth. Automating Perfor- mance Testing of Interactive Java Applications. In AST ’10.

[96] M. Musuvathi, S. Qadeer, T. Ball, G. Basler, P. A. Nainar, I. Neamtiu. Finding and reproducing heisenbugs in concurrent programs. OSDI’08, 2008.

[97] M. Olszewski, J. Ansel, and S. Amarasinghe. Kendo: efficient deterministic multi- threading in software. ASPLOS’09, 2009.

[98] Aravind MacHiry, Rohan Tahiliani, and Mayur Naik. Dynodroid: An input generation system for android apps. In FSE ’13, 2013.

[99] Pallavi Maiya, Aditya Kanade, and Rupak Majumdar. Race detection for android applications. In Proceedings of the 35th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI ’14, pages 316–325, New York, NY, USA, 2014. ACM.

[100] R. Majumdar and K. Sen. Hybrid concolic testing. In ICSE’07, pages 416–426.

[101] Matthew Pekar. Pounder - Java GUI Testing Utility, August 2012. http://pounder. sourceforge.net/.

[102] Atif M. Memon. An event-flow model of gui-based applications for testing. Software Testing, Verification and Reliability, pages 137–157, 2007.

[103] Microsoft Surface. Recording and Playing Interactions, August 2012. http://msdn. microsoft.com/en-us/library/ee804844%28v=surface.10%29.aspx.

[104] Ana Milanova, Atanas Rountev, and Barbara G. Ryder. Parameterized object sensi- tivity for points-to analysis for java. ACM Trans. Softw. Eng. Methodol., 14(1):1–41, January 2005.

[105] Mayur Naik, Alex Aiken, and John Whaley. Effective static race detection for java. In Proceedings of the 27th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI ’06, pages 308–319, New York, NY, USA, 2006. ACM.

146 [106] S. Narayanasamy, Z. Wang, J. Tigani, A. Edwards, and B. Calder. Automatically classifying benign and harmful data races using replay analysis. In PLDI’07, pages 22–31.

[107] Satish Narayanasamy, Gilles Pokam, and Brad Calder. Bugnet: Continuously record- ing program execution for deterministic replay debugging. In ISCA ’05.

[108] I. Neamtiu and T. Dumitras. Cloud software upgrades: Challenges and opportuni- ties. In Maintenance and Evolution of Service-Oriented and Cloud-Based Systems (MESOCA), 2011 International Workshop on the, pages 1 –10, sept. 2011.

[109] Nielsen. Who’s Winning the U.S. Smartphone Market? http://www.nielsen.com/ us/en/newswire/2013/whos-winning-the-u-s-smartphone-market-.html.

[110] Nodobo. Nodobo Capture: Mobile Data Recording for Analysing User Interactions in Context. 2011.

[111] Burcu Kulahcioglu Ozkan, Michael Emmi, and Serdar Tasiran. Computer Aided Ver- ification: 27th International Conference, CAV 2015, San Francisco, CA, USA, July 18-24, 2015, Proceedings, Part I, chapter Systematic Asynchrony Bug Exploration for Android Apps, pages 455–461. Springer International Publishing, Cham, 2015.

[112] P. Zhang and S. Elbaum. Amplifying tests to validate exception handling code. ICSE’12, 2012.

[113] Harish Patil, Cristiano Pereira, Mack Stallcup, Gregory Lueck, and James Cownie. Pinplay: A framework for deterministic replay and reproducible analysis of parallel programs. In Proceedings of the 8th Annual IEEE/ACM International Symposium on Code Generation and Optimization, CGO ’10, pages 2–11, New York, NY, USA, 2010. ACM.

[114] Pew Research Center. Report: Mobile Tablet Ownership 2013. http://pewinternet. org/Reports/2013/Tablet-Ownership-2013.aspx.

[115] Pew Research Center. Report: Smartphone Ownership 2013. http://pewinternet. org/Reports/2013/Smartphone-Ownership-2013.aspx.

[116] Eli Pozniansky and Assaf Schuster. Efficient on-the-fly data race detection in mul- tithreaded c++ programs. In Proceedings of the Ninth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP ’03, pages 179–190, New York, NY, USA, 2003. ACM.

[117] PRNewswire. Strategy Analytics: Android Captures Record 80 Percent Share of Global Smartphone Shipments in Q2 2013. http://www.prnewswire. com/news-releases/strategy-analytics-android-captures-record-80 -percent-share-of-global-smartphone-shipments- in-q2-2013-217941551. html.

147 [118] Ranonex. Android Test Automation - Automate your App Testing, January 2013. http://www.ranorex.com/mobile-automation-testing/ android-test-automation.html.

[119] Vaibhav Rastogi, Yan Chen, and William Enck. Appsplayground: automatic security analysis of smartphone applications. In CODASPY, pages 209–220, 2013.

[120] Veselin Raychev, Martin Vechev, and Manu Sridharan. Effective race detection for event-driven programs. In Proceedings of the 2013 ACM SIGPLAN International Conference on Object Oriented Programming Systems Languages & Applications, OOPSLA ’13, pages 151–166, New York, NY, USA, 2013. ACM.

[121] Robolectric. Test-Drive Your Android Code, January 2013. http://pivotal. github.com/robolectric/.

[122] M. Ronsse and K. D. Bosschere. Recplay: A fully integrated proctical record/replay system. TOCS, pages 133–152, 1999.

[123] Atanas Rountev and Dacong Yan. Static reference analysis for GUI objects in Android software. In International Symposium on Code Generation and Optimization, pages 143–153, 2014.

[124] S. Khandkar and F. Maurer. A Domain Specific Language to Define Gestures for Multi-Touch Applications. In DSM ’10.

[125] S. Khandkar, S. M. Sohan, J. Sillito, and F. Maurer. Tool Support for Testing Complex Multi-Touch Gestures. In IST ’10.

[126] Stefan Savage, Michael Burrows, Greg Nelson, Patrick Sobalvarro, and Thomas An- derson. Eraser: A dynamic data race detector for multithreaded programs. ACM Trans. Comput. Syst., 15(4):391–411, November 1997.

[127] Micha Sharir and Amir Pnueli. Two approaches to interprocedural data flow analysis. In in Steven S. Muchnick and Neil D. Jones (eds.), Program Flow Analysis: Theory and Applications, Prentice-Hall, Englewood Cliffs, New Jersey, pages 189–234, 1981.

[128] Yannis Smaragdakis, Martin Bravenboer, and Ondrej Lhot´ak.Pick your contexts well: Understanding object-sensitivity. In Proceedings of the 38th Annual ACM SIGPLAN- SIGACT Symposium on Principles of Programming Languages, POPL ’11, pages 17– 30, New York, NY, USA, 2011. ACM.

[129] SONY. APK Analyzer, January 2013. http://developer.sonymobile.com/ knowledge-base/tool-guides/analyse-your-apks-with-apkanalyser/.

[130] SourceForge. Android GUITAR, August 2012. http://sourceforge.net/apps/ mediawiki/guitar/index.php?title=Android_GUITAR.

[131] Sudarshan M. Srinivasan, Srikanth Kandula, Christopher R. Andrews, and Yuanyuan Zhou. Flashback: A lightweight extension for rollback and deterministic replay for

148 software debugging. In Proceedings of the Annual Conference on USENIX Annual Technical Conference, ATEC ’04, pages 3–3, Berkeley, CA, USA, 2004. USENIX As- sociation.

[132] John Steven, Pravir Chandra, Bob Fleck, and Andy Podgurski. jrapture: A cap- ture/replay tool for observation-based testing. In Proceedings of the 2000 ACM SIG- SOFT International Symposium on Software Testing and Analysis, ISSTA ’00, pages 158–167, New York, NY, USA, 2000. ACM.

[133] T. Takala, M. Katara, and J. Harty. Experiences of system-level model-based GUI testing of an Android application. In ICST ’11, pages 377–386.

[134] TestPlant. eggPlant for mobile testing., January 2013. http://www.testplant.com/ products/eggplant/mobile/.

[135] Timothy Wall. Abbot framework for automated testing of Java GUI components and programs, August 2012. http://abbot.sourceforge.net.

[136] Various. SCanDroid, January 2013. https://github.com/scandroid/scandroid.

[137] Jan Wen Voung, Ranjit Jhala, and Sorin Lerner. Relay: Static race detection on millions of lines of code. In Proceedings of the the 6th Joint Meeting of the Euro- pean Software Engineering Conference and the ACM SIGSOFT Symposium on The Foundations of Software Engineering, ESEC-FSE ’07, pages 205–214, New York, NY, USA, 2007. ACM.

[138] W. Enck, P. Gilbert, B. G. Chun, L. P. Cox, J. Jung, P. McDaniel, and A. N. Sheth. Taintdroid: An information-flow tracking system for realtime privacy monitoring on smartphones. In OSDI, pages 393–407, 2010.

[139] W. River. Wind River Framework for Automated Software Testing., January 2013. http://www.windriver.com/announces/fast/.

[140] Yan Wang, Harish Patil, Cristiano Pereira, Gregory Lueck, Rajiv Gupta, and Iulian Neamtiu. Drdebug: Deterministic replay based cyclic debugging with dynamic slicing. In Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization, CGO ’14, pages 98:98–98:108, New York, NY, USA, 2014. ACM.

[141] I. Neamtiu X. Wei, L. Gomez and M. Faloutsos. Profiledroid: Multi-layer profiling of android applications. In MobiCom’12.

[142] Min Xu, Rastislav Bodik, and Mark D Hill. A ”flight data recorder” for enabling full-system multiprocessor deterministic replay. In ISCA ’03, pages 122–135.

[143] Y. Saito. Jockey: a user-space library for record-replay debugging. In AADEBUG’05.

[144] Wei Yang, Mukul Prasad, and Tao Xie. A grey-box approach for automated gui-model generation of mobile applications. In FASE’13, pages 250–265.

149 [145] Zhemin Yang, Min Yang, Lvcai Xu, Haibo Chen, and Binyu Zang. Order: Object centric deterministic replay for java. In Proceedings of the 2011 USENIX Conference on USENIX Annual Technical Conference, USENIXATC’11, 2011.

[146] Yuan Yu, Tom Rodeheffer, and Wei Chen. Racetrack: Efficient detection of data race conditions via adaptive tracking. In Proceedings of the Twentieth ACM Symposium on Operating Systems Principles, SOSP ’05, pages 221–234, New York, NY, USA, 2005. ACM.

[147] Xun Yuan and Atif M. Memon. Using GUI run-time state as feedback to generate test cases. In ICSE ’07, pages 396–405, 2007.

[148] Xun Yuan and Atif M. Memon. Generating event sequence-based test cases using gui run-time state feedback. IEEE Transactions on Software Engineering, pages 81–95, 2010.

[149] B. Zhou, I. Neamtiu, and R. Gupta. Experience report: How do bug characteristics differ across severity classes: A multi-platform study. In Software Reliability Engi- neering (ISSRE), 2015 IEEE 26th International Symposium on, pages 507–517, Nov 2015.

[150] Bo Zhou, Iulian Neamtiu, and Rajiv Gupta. A cross-platform analysis of bugs and bug- fixing in open source projects: Desktop vs. android vs. . In 19th International Con- ference on Evaluation and Assessment in Software Engineering, EASE 2015, page 10, April 2015.

150