The Pennsylvania State University The Graduate School College of Engineering

APPGRADER: AN APP QUALITY GRADING SYSTEM BASED

ON CODE-LEVEL FEATURES

A Thesis in Computer Science and Engineering by Xi Li

c 2018 Xi Li

Submitted in Partial Fulfillment of the Requirements for the Degree of

Master of Science

August 2018 The thesis of Xi Li was reviewed and approved∗ by the following:

Sencun Zhu Associate Professor of Computer Science and Engineering Thesis Advisor

Danfeng Zhang Assistant Professor of Computer Science and Engineering

Chita R. Das Distinguished Professor of Computer Science and Engineering Department Head of Computer Science and Engineering

∗Signatures are on file in the Graduate School. ABSTRACT

The current app ranking systems applied by app markets are mainly based on app rating and downloads. However, these systems have drawbacks in handling: (i) apps with abnormally high ratings and fake downloads; (ii) newly published apps with limited user feedback. Rankings of these apps may not accord with their actual quality, which will mislead users. Therefore, in an attempt to explore app ranking systems and change the under studied status quo of it, we propose AppGrader, a novel app quality grading system that ranks apps under the same category based on app functionality measured by code-level features. This system is inspired by the analysis on 18 millions app reviews which suggests that while giving ratings, most users may consider user interface and other features that can be extracted directly from app code. Therefore, our system statically analyzes app code and generates “feature view graph” for each app which encodes app code- level features. For app ranking, we apply Graph Convolutional Network to cluster apps into different classes based on the complexity of their corresponding feature view graphs, where each class indicates one level of app quality. According to the system evaluation from two perspectives: system accuracy and label dissimilarity, AppGrader performs well on 1440 real world apps with average accuracy of around 72% and label dissimilarity of around 1, which indicates that AppGrader could be applied for evaluating apps with fake ratings and newly published apps. Keywords: Ranking, App Quality Evaluation, App View, Graph Con- volutional Network, Deep Learning

iii TABLE OF CONTENTS

List of Figures vi

List of Tables vii

Chapter 1 Introduction 1

Chapter 2 Related Work 6

Chapter 3 Background 11

Chapter 4 Methodology 13 4.1 System Overview ...... 13 4.2 App Code-Level Features ...... 15 4.2.1 User Experience of an App ...... 15 4.2.2 App Review Analysis ...... 17 4.3 System Architecture ...... 21 4.3.1 Code De-compiler ...... 22 4.3.2 Feature View Graph Generator ...... 22 4.3.2.1 Generate View Graph ...... 22 4.3.2.2 Generate Feature Vector ...... 25 4.3.3 Graph Convolutional Network ...... 27

iv Chapter 5 Implement and Evaluation 31 5.1 Data Collection and Processing ...... 31 5.1.1 Data Collection ...... 31 5.1.2 Label Assigning ...... 32 5.1.3 GCN Inputs ...... 33 5.2 Experiment Results and Analysis ...... 34 5.2.1 Efficiency ...... 34 5.2.2 Effectiveness ...... 34 5.2.2.1 System Accuracy ...... 34 5.2.2.2 Label Dissimilarity ...... 39 5.2.2.3 Synthesized Result ...... 42 5.2.2.4 Case Study ...... 44

Chapter 6 Discussion 46 6.1 Limitation ...... 46 6.2 Future Work ...... 48

Chapter 7 Conclusion 50

Bibliography 51

Appendix A App Review Analysis 54

Appendix B Detailed Feature Vector 59

Appendix C Testing Result of AppGrader 63

v LIST OF FIGURES

4.1 System architecture ...... 14 4.2 Interaction between a user and an app ...... 16 4.3 Distribution of top 500 high frequency terms ...... 18 4.4 Distribution of 50 topics among 18 million app reviews ...... 20 4.5 Proportion of topics indicating app features in total 50 topics . . . . 21 4.6 Graph convolutional network for node classification ...... 28 4.7 Graph convolutional network for app classification ...... 29

5.1 Comparison result in terms of system accuracy under three groups of ground truth labels ...... 37 5.2 Comparison of propagation models of GCN (Thomas et al. 2017) . 38 5.3 Comparison result in terms of label dissimilarity under three groups of ground truth labels ...... 42 5.4 Comparison of test accuracy and label dissimilarity of the first test (weights: 33%, 33%, 33%) ...... 43

vi LIST OF TABLES

4.1 Terms indicating app features in top 500 high-frequency terms ...... 17 4.2 Topics indicating app features ...... 19 4.3 Intent construction methods ...... 23 4.4 Android APIs for activity switching ...... 23 4.5 Android methods that listen user inputs ...... 27

5.1 Data overview ...... 32 5.2 Summary of results in terms of app classification accuracy ...... 35 5.3 Summary of results in terms of average label dissimilarity for false predictions ...... 39 5.4 Mis-estimated camera apps ...... 44

A.1 High-frequency terms counting result (top 500) ...... 54

B.1 Detailed feature vector ...... 59

C.1 Testing result of AppGrader on 123 apps in the first test (weights: 33%, 33%, 33%) ...... 63

vii CHAPTER 1

INTRODUCTION

Recently, as the rapid development of smart phones and tablets, mobile application markets such as Play for Android and for iOS are growing fast. Take for example, over 1 million apps were released in 20161, and the number of downloads had exceeded 65 billions in the same year2. Up to 2017, there have been more than 3.5 millions available apps3, and the number of apps is still increasing. Not as simple as several clicks to download apps from app markets, the huge number of apps is overwhelming users: even under the same category, there are 200 thousands apps on average to choose. Therefore, an app quality evaluation system that ranks apps by app quality is indispensable to liberate users from countless apps.

The current ranking strategies applied by app markets are mainly based on app ratings and downloads. Apart from the app ranking, users may also refer to the

1https://www.statista.com/statistics/742370/annual-new-apps-google-play/ 2https://www.statista.com/statistics/281106/number-of-android-app-downloads-from- google-play/ 3https://www.statista.com/statistics/266210/number-of-available-applications-in-the-google- play-store/ reviews written by someone who has already tried this app. Nevertheless, these ranking strategies have drawbacks. Some app companies or developers may ma- nipulate the ratings and reviews of their apps to get more downloads from users and further, to earn higher revenues. It is not uncommon to find deceptively pos- itive reviews and abnormally high ratings. Besides, newly published apps have limited user feedback. As a result, rankings of these apps will deviate from real app quality, which may mislead users. The most effective way is to try each app on your own, but that is inconvenient and time-wasting. In the sea of mobile apps, it is hard for people to choose an appropriate one. Moreover, app developers are mired in predicament brought by ill-conceived ranking systems. For example, the app rankings offered by app markets merely inform developers the popularity of apps but not the popular reasons behind them, which finally impede the advent of excellent apps.

However, there is few research about app ranking system. We only find ranking systems that offer app store optimization rather than optimizing the app itself, such as RankMyApp [1] and Appfigures [2]. These systems conduct sentiment analysis on app reviews to mine key words and popular topics by which they do app store optimization based on the mining result. That is, these systems analyze your app, your competitors’ apps and the app market, and give opinions on app name, app description, app icon, screenshots to improve your app visibility in a certain category. However, they just modify the external parts of an app rather than evaluating the app itself therefore cannot help users find high-quality apps as they expect. These systems are superficial and do not solve the primary problems.

Compared with relatively rare app ranking systems, there are plenty of app recom- mendation systems. Nevertheless, these app recommendation systems do not solve the problems mentioned before either. Some systems take user privacy preference into considerations. For instance, Liu et al. propose a system that recommends apps to users which reach a trade-off between app functionality and user privacy preference [3]. But the fact is that most users lack safety awareness. When re- ceiving a permission request to access sensitive data from an app, most users thoughtlessly click “allow” with less mind of their privacy. Thus, this type of sys- tem can not learn real user privacy preference and recommend appropriate apps

2 to them. Some other systems are personal recommendation systems which analyze user usage history to recommend apps that the user may like, such as [4, 5, 6, 7]. If a user is downloading an app under a specific category, their recommendations may not hit because of limited training samples. In this case, the user has to resort to app ratings and reviews again which, as we mentioned before, are not trustworthy.

Given these observations, it is necessary to build an innovative app ranking system which is efficient and effective, and most importantly, can solve the problems we mentioned. This system must give a relatively impartial evaluation on app quality and rank apps by the level of quality. The most challenging part is what criterion should we choose to do quality evaluation on the apps. Since external ratings and reviews are not applicable, we decide to focus on app itself – user interface and app structure. The inspiration comes from our observation towards the way in which people give rating to an app. As we know, Android apps are user inter- action intensive thereby set user-friendliness as a benchmark to gauge apps. The rating of an app is mainly based on the user experience towards this app, or the interaction between the user and this app. The interaction between a user and an app is actually the interaction between the user and app interfaces and the overall app structure. So we decide to evaluate app quality based on code-level features like user interface (UI) and structure. This idea is supported by the app review analysis discussed in chapter 4, where we count term frequency and mine topics in about 18 millions app reviews covering all app categories. From the high-frequency terms and part of the review topics, we notice that while giving ratings, users care about the design of user interface such as picture and color, the widgets used in an interface, permissions an app may request like Wi-Fi and camera, and the ad- vertisement contained in the app. Thus, it makes sense to evaluate apps by its code-level features, features that can be extracted directly from app code.

Based on the observations above, this paper proposes a novel app ranking sys- tem based on app quality which is measured by app code-level features like user interface structure, aiming to help users to download apps which are worthy of their rankings. This system is an extension of ViewDroid [8] which is applied to detect app repackaging by app views. Since graph can preserve app nature as

3 much as possible, in our app quality grading system, we represent an app as a graph called feature view graph which encodes code-level features of an app. The code-level features such as UI and the overall app structure are extracted directly from app code. In this feature view graph, each node is an interface and has its own feature vector which encodes not only the widgets, colors, fonts, pictures, in- teraction methods used in the corresponding interface but also some app features. Each edge indicates a switch between two different interfaces. This feature view graph is generated by searching and tracking Android specific APIs in app code, such as startActivity() and setContentView() which indicates interface switching and interface layout respectively. For app ranking, we will apply a classification model to cluster apps into different classes, where each class indicates one level of app quality. The reason why we apply classification model rather than regression model is that there is barely difference between apps in the same quality level. For example, it is hard to tell the superiority of twitter over . As for the classification model, we apply graph convolutional network (GCN) [9] on these feature view graphs to estimate quality of their corresponding apps. GCN is a classification model for graph-based data. We represent an app as a feature view graph and app structure is also an important feature, so GCN fits our system well. It assigns label (the level of app quality) to each app by which we can rank apps. By AppGrader, we can evaluate app quality by app functionality and make recommendations on apps to users, as well as give feedback to app developers on app designing.

Our paper makes the following contributions:

1. Our system extracts code-level features from an app and generates feature view graph for each app which preserves the nature of an app, such as user interface and app structure.

2. This paper implements AppGrader, a novel app quality grading system based on code-level features like user interface and app structure. As far as we know, it is the first system that takes advantage of app itself to evaluate apps. Besides, it combines the newly proposed graph convolutional network to do the app quality grading.

4 3. Evaluation: AppGrader is trained and tested on over 1440 real world apps from Google Play, covering 14 app categories. The performance evaluation of AppGrader is grounded in two perspectives: system accuracy and label dissimilarity. AppGrader is a very efficient app quality grading system with average accuracy of 72.54% and label dissimilarity of around 1.

The remainder of the paper is organized as follows. Chapter 2 talks about related work. Chapter 3 introduces the background of Android platform and apps, as well as the graph convolutional network. The design of AppGrader is described in chapter 4, followed by the implementation and evaluation in chapter 5. Chapter 6 presents discussion and we conclude our work in chapter 7.

5 CHAPTER 2

RELATED WORK

The systems that help users choose appropriate apps consist of app ranking sys- tems and app recommendation systems. Research on app ranking system for app markets is rare now, and we only find the systems that help app developers improve their apps’ visibility in app markets like RankMyApp [1] and Appfigures [2]. While there are numerous app recommendation systems that provide users with personal recommendations on what apps to choose, such as [10, 11, 12, 3, 13, 4, 5, 6, 7].

RankMyApp [1] targets on app developers. It offers App Store Optimization (ASO) and Search Engine Optimization (SEO) to improve app visibility in app markets by modifying keywords inserted in app name and description. Besides, it also implements app review analysis by periodically checking app reviews, replying quickly to user queries and converting user feedback into suggestions. Appfigures [2] is similar to RankMyApp, differing only in this that it takes users as their target clients at the same time. For app developers, Appfigures offers mobile app store analytics, ranking and review monitoring. It shows rankings of your app by hours to get a real-time view of the performance of your app compared with your competitors’. And it monitors all your app reviews to reply directly and record bugs and suggestions quickly. Besides, it compares the ad revenue and spend through all the ad networks to help you make more money. For app users, it offers comprehensive filters that match users’ needs from app name and price, to SDKs, demographics, performance, and so on. It even can combine different filters in one search by AND/OR to provide precise searching results whose views are customized to firstly present the apps that the users are mostly interested in. However, these systems have essential problems. Firstly, they do not solve the primary problems that causes app unpopularity and cannot improve the app quality radically. They just modify the external parts of an app rather than improving the app itself. Secondly, although they consider many aspects of an app, the app reports offered by these systems only focus on the app they are tracking, ignoring the other apps in the same category, especially the popular apps.

The other related systems that help users choose appropriate apps are app recom- mendation systems. Traditional app recommendation systems like [11, 12, 10] aim to cater to user interest. Some of them try to find the relationship between user interest and app functionality, and recommend apps whose functionality matches user interest most, such as the systems proposed by Koren et al. [11] and Salakhut- dinov et al. [12]. They encode user interest and app functionality into vectors and compute the distance between them. If the distance is short, the corresponding app will be recommended to the user. This type of recommendation system is successfully applied to online shopping [14], book [15], music [16] and movie [17] recommendation. But it is not appropriate for app recommendation because user interest towards an app is not only based on app functionality, but also from the app design, the advertisement contained in the app, the permissions requested by the app and so on. Some other recommendation systems like the one proposed by Bae et al. [10] analyze the user’s app usage history to make predictions on user interest thereby recommend similar apps to the user. They propose a graph-based technology for app recommendation under Android OS and solve limitations exists in the previous recommendation systems, such as cold start and domain disparity. However, the information extracted from user usage history is limited. So such recommendation system offers no help when users facing hundreds of thousands of apps. Based on usage history, this system will recommend several apps over many app categories at one time. If a user want to search for apps under a certain cat- egory, the recommendations may not hit because of the small amount of training

7 data.

Along with the exploration of app recommendation system, novel benchmarks for recommendation are proposed such as user privacy preference. Liu et al. put forward an app recommendation system that emphasizes the importance of user privacy preference, considering the fact that apps often access user’s sensitive data and diverse user privacy preferences [3]. Besides to identify apps matching the user interest, it also measures whether the app’s behavior will meet user’s requirements of privacy. It assumes that there are two values used for app recommendation: one is called functionality match which denotes the degree of match between app functionality and user interest, and the other is privacy respect which shows the degree to which the app accords with user privacy preference. This system will recommend apps to a user which can strike a balance between functionality match and privacy respect. Zhu et al. also focus on user privacy preference and propose an app recommender system with security and privacy awareness [13]. Different from the systems we mentioned before, it is not a personalized system. It is not so much an app recommendation system as an app ranking system. In addition to app popularity which is usually measured by app rating and downloads, this system provides users with other evaluation metrics like security and hybrid of popularity and security. Further, when choosing security as evaluation metric, users can select different level of security to get different app ranking results under different gran- ularities. These research proposes novel ideas on app recommendation. But the fact is that most of the users lack safety awareness. When receiving a permission request to access sensitive data from an app, most users may dismiss it and click “allow”. Thus, this type of system is unable to truly recommend appropriate apps to users.

Yin et al. propose an innovative benchmark on app recommendation [4]. They treat apps as continuous consumption and assume that the apps which a user has already downloaded will prevent him from downloading similar apps. But the fact is that some users may have several similar apps on their phones. Thus, to predict whether a user will download a new app, they propose two values for apps: (1) the existing apps have satisfactory value which measures whether they can satisfy the user’s need; (2) the apps in app market have tempting value which estimates the

8 satisfactory value of the candidate apps. This system regards app replacement as a contest between existing apps’ satisfactory value and candidates’ tempting value. If the tempting value wins, the system may recommend the user to download new apps, otherwise, no change is needed.

There is another personal app recommendation system based on spatio-temporal app usage log developed by Han et al. [5]. Unlike the systems we discussed before which evaluate apps by discrete features, it suggests to recommend apps to users based on continuous random variables such as location and time. It recommends apps that are appropriate to user’s current time and location. This system is actually a variation of the system based on user usage history. It treats app recommendation as a probabilistic problem, where (1) an app is actually a distribution of topics extracted from app descriptions; (2) the user’s preference towards an app is also topic distribution impacted by time and location. It collects the user usage history and encodes app usage log into a series of topics as user preference, then it analyzes new apps and recommends them if they topically accord with the user preference at the current time and location.

Similar to Han et al., Lin et al. also regard apps and user usage history as a set of topics [6]. While in the latter system, the same app may have more than one set of topics as the app is not a static item anymore. Apps are evolving continuously and new features are introduced in the latest version. Thus, each version of an app has its own set of topics which encodes its particular features. Features are extracted from app description and version introduction and encoded in new set of topics for the current version. The system builds personal profiles for users. After generating topic distribution for apps, it searches user profile to recommend apps which topically accord with user preference. If the topic distribution of an app and user preference have a lot of overlap, the app is recommended to the user. As a result, the app that was not appropriate for a user in the past may be recommended to the user after version update.

Most users may think only explicit information like app ratings have impact on the apps that they choose. However, there is research shows that implicit informa- tion also affects mobile app users’ behaviors [18] and app usage [19]. Thus, apart

9 from time and location, Karatzoglou et al. also consider implicit mobile context information such as activity and social interactions to facilitate app recommenda- tions [7]. They propose Djinn model: “a novel context-aware collaborative filtering method for implicit data that is based on tensor factorization” [7]. They give up explicit user feedback and build a model of user preference based on implicit con- textual information. After that, this system will do as what the previous systems do: recommend apps that fits user interest most.

Systems [4, 5, 6, 7] are personal recommenders and all are variants of recommen- dation system based on user usage history or something related to it. As we mentioned before, the biggest drawback of such system is the limited number of training samples. Thus, imagine that app users who are using this type of system with limited recommendations on a specific category, they are largely left on their own to select an appropriate app among countless apps.

In a word, all the systems we discussed in this chapter take great pains to mine the relationship between the external factors of an app and user interests. They care about anything about an app but the user experience and the app per se. These systems fail to recommend apps that are commensurate with their rankings to users and externalize features of popular apps to app developers.

10 CHAPTER 3

BACKGROUND

Android apps are published and downloaded in a package file format, Android Application Package (APK). Like ZIP format package, APK file is one type of archive file. It contains an app’s manifest file (such as AndroidManifest.xml), re- source files, and executable program code (such as .dex files) [20]. The manifest file declares the Android version that this app can be compatible with, the per- missions requested by this app, the activities registered and so on. The resource folder contains the resource files needed by the app, like pictures, media files, and xml files that describe the layout of interfaces. The executable program code has all the classes that implement the functionality of this app. The APK file also contains files like the libraries included in this app. Such files are unrelated to our system so we omit them. Since the APK file only offer unreadable Dalvik Executable code, we need to de-compile .dex files to get a human-understandable code, smali code. Smali code is the intermediate code between Java source code and Dalvik Executable code.

App components serve as the cornerstone of an Android app which comprise Ac- tivity, Services, Broadcast receivers and Content providers. Activity provides the entry point of an app for users to interact with. There is only one main activ- ity in an app, but there can be as many activities as possible. Usually, each activity has its own screen view, or interface. Service is used to conduct long-time- running works. It runs in background and can also keep an app running in the background. Broadcast receiver helps an app receive the system-wide notices and make responds. Content provider manages shareable data stored in file system, SQL database and so on. In our system, we only analyze activities to generate the feature view graph for an app, since only activity offer interaction with users.

When an app is launched, the main activity is first started which is announced in Android manifest.xml. Then the user may trigger certain events to switch to different activities. This activity switching is done by invoking some Android APIs like startActivity(), startActivityForResult() in activities. These methods take Intent as parameter which indicates the target activity. And the caller activity is usually the source activity. The layout file which contains view components and describes activity layout, is loaded to initialize user interface when an activity is created. This is accomplished by invoking Android APIs like setContentView() in method onCreate() in activity. Hence, by statically analyzing the program code and tracking these specific APIs, we can extract activity switching graph from activities and generate feature vector for each activity from its layout file. Then the aforementioned activity switching graph and activity feature vectors will be integrated into one feature view graph which encodes the user interface and structure of an app.

For the classification model, we will apply Graph convolutional network [9]. “GCN is a scalable approach for semi-supervised learning on graph-structured data. Ini- tially, it is proposed to classify nodes (such as documents) in a graph (such as a citation network), where labels are only available for a small subset of nodes” [9]. It takes a N N adjacency matrix and a N F feature matrix of a graph as inputs × × and outputs a N L label matrix which contains the one-hot label for each node × (Figure 4.6). But here we want to do graph classification since we treat an app as a graph. Heuristically, to achieve graph classification, we can regard each graph as a subgraph and integrate them into one graph and put this huge graph into GCN (Figure 4.7). The output label matrix still indicates node labels. However, the nodes from the same graph should have similar labels. Thus, we need to combine the labels of homogeneous nodes to get graph label.

12 CHAPTER 4

METHODOLOGY

4.1 System Overview

Confronting with countless apps, users need an efficient and effective app quality evaluating system to help them choose appropriate apps. Unfortunately, the cur- rent app ranking systems applied by app markets have drawbacks in handling: (i) apps with abnormally high ratings and fake downloads; (ii) newly published apps with limited user feedback. Besides, the existing related systems, including app ranking system and app recommendation system, fail to solve the problem. As a result, it is necessary to introduce an innovative app quality evaluating system. This paper proposes AppGrader to evaluate app quality based on app itself rather than the external ratings, or downloads which can be forged. Our system extracts app static features directly from app code, such as user interfaces and overall app structure. This is inspired by our observation towards the way in which people give ratings and the result of app reviews analysis which will be detailed later. To rep- resent an app by its UI and structure, we conceive of graph which is called feature view graph. In this graph, each node is an app interface. Edges indicate interface switching relationships. To better capture the app features, we introduce feature vector for each node which encodes both the activity features and app features. To insure scalability, we extract app features and structure directly from app code by statically analyzing the de-compiled app code and tracking specific Android APIs and encode them into feature view graph. As for app quality grading, we will apply a classification model to cluster apps into different classes, where each class denotes one level of app quality. Here we apply graph convolutional network for classification. GCN is developed for handling graph-based data and fits our system well since in AppGrader, apps are represented as graphs.

Feature Code Smali code Feature view view graph decompiler Resource files graph generator App 1

. . . . . Graph App convolutional . . . . . labels . . . . . network

Feature Code Smali code Feature view view graph decompiler Resource files graph generator App n

Figure 4.1: System architecture

Figure 4.1 shows the overall architecture of our system. AppGrader consists of three components: code de-compiler, feature view graph generator and graph con- volutional network. We will talk about our system and its components in detail later.

Aiming to further illuminate our methodology, we propose the following definitions:

DEFINITION 4.1 (View) View is a user interface which a user can interact with. Each view is related to an activity where the methods handling user inputs and making responds are defined. It also associates to a layout file which declares the view components, such as widgets, shown in that interface.

DEFINITION 4.2 (View Graph) View graph is essentially a directed graph G (V, E), where V is a set of vertices, each of which corresponds to a view, E is a set of edges v1, v2 , such that v1 V, v2 V , and v1, v2 is a switch from view v1 h i ∈ ∈ h i to view v2 triggered by some event.

14 DEFINITION 4.3 (Feature Vector) Feature vector encodes the app code-level features and is assigned to each view in view graph. It includes: (1) view feature which encodes the view components contained in the current view; (2) app feature which stores app meta data.

The events that trigger switches between views are regarded as edge features and usually are the event listener functions (e.g., onClick(), onLongClick(), onTouch(), etc.) that listen user generated events. However, the graph convolutional network we apply later cannot deal with edge feature. Therefore, instead of edge feature, we encode these event listener functions into the feature vector of source view. This makes sense since these functions that process user inputs can be treated as interaction functions which are stored in the view feature vectors.

DEFINITION 4.4 (Feature View Graph) Feature view graph combines the view graph G (V, E) and feature vectors. It attaches a feature vector to its corre- sponding node in view graph.

4.2 App Code-Level Features

Before processing data and evaluating apps, we need to determine what features we should extract from an app. As mentioned in chapter 1, we will preserve user interface and app structure in feature view graph. This is inspired by the observation that while giving ratings, users will make decisions based on their experience towards the app. Besides, we also conduct app review analysis to verify our idea and explore more possible code-level features.

4.2.1 User Experience of an App

Android apps are user interaction intensive which requires user-friendliness as an essential factor of outstanding apps. The rating of an app is mainly based on user experience which is the interaction between user and app. According to our observation, the interaction can be divided into 4 stages, as is shown in Figure

15

User Interface

4.2. A user uses an app by traversing different interfaces and touching the widgets on the screen. Therefore, theStructure interaction between a user and an app is actually the interaction between the user and each interface, further, is the interaction between the user and theFunctionality widgets on the current interface. Hence, widgets are the cornerstone of an app which serve as the first stage of interaction. Then the user may switch to other interfaces and get to know the app structure (stage 2) Performance and overall functionality (stage 3). At last, some expert users may care about app performance (stage 4), such as CPU overhead and battery consumption.

Button Interface 1 Listview Interface 2 App CPU Imageview . battery ...... Interface n .

Webview

User Interface Structure Functionality Performance

Figure 4.2: Interaction between a user and an app

The first two stages which a user can directly interact with, can be represented as static features extracted from the app code directly. The last two stages are dynamic features which cannot be taken into consideration in fast, large scale experiment. Thus, based on the first two stages, our system extracts the overall app structure and the design of user interfaces, such as the number of each type of widgets used, the number of colors, pictures and fonts applied, as app features.

16 4.2.2 App Review Analysis

To verify our observation and figure out what factors else impact the rating a user gives to an app, we analyze 18 millions app reviews covering all app categories. We do both term frequency counting and topic mining on app reviews to find out what users are discussing about the apps so as to extract possible app features.

First of all, we preprocess the raw data. We replace all the contractions in the reviews since contractions like “don’t” are easy for human to understand but hard for computer. Then we break texts into tokens by general purpose English tokenizer provided by library “sklearn”. At last we remove the stop words. Words like articles and conjunctions that do not convey a content meaning as a token are known as stop words. Here we apply terrier stop words library [21].

Table 4.1: Terms indicating app features in top 500 high-frequency terms

Term Frequency Term Frequency Term Frequency crashes 424664 bugs 99249 figure 66207 graphics 297585 photos 97168 camera 62514 screen 280734 access 96990 tap 62505 features 178181 text 90920 design 62220 crashing 163609 word 89965 internet 59336 feature 157911 view 85423 connection 56411 ads 133841 pics 83931 error 55946 crash 132612 picture 81209 connect 53777 touch 130092 wifi 75958 color 52650 pictures 126972 click 75837 battery 50480 button 118851 photo 75056 speed 47885 interface 117689 bug 73533

Before mining topics among these reviews, we do simple term frequency counting. There are over one million different words appearing in the reviews. Although we have removed meaningless stop words, in the result, there still are large amount of words irrelevant to our research, such as [“app”, 5830765], [“game”, 5742955], [“fun”, 2467243], [“awesome”, 1148299] (The numbers in the bracket is word fre- quency), and lots of nosies, like wrong-spelling words. After filtering out these ir- relevant terms, we get the words related to app features in top 500 high-frequency words and show them in Table 4.1 (the whole table of top 500 high-frequency

17 terms is shown in appendix A). We divide the top 500 high-frequency terms into five classes: noun, verb, adjective and adverb, number and symbol, terms indi- cating app features. And the app feature is divided into static feature and dy- namic feature. Figure 4.3 shows the proportion of the five classes of terms from which we can learn that the terms irrelevant to app features, such as verb (31%), adjective and adverb (23%), number and symbol (4%), account for 58% of the high-frequency words. And among the remaining nouns (35935345), the words related to app features (4305447) take up 12%. So, it is safe to say that the words in Table 4.1, or the app features indicated by these words, matter to users when they rate. While among these words indicating app features, there are terms, such as [“crash”, 132612], [“bug”, 73533] and [“battery”, 50480], cannot be considered since they represent dynamic features which are too hard to capture and are not appropriate for fast, large scale experiment.

numbers & symbols 4% adjective & adverb 23%

non-featured noun 37%

noun 42%

features 5% verb 31%

verb adjective & adverb numbers & symbols non-featured noun features

Figure 4.3: Distribution of top 500 high frequency terms

Then we mine the topics in app reviews by LDA model [22] which is one kind of topic model. It posits that “each document is a mixture of a small number of topics and that each word’s creation is attributable to one of the document’s topics” [22]. We set the number of topics to 50 (default setting) and list the

18 topics whose key words indicate possible app features in Table 4.2 (Here we only list the key words related to app features in each topic and put the whole topic mining result in appendix A). As we mentioned before, there are lots of irrelevant terms and noises in our dataset, so the topic mining result is not as much ideal as we thought. Besides, due to the huge size of our dataset and the limitation of computing environment, we did not conduct topic mining for many times to get the optimized number of topics, which is also a reason contributing to the unclear topics.

Table 4.2: Topics indicating app features

Topic Key words Frequency Topic 2 version photo latest 316069 Topic 11 update crashes problem crash video 573881 Topic 15 picture color light 239741 Topic 18 ads crashed database 311939 Topic 28 interface 277807 Topic 32 pic finger 228356 Topic 38 new graphics updated 223547 Topic 44 photos words 354814 Topic 49 pictures button 335739

Figure 4.4 shows the frequency of each topic, where y-axis denotes the topic fre- quency and x-axis denotes topic. The topics indicating app features, such as topic 2, 11, 15, 18, 28, 32, 38, 44, 49, are in solid fill. This histogram shows that: (1) topic 11 is the third popular topic in the 18 millions reviews; (2) topic 2, 18, 28, 44, 49 are average; (3) topic 15, 32, 38 are less considered by users. To emphasize the importance of these topics related to app features, we present the proportion of these topics in the total 50 topics in Figure 4.5, where these topics take up 16% of the overall app reviews. It seems that there is only nine topics related to app features that only account for 16% in the reviews, which is too diminutive to be convincing. However, there are facts that we cannot ignore. First, most of the users write simple reviews like “good game”, “nice app” which don’t contain useful information about the app quality. Second, among the remaining analyz- able reviews, users may not discuss about the same topic. Besides, among the possible app features, we only focus on the parts which, from our perspective, can be extracted directly from app code and are appropriate for app quality evalua-

19 tion. Thus, 16% is big enough to demonstrate the importance of the app features indicated by these topics.

1200000

1000000

800000

600000

400000

200000

0 topic 0 topic 1 topic 2 topic 3 topic 4 topic 5 topic 6 topic 7 topic 8 topic 9 topic topic 10 topic 11 topic 12 topic 13 topic 14 topic 15 topic 16 topic 17 topic 18 topic 19 topic 20 topic 21 topic 22 topic 23 topic 24 topic 25 topic 26 topic 27 topic 28 topic 29 topic 30 topic 31 topic 32 topic 33 topic 34 topic 35 topic 36 topic 37 topic 38 topic 39 topic 40 topic 41 topic 42 topic 43 topic 44 topic 45 topic 46 topic 47 topic 48 topic 49 topic

Figure 4.4: Distribution of 50 topics among 18 million app reviews

Ignoring the meaningless words like “nice”, “game”, “think”, “lol”, we synthesize the results of term frequency counting and topic mining and notice that users care about the part that they can interact with when giving ratings, such as the design of user interface (picture and color), the widgets used in an interface (but- ton), permissions an app may request (wifi and camera) and the advertisement contained in an app (ads). This result accords with our observation. Although some app features like whether the app will crash while being used (crash) are also concerned by users, they are not considered since they are dynamic features and not appropriate for fast, large scale experiment.

Based on our observation and the app review analysis, the feature vector for each view contains the following code-level features (The detailed feature vector is shown in appendix B):

For each view (activity): the number of each type of widgets used in this • view, the number of colors, pictures, fonts and interaction methods used in this view.

20 topic 2 2% topic 11 3% topic 15 1% topic 18 2% topic 28 2% topic 32 1% topic 38 other topics 1% topic 44 84% 2% topic 49 2%

other topics topic 2 topic 11 topic 15 topic 18 topic 28 topic 32 topic 38 topic 44 topic 49

Figure 4.5: Proportion of topics indicating app features in total 50 topics

For each app: the number of each type of permissions requested by this • app, the proportion of advertisement in this app, the number of views and switches between views in the view graph, app size and the number of files contained in this app.

4.3 System Architecture

As shown in Figure 4.1, AppGrader comprises three components: code de-compiler, feature view graph generator and graph convolutional network. Code de-compiler de-compiles app APK file into human-readable smali codes and resource files for further analysis. Feature view graph generator analyzes resulting smali codes and resource files to extract feature vectors and view graph which are encoded into feature view graph. At last, graph convolutional network evaluates all apps’ feature view graphs and outputs a matrix indicating labels (level of app quality) of each app. The detailed explanations of these components are expanded in the rest of this section.

21 4.3.1 Code De-compiler

The code de-compiler is actually the apktool [23] developed by Google for reversing Android APK files. Android apps are released and downloaded in APK format, which only contains unreadable Dalvik Executable code. For further analysis, the APK file should be de-compiled into human-readable codes by apktool. The resulting files include Android manifest file, smali codes, resource files and so on. Android manifest file stores app meta data and declares all registered activities in the app. Smali code, an intermediate code between Java source code and machine recognizable code, preserves Android specific APIs by which we can restore the app view graph and extract view features. The resource files include the layout files of each interface, pictures and other media resources used by the app.

4.3.2 Feature View Graph Generator

To generate feature view graph, we start from generating app view graph and view feature vectors. In this graph, each node is an app view. Edges denote app view switching relationships. In terms of app view, it consists of activity and layout .xml file. Methods that process user inputs and make responds are defined in activity, including Android specific APIs which remain unchanged in different codes even written by different developers. Therefore, we can extract nodes (app views) and edges (view switches) by tracking Android APIs related to view switching in activity to restore view graph. Layout .xml file from which we can extract view features declares view components shown in its corresponding view. With regard to app features, we can explore the Android manifest file which records the app meta data.

4.3.2.1 Generate View Graph

The view graph of an app presents all app views and view switching relationships. To restore view graph, our system searches and tracks the Android APIs related to app view switching. As mentioned before, a view has its corresponding activity

22 which is de-compiled in smali code and contains necessary methods for app run- ning. Usually, an app may have more than one view. If we want to traverse among different views, we have to resort to Intent in activity which is used to deliver neces- sary information for component communication. Some of its construction methods which are used for activity switching are shown in Table 4.3. In Android coding level, activity is preferred than view therefore is used in the following discussion.

Table 4.3: Intent construction methods

Intent () Intent (Intent o) Intent (String action) Intent (String action, Uri uri) Intent (Context packageContext, Class ? cls) h i Intent (String action, Uri uri, Context packageContext, Class ? cls) h i

As the last two construction methods in Table 4.3 show, there are two parameters related to the activity switching: “Context” which denotes the context of source activity that initiates the switching and “Class” which indicates the target activity. There are several Android APIs taking Intent as parameters used for traversing among activities, as listed in Table 4.4.

Table 4.4: Android APIs for activity switching

startActivity (Intent) startActivityForResult (Intent, n) startActivityIfNeeded (Intent, n) startNextMatchingActivity (Intent) startActivityFromChild (ChildActivity, Intent, n, Bundle) startActivityFromFragment (Fragment, Intent, n, Bundle)

Since Android APIs remain unchanged in different codes even written by different developers, we can search and track these APIs to dig out the words indicating source and target activities thereby restore nodes and edges in app view graph. The pseudocode for generating app view graph is shown in Algorithm 1. Specifically, take startActivity() for example, we go through all activities and only keep the

23 instructions containing this API. Then for each instruction, we search for the nearest Intent construction method to it and analyze the key words indicating an activity name, such as “String”, to get the name of target activity. As for source activity, it is usually the caller of startActivity(). The method that invokes startActivity() is also recorded as edge feature and encoded in source activity’s feature vector.

Algorithm 1: Generating view graph Input: apk, de-compiled APK file func.list, a list of all Android APIs related to activity switching Output: ViewGraph, a file stores app view graph

Function View Graph(apk, func.list): for each API i in func.list do src ret all instructions invoking i in all activities in apk for each← instruction j in src ret do Go to the activity contains j source activity the current activity m the nearest← Intent construction method of j target← activity activity indicated in method m Add (source activity,← target activity) into file V iewGraph end end return ViewGraph

So far, what we introduced is explicit Intent, which indicates the target activity with certainty. However, sometimes app developers may use implicit Intent, such as the first four construction methods in Table 4.3, to do activity switching. Node and edge extraction based on implicit Intent are similar to that based on explicit Intent, with only little difference though. The third and fourth Intent construction method in Table 4.3 only specify action in an Intent. An action is used for informing the system to invoke appropriate activities and declared in activity intent filters which are announced in the Android manifest file. When implicit Intent is processed, it is the Android system that analyzes the action in Intent and chooses appropriate target activities by matching intent filter. The target activity can be either in or out of this app. If there are more than one candidate activities, the user has the right to choose. In light of our generator, it selects possible target activities

24 in the way of Android system and further generates an edge for each of them (all the edges end up to the same source activity). Especially, it generates one node “others” for all the target activities outside this app. The first method in Table 4.3 initializes an empty Intent object. In this case, we search methods such as setClass(), setComponent() and setAction() around this Intent construction method rather than searching key words indicating activity name. With regard to the second method which copies an existing Intent object to a new one, we trace the old Intent object and do as what we talked before to get the target activity.

What’s more, fragment which is very similar to activity and usually embedded in one or more activities can also call these activity switching APIs, therefore will be extracted as a node in view graph. To avoid redundancy, the generator replaces all fragments in the resulting edges with their corresponding activities.

4.3.2.2 Generate Feature Vector

Feature vector consists of view feature and app feature. The pseudocode for gener- ating feature vector is shown in Algorithm 2. In terms of view feature, it involves widgets, colors, pictures, fonts and interaction functions used in a view. As what is done in generating nodes and edges of the app view graph, we also extract view features through specific Android APIs. To load widgets, colors, pictures and fonts in a view, APIs such as setContentView() and inflate() are called in Android. Sim- ilarly, we go through the activity of the current view and keep all the instructions invoking these APIs to get the layout file of the view in which we can extract the number of each type of widgets used in the view, as well as the colors, pictures and fonts. Besides, there are other Android APIs in activity by which we can extract colors, pictures and fonts, such as setTypeface(), setBackgroundResource(), setImageBitmap() and setColor(). With regard to interaction methods, they are usually user-triggered events that listen user inputs and make responds, such as the ones shown in Table 4.5. What’s more, as mentioned before, the events that trigger switches among views are parts of the methods listed in Table 4.5. As a result, it makes sense to store the edge feature as one dimension in the view feature vector.

25 Algorithm 2: Generating feature vectors Input: apk, de-compiled APK file widget, a list of all Android widgets permission, a list of all Android permissions interact m, a list of all Android methods listening user inputs Output: Features, a file records all feature vectors of an app

Function Feature Vector(apk, widget, permission, interact m): AppF eatures App Feature(apk, permission, ViewGraph) for each activity← in apk do V iewF eatures View Feature(activity, widget, interact m) F eatureV ector← V iewF eatures + AppF eatures Add F eatureV ector← into file F eatures end return Features

Function App Feature(apk, permission, ViewGraph): p the number of each type of permissions announced ← in Android manifest file in apk a the proportion of ads in apk s ← apk size f ← the number of files in apk n ← the number of nodes in V iewGraph e ← the number of edges in V iewGraph AppF← eatures encode p, a, s, f, n, e into a vector return AppFeatures←

Function View Feature(activity, widget, interact m): i instruction invoking setContentView() in activity layout← layout file indicated by i w the← number of each type of widgets used in layout c ←the number of colors used in layout p ← the number of pictures used in layout f ← the number of fonts used in layout m← the number of each method in interact m used in activity V iewF← eatures encode w, c, p, f, m into a vector return ViewFeatures←

26 Table 4.5: Android methods that listen user inputs

onClick() onLongClick() onTouch() onTouchEvent() onKey() onKeyDown() onKeyUp() onFoucusChange() onCreateContextMenu()

As for extracting app features which involve the permissions requested by an app, the proportion of advertisement in an app, app size, the number of files contained in an app, the number of nodes and edges in its view graph, we don’t need to resort to Android APIs. The permissions must be announced in Android manifest file in which we can collect all permissions requested by the app. As to extracting proportion of ads, we examine all activities to get their package names and class names which are contrasted with the ones in the mobile app ads library later. If any package names are found in the library, it is safe to say that the activities containing these packages are used for advertisement exhibition. We get the proportion of the activities containing advertisement in all activities as the proportion of ads. Here we apply the ads library maintained by Theodore Book et al. [24]. In terms of the app size, the number of files in an app, the number of edges and nodes in app view graph, they can be obtained by some simple statements in shell script.

4.3.3 Graph Convolutional Network

To achieve the ultimate goal of this study, evaluating app quality based on app functionality measured by code-level features, we apply a classification model to assign labels on the apps which denote different levels of app quality. The reason why we apply classification model rather than regression model is that there is no significant difference between apps in the same quality level. For example, it is hard to tell the superiority of twitter over Facebook. Therefore, it is meaningless to give an exact score for each app. As for the classification model, we choose graph convolutional network, a neural network for graph-based data. Unlike the other neural networks, GCN deals with nodes and takes graph structure into con- sideration when doing classification. That is, when evaluating a node, in addition to the node itself, GCN also analyzes the information stored in the node’s 2-hop

27 neighbors. This network fits our system well, since we treat an app as a graph and the app structure is also an important feature. Besides, different apps have various number of views, each of which has its own feature vector, therefore have unfixed number of features. Unfortunately, we cannot apply classification methods like SVM on data with various features, which also contributes to the application of GCN. Now by GCN, we can handle any amount of feature vectors, preserving the nature of apps. Although there are other graph-based networks, the graph convolutional network has better performance and is simple to modify and use.

N×N adjacency matrix

N×L label matrix

1 0 0 0 1 0 0 0 1 GCN 1 0 0 0 0 1 0 0 1 0 1 0

N-node network N×F feature matrix

Figure 4.6: Graph convolutional network for node classification

According to Kipf et al., GCN is developed for node classification initially, such as classifying documents in a citation network, or users in a social network [9]. As Figure 4.6 shows, it represents a N-node network as a N N adjacency matrix × with a N F feature matrix, where F is the dimension of the feature vector. Then × it takes these two matrixes as input, and outputs a N L label matrix, where L × is the number of classes. Each row of this label matrix is a one-hot label for its corresponding node. One-hot is an encoding method for categorical data. It is a group of bits where all bits are “0” except for one “1” which indicates the category. For instance, if we have 5 classes, the one-hot label for class 1 is 00001 and the label for class 5 is 10000.

28

N×N adjacency matrix N×L label matrix

1 App 1 1 1 1 1 1 1

1 App 2 1 1 1 1 1 1 App 3 1 1 1

GCN

N×F feature matrix

Figure 4.7: Graph convolutional network for app classification

But what we need is app classification therefore we need to modify this network to fit our system. Heuristically, to achieve graph classification (as we represent an app as a graph), we can treat each graph as a subgraph and integrate these subgraphs into one large graph which is input to GCN instead. The resulting label matrix still indicates node labels, but the homogenous nodes (nodes that are belong to the same subgraph) should have similar labels. Hence, we can learn the subgraph label by combining the labels of nodes in the same subgraph. Now we apply this idea to app classification, which is shown in Figure 4.7 (the white parts in these matrixes stand for zero). Assume that there are three apps under the same category which have n1, n2, n3 nodes (views) respectively. Each node has a F-dimension feature vector, sum of which is N. We represent the apps into three feature view graphs, then splice their adjacency matrixes into a N N matrix and × stack their feature vectors into a N F matrix. GCN takes these two large matrixes × as input, and outputs a N L label matrix, where L is the number of app quality × levels. In the label matrix, each row is an one-hot label for its corresponding node. As for app label, it is predicted by combining the labels of all the nodes in the same app. Besides, we rewrite the cost function and the way in which system accuracy

29 is computed in GCN, so that these computations are based on apps rather than nodes. As a result, AppGrader can evaluate apps under the same category by static features and commensurately grade them based on their quality.

Thus far, we have discussed the principle of GCN in abstract level and only shows input data in Figure 4.6 and 4.7. Actually, there are 6 input files for graph con- volutional network: graph, x, y, tx, ty and NodeRange. First we encode the adjacency matrix of the big graph containing all the apps under the same cate- gory by a dictionary and store it in the file named graph. Then we separate the feature vectors of training instances from those of testing instances by relegating the feature vectors of training instances and testing instances to file x and tx re- spectively. Likewise, we apply the same method to separate app labels in file y and ty. The one-hot labels for training and testing instances are the ground truth labels assigned to apps before app quality evaluation. Nodes in the same app are assigned the same label as the app, since GCN evaluates each node rather than an entire app. The ground truth labels are determined by app rating, downloads and number of reviews, which is discussed in detail in chapter 5. Especially, the order of feature vectors and app labels are accordance with their corresponding nodes. File NodeRange records the nodes that an app has for computing app labels. Only when knowing their belonging apps, we can synthesize the labels of homogeneous nodes to get app labels. Specifically, we calculate the average value of labels of homogeneous nodes as the app label. By file NodeRange, we can also modify the functions in GCN to compute the system accuracy and cost based on apps instead of nodes. The format of input files are defined by Yang [25], some of which are modified to fit our system.

30 CHAPTER 5

IMPLEMENT AND EVALUATION

Our system is developed by Python and Shell-script. It consists of about 1500 lines of Python code and 400 lines of Shell-script code. The experiment is conducted in the environment of MacBook pro, with 2.7 GHz Intel Core i5 processor and 8GB memory.

This chapter has two parts. In the first part, we introduce the collected data, the ground truth labels and the methods in which we adapt the data into GCN inputs and implement app classification. In the second one, we demonstrate the efficiency and effectiveness of AppGrader based on over one thousand real-word apps covering 14 app categories.

5.1 Data Collection and Processing

5.1.1 Data Collection

The evaluation of AppGrader is based on Android apps. Therefore we apply gplay- cli [26], a Google Play downloader via command line, to crawl APK files over 14 app categories, including social, browser, mailbox, communication, news, music, reader and books, video, shopping, navigation, food and drink, photography, cam- era, travel. We fetch 100-200 apps per category and maintain a dataset of 1440 apps. The overview of apps is shown in Tabel 5.1, including the number of apps, the ranges of app ratings, downloads and the number of reviews under each cat- egory. Gplaycli only need one command specifying the key words of certain app category and the number of apps to return a list containing app name, creator, size, downloads, ID, version and rating. Then gplaycli will download these apps by their ID provided by the list above.

Table 5.1: Data overview

The number App rating Downloads The number Category of reviews of apps Max Min Max Min Max Min Browser 4.7 3.7 50,000,000 5,000 2,503,696 50 74 Camera 4.7 3.4 100,000,000 50,000 1,509,133 245 86 Communication 5.0 3.0 1,000,000,000 100 64,436,482 1 166 Food 4.8 2.1 10,000,000 100 1,012,627 4 88 Mailbox 4.7 2.3 1,000,000,000 1,000 4,412,334 6 68 Music 4.8 2.6 1,000,000,000 100,000 11,358,607 528 136 Navigation 5.0 2.4 1,000,000,000 100 9,102,685 2 63 News 4.8 2.7 500,000,000 1,000 1,264,518 23 107 Photography 4.8 3.1 100,000,000 50,000 7,447,667 59 131 Reader & books 5.0 2.0 1,000,000,000 100 2,855,294 1 96 Shopping 4.8 3.1 100,000,000 5,000 5,662,790 17 92 Social 4.9 2.0 1,000,000,000 1,000 10,987,443 14 105 Travel 5.0 2.4 100,000,000 50 1,754,207 1 98 Video 4.8 3.0 100,000,000 100,000 9,559,837 403 130

5.1.2 Label Assigning

To evaluate system performance, we need to assign ground truth labels on apps first. For each category, apps are divided into five classes. In other words, we have a scale from one to five in which apps in class five have the best quality and apps in class one are the worst. Since there are only 100-200 apps in each category which are relatively small amount of apps, it is reasonable to divide the apps into five classes. In real app market, there will be hundreds of thousands of apps in one category, and we can have more classes to accommodate the apps.

32 As for ground truth label which indicates real app quality, it is usually regarded as app rating. However, we notice a fact that several popular apps have lower ratings than some little-attention-paid apps. Thus, we cluster the apps based on their rating, downloads and number of reviews by K-means and take the resulting labels as ground truth labels to make GCN learn as much accurate as possible. Since we do not know the absolute ground truth, to approximate real app quality and make the test results convinced, we assign three groups of weights on app rating, downloads and number of reviews to get three different groups of ground truth labels, under which we evaluate the performance of AppGrader respectively. Here the three groups of clustering weights we apply for app rating, downloads and number of reviews are: 1) 33%, 33% and 33%, 2) 40%, 40% and 20%, 3) 50%, 30% and 20%, which emphasizes the role of app rating gradually. After that, the ground truth labels will be encoded into one-hot labels and input to GCN.

5.1.3 GCN Inputs

The input data of GCN introduced before are merely abstract matrixes. Actually, there are six input files for graph convolutional network: graph, x, y, tx, ty and NodeRange. The adjacency matrix of the graph containing apps’ feature view graphs is represented by a dictionary which is stored in a file named graph. For each category, we take 90% of the apps for training and 10% for testing. Therefore we separate the feature vectors of training instances from those of testing instances by relegating the feature vectors of training instances and testing instances to file x and tx respectively. Likewise, we apply the same method to divide app ground truth labels in file y and ty. Specifically, the order of feature vectors and app labels are accordance with their corresponding nodes. Since GCN evaluates each node rather than an entire app, we need to compute app labels by combining labels of homogeneous nodes. Thus, we need a file, NodeRange, to record the nodes that each app has. By this file, we can also modify the functions in GCN to compute system accuracy and cost based on apps instead of nodes.

33 5.2 Experiment Results and Analysis

5.2.1 Efficiency

AppGrader is an efficient system and automatic from app crawling to app quality evaluation. In app crawling, gplaycli only takes 30 to 60 minutes to download 200 apps. Static app analysis is the most time-consuming one, where the static analysis for each app may take 30 seconds to 3 minutes, depending on the app complexity. However, static app analysis is conducted on each app only once thereby does not have impact on the overall system efficiency. App quality evaluation is the most efficient one, where we train a two-layer GCN and apply the default settings of GCN while testing, such as 200 epochs for each run. Therefore for each node, GCN only considers the information stored in its 2-hops neighbors, which is the optimization result from [9]. As a result, GCN spends no more than one minute to evaluate the candidate apps under a certain category (The exact time is shown in Table 5.2).

5.2.2 Effectiveness

In this section, we discuss the effectiveness of AppGrader from two perspectives: accuracy and label dissimilarity.

5.2.2.1 System Accuracy

We test AppGrader on 1440 real world apps under three different groups of ground truth labels and list all the testing results in Table 5.2. The three groups of weights for generating ground truth labels are: (1)33%, 33%, 33%; (2)40%, 40%, 20%; (3)50%, 30%, 20%, each of which is assigned to app rating, downloads and number of reviews respectively. For each category, 90% of the apps are for training and 10% are for testing under all conditions. We compute the system accuracy by

T rue Accuracy = T rue + F alse

34 The detailed testing results are shown in appendix C.

Table 5.2: Summary of results in terms of app classification accuracy

Category The number of apps Accuracy Time(s) Epochs First test with weights: 33%, 33%, 33% Browser 74 85.71% 10.270 200 Camera 86 55.56% 14.577 200 Communication 166 73.33% 29.890 200 Food & drink 88 77.78% 15.335 200 Mailbox 68 85.71% 12.785 200 Music 136 69.23% 23.301 200 Navigation 63 83.33% 10.415 200 News 107 81.81% 19.991 200 Photography 131 61.54% 22.585 200 Reader & books 96 80.00% 16.263 220 Shopping 92 77.78% 17.725 200 Social 105 70.00% 21.695 220 Travel 98 70.00% 16.914 200 Video 130 76.92% 22.303 210 Total 1440 72.54% – – Second test with weights: 40%, 40%, 20% Browser 74 71.43% 11.364 200 Camera 86 55.56% 15.602 220 Communication 166 66.67% 33.600 200 Food & drink 88 77.78% 15.426 200 Mailbox 68 85.71% 13.549 200 Music 136 61.54% 24.205 180 Navigation 63 100.00% 12.334 200 News 107 72.73% 19.195 200 Photography 131 61.54% 23.918 200 Reader & books 96 70.00% 18.309 220 Shopping 92 66.67% 17.062 200 Social 105 60.00% 18.527 200 Travel 98 66.67% 22.309 220 Video 130 61.54% 24.189 210 Total 1440 68.31% – – Third test with weights: 50%, 30%, 20% Continued on next page

35 Table 5.2 – Continued from previous page Category The number of apps Accuracy Time(s) Epochs Browser 74 57.14% 11.470 200 Camera 86 66.67% 14.773 200 Communication 166 66.67% 32.620 200 Food & drink 88 77.78% 15.719 220 Mailbox 68 100.00% 12.163 200 Music 136 69.23% 24.267 200 Navigation 63 66.67% 11.022 180 News 107 63.64% 19.377 220 Photography 131 53.85% 24.675 200 Reader & books 96 70.00% 15.657 200 Shopping 92 66.67% 18.097 210 Social 105 70.00% 18.657 210 Travel 98 66.67% 16.902 200 Video 130 69.23% 22.966 200 Total 1440 68.09% – –

We evaluate the system accuracy of AppGrader on apps over 14 app categories in three tests which are conducted under the ground truth labels generated by three different groups of weights. In later discussion, we use first, second and third test for short. Figure 5.1 shows the comparison result in terms of system accuracy by histogram. The x-axis denotes test accuracy. The y-axis denotes app category. As different groups of ground truth labels are applied, there are slight fluctuations in the system accuracies, which demonstrates the idea that there is no absolute ground truth labels which are commensurate with real app quality and applicable for all apps. Therefore, the ground truth labels for different app categories cannot be determined by only one group of weights. Specifically, from the first test to the third, the ground truth labels which represent app quality place more emphasis on app rating, as the increase of the weight on app rating. At the same time, the system accuracies in most app categories, such as social, communication, browser, shopping, news, music, video, photography, travel, book, are decreasing, yet still around 70%, which indicates that the ground truth label increasingly unmoors from the estimated app quality as it is dominated by app rating gradually. Therefore, app quality cannot be represented by app rating alone and is also impacted by

36 average

food

book

travel

navigation

camera

photography

video

music

news

shopping

mailbox

browser

communication

social

0.0000 0.1000 0.2000 0.3000 0.4000 0.5000 0.6000 0.7000 0.8000 0.9000 1.0000 The first test (weight: 33%, 33%, 33%) The second test (weight: 40%, 40%, 20%) The third test (weight: 50%, 30%, 20%)

Figure 5.1: Comparison result in terms of system accuracy under three groups of ground truth labels app downloads and number of reviews. While the accuracies of the second or third test in mailbox, camera, and navigation are increasing compared with those in the first test, which suggests that the app quality of most apps in these categories, especially the ones in our dataset, are commensurate with their ratings. It is safe to say that most apps in these categories have few fake ratings. Likewise, the accuracies of food apps in the first, second and third test are not changed,

37 Published as a conference paper at ICLR 2017

6RESULTS

6.1 SEMI-SUPERVISED NODE CLASSIFICATION

Results are summarized in Table 2. Reported numbers denote classification accuracy in percent. For ICA, we report the mean accuracy of 100 runs with random node orderings. Results for all other baseline methods are taken from the Planetoid paper (Yang et al., 2016). Planetoid* denotes the best model for the respective dataset out of the variants presented in their paper.

Table 2: Summary of results in terms of classification accuracy (in percent).

Method Citeseer Cora Pubmed NELL ManiReg [3] 60.1 59.5 70.7 21.8 SemiEmb [28] 59.6 59.0 71.1 26.7 LP [32] 45.3 68.0 63.0 26.5 DeepWalk [22] 43.2 67.2 65.3 58.1 ICA [18] 69.1 75.1 73.9 23.1 Planetoid* [29] 64.7 (26s) 75.7 (13s) 77.2 (25s) 61.9 (185s) GCN (this paper) 70.3 (7s) 81.5 (4s) 79.0 (38s) 66.0 (48s) GCN (rand. splits) 67.9 0.5 80.1 0.5 78.9 0.7 58.4 1.7 ± ± ± ± We further report wall-clock training time in seconds until convergence (in brackets) for our method (incl. evaluation of validation error) and for Planetoid. For the latter, we used an implementation pro- vided by the authors3 and trained on the same hardware (with GPU) as our GCN model. We trained and tested our model on the same dataset splits as in Yang et al. (2016) and report mean accuracy of 100 runs with random weight initializations. We used the following sets of hyperparameters for 4 Citeseer, Cora and Pubmed: 0.5 (dropout rate), 5 10 (L2 regularization) and 16 (number of hid- · 5 den units); and for NELL: 0.1 (dropout rate), 1 10 (L2 regularization) and 64 (number of hidden units). · In addition, we report performance of our model on 10 randomly drawn dataset splits of the same size as in Yang et al. (2016), denoted by GCN (rand. splits). Here, we report mean and standard error of prediction accuracy on the test set split in percent. which6.2 indicates EVA L UAT I O that N O F P mostROPAGATION food appsMODEL in our dataset also have the quality that is consistentWe compare to different their ratings, variants of downloads our proposed per-layerand reviews. propagation Surprisingly, model on the the citation accuracy network of datasets. We follow the experimental set-up described in the previous section. Results are summa- mailboxrized inapps Table 3. in The the propagation third test model and of the our originalaccuracy GCN of model navigation is denoted apps by renormalization in the second trick (in bold). In all other cases, the propagation model of both neural network layers is replaced testwith even the reachmodel specified 100%. However,under propagation we think model these. Reported are coincidences numbers denote since mean classification (1) there are a smallaccuracy amount for 100 repeated of testing runs datawith random in those weight two matrix categories; initializations. (2) In the case accuracies of multiple vari- in the ables ⇥i per layer, we impose L2 regularization on all weight matrices of the first layer. remaining two tests are also very high – around 80%. Table 3: Comparison of propagation models.

Description Propagation model Citeseer Cora Pubmed K =3 69.8 79.5 74.4 Chebyshev filter (Eq. 5) K ˜ K = 2k=0 Tk(L)X⇥k 69.6 81.2 73.8 st 1 1 1 -order model (Eq. 6) X⇥P0 + D 2 AD 2 X⇥1 68.3 80.0 77.5 1 1 Single parameter (Eq. 7) (IN + D 2 AD 2 )X⇥ 69.3 79.2 77.4 1 1 Renormalization trick (Eq. 8) D˜ 2 A˜D˜ 2 X⇥ 70.3 81.5 79.0 st 1 1 1 -order term only D 2 AD 2 X⇥ 68.7 80.5 77.8 Multi-layer perceptron X⇥ 46.5 55.1 71.4

Figure3https://github.com/kimiyoung/planetoid 5.2: Comparison of propagation models of GCN (Thomas et al. 2017)

By and large, AppGrader has the best performance7 in the first test for most cat- egories, where the weights of app rating, downloads and number of reviews are equal. Hence, we do selective analysis on the results of the first test. AppGrader performs well with average accuracy of 72.54% on 1440 apps, where 1298 apps for training and 142 apps for testing. Specifically, AppGrader reaches the highest accuracy of 85.71% on browser and mailbox apps and has satisfactory performance on the remaining categories with accuracies around 70%, except for camera apps. Our system reaches the lowest accuracy of 55.56% on camera apps, which can be attributed to the following reasons: (1) the quality of camera apps relies more on their functionality, such as the resolution of lens, which is dynamic features and cannot be captured by AppGrader; (2) the selected ground truth labels cannot fully represent real app quality of camera apps, leading to relatively inaccurate learning of GCN. The detailed case studies on camera apps will be presented later. Yet our testing result is relatively ideal according to the performance of GCN. Figure 5.2 summarizes the performance of GCN under different variants of per- layer propagation model. As we can see, GCN has the best performance (in bold) with accuracy of 70.3%, 81.5%, 79.0% on Citeseer, Cora, and Pubmed respectively (They are all citation networks). Comparing the result with that of AppGrader,

38 where the accuracies of our system are all around 70% even reach 85% in some categories, we can say that our system performs well since it reaches, even exceeds the ceiling of accuracy of GCN.

5.2.2.2 Label Dissimilarity

Test accuracy cannot be used for evaluating the overall system performance alone. If the prediction label is very different from the ground truth label, such as assign- ing label 5 to app with ground truth label 1, we cannot regard AppGrader as an ideal system. Therefore, in addition to analyzing the classification accuracy, we also measure the dissimilarity between false prediction labels and the ground truth labels. The label dissimilarity for an app is the absolute value of the difference between the ground truth label and the prediction label. Take the former instance for example, the label dissimilarity of this app is 4. Here, we measure the average dissimilarity for each category by

n ground truth label prediction label dissimilarity = i=1 | − | the number offalse predictions P where n is the number of apps in each category. Similar to system accuracy, we compute the average label dissimilarity for each category under three different groups of ground truth labels which are summarized in Table 5.3 and compared by histogram in Figure 5.3.

Table 5.3: Summary of results in terms of average label dissimilarity for false predictions

The number of The number of Average Category testing apps false predictions label dissimilarity First test with weights: 33%, 33%, 33% Browser 7 1 2.00 Camera 9 4 1.75 Communication 15 4 1.25 Food & drink 9 2 1.00 Mailbox 7 1 2.00 Music 13 4 1.00 Continued on next page

39 Table 5.3 – Continued from previous page The number of The number of Average Category testing apps false predictions label dissimilarity Navigation 6 1 1.00 News 11 2 1.00 Photography 13 5 1.40 Reader & books 10 2 1.00 Shopping 9 3 1.33 Social 10 3 1.33 Travel 10 3 1.00 Video 13 4 1.00 Total 142 39 1.26 Second test with weights: 40%, 40%, 20% Browser 7 2 1.00 Camera 9 4 1.75 Communication 15 5 1.40 Food & drink 9 2 1.00 Mailbox 7 1 2.00 Music 13 5 1.00 Navigation 6 0 0.00 News 11 3 1.00 Photography 13 5 1.60 Reader & books 10 3 1.67 Shopping 9 3 1.67 Social 10 4 1.25 Travel 10 3 1.67 Video 13 5 1.40 Total 142 45 1.40 Third test with weights: 50%, 30%, 20% Browser 7 3 1.33 Camera 9 3 1.33 Communication 15 5 1.20 Food & drink 9 2 2.00 Mailbox 7 0 0.00 Music 13 4 1.00 Navigation 6 2 2.00 News 11 4 1.75 Photography 13 6 1.50 Continued on next page

40 Table 5.3 – Continued from previous page The number of The number of Average Category testing apps false predictions label dissimilarity Reader & books 10 3 1.33 Shopping 9 3 2.67 Social 10 2 2.00 Travel 10 3 1.33 Video 13 4 1.25 Total 142 45 1.53

Figure 5.3 compares the average label dissimilarity for each app category in three tests. The x-axis denotes the average label dissimilarity. The y-axis denotes app category. There are also slight fluctuations in the label dissimilarities in different tests, which verifies the idea again that the ground truth labels which denote real app quality are not changeless for all apps. Specifically, the label dissimilarities in the first test are smaller than those in the remaining tests for most of the app categories, which suggests that apart from app rating, app downloads and reviews should also be considered to assign ground truth labels. Only in browser, communication and social apps, the label dissimilarities of the second or the third test are smaller than those in the first test, which can be possibly attributed to the reason that the app quality of most apps in these categories is consistent with their ratings. Besides, the ranges of the label dissimilarities for each category are not larger than 1, which demonstrates the stability of our system under different conditions.

Similar to the comparison result in terms of system accuracy, AppGrader has the best performance in the first test roughly with average label dissimilarity of 1.26. As shown in Table 5.3, the average label dissimilarities for most categories are around 1, except for browser apps and mailbox apps. Our system has the largest label dissimilarity of 2 in these two categories because of the few testing data, which will be discussed in detail later. The label dissimilarities in Table 5.3 demonstrate the effectiveness of AppGrader from another perspective: although AppGrader makes false predictions, these predictions are not very different from the ground truth labels.

41 average

food

book

travel

navigation

camera

photography

video

music

news

shopping

mailbox

browser

communication

social

0.00 0.50 1.00 1.50 2.00 2.50 3.00 The first test (weight: 33%, 33%, 33%) The second test (weight: 40%, 40%, 20%) The third test (weight: 50%, 30%, 20%)

Figure 5.3: Comparison result in terms of label dissimilarity under three groups of ground truth labels

5.2.2.3 Synthesized Result

To lucidly analyze the performance of AppGrader, we combine the test accuracy with the label dissimilarity of the first test in Figure 5.4. The y-axis on the left is the scale of test accuracy. The y-axis on the right is the scale of label dissimilarity. The x-axis denotes app category. AppGrader performs well in most app categories

42 such as communication, shopping, news, video, navigation, book and food, where our system reaches high accuracy of around 80% and get small label dissimilarity of around 1. AppGrader reaches the highest accuracy but, at the same time, get the largest label dissimilarity in browser apps and mailbox apps, which can be possibly attributed to the small amount of false predictions. As the test accuracy is high, AppGrader makes few false predictions resulting in that the average label dissimilarity of false predictions are dominated by a few testing apps. Therefore, large label dissimilarity brought by just one testing app will have obvious impact on the average value. Our system reaches low test accuracy and also gets small label dissimilarity in social, music and travel apps, which suggests that although our system makes relatively many false predictions, these false predictions slightly differ from the real app quality. AppGrader has the worst performance on camera apps with the lowest test accuracy and large label dissimilarity. There are two possible reasons: First, the ground truth labels assigned to camera apps fail to represent the real app quality. Second, the quality of camera apps are dominated by its dynamic features which are not captured by our system.

test accuracy label dissimilarity 0.9000 2.15

0.8500 1.95

0.8000 1.75

0.7500 1.55

0.7000

1.35 0.6500

1.15 0.6000

0.95 0.5500

0.5000 0.75

food social news music video travel book mailbox camera average browser shopping navigation photography communication

Figure 5.4: Comparison of test accuracy and label dissimilarity of the first test (weights: 33%, 33%, 33%)

43 5.2.2.4 Case Study

Specifically, we sort out the camera apps which are assigned false prediction labels by AppGrader and analyze their descriptions and app reviews to explore the pos- sible reasons for false predictions. The four mis-estimated camera apps are listed in Table 5.4.

Table 5.4: Mis-estimated camera apps

The number Ground Prediction App name Creator Rating Downloads of reviews truth label label LINE Camera - LINE 4.3 100,000,000+ 1,514,678 5 3 Photo editor Corporation HD Cˆamera de iBahia Ti 4.3 5,000,000+ 19,991 3 4 Alta Qualidade Soft nas Suas Fotos Full HD Free Camera Aleh Tsitou 4.1 100,000+ 353 2 5 Camera Pixel Studio 3.8 1,000,000+ 5,648 1 2 App

LINE Camera with ground truth label of 5 are assigned quality level of 3. We look through its description on Google Play and learn that LINE Camera only provides the basic functions as a camera, such as timer, flash, mirror mode, level, grid. While it is more like a photo editor which offers a number of filters, photo frames and stickers. We also look over the reviews of LINE Camera and find that most users are satisfied with the large amount of beautiful stickers and filters, which contributes to its high rating and downloads. Thus, the ground truth label can fully represent the actual app quality of LINE Camera. However, the filters, photo frames and stickers offered by LINE Camera need to be downloaded from online library thereby are dynamic features and not captured by our system, which contributes to its low predicted quality level. Besides, LINE Camera is not excellent among camera apps with uncompetitive camera functions. As a result, AppGrader gives an ordinary grade to it.

HD Cˆamera de Alta Qualidade nas Suas Fotos Full HD (Hereinafter referred to as HD camera) has ground truth label 3 but is graded by 4. In Google Play, it receives many positive reviews and 5-star ratings from users, which suggests

44 that the ground truth label of 3 cannot represent its real quality. That can be possibly attributed to the unpopularity. HD camera is a Brazilian app thereby may not be known by many American users, which is reflected by its relatively few downloads. With few downloads, the ground truth label is not high enough to represent the quality of HD camera therefore does not accord with the grade given by AppGrader.

Compared with the apps in Table 5.4, Free Camera has relatively rare downloads and reviews, which results in the ground truth label of 2. However, it has relatively high rating of 4.1 in Google Play. We check the reviews of Free Camera and learn that it is a clone of Open Camera which is assigned ground truth label of 4 with app rating of 4.3, over 10 millions downloads and around 10 thousands reviews. Hence, the user praises that Free Camera receives in the reviews are owed to the similar functionality with Open Camera by which AppGrader gives higher grade than the ground truth label to Free Camera. Yet the grade of 5 is still higher than the ground truth label of 4 of Open Camera, which can be attributed to the reason that AppGrader ignores the negative impact brought by dynamic features such as app crash and bugs. A few users mention that this app is incompatible with their smart phone and some users refer to that some settings such as “shutter sound” offered by the app does not work, which are dynamic features and not captured by our system. As a result, the grade given by AppGrader only accords with the app’s static functionality thereby is higher than the actual quality.

As shown in Table 5.1, the range of app rating in camera apps is from 3.4 to 4.7, so it is reasonable to assign ground truth label of 1 to Camera with low app rating of 3.8. While AppGrader classifies it to quality level of 2 based on its functionality. We analyze the reviews of Camera and find that this app is a repackaged app of Open Camera with plenty of advertisements. AppGrader is unable to detect repackaged app thereby treats this app as the original one and gives the grade higher than real app quality to Camera. However, the prediction label of 2 of Camera still differs from the ground truth label of 4 of Open Camera, which can be possibly attributed to the penalty brought by numerous ads in Camera.

45 CHAPTER 6

DISCUSSION

6.1 Limitation

AppGrader evaluates app quality of apps in the same category effectively and efficiently. However, the average system accuracy is not high enough and may fluctuate slightly towards different groups of training data, which can be attributed to following reasons:

1. We cannot restore the feature view graph for every app perfectly by analyzing the de-compiled samli code which is not completely equal to app source code. First, some apps, especially the ones developed by major corporations like Google, is created with obfuscation technology which is simply applied by modifying the environment settings at app packaging stage. By obfuscation, the classes, methods and variables in an app are replaced by some meaning- less words, which does not impact computer or smart phone but does on our analysis. For example, human beings can understand the method call() in class Cellphone but cannot understand method b() in class A. While a com- puter will work normally without knowing what class Cellphone is. The app static analysis is conducted by tracking specific Android APIs and certain key words. Unfortunately, these APIs and key words will be replaced by non- senses thereby offer no useful information if obfuscation is applied. Second, some developers prefer to build his own methods and invoke Android APIs inside. This obstructs our static analysis since the APIs we traced only of- fer formal parameters which are placeholders for arguments. The arguments which contain useful information are passed in runtime thereby cannot be captured during static analysis.

2. App repackaging and copying may add noises to our dataset. Repackaged and copied apps have similar structures and features with the original one but have different labels, which may confuse our system. To solve this problem, we can apply systems like ViewDroid on our dataset prior to app quality evaluation to distinguish the repackaged apps from the normal apps.

3. The popularity of an app varies in different countries so the ground truth label generated by app rating, downloads and reviews cannot represent app quality for all the time, which also introduces noises to our dataset. For example, weibo is popular thereby has high rating and downloads in Chinese app markets. However, in USA app markets, it is a little-attention-paid app because of the competition brought by similar apps such as twitter and Instagram. Although it is designed excellently, it has an ordinary ranking in Google Play, which confuses our system.

4. The fake app ratings and reviews in app markets introduce impurity to our dataset thereby impact the performance of AppGrader. The ground truth labels we assign to apps are determined by app ratings, downloads and num- ber of reviews therefore will deviate from real app quality obviously because of the fake ratings, which causes inaccurate learning of AppGrader. How- ever, this is not a serious problem. The abused apps and fake ratings can be detected by systems like [27] before app quality evaluation.

5. Dynamic features such as app crash and bugs are mentioned by plenty of users in app reviews but not considered by our system, which contributes to higher prediction label than the actual app quality. Since AppGrader only evaluates app quality based on code-level features, it gives grade to app

47 which only accords with its static functionality ignoring the negative impact brought by app crash and bugs. As a result, the prediction labels of some apps are higher than their ground truth labels, which decreases the system accuracy.

6.2 Future Work

We have tested our system on over 1440 apps covering 14 categories which is not a large dataset. This can be attributed to two reasons: First, we have no game apps in our dataset now since game is a far-ranging field and their popularity relies more on the dynamic features such as game content and smooth interfaces. Second, we need to double check the category and verify the feature view graph for each app because of limitation of gplaycli and the drawbacks of decompiling mentioned before. These all limit the number of apps in our dataset. If we can get a whole list of apps for each category and their source codes from app market, we can do larger scale evaluation and get more convincing results.

As mentioned before, the performance of AppGrader is limited by the classification model we apply – graph convolutional network. If we can find other classification model that fits our system better in the future, we think we can reach new heights.

Now the feature vector only has 140 dimensions, and we have to mine more possible features to improve system accuracy. For instance, we can break through the constrain of code-level feature and mine topics from app description and reviews as app features, by which some dynamic features such as app crash can be captured. Besides, when we synthesize homogenous node labels to get app labels, we consider each node equally and compute the average value of homogeneous node labels as the app label. However, this is not the optimized value for app label. In view of the fact that each view plays different roles in an app, we should assign different weights to the node labels in the future. Furthermore, now our system only evaluates app quality based on app static features and helps users to download appropriate apps. In the next stage, we can dissect the weights of different features thereby give feedback to app developers on how to design popular apps.

48 Our system can also be applied to detect abnormal apps in app markets. The abnormal apps have faked high ratings and downloads, which heavily limits the effectiveness of the current app ranking systems. In the future, by AppGrader, we can do quality evaluation on apps in app markets and compare the results with app ratings provided by app store. If the evaluation result differs from the app rating dramatically, it is safe to say this app is abnormal.

49 CHAPTER 7

CONCLUSION

In this paper, we propose AppGrader, an app quality grading system based on code-level features, to explore the possibility to evaluate app quality based on app functionality measured by static features like user interface and app structure. And we test the performance of AppGrader on 1440 real world apps covering 14 app categories. The experiment results prove the efficiency and effectiveness of our system: (i) AppGrader takes less than a minute to process a dataset of around a hundred apps; (ii) AppGrader performs well on over a thousand apps with average system accuracy of 72.54% and average label dissimilarity of around 1. In a word, AppGrader is qualified for evaluating app quality and providing apps in a relatively objective order to users, ignoring the impact brought by fake ratings and limited user feedback. BIBLIOGRAPHY

[1] “RankMyApp,” https://www.rankmyapp.com. [2] “Appfigures,” https://appfigures.com.

[3] Liu, B., D. Kong, L. Cen, N. Z. Gong, H. Jin, and H. Xiong (2015) “Personalized Mobile App Recommendation: Reconciling App Functionality and User Privacy Preference,” in Proceedings of the Eighth ACM International Conference on Web Search and Data Mining, WSDM 2015, Shanghai, China, February 2-6, 2015, pp. 315–324.

[4] Yin, P., P. Luo, W. Lee, and M. Wang (2013) “App recommendation: a contest between satisfaction and temptation,” in Sixth ACM International Conference on Web Search and Data Mining, WSDM 2013, Rome, Italy, February 4-8, 2013, pp. 395–404.

[5] Han, Y., S. Park, and S. Park (2017) “Personalized app recommendation using spatio-temporal app usage log,” Inf. Process. Lett., 124, pp. 15–20.

[6] Lin, J., K. Sugiyama, M. Kan, and T. Chua (2014) “New and improved: modeling versions to improve app recommendation,” in The 37th Interna- tional ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’14, Gold Coast , QLD, Australia - July 06 - 11, 2014, pp. 647–656.

[7] Karatzoglou, A., L. Baltrunas, K. Church, and M. Bohmer¨ (2012) “Climbing the app wall: enabling mobile app discovery through context-aware recommendations,” in 21st ACM International Conference on Information and Knowledge Management, CIKM’12, Maui, HI, USA, October 29 - Novem- ber 02, 2012, pp. 2527–2530. [8] Zhang, F., H. Huang, S. Zhu, D. Wu, and P. Liu (2014) “ViewDroid: towards obfuscation-resilient mobile application repackaging detection,” in 7th ACM Conference on Security & Privacy in Wireless and Mobile Networks, WiSec’14, Oxford, United Kingdom, July 23-25, 2014, pp. 25–36.

[9] Kipf, T. N. and M. Welling (2016) “Semi-Supervised Classification with Graph Convolutional Networks,” CoRR, abs/1609.02907, 1609.02907.

[10] Bae, D., K. Han, J. Park, and M. Y. Yi (2015) “AppTrends: A graph- based mobile app recommendation system using usage history,” in 2015 In- ternational Conference on Big Data and Smart Computing, BIGCOMP 2015, Jeju, South Korea, February 9-11, 2015, pp. 210–216.

[11] Koren, Y., R. M. Bell, and C. Volinsky (2009) “Matrix Factorization Techniques for Recommender Systems,” IEEE Computer, 42(8), pp. 30–37.

[12] Salakhutdinov, R. and A. Mnih (2008) “Bayesian probabilistic matrix factorization using Markov chain Monte Carlo,” in Machine Learning, Pro- ceedings of the Twenty-Fifth International Conference (ICML 2008), Helsinki, Finland, June 5-9, 2008, pp. 880–887.

[13] Zhu, H., H. Xiong, Y. Ge, and E. Chen (2014) “Mobile app recommen- dations with security and privacy awareness,” in The 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’14, New York, NY, USA - August 24 - 27, 2014, pp. 951–960.

[14] Linden, G., B. Smith, and J. York (2003) “Amazon.com Recommenda- tions: Item-to-Item Collaborative Filtering,” IEEE Internet Computing, 7(1), pp. 76–80.

[15] Mooney, R. J. and L. Roy (2000) “Content-based book recommending using learning for text categorization,” in ACM DL, pp. 195–204.

[16] Aizenberg, N., Y. Koren, and O. Somekh (2012) “Build your own music recommender by modeling internet radio streams,” in Proceedings of the 21st World Wide Web Conference 2012, WWW 2012, Lyon, France, April 16-20, 2012, pp. 1–10.

[17] Bell, R. M. and Y. Koren (2007) “Lessons from the Netflix prize chal- lenge,” SIGKDD Explorations, 9(2), pp. 75–79.

[18] Church, K. and B. Smyth (2009) “Understanding the intent behind mobile information needs,” in Proceedings of the 14th International Conference on Intelligent User Interfaces, IUI 2009, Sanibel Island, Florida, USA, February 8-11, 2009, pp. 247–256.

52 [19] Bohmer,¨ M., B. J. Hecht, J. Schoning¨ , A. Kruger¨ , and G. Bauer (2011) “Falling asleep with Angry Birds, Facebook and Kindle: a large scale study on mobile application usage,” in Proceedings of the 13th Conference on Human-Computer Interaction with Mobile Devices and Services, Mobile HCI 2011, Stockholm, Sweden, August 30 - September 2, 2011, pp. 47–56.

[20] “Wiki page of Android application package,” https://en.wikipedia.org/ wiki/Android_application_package. [21] “Terrier-stop-word-list,” https://github.com/RxNLP/nlp-cloud-apis/ blob/master/terrier-stop-word-list.txt.

[22] Blei, D. M., A. Y. Ng, and M. I. Jordan (2003) “Latent Dirichlet Allo- cation,” Journal of Machine Learning Research, 3, pp. 993–1022.

[23] “Android-Apktool: A tool for reverse engineering Android apk files,” https: //ibotpeaches.github.io/Apktool/.

[24] Book, T., A. Pridgen, and D. S. Wallach (2013) “Longitudinal Analysis of Android Ad Library Permissions,” CoRR, abs/1303.0857, 1303.0857.

[25] Yang, Z., W. W. Cohen, and R. Salakhutdinov (2016) “Revisiting Semi-Supervised Learning with Graph Embeddings,” in Proceedings of the 33nd International Conference on Machine Learning, ICML 2016, New York City, NY, USA, June 19-24, 2016, pp. 40–48.

[26] “Gplaycli: a Google Play Downloader via Command line,” https://github. com/matlink/gplaycli.

[27] Xie, Z., S. Zhu, Q. Li, and W. Wang (2016) “You can promote, but you can’t hide: large-scale abused app detection in mobile app stores,” in Pro- ceedings of the 32nd Annual Conference on Computer Security Applications, ACSAC 2016, Los Angeles, CA, USA, December 5-9, 2016, pp. 374–385.

53 APPENDIX A

APP REVIEW ANALYSIS

Table A.1: High-frequency terms counting result (top 500)

Term Frequency Term Frequency Term Frequency Term Frequency app 5830765 sure 155488 issue 89676 write 64008 game 5742955 store 154349 purchase 89534 seen 63893 fun 2467243 10 153867 \ue335 89431 night 63450 time 1316449 ios 153603 plus 89389 alot 63422 awesome 1148299 option 152334 longer 88996 kill 63262 play 1101118 bit 150618 kind 88936 stuck 62781 easy 819153 next 150557 email 88710 car 62678 update 727803 definitely 150091 apple 88447 camera 62514 way 642810 hope 148207 reason 88445 tap 62505 free 637549 controls 146112 quality 88273 sync 62492 cool 634538 load 143743 addicted 87543 system 62451 work 611489 bought 142543 frustrating 87054 seconds 62409 new 597276 helpful 142236 reading 87001 wow 62388 fix 592769 highly 141301 control 86553 spent 62221 5 578739 downloaded 140703 lost 86547 faster 62221 nice 571875 save 140015 points 86471 design 62220 games 555926 fixed 138316 99 86283 person 62123 works 552972 annoying 137192 awsome 86273 friend 62017 well 549613 trying 137186 view 85423 horrible 61626 \ud83d 530216 price 135850 accurate 85385 weather 61566 520472 show 135686 words 85309 100 61044 want 503626 support 134259 code 85252 call 60920 playing 465157 problems 133994 deleted 85182 cost 60867 see 452193 ads 133841 6 85069 terrible 60737 version 440695 review 133419 place 84871 message 60520 437430 slow 133355 cards 84209 friendly 60495 still 436954 working 132722 tell 84110 plz 60480 worth 433236 crash 132612 pics 83931 notes 60269 add 430946 maybe 131838 dont 83536 liked 60234 money 430016 thought 131539 end 83073 believe 60010 keep 429928 video 131243 wrong 82972 line 59804 crashes 424664 touch 130092 player 82188 pick 59675 wish 419174 watch 128303 picture 81209 loading 59477 needs 416630 track 127667 songs 80828 date 59414 Continued on next page Table A.1 – continued from previous page Term Frequency Term Frequency Term Frequency Term Frequency stars 413463 pictures 126972 ur 79766 order 59391 3 407436 helps 125915 name 79645 internet 59336 buy 405212 making 125708 high 79632 song 59253 back 403914 list 125505 added 79284 20 59185 amazing 401465 funny 124536 sounds 79168 handy 59163 2 399102 paid 124085 easily 78255 8 59108 think 398963 easier 122505 learn 78117 extra 59084 addicting 398513 minutes 122383 log 77941 recommended 59014 apps 393781 totally 122079 original 77940 came 58783 first 392648 especially 121564 type 77802 ask 58693 pretty 384293 sound 121448 may 77466 okay 58636 people 367195 computer 121221 past 77195 saying 58560 little 349343 loved 120268 im 77052 cause 58321 give 346043 home 119876 couple 76953 maps 57814 know 334499 big 119143 second 76468 allows 57651 far 331034 worked 119018 choose 76427 half 56848 able 330480 days 118993 needed 76193 map 56755 4 313190 options 118991 wifi 75958 later 56455 try 306455 button 118851 week 75955 voice 56413 makes 302783 interface 117689 multiplayer 75903 connection 56411 graphics 297585 online 117548 tool 75901 gave 56303 find 296065 feel 116034 click 75837 space 56295 levels 291248 soon 115879 difficult 75479 single 56268 day 286870 away 115718 site 75059 puzzle 56209 screen 280734 gameplay 114708 photo 75056 seriously 56134 hard 279613 check 113911 goes 74982 error 55946 using 272768 user 113874 took 74443 today 55865 1 271598 quick 113574 family 74386 browser 55852 addictive 269344 ability 113569 website 74371 mean 55843 thanks 266677 updated 113555 coming 74267 sleep 55800 level 264403 boring 112843 listen 74223 pc 55698 keeps 264073 entertaining 112702 bug 73533 yes 55620 phone 263604 mode 112443 number 73442 30 55591 simple 263152 facebook 112037 ones 73007 convenient 54902 times 261923 looks 111196 multiple 72048 earn 54767 recommend 260357 upgrade 111187 exactly 71760 loves 54693 put 247008 shows 111150 wanted 71721 black 54607 played 241128 reviews 110823 rating 71659 weapons 54529 download 240903 search 110105 quickly 71581 missing 54476 right 239462 kids 110019 create 71501 crashed 54293 bad 234100 delete 109735 beautiful 71474 itunes 54171 music 226679 absolutely 109385 service 71356 effects 54068 pay 223490 coins 108863 characters 71062 stay 53886 let 223218 year 108856 win 70831 connect 53777 friends 221352 fine 108691 extremely 70598 impossible 53672 enjoy 215354 life 108561 luv 70530 web 53306 problem 213896 turn 107228 players 70520 future 52789 perfect 211022 beat 106289 mind 70291 month 52703 open 210139 started 106073 everyday 70136 color 52650 long 209877 set 106016 data 70120 crap 52613 ipod 208452 stupid 106001 simply 70116 worst 52525 favorite 207401 lol 105460 tv 70116 amount 52344 going 206691 least 105260 buying 70034 purchased 52306 look 206217 \ud83c 104515 forward 69782 enjoyable 52106 say 204532 top 102753 mobile 69032 downloading 51783 made 203930 idea 102184 constantly 68935 sweet 51425 take 203885 pass 101892 close 68634 taking 51237 old 202918 application 101892 anymore 68601 enjoyed 50903 help 200663 issues 101367 comes 68568 concept 50821 real 198702 account 101233 freezes 68315 messages 50726 ok 198571 rate 101074 google 68213 battery 50480 super 195532 run 100763 completely 67776 huge 50365 start 188073 news 99679 understand 67675 title 50360 looking 186575 bugs 99249 complete 67177 background 50328 tried 185992 point 98792 kinda 67132 school 49893 Continued on next page

55 Table A.1 – continued from previous page Term Frequency Term Frequency Term Frequency Term Frequency star 185588 story 98318 bored 66901 progress 49789 challenging 184233 gives 97866 months 66611 edit 49511 useful 182915 items 97328 glad 66381 users 49397 read 182206 photos 97168 challenge 66286 paying 49392 says 180106 access 96990 latest 66208 follow 49006 features 178181 happy 96903 figure 66207 team 48961 thank 176455 fantastic 96631 left 66083 variety 48742 hours 169769 part 96517 hit 66033 thinking 48728 full 168724 videos 96104 books 65894 food 48589 takes 167615 wonderful 95914 small 65583 score 48393 fast 167058 7 94494 lose 65291 current 48356 updates 166184 whole 93470 content 65263 continue 48333 actually 165765 program 93282 fan 65241 enter 48328 stuff 164894 live 92994 short 65110 giving 48280 crashing 163609 daily 92922 puzzles 64975 function 48193 change 162333 card 92500 ago 64870 sign 48053 wait 159569 world 92439 useless 64615 speed 47885 waste 159068 move 92261 allow 64600 decent 47652 job 159017 spend 92090 disappointed 64453 birds 47644 cute 158966 page 91872 book 64295 possible 47628 last 158932 years 91236 guess 64292 tons 47520 excellent 158211 waiting 91227 experience 64272 angry 47440 feature 157911 text 90920 went 64233 product 47414 found 157888 word 89965 running 64164 truly 47327 guys 157685 finally 89734 device 64104 chat 47119 come 156353 interesting 89721 developers 64061 developer 47110

56 Table A.2: Topic mining result (This list is in no particular order.)

Topic Key words 1 0.130∗sing, 0.101∗right, 0.038∗page, 0.035∗online, 0.034∗exactly, 0.032∗type, 0.026∗map, 0.020∗concept, 0.019∗eat, 0.016∗improve 2 0.095∗find, 0.085∗thanks, 0.069∗phone, 0.053∗fond, 0.042∗working, 0.030∗wrong, 0.025∗may, 0.025∗needed, 0.021∗songs, 0.021∗disappointed 3 0.156∗version, 0.070∗let, 0.048∗photo, 0.044∗application, 0.043∗maybe, 0.030∗rate, 0.029∗kind, 0.028∗6, 0.022∗score, 0.021∗latest 4 0.080∗featres, 0.050∗stpid, 0.050∗annoying, 0.048∗\e415, 0.041∗\e00e, 0.035∗controls, 0.027∗tracking, 0.027∗editing, 0.026∗line, 0.025∗\e421 5 0.066∗spport, 0.049∗search, 0.039∗pls, 0.035∗accont, 0.035∗high, 0.033∗longer, 0.029∗horrible, 0.024∗tap, 0.021∗anymore, 0.021∗code 6 0.103∗screen, 0.081∗take, 0.076∗save, 0.052∗helps, 0.051∗show, 0.035∗top, 0.034∗website, 0.032∗easily, 0.029∗week, 0.027∗notes 7 0.124∗know, 0.103∗4, 0.070∗thank, 0.055∗sre, 0.041∗fnny, 0.040∗feel, 0.033∗text, 0.028∗date, 0.027∗name, 0.024∗qestions 8 0.127∗ipad, 0.084∗simple, 0.064∗download, 0.049∗levels, 0.027∗words, 0.024∗voice, 0.019∗complete, 0.016∗content, 0.016∗player, 0.015∗hor 9 0.139∗3, 0.108∗makes, 0.058∗level, 0.048∗gys, 0.035∗next, 0.031∗added, 0.028∗scores, 0.024∗im, 0.023∗worst, 0.021∗enter 10 0.086∗ipod, 0.073∗toch, 0.022∗matter, 0.022∗baby, 0.022∗sleep, 0.021∗\0627, 0.017∗created, 0.016∗finished, 0.015∗means, 0.014∗\0644 11 0.133∗want, 0.099∗money, 0.093∗by, 0.048∗long, 0.048∗real, 0.039∗takes, 0.037∗pay, 0.030∗gives, 0.025∗daily, 0.022∗figre 12 0.140∗pdate, 0.108∗fix, 0.082∗crashes, 0.043∗problem, 0.037∗start, 0.032∗challenging, 0.029∗fixed, 0.029∗load, 0.028∗crash, 0.028∗video 13 0.111∗playing, 0.098∗1, 0.080∗sefl, 0.055∗check, 0.053∗definitely, 0.040∗weather, 0.032∗click, 0.031∗reason, 0.030∗r, 0.024∗recommended 14 0.237∗well, 0.056∗ability, 0.053∗set, 0.027∗calendar, 0.026∗song, 0.024∗kinda, 0.022∗i, 0.021∗alot, 0.021∗bored, 0.018∗writing 15 0.165∗see, 0.057∗option, 0.038∗idea, 0.036∗loves, 0.033∗come, 0.033∗absoltely, 0.030∗qality, 0.030∗wanted, 0.023∗design, 0.021∗completely 16 0.176∗think, 0.104∗looking, 0.054∗pictre, 0.042∗trn, 0.035∗color, 0.032∗reading, 0.029∗extremely, 0.029∗pass, 0.018∗light, 0.018∗children 17 0.080∗helpfl, 0.061∗cte, 0.044∗slow, 0.038∗tell, 0.037∗access, 0.037∗started, 0.026∗convenient, 0.021∗settings, 0.020∗challenge, 0.019∗perfectly 18 0.197∗games, 0.110∗bad, 0.069∗options, 0.066∗wait, 0.040∗dont, 0.040∗lost, 0.034∗months, 0.030∗pick, 0.029∗pload, 0.019∗twice 19 0.278∗nice, 0.063∗items, 0.057∗ads, 0.037∗nmber, 0.031∗crashed, 0.028∗gameplay, 0.026∗device, 0.019∗gy, 0.016∗database, 0.014∗happens 20 0.051∗shows, 0.041∗prchase, 0.036∗qickly, 0.035∗control, 0.034∗sync, 0.031∗class, 0.031∗mean, 0.027∗taking, 0.025∗case, 0.023∗awsome 21 0.081∗sond, 0.056∗part, 0.049∗sonds, 0.028∗car, 0.024∗fnctionality, 0.019∗tracks, 0.019∗setting, 0.018∗shop, 0.017∗starts, 0.017∗link 22 0.117∗amazing, 0.077∗favorite, 0.057∗addictive, 0.045∗tool, 0.025∗missing, 0.023∗de, 0.021∗clear, 0.019∗system, 0.017∗corse, 0.013∗bible 23 0.063∗pics, 0.031∗original, 0.024∗example, 0.022∗clean, 0.020∗improvement, 0.018∗clock, 0.018∗workot, 0.016∗pdating, 0.012∗hd, 0.011∗drive 24 0.365∗awesome, 0.064∗ser, 0.045∗learning, 0.040∗data, 0.038∗forward, 0.038∗order, 0.035∗friendly, 0.030∗wifi, 0.029∗daghter, 0.029∗mobile 25 0.094∗recommend, 0.073∗say, 0.072∗made, 0.061∗change, 0.049∗highly, 0.043∗looks, 0.038∗pgrade, 0.031∗finally, 0.029∗took, 0.025∗glad Continued on next page

57 Table A.2 – continued from previous page Topic Key words 26 0.142∗needs, 0.056∗trying, 0.036∗view, 0.028∗allow, 0.027∗developers, 0.021∗extra, 0.021∗saying, 0.021∗basic, 0.018∗fish, 0.016∗paper 27 0.080∗addicting, 0.074∗friends, 0.069∗hope, 0.064∗10, 0.057∗especially, 0.049∗away, 0.041∗live, 0.033∗rn, 0.033∗99, 0.030∗mind 28 0.247∗apps, 0.098∗ok, 0.034∗liked, 0.028∗intitive, 0.021∗hand, 0.021∗creative, 0.018∗art, 0.018∗mood, 0.015∗starting, 0.015∗soo 29 0.132∗first, 0.047∗interface, 0.046∗hors, 0.039∗wonderfl, 0.037∗life, 0.037∗happy, 0.034∗mode, 0.030∗isse, 0.029∗end, 0.028∗second 30 0.095∗fll, 0.053∗qick, 0.032∗apple, 0.029∗simply, 0.027∗everyday, 0.025∗record, 0.023∗post, 0.022∗later, 0.020∗follow, 0.017∗beginning 31 0.250∗cool, 0.150∗day, 0.064∗loved, 0.047∗fantastic, 0.033∗crrent, 0.023∗plan, 0.022∗christmas, 0.022∗total, 0.020∗crazy, 0.018∗finish 32 0.171∗wish, 0.131∗people, 0.106∗old, 0.051∗problems, 0.041∗story, 0.030∗listen, 0.026∗school, 0.017∗expect, 0.015∗lets, 0.015∗lessons 33 0.195∗iphone, 0.100∗look, 0.098∗pt, 0.087∗going, 0.031∗site, 0.021∗pic, 0.019∗given, 0.015∗finger, 0.014∗desktop, 0.013∗taken 34 0.161∗worth, 0.070∗actally, 0.064∗program, 0.063∗downloaded, 0.048∗accrate, 0.041∗weight, 0.030∗entertaining, 0.029∗allows, 0.026∗believe, 0.025∗bying 35 0.609∗game, 0.305∗fn, 0.006∗informative, 0.005∗movie, 0.005∗morning, 0.005∗heard, 0.004∗pointless, 0.003∗professional, 0.003∗ppl, 0.003∗corses 36 0.776∗app, 0.021∗kids, 0.021∗list, 0.018∗store, 0.017∗sper, 0.013∗food, 0.008∗backgrond, 0.007∗helped, 0.006∗variety, 0.005∗random 37 0.066∗review, 0.057∗delete, 0.046∗compter, 0.045∗least, 0.044∗place, 0.037∗seless, 0.035∗ones, 0.028∗left, 0.023∗previos, 0.019∗tells 38 0.142∗5, 0.104∗add, 0.104∗stars, 0.091∗give, 0.084∗try, 0.066∗keeps, 0.060∗msic, 0.048∗enjoy, 0.037∗crashing, 0.027∗effects 39 0.183∗new, 0.131∗able, 0.077∗graphics, 0.047∗pdated, 0.040∗points, 0.037∗isses, 0.032∗bgs, 0.030∗freezes, 0.020∗keeping, 0.019∗adding 40 0.105∗back, 0.102∗2, 0.055∗perfect, 0.031∗days, 0.028∗email, 0.025∗7, 0.021∗developer, 0.019∗went, 0.018∗comes, 0.016∗fnction 41 0.085∗featre, 0.066∗boght, 0.063∗ios, 0.041∗whole, 0.029∗gave, 0.022∗possible, 0.020∗men, 0.018∗0, 0.016∗omg, 0.016∗calories 42 0.140∗8.3d, 0.058∗track, 0.048∗says, 0.047∗8.3c, 0.045∗read, 0.026∗news, 0.023∗create, 0.021∗write, 0.017∗stay, 0.016∗constantly 43 0.129∗far, 0.105∗tried, 0.067∗pdates, 0.067∗worked, 0.057∗home, 0.057∗easier, 0.028∗book, 0.028∗share, 0.024∗software, 0.018∗animals 44 0.307∗time, 0.224∗easy, 0.156∗way, 0.032∗waste, 0.028∗learn, 0.020∗interesting, 0.019∗difficlt, 0.012∗alarm, 0.011∗trly, 0.011∗exercise 45 0.195∗work, 0.155∗works, 0.117∗keep, 0.045∗photos, 0.042∗excellent, 0.037∗making, 0.025∗move, 0.024∗word, 0.022∗cople, 0.021∗pzzles 46 0.228∗play, 0.182∗free, 0.052∗star, 0.043∗price, 0.040∗paid, 0.038∗reviews, 0.031∗point, 0.024∗edit, 0.024∗beat, 0.021∗rating 47 0.100∗job, 0.081∗stff, 0.061∗totally, 0.044∗goes, 0.036∗card, 0.036∗ftre, 0.029∗itnes, 0.027∗foods, 0.026∗hose, 0.026∗changes 48 0.107∗little, 0.085∗times, 0.085∗still, 0.044∗thoght, 0.044∗bit, 0.043∗big, 0.035∗boring, 0.030∗deleted, 0.028∗choose, 0.027∗mltiple 49 0.150∗pretty, 0.114∗hard, 0.072∗played, 0.072∗last, 0.072∗fast, 0.055∗fine, 0.045∗videos, 0.027∗enjoyed, 0.027∗filters, 0.021∗talking 50 0.076∗help, 0.072∗open, 0.061∗pictres, 0.052∗year, 0.042∗btton, 0.037∗mintes, 0.036∗watch, 0.034∗years, 0.028∗lol, 0.027∗beatifl

58 APPENDIX B

DETAILED FEATURE VECTOR

Table B.1: Detailed feature vector

TextView LinearLayout Button ScrollView AbsoluteLayout RadioButton ListView GridLayout ImageButton WebView AppBarLayout EditText GridView TableLayout Toolbar SurfaceView FrameLayout SeekBar ImageView TabLayout ProgressBar RecyclerView RelativeLayout RatingBar widgets HorizontalScrollView CoordinatorLayout ViewFlipper NestedScrollView SwipeRefreshLayout ViewStub AutoCompleteTextView RadioGroup ViewPager view features TextureView Gallery Spinner CardView CheckBox DatePicker VideoView SelfDefined TimePicker The number of colors used in the interface The number of fonts used in the interface The number of pictures used in the interface The number of interaction methods used in the interface android.permission.READ CALL LOG android.permission.WRITE CALL LOG com.android.launcher.permission.INSTALL SHORTCUT com.android.launcher.permission.UNINSTALL SHORTCUT android.permission.RECORD AUDIO android.permission.MODIFY AUDIO SETTINGS permissions

app features android.permission.VIBRATE android.permission.BAIDU LOCATION SERVICE android.permission.READ PHONE STATE android.permission.READ LOGS android.permission.BROADCAST STICKY android.permission.WRITE SETTINGS android.permission.WAKE LOCK android.permission.RESTART PACKAGES android.permission.KILL BACKGROUND PROCESSES android.webkit.permission.PLUGIN android.permission.DISABLE KEYGUARD com.android.browser.permission.WRITE HISTORY BOOKMARKS com.android.browser.permission.READ HISTORY BOOKMARKS android.permission.CAMERA android.permission.WRITE APN SETTINGS android.permission.GET TASKS android.permission.ACCESS CHECKIN PROPERTIES android.permission.ACCESS LOCATION EXTRA COMMANDS android.permission.ACCESS SURFACE FLINGER android.permission.ACCOUNT MANAGER android.permission.AUTHENTICATE ACCOUNTS android.permission.BATTERY STATS android.permission.BIND APPWIDGET android.permission.BIND DEVICE ADMIN android.permission.BIND INPUT METHOD android.permission.BIND REMOTEVIEWS android.permission.BIND WALLPAPER android.permission.BLUETOOTH android.permission.BLUETOOTH ADMIN android.permission.BRICK permissions app features android.permission.BROADCAST PACKAGE REMOVED android.permission.BROADCAST WAP PUSH android.permission.CHANGE COMPONENT ENABLED STATE android.permission.CHANGE CONFIGURATION android.permission.CHANGE NETWORK STATE android.permission.CHANGE WIFI MULTICAST STATE android.permission.CLEAR APP CACHE android.permission.CLEAR APP USER DATA android.permission.CWJ GROUP android.permission.CELL PHONE MASTER EX android.permission.CONTROL LOCATION UPDATES android.permission.DELETE CACHE FILES android.permission.DELETE PACKAGES android.permission.DEVICE POWER android.permission.DIAGNOSTIC android.permission.DUMP android.permission.EXPAND STATUS BAR android.permission.FACTORY TEST android.permission.FLASHLIGHT android.permission.FORCE BACK android.permission.GET ACCOUNTS

60 android.permission.GET PACKAGE SIZE android.permission.GLOBAL SEARCH android.permission.HARDWARE TEST android.permission.INJECT EVENTS android.permission.INSTALL LOCATION PROVIDER android.permission.INSTALL PACKAGES android.permission.INTERNAL SYSTEM WINDOW android.permission.MANAGE ACCOUNTS android.permission.MANAGE APP TOKENS android.permission.MTWEAK USER android.permission.MTWEAK FORUM android.permission.MASTER CLEAR android.permission.MODIFY PHONE STATE android.permission.MOUNT FORMAT FILESYSTEMS android.permission.NFC android.permission.PERSISTENT ACTIVITY android.permission.PROCESS OUTGOING CALLS android.permission.READ CALENDAR android.permission.READ INPUT STATE android.permission.READ SYNC SETTINGS android.permission.READ SYNC STATS android.permission.REBOOT android.permission.RECEIVE MMS android.permission.RECEIVE WAP PUSH android.permission.REORDER TASKS android.permission.SET ACTIVITY WATCHER com.android.alarm.permission.SET ALARM android.permission.SET ALWAYS FINISH permissions app features android.permission.SET ANIMATION SCALE android.permission.SET DEBUG APP android.permission.SET ORIENTATION android.permission.SET PREFERRED APPLICATIONS android.permission.SET PROCESS LIMIT android.permission.SET TIME android.permission.SET TIME ZONE android.permission.SET WALLPAPER android.permission.SET WALLPAPER HINTS android.permission.SIGNAL PERSISTENT PROCESSES android.permission.STATUS BAR android.permission.SUBSCRIBED FEEDS READ android.permission.SUBSCRIBED FEEDS WRITE android.permission.SYSTEM ALERT WINDOW android.permission.UPDATE DEVICE STATS android.permission.USE CREDENTIALS android.permission.USE SIP android.permission.WRITE CALENDAR android.permission.WRITE GSERVICES android.permission.WRITE SECURE SETTINGS android.permission.WRITE SYNC SETTINGS

61 android.permission.READ APP BADGE selfdefined The number of activities in the app The number of edges in the app The number of files in the app

app features The size of the app The proportion of ads in the app

62 APPENDIX C

TESTING RESULT OF APPGRADER

Table C.1: Testing result of AppGrader on 123 apps in the first test (weights: 33%, 33%, 33%)

Ground truth Prediction Ground truth Prediction App App label label label label musical-ly 5 5 LINE 5 4 Tango 5 5 drupe 4 4 Live-me 4 4 chomp-SMS 4 3 Waplog 4 4 textPlus 4 4 Moco 4 4 Rebtel 4 4 JoYo 3 4 YAATA 3 3 Friendthem 2 4 Messages 3 3 Node-All- 2 2 ReosSMS 3 3 socialmedia easymind 2 2 Pinngle 3 3 fbb-Social 1 2 PrivacyMessenger 3 4 AceMessenger 2 4 mCentBrowser 4 4 VoiceText 1 1 myMail 5 3 RaidCall 1 1 Solmail 4 4 FriendChatRandom 1 1 TempMail 4 4 stranger- 1 1 AstroMail 3 3 anonymous-chat DU-Browser 5 3 emailsuite 3 3 UTL-Browser 2 2 mailboxforOutlook 2 2 Fox-Browser 1 1 LotusiNotesClient 1 1 Continued on next page Table C.1 – continued from previous page Ground truth Prediction Ground truth Prediction App App label label label label Venus-Browser 3 3 letgo 5 3 Lynket-Browser 3 3 Poshmark 4 4 Hover-Browser 2 2 Carousell 4 4 Fancy 4 3 ZAKER 4 4 6PM 3 3 ZhiHuDaily 3 3 11street 4 3 WorldNews 3 3 Pinkoi 3 3 inklnews 2 3 Bnft 2 2 YiZhouKan 1 1 IKEAStore 1 1 VICENews 1 1 BBCNews 5 5 slacker-radio 5 5 Reddit 5 4 saavn-android 5 5 RTNews 4 4 datpiff-mobile 4 5 USATODAY 4 4 djit-apps-stream 4 4 VnExpress-net 3 3 doubleTwist- 4 5 androidPlayer dywx-larkplayer 4 4 MovieMakerApps- 4 4 VideoMaker- Slideshow carvalhosoftware- 3 4 MovieMakerApps- 4 4 musicplayer VideoShow- VideoSlide fotoable-mp3- 3 3 jaineel- 4 4 musicplayer videoconvertor rhmsoft-pulsar 3 3 jsn-hdvideoplayer 4 4 scn-musicplayer 3 3 uplayer-video- 3 4 player dek-music 2 2 ph-app- 3 3 birthdayvideomaker cloudaround 2 3 music-video-edit 3 4 org- 1 1 audiocutter- 2 3 newappscreator- videocutter- freemp3music audiovideocutter video-player-audio- 5 4 outthinking- 1 1 player-music vediocollage BestPhotoEditor- 4 4 outthinking- 1 1 MusicVideoMaker mutevideo BestPhotoEditor- 4 4 zentertain- 5 4 PhotoVideoMaker photoeditor image-photoedit- 4 3 tpc-photo-effects- 1 3 photogallery color-filter mobi-charmer- 4 4 jp-naver- 5 3 collagequick linecamera-android mobi-charmer- 4 4 flavionet-android- 4 4 fotocollage camera-lite pixelate- 3 3 br-ibahiatisoft-hd- 3 4 photoEditor- camhdquality photoBlender BeachPhoto- 3 3 perfect-collage-art- 3 3 Framesnnn effects Continued on next page

64 Table C.1 – continued from previous page Ground truth Prediction Ground truth Prediction App App label label label label BestPhotoEditor- 3 3 picocamp-b921cam- 3 3 PhotoCollage- b612-hd-camera- PhotoFrames perfectselfie standoffish-book- 3 3 net-sourceforge- 2 5 photo-frame freecamera universaldream- 3 3 net-sourceforge- 2 2 filterart- opencameraremote instastudio- photoeditorpro clicklab-nature- 2 2 co-akashi-app- 1 2 photo-frame camera codeadore-textgram 2 3 video-photo-hd- 1 1 camera tpc-color-effects- 1 3 waze 5 5 photo-editor net-daum-android- 4 4 tripwolf-miniapp- 2 2 map 472-free net-osmand 4 4 tripwolf-miniapp- 2 2 543-free es-naviontruck- 3 3 tripwolf-miniapp- 2 2 truck-bor1-es 660-free trending-corner- 2 1 tripwolf-miniapp- 2 2 gpsroutefinder 667-free spin-smart- 1 1 navitime-travel 1 2 gpsroutefinder tripit 5 5 de-travelapp 3 2 pt- 4 4 net-skyscanner- 5 5 turismodeportugal- android-main android- visitportugal aldygame-mytrips- 4 3 com-reubro- 3 4 cheap-hotel-flight- instafreebie cheaptotrips- payout-travelok ak-alizandro- 4 4 com-ace- 2 2 smartaudio- banglaebook bookplayer com-audiobooks- 4 4 com-altech- 2 2 androidapp ahadithbookreader com-rainmaker- 3 3 com-amaltaas- 2 2 books-selfhelp pustak com-realtime-free- 3 3 uk-co-lrb-mag 1 2 books com-google- 5 5 com-application- 5 5 android-apps-books zomato-ordering com-eat24-app 5 5 com-neighbfav- 4 4 neighborfavor com-yelp-android 5 5 com-foodfusion- 3 3 foodfusion com-mobilaurus- 4 4 com-free- 3 4 wingstopandroid foodrecipes Continued on next page

65 66

Table C.1 – continued from previous page Ground truth Prediction Ground truth Prediction App App label label label label com-ncconsulting- 4 4 com-foodplating- 1 2 skipthedishes- azt android