SENSITIVE TEXT ICON CLASSIFICATION

FOR ANDROID APPS

by

ZHIHAO CAO

Submitted in partial fulfillment of the requirements for the degree of

Master of Science

Thesis Advisor: Dr. Xusheng Xiao

Department of Electrical Engineering and Computer Science

CASE WESTERN RESERVE UNIVERSITY

January, 2018

CASE WESTERN RESERVE UNIVERSITY SCHOOL OF GRADUATE STUDIES

We hereby approve the thesis/dissertation of

Zhihao Cao

candidate for the degree of Master of Science.

Committee Chair

Xusheng Xiao

Committee Member

Andy Podgurski

Committee Member

Ming-Chun Huang

Date of Defense

Nov. 30 2017

*We also certify that written approval has been obtained for any proprietary material contained therein.

Contents

List of Tables 3

List of Figures 4

List of Abbreviations 7

Abstract 8

1 Introduction 9

2 Background 16

2.1 Permission System in Android…………………………………………………..16

2.2 Sensitive UI Detection in Android……………………………………………….17

2.3 Pixel and Color Model…………………………………………………………...19

2.4 Optical Character Recognition…………………………………………………...21

3 Design of DroidIcon 23

3.1 Overview…………………………………………………………..……………..23

3.2 Image Mutation…………………………………………………………………..24

3.2.1 ………………………………………………………………24

3.2.2 Color Inversion…………………………………………………………….31

1 3.2.3 Opacity Conversion………………………………………………………..33

3.2.4 Grayscale Conversion………..…………………………………………….37

3.2.5 Contrast Adjustment……………………………………………………….42

3.3 Text Icon Classification 48

3.3.1 Text Cleaning……………………...……………………………………….48

3.3.2 Keyword Dataset Construction…………………………………………….49

3.3.3 Classification Algorithm…………………………………………………...50

3.4 DroidIcon………..……………………………………………………………….58

4 Evaluations 60

4.1 Icon Dataset Construction………………………………………………………..60

4.2 Effectiveness of DroidIcon…..…………………………………………………..61

4.3 Case Study……………………………………………………………………….73

5 Related Work 76

6 Discussion and Conclusion 78

Bibliography 80

2

List of Tables

3.1 The pseudo code for image scaling……...……………………………….30

3.2 The pseudo code for color inversion……………………………………..32

3.3 The pseudo code for opacity conversion…………...……………………37

3.4 The pseudo code for grayscale conversion….…...………………………40

3.5 The pseudo code for contrast adjustment………………………………...46

3.6 Keyword Set……………………………………………………………...49

3.7 The pseudo code for text icon classification……………………………..57

3.8 The pseudo code for DroidIcon …………………………………………58

3

List of Figures

1.1 Motivation example of DroidIcon……………………..………………...13

2.1 Screenshots of Android Permission Requests………….………………...17

2.2 Example sensitive text label……………………………………………...18

2.3 Example of pixels in an image…………………………………………...19

2.4 RGB color model mapped to a cube……………………………………..20

3.1 Overview of DroidIcon……….………………………………………….23

3.2 Normalized sinc function………………………………………………...26

3.3 Lanczos window for a = 1, 2, 3…………………………………………..26

3.4 (a) Lanczos kernel for a = 2………………………………………………….28

3.4 (b) Lanczos kernel for a = 3………………………………………………….28

3.5 (a) Before scaling……...……………………………………………………..30

3.5 (b) After scaling……...………………………………………………………30

3.6 An example icon with bright characters and deep background………….31

3.7 The icon in Figure 3.6 after color inversion……..………………………33

3.8 User Interface containing ghost buttons………………………………....34

3.9 (a) Example ghost button……………………………………...……………..35

3.9 (b) Example ghost button without transparent ghost background...…………35

3.10 Converted icon with opacity mapped to RGB color……...……………...37

3.11(a) Image of colored bars…………………………………………………....39

3.11(b) Converted bars after using Intensity…………………………………..…39

4 3.11() Converted bars after using Luminance………………………………..…39

3.12(a) Example color icon that OCR fails to process……..…………………….40

3.12(b) The grayscale image for Figure 3.12 (a)……………………………...…41

3.12(c) Example icon after grayscale conversion and color inversion..…………41

3.13(a) Example of image with very low contrast……………………………….43

3.13(b) Example of image with very high contrast………………………………44

3.14(a) Example icon before contrast adjustment………………………….…….47

3.14(b) Example icon after contrast adjustment………...……………………….47

3.14(c) Example icon after contrast adjustment and color inversion….…………47

3.15 Example Email icon with extracted text “L\_/j Email”...………………..48

3.16 Demonstration of Levenshtein distance………………………………….52

4.1 Number of words in text icons………………………..………………….61

4.2 Recall of OCR…………………………………………………………...62

4.3 Recall of OCR + Classification ………………………………………...62

4.4 Comparison of the recalls of OCR and OCR + Classification……...….62

4.5 Recall of OCR + Classification + Image Scaling…………………...... 64

4.6 Recall of OCR + Classification + Color Inversion………………...….64

4.7 Recall of OCR + Classification + Opacity Conversion……………….64

4.8 Recall of OCR + Classification + Grayscale Conversion….………….65

4.9 Recall of OCR + Classification + Contrast Adjustment...... 65

4.10 Comparison of recalls among all the image mutation techniques……….66

4.11 Recall of DroidIcon…………………………...………………………...67

4.12 Comparison of recalls between OCR and DroidIcon……………….…..67

5 4.13 Recall, precision, accuracy, and F1-score achieved by DroidIcon……...69

4.14 Icons with unusual or decorative fonts…………………………………..69

4.15 Icons with unsuitable character and image size….………………………69

4.16 Scaled Icon from Figure 4.15……….……………………………………70

4.17 Email icon with image scaling and contrast adjustment.……………..….70

4.18 Icon with similar colors in text and background………………………....71

4.19 Icon in Figure 4.18 after contrast adjustment……………………………71

4.20 Comparison of DroidIcon’s performance using different similarity

thresholds……………………………………………………………...... 72

4.21 Influence of similarity threshold on effectiveness……………………….72

4.22 Case study: Text Icons for Location, Message, Email, Contracts and

Call……………………………………………………..………………...74

4.23 Case study: Text Icons for Messaging and Email

……………………………………………………………………………75

6

List of Abbreviations

UI User Interface

OCR Optical Character Recognition

7

Sensitive Text Icon Classification for Android Apps

Abstract

by

ZHIHAO CAO

As smartphones have played a very important role in people’s daily life, users' privacy and security become a serious concern. Previous research efforts in improving mobile app security mainly focused on the predefined sources of sensitive information managed by smartphone platforms. To the best of our knowledge, text icons, a type of user interface elements that may indicate uses of the users' sensitive information, have been largely neglected. In this thesis, we proposed an approach to automatically identify text icons in the UIs of smartphone apps, and classify them into predefined categories of sensitive information. In particular, we developed an algorithm DroidIcon based on OCR

(Optical Character Recognition) to determine whether the texts contained in text icons indicate uses of sensitive information. To evaluate the effectiveness of DroidIcon, we apply the algorithm to 707 text icons collected from 2000 popular Android apps. The algorithm achieves an accuracy of 90.52%, a precision of 91.28% and a recall of 88.25% for classifying text icons into pre-defined categories of sensitive information.

8

Chapter 1

Introduction

With the rapid development of mobile phones, smartphones become more and more popular and are playing an important role in people’s daily life. Today, millions of mobile applications (i.e., apps) are available in app stores. These apps enable smartphones to address various kinds of needs from users. In order to provide better services, apps use more and more user’s sensitive information to customize their functionality. However, certain apps may have behaviors that could be less than desirable or even harmful. For example, some apps obtain users’ personal data such as GPS coordinates, contact list, and e-mail addresses without consensus from the users, and advertisers exploit such data as a marketing channel to bundle pushy ads with apps [27].

To protect users’ sensitive information in smartphones, a lot of research efforts have been spent in constraining the uses of private user data through a data-access control mechanism. That is, in order to access users’ sensitive information, apps need to request the corresponding permissions from the users. For example, to access the users’ contact list, the apps need to request the READ_CONTACTS permission. However, this kind of protection mechanism has shown limited success [28], since many apps have legitimate reasons to request users’ permissions in using their private data and it is difficult to distinguish such legitimate behaviors from the undersized behaviors. For example, apps

9 recommending restaurants use users’ GPS data to suggest restaurant near the users, and apps providing travel planning services let users make phone calls or send messages.

To detect undesired behavior in mobile apps, we are motivated by the vision: can analysis of an app’s program behavior be contrasted with the intents of the app to determine whether the app will perform within the user’s expectation? In other words, we aim to automatically check the compatibility between the intents expressed by an app and its behind-the-scene behaviors. For example, if an app’s user interface (UI) has no texts or images to indicate that it will access users’ GPS data (i.e., no intents for GPS data), but the app sends out users’ GPS data when a button is pressed, then red flags should be raised.

Other useful scenarios include reading users’ contacts, sending SMS messages, and taking pictures.

Apps’ UIs contain various types of semantic information that express the intents of the apps. For example, a button with the text “Location” in the UI indicates that the app will access user’s sensitive location data once the user clicks the button. Therefore, understanding these types of semantic information provides us an important mechanism for automatically detecting apps’ intentions in using user’s sensitive data, which is the first step towards automatically checking the compatibility between apps’ behaviors and their intentions.

Existing research works [1][2] focus on detecting sensitive information via analyzing the texts in UIs, such as text labels and input fields. However, another important type of UI elements, icon, which also contains rich semantic information, has not been explored yet. Icon is an important component of UI and has been wildly used in mobile

10 apps. As mentioned in [26], designers replace text labels with icons because they make UI more stylish, save screen space, and fast to recognize at a glance.

Among the icons used in Apps’ UIs, text icons, which refers to icons embedded with texts, are widely used to show the apps’ intentions in using users’ private information.

Unfortunately, existing works [1][2][29][30] focus on analyzing the textual artifacts of

Android apps, and have limited capability in analyzing the texts in icons to understand their semantic information. The reason is that these texts are represented using pixels in digital images, rather than texts that can be extracted directly from UI layout files [1][2]. Although these works may analyze the file names of icons to infer semantic information based on the keywords in the file names, many apps adopt file names such as “icon1.png” or “1.png”, which do not provide much semantic information and render these works ineffective.

To address the important problem of understanding the semantic information of icons in UIs, this thesis proposes an approach, Sensitive Text Icon Classification

(DroidIcon), that classifies text icons into one of the pre-defined semantic categories. More specifically, DroidIcon adapts Optical Character Recognition (OCR) techniques to extract characters from the icons, computes the similarity between the words formed by the extracted characters and the keywords in each semantic category, and classifies the icons to the semantic category based on the highest similarity.

In particular, since icons in Android apps are usually small, diversified, and partially or totally transparent, OCR techniques face challenges in recognizing the characters with high precision. In fact, directly applying existing OCR techniques can only infer semantic information from less than 10% of the studied icons. To address this challenge, we propose DroidIcon that explores the possibility of applying various image

11 mutation techniques to convert the icons into OCR-friendly images. Our algorithm significantly improves the precision of character recognition, and thus the overall effectiveness of our approach as well.

To determine whether the texts in text icons indicate the users of sensitive information, we define 9 categories of semantic information based on the frequently-used sensitive information in mobile apps. Among the 9 categories, 7 of them indicate different types of sensitive information: Camera, Contracts, Location, Email, Phone, Photo, SMS, and the rest two do not: Non-sensitive text and Non-text. The non-sensitive text category means an icon contains text but the text does not indicate the uses of users’ sensitive information, and the non-text category refers to the icons with no embedded text. Based on the 9 predefined categories, given a text icon, our work aims to determine which category does it belong to. The semantic information provided by these 9 categories can be used by various types of privacy analysis, such as checking whether the semantic information represented by a category is compatible with the permissions requested by apps.

Based on our empirical study of icons from 2000 apps downloaded from Google

Play, most of the icons contain less then 3 words (Section 4.1). Thus, to classify a text icon into a semantic category, we adopt the keyword-based approach that compares the words formed by the extracted characters from the icon to the keywords used in each of the 7 sensitive semantic categories. If a match is found, then the icon is classified into the corresponding semantic category; otherwise, the icon is classified into the Non-sensitive text category. For icons without texts, they are classified into the Non-text category.

However, even with the precision of character recognition can be improved via iterative image mutations, it is still very difficult, if not impossible, to perfectly recognize

12 every character from text icons, since texts could be presented using custom fonts and styles. Thus, in many cases, OCR may extract part of the embedded text from the text icons, and obtain a set of incomplete words. To address this challenge, we propose an edit distance based algorithm to find the most similar keyword via computing the similarity between the extracted words and the keywords in each semantic category. If the similarity between an extracted word and a keyword is higher than a threshold, we consider it is a match and the icon is classified into the corresponding semantic category.

Figure 1.1 Motivation example of DroidIcon

To better illustrate the motivation of DroidIcon, we show a real Android app, named

MyCityWay, whose UI is shown in Figure 1.1. This app provides information of local places to users. As we can see, there exist five icons (marked in red) in this UI. Among these icons, three of them contain the texts that indicate the uses of sensitive information which are “Call”, “Direction”, and “Map”. The “Call” icon indicates it will access user’s phone call information. The “Direction” and “Map” icons indicate they will use user’s location information. The developer will get access to the sensitive data when a user clicks

13 the icons. This may lead to potential risk of exploiting the user’s sensitive information if the app abuses the users’ phone call information or the app accesses other types of sensitive information in contrary to what the users expect. Therefore, if we can detect the apps’ intentions in using the user’s sensitive information and classify them to correct sensitive category, we can apply appropriate behavioral analysis to check whether the corresponding behaviors of the program is within the user’s expectation. As shown in Figure 1.1, given an icon, our algorithm classifies it to one of 9 predefined semantic categories. The “Call” icon should be classified to the Phone category. The “Direction” and “Map” should be classified to the Location category. The “Home” should be classified to the Non-sensitive

Text category. The rest one should be classified to the Non-text category. The classification result will be used for further behavioral analysis.

To evaluate the effectiveness of DroidIcon, we apply DroidIcon on 707 text icons extracted from 2000 apps downloaded from Google Play. Among 707 text icons, 332 positive icons contain sensitive texts and the other 375 negative icons either contain non- sensitive texts or do not embed any text. We compare DroidIcon with OCR in recognizing texts and classifying text icons, and the results show that DroidIcon correctly classifies

90.1% of the 332 positive icons while the approach based on OCR correctly classifies less than 10% of the 332 positive icons. We also measure the effectiveness of different image mutation techniques adopted in DroidIcon, and show the improvement brought by each technique. Based on the results, we show DroidIcon that iteratively apply different image mutation technique achieves the best results.

The rest of the thesis is organized as follows. In chapter 2, we present related work about sensitive UI detection and basic background knowledge of our work. In chapter 3,

14 we present an overview of our algorithm and introduce each component of algorithm in detail. In chapter 4, we conduct experiments to evaluate the effectiveness of our algorithm and have case study. In chapter 5, we conclude our work.

15

Chapter 2

Background

In this chapter, we first introduce the background of Android permission system and related works about sensitive UI detection in Android. Then we provide background knowledge about pixel, color model and Optical Character Recognition (OCR) for further usage.

2.1 Permission System in Android

Android has become a very popular platform for third-party applications because of its’ unrestricted application market and open source. It supports third-party development with an extension API that includes access to phone hardware, settings, and user data [31].

Access to privacy- and security-relevant parts of Android’s rich API is controlled by an install-time application permission system. It means each application must declare what permission it needs and notice users during the installation (Figure 2.1(a)). All these applications can only access to their own files by default. Therefore, in order to access system resources such as text messages, list of contracts, privacy images, the third-party have to get the access permissions from user. For example, to access list of contacts, the permission READ_CONTACTS must be requested and granted. Recent improvement of

Android’s permission system supports runtime permission requests (Figure 2.1(b)), which

16 pops up a dialog to request a permission the first time an app uses users’ protected information.

(a) (b)

Figure 2.1 Screenshots of Android Permission Requests

However, it is difficult for users to make decision about whether to grant permissions to the app or not. The reason is that it is very difficult to distinguish legitimate behaviors of benign apps from the undesired behaviors of malicious apps, since many benign apps request the same permissions as malicious apps. Therefore, if we can understand the intents expressed by an app and check whether the corresponding behaviors behind the screen are compatible as the intents, we can leverage such mismatch to detect the undesired behaviors.

2.2 Sensitive UI Detection in Android

In Android apps, UIs communicate the intents of app through texts and images, and thus contain lots of semantic information that may indicate the uses of user’s sensitive information. There already exit some research efforts that focus on detecting the use of

17 user’s sensitive information using the semantic information in the UI. UIPicker [1] detects sensitive user information through applying supervised learning on the semantic information extracted from the program code of UI elements. Besides the features extracted from the texts and layout descriptions, UIPicker also considers the texts of the sibling elements in the layout file which could include unrelated texts as features.

SUPOR [2] leverages the semantic information from the text labels that are physically close to input fields in the screen. Generally, text labels are used in UI as the descriptions to guide users to enter the input. Therefore, understanding the semantics of text labels could help us determine whether the corresponding input fields need to access user’s sensitive information. Then we can analyze the corresponding program behind the screen to check whether the app behaves as expected or maliciously.

Figure 2.2 Example sensitive text label [2]

Figure 2.2 shows an example UI that requires users to input their sensitive information: User ID and Password. There are two text labels “User ID” and “Password” in the UI that guide user to enter the information in the input fields. SUPOR first parses the layout files and finds the text labels together with input fields. Then it compares the texts in the text labels against a predefined keyword dataset to determine the sensitiveness of the input fields. Finally, the result is sent to the privacy analysis part for behavioral analysis.

18 Both UIPicker and SUPOR have studied the possibility of detecting user of user’s sensitive information from semantic information in texts. Images, especially icons, another important element in the UIs have been largely neglected by these approaches. Thus, UIs with sensitive text icons but without sensitive texts may be classified by these approaches as non-sensitive, causing lots of false negatives in privacy analysis. To address this important problem, we propose DroidIcon to detect sensitive text icons in the UIs of

Android apps.

2.3 Pixel and Color Model

In order to sense, represent and display images in electronic system, researchers proposed pixels, which are the smallest elements in a digital image. The digital image is a rectangular grid of pixels with fixed rows and columns. Figure 2.3 is an example of pixels

[4]. It shows an image with a portion enlarged, the individual pixels are rendered as small squares and can easily be seen.

Figure 2.3 Example of pixels in an image [4]

In an image, a pixel represents a single color dot. All the pixels arranged in a rectangular grid form a colorful image. To represent the full range of colors, researchers propose the color model, an abstract mathematical model that represents colors as tuples

19 of numeric values. The most commonly used color model is RGB, which is wildly used in various digital image formats such as JPEG, PNG, etc. RGB is an additive color model, which means the color is created by mixing a number of different primary colors. RGB refer to three primary colors which is “Red”, “Green”, “Blue” respectively. We can create millions of colors based on these three colors. In this thesis, all the icon images are either

JPEG or PNG format. Therefore, all of them adopt the RGB model.

A color in the RGB color model expressed as an RGB triplet (r, g, b), where “r”,

“g”, and “b” are the numeric values that describe how much red, green, and blue are included in the color, respectively. Each value for a primary color can vary from zero to a defined maximum value. If all the values are zero, the resulting color is black; if all the values are maximum, the resulting color is the brightest color, i.e., white. Therefore, the geometric representation of RGB color model is a cube, where each color is a point within the cube, on its face, or along its edges.

Figure 2.4 RGB color model mapped to a cube [5]

Figure 2.4 shows a cube that the RGB color model is mapped to. The horizontal x- axis represents the values for the red color, the y-axis represents the blue color, and the z-

20 axis represents the green color. The origin, representing the black color, is the vertex hidden from view.

The value of primary color could be quantified in different ways. In computers, each primary color is often represented as an integer number ranging from 0 to 255, so that it could be represented use a single 8-bit byte. For example, the RGB triple value of black is (0, 0, 0), red is (255, 0, 0), and white is (255, 255, 255). In this thesis, we utilize this kind of representation to help us manipulate the color of icons.

RGBA is a color space based on RGB, which provides an extra alpha channel. The alpha channel is normally used to represent the opacity degree of the color. The value could also be represented using an integer between 0 and 255. If a pixel has 0 in its alpha channel, it is fully transparent (invisible); if it has 255, the pixel has fully opaque color which is the same as traditional RGB.

In our studies, we find that many PNG images describe their contents via color opacity instead of using different colors. Thus, we need to transform the opacity differences to color differences to make the image content distinguishable by OCR.

2.4 Optical Character Recognition

Optical Character Recognition (OCR) is an important research filed of computer vision. The task of OCR is to identify characters in images of printed or handwritten texts and convert them into machine-encoded text (e.g. ASCII), so that they could be recognized and edited by computer programs.

Currently, many types of OCR libraries are publicly available [7], such as

OCR, FreeOCR, Asprise OCR, etc. We build DroidIcon upon Asprise OCR, which provides high performance open source APIs for common OCR tasks. It has a very high

21 detection accuracy as mentioned in [8]: “By running a sample of 200 image e-mails, we determined that Asprise OCR was performing with an accuracy of 95%. It had the best detection rate among the approaches we analyzed”. It supports various kinds of image formats such as JEPG, PNG, etc. It also provides different SDKs to support multiple programming languages and could recognize texts in more than 20 languages such as

English, Spanish, French, etc. In this thesis, we use its Java SDK and focus only on English texts.

Although Asprise OCR is introduced as a high performance OCR engine, it does not perform well on Android icons for several reasons. First of all, due to the size limitation of smartphone screens, the icon size is usually small. For example, the smallest size is 48 x 48 in our collected icon dataset. This leads to low resolutions of the texts, which in turn affects the OCR accuracy. Second, the OCR engine works much better on icons with deep color characters in bright color background than bright color characters in deep color background. However, due to diversified icon styles in Android apps, there are many icons that have bright color characters in deep color background, posing challenges for OCR.

Third, in order to provide better user experience, there exits many icons with low contrast and ghost buttons (icons designed via opacity differences) in the UI. These icons are also difficult for the OCR engine to correctly recognize their embedded texts. Therefore, we propose to use different image mutations to convert all these kinds of images to OCR friendly icons.

22

Chapter 3

Design of DroidIcon

In this chapter, we first present an overview of our DroidIcon and then explain each component of DroidIcon in details.

3.1 Overview

Figure 3.1 Overview of DroidIcon

Figure 3.1 is the overview of DroidIcon. It consists of three major components:

Image Mutation, Optical Character Recognition, and Text Icon Classification. DroidIcon takes an APK icon as input and outputs the classification of the icon.

23 The image mutation component accepts an APK icon image and applies different mutations to the icon iteratively. The mutated icons are used as the input to the optical character recognition component. The character recognition component detects and extracts the texts embedded in the icon. The extracted texts are sent to the text icon classification component. The text icon classification component determines the semantic category of the icon by checking the extracted texts against keywords in each semantic category. We predefined 9 semantic categories where 7 of them indicate the uses of sensitive information and the remaining two categories (Non-sensitive text and Non-text) indicate that there are no sensitive texts in the icon. 3.2 Image Mutation

To address the challenges faced by OCR, we leverage different image mutation techniques to convert icons to OCR friendly images. As shown in Figure 3.1, we have five techniques, which are Image Rescaling, Color Inversion, Opacity Conversion, Grayscale

Conversion, and Contrast Adjustment.

3.2.1 Image Scaling

Resolution (pixels per inch) [16] is an important factor to control the image quality, and thus directly affects the accuracy of OCR. Lower resolutions typically produce images where pixels of a character are condensed in a small region, compromising the accuracy of character recognition. On the other hand, higher resolutions produce larger images where pixels of a character are spread to different areas, also affecting the accuracy of character recognition. Therefore, it is important to scale the image size so that it is neither too big nor too small. We crawl icons from apps downloaded from Google Play, and these icons have variable resolutions. In our dataset, small icons have the size of only 48x48 and large

24 icons have the size of 300x300. Based on our empirical observations in applying OCR on these icons, the OCR engine performance better when the image size is around 100x100.

Thus, we adopt 100 as the standard for image scaling.

Enlarging or shrinking images can be interpreted as a form of resampling or image reconstruction. Currently, many image scaling algorithms have been proposed.

Theoretically, sinc cardinal resampling [10] provides the best performance. However, the assumptions behind sinc resampling are not completely met in real-world digital images.

Therefore, we implement Lanczos resampling, an approximation to the sinc cardinal method, as our image scaling algorithm, also it yields better results than the sinc cardinal sampling.

Lanczos resampling is an algorithm. Interpolation is a method of constructing new data points within the range of a discrete set of known data points. Given a set of input samples, the effect of each input sample on the interpolated values is defined by the reconstruction kernel L(x), called Lanczos kernel. The kernel is composed of two parts: normalized sinc function sinc(x), windowed (multiplied) by the Lanczos sinc(x/a) where −� ≤ � ≤ �, it is defined as [11]:

(3.1) where a is the size of the kernel, the sin function is defined as:

sin(��) sinc(�) = �� (3.2) and the window function is defined as:

�∗sin(��/�) sinc(�/�) = �� (3.3)

25 Figure 3.2 is the plot of normalized sinc function, we can see the function is symmetic based on x = 0. The lobe has smaller absolute value of its peak when it has larger distance from y-axis. The plot tells us that the farther the sample is, the lower its effect is.

Figure 3.2 Normalized sinc function

Figure 3.3 is the plot for Lanczos window when a = 1, 2, 3. When � is outside the range [−�, �] , the � values are set to zero, limiting the Lanczos kernel within [−�, �].

Figure 3.3 Lanczos window for a = 1, 2, 3

26 Based on (3.2) and (3.3), (3.1) could be written as [11]:

(3.4)

Based on (3.4), given a one-dimensional input samples �, we can define the effects of all the input samples on the interpolated value as S(x) for arbitrary real argument x. It can be represented as the discrete convolution of these input samples with the Lanczos kernel [11]:

(3.5) where � is the floor function. Based on the one-dimensional Lanzcos kernel, we extend it to a two-dimensional kernel that is [11]:

(3.6) where x, y each represents a dimension. And we can conduct the two-dimensional discrete convolution S(x, y) based on (3.5) [11]:

(3.7)

An image represented in the RGB model can be interpreted as a two-dimension grid, where each pixel is a point in this grid and the coordinates of the pixel correspond to the � (row) and � (column) values in the grid. In this way, we can apply two dimensional

Lanczos kernel to scale the image into a given size.

27 As claimed by Turkowski and Gabrial [12], the Lanczos filter (with a = 2) achieves

"the best compromise in terms of reduction of , sharpness, and minimal ringing".

Therefore, we use Lanczos kernel with a = 2 for scaling down icons.

According to Jim Blinn [13], the Lanczos kernel (with a = 3) "keeps low frequencies and rejects high frequencies better than any (achievable) filter we've seen so far”. Therefore, we use Lanczos kernel with a = 3 for scaling up icons. Figure 3.4(a) and

3.4(b) show the Lanczos kernel when a = 2 and a = 3.

Figure 3.4 (a) Lanczos kernel for a = 2 Figure 3.4 (b) Lanczos kernel for a = 3

To scale an image with W width (row) and H height (column), the image is represented as discrete function �(�, �) as:

� �, � = (�, �) (3.8) where i ∈ [0, W-1] and j ∈ [0, H-1]. Each pair (�, �) represents the coordinates of a pixel in the image. The pair (0, 0) represents the coordinates of the upper left pixel and the pair

(W-1, H-1) represents the coordinates of the lower right pixel in the image. For each pixel, we define each channel of its color as (�, �) for red, (�, �) for blue, (�, �) for green, and (�, �) for transparent if the image has transparent channel.

Accordingly, the output image with W’ width and H’ height is defined as:

� �, � = (�, �) (3.9)

28 where p ∈ [0, W’-1] and q ∈ [0, H’-1]. To obtain the pixels in the output image, we use the Lanczos kernel to compute each channel value of the pixel based on the corresponding channel of a set of pixels from the input image.

For example, if we want to compute (�, �), we compute S(�, �)first:

�(�, �) = (�, �) ∗ �(� − �) ∗ �(� − �) (3.10)

where i ∈ [0, W-1] and j ∈ [0, H-1]. (�, �) is computed from (�, �) using the resampling ratio, i.e., � = �/����� and � = �/�����. Since (�, �) is the weight average of S(�, �), we can compute (�, �) as:

�(��,��) (� , � ) = � (3.11) ���������ℎ� where totalWeight is:

���������ℎ� = �(� − �) ∗ �(� − �) (3.12)

Then we can compute (�, �) , (�, �) and (�, �) using the same approach.

These values form the pixel �, � in the output image. We repeat this procedure for all

(�, �) where p ∈ [0, W’-1] and q ∈ [0, H’-1] and then obtain the output image.

Algorithm 1 shows the detailed steps for scaling an image.

Figure 3.5 (a) is an example icon before scaling. The OCR engine cannot extract any text from it since the image size is 48x48, causing the characters to be too small for the OCR engine to recognize.

29

Figure 3.5 (a) Before scaling Figure 3.5 (b) After scaling

Algorithm 1 Image Scaling Procedure SCALING(I, R) Input: I as an input image, R as scaling ratio Output: O is output image after scaling input I 1: O.width ← I.width * R, O.height ← I.height * R 2: if R > 1 then a ← 3, else a ← 2 3: for all col ∈ O.width 4: x ← col / R 5: for all row ∈ O.height 6: totalWeight ← 0 7: Let outputRGBA[] be an empty array 8: y ← row / R 9: for all subRow ∈ ( � -a+1, � +a) 10: for all subCol ∈ ( � -a+1, � +a) 11: weight ← LanczosKernel(x-subCol) * LanczosKernel(y-subRow) 12: if weight > 0 then 13: p ← I.pixel(col, row) 14: for all c ∈ p 15: totalWeight ← totalWeight + weight 16: outputRGBA[c] ← outputRGBA[c] + p.RGBA[c] * weight 17: end for 18: end if 19: end for 20: end for 21: for all c ∈ I.pixel(col, row) 22: outputRGBA[c] ← ouputRGBA[c] / totalWeight 23: end for 24: O.pixel(col, row) ← construct color using outputRGBA[] 25: end for 26: end for

30

Figure 3.5 (b) shows the example icon after scaling. We can see that although the image is enlarged around twice as the original, each character of the text is more distinguishable with smoother boundaries than the original one. When we apply the OCR engine to extract the text from the resized icon, the resulting text is “SMS”, which exactly matches with one of the sensitive keywords for the SMS semantic category. Therefore, this icon is classified as a SMS category.

3.2.2 Color Inversion

Most of the OCR engines perform much better in the images consisting of dark color characters and bright color background than the ones consisting of bright color characters and dark color background. In Android apps, many text icons are designed using the black theme, as shown in Figure 3.6. In the icon, the text has white color characters and a black background. It is very convenient for human being to recognize these characters.

However, the OCR engine cannot extract any text from this image if not specially trained using the icons of this style.

Figure 3.6 An example icon with bright characters and deep background

In order to address this issue, we propose to convert the characters to a dark color and the background to a bright color. As mentioned in Chapter 2, the RGB color model uses different degrees of three primary colors (red, green, and blue) to compose a color.

31 Each primary color is often stored as an integer number in the range 0 to 255. The RGB value is (0, 0, 0) for the color black and (255, 255, 255) for the color white.

Given the property of RGB color, we defined the color inversion as:

�′ = 255 − � (3.13) where C’ is the RGB channel of the output pixel and C is the RGB channel of the input pixel.

As we have defined the input image and the output image in (3.8) and (3.9), we can apply the color inversion for a pixel in the output image:

(�, �) = 255 − (�, �) (3.14) where � = �, � = �, and �ℎ����� ∈ (���, �����, ����, ���ℎ�).

Based on the equation (3.14), we repeat this conversion for all the pixels in the output image. Algorithm 2 shows the detailed steps for the color inversion.

Algorithm 2 Color Inversion Procedure INVERSION(I) Input: I as an input image Output: O as output image after color inversion 1: for all col ∈ I.width 2: for all row ∈ I.height 3: color ←I.getRGB(col, row) 4: newRed ← 255 - color.getRed 5: newGreen ← 255 - color.getGreen 6: newBlue ← 255 - color.getBlue 7: newColor ← construct color using newRed, newGreen, newBlue 8: O.setRGB(col, row, newColor) 9: end for 10: end for

Figure 3.7 is the example icon after the color inversion. The original image is shown in Figure 3.6. We can see that the characters have the darker color and the background has the brighter color, which is easier for the OCR engine to recognize the characters. When

32 we apply the OCR engine on this converted icon, the extracted text is “_| email now>~<”.

Although this resulting text contains redundant characters, it includes all the characters in the original icon. Since the sensitive keyword “email” is included in the extracted text, this icon can be correctly classified to the Email category. We will introduce how we match the sensitive keywords in Section 3.3.3.

Figure 3.7 The icon in Figure 3.6 after color inversion

Since not all the icons have the dark color background and bright color texts, given an icon, DroidIcon will apply the color conversion to obtain a converted icon, and then apply the OCR on both the converted icon and the original icon. If any text extracted from these two icons match any sensitive keyword in a semantic category, DroidIcon will classify the icon to the corresponding semantic category. If the texts from both the icons cannot exactly match with any sensitive keyword, DroidIcon will apply another image mutation technique.

3.2.3 Opacity Conversion

With the development of UI/UX design, “ghost button” becomes one of the dominant design trends to in the UI design world for smartphone apps [15].

33

Figure 3.8 User Interface containing ghost buttons [14]

Ghost buttons are those transparent buttons that have a basic shape form, such as a rectangular or perhaps a circle. They are generally bordered by a thin line, while the internal section consists of plain texts. They have been wildly used in many mobile apps.

Figure 3.8 is an example UI of an app. There are two rectangular ghost buttons in the UI, which is the “Email Us” button and the “Frequently Asked Questions” button. The contents of both the buttons are transparent except the texts. This means that the color of the background area of buttons are transparent to some extent.

34 The ghost buttons look the same as other images which have no transparent background. However, when we extract the icons of the ghost buttons using OCR, we find the recognition accuracy is very low, where the result is usually an empty text.

By carefully inspecting the colors of each pixel in these icons, we find that the icons are designed via adjusting the opacity of pixels. Some icons actually have the same color for all their pixels, but the pixels are set to totally transparent except for the pixels that form the texts and the borders. The other icons indeed have different background colors from the colors of the texts, but the pixels in the background are set to transparent. The reason why the OCR engine cannot recognize the text in these ghost button icons is that OCR detects characters in the icon based on color differences, but it cannot detect the opacity differences between pixels.

Figure 3.9 (a) Example ghost button

Figure 3.9 (b) Example ghost button without transparent background

35 Figure 3.9 (a) is an icon of a ghost button collected from an Android app. We cannot see the color of its background since the background is transparent. Figure 3.9 (b) is the converted icon after we set all the pixels to opaque. We can see that actually the background has the black color. Since the background color is darker than the text color that is green, the OCR engine cannot extract the text effectively.

The similarly dark colors of both the background and the text cause the OCR engine to perform poorly. In order to address this problem, we propose an algorithm to map the degree of opacity to the degree of color, which transforms the differences of opacity to the differences of color. The key idea of this algorithm is to map the value of the alpha channel to the color channels. The transform equation is:

�′ = 255 − � (3.15) where C’ is the RGB channel of the output pixel and � is the alpha channel of input pixel.

We apply this equation to convert each channel of the pixel based on (3.15):

(�, �) = 255 − (�, �) (3.16) where � = � , � = � , (�, �) is the alpha channel of the pixel (�, �) in �(�, �),

�ℎ����� ∈ (���, �����, ����) is one of RGB channels of the pixel (�, �)in �(�, �).

We apply (3.16) to each RGB channel of the pixel so the image will be converted to black-and-white image. The alpha value is 0 for fully transparent and 255 for fully opaque. Transparent background will be mapped to a brighter color and the opaque text will be mapped to a darker color. As mentioned in last section, this kind of image is easier for OCR engine to extract text. Algorithm 3 shows the detailed steps for the opacity conversion.

36

Algorithm 3 Opacity Conversion Procedure OPACITYCONVERSION(I) Input: I as an input image Output: O as a new image that converting opacity difference to color difference 1: for all col ∈ I.width 2: for all row ∈ I.height 3: color ←I.getRGB(col, row) 4: alpha ← color.getAlpha 5: newRed ← 255 - alpha 5: newGreen ← 255 - alpha 6: newBlue ← 255 - alpha 7: newColor ← construct color using newRed, newGreen, newBlue 8: O.setRGB(col, row, newColor) 9: end for 10: end for

Figure 3.10 is the converted icon after applying Algorithm 3 to the icon in Figure

3.9 (a). The text now has the black color and the background has the white color. When we apply OCR on this converted icon, the extracted text is “Add photo”. Thus, the text is correctly recognized after the opacity is mapped to the RGB colors.

Figure 3.10 Converted icon with opacity mapped to RGB color

In our dataset, we have a certain number ghost button or similar icons, Algorithm

3 is very effective in handling these icons.

3.2.4 Grayscale Conversion

37 Grayscale [18] image is the type of images in which the value of each pixel just represents only the amount of light. In other words, it only carries luminance information.

Generally, an image contains both luminance information and chrominance information.

By converting a color image to a grayscale image, the image only preserves the luminance information and lose the chrominance information.

Modern image recognition techniques often operate on grayscale images [17]. The main reason why grayscale images are preferred than the original color images is that grayscale representations reduces the noises in the color images for these image recognition techniques. For many applications of image processing, color information is not always helpful, especially in distinguishing edges. But other features such as luminance are far more important to these techniques.

Converting color images to grayscale images simplifies a three-channel (RGB) image to a one-channel image, which removes the color information from the image.

Although it loss some information at the same time, but it preserves the most useful information: luminance, which is the important information for OCR. In fact, without the color information, the noises introduced by the color information for OCR is eliminated, which improves the accuracy of OCR.

There exist many techniques for greyscale conversion [19]. One straightforward color-to-grayscale algorithm is Intensity [21]:

� +� +� �′ = � � � (3.17) 3 where �′ represents each RGB channel of the output pixel and �, �, � represent the red, green, and blue channel of the input pixel.

38 Based on Intensity, a more effective version of average algorithm was proposed,

Luminance [20]:

�′ = 0.3� + 0.59� + 0.11� (3.18)

Luminance is designed to match human brightness perception by using a weighted combination of the RGB channel. It performance better than other techniques [17] and widely used in many image processing applications (e.g., GIMP, MATLAB). Therefore, we use it to generate grayscale images in DroidIcon.

Figure 3.11 (a) Image of colored bars

Figure 3.11 (b) Converted bars after using Intensity

Figure 3.11 (c) Converted bars after using Luminance

Figure 3.11 (a) shows example color bars. We apply both Intensity and Luminance to convert these bars to grayscale images. Figure 3.11 (b) shows the resulting color bars using Intensity. Clearly, we can observe that some bars with different colors in Figure 3.11

(a) are converted to the same color. This indicate that the loss of color information causes collusions in the greyscale images, which is not ideal. Figure 3.11 (c) shows the resulting bars using Luminance. We can see that the weighted conversion is a much more accurate

39 representation than the average conversion adopted by Intensity. The intensity levels are gradually changed from the black to the white. Based on (3.8) and (3.9), we can convert each channel of the pixel in the output image as:

(�, �) = 0.3(�, �) + 0.59(�, �) + 0.11(�, �) (3.14) where � = �, � = �, and �ℎ����� ∈ (���, �����, ����).

Based on (3.14), all the three color channels are set to the same value, and thus only luminance information is preserved. Algorithm 4 shows the detailed steps for the grayscale algorithm:

Algorithm 4 Grayscale Conversion Procedure GRAYSCALE(I) Input: I as an input image Output: O as a new grayscale image 1: for all col ∈ I.width 2: for all row ∈ I.height 3: color ← I.getRGB(col, row) 4: grayscale ← color.getRed * 0.3+color.getGreen * 0.59+color.getBlue * 0.11 5: newRed ← grayscale 5: newGreen ← grayscale 6: newBlue ← grayscale 7: newColor ← construct color using newRed, newGreen, newBlue 8: O.setRGB(col, row, newColor) 9: end for 10: end for

Figure 3.12 (a) shows an example icon in which the colors introduce noises for

OCR to recognize characters. This icon contains the text “SMS”.

Figure 3.12 (a) Example color icon that OCR fails to process

40 This is very easy for human to recognize the text “sms” in it. However, when we apply OCR to extract the text from it, the extracted text is “\fij” which is completely different from the correct text “sms”. To reduce the noises introduced by the colors, we apply the grayscale conversion and obtain the converted icon as shown in Figure 3.12 (b).

Figure 3.12 (b) The grayscale image for Figure 3.12 (a)

When we apply OCR on the icon shown in Figure 3.12 (b), the extracted text is an empty text. As explained in the previous section, the converted icon has the bright color for the text and the background has the dark color. Thus, it is not suitable for the OCR engine to recognize the characters. To address this issue, we could apply color inversion to improve the accuracy of OCR.

Figure 3.12 (c) Example icon after grayscale conversion and color inversion

Figure 3.12 (c) shows the icon after applying color inversion on the icon shown in

Figure 3.12 (b). The characters are now in dark color and has a brighter background. When we apply OCR on this icon, the extracted text is “'r w LsmsJ,”. Although there still exist some noisy characters in the resulting text, it contains the keyword “sms”. Thus, this icon can be matched with the sensitive category “SMS” in the text icon classification correctly.

41 3.2.5 Contrast Adjustment

Contrast is the difference in luminance or color that makes an object (or its representation in an image or display) distinguishable [22]. The amount of characters the

OCR engine can recognize from the input image depends on the quality of the image in terms of contrast [23]. Therefore, image contrast enhancement is important for achieving higher OCR accuracy.

There are many possible definitions of contrast. Some of the definitions include color; others do not. In Section 3.2.4, we show that grayscale images that only contain luminance information, so that they have fewer noises for character recognition. Moreover, luminance information is more important in distinguishing edges or other features than chrominance information. Therefore, we define the contrast as the differences in luminance between objects in the image.

To measure the luminance information of an image, we introduce a concept named brightness, which is the perception elicited by the luminance of a visual target. In the RGB color space, brightness of a pixel is defined as the arithmetic mean of the red, green, blue channels:

��+��+�� � = 3 (3.15) where C is the brightness, and �, �, � represent the red, the green, and the blue channels.

The contrast of an image is the difference between the maximum and minimum pixel brightness in the image. Consider two images A and B. The brightness of pixels in the image A ranges from 100 to 200, while the brightness in the image B ranges from 50

42 to 250. That is, the contrasts of the image A and B are 100 and 250, respectively. Thus, image B has a larger contrast than image A.

An image with an appropriate contrast improves the accuracy of OCR, while too high or too low contrast compromise the effectiveness of OCR. Figure 3.13 (a) is an example image with a very low contrast. The low contrast indicates that the range of brightness is very small, and thus the pixels of the image tend to have similar colors. This poses challenges for OCR to identify the edges or the contour of objects in the image.

Figure 3.13 (a) Example of image with very low contrast

43

Figure 3.13 (b) Example of image with very high contrast

Figure 3.13 (b) is an example image that has very high contrast. Many pixels in the image become either black, white, or blue. However, in the original image, they have different brightness, which means that some color information has been lost during the contrast adjustment. The reason is that the high contrast causes all the channels of different pixels are increased to the maximum value (i.e., 255) or decreased to the minimum value

0. For example, two pixels �(150, 150, 150) and �(200, 200, 200) represent different levels of gray color. If the contrast is adjusted to a very large value, both of them are converted to same color (255, 255, 255), resulting in information losing. Therefore, the contrast adjustment should be applied at a level such that it maximizes the accuracy of

OCR in recognizing the text but at the same time minimizes the information lost.

The purpose of contrast adjustment is to make bright pixels brighter and dark pixels darker under the premise that the average brightness of all pixels would remain almost the same. To achieve this, we can perform the adjustment based on the difference between the brightness of the pixel and the brightness of all pixels. It is defined as:

44 � = � + � ∗ (� − �) (3.16) where C is a channel of the input pixel, C’ is a channel of the output pixel, B is the average brightness of all pixels. F is the contrast adjustment factor defined by the user, which can be represented as:

� = 1 + � (3.17) where p is the percent of adjustment on the brightness difference � − �.

To compute the average brightness � of the image, we need to compute the brightness of each pixel. Generally, the brightness of an image is between 100 and 150.

Based on (3.8) and (3.9), we apply (3.16) to all the pixels, and convert each channel of each pixel as:

(�, �) = � + (1 + �) ∗ ((�, �) − �) (3.18) where � = �, � = � and �ℎ����� ∈ (���, �����, ����).

DroidIcon applies the equation (3.18) on each channel of each pixel with the input percent p. p should be set to a suitable value to avoid an either too large or too small contrast adjustment. In order to increase the differences between image text and its background, we need to slightly increase the contrast of the image. Since the average brightness of image is usually between 100 and 150, and most of the pixels have the colors whose values in the three color channels are close to the average brightness. Only a few pixels have very different values in the three color channels than the average brightness. Therefore, we set p to 0.5 to enlarge the differences in more pixels. Algorithm 5 shows the detailed steps for the contrast adjustment technique.

45

Algorithm 5 Contrast Adjustment Procedure CONTRAST (I, f) Input: I as an input image, f as adjustment percent Output: O as an output image 1: count ← 0 2: b ← 0 3: for all col ∈ I.width 4: for all row ∈ I.height 5: count ← count + 1 6: color← I.getRGB(col, row) 7: b ← b + (color.getRed + color.getGreen + color.getBlue)/3 8: end for 9: end for 10: avgb ← b / count 11: for all col ∈ I.width 12: for all row ∈ I.height 13: color← I.getRGB(col, row) 14: factor ← 1 + f 15: newRed ← Truncate(avgb + factor * (color.getRed - avgb)) 16: newGreen ← Truncate(avgb + factor * (color.getGreen - avgb)) 17: newBlue ← Truncate(avgb + factor * (color.getBlue - avgb)) 18: newColor ← construct color using newRed, newGreen, newBlue 19: O.setRGB(col, row, newColor) 20: end for 21: end for

In the algorithm, the Truncate() function limits the value of a channel to be between

0 and 255: values smaller than 0 are truncated to 0, and values larger than 255 are truncated to 255.

Figure 3.14 (a) shows an example for contrast adjustment. It is an icon belonging to the SMS semantic category. It contains bright color texts and a dark color background.

Obviously, the OCR engine cannot extract the correct text from it. After inverting its color, the extracted result is still an empty text. Therefore, we adjust its contrast to see if it can improve the recognition result.

46

Figure 3.14 (a) Example icon before contrast adjustment

The average brightness of Figure 3.14 (a) is 138. We apply contrast adjustment and obtain the output image Figure 3.14 (b). In this converted icon, we can find that contour of text becomes clearer and the background color becomes darker. We further apply color inversion on this converted icon.

Figure 3.14 (b) Example icon after contrast adjustment

Figure 3.14 (c) shows the icon after color inversion. When we apply OCR on this icon, the extracted text is “3,sms ‘”. While this text is not perfect, it suffices to make this icon correctly classified into the SMS semantic category.

Figure 3.14 (c) Example icon after contrast adjustment and color inversion

47 3.3 Text Icon Classification

3.3.1 Text Cleaning

Text cleaning is an important task to improve the accuracy of text icon classification.

Although OCR engines, such as Aspire [38], provide very high accuracy in character recognition, the extracted text often includes extra characters that are inferred from the shapes of objects rather than the text in the icon. For example, Figure 3.15 shows an example email that contains not only the text “Email” but also an email sign. Unfortunately, the shape of the email sign is recognized by OCR engine as unnecessary character. When we apply OCR on the icon, the extracted text is “L\_/j Email”, where the mail sign is recognized as “L\_/j”, resulting in noises for classification.

Figure 3.15 Example Email icon with extracted text “L\_/j Email”

Generally, noises in the extracted texts are mainly divided into four types: extra alphabet, digits, punctuation characters and extra space character. These noises are caused by extra features, the original text layout, unusual text font etc. To improve the accuracy of the subsequent keyword-based classification, we perform text cleaning to remove the digit characters, punctuation characters and extra space characters in the extracted text.

After text cleaning, the extracted text “L\_/j Email” becomes “LjEmail”, where only the alphabet characters are preserved. As we can see, the resulting text still contains both upper and lower case characters. In order to have uniform letter case for keyword matching, we convert all the upper case characters to the corresponding lower case characters, and the result text is changed to “ljemail”. We name the result as “candidate text”, which is used

48 by the text icon classification component to determine the semantic category the icon belongs to.

3.3.2 Keyword Dataset Construction

The goal of text icon classification is to determine the most probably semantic category the input icon belongs to. While there could be numerous types of semantics that can be expressed by icons, we define 9 semantic categories based on the sensitive information categories used in Android [2].

Table 3.6 Keyword Set

Category Sensitive Keywords

Camera camera, retake

Contacts contact, group

Email email, mail

Location location, locate, gps, map, address

Phone call, phone

Photo photo, image, video, audio

SMS sms, message

Among the 9 categories, 7 of them indicate 7 different types of sensitive information frequently used in Android apps. The 7 sensitive semantic categories are:

Camera, Contacts, Email, Location, Phone, Photo and SMS. These sensitive categories are commonly used in apps to access sensitive personal data include user’s private photos, messages, emails, user’s current location etc. [2, 31]. The other two categories represent non-sensitive semantic categories, including non-sensitive text icon and non-text icon. We

49 use the keywords from previous work [2], and further collect more keywords to form our keyword dataset for each sensitive category. Table 3.6 shows the keyword set for each sensitive. All the keywords of a category are used to compare with the candidate string, if the string match with any keyword of the category, it is classified to this category.

3.3.3 Classification Algorithm

To classify text icons to one of the 9 semantic categories, a straightforward approach is to perform keyword matching, as used in previous studies [2]. However, the extracted texts from the icons are often not perfect, which may include extra characters or miss some characters. Therefore, it is unlikely for the words in the extracted texts to exactly match a sensitive keyword from the semantic categories, rendering the classification ineffective.

To address the issues of extra characters and missing characters, instead of pursuing exact matching, we consider a word is a match with a sensitive keyword if they are similar enough. More specifically, we develop an edit distance based algorithm to compute similarities between words in the extracted texts and the keywords, and identify the most similar keyword based on the computed similarities; if the similarity is over a pre-defined threshold, then we consider the extracted text matches the keyword and classify the icon to the corresponding semantic category.

Edit distance is a widely used approach to quantify how dissimilar two strings are by computing the minimum cost of operations required to transform one string to another.

Different edit distance algorithms have different definitions of the string operations. In this thesis, we use the Levenshtein distance algorithm, which considers three operations: removal, insertion, or substitution of a character in the string. All the operations have the

50 same cost. Thus, the Levenshtein distance is equal to find the minimum number of operations required to transform one string to another.

Dynamic programming is a frequently used and efficient way to compute edit distance. It is a method that solves complex problem by breaking it into a collection of simpler subproblems and combines the solutions of each subproblem to solve the whole problem. Based on the idea of dynamic programming, mathematically, the Levenshtein distance � between � = � … � and � = � … � is defined as [24]:

�, for � = � �, + 1 � = for 1 ≤ � ≤ �, 1 ≤ � ≤ � (3.19) ��� �, + 1 for � ≠ � �, + 1 where � represents a substring of � and � represents a substring of �.

(3.19) uses a recurrence approach that computes � based on �, , �, and

�,. First, the algorithm computes � for small i, j , and then computes larger � based on the previously computed smaller values. After computing all the � for 1 ≤ � ≤

�, 1 ≤ � ≤ �, we obtain the final result �, which is the minimum edit distance between

� and �. The algorithm steps can be demonstrated using a two-dimensional matrix, as shown in Figure 3.16. We use this matrix to explain more details about the Levenshtein distance.

51

Figure 3.16 Demonstration of Levenshtein distance

Figure 3.16 is a matrix that demonstrates how edit distance is computed between two strings � = “lcafiion” and � = “location”. The horizontal axis represents the string � and the vertical axis represents the string �. Each � � [�] in the matrix is the minimum edit distance value between � and � where 1 ≤ � ≤ �, 1 ≤ � ≤ �. We can fill all � � [�] according to the recurrence relation (3.19). For each � � [�], we consider three operations:

(1) Removing ith character of � to obtain � , which is corresponding to � =

�, + 1 in (3.19).

(2) Inserting jth character of � to obtain � , which is corresponding to � =

�, + 1 in (3.19).

(3) Substituting ith character of � to obtain �. There are two different cases of this

operation: if � equals to �, then � = �,; otherwise, � = �, +

1 in (3.19).

52 The initialization of the matrix is � = � and � = � for 1 ≤ � ≤ �, 1 ≤ � ≤ �.

The initialization ensures that if either the length of � or � equals to zero, the edit distance equals to the length of another string. After the matrix initialization, to compute the minimum edit distance between � and � , the algorithm chooses the minimum value among three operations. This step is repeated until all the � � [�] have been filled.

In Figure 3.16, all the operations are represented as a color arrow, where orange up arrows represent removal, purple left arrows represent insertion, and blue diagonal arrows represents substitution. All the arrows compose a backtrace path starting from � � [�] to

�[1][1], illustrating how edit distance between � = “lcafiion” and � = “location” are computed step by step.

� � � stores the final result. As shown in Figure 3.16, the edit distance between

“lcafiion” and “location” is � � � = 3.

By using Levenshtein distance, we can compute any edit distance between a word in the extracted text from an icon and any sensitive keyword. Two words are considered more similar if they have smaller edit distance. However, the Levenshtein distance is an absolute distance, which depends on the maximum length of strings. The bound of the edit distance is [0, max(|�|, |�|)] where |�| is length of � and � is length of �. Therefore, this

introduces two problems during the classification. We describe them below together with solutions:

(1) The extracted text usually contains redundant words that do not express

intentions in using sensitive information. Some text icons contain redundant

words in the text. For example, for a text icon that contains the text “enablegps”,

the word “gps” indicates the intention to use users’ gps data, which is sensitive,

53 while the word “enable” is redundant. Redundant words may also come from the inaccurate character recognition of OCR. These redundant words make the words in the extracted text contain much more characters than the sensitive keywords, decreasing the accuracy of keyword matching. For example, the edit distance between “enablegps” and the keyword “gps” is 5, which is quite large and indicates they are dissimilar. However, we know “enablegps” should match with “gps” since “gps” is a substring of “enablegps”.

To address the issue of redundant words, we introduce a concept named n- gram [25]. N-gram is a contiguous sequence of n items from a given sequence of text. For example, 3-gram from a word includes all the substrings of the word whose length is 3. If the length of the word is equal to or smaller than the length of keyword, we compute the edit distance between the word and the keyword directly. Otherwise, we generate a n-gram list from the word where � ∈

[��� ������� − 1, ��� ������� + 1] . For example, given a word

“enablegps” and a keyword “gps”, the length of the keyword is 3. Therefore, the n-gram list should contain all the 2-grams, 3-grams, and 4-grams. We generate a n-gram list: {“en”, “na”, “ab”, “bl”, “le”, “eg”, “gp”, “ps”, “ena”,

“nab”, “abl”, “ble”, “leg”, “egp”, “gps”, “enab”, “nabl”, “able”, “bleg”, “legp”,

“egps”}, and then compare each n-gram with the keyword. Since there exists a n-gram “gps” that exactly matches with the keyword, the corresponding icon is classified to the “Location” category.

(2) The Levenshtein distance represents the absolute distance between two words. However, such an absolute distance may not reflect the similarity

54 properly. For example, if a word has the same edit distance with two different

keywords, it is unclear about which keyword is more similar to the word.

Additionally, a smaller Levenshtein distance may not necessarily indicate these

two words are more similar. Consider another example. Given a word “locatl”,

we compare it with two keywords “locate” and “call”. The edit distance

between “locatl” and “locate” is 1. For the keyword “call”, we generate a n-

gram list which is {“loca”, “ocat”, “catl”}, and the edit distance between “catl”

and “call” is also 1. While they both have the same distance, we know “locatl”

is more similar to “locate” rather than “call”. To address this issue, we propose

to measure the similarity using relative distance instead of absolute one. The

relative distance is computed as:

���� � = (3.20) ������

where � is the relative edit distance, � is the absolute edit distance,

��� is the length of the keyword. The similarity between a word and a

keyword is computed based on (3.20):

���������� = 1 − ���� (3.21)

Based on (3.21), we can compute the similarity between “locatl” and “locate”

as 0.833, and the similarity between “catl” and “call” as 0.75. Therefore, the

word “locatl” is more similar to “locate” and the corresponding icon is

classified to the “Location” category.

To classify text icons, our approach compares each word from the extracted text of an icon with every keyword iteratively and identify the most similar keyword. If there is a keyword that exactly matches with any word in the extracted text, which means

55 ���������� = 1, our approach will terminate and classify the icon to the corresponding category.

However, this approach has a potential problem that may cause a large number of false positives (FPs). A FP means that icons that should be classified to the non-text or the non-sensitive text categories are classified as sensitive text icons. The reason of FPs is that any non-empty extracted text has a high possibility to have a word that is similar to a keyword, even if the similarity is very small. If we consider any similarity larger than 0 is a match, our approach could potentially generate many FPs.

Empirically, we observe that the extracted texts from sensitive icons have significantly higher similarity (close to 1) then those from the non-sensitive text icons.

Therefore, we propose to set a threshold for the similarity comparison. Only when a word with the similarity larger than a given threshold will be considered as sensitive. If the highest similarity is below the threshold but larger than 0, the corresponding icon is classified as a non-sensitive text icon. If the similarity is 0, the corresponding icon is classified as a non-text icon.

Besides affecting FPs, the threshold value will also affect false negatives (FN). A higher threshold typically produces more FNs. Therefore, it is a trade-off between FPs and

FNs. We set the threshold value to 0.6 because this value could achieve the highest F1 score according to our experiments. The details of the experiments will be introduced in the next chapter. Algorithm 6 shows the pseudo code for the text icon classification.

56

Algorithm 6 Text Icon Classification Procedure CLASSIFICATION (S, C) Input: S as candidate string, C as the map which contains all the category and corresponding keywords Output: P as the predicted category that S belongs to 1: P ← null 2: maxSimi ← 0 3: for all category ∈ C 4: for all kw ∈ category 5: if S.length >kw.length then 6: ngramList ← GenerateNgram(S) 7: for all ngram ∈ ngramList 8: simi ← 1 - EditDistance(ngram, kw) / kw.length 9: if simi = 1 then return P ← C.getCategory(kw) 10: if simi > maxSimi and simi > 0.5 then 11: P ← C.getCategory(kw) 12: maxSimi ← simi 13: end if 14: end for 15: else 16: simi ← 1 - EditDistance(S, kw) / kw.length 17: if simi = 1 then return P ← C.getCategory(kw) 18: if simi > maxSimi and simi > 0.5 then 19: P ← C.getCategory(kw) 20: maxSimi ← simi 21: end if 22: end if 23: end for 24: end for 25: return P

The algorithm uses two global variables to keep track of current maximum similarity and the most similar category. Help function GenerateNgram() is used to generate the n-gram list of the extracted text, and EditDistance() is used to compute the

Levenshtein distance between the extracted text and a keyword.

57 3.4 DroidIcon

In section 3.2, we have introduced a suite of image mutation techniques. However, different Image Mutation technique works for different types of icons, and a single image mutation technique cannot work well for all types of icons. Therefore, DroidIcon adopts an iterative process to try different image mutation techniques on the input icon until a best classification result is obtained. DroidIcon first chooses an image mutation techniques to apply on the input icon, and try to classify the converted icon to any of the 7 sensitive categories. If none is matched, DroidIcon chooses another image mutation technique to apply. Such process is repeated until the icon is classified into one of the 7 categories, or all the techniques have been applied. Algorithm 7 shows the detailed steps of DroidIcon.

Algorithm 7 DroidIcon Procedure DROIDICON (I) Input: I as an input icon image, Options as four image mutation algorithms which are {ORIGINAL(I), OPACITYCONVERSION(I), GRAYSCALE (I), CONTRAST (I, 0.5)} Output: P as the predicted category that icon I belongs to 1: P ← null 2: maxSimi ← 0 3: R ←100 / min (I.width, I.height) 4: I ← SCALING (I, R) 5: for all option ∈ Options 6: apply option to I 7: Text ← Extract text from I using OCR 8: CLASSIFICATION (Text, C) 9: if maxSimi = 1 return P 10: INVERSION(I) 11: Text ← Extract text from I using OCR 12: CLASSIFICATION (Text, C) 13: if maxSimi = 1 return P 14: end for 15: return P

58 The algorithm first defines two global variables P and maxSimi to keep track of the maximum similarity and the corresponding category. The scaling ratio is computed based on the icon’s height and width (Line 3). The icon is then scaled using Algorithm 1 (Image

Scaling). The next step is to apply image mutation algorithms to the icon iteratively (Lines

5-14). The image mutation techniques include Algorithm 3 (Opacity Conversion),

Algorithm 4 (Grayscale Conversion), and Algorithm 5 (Contrast Adjustment). Before we apply any of these algorithms, DroidIcon first extracts text from the original icon. If such text cannot be matched with any keyword, DroidIcon starts to apply the image mutation technique. After an image mutation is applied, DroidIcon extracts text from the converted icon with the original color first. DroidIcon then applies color inversion (Algorithm 2) to invert the color of the converted icon, and extract text again. DroidIcon then applies

Algorithm 6 (text icon classification) on the extracted text to classify the icon. If the extracted text could exactly match with any keyword, DroidIcon outputs the semantic category this keyword belongs to. Otherwise, the algorithm updates P and maxSimi if a more similar keyword is found. The procedure (Lines 5-14) repeats for all the image mutation algorithms. If the algorithm still does not find a exactly match keyword, the most similar keyword will be used to classify the icon.

59

Chapter 4

Evaluations

We first conduct comprehensive experiments on DroidIcon to demonstrate the effectiveness of each component. Then we conduct cases studies on the UIs from two different apps to demonstrate how DroidIcon can be used to detect the intentions of using sensitive information in the UIs. 4.1 Icon Dataset Construction

To construct the testing dataset, we adopt a semi-automatic process, which is motivated by the observation that file names often indicate the semantics of the icons. For example, an icon contains “email” in its name probably is an email icon. We observe that all the icon images are usually stored in the resource folder in Android APK. We crawl

APKs of top 2000 apps from the official Google Play, and then applies a tool to automatically extract the icons that contain sensitive keyword from the resource folder of these APKs. Then we manually collect icons that contains sensitive text and divide them to the corresponding sensitive category. We totally collect 332 sensitive text icons, which forms the positive dataset. For the negative dataset, we collect 105 icons that contains texts but the text are not sensitive. We further collect another 270 icons with no text. Therefore, we have totally 375 icons for negative dataset. These 707 icons are used for the experiments to evaluate the effectiveness of DroidIcon.

60

Figure 4.1 Number of Words in Text Icons

Figure 4.1 shows the number of words contained in the 332 text icons. Most of them contain less or equal to 3 words. This indicate that most of the text icons do not contain sentences or paragraphs, but mostly phrases or words. Therefore, keyword match can be used to effectively classify the text icons.

4.2 Effectiveness of DroidIcon

DroidIcon is designed to be recall oriented: it focuses on optimizing the number of successfully detected sensitive icons. As a side effect, it will decrease the precision because it is a trade-off to achieve a higher recall or a higher precision. Therefore, in our evaluations, we conduct experiments on the recalls to measure the effectiveness of DroidIcon.

61 Camera Contracts Email Location Phone Photo SMS

TP 1 0 4 3 2 0 7

FN 16 36 58 54 32 25 94

Recall 5.88% 0% 6.45% 5.26% 5.88% 0% 6.93%

Figure 4.2 Recall of OCR

Camera Contracts Email Location Phone Photo SMS

TP 2 4 13 24 8 5 21

FN 15 32 49 33 26 20 80

Recall 11.76% 11.11% 20.97% 42.11% 23.52% 20% 20.79%

Figure 4.3 Recall of OCR + Classification

Figure 4.4 Comparison of the recalls of OCR and OCR + Classification

62 First, we conduct an experiment on the OCR engine to measure the effectiveness of the OCR engine on the collected icons. We name this approach as OCR. In this experiment, we first extract the texts from the icons directly using the OCR engine. And only when the words in the extracted text could exactly match a sensitive keyword, the icon will be classified to the corresponding semantic category. Otherwise, if the extracted text is not empty, the icon will be classified as non-sensitive text icon; if the extracted text is an empty string, the icon will be classified as a non-text icon.

Figure 4.2 shows the result of this experiment. We compute the recall of each category defined in (4.1):

�� ������ = ��+�� (4.1)

We can see that the recalls of all the categories are less than 10%, indicating that the OCR engine does not perform well on the icons. Only very few sensitive text icons are correctly detected.

In our second experiment, we improve OCR by using our edit distance based text icon classification technique. We name this approach OCR + Classification. Figure 4.3 shows the result of OCR + Classification. As we can see, the recall in each category ranges from 11% to about 40%. Such results show that the text icon classification technique significantly improves the recall performance for each category, especially in the Location category (from 5.26% to 42.11%).

To evaluate the effectiveness of different image mutation techniques, we conduct a series of experiments that apply each technique to improve text icon classification.

Therefore, we have five experiments:

(1) OCR + Classification + Image Scaling

63 (2) OCR + Classification + Color Inversion

(3) OCR + Classification + Opacity Conversion

(4) OCR + Classification + Grayscale Conversion

(5) OCR + Classification + Contrast Adjustment

Camera Contracts Email Location Phone Photo SMS

TP 4 4 19 23 9 3 20

FN 13 32 43 34 25 22 81

Recall 23.53% 11.11% 30.65% 40.35% 26.47% 12% 24.69%

Figure 4.5 Recall of OCR + Classification + Image Scaling

Camera Contracts Email Location Phone Photo SMS

TP 12 8 16 21 15 11 22

FN 5 28 46 36 19 14 79

Recall 70.59% 22.22% 25.81% 36.84% 44.12% 44% 21.78%

Figure 4.6 Recall of OCR + Classification + Color Inversion

Camera Contracts Email Location Phone Photo SMS

TP 3 13 7 4 4 7 10

FN 14 23 55 53 30 18 91

Recall 17.65% 36.11% 11.29% 7.02 11.76% 28% 10.99%

Figure 4.7 Recall of OCR + Classification + Opacity Conversion

64 Camera Contracts Email Location Phone Photo SMS

TP 2 12 14 24 8 6 24

FN 15 24 48 33 26 19 77

Recall 11.76% 33.33% 22.58% 42.11% 23.52% 24% 24.76%

Figure 4.8 Recall of OCR + Classification + Grayscale Conversion

Camera Contracts Email Location Phone Photo SMS

TP 1 3 11 14 4 5 21

FN 16 33 51 43 30 20 80

Recall 5.88% 8.33% 17.74% 24.56% 11.76% 20% 20.79%

Figure 4.9 Recall of OCR + Classification + Contrast Adjustment

65

Figure 4.10 Comparison of recalls among all the image mutation techniques

Figure 4.5 to Figure 4.9 show the recalls for the five experiments. In Figure 4.10, we use a bar chart to compare the performance of each mutation technique. We can see that each mutation algorithm leads to certain improvement on each category compared to

OCR. However, on the other side, we find that except for the Color Inversion technique that achieves a recall of 70.59% on Camera icons, the recalls of each technique on all the categories are no more than 50%. This demonstrates that a single mutation technique cannot work well for all the icons, but can improve the OCR performance on only certain types of icons. This motivates DroidIcon to combine all the mutation algorithms together, so that different types of text icons can be properly handled.

66 Camera Contracts Email Location Phone Photo SMS

TP 16 34 51 56 29 23 84

FN 1 2 11 1 5 2 17

Recall 94.12% 94.44% 82.26% 98.25% 85.29% 92% 83.17%

Figure 4.11 Recall of DroidIcon

Figure 4.12 Comparison of recalls between OCR and DroidIcon

Figure 4.11 shows the recall for DroidIcon. We can see that the recalls are more than 80% for all the sensitive categories, and for Camera, Contracts, Location and Photo categories, the recalls are above 90%. We compare the recalls with those achieved by OCR in Figure 4.12. It clearly shows that DroidIcon significantly improves the recall for each category, demonstrating the effectiveness of our algorithm.

Except for the recalls, we also compute accuracy, precision, and F1-score by:

67 ��+�� �������� = ��������+�������� (4.2)

�� ��������� = ��+�� (4.3)

2 ∗ ��������� ∗ ������ �1 = (4.4) ���������+������

FP TN F1- TP FN Non- Non- Recall Precision Accuracy Non-text Non-text score Sensitive text Sensitive text

293 39 24 4 81 266 88.25% 91.28% 90.52% 89.7%

Figure 4.13 Recall, precision, accuracy, and F1-score achieved by DroidIcon

From Figure 4.13, we can see our algorithm also achieves 91.28% for precision,

90.52% for accuracy, and 89.7% for F1-score. The reason why our algorithm could achieve

such a good performance is that Android icons usually have their own specific designs such

as small sizes and bright color characters with a dark background. Our algorithm is

specifically designed to handle icons of these designs.

In terms of False Positives (FPs), there are 28 in total, where 24 of them are caused

by non-sensitive text icons and 4 of them are caused by non-text icons. This shows that

non-sensitive text icons are more likely to cause FPs. The reason is that non-text icons do

not contain any text, and only the shapes of objects in the icons may cause extra characters

to be extracted by the OCR engine; non-sensitive text icons contain texts that could be used

to compute similarities, which is more likely to cause FPs.

There are 39 False Negatives (FNs) produced by DroidIcon. There are three major

reasons for FNs:

68 (1) The texts in the icons use unusual or decorative fonts that are more difficult for

the OCR engine to recognize. As we have introduced, the OCR engine‘s

performance highly depend on the font types of the data. The Asprise OCR

engine achieves a high performance on commonly used fonts such as Times,

Arial, Helvetica, but performs worse on unusual or decorative fonts, as shown

in Figure 4.14.

Figure 4.14 Icons with unusual or decorative fonts

(2) Unsuitable character size or image size will also result in false negatives.

Although in DroidIcon we have already scaled the image size, but this cannot

work well for all the icons. The image scaling algorithm we use is an

interpolation algorithm, which means the color of a pixel in the converted icon

is computed based on the pixels of the original icons with different weights.

Therefore, the scaling process may cause more noises or information loss in the

converted icons.

Figure 4.15 Icons with unsuitable character and image sizes

Figure 4.15 shows two icons that cause FNs due to unsuitable character and

image sizes. The left icon’s size is 57x57, which is a very small icon. The text

69 “GPS Off” is very small and barely visible. The OCR engine does not extract any character from the original icon. Figure 4.16 is the scaled icon. Although the character size becomes larger, its resolution is very low. Therefore,

DroidIcon fails to recognize the text, the final output is still an empty text.

Figure 4.16 Scaled Icon from Figure 4.15

The OCR engine also does not extract any character from the right icon in

Figure 4.15. Figure 4.17 is the converted icon after image scaling and contrast adjustment are applied. The character size is larger, but the character “i” in the original icon looks very similar to “l”. Also, there exits some noises in the character “a” caused by the scaling. In the end, the extracted string is “lllll”.

This text causes the icon to be classified into the “Phone” category, causing a

FN.

Figure 4.17 Email icon with image scaling and contrast adjustment

70 (3) Another reason for FN is that the text and the background of some text icons

may have similar colors. In Chapter 3, we have introduced the contrast

adjustment technique that aims to increase the brightness difference between

the text and the background. However, if text and background have similar

colors, the benefit of contrast adjustment is limited.

Figure 4.18 Icon with similar colors in the text and the background

Figure 4.18 shows an icon that has bright colors in both the text and the

background. When we analyze the pixels of this icon, the average brightness is

199, the text color is around (255, 238, 158), and the background color is (255,

255, 255). According to the contrast adjustment technique, the adjustment is

based on the difference between the average brightness and the values of each

color channel. Therefore, similar colors will have similar adjustments. The red

and green channels of both the text and the background will become 255 after

adjustment, and thus the difference between the text and the background can

only be identified through the blue channel.

Figure 4.19 Icon in Figure 4.18 after contrast adjustment

71 Figure 4.19 shows the result after contract adjustment is applied on the icon

shown in Figure 4.18. We can see that the text color is bright yellow, which is

still similar to the background color. The final output is still an empty text,

causing a FN.

Threshold 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

TP 307 307 306 304 301 301 293 266 255 246 246

FP 281 281 278 189 130 65 28 5 5 5 1

FN 25 25 26 28 31 31 39 66 77 86 86

Recall 92.47% 92.47% 92.17% 91.57% 90.66% 90.66% 88.25% 80.12% 76.81% 74.1% 74.1%

Precision 52.21% 52.21% 52.4% 61.66% 69.84% 82.24% 91.28% 98.15% 98.08% 98% 99.6%

F1-score 66.73% 66.73% 66.81% 73.7% 78.9% 86.24% 89.7% 88.22% 86.15% 84.39% 84.98% Figure 4.20 Comparison of DroidIcon’s performance using different similarity thresholds

Figure 4.21 Influences of similarity threshold on effectiveness

72 DroidIcon uses a similarity threshold to control the FPs and FNs. Figure 4.20 shows the performances of DroidIcon under different similarity thresholds. The recall decreases and the precision increases as the threshold becomes larger. Since performance of precision and recall is a trade-off, we introduce the F1 score to identify a best trade off. When the threshold is 0.6, the F1 achieves the highest, which is 89.7%. Therefore, we use the similarity threshold of 0.6 in DroidIcon.

4.3 Case Study

To improve the awareness and understanding of sensitive information expressed in text icon, we conduct case studies on two UIs selected from two different apps. These case studies present how text icons express intentions of using users’ sensitive information, and also demonstrate the usefulness of DroidIcon.

73

Figure 4.22 Case study: Text Icons for Location, Message, Email, Contracts and Call

Figure 4.22 is an UI from an app named MyCityWay. This app aims to help people find information about local places. This UI shows the information of a house and provides some options for user to choose. We manually find seven sensitive icons in the UI, marked using red boxes. We apply DroidIcon on these icons and the result shows all of them are successfully detected. “Call”, “Direction”, “Map” icons are correctly classified after image scaling. “MY PLACES”, “CONTACTS”, “Email”, “SMS” icons are correctly classified after color inversion. We can further apply behavioral analysis on the behaviors triggered by clicking these sensitive icons to further detect the potential security risks.

74

Figure 4.23 Case study: Text Icons for Messaging and Email

Figure 4.23 shows an UI from a game app. This UI provides users two approaches to contact with the developers. The user could contact them through SMS or email. We apply DroidIcon to these two sensitive text icons, but DroidIcon fails to classify both of them because of their decorative fonts. This case shows DroidIcon still have some limitations in handling decorative fonts.

75

Chapter 5

Related Work

There exit some works that leverage semantic information for detecting the uses of user’s sensitive information. SUPOR [2], UIPicker [1], and UiRef [32] are among the first works to semantics of the descriptive texts in app’s UI such as text labels to determine whether the corresponding user inputs contain sensitive data. AsDroid [33] also checks whether the descriptive texts are compatible with the intents represented by the sensitive

APIs. WHYPER [29] and AutoCog [34] utilize natural language processing techniques for analyzing descriptions of apps and infer the mapping between sentences in app descriptions and the permissions claimed by the apps. Our approach focuses on analyzing the semantics of descriptive texts in icon, complementing these techniques to better understanding app’s intents.

Many research efforts have been spent in improving the performance of OCR for low quality images. Li et al. [35] observe that the OCR software typically can handle only images with black color text and white color background. Thus, they propose to invert the text in white color and enhance the resolution based on Shannon interpolation. The experiments on digital video images demonstrate that their approach improves OCR accuracy considerably. Jacobs, Charles, et al. [36] propose a machine learning approach based on a convolutional neural network to improve the resolution of low-resolution document images. Their approach converts the image to grayscale in their algorithm and

76 achieve up to 95% accuracy in their experiments. Leung, Chung-Chu, et al. [37] propose to enhance the contrast of images in order to reduce background noise and increase the readability of texts, thus improving the effectiveness of OCR. Inspired by these works, our approach combines different image mutation techniques according to the characteristics of

Android icons, and iteratively apply the mutation techniques to produce OCR-friendly icons.

77

Chapter 6

Discussion and Conclusion

Although DroidIcon shows promising results in classifying sensitive text icons, it still has limitations in handling some text icons.

First, DroidIcon is designed based on the OCR techniques. Existing OCR techniques are mainly based on machine learning, and thus their performance largely depend the learning approach and training dataset they use. In Chapter 4, we show that the

OCR engine does not perform well on unusual or decorative fonts. Therefore, we plan to introduce font mutation techniques to convert the fonts in future work. Since Android icons have their own specific design styles, we could also train an Android icon specified OCR for text icon identification in future work.

Second, we focus on detecting sensitive texts from text icons. Icons with no text may also indicate the uses of users’ sensitive information, as pointed out in Chapter 4. For example, in Figure 3.15, except for the “Email” text, there also exit a mail sign. This kind of information could be leveraged using computer vision technology. We plan to address this issue in our future work.

Finally, after the sensitive text icons are identified, we can further apply behavioral analysis to detect the potential security risks in using users’ sensitive information. For example, when a non-sensitive text icon is clicked, the app may still use users’ sensitive

78 information, which raise a red flag to some extent. We plan to develop program analysis techniques on analyzing the behaviors triggered by click these sensitive icons in our future work.

To conclude, we leverage a largely neglected resource, text icon, for detecting mobile apps’ intentions in using users’ sensitive information. In particular, we proposed

DroidIcon, a text icon classification approach based on OCR (Optical Character

Recognition) to determine whether the texts contained in text icons indicate the uses of sensitive information. To demonstrate the effectiveness of DroidIcon, we conduct experiments on 707 icons collected from 2000 most popular Android apps. The results show that DroidIcon achieves an accuracy of 90.52%, a precision of 91.28%, and a recall of 88.25%. We also conduct two case studies that demonstrate the effectiveness of

DroidIcon in detecting UIs that indicate the uses of users’ sensitive information.

79

Bibliography

[1] Nan, Y., Yang, M., Yang, Z., Zhou, S., Gu, G., & Wang, X. (2015, August).

UIPicker: User-Input Privacy Identification in Mobile Applications. In USENIX

Security Symposium (pp. 993-1008).

[2] Huang, J., Li, Z., Xiao, X., Wu, Z., Lu, K., Zhang, X., & Jiang, G. (2015, August).

SUPOR: Precise and Scalable Sensitive User Input Detection for Android Apps.

In USENIX Security Symposium (pp. 977-992).

[3] Arzt, S., Rasthofer, S., Fritz, C., Bodden, E., Bartel, A., Klein, J., ... & McDaniel,

P. (2014). Flowdroid: Precise context, flow, field, object-sensitive and lifecycle-

aware taint analysis for android apps. Acm Sigplan Notices, 49(6), 259-269.

[4] https://en.wikipedia.org/wiki/Pixel

[5] https://en.wikipedia.org/wiki/RGB_color_model

[6] https://en.wikipedia.org/wiki/Optical_character_recognition

[7] Patel, C., Patel, A., & Patel, D. (2012). Optical character recognition by open

source OCR tool tesseract: A case study. International Journal of Computer

Applications, 55(10).

[8] Youn, S., & McLeod, D. (2009, March). Improved spam filtering by extraction of

information from text embedded image e-mail. In Proceedings of the 2009 ACM

symposium on Applied Computing (pp. 1754-1755). ACM.

[9] https://asprise.com/royalty-free-library/java-ocr-api-overview.html

80 [10] https://en.wikipedia.org/wiki/Image_scaling

[11] https://en.wikipedia.org/wiki/Lanczos_resampling

[12] Turkowski, K. (1990). Filters for common resampling-tasks. Graphics gems, 147-

165.

[13] Blinn, J. (1998). Jim Blinn's corner: dirty pixels. Morgan Kaufmann.

[14] https://www.pinterest.co.uk/pin/357051076681322226/

[15] https://www.techinasia.com/talk/ghost-buttons-ux-design

[16] https://www.dynamsoft.com/blog/document-imaging/scan-settings-for-best-ocr-

accuracy/

[17] Kanan, C., & Cottrell, G. W. (2012). Color-to-grayscale: does the method matter

in image recognition?. PloS one, 7(1), e29740.

[18] https://en.wikipedia.org/wiki/Grayscale

[19] Ĉadík, M. (2008, October). Perceptual Evaluation of Color-to-Grayscale Image

Conversions. In Computer Graphics Forum (Vol. 27, No. 7, pp. 1745-1754).

Blackwell Publishing Ltd.

[20] Pratt, W. K. (1978). Digital image processing: Wiley Interscience. New York.

[21] Jack, K. (2011). Video demystified: a handbook for the digital engineer. Elsevier.

[22] https://en.wikipedia.org/wiki/Contrast_(vision)

[23] Leung, C. C., Chan, K. S., Chan, H. M., & Tsui, W. K. (2005). A new approach

for image enhancement applied to low-contrast–low-illumination IC and

document images. Pattern recognition letters, 26(6), 769-778.

[24] https://en.wikipedia.org/wiki/Edit_distance

81 [25] Kondrak, G. (2005). N-gram similarity and distance. In String processing and

information retrieval (pp. 115-126). Springer Berlin/Heidelberg.

[26] http://babich.biz/icons-as-part-of-an-awesome-user-experience/

[27] https://www.bullguard.com/bullguard-security-center/mobile-security/mobile-

threats/android-malicious-apps.aspx

[28] David Barrera, H. Güneş Kayacik, Paul C. van Oorschot, and Anil Somayaji. (2010).

A Methodology for Empirical Analysis of Permission-based Security Models and

Its Application to Android. In Proceedings of the 17th ACM Conference on

Computer and Communications Security (CCS). 73–84.

[29] Rahul Pandita, Xusheng Xiao, Wei Yang, William Enck, and Tao Xie. (2013).

WHYPER: Towards Automating Risk Assessment of Mobile Applications. In

Proceedings of the 22nd USENIX Security Symposium (USENIX Security 2013),

pages 527-542,Washington DC, August 2013.

[30] Alessandra Gorla, Ilaria Tavecchia, Florian Gross, Andreas Zeller. (2014)

Checking app behavior against app descriptions. In Proceedings of the

International Conference on Software Engineering (ICSE), pages 1025-1035.

[31] Felt, A. P., Chin, E., Hanna, S., Song, D., & Wagner, D. (2011, October). Android

permissions demystified. In Proceedings of the 18th ACM conference on

Computer and communications security (pp. 627-638). ACM.

[32] Andow, B., Acharya, A., Li, D., Enck, W., Singh, K., & Xie, T. (2017, July). UiRef:

analysis of sensitive user inputs in Android applications. In Proceedings of the

10th ACM Conference on Security and Privacy in Wireless and Mobile

Networks (pp. 23-34). ACM.

82 [33] Huang, J., Zhang, X., Tan, L., Wang, P., & Liang, B. (2014, May). Asdroid:

Detecting stealthy behaviors in android applications by user interface and program

behavior contradiction. In Proceedings of the 36th International Conference on

Software Engineering (pp. 1036-1046). ACM.

[34] Qu, Z., Rastogi, V., Zhang, X., Chen, Y., Zhu, T., & Chen, Z. (2014, November).

Autocog: Measuring the description-to-permission fidelity in android applications.

In Proceedings of the 2014 ACM SIGSAC Conference on Computer and

Communications Security (pp. 1354-1365). ACM.

[35] Li, H., Kia, O. E., & Doermann, D. S. (1999, January). Text enhancement in digital

video. In PROCEEDINGS-SPIE THE INTERNATIONAL SOCIETY FOR

OPTICAL ENGINEERING(pp. 2-9). SPIE INTERNATIONAL SOCIETY FOR

OPTICAL.

[36] Jacobs, C., Simard, P. Y., Viola, P., & Rinker, J. (2005, August). Text recognition

of low-resolution document images. In Document Analysis and Recognition, 2005.

Proceedings. Eighth International Conference on (pp. 695-699). IEEE.

[37] Leung, C. C., Chan, K. S., Chan, H. M., & Tsui, W. K. (2005). A new approach

for image enhancement applied to low-contrast–low-illumination IC and

document images. Pattern recognition letters, 26(6), 769-778.

[38] Aspire Software. Aspire: Java OCR SDK Library. (2017).

https://asprise.com/royalty-free-library/java-ocr-api-overview.html

83