UNIVERSITY OF CALGARY

Towards cloud-based anti-malware protection for desktop and mobile platforms

by

Christopher Jarabek

A THESIS

SUBMITTED TO THE FACULTY OF GRADUATE STUDIES

IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE

DEGREE OF MASTER OF SCIENCE

DEPARTMENT OF COMPUTER SCIENCE

CALGARY, ALBERTA

April, 2012

c Christopher Jarabek 2012 Library and Archives Bibliothèque et Canada Archives Canada Published Heritage Direction du Branch Patrimoine de l'édition

395 Wellington Street 395, rue Wellington Ottawa ON K1A 0N4 Ottawa ON K1A 0N4 Canada Canada Your file Votre référence ISBN: 978-0-494-88238-2

Our file Notre référence ISBN: 978-0-494-88238-2

NOTICE: AVIS: The author has granted a non- L'auteur a accordé une licence non exclusive exclusive license allowing Library and permettant à la Bibliothèque et Archives Archives Canada to reproduce, Canada de reproduire, publier, archiver, publish, archive, preserve, conserve, sauvegarder, conserver, transmettre au public communicate to the public by par télécommunication ou par l'Internet, prêter, telecommunication or on the Internet, distribuer et vendre des thèses partout dans le loan, distrbute and sell theses monde, à des fins commerciales ou autres, sur worldwide, for commercial or non- support microforme, papier, électronique et/ou commercial purposes, in microform, autres formats. paper, electronic and/or any other formats.

The author retains copyright L'auteur conserve la propriété du droit d'auteur ownership and moral rights in this et des droits moraux qui protege cette thèse. Ni thesis. Neither the thesis nor la thèse ni des extraits substantiels de celle-ci substantial extracts from it may be ne doivent être imprimés ou autrement printed or otherwise reproduced reproduits sans son autorisation. without the author's permission.

In compliance with the Canadian Conformément à la loi canadienne sur la Privacy Act some supporting forms protection de la vie privée, quelques may have been removed from this formulaires secondaires ont été enlevés de thesis. cette thèse.

While these forms may be included Bien que ces formulaires aient inclus dans in the document page count, their la pagination, il n'y aura aucun contenu removal does not represent any loss manquant. of content from the thesis. UNIVERSITY OF CALGARY

FACULTY OF GRADUATE STUDIES

The undersigned certify that they have read, and recommend to the Faculty of Graduate

Studies for acceptance, a thesis entitled “Towards cloud-based anti-malware protection for desktop and mobile platforms” submitted by Christopher Jarabek in partial fulfill- ment of the requirements for the degree of MASTER OF SCIENCE.

Supervisor, Dr. John D. Aycock Department of Computer Science

Internal Examiner, Dr. Michael E. Locasto Department of Computer Science

External Examiner, Dr. Behrouz Far Department of Electrical and Computer Engineering

Date Abstract

Malware is a persistent and growing problem that threatens the privacy and property of computer users. In recent years, this threat has spread to mobile devices such as smartphones and tablet computers. At the same time, the main method for combating malware, anti-virus software, has grown in size and complexity to the point where the resource demands imposed by these security systems have become increasingly notice- able. In an effort to create a more transparent security system, it is possible to move the scanning of malware from the host computer to a scanning service in the cloud.

This relocation could offer the security of conventional host-based scanning, without the resource demands involved with running a fully host-based anti-virus system.

This thesis shows that under the right circumstances, malware scanning services pro- vided remotely are capable of replacing host-based anti-malware systems on desktop computers, although such a cloud-based security system is better suited to protecting smartphone users from malicious applications. To that end, a system was developed that provides anti-malware security for desktop computers by making use of pre-existing web-based file scanning services for malware detection. This system was evaluated and found to have variable performance ranging from acceptable to very poor. The desktop scanning system was then augmented and adapted to serve as a mechanism for identify- ing malicious applications on Android smartphones. The evaluation of this latter system showed favorable results, and is effective as a mechanism for combating the growing mobile malware threat.

ii Acknowledgements

“No man is an island”, as such, this body of research would not have been what it is without the help of several individuals. First and foremost, I would like to thank Dr.

John Aycock for his guidance and advice during my studies. His open and approachable nature made him a pleasure to work with, and this research would not have reached its full potential without his direction.

I would like to express my gratitude to Dr. Michael Locasto and Dr. Behrouz Far, for serving on my examination committee. I would also like to thank Dr. William Enck and

Dave Barrera for their advice regarding Android development, as well as Erika Chin and

Adrienne Porter Felt for their assistance with tools for data analysis.

Special thanks should also be given to my student colleagues, Daniel De Castro and

Jonathan Gallagher for offering up their company and enjoyable discussions.

Finally, I would like to thank my family: Chelsey Greene and Patricia and Jim

Jarabek. However, words alone are insufficient to show the scale of my gratitude for the love, encouragement, and support they have shown me.

iii iv

Table of Contents

Abstract ...... ii Acknowledgements ...... iii Table of Contents ...... iv List of Tables ...... vi List of Figures ...... vii List of Abbreviations ...... viii 1 Introduction ...... 1 1.1 Background ...... 2 1.1.1 The Malware Threat ...... 2 1.1.2 The Cloud ...... 3 1.1.3 Smartphones ...... 4 1.1.3.1 Mobile Malware ...... 5 1.1.3.2 Android ...... 6 1.2 Thesis Contributions ...... 9 1.3 Thesis Outline ...... 10 2 Related Work ...... 12 2.1 Mobile Security and Malware ...... 12 2.2 Cloud Based Anti-Malware ...... 15 2.3 Device Based Mobile Anti-Malware ...... 18 2.4 Non-Device Based Mobile Anti-Malware ...... 20 2.5 Other Lightweight Anti-Virus Techniques ...... 23 2.6 Summary ...... 25 3 System Architecture ...... 26 3.1 Scanning Services ...... 26 3.1.1 Kaspersky ...... 27 3.1.2 VirusChief ...... 27 3.1.3 VirusTotal ...... 27 3.1.4 Other Services ...... 28 3.1.5 Terms of Service ...... 28 3.2 Desktop Thin AV System ...... 29 3.2.1 DazukoFS ...... 31 3.2.2 File System Access Controller ...... 31 3.2.3 Standalone Runner ...... 33 3.2.4 Thin AV ...... 33 3.2.5 Scanning Modules ...... 34 3.2.6 System Circumvention ...... 35 3.3 Mobile Thin AV System ...... 38 3.3.1 Reuse of Existing Thin AV System ...... 39 3.3.2 Android Specific Scanner ...... 41 3.3.3 Safe Installer ...... 42 3.3.4 Killswitch ...... 44 3.3.5 System Circumvention ...... 48 4 System Evaluation - Desktop Thin AV ...... 49 4.1 Scanning Service Performance ...... 50 4.1.1 Testing Protocol ...... 50 4.1.2 Results ...... 53 4.1.3 Discussion ...... 56 4.2 Actual System Overhead ...... 59 4.2.1 Testing Protocol ...... 59 4.2.2 Results ...... 62 4.2.3 Discussion ...... 64 4.3 Predicted System Overhead ...... 65 4.3.1 Testing Protocol ...... 66 4.3.2 Results ...... 68 4.3.3 Discussion ...... 69 4.4 Large Scale System Simulations ...... 71 4.4.1 Testing Protocol ...... 72 4.4.2 Results ...... 73 4.4.3 Discussion ...... 78 5 System Evaluation - Mobile Thin AV ...... 82 5.1 Data Set ...... 83 5.2 Malware Detection ...... 85 5.3 Emulator Performance ...... 87 5.4 ComDroid Evaluation ...... 87 5.4.1 Testing Protocol ...... 89 5.4.2 Results ...... 89 5.4.3 Discussion ...... 91 5.5 Safe Installer Performance ...... 92 5.6 Killswitch Cost ...... 95 5.6.1 Testing Protocol ...... 96 5.6.2 Results ...... 97 5.6.3 Discussion ...... 102 6 Discussion ...... 104 6.1 Thin AV Performance and Feasibility ...... 104 6.2 Ideal Deployment ...... 106 6.2.1 Desktop Deployment ...... 107 6.2.2 Mobile Deployment ...... 109 6.3 Privacy ...... 111 6.3.1 Desktop Privacy ...... 111 6.3.2 Mobile Privacy ...... 112 7 Conclusion ...... 114 Bibliography ...... 120 A Appendix ...... 133

v List of Tables

3.1 Thin AV security policy matrix...... 32 3.2 Speed comparison of the hashing functions available in Python...... 34

4.1 Kaspersky file scanning performance statistics...... 54 4.2 VirusChief file scanning performance statistics...... 54 4.3 VirusTotal file scanning performance statistics...... 55 4.4 VirusTotal file upload performance statistics ...... 55 4.5 Linear equations for the three scanning services...... 57 4.6 Activities in the web and advanced workload scripts...... 60 4.7 Scenarios examined for assessing Thin AV overhead...... 61 4.8 General characteristics of testing workload scripts...... 63 4.9 Time to complete the three workload testing scripts while using Thin AV. 63 4.10 Refined linear equations for each of the three scanning services...... 68 4.11 Simulation results of the Kaspersky service for three different activity logs. 69 4.12 Simulation results of the VirusChief service for three different activity logs. 70 4.13 Comparison of running time and simulation results for Kaspersky service. 70 4.14 Comparison of running time and simulation results for VirusChief service. 70

5.1 General file size characteristics of the Android test data set...... 84 5.2 Summary of malware found in the Market data set...... 86 5.3 Android emulator versus hardware performance comparison...... 88 5.4 Linear equation for the ComDroid scanning service...... 89 5.5 Summary of exposed communication in Google Market data set...... 90 5.6 Network speeds used for evaluating the mobile implementation of Thin AV. 92 5.7 Thin AV safe installer cached performance summary...... 93 5.8 Thin AV safe installer uncached performance summary...... 94 5.9 Linear equations for generating a system fingerprint...... 97 5.10 Data consumption of Thin AV killswitch over different time periods. . . . 99 5.11 Fingerprint generation time for different conditions...... 100 5.12 Total upload sizes used for calculations of bulk scanning performance. . . 100 5.13 Thin AV killswitch app upload times...... 101 5.14 Scan times for different numbers of apps...... 101

A.1 Raw data from Figure 4.6...... 133 A.2 Raw data from Figure 4.7...... 133 A.3 Raw data from Figures 4.8 and 4.9...... 134 A.4 Raw data from Figure 4.10...... 135 A.5 Raw data from Figure 4.11...... 136 A.6 File size characteristics of Android testing data set...... 137

vi List of Figures

3.1 System architecture for Thin AV ...... 30 3.2 UML Class Diagram for Thin AV...... 36 3.3 System architecture diagram for the mobile implementation of Thin AV. 40 3.4 User interfaces for the Android killswitch...... 47

4.1 Scan response time for the Kaspersky scanning service...... 56 4.2 Scan response time for the VirusChief scanning service...... 57 4.3 Scan response time for the VirusTotal scanning service...... 57 4.4 Upload response time for the VirusTotal scanning service...... 58 4.5 Example CDF of simulated files by size...... 74 4.6 Accesses which involved an uncached file versus Thin AV induced overhead. 75 4.7 Number of file system accesses versus Thin AV induced overhead. . . . . 75 4.8 File size in bytes versus Thin AV induced overhead...... 76 4.9 File size versus the proportion of accesses scanned by each scanning service. 77 4.10 Proportion of file modifications versus Thin AV induced overhead. . . . . 77 4.11 Average time between file accesses versus Thin AV overhead...... 78

5.1 Median file size of the Android test data set by category...... 84 5.2 Reponse time of the ComDroid service as a function of package size. . . . 90 5.3 Fingerprint generation time versus the number and size of packages. . . . 98

vii List of Abbreviations

AIDL ...... Android Interface Definition Language

AJAX ...... Asynchronous JavaScript and XML

API ...... Application Programming Interface

APK...... Android Application Package File

ARM ...... Advanced RISC Machine

ARP...... Address Resolution Protocol

AV ...... Anti-Virus

CPU...... Central Processing Unit

DLL ...... Dynamic-Link Library

DNS ...... Domain Name System

FFBF ...... Feed-Forward Bloom Filter

FSAC ...... File System Access Controller

HTML...... HyperText Markup Language

HTTP(S) ...... Hypertext Transfer Protocol (Secure)

IP ...... Internet Protocol

IPC...... Inter-Process Communication

LOC...... Lines of Code

OS ...... Operating System

RAM ...... Random Access Memory

RISC ...... Reduced Instruction Set Computer

VM ...... Virtual Machine

WEP ...... Wired Equivalent Privacy

XML ...... Extensible Markup Language

viii 1

Chapter 1

Introduction

Computer malware (malicious software) is a persistent and evolving threat to the privacy and property of individuals and organizations. With software systems growing in com- plexity every year the potential exploits of these systems are growing in kind. The most common technique for identifying and removing malware from computers is anti-virus software. However, anti-virus products that run on end-user computers have become increasingly bloated in recent years, as developers push to include features that will serve to differentiate their product in a competitive marketplace. This software bloat has a negative impact on the performance of computer systems and on users’ willingness to use anti-virus products to protect their computer systems. Recently, an idea has started de- veloping which would see security offered as a cloud-based service. Although there are a variety of factors motivating the development of cloud-based security, from a customer’s perspective this shift towards cloud-based security ultimately means that the products that are currently used to ensure access, confidentiality, and integrity of both data and computer systems can be replaced with a cloud-based service. Such services are already being employed by security companies seeking to enhance their existing host-based anti- virus software with cloud-based features [46].

This thesis aims to show that under the right circumstances, malware scanning ser- vices provided remotely are capable of replacing host-based anti-malware systems on desktop computers, although such a cloud-based security system is better suited to pro- tecting smartphone users from malicious applications. The evidence to support this thesis comes from the development and evaluation of Thin AV: a light-weight, cloud- based anti-malware system that was implemented for both Linux desktops and Android 2 smartphones.

The remainder of this chapter is laid out as follows: Section 1.1 will broadly cover the background for the main concepts relevant to this thesis. Next, Section 1.2 will detail the exact contributions made by this thesis. Finally, Section 1.3 will describe the contents of the remainder of this document.

1.1 Background

This thesis ties together a variety of different topics, including malware (of both the mobile and non-mobile varieties), cloud computing, and smartphone security. Whereas

Chapter 2 will discuss a wide range of academic research relating these issues, this section is intended to serve as a general introduction to the relevant topics, and provide the context for the rest of the work contained within this thesis. The remainder of this section is outlined as follows: Section 1.1.1 will discuss malware and the threat it poses;

Section 1.1.2 will talk about the concept of cloud computing; and finally, Section 1.1.3 will discuss smartphones, with a special focus on mobile and smartphone malware as well as a discussion on the Android smartphone operating system.

1.1.1 The Malware Threat

Malware is, in the broadest sense, a computer program that is designed to compromise, damage, exploit, or harm a computer system or the data residing on it [31]. While the term “virus” has become somewhat synonymous with malware, this is incorrect, as computer viruses constitute only a single type of malware. Malware refers to all varieties of malicious computer programs, which are typically categorized based on the specific malicious properties the program exhibits. In recent years the creation of new malware has seen tremendous growth [50], and while malware is created for a variety of reasons, the most prevalent incentive is financial gain [53]. 3

The most common approach to combating malware is through anti-virus programs.

These are programs that examine the files on a computer and locate files that look or behave like known malware samples [31]. While there are numerous companies that sell anti-virus products, and even a few anti-virus products that are given away for free, most of these products are fairly comparable at their ability to detect malware [82], at least when it comes to detecting malware that is currently circulating in the wild [106]. This has led to a scenario where companies have to continually add new features to their anti- virus products in order to stand out in a crowded market place. And while these features may have some security benefit, there is almost always an associated performance cost

[105].

1.1.2 The Cloud

New trends are emerging in computing that may offer a new direction in the fight against malware. Among these trends is the recent emergence of “cloud” computing. Cloud computing is not so much a new technology, as it is a new business model for computing.

Cloud computing is the delivery of computation services as opposed to computation products. These services are typically delivered over a network such as the internet [78]1.

A major motivating factor behind the adoption of cloud computing is the potential for cost savings [63]. For example, rather than a company providing e-mail to their employees through their own local e-mail server, they could pay a subscription fee to a company that provides e-mail services over the internet. This service arrangement saves the company the cost of buying, maintaining, and administering their own e-mail server. The procurement, maintenance and potentially the administration of these cloud e-mail servers is not the responsibility of the company, but rather the responsibility of the service provider. While the geographic location of the cloud servers is controlled by

1The term “cloud” came about because historically a cloud shape is used to represent the internet in network topology diagrams [101]. 4 the service provider, the location is very much a point of interest to customers, as the location of a service provider’s cloud servers can significantly impact the performance of the service, as well as pose significant legal concerns for cloud customers [30].

The concept of cloud computing has its roots in the mainframe computers of a previ- ous generation, but the technology to actually implement cloud computing really began to take shape when grid computing and operating system virtualization started seeing widespread successful applications. The success of these underlying technologies, coupled with a steady increase in the speed of internet connectivity around the globe, eventually allowed for computation services to be delivered over the internet [65, 52].

The notion of providing computation as a service can be broken down into a number of different service categories. Among the most common service offerings are Infrastructure- as-a-Service, where a company will offer shared hardware resources, Platform-as-a-Service, for developing and deploying applications, and Application-as-a-Service, which is simi- lar to the e-mail example above [78]. The notion of offering Security-as-a-Service is a relatively new concept [29], yet the security company McAfee is already offering a cloud- based enterprise security service that includes malware protection [26], though the details pertaining to the architecture of this proprietary system are not publicly available.

1.1.3 Smartphones

Smartphones are fundamentally just mobile phones with some sort of personal computing functionality. This functionality typically includes the ability to run custom software or applications, on top of purpose-built operating systems. It is somewhat difficult to specify the point at which mobile phones started widely being referred to as smartphones, as their development was simply the result of continual product evolution. However, it is safe to say that the variety of touch screen devices ushered in by Apple’s iPhone, and later,

Google’s Android devices, can be classified as smartphones. The growth of smartphones 5 sales has been extremely high, with smartphone sales reaching more than 115 million devices in the third-quarter of 2011 [64].

1.1.3.1 Mobile Malware

Mobile malware is malware that has been written for a mobile device such as a tablet computer or a mobile phone. The problem of mobile malware has been around for more than a decade. Even in the pre-smartphone era there was considerable speculation as to when malware on mobile phones would become commonplace, and what the capabilities of said malware would be when it arrived [49]. As an emerging platform for malware, there were many factors that dictated when malware authors would be sufficiently motivated to begin writing mobile malware in earnest [85]. However, the tremendous increase in smartphone use [64], coupled with the fact that smartphones increasingly store large amounts of personal or private information, has been enough to push mobile malware from a curiosity to a full fledged industry. In recent years the growth of mobile malware has been dramatic, with F-Secure reporting a nearly 400% increase in mobile malware between 2005 and 2007 [66], and McAfee Labs recording a doubling of mobile malware samples between the beginning of 2009 and the middle of 2011. Much like desktop malware, mobile malware ranges from mildly annoying to extremely insidious, and all major platforms have been affected [77, 71, 33, 32, 57, 100].

Combating malware is not trivial on high-resource desktop computers, and the re- source constraints present on mobile devices only increases the challenge of this task. It is not simply that the processing and storage capacity of a mobile device is less than a contemporary desktop computer, but it is the fact that the uptime of the device is limited by the available battery power. Thus, excessive computation caused either by malware or anti-malware code running on the device will shorten the battery life, and decrease the usefulness of the device [37]. 6

1.1.3.2 Android

Given that a large portion of this research uses the Android operating system, it is worth discussing Android, as well as the Android security model and some of the issues around

Android security. (For the remainder of this section, unless otherwise stated, please refer to [2] for details pertaining to the Android operating system.) The selection of Android as the platform for this study was based on a variety of factors. When comparing the top smartphone operating systems (Android, iOS, Windows Phone, Symbian and Blackberry

OS), Android is the only mainstream operating system which is open-source, allowing for modification of the operating system. This, coupled with the rise of Android as a smartphone platform made it the obvious choice [64].

Android is middle-ware developed by Google and built on top of Linux. It is targeted at mobile devices such as smartphones, tablets and e-readers. Like many mobile operating systems, Android has been designed to provide developers with a rich environment in which to develop applications (or “apps”) that leverage the available physical hardware.

Android apps are written in Java, but are not executed on a traditional Java Virtual

Machine. Rather, Android includes a high performance, mobile-specific VM called the

Dalvik Virtual Machine that executes the compiled Android bytecode.

In order to create a secure operating environment, Android implements a high degree of process isolation between apps. When an app is launched, a new process is created for that app, owned by a user ID unique to that app. Within this process, a new Dalvik VM is launched, within which the desired app is run. This process isolation, in conjunction with Google’s design philosophy of “all apps are created equal”, is highly beneficial from a security perspective. It means that flaws or exploits in a given app cannot easily result in access to restricted data, processes or services. For example, a successful buffer overflow attack on a particular app would only provide access to the files and process owned by the compromised app [58], as well as any other public files present on the file system. 7

Another key component of the Android security system is the permissions model which, broadly speaking, defines what portion of the Android API a given app has access to, and what actions an app can perform when interacting with another app on the device

[54, 59]. For example, at install time, an application could declare that it requires access to the internet and the ability to receive SMS messages. Before proceeding with the installation, the user must approve these permission requests. However, an application could potentially request a set of permissions that would allow for malicious behavior, such as creating an application to monitor phone conversations, or track a user’s location without their knowledge [56].

This permissions model is further complicated by the addition of the inter-process communication model which provides a mechanism for passing messages and data be- tween applications, or from the operating system to an app on the device. These messages are referred to as intents, and these intents can be explicit (app A sends a message to app

B and only app B) or implicit (app A sends a message to any app which supports the desired operation). Unfortunately, both explicit and implicit intents allow for a scenario where an app can spoof an intent, in an attempt to gain information from the target app. Additionally, the latter case creates a scenario where an intent can be intercepted by a malicious app, bypassing its intended target [45].

In light of the process isolation enforced by Android, it is becoming increasingly likely that malware in the conventional sense is being eclipsed by the issue of malicious apps which are unwittingly installed by a user [55]. These can be applications that ask for a specific collection of permissions that could enable malicious behavior [56] or applications which abuse Android’s message passing system for malicious purposes [45].

Apps for Android can be distributed in a variety of ways. The most common way is via an application market. A market is simply an app that runs on a device and allows a user to find and install other apps. The feature that differentiates the Android 8

Market 2 model from other market models (most notably, the Apple App Store), is the fact that developer submissions to the Android marketplace are relatively unregulated.

Submissions do not go through any sort of rigorous quality control checks. Specifically, apps are not manually reviewed for quality and content prior to release, a hallmark of the Apple App Store [42, 97]. While on one hand, Google’s marketplace model provides developers with the ability to quickly take an app from development to deployment, it also means that developers of malicious apps have fewer obstacles to overcome when trying to quickly publish their apps to a wide audience. In order to combat this, both the Google and Apple markets contain a remote “killswitch” that allows not only for the removal of an app from their market, but also the remote removal of the app from a user’s device [76]. Additionally, Google has potentially staked the reputation of their brand on their Market, and so has a vested interest in preventing it from becoming filled with malicious apps. Therefore, it is not surprising, given their more permissive market model, that Google has had to actually use their killswitch to remove malicious apps

[40]. Furthermore, Google has very recently announced that due to the spate of malware on their market, they have developed their own internal anti-malware scanning system called Bouncer, which performs automated scanning of apps submitted to the market

[73].

Android’s market model is further complicated by the fact that a user does not need to use the Android Market to install apps. Android allows the installation of apps downloaded from the web, attached in an e-mail, transferred via USB from another computer, or downloaded from any number of the third-party app stores that are available for Android. McDaniel and Enck provide a brief discussion of some of the security challenges presented in such a multi-market environment [76], arguing that markets by

2As of March 6, 2012 Google has grouped the Android market together with a number of their other commercial services creating a new service called [91]. Any future references to the Google Market or the Android Market, refer specifically to the market for Android apps that is now part of Google Play. 9 themselves do not fail at security, because markets don’t claim to provide security. Rather the onus is on the users to make informed decisions about what apps they install. To that end, it is suggested that what Android needs is a level of automated application certification in Android’s multi-market ecosystem. Thin AV, the system described in this thesis, is intended to be a step towards this goal.

While the official Android Market comes with a built-in killswitch for the removal of malicious apps, the only other high-profile Android market, the App Store, does not [43]. Then there are the numerous other, less well known Android application markets, some of which are targeted at specific geographic regions [10, 15], others that are targeted at specific hardware platforms [4], while others still are targeted to individuals with more salacious tastes [14]. There is even a market under development that focuses specifically providing apps that have been banned by the official Google Market [87].

As the number of third-party app stores increases, it is likely that some of these markets will be more interested in the quantity of apps available for download, than the quality of those applications. It is possible that these unofficial application markets will become significant vectors for malicious applications in the years to come. The mobile anti-malware system, Thin AV, which is described in Section 3, is a step in combating this malware vector. By combining an install-time application check with a market- independent killswitch capable of notifying users of malicious apps regardless of their source, it is possible that these non-Google Market sources can be made safer for mobile users.

1.2 Thesis Contributions

The first main contribution of this research is the design and development of Thin AV, a system for providing anti-virus scanning for Linux-based desktop computers. Thin AV 10 combines a set of pre-existing, third-party scanning services and offloads the scanning of

files from the host computer to these services. The evaluation of Thin AV found that performance of the system was highly dependent on the file system activity while the system was active, but that there were specific instances where the system performed well.

The findings from this research can help to address the performance concerns involved in cloud-based malware scanning. This could result in a system that would be capable of performing nearly transparent anti-malware protection from the cloud.

The second contribution of this thesis was an extension of the desktop version of

Thin AV, specifically targeted at smartphones and other mobile devices. The system was designed and developed for the Android operating system, and the evaluation of the system showed favorable performance, suggesting that cloud-based anti-malware scanning may be a very good fit for providing a level of security to mobile devices.

Finally, this research includes a comprehensive examination and summary of the current body of academic research pertaining to cloud-based security for both desktop computers and mobile devices, as well as research regarding low-impact anti-malware techniques which might also be suitable for mobile devices.

1.3 Thesis Outline

This thesis is divided into chapters as follows: Chapter 2 will examine the existing research in the related fields of mobile malware, cloud based-anti-malware, as well as research into other lightweight anti-malware systems. Chapter 3 will introduce Thin

AV, the system at the centre of this thesis, with a significant focus on the design and implementation of both the desktop and mobile versions of Thin AV. Chapter 4 will focus on the evaluation of the desktop version of Thin AV, while Chapter 5 will deal with the evaluation of the mobile version. Chapter 6 will discuss the results of the evaluation as 11 well as the areas in which Thin AV could be improved, giving specific attention to the privacy implications of Thin AV. Chapter 7 will conclude this thesis. 12

Chapter 2

Related Work

Most of the research related to Thin AV can be grouped into one of three categories: security and malware in mobile environments, which will be discussed in Section 2.1, cloud based anti-virus systems, which will be discussed in Section 2.2, and mobile anti- virus systems, which are reviewed in Section 2.3. Section 2.4 contains a review of related research that can be found in the overlap between these research areas. Section 2.5 will discuss and critique the work that is relevant to Thin AV, but cannot be clearly classified into any of these previous areas. Finally, Section 2.6 will conclude this chapter.

2.1 Mobile Security and Malware

The problem of mobile malware has been around for more than a decade. In that time, the nature of the malware threat has shifted significantly. In the pre-smartphone era, most malware came in the form of viruses or Trojan horses [66], while in recent years most malware comes in the form of malicious applications [60]. However, Bickford et al. have shown the possibility of developing rootkits for a modern smartphone, though their work did not focus on a well known smartphone operating system [38]. Additionally, [95] showed that smartphones are susceptible to more traditional denial of service attacks due to their lack of firewalls. The same study also raised the possibility of using smartphones as offensive platforms, though this is less promising due to their limited power.

Porter Felt et al. conducted a survey of malware found in the wild on Android, iOS and Symbian devices [60]. Their survey found that all instances of malware for Android devices used application packages as their vector, meaning that users were un-knowingly 13 installing the malware on their device. Interestingly, the only instances of malware on the iPhone occurred through an SSH exploit in rooted (or “jailbroken”) devices. The study went on to examine the incentives behind each piece of malware, most of which were financially based, and outlined a series of practical changes to each of the mobile platforms to help curb those incentives.

Given the current glut of mobile malware, and the rate at which smartphones are being adopted, it is clear that mobile security has become a pressing issue. Oberheide et al. provide an overview of security issues in mobile environments [83]. They point out that previous approaches to mobile security are either overly entrenched in desktop security practices, or argue for entirely new paradigms. Oberheide et al. suggest the truth lies somewhere in between. They discuss five issues that cause security on mobile platforms to be subtly different than in non-mobile environments: resource constraints, different attack strategies, different hardware architectures, platform/network obscurity, and usability.

Enck et al. performed a review of Android application security by developing a tool for reverse engineering Java code from the compiled Android byte code, then performing static analysis [55]. The top 1,100 apps from the Android market were downloaded and analyzed for a host of security flaws and poor programming practices. Enck et al. found a pervasive misuse of personally identifying information such as phone identifiers and location information, as well as evidence of poor programming practices such as the writing of sensitive data to Android’s public centralized log. Fortunately, no evidence was found of exploits in the Android framework, or the presence of malware in the collection of analyzed apps. However, given that the apps selected for study were the top apps in the Android market place, this likely resulted in a bias towards higher quality code than might be found in a broader cross-section of apps.

Chin et al. performed a study of Android inter-process communication (IPC) that is 14 complementary to the analysis in Enck et al. [45]. Using ComDroid, a custom static code analysis tool, one hundred of the top Android applications in the Android Marketplace were examined for vulnerabilities in how they sent and received IPC messages (intents).

Numerous vulnerabilities were identified, as well as several instances of misuse of the

Android framework. These findings motivated a collection of programming best-practice guidelines for Android programmers.

The same 1,100 apps from [55] were also studied by Barrera et al. with the goal of understanding how the Android permission model is used in practice [35]. The study found that the use of Android permissions showed a distinctly heavy tailed distribution, with some permissions being employed in most apps (e.g., access to the internet) while most other permissions were comparatively rare. Ultimately, it was concluded that the

Android permissions model could be improved by sub-dividing certain broad permissions

(e.g., internet access) to provide a more expressive model, while at the same time rarely used permissions with related functionality could be grouped together (e.g., install / uninstall applications). The findings of Barrera et al. are also in keeping with those of Ongtang et al. [86]. Here, various elements of the Android permissions model were enhanced and modified to accommodate a richer, more expressive set of permissions.

The Android permissions model was also examined by Felt et al. when they examined the issue of overprivilege in applications [59]. By mapping the Android API, it was possible to determine which API calls required which permissions. Using this permissions map they were able to build a tool, Stowaway, to examine several hundred Android applications, finding that almost a third of Android applications over-request permissions.

Additionally they found several that the Android permissions model is severely under- documented, and in some cases, incorrectly documented. 15

2.2 Cloud Based Anti-Malware

The notion of cloud-based malware scanning was first posited in [81], addressed at length in [82], and was a significant source of inspiration for the creation of Thin AV. The system described by Oberheide et al. is called CloudAV. It involves running a local cloud service consisting of twelve parallel VMs, ten of which run different anti-virus engines, and two running behavioral detection engines. End hosts run a lightweight client (300 LOC in

Linux, and 1200 LOC in Windows) which tracks and suspends file access requests, until the file has been scanned. The use of several heterogeneous scanning engines dramatically improved threat detection, with Oberheide et al. claiming a 98% detection rate when testing with the Ann Arbour Malware Library. Such a high detection rate does increase the risk of false positives. However, it was found that by requiring at least four of the scanning engines to flag a file as malware, false positives could be eliminated, while the overall detection rate only dropped by 4%.

Given that CloudAV was deployed with dedicated scanning servers in a LAN envi- ronment, the performance impediment from network latency and system load is minimal.

This results in an average file scan time of just over one second. This process is sped up through the use of caching, which was shown to be highly effective, producing a 99.8% hit rate with a primed cache. The performance of Thin AV could give an indication as to how such a remote scanning system like CloudAV would perform over a WAN, where network latency can be significant.

Following their success on the desktop, Oberheide et al. applied their strategy to a mobile environment [84]. Their results showed a marked reduction in power consumption and improved malware coverage. However, they failed to provide any information on how fast their solution operated in the lower-bandwidth / higher-latency mobile realm. Con- versely, in their examination of the trade-offs between energy consumption and security, 16

Bickford et al. showed that cloud-based anti-malware scanning is more energy intensive than host-based scanning when performed on a mobile device capable of running a VM hypervisor [37]. Although, it should be noted that the latter study was an examination of cloud-based rootkit detection, not virus detection, and both implementations differed greatly. Therefore, the latter is not necessarily a refutation of the results of Oberheide et al.

A novel extension to cloud-based malware scanning was provided by Martignoni et al. [75]. They implemented a system wherein suspect executables are uploaded to a cloud-based analysis engine. The system executes the malware, intercepting the system calls generated by the execution, and where necessary, passing those system calls back to the original host. The rationale behind the approach is that most malware behavior- based detection engines are based on running malware samples in a highly synthetic environment. Yet often, the malicious characteristics of a piece of code are only triggered by a very specific processing environment on the target machine (e.g., visiting a specific banking web site). Like Thin AV, this approach reduces the user’s risk of infection, but this approach provides the scanning service with a much more diverse set of computing environments in which to test potentially malicious code, thus improving coverage when seeking malicious behavior in a piece of code. Such a system, implemented in a VM, would make a compelling addition to other cloud-based anti-malware systems such as

Thin AV or CloudAV.

Jakobsson and Juels described a strategy for malware scanning that also relies on external computing resources [70]. Their technique allows trusted servers to audit the activity logs of remote clients in an effort to establish the security posture of the clients.

The trusted servers, in most cases, would be owned and operated by institutions suscep- tible to malware-based fraud, such as banks. A client-based agent would be responsible for logging activity on the client such as file downloads and installations. This log file 17 could then be sent to the trusted server which would then allow the server to decide whether or not to proceed with the transaction with that particular client.

Jakobsson and Jules claim their technique is secure against log tampering because any events that could result in a malware infection occur only after the event in question has been logged, and the log has been locked. However, they do not address the case where their agent software would be installed on an already compromised machine. Because only logs are being processed, this technique is well suited to low powered mobile environments, where bandwidth is limited. Additionally, because logs and not entire files are being transmitted, the privacy concerns are somewhat less than those presented by Thin AV, where whole files are transmitted.

Clone Cloud [47] and MAUI [48] are both systems designed to enhance the processing capabilities of smartphones by offloading intensive processing to highly resourced cloud- based servers. The designers of Clone Cloud were the first to envision a system capable of offloading smartphone malware scanning on to more powerful cloud-based hardware.

However, the ability to perform intensive malware scans was posited as only one of many possible applications of their approach. Although, it should be noted that the notion of moving intensive processing from mobile devices on to more powerful servers predates Clone Cloud by many years [94, 61], yet Clone Cloud is the first system to apply this practice to modern smartphones, and the first to consider the potential security applications of such an approach.

Paranoid Android is an implementation of a cloud-based anti-malware system which follows very closely on the heels of Clone Cloud [89]. The technique involves replicating an entire mobile device in a virtualized server-based environment. System calls on the physical device are recorded, and transmitted to the server where the user’s behavior is replicated. This allows the server to maintain a faithful copy of the user’s device most of the time (barring network disruptions). This server-based replica can be scanned using 18 traditional CPU intensive techniques that would not be feasible on a mobile device. A major upside of this approach is that once a replica has been established on a server, the amount of traffic necessary to maintain a consistent state is quite small. The obvious downside, like CloudAV and others, is the privacy concerns involved in replicating a device which very likely contains personal information. However, such a solution would be ideal in a highly-managed corporate environment where worker privacy on company provided devices is not a given.

Finally, private security company BitDefender also developed a cloud-based anti- malware product [46]. In their solution they suggest that only the signature based scan- ning portion of the malware-scan should be offloaded to the cloud. Their reasoning behind this is that more than 90% of the size of BitDefender is composed of the static signature based scanning engine. Therefore, if the less intensive operations such as heuristic scan- ning remain on the client, and signature-based scanning is done remotely, then network traffic can be kept to a minimum. For privacy reasons, they also opt to have users only upload cryptographic hashes of their files for analysis, only uploading the whole file in the event that a hash cannot be matched. This is very similar to the approach used by

Thin AV which will be discussed in Chapter 3.

2.3 Device Based Mobile Anti-Malware

There are a host of anti-malware systems which are designed to run on resource con- strained mobile devices. VirusMeter is a proposed approach for general malware detection in a mobile environment [72]. The approach involves detecting malware by monitoring battery consumption. The assumption is that if the battery consumption of benign be- havior can be adequately modeled, then deviations from that model will suggest the presence of unauthorized code. The key issue with their approach is that even the best 19 case scenario has more than a 4% false-positive rate. This is high for a malware scanner.

More importantly, their system was prototyped on a comparatively old mobile device, and it is unclear if their approach would work effectively on a modern smartphone which typically runs a diverse collection of rich media applications capable of quickly draining a device’s battery.

Heuristic based anti-malware scanning is conducive to mobile platforms simply due to its reduced overhead. The approach in [102] identifies malware based on the pattern of

DLL usage in a program. Venugopal et al. observed that many malware programs share similar behaviors, and these behaviors are accessed through DLLs. Furthermore, the spreading mechanisms and targeted exploits of viruses in the mobile domain are different than those in the desktop domain, so the heuristic methods from the latter domain cannot be applied to the former. By developing a heuristic system and training it on a collection of Symbian viruses, they were able to successfully identify 95% of other (non- training set) Symbian malware, with no false positives. Much like VirusMeter, the most obvious problem with this solution is that it was developed in a pre-smartphone world.

Smartphones typically now run a diverse, customized collection of mobile applications. In a software environment where new applications with novel functionality are being released on a daily basis, it raises questions about the efficacy of such a heuristic technique, or at the very least, about the rate of false positives in such an environment.

A similar strategy for malware identification on Android-based mobile devices can be found in [98]. The strategy involves using Linux-based tools to analyze the low-level function calls of ELF files. They then use various heuristic techniques to classify a file as malicious or clean depending on the functions being called. They also suggested a technique for combating infection by having co-located mobile devices collaborate to identify malware. Prior to their work on Android, Schmidt et al. developed a technique for instrumenting Symbian and Windows Mobile devices with the intention of recording 20 user behavior for the purposes of remote anomaly detection [99].

Jakobsson et al. provide a novel technique for mobile malware detection called memory printing [68, 69]. This is done using a cryptographic function which fills the free RAM on a phone. The key property of this cryptographic function is that it takes dramatically longer to compute if the function is configured to use more space than actually exists.

When scanning for malware, legitimate applications in RAM are swapped on the flash disk. Therefore, when the function is executed, if there is less RAM available than should exist (due to a piece of malware), the memory-printing function will take much longer than it would if no malware were present. Jakobsson et al. assert that because malware can only exist in secondary storage or in RAM, any malware that is not detected in the RAM scan will be found when secondary storage is scanned. Unfortunately, in this approach, secondary storage is still scanned via white/black lists, signatures or heuristics, which, as was pointed out in [83], are not efficient strategies in a mobile environment.

Finally, a more preventative approach to mobile malware can be seen in Kirin [56].

Kirin is a system developed from a security requirements analysis of the Android per- missions system. It checks a collection of rules which, if violated, may indicate that an application being installed is capable of malicious activity. The study examined 311

Android applications and found ten positives, five of which were false. The key draw- back of their system is that user intervention is still required to validate positive results.

Unfortunately, Kirin only prevents the installation of malicious applications which are installed from third-party sources and not the official Google Market.

2.4 Non-Device Based Mobile Anti-Malware

Due to the processing and battery limitations of smartphones, there is a trend in research towards anti-malware solutions for mobile devices that rely on a remote server, be it a 21 cloud service, a more conventional centralized server, or even a desktop / laptop.

SmartSiren is a centralized malware mitigation solution targeted at smartphones [44].

The system is the first example of decoupling security from smartphones. It is targeted at the scenario where some smartphones do not have any AV software installed. Smart-

Siren consists of an agent that monitors phone behavior, and a proxy with whom the agent communicates. The agent monitors general behaviors such as SMS and Bluetooth traffic, as well as information about the phone, such as the cell towers to which it most frequently connects. The proxy receives reports from participating agents and aggregates these reports in an effort to find evidence of misbehaving smartphones. Their two key techniques for detecting malware are statistical and anomaly monitoring. The former looks to see if the phone’s capabilities are being used significantly more than would be expected based on historical data. The latter attempts to identify phone numbers (which would charge a fee to the phone’s owner) that are being contacted by the malware. As malware using these phone numbers spreads, calls and messages to such numbers would gradually rise in frequency. Alternatively, if the malware attempts to spread through the smartphone’s contact list, fake contact list entries are used to determine if a device is infected. Upon detection of an infection, the device owner is notified by SMS about the infection, and how to deal with it. Additionally, individuals on the user’s contact list are notified that they might be at risk of infection. Finally, all individuals who are signed up with the SmartSiren service and frequent the same cell towers as the infected device will be monitored.

One of the strengths of the work is its focus on the privacy of the individual. Re- ports can be submitted by users, divulging as little personally identifiable information as possible. There are some key drawbacks. First it seems much more likely that most smartphones will not have any AV software, as opposed to the opposite. It is unclear how well the system would behave when it is operating on very incomplete information. Fur- 22 thermore, the concept of statistical monitoring might falter in a high-traffic, high-noise environment generated by today’s rich-media apps. Additionally, it seems like the system is reliant on having a historical record of clean data against which it can compare current data to ferret out abnormalities. There is no indication as to how this system would cope when introduced into a potentially tainted environment. Finally, it is questionable how well the third-party notification system would work in reality. Years of exposure to adware has left users wary of seemingly inexplicable automated messages that tell them their electronic devices are at risk of being infected with malware. It is quite possible that such messages will be assumed to be spam and ignored. The worst possible sce- nario would be when these third-party notifications are believed to be an indication of an infection on their own device, leading to unnecessary attempts at malware removal.

Dixon and Mishra describe a system which uses a desktop or laptop as the anti- malware analysis device [51]. Rather than relying on remote cloud-based services or other network-intensive techniques, their system validates the contents of a mobile device when it is connected to a user’s computer via USB. File hashes are used to identify files that have not been analyzed for malware (be they modified or new files), and only the

files corresponding to novel hashes are sent to the desktop for analysis with standard anti-malware software. In order to combat a sophisticated attacker which could send false hashes to the validating system, a keyed hashing mechanism would be used, with the key being provided by the external validating system.

A broad vision for hindering the spread of malicious apps can be found in Stratus, a theoretical system proposed by Barrera et al. [34]. Stratus is a system comprising a collection of information sources and services, such as developer registries, application databases and remote application killswitches. The purpose of these entities is to help provide some of the security guarantees that can be found in a single application market, but in a multi-market environment. The proposed system was discussed with Android 23 in mind, both due to the increasing number of application markets available for An- droid, and also due to the increased prevalence of malicious apps available for Android.

The backbone of the Stratus system is a universally unique application identifier that is maintained by Stratus and is unique across all application markets.

Stratus is a well thought-out, but as of yet, unimplemented idea. The Stratus system could potentially provide a framework in which Thin AV could operate. Specifically,

Thin AV could operate in conjunction with either the application databases, or the kill switches. As such, the Stratus system is highly complementary to the goals of Thin AV in the mobile setting.

2.5 Other Lightweight Anti-Virus Techniques

Miretskiy et al. [79] offer a highly integrated anti-virus solution in Avfs, which makes two key contributions. In addition to modifying the open source anti-virus system ClamAV to significantly increase the speed of scanning, they tied their modified ClamAV (called

Oyster) into a stacked file system such that AV scanning is part of the file system, a technique adopted by Thin AV. This has several advantages. It allows the scanning of

files at the earliest possible moment, as opposed to simply scanning when files are opened or closed, which introduces a window of vulnerability. Additionally, it speeds up scanning because scanning is taking place at the kernel level, as opposed to intercepting system calls or message passing. To this end, Miretskey et al. claim that their system demonstrates less than 15% overhead above a standard non-Avfs based system. Finally, by integrating anti-virus scanning into the file system, scanning is completely transparent to the user and any infections can be easily quarantined such that no system process can access the

files. Their file system also supports a mode for built-in post-infection forensics. The key drawback of the system is its reliance on ClamAV, which compares somewhat unfavorably 24 to commercial AV products [82]. While it is obvious that ClamAV was chosen because it is open-source, it would have been interesting to see how Avfs would operate when tied to a proprietary AV system.

A method for improving the speed of conventional malware scanning is provided by Cha et al. in the form of SplitScreen[41]. Both the number of signatures and the number of files that need to be scanned are dramatically reduced by using SplitScreen’s two-pass technique. This is done by performing a first-pass scan of all the files using a feed-forward Bloom filter (FFBF) with a pre-calculated bit-vector hash of all known malware samples. The defining feature of any Bloom filter is that it is very fast, it may produce false positives, but it never produces false negatives. In this way, a system can quickly be scanned for potential malware. If a possible candidate file is found (i.e., possibly infected with malware), the full anti-virus signature needed to positively identify the malware can be downloaded from an external repository, and these candidate files are rescanned using a conventional signature-based scanning approach. Because whole

files are not transferred, less user data is exposed and privacy concerns are somewhat mitigated. Furthermore, Cha et al. claim their technique produces a doubling of scanning speed and a halving of memory usage. They suggest that this inclines their technique to being adopted on mobile devices. However, they do not address the worst-case runtime of their solution, which appears to be worse than a standard AV scan. Furthermore, the speed of SplitScreen is heavily dependent on cache-based optimization. The study results show decreasing performance on CPUs with smaller L2 caches, yet the mobile devices their system is targeted at are not currently endowed with very large caches. 25

2.6 Summary

It is clear that malware is a problem for users of desktop computers. However, this problem has now spilled over into the potentially lucrative domain of smartphones. Given that conventional anti-malware strategies have resulted in unwieldy and low-performance software solutions, significant effort has been dedicated to non-conventional approaches to malware detection. The notion of using cloud resources to aid in malware detection is a relatively new concept. However, in reality many of these efforts make use of non-local resources which do not necessarily conform to the computing-as-a-service model which defines true cloud computing.

The resource limitations of mobile devices have necessitated new research efforts into mobile anti-malware strategies, some of which of are device based, and some of which rely on external computing resources. However, to date, no strategy for mobile malware detection has been shown to be clearly superior. This is why Thin AV is a compelling avenue of research. If it is possible to provide a reasonable level of protection for desktop computers and mobile devices using pre-existing shared resources for malware detection, it may be possible to significantly reduce the computing burden on these devices. 26

Chapter 3

System Architecture

This chapter presents an overview of the Thin AV anti-malware system, both on the desktop and mobile platforms. The system was designed to have a modular architec- ture, with separate modules for the individual scanning services. Section 3.1 will provide background on the different scanning services that Thin AV uses. Section 3.2 details the implementation of the desktop-based implementation of Thin AV. The mobile implemen- tation of Thin AV is discussed in Section 3.3.

3.1 Scanning Services

The goal of Thin AV was to develop an anti-malware solution that offloads the chore of scanning to third-party malware scanning services. At present, the system can scan files with one of three different scanning services that are freely available online: Kaspersky,

VirusChief, and VirusTotal. These services all behave similarly insofar as a user can upload any type of file (executable, data, etc.) through the of the service and receive a report as to any malware that might be contained in that file. Unfortunately, these scanning services are based on proprietary anti-malware engines, and as such, the exact details of the engines underpinning these services are very closely held trade secrets.

Therefore, the exact capabilities and limitations of these services with respect to threat detection are not publicly known. Consumer testing of these anti-virus products can provide some clue as to their capabilities [93], though in order to be in compliance with the end-user license agreements (in the United States), these consumer tests must be limited to black-box testing methodologies [24]. 27

3.1.1 Kaspersky

Kaspersky Lab [13] offers a free service for scanning individual files [12] that are 1 MB or smaller in size. The service scans files uploaded to the website using Kaspersky’s proprietary anti-malware engine and returns a diagnosis to the user’s browser.

3.1.2 VirusChief

VirusChief [21] is a multi-engine based malware scanning service with a 10 MB file size limit. Similar to Kaspersky, users upload files through their browser. Once received, the file is scanned using up to 13 different scanning engines. Results from each scanning engine are returned to the user’s browser via AJAX.

3.1.3 VirusTotal

VirusTotal [22] is a multi-engine based scanning service offered by Hispasec, which can also scan files up to 20 MB in size.1

VirusTotal scans files with 42 different scanning engines, including many of the same engines found in VirusChief as well as the Kaspersky engine. VirusTotal is also unique in that in addition to their website, they offer a semi-public API for accessing their service.

Individuals can apply for a key which can be used to call the API for the purpose of uploading files and retrieving reports. VirusTotal also attempts to increase performance by storing cryptographic hashes of all uploaded files. That way, if a file has been scanned previously, the report generated from the previous scan can be returned as opposed to completely re-scanning the file.

1The file size limit for VirusTotal was raised to 32 MB some time during early 2012. However, all Thin AV evaluation relevant to VirusTotal was completed when the 20 MB limit was still in place. 28

3.1.4 Other Services

Two additional scanning services were examined for inclusion in Thin AV, but were deemed to be inappropriate for Thin AV modules. VirScan [20] is a multi-engine web- based file analyzer. However, the scanning process used by VirScan checks the uploaded

file sequentially, with each of 37 different scanning engines. Although VirScan is designed in a way that could be used to power a Thin AV module, the sequential scanning results in extremely poor performance, and for that reason was not selected for inclusion in Thin

AV.

FileAdvisor [8] by Bit9 was another service considered as a candidate for Thin AV.

However, FileAdvisor is not actually a real-time scanning service. Rather, it is a large database of malware scan results. Users can upload a file, or simply a cryptographic hash of a target file, and FileAdvisor will return information regarding previous malware scans of that file. However, this is contingent upon a matching hash being found in the FileAdvisor database. This means that novel files will always fail to return results.

Furthermore, although FileAdvisor boasts a database of more than 7 billion files [8], the database is heavily biased towards Windows files. Finally, this service offering has a limit of 15 lookups per day, making FileAdvisor a very poor candidate for inclusion in Thin

AV.

3.1.5 Terms of Service

Of the three scanning services included in Thin AV, only VirusTotal provides a terms of service agreement for their offering. The pertinent language in the agreement as it applies to Thin AV is that users must “abstain from any activity that could damage, overload, harm or impede the normal functioning of Hispasec ” [23]. To a large extent, this requirement is automatically enforced given that VirusTotal’s API only allows a limited number of requests from each user over a given time period. 29

Given that Thin AV was developed as a experimental proof-of-concept, the workloads produced were comparatively small. Even so, attempts were made, where possible, to minimize traffic to these scanning services during testing and performance evaluation by making use of a simulator (discussed in Section 4.3.1) as opposed to actually uploading

files for scanning. Given the configuration of Thin AV on the desktop, any sort of large scale deployment would be in violation of VirusTotal’s terms of service, and would surely draw the ire of and VirusChief. However, the configuration of the mobile version of Thin AV is considerably less taxing on the third-party services, as such, it might be possible to run a production-scale deployment of the mobile version of Thin

AV without violating VirusTotal’s terms of service, or overly taxing Kaspersky Lab and

VirusChief.

3.2 Desktop Thin AV System

The desktop-based implementation of Thin AV was written in Python 2.7 and deployed on Ubuntu Linux 11.04 running a modified Linux kernel. The kernel which was originally packaged with this version of Ubuntu (2.6.39.0) was replaced with version 2.6.36.4 so as to be compatible with the stackable file system discussed in Section 3.2.1. The hardware platform for development and testing of Thin AV was a laptop with an Intel Core i5

M460 CPU (2.53GHz) and 4 GB of RAM. The native operating systems were Windows

7 Professional SP1 64-bit, and Ubuntu 11.10. The desktop version of Thin AV was developed and tested on Ubuntu which was deployed in a virtual machine running in

VMWare Player in Windows 7.

The anti-malware system has three major components (Figure 3.1): the DazukoFS

[7] stacked file system, the file system access controller, and the Thin AV anti-malware scanner. The former, DazukoFS, was originally developed by Operations GmbH 30

& Co. KG [5], before being made freely available under a BSD license. The latter two components are original developments. In addition to these components, a testing pro- gram (standalone runner) was developed which provided access to Thin AV independent of DazukoFS and the file system access controller. Each of these components will now be described in detail.

Third-party Scanning Services

VirusTotal VirusChief Kaspersky Service Service Service

VirusTotal VirusChief Kaspersky Module Module Module

Thin AV

Thin AV Cache

File System Access Controller DazukoFS Standalone API Runner

User Space

DazukoFS Stacked File System

Linux File System

Kernel

Figure 3.1: System architecture for Thin AV 31

3.2.1 DazukoFS

DazukoFS is a stackable file system that allows user space programs to perform file access control at the kernel level. For the purposes of this work, version 3.1.4 of DazukoFS was used. This version of DazukoFS installs as a kernel module for version 2.6.36 of the

Linux kernel. In addition to the kernel module, DazukoFS also provides an API library for interacting with the mounted file system.

DazukoFS can be mounted on top of any directory in the Linux file system with the exception of the root directory. Once mounted, all file access requests that take place in that directory (or any subdirectories) will be intercepted by DazukoFS. A user space program (using the DazukoFS API library) is then responsible for permitting or denying access to the requested file. In the case of Thin AV, if a file is deemed to contain malware, the file access operation is terminated and the user is informed that the file access was not permitted.

3.2.2 File System Access Controller

The file system access controller is a user space program that runs in the background and is responsible for allowing or denying file access requests. The controller for Thin AV was written in Python, with the CTypes module providing access to the DazukoFS API library. The controller creates an instance of the thinAv class, which in turn is responsible for telling the controller whether or not the file being accessed contains malware.

There are currently five different infection statuses. All three services can specify that a file is either clean or infected, meaning that the service returned a conclusive result.

However, the API powering the VirusTotal module demanded the inclusion of three additional statuses: waiting, postponed, and questionable. “Waiting” indicates that a

file has been uploaded, but that the service must be checked later for the completed report. “Postponed” means that an upload was denied because more than 20 files have 32

Scan Result Permissive Restrictive Passive √ √ √ Clean √ Infected √ Questionable √ √ Waiting √ √ Error / Postponed

Table 3.1: Thin AV security policy matrix. A check mark indicates whether file access is allowed under each security policy for a given scan result. been uploaded via the API in the last 5 minutes. Finally, because VirusTotal scans with more than 40 anti-malware engines, there is an increased risk of a file falsely being labeled as infected. Therefore, a threshold value is used to alleviate some of the false positives.

If the number of scanning engines indicating a file is infected is less than four, but more than zero, the file will be labeled as “questionable”. For the desktop implementation of

ThinAV the threshold value is hard-coded.

Once the file infection status has been returned to the controller, a determination must be made as to whether or not to allow access to the file. This determination is based on the security policy of the access controller, which is set at run-time via command line argument. There are three policies implemented in Thin AV: permissive, restrictive, and passive. The permissive policy will allow access to any file that is not explicitly infected, or questionable (possibly infected). The restrictive policy will only allow access to files that are explicitly labeled as clean. Finally, the passive policy never prevents access to a

file, rather it will simply alert the user to the presence of any malware via the terminal.

Table 3.1 outlines the various security policies. It should be noted that these policies apply only to the desktop version of Thin AV, and not to the mobile version. The reasons for this will be discussed in Section 3.3.3. 33

3.2.3 Standalone Runner

A small user space Python script was developed which is capable of instantiating a thinAv class and scanning a file, independent of the file system access controller, or DazukoFS.

This was developed for the purpose of debugging and performance evaluation (Section

4.1) of the Thin AV system, as well as on-demand scanning of individual files.

3.2.4 Thin AV

The Thin AV Python program is the core of the anti-malware scanning system (Figure

3.2). The thinAv class possesses a scan function, which takes a filename, and optionally, a file descriptor and the name of a specific scanning service, and returns a status code indicating whether or not the named file is free of malware. This status code is the result of either an online scan of the file, or a successful search through the local Thin AV cache.

The Thin AV local cache is simply a flat file of previous scan results, which is read after a scanning module is instantiated. Prior to uploading a file to one of the scanning services, the local cache is first checked. This way if a file has already been scanned, a quick local lookup is all that is necessary to allow access to the file. The cache contains an MD5 hash of the file being analyzed, the full path to the file, the number of times

Thin AV has been asked to analyze the file, the last time such an access has occurred, the infection status of the file, a note for additional scan details, and the module that was used when the file was analyzed. Because files are identified by their MD5 hash, it avoids having to track and compare file modification times to determine if the file (in its current form), has already been scanned. MD5 was chosen, in spite of its flaws [104], as the hash function for file identification in Thin AV. This is primarily due to the speed of MD5 when compared with other hashing functions available in the Python hashing module

(Table 3.2). However, given that VirusTotal supports a variety of hashing functions, changing Thin AV to use an alternative hashing function would be trivial. 34

MD5 SHA1 SHA256 SHA512 Time (seconds) 2.37 × 10−3 9.12 × 10−3 2.40 × 10−2 5.04 × 10−2 Overhead compared to MD5 N/A 384.66 % 1013.45 % 2125.71 %

Table 3.2: Speed comparison of the hashing functions available in the Python hashlib module. Speeds are based on the average time required to hash a 1 MB file of pseu- do-randomly generated data with each hashing function. The average time is the result of 10 trials on the hardware described in Section 3.2.

In order to determine whether or not a file contains malware, thinAv must instantiate a scanning module. At present, the desktop version of Thin AV has four scanning mod- ules, one for each of the three scanning services described in Section 3.1, and a simulator module which is used for performance evaluation purposes (Section 4.3.1), and does not actually scan files for malware. All of the scanning modules inherit core functionality from the thinAvParent class, which is responsible for providing functions for interacting with the local cache, and uploading multipart/form-data via HTTP POST requests.

At present, the choice as to which module will be used when scanning is based on the average amount of time each module takes to scan a file, with the fastest service

(Kaspersky) being selected first, followed by VirusChief and finally VirusTotal. If any scanning module returns an error from an attempted online scan, then the next module in the priority sequence is selected. If all three scanning modules fail, a general error code is returned to the calling program (be it the file system access controller or the standalone runner). The performance measurements which formed the basis of this decision are discussed in Section 4.1.

3.2.5 Scanning Modules

Thin AV accesses the Kaspersky scanning service by simply constructing a HTTP POST request with the appropriate fields, and searching the body of the HTTP response for text strings which indicate whether Kaspersky has deemed the file to be clean or infected. 35

Using VirusChief for scanning in Thin AV is slightly more complicated than the method used with Kaspersky. First, VirusChief checks for a cookie prior to scanning, this means that Thin AV must initiate a HTTP GET request to the service in order to procure a valid session ID. The file of interest is then uploaded via HTTP POST, along with the session ID, and a report ID is returned. Because results are returned asynchronously via AJAX, Thin AV polls the service once every second to check for scan results. Once at least four scan results have been returned, the results are parsed, and the scan result is returned.

Because VirusTotal provides an API for interacting with their scanning service, the corresponding Thin AV module is somewhat different from the other scanning modules.

VirusTotal caches scan results, so prior to uploading a file, the hash of that file can be checked against the VirusTotal database. If a match is found, the full report for that

file will be returned. If a match is not found, then the file can be uploaded, in full, via an HTTP POST request, returning a report ID which can be used to look up the scan report once it has been completed. Unfortunately, scan requests from the API are given the lowest priority by VirusTotal, making the response time for a file scan highly unpredictable. Although it would be possible to simply have the Thin AV module wait, and periodically poll VirusTotal for the report, this would result in impractically long wait times. Therefore, if a report is not immediately available for a file, the module will return a status code indicating it is waiting for a result. Finally, the VirusTotal API only allows 20 uploads every 5 minutes. If Thin AV exceeds this maximum, the module will return a status code indicating that VirusTotal is temporarily unavailable.

3.2.6 System Circumvention

In the current implementation of Thin AV, there are a number of security holes that would have to be addressed were a production scale system to be implemented. The two 36

thinSimulator

search_online(self,key,filename,data,alreadyUploaded): int checkSize(fileSize,model): int getWaitTime(fileSize,model): float

thinVirusTotal

search_online(self,key,filename,data,alreadyUploaded): int

thinKaspersky

search_online(self,key,filename,data,alreadyUploaded): int

thinVirusChief

search_online(self,key,filename,data,alreadyUploaded): int

thinAvParent

encode_multipart_formdata(self,fields,files): string, string post_multipart(self,url,fields,files): string change_cache(self,key,filename,execcount,reportsummary,urlOrId,scanType) search_local_cache(self,key,filename,scanType): int read_cache(self) write_cache(self)

thinAv

map_online_scan_results(self,scan_ret): int scan_file(self,filename,filedescriptor,av): int scan(self,filename,filedescriptor,forcescanner): int

Figure 3.2: UML Class Diagram for Thin AV. most obvious avenues for attacking Thin AV are via a man-in-the-middle attack, and by attacking the Thin AV local cache.

Of the three scanning services, only VirusTotal allows users to upload files via HTTPS.

This means that all traffic sent between Thin AV and both Kaspersky and VirusChief, is sent in the clear (unencrypted). It is possible that an attacker might be able to intercept this traffic by taking advantage of weakly secured (WEP) or public Wi-Fi, or by using an attack such as ARP cache poisoning. If successful, an attacker could modify the results 37 returned by the scanning services to indicate that the uploaded file is free of malware, when this might not be the case. Unfortunately, the only solution to this attack is to rely on communication via HTTPS, which is a decision that rests in the hands of the scanning service providers.

Additionally, because the Thin AV local cache is implemented as an unencrypted text

file, it is a ripe target for any potential malware. If a piece of malware could edit the local cache, then subverting Thin AV would be as trivial as flagging known malicious

files as clean. One possible solution to this problem would be to encrypt the Thin AV local cache file when it is not being written or read. Any changes to the encrypted file would then result in an inability to correctly read the cache file, which would indicate the presence of malware on the system. Unfortunately, key management under such a scenario would still be problematic.

One possible, though highly improbable attack on Thin AV, would be to take advan- tage of the lack of collision resistance in the MD5 hashing algorithm used by Thin AV to identify files. Given that it is possible to construct two distinct inputs which produce the same MD5 hash [104], it is possible, though, extremely unlikely, that an attacker could construct a piece of malware that, when hashed, produces the same output as a previously scanned benign file. Thin AV would then mistake this piece of malware for a previously scanned file, and allow the execution of the file. As mentioned earlier, changing the hash function used by Thin AV would be a trivial fix for this flaw.

A more problematic issue, is that of file size. Because the largest file that can be scanned by Thin AV is 20 MB (using the VirusTotal service), this means that any files in excess of 20 MB will be ignored by Thin AV. As such an attacker could simply write a very large piece of malware in order to infect a system.

Finally, given that Thin AV is intended to operate only on files to which a normal

(non-root) user can edit (e.g., files in their home directory), any malware which makes use 38 of a privilege escalation exploit, such as a rootkit, could potentially circumvent Thin AV in a variety of ways, from modifying the cache file (as mentioned above), to replacing the

Python interpreter with a corrupted executable, thereby circumventing the very behavior of Thin AV. However, given that this work is focused on the feasibility of a light weight cloud-based anti-malware system, defending against such attacks is beyond the scope of this work.

3.3 Mobile Thin AV System

The mobile version of Thin AV was implemented on the Android platform. As mentioned in Chapter 1, the decision to use Android was based both on the modifiability and the wide-spread adoption of Android.

Because of the application isolation created by the combination of unique user ID and processes for each application, the key threat to the Android user comes not from traditional vectors for malware such as drive-by-downloads and application exploits, but rather from malicious apps that are unwittingly installed by a user [60]. In order to combat this threat, Thin AV on Android is application-centric and not file-centric (as on the desktop). Specifically, the mobile version of Thin AV is focused on non-system applications, that is, applications which were installed on the device after it was released by the manufacturer or carrier. This was done for two reasons: first, unless a user has

“root” access on a device, the un-installation of system apps is not possible; second, it seems unlikely that a manufacturer would intentionally install a malicious application on their product.2

The implementation of Thin AV for Android is an extension of the desktop scanning system. The top-level portion of Thin AV as outlined in Figure 3.1 was, with minimal modification, re-tasked to act as a unified front-end and caching mechanism for a web-

2Recent events have shown this might not necessarily be the case [74]. 39 based scanning service used by the Android implementation. This web-based scanning service is then used in two different ways by the Android device: first, the “safe installer” provides a way of verifying Android applications (APKs) during installation, and second, the “killswitch” informs users when already installed applications have been found to be malicious. Figure 3.3 shows the overall system architecture of Thin AV for Android. Each of the key components of the mobile Thin AV system will be described in the following subsections.

All Android development for Thin AV was done on the same hardware platform described in Section 3.2 and deployed on a virtualized Android device running version

2.3.7 of Android, also referred to as Gingerbread. This version of Android was selected for development because 2.3.7 was the most up-to-date version of Gingerbread before the project was forked and Android 3.0 (Honeycomb) was developed specifically for tablets.

The platform changes in Honeycomb and Gingerbread were merged back into a unified platform in version 4.0 (Ice Cream Sandwich) which, at the time of writing is the most current version of Android. Ice Cream Sandwich was released in November of 2011, and as such, deployment of this version is extremely limited, while Gingerbread constitutes a large portion of the Android install base [92]. The availability of documentation, examples, and the pervasiveness of Gingerbread, made it the ideal version for Android development.

3.3.1 Reuse of Existing Thin AV System

The existing Thin AV system from the desktop implementation was modified to serve as a unified scanning service for the mobile implementation of Thin AV. This was beneficial because not only did it build upon the work which had already been completed for the desktop version, but it also provided two key benefits to the system architecture. First, it minimized the amount of code running on the Android device, and second, it put a layer 40

Third-party Scanning Services

VirusTotal VirusChief Kaspersky ComDroid Service Service Service Service

VirusTotal VirusChief Kaspersky Module Module Module

Thin AV ComDroid Module

Thin AV Cache

Desktop Thin AV Implementation

Application Sources Thin AV Mobile Thin AV Web Interface Extension Web Site Android Device

E-Mail Thin AV Safe Thin AV Installer Killswitch USB PackageInstaller

Third-Party Market Official Google Application G Market Repository

Figure 3.3: System architecture diagram for the mobile implementation of Thin AV. 41 of abstraction between the Android device and the third-party scanning services. This is beneficial because any changes to the scanning services can then be handled by modifying the Thin AV code running on the server, while the code running on the Android device would not require an update.

A web-front end was created using Flask 0.8 [25], a web application micro-framework for Python. The web application creates a simple HTML form capable of receiving both

HTTP GET and POST requests. A GET request is sent in one of two circumstances: first, if the Android Safe Installer attempts to check the cryptographic hash of a package prior to installation, the application will return a scan result if such a result exists (this will be discussed in greater detail in Section 3.3.3). Second, if the Android killswitch sends a system fingerprint, the web-application will return a list of cryptographic hashes for applications which have been found to contain malware (this will be discussed in greater detail in Section 3.3.4). A POST request is sent in the circumstance where the cryptographic hash for an Android package was not found in the Thin AV cache, and the whole package must be uploaded to Thin AV.

3.3.2 Android Specific Scanner

One of the major benefits of Thin AV is its modular and extensible architecture. In order to further demonstrate this benefit, and to increase the functionality of Thin AV, an Android specific scanning service, ComDroid, was added to Thin AV to complement the existing third-party anti-virus scanners. However, because of the modular design of

Thin AV, theoretically, any type of analysis module could be used to evaluate the safety of

Android packages. An Android specific anti-virus scanning service, permissions analyzers similar to Kirin [56], Stowaway [59], or a social reputation analyzer such as was described in [34], are all compelling possibilities. In fact, ComDroid itself is not an anti-virus engine, but rather it is a static code analysis tool which can identify potential vulnerabilities in 42

Android applications. The tool was developed by Chin et al. and described in detail in

[45].

ComDroid is publicly available as a web based service hosted at the University of Cal- ifornia at Berkley. Because ComDroid has a web interface, building a Thin AV scanning module to take advantage of ComDroid was relatively straightforward. Beyond develop- ing a new scanning module called thinComDroid, which also inherits from thinAvParent

(see Figure 3.2), the only internal change that was necessary was the addition of a new return status code. Because ComDroid identifies potentially exploitable apps and not malicious apps, the ComDroid module can identify an Android package as being “at risk” as opposed to being “infected”.

In the current deployment of Thin AV, a package will be prevented from installing it- self if ComDroid identifies vulnerable communication channels within the package. How- ever, depending on the sensitivity of ComDroid, and the prevalence of potentially vulnera- ble apps, a more permissive strategy might be warranted. If one of the existing scanning modules identifies a malicious app, this status will supersede any status returned by

ComDroid. It should also be noted that an Android package is scanned with ComDroid after scanning with the appropriate anti-virus scanner. The performance drawbacks of this configuration are obvious, and a production scale deployment of Thin AV should incorporate the ability to perform simultaneous scans in parallel.

3.3.3 Safe Installer

In order to protect a device from malicious applications a mechanism must exist for preventing the installation of malicious applications, and the Safe Installer is such a mechanism.

All applications not installed via Google’s Market are installed using Android’s Pack- age Installer system, and so this was the target for injecting the application check for Thin 43

AV. The technique used for hooking into the Package Installer was adapted from [56].

Unfortunately, Google’s Market does not use the Package Installer for installing apps.

This is because Google’s Market application is signed with the same certificate that is used to sign the operating system; this gives the Market access to the highest level of system permissions, a level that no other third-party applications is granted. Therefore, the Google Market has the ability to directly install and uninstall applications, bypassing the Package Installer. As mentioned in Section 1.1.3.2, Google has staked the reputation of their brand to the success of the Android Market, and thus have a vested interest in keeping it free of malware. Other application markets may not have such a prominent brand name to maintain.

In order to modify the Package Installer, the Android operating system source code was modified. The Package Installer is part of Android’s Java middle-ware which in- cludes a broad selection of programs and libraries for use by application developers. The

PackageInstallerActivity class was modified to make use of ThinAvService, a new service class which was added to the source code for the purpose of communicating with the Thin AV web application described in the previous section. The service provides a single public function checkAPK, accessed via an interface, defined using the Android

Interface Definition Language (AIDL). When a package is to be installed by the Package

Installer, the APK must already reside on the file system, having been downloaded by a third-party market or transferred via some other method. The checkAPK function takes the file system path of the APK being installed, reads the file and creates an MD5 hash of the bytes of the APK. This hash is then sent to the Thin AV web application, which re- turns a scan report, if such a report exists. If no scan report exists, the APK is uploaded to Thin AV where it is passed off to one of the third-party scanning services. When a scan result is returned, that result is passed back to ThinAvService and checkAPK then returns a Boolean as to whether or not the installation should be allowed to proceed. The 44

PackageInstallerActivity then allows or prevents the installation of the application, displaying the appropriate information dialogs to the user, where necessary.

In some sense the safe installer acts similar to the file system access controller in the desktop version of Thin AV. Despite this, there is no concept of multiple security policies in the safe installer. All package installs are subject to scanning, and package installation will be terminated if Thin AV detects malware. Although different security policies could be added to the mobile system, the faster performance of the mobile system compared to the desktop system allowed for a single strict security policy without compromising system performance (Section 5.5).

At this point, it should be noted that given that the Package Installer is part of the

Android source, the system just described cannot simply be installed on any Android capable device. Replacing the operating system on a particular Android device requires that the device be unlocked or “rooted”, as most devices are locked by the manufacturer or service provider. Additionally, installing a new version of Android on a device voids the device warranty and can have compatibility issues [16].

3.3.4 Killswitch

The safe installer described in the previous section can prevent the installation of appli- cations known to be malicious. However, two other scenarios must also be addressed: one where a malicious application has been installed on a device prior to the installation of

Thin AV, and one where an application was installed on a device but was not flagged as malicious at the time of installation. A killswitch was developed that addresses these two scenarios. It operates independently of any specific application installation mechanism, making it ideal for the multi-market ecosystem available on Android devices.

Four different approaches were considered as potential alternatives for how to imple- ment the killswitch. The first was to check for revocations at application launch time, by 45 modifying the app being launched. However this was not realistic, because even though tools exist for decoding Android packages [1], the lack of a main function in Android applications would require that a hook be inserted into every single activity which could be called by an intent. Furthermore, any code modifications would also invalidate the certificate that is packaged with the application. The second alternative was to hook into the application launcher. This would allow for the ability to interrupt launches from the Android home screen, but not launches from the system application list, nor would it catch launches caused by intents generated from other apps. The third option was to modify the actual program execution code which resides at a much lower level in the Android source code. This approach, while technically challenging, was feasible.

However, from a from a software architecture perspective, it would very likely create un- desirable cross-cutting concerns within the Android source code. The final option, which was ultimately selected, was to develop a scheduled service which periodically checks for revocations.

The killswitch was developed as a standard Android application capable of communi- cating with the Thin AV web application, similar to the safe installer. The killswitch has three different functions available to the user. It can upload all applications to Thin AV for analysis (if said applications are not in the Thin AV cache), it can manually check if any non-system applications on the device have been flagged as malicious, and finally, it can set up the killswitch to regularly check the device for malicious applications using a scheduled event (Figure 3.4). In the current implementation the killswitch is scheduled to run every fifteen minutes.

The feature to manually upload missing packages to Thin AV was left as a manual activity for the user. This decision was made due to the fact that it is possible that many or even most of the packages on a device may be missing from Thin AV. The upload and scanning of these apps, while a one-time activity, would still consume a great deal 46 of time and bandwidth. Therefore by leaving the decision to the user they can opt to perform the upload when the device is connected to a WiFi network (as opposed to being charged for using their cellular data plan), or at another time when the upload process would not be an inconvenience, such as when the device is charging.

When the killswitch is checking for malicious apps, it uses the PackageManager class to locate all public Android packages installed on the device. These packages are stored on the device and are read-only, making them ideal for analysis. The meta-data of each of the packages is read, and if the package has not been previously seen by the killswitch, the bytes of the package are hashed and a collection of all package hashes is sent to the

Thin AV web application via HTTP GET. If a package has already been hashed by the killswitch, then the hash is stored in a file which is only accessible to the killswitch. This hash can then be retrieved much more quickly than recomputing the hash every time the device is fingerprinted. If any of the hashes sent to the Thin AV web application are found to be from a malicious app, the hashes corresponding to the malicious applications are returned to the killswitch. The user is then notified of the issue, and presented a list of applications suspected to be malicious. The user can then choose to initiate the removal of those applications.

It might be preferable to have a killswitch which was capable of removing or in some way quarantining a malicious application, without the input or consent of a user.

However, this would have required significant changes to the Android PackageManager, as well as possibly other lower level components. This would have also created a potential security vulnerability insofar as it would have created a mechanism by which an ordinary application could uninstall other applications. This could result in a scenario where a malicious application could install the anti-virus or security apps on a device. 47

(a) (b)

(c) (d)

Figure 3.4: User interfaces for the Android killswitch: (a) main screen, (b) prompt to upload missing packages, (c) notification of malware, (d) malicious application removal screen. 48

3.3.5 System Circumvention

The key drawback of Thin AV as it is designed and deployed for Android is the fact that install-time checking of applications can only be achieved by rooting a device and in- stalling a custom operating system. As a prototype this technique is adequate. However, this design is impractical for wide-scale deployment. A preferable scenario would be one in which the main source code trunk of Android was modified to create a generic hook in the PackageInstallerActivity class which would give ordinary apps the ability to allow or deny application installations. Unfortunately such a hook in the PackageInstaller could very easily be abused by malicious apps looking to prohibit the installation of le- gitimate applications. A potential solution to this issue would be for Google to allow applications to use the hook only if the developer of the application is trusted or in some way certified by Google.

Finally, the mobile prototype of Thin AV is vulnerable to circumvention in much the same way as the desktop version, most notably, the lack of encryption on HTTP communication, and the possibility of forged MD5 hashes.

One major improvement offered by the mobile version of Thin AV is the reduced privacy concerns due to the fact that only Android applications, and not personal files, are being uploaded to Thin AV. 49

Chapter 4

System Evaluation - Desktop Thin AV

The evaluation of Thin AV was approached differently for the evaluation of the desktop- based version of Thin AV (Chapter 4), and the mobile-based implementation of the system (Chapter 5). The evaluation of the desktop version of Thin AV focuses on the system speed, and not detection rates. This is because the detection performance of Thin

AV is reliant upon third-party proprietary software systems, and the detection perfor- mance of these systems is regularly evaluated by consumer product testing groups [93].

Additionally, [82] provides a detailed analysis of the detection capabilities of many con- sumer anti-virus products. Finally, because Thin AV relies on remote scanning services, the performance of the system would be a much larger barrier to adoption than the de- tection performance which is generally high for all consumer-grade anti-virus products

[106].

The evaluation of the desktop version of Thin AV was performed in four phases. The

first phase in the evaluation was to assess the performance of the individual scanning services. The goal of this test was to determine the relationship, if any, between the size of files being uploaded, and the time required to receive a response from the various scanning services. This phase will be discussed in Section 4.1.

The second phase of testing involved determining the actual overhead caused by Thin

AV. A series of workload scripts were used to generate file system activity while various parts of Thin AV were active. This way the overhead incurred by each element of Thin

AV could be determined. This phase is elaborated upon in Section 4.2.

The third phase of testing involved using the timing results from the first phase to produce a simulator which would predict the scanning time for a file of arbitrary 50 size for each scanning service. The formulæ powering this simulator were then used to compute the predicted overhead of using Thin AV on a system while running the same workload script from the previous phase. By comparing the predicted overheads to the actual overheads measured in phase two, the formulæ for the scanning services could be iteratively refined until they predicted the actual overhead of Thin AV with a very high degree of accuracy. This phase will be detailed in Section 4.3.

Finally, with the response time formulæ refined, the simulator was improved which made it possible to simulate file system access on an large scale and determine the overhead of Thin AV under different file system access patterns. Simulation was chosen as the method for large scale testing because it would allow for testing a variety of file system access patterns, at very large scales, relatively quickly, and it would not draw the wrath of the various scanning service providers. This phase will be detailed in Section

4.4.

For each of the phases discussed below, the precise testing protocol will be described, followed by the results produced from the test, concluding with a brief discussion of the results as they pertain to the testing phase in question.

4.1 Scanning Service Performance

Each of the three scanning services is hosted by a different organization, with different hardware resources, and each service receives different loads. Therefore, it was first critical to determine the response times of the different services based on the size of files being uploaded.

4.1.1 Testing Protocol

In order to test the performance of the different scanning services, a small testing program was developed which would scan a series of files using a specific scanning service, and 51 record the time necessary to complete the scanning operation. Consequently, these tests did not examine any latency that might be introduced by DazukoFS (Section 3.2.1) or the file system access controller (FSAC, Section 3.2.2). An unavoidable drawback of this black-box approach, is that it was not possible to determine the what portion of the response time was due to file uploading what what portion was attributable to file scanning.

Each execution of the testing program scanned 12 different files of the following sizes:

0 KB, 1 KB, 2 KB, 4 KB, 8 KB, 16 KB, 32 KB, 64 KB, 128 KB, 256 KB, 512 KB and

1023 KB. The Kaspersky scanning service will not scan files 1024 KB or larger, hence

1023 KB was chosen as the upper limit on file size. Additionally, files were skewed to the small end of the size spectrum because results from [96], [90] and [28] suggest that small

files comprise the bulk of accesses during typical file system operation. Finally, the 0 KB size file was included in the test because it shows the best case response time for each service.

The test files were generated by a script which produces files of a specified size filled with pseudo-random bits. The assumption was that a file generated in such a way would have an extremely low probability of being flagged as malware by one of the scanning services. Test files are uploaded in a pseudo-random order every time the test program is run. This was done to overcome any penalty that might be incurred against the first

file being uploaded due to DNS lookups.

The testing program was run up to 8 times a day for each of the three scanning services (for a maximum of 24 runs per day). Testing took place on 8 different days:

28/08/11 to 2/9/11, and 8/9/11 to 10/9/11. Testing was done on several days in an attempt to give a fair representation of average service performance, in the event that one or more services were performing particularly poorly on any given day. The tests for each scanner took place between the hours of 9:00AM and 5:30PM MDT, and were 52 spaced as close to one hour apart as possible, with Kaspersky tested first, followed by

VirusChief, and finally VirusTotal. Although it would have been preferable to have a completely automated testing script which ran continuously for several days, the prospect of local network outages and remote service failures meant that the tests had to be run under human supervision. Testing was performed on the hardware platform described in Section 3.2. The laptop was connected to the internet via the University of Calgary’s

AirUC 802.11 wireless network.

Unfortunately, due to the low priority placed on scan requests to VirusTotal, the response times from this service were frequently excessive. As such, testing of all twelve

files could often not be completed in the (approximately) 55 minutes between testing sessions. However, given that the test files were uploaded in a random order, it was possible to collect some data for all of the different file sizes. Additionally, because

VirusTotal limits users to 20 API calls in any given 5 minute window, the testing program had to be modified to periodically poll VirusTotal for results. After uploading a file, the test program would sleep, first for 15, then 30, and finally 60 seconds, each time polling

VirusTotal for a result after sleeping. The test program would then continue to sleep for

60 seconds between polling attempts until a result was returned. It is therefore the case that response time results for VirusTotal have up to a 60 second margin of error. Finally, it should be noted that during the period of testing, there were occasional failures of the

VirusTotal service. These outages were always brief, each lasting less than 5 minutes.

In the event of a failure, the test was stopped and restarted once the service became available.

Finally, as Section 4.1.2 will show, the response time for VirusTotal was extremely high, and not strongly correlated with the file size of the upload. In order to at least gain an understanding of the file-size-dependent portion of the VirusTotal scanning process, a test was run to determine the time required to upload a file of a given size to VirusTotal, 53 and receive a response, without waiting for a scan result. For this experiment, eight files ranging from each of the file sizes listed earlier were uploaded to VirusTotal, and the time required to receive a “waiting” response was measured.

4.1.2 Results

For each of the three scanning services, several hundred response time measurements were recorded. Tables 4.1, 4.2, and 4.3 detail the key statistical measurements for each of the three scanning services. Table 4.4 contains the measurements for the upload-only portion of the VirusTotal scanning service. A cursory review of the data showed a handful of extreme outliers for each service. As no standard technique exists for identifying outliers

[103], standard deviation was chosen as the means by which outliers were identified. As such, any measurement beyond two standard deviations of the mean was classified as an outlier. This threshold was chosen because it eliminated the most extreme results, while retaining the vast majority of the data.

A comparison of the measurements from the three services shows a clear difference of nearly an order-of-magnitude between the performance of VirusTotal and the other two scanning services. The average response times from Kaspersky and VirusChief range from

1.54 – 14.49 seconds and 6.82 – 28.70 seconds respectively, while the response times from

VirusTotal range from 1.21 – 229.28 seconds (though the latter range becomes 148.92

– 229.28 seconds, when only non-zero file sizes are considered). The upload portion of

VirusTotal shows response times similar to Kaspersky with response times ranging from

1.94 – 11.74 seconds.

With the outliers removed, the response time data was plotted in an attempt to determine what, if any, relationship exists between file size and service response time for the three scanning services. Figures 4.1, 4.2, and 4.3 show the upload file size plotted versus the response time for each of the three scanning services, and 4.4 graphs the 54

Kaspersky With Outliers Without Outliers Upload size n µ (sec.) ˜x(sec.) σ (sec.) n µ (sec.) ˜x(sec.) σ (sec.) 0 KB 64 1.69 1.48 0.81 61 1.54 1.45 0.31 1 KB 69 1.96 1.48 2.42 67 1.61 1.48 0.43 2 KB 69 2.09 1.76 1.90 66 1.75 1.75 0.21 4 KB 70 2.00 1.73 0.90 68 1.87 1.71 0.47 8 KB 69 2.89 2.00 3.48 67 2.33 1.99 1.12 16 KB 70 2.49 2.01 2.45 69 2.22 2.00 0.76 32 KB 69 2.90 2.30 2.50 68 2.63 2.29 1.02 64 KB 70 3.07 2.62 1.13 63 2.74 2.58 0.54 128 KB 69 3.67 3.37 1.41 66 3.41 3.34 0.31 256 KB 70 5.58 4.70 3.50 68 5.03 4.67 1.28 512 KB 70 9.92 8.12 6.02 65 8.48 7.96 3.07 1023 KB 68 16.17 13.78 6.12 62 14.49 13.63 2.59

Table 4.1: Number of measurements (n), mean response time (µ), median response time (˜x),and standard deviation for each size of file uploaded to the Kaspersky scanning service.

VirusChief With Outliers Without Outliers Upload size n µ (sec.) ˜x(sec.) σ (sec.) n µ (sec.) ˜x(sec.) σ (sec.) 0 KB 63 7.38 6.46 3.72 60 6.82 6.40 2.75 1 KB 68 17.84 18.01 5.80 62 18.72 18.22 3.97 2 KB 68 18.04 18.79 10.36 62 16.94 18.69 6.10 4 KB 68 18.22 18.11 10.09 65 16.51 18.06 6.19 8 KB 67 18.46 18.57 8.76 65 17.46 18.45 6.16 16 KB 68 20.23 19.19 11.70 66 18.66 19.15 6.62 32 KB 68 19.48 20.18 6.99 63 20.24 21.13 5.58 64 KB 68 20.31 20.31 7.36 61 20.49 20.42 5.10 128 KB 68 19.24 19.85 5.97 60 20.17 20.07 3.28 256 KB 69 20.70 22.18 6.61 61 21.36 22.25 4.25 512 KB 68 23.99 24.72 6.93 61 25.20 24.91 4.25 1023 KB 67 28.53 28.71 7.96 60 28.70 28.73 5.76

Table 4.2: Number of measurements (n), mean response time (µ), median response time (˜x),and standard deviation for each size of file uploaded to the VirusChief scanning service. 55

VirusTotal With Outliers Without Outliers Upload size n µ (sec.) ˜x(sec.) σ (sec.) n µ (sec.) ˜x(sec.) σ (sec.) 0 KB 36 2.20 0.84 6.00 35 1.21 0.84 0.93 1 KB 34 201.65 116.27 246.64 32 148.92 116.08 91.64 2 KB 39 210.30 146.06 173.05 36 168.80 134.25 96.23 4 KB 37 283.89 170.51 289.37 34 212.95 150.50 152.34 8 KB 41 282.63 170.58 408.17 40 224.62 170.44 171.45 16 KB 36 166.83 143.48 102.82 35 152.68 139.54 58.87 32 KB 38 192.47 134.82 151.96 35 162.14 118.65 114.00 64 KB 37 213.12 135.46 197.75 35 173.57 135.42 106.86 128 KB 35 286.85 171.24 386.21 34 229.28 154.04 184.92 256 KB 40 219.71 148.96 228.65 37 160.77 136.61 68.63 512 KB 39 230.78 173.11 203.92 36 182.71 156.84 113.89 1023 KB 33 198.52 140.38 140.44 31 171.32 139.93 91.09

Table 4.3: Number of measurements (n), mean response time (µ), median response time (˜x),and standard deviation for each size of file uploaded to the VirusTotal scanning service and polling until a scan result is returned.

VirusTotal (Upload Only) With Outliers Without Outliers Upload size n µ (sec.) ˜x(sec.) σ (sec.) n µ (sec.) ˜x(sec.) σ (sec.) 1 KB 8 2.30 1.68 1.07 8 2.30 1.68 1.07 2 KB 8 2.77 1.69 3.02 7 1.71 1.59 0.23 4 KB 8 1.94 1.97 0.15 8 1.94 1.97 0.15 8 KB 8 2.55 2.20 0.88 7 2.28 2.02 0.50 16 KB 8 2.73 2.37 0.94 7 2.43 2.17 0.46 32 KB 8 2.42 2.36 0.22 8 2.42 2.36 0.22 64 KB 8 3.17 2.53 1.49 7 2.67 2.50 0.48 128 KB 8 3.10 2.84 0.53 8 3.10 2.84 0.53 256 KB 8 4.54 3.52 2.26 7 3.79 3.47 0.86 512 KB 8 5.28 5.10 0.82 7 5.00 5.10 0.27 1023 KB 8 11.74 7.43 7.97 8 11.74 7.43 7.97

Table 4.4: Number of measurements (n), mean response time (µ), median response time (˜x),and standard deviation for each size of file uploaded to the VirusTotal scanning service when no polling for scan results was performed. 56

30

25

20

15

10 Response time time (seconds)Response

5

0 0 200000 400000 600000 800000 1000000 1200000 File size (bytes)

Figure 4.1: Scan response time versus file upload size for the Kaspersky virus scanner service. upload and response speed of VirusTotal. Kaspersky and VirusChief both show a similar positive correlation between file size and response time, with the VirusChief data being positively shifted on the y-axis by roughly fifteen seconds. VirusTotal, on the other hand, shows little if any relationship between file size and response time. The trend of the VirusTotal response time data is slightly negative and possessing a much larger y- intercept than either of the other two scanning services. Conversely, the upload portion of the VirusTotal scan shows a trend very similar to Kaspersky. With this data it was possible to produce a set of linear equations that approximate the performance of the scanning services as a function of the number of bytes in the file (Table 4.5). This was done using a linear regression in ’s Excel spreadsheet program.

4.1.3 Discussion

It is not surprising that Kaspersky, which only scans with a single anti-virus engine, returns the fastest results, and VirusChief, which scans with six anti-virus engines is roughly fifteen seconds slower than Kaspersky when scanning a similarly sized file. It is 57

45

40

35

30

25

20

15 Response TimeResponse (seconds) 10

5

0 0 200000 400000 600000 800000 1000000 1200000 File size (bytes)

Figure 4.2: Scan response time versus file upload size for the VirusChief virus scanner service.

900

800

700

600

500

400

300

Response time Response(seconds) 200

100

0 0 200000 400000 600000 800000 1000000 1200000 File size (bytes)

Figure 4.3: Scan response time versus file upload size for the VirusTotal virus scanner service.

Kaspersky f(x) = 10−5 × x + 1.891 VirusChief f(x) = 10−5 × x + 17.133 VirusTotal f(x) = −9 × 10−6 × x + 182.98 VirusTotal (Upload Only) f(x) = 9−6 × x + 1.947

Table 4.5: Linear equations for each of the three scanning services derived from Figures 4.1, 4.2, 4.3, and 4.4. Equations calculate the response time for each scanning service for a file x bytes in size. 58

30

25

20

15

10

Response timeResponse (seconds) 5

0 0 200000 400000 600000 800000 1000000 1200000 File size (bytes)

Figure 4.4: Upload response time versus file upload size for the VirusTotal virus scanner service when uploading a file and not polling for a scan result. the VirusTotal scanning service which behaves in a markedly different fashion. Again, this is not surprising, because VirusTotal assigns the lowest priority to requests that are sent via their formal API, the response of the VirusTotal service is not dependent upon the size of the uploaded file, but rather on how busy the VirusTotal scanning service is at any given moment. This distinction is made much clearer when comparing the time required to scan a file with VirusTotal with the time required to merely upload a file to

VirusTotal.

Finally, it should be noted that because VirusTotal maintains a database of scan results, hashes that match entries in the database will return results much faster than

files which there is no matching hash. This is evident when looking at the 0 KB length

files, which have a dramatically faster response time than files of any other size, even the

1 KB files. This is because the empty files always produce the same hash value, meaning that the result will be stored in VirusTotal’s database, and a previous scan report can be returned immediately and the file will not be uploaded for analysis.

The results in this section describe the performance of the three malware scanning 59 services during a specific window in time. Every effort was made to collect numerous measurements during that window in order to provide a fair and accurate representation of the performance of each service. However, it is unreasonable to assume that the results described herein will accurately reflect the performance of the scanning services in the long term. Changes in service loads, hardware configuration, and even changes in the service offerings themselves could alter the performance of the scanning services. Such possibilities are unavoidable, and by no means a reason to shy away from such avenues of research. Unless the changes in performance are drastic, it is unlikely to alter the conclusions made regarding the feasibility of Thin AV as a security apparatus.

4.2 Actual System Overhead

Having characterized the response times of the three different scanning services, the second step of evaluating the desktop implementation of Thin AV was to assess the performance impact of running Thin AV.

4.2.1 Testing Protocol

To calculate the actual overhead incurred by Thin AV, a series of workloads were used to generate file system activity. The technique used to generate workloads was modified from the approach used by Bickford et al. [37]. Here, the workload was created using a bash script to launch and navigate to certain websites. The CPU utilization of the Firefox process was monitored, and once it dropped to below a specific threshold, the browser was terminated, as it was assumed that content had been successfully loaded, and the browser was idle. The same process was applied to launching and terminating the

Thunderbird e-mail client. The idea of using web-browsing and e-mail as a workload is appealing because these activities are extremely common, and well understood by users.

However, testing of the workload script used by Bickford et al. showed some limitations. 60

Web Advanced 1) Launch Firefox 1) Compile GZip from source 2) Navigate the following websites: 2) Compile a five page LATEX paper http://www.google.ca 3) Copy the directory containing the paper http://www.slashdot.org 4) Rename the copied directory http://www.reddit.com 5) Delete the copied directory http://www.youtube.org http://www.cbc.ca 3) Close Firefox 4) Open Thunderbird 5) Close Thunderbird

Table 4.6: Activities in the web and advanced workload scripts.

Most notably, the Firefox application would often be terminated before a page had been completely loaded and rendered. Therefore, a different technique was used to generate browser activity for Thin AV. Specifically, the automated testing tool Selenium was used to drive Firefox [19].

Three bash scripts were produced each corresponding to a different workload. These workloads will be referred to as “web”, “advanced”, and “combined”, respectively. The web script was directly inspired by Bickford et al., and was complemented by the ad- vanced script, which was designed to capture a snapshot of more developer-oriented behavior. Table 4.6 outlines the specific activities in the web and advanced scripts. The combined script includes the activities of the web script followed by the activities in the advanced script. In order to better characterize the activities in each of the workloads, a small Python program was written which uses the Linux inotify subsystem [11]. Each of the workloads was run once and the activities generated by the workload were recorded in order to help contextualize the results from this experiment.

To assess the overhead of Thin AV, several different scenarios were examined in re- lation to a base case scenario. In every scenario the three workload scripts were run ten times, and the time required to complete the workload was recorded. The base case 61

Scenario Description Caches Kaspersky (uncached) Thin AV running with only Kaspersky scanning files Cleared VirusChief (uncached) Thin AV running with only VirusChief scanning files Cleared All scanners (cached) Thin AV running in the typical configuration with all three services scanning files of the appropriate size Not Cleared Dazuko Only DazukoFS mounted without Thin AV running N/A FSAC and Dazuko DazuoFS mounted with file system access controller approving all accesses without checking file with Thin AV N/A

Table 4.7: Scenarios examined for assessing Thin AV overhead. Cache column specifies whether or not the Thin AV cache and browser cache were cleared between each of the ten runs of a given workload. scenario was one in which no part of Thin AV was active while the workloads were being executed. Table 4.7 outlines each of the examined scenarios. Kaspserky and VirusChief were tested individually, with the Thin AV and browser caches being cleared between each of the ten runs. VirusTotal was not tested by itself because the average response time of the VirusTotal service was so large, it would still be completely impractical for scanning with as a standalone service. Additionally, because VirusTotal remotely caches scan results, it would not have been possible to repeatedly test a workload on VirusTo- tal without retrieving cached results from the first run of that workload (as opposed to re-scanning files each time).

The “all scanners” scenario involved running Thin AV with all three scanning services in concert when the Thin AV cache was primed. This is representative of how Thin AV would normally operate. If a file cannot be found in the cache it will be scanned by

Kaspersky. If the file cannot be scanned by Kaspersky due to service unavailability or 62

file size, then Thin AV will attempt to scan with VirusChief, and finally VirusTotal. For this test neither the Thin AV local cache, nor the Firefox browser cache were deleted between runs. Because of the VirusTotal caching issue mentioned earlier it was not possible to perform a test of all three scanners in concert with an empty Thin AV cache.

It should also be noted that because some of the content on the websites in the web and combined workloads is dynamic, a primed Thin AV cache does not mean all files encountered in the workloads were previously cached by Thin AV.

For each of the scenarios where all or part of Thin AV was running, only the home directory of the active user was monitored. This decision was based on the fact that, as previously mentioned, the goal of Thin AV is to provide minimalist anti-malware protection. As such, areas of the file system which would normally be beyond the reach of a standard user will be ignored by Thin AV, in favor of monitoring areas of high user activity (i.e. the /home directory).

4.2.2 Results

Table 4.8 summarizes the characteristics of a representative run of each of the three workloads. Given that the combined workload contains all of the activities of the web and advanced workloads, it is sufficient to say that the characteristics of the combined workload are approximately the sum (events, accesses, modifications, modify / access events, unique files), average (mean file size), or maximum values from the advanced and web workloads. The only similarities between the web and advanced workloads are the number of unique files being accessed, and the median size of those files. The advanced workload contains more than three times as many file events as the web workload, and those events are split fairly evenly between access events and modification events. The web workload, on the other hand, has almost three times as many access events as modification events. Based on the file size statistics, it is clear that all of the files in 63

Web Advanced Combined Events 158 529 698 Accesses 116 245 354 Mean time between accesses (s) 0.157 0.010 0.064 Modifications 42 284 344 Modify / access events 7 19 25 Unique files 88 92 175 Mean file size (KB) 1138.35 17.74 581.59 Median file size (KB) 5.13 2.72 4.52 Maximum file size (KB) 53132.00 250.71 53132.00

Table 4.8: General characteristics for the three different workloads used for testing Thin AV.

Web Advanced Combined Base case 21.4 1.9 21.3 Kaspersky (uncached) 194.9 (910.75%) 186.2 (9800.00%) 376.3 (1766.67%) VirusChief (uncached) 1786.3 (8347.20%) 1864.2 (98115.79%) 3587.6 (16843.31%) All scanners (cached) 59.0 (275.70%) 36.9 (1942.11%) 100.9 (473.71%) Dazuko Only 21.6 (100.93%) 1.8 (94.74%) 23.2 (108.92%) FSAC and Dazuko 21.8 (101.87%) 2.8 (147.37%) 26.5 (124.41%)

Table 4.9: Average time (in seconds) to complete each of the three workloads while run- ning different configurations of Thin AV. Average is based on ten runs of each workload for each Thin AV configuration. Percentage overhead is the result of dividing the average running time of each configuration by the average running time of the base case. the advanced workload are relatively small (250 KB or less), whereas the web workload contains a handful of much larger files.

The timing results in Table 4.9 show a high degree of similarity both between work- loads and between scanning services. The Kaspersky service differs by less than ten seconds for the web and advanced workloads (194.9 and 186.2 seconds respectively). The results from VirusChief are roughly an order of magnitude larger than the Kaspersky timing results, and the corresponding difference between the two workloads is less than one hundred seconds. This similarity is in spite of the order of magnitude difference be- 64 tween the web and advanced workloads in the base case (21.4 seconds versus 1.8 seconds).

The running times for the workloads when Thin AV was running with a primed cache are much smaller than the running times when active scanning was taking place. However, the difference between the web and advanced workloads is somewhat larger (59.0 seconds for the web workload, and 36.9 for the advanced workload). A marginal increase in run- ning times of the web and combined workloads can be seen when only the Dazuko file system is mounted, while the advanced workload actually shows a very small decrease.

The increase in running times is slightly more apparent when both DazukoFS and the

file system access controller are active, with web and advanced workloads showing 0.4 and 0.9 second increase over the baseline running times. Not surprisingly, in all cases, the timing results for the combined workload are approximately the sum of the results for the constituent workloads. Additionally, the percentage overheads follow a similar trend to the absolute timing results.

4.2.3 Discussion

Not surprisingly, the fastest service, Kaspersky, showed the smallest overheads across the three workloads, whereas VirusChief, which had a much slower response time, showed the highest overheads. In general, it is clear that the range of possible overheads for Thin

AV is extreme. Interestingly, despite the highly distinct file system access characteristics of the web and advanced workloads, the ultimate running time of these workloads with both Kaspersky and VirusChief are somewhat similar. Although, when the overhead is considered instead of simply the total running time, the advanced workload appears particularly ill-suited to being scanned by Thin AV.

It is clear that the vast majority of the overhead incurred by Thin AV is caused by the uploading and scanning of files to the remote services. The overheads caused by Dazuko and the file system access controller are negligible, though it is possible that this overhead 65 could be further reduced if Thin AV were implemented in a higher performance language such as C/C++ as opposed to Python. It is likely that the case where the advanced workload was shown to run faster than the baseline in the Dazuko-only scenario, is the result of a measurement error due to the granularity of timing measurements recorded by the workload script.

Based on the response time results from Section 4.1, one could have predicted that

VirusChief would not, by itself, make a practical anti-malware scanning tool. The timing results from this experiment bear out that prediction. Furthermore, Kaspersky, while dramatically faster than VirusChief, is still prohibitively slow when all the files being encountered are novel. However, it was to be expected that Thin AV would perform poorly when a cache of previously scanned files was not available.

The three workloads were intended to be indicative of general desktop usage for some users. However, they are very short, and do not contain the wait time that would be present if a user were manually executing the actions in the workloads. Therefore, it is very likely that the length of the workloads had a major impact on the overhead of Thin

AV. Specifically, because none of the workloads involved a high degree of repetition, the effectiveness of the Thin AV cache was somewhat lessened. Given that the workload run times for the fully cached scenario are approaching a much more reasonable overhead, it is quite probable that as Thin AV is active for longer periods of time, the cache will grow in size, increasing the likelihood of a local cache-hit, and thus decreasing the overhead of

Thin AV. This scenario will be examined in greater detail in Section 4.4.

4.3 Predicted System Overhead

Given that it is the highly-cached scenario that provides the most compelling argument for Thin AV, the performance of Thin AV in the long term (when a high degree of caching 66 is possible) must be assessed. This section details the development and refinement of a simulator for Thin AV which was developed using the timing results from Sections 4.2 and 4.1. Once the simulator was developed and the results were shown to be consistent with the results from 4.2.2, larger scale simulations could be run to better understand the performance impact of Thin AV.

4.3.1 Testing Protocol

In order to predict the overhead of Thin AV, two elements are necessary: a simulator and a workload. In order to accurately calibrate the simulator, the workloads from the previous section were used. Based on the file accesses in the workload, it is possible to calculate how much time would be spent by Thin AV to scan the files being accessed. By dividing the total duration of scanning with Thin AV plus the time spent for non-Thin

AV activities, by the time spent on non-Thin AV activities, the overhead imposed by

Thin AV can be predicted.

In order to calculate the overhead of Thin AV a simulator was developed. The sim- ulator reads a list of file system events (e.g. access or modification of one or more files), and based on the sizes of the different files, calculates the time that Thin AV would take to run if it were actually scanning the files in the list. Using the workload scripts and the inotify monitoring application discussed in Section 4.2.1, a log of file system activity was created for each of the workload scripts which was then used to drive the simulator. In addition to driving the simulator with a file system activity log, the simulator can also be driven by a series of statistical parameters; this will be discussed in greater detail in

Section 4.4.1.

The function calculating the wait time for accessing a specific file can be based on the size of the file being accessed, or not, depending on what kind of Thin AV activity is being simulated. In the case of VirusChief and Kaspersky, the scanning time is calculated based 67 on a pair of linear equations. The functions in Table 4.5 provided the starting point. In a series of successive manual iterations, the simulator was run on each of the workloads and scanning was simulated. By simulating Thin AV running off of strictly Kaspersky or

VirusChief, each of the linear equations could be tuned until the running time predictions produced by the simulator matched the actual Thin AV running time measured in Section

4.2.2.

Because the performance of VirusTotal is so poor, it is not a suitable candidate for real-time file scanning. As such, when Thin AV is operating under the permissive security policy (Table 3.1), the file is only uploaded to VirusTotal. Then, assuming the file has not been previously scanned by VirusTotal, a “waiting” status is returned and Thin AV then allows access to the file. In order to simulate this behavior, the linear equation which describes the time required to upload a file to VirusTotal was used (without modification) to determine the wait time for VirusTotal (Table 4.5). This allowed the simulator to accurately predict the overhead of Thin AV when running with all three scanning services under the permissive security policy.

When the simulator is processing the list of file system events, each time a unique

file is accessed for the first time, the wait time imposed by Thin AV is calculated, based on the size of the file being accessed, and the specific scanning service being simulated.

Once a file is accessed, it is added to a list of known files. Any future accesses of a known

file only incur a wait time of 0.0002 seconds. This simulates the caching behavior of Thin

AV. The wait time for the cache was arrived at by measuring 10,000 successive cache hits by Thin AV, and calculating the average access time. Finally, if a known file is modified, it is removed from the list of known files, and a subsequent access of the modified file will once again incur a wait time for the simulating scanning of that file. 68

Kaspersky f(x) = 8.85−5 × x + 1.34 VirusChief f(x) = 8.31−5 × x + 16.37

Table 4.10: Linear equations for the Kaspersky and VirusChief scanning services. Kasper- sky and VirusChief were manually modified from the functions in Table 4.5.

4.3.2 Results

Table 4.10 contains the linear equations that were finally settled on for the simulation of

Kaspersky and VirusChief. These formulæ were arrived at after approximately twenty iterations of running the simulation and adjusting the slope and y-intercept of each equation.

Tables 4.11 and 4.12 show the simulation results from simulating Thin AV using only one of either Kaspersky of VirusChief. The total running times of these simulations are compared with the actual running times from Section 4.2.2 in Tables 4.13 and 4.14. As can be seen in the latter two tables, the refinement of the wait-time functions was highly successful, with the largest discrepancy between the actual and simulated run-times being only 0.47%.

As was mentioned earlier, the web workload contains a handful of files that are sig- nificantly larger than the bulk of the files in either of the web or advanced workloads.

This can be quantified more precisely by examining the number of un-scanned accesses in each workload. Un-scanned accesses are files that are too large for the scanner to process (1 MB for Kaspersky, and 10 MB for VirusChief). Here we see that in the web workload, eight files are too large to be scanned by Kaspersky, but only three are too big to be scanned by VirusChief. In the advanced workload, all of the files being accessed are small enough for scanning with both Kaspersky and VirusChief. This implies that the vast majority of files are capable of being scanned by Thin AV, and that most of these files would be scanned by Kaspersky, the fastest of the three scanning services. 69

Kaspersky Web Advanced Combined Mean scanned file size (KB) 39.02 17.10 27.23 Median scanned file size (KB) 5.13 1.45 3.67 Maximum scanned file size (KB) 512.00 250.71 512.00 Un-scanned accesses 8 (6.9%) 0 (0.0%) 8 (2.3%) Total size of uploaded files (MB) 3.24 1.85 5.08 Cache hit rate 21.3% 54.69% 44.80% Time for non-AV activities (sec.) 18.20 2.33 22.53 Time for AV scanning (sec.) 176.47 184.57 354.06 Total time (sec.) 194.67 186.9 376.59 Overhead from AV 969.61% 7921.40% 1571.52%

Table 4.11: Simulation results of the Kaspersky service for three different activity logs.

The low cache-hit rates that were alluded to in Section 4.2.3 can be seen in the simulation results. The highest cache hit rate occurs in the advanced workload and is only 54%, while the lowest cache hit rate is just over 20% in the web workload. In spite of this, the runtimes for the web and advanced workloads are quite similar for both

Kaspersky and VirusChief.

4.3.3 Discussion

The key finding from this experiment was that it was possible to predict, very accurately, the running time of Thin AV when using Kaspersky and VirusChief. Although a simu- lation could be run which predicts the running time of Thin AV using VirusTotal, it has already been established that VirusTotal is impractical for real-time scanning. Addition- ally, because no actual overhead measurements were made for VirusTotal, there would be no way of verifying the accuracy of the VirusTotal overhead predicted by the simulator.

Furthermore, the scanner priority for Thin AV during normal operation is Kaspersky, followed by VirusChief, then VirusTotal, it is somewhat less important to characterize the performance of VirusTotal, because only files between ten and twenty megabytes will 70

VirusChief Web Advanced Combined Mean file size scanned (KB) 125.08 17.10 67.05 Median file size scanned (KB) 5.13 1.45 3.99 Maximum file size scanned (KB) 2205.19 250.71 2205.19 Un-scanned accesses 3 (2.6%) 0 (0%) 3 (0.8%) Total size of uploaded files (MB) 10.99 1.85 12.83 Cache hit rate 20.35% 54.69% 44.16% Time for non-AV activities (sec.) 18.20 2.33 22.53 Time for AV scanning (sec.) 1764.20 1866.14 3548.12 Total time (sec.) 1782.40 1868.47 3570.65 Overhead from AV 9693.41% 80092.03% 15748.44%

Table 4.12: Simulation results of the VirusChief service for three different activity logs.

Kaspersky Workload Running Time (s) Simulated Time (s) Difference Web 194.90 194.67 -0.12% Advanced 186.20 186.90 0.38% Combined 376.30 376.59 0.08%

Table 4.13: Comparison of running time and simulation results for each of the workloads for the Kaspersky service.

VirusChief Workload Running Time (s) Simulated Time (s) Difference Web 1786.30 1782.40 -0.22% Advanced 1864.20 1868.50 0.23% Combined 3587.63 3570.70 -0.47%

Table 4.14: Comparison of running time and simulation results for each of the workloads for the VirusChief service. 71 be uploaded to VirusTotal. By examining the file system access characteristics (Table

4.8) and the simulation results (Tables 4.11 and 4.12), it is apparent that this is not a common occurrence. This assertion is also corroborated by the file system traces from

[90], [96] and [28].

Although it may have been possible to further increase the accuracy of the simulator either by more manual tuning of the wait-time functions, or through an optimization algorithm, the differences between the simulated and actual results are sufficiently low as to afford a strong degree of confidence in the results produced by the simulator. It should also be noted that when tuning the wait-time functions, the goal was to approximate the total running time, and not the system overhead. This is because the system overhead can change dramatically as a result in minor fluctuations in the base case run times.

Because the overhead is calculated by dividing the total running time by the base case running time, small changes in such a comparatively small divisor can result in a dramatic change in the resulting overhead. The advanced workload is an obvious example of this.

The promising finding from these results is that the workloads studied up to this point only produce a low to modest degree of caching in Thin AV, nowhere near the 98% cache hit rates reported by [82]. This means it is quite possible that the overhead of Thin

AV can be reduced to a more acceptable level in the long term, when a greater degree of caching is possible. This scenario will be examined in the upcoming and final phase of the evaluation of the desktop implementation of Thin AV.

4.4 Large Scale System Simulations

Given that the overhead values produced by the Thin AV simulator accurately predict the actual running time of Thin AV, it is possible to examine different patterns of file access when using Thin AV, and see if there are conditions under which Thin AV particularly 72 excels, or falls short. Although it may have been desirable to collect a set of actual usage patterns for exploring the behavior of Thin AV, using the simulator to study different scenarios is preferable for three key reasons. First it allows for the examination of a wider range of longer running, high-activity file system access traces, in much less time than it would take to actually perform such traces on the implementation of Thin AV.

Second, it allows for the examination of traces with very specific characteristics (file sizes, number of unique files, number of modifications, etc.), and devising a series of actual workloads which would generate this desired level of activity would be extremely onerous. Finally, despite a concerted search in the areas of software engineering, human- computer interaction, and hardware systems research, it was not possible to locate a precedent for how to characterize “typical” user behavior, upon which to base an actual large-scale workload. This is quite likely due to the fact that the term “typical” itself lacks a consistent definition when examining a wide cross-section of users.

4.4.1 Testing Protocol

The large-scale testing of Thin AV was done using the simulator described in Section

4.2.1. However, instead of using file activity traces produced by inotify, the activity was generated by a collection of probability distributions. A collection of key parameters control the general characteristics of the file system activity generated by the simulator.

The number of file system events controls the total number of file accesses and modifi- cations that will occur in the simulation. The proportion of modifications specifies what proportion of events will be modifications versus accesses. The number of unique files specifies how many different files will be accessed throughout the lifetime of the simula- tion, which can be thought of as the absolute path of a file (i.e., modifications of a file does not make it a new unique file).

Other key simulation parameters, specifically file size and time between events, are 73 drawn from exponential distributions. This distribution was chosen for file size generation because it closely fits the distribution of file sizes measured in [28] and [90]. Figure 4.5 provides an example of file sizes generated by this distribution. The distribution can be shifted left or right (i.e., smaller or larger files) as needed, depending on the parameter provided to the exponential number generator. The file sizes generated are bounded by a constant minimum and maximum; this was necessary to prevent the rare occurrence where an extremely large file size was generated that would overly skew the mean file size in the simulation. The exponential distribution was also chosen for generating the time between file events. This has the effect of producing activity that is “bursty” with periods of high activity (small times between events) broken up by less frequent periods of inactivity (long times between events).

Ultimately the activity “log” produced by the simulator is in the same format pro- duced by the inotify monitoring application used in Section 4.3.1. As such, it can be fed into the same simulation program that was used earlier, and the behavior of Thin AV for the simulated log can be characterized. The key simulation parameters were examined in turn by modifying the experimental parameter while holding all other parameters con- stant. Each simulation was run ten times and the average results were recorded. It should be noted that in performing these experiments, some of the simulations were run with activity logs that would not be likely to occur during any sort of “normal” file system activity. However these results were included for the purposes of illuminating trends in the relationships between the characteristics of file system activity and the performance of Thin AV.

4.4.2 Results

The result of each experiment will be discussed in turn. For the sake of space, figures will only show relevant relationships between the independent and dependent variables. 74

100.00%

80.00%

60.00%

40.00%

Cumulative% offiles 20.00%

0.00% 1 4 16 64 256 1024 4096 16384 65536 File size (Kilobytes, log scale)

Figure 4.5: Example CDF of simulated files by size.

Simulation input and output values that did not change for each experiment can be found in Tables A.1 through A.5 in Appendix A.

The first simulations examined the relationship between the number of novel files in an activity log, and the performance of Thin AV. Figure 4.6 shows a strong positive rela- tionship between the number of novel files, and the overhead incurred by Thin AV. When only one in ten accesses involves a new file, Thin AV already shows a 5000% overhead, and this number increases almost eight-fold as the proportion of accesses involving new

files moves to 1. Unless specifically mentioned, the simulated activity logs did not include any file modification events. This was done to prevent unnecessarily complicating the simulation results, as well as to help clarify the trends in the data.

An analogous trend can be seen when the number of access events is manipulated

(Figure 4.7). The Thin AV overhead drops off dramatically as the number of system accesses increases. The trend in the cache hit rates follow similar trends in both Figures

4.6 and 4.7, as the ratio between the number of unique files and the number of accesses approaches one-to-one, the frequency of cache hits drops off precipitously.

The next series of simulations were designed to show the impact of file size on Thin 75

40000.00% 100.00% 90.00% 35000.00%

80.00% 30000.00% 70.00% 25000.00% 60.00% 20000.00% 50.00% 15000.00% 40.00% 30.00% 10000.00%

Thin AV AV OverheadThin 20.00% Thin AV AV Thincache hit rate 5000.00% 10.00% 0.00% 0.00% 0.0001 0.001 0.01 0.1 1 Proportion of accesses of novel files

Unique file ratio Cache hit rate

Figure 4.6: Proportion of accesses which involved a unique (uncached) file (log10) versus Thin AV induced overhead (left axis), and Thin AV cache hit rate (i.e., the chance that an access was serviced by the cache versus online scanning, right axis). See Table A.1 for further details.

2500.00% 120.00%

2000.00% 100.00% 80.00% 1500.00% 60.00% 1000.00% 40.00%

Thin AV AV ThinOverhead 500.00% 20.00% AV Thin cache hit rate

0.00% 0.00% 1 10 100 1000 10000 100000 1000000 Number of file system accesses

Thin AV overhead Cache hit rate

Figure 4.7: Number of file system accesses (log10) versus Thin AV induced overhead (left axis), and Thin AV cache hit rate (i.e., the chance that an access was serviced by the cache versus online scanning, right axis) See Table A.2 for further details. 76

7000.00% 100.00% 90.00% 6000.00%

80.00%

5000.00% 70.00%

4000.00% 60.00% 50.00% 3000.00% 40.00% 30.00% Thin AV AV Thin Overhead 2000.00% 20.00% AV Thin cachehit rate 1000.00% 10.00% 0.00% 0.00% 1 10 100 1000 10000 100000 1000000 File size (Kilobytes)

Mean file size (bytes) Median file size (bytes) Cache hit rate

Figure 4.8: File size in bytes (log10) versus Thin AV induced overhead (left axis), and Thin AV cache hit rate (i.e., the chance that an access was serviced by the cache versus online scanning, right axis). See Table A.3 for further details.

AV performance. By changing the mean of the file size distribution, it can be shown that there is a distinct peak in the overhead of Thin AV. When the bulk of the files are very small (i.e., one hundred bytes, or less), the overhead is negligible, then as the average

file size increases so too does the overhead (Figure 4.8). This trend peaks around the

10 MB mark at which point the system overhead begins to decline. At the same time, changing the average file size shows little impact on the cache hit rate. The reasons for this behavior will be discussed in Section 4.4.3.

Figure 4.9 shows the proportion of files which are scanned by each of the three scan- ning services as the mean file size changes. Recall that the scanner priority for Thin AV is Kaspersky, VirusChief, then VirusTotal. This is evident in the figure as the mean file size starts off small, and as such, all files are scanned by Kaspersky. Gradually, as the average file size increases, VirusChief and VirusTotal each become more prevalent. At the same time, the number of files which were too large to be scanned by any of the services begins to increase, slowly at first, but with a sharp increase at the 10 MB mark. 77

100.00%

80.00%

60.00%

40.00%

20.00%

Proportion of accesses of Proportion scanned by each by service scanned 0.00% 1 10 100 1000 10000 100000 Mean file size (Kilobytes)

Kaspersky VirusChief VirusTotal Unscanned

Figure 4.9: Mean file size in bytes (log10) versus the proportion of accesses scanned by each scanning service, and the proportion of accesses un-scanned. See Table A.3 for further details.

500.00% 100

400.00% 80 300.00% 60 200.00% 40

100.00% 20 (seconds)Time 0.00% 0 0 0.2 0.4 0.6 0.8 1

Proportion of file modification events Thin OverheadAV / Cache hit rate Thin AV Overhead (%) Thin AV cache hit rate (%) Average time between file accesses (seconds)

Figure 4.10: Proportion of file system events which modify files versus the Thin AV induced overhead, Thin AV cache hit rate (left axis), and average time between file accesses (right axis). See Table A.4 for further details. 78

10000

1000

100

10

1

0.1

Thin AV AV ThinOverhead (%) 0.01

0.001 0.001 0.01 0.1 1 10 100 1000 10000 Average time between file accesses (seconds)

Figure 4.11: Average time between file accesses in seconds, versus Thin AV induced overhead (both axes are log10 scale). See Table A.5 for further details.

Next, the impact of file modifications was examined. Here, the probability that controls whether a file event is a modification or an access, was adjusted from zero to one, and the resulting impact on Thin AV behavior was recorded (Figure 4.10). The most direct relationship is the inverse relationship between the modification rate and the cache hit rate. Additionally, there is an overall decrease in Thin AV overhead as the file modification rate increases.

Finally, there exists a strong (non-linear) positive relationship between the modifica- tion rate and the time between file accesses. When the time between file accesses was manipulated directly, a very clear inverse linear relationship emerges between the mean inter-access time of a file trace, and the overhead incurred by Thin AV (Figure 4.11).

4.4.3 Discussion

Given that the trends in Figures 4.6 and 4.7 are very similar suggests that one of the key variables in predicting Thin AV overhead is not the number of accesses or the number of unique files per se, but rather, the ratio between the number of unique files and the number of accesses. As that ratio approaches one-to-one, the performance of Thin AV 79 drops off dramatically. The reason for this is intuitive, as the ratio approaches one-to-one, most accesses will involve files that are not in the cache, and therefore must be uploaded and scanned. As a result the Thin AV cache becomes increasingly ineffective.

When it comes to the size of the files being scanned, the overhead of Thin AV has much more to do with specific speeds at which the individual scanning services can return a scan result. Figures 4.8 and 4.9 show that the overhead from Thin AV is relatively low when the file sizes being scanned are very small. This is because all of the files are being scanned with Kaspersky, the fastest of the three scanning services. As the

file size increases, there is a large peak as VirusChief (a much slower scanner) becomes the dominant scanning engine. Ultimately however, as the file size exceeds the 10 MB limit imposed by VirusChief, the overhead decreases, because VirusTotal simply accepts the upload and returns a “waiting” response to Thin AV, which is much faster than continually polling for a response. Finally, as the mean file size moves past the 20 MB mark, the Thin AV overhead drops to near zero, as none of the scanning services are capable of servicing the scan request, and the file access is permitted without scanning.

The results in Figure 4.10 are somewhat misleading. One might conclude that as the

file modification rate increases, so would the performance of Thin AV. This trend is in spite of the fact that the cache hit rate also appears to be decreasing at the same time. In reality it is the case that it is the time between file accesses that is actually impacting the performance of Thin AV. As the proportion of file modifications increases, this naturally creates an increasing gap between file access events. Because Thin AV only scans on file accesses, this has the effect of reducing the overall amount of scanning that needs to be performed. This reduction in the need for scanning goes much farther towards reducing the overhead of Thin AV than any reduction in performance that occurs from reducing the effectiveness of the cache. This extremely tight relationship between inter-access time and system overhead can be seen in Figure 4.11. 80

In summary, these simulations have pointed to three key elements that heavily impact the performance of Thin AV. First, as the ratio between the number of unique files and the total number of file accesses approaches one-to-one, performance decreases. Next, when the mean filesize falls into the 10 MB to 20 MB range, the performance significantly decreases. Finally, as the gap between accesses increases, either due to a lack of activity or an increase in the frequency of file modifications, the overhead incurred by Thin AV decreases.

Given the large range of possible overheads and the variety of factors that influence those overheads, it is possible that a deployment scenario exists that would make Thin

AV a practical anti-malware tool. First and foremost, such a system would require a substantial pool of dedicated resources on the server side of the transaction. The dramatic speed difference between scanning with Kaspersky versus the other services show what is possible even on an unregulated, freely available resource. If a company were to offer a dedicated scanning service (either as a standalone offering, or as part of a larger suite of anti-malware products), it would likely be possible to achieve even faster scanning speeds than were displayed by Kaspersky. This would be because such a service would have a smaller user base, and if those users were paying for the service, there would be a greater incentive for the provider to ensure the service had adequate resources. If this service were paired to a fast server-side cache similar to VirusTotal, the performance could be improved even more.

Unfortunately, it is not possible to define a relationship between the number of scan- ning engines running on a service and the performance of that service. Intuitively it follows more scanning engines would result in longer scanning times. This could be consistent with the performance differences between Kaspersky, VirusChief and Virus-

Total, which have 1, 13 and 42 different scanning engines, respectively. However, without knowing what kind of hardware is underpinning these services, it is not possible to ac- 81 curately gauge the relationship between number of scanning engines and performance.

The performance of CloudAV gives some indication of what might be possible in terms of performance when hardware constraints are eliminated [82]. In general, it is likely safe to assume that the best possible performance would be achieved on a system that only used a single scanning engine. As such, an ideal deployment of Thin AV should be limited to a single high performance scanning engine, until it can be established that additional engines can be added without a significant performance penalty.

The ideal deployment on the client computer is a somewhat more challenging issue as it is largely out of the hands of the service provider. However, based on the file inter-access time results, and the characteristics of the web and advanced workloads it is possible to conclude that Thin AV is more conducive to systems that are typically used for casual internet activities as opposed to more developer-oriented activities. Beyond that, users could further improve performance by running Thin AV with the passive security policy (Table 3.1). This would offer a large performance boost, but would come at a significant price. Specifically, files infected with malware would be allowed to execute, and users would only be notified of the infection after the fact. Due to the extremely lax security guarantees offered under this policy, and the fact that Thin AV does not offer a mechanism for malware removal, this trade-off does not seem equitable. For this reason, the performance overhead of this scenario was not studied. 82

Chapter 5

System Evaluation - Mobile Thin AV

The evaluation of the mobile version of Thin AV was considerably more challenging than the evaluation of the desktop version. There are several reasons for this. There is no established library of Android malware for use by researchers. Porter Felt et al. have collected information on 18 malicious Android apps circulating in the wild[60], but their study did not involve the collection of actual malware samples. Therefore, evaluation of

Thin AV was done with a collection of apps downloaded from the official Google Android market. This data set will be described in greater detail in Section 5.1. However, without an Android malware data set it was not possible to fully gauge the effectiveness of the third party scanning engines when scanning Android malware. It was determined that several of the scanning engines used by Thin AV do, in fact, detect Android malware, and this issue will be discussed in Section 5.2.

All development and testing was done on the Android emulator provided in the SDK.

This was because it allowed for rapid development on different versions of the Android operating system, and it allowed for changes to be made to the Android source code.

While it may have been possible to meet these requirements on a rooted Android device, it would have been a significant gamble as to whether or not compatibility issues would have arisen given that virtually all commercially available Android devices run a version of

Android that has been modified by the device manufacturer, such as Samsung’s TouchWiz or HTC’s Sense UI. Further discussion on the performance of the Android emulator can be found in Section 5.3.

Working on the emulator also presents evaluation issues with respect to network performance. Because Thin AV is heavily reliant on the network, the connection speed 83 can greatly impact the performance of Thin AV. On a mobile device like a cell phone, the speed of the cellular connection can be impacted by the location of the user, radio interference, the load on the cellular network, as well as other factors. Because of the challenges of involved with network measurements, previous research results from Gass and Diot, comparing the speed of cellular and WiFi networks were used instead [62].

The remainder of this section is laid out as follows. Section 5.4 will discuss the evaluation of the ComDroid scanning module. Section 5.5 will provide evaluation of the best and worst case performance of the Thin AV safe installer. Finally, Section 5.6 will conclude with an analysis of the cost of running the Thin AV killswitch on an ongoing basis. Where relevant, the following sections are subdivided into the same protocol- results-discussion format from Chapter 4.

5.1 Data Set

The data set used for testing consists of 1,022 apps downloaded from Google’s Android market. To download the apps, a program was created which made use of the un-official market API [9]. The API provided a means of retrieving the asset IDs of the packages to be downloaded. These asset IDs were then combined with a set of valid Google account credentials in order to download the actual packages. An attempt was made to download the top fifty free apps in each application category, as ranked by user votes on January

3, 2012. The majority of package downloads were successful, with 28 downloads causing repeated failures. This resulted in 1,022 apps spread across 21 application categories, with each category having between 46 and 50 packages.

Table 5.1 summarizes the key file size statistics of the data set, while Figure 5.1 shows the median file size for the apps broken down by application category. 84

Number of Apps 1022 Mean App Size 2.65 MB Median App Size 1.78 MB Minimum App Size 0.02 MB Maximum App Size 37.06 MB Proportion of Apps <1 MB 34.64 % Proportion of Apps <10 MB 97.16 % Proportion of Apps <20 MB 99.51 %

Table 5.1: General file size characteristics of the Android test data set.

4 3.5 3

2.5

2 1.5 1 0.5

0

Median File Size (MB) FileMedian

Tools

Social

Sports

Comics

Medical

Finance

Lifestyle

Weather

Business

Shopping

Education

Productivity

Photography

Music & Audio Music&

Travel & Local Travel&

Media & Video Media

Personalization

Communication

Health & Health Fitness

News & & News Magazines Books & Reference Books& Application Category

Figure 5.1: Median file size of the Android test data set packages for each Google Market application category. 85

5.2 Malware Detection

The entire collection of Android packages was uploaded to the VirusTotal scanning ser- vice. This was done for several reasons: first, it would show whether or not any of the

42 scanning engines used by Virus Total detected any malware in the data set. Second, it had the potential to show how many of the engines were capable of even detecting

Android malware. Third, because VirusTotal includes the Kaspersky scanning engine, and most of the scanning engines in VirusChief, it would have the effect of testing the data set on the scanning engines used by the other two third-party scanning services as well. Finally, if VirusTotal was capable of detecting malicious Android apps, it would be the most preferable of the three scanning services for the mobile implementation of Thin

AV. This is because VirusTotal’s slow response time is considerably less of an issue on

Android because the set of possible inputs (packages downloaded from various markets) is tiny and relatively predictable in comparison to the near infinite array of files that might be seen in the desktop implementation. This allows for the possibility of priming

VirusTotal with packages from various markets. The details of this deployment scenario will be expanded upon in Section 6.2.

In a very surprising result, VirusTotal flagged several possible instances of malware in the data set downloaded from the Google market. Of the 1,022 apps uploaded, 1,019 were scanned (with three being skipped due to size restrictions). Of the 1,019 scanned packages, 27 were flagged as malware by at least one scanning engine. One package was

flagged as malware by four different engines, nine packages were flagged by two engines, and the remaining seventeen packages were flagged as malware by a single engine. Table

5.2 provides details on some of the commonly flagged samples. The most commonly identified sample was from the Adware.Airpush family. However, the majority of these samples were identified by a single scanning engine (DrWeb), which raises the possibility 86

Sample Name Malware Type Occurrences Detection Engine(s) Adware.Airpush(2, 3) Adware 15 DrWeb, Kaspersky Plankton (A, D, G) Trojan 6 Kaspersky, Comodo, NOD32, TrendMicro SmsSend (151, 261) Dialer 2 DrWeb Rootcager Trojan 2 Symantec Table 5.2: Most frequent samples of malware detected in Google Market data set. De- tection engine refers to which VirusTotal scanning engines detected the sample. of this being a false positive. The next most common sample was Plankton, which was identified by a variety of scanning engines. The remaining malware samples had far fewer occurrences in the data set.

While Google has themselves admitted to finding malicious apps in their market [40], it was very surprising to find numerous possible instances of malware in Google’s official

Android market. It is even more surprising considering that these apps were selected for the data set because they were among the fifty most popular apps in their respective categories on the day they were downloaded. This only serves to reinforce the problem presented by mobile malware, and if the official Google Market can fall victim to this issue, it is worrying to consider the prevalence of malware in third-party markets.

The detection of malware in the data set shows that Thin AV, in its current form, can take advantage of existing third-party scanning services to prevent the installation of malware on Android devices. While the test data set only suggested that 6 of the AV engines used by VirusTotal are capable of detecting Android malware, follow-up research showed that as many as 26, or more than half of the scanning engines in VirusTotal are capable of detecting some form of Android specific malware [27]. 87

5.3 Emulator Performance

As mentioned above, all development and evaluation of Thin AV was done on the Android emulator. In order to provide context for the performance results taken from the emu- lator, it was necessary to assess the performance of the Android emulator as compared to a physical Android device. The Java based numerical benchmark SciMark [17] was ported to run on an Android device. The benchmark consists of five CPU-bound tasks:

Fast Fourier Transforms (FFT), Jacobi Successive Over-relaxation (SOR), Monte Carlo integration (MC), Sparse matrix-multiply (Sparse), and dense LU matrix factorization

(LU). The specific details of each test can be found in [18]. The benchmark was then modified to test the speed of sequential reads and writes. This was necessary to include as the emulator uses the RAM and hard disk of the host device for storage, whereas

Android devices use flash memory.

The benchmark was run on the Android emulator as well as three different physical

Android devices. Table 5.3 shows the results of the benchmark for each device. It is clear from these results that any performance testing done on the Android emulator will represent a lower bound on the performance of a production deployment of Thin AV.

In general, the Android emulator running on a modern computer is about an order of magnitude slower than the same operation computed on a contemporary Android device.

5.4 ComDroid Evaluation

The ComDroid scanning service was added to Thin AV both to further demonstrate the modularity and extensibility of the Thin AV architecture, as well as to add a scanning service that was specifically targeted at Android applications. This section discussed the evaluation of the ComDroid module. 88 2.3.4 5.433 41.771 35.406 81.101 26.114 60.800 37.879 192.308 1 GB RAM HTC Evo 3D Qualcomm MSM8660 Dual-Core @ 1.2 GHz 2.2 7.970 37.909 32.129 72.464 31.019 45.965 12.438 121.951 HTC Desire 576 MB RAM Qualcomm QSD8250 Snapdragon @ 1 GHz 2.3.3 6.357 9.597 19.457 12.542 32.825 15.398 30.164 112.36 512 MB RAM Samsung Galaxy S Samsung Exynos 3110 ARM Cortex A8 @ 1 GHz 2.3.3 3.039 1.750 4.584 0.656 4.290 3.916 11.062 19.802 See 3.2 Emulator Hardware OS Version Average Score FFT Score SOR Score MC Score Sparse Score LU Score Write (MB/s) Read (MB/s) Table 5.3: ComparisonBenchmark of consists benchmark of scores fiveMonte for Carlo CPU-bound the integration tasks: Android (MC),also SDK Sparse Fast measures emulator matrix-multiply Fourier the and (Sparse), Transforms mean (FFT), three andbetter. speed dense Jacobi different of LU physical Successive sequential matrix Android Over-relaxation reading factorization devices. (SOR), and (LU). writing The from benchmark flash memory. For all benchmarks, higher values are 89

ComDroid f(x) = 0.0132 × x + 9.6893

Table 5.4: Linear equation for the ComDroid scanning service.

5.4.1 Testing Protocol

The ComDroid module was tested in a manner somewhat similar to the other three scanning modules from earlier in this chapter. All 1,022 of the apps described in Section

5.1 were uploaded to the ComDroid scanning service. Roughly half of the uploads were performed on January 26 with the remainder being uploaded on January 27, 2012. For each upload the time required for ComDroid to return a result was recorded; additionally, the scan report for each of the applications was saved.

5.4.2 Results

Of the 1,022 packages uploaded to ComDroid, 993 were scanned, with the remainder being rejected by the server due to a 10 MB size limitation. Of the 993 packages scanned by ComDroid, 8 returned a scan error, resulting in 985 valid scan results. The mean response time was 40.67 seconds (σ = 77.60 seconds), and the median response time was

18.63 seconds. Figure 5.2 shows the response time plotted as a function of the package size, and the exact function is specified in Table 5.4. It is clear there is some positive linear relationship between package size and scan time, although numerous outliers are apparent.

The vast majority of packages, 971 of 985 (or 98.6%) show some potential for exposed communication. Table 5.5 provides a summary of the exposed communication found within the testing data set. 90

1000 900 800 700

600 500

Time (s) Time 400 300 200 100 0 0 2000 4000 6000 8000 10000 12000 App Size (in KB)

Figure 5.2: Reponse time of the ComDroid service as a function of package size.

Type of Warning Packages Occurrences Average Action Misuse 331 (33.6%) 5640 17.0 Possible Activity Hijacking 961 (97.6%) 14671 15.3 Possible Malicious Activity Launch 481 (48.8%) 2200 4.6 Possible Broadcast Theft 501 (50.9%) 4630 9.2 Possible Broadcast Injection 613 (62.2%) 3703 6.0 Possible Service Hijacking 261 (26.5%) 980 3.8 Possible Malicious Service Launch 167 (17.0%) 315 1.9 Protected System Broadcast w/o Action Check 108 (11.0%) 134 1.2

Table 5.5: Break down of exposed communication found by ComDroid in the testing data set. The packages column refers to the number of applications with at least one instance of a given warning type. The occurrences column refers to the total number of potentially exploitable attack surfaces that exist for a given warning type, within the entire data set. The average column is the average number of occurrences per package. For a complete explanation of the types of warnings see [45]. 91

5.4.3 Discussion

The performance of the ComDroid service is somewhat similar to both the Kaspersky and

VirusChief services (Section 4.1.2). However, the linear trend is much less prominent.

The most likely explanation for this lies in the nature of the analysis performed by

ComDroid. ComDroid is a static code analysis tool, and as such, it is safe to assume that the time required to analyze an Android app has more to do with the amount of code in the package, than the total size of the package. Given that many apps contain numerous resource files (images, sounds, video, etc.) which are not scanned by ComDroid, it is easy to imagine how a package might have a large size, but a relatively small amount of code, resulting in a much faster scan than for an app with a large amount of code and few resource files. It is quite likely that the observed linear trend is much more a result of the upload time and not the code-to-resource-file ratio of the package.

The prevalence of exposed communication within the data set seems very high, with less than 3% of packages not reporting any errors. However, interestingly, all values in the packages column of Table 5.5 are within 10% of the values reported by Chin et al. in

[45], suggesting that their initial findings were fairly representative of a larger data set.

The pervasiveness of programming errors detected by ComDroid suggest that in its current form, simply flagging an application as being “at risk” if there is any instance of exposed communication would be overkill. It would effectively cripple the ability of users to install apps on their device. As was pointed out in [45], a manual inspection of a subset of warnings found that only about 10-15% of warnings were genuine vulnerabilities. This does suggest that there is a place for ComDroid in the Thin AV architecture. However, the behavior of this Thin AV module would likely have to be adjusted over time to prevent excessive false positives. This could be done by creating thresholds which would

flag a package as vulnerable if it had significantly more exposed surfaces than average for a given type of warning. 92

Network Configuration Upload Speed (KBps) Download Speed (KBps) Typical 3G 16.25 84.13 Ideal 3G 1792.00 1792.00 Typical WiFi 190.38 155.38 Ideal WiFi 76800.00 76800.00

Table 5.6: Network speeds used for evaluating the mobile implementation of Thin AV.

5.5 Safe Installer Performance

The first line of defense provided by Thin AV is the safe installer, which checks for malicious apps at install time. The performance of the Safe Installer is based on three factors: the size of the package being scanned, the speed of the network to which the device is connected, and whether or not the package being installed has already been scanned by Thin AV.

For the purposes of evaluating the safe installer, three different file sizes (small, medium, and large) were chosen: 0.76 MB, 1.78 MB, and 3.56 MB, corresponding to the median size of apps in the category with the smallest median size (medical apps), the median size for the entire data set, and the median size of apps in the category with the largest median size (educational apps).

Additionally, four different network configurations will be examined. These will be referred to as “Ideal 3G”, “Typical 3G”, “Ideal WiFi”, and “Typical WiFi”. The speeds for each of these configurations (listed in Table 5.6) have been taken from [62] and [67].

The best case scenario for the performance of the safe installer is one in which the package being installed has already been scanned by Thin AV. There are several reasons why this could occur, and they will be discussed in Chapter 6. In this case, the cost for performing an install time check is equal to the time required to hash the installing application, send the hash to Thin AV, look up the scan result, and return the scan result. 93

Network Configuration Small File Medium File Large File Ideal 3G 0.034 s 0.232 s 0.293 s Typical 3G 0.041 s 0.239 s 0.300 s Ideal WiFi 0.034 s 0.231 s 0.293 s Typical WiFi 0.035 s 0.233 s 0.294 s

Table 5.7: Time required to check an package in Thin AV, for three different file sizes, and four different network configurations, assuming the scan result is already cached by Thin AV

The time required to hash a small, medium, and large application on the Android emulator was measured, and the average of five runs was taken for each size. The small

file took 0.033 seconds to hash, the medium file took 0.231 seconds to hash and the large

file took 0.293 seconds to hash. The total amount of data uploaded and downloaded for transmitting the hash and receiving the result was recorded. This was approximately

200 bytes (100 up, 100 down), although this amount varied slightly with the file being scanned. Finally, the cost of Thin AV performing a cache lookup was examined in Section

4.3.1, and so here too, the cost of Thin AV performing a lookup from cache will be taken to be 0.0002 seconds. Table 5.7 summarizes the results for this best case scenario. In general, even the largest file over the slowest network only takes 0.3 seconds to check with Thin AV.

The worst case scenario is one in which the application being installed has not been scanned by Thin AV, and the whole package must be uploaded to Thin AV, which must then upload the package to one or more of the third-party scanning services. Using the formulæ in Tables 4.10 and 5.4, and the file sizes and network speeds above, it is possible to compute the time required to upload and scan these files at install time. Because the time spent uploading and scanning a file will dwarf the time required to hash an application and upload that hash, these costs will not be included in the calculation.

It should be noted that when calculating the time required to scan a package both 94

Network Configuration Small File Medium File Large File Ideal 3G 36.56 s 98.13 s 170.29 s Typical 3G 84.66 s 210.00 s 394.14 s Ideal WiFi 36.13 s 97.14 s 168.31 s Typical WiFi 40.23 s 106.68 s 187.39 s

Table 5.8: Time required to check an package in Thin AV, for three different file sizes, and four different network configurations, assuming the scan result is not cached by Thin AV. the time to scan with the appropriate anti-virus scanning service and the time required for scanning with ComDroid must be added. This is because as ComDroid is currently configured, ComDroid is run in addition to scanning the package with the appropriate anti-virus scanner for the size of the package. The performance drawbacks of this con-

figuration are obvious. However, it does mean that the results presented in this section represent a highly conservative estimation of possible Thin AV performance in a produc- tion deployment.

Table 5.8 summarizes the results for this worst case scenario. In general, the time required to upload and scan an Android package ranges from a low of 36 seconds to a high of almost 400 seconds, depending on the size of the file and the speed of the network.

The best case scenario, where Thin AV already has a cached scan result, is extremely fast. At 0.3 seconds, this check would be unnoticeable to a user. On the other hand, if the file needed to be uploaded and scanned, this process could take as long as 400 seconds, or almost seven minutes. This could be seen as a serious inconvenience to the user, but considering that this check would only take place when a user is installing an unknown app, it is not likely to be a frequent occurrence. Additionally, given that Thin

AV could be primed with packages from a variety of sources, including regular downloads of applications from various application markets, upload of applications by developers, and the upload of applications by other users running Thin AV, the chance that a user 95 would have to upload a package for scanning at install time could be made very rare. So, while the worst case scenario is not ideal, it is not likely to be a frequent occurrence.

Finally, while not a specific performance test, an end-to-end functionality test was run in which the Thin AV safe installer correctly blocked the installation of an app from the testing data set which was flagged as malware by VirusTotal.

5.6 Killswitch Cost

During normal operation, the most frequently used functionality of Thin AV would typ- ically be the killswitch service which is periodically activated and checks for revoked apps. To evaluate the performance of the killswitch, several factors must be examined: the cost of hashing apps to generate a system fingerprint, the network cost associated with uploading the fingerprint, the cost of looking up the hashes in Thin AV, and the network cost associated with returning those hashes to the client. The last and most costly aspect of the killswitch is the manual upload feature, because this is the only time when the killswitch should incur any cost for scanning a package. This is because it is assumed that any missing packages will be scanned by Thin AV when they are uploaded by the safe installer at the time of installation.

This section will examine the cost of the killswitch under normal operation, as well as the cost of manually uploading missing packages. The normal operation will be assessed in two parts, the cost of generating a system fingerprint, and the cost of sending and receiving the system response. The cost of Thin AV performing a cache lookup will again be taken to be 0.0002 seconds per cryptographic hash.

In general, the time required for the killswitch to perform a check for revoked apps will be: 96

hash upload size response download size T ime = hashing time+ +cache lookup time+ upload speed download speed (5.1)

Because the cost of performing a manual upload of missing packages is dominated by upload and scanning costs (similar to the safe installer above), only these costs will be included in the calculation. The time required for the killswitch to manually upload missing packages will be:

package upload size T ime = + scanning time (5.2) upload speed Similar to the safe installer, the time spent scanning is the sum of the time scanning with the appropriate anti-virus service as well as scanning with ComDroid.

5.6.1 Testing Protocol

To test the performance of the hashing function the top five apps (by user popularity ranking) from each of the 21 market categories were installed on the Android emulator

(i.e., all of the top apps from each category were installed, then the top two apps, then the top three, etc.).

A complete system fingerprint was generated ten times, and the average was taken, af- ter which the local fingerprint cache was deleted. This represents the worst case scenario, one in which none of the apps on the device have been hashed before, and all hashes must be computed. Next, another ten fingerprints were generated and the average was taken, but this time, the cache was left intact. This represents the best case scenario: one in which all of the apps on the phone have already been hashed and the phone fingerprint is stored locally. Along with the fingerprint generation time, the size of the fingerprint, and the size of the server response to sending that fingerprint were recorded. This way the data consumption of the killswitch can be evaluated. 97

Uncached f(x) = 0.0895 × x − 0.27 Cached f(x) = 0.0028 × x + 0.164

Table 5.9: Linear equation for generating a system fingerprint for the number of bytes worth of apps on a device, for both the cached and uncached scenarios.

Under normal use it is likely to expect that the typical scenario would in fact be the best case scenario, or very close to it. After the first fingerprint has been generated, the only time an app will have to be hashed is when it has not been seen by the killswitch, meaning it has just been installed. Unless a user installs numerous apps between the scheduled runs of the killswitch, it is likely the number of apps that need to be hashed would be near zero.

Combining the hashing performance with the file size data for the data set, the scanner performance functions in Tables 4.10 and 5.4, and the experimental network performance measurements from [62], the cost of performing manual uploads, as well as the cost of

fingerprinting can be calculated using Equations 5.1 and 5.2.

5.6.2 Results

Figure 5.3 shows the best (cached) and worst (uncached) case scenarios for the fingerprint generation time as a function of both the number of packages on the device and the total size of those packages. Table 5.9 shows the linear equations for both the cached and uncached functions for the total number of bytes worth of apps on a device.

It is clear that time to generate a system fingerprint grows in a mostly linear way with both the number and size of packages on the device. In the worst case, with 110 apps on the device, it only takes 29.95 seconds to generate a system fingerprint. However, the best case scenario is dramatically better, with a fingerprint being generated in 1.09 seconds for the same 110 apps when the fingerprint has been cached. 98

35 4000 30 3500 25 3000 2500 20 2000 15 1500 10 1000

5 500 Transmitted Data (B) 0 0 0 20 40 60 80 100 120

Time toGenerate Time Fingerprint (s) Number of Packages

Uncached Cached Data Transmitted

(a)

35 30 25 20 15 10 5 0 Time toGenerate Fingerprint Time (s) 0 50 100 150 200 250 300 350 Total Size of All Packages (MB)

Uncached Cached

(b)

Figure 5.3: Time required to generate a complete system fingerprint as a function of the number of packages installed on the device (a) and the total size of those packages (b). Both figures show the average time when all of the package hashes have been stored (cached) and when none of the package hashes are stored (uncached). Figure (a) also includes the number of bytes sent and received when communicating the fingerprint to the ThinAV server. 99

Interval Data Consumption (5 Apps) Data Consumption (110 Apps) 1 Day 24.47 KB 349.41 KB 1 Week 171.28 KB 2.39 MB 1 Month 5.19 MB 74.04 MB

Table 5.10: Data consumption of Thin AV killswitch over different time periods, for 5 and 110 apps installed on the device, assuming the killswitch is scheduled to run every fifteen minutes.

Data usage grows linearly with the number of packages on the device. The data consumption ranges from 3.64 KB for 110 apps, down to 261 bytes for 5 apps. The majority of this transmission is in the form of the uploaded fingerprint, as the response from Thin AV only downloads 70 bytes from the server. This is for a fingerprint that included no hashes corresponding to malicious apps, however.

Under the current configuration the killswitch is scheduled to generate a system fin- gerprint every 15 minutes. Table 5.10 shows how much data would be consumed by Thin

AV (as it is currently configured) over different lengths of time.

Using the same network measurements from Section 5.5, the measured fingerprint generation times, and data transmission totals, it is possible to compute a variety of potential running times for the entire fingerprinting operation of Thin AV killswitch using Equation 5.1. These values are summarized in Table 5.11.

For calculating the cost of manually uploading missing packages, two additional as- sumptions must be made, first about the number of packages being uploaded, and the size of those packages. The three different package sizes will be the same as those used in

Section 5.5. These app sizes will be used to examine the case where 10, 25 and 50 apps were being uploaded to the Thin AV service for scanning. These scenarios will again be examined from the four different network configurations seen in the previous section.

Table 5.12 summarizes the total amount of data that would be uploaded for different numbers of apps of different sizes. The key assumption underpinning this table is that the 100

Scenario Time (seconds) 110 apps over an ideal 3G connection with no hashes cached 26.206 110 apps over a typical 3G connection with no hashes cached 26.430 110 apps over an ideal WiFi connection with no hashes cached 26.204 110 apps over a typical WiFi connection with no hashes cached 26.223

110 apps over an ideal 3G connection with all hashes cached 3.424 110 apps over a typical 3G connection with all hashes cached 3.478 110 apps over an ideal WiFi connection with all hashes cached 3.423 110 apps over a typical WiFi connection with all hashes cached 3.428

26 apps over an ideal 3G connection with no hashes cached 1.034 26 apps over a typical 3G connection with no hashes cached 1.258 26 apps over an ideal WiFi connection with no hashes cached 1.032 26 apps over a typical WiFi connection with no hashes cached 1.051

26 apps over an ideal 3G connection with all hashes cached 0.285 26 apps over a typical 3G connection with all hashes cached 0.339 26 apps over an ideal WiFi connection with all hashes cached 0.285 26 apps over a typical WiFi connection with all hashes cached 0.290

Table 5.11: Time required to complete the fingerprinting operation for different numbers of applications, network performance, and caching scenarios. The definitions of “typical” and “ideal” for each connection type is the same as in 5.5.

Total Data Uploaded (MB) Scenario 10 Apps 25 Apps 50 Apps Small Apps 7.643 19.108 38.216 Medium Apps 17.775 44.438 88.875 Large Apps 35.570 88.925 177.850

Table 5.12: Total upload sizes used for calculations of bulk scanning performance. 101

Upload Time (Seconds) Scenario 10 Apps 25 Apps 50 Apps Small Apps 4.367 10.919 21.837 Ideal 3G Connection Medium Apps 10.157 25.393 50.786 Large Apps 20.326 50.814 101.629 Small Apps 485.360 1213.400 2426.801 Typical 3G Connection Medium Apps 1128.781 2821.953 5643.907 Large Apps 2258.833 5647.082 11294.164 Small Apps 0.102 0.255 0.510 Ideal WiFi Connection Medium Apps 0.237 0.593 1.185 Large Apps 0.474 1.186 2.371 Small Apps 41.111 102.777 205.553 Typical WiFi Connection Medium Apps 95.609 239.023 478.046 Large Apps 191.326 478.315 956.630

Table 5.13: Upload times for the values in Table 5.12, for four different network config- urations.

Scanning Time (Seconds) Scenario 10 Apps 25 Apps 50 Apps Small Apps Scanned With Kaspersky 161.021 402.552 805.104 Medium Apps Scanned With VirusChief 634.032 1585.080 3170.159 Large Apps Scanned With VirusChief 1104.893 2762.232 5524.465 Small Apps Scanned With ComDroid 107.224 252.563 494.796 Medium Apps Scanned With ComDroid 120.919 266.259 508.491 Large Apps Scanned With ComDroid 144.972 290.312 532.544

Table 5.14: Scan times for different numbers of apps with small, medium and large sizes, using conventional scanning engines (Kaspersky and VirusChief) and the Android-specific scanner, ComDroid. 102 sizes for a given number of apps are assumed to be the same (e.g., ten small apps would be 10 × 0.764 MB). Using these total upload sizes, the upload times can be calculated based on the network speeds specified in the previous section. The upload times for the different numbers and sizes of apps are summarized in Table 5.13. Using the size and quantity of each app, the scanning time could then be computed using the equations in

Tables 4.10 and 5.4. These results are summarized in Table 5.14. Finally, referring to

Equation 5.2, it is possible to compute the time required to upload and scan missing apps under different scenarios.

The best case scenario is when ten small apps are uploaded and scanned over an ideal WiFi connection. In this case the total operation would take 289.2 seconds, or just under five minutes. The worst case scenario is one in which fifty large apps are uploaded and scanned over a typical 3G connection. This operation would take 17351.2 seconds, or nearly five hours. However, if the same operation is performed over a typical WiFi connection, the time required to complete this one-time operation drops by more than half, to 1.95 hours.

5.6.3 Discussion

From both a time and data consumption perspective, Thin AV has a relatively minor impact on an Android device. Fingerprinting is the only operation that would likely take place with any frequency during long-term use. Given the best case scenario for the killswitch, 1 second of computation followed by less than 4 KB of data transmission for all 110 apps, it is likely that this operation would be unnoticeable to a user. Furthermore, given that these tests were performed on the Android emulator, the fingerprinting would almost certainly take considerably less time on a physical Android device.

In terms of data consumption, the 74 MB a month for uploading the fingerprint of

110 apps is not trivial. However, given that cellular carriers offer data plans ranging 103 from 500 MB a month to unlimited data usage, the impact of Thin AV would represent a small fraction of a user’s allotted data consumption for a given month. Furthermore, it would be possible to reduce the amount of data consumed by a large fraction simply by reducing the frequency with which the killswitch is run, and by removing extraneous bytes from the messages sent and received by ThinAV.

It should be noted that the above results assumed that Thin AV has already scanned the packages present on the mobile device. This assumption is reasonable considering that the killswitch is intended to operate in conjunction with the safe installer. This means that any app installed on a device that has not been scanned by Thin AV would be uploaded and scanned at install time, as a consequence, the scan result would already be present in Thin AV when the killswitch is later run. The one exception to this case would be if the killswitch is installed on a device after several other apps have been installed. In this case, the one-time upload and scanning of missing apps must take place. The worst case performance for this operation is quite poor. Assuming fifty apps were uploaded, each roughly 3.7 MB in size, sent over a typical 3G connection, it would take nearly five hours to complete the operation. However, the same operation over a typical WiFi connection would take less than two hours. Considering that this is a one time operation, and it is at the user’s discretion when this operation takes place, these results are not unreasonable. A user could simply initiate the upload over their home or office WiFi network when their phone is charging.

In general the long term performance impact of using the Thin AV killswitch is quite favorable. 104

Chapter 6

Discussion

This chapter will discuss some of the broader issues pertaining to Thin AV and the possi- ble use of Thin AV as a production scale service. For discussions on specific experimental results see the discussion subsections of Chapter 4. Section 6.1 of this chapter will talk about the feasibility of Thin AV, specifically where the system succeeds and where it fails. The different ideal deployment scenarios for both the mobile and desktop versions of Thin AV will be discussed in Section 6.2. Finally the privacy concerns that would come along with using Thin AV will be expanded upon in Section 6.3.

6.1 Thin AV Performance and Feasibility

After a thorough evaluation of Thin AV in both a mobile and a desktop environment, it can be concluded that the desktop prototype of Thin AV is, at best, marginally successful and in its current form, not highly feasible. Conversely, the mobile prototype, even in its unpolished state, demonstrates a highly feasible mechanism for protecting smartphones from malware.

There are two key factors that seriously impact the performance of Thin AV, and keep it from being a truly feasible concept on the desktop: the size of the input space, and the frequency of file access. Because Thin AV is not selective about the files it scans, any file on the stacked file system will be uploaded to Thin AV if it is accessed. This creates an extraordinarily large input space, only a portion of which is even remotely predictable. This large input space presents a significant challenge by itself. However, when combined with the fact that the files uploaded to Thin AV are often accessed several 105 at a time, and in rapid succession, such as when launching a program, it makes for a very underwhelming experience for a user.

Fortunately, the aspects of Thin AV that made for slow performance in the desktop implementation do not exist in the mobile environment. Because virtually all Android malware comes in the form of malicious applications, the scanner input space is massively reduced. It is not necessary to scan every individual file access, and even if it were necessary, it would not be possible without fundamentally violating the Android security model. Instead only applications need to be scanned. This application-centric design does create a very different security model than the conventional file scanning model present in the desktop version of Thin AV. However, enhancing the existing Android security framework is preferable to violating the framework in the hopes of creating a more direct comparison with the desktop security model. Furthermore, by opting for an app-centric approach on Android, and a file-centric approach on Linux it compares and contrasts two current and realistic scenarios for system security, as opposed to examining a pair of hypothetical (but more similar) scenarios.

By modifying the Android package installation code, it was possible to check for ma- licious code in an application before it was installed on the phone. Furthermore, because of the vastly reduced input space, it is even possible to make predictions about what applications will be installed, namely, the applications that exist in major application markets, both official and third-party. Being able to predict which applications will be installed allows for these apps to be proactively downloaded and scanned, allowing for the quick return of cached results when performing application checks. This safe installation mechanism, combined with a background killswitch, effectively work together to prevent the installation of malicious apps, and prompt the removal of apps if they are found to be malicious after they have been installed, all with minimal ongoing cost in computing time and network bandwidth. 106

Another area in which the mobile version of Thin AV out-performs the desktop version is in the area of connectivity. While most desktop computers are increasingly moving towards having persistent connectivity, this is not always guaranteed. When internet connectivity is unavailable, Thin AV can only function in a passive mode, simply allowing access to files and not scanning them. It might be possible to offer some protection in this scenario by logging file accesses for later scanning when an internet connection is available, but this scenario was beyond the scope of this research. In a mobile environment, this issue does not exist, a smartphone is by its very nature, intended to be a persistently connected device. While it is possible to lose data connectivity due to lack of service, this would not be problematic for Thin AV because while it would not be possible to communicate with Thin AV, it would also not be possible to download packages which would require verification by Thin AV. It might be possible to envision a problematic scenario where a user downloads an Android application, but does not install it until a later time when network connectivity is unavailable. While such a scenario does present a problem for the safe installer, the killswitch would be capable of detecting a malicious app once connectivity was restored.

6.2 Ideal Deployment

The goal of this research has been to determine if a cloud-based security-service would be feasible or appropriate for providing protection from malware on either a desktop computer or a mobile device. The performance of the desktop and mobile Thin AV prototypes suggest that such a system is definitely feasible for mobile devices, and with significant changes, possibly feasible for desktop systems. Due to time and hardware limitations, the prototypes that were built are, at best, rough implementations of a much grander vision. To be truly useful as a mechanism for malware protection, a variety of 107 changes would have to be made to both the desktop and mobile systems. Section 6.2.1 will discuss the ideal deployment scenario for the desktop version of Thin AV, while

Section 6.2.2 will discuss the mobile version of Thin AV.

6.2.1 Desktop Deployment

It is clear from the performance experiments in Chapter 4 that the desktop version of Thin

AV has some significant performance impediments. There are four areas in which the performance of Thin AV could be improved, potentially leading to a practical production deployment.

The most basic change would be the development platform for Thin AV. The proto- type was built using Python, an interpreted language that is not ideal for performance intensive tasks. Python was chosen because it allowed for rapid prototyping. Addition- ally, Python provides a number of feature-rich libraries which greatly increased the speed of development. However, it is quite likely that some modest performance gains could be realized by re-developing Thin AV in a compiled language such as C or C++. These performance gains would be the most noticeable when accessing files in the Thin AV cache, which is an extremely common occurrence.

The second area for improvement would be to allow the Thin AV client to selectively

filter the files that it sends for scanning. Bayer et al. [36] provides an overview of the host and network behavior of a large corpus of Windows malware. If a comparable data set were available for Linux systems, it could be used to inform the development of a

filter for Thin AV. Such a filter would cause Thin AV to be more selective about the files it scans, rarely scanning files which typically pose a low risk of containing malware, and more regularly scanning files which do carry such a risk.

Other areas for improving the performance of Thin AV are the network and scanning performance. CloudAV showed impressive speed when uploading and scanning files [82]. 108

This is not surprising considering that CloudAV was deployed in a university computer lab. With a limited number of users, accessing a dedicated service over a local area network, both the speed of file transfers and the speed of file scanning could be minimized.

Thin AV was an attempt to see how well this concept could be extended to a wide area network. In order to realize a greater degree of performance, some of the benefits of

CloudAV would have to be applied to Thin AV. Most notably, Thin AV could vastly benefit from running on a dedicated hardware platform with ample resources. In this scenario, Thin AV would no longer rely on the specific third party scanning services used in the prototype, but would have a hardware and software configuration much more like

CloudAV. It is not unreasonable to imagine a major anti-virus vendor providing such a dedicated service to its customers either over the internet, or as a network appliance.

The last area in which Thin AV can improve is, not surprisingly, by operating over faster network connections. For years, internet connection speeds have been increasing for both home and business customers. While there are likely limitations to what sorts of speeds are ultimately feasible, it is safe to say that in the short term, the speed of the average internet connection will likely increase. Such speed increases can only serve to improve the performance of Thin AV.

Given the success of Thin AV on the mobile platform compared to the desktop envi- ronment discussed in Section 6.1, it is worth asking if it would be possible to make the desktop operating system more like the mobile operating system, so that desktops could reap the benefits of Thin AV. In general this is not an unreasonable proposition. The desktop version of Thin AV was built on top of Ubuntu, while the mobile version was built on Android, both of which are Linux-based operating systems. The key advantage offered by Android is the application sandboxing which prevents application vulnerabil- ities from compromising entire systems. Unfortunately, this sandboxing comes with a price. It limits the interactions that are possible between applications. Android partially 109 solves this by providing a framework for application interaction. However, such a highly sandboxed desktop operating system would surely require a major shift in the mindset of users. Although, in-roads are already being made in this general direction, with Ap- ple’s introduction of the Mac App Store [3] and Google’s work on Chrome OS and the complementary Chrome Web Store [88, 6]. Both of these initiatives appear to be moti- vated by the desire to make a simpler, more application-centric, and more user-friendly desktop computing experience. However, this paradigm shift might bring with it some very tangible security benefits.

6.2.2 Mobile Deployment

Unlike the desktop prototype of Thin AV, the mobile implementation is considerably closer to an effective production scale system. Given the relatively low volume of files that need to be scanned, it is quite possible to use the existing third party scanning services in a production capacity. Obviously this presents a variety of challenges, not least of which is the fact that Thin AV is completely reliant on the continued existence of these scanning services in order to provide continued protection. However, similar to the ideal desktop deployment scenario above, there is no reason why an anti-virus vendor couldn’t provide a remote scanning service to subscribing customers. However, even if this was not a feasible option, there are still several improvements which could allow

Thin AV to function as a more complete system.

The greatest performance boost to Thin AV would come from having a pre-populated

Thin AV cache. As stated previously, this is a much more realistic expectation on a mobile device as opposed to a desktop computer. Because application markets (both official and third-party) will likely continue to be the first stop for users seeking applications, the apps users will be installing can be pre-scanned by Thin AV. In much the same way that a large selection of apps were downloaded from the Google Market, and scanned with 110

VirusTotal for the purposes of evaluating Thin AV, a system could be developed which would regularly crawl a variety of markets and download new and popular applications and scan them with the different scanning services. This way, when Thin AV users go to install these apps, they will already exist within the cache of the service, negating the need to upload and scan the file, and thus vastly increasing the performance of the safe installer and the killswitch. Another avenue for pre-populating the cache could also be application developers. Thin AV could incorporate a tool allowing developers to upload their application packages as part of the publication process.

Another area in which Thin AV has potential is in the extensibility of the system.

Currently, the addition of a new scanning module does require some very limited code modification to the main Thin AV system, but it might be possible to modify Thin

AV such that these modifications could be removed, or at least moved to an external configuration file. This would make it much easier in the future to develop and add scanning modules for different services. This in-turn leads to a compelling scenario, one in which users, developers, and companies can create their own Thin AV-compatible scanning modules to interface with their various service offerings, be they application blacklists, static code analysis tools, application permission analyzers, social reputation tools, or mobile-specific anti-virus scanners. This could lead to a scenario where Thin

AV is a highly configurable service, and users of the service could configure their Thin

AV clients to specify which scanning modules they want Thin AV to use when uploading packages via the safe installer or killswitch.

The final change that would be necessary in order to fully realize Thin AV on Android would be the addition of a mechanism that allowed Thin AV to interrupt and prevent package installations, without modifying the operating system source code. This is a very challenging requirement, as such a mechanism, if poorly implemented, could do far more harm than good, by offering a means for malware to prevent the installation of 111 legitimate applications. A potential solution to this issue might be for Google to allow applications to use such a mechanism only if the developer of the application is trusted or in some way certified by Google.

6.3 Privacy

Making use of a third party scanning service carries with it some privacy concerns. While the mobile version of Thin AV carries some very limited privacy concerns which will be outlined in Section 6.3.2, it is the desktop version that poses the most serious privacy concerns. These concerns will be discussed in Section 6.3.1.

6.3.1 Desktop Privacy

Because the files scanned by Thin AV are passed along to third-party scanning services, users must accept the fact that the information contained within those files can be seen by the organizations operating the scanning services. The websites for the three scanning services do not make any mention to what is done with files after they have been scanned.

It is not safe to assume they are destroyed. Rather, operating under the assumption that these files are saved by the scanning services is likely the best course of action. For many individuals, the prospect of putting a potentially large amount of private or personal data in the hands of such an organization is quite discomforting, and may not be permissible for some individuals and organizations. This is further complicated by the fact that the desktop implementation of Thin AV communicates directly with the scanning services.

This means that it would be possible for such a scanning service to collate all of the uploads from a single IP address, and unfortunately, such in-depth information could be used for extremely nefarious purposes.

Given that in its current form Thin AV cannot offer any sort of guarantees regarding the privacy of user’s files, it seems that the most appropriate deployment environment 112 would be one in which there is a reduced expectation of privacy, such as public desktop computers in libraries and other common areas. If Thin AV were to be deployed with a dedicated scanning service as was described in Section 6.2.1, then the service provider could offer some privacy guarantees, making such a security arrangement more palatable.

In such a scenario, individuals or groups might still be reticent to provide so much per- sonal information to a single organization. However, in recent years users have become quite tolerant to the idea of putting substantial amounts of personal, even highly private information into the hands of companies. For example Google has access to the inter- net searches, e-mail and personal documents of their users who take advantage of their search, GMail, and Google Docs products. For a time, Google even considered providing storage for health records [39]. And while Google may be the largest possessor of personal information, it is hardly alone in this regard. Companies like Facebook have amassed a wealth of data on their users ranging from private chat logs to personal photographs.

Despite the risk, users have become quite accustomed to willingly giving personal infor- mation to companies in exchange for a desirable service. Therefore, the privacy concerns of Thin AV may be serious, but still within the realm of reason for many users.

6.3.2 Mobile Privacy

The privacy concerns that are present in the desktop version of Thin AV are all but non-existent in the mobile version. Because the mobile version only uploads Android packages, most of which come from public markets, there is no risk of leaking personal or private information from a device to the service provider. Furthermore, because the mobile version of Thin AV has the Thin AV web service which acts as the aggregator for the various scanning services, it would not even be possible for the scanning services to collate uploads by IP, because all uploads, regardless of their original source, would appear to come from the IP of the Thin AV web application. 113

The last aspect of privacy is the issue of file retention. Again, because only packages are being uploaded, file retention is not a major concern. However, unlike the three anti- virus scanning services, ComDroid explicitly states that they do not retain files after scanning. In general, there are no tangible privacy concerns when it comes to the mobile implementation of Thin AV. 114

Chapter 7

Conclusion

This thesis examined the concept of providing anti-malware protection to desktop com- puters and mobile devices through remote third-party services. Host based anti-virus is the conventional answer to the malware problem, on both desktops and even on smart- phones. However, given the vast amounts of new malware that is created on a daily basis, these anti-virus systems require perpetual signature library updates. Furthermore, anti- virus vendors must continually add features to their products in the hopes of standing out in such a crowded market segment. This has lead to an array of functionally similar anti-virus products that are becoming increasingly bloated and resource intensive. This problem is even more serious on smartphones, where computational resources are finitely limited by the battery power available.

Within the last decade, and particularly within the last five years, there has been a push to move computation away from end host computers, and towards more high- capacity remote computational resources. This has led to the notion of cloud computing.

Fundamentally the cloud is a novel business model layered over the existing concepts of high capacity grid computing and distributed computing. Cloud computing allows for software products and services to be offered remotely, and with sufficient capacity as to effectively eliminate the appearance of resource constraints from the perspective of the end user. The notion of cloud-based Security-as-a-Service has recently been examined as a possible way to address the burgeoning malware problem.

The first major contribution of this research was an the design and development of

Thin AV, a system for providing anti-virus scanning for Linux based desktop computers by offloading the scanning of files from the host computer, to a set of pre-existing third- 115 party scanning services. The design of such a system was beneficial because it reduced the software footprint on the host computer to a fraction of what it would be if a full-

fledged anti-virus product were installed. Additionally, it allowed for files to be scanned with several different anti-virus engines, as opposed to a single engine as would be the case with a host-based system. The key factor that differentiates Thin AV from earlier cloud-based anti-malware solutions is its reliance on existing scanning services, which are accessed over the internet, as opposed to making use of dedicated computing resources located on the same local area network. While the latter case does provide tremendous performance benefits, it does not accurately represent the performance that one would see if the service in question were being offered remotely by a third-party.

Thin AV was evaluated by directly measuring the performance of the scanning ser- vices as well as measuring the performance of the system when executing a series of scripted user behaviors. These performance measurements were then used to inform the development of a simulator which was used to test the limits of Thin AV under a variety of file system behaviors. It was found that in certain cases, the performance of such a system was acceptable. However, in its current form, the worst case performance was highly noticeable to the point of being excessively disruptive to the user. However, in the future it may be possible to address the performance concerns in Thin AV, resulting in a system that would be capable of performing nearly transparent anti-malware protection from the cloud.

The second major contribution of this thesis was an extension of the desktop version of Thin AV, specifically targeted at smartphones and tablets. In recent years, malware on mobile phones and smartphones has become a major issue, with virtually every smart- phone platform being affected. The need for addressing the issue of mobile malware is pressing because of the extent to which smartphones are becoming integrated into the modern lifestyle. A substantial amount of personal and private information is often 116 stored on a person’s smartphone, and this presents an extremely tempting target to mal- ware authors. Given the resource constraints that come with mobile devices, a remote anti-malware service appeared to be a good fit for addressing mobile security.

The desktop Thin AV system was extended and wrapped in a web application in order to serve as a unified interface for servicing anti-malware scan requests from Android mobile devices. Additionally, a system was developed which prevents the installation of malicious applications, by intercepting application installation requests and sending them to the Thin AV web application for scanning. This was complemented by a background killswitch which can prompt the removal of pre-existing applications if they are found to be malicious. Both of these mechanisms rely on the third party scanning services used in the desktop version of Thin AV. Because it was determined that the scanning services are capable of detecting Android malware, this made for a fully functioning system, capable of preventing the installation of malicious applications, or removing malicious applications after installation. In order to further demonstrate the extensibility and modularity of Thin AV, a fourth scanning service, capable of performing static code vulnerability analysis, was added to the mobile version of Thin AV.

The evaluation of the mobile extension of Thin AV was done by assessing both the typical and best case run time for both the safe installation mechanism and the killswitch.

This was done by independently measuring the time requirements of various aspects of each system, and then calculating a range of possible running times based on a set of empirical measurements of 3G and WiFi network performance. However, because all ex- periments were run on the Android development emulator, they represent a lower bound to the actual performance. The evaluation showed the system to be highly practical, with the only major drawback being the need to manually upload pre-installed packages the

first time Thin AV is run. The reason for the much improved performance of Thin AV on the smartphone was due to the nature of the malware threat on Android smartphones. 117

Unlike desktop systems where malware can come in many forms, virtually all Android malware in the wild is spread through malicious applications. This meant that Thin AV had to only scan application packages, and not the entire smartphone file system. This significant reduction in input space, coupled with the fact that it is possible to pre-cache scan results for packages, meant that it was possible to get anti-malware protection at a very low cost in terms of running time.

The successful evaluation of Thin AV shows that the concept of providing security services from the cloud is a real possibility. The fact that Thin AV was built on top of shared public scanning services shows what can be achieved on the proverbial “shoestring” budget. Given sufficient dedicated resources, it is quite likely that cloud-based Security- as-a-Service could become a real alternative to desktop users who are tired with the ever-increasing size of anti-virus software, or more ideally, smartphone users who want the confidence to know that the applications they are downloading are free of malware, without bearing the burden of a full host-based security system.

There are two main avenues for future work pertaining to Thin AV on the desktop.

The first are practical changes and improvements to increase the speed and robustness of Thin AV. However, in terms of future research, the largest question that still remains is how transparent anti-malware scanning can be made with the addition of dedicated scanning resources. It would be interesting to create a Thin AV like system which is based not on freely available scanning services, but on private and dedicated systems for scanning. Previous research has established the feasibility of this arrangement when the scanning hardware is co-located with the host computers. However, to truly make malware scanning a service it needs to be seen how much performance is lost when such a dedicated service is offered over a wide area network. If such a service can be made effectively transparent to end users, then it might be viable for security companies to offer subscription based cloud security on the desktop. 118

Another potential topic for future research which came out of the desktop evaluation is the notion of characterizing typical desktop software usage patterns. The scripts used to generate file system activity on the desktop were inspired by a single study that had been performed previously. Despite an extensive search, there does not appear to be any major body of research which describes the general patterns of software use on desktop computers. This is not surprising considering how vague and ill defined the question is. Nevertheless, such a study would have greatly aided the evaluation of Thin AV on the desktop. One possible approach would be to generate a large range of possible operational profiles each incorporating numerous different user activities [80]. This might help in more clearly defining the circumstances in which Thin AV excels.

The future direction for Thin AV on the smartphone is somewhat less clear. The most necessary research pertains to the Android operating system itself. Thin AV was able to interrupt and terminate the installation of malicious apps by modifying the op- erating system source code. This is not a practical long term solution. Research must be undertaken to find a way that the operating system can be made to allow applications to arrest the package installation procedure, in a way that is both safe and secure, because it would be very easy to use such a privileged operation for malicious purposes. While there are other improvements that can be made to Thin AV to improve its extensibility and performance, these would only be necessary when considering a production scale deployment of Thin AV.

Another area of future work for the mobile version of Thin AV would be to assess the power consumption of Thin AV in comparison to other anti-virus systems available on the Android market. Due to the relatively small amount of processing and network traffic generated by Thin AV, it stands to reason that such a security mechanism would have a minimal impact on the battery life of a device in comparison to other anti-virus products available for Android. While such a comparison would have been a desirable 119 addition to this research, it was impractical at this time for two reasons: first, such an experiment would have required a physical Android device capable of running the custom operating system developed with Thin AV and, as stated earlier, this is not trivial as there are numerous compatibility issues involved in replacing the operating system on a physical device; second, because virtually all of the anti-virus applications on the Google Market are proprietary, it would not be a reasonable to compare their battery consumption with Thin AV without knowing what sort of processes are actually taking place within these proprietary systems. However, given that previous research on cloud-based anti-malware systems have shown mixed results when it comes to power consumption, it would be highly desirable to assess the power consumption of Thin AV before a final determination can be made about the suitability of Thin AV for mobile devices.

This thesis has examined the feasibility of using cloud-based security services to pro- tected computer systems from malware. While the findings of this research show that a cloud-based approach offers has many benefits in the fight against malware, it is safe to predict that this will not be the ultimate solution to the malware problem. The continually evolving nature of the malware threat virtually grantees that new systems and techniques will continually need to be developed. However, for the moment cloud computing may offer a temporary respite from the storm of malware to which users are continually exposed. Bibliography

[1] Android-APKTool. http://code.google.com/p/android-apktool/, Last Ac-

cessed: Jan. 2012.

[2] Android Developer Guide. http://developer.android.com/guide/index.html,

Last Accessed: Feb. 2012.

[3] Apple Mac app store. http://www.apple.com/mac/app-store/, Last Accessed:

Feb. 2012.

[4] AppsLib. http://appslib.com/, Last Accessed: Jan. 2012.

[5] Avira Operations GmbH & Co. KG. http://www.avira.com/, Last Accessed:

Sept. 2011.

[6] Chrome web store. https://chrome.google.com/webstore/category/home,

Last Accessed: Feb. 2012.

[7] DazukoFS. http://www.dazuko.org, Last Accessed: Sept. 2011.

[8] FileAdvisor by Bit9. http://fileadvisor.bit9.com, Last Accessed: Sept. 2011.

[9] Google market API. http://code.google.com/p/android-market-api/, Last

Accessed: Jan. 2012.

[10] Indiroid. https://indiroid.com/, Last Accessed: Jan. 2012.

[11] inotify. http://linux.die.net/man/7/inotify, Last Accessed: Sept. 2011.

[12] Kaspersky free virus scan. http://www.kaspersky.com/virusscanner, Last Ac-

cessed: Sept. 2011.

120 121

[13] Kaspersky Lab. http://www.kaspersky.com/, Last Accessed: Sept. 2011.

[14] MiKandi. http://www.mikandi.com/, Last Accessed: Jan. 2012.

[15] Nduoa. http://www.nduoa.com/, Last Accessed: Jan. 2012.

[16] Samsung Galaxy S Forums - pro’s and con’s of installing cus-

tom ROMs. http://samsunggalaxysforums.com/showthread.php/

7418-Pro-s-and-Con-s-of-Installing-custom-Roms, Last Accessed: Jan.

2012.

[17] SciMark. http://math.nist.gov/scimark2/, Last Accessed: Jan. 2012.

[18] SciMark Test Descriptions. http://math.nist.gov/scimark2/about.html, Last

Accessed: Jan. 2012.

[19] Selenium WebKit. http://code.google.com/p/selenium/, Last Accessed: Sept.

2011.

[20] VirScan. http://virscan.org, Last Accessed: Sept. 2011.

[21] VirusChief. http://www.viruschief.com/, Last Accessed: Sept. 2011.

[22] VirusTotal. http://www.virustotal.com/, Last Accessed: Sept. 2011.

[23] VirusTotal terms of service. http://www.virustotal.com/terms.html, Last Ac-

cessed: Sept. 2011.

[24] Bowers v. Baystate Technologies, Inc., 320 F. 3d 1317 - Court of Appeals, Federal

Circuit, 2003.

[25] Flask. http://flask.pocoo.org/, Last Accessed: Jan. 2012. 122

[26] McAfee SaaS endpoint protection suite. http://www.mcafee.com/us/products/

saas-endpoint-protection-suite.aspx, Last Accessed: Feb. 2012.

[27] VirusTotal scan result. https://www.virustotal.com/file/

7f0aaf040b475085713b09221c914a971792e1810b0666003bf38ac9a9b013e6/

analysis/, Last Accessed: Jan. 2012.

[28] Nitin Agrawal, William J. Bolosky, John R. Douceur, and Jacob R. Lorch. A

five-year study of file-system metadata. ACM Trans. Storage, 3:9:1–9:32, October

2007.

[29] Jerry Archer, Alan Boehme, Dave Cullinane, Nils Puhlmann, Paul Kurtz, and Jim

Reavis. Defined categories of service 2011. Technical report, Security as a service

working group. Cloud security alliance, 2011.

[30] Michael Armbrust, Armando Fox, Rean Griffith, Anthony D. Joseph, Randy Katz,

Andy Konwinski, Gunho Lee, David Patterson, Ariel Rabkin, Ion Stoica, and Matei

Zaharia. A view of cloud computing. Commun. ACM, 53:50–58, April 2010.

[31] John Aycock. Computer Viruses and Malware, volume 22 of Advances in Informa-

tion Security. Springer, 2006.

[32] Mark Balanza. Android malware acts as an SMS relay. http://blog.trendmicro.

com/android-malware-acts-as-an-sms-relay/, June 2011.

[33] Mark Balanza. Android malware eavesdrops on users,

uses Google+ as disguise. http://blog.trendmicro.com/

android-malware-eavesdrops-on-users-uses-google-as-disguise/, Au-

gust 2011. 123

[34] David Barrera, William Enck, and Paul C. van Oorschot. Seeding a security-

enhancing infrastructure for multi-market application ecosystems. Technical Re-

port TR-11-06, Carleton University - School of Computer Science, 2011.

[35] David Barrera, H. G¨une¸sKayacik, Paul C. van Oorschot, and Anil Somayaji. A

methodology for empirical analysis of permission-based security models and its

application to Android. In Proceedings of the 17th ACM conference on Computer

and communications security, CCS ’10, pages 73–84, New York, NY, USA, 2010.

ACM.

[36] Ulrich Bayer, Imam Habibi, Davide Balzarotti, Engin Kirda, and Christopher

Kruegel. A view on current malware behaviors. In Proceedings of the 2nd USENIX

conference on Large-scale exploits and emergent threats: botnets, spyware, worms,

and more, LEET’09, pages 8–8, Berkeley, CA, USA, 2009. USENIX Association.

[37] Jeffrey Bickford, H. Andr´esLagar-Cavilla, Alexander Varshavsky, Vinod Gana-

pathy, and Liviu Iftode. Security versus energy tradeoffs in host-based mobile

malware detection. In Proceedings of the 9th international conference on Mobile

systems, applications, and services, MobiSys ’11, pages 225–238, New York, NY,

USA, 2011. ACM.

[38] Jeffrey Bickford, Ryan O’Hare, Arati Baliga, Vinod Ganapathy, and Liviu Iftode.

Rootkits on smart phones: attacks, implications and opportunities. In Proceedings

of the Eleventh Workshop on Mobile Computing Systems & Applications, HotMo-

bile ’10, pages 49–54, New York, NY, USA, 2010. ACM.

[39] Aaron Brown and Bill Weihl. An update on Google Health and

Google PowerMeter. http://googleblog.blogspot.com/2011/06/

update-on-google-health-and-google.html, Last Accessed: Feb. 2012. 124

[40] Rich Canning. An update on Android market security. http://googlemobile.

blogspot.com/2011/03/update-on-android-market-security.html, Last Ac-

cessed: Nov. 2011.

[41] Sang Kil Cha, Iulian Moraru, Jiyong Jang, John Truelove, David Brumley, and

David G. Andersen. Splitscreen: enabling efficient, distributed malware detection.

In Proceedings of the 7th USENIX conference on Networked systems design and

implementation, NSDI’10, pages 25–25, Berkeley, CA, USA, 2010. USENIX Asso-

ciation.

[42] Brian Chen. Want porn? Buy an Android phone, Steve Jobs says. http://www.

wired.com/gadgetlab/2010/04/steve-jobs-porn/, April 2010.

[43] Brian Chen. Amazon app store requires security compromise. http://www.wired.

com/gadgetlab/2011/03/amazon-app-store-security/, March 2011.

[44] Jerry Cheng, Starsky H.Y. Wong, Hao Yang, and Songwu Lu. Smartsiren: virus

detection and alert for smartphones. In Proceedings of the 5th international con-

ference on Mobile systems, applications and services, MobiSys ’07, pages 258–271,

New York, NY, USA, 2007. ACM.

[45] Erika Chin, Adrienne Porter Felt, Kate Greenwood, and David Wagner. Analyzing

inter-application communication in Android. In Proceedings of the 9th Annual

International Conference on Mobile Systems, Applications, and Services (MobiSys),

2011.

[46] Mihai Chiriac. Tales from cloud nine. In Virus Bulletin Conference, pages 1–6,

2009.

[47] Byung-Gon Chun and Petros Maniatis. Augmented smartphone applications

through clone cloud execution. In Proceedings of the 12th conference on Hot topics 125

in operating systems, HotOS’09, pages 8–8, Berkeley, CA, USA, 2009. USENIX

Association.

[48] Eduardo Cuervo, Aruna Balasubramanian, Dae-ki Cho, Alec Wolman, Stefan

Saroiu, Ranveer Chandra, and Paramvir Bahl. Maui: making smartphones last

longer with code offload. In Proceedings of the 8th international conference on Mo-

bile systems, applications, and services, MobiSys ’10, pages 49–62, New York, NY,

USA, 2010. ACM.

[49] David Dagon, Tom Martin, and Thad Starner. Mobile phones as computing de-

vices: The viruses are coming! IEEE Pervasive Computing, 3:11–15, 2004.

[50] Toralv Dirro, Paula Greve, Rahul Kashyap, David Marcus, Franois Paget, Craig

Schmugar, Jimmy Shah, and Adam Wosotowsky. McAfee threats report: second

quarter 2011. Technical report, McAfee Labs, August 2011.

[51] B. Dixon and S. Mishra. On rootkit and malware detection in smartphones. In

2010 International Conference on Dependable Systems and Networks Workshops

(DSN-W), pages 162 –163, July 2010.

[52] The Economist. Clash of the clouds. http://www.economist.com/node/

14637206?story_id=14637206, October 2009.

[53] Marc Fossi (Editor). Symantec report on the underground economy. Technical

report, Symantec Corporation, 2008.

[54] W. Enck, M. Ongtang, and P. McDaniel. Understanding Android security. Security

Privacy, IEEE, 7(1):50 –57, jan.-feb. 2009.

[55] William Enck, Damien Octeau, Patrick McDaniel, and Swarat Chaudhuri. A study

of Android application security. In Proceedings of the 20th USENIX conference on 126

Security, Berkeley, CA, USA, 2011. USENIX Association.

[56] William Enck, Machigar Ongtang, and Patrick McDaniel. On lightweight mobile

phone application certification. In Proceedings of the 16th ACM conference on

Computer and communications security, CCS ’09, pages 235–245, New York, NY,

USA, 2009. ACM.

[57] Georgina Enzer. Android attacks on the up says . http://www.itp.

net/585773-android-attacks-on-the-up-says-trend-micro, August 2011.

[58] Independent Security Evaluators. Exploiting android. http://

securityevaluators.com/content/case-studies/android/index.jsp,

November.

[59] Adrienne Porter Felt, Erika Chin, Steve Hanna, Dawn Song, and David Wagner.

Android permissions demystified. In Proceedings of the 18th ACM conference on

Computer and communications security, CCS ’11, pages 627–638, New York, NY,

USA, 2011. ACM.

[60] Adrienne Porter Felt, Matthew Finifter, Erika Chin, Steve Hanna, and David Wag-

ner. A survey of mobile malware in the wild. In Proceedings of the 1st ACM work-

shop on Security and privacy in smartphones and mobile devices, SPSM ’11, pages

3–14, New York, NY, USA, 2011. ACM.

[61] J. Flinn, D. Narayanan, and M. Satyanarayanan. Self-tuned remote execution for

pervasive computing. In Proceedings of the Eighth Workshop on Hot Topics in

Operating Systems, pages 61 – 66, May 2001.

[62] Richard Gass and Christophe Diot. An experimental performance comparison of

3G and Wi-Fi. In Arvind Krishnamurthy and Bernhard Plattner, editors, Passive 127

and Active Measurement, volume 6032 of Lecture Notes in Computer Science, pages

71–80. Springer Berlin / Heidelberg, 2010.

[63] R.L. Grossman. The case for cloud computing. IT Professional, 11(2):23 –27,

March-April 2009.

[64] Gartner Group. Gartner says sales of mobile devices grew 5.6 percent in third

quarter of 2011; smartphone sales increased 42 percent. http://www.gartner.

com/it/page.jsp?id=1848514, Last Accessed: Nov. 2011.

[65] Gartner Group. Gartner says cloud computing will be as influential as e-business.

http://www.gartner.com/it/page.jsp?id=707508, June 2008.

[66] Mikko Hypponen. The state of cell phone malware in 2007. http://www.usenix.

org/events/sec07/tech/hypponen.pdf, August 2007.

[67] IEEE:802.11n-2009. Wireless LAN medium access control (MAC) and physical

layer specifications enhancements for higher throughput, IEEE, June 2009.

[68] Markus Jakobsson and Karl-Anders Johansson. Assured detection of malware with

applications to mobile platforms. Technical report, DIMACS (February 2010),

2010.

[69] Markus Jakobsson and Karl-Anders Johansson. Retroactive detection of malware

with applications to mobile platforms. In Proceedings of the 5th USENIX confer-

ence on Hot topics in security, HotSec’10, pages 1–13, Berkeley, CA, USA, 2010.

USENIX Association.

[70] Markus Jakobsson and Ari Juels. Server-side detection of malware infection. In

Proceedings of the 2009 workshop on New security paradigms workshop, NSPW ’09,

pages 11–22, New York, NY, USA, 2009. ACM. 128

[71] Gregg Keizer. Spike in mobile malware doubles Android users’ chances of infec-

tion. http://www.computerworld.com/s/article/9218831/Spike_in_mobile_

malware_doubles_Android_users_chances_of_infection, August 2011.

[72] Lei Liu, Guanhua Yan, Xinwen Zhang, and Songqing Chen. VirusMeter: Prevent-

ing your cellphone from spies. In Engin Kirda, Somesh Jha, and Davide Balzarotti,

editors, Recent Advances in Intrusion Detection, volume 5758 of Lecture Notes in

Computer Science, pages 244–264. Springer Berlin / Heidelberg, 2009. 10.1007/978-

3-642-04342-0 13.

[73] Hiroshi Lockheimer. Android and security. http://googlemobile.blogspot.

com/2012/02/android-and-security.html, February 2012.

[74] Zachary Lutz. Carrier IQ: What it is, what it isn’t, and

what you need to know. http://www.engadget.com/2011/12/01/

carrier-iq-what-it-is-what-it-isnt-and-what-you-need-to/, Decem-

ber 2011.

[75] Lorenzo Martignoni, Roberto Paleari, and Danilo Bruschi. A framework for

behavior-based malware analysis in the cloud. In Atul Prakash and Indranil

Sen Gupta, editors, Information Systems Security, volume 5905 of Lecture Notes in

Computer Science, pages 178–192. Springer Berlin / Heidelberg, 2009. 10.1007/978-

3-642-10772-6 14.

[76] P. McDaniel and W. Enck. Not so great expectations: Why application markets

haven’t failed security. Security Privacy, IEEE, 8(5):76 –78, sept.-oct. 2010.

[77] Jane McEntegart. Malicious iPhone virus takes control of your phone.

http://www.tomshardware.com/news/iphone-virus-botnet-bank-details,

9136.html, November 2009. 129

[78] Peter Mell and Timothy Grance. The NIST definition of cloud computing, Septem-

ber 2011.

[79] Yevgeniy Miretskiy, Abhijith Das, Charles P. Wright, and Erez Zadok. Avfs: an

on-access anti-virus file system. In Proceedings of the 13th conference on USENIX

Security Symposium - Volume 13, SSYM’04, pages 6–6, Berkeley, CA, USA, 2004.

USENIX Association.

[80] John D. Musa. Software Reliability Engineering: More Reliable Software Faster

and Cheaper. Authorhouse, 2nd edition, 2004, Chapter 2.

[81] Jon Oberheide, Evan Cooke, and Farnam Jahanian. Rethinking antivirus: exe-

cutable analysis in the network cloud. In Proceedings of the 2nd USENIX work-

shop on Hot topics in security, pages 5:1–5:5, Berkeley, CA, USA, 2007. USENIX

Association.

[82] Jon Oberheide, Evan Cooke, and Farnam Jahanian. CloudAV: N-version antivirus

in the network cloud. In Proceedings of the 17th Conference on Security, pages

91–106, Berkeley, CA, USA, 2008. USENIX Association.

[83] Jon Oberheide and Farnam Jahanian. When mobile is harder than fixed (and vice

versa): demystifying security challenges in mobile environments. In Proceedings of

the Eleventh Workshop on Mobile Computing Systems & Applications, HotMobile

’10, pages 43–48, New York, NY, USA, 2010. ACM.

[84] Jon Oberheide, Kaushik Veeraraghavan, Evan Cooke, Jason Flinn, and Farnam

Jahanian. Virtualized in-cloud security services for mobile devices. In Proceedings

of the First Workshop on Virtualization in Mobile Computing, MobiVirt ’08, pages

31–35, New York, NY, USA, 2008. ACM. 130

[85] A.J. O’Donnell. When malware attacks (anything but windows). IEEE Security

& Privacy, 6(3):68 –70, May-June 2008.

[86] M. Ongtang, S. McLaughlin, W. Enck, and P. McDaniel. Semantically rich

application-centric security in Android. In Computer Security Applications Con-

ference, 2009. ACSAC ’09. Annual, pages 340 –349, December 2009.

[87] Sarah Perez. Developer is building an app store for

banned Android apps. http://techcrunch.com/2012/01/20/

developer-is-building-an-app-store-for-banned-android-apps/, Jan-

uary 2012.

[88] Sundar Pichai. Introducing the OS. http://googleblog.

blogspot.com/2009/07/introducing-google-chrome-os.html, Last Accessed:

Feb. 2012.

[89] Georgios Portokalidis, Philip Homburg, Kostas Anagnostakis, and Herbert Bos.

Paranoid Android: versatile protection for smartphones. In Proceedings of the 26th

Annual Computer Security Applications Conference, ACSAC ’10, pages 347–356,

New York, NY, USA, 2010. ACM.

[90] Drew Roselli, Jacob R. Lorch, and Thomas E. Anderson. A comparison of file

system workloads. In Proceedings of the annual conference on USENIX Annual

Technical Conference, ATEC ’00, pages 4–4, Berkeley, CA, USA, 2000. USENIX

Association.

[91] Jamie Rosenberg. Introducing Google Play: All your entertain-

ment, anywhere you go. http://googleblog.blogspot.com/2012/03/

introducing-google-play-all-your.html, March 2012. 131

[92] Dan Rowinski. More than 50 percent of Android devices still

running Froyo. http://www.readwriteweb.com/mobile/2011/09/

more-than-50-of-android-device.php, Last Accessed: Jan. 2011.

[93] Neil Rubenking. Lab testing . http://www.pcmag.com/

article2/0,2817,2358764,00.asp, September 2010.

[94] Alexey Rudenko, Peter Reiher, Gerald J. Popek, and Geoffrey H. Kuenning. Saving

portable computer battery power through remote process execution. SIGMOBILE

Mob. Comput. Commun. Rev., 2:19–26, January 1998.

[95] Steven Salerno, Ameya Sanzgiri, and Shambhu Upadhyaya. Exploration of attacks

on current generation smartphones. Procedia Computer Science, 5:546 – 553, 2011.

The 2nd International Conference on Ambient Systems, Networks and Technolo-

gies (ANT-2011) / The 8th International Conference on Mobile Web Information

Systems (MobiWIS 2011).

[96] Sharun Santhosh. Factoring file access patterns and user behavior into caching

design for distributed file systems. Master’s thesis, Wayne State University, Detroit,

Michigan, 2004.

[97] James Schlichting. Federal Communications Commission. Google Voice and related

iPhone applications. http://hraunfoss.fcc.gov/edocs_public/attachmatch/

DA-09-1736A1.pdf, September 2009.

[98] A.-D. Schmidt, R. Bye, H.-G. Schmidt, J. Clausen, O. Kiraz, K.A. Yuksel, S.A.

C¸amtepe, and S. Albayrak. Static analysis of executables for collaborative malware

detection on Android. In IEEE International Conference on Communications,

2009, pages 1 –5, june 2009. 132

[99] Aubrey-Derrick Schmidt, Frank Peters, Florian Lamour, Christian Scheel, Seyit

C¸amtepe, and Sahin Albayrak. Monitoring smartphones for anomaly detection.

Mobile Networks and Applications, 14:92–106, 2009. 10.1007/s11036-008-0113-x.

[100] Blake Stimac. Virus alert: Windows Mobile 6.5

virus found. http://www.intomobile.com/2010/04/15/

virus-alert-windows-mobile-6-5-virus-found/, August 2010.

[101] Cisco Systems. Demystifying cloud computing: a three-minute tutorial. http:

//www.cisco.com/web/offer/fedbiz07/july2009/index.html, July 2009.

[102] Deepak Venugopal, Guoning Hu, and Nicoleta Roman. Intelligent virus detection

on mobile devices. In Proceedings of the 2006 International Conference on Pri-

vacy, Security and Trust: Bridge the Gap Between PST Technologies and Business

Services, PST ’06, pages 65:1–65:4, New York, NY, USA, 2006. ACM.

[103] Ronald E. Walpole, Raymond H. Myers, Sharon L. Myers, and Keying Ye. Prob-

ability & Statistics for Engineers & Scientists. Pearson Prentice Hall, 8th edition,

2007, p. 236.

[104] Xiaoyun Wang and Hongbo Yu. How to break MD5 and other hash functions. In

Ronald Cramer, editor, Advances in Cryptology - EUROCRYPT 2005, volume 3494

of Lecture Notes in Computer Science, pages 561–561. Springer Berlin / Heidelberg,

2005. 10.1007/11426639 2.

[105] Oli Warner. What really slows Windows down. http://thepcspy.com/read/

what_really_slows_windows_down/, September 2006.

[106] Joe Wells. A radical new approach to virus scanning. Technical report, CyberSoft,

Inc., 1999. 133

Appendix A

Appendix

Number of events processed 99999.7 99998 99994.2 99991.2 99921.8 99330.1 93426.7 Number of files processed 77691 50000 10000 5000 1000 100 9.5 Mean file size generated 977.002 976.4713 973.6049 977.5164 971.258 974.5984 927.2371 Median file size generated 677.3671 676.6826 672.2553 676.6572 659.8002 713.8306 813.637 Max file size generated 11539.964 10785.0526 9824.4234 9336.1715 7400.5678 5324.6606 2597.6249 Proportion of file modifications 0 0 0 0 0 0 0 Mean file size scanned 977.002 976.4713 973.6049 977.5164 971.258 974.5984 927.2371 Median file size scanned 677.3671 676.6826 672.2553 676.6572 659.8002 713.8306 813.637 Max file size scanned 11539.964 10785.0526 9824.4234 9336.1715 7400.5678 5324.6606 2597.6249 Un-scanned accesses 0 0 0 0 0 0 0 Cache Hit Rate 22.31% 50.00% 90.00% 95.00% 99.00% 99.90% 99.99% Time for AV scanning (sec.) 2325464.1 1495301.15 298412.16 149908.66 29661.74 3015.74 300.18 Time for non-AV activities (sec.) 6673.1 6663.59 6663.27 6663.91 6657.84 6608.88 6230.54 Total time (sec.) 2332137.21 1501964.74 305075.42 156572.57 36319.57 9624.63 6530.72 AV Overhead 34848.69% 22440.06% 4479.49% 2249.56% 445.51% 45.63% 4.83% Average inter-access time 0.067 0.067 0.067 0.067 0.067 0.067 0.067 Table A.1: Raw data from Figure 4.6.

Number of events processed 97.8 196.3 490.8 985.6 9896.4 99183.6 983779.3 Number of files processed 49.8 50 50 50 50 50 50 Mean file size generated 93.5559 99.2755 90.5267 89.4505 95.0418 92.7249 94.8326 Median file size generated 70.2959 75.4199 63.887 67.4278 73.7681 65.9585 69.502 Max file size generated 368.1267 386.5154 395.7792 453.9882 457.671 416.767 408.3396 Proportion of file modifications 0 0 0 0 0 0 0 Mean file size scanned 93.5559 99.2755 90.5267 89.4505 95.0418 92.7249 94.8326 Median file size scanned 70.2959 75.4199 63.887 67.4278 73.7681 65.9585 69.502 Max file size scanned 368.1267 386.5154 395.7792 453.9882 457.671 416.767 408.3396 Un-scanned accesses 0 0 0 0 0 0 0 Cache Hit Rate 49.08% 74.53% 89.81% 94.93% 99.49% 99.95% 99.99% Time for AV scanning (sec.) 154.6 160.66 152.46 151.55 158.6 174.27 353.18 Time for non-AV activities (sec.) 6.59 13.2 33.48 66.06 659.77 6605.13 65570.5 Total time (sec.) 161.19 173.86 185.94 217.6 818.37 6779.41 65923.68 Average inter-access time 0.067 0.067 0.068 0.067 0.067 0.067 0.067 AV Overhead 2352.20% 1221.41% 455.75% 229.57% 24.05% 2.64% 0.54% Table A.2: Raw data from Figure 4.7. 134 0 0 5000 0.067 0.00% 0.00% 0.00% 9.7095 6.6986 9.7095 6.6986 95.00% 99987.3 83.4222 83.4222 7634.69 6657.17 114.68% 100.00% 14291.86 0 0 5000 0.067 0.00% 0.00% 0.00% 97.612 97.612 95.00% 99982.9 67.7155 67.7155 15924.7 6670.02 238.76% 100.00% 801.2105 801.2105 22594.71 0 0 5000 0.067 0.00% 0.00% 95.00% 65.25% 34.74% 99985.9 6664.69 972.1442 674.1827 972.1442 674.1827 2231.50% 8820.3111 8820.3111 148720.84 155385.53 0 0.7 5000 0.067 0.51% 0.00% 95.00% 40.60% 58.89% 99984.2 6664.22 4348.45% 1962.7876 1360.4033 1961.1926 1360.1801 289792.97 296457.19 18711.4329 17379.1231 0 0.67 5000 99992 4.80% 4804.2 6662.6 95.40% 10.98% 63.05% 25.96% 6933.636 6086.82% 9866.4089 6810.7295 5687.9268 405529.12 412191.72 90559.0626 20463.6603 0 5000 0.067 7.82% 102400 95.63% 54.72% 37.45% 25.77% 99980.7 25765.5 6672.74 4132.78% 8525.1434 7702.8734 275772.28 282445.02 19433.8632 13647.1049 20468.9162 0 5000 0.067 5.56% 102400 94.14% 47.43% 47.01% 83.90% 99984.4 83882.3 6667.93 9843.104 73529.56 80197.49 1102.73% 9649.7395 63435.8064 67508.1407 20464.6566 Table A.3: Raw data from Figures 4.8 and 4.9. 0 5000 0.067 6.43% 102400 102400 98.49% 45.11% 48.46% 93.08% 99982.2 93065.7 7838.98 6670.36 117.51% 20309.68 14509.34 9862.1326 97197.7499 10033.4459 Number of events processed Number of files processed Mean file size generated Median file size generated Max file size generated Proportion of file modifications Mean file size scanned Median file size scanned Max file size scanned Un-scanned accesses Cache Hit Rate Time for AV scanning (sec.) Time for non-AV activities (sec.) Total time (sec.) AV Overhead Average inter-access time Kaspersky % VirusChief % Virus Total % Unscanned % 135 0 0.90 148.2 9.79% 8840.6 89.666 35.54% 99264.6 89300.4 704.2243 704.2243 1012.0571 5517.0901 1012.0571 5517.0901 317449.67 893453.96 1210903.63 0 0.80 297.8 40.085 19.67% 53.12% 99744.9 983.568 79825.8 15704.1 983.568 671.5941 671.5941 5928.4191 5928.4191 423906.85 798448.29 1222355.14 0 0.70 451.1 23.461 29.29% 92.39% 99791.3 69893.9 20689.2 996.6749 690.6021 996.6749 690.6021 7263.6021 7263.6021 647313.82 701433.59 1348747.4 0 0.60 587.9 23678 15.132 39.22% 99951.8 60029.1 172.08% 975.5268 673.7253 6585.222 975.5268 673.7253 6585.222 604099.28 1039714.32 1643813.59 0 0.50 748.5 10.08 49.17% 99873.8 49929.1 24635.9 186.90% 701.0202 701.0202 1002.9018 6811.0934 1002.9018 6811.0934 941231.74 503420.99 1444652.74 0 0.40 883.7 6.711 58.93% 99889.5 39945.1 23737.9 176.42% 978.2049 673.7291 978.2049 673.7291 6689.2825 6689.2825 707501.46 402315.21 1109816.67 0 0.30 997.3 4.326 29872 68.93% 99900.7 20759.5 298.17% 977.6461 682.5642 977.6461 682.5642 905131.4 7603.4454 7603.4454 302953.53 1208084.93 0 0.20 1000 2.562 15801 78.98% 99898.7 19976.1 274.64% 984.9695 686.4618 984.9695 686.4618 7342.4595 7342.4595 563833.05 204727.92 768560.97 0 Table A.4: Raw data from Figure 4.10. 0.10 1000 9994 1.174 8924.9 88.96% 99920.4 265.72% 975.3766 681.8096 975.3766 681.8096 7542.5187 7542.5187 281190.78 105553.71 386744.49 0 0 0 0 1000 0.067 99948 99.00% 6659.07 452.61% 980.5748 683.9077 980.5748 683.9077 30140.21 36799.29 6981.1646 6981.1646 Number of events processed Number of files processed Mean file size generated Median file size generated Max file size generated Proportion of file modifications Modifications absolute Mod then Use Mean file size scanned Median file size scanned Max file size scanned Un-scanned accesses Cache Hit Rate Time for AV scanning (sec.) Time for non-AV activities (sec.) Total time (sec.) AV Overhead Inter-access Time 136 0 0 0.95 5000 0.002 97.501 67.862 97.501 67.862 199.91 99986.7 810.934 810.934 7960.64 15914.27 16114.18 0 0 0.95 0.02 5000 97.338 67.328 97.338 67.328 795.93 99981.7 834.322 834.322 1997.55 15898.87 17896.42 0 0 0.2 0.95 5000 79.75 97.758 68.074 97.758 68.074 99983.4 862.779 862.779 15938.42 19985.02 35923.44 0 0 2 0.95 7.95 5000 97.391 67.901 97.391 67.901 99982.9 840.176 840.176 15903.83 199981.72 215885.55 0 0 0.8 0.95 5000 19.97 97.349 67.711 97.349 67.711 99988.4 882.332 882.332 15899.91 1996798.71 20062131.64 0 0 0.95 0.08 5000 97.927 67.832 97.927 67.832 99987.3 865.051 865.051 200.487 15954.44 Table A.5: Raw data from Figure 4.11. 20046177.2 20062131.64 0 0 0.95 0.01 5000 97.326 67.351 97.326 67.351 99987.1 852.013 852.013 15897.76 1999.594 199933617 199949514.7 Number of events processed Number of files processed Mean file size generated Median file size generated Max file size generated Proportion of file modifications Mean file size scanned Median file size scanned Max file size scanned Un-scanned accesses Cache Hit Rate Time for AV scanning (sec.) Time for non-AV activities (sec.) Total time (sec.) AV Overhead Average inter-access time 137 20 MB < 97.96% 97.92% 97.87% 100.00% 100.00% 100.00% 100.00% 100.00% 100.00% 100.00% 100.00% 100.00% 100.00% 100.00% 100.00% 100.00% 100.00% 100.00% 100.00% 100.00% 100.00% % 10 MB < 95.92% 97.87% 97.96% 95.92% 94.00% 95.65% 97.92% 95.92% 97.87% 97.96% 94.00% 91.67% 89.36% 100.00% 100.00% 100.00% 100.00% 100.00% 100.00% 100.00% 100.00% % 1 MB < 55.10% 50.00% 48.94% 48.00% 48.98% 40.00% 44.90% 41.67% 31.25% 40.00% 36.00% 32.61% 27.08% 28.57% 24.49% 17.02% 22.45% 18.37% 28.00% 18.75% 23.40% % 8.37 5.74 7.92 9.10 6.51 9.08 8.80 8.33 14.35 10.15 37.06 13.35 12.17 19.11 11.95 15.73 17.00 12.40 14.71 21.57 28.44 Maximum Size (MB) 0.03 0.07 0.02 0.02 0.04 0.05 0.11 0.03 0.06 0.02 0.15 0.03 0.16 0.11 0.11 0.19 0.12 0.07 0.09 0.06 0.09 Minimum Size (MB) 0.76 1.00 1.03 1.03 1.03 1.37 1.39 1.55 1.58 1.62 1.79 2.05 2.14 2.21 2.25 2.34 2.42 2.43 2.45 2.93 3.56 Median Size (MB) 1.96 1.40 1.66 1.81 2.34 1.77 2.40 2.18 1.84 2.20 3.05 3.17 2.68 3.26 2.37 3.54 3.02 2.83 3.27 4.16 4.80 Mean Size (MB) Table A.6: File size characteristics of Android testing data set. 49 50 47 50 49 50 49 48 48 50 50 46 48 49 49 47 49 49 50 48 47 Number of Apps Category Medical Tools Finance Media and Video Comics Productivity Business Personalization News and Magazines Weather Photography Books and Reference Shopping Lifestyle Music and Audio Sports Travel and Local Social Communication Health and Fitness Education