Studying the Security Centric

Intelligence on Android

Detection

Xiao Chen

A Thesis

Submitted in fulfilment of the requirements for the Degree of

Doctor of Philosophy

Faculty of Science, Engineering and Technology

Swinburne University of Technology

Australia

June 2020 Abstract

Android malware detection has long been a critical challenge. Statistic analysis and machine learning based solutions have yielded promising performance on automatic malware detection. However, most existing research works did not consider a real-world issue in malware detection, that is, malware developers may be aware of the detection systems and take countermeasures against the detection approaches.

Existing works overlooked this issue and assumed that the detectors are working in a non-adversarial environment, which is not likely to happen in the real world.

Recent research in machine learning and deep learning has revealed that learning-based systems are vulnerable to carefully crafted adversarial examples.

Extensive efforts have been undertaken to investigate this issue in the computer vision area, but very few works studied its impact on malware detection systems.

Actually, adversaries always exist in malware detection/identification tasks.

For example, static analysis based malware identification suffers from the code obfuscation problem, which is also a countermeasure the malware authors take to bypass the detection. This thesis presents a study of Android malware detection from both the detector’s and the adversary’s perspective. Specifically, from the detector’s perspective, we analyzed how malware evolves itself. From the adversary’s perspective, we studied how malware camouflages itself and evades it from being detected by non-machine-learning and machine-learning-based detectors.

ii We firstly carry out a fine-grained and in-depth phylogenetic analysis of malware variants, to study the popular evolution patterns of malware samples.

We propose a method that initially clusters malware samples of a family into variant-sets, and then systematically reveals the phylogenetic relationships among those sets for a more in-depth malware evolution analysis. Moreover, we summarise evolutionary patterns that shed light upon how malware samples are evolved to bypass the detection of anti-virus techniques. Such analysis can be of great benefit in understanding newly emerged malware variants and evolution-inspired evasion attacks.

Secondly, we propose a novel attacking method that generates adversarial examples of Android malware that can evade being detected by the current detection models. To this end, we propose a method of applying optimal perturbations onto Android APK that can successfully deceive the machine learning detectors. We develop an automated tool to generate the adversarial examples without human intervention.

Lastly, we demonstrated to make use of a built-in voice assistant

(VA) to compromise an Android , while evading the detection of anti-malware techniques. We propose a novel attack approach that crafts the users activation voice by silently listening to users phone calls. Once the activation voice is formed, the attack can select a suitable occasion to launch an attack. The attack embodies a machine learning model that learns proper attacking times to prevent itself from being noticed by the user. By raising awareness, we urge the

iii community and manufacturers to revisit the risks of VAs and subsequently revise the activation logic to be resilient to the style of attacks proposed in this work.

We believe that by understanding how malware samples are evolved and how adversarial examples are crafted will help researchers and malware analyst to better address the issue of detecting malware in an adversarial environment.

iv Acknowledgements

I would like to thank my supervisors Prof. Yang Xiang, A/Prof. Jun Zhang and

Prof. Wanlei Zhou for their support and patience. I would like to thank my family for their love and understanding. I would like to thank Dr. Sheng Wen for his comments and encouragement. I would like to thank Chaoran, Lihong,

Derek, Junchen, for the sleepless we were working together to catch the deadlines.

I would like to thank all my friends from Swinburne and Deakin. The climb may be long, but the view is worth it.

Xiao Chen

July 10, 2019

v Declaration

This is to certify that this thesis contains no material which has been accepted for the award of any other degree or diploma and that to the best of my knowledge this thesis contains no material previously published or written by another person except where due reference is made in the text of the thesis.

Xiao Chen

July 10, 2019

vi To my family.

vii Contents

Abstract ...... ii

Acknowledgements ...... v

Declaration ...... vi

List of Tables ...... xiii

List of Figures ...... xv

1 Introduction 1

1.1 Motivations ...... 1

1.2 Objectives and Contributions ...... 3

1.3 Thesis Organization ...... 5

2 Literature Review 6

2.1 Android Security Threats ...... 6

2.1.1 Malware Types ...... 8

2.1.2 Malware Penetration Techniques ...... 10

2.1.3 Malware Obfuscation Techniques ...... 11

2.2 Android Malware Detection ...... 12 viii 2.2.1 Static Approaches ...... 13

2.2.2 Dynamic Approaches ...... 25

2.3 Android Malware Evasion ...... 31

2.3.1 Evading Traditional Non-Machine Learning Based Detection 31

2.3.2 Evading Machine Learning Based Detection ...... 33

2.4 Android Malware Evolution ...... 38

3 Investigating The Evolution Pattern of Android Malware 43

3.1 Introduction ...... 44

3.2 Approach ...... 48

3.2.1 Variant Sets Generation ...... 48

3.2.2 Formula Construction of the Variant Sets ...... 53

3.2.3 Phylogenetic Network Construction ...... 55

3.3 Evaluation ...... 57

3.3.1 Dataset ...... 57

3.3.2 Insights Into Variant Clustering ...... 60

3.3.3 Representativeness of the Variant Formula ...... 69

3.4 Inspection on Malware Evolution ...... 71

3.4.1 Phylogenetic Relationship ...... 71

3.4.2 Evolution Analysis ...... 75

3.5 Summary ...... 86

4 Repackaging Malware for Evading Machine-Learning Detection 87

ix 4.1 Introduction ...... 89

4.2 Android Application Package ...... 94

4.3 Targeted Systems and Attack Scenarios ...... 96

4.3.1 MaMaDroid ...... 96

4.3.2 Drebin ...... 100

4.3.3 Attack Scenarios ...... 101

4.4 Attack on MaMaDroid ...... 102

4.4.1 Attack Algorithm ...... 102

4.4.2 APK Manipulation ...... 108

4.4.3 Experiment Settings ...... 113

4.4.4 Experiment Results ...... 117

4.5 Attack on Drebin ...... 124

4.5.1 Attack Algorithm ...... 124

4.5.2 APK Manipulation ...... 126

4.5.3 Experiments & Evaluations ...... 127

4.6 Discussion ...... 128

4.6.1 Comparison with Existing works ...... 128

4.6.2 Why We Are Successful ...... 129

4.6.3 Applicability of Our Attack ...... 130

4.6.4 Transferability ...... 131

4.6.5 Artifacts in Our Attack ...... 133

4.6.6 Defending Methods ...... 134

x 4.7 Summary ...... 137

5 A Stealthy Attack on Android phones without Users Awareness 138

5.1 Introduction ...... 139

5.2 Attacking Model: Vaspy ...... 144

5.3 Proof-of-Concept: A Spyware ...... 145

5.3.1 Activation Voice Manipulation ...... 145

5.3.2 Attacking Environment Sensing ...... 147

5.3.3 Post Attacks and Spyware Delivery ...... 151

5.4 Evaluation ...... 153

5.4.1 Evaluation of the Attacking Environment Sensing Modulel . 153

5.4.2 Evaluation of Real World Attack ...... 155

5.4.3 Capability of Attack ...... 162

5.4.4 Runtime Cost Analysis ...... 164

5.4.5 Resistance to Anti-Virus Tools ...... 167

5.5 Discussion ...... 169

5.5.1 Essential Factors for the Successful Attack ...... 169

5.5.2 Defense Approaches for Vaspy ...... 170

5.5.3 Lessons from This Work ...... 171

5.6 Summary ...... 171

6 Conclusion and Future Works 173

6.1 Conclusion ...... 173

xi 6.2 Future Works ...... 175

References 177

xii List of Tables

3.1 Best Silhouette Values Achieved for Malware Families and its

Corespondent Distance Threshold ...... 62

3.2 Malware samples from the same variant set and various labels

provided by different Anti-Virus vendors. The malware samples

from variant set a is depicted by a0 to an. The Anti-Virus vendors

are depicted by v0 to vm. lnm denotes the variant label provided by

the vm vendor for the malware an. The variant label lan is unified

from the labels given by the different Anti-Virus vendors...... 67

3.3 Label Consistency for Four Well-Known Anti-Virus Vendors. The

average consistencies are all above 80% with the minimum no less

than 50% ...... 69

4.1 Overview of Drebin feature set ...... 100

4.2 Attack Scenarios ...... 117

4.3 Number of features added in each set ...... 128

4.4 Comparison with existing works (Evasion Rate) ...... 129

xiii 5.1 Movement intensity Features ...... 151

5.2 Average Accuracy Performance ...... 155

5.3 Post attack commands against VAs...... 163

xiv List of Figures

3.1 Relationship between a malware family and its variant sets. The

malware samples in the same column (colour) share similar code.

The malware samples from different columns inside a family perform

similar malicious behaviours...... 49

3.2 Variant Set Generation Process Overview. 1) A distance matrix

will be generated by similarity analysis based on a set of apks from

the same malware family by utilizing SimiDroid. 2) By applying

the UPGMA clustering algorithm, a malware family dendrogram

will be generated representing the hierarchical relationship among

malware samples of the family. 3) After the distance threshold is

determined, malware samples are them clustered into smaller clades,

and each clade represents a variant set. 4 variant sets (a, b, c, d)

are generated in the example...... 52

3.3 Silhouette and Method-Based Distance Threshold...... 54

3.4 Variant sets generation for F akebank and Commplat families . . . 54

3.5 Formula Construction of the Variant Sets...... 55 xv 3.6 Malware Sample Distribution of Six Malware Families From 2008

to 2018. Different malware families have different life cycles. The

number of malware samples in opfake, smsagent and autoins

increased dramatically in recent years. On the contrary, the

number of malware samples from droidkungfu decreased and then

disappeared. Utchi, blouns were only discovered in specific years. . 61

3.7 Radial Phylogram for Malware Family Boqx, Droidkungfu, and

Kmin. Boqx (n = 80) achieves its highest silhouette coefficient 0.254

at the distance of 0.65. This family cannot be easily partitioned into

different clusters. Droidkungfu (n = 82) derives its best silhouette

coefficient at the distance of 0.3. Relatively clear groups of malware

samples can be observed through its radial phylogram. Kmin (n =

140) can achieve a high silhouette value 0.928 at the distance of

0.2, and malware samples are more likely to be well-clustered into

variant sets...... 65

3.8 Standard Deviation of the Creation Date for Variant Sets. The

standard deviation of the last modification date is three months

and 18 days. 60% of the malware samples from the same variant

set are created within a month, and 80.8% are created no later than

four months...... 66

xvi 3.9 Example PhyloNets for malware family kmin and geinimi. Each

cluster represents a variant set. A smaller alphabet indicates an

earlier created variant set. Two variant sets (M and N, M 6= N)

are connected only if one of PhyloScore (PS(M,FN ) or PS(N,FM ))

is larger than 0.25 ...... 72

3.10 Relations between variant sets. a) Subset and superset: B includes

most of the essential genes from A. b) Intersecting: A and B share

a relatively large amount of same essential genes of their own. c)

Non-intersecting: A and B share general family genes, but they are

not highly related regarding their own important genes...... 74

4.1 File structure of APK. AndroidManifest.xml declares the

essential information; classes.dex contains the Dalvik Bytecode;

resources.arsc holds the compiled resources in binary format;

META-INF, lib, assets, and res folders include the meta data,

libraries, assets, and resources of the application, respectively. . . . 91

4.2 Process of feature extraction in MaMaDroid - source code . . . . . 98

4.3 Process of feature extraction in MaMaDroid - call graph ...... 98

4.4 Process of feature extraction in MaMaDroid - call sequence . . . . . 99

4.5 Process of feature extraction in MaMaDroid - Markov chain . . . . 99

4.6 The attack process: the dashed lines show the process of our attack

algorithm, and the solid lines illustrate our APK manipulation

procedure...... 113 xvii 4.7 The evasion rate and average distortion of adversarial examples

generated by JSMA in the family mode ...... 119

4.8 The evasion rate and average distortion of adversarial examples

generated by C&W in the family mode...... 120

4.9 The evasion rate and average distortion of adversarial examples

generated by JSMA in the package mode...... 121

4.10 The evasion rate and average distortion of adversarial examples

generated by C&W in the package mode...... 122

4.11 Comparison of applying simple manipulation strategy and

sophisticated manipulation strategy in the family mode by C&W . 125

4.12 The average distortion and evasion rate of adversarial example

generated by JSMA on Drebin...... 126

4.13 Empirical study of transferability on different number of modifiable

features, where S1 includes the system-provided permissions, and

S2 includes both system-provided and user defined permissions. . . 132

4.14 F-measure of benign and malicious applications with the proposed

adversarial training defending method on MaMaDroid, with varying

percentage of adversarial malware samples added in the training set. 135

xviii 4.15 Evasion rate of applying ensemble learning defence mechanism on

MaMaDroid family mode in two scenarios (FTB and F). Base: the

evasion rate before attack as a baseline; No Defence: attack the

model without implementing the defence; Defence - Ensemble

Training: attack the model with implementing ensemble learning

method as defence, in which each of 10 classifier is trained with

1/10 training samples; Defence - Ensemble Feature: attack the

model with implementing ensemble learning method as defence, in

which each of 10 classifier is trained with 1/10 features...... 136

5.1 The workflow of an example spyware based on Vaspy. Incoming/outgoing

calls are monitored and recorded, and the activation voice is then

synthesised. User’s environment is monitored by built-in sensors

to determine a suitable attacking occasion. When launching the

attack, text commands can be retrieved from Firebase [6] and

converted to speech by a built-in Text-to-Speech (TTS) module in

the smartphone...... 143

5.2 RNN training data pre-processing. (A) raw audio signal as input,

which contains the activation keywords; (B) spectrum’s converted

from raw audio signal; (C) a matrix that contains labeled starting

and ending frames of the activation keywords...... 147

5.3 Framework of Attacking Environment Sensing (AES) module . . . . 148

xix 5.4 Data collected from scenario (a) walking on a quiet road; and (d)

taking public transportation...... 149

5.5 The snapshot of the proof-of-concept spyware. After player clicking

start button, the rocket will raise when player blows or scream to

the microphone. The rising speed depends on the volume of sound

that the microphone receives...... 152

5.6 Overview of the data collected in real-world scenarios...... 154

5.7 Evaluation Result of the Attacking Environment Sensing in scenario

(a)...... 156

5.8 Evaluation Result of the Attacking Environment Sensing in scenario

(b)...... 157

5.9 Evaluation Result of the Attacking Environment Sensing in scenario

(c)...... 158

5.10 Evaluation Result of the Attacking Environment Sensing in scenario

(d)...... 159

5.11 Evaluation Result of the Attacking Environment Sensing in scenario

(e)...... 160

5.12 Evaluation Result of the Attacking Environment Sensing in scenario

(f)...... 161

5.13 Power consumption of four phases: P1(Phone call state monitoring),

P2(Recording and synthesising activation command), P3(Environment

monitoring), and P4(Attacking via the speaker) ...... 165

xx 5.14 Memory consumption of four phases: P1(Phone call state

monitoring), P2(Recording and synthesising activation command),

P3(Environment monitoring), and P4(Attacking via the speaker) . . 165

5.15 A snapshot of the detection result in VirusTotal ...... 168

xxi Chapter 1

Introduction

1.1 Motivations

Over the last ten years, Android’s market share has increased from 0.66% to

74.85% [10]. Due to the convenience of the wide range of connectivity options, such as Wi-Fi, Blue Tooth, NFC, and 3G/4G mobile networks, people tend to use smartphones on social activities, entertainment, and even working. According to

[11], Google Play, the official Android application (app for short) store, consists of 2.7 million apps for users to download, covering almost every aspect of people’s daily life, such as communicating with friends and making bank transactions.

The always-on Internet connectivity and personal information such as contacts, messages, personal credentials, and bank transactions have attacked more and more attention from malware developers and cyber attackers [88]. The open-source nature and the mechanism of allowing installation of third-party apps on Android

1 devices, makes the Android smartphones have more chance to install malware unintentionally [57]. The existence of third-party application stores also gives malware more opportunity to be downloaded by Android users.

According to a report from Symantec [9], until 2018, the total number of discovered Android malware variants have reached 28,748, spreading over

436 malware families. The amount of discovered malware variants doubled since 2015 [8]. Android attack is becoming stealthier. For example, malware developers started to obfuscate code to bypass signature-based anti-virus software.

Additionally, before they begin their attacks, some malware can now detect if it is running on real phones or sandboxes that security researchers used to monitor them.

Many research works have been conducted on malware detection, including traditional signature/rule based detection methods [128, 117, 31] and machine learning based detection approaches [25, 102, 20, 108, 22]. While most of the works have claimed superior detection performance in a non-adversarial environment, their effectiveness in a more realistic adversarial has not been evaluated. Malware detection typically involves two parties the detector, which aims to identify the malware sample from a set of unlabelled apps, and the adversary, which intends to camouflage the malware sample from being detected. For example, the detector extracts common patterns from malware samples to use as an indicator (e.g., signature) for identification. The adversary tries to hide its malicious activities from being detected through malware evolution (e.g., obfuscation). While the

2 research community mainly focuses on improving the performance of the detectors, the existence of the adversary has always been overlooked.

1.2 Objectives and Contributions

As a fundamental step in evaluating and improving the malware detection approaches under the adversarial environment, in this thesis, we investigated the malware detection in an adversarial environment from both the detector’s perspective and the adversary’s perspective. Specifically, on the one hand, we explore the evolution of malware samples (cf. Chapter 3), which could help researchers to understand better how malware samples are evolved and assist them in designing effective anti-malware solutions. On the other hand, we propose and demonstrate how malware samples bypass machine-learning based detection

(cf. Chapter 4) and non-learning based detection (cf. Chapter 5).

The major contributions of the thesis are outlined as follows:

• We design and implement PhyloNet, an approach for clustering malware

samples from the same malware family into variant sets, and systematically

identifying the phylogenetic relationship between variant sets inside a

malware family (cf. Chapter 3).

• We demonstrate critical insights of the variant set clustering. We provide

statistics at the variant level and extracted variant labels from various

3 anti-virus vendors. We also discussed the consistency issues of variant

labeling (cf. Chapter 3).

• We conducted an in-depth analysis by manually vetting malware samples

from different variant sets based on the phylogenetic relationships derived

previously. We gained some interesting evolutionary insights and proved

that malware evolution is closer to a directed acyclic net structure than a

hierarchical tree structure (cf. Chapter 3).

• We propose an innovative method of crafting adversarial examples on

recent machine learning based detectors for Android malware. They mainly

collected features (either syntactic or semantic ones) from Dalvik bytecode

to capture Android malware behaviors. This contribution is distinguishable

from the existing works [45, 46, 51, 77] that can only target/protect the

detectors relying on syntactic features (cf. Chapter 4).

• We designed an automated tool to apply the method to the real-world

malware samples. The tool calculates the perturbations, modifies source

files, and rebuilds the modified APK (cf. Chapter 4).

• We propose a novel attacking approach that can stealthily hack into Android

phones via built-in VAs without users’ awareness, and is resilient to current

anti-malware solutions in the industry as well as academia (cf. Chapter 5).

4 1.3 Thesis Organization

The rest of the thesis is organized as follows:

Chapter 2 provides an overview of recent research in Android malware detection and security analysis of these detection approaches.

Chapter 3 presents a systematic study on Android malware evolution.

Chapter 4 details an evasion attack against current machine learning based

Android malware detection systems.

Chapter 5 demonstrates a stealthy attack on Android smartphones via built-in voice assistants.

Chapter 6 concludes this thesis, and presents future research directions.

5 Chapter 2

Literature Review

This chapter presents a survey of relevant research works in the security of

Android malware detection systems. In this chapter, articles reviewed include malware detection techniques, malware evasion techniques, and studies on malware evolution. Android security threats are also introduced in this chapter as preliminary background knowledge.

2.1 Android Security Threats

Android is an open-source based on Linux OS, which has several layers to facilitate the running of applications. The bottom layer is a Linux Kernel

Layer. On top of the kernel layer lies a set of C/C++ native libraries and Dalvik

Virtual Machine [66]. Dalvik Virtual Machine is a Java runtime that is created when running the Java code, and it is optimized for limited resource availability

6 on the mobile platform. The Dalvik Virtual Machine relies on the underlying

Linux kernel to handle low-level functionalities and executes .dex files, which is transformed from Java classes using the SDK tools. On top of the Dalvik VM

Layer lies the Application Framework Layer that includes the Java core libraries.

The top layer is the Application Layer, which contains the Java-based applications created based on the Application Framework layer.

Android platform provides a set of security features to ensure the user data, apps, and the device are under protection, including application sandboxing, permission-based access control, secure Inter-Process Communication (IPC), etc.

[66]:

• Application Sandboxing: Android prevents an app from interfering with

other apps or system services, as each process has its own copy of Dalvik

VM.

• Permission-based Access Control: Android provides a permission-based

security model at the application framework layer, to restrict the access

to important system features and data, such as network, location, etc.

The required permissions need to be declared and showed to the user

while installing the app. Android divide permissions into four security

levels: Normal, Dangerous, Signature, and SignatureOrSystem, in which

the Dangerous permissions need to be granted by the user at the time of

installation, while the other permissions can be automatically granted as

long as they match the criteria. 7 • Secure Inter-Process Communication: Android prevents an application from

directly accessing another applications memory space and data. The access

needs to go through the IPC mechanism.

Despite the security mechanisms the Android system has, many malware samples are still reported to be installed on Android users’ phones and cause tremendous loss. In this section, we take a closer look at Android malware characteristics, including their purpose and the penetration techniques they employed.

2.1.1 Malware Types

According to the characteristics and purpose of the malware, Android malware can be classified into Virus, Trojan, Worm, Spyware, Backdoor, Rootkit, and Botnet

[110].

Virus: Virus enters the smartphone system without user’s knowledge, and then starts to duplicate itself and conducts malicious activities that it was programmed to do.

Trojan: Trojan usually embedded themselves in benign apps, or masquerade themselves as benign apps. Some Trojan are embedded in repackaged apps that provide the same functionality as the original app. They may try to steal confidential information of the user, such as password, credit card, contacts, etc., or send a premium-rate SMS without user acknowledgment.

8 Worm: Worm spread itself over the network or removable media, and infects new vulnerable victim. For example, after infecting a victim, it will send an SMS with a text like “is this your photo” with a link to the malicious APK package to all of the user’s contacts.

Spyware: Spyware stealthily collects user’s personal information. The presence of spyware is typically hidden from users and is difficult to detect. Spyware can gather private information, monitor users’ activity, scan files stored on the smartphone, and change the smartphone settings.

Backdoor tries to get the root level privilege by using root-level exploits. It then ”opens” a tunnel for directly accessing the victim without the notice of all security procedures installed on the victim. It can then install any other malware into the victim system.

Rootkit: Rootkit allows the attacker to gain the full control of a system by exploiting a known vulnerability or modifying the OS’s kernel. It then hides its intrusion as well as to maintain privileged access.

Botnet: Botnet compromises the devices, and control the devices via a remote server. It can either send out the information stored in the device or organizing a

DOS attack using these compromised devices.

9 2.1.2 Malware Penetration Techniques

In this section, popular penetration techniques are discussed and summarized, which are used by Android malware. These include repackaging popular apps, drive-by download, and dynamic payload [50].

Repackaging: Repackaging is a process of adding malicious payload into popular free/paid apps from one app store, and distributing them through other third-party app stores. It will first disassemble resources and dex using apktool [3], and then insert the malicious Dalvik bytecode into the original Dalvik bytecode.

The modified bytecode is then assembled into the APK package using the apktool.

Repackaging is one of the most popular penetration techniques currently in use

[92].

Drive-by Download: The victim downloads the malware without understanding the consequences and installing it on their devices. To perform this kind of attack, the attacker usually sends malicious URL to users by employing social engineering or aggressive advertisements attacks to make users mistakenly download malicious apps. The downloaded malware may disguise itself as a benign app and get permission from the victim to install it [155].

Dynamic Payload: Some encrypted malicious APK/jar files can be embedded in the payload of a benign app. After the installing of the benign app, the malicious payload decrypts itself and starts the installation process. It usually disguises itself as an essential update to make the user grant the permission to install them [82].

10 2.1.3 Malware Obfuscation Techniques

Malware developers usually leverage obfuscation techniques to hide the malware’s malicious behaviors from analysis [53]. These techniques including code obfuscation (e.g., identifier renaming, string encryption, junk code insertion, and Java reflection) and packing.

Identifier renaming: The names of identifiers, such as variable names and function names, are usually meaningful. Following the naming conventions could help developers of legitimate apps to understand and maintain the code.

However, in malware samples, renaming the identifiers to some random characters could impact the readability of the code, making manual analysis harder. This approach only affects human analyzers, but not static analysis tools such as control-flow-based analysis.

String encryption: Strings are critical indicators for detecting malware. For example, the URLs of malicious servers are stored in strings and usually used as the fingerprint for malware detectors to identify its maliciousness. In malware samples, such strings are usually encrypted to protect it from the code scan and are restored at run-time. As a result, string encryption could hinder static code analysis.

Junk code insertion: Code that has no operation, or will not be executed at all, can be inserted to the original app to make code analysis more complicated.

This approach not only impacts the efficiency of a human analyzer but may also cheat static analysis tools. 11 Java reflection: Reflection is an advanced feature of Java, which allows the developers to inspect and modify run-time attributes of classes and methods. The actual classes or methods that are invoked can only be found out at run-time; therefore, it can evade the analysis of static approaches.

Packing [55, 137, 152]: Packing is a widely-adopted technique that protects apps from reverse engineering. The app packers encrypt and pack APK files into an encrypted origin APK and a wrapper APK. When the APK is executed at run-time, the wrapper will decrypt the original APK and load it into the memory.

The original APK will be executed thereafter. As the packer hides the original

APK behind a wrapper APK, apps can only be unpacked at run-time. It makes the reverse engineering of the app harder.

2.2 Android Malware Detection

Based on whether the candidate app needs to be executed, Android malware detection and analysis techniques can be categorized into static approaches and dynamic approaches. Static approaches are based on analyzing the app’s static features, such as the app’s component, the permissions requested by the app, and the code itself.

12 2.2.1 Static Approaches

Based on the technique employed, we categorize static approaches into signature-based approaches and Machine Learning based approaches. Signature-based methods extract syntactic or semantic features to find signatures that match with existing malware database. Machine learning approaches extract features from the malware APK files and other sources such as the application store where the app is downloaded [99]. Optionally, the extracted features may need to be processed to remove the redundant and ineffective ones. The processed features are then used to train a classifier by applying various Machine Learning algorithms.

Signature Based Approaches

Batyuk et al. [30] argued that the information showed to the user when installing an Android app is not detailed enough to judge whether it has a potential threat to leak the users privacy. Before installing an app, a list of permissions required by the app, obtained from the packages metadata, will be displayed. However, the permission list is very coarse-grained and only provides the user with general information on the resources or hardware functionality that the app is allowed to access. No information is provided on what exactly the application intends to perform with the obtained data. They proposed decompiling the bytecode with

Apktool and using the regular expression to match and analyze the obtained source code tree. The proposed system can warn users what information is collected by the app and what potential risks they may have to install the app. Furthermore,

13 in some circumstances, they can also mitigate the threats by applying a patch to the decompiled binary without impacting its core functionality. The proposed method was evaluated on 1,865 top free android apps in Android Market, in which

167 apps access private identifiers such as IMEI, phone number, etc., and 114 of which write the private information to a stream immediately after reading it.

Feng et al. [68] proposed a static signature-based approach to detect Android malware, which belongs to known malware families. The proposed approach, named Apposcopy, incorporates a high-level language to specify signatures that describe the semantic characteristics of malware families. It extracts the data-flow and control-flow properties of Android apps by performing in-depth static analysis.

Then, it uses the extracted information to identify whether the given app belongs to a known malware family. The authors tested Apposcopy on three different datasets: 1) The known malware dataset consists of 1027 malware samples from the Android Malware Genome project [2]. 2) The Google Play dataset includes of 11,215 Android apps that are downloaded from the official Google Play Store.

3) The obfuscated App dataset uses an obfuscator ProGuard to transform the codes in the dataset (1). Though Apposcopy can achieve a high accuracy rate when identifying unobfuscated malware samples, it may be defeated by some high-level obfuscation techniques such as dynamic code loading. Besides, it performs in-depth static analysis to uncover the semantic properties of an app; the large computational overhead makes it unsuitable for real-time detection.

14 Faruki et al. [65] proposed an automatic signature generation approach,

AndroSimilar, which can extract syntactic features to detect Android malware.

Unlike traditional signature-based malware detection mechanisms, AndroSimilar can detect obfuscated malware using string encryption, method renaming, junk method insertion, changing control flow, etc. As a result, it can detect unknown variants of existing malware families. AndroSimilar is based on Similarity Digest

Hash (SDHash) used in digital forensics to identify similar documents. The authors argued that completely unrelated apps should have a low probability of having common features. When two unrelated apps have some common features, using such features is likely to produce some false positive detection results.

Fixed-size byte-sequence features are extracted based on the occurrence of their entropy values, and then popular features are searched among them according to rarity in neighborhood. They generate signatures of known malware families as the representative database and calculate a similarity score of the apps to be detected. If the similarity score of an unknown app with any existing family signatures matches beyond a threshold, then it is labeled as malicious. The evaluation dataset includes 1768 malware samples distributed in 49 malware families, which are downloaded from Contagio Dump [4]. An accuracy of higher than 99% was reported while testing on apps downloaded from Google Play store.

An average accuracy of 63.3% was reported while testing on obfuscated malware, demonstrating that it has a certain capability to detect obfuscated malware.

15 Zhou et al. [154] proposed DroidMOSS, a prototype that detects app repackaging using semantic file features. More specifically, it extracts the DEX opcode sequence from an app and then generates a signature from it using a fuzzy hashing technique. It also adds developer certificate information into the signature. Signatures of two apps are compared with the edit distance algorithm for calculating the similarity score. The intuition behind DroidMOSS of using just opcodes as a feature is that it might be easy for adversaries to modify operands, but much harder to change actual opcodes. This approach has several disadvantages. First, as it considers only DEX bytecode, ignoring native code and resources of the app, as resources are the same most of the time. Second, the opcode sequence does not contain higher-level semantic knowledge. A smart adversary can easily evade this technique by using obfuscations, such as adding junk bytecode, method restructuring, and control flow alteration, which do not contribute to the app’s outcome.

Enck et al. [60] proposed the Kirin security service for Android, which performs lightweight certification of applications to mitigate malware at install time. Kirin defines a set of rules regarding the combination of specific permissions requested by an app that could prove harmful for the user-device. If an app fails to satisfy those security rules, installation is denied.

16 Machine Learning Based Approaches

Schmidt et al. [119] proposed a framework, which performs static analysis on the executables to extract their functions calls using readelf command. This command returns detailed information on relocation and symbol tables of each

Executable and Linking Format (ELF) object file. The output of this analysis is the static list of referenced function calls for each system command. Then, these calls are compared with malware executables for classification. They created their benign training set by extracting a number of Linux system commands within

Android. The malicious training set was generated by extracting the static lists of function calls from approximately 240 malware samples. They applied several state-of-the-art classifiers to test the classification performance on Weka [14], by using ten-fold cross-validation. The malware samples they used to form the malicious training set are not specifically designed for Androids ARM-architecture, limiting the contribution of this work.

Shabtai et al. [121] proposed to use machine learning techniques to classify two types of Android applications: games and tools. Features that they used in their approach are categorized as APK features, XML features, and DEX features.

APK features are extracted from the Android app’s APK file, which is a zip achieve holds all code and non-code resources of an app. APK features include the size of the file, number of zip entries, number of files for each file type, etc. XML features are parsed from the XML files which are used when installing or activating apps. XML features include the features for each element name and attribute 17 name, permissions in manifest, count of XML elements, etc. DEX features are extracted from Dalvik Executables, such as strings, types, classes, prototypes, methods, etc. Feature selection measurements Chi-square, Fisher Score, and

Information Gain are applied to select effective features and remove meaningless ones. The approach is tested on several supervised machine learning algorithms such as Decision Tree, Naive Bayes, Bayes Networks, Random Forest, etc. The dataset used in the evaluation consists of 2,850 apps, which are either game apps or tool apps. 91.8% accuracy with 17.2% False Positive Rate was reported.

Sahs et al. [116] applied the One-Class Support Vector Machine algorithm to detect Android malware automatically. The classifier was trained using only benign applications. The features were extracted by applying Andro Guard [1], an open-source tool, from APK files. The features they used are permissions the app requested and the control flow graphs (CFGs) of each method in the app. The proposed method was evaluated on a dataset of 2,081 benign apps and 91 malicious apps. For each experiment, a random subset of the training applications was selected, and k-fold cross-validation was performed. The results yield a very stable and high recall (i.e., the correctly classified malware ratio to the total malware). The precision (i.e., the ratio of correctly classified malware to the total number of instances classified as malware) decreased when the number of benign samples increased in the testing set, which means it introduces a large number of False Positives. Features they used are relatively limit while comparing with other machine learning based works using static features. Actually, more

18 meaningful metadata can be extracted from the Manifest file and the application code itself.

Sanz et al. [118] classified android apps into several categories defined in

Android Market, e.g., Tools, Entertainment, Communications, etc., using a supervised machine learning approach. They extract features from both the application itself and the Android Market. Specifically, the app features include the permissions defined in the AndroidManifest.xml file and the strings contained in the apps, which are extracted from the decompiled files and processed using

Term Frequency (TF) and Inverse Document Frequency (IDF). The features from the Android Market include the number of ratings the application has obtained, the size of the applications, and the ratings of the app from the user. Redundant features, which have an Information Gain value of 0, were removed before classification. The authors collected 820 apps from 7 categories for the evaluation.

Several state-of-the-art supervised machine learning algorithms, such as Decision

Tree, K-Nearest Neighbour, Bayesian networks, and Support Vector Machine, have been tested over the dataset with ten-folds cross-validation mechanism. It was reported that the highest value of the Area Under ROC Curve (AUC) could achieve 0.93 when testing on the Bayesian TAN classifier. The authors proposed to classify android malware as their future work. However, the features obtained from the Android Market may not be that efficient. For instance, the repackaged malware is likely to have a very similar size as the benign application, and the

19 user may not be able to distinguish from benign applications from the repackaged malware applications.

Yerima et al. [148] proposed and evaluated a Bayesian classification based approach to detect Android malware. The authors developed an APK Analyzer to extract features from the APK file. In total, 58 features were extracted, including

API calls, system commands, etc. The proposed method was evaluated on a dataset containing 1,000 malware samples from 49 Android malware families and

1,000 benign apps across various categories. The authors ranked their features and included a different number of features in the experiments. Results show that 15 to 20 features are sufficient to provide optimum classification performance, which can reach 0.92 in classification accuracy. This work contributes to the selection of the features that are used to feed the machine learning algorithm. The authors further proposed a multi-classifier based approach for Android malware detection

[147]. The same feature set has been used, while multiple classifiers’ results are combined using various strategies.

Huang et al. [80] evaluate the effectiveness of using permissions as features while applying machine learning techniques on Android malware detection. They extracted 20 features from the APK files. These features include the file-based features, such as the number of ELF files, the number of executable files, etc.; and the permission-based features, such as the number of permission requested by the developer, the number of permission required while running the app (as in most cases, developers tend to request more permissions than what they actually

20 required). The evaluation dataset includes 124,769 benign apps and 480 malware samples. Four machine learning algorithms are tested over the dataset, including

AdaBoost, Naive Bayes, Decision Tree, and Support Vector Machine. It is reported that, on average, 81% malware samples can be correctly detected. Consequently, the authors suggested that permission-based machine learning approaches can only be used as a quick filter of Malware; the results need to be further analyzed for more accurate detection outcomes.

Arp et al. [25] proposed DREBIN, a lightweight method for the detection of Android malware that infers detection patterns automatically and enables identifying malware directly on the smartphone. DREBIN firstly extracts different feature sets from the apps manifest and dex code. The features from the manifest include hardware components, requested permissions, app components, and filtered intents. The features from the dex code include restricted API calls, used permissions, suspicious API calls, and network addresses found in the code.

The extracted feature sets are then mapped to a joint vector space, where patterns and combinations of the features can be analyzed geometrically. The feature sets are then fed to machine learning algorithms such as linear Support Vector

Machine. Besides, the features, which contribute to the detection of malware, are identified and presented to the user to explain the detection process. A self-collected dataset containing 123,453 benign apps and 5,560 malware samples is used for detection, one of the largest malware datasets that have been used to

21 evaluate an Android malware detection approach. The result yields the detection rate of 94%, which a false positive rate of 1%.

Wu et al. [133] proposed DroidMat, a static feature based mechanism to provide a static analyst paradigm for detecting the Android malware. First, the

DroidMat extracts the information such as requested permissions, Intent messages passing, etc., from each apps manifest file, and regards components (Activity,

Service, Receiver) as entry points drilling down for tracing API Calls related to permissions. Then it applies the K-means clustering algorithm to enhance the malware modeling capability. Finally, the samples are classified using the

K-Nearest Neighbor algorithm into benign apps or malware samples. Features used in DroidMat are permissions, components, intents, API calls from the bytecode, and Inter-Component Communications (ICC). They collected a dataset of 238

Android malware samples from 34 malware families, and 1,500 benign apps from

30 categories. Average detection accuracy of 87.39% was reported. However, in the provided 34 malware families, the detection rate is 0 in 15 of them, as they may use various obfuscation techniques to bypass the detection.

Recently, a number of machine learning based detectors rely on semantic features to model malware behaviors are proposed. Fan et al. [62] proposed

DAPASA, an approach to detect Android piggybacked applications through sensitive subgraph analysis. DAPASA takes Android APK as input. The dex file in the APK is extracted and converted into the Smali code. Function lists and their calling relationships can then be analyzed from the decompiled Smali code, which

22 is further used to generate Static Function Call Graphs. DAPASA then calculates the sensitivity coefficient of each sensitive APK based on a TF-IDF-like method, and mines the subgraph set of the sensitive subgraphs. Five numerical features of sensitive subgraphs are constructed, representing the sensitive subgraph’s statistical information, such as sensitive coefficient, total sensitive distance, the total number of sensitive motif instances, etc. These features are then fed into a machine-learning algorithm to detect whether an APK is piggybacked. The feature set was extracted from 6,154 malware samples and 44,921 benign apps.

The proposed approach was evaluated on a dataset containing 500 piggybacked apps and 500 benign apps. Results show that the proposed method can effectively detect piggybacked apps, with an AUC of 0.99.

Xu et al. [135] leveraged the inter-component communication patterns to detect

Android malware and proposed ICCDetector. ICCDetector extracts ICC-related features by retrieving and analyzing all ICC related sources and sinks. The features are extracted from four categories, which are component related features

(such as component name and types, number of components), explicit intents related features (such as number of explicit intents, number of external explicit intents), implicit intents related features (such as number of implicit intents, action strings), and intent filter related features (such as number of implicit intents, action strings). These features are further processed by correlation-based feature selection to remove redundant features. Finally, the feature vectors are fed into machine learning algorithms for classifying benign and malicious apps. A dataset of 5,264

23 malware samples and 12,026 benign apps are included in the evaluation. 97.4% accuracy with a false positive rate of 0.67% is reported. ICCDetector is more effective in identifying malware samples that require few suspicious resources but mainly rely on inter-component communication mechanisms.

Yang et al. [142] developed DroidMiner to scan suspicious applications to determine when they contain malicious modalities. DroidMiner can also be used to diagnose the malware family. A similar idea has also been developed by Li et al. in the work [93]. Du et al. adopted community structures of weighted function call graphs to detect Android malware. Zhang et al. [151] proposed a semantic-based approach to classify Android malware via dependency graphs. Gascon et al. [70] developed a method based on efficient embeddings of function call graphs with an explicit feature map. Furthermore, Yang et al. [143] considered user-event-driven components and the related sequences of callbacks from the Android framework to the application code. They further developed a program representation to capture those callback sequences to differentiate Android malware from benign applications.

Li et al. [90] took a more in-depth investigation in using permissions as the features to identify malware samples. Instead of including the full list of requested permissions into the model (e.g., [25]), the authors proposed to only leverage the most significant permissions. They manually selected 22 significant permissions that either creates high-risk attack surfaces and frequently requested by malware

24 samples or rarely requested by malware samples. The detection accuracy of only use the selected 22 features is comparable to the one with the full permission list.

Ma et al. [100] proposed an ensemble model that is based on three sub-models for detecting Android malware. The three sub-models are trained with API usage,

API frequency, and API sequence feature sets. The final results are combined with the soft voting scheme. While each individual sub-model achieves 96% to

98% F-score, the ensembled model can further improve the F-score to 99%.

Recently, researchers leveraged deep learning algorithms in Android malware detection tasks. With deep learning, rather than the simple Boolean or numerical feature set, a more complex feature set (i.e., the input neuron in deep learning) can be used, such as the data flow [156] and the opcode sequence [85]. Researchers also fed traditional feature sets such as permissions into a more complex deep learning model for better performance [37].

2.2.2 Dynamic Approaches

Dynamic approaches execute the app in a protected environment, provide all the emulated resources it needs, and observe its malicious activities. As the execution of Android apps is event-based, it is critical that proper input should be generated to trigger various events, such as tap, pinch, swipe, keyboard, etc., when running the apps in the sandbox.

Shabtai et al. [122] proposed a lightweight Android malware detection system, Andromaly, based on a machine learning approach. It performs real-time

25 monitoring for the collection of various system metrics as the features, such as

CPU usage, amount of data transferred through the network, number of active processes, and battery usage. Andromaly collects features by communicating with the Android kernel and application framework. At regular intervals, feature collection is triggered to obtain new features. The processor, which employs machine learning algorithms, then receives the feature vectors from feature collectors, analyses them, and performs threat assessment to Threat Weighting

Unit (TWU). TWU applies an ensemble algorithm on the analysis results received from all the processors to derive a final decision on device infection. As an on-device detection approach, the authors installed 20 benign game apps, 20 benign tool apps, and four malware samples developed by the authors. These were specifically designed to reflect some of the features used in the approach, such as employing high CPU usage. Testing on the real malware samples needs to be performed to evaluate its effectiveness.

Lindorfer et al. [94] proposed ANDRUBUS, a fully automated, publicly available platform for Android malware detection. ANDRIBUS employs static as well as dynamic analysis in malware detection. Specifically, static features are extracted from the APK file and manifest, such as requested permissions, services, activities, broadcast receiver that is used to receive events from the system or other apps, package name, and SDK version. These static features are used to assist in automating the dynamic analysis, mainly during the stimulation of an app’s component. Then they build a sandbox for the ARM platform

26 with a QEMU-based emulation environment to record activities happening when executing the observed app, such as phone events and loading additional DEX or native code during runtime. Some existing approaches have been employed by

ANDRUBIS, such as Droidbox [5], TaintDroid [59], Aptktool [3], and Andro Guard

[1]. As a publicly available platform, ANDRUBIS collects the app submitted by users. In four years’ time, it collects over 1,000,000 apps, in which 27.9% are malware, 41.15% are benign apps, and 30.95% are reported as , which is usually removed when doing the experiment.

Burguera et al. proposed Crowdroid [39], a behavior-based malware detection system. The system contains two hardware components, the users mobile device, and the remote detection server. The crowd-sourcing app needs to be installed on the users device. It collects the system call details and sends these data in the form of log files to the remote server. At the server-side, received data is processed to create feature vectors, which are further analyzed using the K-means clustering algorithm to classify whether the app is benign or malicious. Evaluations are done on the self-written malware samples and real-world malware. The self-written trace includes 50 benign apps and ten malware samples, which are developed by the authors. Experiments claim a 100% detection rate on these malware samples.

In the real world malware trace, two real-world malware samples, PJApps and

HongTouTou, are tested. The detection rates of 100% and 85%, respectively, are reported.

27 Yan et al. [141] proposed DroidScope, a Virtual Machine Introspection (VMI) based dynamic analysis framework for Android apps built upon QEMU emulator.

Unlike other dynamic analysis platforms, DroidScope does not reside inside the emulator but constructs OS-level and Dalvik Virtual Machine level semantics by residing outside the emulator. As a result, even the privilege escalation attacks in the kernel can be detected. DroidScope also provides a set of APIs to customize analysis needs to human analysts, such as callbacks for a context switch, system call, etc. Android malware families DroidKungFu and DroidDream were analyzed and detected successfully; however, the effectiveness of DroidScope against other malware families needs to be further tested and evaluated.

Reina et al. [113] proposed CopperDroid, a system that performs system call-centric dynamic analysis of Android apps, using Virtual Machine Introspection.

CopperDroid is built upon QEMU to perform out-of-the-box dynamic behavioral analysis of Android malware automatically. To address the path coverage problem, they have supported the stimulation of events as per the specification present in the apps manifest file. Evaluation is based on two collections of malware samples, which have 1,200 samples from 49 Android malware families and 400 samples from 13 malware families. Authors have shown through experimentation that system call centric analysis can effectively detect malicious behavior. They have also provided a web interface for other users to analyze apps.

Enck et al. [59] proposed TaintDroid, which extends the Android platform to track privacy-sensitive information-flow within third-party apps for the leak to

28 keep track of whether the sensitive data leaves the device. When the sensitive information leaves the device, TaintDroid records the label of data and the app, which sent the data along with its destination. Taint propagation is tracked at four levels of granularity, 1) Variable-level, 2) Method-level, 3) Message-level, and 4) File level. Variable-level tracking uses variable semantics, which provides the necessary context to avoid taint propagation. In message-level monitoring, the taint on messages is tracked to prevent IPC overhead. Method-level tracking is used for Android native libraries that are not directly accessible to apps but through modified firmware. File-level monitoring ensures the integrity of

file-access activities by checking whether taint markings are retained. The process of detecting a malware (e.g., an untrusted app is trying to access the data of a trusted app, and send it over the network) is described as follows. Firstly, the information of the trusted app is labeled according to its context. A native method is called to interface Dalvik VM interpreter to store taint markings in a virtual taint map. Every interpreter simultaneously propagates taint tags according to data-flow rules. The Binder library of the TaintDroid is modified to ensure the tainted data of the trusted application is sent as a parcel having a taint tag reflecting the combined taint markings of all contained data. The kernel transfers this parcel transparently to reach the Binder library instance at the untrusted app. The taint tag is retrieved from the parcel and marked to all the contained data by the Binder library instance. Dalvik bytecode interpreter forwards these taint tags along with requested data towards the untrusted app

29 component. When that app calls taint sink (for example, network) library, it retrieves taint tag and marks that apps activity as malicious.

Existing taint analysis tools do not support analyzing native code, i.e., code written in the native language (such as C), and interact with the rest of the app through Java Native Interface (JNI). Xue et al. [138] proposed NDroid, which implements a JNI tracker to deal with the taint propagation in the native code.

Similar work has also been done [139] that not only instruments and analyzes information leaks in native code, but also automatically explores the execution path of such malicious activities.

To overcome the challenge that dynamic analysis may not be able to trigger all paths, Wong et al. [130] proposed to combine static analysis with dynamic testing to generate inputs that will trigger targeted malicious activities. They firstly extracted the event handlers, and reconstruct the call path contains the event handler as well as their constraints. This information is further used to generate the input that will trigger the targeted event. Further work has been done [131] to address language-obfuscation (e.g., reflection, native code, encryption, etc.) and run-time obfuscation (e.g., dynamic loading) problems.

30 2.3 Android Malware Evasion

2.3.1 Evading Traditional Non-Machine Learning Based

Detection

Anti-Malware Tools (AMTs) have shown their effectiveness in detecting current malware samples; however, the capability to detect future malware samples is unknown. Several works proposed to audit the AMTs by automatically generating new malware samples.

Zheng et al. [153] proposed ADAM, a platform to stress test Android AMTs.

ADAM automatically transforms existing malware samples to different variants with repackaging and obfuscation techniques. The repackaging techniques they used include alignment, re-signing, and rebuilding. The obfuscation techniques include inserting defunct methods, renaming methods, changing control flow graphs, and encrypting constant strings. The applied the techniques mentioned above on 222 malware samples to generate malware variants. These variants are then submitted to VirusTotal to test commercial AMTs. The detection rates of applying each repackaging and obfuscation techniques are reported. Results show that most commercial AMTs are robust to repackaging techniques but vulnerable to obfuscation techniques. Similar approach was also proposed by Rastogi et al.

[111] [112] and Maiorca et al. [101].

Aydogan et al. [27] proposed a framework to automatically generate samples using genetic programming (GP), an evolutionary computation

31 technique inspired by nation evolution. The proposed framework takes Android

APK as input. The dex file in the APK is converted to the Smali code, and extracted its control flow graphs (CFGs). GP performs two operations on the

CFGs, namely crossover and mutation. Crossover operation swaps two methods with the same return type and the same number of parameters. In contrast, mutation operation applies five obfuscation techniques to mutate the original code, such as renaming local identifier, code re-ordering, junk code insertion, data encryption, etc. Modified CFGs are then converted back to the Smali code, and

finally reassembled into a new APK for testing.

There are a few existing works designed to attack VAs. For example,

Diao et al. [52, 21] proposed an attacking method that made use of an

Android inter-component communication mechanism and built-in speaker.

To be stealthy, Diao et al. [52] designed the attack to be triggered at 3 am, a time when smartphones were expected to be unattended (e.g., users sleeping). A similar model to make the attack stealthy was adopted in

E. Alepis et al.’s work [21]. However, these attacks require a specific API

(Intent: ‘ACTION VOICE SEARCH HANDS FREE’), which was only available in

Google Assistant. This limits the use of their proposed attacking methods, e.g., considering devices like Huawei’s Xiao Yi and Xiaomi’s Xiao Ai, which provide custom VAs for users. Besides, the stealthiness of the above methods is not complete. For example, the volume of activation voice (e.g., 553 dB claimed

32 in Table 4 of [52]) may be loud enough to wake the user, considering the quiet environment in the early morning [105].

Some other attacking methods focused on crafting special audio that could be recognized by smartphone VAs but not heard by human-beings [150, 41]. For example, the idea of Nicholas et al. [41] was to obfuscate raw attack audio and make it sound like noise. Based on adversarial machine learning techniques [34,

127], the deliberately crafted audio could be recognized by smartphone VAs but was neglected by smartphone users as incomprehensible noise. In another example,

Zhang et al. proposed using ultrasound [150], as its frequency is higher than the upper audible limit of human hearing. However, the approach of Nicholas et al.

[41] requires access to the targeting voice recognition model as either a black-box or a white-box, to run audio crafting processes iteratively. Moreover, the approach of

Zhang et al. requires a special instrument (e.g., ultrasound generator) [150]. Both premises are impractical in most real-world scenarios.

2.3.2 Evading Machine Learning Based Detection

Recently, some research works studied the security aspect of various machine learning based malware detectors. We give them an overview as follows. Grosse et al. [77] extended an existing adversarial example crafting algorithm proposed by

Papernote et al. [107] to the Android domain. They trained a deep feed-forward neural network classifier with the feature set adopted in Drebin. It had a comparable detection performance with Drebin. Then, they launched a white-box

33 attack on the DNN model. The feed-forward derivative-based approach they proposed measures the sensitivity of the input component to the output label.

Then the input features are ranked accordingly. The features with the highest ranks are iteratively perturbed until an evasion example is found. Chen et al.

[46] demonstrated a poisoning attack for Android malware detection systems by polluting the training set of the original detectors. They proposed KuafuDet to detect malware samples in such an adversary environment. They firstly handpicked a number of benign and malicious apps, which are located very far from the decision hyper-plane. These samples are considered not to be poisoned.

They then leveraged a similarity-based approach to identify camouflage malware in the training set, which is poisoned by the adversary. The detected camouflage malware is then excluded from the training set. Results show that KuafuDet can effectively increase the detection rate by at least 15% in an adversarial environment.

Meng et al. [104] proposed Mystique, a tool that automatically generates malware samples for auditing anti-malware tools. They firstly summarise various attack behaviors and evasion techniques into meta feature blocks of attack features and evasion features, respectively. Then these meta-features are randomly combined as building blocks. Mystique mimics the evolution of malware by applying a multiple-objective evolutionary algorithm. The next generation of malware variants is produced based on performing crossover and mutation on current malware samples. An extended work [140] proposed by Xue et al. further

34 added a dynamic code generation and loading component into Mystique, giving it the capability to test the anti-malware tools with dynamic loading techniques.

Similarly, inspired by the evolution of Android malware, Yang et al. [144] proposed to generate malware variants by incorporating evolution patterns into the existing malware. An RTLD feature model was proposed to represent the evolution features, including resource, temporal, locale, and dependency.

Resource feature describes the security sensitive resources obtained by malicious behaviors. The temporal feature and locale feature describes when and where do the malicious behavior takes place, respectively. The dependency feature shows how malicious behaviors are triggered and controlled. A substitute model is trained with RTLD features to approximating the target model. Based on the transferability between models, it has a certain chance that the adversarial example evades the substitute can also evade the target model. Two types of attacks are proposed, namely, confusion attack and evolution attack. In the confusion attack, the feature values from less differentiable malware samples are extracted and mutate to existing malware samples to generate their variants.

In the evolution attack, the authors constructed a phylogenetic evolutionary tree that shows the evolutionary relationships among malware samples. Then feature sets that frequently appear in the evolution are extracted as candidate feature mutation set. New malware variants are then generated by mutating the features in the candidate feature mutation set to existing malware samples.

35 The proposed malware variant generation approach is further Incorporated into a testing framework for learning-based malware detection systems [146].

There are also several works proposed in attacking machine learning based malware detectors on platforms other than Android, such as PDF and Windows.

Their fundamental ideas can also be applied to the Android platform; therefore, we give a brief introduction to these works. Srndic et al. [89] proposed an attack against PDFrate, an online malicious PDF file detection system. They modified the fields in PDF file that was not rendered by PDF readers. They are extracted as features to discriminate malicious files from benign ones. Similar work was done by Biggio et al. [34], who leveraged gradient descent attack to evade detection. It is easy to alter the file without changing the original content due to the relative simplicity of the PDF file structure. Rosenberg et al. [114] proposed a black-box attack against machine learning based malware detectors in Windows OS based on analyzing API calls. The attack algorithm iteratively added no-op system calls (which are extracted from benign software) to the binary code. The proposed method could only be applied to the detection systems that embedded the call sequence into a feature vector. It could not work if the features are statistical information extracted from the call sequence, such as similarity score or probability.

Several works have been proposed to defend the evasion attack on machine learning based Android malware classifiers. Chen et al. [45] proposed SecureDroid to enhance the robustness of machine learning malware detectors. SecureDroid

36 includes a feature selection method named SecCLS, and an ensemble learning approach named SecENS. To minimize the cost of evading detection, adversaries tend to perturb the features with a high impact on the machine learning model. SecCLS reduces the possibilities of high impact features to be selected for model construction; therefore, it increases the cost of generating evasion examples. SecENS further aggregates a set of machine learning classifiers trained with features chosen by SecCLS to generate the final detection results.

While SecureDroid effectively increased the cost to create adversarial examples, sophisticated evasion examples can still bypass its detection.

Demontis et al. [51] proposed a secure version of Drebin [25] detector. They improve the security of the linear classification algorithm by enforcing learning more evenly-distributed weights of the features. As a result, the adversary will need to manipulate a large number of features to evade detection. Therefore, it will be harder to build the evasion example while keeping the original malicious functionality.

Besides increasing the cost of crafting adversarial examples, another strategy is to pre-filter the adversarial examples before feeding the testing data into malware detectors. Grosse et al. [76] proposed to leverage statistical tests to identify specific inputs that are adversarial, based on the observation that the adversarial examples are drawn from a different distribution to the original data. To this end, the authors trained a machine learning model that regards the adversarial

37 examples as an additional outlier class and classified adversarial examples into that class.

In additions, the works [77, 46, 79] used binary features to indicate the presence of a certain permission or API. The modification on these features usually cannot affect the functionality of the applications. For instance, the adversary can request a new permission in the manifest but will not implement it in the code. Most of recent works will adopt semantic features such as the ones extracted from the control flow graphs. They usually require more cautions to tamper with if we want the application functionality not to be affected.

Existing works [45, 77, 46, 51, 145] will not work properly when recent detectors relied more on semantic features. In this thesis, we presented an advanced method of crafting adversarial examples by applying perturbations directly on the APK classes.dex file. The generated adversarial examples will also be effective on recent detectors that rely more on semantic features.

2.4 Android Malware Evolution

Researchers across academia have done comprehensive studies on Android malware family analysis [60, 63] and evolution. However, previous systematic approaches rarely provide a fine-grained evolution analysis inside a malware family. In this section, we mainly look at malware categorizations based on dendrogram analysis and ancestral relation analysis.

38 Several phylogenetic analysis works are based on a dendrogram of the malware samples inside a family or across different malware families. That is, it can infer that malware samples (or families) have the same ancestor, but it cannot decide the ancestral relations among them. For example, unknown malware samples can automatically be classified into candidate families based on a text mining approach, before a phylogenetic analysis based on a dendrogram is performed among malware families [126]. This evolution analysis is based on a family level, and the ancestral relationship is not discussed between families.

In [32], process mining techniques are used to identify recurring execution patterns in the system call traces gathered from malware samples. It allows for the identification of similarities across malware families and discusses malware families’ dendrogram-based phylogenetic relationship. In [144], it obtained all pairwise distances among different apps via coarse-grained and fine-grained app analyses by firstly comparing the permission and API values, and then performing a fine-grained feature comparison between apps. When it generated a candidate feature mutation set, this approach is more concerned about the feasibility and statistical frequency, that is, if the features have been mutated and how frequent the features have been mutated. However, this approach cannot locate the newest evolved behavior, as other samples rarely mutate the newly updated features.

It is possible to use two different feature types, n-grams, and n-perms, for comparing programs to build phylogeny models [58]. Both feature extraction approaches permit permuted sequence matching based on document comparison

39 techniques employing feature occurrence vector similarity measures. In [125], a systematic method is proposed in which it utilized differential analysis to isolate software components that are irrelevant to the campaign so that it can study the behavior and evolution of malicious riders alone. [78] included an intensive analysis based on a malware family that utilized the phylogenetic trees to infer the intrinsic behaviors and relationships inside each variant set and among variant sets. It set the distance among malware to be 0.5, and clustered malware samples into different sets, without any further guidance on how to choose the distance. Moreover, it only discussed the distinguishing features of each variant set, regardless of the ancestral relationships among the sets, since it is not a result they wanted to stress most.

In [84], it developed a framework for abstracting, aligning, and analyzing malware execution traces. It performed a preliminary exploration of phylogenetic methods, whose strengths lie in pattern recognition and visualization, to derive the statistical relationships within two malware families.

Several works discuss the ancestor-descendent relationship after hierarchical agglomerative clustering or similarity analysis. The works can reveal how malware is ancestrally related.

Oyen et al. [106] used statistical data derived from malware analysis to infer dependencies while leveraging human knowledge about the direction of dependencies inside a malware family. Although this approach can infer the ancestral relationship inside a family, it is not feasible for an extensive malware family analysis since each malware sample is regarded as an individual malware

40 type. Notably, it requires human knowledge and manual analysis to define the genetic sequence of the malware samples inside a malware family. Cimitile et al.

[49] utilized system call traces to generate a formal model used to verify properties characterizing the malware behaviors, and then used the formal model to identify the common malware behavior shared by the malware samples from the same malware families. This approach further utilized the formal model to infer the ancestor-descendant relationships between malware families. However, it can only infer one type of phylogenetic relationships (subset and superset). Moreover, it will overlook some static malware updates since it is based on dynamic features only. For example, junk code insertion, renaming, which are part of essential malware evolution aspects, should be emphasized.

Liu et al. [95] proposed a novel malware phylogeny construction method taking advantage of the persistent phylogeny tree model, with nodes that correspond to input instances and edges that represent the gain or loss of functional characters. It can not only depict directed ancestor-descendant relationships between malware instances but also show concrete function inheritance and variation between ancestor and descendant. Liu et al. [96] used phylogenetic networks based on the splits graph, without any prior ordering or post-processing, model evolutionary relations among malware families, describing temporal order among samples, defining ancestor-descendant relationships, and unveiling evolutionary trends.

Given the extent of these related works, our approach is the first to systematically cluster malware samples into variant sets, discuss their ancestral

41 relationship, and give in-depth analysis between variant sets in terms of how later samples have evolved based on previously discovered samples.

42 Chapter 3

Investigating The Evolution

Pattern of Android Malware

Android malware is becoming stealthier and more sophisticated due to the evolution of anti-virus machines and the increased complexity of the Android platform. Previous research works have extensively studied the Android malware categorization and behavior analysis problems, but merely at a family level and fell short in digging further to divide malware samples into variant-sets or exploit the phylogenetic relationship among them. A fine-grained and in-depth phylogenetic analysis approach can facilitate the evolutionary analysis of malware.

Such an approach can assist in analyzing newly discovered malware, detecting evolution-inspired adversarial attacks, and addressing the drifting problems caused by malware evolution. New malicious samples usually belong to a previously-discovered malware family. Normally thousands of malware samples 43 are detected each year, but only a few new families are discovered among them, which shows why a high-granularity malware categorization and phylogenetic investigation is crucial at a variant-set level. In this chapter, we propose a method that initially clusters malware samples of a family into variant-sets, and then systematically reveals the phylogenetic relationships among those sets for a deeper malware evolution analysis. We define three metrics to evaluate the accuracy of variant-set clustering. Given those, we show that our method well-clusters the variant sets. Moreover, we present interesting evolutionary findings based on the derived phylogenetic relationships, which answers the questions of how and why a variant set has evolved into another. To the best of our knowledge, our work is the first to provide a systematic approach to malware phylogenetic analysis at a variant-set level.

3.1 Introduction

Android malware has long been a critical challenging problem. Over the last ten years, the Android’s market share has increased from 0.66% to 74.85% [10].

However, the AntiVirus vendor ‘Kaspersky’ report in 2019 also indicates that the number of detected Android malware samples has doubled since last year.

Considering that this number was as big as 66.4 million in 2017 [44], one can imagine the magnitude of threat landscape. Once infected, malware will commit a series of illegal behaviours such as stealing users sensitive and personal data

44 from the compromised devices. In fact, the writers and defenders of Android malware are engaged in an endless arm races. When new detection methods or new versions of Android are released, upgraded malware samples emerge with more aggressive and sophisticated designs. This process continues iteratively and moves the ultimate solution out of reach.

On the race of malware writers and defenders, some researchers attempted to group similar samples into families, and make use of the knowledge learnt from it to ensure the advancement of detection techniques runs faster than the evolution of malware. For example, by using the family categorisation methods

(e.g., [40, 43, 61]), people believe 1) if a sample belongs to a known family, the same removal techniques can be re-used; 2) security analysts can also focus their manual investigation on the few new samples that do not belong to any known families, thus optimising their limited time and resources. However, these efforts are often too ideal to play the proper role in practice, sometimes even go by contraries. For example, the learning based detection trained on an older dataset(s) often makes poor decisions on newly emerged malware samples, even though they belong to known families. Empirical studies have shown that the

F1-Score could drop down to 30% in the worst case [109]. In addition, adversarial attackers can exploit evolutionary features of the malware [145] by locating useful features through malware evolutionary analysis, and then automatically generating stealthier malware samples that can successfully bypass the malware detection in a short time. Therefore, as drawn from above, designing a reliable malware detection

45 model relies on the comprehensive understanding of malware behaviour and its evolution drift. Understanding the evolved malicious behaviours in detail often requires manual inspection, but it can be either complicated or redundant when confronted with a continuously growing stream of incoming malware samples. This leads to the proposing of a systematic approach for a more comprehensive malware evolution study.

According to the 2019 security threat report published by Symantec, thousands of new mobile malware samples are detected each year since 2016, while there are only a few new malware families discovered every year [19]. This indicates that many new malware samples still fall into previously discovered Android malware families, and that evolution occurs more frequently inside a malware family.

Therefore, categorizing malware samples from the same malware family into smaller variant sets, and then further providing a fine-grained evolution analysis among variant sets is more critical to observe the evolutionary features than the evolution analysis based on malware families. Additionally, researchers often focus more on profiling non-temporal malware family behaviours and extracting evolutionary statistics from a global malware landscape, and malware evolution analysis is only conducted at a coarse level. The phylogenetic relationships among malware samples are barely exploited at a fine-grained (variant set) level, while the phylogenetic relationship indicates the evolution pattern of the malware samples, which is extremely helpful to quickly differentiate between those that are known specimens and those that correspond to the novel, previously unseen

46 samples. With no phylogenetic relationship indicated, previous works are not able to reveal the evolved changes in-depth, and further infer the causes of the malware evolution. However, a fine-grained and in-depth malware phylogenetic analysis approach is critical for facilitating the malware evolutionary analysis and being further contributed to building a more reliable detection model. As a result, there is no existing systematic approach for the malware phylogenetic analysis at a variant set level.

In our approach, we construct phylogenetic networks (PhyloNets) for each malware family. PhyloNets are generated through a systematic approach that categorizes malware samples from the same family into different variant sets by utilizing the phylogenetic features of malware samples. A weighted formula is then generated from the call graph analysis at the variant set level, representing the characteristics of each variant set. The variant set malware samples are then compared with the formula of the other variant sets. Our approach can systematically infer the evolution patterns from pairwise asymmetric results derived from the comparison. We evaluate the variant set clustering, including the inspection of the silhouette coefficient, distances between malware samples, malware creation date for each variant set and the consistency of the variant label retrieved from various anti-virus vendors. To further understand how malware evolve from one variant to another, and the causes that trigger these changes, we manually inspect pairs of variant sets based on their evolution patterns

47 derived from our approach, and present the interesting findings regarding malware evolution.

3.2 Approach

To better understand the evolution inside a malware family, we build phylogenetic networks (PhyloNets) for malware variants inside a malware family (refer to

Fig. 3.1), aiming at revealing the relationships among malware variants from the same family. The phylogenetic networks are built in the following three major steps: 1) Variant Sets Generation: We calculate the pairwise similarity on code level between malware samples in the same family, and cluster them into different variant sets. 2) Formula Construction of the Variant Sets: A formula that represents the characteristics of variant sets is constructed by using the most representative sensitive sub-graphs in each variant set. 3) Phylogenetic Network

Construction: a PhyloNet that determines phylogenetic relationships among variant sets is generated by comparing variants with variant set’s formula.

3.2.1 Variant Sets Generation

Malware variant is known as the tweaked sample based on the previously encountered malware. Different variant sets inside a same family are similar in terms of malicious behaviors, but various in their code implementations.

48    

Figure 3.1: Relationship between a malware family and its variant sets. The malware samples in the same column (colour) share similar code. The malware samples from different columns inside a family perform similar malicious behaviours.

In this study, we use code level similarity to measure the differences in their implementations.

Fig. 3.1 represents the relationship of a malware family F and its variant sets

(a, b, c, d...). Each column denotes a variant set (n < n0, n1, ..., nm >), in which malware samples share a similar implementation at the code level. While the malware samples from different variant sets perform common malicious behaviors, they are different on code implementations.

Given a malware family F with n malware samples m1, ...mn, we firstly calculate the similarity score (Sim< mi,mj >) between each pair of malware

49 samples mi and mj (i,j = 1,2,...,n). We utilized SimiDroid [91], a plugin-based framework to calculate the pairwise similarities. Simidroid integrates various comparison methods that include code-based comparison both at the statement level and component level. Since the statement level comparison best represents malware behaviors, we use statement level similarity in this study. Similarity scores are then converted into distances (Dist< mi, mj > = 1 - Sim< mi, mj >), as shown in Fig. 3.2(a).

In order to analyze the relationship between malware samples in the same family, we adopt the UPGMA algorithm [75], which is a simple bottom-up hierarchical clustering method. Each observation starts in its own cluster, and pairs of clusters are then merged as one moving up to form the hierarchy [124].

UPGMA has been widely used for evolution and relationship analysis among individuals or groups of species in biology and some malware evolution analysis

[145] [95]. It requires the distance matrix as the input of the algorithm, shows in Fig 3.2. After similarity scores are calculated by Simidroid, the distance

(dist < mi, mj >) between malware samples are defined as 1- SS < mi, mj >.

UPGMA takes a symmetric distance matrix as its input and generates a dendrogram with branches and leaf nodes. In our case, we map the distance values which are further processed after obtaining the similarity result from

SimiDroid into a symmetric matrix. In the generated dendrogram, each leaf node represents a malware sample, and the lengths of branches depict the distances among connected malware samples or clades.

50 The distance matrix is further fed into UPGMA [124]. UPGMA is a bottom-up hierarchical clustering algorithm widely used in evolution and relationship analysis in biology as well as malware study. UPGMA generates a dendrogram, which illustrates the hierarchical relationships among the samples in a malware family.

In the generated dendrogram, each leaf node represents a malware sample, and the lengths of branches depict the distances among connected samples, as demonstrated in Fig. 3.2(b).

A distance threshold is then applied to further separate/merge clusters into variant sets, as illustrated in Fig. 3.2(c). We expect a malware sample to be well matched to its own variant set and poorly matched to neighboring variant sets. In other words, we expect a high cohesion among malware samples inside a variant set and a high separation among different variant sets, which is known as a high silhouette coefficient [115]. In a malware family F , the cohesion degree for malware sample m in the variant set Cm is defined as

1 X coh(m) = dist(m, n) (3.1) |Cm| − 1 n∈Cm,m6=n where dist(m, n) denotes the distance between m and another malware sample n in the same variant set. The separation degree is calculated as

1 X sep(m) = min dist(m, k) (3.2) Ck∈F |Ck| k∈Ck,m/∈Ck

51   " 

   

"        

) )#/. )#+) )#-, $ )#/. ) )#*) )#01 $ )#+) )#*) ) *#)) $  )#-, )#01 *#)) ) $ $$$$)  !     %& %& %&

Figure 3.2: Variant Set Generation Process Overview. 1) A distance matrix will be generated by similarity analysis based on a set of apks from the same malware family by utilizing SimiDroid. 2) By applying the UPGMA clustering algorithm, a malware family dendrogram will be generated representing the hierarchical relationship among malware samples of the family. 3) After the distance threshold is determined, malware samples are them clustered into smaller clades, and each clade represents a variant set. 4 variant sets (a, b, c, d) are generated in the example.

where k is a malware sample from variant set Ck other than Cm. The silhouette coefficient of malware sample m can then be drawn as

 coh(m) − sep(m)  , if |cm| > 1 max {coh(m), sep(m)} silhouette(m) = (3.3)  0, if |cm| = 1

52 By calculating the silhouette coefficient for each malware sample m in family

F , we are able to obtain the average silhouette coefficient for a malware family F

1 X silhouette f(F ) = silhouette(m) (3.4) |CF | m∈F

When determining the distance threshold in forming the variant sets, we expect the similarity (i.e., 1−dist) among samples in the same variant set are higher than a threshold (e.g., 0.5), while maximizing the silhouette coefficient of a family.

The examples of how distance threshold influences the silhouette coefficient in malware family F akebank and Commplat can be found in Fig. 3.3 and Fig.

3.4. As shown in Fig. 3.3, for family fakebank, a maximum silhouette coefficient

(0.735) is achieved when the distance threshold is set to 0.2. For family commplat, the best silhouette is achieved at the distance of 0.6; however, to maintain the similarity among the malware samples from the same variant set, we choose the best silhouette value achieved within the distance below 0.5 (inclusive). The dendrograms in Fig. 3.4 illustrate the variant sets (shown in different colours) that we obtained when the dendrogram is partitioned by the chosen distance thresholds

(i.e., 0.2 and 0.5 for F akebank family and Commplat family, respectively).

3.2.2 Formula Construction of the Variant Sets

After variant sets are generated, a formula representing the characteristics of each variant set V that is unique across family F , but common in the samples from V

53 Figure 3.3: Silhouette and Method-Based Distance Threshold.

Figure 3.4: Variant sets generation for F akebank and Commplat families is formed, which will be further used in constructing the PhyloNet. By borrowing the idea of frequent sub-graph construction method proposed in [61], our approach of formula construction includes the following five steps, as demonstrated in Fig.

3.5: 1) extracting sensitive API calls for each malware sample in a family F ; 2) assigning weights to the sensitive API call s by utilizing TF-IDF-like approach, where TF (Term Frequency) is adopted to measure the frequency of s that appear 54                

                           

Figure 3.5: Formula Construction of the Variant Sets. in variant set V , and IDF (Inverse Document Frequency) is used to measure the inverse frequency of s that appears across all malware samples in F ; 3) sensitive call graphs are extracted and further partitioned into sub-graphs by utilizing community detection methods(e.g. infomap [36]), while sub-graphs that do not contain any sensitive APIs are excluded; 4) sensitive sub-graphs are grouped into clusters based on their similarity [61], and a most representative sub-graph (i.e., sub-graph that appears in the most malware samples in the cluster, referred as genes) is selected from each cluster; 5) assigning weights to each selected frequent sub-graph to form the formula of V .

3.2.3 Phylogenetic Network Construction

Variant Phylogenetic Network (PhyloNet) is a network structure diagram revealing the relationships among different variants from the same malware family. Each node in the PhyloNet represents a variant set, while the edges depict how they are related to each other. A PhyloNet example can be found in Fig. ??. 55 The details of how we compare a set of malware samples with a variant set formula show in Algorithm 1. Given a variant set M and a formula f(V ) generated from variant set V , we extract the genes from f(V ) and each malware sample m in M, denoted as genes v and genes m, respectively. For each individual gene vg ∈ genes v , we find a gene mg ∈ genes m that matches vg with highest similarity based on graph comparison. The similarity between mg and vg are regarded as an importance factor that is further multiplied by the weight of vg to measure the importance of vg to m. The importance of f(V ) to M can then be calculated by averaging the importance of each vg ∈ genes v on each m ∈ M.

In this way, we can reduce the effect from the common genes across the entire malware family, and emphasize the effect of those typical genes only appear in f(V ).

To infer the phylogenetic relationship between two variant sets, we introduce

PhyloScore, which explains if the samples from a variant set are phylogenetically related to the other variant set. Suppose we have two variant set A and B from the same family, PhyloScore is defined as PS(A, B) = fB(A)/fB(B), where fB(A) and fB(B) are the corresponding relevance scores calculated by Algorithm 1.

PhyloScore indicates how close are the genes from A to B, compared with the genes from B to itself. The detailed algorithm of calculating PhyloScore is shown in Algorithm 2.

56 Algorithm 1: Relevance Calculation Algorithm Input : Malware samples m from a variant set M Formula f(V ) of a variant set (V ) Output: Relevance FV (M) between M and f(V )

1 Function CalculateRelevanceScore(M, f(V )):

2 genes v ← GetGenesFromFormula(f(V )); 3 total score ← 0;

4 foreach malware sample mi ∈ M do 5 genes m ← GetGenesFromMalware(mi); 6 score ← 0; 7 foreach variant set gene vgj ∈ genes v do 8 weight ← GetWeight(f(V ),vgj); 9 similarity ← GetHighestSimilarity(vgj, genes m); 10 score ← similarity * weight; 11 end 12 total score ← total score + score; 13 end

14 sample num ← GetSampleNum(M); 15 FV (M) ← total score / sample num;

16 return FV (M) 17 end

3.3 Evaluation

In this section, we describe the construction of our dataset. We further evaluate

PhyloNet by inspecting the variant clustering, and evaluate the representativeness of the formula generated for each variant set.

3.3.1 Dataset

We constructed a new Android database from VirusShare, a repository of malware samples from all platforms. We wrote a script to extract Android malware APK

57 Algorithm 2: PhyloScore Calculation Algorithm

Input : Malware samples (MA) from variant set (A) Malware samples (MB) from variant set (B) Formula for variant set A (FA) Formula for variant set B (FB) Output: A pair of asymmetric PhyloScores

1 Function CalculatePhyloScore():

2 score ab ← CalculateRelevanceScore(MA, FB); 3 score bb ← CalculateRelevanceScore(MB, FB);

4 score ba ← CalculateRelevanceScore(MB, FA); 5 score aa ← CalculateRelevanceScore(MA, FA);

6 phylo score ab ← score ab / score bb; 7 phylo score ba ← score ba / score aa;

8 return phylo score ab, phylo score ba 9 end samples and to ensure that the compressed Android APK file is not broken, encrypted or hardened so that it can be decompiled and analysed. A total of

81,636 Android malware samples from 2008 to 2018 were collected.

To assist in the phylogenetic analysis of this newly constructed malware dataset, we further obtained malware family information from 63 different

Anti-Virus vendors provided by VirusTotal. This platform aggregates Anti-Virus vendors and online scan engines to check for viruses uploaded by users [13]. The obtained malware family information enables us to analyse the different variants inside each malware family and infer the phylogenetic relationship among them.

However, a common problem with the VirusTotal malware family information is that different Anti-Virus vendors may use different family names for the same malware family, and accordingly, the labels provided by different Anti-Virus

58 vendors may be inconsistent. In addition, some Anti-Virus vendors may provide malware family aliases on their websites, while most of the Anti-Virus vendors do not.

To address these inconsistencies, we unified the labels by using the open-source tools AVClass [120] and Euphony [81]. These tools use fine-grained labelling to unify family names obtained from VirusTotal, and both achieve high accuracy with F-measure performances of 93.5% and 95.5% respectively. As these two labelling tools use different approaches for family labelling, we took the results from both into consideration. To obtain a well-labelled Android malware dataset, we extracted the samples that are detected as malware by at least five vendors.

If both AVClass and Euphony derived a consistent family label, we included the labelled sample into our dataset. After we tried to assign family labels to the initial set of 81,636 collected malware samples, we eliminated the samples without a family label, as well as the samples without consistent family labels provided by both AVClass and Euphony. This resulted in a carefully labelled Android malware dataset with 18,863 samples in total.

Fig. 3.6 shows the malware sample distribution of some malware families with timestamps from 2008 to 2018. We selected specific families to represent three typical malware family life cycles. As we can infer from the figure, one type of life cycle shows that some families appear very early in the history of Android malware. Malware samples from these families have been created throughout the selected ten year timespan (from 2008 to 2018), and the number of samples grew

59 fast during recent years. These families have survived for a long time, and malware writers are willing to update these based on the old samples or produce the new malware samples of these family kinds. Indeed, some malware from these families are claimed to be discovered years ago, as suggested by VirusTotal, while their last modification dates show they have recently been modified. On the one hand, it shows that the malware samples from these families can still evade improved

Anti-Virus detection algorithms, whilst on the other hand, we can also observe that these families usually bring more financial benefit to the malware writers (e.g. opfake, smsagent, autoins), as financial incentives are the dominant motivation for writing and spreading malware of all time. The second type of malware family life cycle shows malware creation just within specific years, with no discovery in any other years (e.g. utchi, blouns). The third type of life cycle shows hat some families grow fast at the beginning, and then the number of families decrease until it disappears (e.g. droidkungfu).

3.3.2 Insights Into Variant Clustering

To evaluate if the malware samples from the same family are properly clustered into different variant sets, we specifically investigated malware with more than 100 malware samples. From the dataset with 18,863 malware samples, we identified 26 malware families with more than 100 malware samples each. 3,221 samples and

380 variant sets were identified in total by utilising our clustering approach. In this section, we aim to answer the following questions: 1) What is the highest silhouette

60 Figure 3.6: Malware Sample Distribution of Six Malware Families From 2008 to 2018. Different malware families have different life cycles. The number of malware samples in opfake, smsagent and autoins increased dramatically in recent years. On the contrary, the number of malware samples from droidkungfu decreased and then disappeared. Utchi, blouns were only discovered in specific years. coefficient achieved for each malware family? 2) At what distance threshold will each malware family achieve its highest silhouette coefficient value? 3) Are the same variants in each malware family created during the same period and do they share the same variant label obtained from VirusTotal vendors?

61 Dist (0-1) Dist (0-0.5) Malware Family Silh Dist Silh Dist baiduprotect 0.912 0.4 0.912 0.4 boqx 0.254 0.65 0.246 0.5 boxer 0.805 0.1 0.805 0.1 commplat 0.632 0.6 0.615 0.45 dianjin 0.592 0.3 0.592 0.3 dianle 0.536 0.6 0.511 0.5 droidkungfu 0.501 0.3 0.501 0.3 fakebank 0.735 0.2 0.735 0.2 feejar 0.479 0.1 0.479 0.1 geinimi 0.455 0.5 0.455 0.5 gepew 0.73 0.45 0.73 0.45 joye 0.74 0.15 0.74 0.15 jsmshider 0.667 0.2 0.667 0.2 kingroot 0.613 0.3 0.613 0.3 kmin 0.928 0.2 0.928 0.2 kyview 0.44 0.1 0.44 0.1 mobwin 0.766 0.95 0.713 0.5 nandrobox 0.374 0.8 0.255 0.5 nineap 0.779 0.15 0.779 0.15 oimobi 0.981 0.3 0.981 0.3 qysly 0.536 0.15 0.536 0.15 rootnik 0.539 0.3 0.539 0.3 shedun 0.527 0.25 0.527 0.25 skplanet 0.428 0.25 0.428 0.25 sprovider 0.589 0.4 0.589 0.4 tencentprotect 0.815 0.4 0.815 0.4 Average 0.628 0.35 0.620 0.3

Table 3.1: Best Silhouette Values Achieved for Malware Families and its Corespondent Distance Threshold

Silhouette Coefficient

We investigated the maximum silhouette coefficient when malware samples are clustered within the distance threshold from 0 to 1 (0 < dist < 1), as well as within a shorter distance range (0 < dist ≤ 0.5). We record the maximum silhouette

62 coefficient value for each family, as well as the distance when it achieved its best silhouette coefficient. Table 3.1 shows this information on the 26 identified malware families. 21 out of the 26 malware families achieved their best silhouette coefficients within the distance range from 0 to 0.5 (inclusive), and only five malware families

(highlighted in grey) obtained their best silhouette coefficients above 0.5. As we can see from Table 3.1, malware family boqx has the lowest best silhouette coefficient (silh = 0.254), while droidkungfu family’s best silhouette coefficient is 0.501. Malware family kmin has a relatively high silhouette coefficient, with a value of 0.928, indicating a better clustering result in this family.

Based on these values, we draw a radial phylogram for boqx, droidkungfu and kmin respectively (see Fig. 3.7). Radial phylograms can clearly show the distribution among malware samples within a malware family. The branches depict the distances between malware samples, and the leaf nodes (oval shape at the end of branches) denote malware samples. It is clear that the samples in malware family boqx with the sample number (n) of 80 and the silhouette coefficient (silh) value of 0.254 tend to be less well-partitioned, as the malware samples in this family are more various based on their code similarity, which makes them hard to be clustered into different sets. droidkungfu (n = 82) achieves its highest silhouette coefficient (silh = 0.501) at the distance of 0.3. The malware samples in kmin

(n = 140), which has an even higher silhouette coefficient (silh = 0.928), can be clustered more easily into different variant sets, due to the samples’ discriminative

63 distribution. As we can see, if the maximum silhouette coefficient can be achieved earlier at a shorter distance, a more discriminative variant set can be observed.

To avoid sacrificing the similarity among malware samples from the same variant set, we keep the distance among malware samples within 0.5 (inclusive).

For example, if we set the distance threshold to 0.5 rather than the distance of

0.65 in the malware family boqx, the silhouette coefficient only decreases by 0.008.

We gathered this information from the 26 malware families and found the average of the best silhouette coefficient value is 0.62, and the average distance threshold achieved is 0.3. 20 of the 26 malware families can achieve the silhouette value higher than 0.5 within the distance below 0.5 (inclusive). Therefore, we can infer that the malware samples in most of the malware families can be well-clustered at a shorter distance. By utilizing our method, we can guarantee that the generated variant sets are the best clustering result that the family can achieve based on its own samples’ information while not sacrificing similarity among the malware samples.

Creation Date for Variant Sets

We further gathered the creation date of each malware sample to investigate if malware from the same variant set is created during the same time period. For this, two dates are identified. The ”First Seen in the Wild” date is provided by VirusTotal and notes when the malware sample was first discovered by the community. This date is regarded as the date when the malware was first created,

64 Figure 3.7: Radial Phylogram for Malware Family Boqx, Droidkungfu, and Kmin. Boqx (n = 80) achieves its highest silhouette coefficient 0.254 at the distance of 0.65. This family cannot be easily partitioned into different clusters. Droidkungfu (n = 82) derives its best silhouette coefficient at the distance of 0.3. Relatively clear groups of malware samples can be observed through its radial phylogram. Kmin (n = 140) can achieve a high silhouette value 0.928 at the distance of 0.2, and malware samples are more likely to be well-clustered into variant sets. although sometimes, malware can be discovered years after its creation. We therefore consider the last modification date, which can be extracted systematically from the .dex file included in the malware APK file. In some papers, it is also referred to as the ’dex date’ or ’last complied date’. We choose to use this date because changing malware are most likely to evolve towards the trend of evolution rather than stay the same. The last modification date reflects the exact time when the malware completes its latest implementation and is automatically recorded by the system when the malware writer finishes compiling the APK. The last modification date is less likely to be modified by the malware writer because it requires further steps to fake the date, and as far as we know, there is no benefit

65 250

200

150

100

50

0

The Number of Variant Set 12345678910111212+ Standard Deviation of Last Modification Dates/ Month(s)

Figure 3.8: Standard Deviation of the Creation Date for Variant Sets. The standard deviation of the last modification date is three months and 18 days. 60% of the malware samples from the same variant set are created within a month, and 80.8% are created no later than four months. for malware writers to do this. Therefore, we regard the last modification date to be a reliable data source for us to do malware temporal analysis.

We record the last modification dates for each malware sample, and then calculate the standard deviation value based on the last modification dates for each variant set. We obtain the standard deviation value of dates from the 26 malware families’ 380 variant sets. As we can see the malware sample belonging to the same variant type are modified mostly at the same time 3.8. Malware samples in 60% of the variant sets are created within a month, and 80.8% are created within four months, averaging the standard deviation for the malware variant set to three months and 18 days. This reveals that malware writers usually duplicate the malware samples with slight changes, so it can produce a large number of malware samples in a short time and then distribute them through the internet.

66 v0 v1 ... vm

a0 l00 l01 ... l0m la0

a1 l10 l11 ... l1m la1 ......

an ln0 ln1 ... lnm lan Table 3.2: Malware samples from the same variant set and various labels provided by different Anti-Virus vendors. The malware samples from variant set a is depicted by a0 to an. The Anti-Virus vendors are depicted by v0 to vm. lnm denotes the variant label provided by the vm vendor for the malware an. The variant label lan is unified from the labels given by the different Anti-Virus vendors.

Variant Labels from Anti-Virus Vendors

The third type of information that we gather is the variant labels that we map based on the malware samples and Anti-Virus vendors. Table 3.2 shows the variant label information of a variant set named a. an represents the n-th malware sample from a variant set. vm represents the m-th Anti-Virus vendor. Inside the table, lnm represents a variant label of malware sample an provided by an Anti-Virus vendor vm. The labels from the last column are all unified labels based on the

voting rule. For example, lan is derived based on the labels from ln0 to lnm. The

label that takes the largest proportion will be assigned to lan . After we obtained all the label information, we started to analyse variant labels from three different aspects, as highlighted in different colours in Table 3.2: 1) Do different vendors provide the same variant label for the same malware sample? (grey section) 2)

Do malware samples from the same variant set have the same variant label? (red section) 3) If we specify a vendor, does it give the same label for different malware samples from the same variant set? (yellow section)

67 We collect the information provided by the Anti-Virus vendors that can specify the correct malware family label (family label unified by AVClass and Euphony).

We then further extract the malware variant label from the information. The variant label usually appears after a malware family label and are divided by a separator such as a dot (.), a forward slash (/), or an exclamation mark (!), etc.

For example, T rojan : Android/Boxer.C specifies an Android malware sample from boxer family, and the variant label C comes after its family label. Therefore, we can automatically extract the variant label from the information provided by the vendors.

For every malware sample, we gathered the variant labels from different vendors to answer the first question to determine if different vendors provide the same variant label for the same malware sample? We found that the variant labels obtained from different vendors for the same malware sample are inconsistent. To answer the second question, we adopt the voting method and unify the label for each malware sample (assign a variant label that is commonly provided by different vendors for a malware sample). We found that different malware samples inside a variant set also do not have the same variant label. We further investigate the third question to determine whether a specific vendor will assign the same variant label for different malware samples from the same variant set. The result shows that in this case only, the labels provided are consistent. Based on the variant sets clustered by our approach, we first group labels by vendors and then for the result in each vendor, we further group the labels based on the variant set they come

68 Average Min Vendor Consistency Consistency BitDefender 93.05% 53.8% Symantec 100% 100% F-Secure 83.7% 50% Avira 100% 100%

Table 3.3: Label Consistency for Four Well-Known Anti-Virus Vendors. The average consistencies are all above 80% with the minimum no less than 50% from. 68% of vendor-based label groups achieve 100% consistent variant labelling result, that is all the malware samples from the same variant set are labelled as the same type of variant by the same vendor. 85.7% achieve a consistency above 70%, indicating that there at least 70% labels are the same. 100% of groups achieve the consistency of variant label above 50%, that is when the malware samples from the same variant set are detected by the same vendor, there is always a half of the labels are the same. Therefore, it proves that we obtain a set of well-clustered malware variants. Table 3.3 presents the average label consistency and the lowest label consistency for four well-known vendors. This list is not exhaustive since some well-known vendors either do not provide the correct malware family label or the family labels are vague (e.g. Gen, Artemis).

3.3.3 Representativeness of the Variant Formula

During the formula generation process, we applied the approach proposed by Ming et al. [61]. This formula represents the malicious behaviour more accurately, as the inaccurate separation of malicious components and the legitimate part of malware are addressed by applying the clustering-based 69 approach to extract common malicious behaviour in each variant set. It utilises a weighted sensitive-API-call-based graph matching approach to calculate the similarity between graphs, making it resistant to typical obfuscation techniques

(e.g˙renaming, user-defined functions). Moreover, it emphasises the typical features that variant sets possess rather than the common features shared by the entire family by providing weight for each subgraph (gene). The complete implementation can be found in [61]. This approach was previously used for malware family classification. To validate its feasibility at a variant level, we analyse the representativeness of each malware variant formula.

For each variant set formula inside a malware family, we calculate the

PhyloScores PS(M,N). M,N (M 6= N) are both variant sets belonging to the same malware family. If the PS(M,N) are always lower than 1, it means that the variant formula FN represents variant set N most in comparison with other variant sets from the same malware family. For the identified 380 variant sets we generated a formula for each variant set. From this, we obtained 2,926 PhyloScores in total through pairwise comparisons inside each family, in which 93.05% PhyloScores have a value lower than 1. Therefore, it proves that the formula can adequately represent the features of its variant set. When the malware samples from other variant sets are applied onto the formula, they can obtain a lower value compare when the malware samples are applied on its own formula. From the statistical side, the variant set formula can efficiently represent the characteristics of its own malware samples.

70 3.4 Inspection on Malware Evolution

In this section, we present the android malware evolution analysis. Firstly, we discuss different types of phylogenetic relationships (e.g. subset and supset), present the global statistics on the relationships, and further investigate the evolutionary structure for malware families. Then, we reveal the evolutionary changes, the causes to the changes and their anscetral relationships by conducting manual analysis. The result indicates that malware evolution is a directed acyclic net indicating the multiple ascendant and descendent relationships rather than a hierarchical tree structure in which nodes are only generated from a single node.

3.4.1 Phylogenetic Relationship

Based on our empirical study, if the asymmetric PhyloScores are both lower than

0.25, it is unlikely that two variant sets are closely phylogenetic related. Therefore, we only discuss the pairwise variant sets, in which at least one of their PhyloScores is higher than 0.25 (inclusive) and then we connected them together when forming the PhyloNet for each malware family. Fig 3.9 shows three examples of PhyloNets generated based on our approach, a smaller alphabet indicates an earlier creation date. Family kmin consists of four variant sets (vertices) and four phylogenetic relations (edges) in the PhyloNet. Variant set D is connected with three previously discovered variant sets, while A and B are phylogenetic related. C is not closely related with A and B. In geinimi, a more complex structure can be observed with

71   

   

      

Figure 3.9: Example PhyloNets for malware family kmin and geinimi. Each cluster represents a variant set. A smaller alphabet indicates an earlier created variant set. Two variant sets (M and N, M 6= N) are connected only if one of PhyloScore (PS(M,FN ) or PS(N,FM )) is larger than 0.25

five connected variant sets and six edges in total. Each variant set’s degree is no less than two. Variant set C and B are both connected to three other variant sets in the family.

A stealthier behaviour can be achieved by adding malicious code to previous samples, modifying or removing partial malicious code. In [49], it only discussed one possible ascendant or descendant phylogenetic relationship among malware families, which is defined as subset and superset in our approach. Instead of giving a specific ascendant or descendant sequence, we initially discuss how variant sets evolved from one to another based on the set relations, which reflects the evolutionary changes more accurately. Based on pairwise asymmetric PhyloScores, we further discuss the type of phylogenetic relationship. We define three types

72 of phylogenetic relationships between different malware variant sets based on our empirical study. These are a) ’subset and superset’, b) ’intersecting’ and c)’Non-relevance’. Set relation can be found in Fig. 3.10, and the explanations are as below:

• Subset and superset: Variant set A is a subset of variant set B (see in

3.10.(a)). B nearly has all A’s important genes. Besides, B also has a great

number of its important genes that A does not have. We can infer that

1) B might be built based on A, and A might be an ancestor of B, or 2)

A eliminated unnecessary genes due to some reasons (e.g. Android API

update). In this case, the PhyloScores of these two sets will be that at least

one of the PhyloScores is larger than 0.25, while the difference between two

PhyloScores (e.g., |PS(B,A) − PS(A, B)|) is greater than 0.5.

• Intersecting: Variant set A and variant set B share the same important

genes of their own (The genes are important for both A and B, and they

can only be discovered in fewer variant sets), but they both also introduce

their own new important genes that can distinguish themselves from other

variant sets (see in 3.10.(b)). 1) A and B might be siblings. They inherited a

certain portion of genes from the same parents, but there are also new genes

arise for each of them. Another situation could be 2) the latter variant set

modified part of important genes from the previous one and added its own

new important genes, thus it can also indicate their ancestral sequence. In

73     

  

Figure 3.10: Relations between variant sets. a) Subset and superset: B includes most of the essential genes from A. b) Intersecting: A and B share a relatively large amount of same essential genes of their own. c) Non-intersecting: A and B share general family genes, but they are not highly related regarding their own important genes.

this case, PhyloScores PS(A, FB), PS(B,FA) are both larger than 0.25 and

their difference is less than 0.5.

• Non-relevant: Variant set A and B share some general genes that nearly

whole family has, and it is unlikely to share each others’ important genes,

which explain why they still exhibit the same family traits (malicious

behaviours), but they are not highly related due to a very different

implementation. In this case, PS(A, FB), PS(B,FA) are both less than

0.25.

Based on our result, we have 1,716 pairs of phylogenetic related variant sets, in which 48.7% pairs belong to the first relationship, subset and superset, 13.8% pairs have an intersecting relationship, while the rest pairs (37.5%) are not phylogenetic related. However, it does not mean they are not completely irrelevant, they still share the common malicious behaviours of their own family, but apart from this,

74 they are not closely related. From our result, we can see that, malware writers are more likely to enrich the implementation of the malware sample than remove or modified the core malicious behaviour of the malware samples. We further gather the information from 26 family PhyloNets, 1,716 pair of variant sets are phylogenetic related with at least one of the PhyloScores higher than 0.25. On average, the degree (d) for each variant set is 1.4, and 20% of the variant sets are connected with more than 3 other variant sets in its own family, which whether indicates the variant set can have multiple children or parents.

3.4.2 Evolution Analysis

Based on the malware family PhyloNets we generated, and the phylogenetic relationships derived for each connected variant sets, we conduct manual analysis to further investigate the evolution between variants. Malware evolution can occur actively and passively. In this section, we present the evolutionary behaviours, the causes to evolution in depth and ancestral relationship between variant sets.

Variant sets are labelled with alphabets in our context, and an earlier created variant set will named with a smaller alphabet in its family. For example, droidkungfu.A is created before droidkungfu.B.

Malware samples usually utilise renaming technology, so we cannot easily guess the semantics of the classes and methods. In that case, in order to cut down the manual work, we locate the critical sensitive API calls specifically. We firstly 1) eliminate the general sensitive APIs used by the entire variant sets, and then for a

75 pair of variant sets, we 2) extract the rest of their same sensitive API calls and the sensitive API calls that belong to themselves only. Later, we 3) randomly select apks from each variant sets as representatives, and last we 4) decompile the apks to investigate the source code of the app.

Active Evolution

In this category, malware evolves because it actively evades malware detection systems and it enhances its malicious behaviours. We regard this type of behaviour change as an active evolution, since this type of evolution is not necessarily influenced by an external force (e.g. Android Platform Update).

Detection Evasion

Boxer is a family of malware that pretends to be an installer or application downloader program but actually sends premium-rate SMS messages without the users’ acknowledgment. Variant sets boxer.C and boxer.D are originated from the same app base. In variant set C, its own distinguishing sensitive API call sendT extMessage() from an Android class named android.telephony.SmsManager which is also a significant family behaviour. However, sendT exMessage() is nowhere to be found in variant set D, and malware samples in D did not utilise the system built-in SMS application either. Instead, we notice that there is a method called java.io.OutputStream.write(). When we further exam the code, boxer.D reads the bytes by utilising the openStream() method that returns and retrieves the data payload into the program, and then it writes to an executed

76 jar file. After a careful examination, we found an user-defined abstract method called sendSingleMessage() and the class (Sender) it belongs to. The Sender class was initialised through a class loader called DexClassLoader which read the execute jar file previously generated, the class it loads contains the system message sending method, so that the malware can still achieve sending message secretly. A class loader that loads classes from .jar and .apk files containing a classes.dex entry. This can be used to execute code not installed as part of an application. The malware writer can hide the sensitive API calls inside a separate external execute file, and in the external execute file, the sensitive API call can be wrapped into a self-defined method with a harmless method name. Therefore when the class is initialised from a class loader, malware can call the malicious method without being noticed. A source code snippet can be found in 3.1.

From our manual analysis, we can conclude that, even though boxer.C and boxer.D share their own common features, rather than enrich the implementation, variant set boxer.D utilised a different way to perform its core malicious behaviour.

It eliminates an essential sensitive API sendMessage() and evokes other sensitive

APIs instead. As we see from the average last modification dates for both sets, we can infer that boxer.D is created 6 months after boxer.C. Because boxer.C and boxer.D share the same basis, but boxer.D update its malicious behaviour and make it much stealthier. We regard boxer.D is an evolved malware variant set.

This evolution is called dynamic loading, and it also occurs frequently between the

77 23 // dynamically load class from an external dex file 24 public static Sender getSenderInstance(Context paramContext, 25 String paramString) throws Exception 26 { 27 return (Sender)new DexClassLoader(paramString,paramContext. 28 getDir("outdex",0).getAbsolutePath(),null,paramContext. 29 getClassLoader()).loadClass("com.software.application. 30 lib.Msg").newInstance(); 31 } 32 ... 33 // hide malicious sensitive api call in a self define benign 34 // method 35 private void startSendingMessages(){ 36 this.sender.sendSingleMessage(NUMBER10, null, PORT_PREF + 37 "+" + (String)this.schemes.second, null, null); 38 } Listing 3.1: Dynamic Loading connected variant sets from other malware families (e.g. boqx, dianjin, fakebank and etc.) based on our inspection.

It is not hard to notice if a malware sample does not directly evoke the core sensitive malicious code of the family come from, it is likely that it will choose to take a detour and evoke some other sensitive APIs instead to cover its real malicious intent. Dynamic loading is a typical example of how malware try to hide its original intent by utilising other sensitive API calls. For example, in order to load the sendMessage() malicious code dynamically, it has to check the internet status first, then write bytes to the dex file, and last load the malicious code from the dex file to reach its goal. In this case, it will evoke different steps, thus different sensitive API calls.

78 Another example of malware evading the detection is to add if statement. As we look into the source code from both variant sets boxer.A and boxer.B, It is easy to found out that the malware samples from boxer.A are an updated version of the samples from boxer.B. In boxer.A, it added an air plane mode handler, and it also checks if the device has a valid sim card inserted. Therefore, the app will not execute when it detects the air plane mode is disabled by users, or the sim card is not available, however, once the device is connected to the mobile network, it will start to process the sendT extMessage() method. This update can evade the dynamic detection when the APK is tested and monitored in the simulator. In variant set boxer.A and boxer.B, they both send messages by utilising the Android package SMSManager directly. boxer.B added some handler code snippets to the original malware samples without eliminating the original sensitive API calls. We further check the last modification date for these two variant set. For variant set boxer.B, the average last modification date is 24/10/2012, and the average last modification date of variant set boxer.A is 01/03/2012. The date also supports the fact that boxer.B is created or modified after boxer.A. Our manual analysis suggests that boxer.B was built based on boxer.A. boxer.A is boxer.B’s parent.

Enhanced Malicious Behaviour

Geinimi arrives on the device as part of repackaged versions of legitimate applications. In geinimi.C, geinimi is packaged with a game application, and it utilised an updated version of the geinimi library. Comparing the old version geinimi package, the newly updated package does not become stealthier, but it

79 enhanced its malicious functionalities, see the code snippet in 3.2. Compare its connected variant set geinimi.A and geinimi.B, geinimi.C initialised a location listener, which will request a location update provided by GPS once the device move out of a range of distance from its original point. If GPS is not enabled by users, it will request a current location update from network or obtain a last known location at least. In geinimi.A and geinimi.B, they tried to get the available location providers, which can satisfy their criteria. If there are qualified location providers, the application will choose the best location provider, and then requests the current location only once at a time, otherwise, it will retrieve the last known location. Appearently, geinimi.C has enhanced its malicious function and can listen to the location changes and automatically update user’s location information.

The second example of the enhanced malicious behaviours can be shown as adding more unwanted or malicious libraries. kmin.A, kmin.B and kmin.C are all device theme management application. Variant set kmin.C, which is connected with both kmin.A and kmin.B, owns a completely different app host and composition. Apart from the application itself, kmin.C adds two other advertisement libraries called ’adwo’ and ’wooboon’ to boost its profits. The host application are believed to be harmless based on our manual inspection. Both libraries are known to be advertisement libraries that are bundled with certain

Android applications. These two libraries are classified into PUA (Potentially

Unwanted App) by some anti-virus vendors, because they usually require displaying advertisement, downloading and installing application, sending device

80 1 2 // Request for a location update 3 final class h extends Thread{ 4 public final void run(){ 5 ((LocationManager)k.a().getSystemService("location")). 6 requestLocationUpdates("gps", 600000L, 0.0F, d.a); 7 } 8 } 9 10 // Initialise a LocationListener 11 public final class d{ 12 public static void a(String paramString){ 13 b = paramString; 14 paramString = (LocationManager)k.a().getSystemService 15 ("location"); 16 if (a == null){ 17 a = new f(); 18 } 19 ... 20 } 21 22 // Listen for a location change 23 final class f implements LocationListener{ 24 public final void onLocationChanged(Location paramLocation){ 25 d.a(paramLocation.getLatitude(), paramLocation. 26 getLongitude()); 27 } 28 ... 29 } Listing 3.2: Enhanced Malicious Behaviour information such as International Mobile Station Equipment Identity (IMEI), kernel version, phone manufacturer, or phone model details to a remote location etc. The same pattern can be found in geinimi family. More than 5 advertisement libraries were wrapped into the application along with a benign game package

81 in geinimi.D, while the previously discovered variant sets that are connected with geinimi.D in the same family are only repackaged with one library called

”geinimi”. Apart from adding more unwanted or malicious libraries, malware writers can also add some other components. Variant sets kmin.A and kmin.B. both secretly send device information to a remote server by displaying decoy messges on an alert dialog. Moreover, compare to an early created variant set kmin.A, kmin.B added more remote server addresses, so that kmin.B can send through the user’s information to more remote servers.

Passive Evolution

In this category, malware evolves due to some external changes. Malware writers need to update the malware samples to make sure the the malware samples still work in the current context. This type of evolution is driven by external changes and it does not happen spontaneously.

Driven by Android Platform Update

In both variant set kmin.A and kmin.D, SMS message is sent to a premium-rate number. In kmin.A, it used sendT extMessage() in android.telephony. gsm.SmsManager Android java class, but due to API update, the SmsManager class was deprecated in API level 4, and it is replaced by android.telephony.

SmsManager that supports both GSM and CDMA. Therefore, in kmin.D, it updated its code and changed it to sendT extMessage() in class android.telephony.

SmsManager. The evolution is driven by the Android API update. This update

82 cannot be overlooked, because it can also evade most of the malware detectors, which utilise static features to feed into the machine learning classifier. If the trained model is not up to date and aligns with the Android API document, malware can still easily evade the detection.

As for variant sets geinimi.B and geinimi.D , it is not surprising to see in geinimi.D, most of classes and methods have been obfuscated, it uses alphabets to name classes and methods, making it hard to locate the malicious code snippets, while in both B and A, they use reasonable method naming. Moreover, in geinimi.B, it listens if the device has complete its BOOT action, once it is completed, this application will check if the device is rooted before it start automatically installing application by executing cmd command, so that the app can automatically install the additional application quietly in the background, as it shows in 3.3. However, when it comes to geinimi.D, malware writer also start to target the devices that are not rooted. This has come to the fact that

Google has been optimising and updating the Android platform, and is only a need for a certain group of people who want to take the extra effort to receive what rooting offers, and the process is just more trouble than its worth

[33]. Therefore, malware writers gave up targeting fewer rooted android devices and came up with new strategies. In geinimi.D, it removed the root checker and changed the strategy of installing apks, it now requires download permission directly from Android system, however, instead of notifying when it is installing apk, it shows an alert dialog with both YES and Cancel buttons. No matter what

83 button the user hits, both choices will trigger apk auto-installation from a given apk file path. The code snippet can be found in 3.4. 23 public static boolean hasRoot(){ 24 return exc("echo test"); 25 } 26 27 public static boolean installApk(String paramString1, String 28 paramString2){ 29 return exc("cat " + paramString1 + " > /data/app/" + 30 paramString2 + ".apk"); 31 } Listing 3.3: Root Checker

Repackaged with the Newest Host

Based on our previous clustering process, kmin.A, kmin.B, kmin.D are believed to have the same origin and they are all phylogenetic related. Apks from these three variant sets may pose as an Android app named ”KMHome”, in kmin.A, it can provide some simple functions (e.g. set picture as wall paper), and the functions are enriched in later samples from kmin.B and kmin.D. Especially kmin.D, it introduces more various functions compare with kmin.A and kmin.B.

It collects human-device interaction data (e.g. onKeyDown()) so that it can provide a better user-friendly experience, these parts of implementation belong to the benign part of the application. However the malicious behaviour of these three variant sets are exactly the same based on our observation.

Same pattern can also be observed in the geinimi family. Since the malware samples in the geinimi family usually comes as repackaged applications, therefore,

84 23 // renamed function a() installs apk programmatically 24 private void a(){ 25 Uri localUri = Uri.fromFile(new File(this.b)); 26 Intent localIntent = new Intent("android.intent.action.VIEW"); 27 localIntent.setFlags(268435456); 28 localIntent.setDataAndType(localUri, "application/vnd.android. 29 package-archive"); 30 startActivity(localIntent); 31 finish(); 32 } 33 34 public void onCancel(DialogInterface paramDialogInterface){ 35 paramDialogInterface.dismiss(); 36 a(); 37 } 38 39 public void onClick(DialogInterface paramDialogInterface, int 40 paramInt){ 41 paramDialogInterface.dismiss(); 42 a(); 43 } Listing 3.4: Cover malicious trace by displaying a decoy alert box we can easily separate the host and the rider part when manually inspecting variant set geinimi.A, geinimi.B and geinimi.C. In this case, we can notice the changes being made to the benign host (if the hosts are similar) as well as analysis the evolution of the malicious geinimi packages alone. geinimi.A and geinimi.B utilised the exactly same geimini packages, as for the host, geinimi.A is only a subsection of the applications in geinimi.B, while geinimi.B performs a range of other functionalities.

These two are very typical examples for some discovered malware samples that malware writer will implement their malicious code with the newly updated host

85 apps, so that the malware can stay as long as possible in the victims’ mobile devices. Besides, a more complex application will help cover its malicious trace.

3.5 Summary

We proposed an approach that reduces manual analysis process for malware analysis. Our approach cluster malware from the same family into different variant sets according to their code similarity and then we manually observed their evolution features through the inferred phylogenetic relationships between them. We found that by utilizing our method, we can quickly notice the changes between malware samples can infer their evolved behaviours, and it is also the fact that malware has become stealthier and more sophisticated.

86 Chapter 4

Repackaging Malware for Evading

Machine-Learning Detection

Machine learning based solutions have been successfully employed for automatic detection of malware in Android applications. However, as is known, machine learning models lack robustness to adversarial examples, which are crafted by adding minor, yet carefully chosen, perturbations to the normal inputs. So far, the adversarial examples can only deceive Android malware detectors that rely on syntactic features (e.g., requested permissions, specific API calls, etc.), and the perturbations can only be implemented by simply modifying Android manifest.

While recent Android malware detectors rely more on semantic features from

Dalvik bytecode rather than manifest, existing attacking/defending methods are no longer effective due to the rising challenge in adding perturbations to Dalvik bytecode without affecting their original functionality. 87 In this chapter, we introduce a new highly-effective attack that generates adversarial examples of Android malware and evades being detected by the current models. To this end, we propose a method of applying optimal perturbations onto

Android APK using a substitute model (i.e., a Deep Neural Network). Based on the transferability concept, the perturbations that successfully deceive the substitute model are likely to deceive the original models as well (e.g., Support

Vector Machine in Drebin or Random Forest in MaMaDroid). We develop an automated tool to generate the adversarial examples without human intervention to apply the attacks. In contrast to existing works, the adversarial examples crafted by our method can also deceive recent machine learning based detectors that rely on semantic features such as control-flow-graph. The perturbations can also be implemented directly onto APK’s Dalvik bytecode rather than Android manifest to evade from recent detectors. We evaluated the proposed manipulation methods for adversarial examples by using the same datasets that Drebin and MaMadroid

(5879 malware examples) used. Our results show that, the malware detection rates decreased from 96% to 0% in MaMaDroid, and from 97% to 0% in Drebin, with just a small distortion generated by our adversarial examples manipulation method.

88 4.1 Introduction

With the growth of mobile applications and their users, security has increasingly become a great concern for various stakeholders. According to McAfee’s report

[103], the number of mobile malware samples has increased to 22 millions in third quarter of 2017. Symantec further reported that in Android platform, one in every five mobile applications is actually malware [132]. Hence, it is not surprising that the demand for automated tools for detecting and analysing mobile malware has also risen. Most of the researchers and practitioners in this area target Android platform, which dominants the mobile OS market. To date, there has been a growing body of research in malware detection for Android.

Among all the proposed methods [64], machine learning based solutions have been increasingly adopted by anti-malware companies [83] due to their anti-obfuscation nature and their capability of detecting malware variants as well as zero-day samples. Despite the benefits of machine learning based detectors, it has been revealed that such detectors are vulnerable to adversarial examples [107, 42]. Such adversarial examples are crafted by adding carefully designed perturbations to the legitimate inputs that force machine learning models to output false predictions

[72, 107, ?].

Analogously, adversarial examples for machine learning based detection are very much like the HIV which progressively disables human beings’ immune system. We chose malware detection over Android platform to assess the feasibility of using adversarial examples as a core security problem. In contrast to 89 the same issue in other areas such as image classification, the span of acceptable perturbations is greatly reduced: an image is represented by pixel values in the feature space and the adversary can modify the feature vector arbitrarily, as long as the modified image is visually indistinguishable [149]; however, in the context of crafting adversarial examples for Android malware, a successful case must comply with the following restrictions which are much more challenging than the image classification problem: 1) the perturbation must not jeopardise malware’s original functions, and 2) the perturbation to the feature space can be practically implemented in the Android PacKage (APK), meaning that the perturbation can be realised in the program code of an unpacked malware and can also be repacked/rebuilt into an APK.

So far, there are already a few attempts on crafting/defending adversarial examples against machine learning based malware detection for Android platform.

However, the validity of these works is usually questionable due to their impracticality. For example, Chen et al. [46] proposed to inject crafted adversarial examples into the training dataset so as to reduce detection accuracy. This method is impractical because it is not easy for attackers to gain access to the training dataset in most use cases. Grosse et al. [77] explored the feasibility of crafting adversarial examples in Android platform, but their malware detecting classifier was limited to Deep Neural Network (DNN) only. They could not guarantee the success of adversarial examples against traditional machine learning detectors such as Random Forest (RF) and Support Vector Machine (SVM). Demontis et

90 Manifest (AndroidManifest.xml)

Dalvik Bytecode Compiled resources (classes.dex) (Resources.arsc)

Signatures Native libraries (META-INF/) (lib/)

Assets Resources (assets/) (res/)

Figure 4.1: File structure of APK. AndroidManifest.xml declares the essential information; classes.dex contains the Dalvik Bytecode; resources.arsc holds the compiled resources in binary format; META-INF, lib, assets, and res folders include the meta data, libraries, assets, and resources of the application, respectively. al. [51] proposed a theoretically-sound learning algorithm to train linear classifiers with more evenly-distributed feature weights. This allows one to improve system security without significantly affecting computational efficiency. Chen et al.

[45] also developed an ensemble learning method against adversarial examples.

Yang et al. [145] conducted new malware variants for malware detectors to test and strengthen their detection signatures/models. According to our research, all these ideas [51, 45, 145] can only be applied to the malware detectors that adopt syntactic features (e.g., permissions requested in the manifest or specific

APIs in the source code [133, 25, 26, 108]). However, almost all recent machine learning based detection methods rely more on the semantic features collected

91 from Dalvik bytecode (i.e., classes.dex). This disables existing methods of crafting/defending adversarial examples in Android platform. Moreover, it is usually simple for existing methods to modify the manifest for the generation of adversarial examples. However, when the features are collected from the bytecode, it becomes very challenging to modify the bytecode without changing the original functionality due to their programmatic complexity. Therefore, existing works are not of much value in providing proactive solutions to the ever-evolving adversarial examples in terms of Android malware variants [45, 77, 46, 51, 145].

In this chapter, we propose and study a highly-effective attack that generates adversarial malware examples in Android platform, which can evade being detected by current machine learning based detectors. In the real world, defenders and attackers are always engaged in a never-ending war. To increase the robustness of

Android malware detectors against malware variants, we need to be proactive and take potential adversarial scenarios into account while designing malware detectors to achieve creating such a proactive design. The work in this chapter envisions an advanced method to craft Android malware adversarial examples. The results can be used for Android malware detectors to identify malware variants with the manipulated features. For the convenience of description, we selected two typical

Android malware detectors, MaMaDroid [102] and Drebin [25]. Each of these two selects semantic or syntactic features to model malware behaviours.

We summarise the key contributions of this chapter from different angles of view as follows:

92 • Technically, we propose an innovative method of crafting adversarial

examples on recent machine learning based detectors for Android malware

(e.g., Drebin and MaMaDroid). They mainly collected features (either

syntactic or semantic ones) from Dalvik bytecode to capture behaviors of

Android malware. This contribution is distinguishable from the existing

works [45, 46, 51, 77] because can only target/protect the detectors relying

on syntactic features.

• Practically, we designed an automated tool to apply the method to the

real-world malware samples. The tool calculates the perturbations, modifies

source files, and rebuilds the modified APK. This is a key contribution as

the developed tool adds the perturbations directly to APK’s classes.dex.

This is in contrast to the existing works (e.g., [46, 77]) that simply apply

perturbations in AndroidManifest.xml. Although it is easy to implement,

they cannot target/protect recent Android malware detectors (e.g., [54, 123])

which do not extract features from Manifest.

• We evaluated the proposed manipulation methods of adversarial examples

by using the same datasets that Drebin and MaMaDroid (5879 malware

samples) used [25, 129]. Our results show that, the malware detection

rates decreased from 96% to 0% in MaMaDroid, and from 97% to 0% in

Drebin, with just a small distortion generated by our adversarial example

manipulation method.

93 4.2 Android Application Package

Android applications are packaged and distributed in the form of APK files.

The APK file is a jar-like archive that packs the application’s dexcode (.dex

files), resources, assets, and manifest file. The structure of an APK is shown in Fig.4.1. In particular, AndroidManifest.xml is designed for the meta-data such as permissions requested, definitions of components like Activities, Services,

Broadcast Receivers and Content Providers. Classes.dex is used to store the

Dalvik bytecode to be executed on the Android Runtime environment. Res folder contains graphics, string resources, user interface layouts, etc. Assets folder includes non-compiled files and META-INF is to store the signatures and certificates.

The state-of-the-art detectors usually use machine learning based classifiers to categorize applications as either malicious or benign [25, 26, 102, 108, 133].

Features employed by such classifiers are extracted from the APK archive by performing static analysis on the manifest and dexcode.

Manifest introduces the components of an application as well as its requested permissions. Such information is presented in a binary XML format inside

AndroidManifest.xml.

Contents presented in the manifest are informative, implying the intentions and behaviours of an application. For instance, requesting android.permission.SEND SMS and android.permission.READ CONTACTS permissions indicate that the application may send text messages to your contacts. Features retrieved from the manifest are usually constructed as a vector of binary values, while each value indicates 94 the presence/absence of a certain element in the manifest. Dexcode, or Dalvik

Bytecode, is the operational code on Android platform. All the Java source codes are compiled and assembled into a single Dalvik Executable (classes.dex).

Features extracted from classes.dex, such as Control-Flow-Graph (CFG) and

Data-Dependency-Graph (DDG), contains rich semantic information and logical structure of the application. They are usually presented in two forms: 1) the raw sequence of API calls, and 2) the statistic information retrieved from the call graph (e.g., similarity scores between two graphs [54]). Such features are proved to have strong discriminating power for identification of malware.

To evade being detected by machine learning based detectors, a malware sample has to be manipulated so that the extracted features for the learning systems look benign. Intuitively, the target files to be modified are those from which the features are extracted, i.e., AndroidManifest.xml and/or classes.dex. While both of these files are in binary format and are not readable by human, decompiling tools such as apktool are used to convert them into a readable format. Specifically, the binary XML can be transformed into plain-text XML, and the Dalvik bytecode can be disassembled to smali files, which are more human-friendly as intermediate presentations of bytecode. The processed manifest file and smali files can be edited and reassembled to an APK.

95 4.3 Targeted Systems and Attack Scenarios

We propose a framework to craft adversarial examples that can evade machine learning based detection. Generally, machine learning based malware detection methods leverage two types of features: static and dynamic features. Static features are collected from disassembled APKs. Examples of such features include requested permissions, API call sequences, and control flow graphs. Dynamic features, on the other hand, are collected during the execution of the applications by monitoring their behavior and communication patterns. Since dynamic features are collected by feeding random inputs, it is more challenging to alter dynamic features than static features. Therefore, we target static features in this work, and leave the dynamic case for future work. Specifically, we target two typical solutions which have been widely analysed in this field, i.e., MaMaDroid [102] and Drebin

[25]. The semantic features that MaMaDroid uses are extracted from dexcode, and the syntactic string values which are adopted by Drebin are retrieved from both dexcode and manifest. We provide an overview of MaMaDroid and Drebin below.

4.3.1 MaMaDroid

MaMaDroid extracts features from the CFG of an application. It uses the sequence of abstracted API calls rather than the frequency or presence of certain APIs, aiming at capturing the behavioural model of the mobile application. MaMaDroid

96 operates in two modes, namely family mode and package mode. API calls will be abstracted to either family level or package level according to their mode. For instance, the API call sendTextMessage() is abstracted as:

family z }| { android .telephony .SmsManager : void sendTextMessage() | {z } package | {z } API call

Family mode is more lightweight, while package mode is more fine-grained. We demonstrate the results of attacking both.

MaMaDroid firstly extracts the CFG from each application, and obtains the sequences of API calls. Then, the API calls are abstracted using either of the above-mentioned modes. Finally, MaMaDroid constructs a Markov chain with the transition probabilities between each family or package, used as the feature vector to train a machine learning classifier. Fig. 4.2 to 4.5 illustrates the feature extraction process in MaMaDroid. Sub-graph (a) is a code snippet that has been decompiled from a malicious application; sub-graph (b) shows the call graph extracted from the source code; sub-graph (c) is the abstracted call graph generated from (b); and finally, sub-graph (d) presents the Markov chain generated based on

(c).

MaMaDroid recognises nine families and 338 packages from official Android documentation. Packages defined by application developer and obfuscated with identifier mangling, are abstracted as self-defined and obfuscated, respectively.

Overall, there are 340 possible packages and 11 families. 97 Figure 4.2: Process of feature extraction in MaMaDroid - source code

Figure 4.3: Process of feature extraction in MaMaDroid - call graph

98 Figure 4.4: Process of feature extraction in MaMaDroid - call sequence

Figure 4.5: Process of feature extraction in MaMaDroid - Markov chain

99 Given the extracted features, MaMaDroid leverages RF, KNN, and SVM to train the malware detector and test the performance on several datasets

(which were collected over different time periods). RF outperforms the other two classifiers, with its F-measure reaching 0.98 and 0.99 in the family and package modes, respectively.

4.3.2 Drebin

Drebin is an on-device lightweight Android malware detector. Drebin extracts features from both the manifest and the disassembled dexcode through a linear sweep over the manifest file and the disassembled smali files of the application.

The features such as permissions, activities, and API calls are presented as strings.

Eight sets of features are retrieved, as listed in Table 4.1.

Drebin feature sets

S1 Hardware components S Requested permissions manifest 2 S3 App components S4 Filtered intents

S5 Restricted API calls S Used permissions dexcode 6 S7 Suspicious API calls S8 Network addresses

Table 4.1: Overview of Drebin feature set

The extracted features are put into a multidimensional vector (S ) to create a

|S|-D space, in which we can have 0 or 1 value along each dimension, indicating the presence or absence of the corresponding feature. The following shows an 100 example of the feature vector ϕ(x) of a malicious application that sends premium

SMS messages and thus requests certain permissions and hardware components.

  ......        0  permission.SEND SMS        1  permission.RECORD AUDIO     ϕ(x) 7→  ... ...        1  hardware.camera        0  hardware.telephony     ......

After the features being retrieved, Drebin learns a linear SVM classfier to discriminate between benign and malicious applications. The classification performance on Drebin was evaluated on a dataset consisting 5,560 malware samples and 123,453 benign applications, which are collected between August

2010 and October 2012.

4.3.3 Attack Scenarios

The knowledge of the target system obtained by the adversary may vary in different situations. This includes the feature set, the training set, the classification algorithm as well as the parameters. We argue that in the real world, it is not likely for the adversary to have full knowledge of the classification algorithm used

101 in the target detector. However, the adversary can probe the detector through feeding desired inputs and getting the corresponding outputs.

Knowing the feature set, as a baseline assumption for attacking learning systems, has been widely adopted in similar works in this field [34, 46, 77, 89].

Therefore, in this chapter, we consider the following four situations in our attack:

1) Scenario F : the adversary only knows the feature set; 2) Scenario FB: The adversary knows the feature set only, and can query the target detector as a black box; 3) Scenario FT : The adversary knows both the feature set and training set, but cannot query the target detector; and 4) Scenario FTB: The adversary knows both the feature set and the training set, and can also query the target system as a black box. Note that in the scenarios that allows querying the target system as a black-box (i.e., scenario FTB and FB), the only information that the adversary can get is the predicted label from the black-box oracle when given an input.

Also note that in the scenarios of having access to the training set (i.e., scenario

FTB and FT), the adversary can only have a copy of the training set, but he/she cannot inject new samples or modify the existing ones in the training set.

4.4 Attack on MaMaDroid

4.4.1 Attack Algorithm

We introduce an evasion attack on MaMaDroid in this section. The purpose is to make a piece of malware evasive with minimal API call injections into its

102 original smali code. We assume that we only have black-box access to the target

(MaMaDroid) detector. In other words, we can get output from MaMaDroid by feeding input, but we do not know how it processes internally. There are two considerations for the features used in MaMaDroid. First, because the features are actually the state transition probabilities of the call graph, the probabilities of the transitions departing from the same node in the call graph will increment up to 1. Second, the feature value should be bounded between 0 and 1. We will address these considerations in our algorithms.

We employ two adversarial example crafting algorithms that have been widely adopted to generate evasive malware examples. To study a more effective way of attacking, we craft adversarial example by either optimising an adversarial objective function (i.e., refer as C&W), or perturbing influential features based on the indicative forward derivatives (i.e., refer as JSMA). C&W and JSMA are originally developed for crafting adversarial image examples, which has continuous pixel values as the features. In our case, we are going to calculate the perturbation based on the number of API calls, which are discrete. Therefore, we need to refine plain C&W and JSMA algorithms to cater our needs. We construct a neural network F as a substitute to launch the attack. In the malware detection case, F is a binary classifier which has a 2D output. Let the input features of the original malware form an n dimensional vector, denoted as X.

103 Refined C&W

C&W crafts adversarial malware with tunable attack confidence while optimising the distortion on the original malwale features. We modify C&W to search for an adversarial malware sample through optimising an objective function with the following constraints:

2 minδ ||δ||2 + c · f(X + δ)

s.t. X + δ ∈ [0, 1]n, (4.1)

and ||Xg + δg||1 = 1, g ∈ 1...k.

Here, δ is the perturbation to be optimised and c is a constant to balance the two terms in the objective function. We use line-search to determine the value of c. The first term in the objective function minimises the l2 distortion on the original features, which means the change on the MaMaDroid feature should be small enough to limit the amount of API calls we insert into the smali code. The second term is a specially designed adversarial loss function f. Suppose t is the ground-truth class of the current malware example X. Our goal is to make X be incorrectly classified into the other classes (in our case, the benign class). Thus, f takes the following format:

f(X) = max(Z(X)t − max{Z(X)i : i 6= t}, −κ) (4.2)

104 in which Z(X) is the pre-softmax output from the substitute F , κ is a hyper-parameter which can adjust the attack confidence and f will maximise the loss between the output of current model and the ground-truth. To address the aforementioned considerations, we apply two constraints in the optimisation.

First, each feature after perturbation should be between 0 and 1. Second, the l1 norm of the features in each family/package group should be equal to 1.

The objective function is optimised with AdaGrad [56]. The feature values are iteratively updated until being misclassified. We use either the substitute model (in scenario F and FT), or the MaMaDroid oracle (in scenario FB and

FTB), which we refer as the pilot classifier, to determine whether a sample is misclassified.

Since the current feature X is a set of probabilities, to make the perturbation viable during the code injection into the original smali code, we change the optimisation variable from δi on X (the perturbation on the probabilities) to ω on A (the perturbation on the number of API calls). For the perturbation on the i-th feature in group g, we have:

ag + ωg ag δg = i i − i . (4.3) i ag + ωg ag

g P g g wherein ω = i ωi and ai is the number of API calls indicated by the i-th feature in the g-th group. We change the optimiser from δ to ω. Accordingly, we change

2 the first term of the adversarial objective function to ||ω||2, in order to minimise the total number of code injections. 105 Deleting codes may jeopardise the functionality of the malware, therefore we only inject code to make adversarial examples. We apply a ReLu function (i.e.,

ReLu(ω) = max(0, ω)) to clip ω to non-negative values after each iteration. As

g g ai +ωi the result, the first constraint (i.e., ag+ωg ∈ [0, 1]) is automatically satisfied. To satisfy the second constraint (the sum of the feature values in the same group

g g P ai +ωi being 1), we normalise i ag+ωg for each group after each gradient descent round. The detailed algorithm is in Algorithm 3.

Refined JSMA

JSMA finds adversarial examples using the forward derivatives of the classifier.

JSMA iteratively perturbs important features to determine the Jacobian matrix based on the model input and output features. The method first calculates the

Jacobian matrix between the input features X and the outputs from F . In the case of MaMaDroid, we want to find the Jacobian between the API call numbers A and the outputs from F , given the relationship between API call numbers and the probabilities (i.e., the input features). The Jacobian can be calculated as follows:

∂F (X) ∂X ∂Fj(X) ∂xi JF (A) = [ ] = [ ]i∈1...n,j∈0,1 (4.4) ∂X ∂A ∂xi ∂ai wherein i is the index of the input feature and j is the index of the output class labels (in our case it is binary). xi is the i-th feature, ai is the corresponding i-th

API call, and FjX is the output of the substitute at the j-th class. Suppose t is

106 Algorithm 3: Refined C&W Attack Algorithm Input : The substitute model F Feature vector X of input malware sample Ground truth label t of input malware sample Number of each API call a Output label Y from the substitute given an input Constant c balancing the distortion and the adversarial loss Upper bound C of c in line-search Hyper-parameter κ controlling the strength of attack The number of maximum iteration γ Step length α in gradient descent Output: Feature vector X∗ of adversarial malware sample ∗ 1 X ← X; 2 max iter ← γ; g g 2 a +ω 3 i i Objectiveadv ← kωk2 + c · f( ag+ωg | F , κ);

4 while c < C & Y = t do

5 while iter < max iter do

6 // ω is the number of added AP I calls; ∗ 7 Compute gradients ∇ωObjectiveadv(X ); ∗ 8 ω ← ω + clip(α · ∇ωObjectiveadv(X )); g g ∗ ∗g a +ω 9 i i X ← X = ag+ωg ;

10 iter + +; 11 end

12 c ← c ∗ 10 13 end ∗ 14 return X

the ground truth label. To craft an adversarial example, Ft(X) should decrease while the outputs of other classes Fj(X), j 6= t are increased.

107 Based on the calculated Jacobian, we can construct a saliency map S(A, t) to direct the perturbation. The value for feature i in the saliency map can be computed as:

  P 0, ifJit(A) > 0 or j6=t Jij(A) < 0, S(A, t)[i] = (4.5)  P |Jit(A)|( j6=t Jij(A)), otherwise.

According to the saliency map, we pick one API call (i) that has the highest

S(A, t)[i] value to perturb during each iteration. The maximum amount of allowed changes is restricted to γ. The number of the selected API call will be increased by a small amount, represented as θ, in each iteration. The iteration terminates when the sample is misclassified by the pilot classifier, or the maximum change number is reached. The detailed algorithm is introduced in Algorithm 4.

4.4.2 APK Manipulation

In our study, the development of the APK file modification method was guided by the following design goals: 1) the modified APK will keep its original functionality; and 2) the modification will not involve additional human efforts, i.e., it can be applied automatically via running scripts.

As introduced in section 4.3, the feature vector that MaMaDroid uses are the transition probabilities between states (either families or packages). Intuitively, the modification approach we apply is to add a certain number of API calls from specific callers to callees into the code to change feature values in the feature space. 108 Algorithm 4: Refined JSMA-based Attack Algorithm Input : The substitute model F The target model F ∗ Feature vector X of input malware sample Ground truth label t of input malware sample Number of each API call A Total number of API calls in each family/package N The number of maximum iteration γ Number of API calls θ perturbed in each iteration Output: Feature vector X∗ of adversarial malware sample A 1 X ← N ; ∗ 2 X ← X; 3 // Search all features; 4 Γ ← {1...|X|}; 5 max iter ← γ; 6 // Current class label c; ∗ ∗ 7 c ← argmax F (X );

8 while c 6= t & iter < max iter &Γ 6= ∅ do ∗ 9 Compute forward derivative ∇F (X ); ∗ ∗ 10 x1, x2 ← saliency map(∇F (X ), Γ, Y ); ∗ 11 Convert X to corresponding number of calls O; ∗ 12 P erturb O at x1 position in X by adding θ; ∗ 13 Recalculate X using O; ∗ 14 P erturb O at x2 position in X by adding θ; ∗ 15 Recalculate X using O; 16 Remove x1 & x2 from Γ; ∗ ∗ 17 c ← argmaxF (X ); 18 iter + +; 19 end ∗ 20 return X

Since we can obtain the total number of calls that go from any callers to any callees with static analysis, we therefore can calculate how much the feature values will be affected by adding a single call.

109 1 package android.os.mypack 2 3 public class Myclass { 4 public static void callee() {} 5 public static void caller() { 6 callee(); 7 callee();}} Listing 4.1: example of Java source code for adding two android to android calls

The APK manipulation process is designed with two strategies, namely simple manipulation strategy and sophisticated manipulation strategy. The following explains their details and limitations, respectively.

Simple manipulation strategy was motivated by the process that

MaMaDroid extracts and calculates its feature values. MaMaDroid extracts all API calls from classes.dex, and abstracts them as either their families or packages merely based on their root domain in the package names. For instance, The self-defined class "MyClass" in a self-defined package like android.os.mypack, and the system class "StorageManager" in the system package android.os.storage, will both be abstracted as android family or android.os package. By adding such self-defined classes, we are able to mislead the abstraction of API calls in MaMaDroid.

According to the above observation, we design some code blocks that can include an arbitrary number of calls from any caller to any callee. The java source code shown in listing 4.1 is an example of adding two android to android calls.

Arbitrary number of calls can be added by simply invoking callee() multiple times in the caller(). 110 1 .class public Landroid/os/mypack/Myclass; 2 .source "Myclass.java" 3 4 .method public static callee()V 5 .locals 0 6 return-void 7 .end method 8 9 .method public static caller()V 10 .locals 0 11 .line 6 12 invoke-static {}, Landroid/os/mypack/Myclass;->callee()V 13 return-void 14 .end method Listing 4.2: example of Smali code for adding two android to android calls

Our approach proceeds by injecting the required self-defined classes into the source of the target APK, and invoking the corresponding caller methods in the onCreate() method of its entry point activity class. The entry point activity can be located by searching "android.intent.action.MAIN" in the manifest. Since source code cannot be perfectly reverse-engineered to Java, we perform the code insertion on the smali code. As mentioned in Section 4.2, the modified smali codes can be rebuilt to make an APK again. Listing 4.2 presents the smali code of the above Java source code (with constructor methods omitted).

The described modification process can add an arbitrary number of calls from any callers to any callees, by simply runing an automated script. It also ensures that the process will not affect the functionality of the original application.

However, it modifies the CFG that MaMaDroid extracted from the APK, and consequently modifies its feature values. 111 1 const-string p0, "" 2 const-string p1, "" 3 4 .line 13 5 invoke-static {p0, p1}, 6 Landroid/util/Log;->d(Ljava/lang/String; 7 Ljava/lang/String;)I Listing 4.3: example of Smali code for adding a log.d() method

Simple manipulation takes advantage of the design flaw in the feature abstraction process in MaMaDroid, thus can possibly be defended by implementing white-list filter, which filters out the API calls that are not in a standard Android

SDK when processing API abstraction (which is not implemented in MaMaDroid).

Sophisticated manipulation strategy is designed to bypass the white-list

filter, in which system provided non-functional API calls are inserted into the smali code. For instance, invoking a Log.d() method in the onCreate() method of the entry activity class (e.g., com.my.project.MainActivity), will result in adding one self-defined to android call in the family mode, or one self-defined to android.util call in the package mode. Since the calls that we inserted are in the activity class of the project, it is abstracted to self-defined or obfuscated according to the abstraction rule of MaMaDroid. Therefore, with sophisticated manipulation, calls only originated from self-defined or obfuscated family/package can be inserted. Such limitation slightly decreased the evasion rate from 99% to 93% in our family mode experiment. An example of added smali code for a log.d() method is presented in Listing 4.3.

112 pre‐designed no‐op calls

APK decompilation smali smali repackage MaMaDroid smali automated modified smali (Black-box) smali manipulation smali

feature locate program successful failed extraction entry point classification

0, 0.24, ..., 0, 0.2, 0 attack 0, 0.24, ..., 0, 0.2, 0 0.3, 0.22, ..., 0, 0, 0 algorithm 0.5, 0.22, ..., 0, 0, 0 1, 0, … , 0, 0.5, 0.1 1, 0, … , 0, 0.5, 0.1 0, 0, 0, ..., 0, 0.4, 0 0, 0, 0, ..., 0, 0.6, 0 0, 0.05,..., 0.45, 0.2 0, 0.05,..., 0.45, 0.2

Figure 4.6: The attack process: the dashed lines show the process of our attack algorithm, and the solid lines illustrate our APK manipulation procedure.

We developed a script to automatically perform the code insertion process.

We firstly prepared the above described no-op code blocks from each caller to each callee. These code block are independent to the application, thus can be repeatedly used in the attack. The number of calls to be inserted from specific callers to callees were calculated by our attack algorithms described in Section

4.4.1. Then, we used regular expression to locate the onCreate() method in the smali code of the entry point activity class, and add any necessary code blocks to the end of the onCreate() method. Fig. 4.6 demonstrates the attack process, in which the dashed lines show the process of our attack algorithm, and the solid lines illustrate our APK manipulation procedure.

4.4.3 Experiment Settings

The experiments to be presented in the following two subsections evaluate the effectiveness of crafted adversarial examples. More specifically, we are going to

113 answer the following two questions: 1) can the modified malware sample effectively evade from the target detector? and 2) can the modification be easily applied to the original APK? For the convenience of experiments, we built MaMaDroid based on the source code that the authors published online1.

Dataset

To evaluate the performance of the crafted adversarial examples, we use the same datasets that have been used in MaMaDroid. First, the set of benign applications consists of 5,879 benign applications collected by PlayDrone [129] in 2014 (denoted by oldbenign in [102]). The set of malware includes 5,560 samples that were initially used in Drebin [25] and collected between 2010 and 2012 (denoted by drebin in

[102]). The original experiments reported in [102] also tested several combinations of other old and new datasets collected over years to evaluate the robustness of their approach. Using only one set of data does not affect our research target, i.e., to craft adversarial example that can fool and evade the malware detector. The classification results on the chosen datasets are promising, of which the F-measures reach 0.88 and 0.96, in the family and package modes, respectively. Our work is to generate malware samples for evading the detection, therefore, our test set obtains only malware samples. We carefully prepare the test set by manually checking that every sample can be installed and launched on an Android smart phone. We randomly select 1,000 qualified malware samples to form the test set, leaving the

1https://bitbucket.org/gianluca students/mamadroid code

114 rest of the malware samples, together with the benign application samples to be the training set.

As discussed in Section 4.3.3, to simulate the scenarios where the original training dataset of the target detector is unknown to the adversary (i.e., Scenario

F and FB), we collected a set of malware samples and another set of benign applications from VirusShare2 and APKPure3, respectively. VirusShare dataset consists of 24,317 malware samples collected between May 2013 to March 2014, while APKPure dataset consists of 10,000 applications we crawled from its website on January 2018. The applications from APKPure are submitted to VirusTotal to examine their benignity. We discard the samples that are reported by at least one anti-virus engine as malicious. Finally, the APKPure dataset contains

9,664 application. We randomly selected 4,560 malware samples and 5,879 benign applications from VirusShare and APKPure datasets, respectively, to form the surrogat dataset (to eliminate the influence caused by different number of training samples in the original and surrogate datasets). In the FT and FTB scenarios, we use the original dataset to train the target detector, as well as our attack algorithm; while in the F and FB scenarios, we use the original dataset to train the target detector and the surrogate dataset to train the attack algorithm.

2https://virusshare.com 3https://apkpure.com

115 Experiment Work Flow

Given a malicious APK as the input, we firstly decompiled it with apktool, and constructed its feature vector. The attack algorithm then optimised the perturbations to be added to the feature vector, i.e., the number of calls added from each caller to callee. Then, corresponding pre-designed code blocks were inserted into the smali files, which were then recompiled into a new APK. The manipulated APK was submitted to MaMaDroid oracle to get the classification result. The attack was declared successful if the modified APK was labelled as benign. This process makes sure that our attack method not only changes the feature vector, but also effectively modifies the APK. We additionally verified that all the modified APKs can be successfully installed and launched on an Android smartphone. It was difficult to verify whether the functionality was affected or not. However, we presume that since the calls we added were non-functional, they will not have changed the functionality of original APK.

As we have explained before, we run experiments in four deliberate scenarios

(refer to Section 4.3.3). The details of the settings for each scenario are listed in Table 4.2. In the experiments, we train a substitute model to approximate

MaMaDroid by using AdaGrad. Accordingly, a multi-layer perceptron (MLP) model is employed. The model contains 2 fully connected hidden layers, each has

128 nodes. Each training batch contains 256 samples and the substitute model is trained for 100 epochs. In addition, we introduce dropout after each hidden layer to prevent overfitting problem in the experiments. We set the dropout rate to 0.5. 116 Note that MaMaDroid trained with the original dataset is used as benchmark for evaluation, and we only require black-box access to the pilot classifier (refer the definition to Section 4.4.1).

4.4.4 Experiment Results

In [102], MaMaDroid’s performance was examined on three different machine learning classifiers. They are RF, SVM, and K-Nearest Neighbour (KNN). To be consistent with the experiments in [102], we also evaluate our proposed method on these classifiers, respectively. In addition, to investigate the robustness of

Deep Neural Networks (DNN) in malware detection, we leverage the features of

MaMaDroid to train a DNN-based detector. Specifically, our DNN-based detector consists of five hidden layers, each with 128, 64, 64, 64, 64 neurons, respectively.

The F-measures in family and package modes are 0.92 and 0.95, respectively, which is comparable to the state-of-the-art. The proposed attack methods are also evaluated on the DNN-based detector.

The effectiveness of the crafted adversarial examples is evaluated in terms of evasion rate and distortion. Evasion rate is defined as the ratio of malware

Scenario Pilot Classifier Training Set F Substitute Surrogate FT Substitute Original FB MaMaDroid Surrogate FTB MaMaDroid Original

Table 4.2: Attack Scenarios

117 samples that are misclassified as benign, to the total number of malware samples in the testing set. Distortion is defined as the number of API calls added to the smali code for each malware sample.

Overall results

The overall results of our attack are presented in Fig. 4.7-4.10. Specifically,

Fig. 4.7 and Fig. 4.8 present the attack results of family mode using two attack algorithms, while Fig. 4.9 and Fig. 4.10 demonstrate the results of package mode. We applied the attack on aforementioned four machine learning algorithms

(sub-figures (a)-(d)), under four real world scenarios (x-axes) as discussed in

Section 4.3.3. Simple manipulation strategy is applied in this experiment, while sophisticated manipulation strategy is evaluated in Sub-section (4). The evasion rate before attack is also reported and acted as a baseline. The evasion rate as well as the average distortion for each sample is reported. The results indicate that the proposed attack methods effectively evaded MaMaDroid in most of the real world scenarios. For instance, the evasion rate on RF increased from 4%

(before attack) to 56%-99% (after attack) in the family mode, and from 3% to

58%-99% in the package mode, depending on the scenario and attack algorithm.

It is worth to note that in scenario FTB, where adversary gains most knowledge of MaMaDroid, the evasion rate (C&W) reaches 100% in RF, 100% in SVM, 83% in 3-NN, and 100% in DNN, with average 55, 2, 65, and 1 API calls added to each malware samples for these algorithms, respectively. Even when the adversary only

118 Figure 4.7: The evasion rate and average distortion of adversarial examples generated by JSMA in the family mode knows the feature set (i.e., scenario F), the evasion rates with JSMA reach 62%,

75%, 58%, and 91%, in the above mentioned algorithms, respectively.

119 Figure 4.8: The evasion rate and average distortion of adversarial examples generated by C&W in the family mode.

Evaluation results by scenarios

An important observation is the improvement of attack effectiveness with the increase of adversary’s knowledge of the target system. While different level of knowledge obtained by adversary affects the evasion rate in both algorithms, the

120 Figure 4.9: The evasion rate and average distortion of adversarial examples generated by JSMA in the package mode. impact on each factor is different. As demonstrated in Fig. 4.8 , in the scenarios which black-box access to MaMaDroid oracle is acquired (i.e., FB and FTB), the evasion rate of C&W in all four algorithms are significantly higher than the evasion rate in the scenarios which black-box access is not granted (i.e., F and FT). In

121 Figure 4.10: The evasion rate and average distortion of adversarial examples generated by C&W in the package mode. the meanwhile, the possession of training set (F versus FT, FB versus FTB) has little impact on the evasion rate. However, the evasion rate in JSMA (refer to Fig.

4.7) are to the contrast. The possession of training set influenced the evasion rate significantly, while the access to black-box model is less important.

122 Evaluation results by operation modes

As introduced in Section 4.3, MaMaDroid runs in either the family mode or the package mode. Family mode is more lightweight, while package mode is more

fine-grained. The original classification performance in the package mode is slightly better than that in the family mode, with the original (baseline) evasion rate falls in the range of 1%-6% on various algorithms (compared with 4%-11% in the family mode). The results of the experiment indicate that the attack is more effective in the package mode than in the family mode, in terms of evasion rate. For instance, when attacking using JSMA, the evasion rate in the package mode with RF reaches

100% in scenario FTB (Fig. 4.9(a)), while it is 89% in the family mode in the same scenario (Fig. 4.7(a)). However, the average distortion of the adversarial example in the package mode is significantly higher than in the family mode. In average, 17 API calls need to be added in each application in the family mode, while this number increased to 257 in the package mode. The results disclose that while using more fine-grained features slightly enhance the classification accuracy, it’s resistance to our attack is significantly higher than using highly abstracted features (i.e., family mode), considering that more than 15 times of number of calls need to be inserted for a successful evasion.

Evaluation results by manipulation strategy

As presented in Section 4.4.2, two strategies can be applied to the proposed APK manipulation method. In simple manipulation strategy, API calls originated from

123 any caller can be inserted into the smali code; while in sophisticated manipulation strategy, only API calls originated from android, google, self-defined and obfuscated can be added. Thus, we examine the feasibility of the sophisticated manipulation strategy, by restricting that only the values of the calls originated from aforementioned families can be modified in the feature space. Fig. 4.11 presents the evasion rates and the corresponding average distortions of applying simple manipulation strategy and sophisticated manipulation strategy in scenario

FTB, respectively. In the simple manipulation strategy, the evasion rate are

100%, 100%, 83%, and 100%, in RF, SVM, 3-NN, and DNN, respectively, with

55, 2, 65, and 9 API calls in average to be added; while in the sophisticated manipulation strategy, the evasion rate slightly decreased to 99%, 96%, 58%, and

100%, respectively. The number of API calls to be injected in the sophisticated manipulation strategy are in average 46, 14, 16, and 9, respectively. The results demonstrate that sophisticated manipulation strategy can also achieve a high evasion rate with only a small number of API calls to be injected into the APK.

4.5 Attack on Drebin

4.5.1 Attack Algorithm

We adopt the Jacobian-based attack to craft an adversarial example for Drebin, since the features of Drebin are binary. JSMA perturbs a feature from 0 to 1 in

124 RF SVM 3-NN DNN

60

40

20 Avg. Distortion 0 Simple Strategy Sophisticated Strategy 1.0 0.8 0.6 0.4

Evasion Rate 0.2 0.0 Simple Strategy Sophisticated Strategy

Figure 4.11: Comparison of applying simple manipulation strategy and sophisticated manipulation strategy in the family mode by C&W each iteration. Regarding the Jacobian for Drebin, we calculate it based on the following formula:

∂F (X) ∂Fj(X) JF (X) = [ ] = [ ]i∈1...n,j∈0,1 (4.6) ∂X ∂xi wherein X is the binary feature vector for Drebin and i is the classification result

(i.e., malware if i = 1). Based on the Jacobian matrix, we select the most influential feature to perturb in each iteration. In other words, we perturb the

i-th feature for which i = arg maxi∈1...n,xi=0F0(xi). We change the selected one feature from 0 to 1 in each iteration, until the example is misclassified, or we reach the maximum amount of allowed change (i.e., γ).

125 10 8 6 4 2 Avg. Distortion 0 Base F FT FB FTB 1.0 0.8 0.6 0.4

Evasion Rate 0.2 0.0 Base F FT FB FTB Senario

Figure 4.12: The average distortion and evasion rate of adversarial example generated by JSMA on Drebin.

4.5.2 APK Manipulation

Drebin extracts features from both manifest and dexcode. Different from previous work that only modifies the features in manifest [77], we analyse the capability of modifying the features obtained from the dexcode.

As explained in Section 4.3.2, Drebin retrieves features by applying a linear scan on related source files (AndroidManifest.xml and smali files), which only searches for the presence of particular strings (e.g., name of API calls), rather than examining whether the calls are executed. Therefore, our strategy is to add code containing the required features but never being invoked or executed. Listing

126 1 .method private addSuspiciousApiFeature()V 2 .locals 1 3 const-string v0, "phone" 4 .line 17 5 invoke-virtual {p0, v0}, 6 La/test/com/myapp/MainActivity;-> 7 getSystemService(Ljava/lang/String;) 8 Ljava/lang/Object; 9 move-result-object v0 10 check-cast v0, Landroid/telephony/TelephonyManager; 11 return-void 12 .end method Listing 4.4: example of Smali code for adding a “suspicious API: getSystemService()” feature

4.4 presents an example of adding a “suspicious API: getSystemService()” feature to the smali code.

4.5.3 Experiments & Evaluations

We present our attack performance on Drebin by reporting the evasion rate and the average distortion in different real world scenarios. Dataset described in Section

4.4.3 is used in the experiments.

Fig. 4.12 reports the result of our proposed attack. In scenario FTB, where the adversary gets most knowledge of Drebin (i.e., the feature set, the training set, and output from Drebin oracle), 99% of malware samples in the testing set are misclassified after the attack, with average 3.5 features to be added in each sample.

While in scenario F, where the adversary obtains least knowledge of Drebin (i.e., only the feature set), 60% adversarial malware examples can evade from detection.

127 Table 4.3 presents the average number of features inserted into each malware sample, from which we observe that the most added features are in the sets of restricted API calls and suspicious API calls.

Source File Feature Sets Avg. Number Added

S5 Restricted API calls 2.17 S Used permissions 0.1 dexcode 6 S7 Suspicious API calls 1.21 S8 Network addresses 0.02

Table 4.3: Number of features added in each set

4.6 Discussion

4.6.1 Comparison with Existing works

We compare our attack method with another two works in evading machine learning based malware detection. Chen et al. [46] proposed to poison the training dataset to mislead machine learning detectors. Grosse et al. [77] proposed a white-box attack against deep learning based malware detection models. Both of these works require the access to the feature set, the training set and the machine learning model, therefore we compare our results of scenario FTB with them, which requires the same knowledge of the target model. However, the comparison works make additional assumptions. [46] further assumes that the adversary is capable of injecting tainted samples into the training set, and

[77] only considers the situation that the adversary knows the detailed structure

128 and parameters of the targeting machine learning model (i.e., white-box). These assumptions are more restrictive that are unlikely to happen in real world scenarios.

Table 4.4 presents the evasion rate of our methods and the methods proposed in [46] and [77]. The results show that our methods outperform the compared methods in terms of evasion rate.

MaMaDroid Drebin RF SVM 3-NN DNN Our JSMA Attack 89% 96% 86% 93% 99.4% Our C&W Attack 100% 100% 83% 100% — Chen et al. [46] — 68.95% — — 75.2% Grosse et al. [77] — — — — 69.35%

Table 4.4: Comparison with existing works (Evasion Rate)

4.6.2 Why We Are Successful

A critical challenge in crafting adversarial malware examples is how to map the perturbed feature space into the problem space. In other words, the challenge is to modify APK in a way that the desired features are reflected while the code functionality remains the same. Adding a specific API call into an application’s call graph without affecting its original functionality is a non-trivial task. In this chapter, instead of explicitly matching the API calls, MaMaDroid makes use of their abstractions to realise the feature reflection. While using the abstracted API

129 may be more resilient to API changes and the size of feature set stays manageable, the above challenge can be solved.

We summarise the reasons as follows. First, as described in Section 4.4.2, both of our proposed strategies can successfully apply the perturbed features into the application’s smali code, and further recompile the manipulated code into an APK. Second, similar treatments can be applied on Drebin, as described in

Section 4.5.2. This is one of the key reasons that lead the proposed method to success. In addition, there are also some other aspects. For example, by taking the advantage of transferability of adversarial examples on various machine learning models, we train a substitute model to approximate the target detector. The optimal perturbations on the feature space can then be calculated.

4.6.3 Applicability of Our Attack

A great number of machine learning based Android malware detection techniques have been proposed in the past few years. The main differences and key contributions of these techniques are the features they extracted to profile the malware samples. As we could not demonstrate the effectiveness of the proposed method on every machine learning detector, we therefore selected two typical detectors, Drebin and MaMaDroid that use syntactic and semantic features, respectively. In prior works, they have been selected as baseline methods to evaluate the performance of adversarial attacks, e.g., [46, 77].

130 Our proposed attack framework can be applied in most machine learning based detectors, which extract features from either manifest or bytecode of an application. We just need to refine the attack algorithm according to the constraint and inter-dependency of the features used in the target detector, just as what we have done for attacking MaMaDroid and Drebin.

4.6.4 Transferability

The proposed attack framework is inspired by the transferability of adversarial examples among different machine learning models. In previous works, it has been demonstrated that adversarial examples crafted for one model may also be misclassified by another model [?][98]. We further investigate the limitation of tranferability by varying the number of features that are allowed to be modified in the attack.

We conducted the experiments under the assumption that the adversary can only modify a subset of the feature set S. However, we limited the adversary to two subsets denoted as S1 and S2. In our settings, S1 consisted of 104 features indicating the presence of system-provided permissions in the manifest (e.g., android.permission.SEND SMS), while S2 had 1,450 features, which included system permissions as well as custom permissions defined by the application developer (e.g., com.a.b.permission.DEADLY ACTIVITY). S was used to train the original Drebin system and the substitute model, while only S1 and S2 were allowed to be modified, of course separately in the tests.

131 Substitute SVM 1.0

0.8

0.6

0.4 Evasion Rate

0.2

0.0 S (104 features) S (1450 features)

Figure 4.13: Empirical study of transferability on different number of modifiable features, where S1 includes the system-provided permissions, and S2 includes both system-provided and user defined permissions.

Fig. 4.13 compares the two cases of modifying S1 and S2 when crafting the adversarial example. While the evasion rate on the substitute model was high in both cases (91% in S1 and 97.7% in S2), the transferabilities to SVM were quite different. More specifically, the evasion rate slightly decreased from 97.7% to 74% when S2 was modified, indicating that a large portion of the adversarial examples generated on the substitute model were also misclassified by Drebin. However, the evasion rate reduced dramatically from 91% to 9% if only S1 was touched, showing that only a small portion of the adversarial examples were effective on Drebin. This observation shows that the number of modifiable features has a significant impact on the transferability between models.

132 4.6.5 Artifacts in Our Attack

It could be argued that adding a certain number of dummy calls or no-op APIs, such as logging output and reading files, introduces artifacts into the APK, which may make the application look suspicious. To investigate whether our attack will introduce such side effect to the original APK, we observe the prevalence of such no-op API calls (e.g. android.util.log) in the applications in the wild. According to our observation on the experiment dataset, we find that it is normal for an application, either benign or malicious, to have a certain number of no-op API calls. For example, 17.9% benign applications and 16.3% malware samples in the dataset have more than 100 android.util.log() calls in their source code.

The percentages further increased to 28.7% and 40% for the applications to have more than 50 android.util.log() calls, in benign and malicious applications, respectively. There is no specific indication whether malware samples or benign applications tend to have more such calls than the other. Note that the average number of calls we inserted to craft adversarial examples is only 17 for family mode

(refer to Fig. 4.7), and 257 for package mode (refer to Fig. 4.9). Therefore, the injected code will not bring strong indication that suggests an application to be benign or malicious.

133 4.6.6 Defending Methods

Adversarial training method

The idea of adversarial training is to recursively feed crafted adversarial examples into the training dataset to strengthen the robustness of the machine learning model. We evaluate the effectiveness of adversarial training method on the proposed attack. Fig. 4.14 shows the F-measure of benign and malicious applications with the proposed adversarial training method, with varying percentage of adversarial malware samples added in the training set. Note that the malware test set contains the equal number of original and adversarial malware samples. The F-measure of benign and malicious samples increased from 69% to 80% and from 64% to 82.5% by adding 1% of adversarial examples into the training dataset, respectively. Their F-measure further increased to 83% and 87%, respectively, when adding more adversarial examples into the training dataset. Although this defending method is simple and effective, it strongly relies on the a priori knowledge of the attack algorithm.

Ensemble learning method

Ensemble of classifiers is one of the effective defences for black-box adversarial example. Instead of training one classifier using the full feature set and all training samples, a number of sub-classifiers are trained with either a subset of features, or a subset of training samples. The final classification result is then made based on the decision of different sub-classifiers with a specific rule, such as majority vote. 134 1.00

0.95

0.90

0.85

0.80

F-measure 0.75

0.70

0.65 Benign App Malware 0.60 0 2 4 6 8 % of Adversarial Examples Added in Training Set

Figure 4.14: F-measure of benign and malicious applications with the proposed adversarial training defending method on MaMaDroid, with varying percentage of adversarial malware samples added in the training set.

We demonstrate the effectiveness of ensemble learning defence method against our attack. Two scenarios are considered: scenario F and scenario FTB. In scenario

F, the attacker cannot query the target model, therefore he/she is not aware of the existence of the defence mechanism. In scenario FTB, the attacker can query the target model with the defence mechanism implemented, and get its prediction result.

Fig. 4.15 presents the results of applying ensemble learning method to defend our C&W attack on MaMaDroid (family mode), in aforementioned two scenarios, respectively. Two ensemble strategies are implemented in both scenarios, in which each of 10 classifiers is trained with either 1/10 training samples, or 1/10 of

135 Base Defence - Ensemble Training No Defence Defence - Ensemble Feature 1.0 0.8 0.6 0.4 0.2 Evasion Rate 0.0 Scenario F 1.0 0.8 0.6 0.4 0.2 Evasion Rate 0.0 Scenario FTB

Figure 4.15: Evasion rate of applying ensemble learning defence mechanism on MaMaDroid family mode in two scenarios (FTB and F). Base: the evasion rate before attack as a baseline; No Defence: attack the model without implementing the defence; Defence - Ensemble Training: attack the model with implementing ensemble learning method as defence, in which each of 10 classifier is trained with 1/10 training samples; Defence - Ensemble Feature: attack the model with implementing ensemble learning method as defence, in which each of 10 classifier is trained with 1/10 features. features. Evasion rates with and without defence are reported. The results suggest that the ensemble learning method is effective in defending the attack when the attacker has least knowledge of the target model (i.e., scenario F), where the evasion rate decreased from 90% to 40-59% in different ensemble strategies.

However, when the attacker is capable to query the target model (e.g., the model is implemented as an online service), the defence method cannot effectively defend the proposed attack.

136 4.7 Summary

Recent studies in adversarial machine learning and computer security have shown that, due to its weakness in battling against adversarial examples, machine learning could be a potential weak point of a security system [29, 28, 134]. This vulnerability may further result in the compromise of the overall security system.

The underlying reason is that machine learning techniques are not originally designed to cope with intelligent and adaptive adversaries, who can manipulate input data to mislead the learning system.

The goal of this work has been, more specifically, to show that adversarial examples can be very effective to Android malware detectors. To this end, we

first introduced a DNN based substitute model to calculate optimal perturbations that also comply with the APK feature interdependence. We next developed an automated tool to implement the perturbations onto the source files (e.g., smali code) of a targeted malware sample. According to the evaluation results, the Android malware detection rates decreased from 96% to 0% in MaMaDroid

(i.e., a typical detector that uses semantic features). We also tested Drebin (i.e., a typical detector that uses syntactic features but also collects some features from classes.dex). We found Drebin’s detection rates decreased from 97% to

0%. To the best of our knowledge, our work is the first one to overcome the challenge of targeting recent Android malware detectors, which mainly collect semantic features from APK’s ‘classes.dex’ rather than syntactic features from

‘AndroidManifest.xml’. 137 Chapter 5

A Stealthy Attack on Android phones without Users Awareness

Voice Assistants (VAs) are increasingly popular for human-computer interaction

(HCI) smartphones. To help users automatically conduct various tasks, these tools usually come with high privileges and can access sensitive system resources.

A comprised VA is a stepping stone for attackers to hack into users’ phones.

Prior work has experimentally demonstrated that VAs can be a promising attack point for HCI tools. However, the state-of-the-art approaches require ad-hoc mechanisms to activate VAs that are non-trivial to trigger in practice and are usually limited to specific mobile platforms. To mitigate the limitations faced by the state-of-the-art, we propose a novel attack approach, namely Vaspy, which crafts the users’ “activation voice” by silently listening to users’ phone calls. Once the activation voice is formed, Vaspy can select a suitable occasion to launch an 138 attack. Vaspy embodies a machine learning model that learns proper attacking times to prevent the attack from being noticed by the user. We implement proof-of-concept spyware and test it on a range of popular Android phones. The experimental results demonstrate that this approach can silently craft the users’ activation voice and launch attacks. In the wrong hands, a technique like Vaspy can enable automated attacks to HCI tools. By raising awareness, we urge the community and manufacturers to revisit the risks of VAs and subsequently revise the activation logic to be resilient to the style of attacks proposed in this work.

5.1 Introduction

Voice assistants (VAs) have been widely used in smartphones, typically as human-computer interaction (HCI) mechanisms for device control and identity authentication. Popular examples from the market include Amazon Alexa [21],

Samsung Bixby [87], Google Assistant [52], and Apple Siri [24]. Because human beings can speak about 150 words per minute, which is much faster than typing, e.g., roughly 40 words per minute on average, VAs are very useful to transform human speech into machine-actionable commands. This creates an easy-to-use design of smartphones, especially for those that need lots of inputs or for scenarios where ‘hands-free’ is mandatory (e.g., making phone calls when driving). In order to support broad functionalities via voice, e.g., sending text messages, making phone calls, browsing the Internet, playing music/videos, etc., VAs are

139 usually granted high-level privileges including dangerous permissions [86] (e.g.,

ACCESS COARSE LOCATION, READ CONTACTS).

Unfortunately, the VA technique is a double-edged sword. They not only bring great convenience to smartphone users but also offer a backdoor for hackers to gain entrance into the mobile systems. Hackers can take advantage of VAs’ required high privilege in accessing various applications and system services to steal users’ private information like locations and device IDs [52], control smart home devices [16], forge emails, or even transfer money [17], etc. For example, after activating the Google Assistant with the keywords “OK Google”, a hacker can further manipulate an episode of attacking voice that cheats the smartphone to send the user’s location to a specific number via SMS with commands such as “send my location to 12345678” [52]. Given a list of VA-enabled functions

[73], we can identify many potential attacks against users’ smartphones.

Prior work has already demonstrated the feasibility of attacking smartphones via VAs [52, 21, 41, 150]. The key to the success of the approaches is to activate

VAs stealthily. For example, W. Diao et al. [52] and E. Alepis et al. [21] utilise the

Android inter-component communication (ICC) to wake up the VA. To be stealthy, they propose to launch attacks when smartphones are unattended or in the early morning (e.g., 3 am). However, this approach requires to call a specific API

(‘ACTION VOICE SEARCH HANDS FREE’), which is only available in Google Assistant.

This excludes the use of the approach in some brands like Huawei and Xiaomi, which provide custom VAs other than Google Assistant. G. Zhang et al. [150]

140 propose using inaudible ultrasound to activate VAs. The attacking commands are undetectable by users but can be recognized by VAs on smartphones. However, this approach needs a special ultrasound generator on-site, which is not practical in the real world. There is another work under the same umbrella. N. Carlini et al.

[41] apply adversarial machine learning techniques to manipulate attacking sounds against voice recognition systems. This approach requires hackers to have physical access to targeting smartphones and run sound crafting processes iteratively. This premise is also impractical in most real-world scenarios.

In this chapter, we propose a novel and practical stealthy attacking approach against voice assistants in Android phones, named Vaspy. It learns from the user’s regular dialogue to craft the activation voice to the VA and leverages the built-in speaker to play and activate the VA. In order to make the attack stealthy, the attack is triggered only when the smartphone user is most likely to overlook the occurrence of activation voice. The idea of Vaspy comes from two practical facts:

1) the built-in speaker can activate the VA of a phone [21]; and 2) a user in a noisy environment can easily neglect the ringtone of a phone.

We develop proof-of-concept spyware based on Vaspy. The spyware disguises itself as a popular microphone controlled game to increase the chance of successful delivery to targeting Android phones1. The spyware records inbound/outbound calls and synthesises the activation keywords (e.g.,‘OK Google’) using speech recognition and voice cloning [23] techniques. This operation is necessary

1This is only an example of delivery. There are many other social engineering methods to be used in the real world, e.g., [136].

141 as state-of-the-art VAs are resilient to unauthenticated voiceprints. The proof-of-concept spyware sheds light on two advantages of Vaspy: 1) since the attacking process only makes use of a common component in an Android phone (e.g., the built-in speaker), Vaspy can be applied to most off-the-shelf

Android phones that have built-in VAs; this breaks the limitations in prior work, which either requires a special equipment [41, 150] or can only be applied to

Google Assistant [52, 21]; 2) Vaspy can employ machine learning techniques to analyze data collected from various on-board sensors; this helps Vaspy identify the optimal attacking time, making it stealthier compared to prior work [52, 21].

Vaspy can be very dangerous to smartphone users, not only due to its stealthiness but also because of its resilience to state-of-art anti-virus tools.

We test the proof-of-concept spyware on VirusTotal [12], a widely adopted industrial anti-virus platform. We also test the spyware on three state-of-the-art learning-based Android malware detectors, namely Drebin [25], DroidAPIMiner

[20], and MaMaDroid [102]. Results indicate that the spyware based on Vaspy can evade their detection. In fact, Vaspy seldom invokes sensitive APIs [18] and uses the VA as a puppet to carry out malicious activities, making it resilient to those anti-virus tools.

We summarise the contributions of this chapter as follows.

• We propose a novel attacking approach called Vaspy, which can stealthily

hack into Android phones via built-in VAs without users’ awareness.

142 Figure 5.1: The workflow of an example spyware based on Vaspy. Incoming/outgoing calls are monitored and recorded, and the activation voice is then synthesised. User’s environment is monitored by built-in sensors to determine a suitable attacking occasion. When launching the attack, text commands can be retrieved from Firebase [6] and converted to speech by a built-in Text-to-Speech (TTS) module in the smartphone.

• We designed a context-aware module in Vaspy, making it stealthier compared

to prior work. This module provides intelligent environment detection to

identify the optimal time to launch the attack, based on the data collected

from various on-board sensors.

• We develop proof-of-concept spyware based on Vaspy to evaluate the attack

in a real-world empirical study. The empirical results show that users cannot

detect the spyware, and the spyware does not significantly affect Android

phones’ performance. We also find that the spyware is resilient to common

anti-virus tools from both industry and academia.

143 5.2 Attacking Model: Vaspy

The workflow of Vaspy is shown in Fig. 5.1. Vaspy’s attacking approach includes two modules: 1) Activation Voice Manipulation, and 2) Attacking Environment

Sensing. The first module synthesizes the commands (e.g., ‘OK Google’) that are required to activate the VA. Because most popular VAs can differentiate the voice of genuine smartphone owners based on artificial intelligence technologies [35], the activation voice in Vaspy will be manipulated based on the targeted users’ own voice. This will ensure the success in activating smartphone VAs.

There are mainly two approaches available for synthesizing activation voice:

1) using users’ voice recording to clone an activation voice [23]; and 2) extracting an activation voice form users’ voice recordings. For the first approach, we can adopt the voice cloning method [23] based on multi-speaker generative modeling

[71] to generate the activation voice by a few users’ own voice recordings. The method provides a trained multi-speaker model (fine-tuning) that takes a few audio-text pairs as input to simulate a new speaker. This approach requires a text input to encode the cloned voice. Alternatively, the second approach adopts speech recognition techniques/tools such as Recurrent Neural Network (RNN) [74] to retrieve/synthesize the vocal pieces of those particular words from users’ own voice. This approach has been widely used in some commercial systems, such as

IBM Watson [7]. In Section 5.3, we implement an RNN-based method to synthesize users’ voices in our proof-of-concept spyware, but alternative techniques/tools can also be integrated into Vaspy. In our implementation, the vocal corps of special 144 words can help craft the activation voice, e.g., ‘OK’ plus ‘Google’ producing ‘OK

Google’ as a whole activation voice piece for Google Assistant. However, it can be very challenging when the targeted user seldom speaks these special words. In this case, Vaspy will synthesize the vocal pieces of the special words from syllables captured from users’ voices [69], e.g., the first syllable of ‘good’ and the second syllable of ‘single’ can be concatenated to pronounce ‘google’.

Once the activation commands are crafted, the second module will collect environmental data such as light levels, noise levels, and motion states via onboard sensors. Vaspy introduces machine learning techniques to decide an optimal time to launch the attack stealthily. The correctness of Vaspy’s decisions is determined by the volume and quality of the contextual data collected to access the attacking environment. After the second module identifies a suitable attacking time, the synthesized activation voice is played, followed by prepared attacking commands

(e.g., “send my location to 123456”), causing harm to the targeted smartphone user. After the activation, successive attacking commands can be easily delivered to the VAs to control the compromised phone.

5.3 Proof-of-Concept: A Spyware

5.3.1 Activation Voice Manipulation

We implement proof-of-concept spyware in Android to evaluate Vaspy in a series of real-world scenarios. The spyware disguises itself as a microphone-controlled game.

145 When a user starts playing the game, Vaspy will be activated in the background and stay active even if the game app is terminated.

Once launched, Vaspy registers itself as a foreground service2 that monitors phone call status. When there is an incoming or outgoing call, Vaspy starts recording the audio from the microphone. An audio clip is saved every 30s. It will be processed by the Activation Voice Manipulation module and then be deleted immediately to release the storage. The recording process stops either when the phone call ends, or when the activation keyword has been successfully synthesized.

We implement an RNN-based voice synthesis model in our proof-of-concept spyware. The RNN model is trained with audio clips containing both positive words (i.e., activation keywords) and negative words (i.e., non-activation words).

Fig 5.2 illustrates the process of preparing the training samples. The model takes a raw audio signal as input, which contains the activation keywords.

Fourier Transform [38] is then applied to the raw signal and converted it to a spectrogram. The spectrogram is a visual representation of frequencies of a given signal with time. An RNN is then trained with the frequency information from the spectrogram, and a pre-labeled matrix containing the starting and ending frames of the activation keywords, to extract activation words from audio clips.

We implement the Gate Recurrent Unit as the core unit of our RNN [48]. There are 4500 and 500 audio clips used in training and testing, respectively. The accuracy of the testing set is 93.4%.

2Android 9 disables background services from accessing user input and sensor data. Therefore, we use foreground service and hide the notification icon by making it transparent. [15]

146 Figure 5.2: RNN training data pre-processing. (A) raw audio signal as input, which contains the activation keywords; (B) spectrum’s converted from raw audio signal; (C) a matrix that contains labeled starting and ending frames of the activation keywords.

Note that in our prototype implementation, recorded audio clips must contain the activation keywords. However, this limitation can be removed by implementing the voice cloning technique [23], which requires only a few voice recordings of arbitrary contents from the targeting user.

5.3.2 Attacking Environment Sensing

Environment data that decides whether to launch the attack is collected from smartphone on-board sensors. Fig. 5.3 describes the Attacking Environment 147 Figure 5.3: Framework of Attacking Environment Sensing (AES) module

Sensing module. In particular, we extract the movement intensity features from accelerometer readings and the features of environment variables from microphone and light sensors readings. Since smartphones do not have built-in noise sensors, noise levels in decibel are calculated from the amplitude of the ambient sound that we gathered from the microphone, according to

 2 A1 LdB = 10 lg A0

wherein A1 is the amplitude of the recorded sound, and A0 is a standard amplitude that is usually set to one.

Movement intensity features describe an overall perspective of human behavior state. We divide human behaviors into a series of states, including 1) the definite motion state, 2) the definite stationary state and the relative motion-stationary

148 Figure 5.4: Data collected from scenario (a) walking on a quiet road; and (d) taking public transportation. state. The definite motion state indicates significant fluctuation in sensor readings, as shown in scenario (a) of Fig. 5.4. The definite stationary state shows consistent sensor readings, as shown in the scenario (d) of Fig. 5.4. The sharp difference

149 of readings between the definite motion state and the definite stationary state allows the classification model to recognize these behaviors with high accuracy.

However, the activities that do not show an apparent fluctuation may confuse the classification model. Therefore, to increase the classification accuracy, we define an intermediate, i.e., a relative motion-stationary state, by which most of the confusing activities can be classified accurately. In this prototype, we use

Random Forest as our classification model because it does not directly output class labels but instead computes probabilities. We assign labels to the instances according to whether the probabilities of RF exceed a certain threshold. We label the motion state with a probability of over 60% and less than 40% as a definite motion state and a definite stationary state. We also label the motion state with the probability between 40% to 60% as a relative motion-stationary state (as shown in Table.5.1). As the movement intensity features are categorical data, machine learning based algorithms cannot work with them directly. Therefore, we convert all the movement intensity features to numerical values using one-hot encoding.

The definite motion state has been encoded to [0, 1], the definite stationary state has been encoded to [1, 0], and the relative motion-stationary state has been encoded to [1, 1].

Movement data is collected every 20 ms, while noise and light data are collected every 200 ms, as they are more stable in a short period of time. Noise and light data will be re-sampled in a frequency of 50 Hz and follow the Nearest

Neighbour Interpolation principle, and merge with one hot encoded movement

150 Motion State Stationary State Movement Intensity 0.70 0.30 Definite motion state 0.56 0.44 Relative motion-stationary state 0.15 0.85 Definite stationary state 0.45 0.55 Relative motion-stationary state

Table 5.1: Movement intensity Features intensity features to built training matrices. The features of environment variables are used to provide more specific details on the uncertain environmental factors, such as noise level and light intensity, which can also affect the decision about whether to launch a stealthy attack.

The collected raw signals usually contain noise generated by different sources, such as sensor miscalibration, sensor errors, noisy environments, etc. These noisy signals adversely affect the signal segmentation and feature extraction, and further significantly hamper the activity prediction. In our study, we use a fourth-order Butterworth low-pass filter for noise removal. Besides gathering the data from onboard sensors, we also collect smartphone usage status, such as the lock screen on/off state and the Bluetooth/headphone connection status by using corresponding APIs. These status indicate whether the smartphone is in use.

Environment sensing will be triggered only when the smartphone is not in use.

5.3.3 Post Attacks and Spyware Delivery

Once the environment detector determines to launch the attack, the synthesized activation voice is played via the speaker on the victim’s phone. Meanwhile, the

151 Figure 5.5: The snapshot of the proof-of-concept spyware. After player clicking start button, the rocket will raise when player blows or scream to the microphone. The rising speed depends on the volume of sound that the microphone receives. attacking commands, which are in text format, are dynamically fetched from

Firebase [21] and played via the smart phone’s speaker using Android built-in

Text-to-speech (TTS) service. Attackers can then manipulate the voice assistant to conduct further malicious activities such as leaking private information, sending malicious SMS/email, etc.

Three permissions are required in Vaspy, which are RECORD AUDIO (to record the activation voice of the user), INTERNET (to dynamically fetch attacking commands from the Firebase server and interact with trained online model), and

READ PHONE STATE (to monitor incoming/outgoing call status). Vaspy is disguised as a popular microphone-controlled game (as shown in Fig. 5.5) so that it can legitimately request the RECORD AUDIO permission without being suspected by the user. When a victim user plays the game, the player is required to blow or scream to the microphone to raise a rocket. The higher volume the microphone receives,

152 the faster the rocket flies. (The snapshot of the game can be found in Appendix

??). The game is very deceivable to teenagers or kids. In fact, spyware can be delivered in other forms, such as a malicious audio recorder. READ PHONE STATE and INTERNET permissions are very commonly requested by various Android games. There are 46 of the top 100 games on Google Play that requests the

READ PHONE STATE permission, while all of the top 10 games request the INTERNET permission.

5.4 Evaluation

In this section, we evaluate the performance of our prototype spyware in terms of the attack success rate. The attack capabilities on the VAs from various vendors

(i.e., Google, Huawei, and Xiaomi) are also investigated. Besides, to examine its stealthiness, we evaluate the system overhead and teste Vaspy against anti-virus tools/platforms.

5.4.1 Evaluation of the Attacking Environment Sensing

Modulel

We evaluate the proposed attack on three VAs on four Android smartphones, including Google Assistant on Google Pixel 2 and Samsung Galaxy S9, Xiao Yi on Huawei Mate 8, and Xiao Ai on Xiaomi Mi 8. Smartphones are taken to various real-world scenarios for data collection. These scenarios include moving

153 Figure 5.6: Overview of the data collected in real-world scenarios. or stationary states, noisy or quiet environments, and putting a smartphone in pockets or holding it on hands. The example scenarios are shown in Figure 5.6.

A participant carries each smartphone for data collection. An audio piece of synthesized activation voice is stored in each smartphone. These activation voices are tested in advance to make sure that they can successfully activate the voices assistant on smartphones. In every two minutes, the activation voice followed by one random attacking voice command (e.g., “Send ‘subscribe’ to 1234567”) is played via smartphone’s built-in speaker. If the participant does not notice the

154 voice command, and the command is successfully executed, we label this attack as a success. Finally, the data we collected for training includes the readings from smartphone onboard sensors (i.e., microphone, accelerometer, and ambient light sensor) and attack results (as label set).

We train a Random Forest with collected data and evaluated the model based on Precision, Recall, and F1 Score. The results of 20-fold cross-validation are presented in Table.5.2. It shows the proposed model is well-trained.

5.4.2 Evaluation of Real World Attack

We further evaluate the effectiveness of the attack in real-world scenarios at different times of the day. Ten participants are recruited to carry one of the smartphones mentioned above to various real-world scenarios. Smartphone sensors collect real-time environment data and feed it to the trained machine learning model. Then, a probability of whether to launch an attack is obtained.

An attack will be triggered if the probability exceeds a threshold (e.g., 80% in our experiment setting). We set up a restriction that every two minutes, there will be at most one attack triggered. Fig. 5.7 to 5.12 reports the sensors’ readings and the output attacking probabilities in some typical scenarios, where “True”

Invasion Precision Recall f1-score Unsuccessful 0.96 0.95 0.95 Successful 0.97 0.98 0.98 Avg 0.97 0.97 0.97

Table 5.2: Average Accuracy Performance 155 Figure 5.7: Evaluation Result of the Attacking Environment Sensing in scenario (a).

156 Figure 5.8: Evaluation Result of the Attacking Environment Sensing in scenario (b).

157 Figure 5.9: Evaluation Result of the Attacking Environment Sensing in scenario (c). 158 Figure 5.10: Evaluation Result of the Attacking Environment Sensing in scenario (d). 159 Figure 5.11: Evaluation Result of the Attacking Environment Sensing in scenario (e). 160 Figure 5.12: Evaluation Result of the Attacking Environment Sensing in scenario (f). 161 in the attack results indicates that the attack is triggered but not heard by the participant, while “False” represents that the attack is triggered and heard by the participants. “N/A” means that no attack is triggered in the time slot so that it is excluded when calculating the success rate. We can see from Fig. 5.7 to 5.12 that the spyware based on Vaspy achieves 100% success rate in real-world attack.

5.4.3 Capability of Attack

After activating the VAs, the attackers may further acquire victim’s private information, or conduct malicious activities on the infected smartphones, through remotely executing specific attacking commands.

In Table. 5.3, we list and compare the potential attacks that can be launched on different VAs in victim smartphones, namely Google Assistant on Pixel 2, Xiao

Yi on Huawei Mate 8, and Xiao Ai on Xiaomi Mi 8. We also listed the permissions required if the corresponding information is queried in an app. However, none of these permissions are necessary for the proposed attack, since VAs are naturally gained the privilege to access such information.

While private information such as location, calendar, IP address etc., can be queried locally, most of them cannot be sent out as text, with one exception that Google Assistant can send user’s current location via SMS to an arbitrary number. However, this does not necessarily mean that attackers cannot access this information remotely. An attacker can manipulate the VA to start a phone call to

162 Attack Permission(s) Attack result against VAs Category type bypassed Google Huawei Xiaomi Query ACCESS COARSE √ √ √ location LOCATION READ CONTACTS, SEND SMS, Share √ ACCESS COARSE × × location LOCATION, WRITE SMS Query √ √ √ Privacy READ CALENDAR leak calendar READ CALENDAR, Share READ CONTACTS, × × × calendar SEND SMS, WRITE SMS ACCESS COARSE Query IP √ √ √ LOCATION, address INTERNET ACCESS COARSE LOCATION, Share IP INTERNET, × × × address READ CONTACTS, SEND SMS, WRITE SMS Phone READ CONTACTS, √ √ √ Call CALL PHONE READ CONTACTS, Send √ √ √ SEND SMS, SMS Malicious WRITE SMS activity Send √ √ √ / email Browse √ √ √ INTERNET website Bluetooth √ √ √ BLUETOOTH control Video INTERNET, √ √ √ call CAMERA

Table 5.3: Post attack commands against VAs.

163 him, and then query the private information during the phone call. The attacker can then hear the audio response from the VA.

The malicious activities such as making phone calls to premium numbers, sending SMS, browsing malicious websites, and so on, can be performed on all the VAs that we tested without requesting for any permissions.

Given the fact that VAs can be easily controlled by attackers to perform malicious activities and acquire private information, we suggest that the vendors rethink the privilege assigned to VAs.

5.4.4 Runtime Cost Analysis

We evaluate and analyze the runtime cost of the spyware because high runtime cost (e.g., CPU, Memory) will reduce the stealthiness of the attack. We install the prototype spyware on Google Pixel 2, Huawei mate 8, Xiaomi Mi

8, and Samsung Galaxy S9. Since the spyware launches the attack in the four distinctive phases below, we evaluate each stage individually: P1(Phone call state monitoring), P2(Recording and synthesizing activation command),

P3(Environment monitoring), and P4(Attacking via a speaker).

Power consumption analysis: Fig.5.13 reports the power consumption per minute for four attacking phases. We also compare the power consumption by playing 1080P video and music. The results show that in P1, P2, and P4, the power consumption per minute on all Android phones is very low. P3 has the highest power consumption, which is approximately 0.8 mAh per minute. It is still

164 Figure 5.13: Power consumption of four phases: P1(Phone call state monitoring), P2(Recording and synthesising activation command), P3(Environment monitoring), and P4(Attacking via the speaker)

Figure 5.14: Memory consumption of four phases: P1(Phone call state monitoring), P2(Recording and synthesising activation command), P3(Environment monitoring), and P4(Attacking via the speaker)

165 negligible when compared with the scenarios such as playing video or listening to music, which consumes 6.1 mAh and 5.1 mAh per minute, respectively. We further reduce the frequency of collecting data from sensors in P3 from 50 Hz to 10 Hz.

The power consumption decreases to 0.5 mAh, without affecting the success rate of the attack. The results suggest that the spyware consumes too little power to be noticed by the user.

Memory & CPU Analysis. Fig.5.14 (b) shows the average RAM usage in the four processes. The average RAM usage in P1, P2, and P3 is less than 5

MB. The P3 uses the highest memory (approximately 10 MB) because of sensor utilization. Compared to the scenarios like playing video or listening to music, which consumes approximately 60 MB to 70 MB, the memory cost of our prototype spyware can hardly affect the performance of the hosting smartphone systems.

Therefore, it is hard to be noticed by the user. We also evaluate the CPU cost.

It is found that only P3 requires CPU, which consumes around 7% of the total capacity.

File size. In P2 and P3, the recorded voice pieces and sensors’ data will be stored until they are uploaded to the server. There is no need to store files in

P1 and P4. Therefore, only two categories of files are stored during the whole attacking process then uploaded to the server: an audio file (*.wav) to save the synthesized activation voice and three text files (*.txt) to record the sensors data.

The average size of voice, acceleration, light, noise files are 180.9 KB, 91.7 KB, 4.4

KB, 5.4 KB, respectively.

166 5.4.5 Resistance to Anti-Virus Tools

We test the spyware against industrial anti-malware tools as well as academic malware detection solutions. Android malware detection approaches can be categorized into static tools and dynamic platforms, according to whether the candidate app needs to be executed. Static approaches are based on analyzing static features of the application, such as the component of the app, the permissions requested by the application, and the code itself. Dynamic approaches execute the application in a protected environment, provide all the emulated resources it needs, and observe its malicious activities.

For industrial anti-virus products, we test the spyware on VirusTotal as well as the top ten most popular anti-virus tools on Google Play, such as Norton

Security and Antivirus, Kaspersky Mobile Antivirus, McAfee Mobile Security, and so on. None of them reported our spyware as a malicious app. We also submit the spyware to Google Play store, where submitted apps are tested against their dynamic test platform Google Bouncer. The spyware successfully passes the detection of Google Bouncer. Note that we took down the spyware from the Google

Play immediately after it passed the test. The detection result of VirusTotal is presented in Figure.5.15.

We also test the spyware with three typical learning-based detectors in academia, which rely on syntactic features (e.g., requested permissions, presence of specific API calls, etc.), as well as semantic features (e.g., the sequence of API calls) extracted from the Android application package (APK), namely Drebin 167 Figure 5.15: A snapshot of the detection result in VirusTotal

[25], DroidAPIMiner [20], and MaMaDroid [102]. We trained all the detectors with 5,000 most recently discovered malware samples and 5,000 benign apps that we collected from Virusshare 3 and Google Play store between August and

October 2018, respectively. Our spyware is labeled as a benign app by all three detectors. The results show the resistance of the proposed attacking method to both industrial and academic malware detection tools. 3https://virusshare.com/

168 5.5 Discussion

In this section, we will discuss the essential factors for a successful attack and introduce the promising defense approaches for Vaspy. We also discuss the lessons from this work.

5.5.1 Essential Factors for the Successful Attack

We demonstrate three essential core factors for the successful attack: 1) success in delivery, 2) success in avoiding being detected by anti-malware products, and

3) success in not being noticed by users.

1) The prevalent abuse of high-risk permissions in Android applications makes users insensitive to granting these permissions. For example, with the rise of popularity of location-based applications and Augmented Reality (AR) games,

Android users are becoming less cautious when granting permission to access the location and the camera, both of which are high-risk permissions that may leak users’ private information. In our proof-of-concept attack, we disguise Vaspy as a microphone controlled game to trick the user for granting the permission to access the microphone.

2) Currently, commercial anti-malware tools are based on known features of malware, such as signatures and sensitive operations. In Vaspy, there is no relevant data in the existing signature library of anti-malware tools. All the sensitive operations in the attack, such as sending emails and making phone calls, are

169 executed through the VA. It has the privilege to access sensitive data, but it is not monitored by anti-malware products.

3) In the proposed attack, we developed a stealthy attacking module inside

Vaspy, which monitors the environment and looks for a suitable time to launch the attack. It also adjusts the voice commands’ volume to ensure that the voice commands can be captured and recognized by the Android phone, but users cannot hear it.

5.5.2 Defense Approaches for Vaspy

In this section, we demonstrate three possible defense approaches for Vaspy: 1) identifying the source of the voice commands; 2) continuous authentication for

VAs; 3) distinguishing human voice from machine-based voice.

1) Identifying the source of the voice commands. In the proposed attack scenario, the voice commands are played via a speaker on a smartphone. New techniques [97] can locate the source of the sound, which can then determine whether the sound comes from the built-in speaker. The VA vendors can disable our attack by setting the VA to disregard any voice commands from the built-in speaker on its hosting smartphone.

2) Continuous authentication for VAs. H. Feng et al. [67] proposed a scheme that collects the body-surface vibrations of the user and match with the speech signal received from a microphone. The VA only executes the commands that originate from the owner’s voice. While it may successfully defend our attack, it

170 also brings some inconvenience to the user. For example, users cannot activate the

VA when they do not hold the smartphone. Users tend to interact with the VA when they are not able to touch the screen, such as when they are driving.

3) Distinguish human voice from a machine-based voice. S. Chen et al. [47] explored the difference between a human voice and a machine-based voice based on the magnetic field emitted from loudspeakers. It can detect machine-based voice impersonation attacks. However, there will be a high false-positive rate when there are other devices around, which generates magnetic signals.

5.5.3 Lessons from This Work

This can be recognized as a vulnerability in the current VAs. Once the VAs are activated, they can change smartphone settings and do malicious activities that require high-level permission, such as sending SMS/emails and making phone calls.

Due to the privilege, it has to access system resources and private information.

VAs can then be a stepping stone for the attackers to hack into Android phones.

More secure mechanisms will be implemented to improve the security of VAs, from either the research community or the VA vendors.

5.6 Summary

In this chapter, we propose a smart and stealthy attack Vaspy targeting VAs on

Android phones. With the new attack, an attacker can forge voice commands

171 to activate the VA and launch a number of attacks, including leaking private information, sending forged messages or emails, and calling arbitrary numbers.

An Attacking Environment Sensing module is built inside the Vaspy to choose an optimal attacking time and voice volume making the attack unnoticed by the users.

We build prototype spyware for Vaspy and evaluate the spyware with participants across various VAs on different Android phones. We demonstrate that Vaspy can launch attacks without being noticed by users. Moreover, our spyware cannot be detected by state-of-art anti-malware tools from both industry and academia. We also propose a few potential solutions to detect our attack. This research work may inspire the researchers for Android phones to strengthen the security of VAs in general.

172 Chapter 6

Conclusion and Future Works

6.1 Conclusion

Malware detection techniques are improving, so are malware samples. Current work on Android malware detection assumes a non-adversary environment that is not realistic in the real world. Malware detector should not overlook the fact that the adversaries are also actively updating the malware samples to bypass their detection. This thesis studies the security challenges in current malware analysis/detection research to understand better how malware behaves to cheat the detection systems, providing an insight into how malware authors take countermeasures against current malware detection approaches. We break down the works in this thesis in answering three research questions: 1) How malware samples evolve themselves over time; 2) What are the vulnerabilities of machine-learning based detectors, and how malware makes use of it; 3) What

173 are the factors that are overlooked in traditional non-learning based malware detection approaches and how can malware leverage them.

For the first research question, we conducted a systematic study on Android malware evolution. Specifically, we design and implement PhyloNet, a scheme that constructs a directed evolution graph of malware variants inside a malware family.

We further conducted case studies to analyzed in-depth evolution strategies of malware variants. Our findings show that: 1) most malware variants are generated based on a previous version, which shares more than 50% similar codes; 2) malware variants did update to enhance their malicious behaviors (e.g., steal additional information from the user than the previous version), or make itself stealthier to bypass the detection (e.g., use DexClassLoader to load malicious code at run-time dynamically).

For the second research question, we investigated that the machine-learning based detectors using either syntactic (e.g., requested permissions) or semantic features (e.g., control flow graphs) are vulnerable to adversarial attacks. To study how these attacks could occur, we proposed a novel method to attack machine-learning based detectors. Our approach adds pre-designed code blocks into APK, changing the feature vector of the malware samples. The modified malware samples can, therefore, bypass the machine-learning detection.

Our attack demonstrated that adversarial examples could be very effective to machine-learning based Android malware detectors, by adding only a few lines

174 of code into the APK. Possible defending methods are also proposed to defend against the adversarial attack.

For the third research question, we demonstrated via a case study that the malicious activities can be conducted even without requesting any dangerous permissions or invoking any sensitive APIs. To this end, we proposed an attack that leverages the in-built voice assistant of the Android smartphones to steal user’s private information, who has legitimate access to sensitive resources. We also proposed a few potential solutions to detect such attacks.

6.2 Future Works

There are a few future research directions that may be of interest.

Every machine-learning based malware detection system suffers from a data drifting problem, in which the detection accuracy decreases when trying to identify newly emerged malware samples with detectors trained with old samples.

A common solution to data drifting problem is to retrain the detection model with recent malware samples periodically. However, such retrain process requires security experts to manually label the newly emerged samples, which usually happens after the malware is widely spread. Based on our study on Android malware evolution patterns, we can proactively simulate Android malware’s evolution and generate “future” malware samples according to their common

175 evolution strategies. These malware samples can be used as training samples, which makes the trained malware resilient to the data drifting problem.

Current adversarial malware example crafting algorithms are developed under the assumption that the target machine learning model’s feature space is known to the adversary. This assumption is practical in some domains, such as image classification, where pixel values are used as features. However, in malware detection and other security-related tasks, feature space is not always known to the adversary. A possible future work would be developing a black-box attack that requires zero knowledge from the target machine learning model.

176 References

[1] Androguard. https://github.com/androguard/androguard. Accessed: 2019-07-01.

[2] Android malware genome project. http://www.malgenomeproject.org. Accessed: 2019-07-01.

[3] APKtool. https://ibotpeaches.github.io/Apktool/. Accessed: 2019-07-01.

[4] Contagio dump. http://contagiominidump.blogspot.com. Accessed: 2019-07-01.

[5] Droidbox: Android application sandbox. https://github.com/pjlantz/ droidbox. Accessed: 2019-07-01.

[6] Firebase. https://firebase.google.com. Accessed: 2019-07-01.

[7] IBM Watson. https://www.ibm.com/watson/. Accessed 2018-09-28.

[8] Internet security threat report, volume 21, april 2016. https: //www.symantec.com/content/dam/symantec/docs/reports/ istr-21-2016-en.pdf. Accessed: 2019-07-01. [9] Internet security threat report, volume 24, february 2019. https://www.symantec.com/content/dam/symantec/docs/reports/ istr-24-2019-en.pdf. Accessed: 2019-07-01.

[10] market share worldwide. http://gs. statcounter.com/os-market-share/mobile/worldwide. Accessed: 2019-04-22.

[11] Number of available applications in the google play store from december 2009 to june 2019. https://www.statista.com/statistics/266210/ number-of-available-applications-in-the-google-play-store/. Accessed: 2019-07-01. 177 [12] Virus total. https://www.virustotal.com. Accessed 2018-09-29.

[13] Virustotal contributors. https://support.virustotal.com/hc/en-us/ articles/115002146809-Contributors. Accessed: 2019-06-26.

[14] Weka. http://www.cs.waikato.ac.nz/ml/weka/. Accessed: 2019-07-01.

[15] Android 9 behavior changes. https://developer.android.com/about/ versions/pie/android-9.0-changes-all, 2018. Accessed 2019-04-01.

[16] Control google home by voice. https://support.google.com/googlehome/ answer/7207759?hl=en-AU&ref_topic=7196346, 2018. Accessed 2018-09-29.

[17] Now the google assistant can take care of your ious. https://www.blog.google/products/assistant/ now-google-assistant-can-take-care-your-ious, 2018. Accessed 2018-09-29.

[18] Permissions overview. https://developer.android.com/guide/topics/ per-missions/overview#dangerous-permission-prompt, 2018. Accessed 2018-09-29.

[19] Internet security threat report. https://www.symantec.com/content/ dam/symantec/docs/reports/istr-24-2019-en.pdf, 02 2019. Accessed: 2019-05-08.

[20] Yousra Aafer, Wenliang Du, and Heng Yin. Droidapiminer: Mining api-level features for robust malware detection in android. In International conference on security and privacy in communication systems, pages 86–103. Springer, 2013.

[21] Efthimios Alepis and Constantinos Patsakis. Monkey says, monkey does: Security and privacy on voice assistants. IEEE Access, 5:17841–17851, 2017.

[22] Brandon Amos, Hamilton Turner, and Jules White. Applying machine learning classifiers to dynamic android malware detection at scale. In 2013 9th international wireless communications and mobile computing conference (IWCMC), pages 1666–1671. IEEE, 2013.

[23] Sercan Arik, Jitong Chen, Kainan Peng, Wei Ping, and Yanqi Zhou. Neural voice cloning with a few samples. In Advances in Neural Information Processing Systems, pages 10019–10029, 2018.

178 [24] Jacob Aron. How innovative is apple’s new voice assistant, siri? New Scientist, 212(2836):24, 2011.

[25] Daniel Arp, Michael Spreitzenbarth, Malte Hubner, Hugo Gascon, Konrad Rieck, and CERT Siemens. Drebin: Effective and explainable detection of android malware in your pocket. In Ndss, volume 14, pages 23–26, 2014.

[26] Zarni Aung and Win Zaw. Permission-based android malware detection. International Journal of Scientific & Technology Research, 2(3):228–234, 2013.

[27] Emre Aydogan and Sevil Sen. Automatic generation of mobile using genetic programming. In European conference on the applications of evolutionary computation, pages 745–756. Springer, 2015.

[28] Marco Barreno, Blaine Nelson, Anthony D. Joseph, and J. D. Tygar. The security of machine learning. Machine Learning, 81(2):121–148, Nov 2010.

[29] Marco Barreno, Blaine Nelson, Russell Sears, Anthony D. Joseph, and J. D. Tygar. Can machine learning be secure? In Proceedings of the 2006 ACM Symposium on Information, Computer and Communications Security, ASIACCS ’06, pages 16–25, 2006.

[30] Leonid Batyuk, Markus Herpich, Seyit Ahmet Camtepe, Karsten Raddatz, Aubrey-Derrick Schmidt, and Sahin Albayrak. Using static analysis for automatic assessment and mitigation of unwanted and malicious activities within android applications. In 2011 6th International Conference on Malicious and Unwanted Software, pages 66–72. IEEE, 2011.

[31] Zahra Bazrafshan, Hashem Hashemi, Seyed Mehdi Hazrati Fard, and Ali Hamzeh. A survey on heuristic malware detection techniques. In The 5th Conference on Information and Knowledge Technology, pages 113–120. IEEE, 2013.

[32] Mario Luca Bernardi, Marta Cimitile, and Francesco Mercaldo. Process mining meets malware evolution: A study of the behavior of malicious code. In CANDAR, pages 616–622. IEEE Computer Society, 2016.

[33] Andy Betts. Do you still need to root your android phone? https:// www.makeuseof.com/tag/need-root-android-phone/, 04 2017. Accessed: 2019-05-08.

[34] Battista Biggio, Igino Corona, Davide Maiorca, Blaine Nelson, Nedim Srndi´c,Pavelˇ Laskov, Giorgio Giacinto, and Fabio Roli. Evasion attacks 179 against machine learning at test time. In Joint European conference on machine learning and knowledge discovery in databases, pages 387–402. Springer, 2013. [35] Justin Binder, Samuel D Post, Onur Tackin, and Thomas R Gruber. Voice trigger for a digital assistant, August 7 2014. US Patent App. 14/175,864. [36] Ludvig Bohlin, Daniel Edler, Andrea Lancichinetti, and Martin Rosvall. Community Detection and Visualization of Networks with the Map Equation Framework, pages 3–34. Springer International Publishing, Cham, 2014. [37] Jarrett Booz, Josh McGiff, William G Hatcher, Wei Yu, James Nguyen, and Chao Lu. Tuning deep learning performance for android malware detection. In 2018 19th IEEE/ACIS International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing (SNPD), pages 140–145. IEEE, 2018. [38] Ronald Newbold Bracewell and Ronald N Bracewell. The Fourier transform and its applications, volume 31999. McGraw-Hill New York, 1986. [39] Iker Burguera, Urko Zurutuza, and Simin Nadjm-Tehrani. Crowdroid: behavior-based malware detection system for android. In Proceedings of the 1st ACM workshop on Security and privacy in smartphones and mobile devices, pages 15–26. ACM, 2011. [40] Haipeng Cai, Na Meng, Barbara Ryder, and Daphne Yao. Droidcat: Effective android malware detection and categorization via app-level profiling. IEEE Transactions on Information Forensics and Security, 14(6):1455–1470, 2018. [41] Nicholas Carlini, Pratyush Mishra, Tavish Vaidya, Yuankai Zhang, Micah Sherr, Clay Shields, David A. Wagner, and Wenchao Zhou. Hidden voice commands. In USENIX Security Symposium, 2016. [42] Nicholas Carlini and David Wagner. Towards evaluating the robustness of neural networks. In 2017 IEEE Symposium on Security and Privacy (SP). IEEE, may 2017. [43] Tanmoy Chakraborty, Fabio Pierazzi, and VS Subrahmanian. Ec2: Ensemble clustering and classification for predicting android malware families. IEEE Transactions on Dependable and Secure Computing, 2017. [44] Victor Chebyshev. Mobile malware evolution 2018. https://securelist. com/mobile-malware-evolution-2018/89689/, 03 2019. Accessed: 2019-05-08. 180 [45] Lingwei Chen, Shifu Hou, and Yanfang Ye. Securedroid: Enhancing security of machine learning-based detection against adversarial android malware attacks. In Proceedings of the 33rd Annual Computer Security Applications Conference, ACSAC 2017, pages 362–372, 2017. [46] Sen Chen, Minhui Xue, Lingling Fan, Shuang Hao, Lihua Xu, Haojin Zhu, and Bo Li. Automated poisoning attacks and defenses in malware detection systems: An adversarial machine learning approach. computers & security, 73:326–344, 2018. [47] Si Chen, Kui Ren, Sixu Piao, Cong Wang, Qian Wang, Jian Weng, Lu Su, and Aziz Mohaisen. You can hear but you cannot steal: Defending against voice impersonation attacks on smartphones. 2017 IEEE 37th International Conference on Distributed Computing Systems (ICDCS), pages 183–195, 2017. [48] Kyunghyun Cho, Bart Van Merri¨enboer, Dzmitry Bahdanau, and Yoshua Bengio. On the properties of neural machine translation: Encoder-decoder approaches. arXiv preprint arXiv:1409.1259, 2014. [49] Aniello Cimitile, Fabio Martinelli, Francesco Mercaldo, Vittoria Nardone, Antonella Santone, and Gigliola Vaglini. Model checking for mobile android malware evolution. In 2017 IEEE/ACM 5th International FME Workshop on Formal Methods in Software Engineering (FormaliSE), pages 24–30. IEEE, 2017. [50] Vanessa N Cooper, Hossain Shahriar, and Hisham M Haddad. A survey of android malware characterisitics and mitigation techniques. In 2014 11th International Conference on Information Technology: New Generations, pages 327–332. IEEE, 2014. [51] Ambra Demontis, Marco Melis, Battista Biggio, Davide Maiorca, Daniel Arp, Konrad Rieck, Igino Corona, Giorgio Giacinto, and Fabio Roli. Yes, machine learning can be more secure! a case study on android malware detection. IEEE Transactions on Dependable and Secure Computing, 2017. [52] Wenrui Diao, Xiangyu Liu, Zhe Zhou, and Kehuan Zhang. Your voice assistant is mine: How to abuse speakers to steal information and control your phone. In Proceedings of the 4th ACM Workshop on Security and Privacy in Smartphones & Mobile Devices, SPSM ’14, pages 63–74, 2014. [53] Shuaike Dong, Menghao Li, Wenrui Diao, Xiangyu Liu, Jian Liu, Zhou Li, Fenghao Xu, Kai Chen, XiaoFeng Wang, and Kehuan Zhang. Understanding 181 android obfuscation techniques: A large-scale investigation in the wild. In International Conference on Security and Privacy in Communication Systems, pages 172–192. Springer, 2018.

[54] Yao Du, Junfeng Wang, and Qi Li. An android malware detection approach using community structures of weighted function call graphs. IEEE Access, 5:17478–17486, 2017.

[55] Yue Duan, Mu Zhang, Abhishek Vasisht Bhaskar, Heng Yin, Xiaorui Pan, Tongxin Li, Xueqiang Wang, and XiaoFeng Wang. Things you may not know about android (un) packers: A systematic study based on whole-system emulation. In NDSS, 2018.

[56] John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(Jul):2121–2159, 2011.

[57] Manuel Egele, Theodoor Scholte, Engin Kirda, and Christopher Kruegel. A survey on automated dynamic malware-analysis techniques and tools. ACM computing surveys (CSUR), 44(2):6, 2012.

[58] Md Enamul Karim, Andrew Walenstein, Arun Lakhotia, and Laxmi Parida. Malware phylogeny generation using permutations of code. Journal in Computer Virology, 1:13–23, 11 2005.

[59] William Enck, Peter Gilbert, Seungyeop Han, Vasant Tendulkar, Byung-Gon Chun, Landon P Cox, Jaeyeon Jung, Patrick McDaniel, and Anmol N Sheth. Taintdroid: an information-flow tracking system for realtime privacy monitoring on smartphones. ACM Transactions on Computer Systems (TOCS), 32(2):5, 2014.

[60] William Enck, Machigar Ongtang, and Patrick McDaniel. On lightweight mobile phone application certification. In Proceedings of the 16th ACM conference on Computer and communications security, pages 235–245. ACM, 2009.

[61] Ming Fan, Jun Liu, Xiapu Luo, Kai Chen, Zhenzhou Tian, Qinghua Zheng, and Ting Liu. Android malware familial classification and representative sample selection via frequent subgraph analysis. IEEE Transactions on Information Forensics and Security, 13(8):1890–1905, 2018.

[62] Ming Fan, Jun Liu, Wei Wang, Haifei Li, Zhenzhou Tian, and Ting Liu. Dapasa: detecting android piggybacked apps through sensitive subgraph

182 analysis. IEEE Transactions on Information Forensics and Security, 12(8):1772–1785, 2017.

[63] Ming Fan, Xiapu Luo, Jun Liu, Meng Wang, Chunyin Nong, Qinghua Zheng, and Ting Liu. Graph embedding based familial analysis of android malware using unsupervised learning. In 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE), pages 771–782. IEEE, 2019.

[64] Parvez Faruki, Ammar Bharmal, Vijay Laxmi, Vijay Ganmoor, Manoj Singh Gaur, Mauro Conti, and Muttukrishnan Rajarajan. Android security: a survey of issues, malware penetration, and defenses. IEEE communications surveys & tutorials, 17(2):998–1022, 2014.

[65] Parvez Faruki, Vijay Laxmi, Ammar Bharmal, Manoj Singh Gaur, and Vijay Ganmoor. Androsimilar: Robust signature for detecting variants of android malware. Journal of Information Security and Applications, 22:66–80, 2015.

[66] Adrienne Porter Felt, Erika Chin, Steve Hanna, Dawn Song, and David Wagner. Android permissions demystified. In Proceedings of the 18th ACM conference on Computer and communications security, pages 627–638. ACM, 2011.

[67] Huan Feng, Kassem Fawaz, and Kang G. Shin. Continuous authentication for voice assistants. In Proceedings of the 23rd Annual International Conference on Mobile Computing and Networking, MobiCom ’17, pages 343–355, 2017.

[68] Yu Feng, Saswat Anand, Isil Dillig, and Alex Aiken. Apposcopy: Semantics-based detection of android malware through static analysis. In Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering, pages 576–587. ACM, 2014.

[69] Aravind Ganapathiraju, Jonathan Hamaker, Joseph Picone, Mark Ordowski, and George R Doddington. Syllable-based large vocabulary continuous speech recognition. IEEE Transactions on speech and audio processing, 9(4):358–366, 2001.

[70] Hugo Gascon, Fabian Yamaguchi, Daniel Arp, and Konrad Rieck. Structural detection of android malware using embedded call graphs. In Proceedings of the 2013 ACM workshop on Artificial intelligence and security, pages 45–54. ACM, 2013.

[71] Andrew Gibiansky, Sercan Arik, Gregory Diamos, John Miller, Kainan Peng, Wei Ping, Jonathan Raiman, and Yanqi Zhou. Deep voice 2: Multi-speaker

183 neural text-to-speech. In Advances in neural information processing systems, pages 2962–2970, 2017.

[72] Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. Computer Science, 2014.

[73] Google. What can your google assistant do? https://assistant.google. com/explore?hl=en-AU, 2018. [Accessed September 29, 2018].

[74] Alex Graves, Abdel-rahman Mohamed, and Geoffrey Hinton. Speech recognition with deep recurrent neural networks. In Acoustics, speech and signal processing (icassp), 2013 ieee international conference on, pages 6645–6649. IEEE, 2013.

[75] Ilan Gronau and Shlomo Moran. Optimal implementations of upgma and other common clustering algorithms. Information Processing Letters, 104(6):205–210, 2007.

[76] Kathrin Grosse, Praveen Manoharan, Nicolas Papernot, Michael Backes, and Patrick McDaniel. On the (statistical) detection of adversarial examples. arXiv preprint arXiv:1702.06280, 2017.

[77] Kathrin Grosse, Nicolas Papernot, Praveen Manoharan, Michael Backes, and Patrick McDaniel. Adversarial examples for malware detection. In European Symposium on Research in Computer Security, pages 62–79. Springer, 2017.

[78] Shun-Wen Hsiao, Yeali Sun, and Meng Chen. Virtual machine introspection based malware behavior profiling and family grouping. 05 2017.

[79] Weiwei Hu and Ying Tan. Generating adversarial malware examples for black-box attacks based on gan. arXiv preprint arXiv:1702.05983, 2017.

[80] Chun-Ying Huang, Yi-Ting Tsai, and Chung-Han Hsu. Performance evaluation on permission-based detection for android malware. In Advances in Intelligent Systems and Applications-Volume 2, pages 111–120. Springer, 2013.

[81] Mdric Hurier, Guillermo Suarez-Tangil, Santanu Dash, Tegawend Bissyand, Yves Le Traon, Jacques Klein, and Lorenzo Cavallaro. Euphony: Harmonious unification of cacophonous anti-virus vendor labels for android malware. 05 2017.

[82] Takamasa Isohara, Keisuke Takemori, and Ayumu Kubota. Kernel-based behavior analysis for android malware detection. In 2011 Seventh 184 International Conference on Computational Intelligence and Security, pages 1011–1015. IEEE, 2011. [83] Kaspersky. Machine learning for malware detection. Technical report, Kaspersky, 2017. [84] Wei Ming Khoo and Pietro Li´o. Unity in diversity: Phylogenetic-inspired techniques for reverse engineering and detection of malware families. 2011 First SysSec Workshop, pages 3–10, 2011. [85] TaeGuen Kim, BooJoong Kang, Mina Rho, Sakir Sezer, and Eul Gyu Im. A multimodal deep learning method for android malware detection using various features. IEEE Transactions on Information Forensics and Security, 14(3):773–788, 2018. [86] Julia Kiseleva, Kyle Williams, Jiepu Jiang, Ahmed Hassan Awadallah, Aidan C. Crook, Imed Zitouni, and Tasos Anastasakos. Understanding user satisfaction with intelligent assistants. In Proceedings of the 2016 ACM on Conference on Human Information Interaction and Retrieval, CHIIR ’16, pages 121–130, 2016. [87] Robin Knote, Andreas Janson, Laura Eigenbrod, and Matthias S¨ollner. The what and how of smart personal assistants: Principles and application domains for is research. 2018. [88] Mariantonietta La Polla, Fabio Martinelli, and Daniele Sgandurra. A survey on security for mobile devices. IEEE communications surveys & tutorials, 15(1):446–471, 2012. [89] Pavel Laskov et al. Practical evasion of a learning-based classifier: A case study. In Security and Privacy (SP), 2014 IEEE Symposium on, pages 197–211. IEEE, 2014. [90] Jin Li, Lichao Sun, Qiben Yan, Zhiqiang Li, Witawas Srisa-an, and Heng Ye. Significant permission identification for machine-learning-based android malware detection. IEEE Transactions on Industrial Informatics, 14(7):3216–3225, 2018. [91] Li Li, Tegawend Bissyand, and Jacques Klein. Simidroid: Identifying and explaining similarities in android apps. pages 136–143, 08 2017. [92] Li Li, Daoyuan Li, Tegawend´eF Bissyand´e,Jacques Klein, Yves Le Traon, David Lo, and Lorenzo Cavallaro. Understanding android app piggybacking: A systematic study of malicious code grafting. IEEE Transactions on Information Forensics and Security, 12(6):1269–1284, 2017. 185 [93] Yongfeng Li, Tong Shen, Xin Sun, Xuerui Pan, and Bing Mao. Detection, classification and characterization of android malware using api data dependency. In International Conference on Security and Privacy in Communication Systems, pages 23–40. Springer, 2015.

[94] Martina Lindorfer, Matthias Neugschwandtner, Lukas Weichselbaum, Yanick Fratantonio, Victor Van Der Veen, and Christian Platzer. Andrubis–1,000,000 apps later: A view on current android malware behaviors. In 2014 third international workshop on building analysis datasets and gathering experience returns for security (BADGERS), pages 3–17. IEEE, 2014.

[95] Jing Liu, Pei Dai Xie, Meng Zhu Liu, and Yong Jun Wang. Having an insight into malware phylogeny: Building persistent phylogeny tree of families. IEICE TRANSACTIONS on Information and Systems, 101(4):1199–1202, 2018.

[96] Jing Liu, Yuan Wang, Pei Dai XIE, and Yong Jun Wang. Inferring phylogenetic network of malware families based on splits graph. IEICE TRANSACTIONS on Information and Systems, 100(6):1368–1371, 2017.

[97] Rui Liu, Cory Cornelius, Reza Rawassizadeh, Ron Peterson, and David Kotz. Poster: Vocal resonance as a passive biometric. In Proceedings of the 15th Annual International Conference on Mobile Systems, Applications, and Services, MobiSys ’17, pages 160–160, 2017.

[98] Yanpei Liu, Xinyun Chen, Chang Liu, and Dawn Song. Delving into transferable adversarial examples and black-box attacks. arXiv preprint arXiv:1611.02770, 2016.

[99] Long Lu, Zhichun Li, Zhenyu Wu, Wenke Lee, and Guofei Jiang. Chex: statically vetting android apps for component hijacking vulnerabilities. In Proceedings of the 2012 ACM conference on Computer and communications security, pages 229–240. ACM, 2012.

[100] Zhuo Ma, Haoran Ge, Yang Liu, Meng Zhao, and Jianfeng Ma. A combination method for android malware detection based on control flow graphs and machine learning algorithms. IEEE access, 7:21235–21245, 2019.

[101] Davide Maiorca, Davide Ariu, Igino Corona, Marco Aresu, and Giorgio Giacinto. Stealth attacks: An extended insight into the obfuscation effects on android malware. Computers & Security, 51:16–31, 2015.

186 [102] Enrico Mariconti, Lucky Onwuzurike, Panagiotis Andriotis, Emiliano De Cristofaro, Gordon Ross, and Gianluca Stringhini. Mamadroid: Detecting android malware by building markov chains of behavioral models. arXiv preprint arXiv:1612.04433, 2016.

[103] McAfee. Mcafee mobile threat report q1, 2018. Technical report, McAfee Labs, 2018.

[104] Guozhu Meng, Yinxing Xue, Chandramohan Mahinthan, Annamalai Narayanan, Yang Liu, Jie Zhang, and Tieming Chen. Mystique: Evolving android malware for auditing anti-malware tools. In Proceedings of the 11th ACM on Asia conference on computer and communications security, pages 365–376. ACM, 2016.

[105] Alain Muzet. Environmental noise, sleep and health. Sleep medicine reviews, 11(2):135–142, 2007.

[106] Diane Oyen, Blake Anderson, and Christine M. Anderson-Cook. Bayesian networks with prior knowledge for malware phylogenetics. In AAAI Workshop: Artificial Intelligence for Cyber Security, 2016.

[107] Nicolas Papernot, Patrick McDaniel, Somesh Jha, Matt Fredrikson, Z Berkay Celik, and Ananthram Swami. The limitations of deep learning in adversarial settings. In 2016 IEEE European Symposium on Security and Privacy (EuroS&P), pages 372–387. IEEE, 2016.

[108] Naser Peiravian and Xingquan Zhu. Machine learning for android malware detection using permission and api calls. In Tools with Artificial Intelligence (ICTAI), 2013 IEEE 25th International Conference on, pages 300–305. IEEE, 2013.

[109] Feargus Pendlebury, Fabio Pierazzi, Roberto Jordaney, Johannes Kinder, and Lorenzo Cavallaro. Tesseract: Eliminating experimental bias in malware classification across space and time. CoRR, abs/1807.07838, 2018.

[110] Sancheng Peng, Shui Yu, and Aimin Yang. Smartphone malware and its propagation modeling: A survey. IEEE Communications Surveys & Tutorials, 16(2):925–941, 2013.

[111] Vaibhav Rastogi, Yan Chen, and Xuxian Jiang. Catch me if you can: Evaluating android anti-malware against transformation attacks. IEEE Transactions on Information Forensics and Security, 9(1):99–108, 2013.

187 [112] Vaibhav Rastogi, Yan Chen, and Xuxian Jiang. Droidchameleon: evaluating android anti-malware against transformation attacks. In Proceedings of the 8th ACM SIGSAC symposium on Information, computer and communications security, pages 329–334. ACM, 2013.

[113] Alessandro Reina, Aristide Fattori, and Lorenzo Cavallaro. A system call-centric analysis and stimulation technique to automatically reconstruct android malware behaviors. EuroSec, April, 2013.

[114] Ishai Rosenberg, Asaf Shabtai, Lior Rokach, and Yuval Elovici. Generic black-box end-to-end attack against rnns and other api calls based malware classifiers. arXiv preprint arXiv:1707.05970, 2017.

[115] Peter J Rousseeuw. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Journal of computational and applied mathematics, 20:53–65, 1987.

[116] Justin Sahs and Latifur Khan. A machine learning approach to android malware detection. In 2012 European Intelligence and Security Informatics Conference, pages 141–147. IEEE, 2012.

[117] Igor Santos, Felix Brezo, Javier Nieves, Yoseba K Penya, Borja Sanz, Carlos Laorden, and Pablo G Bringas. Idea: Opcode-sequence-based malware detection. In International Symposium on Engineering Secure Software and Systems, pages 35–43. Springer, 2010.

[118] Borja Sanz, Igor Santos, Carlos Laorden, Xabier Ugarte-Pedrero, and Pablo Garcia Bringas. On the automatic categorisation of android applications. In 2012 IEEE Consumer communications and networking conference (CCNC), pages 149–153. IEEE, 2012.

[119] A-D Schmidt, Rainer Bye, H-G Schmidt, Jan Clausen, Osman Kiraz, Kamer A Yuksel, Seyit Ahmet Camtepe, and Sahin Albayrak. Static analysis of executables for collaborative malware detection on android. In 2009 IEEE International Conference on Communications, pages 1–5. IEEE, 2009.

[120] Marcos Sebasti´an,Richard Rivera, Platon Kotzias, and Juan Caballero. Avclass: A tool for massive malware labeling. In RAID, 2016.

[121] Asaf Shabtai, Yuval Fledel, and Yuval Elovici. Automated static code analysis for classifying android applications using machine learning. In 2010 International Conference on Computational Intelligence and Security, pages 329–333. IEEE, 2010.

188 [122] Asaf Shabtai, Uri Kanonov, Yuval Elovici, Chanan Glezer, and Yael Weiss. andromaly: a behavioral malware detection framework for android devices. Journal of Intelligent Information Systems, 38(1):161–190, 2012.

[123] Feng Shen, Justin Del Vecchio, Aziz Mohaisen, Steven Y Ko, and Lukasz Ziarek. Android malware detection using complex-flows. In Distributed Computing Systems (ICDCS), 2017 IEEE 37th International Conference on, pages 2430–2437. IEEE, 2017.

[124] Robert R. Sokal and Charles D. . Michener. A statistical method for evaluating systematic relationships. 1958.

[125] Guillermo Suarez-Tangil and Gianluca Stringhini. Eight years of rider measurement in the android malware ecosystem: Evolution and lessons learned. 01 2018.

[126] Guillermo Suarez-Tangil, Juan E Tapiador, Pedro Peris-Lopez, and Jorge Blasco. Dendroid: A text mining approach to analyzing and classifying code structures in android malware families. Expert Systems with Applications, 41(4):1104–1117, 2014.

[127] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. Computer Science, 2013.

[128] Deepak Venugopal and Guoning Hu. Efficient signature based malware detection on mobile devices. Mobile Information Systems, 4(1):33–49, 2008.

[129] Nicolas Viennot, Edward Garcia, and Jason Nieh. A measurement study of google play. In ACM SIGMETRICS Performance Evaluation Review, volume 42, pages 221–233. ACM, 2014.

[130] Michelle Y Wong and David Lie. Intellidroid: A targeted input generator for the dynamic analysis of android malware. In NDSS, volume 16, pages 21–24, 2016.

[131] Michelle Y Wong and David Lie. Tackling runtime-based obfuscation in android with {TIRO}. In 27th {USENIX} Security Symposium ({USENIX} Security 18), pages 1247–1262, 2018.

[132] Paul Wood. Internet security threat report. Technical report, Symantec, California, 2015.

189 [133] Dong-Jie Wu, Ching-Hao Mao, Te-En Wei, Hahn-Ming Lee, and Kuo-Ping Wu. Droidmat: Android malware detection through manifest and api calls tracing. In Information Security (Asia JCIS), 2012 Seventh Asia Joint Conference on, pages 62–69. IEEE, 2012.

[134] Weimin Wu. Adversarial sample generation: Making machine learning systems robust for security. Technical report, Trend MIcro, 2018.

[135] Ke Xu, Yingjiu Li, and Robert H Deng. Iccdetector: Icc-based malware detection on android. IEEE Transactions on Information Forensics and Security, 11(6):1252–1264, 2016.

[136] Nan Xu, Fan Zhang, Yisha Luo, Weijia Jia, Dong Xuan, and Jin Teng. Stealthy video capturer: a new video-based spyware in 3g smartphones. In Proceedings of the second ACM conference on Wireless network security, pages 69–78. ACM, 2009.

[137] Lei Xue, Xiapu Luo, Le Yu, Shuai Wang, and Dinghao Wu. Adaptive unpacking of android apps. In 2017 IEEE/ACM 39th International Conference on Software Engineering (ICSE), pages 358–369. IEEE, 2017.

[138] Lei Xue, Chenxiong Qian, Hao Zhou, Xiapu Luo, Yajin Zhou, Yuru Shao, and Alvin TS Chan. Ndroid: Toward tracking information flows across multiple android contexts. IEEE Transactions on Information Forensics and Security, 14(3):814–828, 2018.

[139] Lei Xue, Yajin Zhou, Ting Chen, Xiapu Luo, and Guofei Gu. Malton: Towards on-device non-invasive mobile malware analysis for {ART}. In 26th {USENIX} Security Symposium ({USENIX} Security 17), pages 289–306, 2017.

[140] Yinxing Xue, Guozhu Meng, Yang Liu, Tian Huat Tan, Hongxu Chen, Jun Sun, and Jie Zhang. Auditing anti-malware tools by evolving android malware and dynamic loading technique. IEEE Transactions on Information Forensics and Security, 12(7):1529–1544, 2017.

[141] Lok Kwong Yan and Heng Yin. Droidscope: Seamlessly reconstructing the {OS} and dalvik semantic views for dynamic android malware analysis. In Presented as part of the 21st {USENIX} Security Symposium ({USENIX} Security 12), pages 569–584, 2012.

[142] Chao Yang, Zhaoyan Xu, Guofei Gu, Vinod Yegneswaran, and Phillip Porras. Droidminer: Automated mining and characterization of fine-grained

190 malicious behaviors in android applications. In European symposium on research in computer security, pages 163–182. Springer, 2014.

[143] Shengqian Yang, Dacong Yan, Haowei Wu, Yan Wang, and Atanas Rountev. Static control-flow analysis of user-driven callbacks in android applications. In 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering, volume 1, pages 89–99. IEEE, 2015.

[144] Wei Yang, Deguang Kong, Tao Xie, and Carl A Gunter. Malware detection in adversarial settings: Exploiting feature evolutions and confusions in android apps. In Proceedings of the 33rd Annual Computer Security Applications Conference, pages 288–302, 2017.

[145] Wei Yang, Deguang Kong, Tao Xie, and Carl A Gunter. Malware detection in adversarial settings: Exploiting feature evolutions and confusions in android apps. In Proceedings of the 33rd Annual Computer Security Applications Conference, pages 288–302. ACM, 2017.

[146] Wei Yang and Tao Xie. Telemade: a testing framework for learning-based malware detection systems. In Workshops at the Thirty-Second AAAI Conference on Artificial Intelligence, 2018.

[147] Suleiman Y Yerima and Sakir Sezer. Droidfusion: A novel multilevel classifier fusion approach for android malware detection. IEEE transactions on cybernetics, 49(2):453–466, 2018.

[148] Suleiman Y Yerima, Sakir Sezer, Gavin McWilliams, and Igor Muttik. A new android malware detection approach using bayesian classification. In 2013 IEEE 27th international conference on advanced information networking and applications (AINA), pages 121–128. IEEE, 2013.

[149] Xiaoyong Yuan, Pan He, Qile Zhu, Rajendra Rana Bhat, and Xiaolin Li. Adversarial examples: Attacks and defenses for deep learning. CoRR, abs/1712.07107, 2017.

[150] Guoming Zhang, Chen Yan, Xiaoyu Ji, Tianchen Zhang, Taimin Zhang, and Wenyuan Xu. Dolphinattack: Inaudible voice commands. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, CCS ’17, 2017.

[151] Mu Zhang, Yue Duan, Heng Yin, and Zhiruo Zhao. Semantics-aware android malware classification using weighted contextual api dependency graphs. In Proceedings of the 2014 ACM SIGSAC Conference on Computer and Communications Security, pages 1105–1116. ACM, 2014. 191 [152] Yueqian Zhang, Xiapu Luo, and Haoyang Yin. Dexhunter: toward extracting hidden code from packed android applications. In European Symposium on Research in Computer Security, pages 293–311. Springer, 2015.

[153] Min Zheng, Patrick PC Lee, and John CS Lui. Adam: an automatic and extensible platform to stress test android anti-virus systems. In International conference on detection of intrusions and malware, and vulnerability assessment, pages 82–101. Springer, 2012.

[154] Wu Zhou, Yajin Zhou, Xuxian Jiang, and Peng Ning. Detecting repackaged smartphone applications in third-party android marketplaces. In Proceedings of the second ACM conference on Data and Application Security and Privacy, pages 317–326. ACM, 2012.

[155] Yajin Zhou, Xinwen Zhang, Xuxian Jiang, and Vincent W Freeh. Taming information-stealing smartphone applications (on android). In International conference on Trust and trustworthy computing, pages 93–107. Springer, 2011.

[156] Dali Zhu, Hao Jin, Ying Yang, Di Wu, and Weiyi Chen. Deepflow: Deep learning-based malware detection by mining android application for abnormal usage of sensitive data. In 2017 IEEE symposium on computers and communications (ISCC), pages 438–443. IEEE, 2017.

192