Xiao Chen Thesis
Total Page:16
File Type:pdf, Size:1020Kb
Studying the Security Centric Intelligence on Android Malware Detection Xiao Chen A Thesis Submitted in fulfilment of the requirements for the Degree of Doctor of Philosophy Faculty of Science, Engineering and Technology Swinburne University of Technology Australia June 2020 Abstract Android malware detection has long been a critical challenge. Statistic analysis and machine learning based solutions have yielded promising performance on automatic malware detection. However, most existing research works did not consider a real-world issue in malware detection, that is, malware developers may be aware of the detection systems and take countermeasures against the detection approaches. Existing works overlooked this issue and assumed that the detectors are working in a non-adversarial environment, which is not likely to happen in the real world. Recent research in machine learning and deep learning has revealed that learning-based systems are vulnerable to carefully crafted adversarial examples. Extensive efforts have been undertaken to investigate this issue in the computer vision area, but very few works studied its impact on malware detection systems. Actually, adversaries always exist in malware detection/identification tasks. For example, static analysis based malware identification suffers from the code obfuscation problem, which is also a countermeasure the malware authors take to bypass the detection. This thesis presents a study of Android malware detection from both the detector's and the adversary's perspective. Specifically, from the detector's perspective, we analyzed how malware evolves itself. From the adversary's perspective, we studied how malware camouflages itself and evades it from being detected by non-machine-learning and machine-learning-based detectors. ii We firstly carry out a fine-grained and in-depth phylogenetic analysis of malware variants, to study the popular evolution patterns of malware samples. We propose a method that initially clusters malware samples of a family into variant-sets, and then systematically reveals the phylogenetic relationships among those sets for a more in-depth malware evolution analysis. Moreover, we summarise evolutionary patterns that shed light upon how malware samples are evolved to bypass the detection of anti-virus techniques. Such analysis can be of great benefit in understanding newly emerged malware variants and evolution-inspired evasion attacks. Secondly, we propose a novel attacking method that generates adversarial examples of Android malware that can evade being detected by the current detection models. To this end, we propose a method of applying optimal perturbations onto Android APK that can successfully deceive the machine learning detectors. We develop an automated tool to generate the adversarial examples without human intervention. Lastly, we demonstrated to make use of a smartphones built-in voice assistant (VA) to compromise an Android smartphone, while evading the detection of anti-malware techniques. We propose a novel attack approach that crafts the users activation voice by silently listening to users phone calls. Once the activation voice is formed, the attack can select a suitable occasion to launch an attack. The attack embodies a machine learning model that learns proper attacking times to prevent itself from being noticed by the user. By raising awareness, we urge the iii community and manufacturers to revisit the risks of VAs and subsequently revise the activation logic to be resilient to the style of attacks proposed in this work. We believe that by understanding how malware samples are evolved and how adversarial examples are crafted will help researchers and malware analyst to better address the issue of detecting malware in an adversarial environment. iv Acknowledgements I would like to thank my supervisors Prof. Yang Xiang, A/Prof. Jun Zhang and Prof. Wanlei Zhou for their support and patience. I would like to thank my family for their love and understanding. I would like to thank Dr. Sheng Wen for his comments and encouragement. I would like to thank Chaoran, Lihong, Derek, Junchen, for the sleepless we were working together to catch the deadlines. I would like to thank all my friends from Swinburne and Deakin. The climb may be long, but the view is worth it. Xiao Chen July 10, 2019 v Declaration This is to certify that this thesis contains no material which has been accepted for the award of any other degree or diploma and that to the best of my knowledge this thesis contains no material previously published or written by another person except where due reference is made in the text of the thesis. Xiao Chen July 10, 2019 vi To my family. vii Contents Abstract . ii Acknowledgements . v Declaration . vi List of Tables . xiii List of Figures . xv 1 Introduction 1 1.1 Motivations . 1 1.2 Objectives and Contributions . 3 1.3 Thesis Organization . 5 2 Literature Review 6 2.1 Android Security Threats . 6 2.1.1 Malware Types . 8 2.1.2 Malware Penetration Techniques . 10 2.1.3 Malware Obfuscation Techniques . 11 2.2 Android Malware Detection . 12 viii 2.2.1 Static Approaches . 13 2.2.2 Dynamic Approaches . 25 2.3 Android Malware Evasion . 31 2.3.1 Evading Traditional Non-Machine Learning Based Detection 31 2.3.2 Evading Machine Learning Based Detection . 33 2.4 Android Malware Evolution . 38 3 Investigating The Evolution Pattern of Android Malware 43 3.1 Introduction . 44 3.2 Approach . 48 3.2.1 Variant Sets Generation . 48 3.2.2 Formula Construction of the Variant Sets . 53 3.2.3 Phylogenetic Network Construction . 55 3.3 Evaluation . 57 3.3.1 Dataset . 57 3.3.2 Insights Into Variant Clustering . 60 3.3.3 Representativeness of the Variant Formula . 69 3.4 Inspection on Malware Evolution . 71 3.4.1 Phylogenetic Relationship . 71 3.4.2 Evolution Analysis . 75 3.5 Summary . 86 4 Repackaging Malware for Evading Machine-Learning Detection 87 ix 4.1 Introduction . 89 4.2 Android Application Package . 94 4.3 Targeted Systems and Attack Scenarios . 96 4.3.1 MaMaDroid . 96 4.3.2 Drebin . 100 4.3.3 Attack Scenarios . 101 4.4 Attack on MaMaDroid . 102 4.4.1 Attack Algorithm . 102 4.4.2 APK Manipulation . 108 4.4.3 Experiment Settings . 113 4.4.4 Experiment Results . 117 4.5 Attack on Drebin . 124 4.5.1 Attack Algorithm . 124 4.5.2 APK Manipulation . 126 4.5.3 Experiments & Evaluations . 127 4.6 Discussion . 128 4.6.1 Comparison with Existing works . 128 4.6.2 Why We Are Successful . 129 4.6.3 Applicability of Our Attack . 130 4.6.4 Transferability . 131 4.6.5 Artifacts in Our Attack . 133 4.6.6 Defending Methods . 134 x 4.7 Summary . 137 5 A Stealthy Attack on Android phones without Users Awareness 138 5.1 Introduction . 139 5.2 Attacking Model: Vaspy . 144 5.3 Proof-of-Concept: A Spyware . 145 5.3.1 Activation Voice Manipulation . 145 5.3.2 Attacking Environment Sensing . 147 5.3.3 Post Attacks and Spyware Delivery . 151 5.4 Evaluation . 153 5.4.1 Evaluation of the Attacking Environment Sensing Modulel . 153 5.4.2 Evaluation of Real World Attack . 155 5.4.3 Capability of Attack . 162 5.4.4 Runtime Cost Analysis . 164 5.4.5 Resistance to Anti-Virus Tools . 167 5.5 Discussion . 169 5.5.1 Essential Factors for the Successful Attack . 169 5.5.2 Defense Approaches for Vaspy . 170 5.5.3 Lessons from This Work . 171 5.6 Summary . 171 6 Conclusion and Future Works 173 6.1 Conclusion . 173 xi 6.2 Future Works . 175 References 177 xii List of Tables 3.1 Best Silhouette Values Achieved for Malware Families and its Corespondent Distance Threshold . 62 3.2 Malware samples from the same variant set and various labels provided by different Anti-Virus vendors. The malware samples from variant set a is depicted by a0 to an. The Anti-Virus vendors are depicted by v0 to vm. lnm denotes the variant label provided by the vm vendor for the malware an. The variant label lan is unified from the labels given by the different Anti-Virus vendors. 67 3.3 Label Consistency for Four Well-Known Anti-Virus Vendors. The average consistencies are all above 80% with the minimum no less than 50% . 69 4.1 Overview of Drebin feature set . 100 4.2 Attack Scenarios . 117 4.3 Number of features added in each set . 128 4.4 Comparison with existing works (Evasion Rate) . 129 xiii 5.1 Movement intensity Features . 151 5.2 Average Accuracy Performance . 155 5.3 Post attack commands against VAs. 163 xiv List of Figures 3.1 Relationship between a malware family and its variant sets. The malware samples in the same column (colour) share similar code. The malware samples from different columns inside a family perform similar malicious behaviours. 49 3.2 Variant Set Generation Process Overview. 1) A distance matrix will be generated by similarity analysis based on a set of apks from the same malware family by utilizing SimiDroid. 2) By applying the UPGMA clustering algorithm, a malware family dendrogram will be generated representing the hierarchical relationship among malware samples of the family. 3) After the distance threshold is determined, malware samples are them clustered into smaller clades, and each clade represents a variant set. 4 variant sets (a, b, c, d) are generated in the example. 52 3.3 Silhouette and Method-Based Distance Threshold. 54 3.4 Variant sets generation for F akebank and Commplat families . 54 3.5 Formula Construction of the Variant Sets. 55 xv 3.6 Malware Sample Distribution of Six Malware Families From 2008 to 2018. Different malware families have different life cycles. The number of malware samples in opfake, smsagent and autoins increased dramatically in recent years.