Android Malware Detection Using Data Mining Techniques on Process Control Block Information

Android Malware Detection Using Data Mining Techniques on Process Control Block Information by Heba Ziad Alawneh A dissertation submitted to the Graduate Faculty of Auburn University in partial fulfillment of the requirements for the Degree of Doctor of Philosophy Auburn, Alabama August 8, 2020 Keywords: Dynamic Malware Detection, Android smartphones, Recurrent Neural Networks, Convolutional Neural Networks, Long short-term memory, Deep Learning Copyright 2020 by Heba Ziad Alawneh Committee Members David Umphress, Chair, Professor of Computer Science and Software Engineering Anh Nguyen, Assistant Professor of Computer Science and Software Engineering Daniel Tauritz, Associate Professor of Computer Science and Software Engineering Anthony Skjellum, Professor of Computer Science and Engineering, University of Tennessee at Chattanooga Abstract Because smartphones are increasingly becoming the mobile computing device of choice, we are experiencing an increase in the number and sophistication of mobile-computing-based malware attacks. A lot of these attacks target users’ sensitive information, such as banking usernames, and passwords. A widespread type of malicious app encrypts user data locking their devices with passwords and asking money to decrypt it. Moreover, they can illegitimately collect browsing-related information or install other apps. Available malware detection techniques can be categorized as dynamic or static based on the type of features used in the analysis. Using process behavior (as in dynamic analysis) to detect malware is generally more reliable than examining application files only (as in static analysis). Nonetheless, dynamic analysis is more time and computationally intensive. Hence, real-time malware detection is considered a challenging task. The limitations of mobile devices, such as storage, computing capacity, and battery life, make the task even more challenging. In this research, we propose a dynamic malware detection approach that identifies malicious behavior using deep learning techniques on Process Control Block (PCB) information mined over the process execution time. Our mining approach is performed at the kernel level and synchronized with the process CPU utilization. It precisely tracks changes in PCB parame- ters over the execution time. It does not only represent the process behavior efficiently but also all threads created by that process. We then use the PCB sequence information to train a deep learning model to identify malicious behavior. We validated our approach using 2600 benign and 2500 malware-infested recent Android applications. Our mining approach successfully captured more than 99% of context switches for the vast majority of tested applications. Furthermore, our detection model was able to identify malicious behavior at various points of the process execution time using 12 PCBs only with an F1-score of 95.8%. To the best of our knowledge, no available dynamic malware detection technique has achieved such minimal detection time. We also introduce a ii closed dynamic malware analysis framework for application testing running on multiple An- droid phones concurrently. iii Acknowledgments First and foremost, I would like to thank God Almighty for giving me the strength, ded- ication, and ability to finish this work. I am also deeply thankful for those who supported me professionally and personally throughout my journey here at Auburn University. I am grateful, especially to my mentor, and advisor, Dr. David Umphress. Thank you for your effort and support. Mostly, thank you for encouraging me to freely pursue my research interests while guiding me to stay on course. I wish to express my sincere thanks to my dissertation committee members. Dr. Anthony Skjellum, for his support to start my Ph.D. studies, Dr. Anh Nguyen, and Dr. Daniel Tauritz. Thank you all for your guidance and valuable input. Finally, I would like to thank my companion Hamza Alkofahi, for the endless support and always pushing me to achieve my best, my kids Qais and Omar for keeping me happy, my parents Helwa and Ziad Alawneh for believing in me, my brothers and sisters, and to all my friends and family. Thank you for your prayers, love, and support. Without you, my journey would not have been possible. iv Table of Contents Abstract . ii Acknowledgments . iv 1 Introduction . 1 2 Literature Review . 4 2.1 Malware Targeting Android Smartphones . 4 2.2 Malware Detection Techniques . 6 2.3 Android PCB Features . 9 3 Mining PCB information & Building the Application Dataset for Android Malware Detection . 12 3.1 Introduction . 12 3.2 Application Data Collection . 13 3.2.1 Benign Application Collection . 13 3.2.2 Malware-infested Application Collection . 13 3.2.3 The Android Application Dataset . 14 3.3 Data Collection Architecture . 16 3.3.1 Android Test Devices: Phone, Android, and Kernel Versions . 16 3.3.2 Automated Input Generation for Android Application Testing . 17 3.3.3 Network Configuration for Application Testing . 19 3.4 Mining PCB Information . 19 3.4.1 Tasks Management & Preparation . 20 v 3.4.2 Application Testing & PCB Data Collection . 21 3.4.3 Storing & Analyzing Results . 25 3.5 The PCB Dataset . 26 3.6 Discussion . 28 3.7 Conclusion . 31 4 Android Malware Detection using Deep Learning on PCB Information . 33 4.1 Introduction . 33 4.2 Using Deep Learning & PCB information to Identify Android Malware Attacks 34 4.2.1 Data Preprocessing . 34 4.2.2 Model Architecture . 34 4.3 Experiments & Results . 36 4.3.1 Dataset Design Impact on Classifier Performance . 36 4.3.2 Measuring Features Importance . 39 4.4 Conclusion . 43 5 Conclusion & Future Work . 45 5.1 Conclusion . 45 5.2 Future Work . 48 Appendices . 51 A The Kernel-Space System Components Pseudocode . 51 A.1 The Data Collector Pseudocode . 51 A.2 The modified CPU Scheduler Pseudocode . 54 B Struct task struct . 57 vi List of Figures 2.1 (a) Android Malware distribution in 2018 (b) Top 10 Android malware in 2018. 6 3.1 Summary statistics of the benign dataset: applications are distributed by (a) category, (b) application rating, and (c) minimum number of installs. 15 3.2 The distribution of the malware samples based on the year the malware sample was first seen . 16 3.3 Distribution of malware families within the malware dataset . 17 3.4 The network setup for malware dynamic analysis . 20 3.5 Main manager running tasks concurrently on all connected devices . 21 3.6 Sequence diagram of task management and preparation . 22 3.7 The sequence diagram of the application testing and PCB data collection work- flow ........................................ 24 3.8 Sequence diagram of post application testing and data collection process . 26 3.9 The distribution of app tests based on the number of threads running each app . 27 3.10 Distribution of app tests for malware-infested and benign based on (a) the PCB miss ratio for all threads during the application test (b) the PCB miss ratio of the main thread during the application test . 28 3.11 Distribution of app tests for malware-infested and benign based on (a) the number of context switches for all threads combined (b) the number of context switches of the main thread that occurred during the application test . 29 4.1 The network architecture for detecting Android malware . 35 4.2 The F1-scores and their variations of the classifier models trained with the fol- lowing PCB sequence setting: zero or random starting point and a sequence size of n PCBs. 37 4.3 PCBs Features categories based on the Feature Importance Score (FI) . 40 4.4 Heatmap of malware and benign processes for the top 10% features ranked by FI score. 42 vii List of Tables 2.1 CPU scheduling sched info Features within task struct. 10 2.2 Memory management mm struct Features within task struct. 10 2.3 Signal information signal struct Features within task struct. 10 2.4 Process information within task struct. ..................... 11 3.1 Comparing statistics of our proposed approach with a previous method. 30 3.2 Comparing the proposed mining approach with the available methods [70, 88, 96] ........................................ 30 3.3 Comparing statistics of the datasets collected using our proposed mining approach and another available method. 31 4.1 Evaluating the random models on the random-starting point versus on the expanded test set. 38 4.2 Evaluating the classifier models trained and tested using the expanded dataset. 39 viii Chapter 1 Introduction Today, smartphone users number more than 5.1 billion [28], with smartphone penetration rates increasing as well. Many smartphone users are unaware of the fact that most of the malicious smartphone software targets users’ private information without their permission. The majority of commercial anti-virus tools follow a signature-based malware detection approach (i.e., they look for specific fingerprints of already discovered malware), and are thus unable to detect new malware (also known as zero-day malware attacks)[104]. According to StatCounter [57], Android is the most popular operating system worldwide. It’s also the second most targeted platform after Microsoft Windows. However, only a few An- droid mobile devices provide effective virus protection, while the majority remaining are poorly protected. Moreover, the number and sophistication of new malware attacks grew significantly during the past two years [1], making it harder with time to detect such attacks. Over 90% of malware targeting Android platforms in 2018 were Trojan attacks [18]. A Trojan attack is a harmful app that appears as legitimate; however, it performs malicious ac- tivity unbeknownst to the user [43]. Trojans can be used for stealing confidential information, creating backdoors, and activating viruses or other malware. Banking and password Trojan attacks tripled within the first quarter of 2018 alone [18]. The latest state-of-the-art banking Trojan named Gustuff can steal the login information of over 100 banking apps and 30 cryp- tocurrency apps.

Load more