Automatic Identification and Recovery of Obfuscated Android Apps
Total Page:16
File Type:pdf, Size:1020Kb
Automatic Identification and Recovery of Obfuscated Android Apps Glanz, Leonid (2020) DOI (TUprints): https://doi.org/10.25534/tuprints-00014647 License: CC-BY-SA 4.0 International - Creative Commons, Attribution Share-alike Publication type: Ph.D. Thesis Division: 20 Department of Computer Science Original source: https://tuprints.ulb.tu-darmstadt.de/14647 Automatic Identification and Recovery of Obfuscated Android Apps Vom Fachbereich Informatik der Technischen Universität Darmstadt genehmigte Dissertation zur Erlangung des akademischen Grades eines Doktor-Ingenieurs (Dr.-Ing.) vorgelegt von Leonid Glanz, M.Sc. geboren in Karabulak (Kasachstan). Referent: Prof. Dr.-Ing. Mira Mezini Korreferent: Prof. Dr.-Ing. Awais Rashid Datum der Einreichung: 15. Oktober 2020 Datum der mündlichen Prüfung: 25. November 2020 Erscheinungsjahr 2020 Darmstädter Dissertationen D17 . Glanz, Leonid : Automatic Identification and Recovery of Obfuscated Android Apps Darmstadt, Technische Universität Darmstadt Jahr der Veröffentlichung der Dissertation auf TUprints: 2020 URN: urn:nbn:de:tuda-tuprints-146479 URL: https://tuprints.ulb.tu-darmstadt.de/id/eprint/14647 Tag der mündlichen Prüfung: 25.11.2020 Veröffentlicht unter CC BY-SA 4.0 International https://creativecommons.org/licenses/ Preface While looking for a method to break big problems into small solvable units, I discovered programming for me. As a bachelor student, I became aware of the Goal-Question- Metric approach (GQM). It allowed me to make all goals accessible by asking detailed questions and using appropriate measurement methods to answer them. In the course of my master’s studies, I became aware of code analysis and obfuscation, which offered me an exciting challenge and demanded my knowledge. This ledtomy decision to pursue these two topics in my dissertation. The GQM approach allowed me to convert all goals into measurement parameters. However, I could not always define the exact thresholds for the measurements from which action should be taken. Since these thresholds often depended on unknown criteria, I taught myself to use machine learning techniques during my time as a Ph.D. student. With these techniques, I was able to extract the desired thresholds as patterns and rules from previous decisions and the collected data. Using all these techniques, I have designed and developed, in this dissertation, a collec- tion of deobfuscation approaches. These approaches support analysts in the automatic identification and recovering of obfuscated Android apps. I started my Ph.D. student job in the project of Ben Herman under the supervision of Professor Mira Mezini in the Software Technology Group (STG) at the Technische Universität Darmstadt. In this project, I learned more about vulnerability and malware detection, which helped me to find an application for my deobfuscation challenge. Afterward, I worked onother projects for several years, which deepened my knowledge and led to this dissertation. My life as well as this dissertation would have been more difficult without the help of others. In the following, I try to thank all the people who supported me on my way: First of all, I would like to thank my supervisor Professor Mira Mezini, who showed me new ways that demanded and enhanced my knowledge of software engineering and IT security. She helped me develop a clearer view of these topics and improve the structure of my thinking processes. Thank you for all your efforts that supported me to getto this point of the long doctoral process. I also want to thank Professor Awais Rashid for being the second examiner of my dissertation. I am grateful for the long hours you spent on carefully reviewing my work. My work would not have gotten this far without the help of people from the security subgroup with whom I could discuss my ideas. I would like to thank Lars Baumgärt- ner, Michael Eichberg, Dominik Helm, Ben Hermann, Florian Kübler, Johannes Lerch, Patrick Müller, Krishna Narasimhan, Michael Reif, and Anna-Katharina Wickert for your feedback that gave me new thinking impulses and for your patience when I pre- sented new ideas involving black magic (machine learning). Over the years, I enjoyed working with great students who helped me advance my 3 research projects and did a great deal of the implementation work. I would like to thank Florian Breitfelder, Mattis Manfred Kämmerer, Patrick Müller, and Jonas Schlitzer and hope that they continue to be successful in their careers. Of course, I also want to thank my colleagues, who took the time to proofread this thesis and gave me their feedback. First of all, I would like to thank Lars Baumgärtner, who read several chapters over and over again. Then I would like to thank Sven Amann, Dominik Helm, Ragnar Mogk, Patrick Müller, Krishna Narasimhan, Michael Reif, Aditya Oak, and Pascal Weisenburger. Their feedback helped me to improve the clarity of this thesis. The last years would not be as enjoyable and fruitful without my colleagues Sven Amann, Lars Baumgärtner, Andi Bejleri, Oliver Bracevac, Ervina Cergani, Joscha Drech- sler, Michael Eichberg, Matthias Eichholz, Sebastian Erdweg, Sylvia Grewe, Dominik Helm, Ben Hermann, Sven Keidel, Matthias Krebs, Mirko Köhler, Florian Kübler, Edlira Kuci, Johannes Lerch, Ingo Maier, Nafise Eskandani Masoule, Annette Miller, Ragnar Mogk, Patrick Müller, Sarah Nadi, Krishna Narasimhan, Aditya Oak, Sebastian Proksch, Michael Reif, David Richter, Tobias Roth, Guido Salvaneschi, Simon Schönwälder, Jan Sinschek, Daniel Sokolowski, Jurgen van Ham, Manuel Weiel, Pascal Weisenburger, and Anna-Katharina Wickert. Ultimately, I would like to thank Gudrun Harris and Claudia Roßman. Gudrun, you are the soul of the STG and a master of administration. I would surely have fallen into some pitfalls when it concerns paperwork. You ensured that we were well funded, fulfilled all regulations, and always had an open ear for our troubles. Claudia Roßman will be your successor, and I want to thank her for the help with the dissertation registration. I hope we will work together as well as we did with Gudrun. Many thanks to you both for the continued assistance. Last but not least, I would like to thank my wife, my family, and my friends. Thank you all for the support with small and bigger steps in my life, and I want to share many more amazing moments with you. Editorial notice: Throughout this thesis, I use the term “we” and “us” to describe my work. This is meant to underline that research is always a cooperative effort. I would have much less (if something at all) to present here if other people had not taken time off their own work to review, discuss, and contribute to mine. I am deeply gratefulfor their effort. 4 Abstract Every day, developers add new applications (apps) to the Google Play Store, which ease users’ lives and entertain them. The rapid development of these apps is only possible through the provision of software libraries whose functionality can be directly integrated into an app without creating it from scratch. For instance, some libraries provide ex- tended possibilities for displaying content or additional support for specific networking capabilities. All these libraries are bundled with the main application code into one binary to avoid delays due to the loading of external functionality. The wide distribution of Android apps and the high turnover in this market attracts criminal actors (attackers). These attackers decompile the apps, integrate additional ad libraries or malware, and republish them. Through this approach, they use the apps’ popularity to trick users into downloading their repackaged apps. To prevent such malicious practices, developers and producers of libraries have begun to obfuscate their apps to make the decompilation process more challenging. However, the obfuscation of apps is not only done by developers but also by attackers to make the detection of copyright infringement harder or hide their malicious intent. While analysts try to protect the developers’ copyright and the privacy and security of app users, code obfuscation hinders them from identifying libraries and repackaged apps, detecting obfuscated names and strings, and recovering them. The obfuscated code might contain not only malware but also vulnerabilities or unauthorized access to private data. This dissertation introduces different approaches that support the analyses mentioned above using static analysis, dynamic analysis, and machine learning. Since the obfusca- tion of repackaged apps makes it difficult to distinguish between library and app code, we present approaches for library detection, separation of app code, and mapping of library code. We evaluated the effectiveness of these approaches under the influence of different obfuscation techniques. Furthermore, we present our approach for identifying repackaged apps that uses our library identification to measure the similarity between repackaged and original apps without the influence of library code, which would distort the measurement. Further contributions support the recovery of names from obfuscated entities in library code. Finally, we presented an approach for identifying and recovering obfuscated strings that supports data-flow analyses. Using our approaches, we outperformed all state-of-the-art competitors. Furthermore, we analyzed in total over 100,000 apps for obfuscated names, obfuscated libraries, ob- fuscated strings, and repackaged apps. 5 Zusammenfassung Jeden Tag fügen Entwickler dem Google Play Store neue Anwendungen (Apps) hinzu, die das Leben