Data Structure Identification from Executions of Pointer Programs

Schriften aus der Fakultät Wirtschaftsinformatik und 41 Angewandte Informatik der Otto-Friedrich-Universität Bamberg Data Structure Identifcation from Executions of Pointer Programs Thomas Rupprecht 41 Schriften aus der Fakultät Wirtschaftsinformatik und Angewandte Informatik der Otto-Friedrich- Universität Bamberg Contributions of the Faculty Information Systems and Applied Computer Sciences of the Otto-Friedrich-University Bamberg Schriften aus der Fakultät Wirtschaftsinformatik und Angewandte Informatik der Otto-Friedrich- Universität Bamberg Contributions of the Faculty Information Systems and Applied Computer Sciences of the Otto-Friedrich-University Bamberg Band 41 2020 Data Structure Identifcation from Executions of Pointer Programs Thomas Rupprecht 2020 Bibliographische Information der Deutschen Nationalbibliothek Die Deutsche Nationalbibliothek verzeichnet diese Publikation in der Deutschen Na- tionalbibliographie; detaillierte bibliographische Informationen sind im Internet über http://dnb.d-nb.de/ abrufbar. Diese Arbeit hat der Fakultät Wirtschaftsinformatik und Angewandte Informatik der Otto-Fried- rich-Universität Bamberg unter dem Titel „Data Structure Identifcation from Executions of Pointer Programs“ als Dissertation vorgelegen. 1. Gutachter: Prof. Dr. Gerald Lüttgen, University of Bamberg 2. Gutachter: Prof. Dr. Thomas Noll, RWTH Aachen University Tag der mündlichen Prüfung: 7. November 2019 Dieses Werk ist als freie Onlineversion über das Forschungsinformationssystem (FIS; fis. uni-bamberg.de/) der Universität Bamberg erreichbar. Das Werk – ausgenommen Cover, Zitate und Abbildungen – steht unter der CC-Lizenz CC-BY. Lizenzvertrag: Creative Commons Namensnennung 4.0 http://creativecommons.org/licenses/by/4.0 Herstellung und Druck: docupoint, Magdeburg Umschlaggestaltung: University of Bamberg Press © University of Bamberg Press, Bamberg 2020 http://www.uni-bamberg.de/ubp/ ISSN: 1867-7401 ISBN: 978-3-86309-717-2 (Druckausgabe) eISBN: 978-3-86309-718-9 (Online-Ausgabe) URN: urn:nbn:de:bvb:473-irb-473250 DOI: http://dx.doi.org/10.20378/irb-47325 Dedicated to my parents Elli and Klaus, my wife Steffi and our son Henrik. Acknowledgments I want to thank Dr. David White and Prof. Gerald Lüttgen for making this dissertation thesis possible. Further I am very glad about the collaboration with Prof. Herbert Bos together with Dr. Xi Chen from the VU Amsterdam in the Nether- lands and Dr. Tobias Mühlberg from the KU Leuven in Belgium. However, this work would not have come into existence without the great sup- port and patience throughout the years from my parents and my wife. Thank you! I also want to thank all my colleagues from the Software Technologies Research Group for lots of interesting discussions and good times, e.g., the country side tour will (mostly) be unforgotten and your wedding gift still resides prominently in our living room. Special thanks to Jan Boockmann from our group, for being an invaluable help on so many occasions. Additionally, I’m happy about my friends from my home town Hundelshausen and the neighbouring Altmannsdorf, who went with me through good and bad times since I was born. Further, I want to list some great human beings I have met during my life that are worth mentioning for reasons they know best them- selves (in order of appearance in my life): Peter Hofmann, Daniel Finster, Chris- tian Wiegand, Leslie Polzer and Tobias Zeck. Finally, I want to give my appreciation to some outstanding individuals that made or still make music that motivates me: Kevin, Lemmy, James. Abstract The reverse engineering of binaries is a tedious and time consuming task, yet mandatory when the need arises to understand the behaviour of a program for which source code is unavailable. Instances of source code loss for old arcade games1 and the steadily growing amount of malware2 are prominent use cases requiring reverse engineering. One of the challenges when dealing with binaries is the loss of low level type information, i.e., primitive and compound types, which even state-of-the-art type recovery tools often cannot reconstruct with full accuracy. Further programmers most commonly use high level data structures, such as linked lists, in addition to primitive types. Therefore detection of dynamic data structure shapes is an important aspect of reverse engineering. Though the recognition of dynamic data structure shapes in the presence of tricky programming concepts such as pointer arithmetic and casts – which are both fundamental concepts to enable, e.g., the frequently used Linux kernel list3 – also bring current shape detection tools to their limits. A recent approach called Data Structure Investigator (DSI)4, aims for the detection of dynamic pointer based data structures. While the approach is general in nature, a concrete realization for C programs requiring source code is envisioned as programming constructs such as type casts and pointer arithmetic will stress test the approach. Therefore, the first research question addressed in this dissertation is whether DSI can meet its goal in the presence of the sheer multitude of existing data structure implementations. The second research question is whether DSI can be opened up to reverse engineer C/C++ binaries, even in the presence of type information loss and the variety of C/C++ programming constructs. Both questions are answered positively in this dissertation. The first is answered by realizing the DSI source code approach, which requires detailing fundamental aspects of DSI’s theory to arrive at a working implementation, e.g., handling the consistency of DSI’s memory abstraction and quantifying the interconnections found within a dynamic pointer based data structure, e.g., a parent child nesting scenario, to allow for its detection. DSI’s utility is evaluated on an extensive 1http://kotaku.com/5028197/sega-cant-find-the-source-code-for-your-favorite-old-school- arcade-games 2https://www.av-test.org/en/statistics/malware/ 3https://github.com/torvalds/linux/blob/master/include/linux/list.h 4SWT Research Group, University Bamberg, DFG-Project LU 1748/4-1 viii benchmark including real world examples (libusb5, bash6) and shape analysis7,8 examples. The second question is answered through the development of a DSI prototype for binaries (DSIbin). To compensate for the loss of perfect type information found in source code, DSIbin interfaces with the state-of-the-art type recovery tool Howard9. Notably, DSIbin improves upon type information recov- ered by Howard. This is accomplished through a much improved nested struct detection and type merging algorithm, both of which are fundamental aspects for the reverse engineering of binaries. The proposed approach is again evaluated by a diverse benchmark containing real world examples such as, the VNC clipping library, The Computer Language Benchmarks Game and the Olden Benchmark, as well as examples taken from the shape analysis literature. In summary, this dissertation improves upon the state-of-the-art of shape detection and reverse engineering by (i) realizing and evaluating the DSI approach, which includes contributing to DSI’s theory and results in the DSI prototype; (ii) opening up DSI for C/C++ binaries so as to extend DSI to reverse engineering, resulting in the DSIbin prototype; (iii) handling data structures with DSIbin not covered by some related work such as skip lists; (iv) refining the nesting detection and performing type merging for types excavated by Howard. Further, DSIbin’s ultimate future use case of malware analysis is hardened by revealing the presence of dynamic data structures in multiple real world malware samples. In summary, this dissertation advanced the dynamic analysis of data structure shapes with the aforementioned contributions to the DSI approach for source code and further by transferring this new technology to the analysis of binaries. The latter resulted in the additional insight that high level dynamic data structure information can help to infer low level type information. 5http://libusb.info/ 6https://www.gnu.org/software/ bash/ 7Predator: http://www.fit.vutbr.cz/research/groups/verifit/tools/predator/ 8Forester: http://www.fit.vutbr.cz/research/groups/verifit/tools/forester/ 9http://www.cs.vu.nl/ herbertb/papers/dde_ndss11-preprint.pdf Zusammenfassung Reverse Engineering von Binärcode ist eine schwierige und zeitaufwändige Tätig- keit, die jedoch unabdingbar ist, wenn das Programmverhalten verstanden werden muss, ohne dass Quelltext zur Verfügung steht. Fälle von Quelltextverlust für alte Computerspiele10 und die stetig wachsende Anzahl von Schadsoftware11 sind daher prominente Anwendungsfälle für Reverse Engineering. Eine der Her- ausforderungen bei der Analyse von Binärcode ist der Verlust von Typinforma- tionen, wie zum Beispiel primitiven und komplexen Datentypen. Oftmals kön- nen diese Typinformationen von den aktuellen Werkzeugen zur Typrückgewin- nung, die den Stand der Technik repräsentieren, nicht vollumfänglich und korrekt rekonstruiert werden. Weiterhin verwenden Programme zusätzlich zu den primitiven und komplexen Datentypen meist höhere dynamische Datenstrukturen, wie zum Beispiel verkettete Listen. Daher ist die Erkennung der Form von dynamis- chen Datenstrukturen ein wichtiger Aspekt des Reverse Engineerings. Wobei die Erkennung der Formen dynamischer Datenstrukturen im Kontext von schwieri- gen Programmierkonzepten, wie Zeigerarithmetik und Typumwandlungen – bei- des fundamentale Konzepte um zum Beispiel die häufig verwendete Linux Kernel Liste12

Load more