Probabilistic Naming of Functions in Stripped Binaries
Total Page:16
File Type:pdf, Size:1020Kb
Probabilistic Naming of Functions in Stripped Binaries James Patrick-Evans Lorenzo Cavallaro Johannes Kinder Royal Holloway, University of London King’s College London Bundeswehr University Munich Egham, United Kingdom London, United Kingdom Munich, Germany [email protected] [email protected] [email protected] ABSTRACT software developers can use symbols to relate binary code to source- Debugging symbols in binary executables carry the names of func- level information such as file names, data structures and names tions and global variables. When present, they greatly simplify the of functions. When releasing software for distribution, however, process of reverse engineering, but they are almost always removed symbols are routinely stripped from the binaries; this decreases (stripped) for deployment. We present the design and implementa- the file size and impedes reverse engineering of proprietary code tion of punstrip, a tool which combines a probabilistic fingerprint (commercial or malicious). of binary code based on high-level features with a probabilistic State-of-the-art reverse engineering tools, such as the IDA Pro graphical model to learn the relationship between function names disassembler, use databases of function signatures to reliably iden- and program structure. As there are many naming conventions and tify standard functions such as those included by a statically linked developer styles, functions from different applications do not neces- C runtime. This works well for systems where libraries are stan- sarily have the exact same name, even if they implement the exact dardized and rarely recompiled. IDA Pro for example, maintains a same functionality. We therefore evaluate punstrip across three lev- directory of FLIRT signature files for the most common Windows els of name matching: exact; an approach based on natural language libraries replicated for the most common compilers and instruction processing of name components; and using Symbol2Vec, a new em- set architectures (ISAs). Reverse engineers also manually create bedding of function names based on random walks of function call custom databases of such signatures, because they can immedi- graphs. We show that our approach is able to recognize functions ately identify many functions which otherwise would have to be compiled across different compilers and optimization levels and rediscovered in new binaries through costly manual analysis. Such then demonstrate that punstrip can predict semantically similar signature-based mechanisms can allow for some variation in the function names based on code structure. We evaluate our approach exact byte sequence matched, but they do not go further than rela- over open source C binaries from the Debian Linux distribution tively simple wildcard mechanisms. and compare against the state of the art. The problem of matching sequences of binary code while al- lowing for variation presents itself in a number of domains. Code CCS CONCEPTS clone detection [9, 18, 24, 27, 41], vulnerable code identification [11], code searching [15], and software plagiarism detection [31] address • Security and privacy ! Software reverse engineering; • Com- the problem of finding exact matches between software compo- puting methodologies ! Machine learning. nents. They focus on finding a fixed set of previously seen functions KEYWORDS with the main contributions drawn from methods of identifying semantically-equivalent code that have undergone various soft- binaries, function names, machine learning ware transformations; these transformations are typical of source ACM Reference Format: code compiled with different compilers or compilation optimiza- James Patrick-Evans, Lorenzo Cavallaro, and Johannes Kinder. 2020. Proba- tions. Techniques typically adopt static or dynamic approaches that bilistic Naming of Functions in Stripped Binaries. In Annual Computer Secu- build features of a functions interpreted execution or rely on fixed rity Applications Conference (ACSAC 2020), December 7–11, 2020, Austin, USA. properties of compiler generated machine code. Patch code analy- ACM, New York, NY, USA, 13 pages. https://doi.org/10.1145/3427228.3427265 sis [24, 52] borrows the same techniques from the problem domain for feature collection however it requires an existing executable 1 INTRODUCTION with prior information to perform analysis on differential updates. Reverse engineering is a crucial step in security audits of commer- Common to these domains and approaches is that they aim to cial software, forensic analysis of malware, software debugging identify exact function matches in isolation in previously seen exe- and exploit development. A main task in reverse engineering is to cutables. They do not provide a way to derive names for functions identify functional components in the software and discover the that do not have an exact (semantic or syntactic) match in the meaning behind different portions of binary code. When faced with set of known binaries. In contrast, our goal is to learn a general a flat region of executable code, it is difficult and time consuming to relationship between function names and their binary code. gain a high-level understanding of what it does. During debugging, We implement this approach with punstrip, our tool for revers- ing the stripping process and inferring symbol names in stripped ACSAC 2020, December 7–11, 2020, Austin, USA binaries. Punstrip builds a probabilistic model that learns how devel- © 2020 Association for Computing Machinery. opers use and name functions across a set of existing open source This is the author’s version of the work. It is posted here for your personal use. Not for redistribution. The definitive Version of Record was published in Annual Computer projects; using this model, we infer meaningful symbol informa- Security Applications Conference (ACSAC 2020), December 7–11, 2020, Austin, USA, tion based on similarities in program structure and semantics in https://doi.org/10.1145/3427228.3427265. previously unseen, stripped binaries. It is not necessary to discover 1 ACSAC 2020, December 7–11, 2020, Austin, USA James Patrick-Evans, Lorenzo Cavallaro, and Johannes Kinder Analysis Probabilistic Fingerprint Function Evaluation Boundaries Exact ELF Binaries Disassembly CRF ELF Exporter Features NLP VEX IR Symbol2Vec Feature Extraction Factor Graph Figure 1: A block diagram overview of the components in Punstrip. exactly the same identifiers that the developers used in the original structure for learning (§4). We then present our method for relat- program: for reverse engineering, we are interested in discovering ing function names, including Symbol2Vec (§5), before evaluating symbol names that are helpful to an analyst. With punstrip, reverse punstrip against previous work (§6). Finally, we discuss limitations engineers are able to pre-process an unknown binary to automati- of our approach (§7), contrast with related work (§8) and present cally annotate it with symbol information, saving them time and our conclusions (§9). preventing mistakes in doing further manual analysis. We make 1 punstrip available as open source . 2 OVERVIEW We make the following contributions: Figure 1 shows an architectural overview of the punstrip pipeline. • We present a novel approach to function identification and Punstrip takes as input a set of ELF binaries, which for training signature recognition for stripped binaries that uses features should be unstripped. In the initial analysis stage, punstrip extracts in a higher-level intermediate representation. This approach function symbols and their boundaries as defined in the symbol can scale to real world software and seeks to be agnostic to table, disassembles them, and lifts the instructions to the VEX inter- both compiler architectures, binary formats, and optimiza- mediate representation. From this representation and interproce- tions. dural control flow information among functions, punstrip extracts • We introduce a probabilistic graphical model for inferring a set of features that are stored in a database. Those features are function names in stripped binaries that compares the joint used to build a per-function fingerprint, as well as a factor graph probability of all unknown symbols simultaneously rather representing the relationships between functions and feature values than treating each function in isolation. The model builds on for each executable. Our probabilistic fingerprint and factor graph our probabilistic fingerprint and analysis between symbols are used to construct a Conditional Random Field (CRF) that learns in binaries. how individual functions interact with other code and data. • We describe Symbol2Vec, a new high dimensional vector After training a model on a large corpus of programs that include embedding for function symbols. We demonstrate that the symbol information, we are able to use the learned parameters embedding is meaningful by creating a set of relationships to infer the most likely function names in stripped binaries. The within the space of function names drawn from C bina- inferred function symbol names can then be added to the stripped ries distributed as part of Debian GNU/Linux. We use Sym- binary and used for debugging and reverse engineering purposes. In bol2Vec as one metric in the evaluation of punstrip to capture this paper, we focus exclusively on the problem of naming functions; relations between function names that do not share any lex- for detecting function boundaries in stripped