Native X86 Decompilation Using Semantics- Preserving Structural Analysis and Iterative Control-Flow Structuring Edward J
Total Page:16
File Type:pdf, Size:1020Kb
Native x86 Decompilation Using Semantics- Preserving Structural Analysis and Iterative Control-Flow Structuring Edward J. Schwartz, Carnegie Mellon University; JongHyup Lee, Korea National University of Transportation; Maverick Woo and David Brumley, Carnegie Mellon University This paper is included in the Proceedings of the 22nd USENIX Security Symposium. August 14–16, 2013 • Washington, D.C., USA ISBN 978-1-931971-03-4 Open access to the Proceedings of the 22nd USENIX Security Symposium is sponsored by USENIX Native x86 Decompilation using Semantics-Preserving Structural Analysis and Iterative Control-Flow Structuring Edward J. Schwartz JongHyup Lee Carnegie Mellon University Korea National University of Transportation Maverick Woo David Brumley Carnegie Mellon University Carnegie Mellon University Abstract in the literature assume access to source code. For in- stance, there are numerous source-based static vulnera- There are many security tools and techniques for analyz- bility finding tools such as KINT [40], RICH [9], and ing software, but many of them require access to source Coverity [6], but equivalent binary-only tools are scarce. code. We propose leveraging decompilation, the study In many security scenarios, however, access to source of recovering abstractions from compiled code, to apply code is simply not a reasonable assumption. Common existing source-based tools and techniques to compiled counterexamples include analyzing commercial off-the- programs. A decompiler should focus on two properties shelf software for vulnerabilities and reverse engineering to be used for security. First, it should recover abstractions malware. The traditional approach in security has been to as much as possible to minimize the complexity that must directly apply some form of low-level binary analysis that be handled by the security analysis that follows. Second, does not utilize source-level abstractions such as types it should aim to recover these abstractions correctly. and functions [5, 7, 10, 24]. Not surprisingly, reasoning Previous work in control-flow structuring, an abstrac- at such a low level causes binary analysis to be more tion recovery problem used in decompilers, does not pro- complicated and less scalable than source analysis. vide either of these properties. Specifically, existing struc- We argue that decompilation is an attractive alterna- turing algorithms are not semantics-preserving, which tive to traditional low-level binary-based techniques. At means that they cannot safely be used for decompilation its surface, decompilation is the recovery of a program’s without modification. Existing structural algorithms also source code given only its binary. Underneath, decom- miss opportunities for recovering control flow structure. pilation consists of a collection of abstraction recovery We propose a new structuring algorithm in this paper that mechanisms such as indirect jump resolution, control flow addresses these problems. structuring, and data type reconstruction, which recover We evaluate our decompiler, Phoenix, and our new high-level abstractions that are not readily available in the structuring algorithm, on a set of 107 real world programs binary form. Our insight is that by reusing these mecha- from GNU coreutils. Our evaluation is an order of nisms, we can focus our research effort on designing secu- magnitude larger than previous systematic studies of end- rity analyses that take advantage of such abstractions for to-end decompilers. We show that our decompiler outper- accuracy and efficiency. In fact, when taken to an extreme, forms the de facto industry standard decompiler Hex-Rays we may even use decompilation to leverage an existing in correctness by 114%, and recovers 30 more control- source-based tool—be it a vulnerability scanner [27], a × flow structure than existing structuring algorithms in the taint engine [12], or a bug finder [6]—by applying it to literature. the decompiled program code. Of course, decompilation is also extremely beneficial 1 Introduction in situations where manual analysis is required. For exam- ple, practitioners often reverse-engineer program binaries Security analyses are often faster and easier when per- to understand proprietary file formats, study vulnerabili- formed on source code rather than on binary code. For ex- ties fixed in patches, and determine the exploitability of ample, while the runtime overhead introduced by source- crashing inputs. Arguably, any one of these tasks becomes based taint checkers can be as low as 0.65% [12], the easier when given access to source code. overhead of the fastest binary-based taint checker is over Unfortunately, current research in decompilation does 150% [8]. In addition, many security analyses described not directly cater to the needs of many security applica- USENIX Association 22nd USENIX Security Symposium 353 tions. A decompiler should focus on two properties to be work. In particular, we identify a new property that struc- used for security. First, it should recover abstractions as tural analysis algorithms should have to be safely used for much as possible to minimize the complexity that must decompilation, called semantics-preservation. We also be handled by the actual security analysis that follows. propose iterative refinement as a strategy for recovering Second, it should aim to recover these abstractions cor- additional structure. rectly. As surprising as it may sound, previous work on decompilation almost never evaluated correctness. For Semantics Preservation Structural analysis [32, example, Cifuentes et al.’s pioneering work [13] and nu- p. 203] is a control flow structuring algorithm that was merous subsequent works [11, 14, 16, 39] all measured originally invented to help accelerate data flow analysis. either how much smaller the output C code was in com- Later, decompiler researchers adapted this algorithm to parison to the input assembly, or with respect to some reconstruct high-level control flow structures such as subjective readability metric. if-then-else and do-while from a program’s control flow In this paper, we argue that source can be recovered graph (see 2.1). We propose that structuring algorithms § in a principled fashion. As a result, security analyses should be semantics-preserving to be safely used in can better take advantage of existing source-based tech- decompilers. A structuring algorithm is semantics- niques and tools both in research and practice. Security preserving if it always transforms the input program practitioners can also recover correct, high-level source to a functionally equivalent program representation. code, which is easier to reverse engineer. In particular, Semantics-preservation is important for security analyses we propose techniques for building a correct decompiler to ensure that the analysis of the structured program that effectively recovers abstractions. We implement our also applies to the original binary. Surprisingly, we techniques in a new end-to-end binary-to-C decompiler discovered that common descriptions of structural 1 called Phoenix and measure our results with respect to analysis algorithms are not semantics-preserving. For correctness and high-level abstraction recovery. example, in contrast to our natural loop schema in Phoenix makes use of existing research on principled Table 4, other algorithms employ a schema that permits abstraction recovery where possible. Source code recon- out-going edges (e.g., see [20, Figure 3]). This can struction requires the recovery of two types of abstrac- lead to incorrect decompilation, such as the example tions: data type abstractions and control flow abstractions. in Figure 3. We demonstrate that fixing this and other Recent work such as TIE [28], REWARDS [29], and schemas to be semantics-preserving increases the number Howard [38] have largely addressed principled methods of utilities that Phoenix is able to correctly decompile by for recovering data types. In this paper, we investigate 30% (see 4). new techniques for recovering high-level control struc- § ture. Iterative Refinement When structural analysis algo- rithms encounter unstructured code, they stop recover- 1.1 The Phoenix Structural Analysis Algo- ing structure in that part of the program. Our algorithm rithm instead iteratively refines the graph to continue mak- ing progress. The basic idea is to select an edge from Previous work has proposed mechanisms for recovering the graph that is preventing the algorithm from making high-level control flow based on the structural analysis progress, and represent it using a goto in the decom- algorithm and its predecessors [20, 23, 39]. However, piled output. This may seem counter-intuitive, since they are problematic because they (1) do not feature a cor- more gotos implies less structure recovered. However, rectness property that is necessary to be safely used for by removing the edge from the graph the algorithm can decompilation, and (2) miss opportunities for recovering make more progress, and recover more structure. We control flow structure. Unfortunately, these problems can also show how refinement enables the recovery of switch cause a security analysis using the recovered control struc- structures. In our evaluation, we demonstrate that iterative tures to become unsound or scale poorly. These problems refinement recovers 30 more structure than structural × motivated us to create our own control flow structuring al- analysis algorithms that do not employ iterative refine- gorithm for Phoenix. Our algorithm is based on structural