Finding Substitutable Binary Code by Synthesizing Adapters
Total Page:16
File Type:pdf, Size:1020Kb
Finding Substitutable Binary Code By Synthesizing Adapters Vaibhav Sharma Kesha Hietala Stephen McCamant Department of Computer Science and Department of Computer Science Department of Computer Science and Engineering University of Maryland Engineering University of Minnesota College Park, MD 20742 University of Minnesota Minneapolis, MN 55455 [email protected] Minneapolis, MN 55455 [email protected] [email protected] Abstract—Independently developed codebases typically contain by comparing their behavior to known implementations, and many segments of code that perform same or closely related locating multiple versions of a function to be run in parallel to operations (semantic clones). Finding functionally equivalent provide security through diversity [4], and reverse engineering segments enables applications like replacing a segment by a more efficient or more secure alternative. Such related segments a fragment of code to its intended semantic functionality. often have different interfaces, so some glue code (an adapter) is Our technique works by searching for a wrapper that needed to replace one with the other. We present an algorithm can be added around one function’s interface to make it that searches for replaceable code segments at the function level equivalent to another function. We consider wrappers that by attempting to synthesize an adapter between them from some transform function arguments and return values. Listing 1 family of adapters; it terminates if it finds no possible adapter. We implement our technique using (1) concrete adapter enumeration shows implementations in two commonly-used libraries of based on Intel’s Pin framework (2) binary symbolic execution, the isalpha predicate, which checks if a character is a letter. and explore the relation between size of adapter search space and Both implementations follow the description of the isalpha total search time. We present examples of applying adapter syn- function as specified in the ISO C standard, but the glibc thesis for improving security and efficiency of binary functions, implementation signifies the input is a letter by returning 1024, deobfuscating binary functions, and switching between binary implementations of RC4. We present two large-scale evaluations, while the musl implementation returns 1 in that case. (1) we run adapter synthesis on more than 13,000 function int musl_isalpha(int c) { pairs from the Linux C library, (2) using more than 61,000 return ((unsigned)c|32)-’a’ < 26; fragments of binary code extracted from a ARM image built } for the iPod Nano 2g device and known functions from the VLC int glibc_isalpha(int c) { media player, we evaluate our adapter synthesis implementation return table[c] & 1024; on more than a million synthesis tasks . Our results confirm that } several instances of adaptably equivalent binary functions exist int adapted_isalpha(int c) { in real-world code, and suggest that adapter synthesis can be return (glibc_isalpha(c) != 0) ? 1 : 0; applied for reverse engineering and for constructing cleaner, less } buggy, more efficient programs. Listing 1. musl and glibc implementations of the isalpha predicate and a wrapper around the glibc implementation that is equivalent to the musl I. INTRODUCTION implementation When required to write an implementation for matrix The glibc implementation can be adapted to make it equivalent multiplication, the average programmer will come up with to the musl implementation by replacing its return value, if arXiv:1707.01536v3 [cs.SE] 30 Nov 2017 a naive implementation in a matter of minutes. However, non-zero, by 1 as shown by the adapted isalpha function. much research effort has been invested into creating more This illustrates the driving idea of our approach: to check efficient matrix multiplication algorithms [56], [8], [16]. On whether two functions f1 and f2 are different interfaces to the attempting to replace the naive implementation with an imple- same functionality, we can search for a wrapper that allows mentation of a more efficient matrix multiplication algorithm, f1 to be replaced by f2. the programmer is likely to encounter interface differences, We refer to the function being wrapped around as the inner such as taking its arguments in a different order. In this function and the function being emulated as the target func- paper we present a technique that automates the process of tion. In the example above, the inner function is glibc isalpha finding functions that match the behavior specified by an and the target function is musl isalpha. We refer to the existing function, while also discovering the necessary wrapper wrapper code automatically synthesized by our tool as an needed to handle interface differences between the original adapter. Our adapter synthesis tool searches in the space of and discovered functions. Other use cases for our technique in- all possible adapters allowed by a specified adapter family for clude replacing insecure dependencies of off-the-shelf libraries an adapter that makes the behavior of the inner function f2 with bug-free variants, deobfuscating binary-level functions equivalent to that of the target function f1. We represent that such an adapter exists by the notation f1 f2. Note that this Inputs: adaptability relationship may not be symmetric: a b does reference, target code not imply b a. To efficiently search for an adapter, we use adapter family FA The target code is counterexample guided inductive synthesis (CEGIS) [52]. An default adapter substitutable with the adapter family is represented as a formula for transforming empty test list current adapter values controlled by parameters: each setting of these pa- CheckAdapter rameters represents a possible adapter. Each step of CEGIS Q: Is there an input on which the A: No allows us to conclude that either a counterexample exists for reference and target code outputs differ with the current adapter? the previously hypothesized adapter, or that an adapter exists that will work for all previously found tests. We use binary symbolic execution both to generate counterexamples and to A: Yes, A is a A: Yes, x1, …, xn is find new candidate adapters; the symbolic execution engine suitable adapter a counterexample internally uses a satisfiability modulo theories (SMT) solver. (set the current (add x1, …, xn to adapter to A) the test list) We contrast the performance of binary symbolic execution for adapter search with an alternate approach that uses a randomly-ordered enumeration of all possible adapters. We SynthesizeAdapter always restrict our search to a specified finite family of Q: Is there an adapter such that the A: No adapters, and also bound the size of function inputs. output of the reference and target code is the same on all current tests? We show that adapter synthesis is useful for a variety of soft- The target code is not ware engineering tasks. One of our automatically synthesized substitutable with any adapters creates adaptable equivalence between a naive im- adapter in FA plementation of matrix multiplication and an implementation of Strassen’s matrix multiplication algorithm. We also demon- Fig. 1. Counterexample-guided adapter synthesis strate the application of adapter synthesis to deobfuscation by deobfuscating a heavily obfuscated implementation of CRC-32 checksum computation. We find adaptable equivalence modulo Section II presents our algorithm for adapter synthesis and a security bug caused by undefined behavior. Two other pairs describes our adapter families. Section III describes our im- of our automatically synthesized adapters create adaptable plementation, and Section IV presents examples of application equivalence between RC4 setup and encryption functions in of adapter synthesis, large-scale evaluations, and a comparison mbedTLS (formerly PolarSSL) and OpenSSL. We can use of two adapter search implementations. Section V discusses binary symbolic execution both to generate counterexamples limitations and future work, Section VI describes related work, and to find new candidate adapters. Our notion of adapter and Section VII concludes. correctness only considers code’s behavior, so we can de- II. ADAPTER SYNTHESIS tect substitutability between functions that have no syntactic similarity. We explore the trade-off between using concrete A. An Algorithm for adapter Synthesis enumeration and binary symbolic execution for adapter search. The idea of counterexample-guided synthesis is to al- Guided by this experiment, we show that binary symbolic ternate between synthesizing candidate adapter expressions, execution-based adapter synthesis can be used for reverse and checking whether those expressions meet the desired engineering at scale. We use the Rockbox project [49] to create specification. When a candidate adapter expression fails to an a ARM-based 3rd party firmware image for the iPod Nano meet the specification, a counterexample is produced to guide 2g device and identify more than 61,000 target code fragments the next synthesis attempt. We are interested in synthesizing from this image. We extract reference functions from the adapters that map the arguments of the target function to the VLC media player [57]. Using these target code fragments arguments of the inner function, and map the return value and reference functions, our evaluation completes more than of the inner function to that of the target function, in such 1.17 million synthesis tasks. Each synthesis task navigates an a way that the behavior of the two functions match. Our adapter search space of more than 1.353