Getting Revenge: a System for Analyzing Reverse Engineering Behavior
Total Page:16
File Type:pdf, Size:1020Kb
Getting RevEngE: A System for Analyzing Reverse Engineering Behavior Claire Taylor Christian Collberg Department of Computer Science University of Arizona Tucson, AZ 85721 [email protected], [email protected] Abstract Code obfuscation is a popular technique used by white- ⓵ ⓶ ⓷ Generate 1 Script2 Challenge Script Generate Obfuscate Compile as well as black-hat developers to protect their software Asset Random p0.c p1.c p2.exe Program Program Program from Man-At-The-End attacks. A perennial problem has Seed1 Seed2 been how to evaluate the power of different obfuscation Solve Challenge ⓸ Analyze Results techniques. However, this evaluation is essential in order Reward ($) ⓹ Correctness Virtual Machine p3.c to be able to recommend which combination of techniques & Precision Reverse User p2.exe Engineering to employ for a particular application, and to estimate how Monitor Tools User ⓺ Analysis & long protected assets will survive in the field. We describe a Actions Visualization system, RevEngE, that generates random obfuscated chal- lenges, asks reverse engineers to solve these challenges Figure 1: System Overview. (with promises of monetary rewards if successful), monitors and collects data on how the challenges were solved, and verifies the correctness of submitted solutions. Finally, the technique. system analyzes the actions of the engineers to determine Attacks on obfuscation (known as reverse engineering the sequence and duration of techniques employed, with the or deobfuscation) aim to defeat the obfuscating transforma- ultimate goal of learning the relative strengths of different tions by extracting a close facsimile of the asset a from P 0. combinations of obfuscations. Much work has gone into developing methods to evalu- ate obfuscating transformations. Fundamental questions we may want to ask include (a) “how long will asset a survive 1 Introduction in the field in a program P 0 that has been protected with Code Obfuscation seeks to protect valuable assets con- obfuscating transform T ?”; (b) “how do transformations T1 tained in software from those who have full access to it by and T2 compare with respect to their protective power and making the software difficult to analyze. An obfuscator O performance degradation?”; and (c) “what level of diversity applies a sequence of code transformations T1 ◦T2 ◦T3 ◦::: can transformation T induce?” to an input program P containing an asset a, transforming it Unfortunately, the validity of any theoretical models we into a program P 0 = O(P ), such that P and P 0 are seman- develop to answer such questions will ultimately depend on tically identical, but extracting a from P 0 is more difficult how we define the power of the adversary. than extracting a from P . Common assets include crypto- In the project we report on here, we present a system, graphic keys, proprietary algorithms, security checks, etc. RevEngE (Reverse Engineering Engine), designed to build Obfuscating transformations are also used to induce diver- models from the behavior of reverse engineers. Figure 1 sity: given an original program P , a diversifying obfusca- shows an overview of the system. Specifically, RevEngE tor generates a large number of semantically identical but generates a secret random program p0:c (point 1 in Fig- 0 0 syntactically different programs fP1;P2;:::g. Diversity is ure 1) and transforms it with a variety of obfuscating trans- a defense against class attacks, where a single attack can formations to a program p1:c (point 2) to create a collec- successfully target all programs protected with a particular tion of reverse engineering challenges. These challenge programs are made available to reverse engineering experts 2.2 Analyzing Obfuscation Resilience anywhere to solve (point 3). Experts download the chal- There exist many obfuscating transforms and reverse en- 2 lenge p :exe into a virtual machine (point 4) equipped with gineering methods. Some are known through the academic a data collection subsystem that monitors and stores the en- literature, others are kept proprietary by purveyors of obfus- gineers’ behavior as they attempt the challenges. The re- cation tools, nation-states, and hackers. Furthermore, there 3 verse engineers submit a de-obfuscated solution p :c which are innumerable ways to combine such methods [19]. is checked for correctness (point 5). To act as an incentive Evaluating the efficacy of strategies for obfuscation and for accomplished reverse engineers to participate, substan- reverse engineering has proven to be difficult: for a given tial monetary awards are handed out to successful partici- obfuscation, many reverse engineering techniques might pants. The recorded user actions are analyzed (point 6) to counter it, and for a given reverse engineering technique create reverse engineering attack models, such as visualiza- many obfuscations can render it ineffective. tions, trees, and Petri nets. The anonymized data set will be made available to the community for further analysis. 2.2.1 Normative Evaluation This paper is organized as follows. Section 2 discusses Some work attempts to reduce the complexity of an ob- prior work on code obfuscation. Section 3 presents the de- fuscation technique to a known hard problem [16]. For sign of RevEngE. Section 4 presents an evaluation of the example, it has been argued that opaque constants can be system, and Section 5 concludes. based on hard problems [31]. Unfortunately, this type of ar- gument is fraught with issues. For example, just because a 2 Related Work problem is hard in the worst case does not mean that a ran- dom instance will be, or that probabilistic algorithms cannot We will next discuss prior work on obfuscation, obfus- give an acceptable solution in reasonable amounts of time, cation evaluation techniques, and the use of challenge prob- or that the size of the problem will be sufficient. Addition- lems to advance computer security. ally, assumptions made for normative evaluation—such as assuming static analysis only—often do not align with real 2.1 Obfuscation world techniques. Finally, the human element and intuition is not captured by normative evaluation. Obfuscating code transformations to protect assets in software date back at least to the early 1990s [17]. Progress 2.2.2 Software Metrics was initially driven by the needs to protect cryptographic It has been argued [16] that some software engineer- operations in Digital Rights Management systems [20, 12] ing metrics, such as control flow complexity, are related to and to hide algorithms in easily decompilable languages the difficulty of reverse engineering an obfuscated program. like Java [16]. Diversification also motivated engineers, However, this notion does not hold much promise, as empir- both to protect operating system mono-cultures against at- ical studies have shown no correlation between such metrics tacks by malware [13], and, interestingly, to protect mal- and human ability to reverse engineer [30]. ware from detection [27]. The literature is rife with descriptions of obfuscating 2.2.3 Empirical Studies code transformations [15]. They typically fall in three cate- Empirical analysis seeks to demonstrate the resilience of gories: control flow, data, and abstraction transformations. an obfuscation technique by testing it against reverse engi- Examples of control flow transformations include control neering attacks. Generally, empirical analysis focuses on flow flattening which removes existing control flow [33] two questions: and opaque predicate insertion which adds bogus control flow [16]. An extreme version of flattening is virtualiza- 1. How well does an obfuscation technique resist an au- tion [36], in which the obfuscator generates a random vir- tomated attack, using a particular tool or code analysis tual instruction set (V-ISA) for the target program P , trans- algorithm? lates P into this V-ISA, and generates a interpreter that 2. How well does an obfuscation technique resist a hu- executes programs in the V-ISA. Data obfuscations trans- man attacker attempting to reverse engineer the pro- form variables into a different representation, for example tected code? by encrypting them, xor:ing them with a random constant, or converting them into a representation based on Mixed For the first question, researchers generate obfuscated Boolean Arithmetic [38]. Examples of abstraction transfor- code with an obfuscator and then run an automated tool on mations, finally, include class hierarchy flattening [18] and that code [2, 4, 3]. Success metrics include whether the au- traditional optimizing transformations such as inlining, out- tomated tool extracted the protected asset or not, and the lining, and loop unrolling. amount of resources (time and memory) consumed. 2.2.4 Human Studies with high quality subjects, incurs large costs, drastically Ceccato et al. [8] argue that there is a ”Need for More limiting scalability, both in terms of subject sample size and Human Studies to Assess Software Protection.” Several the diversity of the transforms tested. such experiments have been carried out. In [10], Ceccato et al. conducted experiments to analyze the potency of an 3 Methodology identifier renaming transform using Masters and PhD stu- Our goal is to collect and analyze data about the strate- dents as subjects. In [9] they expanded upon this earlier gies employed by reverse engineers. This data can be used work with more students and more transforms. Viticchie´ et to construct behavioral models which, in turn, can be used al. [32] conducted a similar