A Framework for Static Analysis of Dynamic Code Via Emulation
Total Page:16
File Type:pdf, Size:1020Kb
Submission Version Emulization: a Framework for Static Analysis of Dynamic Code via Emulation Abbas Naderi-Afooshteh, Anh Nguyen-Tuong, Jason D. Hiser, Jack W. Davidson Department of Computer Science, University of Virginia [email protected], [email protected], [email protected], [email protected] Abstract Categories and Subject Descriptors F.3.2 [Logics and Static analysis of dynamic scripting code is a very chal- Meanings of Programs]: Semantics of Programming Languages— lenging problem due to the extensive use of dynamic features Program Analysis; D.3.4 [Programming Languages]: Processors— such as run-time code generation, dynamic aliasing, dynamic Interpreters; D.3.1 [Programming Languages]: Formal Defi- weak typing, and implicit object creation, to name a few. nitions and Theory With increasing popularity of dynamic languages, prior General Terms Design, Security research has focused on theoretical, or practical but ad- hoc solutions to handle many of these dynamic features. Keywords Dynamic Analysis, Static Analysis, Emulation Unfortunately, a comprehensive static analysis method that scales to popular real-world applications is lacking. 1. Introduction This work tackles the problem from a new perspective. Our Static analysis of dynamic applications, such as those writ- approach enables us to have an accurate, scalable method for ten in PHP, Javascript, Python or Bash Script, is an open analysis of dynamic applications. research problem. Several challenging problems plague anal- Our method, emulization, is based on a novel dynamic ysis of dynamic code. Constructing a sound or complete analysis powered by emulation, combined with static analysis. control flow graph (CFG), reasoning about dynamic evalua- The combination provides a framework for context and flow tions (i.e., calls to eval), type inference, error handling and sensitive dynamic and static analysis of dynamic applications. alias inference are some of these issues. To demonstrate its scalability and accuracy, emulization Most existing solutions try to extrapolate static analysis was applied to large and very popular dynamic applications methods developed for strongly typed, static (or less dynamic) such as Wordpress and Joomla (262k and 472k lines of code languages such as C or Java to dynamic languages, while respectively, with more than 40 different dynamic features). providing novel solutions for some of the problems specific Our analysis of these applications resulted in the discovery to dynamic languages (such as dynamic includes). None of of at least 7 bugs in PHP, at least 2 of which are confirmed these approaches has demonstrated analysis of dynamic real- security vulnerabilities. We also applied emulization to a world applications scaling above 100k lines of code. benchmark group of applications commonly evaluated by Another sizable category of solutions uses a combination prior work. of static and dynamic analysis, while employing a wide Finally, we demonstrate the simplicity of implementing range of ad-hoc approaches to tackle challenges of statically a practical analysis on top of our framework by creating a analyzing dynamic code. These solutions either use dynamic taint tracker in less than 3 hours consisting of 300 lines of analysis on small basic blocks, or obtain a trace of the code. The taint tracker confirmed several reported vulnerabil- application to help in the static analysis phase. ities from the benchmark and discovered two new security We believe neither of these approaches is satisfactory for vulnerabilities in Wordpress and Joomla. analysis of dynamic code, especially when the code base is large (as is the case with most popular applications today). This belief is strengthened by the fact that none of the prior work could reason about such applications. In fact, many projects note that it is not possible to precisely analyze dynamic code via their approach [1, 3, 9, 10, 15, 22, 30, 36]. With respect to the implementation of analysis techniques, many solutions modify the interpreter [1, 28, 33]. Others instrument the code, either lightly or heavily, to extract the [Copyright notice will appear here once ’preprint’ option is removed.] data and perform post processing analysis [27, 29]. The 1 2016/11/12 remaining solutions use formal methods to parse and model The paper is organized as follows. Section 2 provides the dynamic code and then reason about it [14–16, 22]. technical background and discusses why dynamic features We believe that these implementation approaches are not present challenges for static analysis. Section 3 describes well suited for the analysis of dynamic applications. Modi- our method which is based on emulation. Section 4 covers fying the interpreter is very hard, error prone, and requires challenges of emulating dynamic code and our experience modeling all functionality. Formal reasoning also requires with addressing these challenges. Section 5 presents our modeling of all language features, which can take thousands analysis framework and its capabilities. The evaluation of of hours of work to provide reasonable completeness. That our framework is contained in Section 6. It includes analysis is why the majority of prior work notes that they are not coverage and performance measurements as well as case complete, due to lack of modeling or implementation of var- studies. Section 7 discusses the related work and contrasts it ious challenging dynamic features. Finally, instrumenting with our work, while Section 8 provides a summary. the code breaks metaprogramming and reflection, and incurs significant performance overhead. 2. Background emulation In contrast, our solution is based on , i.e., creat- 2.1 Dynamic Code Definition ing an interpreter of the dynamic language in itself. There are several benefits. Emulation supports proxying, allowing the Throughout this paper dynamic code is defined to be any implementation of necessary features with the desired granu- code that does not have a clear separation between data larity, while proxying all undesired features of the language and executable code, relying on the interpreter regularly to to the underlying interpreter to preserve accuracy. execute data (e.g., variable contents) as code. Since emulation is done in the dynamic language itself, Based on this definition, we use dynamic programming low-level details of its semantics are abstracted away (e.g., languages to refer to languages which pervasively use dy- memory management, optimizations and implicit type casts), namic features, such as PHP, Ruby, Python, Perl, Javascript, significantly reducing the complexity of implementation Lua, Bash Script and several other scripting languages, in while minimizing the risks associated with incorrect modeling contrast with languages such as C# and Java, which have an or erroneous implementation of semantics [10, 17, 28]. interpreter ready and provide dynamic coding features such as Finally, emulation in the high-level language supports Reflection and CodeDOM, but are not typically used to create rapid development. Our prototype implementation of the PHP dynamic code. Dynamic languages are also frequently used emulator consists of 4000 total lines of code, whereas the for metaprogramming, i.e., programs that modify themselves analyzer by Dahse et. al. [10] is roughly 65,000 lines of code. during runtime and execute data as code. The implementation of Weverca [15] is about 200,000 lines The goal of this research is to provide a platform for anal- of code, while supporting much less of the PHP language ysis of dynamic applications developed using dynamic pro- than our tool. As far as we are aware, our work is the first to gramming languages (which is the majority of applications fully model its target language, including 44 dynamic features developed using such languages). necessary for accurate analysis. 2.2 Dynamic Features The primary drawback of emulation is that it adds another layer on top of the interpreter, which impacts performance. In this section, we discuss some of the dynamic features However, our evaluations show that the performance is com- common in dynamic code, making their analysis challenging: petitive with other analysis methods. 2.2.1 Dynamic Include Contributions. The main contributions of this research are Dynamic languages, just like any other programming lan- as follows: guage, benefit from structuring code into several files, orga- nized into several directories. Many applications developed • A discussion of difficult to analyze dynamic features in using these languages have one or few entry points into the dynamic programming languages, and potential solutions. application, even though all their files can be executed individ- • A novel method for analysis of dynamic code, accompa- ually. This design allows such applications to load libraries nied by a dynamic code analysis framework based on the and code required to perform desired functionality, then inter- method and powered by emulation. nally dispatch the request to the respective individual script, • An open source, well-documented and extensible emulator enabling easier control over flow of the application while for PHP supporting all features of the target language. reducing complexity of individual scripts. One of the most problematic features in analyzing dy- • An open source, extensible flow and context-sensitive namic languages are dynamic includes. Includes in dynamic analysis framework for PHP. languages, in contrast with static languages, are not prepro- • An evaluation of the framework and its