A Shadow Execution and Dynamic Analysis Framework for LLVM IR and Javascript

A Shadow Execution and Dynamic Analysis Framework for LLVM IR and JavaScript ∗ Liang Gong Cuong Nguyen University of California, Berkeley University of California, Berkeley [email protected] [email protected] ABSTRACT 1. INTRODUCTION Dynamic program analysis is a widely used technique to Dynamic program analysis is the analysis of software ap- analyse programs by executing program on a real or virtual plications by running the programs with a set of test inputs processor. However, implementing a robust and scalable and observing their behaviours during executions. Dynamic dynamic analysis tool for a full-fledged language such as program analysis is widely used for software testing, software C/C++ or JavaScript is challenged by many factors: ar- profiling, understanding software behaviours, among other bitrary pointer arithmetic, un-interpreted functions or arbi- things [12, 5, 9]. We observed that most program analysis trary dynamic features. As a consequence, many existing tools are implemented in two phases, first instrument the dynamic analysis tools either ignore or do not support these program at instructions of interest and compute additional features completely. In this paper, we propose a robust information at these instructions. However, implementing a and scalable dynamic analysis framework that supports full- robust and scalable dynamic analysis tool for a full-fledged fledged LLVM IR and JavaScript language based on shadow language such as C/C++ or JavaScript is challenged by execution. Our framework can be easily extended to imple- many factors: arbitrary pointer arithmetic, un-interpreted ment almost any dynamic analysis of interest. functions or arbitrary dynamic features. As a consequence, We implemented on top of our frameworks several dynam- many existing dynamic analysis tools either ignore or do not ic analysis techniques, all of which are less than a few hun- support these features completely. dreds line of codes. In particular, one of our analysis which Indeed, for example on the C/C++ site, consider the tool found bugs in real world programs and websites including Klee, which is a state of the art test generation tools based Facebook and jQuery. Finally, we show that our framework on LLVM IR [7]. A recent study shown that the tool does is able to dealt with real world programs of medium and not support several features of C programs: unsupported large size within an acceptable overhead, which is compara- symbolic sizes to malloc, unsupported calls to unknown in- ble to existing state-of-the-art dynamic analysis tools such structions, call to an invalid function pointer, or inlined as Pin and Valgrind. assembly or unsupported external functions with symbolic arguments [13]. Some other start-of-the-art analysis tools for Categories and Subject Descriptors C programs overall skip mentioning about those issues [4, 1]. Similar analysis tools to ours are Valgrind and Pin which D.2.5 [Software Engineering]: Testing and Debugging| also allow code instrumentation at different code granularity symbolic execution, testing tools levels (i.e. function, block or instruction level, etc.) [15, 14]. However, they do now allow altering the semantic of the General Terms programs, thus is less general than our approach. Verification On the JavaScript side, there exists no dynamic analysis framework for front-end in browser analysis for JavaScript similar to valgrind[15], PIN[14] or DynamoRIO[6] for x86. Keywords Based on this general concept, we implement an analysis Dynamic Analysis; LLVM; JavaScript; framework for the front-end JavaScript. Our framework in- tercepts and transforms all possible JavaScript code snippet ∗Names are in alphabetical order. Each author makes equal contribution to the paper. in either a HTML web page or an external JavaScript file. This work was done as a final project for CS262A in Fall Concretely, the contributions of this paper are as follows: 2013 at UC Berkeley. 1. We design and implement a general LLVM IR and JavaScript analysis framework. To our best knowledge, this is the first general purpose dynamic analysis framework for LLVM IR and front-end JavaScript Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are based on shadow execution. not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to 2. We demonstrate that our analysis framework preserves republish, to post on servers or to redistribute to lists, requires prior specific the semantics of the target code and does not cause permission and/or a fee. obvious slowdown on real-world programs. CS262a ’13 UC Berkeley Copyright 20XX ACM X-XXXXX-XX-X/XX/XX ...$15.00. 3. On top of our framework, we implement several sim- ple but powerful dynamic analysis modules which find // alloca double %a! bugs in real-world websites including jQuery and Face- my_alloca(double) { book. %my_a = new ShadowValue(primitive: double) } 4. Based on our framework, analysis module developers ! are not required to have deep knowledge on the low level language and virtual machine mechanism. // store %b, %a! my_store(%my_b, %my_a) { The rest of this paper is structured as follows. In Section %my_a.concrete = %my_b.concrete 2 we introduce shadow execution and show an example in %my_a.shadow = new ShadowInfo(%my_b.shadow.value) which we compute long double values along with concrete } values. In section 3 we go though the technical details of ! LLVM and JavaScript shadow execution system. We then show some dynamic analysis applications we built on top // %c = fadd %a, %b! of our framework in Section 4. We the conclude with an my_fadd(%my_a, %my_b) { evaluation section (Section 5), related work (Section 6) and %my_c = new ShadowValue(primitive: %my_a.primitive) conclusion (Section 7). %my_c.concrete = %my_a.concrete + %my_b.concrete %my_c.shadow = new ShadowInfo(%my_a.shadow.value 2. BACKGROUND + %my_b.shadow.value) } 2.1 Shadow Execution In shadow execution, each concrete value in the program Figure 2: An example of shadow execution. has an associate shadow value. Shadow value can carry useful information about the concrete value, such as how the error was propagated, the array bound of a pointer or object. Finally, for the %c = fadd %a %b instruction, the symbolic value of the concrete value, etc. In our implemen- shadow execution first creates a shadow object for c, called tation, the shadow execution is performed side-by-side with my_c. Then it assigns the concrete value of my_c to be the the concrete execution and the concrete execution assists sum of the concrete values of my_a and my_b. Finally, it shadow execution to guarantee correctness. Cases in which initializes the shadow information of my_c to be the sum the concrete execution needs to assist shadow execution are of shadow information of my_a and my_b. Recall that the to interpret the return value of an un-interpreted function shadow information in this case is the long double value of or to replicate the un-interpreted function side effects. the concrete value. In this way, the long double value of the concrete value is computed along the execution. Shadow Value VALUE concrete Shadow Info 3. TECHNICAL DETAILS TYPE primitive void* shadow long double value We are now ready to discuss the technical details for our shadow execution systems. We implemented two shadow execution systems, one for LLVM IR and one for JavaScript. Figure 1: Class diagram of a shadow value. Each system employs the same underlying shadow execution techniques, but they differ in the way different language Figure 1 depicts the design of our shadow value. Each features are handled. In particular, the LLVM shadow ex- shadow value maintains three values: the concrete value ecution system faces challenges in implementing a memory as computed in the concrete execution called concrete, the model for pointer arithmetics and handling un-interpreted type of the value called primitive, and a void pointer called calls, among others, while JavaScript's challenge is to ad- shadow. For modelling LLVM IR, we only need to have value dress the dynamic nature of the language. This section of primitive types. Arrays, structures and their combina- discusses the technical details on both LLVM and JavaScript tions are modelled using primitive types. The shadow void shadow execution system, respectively in Section 3.1 and 3.2. pointer can point to any objects defined by the users. They serve as a carrier for the extra information the users want 3.1 LLVM Shadow Execution System to tag along. The extra information are different depending on the analysis. In the example in Figure 1, the shadow 3.1.1 Overview information contains a long double value, which carry the Figure 3 illustrates the architecture of our LLVM shadow concrete value computed in long double precision. execution framework. The input to our framework is an LLVM Figure 2 depicts an example of how shadow execution IR of the program of interest. The output of our framework is performed for three LLVM instructions alloca, store is an instrumented LLVM IR programs with some default and fadd. Firstly, at the alloca double %a instruction, hooks. By default, our hooks re-interpret the program using the shadow execution creates a shadow value object of type shadow execution. Users can extend these hooks to add double. Shadow information is not initialized at this step. more shadow information and define the semantics of how Secondly, at the store %b %a instruction, the shadow exe- these shadow information are computed to implement their cution first assigns the concrete value of b's shadow object analysis of interest. As these hooks are implemented as dy- (my_b) to the concrete value of a's shadow object (my_a). namically loaded libraries, they can be plugged in (possibly It then initializes the shadow information of a's shadow at the same time) or plugged out easily.

Load more