A Shadow Execution and Dynamic Analysis Framework for LLVM IR and JavaScript

∗ Liang Gong Cuong Nguyen University of California, Berkeley University of California, Berkeley [email protected] [email protected]

ABSTRACT 1. INTRODUCTION is a widely used technique to Dynamic program analysis is the analysis of ap- analyse programs by executing program on a real or virtual plications by running the programs with a of test inputs processor. However, implementing a robust and scalable and observing their behaviours during executions. Dynamic dynamic analysis tool for a full-fledged language such as program analysis is widely used for software testing, software C/C++ or JavaScript is challenged by many factors: ar- profiling, understanding software behaviours, among other bitrary pointer arithmetic, un-interpreted functions or arbi- things [12, 5, 9]. We observed that most program analysis trary dynamic features. As a consequence, many existing tools are implemented in two phases, first instrument the dynamic analysis tools either ignore or do not support these program at instructions of interest and compute additional features completely. In this paper, we propose a robust information at these instructions. However, implementing a and scalable dynamic analysis framework that supports full- robust and scalable dynamic analysis tool for a full-fledged fledged LLVM IR and JavaScript language based on shadow language such as C/C++ or JavaScript is challenged by execution. Our framework can be easily extended to imple- many factors: arbitrary pointer arithmetic, un-interpreted ment almost any dynamic analysis of interest. functions or arbitrary dynamic features. As a consequence, We implemented on top of our frameworks several dynam- many existing dynamic analysis tools either ignore or do not ic analysis techniques, all of which are less than a few hun- support these features completely. dreds line of codes. In particular, one of our analysis which Indeed, for example on the C/C++ site, consider the tool found bugs in real world programs and websites including Klee, which is a state of the art test generation tools based Facebook and jQuery. Finally, we show that our framework on LLVM IR [7]. A recent study shown that the tool does is able to dealt with real world programs of medium and not support several features of C programs: unsupported large size within an acceptable overhead, which is compara- symbolic sizes to malloc, unsupported calls to unknown in- ble to existing state-of-the-art dynamic analysis tools such structions, call to an invalid function pointer, or inlined as Pin and Valgrind. assembly or unsupported external functions with symbolic arguments [13]. Some other start-of-the-art analysis tools for Categories and Subject Descriptors C programs overall skip mentioning about those issues [4, 1]. Similar analysis tools to ours are Valgrind and Pin which D.2.5 [Software Engineering]: Testing and Debugging— also allow code instrumentation at different code granularity symbolic execution, testing tools levels (i.e. function, block or instruction level, etc.) [15, 14]. However, they do now allow altering the semantic of the General Terms programs, thus is less general than our approach. Verification On the JavaScript side, there exists no dynamic analysis framework for front-end in browser analysis for JavaScript similar to valgrind[15], PIN[14] or DynamoRIO[6] for . Keywords Based on this general concept, we implement an analysis Dynamic Analysis; LLVM; JavaScript; framework for the front-end JavaScript. Our framework in- tercepts and transforms all possible JavaScript code snippet ∗Names are in alphabetical order. Each author makes equal contribution to the paper. in either a HTML web page or an external JavaScript file. This work was done as a final project for CS262A in Fall Concretely, the contributions of this paper are as follows: 2013 at UC Berkeley. 1. We design and implement a general LLVM IR and JavaScript analysis framework. To our best knowl- edge, this is the first general purpose dynamic analy- sis framework for LLVM IR and front-end JavaScript Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are based on shadow execution. not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to 2. We demonstrate that our analysis framework preserves republish, to post on servers or to redistribute to lists, requires prior specific the semantics of the target code and does not cause permission and/or a fee. obvious slowdown on real-world programs. CS262a ’13 UC Berkeley Copyright 20XX ACM X-XXXXX-XX-X/XX/XX ...$15.00. 3. On top of our framework, we implement several sim- ple but powerful dynamic analysis modules which find // alloca double %a! bugs in real-world websites including jQuery and Face- my_alloca(double) { book. %my_a = new ShadowValue(primitive: double) } 4. Based on our framework, analysis module developers ! are not required to have deep knowledge on the low level language and mechanism. // store %b, %a! my_store(%my_b, %my_a) { The rest of this paper is structured as follows. In Section %my_a.concrete = %my_b.concrete 2 we introduce shadow execution and show an example in %my_a.shadow = new ShadowInfo(%my_b.shadow.value) which we compute long double values along with concrete } values. In section 3 we go though the technical details of ! LLVM and JavaScript shadow execution system. We then show some dynamic analysis applications we built on top // %c = fadd %a, %b! of our framework in Section 4. We the conclude with an my_fadd(%my_a, %my_b) { evaluation section (Section 5), related work (Section 6) and %my_c = new ShadowValue(primitive: %my_a.primitive) conclusion (Section 7). %my_c.concrete = %my_a.concrete + %my_b.concrete %my_c.shadow = new ShadowInfo(%my_a.shadow.value 2. BACKGROUND + %my_b.shadow.value) } 2.1 Shadow Execution In shadow execution, each concrete value in the program Figure 2: An example of shadow execution. has an associate shadow value. Shadow value can carry useful information about the concrete value, such as how the error was propagated, the array bound of a pointer or object. Finally, for the %c = fadd %a %b instruction, the symbolic value of the concrete value, etc. In our implemen- shadow execution first creates a shadow object for c, called tation, the shadow execution is performed side-by-side with my_c. Then it assigns the concrete value of my_c to be the the concrete execution and the concrete execution assists sum of the concrete values of my_a and my_b. Finally, it shadow execution to guarantee correctness. Cases in which initializes the shadow information of my_c to be the sum the concrete execution needs to assist shadow execution are of shadow information of my_a and my_b. Recall that the to interpret the return value of an un-interpreted function shadow information in this case is the long double value of or to replicate the un-interpreted function side effects. the concrete value. In this way, the long double value of the concrete value is computed along the execution. Shadow Value

VALUE concrete Shadow Info 3. TECHNICAL DETAILS TYPE primitive void* shadow long double value We are now ready to discuss the technical details for our shadow execution systems. We implemented two shadow execution systems, one for LLVM IR and one for JavaScript. Figure 1: Class diagram of a shadow value. Each system employs the same underlying shadow execution techniques, but they differ in the way different language Figure 1 depicts the design of our shadow value. Each features are handled. In particular, the LLVM shadow ex- shadow value maintains three values: the concrete value ecution system faces challenges in implementing a memory as computed in the concrete execution called concrete, the model for pointer arithmetics and handling un-interpreted type of the value called primitive, and a void pointer called calls, among others, while JavaScript’s challenge is to ad- shadow. For modelling LLVM IR, we only need to have value dress the dynamic nature of the language. This section of primitive types. Arrays, structures and their combina- discusses the technical details on both LLVM and JavaScript tions are modelled using primitive types. The shadow void shadow execution system, respectively in Section 3.1 and 3.2. pointer can point to any objects defined by the users. They serve as a carrier for the extra information the users want 3.1 LLVM Shadow Execution System to tag along. The extra information are different depending on the analysis. In the example in Figure 1, the shadow 3.1.1 Overview information contains a long double value, which carry the Figure 3 illustrates the architecture of our LLVM shadow concrete value computed in long double precision. execution framework. The input to our framework is an LLVM Figure 2 depicts an example of how shadow execution IR of the program of interest. The output of our framework is performed for three LLVM instructions alloca, store is an instrumented LLVM IR programs with some default and fadd. Firstly, at the alloca double %a instruction, hooks. By default, our hooks re-interpret the program using the shadow execution creates a shadow value object of type shadow execution. Users can extend these hooks to add double. Shadow information is not initialized at this step. more shadow information and define the semantics of how Secondly, at the store %b %a instruction, the shadow exe- these shadow information are computed to implement their cution first assigns the concrete value of b’s shadow object analysis of interest. As these hooks are implemented as dy- (my_b) to the concrete value of a’s shadow object (my_a). namically loaded libraries, they can be plugged in (possibly It then initializes the shadow information of a’s shadow at the same time) or plugged out easily. Finally, with some object to be equal to the shadow information of b’s shadow available test scripts, users can execute the instrumented LLVM bitcode associated with one shadow value object.

%a = alloca double // concrete code %b = alloca double Analysis Result %1 = load %a double *array = malloc(3 * sizeof(double)); %2 = load %b // shadow code %3 = add %1 %2 Instrumented … LLVM bitcode ShadowValue *sarray = malloc(3 * sizeof(ShadowValue Test scripts )); my_alloca(double) %a = alloca double … Callback Instrumenter my_load(%my_a) float (LLVM Pass) %1 = load %a (Dynamically double Shadow Value … float my_add(%my_1, loaded/Perform float %my_2) double Shadow Value %3 = add %1 %2 float … double float Shadow Value float

Figure 3: Architecture of LLVM Shadow Execution System Figure 4: One shadow memory layout for three double values and six float value.

Now consider the following code which perform a casting programs with hooks to obtain analysis results. of the double pointer to a float pointer. The corresponding Our LLVM shadow execution system has two main compo- memory layout mapping between concrete and shadow exe- nents: the instrumenter which is implemented as an LLVM cution is depicted in Figure 4. This mapping shows that now pass and a callback interpreter which by default re-interprets six float values are mapped to 3 shadow value objects and it the program using shadow execution. As a glance, these two is unclear how their correspondences are handled effectively. components are analogous to a structure, in which A solution would be to create another copy of the shadow the instrumenter transforms the code into an intermediate pointer that points to six shadow value objects, however representation (in this case an instrumented program) and maintaining multiple copies of the same objects and syncing the callback interpreter serves as a runtime environmen- them are expensive. t that provides semantic for the intermediate code. We // concrete code provide instrumentation and hooks for all 57 instructions float *farray = (float *) array; of LLVM IR. In addition, the instrumenter also computes // shadow code the symbol table and stack frame space for local and global ShadowValue *sfarray = ??; variables. The symbol table also contains incides for local To handle dynamic memory allocation and casting effec- variables, which enable fast look up during runtime. On tively, we add another two fields to our shadow value object the other hand, the callback interpreter models a stack and that corresponds to pointers: dataSize and offset. The runtime environment and defines semantics for all 57 class diagaram of shadow value objects for pointers looks instrumented hooks in the program. as in Figure 5. For pointer shadow object, the concrete Recall that our callback interpreter performs a shadow field contains the base address of the shadow object this execution which dealts with shadow values. Contract to pointer points to. Notice that this value differs from the concrete values, shadow values has a fixed-size. In addition, concrete address which points to the concrete values. This shadow value is a more powerful abstraction of concrete design decision hinges another issue which we will address value, because it carries extra information such as type and in Section 3.1.3. The dataSize field contains the size in shadow information. However, re-inrepretation of shadow byte of the this pointer points to, e.g if this is a float values to have the exact semantic as the concrete values pointer, dataSize is 4 bytes. The offset field contains the poses several challenges. Firstly, memory model of the two differences in byte between the address this pointer points to executions are different, e.g. one uses fixed-size values while and the base pointer address (which stores in the concrete the other uses varied-size values, among other differences. field). Secondly, shadow execution might not always have enough The values contained in the concrete and offset field information about the code in hand, e.g. native functions is used to dereference the pointer shadow object. If we which it cannot interpret. To produce a robust analysis dereference the address stored in concrete field, we only get framework for full-fledge LLVM IR, we attempt to support all the base shadow object. To get the corresponding shadow features of the language. In the following sections, we will object a pointer shadow object points to, we dereference the describe our solutions for the problems desribed: memory address at conrete + offset. We will discuss about deref- model and un-interpreted functions, respectively in Section erencing and pointer arithmetic in more details in Section 3.1.2, 3.1.3 and Section 3.1.4. 3.1.3. As in our example, the layout of the variables sarray and sfarray are as follows. Both variables point to three 3.1.2 Modelling Dynamic Memory Allocation shadow value objects. This section desribes challenges and solutions to the prob- ShadowValue* base = malloc( 3 * sizeof(ShadowValue) lem of modelling dynamic memory allocation for shadow val- ); ues. First, to illustrate and motivate the problem, consider the following code example and the corresponding memory sarray { // double pointer shadow value concrete: base; layout depicted in Figure 4. In this example, the concrete type: pointer; execution allocates a space of 24 bytes for three double dataSize: 8; values, but the shadow execution allocates a space of 3*n offset : 0; } bytes for three shadow value objects, where n is the size of one shadow value object in bytes. Each double value is then sfarray { // float pointer shadow value float Shadow Value Shadow Value 0 float VALUE concrete Shadow Info float TYPE primitive Shadow Value 8 float void* shadow float int dataSize Shadow Value 16 int offset float

Figure 5: Class diagram for shadow value for Figure 6: Pointer dereference of concrete and pointer. shadow pointer values.

concrete: base; double a; type: pointer; ShadowValue sa = new ShadowValue(primitive: double) dataSize: 4; ; offset : 0; } a = sin (pi); sa = my_call(arguments: {pi}, uninterpreted: true); 3.1.3 Arithmetic of Memory Address Recall that in our shadow execution framework, we al- This section shows how pointer arithmetic is performed on ways have the concrete execution runs side by side with the shadow values using the design presented in Section 3.1.2. shadow execution. For uninterpreted functions, we use ideas Consider the following code that performs pointer arithmetic from concolic execution [], which is we use the concrete value on the variable farray. In this case, a new pointer shadow computed on the concrete execution to construct the shadow object selem is created whose concrete, type and size value. In the example shown above, we use the return value is the same as fsarray pointer shadow object. However, of sin(pi) which is 0 and construct the shadow value sa = selem’s offset is 12 instead of 0 (3 bytes from the base new ShadowValue(primitive: double, concrete: 0). pointer). Consider another uninterpreted call example in which the call is memcpy. This code first initializes an array of three // concrete code integer (line 1-2). It then allocas memory space for another float *elem = farray + 3; // shadow code three integer (line 4-5). Finally, it calls mempcy to copy the ShadowValue *selem { contents of a into b (line 7-8). This mempcy function does not concrete: base; return any value, however it changes the content of the heap. type: pointer; size : 4; Unfortunately, as we cannot interprete this function, we do offset : 12; not know how the heap was changed at this point. Later on, } when b is used, the shadow execution will deviates because Consider the following code which dereferences the pointer sb is different from the concrete value. and Figure 6 which demonstrates which value is dereferenced 1 int[] a = {1,2,3}; in the concrete and shadow execution. To dereference the 2 ShadowValue* sa = my_alloca_array(1,2,3); 3 shadow pointer value, we first compute the index of the 4 int* b = malloc(3*sizeof(int)); initial shadow value. Each shadow value object has a field 5 ShadowValue* sb = malloc(3*sizeof(ShadowValue)); denoting its distance to the base pointer (in this case, the 6 distance is 0, 8 and 16 respectively for three shadow objects). 7 memcpy(a,b); 8 my_call(arguments: {sa, sb}, uninterpreted: true); Based on these distances and the offset value of 12, we compute that we want to dereference the shadow object with To be able to keep the concrete and shadow heap syncing, index 1 (8 bytes from the base pointer), with internal offset we then sync values of concrete and shadow value at every 4 (which is 12-8). From that internal offset, we want to load instruction if they are mismatched. For example, con- read bytes ahead (because the dataSize of the pointer is 4). sider the following code when b’s values are dereferenced. Hence we call the function read(sarray[1], 4, 4) which When we load the value of b[0], we get the value 1, but returns the correspinding shadow value object. when we load the corresponding shadow value of sb[0], we Notice that looking up for variable index might be expen- get the value 0. As their values are mismatched at load, we sive if the array is long. We thus employ binary search as then create a shadow value with concrete value 1 and as- an optimization for looking up variable index quickly when signs that to sb[0], sb[0] = new ShadowValue(concrete: the array is long. 1, primitive: int). Similary, shadow value of sb[2] is // concrete code synced at line 5. Note that values of sb[1] is never sync if it float result = *elem; is never used. Although this mismatch might persits during // shadow code the entire execution, as the value is never used, there would ShadowValue sresult = read(sarray[1],4,4); be no deviation in the semantic of concrete and shadow execution. 3.1.4 Handling Uninterpreted Functions 1 load b [0]; In this section, we show how shadow execution dealts 2 load sb [0]; // syncing happens with uninterpreted functions that are functions that we do 3 4 load b [2]; not corresponding bitcodes (native functions, dynamically 5 load sb [2]; // syncing happens load functions, etc.). Following is a code example that uses function sin from C standard library that we do not have A final issue is array out of bound might occur because we corresponding bit code. do not sync the heap. Indeed, if an uninterpreted function changes the length of b to 4, and later execute load b[4]. Further more, unlike server-side JavaScript environment The corresponding load in shadow execution load sb[4] (e.g., node.js), JavaScript execution and Interaction in a will results in an array out of bound. In shadow execution, as web browser is more complicated. In web browser, JavaScrip- we keep the length of the pointer when we allocate space for t code may be executed in different time: them, we know array out of bound occur. We simply skip • Some pieces of JavaScript code are executed during we- instructions that access elements beyond the array bound b page loading, even before another external JavaScrip- and uses the concrete value to construct the corresponding t file is fully loaded. shadow value. • Some pieces of JavaScript code will not be executed until some specific events were triggered. 3.2 JavaScript Shadow Execution System Our JavaScript Analysis Framework transforms JavaScrip- 3.2.1 Overview t code imported in all of those above cases. To be able to handle those challenges, we adopt an interception ar- One of the major challenges during the implementation chitecture to build our system (see Figure 7). We imple- of the JavaScript Shadow Execution Framework comes from ment our framework as an observer that monitor any http the flexibility of the JavaScript and request sent by the browser and once found any response the various ways we can execute JavaScript code in a web that contains JavaScript code the framework will do code page. We also need to make sure that we can completely transformation, constructs the new response containing the detect and instrument all of the JavaScript code and at the transformed code and passes it to the browser. same time not interfere with the events triggered by user and the browser. We design and implement a dynamic program analysis framework for JavaScript. In comparison with existing dy- namic analysis framework (e.g., Pin[14] and Valgrind[15]), our framework is different in the following aspects: 1. The analysis framework does code transformation rather than code instrumentation. 2. Both transformation code and analysis code is com- pletely written in JavaScript, which makes it easy for JavaScript developers to analysis their own code. Figure 7: Architecture of our front-end JavaScript dynamic analysis Framework. 3. The analysis framework allows the analysis code to be executed side by side with the target code, which does not require extra knowledge about the underlying Specifically, When our framework intercepts a response virtual machine. that is a HTML web page, it will first scan through the web page and transform any embedded JavaScript code in 4. The analysis framework does not require modifying the both tags. and instrument the newly added code before it was executed. So we believe every piece of JavaScript code executed has • From an external file specified by the src attribute of been transformed by our analysis framework. a