Constraint Programming for Dynamic Symbolic Execution of Javascript

Constraint Programming for Dynamic Symbolic Execution of JavaScript Roberto Amadini1, Mak Andrlon1, Graeme Gange2, Peter Schachte1, Harald Søndergaard1, and Peter J. Stuckey2 1 University of Melbourne, Victoria, Australia 2 Monash University, Melbourne, Victoria, Australia Abstract. Dynamic Symbolic Execution (DSE) combines concrete and symbolic execution, usually for the purpose of generating good test suites automatically. It relies on constraint solvers to solve path conditions and to generate new inputs to explore. DSE tools usually make use of SMT solvers for constraint solving. In this paper, we show that constraint programming (CP) is a powerful alternative or complementary technique for DSE. Specifically, we apply CP techniques for DSE of JavaScript, the de facto standard for web programming. We capture the JavaScript semantics with MiniZinc and integrate this approach into a tool we call Aratha. We use G-Strings, a CP solver equipped with string variables, for solving path conditions, and we compare the performance of this approach against state-of-the-art SMT solvers. Experimental results, in terms of both speed and coverage, show the benefits of our approach, thus opening new research vistas for using CP techniques in the service of program analysis. 1 Introduction Dynamic symbolic execution (DSE), also known as concolic execution/testing, or directed automated random testing [21, 35], is a hybrid technique that integrates the concrete execution of a program with its symbolic execution [28]. The main application is the automated generation of test suites with high coverage relative to their size. In a nutshell, DSE collects the constraints (or path conditions) encountered at conditional statements during concrete execution; then, a constraint solver or theorem prover is used to detect alternative execution paths by systematically negating the path conditions. This process is repeated until all the feasible paths are covered or a given threshold (e.g., a timeout) is exceeded. Key factors for the success of DSE are the efficiency and the expressiveness of the underlying constraint solver. The significant advances made by satisfiability modulo theories (SMT) solvers over recent years have stimulated interest in DSE and led to the development of many popular tools [11, 15, 36, 44, 46, 48]. In particular, improvements in expressive power (due the ability to combine different theories) and solver performance have made SMT solvers very attractive for DSE, to the point that they are considered the de facto standard for DSE tools. Alternatives such as constraint programming (CP) exist, however. 2 R. Amadini et al. Constraint programming [40] is a declarative paradigm aimed at solving combinatorial problems consisted of variables (typically having finite domains) and constraints over those variables. CP is applied in fields like resource allocation, scheduling, and planning, but apart from some dedicated approaches [14, 16, 23], it has seen limited use in software analysis. Arguably, the main impediment has been lack of support for common data structures such as dynamic arrays, bit vectors, and strings. In this paper, we show that DSE can benefit from modern CP solving. In particular, we apply CP techniques to solve the path conditions generated by the dynamic symbolic execution of JavaScript programs. JavaScript is nowadays the standard programming language of the web, extensively used by developers on both the client and server side, and supported by all common browsers. Its dynamic nature can easily lead to programming errors and security vulnerabilities. This makes the dynamic symbolic execution of JavaScript an important task, but also a highly challenging one. Hence, it is not surprising that only a small number of DSE tools are available for JavaScript. To capture JavaScript semantics, we first modelled the main language con- structs with the CP modelling language MiniZinc [38]. It is essential to note that we are using the MiniZinc extension with string variables defined by Amadini et al. [3]. Strings play a central role in JavaScript because each JavaScript object is a map from string keys to values, and hence coercions to strings frequently occur in JavaScript programs (notably, arrays are objects and hence array indices are converted to their corresponding string values). Moreover, JavaScript programs often use regular expressions to match string patterns [6]. We then developed Aratha, a DSE tool using the Jalangi analysis frame- work [45]. Aratha can generate path conditions in our MiniZinc encoding, and solve them with G-Strings [7], a recent extension of the CP solver Gecode [20] able to handle string variables. Aratha is also able to generate path conditions in the form of SMT-LIB assertions, allowing us to empirically evaluate our CP approach against the state-of-the-art SMT solvers CVC4 [32] and Z3str3 [13]. Results indicate that a CP approach can easily be competitive with SMT approaches, and in particular the techniques can be used in conjunction. We emphasize that this technique can be replicated and extended to analyze languages other than JavaScript by using different MiniZinc encodings and different solvers (MiniZinc is a solver-independent language). We are not aware of any similar existing approaches for dynamic symbolic execution. Paper structure. Section 2 introduces the basics of CP and DSE. Section 3 explains how we use MiniZinc to model JavaScript semantics. Section 4 describes Aratha. Section 5 presents our experimental evaluation. Section 6 discusses related work. Section 7 concludes by outlining possible future research directions. 2 Preliminaries We begin by summarizing some basic notions related to constraint programming, string solving, DSE, and JavaScript. Constraint Programming for Dynamic Symbolic Execution of JavaScript 3 For a given finite alphabet Σ, we denote by Σ∗ the set of all finite strings over Σ. The length of a string x 2 Σ∗ is denoted jxj. 2.1 Constraint Programming and String Constraint Solving Constraint programming [40] comprises modelling and solving combinatorial problems. This often means to define and solve a constraint satisfaction problem (CSP), which is a triple hX ; D; Ci where: X = fx1; : : : ; xng is a finite set of variables, D = fD(x1);:::;D(xn)g is a set of domains, where each D(xi) is the set of the values that xi can take, and C is a set of constraints over the variables of X defining the feasible assignments of values to variables. The goal is typically to find an assignment ξ 2 D(x1) × · · · × D(xn) of domain values to corresponding variables that satisfies all of the constraints of C. Most CSPs found in the literature are defined over finite domains, i.e., D only contains finite sets. This guarantees the decidability of these problems, that are in general NP-complete. Typically, only integer variables and constraints are considered. However, some variants have been proposed. In this work, we also consider constraints over bounded-length strings. Fixing a finite alphabet Σ and a maximum string length λ 2 N, a CSP with bounded-length strings contains a ∗ number k > 0 of string variables fx1; : : : ; xkg ⊆ X such that D(xi) ⊆ Σ and jxij ≤ λ. The set C contains a number of well-known string constraints, such as string length, (dis-)equality, membership in a regular language, concatenation, substring selection, and finding/replacing. In the following, we will refer to constraint solving involving string variables as string (constraint) solving. Different approaches to string constraint solving have been proposed, based on: automata [25, 31, 47], word equations [13, 32], unfolding (using either bit- vector solvers [27, 42] or CP [43]), and dashed strings [7, 8]. In particular, dashed strings are a recent CP approach that can be seen as “lazy” unfolding. Thanks to dedicated propagation, dashed strings enable efficient “high-level” reasoning on string constraints, by weakening the dependence on λ [5, 6]. Several modelling languages have been proposed for encoding CP problems into a format understandable by constraint solvers. One of the most popular nowadays is MiniZinc [38], which is solver-independent (the motto is “model once, solve anywhere”), enabling the separation of model and data. Each MiniZinc model (together with corresponding data, if any) is translated into FlatZinc—the solver-specific target language for MiniZinc—in the form required by a solver. From the same MiniZinc model, different FlatZinc instances can be derived. MiniZinc was equipped with string variables and constraints by Amadini et al. [3]. A MiniZinc model with strings can be solved “directly” by CP solvers natively supporting string variables (Gecode+S [43] and G-Strings [7]) or “indirectly” via the static unfolding into integer variables. Clearly, direct resolution is generally more efficient—especially as λ grows. 2.2 Dynamic Symbolic Execution Symbolic execution is a static analysis technique that has its roots in the 1970s [28]. 4 R. Amadini et al. The idea of symbolic execution is to assume symbolic values for input and to interpret programs correspondingly, i.e., to use a concept of “value” that is in fact an expression over the variables representing possible input values. The symbolic interpreter can then explore the possible program paths by reasoning about the conditions under which execution will branch this way or that. The set of constraints leading to a particular path being taken is a path condition, so that a given path is feasible if and only if the corresponding constraint is satisfiable. The test for satisfiability (and the generation of a witness in the affirmative case) is delegated to a constraint solver. Symbolic execution can be useful to automatically prove a given property of interest, provided that: (i) the whole program—including libraries—is available to the interpreter, and (ii) the underlying constraint solver is expressive and efficient enough to handle the generated path conditions. Unfortunately, these conditions are often not met. Dynamic symbolic execution (DSE) is a software verification approach that performs symbolic execution along with concrete (or dynamic) execution of a given program.

Load more