Accelerating Array Constraints in Symbolic Execution

Accelerating Array Constraints in Symbolic Execution David M. Perry∗ Andrea Mattavelli Xiangyu Zhang Cristian Cadar Purdue University Imperial College London Purdue University Imperial College London USA United Kingdom USA United Kingdom [email protected] [email protected] [email protected] [email protected] ABSTRACT 1 char b64[256] = {−1, −1, −1, ... 62, −1, −1, −1, 63, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, −1, −1, −1, −1, ..., −1 }; Despite significant recent advances, the effectiveness of symbolic 2 execution is limited when used to test complex, real-world software. 3 unsigned isBase64(unsigned char k) { 4 if (b64[k] >= 0) One of the main scalability challenges is related to constraint solv- 5 return 1; ing: large applications and long exploration paths lead to complex 6 else constraints, often involving big arrays indexed by symbolic expres- 7 return 0; 8 } sions. In this paper, we propose a set of semantics-preserving transformations for array operations that take advantage of contextual information collected during symbolic execution. Our transforma- Figure 1: A simplified excerpt from the base64 decoding rou- tions lead to simpler encodings and hence better performance in tine in Coreutils. Array b64 has positive values at offsets 43, constraint solving. The results we obtain are encouraging: we show, 47–57, 65–90, and 97–122. through an extensive experimental analysis, that our transformations help to significantly improve the performance of symbolic conditions. Symbolic execution employs satisfiability-modulo the- execution in the presence of arrays. We also show that our transfor- ory (SMT) constraint solvers to determine the feasibility of a path mations enable the analysis of new code, which would be otherwise condition and to generate concrete solutions for it. Despite the out of reach for symbolic execution. recent significant improvements in SMT solvers [12], constraint solving is still one of the main bottlenecks of symbolic execution. CCS CONCEPTS First, symbolic executors issue a huge number of queries to the • Software and its engineering → Software testing and de- constraint solver—for example those generated at every branch bugging; that depends on symbolic inputs—and second, symbolic execution generates large and complex constraints when applied to real-world KEYWORDS programs. As a result, constraint solving dominates runtime for the majority of non-trivial programs, in which often more than 90% of Symbolic execution, Constraint solving, Theory of arrays the time is spent in constraint solving activities [24, 26]. Authors ACM Reference format: of symbolic execution engines recognize this major bottleneck and David M. Perry, Andrea Mattavelli, Xiangyu Zhang, and Cristian Cadar. have invested significant effort in reducing the impact of constraint 2017. Accelerating Array Constraints in Symbolic Execution. In Proceedings solving using various techniques, such as caching of solutions and of 26th International Symposium on Software Testing and Analysis, Santa optimizing queries with subset/superset relations [6]. Nevertheless, Barbara, CA, USA, July 2017 (ISSTA’17), 11 pages. constraint solving still remains a showstopper for the application https://doi.org/ of symbolic execution to many real-world software. Some of the most expensive constraints in symbolic execution 1 INTRODUCTION involve arrays. Many programs take as inputs various forms of Symbolic execution is an approach at the core of many modern arrays, for example strings are encoded as arrays of characters, techniques to software testing, automatic program repair, and re- and developers extensively use arrays to implement various data verse engineering [6, 10, 20, 22, 25, 27]. At a high-level, symbolic structures (e.g. hash tables and vectors). Pointer operations in low- execution provides an automated mechanism for exploring multi- level code are often modelled using arrays [6, 7, 14]. As a result, ple paths in a program by constructing and solving symbolic path arrays are prevalent in symbolic path conditions. While array accesses with concrete indexes can be expressed and ∗The first two authors contributed equally to this paper. handled similarly to scalar variables, array accesses with symbolic indexes are more difficult to manage, as they could refer to poten- Permission to make digital or hard copies of all or part of this work for personal or tially every position in the array. Consider the code in Figure 1, classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation where the function isBase64 computes if a given integer input rep- on the first page. Copyrights for components of this work owned by others than the resents a value in the base64 encoding. The access b64»k¼ at line 4 author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or can refer to any of the 256 positions in the array. This implies that republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. all the array values have to be communicated to the constraint ISSTA’17, July 2017, Santa Barbara, CA, USA solver to determine the feasibility of b64»k¼ ≥ 0. Therefore, an © 2017 Copyright held by the owner/author(s). Publication rights licensed to Associa- SMT formula over arrays requires creating a variable for each off- tion for Computing Machinery. ACM ISBN 978-1-4503-5076-1/17/07...$15.00 set of the array and asserting the indexed value (more in §2.1 and https://doi.org/ §2.2). Such formulaic representation rapidly leads to inefficiencies ISSTA’17, July 2017, Santa Barbara, CA, USA David M. Perry, Andrea Mattavelli, Xiangyu Zhang, and Cristian Cadar in the constraint solving procedures that result in slowdowns and and performing the operation. Such selection may be very costly timeouts, thus hindering symbolic execution. in situations where both concrete and symbolic array indexes are In this paper, we propose a fully automatic technique to op- involved, requiring enumerating all possible concrete index values. timize constraints involving arrays, which are one of the major The array encoding in SMT solvers is generally founded on bottlenecks in symbolic execution. Our technique employs orthog- two axioms [16]. The first axiom enforces the offset-to-value cor- onal semantics-preserving transformations for array operations, by respondence in the array by creating a variable for each offset of taking advantage of contextual information during symbolic execu- the array and asserting that it is equal to the corresponding value tion. Our transformations lead to optimized constraint encodings, in the array. Note that the values may be symbolic or concrete, ultimately improving the efficiency of symbolic execution. They depending on whether the array is updated by some symbolic ex- rely on the key observation that when the array contains concrete pression at runtime. The second axiom enforces the index-to-offset values, we can transform the array operations into semantically- correspondence, which implies that if the symbolic index of the equivalent ones involving only the indexes satisfying the condi- array is equal to a specific offset, then the result of the read isthe tion, or the unique values in the array. For instance, the operation one enforced by the offset-to-value axiom. As such, the formulaic b64»k¼ ≥ 0 in Figure 1 can be transformed into the equivalent query representation referring to an array read (with a symbolic index) k = 43 _ k = 47 _ k = 48 _ :: that involves only the integer in- is an intertwinement of the two axioms. For example, the branch dex values satisfying the condition, for which constraint solvers condition at line 4 in Figure 1 is encoded in the following way, are highly optimized. Moreover, an array read with a symbolic where b640;b641; :::;b64255 represent the variables associated with index can be transformed to index range check operations that the concrete array reads b64»0¼;b64»1¼; :::;b64»255¼ and b64k is the group the repeated values in the array, reducing the size of the variable chosen to represent the symbolic array read b64[k]: formula in the solver. For example, the array indexing b64»k¼ can ¹b64 = −1º ^ ¹b64 = −1º ^ · · · ^ ¹b64 = −1º^ be rewritten using an if-then-else construct so that the indexes in 0 1 255 ¹0 ≤ k ≤ 42º _ ¹44 ≤ k ≤ 46º _ :: result in reading -1 from the array, ¹b64k ≥ 0º^ and so on. ¹k = 0 ! b64k = b640º ^ ::: ^ ¹k = 255 ! b64k = b64255ºº We develop a prototype implementation of our transformations within the KLEE symbolic executor [6]. Our experimental eval- Array writes are not independently encoded. Instead, they are uation demonstrates that the transformations can indeed lead to represented together with the enclosing read operations through read-over-write significant improvements in the performance of symbolic execution. the so-called transformation [3, 28]. For instance A»j¼ A»i¼ = v In summary, we make the following contributions: the read after a write is encoded as follows, where the uninterpreted functions read¹A;iº and write¹A;iº denote reading (1) We propose novel semantics-preserving transformations for and writing the ith element of array A, respectively, and ite denotes array operations to achieve better encodings of arrays. an if-then-else expression: (2) We precisely define where and how these transformations are applied during symbolic execution, which requires distinguish- read¹write¹A;i;vº; jº $ ite¹i = j;v;read¹A; jºº ing scenarios such as concrete arrays versus symbolic arrays That is, when a read has the same index as a preceding write, and concrete indexing versus symbolic indexing. the value obtained is the updated value. When the indexes differ, (3) We develop a prototype implementation inside the symbolic the value obtained is the one stored in the array before the update.

Accelerating Array Constraints in Symbolic Execution

The AWK Programming Language

GNU Coreutils Cheat Sheet (V1.00) Created by Peteris Krumins ([email protected], -- Good Coders Code, Great Coders Reuse)

Constraints in Dynamic Symbolic Execution: Bitvectors Or Integers?

07 07 Unixintropart2 Lucio Week 3

An Introduction to Ocaml

Gnu Coreutils Core GNU Utilities for Version 5.93, 2 November 2005

The Top 7000+ Pop Songs of All-Time 1900-2017

A Quickie Intro to UNIX, Linux, Macosx the Filesystem + Some Tools

Tsort: the Most Flexible Robotic Sortation System HOW FLEXIBLE IS YOUR SUPPLY CHAIN?

GPL-3-Free Replacements of Coreutils 1 Contents

LINUX Commands

GNU Coreutils Core GNU Utilities for Version 9.0, 20 September 2021