DEGREE PROJECT IN COMPUTER SCIENCE AND ENGINEERING, SECOND CYCLE, 30 CREDITS STOCKHOLM, SWEDEN 2017

Quantitative Analysis of Exploration Schedules for Symbolic Execution

CHRISTOPH KAISER

KTH ROYAL INSTITUTE OF TECHNOLOGY SCHOOL OF COMPUTER SCIENCE AND COMMUNICATION Quantitative Analysis of Exploration Schedules for Symbolic Execution

CHRISTOPH KAISER

Master in Computer Science Date: August 21, 2017 Supervisor: Cyrille Artho Examiner: Mads Dam Swedish title: Kvantitativ analys av utforskningsscheman för Symbolisk Exekvering School of Computer Science and Communication i

Abstract

Due to complexity in software, manual testing is not enough to cover all relevant behaviours of it. A different approach to this problem is Symbolic Execution. Symbolic Execution is a software testing technique that tests all possible inputs of a pro- gram in the hopes of finding all bugs. Due to the often exponential increase in possible program paths, Symbolic Execution usually cannot exhaustively a program. To never- theless cover the most important or error prone areas of a program, search strategies that prioritize these areas are used. Such search strategies navigate the program execution tree, analysing which paths seem interesting enough to execute and which to prune. These strate- gies are typically grouped into two categories, general purpose searchers, with no specific target but the aim to cover the whole program and targeted searchers which can be directed towards specific areas of interest. To analyse how different searching strategies in Symbolic Execution affect the finding of errors and how they can be combined to improve the general outcome, the experiments conducted consist of several different searchers and combinations of them, each run on the same set of test targets. This set of test targets contains amongst others one of the most heav- ily tested sets of open source tools, the GNU Coreutils. With these, the different strategies are compared in distinct categories like the total number of errors found or the percentage of covered code. With the results from this thesis the potential of targeted searchers is shown, with an example implementation of the Pathscore-Relevance strategy. Further, the results obtained from the conducted experiments endorse the use of combinations of search strategies. It is also shown that, even simple combinations of strategies can be highly effective. For example, interleaving strategies can provide good results even if the underlying searchers might not perform well by themselves ii

Sammanfattning

På grund av programvarukomplexitet är manuell testning inte tillräcklig för att täcka alla relevanta beteenden av programvaror. Ett annat tillvägagångssätt till detta problem är Sym- bolisk Exekvering (Symbolic Execution). Symbolisk Exekvering är en mjukvarutestningsteknik som testar alla möjliga inmatning- ar i ett program i hopp om att hitta alla buggar. På grund av den ofta exponentiella ökningen i möjliga programsökvägar kan Symbolisk Exekvering vanligen inte uttömmande testa ett program. För att ändå täcka de viktigaste eller felbenägna områdena i ett program, används sökstrategier som prioriterar dessa områden. Sådana sökstrategier navigerar i programexe- kveringsträdet genom att analysera vilka sökvägar som verkar intressanta nog att utföra och vilka att beskära. Dessa strategier grupperas vanligtvis i två kategorier, sökare med allmänt syfte, utan något specifikt mål förutom att täcka hela programmet, och riktade sökare som kan riktas mot specifika intresseområden. För att analysera hur olika sökstrategier i Symbolisk Exekvering påverkar upptäckandet av fel och hur de kan kombineras för att förbättra det allmänna utfallet, bestod de expe- riment som utfördes av flera olika sökare och kombinationer av dem, som alla kördes på samma uppsättning av testmål. Denna uppsättning av testmål innehöll bland annat en av de mest testade uppsättningarna av öppen källkod-verktyg, GNU Coreutils. Med dessa jämför- des de olika strategierna i distinkta kategorier såsom det totala antalet fel som hittats eller procenttalet av täckt kod. Med resultaten från denna avhandling visas potentialen hos riktade sökare, med ett ex- empel i form av implementeringen av Pathscore-Relevance strategin. Vidare stöder resul- taten som erhållits från de utförda experimenten användningen av sökstrategikombinatio- ner. Det visas också att även enkla kombinationer av strategier kan vara mycket effektiva. Interleaving-strategier kan till exempel ge bra resultat även om de underliggande sökarna kanske inte fungerar bra själva. Contents

Contents iii

1 Introduction 1 1.1 Research Question ...... 2 1.2 Scope ...... 2 1.3 Ethics and sustainability ...... 2 1.4 Structure of this thesis ...... 3

2 Background 5 2.1 Symbolic Execution ...... 5 2.2 Search Strategies ...... 8 2.2.1 Depth-First ...... 8 2.2.2 Breadth-First ...... 9 2.2.3 Random ...... 9 2.2.4 Coverage-Optimized ...... 9 2.2.5 Others ...... 10 2.3 Meta Strategies ...... 10 2.4 KLEE ...... 11 2.5 Cluster ...... 12

3 Methods 13 3.1 Pathscore-Relevance ...... 13 3.1.1 Path Score ...... 14 3.1.2 Component Relevance ...... 14 3.1.3 Coalesce Pathscore-Relevance ...... 15 3.2 Random-Shuffle-Round-Robin ...... 15 3.3 Evaluation ...... 16 3.3.1 Metrics ...... 16 3.3.2 Test Design ...... 17 3.3.3 Evaluation on a Cluster ...... 19 3.3.4 Evaluating the Evaluation ...... 20 3.4 Test Setup ...... 21 3.4.1 Software ...... 21 3.4.2 Hardware ...... 21 3.4.3 Searchers ...... 22 3.4.4 Targets ...... 22

iii iv CONTENTS

4 Results 25 4.1 Number of Found Errors ...... 25 4.2 Time until First Error ...... 26 4.3 Coverage ...... 29 4.4 Consistency of the Results ...... 30 4.5 Quality of Targeting for ...... 32 4.6 Cluster vs. Dedicated Machine ...... 34

5 Related Work 37 5.1 Automated Testing Techniques ...... 37 5.2 Symbolic Execution ...... 39 5.3 Solvers ...... 40

6 Conclusion 41 6.1 Discussion ...... 42 6.2 Future Work ...... 43

Bibliography 45

A Appendix 53 A.1 KLEE Test Arguments ...... 53 A.2 Reduction Proof ...... 53 A.3 List of GNU Coreutils ...... 55 A.4 Results ...... 57 Chapter 1

Introduction

Software development is a complex process and often results in complex software as well. Since any complex processes can easily lead to mistakes, some form of quality assurance is required. This is usually done by testing the product continuously along its development [1]. Most of today’s tests in software are implemented by the developers themselves. In many of these cases this is done by hand with respect to the intended outcome of a function or even a full program, which has several problems to it with the most important one being incompleteness. To avoid the exhausting process of writing test cases by hand, mechanics for automated test case generation already exist. One of these is a method called Symbolic Execution.

Symbolic Execution [2] is a technique in software testing which analyses a given program automatically. To do so the program’s (or function’s) inputs are represented as symbolic val- ues and test cases to cover all different possible combinations are generated automatically. By these means, a completely and successfully tested piece of code either detects errors for some specific input that then can be used for further debugging or no errors and is therefore proven to be correct according the executed assertions. The last part makes Symbolic Execu- tion also interesting for software verification. Although this is only in theory of real interest, since analysing all different possible paths of a program is hard to achieve in practice. Most of the time it is even near impossible with today’s technology in any feasible amount of time. This problem is due to the fact that Symbolic Execution tries every possible path of a given software to test it completely and the amount of paths in a program typically grows expo- nentially, which is often also denoted as path explosion problem [3]. For practical usage this therefore represents a huge problem when thinking about scalability.

With these problems of an otherwise great process in mind, one can see that the task of finding a good execution sequence is crucial for the practical use of this method. One possible approach that will be followed in this thesis is to build or modify the searcher in such a way to improve the choice of paths to take along the program and therefore have a higher chance of finding an error early on.

1 2 CHAPTER 1. INTRODUCTION

1.1 Research Question

The main research question this thesis answers is:

What effect do different searching heuristics in Symbolic Execution have on finding errors and how could they be combined or influence each other to improve the general outcome?

Because there exist two typical methods to approach that problem, the question can be bro- ken further down into two distinct categories. The two categories typically followed when searching are directed search, where the goal is to navigate primarily towards a specified area, and general search, which aims to cover as much as possible. The same principles also apply to the automated search heuristics analysed in this thesis, which can be seen as part of either of the categories. A more specific research question results from that:

Are search heuristics of one category strictly better in their purpose than other search heuristics, which do not share the same specialisation.

1.2 Scope

The scope of this thesis includes identification and classification of existing heuristics for exploration strategies (searchers). This mainly focuses on KLEE available strategies, but also on, within the course of this thesis implemented, prototypes of other strategy. Furthermore the overall evaluation process including a quantitative analysis, executed on a large enough environment, of a chosen set of promising searchers is as a goal of course in the scope of this thesis.

Symbolic Execution still has to deal with a certain amount of unsolved problems, like the interaction with the general environment or dealing with parallelism. These problems are clearly out of scope of this theses and thus not further addressed.

1.3 Ethics and sustainability

This thesis follows all ethical standards to the best of the knowledge of its author.

Even though the purpose of this thesis has only the best intentions, it cannot be barred from profoundly dishonourable use. That is the case, because Symbolic Execution is, gen- erally speaking, an area within software testing thus it has the possibility to effect every software product. Thinking further this allows not only to improve the quality of society and in general make every technical product more secure, but also could improve reliability of weapon systems. Even though this could be the case such assumptions are a bit far fetched and even if that would be the case, the improvements of society seem more important and the accompanying increase in security is expected to even out possible negative side-effects. CHAPTER 1. INTRODUCTION 3

Sustainability and the development goals connected to it are formed by three pillars [4, 5]. These are economic development, social development and environmental protection. Out of these three main pillars, the most related to the work within this thesis is the economical pil- lar. That is because results from this thesis are structured to give an insight into the schedul- ing algorithms used for Symbolic Execution. These are typically used to reduce the execution time required for testing and thus also cause a reduction in the amount of required resources leading to a more sustainable environment. Within this thesis, social development does not really play a huge role, even though some work defines the economical pillar as a subset of the social [6]. Finally environmental protection is covered by the argument that fewer bugs in code subsequently lead to less critical events which could otherwise have a huge impact on the environment. Besides that, faster and more effective software testing reduces not only the overall required resources like energy, but also safes material. That positive effect of software testing in general is due to the resulting lower likelihood of damaging components with inappropriate faulty software, thus reducing the amount of generated waste.

1.4 Structure of this thesis

In Chapter 2, the necessary background for this thesis is explained, including the basics of Symbolic Execution as well as different search strategies that are used later in the evaluation. Chapter 3 shows the used method. The results are then presented in Chapter 4. Related Work is presented in Chapter 5. As a conclusion, Chapter 6 discusses the main contributions of this thesis and points out certain highlight together with possible future work.

Chapter 2

Background

Within this chapter the overall background of this thesis’ specific field, Symbolic Execution, is explained. It is a method of software testing by program analysis that uses symbolic input values instead of concrete input values and was first introduced 1976 by King [2]. Further a special focus is set on the, for this thesis, most important exploration strategies, which are further also referred to as searchers. Additionally an insight into the practical usage of Sym- bolic Execution is given with presenting the Symbolic Execution engine, used in this thesis, KLEE [7]. Finally in the last section, a basic introduction for the evaluation environment, cluster, is given.

2.1 Symbolic Execution

The basic idea and ultimate goal of Symbolic Execution is to test every single possible input value of a program to eventually find all bugs within it. This sadly is in practice near the impossible. For example a program taking a 20 character long string as input would end up with 25620 (=2160) different possibilities. In practice this means that not even all on earth’s computing power could brute-force such a Symbolic Execution in a lifetime. To put this in perspective, AES [8] has a typical key size of 128, 192 or 256 bit, which would lead to respec- tively 2128, 2192 or 2256 possibilities. It is currently a common used algorithm for encryption, which substantiates, based on general security requirements, the assumption that such im- mense possibilities are perceived as safe (at least against brute-force attacks, which Symbolic Execution basically is comparable with).

Now instead of executing the program once for every single possible input, Symbolic Execution groups different inputs that show the same behaviour together to reduce the actual number of executions. To do so, all possible variable assignments are tracked in form of so called path constraints. Path constraints form a set alongside each path, consisting of all conditions that have to hold at the current depth of the path. In Figure 2.1b an example of how path conditions are structured can be seen next to each transition of the tree. These path constraints are required to keep track of all the different input groups that are formed during the exploration of the program. In Symbolic Execution these inputs are referred to

5 6 CHAPTER 2. BACKGROUND

as symbolic inputs, or more general symbolic values, because the actual value is yet to be determined, which can be seen as the overall goal of Symbolic Execution.

An example of how symbolic input can be utilized within a program can be seen in line 3 of Listing 2.1a. Using symbolic values means that instead of reading a concrete value (e.g. 5) and assigning it to a variable a, it is assigned a symbolic value (e.g. α). Any further usage of the variable is treated as a function over the symbolic value. For example a multiplication of a by 3, as in line 4 of Listing 2.1a, would change the assignment of a from α to 3 α. ∗ Another special case beside symbolic inputs is conditional statements (i.e. if, while, etc.). Because of the possibility of having symbolic values as arguments in such conditions, they have to be handled as a special case. This happens when for example as in line 5 of Listing 2.1a a condition is reached. In the example the condition is checking if a == 9, but a was according to the previous example defined as 3 α. Recall that α is the in Listing 2.1a ∗ line 3 read symbolic value. So when checking the condition there are two possible ways of evaluating it. True if α assumes the value 3 and false in any other case. To still be able to explore both branches, the program space is forked. In the matter of Symbolic Execution the fork is a bit different. Even though it is in fact analogue to the other, more well-known, (process-)fork, the target here is not the process. After the fork two individual paths are created. Each of them from a copy of the original program state and added path constraints. The path constraints would be in our example 3 α = 9 for the true and (3 α = 9) for ∗ ¬ ∗ the false case. Before each further fork the new conditions are checked against the already gathered ones to see which of the branches are still satisfiable.

According to the program shown in Listing 2.1a a Symbolic Execution Tree as in Fig- ure 2.1b can be created. This tree visualizes the set of path constraints after every branching point alongside its edges. A branching point specifies in such case a point in the program that is able to alter the flow of control and thus decide which branch to execute next [2]. For example the root node presents the condition of the branching point presented in line 5 of Listing 2.1a. Following this are two paths. One that is displayed in the tree to the left, if 3 a == 9, or to the right, if the equation does not hold. From that on after each branching ∗ point the new constraints are added to the set of already existing path constraint individu- ally for each path and checked again when arriving at the next branching point. This process repeats for each path and goes on until an program exit point or failure is reached.

During this process, Symbolic Execution needs to keep track of all gathered up con- straints and check on each branching points for further paths. This checking of constraints is done with the help of solvers, because such queries are proven as NP-complete [9], making them obviously hard to solve. These commonly used standard solvers are typically satis- fiability modulo theory (SMT) solvers. Generally speaking SMT solvers are an extended version of SAT solvers, since they extend the boolean logic of SAT solvers with first-order logic [10].

With the help of these solvers, Symbolic Execution can reveal under which conditions a path is reachable. This in combination with the final step of picking a concrete value that follows all such gathered path constraints leads to specific input values which are necessary to explore that path. Thus it inherently and automatically creates test cases for all such dis- covered paths. This helps to reproduce found errors, assisting in the development process. CHAPTER 2. BACKGROUND 7

1 int main(void){ 2 int a, b; {} 3 a = read_symbolic(); 4 a = 3 * a; if (a == 9) 5 if (a == 9) { 3 α = 9 (3 α = 9) 6 do{ { ∗ } {¬ ∗ } b = read_symbolic(); 7 while (a != b) doStuff() 8 } while (a != b); 9 } (3 α = 9) (3 α = 9) {(3∗ α = β)∧ {(3 ∗α = β)∧ 10 doStuff(); ¬ ∗ } ∗ } while (a != b) doStuff() 11 return 0; 12 }

(a) Example of a program using Sym- (b) partial Symbolic Execution Tree bolic Input Values

Figure 2.1: An example program using Symbolic Input Values in the listing left and the from it generated Symbolic Execution Tree to the right, showing the programs search space

In the example tree, shown in Figure 2.1b, the set of path conditions (e.g. 3 α = 9}) from { ∗ each path are displayed alongside the transitions from one state to another. An important aspect that has to be kept in mind when looking at the shown example of (3 a = 9 a = 3) ∗ ⇔ is, that it is not always applicable to cancel out factors (e.g. (4 a = 8 a = 2)). The ∗ 6⇔ importance here is that the base value has to be co-prime to 232 in order to qualify for further cancelling. A more detailed proof using SMT queries can be found in Appendix A.1 and A.2.

Symbolic Execution is a sound and in case of a finite execution also complete method of locating all possible paths of a program [11]. This can be easily proven via induction. Looking at the root of the execution tree no steps were taken so far, thus it is also sound and complete in the beginning. From there on two possible steps can be taken. The first possibility involves variable access requiring Symbolic Execution to keep track of every such instruction. Because this requirement is fulfilled the reasoning still holds. The other possible step is a conditional branching point. At such a point a decision whether to explore the left or right path has to be made. As explained earlier at these points queries are sent and executed by the SMT solver to help the decision. This makes the assumption of Symbolic Execution being sound and complete dependent on the soundness and completeness of the underlying SMT solver.

One of the major challenges when working with Symbolic Execution is the path explosion problem. It occurs because today’s software typically consists of thousands or even millions of lines of code [12] and therefore one would assume also many branching points. For example the average GNU Coreutil has between two and ten thousand lines of code [13], while more complex software like OpenOffice consists of around 9 million lines of code [14]. Due to the nature of branching points, which open up two new paths, this typically results in a lot of forks, thus letting the state-space grow exponentially.

When recalling the previous shown example of the symbolic execution tree in Figure 2.1b one realizes the loop containing symbolic values, like shown in Listing 2.1a in the lines 6 to 8, 8 CHAPTER 2. BACKGROUND

causes the tree to keep branching indefinitely, which results in an infinite state space. Every iteration of the loop requires a fork and a new symbolic value is read, which keeps going on since there will always exist at least one possible assignment of the variables such that (a = b) holds. The general question of termination when performing dynamic analysis, to ¬ which Symbolic Execution also counts, is a well-known problem also known as the Halting Problem [15].

Other unsolved problems, like general environment interaction or how to deal with par- allelism are still existing modelling problems, are not relevant for this thesis and will there- fore not be explained any further. Real world applications of course use all of these features, which reduce the applicability of Symbolic Execution to a subset of those.

Even though King originally did not define a specific approach for systematically ex- ploring the tree, today search strategies are used to direct the execution towards the most promising paths. This is a necessity, because of the rapidly increasing size of the execution tree and thus often impossible to cover all of it in reasonable time. Therefore searchers be- came a big topic in research for Symbolic Execution. This lead to experiments with search strategies that have already been proven to be effective in other areas, but also caused the development of brand new ones.

2.2 Search Strategies

Starting with a set of so far unexplored paths, a search strategy is supposed to navigate as efficient as possible through them. What an effective navigation actually looks like depends a lot on the intended purpose. Within this thesis therefore the two main categories, consisting of targeted searchers and general purpose searchers, are analysed. In practice each searcher maintains a frontier of nodes that can be explore next. This frontier gets updated after every new exploration, by removing the explored node and adding all newly discovered nodes, following the explored one, to it. Which of these nodes will be explored next is determined by the implemented strategy. Ideally that is done by choosing the most promising paths first. Doing this although, is not necessarily easy to decide due to the different definitions of what is promising and what not. Thus over the years, a lot of research has been introducing novel search strategies aiming to solve this problem. Despite all this work, these strategies often only perform in certain areas very well, but lack efficiency in others. Hence there currently is no such thing as the optimal universal search strategy, which sets the base for this thesis. To create a fundamental understanding of the basic search principles, some common search strategies are briefly explained in the following:

2.2.1 Depth-First

One of the possibly longest existing search strategies is Depth-First Search (DFS). Its roots go even back into the 19th century, where Charles Pierre Trémaux [16] supposedly already worked with a version of the algorithm.

The way DFS works is probably best describe as a stack. It navigates always towards the CHAPTER 2. BACKGROUND 9

most recently discovered new state, as this is the last one added to the search frontier. This has as a result that DFS typically reaches depth fast. Since it follows one path completely until reaching an end, DFS can get stuck in loops very easy. This is especially a problem in Symbolic Execution when reading user-input in a loop, because it can result in an infinite loop due to symbolism of read input. Figure 2.1b shows an example where under some circumstances DFS would not being able to make any further progress.

2.2.2 Breadth-First

Originally Breadth-First Search (BFS) was invented, by Moore [17] to find the shortest path out of a maze. Parallel to that Lee [18] also discovered this algorithm by trying to find a solution for routing wire on a circuit board.

Opposite to the Depth-First Search, this strategy traverses slowly through the execution tree, by exploring all states of same depth before going further into depth. This behaviour can be compared to a queue, which usually also works in a First-In-First-Out (FIFO) manner. Because BFS does not focus on a specific path to follow all the way until the end, it does not show any loop related problems like DFS. Nevertheless a downside of BFS, for this use-case, is the large state-space in Symbolic Execution due to the path explosion problem. This makes it overall hard for BFS to reach certain depths.

2.2.3 Random

Another well-known approach is randomization. The most popular randomized search is also referred to as Random State Search. It typically discovers the next state completely by random, according to a uniform distribution over all possible next states.

A modified version of the Random State Search is the Random Path Search. Opposite to former, Random Path Search builds a binary tree which is traversed to determine the next state to discover. Therefore nodes closer to the root of the execution tree are favoured, since they require less coin flips to be won. This distribution change should increase the possibility to explore new, maybe more interesting, paths quicker. It also reduces the chances of getting stuck in a loop like DFS but still having the chance to reach depth quickly.

2.2.4 Coverage-Optimized

As one can already tell by the name, this search tends to navigate preferably towards states that have a higher chance to cover new code. The following explanation is based on KLEE’s [7] default searcher, but similar methods are presented in MAYHEM [19] and S2E [20].

To determine whether states are likely to cover new code or not heuristics are used. Such could be for example the minimum distance to a so far uncovered instructions or if it recently covered new code. A combination of these two is part of the current default strategy in KLEE. Some others would be for example query cost, exponential depth or an instruction 10 CHAPTER 2. BACKGROUND

count. Based on such heuristics, weights are assigned to each possible next state. This next state is then chosen by a chance based on the weight of each state.

2.2.5 Others

In addition to the in this thesis most relevant searchers listed above, many other approaches exist. The most interesting and maybe diverse strategies are explained briefly in the follow- ing paragraphs:

One of the most classic approaches for a searcher is fitness-guidance. A more precise use in the Symbolic Execution area of such a searcher was proposed by Xie et al. [21] with the Fitnex search strategy. Fitnex is a strategy making use of individually calculated fitness val- ues, depending of the current state. This is done by taking the fitness value of the cur- rent path (F (path)) and subtracting the fitness gain for a branch (FGain(branch)) from it. Branches with a better (lower) composite fitness value then have priority, when discovering new states. It was implemented for Pex [22], an automated structural testing tool developed at Microsoft Research.

Another similar strategy is the subpath-guided search (SGS), developed by Li et al. [23]. The main idea of this strategy is to exploit the length-n subpath program spectra, which comes down to focusing less travelled parts of the program. This is be done by using a frequency distribution of explored length-n subpaths and shall improve therefore the overall coverage and error detection.

Besides focusing on the execution tree there are other methods aiming towards program specific calls. One example therefore are the heuristics presented by Cha et al. [19] introduced with their symbolic execution engine MAYHEM. These heuristics prioritize paths containing symbolism. Such are for example Symbolic instruction pointers or Symbolic memory access, which are supposed to be more likely to contain bugs.

In another tool named AEG[24] also two new heuristics were presented. The first heuris- tic is called Buggy-Path-First, which is alike the earlier presented Coverage-Optimized ones, but continues the exploration with strong priority on paths which are already proven to con- tain bugs. This rests upon the assumption that humans are most likely to repeat mistakes through out their work. The other heuristics spends priority opposite to many other heuris- tics on exhausting loops, thus also called Loop Exhaustion. To avoid problems of getting stuck deep in a loop, preconditioned symbolic execution alongside pruning is used.

2.3 Meta Strategies

Because as previously stated most of the presented strategies have a typical domain con- nected to their strengths, but then fall off in other areas. To improve the general effectiveness of these search strategies, more than one of the in the previous section mentioned strategies can be combined. This opens up a complete new area of research, since such combinations can be done in many different ways. Currently there are not many of such strategies, except CHAPTER 2. BACKGROUND 11

of very basic ones, which are explained below:

Round-Robin A deterministic, and rather simple technique is the concept of Round-Robin. Hereby each Search Strategy receives a fixed number of executions or even a dedicated timeslot before switching to the next. This Meta Strategy is currently the default choice in KLEE, which is used to alternate between Random-Path Search and a form of the Coverage- Optimized Search explained earlier.

Random The random meta strategy is very similar to the Random State Search displayed in the previous section. The only real difference is that the meta version chooses this time the next search strategy by random rather than the next state to discover. This is a fairly sim- ple approach, but could help overcome local optima or specific deficits from deterministic searches by combining two or more strategies.

2.4 KLEE

KLEE is an open source Symbolic Execution engine, published 2008 by Cristian Cadar, Daniel Dunbar and Dawson Engler [7] and written in C++. It operates on bytecode, created by the LLVM compiler [25] for GNU C, by directly interpreting the created instruction set and mapping of instructions to constraints on bit-level accuracy. These constraints are forwarded to solvers, where they are then checked for satisfiability. As for the state scheduling, KLEE determines at each instruction step which state will be discovered next, according to the chosen heuristic. That is by default Random Path interleaved with Coverage Optimized.

Besides currently only focusing on the language C, KLEE is, at the point this thesis is written, not able to support some other features. Such are for example operations involving symbolic floating points, threads and assembly code.

If once set up, KLEE itself is rather easy to start and test with. Before the actual KLEE run the target code needs to be compiled to bytecode first, by using the LLVM compiler:

llvm-gcc --emit-llvm -c example.c -o example.o

After the bytecode was generated, KLEE can be run. Users can state the number, size and type of used symbolic input arguments by adding parameters to the call:

klee --max-time=3600 example.bc --sym-args 0 2 2 --sym-files 1 8

The very first option, --max-time, specifies the maximum time KLEE checks the targeted program for. The latter two options describe parameters specifying the symbolic inputs. With the first symbolic input option --sym-args 0 2 2, the user specifies to make use of 0 to 2 symbolic inputs, each with a length up to 2. The other option, --sym-files 1 8, specifies that, beside the already defined symbolic inputs, 1 file is used, holding symbolic data with the size of 8 bytes. 12 CHAPTER 2. BACKGROUND

A more in-depth analysis of all arguments used within this thesis can be found in the Appendix A.1. Additionally, a more detailed explanation of the general usage of KLEE can be found on their respective website 1.

2.5 Cluster

A cluster, in the sense of a computer cluster, is a typically tightly connected set of computers, which work together so that they are viewed as a single system. The most obvious ben- efit therefore is the greater computational power, because of the typically large amount of mounted cores. This especially comes into play for programs or tests that can be scheduled in parallel, since a cluster typically has many but if compared single handedly not necessar- ily the most powerful nodes. Besides the benefits there also exist a couple of trade-offs one has to deal with. The following paragraphs point out some of the most important aspects, when planning an evaluation like in this thesis.

The first and most important notice are other users. One usually does not have a whole cluster for one project without sharing some resources with others. This obviously leaves an open question about the quality of the results, which will be addressed later on in this thesis.

Another often occurring aspect is the privilege of the test user. Typically users do not receive root access to the system, which makes setting up the necessary tools or working with them particularly difficult. In most cases there exists a workaround, because that is a common problem, but one still has to keep that in mind.

Finally when running tests on a cluster it is important to keep in mind that it is a dis- tributed environment. Such an often heterogeneous environment requires in most of the cases a special setup to even be able to run tests. Many clusters make use of workload man- agers to queue and run commands on their machines. Thus it is of importance to be aware about possible parallelism of the test setup and how to utilize given tools the best.

1http://klee.github.io/docs/options/ Chapter 3

Methods

Currently a lot of research focuses on different Symbolic Execution engines and the devel- opment of search strategies which typically aim to achieve single handedly the best possible results. Most of these searchers are either tuned for very specific use-cases but tend to leave out some parts of the program due to this specialisation or more general by exploring a lot different paths thus not necessarily good to find deep errors in certain areas. Because of these often distinct approaches an analysis if these assumptions really hold seems interesting. Ad- ditionally the question if a combination of such rather different approaches could make any use comes up.

The focus of this thesis therefore is to analyse what effects different search strategies in Symbolic Execution have on finding errors and how they could be combined or influence each other to improve the general outcome.

Before going into detail presenting the test environment in section 3.3 though, one ma- jor search strategy Pathscore-Relevance, which was implemented during the course of this thesis, as well as a novel approach for a meta searcher, combining different strategies will be explained.

3.1 Pathscore-Relevance

Besides the already existing searchers, another promising strategy was implemented during the course of this thesis. This search strategy is a special form of a weighted strategy and was initially presented by Andrica and Candea [26]. The original thought behind this strategy was to present a new metric to combine two usually different perspectives of a program.

The first perspective defines a component by its relevance, which can be either represented as the developers or the end-users view of the program. This aspect is supposed to make sure that all interesting, critical or often used paths are prioritized due to obvious reasons. The other aspect is a statical analysis of the programs execution paths, which is represented by the so called path score.

13 14 CHAPTER 3. METHODS

3.1.1 Path Score

The path score weighs execution paths inversely to their length, thus encouraging the ex- ecution of shorter paths before longer ones. The underlying concept behind this strategy is that many longer paths are anyway a combination of partially overlapping shorter paths. This should help to reduce test redundancy, thus improving the overall efficiency. A possible formula, as presented by the authors, following this strategy could be denoted as:

max X 1 pathsExercisedL(c) P (c) = L (3.1) 2 ∗ totalPathsExercisedL(c) L=1

In formula 3.1 pathsExercisedL(c) stands for the number of paths of length L in c, while totalPathsExercisedL(c) is the number of all possible execution paths of length L. In theory the length L ranges from 1 to max, but when executing according to this formula the of 1/2L drastically limits the impact for longer paths. Therefore any paths of length L > 8 (1/28 < 0.4%) are unlikely to improve P in any significant way. Since this function has a very aggressive decay it is possible to reach and overall value plateau, before running out of resources. For a more extensive search and counter such an early plateau the function can be changed to any slower (but still monotonic) decreasing function. It is also important that Pinf 1 f(l) = 1 still holds.

3.1.2 Component Relevance

The second part, the relevance, is determined by calculating the importance from a certain perspective (or several). This should therefore target more important functions, hence lead to better coverage around critical or often executed areas. As mentioned before, this could be determined implicitly by the end-users, which would lead to calculation based on e.g. the usage of certain parts of the program or set explicitly based on feedback.

The other perspective is the rating of a components relevance based on the developers perspective. Thereby the code itself is explicitly marked regarding different levels of interest (e.g. critical, standard, low, . . . ). Another possible approach from the developers perspective could be to take time related data like recent changes or updates into account. One example therefore is a point system working with different levels of importance, leading to a formula like this:

pointSum(c) R(c) = (3.2) maxci∈P pointSum(ci)

Formula 3.2 makes sure to capture the relative importance of all components in a pro- gram. With that it is made sure to navigate towards more interesting parts of the pro- gram. In this case pointSum can be for example based on new changes and gathered as P P pointSum = l∈lines newCodel or with a (not-)important flag pointSum = l∈lines flaggedl . CHAPTER 3. METHODS 15

A similar formula can be used, when making use of the end-users usage information, except the pointSum property would be usage related (e.g. NumTimesExecuted).

3.1.3 Coalesce Pathscore-Relevance

The combined Pathscore-Relevance rating is built such that, the higher the relevance of a component, the higher the path score must be in order to compete with other less relevant components. This is achieved by combining the two parts, path score and relevance, to calcu- late the final weights of each component as seen in formula 3.3 and rank all paths according to it.

P (c) PR(c) = (3.3) R(c)

P The goal for this to to maximize the overall PR for all components ( ci∈P PR(ci)). This can be achieved with a rather naive greedy approach, which guides the efforts primarily towards components with high relevance to improve their path score.

Even though Pathscore-Relevance is an innovative and quite promising approach, it of course also has some possible drawbacks due to the way longer paths are gradually less likely to be visited. This can result in a problem if most of the bugs tend to lie on long paths within a component. The deliberation about whether most resources should be spent on finding a very deep bug or to continue searching for more in shallower paths is fully up to the components relevance. This means that the overall quality of this searcher depends a lot on the quality of the relevance. If the developers misinterpret some important sections or simply ignore possible error prone parts of the program, the strategy will most likely miss out on some bugs, if they are not shallow enough to be caught by the general maximization of P.

Nevertheless most of these problems heavily depend on human interaction and can be worked around with. Also it provides the advantage that both aspects, the developers esti- mates as well as the way customers use the system have impact on the strategies decisions to maximize its effectiveness. To do so it prioritizes execution based on usage information and potentially unstable or critical code.

3.2 Random-Shuffle-Round-Robin

In addition to the rather basic meta strategies, introduced in the background chapter, another strategy combining more than one searcher has been developed during the course of this thesis. It was developed as a prototype to not only test the combination of different search strategies, but also introduce a principle of scheduling said searchers in a reactive manner, based on some runtime information. For the sake of this thesis, newly discovered states, further referred to as coverage, were tracked as example for such runtime information. 16 CHAPTER 3. METHODS

This strategy mirrors the behaviour of the basic Round-Robin strategy by default. While executing all queued searchers in this fashion, the generated coverage gets tracked. This is necessary to base further decisions on the information generated during the run. If the search then reaches a point where the searcher cannot cover any new code in a pre-specified number of steps, a random shuffle is executed. This random shuffle can be seen as one step of the Random State Search, described in the previous chapter. The goal of this meta strategy is to combine both the benefits of chaining specific search strategies after each other with the advantages gained by targeted randomness on otherwise occurring weak points.

The shown example of tracking coverage is of course only one of many possible quality measurement methods and was used in this thesis as a proof of concept. Further on this enables the possibility of extending the meta search with nearly any other heuristic. Some also fitting examples could be the query cost or expected coverage as mentioned earlier in this section.

3.3 Evaluation

The goal in this thesis, is to analyse what effects different search strategies i Symbolic Execu- tion have on finding errors and how they can be combined or influence each other to improve their effectiveness. To measure such effectiveness, searchers can be, as presented before, di- vided into two groups with each having their own qualities. The first category are targeted searchers, which as the name already tells focus towards a specific target that is supposed to be reached and explored quickly. The second category consists of so called general purpose searchers, which explore the program typically in a more wider range to cover as much as possible, but are therefore not really able to explore long single paths to deep extent.

A good targeted searcher by these means is supposed to navigate as direct as possible to the defined area within the code and explore that area quickly. Such a searcher typically requires at least some previous knowledge about the program to do well. If no such knowl- edge exists these type of searchers are not as applicable, because the targeted area might be completely off and therefore spends too much time on a too deep analysis of a possible unin- teresting area. For such a case, on the other hand, are general purpose searchers. The grade of a general purpose searcher is defined by how well it can cover as much of the program in short time. Thus it aims to execute all lines of the program quickly at least once, with the goal to reduce the chance of an error due to completely unexplored areas.

3.3.1 Metrics

Because of the yet quite different aspects denoted as effective for the two types, also different metrics will used to classify them and their performance. To fit their supposed goals as well as possible these metrics are going to be as follows, displayed without any rated order:

Time until first error • Number of found errors • CHAPTER 3. METHODS 17

Coverage • Consistency •

The first metric, time until first error, seems fitting especially for targeted searchers, since their purpose is to get quickly to an estimated error prone area and discover it quickly, thus is supposed to find possible errors, if the chosen area indeed housed at least one, quickly. The number of found errors is a globally interesting factor, especially for practical use. An important note for that metric though is that these values can depend a lot on the test targets, since some could favour one specific type of searcher with the layout of errors. More widely spread errors throughout a program of course favour general purpose searchers more than programs that have all of their errors bunched in one specific area. Of course these two metrics only make sense in practice when tested on targets with already known errors to look for. This was taken care of by choosing a good and already well tested target, as presented in section 3.4.4.

The next metric, coverage, was chosen because it gives a good estimate of how much of the program was at least visited once by a searcher, which is, as above stated, a quality label of general purpose searchers. The generated coverage will be measured as executable line coverage, reported by gcov [27]. This is surely a more conservative measurement, but was chosen because it is widely understood, easy to measure and due to its simplicity definitely uncontroversial. Finally the overall consistency of these results needs to be highlighted to figure out which of the tested searcher can provide high quality results aiming to eliminate one-hit-wonders. This is necessary, because it seems not too acceptable to always run several iterations of a searcher to find some errors, when others are usually able to find them in every iteration.

3.3.2 Test Design

To evaluate such a complex problem, a specific setup is required. In this case the environment needs to be even more special. That is because the amount of different possible factors for the experiment is rather large and each run requires a lot of resources (i.e computing power).

Such a test setup with two or more different factors, each with discrete possible values that are experimented with in every possible combination is also known as Factorial Design [28]. The typical case of a Factorial Design that is practised by the vast majority of factorial experiments, uses for each factor only two different levels, which is then called 2k Factorial Design [28]. Already in the early 19th century Ronald Fisher argued that such complex designs seem to be more efficient than single handed factor studies [29].

The factorial factors for this thesis are the distinct search strategies and all combinations of them, tested on various different targets. This obviously lets the number of tests that need to be executed raise immediately. In addition each of these test runs has to be performed mul- tiple number of times to create statistical significance. This is especially important, because as it was presented in the background, most of the searchers rely on randomness to at least some extent. In the end the mean of all these results for each searcher is calculated. Since just 18 CHAPTER 3. METHODS

the mean of such an evaluation does not contain enough information to be significant for the later analysis, the respective confidence interval of it is also calculated.

This can be done because of the Central Limit Theorem [30]. The Central Limit Theorem basically states that the of independent identically distributed random variables tends, for a sufficiently large enough number of iterations, towards a normal distribution. Such independent and identically distributed random variables are generated as results from the experiments in this thesis per construction. Additionally if all parts of a sum are indepen- dent, the sum is too. Also the division of a constant factor does not change the properties of such a value. Therefore the calculated mean is also independent and follows an identical, more specific normal, distribution. An even looser definition of the CLT was given by Alek- sandr Mikhailovich Lyapunov [31], stating that it is not even necessary to have identically distributed, but still independent, values.

As mentioned before to make the analysis possible a sufficiently large enough number of iterations is required. In a perfect world, without any constrictions and resource limitations, the experiment would basically run close to forever. This would result in ongoing iterations until the value of the sample mean, of the measured outcomes for each metric, fully con- verged towards the population mean, represent therefore the final result of the evaluation. It would also include every single possible searcher plus all possible combinations of them, with every target that can be found. Obviously this prospect is utopian and nowhere close to any doable evaluation.

In reality the number of iterations can already seen as sufficient with around 20 to 30 runs per searcher configuration and test target. The final lists of chosen searchers (3.4.3) and test targets (3.4.4) are displayed and explained further below.

Even the above proposed and more realistic approach for the evaluation, would exceed the time scope of this thesis. A rather typical personal computer these days got about four cores, which would allow execution to be parallelised, thus saving time. Therefore the total load of estimated 30.000 hours (10 searchers, 30 iterations, 100 hours of test targets) could be up on these four cores, still requiring a runtime of roughly 7.500 hours in total. That is of course only the number in a perfect case with the personal computer being used to its full capacity all the time. In practice this was not applicable. Therefore the time spent for this evaluation needs to be reduced. Two possible methods to reduce the time taken came up:

1. Reduce the number of experiments (i.e. combinations, targets, runs)

2. Increase the amount of resources (i.e. core hours)

Since a further reduction of either combinations of searchers, testing targets or total runs would also mean a reduction in the quality of this evaluation, the obvious choice was to aim towards the second method. To follow this method, two common solutions exist. Either the use of shared computer processing resources, called cloud computing [32, 33] or to use a tightly connected set of computing nodes, also known as cluster [34]. In case of this thesis the chosen hardware solution for running the evaluation is a cluster. This choice was made because the university was able to provide access to one, without taking a fee. CHAPTER 3. METHODS 19

3.3.3 Evaluation on a Cluster

One major concern when working with clusters though is the often old, outdated software or some scattered, maybe even incomplete, toolchains. This especially becomes a problem when working with software that has a very specific set of dependencies. In case of this thesis such a rather strict piece of software is KLEE. A detailed installation guide for KLEE can be found on the official KLEE website 1.

Within the course of creating a suitable environment for the evaluation, a goal was to create an easy to migrate solution, so the same evaluation can be preferable run on several different machines and not be too attached to the currently chosen system. The thought behind that was not only because of backup reasons, but also to make found results easily repeatable, which is sadly not too common within research. For example the original plan for the evaluation also included KITE [35] and combinations of it with other searchers, but even with the help of the developer, it was not able to run on the provided machines. Part of the problem was an incomplete setup instruction.

One choice to make sure that a specific setup of tools or a complete toolchain also runs on different hardware is docker [36]. Docker is an open platform for developers to build, ship and run various types of applications. For the means of this thesis docker seemed perfect due to common difficulties in the KLEE setup and an already existing image free to use 2. Sadly because of security issues where one could hijack the host system, docker is usually no standard software on clusters like the one used for the evaluation and could also not be made available. Therefore KLEE had to be set up different for the use on a cluster.

To assist in setup processes, clusters often provide a module system which allows its users to load programs in different versions dynamically. This of course is not a reliable source for reproducibility, since it is very likely that different clusters use different module systems with more or less changing versions. Thus the complete setup had to be done man- ually to figure out pitfalls and general difficulties, which are explained below

Macro problems when compiling KLEE

Listing 3.1: Error showing an undefined occurrence of PRIu64 MemoryManager.cpp:99:43: error: expected’)’ before’PRIu64’ klee_warning_once(0,"Large alloc: %" PRIu64

Listing 3.1 shows for the chosen setup an undefined macro, called PRIu64. It is typically used as a string replacement technique for uint64_t. Such problems can happen, be- cause some header anywhere above in your include chain already pulls in the necessary . An alternative problem could also be an outdated GCC version (a mini- mum of GCC 4.4 is suggested). There exist two possible solution to fix this kind of problem.

A more general and but fundamental fix would be to make sure that the definition of

1http://klee.github.io/build-llvm34/ 2https://hub.docker.com/r/klee/klee/ 20 CHAPTER 3. METHODS

__STDC_FORMAT_MACROS happens before doing any includes. This change although re- quires studying the include chain, which of course can be a bit tricky.

For the setup on the cluster a simpler method was chosen. The same can also be achieved by replacing all the occurrences of PRIu64 with "llu". This signals the insert of an long long unsigned, which can be basically seen as an equivalent to uint64_t.

Versioning

Generally a large problem during the manual setup of various number of tools are the dif- ferent versions. Not really simplifying this are the dependencies each of the tools has them- selves. A good approach for this is to start from the final tool, in this case that would be KLEE, and work your way to the front of the dependencies. This gives a higher chance in meeting all the criteria than starting somewhere in between. Additionally it is also recom- mended to compile all the manually built tools using the same compiler if possible.

Libraries

Listing 3.2: Error showing absence of libcap /usr/bin/ld: cannot find -lcap collect2: error: ld returned 1 exit status

Another often occurring problem includes libraries. One example for such a problem is shown in Listing 3.2, where a commonly used library, libcap, could not be found.

Listing 3.3: Error showing conflicts with libcap version cap_file.c: In function’cap_get_fd’: cap_file.c:199:33: error:’XATTR_NAME_CAPS’ undeclared (first use in this function) sizeofcaps = fgetxattr(fildes, XATTR_NAME_CAPS,

Besides the total absence of, or inability to find, a library, compilation or version problems can also occur. The example in Listing 3.3 shows an error during the KLEE compilation, which apparently used a reference to the wrong version.

3.3.4 Evaluating the Evaluation

Due to earlier mentioned possible problems of such a multi user environment, which a clus- ter is, the question of effectiveness and how meaningful obtained results actually are, rises. To answer these concerns a smaller test sample consisting of a, as close to the original as pos- sible, test setup is prepared on a separate, dedicated server. This separate setup is supposed to evaluate the effectiveness of running such tests on a cluster. This of course disagrees in some sense with the argumentation for the chosen method to reduce the required time, but to still stay within the time constraints of this thesis, the target set was reduced as well as fewer runs were executed.

With the use of a cluster as well as a dedicated server, both economically common used CHAPTER 3. METHODS 21

Tool Version GCC 5.4.0 Python 2.7.12 LLVM 3.4.2 MiniSat 2.2.0 STP 2.1.2 uclibc & POSIX (klee_uclibc_v1.0.0) Z3 4.5.0 KLEE 1.3.0

Table 3.1: Tools in use for the evaluation, based on KLEE architectures for larger test environments are tested. This was important when planning the environment of the evaluation, because this means that the applicability on a large, but still realistic and practical scale is measured.

3.4 Test Setup

Within this section all parts that contributed to the evaluation are listed. This is supposed to give a reader not only an insight into the dimension of such a project, but also to make a repetition of the experiments possible.

3.4.1 Software

To begin with, a list of used software is given in Table 3.1 combined with the versions that were used during the experiments. It presents the most important tools, required for this evaluation. The list is not exhaustive and only focuses on the specific needs of KLEE on the tested machines. In detail more basic tools like build-essential or curl are also required, but usually expected to be present anyway.

3.4.2 Hardware

As the main test environment, all tests were run on a cluster. That cluster is named Hebbe [37] and was provided by the Chalmers University of Technology. It is built on Intel 2650v3 ("Haswell") CPU’s and consists of 315 compute nodes, summing up to a total of 6300 cores, with at least 64 GiB of RAM per node, resulting in a total of 26 TiB RAM. A more detailed list of the specifications can be found on the Hebbe Hardware website 3.

To evaluate the capability of running such evaluations on a cluster, additional runs on dedicated machines, provided by the Chair of Communication and Distributed Systems at RWTH Aachen University [38], were executed. Such a dedicated machine is built on two

3http://www.c3se.chalmers.se/index.php/Hardware_Hebbe 22 CHAPTER 3. METHODS

Mnemonic Searcher Meta Strategy COVNEW Coverage Optimized: Cover New — DEFAULT Random Path, Round Robin Coverage Optimized: Cover New DFS Depth-First Search — PR Pathscore-Relevance — PRCOVNEW Pathscore-Relevance, Round Robin Coverage Optimized: Cover New DFSBFS DFS, BFS Random-Shuffle-Round-Robin QCMDU QueryCost, Random-Shuffle-Round-Robin Minimum Distance to Uncovered RND Random State Search — TRIPLE Pathscore-Relevance, Round Robin Coverage Optimized Cover New, QueryCost

Table 3.2: List of all searcher (combinations), included in the evaluation

Intel Xeon E5-2643v4 ("Broadwell") CPU’s, adding up to a total of 12 cores, with a total of 256 GiB RAM.

Since the configuration of these machines can of course also impact certain parts of the evaluation, it is important to note that both of the chosen environments, the cluster as well as the dedicated server, do not allow Hyper-Threading 4, but make use of the Turbo Boost Technology 5.

3.4.3 Searchers

As described in the previous section it was not possible to evaluate all presented searchers, including all of their possible combinations. In Table 3.2 the set of chosen searchers, includ- ing several combinations, is presented. The choice to limit the set to these was either be- cause they seemed promising due to previous work, like different versions of the Pathscore- Relevance, or because they are already well known like DFS and BFS or state the current standard like the KLEE-Default. To give a base to evaluate against also the naive random searcher was included in the evaluation.

3.4.4 Targets

To create a well balanced evaluation suit, the main target were the GNU Coreutils [13]. Over the last decades they became the single most heavily tested set of open-source programs and a rather typical test target for any kind of work related to software testing. In addition the

4http://www.intel.com/content/www/us/en/architecture-and-technology/ hyper-threading/hyper-threading-technology.html 5http://www.intel.com/content/www/us/en/architecture-and-technology/ turbo-boost/turbo-boost-technology.html CHAPTER 3. METHODS 23

GNU Coreutils provide a complete public error tracking system, which allowed the set up of a specific evaluation, testing the error detection rate of all featured searchers. To be able to compare generated results easier with related work, the version 6.10, as in the classic KLEE evaluation [7], was used. The total test set of all GNU Coreutils used for the evaluation contains 88 stand-alone programs from the 101 officially listed in the most recent version. These include programs from all different categories like File utilities, Text utilities and Shell utilities. Only utilities that function as wrapper calls, like arch (same as -m), to others are left out, so none of these were picked with a preference for any of the tested searchers.

In addition, to also cover the most recent version to the time of this thesis, a subset of the GNU Coreutils in version 8.27 has been used for evaluation. This aims to check if the same assumptions retrieved for the older version could also hold in the most recent version. A complete list of all tested GNU Coreutils can be found in Appendix A.2.

Another test subject chosen for the evaluation, are benchmark verification tasks taken from the Competition on Software Verification of TACAS’17 of the European Joint Confer- ence on Theory & Practice of Software [39]. The specific set of benchmarks taken from the 23rd International Conference on Tools and Algorithms for the Construction and Analysis of Systems (TACAS) consists of only a selected set of the competitions tasks, since the overall competition has a wide spread target. This selected set contains three major tasks from the ReachSafety module, namely ReachSafety-Arrays, ReachSafety-ControlFlow, ReachSafety- Loops.

Chapter 4

Results

This chapter presents the results obtained from the conducted experiments during this thesis. All tested searchers and their combinations are analysed based on their effectiveness regard- ing the test targets presented in chapter 3.4.4, according to the four mentioned metrics. This includes the main test target, the GNU Coreutils 6.10 and comparisons with the current ver- sion 8.27. In addition to the chosen metrics, the impact of knowledge on a targeted searcher was also tested, with different settings for the Pathscore-Relevance, to see how a well chosen target differs from a poorly chosen or a completely wrong one.

The results presented in the following are the outcome of at least 25 iterations on the clus- ter described in chapter 3.4.2.1 To verify the obtained results, additional 20 iterations of a bit smaller sample size were executed on dedicated machines, making sure the clusters results are actually meaningful. All of these together sum up to a total of about 20.000 CPU hours. To help the readability of all further presented results, they include confidence intervals, if not clearly mentioned otherwise, showing the interval, containing 99% of all samples.

First of all, one important outcome, before going more into depth, is that the combination of DFS and BFS with the Random-Shuffle-Round-Robin meta was not able to produce any meaningful results and thus was omitted in all future figures.

4.1 Number of Found Errors

To begin with, the very first metric to measure the tested searchers quality is the number of found errors. This was possible, because the used version 6.10 of the GNU Coreutils has a public error tracker which features already known errors to validate against, as well as the errors reported in the original KLEE experiment [7]. Regarding the type of errors, it might be interesting that all of the found errors were memory errors that triggered an out of bounds pointer. Figure 4.1 includes all tools where at least one search was able to find an error.

1The coverage specific results are currently only based around 10 iterations, because its calculation was much more expensive than expected and at the moment this final draft was handed in these calculations were still ongoing. Nevertheless the remaining iterations are not expected to change the overall outcome by much, except to tighten the confidence interval

25 26 CHAPTER 4. RESULTS

Therefore it displays the results for all searchers grouped per tool. Because the combined number of searchers and tools where an error was found is too high, the plot was split up into three separate, limiting the number of tools per plot, for easier readability.

One important aspect, presented by Figure 4.1, is that no searcher could manage to find all errors, thus no clear winner can be crowned. This is especially shown in the very first image of Figure 4.1, where a lot of spaces are left empty, indicating that no error was found, throughout all iterations, for a specific searcher. The searcher that was able to find errors in most of the tools was Pathscore-Relevance. This searcher only missed errors completely in 2 out of the 15 tools during the tests, which consisted of at least 25 iterations each.

Besides the weak points of not finding an error in any of the tools presented in the first image, the current default searcher of KLEE nearly outperformed all searchers for the other tools. It found in every tool, except for one, the most errors, if it managed to locate any at all. Including the consistency, which is explained a bit further below, in section 4.4, the default searcher seems like a good overall choice. Although Pathscore-Relevance and its combined version with Coverage-New, or even the Querycost and Minimum Distance to Uncovered combination with the Random-Shuffle-Round-Robin meta searcher, also score really well throughout the tests, but additionally provide a detection rate in some of the oth- erwise, from the default searcher, uncovered tools. To determine if any of these supposedly similar efficient searchers is strictly better than one of the others, the Wilcoxon signed-rank test [40] was used. The results from it showed no significance between the four mentioned top searchers, pr, default, qcmdurnd and prcovnew.

Different than these more sophisticated searches, with overall good results, other searchers did not manage to score that well, but still draw interest, because they found errors which others could not. That is for example with the pure Depth-First Search the tool.

4.2 Time until First Error

The actual time until the first found error is usually of similar interesting as the total number of found errors. This is especially the case due to the iterative process that is software de- velopment. Further, this is also interesting, because generally none of the chosen searchers was designed to straight away ignore simple cases and thus each searcher is expected to find an error within reasonable time for such. In Figure 4.2 the average time until the first error found for each searcher, grouped by the tools in which an error was found, as presented pre- viously in Figure 4.1, is shown, including also the confidence interval for 0.01 significance.

Similar to the average amount of found errors, Figure 4.2 leaves a blank gap for searchers that did not find any error at all. Besides the from that resulting gaps in the plot, many different times were reported. However, most searchers typically are close to each other, proving the assumption that every searcher typically has at least some general, more broader, exploration scheme to also discover earlier errors that are typically closer to the root of the execution tree. A counter example to that is DFS, which, if it finds something, it typically does this rather quick. Other, obviously not so predictable times come from the Random State Search. CHAPTER 4. RESULTS 27

Average number of found Errors 1.0 pr qcmdurnd 0.9 default triple 0.8 dfs covnew rnd prcovnew 0.7 0.6 0.5 0.4 0.3 0.2

Average Number of Errors 0.1 0.0 nl pr shred

1.0

0.8

0.6

0.4

0.2 Average Number of Errors

0.0 mkfifo mknod

16 15 14 13 12 11 10 9 8 7 6 5 4 3 Average Number of Errors 2 1 0

Figure 4.1: Average amount of errors found per searcher for all tools in the GNU Coreutils version 6.10 28 CHAPTER 4. RESULTS

Average Time until the first Error 3500 3250 pr 3000 default 2750 dfs 2500 rnd 2250 qcmdurnd 2000 triple 1750 covnew 1500 prcovnew 1250 1000 750 500

Time until first Error (in s) 250 0 cut head nl pr seq shred sort Tool

3500 3250 3000 2750 2500 2250 2000 1750 1500 1250 1000 750 500 Time until first Error (in s) 250 0 ls md5sum mkdir mkfifo mknod touch Tool

16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 Time until first Error (in s) 1 0 paste ptx Tool

Figure 4.2: Average time (in seconds) until first error found, per searcher for all tools in the GNU Coreutils version 6.10 CHAPTER 4. RESULTS 29

An important notice for this metric is that the total execution time of a searcher for each tool was stopped after 3600 seconds and several of the searchers really needed this time to actually be successful in their attempt. Still this maximum boundary seemed fitting, since most of the searchers were successful before meeting the deadline.

Also similar to the results presented for the average number of found errors, here the default searcher as well as Pathscore-Relevance seem to again outperform their competitors slightly, even though these differences are mostly not even exceeding the 1 minute mark.

4.3 Coverage

The overall executable line coverage, produced by a searcher, is another really interesting metric that, depending on the use-case, might be highly valued. Typically, high code cov- erage is used as an indicator for low error probability, thus definitely desirable. To give an estimate about this specific quality, Figure 4.3 shows how the different searchers perform re- garding one specific coverage metric, namely executable line coverage. The figure shows the achieved coverage in percent for each tool, ordered by coverage, which allows for assump- tions about the generated coverage, but no comparison of the separate tools.

Average Executable Line Coverage 100

90

80

70

60

50

40 pr qcmdurnd Line Coverage in % 30 dfs rnd 20 default triple 10 covnew prcovnew 0 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 Tools

Figure 4.3: Average executable line coverage, per searcher for all tools in the GNU Coreutils version 6.10 30 CHAPTER 4. RESULTS

At first glance the overall generated coverage, as presented in Figure 4.3, does not look too promising, with roughly 15 of all tested tools achieving less than 50% and only for about 10 tools a coverage of more than 90% was generated. The majority of tools with 40 to 50 tools were covered between these 50% and 90%, with only DFS showing consistently results with about 10% less coverage than its competitors. The results for the measured executable line coverage, were also tested with the Wilcoxon signed-rank test [40], to determine if any of the searchers was able to significantly outperform the others. Similar to the conducted test for the average number of found errors, here also the triple searcher was less effective than pr, default, qcmdurnd and prcovnew, but additionally showing that the default searcher dominates every other implementation, with a p-value of at most 0.014.

Important for all of these results although is that this evaluation took place on a live system, without any sandbox mode. This means that in several tools some of the paths, or in few cases even nearly the complete program could not be executed for the calculation of coverage. Such problematic tools are for example chcon and with their direct influence on permission criteria. But also tools like could lead to problems, when reaching certain protected areas of the system. Additionally when comparing to the original KLEE test cases, one should also keep in mind that in about ten cases their team decided to use different parameters which are more suitable for higher coverage generation, which was omitted for this evaluation. With all that in mind the generated coverage from these searchers might not be best comparable to tests run in other evaluations, but since all of them were run on the same environment, they all have the same reference point.

4.4 Consistency of the Results

Another important aspect to point out, is the consistency of the retrieved results. This was needed due to the fact that most of the chosen searchers make use of some heuristics, often including randomization, to increase their chances of success. Therefore an estimate of their consistency is required. To evaluate that, the average of the variance of all results, across every iteration, was calculated. The comparison between the mean variance of the different searchers is presented for the GNU Coreutils 6.10 in Figure 4.4 and for 8.27 in Figure 4.5.

As it can be seen in Figure 4.4, the current default searcher of KLEE showed close to no variance in finding errors at all and seems also rather consistent with the timing when such error is found. This consistency throughout all iterations can seem a bit surprising at first, because the searcher makes use of randomness in addition to the searcher that scored worst in this test, namely Coverage-New. Similar to that the qcmdurnd solver combination also did provide quite surprising overall consistent results. Besides these not necessarily expected results, all other results were about to be expected. For example a pure random search can of course not be as consistent as a deterministic search like DFS. The same holds for the other searchers that all make use of randomness at some point.

The in Figure 4.5 shown variance for the results of the GNU Coreutils 8.27 was expected to be on such a low due to the overall small number of found errors. Nevertheless, a similar behaviour as in the results for the GNU Coreutils 6.10 can be observed, with the default strategy being the only searcher not being able to change a lot in its overall consistency. CHAPTER 4. RESULTS 31

Consistency for Coreutils 6.10 7.5 7.0 6.5 6.0 5.5 5.0 4.5 4.0 3.5 3.0 2.5 2.0

Average Variance 1.5 1.0 0.5 0.0 pr dfs rnd triple default covnew qcmdurnd prcovnew Searcher

Figure 4.4: Average variance of the amount of errors found per searcher for the GNU Core- utils in version 6.10

Consistency for Coreutils 8.27 0.6 0.5 0.4 0.3 0.2

Average Variance 0.1 0.0 pr dfs rnd triple default covnew qcmdurnd prcovnew Searcher

Figure 4.5: Average variance of the amount of errors found per searcher for the GNU Core- utils in version 8.27 32 CHAPTER 4. RESULTS

4.5 Quality of Targeting for PR

Recalling that the major quality of targeted searchers is supposed to be its capability of fol- lowing an order towards specified targets quickly, another assumption was made. That as- sumption is that the effectiveness of targeted searchers depends highly on the quality of prior knowledge, which is supposed to lead the searcher towards the correct direction for higher chances in finding an error. To evaluate this assumption the example implementation developed during the course of this thesis for the Pathscore-Relevance searcher, a targeted searcher, was tested with different parameters for guidance. Figure 4.6 and 4.7 show the impact of the three different forms of guidance used. These three different settings are:

Distraction: directing the searcher towards an area with no errors • Area: setting the target to a broader area in which an error is located • Direct: aiming towards one specific line containing an error •

The presented results were of course only possible to obtain by testing targets where the errors was already known, like the widely used GNU Coreutils. That knowledge was necessary to set up the different targets, directing the searcher towards a certain area. To keep the comparison as clean and easy to follow as possible, only results of targets where at least one of the settings was able to detect an error were included.

With the distracted setting the searcher received bad directions on purpose. This of course was expected to produce rather bad results, even though it did manage to find some errors, which are located on a more global scale. The searcher managed to still find them be- cause of the Pathscore part, which makes sure that the searcher does not get stuck in certain areas, even though they might be defined as interesting. The second tested targeting is set to a global area where one of the errors could be located, rather than a very specific line of code. This pays especially of in tools like ptx where most of the found errors are located back to back to each other. Finally the last tested variant was meant to snipe down one spe- cific error by directing the searcher exactly towards one line where previously another tool has reported an error. The presented results although show that this was not completely followed by the searcher, which most likely is a cause of an unbalanced Pathscore to Rele- vance setting, reducing the impact of Relevance too fast and thus aborting deeper searches too early.

As far as the time until the first error is concerned, Figure 4.7 only shows one real irreg- ularity, that is for the pr tool. The error in pr is located fairly deep within the execution tree, which typically has problems in being reached leaving the distracted search completely in the dark. The reason that the areal setting scored way better than the direct approach is based on the broader field of relevance for the areal setting in combination with the previ- ously mentioned imbalance of the two factors, Pathscore and Relevance. In the case of the direct setting, the searcher only has one real target which looses impact with every further step until all paths even out, while the areal setting on the other hand features a larger area of interest, thus also allowing for a slower decrease of the impact such interesting paths have. CHAPTER 4. RESULTS 33

Average Number of found Errors 13 12 Distracted 11 Area 10 Direct 9 8 7 6 5 4 3 2 Average Number of Errors 1 0 cut md5sum paste pr ptx seq shred

Figure 4.6: Average number of found errors for different target settings for Pathscore- Relevance

Average Time until the first Error

1800 Distracted 1650 Area 1500 Direct 1350 1200 1050 900 750 600 450 300 Time until first Error (in s) 150 0 cut md5sumpaste pr ptx seq shred

Figure 4.7: Average time until first error for different target settings for Pathscore-Relevance 34 CHAPTER 4. RESULTS

4.6 Cluster vs. Dedicated Machine

As it was already mentioned within section 2.5, clusters come natural with some uncon- trollable difficulties. Because of several of these mentioned possible negatively influencing factors, like multiple other users or the absence of real load control, an effort was made to evaluate how the, from the cluster, obtained results are compared to the ones of a dedicated machine. Important to note for the experiments presented in the following is that the dedi- cated machine has stronger single cores, as presented in section 3.4.2. Therefore it covered, as expected, more instructions during the same time, which can be seen in Figure 4.8. This roughly 20% increase of executed instructions of course seems also more likely to detect er- rors that the cluster might missed within the set time constraint.

Dedicated

Cluster

0.0 0.5 1.0 1.5 2.0 2.5 3.0 Average Number of Executed Instructions 1e8

Figure 4.8: Average number of executed instructions for the dedicated machine and cluster

That this assumption was proven to be correct is shown in Figure 4.9, which states that the overall increase in executed instructions also led further to an overall increase of about 17% in found errors on the dedicated machine, thus nearly showing perfect efficiency.

Dedicated

Cluster

0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 Average Number of Errors

Figure 4.9: Average total of found errors for the dedicated machine and cluster

An arguably even more important result for the evaluation was that all the errors found on the cluster have also been found on the dedicated machine, as displayed in Figure 4.10, thus verifying the used methods. Besides that the average time until first error is compared in Figure 4.11. Important to notice in this figure is that many of the bars are very high, indicating a long time until an error was found, because of the low overall detection rate. This is due the fact that to be able to compare these two, every time no error was found the time was set to the maximum time (3600s) + 1. CHAPTER 4. RESULTS 35

Average Number of found Errors 16 15 Dedicated Machine 14 Cluster 13 12 11 10 9 8 7 6 5 4 3

Average Number of Errors 2 1 0 ls pr cut ptx seq sort mkdirmkfifomknodpaste shred touch md5sum

Figure 4.10: Average number of found errors for the dedicated machine and the cluster

Average Time until First Error 3600 3400 3200 3000 2800 2600 2400 2200 2000 1800 1600 1400 1200 1000 800 600 Time until first Error (in s) 400 200 0 cp ls pr cut du ptxseq sort chconchgrp mkdirmkfifomknodpaste shred touch md5sum

Figure 4.11: Average time until first error for the dedicated machine and the cluster

Chapter 5

Related Work

During the last decade a lot of related work around the field of Symbolic Execution was published. Even though King published his ideas about four decades ago (1976), his vision appeared utopian in the past, because the machines back than could not handle the complex- ity. So time passed and the hardware became more powerful and fundamental algorithms, like the ones of solvers, were improved. As a result of this most of the research done in the domain of Symbolic Execution happened during the last decade.

This mainly shows itself in the number of distinct engines developed following different ap- proaches to deal with scalability problems and secure efficiency in their own way. Although to begin with, other techniques performing automated testing as well will be discussed. Fi- nally this chapter also highlights some of the most interesting implementations of solvers used within these engines.

5.1 Automated Testing Techniques

Obviously Symbolic Execution is not the only existing method for ensuring software quality, by finding inputs to cover as much of the target code as possible. There are several other related areas which also operate towards a similar goal. These are displayed briefly in the following.

The probably most common and well-known testing method is manual testing. A typical form of manual testing sets the developer in the role of an end-user to find defects. In the role of the end-user the developer then tries to use different parts of the program.

During the testing process this manual testing typically becomes quickly use-case test- ing, because most developers manually test with certain goals or according to a test plan rather than just random [41]. Such use-case testing is done by translating the specification into small test cases, hoping to cover most of the underlying code. Of course this helps to verify if the right program is built and also allows for fast manual testing on every level of development. On the other hand manual testing in any form is a rather expensive way of

37 38 CHAPTER 5. RELATED WORK

testing exhaustively and typically looses against automated testing mechanisms if the pro- gram gets just a bit more complex [7]. Thus dynamic testing methods, as described next, should usually be pursued more.

Besides manual testing, which is a special form and can be applied to either of the cate- gories, software testing for correctness is typically split into two different categories [42]. White-box testing [43], which aims to test the internal structure of a program opposite to black-box testing, which tests the functionality. Symbolic Execution itself falls into the - egory of black-box testing, thus the following explained methods are focused on black-box testing.

One rather simple, but still commonly used strategy in there is random testing [44, 45, 46]. It is typically easy to implement for low level data types like integers, but gets much harder for higher levels (e.g. graphs). Additionally it has decreasing effectiveness with increasing dimensions of inputs, since exhaustively finding all paths connected to one input is obvi- ously by far easier than finding them for ten. For example ten 32-bit integer inputs would lead to 2320 possible input values.

Opposite to the completely uninformed random testing, Fuzzing or fuzz testing is based on some previous knowledge [47, 48, 49]. That previous knowledge is required, because fuzzing starts with a specific input that gets slightly altered with every further run to explore as many paths in the targeted area as possible. On the basis of its behaviour, fuzzing can be seen as the previous step to Symbolic Execution, since Symbolic Execution advances on fuzz testing’s methodology by not requiring a starting input.

Another form of black-box testing is boundary value testing. Instead of generating all inputs by random, boundary value testing defines a small set of so called boundary values, which are in the simplest case maximum and minimum. Problem hereby is the same as in Symbolic Execution, since the state space rises exponentially, making it infeasible for large projects.

A more scalable solution is presented with pairwise testing. By making use of this testing technique, the state space gets reduced to grow at most quadratically on the dimension of the input. Of course it is most of the times not possible to cover the same as with some more accurate techniques. This although seems in many cases like a good trade-off according to Dunietz et al. [50], since the coverage still seems sufficient and bugs involving three or more parameters are progressively less common [51, 52, 53].

Besides conventional testing methods, described above, a more formal technique for pro- gram verification exists. This technique is called Model Checking and its concept was first presented by Queille and Sifakis [54] in the early 1980s and a decade later resurged by Clarke et al. [55]. As the name already tells, verification is hereby based on a model, which can ei- ther be created on its own or often be automatically derived from the sources. Such a model describes the programs boundaries and its behaviour can be determined via queries over this model, which are typically expressed by temporal logic properties.

The main difference here to Symbolic Execution is that Model Checking typically oper- ates on statements about behaviour over time (e.g. liveliness), where as Symbolic Executions CHAPTER 5. RELATED WORK 39

assumptions operate on a general (memory) state. On the other hand Symbolic Execution as well as Model Checking face both the problems of path explosion and thus often lack in scalability.

Some of the most well known tools assisting today’s Model Checking are the Java PathFinder [56] and NuSMV [57]. JPF in this case comes probably closest to common Symbolic Execution engines as far as the application goes, because it also directly operates on target programs. Opposite to that stands NuSMV, which practises Model Checking in a more classic sense di- rectly onto generated models. Similar to Symbolic Execution this only shows a small section, the actual list of tools is much longer.

5.2 Symbolic Execution

Within this section a small selection of the most interesting engines for Symbolic Execution, interesting for this paper, will be presented. A more in depth analysis of modern Symbolic Execution techniques was published by Cadar and Sen [58].

In this thesis several techniques based on the open source Symbolic Execution engine KLEE [7] are tested. KLEE itself has its roots in an engine called EXE [59], one of the first attempts to utilize symbolic execution for automatic test case creation. Further, KLEE provided a base for other research.

One of the engines evolved from it over the last years is for example S2E [20]. S2E is briefly summarized a symbolic execution engine that utilizes bidirectional state conversion between symbolic and concrete states. This means that S2E is able to automatically convert data between the two domains, which is supposed to improve scalability.

A similar hybrid approach is followed with MAYHEM [19], a hybrid execution engine which combines classical Symbolic Execution with Concolic Execution. Even though MAY- HEM and all of the above mentioned engines have some differences in their approaches and nature of execution, they all use built in searchers based on the principle of maximizing code coverage.

Another KLEE based engine is KITE [35]. Unlike previous engines, KITE focuses on an algorithm named Conflict-Driven Symbolic Execution to maximize learning from previous conflicts. It was inspired by Conflict-Driven Clause Learning, which is a central feature of to- day’s SMT-Solvers [60, 61, 62]. With this new Conflict-Driven approach the knowledge gain from detected conflicts during Symbolic Execution is promised to reduce the total number of paths to explore during a complete run. As a result the required amount of resources are also reduces, which takes Symbolic Execution a step further towards making it scalable.

Besides the pure symbolic or hybrid execution engines shown above, pure concolic execution engines also exist. Such are for example DART [63] and SAGE [64], which execute only one path at a time concrete, collecting all constraints while doing so. On the other hand engines like CUTE [65] and CREST [66] only reason about equality constraints for pointers, but fall back to concretization for all other symbolic references. 40 CHAPTER 5. RELATED WORK

The final notable engine is based on the core model checking framework JPF, which builds a bridge between Model Checking and Symbolic Execution with an extension called JPF-SE [67]. This extension to the regular model checker was developed as a prototype to also support the use of Symbolic Execution. Later this extension was developed further and put into a stand-alone project known as SPF [68].

In addition to the just mentioned closer related engines, a more exhaustive list featuring several additional others is presented in the appendix A.1.

5.3 Solvers

Even though it is the main topic of this thesis, Symbolic Execution does not rely solely on search strategies when testing a program. In addition to the necessary path finding, Sym- bolic Execution also depends a lot on the solvers underneath, which evaluate found path constraints. An example showing the actual relevance of solvers is provided by Rakadjiev et al. [69]. With the in this paper proposed constraint solving approach, the measured per- formance increase was of a factor 7 compared to regular solving strategies.

One of the probably most well known solvers is Z3 [70], which is widely spread and used (or at least supported) by most engines such as PEX, SAGE, KLEE and several others. Also rather famous solvers are STP [71] with MiniSat [72], which is the default solver of engines like EXE and KLEE, but also engines that emerged from these, like KITE, utilize it.

In addition to the most commonly used solvers in Symbolic Execution, there are also a numerous amount of others. The following list is not fully exhaustive, but it represents a summary of most solvers, used or supported, in today’s Symbolic Execution engines.

The previously mentioned JPF extension JPF-SE for example uses Omegalibrary [73], which specialises mainly on manipulating integer tuple relations and sets and would there- fore count, opposite to the others presented here, as a rather unconventional solver.

Another big player in the field of solvers is also CVC4 [74]. It has its roots from CVC- Lite [75], which is an often compared competitor to YICES [76], since they both follow similar directions with their solvers. Also they support a very similar set of operations like both linear and non-linear arithmetic over integers and bit vectors.

Last named another interesting solver, RealPaver [77]. RealPaver is thus interesting, be- cause it allows constraints over floating point numbers where as the majority of other solvers typically support at most (linear) arithmetic with integers. Chapter 6

Conclusion

This chapter is primarily concerned to give a conclusion on the findings regarding the pri- mary research question defined during the introduction, including the main contributions of this thesis. Further in section 6.1, the main findings of the conducted experiments are highlighted and critically evaluated against the background of the study’s assumptions and limitations. Ending it with a brief insight for possible future work in section 6.2.

The original problem concerned in this thesis is to find out what effect different search heuristics in Symbolic Execution have on finding errors and how they could be combined to improve the general outcome. To address this problem and the with it related research question, the following main contributions were made with this thesis:

Identification and classification of existing heuristics for exploration strategies used in • Symbolic Execution, with a primary focus on strategies already available to KLEE.

Implementation of a promising targeted search strategy that combines two usually dif- • ferent perspectives of a program, a statical analysis of the programs execution paths, defined with a calculated pathscore, and the user guided targeting mechanism, the rele- vance.

Evaluation of the implementation of the Pathscore-Relevance searcher, together with • others in the example environment KLEE. Additionally, it was used for evaluating how the quality of prior knowledge can affect the effectiveness of such a targeted searcher.

Presentation of a novel meta strategy that combines multiple search strategies in a • Round-Robin fashion. In addition it also tracks runtime information, like coverage, to analyse if the exploration makes any progress and if starvation is detected, switching to randomization to avoid getting stuck.

Quantitative analysis of a chosen set, composed of promising searchers and combina- • tions of them, to evaluate what effect different search strategies have on finding errors and how they can be combined to improve the general outcome, in experiments span- ning around 20,000 CPU hours.

41 42 CHAPTER 6. CONCLUSION

6.1 Discussion

The very first finding to point out, features the effectiveness of the current KLEE default searcher, Coverage-New interleaved with Random Path Search. Even though it was known to perform quite well, the overall error detection rate with especially high consistency com- pared to the others, was still surprisingly. Because of its reliance on random factors with the Random Path Search slightly worse results were anticipated. Resulting from these findings the default searcher indeed made arguably the most sense to use as a default implementation for the general purpose, because it can cover most of the errors within reasonable time and additionally provide enough certainty of meaningful results after a single iteration.

A good alternative to that default implementation was presented with the Round-Robin- Random-Shuffle meta combination of the QueryCost and Minimum Distance to Uncovered searchers (qcmdurnd). It provided throughout all tests a very similar detection rate as the default searcher, but managed to detect in some instances even more errors. On the other hand the qcmdurnd searcher did not manage to cover as much as the default searcher and also usually took longer in finding these errors. Based on these results a trade-off has to be made, depending on the preferences of the user. Overall the qcmdurnd searcher seems definitely like a good alternative to the default implementation, especially when resources like computation time are not too limited.

When taking a look at the general outcome of the combination of three searchers, it be- came quite clear that including more searchers does not necessarily perform better than combinations of fewer. This can easily be seen when comparing the triple searcher with the prcovnew. Through these results it was made clear that it is more important to care about which searchers to combine and also how to schedule them. With such rather simple meta implementations as used within this thesis, more searcher only increase the possible influence of a single one by a small margin and the compatibility of them is therefore more important. With two searchers and combination strategies like the ones used here, it most of the time is already possible to overcome the weaknesses of the respective other, leaving not much need for another searcher. Sometimes, like in the case of the triple searcher, even lowering the overall efficiency. On the other hand with more intelligent meta strategies, as discussed in section 6.2, searcher combinations of more than just two could make sense.

Another interesting outcome of the in this thesis conducted experiments, even though it was not expected to score well, has been that the combination of Depth-First Search and Breadth-First Search did not manage to perform at all for any of the chosen metrics. It was expected to not score well, similar to the single DFS, but not to miss out all errors in every tool. The conclusion to this was that DFS and BFS both work too different and nearly against each other, such that the loosening from the Round-Robin-Random-Shuffle meta strategy also did not manage to improve anything. This led finally to the decision that it was not further displayed in the result section, due to absence of meaningful results.

Additionally to all these general purpose like search strategies, a different type of searcher was tested with the Pathscore-Relevance implementation. This searcher requires some prior knowledge about the program to perform the best, which was verified by the results pre- sented, otherwise it is rather bad and quite unreliable. Besides the definite requirement of a CHAPTER 6. CONCLUSION 43

rough area of interest, more details depend a lot on the implementation. In the case of the implementation used for evaluation in this thesis, the guidance lost its impact too fast for smaller targets in further distance, which led to slightly worse results for the strict direct- ing, than expected. On the other hand broader areas, to some extent, scored still pretty well, even if they were deeper in the execution tree. In the end it most likely comes down to the requirements towards the searcher so it can be tuned in a way that more precise targets are located more easily or to focus on slightly broader areas which would reflect larger changes.

As a conclusion and answer to the main research question, an overall combination of different search strategies seems definitely worth, with nearly all the combined variants out- performing their respective solo variants. The results presented also show that there lies even more potential, for possible more complex ways of interleaving these strategies, as it will also be presented for possible future work in the next section.

6.2 Future Work

In the following a brief overview about some possible future work, like further benchmark- ing or different, more advanced, scheduling strategies within meta searchers, is given.

For the analysis conducted in this thesis, a set of different promising searchers was cho- sen, but it clearly could not have been fully exhaustive. Although the searchers analysed in this thesis were picked with clear goals as they seemed to be the most promising, resources were limited and thus not every possible combination has been tested. Therefore a rather simple addition to this work would be to include several more searchers into the evaluation and analyse how they perform. Interesting here would especially be a couple of more dis- tinct combinations of other searchers, but also a more exhaustive test on how the quality of targeting in searchers like Pathscore-Relevance affects such combinations.

This thesis proposes with the Random-Shuffle-Round-Robin an approach of using run- time information, like generated coverage, to create schedules for the implemented searchers. Following this thought of more intelligent meta searchers even further, it is possible to merge the field of Symbolic Execution with Machine Learning. This would mean the use of pattern recognition to extract patterns found during the execution to later on be able to identify ex- actly in which situations what searcher with which specific parameters will lead to the best possible output.

Another big topic in Symbolic Execution are, even though they were not really in focus of this thesis, solvers. One possible relation that came up during the evaluation process was that the effectiveness of searchers might be coupled to specific solvers. This would mean that one searcher might end up scoring better when used with some solvers than its competitors, while underperforming on others. With more resources, future work could focus on stated assumption and evaluate if there indeed is a correlation between searchers and solvers and how to possibly make best use out of it.

Bibliography

[1] William E. Lewis and W. H. C. Bassetti. Software Testing and Continuous Quality Improve- ment, Second Edition. Auerbach Publications, Boston, MA, USA, 2004. ISBN 0849325242.

[2] James C. King. Symbolic execution and program testing. Commun. ACM, 19(7):385–394, July 1976. ISSN 0001-0782. doi: 10.1145/360248.360252. URL http://doi.acm.org/ 10.1145/360248.360252.

[3] Peter Boonstoppel, Cristian Cadar, and Dawson Engler. Rwset: Attacking path ex- plosion in constraint-based test generation. In Proceedings of the Theory and Practice of Software, 14th International Conference on Tools and Algorithms for the Construction and Analysis of Systems, TACAS’08/ETAPS’08, pages 351–366, Berlin, Heidelberg, 2008. Springer-Verlag. ISBN 3-540-78799-2, 978-3-540-78799-0. URL http://dl.acm.org/ citation.cfm?id=1792734.1792770.

[4] United Nations General Assembly. Resolution a/60/1, adopted by the gen- eral assembly. In 2005 World Summit Outcome, September 2005. URL http: //data.unaids.org/topics/universalaccess/worldsummitoutcome_ resolution_24oct2005_en.pdf.

[5] John Morelli. Environmental sustainability: A definition for environmental profession- als. In Journal of Environmental Sustainability, volume 1. Rochester Institute of Tech- nology, 2011. URL http://scholarworks.rit.edu/cgi/viewcontent.cgi? article=1007&context=jes.

[6] Molly Scott Cato. Green economics. london: Earthscan. Ecological Economics, 69(1):36– 37, 2009.

[7] Dawson Engler Cristian Cadar, Daniel Dunbar. Klee: Unassisted and automatic gener- ation of high-coverage tests for complex systems programs. 2008.

[8] Joan Daemen and Vincent Rijmen. Specification for the advanced encryption standard (aes). Federal Information Processing Standards Publication 197, 2001. URL http: //csrc.nist.gov/publications/fips/fips197/fips-197.pdf.

[9] Stephen A. Cook. The complexity of theorem-proving procedures. In Proceedings of the Third Annual ACM Symposium on Theory of Computing, STOC ’71, pages 151–158, New York, NY, USA, 1971. ACM. doi: 10.1145/800157.805047. URL http://doi.acm. org/10.1145/800157.805047.

45 46 BIBLIOGRAPHY

[10] Herbert B. Enderton. A Mathematical Introduction to Logic. Academic Press, second edi- tion, 2000.

[11] Roberto Baldoni, Emilio Coppa, Daniele Cono D’Elia, Camil Demetrescu, and Irene Finocchi. A survey of symbolic execution techniques. CoRR, abs/1610.00502, 2016. URL http://arxiv.org/abs/1610.00502.

[12] David McCandless, Pearl Doughty-White, and Miriam Quick. Codebases - mil- lions of lines of code. URL http://www.informationisbeautiful.net/ visualizations/million-lines-of-code/.

[13] GNU. Coreutils, . URL https://www.gnu.org/software/coreutils/ coreutils.html.

[14] Apache Open Office. Build faq. URL https://www.openoffice.org/FAQs/ build_faq.html#source.

[15] A. M. Turing. On computable numbers, with an application to the entscheidungsprob- lem. Proceedings of the London Mathematical Society, s2-42(1):230–265, 1937. ISSN 1460- 244X. doi: 10.1112/plms/s2-42.1.230. URL http://dx.doi.org/10.1112/plms/ s2-42.1.230.

[16] French engineer of the telegraph Charles Pierre Trémaux (1859–1882) École polytech- nique of Paris (X:1876).

[17] E. F. Moore. The shortest path through a maze. In Proceedings of the International Sympo- sium on the Theory of Switching, pages 285–292. Harvard University Press, 1959.

[18] C. Y. Lee. An algorithm for path connections and its applications. IRE Transactions on Electronic Computers, EC-10(3):346–365, Sept 1961. ISSN 0367-9950. doi: 10.1109/TEC. 1961.5219222. URL https://doi.org/10.1109/TEC.1961.5219222.

[19] S. K. Cha, T. Avgerinos, A. Rebert, and D. Brumley. Unleashing mayhem on binary code. In 2012 IEEE Symposium on Security and Privacy, pages 380–394, May 2012. doi: 10.1109/SP.2012.31. URL http://doi.acm.org/10.1109/SP.2012.31.

[20] Vitaly Chipounov, Volodymyr Kuznetsov, and George Candea. The s2e platform: De- sign, implementation, and applications. ACM Trans. Comput. Syst., 30(1):2:1–2:49, Febru- ary 2012. ISSN 0734-2071. doi: 10.1145/2110356.2110358. URL http://doi.acm. org/10.1145/2110356.2110358.

[21] T. Xie, N. Tillmann, J. de Halleux, and W. Schulte. Fitness-guided path exploration in dynamic symbolic execution. In 2009 IEEE/IFIP International Conference on Dependable Systems Networks, pages 359–368, June 2009. doi: 10.1109/DSN.2009.5270315.

[22] Nikolai Tillmann and Jonathan De Halleux. Pex: White box test generation for .net. In Proceedings of the 2Nd International Conference on Tests and Proofs, TAP’08, pages 134–153, Berlin, Heidelberg, 2008. Springer-Verlag. ISBN 3-540-79123-X, 978-3-540-79123-2. URL http://dl.acm.org/citation.cfm?id=1792786.1792798.

[23] You Li, Zhendong Su, Linzhang Wang, and Xuandong Li. Steering symbolic execu- tion to less traveled paths. SIGPLAN Not., 48(10):19–32, October 2013. ISSN 0362-1340. BIBLIOGRAPHY 47

doi: 10.1145/2544173.2509553. URL http://doi.acm.org/10.1145/2544173. 2509553.

[24] T. Avgerinos, S. K. Cha, B. L. T. Hao, and D. Brumley. Aeg: automatic exploit generation. In In Proceedings of the Network and Distributed System Security Symposium,, 2011. URL http://www.isoc.org/isoc/conferences/ndss/11/pdf/5_5.pdf.

[25] Chris Lattner and Vikram Adve. LLVM: A Compilation Framework for Lifelong Pro- gram Analysis & Transformation. In Proceedings of the 2004 International Symposium on Code Generation and Optimization (CGO’04), Palo Alto, California, March 2004.

[26] Silviu Andrica and George Candea. Pathscore-relevance: A metric for improving test quality. 2009.

[27] GNU. Gcov, . URL https://gcc.gnu.org/onlinedocs/gcc/Gcov.html.

[28] Ben James Winer, Donald R Brown, and Kenneth M Michels. Statistical principles in experimental design, volume 2. McGraw-Hill New York, 1971.

[29] R. A. Fisher. The Arrangement of Field Experiments, pages 82–91. Springer New York, New York, NY, 1992. ISBN 978-1-4612-4380-9. doi: 10.1007/978-1-4612-4380-9_8. URL http://dx.doi.org/10.1007/978-1-4612-4380-9_8.

[30] J.A. Rice. Mathematical Statistics and Data Analysis. Number p. 3 in Advanced series. Cengage Learning, 2006. ISBN 9780534399429. URL https://books.google.se/ books?id=EKA-yeX2GVgC.

[31] C. Berzin, A. Latour, and J.R. León. Inference on the Hurst Parameter and the Variance of Diffusions Driven by Fractional Brownian Motion, page 362. Lecture Notes in Statistics. Springer International Publishing, 2014. ISBN 9783319078755. URL https://books. google.se/books?id=6tTVBAAAQBAJ.

[32] Qusay F. Hassan. Demystifying cloud computing. In The Journal of De- fense Software Engineering, 2011. URL http://static1.1.sqspcdn.com/ static/f/702523/10181434/1294788395300/201101-Hassan.pdf?token= R1MJl7CfIESODLO4c1klLcRMtfw%3D.

[33] Peter Mell and Timothy Grance. The nist definition of cloud computing. In National Institute of Standards and Technology: U.S. Department of Commerce, September 2011. URL http://nvlpubs.nist.gov/nistpubs/Legacy/SP/ nistspecialpublication800-145.pdf.

[34] Mark Baker. Cluster computing white paper. CoRR, cs.DC/0004014, 2000. URL http: //arxiv.org/abs/cs.DC/0004014.

[35] Celina G. Val. Conflict-driven symbolic execution: How to learn to get better. Master’s thesis, 2014.

[36] Docker Inc. Docker. URL https://www.docker.com/.

[37] Chalmers Centre for Computational Science and Engineering. Hebbe. URL http: //www.c3se.chalmers.se/index.php/Hebbe. 48 BIBLIOGRAPHY

[38] RWTH Aachen University. Chair of communication and distributed systems. URL https://www.comsys.rwth-aachen.de/home/.

[39] European Joint Conference on Theory & Practice of Software. Benchmark verification tasks. URL https://sv-comp.sosy-lab.org/2017/benchmarks.php.

[40] Ronald H. Randles. Wilcoxon Signed Rank Test. John Wiley & Sons, Inc., 2004. ISBN 9780471667193. doi: 10.1002/0471667196.ess2935.pub2. URL http://dx.doi.org/ 10.1002/0471667196.ess2935.pub2.

[41] Ieee standard for software and system test documentation. IEEE Std 829-2008, pages 1–150, July 2008. doi: 10.1109/IEEESTD.2008.4578383.

[42] Mohd Ehmer Khan. Different forms of software testing techniques for finding errors. International Journal of Computer Science Issues, pages 11–16, 2010.

[43] Mohd Ehmer Khan. Different approaches to white box testing technique for finding errors. International Journal of Software Engineering and Its Applications, 5, July 2011.

[44] Joe W. Duran and Simeon Ntafos. A report on random testing. In Proceedings of the 5th International Conference on Software Engineering, ICSE ’81, pages 179–183, Piscataway, NJ, USA, 1981. IEEE Press. ISBN 0-89791-146-6. URL http://dl.acm.org/citation. cfm?id=800078.802530.

[45] Richard Hamlet. Random testing. In Encyclopedia of Software Engineering, pages 970–978. Wiley, 1994.

[46] Rainer Gerlich, Ralf Gerlich, and Thomas Boll. Random testing: From the classical approach to a global view and full test automation. In Proceedings of the 2Nd International Workshop on Random Testing: Co-located with the 22Nd IEEE/ACM International Conference on Automated Software Engineering (ASE 2007), RT ’07, pages 30–37, New York, NY, USA, 2007. ACM. ISBN 978-1-59593-881-7. doi: 10.1145/1292414.1292424. URL http:// doi.acm.org/10.1145/1292414.1292424.

[47] Ilja Van Sprundel. Fuzzing. In Chaos Communication Congress (CCC), 2005.

[48] Ari Takanen, Jared DeMott, and Charlie Miller. Fuzzing for Software Security Testing and Quality Assurance. Artech House, Inc., Norwood, MA, USA, 1 edition, 2008. ISBN 1596932147, 9781596932142.

[49] John Neystadt. Automated penetration testing with white-box fuzzing. Microsoft, February 2008.

[50] I. S. Dunietz, W. K. Ehrlich, B. D. Szablak, C. L. Mallows, and A. Iannino. Applying design of experiments to software testing: Experience report. In Proceedings of the 19th International Conference on Software Engineering, ICSE ’97, pages 205–215, New York, NY, USA, 1997. ACM. ISBN 0-89791-914-9. doi: 10.1145/253228.253271. URL http:// doi.acm.org/10.1145/253228.253271.

[51] K. Tatsumi, S. Watanabe, Y. Takeuchi, and H. Shimokawa. Conceptual support for test case design. Proc. 11th Intl. Computer Software and Applications Conf. (COMPSAC), pages 285–290, 1987. URL http://a-lifelong-tester.cocolog-nifty.com/ BIBLIOGRAPHY 49

publications/Conceptual_Support_for_Test_Case_Design-COMPSAC87. pdf.

[52] Alan W. Williams. Determination of test configurations for pair-wise interaction cov- erage. In Proceedings of the IFIP TC6/WG6.1 13th International Conference on Testing Communicating Systems: Tools and Techniques, TestCom ’00, pages 59–74, Deventer, The Netherlands, The Netherlands, 2000. Kluwer, B.V. ISBN 0-7923-7921-7. URL http: //dl.acm.org/citation.cfm?id=648131.748027.

[53] Jerry Huller. Reducing time to market with combinatorial design method testing. In In Proceedings of the 2000 International Council on Systems Engineering (INCOSE) Conference, pages 16–20, 2000. URL http://citeseerx.ist.psu.edu/viewdoc/download? doi=10.1.1.145.8887&rep=rep1&type=pdf.

[54] Jean-Pierre Queille and Joseph Sifakis. Specification and verification of concurrent sys- tems in cesar. In Proceedings of the 5th Colloquium on International Symposium on Program- ming, pages 337–351, London, UK, UK, 1982. Springer-Verlag. ISBN 3-540-11494-7. URL http://dl.acm.org/citation.cfm?id=647325.721668.

[55] Edmund M. Clarke, Jr., Orna Grumberg, and Doron A. Peled. Model Checking. MIT Press, Cambridge, MA, USA, 1999. ISBN 0-262-03270-8.

[56] Guillaume Brat, Klaus Havelund, Seungjoon Park, and Willem Visser. Java pathfinder - second generation of a java model checker. In In Proceedings of the Workshop on Advances in Verification, 2000.

[57] Alessandro Cimatti, Edmund M. Clarke, Fausto Giunchiglia, and Marco Roveri. Nusmv: A new symbolic model verifier. In Proceedings of the 11th International Con- ference on Computer Aided Verification, CAV ’99, pages 495–499, London, UK, UK, 1999. Springer-Verlag. ISBN 3-540-66202-2. URL http://dl.acm.org/citation.cfm? id=647768.733923.

[58] Cristian Cadar and Koushik Sen. Symbolic execution for software testing: Three decades later. Commun. ACM, 56(2):82–90, February 2013. ISSN 0001-0782. doi: 10. 1145/2408776.2408795. URL http://doi.acm.org/10.1145/2408776.2408795.

[59] Cristian Cadar, Vijay Ganesh, Peter M. Pawlowski, David L. Dill, and Dawson R. Engler. Exe: Automatically generating inputs of death. In Proceedings of the 13th ACM Conference on Computer and Communications Security, CCS ’06, pages 322–335, New York, NY, USA, 2006. ACM. ISBN 1-59593-518-5. doi: 10.1145/1180405.1180445. URL http://doi. acm.org/10.1145/1180405.1180445.

[60] J. P. Marques Silva and K. A. Sakallah. Grasp-a new search algorithm for satisfiability. In Proceedings of International Conference on Computer Aided Design, pages 220–227, Nov 1996. doi: 10.1109/ICCAD.1996.569607.

[61] J. P. Marques-Silva and K. A. Sakallah. Grasp: a search algorithm for propositional satisfiability. IEEE Transactions on Computers, 48(5):506–521, May 1999. ISSN 0018-9340. doi: 10.1109/12.769433. 50 BIBLIOGRAPHY

[62] Roberto J. Bayardo Jr. and Robert C. Schrag. Using csp look-back techniques to solve real-world sat instances. In Proc. 14th Nat. Conf. on Artificial Intelligence (AAAI), pages 203–208, 1997.

[63] Patrice Godefroid, Nils Klarlund, and Koushik Sen. Dart: Directed automated random testing. SIGPLAN Not., 40(6):213–223, June 2005. ISSN 0362-1340. doi: 10.1145/1064978. 1065036. URL http://doi.acm.org/10.1145/1064978.1065036.

[64] Patrice Godefroid, Michael Y. Levin, and David Molnar. Sage: Whitebox fuzzing for security testing. Queue, 10(1):20:20–20:27, January 2012. ISSN 1542-7730. doi: 10.1145/ 2090147.2094081. URL http://doi.acm.org/10.1145/2090147.2094081.

[65] Koushik Sen, Darko Marinov, and Gul Agha. Cute: A concolic unit testing engine for c. SIGSOFT Softw. Eng. Notes, 30(5):263–272, September 2005. ISSN 0163-5948. doi: 10. 1145/1095430.1081750. URL http://doi.acm.org/10.1145/1095430.1081750.

[66] J. Burnim and K. Sen. Heuristics for scalable dynamic test generation. In 2008 23rd IEEE/ACM International Conference on Automated Software Engineering, pages 443–446, Sept 2008. doi: 10.1109/ASE.2008.69.

[67] Saswat Anand, Corina S. P˘as˘areanu, and Willem Visser. Jpf-se: A symbolic execution extension to java pathfinder. In Proceedings of the 13th International Conference on Tools and Algorithms for the Construction and Analysis of Systems, TACAS’07, pages 134–138, Berlin, Heidelberg, 2007. Springer-Verlag. ISBN 978-3-540-71208-4. URL http://dl. acm.org/citation.cfm?id=1763507.1763523.

[68] Corina S. P˘as˘areanu and Neha Rungta. Symbolic pathfinder: Symbolic execution of java bytecode. In Proceedings of the IEEE/ACM International Conference on Automated Software Engineering, ASE ’10, pages 179–180, New York, NY, USA, 2010. ACM. ISBN 978-1- 4503-0116-9. doi: 10.1145/1858996.1859035. URL http://doi.acm.org/10.1145/ 1858996.1859035.

[69] Emil Rakadjiev, Taku Shimosawa, Hiroshi Mine, and Satoshi Oshima. Parallel smt solving and concurrent symbolic execution. In Proceedings of the 2015 IEEE Trust- com/BigDataSE/ISPA - Volume 03, TRUSTCOM-BIGDATASE-ISPA ’15, pages 17–26, Washington, DC, USA, 2015. IEEE Computer Society. ISBN 978-1-4673-7952-6. doi: 10.1109/Trustcom-BigDataSe-ISPA.2015.608. URL http://dx.doi.org/10.1109/ Trustcom-BigDataSe-ISPA.2015.608.

[70] Leonardo De Moura and Nikolaj Bjørner. Z3: An efficient smt solver. In Proceedings of the Theory and Practice of Software, 14th International Conference on Tools and Algorithms for the Construction and Analysis of Systems, TACAS’08/ETAPS’08, pages 337–340, Berlin, Heidelberg, 2008. Springer-Verlag. ISBN 3-540-78799-2, 978-3-540-78799-0. URL http: //dl.acm.org/citation.cfm?id=1792734.1792766.

[71] Vijay Ganesh and David L. Dill. A decision procedure for bit-vectors and arrays. In Proceedings of the 19th International Conference on Computer Aided Verification, CAV’07, pages 519–531, Berlin, Heidelberg, 2007. Springer-Verlag. ISBN 978-3-540-73367-6. URL http://dl.acm.org/citation.cfm?id=1770351.1770421. BIBLIOGRAPHY 51

[72] Niklas Eén and Niklas Sörensson. An Extensible SAT-solver, pages 502–518. Springer Berlin Heidelberg, Berlin, Heidelberg, 2004. ISBN 978-3-540-24605- 3. doi: 10.1007/978-3-540-24605-3237. URL http://dx.doi.org/10.1007/ 978-3-540-24605-3_37.

[73] William Pugh. The omega test: A fast and practical integer programming algorithm for dependence analysis. In Proceedings of the 1991 ACM/IEEE Conference on Supercomput- ing, Supercomputing ’91, pages 4–13, New York, NY, USA, 1991. ACM. ISBN 0-89791- 459-7. doi: 10.1145/125826.125848. URL http://doi.acm.org/10.1145/125826. 125848.

[74] Clark Barrett, Christopher L. Conway, Morgan Deters, Liana Hadarean, Dejan Jo- vanovi´c, Tim King, Andrew Reynolds, and Cesare Tinelli. Cvc4. In Proceedings of the 23rd International Conference on Computer Aided Verification, CAV’11, pages 171– 177, Berlin, Heidelberg, 2011. Springer-Verlag. ISBN 978-3-642-22109-5. URL http: //dl.acm.org/citation.cfm?id=2032305.2032319.

[75] Clark Barrett and Sergey Berezin. CVC Lite: A New Implementation of the Cooperating Validity Checker, pages 515–518. Springer Berlin Heidelberg, Berlin, Heidelberg, 2004. ISBN 978-3-540-27813-9. doi: 10.1007/978-3-540-27813-9\_49. URL http://dx.doi. org/10.1007/978-3-540-27813-9_49.

[76] Bruno Dutertre and Leonardo de Moura. A fast linear-arithmetic solver for dpll(t). In Proceedings of the 18th International Conference on Computer Aided Verification, CAV’06, pages 81–94, Berlin, Heidelberg, 2006. Springer-Verlag. ISBN 3-540-37406-X, 978-3- 540-37406-0. doi: 10.1007/11817963\_11. URL http://dx.doi.org/10.1007/ 11817963_11.

[77] Laurent Granvilliers and Frédéric Benhamou. Algorithm 852: Realpaver: An interval solver using constraint satisfaction techniques. ACM Trans. Math. Softw., 32(1):138–156, March 2006. ISSN 0098-3500. doi: 10.1145/1132973.1132980. URL http://doi.acm. org/10.1145/1132973.1132980.

Appendix A

Appendix

A.1 KLEE Test Arguments

Full argument list for all KLEE calls:

klee $SEARCHER –output-dir=$OUTPUTDIR –simplify-sym-indices –optimize –libc=uclibc –posix-runtime –only-output-states-covering-new –max-sym-array-size=4096 –max-instruction- time=10. –max-time=3600. $FILE.bc –sym-args 0 1 10 –sym-args 0 2 2 –sym-files 1 8 –sym- stdin 8 –sym-stdout

A.2 Reduction Proof

In the following two Listings A.1 and A.2 a proof for the claim (TODO) is given by SMT- Queries. They can be verified by e.g. copy-pasting them into the Z3 solver1.

Listing A.1: Proof for the reduction of 3 a = 9 to a = 3 ∗ (declare-fun a () (_ BitVec 32)) (assert (= #x00000009 (bvmul a #x00000003))) (assert(not (= #x00000003 a))) (check-sat)

Listing A.2: Proof that the reduction of 4 a = 8 to a = 2 is not valid ∗ (declare-fun a () (_ BitVec 32)) (assert (= #x00000008 (bvmul a #x00000004))) (assert(not (= #x00000002 a))) (check-sat) (get-model)

1http://rise4fun.com/Z3/

53 54 APPENDIX A. APPENDIX

Table A.1: Snapshot of current Symbolic Execution engines

Engine URL angr http://angr.io/ BAP https://users.ece.cmu.edu/~aavgerin/papers/bap-cav-11.pdf BITBLAZE http://bitblaze.cs.berkeley.edu/ CATG https://people.eecs.berkeley.edu/~ksen/papers/tesma.pdf CIVL https://vsl.cis.udel.edu/civl/ CLOUD9 http://cloud9.epfl.ch/ CREST https://www.burn.im/crest/ CUTE http://dl.acm.org/citation.cfm?doid=1081706.1081750 DART http://dl.acm.org/citation.cfm?id=1065036 FuzzBALL http://bitblaze.cs.berkeley.edu/fuzzball.html janala2 https://github.com/ksen007/janala2 Jalangi2 https://github.com/Samsung/jalangi2 jCUTE https://github.com/osl/jcute JDART https://github.com/psycopaths/jdart JPF(-SE) https://babelfish.arc.nasa.gov/trac/jpf KeY https://www.key-project.org/ KITE http://www.cs.ubc.ca/labs/isd/Projects/Kite/ KLEE http://klee.github.io/ MAYHEM http://forallsecure.com/mayhem.html McVeto https://research.cs.wisc.edu/wpis/papers/cav10-mcveto.pdf MIASM https://github.com/cea-sec/miasm OTTER https://bitbucket.org/khooyp/otter/overview PATHGRIND https://github.com/codelion/pathgrind PEX http://research.microsoft.com/en-us/projects/pex/ PYEXZ3 https://github.com/thomasjball/PyExZ3 PYSYMEMU https://github.com/feliam/pysymemu/ RUBYX http://www.cs.umd.edu/~avik/papers/ssarorwa.pdf S2E http://s2e.epfl.ch/ SAGE http://patricegodefroid.github.io/public_psfiles/ndss2008.pdf SMART http://dl.acm.org/citation.cfm?id=1190226 SYMDROID http://www.cs.umd.edu/~jfoster/papers/symdroid.pdf SYMJS http://www.cs.utah.edu/~ligd/publications/SymJS-FSE14.pdf TRITON https://triton.quarkslab.com/ APPENDIX A. APPENDIX 55

A.3 List of GNU Coreutils

Table A.2: List of tested GNU Coreutils from version 6.10 and 8.27

File utilities Name in 6.10 in 8.27 Short description chcon X X Changes file security context chgrp X X Changes file group ownership chmod X X Changes the permissions of a file or directory X X Changes file ownership cp X X Copies a file or directory X X Copies and converts a file X Shows disk free space on file systems dircolors X X Set up color for ls ginstall X X Copies files and set attributes X X Creates a to a file ls X X Lists the files in a directory mkdir X X Creates a directory mkfifo X X Makes named pipes (FIFOs) mknod X X Makes block or character special files X X Creates a temporary file or directory X X Moves files or rename files rm X Removes (deletes) files X Removes empty directories shred X Overwrites a file to hide its contents, and optionally deletes it X Flushes file system buffers touch X Changes file timestamps

Text utilities Name in 6.10 in 8.27 Short description base64 X X base32 encodes or decodes data and prints to standard output cat X X Concatenates and prints files on the standard output X X Checksums and count the bytes in a file X X Compares two sorted files line by line X X Splits a file into sections determined by context lines cut X X Removes sections from each line of files X X Converts tabs to spaces X X Simple optimal text formatter X X Wraps each input line to fit in specified width head X X Outputs the first part of files X X Joins lines of two files on a common field md5sum X X Computes and checks MD5 message digest nl X X Numbers lines of files X Dumps files in octal and other formats paste X Merges lines of files 56 APPENDIX A. APPENDIX

pr X Converts text files for printing ptx X Produces a permuted index of file contents X Generates random permutations sort X Sorts lines of text files split X Splits a file into pieces sum X Checksums and counts the blocks in a file tac X Concatenates and prints files in reverse order line by line X Outputs the last part of files X Translates or deletes characters X Performs a topological sort X Converts spaces to tabs X Removes duplicate lines from a sorted file X Prints the number of bytes, words, and lines in files

Shell utilities Name in 6.10 in 8.27 Short description X X Removes the path prefix from a given pathname X X Changes the root directory date X X Prints or sets the system date and time X X Strips non-directory suffix from file name du X X Shows disk usage on file systems X X Displays a specified line of text X X Displays and modifies environment variables X X Evaluates expressions factor X Factors numbers false X X Does nothing. hostid X X Prints the numeric identifier for the current host id X X Prints real or effective UID and GID link X X Creates a link to a file X X Print the user’s login name X X Modifies scheduling priority X Allows a command to continue running after logging out pathchk X Checks whether file names are valid or portable pinky X A lightweight version of finger printenv X Prints environment variables printf X Formats and prints data X Prints the current working directory readlink X Displays value of a symbolic link runcon X Run command with specified security context seq X Prints a sequence of numbers X Delays for a specified amount of time stat X Returns data about an inode stty X Changes and prints terminal line settings X Sends output to multiple files X Prints terminal name APPENDIX A. APPENDIX 57

uname X Prints system information X Removes the specified file X Tells how long the system has been running users X X Prints the user names of users currently logged into the current host X Prints the effective userid X X Prints a list of all users currently logged in X Prints a string repeatedly

Other utilities Name in 6.10 in 8.27 Short description [ X A synonym for test, which permits expressions like [ expression ] kill X X Closes the target process setuidgid X Runs another program under a specified account’s uid and gid

A.4 Results

All the gathered and visualized results over the duration of this thesis are shown within this section 58 APPENDIX A. APPENDIX

Figure A.1: Amount of Errors for all searcher combinations for the GNU Coreutils 6.10 on Hebbe

Number of found Errors for DEFAULT Number of found Errors for PR 16 13

15 12 14 11 13 10 12 11 9 10 8 9 7 8 6 7 6 5 5 4 Average Number of Errors Average Number of Errors 4 3 3 2 2 1 1 0 0 ls nl pr ls nl pr cut ptx seq cut ptx seq head sort head sort mkdir mkfifomknod paste shred touch mkdir mkfifomknod paste shred touch md5sum md5sum Tool Tool

(a) Default (b) PR

Number of found Errors for COVNEW Number of found Errors for DFS

8 1

7

6

5

4

3 Average Number of Errors Average Number of Errors 2

1

0 0 ls nl pr ls nl pr cut ptx seq cut ptx seq head sort head sort mkdir mkfifo mknod paste shred touch mkdir mkfifo mknod paste shred touch md5sum md5sum Tool Tool

(c) Covnew (d) DFS

Number of found Errors for PRCOVNEW Number of found Errors for QCMDURND 16 16 15 15 14 14 13 13 12 12 11 11 10 10 9 9 8 8 7 7 6 6 5 5 Average Number of Errors Average Number of Errors 4 4 3 3 2 2 1 1 0 0 ls nl pr ls nl pr cut ptx seq cut ptx seq head sort head sort mkdir mkfifomknod paste shred touch mkdir mkfifomknod paste shred touch md5sum md5sum Tool Tool

(e) PR-Covnew (f) QCMDUrnd

Number of found Errors for RND Number of found Errors for TRIPLE 11 8 10 7 9

6 8

7 5 6 4 5

3 4

Average Number of Errors Average Number of Errors 3 2 2 1 1

0 0 ls nl pr ls nl pr cut ptx seq cut ptx seq head sort head sort mkdir mkfifo mknod paste shred touch mkdir mkfifomknod paste shred touch md5sum md5sum Tool Tool

(g) Random (h) Triple APPENDIX A. APPENDIX 59

Figure A.2: Time to first Error for all searcher combinations for the GNU Coreutils 6.10 on Hebbe

Average time until first Error for DEFAULT Average time until first Error for PR 3600 3600 3400 3400 3200 3200 3000 3000 2800 2800 2600 2600 2400 2400 2200 2200 2000 2000 1800 1800 1600 1600 1400 1400 1200 1200

Time until first Error (in s) 1000 Time until first Error (in s) 1000 800 800 600 600 400 400 200 200 0 0 ls nl pr ls nl pr cut ptx seq cut ptx seq head sort head sort mkdir mkfifomknod paste shred touch mkdir mkfifomknod paste shred touch md5sum md5sum Tool Tool

(a) Default (b) PR

Average time until first Error for COVNEW Average time until first Error for DFS 3600 3600 3400 3400 3200 3200 3000 3000 2800 2800 2600 2600 2400 2400 2200 2200 2000 2000 1800 1800 1600 1600 1400 1400 1200 1200

Time until first Error (in s) 1000 Time until first Error (in s) 1000 800 800 600 600 400 400 200 200 0 0 ls nl pr ls nl pr cut ptx seq cut ptx seq head sort head sort mkdir mkfifomknod paste shred touch mkdir mkfifomknod paste shred touch md5sum md5sum Tool Tool

(c) Covnew (d) DFS

Average time until first Error for PRCOVNEW Average time until first Error for QCMDURND 3600 3600 3400 3400 3200 3200 3000 3000 2800 2800 2600 2600 2400 2400 2200 2200 2000 2000 1800 1800 1600 1600 1400 1400 1200 1200

Time until first Error (in s) 1000 Time until first Error (in s) 1000 800 800 600 600 400 400 200 200 0 0 ls nl pr ls nl pr cut ptx seq cut ptx seq head sort head sort mkdir mkfifomknod paste shred touch mkdir mkfifomknod paste shred touch md5sum md5sum Tool Tool

(e) PR-Covnew (f) QCMDUrnd

Average time until first Error for RND Average time until first Error for TRIPLE 3600 3600 3400 3400 3200 3200 3000 3000 2800 2800 2600 2600 2400 2400 2200 2200 2000 2000 1800 1800 1600 1600 1400 1400 1200 1200

Time until first Error (in s) 1000 Time until first Error (in s) 1000 800 800 600 600 400 400 200 200 0 0 ls nl pr ls nl pr cut ptx seq cut ptx seq head sort head sort mkdir mkfifomknod paste shred touch mkdir mkfifomknod paste shred touch md5sum md5sum Tool Tool

(g) Random (h) Triple 60 APPENDIX A. APPENDIX

Figure A.3: Executable Line Coverage for all searcher combinations for the GNU Coreutils 6.10 on Hebbe

Average Executable Line Coverage for DEFAULT Average Executable Line Coverage for PR 100 100

90 90

80 80

70 70

60 60

50 50

40 40 Coverage in % Coverage in %

30 30

20 20

10 10

0 0 Tools Tools

(a) Default (b) PR

Average Executable Line Coverage for COVNEW Average Executable Line Coverage for DFS 100 100

90 90

80 80

70 70

60 60

50 50

40 40 Coverage in % Coverage in %

30 30

20 20

10 10

0 0 Tools Tools

(c) Covnew (d) DFS

Average Executable Line Coverage for PRCOVNEW Average Executable Line Coverage for QCMDURND 100 100

90 90

80 80

70 70

60 60

50 50

40 40 Coverage in % Coverage in %

30 30

20 20

10 10

0 0 Tools Tools

(e) PR-Covnew (f) QCMDUrnd

Average Executable Line Coverage for RND Average Executable Line Coverage for TRIPLE 100 100

90 90

80 80

70 70

60 60

50 50

40 40 Coverage in % Coverage in %

30 30

20 20

10 10

0 0 Tools Tools

(g) Random (h) Triple APPENDIX A. APPENDIX 61

Figure A.4: Amount of Errors for all searcher combinations for the GNU Coreutils 8.27 on Hebbe

Number of found Errors for DEFAULT Number of found Errors for PR

2 2

1 1 Average Number of Errors Average Number of Errors

0 0 ls ls cut dd mv nl cut dd mv nl chcon chgrp chroot chcon chgrp chroot Tool Tool

(a) Default (b) PR

Number of found Errors for COVNEW Number of found Errors for DFS

1 1 Average Number of Errors Average Number of Errors

0 0 ls ls cut dd mv nl cut dd mv nl chcon chgrp chroot chcon chgrp chroot Tool Tool

(c) Covnew (d) DFS

Number of found Errors for PRCOVNEW Number of found Errors for QCMDURND

2 2

1 1 Average Number of Errors Average Number of Errors

0 0 ls ls cut dd mv nl cut dd mv nl chcon chgrp chroot chcon chgrp chroot Tool Tool

(e) PR-Covnew (f) QCMDUrnd

Number of found Errors for RND Number of found Errors for TRIPLE 2

1

1 Average Number of Errors Average Number of Errors

0 0 ls ls cut dd mv nl cut dd mv nl chcon chgrp chroot chcon chgrp chroot Tool Tool

(g) Random (h) Triple 62 APPENDIX A. APPENDIX

Figure A.5: Time to first Error for all searcher combinations for the GNU Coreutils 8.27 on Hebbe

Average time until first Error for DEFAULT Average time until first Error for PR 3600 3600 3400 3400 3200 3200 3000 3000 2800 2800 2600 2600 2400 2400 2200 2200 2000 2000 1800 1800 1600 1600 1400 1400 1200 1200

Time until first Error (in s) 1000 Time until first Error (in s) 1000 800 800 600 600 400 400 200 200 0 0 ls ls cut dd mv nl cut dd mv nl chcon chgrp chroot chcon chgrp chroot Tool Tool

(a) Default (b) PR

Average time until first Error for COVNEW Average time until first Error for DFS 3600 3600 3400 3400 3200 3200 3000 3000 2800 2800 2600 2600 2400 2400 2200 2200 2000 2000 1800 1800 1600 1600 1400 1400 1200 1200

Time until first Error (in s) 1000 Time until first Error (in s) 1000 800 800 600 600 400 400 200 200 0 0 ls ls cut dd mv nl cut dd mv nl chcon chgrp chroot chcon chgrp chroot Tool Tool

(c) Covnew (d) DFS

Average time until first Error for PRCOVNEW Average time until first Error for QCMDURND 3600 3600 3400 3400 3200 3200 3000 3000 2800 2800 2600 2600 2400 2400 2200 2200 2000 2000 1800 1800 1600 1600 1400 1400 1200 1200

Time until first Error (in s) 1000 Time until first Error (in s) 1000 800 800 600 600 400 400 200 200 0 0 ls ls cut dd mv nl cut dd mv nl chcon chgrp chroot chcon chgrp chroot Tool Tool

(e) PR-Covnew (f) QCMDUrnd

Average time until first Error for RND Average time until first Error for TRIPLE 3600 3600 3400 3400 3200 3200 3000 3000 2800 2800 2600 2600 2400 2400 2200 2200 2000 2000 1800 1800 1600 1600 1400 1400 1200 1200

Time until first Error (in s) 1000 Time until first Error (in s) 1000 800 800 600 600 400 400 200 200 0 0 ls ls cut dd mv nl cut dd mv nl chcon chgrp chroot chcon chgrp chroot Tool Tool

(g) Random (h) Triple APPENDIX A. APPENDIX 63

Figure A.6: Amount of Errors for all searcher combinations for the GNU Coreutils 6.10 on a dedicated server

Number of found Errors for DEFAULT Number of found Errors for PR 15 16 15 14 14 13 13 12 12 11 11 10 10 9 9 8 8 7 7 6 6 5 5 Average Number of Errors Average Number of Errors 4 4 3 3 2 2 1 1 0 0 cp ls pr cp ls pr cut du ptx seq sort cut du ptx seq sort chgrp mkdir mkfifomknod paste shred touch chgrp mkdir mkfifomknod paste shred touch md5sum md5sum Tool Tool

(a) Default (b) PR

Number of found Errors for PRCOVNEW Number of found Errors for QCMDURND 17 16 16 15 15 14 14 13 13 12 12 11 11 10 10 9 9 8 8 7 7 6 6 5 5 Average Number of Errors Average Number of Errors 4 4 3 3 2 2 1 1 0 0 cp ls pr cp ls pr cut du ptx seq sort cut du ptx seq sort chgrp mkdir mkfifomknod paste shred touch chgrp mkdir mkfifomknod paste shred touch md5sum md5sum Tool Tool

(c) PR-Covnew (d) QCMDUrnd

Number of found Errors for TRIPLE

15 14 13 12 11 10 9 8 7 6 5

Average Number of Errors 4 3 2 1 0 cp ls pr cut du ptx seq sort chgrp mkdir mkfifomknod paste shred touch md5sum Tool

(e) Triple 64 APPENDIX A. APPENDIX

Figure A.7: Time to first Error for all searcher combinations for the GNU Coreutils 6.10 on a dedicated server

Average time until first Error for DEFAULT Average time until first Error for PR 3600 3600 3400 3400 3200 3200 3000 3000 2800 2800 2600 2600 2400 2400 2200 2200 2000 2000 1800 1800 1600 1600 1400 1400 1200 1200

Time until first Error (in s) 1000 Time until first Error (in s) 1000 800 800 600 600 400 400 200 200 0 0 cp ls pr cp ls pr cut du ptx seq sort cut du ptx seq sort chgrp mkdirmkfifomknod paste shred touch chgrp mkdirmkfifomknod paste shred touch md5sum md5sum Tool Tool

(a) Default (b) PR

Average time until first Error for PRCOVNEW Average time until first Error for QCMDURND 3600 3600 3400 3400 3200 3200 3000 3000 2800 2800 2600 2600 2400 2400 2200 2200 2000 2000 1800 1800 1600 1600 1400 1400 1200 1200

Time until first Error (in s) 1000 Time until first Error (in s) 1000 800 800 600 600 400 400 200 200 0 0 cp ls pr cp ls pr cut du ptx seq sort cut du ptx seq sort chgrp mkdirmkfifomknod paste shred touch chgrp mkdirmkfifomknod paste shred touch md5sum md5sum Tool Tool

(c) PR-Covnew (d) QCMDUrnd

Average time until first Error for TRIPLE 3600 3400 3200 3000 2800 2600 2400 2200 2000 1800 1600 1400 1200

Time until first Error (in s) 1000 800 600 400 200 0 cp ls pr cut du ptx seq sort chgrp mkdirmkfifomknod paste shred touch md5sum Tool

(e) Triple www.kth.se