Advanced Techniques for Search-Based Program Repair

christopher steven timperley

Ph.D.

University of York, June 2017 2 Abstract

Debugging and repairing defects costs the global economy hundreds of bil- lions of dollars annually, and accounts for as much as 50% of programmers’ time. To tackle the burgeoning expense of repair, researchers have proposed the use of novel techniques to automatically localise and repair such defects. Collectively, these tech- niques are referred to as automated program repair. Despite promising, early results, recent studies have demonstrated that existing au- tomated program repair techniques are considerably less eective than previously believed. Current approaches are limited either in terms of the number and kinds of bugs they can x, the size of patches they can produce, or the programs to which they can be applied. To become economically viable, automated program repair needs to overcome all of these limitations. Search-based repair is the only approach to program repair which may be applied to any bug or program, without assuming the existence of formal specications. De- spite its generality, current search-based techniques are restricted; they are either ecient, or capable of xing multiple-line bugs—no existing technique is both. Fur- thermore, most techniques rely on the assumption that the material necessary to craft a repair already exists within the faulty program. By using existing code to craft repairs, the size of the search space is vastly reduced, compared to generating code from scratch. However, recent results, which show that almost all repairs gen- erated by a number of search-based techniques can be explained as deletion, lead us to question whether this assumption is valid. In this thesis, we identify the challenges facing search-based program repair, and demonstrate ways of tackling them. We explore if and how the knowledge of can- didate patch evaluations can be used to locate the source of bugs. We use software repository mining techniques to discover the form of a better repair model capable of addressing a greater number of bugs. We conduct a theoretical and empirical anal- ysis of existing search algorithms for repair, before demonstrating a more eective alternative, inspired by greedy algorithms. To ensure reproducibility, we propose and use a methodology for conducting high-quality automated program research. Finally, we assess our progress towards solving the challenges of search-based pro- gram repair, and reect on the future of the eld.

3 4 Contents

Abstract3

Acknowledgements 13

Declaration 15

1 Introduction 17 1.1 Motivation ...... 19 1.2 Challenges ...... 22 1.3 Research Questions ...... 23 1.4 Contributions ...... 24 1.5 Document Structure ...... 26

2 Background 27 2.1 Automated Program Repair ...... 27 2.2 Search-Based Repair ...... 31 2.3 Semantics-Based Repair ...... 50 2.4 Specication-Based Repair ...... 57 2.5 Related Techniques ...... 60 2.6 Concluding Remarks ...... 62

3 Tools and Techniques 65 3.1 Bug Scenarios ...... 66 3.2 Pythia ...... 68 3.3 RepairBox ...... 75 3.4 Methodology ...... 84 3.5 Conclusion ...... 85

4 Fault Localisation 87 4.1 Background ...... 88 4.2 Analysis ...... 107 4.3 Approach ...... 114 4.4 Discussion & Conclusion ...... 116

5 Repair Model 119 5.1 Related Work ...... 120 5.2 Motivation for Study ...... 127 5.3 Methodology ...... 128 5.4 Repair Model ...... 131 5.5 Approach ...... 138 5.6 Results ...... 140 5.7 Discussion & Conclusion ...... 143

5 CONTENTS

6 Search 149 6.1 Related Work ...... 151 6.2 Theoretical Analysis ...... 157 6.3 Empirical Study ...... 166 6.4 Greedy Algorithm ...... 181 6.5 Future Work ...... 188 6.6 Conclusion ...... 191

7 Conclusion 193 7.1 Summary ...... 193 7.2 Future Work ...... 196 7.3 Concluding Remarks ...... 200

Appendices 203

A Reproducibility 205 A.1 Fault Localisation ...... 205 A.2 Repair Model ...... 205 A.3 Search ...... 205

B Repair Action Mining 207 B.1 AST and Edit Script Generation ...... 207 B.2 Detection Rules ...... 208

C Additional Fault Localisation Results 215

Bibliography 217

6 List of Tables

2.1 Examples of dierent types of defect classes and the shared proper- ties by which they are dened. Adapted from [Monperrus, 2014]. . 42 2.2 A list of the repair actions within the repair model for History Driven Program Repair, separated by their sources [Le et al., 2016]...... 49 2.3 MintHint’s repair model, or hints, and the kinds of faults that each hint is designed to address. Taken from [Kaleeswaran et al., 2014]. . 51

4.1 The average precision of HybridMUSE as its mutant sampling rate is adjusted. Taken from [Moon et al., 2014b]...... 105 4.2 Details of the subjects we studied for our preliminary mutation anal- ysis. KLOC measures the number of thousands of lines of C code in the program, as calculated by cloc. Tests states the average number of test cases used by bugs for that program...... 109 4.3 Specications for Amazon EC2 instances used to perform mutation analysis on 15 artical bugs...... 109 4.4 Specications of the Microsoft Azure compute instances used to per- form analysis on 13 real-world bugs...... 110 4.5 A summary of the mutation analysis results for each bug scenario. % Compiling species the percentage of mutants that successfully compiled. % Neutral describes the percentage of (compiling) mu- tants that had no eect on the outcome of the tests. % Lethal de- scribes the percentage of (compiling) mutants (covering at least one positive test) that failed all of their covered tests. Mutants species the number of mutants generated within the 12-hour random walk. Sample Rate gives the average number of mutants per suspicious statement...... 111 4.6 Comparison of fault localisation accuracies achieved by dierent ap- proaches, where accuracy is measured by the probability of sam- pling a statement containing a x from the resulting distribution. Results are given as percentages...... 116

5.1 The number of instances of each repair action discovered across each of the mined bugs, together the number (and percentage) of bugs that involve at least one repair action of that type...... 141 5.2 The graftability of each repair action in the contexts of the concrete pool, containing the unchanged snippets from the le under repair, and the abstract pool, containing the unlabelled forms of the snip- pets from the le under repair...... 142

7 LIST OF TABLES

5.3 A summary of the frequency of each of the proposed repair actions, measured by the percentage of bugs in which it is encountered, to- gether with the graftability of that repair action when the abstract pool is used. Eectiveness, computed as the product of frequency and graftability, estimates the fraction of bugs for which a given repair action may graft a repair...... 145

6.1 The baseline parameters of the algorithm used within each run. . . 168 6.2 Specications of the Microsoft Azure F4 compute instances used to collect the data for this study...... 168 6.3 A table of the subject programs used to conduct this study. # LOC describes the number of source code lines in the original program, as measured by cloc. # Stmts species the number of statements within the GenProg’s pre-processed AST representation of the pro- gram. # Tests gives the size of the original test suite for the program. 171 6.4 A summary of the bugs for which a repair was found at least once over the ten runs, for each conguration. Xindicates that at least one patch was found for the given bug-conguration...... 173 6.5 A comparison of the reliability of dierent congurations for each bug scenario. The presence of an “—” symbol indicates that no patches were found for that given bug-conguration...... 174 6.6 A summary of the median number of unique candidate patch evalu- ations required to nd a repair, for each bug-scenario. The presence of an “—” symbol indicates that no patches were found for that bug- conguration...... 175 6.7 An overview of the cost of nding a patch for each bug-conguration, measured by the total number of unique candidate patch evalua- tions, across all runs, divided by the number of runs that were suc- cessful. Bug-congurations with an “—” symbol indicate that no patches were found during any of the runs...... 176 6.8 A comparison of the bugs patched by each technique...... 187 6.9 A comparison of the reliability achieved by the genetic and greedy search algorithms, measured by the fraction of runs wherein an ac- ceptable repair was found. An “—” is used to denote bug-congurations where no repair was found across any of the runs...... 188 6.10 A comparison of eciency between the genetic search and greedy search algorithms, measured by the total wall-clock time across all runs, in seconds, divided by the number of successful runs. Reduction describes the reduction factor achieved by the greedy algorithm, compared to the genetic algorithm...... 189

C.1 The eectiveness of various fault localisation schemes, measured by the probability of sampling a xable statement...... 216

8 List of Figures

2.1 The general automated repair process accepts the source code for a faulty program, together with a test suite, containing failed test cases, exposing the faults within the program. From these, the pos- sible locations of the faults are determined, and for some repair ap- proaches, a pool of donor code is generated from the input program. Using the contents of the donor pool, together with a number of basic repair actions, the search generates and evaluates candidate patches, until one is found that passes all tests within the suite. . . 28 2.2 An example bug scenario, adapted from the real-world Zune leap year bug [Coldewey, 2008]. The program accepts a date, given as the number of days since January 1st 1980, and should determine the year to which that date belongs. In the event that year is a leap year and days ever becomes 366, the program will enter an innite loop...... 29 2.3 Before the repair process can begin, each of the statements within the program is identied and annotated with a unique SID...... 30 2.4 A test suite for our running example, described by inputs and ex- pected outputs for the program. Failing tests are assigned labels starting with the letter “N”, indicating that they are negative test cases. Conversely, passing tests are assigned a label starting with the letter “P”, indicating that they are positive test cases...... 31 2.5 An example of the output produced by the coverage generation pro- cess. The columns on the right show which statements are covered by a particular test. For the sake of brevity, we omit coverage for the majority of the test suite...... 32 2.6 An example of a fault spectrum for the Zune bug, together with the suspiciousness values for each statement, as computed by GenProg’s suspiciousness metric...... 33 2.7 An illustration of GenProg’s one-point crossover operator. Two parents A and B are accepted as input, and each is split into two parts at a random point. The rst part of A is combined with the second part of B to form a child C. Similarly, the rst part of B is combined with the second part of A to form a child D...... 37 2.8 An example change graph produced during the o ine, bug x min- ing stage of history-driven program repair [Le et al., 2016]. Here, the change graph shows the name of a parameter being updated within the context of a method call...... 46

9 LIST OF FIGURES

2.9 The x schemas implemented in AutoFix-E [Wei et al., 2010]. snippet is replaced with a sequence of routine calls that move the program from a faulty state into a desired state. old_stmt may either be a sin- gle statement, or the block to which a statement belongs. fail is used to monitor the conditions under which the fault manifests, and not to aect the appropriate action...... 59

3.1 An overview of the inputs and outputs of Pythia’s oracle generation process...... 71 3.2 In this example, taken from the gzip object within the SIR, the test suite attempts to destructively compress a given le, deleting the original version of the le and replacing it with its zipped form. . . 72 3.3 An example test case description within a Pythia test manifest le. Each test case is described by its corresponding shell command, the contents of its sandbox directory, and an optional human-readable description...... 72 3.4 An entry in an example Pythia oracle le, describing the expected behaviour of a test case from the corresponding manifest le. . . . 73 3.5 Docker Containers (left) vs. Virtual Machines (right). Each con- tainer sits on top of the Docker runtime, which provides access to the kernel of the host machine. Each virtual machine virtualises its own stack, down to the level of the hardware, and sits on top of a hypervisor, which allows multiple VMs to run on the same machine. 78 3.6 Repair boxes can be used with repair tools by mounting the binaries, provided by a repair tool container, into the repair box, via volume mounting...... 79 3.7 An example Dockerle for one of the libtiff scenarios within RepairBox. 80 3.8 The Docker image for a given bug scenario is built as a series of layers. Each layer is shared by a number of images on the layer above, reducing disk usage and build time...... 81 3.9 An example bug scenario manifest...... 82 3.10 An example tool manifest describing GenProg...... 82 3.11 Example uses of the repairbox list command...... 83 3.12 An example use of the repairbox launch command...... 84

4.1 A screenshot of the original Tarantula fault localisation visualisa- tion tool. Lines coloured red are executed primarily by the failing test cases, suggesting a high suspiciousness. Lines that are mostly covered by the passing test cases are coloured green. Brightness is used to indicate a form of condence in the suspiciousness at- tributed to a given line: The brightness of a line is given by greater of the fraction of failing tests covered by the line, and the frac- tion of passing tests covered. Source: http://spideruci.org/ fault-localization/ (May 2017)...... 90

10 LIST OF FIGURES

4.2 An example of backwards static slicing on a small program. The source code on the right shows the sliced form of the source code on the right, where (20, lines) is set as the slicing criterion. Adapted from example given in [Silva, 2012]...... 96 4.3 Comparing the mean pass-to-failure rate for applicable mutants, we observe dierent, but overlapping distributions for correct state- ments and faulty statements (KS2 = 0.301; p = 0.003,A12 = 0.679 [medium eect])...... 112 4.4 We observe similar distributions of mean f2p values for faulty and correct statements (KS2 = 0.185; p = 0.177). In both cases, more than half of the mutants at each statement did not pass any of the previously failing tests...... 113

6.1 An illustration of the implicit search space dened by GenProg’s representation and a more granular alternative proposed by Oliveira et al.[2016]. Whereas the type of operation, location, and donor statement used by an edit are all considered to be part of a discrete, evolvable unit within GenProg’s representation, Oliveira et al.[2016]’s intermediate representation allows each of these attributes to be treated separately by a set of purpose-built crossover operators. . . 156 6.2 Patches tend to become longer over time...... 158 6.3 Across all problems, we observe a monotonic increase in the total number of edit operations contained within the population despite the ability of crossover to generate smaller individuals...... 159 6.4 An example of the destructive edit bias. In this example, a single-edit patch is applied to the original AST, given on the left. This patch replaces the node at location 1 with the node at location 5. When the child tree is subsequently mutated, any changes to locations 1, 4 or 5 will have no eect. Note, the donor node, coloured black, may not be the subject of a future mutation operation. Thus, the eects of this replacement operation are permanent on all of its descendants (except in cases where crossover moves this operation to another child)...... 161 6.5 An example of the append bias. The AST in the top left shows the state of original, faulty program. The AST in the top right shows the repaired version of that AST, containing two missing statements α and β. On the bottom half of the diagram, we show a repair sce- nario in which this bias is encountered. In the rst generation, α is appended after the statement at location 4, matching its intended position in the repaired program. In the following generation, β is appended, yielding the incorrect sequence β; α. From the rst mis- take in the order of append operations, the search is unable to move backwards to nd a correct order; incorrect edits will continue to be accumulate at statement 4...... 163

11 LIST OF FIGURES

6.6 Our restricted representation implicitly encodes individuals as a xed length list of optional edits P . Each entry of the list, Pi, describes the edit, if any, that should be applied to the statement with number i. 178 6.7 Uniform crossover takes two parents, A and B, and generates a pair of symmetrical children, C and D...... 179

12 Acknowledgements

First and foremost, I would like to thank my supervisor, Prof. Susan Stepney, for her continual support and guidance over the last ve years, and for somehow always managing to give me feedback at a superhuman pace. I would also like to extend my thanks to the (former) members of EvoEvo research group for many stimulating discussions: Dr. Simon Hickinbotham, Dr. Tim Hoverd, Dr. Paul Andrews, and Dr. Tim Taylor. A special thanks goes to Dr. Dan Franks for his invaluable advice and support, and whose teaching inspired me to begin this pursuit. I am indebted to Prof. Claire Le Goues for her mentorship and advice, and for bring- ing me into her research group. Our paths may never have crossed were it not for the generous support of the William Gibbs Foundation, which made my visit to Prof. Le Goues’s lab possible. Thank you to Mum, Dad, Joe, Becky and Rick for always believing in me and for never letting me forget where I’m from, and to Millie for her cuddly companion- ship. And nally to Geraldine for living through every moment of this adventure with me, and giving me the courage to see its end. Nothing I can write here can begin to express my gratitude to you all. Thank you.

13 LIST OF FIGURES

14 Declaration

I declare that this thesis is a presentation of original work, and that I am the sole author. This work has not previously been presented for an award at this, or any other, University. All sources are acknowledged as references. The study of mutation analysis for automated program repair, presented in Chap- ter4, has been accepted for publication and is due to appear in the proceedings of SSBSE’17: Christopher Steven Timperley, Susan Stepney, Claire Le Goues. An In- vestigation into the Use of Mutation Analysis for Automated Program Repair. SSBSE’17.

15 LIST OF FIGURES

16 CHAPTER 1

Introduction

According to a 2013 study [Judge Business School, Cambridge University, 2013], the worldwide cost of debugging and repairing software bugs is estimated to be $312 billion per year; on average, programmers spend roughly 50% of their time nding and xing bugs. Automated Program Repair (APR) is a new and emerging technique that attempts to reduce this burden by automatically localising and xing arbitrary bugs. In 2009, the research area of APR was born when GenProg [Weimer et al., 2009] demonstrated the ability to x arbitrary bugs in real-world C programs, using only their provided test suites as an oracle. Although previous work had examined the possibility of automatically detecting, and in some cases, xing bugs, none of those works considered the general case—each addressed a particular sub-set of bugs, un- der restrictive assumptions [Demsky and Rinard, 2003; Perkins et al., 2009]. To automatically repair programs, GenProg uses a form of genetic programming (GP). Whereas genetic programming is typically used to evolve entire programs from scratch—albeit small ones—GenProg evolves patches, represented as a se- quence of edit operations, instead. These operations are represented by one of three statement-level program transformations: deletion, insertion, and replacement.1 To constrain the search space, and to transform program repair into a tractable prob- lem, these operations are crafted using existing code, supplied by the program under repair, via a technique later referred to as plastic surgery [Barr et al., 2014]. Candi- date patches are generated using this representation through a continual process of mutation, crossover, and selection. Patches that pass a greater number of tests, and especially those that pass previously failing tests, are identied as partial solutions by the search, and used as the basis of the next generation of patches. This process of generation and validation continues until a patch that passes the entire test suite is found. At ICSE 2012 [Le Goues et al., 2012a], GenProg was shown to repair 55 out of 105 real-world bugs in large-scale C programs, including the PHP and Python in- terpreters, and libti. In the same year, at GECCO 2012, Le Goues et al.[2012c] demonstrated how modications to GenProg’s parameters could allow it to repair another ve bugs within the same dataset, and to reduce the wall-clock time re- quired to do so. Automated program repair—an idea dreamt about for more than half a century—appeared to have been conquered, overnight.

1Originally, GenProg implemented a swap operation, rather than replacement. This operation was dropped from future versions when its authors found replacement to be a more eective alternative [Le Goues et al., 2012c].

17 INTRODUCTION

Two years later, at ICSE 2014, Qi et al.[2014] showed that a form of random search— restricted to a single-edit search space—outperformed the genetic algorithm used by GenProg, as measured by the average Number of Test Case Evaluations required to nd a repair (NTCE). Whilst these results raised questions regarding the eective- ness of using genetic algorithms for program repair, they could also be explained by the aggressive optimisations made by the random search, and by its restriction to a single-edit search space. Finally, at ISSTA 2015, Qi et al.[2015] examined the patches generated by GenProg for the ICSE 2012 dataset, and found that 104 of its 110 plausible patches could be explained by a single functionality-deleting modication. (e.g., deletion of an If statement.) For the majority of bug scenarios within the ICSE 2012 dataset,2 the test harness only checked the exit status of the program was zero. As a result, many of the repairs reported by GenProg overt to the test suite by simply ensuring the program produces the correct exit status, instead of xing the underlying bug: Most patches either deleted parts of the program, or inserted premature exit(0) or return statements. After addressing issues in the test harnesses, and extending the original test suites, Qi et al.[2015] found that GenProg produced correct xes for 2 out of 105 bugs; in both cases, these bugs could be patched through functionality deletion alone (i.e., removing a set of statements from the program). Motivated by these problems, at ESEC-FSE 2015, Long and Rinard[2015] introduced an alternative search-based repair technique: Staged Program Repair (SPR). To avoid overtting3 due to functionality deletion, and to address a greater number of bugs, SPR uses a richer set of repair actions (i.e., types of transformation that can be ap- plied to the program), without an explicit deletion action. Although SPR does not permit explicit functionality deletion, Mechtaev et al.[2016] demonstrated that SPR still produces such patches implicitly. In addition to introducing a new repair model, SPR uses a novel search technique to reduce the cost of nding repairs: value search. Value search allows SPR to determine whether a particular kind of transformation (e.g., modication of an if condition) can be used to x the program; only once the feasibility of a repair has been determined, does SPR attempt to nd the values required to concretely x the bug (e.g., a xed if condition). Although SPR was shown to outperform GenProg in terms of the number of correct repairs found within a 12-hour window, it is unclear how much of this is due to SPR’s larger repair model, and to what extent value search improves eciency. Unfortu- nately, value search has not been compared to any alternative search techniques— using the same repair model—such as random search. Whilst SPR may oer a more ecient way of generating single-edit patches, it is dicult to see how its value search algorithm could be extended to cover multiple edits, as Long and Rinard [2015] note.

2The ICSE 2012 dataset would go on to form a sub-set of the bug scenarios within the ManyBugs [Le Goues et al., 2015] dataset. 3In the context of automated program repair, the term “overtting” is used to refer to candidate repairs which pass the test suite but do not consitute correct repairs. (i.e., one could produce a test that rejects the candidate repair.)

18 1.1. MOTIVATION

Since these discoveries, automated program repair techniques have been categorised by three distinct approaches: search-based repair techniques, such as GenProg and SPR, which generate candidate repairs according to a search algorithm, and evaluate them until a patch is found which passes the entire test suite; semantics-based tech- niques, such as Angelix [Mechtaev et al., 2016] and SearchRepair [Ke et al., 2015], which use symbolic evaluation to guide patch synthesis; and specication-based ap- proaches, such as AutoFix-E [Wei et al., 2010], which use formal specications to synthesise xes, rather than using test suites as their oracle. Whilst semantics-based and specication-based approaches have demonstrated promis- ing results, especially in terms of patch quality, their real-world use is highly re- stricted. Specication-based approaches rely on the existence of formal specica- tions, which are rarely found in practice [Kossak et al., 2014]. Although semantics- based approaches do not require the existence of formal specications, instead re- lying on the existence of a test suite, the limits of symbolic execution prevent their application to real-world programs, in the general case. Existing semantics-based techniques are unable to deal with code that involves side-eects (e.g., reading from or writing to a le), non-primitive data types (e.g., structs), or looping and recur- sion. Search-based program repair is the only approach capable of xing arbitrary bugs, without restrictions on the program, or the need for formal specications. To ad- dress a greater number of bugs, search-based techniques will need to expand their repair models (i.e., the operations used to construct patches), improve their ability to localise faults, and scale to bugs requiring multiple edits. In this thesis, we inves- tigate the challenges facing search-based program repair, propose and demonstrate eective ways to tackle them, and increase our understanding of the problem along the way.

1.1. Motivation

In this section, we outline the motivation behind this thesis, and provide justication for our focus on search-based program repair of C programs.

Automated Repair of C Programs

Amongst other languages, automated program repair has been successfully applied to programs written in C [Le Goues et al., 2012a; Long and Rinard, 2015, 2016; Mech- taev et al., 2016; Tan and Roychoudhury, 2015], Java [Kim et al., 2013; Le et al., 2016; Xuan et al., 2017], and Python [Ackling et al., 2011]. Accompanying each of these languages is a unique set of opportunities and challenges: • Being a dynamic language without the benets of static type checking, bugs

19 INTRODUCTION

in Python may involve the use of unreferenced variables, passing (incorrect) arguments of the wrong type, or unintended implicit conversions. At the same time, being an interpreted language, with the capacity for introspection and self-modication, Python oers new avenues for (automated) debugging and patch generation. • Java, on the other hand, employs static type checking, allowing it to avoid this class of run-time bugs. Additionally, unlike C, built-in memory management (provided by the garbage collector) helps to prevent the majority of memory leaks and segmentation faults.4 • In contrast to Java and Python, C provides no (built-in) memory management and thus it suers from its own class of memory leaks, buer overows, seg- mentation faults and pointer misuses. Furthermore, unlike Java and Python, which (by default) provide the programmer with detailed information about a program failure (in the form of a stack trace and exception messages), C programs tend to almost devoid of useful information in such cases. Mak- ing automated repair more dicult still, C also lacks a standard framework for testing. Projects tend to either provide an arbitrary set of scripts (usually written in bash or Perl), or in rarer cases, to use their own bespoke testing framework (PHP). C, therefore, represents a challenging testing ground for automated program re- pair, where knowledge of the bug is limited to its bare minimum. By demonstrating the eectiveness of automated program techniques in such a barren environment, one gains an increased condence in their applicability to a wider range of lan- guages. Furthermore, C remains one of the most popular programming languages in the world (ranked at #2 in the January 2017 TIOBE popularity index5, represent- ing 9.35% of the market), and is widely used within production legacy code, where specications and test suites tend to be woefully lacking, and the original developers are no longer around. Most importantly, by using C as the focus of this thesis, we are able to build upon the large body of existing techniques that target this language.

Search-Based Program Repair

Having presented our case for using the C programming language as the focus of this thesis, below we put forward three reasons for pursuing search-based program repair, rather than semantics- or specication-based approaches. 1. Generality of Fixes: A number of techniques exist that are theoretically capable of identifying and correcting bugs belonging to a particular class, such

4Note, garbage collection does not necessarily ensure Java programs are invulnerable to memory leaks (or, in this case, heap overows). A common way of introducing such bugs stems from careless storage of objects when using the singleton object pool pattern [Bloch, 2008]. 5https://www.tiobe.com/tiobe-index/

20 1.1. MOTIVATION

as memory leaks and integer overows [Coker and Haz, 2013; Gao et al., 2015]. In practice, however, these techniques are only eective under limited circumstances, and unlike search-based repair, they have no utility beyond the specic bug class for which they were designed. Given its ability to perform arbitrary modications to the source code, the search-based approach is (in theory) capable of solving any kind of bug, in- cluding those that cannot be cleanly delineated into specic bug classes. In contrast, many semantics-based techniques are limited to performing very specic code transformations, such as the replacement of an if-condition, or the RHS of an assignment expression. Additionally, search-based approaches can be used to produce xes spanning multiple lines and les, unlike the majority of alternatives, which tend to be limited to modication of a single statement (or expression), or the introduc- tion of a contiguous block of statements (in the case of SearchRepair). Of the non-search-based approaches to automated repair, Angelix is the only technique capable of producing xes spanning more than a single statement. Unlike search-based approaches, such as GenProg, Angelix suers a num- ber of limitations with respect to the types of programs to which it can be applied, discussed below.

2. Generality of Subjects: In addition to being free from limitations with re- spect to the types of xes that search-based repair can address (since the tech- nique involves modication of the source code), search-based repair is free from any limitations on the types of programs on which it may be used. In contrast, a number of semantics-based techniques, including NOPOL, Angelix, relifix and SearchRepair are unable to handle programs involving loops and recursion, or code involving I/O operations (e.g., database and le accesses). Furthermore, techniques based on symbolic evaluation are only able to han- dle code containing a limited set of primitive types (excluding oating-point numbers and structs).

3. Minimal requirements: Unlike specication-based approaches, search-based techniques do not assume the existence of any kind of contracts or specica- tions, which are seldom found in real-world code. Ideally, all software would be provided with at least a partial specication of its behaviour, but in real- ity, the uptake of within the has been highly limited. Assuming the existence of such artefacts highly restricts the domain of problems to which automated repair may be applied. By forgo- ing this need, the search-based approach becomes particularly well suited to legacy systems, where knowledge is particularly limited. Similarly, most search-based techniques do not require the existence of an or- acle, unlike techniques such as relifix [Tan and Roychoudhury, 2015], which uses previous (functioning) versions of the program to x regressions. The only requirements for repair are a test suite capable of exposing the bug

21 INTRODUCTION

(i.e., one with at least a single failing test case, covering the faulty code) and the source code of the program. In summary, search-based automated program repair poses the fewest limitations and makes the least assumptions of the bug, allowing it to be applied in the widest set of scenarios.

1.2. Challenges

Despite the potential of search-based repair, and promising early results produced by GenProg, recent studies have highlighted a number of major challenges facing the approach:

1. Patch Quality: Of the challenges facing search-based repair, and the wider eld in general, the problem of plausible, but incorrect patches (i.e., those that pass the test suite as a result of overtting) is perhaps the most controversial. Recent studies [Qi et al., 2015; Smith et al., 2015] have demonstrated that the majority of xes reported for GenProg, the most popular search-based ap- proach, were the result of overtting to weak test suites (i.e., one which fails to provide sucient coverage or to thoroughly check the output of the pro- gram). On further inspection, Qi et al.[2015] found that almost all patches produced by GenProg were destructive (i.e., deletion or replacement of code). Whilst most of the focus of the discussion with respect to patch quality has focused on the shortcomings within the GenProg-family of search-based ap- proaches (i.e., GenProg, RSRepair, AE), this problem is by no means exclu- sive to them. The same phenomenon may also be observed in semantics-based approaches, such as SPR and Prophet. These results also cast uncertainty on the ndings of previous studies within this sub-eld of program repair. In particular, it is no longer clear how ef- fective the repair model used by these approaches is, nor is it clear that im- provements and parameter tweaks of the search algorithms used by them are eective in producing correct repairs. To move forward with search-based repair, we must rst take a step back to better understand its current abilities and limitations.

2. Eectiveness: The ndings of the aforementioned studies on patch quality also raise questions regarding the actual eectiveness of the search algorithm and repair model used by GenProg. Assuming the existence of a more robust test suite, less susceptible to overtting, it is no longer clear to what extent these components help GenProg to nd correct repairs. For automated program repair to see a wider acceptance within the software engineering community, techniques must demonstrate the ability to solve

22 1.3. RESEARCH QUESTIONS

a signicant number of bugs across a diversity of programs. To be truly convincing, automated repair requires the ability to perform arbitrary code transformations—a feat that only search-based repair is capable of.

3. Eciency: To become economically viable, repair techniques need to be able to discover repairs within a reasonable amount of time. Ideally, the time re- quired to nd a patch should be measured in hours, or even minutes, rather than days. This problem can be side-stepped through the use of cloud comput- ing resources, which allow the repair process to be split across a large number of compute nodes for a short period of time. In this case, the repair process needs to be cost eective.

By improving the eciency of the search, the likelihood of encountering a patch within a xed resource window is increased, leading to simultaneous gains in eectiveness. Moreover, increases in eciency will allow larger, more complex search spaces to be explored, allowing the technique to solve a greater number of bugs.

4. Scalability: Future program repair techniques will require the ability to pro- duce multiple-edit patches for bugs in arbitrary programs. Due to the limits of symbolic execution associated with semantics-based repair, search-based techniques are the only foreseeable means of tackling this challenge. Despite this, the majority of recent work in search-based program repair has focused on increasing the ability to discover single-edit patches [Long and Rinard, 2015, 2016; Qi et al., 2014; Weimer et al., 2013].

Genetic algorithms, such as those used by GenProg, PAR, and HDRepair, are the only proposed search technique capable of producing multi-edit patches, in theory. In practice, however, almost all patches generated by those tools can be reduced to a single edit; there is little evidence that this approach is eective at discovering multi-edit patches. With recent results showing that a form of random search outperformed the genetic algorithm used by GenProg, there is a clear need to establish whether GAs are a suitable algorithm for repair, and if not, what alternatives exist.

1.3. Research Questions

Motivated by the challenges facing search-based repair, this thesis seeks to increase our understanding of each of the components of the search, and to explore ways in which those components might be improved.

Motivated by recent results in the use of mutation analysis for fault localisation [Moon et al., 2014a; Papadakis and Le Traon, 2015], we ask the following ques- tion:

23 INTRODUCTION

• RQ1: Can the results of candidate patch evaluations, gathered over the course of the search, be used to improve the accuracy of the fault localisation, online? To understand how the repair model used by program repair techniques can be improved to allow a greater number of bugs to be repaired, we ask the following questions:

• RQ2: Is plastic surgery equally eective for all repair actions? • RQ3: Can the eectiveness of plastic surgery be increased through the use of unlabelled code snippets? Given that GenProg is the only search-based repair technique capable of producing multiple-edit patches, we ask the following questions:

• RQ4: Do biases within GenProg’s operators degrade its performance? • RQ5: Does tness help to guide GenProg’s search algorithm towards solu- tions?

• RQ6: Is there a more eective search algorithm for generating multiple-edit patches than the genetic algorithm used by GenProg? To answer these questions, and to avoid the pitfalls of previous research, we propose and use a novel methodology. Our methodology allows researchers to quickly and easily reproduce the results of our experiments. It also provides a platform for others to conduct high-quality program repair research.

1.4. Contributions

Below, we present a summary of the main research contributions made by this the- sis. • We identify a number of weaknesses and compromises in the current ap- proach to conducting APR experiments. Motivated by those problems, we present a better methodology for conducting APR experiments, and a suite of tools for implementing that methodology. • As part of our theoretical and empirical analyses of GenProg’s search al- gorithm, we identify a number of phenomena and unintended biases, which hamper the eciency and eectiveness of the search. • We investigate the role of tness within the search for repairs, and nd that the search is more eective when restricted to consideration of non-destructive patches (i.e., patches that do not fail any of the previously passing tests). Fur- ther gains in performance are attained when the number of previously failing tests that are passed by a candidate patch is used to perform selection between

24 1.4. CONTRIBUTIONS

non-destructive patches (i.e., those which do not fail any previously passing tests). • We propose an alternative search algorithm for nding multiple-edit repairs, based on a greedy algorithm. We demonstrate that our algorithm outperforms GenProg’s genetic algorithm in terms of eectiveness, eciency, and relia- bility. • We explore the feasibility of using test suite evaluations for candidate patches, gathered over the course of the search, to increase the accuracy of the fault localisation, online. In contrast to earlier studies of mutation-based fault lo- calisation, conducted on articial bugs in small programs, our results demon- strate that this information leads to little to no improvement for real-world bugs. We speculate on why these results might be the case, and outline direc- tions for future research. • Building on previous work concerning the eectiveness of plastic surgery, we investigate its eectiveness at the level of actual repair actions. We report the observed frequencies and graftabilities of a number of new and existing repair actions. As part of this study, we also provide a more formal denition of each of these repair actions. • We nd that repair actions that operate at ner levels of granularity (i.e., expressions), are more amenable to plastic surgery than repair actions with coarser levels of granularity (e.g., blocks, and to a lesser extent, statements). We also nd that removing labels from donor code substantially increases the chances of successfully grafting a repair. • We identify an unintended behaviour within GenProg’s implementation, stem- ming from its use of CIL [Necula et al., 2002], which may severely impact GenProg’s performance and prevent it from xing certain otherwise xable bugs. In addition to these research contributions, we have developed a number of tools over the course of this thesis:

• RepairBox: a high-performance platform for conducting reproducible em- pirical studies of program repair, built on top of Docker. This tool was used to conduct all of the relevant experiments within this thesis.

• Pythia: a tool for sandboxing test executions and automatically increasing the quality of existing test suites. Pythia was used in combination with RepairBox to reduce the risk of overtting during our experiments.

• BugHunter: a tool for automatically mining bug xes from Git repositories. In addition to collecting (likely) bug xing commits, BugHunter also extracts repair action instances from those commits (e.g., a statement was inserted by the x), and allows a variety of analyses to be conducted on those instances.

• GenProg: we performed an extensive refactoring of the GenProg source

25 INTRODUCTION

code. New features include asynchronous test suite evaluation, JSON con- guration les, and a component-based architecture, which allows GenProg to be easily extended with new operators, algorithms and fault localisation approaches.

1.5. Document Structure

The rest of this thesis is structured as follows: • Chapter 2 – Background: provides the reader with the necessary back- ground in automated program repair to understand the rest of the thesis. • Chapter 3 – Tools and Techniques: discusses a number of problems in the current approach to conducting APR experiments, before presenting a set of tools to address them. These tools are used throughout the thesis, to ensure reproducibility of results. • Chapter 4 – Fault Localisation: investigates the feasibility of using the re- sults of candidate patch evaluations, gathered over the course of the search, to increase the accuracy of the fault localisation. • Chapter 5 – Repair Model: measures the frequency and graftability of a set of new and existing repair actions, to improve our understanding of eective repair models, by examining bugs mined from version control histories. • Chapter 6 – Search: conducts a theoretical and empirical analysis of GenProg’s search algorithm, before presenting a new search algorithm, based on greedy search. • Chapter 7 – Conclusion: summarises the ndings of this thesis, and outlines several directions for future work.

26 CHAPTER 2

Background

In this chapter, we provide the reader with the necessary background to under- stand the rest of this thesis. We rst provide an overview of the general concepts of program repair, before reviewing a number of the most popular search-based, semantics-based and specication-based repair techniques. Finally, we discuss a number of loosely related techniques covering the space of code transformation and repair more broadly.

2.1. Automated Program Repair

Provided the source code for a program, and an oracle capable of exposing its bugs, APR attempts to nd a variant of the program that satises the oracle; assuming a suciently strong oracle, a repair to those bugs is the output of the repair process. Although a few repair techniques use formal specications, the majority make use of existing test suites as their oracles instead: failing tests within these suites are used to expose bugs, whilst passing tests are used to prevent the destruction of existing, correct behaviours.

Repair approaches can be split into three categories: search-based, semantics-based, and specication-based. For the most part, each of these approaches shares the same high-level approach to locating and repairing bugs, outlined in Figure 2.1. In the proceeding sections of this chapter, we discuss each of these approaches. For the rest of this section, we explore each of the high-level components in the repair process. To aid this discussion, we look at a fault in a small program, given in Figure 2.2, inspired by the real-world Zune leap year bug [Coldewey, 2008].

Input Program & Donor Pool

Before the repair process begins, each statement within the source code of the pro- gram (or the sub-set of les suspected to contain the bug) is identied and uniquely numbered with a statement identier, or SID, as illustrated in Figure 2.3. State- ments are assigned an integer, starting at one, based on the order in which they are encountered when walking through the collection of abstract syntax trees for the program (although the exact SID for a statement makes no dierence, only that the SID is guaranteed to be unique). These SIDs are used to identify which areas

27 BACKGROUND

Figure 2.1: The general automated repair process accepts the source code for a faulty program, together with a test suite, containing failed test cases, exposing the faults within the program. From these, the possible locations of the faults are determined, and for some repair approaches, a pool of donor code is generated from the input program. Using the contents of the donor pool, together with a number of basic repair actions, the search generates and evaluates candidate patches, until one is found that passes all tests within the suite. of the program were executed by the passing and failing test cases during the fault localisation stage of the search.

Whilst this process of annotation occurs, some repair techniques, such as GenProg [Le Goues et al., 2012a] and SPR [Long and Rinard, 2015], extract each of the iden- tied statements and place them into a donor code pool, ready to provide the nec- essary code to produce repair candidates during the search. Other techniques, such as SearchRepair [Ke et al., 2015], use a pre-computed donor code pool, built from foreign source code, rather than the source code of the program under repair. Oth- ers avoid the need for a donor pool entirely, synthesising fragments of code from scratch [Long and Rinard, 2015; Mechtaev et al., 2016; Xuan et al., 2017].

Test Suite

With the exception of techniques where an oracle is provided, either in the form of a (partial) formal specication [Wei et al., 2010], or an alternative version of the pro- gram [Tan and Roychoudhury, 2015], a patch is assumed to be correct if it passes all tests within the test suite. For the purposes of conducting research, the test suite is divided into a set of positive tests, representing the passing tests (and the behaviours that should be preserved during repair), and a set of negative tests, representing the failing tests. The need for this distinction is to ensure the same tests pass or fail when an experiment is repeated.

28 2.1. AUTOMATED PROGRAM REPAIR

1 year= 1980;

2 days= atoi(argv[1]);

3 while (days> 365){

4 if (isLeapYear(year)) {

5 if (days> 366){

6 days -= 366;

7 year +=1;

8 }

9 } else {

10 days -= 365;

11 year +=1;

12 }

13 }

14 return year;

Figure 2.2: An example bug scenario, adapted from the real-world Zune leap year bug [Coldewey, 2008]. The program accepts a date, given as the number of days since January 1st 1980, and should determine the year to which that date belongs. In the event that year is a leap year and days ever becomes 366, the program will enter an innite loop.

Fault Localisation

To reduce the scale of the search landscape and to bring automated repair into the realms of feasibility, the fault localisation stage of the repair process is used to select a (relatively) small number of suspicious locations within the program as candidates for modication. For most existing automated repair techniques, these locations correspond to statements within the program. (Rather than lines, expressions, or any other logical grouping of code.) To determine the set of faulty statements, and thus the candidates for modication, the rst step of the fault localisation process is to compute coverage information for the program. This coverage information describes the statements that are executed by the program for each test within the suite. To generate coverage information for the program, its les (or a sub-set of them, believed to contain the fault) are sub- ject to a process of source code instrumentation. Each statement of the program is wrapped by a coverage statement, responsible for monitoring and logging its ex- ecution. An example of the output of the coverage generation process is given in Figure 2.5. From this coverage information, the next step is to aggregate the data into a fault spectrum. The fault spectrum concisely describes which statements were executed (or covered) by each test, 1 and whether the result of the result of that test was failure

1Optionally, rather than recording whether a statement was executed, each cell in the spectrum may be used to specify how many times a given statement was executed. Whilst this information may be useful in dealing with innite loops, the need to log individual executions increases the burden on

29 BACKGROUND

# Statement 1 year = 1980; 2 days = atoi(argv[1]); 3 while (days > 365) ... 4 if (isLeapYear(year)) ... 5 if (days > 366) ... 6 days -= 366; 7 year += 1; 8 else {} 9 else ... 10 days -= 365; 11 year += 1; 12 return year;

Figure 2.3: Before the repair process can begin, each of the statements within the program is identied and annotated with a unique SID. or success. An example of such a spectrum is given in Figure 2.6.

Finally, using a lightweight form of fault localisation known as spectrum-based fault localisation, the fault spectrum is transformed into a set of suspiciousness values, encoding a measure of the belief that a given statement is responsible for producing the bug. This process of transformation is performed using a suspiciousness metric, µ(ep, ef , np, nf ), which takes a count of the number of times a given statement was (and was not) executed for passing and failing bugs, respectively. For each statement, the suspiciousness metric transforms those counters into a real number, or suspiciousness value, where higher values indicate a greater degree of belief that the statement is (partly) responsible for producing the bug. For the majority of automated repair techniques, a relatively simple suspiciousness metric is employed. In the case of GenProg, its metric assigns a suspiciousness of 1.0 to statements executed exclusively by failing tests, 0.1 to statements executed by both passing and failing tests, and 0.0 to statements that are not covered by a failing test (thus eliminating them from consideration by the repair). More details on the fault localisation process can be found in Chapter4.

Repair Candidates

Once the fault localisation and donor code pool have been computed, the two are combined according to a repair model to generate the space of possible repairs. The the instrumented program, causing it to run considerably slower. This slowdown is made worse when the user wishes to generate an execution path for each test, describing the order in which statements were executed. For most real-world programs, the disk requirements of this path le, potentially of the order of GBs per test, make such an approach infeasible.

30 2.2. SEARCH-BASED REPAIR

Test Input Expected Passed P1 -366 1980  P2 -100 1980  P3 0 1980  P4 365 1980  P5 367 1981  P6 1000 1982  P7 1826 1984  P8 2000 1985  N1 10593 2008  N2 12054 2012  N3 1827 1984  N4 366 1980 

Figure 2.4: A test suite for our running example, described by inputs and expected outputs for the program. Failing tests are assigned labels starting with the letter “N”, indicating that they are negative test cases. Conversely, passing tests are assigned a label starting with the letter “P”, indicating that they are positive test cases. repair model of a technique is composed of a series of repair actions, each of which abstractly describe the kinds of changes (or mutations) to the program that can be made. For instance, the repair model of GenProg contains repair actions that allow the insertion, replacement, and deletion of statements. A more detailed denition of repair models is provided in Chapter5. An upper bound on the size of the search space S for a given problem and (search- based) repair technique is:

l X S ≤ [F · (DR)]i (2.1) i=1 where F is the number of suspicious statements, D is the number of donor code fragments, R is the number of repair actions, and l is the maximum number of edits that may be contained within the patch.

2.2. Search-Based Repair

Search-based repair techniques employ meta-heuristic search algorithms to explore the space of possible repairs. Some of these techniques (e.g., GenProg and PAR [Kim et al., 2013]) attempt to discover partial solutions during the search by treating repair as an optimisation problem. Others, such as RSRepair [Qi et al., 2014] and AE

31 BACKGROUND

# Statement P1 P6 N1 N4 1 year = 1980; •••• 2 days = atoi(argv[1]); •••• 3 while (days > 365) ... •••• 4 if (isLeapYear(year)) ... ••• 5 if (days > 366) ... ••• 6 days -= 366; •• 7 year += 1; •• 8 else {} •• 9 else ... •• 10 days -= 365; •• 11 year += 1; •• 12 return year; ••

Figure 2.5: An example of the output produced by the coverage generation process. The columns on the right show which statements are covered by a particular test. For the sake of brevity, we omit coverage for the majority of the test suite.

[Weimer et al., 2013], treat repair as a decision problem, opting for higher eciencies in the evaluation of candidate patches over the identication and exploitation of partial solution information. In all cases, candidate repairs are generated by the search, and subject to evaluation. This process of evaluation involves compiling the resulting program, and determining the outcomes of (a sub-set of) its test suite. The processes of generation and evaluation continue until either an acceptable repair is found (i.e., one which passes all of the tests) or a resource limit is reached. For the rest of this section, we review the majority of search-based repair tech- niques.

Co-Evolutionary Program Repair

Arcuri and Yao[2008] introduce the rst search-based program technique, a year prior to the publication of GenProg. Like GenProg, their technique uses evolu- tionary computation, but whereas GenProg assesses the tness of an individual based on a xed number of tests, their approach uses competitive co-evolution to evolve a more robust set of tests. To automatically generate test cases—and thereby facilitate co-evolution—users are required to provide a formal specication for the program. Unlike all other search-based techniques, which operate on the abstract syntax trees of a real-world program, whether directly, through representations such as the AST/WP, or indirectly, through variants of the Patch representation, Arcuri and Yao[2008]’s approach operates on a bespoke Strong-Typed Genetic Programming (STGP) rep- resentation [Montana, 1995]. As with traditional applications of GP, programs are

32 2.2. SEARCH-BASED REPAIR

# Statement ep ef np nf µ 1 year = 1980; 8 4 0 0 0.1 2 days = atoi(argv[1]); 8 4 0 0 0.1 3 while (days > 365) ... 8 4 0 0 0.1 4 if (isLeapYear(year)) ... 4 4 4 0 0.1 5 if (days > 366) ... 4 3 4 1 0.1 6 days -= 366; 4 3 4 1 0.1 7 year += 1; 4 3 4 1 0.1 8 else {} 0 4 0 0 1.0 9 else ... 3 3 5 1 0.1 10 days -= 365; 3 3 5 1 0.1 11 year += 1; 3 3 5 1 0.1 12 return year; 8 0 0 4 0.0

Figure 2.6: An example of a fault spectrum for the Zune bug, together with the sus- piciousness values for each statement, as computed by GenProg’s suspiciousness metric. operated upon directly and mutations are synthesised from a set of primitives. Given the relative proximity of the buggy program to its (correctly) repaired form—compared to the large distances between the origin and a solution when attempting to synthe- sise a program de novo—Arcuri and Yao[2008] forgo the need for crossover. Although this representation is Turing-complete and incorporates a number of high- level operations similar to those in C and Java, it signicantly lacks the expressive- ness of those languages. In the general case, programs written in this dialect of STGP may be easily transformed into C or Java. Transforming C or Java programs into this STGP dialect is less trivial, however. Even if C or Java programs could be transformed into this dialect of STGP, representing entire programs—or even single les—would require vast amounts of memory and incur a signicant overhead to performance, as observed in earlier forms of GenProg using the AST/WP represen- tation [Le Goues et al., 2012a]. Arcuri and Yao[2008]’s work is therefore a powerful proof of concept of search-based program repair, rather than its rst practical ap- plication.

Rather than solely measuring tness as a function of the number of passing and failing tests, Arcuri and Yao[2008] use a richer tness function, dened in Equation 2.2:

N(g) E(g) X f(g) = + + d (t, g(t)) (2.2) N(g) + 1 E(g) + 1 P t∈Ti where g is a candidate solution, N(g) is the number of nodes in its tree, E(g) is the number of exceptions encountered during its execution, and dP (t, g(t)) is a measure

33 BACKGROUND of the distance between the expected and observed result for the execution of test t. Unlike subsequent approaches, the number of nodes for a given individual N(g) is directly incorporated into the tness function, acting as a form of bloat control; a selective advantage is awarded to programs that are shorter than others. If the size of the program is less than or equal to δN(buggy), where N(buggy) is the number of nodes in the original program, then this bloat term is replaced by the constant 1. The tness function also incorporates a term based on the number of exceptions produced by a given individual, E(g). Sharing a similar intuition to later work on the use of failure-obliviousness for evolving Ruby programs [Timperley, 2013; Tim- perley and Stepney, 2014], fallback semantics are provided for unsafe operations. Specically, both division by zero, and out-of-bounds array accesses are set to re- turn zero instead. In theory, and as demonstrated by [Timperley and Stepney, 2014], allowing execution to persist in the face of such exceptions creates a smoother t- ness landscape, which may allow certain classes of multiple-line errors to be solved more easily. Finally, Arcuri and Yao[2008] harness the results of the test suite by summing the distance between the intended and observed outcomes dP (t, g(t)) for each test t. Whereas later work treats the outcome of a test case as a binary variable (i.e., pass/fail), their distance metric shares more in common with previous work in ge- netic programming, by measuring the distance between the observed and intended return value of the function under repair.

• For tests where the function returns the intended outcome, dP = 0.

• In cases where the function returns an unexpected outcome, dP > 0. The measurement of the magnitude of the dierence between the two outcomes is performed by a necessarily arbitrary distance function, as is the case with traditional genetic programming. This distance function is automatically gen- erated from a formal specication for the program. To avoid solutions overtting to a xed test suite, the tests used for evaluation are competitively co-evolved in a separate population (of size 32). For the particular program studied in the paper—an implementation of bubble-sort—each test within the population is represented by a variable-length list of integers, representing an input list to the sorting algorithm. For every 5 generations that the solution pop- ulation is evolved, the test population is subject to 1024 generations of evolution against the best program in the current population. The tness of a given test t is measured using the tness function described in Equation 2.3:

X f(t) = dP (t, g(t)) (2.3)

g∈Gi where g ∈ Gi is an individual from the i-th generation, and dP (t, g(t)) is the dis-

34 2.2. SEARCH-BASED REPAIR tance between the expected and observed outcome for this individual’s test execu- tion. At the end of each of these 1024 generations, the best test case within the population is added to a Hall of Fame [Rosin and Belew, 1997], which then becomes the test suite for evaluating candidate patches.

In summary, Arcuri and Yao[2008]’s approach introduces the breakthrough idea of using genetic programming to evolve patches, rather than entire programs—an idea that underpins the seminal work in the eld. Many of the ideas introduced in their work, such as bloat control and the use of a richer tness function, have proved to be ahead of their time. Unlike the other approaches reviewed in this section, all of which are usable repair tools, Arcuri and Yao[2008]’s work is instead a proof of concept.

GenProg

GenProg can be viewed as both the seminal approach to the automated repair of real-world programs, and the rst viable search-based repair technique. At its core, GenProg uses a form of genetic algorithm to search for its repairs. Like all other genetic algorithms, GenProg’s search process can be seen as a continual cycle of generation and evaluation, also referred to as validation within the automated repair literature [Le Goues et al., 2012a; Long and Rinard, 2015].

The generation stage is responsible for generating a population of potential solu- tions, based on (implicit) knowledge gained about previously encountered solutions through evaluation of their tness. Solutions deemed to be closer to a repair, where distance is measured by a weighted sum of the number of the positive and negative test cases passed by the candidate, are more likely to be used to inform the genera- tion of future candidates. This assumes that tness is correlated with edit distance, and that solutions which pass a higher proportion of the test suite are more likely to be close to a repair than solutions that pass fewer tests.

From a high-level perspective, GenProg, shares the same approach to generation as all other genetic algorithms. Following a stage of initialisation, wherein a population of single-edit patches is randomly sampled, the process of generation is sub-divided into sub-processes of selection, crossover, and mutation. The resulting search algo- rithm is given in Algorithm1, together with a description of each of its stages.

Algorithm 1: GenProg’s search algorithm population ← initialise() ; while solution not found and resources not exhausted do parents ← select(population); population ← crossover(parents); population ← mutation(population); evaluate(population); end

35 BACKGROUND

Individual Representation

To represent a candidate repair, each individual within the population is encoded as a sequence of statement-level modications to the abstract syntax tree of the faulty program. To generate the corresponding repair, each of these modications, or edits, is applied in sequence. Each edit within this sequence, or patch, applies a particular transformation to the statement associated with a given SID. The particular repair model used by GenProg allows for three dierent kinds of statement transforma- tion, described below:

• Deletion: The edit DELETE x permanently removes the statement with SID x from the program, together with any child statements that may be attached.

• Append: The edit APPEND x y copies the statement with SID y from the original program, and appends it immediately after the statement with SID x.

• Replacement: The edit REPLACE x y copies the statement with SID y from the original program, then permanently replaces the statement at SID x with the copied statement.

Selection

The selection phase within GenProg selects which candidate repairs from the pop- ulation of the previous generation should be chosen as parents, whose edits will be used to construct the population for the next generation. By default, this process of selection is performed using tournament selection. Tourna- ment selection chooses n individuals as parents through a series of n tournaments. For each tournament, k individuals are chosen from the population to be partici- pants, at random with replacement. The best individual within each tournament is granted entry into the next population. The setting of k (a.k.a. tournament size) can be used to control selection pressure. Low values of k exhibit little selection pressure, allowing more subpar solutions to be accepted into the population. Conversely, high values of k increase the se- lection pressure, driving selection towards only the ttest solutions. By default, GenProg uses a minimal k = 2 (k = 1 would be equivalent to uniform random selection), preferring exploration of the search space over exploitation of its ttest solutions.

Crossover

Crossover within GenProg groups the selected parents into pairs and subjects each pair to a crossover operation with probability Pc, also known as the crossover rate. In the event of an application of the crossover operator, a pair of individuals are generated from the edits of the two selected parents and added to the ospring.

36 2.2. SEARCH-BASED REPAIR

Figure 2.7: An illustration of GenProg’s one-point crossover operator. Two parents A and B are accepted as input, and each is split into two parts at a random point. The rst part of A is combined with the second part of B to form a child C. Similarly, the rst part of B is combined with the second part of A to form a child D.

After iterating across all pairs of parents, a new proto-population is formed from the union of the multi-set of parents and the multi-set of ospring. Unlike most genetic algorithms [Eiben and Smith, 2015], where parents are replaced by their ospring in the event of a crossover, GenProg allows parents to be retained. This behaviour is similar to a (µ + λ) evolutionary strategy [Eiben and Smith, 2015], except that parents may also be subject to further mutation.

By default, GenProg uses a variant of variable-length one-point crossover, wherein the lists of edits for each parent are split into two at a random point and then joined together, as illustrated in Figure 2.7. Alternatively, a patch sub-set operator may be used, which has previously been shown to improve performance [Le Goues et al., 2012c].

Mutation

Following crossover, each individual within the proto-population is subject to mu- tation with probability Pm, also known as the mutation rate, to form the popula- tion. The mutation operator within GenProg accepts a single patch as its input, and returns a new patch, formed by appending a newly sampled edit to the input patch.

To generate the new edit, GenProg rst selects a statement as the subject of the edit, using the distribution dened by the fault localisation information. After a statement has been selected as the site of the edit, GenProg decides whether the edit should be a replacement, insertion, or deletion. By default, each edit type has an equal probability of being selected, although their probabilities may be adjusted to improve performance [Le Goues et al., 2012c]. Once an edit location and type have been chosen, GenProg produces the set of all possible edits of that particular type at the given location, and selects one of them at random as the chosen edit.

37 BACKGROUND

Test Case Sampling

In an eort to reduce the number of test case evaluations required to nd an accept- able patch, Fast et al.[2010] introduce the concept of test suite sampling to auto- mated repair. Rather than computing the tness of a candidate patch by subjecting it to the whole test suite, each candidate is instead subject to its own random sample of the test suite. Each sample is composed of all the negative tests, together with a random selection of n% of the positive test cases, where n% is usually set to 10% for larger problems, such as those in the ManyBugs benchmark set [Le Goues et al., 2015]. If the candidate passes all tests in the sample, then, and only then, is it subject to rest of the test suite, to determine whether it is an acceptable repair.

Fast et al.[2010] nd that despite the noise introduced into the tness function, the use of sampling increased the performance of GenProg by 81%, when measured by the number of test case evaluations required to nd a repair.

Repair Minimisation

Upon nding an acceptable repair that passes all tests within the test suite, GenProg subjects the repair to a minimisation process, wherein redundant edits which have no eect upon the outcome of the patch are removed [Le Goues et al., 2012a]. To facilitate the minimisation process, the patch is rst converted to an AST dierence using a variant of the DiX XML dierencing algorithm [Al-Ekram et al., 2005], tailored to operate with CIL ASTs. This process transforms the patch into a series of structured tree operations, describing the steps required to transform the orig- inal program into its patched form, e.g., “Delete the node rooted at k”. The min- imised repair is then computed by applying the delta debugging algorithm [Zeller and Hildebrandt, 2002] to the sequence of edits, reducing them to their minimal form.

Revealingly, in almost all cases, patches yielded by the minimisation process consist of only a single edit, indicating that the algorithm is not nding and combining partial edits, as would be expected for a traditional genetic algorithm [Holland, 1992, 2000].

RSRepair

Qi et al.[2014] introduce RSRepair, an alternative search-based repair tool, to assess whether GenProg’s search algorithm could be outperformed by random search. The RSRepair algorithm, given in Algorithm2, uses the same fault localisation and search operators as used by GenProg, allowing a direct comparison between their

38 2.2. SEARCH-BASED REPAIR search algorithms.

Algorithm 2: RSRepair algorithm while solution not found and resources not exhausted do location ← sample(faultLocalisation); edit ← sampleEdit(location); candidate ← apply(program, edit); outcomes ← evaluate(candidate); if candidate passes all tests then return candidate end

As a result of abandoning the need for tness, RSRepair is able to treat automated repair as a decision problem, rather than an optimisation problem. This allows the search to terminate the evaluation of a candidate patch on the rst observation of failure, substantially reducing the number of test case evaluations required by each candidate. Additionally, this allows RSRepair to make use of test case prioritisation [Rothermel et al., 2001], whereby tests that have been observed to be more likely to fail over the course of the search are executed rst, maximising the rate of incorrect patch detection (and by extension, improving eciency, as measured by the number of test case evaluations). Another important dierence between RSRepair and GenProg is that RSRepair restricts itself to the much smaller space of single-edit patches; in contrast, GenProg patches may contain an arbitrary number of edits. Since most patches reduce to only a single edit, this allows RSRepair to consider substantially fewer candidates than GenProg. Results show that on 100 runs of a sub-set of the ManyBugs dataset, RSRepair achieves a higher success rate (measured by the percentage of runs that found a repair) than GenProg on 24/25 problems, and a lower mean number of test case evaluations for 24/25 problems [Qi et al., 2014].

AE

Weimer et al.[2013] introduce the “Adaptive Search, Program Equivalence” ( AE) repair technique. Similar to RSRepair, AE operates using the same fault localisation method as GenProg, and with a sub-set of its mutation operators. By default, AE restricts itself to consideration of single edits patches, transforming the problem into a decision problem. AE searches for repairs by iterating through each of the possible Delete and Append operations at each statement within the program, in order of suspiciousness. By treating automated repair as a decision problem, AE is able to use test case pri- oritisation to increase the eciency of the search. To achieve further eciency gains, AE performs test suite reduction for each candidate patch, removing test cases

39 BACKGROUND whose results are not aected by the patch, by checking statement coverage of each test. In addition to its adaptive search process, AE introduces the notion of approximate program equivalency checking, which uses a series of syntactic and dataow anal- yses to nd and prune equivalent repairs from the search space. To determine ap- proximate program equivalence, the following three techniques are used:

1. Syntactic equality: By exploiting the observation that programs that are syn- tactically identical are also semantically equivalent, AE removes duplicates of identical statements from the pool of donor statements.

2. Dead code elimination: Using CIL’s [Necula et al., 2002] dataow analysis framework, AE is able to detect whether a particular mutation would insert dead code into the program; this information is used to eliminate variants from the search space.

3. Instruction scheduling: In many cases, a statement may be inserted into the program at a number of dierent points, all of which will yield the same se- mantics. AE identies instances where two (or more) insertions would pro- duce equivalent results, and removes redundant insertions from the search space. Results taken from a comparison between AE and GenProg [Weimer et al., 2013], using the 55 bugs from the ManyBugs benchmark suite that were previously solved by GenProg, showed that AE was able to x 53 of the 55 bugs. In total, AE re- quired 186 test suite evaluations across the search, compared to the 3252 required by GenProg, giving an order of magnitude improvement in test case eciency.

PAR

Kim et al.[2013] propose a search-based program repair technique for repairing Java programs, PAR, based on applying hand-crafted repair operators to generate repairs. With the exception of its repair operators, PAR is identical to GenProg in all other respects. These repair operators, or x templates, described below, were produced through manually inspection of more than 60,000 human-written patches in open-source projects for common x patterns.

1. Null checker: inserts an if-statement checking that all objects within context at the target statement are not null.

2. Parameter replacement: replaces a randomly selected parameter from an arbitrary function call at the fault location with a compatible variable or ex- pression.

3. Method replacement: replaces the target method of a randomly selected method call at the fault location with the name of a visible method with com- patible parameters and return type.

40 2.2. SEARCH-BASED REPAIR

4. Parameter addition or removal: adds or removes a single parameter from a method call at the fault location, provided the method has a compatible overloaded sibling. As with the parameter replacement transformation, suit- able variables and expressions are extracted from within the scope of the fault statement to serve as additional parameters.

5. Object initialisation: introduces a new assignment statement, assigning a newly-created object to a new local variable, before using the object as a pa- rameter of a method invocation.

6. Sequence exchange: swaps the order of a sequence of identied statements within the program, where each statement is exchanged with its most similar counterpart.

7. Range checking: inserts a series of if-statements, ensuring that all index variables of array accesses are within lower and upper bounds.

8. Collection size checking: inserts an if-statement checking that an index variable is smaller than the size of a collection object if faulty statements have collection object references.

9. Lower bound setter: assigns a lower bound value to an index variable if the faulty statements have array accesses.

10. Upper bound setter: assigns an upper bound value to an index variable if the faulty statements contain array accesses.

11. O-by-one mutator: modies an index variable by 1 if the faulty statements have array accesses.

12. Class cast checker: inserts an if-statement, ensuring that castees implement appropriate types when casting operations are performed.

13. Caster mutator: replaces the casting type of a casting operator within the faulty statement with another type.

14. Castee mutator: replaces the subject of a casting operation within the faulty statement with a dierent variable, sharing a similar name.

15. Expression changer: replaces a conditional expression in the faulty state- ment with a similar conditional expression, taken from the same source code.

16. Expression adder: inserts a similar expression into an existing expression in a faulty statement, provided it contains a conditional expression.

Across 119 real-world bugs, Kim et al.[2013] found that PAR was able to produce repairs for 27 of them, compared to 16 bug xes achieved by GenProg. Further- more, results of a human study involving 89 students and 164 developers found that patches generated by PAR were more likely to be accepted than those produced by GenProg.

41 BACKGROUND

Defect class according to... Examples of defect class The root cause Incorrect variable initialisation, incorrect conguration, etc. The symptom Segmentation fault, null pointer exceptions, memory exhaustion, etc. The x Adding an input check, changing a method call, restoring an invariant, etc.

Table 2.1: Examples of dierent types of defect classes and the shared properties by which they are dened. Adapted from [Monperrus, 2014].

A year after Kim et al.[2013] introduced PAR, Monperrus[2014] delivered a critical review of the repair, challenging both its methodology and results.

After introducing the concept of defect classes—a family of bugs sharing some shared property, outlined in Table 2.1—Monperrus[2014] argues that to fairly evaluate re- pair techniques, one must compose the dataset according to the defect classes one believes the tool addresses. Similarly, to achieve a fair comparison between tech- niques, one must take care to account for the dierent defect classes addressed by each. Failing to do so, one may compose an unbalanced dataset, overrepresenting the defect classes handled by one technique, whilst underrepresenting those of an- other.

Monperrus[2014] highlights that Kim et al.[2013] fail to identify a priori, or to dis- cuss a posteriori, the defect classes tackled by PAR—a problem common to a number of repair techniques. Whereas GenProg aims to provide a highly generic approach to program repair, theoretically capable of addressing all defect classes, PAR focuses on addressing common programmer mistakes tting its predened repair templates. Monperrus[2014] argues that Kim et al.[2013] evaluate on a dataset that may unin- tentionally favour PAR, and that given the two techniques address dierent defect classes, that such a comparison is meaningless. Upon closer inspection, Monperrus [2014] found that the majority of PAR’s repairs were produced using either the “Null Pointer Checker” or the “Expression Adder, Remover, Replacer”—the remaining op- erators xed only a single bug each.

In addition to identifying problems in the (lack of) methodology used to construct the dataset, Monperrus[2014] recognises a number of awed assumptions underly- ing the human study used to evaluate patch acceptability. Although the participants of the study included 67 developers, none of them were responsible for maintaining the programs used in the study. Therefore, the experiment assumes that the devel- oper is able to judge the quality of a patch without knowledge of the codebase or the wider context of the bug. Monperrus[2014] argues that without prior experience with the codebase, participants are unlikely to accurately determine the quality and correctness of a patch within a reasonable amount of time (e.g., 30 minutes), armed only with the patch and a link to the bug report. Under these circumstances, then,

42 2.2. SEARCH-BASED REPAIR

Monperrus[2014] believes that the participants conate the notion of acceptability with understandability; that is, participants select patches on the basis of whether they or not they “look good”. To explain why this conation is a problem, Monperrus[2014] discusses the dier- ent requirements of autonomous program repair techniques and patch recommen- dation systems: • Automated repair systems are intended to operate mostly at run-time, without human assistance, and their patches are only required to serve as a temporary solution. • Conversely, recommendation systems are required to generate a set of (par- tial) transformations, for a human developer during development or mainte- nance, who is given the responsibility of selecting an acceptable change from amongst them; the produced changes are expected to be long-lasting, rather than ephemeral. As a consequence of the diering assumptions and requirements of these two ap- proaches, each should be assessed with its own evaluation criteria, in terms of un- derstandability, correctness, and completeness: 2 • For fully automated repair techniques, such as GenProg and SPR, the only considerations should be that correct and complete solutions are generated. Given the temporary nature of its xes, alien-looking solutions—lacking understandability— should be treated with equal consideration. • In contrast, patch suggestion systems should concern themselves less with the completeness and correctness of a patch, and more with producing an understandable, possibly partial solution that is likely to persist. Given that PAR is presented as an automated repair technique, and not as a x suggestion tool, the methodology used to evaluate acceptability is compromised by the bias towards understandable repairs and against alien repairs.

Staged Program Repair

As a more ecient and higher quality means of generating repairs for C programs, Long and Rinard[2015] introduce Staged Program Repair ( SPR): a search-based re- pair technique which combines parameterised transformation schemas and abstract conditions to reduce the size of the search space. In lieu of GenProg’s generic statement-level repair operators, SPR uses a set of parameterised transformation schemas—each designed to tackle a particular class of defects—to generate its re- pairs. Given a particular transformation schema, SPR performs target value search

2In this context, “completeness” refers to the ability of the repair tool to generate an executable variant of the program. (i.e., all of the information required to repair the program should be encoded within the patch.)

43 BACKGROUND to determine whether there exists a parameter which would cause the schema to yield an acceptable repair. This set of schemas is outlined below. 1. Condition Renement: Transforms the condition of a given if-statement by conjoining or disjoining an abstract condition to its original condition. 2. Condition Introduction: Transforms a given statement such that the state- ment is only executed when an abstract condition is true. 3. Conditional Control Flow Introduction: Inserts a new control-ow state- ment (e.g., return, break, goto) that executes only when an abstract condition is true. 4. Insert Initialisation: Inserts a memory-initialisation statement before a se- lected statement. 5. Value Replacement: Given a target statement, this transformation performs one of the following replacements: one of its variables with another, an in- voked function with another, or a constant with another constant. 6. Copy and Replace: Prepends an existing statement from elsewhere in the program to the program point immediately before a selected statement, before applying a value replacement transformation to the prepended statement. After ranking the suspiciousness of statements according to spectrum-based fault localisation, SPR attempts each of these transformation schemas—in the order given above—at each statement, in order of their suspiciousness. For the rst three of these transformation schemas, SPR uses angelic debugging [Chandra et al., 2011] to deter- mine if there exists an abstract condition which would result in an acceptable repair. If so, SPR synthesises a suitable condition using the angelic values for that If condi- tion with its condition synthesis algorithm. In cases where no abstract condition is found, SPR avoids the need to attempt concrete transformations, thus increasing the eciency of the search. Similarly, SPR uses abstract expressions within print state- ments, allowing a suitable concrete expression to be synthesised from the observed, correct print behaviour. Unlike GenProg, SPR is unable to compose multiple-edit xes, opting to focus on single-edit patch generation instead. Long and Rinard[2015] state: “It is unclear how to combine multiple transformations and still eciently explore the enlargened space.” Intentionally, SPR lacks a statement deletion operator, due to the previous ndings by Qi et al.[2015], showing that the majority of plausible, but incorrect patches generated by GenProg were the result of deletion. Although this decision prevents explicit functionality deletion—and thus repairs which require such deletion—it does not prevent SPR from yielding implicit deletions, by replacing conditions with con- tradictions and tautologies [Mechtaev et al., 2016]. Mechtaev et al.[2016] found that 80% of the repairs generated by SPR for the libtiff bug scenarios involved implicit functionality deletion (whereas only 21% of repairs generated by Angelix were functionality-deleting).

44 2.2. SEARCH-BASED REPAIR

To assess the ecacy of SPR, Long and Rinard[2015] evaluated it against 69 bugs from the ManyBugs suite.3 Given the tendency for automatically generated re- pairs to overt the test suite, the authors themselves assessed the correctness of the patches produced by SIR. Whilst any form of patch quality assessment is better than none at all, using humans to evaluate correctness is fraught with potential biases, and the possibility of false positives and negatives—all the more so when evaluating one’s own patches against another system. A more robust solution—albeit far from perfect—is to use a held-out set of manually or automatically generated whitebox tests to independently verify the repairs. Across the 69 bugs, SPR’s search space contained repairs for 19 of them, and for 11 of the 19, the rst plausible repair that was presented by SPR was correct. By com- parison, GenProg generated correct xes for 2 of the 69 bugs [Qi et al., 2015]. In later work, Long and Rinard[2016] proposed a variant of SPR, known as Prophet, wherein each candidate patch within the search space was ranked according to a learned model of likely bug xes. This model, learnt by applying machine learning over a corpus of several hundred historical bug xes, accepts as its input, a set of syntactic features for the original and modied versions of a line, each described as a binary variable, and produces a scalar value, describing the likelihood that the given x is correct. To attain scores for each candidate x, and thus rank them, the resulting score from the model for a given x is combined with its fault localisation suspiciousness score, by computing their product. Long and Rinard[2016] found that the use of Prophet reduced the average time taken to nd a repair, and allowed two more bugs to be xed within the 12-hour repair window.

History-Driven Program Repair

To improve patch quality and search eciency, Le et al.[2016] introduce a novel “history-driven” repair technique for for Java programs: HDRepair. Like, GenProg and PAR, this technique employs genetic algorithms to sample and combine edits into candidate patches. Whereas GenProg and PAR treat all edits equally, however, history-driven repair exploits knowledge mined from version control repositories to bias the search towards repairs that more closely resemble human-written re- pairs. To achieve this, 3000 historical bug xes are rst mined from GitHub from across 700 Java projects, before being transformed into a graph-based representation. To produce this graph representation, an AST dierence for each modied source le is rst generated using GumTree [Falleri et al., 2014], a state-of-the-art, open-source source code dierencing tool. Since the GumTree edit script still contains informa- tion specic to particular variable names, rather than capturing an abstract type of change (e.g., a change of method name), this script is transformed into a labelled, directed graph, capable of abstractly representing the change and capturing its sur- rounding context. Each edge within this graph {P,C} represents a modication to

3“Bug scenarios” that were in fact feature additions were removed from consideration.

45 BACKGROUND

Figure 2.8: An example change graph produced during the o ine, bug x mining stage of history-driven program repair [Le et al., 2016]. Here, the change graph shows the name of a parameter being updated within the context of a method call. a child node C within the context of its parent P . An example of such a change graph is given in Figure 2.8. Once each bug x has been transformed into its corresponding abstract change graph, the set of graphs is subject to a process of graph mining using gSpan [Yan and Han, 2002], wherein the largest, most frequent bug x patterns are mined. For each mined pattern, all of its vertices, edges, and the supergraphs that contain that pattern are recorded to disk, forming a historical bug x database. Using this in- formation, the number of supergraphs of a given pattern can be used to nd its frequency. During the online phase of the search, candidate repairs are generated stochasti- cally, using a similar process of generating repairs to GenProg and PAR, wherein a single edit is added onto an existing patch at each mutation step. An outline of the search algorithm employed by History-Driven Program Repair is given in Algorithm 4.

46 2.2. SEARCH-BASED REPAIR

Algorithm 4: The select function randomly selects a number of individuals from a population, either uniformly or weighted by a given function. The tunable pa- rameters are: P opSize (size of population), M (number of desired solutions), E (size of the initial population), L (number of locations considered during mutation). Sourced from [Le et al., 2016]. Input: BugP rog: Buggy program Input: F aultLocs: Fault locations Input: NegT ests: Initially failing test cases Input: ops: Possible operators Input: params: Tunable parameters P opSize, M, E, L Output: A ranked list of possible solutions editFreq(cand) begin N ← |cand| ; PN−1 i=0 FixPar(candi) return N end mutate(cand) begin locs ← select(F aultLocs, L) ; pool ← ∅ ; foreach f ∈ locs do S opf ← op∈applies(ops,loc)instant(op,loc) ; cand ← cand + select(opf , 1) ; pool ← pool ∪ {cand0} ; end return select(pool, 1, editF req) end search() begin Solutions ← ∅ ; P op ← {E empty patches} ; while |P op| < P opSize do P op ← P op ∪ mutate([]) ; end repeat foreach c ∈ P op do if c∈ / Solutions then if c passes NegT ests then Solutions ← Solutions ∪ {c} ; else c ← mutate(c) ; end end end until |Solutions| = m; return Solutions end

47 BACKGROUND

As with GenProg and PAR, a population composed of single-edit solutions is gener- ated during the initialisation stage, before being supplemented with a given number of empty patches. Unlike GenProg and PAR, there is no selection between individ- uals; rather, each individual within the population is subject to mutation and carried as-is into the next generation, without being subject to crossover. As a result, the algorithm ignores the outcome of the test suite for a given solution—except in cases where a repair is found—allowing it to terminate upon the rst instance of failure, as with RSRepair and AE. Instead, a form of selection is introduced into the process of mutation, wherein several edits are generated and only a single is selected and added to the patch, based upon its likeness to historical bug xes within the mined database.

This selection between edits uses stochastic universal sampling to pseudo-randomly sample an edit based upon its corresponding frequency within the historical bug x database. Specically, each candidate edit is assigned a score, according to Equation 2.4, based upon the mean frequency (within the database) of each of the edits in the mutant patch that contains the edit under consideration.

PN−1 F ixP ar(patch ) score(patch) = i=0 i (2.4) N

The rationale given by Le et al.[2016] for using the mean frequency of the mu- tant patch—rather than the frequency of the singular edit—is that it allows the score for low frequency edits to be dampened. This increases the likelihood of their se- lection (although the distribution given by considering singular edit frequencies is isotone).

Rather than terminating upon the discovery of a repair, like PAR and GenProg, HDRepair continues to search until a specied number of solutions have been found, or a time limit has been reached. The user is then presented with each of the dis- covered xes, ranked and ordered according to their mean frequencies. Unlike all other repair techniques, HDRepair does not validate its candidate repairs against the previously passing test cases. Consequently, repairs reported by HDRepair may introduce new bugs into the program. In addition to employing a novel search pro- cedure, this repair technique also uses a combination of existing repair models and mutation testing operators to construct repairs, detailed in Table 2.2.

Results taken from an evaluation of the technique on 90 bugs taken from 5 pro- grams within the Defects4J dataset [Just et al., 2014] demonstrate that HDRepair produces correct repairs for 23 bugs, compared to 4 and 1 xed by PAR and GenProg respectively. It is unclear whether o-the-shelf versions of PAR and GenProg were used; if so, then it is important to note that the denition of a repair is dierent be- tween these tools—PAR and GenProg require that a patch passes all of the test suite, whereas HDRepair requires that it only pass the previously failing tests. Whilst these results may suggest a higher eectiveness (i.e., HDRepair xes more bugs), they also increase the wall-clock time taken to nd a repair.

48 2.2. SEARCH-BASED REPAIR

GenProg Mutation Operators Insert Statement Insert a statement before or after a given state- ment Replace Statement Replaces a statement with another from the program Delete Statement Removes a statement from the program Mutation Testing Operators Insert Type Cast Cast an object to a compatible type Delete Type Cast Remove a type cast from an object Change Type Cast Replace a type cast with a compatible cast Change Inx Expression Changes primitive operator within an inx expression Boolean Negation Negates a boolean expression PAR Mutation Operators Replace Call Parameter Replaces a method call parameter with a com- patible one Replace Method Call Name/Invoker Replaces the name of a method call, or a method-invoking expression with a compat- ible name or expression. Remove Condition Remove a boolean condition in an if-condition Add Condition Add a boolean condition to an existing if- condition

Table 2.2: A list of the repair actions within the repair model for History Driven Program Repair, separated by their sources [Le et al., 2016].

Although those results are encouraging, the methodology used to perform the ex- periments makes it dicult to gauge whether the eectiveness of the technique is bolstered by its use of historical bug x information, or whether the results are largely due to a larger repair model. Ideally, one would like to see the technique compared to a near-identical variant, in which the selection procedure is replaced by a random selection. Furthermore, the results for each technique were gathered over a single run, failing to account for their stochastic nature; consequently, the comparison may be somewhat unrepresentative of the average performance of the techniques involved.

No comparison is given between the eectiveness of using the historical bug x database against the graphical model employed by Prophet. Although each of these techniques operate on dierent programming languages, their x weighting com- ponents are both usable within C. Ideally, then, one would like to see whether the historical bug x database performs any better than Prophet’s graphical model, given the signicantly higher running costs associated with its memory and CPU consumption.

49 BACKGROUND

2.3. Semantics-Based Repair

Instead of generating and evaluating concrete patches, semantics-based approaches use symbolic execution to determine the intended semantics of the program, be- fore using program synthesis to construct a patch based on that extracted semantic information. By avoiding the need to compile and execute the test suite for each can- didate patch, semantics-based approaches are often signicantly more ecient than their search-based counterparts. This speed typically comes at the cost of restricting the set of bugs and programs to which repair might be applied, however. In this section, we provide the reader with a brief review of several of the most popular semantics-based approaches to program repair.

MintHint

As an alternative to fully automated program repair, Kaleeswaran et al.[2014] pro- pose a semi-automated technique, MintHint, which provides the developer with patch suggestions. By delegating the responsibility of patch validation to the devel- oper, MintHint is able to avoid concerns of patch quality and overtting—a major challenge for test-based repair techniques (i.e., repair techniques that rely on test suites as their oracles). To identify candidate x locations, MintHint uses Zoltar [Janssen et al., 2009] to compute a ranked list of suspicious statements within the program, using the Ochiai spectrum-based fault localisation metric. Like the majority of existing repair ap- proaches, MintHint assumes that only a single statement within the program con- tains a fault. For each of the top k statements within this list, MintHint generates a state trans- former f: a function that accepts each of the program states that reach the can- didate statement F and computes the correct output for each of them. To ascer- tain the correct output—with respect to the provided test suite—dynamic symbolic execution is performed using KLEE [Cadar et al., 2008]. To simplify hint gener- ation, all statements in the program are converted to the form x := e through semantics-preserving program transformations. After applying this transformation, MintHint computes a state transformer f for each suspicious statement F as fol- lows: • To capture the existing, correct semantics of F , MintHint records the ob- served input/output states for each of the positive tests. • To determine the intended semantics of the program over the negative test cases, MintHint rst transforms the LHS variable x to a symbolic one. Using symbolic execution and constraint solving, MintHint nds concrete values of x that cause the program to yield the correct output for each of the negative tests. Finally, the program is re-run using computed values of x, in place of

50 2.3. SEMANTICS-BASED REPAIR

Nature of hint Targeted fault Insert Missing expression Replace Incorrect operator, constant, variable, etc. Remove Spurious expressions Retain Filtering out non-faulty statements Compound One of more occurrences of the above

Table 2.3: MintHint’s repair model, or hints, and the kinds of faults that each hint is designed to address. Taken from [Kaleeswaran et al., 2014].

its RHS e, allowing its input/output states to be extracted, thus giving the mapping f for the failing tests. In some cases, the symbolic execution process may fail to determine values for x, due to either time-out or an unsatisable constraint. If, for all of the negative tests, symbolic execution fails to generate values of x for some F within a predened time limit, F is removed from consideration for the re- mainder of the repair process. Alternatively, if x is deemed to be unsatisable for all of the failing tests, then MintHint generates a “retain statement” hint for that statement, indicating that the statement is irrelevant to the fault and should be left unaltered.

Using the computed state transformer for a statement as its operational specication, MintHint exhaustively searches its repair space for expressions to serve as the ar- guments for its hints at that statement. The possible forms of these hints are listed in Table 2.3. This repair space is generated by iteratively constructing all possible ex- pressions up to a pre-dened maximum length, using the in-scope variables at that statement, according to a provided grammar, describing the syntax of the language. Additionally, the set of sub-expressions within the RHS e are incorporated into the search space, together with the RHS itself. By incorporating e and its constituent sub-expressions, MintHint is able to determine whether those elements are likely to be faulty, independent of the results of the fault localisation. Each expression within the repair space of F is ranked according to its likelihood of occurring in the RHS of the repaired form of F. To determine this likelihood, MintHint measures the correlation between the outputs of a candidate expression and the correct val- ues of the LHS, provided by f. Expressions that appear to be highly correlated with the values of f are assumed to indicate possible repairs, whereas lowly correlated expressions are assumed to be indicative of faulty expressions. By incorporating e and its consituent sub-expressions into the repair space, MintHint is able to operate eectively in the presence of poor fault localisation information. Statements that are correct should exhibit a high degree of correlation between e and the expected values of x. In such cases, MintHint will generate a “Retain” hint for that statement. After ranking the expressions within the search space of F by their likelihood of ap-

51 BACKGROUND pearing in the xed RHS, MintHint generates repair hints using these expressions, as follows: • Exhaustively searches the space of expressions at a given statement, and lters out expressions with likelihood values below a certain threshold. • For each candidate expression, MintHint uses pattern matching to locate matching expressions within the suspected statement. Based on the edit dis- tance d between the candidate and matched expressions, MintHint suggests a dierent hint:

– If d ≤ k, where k is a tunable threshold (set to 2, by default), a hint is generated wherein the matched expression is replaced by the candidate expression.

– If d > k, an insertion hint is generated. – If d = 0, the expression already exists within the statement, MintHint also allows compound hints to be synthesised. Compound hints are gen- erated by combining non-overlapping atomic hints. To evaluate MintHint, Kaleeswaran et al.[2014] conducted a user study, in which ten participants (all of whom were either professional developers, or graduate stu- dents with prior industrial experience) were asked to debug and x one of ten hand- seeded bugs, containing a single fault, taken from small C programs in the Software Infrastructure Repository. The time taken by the users without the tool was com- pared to users equipped with the tool to assess its ecacy. In an eort to reduce the diculty of the task, the authors used Zoltar to nd the ve most suspicious lines within the program. The user was informed that the bug was contained on one of these lines; the lines presented were reviewed to ensure that this was the case. Kaleeswaran et al.[2014] found that the time taken by the user to nd a repair was reduced from an average of 91 minutes (with 4 participants unable to complete the task) to 29 minutes, when the tool was used (with all participants completing the task). Although these results look encouraging, we have identied a number of potential issues in the design and evaluation of the study. No raw timing data, specifying the time taken by each user to complete the given task, is provided. Instead, this data is aggregated by computing the mean. By computing the mean, rather than the more statistically-robust median, the evaluation is prone to inuence by outliers. The paper also fails to conduct statistical hypothesis testing, to demonstrate a signicant dierence between the control and the treatment, and to measure the eect size. The lack of statistics makes it dicult to assess the true eectiveness of the tool. A deeper concern lies in the design of the user study itself; the ten participants in the control group were the same as those given the treatment. The order in which the subject is given the control and the treatment may produce an unintended bias

52 2.3. SEMANTICS-BASED REPAIR in the results. For the user study, participants were asked to x a bug without using the tool rst, before then being asked to x a bug using the tool. The paper makes no mention of the steps, if any, to avoid introducing bias. From a technical perspective, questions remain over whether the techniques pro- posed by MintHint can be scaled to larger programs. For a number of bugs within small programs taken from the Software Infrastructure Repository, containing fewer than 20 KLOC, the symbolic execution phase timed out. The costs of applying the technique to programs containing hundreds of thousands of lines of code are likely to be several orders of magnitude higher. In summary, MintHint transforms the exceptionally dicult problem of generating high-quality repairs into the signicantly easier problem of suggesting bug xes. In practice, there is no dierence between the approach of hint generation tools and repair tools—the only dierence lies in the way such tools are treated. There is nothing to prevent us from using GenProg, SPR, or any other repair technique as a hint generation tool.

SemFix, DirectFix and Angelix

SemFix [Nguyen et al., 2013], DirectFix [Mechtaev et al., 2015] and Angelix [Mech- taev et al., 2016] belong to the same family of semantics-based repair techniques. All of these techniques leverage symbolic execution and component-based program synthesis to craft provably minimal repairs to faulty program expressions. Below, we summarise the abilities and limitations of these techniques: • SemFix, the rst of these techniques to be proposed, allows the repair process to scale with the size of the program, through the use of controlled symbolic execution, but is restricted to generating single-edit patches. • DirectFix, the successor to SemFix, allows multiple-edit patches to be gen- erated, but does so by restricting the technique to small programs. DirectFix derives its ability to craft multiple-edit patches—and its associated limitations— from describing the semantics of the entire program in a single logical formula. • Angelix, the newest of the three techniques, combines the strengths of SemFix and DirectFix by using lightweight angelic forests to describe the semantics of the areas of the program under repair. By using angelic forests to encode the necessary semantic information, Angelix is able to produce multiple-edit patches without restriction on the size of the program—the cost of the tech- nique scales with the size the patch, and not the program. For the most part, the high-level approach employed by each of these techniques is split into the same four stages: 1. Preprocessing: A series of semantics-preserving code transformations are ap- plied to the program under repair to allow it to address a greater number of bugs. These transformations wrap each unguarded statement in its own

53 BACKGROUND

if statement, allowing their execution to be controlled (e.g., x=y+1 ; be- comes if (1) x=y+1; ).

2. Fault Localisation: Like most search-based approaches, spectrum-based fault localisation is used to narrow down the possible locations of the faults, and to prioritise the eorts of the repair process. Fault localisation is performed at the level of expressions, rather than statements, since expressions are the target of the repair model shared by these techniques.

3. Symbolic Execution: Once a set of suspicious expressions has been determined, one (in the case of SemFix) or more (in case of Angelix) expressions are se- lected as candidates for repair. Each of the selected expressions is then re- placed with symbolic variables. Symbolic execution is then used to determine whether there exists a program path through which a given test is passed; SemFix and Angelix constrain this search to consider dierent outcomes as- sociated with the symbolic variables, rather than the inputs to the program. This process is carried out for each test within the suite, until either a test is failed, or all tests are passed. In the event that a set of paths are discovered that result in success for each of the tests, a constraint solver is used to infer concrete values (referred to as angelic values) for each of the symbolic vari- ables. Additionally, the state of the program at each visit to these locations is recorded as the angelic state. For Angelix, the sets of angelic values and angelic states are combined together to form the angelic forest.

4. Repair Synthesis: Using the semantic information extracted during symbolic execution, a provably minimal repair to the implicated expressions is con- structed using a variation of component-based program synthesis [Jha et al., 2010]. Each component is represented by a variable, constant, or a term over components and supported operations (e.g., +, −, ∗), dened in the given background theory. Through the use of soft and hard Partial MaxSMT clauses, the solver nds a patch that satises the synthesis specication, provided by the semantic information, and that is syntactically closest to the original pro- gram. Underlying this decision is an assumption that smaller patches, closer to the original, buggy program, are more likely to be correct, and easier to maintain, than larger patches.

SemFix and DirectFix were evaluated on articial bugs in small programs taken from the Siemens and SIR datasets, and a small number of real-world bugs in small programs from the Coreutils [Böhme and Roychoudhury, 2014] dataset [Mechtaev et al., 2015; Nguyen et al., 2013]. SemFix was found to repair more bugs than GenProg (48 vs. 16), and to do so in an average of 3.8 minutes, compared to the average of 6 minutes required by GenProg [Nguyen et al., 2013]. DirectFix was found to repair a greater number of bugs than SemFix; 53% of those repairs were found to to be equivalent to those by the programmer, compared to 17% for SemFix. In the evalu- ation of DirectFix, for all scenarios longer than 135 lines of code, the fault local- isation information was manually—and substantially—improved by restricting the consideration of the tool to the single function known to contain the faults.

54 2.3. SEMANTICS-BASED REPAIR

Angelix was evaluated against GenProg, AE, and SPR on a modied sub-set of the ManyBugs dataset, containing manual improvements to some of its weak test harnesses (i.e., those that only checked the exit status of the program). Due to limits of the symbolic execution engine used by Angelix, the fbc, Python and lighttpd subjects were omitted from the evaluation. Angelix was shown to x 28 out of 82 bugs, compared to 31, 11 and 19 achieved by SPR, GenProg and AE, respectively. Angelix was able to x six bugs that were unsolved by other techniques, whereas SPR exclusively xed four bugs; none of the bugs xed by AE or GenProg were exclusive. On average, Angelix took 32 minutes to nd a patch. Although Angelix shows promise as a viable approach for constructing multiple- edit patches, it also suers from a number of limitations, also shared by SemFix and DirectFix. Part of these limitations are inherited from the limits of current sym- bolic evaluation engines. Angelix is unable to generate patches involving oating point numbers, non-primitive types, or side-eects, nor is it able to introduce new statements or variables into the program.

NOPOL

Xuan et al.[2017] propose NOPOL, a semantics-based repair technique, designed to address a narrow class of defects in Java programs. This class of defects covers faulty if-conditions, and missing if-conditions around individual statements. To x those bugs, NOPOL uses a form of angelic debugging, similar to that per- formed by SemFix and SPR, to determine if there exists a set of branch outcomes for a particular if-statement that would allow the program to pass its test suite. To allow missing if-conditions to be repaired, NOPOL performs the same semantics- preserving program transformations as Angelix, wherein (unguarded) statements are wrapped in a trivial if-condition (e.g., x=0; becomes if (true)x=0; ). Once a set of branch outcomes has been determined, NOPOL synthesises a branch condition from the in-scope variables, using its knowledge of the program state at each branch evaluation, that yields those outcomes. Due to the nature of its implementation, the category of defects that NOPOL can ac- tually x is smaller than intended, however. NOPOL is unable to record information for branches that are visited more than once by the program. In contrast, SPR and SemFix make no such assumption. Moreover, NOPOL assumes that the program contains only a single faulty condition, preventing it from scaling to multiple-line faults.

SearchRepair

[Ke et al., 2015] propose an automated repair technique driven by semantic code search [Stolee et al., 2014]—a method for identifying code by its behaviour, rather than its syntactic features. To compose repairs, SearchRepair searches through a

55 BACKGROUND large, pre-computed database of code snippets—taken from existing programs—for a single snippet which satises a partial specication for the program. From a high- level perspective, the approach taken by SearchRepair to construct repairs is as follows: 1. A database of code snippets is constructed from a set of donor programs. To capture its semantics, each snippet is encoded as a set of satisiability modulo theory (SMT) constraints over its input-output behaviour. 2. Tarantula, a technique for spectrum-based fault localisation, used by a ma- jority of repair approaches, is used to rank the suspiciousness of program fragments, rather than lines or statements. 3. For each suspicious fragment, a lightweight prole of its desired input-output behaviour—determined using symbolic execution [Cadar et al., 2008] over its test suite—is generated, in the form of a set of SMT constraints. 4. Using the Z3 SMT constraint solver [de Moura and Bjørner, 2008], the database of donor code is searched for snippets which satisfy the desired behaviour of the faulty program fragment. Matching snippets are thereafter contex- tualised, through a process of variable substitution, and transformed into a patch, wherein the original fragment is replaced by the discovered snippet. 5. Finally, as with search-based methods, candidate patches are evaluated against the test suite to determine whether they represent a repair. SearchRepair sits between search-based techniques, which randomly generate mu- tants and assess their correctness by evaluation against a test suite, and specication- based techniques, which use program synthesis to generate a repair that satises a partial specication.4 Like GenProg and its variants, SearchRepair relies on the plastic-surgery hypothe- sis to nd redundant code within programs from which to craft its repairs. Whereas GenProg operates on the level of statements, SearchRepair operates on more- coarsely-grained entities, known as fragments. Fragments include the basic blocks within if- and while-statements, the if- and while-statements themselves, and se- quences of one to ve statements. By operating at a higher level of granularity, Ke et al.[2015] believe that SearchRepair is less susceptible to overtting whilst remaining suciently expressive. Ke et al.[2015] state: Our core assumption is that a larger block of human-written code, such as a method body, that ts a given partial specication is more likely to satisfy the unwritten specication than a randomly chosen set of edits generated with respect to the same partial specication. Although results from the paper show that 97.3% of patches generated by SearchRepair over the IntroClass benchmarks are correct—with respect to an independent test

4Note that patches generated by specication-based techniques are only guaranteed to be correct with respect to a partial specication, and not necessarily to the complete, intended specication of the program. Therefore, such methods are equally as susceptible to overtting as search-based techniques.

56 2.4. SPECIFICATION-BASED REPAIR suite—compared to 68.7% for GenProg, Ke et al.[2015] do not explicitly validate this hypothesis, and these results alone do not conrm it. Whilst almost all patches gen- erated by SearchRepair were deemed correct, SearchRepair also xed the fewest bugs, with respect to the original test suite. Additionally, the benchmarks used consist of bugs in small programs, each contain- ing fewer than 100 lines of code, and so questions remain over the generality of the technique, and its eectiveness when applied to bugs in large-scale programs, such as those in the ManyBugs dataset. Moreover, due to the limitations of KLEE— the symbolic execution engine used to compute proles—SearchRepair is unable to insert code containing loops, recursion, non-primitive types, side-eects, or I/O operations (e.g., le manipulation, database queries). Despite suering from a number of limitations, SearchRepair presents a promising approach to automated program repair by combining the strengths of search-based methods with semantics-based techniques.

2.4. Specication-Based Repair

Unlike search-based and semantics-based program repair, both of which rely on user-provided test suites to establish correctness, specication-based repair tech- niques assume the existence of formal specications. Due to this requirement, the applicability of these approaches is far more limited; none of the datasets previ- ously used to evaluated search-based or semantics-based repair approaches (e.g., SIR, ManyBugs, Coreutils) possess any such specications. On the other hand, specication-based approaches are able to leverage these specications to generate higher quality patches (that are likely to satisfy them). Concerns are shifted from the adequacy of the test suite, to the soundness and completeness of the specica- tions that are used. Like semantics-based repair, specication-based approaches use program synthesis techniques to generate patches. Below, we briey discuss AutoFix-E, one of the few and foremost approaches to specication-based program repair.

AutoFix-E

AutoFix-E [Wei et al., 2010] leverages partial specications, provided in the form of contracts, to automatically repair faults in Eiel classes. These contracts are used to specify preconditions, postconditions and intermediary assertions over Eif- fel classes. Provided a defective Eiel class, AutoFix-E attempts to expose and repair the un- derlying bug, following the steps below:

57 BACKGROUND

1. Test Suite Generation: In contrast to almost all search-based and semantics- based repair approaches, AutoFix-E generates a test suite automatically, rather than relying on one provided by the programmer. AutoTest, an automated test generation tool for Eiel, is used to generate as large a test suite as pos- sible within a xed window of time, covering the defective class. By evaluating the program under repair against this test suite, the tests are partitioned into passing runs and failing runs (equivalent to positive and neg- ative tests). A run is determined to be a failure if it results in a contract viola- tion.

2. Object State: To aid in determining the source of the fault, AutoFix-E exam- ines object state by observing the output of argument-less, boolean-valued functions (referred to as boolean queries). Boolean queries are widely used in Eiel contracts as a means of concisely capturing the key properties of object state. Together, the n boolean queries for a given class describe its 2n abstract states. Although the resulting state space may seem intractable, a study of a large Eiel library, conducted by Wei et al.[2010], suggests that the majority of Eiel classes have 15 or fewer such queries. Using the set of boolean queries for the class under repair, AutoFix-E uses contract mining to discover implications between queries. For each mined implication, AutoFix-E also generates three mutants of that implication: the negation of its antecendent, consequent, and both. Together, the sets of boolean queries and implications form the predicate set P ; this set describes a set of simple facts about the class under repair, encoded as logical statements. To improve the eciency of later stages in the repair process, the predicate set is pruned, with the help of an automated theorem prover.

3. Fault Proling: Using Daikon, a popular invariant mining tool, and a super- set Π of the predicate set P , also containing the negation of each predicate, AutoFix-E generates a pair of invariants for each executed program location + − + `. This pair, (I` ,I` ), is comprised of the predicates I` ⊆ Π that hold for − all passing runs, and the predicates I` ⊆ Π that hold for all failing runs. Together, this sequence of pairs forms a fault prole, describing the possible causes for the failed runs, in terms of abstract state.

4. Behaviour Model Generation: To serve as a source of ingredients for its repair process, AutoFix-E mines a simple nite-state behavioural model over all of its passing runs, encoded as a nite-state automaton. The states of this au- tomation are labelled by the predicates that hold over that state. Transitions are labelled with the name of a routine, and are used to describe a particular input-output behaviour of a routine, in terms of abstract state; implicitly, each transition describes a Hoare triple. The resulting model is used by AutoFix-E to determine how to reach a desired state (i.e., the intended state) from some current state (i.e., a faulty state). In general, the model is neither sound nor complete, since it is inferred over a

58 2.4. SPECIFICATION-BASED REPAIR

(a) (b) (c) (d) snippet if fail then if not fail then if fail then old_stmt snippet old_stmt snippet end end else old_stmt old_stmt end Figure 2.9: The x schemas implemented in AutoFix-E [Wei et al., 2010]. snippet is replaced with a sequence of routine calls that move the program from a faulty state into a desired state. old_stmt may either be a single statement, or the block to which a statement belongs. fail is used to monitor the conditions under which the fault manifests, and not to aect the appropriate action.

nite set of executions. In practice, the model appears to be suciently precise for the purposes of nding a repair.

5. Fix Generation: Beginning at the location where the fault occurred and it- erating backwards, AutoFix-E uses its fault prole and behaviour model to generate a set of candidate repairs. At each location, AutoFix-E nds all pos- sible instantiations of the four repair schemas within its model, outlined in Figure 2.9. Each of these candidate repairs uses the fault prole to move the program from a possibly faulty state to a possibly correct state. To achieve this change in state, AutoFix-E uses its behavioural model to nds the sequence of routine calls that results in the desired post-condition.

A special x schema is also introduced for repairing linear assertion violations (e.g., count ≥ 0).

6. Fix Validation: Once a set of candidate repairs has been generated, AutoFix-E evaluates them, to determine the set of valid repairs (i.e., those which pass all of the tests). Finally, AutoFix-E uses a series of metrics to rank the valid repairs according to some proxy to quality. These metrics measure the size of the snippet, the distance in state, the number of old statements captured by the x schema, and the number of branches required to reach old_stmt from the point of injection of the instantiated x schema.

To evaluate AutoFix-E, Wei et al.[2010] ran it on a dataset of 42 faults detected by AutoTest in 10 data structure classes taken from popular, open-source Eiel libraries. AutoFix-E was able to repair 16 of these faults. To assess the quality of AutoFix-E’s patches, they manually inspected the top ve patches reported for each bug to determine if the patch xed the underlying bug without introducing a new defect. 13 of the 16 bugs xed by AutoFix-E were found to satisfy this crite- rion.

59 BACKGROUND

2.5. Related Techniques

To conclude our review of the relevant literature, in this section we briey discuss a number of techniques loosely related to general-purpose program repair: • Reective Grammatical Evolution: [Timperley, 2013; Timperley and Step- ney, 2014] incorporate concepts from failure-obliviousness into genetic pro- gramming. After using computational reection to create a failure-oblivious dialect of Ruby, the authors demonstrate a higher success rate and eciency— measured by candidate evaluations—when programs are evolved with these measures enabled. These results suggest that allowing the program to persist in the presence of errors, through the use of such measures, may smoothen the tness landscape and reduce the diculty of the search. • Data Structure Repair: [Demsky and Rinard, 2003, 2005] present techniques for automatically recovering from data structure corruption errors, at run- time. Repair is achieved by enforcing a data structure consistency speci- cation, which may be manually provided by the developer, or automatically inferred from correct program executions, using tools such as Daikon. • Integer Bug Fixing: [Coker and Haz, 2013] propose a set of three program transformations for repairing all instances of integer bug within C programs: 1. Add Integer Cast: adds explicit casts to disambiguate integer usage, and to address signedness and widthness problems. 2. Replace Arithmetic Operator: replaces arithmetic operations with safe equivalents, which detect overows and underows at runtime. 3. Change Integer Type: modies types of integers to avoid signedness and widthness problems. Together, these three transformations xed all 7,147 programs within NIST’s SAMATE dataset [Black, 2007], covering over 15 million lines of code. Un- like general-purpose repair techniques, Coker and Haz’s program transfor- mations produce sound and complete repairs for a restricted class of defects, forgoing the need for test suite evalution or symbolic execution. Although these transformations prevent integer bugs from manifesting, they do so by ensuring safe behaviour, rather than nding particular unsafe instances of in- teger usage and patching the source code directly, thus avoiding the need for potentially expensive instrumentation. • Bolt and Jolt: Bolt [Kling et al., 2012] and Jolt [Carbin et al., 2011] are tech- niques for monitoring the execution of a program, detecting whether an in- nite loop occurs, and if so, allowing the loop to be executed or the program to be terminated, at the request of the user. Whereas Jolt requires source code in- strumentation to inject monitoring code, Bolt obviates this need through on- demand dynamic binary instrumentation; the (unstripped) binaries remain

60 2.5. RELATED TECHNIQUES

unmodied until the user attaches Bolt to the application. To detect the occurrence of (some) innite loops, both techniques monitor the state of the program upon entry to the loop. If upon the next iteration the current state is the same as the previous state, an innite loop is detected and the user is given the option to escape the loop or to terminate the program. Across eight innite loops in ve small but real programs (grep, ctags, indent, ping, look), Jolt was able to detect an innite loop in seven cases [Carbin et al., 2011]. Importantly, in each case, the program was allowed to continue exe- cuting, producing a more useful output than simply terminating the program [Carbin et al., 2011; Kling et al., 2012].

• CodePhage: Instead of addressing bugs via source code modication, CodePhage [Sidiroglou-Douskos et al., 2015] attempts to x bugs by automatically iden- tifying correct code in foreign, donor applications, and transferring that code into the faulty program. CodePhage allows programs to be repaired with- out the source code of either the program under repair or the source code of the donor applications, by operating at the binary level. In its evaluation, CodePhage was able to successfully repair ve out of ten security bugs across seven dierent programs.

• ClearView: ClearView [Perkins et al., 2009] uses Daikon to infer likely invariants for a given system, based on data collected from several training executions. Prior to deployment, ClearView uses binary instrumentation to inject monitoring code into the program. Two of these monitors, HeapGuard and Determina Memory Firewall, check for out-of-bounds memory accesses, and illegal control ow transfer errors, respectively. A third monitor, Shadow Stack, allows invariant violations along the call stack to be recorded. In the event that a monitor reports erroneous behaviour at run-time, ClearView attempts to generate a patch that restores violated invariants and satises the monitor. The generated patches re-establish the inferred invariants by altering control ow, register values, and/or values of memory locations. To evaluate ClearView, Perkins et al.[2009] conducted a Red Team exercise, wherein members of the Red Team were asked to perform attacks on a pro- gram protected by ClearView, using exploits discovered on the unprotected version of the program. Firefox, a popular open-source web browser, was used as the target application for the exercise. The external Red Team was able to generate ten code-injection exploits in the target application. When ClearView was used, all of these attacks were detected by its monitors, and thus prevented. In seven out of ten cases, ClearView generated a patch that allowed the program to survive the attack and to safely resume its execution.

• LeakFix: LeakFix [Gao et al., 2015] uses a series of program analyses to lo- cate and repair (a sub-set of) memory leaks in C programs. Identied leaks are patched by inserting deallocation statements at appropriate points in the program. The resulting patches are guaranteed to not interrupt normal pro-

61 BACKGROUND

gram execution. On an evaluation of 15 programs, comprising 522 KLOC, each containing multiple leaks, LeakFix generated patches for 28% of the leaks. LeakFix exemplies an alternative approach to automated program repair: tackling specic defect classes with high quality and accuracy. This approach is an appealing one, but in practice, it is dicult to cleanly assign bugs to any one particular defect class.

• Genetic Improvement: Search-based program repair can be viewed as part of the wider eld of Genetic Improvement (GI) [Langdon, 2015], which seeks to apply machine learning and search techniques to improving existing pro- grams, more generally. In contrast to Genetic Programming, which typically attempts to evolve a program from scratch, GI uses existing code as its seed. In addition to program repair, GI has been used to automatically improve run- time performance, reduce power consumption, port functionality, and more.

2.6. Concluding Remarks

In this chapter, we conducted a review of the majority of existing program re- pair techniques. Below, we summarise each of the three main approaches to re- pair: • Semantics-based approaches, such as Angelix [Mechtaev et al., 2016], have been shown to be more ecient than search-based approaches and to success- fully produce multi-line patches. The bugs and programs to which semantics- based repair can be applied is limited, however. Due to limitations in its under- lying techniques, namely those inherited from symbolic execution, semantics- based approaches are unable to repair bugs involving side-eects, iteration, recursion, non-primitive data types, and more. • Specication-based repair has demonstrated promising results, specically AutoFix-E [Wei et al., 2010], but its reliance on formal specications pre- vents it from being applied to the majority of bugs. Nonetheless, it may be possible to use contracts to guide search-based repair towards a repair, rather than to require them. The information provided by contracts could be inte- grated into a richer tness function, similar to the one introduced by Arcuri and Yao[2008]. Alternatively, specication mining [Ernst et al., 2007] may be used to auto- matically infer partial specications. Fast et al.[2010] attempt to realise this with their predicate-based tness function (described in Section 6.1. However, their approach requires knowledge of at least one patch a priori, preventing it from being used in practice. B. Le et al.[2016] use mined specications to improve the accuracy of function-level fault localisation (see Section 4.1.4 for more details). Additionally, it may be possible to use the degree to which

62 2.6. CONCLUDING REMARKS

a potential solution conforms to an inferred specication as a proxy to the quality of a solution (i.e., whether the solution destroys existing functionality or overts to the test suite). To our knowledge, no studies have explored a possible link between specication conformance and patch quality. • Search-based repair is the only approach that neither requires formal speci- cations nor has any limitations on the bugs it can x. GenProg, the sem- inal approach to search-based repair, is one of few search-based techniques capable of generating multiple-edit patches—PAR and HDRepair, the other approaches capable of multiple-edit repair, share the same underlying ge- netic algorithm. Despite this capability, almost all patches that were gener- ated by GenProg in previous studies were reduced to a single edit [Le Goues et al., 2012a; Qi et al., 2015]. Techniques such as PAR, AE, and SPR sacri- ce the ability to generate multiple-edit patches in exchange for signicant eciency gains. It is unlikely that the algorithms used by these techniques (random search, exhaustive search and value search) can be scaled to generat- ing multiple-edit patches, owing to the combinatorial explosion of the search space. To tackle the problem of multiple-edit repair, new active search al- gorithms are needed, capable of identifying and exploiting partial xes, and localising bugs over the course of the search.

63 BACKGROUND

64 CHAPTER 3

Tools and Techniques

Research in automatic program repair is typically evaluated by applying a new tech- nique to a set of real-world bugs, and comparing the results to those produced by prior techniques. Such an evaluation ideally involves a set of reproducible real- world defects. Open-source software repositories provide a plentiful source of such bugs, but accurately reproducing them reliably can prove challenging. Program behaviour is often dependent upon the particular conguration of its host’s envi- ronment, such as the versions of particular libraries; many bugs only manifest un- der certain congurations. Additionally, because most program repair techniques involve evaluating candidate patches that may apply arbitrary changes to the pro- gram under repair, experiments must be appropriately sandboxed to ensure system safety and test idempotency. What is needed, therefore, is both a dataset of bugs, and a platform for interacting with them that ensures reproducibility and safety. The former is provided by bench- marks such as ManyBugs [Le Goues et al., 2015], Defects4J [Just et al., 2014] and the Software Infrastructure Repository (SIR) [Do et al., 2005]. For Java bugs, Defects4J provides both a dataset of bugs, and a platform for executing them, implicitly sup- plied by the Java Virtual Machine. For C bugs, this platform is either provided in the form of a monolithic virtual machine (VM) [Le Goues et al., 2015; Long and Rinard, 2015; Qi et al., 2015], shared by multiple bugs within the same dataset, or, in some cases, not provided at all [Kim et al., 2013]. Although the use of a VM as an execution platform provides reproducibility and safety, it does so at the cost of performance and extensibility. Running the experi- ment within a VM incurs the signicant performance penalties associated with the overheads of virtualisation. Using a single VM to host both the bug scenario and the repair tool limits its future usage. Newer techniques may be unusable within the VM due to library incompatibilities, leading to the need to perform unsound modications to VM, which compromises its reproducibility. Motivated by these problems and the issues of repair quality, we present RepairBox, a platform for safely conducting controlled, reproducible program repair experi- ments, with minimal compromise to performance. RepairBox operates by isolating individual bug scenarios and repair tools into separate, minimal Docker1 containers that can be easily inspected and extended. By packaging bug scenarios as contain- ers, RepairBox attains close to bare metal performance, an important trait when performing a large number of repeats of computationally-intensive program repair experiments.

1https://www.docker.com. [Accessed April, 2017].

65 TOOLS AND TECHNIQUES

A summary of the contributions of this chapter is given below: • Identied undiscovered weaknesses in a number of publicly available APR benchmarks previously used to evaluate the performance of existing tech- niques. • Developed a tool, Pythia, to automatically improve the reliability of existing test suites, within the context of APR; ensures all passing tests meet a minimal set of quality requirements. • Proposed a novel, ecient technique, RepairBox, for conducting controlled experiments into APR, together with a set of accompanying open-source tools. • Collated and edited over 400 bug scenarios for C programs from a number of existing sources, including ManyBugs, GenProg’s TSE benchmarks, and the Software Infrastructure Repository. The rest of the chapter is structured as follows: In Section 3.1, we review a collec- tion of publicly available datasets of bugs, used to conduct APR experiments. In Section 3.2, we discuss the problem of repair quality and its importance to empirical studies of APR, before presenting Pythia, a tool to avoid this problem. In Section 3.3, we review existing approaches to reproducibility within APR before proposing RepairBox, a high-performance, controlled platform for conducting APR experi- ments. Finally, in Section 3.4, we put forward an ecient and robust methodology for conducting APR experiments, based on these tools, which is used for the remain- der of this thesis.

3.1. Bug Scenarios

In this section, we rst identify and review a number of publicly-available datasets of bug scenarios, used to conduct empirical studies of program repair. We then discuss some of the current challenges involved in performing such studies. • GenProg ICSE 2009, GECCO 2010, TSE 2012: each of these sets of bug scenarios was used to conduct early experiments on GenProg [Fast et al., 2010; Le Goues et al., 2012b; Weimer et al., 2009]. Across these three sets, 20 bugs, each from dierent programs, are covered. Of these 20 bugs, two are found in articial programs, consisting of fewer than 100 lines of code (gcd, zune). 17 of the remaining 18 bugs are sourced from genuine bugs in large, open-source projects (many taken from the results of Miller’s work on fuzz testing [Miller et al., 1990], with the other bug (atris) being sourced from a small game created by one of the authors of the paper. Although the majority of these scenarios represent genuine bugs in large projects, each uses a very small, articial test suite, hand-written by the au- thors. Each of these test suites contains fewer than 21 tests, designed to solely

66 3.1. BUG SCENARIOS

expose the bug, rather than thoroughly test the functionality of the program. This behaviour allows destructive changes to be silently introduced into the program. Whilst the test suites for these bugs are unrepresentative of stan- dard test suites, their small sizes, together with the relatively short compila- tion times for most of their programs,2 makes them ideal for high-cost, non- deterministic experiments, where large numbers of repeats are required to perform statistical hypothesis testing (e.g., to demonstrate a dierence in per- formance between two search algorithms).

• ManyBugs: to validate the eectiveness of GenProg on a set of bug scenarios more representative of a real-world environment, the ManyBugs [Le Goues et al., 2015] benchmarks contain 185 bug scenarios across 9 large, open-source programs, where each employs a genuine test suite. Since real test suites are provided with each bug, these benchmarks are appropriate for gauging and comparing the overall eectiveness of APR approaches in the wild.

Since real test suites are used in each of these bug scenarios, they tend to be several orders of magnitude larger than the articial test suites found in the earlier GenProg datasets. For instance, each of the PHP bug scenarios con- tains around 8000 tests. As such, performing multiple runs of a given algo- rithm on one of these scenarios to achieve statistical signicance can become particularly expensive, requiring (tens of) thousands of CPU hours.

• IntroClass: Whilst ManyBugs and the earlier GenProg benchmarks are al- most exclusively made up of large, open-source projects, consisting of mil- lions of lines of code, the IntroClass [Le Goues et al., 2015] bug scenarios provide an invaluable set of real-world bugs from small programs, consisting of fewer than 100 lines of code, taken from students’ laboratory assignments. In total, this dataset contains 998 bugs. Each bug is accompanied by a pair of independent test suites containing 95 test cases in total, one of which the user could use during the development of their solutions; the other may be used to validate solutions, and to reduce the likelihood of overtting. Additionally, an oracle program is provided for each of the six assignments.

The small-size and low complexity of this set of benchmarks make it partic- ularly well suited to studies involving a large number of bug scenarios, or those in which certain levels of statistical signicance must be achieved. Ad- ditionally, these bugs provide insight into the ability of APR techniques to generalise to smaller programs, written by novice developers. This particular trait may also be useful in assessing the eectiveness of plastic surgery-driven approaches for small code bases.

• Defects4J: Defects4J [Just et al., 2014] provides a database of high quality Java bug scenarios, designed for studies on and testing. This database covers 395 real-world bugs across 6 open-source projects.

2For php and imagemagick, using the provided compilation scripts could result in compilation times taking as long as 8 minutes.

67 TOOLS AND TECHNIQUES

Another source of bugs within C programs, albeit one that has yet to be used by the APR community, is provided by Software Infrastructure Repository [Do et al., 2005]. SIR contains a collection of thousands of manually seeded bugs across 85 small to medium-sized programs (1 to 100 KLOC), written in C, C++, C#, and Java. Each program comes with a series of manually engineered test suites, designed to test the program at various levels of coverage. Although articial in the nature of its tests and bugs, the SIR oers a rich and plenti- ful source of bugs, suitable for conducting studies on large numbers of medium-sized programs. The Siemens benchmarks [Hutchins et al., 1994] also provide a source of synthetic bugs within C programs. In comparison to the bugs within the SIR, the Siemens bugs span between 100 and 800 lines of code. Each bug scenario is accompanied by a test suite containing thousands of tests, providing a level of coverage rarely seen in real-world programs. Importantly, while these datasets provide a plentiful source of bug scenarios, none provides an execution platform in which to reproduce these bugs, with the exception of Defects4J; Defects4J implicitly provides such a platform in the form of the Java Virtual Machine. For programs written in compiled languages such as C, this lack remains an important and unsolved problem.

3.2. Pythia

In this section, we look at the problems that inadequate test suites cause when per- forming APR experiments, before introducing a lightweight, open-source tool to address the problem, and to improve our condence in the results of APR experi- ments.

Motivation

Recent ndings [Qi et al., 2015; Smith et al., 2015] have demonstrated how inad- equate test suites can lead to (unintentionally) misleading results and claims re- garding the eectiveness of repair techniques. The majority of test suites provided within publicly-available automated repair benchmarks (for C) have been shown to suer from serious problems of inadequacy. Qi et al.[2015] found that previously published results for GenProg and AE on the ManyBugs suite, wherein 55 of 105 bugs are solved, are mostly the result of overtting. Following a manual inspection of the patches produced by GenProg and AE, they observe that only 2 of the 55 solved bugs yielded correct patches. Upon closer examination of the ManyBugs test suite, they discovered that many of the tests were failing to compare the output of the program against its expected

68 3.2. PYTHIA output. In some cases, the only requirement of the test was that the program should exit with a status of zero, allowing trivial patches such as exit(0); to be quickly discovered and accepted. Motivated by those ndings, we subjected the ManyBugs scenarios to further scrutiny, discovering a dierent class of weakness within certain test suites in the process. Specically, for programs such as libtiff and gzip, the accompanying test har- nesses neither check for expected side eects on the lesystem, such as the presence (or absence) of a particular le (e.g., file should be replaced by file.z), nor do they check the state of particular les. We also examined the GenProg TSE 2012 bug scenarios, and found that the majority of tests also failed to validate on any other criteria than the exit status of the program. After manually modifying each of the test suites to compare their outputs against an expected output, GenProg (and AE) failed to yield a solution, where previously they did. 3 Such weaknesses compromise the validity of the conclusions that are drawn during APR experiments, and thus make performing experiments exceptionally challeng- ing with such test suites. To reason about the eectiveness of APR approaches—and avoid potentially misleading results—one must introduce further rigour to the test suites that are used. No existing (publicly available) benchmarks for C programs exist that are free from such weaknesses.

Existing Solutions

To tackle this problem, there are a number of existing techniques which one may use. We critique each of these techniques below.

Manual Extension and Modication

The simplest solution to the problem is to manually add more test cases to each of the test suites that are used within the experiment. However, doing so risks losing the real-world relevance of the results as the test suites become more articial and less realistic. One may argue that the user of an automated repair technique (in practice) could be expected to supply these additional test cases. In cases where bugs are less trivial, and the user is unable to fully comprehend them, such an assumption may prove impractical. Furthermore, such an approach would require a signicant investment of time to carefully craft a higher-quality test suite for each problem. Similarly, one may attempt to address the problem at its root by manually addressing each of the weaknesses in the test suites, thereby reducing the number of false pos- itive results. By taking such measures, however, one’s results become predicated on

3For some bug scenarios, this was not possible since the program failed to produce any output when run on a 64-bit variant of Fedora 24. In these cases, the program appeared to have no side-eects on either the le system, the standard output, or the standard error.

69 TOOLS AND TECHNIQUES the assumption that the test suite is of a high quality—an assumption that often fails to hold in the real-world, as exemplied by the weaknesses within the ManyBugs benchmarks (e.g., PHP, Libti, gzip). As with manually extending the test suite, this approach also requires a signicant investment of time as (thousands of) existing test suites need to be analysed and adjusted.

Automatic Extension

Alternatively, one may use the developer-provided patch for the bug as an oracle, allowing automated test generation [Cadar et al., 2008] to be used to supplement the test suite with more positive and negative tests. As a simple, albeit limited criterion for the adequacy of a repair, one may use test case generation to increase statement coverage. In practice, aiming for complete branch, statement, or MC/DC coverage across the test suite is likely to prove intractable for all but the smallest and most articial of problems. Even if one were to relax the coverage criterion to full coverage at the statement level (within a particular set of les), the resulting test suite may still prove too large as the size of the program grows. However, one may use test suite prioritisation and sampling to reduce the cost of this extended test suite. For the purposes of assessing the potential eectiveness of automated repair tech- niques within the real-world, where complete coverage (at any level) is scarce, aug- menting the test suite in any way may produce overly optimistic results. In the general case, we believe that ideal, representative bug scenarios used within APR experiments should place no assumptions on the user extending the original test suite of the program, whether manually or automatically.

Pythia

To address the problem of inadequate test suites, without the need for manual in- tervention, we introduce an open-source, automated tool for improving the qual- ity of experiments, Pythia.4 Pythia takes a test suite manifest le (provided in a human-readable JSON format), together with an oracle program (i.e., the human- repaired form of the program), and generates an oracle le, describing the expected behaviour of the program for each test within the manifest. This process is illus- trated in Figure 3.1. Pythia may be used to improve the detection and automated repair of regression faults in real-world programs by using an existing version of the program. Once the oracle le has been generated, all tests (when performed using Pythia) are guaranteed to meet three quality requirements:

4The name “Pythia” comes from the name of the mythical Oracle at Delphi (https://en. wikipedia.org/wiki/Pythia). Pythia is available to download at: https://github.com/ ChrisTimperley/Pythia.

70 3.2. PYTHIA

Figure 3.1: An overview of the inputs and outputs of Pythia’s oracle generation process.

1. Checking of standard output: the standard output of the program should match that of the oracle.

2. Checking of standard error: the standard error of the program should match that of the oracle.

3. Checking of exit code: the program should exit with the same code as the ora- cle.

4. Checking of le system: the program should leave the relevant areas of the le system in the same state as the oracle (i.e., the content, presence, and absence of les are checked). Rather than describing tests and their expected outputs within the same le, as is the typical approach within the ManyBugs and early GenProg benchmark sets, Pythia uses the description of a test to create a (faux) isolated environment for its execution (known as a sandbox), before executing its associated shell command within that sandbox and recording its various behaviours. After capturing the behaviour of a test execution, those behaviours are compared against those given in the oracle le—if the program exhibits all of the same behaviours, it passes the test; otherwise, it fails.

Sandbox Environment

Prior to the execution of each test, an isolated environment, or sandbox, is created for the execution. By reading the description of a test from its corresponding manifest le, Pythia copies the les relevant to this test into the sandbox, ensuring a minimal, working environment for the execution. In practice, sandboxes are implemented as temporary directories that are disposed of after the execution.

71 TOOLS AND TECHNIQUES

$ ./gzip ../inputs/testdir/file26 -c

Figure 3.2: In this example, taken from the gzip object within the SIR, the test suite attempts to destructively compress a given le, deleting the original version of the le and replacing it with its zipped form.

[ ..., { "description": "Compresses a file at a given location", "command": "<> -c ’<>/foo’", "sandbox":{ "foo": "inputs/example.txt", "bar": "inputs/example2.txt" } }, ... ]

Figure 3.3: An example test case description within a Pythia test manifest le. Each test case is described by its corresponding shell command, the contents of its sand- box directory, and an optional human-readable description.

Pythia includes this feature to avoid the side-eects of certain test executions on the le system. One example of where this behaviour is essential, amongst several, can be found in the gzip object from the Software Infrastructure Repository. An exemplar test within the original test suite is given in Figure 3.2. When one attempts to run the test a second time, the oracle program will produce a dierent result, as the input le to the test was destroyed during the initial invocation. By copying the necessary les to the sandbox, rather than using their original versions on the le system, such side-eects can be avoided.

Test Manifest File Format

Test manifest les within Pythia are described using a JSON array of objects, where each object describes a single test case by a shell command, the contents of its sand- box, and an optional human-readable description of the test case (useful for debug- ging). An example test case description within a manifest le is given in Figure 3.3. To ensure the program behaves appropriately in all environments, two special vari- ables may be used in the shell command for the test case: <>, which spec- ies the absolute path to the program executable, and <>, which gives the absolute path of the sandbox for the particular execution. Prior to execution, Pythia substitutes these special variables, replacing them with their appropriate

72 3.2. PYTHIA

[ ..., { "retval":0, "time": 1.432, "out": "zipped f to f.z", "err":"", "sandbox":{ "f.z": "cf23df2207d99a74fbe169e3eba035e633b65d94" } }, ... ]

Figure 3.4: An entry in an example Pythia oracle le, describing the expected be- haviour of a test case from the corresponding manifest le. values. The contents of the sandbox for each test case are described using a dictionary, where each entry describes a le that should be copied into the sandbox. The value (i.e., right-hand side) of each entry species the location of the le that should be copied to the sandbox (the path must be either absolute, or relative to the location of the manifest le). The key for each entry (i.e., left-hand side), species the path that the le should be copied to, relative to the root of the sandbox directory. Note that no distinction is made between les and directories when copying; when a directory is specied by the value of a sandbox entry, that directory will be copied recursively to a directory with the name given by the key of that entry.

Oracle File Format

Pythia uses oracle les to describe the expected behaviour of each test within a corresponding test suite. As with the test manifest, oracle les are encoded as a JSON array of objects, where the nth object describes the intended behaviour for the nth entry within its associated manifest. Each object species the expected exit code, standard output, standard error, the state of the sandbox after execution, and the time taken to reach completion.

An example oracle le entry is given in Figure 3.4. The retval, out and err prop- erties record the exit code and the state of the standard output and standard error respectively. The resulting state of the sandbox is stored by computing a SHA1 hash of each le therein, recursively. The contents of each le may be just as eectively described using a hash, rather than its raw contents, since the state is only used for comparison. By recording the state in such a manner, Pythia is able to check for the presence and absence of les within the sandbox, and to ensure the correctness

73 TOOLS AND TECHNIQUES of their contents.

The time property species the number of seconds taken for the test to nish its execution. Pythia uses this information to automatically set appropriate time lim- its for each test. Specically, Pythia scales this value using the formula given in Equation 3.1:

t0 = P · G · t + ∆ (3.1) where t0 is the calculated time limit for a given test, t is the time taken by the oracle to nish that test, ∆ is an oset, and P and G are platform and generic scaling factors. ∆ and G are used to control for variance in execution time. P is used to take the expected execution time for one machine, and to apply it to another; this term is used when the oracle is built on a faster machine than the one used to conduct repair. By assigning individual time limits to each test, rather than a single time limit for all tests—as is the current standard—Pythia signicantly reduces the time taken to evaluate the entire test suite for worst-case repairs.

Limitations

Although Pythia addresses certain kinds of test suite weaknesses, allowing prac- tioners to gain a higher degree of condence in the results of their experiments, it is not without its limitations and trade-os: • Non-determinism: in cases where some aspect of the behaviour of a par- ticular test case is non-deterministic, correct results may fail to agree with the oracle, resulting in false negatives. To our knowledge, this problem does not aect any of the benchmarks within the benchmark sets discussed earlier; the SIR benchmarks, in particular, specically go about removing sources of non-determinism from the program. • Inadequate test coverage: although Pythia avoids certain, common degen- erate solutions from being accepted (e.g., early program exits) it is unable to prevent false positives stemming from a lack of coverage (e.g., a faulty If con- dition that is only covered once may be replaced by a tautology or a fallacy). • Inecient copying of les: in cases where the program does not write to its input les, copying them to the sandbox is a waste of resources. However, in the general case, variants of the program may have the ability to write to that le, and so to avoid side-eects, a copy of that le must be made. Alternatively, one could solve the problem by ensuring the le is read-only and creating a symbolic link within the sandbox. • Ignorance of le attributes: properties of les, such as their permissions, ownership, and associated dates are not recorded by Pythia. In general, we

74 3.3. REPAIRBOX

observe few instances where such attributes are relevant to testing. Moreover, adding certain of these properties may introduce non-determinism. • Modication of les outside of sandbox: certain tests may involve les outside of the sandbox (e.g., htpasswd). To protect these les from being in- fected and left in an altered state following a test execution, one must take further steps to isolate the test execution. A particularly expensive option would be to launch a separate container (discussed further in Section 3.3) for each execution. An alternative solution—albeit complex—would be to (option- ally) use the sandbox as a chroot jail for each test execution, guaranteeing a side-eect free execution and a complete description of the state of the le system. Despite these drawbacks, Pythia provides a lightweight way of improving the qual- ity of existing test suites and makes performing experiments easier and their results more robust. Future work may address some of these issues to further enhance the utility of Pythia.

3.3. RepairBox

Having studied and addressed the problem of inadequate test suites within APR experimentation, in this section we look at the related problem of reproducibility of results, before introducing a framework to eradicate the problem with minimal loss of eciency.

Motivation

In addition to exposing weaknesses within publicly available bug scenarios, recent studies have also highlighted the diculties in accurately reproducing the results of previous experiments. Whilst investigating the false positive repairs generated by AE and GenProg on the ManyBugs bug scenarios, [Qi et al., 2015] were unable to reproduce the original (plausible, but incorrect) repairs generated by GenProg and AE in some instances. With no information about the environment of the experi- ment beyond a (partially congured) virtual machine, the conguration required to produce these original results is dicult to replicate. Through personal experience with the ManyBugs bug scenarios, we have encoun- tered diculties in installing newer software on the virtual machine images pro- vided by previous experiments. Given the lack of any additional documentation with these VM images, creating a new, minimal environment that replicates the bug can prove challenging. In other work, where template-driven automated program repair was applied to Java programs [Kim et al., 2013], none of the source code used within the experiment was

75 TOOLS AND TECHNIQUES made available; only informal, ambiguous descriptions of the system were given, making it impossible to reproduce the results of the original system.

Existing Solutions

In this section, existing approaches to allowing the results of APR experiments to be reproduced are critiqued.

Source Files

The simplest technique for achieving reproducibility is to specify the name of a bug scenario within a publicly available set of benchmarks and the release version of the operating system used by the experiment, and to provide a set of source code les for the faulty and xed versions of the program, omitting all other details. Cru- cially, this approach misses details, including the state of relevant environmental variables, installed packages on the system (and their associated versions), and the state of various conguration les (many of which may be unknown to the experi- menter). Running the experiments within the specied operating system is far from guar- anteed to produce the same results, unless all other relevant setup stages are made clear and followed correctly. Additionally, in some cases, the correct functioning of the program also relies on the absence of certain les and packages from the system. In practice, this leads to needless hours spent trying to discover why a particular ex- periment won’t reproduce the original results and taking measures to try to align the results with those that are expected. In the worst case, the original experiment may have suered from a mistake somewhere in its conguration, but without clear reference to the complete setup this is unknowable.

Virtual Machine Image

The most common approach to the problem of reproducing results is to provide an accompanying virtual machine image, used to conduct the experiments. Conducting experiments within a VM introduces considerable performance overheads, signi- cantly increasing the already long run-time of APR experiments. Additionally, such an approach increases disk space and bandwidth requirements substantially as the user needs to download a complete image of the VM to disk for each bug scenario should they wish to replicate its results. Furthermore, such an approach may prove excessively cumbersome when one wishes to modify or extend the experiment. In particular, years later, aspects of the vir- tual machine may cease to work correctly (e.g., package managers), or additional, conicting software may need to be installed to perform a modied version of the

76 3.3. REPAIRBOX experiment. Since the steps taken to create the VM are unclear, modication can prove dicult and impractical. 5

Amazon EC2 AMI

Rather than supplying a virtual machine image in a common, open-source format, one may provide a machine image as a proprietary AMI (Amazon Machine Image),6 usable on Amazon’s EC2 cloud computing service. This approach, used in later experiments performed with GenProg, avoids some of the overheads involved in running a standard VM on the user’s hardware, allowing it to be run on a faster, purpose-built cloud computing platform. One may then provision (multiple) ma- chines on the EC2 platform using the provided AMI to reproduce the results of an experiment. Since this approach only provides an image of the system used to perform the ex- periment, without details of its creation, it also inherits the problems of extendibil- ity and bandwidth from the VM approach. As the AMI le format is proprietary, the problem of extendibility is further exacerbated. This restriction also prevents the machine from being run on locally available hardware (e.g., university compute clusters), and forces the user to use—and pay for—Amazon’s compute services to reproduce the result.

Vagrant and Puppet

In earlier attempts to produce a suitable platform for program repair research, we explored the combined use of Vagrant7 and Puppet8. We used these technologies to generate and provision a VM from a Vagrantfile, describing a base VM in a human-readable format, and a set of Puppetfiles, providing instructions on how to congure the VM and install the necessary packages. Whilst this approach partially solved the problem of transparency encountered when using VM images, extending the VM is still overly tedious and the performance remains inferior compared to a container-based approach.

RepairBox

To avoid these problems, we present RepairBox, a platform for safely conducting controlled, reproducible program repair experiments, with minimal compromise to

5We encountered this problem of extendibility when attempting to install a newer version of OCaml, required by the software we were using, onto the VM of GenProg’s 2010 results. The pack- age manager used by Fedora had transitioned from yum to dnf, and all of its sources were outdated. Further, packages installed on the system prevented this version of OCaml from being installed. 6http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/AMIs.html. [Accessed: May, 2017]. 7https://www.vagrantup.com/. [Accessed: May, 2017]. 8https://puppet.com/. [Accessed: May, 2017].

77 TOOLS AND TECHNIQUES

Figure 3.5: Docker Containers (left) vs. Virtual Machines (right). Each container sits on top of the Docker runtime, which provides access to the kernel of the host machine. Each virtual machine virtualises its own stack, down to the level of the hardware, and sits on top of a hypervisor, which allows multiple VMs to run on the same machine. performance. RepairBox operates by isolating individual bug scenarios and repair tools into separate, minimal Docker containers that can be easily inspected and ex- tended. By packaging bug scenarios as containers, RepairBox attains close to bare metal performance, an important trait when performing a large number of repeats of computationally-intensive program repair experiments.

As its means of packaging each bug scenario into a controlled, self-contained, exe- cutable environment, RepairBox uses Docker containers. Container images provide a lightweight means of encapsulating a piece of software and all of its dependencies into a single, portable executable package, referred to as a container.9 Unlike virtual machines, which virtualise at the hardware level, containers virtualise at the operat- ing system level (illustrated in Figure 3.5). To do so, containers share the operating system kernel of their host machine. As a result, containers use less compute and memory resources than their VM counterparts, giving them a signicantly higher level of performance [Felter et al., 2015]. Additionally, since containers are built as a series of le system layers, they typically size in the tens of MBs, rather than tens of GBs.

By packaging bug scenarios as Docker images, users can quickly download pre-built images from DockerHub10, using the repairbox download command.

To conduct experiments using these repair boxes, users can combine them with con- tainers holding the repair tools required for the experiment, as illustrated in Figure 3.6. By adopting this approach, users can provide transparent, high performance executable experiment designs as the replication package for their research.

9https://www.docker.com/what-container. [Accessed April, 2017]. 10https://hub.docker.com. [Accessed: April, 2017].

78 3.3. REPAIRBOX

Figure 3.6: Repair boxes can be used with repair tools by mounting the binaries, provided by a repair tool container, into the repair box, via volume mounting.

Each bug scenario in the repository is accompanied by a Dockerle. This le pro- vides an executable set of instructions for constructing the corresponding image. An except of the Dockerle for one of the libtiff scenarios is given in Figure 3.7.

To compose the image, Docker uses each instruction in the Dockerle (e.g., RUN, FROM, ADD) to construct a series of read-only layers. Each of these layers represents a lesystem dierence.11 This approach allows Docker to cache layers, thus reducing build time. Importantly, it also allows images to be used as the base layer of another image, using the FROM instruction. RepairBox uses layering to eciently provide minimal images for its bug scenarios. Each bug scenario is comprised of a stack of images, illustrated in Figure 3.8. At the bottom of this stack sits the universal base image used by all bug scenarios, which provides common functionality to all repair boxes. Above this image there rests an image for the particular dataset to which the scenario belongs. On top of the dataset image sits an image for the particular program. Finally, at the top of the stack rests an image for the bug scenario. Each repair box is structured identically, with all of its relevant les (e.g., source code, test suite, meta-data) within the /experiment directory. To allow test pri- oritisation, reduction, and sampling to be used, each scenario implements a simple bash-based test harness, which allows a single, specied test to be executed, rather than the entire suite. For some bugs within RepairBox, the test suite is provided by the original dataset. For the rest, where we observe issues with test case interference and inadequate checking (i.e., only the exit status of the program), we use Pythia, described in Sec- tion 3.2, to generate a stronger oracle from the original test suite. For each test execution, Pythia creates a sandboxed environment (realised as a temporary direc-

11https://www.docker.com/what-container. [Accessed: April, 2017].

79 TOOLS AND TECHNIQUES

FROM squareslab/repairbox:manybugs64 MAINTAINER Chris Timperley "[email protected]"

# install dependencies RUN sudo apt-get update&& \ sudo apt-get install -y --force-yes \ gcc-multilib psmisc zlib1g-dev&& \ sudo apt-get clean&& \ sudo apt-get autoremove&& \ sudo rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/*

# download Libtiff source code RUN git clone https://github.com/vadz/libtiff src ...

Figure 3.7: An example Dockerle for one of the libtiff scenarios within RepairBox. tory) in which to perform the test. This behaviour allows us to avoid the problem of test interference.

Containers

By using a Docker container as a means of accurately replicating a particular bug scenario, rather than employing a VM-based approach, we can create reproduction packages for APR experiments with the following advantages:

• Transparency: since the Dockerfile species the exact steps required to build the container, all of its details available to the user, making future mod- ication and bug understanding simpler.

• Minimality: each Dockerfile is based on a minimal base image (for either Ubuntu 14.04 or Fedora 23), between 200 to 400 MB, which is extended with the minimal set of packages and conguration required to replicate a partic- ular bug and nothing more.

• Extendibility: Docker container images are built around the concept of lay- ering, where the image from a container is built in a series of layer images, each of which are cached, starting from a base image. Each layer represents the resulting system image when a given instruction (e.g., apt-get install ...) is executed on the resulting container for the previous layer.

Instead of building each of the RepairBox bug scenarios from scratch (us- ing the provided Makefile), the user may download a pre-built image from DockerHub using RepairBox’s command-line interface.

80 3.3. REPAIRBOX

Figure 3.8: The Docker image for a given bug scenario is built as a series of layers. Each layer is shared by a number of images on the layer above, reducing disk usage and build time.

• Performance: unlike traditional, hypervisor-based virtual machines, con- tainers share the kernel of their host operating system, removing the need to virtualise the kernel and hardware stack. As a result, the performance of containers tend to be signicantly better than VMs, using fewer clock cycles, and less memory and disk space.

Due to the fact that containers share the kernel of the host machine, bugs that rely on a particular kernel are dicult, and in some cases, impossible, to reproduce within a Docker container. This limitation prevents the inclusion of the valgrind bug scenarios from ManyBugs. A similar limitation applies to bug scenarios that only occur on 32-bit architectures. In this case, however, we can usually reproduce the bug by passing the appropriate ags to make, or by using a Docker image with 32-bit libraries as the base image.

At the cost of performance, those bug scenarios could be included into RepairBox by hosting them within a suitable virtual machine (possibly provided by boot2docker12). We leave this to future work. In most cases, however, we have found that a specic kernel is not necessary.

Composition

To simplify the management of repair boxes through the RepairBox command-line interface, discussed in Section 3.3.5, we require that repair boxes provide manifests. These manifests, encoded as YAML les, specify: the name of the bug scenario, the dataset (and optionally, the program within a dataset) to which they belong, the name of the Dockerle used for their construction, as well as any build time

12http://boot2docker.io. [Accessed: April, 2017].

81 TOOLS AND TECHNIQUES bug: 69223-69224 build-arguments: scenario: python-bug-69223-69224 dataset: manybugs dockerfile: Dockerfile.bug program: python version: 0

Figure 3.9: An example bug scenario manifest. version: 0 tool: genprog image: christimperley/genprog:latest environment: PATH: "/opt/genprog3:${PATH}"

Figure 3.10: An example tool manifest describing GenProg. arguments that should be passed to Docker. An example manifest is given in Figure 3.9. Similar manifests must also be provided for datasets and programs. All of these entities (bug scenarios, programs, datasets) must provide a version sux parameter. Complete version numbers for entities are generated by concatenating these numbers. For instance, for the bug scenario described in Figure 3.9, the com- plete version number will be given by “D.P.B”, where D, P, and B are the version suxes for the dataset, program, and bug, respectively. This information is used by RepairBox to determine whether a particular component is out-of-date and should be updated. Manifests may also be provided for containerised tools (e.g., GenProg, or Daikon). An example tool manifest is given in Figure 3.10. This manifest species a name and version for the tool, the name of the Docker image for its container, and a dictionary describing which environmental variables should be modied within the repair box upon launch.

Command-Line Interface

RepairBox hides the details of its implementation from its end-users through a sim- ple, self-documenting command-line interface (CLI). Below, we briey discuss the features of this CLI and provide examples of its use:

• repairbox -h: describes each of the commands provided by the RepairBox CLI.

• repairbox list: produces a list of all artefacts of a given kind (e.g., bug scenarios, datasets, tools). Example uses of this command are given in Figure

82 3.3. REPAIRBOX

$ repairbox list bugs

Bug Latest Installed Remote ------manybugs:libtiff:2005... 0.0.0.0 0.0.0.0 0.0.0.0 manybugs:libtiff:2005... 0.0.0.0 0.0.0.0 0.0.0.0 manybugs:libtiff:2005... 0.0.0.0 0.0.0.0 0.0.0.0 manybugs:libtiff:2006... 0.0.0.0 0.0.0.0 0.0.0.0 ...

$ repairbox list tools

Tool Image Installed ------genprog christimperley/genprog:latest True searchrepair searchrepair:latest True shuriken shuriken:latest True ...

Figure 3.11: Example uses of the repairbox list command.

3.11.

• repairbox install: installs a specied repair box (or collection of boxes). If the latest version of the repair box is hosted on DockerHub, its image will be downloaded. If not, the repair box and its dependencies will be built from scratch locally.

This command may also be used to install a registered tool onto the local machine by downloading its image from DockerHub.

• repairbox uninstall: removes a specied (set of) repair boxes from the local machine.

• repairbox build: constructs a named repair box (or collection of boxes) and their dependencies locally.

• repairbox download: downloads a pre-built image for a (set of) repair box(es) or a repair tool from DockerHub.

• repairbox launch: spawns an interactive container for a given repair box. The names of supported repair tools may also be passed via the --tools ar- gument, causing the binaries for those tools to be imported into the container. Internally, this command oversees the generation, linking and clean-up of the required containers. An example use of this command is given in Figure 3.12.

83 TOOLS AND TECHNIQUES host> repairbox launch manybugs:libtiff:2005-12-14-6746b87-0d3d51d \ --with genprog daikon rbox> genprog problem.json ...

Figure 3.12: An example use of the repairbox launch command.

3.4. Methodology

Using the resources provided by the RepairBox project, together with Pythia, we put forward a methodology for automated program repair experiments that is used throughout the rest of this thesis.

Executable Experiment Designs

Experiment designs are described by simple scripts which interact with container- ised bug scenarios and tools through the RepairBox interface. This approach allows experiments to be concisely described, more easily understood, and to be accurately reproduced. By imposing this requirement on the experiment, the benets of using containerisation, namely performance, extendibility, and isolation, are inherited by the experiment. Since this script contains all of the steps necessary to run the ex- periment, it essentially becomes an executable experiment design.

Host Platform

In the interest of identifying and avoiding rare bugs that only occur when particular kernel versions are used, the experiment must state the kernel version and operating system release used by the host machine. Additionally, where relevant, a description of the hardware specications of the host machine should also be given. For the majority of the experiments conducted in this thesis, we used cloud com- puting facilities to distribute the load across a large number of machines, hosted on Amazon AWS and Microsoft Azure. Directions on expediting the running of ex- periments within a distributed environment are provided with each experiment’s reproduction package.

Justication of Benchmarks

In addition to supplying a simple script capable of running the experiment, this methodology requires that users justify their selection of benchmarks within the context of their research questions. Any weaknesses, trade-os, or articiality within the bug scenarios must be stated and mitigated.

84 3.5. CONCLUSION

Statistical Hypothesis Testing

Where possible, we advocate that statistical hypothesis testing be used to demon- strate the statistical signicance of any results. In most cases, we do not have reason to assume that the results will t to a particular probability distribution, and so we must use nonparametric statistics to describe and test them [Downey, 2011]; non- parametric statistics should also be used when the median is a more appropriate measure of the central tendency of the data. Although nonparametric statistics al- low us to describe and test data without assuming a particular distribution, they do so at the cost of decreased statistical power (i.e., the probability of correctly re- jecting the null hypothesis). This decreased statistical power means that a greater number of samples are needed to increase the likelihood of detecting a signicant eect when one exists. To test whether the performance of one search algorithm is stochastically greater than that obtained by another, a one-sided test for statistical signicance should be used, such as the Mann-Whitney U test [Mann and Whitney, 1947]. To test that two groups are not drawn from the same distribution, one may use a two-sided Kolomogorov-Smirnov (KS-2) test [Downey, 2011].

In addition to determining statistical signicance, one should also measure eect size (i.e., how big is the dierence between the two groups?). Vargha and Delaney [2000]’s A12 measure is a popular choice for nonparametric eect-size measure- ment. Intuitively, A12 describes the probability that a randomly selected value from group A is greater than one from group B [Neumann et al., 2015]. Vargha and De- laney[2000] provide guidelines for interpreting the size of the eect: 0.5 indicates no eect, 0.56 is a small eect, 0.64 is a medium eect, and 0.71 is a big eect.

3.5. Conclusion

In this chapter, we identied and discussed the problems of inadequate test suites and incomplete reproduction packages within studies of APR, before introducing Pythia and RepairBox to address them. Using both of these tools, we propose a more robust, ecient experiment methodology, allowing research to be condently performed for a diversity of research questions using hundreds of bug scenarios, collated and edited from several existing datasets. RepairBox and Pythia are available to download at:

https://github.com/squaresLab/RepairBox

https://github.com/ChrisTimperley/Pythia

Over time, we intend to explore further avenues of improvement to RepairBox, allowing it to be used for a wider range of experiments within program testing

85 TOOLS AND TECHNIQUES and genetic improvement. Additionally, we plan to incorporate further sources of bugs into RepairBox, including the DARPA Security Challenge13, IntroClass, and CVE14.

13http://archive.darpa.mil/cybergrandchallenge/ 14https://cve.mitre.org/

86 CHAPTER 4

Fault Localisation

For automated program repair to become an economical and eective means of x- ing bugs, and thus break through into the mainstream, techniques will need to be capable of solving a wide range of bugs within a reasonable resource window. To go beyond the small number of xes possible with current techniques, search-based repair will need to incorporate a richer repair model, as discussed in Chapter5. Additionally, techniques will need to be capable of generating patches spanning multiple lines, going beyond the single-line patches currently produced by all but two approaches [Le Goues et al., 2012a; Mechtaev et al., 2016]. As a result of these requirements—a richer repair model and the ability to produce multi-line xes—the search space will undergo a combinatorial explosion. As one approach to tack- ling this growth, more accurate means of localising the source of faults will need to be discovered. At present, all repair techniques make use of relatively simple spectrum-based fault localisation techniques, which often prove ineective at pin- pointing suspicious statements within a basic block (since all statements within a basic block share the same coverage, except in the case of crashing and non- terminating faults).

Previous work by Qi et al.[2013] compared the eectiveness of several spectrum- based fault localisation techniques by measuring the average number of candidate patches (NCP score) that were evaluated before a repair was found (using GenProg). On a sub-set of the ManyBugs benchmarks [Le Goues et al., 2015], Qi et al.[2013] found that the Jaccard metric (discussed in more detail in 4.1) yielded the lowest NCP. Almost all of the solutions that were found during the search were the subject of overtting, however, and so it is unclear whether Jaccard was more eective at localising the fault or simply better at selecting areas prone to overtting. Regard- less, the accuracy of spectrum-based fault localisation is simply not good enough for the purposes of repair. Often, hundreds of statements will be assigned the same suspiciousness score—a value which determines the probability of a statement being selected as a subject for repair.

Whilst there are a number of complementary methods for fault localisation, dis- cussed in 4.1, almost all of these are performed in an o ine, preprocessing step. In this chapter, we turn our attention to the young and promising eld of Mutation- Based Fault Localisation (MBFL) [Moon et al., 2014a; Papadakis and Le Traon, 2015] to determine whether information collected over the course of the search, in the form of mutant test suite results, can be harnessed to rene the fault localisation online. More specically, we ask:

87 FAULT LOCALISATION

RQ1: Can the results of candidate patch evaluations, gathered over the course of the search, be used to improve the accuracy of the fault local- isation, online?

To answer this question, we rst perform an exploratory analysis of the test case results of randomly sampled single-edit mutants within GenProg’s search space. Based on the ndings of this analysis, we propose a number of fault localisation techniques which incorporate knowledge of mutant test suite evaluations. In con- trast to the impressive results demonstrated by previous studies on MBFL, we nd that the positive contributions of this knowledge to the accuracy of the fault locali- sation are negligible. In Section 4.4, we speculate on the reasons for this result and discuss alternative means of improving the fault localisation process.

In summary, the primary contributions of this chapter are the following:

• A detailed analysis of the test suite results of mutants sampled from GenProg’s search space, covering 28 bugs in six real-world C programs.

• An evaluation of several alternative fault localisation techniques that use the test case outcomes of mutants produced during the search.

• An informed discussion of the limitations of GenProg’s statement-level re- pair actions in identifying faulty locations and how these limitations might be addressed through the use of more granular repair actions.

The rest of this chapter is structured as follows: Section 4.1 presents a brief review of several approaches to automated fault localisation, the majority of which have not been applied to automated program repair. Section 4.2 conducts an analysis of the test case outcomes of mutants randomly sampled from GenProg’s search space, in an eort to determine whether test suite behaviour can be used to discriminate faulty statements from correct ones. Section 4.3 proposes and evaluates a number of fault localisation measures based on the ndings of the previous section. Finally, Section 4.4 summarises the ndings of the study and discusses a number of possible explanations for the lack of noticeable improvement, as well as future directions for research.

4.1. Background

In this section, we provide the reader with a brief introduction to a number of dif- ferent methods for automated fault localisation.

88 4.1. BACKGROUND

Spectrum-Based Fault Localisation

Spectrum-Based Fault Localisation (SBFL), also referred to as statistical fault locali- sation [Landsberg et al., 2015], operates by assigning each statement in the program with a suspiciousness value s ∈ R, encoding the likelihood that the statement is (partly) responsible for the fault. In most cases, SBFL proceeds to rank each state- ment within the program by its suspiciousness. An exception to this, is in the case of automated repair, where most approaches use the suspiciousness values them- selves to better decide exactly how resources should be spent during the search for a repair. When used to manually identify and repair bugs, suspiciousness information is vi- sualised by highlighting source code lines within a viewer with dierent brightness and hue components, allowing suspicious statements to be quickly identied, as il- lustrated in Figure 4.1[Jones et al., 2002]. Through such visualisation, developers can better decide where to focus their search for the bug, leading to a substantial reduction in the time and eort involved in debugging [Jones et al., 2002]. To generate a set of suspiciousness values, SBFL rst generates a program spectrum, concisely summarising the coverage data in the form of an n-by-4 array, where n is the number of statements within the program [Yoo, 2012]. Each row in the array encodes a summary of the execution data for a given statement, in the form of four integer variables, (ep, ef , np, nf ). ep and ef state the number of times that the statement was executed by a passing and failing test case, respectively. Conversely, np and nf specify the number of times the statement was not executed by a passing and failing test case. These counters may record the number of times a statement was executed in total (i.e., multiple executions of a statement for a given test case may count), or as is more often the case, the number of test cases that executed the given statement (i.e., only a single execution is recorded for each test case). Once the program spectrum has been generated, SBFL proceeds to compute sus- piciousness values for each statement through the use of a suspiciousness metric, also known as a risk evaluation formula. This metric maps each row in the program spectrum, (ep, ef , np, nf ), to a suspiciousness value, represented by a real number. By applying this mapping across the entire spectrum, the array is reduced to a list describing the relative suspiciousness values of each statement in the program. An example of a program spectrum can be found in Section 2.1.

Spectrum

The term “spectrum” was introduced to the eld by Reps et al.[1997], and later generalised by Harrold et al.[2000]. Generally, the term is taken to refer to the fre- quency of statement execution, whether the frequency be a measure of the number of test cases that execute a given statement, or the total number of times a state- ment is executed by all (positive or negative) test cases. Alternatively, when used as a debugging aid for humans, lines may be used as the entities of interest when

89 FAULT LOCALISATION

Figure 4.1: A screenshot of the original Tarantula fault localisation visualisation tool. Lines coloured red are executed primarily by the failing test cases, suggesting a high suspiciousness. Lines that are mostly covered by the passing test cases are coloured green. Brightness is used to indicate a form of condence in the suspi- ciousness attributed to a given line: The brightness of a line is given by greater of the fraction of failing tests covered by the line, and the fraction of passing tests covered.

Source: http://spideruci.org/fault-localization/ (May 2017).

90 4.1. BACKGROUND computing the spectrum, rather than statements. Similarly, one can monitor the execution of function calls, program paths, n-grams, program slices, use-def chains, invariants, and more [Naish et al., 2011; Renieris and Reiss, 2003].

Suspiciousness Metric

As numerous studies have shown [Jones and Harrold, 2005; Qi et al., 2013; Yoo, 2012], the eectiveness of SBFL can be attributed to the choice of suspiciousness metric, of which there are many. Since the publication of Tarantula [Jones et al., 2002], there has been a proliferation of suspiciousness metrics. Over the rest of this section, we shall investigate a handful of the most popular and successful metrics put forth in the literature.

• Set Union and Intersection: Two of the simplest suspiciousness metrics are the set union and set intersection metrics [Renieris and Reiss, 2003]. These metrics assign a unitary suspiciousness value to all statements that belong to a given set. The sets for the set union and set intersection metrics are given in Equations 4.1 and 4.2, respectively:

[ f − s (4.1) S

\ s − f (4.2) S

where f is the set of statements covered by a (singular) failing run, each s ∈ S is the set of statements covered by a given passing run, and − represents the non-symmetric set dierence.

In its standard form, the set union metric assigns a unitary suspiciousness value to each statement within the dierence of the set of statements executed by a single failing test case, and the union of the statements executed by the passing test cases. In cases where the faulty statement is executed by both the passing and failing test cases, such as an incorrect conditional, the faulty statement will be missed entirely. The intuition behind the set union metric is that the bug is to be found within the statements uniquely executed by the failing test case.

The set intersection metric, on the other hand, is built upon the polar opposite intuition: the faulty statement is absent from the failing test case. Whilst the set dierence may aid the programmer’s understanding of the faulty test case behaviour, it also prevents the program from being repaired automatically, as all changes will go unnoticed (since none of the mutated statements are ever executed by the failing test case).

91 FAULT LOCALISATION

Pan and Spaord[1992] and Chen and Cheung[1997] propose soft extensions to these metrics, which incorporate frequency of execution, similar to the majority of other suspiciousness metrics.

• Nearest Neighbour: A slightly more advanced technique than the set union and set intersection methods discussed above is the nearest neighbour metric, introduced by Renieris and Reiss[2003].

This approach rst determines the passing test case with the smallest dis- tance to the failing test, where distance is measured as the cardinality of the non-symmetric dierence between the failing test spectrum and passing test spectrum, as shown by Equation 4.3:

F − P (4.3)

where P is the set of statements covered passing a passing test, and F is the set of statements covered by the failing test.

Once the passing test with the closest behaviour to the failing test, as mea- sured by the distance metric, has been found, all statements within F − P are assigned a unit suspiciousness value.

• Tarantula: A highly popular metric, credited as the rst system to apply spectrum-based ranking to software diagnosis for an imperative language [Jones and Harrold, 2005; Naish et al., 2011]. Originally designed as a means of visualising suspicious locations within a faulty program [Jones et al., 2002], but later adapted and more widely used as a suspiciousness metric in its own right [Jones and Harrold, 2005]; an adapted form of this metric is given in Equation 4.4.

ef e +n f f (4.4) ep + ef ep+np ef +nf

Although not optimal, Tarantula has been shown to give good results for au- tomated repair [Qi et al., 2013] and has served as the suspiciousness metric of choice for several automated repair systems [Ke et al., 2015; Kim et al., 2013; Le Goues et al., 2012a; Qi et al., 2014].

• GenProg: The default suspiciousness metric in GenProg, given by Equation 4.5, attaches a minimal, non-zero weight to statements that are executed by both the positive and negative test cases, and a maximal weighting to state- ments that are executed by the negative test exclusively. Statements that are unexecuted by the negative test case are assigned a suspicious of zero, and excluded from consideration.

92 4.1. BACKGROUND

 1.0 if ef > 0 ∧ ep = 0 0.1 if ef > 0 ∧ ep > 0 (4.5)  0 if ef = 0 ∧ ep = 0

• Op1 and Op2: Op1 and Op2, given in Equations 4.6 and 4.7, respectively, are two suspiciousness metrics introduced by Naish et al.[2011]. Unlike previous metrics, Op1 and Op2 are designed to perform optimally for single fault pro- grams on a model program, ITE2, composed of two consecutive If-Then-Else statements. Beyond theoretical optimality, empirical observations have also demonstrated the strong performance of these metrics on a widely studied sub-set of single fault bugs from the SIR dataset.

−1 if n > 0 f (4.6) np otherwise

ep ef − (4.7) ep + np + 1

• Jaccard and Ochiai: Jaccard [Jaccard, 1901], given in Equation 4.8, and Ochiai [Ochiai, 1957], given in Equation 4.9, are both examples of suspiciousness metrics that have been adapted from existing similarity metrics in other elds [Abreu et al., 2007; Naish et al., 2011].

ef (4.8) ef + nf + ep

ef p (4.9) (ef + nf ) · (ef + ep)

• Evolved: [Yoo, 2012] transform the problem of deciding a suitable metric into a search problem, using genetic programming to evolve an eective metric for automated repair. The results of this search are a set of metrics that out- performed the most successful human-designed metrics proposed up to that point, including Op1, Op2 and Jaccard.

Comparison

Naish et al.[2011], and later Landsberg et al.[2015], proved that many of the pro- posed suspiciousness metrics are equivalent for the purposes of ranking, by exploit- ing their monotonicity, i.e., m1(x) < m1(y) =⇒ m2(x) < m2(y). Although this is an insightful result and further investigation goes towards explaining the successes of these metrics, for the purposes of automated repair, we are more interested in the absolute dierence in scores between statements, so that we may more accurately decide how to divide our eorts.

93 FAULT LOCALISATION

To compare the eectiveness of dierent spectrum-based fault localisation tech- niques, Naish et al.[2012] formally introduce the Expense measure, which quanti- es accuracy as the fraction of statements within the program that would need to be checked by the programmer before a faulty statement is found were they to examine them in rank order.

Although Parnin and Orso[2011] nd that ranking statements by their suspicious- ness score allowed programmers to nd a bug more quickly, work by Qi et al.[2013] demonstrates that suspiciousness metrics with a low expense are not necessarily the best metrics for automated repair. As an alternative, Qi et al.[2013] propose measur- ing the number of candidate patches (NCP) produced by GenProg before a repair is found. To account for the stochastic nature of the search, the process must be repeated a xed number of times and averaged.

Comparing the Expense and NCP measures of a number of proposed suspiciousness metrics across 11 benchmark programs taken from the GenProg ICSE 2012 bench- mark suite [Le Goues et al., 2012a], Qi et al.[2013] found Jaccard to be the best performing metric, despite previously being shown to produce sub-optimal ranking by Naish et al.[2012].

Following the proposal of the NCP measure, Moon et al.[2014a] introduced an al- ternative evaluation metric: Locality Information Loss, or LIL for short. Like NCP, LIL is able to approximate the diculty of nding a repair for a given bug using APR techniques. Importantly, it does so without the need to perform expensive APR trials.

To determine the accuracy of a given set of suspiciousness values, LIL measures the distance between the probability distribution implicitly dened by those suspi- ciousness values, Pτ , and an ideal probability distribution, PL. Pτ and PL are used to describe the probability of an APR technique selecting a particular statement as the target location of the repair. Thus, the ideal probability distribution, PL, is de- ned as the distribution in which each of the faulty lines is assigned a maximal probability; all other correct lines are given an  probability of selection, for rea- sons of numerical stability. The distance between Pτ and PL is computed using the Kullback-Leibler measure, given in Equation 4.10:

X PL(si) D (P ||P ) = ln P (s ) (4.10) KL L τ P (s ) L i i τ i where P (si) gives the probability of selection for the statement si.

On a single bug from the GenProg TSE 2012 benchmarks (look utx 4.3), Moon et al.[2014a] showed a correlation between LIL and NCP for a number of dierent spectrum-based fault localisation techniques. As previously demonstrated by Qi et al.[2013], Moon et al.[2014a] showed that the EXAM score appeared to have no predictive value.

94 4.1. BACKGROUND

Program Slicing

Program slicing is a technique for reducing programs to a minimal form, known as a slice, which still retains some given semantics of interest, known as the slicing criterion [Harman and Hierons, 2001]. Slices are generated by removing all parts of the given program which can be determined to have no eect on the slicing criterion. Slicing has been used for aiding program comprehension, debugging, dead code removal, cohesion analysis, and more [Silva, 2012]. Importantly, slicing may be used as a fault localisation technique by reducing the program to only those parts which have an eect on the location at which the failure was observed. Naturally, this prevents slicing from being used to nd memory leaks, since the point at which the program crashes—when it runs out of memory—may have no relation to the fault.

Static Slicing

Program slicing was rst introduced in 1979, as part of Mark Weiser’s Ph.D. [Weiser, 1979] thesis, in the form of a technique that would later become known as static program slicing [Weiser, 1981]. Static program slicing computes the set of statements S, referred to as the program slice, that may aect the value of a variable of interest v ∈ V at a statement x. Collectively, the set of variables of interest V and the statement of interest x are termed the slicing criterion, given as C = (x, V ). The slice for C = (x, V ) can be also be computed from the union of the slices of each of its variables v ∈ V , as shown in Equation 4.11.

[ C = (x, V ) = (x, {v}) (4.11) v∈V

There are two types of program slice: the backward slice and the forward slice.A backward slice contains all statements in a given program P that may have an eect upon C, whilst a forward slice consists of the statements that C may aect. The program slice is usually computed by gradually removing statements from the program P which are irrelevant to the slicing criterion, leaving a minimal program which retains the desired properties of interest. By employing backwards slicing, we can exploit this to remove all statements from the program which can be shown to have no (static) eect upon C, thus reducing the number of statements the pro- grammer needs to consider and the debugging eort. An example of a backwards static slice is given in Figure 4.2. Whilst backwards slicing can help to reduce the program to a smaller form, quite often, the resulting slice can remain a considerable size. This is especially the case for large, well-constructed programs, where statements exhibit a high degree of cohesion; in such programs, the value of each variable is dependent on many oth- ers.

95 FAULT LOCALISATION

1 read(text); read(text); 2 read(n); 3 int lines=1; int lines=1; 4 int chars=1;

5 string subtext=""; 6 c= getChar(text); c= getChar(text); 7 while (c != EOF) { while (c!= EOF) { 8 if (c == ’\n’){ if (c== ’\n’){ 9 lines= lines+1; lines= lines+1; 10 chars= chars+1; 11 } else { } 12 chars= chars+1;

13 if (n !=0){

14 subtext= subtext << c;

15 n=n-1;

16 }

17 } 18 c= getChar(text); c= getChar(text); 19 } } 20 write(lines); write(lines); 21 write(chars);

22 write(subtext);

Figure 4.2: An example of backwards static slicing on a small program. The source code on the right shows the sliced form of the source code on the right, where (20, lines) is set as the slicing criterion. Adapted from example given in [Silva, 2012].

96 4.1. BACKGROUND

Dynamic Slicing

Whereas static slicing determines the set of statements that could aect the slic- ing criterion under all possible conditions, dynamic slicing [Korel and Rilling, 1998] produces the set of only those statements that are actually executed when the pro- gram is supplied with a certain input sequence [Harman and Hierons, 2001]. This technique builds upon static slicing, constructing a dynamic slice instead, according to a dynamic slicing criterion. The dynamic slicing criterion is almost identical to its static counterpart, except for the addition of an input sequence i, describing the particular set of inputs to the program for which a slice should be produced; it is described using the form C(x, V, i). In the case of APR, each test case can be used to provide a particular input sequence.

Using a dynamic slice, rather than a static slice, tends to give us a sizeable reduction in the number of statements that need to considered when debugging the program for a particular failing test case. However, this reduction comes at the cost of having to compute the specic control and data-ow dependencies produced by the given input sequence; a problem which requires the resource intensive construction of a complex data structure, such as a trace.

Other Forms of Slicing

A recent survey of program slicing techniques [Silva, 2012] listed 29 variants. Below, we briey focus on the sub-set of these variants which are immediately applicable to fault localisation.

• Conditioned Slicing: Rather than slicing according to a specic execution of the program using a given input sequence, conditioned slicing [Ning et al., 1994] allows the slicing criterion to specify a boolean expression that be must be satised by the program inputs. For instance, one may provide the boolean expression F , x = y − 1, instructing the slicer to consider only the possible executions of the program when its inputs x and y satisfy F . Using this infor- mation, the slicer is able to prune unreachable statements, producing a slice that is at least as small as the static slice.

• Relevant Slicing: Whilst dynamic slicing yields the slice of program state- ments that actually had an eect upon the slicing criterion, the relevant slice is a superset of the dynamic slice, including the set of statements that could have aected the slicing criterion [Agrawal et al., 1993]. The slicing criterion used remains the same as that used by dynamic slicing.

This technique can be used to nd statements whose contamination prevents them from impacting upon the variables of interest within the slicing crite- rion. In such cases, where a statement does not aect the variables of interest, the dynamic slice will fail to include the faulty statement.

97 FAULT LOCALISATION

• Hybrid Slicing: Hybrid slicing [Gupta and Soa, 1995] is an alternative slic- ing technique that avoids the need to compute large data structures on the order of gigabytes incurred by dynamic slicing, whilst increasing precision over static slicing by still using dynamic information. To achieve this feat, hy- brid slicing incorporates knowledge of which statements have been executed (acquired from the coverage information) into the static analysis, removing from the slice those statements which are in a non-executed possible path of the computation [Silva, 2012]. • Interprocedural Slicing: The original form of program slicing has since been reclassied as intraprocedural slicing, as static slicing fails to exploit the boundaries of procedure calls, leaving statements within the slice that have no eect upon the criterion [Silva, 2012]. To address the loss of precision, Horwitz et al.[1990] use a System Dependence Graph (SDG) to capture depen- dencies within the program. The SDG allows information about the calling context of procedures to be taken into consideration by the slicer [Silva, 2012]. • Quasi-Static Slicing: The quasi-static slicing criterion [Venkatesh, 1991] al- lows a slice to be produced for a program where some of its inputs are xed but the rest are unknown. • Call-Mark Slicing: Similar to hybrid slicing, call-mark slicing provides a less expensive alternative to dynamic slicing by using the knowledge of which statements were executed, together with the program’s input sequence, to yield a smaller program dependence graph (PDG) by pruning unvisited state- ments [Nishimatsu et al., 1999]. • Dependence-Cache Slicing: Similar to call-mark slicing, dependence cache slicing uses dynamic information to remove irrelevant statements from the slice, via a richer PDG to exploit possible data dependence relationships pro- duced by the input data [Silva, 2012; Takada et al., 2002]. With this informa- tion, dependence-cache slicing is able to prune infeasible edges from the PDG, yielding a slice that, on average, tends to be smaller than a call-mark slice. • Program Dicing: Uses the set dierence of at least two slices to remove statements which appear to be correct (determined from a set of slices for passing inputs) from the slice of a failing input [Lyle, 1987], exploiting the intuition that statements belonging exclusively to the negative slice are more likely to contain the fault. • Stop-List Slicing: Stop-list slicing [Gallagher et al., 2006] augments the slic- ing criterion with a stop list, specifying which variables are considered un- interesting and should be omitted from the slice. The stop-list is realised by removing all assignments to variables within the list and all data dependencies stemming from those assignments. • Barrier Slicing: Allows the programmer to control which regions of the program may be considered when building the slice, and which areas may not, allowing the programmer to exclude trusted, correct code from the slice

98 4.1. BACKGROUND

[Krinke, 2003, 2004]. This ability is made possible through the introduction of barriers in the slicing criterion, specifying which nodes or edges of the PDG may not be passed during the graph traversal [Silva, 2012]. • Concurrent Slicing: As concurrent programs cannot be represented using the PDG or SDG, concurrent slicing must operate on extended, thread-aware forms of the CFG and the PDG, known as the threaded CFG (tCFG) and threaded PDG (tPDG), respectively [Silva, 2012]. When threads are synchronised or al- lowed to communicate, a new level of complexity is added to the analysis in the form of the interference dependencies introduced between statements. Such dependencies highlight statements that share a data dependency and may be executed in parallel. Those dependencies are not transitive, however, unlike control and data dependencies, and so concurrent slicing techniques must treat them specially. For a recent overview and evaluation of concurrent slicing techniques, the reader is referred to [Gihorn and Hammer, 2007]. • External State Aware Slicing: Sivagurunathan et al.[1997] demonstrated that standard slicing techniques fail to account for the modication of exter- nal state through I/O operations, potentially leading to incorrect slices. To address this shortcoming, they associated each I/O operation with its own special variable, allowing the external state to be observed by the slicer. To achieve this, however, the original program must be subject to a transforma- tion which maps the original program to its equivalent in a special language that includes these external state variables. Later, further revisions to the slic- ing process were introduced by Tan and Ling[1998], and Willmor et al.[2004], which allowed the slicer to account for database operations.

Delta Debugging

Delta Debugging [Zeller and Hildebrandt, 2002] is an ecient method for deter- mining the cause of regressions (i.e., changes to the program that introduce new bugs) between two software versions in worst-case linear time, with respect to the number of changes introduced between the two versions. By analysing the set of dierences C, or deltas, between an passing version of the software, referred to as the baseline, and a failing version, the DD algorithm is capable to nding the mini- mal set of failure inducing changes c ⊆ C. This technique allows users to nd the minimal set of changes that explain a regression. DD is capable of nding the minimal set of failure-inducing changes c, where |c| > 1 and ∀c0 ⊂ c(result(c0) 6= P ASS), without having to test each of the 2n possible sets of changes. To nd the minimal sub-set, DD employs a variant of binary search to divide and conquer the set of all changes. At each step, the algorithm halves the set of changes into c1 and c2, and recursively searches the half which continues to fail, until c contains only a single edit. In the event where neither c1 nor c2 contain the failure inducing changes, known as interference, the algorithm continues the binary search on each half without discarding any changes.

99 FAULT LOCALISATION

A more advanced delta-debugging algorithm, DD+ [Zeller and Hildebrandt, 2002], removes a number of assumptions within DD to allow the approach to be applied in cases where certain changes may prevent the program from compiling or executing correctly in the absence of particular changes. To achieve this ability, the eciency of the algorithm is degraded, but steps are taken to ensure that in most cases, the algorithm is capable of nding results in a reasonable amount of time.

Although delta-debugging can be an eective technique for nding the cause of re- gressions, the associated cost of testing at least log n variants, and its limited scope prevent it from being more widely used as a means of fault localisation within auto- mated repair. However, through a slight rephrasing, the delta-debugging algorithm can be applied to a wider set of problems beyond fault localisation. Instead, we can use DD to nd the minimal sub-set c ⊆ C which retains some arbitrary property of interest, I. Exploiting the wider applicability of this technique, GenProg uses the original delta-debugging algorithm as a post-processing step to minimise the sequence of edit operations within the patch produced by the search [Weimer et al., 2009].

Predicate-Based Fault Localisation

Whereas most fault localisation techniques attempt to localise the fault to a par- ticular statement—especially those applied within the context of program repair— B. Le et al.[2016] propose a novel technique, Savant, for identifying faulty methods within Java programs. To perform localisation, Savant uses Daikon [Ernst et al., 2007], an o-the-shelf likely invariant mining tool, to extract plausible invariants for each of the methods within the program. Using Daikon, Savant produces two sets of invariants: one based on observations of the passing test cases, and another for the negative tests. To determine the set of suspicious methods, Savant uses Daikon’s Di feature to nd invariant changes (e.g., “x is less than zero” becomes “x is greater than zero”) between the positive and negative invariant sets; methods which exhibit such changes in their invariants are deemed to be suspicious.

Given the prohibitive costs of execution trace collection and invariant mining for large-scale programs, Savant uses method clustering and test case selection heuris- tics to reduce the costs of these processes. To begin with, Savant excludes from consideration, all methods that are not executed by any of the negative tests. The remaining methods are thereafter grouped into clusters on the basis of the posi- tive tests that cover them; methods that are covered by the same positive tests are more likely to be grouped together. After identifying method clusters, Savant pro- duces execution traces for each—rather than an individual execution trace for each method—across all of the negative test cases and a sub-set of passing tests. This sub-set is selected for coverage-adequacy using a greedy algorithm. Specically, Savant nds a sub-set of positive test cases where each method within the cluster is covered by at least T test cases (where T is a tunable parameter). Using the gen- erated execution traces for each cluster c, Savant generates the following sets of

100 4.1. BACKGROUND invariants across each of the methods contained within the cluster:

• inv(Fc), the set of invariants over the negative test cases that cover c.

• inv(Pc), the set of invariants over the positive test cases that cover c.

• inv(Fc ∪ Pc), the set of invariants over all test cases that cover c.

Using those invariant sets, Savant produces a series of invariant change features, usable by its learn-to-rank model. To generate these features, Savant uses Daikon’s Invariant Di tool to describe the addition, removal, and modication of invariants between the mined invariant sets. From these dierences, Savant yields 290,163 features, each specifying the frequency of a particular kind of invariant change (e.g., NonZero to EqualZero) between a given pair of invariant sets, based on its 311 in- variant types. Together with these invariant change features, Savant also uses a set of ten suspiciousness score features, where each represents the computed suspicious- ness value computed for the particular method by a state-of-the-art SBFL metric. Before these features are used by learning or ranking, each of them is normalised within the range [0, 1].

To learn a model that ranks methods by their likelihood of being faulty, on the basis of their extracted features, Savant applies rankSVM [Joachims, 2002]—a popular, o-the-shelf learning-to-rank algorithm—to a corpus of bug xes. The learning al- gorithm is fed the methods modied by the programmer as its ground truth for the location of the fault, together with the extracted feature set. Once the model has been learnt, Savant returns a ranked list of methods for a given program, computed using its extracted feature set.

Out of 357 bugs in 5 programs sourced from the Defects4J dataset [Just et al., 2014], Savant correctly localises 63.08, 101.72 and 122 buggy methods on average within the top 1, top 3, and top 5 of the ranked lists it generates, respectively. This corre- sponds to a 57.73%, 56.69% and 43.13% improvement over the best performing SBFL techniques, at the top 1, top 3 and top 5 positions, respectively.

Although Savant provides a signicant boost in the accuracy of method-level fault localisation, its rank-based nature prevents it from describing the degree of suspi- ciousness for each method. Studies by Qi et al.[2013] showed that the best per- forming fault localisation techniques for automated program repair, as measured by the (average) Number of Candidate Patches till a repair is found, was sub-optimal in terms of its EXAM score. To eciently nd faults, automated program repair needs more detailed localisation information. It may be possible to replace the learning- to-rank algorithm used by Savant with a more conventional SVM algorithm, and to train the model to compute suspiciousness values directly. The objective could be to reduce the LIL score of the resulting suspiciousness value distribution—a metric that has been shown to be correlated with NCP [Moon et al., 2014b].

101 FAULT LOCALISATION

MUSE

Mutation-based Fault Localisation, or MUSE, is a relatively new fault localisation technique introduced by Moon et al.[2014a]. Taking inspiration from mutation testing, MUSE attempts to determine the location of a fault by mutating the pro- gram and observing its eect upon the results of the test suite. MUSE assumes that changes made to correct program statements, rather than faulty statements, are more likely to introduce new bugs, which can be detected by the positive test cases. MUSE attempts to exploit this assumption to distinguish between faulty and correct statements using the results of a mutation analysis. Empirical evaluation on a set of 12 bugs from the ICSE 2012 benchmarks showed MUSE to produce 25 times more precise information than the best SBFL technique in terms of Expense, and 2.29 times more precise than Op2 according to the more appropriate LIL metric. Let P be a faulty program, which passes all positive test cases and fails all negative cases, and mf and mc be mutants of P which mutate the faulty statement and a correct statement, respectively. Given these denitions, MUSE is founded on two conjectures.

The rst conjecture is that negative test cases are more likely to pass on mf than on mc, since faulty programs are usually xed by modifying the faulty statement (and are more dicult to x through mutation to correct statements). This rea- soning is based on the observation that mf must satisfy strictly one of the three outcomes:

1. Equivalent mutant; the mutant introduces a syntactic change to the pro- gram but retains the same semantics, in which case the result of the negative test case execution is unchanged, as the bug still remains.

2. Non-equivalent, faulty mutant: the mutant modies the semantics of the program and continues to fail the negative test cases. The bug may or may not be the same as the original fault. Negative test cases are still considered more likely to fail than to pass on mf . 3. Non-equivalent, not faulty mutant: the mutant no longer fails the neg- ative test cases, and the program is considered to be xed, with respect to the test suite (the resulting mutant may cheat the test suite or introduce new bugs, but neither of these are detectable using the original test suite alone). As a result of the above, and the observation that modications to faulty statements within the program are more likely to yield a repair than changes to correct ones, the number of passing negative test cases should be larger for mf than mc.

The second conjecture is that positive test cases are more likely to fail for mc than for mf . As with the rst conjecture, this conjecture is based on the observation that mc must exist in one of the two states: 1. Equivalent (neutral) mutant: the mutant retains the same semantics as the original program, with respect to the test suite, and should therefore have the

102 4.1. BACKGROUND

same test case results. 2. Non-equivalent mutant: by denition, a non-equivalent mutant must fail at least one of the positive test cases, thereby introducing a fault into the program. As a result of these two conjectures, and based upon observations, it is believed that mutating the faulty statement is more likely to generate a (partial) x, and that mutating a correct statement is more likely to fail positive test cases. By generat- ing mutants and observing their test case results, MUSE exploits these conjectures to locate the source of the bug. MUSE incorporates this information into a new suspiciousness metric µ, given in Equations 4.12 and 4.13 (with minor changes in notation):

1 X µ(s) = µ (s) (4.12) |mut(s)| m m∈mut(s)

|fP (s) ∩ pm| |pP (s) ∩ fm| µm(s) = − α · (4.13) |fP | |pP | where s is a statement of P, fP (s) is the set of negative tests covering s, and pP (S) is the set of positive tests covering s. mut(s) denes the set of all non-neutral mu- tants of P with changes at s, {m1...mk}; note that neutral mutants are ignored, as they do not provide useful information based on the underlying conjectures. For each mutant mi ∈ mut(s), let fmi and pmi be the set of failing and passing tests on mi, respectively. Importantly, MUSE discards all neutral mutants from consid- eration, prior to calculating µ; these mutants are deemed to contain no relevant information. mum(s) is used to calculate the contribution of knowledge from a mutant’s test case results, based on the conjectures. Its rst term describes the proportion of negative test cases that now pass on a mutant m, whilst its second term gives the propor- tion of positive test cases that now fail on a mutant m. When mum(s) is summed and averaged over mut(s), these terms become the probability of a change in test case result (i.e., positive test failure or negative test success), reecting the intuition behind MUSE. Passing of a negative test case will increase the suspiciousness of s, whereas failing a positive test will decrease its suspiciousness. As it is more likely that a positive test case will fail for m than a negative test case will succeed, a weighting term α is introduced to balance these probabilities by compensating for the dierence in their averages. The equation for α is given in Equation 4.14, where f2p and p2f describe the number of test case results which change across all mutants—note that α does not require a priori knowledge of the bug.

f2p |mut(P )| · |pP | α = · (4.14) |mut(P )| · |fP | p2f

103 FAULT LOCALISATION

As a result of this balancing, when the second term is subtracted from the rst term, a baseline value of zero is produced. For the faulty statements, the rst term is expected to be higher, whilst the second term is expected to be lower than the correct statements in P (w.r.t. the test suite). Equipped with this suspiciousness metric, MUSE generates suspiciousness values for each of the statements within P by following the steps below. 1. MUSE computes coverage information for P using a given test suite T , nd- ing the set of statements that are executed by the negative test cases, since all of the (manifested) faults are guaranteed to be within this set. To calcu- late coverage information, MUSE uses gcov. This decision restricts MUSE from handling crashing programs, such as those with segmentation faults or memory leaks. Note, this limitation is purely a technical one, which could be overcome through the use of an alternative coverage monitoring tool. How- ever, this decision also makes it unclear whether MUSE is equally as eective for crashing bugs. 2. A set of mutants are then generated by mutating each of the candidate state- ments identied in the previous step. These mutants are created by generat- ing a single mutant at each of its identied mutation points using Proteum [Maldonado et al., 2001]. 3. From the test case outcomes of all mutants, MUSE proceeds to compute µ(s) for each statement in the program. To improve the eectiveness of MUSE in cases where mutants for a given statement are absent, and to use existing spectrum-based fault localisation, Moon et al.[2014b] introduce HybridMUSE. Given a set of suspiciousness values produced by MUSE, µMUSE, and a set of suspiciousness values generated by an SBFL technique, µSBFL, HybridMUSE combines these values for each statement, as illustrated by Equation 4.15:

µHybridMUSE(s) = norm(µMUSE, s) + norm(µSBFL, s) (4.15) where norm(µ, s), described in Equation 4.16, is used to normalise a given suspi- ciousness value such that the minimum value is set to zero, and the maximum is set to one:

µ(s) − min(µ) norm(µ, s) = (4.16) max(µ) − min(µ)

Across 51 faulty versions of 5 programs taken from the Software Infrastructure Repository, Moon et al.[2014b] nd that HybridMUSE is 4.08 times more precise than MUSE, according to the EXAM metric (the fraction of statements that would have to be considered until a fault is found, if one were to order all statements by their suspiciousness). Moon et al.[2014b] speculate that the relative strength of

104 4.1. BACKGROUND

Sample EXAM Rank LIL 1% 6.20 123.50 6.26 10% 4.60 95.53 6.11 40% 2.79 61.21 5.96 70% 1.83 45.59 5.90 100% 1.65 41.60 5.86

Table 4.1: The average precision of HybridMUSE as its mutant sampling rate is adjusted. Taken from [Moon et al., 2014b].

HybridMUSE over the standard form of MUSE is observed in cases where all mu- tants at a faulty statement are discarded, causing those statements to be assigned a suspiciousness of zero. Interestingly, despite exhibiting a signicant improvement in precision—as mea- sured by the EXAM score—Moon et al.[2014b] nd that MUSE is a factor of 1.12 more precise than HybridMUSE when the LIL metric is used instead; the authors choose to discount this result, given the improvements in EXAM. For the purposes of automated program repair, however, this result may prove to be of greater impor- tance. Moon et al.[2014b] conduct a preliminary study into the correlation between EXAM and LIL and NCP (average number of candidate patches before a repair was found) on a single automated repair benchmark, “look-utx-4.3”, using a randomly generated test suite. On this single benchmark, they nd that EXAM bears little relation to NCP, but that LIL appears to correlate well. Despite impressive gains in precision—as compared to SBFL techniques—both MUSE and HybridMUSE require a vast number of mutants to compute their suspiciousness scores; a lengthy and expensive process. To investigate this further, Moon et al. [2014b] measured the precision of HybridMUSE as the number of mutants sampled at each statement was varied. The results of their investigation, listed in Table 4.1, show a small improvement in LIL as the sample size is increased and a much larger improvement in terms of EXAM and the rank of the faulty statement. At sample sizes below 40%, HybridMUSE performs worse than Jaccard (LIL: 6.04), the best performing SBFL technique. MUSE and HybridMUSE appear to be better suited to aiding human-developers in determining the location of the fault than they are at reducing the dicult of automated program repair.

Metallaxis

Prior to the introduction of MUSE, Papadakis and Le Traon[2015] proposed an al- ternative approach to Mutation Analysis Fault Localisation: Metallaxis. As with MUSE, Metallaxis assigns a suspiciousness to each statement within the program based on the test suite outcomes for a set of randomly generated mutants. Un- like MUSE, Metallaxis explicitly assigns suspiciousness to individual mutants—

105 FAULT LOCALISATION building on the intuition that mutants whose test suite outcomes are closest to those the fault are likely to be co-located. Each killed mutant m (i.e., a mutant which fails at least one test)1 is assigned a suspiciousness using the Ochia similarity metric— a commonly used metric for spectrum-based fault localisation—given in Equation 4.17:2

#Kn #Kn µ(m) = p = (4.17) #K · (#Kp + #Kn) #K where K is the set of tests that kill m (which may be composed of both positive and negative tests), Kn is the set of (covered) negative tests that kill m, and Kp is the set of (covered) positive tests that kill m. Although MUSE does not explicitly compute suspiciousness values for its mutants, one can view its summation terms as implicit mutant suspiciousness scores. Com- paring MUSE with Metallaxis, one observes that Metallaxis reduces the suspi- ciousness of mutants that are not killed by negative tests. In contrast, MUSE signi- cantly increases the suspiciousness of such mutants. It seems, therefore, that whilst MUSE and Metallaxis share a similar high-level intuition—that mutants can be used to expose unlocated bugs—they each operate under a fundamentally dierent set of assumptions: MUSE sees failing-to-passing behaviour as a sign of a partial re- pair, whereas Metallaxis appears to treat it as a sign of overtting. This dierence in assumptions raises the question: “How should the outcomes of negative tests be treated?” From the suspiciousness values for each mutant, Metallaxis assigns a suspicious- ness value to each statement within the program, using the formula in Equation 4.18.

( maxm∈M µ(m) Ms 6= ∅ µ(s; M) = s (4.18) 1.0 otherwise

Rather than computing the average of each of mutant at a given statement, Metallaxis aggregates the suspiciousness of statement-mutants by computing their maximum. An advantage of this approach—unmentioned by its authors—is that the cost of com- puting suspiciousness values is reduced since values may only increase; there is no need to gather full test suite results for statements which have already been assigned the maximum suspiciousness. Thus, we may forgo the generation of mutants (for the purposes of fault localisation) at statements covered exclusively by negative test cases (since each µ(m) is guaranteed to be 1.0). Whilst this observation allows us

1Provided that one only generates mutants at statements that are covered by at least one negative test case, then the only mutants that are not killed are (partial) repairs. 2Notation is changed slightly from its original form, to avoid confusion with similar spectrum- based fault localisation terms. The original form of the equation simplies to the fraction of killed tests that are negatives (since #Kp + #Kn = #K).

106 4.2. ANALYSIS to reduce the cost of the fault localisation process, it fails to provide a means of discriminating between such statements, unlike MUSE. By using the maximum suspiciousness of a statement’s mutants, Metallaxis is ar- guably more prone to the eects of outliers. For example, in the case that a neutral mutant m—one which passes and fails the same test cases as the original, faulty program—is generated at a statement s, µ(m) will go to 1.0, and by extension, µ(s) will be assigned the maximum suspiciousness of 1.0. If most statements within the program have neutral mutants, then this behaviour becomes troubling; Metallaxis leaves us unable to discriminate between them. Given that previous research by Schulte et al.[2013] showed that roughly a third of mutants within GenProg’s search space are neutral, this problem may manifest in practice. One can compare the behaviour of Metallaxis to that of MUSE by treating the inner term of µMUSE as the suspiciousness of a particular mutant. From this perspective, we notice diering behaviours and underlying assumptions in the way that MUSE and Metallaxis aggregate mutant results, and how they treat failing-to-passing test outcomes: • According to Metallaxis, a statement is as suspicious as its most suspicious mutant. This behaviour assumes that the search landscape is mostly com- posed of non-neutral mutants. If the search space contains a large number of neutral mutants [Schulte et al., 2013], Metallaxis assigns the maximum sus- piciousness value to most statements. In contrast, MUSE actively discards its neutral mutants and computes the suspiciousness of a statement as the aver- age suspiciousness of its mutants, indicating a robustness to sampling noise. • Whereas MUSE signicantly increases the suspiciousness of statements con- taining mutants which pass previously failing test cases, Metallaxis implic- itly decreases the suspiciousness of such statements. These behaviours high- light a contradiction between the underlying assumptions of the techniques: MUSE sees partial solutions as signs of a repair, whilst Metallaxis views them as either irrelevant or the result of overtting. Both Metallaxis and MUSE have demonstrated signicant improvement to fault lo- calisation accuracy but their evaluation has been limited to manually seeded faults in small-to-medium sized programs, sourced from the Siemens [Hutchins et al., 1994] and SIR [Do et al., 2005] datasets. Furthermore, in the evaluation of Metallaxis, the EXAM metric was used to measure accuracy—a metric which has been shown to be inappropriate for purposes of automated program repair [Qi et al., 2013].

4.2. Analysis

In this section, we explore the feasibility of using results of candidate patch evalua- tions to improve the accuracy of fault localisation. To determine feasibility, we use

107 FAULT LOCALISATION a collection of historical, real-world bug xes to assess whether mutants at state- ments where a repair was made exhibit dierent test suite behaviour to those at statements that were unchanged by the patch. In the context of automated pro- gram repair, each mutant can be viewed as a candidate repair. Evidence of diering test suite behaviours between the mutants of modied and non-modied statements would suggest that it is possible to (help to) identify the locations in the program at which a successful x could be made. If no dierence in test suite behaviour is detected, it would suggest that the knowledge of the test outcomes for candidate patches are not useful in improving the accuracy of fault localisation. To focus our analysis, we use GenProg to generate candidate patches, but anticipate the analysis results can generalise to any search-based repair technique that follows a similar paradigm. Specically, this analysis answers the following questions:

• RQ1A: Do the solutions found by the search occur at the same statement that was modied by the programmer? We ask this question to determine whether APR should restrict its attention to the locations modied by the programmer, rather than pursuing alien repairs at other locations.

• RQ1B: Can statements that were modied by the human be discriminated from those that were not, on the basis of the passing-to-failing p2f rates of their mutants?

• RQ1C: Can human-modied statements be distinguished from non-modied statements, based on the failing-to-passing rates f2p of their mutants? We ask RQ1B and RQ1C to to assess the degree to which p2f and f2p con- tribute to identifying suitable x locations, if they do at all; previous work has ignored the individual contributions of these parts [Moon et al., 2014a; Papadakis and Le Traon, 2015].

• RQ1D: Are statements covered by the fewest number of positive test cases the most likely to contain the fault? We ask this question to determine whether this observation could be exploited to improve o ine fault localisation. To answer these questions, we analyse test case results for a sample of single-edit mutants taken from 28 bugs across 6 real-world C programs. 15 of these bugs are articial, injected into 3 small-to-medium programs, sourced from the Software In- frastructure Repository [Do et al., 2005]—the same source of bugs used to evaluate MUSE and Metallaxis. We include these bugs to determine whether GenProg’s repair operators may be used to perform MBFL, rather than traditional mutation testing operators used by existing approaches [Moon et al., 2014a; Papadakis and Le Traon, 2015]. We supplement this dataset with 13 real-world bugs in 3 large-scale,

108 4.2. ANALYSIS

Program # Bugs KLOC Tests Articial? gzip 6 6 104  grep 2 10 146  sed 7 14 255  OpenSSL 5 248 77  Python 4 446 344  PHP 4 789 8597 

Table 4.2: Details of the subjects we studied for our preliminary mutation analysis. KLOC measures the number of thousands of lines of C code in the program, as calculated by cloc. Tests states the average number of test cases used by bugs for that program.

Instance Type: c4.large CPU: Intel Xeon E5-2666 v3 (2 cores) Base AMI: Amazon Linux, 64-bit (ami-0b33d91d) RAM: 3.75 GB Storage Type: io1 Storage Size: 8 GB (400 IOPs) Cost: > $0.1/hr

Table 4.3: Specications for Amazon EC2 instances used to perform mutation anal- ysis on 15 artical bugs. real-world programs, to determine whether MBFL remains eective when applied in the wild. We obtained these 13 bugs by manually searching for recent patches within the version control histories of these projects. Table 4.2 provides a summary of the programs we used for the study. To collect the necessary data for the analysis, we rst generated a list of all the single-edit patches within GenProg’s search space, before randomly shu ing that list and evaluating as many patches as possible within a 12-hour window. This 12- hour random walk was repeated for each of the bugs within the dataset. For historical reasons, and given its similarity to traditional mutation testing oper- ators, we also ensured that a deletion was attempted at each statement.3 In tting with our general methodology, outlined in Chapter3, each bug was analysed using RepairBox. We used a C4.Large instance on Amazon EC2, described in Table 4.3, to collect data for the 15 articial bugs, and a DS1_V2 instance on Microsoft Azure, described in Table 4.4, for the 13 real-world bugs. To t as many patch evaluations as possible the time window allotted to each ran- dom walk, we performed a number of optimisations to the search process, discussed

3In earlier experiments, using publicly available bug scenarios with weak test harnesses, we found that the deletion operator was highly eective at pruning the fault localisation.

109 FAULT LOCALISATION

Instance Type: DS1_V2 CPU: Intel Xeon E5-2673 v3 (1 core) Host OS: Ubuntu Server 16.04 LTS (64-bit) Kernel: 4.4.0-77-generic Docker: 1.12.6, build 78d1802 RAM: 3.5 GB Storage: 7 GB (3200 IOPs) Cost: $0.249/hr

Table 4.4: Specications of the Microsoft Azure compute instances used to perform analysis on 13 real-world bugs. below.

• By limiting our attention to single-edit patches, we were able to exploit the assumption that the x requires only a single edit by restricting our attention to the sub-set of statements that are covered by all negative test cases as can- didates for repair, rather than considering those that are covered by a sub-set of the negative tests.

• For each mutant, we restricted its test suite evaluation to the sub-set of tests whose values may have changed as a result of the mutation. To determine this, we found the sub-set of tests that covered the single statement modied by the mutant. Whilst dynamic analysis techniques could be used to further reduce the evaluation eort required, we believe their associated costs are likely to outweigh any potential gains.

• Additionally, we implemented a form of parallel, asynchronous test case eval- uation, which reduces CPU idle time, and improves throughput and eciency.

• We also used a number of weak equivalency checking techniques introduced by Weimer et al.[2013] in AE, including liveness and scope checking, and the removal of duplicate statements from the donor pool.

Results

A brief summary of the results of the mutation analysis is given in Table 4.5. From inspection of these results, we observe that most (compiling) mutants exhibit an all-or-nothing behaviour: either their test outcomes remain unchanged (shown in the “% Neutral” column), or all of their tests are failed (given by the “% Lethal” column). We also observe low sample rates (< 1 mutant per suspicious statement) for most Python and PHP bug scenarios. This is due to the substantial overheads involved in compiling mutants, and for statements covered by a large number of tests, the signicant costs of evaluating hundreds or thousands of test cases for each mutant.

110 4.2. ANALYSIS

Scenario % Compiling % Neutral % Lethal Mutants Sample Rate mmb-openssl-0a2dcb6 100.00 14.06 50.66 377 2.48 mmb-openssl-4880672 90.72 22.27 35.51 1971 4.82 mmb-openssl-6979583 99.18 26.62 51.23 1097 13.06 mmb-openssl-8e3854a 93.66 38.14 25.59 3028 4.69 mmb-openssl-eddef30 100.00 55.97 16.84 2898 80.50 mmb-php-01c028a 96.43 7.14 0.00 28 0.06 mmb-php-11bdb85 96.88 25.00 0.00 32 0.07 mmb-php-1d6b3f1 88.66 14.57 25.78 741 8.72 mmb-php-9fb92ee 100.00 28.11 37.33 217 0.51 mmb-python-6c3d527 84.66 64.29 13.76 378 0.46 mmb-python-a93342b 81.51 53.42 0.68 146 0.37 mmb-python-b2f3c23 81.83 45.67 14.33 600 3.03 mmb-python-f584aba 32.74 10.50 5.05 733 0.71 sir-grep-v2-DG_1 98.73 39.17 35.99 628 1.00 sir-grep-v3-DG_3 67.06 31.92 31.19 2353 10.14 sir-gzip-v1-KL_2 91.41 23.60 48.79 1898 10.20 sir-gzip-v1-KP_1 92.09 41.11 36.77 2301 11.22 sir-gzip-v1-TW_3 91.34 27.85 40.57 2068 13.34 sir-gzip-v4-KL_1 91.27 43.17 28.83 2886 13.49 sir-gzip-v5-KL_1 90.26 66.75 16.96 6437 16.98 sir-gzip-v5-KL_8 98.68 24.01 31.26 1062 1.05 sir-sed-v2-AG_17 85.77 23.68 30.67 1630 1.52 sir-sed-v2-AG_19 53.50 8.73 19.93 3011 22.47 sir-sed-v3-AG_11 78.79 24.74 33.81 2579 7.39 sir-sed-v3-AG_15 80.71 24.92 26.02 1545 2.95 sir-sed-v3-AG_17 67.57 16.14 27.40 1332 3.95 sir-sed-v3-AG_18 67.85 16.92 24.13 1235 3.48 sir-sed-v3-AG_6 77.06 23.40 31.17 2615 5.56 84.94 30.07 26.44 1637 8.72

Table 4.5: A summary of the mutation analysis results for each bug scenario. % Compiling species the percentage of mutants that successfully compiled. % Neutral describes the percentage of (compiling) mutants that had no eect on the outcome of the tests. % Lethal describes the percentage of (compiling) mutants (covering at least one positive test) that failed all of their covered tests. Mutants species the number of mutants generated within the 12-hour random walk. Sam- ple Rate gives the average number of mutants per suspicious statement.

111 FAULT LOCALISATION

Mean p2f of all mutants at a given statement 1.0

0.8

0.6 f 2 p 0.4

0.2

0.0 Faulty Correct

Figure 4.3: Comparing the mean pass-to-failure rate for applicable mutants, we ob- serve dierent, but overlapping distributions for correct statements and faulty state- ments (KS2 = 0.301; p = 0.003,A12 = 0.679 [medium eect]).

RQ1A: Do the solutions found by the search occur at the same statement that was modied by the programmer?

Across the random walk for each of the 28 bugs, we found xes at a total of 48 dif- ferent statements. Only 5 of these 48 statements were also modied by the original patch. This nding suggests that GenProg is crafting repairs unlike those that a human would make, supporting arguments made by Monperrus[2014] that auto- mated repair should consider alien repairs, rather than restricting itself to producing human-like repairs.

RQ1B: Can statements that were modied by the human be discrimi- nated from those that were not, on the basis of the passing-to-failing p2f rates of their mutants?

In line with our expectations and previous ndings, we observe dierent mean pass- to-failure rates across all applicable mutants between the statements we assumed to be faulty and those we assumed to be correct, as shown in Figure 4.3. To avoid misleading results, only mutants that successfully compile and cover at least one positive test case are considered applicable mutants. Using a two-way Kolmogorov- Smirnov test, we reject the null hypothesis (with p < 0.05) that the samples for the faulty and correct statements are drawn from the same distribution. Between the p2f distributions for correct and faulty statements, we nd an eect size of 0.679, indicating that correct statements tend to have higher p2f values than faulty statements. Details on the statistics used in this chapter may be found in Section 3.4.4.

The distributions suggest that faulty statements tend to have a lower pass-to-failure

112 4.2. ANALYSIS

Mean f2p of all mutants at a given statement (without outliers) 1.0 1.0

0.8 0.8

0.6 0.6 p p 2 2 f f 0.4 0.4

0.2 0.2

0.0 0.0 Faulty Correct Faulty Correct

Figure 4.4: We observe similar distributions of mean f2p values for faulty and cor- rect statements (KS2 = 0.185; p = 0.177). In both cases, more than half of the mutants at each statement did not pass any of the previously failing tests. rate than those that are correct, supporting the intuition that modications to a correct statement are likely to result in a greater degree of functionality loss. In answer to RQ1B, these results suggest that p2f information can be used to partially discriminate between statements that were modied by the programmer and those that were not. Thus, the results provide promise for the use of GenProg’s mutation operators to localise the source of faults over the course of the search.

RQ1C: Can human-modied statements be distinguished from non-modied statements, based on the failing-to-passing rates f2p of their mutants?

To answer this question, we rst removed all non-compiling mutants from con- sideration, before also removing all acceptable solutions that were found during the random walk; this puts the f2p values into the context of an ongoing search, allowing us to observe the most complete f2p distributions possible without hav- ing yet found a solution. We then compared the mean f2p of all mutants for both statements that were modied by the human and those that were not, illustrated in Figure 4.4.

Results show that mean f2p was zero for the majority of statements, regardless of whether they were modied by the human. On closer inspection, we nd that only 2.09% of mutants have any impact on the outcomes of the negative tests (i.e., 97.91% of mutants fail all negative tests). These ndings suggest that f2p informa- tion may not be particularly eective in distinguishing between faulty and correct statements.

RQ1D: Are statements covered by the fewest number of positive test cases the most likely to contain the fault?

In restricting the search to only those statements covered by all negative test cases, existing spectrum-based fault localisation techniques become partially redundant,

113 FAULT LOCALISATION

as the contribution of the en variable is zero. Instead, it may be preferable to quan- tify statement suspiciousness using a function of the number of executed and non- executed passing test cases. Thus, we used the results of the analysis to explore how many passing test cases were executed at each statement with an acceptable repair, and whether faulty statements covered fewer positive tests. Accounting for the varying sizes of the test suites, we looked at the fraction of the test suite covered by a faulty statement, or passing test coverage. We computed an adjusted coverage level for each of the bugs, where the minimum number of positive tests covered by any statement under consideration was negated from the number of positive tests covered by a given statement. This allows the levels of coverage to be compared, relative to other candidate statements in the program. Measuring the adjusted coverage of each bug scenario, we nd that 90% are cov- ered by fewer than 2% of the positive test cases, mirroring the intuition behind the original suspiciousness metric used in GenProg.

4.3. Approach

Using the knowledge gained from our mutation analysis, we propose and evalu- ate two fault localisation strategies for search-based program repair, which may be aggregated. For our purposes, we aggregate localisations by computing product of their suspiciousness values, although more intricate methods may yield better results. To use these layers in a noisy, online context, we ensure that each is numer- ically stable and that none assigns a suspiciousness of zero to any statement covered by a negative test case. We consider:

Coverage: based on our ndings regarding the (adjusted) number of positive test cases covering human-repaired statements, this layer, described by Equation 4.19, creates a localisation that results in a 90% probability that a statement with less than 2% adjusted coverage will be selected, and a 10% probability that a statement with a greater level of coverage will be chosen.

( x AdjustedCoverage(s) ≤ 0.02 µCov(s) = (4.19) y otherwise

The adjusted coverage of a statement is computed using the formulae given in Equa- tions 4.20 and 4.21:

|PositiveTests(s)| − MinCoverage AdjustedCoverage(s) = (4.20) |PositiveTests|

MinCoverage = min |PositiveTests(s)| (4.21) s∈S

114 4.3. APPROACH where |PositiveTests(s)| species the number of passing test cases covering a given statement, and |PositiveTests| species the number of passing tests overall. Let the statements that are at or below the adjusted coverage threshold be be given by the set X, and let Y be the set of all other candidate statements, x and y are thus implicitly given by Equations 4.22 and 4.23.

X µCov(s) = 0.90 (4.22) s∈X X µCov(s) = 0.10 (4.23) s∈Y

Pass-to-Fail: This layer, described in Equation 4.24, assigns values between zero and one to each statement, based on the pass-to-fail rates p2f of its mutants:

  1 X µp2f (s) = · 1 + (1 − p2fm) (4.24) |mutants(s)| + 1 m∈mutants(s)

Evaluation

We now investigate eectiveness of a fault localisation that combines the proposed layer. To assess the eectiveness of each candidate fault localisation technique, we use the mutants of the analysis—excluding any solutions, to avoid potential biases— to generate a set of suspiciousness values µ, for each of the programs. We then measure the accuracy of a fault localisation technique by the probability p(µ) of selecting a statement that contains a x found during the random walk, as described in Equation 4.25.4

X µ(s) p(µ) = P 0 (4.25) s0∈S µ(s ) s∈FixedStmts

Table 4.6 shows results performed across the 11 bugs for which solutions were found during the random walk.

We observe that µCov is more accurate than GenProg’s standard fault localisation µGP in 6 out of the 11 cases, marginally worse in 1 case, and considerably less accurate for the remaining 4 cases. When used in isolation, µp2f is beaten by µGP for 8 out of the 11 bugs, and for the remaining 3 bugs, µp2f only provides a marginal improvement in accuracy. This result indicates that passing-to-failing information, used alone, is not particularly eective at identifying suitable x locations.

4Note, we do not measure how well these techniques localise the statement modied in the human repair, since the patterns observed in this data were used to design these techniques.

115 FAULT LOCALISATION

Scenario p(µGP ) p(µCov) p(µp2f ) p(µCov × µp2f ) ct-openssl-0a2dcb6 1.227 0.142 0.907 0.046 ct-openssl-6979583 37.681 75.256 15.958 79.753 ct-openssl-8e3854a 0.714 0.106 1.005 0.100 ct-openssl-eddef30 66.667 76.036 45.920 77.583 sir-gzip-v1-KP_1 6.631 4.455 4.673 6.485 sir-gzip-v1-TW_3 9.202 14.211 4.689 15.324 sir-gzip-v4-KL_1 2.685 2.727 1.724 3.083 sir-gzip-v5-KL_1 0.400 0.326 0.360 0.394 sir-gzip-v5-KL_8 2.169 6.585 0.338 4.672 sir-sed-v2-AG_17 0.185 0.019 0.211 0.023 sir-sed-v3-AG_11 0.573 6.953 0.606 3.086

Table 4.6: Comparison of fault localisation accuracies achieved by dierent ap- proaches, where accuracy is measured by the probability of sampling a statement containing a x from the resulting distribution. Results are given as percentages.

When µCov and µp2f are naively aggregated by computing their product, the re- sulting fault localisation is more accurate than µGP for 6 out of 11 bugs, marginally worse for 1, and considerably worse for 4. If the online modications to µp2f are removed, and 1 − p2f(s) is used to compute suspiciousness instead, the resulting localisation outperforms GenProg’s fault localisation in all cases. We also experimented with various ways of incorporating f2p information into the fault localisation, but found the approach either attained near-perfect accuracy (since the only mutants to pass any negative tests were at statements where a solu- tion was found), or substantially worse accuracy (since all mutants other than the solutions at the faulty statements failed all of the negative tests). A larger table of results can be found in AppendixC.

4.4. Discussion & Conclusion

From the results of our evaluation, we observe relatively little benet in incorpo- rating information learned from the evaluation of candidate patches into the fault localisation, in comparison to previous attempts to use mutation analysis to locate faults. We believe there may be a number of reasons for this result: • Lack of mutants: for a number of bug scenarios, we found that excessively long test suite evaluation and compilation times prevented the search from producing an adequate sample of mutants at each statement. • Lack of passing test coverage: in cases where a large of statements are cov- ered by no passing tests, all of these statements will be assigned a suspicious-

116 4.4. DISCUSSION & CONCLUSION

ness score of 1.0 by µP 2F (s). Consequently, this layer will either suppress the xed statement, if it is covered by any positive test cases, or it will fail to identify it amongst the many statements without positive test coverage.

• All-or-nothing f2p response: within GenProg’s search space, we observe erratic negative test case behaviour. In some cases, the only statements to pass a negative test case were those that could be repaired. In other cases, negative test case passes were much more common, whilst no mutants at the statement that was repaired (excluding the solutions) passed any negative tests. If one knew which type of f2p response one was dealing with, a more accurate fault localisation might be possible. In the future, we plan to explore whether the rarity of negative test passes might be used to determine whether a given failing-to-passing event is a coincidence, or indicative of a potential repair at that statement.

• Coarsely-grained mutation operators: one explanation for the relative lack of success in incorporating the results of the mutation analysis into the fault localisation may be due to the coarsely-grained nature of the repair op- erators within GenProg’s search space. With such actions, it may be dicult to expose subtle bugs within the statement that might otherwise be identied using ner-grained mutation testing operators. In our preliminary analysis, we nd that most repairs tend to either have no eect on the outcomes of the test suite, or to cause all of their covering tests to fail; this all-or-nothing behaviour may be a consequence of the granularity of the search operators.

• Combining information: to combine each of the proposed layers of fault localisation from our evaluation, we computed the product of each of the lay- ers; a necessarily arbitrary decision. A more meaningful, eective way of combining information from multiple sources and how to deal with conict- ing or corroborating suspicious values is not immediately clear. Going forward, to translate the potential of mutation analysis approaches such as MUSE into eciency gains in automated program repair, we intend to explore the following:

• Richer repair models: in an eort to avoid the all-or-nothing behaviour ex- hibited by mutants generated using GenProg’s coarsely-grained statement- level operators, we intend to explore the utility of lower-level repair operators including, but not limited to, those traditionally used within mutation testing. Beyond the use of mutation testing operators, we intend to explore whether expression-level operators, such as the replacement of the LHS or RHS of an assignment, or the modication of function call parameters, may be used to predict the location of the fault.

• Shape prediction: in addition to investigating whether mutants can be used to predict the location of the fault, we are interested to see whether the results of particular types of mutants can be used to rene suspiciousness beyond the level of the statement, and if they can be used to predict the type of repair that

117 FAULT LOCALISATION

might be needed (for instance, if replacing an if condition appears to have no eect, that might suggest that a replacement condition is needed). • Ensemble learning: rather than combining fault localisation layers by tak- ing their product, or using a simple weighted average, we may see better results—when using a dierent model—if ensemble learning techniques are used to nd more eective ways of combining these multiple sources of in- formation. In conclusion, we nd that mutation analysis appears to oer little potential for online fault localisation within GenProg’s search space. From observation of the mutants, we see that GenProg exhibits an all-or-nothing landscape, where most edits are either neutral or fail all of their covering tests. Given the previous success of Metallaxis and MUSE, we believe that this all-or-nothing behaviour may be partly responsible for the lack of improvement. Alternatively, it may be that the results observed by Metallaxis and MUSE fail to scale to large real-world programs. In future work, we intend to investigate the assumptions behind these techniques more deeply. To benet from the knowledge of its mutants test case results, we believe a set of more nely grained mutation operators are required; a requirement that will most likely allow a larger number of bugs to be solved at the same time.

118 CHAPTER 5

Repair Model

Motivated by our ndings and the ndings of others—that the statement-level repair model used by GenProg is ineective at nding repairs—in this chapter we conduct an empirical study of bugs in real-world C programs to determine a more eective repair model. Specically, we explore the viability of extending the ideas of plastic surgery—using code from existing sources to craft the materials necessary for a repair—beyond the level of statements, and to a larger set of richer, more granular changes, capable of xing a greater number of bugs. Using a new bug x mining tool, BugHunter, we automatically identify bug xing commits within Git repositories, before extracting instances of AST-level repair ac- tions and collecting a pool of donor code snippets from the program. Equipped with this information, we determine the fraction of bug xes which involve a particular repair action, and the fraction of repair action instances that can be grafted [Barr et al., 2014] from existing code within the program. In an eort to reduce the search space, and backed by the ndings of previous studies regarding the eectiveness of plastic surgery, we limit the membership of the donor code pool to snippets taken from the le where the x occurred. To avoid the rejection of potential snippets due to dierently labelled variables, we also explore the eectiveness of plastic surgery when such labels are removed; we refer to the labelled and unlabelled forms of the donor pool as the concrete and abstract donor pools, respectively. From analysis of the results, we nd that more granular repair actions, at and below the level of statements, are better suited to plastic surgery than more block-level changes. By removing labels from donor code snippets and treating them as tem- plates, we observe a substantial increase in graftability, rising from 0–58% to 16–94%. To x a larger number of bugs, we suggest incorporating a sub-set of the most fre- quent repair actions into the repair model, and to use an abstract donor pool to craft repairs. The contributions of this chapter are as follows: • We present BugHunter, a repair action mining tool, capable of identifying bug xing commits within Git repositories and discovering potential instances of automated repair actions at the AST-level. • We build upon previous denitions of repair models [Martinez and Monperrus, 2013] and provide inference rules for a set of repair actions, used to perform the study. • We examine the frequency of 23 repair actions, inspired by repair models used within existing repair techniques, across 10,000 identied bug xes in 200

119 REPAIR MODEL

open-source C projects. • We determine the graftability [Barr et al., 2014] of each of these repair actions within a set of labelled and unlabelled donor code pools. • From the observed frequencies and graftabilities of repair actions and donor code pools, we make a number of suggestions for constructing a future repair model, capable of addressing a greater number of bugs. The rest of this chapter is structured as follows: Section 5.1 gives a brief review of the related related literature. Section 5.2 elaborates on the motivation of this study and outlines our research questions. Section 5.3 discusses our methodology; Section 5.4 expands upon the denition of repair models and provides descriptions, in the form of inference rules, for each of the repair actions studied. Section 5.5 outlines our approach to mining repair action instances from real-world software projects. Section 5.6 presents and discusses the results of our study. Finally, Section 5.7 sum- marises our ndings, provides suggestions for the construction of more eective repair models, and outlines future directions for research.

5.1. Related Work

In this section, we briey discuss previous research of relevance to repair models and plastic surgery, as well as highlighting where our study diers and builds upon this body of work.

The Plastic Surgery Hypothesis

Barr et al.[2014] summarise the “plastic surgery hypothesis” as follows: Changes to a codebase contain snippets that already exist in the codebase at the time of the change, and these snippets can be eciently found and exploited. The plastic surgery hypothesis underlies various genetic-programming-based ap- proaches to automated program repair, optimisation, and improvement, all of which use existing code to nd solutions. Given this denition, Barr et al.[2014] break down this hypothesis into two assump- tions: (1) changes to the program are repetitive, relative to their parent, and that (2) this repetitiveness may be eciently exploited to construct those changes. In par- ticular, techniques such as GenProg rely on the existence of donor snippets within the current, buggy form of the program. To test the rst assumption, they measure the graftability of 15,273 commits, taken from several large-scale Java projects. The graftability of a change is dened as the number of its snippets for which a matching snippet can be found within the search space. For the purposes of the study, source

120 5.1. RELATED WORK code lines (with whitespace removed), rather than AST-level entities, are treated as snippets. Barr et al.[2014] measure this quantity across three dierent search spaces, each containing snippets taken from the following sources, respectively: • The parent of the change • All non-parental ancestors of the change • The most recent version of a foreign software project To test the validity of the second assumption—that the space of donor snippets can be eciently explored—Barr et al.[2014] also measure the density of matching snip- pets within each space. From the results of their analysis, Barr et al.[2014] found that, in most cases, donor grafts could be found within the current version of the program and that rarely was it necessary to search non-parental ancestors for the graft. Moreover, 30% of grafts could be found within the same le at which the human-written change occurred. These results support both the plastic surgery hypothesis and GenProg’s interpre- tation of that hypothesis: restricting the attention of the search to snippets within the current versions of the faulty versions of the le allows a large number of grafts to be found much more eciently. Our study builds upon the work by Barr et al.[2014] by investigating redundancy at the level of program repair actions, rather than at the level of source code lines. We choose to study source code redundancy within the context of particular repair actions since this allows us to more accurately determine the eectiveness of plastic surgery when applied to program repair.

Empirical Inquiry into Redundancy Assumptions of APR

Martinez et al.[2014] investigate the underlying assumption of plastic-surgery driven APR techniques, such as GenProg, PAR, and SearchRepair: that the ingredients used by a x already exist within the program. To investigate this assumption, the authors examine six open-source Java projects and determine the fraction of (ver- sion control) commits—including those which do not pertain to bug xes—that are “temporally redundant”. A commit is deemed temporally redundant if it can be composed in its entirety from code introduced by previous commits. The authors measure temporal redundancy at both the line-level and token-level, and within the same le (termed local temporal redundancy) and across all les (termed global temporal redundancy). The results of the study demonstrate a stark contrast in temporal redundancy at the line-level and token-level: Between 2–17% of commits are temporally redundant at the line-level, whereas 8–52% are temporally redundant at the token-level. This nding suggests that more granular changes to the program are easier to compose from previous versions of the program, although this trade-o comes at the cost of a signicantly increased search space.

121 REPAIR MODEL

From analysis of global and local redundancy, the authors nd that between 8– 29% of tokens can be found within previous versions of the same le, compared to 31–52% across previous versions all les. Importantly, the size of the local pool was between two-to-three orders of magnitude smaller than the global pool. The cost eectiveness of searching the donor pool lends support to GenProg’s decision to restrict the composition of its donor statements to those within the les under repair. Our study diers from that conducted by Martinez et al.[2014] in two important aspects: Firstly, we focus our attention on the eectiveness of plastic surgery within C programs, rather than Java programs. Secondly, we measure redundancy within the context of concrete repair actions, rather than generically measuring it at the line- or statement-level. For instance, we determine the graftability of a “Replace If Guard” action by measuring redundancy at an expression level.

Mining Software Repair Models

Martinez and Monperrus[2013] conduct an empirical study of the frequency of 41 dierent AST-level repair actions within real-world bug xes in Java programs, and argue that not all probabilistic repair models are equally eective. As the dataset for their analysis, the authors automatically identify the sub-set of bug xes in the CVS-Vintage dataset [Monperrus and Martinez, 2012]: a dataset containing 89,993 source-code versioning transactions across 14 open-source Java programs. To nd AST-level changes for each x, they employ ChangeDistiller [Gall et al., 2009], an AST dierencing tool which describes modications to ASTs using 41 dierent types of changes (e.g., “Statement Insertion”, “Statement Update”, “Statement Deletion”, “Addition of final to class declaration”). Martinez and Monperrus[2013] look at the dierent outcomes produced by using dierent methods of bug identication. They nd that the number of source code changes is a good predictor of whether the changes within that transaction dier to those of normal software evolution; transactions with fewer changes tend to be the most dierent. In contrast, when transactions were selected purely based on the presence of indicators (e.g., “bug” or “x”) within their associated messages, the observed distribution of repair action frequencies was almost identical to the distri- bution across all transactions, suggesting such measures are ineective at isolating xes to the source code. When considering the set of transactions that contain only a single change—as re- ported by ChangeDistiller—the top ve change types were as follows: Statement Update (38%), Add Function (14%), Condition Change (13%), Statement Insertion (12%), Statement Deletion (6%). Between them, these change types account for 83% of all changes. The focus of Martinez and Monperrus[2013]’s study is on the frequency of partic- ular repair actions within Java projects. In contrast, although we also measure the

122 5.1. RELATED WORK frequency of repair actions—albeit it for C—our study is primarily focused on the eectiveness of plastic surgery in the context of APR.

Critical Review of PAR

As an alternative to GenProg’s coarsely-grained repair model, Kim et al.[2013] propose a hand-crafted repair model, based on frequently observed bug x patterns (e.g., insertion of a null check). They incorporate this repair model into PAR, an evolutionary program repair technique aimed at repairing Java bugs. Compared to an implementation of GenProg for Java, PAR was able repair more bugs (27 out of 119, vs. 16). The authors also demonstrate the acceptability of PAR’s patches through the use of a human study. Despite achieving impressive results—and an ACM SIGSOFT Distinguished Paper Award—both PAR and the methodology used in its evaluation have since been the subject of criticism by Monperrus[2014]. This criticism has focused on both the unbalanced composition of the bug scenario benchmark used in the evaluation, and the methodology used to conduct the human study. The latter is of relevance to this study. Principally, Monperrus[2014] highlights potential biases within the human study, which may lead to participants to judge the correctness of a patch based on its visual similarity to human-repaired bug xes, rather than its semantics. He ar- gues that conating correctness with appearance may unfairly lead to the dismissal of more alien-looking but otherwise correct patches. In this study, we estimate the utility of a repair action (i.e., its ability to generate repairs) by its frequency within a corpus of mined human-written bug xes. In practice, this decision may not accurately reect the utility of some repair actions. (e.g., it may be possible to x a bug using a certain repair action, but such an action is unlikely to be performed by a human-programmer, owing to its aesthetics or other such non-functional properties.) Nonetheless, our results provide a reasonable ap- proximation of utility. For a more detail discussion of both PAR and its critique, see Section 2.2.5.

Bug Fix Patterns in Java

In an eort towards realising a more eective repair model for Java programs, Soto et al.[2016] use software repository mining to determine the frequency—and to some extent, the composition—of certain bug x patterns within human-written repairs for Java. As a corpus for their study, the authors use the publicly available September 2015/GitHub dataset, provided by Boa [Dyer et al., 2013], containing 4.5 million identied bug xes in over 500,000 Java projects. As the rst part of their study, Soto et al.[2016] assess how many les are changed by each x, where le insertions, deletions and modications are all considered to be changes. Across all les types—including non-source code les—the authors observe

123 REPAIR MODEL a median of 2 changes, and a surprisingly high mean of 11.3 changes, suggesting a long-tailed distribution. When only Java source code les are considered, the authors nd that each bug x changes a mean of 4.47 les—the median number of changes is omitted from the paper. Although this adjusted mean is still high, the relatively low—and more relevant—median suggests that most bug xes involve changes to two les or fewer. After investigating the number of les changed by each bug x, Soto et al.[2016] look into the mean number of class, method, eld, and variable introductions—high- level changes that are currently beyond the scope of all APR repair models. They nd that, on average, each x introduces 0.16 classes, 0.69 methods, 1.32 elds, and 0.20 variables (these gures fail to account for test xtures, and should be consid- ered an overestimate). Based on these ndings, Soto et al.[2016] advocate for the inclusion of property introduction into (Java) repair models. For two reasons, we do not agree with this conclusion, nor with the methodology used to reach it: • Within the context of learning suitable models for automated program re- pair, it does not make sense to base one’s decisions on the mean number of introductions. This statistic is more susceptible to noise within the dataset— stemming from a risk of misclassication—and the eects of large-scale refac- torings; Soto et al.[2016] do not report the standard deviation or interquartile range to determine this risk. Instead, one should investigate the more sta- tistically robust median, which is better suited to asking the (more relevant) question: “How many properties do repairs tend to introduce?” • Not all modications made by developers are strictly necessary to x the bug. In the process of bug xing, developers will sometimes refactor the original code to improve its maintainability and reduce its complexity. Following an investigation of these source-level feature introductions, the authors attempt to determine the frequency of PAR’s repair templates within the corpus. Using Boa, Soto et al.[2016] look for instances of eight of the ten templates used by PAR: neither “Class Cast Checker” or “Expression Changer” patterns are inves- tigated, due to technical limitations in Boa. Under the most lenient assumption, that no two of PAR’s patterns co-occur within a bug x, these templates are found to cover 14.78% of buggy les. In contrast, in the dataset used to evaluate the eectiveness of PAR, these repair templates cover 30% of the corpus. This discrepancy suggests that the dataset used to evaluate PAR was unrepresentative; bugs that are xable by PAR are encountered twice as often in PAR’s dataset than they are in general. Finally, the authors turn their attention to the statement-level repair model used by GenProg. After mining instances of GenProg’s operators within the corpus, the authors determine the likelihood of a statement of a given type being replaced by that of another type. The results of this analysis demonstrate a non-uniform distribution, and that certain kinds of statement are highly unlikely to be used as a replacement (e.g., TypeDecl, Label, Do, Assert). The analysis exclusively considers

124 5.1. RELATED WORK cases where a statement is replaced by a statement of another type. It may be more informative to the construction of a repair model to determine how often a statement is replaced by another of the same type. From observations of the mined Delete and Insert operators, for each kind of state- ment, the authors determine the fraction of xes that involve deleting, inserting or otherwise leave statements of that kind unchanged. The most frequently mod- ied kind of statement, in terms of both insertion and deletion, was found to be Expression statements (e.g., function calls, assignments, casts). If, Return and For statements insertions and deletions were all found to occur in at least 15% of bug xes. TypeDeclaration and Assert statements were found to have the lowest rates of insertion and deletion. Although our study shares a similar motivation to that of Soto et al.[2016]’s—to nd a more eective repair model to reduce the cost of APR—it aims to increase our understanding of repair models by considering the frequency of repair actions and their associated graftabilities.

Repair Quality

Recent studies by Qi et al.[2015] and Smith et al.[2015] demonstrate the tendency of APR techniques to yield plausible but incorrect patches when used with inadequate test suites. Qi et al.[2015] showed that the all of the xes generated by GenProg that were reported in [Le Goues et al., 2012a] were either the result of overtting, or could be realised through code deletion alone. This nding raises questions re- garding the true eectiveness of GenProg’s repair model. Firstly, it causes us to ask: “Are GenProg’s statement-level repair actions suciently expressive to repair most bugs?” Secondly, it leads us to the question: “Do GenProg’s assumptions regarding redundancy—i.e., that the materials required for the repair can be found within the same le as the fault—hold in practice?” The questions raised by this study serve as one of the driving motivations behind this study. By observing a large number of historical bug xes, we hope to answer both of them. For a more in-depth discussion of the issues of repair quality raised by Qi et al. [2015] and Smith et al.[2015], see Section 3.2.1.

Anti-Pattern Avoidance

To improve the quality of patches produced by existing repair techniques (speci- cally, GenProg and SPR), Tan et al.[2016] take an orthogonal approach to the prob- lem by introducing the notion of anti-patterns into the search space. Anti-patterns are used to actively prune patches whose resulting changes to the control-ow graph of the program are deemed likely to yield an incorrect or incomplete program ac- cording to a set of heuristics. Each of the seven hand-crafted anti-patterns are briey outlined below:

125 REPAIR MODEL

• A1 - Anti-delete CFG exit node: prevents the removal of return statements, exit calls, assertions, and calls to functions containing the word “error”.

• A2 - Anti-delete Control Statement: prevents the deletion of control-ow statements (e.g., if-statements, switch-statements and loops).

• A3 - Anti-delete Single-statement CFG: prevents statement deletion within CFG nodes that contain only a single statement.

• A4 - Anti-delete Set-Before-If: prohibits the deletion of a variable deni- tion if that denition is immediately followed by an if-statement using the dened variable.

• A5 - Anti-delete Loop-Counter Update: forbids the deletion of an assign- ment statement contained within a loop if the set of variables on the LHS of the assignment intersect with the variables referenced by the loop invariant.

• A6 - Anti-append Early Exit: prevents the insertion of return and goto statements at all locations within the program, except for immediately after the last statement within a CFG node.

• A7 - Anti-append Trivial Conditions: prevents the insertion of if-conditions that can be shown to be either tautological or fallacious following static anal- ysis of the program. Motivated by a manual inspection of the patches generated by GenProg, AE, and SPR on the ManyBugs dataset, these anti-patterns are designed to address specic weaknesses that are exploited in producing incorrect patches:

• A1 and A6 are designed to overcome susceptibility to weak test oracles (i.e., those which only check the exit status of the program execution).

• A2, A3 and A4 are designed to compensate for inadequate test coverage, which may otherwise cause important, but untested functionality to be de- stroyed.

• A4 is intended to prevent existing vulnerabilities, exposed by the test suite, from being masked, instead of being directly addressed. • A5 is intended to prevent the occurrence of innite loops which signicantly hinder the eciency of the search, and in the worst case, may be accepted as correct patches by test suites that test for the absence of a particular bug.

• A7 aims to inhibit the implicit removal of functionality (observed when us- ing tools such as SPR) through the insertion of a tautological or fallacious condition. To demonstrate the eectiveness of these anti-patterns, Tan et al.[2016] produced modied versions of GenProg and SPR incorporating these rules, termed mGenProg and mSPR, respectively, and evaluated them against a sub-set of the CoREBench and ManyBugs datasets, using 86 bugs taken from 12 programs.

126 5.2. MOTIVATION FOR STUDY

• Results demonstrated a 41% reduction of the search space for mGenProg, and a 27% reduction for mSPR, yielding 1.39x and 1.78x speed-up, as measured by the average wall-clock time taken to nd a plausible repair. • mSPR yielded fewer plausible repairs than SPR in most cases, however, the omitted repairs were those (correctly) identied as erroneous by the anti- patterns. By producing fewer repairs, mSPR reduces the burden of manually inspection placed upon the user. • Both mGenProg and mSPR removed fewer lines of code than their unmodied variants, demonstrating a lower rate of functionality deletion. • In cases where the repair produced was incorrect, mGenProg and mSPR were able to localise the fault to a single statement within the program, substan- tially reducing the complexity of debugging for the user. Although the incorporation of these anti-patterns into both GenProg and SPR im- proved both the eciency of the tool and its ability to localise the fault, neither saw a considerable increase in the number of correct patches that were produced. For mSPR, a single additional, correct repair was found, increasing the total number of correct repairs up from 12 to 13 (out of 86). In the case of GenProg, mGenProg found one fewer repairs, reducing the number of correct repairs found from 3 to 2. Given the signiciant space reduction achieved by mGenProg, these results seem to suggest that GenProg’s repair model lacks the complexity required to tackle a greater number of bugs. Furthermore, although the results from this study are encouraging, one may ques- tion whether the techniques generalise beyond the relatively small set of bench- marks used. Since the dataset used to perform the experiment was the same from which the anti-patterns draw their motivation, the results may be subject to bias. Despite questions over its generality, and the lack of any signicant improvement in the number of correct repairs found (which is arguably a reection of the limita- tions of the underlying repair model), anti-patterns provide a cost-eective means of reducing the search space, increasing eciency, and improving fault localisation. As repair models grow in size and complexity, such techniques may prove crucial in tackling the inevitable explosion of the search space.

5.2. Motivation for Study

To avoid producing low quality patches, and to increase the number of correct re- pairs produced by search-based repair techniques, we must use alternative, richer repair models. Innovations from previous works can be used to increase the likeli- hood of selecting a correct repair over a feasible, but incorrect one—specically, the x prediction model used by Prophet [Long and Rinard, 2016]—in order to bene-

127 REPAIR MODEL

t from these additions, we must possess a repair model capable of constructing a viable repair to begin with. One way to tackle this problem is through the introduction of human repair tem- plates, as proposed in PAR [Kim et al., 2013]. The sourcing of these templates re- mains a problem, however. It is also unclear how well these templates would ap- ply to bugs beyond those encountered in the benchmark. From observation, one sees that the majority of (publicly visible) repairs for large-scale, open-source pro- grams, do not represent the common programmer errors encoded by these tem- plates; rather, although certain bug xes may share certain similarities in their se- mantics, most are syntactically unique, and do not represent instances of common programmer error. Alternatively, one could take a similar approach to most genetic programming sys- tems beyond the domain of automated repair, and generate code for insertion from scratch, rather than copying it from the program under repair. However, whilst gen- erating fragments of code from scratch may be acceptable for synthesis-based ap- proaches, the resulting search space would almost certainly prove intractable when using search-based repair. Therefore, plastic surgery presents a promising way of reducing the size of the search space to a more tractable one, and gives us a more general means of construct- ing repairs, than relying on the utility of manually constructed repair templates. De- spite previous studies demonstrating the eectiveness of the plastic surgery [Barr et al., 2014; Martinez et al., 2014], its particular realisation within the repair model of GenProg has been shown to struggle to discover a correct repair for most prob- lems to which it has been applied, as both ourselves and others have found [Qi et al., 2015]. (For more details, see Section 3.2.1.) For the remainder of this chapter, the eectiveness of the plastic surgery within the domain of automated repair is explored at ner levels of granularity, beyond the coarsely-grained statement-level repair model used by GenProg. Additionally, the benet of incorporating a number of dierent repair actions into the repair model is explored in tandem. Specically, we ask the following research questions: • RQ2: Is plastic surgery equally eective for all repair actions? • RQ3: Can the eectiveness of plastic surgery be increased through the use of unlabelled code snippets?

5.3. Methodology

Traditionally, when comparing repair techniques, one would make use of an exist- ing set of pre-processed bugs from the literature, such as ManyBugs or GenProg’s

128 5.3. METHODOLOGY

TSE 2012 benchmarks. The ecacy of repair actions would then be evaluated by determining the number of bugs for which a repair can be found within a xed resource window [Le Goues et al., 2012a; Long and Rinard, 2015; Mechtaev et al., 2016]. However, such an approach is inappropriate when assessing repair models, for a number of reasons:

1. Diversity: To produce a fair comparison of techniques, one must subject the repair models to a large sample of bugs, taken from a diversity of programs, in order to minimise the eects of sampling errors, and to mitigate the potential for any bias. At present time, the only publicly available benchmark suites for automated program repair presented in the literature (ManyBugs, IntroClass, GenProg TSE 2012, Defects4J, SPR) consist of at most several hundred bugs, across fewer than 15 dierent programs. Furthermore, the composition of these benchmarks is unlikely to be truly random, and may be subject to unin- tentional selection biases; for example, one may omit bug scenarios that are particularly expensive or dicult to replicate, such as bugs within the Linux kernel.

2. Overtting: Operating on a relatively small set of bugs allows one to unin- tentionally overt to the specics of that sample. By analysing the composi- tion of the corpus, one may quite simply introduce a number of specic repair actions, tailored to x certain bugs. Such criticism may be levelled at the ap- proach taken by PAR to the design and evaluation of its repair actions, each of which bear the watermarks of solving particular bugs from the benchmark it was tested on. By evaluating its repair model on such a small and well- studied set of benchmarks, we are unable to determine whether the repair model continues to hold the same utility in a more general context.

A potential solution to this problem of overtting would be to use separate datasets for training (i.e., learning a suitable repair model) and testing (i.e., assessing the generality of the learned repair model). To avoid misleading results, such a dataset would need to represent a truly random, uniform sam- pling of real-world bug scenarios. To our knowledge, no such dataset is pub- licly available. Existing real-world datasets are either hand-picked (e.g., Many- Bugs), and thus subject to selection bias, or represent a narrow category of bugs (e.g., IntroClass contains bugs in simple programming assignments). Given the costs and complexities associated with sourcing and evaluating a suitable dataset, the only feasible way to assess the eectiveness of a repair model, in general, is through the use of bug x mining.

3. Repair Quality: Instead of determining the ability of repair actions to aid in the construction of high-quality xes, one may unintentionally end up nding the repair actions which best exploit weaknesses in the test suites used by each of the bug scenarios. For example, one may trivially introduce an “Append Exit” repair action, which appends a statement containing exit(0); after the selected statement. For many of the bugs within the TSE benchmarks, and a sub-set of the ManyBugs scenarios, this would yield an acceptable x, as only

129 REPAIR MODEL

the exit status of the program is checked, rather than its outputs. (For more details, see Section 3.2.1).

4. Feasibility: Even if one were to possess a suciently rich variety of bugs, evaluating the eectiveness of each repair action by using each of them to perform search may take prohibitively long. One could sample a sub-set of the candidate xes generated by each repair action to reduce the running time, but doing so will degrade the accuracy of the evaluation, and ties the performance of the repair actions to the underlying search technique. Given the diculties involved in this approach, we opt to assess the prevalence and graftability of repair actions through software repository mining instead, allowing a far larger corpus to be analysed, without the need for expensive test suite evalua- tions. The steps of our analysis are as follows: 1. A corpus of human bug xes is mined from over 200 of the most popular repos- itories on GitHub containing C source les, using a custom-written, open- source repository mining tool, BugHunter. 2. A set of abstract syntax trees (ASTs) and AST dierences is computed for each of the les modied by each x, using GumTree [Falleri et al., 2014]. 3. From these ASTs and their associated dierences, a set of repair action in- stances are mined using the detection rules. 4. For each modied AST, a series of abstract and concrete donor pools are gener- ated from its contents. Using these repair pools, we determine the graftability of each of the proposed repair actions.

Trade-Os

Although this approach overcomes the identied problems involved in evaluating repair actions by incorporating them into the search procedure, it also comes with its own set of trade-os. These trade-os, and the steps taken to minimise their eects upon the results of the analysis, are as follows:

• Precision: Although this approach allows us to evaluate repair actions across a much larger corpus of bugs, it comes with the drawback that it excludes correct repairs that are not syntactically equivalent to the human bug x. To partially mitigate this problem, we check whether the human repair is of the same kind as the action (e.g., does it modify an if guard?) in addition to check- ing whether the exact repair can be crafted from the elements of the program. This step also allows us to reason about the eectiveness of synthesis-driven automated repair approaches, and generative repair models, more like those from traditional applications of genetic programming.

• Irrelevant Changes: Human repairs may often contain modications to the source code that are irrelevant to the bug x. These changes might include

130 5.4. REPAIR MODEL

aesthetic or structural changes, or they may have been bundled together with the bug x commit. In theory, one could alleviate this problem by using delta- debugging to minimise the AST dierence. In practice, however, there may be no (readily accessible) test suite, and so this technique is unusable. Instead, in this study, only bug xes pertaining to a single le and function are used to perform the analysis of repair actions.

Although the results of this analysis are likely to underestimate the utility of certain repair actions, the overall results help us to understand the composition of human repairs, the eectiveness of the plastic surgery hypothesis, and how we might go about designing a more eective repair model.

5.4. Repair Model

In this section, we provide a more rigorous denition of “repair models” building upon a previous, informal denition given by Martinez and Monperrus[2013]. Fol- lowing this denition and a brief discussion of its related terms, we propose a series of 23 repair actions, inspired by the plastic surgery hypothesis and existing repair models.

Denitions

The term “repair model” appears to have been introduced by Martinez and Mon- perrus[2013] in an empirical study of search spaces within APR problems. They provide the following informal denitions:

• a repair action is dened as a “kind of modication on source code that is made to x bugs”, such as “changing the initialisation of a variable” or “adding a method call”.

• a repair model is dened as a set of repair actions that are used to construct repairs.

• a repair is dened as a concrete application (or instance) of a particular repair action to source code, such as “adding the method call fun(x).” We further rene the denition provided by Martinez and Monperrus[2013]:

Host Program, P : is the program subject to repair, whose behaviour is dened to be incorrect (faulty) with respect to a given test suite T . P is represented by the set of abstract syntax trees belonging to the source code les under repair. For the sake of brevity, x ∈ P is taken to refer to the existence of an AST node x within P .

131 REPAIR MODEL

Editable Locations, L: the locations within the program at which a repair action may be applied. For the majority of repair systems, L is comprised of the sub-set of statements within P that are covered by at least one failing test case.

Fault Localisation, µ : L → R: the suspiciousness of each editable location, µl ≥ 0, where larger values of µl imply a higher degree of suspicion, and µl = 0 indicates the belief that l does not contain a fault.

Donor Pool, D: the set of AST sub-trees that may be combined with a repair action to construct an edit.

Repair Actions, a ∈ A: are used to describe a kind of AST transformation that may be performed. Each repair action is described using the form described in Equation 5.1.

statement(s1) s1 6= [] s2 6= [] s2 ∈ D (5.1) s1 → [s1; s2]

To the left of the transformation arrow, below the horizontal line, is the match- ing signature; this uses the BNF for the language of the host program to specify the shape of AST nodes that may be subject to this repair action, and to cap- ture their contents. In the example above, the matching signature is simply s1, allowing the action to be applied to any AST node (before consideration of the rest of the rule).

The right-hand side of the transformation arrow, is used to specify the shape of the node that the left-hand side should be transformed into. In the given ex- ample, the right-hand side [s1; s2] states that the matching statement should be replaced by a block containing the matching statement and another state- ment, s2.

Finally, the clauses above the horizontal line describe the transformation crite- ria that must be satised by all instances of this repair action. In the example above, the transformation criteria states that the matching node should not be an empty block, and that the second statement of the right-hand-side s2 should be a non-empty block, contained within the donor pool.

Edit, e: represents a particular application of a repair action by specifying the match- ing node within the host program, together with the values of each of the free variables (e.g., s2 in the previous example).

Edit Space, E: denes the set of all edits that may be applied to P .

Edit Localisation, ω : E 7→ R: is used to assign a weight to each edit, ωe ≥ 0, in a similar fashion to fault localisation. In this case, larger weights are used to encode a belief that a particular edit is more likely to be correct.

132 5.4. REPAIR MODEL

Repair, R: represents a candidate repair as a set of non-overlapping edits, which yield the program P 0 when applied to P . For the sake of our analysis, and in the interest of bounding the size of the repair space, the non-overlapping constraint is strengthened, such that no two edits within a given patch may be applied at the same statement:

∀e1, e2 ∈ R | e1 6= e2 =⇒ nearestStmt(e1) 6= nearestStmt(e2) From observation, most human repairs may still be expressed when this con- straint is applied.

Donor Pool

Associated with each repair model is a donor pool, containing the fragments of code (i.e., AST subtrees) from which repairs may be crafted. In the case of synthesis- based or generative repair models, such as those used by SPR, PAR, and NOPOL, this donor pool is implicit, and is lazily generated during the synthesis (or generation) procedure. For tools relying on the plastic surgery hypothesis, such as GenProg and SearchRepair, the donor pool explicitly describes the fragments that may be used to form repairs. Typically, the contents of the donor pool are sourced from the program under repair, as is the case with AE, GenProg, Prophet, and others; however, foreign programs may also be used to construct the donor pool, as demonstrated by SearchRepair [Ke et al., 2015]. In either case, one must usually perform optimisations in order to reduce the size of the donor pool to a more scalable one. Whilst these techniques, described be- low, may lead to eciency gains, they do so at the cost of potentially solving fewer bugs. • In the case of GenProg (and its variants), the donor pool is composed from the sub-set of les that are subject to the repair, rather than the entire pro- gram. Previous studies [Barr et al., 2014; Martinez et al., 2014] have shown that this restriction causes a minimal reduction in the number of solvable problems. However, these studies failed to account for the particulars of any repair model, potentially reporting a higher number of successes than would otherwise be possible. • GenProg further restricts the donor pool to only contain code fragments that are executed by at least one positive test case. • AE reduces the number of redundant entries in the donor pool by treating it as a set, rather than a collection, removing all duplicates of syntactically equivalent statements. • Rather than restricting the donor pool to the contents of the le under repair, Prophet retains the n-most executed statements from the entire program

133 REPAIR MODEL

within its donor pool. One may also reduce the donor pool further still, through the introduction of com- piler optimisation techniques—similar to those used by AE to prune edits from the search space—such as canonicalisation and constant folding. For each of the reviewed repair techniques which rely upon plastic-surgery, the donor pool is used only at the statement-level, and so, it solely consists of statements from the program. Here, we extend the donor pool, allowing it to operate with a number of ner-grained repair actions, by introducing sub-pools:

• Dstmt containing all the statements in the program.

• Dblock containing all control-ow blocks within the program, with the excep- tion of Switch blocks.

• Dexp containing all expressions within the program. In practice, one could further divide this sub-pool by type, allowing for more ecient repair action queries.

• Dguard containing all boolean expressions that are used as guards for all if-, while-, do-while- and for-statements within the program.

• Dlhs contains all of the left-hand side expressions of each assignment within the program.

• Drhs contains all of the right-hand side expressions of each assignment within the program.

• Darg contains all of the function call arguments within the program.

Repair Actions

Having provided a denition of repair models, repair actions, and their associated terms, in this section we dene a number of repair actions, inspired by existing repair techniques. Each repair action is tailored to exploit the plastic surgery hy- pothesis, in order to maintain a tractable edit space, through the incorporation of the extended donor pool, described in Section 5.4.2. Below, we describe each of the 23 repair actions, using the repair action notation introduced in Section 5.4.1. To aid our descriptions, we introduce the following denitions:

• [s1; ... ; sn] denotes a block of statements. • [] denotes the empty block. • HasKind(node, kind) determines whether a given AST node belongs to a specied kind (e.g., statement, expression, function call). A node may have multiple kinds.

134 5.4. REPAIR MODEL

Generic Statement Actions

This rst group of repair actions is identical to the original actions used in the GenProg repair model, with the addition of empty statement checking, a prepend action, and the removal of the Swap action, present in earlier versions of GenProg.

• Delete Statement: removes a non-empty statement from the program. s 6= [] s → []

• Append Statement: appends a statement from the donor pool immediately after an existing statement.

s1 6= [] s2 6= [] s2 ∈ D s1 → [s1; s2] • Prepend Statement: clones a statement from the donor pool at random and prepends it immediately before an existing statement in the program.

s1 6= [] s2 6= [] s2 ∈ D s1 → [s2; s1] • Replace Statement: replaces an existing statement within the program with a randomly selected clone, taken from the donor pool of statements. 0 0 0 s 6= [] s 6= [] s 6= s s ∈ D s → s0

If-Statement-Related Actions

• Wrap Statement: replaces a compatible statement with an if-statement whose body contains the original statement, and whose condition is taken from the donor pool. s ∈ S NC = {ForLoop, WhileLoop, DoWhileLoop, Switch, IfThen} ∀k ∈ NC | ¬HasKind(s, k) c ∈ D s → IfThen(c, [s], []) • Unwrap Statement: replaces an if-statement that contains no else-branch by its then-branch. else = [] then 6= [] IfThen(c, then, else) → then • Replace If-Condition: replaces the condition of an existing if-statement within the program with a boolean expression taken from the donor pool.

135 REPAIR MODEL

0 0 c ∈ D c 6= c IfThen(c, then, else) → IfThen(c0, then, else) • Replace Then-Branch: replaces the then-branch of an existing if-statement with a (compatible) block taken from the donor pool. 0 0 0 0 then ∈ D then 6= then then 6= [] then 6= else If(guard, then, else) → If(guard, then0, else) • Replace Else-Branch: replaces the else-branch of an existing if-statement with a (compatible) block taken from the donor pool. 0 0 0 else ∈ D else 6= [] else 6= [] else 6= then If(guard, then, else) → If(guard, then, else0) • Remove Else-Branch: removes the else-branch of an existing if-statement, provided that the else-branch does not contain an if-statement, and is not empty. else 6= [] kind(else) = Block If(guard, then, else) → If(guard, then, []) • Insert Else-Branch: clones an existing block from the donor pool, and in- serts it into a given if-statement (which lacks an else-branch) as its else-branch. else ∈ D else 6= [] else 6= then If(guard, then, []) → If(guard, then, else) • Insert Else-If-Branch: selects both a randomly-selected block and guard from the donor pool, before joining them into an if-statement, and replacing the empty else-branch of an if-statement with the created statement.

g2 ∈ D b2 ∈ D b2 6= [] b2 6= b1 g2 6= g1 elif = If(g2, b2, [])

If(g1, b1, []) → If(g1, b1, elif) • Add Guard to Else-Branch: adds a guard to an unguarded else-branch within the program. 0 else 6= [] ¬HasKind(else, If) g2 ∈ D else = If(g2, else, []) 0 If(g1, then, else) → If(g1, then, else )

Switch-Related Actions

• Replace Switch-Expression: replaces the expression of a switch statement with compatible expression from elsewhere in the program. 0 0 exp ∈ D exp 6= exp Switch(exp, block) → Switch(exp0, block)

136 5.4. REPAIR MODEL

Loop-Related Actions

• Replace Loop-Guard: replaces the guard of an existing for-, while- or do- while-loop with a guard cloned from the donor pool. 0 0 g ∈ D g 6= g While(g, do) → While(g0, do) 0 0 g ∈ D g 6= g DoWhile(g, do) → DoWhile(g0, do) 0 0 g ∈ D g 6= g For(init, g, incr, do) → (init, g0, incr, do)

• Replace Loop-Body: replaces the body of an existing for-, while- or do- while-loop with a block cloned from the donor pool. 0 0 do ∈ D do 6= do While(g, do) → While(g, do0) 0 0 do ∈ D do 6= do DoWhile(g, do) → DoWhile(g, do0) 0 0 do ∈ D do 6= do For(init, g, incr, do) → For(init, g, incr, do0)

Assignment-Related Actions

• Replace RHS of Assignment: replaces the right-hand side of an assignment statement with a compatible expression, cloned from the donor pool. 0 0 rhs ∈ D rhs 6= rhs Assign(lhs, op, rhs) → Assign(lhs, op, rhs0)

• Replace LHS of Assignment: replaces the left-hand side of an assignment with a compatible left-hand side, cloned from the donor pool. 0 0 lhs ∈ D lhs 6= lhs Assign(lhs, op, rhs) → Assign(lhs0, op, rhs)

Function-Call-Related Actions

• Replace Target of Function Call: 0 0 lhs ∈ D lhs 6= lhs FunCall(target, args) → FunCall(target0, args)

• Modify Arguments of Function Call:

137 REPAIR MODEL

0 0 args ∈ D args 6= args FunCall(target, args) → FunCall(target, args0) • Insert Argument into Function Call: 0 0 0 arg ∈ D args = l ⊕ r args = l ⊕ (arg ) ⊕ r FunCall(target, args) → FunCall(target, args0) • Replace Argument of Function Call: 0 0 0 arg ∈ D args = l ⊕ (arg) ⊕ r args = l ⊕ (arg ) ⊕ r FunCall(target, args) → FunCall(target, args0) • Remove Argument of Function Call: args = l ⊕ (arg) ⊕ r args0 = l ⊕ r FunCall(target, args) → FunCall(target, args0)

5.5. Approach

Below, we briey describe the steps used by our tool, BugHunter, to mine repair action instances in Git repositories: 1. Repository Sourcing: We use GitHub’s1 API to identify and download the top 200 most starred projects whose source code includes C.2. Notable projects include the Linux kernel, Redis, Git, PHP, Vim, tmux, CCV, and Curl. 2. Bug Fix Detection: We discover commits that are suspected to be bug xes across all of the 200 Git repositories. Possible bug xes are found by iter- ating over commit histories,3 and identifying commits that satisfy all of the following criteria: (a) The commit message must contain at least one of the following words: bug(s), x, xed, xes, fault(y), defect, repair(ed), patch(ed). (b) To avoid compilation-related source changes, or artefacts from unre- lated version control actions, the commit message for a potential bug x should not contain any of the following words: compile, compilation, merge, revert.

(c) The commit for a potential bug x must modify at least one .c source le, and should not modify any header les (i.e., .h les), nor should it modify the source code of les written in another language (that can be identied; e.g., .java, .py, .cc).

1https://github.com 2We use stars as a crowd-sourced indicator of the popularity of a project 3Using the open-source GitPython API: https://github.com/gitpython-developers/ GitPython

138 5.5. APPROACH

(d) Any modied .c source code les within the commit of a potential bug x must also exist in the previous version of the program (i.e., no new .c source les may be introduced).

(e) The commit does not delete any .c les.

Although there may be more sophisticated ways to detect bug xing commits, we believe our approach is suitable for this study. Our relatively strict criteria is likely to result in a high number of false negatives, which should have little to no eect on its outcome. Importantly, these criteria should maintain a low false positive rate. That is, the commits within our dataset are likely to be bug xes, but the size of the dataset may be much smaller than the set of all bug xing commits. We address concerns of size by examining commits from 200 projects.

Rather than dealing with the overheads and limitations of the GitHub API, BugHunter downloads given Git repositories—which may be hosted at loca- tions other than GitHub, if one so wishes—to the host machine and interacts with the repositories directly. This avoids the need to re-download the source code for a particular pair of commits, as the faulty and patched les can be quickly acquired using the git branch command.

In total, we collected 372,463 bug xing commits. To reduce the costs of min- ing, we use a random sample of 10,000 commits for the subsequent stages of the mining process.

3. AST and Edit Script Generation: We use GumTree [Falleri et al., 2014] to generate the abstract syntax trees (ASTs) of the buggy and xed versions of each modied le. We also use GumTree to compute an edit script between the two trees, describing a set of primitive operations required to transform the buggy AST into the xed AST. Finally, we apply post-processing steps to these ASTs and edit scripts to ease their integration into later stages of the mining process. See Appendix B.1 for more details on this stage of the mining process.

4. Donor Pool Extraction: We compute the set of donor pools, outlined in Section 5.4.2, for each of the buggy ASTs. To reduce the memory burden of maintaining multiple donor pools across thousands of xes, we hash code snippets and save them to disk.

5. Repair Action Detection: We use inference rules, described in Appendix B.2 to mine instances of the repair actions described in Section 5.4.3.

6. Graft Detection: We determine which of the mined repair action instances can be grafted from the contents of each donor pool.

139 REPAIR MODEL

5.6. Results

In this section we present the ndings of our analysis. Firstly, we look at the fre- quency of each of the proposed actions, before determining the extent to which instances of these repair actions can be grafted from existing source code within the same le.

Repair Action Frequency

To determine the utility of each repair action outside of the context of plastic surgery, Table 5.1 shows the number of instances of each repair action that were found across each of the mined bug xes, together with the number of bug xes for which a repair action of that kind could have been used. We nd that statement modication is by far the most frequently-encountered re- pair action, used by over 73% of bug xes. This result, and the relatively low rate of statement deletion, indicates that the majority of changes to the program occur below the statement-level. This nding conrms our hypothesis, that the majority of changes to the program occur below the statement-level (used by existing tech- niques, such as SPR and GenProg). To x more bugs, we can either restrict the consideration of replacement statements to those that are structurally similar to the ones being replaced, or we can extend the repair model with actions that operate below the level of statements. We also nd that statement deletion occurs in almost a third of bug xes, suggesting that attempts to remove this repair action in the hope of avoiding the low-quality repairs discovered within GenProg may be misguided and detrimental to the ef- fectiveness of the search technique. The high frequency of statement deletion may also be (partly) due to the way that statement replacement events are handled (as separate deletion and insertion actions). Nevertheless, we nd that statement inser- tion occurs in over half of bug xes, lending further support to the statement-level repair model. Upon rst glance, the fraction of bug xes that involve replacement of large blocks of code—such as if-branches and loop bodies—may seem particularly high. This number is likely due to the fact that we attempt to nd all possible ways of inter- preting a repair; a modication to a particular statement within a block may also be achieved by replacing that entire block with another that contains the necessary changes. Amongst the least frequent repair actions (< 2%), we nd “Replace Assignment Operator”, “Guard Else Branch”, “Remove Call Argument”, “Replace Switch Expres- sion”, “Insert Else-If Branch”, “Unwrap Statement”, and “Insert Else Branch”. Al- though “Remove Call Argument” and “Unwrap Statement” are relatively rare, nei- ther of these repair actions requires any donor code, and so they may be added to

140 5.6. RESULTS

Repair Action Instances Usage % Modify Statement 27,989 2,027 73.07 Insert Statement 9,749 1,510 54.43 Replace Then Branch 9,835 1,136 40.95 Modify Call 6,473 934 33.67 Delete Statement 6,702 915 32.98 Modify Assignment 5,869 843 30.39 Replace If Condition 1,946 648 23.36 Replace Call Argument 2,898 546 19.68 Replace Assignment RHS 1,668 535 19.29 Replace Loop Body 986 473 17.05 Replace Else Branch 3,481 408 14.71 Replace Call Target 828 201 7.25 Replace Assignment LHS 429 115 4.15 Wrap Statement 148 105 3.79 Insert Call Argument 388 82 2.96 Replace Loop Guard 134 69 2.49 Remove Else Branch 76 62 2.24 Insert Else Branch 54 45 1.62 Unwrap Statement 38 28 1.01 Insert Else-If Branch 19 17 0.61 Replace Switch Expression 21 16 0.58 Remove Call Argument 23 8 0.29 Guard Else Branch 2 2 0.07 Replace Assignment Operator 0 0 0.00

Table 5.1: The number of instances of each repair action discovered across each of the mined bugs, together the number (and percentage) of bugs that involve at least one repair action of that type. a repair model to solve a small number of bugs at a relatively low additional cost. In contrast, each of the other repair actions may potentially have a large number of applicable potential snippets, and thus they may represent a poor value proposi- tion within a context where compute resources prevent a large number of candidate repairs from being tested.

Repair Action Graftability

Having determined the frequency of repair actions, we now turn our attention to whether the snippets used within those repair actions can be discovered within the (buggy version of the) source code of the same le. Specically, we measure the graftability of each of the dierent types of repair action (i.e., the fraction of the repair action instances whose materials can be found in the donor pool). Since the

141 REPAIR MODEL

Graftability Type Concrete Pool Abstract Pool Delete Statement 100.00% 100.00% Guard Else Branch 0.00% 50.00% Insert Call Argument 44.85% 88.92% Insert Else Branch 20.37% 25.93% Insert Else-If Branch 5.26% 15.79% Insert Statement 27.17% 37.90% Modify Assignment 58.00% 71.63% Modify Call 4.55% 45.54% Modify Call Arguments 4.72% 37.22% Modify Statement 47.40% 57.54% Remove Call Argument 100.00% 100.00% Remove Else Branch 100.00% 100.00% Replace Assignment LHS 17.25% 49.18% Replace Assignment RHS 10.07% 45.50% Replace Call Arg 22.60% 52.45% Replace Call Target 18.84% 94.08% Replace Else Branch 0.92% 37.92% Replace If Condition 34.33% 55.65% Replace Loop Body 1.11% 15.52% Replace Loop Guard 12.69% 38.06% Replace Switch Expression 9.52% 42.86% Replace Then Branch 47.25% 51.46% Unwrap Statement 100.00% 100.00% Wrap Statement 35.81% 54.05%

Table 5.2: The graftability of each repair action in the contexts of the concrete pool, containing the unchanged snippets from the le under repair, and the abstract pool, containing the unlabelled forms of the snippets from the le under repair. graftability of a repair action is dependent upon the contents of the donor pool, we measure graftability for both the concrete and abstract donor pools. The results of this analysis can be found in Table 5.2. Note, actions that purely involve deletion, such as “Delete Statement” and “Remove Else Branch”, are completely graftable re- gardless of donor pool since they do not introduce code.

Concrete Donor Pool

Using the contents of the concrete pool, we nd that between 0–58% of instances of particular repair action types can be grafted in their entirety. The large degree of variance in graftability between dierent kinds of repair action suggests that plastic surgery is more eective within the context of some kinds of repair action, and less

142 5.7. DISCUSSION & CONCLUSION so in others. At the statement level, we nd that 27% of statement insertions may be grafted; a similar rate to previous studies [Barr et al., 2014; Martinez et al., 2014]. Encourag- ingly, we nd that statement modication—the most frequently encountered repair action, at 73%—is the second-most graftable repair action with a graftability of 47%; this result suggests that the modied form of a statement are likely to already ex- ist within the donor pool. We nd that 58% of “Modify Assignment” actions are graftable, hinting that varying degrees of graftability may be seen between dier- ent kinds of statements; we leave investigation of this possibility, and how it may be harnessed by the repair model, to future work. In general, we nd that repair actions that accept a block of statements as their input are amongst the least graftable: “Replace Loop Body” (1%), “Replace Else Branch” (1%) and “Insert Else-If Branch” (5%). Two surprising exceptions to this rule are “Insert Else Branch” (20%) and “Replace Then Branch” (47%). Below the statement level, we observe a graftability between 5% and 45%. Again, we observe substantial variance between the graftability of actions at this level. Amongst the most graftable repair actions are “Replace If Condition” (34%), “Wrap Statement” (36%), and “Insert Call Argument” (45%).

Abstract Donor Pool

Comparing the abstract pool to the concrete pool, we see a substantial increase in graftability, rising from 0–58% to 16–94%. These results show that snippets struc- turally identical to those used by the repair are likely to already exist, and as such, serve as an eective pool of donor code from which to automatically craft repairs. This increase in graftability is most prominent in block-level actions, that were amongst the most dicult to graft using the concrete donor pool: “Replace Loop Body” (1% to 16%), “Replace Else Branch” (1% to 38%), “Insert Else-If Branch” (5% to 16%). Likewise, “Modify Call” and “Modify Call Arguments” rise from 5% each to 46% and 37%, respectively. When one removes block-level repair actions from consideration, one nds that graftability increases from 16–94% to 37–94%—a rea- sonable gure when attempting to automatically generate repairs.

5.7. Discussion & Conclusion

Building upon previous work on plastic surgery and the redundancy assumptions of automated program repair, we nd that the eectiveness of plastic surgery varies according to the particular repair actions to which it is applied. In general, we nd that plastic surgery is most eective in the context of more granular repair actions, at and below the statement level, and less so at the block level. We nd that our

143 REPAIR MODEL odds of discovering a graft within the previous version of the le under repair are signicantly enhanced when labels (i.e., variable and function names) are removed from consideration. In Table 5.3, we summarise the ndings of our study by giving the frequency and graftability of each of the proposed repair actions, together with a measure of its eectiveness, computed as the product of frequency and graftability. To achieve a balance between search eciency and the number of bugs that can be solved by a given tool, we suggest that the search should devote the bulk of its attention to- wards repair actions that are both frequent and highly graftable. We also suggest swapping GenProg’s “Replace Statement” operator with a “Modify Statement” op- erator, or otherwise tuning the “Replace Statement” operator to heavily favour the replacement of a statement with a structurally similar one.

144 5.7. DISCUSSION & CONCLUSION

Type Frequency Graftability Eectiveness Modify Statement 73.00% 57.54% 42.00% Delete Statement 32.98% 100.00% 32.98% Modify Assignment 30.39% 71.63% 21.77% Replace Then Branch 40.95% 51.46% 21.07% Insert Statement 54.43% 37.90% 20.63% Remove Call Argument 19.68% 100.00% 19.68% Modify Call 33.67% 45.54% 15.33% Remove Else Branch 14.71% 100.00% 14.71% Replace If Condition 23.36% 55.65% 13.00% Replace Call Arg 19.68% 52.45% 10.32% Replace Assignment RHS 19.29% 45.50% 8.78% Modify Call Arguments 19.68% 37.22% 7.32% Replace Call Target 7.25% 94.08% 6.82% Replace Else Branch 14.71% 37.92% 5.58% Replace Loop Body 17.05% 15.52% 2.65% Insert Call Argument 2.96% 88.92% 2.63% Wrap Statement 3.79% 54.05% 2.05% Replace Assignment LHS 4.15% 49.18% 2.04% Unwrap Statement 1.01% 100.00% 1.01% Replace Loop Guard 2.49% 38.06% 0.95% Insert Else Branch 1.62% 25.93% 0.42% Replace Switch Expression 0.58% 42.86% 0.25% Insert Else-If Branch 0.61% 15.79% 0.10% Guard Else Branch 0.07% 50.00% 0.04%

Table 5.3: A summary of the frequency of each of the proposed repair actions, mea- sured by the percentage of bugs in which it is encountered, together with the grafta- bility of that repair action when the abstract pool is used. Eectiveness, computed as the product of frequency and graftability, estimates the fraction of bugs for which a given repair action may graft a repair.

RQ2: Is plastic surgery equally eective for all repair actions?

We discovered that plastic surgery is less eective when applied at the block level. This suggests that excluding such edits from the search space may increase e- ciency with minimal compromise to repairability (i.e., the bugs that can be xed). We found that statement deletion is the second most common repair action, occur- ring in roughly a third of bug xes. This result suggests that excluding statement deletion from the repair model to avoid overtting [Long and Rinard, 2015; Qi et al., 2015] may be misguided and likely to reduce the ability to x bugs. Given the success of repair techniques which use repair actions below the statement- level, it is crucial that such actions be incorporated into scalable, search-based repair.

145 REPAIR MODEL

Our results demonstrate that plastic surgery can be successfully applied at this level of granularity—by doing so, we can allow search-based repair to address a greater number of bugs.

We also found that the majority of bug xing commits occur below the statement- level. In 58% of cases, the modied variant of the statement could be found some- where else in the program. This result could be used to increase the eciency of program repair by focusing statement replacement on structurally similar snippets. This result builds on previous work by Soto et al.[2016], showing that certain kinds of statement are more likely to be replaced by certain kinds of donor statements (i.e., not all statement replacements are equally likely).

RQ3: Can the eectiveness of plastic surgery be increased through the use of unlabelled code snippets?

We found that the eectiveness of plastic surgery can be boosted signicantly through the use of abstract donor pools. Our results suggest that incorporating abstract donor pools into repair techniques could provide a substantial increase in the num- ber of xable bugs.

Future Work

Below, we discuss a number of ways in which this analysis could be extended to gain knowledge about the likelihood and composition of repair actions; knowledge that may be leveraged to produce a more predictive, eective, and ecient repair model, allowing more bugs to solved in fewer candidate evaluations.

• Limitations of Plastic Surgery: We found that success of plastic surgery is dependent upon both the repair actions to which it is applied, and their associated level of granularity. Building on this, one could further explore a more detailed set of repair actions to determine whether type, or other at- tributes associated with code fragments aect graftability. In particular, it may be fruitful to explore the graftability of particular kinds of statements and expressions, and whether certain fragments are unlikely to be found within existing code (e.g., print statements and strings). By identifying the strengths and weaknesses of plastic surgery, we can tune our repair model accordingly, and prune certain portions of the search landscape, leading to a more ecient search process.

• Domain-Specic Languages: Although we use inference rules to formalise the semantics of repair actions within this analysis, further work could be per- formed to allow a compact domain-specic language to be used to describe (and implement) repair actions (in terms of constrained source code transfor- mations). Beyond improvement to software quality, this addition would allow potential repair actions to be machine generated, rather than being translated

146 5.7. DISCUSSION & CONCLUSION

by hand, opening up the possibility of using evolutionary computation and knowledge discovery to construct and evaluate novel repair actions. • Graft Localisation: Although we may use the frequency of repair actions to generate a more eective probabilistic repair model, this step only allows us to dierentiate between repair actions, rather than candidate grafts. One so- lution to this would be to incorporate an additional layer into the probabilistic model, describing the likely type of the donor code snippet. Soto et al.[2016] achieve this by looking at which kind of statement is most likely to replace a statement of a given kind. Soto et al.[2016]’s approach could be further ex- tending to statement insertions, or it could use other attributes of a snippet, such as its size or complexity. More generally, it may be productive to apply machine learning to the prob- lem of graft selection—for both automated repair and otherwise. Ideally, such an approach would allow us to rank grafts by their likelihood. In addition to potentially improving the quality of repairs and the eciency of the search process, this step would allow us to remove the non-deterministic aspects of the search with minimal impact to results, eliminating the costly need to gauge performance over a number of repeated costs. • Discovery of Repair Actions: Instead of mining a pre-specied set of re- pair actions, one may attempt to discover common, predictive repair actions within the corpus, through the application of machine learning and graph mining techniques. In particular, one may combine a variant of the LASE [Meng et al., 2013], a method for discovering context-aware, systematic edits, with BugHunter to nd (and constrain) common repair actions across a wide and diverse corpus of programs. • Bug-Fix Detection and Categorisation: To identify a larger number of bug- xing commits, and to reduce the number of false positives, one could explore the use of more sophisticated bug x identication criteria. Whilst incorpo- rating such measures would be unlikely to signicantly alter the outcome of the research questions posed in this chapter, it may allow BugHunter to be used to probe the composition of specic repair actions, allowing the ndings to be exploited to produce a more eective repair technique. By improving bug x detection accuracy, one may use existing techniques [Thung et al., 2012] to predict the category of bug under repair, potentially allowing the exploitation of that information to suggest which repair actions are most likely to be used and where the x is most likely to reside.

147 REPAIR MODEL

148 CHAPTER 6

Search

Where search-based program repair techniques such as GenProg, HDRepair, and PAR are designed to be capable of multiple-edit repair, rarely is this ability used in practice. Underlying all of these techniques is a common search algorithm based on genetic programming. HDRepair and PAR can be viewed as extensions to the original GenProg system. Despite possessing the ability to construct multiple-line patches, in practice, the majority of patches generated by GenProg can be reduced to a single edit [Qi et al., 2015]. Moreover, almost all of the plausible patches gen- erated by GenProg on the ManyBugs dataset have been shown to be equivalent to functionality deletion [Qi et al., 2015]; only 2 of the 55 reported bug xes are correct, with respect to the intended program semantics. Whilst these results suggest that GenProg, as a whole, struggles to craft repairs, it is unclear whether this is due to an inadequate repair model, a poor search algorithm, or both. Since the introduction of GenProg, work on search-based program repair has fo- cused primarily on the eciency of the repair process, to the detriment of scalability (i.e., the ability to repair multiple-line bugs). We briey discuss three subsequent search based repair techniques with alternative search algorithms: • AE, proposed by a sub-set of GenProg’s authors, replaces GenProg’s search algorithm with an exhaustive search. By default, this search is conducted within a single-edit search space. This search space is obtained by discov- ering, and discarding, a sub-set of provably equivalent patches, through the use of various compiler optimisations and program analysis techniques. Fur- thermore, by treating the search process as a decision problem (i.e., is this patch correct?), rather than an optimisation one (i.e., how close is this patch to being correct?), AE uses test case prioritisation, and regression test selection techniques to reduce the expected number of test case evaluations.1 • RSRepair, a variant of GenProg, showed that a form of random search at- tained a higher eciency and success rate than GenProg’s genetic algorithm [Qi et al., 2014]. However, RSRepair constrains itself to a single-edit search space, and the ManyBugs dataset was used to conduct its evaluation. Thus, it may be that RSRepair simply converges to a plausible but incorrect solution faster than GenProg does. Authors have incorrectly cited this study as evi- dence that GenProg is merely equivalent to a less ecient form of random search [Long and Rinard, 2015, 2016]. Whilst this may be the case, the eval- uation of RSRepair does nothing to prove this, nor does it claim to. Rather,

1Note, regression test selection, unlike test prioritisation, could be used by GenProg to reduce the test suite for a candidate patch.

149 SEARCH

RSRepair shows the eectiveness of using test case prioritisation to reduce the cost of nding single-edit patches. • SPR introduces both a new, substantially larger repair model, intentionally lacking a statement deletion repair action, and a new search algorithm, used for a sub-set of its patches, known as value search. SPR manages to nd cor- rect patches for 16 bug scenarios within the ManyBugs dataset, compared to 2 found by GenProg. Despite the intentional exclusion of explicit function- ality deletion, Mechtaev et al.[2016] highlight that SPR implicitly generates many such patches through the introduction of tautological and contradic- tory branch conditions [Mechtaev et al., 2016]. For the one of the libtiff subjects within ManyBugs, 80% of SPR’s reported patches were found to be functionality deleting. To accommodate the increased search space accompanying its enriched repair model, SPR uses value search to soundly reduce the number of patch evalu- ations. Instead of validating concrete patches, value search rst determines whether a set of outcomes exists for a given branch condition that would cause the test suite to pass. If, and when such a set of outcomes is found, SPR utilises value search to synthesise a suitable branch condition, using its knowledge of the program state, and intended outcome at each evaluation of the branch condition. Although all of these techniques outperform GenProg in terms of eciency, and in some cases, eectiveness (i.e., the number of bugs for which it can nd a repair), none is capable, by default, of producing multiple-edit patches. In other words, since GenProg, researchers have proposed signicantly more ecient techniques for nding single-edit repairs, but none have proposed better (search-based) tech- niques for generating multiple-edit patches. SPR, RSRepair, and AE demonstrate that treating program repair as a decision prob- lem, rather than an optimisation problem, is an eective strategy for decreasing the cost of nding single-edit patches. This treatment, however, is not scalable; the size of the search space increases exponentially with respect to the size of the patch. Even if decision techniques can substantially reduce the cost of evaluating a single candidate, the combinatorial number of patches renders this advantage, more or less, useless. If search-based program repair is to scale to multiple lines, active search al- gorithms capable of discovering partial xes, and promising repair locations will be needed. Motivated by this need for better search algorithms for multiple-edit repair, in this chapter, we conduct a theoretical and empirical analysis of GenProg’s search technique— since GenProg is the only search based technique capable of multiple-edit repair—in order to gain a clearer understanding of its strengths and weaknesses. Based on our ndings, we propose and demonstrate an alternative search technique better suited to the nuances of program repair, inspired by a greedy algorithm. In summary, the main contributions of this chapter are as follows:

150 6.1. RELATED WORK

• We conduct a theoretical analysis of GenProg’s search algorithm, identifying its underlying assumptions and a number of potential weaknesses. • Based on our theoretical analysis, we conduct an empirical analysis to test our theories and to determine the contribution of tness to the success of the search. • Motivated by the ndings of these analyses, we propose and implement an alternative search technique, capable of multi-edit repair, based on a greedy algorithm. • We demonstrate that our proposed technique outperforms GenProg’s search algorithm across a number of performance metrics. The remainder of the chapter is structured as follows: Section 6.1 provides a re- view of the related literature. Sections 6.2 and 6.3 conduct theoretical and empirical analyses of GenProg’s search algorithm, respectively. Section 6.4 introduces and evaluates an alternative search algorithm for program repair, inspired by greedy algorithms. Section 6.5 outlines directions for future work. Finally, Section 6.6 sum- marises the ndings of this chapter and provides concluding remarks.

6.1. Related Work

In this section, we briey discuss previous and related work concerning the per- formance of genetic algorithms for automated program repair, as well as relevant studies on the matter of search landscapes for programs.

Random Search for Program Repair—RSRepair

In an eort to gauge the eectiveness of GenProg’s search algorithm, Qi et al. [2014] explored the eects of replacing GenProg’s genetic algorithm with a ran- dom search. The authors found that, over 100 runs on a sub-set of the ManyBugs benchmarks, their variant of random search, RSRepair, achieved both a higher suc- cess rate, and a lower mean number of test case evaluations than GenProg on 24 out of 25 problems. For more technical details on RSRepair, see Section 2.2.3. Although these results demonstrate the (relative) eectiveness of random search compared to genetic algorithms for the purposes of program repair, they do not necessarily show that genetic algorithms are detrimental to the search. Importantly, RSRepair is able to use test case prioritisation and to terminate on the rst instance of failure within the test suite, whereas GenProg must evaluate a xed number of tests in order to produce a tness value, regardless of their outcomes. As such, RSRepair is able to achieve a far higher test case eciency than GenProg; in most cases, the candidate repair fails the rst negative test case. In addition to evaluating

151 SEARCH candidates in a dierent way, RSRepair also operates in a single-edit search space, giving it a distinct advantage over GenProg, which considers patches of an arbitary length. For repairs that require only a single edit, there is little room for genetic algo- rithms to compose a solution, thus leading to a partial explanation of the strength of RSRepair. Conversely, problems requiring multiple edits cause us to question whether genetic algorithms are capable of identifying and combining partial solutions— no studies have investigated this assumption.

Representation and Operators

Le Goues et al.[2012c] conducted a study of the impact of various parameter choices on the eciency of GenProg and the total number of bugs it is able to repair (within a xed window of time). Specically, the authors considered the representation used, the choice of crossover operator, the probabilities associated with the selection of each mutation operator, and the weighting factor used by the fault localisation to bias the search towards mutation of statements executed solely by the failing test cases. Through their investigation, the authors discovered:

• The use of the Patch representation, in which individuals are represented as a sequence of edit operators, yielded a higher success rate than the original AST/WP representation. • The time taken to nd a repair (when successful) was lowest when no crossover operator was used, although this resulted in the worst success rate (54.4%) of the options studied. Of the two remaining options, one-point crossover and sub-set crossover, one-point crossover was found to have the higher success rate (65.2% vs. 61.1%), and the lower number of on-average tness evaluations (118.20 vs. 163.05). • From observation of the composition of (Insert, Delete, Replace) operations within the minimised form of (plausible) repairs generated by GenProg, it was found that each of these edit types appeared in 13%, 51%, and 57% of re- pairs, respectively. In contrast, the associated weightings for each mutation operator cause the dierent types of edit to appear with equal frequency, de- spite the prevalence of deletion and replacement edits. • The authors also measured the proportion of repairs that exclusively modied statements executed by the negative test cases to the repairs that modied statements executed by both the positive and negative test cases. Only 35% of modied statements were executed exclusively by the negative tests, despite the fact that GenProg’s default fault localisation weightings attribute 90% of modications to statements executed solely by the negative tests. Based on the results of the investigation, Le Goues et al.[2012c] changed the pa- rameters used by GenProg, and evaluated the resulting algorithm on a set of 105

152 6.1. RELATED WORK bugs (which now forms a sub-set of the ManyBugs benchmarks). These changes allowed GenProg to repair 5 additional bugs, and decreased the time taken to nd a repair by 17-43% for some of the more dicult bug scenarios. Qi et al.[2015] found that, of the 55 bugs that were reported to have been repaired by GenProg, only 2 were correctly repaired. In most cases, patch evaluation only considered the exit status of the program and not its outputs, leading to patches that inserted code such as exit(0); to be accepted. The ndings of that study cast uncertainty on any conclusions made regarding the eectiveness and eciency of GenProg’s search algorithm on the basis of inadequate test suites. Results reported to increase the success rate or reduce the cost of nding a repair may in fact be pushing the search towards areas likely to contain a plausible but incorrect repair.

Predicate-based Fitness Functions

Fast et al.[2010] propose an alternative tness function for GenProg based on the similarity of learned program invariants—logical assertions at points within a program that hold true over an associated set of executions, also referred to as predicates—between the original program and candidate solutions. For example, an invariant may specify that an integer argument x, belonging to some given function, is always non-negative (i.e., x ≥ 0 holds over all observed executions). Fast et al.[2010] use Daikon [Ernst et al., 2007], a popular, o-the-shelf invariant mining tool, to infer these predicates. This process works by rst instrumenting the program to monitor invariants over observed program values, before executing the program and observing which predicates hold. Sets of observed predicates are collected for executions of each of the tests cases within the test suite. Building on work by Liblit et al.[2005] on the use of invariants for statistical fault localisation, the following statistics are calculated for each observed predicate P : • Failure(P ): probability the program will fail, given P . • Context(P ): probability the program will fail, given the line on which P is tested is reached. • Increase(P ): the amount by which P being true increases the probability of failure. The resulting statistics describe a set of observations of program behaviour that are associated with failure (e.g., the branch condition at line 5 only evaluates to 0 when the program fails). Using these statistics, the following two predicate sets are determined: • Increase Set: all P such that Increase(P ) > 0. • Context Set: all P such that Context(P ) > 0. The following universal sets of predicates are also constructed: • all P that hold for every execution,

153 SEARCH

• all P that fail to hold for every execution,

• all P that hold only for the positive tests,

• all P that fail to hold only for the negative tests.

Together, these six sets represent a baseline B, characterising the behaviour of the buggy program. To compare the behaviour of a candidate patch i to the original program, its six invariant sets Vi are computed and compared against B. This com- parison is performed by computing a number of statistics:

• The number of predicates that once predicted failure (i.e., the set of predicates that hold only for the negative tests) that no longer hold on failing runs in variant i.

• The number of predicates that did not hold on failing runs for the original program that hold on failing runs for variant i.

• The cardinality of set dierences, for each of the six predicate sets.

• The weighted sum of the Context scores of predicates in the context set. (This weighting is learnt using linear regression, discussed next.)

• The weighted sum of the Increase scores of predicates in the context set These statistics are combined with the outcomes of the test suite for the candidate patch to form a characteristic vector of scalar features fi. To transform these vectors into scalar tness values, linear regression is used to learn a suitable global intercept c0, and a linear weighting of features, given in Equation 6.1.

X predicate_tness(i) = c0 + cjfi,j (6.1) j

As its training set, the learner requires a large number of evaluated patches for the bug under repair—in the evaluation performed in the [Fast et al., 2010], 1772 such patches were generated—together with their test suite outcomes, and the state of their invariants. Assuming that at least one solution to the bug is found amongst these patches, Dopt, the shortest edit distance from a given patch to a known repair, is calculated for each entry in the training set. Linear regression is then used to calculate a global intercept and a vector of feature weights that approximates the edit distance from a given edit to a potential solution.

Although the idea of using learned invariants as part of an alternative, more amenable tness function sounds promising, the practical utility of the tness function pro- posed by Fast et al.[2010] is unclear. The technique requires that a highly expensive process of learning be carried out before the search process begins; for larger pro- grams, such as PHP and Python, gathering the necessary number of patches would likely take longer than 12 hours. More importantly, however, this approach requires

154 6.1. RELATED WORK that at least one repair to the bug is known before beginning the search. This re- quirement raises the question: “If a repair to the bug is already known, then why are we searching?” To evaluate the quality of the tness function, the authors measure its tness– distance correlation [Jones and Forrest, 1995] for a single bug scenario, taken from a precursor to the GenProg TSE 2012 dataset. The tness–distance metric approx- imates the diculty of a search problem as the correlation between (scalar) tness values and the edit distance to the nearest (known) solution. For a single bug, the authors found a moderately strong correlation between the predicate-based tness function and Dopt. In contrast, GenProg’s original tness function demonstrated little to no correlation. We do not believe that these results should come as a sur- prise, nor do they demonstrate the superiority of the predicated-based tness func- tion; the results are a reection of the fact that Dopt was used to train the tness function.

Test Case Sampling

In addition to conducting an initial study into the feasibility of using learned pro- gram invariants as part of an alternative tness function, Fast et al.[2010] investi- gated the use of test suite sampling to reduce the cost of tness evaluations. Instead of running all tests within the suite for each tness evaluation, the authors observed the performance of the search—measured by wall-clock time taken to nd a repair— when a random sample of tests is used to calculate tness.2 Fast et al.[2010] found that the use of sampling, combined with a safe-impact analysis responsible for de- termining whether the outcomes of a particular test would be aected by a given patch, resulted in an 81% speed-up. Notably, the benchmarks used to conduct this study are amongst those that we found to have weak oracles (e.g., only checking the exit status of the program; see Section 3.2.1). As such, it is unclear whether test case sampling would produce the same gains when used with stronger test suites. The authors found little dierence in performance between the use of random selec- tion and Walcott et al.[2006]’s time-aware test case prioritisation technique when selecting the tests within the sample.

Crossover

Oliveira et al.[2016] argue that GenProg’s implicit conation of operator type, fault location, and x statement into a single discrete unit within the genome results in a landscape that is more dicult to traverse. With the goal of allowing more gran- ular building blocks to be identied and exploited, the authors propose explicitly separating these aspects into their own sub-spaces. To this end, six new crossover operators are proposed, all of which act upon an intermediate representation that

2No mention of the size of this sample is made in the paper. Both subsequent experiments [Le Goues et al., 2012a] and the default settings within GenProg use a 10% sample.

155 SEARCH

Figure 6.1: An illustration of the implicit search space dened by GenProg’s rep- resentation and a more granular alternative proposed by Oliveira et al.[2016]. Whereas the type of operation, location, and donor statement used by an edit are all considered to be part of a discrete, evolvable unit within GenProg’s representation, Oliveira et al.[2016]’s intermediate representation allows each of these attributes to be treated separately by a set of purpose-built crossover operators. explicitly separates these sub-spaces, illustrated in Figure 6.1. These six crossover operators are composed from three base operators, with and without a “auxiliary memorisation component”:

• One-Point Crossover on a Single Subspace (OP1Space) applies variable-length one-point crossover to a single subspace, selected at random, leaving the rest of the intermediate representations of the parents unchanged.

• Uniform Single Subspace (Unif1Space) selects a subspace at random and swaps entries between components according to a randomly-generated mask.

• One-Point Across All Subspaces (OPAllS) performs a variable-length crossover across the entire intermediate representations of both parents, allowing all subspaces to be mixed simultaneously.

To allow this intermediate representation to be used in tandem with the original Patch representation, individuals are subject to an encoding and subsequent decod- ing phase. During the encoding phase, the edits within each patch are exploded into each of the three subspaces of the intermediate representation. After crossover is applied across the three subspaces, each intermediate representation is decoded to form a valid patch. As a consequence of allowing individual subspaces to be blindly modied without consideration of the well-formedness of the patch, invalid edits may be introduced. The role of decoding, therefore, is to identify and remove these invalid edits.

Optionally, a memorisation component may be employed to avoid removing invalid edits produced during crossover, thus limiting the destruction of search informa- tion. This component caches the details of each edit encountered over the course of the search, allowing recovery queries to be made to nd valid edits at a particular location and/or with a given donor statement. When enabled, memorisation will attempt to use part of the information provided by the invalid edit to nd an edit that shares some of its details within the cache.

To assess the eectiveness of these operators, the authors compared the result- ing performance of each operator against the performance achieved through the

156 6.2. THEORETICAL ANALYSIS use of GenProg’s original one-point crossover operator. Across 43 bug scenarios, taken from small programs in the IntroClass and GenProg TSE 2012 benchmarks, Unif1Space without memorisation yielded a 34% improvement in the success rate of the search. Accompanying this improvement in the success rate, however, was a signicant degradation of eciency, as measured by the average number of test suite evaluations (15.59 vs. 29.32).

Mutational Robustness

Schulte et al.[2013] investigated the widely-held view that the majority of soft- ware is highly fragile and that small changes are likely to lead to substantial and undesirable changes in behaviour. The authors found that over 30% of (single- edit) modications—generated using GenProg’s mutation operators—resulted in no change to the outcomes of the test suite. The results were found to hold across 22 programs at both the source code and assembly instruction level, as well as across multiple programming languages, including C, C++, Haskell, and OCaml. Based on their ndings, the authors proposed the use of neutral variants as a means of proactively generating software diversity, allowing techniques such as N-version programming [Avizienis, 1985] to be used to supply an alternative version of a func- tion when it ceases to work under some conditions. To demonstrate the practical utility of such an ability, the authors seeded ve latent faults into the original ver- sions of 11 dierent programs, and evaluated whether a x to the fault could be found from amongst a sample of 5,000 of its single-edit neutral variants. With re- spect to an extended test suite, containing tests that were unused in the generation of neutral variants, plausible repairs for 12 out of 55 faults were found.

6.2. Theoretical Analysis

In this section, we identify and qualitatively discuss a number of phenomena ex- hibited by GenProg’s search algorithm, which we believe to be detrimental to its repair process. As part of this discussion, we speculate on both the immediate and wider ramications of these phenomena to the eciency and success of the search. Following this discussion, we investigate the role of tness information within the search. The discussions in this section are supported by a number of graphs, gener- ated over 20 repeats of a sub-set of the GenProg TSE 2012 dataset, with improve- ments to its weak output checking.

Bloat

The rst, and perhaps, most apparent indicator of unintended behaviour in GenProg is the continual growth of its patches over time—a widespread phenomenon within

157 SEARCH

Patch Length vs. Generation

40

30

20 Patch Length

10

0

1 2 3 4 5 6 7 8 9 10 Generation

Figure 6.2: Patches tend to become longer over time. the eld of genetic programming, known as bloat [Langdon and Poli, 1998]. As with other applications of genetic programming, we observe a steady and sustained in- crease the size of individuals with each generation, measured by their number of edit operations, Figure 6.2. More specically, we observed a monotonic increase in the total number of edit operations within the population with each generation, Figure 6.3.3

Upon closer inspection of GenProg’s search algorithm, this phenomena is not par- ticularly surprising, since the mutation operator appends a newly-sampled edit to the end of each child with a probability equal to the mutation rate; by default, the mutation rate is set to 1.0, causing all children to grow by a single edit.

In practice, GenProg partially overcomes this problem through the use of delta- debugging as a minimisation technique. For more details, see Section 4.1.3. How- ever, this process takes the form of a post-processing step, occurring only once an acceptable solution has been found by the search. In most cases, this minimisation stage reduces the size of the patch to a single edit. For the rest of the search process, however, GenProg takes no active steps to control or reduce bloating within the population.

3Note, that this phenomena persists regardless of whether crossover is enabled. In fact, this phe- nomena is further exacerbated when crossover is disabled, since all individuals within the population are guaranteed to be the same size and to be one edit longer than their predecessor.

158 6.2. THEORETICAL ANALYSIS

Num. Edits in Population vs. Generation

600

500

400

300

Num. Edits in Population 200

100

1 2 3 4 5 6 7 8 9 10 Generation

Figure 6.3: Across all problems, we observe a monotonic increase in the total number of edit operations contained within the population despite the ability of crossover to generate smaller individuals.

One may argue that the phenomenon of bloating is irrelevant to the ability of the search to generate and discover correct solutions, and that the search remains un- hindered in its presence. Based on observation and analysis of GenProg’s search algorithm, we believe this is not the case, and that bloat is detrimental to the search in multiple ways, for the following reasons:

1. Although the edits within the patch may have no eect on the outcomes of the positive test cases, this does not mean that the patched program retains its original or intended semantics. When combined with a weak testing suite, this phenomenon allows destructive changes to be introduced into the population without being detected, and subsequently retained and propagated (since all edits are conserved).

Over time, as the search nds and exploits more of the weaknesses within the testing suite, the individuals within the population inevitably end up accruing more of these silently destructive edits, and thus, end up further away from the semantics of the original program. This behaviour places the search at odds with its implicit assumption, that the semantics of the repaired program are a short distance away from the faulty program, which allows automated program repair to be a tractable problem.

159 SEARCH

2. Even in the few cases where the testing suite is perfect at identifying destruc- tive mutants, the search ends up performing redundant work by evaluating edits that appear to be neutral. In these cases, the search is unable to use the results within the cache and so, it must evaluate the mutant as if it were any other. As a result, we believe the search to be spending most of its limited resources in evaluating larger, semantically-equivalent patches.

3. Finally, a human study by Fry et al.[2012] demonstrates that the likelihood of a patch being accepted by a human-developer sharply diminishes as the size of the patch increases and that such increases in patch length are also associated with higher maintenance costs.

Accumulation of Neutral Edits

As bloat causes candidate solutions to move further away from the original program syntactically, the introduction of new edits also drives the program further away from its original and intended semantics as a result. Moreover, this drift is likely to be felt most in the bug-aected regions of the program, covered by few, if any, pos- itive test cases, where almost all changes have no eect on the outcomes of the test suite. In such areas, the search will reduce to a random walk, and as a result of bloat, will continue to accumulate random edits within this region. Consequently, the ma- jority of the search is spent evaluating either bloated but semantically-equivalent candidate patches, or patches that involve so many changes that they are likely to be far away from the intended semantics of the program.

In the absence of eective bloat controls, an implicit selective advantage is created towards larger, more bloated, neutral variants, whose neutral edits are more likely to survive a crossover event. Consequently, these neutral edits will tend to be selected together with other such genes, to the eect that the genome as a while becomes selected for bloat. Whilst the minimal tournament sizes (k = 2) used by default in GenProg’s tournament selection operator should weaken this pressure towards bloat, it does so at the cost of allowing overtly destructive edits to enter the popu- lation.

In summary, without eective bloat controls, patches will tend to accumulate edits that are unhelpful within the context of repair, in the best case. In the worst case, patches will accumulate detrimental edits that silently destroy functionality within the program and prevent solutions from being found.

Operator Biases

In addition to the problems produced by bloat, we believe that the way in which GenProg allows multiple edits to be performed at a single statement within the program leads to subtle biases that ultimately harm its ability to nd repairs. Since

160 6.2. THEORETICAL ANALYSIS

Figure 6.4: An example of the destructive edit bias. In this example, a single-edit patch is applied to the original AST, given on the left. This patch replaces the node at location 1 with the node at location 5. When the child tree is subsequently mutated, any changes to locations 1, 4 or 5 will have no eect. Note, the donor node, coloured black, may not be the subject of a future mutation operation. Thus, the eects of this replacement operation are permanent on all of its descendants (except in cases where crossover moves this operation to another child). the search permits multiple edits to a given statement, a somewhat arbitrary deci- sion must be made as to how these edits should be interpreted when applying the patch.

GenProg deals with this dilemma in perhaps the most intuitive way—applying each edit within the patch in the order in which it appears. However, since the mutation operator may only introduce new edits at the end of the patch, the set of programs that may be reached from a given point becomes constrained by the existing con- tents of that patch.

The most destructive bias introduced by this behaviour is observed when a deletion or replacement operation is applied to a given statement. In this case, the statement is no longer addressable by future edits, causing the modication to become frozen. As a result, the search is allowed to accumulate non-coding edits at the aected site, since all edits after the Delete or Replace are ignored. An example of this behaviour is illustrated in Figure 6.4. Since all edits are conserved during crossover and mutation, and given the relatively low selective pressure, this can also result in functionality loss, particularly within regions lacking positive test case cover- age. This functionality loss may only be recovered through an increasingly-unlikely crossover event.

Another less destructive, but nonetheless hindering, consequence of this behaviour stems from the treatment of append operations. In the event that a donor statement X is appended before a selected statement S, then a successive append of statement Y at S will result in the statement sequence S; Y ; X at the original location of S. Whilst the semantics of this operation seem acceptable, since edits can only be intro- duced at the end of the patch, the set of reachable programs becomes increasingly constrained with each append. In the case of the example above, if X and Y are appended in the wrong order, or an incorrect statement is appended instead, that

161 SEARCH mistake becomes frozen, save for a unlikely re-arrangement of the genome through a series of unlikely crossover events. An illustration of this bias is given in Figure 6.5.

162 6.2. THEORETICAL ANALYSIS

Figure 6.5: An example of the append bias. The AST in the top left shows the state of original, faulty program. The AST in the top right shows the repaired version of that AST, containing two missing statements α and β. On the bottom half of the diagram, we show a repair scenario in which this bias is encountered. In the rst generation, α is appended after the statement at location 4, matching its intended position in the repaired program. In the following generation, β is appended, yielding the incorrect sequence β; α. From the rst mistake in the order of append operations, the search is unable to move backwards to nd a correct order; incorrect edits will continue to be accumulate at statement 4.

163 SEARCH

In summary, by allowing multiple edits at a single site, a number of subtle biases are introduced into the search, limiting its ability to reach solutions as generations pass.

Limited, Short-Term Memory

As a consequence of the relatively small size of its population, the ability of the search to implicitly retain useful information about the problem and the possible nature of a solution is fundamentally and severely restricted. It is therefore the re- sponsibility of the selection operator to manage the storage of this information, and to hold onto information useful for solving the problem whilst discarding mislead- ing or irrelevant information. For most optimisation problems where population-based algorithms are used, the role of selection is to identify and exploit high-quality individuals, based on the as- sumption that such individuals tend to be co-located with other such high quality solutions. In these cases, selection serves primarily as a positive feedback mecha- nism aimed at promoting higher-quality solutions. In the case of automated repair, however, we are purely interested in nding a correct solution, and from the perspective of the end-user, there is little to no concept of a partial solution. Here, tness purely serves as a means of more eciently generating potential solutions, rather than assessing the quality of a solution. To help the search craft potential solutions that are more likely to be correct, we believe that tness combines elements of both positive and negative feedback: • Positive Feedback: in theory, tness helps the search to identify and exploit partial solutions, identied by their passing of a sub-set of the negative test cases. In practice, when adequate test suites are used, patches that pass any of the negative test cases are very rarely encountered. • Negative Feedback: whilst positive feedback remains largely unused for the majority of the search, the primary feedback mechanism comes from the avoidance of destructive patches, which cause previously passing test cases to fail. Despite the fact that the search largely makes use of negative feedback in selecting parents, the population-based nature of the algorithm is almost entirely oriented towards positive feedback, which is rarely used. Populations serve relatively well as retainers of positive information (i.e., potential partial solutions) since they are very likely to persist from one generation to the next, and thus be exploited to nd promising solutions. However, populations also make for a woefully-ineective means of storing and exploiting useful information for problems that are largely driven by negative feedback mechanisms. The reason for this ineectiveness is that negative feedback is only used to inuence what is not contained in the population. Given the vast size of the search space, the very small size of the population (40 indi- viduals, by default), and the substantial cost of evaluating candidate solutions, all of

164 6.2. THEORETICAL ANALYSIS this negative feedback information—indicating that certain edits are destructive— is lost between generations. One approach to tackling this problem would be to avoid solutions that are known to be destructive, in a similar manner to Tabu search [Glover, 1986]. For reasons given in Section 6.4, we decided not to introduce such functionality into GenProg. The consequences of this lack of memory are most obvious during mutation, where- upon previously encountered destructive edits are given equal consideration when appending a new edit to the patch as those that have yet to be explored. As a re- sult, the algorithm reintroduces known (but unrecorded) destructive edits into the population, causing potentially useful information about partial solutions to be cor- rupted and discarded. Even in cases where unexplored edits are added to the patch during mutation, if those edits are destructive, which approximately two thirds of edits are [Schulte et al., 2013], then information will also be corrupted. In summary, given the nature of the automated program repair search landscape and its heavy emphasis on the avoidance of destructive patches—rather than the pursuit of promising patches—using a population to record problem information appears to be misguided.

Summary

In combination, we believe that these phenomena result in a number of unintended eects, all of which combine to signicantly harm the eectiveness of the search algorithm in numerous ways:

• Inability to reach certain repairs: as a consequence of operator biases, and further compounded by the problem of bloat, it becomes increasingly dicult for the search to reach certain repairs with each passing generation, since the presence of certain edits within a patch preclude the existence of others. Futhermore, mutations at areas within of the program that lack positive test case coverage, which also happen to be the most likely to be selected, may silently destroy functionality, preventing a repair from being found.

• Sensitivity to initial conditions: as a side eect of the search’s inability to reach certain repairs over time, the success of the search becomes highly dependent on the outcome of a series of chance events in the early genera- tions of the search. If destructive and restrictive edits successfully invade the population, the search is unlikely to be able to recover.

• Inecient usage of test case evaluations: the inability of the search to explicitly remember and avoid destructive edits, combined with the highly limited memory capacity of its population, leads to corruption of partial so- lutions and discarding of useful negative-space information.

• Majority of resources are spent on unlikely repairs: due to bloat, the search spends the majority of its time evaluating long and highly unlikely

165 SEARCH

patches, rather than focusing its eorts on patches that are closer to the orig- inal program.

• Poor reliability: as a result of the above symptoms of these phenomena, we expect to see a decreasing (marginal) probability of success as the search progresses, leading to lower levels of reliability.

Ultimately, these symptoms stem from certain design decisions and trade-os within the design of GenProg’s algorithm, including its ability to perform multiple edits at a given site, and its treatment of the problem as an optimisation problem, rather than a decision problem.

6.3. Empirical Study

Having highlighted a number of potential issues and ineciencies associated with GenProg’s search algorithm—an algorithm similar in all important aspects to most other evolutionary APR techniques—here, we conduct an empirical study to explore whether these claims hold in practice. In doing so, we seek to produce a better understanding of the workings (and shortcomings) of the evolutionary repair pro- cess.

To investigate the eects, if any, of bloat and operator biases on the performance of the search, we answer following questions:

• RQ4A: Does bloat hinder the ability of the search to eciently nd repairs?

• RQ4B: Does removing the ability to perform multiple edits at a given site improve the eciency and reliability of the search?

After exploring these eects, we turn our attention to the role of tness information within the search for multiple-edit repairs. To conduct this investigation, we decon- struct the notion of tness in a systematic, stepwise manner by answering each of the research questions below:

• RQ5A: Does selecting individuals to serve as parents at random, rather than in accordance to their tness, have a measurable impact on the performance of the search?

• RQ5B: Does restricting parent selection to non-destructive individuals (i.e., those which pass all of the positive tests), and choosing from amongst them at random aect the performance of the search?

• RQ5C: Does selecting non-destructive individuals as parents on the basis of the number of negative tests that they pass—instead of choosing between them at random—aect the performance of the search?

166 6.3. EMPIRICAL STUDY

Methodology

To answer each of these research questions, we extended GenProg, and produced a suitable conguration for each question. To reduce the cost of answering these questions, we used asynchronous patch evaluation, and weak equivalency checking, described in Section 4.2. Using these congurations, we collected performance data for each research question by a number of repeat repair trials over a set of 21 bug scenarios. To balance the accuracy of the results against the staggering time and computa- tional resources required to conduct extensive program repair trials, we performed 10 repeat trials for each bug-conguration; in total, we conducted 1260 individual trials. For each research question, we measure the performance of the associated congu- ration against GenProg’s standard conguration, in terms of the metrics described below: • We measure the eectiveness of the conguration by the fraction of bug sce- narios for which it was able to nd a repair. • For bug-congurations where a repair was found, we measure the reliability as the fraction of runs for that bug-conguration that yielded a repair. • Additionally, we measure a modied version of the previously used NCP met- ric, a measure of the average number of patch evaluations required to nd a repair. The rst of these metrics, NCPU , measures the average number of unique candidate patch evaluations. By measuring unique evaluations, rather than all evaluations, we gain a more representative measure of performance for congurations that are more likely to revisit previously evaluated patches. The second, ER, measures the eciency of the search by the number of unique patch evaluations per run (including runs for which a repair was not found). This measure gives a more reliable estimate of the eciency of a conguration over multiple runs. Except where explicitly stated, we used identical congurations and termination criteria for each experiment. An outline of this conguration is provided in Table 6.1. The settings within this conguration are based on the default settings for GenProg, which have been previously used to conduct research [Le Goues et al., 2012a,c]. For the purposes of this study, we chose to enforce an explicit limit on the number of unique candidate patch evaluations—rather than restrict the number of generations, as is de facto practice [Le Goues et al., 2012a,b,c; Weimer et al., 2009],—in order to more fairly represent the performance of algorithms that revisit solutions. As with all of the experiments conducted within this thesis, a complete replication package is provided using RepairBox. Details on replicating the results of this study, along with the other studies within the thesis, can be found in AppendixA. By using RepairBox, we are able to provide a transparent and controlled execution environment without compromising performance.

167 SEARCH

Population Size 40 Fault Localisation GenProg (10/1) Selection Tournament (k = 2) Crossover Rate 100% Mutation Rate 100% Termination Criteria Num. Unique Candidates 400

Table 6.1: The baseline parameters of the algorithm used within each run.

CPU: Intel Xeon E5-2673 v3 (4 cores) OS: Ubuntu Server 16.04 LTS (64-bit) Kernel: 4.4.0-77-generic Docker: 1.12.6, build 78d1802 RAM: 8 GB Storage: 64 GB, SSD Cost: $0.249/hr

Table 6.2: Specications of the Microsoft Azure F4 compute instances used to collect the data for this study.

To run these experiments, we used a large number of parallel F4 instances cloud computing instances on the Microsoft Azure cloud computing platform. Specica- tions for these instances are given in Table 6.2.

Benchmarks

In selecting a suitable set of bug scenarios for this study, we primarily considered three factors. The rst is that performing 1260 repeat repair trials over the bug sce- narios should be neither prohibitively time-consuming, nor exorbitantly expensive. The second is that the test suite should meet the minimal set of requirements laid out in Chapter3; bug scenarios that fail to thoroughly check the outputs and resulting state of the program should not be considered. From earlier experiments, we found that including such benchmarks—as have been used to conduct previous studies on the eciency of GenProg [Le Goues et al., 2012a,c; Oliveira et al., 2016; Weimer et al., 2009]—can produce misleading results. In such cases, variants of GenProg which push the search towards regions of the space likely to contain plausible, but incorrect, solutions are assumed to be better performing variants. When one con- trols for test suite quality, one nds that these variants lead to a lower performance. The third and nal factor is that the bug scenarios should include, but not necessar- ily be restricted to, bugs whose patches require multiple edits. Without the inclusion of such bugs, one cannot assess the eectiveness of the search technique when it comes to multi-edit bugs—bugs which have been largely been beyond the reach of

168 6.3. EMPIRICAL STUDY search-based repair. With these considerations in mind, we rst assessed a number of publicly avail- able datasets of real-world bugs: ManyBugs [Le Goues et al., 2015], Codeaws [Tan et al., 2017], and the GenProg TSE 2012 [Le Goues et al., 2012b] benchmarks. We found that the bug scenarios within the ManyBugs dataset exhibited excessively- long compilation and test suite evaluation times, and that the majority of test suites failed to meet our minimum quality requirements (many exclusively consider the exit status of the program). We encountered similar concerns of quality in the GenProg TSE 2012 benchmarks. On inspection of the subjects within the Code- aws dataset, we determined that GenProg would be highly unlikely to nd repairs for any of its bugs, owing to the small size of the programs (mostly less than 20 lines of code). Given the extensive diculties involved in using naturally-occurring bugs, we turned our attention to synthetic bugs. Since the research questions asked in this study are purely concerned with the process of searching for patches, rather than the ability to solve real bugs, we believe this decision to be sound. We rst explored using a sub-set of the C bug scenarios within the Software Infrastructure Repository, but found that their vast search space reduced the probability of repair to such a point that any signicant dierence in performance between search techniques would be impossible to gauge. Although the programs used by these bug scenarios are smaller than those used in previous APR research, it is important to note that in these cases, the search was restricted to a single le within the project—usually containing a few hundred lines of code (and not the millions of lines of code used by the entire project). In contrast, the programs within the SIR are collapsed into a single le containing thousands of lines of code.

169 SEARCH

Algorithm 5: Test Suite Reduction ReduceTestSuite(tests, lines, line_cov, test_cov) begin selection ← ∅ ; line_cov ← {(l, 0) | l ∈ lines} ; for i in 1..k do for l1 in lines do candidates ← CoveringTests(l1); poor_cov ← {t | t ∈ tests ∧ line_cov[t] < i} ; if line_cov[l1] ≥ i or candidates = ∅ then continue end best_candidates ← ∅ ; best_score ← 0 ; for t in candidates do score ← #{l | l2 ∈ poor_cov ∧ Covers(t, l2) } ; if score > max_score then best_score ← score ; best_candidates ← {t} ; else if score = max_score then best_candidates ← best_candidates ∪ {t} ; end end t ← RandomChoice(best_candidates); UpdateCoverage(line_cov, test_cov[t]); selection ← selection ∪ {t} ; tests ← tests \{t} ; end end return selection end

Ultimately, we found that when combined with Pythia (see Section 3.2) to produce a high-quality oracle, the programs within the Siemens benchmarks provide all of the desired characteristics. Although each program is only several-hundred lines of code in length, the corresponding size of the search space matches those used in previous studies [Le Goues et al., 2012a, 2015; Long and Rinard, 2015; Mechtaev et al., 2016]. The test suites used by these programs, containing thousands of tests, are unlike those used in previous work, however. To reduce the burden involved in evaluating these test suites, and to produce a test suite closer to those encountered in real-world programs, we devised and used a simple test suite reduction heuristic. This algorithm, Algorithm5, uses a fast, stochastic greedy search to ensure that, where possible, each line within the program is covered by at least k = 5 tests. Using this algorithm, we were able to reduce the size of the test suites by roughly a factor of 100. This substantially diminished level of coverage appeared to have

170 6.3. EMPIRICAL STUDY

Program # LOC # Stmts # Tests # Tests0 tcas 135 136 1608 20 totinfo 230 271 1052 25 replace 494 372 5542 42 printtokens 342 292 4073 14 printtokens2 388 340 4115 18 schedule 249 194 2650 15 schedule2 273 217 2710 19

Table 6.3: A table of the subject programs used to conduct this study. # LOC describes the number of source code lines in the original program, as measured by cloc. # Stmts species the number of statements within the GenProg’s pre- processed AST representation of the program. # Tests gives the size of the original test suite for the program.

no discernible eect on the quality of patches produced during the study—a phe- nomenon we intend to investigate more closely in future work. Details of the seven programs used as subjects for this study, together with the sizes of their original and reduced test suites, and their number of lines of code, are given in Table 6.3.

Upon inspection of original bug scenarios within the Siemens subject programs, we found that GenProg was unable to x most of them—owing to the granularity of its repair model—and that most xes required only a single edit. To gain an understand- ing of the ability of GenProg’s algorithm to compose multiple-edit patches, and to ensure our understanding was based on bugs that are solvable within GenProg’s search space, we manually seeded bugs into these programs. For each of the seven programs, we seeded a one-line bug, a two-line bug, and a three-line bug, giving a total of 21 bug scenarios, balanced across the dierent bug sizes.

To ensure that the seeded bugs could be solved, we ensured that a x could be gen- erated from the source code in the faulty le. For example, to seed a bug scenario that required the insertion of a missing statement, we deleted a statement in the program, and ensured that a redundant copy of that statement could be found else- where (and that the redundant copy was covered by the test suite). That is, for each seeded bug scenario, we devised a repair within GenProg’s search space, and then applied the inverse of the x to the program. We also ensured that all mutations to the program resulted in the failure of at least one test within the reduced test suite.

Instructions on nding and replicating these bug scenarios are given in Appendix A.

171 SEARCH

Results

A summary of the results across all of the studied congurations is aggregated into the following tables: Table 6.4 describes which bugs were successfully repaired at least once for each conguration. We nd that the baseline conguration is unable to repair any bugs that contain more than a single fault. Across all congurations, only one bug containing two faults was repaired, and no bugs containing three faults were repaired. Table 6.5 shows the observed reliability of each conguration. Table 6.6 gives the median number of unique candidate patches required to nd a repair for each bug-conguration. Table 6.7 species the cost of nding a repair, measured by the number of unique candidate patch evaluations across all runs, including those which failed to yield a patch, divided by the number of successful runs. For the remainder of this section, we use these results to provide answers to each of our research questions.

172 6.3. EMPIRICAL STUDY indicates that at X XX X XXX 5 6 5 6 7 7 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX totinfo-two totinfo-three tcas-one Bug Scenarioreplace-one replace-two replace-three Baseline Bloatprinttokens-one Controlprinttokens-two Restricted Shadowprinttokens-three Neutralprinttokens2-one Neutral Negative printtokens2-two printtokens2-three schedule-one schedule-two schedule-three schedule2-one schedule2-two schedule2-three totinfo-one tcas-two tcas-three Successful Table 6.4: A summary ofleast the one bugs patch for was which found a for repair the given was bug-conguration. found at least once over the ten runs, for each conguration.

173 SEARCH ————————————————————————————————————————————————— —————————————— ————————————— —— 10%—————— 10% — — 10% — — 10% 10% — 10% — 30%60% 30% 60% 20%20% 20% 70%30% 100% 60% 20% 70% 20% 20% 100% 20% 30% 80% 20% 50% 30% 20% 30% 100% 100% 90% 100% 100% 90% tcas-two replace-two replace-three printtokens-one printtokens-two printtokens-three printtokens2-one printtokens2-two printtokens2-three schedule-one schedule-two schedule-three schedule2-one schedule2-two schedule2-three totinfo-one totinfo-two totinfo-three tcas-one tcas-three Bug Scenarioreplace-one Baseline Bloat Control Restricted Shadow Neutral Neutral Negative Table 6.5: A comparison of thepatches reliability were of found dierent for congurations that for given each bug-conguration. bug scenario. The presence of an “—” symbol indicates that no

174 6.3. EMPIRICAL STUDY ————————————————————————————————————————————————— —————————————— ————————————— —— 325.0—————— 109.0 — — 296.0 — — 198.0 83.0 183.0 — — 13.5 13.531.0 14.039.0 13.5 31.0 14.0 34.5 31.026.5 31.0 39.0 37.0 31.0 60.0 26.5 39.0 20.0 31.0 26.5 39.0 26.5 20.0 166.0 85.0 63.5 93.5 144.0 81.5 replace-two replace-three printtokens-one printtokens-two printtokens-three printtokens2-one printtokens2-two printtokens2-three schedule-one schedule-two schedule-three schedule2-one schedule2-two schedule2-three totinfo-one totinfo-two totinfo-three tcas-one tcas-two tcas-three Bug Scenarioreplace-one Baseline Bloat Control Restricted Shadow Neutral Neutral Negative Table 6.6: A summary ofpresence of the an median “—” number symbol of indicates unique that no candidate patches patch were evaluations found required for to that nd bug-conguration. a repair, for each bug-scenario. The

175 SEARCH ————————————————————————————————————————————————— —————————————— ————————————— — 3925.00— 3709.00—————— — 3896.00 — — 3798.00 — 3683.00 3783.00 — — 58.00 40.70 61.33 36.40 45.30 35.33 331.50 336.33 277.86 310.67972.67 264.00 1634.50 173.13 972.67 477.80 976.67 972.67 1087.33 1089.00 1663.50 1693.50 167.401631.00 1631.00 115.00 1631.00 1631.00 1631.00 1631.00 printtokens-two schedule2-two tcas-two replace-three printtokens-one printtokens-three printtokens2-one printtokens2-two printtokens2-three schedule-one schedule-three schedule2-one schedule2-three totinfo-one totinfo-two totinfo-three tcas-one tcas-three Bug Scenarioreplace-one replace-two Standard Bloat Control Restricted Shadow Neutral Neutral Negative schedule-two Table 6.7: An overview ofevaluations, the across cost all of runs, nding a dividedpatches patch by were for the found each number during bug-conguration, any of measured of runs by the that the runs. were total successful. number of Bug-congurations unique with candidate an patch “—” symbol indicate that no

176 6.3. EMPIRICAL STUDY

RQ4A: Does bloat hinder the ability of the search to eciently nd re- pairs?

We rst introduce a variant of the search algorithm, tted with active bloat controls. One way of doing this is to break the conservation of edits within the mutation oper- ator by allowing edits to be removed, through the introduction of an Undo operator. To add such an operator, however, we would need to arbitrarily determine a suit- able probability of its application, which opens the door to overtting the particular benchmarks being studied. Since the majority of problems that GenProg has been applied to can be solved within a single edit, one could engineer an Undo probability such that all patches within the population remain at this particular size. A simpler method to achieve bloat control, less prone to overtting, is to replace the selection operator with a double-tournament selection operator, based on the non- parametric parsimony pressure techniques introduced by Luke and Panait[2002]. Rather than drawing k individuals at random from the population and choosing the best amongst them to act as a parent for the next generation—as in standard tournament selection—individuals are also compared on tness, through a series of “qualier” tournaments. • k “qualier” tournaments are carried out, in order to select the k individuals that will compete in the nal “tness” tournament. • Each “qualier” tournament selects two individuals at random from the pop- ulation, and chooses the smallest as the winner, with probability Sp/2, where Sp is a real number between 1.0 and 2.0. Setting Sp to 2.0 ensures the small- est individual is always picked, whilst 1.0 chooses between the individuals at random. • The “tness” tournament selects the individual with the highest tness as the winner, and uses random selection as a tie-breaker. After introducing double-tournament selection, we compared the performance met- rics of the resulting algorithm against the baseline. We found that double-tournament selection was able to x an additional bug scenario compared to the baseline (5 vs. 6), as shown in Table 6.4. Furthermore, all bugs that are repaired by the baseline are also repaired when double-tournament selection is used; double-tournament selec- tion dominates the baseline, in terms of eectiveness. We observe little to no dierence in reliability between double-tournament selection and the baseline, Table 6.5. For scenarios where reliability can be compared, we see a minimal 10 percentage point dierence.

In terms of eciency, as measured by NCPU , double-tournament selection domi- nates the baseline. Double-tournament selection exhibits a lower NCPU for two of the ve comparable bugs, and an identical NCPU for the remaining three. When measured by expense, the results are less clear. Double-tournament selection is marginally worse in two cases (1087.33 vs. 1089.00, and 331.50 vs. 336.33; Table

177 SEARCH

0 1 2 ... N - 1

R 12 NULL NULL ... D

Figure 6.6: Our restricted representation implicitly encodes individuals as a xed length list of optional edits P . Each entry of the list, Pi, describes the edit, if any, that should be applied to the statement with number i.

6.7), substantially worse in one (927.67 vs. 1634.50), identical in one, and improved in one (58.00 vs. 40.70). Taken together, these results demonstrate little to no improvement in performance, suggesting that traditional bloat control measures may be ineective within GenProg’s search space. This may be down to the need for more careful, but expensive, param- eter tuning. Alternatively, it may be that bloat controls further hinder the memory capabilities of the population. As the population contains fewer edits, it also retains less information about the solution space. In summary, although the search appears to be most eective during its earliest stages, the use of traditional bloat control measures appear to be unsuccessful in increasing reliability or eciency.

RQ4B: Does removing the ability to perform multiple edits at a given site improve the eciency and reliability of the search?

To determine the eects of removing the ability to perform multiple edits at a single location, we needed to introduce a slightly-modied representation, and an associ- ated set of operators, all ensuring that no more than a single edit could be made at a given site. Instead of encoding the patch as an arbitrary-length sequence of edit operations, we eectively represent the patch as a xed length list of edits P , as illustrated in Figure 6.6. Each entry of the list, Pi, describes the edit ei that should be applied at particular location i. The absence of an edit at a particular site is de- noted by a NULL edit. In practice, to avoid high memory costs, this representation is realised by placing certain constraints on variable-length edit sequences. To accommodate these changes, we replace the standard mutation operator used by GenProg with a uniform mutation operator, which replaces any existing edit at a given statement with a newly sampled one. Similarly, we replace GenProg’s stan- dard, variable-length one-point crossover with a symmetrical uniform-crossover op- erator, outlined in Figure 6.7. Each crossover event takes a pair of parent patches as input (x, y), and generates a pair of children (a, b), by iterating through each modiable statement s, and assigning the edit at s from either parent to x, and the other to y, by selecting at random. Compared against the baseline eectiveness, we nd that the restricted congura- tion solved the same set of bugs as the baseline. We observe little to no dierence in reliability when the restricted conguration is used: for two out of ve bugs, re- liability is the same; for one out of ve, the restricted conguration is minimally

178 6.3. EMPIRICAL STUDY

AB CD

R 12 NULL R 12 NULL

NULL A 13 A 13 NULL

NULL NULL NULL NULL

NULL NULL NULL NULL

…......

D NULL NULL D

Figure 6.7: Uniform crossover takes two parents, A and B, and generates a pair of symmetrical children, C and D. better; for the remaining two bugs, the baseline conguration is minimally better. In terms of eciency, when measured by NCPU , we observe an identical eciency for two bugs, an improvement for two bugs (166.0 vs. 63.5, and 20.0 vs. 26.5; Table 6.6), and a slight reduction for one bug (13.5 vs. 14.0). When eciency is measured using the expense metric, we see a worse performance for two bugs (1087.33 vs. 1663.50, and 58.00 vs. 61.33; Table 6.7), an identical performance for two bugs, and an improvement for one bug (331.50 vs. 277.86).

In summary, whilst restricting the search to only a single edit at each site fails to produce any sizeable improvements in performance, it also appears to have little negative eect. This small change substantially reduces the size of the search land- scape, and in practice, prevents few bugs from being patched. This observation can be better exploited by search techniques which are not population based.

RQ5A: Does selecting individuals to serve as parents at random, rather than in accordance to their tness, have a measurable impact on the performance of the search?

Building on previous studies showing that random search outperform genetic algo- rithms, here we determine whether this result was due to any exploitable informa- tion provided by tness values, or if it was largely due to diering search spaces—in the previous study of random search [Qi et al., 2014], RSRepair only considered sin- gle edit patches—and the ability to terminate on the rst instance of failure.

To perform this experiment, we simply ran an otherwise identical version of GenProg’s search algorithm, except with its parent selection replaced by random selection.

From the results in Table 6.4, we see that random search dominates the baseline in terms of eectiveness; random search xes all the bugs xed by the baseline, plus tcas-two. Furthermore, of all the congurations we investigated, random search

179 SEARCH was the only technique that found a solution to a multi-edit bug. It may be the case that the bug in question, tcas-two, exhibits a particularly deceptive search space, ill-suited to optimisation techniques.

In terms of reliability, we observe little dierence between random search and the baseline: random search was minimally better for one bug, minimally worse for two, and identical for the remaining two bugs on which it can be compared.

When comparing the eciency of random search and the baseline, we see mixed results. When measured by NCPU , random search is better in one case (166.0 vs. 93.5; Table 6.6), worse in one (39.0 vs. 60.0), and identical for the rest. In terms of the expense metric, the baseline is outperformed in three cases (331.50 vs. 310.67, 972.67 vs. 477.80, and 58.00 vs. 36.40; Table 6.7), better in one case (1087.33 vs. 1693.50), and identical in one case.

In summary, these results suggest that GenProg’s search algorithm is no more ef- fective than selecting patches at random. Overall, random search appears to exhibit a stronger performance than the baseline.

RQ5B: Does restricting parent selection to non-destructive individu- als (i.e., those which pass all of the positive tests), and choosing from amongst them at random aect the performance of the search?

Having shown that tness information either has no eect on the search, or that it actively misleads it, we explore two variants of GenProg’s tness function, de- signed to more aggressively prune and exploit the search space. In this experiment, we examine whether the algorithm would be better served by strictly selecting neu- tral edits. Or, put another way, we ask whether tolerating destructive patches (i.e., patches which cause passing tests to fail) is benecial to the search. To do so, we introduce a variant of GenProg wherein the default parent selection method is re- placed by a pair of successive selection operators: rstly, the population is reduced to those candidates which pass all of the positive test cases; from this restricted pool, n individuals are sampled at random, with replacement.

When compared to the baseline, we nd that this neutral conguration nds patches for a greater number of bugs (5 vs. 7; Table 6.4). Specically, the neutral congura- tion successfully nds patches for all of the single-edit bugs, which includes the set of all bugs repaired by the baseline. In all cases, the reliability of the neutral con- guration is higher (30% vs. 100%, and 60% vs. 70%; 6.5) than or equal to that of the baseline conguration. In terms of NCPU , we observe little dierence in eciency, but a more pronounced improvement is observed when eciency is measured by expense: for the ve bugs on which eciency can be compared, the neutral cong- uration is better for three (1087.33 vs. 167.40, 331.50 vs. 264.00, 972.67 vs. 477.80; Table 6.7), and identical for the remaining two.

The substantial performance improvements stemming from restricting selection to only those solutions which do not fail any positive test cases suggests that detailed

180 6.4. GREEDY ALGORITHM positive test case information is not particularly useful to the search. This nding, that the search should only pursue non-lethal changes, oers room for signicant optimisation: one can use test case prioritisation to order the positive test cases, thus minimising the number of test case evaluations (i.e., regret) necessary to identify a lethal edit.

RQ5C: Does selecting non-destructive individuals as parents on the ba- sis of the number of negative tests that they pass—instead of choosing between them at random—aect the performance of the search?

Building on our ndings from the previous experiment, showing that the search is more performant when restricted to selecting from amongst patches which do not fail any of the positive tests, we explore whether negative test case outcomes con- tribute to the success of the search. To determine the contribution of negative test outcomes, if any, we modied the conguration introduced in the previous experi- ment to perform a tournament selection between non-destructive patches, based on the number of passing negative tests. In terms of eectiveness, we nd no dierence between the neutral negative and neutral congurations: each technique solves the same set of bugs. Similarly, we nd little dierence between the reliability of the two congurations: in one case, the neutral negative conguration is minimally better; in another case, the neutral conguration is superior, and for the remaining cases, both congurations are iden- tical. We observe dierences in the eciency of the two congurations, as measured by the expense metric. For ve out of seven bugs, the neutral negative conguration is superior, in one case it is worse, and for the remaining case, it is identical. In summary, these results suggest that knowledge of negative test outcomes is useful and that it may be used to enhance the eciency of the search. When the search is used with a higher resource limit, beyond the default limit used by GenProg, this increased eciency may also translate into a higher eectiveness.

6.4. Greedy Algorithm

As we demonstrated in our empirical study, the performance characteristics of the search can be improved through certain modications to its various parameters and components. However, such parameter tweaks are a cheap way of addressing the deciencies of GenProg’s genetic algorithm—they ignore the deeper cause of those deciencies. Improving the genetic algorithm along one dimension often degrades it along another. For instance, we can bolster the memory of the search by increasing the size of the population, but doing so is likely to reduce the rate of convergence. Similarly, we can control bloat by introducing controls such as double-tournament selection, but doing so also causes information about the problem, implicitly en- coded in the population, to be lost.

181 SEARCH

Crucially, our study shows that, for the most part, the search performs better when it only considers whether a candidate solution passes or fails any of its positive test cases, and how many of its negative test cases it passes. Restricting the search to patches which do not degrade existing functionality—as measured by the positive test cases—allows it to operate with greater eectiveness and eciency. We also found that selecting from amongst potential parents on the basis of the number of negative test cases that they pass can lead to further gains in eectiveness and eciency. Motivated by these ndings, here we propose the use of a form of greedy algorithm to conduct search-based program repair. Since we found little to no degradation to performance by restricting the search to considering only a single edit at each location, our algorithm exploits this observation to reduce the size of its search space.4 Our technique allows the space to be explored more eciently, with the bulk of its eorts spent looking at the smallest possible patches—in which a solution is more likely to be encountered—rather than evaluating increasingly large patches over time. In doing so, we nd ourselves closer to the driving motivation behind search-based program repair: patches are only a short distance away from the buggy program. At the same time, to allow the technique to scale to multiple-edit patches and to speed up convergence, our algorithm exploits promising combinations of edits as soon as they are encountered, as explained next. To achieve these behaviours, our algorithm divides into search space into two parts: an exploration space and an exploitation space. The former is composed, initially, of the set of all possible single-edit modications to indicted statements within the program. As the search progresses, edits are sampled without replacement from this space, in an eort to discover partial solutions. Over time, this space continues to shrink, monotonically, until no edits remain. The exploitation space contains the set of all partial solutions discovered over the course of the search. When a partial repair is discovered, it is added to this space. The algorithm then determines whether any other partial repairs may be combined with this newly added partial, and if so, pushes their concatenations to the exploitation stack. Whilst this stack is non-empty, the algorithm evaluates each of its combinations of partial solutions. If the combination passes all of the negative test cases passed by each of its parts, the combination is added to the exploitation space. If not, this is treated as a sign that at least one of the two partial solutions involved in the combination is incorrect, and so their combination is discarded. This process continues until either an acceptable solution is found, a resource limit is reached, or both the exploration space and exploitation stack are empty.

4GenProg, and to some degree, MintHint, are the only two repair techniques to allow multiple edits to a single program location. From observation of GenProg’s patches, we nd no cases where this ability is used in an acceptable, minimised repair.

182 6.4. GREEDY ALGORITHM

Algorithm 6: Greedy Search Program Repair explore ← GenerateExplorationSpace() ; exploit ← ∅ ; combinations ← [] ; while resources not exhausted do /* perform exploitation when possible */ if combinations 6= [] then xy ← Pop(combinations); x, y ← combo ;

/* xy should pass all tests passed by x and y */ Evaluate(xy, TN ); foreach t : PassedTests(x) ∪ PassedTests(y) do if not PassesTest(xy, t) then break end end

/* xy should pass all positive tests */ if not PassesAll(xy, TP ) then break else if PassesAll(xy, TN ) then return xy else ReportPartialSolution(xy, exploit, combinations); end /* check if exploration space is empty */ else if explore = ∅ then return ⊥ /* explore single-edit space */ else cand = Sample(explore); if PassesAny(cand, TN ) and PassesAll(cand, TP ) then if PassesAll(cand, TN ) then return cand end ReportPartialSolution(cand, exploit, combinations) end end end

183 SEARCH

Pseudocode for our greedy algorithm is given in Algorithm6. As an initialisation step, the GenerateExplorationSpace populates a memory- and look-up-ecient data structure with the set of all possible single edit patches to the program, at loca- tions implicated by the fault localisation. The main loop of the algorithm continues to iterate until a solution is found, a resource limit is hit, or the search space has been exhausted.

At each iteration, the algorithm checks to see whether there is a promising combina- tion of edits waiting to be evaluated. If so, the algorithm evaluates the combination on the (covering) negative tests. The combination xy is saved to the exploit space, using ReportPartialSolution!, provided it passes all of the tests that are passed by each of its parts, x and y. For example, if x passes {N1}, and y passes {N2}, xy should pass {N1,N2}, as well as all of the positive tests. As an optimisation, the combination is only evaluated against the positive tests if it satises this invariant over the negative test cases. A further optimisation, left for future work, would be to evaluate the union of the negative tests passed by x and y, using prioritisation and terminating on the rst instance of failure.

In cases where the combinations stack is empty, the algorithm proceeds to sample an edit from its explore space via the Sample! function. This process of sampling rst selects a subject statement from the program, according to its fault localisation. For the sake of allowing a fair comparison to GenProg’s existing search algorithm, we used GenProg’s default 10:1 fault localisation. Once a statement has been selected, the algorithm proceeds to choose the type of edit at random. In accordance with GenProg’s default mutation operator probabilities, we give each operator an equal probability of selection. Finally, the algorithm samples an edit of the chosen type at the selected location at random without replacement (i.e., the edit is removed from the exploration space). Upon sampling an edit, the algorithm updates the state of its explore space, removing the chosen statement from consideration by the fault localisation if no edits remain at that statement.

Once an edit has been sampled, the algorithm evaluates that edit on all of its cover- ing negative tests, in parallel. If the edit fails all of the negative tests, it is discarded. If, and only if, the edit passes at least one negative test, is it evaluated against the covering positive tests. To increase the eciency of the search, the set of covering positive tests are ordered by their observed likelihood of failure, prior to evaluation; this minimises the expected number of test evaluations required to reject an edit. Once sorted, tests are then evaluated in parallel. If at any point a positive test is failed, the search ceases to evaluate the rest of the covering positive tests and dis- cards that edit. In the event that the edit passes all of its covering positive tests and at least one of its covering negative tests, that edit is recorded as a partial solution by ReportPartialSolution.

When a partial solution is discovered by the search, ReportPartialSolution tra- verses its exploit space to nd promising combinations of edits that include the ed- its within the reported partial solution. Combinations, each formed of two partial solutions within the exploit space which satisfy certain criteria, are pushed to the

184 6.4. GREEDY ALGORITHM

Algorithm 7: Greedy Search, Partial Solution Reporting ReportPartialSolution(x, exploit, combinations) begin stmtsx ← CoveredStmts(x); passesx ← PassedTests(x);

/* Look for promising combinations including this patch */ foreach y : exploit do /* Find any overlap in edit statements */ stmtsy ← CoveredStmts(y); stmtsxy ← stmtsx ∩ stmtsy ;

/* Look for mutually exclusively test passes */ passesy ← PassedTests(y); passes∆ ← (passesx \ passesy 6= ∅) ∧ (passesy \ passesx 6= ∅)

/* Determine the length of the resulting patch */ length ← Length(x) + Length(y);

if (stmtsxy = ∅) ∧ passes∆ ∧ (length ≤ lengthmax) then xy ← Combine(x, y); Push(combinations, xy); end end Add(exploit, partial); end combinations stack: This behaviour causes the algorithm to pause its exploration of the single-edit space and to immediately exploit the combination. Pseudocode for the ReportPartialSolution procedure can be found in Algorithm7. A partial solution x can be combined with partial solution y to form xy, provided that the following criteria are satised:

1. The set of statements modied by x and y do not overlap.

2. Let Tx and Ty be the set of tests passed by x and y, respectively. Ensure that the following holds:

(Tx \ Ty 6= ∅) ∧ (Ty \ Tx 6= ∅)

In other words, x must pass at least one test that is not passed5 by y, and y must pass at least one test that is not passed by x.

3. The length of the resulting combination xy does not exceed the maximum allowed patch length. For this study, we set this limit to three edits.

5Note, this includes tests that are not covered by a given patch, as well as those that fail.

185 SEARCH

After nding satisfactory combinations for the reported partial solution, Report- PartialSolution records the solution to the exploit space.

Evaluation

To determine whether our proposed algorithm outperforms that used by GenProg, we evaluated each against the bug scenarios used in our empirical study. From observation, we deemed that the articial candidate evaluation limits imposed by GenProg were far too low for use in multiple-edit scenarios—the probability of encountering, let alone combining, all of the partial xes within this time frame is highly unlikely. Thus, to account for the low probability of encountering the compo- nents of a partial repair, and to gain a better understanding of the real performance of each conguration, we evaluated each using a time limit, rather than a limit on the number of unique candidate evaluations. These time limits were set according to the number of seeded faults in the bug scenario, as outlined below.

• One Fault: 30 minutes. • Two Fault: 90 minutes. • Three Fault: 240 minutes. Given a limited budget and the increased demands of evaluating patches for such lengths of time, we performed ve repeat runs for each bug-algorithm. To conduct this evaluation, we once again made use of the cloud compute nodes, described in Table 6.2, for a total of over 300 hours of compute time. From the results of this evaluation, we compared the eectiveness and reliability of each algorithm using the same metrics used in the empirical study. To compare eciency, we measured the cost of nding an acceptable patch as the total time spent searching, across all runs, divided by the number of runs wherein a repair was found. A comparison of eectiveness and the bugs xed by each algorithm is given in Table 6.8. A summary of the reliability for each technique is provided by Table 6.9. Details of the eciency of each technique are listed in Table 6.10. Our results show that the greedy algorithm successfully xes more bugs than the baseline (14 vs. 8; Table 6.8). The genetic algorithm is able to nd a solution to replace-three, whereas the greedy algorithm fails. On closer inspection of the seeded fault, we see that the canonical patch requires exploration of the neutral edit space; two of the edits within this patch do not pass any of the negative test cases, and thus do not appear as partial solutions to the greedy search. Comparing the reliability of the two techniques, shown in Table 6.9, we nd that the greedy algorithm performs at the same level or greater than the genetic algo- rithm for 20 of the 21 bugs. For 10 of the 14 bugs solved by the greedy algorithm, it achieves a reliability of 100%. By comparison, the genetic algorithm only achieves 100% reliability for one bug, tcas-one.

186 6.4. GREEDY ALGORITHM

Bug Scenario Genetic Greedy replace-one XX replace-two X replace-three X printtokens-one XX printtokens-two printtokens-three printtokens2-one X printtokens2-two printtokens2-three XX schedule-one X schedule-two X schedule-three schedule2-one XX schedule2-two X schedule2-three XX totinfo-one X totinfo-two X totinfo-three tcas-one XX tcas-two tcas-three XX Successful: 8 14

Table 6.8: A comparison of the bugs patched by each technique.

In the seven cases where the eciency of the two search algorithms can be com- pared, the greedy algorithm is between a factor of 1.4 and 57 faster than the ge- netic algorithm. For tcas-three, the greedy algorithm is roughly a factor of 2.7 slower.

On closer inspection of the results of the greedy algorithm for tcas-three, we ob- serve an abnormally large number of partial solutions, numbering close to 1000. Given the opportunistic nature of the greedy algorithm, much of its time is spent attempting to combine these partial solutions, rather than exploring the single-edit search space more broadly.

In summary, our greedy algorithm outperforms the genetic algorithm used by GenProg in terms of eectiveness, reliability and eciency. Moreover, our approach is sim- pler and does not require (expensive) parameter tuning.

187 SEARCH

Bug Scenario Genetic Greedy replace-one 80% 80% replace-two — 100% replace-three 20% — printtokens-one 80% 100% printtokens-two —— printtokens-three —— printtokens2-one — 40% printtokens2-two —— printtokens2-three 40% 100% schedule-one — 100% schedule-two — 100% schedule-three —— schedule2-one 40% 100% schedule2-two — 80.00% schedule2-three 60% 100% totinfo-one — 100% totinfo-two — 100% totinfo-three —— tcas-one 100% 100% tcas-two —— tcas-three 40% 20%

Table 6.9: A comparison of the reliability achieved by the genetic and greedy search algorithms, measured by the fraction of runs wherein an acceptable repair was found. An “—” is used to denote bug-congurations where no repair was found across any of the runs.

6.5. Future Work

Although our proposed greedy search algorithm outperforms the baseline algorithm used by GenProg, there are a number of modications that we plan to make to improve it further still.

A number of existing, complementary techniques, described below, could be applied to the algorithm with relative ease to increase its eciency and reliability. For the purpose of this study, we omitted these improvements in order to focus our atten- tion on the relative performance of our greedy algorithm over GenProg’s baseline genetic algorithm.

• A more eective fault localisation approach could be used, allowing the ex- ploration space to be pruned and focused.

• A probabilistic model could be used to assign dierent probabilities to indi-

188 6.5. FUTURE WORK

Bug Scenario Genetic Greedy Reduction replace-one 712.14 528.20 1.35 replace-two — 438.33 — replace-three 57,875.16 — — printtokens-one 495.01 8.69 56.95 printtokens-two ——— printtokens-three ——— printtokens2-one — 3,919.53 — printtokens2-two ——— printtokens2-three 21,700.96 974.10 22.28 schedule-one — 886.24 — schedule-two — 2,399.63 — schedule-three ——— schedule2-one 2,959.97 328.98 9.00 schedule2-two — 4,637.24 — schedule2-three 9,697.29 2,117.91 4.58 totinfo-one — 167.02 — totinfo-two — 773.71 — totinfo-three ——— tcas-one 23.01 7.37 3.12 tcas-two ——— tcas-three 21,651.88 59,104.75 0.37

Table 6.10: A comparison of eciency between the genetic search and greedy search algorithms, measured by the total wall-clock time across all runs, in seconds, di- vided by the number of successful runs. Reduction describes the reduction factor achieved by the greedy algorithm, compared to the genetic algorithm.

vidual edits within the repair space, in a similar fashion to Prophet [Long and Rinard, 2016]. Edits that are deemed more likely to occur could be given priority over less likely edits. One could also use the model to create a deter- ministic ordering of exploration space, as a pre-processing step. By doing so, the algorithm would become deterministic, allowing it to be evaluated with- out the need for repeated runs.

• Anti-patterns, proposed by Tan et al.[2016], could be incorporated into the exploration space, allowing edits associated with low-quality patches to be pruned.

Further eciency and reliability improvements may be achieved through modica- tions to the algorithm itself:

• Instead of combining and exploiting complementary partial solutions at the earliest moment, the algorithm could defer this process until a combination, whose constituent edits pass all of the negative tests, amongst them. A for-

189 SEARCH

mal expression of this criterion is given in Equation 6.2, where C denotes a potential combination of edits e ∈ C, and Passes(e, t) is used to assert that the singleton patch for e passes a given test t.

∀t ∈ TN : ∃e ∈ C : Passes(e, t) (6.2)

This behaviour would dramatically improve the eciency of the search in cases where large numbers of partial but misleading solutions are reported, such as the tcas bugs. • In the current version of the algorithm, patches are rst evaluated against the negative test suite to check that they pass at least one negative, or all of the negatives passed by their constituent parts, before being subjected to the positive tests. To reduce the number of test evaluations, let us assume that number of patches which pass at least one negative test case is low, and that patches which pass negative tests are unlikely to fail their covering positive tests. We can exploit these assumptions by deferring evaluation of the positive test cases until a patch is found which passes all of the negative tests. In practice, the positive tests vastly outnumber the negative tests, and so the gains from such an optimisation could be signicant. For the majority of programs that we have encountered, synthetic and or- ganic, these assumptions are true; patches which pass any of the negative tests are relatively rare. In a small number of cases, however, such patches are fre- quently encountered. To handle this, a smart repair technique could measure the observed frequency of such patches against an expected frequency, and if the expected frequency is exceeded, a dierent set of optimisations could be used instead. To allow the algorithm to solve a greater number of bugs—albeit it at some cost to eciency—the algorithm could be tuned to exploit neutral edits, in addition to partial solutions. One way of exploiting neutral edits could be used with the opti- misations described above. After exhausting the exploration space and nding an acceptable solution, a neutral space could be created. The set of neutral edits could then be found and saved to this space, using test prioritisation to reduce the expected cost of doing so. Once the neutral space is created, the algorithm could exhaustively combine pairs of neutral edits, in the search for partial solutions. Partial solutions discovered during this phase would be recorded to the exploitation space, as usual, and combined with complementary partial solutions. In addition to the above optimisations, we plan to explore whether the inclusion of certain heuristics can be used to prune the search space. Specically, we are interested in answering the following question: “Given two unique patches, A and B, with known test suite outcomes, does the knowledge of the test suite outcome for their combination AB provide us with (usable) information about the solution?” To answer this question, we intend to explore whether certain test suite outcomes for AB can be used to identify parts of the solution and to prune certain edits.

190 6.6. CONCLUSION

6.6. Conclusion

In this chapter, we conducted an theoretical and empirical analysis of the search algorithm used by GenProg, one of few repair techniques (theoretically) capable of performing multiple-edit repair. We identied a number of fundamental weak- nesses within GenProg’s algorithm, stemming from its generational, population- based nature; xing any one of these issues causes another to be exacerbated. We also explored the role of tness information within the search and found that the search was more ecient, eective and reliable when selection was restricted to solutions which pass all of their covering positive tests. Modifying this selection between non-destructive patches to favour patches that pass previously failing tests was found to produce further performance gains. Motivated by these results, we proposed and evaluated an alternative search algo- rithm, based on a greedy search. Across a benchmark of 21 bug scenarios of varying size, we found that our algorithm beat GenProg’s in terms of every performance metric we measured. Although our aggressive restriction of the search space pro- duced a better performance overall, the greedy search was unable to patch one of the scenarios repaired by GenProg. Additionally, the high frequency of negative test passes on another bug scenario caused the eciency of our algorithm to drop below that of GenProg. We outlined a number of improvements that we intend to make to the algorithm in the future, which may alleviate these issues and increase its performance further. In summary, through theory and observation, we showed that GenProg’s search algorithm has deciencies. Existing approaches, such as RSRepair, AE and SPR are signicantly more ecient due to aggressive optimisations, but all are limited to single-edit patches. We proposed a novel search algorithm for performing multiple- edit repair, which demonstrates promising results. In the future, we plan to optimise this algorithm further and to evaluate it against a dataset of real-world bugs in large- scale programs.

191 SEARCH

192 CHAPTER 7

Conclusion

In this section, we present a summary of our ndings, in terms of the research ques- tions presented in Chapter 1, and briey outline directions for future research. Fi- nally, we provide a set of concluding remarks.

7.1. Summary

RQ1: Can the results of candidate patch evaluations, gathered over the course of the search, be used to improve the accuracy of the fault local- isation, online?

Inspired by recent results in applying mutation analysis to the problem of fault lo- calisation, in Chapter4 we explored whether the test outcomes of candidate patches could be used to identify promising x locations, online. To answer this question, we conducted a 12-hour random walk of 28 dierent bugs, 15 of which were articial, and the remaining 12 organic. Our results demonstrate a statistically-signicant dierence (with a medium-sized eect) in the between the passing-to-failing dis- tributions of statements that were changed by the developer, and those that were not.

No signicant dierence was found between the failing-to-passing distributions of human-modied and non-modied statements, suggesting that this failing-to-passing information has little to no use in detecting potential x locations. We proposed and evaluated a number of online fault localisation measures based on our ndings, but found that none of them was able to consistently outperform the use of o ine localisation; MUSE and Metallaxis were also unable to outperform o ine localisa- tion.

There may be a number of reasons for the lack of a signicant improvement in the accuracy of the fault localisation:

• Our evaluation naively combined the results of the mutation analysis by sim- ply computing their product. Determining a meaningful way of combining multiple sources of fault localisation appears to be a non-trivial problem. The use of machine learning techniques may be required to discover eective ways of combining information.

193 CONCLUSION

• We observe a search landscape in which most repairs either had no impact on the outcomes of the test suite, or they failed all of their passing tests. The lack of diversity in test suite outcomes hinder the eectiveness of the analysis and increase the diculty of discriminating between xable and non-xable program locations. The shape of this landscape may be a result of GenProg’s coarsely-grained statement level operators. Alternatively, it may be that real-world test suites for C programs tend to ex- hibit a largely all-or-nothing behaviour. Upon inspection of Python’s tests, for instance, we observe that each test case in fact corresponds to a whole suite of atomic tests for one particular module. The results of mutation anal- ysis may be improved through the use of test case atomisation—a technique which cannot be applied to C programs, due to the lack of a ubiquitous test- ing framework. It may also be that mutation analysis performs better when combined with purpose-built, automatically generated test suites. • To be eective, mutation analysis may require a large number of mutants per program location. Due to the substantial overheads associated with compi- lation, we were able to generate relatively few mutants within our 12-hour window. Previous mutation analysis results involve the generation of tens of thousands of mutants—a feat that would require days, if not weeks of compute time to realise in large-scale, real-world programs.

RQ2: Is plastic surgery equally eective for all repair actions?

In Chapter5, we mined thousands of real-world bug xes from hundreds of the most popular open-source projects on GitHub. After extracting instances of 23 repair ac- tions within these mined xes, we determined how many of those instances could be grafted from the le under repair. We found that repair actions at the block-level (e.g., Insert Else-If Branch) were amongst the least likely to be grafted, whereas those operating at ner levels of granularity (e.g., Modify Assignment) exhibited the highest levels of graftability. To improve the eciency of automated repair, fu- ture repair tools may exploit this observation by simply omitting (or de-prioritising) block-level repair actions. We nd that the majority of repairs occur below the statement-level—an encourag- ing result, given that plastic surgery appears to be most eective at ner levels of granularity. In 58% of cases, the modied version of a statement could be found else- where in the le (when abstract code snippets were used). We can use this nding to increase the eective of future repair tools by focusing statement replacement on structurally similar statements. As part of our analysis, we also found that statement deletion was the second most common repair action, occurring in around a third of bug xes. This suggests that preventing explicit statement deletion [Long and Rinard, 2015; Qi et al., 2015] may be misguided and likely to reduce the number of bugs that can be repaired.

194 7.1. SUMMARY

RQ3: Can the eectiveness of plastic surgery be increased through the use of unlabelled code snippets?

In Chapter5, we also investigated the eectiveness of plastic surgery when variable labels are stripped from snippets. When plastic surgery is performed with these abstract code snippets, we nd that graftability is substantially increased. To increase the number of repairable bugs, we strongly advocate the inclusion of abstract donor code snippets into future search-based repair techniques.

In the future, we plan to implement our proposed repair actions and donor code snippets in a new repair tool. Building on previous work by Long and Rinard[2016] and Soto et al.[2016], we could feed the patches mined by BugHunter to a ma- chine learning technique, to produce a model for performing x localisation (i.e., to assign probabilities to the likelihood of certain bug xes, based on those observed in historical patches).

RQ4: Do biases within GenProg’s operators degrade its performance?

In Chapter6, we identied a number of potential biases and ineciencies in GenProg’s search algorithm as a part of a larger theoretical analysis. We showed how allowing multiple edits at a single location may limit the search from reaching certain areas of the search space, and how patches tend to grow in size over the course of the search (i.e., bloat occurs).

As part of an empirical study, we examined whether the performance of the search could be improved by xing these problems:

• To address the problem of bloat, we replaced GenProg’s standard selection operator with a double-tournament selection. We found little to no dierence in the performance of the search, despite the observation that most patches are found earlier in the search (when patches are smaller). This result may be due to the lack of tuned parameters, or it may be that bloat controls further reduce the memory of the search.

• As a simple x to the identied operator biases, we assessed the performance of GenProg when its algorithm is limited to a single edit per location. As with the use of bloat controls, we found no discernible improvement in per- formance.

None of our changes to GenProg in response to these biases had a noticeable eect on its performance. The lack of any response to these changes, be it positive or neg- ative, shows that changes to important components and parameters of GenProg’s search algorithm have no eect. This observation, together with our ndings re- garding GenProg’s tness function, suggests that the search is not operating as intended.

195 CONCLUSION

RQ5: Does tness help to guide GenProg’s search algorithm towards solutions?

As part of our analysis of GenProg’s search algorithm in Chapter6, we identied the implicit positive and negative feedback mechanisms provided by tness infor- mation: The test outcomes of previously passing tests serve to prevent the degrada- tion of existing functionality, whereas the outcomes of previously failing tests serve to promote the exploitation of partial solutions. Through an empirical analysis, we demonstrated that the search performs at least as well as the baseline when selection is restricted to non-destructive patches (i.e., patches that do not fail previously passing test cases). Therefore, gathering infor- mation for all of an individual’s previously passing tests is redundant. The search can operate more eciently, as measured by the number of test evaluations, by per- forming test case prioritisation over the previously passing tests, and terminating on the rst instance of failure. To determine the contribution of negative test outcomes, we introduced a variant of GenProg wherein selection was restricted to non-destructive individuals, but selec- tion between those individuals was performed on the basis of the number of negative test passes. In most cases, we found that this algorithm allowed the search to nd a solution in fewer candidate patch evaluations. For scenarios with particularly- weak test suites, where a large number of negative test passes were observed, this algorithm led to a reduced performance.

RQ6: Is there a more eective search algorithm for generating multiple- edit patches than the genetic algorithm used by GenProg?

Motivated by the results of our analysis of GenProg’s search algorithm, in Chapter 6 we proposed an alternative search algorithm for multiple-edit repair, based on a greedy algorithm. Our proposed algorithm was shown to beat GenProg’s genetic algorithm across all of our performance metrics (eciency, eectiveness, reliability). These results display promise for the future of multi-edit, search-based program re- pair. In future work, we plan to explore a number of optimisations to this algorithm, and to assess its eectiveness of a larger set of real-world, multi-line bugs.

7.2. Future Work

In this section, we briey discuss two promising areas for future research. A more detailed discussion on future research regarding repair models, fault localisation and search algorithms can be found at the end of each of the respective chapters. • Platform for Automated Program Repair: Part of the reason for GenProg’s perceived lack of eectiveness may not be that its repair model lacks expres-

196 7.2. FUTURE WORK

siveness, as suggested by others [Long and Rinard, 2015]. Rather, its problems may be the consequence of a number of unintended code transformations per- formed by CIL—the AST analysis framework upon which GenProg is based— which may make the program unsuitable for repair. Below, we describe two examples of this behaviour:

– if-statements with multiple terms (e.g., if (x&&y&& z) ) are exploded into a sequence of side-eect-free, single-term if-statements, shown be- low.

if (x) { if (y) { if (z) { ... } } }

– Statements with possible side eects are transformed into a sequence of side-eect-free statements, through the introduction of temporary vari- ables. For example, the statement:

int z= f(x)+ g(z);

is transformed into the following:

int t1= f(x); int t2= g(z); int z= t1+ t2;

As a result of these transformations, the program inates in size. In the case of the SIR bug scenarios, we observed single statements that were exploded into more than ten. Since all of these statements belong to the same basic block, there is no potential improvement to the fault localisation (assuming spectrum-based fault localisation is used). This linear increase in the number of statements within the program leads to a quadratic increase in the size of the search space: An average ination factor of 10 could result in a 100-fold increase in the size of the search space. Accompanying this growth in the search space is an increase in the size of the solution. Where previously a patch could be generated with a single edit (e.g., insertion of the statement from the example above), now a set of statements must be introduced, in an acceptable order.

In addition to increasing the diculty of the problem along these two dimen- sions, ination introduces other unintended issues through its use of tem- porary variables. Crafting a repair now requires that a particular temporary variable is in scope.

197 CONCLUSION

In ongoing work, we are building a new framework for automated program repair of C and C++, on top of Clang—a modern program analysis toolchain, which preserves the original source code of the program. As part of this framework, we plan to integrate a DSL for formally dening repair actions in terms of program transformations. Additionally, to reduce the considerable overheads associated with compilation, we plan to introduce super-mutation, i.e., multiple mutants are compiled once and selected between at run-time.

• Test Suites for Automated Program Repair: During our empirical analysis of GenProg’s search algorithm, we employed a simple test suite reduction technique to minimise the original test suites for the Siemens benchmarks by two orders of magnitude. Despite the large reduction, from thousands to tens of tests, we observed few cases of overtting. These anecdotal results would seem to suggest that any improvements to repair quality stemming from the use of larger test suites are quickly diminishing as the size of the test suite increases. In future work, we plan to explore exactly what makes a good test suite for automated program repair. These qualities may involve mutation score, coverage, the novelty of execution path, or data-ow invariants.

Interestingly, we found that many low-quality repairs associated with GenProg are found in test suites which exhibit a low entropy in across their possible outcomes. Test harnesses that simply check the exit status of a program, such as the PHP scenarios within ManyBugs, exhibit the lowest-possible entropies. From the perspective of the test harness, most test cases appear to have iden- tical outcomes. By using Pythia to automatically generate an oracle that checks the standard output, standard error, and resulting state of the sand- box, the number of identical tends to drop considerably (i.e., this enhanced test harness exhibits a higher degree of entropy across the behaviours of its tests). When this enhanced oracle is used, the number of plausible solutions for the TSE 2012 benchmarks drops from thousands to fewer than ten.

If we can nd a criterion to express the suitability of a test suite for APR, then we could use that criterion as an objective function when performing test-suite reduction. Alternatively, we could use this criterion to guide the automatic generation of a highly-eective, minimal test suite.

• Fault Localisation: In Chapter4, we found that candidate patches gener- ated using GenProg’s statement-level operators were ineective at localis- ing the fault when used to perform mutation-based fault localisation (MBFL). Upon closer investigation, we found that the majority of the candidate patches generated using GenProg’s coarsely-grained mutation operators exhibited an “all-or-nothing” behaviour, where the mutation either caused all covering tests to fail, or it had no eect on their outcomes at all. Given the previous suc- cesses of using MBFL with traditional mutation testing operators [Moon et al., 2014a; Papadakis and Le Traon, 2015], which typically operate at the level of expressions, rather than statements, it may be that mutation-based fault local- isation is only eective when used with mutants generated by nely-grained

198 7.2. FUTURE WORK

operators. Motivated by the lack of eectiveness of GenProg’s statement-level oper- ators, we plan to investigate the eectiveness of a wide range of mutation operators at varying levels of coarseness. We also plan to explore whether the test suite behaviours of particular kinds of mutants can be used to pre- dict the shape of underlying faults. For instance, if mutation of a particular if-condition appears to have no eect on the outcomes of the test suite, it may suggest a faulty if-condition. By probing the shape of the fault, the search space can be vastly reduced to the small sub-set of repairs that t a believed shape. As well as exploring the behaviour of MBFL using dierent mutation operators, we intend to assess whether ensemble learning techniques may be used to better aggregate mutant information and o ine, spectrum-based fault localisation information. • Repair Model: In Chapter5, we investigated and demonstrated the eec- tiveness of plastic surgery in the context of actual repair operators. Since we found that levels of graftability (i.e., the likelihood of discovering a suitable code snippet within the corpus) varied highly between dierent repair op- erators, a natural next step would be to explore whether other factors aect graftability. For instance, we may wish to determine whether missing print statements and strings are likely to be discovered within the code. By better understanding the strengths and limitations of plastic surgery, we can design repair tools that are better able to repair the subset of faults that are most amenable to plastic surgery; unlikely repairs could be pruned from the search entirely. Using the tool we developed to conduct the analysis in Chapter5, BugHunter, we could build upon work by Soto et al.[2016] by devising techniques for graft localisation (i.e., determining the likelihood that certain snippets form part of the repair). Soto et al.[2016] look at the kind of statement that is most likely to replace a statement of a given kind. Their work could be extended by looking at likely insertions, as well as by examining a richer set of features, beyond just the kind of the statement. A richer set of features, closer to those used in [Long and Rinard, 2016], could look at the variables used by the snippet (and the extent to which they overlap with, or complement, the original statement). Machine learning techniques could be applied to that set of features to learn an eective graft localisation function. The results of our analysis could also be bolstered by the development and subsequent use of more sophisticated bug-x identication procedures. One might also attempt to automatically categorise bugs, and to use that informa- tion to determine whether certain kinds of bugs are more amenable to repair, and to improve graft localisation by nding snippets that are most likely to be used to repair faults of that kind. For more details on future work related to the work conducted in Chapter5, see Section 5.7.1.

199 CONCLUSION

• Search: In Chapter6, we showed that a form of greedy search outperforms is better suited to automated repair than the genetic algorithm used by GenProg. Although our greedy search algorithm was substantially more ecient than the genetic algorithm approach, its performance could be further improved by a number of small optimisations. Instead of combining and exploiting com- plementary partial solutions at the earliest possible moment, an optimised version of the algorithm could wait until a combination of partial solutions, whose constituent edits pass of all of the negative tests, is found. By waiting until a (potential) complete solution is found, the search should show a vastly improved performance for deceptive problems, where many partial solutions are incorrectly reported (such as the tcas bugs). Further gains in eciency could be made by deferring the evaluation of evaluation of the positive tests until a combination of edits that pass all of the negative tests is found. Rather than pursuing improvements in eciency, one may adapt our algo- rithm to be less greedy to allow a greater number of bugs to be xed. Instead of solely combining and exploiting partial solutions, the search could also combine neutral edits. In addition to improvements to eciency and eectiveness, we plan to explore whether knowledge of the test suite outcomes for combinations of patches can be used to probabilistically determine whether the individual edits within that patch are parts of the solution or if they are misleading. More specically, we seek to answer: “Given two unique patches, A and B, with known test suite outcomes, does the knowledge of the test suite outcome for their combination AB provide us with (usable) information about the solution?” Technical details on potential optimisations to our search algorithm can be found in Section 6.5.

7.3. Concluding Remarks

Realising the long-held goal of automatically repairing programs poses enormous challenges, far exceeding those that early results would seem to suggest. Despite the daunting vastness of the problem, in less than a decade, researchers have demon- strated techniques capable of xing a small, but considerable sub-set of single-edit bugs in real-world, large-scale code. Over the next decade, researchers face an even greater set of challenges, as we attempt to scale techniques to a larger number of bugs, spanning multiple lines, all whilst remaining cost ecient. The sheer magnitude of the challenges ahead may—understandably—cause some to doubt that program repair will ever become an integral part of real-world processes. To those that doubt, recall that a decade prior to the in- troduction of GenProg, the concept of automated program repair would have been

200 7.3. CONCLUDING REMARKS deemed purely science ction. Since then, the question has shifted from “is auto- mated program repair possible?” to “which bugs can we automatically repair?” In this thesis, we explored each of the challenges facing search-based program re- pair over the coming decade, and in the process discovered hints of what a next generation repair technique might look like: We devised a more eective algorithm for multiple-edit repairs, found ways of exploiting plastic surgery to x a greater number of bugs, and presented a more robust platform for conducting program re- pair research. Not all of our results were positive: Contrary to our expectations, and ndings of previous research, we found that the use of mutation-based fault locali- sation had no impact on fault localisation accuracy. In such a early eld of research, seemingly promising ideas are likely to fail—a small price to pay, for the privilege of pioneering. These results help us to understand where best to focus our attention— or rather, where not to waste it—and raises interesting questions regarding the cases in which mutation analysis can be used to improve fault localisation. An exciting decade of research lies ahead—we’ll see you on the other side.

201 CONCLUSION

202 Appendices

203

APPENDIX A

Reproducibility

A.1. Fault Localisation

Raw data, analysis scripts and a replication package for the results presented in Chapter4 can be downloaded at:

https://bitbucket.org/ChrisTimperley/ssbse-2017-data Source code for the modied version of GenProg used in these experiments (and those in Chapter6) can be found at:

https://bitbucket.org/ChrisTimperley/GP3

A.2. Repair Model

A replication package for the results in Chapter5 can be found at:

https://bitbucket.org/ChrisTimperley/phd-repair-model Note, raw results for this study are not available online, due to size limitations (the raw data is several hundred GBs in size). BugHunter, the tool used to mine bug xing commits and repair actions from Git repositories can be downloaded at:

https://github.com/ChrisTimperley/BugHunter

A.3. Search

Raw data, analysis scripts and a replication package for the results in Chapter6 can be found at:

https://bitbucket.org/ChrisTimperley/phd-search

205 REPRODUCIBILITY

206 APPENDIX B

Repair Action Mining

B.1. AST and Edit Script Generation

After identifying the bug xing commits within a Git repository, BugHunter pro- ceeds to generate the abstract syntax trees for each of the modied source les, in both the faulty and xed versions of each identied bug x, using GumTree [Falleri et al., 2014]. Within each generated AST, each node is assigned a unique integer identier n ≥ 0, given by its post-order position within the tree. Additionally, for each pair of faulty and xed versions of a source code le (F,F 0), a set of AST node mappings M(F,F 0) and an AST edit script ∆(F,F 0) are generated, using a modied version of GumTree [Falleri et al., 2014].

0 0 The mapping M(F,F ) is composed of a sequence of ordered pairs, (nf , nf ), where 0 nf species the position of a node in F , and nf species the the location of the 0 matching node in F . In the case where the node located at nf in F has no matching 0 0 0 node in F , then nf is assigned to ⊥. Similarly, when a node exists in F , but no matching node is found in F , nf is assigned to ⊥. Each edit script ∆(F,F 0) describes a sequence of low-level operations that may be applied to the faulty source le F to transform it into its patched form F 0. Unfor- tunately, GumTree provides no formal denition of its edit script language, and so, for the purposes of this analysis, we dene its operations as follows: • Delete(X) removes the node located at position X within P . • Insert(X0, Y 0, k) describes the insertion of the node located at X0 in P 0 as the k-th child of the node located at Y 0 in P 0. Note, that Y 0 is not necessarily contained in P . • Update(X, `0) updates the label ` of the node located at X in P , with label `0. • Move(X, X0) moves the node located at position X within P to location X0 in P 0. X0 is obtained by nding the corresponding mapping for (X,X0) ∈ M(F,F 0). By performing the analysis on the ASTs of each repaired source le, and their ac- companying edit scripts, rather than the plain text and diff outputs between each version of the le, aesthetic and irrelevant changes to the code, including the modi- cation of comments can safely be ignored. More importantly, this treatment allows each of the described repair actions to be detected using a strict set of formal rules

207 REPAIR ACTION MINING

(based upon M(F,F 0) and ∆(F,F 0)), without the need to resort to unsound, ad-hoc approaches based on analysing the diff. In a preliminary version of this analysis, BugHunter’s (heuristic) ability to perform pre-processing on arbitrary C projects was used to handle pre-processor macros (e.g., #ifdef, #define), potentially allowing the analysis to deal with a wider range of bugs. However, pre-processing added a signicant overhead to the process of AST generation (to the extent that certain projects, such as the Linux kernel, were no longer feasible to analyse), and made little dierence to the detection of repair actions. To reduce the cost of the analysis, and to allow a larger, wider body of programs to be incorporated, this pre-processing stage has been dropped from the nal analysis.

Edit Script Post-Processing

To reduce the complexity of the analysis, we perform a number of post-processing steps on the edit scripts and accompanying node mappings produced by GumTree: 0 • Node Mapping, M: using the mapping of node numbers in P to P , Mnum, we generate a mapping of nodes from P to P 0 by their pointers. In cases where a node does not exist in either P or P 0, the unpaired node n is mapped to ⊥ (i.e., (n, ⊥) or (⊥, n)). • Edit Script Nodes: each node number in the edit script is replaced by a pointer to its corresponding node. • Before Statements, S: computes the set of statements in P . • After Statements, S0: computes the set of statements in P 0.

B.2. Detection Rules

In this section, we provide the rules we devised to detect instances of the repair actions listed in 5.4.3. Below, we discuss a number of helper functions that we use to reduce the complexity of these detection rules. We introduce the function Parent(node) to nd the parent of a given node within its AST. If the node has no parent (i.e., it is the root of the AST), ⊥ is returned. To simplify nding the nearest ancestors to a given node n that belongs to a certain kind k, we introduce the function NAK:

 ⊥ if n = ⊥  NAK(n, k) = node if HasKind(n, k) (B.1)  NAK(Parent(n), k) otherwise

208 B.2. DETECTION RULES

Note, this function treats its argument as the zeroth ancestor, allowing n to be re- turned as its result. We introduce a helper function, NAO, to nd the nearest ancestor of a given node n0 in P 0 with a matching node in P , and returns that matching ancestral node:

( NAO(Parent(n0)) if (⊥, n0) ∈ M (n0) = node NAO 0 (B.2) n if (n, n ) ∈ Mnode

We introduce a function, Modied, that accepts an edit and returns the nearest node of a certain kind k within P to which a given edit e was applied:

 {NAK(n, k)} if e = Delete(n)  {NAK(n, k)} if e = Update(n, `0)  Modied(e, k) = {NAK(NAO(n0), k)} if e = Insert(n0, p0, k) (B.3) ( )  NAK(n, k), 0  if e = Move(n, n )  NAK(NAO(n0), k)

Statement-Related Actions

  stmt ∈ S    parent = Parent(stmt)    DelStmt(stmt) (stmt, ⊥) ∈ Mnode 0  (parent, parent ) ∈/ Mnode     parent0 6= ⊥ 

To detect statement deletions, we iterate through each of the statements in the orig- inal AST (i.e, stmt ∈ S) and check that there is no matched node for that statement within the modied AST (indicating it has been deleted). Redundant statement dele- tions (i.e., those implied by the deletion of their parent statement) are removed by ensuring that the parent parent of stmt maps to a node in the modied AST. Next, we mine InsertStatement actions, which are used to represent both Prepend- Statement and AppendStatement actions. As with the detection of statement dele- tions, redundancy checks are used to ensure that nested actions are modelled using a single insertion.

 0  stmt ∈ S  0   parent = Parent(stmt)    InsStmt(stmt, parent) (⊥, stmt) ∈ Mnode 0  (parent, parent ) ∈ Mnode     parent 6= ⊥ 

209 REPAIR ACTION MINING

To simplify the detection of sub-statement-level changes, an intermediary ModifyS- tatement is introduced, which describes a series of edits applied to a given statement. To avoid the parent of a modied statement from triggering the ModifyStatement action, edits are only associated with their nearest statement in the original form of the AST.

 0  (s, s ) ∈ Mnode  0 0   s ∈ S ∧ s ∈ S   0  ModStmt(s, s , edits) edits = {e ∈ E | Modied(e, Stmt) = s}

 edits 6= ∅     s 6= s0 

If-Statement-Related Actions

 0  (s, s ) ∈ Mnode   Wrap(s, w, g) w = If(g, s, [])

 (⊥, if) ∈ M 

 0  s = If(g, s , [])  0  Unwrap(s, s ) s ∈ SP ∧ s∈ / SP 0 0  s ∈ SP 0 

 0  (s, s ) ∈ M    0 0 s = If(c, then, else)  ReplaceIfCond(s, s , c, c ) s0 = If(c0, then, else)    c 6= c0 

 0  (s, s ) ∈ M    s = If(c, then, else)   0 0  ReplaceThen(s, s , then, then ) s0 = If(c, then0, else)

 then 6= then0     then0 6= [] 

 0  (s, s ) ∈ M    s = If(c, then, else)   0 0  ReplaceElse(s, s , els, els ) s0 = If(c, then, else0)

 else 6= else0     else0 6= [] 

 0  (s, s ) ∈ M    0 s = If(c, then, els)  RemoveElse(s, s , els) s0 = If(c, then, [])    els 6= [] 

210 B.2. DETECTION RULES

 0  (s, s ) ∈ M    0 s = If(c, then, [])  InsertElse(s, s , els) s0 = If(c, then, els)    els 6= [] 

 0  (s, s ) ∈ M    s = If(c1, then1, [])   0   s = If(c1, then1, els)   0  InsertElseIf(s, s , elif, c2, then2) elif = If(c2, then2, [])    c2 6= c1     then2 6= []    (⊥, then2) ∈ M

 0  (s, s ) ∈ M    0 s = If(g1, then, els)  GuardElse(s, s , g2) 0 0  s = If(g1, then, els )   0   els = If(g2, els, []) 

Switch-Related Actions

 0  (s, s ) ∈ M    0 0 s = Switch(exp, blk)  RepSwitchExp(s, s , exp, exp ) s0 = Switch(exp0, blk)    exp 6= exp0 

Loop-Related Actions

To detect RepLoopGuard repair actions, three intermediate repair actions, corre- sponding to the dierent loop types are introduced. RepLoopGuard actions are then found by computing the union of the sets for these intermediate repair ac- tions.

 0  (s, s ) ∈ M    0 0 s = For(init, g, incr, blk)  RepForGuard(s, s , g, g ) s0 = For(init, g0, incr, blk)    g 6= g0 

 0  (s, s ) ∈ M    0 0 s = While(g, blk)  RepWhileGuard(s, s , g, g ) s0 = While(g0, blk)    g 6= g0 

211 REPAIR ACTION MINING

 0  (s, s ) ∈ M    0 0 s = DoWhile(g, blk)  RepDoWhileGuard(s, s , g, g ) s0 = DoWhile(g0, blk)    g 6= g0 

ARepLoopGuard = ARepForGuard ∪ ARepWhileGuard ∪ ARepDoWhileGuard

Like RepLoopGuard, RepLoopBody is composed of the union of three intermediate repair actions, which are otherwise ignored for the rest of the analysis.

 0  (s, s ) ∈ M    0 0 s = For(init, g, incr, b)  RepForBody(s, s , b, b ) s0 = For(init, g, incr, b0)    b 6= b0 ∧ b0 6= [] 

 0  (s, s ) ∈ M    0 0 s = While(g, b)  RepWhileBody(s, s , b, b ) s0 = While(g, b0)    b 6= b0 ∧ b0 6= [] 

 0  (s, s ) ∈ M    0 0 s = DoWhile(g, b)  RepDoWhileBody(s, s , b, b ) s0 = DoWhile(g, b0)    b 6= b0 ∧ b0 6= [] 

ARepLoopBody = ARepForBody ∪ ARepWhileBody ∪ ARepDoWhileBody

Assignment-Related Actions

 0  (s, s ) ∈ M    0 0 s = Assign(lhs, op, rhs)  RepRHS(s, s , rhs, rhs ) s0 = Assign(lhs, op, rhs0)    rhs 6= rhs0 

 0  (s, s ) ∈ M    0 0 s = Assign(lhs, op, rhs)  RepLHS(s, s , lhs, lhs ) s0 = Assign(lhs0, op, rhs)    lhs 6= lhs0 

212 B.2. DETECTION RULES

Function-Call-Related Actions

As with other types of repair action, an intermediate action is introduced for all repair actions related to function calls. This intermediate action, ModCall, is used to quickly identify that the only modication to a given statement is rooted at a given function call (or, to be more precise, the function call closest to all edits made at that statement).

 0  ModStmt(s, s ,E) ∈ AModStmt    call = Call(t, args)   0 0 0   0 0 call = Call(t , args )  ModCall(s, s , c, c ,E) call 6= call0    (call, call0) ∈ M     ∀e ∈ E | Nearest(e, Call) = call 

 0 0  ModCall(s, s , c, c ,E) ∈ A  ModCall   0 0 call = Call(t, args)  RepCallTarg(s, s , t, t ) call0 = Call(t0, args)    t 6= t0 

 0 0  ModCall(s, s , c, c ,E) ∈ A  ModCall   0 call = Call(t, args)  ModCallArgs(args, args ) call0 = Call(t, args0)    args 6= args0 

 0  ModCall(args, args ) ∈ AModCallArgs    0 args = l ⊕ arg ⊕ r  RepCallArg(args, arg, arg ) args0 = l ⊕ arg0 ⊕ r    arg 6= arg0 

 0  ModCall(args, args ) ∈ AModCallArgs  0  InsertCallArg(args, arg ) args = l ⊕ r

 args0 = l ⊕ arg0 ⊕ r 

 0  ModCall(args, args ) ∈ A  ModCallArgs  RemoveCallArg(args, arg) args = l ⊕ arg ⊕ r

 args0 = l ⊕ r 

213 REPAIR ACTION MINING

214 APPENDIX C

Additional Fault Localisation Results

In Table C.1, we report additional results from our comparison of the eectiveness of various fault localisation approaches in Section 4.3.1.

215 ADDITIONAL FAULT LOCALISATION RESULTS Table C.1: The eectiveness of various fault localisation schemes, measured by the probability of sampling a xable statement. Scenarioct-openssl-0a2dcb6ct-openssl-6979583ct-openssl-8e3854a 1.227 GenProgct-openssl-eddef30 37.681 Adj. Cov.sir-gzip-v1-KP_1 0.714 0.142sir-gzip-v1-TW_3 66.667 75.256 P2Fsir-gzip-v4-KL_1 15.958 P2F-Cov 0.907 0.106sir-gzip-v5-KL_1 Metallaxis 6.631 76.036 9.202 MUSE 79.753sir-gzip-v5-KL_8 45.920 1.005 0.046 Jaccardsir-sed-v2-AG_17 2.685 4.455 Ochiai 14.211sir-sed-v3-AG_11 77.583 0.400 0.100 Tarantula 4.673 8.464 2.169 4.689 0.533 2.727 0.185 8.333 0.326 41.667 1.379 0.573 1.724 15.324 6.485 0.730 6.585 41.667 25.000 0.360 0.019 1.819 0.338 0.948 14.418 3.083 54.545 6.953 0.211 2.097 2.598 0.394 1.655 47.804 0.606 0.609 4.672 1.935 2.439 8.588 0.023 0.989 0.801 41.968 3.086 1.443 0.268 6.601 4.626 0.935 0.887 0.264 0.530 4.380 3.813 0.934 3.028 0.296 0.810 0.186 0.337 2.367 0.573 4.080 2.925 3.265 0.317 0.149 1.568 1.923 0.223 1.232 1.128 0.292 0.483 0.226 0.910

216 Bibliography

Abreu, R., Zoeteweij, P., and van Gemund, A. J. C. (2007). On the Accuracy of Spectrum-based Fault Localization. In Testing: Academic and Industrial Confer- ence Practice and Research Techniques - MUTATION, TAICPART-MUTATION ’07, pages 89–98.

Ackling, T., Alexander, B., and Grunert, I. (2011). Evolving Patches for Software Repair. In Conference on Genetic and Evolutionary Computation, GECCO ’11, pages 1427–1434.

Agrawal, H., Horgan, J. R., Krauser, E. W., and London, S. (1993). Incremental Regres- sion Testing. In Conference on , ICSM ’93, pages 348–357.

Al-Ekram, R., Adma, A., and Baysal, O. (2005). diX: An Algorithm to Detect Changes in Multi-version XML Documents. In Conference of the Centre for Ad- vanced Studies on Collaborative Research, CASCON ’05, pages 1–11.

Arcuri, A. and Yao, X. (2008). A Novel Co-evolutionary Approach to Automatic Software Bug Fixing. In Congress on Evolutionary Computation, CEC ’08, pages 162–168.

Avizienis, A. (1985). The N-Version Approach to Fault-Tolerant Software. IEEE Transactions on Software Engineering, 11(12):1491–1501.

B. Le, T.-D., Lo, D., Le Goues, C., and Grunske, L. (2016). A Learning-to-rank Based Fault Localization Approach Using Likely Invariants. In International Symposium on and Analysis, ISSTA ’16, pages 177–188.

Barr, E. T., Brun, Y., Devanbu, P., Harman, M., and Sarro, F. (2014). The Plastic Surgery Hypothesis. In International Symposium on Foundations of Software En- gineering, FSE ’14, pages 306–317.

Black, P. E. (2007). Software assurance with SAMATE reference dataset, tool stan- dards, and studies. In Digital Avionics Systems Conference, DASC ’07, pages 6.C.1– 1–6.C.1–6.

Bloch, J. (2008). Eective Java. Prentice Hall PTR, Upper Saddle River, NJ, USA, 2nd edition.

Böhme, M. and Roychoudhury, A. (2014). CoREBench: Studying Complexity of Regression Errors. In International Symposium on Software Testing and Analysis, ISSTA ’14, pages 105–115.

Cadar, C., Dunbar, D., and Engler, D. (2008). KLEE: Unassisted and Automatic Gen- eration of High-coverage Tests for Complex Systems Programs. In Conference on Operating and Implementation, OSDI ’08, pages 209–224.

217 BIBLIOGRAPHY

Carbin, M., Misailovic, S., Kling, M., and Rinard, M. C. (2011). Detecting and Escap- ing Innite Loops with Jolt. In European Conference on Object-oriented Program- ming, ECOOP ’11, pages 609–633.

Chandra, S., Torlak, E., Barman, S., and Bodik, R. (2011). Angelic Debugging. In International Conference on Software Engineering, ICSE ’11, pages 121–130.

Chen, T. Y. and Cheung, Y. Y. (1997). On Program Dicing. Journal of Software Main- tenance: Research and Practice, 9(1):33–46.

Coker, Z. and Haz, M. (2013). Program Transformations to Fix C Integers. In International Conference on Software Engineering, ICSE ’13, pages 792–801.

Coldewey, D. (2008). Zune bug explained in detail. https://techcrunch.com/ 2008/12/31/zune-bug-explained-in-detail/. Accessed May, 2017. de Moura, L. and Bjørner, N. (2008). Z3: An Ecient SMT Solver, pages 337–340. ETAPS ’08.

Demsky, B. and Rinard, M. (2003). Automatic Detection and Repair of Errors in Data Structures. In Conference on Object-oriented Programing, Systems, Languages, and Applications, OOPSLA ’03, pages 78–95.

Demsky, B. and Rinard, M. (2005). Data Structure Repair using Goal-Directed Rea- soning. In International Conference on Software Engineering, ICSE ’05, pages 176– 185.

Do, H., Elbaum, S., and Rothermel, G. (2005). Supporting Controlled Experimenta- tion with Testing Techniques: An Infrastructure and its Potential Impact. Empir- ical Software Engineering, 10(4):405–435.

Downey, A. B. (2011). Think Stats: Probability and Statistics for Programmers. O’Reilly Media, 2nd edition.

Dyer, R., Nguyen, H. A., Rajan, H., and Nguyen, T. N. (2013). Boa: A Language and Infrastructure for Analyzing Ultra-large-scale Software Repositories. In Interna- tional Conference on Software Engineering, ICSE ’13, pages 422–431.

Eiben, A. E. and Smith, J. E. (2015). Introduction to Evolutionary Computing. Springer Publishing Company, Incorporated, 2nd edition.

Ernst, M. D., Perkins, J. H., Guo, P. J., McCamant, S., Pacheco, C., Tschantz, M. S., and Xiao, C. (2007). The Daikon System for Dynamic Detection of Likely Invariants. Science of , 69(1-3):35–45.

Falleri, J., Morandat, F., Blanc, X., Martinez, M., and Monperrus, M. (2014). Fine- grained and accurate source code dierencing. In International Conference on Automated Software Engineering, ASE ’14, pages 313–324.

218 BIBLIOGRAPHY

Fast, E., Le Goues, C., Forrest, S., and Weimer, W. (2010). Designing Better Fitness Functions for Automated Program Repair. In Annual Conference on Genetic and Evolutionary Computation, GECCO ’10, pages 965–972.

Felter, W., Ferreira, A., Rajamony, R., and Rubio, J. (2015). An updated performance comparison of virtual machines and Linux containers. In International Symposium on Performance Analysis of Systems and Software, ISPASS ’15, pages 171–172.

Fry, Z. P., Landau, B., and Weimer, W. (2012). A Human Study of Patch Maintain- ability. In International Symposium on Software Testing and Analysis, ISSTA ’12, pages 177–187.

Gall, H. C., Fluri, B., and Pinzger, M. (2009). Change Analysis with Evolizer and ChangeDistiller. IEEE Software, 26(1):26–33.

Gallagher, K., Binkley, D., and Harman, M. (2006). Stop-List slicing. In International Workshop on Source Code Analysis and Manipulation, SCAM ’06, pages 11–20.

Gao, Q., Xiong, Y., Mi, Y., Zhang, L., Yang, W., Zhou, Z., Xie, B., and Mei, H. (2015). Safe Memory-Leak Fixing for C Programs. In International Conference on Software Engineering, volume 1 of ICSE ’15, pages 459–470.

Gihorn, D. and Hammer, C. (2007). An Evaluation of Slicing Algorithms for Con- current Programs. In International Working Conference on Source Code Analysis and Manipulation, SCAM ’07, pages 17–26.

Glover, F. (1986). Future Paths for Integer Programming and Links to Articial In- telligence. Computers and Operations Research, 13(5):533–549.

Gupta, R. and Soa, M. L. (1995). Hybrid Slicing: An Approach for Rening Static Slices Using Dynamic Information. SIGSOFT Software Engineering Notes, 20(4):29– 40.

Harman, M. and Hierons, R. M. (2001). An Overview of Program Slicing. Software Focus, 2(3):85–92.

Harrold, M. J., Rothermel, G., Sayre, K., Wu, R., and Yiz, L. (2000). An Empiri- cal Investigation of the Relationship Between Spectra Dierences and Regression Faults. Journal of Software Testing, Verication, and Reliability, 10(3).

Holland, J. H. (1992). Adaptation in Natural and Articial Systems: An Introductory Analysis with Applications to Biology, Control and Articial Intelligence. MIT Press, Cambridge, MA, USA.

Holland, J. H. (2000). Building Blocks, Cohort Genetic Algorithms, and Hyperplane- Dened Functions. Evolutionary Computing, 8(4):373–391.

Horwitz, S., Reps, T., and Binkley, D. (1990). Interprocedural Slicing Using De- pendence Graphs. ACM Transactions on Programming Languages and Systems, 12(1):26–60.

219 BIBLIOGRAPHY

Hutchins, M., Foster, H., Goradia, T., and Ostrand, T. (1994). Experiments of the Eectiveness of Dataow- and Controlow-based Test Adequacy Criteria. In In- ternational Conference on Software Engineering, ICSE ’94, pages 191–200.

Jaccard, P. (1901). Étude comparative de la distribution orale dans une portion des Alpes et des Jura. Bulletin del la Société Vaudoise des Sciences Naturelles, 37:547– 579.

Janssen, T., Abreu, R., and v. Gemund, A. J. C. (2009). Zoltar: A Toolset for Automatic Fault Localization. In International Conference on Automated Software Engineering, ASE ’09, pages 662–664.

Jha, S., Gulwani, S., Seshia, S. A., and Tiwari, A. (2010). Oracle-guided Component- based Program Synthesis. In International Conference on Software Engineering, ICSE ’10, pages 215–224.

Joachims, T. (2002). Optimizing Search Engines Using Clickthrough Data. In In- ternational Conference on Knowledge Discovery and Data Mining, KDD ’02, pages 133–142.

Jones, J. A. and Harrold, M. J. (2005). Empirical Evaluation of the Tarantula Au- tomatic Fault-localization Technique. In International Conference on Automated Software Engineering, ASE ’05, pages 273–282.

Jones, J. A., Harrold, M. J., and Stasko, J. (2002). Visualization of Test Information to Assist Fault Localization. In International Conference on Software Engineering, ICSE ’02, pages 467–477.

Jones, T. and Forrest, S. (1995). Fitness Distance Correlation As a Measure of Prob- lem Diculty for Genetic Algorithms. In International Conference on Genetic Al- gorithms, ICGA ’95, pages 184–192.

Judge Business School, Cambridge University (2013). Cambridge University Study States Software Bugs Cost Economy $312 Billion Per Year. http://www.prweb. com/releases/2013/1/prweb10298185.htm. Accessed April, 2017.

Just, R., Jalali, D., and Ernst, M. D. (2014). Defects4J: A Database of Existing Faults to Enable Controlled Testing Studies for Java Programs. In International Symposium on Software Testing and Analysis, ISSTA ’14, pages 437–440.

Kaleeswaran, S., Tulsian, V., Kanade, A., and Orso, A. (2014). MintHint: Automated Synthesis of Repair Hints. In International Conference on Software Engineering, ICSE ’14, pages 266–276.

Ke, Y., Stolee, K. T., Le Goues, C., and Brun, Y. (2015). Repairing Programs with Semantic Code Search. In International Conference on Automated Software Engi- neering, ASE ’15, pages 295–306.

220 BIBLIOGRAPHY

Kim, D., Nam, J., Song, J., and Kim, S. (2013). Automatic Patch Generation Learned from Human-written Patches. In International Conference on Software Engineer- ing, ICSE ’13, pages 802–811.

Kling, M., Misailovic, S., Carbin, M., and Rinard, M. (2012). Bolt: On-demand In- nite Loop Escape in Unmodied Binaries. In International Conference on Object Oriented Programming Systems Languages and Applications, OOPSLA ’12, pages 431–450.

Korel, B. and Rilling, J. (1998). Dynamic program slicing methods. Information and Software Technology, 40(11–12):647–659.

Kossak, F., Mashkoor, A., Geist, V., and Illibauer, C. (2014). Improving the Under- standability of Formal Specications: An Experience Report, pages 184–199. REFSQ ’14.

Krinke, J. (2003). Barrier slicing and chopping. In International Workshop on Source Code Analysis and Manipulation, SCAM ’03, pages 81–87.

Krinke, J. (2004). Slicing, Chopping, and Path Conditions with Barriers. Software Quality Journal, 12(4):339–360.

Landsberg, D., Chockler, H., Kroening, D., and Lewis, M. (2015). Evaluation of Mea- sures for Statistical Fault Localisation and an Optimising Scheme. In Egyed, A. and Schaefer, I., editors, International Conference on Fundamental Approaches to Software Engineering, FASE ’15, pages 115–129.

Langdon, W. B. (2015). Genetically Improved Software, pages 181–220. Springer International Publishing, Cham.

Langdon, W. B. and Poli, R. (1998). Fitness Causes Bloat, pages 13–22. Springer London, London.

Le, X.-B. D., Lo, D., and Le Goues, C. (2016). History Driven Program Repair. In In- ternational Conference on Software Analysis, Evolution, and Reengineering, SANER ’16.

Le Goues, C., Dewey-Vogt, M., Forrest, S., and Weimer, W. (2012a). A systematic study of automated program repair: Fixing 55 out of 105 bugs for $8 each. In International Conference on Software Engineering, ICSE ’12, pages 3–13.

Le Goues, C., Holtschulte, N., Smith, E. K., Brun, Y., Devanbu, P., Forrest, S., and Weimer, W. (2015). The ManyBugs and IntroClass Benchmarks for Automated Repair of C Programs. IEEE Transactions on Software Engineering, 41(12):1236– 1256.

Le Goues, C., Nguyen, T., Forrest, S., and Weimer, W. (2012b). GenProg: A Generic Method for Automatic Software Repair. IEEE Transactions on Software Engineer- ing, 38(1):54–72.

221 BIBLIOGRAPHY

Le Goues, C., Weimer, W., and Forrest, S. (2012c). Representations and Operators for Improving Evolutionary Software Repair. In Conference on Genetic and Evolu- tionary Computation, GECCO ’12, pages 959–966, New York, NY, USA.

Liblit, B., Naik, M., Zheng, A. X., Aiken, A., and Jordan, M. I. (2005). Scalable Sta- tistical Bug Isolation. In Conference on Programming Language Design and Imple- mentation, PLDI ’05, pages 15–26.

Long, F. and Rinard, M. (2015). Staged Program Repair with Condition Synthesis. In Joint Meeting of the European Software Engineering Conference and the Symposium on the Foundations of Software Engineering, ESEC/FSE ’15, pages 166–178.

Long, F. and Rinard, M. (2016). Automatic patch generation by learning correct code. In Symposium on Principles of Programming Languages, POPL ’16, pages 298–312.

Luke, S. and Panait, L. (2002). Fighting Bloat with Nonparametric Parsimony Pres- sure. In International Conference on Parallel Problem Solving from Nature, PPSN VII, pages 411–421.

Lyle, J. R. (1987). Automatic program bug location by program slicing. International Conference on Computers.

Maldonado, J. C., Delamaro, M. E., Fabbri, S. C. P. F., Simão, A. d. S., Sugeta, T., Vin- cenzi, A. M. R., and Masiero, P. C. (2001). Mutation Testing for the New Century. chapter Proteum: A Family of Tools to Support Specication and Program Testing Based on Mutation, pages 113–116. Kluwer Academic Publishers, Norwell, MA, USA.

Mann, H. B. and Whitney, D. R. (1947). On a Test of Whether one of Two Random Variables is Stochastically Larger than the Other. The Annals of Mathematical Statistics, 18(1):50–60.

Martinez, M. and Monperrus, M. (2013). Mining software repair models for reason- ing on the search space of automated program xing. Empirical Software Engi- neering, 20(1):176–205.

Martinez, M., Weimer, W., and Monperrus, M. (2014). Do the Fix Ingredients Already Exist? An Empirical Inquiry into the Redundancy Assumptions of Program Repair Approaches. In International Conference on Software Engineering, ICSE ’14, pages 492–495.

Mechtaev, S., Yi, J., and Roychoudhury, A. (2015). DirectFix: Looking for Simple Program Repairs. In nternational Conference on Software Engineering, ICSE ’15, pages 448–458.

Mechtaev, S., Yi, J., and Roychoudhury, A. (2016). Angelix: scalable multiline pro- gram patch synthesis via symbolic analysis. In International Conference on Soft- ware Engineering, ICSE ’16, pages 691–701.

222 BIBLIOGRAPHY

Meng, N., Kim, M., and McKinley, K. S. (2013). LASE: Locating and Applying Sys- tematic Edits by Learning from Examples. In International Conference on Software Engineering, ICSE ’13, pages 502–511. Miller, B. P., Fredriksen, L., and So, B. (1990). An Empirical Study of the Reliability of UNIX Utilities. Communications of the ACM, 33(12):32–44. Monperrus, M. (2014). A critical review of automatic patch generation learned from human-written patches: essay on the problem statement and the evaluation of automatic software repair. In International Conference on Software Engineering, ICSE ’14, pages 234–242. Monperrus, M. and Martinez, M. (2012). CVS-Vintage: A Dataset of 14 CVS Reposi- tories of Java Software. Technical Report hal-00769121, INRIA.

Montana, D. J. (1995). Strongly Typed Genetic Programming. Evol. Comput., 3(2):199–230. Moon, S., Kim, Y., Kim, M., and Yoo, S. (2014a). Ask the Mutants: Mutating Faulty Programs for Fault Localization. In International Conference on Software Testing, Verication and Validation, ISSTA ’14, pages 153–162. Moon, S., Kim, Y., Kim, M., and Yoo, S. (2014b). Hybrid-MUSE: Mutating Faulty Programs for Precise Fault Localization. Technical report, KAIST. Naish, L., Lee, H. J., and Ramamohanarao, K. (2011). A Model for Spectra-based Software Diagnosis. ACM Transactions on Software Engineering and Methodology, 20(3):11:1–11:32. Naish, L., Lee, H. J., and Ramamohanarao, K. (2012). Spectral Debugging: How Much Better Can We Do? In Australasian Computer Science Conference, ACSC ’12, pages 99–106. Necula, G. C., McPeak, S., Rahul, S. P., and Weimer, W. (2002). CIL: Intermediate Lan- guage and Tools for Analysis and Transformation of C Programs. In International Conference on Compiler Construction, CC ’02, pages 213–228. Neumann, G., Harman, M., and Poulding, S. (2015). Transformed Vargha-Delaney Eect Size. In International Symposium on Search Based Software Engineering, SSBSE ’15, pages 318–324. Nguyen, H. D. T., Qi, D., Roychoudhury, A., and Chandra, S. (2013). SemFix: Program Repair via Semantic Analysis. In International Conference on Software Engineering, ICSE ’13, pages 772–781. Ning, J. Q., Engberts, A., and Kozaczynski, W. V. (1994). Automated Support for Legacy Code Understanding. Communications of the ACM, 37(5):50–57. Nishimatsu, A., Jihira, M., Kusumoto, S., and Inoue, K. (1999). Call-mark slicing: an ecient and economical way of reducing slice. In International Conference on Software Engineering, ICSE ’99, pages 422–431.

223 BIBLIOGRAPHY

Ochiai, A. (1957). Zoogeographic studies on the soleoid shes found in Japan and its neighbouring regions. Nippon Suisan Gakkai Shi, 22(9):526–530.

Oliveira, V. P. L., Souza, E. F. D., Le Goues, C., and Camilo-Junior, C. G. (2016). Improved Crossover Operators for Genetic Programming for Program Repair. In Sarro, F. and Deb, K., editors, International Symposium on Search Based Software Engineering, SSBSE ’16, pages 112–127.

Pan, H. and Spaord, E. H. (1992). Heuristics for Automatic Localization of Software Faults. Technical Report SERC-TR-116-P, Software Engineering Research Center, Purdue University, West Lafayette, Indiana, USA.

Papadakis, M. and Le Traon, Y. (2015). Metallaxis-FL: mutation-based fault localiza- tion. Software Testing, Verication, and Reliability, 25(5-7):605–628.

Parnin, C. and Orso, A. (2011). Are Automated Debugging Techniques Actually Helping Programmers? In International Symposium on Software Testing and Anal- ysis, ISSTA ’11, pages 199–209.

Perkins, J. H., Kim, S., Larsen, S., Amarasinghe, S., Bachrach, J., Carbin, M., Pacheco, C., Sherwood, F., Sidiroglou, S., Sullivan, G., Wong, W.-F., Zibin, Y., Ernst, M. D., and Rinard, M. (2009). Automatically Patching Errors in Deployed Software. In Symposium on Operating Systems Principles, SOSP ’09, pages 87–102.

Qi, Y., Mao, X., Lei, Y., Dai, Z., and Wang, C. (2014). The Strength of Random Search on Automated Program Repair. In International Conference on Software Engineer- ing, ICSE ’14, pages 254–265.

Qi, Y., Mao, X., Lei, Y., and Wang, C. (2013). Using Automated Program Repair for Evaluating the Eectiveness of Fault Localization Techniques. In International Symposium on Software Testing and Analysis, ISSTA ’13, pages 191–201.

Qi, Z., Long, F., Achour, S., and Rinard, M. (2015). An Analysis of Patch Plausibility and Correctness for Generate-and-validate Patch Generation Systems. In Inter- national Symposium on Software Testing and Analysis, ISSTA ’15, pages 24–36.

Renieris, M. and Reiss, S. P. (2003). Fault Localization With Nearest Neighbor Queries. In International Conference on Automated Software Engineering, ASE ’03, pages 30–39.

Reps, T., Ball, T., Das, M., and Larus, J. (1997). The Use of Program Proling for Soft- ware Maintenance with Applications to the Year 2000 Problem. In Joint Meeting of the European Software Engineering Conference and the Symposium on the Founda- tions of Software Engineering, ESEC/FSE ’97, pages 432–449, New York, NY, USA. Springer-Verlag New York, Inc.

Rosin, C. D. and Belew, R. K. (1997). New Methods for Competitive Coevolution. Evolutionary Computation, 5(1):1–29.

224 BIBLIOGRAPHY

Rothermel, G., Untch, R. H., Chu, C., and Harrold, M. J. (2001). Prioritizing test cases for regression testing. IEEE Transactions on Software Engineering, 27(10):929–948.

Schulte, E., Fry, Z. P., Fast, E., Weimer, W., and Forrest, S. (2013). Software mutational robustness. Genetic Programming Evolvable Machines, 15(3):281–312.

Sidiroglou-Douskos, S., Lahtinen, E., Long, F., and Rinard, M. (2015). Automatic Error Elimination by Horizontal Code Transfer Across Multiple Applications. In Conference on Programming Language Design and Implementation, PLDI ’15, pages 43–54.

Silva, J. (2012). A Vocabulary of Program Slicing-based Techniques. ACM Computing Surveys, 44(3):12:1–12:41.

Sivagurunathan, Y., Harman, M., and Danicic, S. (1997). Slicing, I/O and the Implicit State. In International Workshop on Automated Debugging, pages 59–65.

Smith, E. K., Barr, E., Le Goues, C., and Brun, Y. (2015). Is the Cure Worse than the Disease? Overtting in Automated Program Repair. In Joint Meeting of the European Software Engineering Conference and the Symposium on the Foundations of Software Engineering, ESEC/FSE ’15, pages 532–543, Bergamo, Italy.

Soto, M., Thung, F., Wong, C.-P., Le Goues, C., and Lo, D. (2016). A deeper look into bug xes: patterns, replacements, deletions, and additions. In International Conference on Mining Software Repositories, MSR ’16, pages 512–515.

Stolee, K. T., Elbaum, S., and Dobos, D. (2014). Solving the search for source code. ACM Transactions on Software Engineering and Methodology (TOSEM), 23(3):26.

Takada, T., Ohata, F., and Inoue, K. (2002). Dependence-cache slicing: a program slicing method using lightweight dynamic information. In International Workshop on Program Comprehension, IWPC ’02, pages 169–177.

Tan, H. B. K. and Ling, T. W. (1998). Correct program slicing of database operations. IEEE Software, 15(2):105.

Tan, S. H. and Roychoudhury, A. (2015). Relix: Automated Repair of Software Regressions. In International Conference on Software Engineering, ICSE ’15, pages 471–482.

Tan, S. H., Yi, J., Yulis, Mechtaev, S., and Roychoudhury, A. (2017). Codeaws: A Programming Competition Benchmark for Evaluating Automated Program Re- pair Tools. In ICSE ’17 Poster. To appear.

Tan, S. H., Yoshida, H., Prasad, M. R., and Roychoudhury, A. (2016). Anti-patterns in Search-Based Program Repair. In International Symposium on Foundations of Software Engineering, FSE ’16.

Thung, F., Lo, D., and Jiang, L. (2012). Automatic Defect Categorization. In Working Conference on Reverse Engineering (WCRE), WCRE ’12, pages 205–214. IEEE.

225 BIBLIOGRAPHY

Timperley, C. S. (2013). Reective Method Matching for Object-Oriented Programs. Master’s thesis, University of York, York, England. Timperley, C. S. and Stepney, S. (2014). Reective Grammatical Evolution. In ALife XIV, pages 71–78. MIT Press. Vargha, A. and Delaney, H. D. (2000). A Critique and Improvement of the CL Com- mon Language Eect Size Statistics of McGraw and Wong. Journal of Educational and Behavioral Statistics, 25(2):101–132. Venkatesh, G. A. (1991). The Semantic Approach to Program Slicing. SIGPLAN Notices, 26(6):107–119. Walcott, K. R., Soa, M. L., Kapfhammer, G. M., and Roos, R. S. (2006). Time-Aware Test Suite Prioritization. In International Symposium on Software Testing and Anal- ysis, ISSTA ’06, pages 1–12. Wei, Y., Pei, Y., Furia, C. A., Silva, L. S., Buchholz, S., Meyer, B., and Zeller, A. (2010). Automated Fixing of Programs with Contracts. pages 61–72. Weimer, W., Fry, Z. P., and Forrest, S. (2013). Leveraging program equivalence for adaptive program repair: Models and rst results. In International Conference on Automated Software Engineering, ASE ’13, pages 356–366. Weimer, W., Nguyen, T., Le Goues, C., and Forrest, S. (2009). Automatically Find- ing Patches Using Genetic Programming. In International Conference on Software Engineering, ICSE ’09, pages 364–374. Weiser, M. (1979). Program slices: formal, psychological, and practical investigations of an automatic program abstraction method. PhD thesis, University of Michigan. Weiser, M. (1981). Program Slicing. In International Conference on Software Engi- neering, ICSE ’81, pages 439–449. Willmor, D., Embury, S. M., and Shao, J. (2004). Program Slicing in the Presence of a Database State. In International Conference on Software Maintenance, ICSM ’04, pages 448–452. Xuan, J., Martinez, M., DeMarco, F., Clément, M., Marcote, S. L., Durieux, T., Berre, D. L., and Monperrus, M. (2017). Nopol: Automatic Repair of Conditional State- ment Bugs in Java Programs. IEEE Transactions on Software Engineering, 43(1):34– 55. Yan, X. and Han, J. (2002). gSpan: Graph-Based Substructure Pattern Mining. In International Conference on Data Mining, ICDM ’02, pages 721–. Yoo, S. (2012). Evolving Human Competitive Spectra-Based Fault Localisation Tech- niques. In Search Based Software Engineering, Lecture Notes in Computer Science, pages 244–258. Springer Berlin Heidelberg. Zeller, A. and Hildebrandt, R. (2002). Simplifying and isolating failure-inducing input. IEEE Transactions on Software Engineering, 28(2):183–200.

226