Advanced Techniques for Search-Based Program Repair

Advanced Techniques for Search-Based Program Repair christopher steven timperley Ph.D. University of York, Computer Science June 2017 2 Abstract Debugging and repairing software defects costs the global economy hundreds of bil- lions of dollars annually, and accounts for as much as 50% of programmers’ time. To tackle the burgeoning expense of repair, researchers have proposed the use of novel techniques to automatically localise and repair such defects. Collectively, these techniques are referred to as automated program repair. Despite promising, early results, recent studies have demonstrated that existing automated program repair techniques are considerably less eective than previously believed. Current approaches are limited either in terms of the number and kinds of bugs they can x, the size of patches they can produce, or the programs to which they can be applied. To become economically viable, automated program repair needs to overcome all of these limitations. Search-based repair is the only approach to program repair which may be applied to any bug or program, without assuming the existence of formal specications. De- spite its generality, current search-based techniques are restricted; they are either ecient, or capable of xing multiple-line bugs—no existing technique is both. Fur- thermore, most techniques rely on the assumption that the material necessary to craft a repair already exists within the faulty program. By using existing code to craft repairs, the size of the search space is vastly reduced, compared to generating code from scratch. However, recent results, which show that almost all repairs generated by a number of search-based techniques can be explained as deletion, lead us to question whether this assumption is valid. In this thesis, we identify the challenges facing search-based program repair, and demonstrate ways of tackling them. We explore if and how the knowledge of candidate patch evaluations can be used to locate the source of bugs. We use software repository mining techniques to discover the form of a better repair model capable of addressing a greater number of bugs. We conduct a theoretical and empirical analysis of existing search algorithms for repair, before demonstrating a more eective alternative, inspired by greedy algorithms. To ensure reproducibility, we propose and use a methodology for conducting high-quality automated program research. Finally, we assess our progress towards solving the challenges of search-based program repair, and reect on the future of the eld. 3 4 Contents Abstract3 Acknowledgements 13 Declaration 15 1 Introduction 17 1.1 Motivation . 19 1.2 Challenges . 22 1.3 Research Questions . 23 1.4 Contributions . 24 1.5 Document Structure . 26 2 Background 27 2.1 Automated Program Repair . 27 2.2 Search-Based Repair . 31 2.3 Semantics-Based Repair . 50 2.4 Specication-Based Repair . 57 2.5 Related Techniques . 60 2.6 Concluding Remarks . 62 3 Tools and Techniques 65 3.1 Bug Scenarios . 66 3.2 Pythia . 68 3.3 RepairBox . 75 3.4 Methodology . 84 3.5 Conclusion . 85 4 Fault Localisation 87 4.1 Background . 88 4.2 Analysis . 107 4.3 Approach . 114 4.4 Discussion & Conclusion . 116 5 Repair Model 119 5.1 Related Work . 120 5.2 Motivation for Study . 127 5.3 Methodology . 128 5.4 Repair Model . 131 5.5 Approach . 138 5.6 Results . 140 5.7 Discussion & Conclusion . 143 5 CONTENTS 6 Search 149 6.1 Related Work . 151 6.2 Theoretical Analysis . 157 6.3 Empirical Study . 166 6.4 Greedy Algorithm . 181 6.5 Future Work . 188 6.6 Conclusion . 191 7 Conclusion 193 7.1 Summary . 193 7.2 Future Work . 196 7.3 Concluding Remarks . 200 Appendices 203 A Reproducibility 205 A.1 Fault Localisation . 205 A.2 Repair Model . 205 A.3 Search . 205 B Repair Action Mining 207 B.1 AST and Edit Script Generation . 207 B.2 Detection Rules . 208 C Additional Fault Localisation Results 215 Bibliography 217 6 List of Tables 2.1 Examples of dierent types of defect classes and the shared proper- ties by which they are dened. Adapted from [Monperrus, 2014]. 42 2.2 A list of the repair actions within the repair model for History Driven Program Repair, separated by their sources [Le et al., 2016]. 49 2.3 MintHint’s repair model, or hints, and the kinds of faults that each hint is designed to address. Taken from [Kaleeswaran et al., 2014]. 51 4.1 The average precision of HybridMUSE as its mutant sampling rate is adjusted. Taken from [Moon et al., 2014b]. 105 4.2 Details of the subjects we studied for our preliminary mutation analysis. KLOC measures the number of thousands of lines of C code in the program, as calculated by cloc. Tests states the average number of test cases used by bugs for that program. 109 4.3 Specications for Amazon EC2 instances used to perform mutation analysis on 15 artical bugs. 109 4.4 Specications of the Microsoft Azure compute instances used to perform analysis on 13 real-world bugs. 110 4.5 A summary of the mutation analysis results for each bug scenario. % Compiling species the percentage of mutants that successfully compiled. % Neutral describes the percentage of (compiling) mutants that had no eect on the outcome of the tests. % Lethal describes the percentage of (compiling) mutants (covering at least one positive test) that failed all of their covered tests. Mutants species the number of mutants generated within the 12-hour random walk. Sample Rate gives the average number of mutants per suspicious statement. 111 4.6 Comparison of fault localisation accuracies achieved by dierent approaches, where accuracy is measured by the probability of sampling a statement containing a x from the resulting distribution. Results are given as percentages. 116 5.1 The number of instances of each repair action discovered across each of the mined bugs, together the number (and percentage) of bugs that involve at least one repair action of that type. 141 5.2 The graftability of each repair action in the contexts of the concrete pool, containing the unchanged snippets from the le under repair, and the abstract pool, containing the unlabelled forms of the snippets from the le under repair. 142 7 LIST OF TABLES 5.3 A summary of the frequency of each of the proposed repair actions, measured by the percentage of bugs in which it is encountered, together with the graftability of that repair action when the abstract pool is used. Eectiveness, computed as the product of frequency and graftability, estimates the fraction of bugs for which a given repair action may graft a repair. 145 6.1 The baseline parameters of the algorithm used within each run. 168 6.2 Specications of the Microsoft Azure F4 compute instances used to collect the data for this study. 168 6.3 A table of the subject programs used to conduct this study. # LOC describes the number of source code lines in the original program, as measured by cloc. # Stmts species the number of statements within the GenProg’s pre-processed AST representation of the program. # Tests gives the size of the original test suite for the program. 171 6.4 A summary of the bugs for which a repair was found at least once over the ten runs, for each conguration. Xindicates that at least one patch was found for the given bug-conguration. 173 6.5 A comparison of the reliability of dierent congurations for each bug scenario. The presence of an “—” symbol indicates that no patches were found for that given bug-conguration. 174 6.6 A summary of the median number of unique candidate patch evaluations required to nd a repair, for each bug-scenario. The presence of an “—” symbol indicates that no patches were found for that bug- conguration. 175 6.7 An overview of the cost of nding a patch for each bug-conguration, measured by the total number of unique candidate patch evaluations, across all runs, divided by the number of runs that were successful. Bug-congurations with an “—” symbol indicate that no patches were found during any of the runs. 176 6.8 A comparison of the bugs patched by each technique. 187 6.9 A comparison of the reliability achieved by the genetic and greedy search algorithms, measured by the fraction of runs wherein an ac- ceptable repair was found. An “—” is used to denote bug-congurations where no repair was found across any of the runs. 188 6.10 A comparison of eciency between the genetic search and greedy search algorithms, measured by the total wall-clock time across all runs, in seconds, divided by the number of successful runs. Reduction describes the reduction factor achieved by the greedy algorithm, compared to the genetic algorithm. 189 C.1 The eectiveness of various fault localisation schemes, measured by the probability of sampling a xable statement. 216 8 List of Figures 2.1 The general automated repair process accepts the source code for a faulty program, together with a test suite, containing failed test cases, exposing the faults within the program. From these, the pos- sible locations of the faults are determined, and for some repair approaches, a pool of donor code is generated from the input program. Using the contents of the donor pool, together with a number of basic repair actions, the search generates and evaluates candidate patches, until one is found that passes all tests within the suite. 28 2.2 An example bug scenario, adapted from the real-world Zune leap year bug [Coldewey, 2008].

Load more