PRECISION IMPROVEMENT AND COST REDUCTION FOR

DEFECT MINING AND TESTING

By

BOYA SUN

Submitted in partial fulfillment of the requirements

For the degree of Doctor of Philosophy

Dissertation Advisor: Dr. H. Andy Podgurski

Department of Electrical Engineering and Computer Science

CASE WESTERN RESERVE UNIVERSITY

January, 2012

CASE WESTERN RESERVE UNIVERSITY

SCHOOL OF GRADUATE STUDIES

We hereby approve the thesis/dissertation of

Boya Sun ______

Doctor of Philosophy candidate for the ______degree *.

Andy Podgurski (signed)______(chair of the committee)

Gultekin Ozsoyoglu ______

Soumya Ray ______

M. Cenk Cavusoglu ______

______

______

10/20/2011 (date) ______

*We also certify that written approval has been obtained for any proprietary material contained therein. TABLE OF CONTENTS

Table of Contents ...... I

List of Tables ...... VII

List of Figures ...... IX

Acknowledgements ...... XI

Abstract ...... XIII

Chapter One. Introduction ...... 15

1.1 Precision improvement and cost reduction for defect mining ...... 15

1.1.1 Overview of defect mining approaches ...... 15

1.1.2 Costs of defect mining ...... 16

1.1.3 Proposed approaches ...... 17

1.2 Precision improvement and cost reduction for operational software testing ...... 18

1.3 Contributions ...... 19

Chapter Two. Related work ...... 20 I 2.1 Bug detection by mining frequent code patterns ...... 20

2.2 Bug detection by employing revision histories ...... 21

2.3 Classifying and ranking static warnings ...... 22

2.4 Application and augmentation of static analysis tools ...... 23

2.5 Considering cost in software testing and reliability ...... 24

2.6 Test case clustering and classification ...... 26

Chapter Three. Background ...... 28

3.1 Program dependence graph and system dependence graph ...... 28

3.2 Dependence graph based bug mining ...... 30

3.3 Dependence graph based bug fix propagation ...... 31

3.4 Cost sensitive active learning ...... 32

3.4.1 Active learning ...... 32

3.4.2 Cost-sensitive active learning ...... 33

Chapter Four. Improving precision of dependence graph based defect mining: a machine learning II approach 35

4.1 Introduction ...... 35

4.2 Previously proposed classification and ranking techniques ...... 37

4.3 Proposed Solution ...... 40

4.3.1 Classifying and Ranking Rules ...... 40

4.3.2 Classifying and Ranking Violations ...... 44

4.4 Empirical Study ...... 47

4.4.1 Methodology ...... 47

4.4.2 Summary of the trained Rule/Violation models ...... 51

4.4.3 HP-1: Comparing our rule model with the baseline rule models ...... 54

4.4.4 HP-2: Comparing our violation model with the baseline violation models ...... 56

4.4.5 HP-3: Learning curves...... 58

Chapter Five. Extending static analysis by automatically mining project-specific rules ...... 61

5.1 Introduction ...... 61 III 5.2 The Rule Mining Tool and Static Analysis Tool Used In This Work ...... 64

5.2.1 Mining Frequent Code Patterns ...... 64

5.2.2 Static Analysis Tools and Custom Checkers ...... 67

5.3 Automatic P2C (Pattern to Checker) Converter ...... 69

5.3.1 Rule Extractor ...... 70

5.3.2 Checker Generator ...... 75

5.4 Empirical Study ...... 81

5.4.1 Preparing patterns for analysis ...... 82

5.4.2 R-1: Generality of generated checkers ...... 84

5.4.3 R-2: Effectiveness of the generated checkers ...... 88

5.5 Lessons Learned ...... 91

Chapter Six. Bug fix propagation with fast subgraph matching ...... 94

6.1 Introduction ...... 94

6.2 GADDI: index based Fast subgraph matching algorithm ...... 101 IV 6.3 Specifics of Our Approach ...... 102

6.3.1 Base graph generation ...... 103

6.3.2 Generating a query graph from a bug fix: the PatternBuild tool ...... 106

6.3.3 Applying the GADDI Algorithm ...... 110

6.4 Empirical evaluation ...... 110

6.4.1 Study design ...... 111

6.4.2 Results ...... 114

6.4.3 Threats to Validity ...... 124

Chapter Seven. CARIAL: Cost-Aware reliability improvement with active learning ...... 127

7.1 Introduction ...... 128

7.2 Operational distribution and failure rates ...... 131

7.3 The CARIAL Framework...... 133

7.3.1 Cost estimation ...... 133

7.3.2 Risk Estimation of potential defects ...... 134 V 7.3.3 Test case selection ...... 141

7.4 The Cost-Sensitive Active Learner: CostHSAL ...... 141

7.4.1 Motivation on using HSAL ...... 142

7.4.2 HSAL (Hierarchical Sampling for Active Learning) ...... 143

7.4.3 CostHSAL ...... 145

7.5 Emperical Study ...... 146

7.5.1 Dataset ...... 147

7.5.2 Methodology ...... 150

7.5.3 RQ-1: Accuracy of Risk Estimation ...... 152

7.5.4 RQ-2: Effectiveness of Test Case Selection on Risk Reduction...... 153

7.5.5 Threads to the Validity ...... 157

Chapter Eight. Conclusions and future work ...... 159

References ...... 163

VI LIST OF TABLES

Table 4.1 Features for Rule Model ...... 41

Table 4.2 Features for Violation Model ...... 45

Table 4.3 Summary of Dataset ...... 48

Table 4.4 Coefficients of Learned Rule Models ...... 52

Table 4.5 Coefficients of Learned Violation Models ...... 53

Table 4.6 Metrics for Our Rule Model and Base Rule Models ...... 55

Table 4.7 Metrics for Our Violation Model and Base Violation Models ...... 55

Table 5.1 Simple Rules ...... 66

Table 5.2 Distribution of Mined Patterns ...... 83

Table 5.3 Characterization of Checkers for System X ...... 85

Table 5.4 Characterization of Checkers for System Y ...... 85

Table 5.5 Characterization of Warnings in System X ...... 89

Table 5.6 Characterization of Warnings in System Y ...... 89 VII Table 6.1 Project Versions and Sizes ...... 112

Table 6.2 Bug Fixes ...... 115

Table 6.4 Recall and Precision on BF* ...... 116

Table 6.4 Precision on BF ...... 116

Table 6.5 Efficiency of Gaddi ...... 121

Table 6.6 Text-Based Search ...... 121

Table 7.1 Summary of Subject Programs ...... 148

VIII LIST OF FIGURES

Figure 3.1 Examples of Programming Rules ...... 29

Figure 3.2. Outline of fix propagation approach ...... 31

Figure 4.1 Precision-Recall curves for rule/violation model ...... 57

Figure 4.2 Learning Curves ...... 60

Figure 5.1 Framework ...... 62

Figure 5.2 Example of Klockwork Path Checker ...... 68

Figure 5.3 Automatic P2C Converter ...... 69

Figure 5.4 Example of XML RuleSpec ...... 74

Figure 6.1 Framework of Bug Fix Propagation ...... 100

Figure 6.2 Bug and fix patterns of Example 2 ...... 108

Figure 7.1 Illustration of operational distribution ...... 132

Figure 7.2 Framework of CARI, consisting of cost estimation, risk estimation and test case selection

...... 134

IX Figure 7.3 Screenshot of the JavaPDG program...... 138

Figure 7.4 Example cluster tree ...... 143

Figure 7.5 RQ-1: The Cost-Against-ERE Curves...... 153

Figure 7.6 RQ-2: The Cost-Against-Risk-Reduction Curves for JavaPDG with different B and 흀 ... 154

Figure 7.7 RQ-2: The Cost-Against-Risk-Reduction Curves for ROME with different B and 흀...... 155

Figure 7.8 RQ-2: The Cost-Against-Risk-Reduction Curves for Xerces2 with different B and 흀 ..... 156

X ACKNOWLEDGEMENTS

First of all, I would like show sincere appreciation to my advisor, Prof. Podgurski, for his support, help and guidance during my graduate studies. He himself has served as a role model of me by being enthusiastic about his research, always exploring new techniques and ideas and working really hard to catch up with new developments in research. Many ideas in this paper were inspired by him during our weekly meetings, and numerous help and instructions were provided by him in the process of pursuing, implementing and validating the ideas. Most importantly, I want to thank Prof. Podgurski for always being supportive of my research work and having faith in me, turning me from a graduate student with zero research experience to a junior software engineering researcher.

I would also like to thank my committee members, Prof. Gultekin Ozsoyoglu, Prof. Soumya Ray, and

Prof. Cenk Cavusoglu for their guidance and help of my research work. The courses I have taken from them helped me to build solid foundations for my research work. I especially thank Prof. Ray, for providing instructions on machine learning aspects of my research work. His help has made my dissertation more complete and in-depth. I would also like to thank Brian Robinson, who is our industrial partner in the cooperative project with ABB Corporate Research Center for his help and support during my internship in his group, and during our cooperation.

A very special thank you to my current and former colleagues at EECS department, Ray-young Chang,

Vinay Augustine, Shu Gang, and Zhuofu Bai. Dr. Ray-young Chang has inspired me with the ideas

XI in Chapter Six; Dr. Vinay Augustine helped me to collect dataset for the research work in Chapter

Seven; and Gang Shu has helped me with experiments, data collection and writing for Chapter Six and

Seven.

Finally, I would like to thank my parents, Prof. Hexu Sun, and Prof. Yan Lin for their love and support through the years, and for always having faith in me. I also want to thank my husband, Dr. Yan Liu, for always being there for me. Finally, I would like to thank my little three-month old girl, Zilin S.

Liu, for being such a sweet baby during the write-up of this dissertation.

XII Precision Improvement and Cost Reduction for Defect Mining and Testing

ABSTRACT

by

BOYA SUN

A significant body of recent research applies static program analysis techniques in combination with data mining techniques to discover latent software bugs. These techniques work as a complement for traditional debugging approaches to detect a larger variety of defects. Although these approaches have successfully revealed many interesting bugs from real software applications, they often generate false positive results, and therefore ’ time will be wasted on evaluating invalid outputs.

In this dissertation, our goal is to reduce the cost associated with manual examination of the results.

This is done by (1) applying machine learning technique to improve precision of generated warnings, so that programmers will not waste time reviewing false positive warnings, and (2) applying improved defect discovery techniques instead of the heuristic ones to increase precision as well as decrease execution time. We have investigated two kinds of data mining approaches: one is based on the idea of finding bugs as deviations from programming rules; the other one discovers overlooked instances of bugs fixed elsewhere in the code. We have proposed innovative methods for precision

XIII improvement and cost reduction for each of these two kinds of defect mining approaches, which have successfully improved their efficiency and effectiveness.

The similar problem of balancing cost and benefit exists for operational software testing. Before releasing software, operational testing should be performed to discover bugs in the field and improve software reliability. To maximize software reliability, it is desirable to run as many test cases as possible. However, programmers have only limited time for testing. We proposed a machine learning framework to allow programmers to balance the costs and benefits of operational testing.

The framework will help programmers to pick test cases in a smart way, such that the chosen test cases are cheap to analyze, but reveal bugs with big impact.

XIV

CHAPTER ONE. INTRODUCTION

1.1 PRECISION IMPROVEMENT AND COST REDUCTION FOR DEFECT MINING

1.1.1 Overview of defect mining approaches

Defect mining approaches combine software static program analysis with data mining technique to discover latent software bugs. These techniques work as a complement for traditional debugging approaches to detect a larger variety of bugs. These approaches are based on the idea that clues about bugs are often distributed throughout code base. Investigation of different kinds of clues will lead to discovery of potential software defects.

In this work, we investigated defect mining approaches that are based on two different kinds of clues: frequent code patterns and bug fixes. Approaches which utilize frequent code patterns as clues are based on the idea that “bugs are violations to programming rules”

[4][11][12][13][14][59][60][62][87][97][98][99][100]. The first step of these approaches is frequent pattern mining, where frequent code patterns are mined from the code base as potential programming rules. The second step is violation detection, in which pattern matching algorithms or other heuristic algorithms are used to discover violations to the programming rules. We have recently developed another defect mining approach which is based on bug fix clues [90]. When a bug is fixed in one place, our approach calls for the to specify, with automated support, a programming rule that the fix helps enforce. We then applied a heuristic violation detection algorithm to discover

15

additional bug instances. Once a rule is specified, our approach employs a heuristic graph-matching algorithm to search for violations of the rule elsewhere in the code base. Violations are then displayed so that the programmer can fix them.

The above defect mining approaches discovered many bugs which had not been discovered before using other software debugging approaches. However, these approaches suffer from both long-term and short-term costs, which will be discussed in the next section.

1.1.2 Costs of defect mining

It is common for defect mining approaches to generate false positive results. Therefore, all these techniques require developers to confirm, reject, or possibly edit reported rules and warnings, which is a time-consuming and error-prone process. Moreover, programmers will waste much time reviewing false positive results. Chang et al’s work [11][12][13][14] has demonstrated that the application of graph mining techniques, such as frequent subgraph mining and subgraph matching, to a system dependence graph (SDG) [24][33][40] representation of a project’s code can improve the precision of rule and violation discovery, since SDG subgraphs reflect the semantics of many kinds of rules and violations better than textual or syntactic representations and since they are insensitive to statement reordering that respect dependences. However, even SDG-based rule and defect discovery suffers from significant numbers of false positive reports of rules and violations. Since developer effort is a costly resource, it is desirable to further improve the precision and recall of rule and defect discovery, 16

so that programmers will spend time only on the useful rules and warnings.

Moreover, many defect mining approaches use heuristic approaches for violation detection and warning generation [4][11][12][59][60][62][98]. Problems with the heuristic defect discovery algorithms are: (1) they could be inefficient if they use complex heuristics, therefore there is extra cost running the algorithms; (2) they are mostly incomplete, which are not guaranteed to find all latent violations. As a result, latent bug instances are still left in the code base, and programmers will spend extra cost maintaining the software.

In general, current defect mining suffers from two types of costs: short-term cost in terms of programmers’ time that is wasted on reviewing false-positive results; long-term cost in terms of extra effort spent on software maintenance which is caused by the latent violations that is left undiscovered by heuristic violation detection algorithms.

1.1.3 Proposed approaches

To reduce the above mentioned costs incurred by defect mining, we need to (1) improve precision of the output of defect mining approaches to reduce the short-term costs in evaluating false-positive results; (2) improve the violation detection algorithm to discover more underlying defects to reduce the long-term costs in software maintenance. We have proposed the following three approaches to reduce short-term and long-term costs:

17

 We applied supervised-learning approach to post-process the results output by defect mining

approaches. The results are classified into false positives and true positives, and false positives

will be removed before showing to programmers.

 We applied fast and complete graph matching algorithm on the problem of propagating bug fixes.

The new algorithm replaced the previous heuristic violation detection algorithm, and is

guaranteed to discover all bugs that match a bug pattern.

 We applied an industrial alternative for violation detection, which is both efficient and accurate,

and it also provided industrial users a way of applying defect mining research prototypes in their

working environment.

1.2 PRECISION IMPROVEMENT AND COST REDUCTION FOR OPERATIONAL

SOFTWARE TESTING

Software testing and debugging are time-consuming and costly. This is especially the case for software that produces complex output, since, without an automated oracle, test output examination is done manually and adds significantly to the overall cost of the debugging process. Further, the selection of test cases to examine is not generally guided by estimates of the expected reduction in risk if a test case is analyzed.

We consider the problem of maximizing expected risk reduction while minimizing the cost in terms of examining and labeling software output when selecting test cases to examine. To do this, we use a 18

budgeted active learning strategy. Our approach queries the developer for a limited number of low-cost annotations to program outputs from a test suite. We use these annotations to construct a mapping between test outputs and possible defects and so empirically estimate the reduction in risk if a defect is debugged. Next, we select test cases from this mapping. These are chosen so that they are inexpensive to examine and also maximize risk reduction. We evaluate our approach on three subject programs and show that, with only a few low-cost annotations, our approach (i) produces a reasonable estimate of risk reduction that can be used to guide test case selection, and (ii) improves reliability significantly for all subject programs with low developer effort.

1.3 CONTRIBUTIONS

The following contributions are made in this dissertation:

 Improved precision for defect mining approach based on frequent code patterns by using

supervised learning models to classify and rank output of defect mining tools.

 Proposed an approach to propagate bug fixes by using a fast, precise subgraph matching

algorithm, in order to improve the precision, efficiency and generality of our previous approach.

 Proposed a semi-automated approach to integration defect mining prototype with commercial

static program analysis tools to improve precision of defect discovery.

 Design a cost-aware system, CARIAL (Cost-Aware Reliability Improvement with Active

Learning), to balance the costs and benefits in operational software testing. 19

CHAPTER TWO. RELATED WORK

2.1 BUG DETECTION BY MINING FREQUENT CODE PATTERNS

Engler et al were among the first to propose the idea of treating bugs as violations to programming rules [23]. They employed “checkers” to find bugs by matching rule templates created by programmers. Li and Zhou proposed an approach to finding programming rules and rule violations that is based on frequent itemset mining [59]. Based on the idea that variables that appear frequently together are probably semantically correlated, Lu et al employed frequent itemset mining in a tool called MUVI to detect bugs involving inconsistent updates of correlated global variables or structure fields [62]. Ramanathan et al presented a tool called CHRONICLER that employs interprocedural path-sensitive static analysis to automatically infer function precedence protocols [79]. They also proposed another approach that employs control and data flow analysis and analysis of program predicates to discover both function precedence protocols and pre-conditions for function calls [80].

Acharya et al presented an interprocedural approach to discovering client-side API usage patterns [4] by mining execution traces generated by an adapted model checker. Shoham et al presented an approach to mining client-side API usage rules with static interprocedural analysis [87], in which event sequences involving objects of a particular type are abstracted and modeled in the form of automata. Wasylkowski et al developed a tool called JADET, based on intraprocedural analysis, to detect ordering patterns among method calls of objects by mining finite state automata [100].

Thummalapenta et al developed a tool called PARSEWeb [97], which extracts method invocation 20

sequences from code samples using the Google Code Search Engine, to assist programmers using existing frameworks or libraries. This approach extracts programming patterns from of multiple applications. In a later work, they developed a tool Allatin [99] which is used to find alternative rules. They also proposed to mine mining conditional association rules to mine exception handling rules [98].

In this work, we mainly invested the defect mining works proposed by Chang et al [11][12][13][14].

Chang presented an approach that employs frequent subgraph mining techniques to mine programming rules that are represented by frequent graph minors of system dependence graphs [11].

In subsequent work, they improved this approach by employing enhanced procedure dependence graphs and algorithms that better exploit their structure [12]. In a recent work [13], they further extended this approach to perform interprocedural analysis on rule mining and violation detection.

This work mines rules that may contain rule instances that cross function boundaries, and it uses multiple heuristics to reduce false positive violations using interprocedural analysis.

2.2 BUG DETECTION BY EMPLOYING REVISION HISTORIES

Williams and Hollingsworth investigated an approach to finding bugs that involves first doing a manual inspection of a project’s bug database and CVS commits to find the most frequently appearing bugs [105][106]. Since their study indicated that the most common type of bug involved failure to check the return value of a function, a “function return value checker” was implemented. Kim et al 21

[54] implemented a bug finding tool BugMem to find application-specific bugs using “bug fix memories”, which means a project-specific bug-and-fix knowledge base. This knowledge base is used to distinguish between good and bad source code. Livshits et al [60] focused on bugs that can be corrected with a one-line code change, such as failing to use the free() method to deallocate a data structure. They proposed analyzing software revision histories to find highly correlated method pairs and then using runtime analysis to look for pattern violations. Dallmeier et al [19] implemented a bug fix benchmark, iBUG, for bug-localization tools. They extract bugs by first identifying bug fixes, and then running test suites on the version of the program that contains the bug. Kim and

Notkin [47] implemented LSDiff to infer systematic changes as logic rules, which were used to detect changes that violate the rule. In contrast to their approach in Chapter Six, ours tries to discover potential bug instances in the code base, not in newly committed changes. Nguyen et al [68] present evidence that recurring bug fixes may occur in code peers, which are classes or methods with similar functionality, and they implemented a tool FixWizard to recommend fixes in code peers.

2.3 CLASSIFYING AND RANKING STATIC WARNINGS

Kremenek et al [52] propose a probabilistic model Feedback-Rank to rank warnings adaptively. The intuition of their work is that warnings are highly correlated with code locality. Also, when a warning is examined by programmers, the ranking is dynamically updated accordingly. Our work in

Chapter Four is a static model while theirs is a dynamic model, but we consider more fine-grained

22

properties of warnings based on the SDG, which we think is more suitable for estimating the correlation between violations than using code locality. It might be worthwhile to combine these two techniques, so that correlations are estimated more accurately and so the model can be adapted to newly labeled warnings. Kim et al [46] rank warnings using software change history, based on the idea that the warnings that are removed quickly by programmers are important and should have higher ranking. The resulting precision is only 17% to 67%. Our approach in Chapter Four exhibits higher precision, presumably because it is based on program dependences. Ruthruff et al [85] propose a logistic regression model to classify both accurate and actionable static warnings that are produced by the FindBugs tool [41]. Heckman et al [39] propose a model building process to identify actionable static warnings. They observe that selected features and models could be different for different projects, which is also reflected in our models in Chapter Four. Although both tools demonstrate good performance, they could not be used directly for violations in our work, since some of features are not available, and code metrics and change churn are overly general in our setting and do not consider the relationship between a violation and the rule it violates.

2.4 APPLICATION AND AUGMENTATION OF STATIC ANALYSIS TOOLS

Both open source and commercial static analysis tools have been applied for early bug detection, and researchers have reported their experiences in various work. Guo et al [36] applied the Coverity commercial tool to the Linux kernel and indicated that such a static analysis tool can help

23

programmers by focusing on source code with high bug density. Krishnan et al [53] applied the

Klocwork Inforce tool to enforce secure coding standards in Motorola’s products. Nagappan et al [65] reported experience in applying two static analysis tools, PREFix and PREFast at Microsoft, to predict pre-release defect density. Zheng et al [111] applied a set of static analysis tools, including

Klocwork, FlexLint and Reasoning, to an industrial software system developed at Nortel Networks, and showed that the tools provide affordable means for software fault detection.

Some other research work addresses augmentation of static analysis tools. Csallner et al [15] combined static analysis with a concrete test-case generation tool to eliminate spurious warnings.

Nanda et al [66] addressed a tool Khasiana which was developed in IBM to combine functionalities of three static analysis tools: FindBugs, SAFE and XYLEM. Phang et al [76] created a tool called Path

Projection on top of a static analysis tool Locksmith in order to visualize and navigate program paths.

Ruthruff et al [85] propose a logistic regression model to classify both accurate and actionable static warnings that are produced by the FindBugs tool. All the above mentioned work augments static analysis tools by enhancing the accuracy of the warnings produced or by making the results more understandable. However, none of the work addresses the issue on expanding the tools to find project-specific defects, as was done in our work in Chapter Five.

2.5 CONSIDERING COST IN SOFTWARE TESTING AND RELIABILITY

Testing cost has been considered in many research works for reliability estimation, optimization and

24

test resource allocation. Pham et al [73] proposed a cost model in the software reliability growth model, taking cost of testing, cost to remove faults, and risk cost due to fault into consideration, and proposed a linear model using the three cost factors. This work requires a large amount of history data to infer the coefficients of linear model, which could be considered for our cost estimation in our model if history data is available to us. Gokhale et al [31] considers the problem of maximizing reliability with given amount of testing effort, and they proposed evolutionary algorithm to solve the problem together with architectural analysis of the software. Different from their work, Huang et al

[42] assumed the reliability objective is given, and aims at achieving an optimal allocation of the testing effort to software modules. Costs used in the above two approaches are based on software modules, and a number of factors from the software development process and testing process are considered in their cost models. On the other hand, costs in our approach in Chapter Seven are based on individual test cases. Brown et al’s work [8] aims at balancing cost of testing and cost of defects by determining the optimal number of software test cases. Different from our work in Chapter Seven, cost of testing is considered to be linear to the number of test cases, namely, all test cases have universal cost; cost of defect is directly associated with frequency, therefore severity is not taken into consideration as in our work.

Cost is also considered in regression testing. Many approaches, such as test case prioritization, test case selection and test case reduction has been proposed to reduce the number of test cases required to run in regression testing. However, these techniques normally come with an analysis cost together 25

with testing cost. Some research work [58][64][84] is proposed to predict cost-effectiveness of various regression techniques, and a number of factors, such as test execution cost, result analysis cost, test selection cost and code coverage information is considered in these work. In these works, cost is assumed to be universal among test cases, which is a different assumption as in our work in Chapter

Seven.

2.6 TEST CASE CLUSTERING AND CLASSIFICATION

Test case clustering and classification has been used to obtain a partitioning of test cases. Such a partitioning could be used to estimate reliability, identify failure or reduce the size of test set.

Podgurski et al [74] proposed to use stratified sampling to estimate reliability based on cluster analysis.

Test cases were first automatically clustered into different groups; stratified sampling would then be applied to obtain a sample based on the clusters. Dickenson et al [21] applied cluster analysis to discover failures in the test cases. Their work made the assumption that failures were likely to be isolated into small clusters. If programmers devote their effort to small clusters, they are likely to discover more failures. Bowring et al [7] proposed to use active learning approaches to classify program behavior into “pass” or “fail”. All the above works do not differentiate failure according to the defects they would trigger, which is different from the goal of our work in Chapter Seven. We need to map test cases to their underlying defects in order to estimate the relative benefit a test would contribute to reliability improvement.

26

Other work tries to cluster failed test cases according to the defects they trigger. Podgurski et al proposed automated support for classifying failure reports with similar defect in order to reduce the number of failure groups programmers need to review [75]. The failure reports were provided by users, and they assumed that these reports could be replayed to collect execution profiles. Two techniques, pattern classification and multivariate visualization are applied to identify failure groups.

Users are required to refine the failure groups manually to get a more precise result. A later work by

Francis et al [26] proposed a tree-based approach to refine the failure groups. A tree is built using either hierarchical clustering or decision tree. One could then examine the test cases to decide whether to split or merge clusters. The above two approaches classify only failures, while in our work in Chapter Seven, we classify a mixture of failed and succeeded test cases, and the status of the test cases are unknown without manual examination of the test outputs. This is a much harder problem, especially because the proportion of failures is very small. Another difference is that the above work first automatically clusters test cases, and then refines the clusters with manual evaluation; while we directly make use of the manual effort in the active learning algorithm to get a precise clustering.

There are other differences between our work in Chapter Seven and the above clustering work: (1) when test case evaluation is required, cost of evaluation is not taken into consideration by the above work; (2) we have another goal of clustering test cases: evaluating risk reduction of incurred by removing a certain possible defect. 27

CHAPTER THREE. BACKGROUND

3.1 PROGRAM DEPENDENCE GRAPH AND SYSTEM DEPENDENCE GRAPH

Dependence graphs represent the essential ordering constraints between program statements and permit instances of programming rules to be recognized in different contexts and despite semantics-preserving reordering of their elements and interleaving with unrelated elements. A program dependence graph (PDG) [24] is a labeled directed graph that models dependences between the statements of a program. Vertices represent simple program statements, such as expressions, predicates, calls and actual in/out parameters. Vertices have attributes such as their abstract syntax tree

and their source code location. Two types of dependences are typically represented: a statement s1 is

data dependent on a statement s2 if there is a variable x and a control flow path s2Ps1 from s2 to s1 such

that x is defined (given a value) at s2, used at s1, and not redefined along the subpath P; s1 is control

dependent on s2 if s2 is a control point that directly controls whether or not s1 is executed. A procedure dependence graph (pDG) is a dependence graph for a single procedure. We obtain an extended pDG (EpDG) [12] by adding directed edges called shared data dependence edges (SDDEs) between pairs of procedure elements that use the same variable definition and are connected by a control flow path. A system dependence graph (SDG) [33][40] connects each EpDG for a software system by interprocedural control and data dependence edges, and thus represents the entire dependence structure of a system.

28

Precondition Rule (Apache) if (file->thlock) rv=apr_thread_mutex_destroy(file->thlock);

Postcondition Rule (Apache) rv = apr_file_open (…); if (rv != APR_SUCCESS)

return rv;

Call Sequence Rule (Python) PyArena *arena = PyArena_New(); …

PyArena_Free(arena);

Figure 3.1 Examples of Programming Rules

29

3.2 DEPENDENCE GRAPH BASED BUG MINING

Chang et al [11][12][13][14] applied frequent subgraph mining (FSM) to SDGs to discover neglected conditions and other software defects. Their approach involves first generating a collection of graphs from dependence spheres, which are dependence subgraphs of limited radius that are expanded from a central function call; then a FSM algorithm is used to discover rules corresponding to frequent subgraphs of these spheres. An approximate graph matching algorithm is used to find violations, which are small deviations from a rule subgraph.

Figure 3.1 shows examples of function precondition, postcondition and call sequence rules along with their underlying graph structures, which were mined from the Apache [5] or Python [77] code bases.

Precondition and postcondition rules indicate the input parameters or returned value, respectively, of a function should be checked; call sequence rule indicates that a set of functions should be called in a certain order. The underlined statements shown in Figure 3.1 correspond to the dependence graph nodes with dark borders. If those nodes are missing, the graphs are flagged as violations. Although

Chang et al’s work focuses on detecting neglected conditions and function-call sequence rules and violations, their approach is capable of discovering any rule or violation that can be represented as an

SDG subgraph. In the dissertation, we will investigate this approach and try to improve precision of the results produced by this approach.

30

// 2nd Instance of Bug // Fixed Code ctx = SSL_CTX_new(); ctx=SSL_CTX_new(); // ’if (ctx==NULL);’ // Fixed Code +if (ctx==NULL); // 2nd Instance Fixed //is missing! ctx=SSL_CTX_new(); ctx =SSL_CTX_new();

+if (ctx==NULL); // add a check

if (ctx==NULL);

Step 1: Bug is fixed Step2: Extract rule Step3: Find violations Step 4: Make from bug fix as of rule to locate corrections dependence subgraph bug instances

Figure 3.2. Outline of fix propagation approach

3.3 DEPENDENCE GRAPH BASED BUG FIX PROPAGATION

Sun et al [90] proposed an approach to help programmers propagate many bug fixes completely, which is based on the idea of treating bugs as violations to programming rules. When a bug is fixed in one place, our approach calls for the programmer to specify, with automated support, a programming rule that the fix helps enforce. Our approach attempts to automatically map the fix to one of three basic kinds of function usage rules: precondition rules, postcondition rules, and call-pair rules, as shown in Figure 3.1. We call these three kinds of rules default rules. If possible, a default rule is extracted automatically from the code change; however, the programmer may edit an extracted rule or create one from scratch, if necessary. Once a rule is specified, our approach employs a heuristic graph-matching algorithm to search for violations of the rule elsewhere in the code base.

31

Violations are then displayed so that the programmer can fix them. Framework of this approach is shown in Figure 3.2. Note that with our approach, unlike some approaches to mining programming rules [4][11][12][23][59][60][62][98], instances of extracted rules need not be frequent in the code base.

Although this work has successfully propagated many bug fixes, it suffers from the following problems: (1) it has limited generality since it addresses only bug fixes that can be specified with one of the three rule templates. (2) The approach is also strictly intraprocedural, so it cannot find bugs that cross function boundaries, and it may generate false alarms when the “missing part” of a reported bug can actually be found in callers or callees. (3) The graph-matching algorithm used is relatively slow and is heuristic, so it is not guaranteed to find all violations. In this dissertation, we plan to further improve this approach to address the above mentioned issues.

3.4 COST SENSITIVE ACTIVE LEARNING

Active learning and cost-sensitive analysis are used in the CARIAL framework for balancing cost and benefits in operational testing. We will give an introduction on these two machine learning techniques.

3.4.1 Active learning

For some problems such as speech recognition and information retrieval, it is very expensive and

32

difficult to obtain class labels [86]. Active learning aims at using as few labeled examples as possible to achieve high classification accuracy. Different from passive learning, in which labeled examples are used all at once, active learning iteratively selects a batch of unlabeled examples and ask for labels from an oracle. The selected batch of examples is chosen by the active learner since their labels will help improve classification accuracy.

In our work, class labels of test cases are assigned by post processing feedback provided programmers, which is quite expensive. Therefore we use active learning to make as few queries as possible.

Active learning is normally used for classification instead of clustering. In this work, we use active learning to discover test cases that are likely to trigger different underlying defects, and obtain an overall mapping between test cases and possible defects. We adopted an active learning algorithm, called hierarchical sampling for active learning (HSAL) [20], which is suitable for such purpose, and we revised it to consider cost-sensitive analysis.

3.4.2 Cost-sensitive active learning

Labeling comes with a cost. In cost-insensitive learning, costs are assumed to be universal among all examples, which is not true in reality. Cost-sensitive active learning [38] takes labeling cost into consideration, and it aims at achieving high classification accuracy with as few labeling costs as possible.

33

Costs spent on reviewing test case output are not trivial for software that produces complex outputs.

Therefore, we used cost-sensitive active learning to further reduce the amount of effort needed from programmers in the clustering algorithm. In our work, we have proposed a simple cost estimation scheme to estimate cost of obtaining class labels of test cases, and applied cost-sensitive active learning to further reduce costs from programmers.

34

CHAPTER FOUR. IMPROVING PRECISION OF DEPENDENCE GRAPH BASED

DEFECT MINING: A MACHINE LEARNING APPROACH

4.1 INTRODUCTION

A significant body of recent research applies program analysis techniques in combination with data mining techniques such as frequent itemset mining, frequent sequence mining and frequent subgraph mining in order to discover implicit programming rules and rule violations in a software code base

(e.g., [4][11][12][59][60][62][98]). The discovered rules correspond to recurrent programming patterns, and the rule violations correspond to infrequent small deviations from these patterns, which may correspond to defects. These techniques have been shown to be able to discover many and varied software defects in actual software projects.

All such techniques require developers to confirm, reject, or possibly edit reported rules and violations, which is a time-consuming and error-prone process. Since developer effort is a costly resource, it is desirable to further improve the precision and recall of rule and defect discovery.

When developers confirm, reject, or edit reported rules or violations, they implicitly label them.

Hence it is natural to consider using supervised learning techniques to exploit these labels (though they may be noisy) in order to more precisely classify or rank candidate rules and violations in the future.

35

We present and empirically evaluate logistic regression models [102] for classifying and ranking candidate rule and violation subgraphs, respectively, which are based on features of dependence graphs that are semantically important yet are general (i.e., they are not tied to specific rule patterns).

Logistic regression is a robust and widely-used statistical technique that provides a number of advantages for our purposes. Fitted model coefficients and associated diagnostics, such as standard errors and significance levels, indicate the relative importance or unimportance of different features

(predictor variables). Logistic models output conditional probability estimates, which may be used directly for ranking or may be thresholded for classification purposes and to trade off precision against recall. Finally, robust and efficient open-source logistic regression software is available.

A number of other metrics and heuristics (some of them rule-specific) have been used for ranking candidate rules and violations discovered by non SDG-based approaches [4][59][60][62][98]. We empirically compare our approach to them in the context of SDG-based defect discovery. The results indicate that our approach exhibits superior precision and recall. The contributions of this work [92] can be summarized as follows:

 We propose supervised learning models for classifying and ranking both rules and violations

generated by dependence-based defect mining.

 We empirically compare our models with existing heuristic approaches and with a

static-warning classifier, in terms of accuracy, precision, and recall.

36

 We compute learning curves for our approach to see how many labeled examples are needed

to achieve a desired level of performance.

4.2 PREVIOUSLY PROPOSED CLASSIFICATION AND RANKING TECHNIQUES

In this section we describe some commonly used techniques for classifying and ranking candidate programming rules and rule violations.

Heuristic approaches for rules. For classifying and ranking candidate programming rules, pattern statistics are commonly used. They are heuristic in the sense that they do not involve learning from developer feedback about candidate rules; that is, they are purely descriptive statistics. Pattern statistics are used generally in data mining to measure pattern quality. Support and confidence are the two most widely used pattern statistics, and they have been used in defect mining work

[4][59][60][62][98] to rank and to prune rules. Interprocedural analysis is also used to rank programming rules. The MUVI tool [62] gives more weight to direct access to a variable than it does to indirect access, e.g., via function parameters. Lastly, rule verification is also used to prune false positive rules, and both static (e.g. Mine-Verify 0) and dynamic (e.g., DynaMine [60]) rule verification algorithms have been proposed.

Heuristic approaches for violations. For classifying and ranking candidate rule violations, heuristic measures of rule strength, such as support and confidence [59][62][98], have been used.

37

Engler et al [23] proposed the z-statistic to measure rule strength. Intuitively, this statistic gives a mined rule a higher rank if the number of instances of the rule is large and the number of violations is small. Interprocedural analysis is also used as a heuristic in ranking or pruning violations. Since most violations are rules with missing elements [11][59][60][62], interprocedural analysis examines the callers and callees of a function where a possible violation was discovered. If the missing elements are found, the candidate violation is flagged as false positive.

Problems with heuristic approaches. For measures like pattern statistics and rule strength there is always the problem of how to pick the threshold value used to decide which rules or violations are reported, so as to achieve an appropriate tradeoff between precision and recall. Interprocedural analysis is very useful in recognizing false positives; however, if most of the instances are intraprocedural, this heuristic alone is not adequate. Rule verification is a conceptually straightforward way to prune false positives, but it can be expensive, since it requires either executing the program on chosen test cases or employing a model checker. Moreover, verification will not work in all cases.

For example, the Mine-Verify algorithm [4] mentioned above randomly splits the dataset into two sets, and one set is verified against the other. The algorithm will work only if the subsets resulting from a split are well matched. Another issue is that heuristics that improve precision may sacrifice recall.

Chang [14] employed interprocedural analysis to divide reported violations into three groups: “likely bugs”, “probable bugs”, and “false positives”. A significant improvement in precision was observed for “likely bugs”; however, half of the actual violations were not so labeled. Other defect mining 38

papers have generally not discussed the recall of rule/violation discovery.

We believe that a more general statistical model, which is fitted with a sample of multivariate predictor values and corresponding response values, is needed to improve the precision of rule and violation discovery without sacrificing recall. To our knowledge, no such model has been proposed for classifying or ranking candidate rules or violations discovered in defect mining. However, related models have been proposed for classifying warnings produced by static analysis tools such as

FindBugs [41]. We call these models static-warning classifiers and briefly discuss them below.

Static-warning classifiers [39][46][52][85] are related models that classify static warnings produced by static analysis tools such as FindBugs. The most commonly used features in these models are warning type, history, code metrics and code churn. Warning type represents a set of pre-defined bug pattern categories detected by the static analysis tool. History describes any previously generated warnings involving the code location referenced by the current warning. Code metrics are used to characterize properties of the source code referenced by a warning. Code metrics can be computed for just the warning or for the file, package, or project referenced by it. Code churn measures the amount of change to the relevant file, package or project.

Problems with static-warning classifiers. Such classifiers cannot be directly applied to candidate rule violations generated by defect mining, for several reasons. First, some features cannot always be obtained. Bugs discovered by defect mining do not necessarily belong to a predefined bug 39

category, since some defect mining algorithms (e.g., [59]) do not strongly constrain the types of bugs that may be discovered. The history of past warnings is not available if the defect mining technique is being applied to a project for the first time. Secondly, code metrics and code churn for files, packages, or projects may not be strong predictors since they characterize the context of the violations but not the violations themselves. For instance, file length is a commonly used code metric that is positively correlated with the number of bugs in a file. However, whether a file contains a bug at a particular line may be unrelated to the length of the file.

4.3 PROPOSED SOLUTION

In our approach, we exploit some useful features from the heuristic approaches described above and from static-warning classifiers, and we supplement them with new, general features characterizing dependence subgraphs. These features are used by our logistic regression models to classify candidate rules and violations.

4.3.1 Classifying and Ranking Rules

For classifying and ranking rules, we consider both pattern statistics and interprocedural analysis.

We do not consider rule verification, since it requires test executions or model checking, both of which are costly. Besides these two features, we added two kinds of new features:

40

Table 4.1 Features for Rule Model Pattern Statistics Support %Inter Interprocedural Analysis HasInter |V| Size |E| |V|/|E|

MaxOut

MinIn Complexity

Structure Metrics AveOut

AveIn

%Data Edge Distribution %SDDE

Structural Support RIR

 Structure metrics are similar to code metrics, but more fine-grained. They characterize the

graphical structure of a rule rather than the code itself.

 Whereas in data mining, the support for a rule is the proportion of its instances among the mined

dataset, we define the structural support of a rule as a measure of the degree of homogeneity

among the rule’s instances. Intuitively, instances of a valid rule should be structurally very

similar to one another.

The features we used for the rule model are summarized in Table 4.1, with 4 categories as described 41

above and 13 features altogether. We explain them in detail below.

Pattern statistics. Since rules mined with our approach are frequent subgraphs, we use rule support as a pattern statistic. Confidence is not used in our model, since it is intended for use with association rules [37].

Interprocedural analysis. For each rule, there could be both intraprocedural rule instances, which are instances whose program elements all are located in the same function, and interprocedural instances, which are rule instances whose program elements cross function boundaries. It seems plausible that a candidate rule with mostly interprocedural rule instances is less likely to be valid than candidate rules with mostly intraprocedural instances, since programmers may be more likely to implement a rule within a function than across multiple functions, to make the rule code easier to understand and check.

We use two features for interprocedural analysis: percentage of interprocedural instances of the rule

(%Inter) and a binary indicator representing whether the rule has interprocedural instances (HasInter).

Structure metrics. We defined the following structure metrics on the SDG subgraphs corresponding to rules:

Size: We use number of vertices |V| and number of edges |E| to represent the size of the subgraph, where V is the set of vertices and E is the set of edges

42

Complexity: Intuitively, if candidate rule graph is too simple or too complex, it is unlikely to be valid, because it is too general or too specific. We defined several metrics to measure the complexity of the subgraph: the ratio |V|/|E|; the average and maximum in-degrees (AveIn, MaxIn) of vertices; and the average and maximum out-degree (AveOut, MaxOut) of vertices.

Edge Distribution: There can be three kinds of edges in an (enhanced) SDG subgraph: data dependence edges, control dependence edges, and SDDEs. To characterize their relative frequency, we use the percentage of data dependences (%Data) and percentage of SDDEs (%SDDE) as predictors in our model. Note that the percentage of control dependences need not be specified since it is implied by the other two values.

Structural support. Each candidate rule is supported by a group of candidate rule instances. These instances share exactly the same dependence structure. However, we expand vertices in a rule instance by replacing them with their abstract syntax trees, to permit more fine-grained comparisons.

Instances of a valid rule should be very similar to one another even after vertices are expanded in this way.

We defined a metric called rule instance radius (RIR) to measure the similarity among instances of a candidate rule. It is defined to be the average Euclidean distance from each rule instance to the centroid of the set of instances. In order to compute RIR, we transform each rule instance into a binary vector as follows: 43

1. Each vertex in the rule instance is expanded to its AST vector [28]. The AST vector is a bit

vector, and each bit indicates whether a certain kind of AST vertex appears in the AST. AST

vertices are operators, variable types, etc.

2. The AST vectors of the vertices are concatenated in order of the vertices’ indices.

4.3.2 Classifying and Ranking Violations

For classifying and ranking violations, we consider both rule strength and interprocedural analysis from the heuristic approaches. In contrast to static-warning classifiers, we consider only metrics characterizing the violation itself, rather than ones characterizing its context. As described above, warning type and history information are not generally available, and code metrics and code churn for files, packages, and projects do not seem very relevant, so we do not include them in our model.

We do not directly apply code metrics on the violation itself, but define another metric, which we call the delta metric, that is obtained by computing differences between a candidate violation and the rule it violates. Intuitively, a candidate is more likely to be a true violation if it is only slightly different from the rule. Moreover, characteristics of the differences might also be predictive of whether it is a defect or not.

The features we used in the violation model are summarized in Table 4.2, with 3 categories and 22 features. We explain them below.

44

Table 4.2 Features for Violation Model

RuleFit Rule Strength %Miss

AnyCallee Interprocedural Analysis AllCallers

#훥V

%훥V Delta size Delta Metrics #훥E

%훥E

14 delta edge types

Rule strength. This category of features represents how likely it is that the rule an example violates is valid. If a rule is unlikely to be valid, its “violations” are also unlikely to be actual bugs. We use the following two features to estimate rule strength:

Fitted value of the violated rule (RuleFit): This is obtained directly from the logistic regression model of the rules. The fitted value represents the probability that the rule is valid, so this is a natural representation of rule strength.

Percentage of mismatches (%Miss): Violations are subgraphs that are similar to but not exactly the same as the corresponding rule. Hence, they can be seen as mismatches of the rule. The instances of the rule, on the other hand, are exact matches of the rule. %Miss is computed as the fraction of mismatches among all mismatches and exact matches of the rule. We assume that in a mature program, 45

rules are followed most of the time, and violations happen only occasionally. Hence one would expect a valid rule to have a small percentage of mismatches.

Interprocedural analysis. This is performed to try to find the missing elements of the example by comparing the rule it violates to the code in the callers and callees of the function where the violation is reported. Here we borrow the heuristics used in PRMiner [59] and develop two features:

Existence of the missing element in all of the callers (AllCallers): This feature considers all calling contexts of the current example. Its value is 1 when the missing element is found in all calling contexts; otherwise its value is 0.

Existence of the missing element in any of the callees (AnyCallee): If the missing element is found in any of the callees, the value of this feature is 1 and thus the violation is considered likely to be invalid, otherwise its value is 0.

Delta metrics. We use two features to represent both the size and content of the missing element.

Delta size: We use the number and percentage of missing vertices (#훥V, %훥V) and number and percentage of missing edges (#훥E, %훥E) to indicate the size of the missing element.

Delta edge types: We consider different types of edges that could be missing in the enhanced SDG.

Since our rules are derived from Chang et al’s work [12], the rule graph contains callsite (Cs),

46

actual-in parameters (Ai), actual-out parameters (Ao) and control points (Cp). There are three types of dependences. There are 14 possible edges with these types of vertices and edges in the SDG, whose presence or absence we indicate with binary features. The 14 types of edges are:

Data(AiAi), Data(AiAo), Data(AiCp), Data(AoAi), Data(AoAo), Data(AoCp),

Control(CpAi), Control(CpAo), Control(CpCs), Control(CpCp), Control(CsAi),

Control(CsAo), SDDE(AiAi), SDDE(CpAi).

4.4 EMPIRICAL STUDY

The goal for our empirical evaluation was to test the following hypotheses:

 HP-1: Our rule model will exhibit better accuracy, precision and recall compared to existing

heuristic approaches.

 HP-2: Our violation model will exhibit better accuracy, precision and recall compared to existing

heuristic approaches and to static-warning classification.

 HP-3: Both the rule model and the violation model will require only a moderate number of

labeled examples in order to achieve desired performance.

4.4.1 Methodology

A Dataset

47

Table 4.3 Summary of Dataset

LOC Rules Violations

Openssl 352K 206 (75%) 142 (46%)

Apache 317K 166 (54%) 26 (54%)

Snmp 334K 233 (66%) 63 (51%)

Python 479K 346 (62%) 497 (23%)

We applied Chang et al’s [12] rule mining and violation detection algorithms on four large open source projects: Openssl [70], Apache [5], Snmp [67] and Python [77]. The sizes of the projects, the numbers of rules and violations, and the percentages of positive (valid) examples of rules and violations are shown in Table 4.3. The results were reviewed and labeled by Chang with help of project documentation and source code history. We also reviewed the rules and violations to double-check the labels.

B Training and evaluating the rule and violation models.

We used the R statistical computing environment [78] to train and test classifiers. We used the glm function to train logistic regression models. For each dataset, we applied backward elimination [102], a greedy feature selection algorithm, to select a subset of predictive features. Thus, after backward elimination each project had a different rule model or violation model (See Section 4.4.2).

48

We evaluated the models with respect to their accuracy, precision, and recall. We also compute the

F1 measure, which is the harmonic mean of precision and recall. Finally, we plotted precision-recall curves [63], which depict how precision varies with recall. Different precision and recall pairs were obtained by varying the threshold used to decide whether an instance is in the target class or not. For the logistic regression model, the threshold takes on every unique fitted value Pr(Y = 1 | x) to yield different precision/recall pairs. For each baseline model consisting of a simple descriptive statistic, the threshold takes on all unique values of the statistic.

C Base rule models.

We compared our approach with two baseline heuristic approaches:

Support model: Support is by far the most commonly used heuristic in defect mining research, and it is used as our first baseline rule model. When mining rules, we chose 0.8 as the minimum support for patterns, as in Chang’s original work [12]. We chose 0.9 as the threshold, since it is halfway between

0.8 and 1.0. All patterns having support greater than 0.9 were considered to be valid and others were considered to be invalid. Thus, with no prior knowledge about the dataset, we assumed that half of the mined rules or violations were false positives.

z-statistic model: Engler [23] used the z-statistic to evaluate rule strength. The z-statistic is defined as follows:

49

Eq 4.1: z(,)(/ n e e n  p0 )/ p 0 (1  p 0 )/ n

Here e is the number of exact matches of a rule, and n is the total number of exact matches and

mismatches of a rule. The fraction e / n equals 1 – %Miss in our case. p0 is a constant representing the expected ratio of exact matches.

We set p0 to be 0.9 to compute the metrics, which is the threshold for the support model. We set the threshold of the z-statistic to be 0 to compute precision and recall.

D Baseline violation models.

We compared our violation model with both heuristic approaches and with a static-warning classifier.

Support model and z-statistic model. Rule strength is the most commonly used heuristic approach for ranking and pruning violations. We again chose support and the z-statistic as two ways of estimating rule strength.

Screening-vio model. Ruthruff et al [85] proposed a logistic regression model to classify static-warnings generated by FindBugs [41]. We used the ideas of their screening model for classifying false positives to obtain a baseline violation model. The following features are used in their model: FindBugs Pattern, FindBugs Priority, BugRank, Project Staleness, File churn: added,

Code: warning depth, Code: indentation, Code: FileLength. Since not all features in their model are available for our violations, we constructed a counterpart to their model, which we call 50

screening-vio, as follows:

 The values of FindBugs Pattern are pre-defined bug patterns. We used the types of violated

rules as a counterpart. Since Chang et al’s approach to defect mining focuses on rules involving

function calls and predicates, there can be function precondition rules, function postcondition

rules and call sequence rules, or hybrids of these rule types.

 FindBugs’ Priority and BugRank use heuristics and warning history to rank warnings. We use

RuleFit as a counterpart, since it estimates rule strength and thus indirectly estimates violation

strength

 Project Staleness measures the number of days since the last FindBugs warning reported in the

project. We could not calculate this since it requires the mining tool to be applied on a regular

basis.

 We retained the other features of the original screening model, which are general code metrics

and code churn.

In the remainder of this section, subsection B summarizes the trained rule and violation models after backward elimination; subsections , D and E each address a hypothesis listed at the beginning of this section.

4.4.2 Summary of the trained Rule/Violation models

A Our rule model. 51

Table 4.4 Coefficients of Learned Rule Models

Openssl Apache Snmp Python

Support -6.8 ** 3.42 * 0.72 *

%Inter -2.7 *** -5.8 *** -1.8 ***

|V| 0.96** 0.3 *

|E| -2.0 *** -0.3 ***

|V|/|E| -1.9 *** -0.84 *

Max Out -1.6 * -1.2 *

Max In 2.6 ** 1.4 ** -3.8 ** 0.1 #

Ave In -4.3 # -0.67 **

%SDDE 4.64 * -1.1 ** -1.2 * -1.4 ***

RIR -0.56 * -1.1 *** -0.33 **

Significance code: # (p>0.1), * (0.05 < p < 0.1), ** (0.01

Table 4.4 summarizes our rule models for the four projects, showing the coefficient and significance level of each feature selected in each model. Shaded entries indicate that the features were removed

by backward elimination. We omitted the intercept β0 and some features that were selected in only one project. The p-values of coefficients were obtained with the Wald test [102] and indicate significance level of the coefficients. Features that were selected in more than 3 of the projects are:

Support, %Inter, MaxIn, %SDDE and RIR. All of these features have a high significance level

52

Table 4.5 Coefficients of Learned Violation Models

Openssl Apache Snmp Python

Rule Fit 10.2 * 3.5 **

%Miss -7.3 ** 5.9 * -0.0005 **

AllCallers -2.3 ** -55 #

%횫V -12 *** -6640 # -2027 # 6.5 ***

#횫E -2.0 ** -545 # -0.3 ***

%횫E 3620 # 2027 #

Edge features are omitted

Significance code:

# (p>0.1), * (0.05 < p < 0.1), ** (0.01

B Our violation model.

Table 4.5 shows the coefficients and significance levels of features selected in our violation model for each project. The features that were selected in more than 3 of the projects are: %Miss, %ΔV, #ΔE, and indicators for the edge types Data(AoAi), Data(AoCp), and SDDE(AiAi). For Apache and

Snmp, most features were insignificant, which is likely due to the small sample size. For the other 53

two projects, all features were significant. The selected edge features, namely Data(AoAi),

Data(AoCp), and SDDE(AiAi), were positive in 2 out of 3 projects. This indicates that different kinds of violations have different proportions of false positive examples. If an edge of type

Data(AoCp) is omitted in a violation, then a postcondition rule, which requires checking the return value of a function, is violated. If edges of type Data(AoAi) or SDDE(AiAi) are omitted, then a call sequence rule is violated. Hence candidate violations to post-condition rules and sequence rules are more likely to be valid than pre-condition rules.

4.4.3 HP-1: Comparing our rule model with the baseline rule models

A Results

Table 4.6 summarizes the accuracy, precision, recall and F1 of our rule model and the two baseline heuristic rule models, namely support and the z-statistic. The best values for each subject program and metric are shown in boldface.

Our model was superior in all cases except with respect to precision for Python. For every project, the accuracy of our model was higher than the maximum of the proportion of positive and negative examples, indicating that our model performed better than a naïve classifier which assigns each example the same label. Our model’s precision for Openssl and Snmp was higher than for Apache and Python. For all four projects, its recall was high – above 85%. The values of the F1 measure, which balances precision and recall, were good for all projects, ranging from 77% to 92%.

54

F1

F1

0.14 0.52 0.38 0.09

0.76 0.72 0.84 0.56

0.43 0.31

0.076 0.059

Recal

0.73 0.76 0.91 0.43

Recal

statistic

-

statistic

0.5

z

-

0.62 0.67 0.22

Prec

z

0.8

0.78 0.68 0.79

Prec

0.7

0.55 0.58 0.49

Accur

0.65 0.66 0.77 0.51

Models

Accur

9

F1

Rule

0.06 0.43 0.35 0.15

F1

0.33 0.55 0.74 0.52

Models

0.36 0.38 0.12

0.045

Recal

0.22 0.52 0.68 0.41

Recal

Violation

Support

0.14 0.56 0.45 0.19

Prec

Model and Base Base and Model Support

0.7

0.63 0.59 0.81

Prec

Rule

0.5

0.43 0.46 0.65

Accur

0.32 0.53 0.66 0.52

Accur

F1

Model and Base Base and Model

0.45 0.59 0.71 0.46

F1

0.8

0.92 0.77 0.89

vio

-

Metrics for Our Our for Metrics

0.36 0.57 0.65 0.41

Recal

6

Violation

.

4

0.95 0.88 0.94 0.85

Recal

0.6 0.8

0.62 0.53

Prec

Table Table

Screening

0.88 0.69 0.85 0.75

Prec

ur Rule Model Rule ur

O

0.59 0.58 0.74 0.75

Accur

Metrics for Our for Metrics

0.87 0.71 0.84 0.73

7

Accur

.

4

F1

0.8

0.73 0.85 0.61

Table Table

Model

Snmp

0.81 0.79 0.94 0.58

Python

Apache

Openssl

Recal

0.79 0.69 0.79 0.64 Prec

Violation

ur ur

O

0.82 0.69 0.84 0.81

Accur

Snmp

Python

Apache

Openssl

55

B Precision-Recall Curve

The precision-recall curves for the rule models are shown on Figure 4.1 (a). The curves for our rule model are quite similar across all four projects, with precision decreasing very slowly as recall increases. For Openssl, the curves for our rule model (drawn in black) lie entirely above the other two curves, which indicates that the rule model achieved better precision than the other models for all recall values. For Snmp, our rule model exhibited superior precision when recall was greater than

80%. For the Apache project, our model performed similarly to the z-statistic. For Python, however, the z-statistic performed slightly better than our rule model.

4.4.4 HP-2: Comparing our violation model with the baseline violation models

A Results

Table 4.7 summarizes the accuracy, precision, recall and F1 results for our violation model and the baseline violation models: screening-vio, support, and the z-statistic.

Our violation model performed better than the other models in all cases except with respect to precision for Snmp. Again all of the models outperformed the naïve classifier in terms of the accuracy measure. The precision, recall, and F1 values achieved with the violation models are higher for Openssl and Snmp than for Apache and Python. All of the models performed poorly with

Python. This may be due to the fact that the Python dataset is imbalanced, with only 23% positive examples. 56

(a) Rule Model

(b) Violation Model Figure 4.1 Precision-Recall curves for rule/violation model Our model (black) , Support (blue), z-Statistic(red), Screening-vio (green)

57

B Precision-Recall Curve

The precision-recall curves for the violation models are shown in Figure 4.1 (b). Our violation model performs better than the heuristic models, as indicated by the fact that its curve lies mostly above the other curves for all four projects. The screening-vio model, which is represented by the green curve, performs second-best for Snmp and Python. This shows that overall, the logistic regression models outperformed the single heuristics. The blue curves, which indicate support, are always lowest for recall approaching 1.

4.4.5 HP-3: Learning curves

We use learning curves [63] to estimate how many examples should be labeled in order to achieve good performance with our models. In order to generate a learning curve, we increased sample size in steps of 20. For each sample size n, we randomly selected 10 stratified samples from the original dataset with total size n, we computed accuracy, precision and recall by 10-fold cross-validation, and finally we computed the average accuracy, precision and recall for the 10 samples.

A Learning curve for our rule model

Figure 4.2 (a) shows the learning curves for our rule model. For all four projects, the curve indicates that model performance levels off at a sample size in the range 80~120. This indicates that only a few rules need to be labeled on average to obtain reasonably accurate rule models. This is a consequence

58

of the simplicity of our logistic models, since the complexity of the hypothesis space influences the number of training samples needed. However, in this case we also get good performance with simple models.

B Learning curve of our violation model

Figure 4.2 (b) shows learning curves for the violation models. We show the curves only for Openssl and Python, since Apache and Snmp have fewer examples. For Python, model performance levels off at around 100 to 120 examples. For Openssl, the rate of increase becomes lower at between

60-70 examples and thereafter performance continues to increase gradually. As before, this also indicates that our violation models can be learned with only a few labeled examples.

59

(a) Rule Model

(b) Violation Model

Figure 4.2 Learning Curves Accuracy (black) , Precision (red), Recall (green)

60

CHAPTER FIVE. EXTENDING STATIC ANALYSIS BY AUTOMATICALLY MINING

PROJECT-SPECIFIC RULES

Commercial static program analysis tools can be used to detect many defects that are common across applications. However, such tools currently have limited ability to reveal defects that are specific to individual projects, unless specialized checkers are devised and implemented by tool users.

Developers do not typically exploit this capability. By contrast, defect mining tools developed by researchers can discover project-specific defects, but they require specialized expertise to employ and they may not be robust enough for general use. We present a hybrid approach in which a sophisticated rule mining tool is used to discover project-specific programming rules, which are then automatically transformed into checkers that a commercial static analysis tool can run against a code base to reveal defects. In this way, a commercial static analysis tool is extended to find project-specific bugs, and defect discovery is both efficient and complete with the checkers that comes with commercial static analysis tools.

5.1 INTRODUCTION

Static analysis tools are commonly used by software development teams to detect defects and enforce coding standards. The most basic tools, such as PC Lint [72], are essentially advanced compiler front-ends that check for violations of simple rules about proper use of particular language constructs.

Advanced static analysis tools such as Klocwork [48] and Coverity [16] can detect more complicated 61

Figure 5.1 Framework defects, such as ones that cause memory leaks or invalid data flows, by checking rules involving control flow and data flow paths, which may be intraprocedural or interprocedural. Advanced static analysis tools find potential defects by using checkers, which are written to search for specific patterns or pattern violations in source code. Issues are reported to the tool user, who must decide if each issue is a real defect or a false positive.

Advanced static analysis tools are designed to work on source code written with different design paradigms (object-oriented or procedural, etc.), different operating systems, and different application domains (real-time, embedded, server, web, etc). For this reason, the checkers that come with a tool are intended to reveal defects that are common across many types of software systems. Ideally, the checkers should collectively find a high proportion of the actual defects in a system (exhibit high recall) and generate relatively few false positive indications (exhibit high precision). In reality, recall and precision must be traded off to some extent.

Developers dislike using tools with low precision [6], therefore commercial static analysis tools are typically configured to use general checkers having high precision across varied types of software.

While general checkers are useful, their recall may be quite low, since there can be many

62

programming rules that are specific to a software project. For example, implicit programming rules of a project may require that an open routine be called before a connect routine is called, or they may require that the return value of a given function be checked for error codes before another function is called. Commercial static analysis tools do not check such rules by default, because they are not common between projects.

Advanced static analysis tools allow users to create new, “custom” checkers, based on project-specific rules, to detect additional defects. These checkers can be created manually for rules based on control flow or data flow paths. Unfortunately, project-specific rules are rarely documented. Instead, they are known to at least some developers and are implicit in the source code. In order to improve the defect detection of static analysis tools, these rules must be identified and converted into checkers that the tool can use. Due to the difficulty of obtaining project-specific rules and manually writing checkers for them, the advanced functionality of creating customized checkers is not fully utilized by software development organizations today.

This paper presents a framework for automatically identifying project-specific rules and creating custom checkers for a commercial static analysis tools. The framework is illustrated in Figure 5.1

It employs a recently developed dependence-based pattern mining technique to identify frequent code patterns that are likely to represent programming rules. These patterns are currently transformed automatically into customized checkers for Klocwork, a commercial static analysis tool used by ABB,

63

although our framework can be adapted to work with other tools. Finally, the Klocwork analysis engine uses these checkers to find rule violations in the code base, which are then examined by developers. To evaluate this framework, we applied it to the code bases of two large ABB software products. The results indicate that: (1) Klocwork checkers can be created for most of the identified rules and (2) the transformed checkers accurately report defects involving violations of project-specific rules, which were not detected by Klocwork’s built-in checkers.

5.2 THE RULE MINING TOOL AND STATIC ANALYSIS TOOL USED IN THIS WORK

In this work, we used Chang et al’s [11][12][13][14] dependence graph based defect mining tool to mine frequent code patterns as programming rules. We then transform these mined frequent patterns into Klocwork path checkers [48][51]. In this section, we give an introduction on the frequent patterns mined from Chang et al’s work [11][12][13][14], and then introduce Klocwork, the commercial static analysis tool used in this study, and describe how we use the mined rules to create new checkers.

5.2.1 Mining Frequent Code Patterns

In Section 3.2, we introduced the techniques used in [11][12][13][14] which mines frequent code patterns from SDG (System Dependence Graph) as programming rules, and showed examples of function precondition, postcondition and call sequence rules mined from this approach. We will give

64

a more detailed explanation on different types of programming rules mined from this approach.

Chang et al’s work focused on mining programming rules around function calls. Therefore, most of the mined frequent patterns are expressing rules in the form:

풇(… ) 퐢퐬 퐜퐚퐥퐥퐞퐝 ⇒ 퐜퐨퐧퐬퐭퐫퐚퐢퐧퐭 푪 퐢퐬 퐬퐚퐭퐢퐬퐟퐢퐞퐝

, where ⇒ means “implies”. Three basic kinds of rules, as introduced in Section 3.2, involving different kinds of constraints C, are precondition rules, postcondition rules, and call-sequence rules.

A precondition rule requires programmers to ensure that a certain input parameter is valid. To do that, a conditional check of the parameter’s validity is usually performed before passing it to the function.

Such a check can be omitted only if the value is certain to be valid, e.g., when a valid constant is assigned to the variable before it is passed to the function. Postcondition rules take one of two different forms. In a Type 1 postcondition rule, one should perform a conditional check on the returned value before it is used elsewhere in the program, unless the return value is guaranteed to be correct. In a Type-2 postcondition rule, the return value is a constant representing the completion status of the function call, which should be checked so that abnormal completion can be handled appropriately. A call sequence rule specifies that a set of functions should be called in a certain order.

A call-pair rule is a simple call sequence rule requiring that when a given f function is called, another function g should be called before or afterwards.

65

Table 5.1 Simple Rules

XML Rule type Rule Example Checker RuleSpec

if(x <= MAX) Value Precondition 푓(푥) ⇒ ensure 푥 is valid Data flow f(x); constraint

y = f(x); Type 1: Value if(y == null) Data flow 푦 = 푓(푥) ⇒ ensure 푦 is valid constraint return;

Type 2: Postcondition 푠푡푎푡푢푠 = 푓(푥) ⇒ status = f(x); Conditional Control ensure 푠푡푎푡푢푠 is checked to if(status == FAIL) constraint flow handle different completion //do something

statuses.

Example-1: y = f(x);

푓(푥) ⇒ g(y); Call Control Call pair ensure function 푔(. . ) is called constraint flow before or after f() Example-2: g(x);

f(x);

If a programming rule consists of a single precondition rule, postcondition rule, or a call pair rule, we

call it a simple rule. If a rule involves several pre or post conditions, more than two functions, or a

combination of simple rules, we call it a hybrid rule. Table 5.1 shows examples of simple rules.

66

The underlined code in each example implements the consequent (action part) of the corresponding rule.

5.2.2 Static Analysis Tools and Custom Checkers

For this study, we selected Klocwork [48], a popular static program analysis tool that can detect security vulnerabilities, quality defects, and architectural issues in C, C++, Java, and C# programs.

Klocwork has default checkers [50] which detect common issues in C and C++ programs, such as possible buffer overflow vulnerabilities, null-pointer dereferences, and memory leaks. Klocwork was selected for this study since it is used by development teams at ABB.

Klocwork provides the Klocwork Extensibility Interface [49] for creating custom checkers. Other static analysis tools also support the creation of custom checkers. For example, Coverity provides a

Static Analysis SDK [17] for this purpose. Each of these tools provides two types of checkers: AST checkers and path checkers. AST checkers are used for code-style analysis, while path checkers are used to perform control flow and data flow analysis. Writing effective checkers requires skill, even when the rules to be checked are well specified. We transform mined programming rules, corresponding to SDG subgraphs, into path checkers so that we can employ Klocwork to find violations across the entire code base.

67

If the analysis engine detects a data flow from the source to the sink, an error will be reported at the sink node.

Figure 5.2 Example of Klockwork Path Checker

Klocwork path checkers perform automatic source-to-sink data flow and control flow analysis. The

source and sink are program elements of interest, such as program variables. Users specify them using the Klocwork C/C++ Path API, so each checker is a piece of C++ code. Once the source and the sink are specified, the path checker analysis engine checks whether there is a data flow or control flow from the source to the sink, and, if one exists, it reports an issue. Figure 5.2 shows an example path checker that performs data flow analysis. Assume that we want to check whether a null pointer is used as the input parameter to function f(). We could specify the source any variable which is assigned a NULL pointer; and the sink as the input parameter to a call to f(). If the Klocwork analysis engine detects a data flow from the source to the sink, it indicates that function f() uses a null pointer as input parameter. Therefore an error will be reported at the sink node, which is the call to f().

A path checker may also employ the following capabilities provided by the Klocwork C/C++ Path

API:

68

Figure 5.3 Automatic P2C Converter

 Control-flow traversal – the checker can traverse the control flow graph of a function.

 Value tracking – the checker can determine whether two variables contain the same value or

point to the same memory block.

 Value constraints – the checker can determine if certain constraints hold on values in memory,

such as upper and lower bounds.

5.3 AUTOMATIC P2C (PATTERN TO CHECKER) CONVERTER

In this section, we describe the Pattern-to-Checker (P2C) Converter, which is a prototype tool we have implemented to automatically transform frequent code patterns, mined using Chang et al’s approach [11][12][13][14], into Klocwork path checkers. The P2C Converter consists of two main components, which carry out the transformation illustrated in Figure 5.3. The Rule Extractor extracts programming rules from the mined patterns, which are in the form of generic SDG subgraphs.

To obtain rules of the forms shown in Table 5.1, the Rule Extractor analyzes the dependence graph structure of a pattern and the abstract syntax trees (AST) of its statement instances. The Rule

Extractor generates XML rule specifications, called XML RuleSpecs, conforming to an XML schema that we defined. 69

The other main P2C component, called the Checker Generator, transforms XML RuleSpecs into

Klocwork checkers. This design make our framework more flexible, since programmers can use

XML RuleSpecs to directly specify other programming rules they wish to enforce and then use the

Checker Generator to transform them into checkers, obviating the need to write the checkers themselves. We have provided a web demo to illustrate the transformation of an XML RuleSpec into

Klocwork checkers [101].

5.3.1 Rule Extractor

Since the mined patterns do not represent programming rules directly in the forms shown in Table 5.1, we need a scheme to automatically extract programming rules of those forms from the patterns.

These rules will be expressed using XML RuleSpecs. We will first describe the nature of XML

RuleSpecs; then we will describe how rules are extracted from simple and complex patterns and expressed as XML RuleSpecs.

A XML RuleSpecs

As described in Section 5.2.1, the programming rules listed in Table 5.1 specify what constraint should be satisfied when a function is called. Three types of constraints on calling a function f(): value constraints, conditional constraints and call constraints. We will describe these constraint types below.

70

A value constraint on the value of an input or output parameter of a function is a constraint that can be checked by the Klocwork engine (statically). For example, a value constraint can specify upper and lower bounds for an input or output parameter or specify that a pointer cannot be NULL.

Precondition rules and Type 1 postcondition rules specify value constraints on input or output parameters.

A conditional constraint requires the presence, in the program code, of a conditional check on an input or output parameter before or after a given function is called. A Type 2 postcondition rule specifies such a constraint.

Conditional constraints can also be used when value constraints cannot be checked statically. For example, to ensure that a parameter value is within a certain range that can be determined only at runtime, a conditional constraint can be written that requires the presence of a conditional check on the parameter value.

A call constraint requires the presence of another function call before or after a function f() is called.

Call sequence rules specify call constraints involving two or more function calls.

Constraints for different programming rules are summarized in Table 5.1. The schema for XML

RuleSpecs supports specification of value, conditional, and call constraints on a call to a given function f(). The XML specification has two major elements. The first element, ,

71

indicates the function f() whose calls are subject to constraints. The second element,

, specifies one or several constraints as described above. The three types of constraints are expressed by the following three types of specification elements:

: This element indicates which input or output parameter is involved in the constraint, and it specifies value constraints on this parameter. Currently we support constraints on numerical values, including upper bounds, lower bounds, ranges, legal values, and illegal values

(including NULL for pointers).

: This element indicates which input or output parameter should be checked and whether the check should appear before or after the call to the function.

: This element specifies the other function g() that should be called together with f() and indicates whether it should be called before or after f(). Relationships between parameters and return values of the two functions can also be specified. For example, a call constraint for the call sequence rule y = f()  ensure g(y) is called afterward indicates that g() should use the return value of f() as its input parameter.

Figure 5.4 shows an example XML RuleSpec, which specifies the a value constraint, a control constraint and a call constraint on calling a function f().

B Generate XML RuleSpecs from mined patterns

72

Mined patterns can be either simple patterns or hybrid patterns. A simple pattern involves only one precondition or postcondition or involves only one pair of functions. Thus a precondition, postcondition or call sequence rule can be extracted directly from a simple pattern. Hybrid patterns are more complex patterns which involve multiple conditions and/or functions. A hybrid pattern must be decomposed into simple patterns to extract rules.

A simple pattern can be categorized as a precondition, postcondition, or call sequence rule by analyzing its dependence graph structure. To extract a value constraint from a precondition rule or

Type 1 postcondition rule, we need to also analyze the abstract syntax tree of the conditional check involved in occurrences of the pattern. For call sequence patterns involving one pair of functions, it is necessary to decide which function should be on the left hand side of the rule and which should be on the right hand side. In such cases, we generate programming rules for both cases, and let programmer decide whether to keep both of the rules or remove one of them.

By analyzing the dependence graph structure of a hybrid pattern, it can be decomposed into a conjunction of simple rules:

[풇ퟏ(… ) ⇒ 푪ퟏ] ∧ [풇ퟐ(… ) ⇒ 푪ퟐ] ∧ … ∧ [풇풏(… ) ⇒ 푪풏]

Such a rule conjunction is satisfied if each constituent rule is satisfied. If any of the latter rules is violated, then the conjunction is violated. We extract an XML RuleSpec for each of the constituent

73

Figure 5.4 Example of XML RuleSpec

74

simple rules.

5.3.2 Checker Generator

An XML RuleSpec may specify one or several constraints; a checker will be generated for each constraint to check whether it is violated. As described in Section 5.2.2, Klocwork path checkers perform source-to-sink data flow or control flow analysis, and an error is flagged at the sink if a data flow or control flow is detected from the source to the sink. For a value constraint, the checker generator will produce a checker based on data flow analysis from the source and the sink. The main idea is that if an illegal value reaches the sink from the source via a data flow then an error will be flagged. For conditional constraint or call constraint, the checker generator will produce a checker based on control flow analysis from the source to the sink. To address such constraints, the source is specified as the entry node to any function, and the sink is a call to a function f() such that a required conditional check or a required call to another function g() is missing. If there is a control flow from the source to the sink, we have detected a call to f() with a missing constraint. To write such checkers, we need to traverse the control flow graph backwards or forwards from the source, and do possible value tracking using the Klocwork C/C++ Path APIs [49].

Next we describe the path checker mechanism, before showing how checkers can be written for different constraints. For each checker, an example is used to show what the source and the sink might look like. We use superscripts source and sink to annotate the expressions representing the 75

extracted source and sink.

Path checker mechanism [51]. The checker generator needs to implement two functions to extract sources and sinks. These are SourceTrigger::extract(node, result), and SinkTrigger::extract(node, result), respectively. Basically, these two functions let the path checker interact with the backend path checker analysis engine. The engine traverses the control flow graph of the current function under analysis and passes the current node being traversed, represented below as node, to both of the two functions; the functions check individually whether node is of interest and if so extract expression(s) of interest from it; the extracted expression(s) will be returned to the engine by being added to result. As long as one data or control flow exists between the specified sources and sinks, a defect will be reported.

Check value constraint for precondition rule. Assume that we want to check whether the ith parameter of function f(…) is valid. A checker can be written as follows:

SourceTrigger::extract(node_t node, TriggerResult* result)

/*Source: Anywhere a variable could be defined, or any formal-in parameter */

Code omitted

SinkTrigger:: extract(node_t node, TriggerResult* result)

/*Sink: ith parameter of f(), whose value might be illegal */

76

if node is a call to f()

expr  ith parameter of f()

if the value of expr is not legal

Add expr to result set

Example

h(long xSource){

/*if(x <= MAX) is missing,

*so x could be > MAX */

f(xSink);

}

Check value constraint for postcondition rule. Assume that we want to check whether the returned value of function f(…) is valid. A checker can be written as follows:

SourceTrigger::extract(node_t node, TriggerResult* result)

/*Source: Return value of f() */

if node is a call to f()

expr  returned value of f()

77

Add expr to result set

SinkTrigger:: extract(node_t node, TriggerResult* res)

/*Sink: anywhere a variable could be used whose value might be an illegal return value of f()

*/

for each variable var used in node

expr  var

if expr might be an illegal return value of f()

Add expr to result set

Example

ySource = f(x);

/* if(y == null) is missing,

*so y could be null */

z = ySink->a;

Check conditional constraint. Assume that we want to check whether a conditional check on an input or output parameter of f() is missing. The checker could be written as follows:

78

SourceTrigger::extract(node_t node, TriggerResult* result)

/*Source: entry of a function */

Code omitted

SinkTrigger:: extract(node_t node, TriggerResult* result)

/*Sink: call to f(), for which a conditional check is missing*/

if node is a call to f()

expr  specified input or output parameter of f()

Traverse the control flow graph in the forward or backward direction:

if the current node is a check of expr, return;

Add call to f() to result set

Example

hSource(long x){

status = fSink (x);

//if(status == FAIL) is missing }

Check call constraint. Assume that we want to check whether a call g(…, y, ..) is missing before or after f(…, x, …) is called. Let x be the ith parameter of f(), and y be the jth parameter of g(). A

79

checker can be written as follows (note that Example 1 for the call sequence rule in Table 5.1 could be written similarly):

SourceTrigger::extract(node_t node, TriggerResult* result)

/*Source: entry of a function */

Code omitted

SinkTrigger:: extract(node_t node, TriggerResult* result)

/*Sink: call to f(), for which a call to another function g() is missing*/

if node is a call to f()

expr  ith parameter of f()

traverse the control flow graph in the forward or backward direction:

if the current node is a call to g(), and the jth parameter is the same as expr, return

Add call to f() to result set

Example

hSource (long x){

fSink (x);

/*g(x) is missing*/

}

80

5.4 EMPIRICAL STUDY

We conducted an empirical study to assess the practicality and effectiveness of our automated framework for deriving project-specific rule checkers. The study addressed the following research questions:

R1: How general is our automatic framework to deriving checkers? Specifically,

 R1-1 (“Precision” of checker generation): what percentage of automatically generated checkers are

actually valid?

 R1-2 (“Recall” of checker generation): how many rules (XML RuleSpecs) cannot be transformed

into checkers using the framework so that programmers have to write checkers for them manually?

R2: How effective are the generated checkers? Specifically,

 R2-1 (Precision of warnings produced): what percentage of checker-generated warnings is valid?

 R2-2: Compared to Klocwork default checkers, how often are new valid warnings produced using

the generated checkers?

In this section, we will first describe the dataset we used in the experiments in Section 5.4.1A. Then we present and analyze experimental results and analysis of R1 and R2 in section 5.4.1B and 5.4.1C correspondingly.

81

5.4.1 Preparing patterns for analysis

A Mining Patterns from the Code Base

We applied our framework to the code bases of two industrial software systems, which we shall call

System X and System Y, which were developed by separate product groups at ABB. Both systems contain a mixture of C and C++ code. System X is the larger system, with more than 5 million

SLOC. Chang et al’s pattern mining tool [11][12][13][14] was applied to one of its largest components, which contains 2.7 million SLOC and which we call Component A. The other software system, System Y, contains around 2 million SLOC, and we applied the mining tool to the entire system. We used CodeSurfer [32] to generate SDGs for the two systems. Chang et al’s tool mined a total of 1112 frequent code patterns from Component A of System X, and it mined 748 patterns from

System Y.

B Sampling from the Mined Patterns

To reduce the amount of effort needed to review the patterns, generated checkers, and reported warnings, we did not apply our framework to the entire pattern set. Instead, we selected random samples of 200 patterns from System X and 150 patterns from System Y. To obtain a diverse and representative set of patterns, a stratified random sampling scheme was employed. We partitioned the entire pattern set into precondition patterns, postcondition patterns, call sequence patterns, and hybrid patterns, and we sampled a number of patterns from each category according to its size relative

82

Table 5.2 Distribution of Mined Patterns

Useful patterns Usage patterns False Positives

System X 103 34 13

System Y 127 39 34

to the entire set.

C Filtering the sampled patterns

We manually examined each of the patterns and put them into one of the following three categories:

Useful patterns are those from which a rule or a set of rules like those shown in Table 5.1 could be extracted. Common usage scenarios are patterns that do not satisfy the criteria for useful patterns but that still represent common coding scenarios. One example is that a get function is often called after a set function. Although this is an interesting pattern, it is not a useful call pair rule, since get and set can be invoked independently. False positive patterns are those with no apparent practical significance. Checkers were generated for just the useful patterns.

To make sure that we obtained a fair categorization of the patterns, the first two authors of the paper reviewed disjoint subsets of the sampled patterns, each comprising 45% of the total, and they both reviewed the remaining 10% of the sampled patterns to see whether they generally agreed on the categorization. They disagreed on only one of the latter patterns. The distribution of the patterns is

83

shown in Table 5.2.

To better evaluate the precision of mined patterns, we conducted a small survey of developers of

System Y. We created three surveys, each containing 10 randomly selected patterns, and we asked three different groups of developers to classify the patterns. The first group of developers classified all patterns to be usage patterns; the second group of developers classified 5 patterns as useful, 4 as usage patterns, and 1 as a false positive; the third group of developers classified 7 patterns as useful, 2 as usage patterns, and 1 as a false positive. It seems that the first group of developers were more conservative, since their comments on some patterns indicated that they are useful under certain circumstances. The second and third group of developers thought the majority of patterns were useful ones.

5.4.2 R-1: Generality of generated checkers

In total, 161 checkers were generated for System X and 148 checkers were generated for System Y.

To assess the generality of the checkers, we reviewed each of the generated checkers to decide whether the checkers are precise, imprecise, or false positive. A checker for a rule was considered precise if it would precisely check the required constraint(s), whereas an imprecise checker may produce false positive warnings. A checker for a rule was considered a false positive if it did not actually check the constraints in the rule. That is, the rule is one that cannot be addressed using our current XML RuleSpec framework, though might be possible to create a suitable checker manually. 84

Table 5.3 Characterization of Checkers for System X

Value Conditional Call Total Constraints Constraints Constraints

Precise 34 70 22 126

Generated Imprecise 0 13 0 13

False Positive 1 0 21 22

Manually Created 1 0 3 4

Total 36 83 46 165

CG Precision 126/(126 + 13 + 22) = 78.3%

CG Recall 126/(126 + 4) = 96.9%

Table 5.4 Characterization of Checkers for System Y

Value Conditional Call Total Constraints Constraints Constraints

Precise 70 36 14 120 Generat Imprecise 0 9 0 9 ed False Positive 4 1 14 19

Manually Created 0 0 4 4

Total 74 46 32 152

CG Precision 120/(120 + 9 + 19) = 81.1%

CG Recall 120/(120 + 4) = 96.8%

Based on the above definition, we can define the precision and recall of the checker generator (CG) as follows:

|퐩퐫퐞퐜퐢퐬퐞 퐜퐡퐞퐜퐤퐞퐫퐬| Eq 5.1: 푪푮 퐏퐫퐞퐜퐢퐬퐢퐨퐧 = |퐚퐥퐥 퐠퐞퐧퐞퐫퐚퐭퐞퐝|

85

|퐩퐫퐞퐜퐢퐬퐞 퐜퐡퐞퐜퐤퐞퐫퐬| Eq 5.2: 푪푮 퐑퐞퐜퐚퐥퐥 = |퐩퐫퐞퐜퐢퐬퐞 퐜퐡퐞퐜퐤퐞퐫퐬 & 풎풂풏풖풂풍|

The generated checkers are characterized in Table 5.3 for System X, and Table 5.4, for System Y.

For both projects, around 80% of the generated checkers were precise checkers, and we needed to manually create only four checkers for each project. Next, we discuss the sources of false positives and imprecision:

Sources of false positives:

(1) Incomplete constraints. Some patterns could not be fully described using the three types of constraints described in this paper. For example, we observed some patterns expressing the following programming rule: f(…) is called and f(…) returns a value V  g(…) should be called.

This rule could not be fully expressed by our call constraint, since it has an additional requirement on the returned value of f(). For such rules, we must manually create checkers to fully express the constraints.

(2) Reversed call constraint. If the mined pattern is a call pair rule involving two function calls f() and g(), there are two possible rules: f() called  ensure g() called afterward; or g() called  ensure f() called before. There is no way to decide which one is the correct programming rule, or whether both of them are correct. Therefore, we generated checkers expressing both of the rules automatically, and let programmers decide which one they want to use. If only one of them is correct,

86

the other one will be a false positive.

Sources of imprecision:

(1) Runtime behavior. Some constraints could not be checked statically by the checkers. For example, the following rule requires a parameter to be within the range of an array: sort(list, size) called  ensure size is within the range (0, length(list)); however, the range of the array could not be determined statically. In this case, we can write a checker based on a conditional constraint instead of value constraint, that is, check for a runtime test on the value of size. This could produce false positive warnings if the parameter is sure to be within the range and no check is needed.

(2) OR rules. Sometimes two constraints work as alternatives to each other. For example, we found the following OR rule in our experiment: speed returned by f(…)  speed does not exceed limit OR speed*time does not exceed limit. In this case we have to write two separate checkers to check the two constraints “speed does not exceed limit” and “speed*time does not exceed limit” when f() is called. Therefore, false positive warnings will be produced when code satisfies only one of the constraints but not the other.

(3) Value constraints on structure fields or class members. We found some rules that require value constraints on a field of a structure or a member of a class. Our current framework implementation does not support such constraints, which require traversing the memory blocks of a structure or a class.

87

We have put this feature on our future work and will implement it in the future.

5.4.3 R-2: Effectiveness of the generated checkers

To assess the effectiveness of the generated checkers, we estimated the precision of the reported warnings, and we compared the generated checkers with the Klocwork default checkers to see how many new valid warnings were produced.

A Precision of the warnings

Ideally, precision of the warnings should be the proportion of warnings indicating actual bugs. This proportion could be obtained only by submitting all warnings to programmers and asking them to decide whether or not each warning is a bug they should fix. This was infeasible, due to the large number of warnings produced, so we estimated precision by the proportion of valid warnings. Valid warnings represent true violations of the corresponding rule, while invalid warnings represent spurious violations. For each warning, we manually determined whether it was valid or invalid. Our estimate is likely to be an overestimate of the true precision. However, the true violations at least indicate dubious programming practices that programmers should look into.

88

Table 5.5 Characterization of Warnings in System X

Value Conditional Call Total % Constraints Constraints Constraints

Valid 36 16 42 94 94%

Invalid 3 0 3 6 6%

Total 39 16 45 100 100%

Table 5.6 Characterization of Warnings in System Y

Value Conditional Call Total % Constraints Constraints Constraints

Valid 123 19 27 169 84.5%

Invalid 23 8 0 31 15.5%

Total 146 27 27 200 100%

Our approach produced altogether 691 warnings for System X and 2335 warnings for System Y. Due to the amount of manual work required to examine the rules and violations, we selected a random sample of 100 warnings from System X and 200 warnings from System Y, and we examined each of them to decide whether they are valid.

Table 5.5 and Table 5.6 characterize the precision of the warnings issued for the two systems.

System X and Y achieved precision of 94% and 84.5% correspondingly, which are both high level of precision. Moreover, precisions are different for different types of checkers. Call constraint checkers seem to have slightly better precision than the other two types of checkers.

89

Sources of the imprecision that did occur are as follows:

(1) Imprecise data flow analysis of Klocwork. For example, an invalid warning concerning the validity of the return value of f() was as follows:

1: x->a = f(…);

2: g(x); //Warning reported

The warning was reported because Klocwork detected a data flow from x->a to x. This indicates that the analysis engine cannot distinguish a structure variable from its fields.

(2) Interprocedural analysis. Klocwork provides interprocedural data flow analysis, but the control flow analysis is intraprocedural. Since checking of conditional constraints and call constraints is based on control flow analysis, an invalid warning may be produced due to lack of interprocedural analysis. For example, although g() is not called after f() in a function h(), it might be called in a caller or callee of h(). A false positive warning will be reported in this case.

B Comparison between warnings produced by default checkers and generated checkers.

We ran default checkers against the code base, and 109 warnings and 1146 warnings were produced for System X and System Y, respectively. We compared them with the set of warnings produced by the generated checkers.

90

There was no overlap between the two sets of warnings for System Y, and there were only 4 warnings for System X produced by both kinds of checkers. The latter warnings were produced by: the default checker NPD.FUN.MUST, which checks whether the return value of a function is NULL; and by generated checkers which check value constraints on the return values of functions to make sure that they are not NULL. One thing to note here is that NPD.FUN.MUST can check only for functions that explicitly return a NULL pointer, such as “return 0”, while generated value constraint checkers can check any functions that could possibly return a NULL pointer.

5.5 LESSONS LEARNED

At the outset of this work, our approach was to manually write checkers for mined rules. The first author sampled around 40 mined patterns and wrote checkers for them. We found that this approach requires expertise with both the mining tool and the static program analysis tool. First of all, one needs to understand and be able to translate the mined patterns into programming rules. In our experiment, this requires one to also have knowledge about the underlying dependence graph structure of the mined patterns. Secondly, one also needs to know the checker extension functionality of the static analysis tool very well, since many details need to be handled. The following are some examples of the subtleties that arise when writing checkers: (1) traversing the abstract syntax tree of a statement to obtain a certain parameter; (2) extracting value constraints for an arbitrary parameter, including value constraints for a memory block referenced by a pointer; (3) deciding whether two

91

parameters are the same, or whether they point to the same memory block; (4) handling cases in which the return value of a function is omitted. We concluded that either a special static analysis specialist or an automatic framework was needed to do the work. In our current automatic checker generation framework, we handled all subtleties mentioned above, and programmers do not need to understand a lot about the mined rules. Therefore, we think that such a framework will greatly improve the practicality of checking project-specific rules with extended static analysis and will also save companies the cost of hiring an expert to do the work.

We have also learned important lessons from a survey we conducted to access the quality of the mined rules. In the survey, we asked programmers to classify patterns into: (1) must-rules [23], which are programming rules that programmers should follow under all circumstances; (2) may-rules [23], which are rules that programmers should follow under certain conditions that depend on the calling context of the function; (3) common usage scenarios, which are common pattern of use but not programming rules; and (4) false positives, which are none of the above. The survey contains 32 randomly selected patterns, and programmers classified 14 of them as must rules, 1 as may rule, 16 as common usage scenarios, and 1 as false positives. We learned from programmers’ responses to the survey that it was hard for them to draw a fine line between a programming rule and a common usage scenario, even if they fully understand the semantics of the pattern. For example, they categorized some patterns as common usage scenarios when they were actually may-rules, since their comments indicated that the pattern should be followed under a certain conditions. One explanation for this is 92

that programmers need stronger evidence to decide the usefulness of a pattern than is provided by a couple of code examples. Therefore, we think that a better way is to let programmers decide the usefulness of patterns is over the course of development. Mined patterns could be placed on a

“watch list”, which is updated automatically when new evidence about a pattern’s status is obtained.

For example, the priority of a pattern can be elevated if a defect is found, by any means, in code that matches the pattern.

93

CHAPTER SIX. BUG FIX PROPAGATION WITH FAST SUBGRAPH MATCHING

In a previous study [90], we presented empirical evidence that when programmers fix a bug, they often fix it only in the code they are responsible for or are most familiar with, and hence they fail to propagate the fix to all the places where it is applicable. In the study, 59% of the Openssl [70] bug fixes examined and 43% of the Apache Http Server [5] bug fixes examined were not propagated completely. This work successfully discovered many ignored bug fixes, but it suffered from several problems. The work is limited to only three pre-defined bug patterns, and its precision and efficiency needs to be further improved.

To address the problems of the above work, we proposed a powerful and efficient approach to propagate a bug fix to all the locations in a code base to which it applies. This approach represents bug and fix patterns as subgraphs of a system dependence graph, and it employs a fast, index-based subgraph matching algorithm to discover unfixed bug-pattern instances remaining in a code base.

We have also developed a graphical tool to help programmers specify bug patterns and fix patterns easily. We evaluated our approach by applying it to bug fixes in four large open-source projects.

The results indicate that the approach exhibits good recall and precision and excellent efficiency.

6.1 INTRODUCTION

Debugging is one of the most time-consuming activities that software developers engage in. In order

94

to identify the cause of a bug, programmers may need to intensively inspect code, employ a debugger, communicate with fellow programmers, write test cases, apply possible fixes, and do regression tests.

A NIST report [96] estimated that debugging costs industry billions of dollars each year. Given the cost of finding a fix to an existing bug, it is desirable that the fix be applied in all the places where the bug actually occurs. Nevertheless, programmers often fail to do so.

There are various reasons this occurs. Two very common scenarios are described below.

Example 1: Bug fix to copy-pasted code segments. Programmers tend to copy and paste code segments in order to save time and effort [62]. When a bug occurs in one location where a copied code segment was pasted, it is likely to occur in other such locations. A programmer may fix one or more of the locations but fail to apply the fix to all them. The following example from the Python bug database illustrates this scenario. In this example, a bug is fixed in posix_getcwd(); and another instance of the bug is found in posix_getcwdu(). These two functions are similar and contain copy-pasted code fragments.

Bug 2722: “os.getcwd fails for long path names on linux.”

getcwd()is used to get the current working directory. In this bug fix, instead of using fixed-length

buffer of size 1026 in getcwd(),dynamically allocate a buffer until it is large enough to take the

current path name.

95

Bug fix in Modules/posixmodule.c, posix_getcwd()

- char buf[1026];

+ int bufsize_incr = 1026;

+ int bufsize = 0;

char *res;

+ char *tmpbuf = NULL;

+ do {

+ bufsize = bufsize + bufsize_incr;

+ tmpbuf = malloc(bufsize);

+ if (tmpbuf == NULL) {

+ break;

+ }

- res = getcwd(buf, sizeof buf);

+ res = getcwd(tmpbuf, bufsize);

+ if (res == NULL)

+ free(tmpbuf);

+ } while ((res == NULL));

Overlooked bug instance in

Modules/posixmodule.c, posix_getcwdu()

char buf[1026];

res = getcwd(buf, sizeof buf);

Example 2: Bug fix that changes an existing usage pattern or adds a new usage pattern. A large number of usage patterns exist for commonly used functions, macros, data structures etc.

[4][11][12][59][60][62][98]. Although most of these patterns are undocumented, they are often

96

followed by programmers when writing new code. When an existing usage pattern is changed or an entirely new usage pattern is introduced, it is often important that the new pattern be followed wherever it is applicable in the code base. The following example from Httpd [5] illustrates such a scenario:

Bug 39518: “Change some ‘apr_palloc / memcpy’ construction into a single apr_pmemdup which is

easier to read.”

Bug fix in modules/http/mod_mime.c

- extension_info *new_info = apr_palloc(p,sizeof(extension_info));

- memcpy(new_info, base_info,sizeof(extension_info));

+ extension_info *new_info = apr_pmemdup(p, base_info, sizeof(extension_info));

Overlooked bug instance in modules/http/mod_mime.c

exinfo=(extension_info*)apr_palloc(p,sizeof(*exinfo));

memcpy(exinfo, copyinfo, sizeof(*exinfo));

The bug fix changes the previous usage pattern involving the apr_palloc/memcpy construct to a new usage pattern involving apr_pmemdup. Our approach has discovered several bugs that involve failure

97

to update an old usage pattern1.

In previous work [90], we presented a preliminary approach to providing semi-automated support for propagating fixes completely [90]. That approach relies on templates that are defined in terms of program dependences and that match fixes of three particular kinds of bugs: a missing precondition check for a function call; a missing postcondition check for a function call; and an omitted function call of a call-pair usage pattern. With that approach, a fix pattern is extracted from a bug fix instance, and a heuristic graph-matching algorithm is used to find violations of the fix pattern, which may correspond to previously undiscovered bug instances.

Our previous approach has three major limitations. (1) It has limited generality since it addresses only bug fixes that can be specified with one of the three rule templates. (2) The approach is also strictly intraprocedural, so it cannot find bugs that cross function boundaries, and it may generate false alarms when the “missing part” of a reported bug can actually be found in callers or callees. (3)

The graph-matching algorithm used is relatively slow and it is heuristic, so it is not guaranteed to find all violations.

1 Example 2 is more of code refactoring than a bug fix. Our approach is applicable to not only coding errors but also bad programming practices that programmers want to change. So in the context of this work, “bug” indicates either a coding error or bad programming practice.

98

In this work, we extend our previous approach substantially to address all of the aforementioned issues: the new approach applies to more bug fixes, it is interprocedural, and it efficiently finds all instances of a bug pattern. This is done by converting the bug fix propagation problem into an exact subgraph matching problem, as follows.

A bug fix involves a buggy version and a fixed version of a project; we supply a tool called

PatternBuild [71] to help programmers to specify a bug pattern from the buggy version and a fix pattern from the fixed version. Each pattern takes the form of generic dependence graph. The fixed version of project is transformed into its system dependence graph (SDG), which is then augmented with additional transitive edges and additional node and edge labels (we call this augmented SDG augSDG). The bug pattern, and sometimes also the fix pattern, is used as a query graph, and we search for all instances of the bug pattern in the augSDG of the fixed version, in order to find remaining instances of the bug. Finally, the bug instances are reported to programmers, who can use the fix pattern as a reference to confirm and fix potential bugs. The framework for our approach is shown in Figure 6.1.

99

Figure 1. Framework of our approach

Figure 6.1 Framework of Bug Fix Propagation

Note that although subgraph matching (isomorphism) is a computationally difficult problem, it is not desirable to abandon the characterization of bugs and fixes in terms of program dependences. Program and system dependence graphs represent the essential ordering constraints between program statements and permit instances of programming rules to be recognized in different contexts and despite semantics-preserving reorderings of their elements and interleaving with unrelated elements.

Fortunately, very recent developments in data mining technology, namely fast, index-based subgraph matching algorithms, make this unnecessary.

To identify bug pattern instances, we use a fast subgraph matching algorithm called GADDI that we developed recently [110]. GADDI is very efficient and scalable, and it finds all instances of a query subgraph. Our PatternBuild tool is designed to make it easy for programmers to specify both bug and fix patterns. Programmers work directly with the source code instead of with the SDG.

PatternBuild’s back end handles the work of maintaining graph data structures and doing graph-related calculations. The amount of manual work required when using PatternBuild is small,

100

as illustrated in Section 6.4. Moreover, with our previous approach, programmers sometimes also need to manually edit the automatically extracted patterns. PatternBuild makes it unnecessary to use restrictive fix templates as in our previous approach.

The main contributions of this work [91] are:

 We present a powerful and general approach to propagating bug fixes, including

interprocedural ones.

 We apply the state-of-the-art fast sugraph-matching algorithm GADDI to find instances of

bugs.

 We developed a tool PatternBuild to help programmers specify bug fixes easily and intuitively.

(A demo of the tool is available on our PatternBuild project website [71].)

 We present empirical results indicating that our approach is effective and efficient. A sample

of bug and fix patterns along with the bugs discovered can also be found on the PatternBuild

site [71].

6.2 GADDI: INDEX BASED FAST SUBGRAPH MATCHING ALGORITHM

Our approach requires an efficient, scalable, and complete graph matching algorithm. To achieve efficiency, we considered the state-of-the-art index-based graph query algorithms including

GraphGrep [34], TALE [94] and GADDI [110]. The index size of GraphGrep increases dramatically

101

with the database graph size, and it fails to build the index structure when the graph has thousands of vertices. The SDG graph normally contains hundreds of thousands vertices which cannot be handled by GraphGrep. TALE is an approximate method which cannot find all occurrences of a pattern correctly. In comparison to these approaches, GADDI is both scalable and complete; hence we chose it for our approach.

GADDI [110] is designed to solve the problem of finding all exact matches of a query graph from a single large base graph. GADDI uses a new indexing technique based on neighboring discriminating substructure distance (NDS). The number of indexing units of NDS is proportional to the number of neighboring vertices in the database, which allows the index to grow in a controlled way, and thus it is scalable to very large graphs. Previous experimental results showed that GADDI works efficiently and accurately, and that it scales to base graphs which contain thousands of vertices and hundreds of thousands of edges [110]. Since our SDG can contain hundreds of thousands of vertices and millions of edges, we made the following modifications to GADDI so that it can handle larger SDGs: Instead of keeping shortest distance between all pairs of vertices, we create a vertex set for each vertex v which contains all vertices that are within 5 vertices away from v. This reduces the space complexity from O(n2) to O(n), where n is the total number of vertices, and thus allow us to handle larger graphs.

6.3 SPECIFICS OF OUR APPROACH

102

In order to use GADDI to identify all latent instances of a bug that has been fixed in at least one code location, three steps are needed in order to transform the bug fix propagation problem into a subgraph matching problem: first, we need to transform the source code into its SDG and transform the SDG into augSDG; second, we need to extract from a bug fix both a bug pattern and a fix pattern, each represented by a generic dependence graph. After these two steps, the final step is to apply GADDI to find all latent bug instances and report them back to programmers. We explain the three steps below.

6.3.1 Base graph generation

The base graph is the augSDG. To obtain the augSDG from the original SDG produced by

CodeSurfer, two steps are needed: (1) assigning identical node labels to semantically equivalent vertices where feasible; (2) adding edges representing transitive dependences. These two steps are explained below.

A Node and edge labeling

There are four types of dependences in the original SDG, namely intraprocedural and interprocedural data dependences and control dependences. Since we want to perform subgraph matching across procedure boundaries, we do not discriminate between intraprocedural and interprocedural dependences. In the original SDG, nodes are labeled by the kind of program elements they represent,

103

which is a rather coarse level of labeling. In order to improve the precision of subgraph matching, we relabeled different types of vertices as follows:

Vertices of type call-site, actual-in, actual-out: Call-site vertices represent function calls, and actual-in and actual-out vertices represent actual input and output parameters of a function call. We label vertices by examining interprocedural data and control dependences, as described in

[11][12][13][14]. If a call-site vertex represents a call to a function f, then the call-site is labeled by the entry point of the procedure dependence graph of f; actual-in vertices are labeled by the corresponding formal-in vertices of f; and an actual-out vertex is labeled by the corresponding formal-out vertex of f. In this way, the elements in each of the following sets of vertices receive the same label: call-site vertices representing a given function call; actual-in vertices corresponding to a particular formal input parameter; and actual-out vertices corresponding to a particular output parameter.

Vertices of type control-point and expression: A control-point vertex represents an if, for or while statement, and an expression vertex represents an expression or assignment. We label these two kinds of vertices by their abstract syntax tree (AST) as described in [11][12][13][14]. The advantage of this labeling scheme is that variable names are ignored, so that, for example, c = a + b will have the same label as z = x + y. However, the labeling scheme might give semantically equivalent vertices different labels, for example, if(x < 0) and if(x >= 0). In our previous work [11][12][13][14] we

104

have observed that many reported false positive violations are due to this labeling scheme, especially the labeling of null checks. In order to reduce false positives, we use the simple heuristic of giving the following four kinds of null checks the same label: if(x == NULL), if(x != NULL), if(!x), if(x).

Vertices of type jump, switch-case, and label: These vertices are labeled by the code of the program elements they represent. For example, all switch-case vertices with code “case ALG_APMD5:” will have the same label.

Vertices of type declaration: Declaration vertices are labeled by the type of the variable declared.

Other: The remaining vertices are labeled by their vertex types from the SDG.

B Add transitive dependence edges

Both intraprocedural and interprocedural transitive dependence edges are added to the original SDG.

We compute the transitive closures of the data dependence subgraph and the control dependence subgraph of each procedure dependence graph, where the former contains all and only the pDG’s data dependences and the latter contains all and only the pDG’s control dependences. For each pair of direct caller and callee functions in the system call graph, we add interprocedural transitive data dependences between the caller and the callee.

One reason for adding these edges is to increase the recall of bug pattern matching. Adding them can increase the likelihood of finding bug instances, since an edge in the query graph may be equivalent to 105

a path in a pattern instance. For example, “if (!f())” is equivalent to “ret = f(); if(!ret)”. The first of these two code fragments gives rise to a single data dependence edge from the return value of f(x) to the if statement, and the second one gives rise to a path containing two data dependence edges.

Adding transitive dependence edges can also increase the precision of bug pattern matching when the bug pattern is a subgraph of the fix pattern. As will be explained in Section 6.3.3, in this case we first find all fix pattern instances and remove them from the code base. In the augSDG, paths are contracted into edges. This allows more fix instances (both inter and intra procedural) to be discovered and removed from the code base. Therefore, in the second step of subgraph matching, which matches the bug pattern in the pruned augSDG, bug instances that are part of a fix instance will not be discovered since they were removed in the first step.

As will be discussed in the next subsection, the PatternInduce algorithm is used to automatically generate a pattern from the vertices specified by the programmer. The pattern should be a connected generic dependence graph. The vertices specified by programmers might be connected indirectly by a dependence path instead of an edge. Therefore, PatternInduce is invoked on the augSDG instead of the original SDG.

6.3.2 Generating a query graph from a bug fix: the PatternBuild tool

106

The PatternBuild tool [71] is designed to be very easy for developers to use. It provides a GUI front end with which they can specify bug patterns and fix patterns with just a few mouse clicks. The back end handles all graph-related calculations and data structures. It consists of three components.

CodeHighlight is used to highlight differences between the buggy version and the fixed version, as well as the code that is affected by the changes. PatternEditGUI provides a GUI for programmers to add program elements to the bug pattern and fix pattern. PatternInduce takes the program elements that are specified by PatternEditGUI as input and automatically builds an induced subgraph from them. A demo of the tool is shown in [71].

A motivating example. Before getting into the details of these components, we present a scenario in

Example 2 that illustrates how programmers can apply PatternBuild to specify a pattern. The bug and fix patterns along with their dependence graph structures are shown in Figure 6.2. First,

PatternBuild is used to display the exact code changes from the buggy version and the fixed version.

Second, after viewing these changes, the programmer will start adding program elements to the pattern. For example, if he wants to add the following line:

new_info = apr_palloc(p, sizeof(extension_info)) into the pattern, he can right click on this line, and then a dialog will pop up. He may add all program elements by simply selecting them.

When an element is selected, the corresponding code is highlighted automatically. Lastly, after the programmer specifies all the nodes that he wants to add to the pattern, he can save the pattern, and the

107

Bug Pattern

new_info = apr_palloc (p, sizeof(extension_info));

memcpy (new_info, base_info, sizeof(extension_info));

Fix Pattern

new_info = apr_pmemdup(p,base_info, sizeof(extension_info));

Figure 6.2 Bug and fix patterns of Example 2 background algorithm will automatically induce a graph. The entire process typically takes one to three minutes.

We will explain each component individually as follows:

CodeHighlight. We used the diff [22] tool together with a graph edit distance algorithm to extract 108

modifications. We first use diff to extract statement level changes. For each change hunk in diff,

vertices that appear on the “-” lines are collected as vertexSetb and vertices that appear on the “+” lines

are collected as vertexSetf. The two vertex sets along with their SDGs are inputs to a graph edit distance algorithm Adjacency_Munkre [81]. CodeHighlight highlights both modified program elements and affected program elements, which are elements that are connected with changed program elements in the SDG.

Combining diff and Adjacency_Munkre enables us to characterize changes at a finer level of granularity. For example, if the statement “if(a==b)” is changed to “if((a==b)||(c==d))”, diff will flag this modification as a “change” (delete and add), whereas the graph edit distance algorithm will flag this as an “addition of a control point”. Moreover, the graph edit distance algorithm also helps to ensure that changes have semantic significance. For example, unlike diff, it does not consider splitting a line of code into two lines to be a change.

PatternEditGUI. This component provides a GUI for developers to specify the program elements they want to add to their bug or fix pattern. By right clicking on a line in the code, a drop-down list will show all the program elements on that line. A developer can then click on check boxes to select program elements to add to or delete from the bug or fix pattern. The back end maintains the sets of

program elements (vertices) belonging to the bug and fix patterns, PEb and PEf respectively. They are passed to PatternInduce to generate bug and fix patterns.

109

PatternInduce. Given a subset VS of the vertices of a graph G, the subgraph G of G that is induced by VS has VS as its vertex set and contains exactly the edges e of G whose endpoints are both in VS.

The algorithm PatternInduce takes the vertex sets PEb and PEf as inputs and extracts vertex induced subgraphs from their augSDG as bug and fix patterns, respectively.

6.3.3 Applying the GADDI Algorithm

After we have the bug pattern and the augSDG of the fixed version VERf, we use GADDI to identify

any unfixed instances of the bug pattern in VERf. There is one subtlety involved. Consider the common case where the bug pattern is a subgraph of the fix pattern. In this case, a discovered match of the bug pattern might be part of an instance of the fix pattern, so reporting a bug instance would be a false alarm. This would decrease the precision of our approach. In order to handle this problem, we first check whether the bug pattern is a subgraph of the fix pattern. If so, two passes of queries are done. In the first pass, we search for all instances of the fix pattern, and remove all the occurrences from the augSDG of the fixed version; in the second pass, we search for all instances of the bug pattern in the rest of the augSDG. By doing this, we make sure that the reported bug instances are not part of any fix instance. Since GADDI is a very fast algorithm, running it one more time does not significantly increase computation time.

6.4 EMPIRICAL EVALUATION

110

The goals of our empirical evaluation were to:

 Determine how many bug fixes our approach is applicable to.

 Determine how often bug fixes are incompletely propagated.

 Evaluate the precision and recall of our approach.

 Evaluate the efficiency of GADDI when applied to the bug propagation problem.

 Compare the approach with a baseline approach based on text search.

 Compare the new approach to our previous one.

6.4.1 Study design

We applied our approach to four open source projects, Apache Httpd [5], Net-snmp [67], Openssl [70] and Python [77]. All of them are large and mature. For each project, we first collected bug fixes,

denoted by BF, from between the release of an older version VO and a newer version VN. After that, the first and second authors acted as developers to build bug and fix patterns from each bug fix.

Finally, GADDI was applied to search for potential bugs in VN. We submitted some of the discovered bugs to project developers. The project versions used and their sizes are shown in Table

6.1. All computations were run on a Dell PowerEdge 2950, with two 3.0 GHZ dual-core CPUs and

16 GB of main memory. The following subsections describe how we extracted bug fixes and how we evaluated precision and recall.

111

Table 6.1 Project Versions and Sizes

1 2 Project VO VN LOC Vertices

Httpd 2.2.6 2.2.11 106366 159955

Python 2.5.2 2.6.2 121142 230958

Openssl 0.9.8c 0.9.8g 221930 356255

Snmp 5.3.2 5.3.3 199507 340375

1, 2 LOC and Vertices are averaged over VO and VN A Extraction of bug fixes

We extracted bug fixes using the SZZ algorithm [61][112]. We first downloaded SVN [93] or CVS

[18] logs from between VO and VN; then we used the SZZ algorithm to analyze the logs and discover bug fixes from them. For most of the bug fixes discovered, a bug number is indicated in the log.

We built the bug and fix patterns according to the log message and information in the bug database if a bug number was associated with the fix.

B Measurement of recall and precision

Recall. Recall is defined as the proportion of actual bug instances discovered among all latent bug

instances in VN. However, it is impossible to know the true number of latent bug instances in VN.

Hence we developed another method to estimate recall. We select a subset of all the bug fixes we collected, denoted by BF*, for which each fix is applied to the same bug in multiple places. An

112

example of such a fix is adding a postcondition check at several calls to a certain function. For each bf* in BF*, we first build bug and fix patterns from an arbitrary instance bfi of bf*. Then we go back

to the buggy version VO and see how many bug instances other than bfi can be found using the bug and fix patterns built from bfi. We denote the set of “hit” (found) bug instances by HitBI*, and we denote by BI* the set of bug instances in BF* minus the ones from which we built the patterns.

Recall is defined as follows:

|푯풊풕푩푰∗| Eq 6.1: 푹풆풄풂풍풍 = |푩푰∗|

∗ ∗ , where |퐵퐼 | = ∑푏푓∗∈퐵퐹∗(|퐵푢푔퐼푛푠푡푎푛푐푒푠(푏푓 )| − 1)

Precision. Precision is defined as the ratio of the number of potential bug instances (PBI) to the number of all bug instances reported by our tool (RBI). Our preferred way to evaluate precision is to submit all reported bug instances to open source project developers and await their feedback.

However, the developers do not always reply and it was not practical to submit every bug instance we discovered to them anyway. Consequently, we examined each reported bug manually to determine whether it belonged to PBI or not.

We evaluate precision over two datasets, BF* and BF (the set of all the bug fixes collected from VO

and VN):

Evaluation over BF*. The purpose is that we want to evaluate precision and recall using the same

113

dataset and the same procedure in order to show tradeoff among precision and recall. Assume that using the procedure described in the Recall section, our tool reported a set of bug instances RBI*, and we identified PBI* as the set of potential bug instances. We report the following two proportions:

|푯풊풕푩푰∗| Eq 6.2: %푯풊풕∗ = |푹푩푰∗|

|푷푩푰∗| Eq 6.3: %푷푩푰∗ = |푹푩푰∗|

The first proportion is a lower bound for estimated precision since HitBI* are the real bug instances.

Evaluation over BF. The purpose is to examine a wider range of bug fixes, including fixes that occur in one place. For this experiment we report the following proportion:

|푷푩푰| Eq 6.4: %푷푩푰 = |푹푩푰|

Moreover, we also do detailed analysis on reasons of false positive bug instances, denoted as FPBI.

Since BF is propagated on VN, we pick the most likely bug instances, denoted as LBI, and submit them to the programmers.

6.4.2 Results

A Distribution of bug fixes

We analyzed the distribution of bug fixes in order to answer the following two questions: (1) Out of all the bug fixes collected, how many fixes does our tool apply to? That is, how many bug fixes can we 114

Table 6.2 Bug Fixes

BFs applicable to BFs not applicable to total total ip/cp1 total ic/vc/ot2

Httpd 46 40 (0.87) 6/34 (0.15/0.85) 6 (0.13) 3/3/0 (0.50/0.5/0)

Python 68 48 (0.71) 8/40 (0.17/0.83) 20 (0.29) 12/4/4 (0.6/0.2/0.2)

Openssl 29 21 (0.72) 6/15 (0.29/0.71) 8 (0.28) 4/3/1 (0.5/0.38/0.12)

Snmp 15 13 (0.87) 3/10 (0.23/0.77) 2 (0.13) 1/1/0 (0.5/0.5/0)

1, 2 ip (incompletely propagated bug fixes), cp (completely propagated bug fixes), ic (insufficient context), vc (value change), ot (other) build bug and fix patterns from? (2) How often do developers fail to propagate bug fixes?

The results of this analysis are summarized in Table 6.2. Regarding question (1), it can be seen that our approach is applicable to most of the bug fixes. When PatternBuild occasionally cannot build a pattern from a bug fix, it is usually due to insufficient context or a change to a literal constant. There is insufficient context when the bug fix added some code that has few or no data or control dependences with the surrounding code. In such cases, it is hard for programmers to build the bug pattern, since it is not obvious which program elements are related to the bug fix. For example, one fix in the Python project adds the call “fprint(fp, …)” to output information to a file “fp”, and the only affected node is the declaration vertex of the variable “fp”. In such cases, there is not enough evidence to build bug patterns or fix patterns.

115

Table 6.4 Recall and Precision on BF*

Precision Recall %Hit* %PBI* (|HitBI*|/|BI*|) (|HitBI*|/|RBI*|) (|PBI*|/|RBI*|)

Httpd 87.5% (14/16) 15% (14/94) 58.5% (55/94)

Python 88% (37/42) 27% (37/137) 43% (59/137)

Openssl 75% (12/16) 3.8% (12/320) 57% (183/320)

Snmp 100% (4/4) 16% (4/25) 52% (13/25)

Table 6.4 Precision on BF

PBI FPBI RBI %PBI total LBI total cr-1/cr-2/ot1

Httpd 88 44 18 44 13/30/1 50%

Python 136 73 11 63 43/20/0 52%

Openssl 493 336 199 157 61/90/6 68%

Snmp 23 11 2 0 0/12/0 48%

1cr-1 (violation to Criterion 1), cr-2 (violation to Criterion 2), ot (other)

Changing the value of a literal constant is not reflected in the SDG. For example, a bug that changes

“i=0;” to “i=1;” is not reflected in the SDG since these two expressions have the same AST and thus have the same label.

116

Regarding question (2), it can be seen that a significant number of bug fixes were incompletely propagated in each of the four projects evaluated, leaving bugs (or code exhibiting bad programming practices) in the code base. As a result, a tool like ours is needed to eliminate latent bug instances when a bug fix is being applied.

B Recall and precision

Recall. The second column of Table 6.4 shows the results of evaluating the recall of our approach using the method described earlier in this section. The mean recall over the four projects was 88%, which is excellent in this context. False negatives were mainly caused by labeling semantically equivalent vertices with different labels.

Precision. The results of evaluation precision on BF* are shown in Table 6.4, and those for BF are displayed in Table 6.4. The means of %Hit*, %PBI* and %PBI are 15.4%, 52.6% and 54.5%, respectively. %Hit* is a lower bound on precision, and %PBI* and %PBI are alternative, context-dependent estimates of precision. We determined PBI* and PBI manually, with the help of bug databases, documentation, revision logs and mailing lists. In order to perform a fair experiment, we flagged a bug instance as a potential bug instance only if the following two criteria held:

 Criterion 1: the bug instance apparently exhibits the same semantics as the bug pattern;

117

 Criterion 2: Given that Criterion 1 holds, the bug instance also involves the same kind of

semantic problem as the bug from which the bug pattern was extracted (a counterexample will

be shown later in this section).

Potential bug instances that have context very similar to the buggy code used to build the bug pattern are flagged as likely bug instances Table 6.4. For these instances, we have stronger evidence indicating they are real bugs. But it is also worthwhile for programmers to examine the rest of the potential bug instances since they are similar to known buggy code and may cause problems in the future.

We submitted to developers likely bug instances that were not fixed in the latest project version. For

Httpd and Python, we added comments and patches to the corresponding bug entry in the bug database instead of creating new bugs; for Openssl and Snmp, we sent the newly discovered bug to the developer mailing list. In the Httpd project, 4 out of 18 potential bug instances are actually fixed in the most recent trunk; we submitted the remaining 14 instances to programmers. They reopened one closed bug (39518) out of four commented bugs, but did not respond about the other three (47753,

31440, 39722). Python programmers reopened two closed bugs (2620, 2722) and created a new bug

(6873) for another commented bug (5705), for which a programmer applied a patch in a recent revision, leaving the last one as “theoretically true, but may not happen in practice” (3139); the Snmp programmers were not sure about the bug instances 2184039 and 1912647; the Openssl programmers

118

have not responded yet.

We will now give analysis on the source of false positives abased on PBI. False positive bug instances either violate Criterion 1 or Criterion 2. Violations of Criterion 1 are mainly caused by node labeling issues with the SDG. One labeling issue is that two nodes with the same semantics may be labeled differently. For example, the same predicate can be represented as an if statement, a switch-case branch, or a statement with conditional operator “?:”. When the bug pattern is a subgraph of the fix pattern, this will cause some fix instances to be left erroneously in the augSDG. As a result, some reported bug instances are actually instances of these fix instances.

Another case of erroneous labeling occurs when two nodes with the same label have different semantics. For example, this happens when a function uses variable-length argument lists, such as

“func(TYPE1 arg1,TYPE2 arg2, …)”. In the definition of the function, the variable-length argument list is represented by one ellipsis “…”, so there is only one formal input parameter for the list in the pDG of the function. This formal input parameter is data dependent on all actual input parameters in a variable-length argument list in a call to func(). As a result, all actual input parameters that are part of the argument list in a call to func() are given the same label, although they are not semantically equivalent.

Even if a reported bug instance satisfies Criterion 1, it may not be an actual bug, because it does not satisfy Criterion 2. For example, consider the original bug fix: 119

keyToken = OPENSSL_malloc(…);

if (!keyToken)

{

+ ECerr(EC_F_EC_WNAF_MUL,ERR_R_MALLOC_FAILURE);

The function ECerr() was added to report an error when OPENSSL_malloc() failed. The bug pattern can be described as the error function ECerr() is not called when OPENSSL_malloc() fails. The following code has the same omission as the bug pattern, since ECerr() is not called when

OPENSSL_malloc(…) returns NULL:

hashBuffer = OPENSSL_malloc(…);

if (!hashBuffer)

{

+ CCA4758err(CCA4758_F_IBM_4758_LOAD_PUBKEY, ERR_R_MALLOC_FAILURE);

However, it does not cause a problem, since an alternative error function CCA4758err() is used to report the error.

C Efficiency of GADDI 120

Table 6.6 Text-Based Table 6.5 Efficiency of Gaddi Search Index Build Time Mean Query Time Precision Recall (sec) (sec)

Httpd 40 3.0 16% 100%

Python 79 6.7 5% 100%

Openssl 242 9.4 61% 94%

Snmp 555 19.5 35% 100%

When GADDI is used for subgraph matching, it first builds an index structure for the base graph; once the index is built, it can be reused for matching any subgraph against the same base graph. We evaluated the efficiency of GADDI by measuring the index build-time and the query time after the index was built. Table 6.5 summarizes the index build time and mean query time for the four projects. It can be seen that GADDI is efficient in practice; it takes only a few minutes to build the index and on average only a few seconds to query.

D Comparison with text-based code search

We compared our approach to the following text-based search procedure: (Step 1) a keyword is selected from the bug pattern to be propagated. This is the source code of the most unique vertex from the bug pattern, which is the vertex that has the fewest occurrences in the code base among all the vertices in the bug pattern; (Step 2) a text search for the keyword is performed and the results are

121

examined2. The precision of this procedure was estimated using the set of incompletely propagated bug fixes (ip) indicated in Table 6.2. In order to make objective decisions, we determined whether each match was a bug or not by considering only Criterion 1 mentioned in B, 2). Note that this resulted in an overestimate of precision since we did not consider Criterion 2. Recall was estimated using the procedure described in subsection B.

The precision and recall results are summarized in Table 6.6. As can be seen, the precision of the text search procedure was substantially lower than with our approach for all projects but Openssl.

One reason for this is that text search involves fewer constraints than graph search. For example, for

Bug 39518 of Httpd shown in the Introduction, we used memcpy as the keyword. This returned 441 matches in total. By contrast, the graph representation allowed more constraints to be specified, and it returned only 99 matches. Another reason for the superior precision of our approach is that when a bug pattern is a subgraph of the rule pattern, text search may find rule instances. Normally, the number of rule instances would exceed the number of bug instances if the rule is commonly applied.

In such cases, many of the text search matches are rule instances, which leads to low precision.

However, in the Openssl project, two such bug-fixes were poorly propagated, resulting in more bug instances than rule instances. That is why the precision of text search was better for Openssl than for

2 Doing a keyword search might be the first thing programmers try in order to find additional bug instances. They might also do more sophisticated searches, such as using regular expressions, which we did not attempt to simulate.

122

the other projects.

The recall of the text-based search procedure was higher than with our approach. Since text search has fewer constraints than our approach, it should be able to find more instances. The only exception occurred when the keyword was an expression, and the missed bug instance had variables with altered names in the expression. However, the number of returned matches is too large for programmers to examine them all. For example, for the Python project, 5353 matches were returned, while our approach returned only 136 matches Table 6.6. Text search is not feasible for bug propagation in such cases.

E Comparison with the previous study

Our previous approach [90] to bug-fix propagation was evaluated on two projects, Openssl [70] and

Httpd [5]. We employed two more projects, Net-snmp [67] and Python [77] in the study reported here. The new study demonstrated that the new approach is superior in three important respects:

(1) The new approach applies to more fixes.

Our previous approach can only be applied to bug fixes that add if statements or function calls. The proportions of such bug fixes were 17%, 24%, 24%, and 13% in Httpd, Python,

Openssl and Snmp respectively. By contrast the new approach applied to 87.5%, 71%, 72% and

87% of fixes as shown in Table 6.2.

123

(2) A lot more bug instances are discovered with the new approach.

Our previous approach discovers bug instances by finding violations of an extracted fix pattern, using a heuristic graph matching algorithm due to Chang et al [11]. That algorithm is not certain to find all violations. However, GADDI is certain to find all occurrences of a bug pattern. We applied both approaches to bug fixes in the Openssl project to which both approaches are applicable. The previous approach discovered 16 bug instances altogether, whereas the new approach discovered 85 bug instances. The precision of the new approach was 100%, whereas the precision of the previous approach was 94%.

(3) Computation time is significantly reduced.

We used the same bug fixes to compare the total time spent by the two approaches in discovering bug instances. The previous approach took 1347 seconds altogether, whereas the new approach took only

31 seconds. Thus, the new approach is more than 43 times faster!

6.4.3 Threats to Validity

A Human factors involved in building patterns.

In the experiments we acted as programmers to build bug and fix patterns. The following human factors might have an impact on the experimental results:

(1) Knowledge on the bug fix.

124

Since programmers understand their own code better than we do, the patterns we built might not represent the programmers’ intensions. To address this, we study each bug fix carefully using the following information: revision logs, discussion in the bug database, documentation, and communications between programmers via mailing lists.

(2) Familiarity with the tool.

We are more familiar with our tool than programmers using it for first time would be. However, since the tool is quite intuitive to use, we don’t think this had a large impact on the evaluation.

B Sensitivity of the approach to patterns.

Since our approach relies on an exact graph matching algorithm, it is sensitive to the quality of the patterns. If the pattern is too specific, our approach is likely to miss likely instances; if it is too general, our approach is likely to report more false positives. In the experiments, we tended to build more general patterns when we weren’t sure of the purposes of bug fixes. As a result, our evaluation might underestimate precision and overestimate recall.

C Evaluation on recall and precision.

We used the manually constructed BF* to evaluate recall, which might not be the most representative data set. When evaluating precision, we relied on our manual labeling, which might be noisy. To address this, we collected evidence from code inspection, reading documentation and learning from

125

code history.

D Choice of subject projects.

The projects that we have chosen to use in this work are quite large and mature, and they are working on different areas. However, we may need to extend our experiments to more projects, for example, commercial projects or GUI intensive projects, in order to better understand its advantages and disadvantages.

126

CHAPTER SEVEN. CARIAL: COST-AWARE RELIABILITY IMPROVEMENT WITH

ACTIVE LEARNING

Software testing and debugging are time-consuming and costly. This is especially the case for software that produces complex output, since, without an automated oracle, test output examination is done manually and adds significantly to the overall cost of the debugging process. Further, the selection of test cases to examine is not generally guided by estimates of the expected reduction in risk if a test case is analyzed. We consider the problem of maximizing expected risk reduction while minimizing the cost in terms of examining and labeling software output when selecting test cases to examine. To do this, we use a budgeted active learning strategy. Our approach queries the developer for a limited number of low-cost annotations to program outputs from a test suite. We use these annotations to construct a mapping between test outputs and possible defects and so empirically estimate the reduction in risk if a defect is debugged. Next, we select test cases from this mapping.

These are chosen so that they are inexpensive to examine and also maximize risk reduction. We evaluate our approach on three subject programs and show that, with only a few low-cost annotations, our approach (i) produces a reasonable estimate of risk reduction that can be used to guide test case selection, and (ii) improves reliability significantly for all subject programs with low developer effort.

This problem is similar to the problem of using supervised learning techniques for ranking and filtering rules and violations discovered by mining defects, as shown in Chapter Four, where we need

127

to balance the cost of examining rules and violations and the gain of improving the classification performance. Therefore, we will adapt and extend this idea to balance the cost of software testing and the benefit improvement in software reliability, using a machine learning framework called

CARIAL.

7.1 INTRODUCTION

For given software system or application, there are often substantial differences among potential test inputs in the costs associated with evaluating the input-output behavior they induce. Test evaluation typically requires developers either to check actual output manually or to determine expected output in advance so that it can be compared automatically to actual output. In general, the cost of evaluating a test increases with the size and complexity of the test input and the output it generates. Among test inputs that induce software failures, there are often large differences in the benefits obtained, in terms of improved operational reliability, by finding and repairing the software defects they reveal. This is because some defects are triggered more frequently in the field than others and because they cause failures of different severity.

Few proposed software testing techniques are guided explicitly by consideration of these differences in the costs and benefits associated with individual test inputs, apparently because it is often difficult to predict them prior to testing. However, testers do commonly favor small and relatively simple test cases [95][103], presumably to reduce the effort they must expend in constructing tests, checking their 128

results, and debugging failures. This practice runs the risk of missing defects that will be triggered by more complex inputs from end users. The difficulty of evaluating complex outputs helps explain why developers do not routinely capture inputs in the field [89] when it is feasible, which would enable them to replay and check user executions themselves and to reuse the inputs to enable realistic regression testing. It is much more common for developers to rely mainly on users to discover and report field failures, though users often perform these tasks poorly [25].

In this work, we propose a novel, proactive approach to improving the reliability of software that has already been deployed, at least for beta testing, and that possibly has also been modified. The approach, which is called Cost-Aware Reliability Improvement with Active Learning (CARIAL), is intended to balance the costs of evaluating different outputs against reliability improvements expected to result from fixing the defects they may reveal. It does so by guiding developers, using a cost-sensitive active learning technique, to investigate a sample of software executions induced by inputs captured in the field and to provide feedback to the learning algorithm. The learning algorithm uses such feedback to construct a mapping between test outputs and possible defects, which is used in turn to empirically estimate the reduction in risk if a defect is debugged. Such mapping will later guide programmers to select test cases to balance cost and risk reduction. Moreover, it may also help to decide whether sufficient effort is spent on debugging and software is ready to release.

CARIAL does not require feedback from end users (though it can exploit such feedback) and it does

129

not require developers to assign specific costs to inputs or to software failures. It does require a means of roughly predicting the relative costs of checking individual test outputs, based on properties such as their sizes or those of corresponding inputs. It also requires that a significant and representative sample of complete program inputs be captured in the field for subsequent reuse offline.

Finally, it requires that developers check executions induced by a sub-sample of these inputs, group failures with common symptoms, and possibly assign levels of relative severity to those symptoms.

We shall argue that in many scenarios, each of these requirements is reasonable and well-justified by the potential benefits of our method.

Our work makes the following contributions:

 We propose a general framework CARIAL, which is the first to balance cost in test output

examination and benefit in reliability improvement.

 We propose a cost-sensitive active learning algorithm CostHSAL to estimate reduction in risks if

a defect is debugged, with help of some low-cost programmer feedback. We consider two

factors in risk estimation: severity of the defect and the probability that the defect is triggered in

field.

 We have conducted experiments on three software projects to show effectiveness of the CARIAL

framework. We have demonstrated that:

o With some feedback from programmers , CARIAL provided reasonable estimates of risk

130

reduction.

o With the risk estimation of CARIAL, programmers can spend relatively little effort in

reliability improvement.

7.2 OPERATIONAL DISTRIBUTION AND FAILURE RATES

The input domain of a program follows operational distribution [104], which is the probability distribution that describes how the program is used in field. Each point in the input domain has a probability of selection according to such distribution. In the input domain, there will be one success region and several failure regions. The success region contains all inputs that will allow the program to run normally. Each failure region contains input that will cause the program to fail due to a certain defect. Let Q be the input domain of the program. The input domain could be infinite, but we assume that only finite number of inputs have nonzero probability. Let S be the success region,

and Fi be a failure region due to defect di. Assuming there are k underlying defects. Q can be decomposed as follows:

Q = S ∪ F1 ∪ F2 ∪ …∪ Fk

. Figure 7.1 (a) illustrate the input domain with a success region and several failure regions. Note

that Fis might overlap, since an input may trigger several defects and thus belong to several Fis. We will describe how we dealt with this issue in Section 7.3.2D.

131

(a) Success regions and overlapping failure regions (b) Success regions and none-overlapping failure regions Figure 7.1 Illustration of operational distribution

Based on the operational distribution, we could define the concepts of program failure rate and defect failure rate as follows:

Program failure rate: the probability of selecting an input that causes program to fail. According to

Figure 7.1 (a), program failure rate is defined as follows, assuming that each input is equally likely to be selected:

| 푭 | Eq 7.1: ⋃풊 풊 |푸|

Defect Failure Rate: the probability of selecting an input that causes the program to fail due to a

certain defect. According to Figure 7.1 (a), defect failure rate due to defect di is defined as

|푭 | Eq 7.2: 풅풇풓(풅 ) = 풊 풊 |푸|

Our work applies to test cases that are captured in field; therefore they follow the operational distribution. In our work, we will use active learning to get a reasonable estimate of the operational distribution and defect failure rates, which will later be used to guide programmers in selecting test cases that will result in better reliability improvement.

132

7.3 THE CARIAL FRAMEWORK

In this section, we describe the CARIAL framework we propose to balance cost and benefit in reliability improvement. In general, our framework will do a cost estimation of test cases and a risk estimation of potential defects, and then use the resulting estimates in a test case selection scheme to guide programmers in the testing and debugging process. The framework is illustrated in Figure 7.2.

It consists of three components: cost estimation, risk estimation and test case selection. The first component uses a cost-sensitive active learning approach to get an estimate of the successful region and failure regions. We will do a risk estimation of the estimated failure regions based on its estimated defect failure rate and severity. The second component estimates costs of evaluating test case outputs. The third component applies information provided by the above two components to guide programmers in test case selection, such that the selected test cases are inexpensive to evaluate as well as providing large risk reduction. We will describe each of the three components in detail.

7.3.1 Cost estimation

Cost estimation involves predicting the time programmers will spend on testing and debugging a test case. Testing and debugging is a complex procedure which requires running test cases, evaluating test output and locating bugs. Test output evaluation is more crucial than the other testing activities for software that produces complex output, since it is hard to automate, and much effort is required from programmers. Therefore we provide cost estimation on only test output evaluation in this work, 133

Figure 7.2 Framework of CARI, consisting of cost estimation, risk estimation and test case selection and leave cost estimation of the other procedures into future work.

The cost of test output evaluation is estimated as follows. Intuitively, programmers should spend more effort evaluating large and complex outputs, and less effort evaluating small and simple outputs.

For example, if the out is a graph, programmers spend more time evaluating dense graphs than sparse graphs. Therefore, the costs should be approximately linear to the size or complexity of the test output. We therefore propose the following simple cost estimation scheme as a linear function with slope 훼:

Eq 7.3: cost(ti) = 휶 ∙|output(ti)| or cost(ti) = 휶 ∙complexity(output(ti))

7.3.2 Risk Estimation of potential defects

A Definition of risk

Intuitively, the risk of a defect depends on two major factors: the severity level of the defect, and the defect failure rate. Severity level is decided by the nature of the defect. For example, a defect that

134

causes a cosmetic issue is less severe than a defect that will cause a deadlock. Most bug database systems provide built-in severity levels such as trivia, normal, critical or blocker. Defect failure rate is another important factor. Defects with high failure rate cause the software to fail frequently, and therefore they have higher risk.

We define risk of a defect as the expected loss incurred if the defect remains in the released software.

We assume that the loss incurred by a defect is directly associated with its severity level, therefore a

loss function ℒ of a defect di could be defined as ℒ (severity(푑푖)), where severity(di) is the only parameter used to decide the loss of the defect. The risk of a defect is thus defined as:

Eq 7.4: 퓡(풅풊) = 퓛 (퐬퐞퐯퐞퐫퐢퐭퐲(풅풊)) * dfr(di)

The risk of the whole program will thus be

Eq 7.5: ∑퐢 퓡(풅풊)

Obviously, with limited manual effort that can be spent on software testing and debugging, one would prefer to choose test cases from high-risk failure regions in order to reduce the total risk of the program and thus improve the overall software reliability. We will next describe how a cost-sensitive active learner could be used to estimate risk of individual defects.

B Cost-sensitive active learner

(1) Intuition on using cost-sensitive active learning. 135

According to Eq 7.4, we need to know the severity level and the failure rate of a defect in order to estimate the risk of defects. If, as is often the case, an automatic test oracle does not exist, programmers will need to evaluate some test cases in order to get an estimate of failure rate and severity of defects. Intuitively, programmers or testers could use random sampling to obtain a sample of test cases, and then they could: (1) observe the outputs of test cases to decide whether they succeeded or failed; (2) for failed test cases, they could, in many cases, group test cases into different failure regions according to the failure symptoms, assuming that test cases triggering the same defect share the same symptoms; (3) assign severity levels according to the symptoms. By following such a procedure, we could estimate the failure rate; we can also estimate the severity level of the estimated failure regions.

A problem with random sampling is that a relatively large sample is needed to obtain an accurate estimate. For operational testing, it is likely that failed test cases are relatively infrequent, making it more difficult to estimate failure regions using random sampling. Moreover, random sampling will not take the cost of evaluation into consideration, and hence it is likely to select test cases that are too costly to evaluate (unless they are rare).

Based on the above arguments, we choose to use active learning and cost-sensitive analysis to estimate the operational distribution. Active learning can infer class labels (or, failure regions) of all test cases using as few queries as possible. If cost estimation of test cases is available, we can use

136

cost estimation to further reduce cost of test case evaluation by attempting to select test cases that are cheaper to evaluate.

(2) Getting feedback from programmers.

We will ask programmers to provide two pieces of information for a test case output: symptom profiles and severity. Severity could be directly specified using a bug database, such as “trivia”,

“normal”, or “critical”. We will next describe symptom profiles.

As the name indicated, symptom profiles describe the symptoms of a failure. Symptom profiles can be provided as symptom vectors or symptom signatures. Symptom vectors could be provided by using well-designed user feedback mechanism. Some software provides advanced user feedback mechanism to allow users to give answers to a set of pre-defined questions that are customized to the particular software. For example, the JavaPDG project [44] developed in our group provides such functionality. The output of JavaPDG is a complex graph. Programmers can review an erroneous output graph and provide information on problematic nodes or edges using the user feedback panel

137

Figure 7.3 Screenshot of the JavaPDG program. The left window shows the program under analysis; the middle window shows the visualization of the System Dependence Graph of the program; the left window shows the user feedback panel. 138

that comes with the software. Screenshot of the user feedback panel is shown in Figure 7.3. Since the questions are pre-defined, it is easy to extract the feedback into a feature vector. Secondly, if a structured user feedback mechanism does not exist, symptom profiles can be provided using symptom signature, which is an easily checkable property of the failures caused by a defect. Sometimes such a property could even be checked automatically. For example, we use ROME [82], a RSS reader, as a subproject in the experiment. Symptom of defect-37 [83] could be described as “ is a relative URL instead of an absolute one”. Burger et al [10] used similar idea when the proposed system is working on non-crashing bugs. They inserted predicates into the code to describe symptoms of a failure, such as “attribute name of object with id 13 has value “UTC””, and the predicate serves as a signature for the symptom.

Symptom profiles serve as class labels of test cases. That is, test cases that share the same symptom profiles will be assigned the same class label, and they will be put into the same failure region.

C Learning algorithm in a nutshell.

We will give a brief introduction on the cost-sensitive active learner, leaving more details to Section

7.4. For each test case, execution profiles are collected and used to group test cases with the same underlying defects in the learning algorithm. Execution profiles could be any of function coverage, line coverage, branch coverage etc. The learner works as follows: we use active learning to iteratively select some unlabeled test cases and present them to programmers. Test cases are selected

139

based on the current estimated failure regions and cost sensitive analysis, such that they are cheap to label, and their class labels could help improve estimation accuracy. After reviewing the test output, programmers provides symptom profiles and corresponding severities of the selected test cases.

Such information will be used to further update the estimated failure regions. The iteration will stop if the budget runs out and we cannot afford any more manual labor.

The learner we used is called CostHSAL, which is derived from the original algorithm, Hierarchical

Sampling for Active Learning [20]. We will introduce the algorithm in detail in Section 7.4.

D Risk estimation

The cost-sensitive active learner will group test cases into an estimated success region (푆̂) and several

estimated failure regions (퐹̂푖). Each 퐹̂푖 is associated with a severity level. Assuming that there are n test cases (not necessarily distinct), and ℒ is a customized loss function, then the risk of each failure region is estimated as:

|푭̂| Eq 7.6: 퓡̂ (푭 ) = 퓛 (퐬퐞퐯퐞퐫퐢퐭퐲̂ (푭̂ )) * 풊 풊 풊 풏

Note that there is one subtlety with the above estimation. As mentioned in Section 7.2, failure regions might overlap since some test cases trigger a set of defects instead of one single defect. Test cases triggering different sets of defects are likely have different symptoms. Since symptom profiles serve

as class labels of test cases, the estimated failure regions 퐹̂푖 are a set of non-overlapping failure

140

regions (NOL FRs), each of which corresponds to a set of defects instead of one defect. This is

illustrated in Figure 7.1 (b). Likewise, the severity of 퐹̂푖 is the highest severity of the underlying defects.

7.3.3 Test case selection

We proposed a two-stage test case selection scheme based on cost estimation of test cases and risk estimation of possible defects, aiming at reducing risk at a fast rate. In the first stage, we rank the estimated failure regions by estimated risk from high to low. In the second stage, we rank test cases within each failure region according to their estimated cost. According to these two rankings, programmers could first select a failure region according to the failure region list; then they could select a test case with low cost within the failure region. Once they have observed enough test cases to debug in the current cluster, they could move on to the next cluster and continue to work on other defects.

In the test case selection step, we allow programmers to reuse labeled test cases. These test cases will be assigned cost 0, since the output is already evaluated. Therefore, when test cases are ranked within each estimated failure regions, the labeled test cases will always be ranked at the top of the list.

7.4 THE COST-SENSITIVE ACTIVE LEARNER: COSTHSAL

In this section, we will show our motivation on using the HSAL (Hierarchical Sampling for Active

141

Learning) [20] algorithm in the CARIAL framework. Then we will describe how we modified it into

CostHSAL to involve cost sensitive analysis.

7.4.1 Motivation on using HSAL

We have two concerns in applying active learning to infer failure regions. (1) Most active learning algorithms are applied to classification problems and the number of classes is known. However, we have no idea how many underlying defects there are. Even if we do, we still don’t know how many non-overlapping failure regions there are as illustrated in Figure 7.1 (b). (2) Most algorithms assume that data with the same class label are similar to each other. Namely, they are close to each other in the feature space. However, this is not necessarily true for test cases and execution-profile features.

For example, successful test cases all share the same label, but they may have distinct features. For example, the ROME project uses different classes and functions to parse different types of feeds, such as RSS 1.0, RSS 2.0 and Atom 1.0. Therefore, successful test cases that parse different types of feeds will form several clusters. This is even true for groups of failed test cases. For example, if the defect appears in a commonly used function, such as a print function, test cases may have very different code coverage even if they cover the same defective function(s).

Due to the above reasons, we choose to use the HSAL algorithm. As the name suggested, it is an active learning algorithm that exploits the cluster structure in data. The goal of the algorithm is to work with a hierarchical clustering of the unlabeled dataset, and quickly find a pruning (or, a cut) of 142

Figure 7.4 Example cluster tree the cluster tree, which is a partition of the dataset, such that its constituent clusters are fairly pure in their class labels. Although the algorithm works for classification, there is no need to specify the number of classes. We could start with only one class; once user has returned a new class label, we could update the number of classes. Moreover, this algorithm can handle the case when data with the same class label forms several different clusters. We will describe the HSAL algorithm below, and how we revised it into CostHSAL.

7.4.2 HSAL (Hierarchical Sampling for Active Learning)

The framework of the algorithm is as follows: it first takes the root of a hierarchical clustering tree as the current pruning P, and works with the hierarchy in a top-down manner. Given the current pruning P, the algorithm will sample some nodes to query for class labels using a two-stage sampling

143

scheme, and update related statistics. Based on the statistics, the algorithm will identify some clusters to be impure, that is, composed of different classes. These clusters will be split into smaller and purer clusters base on the cluster tree to get an updated P. The process will continue until the clusters in the current pruning are fairly pure, or we run out of budget and cannot afford more costs.

The framework is shown in Algorithm 1.

Now we describe in detail the two-stage sampling scheme used in the algorithm. Assume that there

are n unlabeled data. In the first stage (SelectCluster), it chooses a cluster vi in P with probability

proportional to 휔푖 (1 − 푝푖). 휔푖 is the weight of vi, that is, 휔푖 = |푣i|/푛. 푝푖 is the fraction of the

majority label among labeled examples in vi. The closer 푝푖is to one, the more likely that vi is a pure cluster. Therefore, this selection scheme selects a cluster that tends to be relatively large in size and

impure. In the second stage (SampleNode), a node node is sampled from the cluster tree rooted at vi.

Starting from this tree root, a child ch of the current cluster vi is selected with probability proportional to the size of unlabeled data in ch. The iteration continues until a leaf node is reached and returned.

We now show an example using Figure 7.4. At the beginning we have P = {v1}. After sampling

some tests and determining their labels, we have found that v1 is impure, so we split v1 into v2 and v3.

Since the sampling scheme favors impure clusters, the algorithm is likely to sample more from v2 than

from v3. A couple of more samples will reveal that v2 is quite impure and needs to be split. Now

we have obtained a good pruning P = {v4, v5, v3}. If budget allows, we could go further to split v3

144

into v6, v7 to get a perfect clustering.

7.4.3 CostHSAL

To use HSAL in the CARIAL framework, we need to revise it to consider cost-sensitive analysis. We first define mean sampling cost (MeanSampCost) of a cluster. Given a cluster v, this is the average

cost of nodes sampled by SampleNode (v). In the first stage, we penalize 휔푖 (1 − 푝푖) with the mean sampling cost of the cluster, such that clusters that are large and impure but also cost less to label on average are selected. An intuitive approach is to select a cluster v with probability proportional to

휔 (1−푝 ) 푖 푖 . However, since 휔 (1 − 푝 ) is a probability and MeanSampCost(푣) is not, it is MeanSampCost(푣) 푖 푖 hard to put these two metrics on the same scale. We therefore proposed a heuristic scheme based on

ranking clusters. We rank clusters according to 휔푖 (1 − 푝푖), and walk the list one by one until the

MeanSampCost starts to increase. We return the cluster with the lowest cost observed so far. In the second stage, we use similar iterative procedure as in HSAL, except that in each iteration, we select a

TotalCost(푣) child ch with probability proportional to , therefore child with lower mean MeanSampCost(푐ℎ) sampling cost is more likely to be selected. The entire algorithm is shown in Algorithm 1.

Algorithm 1 CostHSAL Input: hierarchical clustering of all unlabeled points; batch size B; budget G Algorithm: P{root} (current pruning of tree) L(root)1 (arbitrary label for the root of tree) Repeat until budget G exhausted: For i = 1 to B

145

v  SelectCluster (P) node  SampleNode (v) L  Query (node) Update related statistics according to the label; End For Split impure clusters in the current P; Update P to the current pruning; End Repeat

For each cluster 푣iϵ푃, assign all members with the majority label Functions: SelectCluster (P)

Rank clusters according to 휔푗 (1 − 푝푗); cmin ∞; vminNULL; Repeat Remove v from top of the list; cMeanSampCost(v); if c< cmin then cmin c, vmin  v, continue; else return vmin, End Repeat SampleNode (v) tmp  v Repeat Select child ch in tmp, such that

TotalCost(푡푚푝) |ch| ∝ MeanSampCost(푐ℎ)

if ch is a leaf node then return ch; else tmpch; continue; End Repeat MeanSampCost(v) Call SampleNode (v) k times and return average cost

7.5 EMPERICAL STUDY

146

We conducted empirical study to assess the effectiveness of the CARIAL framework in balancing cost and benefit in reliability improvement. The study addresses the following two research questions.

 RQ-1: How accurate is the CARIAL framework in risk estimation?

 RQ-2: With a risk estimation and cost estimation provided by CARIAL, how effective is the

test case selection scheme in reducing overall risk?

7.5.1 Dataset

We applied our approach on three open source software systems, JavaPDG [44], ROME [82] and

Xerces2 [108], all of which produces complex output and for which an automated oracle does not exist.

JavaPDG [44] is a software system developed by our group; it is used to perform interprocedural program analysis on Java programs. Inputs to JavaPDG are java programs, and outputs are System

Dependence Graphs [40], where vertices are program elements such as declarations, expressions or predicates, and edges are data or control dependence.

147

Table 7.1 Summary of Subject Programs

Program #Test Cases %failures #Defects #NOL FRs Cost Estimation

JavaPDG 1295 3.9% 9 12 훼 ∙ #Edges

ROME 5425 5.2% 6 13 훼 ∙ Size of Output XML

Xerces2 4773 5.4% 8 14 훼 ∙ Size of Output XML

ROME [82] is an open source Java library for parsing, generating and publishing RSS and Atom feeds.

We constructed a testing program with the APIs provided by ROME library which takes RSS and

Atom feeds as input, parses the feeds, and outputs the parsed feeds into XML file.

Xerces2 [108] is an open source Java XML parser. Like ROME, we also wrote a testing program, which takes arbitrary XML file as input, parses the file, and then output the parsed XML.

Summary of the dataset are presented in Table 7.1. Details of the dataset are explained in the following sections.

A Test case collection

We captured test inputs in field for each of the three projects. For JavaPDG, we used functions in the

Spring Framework [88] as inputs to the program, and altogether 1295 test cases were collected. For

ROME and Xerces2 projects, we reused test cases in Augustine et al’s work [1]. For ROME project,

8,000 Atom and RSS files were downloaded from Google Search results [30] using a custom web crawler as test inputs. For Xerces2 project, 9,630 files were collected from the system directories of an Ubuntu Linux 7.04 [107] machine and from Google Search results. 148

B Defects

We collected defects from the bug databases [45][83][109] of the three projects. 9 defects were collected for the JavaPDG program, 6 for ROME project and 8 for Xerces2 project. For each of the defects, a failure checker is inserted into the program. The JavaPDG test set triggered 3.9% failures.

ROME and Xerces2 test sets triggered more than 30% failures. In operational testing, it is unlikely that the test set contains such high failure rate. To make it more realistic, we reduced failure rate to

5.2% for ROME project and to 5.4% for Xerces2. For each of the projects, there are test cases that trigger multiple defects; therefore there are overlaps in the failure regions. There are 12, 13, 14 non-overlapping failure regions (NOL FRs) in JavaPDG, ROME and Xerces2 test sets correspondingly.

To better demonstrate the strength of our approach, we assigned high severity levels to defects with low failure rate, and low severity levels to defects with high failure rate. This mimics that situation when a defect is rarely triggered but will cause severe damage, while commonly triggered defects are less important. We will show that in such situation, the CARIAL framework will outperform common testing approaches.

C Execution profiles

We used function coverage profiles as execution profiles in our experiment. We used Java

Interactive Profiler [43] to collect function coverage information. The number of times a function is 149

invoked in an execution is recorded, and we also use a binary indicator of the function representing whether the function is covered or not. Therefore, execution profile of a program is represented by a feature vector that is twice the size of the number of functions.

D Symptom profiles

For JavaPDG project, we used the built-in user feedback panel to specify symptom vectors as symptom profiles. For ROME and Xerces2 project, we use symptom signatures as symptom profiles.

For all three projects, symptoms are distinct for individual non-overlapping failure regions.

7.5.2 Methodology

A Cost Estimation

According to Section 7.3.1, we assumed that costs of evaluating test output are linear to the size or complexity of test output. For JavaPDG project, we assumed that the cost is linear to the number or edges of the output system dependence graphs. For ROME and Xerces2 projects, we assumed that the cost is linear to the size of the output XML files in bytes. This is illustrated in Table 7.1.

B Risk Estimation

In our experiment, we used the built-in importance levels in bugzilla [2] as our severity levels. There are seven pre-defined importance levels, namely trivia, minor, normal, major, critical and blocker.

We assigned the severity levels to be from 1 to 7 according to the importance levels.

150

We defined a loss function ℒ for a defect as follows:

퐬퐞퐯퐞퐫퐢퐭퐲(풅풊) Eq 7.7: 퓛 (풅풊) = 흁 ∙ ퟐ

휇 is defined as the unit dollar cost incurred by a defect. And this definition assumes that the loss increases exponentially as the severity level upgrades. According to Eq 7.6 and Eq 7.7, risk of an inferred failure region will be defined as:

̂ |푭̂| Eq 7.8: 퓡̂ (푭̂ ) = 흁 ∙ ퟐ퐬퐞퐯퐞퐫퐢퐭퐲(푭풊) * 풊 풊 풏

C Test Case Selection

Experiment for test case selection is affected by the following two parameters: cost budget for risk estimation, denoted as B; number of failures programmers need to review in order to successfully remove a defect, denoted as X. The former is provided by programmers to decide how much effort they are willing to spend on test case evaluation for risk estimation. Obviously, this is directly associated with the accuracy of risk estimation. The latter is related to risk reduction. Assuming

that programmers need to review X test cases in order to successfully fix a defect di, then the overall

risk of the program will reduce ℛ̂(푑푖) once X failed test cases due to di are reviewed. In our experiment, we assume that X follows Poisson distribution with mean 휆.

D Baseline Approaches

151

We will use two baseline approaches as a comparison to the CARIAL framework: random sampling and smallest first sampling. The smallest-first sampling will always select an unobserved test case with smallest cost, which mimics programmers’ practice in real world.

In the risk estimation step, random sampling or smallest first sampling will be used to get a sample for programmers to evaluate. Proportions of each class of failures in this labeled sample will be directly used as failure rates, and we will compute risk estimation using Eq 7.8. In the test case selection step, since test case partitioning is available for just this labeled sample, one could first select test cases from the labeled sample according to the test case selection scheme proposed in Section 7.3.3.

Then, random sampling and smallest-first sampling could be used to explore the rest of test cases.

7.5.3 RQ-1: Accuracy of Risk Estimation

We define the error of risk estimation (ERE) as the absolute difference between estimated overall risk and the actual risk, namely:

Eq 7.9: 푬푹푬 = | ∑풊 퓡̂ (푭̂풊) - ∑풊 퓡(풅풊)|

And we used 1000$ as the unit dollar cost 휇 incurred by a defect.

152

Figure 7.5 RQ-1: The Cost-Against-ERE Curves. Red-CARIAL; Green-Random Sampling; Black-Smallest First Sampling

To evaluate accuracy of risk estimation, we plotted the Cost-Against-ERE curve for CARIAL, random sampling and smallest first sampling for all three projects, which shows how ERE decreases as more test cases are evaluated. The x-axis of the curve represents percentage of cumulative total cost spent so far, therefore the constant 훼 used in cost estimation is canceled out. The learning curves are shown in Figure 7.5. For JavaPDG and Xerces2 projects, the CARIAL framework is the most accurate among the three for most cost values, and ERE decreases faster as more test cases are evaluated. For ROME project, ERE for CARIAL decreases a little slower than the smallest first sampling approach at the beginning, but begins to catch up when more than 20% of cost is spent.

This is likely due to the fact that failed test cases in ROME have relatively small sizes.

7.5.4 RQ-2: Effectiveness of Test Case Selection on Risk Reduction

153

(a) 휆 = 3, B = {0.05, 0.10, 0.15}

(b) 휆 = 10, B = {0.05, 0.10, 0.15}

Figure 7.6 RQ-2: The Cost-Against-Risk-Reduction Curves for JavaPDG with different B and 흀 Red-CARIAL; Green-Random Sampling; Black-Smallest First Sampling

To evaluate effectiveness of test case selection, we plotted the Cost-Against-Risk-Reduction curve to show how risk is reduced as more test cases are evaluated following the proposed test case selection scheme. Performance of test case selection are affected by two parameter as described in Section

7.5.2C, the cost budget B and the mean 휆 of Poisson distribution. We plotted the curves using different combination of values of the two parameters. For cost budget B, we choose three values,

0.05, 0.10, 0.15, since programmers are unlikely to spend too much effort on risk estimation. For 휆, 154

(a) 휆 = 3, B = {0.05, 0.10, 0.15}

(b) 휆 = 10, B = {0.05, 0.10, 0.15}

Figure 7.7 RQ-2: The Cost-Against-Risk-Reduction Curves for ROME with different B and 흀 Red-CARIAL; Green-Random Sampling; Black-Smallest First Sampling we chose two values, 3 and 10, to show how the three approaches behave when different number of test cases are required to evaluate for removing a defect. The Cost-Against-Risk-Reduction curves for JavaPDG, ROME and Xerces2 are shown in Figure 7.6, Figure 7.7 and Figure 7.8 correspondingly.

We will use JavaPDG program as an example to explain the effect of the two parameters risk reduction performance as follows. The other two projects have similar trends as JavaPDG.

155

(a) 휆 = 3, B = {0.05, 0.10, 0.15}

(b) 휆 = 10, B = {0.05, 0.10, 0.15}

Figure 7.8 RQ-2: The Cost-Against-Risk-Reduction Curves for Xerces2 with different B and 흀 Red-CARIAL; Green-Random Sampling; Black-Smallest First Sampling

Effect of 흀. If 휆 is a larger number, meaning that more test cases are required to evaluate in order to remove a defect, the CARIAL framework is less likely to be affected. This is because that

CARIAL produces a partitioning of test cases with relatively good accuracy; therefore test cases of a certain possible defect could directly be retrieved from the corresponding failure region. However, the baseline approaches are more likely to be affected due to lack of such a partitioning. This could be qualitatively seen from Figure 7.6. For example, when cost is 10%, the black curve representing 156

smallest first sampling and the red curve representing random sampling are “lifted” quite a bit when 휆 increases to 10; on the other hand, the red curve representing CARIAL stays roughly stable for both of the lambda values.

Effect of B. With a larger budget, active learning could make a better prediction on the operational distribution, and thus a more precise prediction on risk estimation. Such trend can be seen in Figure

7.6. With 휆 fixed to be 3 or 10, risk reduction curve by CARIAL decreases at a faster rate than the other two approaches as B increases.

7.5.5 Threads to the Validity

Cost Estimation. We proposed a simple linear cost estimation scheme in the current framework. It is reasonable since time spent on test case evaluation is directly associated with the size or complexity of the test output. However, more sophisticated and realistic models should be considered in the future to take other factors into consideration.

Risk Estimation. We proposed a loss function in our approach which assumes that the loss increases exponentially as the severity level upgrades, to indicate that high severity bugs causes much more severe damage than low severity bugs. There might be models more sophisticated than this to estimate risk that is more realistic in practice.

Assumptions on Symptom Profiles. The defects we collected in all three projects exhibit distinct

157

failure symptoms. However, there might be special cases when a defect triggers different symptoms, or two defects share the same symptom. This could be addressed by collecting more defects with such symptoms.

158

CHAPTER EIGHT. CONCLUSIONS AND FUTURE WORK

This thesis focused on improving precision and reducing cost for defect mining and operational software testing. For precision improvement and cost reduction of defect mining, we first proposed to use supervised learning to improve precision of dependence graph based defect mining. We used features that describe dependence graph features, code metrics and pattern statistics to represent programming rules and violations discovered by mining tools; and we applied logistic regression model to filter false positives. The results show that our approach has both high precision and recall, and it requires only around a quarter of labeled examples to train the classifier. We then applied fast subgraph mining algorithm GADDI on propagating bug fixes. In this approach, we developed a prototype tool PatternBuild to help programmers to easily specify pattern for a bug; the bug pattern was represented by generic dependence graph, and GADDI was applied to find matches of the pattern.

Experimental results show that our work is quite general to handle a variety of bugs; it also has good precision and high recall in discovering latent bugs; lastly, it achieved high efficiency in defect discovery. Finally, we proposed an approach to extend commercial static analysis tools by discovering program specific bugs. We extended an industrial tool, Klockwork, to allow it to find project-specific bugs. We developed a tool to automatically transform a mined programming rule to a Klockwork path checker, which is used to discover violations of the rule. Results show that the tool could be applied to most of the mined rules in checker transformation; moreover, most of the detected warnings are valid violations. 159

For precision improvement and cost reduction of software testing, we proposed a framework,

CARIAL (Cost-Aware Reliability Improvement of Active Learning) to balance cost and benefit in operational testing. The goal of CARIAL is to provide a test case selection scheme based on cost estimation of individual test cases, and risk estimation of potential defects, so that the test cases that are inexpensive to examine but simultaneously maximize risk reduction are selected. We proposed a cost estimation scheme which is linear to the size of the test output. Active learning is applied to obtain a mapping of test cases to potential defects. With such a mapping, we could get the severity and failure rate of defects, which are used to do risk estimation. Empirical results show that our approach is more superior in risk estimation accuracy than baseline approaches; our test case selection scheme also achieved fast risk reduction rate compared to the baselines.

We propose to further investigate the topics in the following aspects:

Improving precision of defect mining with supervised learning. We would like to further develop this work in two directions. The first direction is to develop more flexible models that can be used with datasets from different projects or even from different mining approaches. Hierarchical linear models [29] are possible choices for this purpose, since they support multi-level modeling.

The second direction is to design even better ways to obtain labeled examples from programmers.

Although our results indicate that our models require only small sets of labeled examples to achieve good performance, developer time is a costly resource, and further improvements are desirable.

160

Bug-fix propagation. One issue with the current fix propagation work is its precision, which averaged a little more than 50%. Although it is acceptable, better precision is desirable. The main cause of false positives is improper node labeling. In order to help ensure that semantically equivalent nodes are assigned the same label, supervised learning approaches seem to be appropriate.

Features can be selected to characterize all the factors that can affect labeling: ASTs, surrounding dependences, text of source code, etc. Another cause of false positives is that some bug and fix patterns are not universally applicable, as shown in Section 6.4.2B. To solve this problem, we can provide functionality to let programmers define more constraints on the bug and fix patterns, which can be used to prune results. We would also hope to add functionality to support semi-automatic correction of buggy code. This seems to be plausible since we currently highlight code changes according to the graph edit distance algorithm, so we can get an edit script relating the bug pattern and the fix pattern, which might be used to change a potential bug instance into a fix instance. Finally, we would like to extend the current work to Java programs, with combination of the JavaPDG tool.

Extending static analysis tools by mining project-specific rules. Our current framework applies to function precondition and postcondition rules as well as call sequence rules, regardless of how they are mined from the code base. We plan to extend the current work to address more types of programming rules in the future. Moreover, we would like to further improve the framework to make it applicable in field, and conduct extensive user studies involving programmers of commercial software. With such experiences, we could better assess the performance of the framework and get 161

valuable programmer feedback.

Cost-aware reliability improvement. In the future, we may need more realistic model in both cost and risk estimation. We considered only the size or complexity of output in our simple linear cost model for estimating test case evaluation cost, and we may extend it to consider other factors that are customized to the applications. Moreover, we may extend the work to consider other cost incurred in the testing and debugging process, such as test case execution time and fault localization effort. We considered two factors in risk estimation, defect severity and defect failure rate. We may consider more realistic models to estimate cost in terms of dollar loss in future work. Finally, our current work focused on proposing the general framework and the learning algorithm, and we made simple assumptions on failure symptoms. In the future, we will conduct more experiments to analyze the relationship between failure symptoms and underlying defects with more defect data.

162

REFERENCES

[1] V. Augustine, “Exploiting User Feedback to Facilitate Observation-based Testing,” Ph.D.

Dissertation, EECS, CWRU, 2009.

[2] Bugzilla. http://www.bugzilla.org/

[3] The ABB Group. http://www.abb.com/

[4] M. Acharya, T. Xie, J. Pei, and J. Xu, “Mining API Patterns as Partial Orders from Source Code:

from Usage Scenarios to Specifications,” The 6th joint meeting of European Softw. Eng. and

ACM SIGSOFT Symp. Found. of Softw. Eng., Dubrovnik, Croatia, 2007, pages 25-34.

[5] Apache HTTP Server Project, http://httpd.apache.org/.

[6] A. Bessey, K. Block, B. Chelf, A. Chou, B. Fulton, S. Hallem, C. Henri-Gros, A. Kamsky, S.

McPeak and D. Engler, “A Few Billion Lines of Code Later: Using Static Analysis to Find Bugs

in the Real World,” Communications of the ACM, Vol. 53 Issue 2, pages 66-75, Feb. 2010.

[7] J. F. Bowring, J. M. Rehg, and M. J. Harrold, “Active learning for automatic classification of

software behavior,” In Proceedings of the International Symposium on Software Testing and

Analysis, Jul, 2004, pages 195–205.

[8] D. B. Brown, S. Maghsoodloo, and W. H. Deason, “A cost model for determining the optimal

number of software test cases,” IEEE Transactions on Software Engineering vol. 15(2), 1989,

pages 218–221.

[9] Bugzilla. http://www.bugzilla.org/ 163

[10] M. Burger, A. Zeller. “Minimizing Reduction of Software Failures, ”, in Proc. of ISSTA,

Toronto, Canada, Jul., 2011.

[11] R. Chang, A. Podgurski, J. Yang, “Finding What’s Not There: A New Approach to Revealing

Neglected Conditions in Software”, Proceedings of the 2007 International Symposium on

Software Testing and Analysis, London, UK, July 2007, pages 163-173.

[12] R. Chang, A. Podgurski, J. Yang, "Discovering Neglected Conditions in Software by Mining

Dependence Graphs,", IEEE TSE, Volume 34, Issue 5, September 2008, pages 579-596.

[13] R. Chang, A. Podgurski, “Discovering programming rules and violations by mining

interprocedural dependences,”, in Journal of Software Maintenance and Evolution: Research and

Practice, Feb 2011.

[14] R. Chang, “Discovering Neglected Conditions in Software by Mining Program Dependence

Graphs,” Ph.D. Dissertation, EECS, CWRU, August 2008.

[15] C. Csallner and Y. Smaragdakis, “Check’n Crash: Combining Static Checking and Testing,” in

Proceedings of 27th International Conference on Software Engineering, ACM. 2005.

[16] Coverity. www.coverity.com.

[17] Coverity Extend: http://www.coverity.com/products/static-analysis-extend.html

[18] CVS: http://www.nongnu.org/cvs/

[19] V. Dallmeier, T. Zimmerman, “Extraction of Bug Localization Benchmarks from History”, 22nd

IEEE/ACM Int’l Conf. on Automated Softw. Eng., Atlanta, Georgia, USA, November 4-9, 2007,

164

pages 433-436.

[20] S. Dasgupta and D. Hsu. “Hierarchical sampling for active learning,” In Proc. of the 25th

International Conference on Machine learning, 2008, pages 208–215.

[21] W. Dickinson, D. Leon, and A. Podgurski, “Finding failures by cluster analysis of execution

profiles,” in Proceeding of 23rd Intl. Conf. on Software Engineering, Toronto, May 2001, pages

339-348.

[22] Diff: http://www.gnu.org/software/diffutils/

[23] D. Engler, D. Y. Chen, S. Hallem, A. Chou, and B. Chelf, “Bugs as Deviant Behavior: A General

Approach to Inferring Errors in System Code”, 18th ACM Symp. on Operating Systems

Principles, Banff, Canada, Oct. 2001, pages 57-72.

[24] J. Ferrante, K.J. Ottenstein, and J.D. Warren, “The Program Dependence Graph and its Use in

Optimization,” ACM Trans. Prog. Lang. Syst., vol. 9, 1987, pages 319-349.

[25] M. R. Fine, “Beta Testing for Better Software,” Wiley, 2002.

[26] P. Francis, D. Leon, M. Minch, A. Podgurski, “Tree-Based Methods for Classifying Software

Failures,” in Proceedings of the 15th International Symposium on Software Reliability

Engineering, Nov., 2004, pages.451-462,

[27] P. G. Frankl, R. G. Hamlet and B. Littlewood, “Evaluating Testing Methods by Delivered

Reliability”, Transaction on Software Engineering, Vol. 24, August, 1998. pages 586-601.

[28] M. Gabel, L. Jiang, and Z. Su, “Scalable Detection of Semantic Clones,” The 31th ICSE,

165

Vancouver, Canada, May 2009,pages 321-330

[29] A. Gelman and J. Hill, "Data Analysis Using Regression and Multilevel/Hierarchical Models,"

Gelman and Hill, 2007.

[30] Google, Inc. Google search. www.google.com.

[31] S. Gokhale. “Cost–constrained reliability maximization of software systems”. In Proc. of Annual

Reliability and Maintainability Symposium, Los Angeles, CA, January 2004, pages 195-200.

[32] Grammatech, CodeSurfer: www.grammatech.com/products/codesurfer/overview.html.

[33] Grammatech, “Dependence Graphs and Program Slicing”.

www.grammatech.com/research/slicing/slicingWhitepaper.html.

[34] R. Giugno, D. Shasha, “GraphGrep:, A Fast and Universal Method for Querying Graphs,” in Proc.

of ICPR, 2002.

[35] Grammatech, CodeSurfer:

[36] P. J. Guo, and D. Engler, “ Linux kernel developer responses to static analysis bug reports,” in

Proceedings of the 2009 USENIX Annual Technical Conference, 2009, pages 285–292.

[37] J. Han and M. Kamber, “Data Mining: Concepts and Techniques,” Morgan Kaufmann Publishers,

2006.

[38] R. A. Haertel, K. D. Seppi, E. K. Ringger, and J. L. Carroll. “Return on investment for active

learning,” In Proceedings of the NIPS Workshop on Cost-Sensitive Learning, 2008.

[39] S. Heckman and L. Williams, “A Model Building Process for Identifying Actionable Static

166

Analysis Warnings, ”, in Proceedings of the 2009 ICST, Denver, Colorado, USA, April 2009,

pages 161-170.

[40] S. Horwitz, T. Reps, and D. Binkley, “Interprocedural Slicing Using Dependence Graphs,” ACM

Trans. Program. Lang. Syst. 12, 1, Jan, 1990, pages 26-60.

[41] D. Hovemeyer and W. Pugh, “Finding bugs is easy,” ACM SIGPLAN Notices, v.39 n.12,

December 2004

[42] C.Y. Huang, J.H. Lo, S.Y. Kuo, M.R. Lyu, “Optimal allocation of testing-resource considering

cost, reliability, and testing-effort”, Proceedings of the10th IEEE Pacific Rim International

Sympsium on Dependable Computing, 2004, vol. 3-5, March 2004, pages. 103 – 112.

[43] Java Interactive Profiler. http://jiprof.sourceforge.net/.

[44] JavaPDG Project. http://selserver.case.edu:8080/javapdg/index.htm

[45] JavaPDG Project Bug Database. http://selserver.case.edu/bugzilla/

[46] S. Kim and M. D. Ernst, “Which Warnings Should I Fix First?” in Proc. 6th Joint

ESEC/SIGSOFT Foundations of Softw. Eng., 2007, pages 45–54.

[47] M. Kim and D. Notkin, “Discovering and Representing Systematic Code Changes,” in Proc. of

International Conference on Software Engineering, Vancouver, Canada, May 2009, pages

309-319.

[48] Klocwork. klocwork.com

[49] Klocwork C/C++ Path API Reference:

167

http://www1.klocwork.com/products/documentation/barracuda/images/b/b1/Klocwork_C_Cxx_P

ath_API_Reference.pdf

[50] Klocwork detected C/C++ Issues:

http://www1.klocwork.com/products/documentation/Insight-9.1/Detected_C/C%2B%2B_Issues

[51] Klockwork KAST and Path Checkers

http://www1.klocwork.com/products/documentation/Insight-9.1/Writing_custom_C/C%2B%2B_

checkers

[52] T. Kremenek, K. Ashcraft, J. Yang, and D. Engler, “Correlation Exploitation in Error Ranking,”

in Proceedings of the 12th ACM SIGSOFT twelfth international symposium on Foundations of

software engineering, Newport Beach, CA, USA, 2004, pages 83-93.

[53] R. Krishnan, M. Nadworny and N. Bharill, “Static Analysis Tools for Security Checking in Code

at Motorola,” ACM SIGAda Ada Letters. 2008.

[54] S. Kim, K. Pan, and E.E. James Whitehead, “Memories of Bug Fixes,” in Proceedings of the 14th

ACM SIGSOFT Int’l Symp. on Foundations of Softw. Eng., Portland, OR, USA, Nov 5-11, 2006,

pages 35-45.

[55] S. Kim and M. D. Ernst, “Which Warnings Should I Fix First?” in Proc. 6th Joint

ESEC/SIGSOFT Foundations of Softw. Eng., 2007, pages 45–54.

[56] T. Kremenek, K. Ashcraft, J. Yang, and D. Engler, “Correlation Exploitation in Error Ranking,” 168

in Proceedings of the 12th ACM SIGSOFT twelfth international symposium on Foundations of

software engineering, Newport Beach, CA, USA, 2004, pages 83-93.

[57] I. Lee, R. K. Lyer, “Diagnosing Rediscovered Software Problems Using Symptoms,” in IEEE

Trasactions on Software Engineering, Vol 26, NO. 2, Feb. 2000, pages 113-127.

[58] H. K. N. Leung, L. White, “A Cost Model to Compare Regression Test Strategies, ” In Proc. Conf.

Softw. Maint., pages 290–300, Nov. 1990.

[59] Z. Li, and Y. Chou, “PR-Miner: Automatically Extracting Implicit Programming Rules and

Detecting Violations in Large Software Code”, ESEC-FSE ’05, Lisbon, Portugal, Sept 2005,

pages 306-315.

[60] B. Livshits and T. Zimmermann, “DynaMine: Finding Common Error Patterns by Mining

Software Revision Histories”, The joint meeting of European Softw. Eng. and ACM SIGSOFT

Symp. Found. of Softw. Eng., pages 296-305.

[61] J. Liwerski, T. Zimmermann and A. Zeller, “When Do Changes Induce Fixes?” in Proc. of of

27th Int'l Workshop on Mining Software Repositories, Saint Louis, Missouri, USA, May 2005,

pages 24-28.

[62] S. Lu, S. Park, C. Hu, X. Ma, W. Jiang, Z. Li, R. A. Popa, and Y. Zhou. “Muvi: Automatically

Inferring Multi-variable Access Correlations and Detecting Related Semantic and Concurrency

Bugs,” SOSP07, Stevenson, Washington, USA, October 14-17, 2007, pages 103-116.

[63] T. Mitchell, “Machine Learning,” McGraw Hill, 1997

169

[64] G. Malishevsky, G. Rothermel, and S. Elbaum, “Modeling the Cost-Benefits Tradeoffs for

Regression Testing Techniques,” in Proc. International Conference on Software Maintenance,

Montreal, Quebec, Canada, Oct. 2002, pages 230-240.

[65] N. Nagappan and T. Ball. “Static analysis tools as early indicators of pre-release defect density,”

Proceedings of the 27th international conference on Software engineering, pages 580–586. 2005.

[66] G. M. Nanda, M. Gupta, S. Sinha, S. Chandra, D. Schmidt and P. Balachandran, “Making

detect-finding tools work for you,” in Proceedings of the 32nd ACM/IEEE International

Conference on Software Engineering, Cape Town, South Africa, pages 99-108. 2010.

[67] Net-snmp: http://net-snmp.sourceforge.net/

[68] T. Nguyen, H. Nguyen, N. Pham, J. Al-Kofahi, and T. Nguyen, “Recurring Bug Fixes in

Object-Oriented Programs,” in Proc. of International Conference on Software Engineering

2010, Cape Town, South Africa, May 2010.

[69] T. J. Ostrand, E. J. Weyuker, and R. M. Bell, “Where the Bugs Are,” In Proceedings of ACM

SIGSOFT Int’l Symp. on Softw. Testing and Analysis, 2004, pages 86–96,

[70] Openssl, http://www.openssl.org.

[71] PatternBuild: http://selserver.case.edu/patternbuild/index.htm

[72] PC-lint. http://www.gimpel.com/html/pcl.htm

[73] H. Pham, X. Zhang, “NHPP software reliability and cost models with testing coverage, ” in

European Journal of Operational Research, vol. 145, no. 2, pages. 443–454, Mar.2003.

170

[74] A. Podgurski, W. Masri, W., Y. McCleese, F. G. Wolff, and C. Yang, “Estimation of software

reliability by stratified sampling,” in ACM Transactions on Software Engineering and

Methodology 8, 9 (July, 1999), pages 263-283.

[75] A. Podgurski, D. Leon, P. Francis, M. Minch, J. Sun, B. Wang and W. Masri, “Automated support

for classifying software failure reports,” in Proceedings of 25th International Conference on

Software Engineering, Portland, OR, May 2003.

[76] K. Y. Phang, J. S. Foster, M. Hicks, and V. Sazawak, “Path projection for user-centered static

analysis tools,” in PASTE’08: Proceedings of the 8th ACM workshop on Program analysis for

software tools and engineering. 2008.

[77] Python: http://www.python.org/

[78] The R Project: http://www.r-project.org

[79] M. Ramanathan, A. Grama, and S. Jagannathan, “Path Sensitive Inference of Function

Precedence Protocols,” 29th Int’l Conf .onSoftw. Eng., 2007, pages 240-250.

[80] M. Ramanathan, A. Grama, and S. Jagannathan, “Static Specification Inference Using Predicate

Mining,” ACM SIGPLAN 2007 Int’t Con on Prog. Lang. Design and Impl., 2007, pages

123-134.

[81] K. Riesen, M. Neuhaus and H. Bunke, “Bipartite Graph Matching for Computing the Edit

Distance of Graphs,” in Graph-Based Representations in Pattern Recognition, 2007, pages 1-12.

[82] ROME Project. http://java.net/projects/rome/

171

[83] ROME Project Bug Database. http://java.net/jira/browse/ROME

[84] D. S. Rosenblum, E. J. Weyuker, “Using Coverage Information to Predict the Cost-Effectiveness

of Regression Testing Strategies, ” in IEEE Trasaction on Software ngineering, Vol. 23, No. 3,

Mar. 1997, pages 146-156.

[85] J. R. Ruthruff, J. Penix, J. D. Morgenthaler, S. Elbaum, and G. Rothermel, "Predicting Accurate

and Actionable Static Analysis Warnings: An Experimental Approach," in 30th International

Conference on Software Engineering, Leipzig, Germany, May 10-18, 2008, pages 341-350.

[86] B. Settles, “Active learning literature survey,”Technical Report 1648, University of Wisconsin-

Madison. 2009.

[87] S. Shoham, E. Yahav, S. Fink, and M. Pistoia, “Static Specification Mining using

Automata-based Abstractions,” ACM Int’l Conf. on Softw. Testing and Analysis, 2007, pages

174-184.

[88] Spring. Spring Framework 3.0.2.RELEASE, www.springframework.org.

[89] J. Steven, P. Chandra, B. Fleck, A. Podgurski, “Rapture: A Capture/Replay tool for

observation-based testing,” SIGSOFT Softw. Eng. Notes, vol. 25, 2000, pages. 158-167.

[90] B. Sun, X. Chen, R. Chang and A. Podgurski, “Automated Support for Propagating Bug Fixes,”

In Proc.of 19th International Symposium on Software Reliability Engineering, Seattle,

Washington, USA, Nov 2008, pages 187-196.

[91] B. Sun, G. Shu, A. Podgurski, S Li, S. Zhang, J. Yang, “Propagating Bug Fixes with Fast

172

Subgraph Matching”, In Proc.of 21th International Symposium on Software Reliability

Engineering, San Jose, California, USA, Nov 2010, pages 21-30.

[92] B. Sun, A. Podgurski, S. Ray, “Improving the Precision of Dependence-Based Defect Mining by

Supervised Learning of Rule and Violation Graphs”, In Proc.of 21th International Symposium on

Software Reliability Engineering, San Jose, California, USA, Nov 2010, pages 21-30.

[93] SVN: http://subversion.tigris.org/

[94] Y. Tian and J. Patel, “TALE: A Tool for Approximate Large Graph Matching,” Proc. of ICDE,

2008.

[95] J. Tian and J. Palma, “Test workload measurement and reliability analysis for large commercial

software systems,” Annals of Software Engineering 4 (1997), pages 201-222.

[96] G. Tassey, “The economic impacts of inadequate infrastructure for software testing,” National

Institute of Standards and Technology, Planning Report 02-3, 2002.

http://www.nist.gov/director/prog-ofc/report02-3.pdf

[97] S. Thummalapenta and T. Xie, “PARSEWeb: A Programmer Assistant for Reusing Open Source

Code on the Web,” 22nd IEEE/ACM Int’l Conf. on Automated Softw. Engl. (ASE 2007), Atlanta,

Georgia, Nov. 2007, pages 204-213.

[98] S. Thummalapenta and T. Xie, “Mining Exception-Handling Rules as Conditional Association

Rules,” The 31th ICSE, Vancouver, Canada, May 2009, pages 496-506.

[99] S. Thummalapenta and T. Xie, “Mining alternative patterns for detecting neglected conditions,”

173

in Proceedings of 24th IEEE/ACM International Conference on Automated Software Engineering

(ASE 2009), pages 283–294. 2009.

[100] A. Wasylkowski, A. Zeller, and C. Lindig, “Detecting Object Usage Anomalies,” ESEC/FSE,

Dubrovnik, Croatia, 2007, pages 35-44.

[101] Web Demo of Automatic P2C Converter:

http://selserver.case.edu:8080/autochecker/index.htm

[102] S. Weisberg, “Applied Linear Regression”, New York: Willey.

[103] J. Weyuker, “On Testing Non-Testable Programs, ” in The Computer Journal (1982) 25 (4),

pages 465-470

[104] J.Weyuker, 2003. “Using operational distributions to judge testing progress,” In Proc. of the

2003 ACM Symposium on Applied Computing, Melbourne, Florida, March 9-12, 2003, pages

1118-1122.

[105] C. C. Williams, and J. K. Holingsworth, “Automatic Mining of Source Code Repositories to

Improve Bug Finding Techniques”, IEEE Trans. on Softw. Eng., Vol. 31, June 2005, pages

466-480.

[106] C. C. Williams, and J. K. Hollingsworth, “Bug Driven Bug Finders”, Proc. Intl Workshop

Mining Software Repositories, Edinburgh, Scotland, UK, May 2004, pages 70-74.

[107] Ubuntu Development Team. Ubuntu linux 7.04, July 2004. URL http://www.ubuntu.com/.

[108] Xerces2 Project. http://xerces.apache.org/xerces2-j/

174

[109] Xerces2 Bug Database. https://issues.apache.org/jira/secure/IssueNavigator.jspa

[110] S. Zhang, S. Li and J. Yang, “GADDI: Distance Index based Subgraph Matching in

Biological Networks,” in Proc. of 12th EDBT, Saint Petersburg, Russia, Mar 2009, pages

192-203.

[111] J. Zheng, L. Williams, N. Nagappan, W. Snipes, J. Hudepohl, and M. Vouk, “On the Value of

Static Analysis for Fault Detection in Software,” in IEEE Transactions on Software Engineering,

vol. 32, no. 4, pages 240-253. 2006.

[112] T. Zimmermann and P. Weissgerber, “Preprocessing CVS data for fine-grained analysis,”

in Proc. of International Workshop on Mining Software Repositories, Edinburgh, Scotland, UK,

May 2004, pages 2–6.

175