IMPROVED METHODS FOR MINING SOFTWARE REPOSITORIES TO DETECT EVOLUTIONARY COUPLINGS

A dissertation submitted

to Kent State University in partial

fulfillment of the requirements for the

degree of Doctor of Philosophy

by

Abdulkareem Alali

August, 2014

Dissertation written by

Abdulkareem Alali

B.S., Yarmouk University, USA, 2002

M.S., Kent State University, USA, 2008

Ph.D., Kent State University, USA, 2014

Approved by

Dr. Jonathan I. Maletic Chair, Doctoral Dissertation Committee

Dr. Feodor F. Dragan Members, Doctoral Dissertation Committee

Dr. Hassan Peyravi

Dr. Michael L. Collard

Dr. Joseph Ortiz

Dr. Declan Keane

Accepted by

Dr. Javed Khan Chair, Department of Computer Science

Dr. James Blank Dean, College of Arts and Sciences

ii TABLE OF CONTENTS

TABLE OF CONTENTS ...... III

LIST OF FIGURES ...... VIII

LIST OF TABLES ...... XIII

ACKNOWLEDGEMENTS ...... XX

CHAPTER 1 INTRODUCTION ...... 22

1.1 Motivation and Problem ...... 24

1.2 Research Overview ...... 26

1.3 Contributions ...... 27

1.4 Organization ...... 28

CHAPTER 2 BACKGROUND AND RELATED WORK ...... 30

2.1 Impact Analysis ...... 30

2.2 Evolutionary Couplings ...... 32

2.3 Static Program Analysis ...... 37

2.4 Changesets ...... 39

2.5 Failure Prediction and Maintenance Effort Using Metrics ...... 41

2.5.1 Code Metrics ...... 41

2.5.2 Code Change Metrics ...... 43

2.5.3 Previous Changes and Defects ...... 46

2.5.4 Collaboration Metrics ...... 47

iii CHAPTER 3 USING CHANGE METRICS TO IMPROVE THE DETECTION OF

EVOLUTIONARY COUPLINGS ...... 51

3.1 Introduction ...... 51

3.2 Detecting Evolutionary Coupling ...... 53

3.3 Change Metrics ...... 55

3.4 Adding change metrics ...... 63

3.4.1 Discrete Change metrics ...... 64

3.4.2 Change metrics + Mining ...... 65

3.4.3 Implementation ...... 67

3.5 Evaluation ...... 68

3.5.1 Evaluation Using Prediction ...... 70

3.5.2 Interestingness Measures ...... 73

3.5.3 Manual Validation of Patterns ...... 81

3.6 Threats to Validity ...... 84

3.7 Discussion ...... 84

CHAPTER 4 CHANGE PATTERNS INTERACTIVE TOOL AND VISUALIZER

...... 87

4.1 Introduction ...... 87

4.2 Controls ...... 88

4.3 Summary ...... 92

CHAPTER 5 DISTRIBUTION AND CORRELATION OF CODE, CHANGE AND

COLLABORATION METRICS ...... 93

iv 5.1 Introduction ...... 93

5.2 Software Metrics ...... 96

5.3 Data Collection ...... 97

5.4 Time Window ...... 102

5.5 Metrics Distribution ...... 103

5.5.1 Frequency Histogram ...... 104

5.5.2 Frequency Histogram on a Log-Log Plot ...... 108

5.5.3 Complementary Cumulative Distribution Function ...... 111

5.6 Metrics Correlation ...... 129

5.7 Discussion ...... 140

CHAPTER 6 USING AGE AND DISTANCE TO IMPROVE THE DETECTION

OF EVOLUTIONARY COUPLINGS ...... 142

6.1 Introduction ...... 142

6.2 Frequent Pattern Mining ...... 143

6.3 Data Collection and Patterns Generation ...... 144

6.4 Pattern Distance ...... 146

6.5 Pattern Age ...... 149

6.6 Evaluation Using Interestingness Measures ...... 155

6.7 Summary ...... 167

CHAPTER 7 ASSESSING TIME WINDOW SIZE IN THE MINING OF

SOFTWARE REPOSITORIES FOR EVOLUTIONARY COUPLINGS .. 169

7.1 Introduction ...... 169

v 7.2 Evolutionary Couplings ...... 172

7.3 Approach & Setup of the Study ...... 172

7.3.1 Experimental Data ...... 172

7.3.2 Patterns Generation ...... 176

7.3.3 Design of the Evaluation ...... 176

7.3.4 Evaluation Using Prediction ...... 177

7.4 Empirical Study ...... 180

7.4.1 Time Windows Comparison ...... 181

7.4.2 Time Window Cross Prediction ...... 186

7.4.3 Combining Time Windows ...... 192

7.5 Threats To Validity ...... 199

7.6 Discussion ...... 199

CHAPTER 8 PREDECTION PARAMTERS ON THE DETECTION OF

EVOLUTIONARY COUPLINGS ...... 202

8.1 Introduction ...... 202

8.2 Data Collection ...... 204

8.3 Patterns and Association Rules Generation ...... 205

8.4 Multiple Regression Model Setup ...... 209

8.5 Experiment and Results ...... 210

8.6 Summary ...... 221

CHAPTER 9 CONCLUSIONS AND FUTURE WORK ...... 225

9.1 Synergic Approaches ...... 225

vi 9.2 The Analysis of Data Mining Parameters ...... 227

9.3 Future Work ...... 228

RERERENCES ...... 231

vii LIST OF FIGURES

Figure 1. Changed lines histogram for 2001-2010 KOffice, where the X-axis is bins of

size two, Y-axis is the frequency of lines changed in each bin ...... 56

Figure 2. Distributions of LOCC, HC, and FC metrics for KOffice repository. The data

plotted on a log-log scale...... 58

Figure 3. Distributions of LOCC, HC, and FC metrics for KOffice repository. The same

data on Figure 2 plotted on binned on log-log scales...... 60

Figure 4. Size ratios (Total change/no. files/no. of years), where change is the collected

L, F, and H to the studied projects. Revisions (rev)...... 63

Figure 5. The win-lose plot and score for the 10-point percentiles distribution of

confidence for rules generated of the patterns for KOffice. CEC patterns were

generated by enforcing LOCC consistencies...... 78

Figure 6. The final scores of the seven open source systems for confidence values for the

generated association rules of the generated CEC, LOCC and EC...... 80

Figure 7. The final scores of the seven open source systems for lift values for the

generated association rules of the generated CEC, HC and EC patterns...... 80

Figure 8. KouplerVis2 snapshot, the system is KOffice [2000-2009], the minimum

support is 2% which is 5 weeks out of 269 weeks for the selected range 2000-2005.

Itemset number 37 has four files and such pattern consistent appeared 5 times, total

of 405 ECs ...... 91

viii Figure 9. Activity chart for the ten free and open source systems. Each bar represents

the ratio of the number of commits over the number of years. Then normalize all

ratios by dividing each over the highest ratio (the MAX)...... 100

Figure 10. KOffice two baskets of six files with their attributes and values are stored in

an XML file, encoding effort measurements for each file during an observed time 

= week, the date is the week first day and the basket name...... 102

Figure 11. Frequency distribution of three selected metrics from each category of code,

change and collaboration for KOffice [2001-2010] collected on weeklong time units

...... 107

Figure 12. A log-log plot of probability distribution of three selected metrics from each

category of code, change and collaboration for KOffice [2001-2010] on weeklong

time units. Same data used in Figure 11...... 110

Figure 13. CCDF Shapes of lognormal, double Pareto, and Pareto distributions on a log-

log plot [Mitzenmacher 2004]...... 112

Figure 14. A log-log plot of complementary cumulative distribution function of three

selected metrics from each category of code, change and collaboration for KOffice

[2001-2010] on weeklong time units. Same data used in Figure 11 and Figure 12.

...... 115

Figure 15. A log-log plot of CDF and CCDF of LOC Churn for KOffice [2001-2010] on

weeklong time units...... 117

Figure 16. A log-log plot of CCDF for the number of commits on Bins of Size 2 for

KOffice [2001-2010] on weeklong time units ...... 118

ix Figure 17. A log-log plot of CCDF for code metrics (LOC and CC) on bins of size 2 over

five systems on weeklong time units...... 120

Figure 18. A log-log plot of CCDF for code metrics (LOC and CC) on bins of size 2 over

the other five systems on weeklong time units, this completes Figure 18...... 121

Figure 19. A log-log plot of CCDF for change metrics (LOC, Functions Churn) on bins

of size 2 over five systems on weeklong time units...... 122

Figure 20. A log-log plot of CCDF for change metrics (LOC, Functions Churn) on bins

of size 2 for the five other systems on weeklong time units, this completes Figure 19.

...... 123

Figure 21. A log-log plot of CCDF for change metrics (Hunks Churn, CC Diff, LOC

Diff) on bins of size 2 over ten systems on weeklong time units...... 124

Figure 22. A log-log plot of CCDF for change metrics (Hunks Churn, CC Diff, LOC

Diff) on bins of size 2 over ten systems on weeklong time units, this completes

Figure 21...... 125

Figure 23. A log-log plot of CCDF for collaboration metrics (Commits and Authors) on

bins of size 2 over ten systems on weeklong time units...... 126

Figure 24. A log-log plot of CCDF for collaboration metrics (Commits and Authors) on

bins of size 2 over ten systems on weeklong time units, this completes Figure 23. 127

Figure 25. Size ratios (Total commits/Years/MAX), where commits and Years as

reported in Table 21. MAX is the highest commits/Years ratio which is for Chrome,

being relatively a 100% active...... 146

x Figure 26. Distribution for Distances between Pattern Pairs. Patterns Generated for

KOffice [2001-2010] for Files...... 148

Figure 27. Distribution of Patterns Reoccurring Differences in Days for KOffice [2001-

2010] over Files...... 152

Figure 28. 10-Percentile Distribution for Patterns Age and Decile Widths in days for

KOffice [2001-2010] for Files ...... 154

Figure 29. The win-lose plot and score for the 10-point percentiles distribution of

confidence for rules generated of the patterns for KOffice comparing low-age-

patterns vs. high-age-patterns ...... 159

Figure 30. The win-lose plot and score for the 10-point percentiles distribution of

confidence for rules generated of the patterns for KOffice comparing low-age-

patterns vs. high-age-patterns ...... 161

Figure 31. Activity Plot based on the ratio commits/years divided by the maximum of all

commits/years ratios for the studied thirteen systems...... 175

Figure 32. Time window sortings distribution over the thirteen systems. Commit, Hour,

Day, Week are the different labels...... 185

Figure 33. Improvements range (MIN-MAX) of precision, recall, and F-measure values

over a cross prediction where Tr = hdw and Te = c...... 188

Figure 34. Improvements range (MIN-MAX) of precision, recall, and F-measure values

over a cross prediction where Tr = c and Te = hdw...... 191

Figure 35. Improvements range (MIN-MAX) of precision, recall, and F-measure values

over a cross prediction where Tr = c and Te = hdw ...... 198

xi Figure 36. Precision coverage over all systems...... 219

Figure 37. Recall coverage over all systems...... 220

Figure 38. F-Measure coverage over all systems...... 221

Figure 39. Extent of the representation from the total variation of model analysis

outcomes over all systems ...... 222

Figure 40. Distribution of the extent of the representation for the models over precision,

recall, and F-measure for all systems ...... 223

xii LIST OF TABLES

Table 1. Characteristics of the seven open source systems used in the study including

number of years, baskets (revisions), files, and changed lines, functions, and hunks.

...... 61

Table 2. Prediction Accuracies and Completeness Over Seven Open Sources. P =

Precision, = Recall, Fm = F-Measure, LFH = LOCC  FC  HC ...... 71

Table 3. Seven open source systems and the number (#I) of generated CEC patterns

using changed lines (L), functions (F), and hunks (H), EC patterns and their

minimum supports (mS) ...... 75

Table 4. Deciles of confidence values for KOffice and their win-lose scores...... 78

Table 5. All variations (intersections and unions) of change metrics for CEC patterns and

their precision, recall, and F-Measure values ...... 83

Table 6. All metrics used in the study of code, change and collaboration. Four columns

representing each Metric Category, Name, Abbreviation, and Description ...... 97

Table 7. Ten free and open source systems used in this study where all systems are

written in C/C++ for a range from 3 to 12 years. A brief description for each system

is contained...... 99

Table 8. Legend key for tables from Table 9 to Table 18. Spearman rho values that are

colored in white with black background indicates Strong relationship, colored black

with yellow background indicate moderate relationship, and colored black and white

xiii background indicates weak relationship between a pair of variables. p-value is a

very small number of all calculated rho values...... 132

Table 9. Spearman’s rho values for the cross production between every possible pair for

the all presented metrics in this study on the data collected for Chrome...... 133

Table 10. Spearman’s rho values for the cross production between every possible pair

for the all presented metrics in this study on the data collected for GCC...... 133

Table 11. Spearman’s rho values for the cross production between every possible pair

for the all presented metrics in this study on the data collected for KOffice...... 134

Table 12. Spearman’s rho values for the cross production between every possible pair

for the all presented metrics in this study on the data collected for KDElibs...... 134

Table 13. Spearman’s rho values for the cross production between every possible pair

for the all presented metrics in this study on the data collected for LLVM...... 135

Table 14. Spearman’s rho values for the cross production between every possible pair

for the all presented metrics in this study on the data collected for OpenMPI...... 135

Table 15. Spearman’s rho values for the cross production between every possible pair

for the all presented metrics in this study on the data collected for Python...... 136

Table 16. Spearman’s rho values for the cross production between every possible pair

for the all presented metrics in this study on the data collected for Quantlib...... 136

Table 17. Spearman’s rho values for the cross production between every possible pair

for the all presented metrics in this study on the data collected for Ruby...... 137

Table 18. Spearman’s rho values for the cross production between every possible pair

for the all presented metrics in this study on the data collected for Xapian...... 137

xiv Table 19. Three counters (frequency) for each as a summary for from all Spearman’s rho

test results. Each number count the strength levels (strong, moderate, and weak)

among all possible pairs of the studied set of metrics from different categories. ... 138

Table 20. Ratios and distributions for each Spearman’s rho test for all systems. Each

percentage represents the ratio for each strength level (strong, moderate, and weak)

among all possible pairs of the studied set of metrics from different categories. ... 139

Table 21. Characteristics of the Eleven open source systems used in study including

years, commits, files, Selected minimum Supports, itemsets generated, and

maximum itemset Sizes (L)...... 145

Table 22. Typical Distances for File Granularity...... 150

Table 23. Typical age distribution For files. The threshold represents the age at which

most of the data is accounted for...... 153

Table 24. Eleven Open Source Systems and the Number of Generated Itemsets and their

Minimum Supports. The Number of Association Rules Generated on Using 10%

Minimum Confidence Threshold. Number of Files Included in the Evaluation where

the Date Range of All Systems Are From April 01, 2009 to August 01, 2009...... 156

Table 25. Deciles of confidence values for KOffice and their win-lose scores comparing

low-age-patterns vs. high-age-patterns ...... 158

Table 26. Deciles of confidence values for KOffice and their win-lose scores comparing

zero-distant-patterns vs. non-zero-distant-patterns ...... 160

xv Table 27. The final confidence and lift comparison scores of the eleven open source

systems on the generated association rules for the generated low-age-patterns vs.

high-age-patterns ...... 164

Table 28. The final confidence and lift comparison scores of the eleven open source

systems on the generated association rules for the generated zero-distant-patterns vs.

non-zero-distant-patterns...... 165

Table 29. Characteristics of The Studied Systems. For each system there is number of

Files, LOC, Committed Files, and Years...... 173

Table 30. The Number of Commits, Hours, Days, and Weeks Used In This Study Over

The Thirteen Open Source Systems...... 174

Table 31. Patterns Uncovered for the Training Set (first 75% of transactions) over

Thirteen Open Sources and Their Minimum Support count, ECs = Evolutionary

Couplings, mS = minimum support count...... 179

Table 32. Prediction Accuracies and Completeness Over Six Open Sources. P =

Precision, R = Recall, Fm = F-Measure, Tr = Time window of Training Set, Te =

Time window of Test Set...... 183

Table 33. Prediction Accuracies and Completeness Over Seven Open Sources. P =

Precision, R = Recall, Fm = F-Measure, Tr = Time window of Training Set, Te =

Time window of Test Set. This Completes Table 32...... 184

Table 34. Cross Prediction F-Measure. Fm is F-Measure, Tr=hdw is Time window of

Training Set, where training is hour, day, or week. Te = c is Time window of Test

Set, where commit is the Test set Time window...... 187

xvi Table 35. Cross Prediction F-Measure values. Fm is F-Measure. Tr = c is Training Set

Time window, where commit is the Time window , Te=hdw is Test Set Time

window, where test set is hour, day, or week...... 190

Table 36. Prediction Precisions, Recalls, and F-measures for OSG Open Source.

Precision (P), Recall (R), F-Measure (Fm), Tanning Set Time window (Tr), Test

Set Time window, where commit is the Time window (Te = c) ...... 193

Table 37. Prediction Improvements on Precisions, Recalls, and F-measures for OSG

Open Source. Precision (P), Recall (R), F-Measure (Fm), Tanning Set Time window

(Tr), Test Set Time window, where commit is the Time window (Te = c) ...... 194

Table 38. Highest (MAX) and Lowest (MIN) Improved percentages for different time

window pattern sets conjectures over the thirteen open sources, where Te = c .... 197

Table 39. Characteristics of the eleven open source systems used in study including

years, commits, files...... 204

Table 40. Prediction Parameters (Support Count and Training Data Ratio) Min and Max

Values, Unit Size and Trails ...... 207

Table 41. Prediction Parameters (Years and Confidence) Min and Max Values, Unit Size

and Trails ...... 208

Table 42. Prediction Parameters Total Trials...... 208

Table 43. Regression analysis results (R Square, Adjusted R Square, and Significance F)

and a magnitude value and p-value for the Coefficients (Support Count, Training

Data Ratio, Years, Confidence, and Intercept) these are for dependent variables

Precision, Recall, and F-measure for KOffice ...... 212

xvii Table 44. Regression analysis results (R Square, Adjusted R Square, and Significance F)

and a magnitude value and p-value for the Coefficients (Support Count, Training

Data Ratio, Years, Confidence, and Intercept) these are for dependent variables

Precision, Recall, and F-measure for KDELibs...... 214

Table 45. Regression analysis results (R Square, Adjusted R Square, and Significance F)

and a magnitude value and p-value for the Coefficients (Support Count, Training

Data Ratio, Years, Confidence, and Intercept) these are for dependent variables

Precision, Recall, and F-measure for Httpd...... 214

Table 46. Regression analysis results (R Square, Adjusted R Square, and Significance F)

and a magnitude value and p-value for the Coefficients (Support Count, Training

Data Ratio, Years, Confidence, and Intercept) these are for dependent variables

Precision, Recall, and F-measure for Subversion...... 215

Table 47. Regression analysis results (R Square, Adjusted R Square, and Significance F)

and a magnitude value and p-value for the Coefficients (Support Count, Training

Data Ratio, Years, Confidence, and Intercept) these are for dependent variables

Precision, Recall, and F-measure for Ruby...... 215

Table 48. Regression analysis results (R Square, Adjusted R Square, and Significance F)

and a magnitude value and p-value for the Coefficients (Support Count, Training

Data Ratio, Years, Confidence, and Intercept) these are for dependent variables

Precision, Recall, and F-measure for Chrome...... 216

Table 49. Regression analysis results (R Square, Adjusted R Square, and Significance F)

and a magnitude value and p-value for the Coefficients (Support Count, Training

xviii Data Ratio, Years, Confidence, and Intercept) these are for dependent variables

Precision, Recall, and F-measure for OpenMPI...... 216

Table 50. Regression analysis results (R Square, Adjusted R Square, and Significance F)

and a magnitude value and p-value for the Coefficients (Support Count, Training

Data Ratio, Years, Confidence, and Intercept) these are for dependent variables

Precision, Recall, and F-measure for Python...... 217

Table 51. Regression analysis results (R Square, Adjusted R Square, and Significance F)

and a magnitude value and p-value for the Coefficients (Support Count, Training

Data Ratio, Years, Confidence, and Intercept) these are for dependent variables

Precision, Recall, and F-measure for LLVM...... 217

Table 52. Regression analysis results (R Square, Adjusted R Square, and Significance F)

and a magnitude value and p-value for the Coefficients (Support Count, Training

Data Ratio, Years, Confidence, and Intercept) these are for dependent variables

Precision, Recall, and F-measure for GCC...... 218

Table 53. Regression analysis results (R Square, Adjusted R Square, and Significance F)

and a magnitude value and p-value for the Coefficients (Support Count, Training

Data Ratio, Years, Confidence, and Intercept) these are for dependent variables

Precision, Recall, and F-measure for Xapian...... 218

xix ACKNOWLEDGEMENTS

I want to acknowledge my "dissertation family" (in no significant order): SDML

(Research Laboratory), Wedad Alali (Mother), Jonathan I. Maletic (Supervisor), Qasem

Alali (Father), Walid Alali (Brother), Feodor F. Dragan (Committee Member), Sireen

Abu-Khafajah (Wife), Bacim Alali (Brother), Kent State University, Andrew Sutton

(Research Partner), City of Kent, Michael L. Collard (Committee Member), Carmela

Alali (Daughter), SDML members, Computer Science Department (at KSU), Hassan

Peyravi (Committee Member), Declan Keane (Committee Member), Huzefa Kagdi

(Research Partner), Brian Bartman (Research Partner), Saleh M. Alnaeli (Research

Partner), Joseph Ortiz (Committee Member), Christian D. Newman (Research Partner).

Also thanks for the support of (in no significant order): Tharwa Alali (Sister), Daleh

Abu-Khafajah (Mother in Law), Feras Alali (Brother), Mysoon Alali (Sister), Ahmad

Abu-Khafajah (Father in law), Muad Abu-Ata (Friend), Ismaeel Alali (Brother), Aya

Alali (Sister), Mohammad Abu-Khafajah (Brother in Law), Shatha Abu-Khafajah (Sister in Law), Salem Othman (Friend), Jehad Rababah (Friend), Maen Hammad (Friend).

Thankful for the limitless help of my work colleagues at Nichevision, Inc: Luigi

Armogida, Vic Meles, Tom Faris, and Tracy Bauer.

Lastly:

Thank you for keeping me in your thoughts and prayersMom and Dad.

xx Highest gratitude and appreciation to the best friend-like father-like teacher and advisorDr. Maletic.

Thanks to the sweetest girl, eyes of glory, sliver of my heart, my angelCarmela.

Thanks to the closest soul, second me, new mama, love of my lifeSireen.

Abdulkareem Alali

July, 2014, Kent, Ohio

xxi

CHAPTER 1

INTRODUCTION

During system evolution, modifications are made to fix bugs, add new features, adapt the system to use new (application programming interface), etc. These changes can affect a broad range of the software system. Determining which parts of the system are potentially affected by a given modification is called software change impact analysis [Arnold, Bohner 1996]. The ideal impact analysis technique produces the exact set of functions (or files) that require modification due to a given change. Unfortunately, this technique does not exist. Current technology gives us an impact set that can be missing important items or contains a large number of items that are irrelevant (saying something is impacted and must be updated when in practice it goes unmodified). These later cases are false positives and can be quite problematic. Large numbers of false positives make a tool difficult to use and decrease the likelihood that it will be adopted by software developers [Cordy 2003].

Current methods for impact analysis typically use traditional static and dynamic analysis methods as the basis for identifying impact sets. More recently other approaches such as, information retrieval and data mining have been used to address this problem.

These alternative approaches have focused on identifying relationships in the software that are difficult for traditional analysis methods to uncover. Here, we describe experiments combining traditional static information (i.e., change measures) with data mining techniques in order to identify evolutionary couplings uncovered in the version

22 23

history of the system. Our goal is to improve the accuracy of the impact set by reducing the number of false positives.

Evolutionary Couplings [Gall, Hajek, Jazayeri 1998; Gall, Jazayeri, Krajewski 2003] are patterns of co-changing items in software that are uncovered from the version history.

That is, if two functions change regularly together at the same time over and over again during software evolution, then they are evolutionarily coupled for some reason. This coupling may exist for several reasons, but a static relationship may not be one of those.

The software items in a change pattern can be related through static dependencies

[Fenton, Pfleeger 1996; Offen, Jeffery 1997] such as a function call. Yet others can be related through implicit dependencies or hidden (e.g., potentially misplaced related artifacts in a software system with no clear static relation [D'Ambros, Lanza 2006]).

Gall et al. [Gall, Hajek, Jazayeri 1998] introduced the concept of evolutionary coupling (aka logical coupling) to uncover and describe these implicit dependencies.

Implicitly related coupled artifacts typically require more time and effort to maintain because of the lack of an explicit manner to identify them. Evolutionary couplings are identified using a data mining technique called frequent pattern mining [Agrawal, Srikant

1994]. This is a commonly applied method in Mining Software Repositories (MSR).

Frequent pattern (aka itemset) mining is applied to sets of files committed within a given unit of time. Unfortunately, this approach typically suffers from high false-positive rates.

For example, files may be committed together many times due to a code license change, branch merges, or, mainly, developers’ making commits across logical changesets

(changes that can be grouped together as a atomic for being related addressing a single

24

task [Hassan, Holt 2004b]). A lot of times, also, a pattern of three items would be detected and due to that fact that one item is dependent and coupled with the other two items but the other two items are completely decoupled.

1.1 Motivation and Problem

The identification of evolutionary couplings is important for a number of maintenance tasks. It allows developers to consider solutions of refactoring and reengineering to minimize the impact of co-changes. These relationships can also be used to support the enforcement of evolution or maintenance constraints on artifacts.

Such patterns help developers to avoid infesting bugs to the system during maintenance activities being unaware to the existence of such dependencies and their propagating effects. Software developers or engineers would know such implicit or explicit dependencies based on human mental models [LaToza, Venolia, DeLine 2006], but with big systems it becomes hard to have a tight grip about all software elements and their relationships. The availability of such knowledge of systematic changing patterns would help developers following with change impacts. The awareness of such patterns helps though impact analysis to smarter testing. Developer can focus on smaller groups of tests with the range of predicted changes rather than dealing with the whole system.

Explicit static relationships (call and include graphs), that are dependencies among artifacts, compose the system. Static measures are based on software source code which is, usually, very large. Uncovering such explicit links number would be enormous that makes it hard to follow. The existence of physical dependencies among artifacts does not assure that it would contribute to the existence of changing patterns. To filter out those

25

critical links that would require the historical maintenance archived information. Static measures do not reveal all dependencies, i.e. implicit/hidden ones, which are not documented anywhere [Gall, Jazayeri, Krajewski 2003]. An example to such hidden dependencies is that if two classes with no explicit physical links between them where both are responsible for the implementation of some feature, a change to a class would probability propagate a change to the other one.

Logically coupled artifacts typically require more time and effort to maintain because of the lack of explicit static relationships. It is difficult to develop models and tools that identify implicitly coupled artifacts or enforce traceability constraints between them. Previous work on the detection of implicit logical coupling has most often applied frequent itemset mining to sets of files committed within a given unit of time.

Unfortunately, these approaches typically suffer from high false-positive rates due to the fact that they do not distinguish between coincidentally changing files and those actually logically coupled. Evolutionary coupling is an approach that came after the emergence of data mining methodologies; artifacts are co-changed if they were frequently committed to revision repositories together. The transaction, here, is derived through sliding window approaches on commit metadata such as author identity or log messages in a version repository (e.g., CVS), or with more recent versioning systems as a changeset (e.g.

Subversion, Git). The detected changing patterns uncover interesting unseen dependencies with the classical static analysis approaches; those types of implicit dependencies are called logical couplings [Gall, Hajek, Jazayeri 1998]. Evolutionary couplings can produce high quality patterns, but that requires high frequencies which

26

require long maintenance history, and that may not be available [Robbes, Pollet, Lanza

2008]. The work presented here addresses the problem of reducing the number of false- positives by extending the previous approaches for the identification of implicit and explicit couplings using synergic approaches and via the tuning of pattern mining parameters.

Presented methodologies, static analysis and evolutionary couplings, suffer mainly from high false positives, in large scale systems, such number can be too many and patterns can be long too. That would make it difficult for developers to investigate.

Following such critical dependencies and bugs may not be avoided and reengineering decisions may come late.

1.2 Research Overview

Impact analysis and change propagation classically use software engineering approaches of static and dynamic program analysis. More recently, evolutionary coupling via mining software maintenance historical activates (repertories) have been showing a promising results [Li, Sun, Leung, Zhang 2013]. A number of mining software repositories techniques are borrowed from large data mining analysis [Agrawal,

Imieliński, Swami 1993; Agrawal, Srikant 1994]. Static analysis and dynamic analysis

(program analysis) are heavyweight techniques and require a lot of time, processing and analysis. They, also, produce a lot of information, such as dependencies, with high accuracies. On the other side, data mining approaches are lighter in nature, but suffer from high false positive dependencies. Such approaches are blind to the structure, syntactical, or semantic relations and dependencies among software elements. A

27

dependency among all items for some frequent pattern is claimed to exist just based on the fact that these items co-changed several times together more than some arbitrary threshold (minimum support). Each (program analysis vs. mining software history) has shown many draw backs and limitation on accuracy and usability. In this work, we bridge the gap and we do it in a two-fold approach:

 Synergic Approaches. We use hybrid approaches to improve the methods of data

mining detecting evolutionary couplings using program information of static analysis,

metrics and historical meta-data. We further study such collected data in more depth

to understand its nature and value.

 The Analysis of Data Mining Parameters. We study the effect of changeset size

variation. Changeset or time windows or transaction size (in data mining it is called a

basket) is one main parameter that is used on data mining. We, further, study a

collective set of these parameter and build a prediction model around them to uncover

their altitude and effect or the generated patterns and association rules.

The hypothesis is that the combination of these orthogonal approaches will produce more accurate results.

1.3 Contributions

We present several approaches that can be added to improve the data mining techniques that uncover evolutionary couplings. We present the following contributions:

 Combining static analysis with data mining to reduce false positive dependencies

 A tool to detect change patterns using a set of parameters and a visual presentation to

the results.

28

 Metrics distribution and correction analysis for bug prediction and effort estimation.

 Historical meta-data of patterns age and patterns’ items distance to rank and filter false

positive evolutionary couplings

 First assessing the impact of different time window sizes on the detection of

evolutionary coupling. Then study the cross prediction of patterns generated based off

different windows and the combinations (intersections and unions) of patterns

generated from different window sizes to enhance the quality of the selected patterns

of change.

 The study of a collective set of data mining parameters and its effect on the prediction

quality from the generated association rules. We use a multiple regression analysis to

build a prediction model around these set of parameters.

1.4 Organization

The dissertation is organized as follows. Chapter 2 presents related work on impact analysis, evolutionary couplings, static program analysis, changesets, predicting failures, and effort estimation using metrics. Following that are two main components: 1) research on static analysis and meta-data plus data mining techniques (Chapters 3 to 6); and 2) examination and analysis of repository mining parameters (Chapters 7 and 8).

The work first focuses on combining software information with frequent pattern mining techniques. Chapter 3 uses change measures to improve the detection of evolutionary couplings (CEC) [Alali, Sutton, Maletic 2014a]. Chapter 4 continues this work and presents the CEC technique as a change pattern detection tool and visualizer

[Alali, Maletic 2015]. Chapter 5 is a broader and deeper study on change measures

29

(metrics) that were used in Chapter 3 [Alali, Maletic 2014]. Chapter 6 deals with using meta-data collected from repositories and structural information on using age and distance on patterns to improve the detection of evolutionary couplings [Alali, Bartman,

Newman, Maletic 2013].

The next focus of the research is related to data mining parameter analysis. In

Chapter 7 is a study on time window size. An empirical validation on assessing different time window sizes and their effect on detecting evolutionary couplings is presented

[Alali, Sutton, Maletic 2014b]. Chapter 8 is broader study relative to Chapter 7. It investigates a group of parameters using prediction and regression analysis. The collective effectiveness over all parameters is assessed [Alali, Bartman, Newman,

Maletic 2015]. Conclusions and future work are given in Chapter 9.

CHAPTER 2

BACKGROUND AND RELATED WORK

We cover a range of topics in this chapter. We start with the highest level and the big umbrella of impact analysis as in section 2.1. After that we talk about our main focus problem of the history and a brief background on evolutionary couplings, section 2.2.

Section 2.3 is a related work and background on static program analysis. The study of changesets plays an important role in evolutionary couplings research. The better approximated the changesets from version repositories the higher the quality of the uncovered evolutionary couplings, section 2.4. And finally, we have section 2.5, where we present a background and related work on the usage of static program metrics of code, change, and collaboration metrics on building fault prediction and effort estimation models.

2.1 Impact Analysis

Dependency and traceability analysis are methodologies to performing impact analysis. Dependency analysis performs horizontal traces to software artifacts at the same level of abstraction change impacts (function to function), while traceability analysis runs vertically on software artifacts on different levels of abstractions (code to design) [Kagdi, Gethers, Poshyvanyk, Collard 2010]. Literature is rich with dependency analysis methods that can be used to uncover relations using call-graphs [Ryder 1979], program slicing [Gallagher, Lyle 1991], hidden dependency analysis [Chen, Rajlich

30 31

2001; Rajlich 1997; Yu, Rajlich 2001], lightweight static analysis approaches [Moonen

2002; Petrenko, Rajlich 2009], concept analysis [Tonella 2003], dynamic analysis [Law,

Rothermel 2003], unified modeling language (UML) models [Briand, Labiche, Soccar

2002], and Information Retrieval [Antoniol, Canfora, Casazza, De Lucia 2000].

Programmers maintain the software through source code changes, generally, to fix a bug or add a feature of existing software systems. A software change to an artifact, usually, requires the developer to identify the other parts of the code that would require a change due to the existence of dependencies (structural, logical, conceptual ... etc).

Bohner [Bohner 1996] recognized impact analysis as an activity that estimates all components to be changed. One of the techniques of impact analysis was proposed in the work of Queille et al. [Queille, Voidrot, Wilde, Munro 1994], where an interactive process was suggested, in which the programmer, guided via dependencies among program components, inspects components and identifies the ones that are going to change; this process involves both searching and browsing activities. This interactive process was supported via a formal model, based on graph rewriting rules [Chen, Rajlich

2000].

Shawn, Robillard and Hill et al. [Robillard 2005; Shawn, Gracanin 2003] [Hill,

Pollock, Vijay-Shanker 2007] proposed tools to navigate and prioritize system dependencies during various software maintenance tasks. Hill et al. [Hill, Pollock, Vijay-

Shanker 2007] uses lexical clues from the source code to identify related methods.

Rountev et al. [Rountev, Milanova, Ryder 2001] did an impact estimation of a change on tests. A comparison of different impact analysis algorithms was provided in [Orso et al.

32

2004]. Coupling measures have been used to support impact analysis in object oriented systems [Briand, Wuest, Lounis 1999; Wilkie, Kitchenham 2000]. Briand et al. [Briand,

Wuest, Lounis 1999] investigated the use of coupling measures and derived decision models for identifying classes likely to be changed during impact analysis. The results of empirical investigation of the structural coupling measures and their combinations showed that the coupling measures can be used to focus underlying dependency analysis and reduce impact analysis effort. Poshyvanyk et al. [Poshyvanyk, Marcus, Ferenc,

Gyimóthy 2009] presented alternative sources of information (i.e., text in identifiers and comments) to capture dependencies that are not captured by the existing structural coupling measures.

2.2 Evolutionary Couplings

Evolutionary coupling is a dependency between software artifacts that have been observed to frequently change together with probably identical changing behavior during the evolution of a system [D'Ambros, Lanza 2006; Gall, Hajek, Jazayeri 1998]. Software elements are evolutionary coupled if they changed together for a defined time window and a minimum support. Evolutionary coupling is a lightweight approach that uses software repositories to detect coupled artifacts as happened in history vs. the costly static analysis. Evolutionary coupling has the ability to find hidden dependencies and traceability links that are hard or impossible to find through conventional static analysis, it also uncovers explicit dependencies as with static analysis methodologies but without specific knowledge on the type(s) of those dependencies.

33

Evolutionary coupling methods based on history and maintenance activates are capable of uncovering such dependencies that are part of changing patterns. Add to that, hidden dependencies among artifacts are hard to be detected using static analysis and that is due to the lack of direct static dependencies. Such implicit dependencies are known as logical coupling [Gall, Hajek, Jazayeri 1998]. Due to the uncertainty from both sides, recently research started to focus of hybrid approaches for a better control on detecting changing patterns and higher accuracies. The next sections will state the dimensions of our detection problem.

Ball et al. [Ball, Porter, Siy 1997] are one of the first to introduce the idea and importance of co-changed artifacts. They introduce a visualization of co-changed classes to detect clusters of classes that frequently changed together during the evolution of a system. Gall et al. [Gall, Hajek, Jazayeri 1998] introduced the concept of evolutionary coupling and presented CAESAR, an approach to detect implicit relationships between modules using historical data, extracted from Concurrent Versions System (CVS) for a large telecommunications software system, and validated these potential dependencies by examining change reports that contain specific change information for a release. They provided insights concerning the architecture of a system. These uncovered dependencies and identified modules that should be restructured and reengineered.

Additionally [Gall, Jazayeri, Krajewski 2003], with a lower level of abstraction (classes), they were able to detect evolutionarily coupled items that infer architectural weaknesses of poor design interfaces and inheritance hierarchies.

34

The traditional evolutionary coupling techniques, majorly, use a small time window.

The most used versioning systems so far are CVS and subversion (SVN). For CVS and alike systems, researchers use a sliding window approach to define the mining basket, where two subsequent changes that have been committed by the same author and with the same log message are part of one transaction if they were at most 200 seconds apart

[Fogel, O’Neill 2002; Zimmermann, Weisgerber, Diehl, Zeller 2004]. In Subversion a revision is the time window. A revision or a commit is a multiple modified files that to be submitted and merged into the source tree of files and folders [Collins-Sussman,

Fitzpatrick, Pilato 2004]. Such unit is used to cluster software elements based on the hope or the assumption that a commit is logically atomic, which is an ideal non-existing situation. Even though, to some percentage the committed files are related and responsible for a change or a task; the frequent occurrences of a group of artifacts would raise the accuracy that there is an actual dependency among its elements. We decided to use an arbitrary coarse-grain window, e.g. revision, then we measure the change using counting metrics based on such window to events happened during maintenance to support coupling.

Evolutionary couplings support static analysis through empirically assessing the importance of dependencies that can be detected statically between software elements.

With the easy and quick discoveries of change patterns, however, evolutionary coupling suffer mainly from high false positives [D'Ambros, Lanza 2006; Robbes, Pollet, Lanza

2008], such inaccuracies are problematic. Evolutionary couplings techniques produce patterns of change without any hints on the reasons behind their existence for such

35

detected changing patterns. To know the reasons behind such rooted pattern, a developer needs to do it manually or uses static techniques to look for implicit or explicit dependencies.

Both evolutionary and static coupling are able to detect software change patterns, it is faster and easier with evolutionary coupling [Hassan 2008] but with low accuracies, On average historical co-changes are 30% of the time (precision), and can correctly propose

44% of the entities which must co-change (recall) [Hassan, Holt 2004a; Sayyad,

Lethbridge 2001; Ying, Murphy, Ng, Chu-Carroll 2004; Zimmermann, Weisgerber,

Diehl, Zeller 2004]. While static analysis is more time consuming approaches that would produce enormous number of coupling links among artifacts. Static analysis is code-base where it presents all dependencies and coupling dependencies without a reality behind which of these dependencies are the reason behind the existence of change patterns.

A similar technique to that of Gall et al. was applied by Ratzinger et al. [Ratzinger,

Fischer, Gall 2005] at the class level, to identify code smells [Van Emden, Moonen 2002] based on the evolutionary coupling between classes of the system. At a finer granularity level, Zimmermann et al. [Zimmermann, Weisgerber, Diehl, Zeller 2004] presented

ROSE to find change patterns among entities (files, classes, functions and lines of code) using a sliding window approach. ROSE suggests further changes to be made and warns about missing changes. This led to work on automatic bug prediction (e.g. [Kim,

Zimmermann, Pan, Whitehead 2006; Ying, Murphy, Ng, Chu-Carroll 2004]). Ying et al.

[Ying, Murphy, Ng, Chu-Carroll 2004] proposed data mining techniques that recommend relevant source code to a developer performing a modification task.

36

D’Ambros et al. [D'Ambros, Lanza 2006; D'Ambros, Lanza, Lungu 2009] presented

Evolution Radar, a visualization technique to analyze evolutionarily coupled items to detect architecture decay and coupled components. Pinzger [Pinzger, Gall, Fischer,

Lanza 2005] proposed a visualization method to detect potential refactoring candidates.

They presented module change coupling as Kiviat diagrams with edges connecting them.

Beyer [Beyer, Hassan 2006] proposed a visualization technique called Evolution

Storyboards, a sequence of animated panels that shows CVS repository files where the distance of two files is computed based on their change coupling values to spot clusters of related files.

Robbes et al. [Robbes, Lanza 2005; Robbes, Pollet, Lanza 2008] presented a fine- grain logical/evolutionary coupler, the goal is to overcome the shortcomings of VCS.

Most VCSs are file-based, rather than entity-based, and snapshot-based, rather than change-based, i.e., changes that happen in between two subsequent snapshots are lost. So all files in a revision have their coupling increased by the same amount, regardless of how and how much they changed.

Kagdi et al. [Kagdi, Gethers, Poshyvanyk, Collard 2010] presented a conceptual and evolutionary coupling technique to improve change impact analysis in software source code. They use evolutionary couplings and mine source code commits using Information

Retrieval to build conceptual couplings from the source code for a release of a software system. The premise is that such combined methods provide improvements to the accuracy of impact sets. A limitation of mining techniques is that a long period of history is necessary to produce quality change patterns with high support and confidence.

37

Canfora et al. [Canfora, Ceccarelli, Cerulo, Di Penta 2010] addressed this problem by combining a mining approach with a statistical learning algorithm. The method uses

Granger causality test to infer consequent changes from evolutionary dependencies between multiple time series. The precision achieved is between 25% and 60% over four systems, while recall is between 15% and 60%. Our work differs in that a simple static measure is used versus a learning algorithm.

2.3 Static Program Analysis

Source code artifacts and their development are recoded carefully via commits to version control systems. Source code repositories able accessing source code at any point in time, such detailed fine grain evolutionary steps are rich sources for better quality research. Such analysis is used to extract information to compare system versions. In

[Williams 2005; Williams, Hollingsworth 2004] information was automatically mined from the source code repository to support locating and fixing bugs. The aim was to detect returned values of functions that create bugs on the system. Another bug type is detecting incomplete refactorings of programs on adding, removing parameters and renaming methods on different class hierarchical levels [Görg, Weißgerber 2005].

Incomplete refactorings may not be detected by compilers (non-overridden methods) which may harm the system. Locating such bugs due to incomplete refactorings early would reduce later complications. Kim et al. [Kim, Whitehead, Bevan 2005] presented a tool to identify function signatures in versions. Function signatures traced for changes across versions and were classified based on the data flow between called/calling functions.

38

RefactoringCrawler [Taneja, Dig, Xie 2007] and UMLDiff [Xing, Stroulia 2005] analyze code relationship similarity between versions of programs looking for refactorings on a high level granularity. [Kim, Pan, Whitehead 2005] presented an automated implementation of Origin Analysis the approach for Java, but also reduced the number of change types that the technique detected to consider code artifacts renaming and moving. [Kim, Notkin, Grossman 2007] proposed a technique for detecting change patterns by leveraging the similarity of program element names. Such techniques use intensively static analysis methods which fit in the third dimension, static analysis information, of the presented multidimensional hybrid detector.

Dex [Raghavan et al. 2004] detects syntactic and semantic changes from version history. Each C code (C programming language) version presented as an abstract semantic graph (ASG). A differencing algorithm applied to produce nodes that are added, deleted, modified, or moved in order to achieve one ASG from another. Each pair of contiguous ASGs were analyzed to answer questions, e.g., how many functions and function calls are inserted, added, or modified, to specific changes. Collard et al.

[Collard, Decker, Maletic 2011; Maletic, Collard 2004] presented a syntactic meta differencing approach to highlight syntax differences, after encoding source code as

ASTs using XML formats namely srcML. Marking added, deleted, or modified sections in an extended srcML format, namely srcDiff, the types of syntactic changes are then computed. That would allow queries to be performed as XPath expressions on the srcDiff format. Weißgerber et al. [Weißgerber, Diehl 2006] presented a technique for identifying changes that are refactorings.

39

Software metrics are used quantitatively to assess software quality. Such quantitates can be size, effort, and complexities [Kagdi, Collard, Maletic 2007b] of software artifacts. Bieman et al. [M. Bieman, Andrews, Yang 2003] presented a metrics-based approach presented for detecting change-prone classes, i.e., classes that co-change frequently would likely change again in the future. Visualization was used in understanding these clusters of classes.

Origin analysis by Tu et al. [Tu, Godfrey 2002] determines whether entities were added or deleted from one version to the next. Evolution metrics lines of code, S- complexity, cyclomatic complexity, and number of function parameters and others were measured for each artifact in a version and stored in a vector of evolution metrics. The similarity between two artifacts in different versions was represented by the Euclidian distance between their vectors. The similarity values were used for origin analysis of a given entity, where less distance means it is more probable that an artifact in a current version has originated from the other in a previous version.

2.4 Changesets

In literature, changesets have mixed and varied definitions, primarily based on a unit of time under which a set of files is related. The traditional time window is 200 seconds

[Zimmermann, Weisgerber, Diehl, Zeller 2004; Zimmermann, Weissgerber 2004] with the related files having the same author and log message in CVS repositories. Next, with the emergence of transactional commits, e.g. Subversion, the time window was reduced to a single transaction [Kagdi, Collard, Maletic 2007b; Kagdi, Maletic 2007; Kagdi,

Yusuf, Maletic 2006; Moser, Pedrycz, Succi 2008]. Rastkar and Murphy [Rastkar,

40

Murphy 2009] defined changeset as transactions if the files committed together were related to the same task.

Estublier et al. [Estublier et al. 2005] define changeset as files changed and committed when they are associated with a feature or task. McNair et al. [McNair,

German, Weber-Jahnke 2007] defined a changeset as logical commits, where the files modified for a single modification request. Hassan and Holt [Hassan, Holt 2004a] introduced a definition of what would be a golden standard of a changeset where a changeset is the set of changes responsible for a single maintenance task. This could be a single commit, part of a commit, or spread over multiple commits. Unfortunately, this ideal set is hard to define without disciplined versioning policies. Such policies would require developers to commit changes on files that are related to some task. As reported by Vanya et al. [Vanya, Premraj, Vliet 2011], developers often commit changed files together which may relate to multiple tasks. Systems supporting task-oriented versioning would also help in this regard.

Based on our empirical evaluation of typical Subversion commit sizes, we found that the majority of commits are of small size, while few are of massive sizes, thus exhibiting a power law-like distribution [Alali, Kagdi, Maletic 2008]. Similar observations were also reported by other studies [Arafat, Riehle 2009; Riehle, Kolassa, Salim 2012]. Such observation follows the convention of committing changes incrementally. This encouraged us to study the quality of changesets generated via grouping transactional commits form Subversion repositories using different time windows. We use the traditional evolutionary dependencies and their abilities to predict change impacts. We

41

use fine to coarse grain time windows, a single commit, hour, day, and week of transactional commits. We further study how different time windows can be effective cross predicting change impact against each other’s, and then we study how combined evolutionary dependencies that can be generated by different time windows can enhance change predictability measures.

2.5 Failure Prediction and Maintenance Effort Using Metrics

Metrics of code, change and collaboration recently have been used in failure prediction and maintenance effort. Many models were proposed using single or a mixture of metrics targeting a higher failure prediction accuracies and effort maintenance.

Some models try to use and learn from previous and recent changes and defects.

2.5.1 Code Metrics

Code metrics such as size and complexity metrics are widely used for defect prediction and effort estimation, based on the idea that more complicated code more probably would produce more bugs since it is more sensitive to code modifications. In the 1970s, McCabe [McCabe 1976] and Halstead [Halstead 1977] defined static code attributes that were proven to assess code complexity. Basili et al. [Basili, Briand, Melo

1996] used Chidamber and Kemerer metrics [Chidamber, Kemerer 1994] and Ohlsson et al. used McCabe's cyclomatic complexity on an Ericsson telecom system for defect prediction. El Emam et al. used cyclomatic complexity metrics and Briand’s coupling metrics [Briand, Daly, K. Wüst 1999] to predict faults on a commercial Java system [El

42

Emam, Melo, Machado 2001]. Subramanyam et al. used cyclomatic complexity metrics on a commercial C++/Java system [Subramanyam, Krishnan 2003].

Gyimothy et al. [Gyimóthy, Ferenc, Siket 2005] used Mozilla open source code and investigated the significance of correlation between code size and cyclomatic complexity metrics on fault-proneness of classes using statistical methods and machine learning methods. Nagappan et al. [Nagappan, Ball 2005a; Nagappan, Ball, Zeller 2006] used static tools to predict pre-release defect density then highlighted a strong correlation with the actual pre-release defect density. Later on [Nagappan, Ball, Zeller 2006], they used code metrics to predict post-release defects on Microsoft systems. The main finding was that predictors do work on an individual project, but no predictor fits all. Zimmermann et al. [Zimmermann, Premraj, Zeller 2007] found a significant correlation between code metrics and pre- and post-release defects over three releases of Eclipse. They used logistic regression models for predicting with high accuracy defects at a package and file level. Menzies et al. [Menzies, Greenwald, Frank 2007] used the Naïve Bayes learner and information theory for defect prediction on NASA MDP repository. They observed that there are no set of code metrics that outperform the rest that can be used for defect prediction but it depends on the characteristics of single data sets, feature selection methods of information theory, and machine learners.

In [Herraiz, Hassan 2010], Herraiz et. al use one open source software to study the relationships between different size and complexity metrics. The results show that for non-header files written in C language, all the complexity metrics are highly correlated with lines of code, and therefore the more complex metrics provide no further

43

information that could not be measured simply with lines of code. More recently,

[Herraiz, German, Hassan 2011] studied the distribution of software source code size and showed that it follows a double Pareto while it has been reported to be lognormal in previous literature [Concas, Marchesi, Pinna, Serra 2007]. They note that a lognormal can be an under estimation to the size of large files, as in GNU/, where large files add up to 40% of the whole source code. Such underestimation could impact the accuracy of size estimation of an overall system, as in [Zhang, Tan, Marchesi 2009].

2.5.2 Code Change Metrics

Recently change history based metrics have been more prominently in defect and effort prediction research. Code churn has been proposed by Munson et al. [Munson,

Elbaum 1998] which is the amount of changed code collected from archived repositories.

Nagappan et al. [Nagappan, Ball 2005b] showed that relative code churn metric (e.g. normalized by file’s LOC) is very effective for defect prediction and better than absolute code churn, with accuracy up to 80%s. Hayes et al. [Hayes, Patel, Zhao 2004] derived a model for estimating adaptive maintenance effort. They found that metrics of code and change are strongly correlated with the maintenance effort. Hayes et al. [Jane Huffman,

Zhao 2005] used maintenance effort to build a maintainability predictor model. The model possesses predictive accuracy of up to 83%.

Ramil et al. [Ramil, Lehman 2000] presented models to predict maintenance effort using metrics of software evolution. They found that models based on coarse granularity measures (subsystem) perform similarly to those based on finer (module). Graves et al.

[Graves, Karr, Marron, Siy 2000] show that change data (number of modifications, age of

44

files, size of changes, etc.) are better defect predictors than code metrics such as

McCabe’s cyclomatic complexity. Lucia et al. [De Lucia, Pompella, Stefanucci 2002;

2005] presented a corrective maintenance effort estimation model. In their studies, they used multiple linear regression analysis to construct the effort estimation models and validated them against real maintenance project data. They found that the performance of the maintenance effort model can be improved if the types of the different maintenance tasks are taken into account. Ostrand et al. [Ostrand, Weyuker, Bell 2005] obtained highly accurate results for predicting the most fault prone files in a large software system:

20% of the files with the highest predicted number of faults contained on average 83% of the faults that were actually detected. In contrast to the study of Bell et al. file size was found to be an important fault predictor.

Knab et al. [Knab, Pinzger, Bernstein 2006] used static code attributes, in particular software size, together with a set of metrics derived from the change history of the

Mozilla project in order to build classification trees and obtained promising results (up to

59% of correctly classified instances). Bell et al. [Bell, Ostrand, Weyuker 2006] applied successfully negative binomial regression models to identify the most fault prone files

(20% of files, which contain on average 75% of the total number of faults) in an industrial software system. They used lines of code, files age, change history, and programming language type as independent variables. They found that change data significantly improved (twice as much) prediction accuracy against a model that uses only lines of code as a predictor variable.

45

Ratzinger et al. [Ratzinger, Pinzger, Gall 2007] used 63 predictors including various size metrics, measures of the change history, and other process related metrics (problem difficulty, team structure, etc.) for short-term defect prediction. They found that size and complexity measures do not dominate defect-proneness prediction, but many people- related issues are more important. Garves et al. [Graves, Karr, Marron, Siy 2000; Moser,

Pedrycz, Succi 2008; Ratzinger, Pinzger, Gall 2007] confirmed that change data and process metrics contain more discriminatory and meaningful information about the defect distribution in software than the source code itself. Moser et al. [Moser, Pedrycz, Succi

2008] used the number of revisions, previous fixes, age of file, and authors as independent variables for defect prediction model. Hassan [Hassan 2009] used the entropy of changes to measure the complexity of code changes. He found that Entropy is often better bug predictor than change and code metrics.

Arisholm et. al [Arisholm, Briand, Fuglerud 2007] examined several data mining techniques used for fault prediction and validated their work on a large telecommunications product. The authors also discuss techniques of data collection, model selection, and model validation. Some of the discussed data mining techniques include logistic regression, neural networks, and decision trees. D'Ambros et al.

[D'Ambros, Lanza, Robbes 2010a] conducted an extensive comparison of existing bug prediction approaches using source code metrics, change history metrics, past defects and entropy of change metrics. They also proposed two noble metrics: churn and entropy of source code metrics.

46

Weyuker et al. [Weyuker, Ostrand 2010] at AT&T studied closed large systems with extended years. The goal was to build a tool and a statistical model to identify buggy files prior to the system testing phase. The study confirms previous literature findings regarding faulty files Pareto distribution where 80% of bugs originate from 20% of the files. A negative binomial regression was used to build the prediction model using metrics as independent variables. The variables were combination of code and change metrics (LOC on each file, added files, changes and faults on recent releases, and the programming language used in development). Weyuker et al. [Weyuker, Ostrand, Bell

2007] found that developer information helped to improve their prediction model based on file size and change data. Kim [Kim et al. 2011] trained a machine learner on the features of top crashes of past releases, that was able to effectively predict the top crashes well before a new release come out.

2.5.3 Previous Changes and Defects

Hassan and Holt [Hassan, Holt 2005] presented a top ten list approach to locate defect-proneness of the top and most recently changed and fixed files using defect repositories of six open sources. They found that recently modified and fixed entities were the most defect-prone. Ostrand et al. [Ostrand, Weyuker, Bell 2005] predicted faults on two industrial systems, using change and defects.

Kim et al. [Kim, Zimmermann, Whitehead, Zeller 2007] presented a bug cache approach uses recent changes and defects assuming that faults occur in bursts. The approach approximates faults with bug-introducing changes [Śliwerski, Zimmermann,

Zeller 2005]. They processed different artifact granularities, at the file level, they

47

propose accuracies about 73-95% of future faults; at the function/method level, it covers

46-72% of future faults with a cache size of only 10%. Seven open-source systems were used to validate the findings. Bernstein et al. [Bernstein, Ekanayake, Pinzger 2007] use bug and change information in non-linear prediction models where six eclipse plugins were used to validate the approach.

2.5.4 Collaboration Metrics

Zimmermann and Nagappan [Zimmermann, Nagappan 2008] predicted defects in

Windows server 2003 using network analysis among binaries. Zimmerman and

Nagappan applied network analysis to dependency graphs for predicting failures in files.

By applying metrics of centrality and network motifs to the directed dependency graphs of source code, the researchers found that central components were more failure-prone.

Furthermore, network metrics proved to identify 60% of the critical, failure prone binaries, which was better than object oriented complexity metrics that only identified

30%. In addition to using centrality metrics of closeness and betweenness, Zimmerman and Nagappan used similar statistical regression techniques for their analysis that we used.

Bacchelli et al. [Bacchelli, D’Ambros, Lanza 2010] proposed popularity metrics based on e-mail archives. They assumed the most discussed files are more defect-prone.

Meneely et al. [Meneely, Williams, Snipes, Osborne 2008] proposed developer social network based metrics to predict defects. These proposed metrics play an important role in defect prediction, and yield reasonable prediction accuracy. However, they do not capture developers' direct interactions.

48

Weyuker et al. [Weyuker, Ostrand, Bell 2007] examined various releases of a large industrial software system to predict which files are most likely to contain the largest number of faults. Inspection guidance and automated testing efforts are among the applications intended for their fault prediction model. Their model is based on the negative binomial distribution and their model’s variables, based on developer information, attempt to capture information about the amount and the type of developers who have worked on any given file. Validation for their model included a comparison with a working model based on static code metrics and churn information. Weyuker et al. reported finding 84.9% of the faults in 20% of the files with the developer information, where without the developer information, 83.9% of the faults were found.

Mockus and Weiss [Mockus, Weiss, Zhang 2003] used metrics based on developer information for failure prediction to assess risk in a large industrial software system.

Developer metrics included counts of distinct developers and a quantitative measurement of developer experience in terms of recent changes of the current project, experience in the subsystem, and in the product overall. They used -wise variable selection to construct a logistic regression model for estimating post-release failures.

Arisholm and Briand [Arisholm, Briand 2006] identified developer experience and skill level as fundamental factors affecting fault-proneness in an object-oriented system.

Since they had no data on skills and experience of developers, they did not consider developer information in their model. Nonetheless, they used a stepwise logistic regression model and a cross-validation classification analysis to validate their results.

Most of the variables in their model could be classified in the categories of object-

49

oriented metrics and code churn information. Their results from cross validation analysis showed less than 20% false positives and false negatives, with an estimated verification effort savings of 29%.

Lopez-Fernandez et al. [Lopez-Fernandez, Robles, Gonzalez-Barahona 2004] propose the idea of creating developer networks from source repositories as a method of characterizing projects. Their main focus was to organize Open Source projects into various categories based on models of collaboration. The developer networks that

Gonzales-Barahona and Lopez-Fernandez propose are constructed in a similar manner as ours, except that the edges of the graph are weighted based on number of files the pair has collaborated on. The idea of weight in their network introduces variations on the centrality and connectivity metrics, such as a “clustering coefficient”. In addition to a developer network, they used a module network – where two modules were connected if they were committed together.

Huang and Liu [Huang, Liu 2005] used SNA based on source repositories to examine the learning process in Open Source projects. Their primary analysis involved using Legitimate Peripheral Participants, a network-based theory proposed by Lave and

Wenger [Lave, Wenger 1991]. Huang and Liu concluded that developers could be divided into core and non-core groups, which loosely affected a “project’s vitality and popularity” [Huang, Liu 2005].

Hudepohl et al. [Hudepohl et al. 1996] used developer information in combination with various other metrics to create a risk assessment tool at Nortel called EMERALD.

The developer information was a measurement of experience similar to the variables used

50

by Mockus and Weiss [Mockus, Weiss, Zhang 2003]. EMERALD’s developer variables, however, incorporated developer experience in terms of Nortel career, as opposed to specific projects. For example, one of the experience measurements was the count of the number of developers who were within their first ten code updates while working at

Nortel as a way to identify inexperienced developers. EMERALD’s other variables included complexity metrics, customer usage metrics, churn information, and past failure counts from both testing and post-release phases. Hudepohl et al. reported that over half of the field failure patches were correctly identified as “red” (highest risk) in 20% of the files.

Based on the study of the correlation of simple code metrics and maintenance effort,

Polo et al. [Polo, Piattini, Ruiz 2001] demonstrated that it is possible to estimate the maintenance effort in the initial stages of a maintenance project, when the maintenance contract is being prepared and there is very little available information on the software modified.

CHAPTER 3

USING CHANGE METRICS TO IMPROVE THE DETECTION OF

EVOLUTIONARY COUPLINGS

This chapter describes an approach to improve the accuracy of evolutionary couplings uncovered from version history. Change metrics are combined with traditional methods for computing evolutionary couplings with the goal of reducing the number of false positives (i.e., inaccurate or irrelevant claims of coupling). The standard method of computing evolutionary couplings using data mining techniques is compared with an approach that incorporates change metrics with data mining.

Three different methods are used for the evaluation. The predictive ability of each approach is compared over seven systems. Next, the data mining interest measures of lift and confidence are used to compare the quality of the resulting rules produced. And, finally, a manual examination of a small code base is done to identify relevant couplings.

These results are used to compare the other two approaches via precision and recall. The results demonstrate that the use of change measures reduces the number of false positives produced.

3.1 Introduction

The work presented here addresses the problem of reducing the number of false positives by using statically derived change metrics, which quantify the extent of change to an artifact over time, to help filter irrelevant items. In particular, we evaluate the

51 52

efficacy of change metrics of line of code (LOCC), hunk (HC), and function (FC) churns to reduce false positives in the automatic identification of explicit and implicit couplings.

The change metrics used to identify couplings are logarithmically scaled to eliminate skewness in the analysis of real-world datasets, since the observed changes follow power- law distribution..

Our evaluation confirms our hypothesis: change metrics improve the quality of results when detecting evolutionary couplings. Items coupled by their extent of change consistencies have stronger relationships than coupled items found without incorporating change metrics.

For example, consider two artifacts foo and bar. We find that they historically change together very often. But we also find that the number of lines changed in the two artifacts varies greatly. That is, sometimes foo changes by 10 lines and other times it changes by 100 lines. Likewise, for bar. This implies that the particular changes taking place each time maybe quite different and possibly unrelated at times.

However, if we take the size of the change into account when constructing itemsets we find that the pattern of foo changing by 6 lines and bar changing by 7 lines happens quite often in proportion to other combinations. Since the magnitude of the change is similar each time we have more confidence that it is the same type of related change occurring hence, less likely to be a false positive.

The chapter is organized as follows. Section 3.2 presents evolutionary coupling and our time-window choice. Section 3.3 defines the change metrics and presents an empirical survey of change metrics distributions in the context of software evolution.

53

The change measure distributions are then used in Section 3.4 as a basis for our approach to detect change patterns. An evaluation of the approach is presented in Section 3.5.

Three different approaches for validation are used: manual inspection, an automated comparison of the predictive quality of the generated rules, and the use of interestingness measures from association rules mining to assess quality, discussion and conclusions follow.

3.2 Detecting Evolutionary Coupling

Evolutionary couplings are determined by examining the history of a software system [D'Ambros, Lanza 2006; Gall, Hajek, Jazayeri 1998]. It is an alternative approach to traditional static or dynamic analysis. The intent is to find hidden dependencies and traceability links that are difficult to identify through structural or physical dependencies in the code. Software elements are evolutionarily coupled if they are frequently co-changed over a defined time window and a minimum support for a given duration of history. The typical manner to uncover these couplings is to search for patterns of frequently co-changing items in the history. Thus, data mining techniques are commonly used to address this problem. We use the Apriori algorithm for frequent pattern (or itemset) mining to uncover evolutionary couplings [Agrawal, Srikant 1994].

The technique searches for patterns of co-changing items. The underlying idea is that if items change together on a very frequent basis then they must be related to one another in either an explicit or implicit manner.

The technique has a number of parameters that must be selected for the particular problem and data set. The first is the size of a transaction (or changeset). The

54

transaction size dictates which items “change together”. Since logical commits and physical commits are typically not mapped one-to-one, a time duration/window is needed to group physical commits into logical ones. Prior studies conducted on CVS repositories typically employ a sliding window approach in which two subsequent changes committed by the same author and using the same log message and are part of one transaction if they are at most 200 seconds apart [Zimmermann, Weisgerber, Diehl, Zeller 2004].

Studies targeting Subversion base their differences on a revision. A revision is a commit that can incorporate multiple files. Revisions in Subversion have been used to cluster software elements based on the assumption that a commit is logically atomic [Collins-

Sussman, Fitzpatrick, Pilato 2004]. As such, we choose to use the traditional time window of a revision since all the systems studied are in Subversion.

The next parameter that must be selected is minimum support. This value regulates the low bound of the frequency of patterns (itemsets) produced. Low values of minimum support generate higher numbers of patterns while higher numbers produce fewer patterns but with more support. Lastly, we must select the granularity of the changing item. For source code this could be a line of code, a function, a class, a file, a module, or a subsystem. In our case we use the file level. This is a common level of granularity to use and maps well to commits. Each granularity level has pros and cons. For example, module dependencies uncover architectural issues and likely would miss detailed low- level dependencies. Patterns of co-changing methods result in finer dependencies but miss the big picture. It would be interesting to research the effectiveness of synergic

55

patterns from different granularities. We do not attempt this because it is beyond the scope of this project.

3.3 Change Metrics

We now introduce the use of change metrics to help filter out coincidentally co- changing artifacts. Here, a change metric is defined to be the extent of change to an artifact during a unit of time (e.g., a revision). This definition attempts to capture the idea that evolutionarily coupled artifacts require similar levels of change over a given unit of time. Pattern mining techniques for detection of evolutionarily coupled artifacts often produce misleading results. For example, a developer who modifies a file in a system may forget to modify related files because they are placed in other subsystems or packages [D'Ambros, Lanza 2006]. This is largely due to the fact that the couplings are derived entirely from the temporal relationship of commits of the files involved.

The extent of change of an artifact can be measured in a number of different ways.

Specifically, we measure the cumulative changed lines (i.e., the number added, deleted, modified), the changed hunks (modified), and changed functions (added, deleted, and modified) over a given period of time. A hunk is a standard term about differences and is a continuous group of lines that are changed along with contextual unchanged lines. In order to develop a method for identifying change patterns that incorporate measures of change, we must first understand how changes are distributed across files. This motivates our decisions about analysis techniques presented later. We collected data on the number of lines of code changed during the lifetime of a number of projects. This information is readily available from the version history of the project. We sorted each

56

transaction (revision) by changed lines to produce a histogram. The histogram for KDE’s

KOffice (years 2001-2010) is shown in Figure 1.

Figure 1. Changed lines histogram for 2001-2010 KOffice, where the X-axis is bins of size two, Y-axis is the frequency of lines changed in each bin

The extent of change is clearly a power-law like distribution [Newman 2005].

Likewise, these distributions were observed for the number of modified hunks and functions. Figure 2 and Figure 3 shows the distribution of the lines, hunks, and functions change metrics for KOffice. All three measures follow power-law distributions. The raw data is shown in a log-log plot on Figure 2. Same data in Figure 2 is re-plotted as in

Figure 3 where the x axis is binned logarithmically [Newman 2005] to smooth the otherwise noisy tails. The change values (the y axis) are also scaled logarithmically to produce a log-log plot. So LOC Churn frequency plot on Figure 2 (a) is the same as

Figure 3 (a), but with a smoother line, other metrics follows. We also observed similar power-law distributions in the change metrics of the other systems in our study (Table 1).

57

a) LOC Churn

b) Function Churn

58

c) Hunks Churn

Figure 2. Distributions of LOCC, HC, and FC metrics for KOffice repository. The data plotted on a log-log scale.

59

a) LOC Churn

b) Functions Churn

60

c) Hunks Churn

Figure 3. Distributions of LOCC, HC, and FC metrics for KOffice repository. The same data on Figure 2 plotted on binned on log-log scales.

From this data, it is clear that the largest changes are relegated to a small number of transactions. The majority of the actual work being done is found in smaller but vastly more frequent transactions. This is a fact that can be exploited when discretizing the data to the levels of change, as we will see in the next section.

61

Table 1. Characteristics of the seven open source systems used in the study including number of years, baskets (revisions), files, and changed lines, functions, and hunks.

System Years Revisions Files Lines Functions Hunks

KDELibs 04-10 2,191 290 78,577 6,122 7,860

KOffice 01-10 16,919 2367 1,040,674 87,549 98,255

Httpd 01-10 6,776 247 334,883 18,069 29,190

Subversion 01-10 14,983 448 938,332 63,601 88,613

Ruby 01-10 9,243 249 521,388 40,653 60,894

Python 01-10 7,049 337 452,345 31,048 38,421

Xapian 01-10 3,251 254 115,275 8,576 12,341

Previously [Alali, Kagdi, Maletic 2008], we studied the size of typical commits

(Subversion revisions) and the correlation between the extent of change for lines and hunks. We observe that 75% of commits modify between 2 and 4 files, and modify around 50 lines of code with approximately 8 different hunks. Again, power-law distributions were observed for these measures. There is a strong positive correlation (up to 0.75), between changed lines and hunks. Despite the high correlation between the two variables, measuring both changed lines and hunks as distinct change metrics remains an important component of the analysis. It shows not only the editorial extent of the change

(LOCC) but also gives an indication about the distribution of that change over the source file (HC).

62

Our previous work [Alali, Kagdi, Maletic 2008] did not examine measures on functions. We hypothesize a similar correlation to that of changed lines and hunks, thus inferring similar meaning. The extent of functional change is a measure to which an implementation has changed and not simply a source file. No single measure can encode the extent of change to a file. However, a combination can yield more descriptive measures than individual measures.

Table 1 presents the open source projects used in this work, the number files in each system, along with the changed lines, functions, and hunks over the duration of the study are given. Using this data we generated the chart in Figure 4 to present the collected measures. The chart shows four groups of bars of the three change metrics and the revisions for each project. The last group of bars is the same ratio but for revisions. The data must be properly normalized for comparison because the systems have differing number of files and the observed total numbers of changes were generated over different time spans. We thus calculated the total number of changes per unit file per year, and present the results on a log scale.

The following observations can be made. We see that Ruby is the most active and changing project, followed by Subversion. KDELibs, Xapian, and KOffice are the least changing development projects. Also, we clearly see a size correlation between the three measures. KOffice has the highest change values (Table 1); however, it has the lowest committing rate and it is not that active. This is in part due to its large number of files.

Note that we use log10 for the Y-axis for a better normalization with respect to the other data values in the chart.

63

Figure 4. Size ratios (Total change/no. files/no. of years), where change is the collected L, F, and H to the studied projects. Revisions (rev).

3.4 Adding change metrics

In this section, we present our approach to use change metrics to support the detection of evolutionary change patterns between software artifacts. The use of change- measures for filtering out irrelevant patterns differentiates this work from previous approaches. Recently, it has been shown [Arisholm, Briand, Johannessen 2010; Kamei et al. 2010; Menzies et al. 2010; Thilo, Koschke 2009] that bug prediction models perform significantly better when effort measures are taken into account. In these cases, a simple measure of LOC (of a file) was used to represent effort. However, we feel that our change metrics play a major role in determining evolutionary couplings, although we apply it here to a different problem.

Simply put, if two files change frequently together and the changes are similar in the extent of change each time, there is a higher likelihood that it is the same kind of change

64

each time. That is, if a small number of lines (say 5) change in one file and a medium number (say 15) change in another file frequently together, it is more likely to be the same change taking place each time. If the number of lines changing differs greatly each time, the likelihood of other types of changes or unrelated changes is higher. More succinctly, the core idea is given a frequent pattern of co-changing items, we measure the extent of change for each instance of the co-changes. If all instances change with the same levels each time, we say that the pattern has consistency with respect to change.

This means that each time the changes occurred, they happened in about the same manner with respect to the level of change. One thing we can infer from this is that these changes are happening for the same reason each time. This consistency of the extent of change forms our heuristic to identify higher quality patterns and filter out false positives.

Co-changing artifacts with inconsistent levels of change are assumed to not represent an actual dependency (i.e., the co-change is coincidental).

As such, we need to discretize change metrics to a set of levels or categories so general comparisons can be made between them. Because of the power-law distribution we logarithmically scale the data into three discrete levels of change. In practice, any number of levels could be used, the more levels you have the finer the change granularity will be.

3.4.1 Discrete Change metrics

We define the extent of change on an artifact as a vector,

C = , where LOCC is the number of changed lines (modified, added or

65

deleted), FC is the number of changed functions (modified, added or deleted), and HC is the number of changed hunks (modified) in an artifact during the observed time period.

Because of the observed distributions of these values reported in Section 3.3, simply constructing the vectors over raw data will skew subsequent analyses. In order to reduce the effect of skew caused by the power-law distributions, we logarithmically scale and discretize the observed data into the integer range [1, 3] such that 1 indicates least amount of change and 3 indicates the greatest amount of change. The scaling is defined as:

 ⌊ ( ) ⌋ 

From Equation 1, x is the actual change value and s(x) is the scaled change value, max is the maximum observed change measure value over the observed history of the project, and q is the number of division values between the minimum and the maximum observed values. This gives the correct number change levels required, q = Number of change levels 1, here, we assign q to 2 and that produces 3 change levels. For example, the maximum number of changed lines for KOffice on a file in the observed history was

6,085 source lines. In one revision, modifications to the file abiwordimport.cc changed

129 lines. For that file, its scaled change level is 2, a medium change.

3.4.2 Change metrics + Mining

We now extend traditional techniques for detecting evolutionary coupling with the change vectors described. The approach takes the change values into consideration when determining the frequently occurring patterns. The same Apriori algorithm for frequent

66

pattern (or itemset) mining [Agrawal, Srikant 1994] is used. The only difference is the input into that algorithm.

For each changing item (i.e., file) we compute the extent of change. The change measure value is 1, 2, or 3. The item (file) name is relabeled to reflect the change value.

For example, a file named foo.cpp has a change level of 3 in a particular revision. For this particular transaction the file name is augmented with the change measure value and relabeled as foo.cpp3. This added information translates one item into possibly three specialized items, each corresponding to a file with a particular change level. The frequent pattern-mining algorithm is run on this augmented data, resulting in set of new patterns that reflect the change measure values. This type relabeling approach is a common practice in data mining [Witten, Frank, Hall 2011].

The patterns generated tell us two things: 1) what changed (a particular file) and 2) how it changed. Let us now look at a simple example. Say there are 40 occurrences of

(foo, bar) co-changing in our history. We use a minimum support of 10, so this represents a pattern. Adding the change data we have 10 occurrences of (foo1, bar3), 6 of

(foo2, bar1), 7 of (foo1, bar3), 16 of (foo2, bar3) and 1 of (foo1, bar1). No other combinations occur. With a minimum support of 10 we get two patterns (foo1, bar3) and

(foo2, bar3) but these two only cover 26 occurrences of the more general (foo, bar) pattern. The technique filters out occurrences of (foo, bar) that were not co-changed together with the same consistent change levels of a given minimum support. This heuristic attempts to identify only the files that co-change together consistently in the same (approximate) manner.

67

We note that the itemsets covered by a pattern created using change measure information is a subset of those covered by a pattern created without change information

(when the minimum supports are the same). That is, the patterns of change-based evolutionary couplings are a subset of the patterns of evolutionary couplings.

3.4.3 Implementation

The implementation of the approach is achieved with two tools: shopper

(pronounced “delta shopper”) and Koupler2 (reads “coupler square”). These tools target data available from the Subversion version control system. shopper is responsible for extracting metadata and differences from a Subversion repository. The program takes a unit of time () and extracts information about the modification of artifacts over the course of the history. The program identifies modifications to files within each time  and computes, for each file, its observed change measure values for LOCC, FC, and HC.

To collect LOCC, shopper uses Subversion’s unified diff output to count the added and deleted source lines by counting the ‘+’ and ‘–’ markers. Likewise, the HC is easily computed by counting the number of ‘@@’ hunk delimiters.

To collect FC, we first convert each versioned source code file to srcML [Maletic,

Collard 2004] and find each function’s region, the contiguous lexical range of lines in which the function is declared or defined. We map diff hunks (change regions) onto the function’s to find overlapping regions of modification. If a change region overlaps with a function’s region, the then function must have been changed by the commit. The data is

68

output as an XML file such that each time  defines a transaction consisting of artifact/extent of change pairs.

The Koupler2 program takes the generated XML file as input and computes scaled and discretized change vectors for each artifact in each time period as described in section 3.4.1 Frequent pattern mining (via the Apriori algorithm [Agrawal, Srikant 1994]) is applied with an initial minimum support of s/T where T is the number of observed time periods, here the number of revisions. Initially we start with a high value of s such that no patterns are produced. If the algorithm does not yield at least N frequent patterns, the minimum support is decremented by one and the algorithm is rerun. This continues until the minimum required number of patterns N has been found or the minimum support drops below 5/T. This is a standard manner to determine a minimum support [Kagdi,

Maletic, Sharif 2007; Zimmermann, Weisgerber, Diehl, Zeller 2004].

The heuristics guiding the search were determined experimentally from the datasets; they may vary for different units of time . Koupler2 computes both evolutionary couplings and change based evolutionary couplings for the specified time window . The tools generate results for both approaches. For change-based evolutionary coupling, any combination of change metrics can be used; namely LOCC, FC, HC, LOCC + FC, etc.

3.5 Evaluation

To validate the approach, we need to verify that Change-based Evolutionary

Coupling (CEC) patterns produce fewer false positives than Evolutionary Coupling (EC) patterns. We examine seven open source projects with version histories spanning 7 to 10 years. A list of these projects was given previously in Table 1. Notice that our approach

69

only considers files that existed on the final date of the time period. If a file did not exist at the end of the time period it is not relevant to future versions.

Evaluating the approach is difficult because of the lack of a gold standard. That is, we do not have a correct list of all the coupled artifacts. To overcome such limitation, we evaluate our claim using three different methods from three different angles to solidify our study. We hypothesize that patterns produced using change-based evolutionary coupling have less false positives and higher qualities that patterns produced using the classical evolutionary couplings.

We first use the traditional method to evaluate the presented approach, as in

[Canfora, Ceccarelli, Cerulo, Di Penta 2010; Zimmermann, Weisgerber, Diehl, Zeller

2004]. We compare the patterns produced using the traditional association rules predictabilities by training a model and using it to predict future changes. Precision and recall are used to compare the approaches [Buckland, Gey 1994].

Second, we use the standard data mining interestingness measures of confidence and lift to assess the quality of the rules that are generated from the change patterns. This is a more theoretical means to compare the two approaches. These measures give an assessment of the interest and quality of the association rules generated from the patterns.

Lastly, we evaluate our approach by manually determining the correct set of patterns for a small system. Doing this for a large system is not very practical due to the large amount of time necessary to inspect and learn a system. Specifically we examined a subsystem of KOffice to determine the correctness of the patterns produced by each approach. Precision and recall are used to compare the approaches.

70

3.5.1 Evaluation Using Prediction

A traditional means of validating the quality of association rules is to generate them from a part of the history (training set) and then see how well they predict future changes in a later part of the history (test set). In [Zimmermann, Weisgerber, Diehl, Zeller 2004], the authors used co-change information (evolutionary coupling) to predict entities

(classes, methods, fields, etc.) that are likely to be modified. A prediction model correlates with the quality of the association rules generated from frequent pattern.

Here, for a given sequence of timely ordered transactions for a system, we divide the sequence into a training set and a test set. We use 75% of transactions as our splitting point; the training set is the first 75% and the test set is the remainder. We use the training set to generate the patterns of change using Koupler2. After generating all EC and CEC patterns (with a selected change measure), we generate all possible association rules with 50% confidence.

71

Table 2. Prediction Accuracies and Completeness Over Seven Open Sources. P = Precision, R = Recall, Fm = F-Measure, LFH = LOCC  FC  HC

OSS EC Type k = i+0 k = i+5 P R Fm Avg P Avg R Fm EC 0.46 0.04 0.07 0.10 0.001 0.003 Httpd CEC, FC 0.42 0.03 0.05 0.25 0.003 0.007 CEC, LFH 0.40 0.05 0.08 0.30 0.003 0.007 EC 0.30 0.01 0.01 0.00 0.00003 0.0001 Python CEC, FC 0.22 0.01 0.01 0.07 0.001 0.002 CEC, LFH 0.13 0.02 0.03 0.07 0.001 0.002 EC 0.28 0.03 0.06 0.45 0.005 0.01 Ruby CEC, FC 0.33 0.03 0.0 0.34 0.003 0.005 CEC, LFH 0.18 0.05 0.08 0.40 0.004 0.009 EC 0.40 0.03 0.06 0.20 0.001 0.002 Subversion CEC, FC 0.34 0.04 0.07 0.21 0.001 0.003 CEC, LFH 0.28 0.05 0.09 0.22 0.002 0.003 EC 0.31 0.02 0.03 0.20 0.002 0.004 KDELibs CEC, FC 0.20 0.01 0.02 0.42 0.003 0.006 CEC, LFH 0.15 0.01 0.03 0.39 0.003 0.006 EC 0.30 0.03 0.05 0.15 0.0002 0.0005 KOffice CEC, FC 0.23 0.02 0.04 0.30 0.0003 0.0006 CEC, LFH 0.25 0.04 0.07 0.31 0.0004 0.0008 EC 0.42 0.04 0.08 0.42 0.004 0.0078 Xapian CEC, FC 0.30 0.02 0.05 0.63 0.004 0.009 CEC, LFH 0.24 0.04 0.07 0.44 0.004 0.007

The patterns are generated as described in section 3.4.3 (implementation). The tool chain (shopper + Koupler2) was run on the studied systems. CEC is generated with FC for the change measure. Given the training set, we configured Koupler2 with initial

72

minimum support to be 50/T and then lowered it until N patterns are produced. We selected N to be 1,000 as this value allows for the generation of enough patterns to compare the two approaches. We perform an evaluation similar to what Zimmermann et al. [Zimmermann, Weisgerber, Diehl, Zeller 2004] and Canfora et al. [Canfora,

Ceccarelli, Cerulo, Di Penta 2010] did in their work.

To compare the predictability (precision) of association rules generated from both sides and to investigate their completeness (recall) [Buckland, Gey 1994]. We examine the left hand side (X) of and right hand side (Y) of the test set transaction, as in [Canfora,

Ceccarelli, Cerulo, Di Penta 2010; Zimmermann, Weisgerber, Diehl, Zeller 2004]. Such cases are called the same subsequent (k = i+0), where k is the index of the subsequent transaction, from transaction i, i is the index of a transaction that contains X. In [Canfora,

Ceccarelli, Cerulo, Di Penta 2010; Zimmermann, Weisgerber, Diehl, Zeller 2004] the authors also looked for subsequent of k = i+5, that is Y matches with in five subsequent transactions from transaction i.

Table 2 show the results from running a prediction experiment on EC patterns and two variations of CEC patterns, using FC and the union of LOCC, FC, and HC. We used subsequent k = i+0 and k = i+5, and computed the precisions, recalls, and F-Measures (a weighted average of the precision and recall) [Blair 1979]. Our values, here, are not that high in comparison with the work in [Canfora, Ceccarelli, Cerulo, Di Penta 2010;

Zimmermann, Weisgerber, Diehl, Zeller 2004], and we have two main reasons for that.

First, unlike [Canfora, Ceccarelli, Cerulo, Di Penta 2010; Zimmermann, Weisgerber,

Diehl, Zeller 2004] the rules that are considered for evaluation are only the top 10 rules

73

(or top-N), ranked by confidence. While we use all rules with 50% confidence for pruning, a large difference in the numbers of rules evaluated. Second, we are using a time period approximately 10 times longer (7 to 10 years). In [Canfora, Ceccarelli,

Cerulo, Di Penta 2010; Zimmermann, Weisgerber, Diehl, Zeller 2004] the study was conducted over only a few months. Regardless, our goal, here, is to compare EC vs. CEC and we see that CEC produces better quality patterns.

The main observation from Table 2 is that change-based evolutionary coupling consistently has a higher precisions where k = i+5, while the classical evolutionary coupling is better with k = i+0, except it is the other way around with Ruby. Predicting subsequences farther ahead is more crucial for impact analysis.

3.5.2 Interestingness Measures

In our first evaluation we covered implicit and explicit dependencies by assessing the association rules predictabilities by training a model and using it to predict future changes over the seven systems. In the second evaluation, we use data mining techniques to assess the quality of a pattern in the context of change impact analysis [Arnold, Bohner

1996]. Again, we use an approach similar to the work in [Zimmermann, Weisgerber,

Diehl, Zeller 2004] on change prediction in the context of assessing association rules quality. Here, we assess the interestingness of the mined association rules.

It is quantitatively sufficient to measure the quality of rules generated using a combination of Agrawal et al. [Agrawal, Imieliński, Swami 1993] support-confidence framework with lift or leverage [Damaševičius 2009; Soman, Diwakar, Ajay 2006].

74

Here, we will generate association rules from our CEC and EC patterns, then compare the values of both lift and support confidence.

| (2)

Confidence and lift are measures of significance and interestingness for association rules mining. Given a rule X → Y, confidence of a rule [Agrawal, Imieliński, Swami

1993] is defined as the probability of having the rule's consequent (Y) under the condition that the transactions also contain the antecedent (X). Confidence can be evaluated using the conditional probability P (Y|X) or the ratio of the actual probability observed to that expected, if X and Y were independent. Confidence is a commonly used measure for the quality of an association rule, see Equation 2.

(3)

Again, the tool chain (shopper + Koupler2) was run on the studied systems and the results for both approaches CEC and EC are shown in Table 3. CEC is generated with

LOCC for the change measure. We configured Koupler2 with initial minimum support to be 50/T and then lowered it until N patterns are produced. Again we selected N to be

1,000 as this value allows for the generation of enough patterns to compare the two approaches. This is somewhat of an arbitrary value and in practice almost any value could be used. However, as can be seen in Table 3, the value of 1000 is within the range of the number of patterns produced with minimum support values determined using the

75

standard data mining practice. This value appears to result in the identification of many

(possibly all) relevant patterns along with irrelevant ones. Had we selected a very low value (say 100), we would have missed a large number of relevant patterns. The patterns generated in Table 3, can be aligned to the activity plot in Figure 4. Ruby and Subversion are the most active and at the same time have the highest minimum supports and generate around 1,000 patterns.

Table 3. Seven open source systems and the number (#I) of generated CEC patterns using changed lines (L), functions (F), and hunks (H), EC patterns and their minimum supports (mS)

CEC, LOCC CEC, FC CEC, HC EC System #revs mS #I mS #I mS #I mS #I KDELibs 2,191 5 34 5 62 5 48 5 414 KOffice 16,919 5 550 5 8,949 5 695 9 1,187 Httpd 6,776 5 187 5 562 5 270 6 1,126 Subversion 14,983 7 5,397 10 1,359 8 1,064 20 1,036 Ruby 9,243 5 1,541 9 1,061 5 1,603 19 1,196 Python 7,049 5 569 7 1,034 5 598 15 1,213 Xapian 3,251 5 132 5 308 5 122 5 837

Using the patterns in Table 3, we generate all association rules possible by constructing all the possible combinations of association rules through halving each pattern into two subsets. This can produce an enormous number of rules so we set constraints on the generation of rules.

Let A and B be a disjoint subsets of a pattern I (EC or CEC) A  B = I. From

Equation 3, lift (A → B) is equal to lift (B → A) which is not the case in confidence, see

76

Equation 2. We generated only the association rules A → B where |A|  |B| without duplications. To further reduce the problem, we only consider rules where confidence (A

→ B)  0.9.

We now address which approach (EC vs. CEC) produces change patterns of higher confidence values. For the generated patterns on Table 3, we computed confidence values using Equation 2. For each approach we sort and then break down the values to

10 Deciles (10 percentile points) where each part represents 1/10 of the observed values.

Then we compare the first decile value (0th percentile), the minimum value from each side, and the second point (0th percentile) and so on until the 100th percentile (i.e., maximum value). At each compare at each decile we compute a win or lose point (a score).

To assign a score at each percentile, the following rule is used: assign a 0 to the loser and a 1 + 1 × k for a winner where k is the percentile. If the values are equal we use 0 for both as there is no winner. The reasoning of such a rule is that a win with higher points is more valuable than a win with low points (much like a weighted score). Notice the last two columns of Table 4. The final score (Table 4) is the total of the scores at each point. Figure 5 shows Table 4 as a win-lose plot, high is a win and low is lose.

When two points overlap it represent a match, where each point represents a percentile.

Applying the confidence equation 2, we use the win-lose plot to compare the confidence distribution of the two approaches. Figure 5 presents the win-lose chat of the ordered values using a 10-point percentiles (decile) approach. Here, in KOffice 2001-

2010 CEC have a better distribution, with higher values. It implies CEC has better

77

predictive rules with higher confidence. From Figure 6 we can see a better confidence over all that leans clearly toward CEC. Xapian is the only win for EC (only a slight difference). Now we rerun the same procedure for lift.

Confidence is sensitive to high frequencies of the consequent Y; consequents that have higher support will automatically produce higher confidence values even if there exists no association between the items [Hahsler, Hornik, Reutterer 2006]. Lift overcomes such sensitivity and is a measure of how many times more often X and Y occurs together than expected if they were statistically independent. Lift tells us how much better a rule is at predicting the result than just assuming the result.

78

Table 4. Deciles of confidence values for KOffice and their win-lose scores.

CEC, CEC, Kth Decile EC EC score LOCC LOCC score 0 1.0 0.9 1 0.0 1 1.0 0.9 1.1 0.0 2 1.0 0.9 1.2 0.0 3 1.0 0.9 1.3 0.0 4 1.0 0.91 1.4 0.0 5 1.0 0.94 1.5 0.0 6 1.0 1.0 0.0 0.0 7 1.0 1.0 0.0 0.0 8 1.0 1.0 0.0 0.0 9 1.0 1.0 0.0 0.0 10 1.0 1.0 0.0 0.0 Total Final Score 7.5 0.0

Figure 5. The win-lose plot and score for the 10-point percentiles distribution of confidence for rules generated of the patterns for KOffice. CEC patterns were generated by enforcing LOCC consistencies.

79

For a rule A → B, a higher lift implies that the existence of A and B co-occurring in a transaction is not a random coincidence but due to some relationship between them.

The purpose of such models is to identify a subgroup (target) from a larger population.

The target members selected are those likely to respond positively to a marketing offer.

A model is successful if the response within the target is much better than average for the population as a whole. Lift is computed as the ratio of these values: target response divided by average response, see Equation 3 for formal definition of lift [Agrawal,

Imieliński, Swami 1993]. As with confidence, we calculate the lift values for the generated rules, here, CEC is generated with H for the change measure.

Figure 7, shows the final lift win/loss results for all studied systems. And again, with

Xapian as the only EC win, clearly, CEC association rules generated from the selected patterns are better at producing rules of higher interest and better predictability.

80

Figure 6. The final scores of the seven open source systems for confidence values for the generated association rules of the generated CEC, LOCC and EC.

Figure 7. The final scores of the seven open source systems for lift values for the generated association rules of the generated CEC, HC and EC patterns.

81

3.5.3 Manual Validation of Patterns

We compute precision and recall [Buckland, Gey 1994] of the patterns produced by both CEC and EC for KSpread, a subsystem of KOffice. KSpread (recently changed to

KCells) is a spreadsheet program with 409 C++ stable source files. The study covers

1556 revisions between 2006 and 2010. To do this, we manually identify all the relevant evolutionarily coupled artifacts. This is a non-trivial task since there are 83,436 pairs of files to examine. We determined the set of evolutionarily coupled artifacts by first finding all EC patterns and then manually identifying the relevant evolutionarily coupled files, determining the false positives.

The EC and CEC patterns are generated with the same minimum support. Again, we are using a time window of one revision. We consider the EC patterns generated to be the space of all possible patterns because the patterns produced by CEC are a subset of those. So we assume the recall for EC experiment is 100% and determine the precision from the manual inspection. Clearly, the EC approach may miss some actual couplings but in comparing the two approaches this is makes little difference.

We determine a minimum support that generates EC patterns in an amount that is reasonable to filter the relevant evolutionarily coupled patterns manually and at the same time generate a non-zero set of CEC patterns. For KSpread, we used a minimum support count of 6 and generated 308 EC patterns of size two and more. This gave us a manageable number of patterns to examine for both approaches.

We identify the relevant EC patterns from the 308 generated to form our baseline.

Due to the downward closure property (anti-monotonicity) that all subsets of a frequent

82

pattern are also frequent [Agrawal, Srikant 1994], we only need to filter out the patterns of size two, then any pattern that has an irrelevant pair of patterns is also irrelevant. This reduced the space to 148 patterns of size two. We manually checked changes to determine which patterns reflect actual relevant couplings.

For each 148 pattern of size two we examined the code for physical decencies between files, either direct or with two level of indirection. If a link existed between a pair of files then it was deemed a relevant pattern (e.g., Sheet.cpp and Sheet.h).

Additionally, we looked for files that have very similar code structure (clones). If two files were very similar we consider this a relevant pattern [Geiger, Fluri, Gall, Pinzger

2006]. Overall, this task took approximately 46 hours. We identified 96 relevantly couple pairs from the 148 patterns of size two. These results were used to filter out the irrelevant patterns from the larger sized patterns. This was done automatically. Filtering out the clearly irrelevant ones resulted in 136 out of 308 that are potentially relevant EC patterns. Thus, the precision for the EC method is therefore 0.44 and the recall is 1.0.

Now, we examine change-based evolutionary coupling (CEC) recall and precision values. For the change measure, we use all the possible combinations of patterns to show how the different change metrics influence the quality of patterns. Let us note that for each change measure we generate a set of evolutionary couplings. Table 5 contains the numbers of CEC patterns for each variation (intersections + unions); it also contains the number of patterns, precision, recall, and F-Measure values. We take the value of 136 as the total potentially relevant patterns in the system.

83

Table 5. All variations (intersections and unions) of change metrics for CEC patterns and their precision, recall, and F-Measure values

Change, C CEC Precision Recall F-Measure

LOCC 26 22/26 0.85 22/136 0.16 0.27 FC 54 33/54 0.61 33/136 0.24 0.35 HC 33 31/33 0.93 31/136 0.23 0.37 LC  FC 12 10/12 0.83 10/136 0.07 0.14 LOCC  HC 14 14/14 1.0 14/136 0.10 0.19 FC  HC 18 16/68 0.89 16/136 0.12 0.20 LOCC  FC  HC 8 8/8 1.0 8/136 0.06 0.11 LOCC  FC 60 37/60 0.62 37/136 0.27 0.38 LOCC  HC 43 37/43 0.86 37/136 0.27 0.41 FC  HC 62 41/62 0.66 41/136 0.30 0.41 LOCC  FC  HC 66 43/66 0.65 43/136 0.32 0.43

We see a large jump in precision values of the change-base evolutionary coupling method in Table 5. The top three rows are CEC patterns and clearly all has precision values in range of 0.6 to 0.93, much higher than the precision of EC patterns (i.e.,

136/308 = 0.44). We cannot compare CEC vs. EC recall values since EC patterns are our base line. But we can compare the CEC patterns generated by each variation of change metrics.

The bottom part of Table 5 is the variant unions of CEC patterns produced using the three change metrics. These rows show the highest recalls (0.27 to 0.32) and F-measures

(0.38 to 0.43) and medium-high precisions (0.62 to 0.86). The middle of the table has the

84

variant intersections of CEC patterns generated using the three change metrics. These rows show the highest precisions (0.83 to 0.1) and medium-low recalls (0.06 to 0.12) and

F-measures (0.0.11 to 0.20). These results cover a range of precisions and recall values but change-based evolutionary couplings are favored over the classic evolutionary couplings. Particularly interesting is that high recall and precision values are produced through the combination of different CEC patterns generated by the change metrics.

3.6 Threats to Validity

Selecting a time window is a challenging problem as a logical commit can be spread out over multiple physical commits across several days, or one physical commit could have multiple logical commits. Traditionally, the identification of evolutionary coupling uses a small time window or a discrete version as committed. This is an open problem and continued research is required for a complete answer.

We note that manual inspection was done by ourselves and as such represents a threat to the validity of this evaluation. Alternatively, we could have hired an outside individual to conduct the work but this was not practical at the time. Thus, to expand our evaluation to more systems we apply another evaluation method.

3.7 Discussion

All three evaluation methods give strong evidence that our Change-based

Evolutionary Coupling (CEC) approach produces fewer false positives than using the traditional approach to compute Evolutionary Coupling (EC).

85

Comparing the two approaches using association rules prediction capabilities, as in

Table 2, change-based evolutionary coupling is better at predicting future subsequences.

Using the interestingness measures (lift and confidence) showed that the interestingness and the quality of the association rules generated from CEC patterns is better than those of EC patterns. Specifically, the change patterns identified using our CEC approach are more likely to represent actual evolutionary dependencies rather than coincidentally co- changing artifacts.

The manual experiment on a subsystem was able to assess both the actual implicit and explicit reasons behind coupling of the artifacts examined. We observed high precision values, some above 90%. Of course, this evaluation approach was very time consuming and impractical to apply to all seven systems. It would require a deep knowledge of the systems and the change patterns of its artifacts. As in Table 5, CEC measures gave better precisions that EC alone. The recall values were not that low. With the combinations of CEC patterns on different change metrics (LOCC, FC, and HC) we toned recall and precision values better.

An intersection among CEC patterns of different change metrics produces very high precision (0.89 to 1.0), low recall (0.06 to 0.12). A union among CEC patterns of different change metrics produces moderate recalls (0.27 to 0.32), moderate precisions

(0.62 to 0.56), and very well balanced F-measures (0.38 to 0.43).

In first and third evaluations, the low recall values are of interest. The premise here is that it is better to uncover correct patterns and miss some rather than uncovering all patterns but include many false positives. Therefore, we chose to give more weight to the

86

prediction accuracy (precision) of the uncovered patterns rather than to the completeness

(recall). Many approaches have low precision, which make them unattractive to developers to adopt.

The main contribution of this work is that our approach can detect high quality patterns with low minimum supports (Table 3). This means a pattern can be of high quality without a large amount of maintenance activity or large number of occurrences.

The evaluation demonstrates that the addition of easily extracted information such as change metrics (effort) has a substantial impact of the quality of the resultant patterns.

The approach is straightforward and requires little additional overhead compared to the traditional approach. The same underlying data mining technique is used so the only variable in the evaluation is the addition of the effort values.

We can only say with certainty that our choice of parameters resulted in higher quality couplings than previous efforts. However, we make no claim that the parameters we have chosen are optimal. We plan to empirically identify work patterns from change patterns in order to find which time windows produce better results (as in the work of

Vanya et al. [Vanya, Premraj, Vliet 2011]). We also plan to further this work by investigating finer artifact granularities such as methods and classes.

CHAPTER 4

CHANGE PATTERNS INTERACTIVE TOOL AND VISUALIZER

In this chapter we present an interactive and a visualizer tool called KouplerVis2.

The tool is an extension to Chapter 3’s tool, Koupler2. Koupler2 is an enhanced frequent mining technique based of Apriori algorithm [Agrawal, Imieliński, Swami 1993;

Agrawal, Srikant 1994]. The tool uses seeSoft style of displaying the detected change patterns [Eick, Steffen, Eric E. Sumner 1992; Maletic et al. 2011].

4.1 Introduction

KouplerVis2 is a graphical user interface and interactive tool written in C++ under the environment. The idea behind a GUI application is to let the developers easily benefits using the tool. Developers can invest their own experiences on using the right set of parameters. The tool with a few sets of parameters is able to produce high quality changing patterns. A manually tuning to the tool based of developers’ familiarity with their system could help them understand large system evolution, architecture, and maintenance activities. A user can interact with KouplerVis2 and read the mining results visually on the fly.

Users can assign different minimum support values, different time ranges, and different approaches. Each selection, likely, would generate different set of itemsets.

KouplerVis2 presentation is a matrix filled with black squares on white edges. Each square represents a unique file, a file that happen to appear at least once during the

87 88

specified time period. A group of files represent an itemset each time. Each square is colored with its changing behavior levels. A spin control can traverse over all generated itemsets. The itemsets are ranked based on a proposed weight measure, see Figure 8.

4.2 Controls

Here we will discuss each visual element of KouplerVis2. Each control represents a mining parameter. The visual sets of controls are:

File open: the file menu button in the tool bar opens a dialog and read baskets XML file of the format given in Figure 8.

Controls group: in the left panel of the screen shot in Figure 8, a set of input and output controls that help the user interact with KouplerVis2:

. Efforts: A set of check boxes that enable user to select to include change metrics

in the filtration. KouplerVis2 will process with no filtration by default.

. Minimum support: A spin control where a user can assign a percentage that would

be the frequency threshold, the minimum support ratio.

. Minimum support/Weeks: A read only textbox that reflects the actual number of

weeks out of the total number of weeks in the given period..

. Support: A read-only textbox that will show the actual frequency of the displayed

itemset.

. Weight: A read-only textbox that would hold the computed weight value of the

displayed itemset. A weight is a quality measure that can be calculated based off

the filtration of change metrics.

89

. Itemset Size (L): A read-only textbox that display the number of files (the size) in

the currently viewed itemset.

. From and To: Two combo boxes that are responsible for selecting the years range

out of the whole period.

. Calculate: A button that starts the process.

. ELC and TLC: Two radio buttons that specify which algorithm to run: the effort-

based or time-based evolutionary coupling. In case TLC has been selected, the

value weight control won’t show any value since it is only an ELC property and

the Efforts control would be none. The files in the TLC version’s itemset will be

colored in white. The resulted itemsets would be ranked only based on the

itemset size in a descending manner.

. Runs: Is a read only multiline textbox that gives more detailed results and the

ongoing processing on different phases of the frequent item mining algorithm

generating different itemsets for each sizes along with the progress on processing

times.

Frequent itemsets group: in the right panel of the screen shot in Figure 8, a set of input and output controls:

. Itemset number: A spin control that loops over all the resulted itemsets, once the

value change the displayed itemset changes according; the number, also,

represents the ranking that is based on the computed weight value,

90

. Total number of itemsets: A read-only textbox that marks the last itemset that is

the maximum the itemset number control can get.

The Black Matrix: Is the matrix of black boxes, where each box represents a unique file that appeared in the system on the selected period of time and ordered left to right. In contrast a few colored boxes display the detected colored pattern of change.

91

Figure 8. KouplerVis2 snapshot, the system is KOffice [2000-2009], the minimum support is 2% which is 5 weeks out of 269 weeks for the selected range 2000-2005. Itemset number 37 has four files and such pattern consistent appeared 5 times, total of 405 ECs

92

4.3 Summary

The tool, KouplerVis2, is an interactive tool follow seeSoft display approach. In the backend it uses the CEC algorithm from Chapter 3 to detect high quality frequent patterns. The tool can be adjusted to any set of values and the results will be displayed in a one window of boxes. The user can iterate over all the detect patterns with simple and few controls.

CHAPTER 5

DISTRIBUTION AND CORRELATION OF CODE, CHANGE AND

COLLABORATION METRICS

In this chapter we present a large-scale investigation to understand the distribution of and the correlation between code, change, and collaboration metrics. Such metrics have shown a promising result on fault predictions and maintenance estimations. Literature has used a single and a mixture of metrics to build such models. A lack of understanding of these metrics led to conflicting results based on wrong assumptions and under estimation of different types of metrics distribution. Such distributions help us understand the nature of systems and their evolution. We then study all selected metrics correlation relationships against each other’s. This information assists the decision making on the right set of metrics in any future prediction models or studies. We use two code metrics (size and complexity), four change metrics (variety of churn metrics) and two collaboration metrics (authorship and commits).

5.1 Introduction

Software metrics over the years played, and still do, a crucial role in assessing software source code quality and complexity. Recently, focus became strongly directed toward predicting software fault locations and maintenance effort estimation. Several models were proposed based on statistical (e.g. logistic regression [Zimmermann,

Nagappan 2008]) techniques, machine learning (e.g. Naive Bayesian networks [Menzies,

93 94

Greenwald, Frank 2007]) techniques and information theory techniques (e.g. [Hassan

2009]). Such approaches that have been proposed to tackle the problem rely on diverse information, such as code metrics, change metrics, collaboration metrics, recent activity information and previous defects data [D'Ambros, Lanza, Robbes 2010a].

Such metrics run as independent variables in these models to predict fault prone locations in a system on different artifact granularities. Previous work has shown that there is no perfect set of metrics that would work best for all systems [Nagappan, Ball,

Zeller 2006]. At the same time, hybrid approaches using multivariate metrics have shown better consistency and accuracies than individual metrics where the later suffers from inconsistent prediction behavior [D'Ambros, Lanza, Robbes 2010a]. Research in effort and bug prediction still showing conflicting results on which metrics are best predictor overall and which mathematical prediction model is best.

Understanding the distribution of complexity code metrics plays a major rule on effort estimation and failure prediction studies. The distributions presented in literature focus on code metrics while little has been done regarding change metrics. As we noted before, change metrics have proven to be better estimator of among all complexity metrics, at the same time, individual metrics are no better than hybrid metrics for different types and there is no best set of metrics that would work well with all software systems [D'Ambros, Lanza, Robbes 2010a].

As a step back, dealing with this data without knowing their statistical distribution and how such collected metrics correlate with each other. Such information would help

95

us understand systems maintenance activates and estimation models. It would also clarify the relation between different set of metrics.

We are investigating empirically ten open sources with 10 years of subversion recorded revisions. Using a weeklong delta as a unit of code change, at the end of each week (Sundays) we collect these different types of metrics. Through such data we collect different types of selected representative metrics of code, change and collaboration metrics. Using simple plotting techniques, we show the distribution for such detailed information. Then we conduct a cross correlation between the collected metrics.

Code metrics has been long reported to be a log-normal [Concas, Marchesi, Pinna,

Serra 2007] or a double Pareto [Herraiz Tabernero, German, Hassan 2011]. Code metrics are collected from a snapshot in time during the life of a system, while here we are more interested on all metrics reported over long histories. Such distributions are more reliable than studying a random point in time. We note that no change or collaboration metrics distributions have been investigated.

The chapter is organized as follows. Section 5.2 presents an introduction on software metrics. Section 5.3 is the approaches used to collect the data from subversion repositories that will be used for analysis. Section 5.4 is about the rationale behind picking a week as the time window size. Section 5.5 is the metrics distribution study and

Section 5.6 follows with the metrics correlation analysis. We end with final discussion in

Section 5.7.

96

5.2 Software Metrics

It is almost impossible to study all the kinds of code, change and collaboration metrics in the literature, so we decided to pick a few from across these main metrics categories. We add some more focus on change metrics where they have been reported to be the best predicting metrics on bug prediction [D'Ambros, Lanza, Robbes 2010a].

Table 6, is a summary of the proposed metrics. We use a week as our basket, so sequentially starting at some date in a system history we report these metrics on a week by week basis.

97

Table 6. All metrics used in the study of code, change and collaboration. Four columns representing each Metric Category, Name, Abbreviation, and Description

Metric Metric Metric Name Description (file granularity) Category Abbreviation

LOC LOC Total number of lines of code Code Metrics CC CC McCabe Cyclomatic complexity Total number of lines that are modified, LOC Churn LOCC deleted, or added over a time unit Total number of function/methods that Functions FC are modified, deleted, or added over a Churn time unit Change Total number of hunks that are Metrics Hunks Churn HC modified, deleted, or added over a time unit Difference in McCabe Cyclomatic CC Diff CCD complexity over a time unit Difference in total number of lines of LOC Diff LOCD code over a time unit Number of developers’ commits over a Commits C Collaboration unit of time Metrics Number of committers over a unit of Authors A time

5.3 Data Collection

We collect our data using an in house tool called SHOPPER (pronounced “delta shopper”). This tool target data available from the Subversion version control system.

The shopper tool is responsible for extracting metadata and differences from a

Subversion repository. The program takes a unit of time () and extracts information

98

about the modification of artifacts over the course of the history. The program identifies modifications to files within each time  and computes, for each file, its observed values for the metrics in Table 6.

We use the definition of McCabe [McCabe 1976] for cycloramic complexity, for a peace of source code, which is the count of the number of linearly independent paths through the source code. At the end of every week and for each file, we subtract the previous CC from the new CC to get CC Diff. Same way regarding LOC Diff, we subtract LOC for some file at the end of a week to its previous week value.

To collect the LOC churn, shopper uses Subversion’s unified diff output to count the added, deleted and deleted source lines by counting the ‘+’ and ‘–’ markers.

Likewise, the number of hunks churn is easily computed by counting the number of

‘@@’ hunk delimiters. A hunk is a continuous group of lines that are changed along with contextual unchanged lines [Alali, Kagdi, Maletic 2008].

To collect functions churn, we first convert each versioned source code file to srcML

[Maletic, Collard 2004] and find each function’s region, the contiguous lexical range of lines in which the function is declared or defined. We map diff hunks (change regions) onto the function’s to find overlapping regions of modification. If a change region overlaps with a function’s region, the then function must have been changed by the commit. For example, if version n of a function f has the region f.c:5-10 (lines 5-10 in file f.c), and a commit for revision n + 1 modifies f.c:4-7, then function f has changed, and f.c has at least 1 function change during the time  in which the change occurred.

99

Using meta-data from a repository log, we can collect the extent of collaboration metrics which are number for commits and committers on a file over a week. The data collected is output as an XML file such that each time week defines a transaction consisting of artifact/metric and its value pairs.

Table 7. Ten free and open source systems used in this study where all systems are written in C/C++ for a range from 3 to 12 years. A brief description for each system is contained.

Open Source System (C/C++) Years Description Chrome 2008-2011 (3) Web Browser

GCC 2001-2010 (10) GNU Compiler Collection KOffice 2001-2010 (10) KDE Office Suite

KDELibs 2005-2010 (6) Central KDE programs/libraries Compiler and tool chain LLVM 2001-2011 (11) technologies Supercomputer Message Passing Open MPI 2005-2011 (7) Interface library Python 2001-2010 (10) Python Language Implementation

Quantlib 2000-2011 (12) Library for Quantitative Finance

Ruby 2001-2010 (10) Ruby Language Implementation Xapian 2001-2010 (10) Search Engine Library

The systems, Table 7, we are investigating three to ten years of maintenance history of ten open sources. These selected free and open source systems cover different types of systems which have a broad range of different design, architecture, and functionalities.

100

shopper produces XML datasets, notice a sample on Figure 10. The process collecting such data is very time consuming, for each system it could take days to finish, which is expected since we are collecting an enormous amount of data. Figure 10, shows two sample baskets for two weeks, the weeks of 20 and 27 of December from 2010, of data collected from KOffice. Each basket contains files paired with their actual values for seven of our selected set of metrics of change and collaboration. The LOC and CC metrics are collected at a snapshot.

Figure 9. Activity chart for the ten free and open source systems. Each bar represents the ratio of the number of commits over the number of years. Then normalize all ratios by dividing each over the highest ratio (the MAX).

Table 7 presents the open source projects we use in this work, the number files in each system, the total number of LOC, functions, and hunks that are modified, deleted, or added over the duration of the study are given. Using this data we generated the chart in

Figure 10 to present the collected measures. The chart shows three groups of bars of the

101

three effort measures for each project. The systems are of different sizes (files) and the duration periods are not equal. As such we use a ratio to draw a better comparison between the systems. We divide each cumulative effort measures for the whole range

(years) over the number of files over the number of years.

The following observations can be made: We see that Chrome is the most active and changing project, followed by KDELibs, as where Xapian is the least changing development project. Also, we clearly see a size correlation between the three measures.

The chart represents a ratio of number of commits over number of years then we normalize all ratios by diving each ratio by the maximum observed ratio (MAX).

Chrome is a 100% in relative to others, see plot in Figure 9.

102

. . . . . . .

Figure 10. KOffice two baskets of six files with their attributes and values are stored in an XML file, encoding effort measurements for each file during an observed time  = week, the date is the week first day and the basket name.

5.4 Time Window

We choose a time unit of one week for the starting and end points to collect our metrics values. The choice of week is considerably a coarse-grain unit of time which is motivated by how developers work. A developer’s work plans often revolve around the workweek and multiple commits within a week, where a commit (Subversion revision) can incorporate multiple files. Weeklong work plans can include fixing a set of bugs, developing a new feature, or some non-trivial refactoring of a critical component. We

103

feel that such behavior is intentionally created by a developer’s schedule and to-do list.

D'Ambros et al. [D'Ambros, Lanza, Robbes 2010a] in their comparative study among the main bug prediction techniques they used two weeks work-logs to collect their metrics.

Common process practices such as agile methods promote continuous integration and frequent releases, as do many open source development processes. It is also common to have iterations of one or two weeks in agile development cycles [Abbas 2010].

5.5 Metrics Distribution

Distribution studies on code complexities metrics helps as a base for modules on effort estimation and failure prediction. The distributions presented in literature focus on code metrics while nothing regarding change metrics. Change metrics have proven to be better estimators among all complexity metrics, at the same time, individual metrics are no better than hybrid metrics of different types, individual metrics show inconsistences

[D'Ambros, Lanza, Robbes 2010b]. Adding to that, there are no best set of metrics that would work well with all software systems.

Mentioned issues encouraged us to approach the problem from different angle where most literature test metrics to predict failures and estimate maintenance costs. We take metrics from different types and study their distribution and correlation among. We study data that were collect in a weekly fashion over three to twelve years over ten open sources.

Figure 10 shows a sample of our collected data. This data represents all the observed metrics values at the end of each week for code, change and collaboration values. We are interested in the distribution of this data against what is in the literature where the

104

complexity (LOC) data is collected for one snapshot at some time for a system. What we have here, is all the observed metrics from Table 6. Such data represent the actual maintenance process and how this data appears to be would tell how maintenance activates and their complexity happens.

5.5.1 Frequency Histogram

We start our analysis with simple observations and then build on it. Figure 11 (a, c, e) (left side plots) represent a discrete frequency histogram (probability distribution) for a metric from each type as in Table 6 collected for KOffice [2001-2010]. The X-axis is bins of size two sorted in ascending order, Y-axis is the frequency for each bin for some metric vales on a log 10 base scale. Notice, first bin range is [1, 3), second is [3, 5), etc.

Figure 11 (b, d, f) (right side plots) is more of a zoom out picture breaking the data to four 25 presents (as in the boxplot 5-point summary). Such plot doesn’t make any assumptions of the statistical distribution. Each quartile size would help uncover the dispersion, skewness, and outliers in the data [Hoaglin, 1983 #136].

105

a) LOC Churn on Bins of Size 2

b) LOC Churn on Five Point Summary and Ranges

106

c) Number of Commits on Bins of Size 2

d) Number of Commits on Five Point Summary and Ranges

107

e) LOC on Bins of Size 2

f) LOC on Five Point Summary and Ranges

Figure 11. Frequency distribution of three selected metrics from each category of code, change and collaboration for KOffice [2001-2010] collected on weeklong time units

108

Simple histograms for the collected data though tell a lot, but not all. Figure 11 (a, b) shows the skewness of LOC Churn values where majority, up to 75%, are between [1,

34] of lines where churned (modified, deleted and added), and the top 25% where between [35, 11,217]. That is clearly shows how KOffice developers’ lines of code churns are of small size most of the time. Figure 11 (c, d) is the developers’ activity metric frequency histogram via counting the number of commits that touched a file on weeklong periods. Here, the majority (75%) of files have changed once or twice a week only, while the top few 25% are between [3, 56] commits. Figure 11 (e, f) is a bit different distribution, Figure 11 (e) shows a spike at the beginning before the sharp fall where the two previous metrics shows only a sharp fall right away. Figure 11 (f) fails to show that due to coarse grain clustering using 25%s, but it shows that 75% of source files appear in history with small and medium sizes (LOC) between [1, 679] for KOffice

[2001-2010].

We must note here that the other metrics from each type show similar frequency histograms to the selected metric. For example cyclomatic complexity CC from code metrics (Table 6) is very similar to LOC frequency histograms. Same thing for function churn metric where it does show similar frequency histograms to LOC churn. We won’t stop here, all nine systems left does show, in general, very close picture to KOffice.

5.5.2 Frequency Histogram on a Log-Log Plot

From Figure 11, we can sense and due to the L shape curve of the probability distribution plots, the existence of power low probability distributions. We can see that clearly on change and collaboration metrics as in Figure 11 (a, c), while it is not so clear

109

with the code metrics Figure 11 (e). Code metrics has been long reported to be a log- normal [Concas, Marchesi, Pinna, Serra 2007] or a double Pareto [Herraiz Tabernero,

German, Hassan 2011] but that was for snapshot in time, while here we are more interested on all metrics reported over long histories. Such distributions are more reliable than studying a random point in time. To the extent of our knowledge, we note that no change or collaboration metrics distributions have been investigated before.

a) LOC Churn on Bins of Size 2 (log-log plot)

110

b) Number of Commits on Bins of Size 2 (log-log plot)

c) LOC on Bins of Size 2 (log-log plot)

Figure 12. A log-log plot of probability distribution of three selected metrics from each category of code, change and collaboration for KOffice [2001-2010] on weeklong time units. Same data used in Figure 11.

111

A simple approach to detect distributions for such types of data is graphical on log- log plots. We take the probability distribution function plots and move it to log-log axis via adjusting the X- and Y-axis to their logs of based 10. Then we try to find a linear fit for the resulted figure. There are two common ways to find a fit. We can use the complementary cumulative distribution function [Monti 1995] or logarithmic binning

[Newman 2005].

Figure 12 represents a log-log plot for the probability density functions presented in

Figure 11 (a, c, e). Here, same distributions look to be linear (first degree polynomial) for LOC churn and Number of commits metrics. This is the characteristic signature of a power-law [Newman 2005]. While LOC gives a more of a curvy normal shape on a log- log scales.

5.5.3 Complementary Cumulative Distribution Function

To estimate the shape of the statistical distributions of code, change and collaboration metrics, we are plotting the complementary cumulative distribution function (CCDF). The cumulative distribution function (CDF) is the cumulative values of the probability density function (PDF), where the ranges have to be from [0, 1], probabilities have to add up to 1. The CCDF is 1 1 CDF. A PDF, CDF and CCDF hold the same information, but each would have different characteristics and usage. A log-log scale would show a power law distribution appears as a straight line in a CCDF, while a lognormal appears as a curve. So the CCDF can be used to distinguish between power law and other kind of distributions.

112

In a CCDF, the double Pareto distribution appears as a curve with two straight segments, one at the low values side, and another one at the high values side. The difference between a lognormal and a power law at very low values is negligible, and therefore imperceptible in a plot. This means that in a CCDF plot the main difference between a lognormal and a double Pareto is only spotted at high values. In any case, for our purposes, it is more important to focus on the high values side.

Figure 13. CCDF Shapes of lognormal, double Pareto, and Pareto distributions on a log-log plot [Mitzenmacher 2004].

A key characteristic of the double Pareto distribution is that it has a power law at both tails. That is, if we look at the cumulative distribution function (CDF) on a log-log plot, it will also have a linear tail (for the small files). This provides a test for seeing

113

whether a distribution has a double Pareto distribution; look at both the CCDF and the

CDF on log-log plots for linear tails.

Figure 13 is taken from an estimation presented by [Mitzenmacher 2004] where it shows how the double Pareto distribution falls between the lognormal distribution and the

Pareto distribution. Like the Pareto distribution, it is a power law distribution. But while the log-log plot of the density of the Pareto distribution is a single straight line, for the double Pareto distribution the log-log plot of the density consists of two straight line segments that meet at a transition point. This is similar to the lognormal distribution, which has a transition point around its median. Hence, an appropriate double Pareto distribution can closely match the body of a lognormal distribution and the tail of a

Pareto distribution.

For example, Figure 13 shows the complementary cumulative distribution function for a lognormal, double Pareto, and Pareto distribution. (These graphs have only been minimally tuned to give a reasonable pictorial match; they could be made to match more closely.) The lognormal and double Pareto distributions match quite well with a standard scale for probabilities, but on the log-log scale in Figure 13 one can see the difference in the tail behavior, where the double Pareto more closely matches the Pareto.

Due to the noise we see in the log-log plots and especially the tails, we need to smooth these lines or carves for better judgment. Based on the work of [Mitzenmacher

2004] while the distribution analysis were conducted on snapshot data for file sizes, still, we feel our data could fit between a Pareto, a double Pareto, and a lognormal distribution.

Simple graphical approaches were formed to distinguish among highly right skewed data.

114

For a formal mathematical and detailed explanation of the Pareto, double Pareto, and lognormal distribution and their characteristics, you can refer to the mentioned references in this paragraph above. We will move from Figure 12 to Figure 14, were we using the complementary cumulative density function to clean out the plots on Figure 12.

a) LOC Churn on Bins of Size 2 (log-log plot)

115

b) Number of Commits on Bins of Size 2 (log-log plot)

c) LOC on Bins of Size 2 (log-log plot)

Figure 14. A log-log plot of complementary cumulative distribution function of three selected metrics from each category of code, change and collaboration for KOffice [2001-2010] on weeklong time units. Same data used in Figure 11 and Figure 12.

116

From Figure 14, and with such small sample, we notice a trend that we need to confirm with the other metrics and systems. Figure 14 (a), and based on the graph estimation presented, a lognormal body with a Pareto tail, which is a double Pareto.

If y is the frequency of occurrences, and x is the end point of each bin, where each is of size 2. In a power-law we have:

(4)

So a power-law with exponent a is seen as a straight line with slope a on a log-log plot. As presented earlier and by [Mitzenmacher 2004], the key characteristic of a double

Pareto distribution is that it has a power law at both tails, to test that Figure 14 (a) is a double Pareto distribution, we need to look at both the CCDF and the CDF on log-log plots for linear tails. In Figure 15 (a, b) we split the Figure 14 (a) around the median and fit the tails on a CCDF for big values and a CDF on the small ones. Though the median for KOffice [2001-2010] churning metric is 10, we can see that at the bin 100 (which is around the median, approximation) the line deviates from the curvy fall and straighten to become linear.

117

a) A CDF for LOC Churn on Bins of Size 2 [1-100] on a log-log

b) A CCDF for LOC Churn on Bins of Size 2 [100-10000] on a log-log

Figure 15. A log-log plot of CDF and CCDF of LOC Churn for KOffice [2001-2010] on weeklong time units.

118

Using the standard least square data fitting for lines [Bretscher 1997], we have

Figure 15 (a) shows the lognormal body on Figure 14 (a) on showing a straight line on

CDF with a slop equals to a = 0.15 (it is positive!, the chart X-axis is decreasing) with R2

= 0.99 (determination coefficient for goodness of fit) which shows almost a perfect fit.

Figure 15 (b) is the other straight line on a CCDF with a = 1.64 and R2 = 0.99. Figure 14

(a), where a lognormal body and Pareto tail is enough to validate a double Pareto, but we have it here further confirmed, commits follows in Figure 16.

Figure 16. A log-log plot of CCDF for the number of commits on Bins of Size 2 for KOffice [2001-2010] on weeklong time units

Figure 14 (b), with the data points are few due to the nature of the metric. For

KOffice [2001-2010], the maximum observation for number of commits that touched it a file during a weeklong units is 56 time. Fitting this plot shows a high R2 = 0.95 for a line

119

with a slop a = -2.1, while the very first point deviates big away from the line of fit. This point is the most important point that holds the highest frequency and representation.

What we see here is a Pareto key characteristic. An observation on a single system is not a good enough for any generalization; we need to look over more systems and metrics to confirm our results. Figure 14 (c) shows a clear lognormal plot where it is all curving then it drops vertically down with some squeaking along.

Now, we need to use the same approach and plot all CCDFs for all metrics for all systems. Such large scale investigation would help us uncover the patterns to understand the distribution of each matric or each type. Figure 16, Figure 17, Figure 18, Figure 19,

Figure 20, Figure 21, Figure 22, Figure 23, and Figure 24 are the figures that contain all possible plots.

120

Code Metrics LOC CC

Chrome

GCC

KOffice

KDELibs

LLVM

Figure 17. A log-log plot of CCDF for code metrics (LOC and CC) on bins of size 2 over five systems on weeklong time units.

121

Code Metrics

LOC CC

OpenMPI

Python

Quantlib

Ruby

Xapian

Figure 18. A log-log plot of CCDF for code metrics (LOC and CC) on bins of size 2 over the other five systems on weeklong time units, this completes Figure 18.

122

Change Metrics

LOC Churn Functions Churn

Chrome

GCC

KOffice

KDELibs

LLVM

Figure 19. A log-log plot of CCDF for change metrics (LOC, Functions Churn) on bins of size 2 over five systems on weeklong time units.

123

Change Metrics

LOC Churn Functions Churn

OpenMPI

Python

Quantlib

Ruby

Xapian

Figure 20. A log-log plot of CCDF for change metrics (LOC, Functions Churn) on bins of size 2 for the five other systems on weeklong time units, this completes Figure 19.

124

Change Metrics

Hunks Churn CC Diff LOC Diff

Chrome

GCC

KOffice

KDELibs

LLVM

Figure 21. A log-log plot of CCDF for change metrics (Hunks Churn, CC Diff, LOC Diff) on bins of size 2 over ten systems on weeklong time units.

125

Change Metrics

Hunks Churn CC Diff LOC Diff

OpenMPI

Python

Quantlib

Ruby

Xapian

Figure 22. A log-log plot of CCDF for change metrics (Hunks Churn, CC Diff, LOC Diff) on bins of size 2 over ten systems on weeklong time units, this completes Figure 21.

126

Collaboration Metrics Commits Authors

Chrome

GCC

KOffice

KDELibs

LLVM

Figure 23. A log-log plot of CCDF for collaboration metrics (Commits and Authors) on bins of size 2 over ten systems on weeklong time units.

127

Collaboration Metrics Commits Authors

OpenMPI

Python

Quantlib

Ruby

Xapian

Figure 24. A log-log plot of CCDF for collaboration metrics (Commits and Authors) on bins of size 2 over ten systems on weeklong time units, this completes Figure 23.

128

Figure 17 and Figure 18 represent code change (LOC and CC) metrics distribution over all systems seems clearly showing a double Pareto plots. A similar results have been reported previously to by [Herraiz, German, Hassan 2011], the difference here is that or code metrics data is not one snapshot at some point in time, it is for all reported sizes every week over a range of multiple years. Still, we have the same observation confirmed with way richer data. Other studies have reported a lognormal distribution for code metrics. We agree with [Herraiz, German, Hassan 2011] that such studies have underestimated the large file size. As we mentioned, our scale of work is way larger with historical LOC were investigated here, not just a one snapshot, but as many as hundreds to thousands snapshots.

Figure 19, Figure 20, Figure 21, Figure 22, Figure 23, and Figure 24 are for change and collaboration metrics, respectively. All plots are systematically show a curvy start then a straight (approximately) linear line with some noise on the ending tail. Such behavior has been reported to represent a Pareto distribution which agrees with results by

[Herraiz Tabernero, German, Hassan 2011] for change metrics. The distribution of collaboration metrics have never been reported before in literature.

The reported distribution helps better estimate maintenance efforts and uncovers possible flows in all reported systems. For example change and collaboration metrics distribution show that very few files take up most of the work and effort to be maintained. Form Chrome and from the week of 2011-08-29 the file

”/trunk/src/chrome/browser/ui/browser.cc" got in a weeklog work of 201 lines, 50 functions, and 39 hunks churn though 24 commits and 18 authors where the file size is

129

4448 lines with 1026 cyclomatic complexity, while the total changes (delta) are zero lines and zero cyclomatic complexity. Again, in one week, one very big file has been hammered so much with eighteen developers, that is an average of 2.5 developers a day touching this file. Such changes require a lot of management around developers and carful testing.

5.6 Metrics Correlation

In previous work [Alali, Kagdi, Maletic 2008], we studied the size of typical subversion commits and the correlation between extent of change for LOC churn and hunks churn. We observe that 75% of commits modify between 2 and 4 files, modify around 50 LOC churn with approximately 8 different hunks. There is a strong positive correlation (up to 0.75), between LOC Churn size measures and the number of hunks.

Despite the high correlation between the two variables, measuring both LOC Churn and hunks Churn as distinct effort units remains an important component of the analysis. It shows not only the editorial extent of the change (LOC Churn) but also gives an indication about the distribution of that change over the source file (hunks Churn).

Our previous work [Alali, Kagdi, Maletic 2008] did not examine the number of metrics that we have in this study. In our previous work we used the parametric

Pearson’s measure [Johnson, Wichern 1998] which is a linear correlation coefficient where its values range between -1 and +1 and it measures the strength and the direction of a linear relationship between two variables. Before, we used the simple boxplot 5- point summary [Johnson, Wichern 1998] to study the distribution of such metrics, here, as we see that we are dealing with non-normal distributions. We use the non-parametric

130

Spearman rank correlation method. Spearman’s rank correlation coefficient is a nonparametric (distribution-free) rank statistic proposed as a measure of the strength of the association between two variables. It is a measure of a monotone association that is used when the distribution of data makes Pearson’s correlation coefficient undesirable or misleading.

Spearman’s coefficient assesses the ability of an arbitrary monotonic function can describe the relationship between two variables, with no pre assumptions about the frequency distribution of the two variables involved. Unlike the commonly used

Pearson’s product-moment correlation coefficient, it does not require the pre assumption that the relationship between the variables is linear, nor does it require the variables to be measured on interval scales [Hauke, Tomasz 2011].

For a sample of size N, the N raw scores Xi, Yi are converted to ranks xi, yi and

Spearman’s coefficient rho is computed from these:

∑ ̅ ̅ (5)

√∑ ̅ ∑ ̅

The sign of the Spearman correlation indicates the direction of association between an independent variable and a dependent variable. The Spearman correlation coefficient value is positive that implies that if the dependent variable increases when the independent variable increases. On the other hand, the value is negative If the dependent variable decreases and the independent variable increases.

A Spearman correlation of zero indicates that the dependent variable neither increases nor decreases when the independent variable decreases. The Spearman

131

correlation increases in magnitude (closer to 1.0) as the two variables get closer to being perfect monotone functions of each other. When the two variables are perfectly monotonic, the Spearman correlation coefficient becomes 1.0. A perfect monotone increasing relationship implies that for any two pairs of data values xi, yi and xj, yj, that xi

− xj and yi − yj always have the same sign. A perfect monotone decreasing relationship implies that these differences always have opposite signs [Lehman 2005].

The Spearman correlation coefficient is often described as being non-parametric. A perfect Spearman correlation results when two variables are related by any monotonic function where with Pearson correlation, it has to be a linear function to have a perfect correlation between the variables. Spearman correlation’s exact sampling distribution can be calculated without having to know the combined probability distribution of its variables.

If there is no correlation or a weak one, rho (Equation 5) is close to 0. A value near zero means that there is a random relationship between the two variables. A correlation greater than 0.6 is generally described as strong to very strong, whereas a correlation between 0.4 – 0.6 is a moderate correlation. While less than 0.4 to zero is generally described as weak (negative ranges follows). By convention we accept a p-value of 0.05 or less as being statistically significant, where we reject the null hypothesis of no correlation between the two variables. For all of our correlation values presented here we got statistically significant results with almost 100% confidence level and have a p-value that is very small way less than 0.05.

132

Now let us look at the data in exhaustive analysis and address the question: Is there a correlation between any two metrics form a specific a change category? That can be calculated as follows:

 We take any two metrics.

 The first metric from the chosen category is x and the other is y.

 Then calculate rho and p-value for x and y, using R environment function

cor.test(a, b, method="spearman"), where a and b are two vectors

represents for the raw aligned data [Best 1975; Hollander, Wolfe, Chicken 2013].

 Then we do this for the remaining change categories and assign new x and y until we

finish all metrics.

 Then we repeat that for the other combinations.

Table 8. Legend key for tables from Table 9 to Table 18. Spearman rho values that are colored in white with black background indicates Strong relationship, colored black with yellow background indicate moderate relationship, and colored black and white background indicates weak relationship between a pair of variables. p- value is a very small number of all calculated rho values.

Legend [0.6-1.0] Strong [0.4-0.6) Moderate [-0.4-0.4) Weak p-value is 2.2e-16 over all

133

Table 9. Spearman’s rho values for the cross production between every possible pair for the all presented metrics in this study on the data collected for Chrome.

Collaboration Chrome Code Metrics Change Metrics Metric rho LOC CC LOCC FC LOCD HC CCD C A LOC 1.0 0.9 0.2 0.2 0.1 0.3 0.0 0.2 0.2 CC 1.0 0.2 0.3 0.0 0.2 0.1 0.2 0.2 LOCC 1.0 0.6 0.5 0.8 0.5 0.5 0.4 FC 1.0 0.3 0.6 0.3 0.5 0.4 LOCD 1.0 0.3 0.8 0.0 0.0 HC 1.0 0.2 0.6 0.5 CCD 1.0 0.0 0.0 C 1.0 0.8 A 1.0

Table 10. Spearman’s rho values for the cross production between every possible pair for the all presented metrics in this study on the data collected for GCC.

Collaboration GCC Code Metrics Change Metrics Metric rho LOC CC LOCC FC LOCD HC CCD C A LOC 1.0 1.0 0.4 0.4 0.3 0.4 0.3 0.3 0.3 CC 1.0 0.4 0.4 0.3 0.4 0.3 0.3 0.3 LOCC 1.0 0.6 0.6 0.8 0.5 0.5 0.4 FC 1.0 0.4 0.7 0.3 0.5 0.4 LOCD 1.0 0.3 0.8 0.1 0.0 HC 1.0 0.3 0.6 0.5 CCD 1.0 0.1 0.0 C 1.0 0.8 A 1.0

134

Table 11. Spearman’s rho values for the cross production between every possible pair for the all presented metrics in this study on the data collected for KOffice.

Collaboration KOffice Code Metrics Change Metrics Metric rho LOC CC LOCC FC LOCD HC CCD C A LOC 1.0 1.0 0.2 0.2 0.1 0.3 0.1 0.1 0.1 CC 1.0 0.2 0.2 0.1 0.3 0.1 0.1 0.1 LOCC 1.0 0.7 0.5 0.8 0.5 0.5 0.2 FC 1.0 0.3 0.8 0.3 0.5 0.2 LOCD 1.0 0.3 0.8 0.1 0.0 HC 1.0 0.3 0.6 0.3 CCD 1.0 0.1 0.0 C 1.0 0.5 A 1.0

Table 12. Spearman’s rho values for the cross production between every possible pair for the all presented metrics in this study on the data collected for KDElibs.

Collaboration KDElibs Code Metrics Change Metrics Metric rho LOC CC LOCC FC LOCD HC CCD C A LOC 1.0 1.0 0.1 0.2 0.1 0.2 0.1 0.1 0.1 CC 1.0 0.1 0.2 0.1 0.2 0.1 0.1 0.1 LOCC 1.0 0.7 0.5 0.8 0.5 0.5 0.3 FC 1.0 0.3 0.8 0.3 0.5 0.3 LOCD 1.0 0.3 0.8 0.0 0.0 HC 1.0 0.2 0.6 0.3 CCD 1.0 0.0 0.0 C 1.0 0.6 A 1.0

135

Table 13. Spearman’s rho values for the cross production between every possible pair for the all presented metrics in this study on the data collected for LLVM.

Collaboration LLVM Code Metrics Change Metrics Metric rho LOC CC LOCC FC LOCD HC CCD C A LOC 1.0 1.0 0.3 0.3 0.2 0.3 0.1 0.2 0.3 CC 1.0 0.2 0.3 0.2 0.3 0.1 0.2 0.3 LOCC 1.0 0.8 0.5 0.8 0.4 0.6 0.4 FC 1.0 0.3 0.8 0.2 0.7 0.4 LOCD 1.0 0.2 0.8 0.1 0.0 HC 1.0 0.2 0.7 0.5 CCD 1.0 0.1 0.0 C 1.0 0.6 A 1.0

Table 14. Spearman’s rho values for the cross production between every possible pair for the all presented metrics in this study on the data collected for OpenMPI.

Collaboration OpenMPI Code Metrics Change Metrics Metric rho LOC CC LOCC FC LOCD HC CCD C A LOC 1.0 1.0 0.2 0.2 0.2 0.3 0.1 0.2 0.1 CC 1.0 0.2 0.2 0.2 0.3 0.2 0.1 0.1 LOCC 1.0 0.6 0.6 0.8 0.5 0.4 0.2 FC 1.0 0.3 0.6 0.2 0.4 0.3 LOCD 1.0 0.3 0.8 0.0 0.0 HC 1.0 0.3 0.5 0.3 CCD 1.0 0.0 0.0 C 1.0 0.6 A 1.0

136

Table 15. Spearman’s rho values for the cross production between every possible pair for the all presented metrics in this study on the data collected for Python.

Collaboration Python Code Metrics Change Metrics Metric rho LOC CC LOCC FC LOCD HC CCD C A LOC 1.0 1.0 0.1 0.2 0.1 0.2 0.1 0.2 0.1 CC 1.0 0.1 0.2 0.1 0.2 0.1 0.2 0.1 LOCC 1.0 0.7 0.5 0.8 0.4 0.5 0.3 FC 1.0 0.3 0.8 0.3 0.5 0.4 LOCD 1.0 0.3 0.8 0.0 0.0 HC 1.0 0.2 0.6 0.4 CCD 1.0 -0.1 0.0 C 1.0 0.7 A 1.0

Table 16. Spearman’s rho values for the cross production between every possible pair for the all presented metrics in this study on the data collected for Quantlib.

Collaboration Quantlib Code Metrics Change Metrics Metric rho LOC CC LOCC FC LOCD HC CCD C A LOC 1.0 0.8 0.2 0.3 0.0 0.3 -0.1 0.1 0.1 CC 1.0 0.2 0.3 0.0 0.3 0.0 0.1 0.1 LOCC 1.0 0.6 0.4 0.7 0.3 0.4 0.3 FC 1.0 0.2 0.7 0.1 0.4 0.3 LOCD 1.0 0.3 0.8 0.0 0.0 HC 1.0 0.1 0.5 0.3 CCD 1.0 0.0 0.0 C 1.0 0.6 A 1.0

137

Table 17. Spearman’s rho values for the cross production between every possible pair for the all presented metrics in this study on the data collected for Ruby.

Collaboration Ruby Code Metrics Change Metrics Metric rho LOC CC LOCC FC LOCD HC CCD C A LOC 1.0 1.0 0.2 0.2 0.2 0.2 0.2 0.2 0.2 CC 1.0 0.2 0.2 0.2 0.2 0.2 0.2 0.2 LOCC 1.0 0.6 0.4 0.8 0.3 0.5 0.4 FC 1.0 0.3 0.6 0.2 0.5 0.3 LOCD 1.0 0.2 0.8 0.0 0.0 HC 1.0 0.2 0.6 0.4 CCD 1.0 0.0 0.0 C 1.0 0.7 A 1.0

Table 18. Spearman’s rho values for the cross production between every possible pair for the all presented metrics in this study on the data collected for Xapian.

Collaboration Xapian Code Metrics Change Metrics Metric rho LOC CC LOCC FC LOCD HC CCD C A LOC 1.0 0.9 0.2 0.1 0.2 0.2 0.1 0.1 0.1 CC 1.0 0.2 0.1 0.1 0.2 0.1 0.1 0.1 LOCC 1.0 0.6 0.5 0.8 0.4 0.5 0.2 FC 1.0 0.2 0.7 0.2 0.5 0.2 LOCD 1.0 0.2 0.8 0.0 0.0 HC 1.0 0.2 0.6 0.2 CCD 1.0 0.0 0.0 C 1.0 0.4 A 1.0

138

Tables from Table 9 to Table 18 are the rho Spearman correlation results from running the correlation test using R function corr.test [Best 1975; Hollander,

Wolfe, Chicken 2013]. Table 8 is the legend key that we use to descript the colorings and strength levels. The black background colored cells represent a strong to very strong correlation between the pair of tested metrics. The yellow background cells are of a moderate correlation and the white cells hold a weak to a very week relationship strength level. Besides doing a correlation test among pairs of metrics from the same change type, we also compute a cross correlation between metrics from different groups. For example we find the correlation between LOC and CC from the code change, and then we find it for LOC and LOCC from the change metrics, and so on. We find the correlation exhaustively over all possible combinations among the metrics.

Table 19. Three counters (frequency) for each as a summary for from all Spearman’s rho test results. Each number count the strength levels (strong, moderate, and weak) among all possible pairs of the studied set of metrics from different categories.

Code Collaboration Frequency Change Metrics Metrics Metric rho LOC CC LOCC FC LOCD HC CCD C A LOC 10,0,0 0,1,9 0,1,9 0,0,10 0,1,9 0,0,10 0,0,10 0,0,10

CC 0,1,9 0,1,9 0,0,10 0,1,9 0,0,10 0,0,10 0,0,10

LOCC 10,0,0 2,6,0 10,0,0 0,8.2 1,9,0 0,4,6

FC 0,1,9 10,0,0 0,0,10 1,9,0 0,4,6

LOCD 0,0,10 10,0,0 0,0,10 0,0,10

HC 0,0,10 8,2,0 0,6,4

CCD 0,0,10 0,0,10

C 9,1,0

A

139

Table 20. Ratios and distributions for each Spearman’s rho test for all systems. Each percentage represents the ratio for each strength level (strong, moderate, and weak) among all possible pairs of the studied set of metrics from different categories.

Code Collaboratio Ratio (%) Metrics Change Metrics (%) n Metric (%) (%) rho LOC CC LOCC FC LOCD HC CCD C A LOC 100 9 90 9 90 100 9 90 100 100 100

CC 9 90 9 90 100 9 90 100 100 100

46 LOCC 100 20 60 100 80 20 10 90 00 4 6 FC 10 90 100 100 10 90 0 0 LOCD 100 100 100 100

4 HC 100 80 20 60 0 CCD 100 100

1 C 90 0 A

A metric against itself is a perfect 100% (1.0) which is the diagonal line in the group

of tables from Table 9 to Table 17. Take for example Table 9, for Chrome; we can see

that rho for LOC  CC is a very strong correlation that is up to 90%. Such highly

correlated relationship tells a lot, it tell us that the more lines of code that you have in a

file the more complicated it is the higher the CC it is. CC is the Cyclomatic Complexity

of a program which represents the number of linearly independent paths through a

program's source code. Such a high correlation tells that LOC and CC are almost the

same metric. So among change metrics using LOC alone in a fault prediction model

won’t be affected or enhanced by adding the cyclomatic complexity metric to it.

140

If we read though the tables from Table 9 to Table 17, we can see a very interesting pattern reoccurring. We can see that the metrics that comes from the same type are highly correlated while cross correlations are largely weak. This main observation validates our gut and intention behind this study. A synergy of metrics from each category of metrics is extremely important to be included in building effort and fault prediction models. Most literature, as have been reported in the related work Chapter 2, rely on single category metrics, either one or a variety from the same category. This work stands behind hybrid combinations of metrics across categories to build more accurate prediction models. Each type of metrics has shown a promising results but no work yet has combined such set of metrics.

To validate our claim regarding the correlation result, we have Table 19 and Table

20 that summarize the empirical assessment of the relation among all metrics. Table 19 has the cell structure as in tables Table 9 to Table 17 but now each cell has a count tuple for each correlation level. So, LOC  CC is 10,0,0, where 10 means all systems have agreed that the correlation between LOC and CC is a strong to very strong relationship and there is a zero moderate or weak correlation among LOC and CC metrics. And it follows along for the rest of the table. Table 20 follows the same structure as in Table 19 but this time we have a ratio (percentage) for each correlation level.

5.7 Discussion

In this chapter we presented a large scale investigation to understand the distribution of and the correlation between code, change, and collaboration metrics. Such metrics have shown a promising result on fault predictions and maintenance estimations.

141

Literature has used a single and a mixture of metrics to build such models. The lack of understanding these metrics led to conflicting results based on wrong assumptions and under estimation of different categories of metrics distribution. We then studied the correlation relationships among all metrics them. Such information helps us in the decision making on the right choice of metrics in any futuristic prediction models. We used two code metrics of size and complexity, four change metrics (churn metrics) and two collaboration metrics (authorship and commits).

We can summarize our results in two points. First, code metrics follows a double

Pareto distribution; change and collaboration are Pareto distribution. Earlier research

[Herraiz Tabernero, German, Hassan 2011] have reported similar results but not this variety of metrics and scale. Second point is that each metric group are highly correlated, while a weak correlation on a cross category. For example code metrics weakly correlate with change metrics. While LOC and CC from code metrics are in the range of 90% very strong correlated. That means, any of these two metrics adds a little information to each other in case they both used in a combination. Such important observation recommends using a variety of categories rather than focusing on multiple metrics of the same category.

142

CHAPTER 6

USING AGE AND DISTANCE TO IMPROVE THE DETECTION OF

EVOLUTIONARY COUPLINGS

In this chapter we discuss an approach to improve the accuracy of evolutionary couplings uncovered from version history. Two measures, namely the age of a pattern and the distance among items within a pattern, are defined and used with the traditional methods for computing evolutionary couplings. The goal is to reduce the number of false positives (i.e., inaccurate or irrelevant claims of coupling). We first discuss the characteristics of such measures and highlight a few observations on applying such measures to the ranking and filtering of evolutionary couplings. Then we use eleven large open source systems to validate our claim. The results show that age is a decisive filter to false patterns, while distance is not.

6.1 Introduction

The work presented here investigates the problem of reducing the number of false- positives by ranking the patterns. Two measures, namely distance and age, of a pattern are introduced. Our hypothesis is that these measures will help rank correct and important patterns higher than less meaningful or false-positive patterns. This has the potential to improve the usefulness of the patterns for developers.

We now present our observations, validation and experiment on this topic in the following manner. Section 6.2 presents a brief discussion about evolutionary coupling and how we mine the patterns. Section 6.3 is on data collection and the following

143

sections are sections on using the distance measure (section 6.4) and another on the age measure (section 6.5). Each section defines the measure and presents initial observation.

Section 6.6 has the experiment and validation. Following to that are the conclusions and the future work.

6.2 Frequent Pattern Mining

We use the Eclat algorithm for frequent pattern (or itemset) mining [Zaki,

Parthasarathy, Li 1997] to uncover evolutionary couplings. The technique searches for patterns of co-changing items. The underlying idea is that if items co-change together on a very frequent basis then they must be related to one another in either an explicit or implicit manner.

The technique has a number of parameters that must be selected for the particular problem and data set. The first is the size of a transaction (or change set). The transaction size dictates which items “change together”. Since logical commits and physical commits are typically not mapped one-to-one, a time duration/window is needed to group physical commits into logical ones. Selecting the time window is an open problem as a logical commit can be spread out over multiple physical commits across days.

The next parameter that must be selected is minimum support. This value regulates the low bound of the frequency of patterns (itemsets) produced. Low values of minimum support generate higher numbers of patterns while higher numbers produce fewer patterns but with more support. Lastly, we must select the granularity of the changing item. For source code this could be a line of code, a function, a file, or a subsystem. In

144

our case we use the file level. This is a common level of granularity to use and maps well to commits.

6.3 Data Collection and Patterns Generation

The implementation of the approach is achieved with two tools: Shopper

(pronounced “delta shopper”) and srcMiner (reads “source miner”). These tools target data available from the Subversion version control system.

The Shopper tool is responsible for extracting metadata and differences from a

Subversion repository. The program takes a unit of time () and extracts information about the modification of artifacts over the course of the history. The program identifies modifications to files and functions within each time  and generates the datasets of buckets of co-changing files and functions, our  here is a Subversion commit. To collect the files we use the archived logs for co-changing files.

145

Table 21. Characteristics of the Eleven open source systems used in study including years, commits, files, Selected minimum Supports, itemsets generated, and maximum itemset Sizes (L).

Minimum System Years Commits Files Itemsets L Support KDELibs 01-10 (10) 54,189 14,748 9 13,342 10 KOffice 01-10 (10) 55,651 21,857 12 47,444 13 Httpd 99-11 (13) 11,264 763 5 146,363 16 Subversion 00-11 (12) 23,420 1,485 10 93,701 16 Ruby 00-11 (12) 12,439 834 9 82,005 14 Chrome 08-11 (4) 35,650 16,358 12 315,986 18 QuanLib 00-10 (11) 7,791 5,340 9 266,424 18 OpenMPI 03-11 (9) 11,682 6,583 6 328,963 18 LLVM 01-10 (10) 50,327 4,266 12 55,435 13 GCC 01-10 (10) 50,145 26,154 12 50,050 11 Xapian 00-10 (11) 4,703 1,302 5 16,764 12

146

Figure 25. Size ratios (Total commits/Years/MAX), where commits and Years as reported in Table 21. MAX is the highest commits/Years ratio which is for Chrome, being relatively a 100% active.

Table 21 presents the open source projects we use in this work, years range, the number files in each system, commits, minimum support counts, generated itemsets and their maximum observed itemset size. We use srcMiner to process our datasets and generate frequent patterns. We developed srcMiner based on Eclat frequent item sets data mining algorithm [Zaki, Parthasarathy, Li 1997].

6.4 Pattern Distance

Here, distance represents the position of multiple files relative to one another within the directory tree. Software projects normally split source files into a logical tree

147

structure that represents abstract modules. For example, all source code that deals with

UI management might be under the same sub-directory. We take advantage of this and use tree distance as our measure. A tree distance is the number of unmatched edges between any two leaf-nodes (files) on the file structure tree.

For example, the three files below represent an itemset set with support count 12.

Two differ by three folders and as such the distance between them is 6, which is the maximum distance in this pattern:

/koffice//colors/gray_u16/kis_gray_u16_colorspace.h

/koffice/plugins/colors/gray_u8/kis_gray_colorspace.h

/koffice/plugins/colors/gray_u16/kis_gray_u16_colorspace.h

We use the definition of tree distance between any pair of files to define a distant pattern. A distant pattern is a pattern that contains a maximum tree distance between its items that exceeds a given threshold. We empirically assign the threshold based on the distribution of tree distances of all collected patterns. Specifically, we choose a point where the distance becomes relatively non-typical or outlier. For example, from Table

22, we can see distances of zero and two showing for multiple systems; one and four for a couple, that represent the majority (above 75%) of the data. We consider the distances above that to be outliers.

Distant patterns represent changes occurring across a broader portion of the system.

Our hypothesis is that these patterns are more likely to represent hidden dependencies and may be otherwise difficult to uncover. Such patterns may be an indication of code

148

that is a candidate for refactoring or reengineering. The premise is that outlier behavior is often problematic, where the majority of co-changes are localized.

Figure 26. Distribution for Distances between Pattern Pairs. Patterns Generated for KOffice [2001-2010] for Files.

The distribution for distances between itemset pairs is presented using a histogram over the eleven systems. The data mined (Table 21) is split into ten percentiles that are collapsed if we discovered percentiles that were clones of one another. Figure 26 presents the histogram created using the percentiles, which represent the file distances on the x-axis, by the frequency of files at each decile on a log scale.

It can be seen that typical distances for KOffice are zero and two for files, where the zero distance is dominating the set (zero means the files are children of the same folder).

This implies that most evolutionary couplings occur within the same subfolder and remain localized. This isn’t entirely unexpected, as changes to modules should happen,

149

most of the time, in source files local to the module, as opposed to outside of the module.

With this in mind, outliers in this data; couplings with a distance higher than three or four, may represent cross-module couplings and require further analysis by software engineers.

For the other 10 systems we examined, most followed the same trend as KOffice with

0 dominating the set of distances. A distance of 2 took second place most of the time.

Table 22 gives a partial listing of distances that, when combined, represent more than

75% of the data collected for the systems.

6.5 Pattern Age

Age represents the duration of an evolutionary coupling. For impact analysis, the age of an evolutionary coupling may help to highlight both explicit and implicit couplings between files by showing that they have been co-changed extensively and persisted unbroken during maintenance activities for a long time.

150

Table 22. Typical Distances for File Granularity.

Majority Total% OSS Distance % Distance % KDELibs 0 86% 86% KOffice 0 97% 97% Httpd 0 84% 84% Subversion 0 99% 99% Ruby 0 99% 99% Chrome 0 32% 1 44% 76% QuantLib 0 99% 99% OpenMPI 2 81% 81% LLVM 2 63% 4 13% 76% GCC 0 92% 92% Xapian 0 27% 2 65% 91%

We define a pattern age as the difference in days between the date on which a given pattern first appears and the date on which the same pattern last appeared during the selected study period. We refer to this difference as age because of the notion that patterns exist over an interval of time. Patterns are created through maintenance tasks, they persist for a period of time through maintenance tasks, and, eventually, through inaction or maintenance, their appearance becomes infrequent, perhaps denoting their end. The age does not necessarily reflect the end, or death, of a pattern due to the fact that it may reoccur during maintenance outside our range of study.

To compute the age of an evolutionary coupling, we found the date on which an evolutionary coupling first appeared and subtracted this from the date on which the same

151

coupling disappeared. We did this for eleven systems in order to get an overview of what the pattern age looks like for various software projects. For example, take a coupling that appeared first on Oct 6th and disappears on Nov. 1st during the same year. Let’s say there were multiple revisions that contained this coupling between those dates as well; on Oct

6th, Oct 10th, Oct 19th, Oct 30th, and Nov 1st. The age of this pattern is the difference between Oct 6th and Nov 1st, which is 26 days. The age of this coupling is, therefore, 26 days.

A pattern appears during maintenance due to the existence of dependencies that exist between multiple items. Often, when patterns appear, there is a ripple effect that forces the same pattern to reappear within a short interval of time. This can be due to unfinished task, a bug created or uncovered due to the maintenance performed; a system wide API change is taking place, etc. The patterns then sleep for a short time and reappear once again.

For example, in Figure 27 we have Deciles for the difference in time from one appearance of a pattern to the next. The bars are the size of these differences in days.

We notice that there is a 40% chance that for every appearance of a pattern, the next appearance will be within four days. We have made similar observations with the other systems used in our experiments.

152

Figure 27. Distribution of Patterns Reoccurring Differences in Days for KOffice [2001-2010] over Files.

In order to get an idea of what is the typicality of age, we computed the distribution of ages for eleven systems. The data contained in the graphs are split into ten percentiles that are collapsed if two percentiles represented the same value.

The distributions indicate that the age of coupled files tend to be short in most systems. This is presented in Table 23 and holds true for more than half of the systems we examined. For example, QuantLib’s maximum age is 3051 days, but 90% of the couplings have ages at or smaller than 399 days.

153

Table 23. Typical age distribution For files. The threshold represents the age at which most of the data is accounted for.

Max Age (days) Threshold (days) Data %

KDELibs 2046 1027 60% KOffice 3311 1456 60% Httpd 4491 1030 90% Subversion 4121 975 70% Ruby 3591 3302 90% Chrome 878 103 80% QuantLib 3051 399 90% OpenMPI 1990 929 90% LLVM 3482 928 80% GCC 3642 2810 60% Xapian 3458 1206 60%

Here, a short age is defined as half of the age of the longest evolutionary coupling within the system. Out of the eleven systems we studied, only three did not show this trend at file granularity. In Figure 27, we show the distribution for age for ten years of

KOffice software package using file granularity. The x-axis represents the range of ages.

The y-axis represents the frequency of couplings that fall between the ranges and the second bar is width of each percentile.

154

Figure 28. 10-Percentile Distribution for Patterns Age and Decile Widths in days for KOffice [2001-2010] for Files

We notice a normal like distribution for age, where the middle deciles are short and contain most frequencies of age. While both tails are of long day ranges and being less frequent. For example, from Figure 28, we see that 60% of the data collected has an age of 1456 or less, and the largest age for an evolutionary coupling is 3311 for files. This follows the trend for short ages because 1456 is less than half of 3311 and 60% of the data have an age 1456 days or less.

Several of the other systems had similar distributions. These include OpenMPI,

QuantLib, Httpd, Subversion, LLVM, Xapian, and Chrome and most of their evolutionary couplings were well below half of the maximum age. The only systems that did not show this trend are KDELibs, GCC, and Ruby.

155

6.6 Evaluation Using Interestingness Measures

We use data mining techniques to assess the quality of a pattern in the context of change impact analysis [Arnold, Bohner 1996]. We assess the interestingness of the mined association rules. It is quantitatively sufficient to measure the quality of rules generated using a combination of Agrawal et al. [Agrawal, Imieliński, Swami 1993] support-confidence framework with lift or leverage [Damaševičius 2009; Soman,

Diwakar, Ajay 2006]. Here, we will generate association rules from our patterns based on high ages against low ages, also, zero distances against high distances, then compare the values of both lift and confidence.

Confidence and lift are measures of significance and interestingness for association rules mining. Given a rule X → Y, confidence of a rule [Agrawal, Imieliński, Swami

1993] is defined as the probability of having the rule's consequent (Y) under the condition that the transactions also contain the antecedent (X). Confidence can be evaluated using the conditional probability P (Y|X) or the ratio of the actual probability observed to that expected, if X and Y were independent. Confidence is a commonly used measure for the quality of an association rule, see Equation 1.

156

Table 24. Eleven Open Source Systems and the Number of Generated Itemsets and their Minimum Supports. The Number of Association Rules Generated on Using 10% Minimum Confidence Threshold. Number of Files Included in the Evaluation where the Date Range of All Systems Are From April 01, 2009 to August 01, 2009.

Minimum Number 20% of Number of System Itemsets Support of Files Itemsets Association rules KDELibs 9 635 13,342 2,002 169,210 KOffice 13 912 23,215 3,483 583,688 Httpd 7 189 8,379 1,257 132,755 Subversion 12 291 14,912 2,237 292,639 Ruby 11 85 17,668 2,651 372,225 Chrome 15 732 19,031 2,855 588,822 QuantLib 12 136 1,804 271 30,262 OpenMPI 8 372 12,757 1,914 587,797 LLVM 18 348 12,563 1,885 677,698 GCC 15 423 21,631 3,245 359,726 Xapian 6 179 6,378 957 257,278

Using the patterns in Table 24, we generate all association rules possible by constructing all the possible combinations of association rules through halving each pattern into two subsets. This can produce an enormous number of rules so we set constraints on the generation of rules. Again, using the tool chain (shopper + srcMiner) and a run on the studied systems and the results are shown in Table 24. We use an arbitrary minimum support (see Table 24) and minimum confidence of 10% which we both selected in intent to minimize the enormous number of association rules. These numbers come out of experience and for that due to the lack of research on the golden rules to select the right minimum support and other interest measures (e.g. confidence).

157

We collected the data in Table 3 using five months of history across all systems. Each system has its own nature of growth and size, Figure 25, correlate with the number of patters generated as more active systems meaning more commits and more changes with a richer histories. GCC, LLVM, KDElibs, KOffice, as in Figure 25, are the fastest evolving systems among our selected open source systems. That is clearly implied and correlated where each are among the highest minimum supports and still producing the highest number of association rules.

Let A and B be a disjoint subsets of a pattern I, A  B = I. From Equation 2, lift (A

→ B) is equal to lift (B → A) which is not the case in confidence, see Equation 1. We generated only the association rules A → B where |A|  |B| without duplications. Then we filter out low quality rules and reduce the number of association rules, we only consider rules where confidence (A → B)  0.1.

158

Table 25. Deciles of confidence values for KOffice and their win-lose scores comparing low-age-patterns vs. high-age-patterns

High-age- Low-age- High-age- Low-age- kth Decile patterns patterns patterns score patterns score 0 10 5 1 0 1 5 10 0 1.1 2 5 10 0 1.2 3 10 5 1.3 0 4 10 5 1.4 0 5 10 5 1.5 0 6 10 5 1.6 0 7 10 5 1.7 0 8 5 5 0 0 9 5 5 0 0 10 5 5 0 0 Total Final Score 8.5 2.3

159

Figure 29. The win-lose plot and score for the 10-point percentiles distribution of confidence for rules generated of the patterns for KOffice comparing low-age- patterns vs. high-age-patterns

160

Table 26. Deciles of confidence values for KOffice and their win-lose scores comparing zero-distant-patterns vs. non-zero-distant-patterns

kth Non-zero-distant- Zero-distant- Non-zero-distant- Zero-distant- Decile patterns patterns patterns score patterns score 0 10 5 1 0 1 10 5 1.1 0 2 10 5 1.2 0 3 10 5 1.3 0 4 10 5 1.4 0 5 10 5 1.5 0 6 10 5 1.6 0 7 10 5 1.7 0 8 10 5 1.8 0 9 10 5 1.9 0 10 5 10 0 2 Total Final Score 14.5 2

161

Figure 30. The win-lose plot and score for the 10-point percentiles distribution of confidence for rules generated of the patterns for KOffice comparing low-age- patterns vs. high-age-patterns

Our initial assumption is that to validate age and distance are two characters that can be considered to filter out or classify patterns high quality or low quality if used for impact analysis and code change predictions. We compare younger aging patterns (low- age-patterns) against older (high-age-patterns) patterns. The way we do it is we take all generated association rules and sort them by age and then we split the rules in half into two groups. One with higher aged patters and the other is the shorter aged patterns. For distance (which is the maximum distance observed for any pair of files in a pattern based on our previous definition of tree distance), and for the same rules from Table 24, we split into halves but this time to two groups where one with zero distances (zero-distant- patterns) only and the other is the rest of the distances (non-zero-distant-patterns). We

162

noticed that zero distant patterns are the majority which is not much surprising, since co- changing files are usually placed in the same folder.

We now address which group produces change patterns of higher confidence values.

So we compare low-age-patterns against high-age-patterns. And for the second character, distance, we compare zero-distant-patterns against non-zero-distant-patterns.

For the generated patterns on Table 24, we computed confidence values using Equation

1. For each approach we sort and then break down the values to 10 Deciles (10 percentile points) where each part represents 1/10 of the observed values. Then we compare the first decile value (0th percentile), the minimum value from each side, and the second point

(0th percentile) and so on until the 100th percentile (i.e., maximum value). We compare at each point and use the results to build a win or lose plot (a score).

To assign a score at each percentile, the following rule is used: assign a 0 to the loser and a 1 + 1 × k for a winner where k is the percentile. If the values are equal we use 0 for both as there is no winner. The reasoning of such a rule is that a win with higher points is more valuable than a win with low points (much like a weighted score). Notice the last two columns of Table 25; the final score is the total of the scores at each point.

Figure 29 shows, generated based off Table 25, a win-lose plot; high is a win and low is lose. When two points overlap it represent a match, where each point represents a percentile.

Applying the confidence as in equation 1, we use the win-lose plot to compare the confidence distribution of the two approaches. Figure 29 and Figure 30 present the win- lose chart of the ordered values using a 10-point percentiles (decile) approach. Here,

163

KOffice, in Table 25 and Figure 29, high-age-patterns have a better distribution, with higher values than low-age-patterns. For KOffice, in Table 26 and Figure 30, non-zero- distant-patterns have a better distribution, with higher values than zero-distant-patterns.

It implies that high-age-patterns have better predictive rules with higher confidence than low-age-patterns. Also, non-zero-distant-patterns have better predictive rules with higher confidence than zero-distant-patterns.

Confidence is sensitive to high frequencies of the consequent Y; consequents that have higher support will automatically produce higher confidence values even if there exists no association between the items [Hahsler, Hornik, Reutterer 2006]. Lift overcomes such sensitivity and is a measure of how many times more often X and Y occur together than expected if they were statistically independent. Lift tells us how much better a rule is at predicting the result than just assuming the result.

For a rule A → B, a higher lift implies that the existence of A and B co-occurring in a transaction is not a random coincidence but due to some relationship between them.

The purpose of such models is to identify a subgroup (target) from a larger population.

The target members selected are those likely to respond positively to a marketing offer.

A model is successful if the response within the target is much better than average for the population as a whole. Lift is computed as the ratio of these values: target response divided by average response. See Equation 3 for formal definition of lift [Agrawal,

Imieliński, Swami 1993]. As with confidence, we calculate the lift values for the generated rules.

164

Table 27. The final confidence and lift comparison scores of the eleven open source systems on the generated association rules for the generated low-age-patterns vs. high-age-patterns

High-age-patterns Low-age-patterns High-age-patterns Low-age-patterns Confidence Confidence Lift Lift KDELibs 1.1 13.4 2.3 14.2 Koffice 8.5 2.3 1 15.5 Httpd 0 10.9 0 16.5 Subversion 0 14.5 1 15.5 Ruby 1 9.8 1 15.5 Chrome 1.2 11.4 0 16.5 QuantLib 1 8.1 2.1 14.4 OpenMPI 1 13.5 1 15.5 LLVM 6.6 6 0 16.5 GCC 0 14.5 0 16.5 Xapian 1.6 7.8 1 15.5 3 8 0 11 27% 73% 0% 100%

165

Table 28. The final confidence and lift comparison scores of the eleven open source systems on the generated association rules for the generated zero-distant-patterns vs. non-zero-distant-patterns.

Zero-distant- Non-zero- Non-zero-distant- Zero-distant- patterns distant-patterns patterns Confidence patterns Lift Confidence Lift KDELibs 1 7.1 1 15.5 Koffice 1 9.8 1 15.5 Httpd 5 4.2 15.5 1 Subversion 0 12.6 0 16.5 Ruby 1 15.5 1 15.5 Chrome 8 4.6 15.5 1 QuantLib 1 6.5 2.3 14.2 OpenMPI 6.8 2.3 15.5 1 LLVM 13.5 1 15.5 1 GCC 10.9 2.3 0 16.5 Xapian 9.8 1 15.5 1 6 5 5 6 55% 45% 45% 55%

166

From Table 27 we can see a better confidence over all that leans clearly toward low- age-patterns. KOffice and LLVM are the only wins for high-age-patterns. Now we rerun the same procedure for lift. Lift, as in Table 27, has eleven to zero wins for low-age- patterns agreeing with confidence over all (same table, Table 27).

Table 28 has the comparison results of distance. Due to the inconsistency in results between confidence and lift and the slight win from each side, the results don’t favor any side against the other. For confidence, the score is six to five which is slightly in favor of the non-zero-distant-patterns over the zero-distant-patterns for all studied systems. Lift,

(Table 28) has zero-distant-patterns slightly better than with six wins over zero-distant- patterns with five wins. And that disagrees with confidence were it is five to six non- zero-distant-patterns vs. zero-distant-patterns (the reverse). That confirms a no decisive win from confidence and lift. Confidence for non-zero-distant-patterns has one more system advantage and vice versa under lift. We can say that low-age-patterns association rules generated from the selected patterns are better at producing rules of higher interest for a better predictability. But we can’t say that for the case of non-zero-distant-patterns against zero-distant-patterns.

Our initial thoughts were in favor of patterns that would produce higher quality having files with high distances and older patterns. It would make more sense to us that patterns with shorter distances would have a clear edge quality wise over patterns with longer distances, since closer files change more together and more frequent, they tend to have physical dependencies amongst.

167

Regarding the characteristic of age, we initially thought that more aged patterns would produce higher quality association rules, due to the fact that they have existed longer and likely more frequent. At the same time older patterns suffer from having a lot of broken paths which makes them harder to trace. Developers, also, usually have a better awareness of the new and fresh dependencies emerging from code more than the older ones. Such results were unexpected initially and of interest.

6.7 Summary

Our goal is to investigate how effective ranking the patterns using the measures of age and distance are in reducing the number of false-positives. So far we have observed interesting characteristics for each measure. We noticed that patterns are most often localized changes (above 75%) with 0 and 2 distances between co-evolving files. We believe that the outliers are of great interest; that they may be indicating hidden dependencies for example. Ranking patterns based on distances can help direct developers to unusual couplings. With respect to the age of a pattern we observed that there is a 40% chance that a pattern will occur again in 4 days or less (in the systems examined). Furthermore, it appears that the older a pattern is, the more rooted and likely more frequent it has occurred.

We then validated the effect of rankings based on these measures using interestingness measures from data mining techniques. The aim is to help raise the awareness of important patterns to engineers and reduce the problem of too many false patterns being examined. We found few interesting results and somewhat unexpected.

We observed that distance is not a good measure or a character that can be used to rank or

168

filter out false evolutionary couplings. In the other hand, age is. Younger patterns are fresher and better predictors of change.

We believe our empirical validation scales up enough to address our research questions, even though, as a future work, we plan to validate our results even more using different experiments and validation approaches with a broader range of systems.

CHAPTER 7

ASSESSING TIME WINDOW SIZE IN THE MINING OF SOFTWARE

REPOSITORIES FOR EVOLUTIONARY COUPLINGS

We present an empirical study that assesses the effect of time window size (i.e., commit, hour, day, and week) has on mining software repositories for evolutionary change patterns is presented. The results of using different time windows, in the detection of evolutionary couplings, are compared to determine which is best at predicting future changes. Three assessment methods are used: the first compares each size against another, the next compares results of one time window size are used to predict another, and the last compares the intersection and union of all possible combinations to determine if any combination improves results. Thirteen open sources systems are used in the assessment. The results show that the time window of a week appears to be the best for prediction, and the combination of different time windows improves predictive ability.

7.1 Introduction

A classic problem in mining software repositories (MSR) is the mapping of physical commits (in version control systems) to logical change-sets. Physical commits are such things as atomic multi-change commits, change lists, check-ins, or patches. A logical changeset is a set of commits/modifications to one or more files that are related to one maintenance task. The task can be such things as a bug fix, the entire response to a

169 170

modification request [Mockus, Votta 2000], or a small adaptive API [Collard, Maletic,

Robinson 2010] change.

A logical changeset can be spread over many commits across days or weeks. The individual commits may also include changes that relate to other tasks and may be committed by different developers. The development process (e.g., weekly iterations, daily builds, etc.) can also impact how physical commits map to logical changeset.

To address this problem, researchers have used various heuristics mainly based on a sliding time window approach to group supposedly related commits into a logical changeset. We are particularly interested in the problem of uncovering evolutionary couplings [Gall, Hajek, Jazayeri 1998; Gall, Jazayeri, Krajewski 2003] and what impact the size of the time window has on the quality of couplings produced. The definition of the changeset has a major effect on the detection of evolutionary couplings. That is, the better mapping between the physical commits and the logical changeset, the higher the quality of the detected evolutionary couplings.

Traditionally, mining for evolutionary coupling uses a small time window or a discrete version as committed. Many prior studies typically employ a sliding window approach where two subsequent changes committed by the same author and using the same log message, are part of one transaction if they are at most 200 seconds apart

[Zimmermann, Weisgerber, Diehl, Zeller 2004]. Other studies use a single commit as the changeset (i.e., zero time window). However, the selection of the time window size is in general an open problem and has not been systematically studied.

171

In this work, we aim to find an optimal time window size for mining evolutionary couplings. We investigate four basic time window sizes namely, a single commit, a one hour window size, a day window size and a one week window size of commits. In the context of detecting evolutionary couplings using frequent pattern mining we investigate the following questions:

Q1 Which time window size (i.e., single commit, hour, day, and week of commits)

gives the best results for predicting future changes?

Q2 Can one time window size be predictive for another?

Q3 Can the combined use of different time window sizes improve the results?

In a recent survey by Li et al. [Li, Sun, Leung, Zhang 2013] of papers published between 1997 and 2010 concerning change impact analysis, the distribution of papers on static analysis was 70% and 30% on dynamic analysis. Co-change coupling to predict future changes is reported to be the most used technique under static analysis. These studies rely on some definition for a time size window to determine a changeset of co- changes. The size of 200 seconds for the sliding window (or SVN commit) is the most used figure. However, our results show that a time window size of a week appears to generate the most predictive evolutionary couplings for the systems we examined (Q1).

This chapter is organized as follows. The comparative analysis approach and the setup of the study are presented in Section 7.3. We empirically validate our approaches in 7.4, and we close finally with threats to validity, conclusion and discussion in Sections

7.5 and 7.6.

172

7.2 Evolutionary Couplings

Evolutionary coupling as intent, as presented in section 2.2, is to find hidden dependencies and traceability links that are difficult to identify through physical dependencies in the code. Software elements are evolutionarily coupled if they frequently co-change over a defined time window and with a minimum support for a given duration of history. We use the frequent mining algorithm that we presented in section 6.2.

7.3 Approach & Setup of the Study

The goal of the work is to study variant choices of time windows sizes to produce evolutionary couplings. This section describes the approach we take, the data used, and the tools used to conduct the empirical study.

7.3.1 Experimental Data

Table 29 and Table 30 presents the 13 open source projects used in this investigation along with their characteristics. The table is sorted by number of commits. Note that all systems are primarily implemented using C or C++. Included are the number of years for the collected history, the number of files and LOC of the most recent release (in the studied history), and the number of distinct files committed. Notice that the number of files in the most recent version is may be more than the number of files that were committed in the studied years period. For example, GCC has around 30K files in the most recent copy we used, while in the years from 2007-2010, the number of files

173

committed was around 12K files. The hours/days/weeks columns represent the number of hours/days/weeks that have at least one commit.

Table 29. Characteristics of The Studied Systems. For each system there is number of Files, LOC, Committed Files, and Years.

System Files1 LOC Committed Files2 Years QuantLib 1,940 449,139 3,257 07 - 10 Python 769 527,342 474 07 - 10 OSG 1,982 509,088 2,235 07 - 10 Xapian 930 178,877 773 07 - 10 Httpd 370 213,425 409 07 - 10 Open MPI 3,814 887,728 2,840 07 - 10 GCC 30,794 3,783,430 12,313 07 - 10 LLVM 1,735 728,641 2,789 07 - 10 KDELibs 5,177 1,306,065 7,996 07 - 10 KOffice 6,043 1,264,067 13,076 07 - 10 Subversion 687 425,549 1,405 01 - 10 Ruby 396 351,062 829 01 - 10 Chrome 13,324 2,682,856 27,473 08 - 11

1 Number of files in the most recent release. 2 Total numbers of distinct files in the observed history.

174

Table 30. The Number of Commits, Hours, Days, and Weeks Used In This Study Over The Thirteen Open Source Systems.

System Commits Hours1 Days1 Weeks1 QuantLib 3,182 2,000 721 192 Python 2,145 1,788 797 180 OSG 4,033 2,569 902 199 Xapian 2,658 1,961 731 187 Httpd 2,396 1,763 781 204 Open MPI 4,902 3,502 1,128 208 GCC 18,128 11,574 1,452 208 LLVM 30,624 13,432 1,445 209 KDELibs 21,829 11,889 1,457 209 KOffice 24,206 11,569 1,441 208 Subversion 18,809 13,267 3,146 522 Ruby 12,439 9,085 2,897 517 Chrome 59,342 16,993 1,227 179

1 The number of hours/days/weeks having one or more commits.

Version data is extracted using our tool shopper (pronounced “delta shopper”).

shopper extracts metadata from a Subversion repository, partitioning the data in configurable time windows (the parameter ). The program identifies the modified files within each time  and builds the transactions and generates datasets as XML documents.

175

Figure 31. Activity Plot based on the ratio commits/years divided by the maximum of all commits/years ratios for the studied thirteen systems.

Figure 31 shows activity rates, defined as the number of commits over the number of years. The commits/years ratio is normalized by dividing it by the maximum observed ratio among systems. Chrome is, relatively, 100% active since it has the highest activity ratio. This ratio draws a distinction among the presented systems in the context of how active each system has been. We have noticed that in comparison with Table 31 that the more a system is changed, the larger number of patterns. That is, a large number of files are modified so there tends to be more co-changing dependencies. We addressed this earlier in Section 5.3.

176

7.3.2 Patterns Generation

The Koupler2 tool takes the data generated from shopper and computes scaled and discretized change vectors for each file in each time period. Frequent pattern mining using the Apriori algorithm [Agrawal, Srikant 1994] is applied with an initial minimum support of s/T where T is the number of observed time periods and s is a parameter of the algorithm. Initially we start with a high value of s such that no patterns are produced. If the algorithm does not yield at least N frequent patterns, the minimum support is decremented by one and the algorithm is rerun. This continues until the minimum required number of patterns N has been found or the minimum support drops below 5/T.

This is a standard technique to determine a minimum support [Kagdi, Maletic, Sharif

2007; Zimmermann, Weisgerber, Diehl, Zeller 2004]. The heuristics guiding the search were determined experimentally from the datasets; they may vary for different units of time . Koupler2 computes both evolutionary couplings the specified time window ().

7.3.3 Design of the Evaluation

Evaluating the approach is difficult because of the lack of a gold standard. That is, we do not have a correct list of all the coupled artifacts. To evaluate the approach we compare the patterns produced using the traditional association rules prediction abilities over different time windows. This evaluation method can be automatically applied to the systems under examination to give a broad comparison.

High accuracies in predicting the impact of a change is crucial to building usable tools, the existence of large numbers of false positives discourages developers from using such tools. Here, we define quality of patterns based on their prediction abilities. So, the

177

patterns with higher accuracy (precision) and coverage values (recall) are of better quality. Since changes inside these patterns are more likely to be predictive, they contain fewer false positives.

7.3.4 Evaluation Using Prediction

A traditional means of validating the quality of association rules is to generate them from a part of the history (training set) and then see how well they predict future changes in a later part of the history (test set). In [Zimmermann, Weisgerber, Diehl, Zeller 2004], the authors used co-change information (evolutionary coupling) to predict entities

(classes, methods, fields, etc.) that are likely to be modified. A prediction model correlates with the quality of the association rules generated from frequent patterns.

Here, for a given sequence of transactions, ordered by time, for a system we divide the sequence into a training set (Tr) and a test set (Te). We use 75% of transactions as the splitting point; the tanning set is the first 75% and the test set is the remainder. The training set is used to generate the patterns of change using Koupler2. After generating all evolutionary coupling using different time windows, we generate all possible association rules with 50% confidence.

The patterns are generated as described in section 7.3.2. The tool chain (shopper +

Koupler2) was run on the studied systems. Patterns are generated for a system for each time window. Given the training set, we configured Koupler2 with initial minimum support to be 100/T and then lowered it until top N patterns are produced. We selected N to be 2,000 as this value allows for the generation of enough patterns to compare the predictability using different time windows. We perform the same evaluation as

178

Zimmermann et al. [Zimmermann, Weisgerber, Diehl, Zeller 2004] and Canfora et al

[Canfora, Ceccarelli, Cerulo, Di Penta 2010].

We examine the left hand side (LHS) and right hand side (RHS) of the test set transaction, as done in [Canfora, Ceccarelli, Cerulo, Di Penta 2010; Zimmermann,

Weisgerber, Diehl, Zeller 2004]. Such cases are called the same subsequent (k = i+0), where k is the index of the subsequent transaction, from transaction i, i is the index of a transaction that contains the LHS. We define a hit as finding a rule in a transaction in the test set. In [Canfora, Ceccarelli, Cerulo, Di Penta 2010; Zimmermann, Weisgerber,

Diehl, Zeller 2004] the authors also looked for subsequent of k = i+5, that have RHS matches within five subsequent transactions from transaction i.

179

Table 31. Patterns Uncovered for the Training Set (first 75% of transactions) over Thirteen Open Sources and Their Minimum Support count, ECs = Evolutionary Couplings, mS = minimum support count.

commit hour day week OSS mS ECs mS ECs mS ECs mS ECs QuantLib 5 1,296 5 1,336 5 2,361 6 2,811 Python 5 84 5 87 5 167 5 1,996 OSG 5 468 5 677 5 809 6 2,608 Xapian 5 546 5 2,194 5 2,323 7 3,124 httpd 5 255 5 271 5 397 5 634 OpenMPI 7 3,628 7 2,971 8 3,365 11 2,338 GCC 11 3,225 11 3,254 17 2,348 45 2,011 LLVM 19 2,458 21 2,756 28 2,099 52 2,073 KDELibs 7 3,731 6 3,300 8 2,020 16 2,120 KOffice 10 2,734 10 3,121 11 2,062 16 2,013 Subversion 19 2,098 19 2,205 23 2,256 45 2,150 Ruby 12 2,007 16 2,443 22 2,251 38 2,128 Chrome 24 2,138 26 2,115 50 3,380 57 2,290

180



 



Precision, recall, and F-measure are then computed as in equations (6), (7), and (8).

Precision (Equation 6) is the ratio between the number of LHS and RHS hits over the number of LHS hits. Recall (Equation 7) is the number of files predicted over the all files that occur within the training set. There may be files in the test set that do not occur in the training set because they were added afterwards. We cannot predict a change on a file didn’t appear in the training set. Equation 8 is the F-measure (also F-score), which is an accuracy measure. The F-measure can be interpreted as the weighted average or harmonic mean of the precision and recall. All these statistical measures range between

[0, 1] where 0 is the worst and 1 (100%) is the best. Regarding precision for k = i+5, we compute precision on i+1, i+2, i+3, i+4, i+5. Then, we divide that by 5 to get the average precision. We use average precision for k = i+5 subsequent changes.

7.4 Empirical Study

Using this approach we generated four sets of evolutionary couplings for each system (Table 29 and Table 30) with the following time windows: 1) a single commit, 2) an hour, 3) a day, and 4) a week of commits. For each system, the different pattern sets

181

were compared against each other. We now compare the quality of patterns from each set via their future predictive abilities.

7.4.1 Time Windows Comparison

For each system and time window, the top 2,000 plus patterns were collected or patterns were generated using minimum support count of 5. Table 31 has the number of patterns generated using this configuration. The variations in Table 31 are due to the different natures of the each project’s development and maintenance process. For example we see that Chrome itemsets for the commit-time-window are more than our minimum preconfigured number (2,000) with a high minimum support count (24). This implies that Chrome has a very active development environment and changes touch a large number of files for almost every single commit. LLVM also produces more than

2,000 itemsets with a high minimum support count (19). These observations are also clearly visible in Figure 31 where Chrome and LLVM have the most relative activity.

Additionally, the variation can be attributed to the fact that each system has a different architecture and design. This leads to different patterns of change. This could be interpreted as a sign that a large number of the co-changed files are highly coupled.

Many changes touch a number of different files, requiring more effort to evolve the system. From small to large time windows, minimum support count monotonically increase and the different time window sizes clearly has an effect on the results. There is no perfect universally applicable time window for all project histories; different software architectures and designs and different work habits can result in dramatically varied results.

182

Table 32 and Table 33 present the results of running a prediction experiment on evolutionary coupling patterns and the four time windows. We used subsequent k = i+0 and k = i+5, and computed the precisions, recalls, and F-Measures. The time window column is the training (patterns) and test (transactions) set time windows. The highlighted cells are the best among the four time window sizes for each system.

Previous work of this type [Canfora, Ceccarelli, Cerulo, Di Penta 2010; Zimmermann,

Weisgerber, Diehl, Zeller 2004] used only the top 10 rules (or top-N), ranked by confidence. While we use all rules with 50% confidence for pruning, a large difference in the numbers of rules evaluated. We are also using a time period approximately four times longer (3 to 10 years) than in [Canfora, Ceccarelli, Cerulo, Di Penta 2010;

Zimmermann, Weisgerber, Diehl, Zeller 2004] which was conducted over a few months.

There are a number of interesting observations that can be seen in Table 32 and

Table 33. For the week time window Chrome, Ruby, Subversion, KOffice, KDELibs,

LLVM, and GCC have very high predictability (near 100%). Others OpenMPI, httpd,

Xapian, OSG, Python and QuantLib range from 60% to 88%. Week is an appealing window size as it groups work activities that likely semantically related because it is a one of the commonly used release durations. A developer’s work plans often revolve around the workweek and multiple commits within a week are often logically related

(however this is not universally the case by any means). Weeklong work plans can include fixing a set of bugs, developing a new feature, or some non-trivial refactoring of a critical component.

183

Table 32. Prediction Accuracies and Completeness Over Six Open Sources. P = Precision, R = Recall, Fm = F-Measure, Tr = Time window of Training Set, Te = Time window of Test Set.

Tr, k = i+0 k = i+5 OSS Te P% R% Fm% P% R% Fm% c 56.62 15.60 24.46 62.63 2.38 4.59 h 45.66 15.83 23.51 23.91 0.29 0.57 QuantLib d 56.32 27.22 36.70 74.23 4.80 9.02 w 67.65 48.94 56.79 66.21 1.35 2.65 c 50.00 3.92 7.27 56.10 0.93 1.82 h 52.17 5.38 9.76 52.78 0.91 1.78 Python d 42.00 10.55 16.87 52.80 0.88 1.73 w 69.77 68.18 68.97 56.55 1.07 2.10 c 40.43 5.65 9.92 10.08 0.08 0.16 h 47.79 8.41 14.30 30.08 0.31 0.62 OSG d 47.06 21.33 29.36 23.28 0.31 0.62 w 73.33 67.35 70.21 64.38 0.74 1.46 c 61.18 15.66 24.94 61.43 1.30 2.54 h 54.93 15.92 24.68 40.71 1.34 2.60 Xapian d 57.29 30.22 39.57 20.21 1.12 2.13 w 60.98 54.35 57.47 27.39 0.51 1.00 c 47.37 9.03 15.17 35.71 0.79 1.54 h 48.74 13.18 20.75 47.34 0.77 1.52 httpd d 57.55 31.28 40.53 45.67 0.73 1.44 w 79.17 76.00 77.55 71.76 1.36 2.66 c 42.20 9.71 15.79 81.86 2.32 4.51 h 45.10 13.14 20.35 84.19 6.06 11.30 OpenMPI d 55.71 27.76 37.05 18.58 0.27 0.53 w 88.24 88.24 88.24 84.54 1.50 2.95

184

Table 33. Prediction Accuracies and Completeness Over Seven Open Sources. P = Precision, R = Recall, Fm = F-Measure, Tr = Time window of Training Set, Te = Time window of Test Set. This Completes Table 32.

Tr, k = i+0 k = i+5 OSS Te P% R% Fm% P% R% Fm% c 55.93 13.11 21.24 7.91 0.03 0.06 h 59.73 20.26 30.25 11.77 0.04 0.08 GCC d 81.44 75.14 78.16 93.35 0.37 0.74 w 100 100 100 99.62 1.79 3.51 c 60.64 15.41 24.58 17.63 0.07 0.14 h 62.70 22.73 33.36 38.90 0.25 0.49 LLVM d 84.86 82.27 83.54 90.31 0.47 0.94 w 98.08 98.08 98.08 83.22 2.98 5.76 c 66.22 14.33 23.56 22.76 0.04 0.08 h 62.73 18.24 28.26 38.63 0.10 0.20 KDELibs d 76.78 68.13 72.20 67.41 0.23 0.45 w 100 100 100 94.50 0.51 1.01 c 61.51 14.00 22.81 41.88 0.09 0.17 h 55.25 18.19 27.37 36.03 0.10 0.19 KOffice d 80.45 69.72 74.70 70.30 0.27 0.55 w 100 100 100 92.41 0.41 0.82 c 62.66 13.95 22.82 45.28 0.38 0.75 h 59.79 20.81 30.87 56.52 0.47 0.92 Subversion d 72.02 59.29 65.04 91.25 0.64 1.27 w 97.69 97.69 97.69 90.30 1.47 2.89 c 78.59 25.02 37.96 81.98 1.48 2.90 h 28.83 6.38 10.45 23.22 0.35 0.69 Ruby d 89.06 80.94 84.80 87.52 1.38 2.72 w 99.22 99.22 99.22 99.38 4.23 8.11 c 68.96 21.86 33.19 65.12 0.11 0.22 h 75.80 46.39 57.55 55.26 0.12 0.23 Chrome d 90.06 85.56 87.75 74.15 0.63 1.25 w 100 100 100 82.57 8.45 15.34

185

To address our first question in this study about which time window best predicts future changes we summarize part of Table 32 and Table 33 in Figure 32. We sort the time windows from high to low using their F-measures and label each system with that ordering. For example, Chrome has the sorting wdhc where week is best and commit is the lowest in predictability. Over the thirteen systems we plot the distribution of the distinct observed sortings over the two predictions configurations of k = i+0 and k = i+5.

We observe that wdhc is the most prevalent, with 10 systems having it for k = i+0 and 6 for k = i+5. Week is the most predictive with 23 times, day once, and hour twice.

Interestingly, the commit window size is never the top.

Figure 32. Time window sortings distribution over the thirteen systems. Commit, Hour, Day, Week are the different labels.

186

Examining Table 31, Table 32 and Table 33 we see that systems with high minimum support also have high predictive measures (100% in cases of a week time window).

High minimum support is clearly an indication of high quality patterns. Such patterns with high support values are not common as it requires a long period of archived maintenance to exist [Robbes, Pollet, Lanza 2008]. Hence, relying on high support to exist is not very practical. For example, from Table 31 we see that Chrome’s top patterns with a week time window and minimum support of 24 needs at least 24 weeks of co- changes to exist (one change per week). In practice this would most likely require much more than 24 weeks of maintenance for that many co-changes to occur. We try to overcome such limitation in section 7.4.3, and look for combinations of different time window sizes.

While the week window size produces the best patterns with best predictability in most of the systems examined here, predicting what is going to happen next week is not always very useful. In conducting impact analysis, predicting what is going to happen in the next commit, next hour, or maybe the next day is typically more practical. In the next two sections, we will show how to use patterns generated using a week time window size to enhance the prediction of others time windows.

7.4.2 Time Window Cross Prediction

Now for each system we use evolutionary dependencies generated for one time window size to try and predict future changes another time window size. This novel approach called cross prediction and gives insight into which of the time window sizes subsume, in terms of prediction, another. It is difficult to do all possible cross

187

predictions, so we focus on one case and it’s opposite. Since the transactional commit is the most popular time window in recent studies, we examine how other time window sizes (hour, day, and week) perform against the commit size and then commit size against the others.

Table 34. Cross Prediction F-Measure. Fm is F-Measure, Tr=hdw is Time window of Training Set, where training is hour, day, or week. Te = c is Time window of Test Set, where commit is the Test set Time window.

Tr, k = i+0 k = i+5 k = i+0 k = i+5 OSS OSS Te=c Fm% Fm% Fm% Fm% h 32.98 0.62 26.90 0.41 QuantLib d 29.54 0.59 17.93 0.60 LLVM w 22.16 0.82 5.63 5.28 h 7.89 1.81 33.14 1.19 Python d 7.71 1.66 27.89 0.25 KDELibs w 7.10 1.00 14.73 0.59 h 21.32 0.24 34.18 0.19 OSG d 20.56 0.40 29.81 0.29 KOffice w 13.97 0.33 17.55 0.54 h 34.09 3.28 23.54 1.18 Xapian d 31.99 2.30 17.48 1.20 Subversion w 26.33 2.36 14.06 2.61 h 17.23 1.50 44.13 3.01 httpd d 19.02 2.05 34.14 3.54 Ruby w 17.24 2.53 29.63 7.75 h 16.28 3.60 34.37 0.32 OpenMPI d 18.13 3.30 18.55 1.00 Chrome w 16.01 1.19 6.61 13.41 h 21.11 0.06 24 26 Tr = hdw GCC d 19.44 0.38 15 13 Te = c w 12.30 3.11

188

Figure 33. Improvements range (MIN-MAX) of precision, recall, and F-measure values over a cross prediction where Tr = hdw and Te = c.

To do this cross prediction, we use the evolutionary couplings generated from the training set of each system from Table 31. Then we generate the rules with 60% confidence and look for hits on cross time windows. Table 34 presents the cross prediction of the commit time window size as the test set (Te = c), with h, d, and w as the training sets (Tr). We use the evolutionary couplings generated for hour, day, and week time window sizes to predict change in the subsequent k = i+0 and k = i+5. In

Table 34 the bold cells are the improved F-measure values. The bottom right corner of the table summarizes the results.

189

We compared the F-measure values of commits for training and test from Table 32 to F-measure values of hour, day, or week as training and commit for test. For example, the F-measure value k = i+0 of QuantLib is 24.46 in Table 32, in Table 34 for QuantLib under k = i+0, we bolded the higher F-Measure values above that which are 32.98 for hour, 29.54 for day but not 22.16 for week, and so on. The results show that the time window sizes hour, day, and week predict current and subsequent changes of the commit time window size better than the commit time window size for itself.

190

Table 35. Cross Prediction F-Measure values. Fm is F-Measure. Tr = c is Training Set Time window, where commit is the Time window , Te=hdw is Test Set Time window, where test set is hour, day, or week.

Tr=c, k = i+0 k = i+5 k = i+0 k = i+5 OSS OSS Te Fm% Fm% Fm% Fm% h 48.72 1.12 39.06 0.31 QuantLib d 56.73 4.05 92.09 0.52 LLVM w 78.05 2.84 100 0.65 h 9.02 1.76 39.22 0.07 Python d 17.09 2.11 84.96 1.15 KDELibs w 45.71 1.99 100 0.94 h 31.91 0.89 50.15 0.41 OSG d 54.96 0.99 91.71 0.33 KOffice w 88.17 1.37 98.04 0.52 h 32.87 0.65 30.07 0.81 Xapian d 47.33 1.45 72.23 0.97 Subversion w 65.00 1.00 98.46 1.29 h 24.24 1.68 12.23 0.49 httpd d 40.14 2.60 43.29 1.60 Ruby w 78.72 2.38 96.50 1.85 h 19.91 13.02 65.09 0.17 OpenMPI d 35.73 0.48 94.62 0.29 Chrome w 70.71 1.54 100 0.29 h 30.44 0.09 38 25 Tr = c GCC d 82.61 2.97 1 14 Te = hdw w 100 0.44

191

Figure 34. Improvements range (MIN-MAX) of precision, recall, and F-measure values over a cross prediction where Tr = c and Te = hdw.

The k = i+0 prediction is improved by 67%, and k = i+5 prediction improved by

33%. Figure 33 shows the minimum and the maximum improved ratios out of this cross prediction for precision, recall and F-measure values. For example, GCC has improved ratios between 0% to 66%. 66% came from the subtraction of precision under k = i+5 value 7.91 (Table 32) from commit as training and test and precision under k = i+5 value

74.89 from week as training and commit as test.

We now examine the reverse case and use the commit time window size to predict current and future changes for time windows of hour, day and week. Table 35 presents this data. The commit time window size was able to improve the predictability of 15%

192

on k = i+0 and 31% on k = i+5 to the other time windows (week, day, and hour). Figure

34 shows the improvement range of precision, recall, and F-measure values for such cross prediction.

Each system responds differently to cross predictions and no generalization of what is the best cross predictions can be made here. However, we are able to say that time window cross prediction does produce better results (Q2). It appears that the main time window (commit) is not contributing to the prediction of a cross prediction.

7.4.3 Combining Time Windows

The novelty of the idea behind combining time windows’ evolutionary coupling sets is that an intersection will uncover consistent patterns that are frequent over multiple time windows; such an approach able to generate patterns with higher accuracies. A union will bring the missing interesting patterns that a single time window didn’t detect by itself. However, this greatly reduces accuracy. To overcome this problem we filter out patterns that generate association rules with low confidence.

To do this, we assess with all possible combinations. For each system we have four sets out of four time windows. We do all possible combinations (7), commit × hour, commit × day, … , commit × hour × day × week. Each will have two cases of union and intersection. This results in 14 combinations. This is done for the thirteen open sources systems. We present the precision, recall, and F-measure values in detail for a single system, and due to space limitations, we summarize the remainder. For the patterns in

Table 31, we experiment with all the possible combinations for each open source over pattern sets generated from different time windows.

193

Table 36. Prediction Precisions, Recalls, and F-measures for OSG Open Source. Precision (P), Recall (R), F-Measure (Fm), Tanning Set Time window (Tr), Test Set Time window, where commit is the Time window (Te = c)

OpenSceneGraph (OSG) k = i + 0 k = i + 5 Combinations Conjecture P% R% Fm% P% R% Fm% Tr, Te = c c 40.43 5.65 9.92 10.08 0.08 0.16 c h 44.58 14.29 21.64 19.31 0.12 0.24 c d 34.00 15.08 20.89 29.82 0.22 0.44 c w 22.67 11.61 15.35 25.48 0.24 0.48 Unions c h d 33.05 15.58 21.17 24.30 0.18 0.36 c h w 27.92 15.18 19.67 22.88 0.17 0.34 c d w 28.26 14.19 18.89 25.96 0.21 0.42 c h d w 29.90 14.98 19.96 23.92 0.19 0.37 c h 45.74 5.85 10.38 15.11 0.11 0.22 c d 37.50 5.95 10.27 17.70 0.13 0.27 c w 29.12 5.26 8.91 16.73 0.14 0.28 Intersections c h d 33.53 5.75 9.82 21.94 0.17 0.33 c h w 32.92 5.26 9.07 19.62 0.16 0.31 c d w 30.91 5.06 8.70 14.68 0.12 0.24 c h d w 32.08 5.06 8.74 19.00 0.16 0.31

194

Table 37. Prediction Improvements on Precisions, Recalls, and F-measures for OSG Open Source. Precision (P), Recall (R), F-Measure (Fm), Tanning Set Time window (Tr), Test Set Time window, where commit is the Time window (Te = c)

OpenSceneGraph (OSG) Improvements From Combinations Combinations Conjecture k = i + 0 k = i + 5 Tr, Te = c c P% R% Fm% P% R% Fm% c h 4.16 8.63 11.72 9.23 0.04 0.08 c d None 9.42 10.97 19.75 0.14 0.28 c w None 5.95 5.43 15.41 0.16 0.32 Unions c h d None 9.92 11.25 14.22 0.10 0.20 c h w None 9.52 9.74 12.80 0.09 0.18 c d w None 8.53 8.97 15.88 0.13 0.26 c h d w None 9.33 10.04 13.84 0.11 0.22 c h 5.31 0.20 0.46 5.04 0.03 0.07 c d None 0.30 0.35 7.62 0.05 0.11 c w None None None 6.66 0.06 0.13 Intersections c h d None 0.10 None 11.86 0.09 0.17 c h w None None None 9.55 0.08 0.15 c d w None None None 4.60 0.04 0.08 c h d w None None None 8.93 0.08 0.15

Clearly, with a union the generated set will be at least equal or larger. For an intersection it will be equal or smaller. In case of a match where a pattern appears in different sets for different time windows, the support value of the pattern may differ, and we take the lowest. The least support assures that such a frequent pattern will exist in the combined set of patterns. For example, given the pattern sets, for some system, Xr, and

195

Xd, Xw are pattern sets generated for a commit, day and week time windows, the pairs are patterns and their support counts:

Pi = Set( filei1, filei2, filei3, etc. ) where Pi is a pattern

Xr = Set( (P9, 23), (P10, 44), (P11, 10) )

Xd = Set( (P1, 25), (P2, 12), (P3, 9) )

Xw = Set( (P10, 30), (P11, 7), (P12, 10) )

A union and intersection is:

Xr  Xd  Xw = Set( (P1, 25), (P2, 12), (P3, 9), (P9, 23),

(P10, 30), (P11, 7), (P12, 10))

Xr  Xd  Xw = Set( (P10, 30), (P11, 7) )

Since commit is the traditional (or 200s on CVS) time window, we experiment with the combinations of the four pattern sets (Tr = rhdw) and the test set is the commit time window (Te = c). We believe that such combinations will enhance the change impact predictions. Table 36 and Table 37 have the results for one open source only, randomly picked OSG to show the detailed picture as a case study.

We use the evolutionary couplings generated from commit, hour, day, and week time window bases to predict change in the subsequent. Under the combinations column, we have all the variations for unions and for intersections. For k = i+0 and k = i+5 in the non-shaded area, precision, recall, and F-measure values computed. For k = i+0 and k = i+5 in the shaded area each cell show is the enhancements difference between each

196

combination’s precision, recall, or F-measure value and it is correspondence precision, recall, or F-measure value from the c row, bold font. For example, intersection c and h perform 45.74% for precision; the correspondent value under the c raw is 40.43% so the enhancement is 5.31%. Let us note that the cells under the shaded area with None value means that the value is no better than c prediction values. Some are even worse. Note the bold font cells under the shaded area. For k = i+0 the maximum enhancement is

11.72%. For k = i+5 it is 19.75%.

197

Table 38. Highest (MAX) and Lowest (MIN) Improved percentages for different time window pattern sets conjectures over the thirteen open sources, where Te = c

k = i + 0 k = i + 5 OSS MIN% MAX% MIN% MAX% QuantLib 1.09 12.60 0.00 0.00 Python 0.17 4.55 0.01 14.08 OSG 0.10 11.72 0.03 19.75 Xapian 0.90 11.75 0.07 31.15 httpd 1.50 4.50 0.03 3.31 OpenMPI 0.03 2.45 2.04 5.99 GCC 0.79 0.79 0.002 59.32 LLVM 0.71 1.72 0.02 50.00 KDELibs 0.31 8.36 0.05 56.77 KOffice 0.98 10.05 0.01 14.68 Subversion 0.40 1.70 0.02 40.23 Ruby 0.26 5.40 0.04 14.32 Chrome 0.33 17.10 0.002 19.11

198

Figure 35. Improvements range (MIN-MAX) of precision, recall, and F-measure values over a cross prediction where Tr = c and Te = hdw

Table 38 has the minimum and maximum percentage improvements that we got from time window patterns combinations over our thirteen open source systems. From the highlighted cells, we can see that for k = i+0, the minimum improved ratio is for

OpenMPI with 0.03% and the maximum is Chrome with 17.10%. For k = i+5, the minimum improved ratio is for QuantLib with Zero % and the maximum is GCC with

59.32%. Such a scale of improvement is impressive, especially for subsequent k = i+5 where it predicts change impact further into future changes.

199

7.5 Threats To Validity

Generating evolutionary couplings relies on the choice of parameters involved. For example, files where our main artifact, that makes our study limited to such grain. A study on different artifacts sizes (file, class, functions) would better support the generalization of any results.

The minimum support parameter has a big effect on detecting patterns of change, though we used a consistent configuration over the thirteen open sources. Still, each system has different characteristics (architecture, design, size, growth pace, etc.); that would make it uneasy to compare the results against each other.

7.6 Discussion

We presented an empirical experiment studying how different time windows impact the quality and predictability of evolutionary dependencies. We conclude by answering our initial questions:

Q1) Which time window size (i.e., single commit, hour, day, week of commits) gives the best results for predicting future changes?

We found that larger time windows have better prediction accuracies and completeness, with week-long windows giving the best results and individual commits, the worst. This is an important result for the MSR community since window size is one of the main parameters used to detect evolutionary couplings.

Note that we have not examined larger than a week nor different sizes of units.

Logical work units are of interest (e.g., releases or sub releases). We believe broader studies to evaluating time windows of this greater granularity would be of interest.

200

Q2) Can one time window size be predictive for another?

Based on our examination to cross time windows prediction over different time windows, our answer is “yes”. For example, in Table 32, QuantLib commit patterns were predicted with 56.62% precision on k = i + 0, while using a week-long time window precision rose to 78.05% in Table 34. Over the thirteen systems we observed that the time window sizes hour, day, and week predict current and subsequent changes of the commit time window size better than the commit time window size on itself (k = i+0 improved by 67% and k = i+5 improved by 33%). For the reverse case where commit time window size to predict current and future changes for time windows of hour, day and week (15% improvements on k = i+0 and 31% on k = i+5). One reason behind the improvement is that each time window carries patterns undetected by other time windows.

Q3) Can the combined use of different time window sizes improve the results?

Again the answer is “yes”. After cross prediction, we studied patterns combinations and how effective such approach would affect the prediction measures. Table 38, k = i+0, the improved ratio prediction ranged from 0.03% to 17.10%. For k = i+5, the improved ratio was high that reached 59.32% at its best.

One main observation over different experiments is that future subsequent changes (k

= i+5) is effected more by cross prediction and patterns combinations that the instant changes (k = i+0). Such cases show up when predicting against commit time window size (commit is the testing set) using larger time windows.

201

In general, the study shows that larger time windows can more accurately predict evolutionary couplings. However, this will, to an extent, rely on the commit policies of developers. Also, due to different development practices and environments, the accuracy of our approach may vary. In the future, we will investigate further enhancements of our approach by studying additional commit parameters and look for synergistic approaches to enhance analysis methods.

CHAPTER 8

PREDECTION PARAMTERS ON THE DETECTION OF EVOLUTIONARY

COUPLINGS

This chapter represents a study on the set of parameters that we used and are used in literature to control the detection and of evolutionary couplings. We use a subset of the parameters that are known for having an effect on the outcomes of any frequent pattern mining algorithm. We study exactly the minimum support, training data length and ratio, and confidence. These variables are used to build a regression model around them to help us understand their effect and importance on the generated patterns and association rules. We apply our procedure on eleven free and open source systems to study such phenomenon.

8.1 Introduction

The work presented here investigates a set of arbitrary parameters that has minor to major effect on the quality of the detected pattern. Researchers and data miners in software engineering and specifically in the mining software repository (MSR) community, select arbitrary thresholds that help filter out evolutionary couplings

[Zimmermann, Weisgerber, Diehl, Zeller 2004]. Here, we select a subset of these parameters and variate their values over short to long ranges and, then, we used the collected data to build a multiple Regression model around them these independent manipulated variables. For the set of the selected parameters (minimum support, data

202 203

range, training to testing data ratio, and confidence), such model will help us in answer these two questions:

A) To what extent of the total variation of outcomes can be explained by a

prediction model using independent variables of minimum support, data range,

training to testing data ratio, and confidence?

B) Which coefficient is the dominating factor in the quality of the generated rules?

We use eleven free and open source systems (FOSSs) to collect the data from and build a model for each system. Such study would help us select the right parameters in detecting patterns of change and avoid arbitrary picks. Such awareness would help on reducing the number of false-positives and detect correct/important patterns higher than less meaningful or false-positive patterns. This has the potential to improve the usefulness of the patterns for developers.

The coming sections in this chapter are presented as in the following manner.

Section 8.2 presents the data collection approach. A discussion on evolutionary coupling and its parameters and how we mine the changing patterns using mining software repositories technique are presented in Section 8.3. After that, we present the regression analysis methodology and the experiment setup in section 8.4. The experiment and the results take place in section 8.5 and, finally, conclusions are presented in section 8.6.

204

8.2 Data Collection

The implementation of the approach is achieved with two tools: Shopper

(pronounced “delta shopper”) and srcMiner (reads “source miner”). These tools target data available from the Subversion version control system.

Table 39. Characteristics of the eleven open source systems used in study including years, commits, files.

System Years Commits Files KDELibs 01-10 (10) 54,189 14,748 KOffice 01-10 (10) 55,651 21,857 Httpd 99-11 (13) 11,264 763 Subversion 00-11 (12) 23,420 1,485 Ruby 00-11 (12) 12,439 834 Chrome 08-11 (4) 35,650 16,358 OpenMPI 03-11 (9) 11,682 6,583 LLVM 01-10 (10) 50,327 4,266 GCC 01-10 (10) 50,145 26,154 Xapian 00-10 (11) 4,703 1,302 Python 01-10 (10) 13,401 867

The Shopper tool is responsible for extracting metadata and differences from a

Subversion repository. The program takes a unit of time () and extracts information about the modification of artifacts over the course of the history. The program identifies modifications to files and functions within each time  and generates the datasets of buckets of co-changing files and functions, our  here is a Subversion commit. To collect the files we use the archived logs for co-changing files.

205

Table 39 presents the open source projects we use in this work, years range, the number files in each system, and number of commits.

8.3 Patterns and Association Rules Generation

We use srcMiner to process our datasets and generate frequent patterns. We developed srcMiner based on Eclat frequent item sets data mining algorithm [Zaki,

Parthasarathy, Li 1997].

We use the data in Table 39, and generate the patterns of change and from each pattern we generate all possible non redundant association rules. As in 2.2, we use a set of required parameters to generate evolutionary couplings and the association rules after. We have time window, minimum support, artifact granularity, training set size, and interestingness measure filter. We will set an arbitrary value for some and iterate the values of others and build a regression model on top of such variation. The arbitrary one are: time window size will be subversion’s revision (commit) and artifact granularity is a file. The rest will variate within a selected range.

The most common that used in validating the predictability of rules is by splitting the training set size by some ratio, were the first portion become the tanning set and the left is the test data. Then, we use the precision, recall, and F-measure to assess the accuracy of hits and coverage. We add this ratio to the set of variant measures (independent variables). The variant parameters will be support count, training data ratio, training set size (will be in measured in years), and confidence (minimum support, confidence, duration, and training size).

206

(9)

Table 40 and Table 41 contain each parameter with its minimum value (start point) to the maximum value (end point), the unit size, and the number of trials for each parameter for each FOSS, as in Equation 9.

For example support count for KDELibs is Min=11, and Max=24 where the unit is 1 support count increase on every iteration. So trails will be (24-11) / 1+1=14. For

KDELibs, if support count trails are 14, training data ratio trials are 4, with 8 years, and confidence has 5 trails. Permuting over all possible combinations [Biggs 1979], we have

14  5  8  5 = 2800 different combinations. Table 42 has the final number of combinations for each FOSS, where number of trials over KDELibs is 2799. The numbers Table 42 are not exactly as we just calculated using Equation 9, as you can see.

That is due to having the cases were zero rules being generated for some parameters combinations.

207

Table 40. Prediction Parameters (Support Count and Training Data Ratio) Min and Max Values, Unit Size and Trails

Support Count Training Data Ratio FOSSs Min Max Unit Trails Min Max Unit Trails KDELibs 11 24 1 14 0.75 0.95 0.05 5 KOffice 14 24 1 11 0.8 0.95 0.05 5 Httpd 10 24 1 15 0.8 0.95 0.05 5 Subversion 12 24 1 13 0.8 0.95 0.05 5 Ruby 9 24 1 16 0.8 0.95 0.05 5 Chrome 14 24 1 11 0.8 0.95 0.05 5 OpenMPI 17 24 1 8 0.8 0.95 0.05 5 LLVM 16 24 1 9 0.8 0.95 0.05 5 GCC 12 24 1 13 0.8 0.95 0.05 5 Xapian 5 24 1 20 0.8 0.95 0.05 5 Python 14 24 1 11 0.8 0.95 0.05 4

208

Table 41. Prediction Parameters (Years and Confidence) Min and Max Values, Unit Size and Trails

Years Confidence FOSSs Min Max Unit Trails Min Max Unit Trails KDELibs 1 8 1 8 0.5 0.9 0.1 5 KOffice 1 8 1 8 0.5 0.9 0.1 5 Httpd 1 8 1 8 0.8 0.9 0.1 2 Subversion 1 8 1 8 0.5 0.9 0.1 5 Ruby 1 8 1 8 0.5 0.9 0.1 5 Chrome 1 3 1 3 0.5 0.9 0.1 5 OpenMPI 1 8 1 8 0.9 0.9 0.1 1 LLVM 1 8 1 8 0.5 0.9 0.1 5 GCC 1 8 1 8 0.5 0.9 0.1 5 Xapian 1 8 1 8 0.5 0.9 0.1 5 Python 1 8 1 8 0.5 0.9 0.1 5

Table 42. Prediction Parameters Total Trials.

FOSSs Total Trials

KDELibs 2799 KOffice 2199 Httpd 869 Subversion 2599 Ruby 2824 Chrome 824 OpenMPI 318 LLVM 1799 GCC 2599 Xapian 3401 Python 2199

209

8.4 Multiple Regression Model Setup

Regression analysis is a statistical process to estimate the relationship between one or more independent variables and a dependent variable. It produces an equation that predicts a dependent variable using one or more independent variables. This equation has the form of:

(10)

where Y is the dependent variable, the predicted variable. X1, X2 … etc. are the independent variables. And the model would look for b1, b2, … etc. as the coefficients that describe the size of the effect the independent variables are having on the dependent variable Y. Intercept is the constant that is predicted for Y to have when all the independent variables are zero. A prediction equation is in use if the independent variables have some kind of correlation with the dependent variable. The coefficients that would be produced tell on the strength of the relationship between the independent variables and the dependent variable.

When running a regression analysis, it looks for coefficients for the independent variables that help us reject that their values effect are no different than zero (the null hypothesis). So the independent variables are having an actual effect the dependent variable. The null hypothesis is always that each independent variable is having absolutely no effect (has a coefficient of zero) and you are looking for a reason to reject this claim.

The p-value is the probability of seeing a result in a collection of random data in which the variable had no effect. The probability of 5% or less is the generally accepted

210

point at which to reject the null hypothesis. With a p-value of 5% (or .05) there is only a

5% chance that results you are seeing would have come up in a random distribution, so you can say with a 95% probability of being correct that the variable is having some effect. The size of the p-value for a coefficient says nothing about the size of the effect that variable is having on the dependent variable.

The coefficients, in a multiple linear regression, the size of the coefficient for each independent variable gives you the size of the effect that variable is having on the dependent variable, and the sign on the coefficient (positive or negative) points to the direction of the effect. The coefficient tells us how much the dependent variable is expected to increase when that independent variable increases by one, holding all the other independent variables constant.

The R-squared of the regression is the fraction of the variation in your dependent variable that is accounted for (or predicted by) the set of independent variables. The p- value tells us how confident we can be that each individual variable has some correlation with the dependent variable, which is the important thing.

Significant F is a p-value for the regression as a whole. Because the independent variables may be correlated but insignificant while the regression as a whole is significant.

[Armstrong 2011].

8.5 Experiment and Results

From Table 40 and Table 41, we use all possible combinations for the set of parameters and run our srcMiner based off each. For KDELibs, the first combination is

(Support Count = 11, Training Data Ratio = 0.75, Years = 1, Confidence = 0.5) which is

211

the start point in the range of each parameter. The next one would be (Support Count =

11, Training Data Ratio = 0.75, Years = 1, Confidence = 0.6) and so on. For each combination we have our tool generating evolutionary couplings then for each to generate all possible irredundant rules that satisfy the parameters.

From Table 39, using the commits as our data mining basket or transaction with a minimum support, we generate the patterns. Then we generate all association rules possible by constructing all the possible combinations of association rules through halving each pattern into two subsets. Let A and B be a disjoint subsets of a pattern A 

B = I. We generated only the association rules A → B where |A|  |B| without duplications. Then use Equation 2 of confidence to further filter the rules, so we keep the rules where confidence (A → B)  confidence.

Precision, recall, and F-measure are then computed as in equations (6), (7), and (8).

Precision (Equation 6) is the ratio between the number of A and B hits over the number of A hits. Recall (Equation 7) is the number of files predicted over the all files that occur within the training set. There may be files in the test set that do not occur in the training set because they were added afterwards. We cannot predict a change on a file didn’t appear in the training set. Equation 8 is the F-measure (also F-score), which is an accuracy measure. The F-measure can be interpreted as the weighted average or harmonic mean of the precision and recall. All these statistical measures range between

[0, 1] where 0 is the worst and 1 (100%) is the best. Regarding precision for k = i+5, we compute precision on i+1, i+2, i+3, i+4, i+5. Then, we divide that by 5 to get the average precision. We use average precision for k = i+5 subsequent changes (commits).

212

For the generated rules we compute the hits and the misses on k = i + 5. We then compute precision, recall, and F-measure as in equations (6), (7), and (8), respectively.

Precision, recall, and F-measure are our dependent variables. As in 8.4, for each system we run regression analysis [Freedman 2009] to find the best linear equation that fit against the outcomes to make it possible to predict a dependent variable using the set of independent variables.

Table 43. Regression analysis results (R Square, Adjusted R Square, and Significance F) and a magnitude value and p-value for the Coefficients (Support Count, Training Data Ratio, Years, Confidence, and Intercept) these are for dependent variables Precision, Recall, and F-measure for KOffice

KOffice Precision Recall F-Measure R Square 1.00 0.84 0.87 Adjusted R Square 1.00 0.84 0.86 Significance F 0.00 0.00 0.00 Coefficients Value P-value Value P-value Value P-value Support Count 0.01 6E-270 0.0 2E-99 0.00 8E-109 Training Data Ratio 0.26 1E-144 0.04 3E-117 0.07 5E-128 Years -0.02 0 0.01 0 0.01 0 Confidence 0.63 0 -0.03 2E-87 -0.05 4E-94 Intercept 0.00 #N/A 0.00 #N/A 0.00 #N/A

Table 43 is the result from running the regression analysis tool on the 2799 outcomes for the dependent variables of precision, recall, and F-measure. As in 8.4, we interpret the table as following. For precision, we have a 100% R-Square value, which means these four independent variables are able to explain a 100% of the variation in the dependent variable precision, 84% of recall, and 85% of F-measure. The Significance F

213

is 0, which is the p-value for the whole model which is way below the predetermined significance levels of 0.05 or 0.01 which is the 95% and the 99% confidence in the results, respectively. So we can say the model is statically significant.

The independent variables coefficients are as well follows, looking at the p-values of each, all issuer statically significant results. We can say that the null hypothesis of that there is no relationship between the independent variables and the dependent variable is rejected. Coefficient values are of a great interest; notice that under precision the coefficient value of the Confidence is 0.63 the highest among the other variables. Such values means that for every unit of change to the parameter which is 0.05 in this case has an effect of 0.63 to the outcome, the precision. Such observation on precision, and only for KOffice and for the collected data with the set of arbitrary ranges and the selected values for the non-variant parameters, confidence is the most important factor that would affect the quality of the generated rules. Next is training data ratio with 0.26, then the

Support Count of 0.01, and Years with the impact of -0.02. We can see that years coefficient value has a negative impact of -0.02 for every increase on the training data ratio unit which is 0.05. Notice that we highlighted the highest coefficient value under each dependent variable with bold red, to isolate its importance. Meaning by decreasing training data ratio we would be able to increase precision. The intercept is 0.

Our model for precision for KOffice, as in Equation (10), would be:

(11)

214

Table 44. Regression analysis results (R Square, Adjusted R Square, and Significance F) and a magnitude value and p-value for the Coefficients (Support Count, Training Data Ratio, Years, Confidence, and Intercept) these are for dependent variables Precision, Recall, and F-measure for KDELibs.

KDELibs Precision Recall F-Measure R Square 1.00 0.84 0.85 Adjusted R Square 1.00 0.83 0.85 Significance F 0.00 0.00 0.00 Coefficients Value P-value Value P-value Value P-value Support Count 0.01 4E-186 0.0 3E-233 0.00 7E-243 Training Data Ratio 0.65 0 0.03 2-260 0.05 2E-272 Years 0.003 0.8 0.00 0 0.01 0 Confidence 0.33 0 -0.02 5-132 -0.03 5E-138 Intercept 0.00 #N/A 0.00 #N/A 0.00 #N/A

Table 45. Regression analysis results (R Square, Adjusted R Square, and Significance F) and a magnitude value and p-value for the Coefficients (Support Count, Training Data Ratio, Years, Confidence, and Intercept) these are for dependent variables Precision, Recall, and F-measure for Httpd.

Httpd Precision Recall F-Measure R Square 0.99 0.91 0.91 Adjusted R Square 0.99 0.91 0.91 Significance F 0.00 0.00 0.00 Coefficients Value P-value Value P-value Value P-value Support Count 0.01 2E-52 0.0 3E-164 0.00 9E-165 Training Data Ratio -0.06 0.06 0.05 4E-118 0.10 2E-118 Years 0.01 1E-10 0.00 3E-81 0.00 3E-81 Confidence 0.90 4E-131 -0.02 1E-28 -0.04 3E-28 Intercept 0.00 #N/A 0.00 #N/A 0.00 #N/A

215

Table 46. Regression analysis results (R Square, Adjusted R Square, and Significance F) and a magnitude value and p-value for the Coefficients (Support Count, Training Data Ratio, Years, Confidence, and Intercept) these are for dependent variables Precision, Recall, and F-measure for Subversion.

Subversion Precision Recall F-Measure R Square 0.61 0.91 0.93 Adjusted R Square 0.60 0.91 0.93 Significance F 0.00 0.00 0.00 Coefficients Value P-value Value P-value Value P-value Support Count 0.00 0.006 0.0 0 -0.01 0 Training Data Ratio 0.27 7E-70 -0.01 0.0007 -0.01 0.015 Years -0.01 3E-99 0.02 0 0.03 0 Confidence 0.41 0 -0.08 0 -0.13 0 Intercept 0.40 2E-149 0.14 5E-241 0.23 2E-288

Table 47. Regression analysis results (R Square, Adjusted R Square, and Significance F) and a magnitude value and p-value for the Coefficients (Support Count, Training Data Ratio, Years, Confidence, and Intercept) these are for dependent variables Precision, Recall, and F-measure for Ruby.

Ruby Precision Recall F-Measure R Square 0.55 0.76 0.79 Adjusted R Square 0.55 0.76 0.79 Significance F 0.00 0.00 0.00 Coefficients Value P-value Value P-value Value P-value Support Count 0.02 1E-201 -0.01 1E-305 -0.01 0 Training Data Ratio -0.22 8E-09 -0.22 8E-62 -0.33 1E-89 Years -0.06 0 0.04 0 0.05 0 Confidence 0.53 7E-145 -0.19 3E-164 -0.23 7E-165 Intercept 0.50 2E-38 0.37 5E-157 0.50 6E-198

216

Table 48. Regression analysis results (R Square, Adjusted R Square, and Significance F) and a magnitude value and p-value for the Coefficients (Support Count, Training Data Ratio, Years, Confidence, and Intercept) these are for dependent variables Precision, Recall, and F-measure for Chrome.

Chrome Precision Recall F-Measure R Square 0.68 0.73 0.74 Adjusted R Square 0.67 0.73 0.74 Significance F 0.00 0.00 0.00 Coefficients Value P-value Value P-value Value P-value Support Count 0.00 7E-06 0.0 7E-07 -0.01 3E-05 Training Data Ratio -0.58 8E-101 0.2 0.02 -0.13 0.08 Years 0.01 7E-08 0.36 4E-236 0.31 2E-241 Confidence 0.38 5E-149 -0.2 0.0004 0.01 0.84 Intercept 0.99 1E-199 0.01 0.87 0.24 0.002

Table 49. Regression analysis results (R Square, Adjusted R Square, and Significance F) and a magnitude value and p-value for the Coefficients (Support Count, Training Data Ratio, Years, Confidence, and Intercept) these are for dependent variables Precision, Recall, and F-measure for OpenMPI.

OpenMPI Precision Recall F-Measure R Square 1.00 0.80 0.81 Adjusted R Square 0.99 0.79 0.81 Significance F 0.00 0.00 0.00 Coefficients Value P-value Value P-value Value P-value Support Count 0.00 0.06 0.0 7E-06 -0.01 9E-05 Training Data Ratio 0.23 1E-07 0.04 0.42 0.07 0.39 Years -0.01 2E-24 0.04 2E-79 0.07 2E-83 Confidence 0.95 3E-52 0.02 0.76 -0.018 0.85 Intercept 0.00 #N/A 0.00 #N/A 0.00 #N/A

217

Table 50. Regression analysis results (R Square, Adjusted R Square, and Significance F) and a magnitude value and p-value for the Coefficients (Support Count, Training Data Ratio, Years, Confidence, and Intercept) these are for dependent variables Precision, Recall, and F-measure for Python.

Python Precision Recall F-Measure R Square 0.39 0.51 0.48 Adjusted R Square 0.39 0.51 0.48 Significance F 0.00 0.00 0.00 Coefficients Value P-value Value P-value Value P-value Support Count -0.10 3E-99 0.00 3E-148 0.00 4E-138 Training Data Ratio -1.56 1E-13 -0.01 2E-23 -0.01 9E-32 Years 0.13 2E-78 0.00 2E-172 0.00 1E-139 Confidence -2.24 1E-92 -0.01 8E-99 -0.01 2E-82 Intercept 4.00 3E-72 0.01 2E-130 0.03 4E-135

Table 51. Regression analysis results (R Square, Adjusted R Square, and Significance F) and a magnitude value and p-value for the Coefficients (Support Count, Training Data Ratio, Years, Confidence, and Intercept) these are for dependent variables Precision, Recall, and F-measure for LLVM.

LLVM Precision Recall F-Measure R Square 0.93 0.87 0.91 Adjusted R Square 0.93 0.87 0.91 Significance F 0.00 0.00 0.00 Coefficients Value P-value Value P-value Value P-value Support Count 0.01 2E-15 0.0 8E-40 0.00 1E-30 Training Data Ratio 0.13 1E-4 0.20 1E-78 0.21 7E-79 Years -0.06 4E-192 0.0 0 0.03 0 Confidence 0.75 2E-151 -0.1 7E-60 -0.09 8E-29 Intercept 0.00 #N/A 0.00 #N/A 0.00 #N/A

218

Table 52. Regression analysis results (R Square, Adjusted R Square, and Significance F) and a magnitude value and p-value for the Coefficients (Support Count, Training Data Ratio, Years, Confidence, and Intercept) these are for dependent variables Precision, Recall, and F-measure for GCC.

GCC Precision Recall F-Measure R Square 0.81 0.72 0.76 Adjusted R Square 0.81 0.72 0.76 Significance F 0.00 0.00 0.00 Coefficients Value P-value Value P-value Value P-value Support Count 0.01 6E-177 0.0 6E-98 0.00 1E-105 Training Data Ratio -0.68 2E-238 0.0 0.47 0.00 0.99 Years -0.02 6E-160 0.0 0 0.02 0 Confidence 0.82 0 -0.03 3E-54 -0.05 3E-54 Intercept 0.54 7E-167 0.03 3E-13 0.05 9E-15

Table 53. Regression analysis results (R Square, Adjusted R Square, and Significance F) and a magnitude value and p-value for the Coefficients (Support Count, Training Data Ratio, Years, Confidence, and Intercept) these are for dependent variables Precision, Recall, and F-measure for Xapian.

Xapian Precision Recall F-Measure R Square 0.47 0.69 0.73 Adjusted R Square 0.47 0.69 0.73 Significance F 0.00 0.00 0.00 Coefficients Value P-value Value P-value Value P-value Support Count 0.01 5E-264 0.0 0 -0.01 0 Training Data Ratio 0.45 7E-58 -0.11 5E-145 -0.19 9E-175 Years -0.03 6E-148 0.01 0 0.01 0 Confidence 0.50 4E-236 -0.06 8E-161 -0.10 1E-182 Intercept 0.08 0.003 0.17 0 0.29 0

219

Figure 36. Precision coverage over all systems.

In the tables from Table 44 to Table 53, we have the regression analysis results for the rest of the systems, and the same interpretations would follow for each model. Table

44 is the model values for KDELibs, we notice that under precision and for the coefficient years, we have a p-value of 0.8. Due to it being way higher than predetermined significance level of 0.05 which is the 95% confidence in the results, we can’t reject the null hypothesis and that coefficient value is insignificant. The whole model p-value

(significant F) is 0, which means it is significant but we can’t trust and use coefficient years in our model. If the p-value came between or equal to 0.05 and 0.01, we say that we have 95% confidence only on our coefficient.

220

In the tables from Table 44 to Table 53, we can see the red highlighted numbers as being the highest factor in each model. To have an overall understanding and a mental comparison on the dominating factors over all models for precision, recall, and F- measure. Figure 36, represent a visual dominance using colors. Each circle represents a system, and each circle has four colors, each is represent a coefficient for one of the independent variables. We can see how for precision over all systems, confidence is the dominant factor, then training data ratio, then years, and, lastly, support count. Figure 37 and Figure 38 are for recall and F-measure. Confidence, with less dominance, is the main factor and coefficient for recall and F-measure, as well.

Figure 37. Recall coverage over all systems.

221

Figure 38. F-Measure coverage over all systems.

8.6 Summary

In this study we empirically examined the effectiveness of mining software repositories set of initial parameters of minimum support, data range, training to testing data ratio, and confidence. We built regression models around eleven systems for each dependent variables of precision, recall, f-measures. These measures are commonly used to assess the accuracy and the coverage on detecting patterns of change from co – changing artifacts in software maintenance, via mining historical repositories. We presented the approach using data mining techniques to mine repositories of eleven large scale open source systems, and then build around that multiple regression models to study

222

the effect of the initial parameters on the last outcome of the generated association rules.

Now we can our questions from the introduction, section 8.1.

A) To what extent of the total variation of outcomes can be explained by a

prediction model using independent variables of minimum support, data range,

training to testing data ratio, and confidence?

Figure 39. Extent of the representation from the total variation of model analysis outcomes over all systems

223

Figure 40. Distribution of the extent of the representation for the models over precision, recall, and F-measure for all systems

Figure 39 and Figure 40 clearly answers the question. Figure 39 shows R-square values over all systems for the dependent variables of precision, recall, and F-measure.

Figure 40, is a more zoom out picture which represent the final distribution on models representation on the total variation of model analysis outcomes. We can see that

(27%+64% = 91%) more than 90% of the models have a representation of 50% and above, while 64% of the models holds a representation of 75% and above.

B) Which coefficient is the dominating factor in the quality of the generated rules?

From Figure 36, Figure 37, and Figure 38 we can visually point the domination of confidence as the main factor among coefficients. After confidence, we have training data ratio, then years, and, lastly, support count.

224

Our future work would to repeat the study on wider ranges of parameters to include artifact granularity (sub-system, model, class, file, function, and line of code) and window time which is the transaction size. Also, winding the ranges for each parameter and add more open source systems for better representation.

CHAPTER 9

CONCLUSIONS AND FUTURE WORK

This dissertation was focused on investigating the problem of uncovering evolutionary couplings in large software systems. New methods, using data mining approaches, to uncover evolutionary couplings were presented that reduce the number of false positives. The novelty of this work is the synergistic use of statically derived information in combination with data mining techniques. Moreover, an empirical approach was used to better understand the foundations of applying data mining techniques to software repositories in the context of uncovering evolutionary dependencies. While the field of mining software repositories (MSR) is relatively mature at this point (10+ years old) many basic assumptions have not be deeply studied. The empirical work done here has broad impacts to the field of MSR beyond the problem of uncovering evolutionary couplings. The contributions and results are summarized below.

9.1 Synergic Approaches

Change-based evolutionary coupling (Chapter 3) is a novel approach that gives more weight and importance to evolutionary copings with consistent types of changes (i.e., lines, function, and hunks, churn metrics) among co-changing files. With three evaluation methods (prediction analysis, interestingness measures, and manual valuation) our change-based evolutionary coupling approach produces fewer false positives and higher quality patterns over using the traditional approach to compute evolutionary

225 226

coupling. We observed high precision values, some above 90%, and recall values are generally low. The low recall values are of interest. The premise here is that it is better to uncover correct patterns and miss some rather than uncovering all patterns but include many false positives. Therefore, we chose to give more weight to the accuracy

(precision) of the uncovered patterns rather than to the completeness (recall).

In Chapter 4, we present KouplerVis2, which is an interactive tool that uses a seeSoft display metaphor. It uses the improved techniques of change-based Evolutionary coupling algorithm from Chapter 3 to detect high quality frequent patterns. The results are displayed to highlight intensity level as an indication of change activity.

A large-scale investigation (Chapter 5) on the distribution and correlation between code, change, and collaboration metrics was undertaken. We can summarize our results in two points: First, code metrics follows a double Pareto distribution; both change and collaboration are Pareto distribution. Earlier research [Herraiz Tabernero, German,

Hassan 2011] reported similar results but not on this variety of metrics and to such scale.

Second, metrics of the same category are highly correlated, while there is a weak correlation on a cross categories. For example, LOC and CC from code metrics are very strongly correlated (90%). This is an important result and implies that we should use a variety of metric categories rather than focusing on multiple metrics of the same category. This is contrary to the dominant approach in literature on fault prediction and effort estimation.

By collecting historical and structural information on pattern age and their items’ distance, we were able to experiment with a filtration approach to reduce false positives.

227

We validated the effect of rankings based on these measures using interestingness measures taken from data mining literature. We found that distance is not a good measure to rank or filter out false evolutionary couplings. However, using age we observed that newly emerging patterns have higher accuracies and coverage prediction propagation. The detailed of this are presented in Chapter 6.

9.2 The Analysis of Data Mining Parameters

In Chapter 7, we undertook the first systematic study of the effect of varying time window size on the detection evolutionary coupling. We empirically studied how different time windows impact the quality and predictability of evolutionary dependencies. It was found that larger time windows have better prediction accuracies and completeness, with week-long windows giving the best results and individual commits being the worst. This is contrary to what is used in most studies. We further validated the idea of cross time windows prediction over different time windows and over the thirteen systems we observed that the time window sizes hour, day, and week predict current and subsequent changes of the commit time window size better than the commit time window size on itself with F-measure of 33% to 67% improvement. Finally, we combined (union and intersection) of patterns produced by different time window sizes.

We got higher F-measures from 17.10% to 59.32%.

In Chapter 8, we empirically examined the effectiveness of mining software repositories in the context of the initial parameters of minimum support, data range, training to testing data ratio, and confidence on the final generated patterns. Using multiple regression model analysis on eleven systems, we use the initial parameters as

228

our independent variables and the quality measures of precision, recall, f-measure as the dependent variables. Our main results, on eleven open source systems, we have a 91% of the regression models were able to represent 50% and above of the outcome data

(dependent variables), while 64% of the models have a 75% and above data representation. It was found that confidence parameter (independent variable) the most dominant and effective parameters among other coefficients on the final outcome results

(precision, recall, and f-measure). After confidence, we rank training data ratio, then years, and, lastly, support count next.

9.3 Future Work

Call-graph and conceptual decencies were combined with evolutionary couplings to reduce false in the work of [Kagdi, Collard, Maletic 2007a] [Kagdi, Gethers,

Poshyvanyk, Collard 2010]. We used metrics Chapter 3, structure distance Chapter 6, and historical meta-data Chapter 6. Other lightweight approaches could be used that we feel would improve pattern detection. To mention few, as age in Chapter 6, we can use authorship. We can also use cloning genealogies [Kim, Sazawal, Notkin, Murphy 2005] to trace dependencies and use them to enhance patterns detection process. Such approaches could enhance the detection but also uncover different types of patterns. A dependency that is frequent and also part of a call graph is different than a dependency that is frequent and conceptually related. Each would serve a different problem inside the code.

One of the most important parameters in MSR is artifacts granularity (line of code, function, file, class, model, or subsystem). No studies yet have compared which

229

granularity has the highest accuracies and coverage. Also, detecting patterns with different granularities where a pattern is a mix between classes, functions, and lines of code.

We have compared the quality of patterns generated using different time window sizes (a single subversion commit, an hour, a day, and a week of commits), as in Chapter

7, using predictability validation. We need to repeat the study and validate our results using different approaches. Our results support wider time windows where most work in the MSR community support the use of a smaller time windows (subversion revision or

200 seconds sliding window [Zimmermann, Weisgerber, Diehl, Zeller 2004;

Zimmermann, Weissgerber 2004]). Also, our results contradict other studies as in

[Vanya, Premraj, Vliet 2011].

In Chapter 8, we have used regression analysis to study the initial parameter effect on the quality of the generated association rules used in frequent pattern mining. Our future work would be to repeat the study on wider ranges for each parameter and include the missing ones that are: first, artifact granularity (sub-system, model, class, file, function, and line of code) and time window size. Also, winding the range of the collected data and add more open source systems. Such a study would help the MSR community picking their parameters more carefully.

Rather than using a time window size to slice and cluster history into co-changing group of items; we need to investigate the logical unit of change. We need to find approaches that infer which code changes map together as addressing the same modification task. Given a history of commits, the technique need to group changes

230

together that appear to be related to the same modification task. Commits can be composed of changes in support of multiple modification tasks. A single modification task (e.g., a modification request) is typically implemented over a period of time and over multiple commits. The commits leading to a single modification task are typically not contiguous and there is often commits dealing with unrelated modification tasks interlaced. The interest is to tease apart the related changes from the unrelated, and to produce a logical unit of change based on the inferred modification task. Why is this important? It is a known problem that using a single commit as a unit of change is flawed in the context of a single modification task. Current techniques use the bug tracking id’s or analysis of commit messages to approximate a logical unit. These techniques have not proven to be very accurate.

RERERENCES

[Abbas 2010] Abbas, N., (2010), "Using Factor Analysis To Generate Clusters Of Agile

Practices (A Guide For Agile Process Improvement)", In Proceedings Of Agile

Conference, Pp. 11-20.

[Agrawal, Imieliński, Swami 1993] Agrawal, R., Imieliński, T., And Swami, A., (1993),

"Mining Association Rules Between Sets Of Items In Large Databases", In

Proceedings Of The 1993 Acm Sigmod International Conference On Management

Of Data. Washington, D.C., United States: Acm.

[Agrawal, Srikant 1994] Agrawal, R. And Srikant, R., (1994), "Fast Algorithms For

Mining Association Rules In Large Databases", In Proceedings Of The 20th

International Conference On Very Large Data Bases: Morgan Kaufmann

Publishers Inc.

231 232

[Alali, Bartman, Newman, Maletic 2013] Alali, A., Bartman, B., Newman, C. D., And

Maletic, J. I., (2013), "A Preliminary Investigation Of Using Age And Distance

Measures In The Detection Of Evolutionary Couplings", In Proceedings Of

Mining Software Repositories (Msr), 2013 10th Ieee Working Conference On, Pp.

169-172.

[Alali, Bartman, Newman, Maletic 2015] Alali, A., Bartman, B., Newman, C. D., And

Maletic, J. I., (2015), "Prediction Parameters On The Detection Of Evolutionary

Couplings", In Proceedings Of International Conference On Software

Engineering.

[Alali, Kagdi, Maletic 2008] Alali, A., Kagdi, H., And Maletic, J. I., (2008), "What's A

Typical Commit? A Characterization Of Open Source Software Repositories", In

Proceedings Of 6th International Conference On Program Comprehension,

Amsterdam, The Netherlands, June 10-13, Pp. 182-191.

[Alali, Maletic 2014] Alali, A. And Maletic, J. I., (2014), "Distribution And Correlation

Of Code, Change And Collaboration Metrics", The Journal Of Empirical

Software Engineering, Vol. To Be Submitted.

233

[Alali, Maletic 2015] Alali, A. And Maletic, J. I., (2015), "Change Patterns Interactive

Tool And Visualizer", In Proceedings Of Working Conference On Software

Visualization.

[Alali, Sutton, Maletic 2014a] Alali, A., Sutton, A., And Maletic, J. I., (2014a), "Using

Change Measures To Improve The Detection Of Evolutionary Couplings", The

Journal Of Software Maintenance And Evolution, Vol. To Be Submitted.

[Alali, Sutton, Maletic 2014b] Alali, A., Sutton, A., And Maletic, J. I., (2014b), "Which

Time Window Size Is Best For Evolutionary Couplings?", The Journal Of

Software Maintenance And Evolution, Vol. To Be Submitted.

[Antoniol, Canfora, Casazza, De Lucia 2000] Antoniol, G., Canfora, G., Casazza, G.,

And De Lucia, A., (2000), "Identifying The Starting Impact Set Of A

Maintenance Request: A Case Study", In Proceedings Of The Conference On

Software Maintenance And Reengineering: Ieee Computer Society, Pp. 227.

[Arafat, Riehle 2009] Arafat, O. And Riehle, D., (2009), "The Commit Size Distribution

Of Open Source Software", In Proceedings, Pp. 1-8.

234

[Arisholm, Briand 2006] Arisholm, E. And Briand, L. C., (2006), "Predicting Fault-

Prone Components In A Java Legacy System", In Proceedings Of The 2006

Acm/Ieee International Symposium On Empirical Software Engineering. Rio De

Janeiro, Brazil: Acm, Pp. 8-17.

[Arisholm, Briand, Fuglerud 2007] Arisholm, E., Briand, L. C., And Fuglerud, M.,

(2007), "Data Mining Techniques For Building Fault-Proneness Models In

Telecom Java Software", In Proceedings Of The The 18th Ieee International

Symposium On Software Reliability: Ieee Computer Society, Pp. 215-224.

[Arisholm, Briand, Johannessen 2010] Arisholm, E., Briand, L. C., And Johannessen, E.

B., (2010), "A Systematic And Comprehensive Investigation Of Methods To

Build And Evaluate Fault Prediction Models", Journal Of Systems And Software,

Vol. 83, No. 1, Pp. 2-17.

[Armstrong 2011] Armstrong, J. S., (2011), "Illusions In Regression Analysis".

[Arnold, Bohner 1996] Arnold, R. And Bohner, S.,(1996),Software Change Impact

Analysis, Los Alamitos, Ca, Ieee Computer Society.

235

[Bacchelli, D’ambros, Lanza 2010] Bacchelli, A., D’ambros, M., And Lanza, M., (2010),

"Are Popular Classes More Defect Prone?

Fundamental Approaches To Software Engineering", D. Rosenblum And G. Taentzer,

Eds., Springer Berlin / Heidelberg, Pp. 59-73.

[Ball, Porter, Siy 1997] Ball, T., Porter, J.-M. K. A. A., And Siy, H. P., (1997), "If Your

Version Control System Could Talk ...", In Proceedings Of Workshop On Process

Modeling And Empirical Studies Of Software Engineering, Boston, Ma.

[Basili, Briand, Melo 1996] Basili, V. R., Briand, L. C., And Melo, W. L., (1996), "A

Validation Of Object-Oriented Design Metrics As Quality Indicators", Ieee Trans.

Softw. Eng., Vol. 22, No. 10, Pp. 751-761.

[Bell, Ostrand, Weyuker 2006] Bell, R. M., Ostrand, T. J., And Weyuker, E. J., (2006),

"Looking For Bugs In All The Right Places", In Proceedings Of The 2006

International Symposium On Software Testing And Analysis. Portland, Maine,

Usa: Acm, Pp. 61-72.

236

[Bernstein, Ekanayake, Pinzger 2007] Bernstein, A., Ekanayake, J., And Pinzger, M.,

(2007), "Improving Defect Prediction Using Temporal Features And Non Linear

Models", In Ninth International Workshop On Principles Of Software Evolution:

In Conjunction With The 6th Esec/Fse Joint Meeting. Dubrovnik, Croatia: Acm,

Pp. 11-18.

[Best 1975] Best, D., (1975), "89: The Upper Tail Probabilities Of Spearman's Rho",

Appl. Stat., Vol. 24, Pp. 377-379.

[Beyer, Hassan 2006] Beyer, D. And Hassan, A. E., (2006), "Animated Visualization Of

Software History Using Evolution Storyboards", In Proceedings Of The 13th

Working Conference On Reverse Engineering: Ieee Computer Society.

[Biggs 1979] Biggs, N. L., (1979), "The Roots Of Combinatorics", Historia

Mathematica, Vol. 6, No. 2, Pp. 109-136.

[Blair 1979] Blair, D. C., (1979), "Information Retrieval, 2nd Ed. C.J. Van Rijsbergen.

London: Butterworths; 1979: 208 Pp. Price: $32.50", Journal Of The American

Society For Information Science, Vol. 30, No. 6, Pp. 374-375.

237

[Bohner 1996] Bohner, S. A., (1996), "Impact Analysis In The Software Change

Process: A Year 2000 Perspective", In Proceedings Of The 1996 International

Conference On Software Maintenance: Ieee Computer Society, Pp. 42-51.

[Bretscher 1997] Bretscher, O.,(1997),Linear Algebra With Applications, Prentice Hall.

[Briand, Daly, K. Wüst 1999] Briand, L. C., Daly, J. W., And K. Wüst, J., (1999), "A

Unified Framework For Coupling Measurement In Object-Oriented Systems",

Ieee Trans. Softw. Eng., Vol. 25, No. 1, Pp. 91-121.

[Briand, Labiche, Soccar 2002] Briand, L. C., Labiche, Y., And Soccar, G., (2002),

"Automating Impact Analysis And Regression Test Selection Based On Uml

Designs", In Proceedings Of The International Conference On Software

Maintenance (Icsm'02): Ieee Computer Society, Pp. 252.

[Briand, Wuest, Lounis 1999] Briand, L. C., Wuest, J., And Lounis, H., (1999), "Using

Coupling Measurement For Impact Analysis In Object-Oriented Systems", In

Proceedings Of The Ieee International Conference On Software Maintenance:

Ieee Computer Society, Pp. 475.

238

[Buckland, Gey 1994] Buckland, M. And Gey, F., (1994), "The Relationship Between

Recall And Precision", Journal Of The American Society For Information

Science, Vol. 45, No. 1, Pp. 12-19.

[Canfora, Ceccarelli, Cerulo, Di Penta 2010] Canfora, G., Ceccarelli, M., Cerulo, L.,

And Di Penta, M., (2010), "Using Multivariate Time Series And Association

Rules To Detect Logical Change Coupling: An Empirical Study", In Proceedings

Of The 2010 Ieee International Conference On Software Maintenance: Ieee

Computer Society, Pp. 1-10.

[Chen, Rajlich 2000] Chen, K. And Rajlich, V., (2000), "Case Study Of Feature

Location Using Dependence Graph", In Proceedings Of The 8th International

Workshop On Program Comprehension: Ieee Computer Society, Pp. 241.

[Chen, Rajlich 2001] Chen, K. And Rajlich, V., (2001), "Ripples: Tool For Change In

Legacy Software", In Proceedings Of The Ieee International Conference On

Software Maintenance (Icsm'01): Ieee Computer Society, Pp. 230.

[Chidamber, Kemerer 1994] Chidamber, S. R. And Kemerer, C. F., (1994), "A Metrics

Suite For Object Oriented Design", Ieee Transactions On Software Engineering,

Vol. 20, No. 6, Pp. 476-493.

239

[Collard, Decker, Maletic 2011] Collard, M. L., Decker, M. J., And Maletic, J. I., (2011),

"Lightweight Transformation And Fact Extraction With The Srcml Toolkit", In

International Working Conference On Source Code Analysis And Manipulation.

[Collard, Maletic, Robinson 2010] Collard, M. L., Maletic, J. I., And Robinson, B. P.,

(2010), "A Lightweight Transformational Approach To Support Large Scale

Adaptive Changes", In Proceedings Of The 2010 Ieee International Conference

On Software Maintenance: Ieee Computer Society, Pp. 1-10.

[Collins-Sussman, Fitzpatrick, Pilato 2004] Collins-Sussman, B., Fitzpatrick, B. W., And

Pilato, C. M.,(2004),Version Control With Subversion, O'reilly Media.

[Concas, Marchesi, Pinna, Serra 2007] Concas, G., Marchesi, M., Pinna, S., And Serra,

N., (2007), "Power-Laws In A Large Object-Oriented Software System", Ieee

Trans. Softw. Eng., Vol. 33, No. 10, Pp. 687-708.

[Cordy 2003] Cordy, J. R., (2003), "Comprehending Reality - Practical Barriers To

Industrial Adoption Of Software Maintenance Automation", In Proceedings Of

The 11th Ieee International Workshop On Program Comprehension: Ieee

Computer Society, Pp. 196.

240

[D'ambros, Lanza 2006] D'ambros, M. And Lanza, M., (2006), "Reverse Engineering

With Logical Coupling", In Proceedings Of The 13th Working Conference On

Reverse Engineering: Ieee Computer Society.

[D'ambros, Lanza, Lungu 2009] D'ambros, M., Lanza, M., And Lungu, M., (2009),

"Visualizing Co-Change Information With The Evolution Radar", Ieee Trans.

Softw. Eng., Vol. 35, No. 5, Pp. 720-735.

[D'ambros, Lanza, Robbes 2010a] D'ambros, M., Lanza, M., And Robbes, R., (2010a),

"An Extensive Comparison Of Bug Prediction Approaches", In Proceedings Of

Msr, Cape Town, South Africa, Pp. 31-41.

[D'ambros, Lanza, Robbes 2010b] D'ambros, M., Lanza, M., And Robbes, R., (2010b),

"An Extensive Comparison Of Bug Prediction Approaches", In Proceedings Of

Mining Software Repositories (Msr), 2010 7th Ieee Working Conference On, Pp.

31-41.

[Damaševičius 2009] Damaševičius, R., (2009), "Analysis Of Academic Results For

Informatics Course Improvement Using Association Rule Mining", In

Information Systems Development, Springer Us, Pp. 357-363.

241

[De Lucia, Pompella, Stefanucci 2002] De Lucia, A., Pompella, E., And Stefanucci, S.,

(2002), "Effort Estimation For Corrective Software Maintenance", In Proceedings

Of The 14th International Conference On Software Engineering And Knowledge

Engineering. Ischia, Italy: Acm, Pp. 409-416.

[De Lucia, Pompella, Stefanucci 2005] De Lucia, A., Pompella, E., And Stefanucci, S.,

(2005), "Assessing Effort Estimation Models For Corrective Maintenance

Through Empirical Studies", Information And Software Technology, Vol. 47, No.

1, Pp. 3-15.

[Eick, Steffen, Eric E. Sumner 1992] Eick, S. G., Steffen, J. L., And Eric E. Sumner, J.,

(1992), "Seesoft-A Tool For Visualizing Line Oriented Software Statistics", Ieee

Trans. Softw. Eng., Vol. 18, No. 11, Pp. 957-968.

[El Emam, Melo, Machado 2001] El Emam, K., Melo, W., And Machado, J. C., (2001),

"The Prediction Of Faulty Classes Using Object-Oriented Design Metrics",

Journal Of Systems And Software, Vol. 56, No. 1, Pp. 63-75.

242

[Estublier Et Al. 2005] Estublier, J., Leblang, D., Van Der Hoek, A., Conradi, R.,

Clemm, G., Tichy, W., And Wiborg-Weber, D., (2005), "Impact Of Software

Engineering Research On The Practice Of Software Configuration Management",

Acm Transactions On Software Engineering And Methodology, Vol. 14, No. 4,

Pp. 383-430.

[Fenton, Pfleeger 1996] Fenton, N. E. And Pfleeger, S. L.,(1996),Software Metrics: A

Rigorous And Practical Approach, International Thomson Computer Press.

[Fogel, O’neill 2002] Fogel, K. And O’neill, M., (2002), "Cvs2cl.Pl: A Script For

Converting Cvs Log Messages To Changelog Files",

Http://Www.Redbean.Com/Cvs2cl/.

[Freedman 2009] Freedman, D.,(2009),Statistical Models: Theory And Practice,

Cambridge University Press.

[Gall, Hajek, Jazayeri 1998] Gall, H., Hajek, K., And Jazayeri, M., (1998), "Detection

Of Logical Coupling Based On Product Release History", In Proceedings Of The

International Conference On Software Maintenance: Ieee Computer Society.

243

[Gall, Jazayeri, Krajewski 2003] Gall, H., Jazayeri, M., And Krajewski, J., (2003), "Cvs

Release History Data For Detecting Logical Couplings", In Proceedings Of The

6th International Workshop On Principles Of Software Evolution: Ieee Computer

Society.

[Gallagher, Lyle 1991] Gallagher, K. B. And Lyle, J. R., (1991), "Using Program Slicing

In Software Maintenance", Ieee Trans. Softw. Eng., Vol. 17, No. 8, Pp. 751-761.

[Geiger, Fluri, Gall, Pinzger 2006] Geiger, R., Fluri, B., Gall, H., And Pinzger, M.,

(2006), "Relation Of Code Clones And Change Couplings", In Proceedings Of In

Proceedings Of The 9th International Conference Of Funtamental Approaches To

Software Engineering.

[Görg, Weißgerber 2005] Görg, C. And Weißgerber, P., (2005), "Error Detection By

Refactoring Reconstruction", Sigsoft Softw. Eng. Notes, Vol. 30, No. 4, Pp. 1-5.

[Graves, Karr, Marron, Siy 2000] Graves, T. L., Karr, A., Marron, J. S., And Siy, H.,

(2000), "Predicting Fault Incidence Using Software Change History", Ieee

Transactions On Software Engineering, Vol. 26, No. 7, Pp. 653-661.

244

[Gyimóthy, Ferenc, Siket 2005] Gyimóthy, T., Ferenc, R., And Siket, I., (2005),

"Empirical Validation Of Object-Oriented Metrics On Open Source Software For

Fault Prediction", Ieee Transactions On Software Engineering, Vol. 31, Pp. 897-

910.

[Hahsler, Hornik, Reutterer 2006] Hahsler, M., Hornik, K., And Reutterer, T., (2006),

"Implications Of Probabilistic Data Modeling For Mining Association Rules", In

From Data And Information Analysis To Knowledge Engineering, Springer Berlin

Heidelberg, Pp. 598-605.

[Halstead 1977] Halstead, M. H.,(1977),Elements Of Software Science, Elsevier.

[Hassan 2008] Hassan, A. E., (2008), "The Road Ahead For Mining Software

Repositories", In Proceedings Of Proc. Fosm 2008. Frontiers Of Software

Maintenance, Pp. 48-57.

[Hassan 2009] Hassan, A. E., (2009), "Predicting Faults Using The Complexity Of Code

Changes", In Proceedings Of The 31st International Conference On Software

Engineering: Ieee Computer Society, Pp. 78-88.

245

[Hassan, Holt 2004a] Hassan, A. E. And Holt, R. C., (2004a), "Predicting Change

Propagation In Software Systems", In Proceedings Of The 20th Ieee International

Conference On Software Maintenance: Ieee Computer Society, Pp. 284-293.

[Hassan, Holt 2004b] Hassan, A. E. And Holt, R. C., (2004b), "Predicting Change

Propagation In Software Systems", In Proceedings Of Software Maintenance,

2004. Proceedings. 20th Ieee International Conference On, Pp. 284-293.

[Hassan, Holt 2005] Hassan, A. E. And Holt, R. C., (2005), "The Top Ten List: Dynamic

Fault Prediction", In Proceedings Of The 21st Ieee International Conference On

Software Maintenance: Ieee Computer Society, Pp. 263-272.

[Hauke, Tomasz 2011] Hauke, J. And Tomasz, K., (2011), "Comparison Of Values Of

Pearson's And Spearman's Correlation Coefficients On The Same Sets Of Data",

Quaestiones Geographicae, Vol. 30, No. 2, Pp. 7.

[Hayes, Patel, Zhao 2004] Hayes, J. H., Patel, S. C., And Zhao, L., (2004), "A Metrics-

Based Software Maintenance Effort Model", In Proceedings Of The Eighth

Euromicro Working Conference On Software Maintenance And Reengineering

(Csmr'04): Ieee Computer Society, Pp. 254.

246

[Herraiz, German, Hassan 2011] Herraiz, I., German, D. M., And Hassan, A. E., (2011),

"On The Distribution Of Source Code File Sizes", In International Conference On

Software And Data Technologies. Seville, Spain.

[Herraiz, Hassan 2010] Herraiz, I. And Hassan, A. E., (2010), "Beyond Lines Of Code:

Do We Need More Complexity Metrics?", In Making Software: What Really

Works, And Why We Believe It, A. Oram And G. Wilson, Eds., Sebastopol, Ca

O'reilly Media, Inc., Pp. 125-141.

[Herraiz Tabernero, German, Hassan 2011] Herraiz Tabernero, I., German, D. M., And

Hassan, A. E., (2011), "On The Distribution Of Source Code File Sizes".

[Hill, Pollock, Vijay-Shanker 2007] Hill, E., Pollock, L., And Vijay-Shanker, K., (2007),

"Exploring The Neighborhood With Dora To Expedite Software Maintenance", In

Proceedings Of The Twenty-Second Ieee/Acm International Conference On

Automated Software Engineering. Atlanta, Georgia, Usa: Acm, Pp. 14-23.

[Hollander, Wolfe, Chicken 2013] Hollander, M., Wolfe, D. A., And Chicken,

E.,(2013),Nonparametric Statistical Methods, John Wiley & Sons.

247

[Huang, Liu 2005] Huang, S.-K. And Liu, K.-M., (2005), "Mining Version Histories To

Verify The Learning Process Of Legitimate Peripheral Participants", Sigsoft

Softw. Eng. Notes, Vol. 30, No. 4, Pp. 1-5.

[Hudepohl Et Al. 1996] Hudepohl, J. P., Aud, S. J., Khoshgoftaar, T. M., Allen, E. B.,

And Mayrand, J., (1996), "Emerald: Software Metrics And Models On The

Desktop", Ieee Softw., Vol. 13, No. 5, Pp. 56-60.

[Jane Huffman, Zhao 2005] Jane Huffman, H. And Zhao, L., (2005), "Maintainability

Prediction: A Regression Analysis Of Measures Of Evolving Systems", In

Proceedings, Pp. 601-604.

[Johnson, Wichern 1998] Johnson, R. A. And Wichern, D. W.,(1998),Applied

Multivariate Statistical Analysis, 4th Ed., Prentice Hall.

[Kagdi, Collard, Maletic 2007a] Kagdi, H., Collard, M. L., And Maletic, J. I., (2007a),

"Comparing Approaches To Mining Source Code For Call-Usage Patterns", In

Proceedings Of The Fourth International Workshop On Mining Software

Repositories: Ieee Computer Society, Pp. 20.

248

[Kagdi, Collard, Maletic 2007b] Kagdi, H., Collard, M. L., And Maletic, J. I., (2007b),

"A Survey And Taxonomy Of Approaches For Mining Software Repositories In

The Context Of Software Evolution", J. Softw. Maint. Evol., Vol. 19, No. 2, Pp.

77-131.

[Kagdi, Gethers, Poshyvanyk, Collard 2010] Kagdi, H., Gethers, M., Poshyvanyk, D.,

And Collard, M. L., (2010), "Blending Conceptual And Evolutionary Couplings

To Support Change Impact Analysis In Source Code", In Proceedings Of

Working Conference On Reverse Engineering, Pp. 119-128.

[Kagdi, Maletic 2007] Kagdi, H. And Maletic, J. I., (2007), "Combining Single-Version

And Evolutionary Dependencies For Software-Change Prediction", In

Proceedings Of The Fourth International Workshop On Mining Software

Repositories: Ieee Computer Society, Pp. 17.

[Kagdi, Maletic, Sharif 2007] Kagdi, H., Maletic, J. I., And Sharif, B., (2007), "Mining

Software Repositories For Traceability Links", In Proceedings Of The 15th Ieee

International Conference On Program Comprehension: Ieee Computer Society,

Pp. 145-154.

249

[Kagdi, Yusuf, Maletic 2006] Kagdi, H., Yusuf, S., And Maletic, J. I., (2006), "Mining

Sequences Of Changed-Files From Version Histories", In Proceedings Of The

2006 International Workshop On Mining Software Repositories. Shanghai, China:

Acm, Pp. 47-53.

[Kamei Et Al. 2010] Kamei, Y., Matsumoto, S., Monden, A., Matsumoto, K.-I., Adams,

B., And Hassan, A. E., (2010), "Revisiting Common Bug Prediction Findings

Using Effort-Aware Models", In Proceedings Of The 26th Ieee International

Conference On Software Maintenance: Ieee Computer Society, Pp. 1-10.

[Kim Et Al. 2011] Kim, D., Wang, X., Kim, S., Zeller, A., Cheung, S. C., And Park, S.,

(2011), "Which Crashes Should I Fix First?: Predicting Top Crashes At An Early

Stage To Prioritize Debugging Efforts", Ieee Transactions On Software

Engineering, Vol. 37, Pp. 430-447.

[Kim, Notkin, Grossman 2007] Kim, M., Notkin, D., And Grossman, D., (2007),

"Automatic Inference Of Structural Changes For Matching Across Program

Versions", In Proceedings Of The 29th International Conference On Software

Engineering: Ieee Computer Society, Pp. 333-343.

250

[Kim, Sazawal, Notkin, Murphy 2005] Kim, M., Sazawal, V., Notkin, D., And Murphy,

G., (2005), "An Empirical Study Of Code Clone Genealogies", In Proceedings Of

Acm Sigsoft Software Engineering Notes, Pp. 187-196.

[Kim, Whitehead, Bevan 2005] Kim, M., Whitehead, E. J., And Bevan, J., (2005),

"Analysis Of Signature Change Patterns", In Proceedings Of The 2005

International Workshop On Mining Software Repositories. St. Louis, Missouri:

Acm, Pp. 1-5.

[Kim, Pan, Whitehead 2005] Kim, S., Pan, K., And Whitehead, E. J., Jr., (2005), "When

Functions Change Their Names: Automatic Detection Of Origin Relationships",

In Proceedings, Pp. 143-152.

[Kim, Zimmermann, Pan, Whitehead 2006] Kim, S., Zimmermann, T., Pan, K., And

Whitehead, E. J. J., (2006), "Automatic Identification Of Bug-Introducing

Changes", In Proceedings Of The 21st Ieee/Acm International Conference On

Automated Software Engineering: Ieee Computer Society.

251

[Kim, Zimmermann, Whitehead, Zeller 2007] Kim, S., Zimmermann, T., Whitehead, E.

J., Jr., And Zeller, A., (2007), "Predicting Faults From Cached History", In

Proceedings Of The 29th International Conference On Software Engineering:

Ieee Computer Society, Pp. 489-498.

[Knab, Pinzger, Bernstein 2006] Knab, P., Pinzger, M., And Bernstein, A., (2006),

"Predicting Defect Densities In Source Code Files With Decision Tree Learners",

In Proceedings Of The 2006 International Workshop On Mining Software

Repositories. Shanghai, China: Acm, Pp. 119-125.

[Latoza, Venolia, Deline 2006] Latoza, T. D., Venolia, G., And Deline, R., (2006),

"Maintaining Mental Models: A Study Of Developer Work Habits", In

Proceedings Of Proceedings Of The 28th International Conference On Software

Engineering, Pp. 492-501.

[Lave, Wenger 1991] Lave, J. And Wenger, E.,(1991),Situated Learning: Legitimate

Peripheral Participation, Cambridge University Press.

[Law, Rothermel 2003] Law, J. And Rothermel, G., (2003), "Whole Program Path-Based

Dynamic Impact Analysis", In Proceedings Of The 25th International Conference

On Software Engineering. Portland, Oregon: Ieee Computer Society, Pp. 308-318.

252

[Lehman 2005] Lehman, A.,(2005),Jmp For Basic Univariate And Multivariate

Statistics: A Step-By-Step Guide, Sas Institute.

[Li, Sun, Leung, Zhang 2013] Li, B., Sun, X., Leung, H., And Zhang, S., (2013), "A

Survey Of Code-Based Change Impact Analysis Techniques", Software Testing,

Verification And Reliability, Vol. 23, No. 8, Pp. 613-646.

[Lopez-Fernandez, Robles, Gonzalez-Barahona 2004] Lopez-Fernandez, L., Robles, G.,

And Gonzalez-Barahona, J. M., (2004), "Applying Social Network Analysis To

The Information In Cvs Repositories", In Proceedings Of Proceedings Of The

Mining Software Repositories Workshop. 26th International Conference On

Software Engineering, Edinburgh, Scotland.

[M. Bieman, Andrews, Yang 2003] M. Bieman, J., Andrews, A. A., And Yang, H. J.,

(2003), "Understanding Change-Proneness In Oo Software Through

Visualization", In Proceedings Of The 11th Ieee International Workshop On

Program Comprehension: Ieee Computer Society, Pp. 44.

[Maletic, Collard 2004] Maletic, J. I. And Collard, M. L., (2004), "Supporting Source

Code Difference Analysis", In Proceedings Of The 20th Ieee International

Conference On Software Maintenance: Ieee Computer Society.

253

[Maletic Et Al. 2011] Maletic, J. I., Mosora, D. J., Newman, C. D., Collard, M. L.,

Sutton, A., And Robinson, B. P., (2011), "Mosaicode: Visualizing Large Scale

Software: A Tool Demonstration", In Vissoft: Ieee, Pp. 1-4.

[Mccabe 1976] Mccabe, T. J., (1976), "A Complexity Measure", In Proceedings Of The

2nd International Conference On Software Engineering. San Francisco,

California, United States: Ieee Computer Society Press, Pp. 407.

[Mcnair, German, Weber-Jahnke 2007] Mcnair, A., German, D. M., And Weber-Jahnke,

J., (2007), "Visualizing Software Architecture Evolution Using Change-Sets", In

Proceedings Of The 14th Working Conference On Reverse Engineering: Ieee

Computer Society, Pp. 130-139.

[Meneely, Williams, Snipes, Osborne 2008] Meneely, A., Williams, L., Snipes, W., And

Osborne, J., (2008), "Predicting Failures With Developer Networks And Social

Network Analysis", In Proceedings Of The 16th Acm Sigsoft International

Symposium On Foundations Of Software Engineering. Atlanta, Georgia: Acm,

Pp. 13-23.

254

[Menzies, Greenwald, Frank 2007] Menzies, T., Greenwald, J., And Frank, A., (2007),

"Data Mining Static Code Attributes To Learn Defect Predictors", Ieee

Transactions On Software Engineering, Vol. 33, No. 1, Pp. 2-13.

[Menzies Et Al. 2010] Menzies, T., Jalali, O., Hihn, J., Baker, D., And Lum, K., (2010),

"Stable Rankings For Different Effort Models", Automated Software Engg., Vol.

17, No. 4, Pp. 409-437.

[Mitzenmacher 2004] Mitzenmacher, M., (2004), "A Brief History Of Generative

Models For Power Law And Lognormal Distributions", Internet Mathematics,

Vol. 1, No. 2, Pp. 226-251.

[Mockus, Votta 2000] Mockus, A. And Votta, L. G., (2000), "Identifying Reasons For

Software Changes Using Historic Databases", In Proceedings Of The

International Conference On Software Maintenance (Icsm'00): Ieee Computer

Society, Pp. 120.

[Mockus, Weiss, Zhang 2003] Mockus, A., Weiss, D. M., And Zhang, P., (2003),

"Understanding And Predicting Effort In Software Projects", In Proceedings Of

The 25th International Conference On Software Engineering. Portland, Oregon:

Ieee Computer Society, Pp. 274-284.

255

[Monti 1995] Monti, K. L., (1995), "Folded Empirical Distribution Function Curves—

Mountain Plots", The American Statistician, Vol. 49, No. 4, Pp. 342-345.

[Moonen 2002] Moonen, L., (2002), "Lightweight Impact Analysis Using Island

Grammars", In Proceedings Of The 10th International Workshop On Program

Comprehension: Ieee Computer Society, Pp. 219.

[Moser, Pedrycz, Succi 2008] Moser, R., Pedrycz, W., And Succi, G., (2008), "A

Comparative Analysis Of The Efficiency Of Change Metrics And Static Code

Attributes For Defect Prediction", In Proceedings Of The 30th International

Conference On Software Engineering. Leipzig, Germany: Acm, Pp. 181-190.

[Munson, Elbaum 1998] Munson, J. C. And Elbaum, S. G., (1998), "Code Churn: A

Measure For Estimating The Impact Of Code Change", In Proceedings Of The

International Conference On Software Maintenance: Ieee Computer Society, Pp.

24.

[Nagappan, Ball 2005a] Nagappan, N. And Ball, T., (2005a), "Static Analysis Tools As

Early Indicators Of Pre-Release Defect Density", In Proceedings Of The 27th

International Conference On Software Engineering. St. Louis, Mo, Usa: Acm, Pp.

580-586.

256

[Nagappan, Ball 2005b] Nagappan, N. And Ball, T., (2005b), "Use Of Relative Code

Churn Measures To Predict System Defect Density", In Proceedings Of The 27th

International Conference On Software Engineering. St. Louis, Mo, Usa: Acm, Pp.

284-292.

[Nagappan, Ball, Zeller 2006] Nagappan, N., Ball, T., And Zeller, A., (2006), "Mining

Metrics To Predict Component Failures", In Proceedings Of The 28th

International Conference On Software Engineering. Shanghai, China: Acm, Pp.

452-461.

[Newman 2005] Newman, M. E. J., (2005), "Power Laws, Pareto Distributions And

Zipf's Law", In Contemporary Physics, Vol. 46: Taylor & Francis, Pp. 323-351.

[Offen, Jeffery 1997] Offen, R. J. And Jeffery, R., (1997), "Establishing Software

Measurement Programs", Ieee Softw., Vol. 14, No. 2, Pp. 45-53.

[Orso Et Al. 2004] Orso, A., Apiwattanapong, T., Law, J., Rothermel, G., And Harrold,

M. J., (2004), "An Empirical Comparison Of Dynamic Impact Analysis

Algorithms", In Proceedings Of The 26th International Conference On Software

Engineering: Ieee Computer Society, Pp. 491-500.

257

[Ostrand, Weyuker, Bell 2005] Ostrand, T. J., Weyuker, E. J., And Bell, R. M., (2005),

"Predicting The Location And Number Of Faults In Large Software Systems",

Ieee Transactions On Software Engineering, Vol. 31, No. 4, Pp. 340-355.

[Petrenko, Rajlich 2009] Petrenko, M. And Rajlich, V., (2009), "Variable Granularity

For Improving Precision Of Impact Analysis", In Proceedings Of Program

Comprehension, 2009. Icpc '09. Ieee 17th International Conference On, Pp. 10-

19.

[Pinzger, Gall, Fischer, Lanza 2005] Pinzger, M., Gall, H., Fischer, M., And Lanza, M.,

(2005), "Visualizing Multiple Evolution Metrics", In Proceedings Of The 2005

Acm Symposium On Software Visualization. St. Louis, Missouri: Acm.

[Polo, Piattini, Ruiz 2001] Polo, M., Piattini, M., And Ruiz, F., (2001), "Using Code

Metrics To Predict Maintenance Of Legacy Programs: A Case Study", In

Proceedings Of The Ieee International Conference On Software Maintenance

(Icsm'01): Ieee Computer Society, Pp. 202.

[Poshyvanyk, Marcus, Ferenc, Gyimóthy 2009] Poshyvanyk, D., Marcus, A., Ferenc, R.,

And Gyimóthy, T., (2009), "Using Information Retrieval Based Coupling

Measures For Impact Analysis", Empirical Softw. Engg., Vol. 14, No. 1, Pp. 5-32.

258

[Queille, Voidrot, Wilde, Munro 1994] Queille, J.-P., Voidrot, J.-F., Wilde, N., And

Munro, M., (1994), "The Impact Analysis Task In Software Maintenance: A

Model And A Case Study", In Proceedings Of The International Conference On

Software Maintenance: Ieee Computer Society, Pp. 234-242.

[Raghavan Et Al. 2004] Raghavan, S., Rohana, R., Leon, D., Podgurski, A., And

Augustine, V., (2004), "Dex: A Semantic-Graph Differencing Tool For Studying

Changes In Large Code Bases", In Proceedings Of Software Maintenance, 2004.

Proceedings. 20th Ieee International Conference On, Pp. 188-197.

[Rajlich 1997] Rajlich, V., (1997), "A Model For Change Propagation Based On Graph

Rewriting", In Proceedings Of The International Conference On Software

Maintenance: Ieee Computer Society, Pp. 84-91.

[Ramil, Lehman 2000] Ramil, J., F. And Lehman, M. M., (2000), "Metrics Of Software

Evolution As Effort Predictors - A Case Study", In Ieee International Conference

On Software Maintenance, Vol. 0, Pp. 163-163.

259

[Rastkar, Murphy 2009] Rastkar, S. And Murphy, G. C., (2009), "On What Basis To

Recommend: Changesets Or Interactions?", In Proceedings Of The 2009 6th Ieee

International Working Conference On Mining Software Repositories: Ieee

Computer Society, Pp. 155-158.

[Ratzinger, Fischer, Gall 2005] Ratzinger, J., Fischer, M., And Gall, H., (2005),

"Improving Evolvability Through Refactoring", Sigsoft Software Engineering

Notes, Vol. 30, No. 4, Pp. 1-5.

[Ratzinger, Pinzger, Gall 2007] Ratzinger, J., Pinzger, M., And Gall, H., (2007), "Eq-

Mine: Predicting Short-Term Defects For Software Evolution

Fundamental Approaches To Software Engineering", M. Dwyer And A. Lopes, Eds.,

Springer Berlin / Heidelberg, Pp. 12-26.

[Riehle, Kolassa, Salim 2012] Riehle, D., Kolassa, C., And Salim, M. A., (2012),

"Developer Belief Vs. Reality: The Case Of The Commit Size Distribution", In

Proceedings Of Software Engineering, Pp. 59-70.

260

[Robbes, Lanza 2005] Robbes, R. And Lanza, M., (2005), "Versioning Systems For

Evolution Research", In Proceedings Of The Eighth International Workshop On

Principles Of Software Evolution: Ieee Computer Society.

[Robbes, Pollet, Lanza 2008] Robbes, R., Pollet, D., And Lanza, M., (2008), "Logical

Coupling Based On Fine-Grained Change Information", In Proceedings Of The

2008 15th Working Conference On Reverse Engineering: Ieee Computer Society.

[Robillard 2005] Robillard, M. P., (2005), "Automatic Generation Of Suggestions For

Program Investigation", In Proceedings Of The 10th European Software

Engineering Conference Held Jointly With 13th Acm Sigsoft International

Symposium On Foundations Of Software Engineering. Lisbon, Portugal: Acm,

Pp. 11-20.

[Rountev, Milanova, Ryder 2001] Rountev, A., Milanova, A., And Ryder, B. G., (2001),

"Points-To Analysis For Java Using Annotated Constraints", In Proceedings Of

The 16th Acm Sigplan Conference On Object-Oriented Programming, Systems,

Languages, And Applications. Tampa Bay, Fl, Usa: Acm, Pp. 43-55.

[Ryder 1979] Ryder, B. G., (1979), "Constructing The Call Graph Of A Program", Ieee

Trans. Softw. Eng., Vol. 5, No. 3, Pp. 216-226.

261

[Sayyad, Lethbridge 2001] Sayyad, J. And Lethbridge, C., (2001), "Supporting Software

Maintenance By Mining Software Update Records", In Proceedings Of The Ieee

International Conference On Software Maintenance (Icsm'01): Ieee Computer

Society, Pp. 22.

[Shawn, Gracanin 2003] Shawn, A. B. And Gracanin, D., (2003), "Software Impact

Analysis In A Virtual Environment", In Proceedings, Pp. 143-143.

[Śliwerski, Zimmermann, Zeller 2005] Śliwerski, J., Zimmermann, T., And Zeller, A.,

(2005), "When Do Changes Induce Fixes?", Sigsoft Softw. Eng. Notes, Vol. 30,

No. 4, Pp. 1-5.

[Soman, Diwakar, Ajay 2006] Soman, K. P., Diwakar, S., And Ajay, V.,(2006),Insight

Into Data Mining: Theory And Practice, Prentice-Hall Of India.

[Subramanyam, Krishnan 2003] Subramanyam, R. And Krishnan, M. S., (2003),

"Empirical Analysis Of Ck Metrics For Object-Oriented Design Complexity:

Implications For Software Defects", Ieee Transactions On Software Engineering,

Vol. 29, No. 4, Pp. 297-310.

262

[Taneja, Dig, Xie 2007] Taneja, K., Dig, D., And Xie, T., (2007), "Automated Detection

Of Api Refactorings In Libraries", In Proceedings Of The Twenty-Second

Ieee/Acm International Conference On Automated Software Engineering. Atlanta,

Georgia, Usa: Acm, Pp. 377-380.

[Thilo, Koschke 2009] Thilo, M. And Koschke, R., (2009), "Revisiting The Evaluation

Of Defect Prediction Models", In Proceedings Of The 5th International

Conference On Predictor Models In Software Engineering. Vancouver, British

Columbia, Canada: Acm, Pp. 1-10.

[Tonella 2003] Tonella, P., (2003), "Using A Concept Lattice Of Decomposition Slices

For Program Understanding And Impact Analysis", Ieee Trans. Softw. Eng., Vol.

29, No. 6, Pp. 495-509.

[Tu, Godfrey 2002] Tu, Q. And Godfrey, M. W., (2002), "An Integrated Approach For

Studying Architectural Evolution", In Proceedings Of The 10th International

Workshop On Program Comprehension: Ieee Computer Society, Pp. 127.

[Van Emden, Moonen 2002] Van Emden, E. And Moonen, L., (2002), "Java Quality

Assurance By Detecting Code Smells", In Proceedings Of Reverse Engineering,

2002. Proceedings. Ninth Working Conference On, Pp. 97-106.

263

[Vanya, Premraj, Vliet 2011] Vanya, A., Premraj, R., And Vliet, H. V., (2011),

"Approximating Change Sets At Philips Healthcare: A Case Study", In

Proceedings Of European Conference On Software Maintenance And

Reengineering, Pp. 121-130.

[Weißgerber, Diehl 2006] Weißgerber, P. And Diehl, S., (2006), "Are Refactorings Less

Error-Prone Than Other Changes?", In Proceedings Of The 2006 International

Workshop On Mining Software Repositories. Shanghai, China: Acm, Pp. 112-118.

[Weyuker, Ostrand 2010] Weyuker, E. J. And Ostrand, T. J., (2010), "An Automated

Fault Prediction System", In Making Software: What Really Works, And Why We

Believe It, A. Oram And G. Wilson, Eds., Sebastopol, Ca O'reilly Media, Inc., Pp.

145-160.

[Weyuker, Ostrand, Bell 2007] Weyuker, E. J., Ostrand, T. J., And Bell, R. M., (2007),

"Using Developer Information As A Factor For Fault Prediction", In Proceedings

Of The Third International Workshop On Predictor Models In Software

Engineering: Ieee Computer Society, Pp. 8.

264

[Wilkie, Kitchenham 2000] Wilkie, F. G. And Kitchenham, B. A., (2000), "Coupling

Measures And Change Ripples In C++ Application Software", J. Syst. Softw.,

Vol. 52, No. 2-3, Pp. 157-164.

[Williams 2005] Williams, C. C., (2005), "Automatic Mining Of Source Code

Repositories To Improve Bug Finding Techniques", Ieee Transactions On

Software Engineering, Vol. 31, Pp. 466-480.

[Williams, Hollingsworth 2004] Williams, C. C. And Hollingsworth, J. K., (2004), "Bug

Driven Bug Finders", In International Workshop On Mining Software

Repositories. University Of Waterloo, Waterloo, On, Pp. 70–74.

[Witten, Frank, Hall 2011] Witten, I. H., Frank, E., And Hall, M. A.,(2011),Data

Mining: Practical Machine Learning Tools And Techniques, Elsevier Science &

Technology.

[Xing, Stroulia 2005] Xing, Z. And Stroulia, E., (2005), "Umldiff: An Algorithm For

Object-Oriented Design Differencing", In Proceedings Of The 20th Ieee/Acm

International Conference On Automated Software Engineering. Long Beach, Ca,

Usa: Acm, Pp. 54-65.

265

[Ying, Murphy, Ng, Chu-Carroll 2004] Ying, A. T. T., Murphy, G. C., Ng, R., And Chu-

Carroll, M. C., (2004), "Predicting Source Code Changes By Mining Change

History", Ieee Transactions On Software Engineering, Vol. 30, No. 9, Pp. 574-

586.

[Yu, Rajlich 2001] Yu, Z. And Rajlich, V., (2001), "Hidden Dependencies In Program

Comprehension And Change Propagation", In Proceedings Of The 9th

International Workshop On Program Comprehension: Ieee Computer Society, Pp.

293.

[Zaki, Parthasarathy, Li 1997] Zaki, M. J., Parthasarathy, S., And Li, W., (1997), "A

Localized Algorithm For Parallel Association Mining", In Proceedings Of The

Ninth Annual Acm Symposium On Parallel Algorithms And Architectures.

Newport, Rhode Island, United States: Acm, Pp. 321-330.

[Zhang, Tan, Marchesi 2009] Zhang, H., Tan, H. B. K., And Marchesi, M., (2009), "The

Distribution Of Program Sizes And Its Implications: An Eclipse Case Study",

Computing Research Repository, Vol. Abs/0905.2.

266

[Zimmermann, Nagappan 2008] Zimmermann, T. And Nagappan, N., (2008),

"Predicting Defects Using Network Analysis On Dependency Graphs", In

Proceedings Of The 30th International Conference On Software Engineering.

Leipzig, Germany: Acm, Pp. 531-540.

[Zimmermann, Premraj, Zeller 2007] Zimmermann, T., Premraj, R., And Zeller, A.,

(2007), "Predicting Defects For Eclipse", In Proceedings Of The Third

International Workshop On Predictor Models In Software Engineering: Ieee

Computer Society, Pp. 9.

[Zimmermann, Weisgerber, Diehl, Zeller 2004] Zimmermann, T., Weisgerber, P., Diehl,

S., And Zeller, A., (2004), "Mining Version Histories To Guide Software

Changes", In Proceedings Of The 26th International Conference On Software

Engineering: Ieee Computer Society.

[Zimmermann, Weissgerber 2004] Zimmermann, T. And Weissgerber, P., (2004),

"Preprocessing Cvs Data For Fine-Grained Analysis", In Proceedings Of

International Workshop On Mining Software Repositories (Msr).