AN INVESTIGATION OF ROUTINE REPETITIVENESS IN OPEN-SOURCE PROJECTS

Mohd Arafat

A Thesis

Submitted to the Graduate College of Bowling Green State University in partial fulfillment of the requirements for the degree of

MASTER OF SCIENCE

August 2018

Committee:

Robert Dyer, Advisor

Robert C. Green II

Raymon Kresman Copyright ○c August 2018 Mohd Arafat All rights reserved iii ABSTRACT

Robert Dyer, Advisor

Many programming languages contain a way to provide a sub-portion of the source code that performs a specific and often independent behavior. Depending on the language this is called a (sub-)routine, method, function, procedure, etc. One of the main purposes of creating a routine is to enable re-use. As devised, routines are intended to be called from multiple places within a program. Sometimes, however, the same code is repeated within a project or across projects. In this work, we investigate how often such routines are repeated in a large-scale corpus of open source software. This work attempts to independently reproduce a prior research result by Nguyen et al., building from the ground up the analysis framework and analyzing a different and very large set of open source software projects. In this work, we use the Boa infrastructure to investigate routine repetitiveness by analyzing over 300k open source projects from GitHub. Similar to the prior work, we first compute the program dependence graphs (PDGs) for each routine in the dataset, perform normalization on the PDGs, and look for repetitions both within and across projects. Our experiment shows that about 16.4% of routines repeat within a project and approximately 11% of routines repeat across at least two different projects. We then perform static program slicing on the PDGs, slicing the graph on each routine argument to obtain subroutines and look for repetitiveness once again. We observe that approximately 17% of all subroutines repeat within a project and 11% repeat across projects. Finally, we investigate if the size of the PDG or the number of control nodes has any impact on the repetitiveness of routines. Overall, our results confirm the trends shown in the prior study, though with differences in the size of the results. iv ACKNOWLEDGMENTS

I would like to express my gratitude to a number of people for their support during my time at BGSU. First and foremost, I would like to thank my advisor Dr. Dyer for being a source of ideas and for providing valuable guidance throughout my master’s journey. I have benefited immensely from his thoughtful suggestions and technical discussions. His insights on programming languages and writing clean and efficient code have made me a much better researcher and a programmer. I also have to thank him for being patient with me throughout my masters and especially in the last few weeks.

The rest of my thesis committee members, Dr. Kresman and Dr. Green also helped me greatly during my years as a graduate student. I would like to thank Dr. Kresman for teaching me the building blocks of algorithm design and databases. I was fortunate to have taken Dr. Green’s artificial intelligence class which introduced me to the new world of heuristics and meta heuristics algorithm. I thank both of them for their valuable suggestions and feedbacks to improve the thesis.

Lastly, this work was supported by the National Science Foundation (NSF) under grant CCF- 1518776. v

TABLE OF CONTENTS

CHAPTER 1 INTRODUCTION 1

CHAPTER 2 RELATED APPROACHES 4

2.1 Control Flow Graphs ...... 4

2.2 Post Dominator Trees ...... 7

2.3 Control Dependence Graphs ...... 8

2.4 Data Dependence Graphs ...... 9

2.5 Program Dependence Graphs ...... 11

2.6 Program Slicing ...... 12

2.6.1 Forward Static Slicing ...... 13

CHAPTER 3 RELATED EMPIRICAL STUDIES 15

3.1 Code Repetition Studies ...... 15

3.2 Code Uniqueness Studies ...... 16

CHAPTER 4 METHODOLOGY 17

4.1 Augmented Graphs ...... 17

4.2 Normalization ...... 17

4.3 Per-Argument Slicing subGraphs (PASG) ...... 21

4.4 Implementation in Boa ...... 21

4.4.1 Boa Graphs API ...... 23

4.5 Boa Queries ...... 24

CHAPTER 5 EVALUATION 29 vi 5.1 Dataset ...... 29

5.2 Post-processing ...... 29

5.3 RQ1: Repetitiveness of Routines ...... 33

5.3.1 RQ1a: Repetitiveness Within a Project ...... 33

5.3.2 RQ1b: Repetitiveness Across Projects ...... 34

5.3.3 RQ1c: Repetitiveness by Graph Size ...... 35

5.3.4 RQ1d: Repetitiveness by Number of Control Nodes ...... 36

5.4 RQ2: Repetitiveness of PASGs ...... 37

5.4.1 RQ2a: Repetitiveness Within a Project ...... 37

5.4.2 RQ2b: Repetitiveness Across Projects ...... 37

5.4.3 RQ2c: Repetitiveness by Graph Size ...... 39

5.4.4 RQ2d: Repetitiveness by Number of Control Nodes ...... 39

5.5 Threats to Validity ...... 40

CHAPTER 6 CONCLUSION 41

BIBLIOGRAPHY 42 vii

LIST OF FIGURES

2.1 An example Java method to iteratively compute factorials...... 4

2.2 A control flow graph for the example code shown in Figure 2.1...... 5

2.3 The CFG in Figure 2.2 augmented with an ENTRY node...... 6

2.4 The data-flow equations for computing dominators...... 7

2.5 A post dominator for the augmented CFG shown in Figure 2.3...... 8

2.6 A control dependence graph for the PDT shown in Figure 2.5...... 9

2.7 The data-flow equations for computing def-use chains...... 10

2.8 A data dependence graph for the augmented CFG shown in Figure 2.3...... 10

2.9 A program dependence graph for the augmented CDG in Figure 2.6 and DDG in Figure 2.8...... 11

2.10 An example forward static slice criteria, with all statements contributing to the slice highlighted...... 13

2.11 A forward static slice using criteria on PDG shown in Figure 2.9. . . 14

4.1 The augmented CFG based on Figure 2.3...... 18

4.2 The augmented PDG based on Figure 2.9...... 19

4.3 The normalized PDG based on Figure 4.2...... 20

4.4 The PASG for the PDG in Figure 4.2...... 21

4.5 The normalized PASG based on Figure 4.4...... 22

4.6 The Boa query for RQ1a (repetition within project)...... 24

4.7 The Boa query for RQ1b (repetition across projects)...... 25 viii 4.8 The Boa query for RQ1c (repetition by graph size)...... 26

4.9 The Boa query for RQ1d (repetition by number of control nodes)...... 26

4.10 The Boa query for RQ2a (repetition within project)...... 27

4.11 The Boa query for RQ2b (repetition across projects)...... 27

4.12 The Boa query for RQ2c (repetition by graph size)...... 28

4.13 The Boa query for RQ2d (repetition by number of control nodes)...... 28

5.1 Boa query to generate dataset statistics...... 30

5.2 Python script to compute parts a/b from RQ1 and RQ2...... 31

5.3 Python script to compute parts c/d from RQ1 and RQ2...... 31

5.4 Python script to plot graphs...... 32

5.5 Main portion of Python script to perform post-processing...... 32

5.6 % of entire routines realized elsewhere within a project...... 33

5.7 % of entire routines realized elsewhere in other projects...... 34

5.8 Repetitiveness by graph size (V+E) in PDGs...... 35

5.9 Repetitiveness by number of control nodes in PDGs...... 36

5.10 % of entire subroutines realized elsewhere within a project...... 38

5.11 % of subroutines realized elsewhere in other projects...... 38

5.12 % Repetitiveness by Graph Size...... 39

5.13 % Repetitiveness by number of control nodes in subroutines...... 40 ix

LIST OF TABLES

5.1 Analyzed Dataset Statistics ...... 29 1

CHAPTER 1. INTRODUCTION

The word repetition or duplication refers to occurrence of something which has already been said or done. Repetition occurs in all kind of systems and depending on the context can be good or bad. For example, repetition codes are used as an error correction mechanism in communication systems and therefore would constitute a good use of repetition. A bad example would be copying somebody else’s work without citation. Software systems are also not immune to duplication and tend to repeat both in terms of architectural design and code syntax and semantics. Software repeti- tion, though mostly considered undesirable [19], is sometimes needed to enhance the performance or some parts of the system.

In this work, we are going to deal with only code repetitions [14, 17, 31] in software systems which are maintained in repositories. Code repetition is not always intentional and could also be sometimes accidental especially if it is occurring across different projects. The are many reasons why source code repetition occurs. First of all, some people find it easier to copy and paste a working or complex piece of code. Automation tools or compilers which translate source code from one language to another can also produce similar looking code. Sometimes efficient and reusable APIs or routines are copied because of performance reasons. Other reasons could be implementation of similar algorithms, naming and style conventions [31], and use of similar design patterns. There are also cases like distributed Source Code Management systems (SCM) such as git whose philosophies are based on allowing clones or copies of projects so that people can experiment with and customize their individual copies as per their needs.

There are many disadvantages associated with code repetition specifically if it is a result of copy and paste. If the copied code has defects, the defects also get replicated and if the code is further copied, it may lead to defect propagation. Repeated code, especially within a project, can cause maintenance issues and drive up its cost. This is because if a change is required in the code, it has to be carried out at all the clone locations. This will require more time and effort and can 2 result in messy code. Though mostly discouraged but repeated code in some cases can provide certain advantages. A repeated method with specifications can help in synthesizing specifications for rarely used similar APIs using some fine grained slicing techniques. Repetition can also help code suggestion tools by aiding in generating better models for suggestions.

The repetition of code can occur at lexical level, syntactical level and semantic level [4, 14, 17]. Lexical techniques like token for token similarity and syntax techniques statement for statement similarity have been used quite extensively in previous researches but there has been a recent increase in semantic based techniques. Nguyen et al. [31] published the first semantic analysis based large scale study on repetitiveness, composition, and, containment of source code. They used a program dependence graphs and slicing based technique to find the percentage semantically similar routines or methods. Their work answered five research questions such as repetition rate of routines within as well as across projects, repetition of routines as part of other routines, repetition of part of routines from other places , repetition of multiple routines together and, the repetitiveness of routines(or part of) routines from common libraries. The data used for the research is a snapshot of SourceForge repository consisting of under 10K projects and just over 17M methods. The program dependence graphs are used to compare routines for similarity and their slices are used to answer questions on routines composability which is defined as the number of per variable subroutines (PVSG) composing a method.

In this work, we repeat the work of Nguyen et al. [31] and answer two of their five research questions which are as follows. RQ1. How likely a routine for a task is repeated exactly elsewhere as an entirety? RQ2. How often portions of a routine are repeated or repeated together; what is the unique set of all of such portions? One problem with the earlier approach is that it is difficult to replicate or extend the research because of the heavy lifting required to set up the infrastructure which can mine on a scale of 100K projects. In this work we leverage the existing Boa [11] infrastructure which is capable of mining ultra large software repositories. Boa infrastructure consists a domain specific language, 3 a compiler for translating it, multiple datasets such as GitHub, SourceForge, a back-end based on map-reduce for parallel distributed processing of datasets and a front-end to write boa queries. For this research we have used GitHub-2015 dataset. To carry out the analysis we have extended the boa infrastructure with graph analysis capabilities. Boa now supports construction of control flow graphs, dominator and post-dominator trees, control dependence graphs, data dependence graphs, program dependence graphs and also enables slicing control flow graphs and program dependence graphs. With the enhanced infrastructure it is as easy as writing 30-40 lines boa queries to perform highly scalable research on code repetitiveness and other similar areas which require large scale analysis. 4

CHAPTER 2. RELATED APPROACHES

This chapter describes various graphs and trees which are intermediate data structures needed for building a program slicer. The subsections present different graph representations of the exam- ple method shown in Figure 2.1.

1 void factorial(int arg) { 2 int count = 1; 3 int fact = 1; 4 while (count <= arg) { 5 fact = fact * count; 6 count = count + 1; 7 } 8 return fact; 9 }

Figure 2.1: An example Java method to iteratively compute factorials.

This method is a simple Java method that takes an integer as input and returns its factorial. The method initializes two local variables, count and fact to 1. The variable fact is used to store the result of the factorial of arg. The variable count generates integers from 1 to arg which are multiplied with fact one at a time. When all the integers have been multiplied, the control comes out of the and the factorial is returned to the caller.

2.1 Control Flow Graphs

A control flow graph (CFG) [5] is a that represents all possible execution paths in a program. A node can represent a single expression/statement or a basic block which consists of a straight line sequence of expressions/statements with no branching. A basic block repre- sentation is more compact because of reduced number of nodes and edges and therefore has less memory footprint. An expression/statement representation can simplify analysis in some cases but has higher memory footprint and is less efficient to build. In this work, we refer to a single ex- pression/statement as node for convenience. An edge represents a possible control flow path from 5 one node to another node in the CFG. A CFG can be constructed from any suitable intermediate representation of the source code, such as an abstract syntax tree (AST).

[0] START

[1] int count = 1

[2] int fact = 1

[3] count <= arg

T

[4] fact = fact * count B

[6] return fact

[5] count = count + 1

[7] STOP

Figure 2.2: A control flow graph for the example code shown in Figure 2.1.

Figure 2.2 shows the control flow graph of the method factorial shown in Figure 2.1. The nodes START and STOP in the figure are special kinds of nodes added to the CFG to show entry into and exit from the method, respectively. The numbers inside brackets in each box show the node ids, which can be mapped to the statement number in the method by decreasing one from it. The labels T and F correspond to the true and false branches of a predicate node. The label B represents a back-edge of a loop. Before constructing a post dominator tree from the CFG, we augment it 6 with a new predicate node named ENTRY [13]. The ENTRY node models the condition which triggers the method call i.e entry into the method. It has two branches, the true branch going to the START node and the false branch going to the STOP node. The newly augmented CFG is shown in Figure 2.3.

ENTRY

T

[0] START

[1] int count = 1

F [2] int fact = 1

[3] count <= arg

T

B [4] fact = fact * count

[6] return fact

[5] count = count + 1

[7] STOP

Figure 2.3: The CFG in Figure 2.2 augmented with an ENTRY node. 7 2.2 Post Dominator Trees

A post dominator tree (PDT) is a tree representation in which the parent of a node represents its immediate post dominator. To understand PDT it is important to first understand the concept of post dominance which can be defined as follows: a node 푛2 post dominates a node 푛1 if all paths from 푛1 to the STOP node go through 푛2. The post dominator set for each node can be computed by applying the data-flow equations in Figure 2.4 to the reverse CFG (a reverse CFG is the original CFG with all its edges reversed and control enters from the STOP node and exits through the START node). ⎧ ⎫ Dom (푛0) = 푛0 ⎨⎪ (︃ )︃ ⎬⎪ (︁ )︁ {︀ }︀ Dom (n) = ⋂︀ (︀ )︀ 퐷표푚 푝 ⋃︀ n ⎩⎪ 푝휀푝푟푒푑 푛 ⎭⎪

Figure 2.4: The data-flow equations for computing dominators.

From the post dominator set of a node, it is easy to determine its immediate post dominator, which post dominates only the concerned node and no other node from the set. A PDT can then be constructed by connecting each node to its immediate post dominator and since each node has exactly one immediate post dominator (except the START node), the resulting connections will form a tree. Alternatively, there is a direct and more efficient algorithm by Lengauer and Tarjan [27] for building a PDT. Due to its complexity, we chose not to use this alternate algorithm.

Figure 2.5 shows the PDT for the augmented control flow graph in Figure 2.3. Given a tree, it is easy to find post dominators and immediate post dominators of a node. To get the post dominator set for a node, traverse from the parent of the given node all the way up to the STOP node. For example, the post dominator set of node 2 is {3, 6, 7}. As can be seen in the figure, the START node does not post dominate any node and the STOP node post dominates all other nodes in the graph. 8

[7] STOP

[6] return fact ENTRY

[3] count <= arg

[2] int fact = 1 [5] count = count + 1

[1] int count = 1 [4] fact = fact * count

[0] START

Figure 2.5: A post dominator tree for the augmented CFG shown in Figure 2.3.

2.3 Control Dependence Graphs

A control dependence graph (CDG) represents control dependency among CFG nodes. The labels of control edges correspond to the predicate being true (T) or false (F) in the control node. A CDG can be built by walking up a PDT from the destination node to the source node for certain control edges in the augmented CFG. For example, consider the control edge 3 → 4 in Figure 2.3 where the source node is 3 and the destination node is 4. On traversing the PDT in Figure 2.5 from node 4 to node 3, we get nodes 4 and 5 as the dependent nodes. We do not include those control edges where the source node is post dominated by the destination node. For example, we skip the edge 3 → 6 because 3 is post dominated by 6. The construction of CDGs is explained in much more detail by Ferrante et al. [13]. 9 The CDG presented in this work differs from that given in [13] as it does not include the concept of region nodes. A CDG can also be built by using other approaches such as building a post dominance frontier [10] and then reversing it to get the control dependence. Both approaches are quite easy to follow but in this work we have used Ferrante’s approach.

[0] ENTRY

T T T T

[1] int count = 1 [2] int fact = 1 [3] count <= arg T [6] return fact

T T

[4] fact = fact * count [5] count = count + 1

Figure 2.6: A control dependence graph for the PDT shown in Figure 2.5.

Figure 2.6 shows a CDG constructed from the PDT in Figure 2.5. For constructing a CDG, only the true branch from the ENTRY node is considered and the false branch is ignored because the STOP node post dominates the ENTRY node. Every node in a CDG is either directly or transitively dependent on the ENTRY node because it models the external condition which triggers the method. In Figure 2.6, we can see the dependency of different nodes on the two predicate nodes 0 and 3. In a CDG, the successors of each node show its dependent nodes. For example, nodes 4 and 5 are control dependent on node 3. In other words, the execution of nodes 4 and 5 depends on node 3. Notice also that node 3 is control dependent on itself.

2.4 Data Dependence Graphs

A data dependence graph (DDG) represents data dependency among CFG nodes. A DDG can be built by connecting variable definitions to their uses and the connections are known as def-use chains. Def-use chains can be computed using the same data-flow equations that are used for 10 live variable analysis, with one modification to keep track of the node ids of each set of the out variables. So for live variable analysis out[n] and in[n] are set of variables but in this case they are set of pairs of variable and node i.e set of (variable, node). A DDG is usually a disconnected graph.

{︂ in[n] = use[n] ⋃︀ (︀ 표푢푡[푛] − 푑푒푓[푛] )︀ }︂ ⋃︀ out[n] = 푠휀푠푢푐푐[푛] 푖푛[푠]

Figure 2.7: The data-flow equations for computing def-use chains.

[0] ENTRY

[1] int count = 1

count

count [5] count = count + 1 count count [2] int fact = 1

count count fact

[3] count <= arg [4] fact = fact * count fact fact

fact

[6] return fact

Figure 2.8: A data dependence graph for the augmented CFG shown in Figure 2.3.

Figure 2.8 shows a DDG constructed by applying the data-flow equations in Figure 2.7 to the augmented CFG from Figure 2.3. The label of each edge shows the dependent variable for the successor node. Here the ENTRY node is added just to be consistent with the previous graphs. 11 The node however does not represent any data dependence for the successor nodes which is also evident from the empty label of the edges originating from it.

2.5 Program Dependence Graphs

A program dependence graph [13] (PDG) represents both control and data dependency among CFG nodes. A PDG is a composition of a CDG and a DDG. To construct a PDG, first a CDG is constructed and then edges representing data dependencies (def-use chains) are overlaid on it. A PDG is a very sophisticated data structure and is used in various complex analyses and compiler optimizations.

[0] ENTRY

CONTROL:T

[1] int count = 1 CONTROL:T

DATA:count CONTROL:T

DATA:count [3] count <= arg CONTROL:T

CONTROL:T CONTROL:T DATA:count DATA:count

[5] count = count + 1 DATA:count CONTROL:T [2] int fact = 1

DATA:count DATA:fact

[4] fact = fact * count DATA:fact DATA:fact

DATA:fact

[6] return fact

Figure 2.9: A program dependence graph for the augmented CDG in Figure 2.6 and DDG in Figure 2.8. 12 Figure 2.9 shows a PDG constructed from the augmented CDG and DDG of Figure 2.6 and Figure 2.8 respectively. Each edge shows either a control or data dependency which is made explicit from the labels. The total edges in a PDG is the sum of the edges in the CDG and the DDG which are used to compose it. As mentioned earlier, we do not include edges originating from the ENTRY node in the DDG and therefore we only see control edges emanating from the ENTRY node of the PDG.

2.6 Program Slicing

Program slicing is a method of filtering out certain statements of interest from a program based on some slicing criteria. The term was first defined in 1981 by Weiser [41] who compared it to the mental abstraction of relevant code statements that programmers make when debugging a piece of code. Program slicing can be classified as intra-procedural, when a graph does not cross the boundary of the method or inter-procedural [18, 41], when a graph of the entire program or project is constructed. The graphs can be of different kinds such as control flow graphs, program dependency graphs, system dependency graphs, etc and some of these are explained in the previous sections. The other major classifications include forward and backward slices, static, quasi-static and dynamic slices, and executable and closure slices [38].

Forward slicing is used to determine the effect of certain selected variables (aka the slicing criteria) on other statements in the program whereas backward slicing produces all the statements which can affect the variables at the slicing criteria location. The slicing criteria of dynamic slic- ing [1, 7, 15, 23, 43], unlike static slicing, takes into account the input values of the variables which enables dynamic slicing to generate significantly smaller slices by excluding unnecessary branches. The dynamic slicing technique has been applied to a number of problems and has shown promising results especially in the field of software testing and debugging [24, 29, 39]. When the sliced statements form an executable subgraph then the slice produced is said to be executable. On the other hand if the subgraph is not executable then the slice is called a closure slice. While most of the earlier work on program slicing focused on approaches which preserved the syntactic struc- 13 ture of the program, Harman and Danicic presented a technique for amorphous slicing [6, 12, 16] which is semantic preserving but performs necessary transformations of the source code statements to reduce the size of the slices. In this work we only deal with forward static slicing.

2.6.1 Forward Static Slicing

Forward static slicing is used to generate statements which are affected by the variables in the slicing criteria. For this reason it is also called ripple effect slicing. The slicing criteria for forward static slicing is a set of the form and unlike dynamic slicing, does not include input values of the variables. The method in Figure 2.1 is sliced using a slicing criteria and the slice obtained is shown highlighted in Figure 2.10. The slice shows all the statements which are affected by the local variable fact.

1 void factorial(int arg) { 2 int count = 1; 3 int fact = 1;// criteria: 4 while (count <= arg) { 5 fact = fact * count; 6 count = count + 1; 7 } 8 return fact; 9 }

Figure 2.10: An example forward static slice criteria, with all statements contributing to the slice highlighted.

To compute slices, it is more convenient to first translate the code into a suitable intermediate representation such as PDG and then slice it with a given slicing criteria to obtain the desired slice. Figure 2.11 shows the slice obtained by slicing the PDG from Figure 2.9 with the slicing criteria . The slice obtained with PDG captures both control and data dependencies unlike the slice on CFG which only captures control dependencies. 14

[1] arg = k

DATA:arg

[4] count <= arg CONTROL:T

CONTROL:T DATA:count

CONTROL:T [6] count = count + 1 DATA:count

DATA:count

[5] fact = fact * count DATA:fact

DATA:fact

[7] return fact

Figure 2.11: A forward static slice using criteria on PDG shown in Figure 2.9. 15

CHAPTER 3. RELATED EMPIRICAL STUDIES

Code repetitiveness and code uniqueness have been studied quite extensively in previous re- search. In this chapter we discuss prior research in these fields.

3.1 Code Repetition Studies

Techniques for finding repetition in source code across software systems include tree based, graph based, token or n-gram based, heuristic based, and machine learning approaches. Here we overview a few of these approaches.

There have been numerous studies on the identification of source code clones [21, 22, 25, 26] across software systems which resulted in different taxonomies and classification of clones such as true clones and accidental clones [2]. Jiang and Su [20] presented an algorithm to search for similar code fragments of random sizes at the functional level using the technique of random testing. Researchers have employed syntax based, semantics based as well as many hybrid techniques for searching similar codes. Reiss [36] presented a semantic-based code search technique to generate user requested functions or classes using a set of program transformations. Several heuristics and meta-heuristics [8, 40] have also been tried for the large scale detection of repetitiveness in source codes.

Hindle et al. [17] introduced an n-gram based approach to find repetitive nature of software when broken into its constituent pieces at the source level. Since the n-grams model alone fails to capture the effect of localness of software, Tu et al. [37] came up with an approach that augments the n-gram model with a cache component to exploit locality of the software. Nguyen et al. [32] described a syntax similarity based approach to find repetitive code changes in software evolution using difference in sub-AST structures for a pair of code pieces. They later presented an approach which uses semantic information to detect source code repetitiveness [33]. This is different from an n-gram model which uses lexemes or syntactic similarity to determine uniqueness of source code. 16 White et al. [42] employed a deep learning based technique for training neural networks on code fragments to identify code clones. The technique exploits both lexical and syntactic structures to link patterns. Cesare and Xiang [9] presented a framework for finding package level clones using pattern classification techniques and proposed a number of systems based on it. Mahmoud and Bradshaw [30] described an approach that takes into account domain-specific terms in the source code to augment the semantic-based code analysis techniques. The authors used the Normalized Software Distance (NSD) method which captures semantic relatedness in the source code.

Finally, Nguyen et al. [31] presented a large scale study on open source projects to find code repetitiveness by comparing PDGs and subgraphs extracted from PDGs using slicing. This work is the focus of our reproduction study.

3.2 Code Uniqueness Studies

The techniques used for identifying code uniqueness are similar to those used for identifying repetitiveness. Gabel and Su [14] presented the first study on code uniqueness which answers questions about software uniqueness at a given level of granularity. Ray et al. [35] present the definition of unique changes and proposes a methodology to identify them. The study estab- lishes how prevalent unique code is and investigated where they occur most commonly. Petke et al. [34] performed a comprehensive survey on the improvement of software using genetic tech- niques. It covers various techniques such as parameter tuning, slicing etc that have been used to improve or transform software. Alexandru et al. [3] designed a generic framework called LISA (Lean Language-Independent Software Analyzer) which represents and analyzes multi-revisioned software artifacts and employs a unique, redundancy-free, multi-revision representation to avoid re-computation of the same artifacts. 17

CHAPTER 4. METHODOLOGY

This work reproduces two of the five research questions from the previous work by Nguyen et al. [31] but using the Boa infrastructure and dataset. Similar to their study, this work constructs program dependence graphs and then uses them to compare either the entire methods or the per- argument slicing subgraph (PASG) obtained by slicing a method’s PDG on each argument. The latter is different from per-variable slicing subgraphs (PVSG) used in the prior work, where a PDG is sliced on every local variable. This is the main difference between our approach and their approach, meaning our RQ2 is slightly modified to use PASGs instead of PVSGs.

4.1 Augmented Graphs

To enable PASGs in Boa, we augment control flow graphs with new argument assignment statements that are placed at the top immediately after the START node. Since all remaining graphs in Boa are built starting from the control flow graph, these new nodes cascade to all other graph kinds. An example of the new control flow graph with augmented nodes for the CFG in Figure 2.3 is shown in Figure 4.1.

As can be seen in the code example in Figure 2.1, the argument arg of the method factorial is now a node in the control flow graph. This argument is assigned an arbitrary value (k). Since we are doing static analysis the choice of arbitrary value does not affect the results. We skip showing the other augmented intermediary graphs as they are similar and directly show the final augmented PDG in Figure 4.2.

4.2 Normalization

To normalize a program dependence graph or a PDG slice, the normalizing algorithm traverses the graph in a depth first order. At each node it sorts all the node’s successors and all its out edges using their respective node/edge id. The normalized program dependence graph and slices also 18

[0] START

T

[1] arg = k

ENTRY

[2] int count = 1

[3] int fact = 1

[4] count <= arg F

T

B [5] fact = fact * count

[7] return fact

[6] count = count + 1

[8] STOP

Figure 4.1: The augmented CFG based on Figure 2.3. 19

[0] ENTRY

CONTROL:T CONTROL:T

[2] int count = 1 CONTROL:T [1] arg = k

CONTROL:T DATA:count DATA:arg

DATA:count [4] count <= arg CONTROL:T

CONTROL:T DATA:count DATA:count CONTROL:T

[3] int fact = 1 [6] count = count + 1 DATA:count CONTROL:T

DATA:fact DATA:count

DATA:fact [5] fact = fact * count DATA:fact

DATA:fact

[7] return fact

Figure 4.2: The augmented PDG based on Figure 2.9. 20

[0] ENTRY

CONTROL:T CONTROL:T

[2] int var$2 = 1 CONTROL:T [1] var$1 = k

CONTROL:T DATA:var$2 DATA:var$1

DATA:var$2 [4] var$2 <= var$1 CONTROL:T

CONTROL:T DATA:var$2 DATA:var$2 CONTROL:T

[3] int var$3 = 1 [6] var$2 = var$2 + 1 DATA:var$2 CONTROL:T

DATA:var$3 DATA:var$2

DATA:var$3 [5] var$3 = var$3 * var$2 DATA:var$3

DATA:var$3

[7] return var$3

Figure 4.3: The normalized PDG based on Figure 4.2. have all their variable names replaced with the pattern var$(number). For example the normal- ized version of the expression x = y + 5 will be var$1 = var$2 + 5. The normalization algorithm keeps a map of already seen variables and their normalized names. For the previous expression, the map will contain {x: var$1, y: var$2}.

The algorithm traverses the program dependence graph (or its slice) and when it reaches nodes with expressions it first looks up to see if the variables in the expression have already been assigned names. If a variable is present in the map, the algorithm replaces the variable with its normalized name and if the variable is not present then it generates the next sequential value which it uses to replace the variable and also stores the mapping. For example, the normalized version of the 21

[1] arg = k

DATA:arg

[4] count <= arg CONTROL:T

CONTROL:T DATA:count

CONTROL:T [6] count = count + 1 DATA:count

DATA:count

[5] fact = fact * count DATA:fact

DATA:fact

[7] return fact

Figure 4.4: The PASG for the PDG in Figure 4.2. program dependence graph in Figure 4.2 is shown in Figure 4.3.

4.3 Per-Argument Slicing subGraphs (PASG)

A per-argument slicing subgraph (PASG) is obtained when a program dependence graph is sliced on one of the method’s arguments. The subgraph obtained is used to answer research ques- tions related to containment and composability of routines and its slices. The PASG of the program dependence graph of Figure 4.2 on its only argument arg is shown in Figure 4.4 and its normalized version is shown in Figure 4.5.

4.4 Implementation in Boa

Boa [11] is an enabling language and infrastructure for mining ultra large scale software repos- itories. It consists of a domain-specific language, a compiler for the language, multiple datasets 22

[1] var$1 = k

DATA:var$1

[4] var$2 <= var$1 CONTROL:T

CONTROL:T DATA:var$2

CONTROL:T [6] var$2 = var$2 + 1 DATA:var$2

DATA:var$2

[5] var$3 = var$3 * var$2 DATA:var$3

DATA:var$3

[7] return var$3

Figure 4.5: The normalized PASG based on Figure 4.4. with each containing almost a million open-source projects, a back-end based on map-reduce for parallel distributed processing of datasets and a front-end to write programs in boa language. Boa’s domain-specific language is a statically typed query language which tightly integrates with the Boa infrastructure. The compiler compiles the query to a Java program which acts as a map in the map-reduce model and each map program operates on one open source project. The number of maps running concurrently depends on the back-end and can be adjusted as per the core count of the boa’s back-end cluster. So depending on the back-end core count, it is possible to achieve a very high degree of parallelism. Boa datasets are based on open-source repositories like GitHub, SourceForge, etc. Each project’s code in the repository is parsed and made available to users for analysis using Boa queries. Boa gets new datasets each year while preserving the datasets from the previous years so that they are available to researchers for experimenting and verifying previous results. 23 4.4.1 Boa Graphs API

The Boa graph API provides access to several kinds of graphs and trees. It includes control flow graphs, control dependence graphs, data dependence graphs, program dependence graphs, dominator trees, post dominator trees, and two slicers for program dependence graphs and another for control flow graphs. This Boa graph library was implemented as part of this research.

To get a graph object, one of several builtin functions is called. For example, to get a control flow graph for a method a user calls getcfg(Method). Here, Method is a domain specific type representing a Java method in the abstract syntax tree in Boa. Similarly there are other functions for other graphs kinds:

∙ getcfg(Method): returns a control flow graph

∙ getcdg(Method): returns a control dependence graph

∙ getddg(Method): returns a data dependence graph

∙ getpdg(Method): returns a program dependence graph

Some functions take optional arguments for normalizing the graph. Currently, only program dependence graphs and program dependence graph slices can be normalized in this way. We also provide a normalize() function to normalize any graph object.

There are also custom traversals available which can be used to traverse the graphs using stan- dard traversal algorithms such as depth first, post order, worklist, etc. Boa also provides a way to visualize the graphs. The dot(Graph) function returns a string representation of the Graph object. This representation uses GraphViz DOT, a graph descriptive language. GraphViz provides tools to turn a DOT description into a graph image. All graphs in this paper were produced using this Boa function and GraphViz. 24

1 out: output sum[string][string] of int;

2 visit(input, visitor{# last snapshot only 3 before c: CodeRepository -> { 4 if(len(c.revisions) >= 50) { 5 snapshot := getsnapshot(c,"SOURCE_JAVA_JLS"); 6 if(len(snapshot) >= 50) 7 foreach (i: int; def(snapshot[i])) 8 visit(snapshot[i]); 9 } 10 stop; 11 } 12 before m: Method -> { 13 if(len(m.statements) > 0) { 14 pdg := getpdg(m, true); 15 normalize(pdg); 16 hash := getcrypthash(pdg,"SHA-1"); 17 out[input.id][hash] << 1; 18 } 19 } 20 });

Figure 4.6: The Boa query for RQ1a (repetition within project).

4.5 Boa Queries

The Boa language can be used easily for writing queries to mine software repositories. For a single run only one dataset can be queried. The language provides standard programming con- structs such as loops (called quantifiers in Boa), conditionals, functions, and a function library for writing time efficient queries which are also very readable. The output of the query can be com- bined using one of the several Boa aggregators such as top, bottom, sum, collection. etc. In this work we created one query for each research sub-question.

For example, Figure 4.7 shows the query for answering the first part of RQ1 that are repetition of methods within project. The query has globals out and seen at the top where out is an aggregator used to collect the results and seen is a set used to hold PDGs that have already been seen in the project. Lines 2–20 show a custom visitor which is called from line 2. The visitor is used to perform custom traversals on the project source code represented as an abstract syntax 25

1 out: output sum[string] of int;

2 seen: set of string;

3 visit(input, visitor{ 4 before c: CodeRepository -> {# last snapshot only 5 ... 13 before m: Method -> { 14 if(len(m.statements) > 0) { 15 pdg := getpdg(m, true); 16 normalize(pdg); 17 hash := getcrypthash(pdg,"SHA-1"); 18 if(!contains(seen, hash)) { 19 add(seen, hash); 20 out[hash] << 1; 21 } 22 } 23 } 24 });

Figure 4.7: The Boa query for RQ1b (repetition across projects).

tree. The visitor provides access to each node in the tree with several domain specific types. The two visitor constructs before and after perform operations specified at each node of the tree before visiting the children of the node and after visiting the children of the node, respectively.

In this work, we are only considering Java projects and among those we filter out projects with less than 50 revisions or less than 50 files as shown in the figure from lines 3–11. Line 13 filters out the abstract methods (a method with no body) and emits a SHA1 cryptographic hash corresponding to a method’s PDG but takes care not to emit hashes for similar methods twice. The queries for the remaining RQs and their parts are shown in Figure 4.7 to Figure 4.13 26

1 out: output sum[string][string] of int;

2 visit(input, visitor{ 3 before c: CodeRepository -> {# last snapshot only 4 ... 12 before m: Method -> { 13 if(len(m.statements) > 0) { 14 pdg := getpdg(m, true); 15 normalize(pdg); 16 graph_size := gettotalnodes(pdg) + gettotaledges(pdg); 17 hash := getcrypthash(pdg,"SHA-1"); 18 out[graph_size][hash] << 1; 19 } 20 } 21 });

Figure 4.8: The Boa query for RQ1c (repetition by graph size).

1 out: output sum[string][string] of int;

2 visit(input, visitor{ 3 before c: CodeRepository -> {# last snapshot only 4 ... 12 before m: Method -> { 13 if(len(m.statements) > 0) { 14 pdg := getpdg(m, true); 15 normalize(pdg); 16 num_control_nodes := gettotalcontrolnodes(pdg); 17 hash := getcrypthash(pdg,"SHA-1"); 18 out[num_control_nodes][hash] << 1; 19 } 20 } 21 });

Figure 4.9: The Boa query for RQ1d (repetition by number of control nodes). 27

1 out: output sum[string][string] of int;

2 visit(input, visitor{ 3 before c: CodeRepository -> {# last snapshot only 4 ... 12 before m: Method -> { 13 if(len(m.statements) > 0) { 14 foreach (i: int; def(m.arguments[i])) { 15 pdg := getpdg(m, true); 16 slice := getpdgslice(pdg, i, true); 17 slicehash := getcrypthash(slice,"SHA-1"); 18 out[input.id][slicehash] << 1; 19 } 20 } 21 } 22 });

Figure 4.10: The Boa query for RQ2a (repetition within project).

1 out: output sum[string] of int;

2 seen: set of string

3 visit(input, visitor{ 4 before c: CodeRepository -> {# last snapshot only 5 ... 13 before m: Method -> { 14 if(len(m.statements) > 0) { 15 foreach (i: int; def(m.arguments[i])) { 16 pdg := getpdg(m, true); 17 slice := getpdgslice(pdg, i, true); 18 slicehash := getcrypthash(slice,"SHA-1"); 19 if(!contains(seen, slicehash)) { 20 add(seen, slicehash); 21 out[slicehash] << 1; 22 } 23 } 24 } 25 } 26 });

Figure 4.11: The Boa query for RQ2b (repetition across projects). 28

1 out: output sum[string][string] of int;

2 visit(input, visitor{ 3 before c: CodeRepository -> {# last snapshot only 4 ... 12 before m: Method -> { 13 if(len(m.statements) > 0) { 14 foreach (i: int; def(m.arguments[i])) { 15 pdg := getpdg(m, true); 16 slice := getpdgslice(pdg, i, true); 17 graph_size := gettotalnodes(slice) + gettotaledges(slice); 18 slicehash := getcrypthash(slice,"SHA-1"); 19 out[graph_size][slicehash] << 1; 20 } 21 } 22 } 23 });

Figure 4.12: The Boa query for RQ2c (repetition by graph size).

1 out: output sum[string][string] of int;

2 visit(input, visitor{ 3 before c: CodeRepository -> {# last snapshot only 4 ... 12 before m: Method -> { 13 if(len(m.statements) > 0) { 14 foreach (i: int; def(m.arguments[i])) { 15 pdg := getpdg(m, true); 16 slice := getpdgslice(pdg, i, true); 17 num_control_nodes := gettotalcontrolnodes(slice); 18 slicehash := getcrypthash(slice,"SHA-1"); 19 out[num_control_nodes][slicehash] << 1; 20 } 21 } 22 } 23 });

Figure 4.13: The Boa query for RQ2d (repetition by number of control nodes). 29

CHAPTER 5. EVALUATION

In this section we present our results and compare them with the results of Nguyen et al. [31]. In this work, we answered two research questions RQ1 and RQ2 which corresponds to RQ1 and RQ4 of Nguyen et al. [31]. To help with comparisons, we kept captions of the figures the same as in Nguyen et al. [31].

5.1 Dataset

The Boa data set used is 2015 September/GitHub. The important statistics of the dataset are shown in the Table 5.1. The query to generate this data is shown in Figure 5.1. This data is after filtering out toy projects. Almost 38k projects are analyzed.

Table 5.1: Analyzed Dataset Statistics

Total Projects 37,955 Total Classes 12,572,376 Total Methods 77,035,440 Total Statements 559,100,406 Total extracted PDGs 161,560,232 Total extracted PASGs 154,487,449

5.2 Post-processing

The post-processing script takes data produced by the Boa scripts (about 15GB total) and con- verts it to the Sframe [28] format. Sframe is a mutable dataframe object which can scale to big data. We used it because it stores data on disk and is not constrained by memory size. Once we have the data in the required format we perform a number of transformations on the Sframe object.

Figure 5.4 shows the script used to generate the graphs from the data. Figure 5.2 shows the portion of the script used to compute parts A and B for both RQ1 and RQ2. Figure 5.3 shows the portion of the script used to compute parts C and D for both RQ1 and RQ2. Finally, Figure 5.5 shows the main portion of the script. 30

1 p: output sum of int; c: output sum of int; 2 m: output sum of int; s: output sum of int;

3 has := false; st: stack of bool; 4 has2 := false; st2: stack of bool;

5 # filter out projects that cause memory issues 6 bad := {"10455994","4858268","4857247","2051480"}; 7 ifall (i: int; input.id != bad[i]) 8 visit(input, visitor{ 9 before node: CodeRepository -> { 10 if(len(node.revisions) >= 50) { 11 snapshot := getsnapshot(node); 12 if(len(snapshot) >= 50) { 13 foreach (i: int; def(snapshot[i])) 14 visit(snapshot[i]); 15 p << 1; 16 } 17 } 18 stop; 19 } 20 before Declaration -> { 21 push(st2, has2); 22 has2 = false; 23 } 24 after Declaration -> { 25 if (has2) c << 1; 26 has2 = pop(st2); 27 } 28 before Method -> { 29 push(st, has); 30 has = false; 31 } 32 after Method -> { 33 if (has) { 34 m << 1; 35 has2 = true; 36 } 37 has = pop(st); 38 } 39 before Statement -> { 40 has = true; 41 s << 1; 42 } 43 });

Figure 5.1: Boa query to generate dataset statistics. 31

1 def rq_AorB(sf, rq, x_lower=1, x_upper=20, x_step=1, y_step=0.25, x_label=’Number of Repetition’): 2 sf_rq = sf[sf[’hash’].contains(rq)] 3 total_methods = sf_rq.num_rows() 4 occurance_count = sf_rq.groupby(key_columns=’repeat’, operations={’ counts’: graphlab.aggregate.COUNT()}).sort(’repeat’) 5 rep_count = occurance_count.to_dataframe().set_index(’repeat’). to_dict(orient=’dict’)[’counts’] 6 rep_percent = dict() 7 for k, v in rep_count.iteritems(): 8 if k > x_upper+1: break 9 if k > x_lower: rep_percent[k - 1] = v * 100.00 / total_methods 10 fori in range(x_lower, x_upper+1): 11 ifi not in rep_percent.keys(): rep_percent[i] = 0 12 y_max = max(rep_percent.iteritems(), key=operator.itemgetter(1))[0] 13 plot_graph(rep_percent, [x_lower, x_upper], [0, rep_percent[y_max ]+1], x_step, y_step, x_label, rq,’o’)

Figure 5.2: Python script to compute parts a/b from RQ1 and RQ2.

1 def rq_CorD(sf, rq, x_lower=3, x_upper=123, x_step=10, y_step=5, pad =5, x_label=’Size’): 2 sf_rq = sf[sf[’hash’].contains(rq)] 3 sf_rq[’temp’] = sf_rq.apply(lambda x: (x[’hash’].lstrip(rq).strip(’ [’).strip(’]’))) 4 sf_rq[’graph_property’] = sf_rq.apply(lambda x: int(re.sub(’].*’,’ ’, x[’temp’]))) 5 sf_rq.remove_columns([’temp’,’hash’]) 6 gproperty_rcount = sf_rq.groupby(key_columns=’graph_property’, operations={’counts’: graphlab.aggregate.COUNT()}).to_dataframe(). set_index(’graph_property’).to_dict(orient=’dict’)[’counts’] 7 sf_rq = sf_rq[sf_rq[’repeat’] > 1].groupby(key_columns=’ graph_property’, operations={’counts’: graphlab.aggregate.COUNT()}). to_dataframe().set_index(’graph_property’).to_dict(orient=’dict’)[’ counts’] 8 rep_percent = OrderedDict() 9 for k, v in sorted(sf_rq.iteritems()): 10 if int(k) > x_upper: break 11 rep_percent[k] = v * 100.00 / gproperty_rcount[k] 12 fork in range(0, x_lower): 13 ifk in rep_percent: del rep_percent[k] 14 y_max = max(rep_percent.iteritems(), key=operator.itemgetter(1))[0] 15 plot_graph(rep_percent, [x_lower, x_upper], [0, rep_percent[y_max] + pad], x_step, y_step, x_label, rq)

Figure 5.3: Python script to compute parts c/d from RQ1 and RQ2. 32

1 import re 2 import numpy 3 import operator 4 import graphlab 5 import graphlab.aggregate 6 import matplotlib.pylab 7 from collections import OrderedDict 8 from matplotlib.ticker import PercentFormatter

9 def plot_graph(repitition_percent_map, x_range=[0,20], y_range=[0,10], x_step=1, y_step=1, x_label=’Number’, fig_name=’Repetition’, mark=’’, col=’blue’): 10 lists = repitition_percent_map.items() 11 x, y = zip(*lists) 12 matplotlib.pylab.xlim(xmin=x_range[0], xmax=x_range[1]) 13 matplotlib.pylab.ylim(ymin=y_range[0], ymax=y_range[1]) 14 matplotlib.pylab.xticks(numpy.arange(x_range[0], x_range[1]+1, x_step)) 15 matplotlib.pylab.yticks(numpy.arange(int(y_range[0]), y_range[1]+1, y_step)) 16 matplotlib.pylab.xlabel(x_label, fontsize=15) 17 matplotlib.pylab.ylabel(’Percentage’, fontsize=15) 18 matplotlib.pylab.gca().yaxis.set_major_formatter(PercentFormatter( xmax=100, decimals=0)) 19 matplotlib.pylab.grid(True) 20 matplotlib.pylab.plot(x, y, linewidth=3, color=col, marker=mark) 21 matplotlib.pylab.savefig(fig_name +’.pdf’)

Figure 5.4: Python script to plot graphs.

1 sf = graphlab.SFrame.read_csv(’full.out.txt’, delimiter="", usecols=[ ’hash’,’equal’,’repeat’]).remove_column(’equal’) 2 rq_AorB(sf,’rq1a’, 1, 20, 1, 1,’Number of Repetitions(Repeated Routines)’) 3 rq_AorB(sf,’rq1b’, 1, 20, 1, 1,’Number of Repetitions(Repeated Routines)’) 4 rq_CorD(sf,’rq1c’, 3, 123, 10, 10, 10,’Graph Size’) 5 rq_CorD(sf,’rq1d’, 0, 20, 2, 10, 10,’Number of Control Nodes’) 6 rq_AorB(sf,’rq2a’, 1, 20, 1, 1,’Number of Repetitions(Repeated Subroutines)’) 7 rq_AorB(sf,’rq2b’, 1, 20, 1, 1,’Number of Repetitions(Repeated Subroutines)’) 8 rq_CorD(sf,’rq2c’, 3, 53, 10, 10, 10,’Graph Size’) 9 rq_CorD(sf,’rq2d’, 0, 20, 2, 10, 10,’Number of Control Nodes’)

Figure 5.5: Main portion of Python script to perform post-processing. 33

18% 17% 16% 15% 14% 13% 12% 11% 10% 9% 8% 7% Percentage 6% 5% 4% 3% 2% 1% 0% 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Number of Repetitions(Repeated Routines)

Figure 5.6: % of entire routines realized elsewhere within a project.

5.3 RQ1: Repetitiveness of Routines

The first research question investigates the repetitiveness of entire routines within and across projects. The question has four parts which are evaluated in the following sub-sections.

5.3.1 RQ1a: Repetitiveness Within a Project

The repetitiveness of routines within a project is shown in Figure 5.6. The figure shows the percentage of repeating routines decreases with the increase in repetition count. Our experiment shows 16.4% routines repeat once, 4.2% twice and 2.2% thrice in a project. The corresponding repeat percentages for Nguyen et al. [31] are 6.7%, 2% and 1.1% respectively. For routines with more than 7 repetitions, the repeat percentage is less than 0.1% in our case as well as theirs. About 24% of routines are repeated up to 7 times in our case which is higher than the 12.01% reported by Nguyen et al. [31].

Notice that the general trend we observed is the same as the prior work. However, while the 34

12%

11%

10%

9%

8%

7%

6%

5%

Percentage 4%

3%

2%

1%

0% 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Number of Repetitions(Repeated Routines)

Figure 5.7: % of entire routines realized elsewhere in other projects. shape of the curve is the same, the percentages are roughly double and thus the line is shifted upward.

5.3.2 RQ1b: Repetitiveness Across Projects

The repetitiveness of routines across projects is shown in Figure 5.7. This figure, like the last one, shows a sharply decreasing trend in the repetition percentage as the number of repetition increases. The repeat percentages for one, two and three repetitions are 10.8%, 3.8% and 1.83% respectively. The corresponding repeat percentages for n Nguyen et al. [31] are 2%, 0.5% and 0.3% respectively. Our result shows that for routines with more than 12 repetitions, the repeat percentage is less than 0.1%. About 19% of routines are repeated between 1 to 12 times across different projects. 35

60%

50%

40%

30% Percentage 20%

10%

0% 3 13 23 33 43 53 63 73 83 93 103 113 123 Graph Size

Figure 5.8: Repetitiveness by graph size (V+E) in PDGs.

5.3.3 RQ1c: Repetitiveness by Graph Size

In this part we measure the repetitiveness based on the size of PDG. The size of a PDG is taken as the sum of nodes and edges. The repetition is calculated for each group based on PDG size. For example, in Figure 5.8 repetition for methods with PDG of size 4 is about 58%. Methods with smaller PDG sizes in the range of 3-5 tend to repeat more, primarily because getters and setters fall into that size range. We have not included sizes 1 and 2 because they represent methods with empty bodies having only entry and exit nodes. Both our and Nguyen et al. [31] results show initially a decreasing trend with PDG size but their decrease is very sharp between the size range of 3-5. As can be seen in the figure the repeat percentage is much less affected by the graph size after 40. The Nguyen et al. [31] result confirms the trend. 36

50%

40%

30%

20% Percentage

10%

0% 0 2 4 6 8 10 12 14 16 18 20 Number of Control Nodes

Figure 5.9: Repetitiveness by number of control nodes in PDGs.

5.3.4 RQ1d: Repetitiveness by Number of Control Nodes

The number of control nodes in a program is a measure of its complexity. A program with more control nodes will be considered having higher complexity. In this part of RQ1 we calculate the repetitiveness based on the number of control nodes in a method. The PDGs of the methods in the entire corpus are first grouped by the number of control nodes and then repetition is found in different groups. The result in Figure 5.9 shows that methods with no control nodes repeat about 43% of the times. There is a decreasing repetition trend for the first six control nodes and then repetition becomes effectively indifferent which is same as Nguyen et al. [31]. The comparison between our graph and Nguyen et al. [31] shows that both graphs have a similar trend but our repetition percentage average for all nodes is comparatively higher. 37 5.4 RQ2: Repetitiveness of PASGs

As explained in the approach section, RQ2 is not exactly same as that of Nguyen et al. [31] due to using PASG instead of PVSG. For these questions we first slice each routine’s PDG on each argument, which we call as PASG. Then we find the repetitiveness of slices in the source code dataset. These questions answer the same four parts as RQ1 but looking at repetition for subroutines instead.

5.4.1 RQ2a: Repetitiveness Within a Project

The trend for subroutines repetitiveness shown in Figure 5.10 is very similar to the trend of RQ1’s first part. The figure shows an exponentially decreasing trend for repetitiveness with the increase in repetition count. Our experiment shows about 17% subroutines repeat once, 4.3% twice and 2.6% thrice in a project. The corresponding repeat percentages for Nguyen et al. [31] are not reported in their paper. About 24% of subroutines are repeated between 1 and 8 times and the repeat percentage for subroutines with more than 8 repetitions, is less than 0.1%.

5.4.2 RQ2b: Repetitiveness Across Projects

The repetitiveness of subroutines across projects is shown in Figure 5.11. This figure also shows decreasing trend with increasing repetition count. However, the percentage for subroutines which repeat once across projects is 10.9% which is less than the percentage for the same repeat count within a project. The trend is similar to what we observed in entire routines case in RQ1 where the percentage within project is more than across projects for single repetition. Again most of the repetitions occur for repetition count less than equal to 13 and beyond that repetition percent- age approaches zero. From this we can conclude that repetition trends for routines and subroutines are quite similar with only little variations. 38

18% 17% 16% 15% 14% 13% 12% 11% 10% 9% 8% 7% Percentage 6% 5% 4% 3% 2% 1% 0% 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Number of Repetitions(Repeated Subroutines)

Figure 5.10: % of entire subroutines realized elsewhere within a project.

12%

11%

10%

9%

8%

7%

6%

5%

Percentage 4%

3%

2%

1%

0% 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Number of Repetitions(Repeated Subroutines)

Figure 5.11: % of subroutines realized elsewhere in other projects. 39

60%

50%

40%

30%

Percentage 20%

10%

0% 3 13 23 33 43 53 Graph Size

Figure 5.12: % Repetitiveness by Graph Size.

5.4.3 RQ2c: Repetitiveness by Graph Size

Subroutines with size three repeat more than 45% of the time as shown in Figure 5.12. The subroutines with smaller sizes are more likely to be repeated as can be seen in the figure. As the size increases the repetition percentage does not vary much though the decrease and variations are more than what we observed in the case of routines in RQ1. When compared to Nguyen et al. [31], our decrease is slightly less varying and has less spikes or sudden variations. The maximum repetition for both occurs when the size of graph is under 10 nodes.

5.4.4 RQ2d: Repetitiveness by Number of Control Nodes

The trend for repetition by the number of control nodes is smoother than the repetition by Graph size. The decrease after 8 control nodes is also much less than before it as can be seen in Figure 5.13. The subroutines with no control nodes tend to repeat more with as high as 44% repetition. From this we can conclude that less complex subroutines are more likely to get repeated 40

50%

40%

30%

20% Percentage

10%

0% 0 2 4 6 8 10 12 14 16 18 20 Number of Control Nodes

Figure 5.13: % Repetitiveness by number of control nodes in subroutines. but after a certain complexity level repetition almost becomes independent of the number of control nodes in the method.

5.5 Threats to Validity

The GitHub dataset we used includes data from branches other than master. As a result it may lead to higher repetition percentage than there actually is. This may also skew the final results slightly. A second threat to results may come from hashing algorithm, we have used SHA1 which is 160 bits and has a collision probability which can affect repetition percentage, though the chances of that happening are minuscule. 41

CHAPTER 6. CONCLUSION

In this work we presented a large scale study on routine repetitiveness in open source projects. The study is a partial reproduction of a prior study by Nguyen et al. [31] and looks at how often (sub)routines repeat both within a project and across projects. Our results show that routines are more likely to be repeated within a project than across projects. A total of 24% routines repeat within a project while a total of 19% repeat across projects. We also performed a similar study on subroutines, obtained by slicing routine PDGs on their corresponding arguments which we call per argument slicing subgraph. The trend of subroutine repetition is similar to full routine repetition. A total of 24% subroutines repeat within a project while only 21% repeat across projects. Some results based on graph size and number of control nodes also present interesting insights about repetition of source code.

When compared with Nguyen et al. [31], we see similar trends. The exact repetition percentage for both RQs is higher in our case. For RQ1 about 24% and 19% of routines repeat 2–7 times within and across projects in our case. The similar percentages from Nguyen et al. [31] are 12.1% and 4% respectively. The repetition percentage for RQ1 (excluding trivial cases) based on graph size and number of control nodes is 35% in our case and 9% in their case. For RQ2 Nguyen et al. [31] only have the result for repetition by graph size and is 4% on average. For us the same is close to 35%. These differences seen between the two studies indicates the need for further exploration.

A direct application of these results could be in the synthesis of behavioral specifications of API methods. If an API method is composed of subroutines and those subroutines already contain specifications in another project, perhaps they could be re-used for the target API. In the future we plan to use this insight to automatically generate specifications for such API methods. 42

BIBLIOGRAPHY

[1] H. Agrawal and J. R. Horgan, “Dynamic program slicing,” in Proceedings of the ACM SIGPLAN 1990 Conference on Programming Language Design and Implementation, ser. PLDI ’90. New York, NY, USA: ACM, 1990, pp. 246–256. [Online]. Available: http://doi.acm.org/10.1145/93542.93576

[2] R. Al-Ekram, C. Kapser, R. Holt, and M. Godfrey, “Cloning by accident: an empirical study of source code cloning across software systems,” in Empirical Software Engineering, 2005. 2005 International Symposium on. IEEE, 2005, pp. 10–pp.

[3] C. V. Alexandru, S. Panichella, S. Proksch, and H. C. Gall, “Redundancy-free analysis of multi-revision software artifacts.”

[4] M. Allamanis, E. T. Barr, C. Bird, and C. Sutton, “Learning natural coding conventions,” in Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Soft-

ware Engineering. ACM, 2014, pp. 281–293.

[5] F. E. Allen, “Control flow analysis,” in ACM Sigplan Notices, vol. 5, no. 7. ACM, 1970, pp. 1–19.

[6] D. Binkley, M. Harman, L. R. Raszewski, and C. Smith, “An empirical study of amorphous slicing as a program comprehension support tool,” in Proceedings IWPC 2000. 8th Interna- tional Workshop on Program Comprehension, 2000, pp. 161–170.

[7] D. Binkley, S. Danicic, T. Gyimo´thy, M. Harman, A. Kiss, and B. Korel, “Theoretical foundations of dynamic program slicing,” Theor. Comput. Sci., vol. 360, no. 1, pp. 23–41, Aug. 2006. [Online]. Available: http://dx.doi.org/10.1016/j.tcs.2006.01.012

[8] S. Bouktif, G. Antoniol, E. Merlo, and M. Neteler, “A novel approach to optimize clone 43 refactoring activity,” in Proceedings of the 8th annual conference on Genetic and evolution- ary computation. ACM, 2006, pp. 1885–1892.

[9] S. Cesare and Y. Xiang, Software similarity and classification. Springer Science & Business Media, 2012.

[10] R. Cytron, J. Ferrante, B. K. Rosen, M. N. Wegman, and F. K. Zadeck, “Efficiently comput- ing static single assignment form and the control dependence graph,” ACM Transactions on Programming Languages and Systems (TOPLAS), vol. 13, no. 4, pp. 451–490, 1991.

[11] R. Dyer, H. A. Nguyen, H. Rajan, and T. N. Nguyen, “Boa: A language and infrastructure for analyzing ultra-large-scale software repositories,” in Proceedings of the 2013 International Conference on Software Engineering. IEEE Press, 2013, pp. 422–431.

[12] D. Fatiregun, M. Harman, and R. M. Hierons, “Search-based amorphous slicing,” in 12th Working Conference on Reverse Engineering (WCRE’05), Nov 2005, pp. 10 pp.–12.

[13] J. Ferrante, K. J. Ottenstein, and J. D. Warren, “The program dependence graph and its use in optimization,” ACM Transactions on Programming Languages and Systems (TOPLAS), vol. 9, no. 3, pp. 319–349, 1987.

[14] M. Gabel and Z. Su, “A study of the uniqueness of source code,” in Proceedings of the Eighteenth ACM SIGSOFT International Symposium on Foundations of Software

Engineering, ser. FSE ’10. New York, NY, USA: ACM, 2010, pp. 147–156. [Online]. Available: http://doi.acm.org/10.1145/1882291.1882315

[15] R. Gopal, “Dynamic program slicing based on dependence relations,” in Proceedings. Con- ference on Software Maintenance 1991, Oct 1991, pp. 191–200.

[16] M. Harman, D. Binkley, and S. Danicic, “Amorphous program slicing,” J. Syst. Softw., vol. 68, no. 1, pp. 45–64, Oct. 2003. [Online]. Available: http://dx.doi.org/10.1016/ S0164-1212(02)00135-8 44 [17] A. Hindle, E. T. Barr, Z. Su, M. Gabel, and P. Devanbu, “On the naturalness of software,” in Software Engineering (ICSE), 2012 34th International Conference on. IEEE, 2012, pp. 837–847.

[18] S. Horwitz, T. Reps, and D. Binkley, “Interprocedural slicing using dependence graphs,” ACM Transactions on Programming Languages and Systems (TOPLAS), vol. 12, no. 1, pp. 26–60, 1990.

[19] A. Hunt, D. Thomas, and W. Cunningham, The pragmatic programmer: from journeyman to master. Addison-Wesley Professional, 2000.

[20] L. Jiang and Z. Su, “Automatic mining of functionally equivalent code fragments via random testing,” in Proceedings of the Eighteenth International Symposium on Software Testing and Analysis, ser. ISSTA ’09. New York, NY, USA: ACM, 2009, pp. 81–92. [Online]. Available: http://doi.acm.org/10.1145/1572272.1572283

[21] L. Jiang, G. Misherghi, Z. Su, and S. Glondu, “Deckard: Scalable and accurate tree-based detection of code clones,” in Proceedings of the 29th international conference on Software Engineering. IEEE Computer Society, 2007, pp. 96–105.

[22] E. Juergens, F. Deissenboeck, B. Hummel, and S. Wagner, “Do code clones matter?” in Software Engineering, 2009. ICSE 2009. IEEE 31st International Conference on. IEEE, 2009, pp. 485–495.

[23] B. Korel and J. Laski, “Dynamic program slicing,” Inf. Process. Lett., vol. 29, no. 3, pp. 155–163, Oct. 1988. [Online]. Available: http://dx.doi.org/10.1016/0020-0190(88)90054-3

[24] B. Korel and J. Rilling, “Application of dynamic slicing in program debugging,” in Proceed- ings of the 3rd International Workshop on Automatic Debugging; 1997 (AADEBUG-97), no. 001. Linko¨ping University Electronic Press, 1997, pp. 43–57. 45 [25] R. Koschke, “Survey of research on software clones,” in Dagstuhl Seminar Proceedings. Schloss Dagstuhl-Leibniz-Zentrum fu¨r Informatik, 2007.

[26] J. Krinke, “A study of consistent and inconsistent changes to code clones,” in Reverse Engi- neering, 2007. WCRE 2007. 14th Working Conference on. IEEE, 2007, pp. 170–178.

[27] T. Lengauer and R. E. Tarjan, “A fast algorithm for finding dominators in a flowgraph,” ACM Transactions on Programming Languages and Systems (TOPLAS), vol. 1, no. 1, pp. 121–141, 1979.

[28] Y. Low, D. Bickson, J. Gonzalez, C. Guestrin, A. Kyrola, and J. M. Hellerstein, “Distributed graphlab: a framework for machine learning and data mining in the cloud,” Proceedings of the VLDB Endowment, vol. 5, no. 8, pp. 716–727, 2012.

[29] A. D. Lucia, “Program slicing: methods and applications,” in Proceedings First IEEE Inter- national Workshop on Source Code Analysis and Manipulation, 2001, pp. 142–149.

[30] A. Mahmoud and G. Bradshaw, “Estimating semantic relatedness in source code,” ACM Trans. Softw. Eng. Methodol., vol. 25, no. 1, pp. 10:1–10:35, Dec. 2015. [Online]. Available: http://doi.acm.org/10.1145/2824251

[31] A. T. Nguyen, H. A. Nguyen, and T. N. Nguyen, “A large-scale study on repetitiveness, containment, and composability of routines in open-source projects,” in Mining Software Repositories (MSR), 2016 IEEE/ACM 13th Working Conference on. IEEE, 2016, pp. 362– 373.

[32] H. A. Nguyen, A. T. Nguyen, T. T. Nguyen, T. N. Nguyen, and H. Rajan, “A study of repetitiveness of code changes in software evolution,” in Proceedings of the 28th IEEE/ACM International Conference on Automated Software Engineering, ser. ASE’13. Piscataway, NJ, USA: IEEE Press, 2013, pp. 180–190. [Online]. Available: https://doi.org/10.1109/ASE.2013.6693078 46 [33] T. T. Nguyen, A. T. Nguyen, H. A. Nguyen, and T. N. Nguyen, “A statistical semantic language model for source code,” in Proceedings of the 2013 9th Joint Meeting on Foundations of Software Engineering, ser. ESEC/FSE 2013. New York, NY, USA: ACM, 2013, pp. 532–542. [Online]. Available: http://doi.acm.org/10.1145/2491411.2491458

[34] J. Petke, S. Haraldsson, M. Harman, D. White, J. Woodward et al., “Genetic improvement of software: a comprehensive survey,” IEEE Transactions on Evolutionary Computation, 2017.

[35] B. Ray, M. Nagappan, C. Bird, N. Nagappan, and T. Zimmermann, “The uniqueness of changes: Characteristics and applications,” in Proceedings of the 12th Working Conference on Mining Software Repositories, ser. MSR ’15. Piscataway, NJ, USA: IEEE Press, 2015, pp. 34–44. [Online]. Available: http://dl.acm.org/citation.cfm?id=2820518.2820526

[36] S. P. Reiss, “Semantics-based code search,” in Proceedings of the 31st International Conference on Software Engineering, ser. ICSE ’09. Washington, DC, USA: IEEE Computer Society, 2009, pp. 243–253. [Online]. Available: http://dx.doi.org/10.1109/ICSE. 2009.5070525

[37] Z. Tu, Z. Su, and P. Devanbu, “On the localness of software,” in Proceedings of the 22Nd ACM SIGSOFT International Symposium on Foundations of Software Engineering, ser. FSE 2014. New York, NY, USA: ACM, 2014, pp. 269–280. [Online]. Available: http://doi.acm.org/10.1145/2635868.2635875

[38] G. A. Venkatesh, “The semantic approach to program slicing,” in Proceedings of the ACM SIGPLAN 1991 Conference on Programming Language Design and Implementation, ser. PLDI ’91. New York, NY, USA: ACM, 1991, pp. 107–119. [Online]. Available: http://doi.acm.org/10.1145/113445.113455

[39] ——, “Experimental results from dynamic slicing of c programs,” ACM Trans. Program. Lang. Syst., vol. 17, no. 2, pp. 197–216, Mar. 1995. [Online]. Available: http://doi.acm.org/10.1145/201059.201062 47 [40] T. Wang, M. Harman, Y. Jia, and J. Krinke, “Searching for better configurations: a rigorous approach to clone evaluation,” in Proceedings of the 2013 9th Joint Meeting on Foundations of Software Engineering. ACM, 2013, pp. 455–465.

[41] M. Weiser, “Program slicing,” in Proceedings of the 5th International Conference on Software Engineering, ser. ICSE ’81. Piscataway, NJ, USA: IEEE Press, 1981, pp. 439–449. [Online]. Available: http://dl.acm.org/citation.cfm?id=800078.802557

[42] M. White, M. Tufano, C. Vendome, and D. Poshyvanyk, “Deep learning code fragments for code clone detection,” in Proceedings of the 31st IEEE/ACM International Conference on Automated Software Engineering, ser. ASE 2016. New York, NY, USA: ACM, 2016, pp. 87–98. [Online]. Available: http://doi.acm.org/10.1145/2970276.2970326

[43] X. Zhang and R. Gupta, “Cost effective dynamic program slicing,” in Proceedings of the ACM SIGPLAN 2004 Conference on Programming Language Design and Implementation, ser. PLDI ’04. New York, NY, USA: ACM, 2004, pp. 94–106. [Online]. Available: http://doi.acm.org/10.1145/996841.996855