<<

Some From Here, Some From There: Cross-Project Code Reuse in GitHub

Mohammad Gharehyazie Baishakhi Ray Vladimir Filkov University of California, Davis University of Virginia University of California, Davis [email protected] [email protected][email protected]

Abstract—Code reuse has well-known benefits on code quality, Social coding ecosystems like GitHub present a fundamen- coding efficiency, and maintenance. Open Source (OSS) tal shift in reuse opportunities. With millions of available programmers gladly share their own code and they happily OSS projects hosted in GitHub, among which thousands reuse others’. Social programming platforms like GitHub have normalized code foraging via their common platforms, enabling of developers freely migrate, GitHub provides an excellent code search and reuse across different projects. Removing project opportunity for collaboration and code reuse [8], [28]; in fact, borders may facilitate more efficient code foraging, and conse- GitHub actively facilitates the process by providing features quently faster programming. But looking for code across projects like advanced code search and GitHub gist. The reused code takes longer and, once found, may be more challenging to tailor varies in size, ranging from few lines of code to methods, and to one’s needs. Learning how much code reuse goes on across projects, and identifying emerging patterns in past cross-project even larger—one can copy and reuse an entire project or a search behavior may help future foraging efforts. set of files and make smaller modifications to cater to local To understand cross-project code reuse, here we present an in- needs [23]1. Thus, ecosystem level code sharing and reuse may depth study of cloning in GitHub. Using Deckard, a clone finding be different from within project code reuse—one would expect tool, we identified copies of code fragments across projects, and limited methods and file copies in a within project setting, investigate their prevalence and characteristics using statistical resulting in new kind of practices. In and network science approaches, and with multiple case studies. By triangulating findings from different methods, we find that this paper we focus on studying such ecosystem level code cross-project cloning is prevalent in GitHub, ranging from sharing and reuse across multiple GitHub projects. cloning few lines of code to whole project repositories. Some An established way to study code reuse is by analyzing code of the projects serve as popular sources of clones, and others clones [19], [24]. A “clone” is a snippet of code somewhere seem to contain more clones than their fair share. Moreover, that is surprisingly similar to code elsewhere, and can arise we find that ecosystem cloning follows an onion model: most clones come from the same project, then from projects in the from copy-pasting practices [15], automatic program gener- same application domain, and finally from projects in different ators [10], [17], [18], or even emerge independently, due to domains. Our results show directions for new tools that can the predictable syntax of underlying programming languages facilitate code foraging and sharing within GitHub. and API usage [4]. Clones have been studied extensively in the literature, primarily within single-project boundaries [14], I.INTRODUCTION [16], as at the time, in the absence of the Web 2.0 and environment-unifying source forges, searching and reusing Coding from scratch is expensive, both in time and effort. code from other projects would have been challenging. In the So, programmers opportunistically reuse code and do so past 5-8 years, GitHub has emerged as a universal platform frequently [13]. Often, programming of well-defined problems for maintaining repositories, with minimal boundaries between amounts to simple look-up [29], first in one’s own, and then in projects. It, and other source forges have increased code others’ code repositories, followed by judicious copying and sharing and reuse, and by virtue of their historical maintenance pasting. The bigger the corpus of code available, the more features, are now enabling us to study code cloning across likely it is that one will find what one is looking for [9]. projects. The advent of software forges like GitHub and Google In this paper, we focus on studying cross-project cloning Code, and Q&A sites like Stack Overflow, has significantly in GitHub projects in order to understand the nature of code enhanced this kind of opportunistic code reuse [5], [21], [26], sharing and reuse in an ecosystem. To our knowledge, this is [31]; one can readily forage online for code and ideas, and the first study of cross-project cloning in an ecosystem setting. often reuse existing code. In fact, per Open Source ideology, We selected 5, 753 Java projects from GitHub and analyzed developers not only look for the opportunity to reuse code but how much of their code are clones that are shared with other also advertise their own high-quality code for others to reuse 1See StackExchange question: http://programmers.stackexchange.com/- in paste-bin like web applications such as GitHub Gist [2], questions/-193415/best-practices-for-sharing-tiny-snippets-of-code-across- Pastebin [3], and Codeshare [1]. projects projects, as compared to code that has clones within the project other projects. Likewise, some projects, e.g., device drivers itself. We used Deckard [12], an established clone detection heavy code, can borrow a lot from other code using similar tool, to detect both within and cross-project clones. We found drivers, and thus may be more dependent on reusing code than the following. expected. Knowing which project has borrowed code from • Cross-project clones comprise between 10% and 30% of which other projects, we link projects based on shared clones all code clones within projects, and up to 5% of the code to help us answer the following: base. • Cross-project clones exist due to different reasons ranging Research Question 2: Is there an asymmetry among the from implementing similar functionalities, to structural projects, i.e., are there projects that serve as clone sources similarity, to being auto-generated. A considerable pro- more than their fair share? Moreover, are some projects portion of all cross-project clones are utility clones, reusing other project’s code at a greater rate? where entire files, directories or even projects are copied with a minimum amount of change. They serve different Lastly, we investigate the ecosystem nature of GitHub in purposes such as exposing APIs and utility code reuse. order to understand the mechanisms behind where clones • Many projects are disproportionately sources for clones, originate, and if there is a social mechanism related to their and many others contain surprising number of clones propagation. Being a platform with uniformity across projects from different projects. The majority of projects provide for uploading and code search, GitHub enables easy searches more clones than they obtain. beyond one’s own project’s repository. In fact, GitHub, via • Code-cloning seems to follow an onion model: most its gamified, uniform social coding environment, encourages clones come from the same project, then from projects in people to code in multiple projects at the same time. In fact, the same application domain, and lastly from projects in it is not uncommon to find people who code on multiple different domains. projects in the same day [32]. Code cloning certainly helps • We find evidence suggesting cloning is higher among with speeding up coding, and perhaps makes this possible. more experienced and active/diverse developers, and be- So, naturally, we wonder if shared people between projects tween projects with shared developers. correlates with sharing clones among the projects, i.e., is there The rest of the paper is organized as follows. In Section II congruence between the network of shared clones and the we motivate our research and follow with a review of prior network of shared developers? work in Section III. The methods are described in Section IV, Additionally along these lines, we wanted to see if clones followed by our results in Section V. In Section VI we discuss are confined within certain functional domains. It is reasonable potential threats that can affect our findings, and conclude to expect that projects in the same application domain, e.g., in Section VII. web development, share the same essential coding problems, and use the same approaches to solve them. So, we wonder if II.RESEARCH QUESTIONS cross-project clones are more frequent between projects in the As a first step towards understanding cross-project cloning, same application domain? All these lead us to the following we seek to identify the rate of code reuse within the GitHub research question: ecosystem. In particular, we want to know how many clones are there, what is the rate of within vs. cross-project clones, Research Question 3: Can we find support for existing how many lines are cloned on average, and if there is a mechanisms which promote cloning, such as multi-project correlation between project metrics and clone density. Once we developers or specific application domains? know the extent of cross-project clones, we are curious what they look like, i.e., whether certain kind of code is cloned more across the project boundaries. We ask all these in the III.RELATED WORK following question. In general, code clones have been considered to be similar code fragments [14], although the notion of similarity widely Research Question 1: What is the prevalence of within varies and mostly depends on the underlying code clone and across project cloning, and what kind of code is detectors. For example, the widely popular CCfinder [14] cloned? detects clones based on lexical and syntactic analysis, while Deckard [12] relies on Abstract Syntax Trees (AST) matching. Once we know the extent and nature of cross-project clones, Rattan et al. have studied and summarized the majority of we focus next on where these clones originate from and end up clone detection techniques in their comprehensive review in in. It is conceivable that some projects, by design or otherwise, 2013 [22]. serve mostly as libraries or frameworks, and thus can be good By applying different clone detection techniques, previous sources of cloned code, i.e., their code is being reused in many studies reported that as much as 10% to 30% of code in many large scale projects may be clones (e.g., gcc-8.7%, JDK- We consider exact tree matches to study intentional cloning 29% [14], and Linux-22.7%). Gabel and Su studied source rather than accidental ones [4] as much as possible. Note that, code uniqueness across a large code corpus of SourceForge once copied, a clone will likely change as it gets adapted to suit projects [9] and found a lack of uniqueness in code at a the programmer’s goal. Thus, an exact tree-match algorithm granularity of 1 to 7 lines of source code. Software is not will be underestimating actual code reuse, but it will provide only repetitive, but also changes in a repetitive way, as found a solid lower bound for it. We do allow in the following, for by multiple studies based on different Microsoft projects, exact clones of different thresholds (lengths), which allows as Linux, and multiple open source Java projects [6], [19], [24]. to relax this definition a bit.2 However, none of these studies have focused on cross-project Our methodology involved several steps, including project code cloning in a large scale ecosystem. selection, repository cloning, clone mining, domain analysis, The most closely related work to ours, by Ossher et al. and statistical analysis of the results. [20], studied cross-project cloning in files across 13,000 Java A. Project Selection projects, from multiple ecosystems. However, they were lim- ited to measuring cloning at the granularity of files, vs short We used the January 4th, 2014 dump of the GHTorrent code snippets, and did not differentiate between within and database [11] to mine the list of users, commit history and cross-project cloning. In this work we focus on both file and other meta-data about projects in GitHub. We selected Java code fragment cloning. We further analyzed clones in multiple projects that had at least 2 developers, were at least 1 year old, temporal, spatial, and developer dimensions to understand and had more than 10 commits. These criteria remove smaller which projects in an ecosystem contributed most clones, or and younger projects, most of which have a single developer, whether certain developers promote more clones than others. and usually do not contribute much to the ecosystem, and Among the other cross-project cloning studies, Ray and would have skewed our results [7]. We also eliminated projects Kim found evidence of significant similar changes among the that were forked using GitHub interfaces, as they would highly BSD product family [23]. Al-Ekram et al. studied cloning skew our findings due to their . in similar software and concluded that many times clones Overall, 8, 599 projects matched our criteria above and were develop accidentally due to several reasons such as following still viable at the time of our data gathering (Sept. 2015). protocols [4]. We also found evidences of accidental cloning After running the initial clone detection algorithm on those, as in cross-project settings. described later in the paper, we were left with 5, 753 projects in Nguyen et al. [19] reported up to 70-100% of repetitive which we found code clones. Between them they have 23,000 source code changes across 2,841 open source Java projects, developers, 1.04M files, 105M LOC and are in age between and the repetitiveness is higher and more stable in across 1 and 6 yrs old. projects vs. within projects. However, code changes are not B. Identifying Project Domains similar to static code and thus, study of repetitive changes are different than code clone study. For example, a study of To identify the domain of a project, we employed a tool de- repetitive changes may ignore utility files as they are seldom veloped by Ray et al. [25]. It uses Latent Dirichlet Allocation changed in software evolution. (LDA), a popular topic analysis algorithm, on the projects’ readme text and GitHub description, to identify the topic of each project. First, we detected 30 distinct topics and estimated IV. METHODOLOGY the probability of each project belonging to these topics. There are a number of ways to define code clones. In From each topic, we then removed project-specific keywords (e.g., facebook, slime) and focused only on the keywords general, two non-overlapping code fragments C1 and C2 are representing underlying functionalities. Next, we manually clones, if C1 and C2 are “similar by some given definition of inspected the resulting topics and found appropriate domain similarity” [27]. More formally, to decide if C1 and C2 are names that describe the topics. Finally, we found six main clones, we need a similarity function F, F(C1,C2) ≥ σ, where σ is a similarity threshold [30]. domains corresponding to these topics: Application domain In this work, we use Deckard [12], an abstract syntax tree (end user programs), Database, CodeAnalyzer (e.g., compiler, (AST) based clone detection tool to detect clones. So, we parser, etc.), Middleware (e.g., Operating Systems, Virtual operationalize our definition of clones as follows: Machines, etc.), code (e.g., APIs), and Framework (e.g., SDKs).

If the AST structures of code fragments C1 and C2 are C. Identifying Clones exact matches, as per Deckard (irrespective of identifier We used Git to gather the repositories for all 8, 599 projects, names and literal values), we consider the two code and rolled back each repository to January 4th, 2014. We fragments clones of each other. In this case, σ represents the size of the two AST trees. 2Since shorter exact clones can capture, to some extent, more variability during longer code evolution. TABLE I. Parsed clone information based on Deckard output.

CloneID CloneSubID ProjectName FileName From Length 01 01 Distrotech cyrus-sasl java/CyrusSasl/SaslOutputStream.java 48 32 01 02 mb-linux-msli uClinux-dist/lib/libcyrussasl/java/CyrusSasl/SaslOutputStream.java 48 32 02 01 encog-java-examples src/main/java/org/encog/examples/neural/gui/ocr/Entry.java 200 32 02 02 encog-java-workbench src/main/java/org/encog/workbench/tabs/query/ocr/DrawingEntry.java 194 32 03 01 agit agit-test-utils/src/main/java/com/madgag/agit/matchers/GitTestHelper.java 46 3 03 02 nexus gateway/src/main/java/org/isolution/nexus/xml/soap/SOAPMessageUtil.java 50 3 03 03 nexus gateway/src/main/java/org/isolution/nexus/xml/soap/SOAPMessageUtil.java 89 3 then used the Deckard tool [12] to identify clones present cases are indeed examples of cross-project code reuse, their in this collection. We chose Deckard because of its notable borrowing dynamics arguably differs from that of regular performance on large data sets and because we had access clones. We therefore study them separately from other clones. to its authors, that provided us the chance to fully utilize the tool’s potential and get more accurate results. E. Constructing Ecosystem Graphs In Deckard, there are 3 required parameters to specify To assess the overlap between cross-project cloning and clones and their accuracy: “Stride”, “Similarity”, and “Token shared people between projects, we construct two graphs Size”. For our study, we set Stride to infinity and Similarity from the datasets we gathered. The first is a co-clone graph to 1, in order to find exact clones. Those correspond to which has links between projects that share code clones. The Deckard’s defaults, recommended in the original paper. For second, a co-developer graph, links projects with developers Token Size, which specifies the minimum (token) code length in common between them. to be accepted as a clone, we used three different values: 20, More specifically, to construct the co-clone graph, we first 30 and 50. We chose those clone lengths so as to capture the create an empty graph with each project as a node. We then add average length of a few lines of code, a code block, and a a weighted edge between two projects if they contain instances method, respectively. With these choices, we are able to study of the same clone, where the weight is the number of clones the effect of this parameter on our findings and observe how in common between the two projects. We also keep track of various clone sizes differ in occurrence. Although Deckard is the oldest of all instances of a clone using git blame, by very efficient, it only works on a single project. To identify ranking the dates of the blame output for the cloned lines. We cross-project clones, we placed all projects’ repositories in call that oldest instance the source and add an edge from all one umbrella directory, with one folder per project, and then projects containing more recent instances of that clone to the treated this as one large project. Once the cloning results were project that contains the source. In this way, with respect to a gathered, we used the directory information of each file in clone, all projects point to an “original” source of information, the results to de-convolve the found clones back into their in a star-like topology. Because of the exact clone approach we respective projects. adopted, a star topology is the best resolution we can get on After parsing Deckard’s output, the results are stored in a the history of a clone, since we can’t deduce more information table, an excerpt of which is shown in Table I. We call a than the existance of the oldest instance, and all the copies. clone a cross-project clone, if for the corresponding cloneID, Figure 1 shows the co-clone graph for clones with token size at least two of its instances come from two different projects. 50. In Table I all three clones are considered cross-project clones. We construct the co-developer graphs as follows. As above, We call a clone a within-project clone if at least two of its we have a graph with projects as nodes, and two projects instances are from the same project. A clone can be both cross- are connected if there is a developer that has been active project and within-project; we call those hybrid clones (e.g., in both. This results in a K-clique for each developer active cloneID 3 in Table I). across K projects. Developers are not usually active to the same extent in all the projects they participate in; they may D. Extracting Utility Clones be heavily invested in some, and occasionally or even rarely During our case-studies of cross-project clones, we found contribute to the rest. Thus, we introduce different cutoffs for multiple instances of entire files, directories, or in extreme the relative level of contribution by defining a contribution cases an entire project, present in multiple projects, with or threshold θ. We consider developer d as a contributor to project without minor changes. Such cloned artifacts often serve as p at some fractional level θ, if the ratio of his/her contribution utility or library code and we call them as utility clones. We (in terms of number of commits) to project p to his/her overall extracted such utility clones based on their file names and contribution to all the projects P is at least θ. In other words: directory structures—they often share similar nomenclature as NComd,p reported by Ossher et al. [20]. For example, we identified files P ≥ θ. NComd,p engine/src/main/java/org/json/JSONArray.j- pi∈P i ava and src/org/json/JSONArray.java of projects We have constructed the co-developer graphs for the values HomeSnap and ModDamage as utility clones. While these of θ: 0%, 5%, 10%, 20%, 30%. To compare co-clone graphs TABLE II. The resulting size of co-developer graphs with varying entire code base of these projects, and their average size ranges θ. Values greater than 5% filter out too many edges. from 5 to 15 LOC. We discuss the clones per categories of θ 20 30 50 interest, as defined in Sect. IV-C and Sect. IV-D. No. of Edges 0 91703 79384 552 5 8304 7623 60 TABLE III. Extent of Cloning across GitHub 10 4562 4170 35 20 2074 1885 12 30 984 910 4 Token Size No. of Nodes 5643 5282 435 20 30 50 Projects with Clones 5753 5533 628 #Unique clones 354,268 243,150 13,908 #Unique within-project clones 246,390 170,729 12,664 with co-developer graphs, we induce the co-developer graph (69%) (70%) (91%) #Unique cross-project clones 121,565 59,589 1,037 by the co-clone graph. Table II shows the resulting graph sizes (34%) (25%) (7%) for our choices of θ. Figure 1 shows the induced co-developer #Unique hybrid clones 53,512 21,276 389 #Unique utility clones 39,825 34,108 596 graphs for θ=5%. (11%) (14%) (4%)

Size of projects with clone * 104.85 103.39 13.27 Size of clones 9.86 8.31 0.67 Size of within-project clones 7.64 6.17 0.59 Size of cross-Project clones 4.85 2.93 0.1 Size of utility clones 0.93 1.09 0.04

Mean within-project clone size † 5.63 9.51 15.6 Mean cross-project clone size 5.21 9.63 20.8 Mean utility clone size 9.91 13.59 23.89 Median within-project clone size 4 7 11 Median cross-project clone size 4 6 13 Median utility clone size 6 8 12 * Total size of projects and clones are in Million LOCs (MLOC). † Mean and median size of clones are in LOC.

(i) Within-project Clones. These are the most common type of code cloning (69% to 91% of all clones) and have been widely studied in the past [9], [14], [24]. In this paper, we will focus mostly on the following, much less understood cloning. (ii) Cross-project Clones. We find a large amount of cloned code across multiple projects. At a smaller token size of 20, nearly 34% of all identified clones come from other projects. As the token size increases, the extent of cross-project clones drops to nearly 7% at token size 50. This suggests that larger chunks of code are less likely to be copied verbatim from other projects, or at the least, they are less likely to have remained unchanged after copying. The median size of within-project and cross-project clones are statistically identical for 20 and 30 token clones but they start to diverge in 50 token clones, with cross-project clones being larger than their counterparts. We further look at how cross-project clone prevalence Fig. 1. Top: The co-clone graph for 50 Token clones. The coloring depends on project size, i.e., whether larger projects have corresponds to project domains. Bottom: The co-clone graph more cross-project clones. For each project, we use two induced co-developer graph, θ = 5%. metrics as proxies for clone prevalence: the total number of cloned lines and the total number of files having cloned code. Figure 2 shows the distribution of cloned lines w.r.t. project V. RESULTS AND DISCUSSION size measured in LOC and the top subfigure of Figure 3 plots cloned files vs. total files.3 The first plot shows that number We begin by characterizing the number of cross-project of cloned lines slowly increases as project size grows, but clones. with a sublinear trend. The latter plot shows the number of RQ1: Cloning Prevalence in GitHub cloned files increases linearly as the number of files increases Table III presents a summary of the extent of cloned code in in a project. Since project size is highly correlated with total selected GitHub projects. In total, we find 354, 268, 243, 150, number of files, such a trend can be explained if clones are and 13, 908 unique clones, for token sizes 20, 30, and 50, 3We also investigated the correlation between clone density and project respectively. Code clones make up between 5% to 10% of the size. The results were similar to those in Fig. 2. 50 Tokens 20 Tokens 30 Tokens

Counts Counts Counts 1500 4 86 1500 74 1500 81 69 4 75 65 4 70 60 3 65 56 3 59 1000 51 1000 3 1000 54 47 3 49 42 3 44 38 2 38 33 2 500 33 500 28 500 Clones LOC 2 28 24 2 22 19 2 17 15 2 12 10 0 6 1 0 6 0 1 1 1 0 1000 2000 3000 4000 5000 1 0 1000 2000 3000 4000 5000 1000 2000 3000 4000 5000 Project Size (LOC) Project Size (LOC) Project Size (LOC)

Fig. 2. Size of cloned lines in a project vs. projects’ size (LOC); we observe a sublinear trend.

with a peak at 2.5% clone density indicating that cross-project clone density varies significantly across files. This observation Counts 72 holds for clones across the studied ranges of token sizes. 6 68 63 To understand the nature of cross-project clones, we man- 59 ually studied 90 unique clone instances, chosen randomly, 54 50 that were detected at token size 30 (see Table IV). 66% 4 45 of these clones appear as developers often reuse code to 41 36 implement similar functionalities, either by reusing smaller 32 code fragments (38%), or cloning an entire method’s body 28 2 23 (28%) (rows a and b in Table IV). For example, we find 7 19 unique clone instances where developers perform operations

log(Files with clones Per Project) log(Files with clones Per 14 10 like null checks and object initializations on a method’s 0 5 arguments before calling the method. There are other cases 1 0 2 4 6 8 where a similar code fragment is reused for adding new log(Files Per Project) objects to an object queue, putting elements in a hash map, performing object serializations (e.g., toString method), calling APIs in similar ways, etc. Such clones are closely tied to data structure/API usage. Developers also clone entire meth- ods to implement higher level functionalities (row b). Such cloned methods can appear either in related, but non-identical files (file/class names or directory structures resemble each Frequency other), or in apparently unrelated files. For instance, method connectSocket, which implements a socket connection 0.00 0.02 0.04 0.06 0.08 02.5 20 40 60 80 100 routine in Android applications for a given IP address and Percentage of cloned lines w.r.t total lines in a file port, is implemented identically in three files, AgeClient.java, FakeSocketFactory.java, EasySSLSocketFactory.java, in the Fig. 3. Top: Files having cross-project clones vs. total files per projects App-Growth-Engine-Android-SDK, ReGalAndroid, project showing a linear trend. Bottom: Percentage of cloned lines and cmsandroid, respectively. Such semantic clones can be per file showing a heavily skewed trend with a peak frequency good candidates for pastebin like GitHub Gist [2] where peo- at clone density 2.5%. Both the figures consider clones at token ple frequently share their commonly used code snippets for fu- size 30. ture use. In fact, the above example of the connectSocket method is shared by developer darrikmazey in his GitHub gist https://gist.github.com/darrikmazey/9521134. Further, (c) 32% non-uniformly distributed across files. To confirm that, the of clones across different projects show structural similarities bottom subfigure of Figure 3 presents a histogram of clone without deeper semantic resemblance. Iterating over while/for density (i.e., number of cloned lines w.r.t. total lines) per file loops, similar if-else structures, and array initializations are across all studied projects. Here we ignore the files that do few common examples in this category. Finally, we find (d) not have a single cloned line. The histogram is heavily skewed two instances of clones where whole files are auto generated. TABLE IV. Cross-Project Cloning: a case study of 90 uniquely cloned instances

Type of cross-project clones count patterns (individual count) (a) Code fragments implementing similar functionalities 34 (38%) add new objects to object queue (2), wrapper methods to call set of other methods (2), wrapper methods to null-check on method arguments or create the argument object or initialize the arguments (7), iterate over an array to access a particular element (2), Object serialization (3), putting elements in HashMap (2), read I/O (2), recursively calling similar function (1), release resources/clean up routine (2), initializing memory elements (2), similar API call (4), testing object initialization (1), similar routine (4) (b) Methods implementing similar functionalities 25 (28%) similar function in related files (18), similar functions in different files (7) (c) Clones due to structural similarity 29 (32%) iterating in loop structure (6), array initialization (3), if-else structure (8), calling methods with similar signature (6), similar class definition (3), long return statement (3) (d) Autogenrated files 2 (2%) WSDL files (1), ANTLR Translator Generator (1)

We note that clones may exist both within and across multiple projects. We call such clones “hybrid” clones. In our 100 Counts 1391 study, we find 53k, 21k, and 389 of those, respectively, for 1304 80 1217 our range of token sizes. 1130 1044 (iii) Utility Clones. We also notice a substantial amount of 60 957 870 clones where whole files or modules are reused with little to no 783 696 modification. As shown in Table III, such clones contribute to 40 609 % Change 522 11%, 14% and 4% of all unique clone instances at token sizes 435 20 348 20, 30, and 50, respectively. We find that such files/modules 262 175 are usually part of utility or library code and developers often 0 88 reuse them across multiple projects. We collectively call such 1 2 4 6 8 clones utility clones. Note that the utility clones are also log(File Size) instances of cross-project clones where instead of copying code fragments developers reuse entire files or modules across Fig. 4. The distribution of change percentage vs file size in projects. To investigate in details, we manually studied 264 of utility clones at toke size 30. such files that are cloned up to 9 projects and distinguish five types of utility clones among them (see Table V): (a) cloning library source code to expose APIs: Developers often copy indicates that even such files are limited to a small amount the entire source code of popular libraries to use their APIs. of changes, with median of 3.95%. These results show that For example, we find several copies of ActionBarSherlock, utility clones are mostly static across different projects. an Android library for using action bar design patterns. (b) Cloning Files/Modules to reuse utility code: Developers often clone single utility files or modules to reuse some common Result 1: A significant amount of code is reused across functionalities such as HTML or JSON file parser. (c) We GitHub projects, and the corresponding clones follow find several instances of Java standard library (e.g., java.io, fixed patterns. java.lang, and java.util code) code reused multiple times across different projects. (d) Another common source of utility clones Now that we have seen that cross-project cloning is quite is extensions of Java by third party vendors like Test Driven prevalent across GitHub projects, we proceed to analyze where Development (TDD) in Action, etc. (e) There are some other these clones are coming from. In the rest of this work, we only sources of utility clones, like related projects, example or demo present our results on cross-project clones. code, etc. Ossher et al. [20] also reported similar types of file- level cloning activities in SourceForge, Apache, Java.net and RQ2: Sources of Clones Google Code projects; thus, our findings confirm theirs. We investigate whether cloning is a uniform and symmetric Once cloned in different projects, the utility files are seldom process. Recall that in the co-clone graphs, edges are directed changed. In fact, among all cloned files at token size 30, from a project with a newer instance of the clone to the project 56.12% of files were never modified, and 75% of files are with the oldest instance (as per git blame). For clarity of changed only up to 2.94% w.r.t. the respective file sizes. For discussion, we refer to out-edges as ”obtain” edges, and in- utility files that differ across multiple instances by at least one edges as ”provide” edges, and to projects with more ”obtain” line, Figure 4 plots the percentage of changed lines w.r.t. total edges as ”obtainers” and more ”provide” edges as ”providers”. file size. The brighter region at the bottom in the hexbinplot We of course recognize that our methodology does not allow TABLE V. Utility Cloning: a case study of 264 cloned files

Type of Utility Clones Examples (a) Cloning library source code to expose APIs Mobile development APIs like ActionbasSharlock, Game development APIs like Forestry, Security protocol related APIs like SSL (b) Cloning Files/Modules to reuse utility code Binary file processing utility, Android ListView, Date and Time Processing utility, Encoder/De- coder, IO, HTML/JSON parser, utility for network protocols like DNS, HTTP, etc. (c) Java Standard Library java.io, java.lang, java.util code, etc. (d) Java Extensions Java extensions by some third party vendor like Test Driven Development (TDD) in Action , etc. (e) Other Related/Duplicate projects, Example Code/Demo etc.

20 Tokens 30 Tokens 50 Tokens that helps the GitHub ecosystem grow, and that projects in−degree that are clone super-sources may be especially important for out−degree contributing to increasing the code quality in the GitHub ecosystem. Density 0.00 0.05 0.10 0.15 0.00 0.05 0.10 0.15 0.00 0.05 0.10 0.15 0.20 0.25 0 5 10 15 0 5 10 15 0 5 10 Result 2: Cross-system cloning is a directed and non- Log(Weighted Degree) Log(Weighted Degree) Log(Weighted Degree) uniform phenomenon and most projects “obtain” more Fig. 5. The weighted in-degree (indicating clone provider) and clones than “provide” them. There are also projects that weighted out-degree (indicating clone forage) distribution of co- clone graphs with different token sizes. Each node represents serve as hubs or “super-sources” of clones to other a project, and edge exists if two projects share a clone. The projects. weight of the edge represents number of unique clones between two project nodes. We observe an overall rightwards shift of RQ3: Mechanisms of Cloning the peak in the out-degree (i.e., forage) indicating most projects forage clones than provide them. Next, we look for evidence of the existence of several poten- tial mechanisms of cloning within GitHub: cloning within the boundaries of domains, cloning among neighboring projects, us to be able to precisely pinpoint all causality implied by these and finally, cloning by experienced and active members. words. However, we maintain that if such causal relationships Cloning within Domain Boundaries: We selected cross- were to exist, these words would capture them, and thus this project clones and checked whether the projects which share terminology is not arbitrary. clones are within the same domain or not. The results are Figure 5 shows the distribution of weighted provide edges presented in Table VII which shows that cross-domain clones (in-degree) and obtain edges (out-degrees) of projects in the outnumber within-domain clones by almost 2 to 1. co-clone graph, as described in Section IV-E. We observe an overall rightwards shift of the peak in the out-degrees. We TABLE VII. The statistics of cross-domain and within-domain know that for directed graphs the sum of in-degrees over clones based on the co-clone graphs all nodes equals the sum of their out-degrees. Thus, a peak 20 30 50 All Edges 808623 244948 4011 in the out-degrees (obtain) implies a heavier tail in the in- Within-Domain Edges 281700 96466 1400 degrees (provide) distribution. This indicates, in general, that Cross-Domain Edges 526923 148482 2611 Within/Cross Ratio 0.53 0.65 0.54 most projects obtain more code clones than provide them. The heavier tail in in-degree distribution also implies that there are We further investigate this issue and quantify the frequency few projects that act as super-sources of clones. of clones across different domains. For each domain we To investigate this further, Table VI shows the 10 highest normalize the number of clones it shares with other domains ranking projects in terms of weighted in-degree (provide) (including itself) by the size of both domains i.e., the number and out-degree (obtain). As expected, we observe that the of projects in that domain: IN top 10 lists in all three cases completely differ from the OUT, i.e., major clone providers are different from the P Clones(Pk,Pl) greatest obtainers. For example, at token size 30, projects Clones(D ,D ) = Pk∈Di,Pl∈Dj . i j |D | × |D | like acceleo, android-sdk, and rwiki are top provider i j of clones while projects android_platform, wl-rwiki, Here Clones(Pk,Pl) denotes the number of clones between and ecl are top obtainers. projects Pk and Pl, and |Di| denotes the number of projects 4 We conclude that cloning is not an uniform process and in domain Di . Such normalization is important because it most projects obtain more clones than provide them. We 4This number is among the set of the projects that had any clones at all. So hypothesize that this asymmetry may be an important feature the total sum of all domain sizes adds up to the first row numbers of Table III. TABLE VI. Projects with highest weighted in- and out-degree in co-clone graphs. Project names are trimmed to fit on page.

20 30 50 Rank IN (Provide) OUT (Forage) IN (Provide) OUT (Forage) IN (Provide) OUT (Forage) 1 acceleo SIREn acceleo android platform nuxeo-features GoodData-CL 2 com.idega.block. EclipsePlugin android-sdk wl-rwiki slps Henshin-Editor 3 xDoc RobotML-SDK rwiki ecl OpenFaces cropinformatics- 4 mvdetsen jnr-x86asm is.idega.idegawe EclipsePlugin plexus-utils b3log-latke 5 com.idega.core OpenFaces jdnssec-dnsjava gatein-shindig ActionBarSherloc plexus-container 6 jedit-ruby-plugi ecl shindig mahout amplafi-json andlytics 7 VUE android platform mahout-commits log4jna MSMB coverity-plugin 8 rwiki wl-rwiki log4j Henshin-Editor jbidwatcher spring-ide 9 android-sdk osate2-ocarina android packages osate2-ocarina wl-calendar WISE-Portal 10 luaj Henshin-Editor com.idega.block. jgit jbosstools-base HomeSnap

removes the bias of larger domains having a larger code base, 30 Tokens and hence large amount of clones (see Figure 2). Finally, for each domain, we measure the percentage of clones that come Counts 169 from within and other domains. The results are in Figure 6. 158 Despite the initial appearance in Table VII, once we control 6 148 138 for domain size, there is a clear pattern that shows greater 127 concentration of clones within domains vs. across them. There 116 4 106 are a few exceptions in the heat maps, especially at token size 96 85 50, but the overall trend is more or less clear. We conclude 74 that cross-project clones are more likely to happen within the 64 Number of Clones 2 54 boundaries of domains than across. 43 Cloning Among Neighboring Projects: A plausible cloning 32 22 mechanism between two projects is via the developers shared 0 12 between them, who, by actively participating in both projects, 1 0 2 4 6 bring code from one to the other. If that were the case, Number of Developers our co-clone and co-developer graphs should show non-trivial overlap. To study the overlap of the co-clone and co-developer networks we use a simple graph congruence metric [33], Fig. 7. Cross-project clone density vs. # developers in each measuring the weighted edge overlap between the graphs: project (30 token clones). Axes are logged. We observe clones superlinearly increasing with number of developers. X X Cong(A, B) = W eight(Ei)/ W eight(Ei). E ∈A∩B E ∈A i i of distinct clones that was authored by the developer, (ii) We observed little to no congruence w.r.t. co-clone graphs Clone Size. The total size (in LOC) of all clone instances (Cong < 0.07), but we did observe some congruence w.r.t. co- written by her, (iii)Number of Projects. The number of projects developer graphs ranging between 0.16 and 0.25 depending on the developer has participated in, (iv) Age. The first date she token size and other parameters such as contribution threshold has contributed to any of the mentioned projects, (v) Number θ (see Section IV-E). We compared this against a baseline of commits. The number of times she has committed to the distribution obtained through evaluating the congruence be- mentioned projects. tween randomized co-clone graphs and co-developer graphs Table VIII contains the Spearman correlation coefficient and found that our results are not significantly different from between these features across our dataset, and show some the baseline distribution. Thus, we do not find evidence for interesting correlations. Most evidently, prolific developers in this cross-project cloning mechanism. terms of number of clones may also clone the largest of them. Cloning by Senior and Expert Developers: We investigated The correlation with the number of commits is not clear. the correlation between development team size and clone density, depicted in Figure 7. A weak sublinear trend is TABLE VIII. Correlation between developer features. # & size evident, as the clone density slowly increases with the number are based on 20 token clones. The results were similar for larger of developers in a project. But, are all developers equally token sizes. For all values, p-val < 0.001. participating in cloning, or are some more active than others, Clone Size #Projects #Commits Age #Clones 0.94 0.18 0.41 0.20 and can we find out who? Clone Size 0.17 0.39 0.20 To study whether certain attributes of developers would #Projects 0.51 0.33 be indicative of their rate of code cloning we gathered the #Commits 0.33 following measures for each developer who was the author of at least one clone: (i) Number of Clones. The total number To further examine the effect of some of the above factors aaeesta hncmie ol rvd ute insight. further provide other would be combined log-linear may when or There that linear parameters parameters. a these not only least given at relationship, cloning, of frequency no the is expertise,and or there activity, above, seniority, between noted different relationship correlation at immediate high clone the developers beyond that and conclude rates, We R- models. different an the a had of than with these higher each of fit none regressions, However, squared linear variables. of dependent of set factors, set a other the ran values. for lower also controling indicate while we colors clones, colder of number values, the higher on indicate orange and Red both. of size by normalized domains, two between clones 6. Fig. a o etecs u osvrlraossc sdiscovery as such That reasons several source. to first/oldest due case the the from be not “obtained” may are copies all domains. some several since to Do- especially belong rare. perfect, conceptually being be could renaming projects not such may and assignments towards copied point main results being our clones files However, such entire names. of number file a different missed with have may and nomenclature, study. our of that scale all the clones for given Ray smaller detection in miss clone as will Running changes, region. subsequently. analysis cloned changed blame unchanged were the git the detect our changed, to if was Thus, able However, portion be small clones. after a will and as Deckard changed longer is them are region sink code clones detect cloned and some won’t source if we of conservative: copying, detection of is the rates clones Thus, of possible. the as exclude under-reporting much to made as in was decision resulted This cloning. have may clones lya biul motn atri cloning. in not factor does important clones obviously these an of play Authorship boundary. domain a o h osrcino h ocoegah,w assumed we graphs, co-clone the of construction the For on based solely was clones utility of identification The Validity. Construct to Threats (i) eut3: Result

APPLICATION etmp fCosdmi ln rqec o oe ie 0 0ad5,lf orgt ahsos%o cross-domain of % shows Each right. to left 50, and 30 20, sizes token for frequency clone Cross-domain of maps Heat

DB

EDUCATION

rs-rjc lnn smr rlfi within prolific more is cloning Cross-project GUI I T VI.

MID−TIER tal. et

RASTO HREATS MOBILE 0 . 2] ol o aebe feasible, been have not would [24], 09 PROG. ANLYS.

niaigpo rdcievalue predictive poor indicating WEB

LIBRARY LIBRARY WEB PROG. ANLYS. MOBILE MID−TIER GUI EDUCATION DB APPLICATION V ALIDITY u titdfiiinof definition strict Our cietlclones accidental

APPLICATION

DB

EDUCATION

GUI

MID−TIER

MOBILE irre,a rtapoiaint nefciecross-project effective reuse. an and to search utility approximation code and first super-sources, a favoring as and libraries, approach layered a on be applications. paste-bin can calls for observation and tool sharing—this recommending projects a patterns code many for code for across candidates super-sources. certain used suitable are as commonly there serve are indicates often that facilitate study that to our projects used Further, in be and same can within domains search more findings code prioritize share should These follow discovery—one code projects others. and some than domain Moreover, in code similar a patterns. are of fixed in clones projects project certain found to Cross are restricted code. clones OSS general of cross-project portion that significant evidence find We ipmyehbtdfeetpten fcloning. or of Perl patterns different e.g., exhibit ones, may Lisp non-object-oriented programming specially Different JavaScript). languages, (after popular GitHub second-most in to the language applicable projects, Java less studied findings only our We platform others. make and may (GitHub) facilities ecosystem and culture one within cloning project of categorization. feedback our the validate personal that to used to developer is we prone external issue issue is an this Another study address case dataset. To the our interpretation. in in classification factors, capture clone unknown not the to do due be we may that for this support any but find mechanisms, to certain unable were we mechanisms, cloning of would impossible. solution practically is only which the developers, of and query source, but direct would threat, “real” be that valid available the a information identify is no help is This there dataset. knowledge our our source of to existing scope an the even of outside and preference, personal limitations,

PROG. ANLYS. uuewr ilfcso eet fgie oaig based foraging, guided of benefits on focus will work Future GitHub. in cloning cross-project studied we paper this In ii het oEtra Validity. External to Threats (iii) Validity. Internal to Threats (ii)

WEB

LIBRARY LIBRARY WEB PROG. ANLYS. MOBILE MID−TIER GUI EDUCATION DB APPLICATION I.C VII.

APPLICATION

DB ONCLUSION

EDUCATION

GUI norsac o evidence for search our In u td icse cross- discusses study Our MID−TIER

MOBILE

PROG. ANLYS.

WEB

LIBRARY LIBRARY WEB PROG. ANLYS. MOBILE MID−TIER GUI EDUCATION DB APPLICATION REFERENCES 2013 International Conference on Software Engineering, pages 502– 511. IEEE Press, 2013. [1] codeshare: https://codeshare.io/. https://codeshare.io/. [19] H. A. Nguyen, A. T. Nguyen, T. T. Nguyen, T. N. Nguyen, and [2] Github gist : https://gist.github.com/. https://gist.github.com/. H. Rajan. A study of repetitiveness of code changes in software [3] pastebin: http://pastebin.com/. http://pastebin.com/. evolution. In Proceedings of the 28th International Conference on [4] R. Al-Ekram, C. Kapser, R. Holt, and M. Godfrey. Cloning by accident: Automated Software Engineering, ASE, 2013. an empirical study of source code cloning across software systems. In [20] J. Ossher, H. Sajnani, and C. Lopes. File cloning in open source java Empirical Software Engineering, 2005. 2005 International Symposium projects: The good, the bad, and the ugly. In Software Maintenance on, pages 10–pp. IEEE, 2005. (ICSM), 2011 27th IEEE International Conference on, pages 283–292. [5] S. Bajracharya, T. Ngo, E. Linstead, Y. Dou, P. Rigor, P. Baldi, and IEEE, 2011. C. Lopes. Sourcerer: a search engine for open source code supporting [21] L. Ponzanelli, G. Bavota, M. Di Penta, R. Oliveto, and M. Lanza. structure-based search. In Companion to the 21st ACM SIGPLAN Mining stackoverflow to turn the ide into a self-confident programming symposium on Object-oriented programming systems, languages, and prompter. In Proceedings of the 11th Working Conference on Mining applications, pages 681–682. ACM, 2006. Software Repositories, pages 102–111. ACM, 2014. [6] E. T. Barr, Y. Brun, P. Devanbu, M. Harman, and F. Sarro. The [22] D. Rattan, R. Bhatia, and M. Singh. Software clone detection: A plastic surgery hypothesis. In Proceedings of the 22nd ACM SIGSOFT systematic review. Information and Software Technology, 55(7):1165– International Symposium on Foundations of Software Engineering, pages 1199, 2013. 306–317. ACM, 2014. [23] B. Ray and M. Kim. A case study of cross-system porting in forked [7] V. Bogdan, D. Posnett, B. Ray, M. v. d. Brand, Filkov, A. Serebrenik, projects. In Proceedings of the ACM SIGSOFT 20th International D. Premkumar, and V. Filkov. Gender and tenure diversity in github Symposium on the Foundations of Software Engineering, page 53. ACM, teams. CHI ’15. ACM, 2015. 2012. [8] L. Dabbish, C. Stuart, J. Tsay, and J. Herbsleb. Social coding in [24] B. Ray, M. Nagappan, C. Bird, N. Nagappan, and T. Zimmermann. github: transparency and collaboration in an open software repository. The uniqueness of changes: Characteristics and applications. Technical In Proceedings of the ACM 2012 conference on Computer Supported report, Microsoft Research Technical Report, 2014. Cooperative Work, pages 1277–1286. ACM, 2012. [25] B. Ray, D. Posnett, V. Filkov, and P. Devanbu. A large scale study of [9] M. Gabel and Z. Su. A study of the uniqueness of source code. In programming languages and code quality in github. In Proceedings of Proceedings of the eighteenth ACM SIGSOFT international symposium the 22nd ACM SIGSOFT International Symposium on Foundations of on Foundations of software engineering, pages 147–156. ACM, 2010. Software Engineering, pages 155–165. ACM, 2014. [10] C. L. Goues, T. Nguyen, S. Forrest, and W. Weimer. Genprog: A [26] S. P. Reiss. Semantics-based code search. In Proceedings of the generic method for automatic software repair. Software Engineering, 31st International Conference on Software Engineering, pages 243–253. IEEE Transactions on, 38(1):54–72, 2012. IEEE Computer Society, 2009. [11] G. Gousios. The ghtorent dataset and tool suite. In Proceedings of [27] C. K. Roy, J. R. Cordy, and R. Koschke. Comparison and evaluation the 10th Working Conference on Mining Software Repositories, pages of code clone detection techniques and tools: A qualitative approach. 233–236. IEEE Press, 2013. Science of , 74(7):470–495, 2009. [12] L. Jiang, G. Misherghi, Z. Su, and S. Glondu. Deckard: Scalable and [28] W. Scacchi. Collaboration practices and affordances in free/open source accurate tree-based detection of code clones. In Proceedings of the 29th software development. In Collaborative software engineering, pages international conference on Software Engineering, pages 96–105. IEEE 307–327. Springer, 2010. Computer Society, 2007. [29] S. E. Sim, C. L. Clarke, and R. C. Holt. Archetypal source code searches: [13] E. Juergens, F. Deissenboeck, B. Hummel, and S. Wagner. Do code A survey of software developers and maintainers. In Program Compre- clones matter? In Proceedings of the 31st International Conference hension, 1998. IWPC’98. Proceedings., 6th International Workshop on, on Software Engineering, ICSE ’09, pages 485–495, Washington, DC, pages 180–187. IEEE, 1998. USA, 2009. IEEE Computer Society. [30] F.-H. Su, J. Bell, K. Harvey, S. Sethumadhavan, G. Kaiser, and T. Jebara. [14] T. Kamiya, S. Kusumoto, and K. Inoue. Ccfinder: a multilinguistic Code relatives: detecting similarly behaving software. In Proceedings of token-based code clone detection system for large scale source code. the 2016 24th ACM SIGSOFT International Symposium on Foundations Software Engineering, IEEE Transactions on, 28(7):654–670, 2002. of Software Engineering, pages 702–714. ACM, 2016. [15] M. Kim, L. Bergman, T. Lau, and D. Notkin. An ethnographic [31] S. Thummalapenta and T. Xie. Parseweb: a programmer assistant for study of copy and paste programming practices in oopl. In Empirical reusing open source code on the web. In Proceedings of the twenty- Software Engineering, 2004. ISESE’04. Proceedings. 2004 International second IEEE/ACM international conference on Automated software Symposium on, pages 83–92. IEEE, 2004. engineering, pages 204–213. ACM, 2007. [16] M. Kim, V. Sazawal, D. Notkin, and G. Murphy. An empirical study of [32] B. Vasilescu, K. Blincoe, Q. Xuan, C. Casalnuovo, D. Damian, P. De- code clone genealogies. In ACM SIGSOFT Software Engineering Notes, vanbu, and V. Filkov. The sky is not the limit: Multitasking on GitHub volume 30, pages 187–196. ACM, 2005. projects. In International Conference on Software Engineering, ICSE, [17] N. Meng, M. Kim, and K. S. McKinley. Systematic editing: generating 2016. to appear. program transformations from an example. In ACM SIGPLAN Notices, [33] Q. Xuan, A. Okano, P. Devanbu, and V. Filkov. Focus-shifting patterns volume 46, pages 329–342. ACM, 2011. of oss developers and their congruence with call graphs. In Proceedings [18] N. Meng, M. Kim, and K. S. McKinley. Lase: locating and applying of the 22nd ACM SIGSOFT International Symposium on Foundations of systematic edits by learning from examples. In Proceedings of the Software Engineering, pages 401–412. ACM, 2014.