Open-Source Tools and Benchmarks for Code-Clone Detection: Past, Present, and Future Trends
Total Page:16
File Type:pdf, Size:1020Kb
Open-Source Tools and Benchmarks for Code-Clone Detection: Past, Present, and Future Trends Andrew Walker Tomas Cerny Eungee Song Computer Science Computer Science Computer Science ECS, Baylor University ECS, Baylor University ECS, Baylor University One Bear Place #97141 One Bear Place #97141 One Bear Place #97141 Waco, TX 76798 Waco, TX 76798 Waco, TX 76798 [email protected] [email protected] [email protected] ABSTRACT of a software system lead to code clones. Despite some ini- A fragment of source code that is identical or similar to an- tial surface-level benefits of code-clones, they, in fact, make other is a code-clone. Code-clones make it difficult to main- the source files very hard to modify consistently. For in- tain applications as they create multiple points within the stance, consider a software system that has several cloned code that bugs must be fixed, new rules enforced, or design subsystems created by duplication with slight modification. decisions imposed. As applications grow larger and larger, When a fault is found in one subsystem, caution must be the pervasiveness of code-clones likewise grows. To face the used to modify all other subsystems [44] or risk the persis- code-clone related issues, many tools and algorithms have tence of the bug into deployment. If the existence of clones been proposed to find and document code-clones within an has been documented and maintained properly, the modifi- application. In this paper, we present the historical trends cation would be relatively easy; however, keeping all clone in code-clone detection tools to show how we arrived at the information is a generally expensive process, especially for a current implementations. We then present our results from large and complex system. a systematic mapping study on current (2009-2019) code- In this paper, we provide the roadmap to existing research clone detection tools with regards to technique, open-source and state-of-the-art code clone detection. We searched the nature, and language coverage. Lastly, we propose future di- IEEE Xplore, ACM Digital Library, and SpringerLink in- rections for code-clone detection tools. This paper provides dexers to identify existing work since 2009. From 3,056 the essentials to understanding the code-clone detection pro- found papers, we recognized and reviewed 67 papers that cess and the current state-of-art solutions. provide tools and benchmarks that could be used to detect code-clones. We identify and classify the techniques used by modern clone detection tools. We also observe and iden- CCS Concepts tify open-source tools and provide references. Finally, we •Software and its engineering ! Formal software ver- determine the coverage by modern tools across existing pro- ification; Software maintenance tools; Software ver- gramming languages. Out of our assessment, we compile ification and validation; Parsers; together future trends. The reader of this paper will un- derstand recent research in this area with respect to tools and benchmarks and thus be able to work on top of existing Keywords artifacts instead of reinventing the wheel. Code Clone, Clone Detection, Mapping Study, Survey The rest of the paper is organized as follows. Section 2 presents the background on code clones and clone types, 1. INTRODUCTION as well as an overview of the code-clone detection process. Section 3 presents our process and results from a mapping Code-clone detection is the process of finding exact or similar study on current trends in code-clone detection tools. Sec- pieces of code known as code clones within an application. tion 4 discusses future trends for code-clone detection tools, Code-clones are introduced in a multitude of ways, includ- as found through our comprehensive study. This is followed ing one of the most significant ways, which is through code by threats to validity and our conclusion. reuse by developers. This involves a developer copying ei- ther pre-existing fragments, coding style, or both. Another way is through repeated computation using duplicated func- tions with slight changes and variations in variables or data 2. BACKGROUND structures used. This is also done for the purposes of en- In this section, we present a background on code-clones and hancements or customization [10]. These changes that are their detection. A large issue within the field code-clone so often used for modification and performance enhancement detection is the definition of what constitutes a code-clone. Firstly, we cover what the generally accepted different types Copyright is held by the authors. This work is based on an ear- lier work: RACS’19 Proceedings of the 2019 ACM Research in Adap- of code-clones are. We then present a summary of historical tive and Convergent Systems, Copyright 2019 ACM 978-1-4503-6843- trends in code-clone detection. While this section does not 8. http://doi.acm.org/10.1145/3338840.3355654 fully cover every tool, we believe the tools we cover show a APPLIED COMPUTING REVIEW DEC. 2019, VOL. 19, NO. 4 28 good representation of understanding the historical trends. source code that have no bearing in the comparison process. Lastly, we cover the benchmarks that are used to test code- Second, it transforms source code into units by dividing it clone detection tools. into separate fragments such as classes, functions, begin- end blocks, or statements. This is done in a variety of ways, 2.1 Basic Definitions including lexical or Abstract Syntax Tree (AST) analysis. These units are used to check for the existence of direct code- Throughout this paper, we use the following well-accepted clone relations. Last, this process will define the comparison basic definitions from the previous surveys on code clones units. For instance, source units can be divided into tokens. [11, 66, 75]: Transformation: This process transforms the source code Code Fragment: A continuous segment of the source code, into a corresponding Intermediate Representation (IR) for specified by (l, s, e), including the source file l, the line the comparison. There are many types of representations that fragment starts on, s, and the line it ends on, e. can be constructed from the source code, such as token Clone Pair: A pair of code fragments that are similar, spec- streams, in which each line of source code is converted into a ified by (f1, f2, ? ), including the similar code fragments f1 sequence of tokens. Another common construct is the AST, and f2, and their clone type ?. in which all of the parsed source code is transformed into an abstract syntax tree or parse tree for sub-tree comparisons. Clone Class: A set of code fragments that are similar. Spec- Additionally, source code can be extracted into Program ified by the tuple (f1, f2,::: , fn, ? ). Each pair of distinct Dependency Graph (PDG), which is used to represent con- fragments is a clone pair: (fi, fj , ? ), i; j 2 1 : : : n; i 6= j: trol and data dependencies. A PDG is usually made using Code Block: A sequence of code statements within braces. semantics-aware techniques from the source code for sub- graph comparison. 2.2 Types of Clones Match Detection: This process compares every transformed The high-level recognized clone classification is broken into fragment of code to all other fragments to find similar source two categories - syntactic and semantic clones. Syntactic code fragments. The output is a set of similar code frag- clones refer to two code fragments, which are similarly based ments either in a clone pair list such as (c1,c2) or a set on their text [67, 3], while semantic clones are two code frag- of combined clone pairs in one class or one group such as ments similar based on their functions [21]. Furthermore, (c1,c2,c3). from the more detailed perspective, there are four types of code clones where the first three types fit the syntactic clone 2.4 Historical Trends category, and the fourth one fits the semantic clones. In the next section of this paper, we will present our findings Type-1: A type-1 code-clone is one in which the two frag- from a mapping study on modern code-clone detection tools. ments are exactly identical. However, the two code frag- This study is limited to papers within the past decade (2009- ments do not need to be precisely the same with regards to 2019) to focus exclusively on modern trends. However, there whitespace, blanks, and comments, as these are generally were many established, well-known tools before 2009, which removed for the code-clone detection process. are worth discussing. We discuss those tools briefly below although they fall outside of the scope of our mapping study. Type-2: A type-2 code-clone is one in which two code frag- ments are similar except for the renaming of some unique Some of the earliest and most seminal work on code-clone identifiers such as function/class names and variable identi- detection was done in the early 1990s. Baker proposed in fiers. In a seminal paper on type-2 clones, Baker identifies 1992 a code-clone detection tool [5] that was based on the the replacement of these unique identifiers as "parameteriz- line-by-line comparison. For the purposes of comparison, ing" the code fragment [8, 6]. whitespace and comments were removed from source-code files. In 1995, this algorithm was updated into a tool called Type-3: A type-3 code-clone is essentially a type-2 code- Dup [7], which used the idea of "parameterization" to allow clone; however, the fragments may be modified. This in- the discovery of type 1 and type 2 clones. To "parameterize" cludes adding and removing portions of the code from the the source code, all unique identifiers (e.g., variable names, two fragments or reordering statements within a code block.