Quick viewing(Text Mode)

Open-Source Tools and Benchmarks for Code-Clone Detection: Past, Present, and Future Trends

Open-Source Tools and Benchmarks for Code-Clone Detection: Past, Present, and Future Trends

Open-Source Tools and Benchmarks for Code-Clone Detection: Past, Present, and Future Trends

Andrew Walker Tomas Cerny Eungee Song Computer Science Computer Science Computer Science ECS, Baylor University ECS, Baylor University ECS, Baylor University One Bear Place #97141 One Bear Place #97141 One Bear Place #97141 Waco, TX 76798 Waco, TX 76798 Waco, TX 76798 [email protected] [email protected] [email protected]

ABSTRACT of a system lead to code clones. Despite some ini- A fragment of source code that is identical or similar to an- tial surface-level benefits of code-clones, they, in fact, make other is a code-clone. Code-clones make it difficult to main- the source files very hard to modify consistently. For in- tain applications as they create multiple points within the stance, consider a software system that has several cloned code that bugs must be fixed, new rules enforced, or design subsystems created by duplication with slight modification. decisions imposed. As applications grow larger and larger, When a fault is found in one subsystem, caution must be the pervasiveness of code-clones likewise grows. To face the used to modify all other subsystems [44] or risk the persis- code-clone related issues, many tools and algorithms have tence of the bug into deployment. If the existence of clones been proposed to find and document code-clones within an has been documented and maintained properly, the modifi- application. In this paper, we present the historical trends cation would be relatively easy; however, keeping all clone in code-clone detection tools to show how we arrived at the information is a generally expensive process, especially for a current implementations. We then present our results from large and complex system. a systematic mapping study on current (2009-2019) code- In this paper, we provide the roadmap to existing research clone detection tools with regards to technique, open-source and state-of-the-art code clone detection. We searched the nature, and language coverage. Lastly, we propose future di- IEEE Xplore, ACM Digital Library, and SpringerLink in- rections for code-clone detection tools. This paper provides dexers to identify existing work since 2009. From 3,056 the essentials to understanding the code-clone detection pro- found papers, we recognized and reviewed 67 papers that cess and the current state-of-art solutions. provide tools and benchmarks that could be used to detect code-clones. We identify and classify the techniques used by modern clone detection tools. We also observe and iden- CCS Concepts tify open-source tools and provide references. Finally, we •Software and its engineering → Formal software ver- determine the coverage by modern tools across existing pro- ification; Software maintenance tools; Software ver- gramming languages. Out of our assessment, we compile ification and validation; Parsers; together future trends. The reader of this paper will un- derstand recent research in this area with respect to tools and benchmarks and thus be able to work on top of existing Keywords artifacts instead of reinventing the wheel. Code Clone, Clone Detection, Mapping Study, Survey The rest of the paper is organized as follows. Section 2 presents the background on code clones and clone types, 1. INTRODUCTION as well as an overview of the code-clone detection process. Section 3 presents our process and results from a mapping Code-clone detection is the process of finding exact or similar study on current trends in code-clone detection tools. Sec- pieces of code known as code clones within an application. tion 4 discusses future trends for code-clone detection tools, Code-clones are introduced in a multitude of ways, includ- as found through our comprehensive study. This is followed ing one of the most significant ways, which is through code by threats to validity and our conclusion. reuse by developers. This involves a developer copying ei- ther pre-existing fragments, coding style, or both. Another way is through repeated computation using duplicated func- tions with slight changes and variations in variables or data 2. BACKGROUND structures used. This is also done for the purposes of en- In this section, we present a background on code-clones and hancements or customization [10]. These changes that are their detection. A large issue within the field code-clone so often used for modification and performance enhancement detection is the definition of what constitutes a code-clone. Firstly, we cover what the generally accepted different types Copyright is held by the authors. This work is based on an ear- lier work: RACS’19 Proceedings of the 2019 ACM Research in Adap- of code-clones are. We then present a summary of historical tive and Convergent Systems, Copyright 2019 ACM 978-1-4503-6843- trends in code-clone detection. While this section does not 8. http://doi.acm.org/10.1145/3338840.3355654 fully cover every tool, we believe the tools we cover show a

APPLIED REVIEW DEC. 2019, VOL. 19, NO. 4 28 good representation of understanding the historical trends. source code that have no bearing in the comparison process. Lastly, we cover the benchmarks that are used to test code- Second, it transforms source code into units by dividing it clone detection tools. into separate fragments such as classes, functions, begin- end blocks, or statements. This is done in a variety of ways, 2.1 Basic Definitions including lexical or Abstract Syntax Tree (AST) analysis. These units are used to check for the existence of direct code- Throughout this paper, we use the following well-accepted clone relations. Last, this process will define the comparison basic definitions from the previous surveys on code clones units. For instance, source units can be divided into tokens. [11, 66, 75]: Transformation: This process transforms the source code Code Fragment: A continuous segment of the source code, into a corresponding Intermediate Representation (IR) for specified by (l, s, e), including the source file l, the line the comparison. There are many types of representations that fragment starts on, s, and the line it ends on, e. can be constructed from the source code, such as token Clone Pair: A pair of code fragments that are similar, spec- streams, in which each line of source code is converted into a ified by (f1, f2, ∅ ), including the similar code fragments f1 sequence of tokens. Another common construct is the AST, and f2, and their clone type ∅. in which all of the parsed source code is transformed into an abstract syntax tree or parse tree for sub-tree comparisons. Clone Class: A set of code fragments that are similar. Spec- Additionally, source code can be extracted into Program ified by the tuple (f1, f2,... , fn, ∅ ). Each pair of distinct Dependency Graph (PDG), which is used to represent con- fragments is a clone pair: (fi, fj , ∅ ), i, j ∈ 1 . . . n, i 6= j. trol and data dependencies. A PDG is usually made using Code Block: A sequence of code statements within braces. semantics-aware techniques from the source code for sub- graph comparison. 2.2 Types of Clones Match Detection: This process compares every transformed The high-level recognized clone classification is broken into fragment of code to all other fragments to find similar source two categories - syntactic and semantic clones. Syntactic code fragments. The output is a set of similar code frag- clones refer to two code fragments, which are similarly based ments either in a clone pair list such as (c1,c2) or a set on their text [67, 3], while semantic clones are two code frag- of combined clone pairs in one class or one group such as ments similar based on their functions [21]. Furthermore, (c1,c2,c3). from the more detailed perspective, there are four types of code clones where the first three types fit the syntactic clone 2.4 Historical Trends category, and the fourth one fits the semantic clones. In the next section of this paper, we will present our findings Type-1: A type-1 code-clone is one in which the two frag- from a mapping study on modern code-clone detection tools. ments are exactly identical. However, the two code frag- This study is limited to papers within the past decade (2009- ments do not need to be precisely the same with regards to 2019) to focus exclusively on modern trends. However, there whitespace, blanks, and comments, as these are generally were many established, well-known tools before 2009, which removed for the code-clone detection process. are worth discussing. We discuss those tools briefly below although they fall outside of the scope of our mapping study. Type-2: A type-2 code-clone is one in which two code frag- ments are similar except for the renaming of some unique Some of the earliest and most seminal work on code-clone identifiers such as function/class names and variable identi- detection was done in the early 1990s. Baker proposed in fiers. In a seminal paper on type-2 clones, Baker identifies 1992 a code-clone detection tool [5] that was based on the the replacement of these unique identifiers as ”parameteriz- line-by-line comparison. For the purposes of comparison, ing” the code fragment [8, 6]. whitespace and comments were removed from source-code files. In 1995, this algorithm was updated into a tool called Type-3: A type-3 code-clone is essentially a type-2 code- Dup [7], which used the idea of ”parameterization” to allow clone; however, the fragments may be modified. This in- the discovery of type 1 and type 2 clones. To ”parameterize” cludes adding and removing portions of the code from the the source code, all unique identifiers (e.g., variable names, two fragments or reordering statements within a code block. method names, etc.) were replaced with a unique character Type-4: A type-4 code clone is different than the previ- before comparison. Dup was also updated to use a ”param- ous three in that a type-4 clone is semantically similar but eterized” suffix tree for comparison. not syntactically similar. These are much more difficult to An algorithm using the novel approach of comparing source find, and generally, code-clone detection tools either focus code ASTs was proposed in 1998 with the tool CloneDR [10]. on types 1-3 or type 4. This tool used tree matching on ASTs extracted from source code to find exact or near-exact matches of code fragments. 2.3 The Code-Clone Detection Process CloneDR was limited to type 1 and type 2 code clones. A code-clone detector is a tool that reads one or more source In 2002, the tool CCFinder [35] was introduced. This built files and finds similarities among fragments of code or text upon earlier work and expanded the lexical nature of code- in the files. In every implementation of code-clone detection clone detection tools. CCFinder tokenized source code files algorithms, there are common steps performed. The gener- and used a set of language-specific lexical rules to transform alized code-clone detection process is described as follows: the token stream. Much like Baker’s previous work, these Preprocessing: This process first removes all elements in the transformations include parameterization of some tokens as

APPLIED COMPUTING REVIEW DEC. 2019, VOL. 19, NO. 4 29 well as the removal of unnecessary tokens. Using a prefix- tree and a novel matching algorithm, common prefixes of the Table 1: BigCloneBench Recall Results. tokens were calculated to find code duplication. CCFinder Tool T1 T2 VST3 ST3 MT3 WT3/T4 was able to parse millions of lines of code in minutes and had CCfinder 100 93 62 15 1 0 strong results with types 1-2 code-clones. Furthermore, it is CloneDR 100 94 71 21 1 0 able to extract code-clones for C, C++, Java, and COBOL. Deckard 60 58 62 31 12 1 CP-Miner [49] was introduced in 2004 as a successor to CCFinder. This tool included a vast speed increase, as well of code-clone detection. We will first present the method- as accuracy improvements over CCFinder. The main fea- ology we used to collect the papers we reviewed and then ture was the addition of bug-detection among code-clones. breakdown the results to answer a couple of key questions This tool was applied to Linux and Apache webserver and about recent trends in code-clone detection. By answering discovered bugs that were later patched. these questions, we provide a comprehensive view of the na- A novel approach to code-clone detection was introduced ture of the tools themselves. The questions we examined in in 2007 with the Deckard [31] code-clone detection tool. this mapping study are as follows: Deckard was tree-based, similar to CloneDR, but took an R1. What techniques are used by modern code- innovative method of transforming the ASTs into charac- clone detection tools? teristic vectors. The vectors are then clustered to reduce comparisons. Deckard has impressive results when applied R2. How many of these modern code-clone detec- to large repositories, including the Linux Kernel, and was tion tools are open-source or publicly available? able to detect types 1-3 code-clones. R3. What is the language coverage by modern code- clone detection tools. 2.5 Benchmarks Beyond just the tools themselves, there have been a num- 3.1 Previous Studies ber of developments within the area of benchmarking and There have been many studies done in the past on code-clone validity testing for code-clone detection tools. detection tools but our study provides a unique perspective One of the first code-clone benchmarks was Bellon’s bench- on understanding trends among these tools. mark [11], introduced in 2007. This benchmark was col- A study in 2002 [12] compared five state-of-the-art tools in lected by running six different code-clone tools against two order to determine which tool was the best. This study small C programs and two small Java programs. These re- only focused on tools up until 2002 and also did not address sults were compared against a corpus of real code-clones any definable research questions beyond which tool was the to create the dataset of code-clones. Most recently, an- ”best”. A further study in 2004 [82] focused on which spe- other framework for assessing code-clone detection tools has cific technique was the best, as opposed to which tool was emerged. BigCloneBench [76] is a collection of eight mil- the best. To answer this question, five state-of-the-art tools lion validated clones within IJaDataset-2.0, a big-data soft- were analyzed. This study was limited through their choice ware repository containing 25,000 open-source Java systems. of tools as they only were able to evaluate three techniques BigCloneBench contains both intra-project and inter-project for code-clone detection - line-based matching, parameter- clones of all four primary clone types. Recall results for a ized tree matching, and lastly, metric fingerprinting. Fur- number of historical tools on the BigCloneBench benchmark thermore, like the previous study, this one only focused on are in Table 1. Recall is defined by BigCloneBench using 6 tools up until 2002. categories - One study by Koschke in 2007 [42] only focuses on five well- known tools and does not have any overlap with the ques- • Type-1 (T1) - A type 1 clone tions we propose in our study. In fact, this study doesn’t • Type-2 (T2) - A type 2 clone identify the languages covered by the tools at all. Further- more, this study only covers tools up until 2007, well outside • Very Strongly Type-3 (VST3) - A type 3 clone with a of our study’s range. syntactical similarity of 90-100% Another study in 2007 [66] was one of the first to provide • Strongly Type-3 (ST3) - A type 3 clone with a syntac- a meaningful study of code-clone techniques. This study tical similarity of 70-90% examined 23 tools to provide a wide variety of tools and techniques to account for. However, while this study did • Moderately Type-3 (MT3) - A type 3 clone with a have languages listed for each tool, it did not contribute any syntactical similarity of 50-70% aggregate statistics or analysis of the language coverage of • Weakly Type-3/Type-4 (WT3/4) - A type 3 clone with these tools. This was a missed opportunity given this study a syntactical similarity of 0-50% or a type 4 clone was one of the first to utilize a large group of tools for their study.

We are aware of only two studies that have attempted to 3. SYSTEMATIC LITERATURE REVIEW study as many tools as our study. The first of these studies In this section, we present our findings from a comprehen- [68], used over 40 tools to provide a comprehensive study on sive and systematic review of existing literature in the field the effectiveness of the tools and techniques. This study is

APPLIED COMPUTING REVIEW DEC. 2019, VOL. 19, NO. 4 30 Figure 1: Classification of Code-Clone Techniques limited to tools up until 2009. Another study was done in conference full-text papers and limiting the date range to 2013 [64] using well over 200 tools. This study provided a between 2009 and the present (2019). Each paper after the novel detailed analysis of state-of-the-art tools and a com- initial filtering was then evaluated on its title and abstract prehensive breakdown of the techniques used in code-clone to determine if it was a fit for the papers we were looking detection. Our study’s R1 overlaps with their R1.1 and R1.2 for - i.e., papers regarding code-clone detection tools or algo- however, like previous studies, their study lacks a modern rithms. If it was not possible to disqualify a paper from the perspective since they limited their study to tools up un- title and abstract alone, its content was evaluated as well for til 2012. Furthermore, noticeably lacking from their study further insight into the nature of the paper. A breakdown is an analysis of language coverage, despite a focus on what of the number of initial results and post-filtering results is benchmark applications are used. We will remedy these gaps shown in Table 2. Papers that were cross-indexed between with our own study. indexers only appear under whichever indexer was searched first. Any paper that made it through the second phase of Lastly, a further comprehensive study was done in 2016 [75]. filtering was then thoroughly studied and evaluated based This study focused on the techniques of code-clone detection on research questions R1-R3. tools up until 2015. This study lacks a varied pool of tools, limiting the study to only 25 tools. Furthermore, despite To evaluate the research questions we wished to answer, we allowing for tools up until 2015, not even half of the tools collected certain meta-data about tools mentioned within mentioned are later than 2009. the reviewed papers. If there was a name attached to the tool that was collected. Otherwise, we collected the author’s Our study is the first of its kind to focus on modern advance- name and year for a generalized title. We also recorded all ments in code-clone detection. We utilize a significantly methods used by the tool for code-clone detection. At this higher number of papers for our study than the average for point, we didn’t filter the tools into specific categories but previous studies. Furthermore, there have been no previous instead gathered notes about all of their methodologies to studies that seek to answer R2/R3. use for sorting later. We recorded what languages were cov- ered by the tools, or in the case of some tools, designated 3.2 Methodology either them as a language-agnostic or unspecified, if the pa- We used three indexing sites and portals (indexers), includ- per didn’t mention what languages they worked for. Lastly, ing IEEE Xplore, ACM Digital Library (DL), and Springer- we recorded if the tool was either publicly available, in bi- Link. We tailored our search queries to look for papers nary or some other format, or open-source. If the tools had related to code-clone detection tools. To help refine our names, a simple internet search usually returned the down- query, we included the terms ’algorithm’ and ’tool’ to fo- loadable tool or open-source repository. Without a name, cus the search on the tools themselves, and hopefully avoid we defined a search query for locating the tool online. We case studies and empirical evaluations. Since R2 focuses searched the first each link on the first page of results for ref- on the open-source nature of the tools, we included that in erence to a downloadable/open-source tool or the tool itself. the query as well to make sure any open-source focused pa- That query can be seen in Listing 2. pers were included. Lastly, we weren’t interested in studies on existing tools or evaluations of existing tools, so we ex- cluded those from our query. The query we used is shown % name % year in Listing 1. (code-clone OR clone) (detect* OR ’’) ( tool OR ’’) (’code clone’ OR ’clone detection’) (open-source OR application AND (’algorithm’ OR ’tool’ OR git OR ’open-source’ OR repository) OR ’open source’) NOT (’study’ OR ’evaluation’) Listing 2: Search Query for Open-Source Tools Listing 1: Search Query for the Indexers 3.3 Techniques of Modern Tools Each indexer had a specific syntax for applying the general- We divided the code-clone detection techniques into five ized query, and in total, the query returned a large number main classes. These classes are textual, token, syntactic, of results from the three indexers we used. We then filtered semantic, and learning. Previous studies [75] on code-clone the initial results even further by examining only journal/- techniques have divided the techniques into only four classes

APPLIED COMPUTING REVIEW DEC. 2019, VOL. 19, NO. 4 31 Table 2: Results from Chosen Indexers Table 4: Token-Based Code-Clone Detection Tools Indexer Initial Results Filtered Relevant Tool OSS? Languages IEEE Xplore 280 224 58 ACM DL 1,695 807 7 Poon2012 [62] No Java SpringerLink 1,081 73 2 CtCompare (Toomey2012) [80] Yes Unspecified Total 3,056 1,104 67 CloneWorks (Svajlenko2017) [77] Yes Java SHINOBI (Kawaguchi2009) [37] Yes Any CodeEase (Abid2017) [1] Yes Unspecified Table 3: Text-Based Code-Clone Detection Tools SaCD (Shi2013) [63] No Java / C / C++ Tool OSS? Languages Boreas (Yuan2012) [89] No Java BCFinder (Tang2018) [78] Yes Binary SourcererCC (Sajnani2016) [70] Yes Java Ito2017 [27] No Any CPDP (Muddu2013) [57] No Java Kodhai2010 [41] No C SourcererCC-I (Saini2016) [69] No Java VFDETECT (Liu2017) [51] No Java Semura2018 [74] No Any Gupta2018 [22] No Any Lavoie2012 [45] No Java LePalex (Maeda2010) [52] No Java Iwamoto2012 [28] No Java / C CCCD (Krutz2013) [43] Yes C Gode2009 [23] No Java / C Cuomo2012 [16] No Java Dexsim (Elsabagh2018) [18] No Java SimCad (Uddin2013) [81] No Java Hummel2010 [26] No C Park2013 [60] No C Dong2012 [17] No Binary Lim2011 [50] No C CCFinderSW (Semura2017) [73] Yes Any Lesner2010 [46] No C Iwamoto2013 [29] No Java / C Agrawal2013 [2] No Java Nicad (Cordy2011) [15] Yes C / C# / Java / Python / WSDL - textual, lexical, syntactic, and semantic. In our classifica- Agrawal2013 [2] No Java tion, token-based techniques are equivalent to lexical tech- Mahajan2014 [53] No C / C++ niques. Furthermore, we saw such a high number of learning- Keivanloo2013 [38] No Unspecified based approaches, focused on machine learning and data Lesner2010 [46] No C mining, that we included a separate section for them in our classification. Similar to previous studies, we further di- Table 5: AST-Based Code-Clone Detection Tools vided syntactic approaches into tree-based and metric-based Tool OSS? Languages techniques. We also distinguished within the semantic ap- Merlo2009 [54] No Java proaches those that used graph-based techniques. Our clas- Chodarev2015 [13] No Haskell sification hierarchy can be seen in Fig 1. In this section, a Clone Merge Yes C / C++ comparison of code-clone detection techniques is presented (Narasimhan2015) [58] and compared. Yang2018 [87] No Java Textual Approaches: In this approach, two code fragments Zeng2019 [90] No Java are compared with each other in the form of text/strings/lex- Li2014 [47] No Java / PHP emes and only found to be cloned if the two code fragments Li2011 [48] No Erlang are literally identical in terms of textual content. Text-based Nichols2019 [59] No C# / C++ / Go / Java / clone detection techniques are easy to implement and are JavaScript / Python / Swift completely independent of language. The code-clone detec- tion tools using text-based techniques are available in Ta- code. The code-clone detection tools using AST-based tech- ble 3. niques are available in Table 5 and those using metric-based Token-Based Approaches: In such techniques, all source techniques are available in Table 6. code lines are divided into a sequence of tokens during the Semantic Approaches: A semantic approach detects two lexical analysis phase of a compiler. Then all tokens are con- fragments of code that perform the same computation but verted back into token sequence lines. The token sequence have differently structured code. Semantic code-clone de- lines are matched to identify and report code-clones. The tection can take many different approaches; however, one of code-clone detection tools using token-based techniques are the predominate methods is through the use of graph-based available in Table 4. techniques. In graph-based clone detection, a program de- Syntactical Approaches: Syntactical approaches are classi- pendency graph (PDG) is constructed to represent the con- fied into two broad kinds of techniques: tree-based tech- trol and data flow dependencies of a function from source niques and metric-based techniques. In tree-based code- code. A comparison of two PDGs is used to identify syn- clone detection techniques and extracted AST is used for tactic and semantic differences between two versions of a sub-tree comparisons to identify similar regions. These simi- program. For the purposes of the mapping study, we dis- lar regions constitute a code-clone. In the metric-based clone tinguish between PDG-based semantic code-clone detection detection techniques, individual vectors are created for each and generalize all other methods into a generic ”semantic” code fragment, using metrics gathered from the source-code. methodology category. The code-clone detection tools using These vectors are then compared to find regions of similar PDG-based techniques are available in Table 8 and those

APPLIED COMPUTING REVIEW DEC. 2019, VOL. 19, NO. 4 32 Table 6: Metric-Based Code-Clone Detection Tools Table 9: Learning-Based Code-Clone Detect. Tools Tool OSS? Languages Tool OSS? Languages Fukushima2009 [20] No Java Mostaeen2018 [55] Yes C / Java Higo2011 [24] No Java Jadon2016 [30] No C Roopam2017 [65] No Java White2016 [85] No Java CloneCognition (Mostaeen2019) Yes Java [56] Table 7: Semantic Code-Clone Detection Tools Kim2016 [40] No C / Binary Tool OSS? Languages Elva2012 [19] No Java Table 10: Data Mining-Based Code-Clone D. Tools Avetisyan2015 [4] No C Tool OSS? Languages Sarala2013 [72] No C# Kaur2016 [36] No Unspecified Tekchandani2013 [79] No Unspecified Clone Miner (Basit2009) [9] No Java Agec (Kamiya2013) [34] Yes Java LICCA (Vislavski2018) [83] Yes Java / C / JavaScript Table 11: Open-Source Results SeByte (Keivanloo2012) [39] No Java Open-Source Count Percentage Yu2019 [88] No Java No 52 78% Mahajan2014 [53] No C / C++ Yes 15 22% Keivanloo2013 [38] No Unspecified tools that were not available at all were more prevalent in a nearly 4:1 ratio against the open-source tools. Table 12 Table 8: PDG-Based Code-Clone Detection Tools shows the open-source/publically available tools along with links to their repositories or download pages (as of November Tool OSS? Languages 2019). Kamiya2012 [33] No C Kamalpriya2017 [32] No Java / C 3.5 Language Coverage by Modern Tools CCSharp (Wang2017) [84] No Java Higo2011 [25] No Java The last question we wished to answer was the coverage of Clone-Differentiator (Xing2011) Yes Java languages by modern code-clone detection tools. We found [86] that be a very large margin, the most popular language Cholakov2015 [14] No Java for code-clone detection tools was Java, but that there in- Higo2011 [24] No Java deed existed a wide range of languages covered. The second Roopam2017 [65] No Java most popular language was C, followed by C++; however, Kim2016 [40] No C / Binary a large disparity exists between Java/C and the rest of the tools. It’s possible that the reason for such a high number of Java-based and C-based code-clone detection tools is that using semantic techniques are available in Table 7. the two predominant benchmark systems, Bellon’s and Big- Learning Approaches: Learning approaches are oftentimes CloneBench, are limited to C/Java and Java respectively. extremely varied within the code-clone detection domain. As the standardized benchmarks for testing the accuracy We classify learning approaches as using machine learning and recall of code-clone detection tools and algorithms, they or other learning-based techniques for detecting code-clones. have an influence on the types of languages code-clone tools Similar to learning approaches are the approaches that used are created for. A full breakdown of the languages covered data-mining in their code-clone detection process. All-in-all, by the tools we examined is available in Fig 2. learning approaches have been the least explored methodol- ogy for modern code-clone detection tools. The code-clone 4. FUTURE TRENDS detection tools using learning-based techniques are available in Table 9 and those using data-mining techniques are avail- While the area of code-clone detection has been thoroughly able in Table 10. explored in the past, most especially within the past decade, there exist a number of areas that must be improved on. In this section, we discuss the future directions for three as- 3.4 Open-Source Nature of Modern Tools pects of code-clone detection tools - techniques, open-source The second question we wished to answer was the open- nature, and lastly, the benchmarks used. source nature of modern code-clone detection tools. We ex- amined each tool discussed within our mapping study and 4.1 Techniques looked to see if it was either open-source or publicly avail- The predominate code-clone detection technique was a token- able. A breakdown of the open-source count and percentage based approach. This is more than likely due to the rise of of the tools is available in Table 11. lexical libraries such as ANTLR [61], which make it eas- In answer to R2, we found that almost none of the tools we ier than ever to tokenize using simple lexical grammar files. examined were open-source or publicly available. In fact, Many of the new approaches do not have the wide-ranging

APPLIED COMPUTING REVIEW DEC. 2019, VOL. 19, NO. 4 33 45 40 35 30 25 20 15 10 5 0

Figure 2: Language Coverage of Modern Code-Clone Tools

Table 12: Available Open-Source Tools Tool Link SHINOBI https://sdlab.naist.jp/projects/shinobi.html Clone Merge https://krishnanm86.wixsite.com/clonerge CloneCognition https://github.com/pseudoPixels/CloneCognition CtCompare https://github.com/awakecoding/ctcompare CloneWorks https://github.com/jeffsvajlenko/CloneWorks CodeEase https://github.com/shamsa-abid/CodeEase SourcererCC https://github.com/Mondego/SourcererCC CCFinderSW https://github.com/YuichiSemura/CCFinderSW Nicad https://www.txl.ca/txl-nicaddownload.html Clone-Differentiator https://sites.google.com/site/yinxingxue/home/projects/clonedifferentiator Agec https://github.com/tos-kamiya/agec LICCA https://github.com/tvislavski/licca Mostaeen2018 https://github.com/pseudoPixels/ML_CloneValidationFramework BCFinder http://www.imm.dtu.dk/~sljo/bcfinder/gettingstarted.html CCCD http://www.se.rit.edu/~dkrutz/CCCD/index.html effect that they likely should. Recent trends in data-mining researcher or group of researchers as opposed to the open- and machine learning approaches have gone largely unno- source tools, which showed collaboration among many di- ticed; however, these techniques have shown incredible suc- verse groups of researchers. The future of code-clone de- cess at code-clone detection. More care must be taken in tection tools relies on the extension of existing techniques, the future to look at all avenues of code-clone detection, but removing these tools from the public makes this collab- as a step back from the lexical analysis that is already so orative improvement impossible. Tools must be made open- thoroughly explored. source to allow the use of the tool and outside collaboration on improvements. The world of computer science is moving 4.2 Open-Source towards open-source, and code-clone detection should not The tools we examined in this study were largely unavail- be left in the past. able either as open-source or even as a downloadable tool for use. For both the academic and industry community, 4.3 Benchmarks this is a loss. We noted that tools that were open-source The last observation we made about the tools used is that showed a higher propensity towards being continually up- nearly every tool covered either Java or C/C++. Our ex- dated and improved externally as opposed to the closed- amination of the benchmarks used highlights that this is source tools we reviewed. Tools such as SourcererCC [71] more than likely due to the fact that no sufficient bench- were able to be extended into SourcererCC-I [70]. Simi- marks exist for other languages. BigCloneBench and Bel- larly, CCFinder [35] has been extended into the CCFind- lon’s benchmark both provide good benchmarks for C/Java, erSW [73] and BCFinder [78] tools. Furthermore, tools that but there are noticeable gaps for up-and-coming languages were closed-source oftentimes were only improved among the like Python and Javascript. The tools we examined that

APPLIED COMPUTING REVIEW DEC. 2019, VOL. 19, NO. 4 34 are language agnostic still tended to use the existing bench- [2] A. Agrawal and S. K. Yadav. A hybrid-token and marks, and so they are unable to fully demonstrate their textual based approach to find similar code segments. capabilities for other languages. To further the code-clone In 2013 Fourth International Conference on detection area of research, fully-formed benchmarks must Computing, Communications and Networking be created in other languages, to facilitate full-testing and a Technologies (ICCCNT), pages 1–4, July 2013. push towards language-agnostic code-clone detection tools. [3] A. V. Aho, M. S. Lam, R. Sethi, and J. D. Ullman. Compilers: Principles, Techniques, and Tools (2Nd Edition). Addison-Wesley Longman Publishing Co., 5. THREATS TO VALIDITY Inc., Boston, MA, USA, 2006. The main threat to the validity of the mapping study is the [4] A. Avetisyan, S. Kurmangaleev, S. Sargsyan, exclusion of papers relevant to the topic at hand. We tried M. Arutunian, and A. Belevantsev. Llvm-based code to mitigate this as much as possible by having our query clone detection framework. In 2015 Computer Science be broad towards code-clone detection to include as many and Information Technologies (CSIT), pages 100–104, relevant papers as possible. Also, we chose the time frame Sep. 2015. of a decade (2009-2019) to cover modern trends; however, [5] B. S. Baker. A program for identifying duplicated arguments can be made for both lengthening or shortening code. Computing Science and Statistics, 1992. the time frame. While it’s possible we didn’t include every [6] B. S. Baker. A theory of parameterized pattern relevant paper, we found overwhelming statistics within our matching: Algorithms and applications. In Proceedings mapping study, and it’s unlikely the inclusion or exclusion of the Twenty-fifth Annual ACM Symposium on of a couple of papers would have had a meaningful impact. Theory of Computing, STOC ’93, pages 71–80, New We feel the paper set we examined is sufficient to gather an York, NY, USA, 1993. ACM. overview of modern code-clone detection. [7] B. S. Baker. On finding duplication and Another area of possible threats to validity is in the filtering near-duplication in large software systems. In process. This was done manually, and while every paper Proceedings of 2nd Working Conference on Reverse was thoroughly reviewed, it’s possible one was excluded at Engineering, pages 86–95, July 1995. this stage unfairly. As previously mentioned, however, for [8] B. S. Baker. Parameterized pattern matching: the purposes of answer R1-R3, a couple of papers would Algorithms and applications. Journal of Computer not have had a significant impact due to the overwhelming and System Sciences, 52(1):28 – 42, 1996. nature of our results. Regardless, as much care as could [9] H. A. Basit and S. Jarzabek. A data mining approach be reasonably expected was taken to examine each paper for detecting higher-level clones in software. IEEE at length before determining if it was to be included in the Transactions on Software Engineering, 35(4):497–514, study or not. July 2009. [10] I. D. Baxter, A. Yahin, L. Moura, M. Sant’Anna, and L. Bier. Clone detection using abstract syntax trees. 6. CONCLUSIONS In Proceedings. International Conference on Software In this paper, we provided a comprehensive overview of Maintenance (Cat. No. 98CB36272), pages 368–377, the necessary information for code-clone detection. Start- Nov 1998. ing with the background and basic definitions, we presented [11] S. Bellon, R. Koschke, G. Antoniol, J. Krinke, and historical trends in computer science. We then discussed our E. Merlo. Comparison and evaluation of clone methodology and results for a comprehensive study of mod- detection tools. IEEE Transactions on Software ern code-clone detection tools. Our results highlighted some Engineering, 33(9):577–591, Sep. 2007. interesting results with regards to technique, open-source [12] E. Burd and J. Bailey. Evaluating clone detection nature, and language coverage of the tools we examined. tools for use during preventative maintenance. In We used these results to present future directions we believe Proceedings. Second IEEE International Workshop on will be beneficial to the area of code-clone detection. This Source Code Analysis and Manipulation, pages 36–43, paper serves as a road-map towards more accessible, and Oct 2002. resilient tools for detecting code-clones within applications. [13] S. Chodarev, E. Pietrikov´a, and J. Koll´ar. Haskell clone detection using pattern comparing algorithm. In 2015 13th International Conference on Engineering of 7. ACKNOWLEDGMENTS Modern Electric Systems (EMES), pages 1–4, June This material is based upon work supported by the National 2015. Science Foundation under Grant No. 1854049 [14] T. Cholakov and D. Birov. Duplicate code detection algorithm. In Proceedings of the 16th International Conference on Computer Systems and Technologies, 8. REFERENCES CompSysTech ’15, pages 104–111, New York, NY, USA, 2015. ACM. [1] S. Abid, S. Javed, M. Naseem, S. Shahid, H. A. Basit, [15] J. R. Cordy and C. K. Roy. The nicad clone detector. and Y. Higo. Codeease: harnessing method clone In 2011 IEEE 19th International Conference on structures for reuse. In 2017 IEEE 11th International Program Comprehension, pages 219–220, June 2011. Workshop on Software Clones (IWSC), pages 1–7, Feb [16] A. Cuomo, A. Santone, and U. Villano. A novel 2017.

APPLIED COMPUTING REVIEW DEC. 2019, VOL. 19, NO. 4 35 approach based on formal methods for clone detection. complexity for a program exercise. In 2013 Eighth In 2012 6th International Workshop on Software International Conference on Broadband and Wireless Clones (IWSC), pages 8–14, June 2012. Computing, Communication and Applications, pages [17] M. Dong, H. Zhuang, R. Zhang, S. Bi, X. Zeng, 575–580, Oct 2013. S. Guo, W. Cai, and Z. Tang. A new method of [30] S. Jadon. Code clones detection using machine software clone detection based on binary instruction learning technique: Support vector machine. In 2016 structure analysis. In 2012 8th International International Conference on Computing, Conference on Wireless Communications, Networking Communication and Automation (ICCCA), pages and Mobile Computing, pages 1–4, Sep. 2012. 399–303, April 2016. [18] M. Elsabagh, R. Johnson, and A. Stavrou. Resilient [31] L. Jiang, G. Misherghi, Z. Su, and S. Glondu. Deckard: and scalable cloned app detection using forced Scalable and accurate tree-based detection of code execution and compression trees. In 2018 IEEE clones. In 29th International Conference on Software Conference on Dependable and Secure Computing Engineering (ICSE’07), pages 96–105, May 2007. (DSC), pages 1–8, Dec 2018. [32] C. M. Kamalpriya and P. Singh. Enhancing program [19] R. Elva and G. T. Leavens. Semantic clone detection dependency graph based clone detection using using method ioe-behavior. In 2012 6th International approximate subgraph matching. In 2017 IEEE 11th Workshop on Software Clones (IWSC), pages 80–81, International Workshop on Software Clones (IWSC), June 2012. pages 1–7, Feb 2017. [20] Y. Fukushima, R. Kula, S. Kawaguchi, K. Fushida, [33] T. Kamiya. Conte*t clones or re-thinking clone on a M. Nagura, and H. Iida. Code clone graph metrics for call graph. In 2012 6th International Workshop on detecting diffused code clones. In 2009 16th Software Clones (IWSC), pages 74–75, June 2012. Asia-Pacific Software Engineering Conference, pages [34] T. Kamiya. Agec: An execution-semantic clone 373–380, Dec 2009. detection tool. In 2013 21st International Conference [21] M. Gabel, L. Jiang, and Z. Su. Scalable detection of on Program Comprehension (ICPC), pages 227–229, semantic clones. In 2008 ACM/IEEE 30th May 2013. International Conference on Software Engineering, [35] T. Kamiya, S. Kusumoto, and K. Inoue. Ccfinder: a pages 321–330, May 2008. multilinguistic token-based code clone detection [22] N. Gupta, V. Gandhi, C. Hariya, and V. Shelke. system for large scale source code. IEEE Transactions Detection of code clones. In 2018 International on Software Engineering, 28(7):654–670, July 2002. Conference on Smart City and Emerging Technology [36] H. Kaur and R. Maini. Identification of recurring (ICSCET), pages 1–4, Jan 2018. patterns of code to detect structural clones. In 2016 [23] N. G¨ode and R. Koschke. Incremental clone detection. IEEE 6th International Conference on Advanced In 2009 13th European Conference on Software Computing (IACC), pages 398–403, Feb 2016. Maintenance and Reengineering, pages 219–228, [37] S. Kawaguchi, T. Yamashina, H. Uwano, K. Fushida, March 2009. Y. Kamei, M. Nagura, and H. Iida. Shinobi: A tool for [24] Y. Higo and S. Kusumoto. Code clone detection on automatic code clone detection in the ide. In 2009 specialized pdgs with heuristics. In 2011 15th 16th Working Conference on , European Conference on Software Maintenance and pages 313–314, Oct 2009. Reengineering, pages 75–84, March 2011. [38] I. Keivanloo and J. Rilling. Semantic-enabled clone [25] Y. Higo, U. Yasushi, M. Nishino, and S. Kusumoto. detection. In 2013 IEEE 37th Annual Computer Incremental code clone detection: A pdg-based Software and Applications Conference, pages 393–398, approach. In 2011 18th Working Conference on July 2013. Reverse Engineering, pages 3–12, Oct 2011. [39] I. Keivanloo, C. K. Roy, and J. Rilling. Sebyte: A [26] B. Hummel, E. Juergens, L. Heinemann, and semantic clone detection tool for intermediate M. Conradt. Index-based code clone detection: languages. In 2012 20th IEEE International incremental, distributed, scalable. In 2010 IEEE Conference on Program Comprehension (ICPC), pages International Conference on Software Maintenance, 247–249, June 2012. pages 1–9, Sep. 2010. [40] J. Kim, H. Choi, H. Yun, and B.-R. Moon. Measuring [27] K. Ito, T. Ishio, and K. Inoue. Web-service for finding source code similarity by finding similar subgraph cloned files using b-bit minwise hashing. In 2017 IEEE with an incremental genetic algorithm. In Proceedings 11th International Workshop on Software Clones of the Genetic and Evolutionary Computation (IWSC), pages 1–2, Feb 2017. Conference 2016, GECCO ’16, pages 925–932, New [28] M. Iwamoto, S. Oshima, and T. Nakashima. York, NY, USA, 2016. ACM. Token-based code clone detection technique in a [41] E. Kodhai, S. Kanmani, A. Kamatchi, R. Radhika, student’s programming exercise. In 2012 Seventh and B. V. Saranya. Detection of type-1 and type-2 International Conference on Broadband, Wireless code clones using textual analysis and metrics. In 2010 Computing, Communication and Applications, pages International Conference on Recent Trends in 650–655, Nov 2012. Information, Telecommunication and Computing, [29] M. Iwamoto, S. Oshima, and T. Nakashima. A pages 241–243, March 2010. token-based illicit copy detection method using [42] R. Koschke. Survey of research on software clones. In

APPLIED COMPUTING REVIEW DEC. 2019, VOL. 19, NO. 4 36 R. Koschke, E. Merlo, and A. Walenstein, editors, detector tool. In 2014 IEEE International Advance Duplication, Redundancy, and Similarity in Software, Computing Conference (IACC), pages 1435–1441, Feb number 06301 in Dagstuhl Seminar Proceedings, 2014. Dagstuhl, Germany, 2007. Internationales Begegnungs- [54] E. Merlo and T. Lavoie. Computing structural types und Forschungszentrum fur¨ Informatik (IBFI), Schloss of clone syntactic blocks. In 2009 16th Working Dagstuhl, Germany. Conference on Reverse Engineering, pages 274–278, [43] D. E. Krutz and E. Shihab. Cccd: Concolic code clone Oct 2009. detection. In 2013 20th Working Conference on [55] G. Mostaeen, J. Svajlenko, B. Roy, C. K. Roy, and Reverse Engineering (WCRE), pages 489–490, Oct K. A. Schneider. [research paper] on the use of 2013. machine learning techniques towards the design of [44] B. Lague, D. Proulx, J. Mayrand, E. M. Merlo, and cloud based automatic code clone validation tools. In J. Hudepohl. Assessing the benefits of incorporating 2018 IEEE 18th International Working Conference on function clone detection in a development process. In Source Code Analysis and Manipulation (SCAM), 1997 Proceedings International Conference on pages 155–164, Sep. 2018. Software Maintenance, pages 314–321, Oct 1997. [56] G. Mostaeen, J. Svajlenko, B. Roy, C. K. Roy, and [45] T. Lavoie and E. Merlo. An accurate estimation of the K. A. Schneider. Clonecognition: Machine learning levenshtein distance using metric trees and manhattan based code clone validation tool. In Proceedings of the distance. In 2012 6th International Workshop on 2019 27th ACM Joint Meeting on European Software Software Clones (IWSC), pages 1–7, June 2012. Engineering Conference and Symposium on the [46] B. Lesner, R. Brixtel, C. Bazin, and G. Bagan. A novel Foundations of Software Engineering, ESEC/FSE framework to detect source code plagiarism: Now, 2019, pages 1105–1109, New York, NY, USA, 2019. students have to work for real! In Proceedings of the ACM. 2010 ACM Symposium on Applied Computing, SAC [57] B. Muddu, A. Asadullah, and V. Bhat. Cpdp: A ’10, pages 57–58, New York, NY, USA, 2010. ACM. robust technique for plagiarism detection in source [47] D. Li, M. Piao, H. S. Shon, K. H. Ryu, and I. Paik. code. In 2013 7th International Workshop on Software One pass preprocessing for token-based source code Clones (IWSC), pages 39–45, May 2013. clone detection. In 2014 IEEE 6th International [58] K. Narasimhan. Clone merge – an eclipse plugin to Conference on Awareness Science and Technology abstract near-clone c++ methods. In 2015 30th (iCAST), pages 1–6, Oct 2014. IEEE/ACM International Conference on Automated [48] H. Li and S. Thompson. Incremental clone detection Software Engineering (ASE), pages 819–823, Nov and elimination for erlang programs. In 2015. D. Giannakopoulou and F. Orejas, editors, [59] L. Nichols, M. Emre, and B. Hardekopf. Structural Fundamental Approaches to Software Engineering, and nominal cross-language clone detection. In pages 356–370, Berlin, Heidelberg, 2011. Springer R. H¨ahnle and W. van der Aalst, editors, Fundamental Berlin Heidelberg. Approaches to Software Engineering, pages 247–263, [49] Z. Li, S. Lu, S. Myagmar, and Y. Zhou. Cp-miner: A Cham, 2019. Springer International Publishing. tool for finding copy-paste and related bugs in [60] S. Park, S. Ko, J. Choi, H. Han, S.-J. Cho, and code. In Proceedings of the 6th J. Choi. Detecting source code similarity using code Conference on Symposium on Operating Systems abstraction. In Proceedings of the 7th International Design & Implementation - Volume 6, OSDI’04, pages Conference on Ubiquitous Information Management 20–20, Berkeley, CA, USA, 2004. USENIX and Communication, ICUIMC ’13, pages 74:1–74:9, Association. New York, NY, USA, 2013. ACM. [50] J.-S. Lim, J.-H. Ji, H.-G. Cho, and G. Woo. Plagiarism [61] T. Parr. Antlr, 2019. detection among source codes using adaptive local [62] J. Y. Poon, K. Sugiyama, Y. F. Tan, and M.-Y. Kan. alignment of keywords. In Proceedings of the 5th Instructor-centric source code plagiarism detection International Conference on Ubiquitous Information and plagiarism corpus. In Proceedings of the 17th Management and Communication, ICUIMC ’11, pages ACM Annual Conference on Innovation and 24:1–24:10, New York, NY, USA, 2011. ACM. Technology in Computer Science Education, ITiCSE [51] Z. Liu, Q. Wei, and Y. Cao. Vfdetect: A vulnerable ’12, pages 122–127, New York, NY, USA, 2012. ACM. code clone detection system based on vulnerability [63] Qing Qing Shi, Li Ping Zhang, Fan Jun Meng, and fingerprint. In 2017 IEEE 3rd Information Technology Dong Sheng Liu. A novel detection approach for and Mechatronics Engineering Conference (ITOEC), statement clones. In 2013 IEEE 4th International pages 548–553, Oct 2017. Conference on Software Engineering and Service [52] K. Maeda. An extended line-based approach to detect Science, pages 27–30, May 2013. code clones using syntactic and lexical information. In [64] D. Rattan, R. Bhatia, and M. Singh. Software clone 2010 Seventh International Conference on Information detection: A systematic review. Information and Technology: New Generations, pages 1237–1240, April Software Technology, 55(7):1165 – 1199, 2013. 2010. [65] Roopam and G. Singh. To enhance the code clone [53] G. Mahajan and M. Bharti. Implementing a 3-way detection algorithm by using hybrid approach for approach of clone detection and removal using pc detection of code clones. In 2017 International

APPLIED COMPUTING REVIEW DEC. 2019, VOL. 19, NO. 4 37 Conference on Intelligent Computing and Control Systems (ICICCS), pages 192–198, June 2017. [78] W. Tang, D. Chen, and P. Luo. Bcfinder: A lightweight and platform-independent tool to find [66] C. K. Roy and J. R. Cordy. A survey on software third-party components in binaries. In 2018 25th clone detection research. School of Computing TR Asia-Pacific Software Engineering Conference 2007-541, Queen’s University, 115, 2007. (APSEC), pages 288–297, Dec 2018. [67] C. K. Roy and J. R. Cordy. A [79] R. Tekchandani, R. K. Bhatia, and M. Singh. mutation/injection-based automatic framework for Semantic code clone detection using parse trees and evaluating code clone detection tools. In 2009 grammar recovery. In Confluence 2013: The Next International Conference on Software Testing, Generation Information Technology Summit (4th Verification, and Validation Workshops, pages International Conference), pages 41–46, Sep. 2013. 157–166, April 2009. [80] W. Toomey. Ctcompare: Code clone detection using [68] C. K. Roy, J. R. Cordy, and R. Koschke. Comparison hashed token sequences. In 2012 6th International and evaluation of code clone detection techniques and Workshop on Software Clones (IWSC), pages 92–93, tools: A qualitative approach. Sci. Comput. Program., June 2012. 74(7):470–495, May 2009. [81] M. S. Uddin, C. K. Roy, and K. A. Schneider. Simcad: [69] V. Saini, H. Sajnani, J. Kim, and C. Lopes. An extensible and faster clone detection tool for large Sourcerercc and sourcerercc-i: Tools to detect clones scale software systems. In 2013 21st International in batch mode and during software development. In Conference on Program Comprehension (ICPC), pages 2016 IEEE/ACM 38th International Conference on 236–238, May 2013. Software Engineering Companion (ICSE-C), pages 597–600, May 2016. [82] F. Van Rysselberghe and S. Demeyer. Evaluating clone detection techniques from a refactoring [70] H. Sajnani, V. Saini, J. Svajlenko, C. K. Roy, and perspective. In Proceedings. 19th International C. V. Lopes. Sourcerercc: Scaling code clone detection Conference on Automated Software Engineering, to big-code. In 2016 IEEE/ACM 38th International 2004., pages 336–339, Sep. 2004. Conference on Software Engineering (ICSE), pages 1157–1168, May 2016. [83] T. Vislavski, G. Raki´c, N. Cardozo, and Z. Budimac. Licca: A tool for cross-language clone detection. In [71] H. Sajnani, V. Saini, J. Svajlenko, C. K. Roy, and 2018 IEEE 25th International Conference on Software C. V. Lopes. Sourcerercc: Scaling code clone detection Analysis, Evolution and Reengineering (SANER), to big-code. In Proceedings of the 38th International pages 512–516, March 2018. Conference on Software Engineering, ICSE ’16, pages 1157–1168, New York, NY, USA, 2016. ACM. [84] M. Wang, P. Wang, and Y. Xu. Ccsharp: An efficient three-phase code clone detector using modified pdgs. [72] S. Sarala and M. Deepika. Unifying clone analysis and In 2017 24th Asia-Pacific Software Engineering refactoring activity advancement towards C# Conference (APSEC), pages 100–109, Dec 2017. applications. In 2013 Fourth International Conference on Computing, Communications and Networking [85] M. White, M. Tufano, C. Vendome, and Technologies (ICCCNT), pages 1–5, July 2013. D. Poshyvanyk. Deep learning code fragments for code clone detection. In 2016 31st IEEE/ACM [73] Y. Semura, N. Yoshida, E. Choi, and K. Inoue. International Conference on Automated Software Ccfindersw: Clone detection tool with flexible Engineering (ASE), pages 87–98, Sep. 2016. multilingual tokenization. In 2017 24th Asia-Pacific Software Engineering Conference (APSEC), pages [86] Z. Xing, Y. Xue, and S. Jarzabek. Clonedifferentiator: 654–659, Dec 2017. Analyzing clones by differentiation. In 2011 26th IEEE/ACM International Conference on Automated [74] Y. Semura, N. Yoshiday, E. Choi, and K. Inoue. Software Engineering (ASE 2011), pages 576–579, Nov Multilingual detection of code clones using antlr 2011. grammar definitions. In 2018 25th Asia-Pacific Software Engineering Conference (APSEC), pages [87] Y. Yang, Z. Ren, X. Chen, and H. Jiang. Structural 673–677, Dec 2018. function based code clone detection using a new hybrid technique. In 2018 IEEE 42nd Annual [75] A. Sheneamer and J. K. Kalita. A survey of software Computer Software and Applications Conference clone detection techniques. International Journal of (COMPSAC), volume 01, pages 286–291, July 2018. Computer Applications, 137(10):1–21, March 2016. Published by Foundation of Computer Science (FCS), [88] D. Yu, J. Yang, X. Chen, and J. Chen. Detecting java NY, USA. code clones based on bytecode sequence alignment. IEEE Access, 7:22421–22433, 2019. [76] J. Svajlenko and C. K. Roy. Evaluating clone detection tools with bigclonebench. In 2015 IEEE [89] Y. Yuan and Y. Guo. Boreas: an accurate and International Conference on Software Maintenance scalable token-based approach to code clone detection. and Evolution (ICSME), pages 131–140, Sep. 2015. In 2012 Proceedings of the 27th IEEE/ACM International Conference on Automated Software [77] J. Svajlenko and C. K. Roy. Cloneworks: A fast and Engineering, pages 286–289, Sep. 2012. flexible large-scale near-miss clone detection tool. In 2017 IEEE/ACM 39th International Conference on [90] J. Zeng, K. Ben, X. Li, and X. Zhang. Fast code clone Software Engineering Companion (ICSE-C), pages detection based on weighted recursive autoencoders. 177–179, May 2017. IEEE Access, 7:125062–125078, 2019.

APPLIED COMPUTING REVIEW DEC. 2019, VOL. 19, NO. 4 38 ABOUT THE AUTHORS:

Andrew Walker is a senior computer science undergraduate at Baylor University. His areas of research are verification of distributed systems, static-code analysis and code-clone detection. He is a member of Upsilon Pi Epsilon and ACM.

Tomas Cerny is a Professor of Computer Science at Baylor University. His area of research is software engineering, code analysis, security, aspect-oriented programming, user interface engineering and enterprise application design. He received his Master’s, and Ph.D. degrees from the Faculty of Electrical Engineering at the Czech Technical University in Prague, and M.S. degree from Baylor University.

Eunjee Song is an Associate Professor & Graduate Program Director in the Department of Computer Science at Baylor University. Her general research interests are in the field of software engineering, with a focus on pattern specification, software analysis and testing, and applying aspect-oriented modeling (AOM) and model-driven engineering techniques to specifying and analyzing complex systems. She received her Ph.D. and M.S. degrees in Computer Science from Colorado State University, and her B.S. degree in Computer Engineering and B.S. degree in Architecture from Seoul National University in Korea. Prior to her graduate study, she worked in IBM Korea as software engineer for more than five years. She is a member of the IEEE and the ACM SIGAPP.

APPLIED COMPUTING REVIEW DEC. 2019, VOL. 19, NO. 4 39