The Impact of Programming Languages in Code Cloning
Total Page:16
File Type:pdf, Size:1020Kb
The Impact of Programming Languages in Code Cloning Jaime Filipe Jorge1 and António Menezes Leitão2 1 Instituto Superior Técnico Av. Rovisco Pais 1, 1049-001 Lisboa, Portugal [email protected] 2 Instituto Superior Técnico Av. Rovisco Pais 1, 1049-001 Lisboa, Portugal [email protected] Abstract Code cloning is a duplication of source code fragments that frequently occurs in large software systems. Although different studies exist that evidence cloning benefits, several others expose its harmfulness, specifically upon inconsistent clone management. One important cause for the creation of software clones is the inherent abstraction capabilities and terseness of the programming language being used. This paper focuses on the features of two different programming languages, namely Java and Scala, and studies how different language constructs can induce or reduce code cloning. This study was further developed using our tool Kamino which provided clone detection and concrete values. 1998 ACM Subject Classification K.6.3 Software Management Keywords and phrases Clone Detection, Software Engineering, Programming Languages, Soft- ware Management Digital Object Identifier 10.4230/OASIcs.SLATE.2012.107 1 Introduction There are a number of bad smells in the process of writing software systems[1]. Code cloning is one of them. A code clone is a duplication of source code where programs are copied from one location and subsequently pasted onto other locations. Studies found that code cloning is ubiquitous in software systems[2][3][4]. Some authors consider cloning to be harmful to software development [2][3][5][6]. A study[7] reports that in some systems half of the changes to code clone groups are inconsistent and that corrective changes following inconsistent changes are rare. Its existence may be due to a number of reasons. Clones may be caused by programmer inexperience or programming methodology. In this paper we will focus in programming language constructs as causes for code cloning. Subject to our study are two programming languages: Java and Scala. The Java pro- gramming language is one of the de facto standards for writing object-oriented systems. Its language constructs are result of years of best practices and it is still passing the test of time by having a large of companies using it. We will provide insights as to how these language constructs influence code cloning. Scala is a more recent statically typed JVM language that provides different abstraction mechanisms, which are a result of learning from Java’s wrong and right doings. © Jaime Filipe Jorge and António Menezes Leitão; licensed under Creative Commons License NC-ND 1st Symposium on Languages, Applications and Technologies (SLATE’12). Editors: Alberto Simões, Ricardo Queirós, Daniela da Cruz; pp. 107–122 OpenAccess Series in Informatics Schloss Dagstuhl – Leibniz-Zentrum für Informatik, Dagstuhl Publishing, Germany 108 The Impact of Programming Languages in Code Cloning We provide clone analysis using these two programming languages and we will conclude how abstraction capabilities and verbosity/terseness influence code cloning. This study originated from the development of our tool, Kamino, which is a software clone evolution analyzer that provides developer with means to acknowledge clone duplication, observe and mitigate its consequences. The rest of the document is structured as follows: section 2 describes software clones and programming language impact, section 3 describes a case study where we instantiate some of the Java problems, section 4 describes in detail some of the problematic situations which fuel software clones and how they can be mitigated using Scala, section 5 will map the scenarios described earlier to real values obtained using our tool Kamino and section 6 presents our conclusions. 2 Software Clones A code clone is a duplication of source code in which fragments are copied from one location and subsequently pasted onto other locations, which can then be modified or not. Upon duplication, it is often difficult to differentiate the original fragments from duplicated ones, hence the two fragments are uniquely called clones. Studies show that cloning is ubiquitous in software systems. For instance, Baker et al.[2] found that, on large systems, 13% to 20% of source code can be cloned code. Baxter et al.[3] reported 12.7% source code clones whereas Mayrand et al.[4] experienced 5% to 20%. Well known software systems have been subject to clone study. In terms of their per- centage of code cloning, GCC averages 8,7%[8], X Windows has 19%[2] whereas Linux and JDK have , respectively, 22,7% and 29%[9]. From the developers perspective, there are a number of factors which propel code cloning. On the one hand, we can think of limitations inherent to the programmer himself. For example, when he needs to work with a new system. Whether because it is a large system (which induces difficulty in mastering its domain) or the programmer is dealing with code which he does not own, both aspects tend to promote a kind of example-driven programming approach, in which developers copy fragments of existing functionality and modify them to solve the problem at hands, i.e., each developer’s task. Logically, this methodology enables clone propagation. In what regards programming methodology, there are also reasons which motivate clon- ing. Systems that rely heavily on generated code (generative programming), tend to have higher clone numbers, since they actually enforce a template which can be seen as a frame- work of a clone. Programming languages can ultimately induce code redundancy. Less expressive lan- guages promote code cloning by requiring greater verbosity and need for boilerplate in programming. This code verbosity may hinder the original developers intent. Other prob- lem related to programming languages are lack of reusing mechanisms. Examples of such mechanisms are inheritance composition, generic functions, higher order functions, macros, type classes, etc. Upon nonexistence of such mechanisms, programmers are motivated to copy and paste code. Software duplication has a function of cost, i.e. , implementing a refactored alternative to a cloned code fragment is not always clearly beneficial [10]. This may be caused by difficulty in creating or understanding the refactored version or by unwanted size increase that it would imply in the system. Both of these causes can be correlated to the original premise that programming languages may induce cloning namely because refactoring may be complex to J. F. Jorge and A. M. Leitão 109 create/understand and may create larger versions of the original cloned fragments. Hence, we can also create the parameters on which we can grade programming languages, based on their capabilities towards mitigating code cloning: Abstraction Capabilities This criteria judges the idiomatic and simple nature of abstraction possibilities to given cloned fragments in the programming language itself. It may be possible to create such abstraction constructs but they may not be valuable in the cost function described above (namely in complexity in construction and understanding). These abstraction capabilities may include method extractions, inheritance, mixins, high- order functions or anonymous inner classes, duck typing, etc. Language Verbosity This criteria evaluates the size of a refactored alternative to cloning as well as the overall boilerplate syntax required to implement a given feature. These map to the cost function described earlier: if a developer prefers cloning to a refactored version that is larger in size, then one can infer that the language is proliferating code cloning by being verbose. To instantiate these concerns, we now present a case study of code cloning in which we observe problematic situations of lack of abstraction and a surplus of verbosity. 3 Case Study To motivate the study of programming languages as fueling agents for software clones we now provide a case study of a mature Java project. Containing approximately 7000 Java classes, with a grand total of 775921 LoC. The project was subject to a clone detection experiment, using our tool Kamino, which analyzed several hundred revisions and provided feedback on possible inconsistent cloning situations. We used a clone detector named Simian1 to provide information about clone detection. Simian is a text based clone detector that provides fast execution and some level of token insensitivity (i.e. mark two code fragments as equal even though they present different text tokens). More specifically, Simian found 57672 duplicate lines in 5230 blocks in 2031 files and processed a total of 385622 significant (775921 raw) lines in 7136 files. Dividing the duplic- ated lines by the total lines we 7.4%. Furthermore, if we divide the number of classes that contain duplicated code by the total number of Java classes, we obtain 28.5%. Hence, our study using Simian found that, 7.4% of all lines codes are duplicated and that 28.5% Program 1 shows two classes where code duplication is evident. We can observe that there are two major types of similarities: (1) Classes have the same amount of duplicate fragments with the only difference being the type of the object in consideration (ChangeCanProc on the left, TransferCanProc on the right); (2) if we compare the methods (of either right and left programs) getValidCanProc() and getValidExtCanProc(), we can observe that the only difference resides in the if condition: one has the negation of the other’s boolean expression. We can also