Archeology of Code Duplication: Recovering Duplication Chains From Small Duplication Fragments

Richard Wettel Radu Marinescu LOOSE Research Group Institute e-Austria Timis¸oara, Romania {wettel,radum}@cs.utt.ro

Abstract only once, the duplicated code has to be refactored. Spot- ting it can sometimes be obvious, but most of the time it is Code duplication is a common problem, and a well- more subtle or easy to miss, especially with industrial-size known sign of bad design. As a result of that, in the last systems. Detection and analysis of the code dupli- decade, the issue of detecting code duplication led to cation in legacy systems without powerful and reliable tool various solutions and tools that can automatically find support is hard to imagine. This is why we consider that de- duplicated blocks of code. However, duplicated fragments tecting clones and quantifying them is essential for further rarely remain identical after they are copied; they are design improvement. oftentimes modified here and there. This adaptation usually Throughout the last decade, there have been many ap- “scatters” the duplicated code block into a large amount proaches to detect duplicates. Some of the actual tools are of small “islands” of duplication, which detected and able to detect slightly adapted code (variables, constants analyzed separately hide the real magnitude and impact and methods renaming), while still overlooking larger-scale of the duplicated block. In this paper we propose a novel, adaptations, i.e., inserted or deleted lines of code. automated approach for recovering duplication blocks, by In a typical lightweight line-based approach, large composing small isolated fragments of duplication into blocks of code affected by various modifications (from larger and more relevant duplication chains. We validate renaming operations to statement insertions or removals) both the efficiency and the scalability of the approach by would wrongly be identified as small, less important frag- applying it on several well known open-source case-studies ments of duplicated code, apparently not related to each and discussing some relevant findings. By recovering other. such duplication chains, the maintenance engineer is To address this issue, we propose an approach that provided with additional cases of duplication that can lead merges such small fragments that belong together, thus to relevant refactorings, and which are usually missed by providing the maintainer with some additional duplication other detection methods. blocks, otherwise granted with less importance or not de- tected at all (due to filtering). Keywords: code duplication, design flaws, quality assurance 1.1 Outline

1 Introduction The paper is further structured as follows: Section 2 presents the concept of duplication chain and other related Duplicating code, while easy and cheap during the de- terms, placed in the archeology metaphor context. Section 3 velopment phase, moves the burden towards the already is a detailed presentation of our approach. Section 4 walks overloaded and much more expensive maintenance phase. the steps we took to validate the approach, proving that it Fowler and Beck ranks it first in their list of “bad smells in brings real benefits to maintenance. Section 5 will make a code” [6] and we strongly believe they were right. There- brief description of the related work and current concerns in fore we don’t intend to emphasize the consequences of in- this field of research, and finally Section 6 will point out the troducing duplicated code anymore. advantages and disadvantages of the presented approach, In order to ensure that the code says everything once and ending with a short presentation of our future work. 2 The Archeology Metaphor 2.3 Anatomy of a duplication chain

Like an archeologist who puts together all the ruins of an A duplication chain can be a complex element (the rep- ancient village in order to build a complete picture, rather resentation of the recovered duplicated code block), com- than analyzing each artifact separately, we try to recover a posed of a number of smaller exact clones (further referred close representation of every scattered duplicated block, in as exact chunks), separated pairwisely by non-matching order to make the right refactoring decisions. gaps. Figure 2 illustrates the previous example’s scatter- plot representation, where each marked cell corresponds to 2.1 It started with a Scatter-Plot a match between the pair of lines of code intersecting in that precise point. The scatter-plot approach was successfully applied in the code duplication detection field starting with the early ’90s exact chunk (2) [1], [4], [5]. These approaches provide the maintenance en- exact gineer with textual information (sets of longest matches), chunk (3) but mainly with a visual representation in form of a scatter- non-matching exact plot (or dotplot), which can draw attention over “grey” areas gap (1, deleted) chunk (3) of the matrix, possibly hosting duplicated code. From here, the specialist would further investigate the results. non-matching gap (1, modified) The scatter-plot inspired us in the first place and our al- gorithm is based on this visual concept. It also provides an easy way to present the ideas behind our approach through- Figure 2. Duplication chain out this paper. An exact chunk is a non-altered part of a duplicated 2.2 Need for duplication chains block, or in the context of the archeology metaphor an arti- fact found in its place. Exact chunks appear in a scatter-plot Imagine we have the two pieces of Java code from Fig- as continuous diagonals, as it can be seen in Figure 2. ure 1, which is a trivial example of scattered duplication A non-matching gap reflects the changes that have been belonging to a single duplicated block. Despite the fact that made to the originating duplicated block, in terms of lines of it seems obvious that they have common origins, due to the code (insertion, deletion, modification). Thus, while appar- deleted and modified lines of code, they could be detected ently less important in clone detection, these non-matching as 3 smaller clones, which is rather false. In a more pes- parts provide us with extra information about the adaptation simistic scenario, they would be filtered out by the mini- process. In a scatter-plot representation, non-matching gaps mum length threshold. One could rightly argue that there appear as shortest non-marked paths linking two consecu- are approaches which can detect variables renaming. What tive diagonals (Figure 2). if the lines of code are modified further than just variables Two closely related characteristics of a duplication or if there are lines appearing in only one of the two code chain, directly influenced by the adaptation process are the fragments? type and the signature.Thetype of the duplication chain provides a summary of the adaptations made to the dupli-

initSensors(tSensors); initSensors(tSensors); cation block. We identified the following types of dupli- readSensors(tSensors); readSensors(tSensors); cation chains: exact (regular clones, which did not suffer lcd.init(); any adaptation), modified (duplication chain made of exact int i = 0; int i = 0; while(i