The Role of Duplicated Code in Software Readability and Comprehension

Master of Science in Software Engineering September 2020 The Role of Duplicated Code in Software Readability and Comprehension Xuan Liao Linyao Jiang Faculty of Computing, Blekinge Institute of Technology, 371 79 Karlskrona, Sweden This thesis is submitted to the Faculty of Computing at Blekinge Institute of Technology in partial fulfilment of the requirements for the degree of Master of Science in Software Engineering. The thesis is equivalent to 20 weeks of full time studies. The authors declare that they are the sole authors of this thesis and that they have not used any sources other than those listed in the bibliography and identified as references. They further declare that they have not submitted this thesis at any other institution to obtain a degree. Contact Information: Author(s): Xuan Liao E-mail: [email protected] Linyao Jiang E-mail: [email protected] University advisor: DEEPIKA BADAMPUDI Department of Software engineering Faculty of Computing Internet : www.bth.se Blekinge Institute of Technology Phone : +46 455 38 50 00 SE–371 79 Karlskrona, Sweden Fax : +46 455 38 50 57 Abstract Background. Readability and comprehension are the critical points of software development and maintenance. There are many researcher point out that the duplicate code as a code smell has effect on software maintainability, but lack of research about how duplicate code affect software readability and comprehension, which are parts of maintainability. Objectives. In this thesis, we aim to briefly summarize the impact of duplicate code and typical types of duplicate code according to current works, then our goal is to explore whether duplicate code is a factor to influence readability and comprehension. Methods. In our present research, we did a background survey to asked some background questions from forty-two subjects to help us classify them, and conduct an experiment with subjects to explore the role of duplicate code on perceived readability and comprehension by experiment. The perceived readability and comprehension are measured by perceived readability scale, reading time and the accuracy of cloze test. Results. The experimental data shows code with duplication have higher perceived readability and better comprehension, however, the difference are not significant. And code with duplication cost less reading time than code without duplication, and the difference is significant. But duplication type are strongly associate with perceived readability. For reading time, it is significant associate with duplication type and size of code. While there do not exists significant correlation between programming experience of subjects and perceived readability or comprehension, and it also has no significant relation between perceived readability and comprehension, size and CC according to our data results. Conclusions. Code with duplication has higher software readability according to the results of reading time, which is significant. And code with duplication has higher comprehension than code without duplication, but the difference is not statistically significant according to our experimental results. Longer size of code will increase reading time, and different duplication type also influence the perceived readability, the three duplication types we discussed show these relationship obviously. Keywords: Duplicate code, Software readability, Comprehension, Experiment, Sur- vey Acknowledgments We are very grateful to our supervisor DEEPIKA BADAMPUDI. Her careful guid- ance of our master thesis significantly improved our understanding of academic writ- ing and taught us a lot of specific research skills. We faced a lot of problem with duplicate code types classification sections. She helps us check related literature and provided us alternative solutions that guide us in the right direction of the thesis. We also want to sincerely thank the participants who sacrifice their spare time to attend our experiment with excellent cooperation. iii Contents Abstract i Acknowledgments iii 1 Introduction 1 1.1 Background ................................ 1 1.2 Defining the scope of thesis ....................... 1 1.3 Outline ................................... 3 2 Related Work 5 2.1 Readability and comprehension ..................... 5 2.2 Duplicated code .............................. 6 2.3 Identification of gap ............................ 7 3 Method 9 3.1 Research Question ............................ 9 3.2 Alternative method ............................ 9 3.2.1 Survey ............................... 9 3.2.2 Case Study ............................ 9 3.3 Experiment ................................ 10 3.3.1 Subjects .............................. 10 3.3.2 Experiment Materials ....................... 10 3.3.3 Dependent and Independent Variables ............. 16 3.3.4 Tasks ................................ 16 3.3.5 Experiment Design . ...................... 17 3.3.6 Piloting .............................. 18 3.3.7 Experiment Execution ...................... 18 4 Results and Analysis 21 4.1 Results ................................... 21 4.2 Analysis .................................. 22 4.2.1 Analysis about duplicate code overall .............. 23 4.2.2 Analysis about code snippet ................... 25 4.2.3 Analysis about subjects characters ............... 30 5 Discussion 33 5.1 Research question ............................. 33 5.2 Discuss the experiment data ....................... 33 v 5.3 Comparison of experimental results and related literature ....... 34 5.4 What a developer should do with duplicate code ............ 35 5.5 Validity Threat . ............................ 35 6 Conclusions and Future Work 37 6.1 Conclusion ................................. 37 6.2 Limitation and Future Work ....................... 37 References 39 A Background/Task-specific Questions 43 B Code snippet 45 C Cloze test 53 vi Chapter 1 Introduction 1.1 Background Maintainability plays a significant role in the product life cycle, the most costly part of software development is product maintenance[8][21]. The main reason is that the programmer cannot understand the project accurately[33]. So it is pointed out that code comprehension takes up more than half of the life cycle cost in software maintenance[10] and readability is the critical point of software development and maintenance[3], both the readability of the documentation and the readability of the source code is vital to the maintainability of the software project [6]. Knight and Myers indicated that we should check the readability of source code at the initial phase at the software inspection stage[29], which are a benefit for the maintainability, reusability and other quality attributes. Chanchal K Roy et al.[25] also claimed that the most time-consuming part of all maintenance activities is reading and comprehension of the code. So people are committed to research what can affect readability and comprehension. Various programming style and coding guidelines were proposed to enhance the comprehension and readability of the code[7][31], and some other good activities and designs can also benefit such as identifying names[18] [15], code refactoring [22] and code smells[11]. However, various factors such as the education of reviews, con- text [27], the length of code snippets, the coding experience of people, are intertwined and affect each other, it can be pretty hard to analyze these factors individually. 1.2 Defining the scope of thesis In our study, we are interested in the impact of duplicated code on software readability and comprehension. Duplicate code is defined as the similar codes found in more than two methods, or two snippets execute the same function with different codes in the same class or different classes[23]. It can be generally classified into four types and the classification standards are presented in Section 3.3.2. The snippets selection standards are also presented in Table 3.1. Duplicated code is defined as code smell[11] for people to indicate it can make code longer and need additional cost for maintenance if one of the duplicated codes has defects[25], but it is a benefit for avoiding repeating the same mistakes as before and decoupling to make the 1 2 Chapter 1. Introduction component independent. There are also some studies that provide strong empirical evidence to support duplicated codes that have some positive impact and should not be refactored, which points duplicate code can be more stable in general than code without duplication, but less stable only in deletion situation, and big size duplicate code is less stable than small code[12]. Also duplicate code are often used as a development, and some places that will cause bugs are almost handled correctly[1]. In addition, in some situation such as clone the subsystem is one of the methods to introduce experimental modification to core subsystem, this method can improve the code by testing in the subsystem, and finally introduce the code into a stable code base, which is reasonable and helpful[17]. The main treatments for duplicate code in the same class are extract method and extract class for duplicate code in different class[11]. Clone tracking is another way to handle duplicated code when refactoring in some situation is impractical[23]. In this research, software readability is defined as the inherent property of text and comprehension is readers’ understanding of text description. Readability is the pre- condition of comprehensibility. The motivation of the definition is detailed in section 2.1. We list factors that affect software comprehension

The Role of Duplicated Code in Software Readability and Comprehension

The Vision of Software Clone Management: Past, Present, and Future (Keynote Paper)

Code Smells Quantification: a Case Study on Large Open Source Research Codebase Swapnil Singh Chauhan University of Texas at El Paso, [email protected]

Anti-Patterns and Code Smells Contributions: Yann-Gaël Guéhéneuc, Foutse Khomh, , Diana El-Masri, Fàbio Petrillo, Zéphryin Soh and Naouel Moha

Improving Software Quality Using an Ontology-Based Approach" (2010)

Analysis of Code Refactoring Impact on Software Quality

Anti-Pattern Detection: Methods, Challenges, and Open Issues

A Large Scale Empirical Study of the Impact of Spaghetti Code and Blob

Sonarqube in Action