Master of Science in Software Engineering September 2020

The Role of Duplicated Code in Software Readability and Comprehension

Xuan Liao Linyao Jiang

Faculty of Computing, Blekinge Institute of Technology, 371 79 Karlskrona, Sweden This thesis is submitted to the Faculty of Computing at Blekinge Institute of Technology in partial fulfilment of the requirements for the degree of Master of Science in Software Engineering. The thesis is equivalent to 20 weeks of full time studies.

The authors declare that they are the sole authors of this thesis and that they have not used any sources other than those listed in the bibliography and identified as references. They further declare that they have not submitted this thesis at any other institution to obtain a degree.

Contact Information: Author(s): Xuan Liao E-mail: [email protected]

Linyao Jiang E-mail: [email protected]

University advisor: DEEPIKA BADAMPUDI Department of Software engineering

Faculty of Computing Internet : www.bth.se Blekinge Institute of Technology Phone : +46 455 38 50 00 SE–371 79 Karlskrona, Sweden Fax : +46 455 38 50 57 Abstract

Background. Readability and comprehension are the critical points of software de- velopment and maintenance. There are many researcher point out that the duplicate code as a has effect on software maintainability, but lack of research about how duplicate code affect software readability and comprehension, which are parts of maintainability. Objectives. In this thesis, we aim to briefly summarize the impact of duplicate code and typical types of duplicate code according to current works, then our goal is to explore whether duplicate code is a factor to influence readability and compre- hension. Methods. In our present research, we did a background survey to asked some back- ground questions from forty-two subjects to help us classify them, and conduct an experiment with subjects to explore the role of duplicate code on perceived readabil- ity and comprehension by experiment. The perceived readability and comprehension are measured by perceived readability scale, reading time and the accuracy of cloze test. Results. The experimental data shows code with duplication have higher perceived readability and better comprehension, however, the difference are not significant. And code with duplication cost less reading time than code without duplication, and the difference is significant. But duplication type are strongly associate with perceived readability. For reading time, it is significant associate with duplication type and size of code. While there do not exists significant correlation between pro- gramming experience of subjects and perceived readability or comprehension, and it also has no significant relation between perceived readability and comprehension, size and CC according to our data results. Conclusions. Code with duplication has higher software readability according to the results of reading time, which is significant. And code with duplication has higher comprehension than code without duplication, but the difference is not statistically significant according to our experimental results. Longer size of code will increase reading time, and different duplication type also influence the perceived readability, the three duplication types we discussed show these relationship obviously.

Keywords: Duplicate code, Software readability, Comprehension, Experiment, Sur- vey

Acknowledgments

We are very grateful to our supervisor DEEPIKA BADAMPUDI. Her careful guid- ance of our master thesis significantly improved our understanding of academic writ- ing and taught us a lot of specific research skills. We faced a lot of problem with duplicate code types classification sections. She helps us check related literature and provided us alternative solutions that guide us in the right direction of the thesis. We also want to sincerely thank the participants who sacrifice their spare time to attend our experiment with excellent cooperation.

iii

Contents

Abstract i

Acknowledgments iii

1 Introduction 1 1.1 Background ...... 1 1.2 Defining the scope of thesis ...... 1 1.3 Outline ...... 3

2 Related Work 5 2.1 Readability and comprehension ...... 5 2.2 Duplicated code ...... 6 2.3 Identification of gap ...... 7

3 Method 9 3.1 Research Question ...... 9 3.2 Alternative method ...... 9 3.2.1 Survey ...... 9 3.2.2 Case Study ...... 9 3.3 Experiment ...... 10 3.3.1 Subjects ...... 10 3.3.2 Experiment Materials ...... 10 3.3.3 Dependent and Independent Variables ...... 16 3.3.4 Tasks ...... 16 3.3.5 Experiment Design ...... 17 3.3.6 Piloting ...... 18 3.3.7 Experiment Execution ...... 18

4 Results and Analysis 21 4.1 Results ...... 21 4.2 Analysis ...... 22 4.2.1 Analysis about duplicate code overall ...... 23 4.2.2 Analysis about code snippet ...... 25 4.2.3 Analysis about subjects characters ...... 30

5 Discussion 33 5.1 Research question ...... 33 5.2 Discuss the experiment data ...... 33

v 5.3 Comparison of experimental results and related literature ...... 34 5.4 What a developer should do with duplicate code ...... 35 5.5 Validity Threat ...... 35

6 Conclusions and Future Work 37 6.1 Conclusion ...... 37 6.2 Limitation and Future Work ...... 37

References 39

A Background/Task-specific Questions 43

B Code snippet 45

C Cloze test 53

vi Chapter 1 Introduction

1.1 Background

Maintainability plays a significant role in the product life cycle, the most costly part of software development is product maintenance[8][21]. The main reason is that the programmer cannot understand the project accurately[33]. So it is pointed out that code comprehension takes up more than half of the life cycle cost in software maintenance[10] and readability is the critical point of software development and maintenance[3], both the readability of the documentation and the readability of the source code is vital to the maintainability of the software project [6].

Knight and Myers indicated that we should check the readability of source code at the initial phase at the software inspection stage[29], which are a benefit for the maintainability, reusability and other quality attributes. Chanchal K Roy et al.[25] also claimed that the most time-consuming part of all maintenance activities is read- ing and comprehension of the code. So people are committed to research what can affect readability and comprehension.

Various programming style and coding guidelines were proposed to enhance the comprehension and readability of the code[7][31], and some other good activities and designs can also benefit such as identifying names[18] [15], [22] and code smells[11]. However, various factors such as the education of reviews, con- text [27], the length of code snippets, the coding experience of people, are intertwined and affect each other, it can be pretty hard to analyze these factors individually.

1.2 Defining the scope of thesis

In our study, we are interested in the impact of duplicated code on software read- ability and comprehension. Duplicate code is defined as the similar codes found in more than two methods, or two snippets execute the same function with different codes in the same class or different classes[23]. It can be generally classified into four types and the classification standards are presented in Section 3.3.2. The snippets selection standards are also presented in Table 3.1. Duplicated code is defined as code smell[11] for people to indicate it can make code longer and need additional cost for maintenance if one of the duplicated codes has defects[25], but it is a ben- efit for avoiding repeating the same mistakes as before and decoupling to make the

1 2 Chapter 1. Introduction component independent. There are also some studies that provide strong empirical evidence to support duplicated codes that have some positive impact and should not be refactored, which points duplicate code can be more stable in general than code without duplication, but less stable only in deletion situation, and big size dupli- cate code is less stable than small code[12]. Also duplicate code are often used as a development, and some places that will cause bugs are almost handled correctly[1]. In addition, in some situation such as clone the subsystem is one of the methods to introduce experimental modification to core subsystem, this method can improve the code by testing in the subsystem, and finally introduce the code into a stable code base, which is reasonable and helpful[17]. The main treatments for duplicate code in the same class are extract method and extract class for duplicate code in different class[11]. Clone tracking is another way to handle duplicated code when refactoring in some situation is impractical[23].

In this research, software readability is defined as the inherent property of text and comprehension is readers’ understanding of text description. Readability is the pre- condition of comprehensibility. The motivation of the definition is detailed in section 2.1.

We list factors that affect software comprehension from people, code and tool as- pects(see Figure 1.1). People’s factors are described from people’s subjective and objective elements and code factors contain properties of the code snippet. Project factor describes the project environment and plugins of different tools to help users understand the code.

Figure 1.1: Factors affecting the ease of program comprehension

Our contribution is to verify whether readability and comprehension of code snip- pets will be statically influenced after removing duplicated code and then we can help programmers to decide whether refactor duplicated code from readability and comprehension perspectives. 1.3. Outline 3

Readability is considered as the main factor affecting software maintenance and development[8] [21][33], it is suggested that we should check the readability of source code at the initial phase at the software inspection stage.

Despite readability and comprehension are correlated concepts, there is a subtle difference between them[28]. Comprehension involves understanding the fact, while readability is defined as what makes texts easy to read. Readability is a necessary condition for comprehension, but it does not necessarily mean comprehensibility.

1.3 Outline

The following paper is generally composed as common empirical studies[3]. We review the related work from the perspectives of readability, comprehension and duplicated codes in Chapter 2. The research questions are outlined in section 3.1. The detail of materials used in the experiment, criteria and specific implementation steps are described in Section 3.3. We analysis the experiment data in details and have a further discussion in Chapter 4 and Chapter 5,respectively. Threats to validity are also claimed in Chapter 5. The conclusions and further work are introduced individually in Chapter 6.

Chapter 2 Related Work

2.1 Readability and comprehension

Buse and Weimer[4] explored the concept of code readability and investigate its relation to . They construct a metric based on predicting readabil- ity judgments to measure software readability. The snippets they selected for the experiment are relatively short and logically coherent so that it can aid feature dis- crimination and the annotator can understand the readability of the context.

Jürgen Börstler et al.[3] experimented with investigating the role of method chains and comments in software readability and comprehension. They used perceived read- ability, reading time and performance on a simple cloze test as indicators to measure software readability and comprehension. The experiment result shows that only code comment has statistically significant differences in perceived readability and subjects with different level skills have substantial differences in perceived readability and per- formance on the cloze tests.

Smith and Taffler[27] indicated that in the study of text readability, people often confuse readability with comprehension, they often use comprehension instead of readability, but readability and comprehension are different concepts, comprehen- sion is also affected by various other factors, such as context and education.

And there are only few research that investigated how to measure software read- ability [5][6], Raymond defined readability as "a human judgment of how easy a text is to understand"[6], linked the software quality with readability metrics concepts, invited 120 students to assess a series code snippets readability themselves, then they generative model considering many aspects such as line length, identifier length based on the results of the experiment, finally built an automated readability metric to measure readability.

Binkley et al.[2] conducted five different experiments to consider the impact of name convention on readability and comprehension by eye-tracking and concluded that camel cases would be more suitable. There are also some other papers that discussed like the length of identifier names [15].

In our research, we consider software readability is the inherent property of text and it needs readers to actively perceive. Comprehension is readers’ understanding

5 6 Chapter 2. Related Work of text description.

2.2 Duplicated code

It is said that duplicated code as a bad code smells that should be removed or refac- tored. People are interested in how to find and eliminate duplicated codes. At the same time, there are some studies have investigated the impact of duplicate code on software quality, and most of them are related to maintainability[16][19][25][29].

Keisuke Hotta et al.[16] empirical studies have shown that duplicate code does not seriously affect software evolution, code without duplication are modified more fre- quently than duplicate code. They also find the indicator based on modification places for evaluating the influence of duplicate code is more accurate than based on the ratio of modified lines. In addition, Clone codes are more understandable and modifiable than complex and unintuitive abstraction [30].

Lozano et al.[19] developed a tool to trace which methods contain duplicate code and which methods were modified in each reversion. The result of the experiment shows ways with duplicate code trend to be more frequently modified than methods without duplicate code. The result implies duplicate code increase the cost of soft- ware modification.

Chanchal et al. made a significant contribution to surveying the situation of clone management[25] by analyzing lots of studies investigating copying code. They pointed out that, on the one hand, duplicated code with defects is helpful for developers to avoid repeating the same mistakes as before, which also reduces time on designing logic, inputting the wanted code, and decoupling to make the component indepen- dent. On the other hand, duplication will be harmful in many situations; it would make the code much longer and need additional cost if one of the duplicated codes was detected defects.

Mondal et al.[23]did a comprehensive study on duplicated code, clone refactoring and clone tracking, which shows that clone codes have positive impact on software maintainability, such as duplicate code are more stable than non-duplicate code[12], but also negative impact , for example, whenever the shared clone method is mod- ified, the workload seems to increase too, depending on the percentage of affected systems[20]. Developers usually use code refactoring and code tracking to modify and maintain duplicate code to maximize the benefits of cloned code and reduce its negative impact during the maintenance phase. The researchers divided the dupli- cated code into four types, including exactly similar, syntactically similar, gapped clones and semantic similar, which we selected as our classification during the exper- iment section.

However, Kapser and other researchers countered the popular belief that duplicated code is helpful[17]. They pointed out that even in projects such as the Apache web server exists a lot of clone codes. Developers clone their codes as a design pattern 2.3. Identification of gap 7 when they think its benefits outweigh the risks. Researchers divided the duplicated code into 11 different categories from both structure and granularity aspects. They made subjective judgments on each type of motivation, advantages, disadvantages, and long-term problems and recorded the frequency of these categories in medium- sized projects. The results show that 71% of the duplicated code is helpful for projects, in this paper researchers define "helpful" as the code has positive impact on maintainability and evolution of software subjectively, and can also improve cer- tain quality attributes such as comprehensibility, and the positive impact is greater than the negative impact. It is notable that for Parameterized Code(duplicate code has difference in identifiers or literals), is also a type of duplicate code we will analyze in 3.3.2, which was considered can improve the comprehensibility in their subjective wishes, but actually only 25% of them are good when analyzed Apache httpd project. The duplicated code are considered good when the size of them were small and cloned very few times, while this type of duplicate code were regarded bad for they involved trivial abstracted and simple code, which would be more understandable in case of removing this type of duplicate code. In Gnumeric project, the parameterized code clone is harmful in 76% of the cases, and in the large clone sample, 71% of the cases are considered harmful. And among the fifteen situations considered beneficial, the duplicate code change the identifier to reflect the standard notation in the mathe- matical function represented by the variable, which increase the traceability.

2.3 Identification of gap

There are some empirical studies that investigated the effects of duplicate code on software quality and most of the research are focused on software overall maintain- ability and modification cost. Few experiments prove the relevance of repeating code and readability and comprehension without relevant experiments but subjective judg- ments. In our present research, we are going to figure out the impact of duplicated code of software readability and comprehension fill this gap. The contribution is to support the decision making of reviewers from the perspective of software readability and comprehension when they get feedback from duplicate code detecting tools.

Chapter 3 Method

3.1 Research Question

RQ1: How does duplicate code affect software readability? RQ2: How does duplicate code affect comprehension? Motivation: These research questions aim to understand and identify the role of duplicated code in software readability and comprehension with particular attention to open source projects. By answering these research question, programmers can decide whether to refactor duplicated code from software readability and compre- hension perspectives. To answer these research questions, we need to analyze how does duplicate code affect software readability and comprehension from various perspectives.

3.2 Alternative method

3.2.1 Survey The first methodology we consider surveys because readability and comprehension seem like subjective attributes. Survey studies are conducted to focus on people’s feelings and opinions[13]. It is most suitable for descriptive research, trying to answer "what" questions about respondents. We can give some snippets that contain du- plicate code or not(variants from the same snippet) to respondents and asked them which snippets are more readable and easy to understand. However, when we think of how we can analyze data, we found some serious problems. It is tough to evaluate the actual readability and comprehension from open questions in the survey because each respondent has their standard to identify what is good or bad. The reading sequence will also impact the result severely. Thus, the survey is inappropriate for our research work and we realized it is better for us to conduct a control experiment or case study for our research work. However, it is suitable for us to collect sub- jects’ personal attributes by survey. We will conduct a survey to select appropriate subjects with their background/task-specific experiment. The detail can be seen in section 3.3.2

3.2.2 Case Study Then we consider doing our research using a case study, which investigates and learn from real-life cases and can explain how and why some certain phenomena occur, the

9 10 Chapter 3. Method results of the case study can be more reliable and convincing[26]. For our research, it can be suitable to follow a real case to research the results from industry projects with version iteration, which including operations like refactoring, removing code smells, etc. We can collect professional and real data from internal employees to get accurate results. But maybe the project can be mature enough that only exists seldom low-level questions so it can be hard to collect the code snippets with specific issues that we want, which can be time-wasting. Furthermore, after some inquiries, people are almost unwilling to provide internal code because of fear of code leakage or other reasons, which can also be a question for us.

3.3 Experiment

"A controlled experiment is the study of testable hypotheses in which one or more independent variables are manipulated to measure the influence on one or more de- pendent variables[26].". The experiments allowed us to control variables to get the effect on other variables we want to measure in a controlled environment. Now we have clear hypotheses and variables, which are needed in the experiment, so it is suitable for us to choose an experiment to conduct our research. Our experiments were based and improved on the study of Jürgen Börstler[3].

We will describe the subjects and materials selection, as well as dependent and independent variables, tasks, experiment design and experiment execution in the fol- lowing parts. What we want to know is that to what extent of comprehension did the subjects have after reading the code snippets and how well they can summarize these codes. So we prepare some open questions and cloze tests as measurements.

3.3.1 Subjects The subjects in our background survey and formal experiment were students who major in computer science/software engineering from Blekinge Tekniska Högskola and Zhejiang University of technology. They were considered as novel programmers or professional programmers(see classification detail in Section 3.3.2). The Partici- pation was mainly voluntarily.

3.3.2 Experiment Materials Code snippets, formal experiment questions and subjects background questions will be described in the following sections.

Code snippets The code snippets selected in the experiment should be sufficiently general, represen- tative, and not overly simple or difficult. Subjects do not need additional domain- specific knowledge to help them understand these codes. To make our code fragments cover various types, such as the length, complexity, and type of repetition of the code should be different[24]. The detail of the characteristics of the code snippets used 3.3. Experiment 11

Snippets Source Type LOC CC Description

S1 Wumpus- Type 1 52 4(avg) Medium size snippet. Completely sim- Word ilar code fragments. Simple and shor comments. Most complex according to CC. S2 JFreeChart Type 2 26 1(avg) Shortest snippet. Syntactically similar (1.5.0) code fragments, the data types of local variables of two methods are different. Minimal comments. No conditionals. S3 Apache Type 3 79 2(avg) Longest method. Gapped clones, has Ant (1.7.0) modification between two methods. Sufficient comments. Low complexity.

Table 3.1: Key Characteristics of the Code Snippets Used in the Experiment in this experiment is showing in table 3.1. Also, we will choose the code type that Kapser[17] pointed out that can improve the comprehensibility and the typical code type to eliminate bias. Thus, we choose code snippets from open source projects and other papers about the duplicate code, and then we modified them as following: · Delete the unrelated methods and structure in case it affects the readability and comprehension of the subjects. · Uniform identifiers using camelCase-style. · Remove unnecessary comments. · Keep LOC below 80. These operations and modifications are to minimize the impact of other factors on our experimental results. We strive to analyze further the effect of different types of repetitive code on the readability and understanding of the software under the condition of consistent code fragment format and structure. For the classification standard of our code snippet, We choose a classification method that is more recognized in the industry[23] and the schematic diagram are shown as Figure 3.1:

* Type 1 Clones: Completely similar code fragments or blocks except for com- ments and identification. * Type 2 Clones: Syntactically similar code, which is evolved from Type 1 but has a difference in data type, parameter and identifiers. * Type 3 Clones: Type 3 clones are evolved from Type 1 and Type 2, but add, delete or modify the source code. It is also called gapped clones. * Type 4 Clones: Semantically similar code, the logic of different code fragments are inconsistent, but the functions they implement and the tasks they complete are consistent. 12 Chapter 3. Method

Figure 3.1: Types of clone code

Figure 3.2: Original code and variant of duplicate code type 1 , 3.3. Experiment 13

Figure 3.3: Original code and variant of duplicate code type 2 14 Chapter 3. Method

Figure 3.4: Original code and variant of duplicate code type 3 3.3. Experiment 15

Also, the Type 2 clone is defined as Parameterized Code by Kapser[17]in order to make sure the names of variable are closely match to the semantics of the required data, Kapser initial subjectively believe that this type of repetitive code can improve comprehensibility in some cases, but this type of duplicate code are harmful most of the time, see Chapter 2.

But in our experiment, we decided not to consider type 4 in our research. Firstly, there is no mature clone code detection tool and the clone tracking tool can detect the type 4 clone codes[23], which means it can be hard to capture the semantically similar code for developers during their maintenance phase. So our research can’t help them well. Secondly, the former three clone codes are mostly the same, and only some areas have been modified. So the subjects only need to focus on the different parts and can skip the same part when reading the duplicate code snippets, which means the difference in time spent on these two variants would not be enormous(see example in Figure 3.3). While clone code of type 4 performances in exactly differ- ent ways but implements the same tasks, Subjects have to read these two methods from beginning to end and can not skip anything. Subjects who only need to read the methods (non-duplicate code snippets) definitely spend less time and energy on reading.

In our experiment, we prepare type1,2,3 of code snippets we mentioned above, and each type has two variants, including code snippets with duplicate code and code snippets without duplicate code. Thus we have six variants for the subjects alto- gether.

Comprehension Questions

To measure the comprehension of subjects, we prepare cloze tests for each code snippet. After reading the complete code and summarizing the main steps of code snippets, the next task of subjects is to do cloze tests, which shows the same code but with some blanks to be filled, subjects need to fill in the blanks with suited answer(does not need to be entirely consistent with the source code). For each code snippet, we leave 2-3 blanks in key positions and 1-2 blanks in unrelated areas to increase the difficulty of cloze. The number of clozes depends on the complexity and length of the specific code snippet. If subjects have a good understanding of the main steps and purposes of code snippets, they can fill gaps efficiently and correctly. And cloze provides more accurate and standard results than the previous open answer questions. It is frequently used in experiments for understanding text and has been experimentally proven to be suitable for program understanding[14].

Background/Task-specific Question

The subjects would be asked to answer some background questions and task-specific experience to help us estimate their programming skills. The related questions in this survey were designed according to J.Feigenspan et al. We extracted a five-factor 16 Chapter 3. Method model with the factors experience with mainstream languages, professional expe- rience, functional experience, experience from education, logical experience[9]. If subjects have not passed any related programming courses or have a reading prob- lem, they don’t have to continue the rest experiment step. The subjects would be classified as a professional programmer if they have typed more than 40000 code lines or less than 40000 but have more than three years of programming experience and two of the former background question scales are level 5. The rest of the sub- jects would be classified as novel programmers[32]. We also classified subjects from an overall task-specific experience perspective. If any scales related to task-specific questions are higher than 3, the subjects would be considered as high level of task- specific experience. Meanwhile, all of the scales related to task-specific questions are lower than three would be considered as low level. The rest of subjects would be considered as medium level[3].

3.3.3 Dependent and Independent Variables Independent variable is the variants of code snippets.

Dependent variables are perceived readability(R), Reading time(s), the Accuracy rate of answering comprehension questions(ACC).

Perceived readability(R) is the extent of how hard subjects feel about the subjective readability of the code snippets, which is their impression of reading the code. The readability has five scales,from very difficult to very easy. Reading time(measured by subjects themselves) also reflects the readability of the code from the side, longer time represents harder and more unreadable. Was measured in terms of judging whether the fragments that the subject thinks unreadable contains duplicate parts or not. Summary of steps(m) can show if the subjects understand the code snippets, and the accuracy rate(ACC) is a more in-depth indication of the degree of compre- hension. Furthermore, we also collect insensitive personal data(gender, reading problem) and their programming levels, which may affect their performance in the experiment. The overview of all variables are showing in Figure 3.5

3.3.4 Tasks We did a background survey by email to asked some background questions from subjects to help us classify them into professional or novel programmers and master their task-specific levels. Then, subjects are divided into six groups and each group will read three code snippets(see allocation detail in Table 3.2). Subjects would not be told whether the code snippets they read contain duplicate code or not. In the beginning, they were asked to carefully read a code snippet and give a readability scale(very difficult, difficult, neutral, easy, very easy) as point 1 to 5 of that snippet[3]. They need to record the reading time(s) for each snippet by themselves. Then subjects have to summarize the main step of the snippet so that we can judge whether the scale of perceived readability that the subject provides is acceptable or not. After 3.3. Experiment 17

Figure 3.5: variables.png

Code Snippet 1 Code Snippet 2 Code Snippet 3 Group 1 S1_1 S2_2 S3_1 Group 2 S1_1 S2_2 S3_2 Group 3 S1_1 S2_1 S3_2 Group 4 S1_2 S2_1 S3_1 Group 5 S1_2 S2_1 S3_2 Group 6 S1_2 S2_2 S3_1

Table 3.2: Group Allocation that, we conduct a simple cloze test(see Section 3.3.2). The same code snippets would be shown to subjects again, but with some blank parts. They were asked to fill all the blanks within 180 seconds. Finally, we will ask them which part is the most difficult for them to understand in each code snippet.

3.3.5 Experiment Design

In the background survey, subjects were asked to answer background questions(see Section 3.3.2) via email. In the formal experiment, we used a 2x3 factorial design with blocking. Subjects were allocated to six groups according to their characters. Each group will contain more than six subjects. Each subject will be asked to read at least one snippet with duplicate codes and at least one snippet without duplicate code. Sm_n describes the variant from snippet S with or without duplicate code. _1 means code with duplicate code while _2 means variant without duplicate code(see Section 3.3.2). The group allocation details are showing in the table 3.2. 18 Chapter 3. Method

3.3.6 Piloting We invited six postgraduate students (allocate one subject into each group) who majored in computer science and did the complete experiment according to the ex- perimental steps. They reacted that some of the question descriptions were not clear, and the font of the code snippet is too small, which made them feel tired and have the psychology of reading resistance. After a pilot study, we have modified some questions to make them more accurate and resolve the issues of code format so that subjects can complete the experiment better.

3.3.7 Experiment Execution The experiment was conducted as an on-line questionnaire and instructed by using wenjuan, which is a questionnaire research website. Personal background and task- specific questionnaires(see Section 3.3.2) were done by subjects in their free time before the formal experiment on wenjuan. The time for subjects to read the code snippets was unlimited, but we asked subjects to save time as much as possible and record their reading time. The cloze test for each snippet was limited to 180 seconds. Their answer data and time will only be used to analyze readability and comprehension of the code and not to estimate their performance. Before the formal experiment, the considerations would be introduced to all the subjects.

· Subjects can not take any notes or copies manually or electronically.

· Subjects can not go back in the questionnaire.

· Subjects can only pause when they see pause signal(appears after all questions of a specific snippet being completed) but the pause time should be less than 120 seconds.

· Keep LOC below 80. 3.3. Experiment 19

Figure 3.6: Experiment execution steps

Chapter 4 Results and Analysis

4.1 Results

Forty-two subjects attended our background survey online. Forty of them met our ex- periment conditions and were allowed to continue the formal experiment. Of those, Thirty-seven completed the formal experimental questions. After estimating the validity of their answering, we remove three rows of experiment data of snippets, because these data give a scale without an accurate description of that snippet. To- tally, we get one hundred and eight valid data points.

*&!!(' '%  ,% *&!!%&$&"" #,% #&                       

   $+  )"   ,%&( $*! ! "!

" #%&&#'(-!  



 



 



 $%&&# &&)#&'$& &&)#&'$&'(&$#!- &&"! ' &&"! ''(&$#!-

Figure 4.1: Subject classification(37 subjects who completed the whole experience)

By summarizing the valid background survey data(subjects who completed the whole experiment), we get the Figure 4.1 which shows the distribution of subjects in over- all programming level, overall task-specific level(low level,medium level,high level),

21 22 Chapter 4. Results and Analysis

Snippet N R(avg) T(avg seconds) ACC(avg) S1_1 20 4.25 220.60 88.35% S1_2 16 3.63 244.94 85.42% S2_1 17 2.94 294.18 94.14% S2_2 18 2.56 292.61 88.89% S3_1 17 2.00 398.12 82.35% S3_2 20 1.85 577.90 78.75%

Table 4.1: Perceived Readability(R),Reading Time(T) and Answering accuracy for blanks in each snippets gender and naming preference style perspectively(No preference,under_score prefer- ence,camleCase preference).

Table 4.1 lists the volume(N) of subjects of each code snippet, the average perceived readability(R), the average of reading time(T-s) and Answering accuracy of filling in the blank for each snippet in the formal experiment. And in our research, we take perceived readability and reading time as indicators for readability respectively, and acc as the indicator for comprehension.

4.2 Analysis

In this section, we would like to make a preliminary analysis of the experimental results first, and shows the methods we use to calculate the relationship between factors or variables. What we focus is mainly the impact of duplicate code, dupli- cation type(completely similar code fragments or blocks, Syntactically similar code, gapped clones), the characters of subjects like over programming experience, specific- task experience and name preference, the character of each code snippet including LOC, CC, type of duplicate code on the perceived readability and comprehension. In addition, we analyze the impact of duplicate code in each type of duplicate code on perceived readability, comprehension and reading time to see if there exists any relationship, and also whether the various attributes affect each other. The summary of relationship between all experiment variables are shown as Figure 4.2. We used bar chart to show off the percentage distribution of the different readability scores as perceived readability in each group(every bar is 100%), and the percent- ages were marked(like Figure 4.3) and 4.7. Box plot as in Figure 4.8, 4.9 and 4.10 were used to show the distribution of R,ACC and Reading time in each type of code snippet with their corresponding statistical correlation calculation results. Regarding four statistical analysis methods that we used, Spearman’s rank correlation coefficien would used to evaluate the relationship with sequential variables. We used ANOVA to estimate the relationship between dependent variable(continuous) and indepen- dent variable(categorical). Two or more categories (groups) for each variable would be tested by Chi-Square tests (x2)), if there is an expected number T 1 or n 40, Fisher’s test would be used. 4.2. Analysis 23

Figure 4.2: Summary of relationships between experiment variables.

4.2.1 Analysis about duplicate code overall For RQ1, our data(see detail in Figure 4.3, Figure 4.4 and table 4.2.1) shows that for all code snippets, code with duplication has higher perceived readability, accuracy but less reading time. It does not indicate there is a statistically significant difference between duplicate code and perceived readability(x2=6.044,p=0.196), base on chi- square, while the subjects with no duplicate code groups tend to think that the code is less readable(Average readability score: 2.61), compared to subjects with duplicate code(average readability score: 3.13).

 

# "               

  



# "            



          

 $#" #" #"  !$  $!$

Figure 4.3: Overview

For RQ2, the relationship between the accuracy of the cloze test and duplicate code is insignificant in statistics(α>0.3), according to the result of ANOVA.

But it is worth noting that the reading time is also insignificantly associated with 24 Chapter 4. Results and Analysis

            

    

 

           

     

       

   

        

Figure 4.4: The distribution of experiment results

Independent Dependent Method H0 H1 variable variable Duplicate Perceived chi-square There is no signif- There is signifi- code readability icant relationship cant relationship between duplicate between duplicate code and Perceived code and Perceived readability. readability. Duplicate ACC ANOVA The mean (aver- The mean is not code age value of the same for all groups. ACC) is same for all groups(have du- plicate or not) Duplicate reading ANOVA The mean (average The mean is not the code time value of the reading same for all groups. time) is the same for all groups(have duplicate or not)

Table 4.2: Analysis and statistic method about duplicate code 4.2. Analysis 25 duplicate code(α=0.05), calculated by ANOVA according to our data. The result shows subjects spent more time on reading code without duplicate code than reading code with duplication.

4.2.2 Analysis about code snippet For different types of code snippets(see detail in Figure 4.6 and Figure 4.7): 40 per- cent of the expected values are less than five, so chi-square cannot be used. But when using Fisher’s exact test for calculation, SPSS shows that there is not enough memory to calculate. So we use Spearman approach to estimate the relationship between the different code snippets

R(perceived readability) has five options(1 to 5 as level from very difficult to very easy). The relationship between R and different types of code snippets(Comparison between groups) is significant according to the chi-square test(x2=86.667,p<0.001). Reading time is also significantly associated with different types of code tested by ANOVA(α<0.001), but not significant when regarding accuracy(α=0.085), which may be due to that the selection of the removal part used for cloze test in code snippet 2 is not reasonable enough, resulting in a small difference in accuracy. But it is enough to prove that different types of duplicate codes are sufficient as an inde- pendent variable for data analysis.

And for duplication type(see detail in Figure 4.5), we analyze the difference between S1_1, S2_1 and S3_1 representing duplication type 1, 2 and 3 respectively. Our data shows that type 1 has higher perceived readability and type 3 has the lowest readabil- ity score, and the difference is statistically strongly significant(x2=43.980,p<0.001). And for reading time, subjects of type 1 spent the shortest time on reading code snippets, while code snippets of type 3 cost more time than other two duplication types, and the difference is also significant(α=0.021) according to ANOVA. While type 2 has the highest ACC than the other two types and type 3 has the lowest ACC, the difference is not significant(α=0.22).

               

          

               

                            

Figure 4.5: The distribution of R, ACC, Reading time in each duplication type

For overall code snippets, the dataset points out that reading time is moderately neg- ative associated with perceived readability (Spearman’s correlation coefficient(rs) = -0.47,p < 0.001) and it is significant, as well as the CC(rs = 0.428,p < 0.01). For the 26 Chapter 4. Results and Analysis

         

  

        

   

      

                               

Figure 4.6: The distribution of R, ACC, Reading time in each type of snippet

 

          

 

 

           

 



            



          

             

Figure 4.7: The experiment results in each type of code snippets 4.2. Analysis 27 characters of code snippets, there is a weak and significant connection between line of code(LOC) and perceived readability(rs = -0.298,p = 0.025). This may be due to incomplete code interception and the existence of unexplained classes. And our data show that there is moderate connection between R and read time, code length, or , so the results claim that perceived readability has moderate correlation with some traditional attributes used to judge software readability. Thus maybe perceived readability may need to be analyzed with other methods to evaluate software readability.

It is also notable that accuracy is slightly and significantly related to LOC(rs = -0.200,p = 0.038)as well as lightly and significantly associated with reading time(rs = -0.214,p = 0.026). The data also shows reading time is slightly associate with LOC(rs = 0.273,p < 0.01) and are statistically significant.

So in our research, code with bigger size needs more time for reading and has lower perceived readability, and there only exist weak but strong correlation between them. We allow subjects to read the source code within the time they want, and fill in the start reading time and end reading time by themselves, which means there is no time limit, therefore, for code snippet with a large amount of code lines, subjects can in- crease the reading time to shrink the negative impact of code size on readability and comprehension. In general, a longer time to read the code means that subjects can understand the code more deeply, resulting in improved perceptual readability of the code. Conversely, subjects that spend a shorter reading time may reduce perceived readability. And these question may also impact the results of ACC, so we limit the time for answering cloze test in 3 minutes.

For individual code snippet(each type of duplicate code), when we analyze the re- lationship between duplicate code and perceived readability, it is notable that the the distribution of perceived readability is quite different in duplicate code type 1, code with duplication has higher perceived readability than code without, while the difference is not significant, as same as other two types, according to the result of Spearman’s correlation coefficient(see detail in Figure 4.8 and Figure 4.9, Figure 4.11).

                 

    

     

     

  

     

            

Figure 4.8: The distribution of R in individual snippet result

When it comes to ACC and duplicate code within each code snippet, duplicate 28 Chapter 4. Results and Analysis

            

  

  

  

   

  

  

                   

Figure 4.9: The distribution of ACC in individual snippet result

            

  

  

  

  

  

            

Figure 4.10: The distribution of reading time in individual snippet result

code type 1 and 3 both has similar distribution of ACC, while subjects have high ACC(more than half get 100% ACC) in group2_1 even the mean of perceived read- ability of code snippet 2 is much lower than code snippet 1, this may be due to insufficient difficulty in the blank part of the cloze of the code, we will conduct a detailed analysis and discussion in the discussion part. And for type 1 and 3, the difference between ACC and duplicate code are either significant.

But there exists a considerable and significant difference between reading time and duplicate code in code snippet 3(also known as type 3), (α=0.041), the difference in the average reading time is nearly 200s(see detail in Figure 4.10 and Figure 4.11). For type1 and type2, the impact between reading time and repeated code is minimal. And it is notable that the code length of type 1 and type 2 are relatively small when compared to type 3, also the perceived readability are higher than type3, so focus on investigating large and obscure code fragments can be worth studying in the future, these are likely to be important factors affecting perceived readability.

When we analyze the relationship between ACC and reading time in each type of code snippet, the data shows that they have no significant relationship. There is also no significant relationship between R and ACC in snippet type one(completely sim- ilar clone) and two(syntactically similar clone), but R is significant associated with ACC(α<0.01) tested by ANOVA when regarding snippet type three(gapped clone). The correlation is weak between ACC and reading time but they have no statistical significance as well. 4.2. Analysis 29

 

             

 



         

          

 #"! "! "!   #  # #



      

  

 

         

 

          

 #"! "! "!   #  # #



        

  



           

          

 #"! "! "!   #  # #

Figure 4.11: The experiment results in each code snippet 30 Chapter 4. Results and Analysis

4.2.3 Analysis about subjects characters For subjects characters, the relationship between overall programming level and R are insignificant in statistics, and overall task-specific experience is not strongly re- lated to R either. However, subjects with high overall programming level give higher perceived readability than those with lower, which also shows the same results re- garding specific task experience.

Figure 4.12 shows groups with high specific task experience got higher accuracy in cloze test than with med and low experience, but different overall programming level(α=0.068) and task-specific experience are not significantly associated with ACC either, according to ANOVA. However, it is notable that subjects with high overall programming experience also has high task-specific experience(x2=15.213,p=0.001).

The ANOVA also shows subjects with high overall programming experience spent less reading time than those with low experience. However, the difference between them is insignificant. But the result shows reading time are strongly associate with specific task experience(α=0.015), it means reading code snippets cost less time for high task-specific experience group.

In addition, subjects with no naming preference think code snippets are less read- able than other subjects with naming preferences, and also have lower accuracy compared to them. And there exists a difference between two different naming pref- erence(Camel and underscore) when comparing R, accuracy and reading time too, however, these gaps are not statistically significant.

According to Figure 4.12, male have higher accuracy and show higher perceived readability than females, but the differences are not statistically significant. But the data shows male spent less time on reading code snippets than female(α=0.032), measured by ANOVA. 4.2. Analysis 31



#"          

 

  "              

 

          

             

 

%*"               

  

 ,&')           

  

          

             

 

  %&'       



$'(%'&'               





#"(&'          

 

          

             



!               

 

               



%+              

          

             

Figure 4.12: Distribution of subjects characters for perceived readability 32 Chapter 4. Results and Analysis

                         

  $)  *%&'    $("   !          " #" " #"

Figure 4.13: Comparison between Gender experience Chapter 5 Discussion

5.1 Research question

For RQ1: How does duplicate code affect software readability? When taking perceived readability as an indicator, code with duplication has higher software readability than code without duplication according to the results. However, our data shows that the difference is not significant. But when taking reading time as an indicator, it takes less time for reading code with duplication than without, and the impact is significant. As well as in each duplication type. And in code snippet 3(also as duplication type 3), which has large size of code, the difference between reading time and duplicate code is more obvious, which means in large size of code of duplication type 3, the impact of duplicate code on readability can be stronger, and this can be researched in the future, whether the duplication type or size of code is the main reason that impact readability. Thus perceived read- ability maybe cannot be used alone as a measure of readability, it needs to be used with other indicators to measure readability. It is also worthy to note that our results point out that different duplication types have different perceived readability, our data shows type 3(gapped clone) has lowest perceived readability while type 1(completely similar) has highest perceived read- ability, and the difference is strongly significant.

For RQ2: How does duplicate code affect comprehension? According to our results, code with duplication has higher comprehension than code without duplication, however, our data shows that the difference is not significant, as well as in each duplication type.

5.2 Discuss the experiment data

Despite the results we discussed in Results section, there are some phenomena worth discussing.

For the relationship and correlation between other factors, it is notable that we choose three different types of duplicate code including completely similar code frag- ments or blocks, syntactically similar code and gapped clone code 3.3.2. Each type of duplicate code shows various distributions on perceived readability and reading time scale, which means duplication type can be an important factor that influences perceived readability and reading time. But we only select one representative code

33 34 Chapter 5. Discussion snippet representing each type, each type of snippet has different degrees of diffi- culty, therefore, the correctness of the result remains to be discussed. More codes of various types should be provided to enhance the accuracy of the experiment. In addition, we changed the duplication type 1(completely similar code fragments or blocks) and type 2(syntactically similar code) slightly to get the code variants, while type 3(gapped clone) were changed relatively more to get the code variants, which means the variant of duplication type 3 is more obscure than the variant of dupli- cation type 1,2. It reflect that the difficulty of code snippet might be a factor to influence the perceived readability and reading time.

For individual code snippets(each duplication type), our experimental data shows that perceived readability and ACC are both weak and negative associated with LOC from an overall perspective. However, it is noteworthy that the perceived read- ability of code snippet 2(type 2) is lower than code snippet 1(type 1), which means it is harder to read code snippet 2 although it has the shortest line of code. We infer that it is because the class calls some methods or other classes which do not show their identification in S2_1 and it may decrease the perceived readability. S2_1 even has the highest ACC despite it has relatively low readability, which are contrary to the results of other two groups, we think this may be due to the selection of remove part for the cloze test and reader can easily guess the specific purpose of this method by its naming, so it may influence the readability but not comprehension. For the size of code in code snippet 2 is too short, the option parts for removal are limited thus our selection might not be good enough to ensure the difficulty. Besides, com- pared with the other two duplication types, the duplication type 3 has a relatively large reading time gap. We assume that because it has the longest code lines and its overall difficulty is highest among three snippets. Subjects need more time to connect and understand the context when remove gapped clones by extracting same part(see code snippet detail in Figure 3.4).

5.3 Comparison of experimental results and related literature

As mentioned in Chapter 2, Kapser[17] made subjective judgments on 11 different categories of duplicate code from motivation, advantages, disadvantages, long-term problem and the frequency aspects in medium-size project. They firstly thought that Parameterized Code(same as type 2 in our study), can improve the compre- hensibility in their subjective wishes, but their results shows that only 25% of them are beneficial when analyzed Apache httpd project, while this type of duplicate code were regarded bad for they involved trivial abstracted and simple code, which would be more understandable in case of removing this type of duplicate code. "Under- standable" is their subjective option, which are more focus on comprehension, rather than software readability. In our research, comprehension is readers’ understanding of text description and readability is a necessary condition for comprehension, but 5.4. What a developer should do with duplicate code 35 it does not necessarily mean comprehensibility. Software readability is the inherent property of textand it needs readers to actively perceive. For duplication type 2 in our experiment, the experimental data shows code with duplication has higher comprehension than code without duplication, but not significant, which is different compared to Kapser’s opinion.

And Some researcher[16][19] pointed out duplicate code has an effect on software maintainability and Chanchal K Roy et al.[25] claimed that reading and understand- ing the code lines are the most time-consuming part of all maintenance activities, which means perceived readability and comprehension are two factors in maintain- ability. However, our experiment result shows there is no significant relationship between duplication and perceived readability or comprehension, so duplicate code affects maintainability not by affecting perceived readability and comprehension. It is worth exploring what factors in maintainability are affected by duplicate code in the further work.

5.4 What a developer should do with duplicate code

Based on our results, code with duplication need less reading time which increase the readability. The subject in our experiment spent less time on reading the code snippet with duplicated code, especially in gapped clone(large size and high complexity) which has biggest size and lowest perceived readability. So for those gapped clone code snippets especially with large size and high complexity should not be refactored from readability perspective, because duplicated code can increase readability of the code to some degree. For completely similar clone and syntactically similar clones, we need to add more code snippet types (longer size and more complexity) to estimate whether those clone types should be refactored as well from a readability perspective. Duplicate code is not associated with comprehension if we choose Acc of cloze test as indicator to measure software comprehension. So we can’t advise developers whether they should to modify the duplicated code from the perspective of comprehensibility. In addition, developers can add good comments or documentation for code lines to improve software readability and comprehension[3].

5.5 Validity Threat

The average duration of the experiment was 40 minutes and fatigue effects will in- crease gradually. To minimize the fatigue effects, we allow the subjects to suspend the experiment and have a rest for 120 seconds before the beginning of the following snippet. Furthermore, we put the least complex and shortest code snippet at the middle position of the experiment to lighten the fatigue effects. However, the last code snippet can be long and obscure.

We analyze the relationship between perceived readability and overall programming experiment, overall task-specific experiment perspectives in Discussion 5. However, 36 Chapter 5. Discussion we can not control each subject to have the same self-evaluation standard in the background survey and get an accurate identity.

As discuss in Section 3.3.4, although we have mentioned that this experiment is supposed to research the relationship between duplicate code and readability/com- prehension and there is no need to cheating. We still can’t prevent subjects from fraud, like using phones or notes to record the code snippet, which can be the help of the cloze test. To eliminate cheating or inaccurate assessment on estimating per- ceived readability, we set a description question about the main steps or primary functions of the snippet to evaluate if the perceived readability scale is acceptable.

And because of the coronavirus pandemic, we changed the field experiments to on- line experiments, which means we can’t monitor the answer process of subjects and require them to experiment with uniform location, date and time. We originally wanted to control the subjects to answer questions at the same time, but because the subjects came from different countries, and some participants completed the experiment after we sent multiple email reminders, we finally failed to control the answer time of the subjects to the same time, or within a nearby date, these factors may affect the results of the experiment.

We can only guarantee that the programming experience of the subject is gener- ally similar but not the same; the differences between them still cause errors in the experimental results. The programming level of each participant is not the same, and they may also have errors in the self-evaluation, which will affect the results of the experiment.

There are four types of duplicate codes in total and our experiment only contains three types of these duplications(see Section 3.3.2). Furthermore, the limited number of code snippets used in our experiment can be a threat to generalize to different programming language environments and complex situations (code contains a va- riety of different code smells). However, we believe that snippets we selected are representative for these three types of duplication because we consider the length, complexity, and type of repetition to cover various code types of the snippet and use the code type which points out by Kapser that can improve the comprehensibility and the usual code type to eliminate bias(see Section 3.3.2).

We also strive to counter the effects of reading ordering and the different number of duplication in each group by grouping into six(see Table 3.2). Chapter 6 Conclusions and Future Work

6.1 Conclusion

In this thesis, we have collected relevant information about duplicate code, software readability and comprehension and analyzed the impact of duplicate code on the perceived readability and comprehension. In our Related Work part, it was gener- ally believed that the duplicate code would reduce the readability of software. But there still some researchers speculated some types of duplicate code would improve readability, so we designed an experiment to estimate whether different types of du- plicate codes had an impact on readability and comprehension.

Based on the data and results we gathered. It shows that duplicate code affects the time of reading the code snippet, but not influences the perceived readability of code snippet and the accuracy of cloze test which represents the comprehension of the code. Codes with duplication require less reading time and the difference of read- ing time is conspicuous in the code snippet which has large size and lower perceived readability. So we need to focus more on such type codes and figures and consider whether different code types or size of codes is the main point that affect reading time in the future.

6.2 Limitation and Future Work

The reason for our ambiguity may be because the number of people participating in the questionnaire is not large enough, only 37 participants attend our full experi- ment. In the case of insufficient data, the authenticity of the results may be affected. Due to the online survey, we cannot monitor the behavior of the participants. They may record codes to obtain a higher cloze accuracy rate, or change the reading time, these are beyond our control. Besides, because we need to ensure that the length of the code snippet is not too long, the intercepted code contains undefined classes and variables, which will cause difficulty in reading to a certain extent. We should pro- vide more comprehensive and excellent code snippet within a specified code length to prove the relationship between duplicate code and perceived readability and com- prehension.

So in future work, each study should only focus on a particular type of duplicate

37 38 Chapter 6. Conclusions and Future Work code, since we only select one piece of code for each type of duplication as the ex- perimental research object, we should select multiple pieces of this duplication type for research, which including different levels of difficulty and code length to avoid contingency.

Secondly, our experimental data shows duplicate code(Whether the code has dupli- cation or not) is weak and insignificant associated with readability,comprehension, while the correlation between duplication type and readability,comprehension are significant. We need more subjects to provide data points to reduce the ambiguities of the experiment in the future.

Thirdly, different types of duplicate codes may represent inconsistencies in the dif- ficulty of each duplicate code, especially for gapped clones. Researching about how different types of duplicate code impact perceived readability and comprehension are expected. References

[1] Lerina Aversano, Luigi Cerulo, and Massimiliano Di Penta. How clones are maintained: An empirical study. In 11th European Conference on Software Maintenance and Reengineering (CSMR’07), pages 81–90. IEEE, 2007.

[2] Dave Binkley, Marcia Davis, Dawn Lawrie, Jonathan I Maletic, Christopher Morrell, and Bonita Sharif. The impact of identifier style on effort and compre- hension. Empirical Software Engineering, 18(2):219–276, 2013.

[3] Jürgen Börstler and Barbara Paech. The role of method chains and comments in software readability and comprehension—an experiment. IEEE Transactions on Software Engineering, 42(9):886–898, 2016.

[4] R. P. L. Buse and W. R. Weimer. Learning a metric for code readability. IEEE Transactions on Software Engineering, 36(4):546–558, July 2010.

[5] Raymond PL Buse and Westley R Weimer. A metric for software readability. In Proceedings of the 2008 international symposium on Software testing and analysis, pages 121–130, 2008.

[6] Raymond PL Buse and Westley R Weimer. Learning a metric for code read- ability. IEEE Transactions on Software Engineering, 36(4):546–558, 2009.

[7] Tom Cargill. C++ programming style. Addison-Wesley Massachusetts, 1992.

[8] Cecil Eng Huang Chua, Sandeep Purao, and Veda C Storey. Developing main- tainable software: The readable approach. Decision Support Systems, 42(1):469– 491, 2006.

[9] J. Feigenspan, C. Kästner, J. Liebig, S. Apel, and S. Hanenberg. Measuring pro- gramming experience. In 2012 20th IEEE International Conference on Program Comprehension (ICPC), pages 73–82, June 2012.

[10] John R Foster. Cost factors in software maintenance. PhD thesis, Durham University, 1993.

[11] Martin Fowler. Refactoring: improving the design of existing code. Addison- Wesley Professional, 2018.

[12] Nils Gode and Jan Harder. Clone stability. In 2011 15th European Conference on Software Maintenance and Reengineering, pages 65–74. IEEE, 2011.

39 40 References

[13] Robert M Groves, Floyd J Fowler Jr, Mick P Couper, James M Lepkowski, Eleanor Singer, and Roger Tourangeau. Survey methodology, volume 561. John Wiley & Sons, 2011.

[14] William E Hall and Stuart H Zweben. The cloze procedure and software compre- hensibility measurement. IEEE transactions on software engineering, (5):608– 623, 1986.

[15] Johannes C Hofmeister, Janet Siegmund, and Daniel V Holt. Shorter identifier names take longer to comprehend. Empirical Software Engineering, 24(1):417– 443, 2019.

[16] Keisuke Hotta, Yui Sasaki, Yukiko Sano, Yoshiki Higo, and Shinji Kusumoto. An empirical study on the impact of duplicate code. Advances in Software Engineering, 2012, 2012.

[17] Cory J Kapser and Michael W Godfrey. “cloning considered harmful” consid- ered harmful: patterns of cloning in software. Empirical Software Engineering, 13(6):645, 2008.

[18] Dawn Lawrie, Christopher Morrell, Henry Feild, and David Binkley. What’s in a name? a study of identifiers. In 14th IEEE International Conference on Program Comprehension (ICPC’06), pages 3–12. IEEE, 2006.

[19] A. Lozano, M. Wermelinger, and B. Nuseibeh. Evaluating the harmfulness of cloning: A change based experiment. In Fourth International Workshop on Mining Software Repositories (MSR’07:ICSE Workshops 2007), pages 18–18, May 2007.

[20] Angela Lozano and Michel Wermelinger. Assessing the effect of clones on change- ability. In 2008 IEEE International Conference on Software Maintenance, pages 227–236. IEEE, 2008.

[21] Xu Luo, Zhexue Ge, Fengjiao Guan, and Yongmin Yang. A method for the maintainability assessment at design stage based on maintainability attributes. In 2017 IEEE International Conference on Prognostics and Health Management (ICPHM), pages 187–192. IEEE, 2017.

[22] Robert C Martin. Clean code: a handbook of agile software craftsmanship. Pear- son Education, 2009.

[23] Manishankar Mondal, Chanchal K Roy, and Kevin A Schneider. A survey on clone refactoring and tracking. Journal of Systems and Software, 159:110429, 2020.

[24] Daryl Posnett, Abram Hindle, and Premkumar Devanbu. A simpler model of software readability. In Proceedings of the 8th working conference on mining software repositories, pages 73–82, 2011. References 41

[25] Chanchal K Roy, Minhaz F Zibran, and Rainer Koschke. The vision of software clone management: Past, present, and future (keynote paper). In 2014 Software Evolution Week-IEEE Conference on Software Maintenance, Reengineering, and Reverse Engineering (CSMR-WCRE), pages 18–33. IEEE, 2014.

[26] Forrest Shull, Janice Singer, and Dag IK Sjøberg. Guide to advanced empirical software engineering. Springer, 2007.

[27] Malcolm Smith and Richard Taffler. Readability and understandability: Dif- ferent measures of the textual complexity of accounting narrative. Accounting, Auditing & Accountability Journal, 5(4):0–0, 1992.

[28] M. . Storey. Theories, methods and tools in program comprehension: past, present and future. In 13th International Workshop on Program Comprehension (IWPC’05), pages 181–191, May 2005.

[29] Herb Sutter. C++ coding standards: 101 rules, guidelines, and best practices. Pearson Education India, 2004.

[30] Michael Toomim, Andrew Begel, and Susan L Graham. Managing duplicated code with linked editing. In 2004 IEEE Symposium on Visual Languages-Human Centric Computing, pages 173–180. IEEE, 2004.

[31] Allan Vermeulen, Scott W Ambler, Greg Bumgardner, Eldon Metz, Trevor Mis- feldt, Patrick Thompson, and Jim Shur. The elements of Java (tm) style, vol- ume 15. Cambridge University Press, 2000.

[32] A. Von Mayrhauser and A. M. Vans. Program comprehension during software maintenance and evolution. Computer, 28(8):44–55, Aug 1995.

[33] Stephen S. Yau and James S. Collofello. Design stability measures for software maintenance. IEEE Transactions on Software Engineering, (9):849–856, 1985.

Appendix A Background/Task-specific Questions

1. Do you have reading difficulty? A: Yes B: NO

2. What’s your gender? A: Male B: Female

3. Which year do you enroll in university(undergraduate)?

4. How many years have you been programming?

5. How many related programming courses did you complete?

6. How did you estimate your Java programming experience (scale:1-5, 1 means least best and 5 means best)?

7. How did you estimate your other programming experience (scale:1-5, 1 means least best and 5 means best)?

8. How did you estimate your Java programming experience compare to your class- mates (scale:1-5, 1 means least best and 5 means best)?

9. How many lines of code have you programed? A: Less than 900 lines B: More than 900 and less than 40000 C: More than 40000

10. Which naming conventions do you prefer? under_score(like sold_card) or camel- Case(like soldCard)? A: I prefer under_score style B: I prefer camelCase style C: I have no preference

11. How Do you level your competence of principles of object-oriented design? (scale 1-5. 1 means no competence and 5 means expert level competence)

12. How Do you level your competence of model design? (scale 1-5. 1 means no competence and 5 means expert level competence)

13. How Do you level your competence of code smell and refactoring? (scale 1- 5. 1 means no competence and 5 means expert level competence)

43 44 Appendix A. Background/Task-specific Questions

14. How Do you level your competence of programming Eclipse plug-ins? (scale 1-5. 1 means no competence and 5 means expert level competence) Appendix B Code snippet

Code snippet 1_1 (With duplicated code)

/∗∗ ∗ Checks if a square has a pit . Returns false ∗ if the position is invalid , or if the square is ∗ unknown . ∗ ∗ @param x X position ∗ @param y Y position ∗ @return True if the square has a pit ∗/ public boolean hasPit(int x, int y) { if (! isValidPosition(x,y)) return false ; if (isUnknown(x , y)) return false ; if (w[x][y].contains(PIT)) return true ; else return false ; }

//Checks if the Wumpus is in a square . public boolean hasWumpus( int x, int y) { if (! isValidPosition(x,y)) return false ; if (isUnknown(x , y)) return false ; if (w[x][y].contains(WUMPUS)) return true ; else return false ; }

//Checks if a square is valid . public boolean isValidPosition(int x, int y) { if (x < 1) return false ;

45 46 Appendix B. Code snippet

if (y < 1) return false ; if (x > size) return false ; if (y > size) return false ; return true ; }

//Checks if a square is unknown . public boolean isUnknown( int x, int y) { if (! isValidPosition(x,y)) return false ; if (w[x][y].contains(UNKNOWN)) return true ; else return false ; }

Code snippet 1_2 (Non−duplicated code)

/∗∗ ∗ Checks if a square has a pit . Returns false ∗ if the position is invalid , or if the square is ∗ unknown . ∗ ∗ @param x X position ∗ @param y Y position ∗ @return True if the square has a pit ∗/ public boolean hasPit(int x, int y) { if (! valid(x,y)) return false ; if (w[x][y].contains(PIT)) return true ; else return false ; }

//Checks if the Wumpus is in a square . public boolean hasWumpus( int x, int y) { if (! valid(x,y)) return false ; if (w[x][y].contains(WUMPUS)) return true ; else return false ; 47

}

private boolean valid(int x, int y) { if (! isValidPosition(x,y)) return false ; if (isUnknown(x , y)) return false ; return true ; }

//Checks if a square is valid . public boolean isValidPosition(int x, int y) { if (x < 1) return false ; if (y < 1) return false ; if (x > size) return false ; if (y > size) return false ; return true ; }

//Checks if a square is unknown . public boolean isUnknown( int x, int y) { if (! isValidPosition(x,y)) return false ; if (w[x][y].contains(UNKNOWN)) return true ; else return false ; }

Code snippet 2_1 (With duplicated code)

DayTest . java

//Test method GetFirstMillisecond () by creating Day type object public void testGetFirstMillisecond () { Locale saved = Locale . getDefault (); Locale . setDefault(Locale .UK); TimeZone savedZone = TimeZone . getDefault ( ) ; TimeZone . setDefault(TimeZone.getTimeZone("Europe/London")); Day d = new Day(1 ,3 ,1970); assertEquals(5094000000L,d. getFirstMillisecond ()); Locale . setDefault(saved ); TimeZone . setDefault(savedZone); } 48 Appendix B. Code snippet

HourTest. java

//Test method GetFirstMillisecond () by creating Hour type object public void testGetFirstMillisecond () { Locale saved = Locale . getDefault (); Locale . setDefault(Locale .UK); TimeZone savedZone = TimeZone . getDefault ( ) ; TimeZone . setDefault(TimeZone.getTimeZone("Europe/London")); Hour h = new Hour(15 ,1 ,4 ,2006); assertEquals(1143900000000L,h. getFirstMillisecond ()); Locale . setDefault(saved ); TimeZone . setDefault(savedZone); }

Code snippet 2_2 (Non−duplicated code)

public void extracted(RegularTimePeriod arg0 ,long arg1) { Locale saved = Locale . getDefault (); Locale . setDefault(Locale .UK); TimeZone savedZone = TimeZone . getDefault ( ) ; TimeZone . setDefault(TimeZone.getTimeZone("Europe/London")); RegularTimePeriod d = arg0 ; assertEquals(arg1 ,d. getFirstMillisecond ); Locale . setDefault(saved ); TimeZone . setDefault(savedZone); }

//Test method GetFirstMillisecond () by creating Day type object public void testGetFirstMillisecond () { extracted(new Day(1,3,1970),5094000000L); }

//Test method GetFirstMillisecond () by creating Hour type object public void testGetFirstMillisecond () { extracted(new Hour(15,1,4,2006),1143900000000L); }

Code snippet 3_1 (With duplicated code) 49

LineContains . java

/∗∗ ∗ Returns the next character in the filtered stream, only including ∗ lines from the original stream which contain all of the specified words ∗ @return the next character in the resulting stream , or −1 ∗ if the end of the resulting stream has been reached ∗/ public int read(){ if (! getInitialized ()) { initialize (); setInitialized(true ); }

int ch = −1; if(line != null){ ch = line .charAt(0); if(line .length() == 1) { line = null ; } else { line = line . substring (1); } } else { final int containsSize = contains. size (); for( line = readLine (); line !=null ; line = readLine ()) { boolean matches = true ; for( int i =0; matches && i=0; } if(matches^isNegated ()) { break ; } } if(line != null){ return read (); } } return ch ; }

LineContainsRegExp. java /∗∗ 50 Appendix B. Code snippet

∗ Returns the next character in the filtered stream, only including ∗ lines from the original stream which match all of the specified ∗ regular expressions . ∗ @return the next character in the resulting stream , or −1 ∗ if the end of the resulting stream has been reached ∗/ public int read(){ if (! getInitialized ()) { initialize (); setInitialized(true ); }

int ch = −1; if(line != null){ ch = line .charAt(0); if(line .length() == 1) { line = null ; } else { line = line . substring (1); } } else { final int containsSize = contains. size (); for( line = readLine (); line !=null ; line = readLine ()) { boolean matches = true ; for( int i =0; matches && i

Code snippet 3_2 (Non−duplicated code) 51

/∗∗ ∗ Returns the next character in the filtered stream, only including ∗ lines from the original stream which match all of the specified ∗ regular expressions or contain all of the specified words. ∗ ∗ @return the next character in the resulting stream , or −1 ∗ if the end of the resulting stream has been reached ∗/ public int extracted(Vector vector ,Function matcher){ if (! getInitialized ()) { initialize (); setInitialized(true ); }

int ch = −1; if(line != null){ ch = line .charAt(0); if(line .length() == 1) { line = null ; } else { line = line . substring (1); } } else { final int containsSize = vector. size (); for( line = readLine (); line !=null ; line = readLine ()) { boolean matches = true ; for( int i =0; matches && i

LineContains . java

public int read(){ return extracted(contains , (Integer i) −>{ String containsStr = ( String ) contains . elementAt( i ); 52 Appendix B. Code snippet

return line . indexOf ( containsStr )>=0; } ); }

LineContainsRegExp. java

public int read(){ return extracted(regexps , (Integer i) −>{ RegularExpression regexp = (RegularExpression)regexps .elementAt( i ); Regexp re = regexp . getRegexp( getProject ()); return re . matches( line ); } ); } Appendix C Cloze test

Code snippet 1_1 (With duplicated code)

/∗∗ ∗ Checks if a square has a pit . Returns false ∗ if the position is invalid , or if the square is ∗ unknown . ∗ ∗ @param x X position ∗ @param y Y position ∗ @return True if the square has a pit ∗/ public boolean hasPit(int x, int y) { if ( ! A : _____) return false ; if (isUnknown(x , y)) return false ; if (w[x][y].contains(PIT)) return true ; else return false ; }

//Checks if the Wumpus is in a square . public boolean hasWumpus( int x, int y) { if ( ! B : _____) return false ; if (isUnknown(x , y)) return false ; if (w[x][y].contains(WUMPUS)) return true ; else return false ; }

//Checks if a square is valid . public boolean isValidPosition(int x, int y) { if (x < 1) return false ;

53 54 Appendix C. Cloze test

if (y < 1) return false ; if (x > size) return false ; if (y > size) return false ; return true ; }

//Checks if a square is unknown . public boolean isUnknown( int x, int y) { if (! isValidPosition(x,y)) return false ; if C : _____ . c o n t a i n s (UNKNOWN) ) return true ; else return false ; }

Code snippet 1_2 (Non−duplicated code)

/∗∗ ∗ Checks if a square has a pit . Returns false ∗ if the position is invalid , or if the square is ∗ unknown . ∗ ∗ @param x X position ∗ @param y Y position ∗ @return True if the square has a pit ∗/ public boolean hasPit(int x, int y) { if ( ! A : _____) return false ; if (w[x][y].contains(PIT)) return true ; else return false ; }

//Checks if the Wumpus is in a square . public boolean hasWumpus( int x, int y) { if ( ! B : _____) return false ; if (w[x][y].contains(WUMPUS)) return true ; else return false ; 55

}

private boolean valid(int x, int y) { if (! isValidPosition(x,y)) return false ; if (isUnknown(x , y)) return false ; return true ; }

//Checks if a square is valid . public boolean isValidPosition(int x, int y) { if (x < 1) return false ; if (y < 1) return false ; if (x > size) return false ; if (y > size) return false ; return true ; }

//Checks if a square is unknown . public boolean isUnknown( int x, int y) { if (! isValidPosition(x,y)) return false ; if ( C : _____ . c o n t a i n s (UNKNOWN) ) return true ; else return false ; }

Code snippet 2_1 (With duplicated code)

DayTest . java

//Test method GetFirstMillisecond () by creating Day type object public void testGetFirstMillisecond () { Locale saved = Locale . getDefault (); Locale . setDefault(Locale .UK); TimeZone savedZone = TimeZone . getDefault ( ) ; TimeZone . setDefault(TimeZone.getTimeZone("Europe/London")); Day d = new Day(1 ,3 ,1970); assertEquals(5094000000L,A:______. g e t F irstMillisecond ()); Locale . setDefault(B:______) ; TimeZone . setDefault(savedZone); } 56 Appendix C. Cloze test

HourTest. java

//Test method GetFirstMillisecond () by creating Hour type object public void testGetFirstMillisecond () { Locale saved = Locale . getDefault (); Locale . setDefault(Locale .UK); TimeZone savedZone = TimeZone . getDefault ( ) ; TimeZone . setDefault(TimeZone.getTimeZone("Europe/London")); Hour h = new Hour(15 ,1 ,4 ,2006); assertEquals(1143900000000L,h. getFirstMillisecond ()); Locale . setDefault(saved ); TimeZone . setDefault(C:______) ; }

Code snippet 2_2 (Non−duplicated code)

public void extracted(RegularTimePeriod arg0 ,long arg1) { Locale saved = Locale . getDefault (); Locale . setDefault(Locale .UK); TimeZone savedZone = TimeZone . getDefault ( ) ; TimeZone . setDefault(TimeZone.getTimeZone("Europe/London")); RegularTimePeriod d = arg0 ; assertEquals(arg1 ,d. getFirstMillisecond ); Locale . setDefault(A:______) ; TimeZone . setDefault(B:______) ; }

//Test method GetFirstMillisecond () by creating Day type object public void testGetFirstMillisecond () { extracted(new C : ______( 1 , 3 ,1970),5094000000L); }

//Test method GetFirstMillisecond () by creating Hour type object public void testGetFirstMillisecond () { extracted(new Hour(15,1,4,2006),1143900000000L); }

Code snippet 3_1 (With duplicated code) 57

LineContains . java

/∗∗ ∗ Returns the next character in the filtered stream, only including ∗ lines from the original stream which contain all of the specified words ∗ @return the next character in the resulting stream , or −1 ∗ if the end of the resulting stream has been reached ∗/ public int read(){ if (! getInitialized ()) { initialize (); setInitialized(true ); }

int ch = −1; if(line != null){ ch = line .charAt(0); if(line .length() == 1) { line = null ; } else { line = line . substring (1); } } else { final int containsSize = contains. size (); for( line = readLine (); line != A:_____ ; l i n e = r e a d L i n e ( ) ) { boolean matches = true ; for( int i =0; matches && i=0; } if(matches^isNegated ()) { break ; } } if(line != null){ return read (); } } return D : _____ ; }

LineContainsRegExp. java /∗∗ 58 Appendix C. Cloze test

∗ Returns the next character in the filtered stream, only including ∗ lines from the original stream which match all of the specified ∗ regular expressions . ∗ @return the next character in the resulting stream , or −1 ∗ if the end of the resulting stream has been reached ∗/ public int read(){ if (! getInitialized ()) { initialize (); setInitialized(true ); }

int ch = −1; if(line != null){ ch = line .charAt(0); if(line .length() == 1) { line = null ; } else { line = line . substring (1); } } else { final int containsSize = contains . size (); for( line = readLine (); line != A:_____ ; l i n e = r e a d L i n e ( ) ) { boolean matches = true ; for( int i =0; matches && i

Code snippet 3_2 (Non−duplicated code) 59

/∗∗ ∗ Returns the next character in the filtered stream, only including ∗ lines from the original stream which match all of the specified ∗ regular expressions or contain all of the specified words. ∗ ∗ @return the next character in the resulting stream , or −1 ∗ if the end of the resulting stream has been reached ∗/ public int extracted(Vector vector ,Function matcher){ if (! getInitialized ()) { initialize (); setInitialized(true ); }

int ch = −1; if(line != null){ ch = line .charAt(0); if(line .length() == 1) { line = null ; } else { line = line . substring (1); } } else { final int containsSize = vector. size (); for( line = readLine (); line != A:_____ ; l i n e = r e a d L i n e ( ) ) { boolean matches = true ; for( int i =0; matches && i

LineContains . java

public int read(){ return extracted(contains , (Integer i) −>{ 60 Appendix C. Cloze test

String containsStr = ( String ) contains . elementAt( i ); return l i n e . i n d e x O f ( C : _____) >=0; } ); }

LineContainsRegExp. java

public int read(){ return extracted(regexps , (Integer i) −>{ RegularExpression regexp = (RegularExpression)regexps .elementAt( i ); R e g e x p r e = D : _____ . getRegexp( getProject ()); return re . matches( line ); } ); }

Faculty of Computing, Blekinge Institute of Technology, 371 79 Karlskrona, Sweden