Bachelor Degree Project

Evolution of Software Documentation Over Time An analysis of the quality of software documentation

Author: Helena Tevar Hernandez Supervisor: Francis Palma Semester: VT/HT 2020 Subject: Computer Science Abstract

Software developers, maintainers, and testers rely on documentation to the code they are working with. However, software documentation is perceived as a waste of effort because it is usually outdated. How documentation evolves through a set of releases may show whether there is any relationship between time and quality. The results could help future developers and managers to improve the quality of their documentation and decrease the time developers use to analyze code. Previous studies showed that documentation used to be scarce and low in quality, thus, this research has investigated different variables to check if the quality of the documentation changes over time. Therefore, we have created a tool that would extract and calculate the quality of the comments in code blocks, classes, and methods. The results have agreed with the previous studies. The quality of the documentation is affected to some extent through the releases, with a tendency to decrease. Keywords: Software documentation, Source code documentation, Code conven- tions, Source code summarizing, Documentation, Textual similarity. Preface

I would like to thank the teachers, readers and friends that followed me in this project, specially to my supervisor Francis Palma and coordinator Diego Perez Palacín, my col- league and personal natural language parser Dustin Payne, and the person that helped me during three years managing courses and schedules, Ewa Püschl. I would also like to thank the open-source community, thanks to them this research was possible. Contents

1 Introduction 1 1.1 Background ...... 1 1.1.1 Quality Definition ...... 1 1.1.2 Jaccard ratio and Cosine similarity ...... 2 1.1.3 Language ...... 3 1.2 Related work ...... 4 1.3 Problem formulation ...... 5 1.4 Motivation ...... 5 1.5 Research Questions and Objectives ...... 5 1.6 Scope/Limitation ...... 6 1.7 Target group ...... 6 1.8 Outline ...... 7

2 Method 8 2.1 Natural Language Processing ...... 10 2.2 Reliability and Validity ...... 10 2.3 Ethical Considerations ...... 10

3 Implementation 11 3.1 Extraction ...... 13 3.1.1 Extracting comments ...... 13 3.1.2 Extracting classes ...... 14 3.1.3 Extracting methods ...... 15 3.2 Cohesion calculation ...... 17 3.2.1 Parsing and normalizing strings ...... 17 3.2.2 Jaccard algorithm ...... 18 3.2.3 Cosine algorithm ...... 18 3.3 Results of the extraction ...... 19

4 Results 20 4.1 RQ 1: What is the proportion of code blocks with and without docu- mentation? ...... 20 4.2 RQ 2: What is the proportion of new code blocks with and without documentation? ...... 20 4.3 RQ 3: Does the code blocks documentation quality improve across the releases? ...... 21 4.4 RQ 4: Is there any relation between lines of code and quality of the documentation? ...... 22 5 Analysis 26

6 Discussion 28

7 Conclusion 29 7.1 Future work ...... 29

References 31

A Appendix — Selection of projects A

B Appendix — Evolution of quality G

C Appendix — Lists of stop words P .1 NLTK stop words ...... P C.2 Extra stop words ...... P C.3 Java Keywords as stop words ...... Q 1 Introduction

Developers usually rely on the low-level documentation, especially class- and method- level documentation to comprehend, modify, and maintain a system that is continuously being evolved. The documentation has to be related to the class and method they are located, reflecting what they do and how they should be maintained. While creating and maintaining software is the job of developers, updating the documentation is not often seen as an important task [1, 2], and, thus, it is common to find documentation that has not been updated and does not reflect the actual functionality of the method or class where it is used. This study aims to study the cohesion between documentation and source code as a factor of the quality of the documentation because software evolves continuously.

1.1 Background During the process of developing source code artifacts, developers need to understand the functions of said artifacts by using source code documentation. This kind of doc- umentation includes comments in source code that are used to explain blocks of code such as methods and classes. While good comments help the developers with their jobs, the act of documenting is often seen as counterproductive and time-consuming [1, 2], especially for projects developed whiting the Agile principles, that require fast-paced programming and continuous delivery. In other cases, the comments are outdated or difficult to create for legacy code [3]. Changes are added in an undisciplined man- ner. [4] This would create problems for the future implementer and other stakeholders that also work with the same code, such as testers and maintainers [2, 5]. Changes in code documentation and some aspects of quality have been studied previously [6,7], the research of Schreck, Dallmeier and Zimmermann studied the quality of documentation by similarity ratios in natural language and source code among other values [8]. Know- ing previous results, this research will focus on the similarity ratio by using different algorithms expecting to see how the documentation quality evolves through time on a big sample of projects.

1.1.1 Quality Definition The American Society for Quality accepts that quality is not a static value and it is dif- ferent for every person or organization, however, it gives a guideline to define quality as ’the characteristics of a product or service that bear on its ability to satisfy stated or implied needs’ [9]. Sommerville [1] suggested different requirements for all the docu- ments associated with a software project to act as the medium between members of the team, and information repository that would help the development process and should tell users how to use and administer the system. There is a subjective component that

1 is inherent to the discussion of quality, for instance, a text difficult to understand is not universal to all humans. Metrics should include human insights [10] but that adds com- plexity to the studies. More objective variables that are related to factors of quality in the documentation are coherence, usefulness, completeness, and consistency, as men- tioned by Steidl [11]. Coherence covers how comment and code are related and thus, is measurable. The relation between the comments and code could be studied as the abil- ity to paraphrase the machine language to natural language in order to give context to the source code. In that case, the documentation should reflect the contents of the code. This was already stated by McBurney and McMillan [12], source code documentation should use keywords from source code. For this reason, a way to investigate how the documentation refers to the source code would be by measuring the similarity between them. In order to check the similarity between two texts, many algorithms have been devel- oped. In the case of the research made by Steidl, Hummel, and Juergens, the similarity ratio used was the Levenshtein ratio [11]. Levenshtein ratio defines distance between two strings by counting the minimum number of operations needed to transform one string into the other [13]. There are two main branches of similarity ratios, string- based and corpus-based measures [13]. Corpus-based measures work the best with large texts, which is not the case for this study. String-based measures are more fitted for small size strings. This kind of algorithms includes character- and term- based ratio. Character-based algorithms measure the distance between characters in two strings, like the Levenshtein ratio. For instance, words like "Sam" and "Samuel" would be similar in character measures because they three characters, however, they would be two different words for term-based ratios, thus their term-based similarity would result on two non related words. Term-based similarity is the one approach that could show the similarity of the developers’ comments and the programming code. In this research, we elaborate on two of the algorithms used to calculate the similarity ratio that have not been used before, they are the Jaccard ratio and Cosine similarity.

1.1.2 Jaccard ratio and Cosine similarity Jaccard index similarity is calculated by the size of the intersection of two sets divided by the size of the union of the sets, where each set includes the words of a string [14]. Jaccard ratio calculates the similarity between a set of words, meaning that the repetition of words is ignored. Two strings that contain the same set of words will result in a Jaccard index of 1 because of the overlapping of each set, while two strings with no same set of words will result in an index of 0. |A ∩ B| J(A, B) = |A ∪ B| The Cosine similarity [15] calculates the cosine of the angle of two vectors. Each ana-

2 lyzed string form a vector. This ratio takes into account the repetition of words to create the vectors required. When two strings have the most similar and repeated words, the cosine of the angle will be closer to 1, or in other words, the angle will be 0◦. On the contrary, when two strings are different in words and repetitions, the cosine will be 0, so the angle formed by the two vectors will be 90◦.

n P A B A · B i i C(A, B) = cos(θ) = = i=1 s n s n kAkkBk P 2 P 2 Ai Bi i=1 i=1

1.1.3 Java Language Java has a particular syntax that developers have to follow to be able to compile an application. In this case, the most important ones for the research were the comment, class declaration, and method declaration syntax [16]. The comments are the main source of documentation for Java. They follow a clear syntax where a set of symbols written before a string would make the compiler ignore them, while developers can still use them to add extra information. For instance, the symbols used for comments in Java are: [/*, /**, * , //, */]. Even when comments can be used anywhere through the code, Oracle pointed the space before class and method declarations as the position to write source code documentation by using Javadoc com- ments [17]. However, the background of the developer may affect the way of writing source code documentation. For instance, the father language of Java, C++, uses block comments as source code documentation. The Java compiler admits those C++ conven- tions in their language. The class declaration syntax is structured around two mandatory keywords. The first mandatory keyword is one of the next words: class, enum, interface. After the class, the Java compiler needs an identifier, the actual name of the class, a set of characters in Unicode excluding numbers at the beginning of the words. The class is followed by a pair of parentheses and an open bracket that sets the beginning point for the contents of the class. Those are the minimum mandatory requirements to declare a class in Java. Additionally, the developers may add modifiers before the class declaration, for example: Public, Private, Protected, etc. When there is no modifier present in the code, Java will assume that the class is Public within the corresponding package, but otherwise is seen as Private [16]. In contrast to the class declarations, method declarations have a more flexible syntax. The only requirement to declare a method in Java is an identifier and a pair of parenthe- ses. The Java compiler will use the default values needed to declare the method, public within the package, private otherwise, in the case of the modifiers and void for the return type [16]. All the possible terms used to declare a method in Java include, in the strict

3 order: modifiers, return data types, identifiers, parameters, exceptions. Similarly to class declarations, there could be one or more modifiers. Unlike the modifiers, there must be only one return type. However, the data type can be any of the build-in Java data types or custom made data types, in their single form or array form. The parameters’ syntax is an input data type and its identifier.

1.2 Related work The current state of the art shows an opposition of forces between those that consider source code documentation as a reliable source of information [1, 18] and those that, while still agreeing with the importance of documentation, try to automatize its creation so developers can avoid the task [19]. However, automatic documentation does not go without criticism. Even though tools have been developed to create source code documentation [3], studies have shown that automatically generated summaries were more inconsistent and with less similarity to the source code [12]. Those results could conclude that source code documentation gets better-perceived similarity when it is written by the develop- ers. However, one of the biggest complaints about documentation is how badly main- tained it is and how most of the time it is out-of-date. Studies have shown that JavaDoc comments change over time, especially when developers want to elaborate on use-tips in the JavaDoc annotation, but there is no information about how the quality of such documentation changed over the releases [6]. For the case of quality in the documentation, the tool JavaMiner studies this topic, among other variables [7]. However, as the previous study, the research behind JavaMiner only works with the JavaDoc comments. This research was continued by Steidl, Hum- mel, and Juergens that developed a project using machine learning on projects in Java and C/C++ that compares different quality aspects through source code exploration combined with interviews with developers during the process of the research [11]. The study behind the tool QUASOLEDO includes the documentation created by JavaDoc and the comments used in C++, block comments, and inline comments. The QUASOLEDO research studied variables related to the ratio of documented code blocks, and the quantity of the words used in the documentation. The results of pointed that only 12.1% of the changes made in the code were for modifications in both comment and code content of the block, 2.1% of the commits changed only comments and 67.4% changed only code, changes on documentation only made up 32% of the total changes on a project [8]. The study points out how sub-optimal this is for development. The study made by Steidl [11] for all types of comments in five projects used the Levenshtein distance, similar to the Jaccard similarity, as one aspect of quality. The research concluded that documentation had a low-quality ratio. Only 37% of all files studied presented header (declaration) comments, a range between 18–49% were copy- right comments.

4 1.3 Problem formulation This study aims to research the cohesion between source code documentation and the code. The use of similarity ratios for all kinds of comments has not been studied and will add information to the field of study. We will study source code and documentation for a group of projects through a range of releases and the quality and cohesion of said documentation. We plan to use cohesive metrics of text similarity as a factor of the quality of the comments in consecutive releases to find how much the documentation quality changed through time.

1.4 Motivation This research contributes to showing the behavior in source code documentation for classes and methods from the perspective of string similarity. The results may help engineers, technical writers, and project managers to understand the behavior of soft- ware documentation and its evolution, and to better plan for software documentation maintenance according to the results.

1.5 Research Questions and Objectives After reviewing the literature, we have not found related work that fully explores the cohesion and similarity using term-based algorithms as a factor of quality. We will study that aspect of quality and its variation over the consequent releases from a project. The following research questions were used to plan the research: • RQ 1: What is the proportion of code blocks with and without documenta- tion? We investigate if the projects are documented in a large or small proportion. • RQ 2: What is the proportion of new code blocks’ with and without docu- mentation? How much source code is documented at the beginning of their life will show how developers prioritize documentation during implementation. • RQ 3: Does the code blocks documentation quality improve across the re- leases? We calculate the ratios selected as a factor of quality to make a study over the time the releases were made in order to see any change that may show a relation between quality and time of release in the documentation. • RQ 4: Is there any relation between lines of code and quality of the docu- mentation? The quality ratios will be studied using the variable of lines of code per code block in order to get any information about who may affect the quality of the documentation.

5 In order to answer these research questions, the objectives presented in Table 1.1 were formulated:

O1 Study the difference between documented and non docu- mented code blocks among different releases and in total numbers. O2 Calculate the cohesion ratios, Jaccard and cosine, of all the code blocks for each release. O3 Perform statistical analysis to compare two sets of cohesion ratios for methods and classes for a release and its consecutive release. O4 Perform statistical analysis to compare cohesion ratios with the lines of code of methods and classes.

Table 1.1: Thesis Project Objectives

1.6 Scope/Limitation The data used for this project was limited to open-source projects to get free access to the source code. No specific requirements, as organization or size of the project, were followed in the election of the projects studied. The only requirement was to have at least 10 releases per project available, which was exceeded in all of the projects. They were also recently updated, being the oldest release uploaded in 2017. The language for the machine language was Java, as it is the most used in programming. For the case of the natural language, it was English, because it is the most common language in the technical field and because it shares keywords with Java. Those three factors were used to select the projects used in this research.

1.7 Target group The results of this research will be useful to different parts of the development team, such as project managers, implementers, and researchers. For project managers and implementers, the findings may reveal if the quantity and quality of documentation are relevant for their projects or they should create guidelines to help the team to create and maintain their code documentation. On the other hand, researchers will have data related to natural language and cohesion, which will help to continue studies on the evolution of source code documentation.

6 1.8 Outline The rest of this report comprises the following sections: Section 2 - Method: In this section we will approach different possible methods to resolve the research questions. We will elaborate on natural language processing as well as explaining the threats to validity and reliability we encountered. Section 3 - Implementation: The implementation phase was done using a Python 3 application that extracted the required data. This section will explain how this process was implemented. Section 4 - Results: The data gathered during implementation will be showed with- out further analysis. The objective is to display objective data that will be used in the analysis. Section 5 - Analysis: This section will give answers to the research questions by using the results as a source. Section 6 - Discussion: This section will continue the discussion on the results in- cluding the results from previous researches as the one studied in the literature research. Section 7 - Conclusion: This section will conclude the research and give an intro- duction to what could be done to further continue with the topic of quality on documen- tation.

7 2 Method

Previous researchers have shown a pattern to study documentation in source code. The steps followed required the extraction of the declaration identifiers and its comments and then process the extracted data. Similar studies followed this pattern as Steidl [11], which also used similarity ratios between comment and content of a block. The sim- ilarity ratios decided to use as an example of coherence, how comment and code are related, were term-based similarity ratios. However, to be able to compare two words in the most accurate way, it was decided to use natural language processes in order to eliminate words that are not related or useful to the research as well as lemmatize words to their inflected forms so two words can be compared. For the next step, the processed data was used with two different term-based similarity algorithms. While previous work studied the similarity between characters, we decided to study how similar are the whole terms and words between comment and source code. These two algorithms were se- lected among multiple options: First, term based algorithms were more fitted because we understood that similarity should be coherent. Code and comment that share some relationship is an indicator of a meaningful code block [11] and term-based algorithms are fitted for that requirement. One single algorithm could answer these questions, but the differences made between Jaccard and Cosine similarity ratios concerning word rep- etitions made it clear that we could use both, so we cover more possibilities. The planning for the method displayed in Figure 2.1 shows how the methodology was applied. After reviewing already created tools and packages, we decided to create a tool that would fit our particular requirements and calculations. The tool requirements were to read all the files of a project, extract comments and their code blocks, use a natural language process to parse the strings extracted, calculate both similarity ratios and size by lines of code, and finally save the raw data in a CSV (Comma Separated Values) file for the consequent data analysis. Because the detection of code blocks in- cluded detecting classes and methods, we included that division to extract more refined data. The tool was tested with test files that showed different cases that could create false positives, as conditional blocks, nested classes and methods, throw statements, or latex expressions. The results would be used to further study. In order to answer our research questions, we needed to know the percentage of code that was documented and not documented, as well as how much of the newly added code was documented. The study has a particular interest in the changes in quality, not quality itself, for that reason, we used the variation ratio of the similarity ratios to express the changes over time clearly. The variation values were split into discrete groups using size percentiles and used a value of 1.0 for the first release.

8 Start

Select Projects

Source Code Files

No Java File? Read Next File

Yes

Extract Code Block

Parse Code

Last File? No

Yes

Raw Data

Analysis Analysis Analysis Analysis Documented Cohesion / Time / New Blocks Cohesion / Time Blocks Size

Figure 2.1: Flowchart of the method 9 2.1 Natural Language Processing Comments and some parts of the source code are written by developers in a natural language, for instance, identifiers and variables are mostly written in common English. The natural language selected for this research was the English language segmented by spaces. To get two strings that could be compared, the strings have to be parsed to have the most meaningful words in them. For that, we created a set of stop words from common English words, like prepositions or pronouns, that give little to no meaning to a string, so they were removed from the study. In the particular case of the computer language, keywords and typical words from the Java language were removed to extract the relevant words from the string [20].

2.2 Reliability and Validity To have a good representation of the results, this study has used a total of ten projects from open source organizations, listed in Appendix A. The projects vary on owner or- ganization and total size. After the selection of the projects, the research uses a total of 10 consecutive releases for each project. That means that this research includes 100 repositories. Despite that, the results may not be representative, since when we are using only open-source projects. Private repositories may behave differently and may include different variables that affect the maintenance of source code documentation. While Oracle names the comments created by JavaDoc as their main source of doc- umentation in source code, this study did not omit other types of comments. Developers come from different backgrounds and may use different code conventions. In the par- ticular case of Java, this research accepted the code conventions for C++ and included inline and block comments that are written over class and method declarations as source code documentation. The main reason was that, as a descendant of C++, Java still ad- mits C++ type of comments in their language. However, the results may differ when using only Oracle JavaDocs. To ensure reliability, this research will provide information on implementation and the possibility to access the raw data extracted from the projects, as well as the data, resulted after further analysis, in Appendix A.

2.3 Ethical Considerations This research did not require any special consideration related to any personal informa- tion because all the data gathered was picked from open source projects. We neither altered nor used the code for other purposes than education and research, which is al- lowed by the existing open source licenses. However, the results may help to change the project planning process and the project code conventions in the future, which could affect the work load for the respective future projects.

10 3 Implementation

The particularities of the requirements for this research made it easier to create our own tool that would include all kind of comments written above the classes’ and methods’ declarations as source code documentation. Other packages researched would have dif- ficulties when reading non-JavaDoc comments, or including non-declarative comments as documentation. By doing our own implementation, we ensured that our documen- tation definition, which was the main requirement, was met. The application that was created for this research was developed using Python 3 with the packages numpy1 for analysis and matplotlib2 for plotting the results. The database used for this application is based on CSV (Character Separated Values) files. The applications required a database that mapped the location of all the repositories studied. Then it iterated the projects’ source code and run the calculations required to get the information needed for the research, as seen in Figure 3.1. After the calculations, the results were written in multiple CSV files, one per release. The files included the ba- sic identification for each block, the identification of its parent block (for classes at the top of the file tree it would be its own name) and the parent block location. The infor- mation about the parent block was used to avoid problems with name duplication when comparing two classes or methods that had the same name. When the block finished, that means, when the application encountered a closing bracket, then the application saved the last part of the information: lines of code of the closed block, the declara- tion comments and the content of the block. The comments and contents were then through the natural language process of eliminating stop words, lemmatizing the string and, in the case of identifiers, separating words from several naming conventions. After this process, the two strings were ready to go through the similarities algorithms. The results were saved in the CSV file by the name with Jaccard ratio and Cosine ratio. To ensure that the application worked properly, it run a test file that would include the most controversial comments and structures that could give problems to be parsed. For instance, conditional blocks could be assumed to be methods without modifiers nor return types, so a group of keywords were included to differentiate a method that conditionals, throw methods or lambda nested methods. After running the test file, the smallest project was run several times to test that no keyword or illegal character was detected as a method declaration by the application. For instance, the example in Listing 3.1 should be not detected:

1 MyClass 2 .DoSomething();// This line should be avoided Listing 3.1: Comment example in Java

1Numpy official web https://numpy.org/ 2Matplotlib official web https://matplotlib.org/

11 Start

Line: String

is Class?

Yes

No

Append to Last Block Yes is Comment? Comments

No

Update is Method? Yes Last Block Data

No

Clear Comments is part of Update No Yes from Last Block the body? Last Block Data

has '{' ? Yes Add New Block

continue

has '}' ? Yes Calculate Similarity

continue Save and Pop

Figure 3.1: The workflow for breaking lines in classes and methods

12 3.1 Extraction The process followed to create the resulting database was to create an algorithm that would read the Java files line by line and proceed with calculations. This algorithm used multiple string operations and regex operations to recognize if the line was a block declaration, its type, and comments, as seen in Figure 3.1. In the first instance, the ap- plication gathered all the Java file paths and iterated over them to get an array of strings that were the contents of that Java file. For each string, the application decided if it was a comment, a block declaration, class or method, or the contents of a block. The decision was made by using regex in the case of the class declaration and comments, while the decision for method declaration required multiple string operations. The contents of a block was the rest of the decisions.

3.1.1 Extracting comments Comments are delimited by the Java language as all those lines or blocks of lines that include the symbols [/∗,/ ∗ ∗, ∗, //, ∗/]. For instance, the code in Listing 3.2 is an example of types of comments in Java:

1 /** 2 * This isa Javadoc. 3 * @returns void 4 */ 5 6 // This is an inline comment example. 7 8 /* 9 * This isa comment block. 10 */ 11 12 /* 13 This kind of comment is also valid for Java. 14 */ 15 //Any number of blank spaces before the comment are also valid. Listing 3.2: Comment example in Java The extraction of comments required, in the first place, a positive match with a regex. The regex checked if, at the beginning of a string, comments symbols appeared. The regex should be flexible enough with the position and number of white spaces because Java ignores the number of them between asterisks and slashes. The absence of com- ments symbols does not mean that the line is not a comment, it could be included inside a comment block that does not begin with an asterisk like shown in the example above. For that reason, if the line failed the check made by the regex, the next step was to check if any other block of comments has been closed or is still opened. For that reason, the

13 algorithm checked if the last comment saved was an open block, meaning the current line is included in the comment block, or on the contrary, the last saved comment was an inline or include a closing comment block symbol. The final regex used for detecting comment in strings was as presented in Listing 3.3:

1 ’^\s*((\*\s*/)+|(/\s*\*)|(/\s*\*\s*\*)|(\*\s*)|(/\s*/)).*’ Listing 3.3: Regex for comments

3.1.2 Extracting classes Considering that class declarations in Java have a formal syntax, the extraction of the data from them was similar to the comment extraction. A string would have to match a regex built around the class syntax in order to be used for data extraction. In this particular case, the only requirement for a class to be declared is to include the keywords for classes (’class’, ’enum’ or ’interface’) and an identifier. Additionally, modifiers, superclasses, and interfaces can be added to the class declaration, but are not mandatory. However, the list of modifiers is a group of keywords that can be used in a regex, because there is no possibility to create personalized modifiers. As a result, finding the identifier in a class declaration is obvious. The identifier of a class was the word located after one of the class keywords. Anything that is declared after the identifier was consumed by a greedy operator in the regex, because they gave no more information about the class identifier, for instance as demonstrated in Listing 3.4:

1 class ClassExample { 2 // This isa valid class declaration 3 } 4 5 public final ClassExample extends SuperClass implements IClass { 6 // This is alsoa valid class declaration 7 } Listing 3.4: Class example in Java In addition, regex in Python can be used to extract parts of the string. In this case, we wanted to extract the identifier of the class from the string. For that reason, the next word from the class keywords was extracted by using a regex tag for python. The tag P ? < id > marks a part of the pattern with a tag name, in this case ’id’, that can be referenced to extract its contents. The final regex used was as shown in Listing 3.5:

1 ’^(Annotation|public|protected|private|static|abstract 2 |final|native|synchronized|transient|volatile|strictfp)* (class| interface|enum)(?P[a-zA-Z\_0-9]+)’ Listing 3.5: Class example in Java

14 3.1.3 Extracting methods In the case of a method, a more precise string manipulation was required. The only requirement for a method to be accepted is an identifier, two parentheses, and a body. Without modifiers and return types, the default values are public in the packages, private for the project, for the modifier and void for the return type. It is important to point out that the return type does not only include the Java data types, but also self-made data types imported in the Java project. Moreover, the position of the open bracket that opens the body of the method, is not required to be in the same line as the method declaration and some coding conventions and developers locate them below the method declaration, for instance, Listing 3.6 below:

1 class ClassExample { 2 3 public static myDataType myMethod (myDataType arg1, int arg2 ) 4 { 5 // This isa valid method 6 } 7 8 myOtherMethod(){ 9 // This is alsoa valid public method that returns void 10 } 11 } Listing 3.6: Class example in Java The syntax of methods is similar to the syntax of any kind of conditional and loop block. Without constraints before the identifier and the parameters, an ’if’ statement can create false positives. One possibility was to check the class parent, but because the existence of nested method and classes, this was difficult to check. The solution found for this issue was to create a list of keywords for loops, conditional blocks, test asserts, and symbols that are not possible to use in method declarations. For instance, it is not possible to have arithmetic operations in a method declaration, so symbols like addition and equals could declare a string as not method declaration. A regex could be a solution, but the flexibility of none or many modifiers, self-made data types, and such made the regex possible but slow to process. Some of the problematic pieces of code found followed similar syntax as in Listing 3.7:

1 class FalsePositives { 2 3 public static myDataType myMethod (myDataType arg1, int arg2 ) 4 { 5 if (arg1) { 6 // This line should not be accepted 7 } 8 } 9

15 10 myOtherMethod(){ 11 assertThat(test)//This line should not be accepted 12 method.something()// This line should not be accepted 13 14 } 15 16 hashCodeImpl(Object content, String mimeType, String language, 17 URL url, URI uri, String name, String path, boolean internal, 18 boolean interactive, boolean cached, boolean legacy) { 19 20 //This method should be accepted 21 } 22 23 } Listing 3.7: Class example in Java The solution to avoid the maximum amount of false positives and, at the same time, avoid the use of regex, was creating a process that would split, trim and extract the sub-string that was needed as seen in Figure 3.2. The requirement was to get the identi- fier, parenthesis, and parameters of the string, ignoring modifiers, return data types, or any other information at the right-hand side of the parameters. For that, the string was processed first in reverse, finding the closing parenthesis for the parameters. Any infor- mation from the closing parenthesis and the end of the string should be ignored. After this trim of the last part of the string, the string will be iterated from the beginning to find the first open parenthesis. The result from the first character to the first parenthesis is the part of the declaration that should include modifiers, data types, and identifiers. When creating an array of the words in a string with the method ’split’, the last word of the array will be the identifier. The sum of the identifier plus the contents of the parameters create the final parsed string that includes the information required.

public Collection getUrlPrefixes (String a, String b) // comment

index: -1 End Trim

String split getUrlPrefixes (String a, String b)

Result

Figure 3.2: Trim of method strings

16 3.2 Cohesion calculation Before the cohesion calculations, a normalization of the two strings to be studied was required. For that, the two strings went through a natural language process. After the normalization of the strings, the cohesion ratios were calculated.

3.2.1 Parsing and normalizing strings When encountering a closing bracket, the application knew that a code block was clos- ing and for that reason, that block was ready to process their comments and contents. The research planned two cohesion ratios to calculate, Jaccard similarity and Cosine similarity. Before calculating the ratios, it was necessary to process the comments and contents through a natural language process (NLP from now on) to avoid bloating the calculations with common words, however, even before processing a string, it was re- quired to parse the method and class identifiers. Until this moment, the comments and contents of a block were part of a list. For more calculations, the arrays were combined in a string for comments and a string for content. After that, the strings still required small adjustments before being used for calculations. Naming a class or a method is not standardized, nor is it mandatory to follow any guideline. Developers have total freedom to name their code as pleased, even so, some common naming conventions are used among developers. For this research, three nam- ing conventions were used to divide multiple words from the method and class identi- fiers. Those were camel case, Pascal case (also know as dromedary case or upper camel case), and underscores as are exemplified in Listing 3.8. All the identifiers that fol- lowed those three conventions were sliced into multiple words. In doing so, the number of words for content increased.

1 class Naming { 2 public void camelCase(){ 3 // Became: camel case 4 } 5 6 public void DromedaryCase(){ 7 // Became: dromedary case 8 } 9

10 11 public void under_score(){ 12 // Became: under score 13 } 14 15 } Listing 3.8: Naming conventions example

17 For this research, the package used to process natural languages was NLTK3 version 3. With NLTK the two strings were parsed to avoid common words for the English language, numbers, pronouns, prepositions, and lemmatize words. A lemmatized word example could be changing ’numbers’ to ’number’. In this way, the two strings would be normalized as much as possible to include only the most important and relevant words in the same tense. When this step was complete for both strings, they were ready for the calculation of their cohesion ratios.

3.2.2 Jaccard algorithm The Jaccard similarity works with sets of words, no repetitions allowed, so the Python algorithm was defined as in Listing 3.9:

1 def calc_Jaccard(self): 2 comment_set = set(self.comment.split()) 3 content_set = set(self.content.split()) 4 5 intersection = comment_set.intersection(content_set) 6 denominator = (len(comment_set) + len(content_set) - len( intersection)) 7 8 if not denominator: 9 self.Jaccard = 0.0 10 else: 11 result = float(len(intersection)) / 12 (len(comment_set) + len(content_set) - len(intersection)) 13 self.Jaccard = result Listing 3.9: Python Jaccard similarity algorithm

3.2.3 Cosine algorithm The Cosine similarity algorithm was also implemented in Python by using the mathe- matical formula described in Section 1.1.2, resulting in the algorithm in Listing 3.10:

1 def calc_cosine(self): 2 comment_vector = self.text_to_vector(self.comment) 3 content_vector = self.text_to_vector(self.content) 4 5 intersection = set(comment_vector.keys()) & 6 set(content_vector.keys()) 7 8 numerator = sum([comment_vector[x]*content_vector[x] 9 forx in intersection]) 10

3NLTK official web https://www.nltk.org/

18 11 sum1 = sum([comment_vector[x] ** 2 forx in comment_vector.keys() ]) 12 sum2 = sum([content_vector[x] ** 2 forx in content_vector.keys() ]) 13 14 denominator = math.sqrt(sum1) * math.sqrt(sum2) 15 16 if not denominator: 17 self.cosine = 0.0 18 else: 19 result = float(numerator)/denominator 20 self.cosine = result Listing 3.10: Python cosine similarity algorithm

3.3 Results of the extraction After the cohesion calculations, the information of each code block was saved in the database with the following information: Project name, release name, release date, iden- tifier, type: class or method, lines of code, owner: parent class, Jaccard ratio, Cosine ratio, comments: string filtered, content: string filtered. The data obtained was used to perform statistical studies using Numpy and spreadsheets. The analysis of the results was documented in three CSV files in three steps: one for the number of documented blocks, one for average and percentile calculations, and a final variation ratio document. Each project had 10 releases, so the output of raw data was 10 CSV files with classes, methods, and similarity ratios. By comparing the names and parents of each block, we calculated how many blocks were documented and how many were not documented, as well as the newly added and how many of them were documented at their creation, per each release. A second step calculated the percentiles of the lines of code of the blocks by type (class or method). We used the library Numpy for Python to make the calculations. The percentiles documented in the result CSV file were percentile 0, 5, 25, 50, 75, 95, and 100. Finally, to know how the similarity ratios evolved through the releases we calculated the average of the similarity ratios of each release. We used four discrete groups for the size of the block by using 4 percentiles: percentile 25, 50, 75, and 95. In total, we got the average of the Jaccard and cosine ratio, for each type of block, by 4 percentiles, for 10 releases. To make the data more readable, and because the similarity ratios by themselves were not interesting for the research, but rather their variation over time, one more cal- culation step was taken. The final CSV file was modified to calculate the variation ratio values for each release using the formula vn/vn−1, where vn indicates the similarity ra- tio value for release n (the value 1.0 was used as the result for the initial step). This way the resulting values over 1.0 indicated improvement and the values under 1.0 indicated decrease in quality.

19 4 Results

The raw data used in this research was uploaded in the repository referenced in Ap- pendix A. It contains the information of all code blocks, their length, and cohesion ratios. Due to the size of the database, it is not directly included in this report. This study used a group of ten open-source projects and ten consecutive releases per each project. In total, 100 releases were studied. The projects are diverse in length and ownership to provide a better representation of the data. More information about the projects used and all the data used, raw and processed, can be found in Appendix A.

4.1 RQ 1: What is the proportion of code blocks with and without documenta- tion? To get a result of this question, we needed to find all the code blocks that contain any documentation or comment. The blocks were divided by type of block, class, or method. The results calculated over the projects defined before are shown in Table 4.1. The raw total number of documented code blocks can also be found in the repository available in Appendix A.

Total documented Classes Methods Total blocks Maven 19.54% 9.75% 9.78% 6871 Jmeter 34.57% 6.59% 27.98% 13940 Che 21.78% 6.95% 14.78% 14608 Tomcat 22.39% 5.87% 16.53% 28882 Springboot 13.86% 8.15% 5.71% 35666 CXF 8.56% 2.44% 6.12% 58581 Guava 11.41% 2.59% 8.82% 61671 Graal 9.24% 2.96% 6.27% 90915 Elasticsearch 14.18% 3.73% 10.45% 128208 Netbeans 21.45% 7.06% 14.39% 451083 Average 17.70% 5.61% 12.09%

Table 4.1: Percentage of code documented by type of block and total

4.2 RQ 2: What is the proportion of new code blocks with and without documen- tation? For this research question, it was calculated which code blocks were added to a reposi- tory in the release n when comparing with the release n−1. That sub-set of code blocks were used to calculate the percentage of new code blocks with documentation that has been added, as is seen in Figure 4.1.

20 New added blocks documented

63.64 Classes Methods 60

50.0 50

40 38.01

32.89

29.67 30 26.71

24.37 Percentage added 21.42 20

14.03 14.21 12.27 10.78 11.14 10 7.41 6.54 6.42 5.36

2.74 2.86

0.0 0 maven jmeter che tomcat springb cxf guava graal elastic

Figure 4.1: Percentage of new documented blocks

4.3 RQ 3: Does the code blocks documentation quality improve across the re- leases? A statistical study of the raw results was done to get the outcome needed to answer the third research question: Do the code blocks documentation quality improve across the releases? For each project, we averaged the Jaccard and Cosine ratios by using discrete sizes as percentiles q25, q50, q75, and q95, excluding the 5% lower and higher. The aim of this research is to find how the quality of the documentation evolves over time, for this reason, we used the variation ratio of the results instead of the similarity values to improve readability and clarity. The similarity ratios were translated to their variation ratio using the formula vn/vn−1, where vn indicates the similarity ratio value for release n. Because the first release has no previous data to be used, we used a value of 1.0 as the first value. The results were plotted in four diagrams for better clarity, dividing the results between the type of block and type of similarity ratio, as seen in Figure 4.2.

21 Project maven: jaccard variation for classes Project maven: jaccard variation for methods 1.40 1.30

1.35 1.25 1.30 1.20 1.25 q25 q25 1.15 1.20 q50 q50 q75 q75

Variation q95 Variation 1.10 q95 1.15

1.05 1.10

1.05 1.00

1.00 0.95

0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 Releases Releases

Project maven: cosine variation for classes Project maven: cosine variation for methods

1.35 1.04

1.30

1.02 1.25

q25 q25 1.20 q50 q50 q75 1.00 q75

Variation 1.15 q95 Variation q95

1.10 0.98

1.05

1.00 0.96

0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 Releases Releases

Figure 4.2: Evolution of similarity rates for project Maven

The results of the evolution and variance can be seen in Appendix B and will be used to continue the study for the next research question in Table 4.2 and Table 4.3.

4.4 RQ 4: Is there any relation between lines of code and quality of the documen- tation? The results of the previous research question were refined to get the answer. The result is showed in Table 4.2 and Table 4.3. Variation ratio results show values over 1.0 in those cases where the similarity ratios improved over time and values under 1.0 for those cases

22 were the similarity ratios decreased over time. The ten projects studied resulted in an average ratio in their quality that is presented in Figure 4.3, Figure 4.4, and Table 4.4.

Block type Classes Methods Percentile 25 50 75 95 25 50 75 95 Maven 1.0589 1.0212 1.0323 1.0242 0.9967 1.0073 1.0323 0.997 Jmeter 1.0 0.994 1.0001 1.0002 1.0001 1.0 1.0001 0.9997 Che 1.0003 1.0023 0.9999 0.9999 1.0015 1.0007 0.9999 0.9991 Tomcat 0.9981 0.9997 0.9997 0.9991 0.9999 1.9991 0.9997 0.9988 Springboot 1.0025 0.9932 1.0001 0.9966 0.999 0.9888 1.0001 0.993 Cxf 0.9957 0.9966 1.0033 0.9975 0.9969 0.9987 1.0033 0.9975 Guava 0.9913 0.9983 0.9993 0.9983 0.9965 1.0026 0.9993 0.9953 Graal 0.9952 0.9927 0.9998 1.0031 1.0021 1.0 0.9998 0.999 Elasticsearch 1.0002 0.9974 0.9959 0.9984 0.9942 0.9942 0.9959 0.9957 Netbeans 0.997 0.9996 1.0001 1.0 0.9992 0.9994 1.0001 0.9998

Table 4.2: Jaccard variation ratios by percentile

23 Jaccard Variation Ratio 1.005

1.00419 Classes 1.004 Methods

1.003051.00305 1.003

1.002 1.00155

1.001

1.00004 1.000 Variation 0.99918

0.999 0.99861

0.998 0.99749

0.997

0.996 25 50 75 95 Percentile

Figure 4.3: Average variation for Jaccard ratios

Block type Classes Methods Percentile 25 50 75 95 25 50 75 95 Maven 1.0461 1.0142 1.0315 1.0402 0.9968 1.0037 0.9948 0.997 Jmeter 1.0 0.9994 1.0 1.0 1.0001 1.0 1.0003 0.9997 Che 0.9993 1.001 1.0 0.9984 1.0015 1.0006 1.0015 0.9989 Tomcat 0.9983 0.9998 0.9994 0.9982 1.0 1.0001 0.9994 0.9993 Springboot 1.0018 0.9924 1.0006 0.997 0.9978 0.9889 0.9989 0.9989 Cxf 0.9957 0.9985 1.0017 0.9976 0.997 0.9991 1.0008 0.9976 Guava 0.9915 0.9992 1.0002 0.9984 0.9969 1.0017 0.9978 0.9951 Graal 0.9972 0.9914 1.0002 1.0036 1.0037 1.0 0.9933 0.9976 Elasticsearch 1.0 0.997 0.9961 0.9979 0.9942 0.9937 0.9947 0.9955 Netbeans 0.9996 0.9995 0.9999 1.0 0.9992 0.9994 0.9995 0.9997

Table 4.3: Cosine variation ratios for each project

24 Cosine Variation Ratio 1.005 Classes 1.004 Methods

1.00313 1.00295 1.00296 1.003

1.002

1.001

1.000 Variation 0.99924 0.999 0.99872 0.99872

0.9981 0.998 0.99737

0.997

0.996 25 50 75 95 Percentile

Figure 4.4: Average variation for Cosine ratios

Projects Cosine variation Jaccard variation Project Size (LOC) Maven 1.0091 1.0056 7,167 Jmeter 1.0000 1.0000 13,943 Che 1.0002 1.0004 19,951 Tomcat 0.9994 0.9995 29,269 Springboot 0.9965 0.9965 41,318 Cxf 0.9984 0.9986 61,264 Guava 0.9979 0.9978 63,893 Graal 0.9993 0.9985 109,920 Elasticsearch 0.9955 0.9952 146,415 Netbeans 0.9996 0.9995 452,863 Average 0.9996 0.9992

Table 4.4: Average variation data for each project as opposed to the project size

25 5 Analysis

According to the results gathered from the 100 releases studied, we do not see either a special pattern or a distribution that may affect the quality of the documentation over time. The variation of the quality of the documentation remains close to 1.0, meaning that there is no change in our similarity ratios. RQ 1: What is the proportion of code blocks with and without documentation? The data extracted were divided to count how many of the code blocks did have docu- mentation by code type, class, or method. The results were averaged for each release and for all releases to have a single data point for classes, methods and in total, to create a percentage of the code that was documented as is displayed in Table 4.1. The results show a tendency to document the methods more than the classes. It could show a pat- tern where developers tend to document functionality over objects. Only an average of 5.61% of the classes has been documented against 12.09% of the methods. However, the total number of blocks documented was on average a 17.70% of the blocks. This would show a low intention of documenting the project. RQ 2: What is the proportion of new code blocks with and without documentation? The research covered 10 consecutive releases, so we had available the data over time to work with. For each release, we looked for code blocks that were not already in the previous release and observed if those codes were added to the project including documentation. The general tendency of the results shows that classes tend to be more documented than methods at the beginning of their life, as presented in Figure 4.1. However, as seen in the previous questions, during their lifetime, methods overgrow the classes and end up being the majority of the documented blocks. It could be understood as the tendency of developers to document their classes when they first created them. Over time, they will keep adding comments mainly to methods. This shows how documentation does not happen in one step, classes and methods are not documented at the same moment, developers will add documentation through time. RQ 3: Does the code blocks’ documentation quality improve across the releases? The general quality of the documentation decreases over time. On average, the variation ratio of the cosine and Jaccard ratios decrease. The comments get less similar both in a set of words and repetitions. However, ratios for this result are close to 1.0, so even when there is a deterioration in the quality, it is small, as can be seen in Appendix B. In the case of Cosine, the resultant average for all the blocks is 0.9996 and Cosine is 0.9992. It shows a small decrease in quality, but so small that it could be assumed that there is no variation. RQ 4: Is there any relation between lines of code and quality of the documentation? There is no relationship between the size of the block and the cohesion ratios. For the

26 Jaccard ratio in classes, it is over 1.0 for percentile 25, 75, and 95 (please note that this ratio works with sets of words, without repetitions). Percentile 50 is close but not yet 1.0. However, the range of variation goes between 1.0039 and 0.9995, i.e., less than 5% of change. The range is so close to 1.0 that we could assume that no variation has a relation to size. For the Jaccard ratio in Methods, it is relevant that the percentile 50 does have a visible increment, especially since all the other ratios remaining close to 1.0. The percentile 50 has a value of 1.09. It is close to no variation at all, but the other values move around the range of 1.0991–0.9975. We could assume that the small size of the range does not give enough variation to be considered as such, but it is necessary to point that even though the other ratios are similar between them, percentile 50 shows different behavior. In the case of Cosine ratios, Classes show an improvement in cohesion for 3 of 4 percentiles. Again percentile 50, in this case in classes, decreases. Again, the extent of the results is extremely close to 1.0 within the range 0.9987–1.0031, i.e., no relevant changes. For the case of methods, all of them decrease in quality. There is a general deterioration in quality, even if it is small, it may be showing in the consequent releases while the projects grow on complexity and age.

27 6 Discussion

The literature showed the preoccupation with the quality of the source code documen- tation. There is a paradox between developers’ complaints about how documentation is poorly maintained [5] while they are responsible for that maintenance. Multiple studies have found that documentation only changes significantly when big changes are made in a project [6], and generally low performance of the existing documentation for some aspects of quality [8, 11]. This research confirmed, by using a bigger set of data than previous research, that the quality of the documentation does not improve over time. Whatever is the ratio of the cohesion of a project at the beginning of the study, it does not show improvement but a small decrease in the quality over time. This low quantity on documentation was also shown in the work by Steidl [11], where the five projects studied had a range of class declaration comments between 5 to 20% of the classes doc- umented and between 28 to 49% of the methods documented. The results presented in Table 4.1 also show a low tendency to document in general, but especially in regards to classes. We also confirmed that developers do not implement with documentation but rather implement first and add documentation, especially on methods, on later steps. As seen in Figure 4.1, classes are usually the most commented block when they are newly added to a project, but in Table 4.1, we can see how it changes in favor of methods. That shows how developers first document classes and add method documentation in the next releases. However, limiting the sample to open source projects may lead to not representative results. It is possible to find different results if this research is continued with private projects. The results gathered by all the research in relation to source code documentation are showing objective data to a problem that was already known. Developers do not spend enough time on documentation so it makes the future of the development and main- tenance more difficult. This problem resembles the situation that started the creation of the unified modeling language, which created a guideline for developers to transmit information. In the same way, many other software artifacts have guidelines and inter- national standards to work with. It would be interesting to study if the area of computer science has reached the point of needing a standard on source code documentation to try to improve developers’ productivity.

28 7 Conclusion

This research aimed to know how the source code documentation evolves through time. For that reason, we formulated four research questions which led to four objectives that were used to answer them. The data set used for the research included 100 releases from 10 open-source projects. The first and second research questions asked what is the proportion of code blocks with and without documentation and what was the proportion of new code blocks with and without documentation. With that in mind, the aim of Objective 1 was to study the difference between documented and non documented code blocks among different releases and in total numbers. The results showed a higher number of methods documented in comparison with classes and low documentation with an average of 17.70% of the code blocks documented. We also confirmed that the process of documenting source code happens in two steps, first documenting classes and secondly documenting methods. The third research question asked if the quality of the code blocks improve across the releases. With this in mind, we planned two objectives. Objective 2 led us to calculate the cohesion ratios, Jaccard and cosine, of all the code block for each release. The aim of Objective 3 was to perform statistical analysis on the cohesion ratios. The results pointed out that there is no improvement, but a slight decrease in the quality of the documentation. The last research question called if there was any relation between lines of code and quality of the documentation, which was addressed with the last objective. Objective 4 required us to perform a statistical analysis to compare cohesion ratios with the lines of code of methods and classes. The results showed no relationship between the size of the block and the cohesion ratios. The research uses a large data set, but all the projects used are open source projects, which limited the results to the particularities of our data set. More extensive work could be done by studying private repositories, where other variables may affect the maintenance of the documentation, for example, project deadlines.

7.1 Future work During the process of this research and after studying the results obtained, three partic- ular projects presented a different behavior. The evolution data presented in Figure 4.2 as well as Appendix B (Figure B.1 and Figure B.2) is different to the rest of the data gathered. The projects Maven, JMeter, and Che displayed an increase in their quality, and the common variable between them was their project size, as Table 4.4 demon- strates. All the subsequent projects decrease in quality over their releases. What this information shows is the possibility of an improvement in the quality of the documen- tation when the total number of code blocks is not larger than the size of the project Tomcat. It would be interesting to extend the research on quality on documentation by using as variables the similarity ratios and the project size, to check that this behavior is also followed by other projects.

29 Although no obvious changes have been seen in this research, it could be assumed that other personal factors may affect the quality of the documentation, for instance, de- livery deadlines may affect the possibility of spending time resources on improving the documentation. Studies suggest that outdated documentation has value [2]. However, further research could focus on the quality of documentation and mandatory documenta- tion, making controlled experiments to have evidence of possible differences in quality.

30 References

[1] I. Sommerville, “Software documentation,” in Software Engineering, vol. 2: The Supporting Processes, . Thayer and M. Christensen, Eds. Wiley-IEEE, 2001, pp. 143–154. [Online]. Available: https://citeseerx.ist.psu.edu/viewdoc/summary? doi=10.1.1.114.8853

[2] T. C. Lethbridge, J. Singer, and A. Forward, “How software engineers use documentation: The state of the practice,” IEEE Software, vol. 20, no. 6, pp. 35–39, 2003. [Online]. Available: https://doi.org/10.1109/MS.2003.1241364

[3] L. Moreno, A. Marcus, L. Pollock, and K. Vijay-Shanker, “JSummarizer: An automatic generator of natural language summaries for Java classes,” in Proceedings of the 21st International Conference on Program Comprehension, ser. ICPC ’13. IEEE, 2013, pp. 230–232. [Online]. Available: https: //doi.org/10.1109/ICPC.2013.6613855

[4] K. D. Welker, P. W. Oman, and G. G. Atkinson, “Development and application of an automated source code maintainability index,” Journal of Software Maintenance: Research and Practice, vol. 9, no. 3, pp. 127–159, 1997. [Online]. Available: https://doi.org/10.1002/(SICI)1096-908X(199705)9:3<127:: AID-SMR149>3.0.CO;2-S

[5] I. Sommerville, Software Engineering, ser. International Computer Science Series. Pearson, 2011. [Online]. Available: https://books.google.se/books?id= l0egcQAACAAJ

[6] L. Shi, H. Zhong, T. Xie, and M. Li, “An empirical study on evolution of API documentation,” in Fundamental Approaches to Software Engineering, D. Giannakopoulou and F. Orejas, Eds. Springer Berlin Heidelberg, 2011, pp. 416–431. [Online]. Available: https://doi.org/10.1007/978-3-642-19811-3_29

[7] N. Khamis, J. Rilling, and R. Witte, “Assessing the quality factors found in in-line documentation written in natural language: The JavadocMiner,” Data & Knowledge Engineering, vol. 87, pp. 19–40, 2013. [Online]. Available: https://doi.org/10.1016/j.datak.2013.02.001

[8] D. Schreck, V. Dallmeier, and T. Zimmermann, “How documentation evolves over time,” in Proceedings of the Ninth International Workshop on Principles of Software Evolution (In Conjunction with the 6th ESEC/FSE Joint Meeting), ser. IWPSE ’07. ACM, 2007, p. 4–10. [Online]. Available: https://doi.org/10.1145/1294948.1294952

31 [9] American Society for Quality. (2020, Feb. 13) Quality glosary. [Online]. Available: https://asq.org/quality-resources/quality-glossary/q

[10] A. Wingkvist, M. Ericsson, R. Lincke, and W. Löwe, “A metrics-based approach to technical documentation quality,” in Proceedings of the 2010 Seventh International Conference on the Quality of Information and Communications Technology, ser. QUATIC ’10. IEEE, 2010, pp. 476–481. [Online]. Available: https://doi.org/10.1109/QUATIC.2010.88

[11] D. Steidl, B. Hummel, and E. Juergens, “Quality analysis of source code comments,” in Proceedings of the 2013 21st International Conference on Program Comprehension, ser. ICPC ’13. IEEE, 2013, pp. 83–92. [Online]. Available: https://doi.org/10.1109/ICPC.2013.6613836

[12] P. W. McBurney and C. McMillan, “An empirical study of the textual similarity between source code and source code summaries,” Empirical Software Engineering, vol. 21, pp. 17–42, 2006. [Online]. Available: https: //doi.org/10.1007/s10664-014-9344-6

[13] W. H. Gomaa and A. Fahmy, “A survey of text similarity approaches,” International Journal of Computer Applications, vol. 68, pp. 13–18, 2013. [Online]. Available: https://doi.org/10.5120/11638-7118

[14] P. Jaccard, “Étude comparative de la distribution florale dans une portion des Alpes et des Jura,” in Bulletin de la Société Vaudoise des Sciences Naturelles, 1901, pp. 547–579. [Online]. Available: https://ci.nii.ac.jp/naid/10019961020/en/

[15] A. Singhal, “Modern information retrieval: A brief overview,” IEEE Data Engineering Bulletin, vol. 24, Jan. 2001. [Online]. Available: http: //sites.computer.org/debull/a01dec/a01dec-cd.pdf#page=37

[16] Oracle. (2020, Feb. 13) Java syntax. [Online]. Available: https://docs.oracle.com/ javase/specs/jls/se7/html/jls-18.html

[17] Oracle. (2020, Feb. 13) Java code conventions. [Online]. Available: https: //www.oracle.com/java/technologies/javase/codeconventions-comments.html

[18] J. Raskin, “Comments are more important than code,” Queue, vol. 3, no. 2, pp. 64–65, Mar. 2005. [Online]. Available: https://doi.org/10.1145/1053331.1053354

[19] S. Haiduc, J. Aponte, L. Moreno, and A. Marcus, “On the use of automated text summarization techniques for summarizing source code,” in Proceedings of the 2010 17th Working Conference on Reverse Engineering. IEEE, 2010, pp. 35–44. [Online]. Available: https://doi.org/10.1109/WCRE.2010.13

32 [20] Oracle. (2020, Feb. 13) Java keywords. [Online]. Available: https://docs.oracle. com/javase/tutorial/java/nutsandbolts/_keywords.html

33 A Appendix — Selection of projects

The projects used in this study have been selected from open source repositories. These include five projects from the organization Apache and five from diverse sources. The Projects are ordered by size, from the smaller one (Apache Maven) to the biggest one (Apache Netbeans). All the raw data extracted from these projects can be accessed in the following repos- itory: https://gitlab.com/HelenaTevar/documentation-evolution

Project Apache Maven Repository https://github.com/apache/Maven Description Apache Maven is a software project management and comprehension tool. Based on the concept of a project object model (POM), Maven can manage a project’s build, reporting and documentation from a central piece of information. Based on the concept of a project object model (POM), Maven can manage a project’s build, reporting and documentation from a central piece of information 3.5.0-beta-1 - 20 03 2017 3.5.0 - 03 04 2017 3.5.1 - 10 09 2017 3.5.2 - 18 10 2018 3.5.3 - 14 02 2018 Releases 3.5.4 - 17 06 2018 3.6.0 - 24 10 2018 3.6.1 - 04 04 2019 3.6.2 - 27 08 2019 3.6.3 - 19 11 2019

Project Apache Jmeter Repository https://github.com/apache/jmeter

A Description Apache JMeter may be used to test performance both on static and dynamic resources, Web dynamic ap- plications. It can be used to simulate a heavy load on a server, group of servers, network or object to test its strength or to analyze overall performance under different load types. 5.2-rc1 - 07 10 2019 5.2-rc2 - 09 10 2019 5.2-rc3 - 15 10 2019 5.2-rc4 - 18 10 2019 5.2-rc5 - 29 10 2019 Releases rel-v5.2 - 03 11 2019 5.2.1-rc1 - 12 11 2019 5.2.1-rc4 - 16 11 2019 5-2-1rc5 - 20 11 2019 rel-v5.2.1 - 24 11 2019

Project Che Repository https://github.com/eclipse/che Description Next-generation container development platform, developer workspace server and cloud IDE. Che is Kubernetes-native and places everything the devel- oper needs into containers in Kube pods including dependencies, embedded containerized runtimes, a web IDE, and project code. 7.9.2 - 21 03 2020 7.9.1 - 06 03 2020 7.9.0 - 24 02 2020 7.8.0 - 30 01 2020 7.7.1b - 20 01 2020 Releases 7.7.1 - 17 01 2020 7.7.0 - 10 01 2020 7.6.0 - 19 12 2019 7.5.1 - 03 12 2019 7.5.0 - 28 11 2019

B Project Repository https://github.com/apache/tomcat

Description The Apache Tomcat R software is an open source implementation of the Java Servlet, JavaServer Pages, Java Expression Language and Java Web- Socket technologies. The Java Servlet, JavaServer Pages, Java Expression Language and Java Web- Socket specifications are developed under the Java Community Process. 9.0.22 - 04 07 2019 9.0.23 - 14 08 2019 9.0.24 - 14 08 2019 9.0.25 - 16 09 2019 9.0.26 - 16 09 2019 Releases 9.0.27 - 07 10 2019 9.0.28 - 14 11 2019 9.0.29 - 16 11 2019 9.0.30 - 07 12 2019 9.0.31 - 05 02 2020

Project Springboot - Spring Repository https://github.com/spring-projects/spring-boot Description Spring Boot makes it easy to create Spring-powered, production-grade applications and services with ab- solute minimum fuss. It takes an opinionated view of the Spring platform so that new and existing users can quickly get to the bits they need. 2.1.6.RELEASE - 19 01 2019 2.1.7.RELEASE - 06 08 2019 2.1.8.RELEASE - 05 09 2019 2.1.9.RELEASE - 02 10 2019 2.2.0.RELEASE - 16 10 2019 Releases 2.2.1.RELEASE - 02 11 2019 2.2.2.RELEASE - 06 12 2019 2.2.3.RELEASE - 16 01 2020 2.2.4.RELEASE - 20 01 2020 2.2.5.RELEASE - 27 02 2020

C Project Apache CXF Repository https://github.com/apache/cxf Description Apache CXF is an open source services framework. CXF helps you build and develop services using frontend programming APIs, like JAX-WS and JAX- RS. These services can speak a variety of proto- cols such as SOAP, XML/HTTP, RESTful HTTP, or CORBA and work over a variety of transports such as HTTP, JMS or JBI. 3.2.5 - 18 06 2018 3.2.6 - 08 08 2018 3.2.7 - 24 10 2019 3.2.8 - 24 01 2019 3.3.0 - 24 01 2019 Releases 3.3.1 - 28 02 2019 3.3.2 - 10 05 2019 3.3.3 - 08 08 2019 3.3.4 - 21 10 2019 3.3.5 - 10 01 2020

Project Google Guava Repository https://github.com/google/guava Description Guava is a set of core Java libraries from Google that includes new collection types (such as multimap and multiset), immutable collections, a graph library, and utilities for concurrency, I/O, hashing, caching, primitives, strings, and more! It is widely used on most Java projects within Google, and widely used by many other companies as well. 24.1 - 14 03 2018 25.0 - 26 04 2018 25.1 - 23 05 2018 26.0 - 01 08 2018 27.0 - 18 10 2018 Releases 27.0.1 - 19 11 2018 27.1 - 08 03 2019

D 28.0 - 12 06 2019 28.1 - 28 08 2019 28.2 - 27 12 2019

Project Oracle Graal Repository https://github.com/oracle/graal Description GraalVM is a universal virtual machine for running applications written in JavaScript, Python, Ruby, R, JVM-based languages like Java, Scala, Clojure, Kotlin, and LLVM-based languages such as C and C++. 19.0.0 - 09 05 2019 19.0.2 - 14 06 2019 19.1.0 - 27 06 2019 19.1.1 - 13 07 2019 19.2.0 - 19 08 2019 Releases 19.2.1 - 12 09 2019 19.3.0 - 15 11 2019 19.3.0.2 - 20 12 2019 19.3.1 - 14 01 2020 20.0.0 - 14 02 2020

Project Elastic ElasticSearch Repository https://github.com/elastic/elasticsearch Description Elasticsearch is a distributed RESTful search engine built for the cloud. 7.3.0 - 31 06 2019 7.3.1 - 22 08 2019 7.3.2 - 12 09 2019 7.4.0 - 01 10 2019 7.4.1 - 23 10 2019 Releases 7.4.2 - 31 10 2019 7.5.0 - 02 12 2019 7.5.1 - 18 12 2019 7.5.2 - 21 01 2020 7.6.0 - 11 02 2020

E Project Apache Netbeans Repository https://github.com/apache/netbeans Description Apache NetBeans is an open source development en- vironment, tooling platform, and application frame- work. 11.1 - 20 07 2019 11.2-beta1 - 25 09 2019 11.2-beta2 - 07 10 2019 11.2-beta3 - 17 10 2019 11.2-vc1 - 20 10 2019 Releases 11.2 - 25 10 2019 11.2-u1 - 01 12 2019 11.3 - 24 02 2020 12.0-beta1 - 10 05 2020 12.0-beta2 - 24 05 2020

F B Appendix — Evolution of quality

Project jmeter: jaccard variation for classes Project jmeter: jaccard variation for methods

1.001 1.0010

1.000 1.0005

0.999 1.0000

q25 q25 0.998 0.9995 q50 q50 q75 q75

Variation 0.997 q95 Variation 0.9990 q95

0.996 0.9985

0.995 0.9980

0.994 0.9975

0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 Releases Releases

Project jmeter: cosine variation for classes Project jmeter: cosine variation for methods 1.002 1.002 1.001

1.000 1.001

0.999 q25 q25 q50 q50 0.998 1.000 q75 q75

Variation q95 Variation q95 0.997

0.996 0.999

0.995

0.994 0.998

0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 Releases Releases

Figure B.1: Evolution of similarity rates for project JMeter

G Project che: jaccard variation for classes Project che: jaccard variation for methods 1.012

1.010 1.010

1.008

1.005 1.006 q25 q25 q50 1.004 q50 1.000 q75 q75

Variation q95 Variation q95 1.002

0.995 1.000

0.998

0.990 0.996

0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 Releases Releases

Project che: cosine variation for classes Project che: cosine variation for methods

1.0050 1.010

1.0025 1.008

1.0000 1.006

1.004 0.9975 q25 q25 q50 q50 q75 1.002 q75 0.9950 Variation q95 Variation q95 1.000 0.9925

0.998 0.9900

0.996 0.9875 0.994

0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 Releases Releases

Figure B.2: Quality evolution for project Che

H Project tomcat: jaccard variation for classes Project tomcat: jaccard variation for methods 1.0075

1.006 1.0050

1.004 1.0025

1.002 1.0000 q25 q25 1.000 0.9975 q50 q50 q75 q75 0.998 Variation q95 Variation q95 0.9950

0.996 0.9925

0.994 0.9900

0.992 0.9875

0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 Releases Releases

Project tomcat: cosine variation for classes Project tomcat: cosine variation for methods

1.004 1.005

1.002

1.000 q25 q25 q50 1.000 q50 q75 q75

Variation q95 Variation q95 0.995 0.998

0.990 0.996

0.994 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 Releases Releases

Figure B.3: Quality evolution for project Tomcat

I Project springboot: jaccard variation for classes Project springboot: jaccard variation for methods

1.02

1.04 1.00

1.02 0.98 q25 q25 q50 q50 1.00 q75 q75 0.96 Variation q95 Variation q95

0.98 0.94

0.96 0.92

0.94 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 Releases Releases

Project springboot: cosine variation for classes Project springboot: cosine variation for methods

1.04 1.00

1.02 0.98

q25 q25 1.00 q50 q50 0.96 q75 q75

Variation q95 Variation q95 0.98 0.94

0.96 0.92

0.94 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 Releases Releases

Figure B.4: Quality evolution for project Springboot

J Project cxf: jaccard variation for classes Project cxf: jaccard variation for methods 1.02

1.015

1.01 1.010

1.00 1.005 q25 q25 q50 q50 1.000 q75 q75

Variation 0.99 q95 Variation q95 0.995

0.98 0.990

0.985 0.97

0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 Releases Releases

Project cxf: cosine variation for classes Project cxf: cosine variation for methods

1.015 1.010

1.010 1.005

1.005

1.000 1.000 q25 q25 q50 q50 q75 q75

Variation 0.995 q95 Variation q95 0.995

0.990

0.990 0.985

0.980 0.985 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 Releases Releases

Figure B.5: Quality evolution for project CXF

K Project guava: jaccard variation for classes Project guava: jaccard variation for methods

1.03 1.03

1.02 1.02

1.01 1.01 q25 q25 1.00 q50 q50 1.00 q75 q75 0.99 Variation q95 Variation q95 0.99 0.98

0.98 0.97

0.97 0.96

0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 Releases Releases

Project guava: cosine variation for classes Project guava: cosine variation for methods

1.03 1.03

1.02 1.02 1.01

1.01 1.00 q25 q25 q50 q50 1.00 q75 0.99 q75

Variation q95 Variation q95 0.99 0.98

0.97 0.98

0.96 0.97 0.95

0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 Releases Releases

Figure B.6: Quality evolution for project Guava

L Project graal: jaccard variation for classes Project graal: jaccard variation for methods

1.025 1.03

1.020 1.02

1.015 1.01

1.010 q25 q25 1.00 q50 q50 q75 1.005 q75

Variation 0.99 q95 Variation q95 1.000 0.98 0.995 0.97 0.990

0.96 0.985 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 Releases Releases

Project graal: cosine variation for classes Project graal: cosine variation for methods

1.02 1.03

1.02 1.01

1.01 1.00 q25 q25 1.00 0.99 q50 q50 q75 q75 0.99 Variation q95 Variation q95 0.98 0.98 0.97 0.97

0.96 0.96

0.95 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 Releases Releases

Figure B.7: Quality evolution for project Graal

M Project elasticsearch: jaccard variation for classes Project elasticsearch: jaccard variation for methods 1.005 1.000

1.000

0.995 0.995

q25 q25 0.990 q50 q50 0.990 q75 q75

Variation q95 Variation q95 0.985 0.985

0.980 0.980

0.975 0.975

0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 Releases Releases

Project elasticsearch: cosine variation for classes Project elasticsearch: cosine variation for methods

1.000 1.000

0.995

0.995 0.990

q25 q25 0.990 q50 0.985 q50 q75 q75

Variation q95 Variation q95 0.980 0.985 0.975

0.980 0.970

0.965 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 Releases Releases

Figure B.8: Quality evolution for project ElasticSearch

N Project netbeans: jaccard variation for classes Project netbeans: jaccard variation for methods

1.002 1.002 1.001

1.000 1.001

q25 0.999 q25 q50 q50 1.000 q75 0.998 q75

Variation q95 Variation q95

0.997 0.999

0.996

0.998 0.995

0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 Releases Releases

Project netbeans: cosine variation for classes Project netbeans: cosine variation for methods

1.0015 1.000

1.0010 0.999 1.0005

1.0000 0.998 q25 q25 q50 q50 0.9995 q75 0.997 q75

Variation q95 Variation q95 0.9990

0.996 0.9985

0.9980 0.995

0.9975 0.994 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 Releases Releases

Figure B.9: Quality evolution for project NetBeans

O C Appendix — Lists of stop words

For this research, we used a natural language process that would skip words that will give little to no meaning to the text we wanted to study. The words skipped are listed below.

C.1 NLTK stop words

’i’, ’me’, ’my’, ’myself’, ’we’, ’our’, ’ours’, ’ourselves’, ’you’, "you’re", "you’ve", "you’ll", "you’d", ’your’, ’yours’, ’yourself’, ’yourselves’, ’he’, ’him’, ’his’, ’himself’, ’she’, "she’s", ’her’, ’hers’, ’herself’, ’it’, "it’s", ’its’, ’itself’, ’they’, ’them’, ’their’, ’theirs’, ’themselves’, ’what’, ’which’, ’who’, ’whom’, ’this’, ’that’, "that’ll", ’these’, ’those’, ’am’, ’is’, ’are’, ’was’, ’were’, ’be’, ’been’, ’being’, ’have’, ’has’, ’had’, ’having’, ’do’, ’does’, ’did’, ’doing’, ’a’, ’an’, ’the’, ’and’, ’but’, ’if’, ’or’, ’because’, ’as’, ’until’, ’while’, ’of’, ’at’, ’by’, ’for’, ’with’, ’about’, ’against’, ’between’, ’into’, ’through’, ’during’, ’before’, ’after’, ’above’, ’below’, ’to’, ’from’, ’up’, ’down’, ’in’, ’out’, ’on’, ’off’, ’over’, ’under’, ’again’, ’further’, ’then’, ’once’, ’here’, ’there’, ’when’, ’where’, ’why’, ’how’, ’all’, ’any’, ’both’, ’each’, ’few’, ’more’, ’most’, ’other’, ’some’, ’such’, ’no’, ’nor’, ’not’, ’only’, ’own’, ’same’, ’so’, ’than’, ’too’, ’very’, ’s’, ’t’, ’can’, ’will’, ’just’, ’don’, "don’t", ’should’, "should’ve", ’now’, ’d’, ’ll’, ’m’, ’o’, ’re’, ’ve’, ’y’, ’ain’, ’aren’, "aren’t", ’couldn’, "couldn’t", ’didn’, "didn’t", ’doesn’, "doesn’t", ’hadn’, "hadn’t", ’hasn’, "hasn’t", ’haven’, "haven’t", ’isn’, "isn’t", ’ma’, ’mightn’, "mightn’t", ’mustn’, "mustn’t", ’needn’, "needn’t", ’shan’, "shan’t", ’shouldn’, "shouldn’t", ’wasn’, "wasn’t", ’weren’, "weren’t", ’won’, "won’t", ’wouldn’, "wouldn’t".

C.2 Extra stop words

’aboard’, ’according’, ’across’, ’along’, ’alongside’, ’amid’, ’anti’, ’around’, ’aside’, ’atop’, ’behind’,’beneath’, ’beside’, ’besides’, ’beyond’, ’concerning’, ’considering’, ’despite’, ’excepting’,’excluding’, ’following’, ’inside’, ’instead’, ’minus’, ’near’, ’onto’, ’opposite’, ’outside’, ’past’, ’plus’, ’prior’, ’regarding’, ’save’, ’since’, ’throughout’, ’till’, ’toward’,

P ’towards’, ’underneath’, ’unlike’, ’upon’, ’versus’, ’via’, ’within’, ’without’,’we’, ’they’.

C.3 Java Keywords as stop words

’ArrayList’, ’LinkedList’, ’true’, ’false’, ’abstract’, ’assert’, ’boolean’, ’break’, ’byte’, ’case’, ’catch’, ’char’, ’class’, ’const’, ’continue’, ’default’, ’do’, ’double’, ’else’, ’enum’, ’extends’, ’final’, ’finally’, ’float’, ’for’, ’goto’, ’if’, ’implements’, ’import’, ’instanceof’, ’interface’, ’int’, ’long’, ’native’, ’new’, ’package’, ’private’, ’protected’, ’public’, ’return’, ’short’, ’static’, ’strictfp’, ’super’, ’switch’, ’synchronized’, ’this’, ’throw’, ’throws’, ’transient’, ’try’, ’void’, ’volatile’, ’while’.

Q