Evolution of Software Documentation Over Time an Analysis of the Quality of Software Documentation

Bachelor Degree Project Evolution of Software Documentation Over Time An analysis of the quality of software documentation Author: Helena Tevar Hernandez Supervisor: Francis Palma Semester: VT/HT 2020 Subject: Computer Science Abstract Software developers, maintainers, and testers rely on documentation to understand the code they are working with. However, software documentation is perceived as a waste of effort because it is usually outdated. How documentation evolves through a set of releases may show whether there is any relationship between time and quality. The results could help future developers and managers to improve the quality of their documentation and decrease the time developers use to analyze code. Previous studies showed that documentation used to be scarce and low in quality, thus, this research has investigated different variables to check if the quality of the documentation changes over time. Therefore, we have created a tool that would extract and calculate the quality of the comments in code blocks, classes, and methods. The results have agreed with the previous studies. The quality of the documentation is affected to some extent through the releases, with a tendency to decrease. Keywords: Software documentation, Source code documentation, Code conven- tions, Source code summarizing, Documentation, Textual similarity. Preface I would like to thank the teachers, readers and friends that followed me in this project, specially to my supervisor Francis Palma and coordinator Diego Perez Palacín, my col- league and personal natural language parser Dustin Payne, and the person that helped me during three years managing courses and schedules, Ewa Püschl. I would also like to thank the open-source community, thanks to them this research was possible. Contents 1 Introduction 1 1.1 Background . .1 1.1.1 Quality Definition . .1 1.1.2 Jaccard ratio and Cosine similarity . .2 1.1.3 Java Language . .3 1.2 Related work . .4 1.3 Problem formulation . .5 1.4 Motivation . .5 1.5 Research Questions and Objectives . .5 1.6 Scope/Limitation . .6 1.7 Target group . .6 1.8 Outline . .7 2 Method 8 2.1 Natural Language Processing . 10 2.2 Reliability and Validity . 10 2.3 Ethical Considerations . 10 3 Implementation 11 3.1 Extraction . 13 3.1.1 Extracting comments . 13 3.1.2 Extracting classes . 14 3.1.3 Extracting methods . 15 3.2 Cohesion calculation . 17 3.2.1 Parsing and normalizing strings . 17 3.2.2 Jaccard algorithm . 18 3.2.3 Cosine algorithm . 18 3.3 Results of the extraction . 19 4 Results 20 4.1 RQ 1: What is the proportion of code blocks with and without documentation? . 20 4.2 RQ 2: What is the proportion of new code blocks with and without documentation? . 20 4.3 RQ 3: Does the code blocks documentation quality improve across the releases? . 21 4.4 RQ 4: Is there any relation between lines of code and quality of the documentation? . 22 5 Analysis 26 6 Discussion 28 7 Conclusion 29 7.1 Future work . 29 References 31 A Appendix — Selection of projects A B Appendix — Evolution of quality G C Appendix — Lists of stop words P C.1 NLTK stop words . .P C.2 Extra stop words . .P C.3 Java Keywords as stop words . .Q 1 Introduction Developers usually rely on the low-level documentation, especially class- and method- level documentation to comprehend, modify, and maintain a system that is continuously being evolved. The documentation has to be related to the class and method they are located, reflecting what they do and how they should be maintained. While creating and maintaining software is the job of developers, updating the documentation is not often seen as an important task [1, 2], and, thus, it is common to find documentation that has not been updated and does not reflect the actual functionality of the method or class where it is used. This study aims to study the cohesion between documentation and source code as a factor of the quality of the documentation because software evolves continuously. 1.1 Background During the process of developing source code artifacts, developers need to understand the functions of said artifacts by using source code documentation. This kind of documentation includes comments in source code that are used to explain blocks of code such as methods and classes. While good comments help the developers with their jobs, the act of documenting is often seen as counterproductive and time-consuming [1, 2], especially for projects developed whiting the Agile principles, that require fast-paced programming and continuous delivery. In other cases, the comments are outdated or difficult to create for legacy code [3]. Changes are added in an undisciplined man- ner. [4] This would create problems for the future implementer and other stakeholders that also work with the same code, such as testers and maintainers [2, 5]. Changes in code documentation and some aspects of quality have been studied previously [6,7], the research of Schreck, Dallmeier and Zimmermann studied the quality of documentation by similarity ratios in natural language and source code among other values [8]. Know- ing previous results, this research will focus on the similarity ratio by using different algorithms expecting to see how the documentation quality evolves through time on a big sample of projects. 1.1.1 Quality Definition The American Society for Quality accepts that quality is not a static value and it is different for every person or organization, however, it gives a guideline to define quality as ’the characteristics of a product or service that bear on its ability to satisfy stated or implied needs’ [9]. Sommerville [1] suggested different requirements for all the docu- ments associated with a software project to act as the medium between members of the team, and information repository that would help the development process and should tell users how to use and administer the system. There is a subjective component that 1 is inherent to the discussion of quality, for instance, a text difficult to understand is not universal to all humans. Metrics should include human insights [10] but that adds com- plexity to the studies. More objective variables that are related to factors of quality in the documentation are coherence, usefulness, completeness, and consistency, as men- tioned by Steidl [11]. Coherence covers how comment and code are related and thus, is measurable. The relation between the comments and code could be studied as the ability to paraphrase the machine language to natural language in order to give context to the source code. In that case, the documentation should reflect the contents of the code. This was already stated by McBurney and McMillan [12], source code documentation should use keywords from source code. For this reason, a way to investigate how the documentation refers to the source code would be by measuring the similarity between them. In order to check the similarity between two texts, many algorithms have been developed. In the case of the research made by Steidl, Hummel, and Juergens, the similarity ratio used was the Levenshtein ratio [11]. Levenshtein ratio defines distance between two strings by counting the minimum number of operations needed to transform one string into the other [13]. There are two main branches of similarity ratios, string- based and corpus-based measures [13]. Corpus-based measures work the best with large texts, which is not the case for this study. String-based measures are more fitted for small size strings. This kind of algorithms includes character- and term- based ratio. Character-based algorithms measure the distance between characters in two strings, like the Levenshtein ratio. For instance, words like "Sam" and "Samuel" would be similar in character measures because they share three characters, however, they would be two different words for term-based ratios, thus their term-based similarity would result on two non related words. Term-based similarity is the one approach that could show the similarity of the developers’ comments and the programming code. In this research, we elaborate on two of the algorithms used to calculate the similarity ratio that have not been used before, they are the Jaccard ratio and Cosine similarity. 1.1.2 Jaccard ratio and Cosine similarity Jaccard index similarity is calculated by the size of the intersection of two sets divided by the size of the union of the sets, where each set includes the words of a string [14]. Jaccard ratio calculates the similarity between a set of words, meaning that the repetition of words is ignored. Two strings that contain the same set of words will result in a Jaccard index of 1 because of the overlapping of each set, while two strings with no same set of words will result in an index of 0. jA \ Bj J(A; B) = jA [ Bj The Cosine similarity [15] calculates the cosine of the angle of two vectors. Each ana- 2 lyzed string form a vector. This ratio takes into account the repetition of words to create the vectors required. When two strings have the most similar and repeated words, the cosine of the angle will be closer to 1, or in other words, the angle will be 0◦. On the contrary, when two strings are different in words and repetitions, the cosine will be 0, so the angle formed by the two vectors will be 90◦. n P A B A · B i i C(A; B) = cos(θ) = = i=1 s n s n kAkkBk P 2 P 2 Ai Bi i=1 i=1 1.1.3 Java Language Java has a particular syntax that developers have to follow to be able to compile an application.

Load more