How to Identify Class Comment Types? a Multi-Language Approach for Class Comment Classification
Total Page:16
File Type:pdf, Size:1020Kb
How to Identify Class Comment Types? A Multi-language Approach for Class Comment Classification Pooja Rania, Sebastiano Panichellab, Manuel Leuenbergera, Andrea Di Sorboc, Oscar Nierstrasza aSoftware Composition Group, University of Bern, Switzerland bZurich University of Applied Science, Switzerland cDepartment of Engineering, University of Sannio, Italy Abstract Most software maintenance and evolution tasks require developers to understand the source code of their software systems. Software developers usually inspect class comments to gain knowledge about program behavior, regardless of the programming language they are using. Unfortunately, (i) different programming languages present language-specific code commenting notations and guidelines; and (ii) the source code of software projects often lacks comments that adequately describe the class behavior, which complicates program comprehension and evolution activities. To handle these challenges, this paper investigates the different language-specific class commenting practices of three program- ming languages: Python, Java, and Smalltalk. In particular, we systematically analyze the similarities and differences of the information types found in class comments of projects developed in these languages. We propose an approach that leverages two techniques — namely Natural Language Processing and Text Analysis — to auto- matically identify class comment types, i.e., the specific types of semantic information found in class comments. To the best of our knowledge, no previous work has provided a comprehensive taxonomy of class comment types for these three programming languages with the help of a common automated approach. Our results confirm that our approach can classify frequent class com- ment information types with high accuracy for the Python, Java, and Smalltalk programming languages. We believe this work can help in monitoring and assessing the quality and evolution of code comments in different programming languages, and thus support maintenance and evolution tasks. Keywords: Natural Language Processing Technique, Code Comment Analysis, Software Documentation 1. Introduction other forms of documentation when they try to answer program comprehension questions [5,6,4]. In addition, recent work Software maintenance and evolution tasks require developers has also demonstrated that “code documentation” is the most arXiv:2107.04521v2 [cs.SE] 25 Jul 2021 to perform program comprehension activities [1,2]. To under- used source of information for bug fixing, implementing fea- stand a software system, developers usually refer to the soft- tures, communication, and even code review [7]. In particular, ware documentation of the system [3,4]. Previous studies have well-documented code simplifies software maintenance activi- demonstrated that developers trust code comments more than ties, but many programmers often overlook or delay code com- menting tasks [8]. Email addresses: [email protected] (Pooja Rani), [email protected] (Sebastiano Panichella), Class comments play an important role in obtaining a high- [email protected] (Manuel Leuenberger), [email protected] (Andrea Di Sorbo), level overview of the classes in object-oriented languages [9]. [email protected] (Oscar Nierstrasz) In particular, when applying code changes, developers using Preprint submitted to Journal of Systems and Software July 27, 2021 object-oriented programming languages can inspect class com- they contain high-level design details as well as low-level im- ments to achieve most or the majority of the high-level insights plementation details of the class, e.g., rationale behind the class, about the software system design, which is critical for program its instance variables, and important implementation points of comprehension activities [10, 11, 12]. Class comments contain its public methods. By analyzing multiple languages that vary various types of information related to the usage of a class or its in the details of their class comments, we can provide a more implementation [13], which can be useful for other developers general overview of class commenting practices than by focus- performing program comprehension [6] and software mainte- ing on a single language. nance tasks [4]. Unfortunately, (i) different object-oriented pro- Programming languages offer various conventions and tools gramming languages adopt language-specific code commenting to express these different information types in class comments. notations and guidelines [14], (ii) they embed different kinds For example, Java supports Javadoc documentation comments of information in the comments [15, 16]; and (iii) software in which developers use specific annotations, such as @param projects commonly lack comments to adequately describe the and @author to denote a given information type. In Python, de- behavior of classes, which complicates program comprehen- velopers write class comments as docstrings containing similar sion and evolution activities [17, 18, 19]. The objective of annotations, such as param: and args: to denote various infor- our work is to examine developer class commenting practices mation types, and they use tools such as Pydoc and Sphinx to (e.g., comment content and style) across multiple languages, process these docstrings. In contrast to Java and Python, class investigating the way in which developers write information in comments in Smalltalk neither use annotations nor the writing comments, and to establish an approach to identify that infor- style of Javadoc or Pydoc, thus presenting a rather different per- mation in a language-independent manner. To achieve this ob- spective on commenting practices, and particular challenges for jective, we select three representative programming languages existing information identification approaches. Additionally, based on i) whether the language is statically- or dynamically- Smalltalk is considered to be a pure object-oriented program- typed, ii) the level of detail embedded in its class comments, ming language and supports live programming since its incep- iii) whether it supports live programming, and iv) the availabil- tion, therefore, it can present interesting insights into code doc- ity of a code comment taxonomy for the language (investigated umentation in live programming environments. in previous work). The motivation behind each criterion is ex- We argue that the extent to which class commenting prac- plained in the following paragraphs. tices vary across different languages is an aspect only partially As mentioned, programming languages adopt different con- investigated in previous work. Given the multi-language nature ventions to embed various types of information in class com- of contemporary software systems, this investigation is critical ments. For example, in Java (a statically-typed language), a to monitor and assess the quality and evolution of code com- class comment provides a high-level outline of a class, e.g., a ments in different programming languages, which is relevant to summary of the class, what the class actually does, and other support maintenance and evolution tasks. Therefore, we formu- classes it cooperates with [11]. In Python (a dynamically-typed late the following research question: language), the class comment guidelines suggest adding low- level details about public methods, instance variables, and sub- RQ1 “What types of information are present in class classes, in addition to the high-level summary of the class [20].1 comments? To what extent do information types vary In Smalltalk (another dynamically-typed language), class com- across programming languages?” ments are a primary source of code documentation, and thus We focus on addressing this question, as extracting class 1https://www.python.org/dev/peps/pep-0257/ comment information types (or simply class comment types) 2 can help in providing custom details to both novice and ex- composing CCTM. pert developers, and can assist developers at different stages Our results confirm that the use of these techniques allows of development. Hence, we report the first empirical study class comments to be classified with high accuracy, for all in- investigating the different language-specific class comment- vestigated languages. As our solution enables the presence or ing practices of three different languages: Python, Java, and absence of different types of comment information needed for Smalltalk. Specifically, we quantitatively and qualitatively ana- program comprehension to be determined automatically, we be- lyze the class comments that characterize the information types lieve it can serve as a crucial component for tools to assess the typically found in class comments of these languages. Thus, quality and evolution of code comments in different program- we present a taxonomy of class comment types based on the ming languages. Indeed, this information is needed to improve mapping of existing comment taxonomies, relevant for pro- code comment quality and to facilitate subsequent maintenance gram comprehension activities in each of these three languages, and evolution tasks. called CCTM (Class Comment Type Model), mined from the In summary, this paper makes the following contributions: actual commenting practices of developers. Four authors ana- 1. an empirically validated taxonomy, called CCTM, char- lyzed the content of comments using card sorting and pair sort- acterizing the information types found in class comments ing [21] to build and validate the comment taxonomy.