<<

How to Identify Class Comment Types? A Multi-language Approach for Class Comment Classification

Pooja Rania, Sebastiano Panichellab, Manuel Leuenbergera, Andrea Di Sorboc, Oscar Nierstrasza

aSoftware Composition Group, University of Bern, Switzerland bZurich University of Applied Science, Switzerland cDepartment of Engineering, University of Sannio, Italy

Abstract

Most maintenance and evolution tasks require developers to understand the source code of their software systems. Software developers usually inspect class comments to gain knowledge about program behavior, regardless of the they are using. Unfortunately, (i) different programming languages present language-specific code commenting notations and guidelines; and (ii) the source code of software projects often lacks comments that adequately describe the class behavior, which complicates program comprehension and evolution activities. To handle these challenges, this paper investigates the different language-specific class commenting practices of three program- ming languages: Python, , and . In particular, we systematically analyze the similarities and differences of the information types found in class comments of projects developed in these languages. We propose an approach that leverages two techniques — namely Natural Language Processing and Text Analysis — to auto- matically identify class comment types, i.e., the specific types of semantic information found in class comments. To the best of our knowledge, no previous work has provided a comprehensive taxonomy of class comment types for these three programming languages with the help of a common automated approach. Our results confirm that our approach can classify frequent class com- ment information types with high accuracy for the Python, Java, and Smalltalk programming languages. We believe this work can help in monitoring and assessing the quality and evolution of code comments in different programming languages, and thus support maintenance and evolution tasks. Keywords: Natural Language Processing Technique, Code Comment Analysis, Software Documentation

1. Introduction other forms of documentation when they try to answer program comprehension questions [5,6,4]. In addition, recent work Software maintenance and evolution tasks require developers has also demonstrated that “code documentation” is the most arXiv:2107.04521v2 [cs.SE] 25 Jul 2021 to perform program comprehension activities [1,2]. To under- used source of information for bug fixing, implementing fea- stand a software system, developers usually refer to the soft- tures, communication, and even code review [7]. In particular, ware documentation of the system [3,4]. Previous studies have well-documented code simplifies software maintenance activi- demonstrated that developers trust code comments more than ties, but many often overlook or delay code com- menting tasks [8]. Email addresses: [email protected] (Pooja Rani), [email protected] (Sebastiano Panichella), Class comments play an important role in obtaining a high- [email protected] (Manuel Leuenberger), [email protected] (Andrea Di Sorbo), level overview of the classes in object-oriented languages [9]. [email protected] (Oscar Nierstrasz) In particular, when applying code changes, developers using

Preprint submitted to Journal of Systems and Software July 27, 2021 object-oriented programming languages can inspect class com- they contain high-level design details as well as low-level im- ments to achieve most or the majority of the high-level insights plementation details of the class, e.g., rationale behind the class, about the software system design, which is critical for program its instance variables, and important implementation points of comprehension activities [10, 11, 12]. Class comments contain its public methods. By analyzing multiple languages that vary various types of information related to the usage of a class or its in the details of their class comments, we can provide a more implementation [13], which can be useful for other developers general overview of class commenting practices than by focus- performing program comprehension [6] and software mainte- ing on a single language. nance tasks [4]. Unfortunately, (i) different object-oriented pro- Programming languages offer various conventions and tools gramming languages adopt language-specific code commenting to express these different information types in class comments. notations and guidelines [14], (ii) they embed different kinds For example, Java supports Javadoc documentation comments of information in the comments [15, 16]; and (iii) software in which developers use specific annotations, such as @param projects commonly lack comments to adequately describe the and @author to denote a given information type. In Python, de- behavior of classes, which complicates program comprehen- velopers write class comments as docstrings containing similar sion and evolution activities [17, 18, 19]. The objective of annotations, such as param: and args: to denote various infor- our work is to examine developer class commenting practices mation types, and they use tools such as Pydoc and Sphinx to (e.g., comment content and style) across multiple languages, process these docstrings. In contrast to Java and Python, class investigating the way in which developers write information in comments in Smalltalk neither use annotations nor the writing comments, and to establish an approach to identify that infor- style of Javadoc or Pydoc, thus presenting a rather different per- mation in a language-independent manner. To achieve this ob- spective on commenting practices, and particular challenges for jective, we select three representative programming languages existing information identification approaches. Additionally, based on i) whether the language is statically- or dynamically- Smalltalk is considered to be a pure object-oriented program- typed, ii) the level of detail embedded in its class comments, ming language and supports live programming since its incep- iii) whether it supports live programming, and iv) the availabil- tion, therefore, it can present interesting insights into code doc- ity of a code comment taxonomy for the language (investigated umentation in live programming environments. in previous work). The motivation behind each criterion is ex- We argue that the extent to which class commenting prac- plained in the following paragraphs. tices vary across different languages is an aspect only partially As mentioned, programming languages adopt different con- investigated in previous work. Given the multi-language nature ventions to embed various types of information in class com- of contemporary software systems, this investigation is critical ments. For example, in Java (a statically-typed language), a to monitor and assess the quality and evolution of code com- class comment provides a high-level outline of a class, e.g., a ments in different programming languages, which is relevant to summary of the class, what the class actually does, and other support maintenance and evolution tasks. Therefore, we formu- classes it cooperates with [11]. In Python (a dynamically-typed late the following research question: language), the class comment guidelines suggest adding low- level details about public methods, instance variables, and sub- RQ1 “What types of information are present in class classes, in addition to the high-level summary of the class [20].1 comments? To what extent do information types vary In Smalltalk (another dynamically-typed language), class com- across programming languages?” ments are a primary source of code documentation, and thus We focus on addressing this question, as extracting class 1https://www.python.org/dev/peps/pep-0257/ comment information types (or simply class comment types) 2 can help in providing custom details to both novice and ex- composing CCTM. pert developers, and can assist developers at different stages Our results confirm that the use of these techniques allows of development. Hence, we report the first empirical study class comments to be classified with high accuracy, for all in- investigating the different language-specific class comment- vestigated languages. As our solution enables the presence or ing practices of three different languages: Python, Java, and absence of different types of comment information needed for Smalltalk. Specifically, we quantitatively and qualitatively ana- program comprehension to be determined automatically, we be- lyze the class comments that characterize the information types lieve it can serve as a crucial component for tools to assess the typically found in class comments of these languages. Thus, quality and evolution of code comments in different program- we present a taxonomy of class comment types based on the ming languages. Indeed, this information is needed to improve mapping of existing comment taxonomies, relevant for pro- code comment quality and to facilitate subsequent maintenance gram comprehension activities in each of these three languages, and evolution tasks. called CCTM (Class Comment Type Model), mined from the In summary, this paper makes the following contributions: actual commenting practices of developers. Four authors ana- 1. an empirically validated taxonomy, called CCTM, char- lyzed the content of comments using card sorting and pair sort- acterizing the information types found in class comments ing [21] to build and validate the comment taxonomy. written by developers in three different programming lan- In the cases a code comment taxonomy was already available guages, from previous works [22, 23, 24], we used that taxonomy and refined it according to our research goals. 2. an automated approach (available for research purposes) able to classify class comments with high accuracy accord- Our work provides important insights into the types of in- ing to CCTM, and formation found in the class comments of the investigated lan- guages, highlighting their differences and commonalities. In 3. a publicly available dataset of manually dissected and cat- this context, we conjecture that these identified information egorized class comments in the replication package.2 types can be used to explore automated methods capable of classifying comments according to CCTM, which is a relevant 2. Background step towards the automated assessment of code comments qual- Code commenting practices vary across programming lan- ity in different languages. Thus, based on this consideration, we guages, depending on the language paradigm, the communities formulate a second research question: involved, the purpose of the language, and its usage in different

RQ2 “Can machine learning be used to automatically domains etc. identify class comment types according to CCTM?" While Java is a general-purpose, statically-typed object- oriented programming language with wide adoption in indus- To answer this research question, we propose an approach try, Python, on the other hand, is dynamically-typed and sup- that leverages two techniques — namely Natural Language Pro- ports object-oriented, functional, and procedural programming. cessing (NLP) and Text Analysis (TA) — to automatically clas- We can observe differences in the notations used by Java and sify class comment types according to CCTM. Specifically, by Python developers for commenting source code elements. For analyzing comments of different types using NLP and TA we instance, in Java, a class comment, as shown in Figure 1, is usu- infer relevant features characterizing the class comment types. ally written above the class declaration using annotations (e.g., These features are then used to train machine learning mod- els enabling the automated classification of the comment types 2https://github.com/poojaruhal/RP-class-comment-classification 3 Figure 1: A class comment in Java Figure 2: A class comment in Python

@param, @version, etc.), whereas a class comment in Python, design information about the class as well as low-level imple- as seen in Figure 2, is typically written below the class declara- mentation details. For example, the class comment of the class tion as “docstrings”.3 In Java, developers use dedicated Javadoc MorphicAlarm in Figure 3 documents (i) the intent of the class annotations, such as @author and @see, to denote a given in- (mentioned in the first line), (ii) a code example to instantiate formation type. Similarly, in Python, developers use similar an- the class, (iii) a note explaining the corresponding comparison, notations in docstrings, such as See also:, Example: and Args:, and (iv) the features of the alarm system (in the last paragraph). and they use tools such as Pydoc and Sphinx to process them. The class comment in appears in a separate pane instead of being woven into the source code of the class. The pane contains a default class comment template, which follows a CRC (Class-Responsibility-Collaboration) model, but no other standard guidelines are offered for the structure and style of the comments. The comments follow a different, and informal writing style compared to Java, Python, and /C++. For in- stance, a class comment uses complete sentences, often written in the first-person form, and does not use any kind of annota- tions, such as @param or @see to mark the information type, as

Figure 3: A class comment in Smalltalk opposed to class comments in other languages [11, 26, 23]. As a descendant of Smalltalk-80, Smalltalk has a long history of Smalltalk is a pure, object-oriented, dynamically-typed, re- class comments being isolated from the source code [27]. Class flective programming language. Pharo is an open-source and comments are the main source of documentation in Smalltalk. live development environment incorporating a Smalltalk di- Additionally, Smalltalk supports live programming for more alect. The Pharo ecosystem includes a significant number of than three decades, and therefore can present interesting in- projects used in research and industry [25]. A typical class com- sights into code documentation in a live programming environ- ment in Smalltalk environment (Pharo) is a source of high-level ment.

3https://sphinxcontrib-napoleon.readthedocs.io/ Given the different commenting styles and the different types en/latest/example_numpy.html of information found in class comments from heterogeneous 4 programming languages, the current study has the aim of (i) the semantic information developers embed in class comments. systematically examining the class commenting practices of To determine the minimum size of the statistically significant different environments in details, and (ii) proposing an auto- sample for each language dataset, we the confidence level to mated approach to identify the information contained in class 95% and the margin of error to 5% [29]. comments. After determining the sample size, we selected the number of comments to consider from each project based on the pro- 3. Study Design portion of the project’s class comments of all comments (from all projects). For instance, class comments from an The goal of our study is to understand the class comment- project in Java contribute to 29% of total comments (comments ing practices of developers across different object-oriented pro- from all Java projects) therefore we selected the same propor- gramming languages. With the obtained knowledge, we aim tion of sample comments, i.e., 29% of Java sample size from to build a recommender system that can automatically identify the Eclipse project (110 class comments), as shown in Table 1. the different types of information in class comments. Such a To select representative sample comments from each project, system can provide custom details to both novice and expert we applied the stratified random sampling strategy and selected developers, and assist them at various stages of development. a proportional number of comments from each stratum. The Figure 4 illustrates the research approach we followed to an- strata were defined based on the length (in terms of lines) of swer research questions RQ1 and RQ2. comments. In particular, for each project, we first computed

3.1. Data Collection the quintiles based on the distribution of comments’ length and treated the quintiles as strata. For example, to choose 110 sam- For our investigation, we selected (i) Java and Python, ple comments for Eclipse as shown in Table 1, we explored the two of the top programming languages according to Google distribution of comments lines and obtained quintiles as follows Trend popularity index4 and TIOBE index5, and (ii) Smalltalk, 1, 3, 4, 5, 7, and 1473. Hence the five strata are 1-3, 4-4, 5-5, as its class commenting practices emerged from Smalltalk- 6-7, and 8-1473. Then from each stratum, we selected the pro- 80 [25, 28]. Other criteria to select these languages are ex- portional number of comments. plained in the Introduction (section 1) and Background section Java: we selected six open-source projects analyzed in pre- (section 2). Smalltalk is still widely used and it gained second vious work [22] to ease the comparison of our work with pre- place for most loved programming language in the Stack Over- vious achievements.7 Modern complex projects are commonly flow survey of 2017.6 For each language, we chose popular, developed using multiple programming languages. For exam- open-source, and heterogeneous projects. Such projects vary in ple, Apache Spark contains 72% Scala classes, 9% Java classes, terms of size, contributors, domains, ecosystems, and coding and 19% classes from other languages.8 In the context of our style guidelines (or comment guidelines). study, we considered the classes from the language under in- As not all classes contain a class comment, we identified the vestigation, and discarded the classes from other programming classes having class comments for each project. Afterwards, we languages. For each class, we parsed the Java code and ex- extracted a statistically significant sample of class comments to tracted the code comments preceding the class definition using conduct a manual analysis. This analysis aimed at identifying the AST (Abstract Syntax Tree) based parser. During extrac- tion, we found instances of block comments (comments starting 4https://pypl.github.io/PYPL.html 5https://www.tiobe.com/tiobe-index/ 6 https://insights.stackoverflow.com/survey/2017/ 7Folder “RP/Dataset/RQ1/Java” in the Replication package verified on 4 Feb 2020 8https://github.com/apache/spark 5 1) 2) 3) 4)

J48 Naive Bayes NLP RULES Random Forest, Natural Language CCTM Processing

TEXT FEATURES Textual Analysis

Projects CCTM Techniques Features Learning Phase Evaluation Taxonomy Definition Automated Classification of Comments Types (RQ1) (RQ2)

Figure 4: Overview Research Approach with /* ) in addition to Javadoc class comments (com- comments in the total dataset of class comments (shown in the ments starting with /** symbol) before the class definition. In column “% of dataset” of Table 1). such cases, the AST-based parser detects the immediately pre- The number of lines in class comments varies from 1 to 4605 ceding comment (being it is a block comment or Javadoc com- in Java projects. For each project, we established strata based ment) as a class comment and treats the other comment above on the identified quintiles from the project’s distribution. From it as a dangling comment. To not miss any kinds of class com- each stratum, we picked an equal number of comments using ment, we adapted our parser to join both comments (the AST the random sampling approach without replacement. For ex- detected class comment and the dangling comment) as a whole ample, in the Eclipse project, we identified the five strata as class comment. 1-3, 4-4, 5-5, 6-7, and 8-1473. We picked 25 comments from each stratum summing to the required 110 sample comments. Table 1: Overview of Java Projects We followed the same approach for Python and Smalltalk to

Project % of Java #Java #Class % of #Sample select the representative sample comments. classes classes comments dataset comments Python: We selected seven open-source projects, analyzed

Eclipse 98% 9 128 6 253 29% 110 also in the previous work [23], to ease the comparison of our Spark 9.3% 1 090 740 3.4% 13 work with previous achievements. To extract class comments Guava 100% 3 119 2 858 13% 50 from Python classes, we implemented a Python AST based Guice 99% 552 466 2.1% 10 parser and extracted the comments preceding the class defini- Hadoop 92% 11 855 8 846 41% 155 tion. 55% 5 867 2 335 11% 41 The metadata related to the selected projects for Python are reported in Table 2, while the class comments are found in our We present the overview of the projects selected for Java in replication package.10 We measured 349 sample comments to Table 1 and raw files in the replication package.9 We estab- be the statistically significant sample size based on total classes lished 376 class comments as the statistically significant sam- with comments. ple size based on the total number of classes with comments. Smalltalk: We selected seven open-source projects. As with For each project, the sample of class comments (the column Java and Python, we selected the same projects investigated in Sample comments) is measured based on the proportion of class previous research [24]. Table 3 shows the details of each project

9Folder “RP/Dataset/RQ1/Java” in the Replication package 10Folder “RP/Dataset/RQ1/Python” in the Replication package 6 Table 2: Overview of Python Projects characterizing programs written in Python, Java, and Smalltalk.

Project #Python #Class % of #Sample Data sources Data Selection classes comments dataset comments

Java Requests 79 43 1.1% 4 Python Pandas 1753 377 9.9% 35 Smalltalk 20 37,446 1,066 Mailpile 521 283 7.5% 26 projects class comments comments Github IPython 509 240 6.3% 22

Djnago 8 750 1 164 30% 107 Classification Evaluate assigned comments Pipenev 1 866 1 163 30% 107

Pytorch 2 699 520 13% 48 Review others Evaluator accept/reject Validation classification reviews

Table 3: Overview of Smalltalk Projects Discuss conflicts

Projects Total #Classes % of #Sample

classes comments dataset comments Taxonomy

GToolkit 4 191 1 315 43% 148 Class Comment Type Model 841 411 14% 46 (CCTM) Mapping categories Roassal 830 493 16% 56

Moose 1 283 316 10% 36 Figure 5: Taxonomy Study (RQ1) PolyMath 300 155 5% 17 Figure 5 depicts the research approach followed to answer PetitParser 191 99 3% 11 RQ1. The outcome of this research, as explained later, consists Pillar 420 237 8% 27 of a mapping taxonomy and a comprehensive taxonomy of class comment types, called CCTM (Class Comment Type Model), with the total numbers of classes, as well as classes with com- mined from the actual commenting practices of developers (see ments,11 their proportion in the total comments, and the num- Figure 5). bers of sample comments selected for our manual analysis. We Mapping categories: Before preparing the CCTM, we an- extracted the stable version of each project compatible with alyzed various earlier comment taxonomies, such as the code Pharo version 7 except for GToolkit, due to the lack of back- comment taxonomy for Java [22] and Python [23], and the class ward compatibility. We, therefore, used Pharo 8 for GToolkit. comment taxonomy for Smalltalk [24], to analyze their reported categories. We mapped the categories from each existing tax- 3.2. Analysis Method onomy, since they were formulated using different approaches

RQ1: What types of information are present in class com- and based on different comment and categories. For in- ments? To what extent do information types vary across pro- stance, the Python code comment taxonomy by Zhang [23] is gramming languages? inspired by the Java code comment taxonomy [22], whereas

The purpose of RQ1 is to quantitatively and qualitatively the Smalltalk class comment taxonomy is formulated using an investigate the language-specific class commenting practices open-card sorting technique [24]. Given the importance of mapping categories from heterogeneous environments [30], we

11Folder “RP/Dataset/RQ1/Pharo” in the Replication package established semantic interoperability [31] of the categories from 7 each taxonomy in this step. One evaluator mapped the cate- gories from Smalltalk to Java and Python categories. Two other Summary evaluators validated the mapping by reviewing each mapping Expand and proposing the changes. The original evaluator accepted or DevelopmentNotes, Warnings rejected the changes. All the disagreement cases were reviewed Links by the fourth evaluator and discussed among all to reach the Usage consensus. The categories that did not map to other taxonomy, we added Parameters them as new categories in that taxonomy. For example, the Precondition category from Smalltalk did not map to Java and Figure 6: Class comment of Python (Figure 2) classified in various categories Python, and thus we added it as a new category in Java and

Python. Thus, we proposed the CCTM taxonomy highlighting 1 the validation step. Thus, the evaluators adopted a three-step the existing and new categories for class comments in the next method to validate the correctness of the performed class com- step Preparing CCTM. ments classification. In the first iteration, called “Review oth- Preparing CCTM: In this step, we analyzed various ecosys- ers’ classification” in Figure 5, every evaluator was tasked to tems from each language. We built our dataset12 by mining review 50% of the comments (randomly assigned and) classi- data from 20 GitHub projects (see subsection 3.1 for further de- fied by other evaluators. This step allowed us to confirm that tails). In the Data Selection step, we extracted the classes and each evaluator’s classification is checked by at least one of the their comments, collecting a total of 37 446 class comments. other evaluators. In reviewing the classifications, the reviewers We selected a statistically significant sample set for each lan- indicated their judgement by labeling each comment with agree guage summing 1 066 total class comments. We then qualita- or disagree labels. tively classified the selected class comments (see the following In the second iteration (called “Evaluator accept or reject Classification step) and validated them (see following Valida- reviews” in Figure 5), the original evaluator examined the dis- tion step) by reviewing and refining the categorization. agreements and the proposed changes. They indicated their Classification: Four evaluators (two Ph.D. candidates, and opinion for the changes by accepting the change or rejecting two faculty members, all authors of this paper) each having at it, stating the reason. In case the reviewer’s changes were ac- least four years of programming experience, participated in the cepted, the classification was directly fixed, otherwise the dis- study. We partitioned Java, Python, and Smalltalk comments agreements were carried to the next iteration. The third itera- equally among all evaluators based on the distribution of the tion assigned all identified disagreements for review to a new language’s dataset to ensure the inclusion of comments from all evaluator, who had not yet looked at the classification. Based projects and diversified lengths. Each evaluator classified the on a majority voting mechanism, a decision was made and the assigned class comments according to the CCTM taxonomy of classification was fixed according to agreed changes. The levels Java, Python and Smalltalk [22, 23, 24]. of agreement and disagreement among the evaluators for each For example, the Python class comment in Figure 2 is classi- project and language are available in the replication package.13 fied in the categories such as Summary, Warnings, and Param- After arriving at a decision on all comment classifications, eters etc., as shown in Figure 6. Validation: After complet- ing their individual evaluations, the evaluators continued with 13Folder “RP/Result/RQ1/Manually-classified-comments” in the Replica-

12Folder “RP/Dataset” in the Replication package tion package 8 we merged overlapping categories or renamed the classes by step, we change the sentences to lower case and remove all applying the majority voting mechanism, thus converging on special characters.15 We typical preprocessing steps a final version of the taxonomy, i.e., CCTM. This way, all the on sentences [36](e.g., stop-word removal) for TEXT fea- categories were discussed by all the evaluators to select the best tures but not for NLP features to preserve the word order naming convention, and whenever required, unnecessary cate- to capture the important n-gram patterns observed in the gories were removed and duplicates were merged. class comments [24].

RQ2: Automated Classification of Class Comment Types in weight of the NLP features TEXT features Cx feature j0 in the boolean value representing if the Different Programming Languages i0 sentence j0 j1 jk jn-1 jn sentence i0 belongs to the Cx category (1) or not (0).

i0 i0j0 i0j1 i0jk i0jn

i1 i1j0 i1j1 i1jn Motivation. Previous work has focused on identifying in- C1 M[i,j] = C2 formation types from code comments scattered throughout C[x] = #sentences Java the source code, from high-level class overviews to low-level L[y] = Pharo Cn injn Python in inj0 method details [12, 22, 23, 32]. These works have focused indi- M[i,j] for a C[x] of a L[y] Category Labels of L[y] Languages vidually on code comments in Java, Python, C++, and COBOL. Figure 7: Matrix representation for a classifier Differently from our study, none of these works attempted to identify information types in class comments automatically 2. NLP Feature Extraction: In this phase we focus on across multiple languages. In our work, we are interested in ex- extracting NLP features to add to an initial term-by- ploring strategies that are able to achieve this goal in multiple document matrix M shown in Figure 7 where each row languages such as Python, Java and Smalltalk. Hence, for this represents a comment sentence (i.e., a sentence belongs to research question, we explore to what extent term-based strate- our language dataset composing CCTM) and each column gies and techniques based on NLP patterns [33, 34, 35] help to represents the extracted feature. To extract the NLP fea- automatically identify the different information types compos- tures, we use NEON, a tool proposed in previous work ing CCTM. [33], which is able to automatically detect NLP patterns Automated classification of class comment types. Af- (i.e., predicate-argument structures recurrently used for ter the definition of CCTM, we propose an approach, called specific intents [37]) present in natural language descrip- TESSERACT (auTomated multi-languageE claSSifiER of tions composing various types of software artifacts (e.g., clAss CommenTs), which automatically classifies class com- mobile user reviews, emails etc.)[33]. In our study, we ment according to CCTM. To achieve this goal, we consid- use NEON to infer all NLP patterns characterizing the ered all the comments manually validated in answering RQ1. comment sentences modeled in the matrix M. Then, we Specifically, our approach leverages machine learning (ML) add the identified NLP patterns as feature columns in M, techniques and consists of four main steps: where each of them models the presence or absence (using 1. Preprocessing: All the manually-labeled class comments binomial features) of an NLP pattern in the comment sen-

from RQ1 in our dataset are used as ground truth to classify tences. More formally, the boolean presence or absence of the unseen class comments.14 It is important to mention a j−th NLP pattern (or feature) in a generic i−th sentence that, since the classification is sentence-based, we split the in M is modeled by 0 (absence) and 1 (presence) values, comments into sentences. As a common preprocessing respectively. The output of this phase consists of the ma-

14File “RP/Dataset/RQ2/Java/raw.csv” in the Replication package 15File “RP/Results/RQ2/All-steps-result.” in the Replication package 9 trix M where each i−th row represents a comment sentence by adopting various machine learning models and a 10- and j−th represents an NLP feature. fold cross-validation strategy. These machine learning models are fed with the aforementioned matrix M. Specif- 3. TEXT Features: In this phase, we add additional features ically, to increase the generalizability of our findings, we (TEXT features) to the matrix M. To get the TEXT fea- experiment (relying on the Weka tool18) with several ma- tures, we preprocess the comment sentences by applying chine learning techniques, namely, the standard proba- stop-word removal16 and stemming [38].17 The output of bilistic Naive Bayes classifier, the J48 tree model, and the this phase corresponds to the matrix M where each row Random Forest model. It is important to note that the represents a comment sentence (i.e., a sentence belonging choice of these techniques is not random but based on their to our language dataset composing CCTM) and each col- successful usage in recent work on code comment analy- umn represents a term contained in it. More formally, each sis [12, 22, 23, 40] and classification of unstructured texts entry M[i, j] of the matrix represents the weight (or impor- for software maintenance purposes [35, 34]. tance) of the j−th term contained in the i−th comment sen- tence. Evaluation Metrics & Statistical Tests. To evaluate the per- For the TEXT features, terms in M are weighted using the formance of the tested ML techniques, we adopt well-known tf-idf score [36], which identifies the most important terms information retrieval metrics, namely precision, recall, and F- in the sentences in matrix M. In particular, we use tf-idf as measure [36]. During our empirical investigation, we focus it downscales the weights of frequent terms that appear in on investigating the best configuration of features and machine many sentences. Such a weighting scheme has been suc- learning models as well as alleviating concerns related to over- cessfully adopted in recent work [39] for performing code fitting and selection bias. Specifically, (i) we investigate the comment classification. The output of this phase consists classification results of the aforementioned machine learning of the weighted matrix M where each row represents a models with different combinations of features (NLP, TEXT, comment sentence and a column represents the weighted and TEXT+NLP features), by adopting a 10-fold validation term contained in it. strategy on the term-by-document matrix M; (ii) to avoid po- It is important to note that a generic i−th comment sen- tential bias or overfitting problems, we train the model for the tence can be classified into multiple categories according categories having at least 40 manually validated instances in our to CCTM. To model this, we prepare the matrix M for dataset. The average number of comments belonging to a cat- each category of each language. The generic (last) column egory varies from 43 comments to 46 comments across all lan- M[ j ] of the matrix M (where n − 1 is the total number of n guages, therefore we selected the categories with a minimum of features extracted from all sentences) represents the cate- 40 comment instances. The top categories selected from each gory C of a language L as shown in Figure 7. More [x] [y] language with the number of comments are shown in Table 4. formally, each entry M of the matrix represents the [i, jn] In order to determine whether the differences between the dif- boolean value if the i−th sentence belongs to the C (1) x ferent input features and classifiers were statistically significant or not(0). or not we performed a Friedman test, followed by a post-hoc 4. Classification: We automatically classify class comments Nemenyi test, as recommended by Demšar [41]. Results con-

cerning RQ2 are reported in section 4. 16http://www.cs.cmu.edu/~mccallum/bow/rainbow/ 17https://weka.sourceforge.io/doc.stable/weka/ core/stemmers/IteratedLovinsStemmer.html 18http://waikato.github.io/weka/ 10 Table 4: Top frequent categories with at least 50 comments the class comments of Java and Python projects. We introduce such categories to the existing taxonomies of Java and Python, Language Categories #Comments and highlight them in green in Figure 9. On the other hand, the Java Summary 336 categories such as Commented code, Exception, and Version are Expand 108 found in Java and Python class comments but not in Smalltalk. Ownership 97 One reason can be that the commented code is generally found Pointer 88 in inline comments instead of documentation comments. How- Usage 87 ever, information about Exception and Version is found in class Deprecation 84 Rationale 50 comments of Java and Python but not in Smalltalk.

Python Summary 318 The mapping taxonomy also highlights the cases where cat- Usage 92 egories from different taxonomies match partially. We define Expand 87 such categories as subset categories and highlight them with Development Notes 67 violet edges. For example, the Deprecation category in Java Parameters 57 and the Version category in Python are found under the Warn- Smalltalk Responsibility 240 ing category in Smalltalk but their description given in the re- Intent 190 spective earlier work covers only a subset of that information Collaborator 91 Examples 81 type according to our investigation. Pascarella et al. define the Class Reference 57 Deprecation category as “it contains explicit warnings used to Key Message 48 inform the users about deprecated interface artifacts. The tag Key Implementation Point 46 comment such as @version, @deprecated, or @since is used” whereas Zhang et al. define the category Version as “identifies

4. Results the applicable version of some libraries” but do not mention the deprecation information in this category or any other cate- This section discusses the results of our empirical study. gory in their taxonomy [22, 23]. Thus, we define the Version category as a subset or a partial match of the Deprecation cat- 4.1. RQ Class Comment Categories in Different Programming 1 egory. On the other hand, the Warning category in Smalltalk Languages (“Warn readers about various implementation details of the As a first step to formulating the CCTM taxonomy we sys- class”) covers a broader aspect of warnings than the Depre- tematically map the available taxonomies from previous works cation category but does not mention the Version information and identify the unmapped categories as shown in Figure 8. The type [24]. We mark these categories as subset categories, and mapping taxonomy shows a number of cases in which Java and highlight them with violet edges. Similarly, the Collabora- Python taxonomies do not entirely fit the Smalltalk taxonomy, tor category in Smalltalk matches partially Expand and Links as shown by pink and violet edges in Figure 8. The figure shows in Python. The categories such as Expand, Links, and Devel- the information types particular to a language (highlighted with opment Notes in Python combine several types of information red edges and unmapped nodes) such as Subclass Explana- under them compared to Smalltalk. For example, Expand in- tions, Observation, Precondition, and Extension are found in cludes collaborators of a class, key methods, instance variables, the Smalltalk taxonomy but not in Python and Java. However, and implementation-specific details. Such categories formulate our analysis shows that these information types are present in challenges in identifying a particular type of information from 11 Figure 8: Mapping of Smalltalk categories to Java and Python the comment. Figure 9 shows that a few types of information, such as the Finding 1. The Python taxonomy focuses more on high-level cat- summary of the class (Summary, Intent), the responsibility of egories, combining various types of information into each category the class (Responsibility, Summary), links to other classes or whereas the Smalltalk taxonomy is more specific to the information sources (Pointer, Collaborator, Links), developer notes (Todo, types. Development Notes), and warnings about the class (Warning) Using the categories from the mapping taxonomy, we ana- are found in class comments of all languages. Summary be- lyzed class comments of various languages and formulated the ing the most prevalent category in all languages affirms the taxonomy for each language to answer the RQ1. Figure 9 shows value of investing effort in summarizing the classes automat- the frequency of information types per language per project. ically [2, 42, 43, 17, 44]. As a majority of summarization tech- The categories shown in green are the newly-added categories niques focus on generating the intent and responsibilities of the in each taxonomy. The categories in each heatmap are sorted class for program comprehension tasks [17], other information according to the frequency of their occurrence in total. For ex- types such as Warning, Recommendation, and Usage are gen- ample, in Java, Summary appeared in a total of 336 comments erally ignored even though coding style guidelines suggest to (88%) of six Java projects out of 378 comments. Pascarella [22] write them in documentation comments. Our results indicate proposed a hierarchical taxonomy, grouping the lower-level cat- that developers mention them frequently, but whether they find egories within the higher-level categories such as grouping the these information types important to support specific develop- categories Summary, Rationale, and Expand within Purpose. ment tasks, or they write just to adhere to the coding guidelines, We show only lower-level categories that correspond with iden- requires more analysis. Such information types present an in- tified information types from other languages. 12 Color scale according to percentage of comments falling into a category gate such information types with developers. Specifically, we 100 50 0 are interested in exploring how various categories are useful to

Eclipse Guice developers, and for which kinds of tasks e.g., program compre- Guava Vaadin hension tasks or maintenance tasks.

Hadoop Java Projects Java Spark Finding 2. Developers embed various types of information in class S E O P U D W E T R P O F S C D I A u x w o s e a x o e r o u o i n u m p i a p t a c d c e r b m r c t a n n g r io r e o o c se m c ec o o m n e te e e n n p m o r a la m t m G a d rs r ca a in ti n v t s e iv p e comments, varying from the high-level overview of the class to the r h t le g o m d a te s n e l n y ip io n e it t r E te et e n n io io x d e r d n n p C at a la o e t n d d low-level implementation details of the class across the investigated io a e n t io n languages.

Django Pipenv Pytorch Ipython According to Nurvitadhi et al. [11], a class comment in Java Pandas

Requests should describe the purpose of the class, its responsibilities, and PythonProjects Mailpile S U E D P W L R S E V P C T O D E N u a u its interactions with other classes. In our study, we observe that m sa x e r a in e b x e re o o b e x o g p v a r k co c ce rs c d d s p te is m e an el m n s m l p io o in o er en n e a d o e in a t n n g v d s r p te g m ss io d G a e io y m r e E n it u t n n e s n x io i io c n d p n d n y class comments in Java often contain the purpose and responsi- tN a la e t n li o io a n te n t e s io n bilities of the class (Summary and Expand), but its interactions

GToolkit Seaside with other classes (Pointer) less often. On the other hand in

Roassal Projects Moose Smalltalk, the information about interactions with other classes, PolyMath Petit i.e., Collaborator, is the third most frequent information type

Smalltalk Pillar R I C E C K I W I R S P R L E O C L D T D O e n o x l e m n e u r e i x b o i i o e t s te l a a y p a st f b ec c n t s d c s d p h p n la m ss M l rn a e c o o k en e i en co o e e after Intent and Responsibility compared to Java and Python. o t b R em i n re la n m s s rv n s u n r n o p e e n c n s d m i a g e r d si r le f ss e g eV c s i o t G s e b a er a n e E to e n io u e n il to e g ta a O x n n n i c it r n e t ri t p d d y y c io a h la a el One of the reasons can be that Smalltalk class comments are e n b er n ti in P le R a o e o e ti n in so o t u n r ce guided by a CRC (Class, Responsibility, Collaborator) design Categories template and developers follow the template in writing these

Figure 9: Categories found in class comments (CCTM) of various projects information types [24]. Class comments in Java also contain (shown on the y-axis) of each programming language. The x-axis shows the many other types of information. The most frequent type of in- categories inspired from existing work (highlighted in black) and the new cate- formation present in class comments is Summary, which shows gories (highlighted in green).) that developers summarize the classes in the majority of the teresting aspect to investigate in future work. For example, us- cases. Pascarella et al. found Usage to be the second most age of a class (Usage), its key responsibilities (Responsibility), prevalent category in code comments, while we find overall warnings about it (Warning), and its collaborators (Collabora- Expand to be the second most prevalent category in class com- tor) are found in significant numbers of comments in all lan- ments, and Usage to be the fifth most prevalent type of informa- guages. These information types are often suggested by the tion [22]. However, the most prevalent categories vary across coding guidelines as they can support developers in various de- projects of a language, and also across programming languages. velopment and maintenance tasks. These information types can For example, Usage is mentioned more often than Expand in be included in the customized code summaries based on the de- Google projects (Guice and Guava) whereas in Apache projects velopment tasks a developer is doing. For example, a developer (Spark, Hadoop) it is not. In contrast to Java, Python class com- seeking dependant classes can quickly find such classes from ments contain Expand and Usage equally frequently thus show- the class comment without reading the whole comment. Simi- ing that Python targets both end-user developers and internal larly, a developer expected to refactor a legacy class can quickly developers. We notice that Python and Smalltalk class com- go through the warnings, if present, to understand the specific ments contain more low-level implementation details about the conditions better and thus can save time. We plan to investi- class compared to Java. For example, Python class comments 13 contain the details about the class attributes and the instance In contrast, the standard Python style guideline [20] suggests variables of a class with a header “Attributes” or “Parameters”, adding this information in the class comment but other Python its public methods with a header “Methods”, and its construc- style guidelines such as those from Google19 and Numpy20 tor arguments in the Parameters and Expand categories. Addi- do not mention this information type. However, we find in- tionally, Python class comments often contain explicit warnings stances of class comments containing subclass information in about the class (with a header “warning:” or “note:” in the new the IPython and Pytorch projects that follow the Numpy and line), making the information easily noticeable whereas such Google style guidelines respectively. Investigating which infor- behavior is rarely observed in Java. Whether such variations in mation types each project style guidelines suggest for the com- the categories across projects and languages are due to differ- ments, and to what extent developers follow these style guide- ent project comment style guidelines or due to developer per- lines in writing class comments is future work. sonal preferences is not known yet. We observe that developers Finding 3. Not all information types found in class comments are use common natural language patterns to write similar types suggested by corresponding project style guidelines. of information. For example, a Smalltalk developer described Discussion: In the process of validating the taxonomies, re- the collaborator class of the “PMBernoulliGeneratorTest” class viewers (evaluator reviewing the classification) marked their in Listing 1 and a Java developer as shown in Listing 2 thus disagreement for the classification, stating their reason, and showing a pattern “[This class] for [other class]” to describe proposing the changes. A majority of disagreements in the the collaborating classes. This information type is captured in first Smalltalk iteration were due to the long sentences con- the categories Collaborator in Smalltalk and Pointer in Java. taining different types of information (Intent and Responsibility A BernoulliGeneratorTest is a test class for testing information types were commonly interweaved), and assign- the behavior of BernoulliGenerator ment of information to categories Key Implementation Point Listing 1: Collaborator mentioned in the PMBernoulliGeneratorTest class in and Collaborator. In Java and Python, we observed that dis- Smalltalk agreements were due to the broad and loosely defined cat- An {@link RecordReader} for {@link SequenceFile}s. egories such as Expand in Java and Development Notes in Listing 2: Collaborator mentioned in the SequenceFileRecordReader class in Python. Several information types such as Warning, Recom- Java mendation, Observation, and Development Notes are not struc- Identifying such patterns can help in extracting the type of tured by special tags and thus pose a challenge for automatic information from the comment easily, and can support the de- identification and extraction. On the other hand, a few cat- veloper by highlighting the required information necessary for a egories such as Example and Instance Variable in Smalltalk, particular task, e.g., to modify the dependent classes in a main- and Deprecation, Links, and Parameters in Java and Python tenance task. are explicitly marked by the headers or the tags such as Us- In contrast to earlier studies, we observe that developers men- age, Instance Variable, @since, @see, @params respectively. tion details of their subclasses in a parent class comment in all We observed that developers use common keywords across lan- languages. We group this information under the Subclass Ex- guages to indicate a particular information type. For example, planation category. In the Javadoc guidelines, this information notes are mentioned in the comment with a keyword “note” is generally indicated by a special @inherit tag in the method 19 comments, but we did not find such a guideline for Java class https://sphinxcontrib-napoleon.readthedocs.io/ en/latest/example_google.html comments. Similarly we found no such guideline to describe 20https://numpydoc.readthedocs.io/en/latest/ subclasses for Smalltalk class comments or method comments. format.html 14 as a header as shown in Listing 3, Listing 4, and Listing 5. several other frequent information types from the class com- Note that even though these ments automatically. * methods use {@link URL} parameters, they are usually not appropriate for HTTP or other 4.2. RQ2: Automated Classification of Class Comment Cate- non-classpath resources. * gories in Different Programming Languages

Listing 3: Explicit note mentioned in the Resources class in Java Haiduc et al. [2] performed a study on automatically gener-

.. note:: ating summaries for classes and methods and found that the ex- perimented summarization techniques work better on methods Depending of the size of your kernel, several (of than classes. More specifically, they discovered that while de- the last) velopers generally agree on the important attributes that should columns of the input might be lost, because it is a valid `cross-correlation`_, be considered in the method summaries, there were conflicts and not a full `cross-correlation`_. concerning the types of information (i.e., class attributes and It is up to the user to add proper padding. method names) that should appear in the class summaries. In

Listing 4: Explicit note mentioned in the Conv3d class in Python our study, we found that while Smalltalk and Python develop- ers frequently embed class attributes or method names in class Note: position may change even if an element has no parent comments, it rarely happens in Java. Automatically identify- ing various kinds of information from comments can enable the Listing 5: Explicit note mentioned in the BlElementPositionChangedEvent class in Smalltalk generation of customized summaries based on what informa- tion individual developers consider relevant for the task at hand Several information types are suggested by the project- (e.g., maintenance task). To this aim, as described in subsec- specific style guidelines but not the exact syntax whereas a large tion 3.2, we empirically experiment with a machine learning- number of information types are not mentioned by them. Due to based multi-language approach to automatically recognize the the lack of conventions for these information types, developers types of information available in class comments. use their own conventions to write them in the comments. Table 5 provides an overview of the average precision, re- Maalej et al. [5] demonstrates that developers consult com- call, and F-Measure results considering the (top frequent) cat- ments in order to answer their questions regarding program egories for all languages shown in Table 4. The results are comprehension. However, different types of information are obtained using multiple machine learning models and various interweaved in class comments and not all developers need to combination of features21: (i) TEXT features only, (ii) NLP know all types of information. Cioch et al. presented the doc- features only, (iii) both NLP and TEXT features.22 The results umentation information needs of developers depending on the in Table 5 show that the NLP+TEXT configuration achieves stages of expertise[45]. They showed that experts need design the best results with the Random Forest algorithm with rela- details and low-level details whereas novice developers require tively high precision (ranging from 78% to 92% for the selected a high-level overview with examples. Therefore, identifying languages), recall (ranging from 86% to 92%), and F-Measure and separating these information types is essential to address (ranging from 77% to 92%). Figure 10 shows the performance the documentation needs. Rajlich presented a tool that gath- of the different algorithms with NLP+TEXT features for the ers important information such as a class’s responsibilities, its most frequent categories of each language. dependencies, member functions, and authors’ comments to fa- cilitate the developer’s need to access the particular information 21File “RP/Result/RQ2/CV-10-results.xlsx” in the Replication package types [46]. We advance the work by identifying and extracting 22File “RP/Result/RQ2/All-steps-results.sqlite” in the Replication package 15 Table 5: Results for Java, Phyton, and Smalltalk obtained through different machine learning models and features

TEXT NLP NLP + TEXT Language ML Models Precision Recall F-Measure Precision Recall F-Measure Precision Recall F-Measure J48 0.91 0.92 0.91 0.84 0.81 0.81 0.89 0.90 0.88 Java Naive Bayes 0.86 0.83 0.83 0.84 0.81 0.81 0.86 0.84 0.84 Random Forest 0.91 0.92 0.91 0.84 0.87 0.82 0.92 0.92 0.92 J48 0.73 0.83 0.73 0.68 0.80 0.66 0.81 0.83 0.79 Python Naive Bayes 0.78 0.69 0.72 0.75 0.77 0.75 0.79 0.72 0.74 Random Forest 0.84 0.85 0.83 0.78 0.81 0.78 0.85 0.86 0.84 J48 0.60 0.87 0.58 0.61 0.88 0.59 0.72 0.88 0.70 Smalltalk Naive Bayes 0.84 0.80 0.82 0.85 0.83 0.83 0.86 0.82 0.84 Random Forest 0.75 0.90 0.73 0.82 0.88 0.82 0.78 0.90 0.77

NaiveBayes J48 RandomForest NaiveBayes J48 RandomForest NaiveBayes J48 RandomForest 1

0.8

0.6

0.4

0.2

0 Summary Expand Ownership Pointer Usage Deprecation Rationale Summary Usage Expand DevelopmentNotesParameters Responsibility Intent Collaborators Examples ClassReferencesKeyMessage ImplementationPoint

Java Top Categories Python Top Categories Smalltalk Top Categories

Figure 10: Performance of the different classifiers based on F-measure for the TEXT + NLP feature set

Finding 4. Our results suggest that the Random Forest algorithm fed Smalltalk resembles natural language (English) phrases, e.g., by the combination of NLP+TEXT features achieves the best classi- the method, shown in Listing 6, “addString: using:” takes two fication performance over the different programming languages. parameters “string” and “CamelCaseScanner new” written in According to Table 5, NLP features alone achieve the low- a natural language sentence style. Similarly, class comments est classification performance for both Java and Python, while in Smalltalk are written in a more informal writing style (of- we observe that this type of feature works well when dealing ten using the first-person form, and written in complete sen- with Smalltalk class comments. On the one hand, class com- tences as shown in Figure 3), compared to Java and Python ments can often contain mixtures of structured (e.g., code ele- (which suggest to write in a formal way using the third-person ments, such as class and attribute names) and unstructured in- form). As demonstrated in previous work [35, 37], the usage formation (i.e., natural language). NEON (i) leverages models of predicate-argument patterns is particularly well-suited when trained on general-purpose natural language sentences to con- dealing with classification problems in highly unstructured and struct the parse tree of the sentences, and (ii) relies on the gener- informal contexts. ated parse trees to identify common NLP patterns [33]. There- fore, the presence of code elements degrades NEON’s capabil- ity to generate accurate parse trees, and consequently compli- cates its pattern recognition task. On the other hand, code in 16 Terms subclasses Bag with support for handling stop Usage occur more frequently than other categories, we observe words etc. that they achieve a classification performance lower than Dep- recation and Ownership (Table 6). This outcome can be due to example: string | terms | the presence of specific annotations or words that often occur terms := Terms new. in sentences belonging to the Deprecation (e.g., @since) and terms addString: string using: CamelCaseScanner new. Ownership (e.g., @author) categories. Although we removed terms withCountDo: [ :term :count |term -> count ]. all the special characters (annotation symbols) from the sen- Listing 6: (Smalltalk) Code in MalTerms comment resembles natural language tences, the techniques based on NLP+TEXT features can well capture the specific terms that frequently occur and are use- Finding 5. When dealing with sentences containing mixtures of code ful for identifying these categories. For instance, in the Vaadin elements and natural language texts, NLP tools based on parse trees fail to correctly identify well-suited NLP features. The usage of these project, the Ownership category always contains “Vaadin” in features is otherwise recommended when the class comments are the @author field. Similarly, in the other Java projects, author mostly unstructured. name patterns are included in the NLP+TEXT feature set.

Table 6: Results for Java using the Random Forest classification model In contrast with Deprecation and Ownership categories, we do not observe recurrent annotations or words introducing sen- NLP + TEXT Category tences of the Rationale type. However, they are more accurately Precision Recall F-Measure classified compared to sentences in the Summary, and Expand Summary 0.87 0.88 0.87 categories. This could depend on the quantity and quality of the Expand 0.86 0.87 0.86 NLP features captured in these categories, jointly with a lower Ownership 0.99 0.99 0.99 variability in the structure of comments falling in the Rationale Pointer 0.91 0.91 0.91 Usage 0.88 0.88 0.87 class. In particular, we observed that for the Rationale cate- Deprecation 0.98 0.98 0.98 gory twelve unique NLP features have information gain values Rationale 0.95 0.95 0.95 higher than 0.01, whereas two unique NLP features for the Sum- mary category and only one NLP feature for the Expand cate- For the sake of brevity, we base the following analysis on the gory have information gain scores higher than 0.01. In terms NLP+TEXT configuration when used with the Random Forest of quality, the top-ranked NLP feature (or pattern) of the Sum- classifier. Table 6, Table 7, and Table 8 respectively report the mary category (i.e., “Represents [something]”) occurs in only precision, recall, and F-Measure for the top frequent categories 3% of the overall comments falling in this category, whereas the (see subsection 4.2) obtained through the Random Forest algo- top-ranked feature of the Rationale category (i.e., “Has [some- rithm with the NLP+TEXT features. thing]”) occurs in 8% of the total comments belonging to such In the case of Java, as shown in Table 6, Deprecation, Own- category. Nevertheless, the NLP heuristics that occur in the ership, and Rationale achieve high F-measure scores ( ≥ 95%), sentences belonging to the Expand class are also frequent in while Expand, Summary and Usage are the categories with the instances of the Pointer and Usage categories, making it the lowest F-measure values (but still higher than 85%). This harder to correctly predict the type of the sentences falling in means that for most Java categories Random Forest achieves these categories. Specifically, the NLP feature with the highest very accurate classification results. However, we observe a cer- information gain score (i.e., “See [something]”) for the Expand tain variability in the results, depending on the category (see category (information gain = 0.00676) is also relevant for iden- Table 4). While in our manually validated sample Summary and tifying sentences of the Pointer category, exhibiting an infor- 17 mation gain value higher than the one observed for the Expand overall classification performance. category (i.e., 0.04104). Table 8: Results for Smalltalk using the Random Forest classification Table 7: Results for Python using the Random Forest classification model model NLP + TEXT Category NLP + TEXT Precision Recall F-Measure Category Precision Recall F-Measure Responsibility 0.79 0.82 0.78 Summary 0.86 0.86 0.85 Intent 0.92 0.92 0.90 Usage 0.83 0.83 0.82 Collaborator 0.83 0.94 0.83 Expand 0.83 0.86 0.83 Example 0.85 0.84 0.85 Development Notes 0.87 0.89 0.87 Class References 0.29 0.98 0.29 Parameters 0.86 0.86 0.85 Key Messages 0.92 0.92 0.89 Key Implementation Points 0.87 0.89 0.85 In the case of Python (see Table 7), the F-measure results are still positive (> 80%) for all the considered categories. Sim- Concerning Smalltalk, the Random Forest model provides ilar to Java, more frequent categories do not achieve the best slightly less stable results compared to Python and Java (see performance. For example, the category Parameter is the least Table 5). As shown in Table 8, in the case of Smalltalk, Intent frequent among the categories considered but achieves higher is the comment type with the highest F-measure. However, for performance than most of the categories, as shown in Table 7. most categories F-measure results are still positive (> 78%), In contrast to Java, Python developers frequently use specific except for the Class references category. The Class references words (e.g., “params”, “args”, or “note”), rather than annota- category captures the other classes referred to in the class com- tions, to denote a specific information type. We observe that ment. Random Forest is the ML algorithm that achieves the these words frequently appear in sentences of the Parameter worst results (F-Measure of 29%) for this category. However, and Development note categories and these terms are captured the Naive Bayes algorithm achieves an F-Measure score of 93% in the related feature sets. In the Python classification, the Us- for the same category. Similarly for the Collaborator category age category reports the lowest F-measure due to its maximum the Naive Bayes model achieved better results compared to the ratio of incorrectly classified instances, i.e., 17% among all cat- Random Forest model. Both categories can contain similar in- egories. This outcome can be partially explained by the small formation, i.e., the name of other classes the class interacts number of captured NLP heuristics (one heuristic “[something] with. We observe that in Smalltalk, class names defaults” is selected according to the information gain measure are generally split into separate words in comments, thus mak- with threshold 0.005). We also observe that instances of the ing it hard to identify them as classes from the text. Neverthe- Usage category often contain code snippets mixed with infor- less, as demonstrated in previous work [22], the Naive Bayes mal text, making it hard to identify features that would be good algorithm can achieve high performance in classifying informa- predictors of this class. Similarly, the instances in the Expand tion chunks in code comments containing code elements (e.g., category also contain mixtures of natural language and code Pointer), while its performance degrades when dealing with snippets. Separating code snippets from natural language ele- less structured texts (e.g., Rationale). We observe similar be- ments and treating each portion of the mixed text with a proper havior for the Links category in Python taxonomy. Figure 8 approach can help (i) to build more representative feature sets shows that all these categories such as Links, Pointer, Collab- for these types of class comment sentences, and (ii) to improve orator, and Class references contain similar types of informa- 18 tion. In contrast, developers follow specific patterns in structur- Usage in Java and Python and Example in Smalltalk). ing sentences belonging to other categories such as Intent and Finding 6. In all the considered programming languages, developers Example, as reported in previous work [24]. In future work, we follow similar patterns to summarize the purpose and the responsibil- plan to explore combining various machine learning algorithms ities of a class. No common patterns are observed when implementa- tion details are discussed. to improve the results [47].

To qualitatively corroborate the quantitative results and un- Discussion: To further confirm the reliability of our results, derstand the importance of each considered NLP heuristic, we we complement previous results with relevant statistical tests. Friedman test computed the popular statistical measure information gain for In particular, the reveals that the differences in the NLP features in each category, and ranked these features performance among the classifiers is statistically significant in based on their scores. We use the default implementation of the terms of F-Measure. Thus, we can conclude that when compar- information gain algorithm and the ranker available in Weka ing the performance of classifiers and using different input con- with the threshold value 0.005 [48]. Interestingly, for each cat- figurations, the choice of the classifier significantly affects the egory, the heuristics having the highest information gain values results. Specifically, to gain further insights about the groups also exhibit easily explainable relations with the intent of the that statistically differ, we performed the Nemenyi test. Results category itself. For instance, for the Responsibility category in of the test suggest that the Naive Bayes and the J48 models do Smalltalk (which lists the responsibilities of the class) we ob- not statistically differ in terms of F-Measure, while the Ran- serve that “Allows [something]” and “Specifies [something]” dom Forest model is the best performing model, with statistical p − value are among the best-ranked heuristics. Similarly, we report that evidence ( < 0.05). “[something] is used” and “[something] is applied” are among To analyze how the usage of different features (NLP, TEXT, the heuristics having the best information gain values for the NLP+TEXT) affects the classification results, we also executed Collaborator category (i.e., interactions of the class with other a Friedman test on the F-Measure scores obtained by the Ran- classes), while the heuristics “[something] is class for [some- dom Forest algorithm for each possible input combination. The thing]” and “Represents [something]” have higher information test concluded that the difference in the results with different gain values when used to identify comments of the Intent type inputs is statistically significant (p − value < 0.05). To gain (i.e., describing the purpose of the class). These heuristics con- further insight into the groups that statistically differ, we per- firm the patterns identified by Rani et al. in their manual anal- formed a Nemenyi test. The test revealed that there is signif- ysis of Smalltalk class comments [24]. Similar results are ob- icant difference between the NLP and NLP+TEXT combina- served for the other languages. More specifically, for the Sum- tions (p − value < 0.05). This result confirms the importance of mary category in Java, the heuristics “Represents [something]” both NLP and TEXT features when classifying class comment and “[something] tests” are among the NLP features with the types in different languages. The input data and the scripts used 23 highest information gain. Instead, in the case of the Expand for the tests are provided in the Replication package. Cur- and Pointer categories, we observe a common relevant heuris- rently, we use the tf-idf weighting scheme for TEXT features tic: “See [something].”. The analysis of the top heuristics for but we plan to experiment with other weighting schemes for the considered categories highlight that developers follow simi- future work, for instance, TF-IDF-ICSDF (Inverse Class Space lar patterns (e.g., “[verb] [something]”) to summarize the pur- Density Frequency) as it considers also the distribution of inter- pose and responsibilities of the class across the different pro- class documents when calculating the weight of each term[49]. gramming languages. However, no common patterns are found when discussing specific implementation details (Expand and 23Folder “RP/Result/RQ2/Statistical-analysis” in the Replication package 19 4.3. Reproducibility was built based on the judgement of four annotators (four au-

In order to automate the study, we developed a Command thors of this work) who manually analyzed the resulting sample. Line Interface (CLI) application in Java. The application in- Moreover, an initial set of 50 elements for each language was tegrates the external tool NEON, and the required external preliminarily labelled by all annotators and all disagreements libraries (Standford NLP) to process and analyze the data were discussed between them. To reduce the likelihood that the (WEKA) using Maven. The input parameters for the appli- chosen sample comments are not representative of the whole cation are the languages (e.g., Java, Python) to analyze, their population, we used a stratified sampling approach to choose input dataset path, and the tasks to perform. Various tasks fetch the sample comments from the dataset, thus considering the 25 the required input data from the database, perform the analy- quartiles of the comment distribution for each language. sis, and store the processed output data back in it. The results Another threat to construct validity concerns the definition of intermediate steps are stored in the database24 and the final of the CCTM and the mappings of the different language tax- results are exported as CSV automatically using Apache Com- onomies performed by four human subjects. To counteract this mons CSV. issue, we used the categories defined by the earlier works in the comment analysis [22, 23, 24]. 5. Threats to validity Threats to internal validity concern confounding factors that could influence our results. The main threat to internal valid- We now outline potential threats to the validity of our study. ity in our study is related to the manual analysis carried out to Threats to construct validity mainly concern the measurements prepare the CCTM and the mapping taxonomy. Since it is per- used in the evaluation. To answer the RQs, we did not con- formed by human subjects, it could be biased. Indeed, there is sider the full ecosystem of projects in each language but se- a level of subjectivity in deciding whether a comment type be- lected a sample of projects for each language. To alleviate longs to a specific category of the taxonomy or not, and whether this concern to some extent, we selected heterogeneous projects a category of one language taxonomy maps to a category in an- used in the earlier comment analysis work of Java, Python, and other language taxonomy or not. To counteract this issue, the Smalltalk [22, 23, 24]. The projects in each language focus on evaluators of this work were two Ph.D. candidates and two fac- different domains such as visualization, data analysis, or devel- ulty members, each having at least four years of programming opment frameworks. They originate from different ecosystems, experience. All the decisions made during the evaluation pro- such as Google and Apache in Java, or Foundation, or cess and validation steps are reported in the replication pack- community project in Python. Thus, the projects follow differ- age (to provide evidence of the non-biased evaluation), and de- ent comment guidelines (or coding style guidelines). Addition- scribed in detail in the paper.26 Also, we performed a two-level ally, the projects are developed by many contributors (develop- validation step. This validation step involved further discussion ers), which further lowers the risk towards a specific developer among the evaluators, whenever their opinions diverged, until commenting style. they reached a final consensus. Another important issue could be due to the fact that we sam- Threats to conclusion validity concern the relationship be- pled only a subset of the extracted class comments. However, tween treatment and outcome. Appropriate statistical proce- the sample size limits the estimation imprecision to 5% of error for a confidence level of 95%. To further mitigate concerns re- 25File “RP/Dataset/RQ1/Java/projects-distribution.pdf” in the Replication lated to subjectiveness and bias in the evaluation, the truth set package 26Folder “RP/Result/RQ1/Manually-classified-comments” in the Replica- 24File “RP/Results/RQ2/All-steps-result.sqlite” in the Replication package tion package 20 dures have been adopted to draw our conclusions. To answer header comments etc. [12]. Differently from Steidl et al., our

RQ2, we investigate whether the differences in the performance work focuses on analyzing and identifying semantic informa- achieved by the different machine learning models with differ- tion found in class comments in Java, Python, and Smalltalk. ent combination of features were statistically significant. To Pascarella et al. presented a taxonomy of code comments for perform this task, we used the Friedman test, followed by a Java projects [22]. In the case of Java, we used the taxonomy post-hoc Nemenyi test, as recommended by Demšar [41]. from Pascarella et al. to build our Java CCTM categories. How- Threats to external validity concern the generalization of our ever, our work is different from the one of Pascarella et al., as it results. The main aim of this paper is to investigate the class focuses on class comments in three different languages, which commenting practices for the selected programming languages. makes our work broader in terms of studied languages and more The proposed approach may achieve different results in other specific in terms of type of code comments studied. Comple- programming languages or projects. To limit this threat, we mentary to Pascarella et al.’s work, Zhang et al. [23] reported considered both static and dynamic types of object-oriented a code comment taxonomy in Python. Compared to the Python programming languages having different commenting style and comment taxonomy of Zhang et al., we rarely observed the Ver- guidelines. To reduce the threat related to the project selec- sion, Todo, and Noise categories in our Python class comment tion, we chose diverse projects, used in previous studies about taxonomy. More importantly, we found other types of infor- comment analysis and assessment. The projects vary in terms mation in Python class comments that developers embed in the of size, domain, contributors, and ecosystems. Finally, during class comments but were not included in the Python comment the definition of our taxonomy (i.e., CCTM) we mainly rely on taxonomy of Zhang et al. such as the Warning, Observation, a quantitative analysis of class comments, without directly in- Recommendation and Precondition categories of the proposed volving the actual developers of each programming language. CCTM. More in general, our work complement and extend the Specifically, for future work, we plan to involve developers, via studies of Pascarella et al. and Zhang et al. by focusing on surveys and interviews. This step is particularly important to class comments in three different languages, which makes our improve the results of our work and to design and evaluate fur- work broader in terms of studied languages as well as the types ther automated approaches that can help developers achieve a of code comments reported and automatically classified. high quality of comments. Several studies have experimented with numerous ap- proaches to identify the different types of information in com- 6. Related Work ments [44, 51, 40, 32]. For instance, Dragan et al. used a rule- Comment Classification. Comments contain various infor- based approach to identify the Stereotype of a class based on mation types [16] useful to support numerous software devel- the class signature [44]. Their work is aimed at recognizing the opment tasks. Recent work has categorized the links found in class type (e.g., data class, controller class) rather than the type comments [50], and proposed comment categories based on the of information available within class comments, which is the actual meaning of comments [26]. Similar to these studies, our focus of our work. Shinyama et al. [40] focused on discovering work is aimed at supporting developers in discovering impor- specific types of local comments (i.e., explanatory comments) tant types of information from class comments. that explain how the code works at a microscopic level inside Steidl et al. classified the comments in Java and C/C++ the functions. Similar to our work, Shinyama et al. and Geist et programs automatically using machine learning approaches. al. considered recurrent patterns, but crafted them manually as The proposed categories are based on the position and syn- extra features to train the classifier. Thus, our approach is dif- tax of the comments, e.g., inline comments, block comments, ferent, as it is able to automatically extract different natural lan- 21 guage patterns (heuristics), combining them with other textual types of information class comments of each programming lan- features, to classify class comment types of different languages. guage. To automatically identify the most frequent information Further Comment Analysis. Apart from identifying infor- types characterizing these languages, we propose an approach mation types within comments, analyzing comments for other based on natural language processing and text analysis that clas- purposes has also gained a lot of attention in the research com- sifies, with high accuracy, the most frequent information types munity in the past years. Researchers are investigating meth- of all the investigated languages. ods and techniques for generating comments [2, 52], assess- The contributions of our work are (i) an empirically validated ing their quality [10, 12, 53], detecting inconsistency between taxonomy for class comments in three programming languages; code and comments [54, 55, 56, 57], examining co-evolution (ii) a mapping of taxonomies from previous works; (iii) a com- of code and comments [58, 59, 60, 61], identifying bugs using mon automated classification approach able to classify class comments [62], or establishing traceability between code and comments according to a broad class comment taxonomy us- comments [63, 64]. ing various machine learning models trained on top of different Other work has focused on experimenting with rule-based feature sets; and (iv) a publicly available dataset of 37 446 class and machine-learning-based techniques and identified further comments from 20 projects and 1 066 manually classified class program comprehension challenges. For instance, for labeling comments27. source code, De Lucia et al. [65] found that simple heuristic ap- proaches work better than more complex approaches, e.g., LSI Our results highlight the different kinds of information class and LDA. Moreno et al. [17] proposed NLP and heuristic-based comments contain across languages. We found many instances techniques to generate summaries for Java classes. Finally, re- of specific information types that are not suggested or men- cent research by Fucci et al. [66] studied how well modern text tioned by their respective coding style guidelines. To what ex- classification approaches can identify the information types in tent developer commenting practices adhere to these guidelines API documentation automatically. Their results have shown is not known yet and is part of our future work agenda. We how neural network outperforms traditional machine learning argue that such an analysis can help in evaluating the quality approaches and naive baselines in identifying multiple infor- of comments, also suggested by previous works [26, 13, 12]. mation types (multi-label classification). In the future, we plan To investigate this aspect, we plan to extract the coding style to experiment with class comment classification using neural guidelines of heterogeneous projects related to comments and networks. compare the extracted guidelines with the identified comment information types (using our proposed approach). Moreno et al. 7. Conclusion used a template-based approach to generate Java comments [67] Class comments provide a high-level understanding of the where the template includes specific types of information they program and help one to understand a complex program. Dif- deem important for developers to understand a class. Our re- ferent programming languages have their own commenting sults show that frequent information types vary across systems. guidelines and notations, thus identifying a particular type of Using our approach to identify frequent information types, re- information from them for a specific task or evaluating them searchers can customize the summarization templates based on from the content perspective is a non-trivial task. their software system. To handle these challenges, we investigate class commenting practices of a total of 20 projects from three programming lan- guages, Java, Python, and Smalltalk. We identify more than 17 27Folder “RP/Dataset/RQ1/Java” in the Replication package 22 Acknowledgments [9] A. Cline, Testing thread, in: Agile Development in the Real World, Springer, 2015, pp. 221–252. We gratefully acknowledge the financial support of the Swiss [10] N. Khamis, R. Witte, J. Rilling, Automatic quality assessment of source National Science Foundation for the project “Agile Software code comments: the JavadocMiner, in: International Conference on Ap- plication of Natural Language to Information Systems, Springer, 2010, Assistance” (SNSF project No. 200020-181973, Feb. 1, 2019 - pp. 68–79. April 30, 2022). We also thank Ivan Kravchenko for helping us [11] E. Nurvitadhi, W. W. Leung, C. Cook, Do class comments aid Java pro- to extract the data for Java and Python. gram understanding?, in: 33rd Annual Frontiers in Education, 2003. FIE 2003., Vol. 1, IEEE, 2003, pp. T3C–T3C. [12] D. Steidl, B. Hummel, E. Juergens, Quality analysis of source code com- References ments, in: Program Comprehension (ICPC), 2013 IEEE 21st International

[1] R. K. Fjeldstad, W. T. Hamlen, Application Program Maintenance Study: Conference on, IEEE, 2013, pp. 83–92. Report to Our Respondents, in: Proceedings GUIDE 48, 1983. [13] D. Haouari, H. A. Sahraoui, P. Langlais, How good is your comment? A [2] S. Haiduc, J. Aponte, L. Moreno, A. Marcus, On the use of automated text study of comments in java programs, in: Proceedings of the 5th Interna- summarization techniques for summarizing source code, in: 2010 17th tional Symposium on Empirical Software Engineering and Measurement, Working Conference on Reverse Engineering, IEEE, 2010, pp. 35–44. ESEM 2011, Banff, AB, Canada, September 22-23, 2011, IEEE Com- [3] G. Bavota, G. Canfora, M. Di Penta, R. Oliveto, S. Panichella, An empir- puter Society, 2011, pp. 137–146. doi:10.1109/ESEM.2011.22. ical investigation on documentation usage patterns in maintenance tasks, URL https://doi.org/10.1109/ESEM.2011.22 in: 2013 IEEE International Conference on Software Maintenance, IEEE, [14] M. Farooq, S. Khan, K. Abid, F. Ahmad, M. Naeem, M. Shafiq, A. Abid, 2013, pp. 210–219. Taxonomy and design considerations for comments in programming lan- [4] S. C. B. de Souza, N. Anquetil, K. M. de Oliveira, A study of the doc- guages: A quality perspective, Journal of Quality and Technology Man- umentation essential to software maintenance, in: Proceedings of the agement 10 (2). 23rd annual international conference on Design of communication: doc- [15] R. Scowen, B. A. Wichmann, The Definition of Comments in Program- umenting & designing for pervasive information, SIGDOC ’05, ACM, ming Languages, Software: Practice and Experience 4 (2) (1974) 181– New York, NY, USA, 2005, pp. 68–75. doi:10.1145/1085313. 188. 1085331. [16] A. T. T. Ying, J. L. Wright, S. Abrams, Source code that talks: An [5] W. Maalej, R. Tiarks, T. Roehm, R. Koschke, On the comprehension exploration of Eclipse task comments and Their Implication to repos- of program comprehension, ACM TOSEM 23 (4) (2014) 31:1–31:37. itory mining, SIGSOFT Softw. Eng. Notes 30 (4) (2005) 1–5. doi: doi:10.1145/2622669. 10.1145/1082983.1083152. URL http://mobis.informatik.uni-hamburg. URL http://doi.acm.org/10.1145/1082983.1083152 de/wp-content/uploads/2014/06/ [17] L. Moreno, J. Aponte, G. Sridhara, A. Marcus, L. L. Pollock, K. Vijay- TOSEM-Maalej-Comprehension-PrePrint2.pdf Shanker, Automatic generation of natural language summaries for Java [6] S. N. Woodfield, H. E. Dunsmore, V. Y. Shen, The effect of modulariza- classes, in: IEEE 21st International Conference on Program Comprehen- tion and comments on program comprehension, in: Proceedings of the sion, ICPC 2013, San Francisco, CA, USA, 20-21 May, 2013, 2013, pp. 5th international conference on Software engineering, IEEE Press, 1981, 23–32. pp. 215–223. [18] S. Panichella, A. Panichella, M. Beller, A. Zaidman, H. C. Gall, The [7] S. C. Müller, T. Fritz, Stakeholders’ information needs for artifacts and impact of test case summaries on bug fixing performance: An empir- their dependencies in a real world context, in: Proceedings of the 2013 ical investigation, in: 2016 IEEE/ACM 38th International Conference IEEE International Conference on Software Maintenance, ICSM ’13, on Software Engineering (ICSE), 2016, pp. 547–558. doi:10.1145/ IEEE Computer Society, Washington, DC, USA, 2013, pp. 290–299. 2884781.2884847. doi:10.1109/ICSM.2013.40. [19] S. Panichella, J. Aponte, M. D. Penta, A. Marcus, G. Canfora, Mining URL http://dx.doi.org/10.1109/ICSM.2013.40 source code descriptions from developer communications, in: D. Beyer, [8] A. Curiel, C. Collet, Sign language lexical recognition with propositional A. van Deursen, M. W. Godfrey (Eds.), IEEE 20th International Confer- dynamic logic, in: Proceedings of the 51st Annual Meeting of the Associ- ence on Program Comprehension, ICPC 2012, Passau, Germany, June 11- ation for Computational Linguistics, ACL 2013, 4-9 August 2013, Sofia, 13, 2012, IEEE Computer Society, 2012, pp. 63–72. doi:10.1109/ Bulgaria, Volume 2: Short Papers, The Association for Computer Lin- ICPC.2012.6240510. guistics, 2013, pp. 328–333. URL https://doi.org/10.1109/ICPC.2012.6240510 URL https://www.aclweb.org/anthology/P13-2059/ 23 [20] Python documentation guidelines, https://www.python.org/doc/ verified [33] A. Di Sorbo, S. Panichella, C. A. Visaggio, M. Di Penta, G. Canfora, on 10 April 2020 (2020). H. C. Gall, Exploiting natural language structures in software informal URL https://www.python.org/doc/ documentation, IEEE Transactions on Software Engineering. [21] A. Guzzi, A. Bacchelli, M. Lanza, M. Pinzger, A. van Deursen, Commu- [34] A. Di Sorbo, S. Panichella, C. A. Visaggio, M. Di Penta, G. Canfora, nication in open source mailing lists, in: Proceed- H. Gall, Deca: development emails content analyzer, in: 2016 IEEE/ACM ings of the 10th Working Conference on Mining Software Repositories, 38th International Conference on Software Engineering Companion IEEE Press, 2013, pp. 277–286. (ICSE-C), IEEE, 2016, pp. 641–644. [22] L. Pascarella, A. Bacchelli, Classifying code comments in Java open- [35] S. Panichella, A. Di Sorbo, E. Guzman, C. A. Visaggio, G. Canfora, H. C. source software systems, in: Proceedings of the 14th International Con- Gall, How can I improve my app? Classifying user reviews for software ference on Mining Software Repositories, MSR ’17, IEEE Press, 2017, maintenance and evolution, in: 2015 IEEE International Conference on pp. 227–237. doi:10.1109/MSR.2017.63. Software Maintenance and Evolution (ICSME), IEEE, 2015, pp. 281– URL https://doi.org/10.1109/MSR.2017.63 290. [23] J. Zhang, L. Xu, Y. Li, Classifying python code comments based on super- [36] R. Baeza-Yates, B. Ribeiro-Neto, Modern Information Retrieval, vised learning, in: X. Meng, R. Li, K. Wang, B. Niu, X. Wang, G. Zhao Addison-Wesley, 1999. (Eds.), Web Information Systems and Applications - 15th International URL http://sunsite.dcc.uchile.cl/irbook Conference, WISA 2018, Taiyuan, China, September 14-15, 2018, Pro- [37] A. Di Sorbo, S. Panichella, C. A. Visaggio, M. Di Penta, G. Canfora, H. C. ceedings, Vol. 11242 of Lecture Notes in Computer Science, Springer, Gall, Development emails content analyzer: Intention mining in devel- 2018, pp. 39–47. doi:10.1007/978-3-030-02934-0\_4. oper discussions (T), in: 2015 30th IEEE/ACM International Conference URL https://doi.org/10.1007/978-3-030-02934-0_4 on Automated Software Engineering (ASE), IEEE, 2015, pp. 12–23. [24] P. Rani, S. Panichella, M. Leuenberger, M. Ghafari, O. Nierstrasz, What [38] J. B. Lovins, Development of a stemming algorithm, Mech. Transl. Com- do class comments tell us? An investigation of comment evolution and put. Linguistics 11 (1-2) (1968) 22–31. practices in Pharo Smalltalk, arXiv preprint arXiv:2005.11583To be Pub- [39] V. Misra, J. S. K. Reddy, S. Chimalakonda, Is there a correlation between lished in Empirical Software Engineering. code comments and issues?: an exploratory study, in: SAC ’20: The 35th [25] Pharo consortium, verified on 10 Jan 2020 (2020). ACM/SIGAPP Symposium on Applied Computing, online event, [Brno, URL http://consortium.pharo.org Czech Republic], March 30 - April 3, 2020, 2020, pp. 110–117. doi: [26] Y. Padioleau, L. Tan, Y. Zhou, Listening to programmers — taxonomies 10.1145/3341105.3374009. and characteristics of comments in code, in: Proceed- URL https://doi.org/10.1145/3341105.3374009 ings of the 31st International Conference on Software Engineering, IEEE [40] Y. Shinyama, Y. Arahori, K. Gondow, Analyzing code comments to boost Computer Society, 2009, pp. 331–341. program comprehension, in: 2018 25th Asia-Pacific Software Engineer- [27] A. Goldberg, D. Robson, Smalltalk 80: the Language and its Implemen- ing Conference (APSEC), IEEE, 2018, pp. 325–334. tation, Addison Wesley, Reading, Mass., 1983. [41] J. Demšar, Statistical comparisons of classifiers over multiple data sets, URL http://stephane.ducasse.free.fr/FreeBooks/ The Journal of Machine Learning Research 7 (2006) 1–30. BlueBook/Bluebook.pdf [42] N. Nazar, Y. Hu, H. Jiang, Summarizing software artifacts: A literature [28] I. P. Goldstein, D. G. Bobrow, Extending object-oriented programming in review, Journal of Computer Science and Technology 31 (5) (2016) 883– Smalltalk, in: Proceedings of the Lisp Conference, 1980, pp. 75–81. 909. [29] M. Triola, Elementary Statistics, Addison-Wesley, 2006. [43] D. Binkley, D. Lawrie, E. Hill, J. Burge, I. Harris, R. Hebig, O. Keszocze, [30] N. Choi, I.-Y. Song, H. Han, A survey on ontology mapping, ACM Sig- K. Reed, J. Slankas, Task-driven software summarization, in: 2013 IEEE mod Record 35 (3) (2006) 34–41. International Conference on Software Maintenance, IEEE, 2013, pp. 432– [31] J. Euzenat, Towards a principled approach to semantic interoperability, in: 435. Proc. IJCAI 2001 workshop on ontology and information sharing, Proc. [44] N. Dragan, M. L. Collard, J. I. Maletic, Automatic identification of class IJCAI 2001 workshop on ontology and information sharing, No commer- stereotypes, in: Proceedings of the 2010 IEEE International Conference cial editor., Seattle, United States, 2001, pp. 19–25, euzenat2001b. on Software Maintenance, ICSM ’10, IEEE Computer Society, USA, URL https://hal.inria.fr/hal-00822909 2010, p. 1–10. doi:10.1109/ICSM.2010.5609703. [32] V. Geist, M. Moser, J. Pichler, S. Beyer, M. Pinzger, Leveraging ma- URL https://doi.org/10.1109/ICSM.2010.5609703 chine learning for software redocumentation, in: 2020 IEEE 27th Inter- national Conference on Software Analysis, Evolution and Reengineering (SANER), IEEE, 2020, pp. 622–626.

24 [45] F. A. Cioch, M. Palazzolo, S. Lohrer, A documentation suite for [57] Z. Liu, H. Chen, X. Chen, X. Luo, F. Zhou, Automatic detection of out- maintenance programmers, in: Proceedings of the 1996 International dated comments during code changes, in: S. Reisman, S. I. Ahamed, Conference on Software Maintenance, ICSM ’96, IEEE Computer C. Demartini, T. M. Conte, L. Liu, W. R. Claycomb, M. Nakamura, Society, Washington, DC, USA, 1996, pp. 286–295. E. Tovar, S. Cimato, C. Lung, H. Takakura, J. Yang, T. Akiyama, URL http://dl.acm.org/citation.cfm?id=645544. Z. Zhang, K. Hasan (Eds.), 2018 IEEE 42nd Annual Computer Soft- 655870 ware and Applications Conference, COMPSAC 2018, Tokyo, Japan, 23- [46] V. Rajlich, Incremental redocumentation using the web, IEEE Software 27 July 2018, Volume 1, IEEE Computer Society, 2018, pp. 154–163. 17 (5) (2000) 102–106. doi:10.1109/COMPSAC.2018.00028. [47] L. A. Alexandre, A. C. Campilho, M. Kamel, On combining classifiers URL https://doi.org/10.1109/COMPSAC.2018.00028 using sum and product rules, Pattern Recognition Letters 22 (12) (2001) [58] Z. M. Jiang, A. E. Hassan, Examining the evolution of code comments 1283–1289. in PostgreSQL, in: Proceedings of the 2006 international workshop on [48] J. R. Quinlan, Induction of decision trees, Machine learning 1 (1) (1986) Mining software repositories, ACM, 2006, pp. 179–180. 81–106. [59] B. Fluri, M. Wursch, H. C. Gall, Do code and comments co-evolve? On [49] T. Dogan, A. K. Uysal, A novel term weighting scheme for text clas- the Relation between Source Code and Comment Changes, in: Reverse sification: Tf-mono, Journal of Informetrics 14 (4) (2020) 101076. Engineering, 2007. WCRE 2007. 14th Working Conference on, IEEE, doi:https://doi.org/10.1016/j.joi.2020.101076. 2007, pp. 70–79. URL https://www.sciencedirect.com/science/ [60] B. Fluri, M. Würsch, E. Giger, H. C. Gall, Analyzing the co-evolution article/pii/S1751157720300705 of comments and source code, Software Quality Journal 17 (4) (2009) [50] H. Hata, C. Treude, R. G. Kula, T. Ishio, 9.6 million links in source code 367–394. comments: Purpose, evolution, and decay, in: Proceedings of the 41st [61] W. M. Ibrahim, N. Bettenburg, B. Adams, A. E. Hassan, On the rela- International Conference on Software Engineering, IEEE Press, 2019, pp. tionship between comment update practices and software bugs, Journal 1211–1221. of Systems and Software 85 (10) (2012) 2293–2304. [51] A. T. Ying, M. P. Robillard, Selection and presentation practices for code [62] L. Tan, D. Yuan, Y. Zhou, Hotcomments: How to make program com- example summarization, in: Proceedings of the 22nd ACM SIGSOFT ments more useful?, in: HotOS, 2007. International Symposium on Foundations of Software Engineering, 2014, [63] A. Marcus, J. I. Maletic, Recovering documentation-to-source-code trace- pp. 460–471. ability links using latent semantic indexing, in: ICSE ’03: Proceedings of [52] S. Nielebock, D. Krolikowski, J. Krüger, T. Leich, F. Ortmeier, Comment- the 25th International Conference on Software Engineering, IEEE Com- ing source code: Is it worth it for small programming tasks?, Empirical puter Society, Washington, DC, USA, 2003, pp. 125–135. Software Engineering 24 (3) (2019) 1418–1457. [64] G. Antoniol, G. Canfora, G. Casazza, A. De Lucia, Information retrieval [53] H. Yu, B. Li, P. Wang, D. Jia, Y. Wang, Source code comments quality models for recovering traceability links between code and documenta- assessment method based on aggregation of classification algorithms, J. tion, in: Proceedings of the International Conference on Software Main- Comput. Appl. 36 (12) (2016) 3448–3453. tenance (ICSM 2000), 2000, pp. 40–49. doi:10.1109/ICSM.2000. [54] I. K. Ratol, M. P. Robillard, Detecting fragile comments, in: Proceedings 883003. of the 32Nd IEEE/ACM International Conference on Automated Soft- [65] A. De Lucia, M. Di Penta, R. Oliveto, A. Panichella, S. Panichella, Using ware Engineering, IEEE Press, 2017, pp. 112–122. IR methods for labeling source code artifacts: Is it worthwhile?, in: 2012 [55] F. Wen, C. Nagy, G. Bavota, M. Lanza, A large-scale empirical study on 20th IEEE International Conference on Program Comprehension (ICPC), code-comment inconsistencies, in: Proceedings of the 27th International IEEE, 2012, pp. 193–202. Conference on Program Comprehension, IEEE Press, 2019, pp. 53–64. [66] D. Fucci, A. Mollaalizadehbahnemiri, W. Maalej, On using machine [56] Y. Zhou, R. Gu, T. Chen, Z. Huang, S. Panichella, H. Gall, Analyzing learning to identify knowledge in API reference documentation, in: Pro- APIs documentation and code to detect directive defects, in: Proceed- ceedings of the 2019 27th ACM Joint Meeting on European Software ings of the 39th International Conference on Software Engineering, IEEE Engineering Conference and Symposium on the Foundations of Software Press, 2017, pp. 27–37. Engineering, 2019, pp. 109–119. [67] L. Moreno, A. Marcus, Automatic software summarization: the state of the art, in: Proceedings of the 39th International Conference on Software Engineering, ICSE 2017, Buenos Aires, Argentina, May 20-28, 2017 - Companion Volume, 2017, pp. 511–512.

25