The Commenting Practice of Open Source

The Commenting Practice of Open Source Oliver Arafat Dirk Riehle Siemens AG, Corporate Technology SAP Research, SAP Labs LLC Otto-Hahn-Ring 6, 81739 München 3410 Hillview Ave, Palo Alto, CA 94304, USA [email protected] [email protected] Abstract lem domain is well understood [15]. Agile software development methods can cope with changing requirements and The development processes of open source software are poorly understood problem domains, but typically require different from traditional closed source development proc- co-location of developers and fail to scale to large project esses. Still, open source software is frequently of high sizes [16]. quality. This raises the question of how and why open A host of successful open source projects in both well source software creates high quality and whether it can and poorly understood problem domains and of small to maintain this quality for ever larger project sizes. In this large sizes suggests that open source can cope both with paper, we look at one particular quality indicator, the den- changing requirements and large project sizes. In this pa- sity of comments in open source software code. We find per we focus on one particular code metric, the comment that successful open source projects follow a consistent density, and assess it across 5,229 active open source pro- practice of documenting their source code, and we find jects, representing about 30% of all active open source that the comment density is independent of team and pro- projects. We show that commenting source code is an on- ject size. going and integrated practice of open source software development that is consistently found across all these pro- Categories and Subject Descriptors D.2.8 [Metrics]: jects. This practice is independent of the chosen program- Software Science ming language, the age of project, the size of the project in lines of code, and their team sizes. General Terms: Measurement, Documentation, Languages The contributions of this paper are the following: • It assesses the metric of comment density for the first 1. Introduction time for open source projects on a broad scale; Open source software has become an important part of • It shows that commenting source code is a consis- commercial software development and use. Its continued tently exercised practice of open source software de- growth emphasizes this importance [1]. Projects like the velopment; Linux kernel and the Apache web server demonstrate that • It reviews a variety of dependencies between proper- open source software can be of high quality. Most interest- ties of open source projects and their comment den- ingly, open source projects have reached a size and com- sity. plexity that rivals the size of some of the largest commercial projects [2], yet they are being developed in a manner The paper is organized as follows. Section 2 reviews our quite different from traditional software engineering proc- data source and the taken approach. Section 3 gives an esses. aggregate overview of comment density in open source Our research goal is to understand the processes and projects, discusses how it varies by programming lan- practices of open source software development and to as- guage, and shows how commenting source code is a con- sess whether they can be applied in a corporate environ- sistently followed practice in open source. Section 4 re- ment. This has become particularly important because views the dependencies of comment density on multiple most well-known processes find it hard to scale up to lar- variables relevant to scaling up projects. Section 5 summa- ger project sizes. Traditional life-cycle processes like the rizes our conclusions and discusses threats to their validity. waterfall model are best used in contexts where the prob- Section 6 reviews related work and Section 7 ends the paper with some final conclusions and an outlook on future Permission to make digital or hard copies of all or part of this work for not work. made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee 2. Data source, filters, and definitions OOPSLA 2009 October 25-29, 2009, Orlando, Florida, USA Copyright © 2009 ACM 978-1-60558-768-4/09/10…$10.00. Our analyses use the database of the open source analytics firm Ohloh, Inc. [9]. The data is accessible through an API [10]. We work with a database snapshot of March 2008, • A comment line, or CL, is a physical line in a source but have cut off all analysis data after December 31st, file that represents a comment. 2007. The database contains data from about 10,000 open • A line of code, or LoC, is either a source line of code source projects, including project name, description, com- or a comment line. mitter information, and the code contribution history of a • An empty line is just that. project. The Ohloh diff tool recognizes every comment character In this paper we are interested in active well-working and characters respectively that are defined and valid open source projects, not dead projects. We define and within one particular programming language such as the apply an active project filter to let a project pass only if it triple quotes in Ruby.Furthermore, it also accounts for is at least two years old and if the code activity of the last external mark up languages such as Plain Old Documenta- year has been at least 60% of the activity of the previous tion (POD) which is widely used in Perl. Additionally, the year. This active project filter reduces the original 10,000 Ohloh diff tool recognizes comments that span multiple projects to 5,229 projects. Using a comparable approach, lines [11]. It does not, however, recognize whether a code Daffara estimates that there were about 18,000 active open line was changed; rather, it counts a changed line as an old source projects in the world by August 2007 [7], so our line removed and a new line added. While it is not possible sample size represents about 30% of the total population. to determine a posteriori whether a line was changed or The code contribution history is a time series of com- removed and then added, heuristics exist to predict which mits (code contributions) to the source code repository. A variant was the case. Most variants of the Unix tool diff, commit represents a set of changes to the source code per- for example, implement such a heuristic by solving the formed as one chunk of work. When analyzing commits Longest Common Subsequence problem. We have devel- we apply filters to improve data quality. For example, we oped a statistic that determines the probabilities of whether filter out file rename and move operations where no real a line was changed or removed and added [12]. Our algo- work has been done. rithms use this statistic to determine aggregate values like A commit consists of multiple diffs. A diff describes commit sizes and comment densities. the differences between two consecutive versions of the The commit size of a commit is the number of lines of same file as changed in the commit. It is split into three code affected in a commit, whether added, removed, or parts: The number of lines of source code that have been changed. When calculating commit sizes we apply the sta- added to the file or removed from it, the number of com- tistic explained above. ment lines that have been added or removed, and the num- The comment density of a file or a group of files or ber of empty lines that have been added or removed. the whole source code base of a project is defined as the number of comment lines divided by the number of lines • A source line of code, or SLoC, is a physical line in a of code of the same code body [4]. source file that contains source code. 100% mean = 0.1867 90% median = 0.1674 stdev = 0.1088 80% correl = -0.00787 70% 60% 50% 40% Comment Density Comment 30% 20% 10% 0% 1.E+00 1.E+01 1.E+02 1.E+03 1.E+04 1.E+05 1.E+06 1.E+07 1.E+08 1.E+09 Project Size in Lines of Code (LoC = CL + SLoC) Figure 1: Comment density as a function of lines of code for a given project. Comment lines Comment density Binned distribution of projects as a function of project size as a function of project size as a function of comment density 1.E+07 100% 70% y = 0.1516x 90% mean = 0.1760 1.E+06 60% R2 = 0.8863 median = 0.1652 0.5438 80% stdev = 0.0832 1.E+05 70% 50% 60% 1.E+04 40% 50% 1.E+03 30% 40% 0.2324 C and Lines Comment 1.E+02 Density Comment 30% 20% 0.1478 20% Percentage of Occurrences of Percentage 1.E+01 10% 0.0540 10% C++ 0.0166 0.0043 0.0012 0.0000 0.0000 1.E+00 0% 0% 1.E+00 1.E+01 1.E+02 1.E+03 1.E+04 1.E+05 1.E+06 1.E+07 1.E+0 1.E+00 1.E+01 1.E+02 1.E+03 1.E+04 1.E+05 1.E+06 1.E+07 1.E+08 [0.0, 0.1[ [0.1, 0.2[ [0.2, 0.3[ [0.3, 0.4[ [0.4, 0.5[ [0.5, 0.6[ [0.6, 0.7[ [0.7, 0.8[ [0.8, 0.9[ Project Size in Lines of Code (CL + SLoC) Project Size in Lines of Code (CL + SLoC) Comment Density y = 0.1516x; R2 = 0.8863; mean = 0.1760; median = 0.1652; stdev = 0.

The Commenting Practice of Open Source

Understanding the Syntactic Rule Usage in Java

Development of an Enhanced Automated Software Complexity Measurement System

Differences in the Definition and Calculation of the LOC Metric In

Projectcodemeter Users Manual

Counting Software Size: Is It As Easy As Buying a Gallon of Gas?

Empirical Software Engineering at Microsoft Research

Investigating the Applicability of Software Metrics and Technical Debt on X++ Abstract Syntax Tree in XML Format – Calculations Using Xquery Expressions

SOFTWARE METRICS TOOL a Paper

Measuring Woody: the Size of Debian 3.0∗

Mobile Open Source Economic Analysis

Evaluating Complexity, Code Churn, and Developer Activity Metrics As Indicators of Software Vulnerabilities

NASA Study on Flight Software Complexity