What Does It Take to Develop a Million Lines of Open Source Code?

What Does It Take to Develop a Million Lines of Open Source Code? Juan Fernandez-Ramil1;3, Daniel Izquierdo-Cortazar2, and Tom Mens3 1 Université de Mons Mons, Belgium {j.f.ramil, tom.mens}@umons.ac.be 2 Universidad Rey Juan Carlos Madrid, Spain [email protected] 3 The Open University Milton Keynes, U.K. [email protected] Abstract. This article presents a preliminary and exploratory study of the relationship between size, on the one hand, and effort, duration and team size, on the other, for 11 Free/Libre/Open Source Software (FLOSS) projects with cur- rent size ranging between between 0.6 and 5.3 million lines of code (MLOC). Effort was operationalised based on the number of active committers per month. The extracted data did not fit well an early version of the closed-source cost estimation model COCOMO for proprietary software, overall suggesting that, at least to some extent, FLOSS communities are more productive than closed- source teams. This also motivated the need for FLOSS-specific effort models. As a first approximation, we evaluated 16 linear regression models involving different pairs of attributes. One of our experiments was to calculate the net size, that is, to remove any suspiciously large outliers or jumps in the growth trends. The best model we found involved effort against net size, accounting for 79 percent of the variance. This model was based on data excluding a possible outlier (Eclipse), the largest project in our sample. This suggests that different effort models may be needed for certain categories of FLOSS projects. Inciden- tally, for each of the 11 individual FLOSS projects we were able to model the net size trends with very high accuracy (R2 ≥ 0:98). Of the 11 projects, 3 have grown superlinearly, 5 linearly and 3 sublinearly, suggesting that in the majority of the cases accumulated complexity is either well controlled or don’t constitute a growth constraining factor. Keywords: baselines, complexity, COCOMO, cost, economics, effort estimation, effort operationalisation, empirical studies, large software, free software, metrics, open source, productivity, software evolution. 1 Introduction Software development productivity and cost estimation have been research topics for more than three decades (e.g., [21] [3] [18]). The vast majority of these studies in- 190 Juan Fernandez-Ramil, Daniel Izquierdo-Cortazar, and Tom Mens volved data from closed-source projects. Little seems to be known about effort models for large, long-lived FLOSS projects. The present paper1 presents our initial exploratory results in the study of the relationship between size, effort, duration and number of contributors in eleven large, long-lived FLOSS projects. FLOSS communities do not use, in general, effort estimation models or other measurement-based models [1]. As a research topic, the exploration and quantifica- tion of effort and productivity may help in comparing FLOSS projects and proprietary systems. This research may provide a basis for baselines, in order to evaluate the possible impact of changes of processes, methods and tools used by FLOSS communities. It may also help, in some way, in understanding and better planning the future evolution of FLOSS communities and their projects2. Before presenting our approach and results, an important limitation of this type of research must be mentioned. It has been pointed out (e.g., [11] [13]), that it is difficult to measure effort accurately in FLOSS. This is so for several reasons: for example, we do not know the degree of involvement of each contributor (Was it full-time? Was it some level of part-time?) and contributors may have widely different roles. Despite our best efforts to measure effort in a meaningful way, we are aware of the limitations of our approach and this is why the results reported here must be seen as exploratory and preliminary. Additional research is needed to achieve more reliable measures of effort involved in FLOSS projects. We hope, however, that our contribution can help establish FLOSS effort modelling as a line of research and, eventually, lead to scientific insights and practical tools for the FLOSS communities and other interested stakeholders. 2 Selected projects and measurements Our research question was how much effort would be required to develop (via evolution) a FLOSS project of one MLOC (i.e., 1000 KLOC). We chose eleven projects, all of which exceed 1 MLOC (one 0.6 MLOC project in our sample has achieved this size at some point before decreasing) and have code repositories compatible with our data extraction tools. Table 1 shows the names of the eleven FLOSS projects we considered, the programming language(s) primarily used for each system, a short description and the life span, that is, the time in years between the first publicly recorded commit to the code repository (CVS or Subversion) and the data extraction date. This was October 2008, with exception of one system (Eclipse) for which it was April 2008. The measurements extracted for this study are listed in Table 2. We used the SLOC- Count tool3 to measure lines of source code. It automatically identifies code developed in a variety of programming languages and, for each of them, counts the lines of code present in a directory. We used CVSAnalY4, which stores information extracted from the version control log (CVS or Subversion) in a MySQL database. Specifically, we 1 This is a revised and extended version of [7]. 2 See [12] for a fuller justification for research into FLOSS effort models 3 www.dwheeler.com/sloccount/ 4 svn.forge.morfeo-project.org/svn/libresoft-tools/cvsanaly What Does It Take to Develop a Million Lines of Open Source Code? 191 Table 1. FLOSS systems studied and some of their characteristics name primary description life span language in years Blender C/C++ cross-platform tool suite for 3D animation 6 Eclipse Java IDE and application framework 6.9 FPC Pascal Pascal compiler 3.4 GCC C/Java/Ada GNU Compiler Collection 19.9 GCL C/Lisp/ASM GNU Common Lisp 8.8 GDB C/C++ GNU Debugger 9.5 GIMP C GNU Image Manipulation Program 10.9 GNUBinUtils C/C++ collection of binary tools 9.4 NCBITools C/C++ libraries for biology applications 15.4 WireShark C network traffic analyser 10 XEmacs Lisp/C text editor and application development system 12.2 Table 2. Measured attributes for each system abbreviation description extracted using KLOC physical lines of source code (in thousands) SLOCCount FILES total number of files in code repository CVSAnalY EFFORT ‘contributor-years’ (based on active committers-month) CVSAnalY DUR time length of ‘active’ evolution in years CVSAnalY DEV number of distinct contributors CVSAnalY indirectly used the data from CVSAnalY by downloading databases from the FLOSS- Metrics project5. In order to operationalise the effort variable, we proceeded as follows: first, we ran a query to the CVSAnalY databases for each system to determine the number of different person ids contributing to its repository over a given month. This value was assumed to be the number of person-months in a given month. The total number of person-months, added over the lifetime of the code repository, was divided by 12 in order to obtain the equivalent to person-years, which is the value assumed to be the total effort spent into a system since code commits recording started (EFFORT). As said in section 1, measuring effort in FLOSS projects is notably difficult. Our effort measurement approach is likely to be, in some cases, an overestimation, since not all the contributors are likely to work full-time on a project (with exception of some company-supported projects). Interestingly, in some projects one or several gate keep- ers submit code developed by others. If present, this will turn our EFFORT values into an underestimation of the real effort. This is clearly a topic for further research. We present our operationalisation of effort as an starting point which can be improved in the future by using additional information about, for example, the social networks and work patterns of contributors in a given project, and eventually by advanced tools which eventually will register automatically the effort. 5 data.flossmetrics.org 192 Juan Fernandez-Ramil, Daniel Izquierdo-Cortazar, and Tom Mens /01' GCL '%!!!" 2500 '$!!!" '#!!!" 2000 '!!!!" 1500 &!!!" %!!!" 1000 $!!!" !"#$%&'()'*+%,' number of files 500 #!!!" !" 0 '" '(" #)" (*" $+" %'" *(" &)" +*" 1 13 25 37 49 61 73 85 97 #(!-.,' months Fig. 1. Trend in FILES (left) and netFILES (right) per month for GCL. The right-hand-side plot corresponds to the central part of the left-hand-side plot, after major ‘jumps’ have been removed. In 9 out of 11 projects studied – that is, excluding Eclipse and Wireshark – when looking at the plot of the total number of files (FILES) in the repository per month, we observed a few, suspiciously large outliers or jumps. This is illustrated for one of the systems (GCL) on the left part of Figure 1. It is unlikely that the team size or the productivity of the team of contributors has increased so suddenly on a particular month. It could be that code added in a jump was created and evolved in a private space (or as a branch separated from the main version or trunk) for which we have no active committers information. It could also be that each jump corresponds to external events such as when the repository receives chunks of externally generated code. An additional hypothesis would be that many non-code files were added to the repository, since FILES includes all files, code and non-code.6 Whatever the actual explanation, the implication for our research is that measuring the total size, including jumps, may lead to too optimistic estimates (the productivity may appear to be higher than it really is).

What Does It Take to Develop a Million Lines of Open Source Code?

GNU/Linux AI & Alife HOWTO

Bringing GNU Emacs to Native Code

GNU MP the GNU Multiple Precision Arithmetic Library Edition 6.2.1 14 November 2020

An Industrial Strength Theorem Prover for a Logic Based on Common Lisp

Unix Quickref.Dvi

The Embeddable Common Lisp

GNU/Linux AI & Alife HOWTO

Lisping Copyleft: a Close Reading of the Lisp LGPL 15

Platform Freedom for a Graphical Lisp Application Through Armed Bear Common Lisp

01}March 02} Issue Volume

Summary of UNIX Commands Furnishing, Performance, Or the Use of These Previewers Commands Or the Associated Descriptions Available on Most UNIX Systems

Using CAS Sessions in TEXMACS