Seeking Empirical Evidence for Self-Organized Criticality in Open Source Software Evolution
Total Page:16
File Type:pdf, Size:1020Kb
David R. Cheriton School of Computer Science, University of Waterloo, Ontario, Canada Seeking Empirical Evidence for Self-Organized Criticality in Open Source Software Evolution Jingwei Wu 1 and Richard Holt 2 Software Architecture Group (SWAG) David R. Cheriton School of Computer Science University of Waterloo Waterloo, ON, Canada Abstract We examine eleven open source software systems and present empirical evidence for the existence of fractal structures in software evolution. In our study, fractal structures are measured as power laws through the lifetime of a software system. We describe two specific power law related phenomena: the probability distribution of software changes decreases as a power function of change sizes; and the time series of software change exhibits long range correlations with power law behavior. The existence of such spatial (across the system) and temporal (over the system lifetime) power laws suggests that Self-Organized Criticality (SOC) occurs in the evolution of open source software systems. As a consequence, SOC may be useful as a conceptual framework for understanding software evolution dynamics (the cause and mechanism of change or growth). We give a qualitative explanation of software evolution based on SOC. We also discuss some potential implications of SOC to current software practices. 1 Introduction Laws of software evolution formulated by Lehman [Leh97] represent a major intellectual contribution to understanding software evolution dynamics (the underlying cause and mechanism of software change or growth). Lehman’s eight laws are empirically grounded on observing how closed source industrial software systems such as IBM OS/360 were developed and maintained within a single company using conventional management techniques [LRW+97]. The laws suggest that software systems must be continuously adapted to respond to external forces such as new functional requirements and hardware upgrade and to maintain stakeholder satisfaction. 1 Email: [email protected] 2 Email: [email protected] Viewing software evolution as a phenomenon driven by various external forces can make it difficult to accommodate and explain conflicting find- ings across different software systems or domains. For example, open source projects in different domains have been found to grow at different rates in- cluding super-linear, sub-linear and linear rates [GT00][RAGBH05]. A simple, unified way for explaining evolution dynamics is needed. This paper looks into Complex Systems theories to seek new ways for describing and explaining soft- ware evolution. A wealth of knowledge has been gained in understanding the change be- havior of complex systems as diverse as sandpiles [BTW87], power black- outs [CNDP04], earthquakes [Sch90] and even biological evolution [SMBB97]. These systems are complex in the sense that no single characteristic event size can control their changes or responses over time. That is, changes in a complex system can be of any size and occur any time. For example, a power blackout can strike a street, a town, a state and all the way up to a country. An intriguing aspect of complex systems is that their statistical properties can be measured as power laws in space and in time. In 1987, Bak, Tang and Wiesenfeld proposed Self-Organized Criticality (SOC) to explain typical power law behavior [BTW87]. SOC was set out as an ambitious effort for explaining the existence of ubiquitous fractal structures in nature. SOC has two important signatures: (1) the power law distribution of dynamical responses, and (2) long range correlations with power law behavior in time series of response [CNDP04][SMBB97]. Section 2 will provide a brief introduction to fractals, power laws and long range correlations and explain how these concepts are relevant to SOC. Does a software system follow the SOC dynamics during its evolution? This question arises from several previous studies which in particular include: ² Software evolution as punctuated equilibrium Wu et. al examined the structural evolution of software systems at the level of source files. They observed that three open source systems (OpenSSH, PostgreSQL and Linux Kernel) evolved through an alternation between long periods of small changes and short periods of large avalanche changes [WSHH04]. ² Biological evolution as self-organized criticality Researchers have found that fluctuations in fossil record exhibit long range correlations with power law behavior [SMBB97]. The existence of such fractal structures means that, when examining a given time frame, some basic properties such as mean and standard variance remain the same as those obtained from the whole time series if a change of scale is performed. SOC is suggested as a useful way of understanding how long periods of small extinctions are interrupted by mass extinctions [BS93][SMBB97]. The structural evolution of a software system exhibits similar characteristics of punctuation as fossil record, suggesting that SOC may also be useful for 2 explaining software evolution. ² Overall open source movement as a self-organizing collaborative social net- work In comparison to traditional industrial systems, open source software sys- tems are largely developed based on a less strict control and manage- ment model [Ope05][Ray99]. Spontaneous collaboration is promoted and backed by a decentralized developer community across the Internet [Ray99]. Researchers including Madey and Koch have suggested that open source projects can be seen as a self-organizing phenomenon featuring the self- selection of tasks, spontaneous collaboration and leadership [Koc04][MFT02]. The main empirical evidence they present is power law distribution of open source project sizes (the numbers of developers) and the power law distri- bution of developer contributions (the number of commits to the source control repository). Motivated by these promising results, we feel that SOC may be established as a useful conceptual framework for describing and explaining software evo- lution. In this paper, we describe our effort in seeking evidence for SOC in open source software systems and also discuss some implications of SOC to software practices. The rest of this paper is organized as follows. Section 2 introduces useful terms and concepts in brief. Section 3 explains empirical data collection in regard to software changes and time series of change. Section 4 investigates the existence of power laws in the evolution of open source software systems. Section 5 provides a qualitative explanation of software evolution based on SOC. Section 6 discusses some threats to the validity of our work. Section 7 considers related work and Section 8 concludes this paper. 2 Background This section provides a brief introduction to fractals, power laws, R/S time series analysis and SOC. The reader familiar with these concepts can skip to the next section. 2.1 Fractals Fractals are mathematical or natural objects that are made of parts similar to the whole in “some” way. A fractal has a self-similar structure that occurs at different scales. For example, a small branch of a tree looks like the whole tree due to the existence of branching structures. When the length of a shoreline is measured using the box counting method, the length of any segment can cover the same number of mesh boxes as the whole shoreline if a change of scale is performed. For a detailed explanation of fractals, the reader can refer to Mandelbrot’s book Fractal Geometry of Nature [Man82]. 3 2.2 Power Law A power law is a relationship between two scalar variables x and y, which can be written as follows: y = C¢xk where C is the constant of proportionality and k is the exponent of the power law. Such a power law relationship shows as a straight line on a log-log plot since, taking logs of both sides, the above equation is equivalent to log(y) = k¢log(x) + log(C) which has the same form as the equation for a straight line Y = k¢X + c The equation f(x) = C¢xk has a property that relative scale change f(sx)=f(x) = sk is independent of x. In this sense, f(x) lacks a characteristic scale or is scale invariant. Consequently f(x) can be related to fractals because of its scale invariance. Power Law Distribution This paper is concerned with a special kind of distribution called power law distribution, in which the Probability Density Function (PDF) of size s is specified as P (s) » s¡® and the tail Cumulative Distribution Function (CDF) of size s is specified as D(s) = P (x¸s) » s¡¯. The relationship between ® and ¯ is ¯ = ® ¡ 1 [New05]. Because ¯ can be conveniently estimated using linear regression on a log-log plot without binning data points, we choose to estimate ¯ rather than ® in our study. Power laws have been observed in many fields such as physics, economics, geography and sociology [New05]. For example, the distribution of earth- quakes is found to follow P (E)»E¡® where E is the amount of energy. The exponent ® exhibits some geographical dependence and is found to be in the interval from approximately 1:8 to 2:2 [Jen98]. The distributions of firm sizes (measured as the number of employees) [Axt01][GGP03] and of open source project sizes (measured as the number of developers or lines of code) [HJ02][Koc04] also follow power laws. 2.3 Time Series Analysis Time series are commonly used to characterize the evolution of software sys- tems. For example, Lehman et al. studied the evolution of IBM’s operating system OS/360 by means of observing system growth measured in terms of the number of source modules and number of modules changed for each release [BL76][LB85]. Turski performed a regression analysis of results from these case studies and proposed the inverse-square model, which suggests that sys- tem growth is inversely proportional to system complexity and that system 4 complexity is proportional to the square of system size [Tur02]. Such regres- sion analysis of time series is useful for understanding the nature of software evolution.