arXiv:0807.0014v1 [physics.data-an] 30 Jun 2008 ieo rqec,ehbtn h oe a dependence law power the exhibiting frequency, or size ilatcmn”t xli h cl-rentok found networks scale-free the explain “preferen- to attachment” name the model tial new under This rediscovered of grows. been industry recently entry overall has the the as for time accounting mod- over [10] firms by the Simon model In sizes, Gibrat’s firm on). theified of distribution so of the and size of popularity the context website of of firm, independent (city, are ran- realizations entity that a stochastic rates is growth successive growth the mod- with that A is process, law entrants. dom Gibrat’s new of with formulation in model ern implemented growth on [11] stochastic based effect law a proportionate Zipf’s of [10] law for Simon Gibrat’s mechanism basis, simple this a such On articulated between growth. links and important distributions size are there that recognized references and [7] therein). (see informetrics, science been bibliometrics, library char- in also and as traffic scientometrics, well has Internet as law and 5] 13] [6, 4, statistics Zipf’s acteristics [3, access Recently, Web sizes firm in world. as found well the as over [2] all sizes the city for well of remarkably distribution accounts relationship natural law in Zipf’s words rank-frequency [1], of languages commonness a relative the as quantifying formulated Initially cecs ifslwuulyrfr opoaiiydensity probability to few refers the usually functions of law social one Zipf’s the as in sciences. role found particular regularities a reproducible plays quantitative self- law characteris- such Zipf’s their many tics, the in Among sta- self-similar distributions. either their tistical and/or properties, geometry fractal scale-free organizing exhibit ten trigwt ue[]adShmee 9,i snow is it [9], Schumpeter and [8] Yule with Starting ope dpiessesi aueadsceyof- society and nature in systems adaptive Complex miia et fZp’ a ehns nOe oreLinux Source Open In Mechanism law Zipf’s of Tests Empirical p ( x fagoigcmlxsl-raiigaatv ytm exhi system, adaptive self-organizing complex growing a of i)teaeaegot nrmn ftenme fin-direct of obey number packages the of of ∆ (i) links increment growth in-directed law: average of Zipf’s the of number (ii) origin the the of at releases of be ingredients to assumed conjectured usually the previously of tests three present We itiuin a altinrta ifslw iha exp an ∆ with time law, the appearing Zipf’s as packages than 1 new thinner value of tail a links has in-directed distributions of number the of ASnmes 97.k 97.a 02.50.Ey 89.75.Da; 89.75.Ak; numbers: PACS p fsm tcatcvariable stochastic some of ) t ( h vlto foe oresfwr rjcsi iu dist in projects software source open of evolution The x spootoa o∆ to proportional is ) ∼ 1 2 /x hi fSrtgcMngmn n noain eateto Department Innovation, and Management Strategic of Chair 1+ ehooyadEoois T uih H80 uih Switze Zurich, CH-8001 Zurich, ETH Economics, and Technology ehooyadEoois T uih H80 uih Switze Zurich, CH-8001 Zurich, ETH Economics, and Technology µ .Maillart, T. 1 hi fEternuilRss eateto Management, of Department Risks, Entrepreneurial of Chair with t ewe eessincreases. releases between t hl t tnaddvaini rprinlto proportional is deviation standard its while , µ 1 = 1 .Sornette, D. . x sal a usually , Dtd coe 6 2018) 26, October (Dated: (1) 1 .Spaeth, S. . spootoa to proportional is diin evrf httedsrbto ftenme of number the of In distribution the model. that growth verify we stochastic addition, standard a in implemented e fi-ietdlnso akgsoe ieinterval time a over packages of num- links ∆ the of in-directed increment of em- growth confirm ber average we criti- the process, additional that growth an stochastic pirically the As Gibrat’s of approximation. test obeys cal good packages a of the with of links law releases in-directed successive of between number that explicitly observed verify growth over then We law the magnitudes. Zipf’s of De- obeys orders successive four precisely in distributions packages Linux of bian links distri- the in-directed that of find We bution other as package. given that to a connections refer to we or have packages measure links pack- a in-directed other of routine, of number their number the in it the “cen- call is the that package of ages given measure com- a A a of trality” form inter-dependencies. of which the web applications, including plex packages, and connected system contain of operating typically thousands distributions of source Linux tens open Large on based an of Linux) softwares. growth ( the For of system analysis law. operating empirical Zipf’s an exhibiting provide system we this, a in models present growth indeed stochastic are present of the we ingredients that law, showing assumed study usually Zipf’s empirical consistent explain fully to first the relevance conjec- its have and on contexts, tured various in deviations) its (and 16]. law 15, Zipf’s 14, yields ingre- [2, that additional law different growth Gibrat’s many the complementing the cells in dients of entrants biological one new just in of is existence process other the But each 13]. networks with [12, or reacting web, proteins world-wide of the communities, social in t hl eea ok aetse irtslwdirectly law Gibrat’s tested have works several While spootoa o∆ to proportional is 2 tcatcgot oesta aebeen have that models growth stochastic h rwhosre ewe successive between observed growth the iigZp’ a vrfu uldecades. full four over law Zipf’s biting dlnso akgsoe ieinterval time a over packages of links ed nn hc ovre oteZp’ law Zipf’s the to converges which onent irtslwo rprinlgrowth; proportional of law Gibrat’s s n .VnKrogh Von G. and iuin ffr eakbeexample remarkable a offers ributions neovn esoso einLinux Debian of versions evolving in √ √ ∆ Management, f ∆ t t speitdfo irtslaw Gibrat’s from predicted as , ii h distribution the (iii) ; t hl t tnaddeviation standard its while , rland rland 2 Distribution P preprint APS 2

[20]. Its evolution is recorded by a chronological series of stable and unstable releases: new packages enter, some disappear, others gain or lose connectivity. Here, we study the following sequence of Debian releases: Woody: 19.07.2002; Sarge: 0.6.06.2005; Etch: 15.08.2007; Lenny (unstable version): 15.12.2007; several other Lenny ver- sions from 18.03.2008 to 05.05.2008 in intervals of 7 days. Figure 1 shows the number of packages in the first four successive versions of Debian Linux with more than C in-directed links, which is nothing but the un-normalized complementary cumulative (or survival) distribution of package numbers of in-directed links. Zipf’s law is con- firmed over four full decades, for each of the four releases. Notwithstanding the large modifications between releases and the multiplication of the number of packages by a fac- tor of three between Woody and Lenny, the distributions shown in Fig.1 are all consistent with Zipf’s law. It is re- markable that no noticeable cut-off or change of regimes occurs neither at the left nor at the right end-parts of the FIG. 1: (Color Online) Log-log plot of the number of distributions shown in Fig.1. Our results extend those packages in four Debian Linux Distributions with more conjectured in Ref. [21] for Red Hat Linux. By using than C in-directed links. The four Debian Linux Distri- Debian Linux, which is better suited for the sampling of butions are Woody (19.07.2002) (orange diamonds), Sarge (06.06.2005) (green crosses), Etch (15.08.2007) (blue circles), projects than the often used SourceForge collaboration Lenny (15.12.2007) (black +’s). The inset shows the Maxi- platform, we avoid biases and gather unique information mum Likelihood estimate (MLE) of the exponent µ together only available in an integrated environment [22]. with two boundaries defining its 95% confidence interval (ap- To understand the origin of this Zipf’s law, we use the proximately given by 1 2/√n, where n is the number of data ± general framework of stochastic growth models, and we points using in the MLE), as a function of the lower threshold. track the time evolution of a given package via its num- The MLE has been modified from the standard Hill estimator ber C of in-directed links connecting it to other packages to take into account the discreteness of C. within Debian Linux. The increment dC of the number of in-directed links to a given package over a small time in-directed links of new packages appearing in evolving interval dt is assumed to be the sum of two contributions, version of Debian Linux distributions has a tail thinner defining a generalized diffusion process: than Zipf’s law, confirming that Zipf’s law in this system dC = r(C) dt + σ(C) dW , (2) is controlled by the growth process. The Linux Kernel was created in 1991 by Linus Tor- with r(C) is the average deterministic growth of the in- valds as a clone of the proprietary Unix directed link number, σ(C) is the standard deviation of [17, 18], and was licensed under GNU General Public Li- the stochastic component of the growth process and dW cense. Its code and open source license had immediately is the increment of the Wiener process (with dW = 0 2 h i a strong appeal to other developpers who contributed to and dW = dt where the brackets denote perform- h i its further development. Quickly, the community of open ing the statistical average). Zipf’s law has been shown source developers started to run other open source pro- to arise under a variety of conditions associated with grams on this new operating system. In 1993, Debian Gibrat’s law. The simplest implementation of Gibrat’s Linux [19] became the first non-commercial successful law writes that both r(C) and σ(C) are proportional to general distribution of an open source operating system. C, While continuously evolving, it remains up to the present r(C)= r C , σ(C)= σ C, (3) the “mother” of a dominant Linux branch, competing × × with a growing number of derived distributions (, with proportionality coefficients r and σ obeying the fol- Dreamlinux, Damn Small Linux, , Kanotix, and lowing inequality r < σ. This later inequality expresses so on). that the proportional growth is dominated by its stochas- From a few tens to hundreds of packages (474 in 1996 tic component [23]. Accordingly, the heavy tail structure (v1.1)), Debian has expanded to include more than about of Zipf’s law can be thought of as the result of large 18’000 packages in 2007, with many intricate dependen- stochastic multiplicative excursions. The rest of the let- cies between them, that can be represented by complex ter is devoted to testing and validating this model. functional networks. Debian offers a remarkable exam- First, we measure the time evolution of the in-directed ple of a growing complex self-organizing adaptive system links of all packages in the successive Debian releases, 3

FIG. 2: Left panel: Plots of ∆C versus C from the Etch FIG. 3: Dependence of R(∆t) and Σ(∆t) defined respectively release (15.08.2007) to the latest Lenny version (05.05.2008) in by R(∆t) ∆C/C and (5) as a function of their time inter- ≡h i double logarithmic scale. Only positive values are displayed. val ∆t for the 66 time intervals that can be formed between all The linear regression ∆C = R C + C0 is significant at the the Debian releases in our database (which includes the four × 95% confidence level, with a small value C0 = 0.3 at the major Debian releases from 19.07.2002 to 15.12.2007 as well origin and R = 0.09. Right panel: same as left panel for the as the several Lenny releases from 18.03.2008 to 05.05.2008 in standard deviation of ∆C. intervals of 7 days). The error bars show the 95% confidence intervals, obtained by shuffling 1000 times the linear regres- sion residuals. The straight lines represent the best linear by retrieving the network of dependencies following the fits. The existence of a genuine linear dependence of R as a methodology explained in Ref. [22]. For packages which function of ∆t cannot be rejected (p < 0.05) and has a high signifiance level (square of correlation coefficient 2 = 0.93). are common to successive releases, we find that their con- R nectivity, measured for instance by their number C of The regression of Σ versus √∆t enjoys the same high statis- tical confidence (p< 0.05 and 2 = 0.97). in-directed links, increases on average albeit with con- R siderable fluctuations. Consider for instance the up- date from Etch (15.08.2007) to the latest Lenny version This last result derives from the properties of the Wiener (05.05.2008). For each package i which is common to process increments dW . We test these two predictions these two versions, we measure the increment ∆C of the i (4) and (6) as follows. Out of the four major Debian re- number C of in-directed links to that package from Etch i leases from 19.07.2002 to 15.12.2007 as well as the several to the latest Lenny version. The left panel of Fig.2 plots Lenny releases from 18.03.2008 to 05.05.2008 in intervals these increments ∆C as a function of C . This figure i i of 7 days, 66 different time intervals can be formed. For is typical of the results obtained on the increments ∆C i each time interval, we calculate the average growth rate between other pairs of Debian releases. The scatter plot defined by R(∆t) ∆C/C and its standard deviation confirms the existence of an approximate proportionality defined by (5). Technically,≡ h i we estimate R(∆t) (respec- between ∆C and C , especially for the largest C val- i i i tively Σ(∆t)) as the slope (respectively the standard de- ues, in agreement with the first equation of (3). The viation of the residuals) of the linear regression of ∆C right panel of Fig.2 shows the standard deviation of ∆C as a function of C. This method allows us to construct as a function of C, confirming the second equation of confidence bounds by bootstrapping (we reshuffle 1000 (3). These two panels are nothing but direct evidence times the linear regression residuals). The left (resp. of Gibrat’s law for package connectivities, which consti- right) panel of figure 3 shows the 66 values of R(∆t) tutes an essential ingredient of stochastic growth models (resp. Σ(∆t)) as a function of their corresponding time of Zipf’s law [2, 14, 15, 16]. Notice that the large scat- interval ∆t (resp. square-root of ∆t), providing a strong ter decorating the approximate proportionality between validation of the stochastic growth model (2) and (3). ∆Ci and Ci observed in Fig. 2 and quantified in the right panel of Fig.2 is an essential ingredient for Zipf’s law to We now address the question of how the increase of appear [23]. the number of packages interacts with the growth pro- We then combine (2) and (3) to predict that, over a cess of the number of links between packages. This is- not too large time interval ∆t, (i) the average growth sue has been considered in the context of firms (see Ref. rate R(∆t) ∆C/C should be given by [23] for a detailed presentation and summary of the lit- ≡ h i erature), and applies to the case of packages as follows. R(∆t)= r ∆t , (4) Most stochastic growth models of firms based on Gibrat’s × principle attempt to derive the distribution of the cross- and (ii) the standard deviation of the growth rate section of firm sizes directly from the distribution of the 2 1 asset value of a single firm as a function of time. In- Σ(∆t) [∆C/C] 2 (5) ≡ h i deed, many models start with the implicit or explicit should be equal to assumption that the set of firms was born at the same origin of time. This approach is mathematically equiv- Σ(∆t)= σ √∆t . (6) alent to considering that the universe is made only of × 4

of in-directed links of newly born packages has a tail thin- ner than Zipf’s law, and converges progressively to Zipf’s law as the time elapsed between two releases increases, reflecting the increasing impact of the stochastic multi- plicative growth process. This confirms that Zipf’s law results indeed from the stochastic multiplicative growth process at the level of individual packages in the presence of the birth-death of packages.

FIG. 4: The right panel shows that the exponent µ of the distribution of C’s of new packages appearing between suc- cessive unstable Lenny releases separated by one week is a [1] Zipf, G.K., Human behavior and the principle of least power law with exponent µ 1.5; the left panel show that ≃ effort, Addison-Wesley Press Cambridge, Mass. (1949). the same power law has a smaller exponent closer to 1 as one [2] Gabaix, X., The Quarterly Journal of Economics 114 (3), considers the new packages appearing between two more dis- 739-767 (1999). tant releases. We have verified that this effect is systematic [3] Simon, H. and C. Bonini, The American Economic Re- in our database. The exponents µ are obtained by maximum view 48(4), 607-617 (1958). likelihood, adapted to the discreteness of C values. The thin [4] Ijri, Y. and H. A. Simon, Skew Distributions and the Sizes lines defined the 95% confidence intervals. of Business Firms (North-Holland, New York, 1977). [5] Axtell, R.L., Science 293 (5536), 1818-1820 (2001). [6] Adamic, L.A. and B.A. Huberman, Quarterly Journal of one single entity. Therefore, the distribution of firm sizes Electronic Commerce 1, 5-12 (2000). can reach a steady-state if and only if the distribution [7] Adamic, L.A. and B.A. Huberman, Glottometrics 3, 143- of the asset value of a single firm reaches a steady state, 150 (2002). which is counterfactual. A more correct model is to take [8] Yule, G. U., Phil. Trans. R. Soc. London B 213, 21-83 into account the fact that firms do not appear all at the (1924). same time but are born according to a more or less reg- [9] Schumpeter, J. (1934) Theory of Economic Development, Harvard University Press, Cambridge. ular flow of newly created firms. Competing with the [10] Simon, H. A., Biometrika 52, 425-440 (1955). birth process, firms also disappear at a surprisingly high [11] Gibrat, R., Les In´egalit´es Economiques; Applications rate. Similarly, the evolution of successive Debian re- aux in´egaliti´es des richesses, `ala concentration des en- leases is punctuated by additions and deletions of many treprises, aux populations des villes, aux statistiques des packages. For instance, at the release of the latest stable familles, etc., d’une loi nouvelle, la loi de l’effet propor- release (Lenny, 15.12.2007), 885 packages disappeared, tionnel (Librarie du Recueil Sirey, Paris, 1931). partly merged, or were renamed while 2983 packages ap- [12] Barabasi, A.-L., and R. Albert, Science 286, 509-512 (1999). peared compared to the precedent release. Clearly, the [13] Barabasi, A.-L., and R. Albert, Reviews of Modern dynamics of the connectivity between packages depends Physics 74, 47-97 (2002). on the birth as well as demise of packages. Therefore, the [14] Champernowne, D., Economic Journal 63, 318-351 stochastic growth model (2) must be supplemented by a (1953). model of the birth and death of packages. For this, we use [15] Kesten, H., Acta Math. 131, 207-248 (1973). Saichev et al. [23]’s approach, who showed that Gibrat’s [16] Sornette, D., Physica A 250, 295-314 (1998) . law of proportionate growth does not need to be strictly [17] Linux Kernel, www.kernel.org. [18] Torvalds, L., Communications of the ACM 42 (4), 38-39 satisfied in the presence of the birth and death of entities (1999). following the stochastic growth process (2): as long as the [19] Debian Linux, www.debian.org. volatility Σ(∆t) defined by (5) increases asymptotically [20] Myers, C.R., Phys. Rev. E 68, 046116 (2003). proportionally to C and that the instantaneous growth [21] Challet, D. and A. Lombardoni, Physical Review E 70, rate increases not faster than the volatility, the distribu- 046109 (2004). tion of sizes follows Zipf’s law. This suggests that the [22] Spaeth, S., M. Stuermer, S .Haefliger and G. von Krogh, occurrence of very large firms in the distribution of sizes Proceedings of the 40th Annual Hawaii International Conference on System Science - 2007 (HICSS’O7) described by Zipf’s law is more a consequence of random [23] Y. Malevergne, A. Saichev and D. Sornette, Zipf’s growth than systematic returns. Likewise, in particu- Law for Firms: Relevance of Birth and Death lar for packages with large connectivities, volatility can Processes, Journal of Economic Theory (2008), dominate over the instantaneous growth rate. (http://ssrn.com/abstract=1083962 Figure 4 verifies that the distribution of the numbers C