Downloads Distribution
Total Page:16
File Type:pdf, Size:1020Kb
FLOSSSim: Understanding the Free/Libre Open Source Software (FLOSS) Development Process through Agent-Based Modeling by Nicholas Patrick Radtke A Dissertation Presented in Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy Approved October 2011 by the Graduate Supervisory Committee: James S. Collofello, Co-Chair Marco A. Janssen, Co-Chair Hessam S. Sarjoughian Hari Sundaram ARIZONA STATE UNIVERSITY December 2011 ABSTRACT Free/Libre Open Source Software (FLOSS) is the product of volunteers collaborating to build software in an open, public manner. The large number of FLOSS projects, combined with the data that is inherently archived with this online process, make studying this phe- nomenon attractive. Some FLOSS projects are very functional, well-known, and successful, such as Linux, the Apache Web Server, and Firefox. However, for every successful FLOSS project there are 100’s of projects that are unsuccessful. These projects fail to attract suf- ficient interest from developers and users and become inactive or abandoned before useful functionality is achieved. The goal of this research is to better understand the open source development process and gain insight into why some FLOSS projects succeed while others fail. This dissertation presents an agent-based model of the FLOSS development pro- cess. The model is built around the concept that projects must manage to attract contri- butions from a limited pool of participants in order to progress. In the model developer and user agents select from a landscape of competing FLOSS projects based on perceived utility. Via the selections that are made and subsequent contributions, some projects are propelled to success while others remain stagnant and inactive. Findings from a diverse set of empirical studies of FLOSS projects are used to for- mulate the model, which is then calibrated on empirical data from multiple sources of pub- lic FLOSS data. The model is able to reproduce key characteristics observed in the FLOSS domain and is capable of making accurate predictions. The model is used to gain a bet- ter understanding of the FLOSS development process, including what it means for FLOSS projects to be successful and what conditions increase the probability of project success. It is shown that FLOSS is a producer-driven process, and project factors that are important i for developers selecting projects are identified. In addition, it is shown that projects are sensitive to when core developers make contributions, and the exhibited bandwagon effects mean that some projects will be successful regardless of competing projects. Recommen- dations for improving software engineering in general based on the positive characteristics of FLOSS are also presented. ii To my family who have supported me on my Ph.D. journey, that I might one day be able to return the favor, and to Larry Caves who has provided a welcome and much-needed distraction when graduate work was overwhelming. iii ACKNOWLEDGMENTS I would like to thank Dr. James Collofello, who saw potential in me as an undergraduate student completing an honors thesis. I would not be where I am today save for his encour- agement to seek a higher degree and guidance thereafter. In addition to providing advice when I was unsure of the direction of my research, he has done his best to make sure that I have been financially supported throughout my graduate journey. I am also grateful to my co-chair Dr. Marco Janssen, who was willing to take on a student outside of his field. He has introduced me to modeling and helped me navigate the social science theory and other concepts necessary for creating a high quality model but unfamiliar to a computer scientist. I appreciate the many brainstorming sessions with him that helped push me off dead center when ideas were running scarce and his general availability between regularly scheduled meetings, should the need for consultation arise. iv TABLE OF CONTENTS Page LIST OF TABLES .................................. xii LIST OF FIGURES ................................. xiv CHAPTER 1 INTRODUCTION ............................... 1 1.1 Research Motivation . 4 1.1.1 Positive Characteristics of FLOSS . 4 1.1.2 Benefits of Predicting FLOSS . 8 1.2 Understanding FLOSS . 12 2 BACKGROUND ON EXISTING FLOSS MODELS ............. 16 2.1 Existing FLOSS Models . 17 2.1.1 Statistical Models . 18 2.1.2 Machine Learning . 19 2.1.3 System Dynamics Models . 22 2.1.4 Dynamic System Models . 22 2.1.5 Agent-Based Models . 25 2.2 Comparison of FLOSSSim to Existing Models . 29 2.3 Recommendations for Modeling FLOSS . 32 2.4 Choosing a Modeling Technique . 37 2.5 Conclusion . 38 3 QUANTIFYING FLOSS ........................... 40 3.1 Measuring Success . 40 v CHAPTER Page 3.1.1 Traditional Software Engineering Success Metrics . 41 3.1.2 Proposed FLOSS Success Metrics . 44 3.2 Factors Influencing FLOSS Project Success . 56 3.2.1 Types of Factors . 56 3.2.1.1 Technical Factors . 57 3.2.1.2 Social Factors . 57 3.2.1.2.1 FLOSS as a Public Good . 59 3.2.1.2.2 The Tragedy of the Commons . 61 3.2.2 Existing Research on Factors . 63 3.2.2.1 Licensing . 63 3.2.2.2 Organization Sponsorship . 69 3.2.2.3 Target Audience . 74 3.2.2.4 Governance and Coordination . 75 3.2.2.5 Documentation . 78 3.2.2.6 Systematic Testing . 79 3.2.2.7 Quality . 81 3.2.2.8 Programming Language . 82 3.2.2.9 Target Operating System . 83 3.2.2.10 Portability . 84 3.2.2.11 Version Control and Software Configuration Management . 85 3.2.2.12 Mailing Lists and Forums . 85 vi CHAPTER Page 3.2.2.13 Development Stage . 87 3.2.2.14 Activity Level . 88 3.2.2.15 Number of Developers . 89 3.3 Developer Motivation . 90 3.3.1 Similarity . 92 3.3.2 Current Resources . 94 3.3.3 Cumulative Resources . 96 3.3.4 Download Count . 97 3.3.5 Maturity . 99 3.4 Conclusion . 103 4 DATA ..................................... 105 4.1 Data Sources . 105 4.1.1 Surveys and Literature . 106 4.1.2 FLOSS Hosting Sites . 107 4.1.3 Databases . 112 4.1.3.1 SourceForge Research Data Archive . 113 4.1.3.2 FLOSSmole . 114 4.1.3.3 FLOSSMetrics . 116 4.1.4 Extraction Tools . 117 4.1.5 Data Sources Used . 119 4.2 Data Caveats . 119 4.2.1 Problems with Online Data . 119 vii CHAPTER Page 4.2.2 Problems with FLOSS Data . 121 4.2.2.1 Historical Data . 121 4.2.2.2 Cleansing Data . 125 4.2.2.3 Missing and Misleading Data . 126 4.2.2.4 Integrating Data . 132 4.3 Conclusion . 133 5 FLOSSSIMPLE ................................ 134 5.1 Characteristics of FLOSS Contributions . 134 5.2 Model Description . 135 5.3 Model Analysis . 140 5.3.1 Cumulative Resources and Consumer and Producer Distributions 140 5.3.2 Projects’ Needs Vector Distributions . 146 5.4 Conclusion . 152 6 FLOSSSIM DESCRIPTION ......................... 153 6.1 Model Description . 153 6.2 Model Evaluation . 164 6.2.1 Validation Method . 166 6.2.2 Calibration Method . 169 6.2.2.1 Mined Values . 169 6.2.2.1.1 Maturity Stage Importance . 169 6.2.2.1.2 Maturity Stage Thresholds . 173 6.2.2.1.3 New Project Creation Rate . 175 viii CHAPTER Page 6.2.2.1.4 New Project Cumulative Re- sources and Maturity . 176 6.2.2.2 Defined Parameter Values . 177 6.2.2.2.1 m . 177 6.2.2.2.2 Needs Vector Dimension . 178 6.2.2.2.3 Starting Memory Size . 179 6.2.2.2.4 Memory Change Probability . 179 6.2.2.2.5 e . 179 6.2.2.2.6 Number of Agents and Projects . 180 6.2.2.2.7 Maximum Resources . 181 6.2.2.3 Genetically Evolved Values . 181 6.2.3 Setup for Testing . 183 6.3 Modeling Environment . 185 6.3.1 Modeling Platform . 185 6.3.2 Execution Environment . 188 6.3.3 Verification . 189 7 FLOSSSIM ANALYSIS ............................ 190 7.1 Results . 190 7.1.1 Matching Distributions . 190 7.1.2 Predictive Validity . 195 7.1.2.1 Project Development Stage and Developers per Project Distributions . 195 ix CHAPTER Page 7.1.2.2 Downloads Distribution . 197 7.1.3 Evolved Parameters . 200 7.1.3.1 Utility Weights . 201 7.1.3.1.1 w1 Similarity . 204 7.1.3.1.2 w2 Current Resources . 205 7.1.3.1.3 w3 Cumulative Resources . 205 7.1.3.1.4 w4 Downloads . 206 7.1.3.1.5 w5 Maturity . 206 7.1.3.2 Producer and Consumer Numbers . 207 7.1.3.3 Maximum Number of Projects Producing and Consuming . 209 7.1.4 Success Metrics . 210 7.1.4.1 Comparing Success Metrics . 210 7.1.4.2 Target Audience Size versus Success . 213 7.2 Sensitivity Analysis . 220 7.2.1 m ..................................222 7.2.2 Needs Vector Dimension . 225 7.2.3 Starting Memory Size . 225 7.2.4 Memory Change Probability . 227 7.2.5 Number of Agents and Projects . 227 7.3 Scenario Analysis . 229 7.3.1 Effects of Consumers ..