Software Process Recovery: Recovered Unified Process Views
Total Page:16
File Type:pdf, Size:1020Kb
Evidence-based Software Process Recovery by Abram Hindle A thesis presented to the University of Waterloo in fulfillment of the thesis requirement for the degree of Doctor of Philosophy in Computer Science Waterloo, Ontario, Canada, 2010 c Abram Hindle 2010 I hereby declare that I am the sole author of this thesis. This is a true copy of the thesis, including any required final revisions, as accepted by my examiners. I understand that my thesis may be made electronically available to the public. ii Abstract Developing a large software system involves many complicated, varied, and inter- dependent tasks, and these tasks are typically implemented using a combination of defined processes, semi-automated tools, and ad hoc practices. Stakeholders in the development process | including software developers, managers, and customers | often want to be able to track the actual practices being employed within a project. For example, a customer may wish to be sure that the process is ISO 9000 compliant, a manager may wish to track the amount of testing that has been done in the current iteration, and a developer may wish to determine who has recently been working on a subsystem that has had several major bugs appear in it. However, extracting the software development processes from an existing project is ex- pensive if one must rely upon manual inspection of artifacts and interviews of developers and their managers. Previously, researchers have suggested the live observation and in- strumentation of a project to allow for more measurement, but this is costly, invasive, and also requires a live running project. In this work, we propose an approach that we call software process recovery that is based on after-the-fact analysis of various kinds of software development artifacts. We use a variety of supervised and unsupervised techniques from machine learning, topic analysis, natural language processing, and statistics on software repositories such as version control systems, bug trackers, and mailing list archives. We show how we can combine all of these methods to recover process signals that we map back to software development processes such as the Unified Process. The Unified Process has been visualized using a time-line view that shows effort per parallel discipline occurring across time. This visualization is called the Unified Process diagram. We use this diagram as inspiration to produce Recovered Unified Process Views (RUPV) that are a concrete version of this theoretical Unified Process diagram. We then validate these methods using case studies of multiple open source software systems. iii Acknowledgements I would like to recognize my coauthors on much of this work: • Michael W. Godfrey (PhD Supervisor) • Richard C. Holt (PhD Supervisor) • Daniel German (Masters Supervisor) • Neil Ernst (Fellow PhD Student and collaborator) I would also like to acknowledge those who awarded me with scholarships: • NSERC, as I was funded by a NSERC PGS-D Scholarship • David Cheriton and the David Cheriton School of Computer Science for the David Cheriton scholarship I received. I would like to thank all the people who helped motivate this work: • Lixin Luo • Ron and Faye Hindle • Michael W. Godfrey and Richard C. Holt iv Dedication This is dedicated to my loving wife Lixin and to principles of Free Software as prescribed by the Free Software Foundation. v Contents List of Tables xii List of Figures xv 1 Introduction 1 1.1 Relationship to Mining Software Repositories . .2 1.2 Application of Software Process Recovery . .3 1.2.1 Stakeholders . .4 1.3 Conceptual View of Software Process Recovery . .6 1.4 Summary . .9 2 Related Research 11 2.1 Stochastic Processes, Business Processes and Software Development Processes 11 2.1.1 Stochastic Processes . 12 2.1.2 Business Processes . 12 2.1.3 Software Development Processes . 13 2.1.4 Process Summary . 15 2.2 Data Analysis . 15 2.2.1 Statistics . 15 2.2.2 Time-series analysis . 16 2.2.3 Machine Learning and Sequence Mining . 18 2.2.4 Natural Language Processing . 20 vi 2.2.5 Social Network Analysis . 23 2.2.6 Data Analysis Summary . 23 2.3 Mining Software Repositories . 23 2.3.1 Fact extraction . 24 2.3.2 Prediction . 24 2.3.3 Metrics . 25 2.3.4 Querying Repositories . 27 2.3.5 Statistics and Time-series analysis . 28 2.3.6 Visualization . 28 2.3.7 Social Aspects . 31 2.3.8 Concept Location and Topic Analysis . 31 2.3.9 MSR Summary . 32 2.4 Software Process Recovery . 32 2.4.1 Process Mining: Business Processes . 32 2.4.2 Process Discovery . 33 2.4.3 Process Recovery . 33 2.4.4 Software Process Recovery Summary . 34 2.5 Summary . 34 3 Software Process Recovery: A Roadmap 36 3.1 Software Artifact Perspective . 40 3.1.1 Source Control Systems . 40 3.1.2 Mailing list and Bug Trackers . 41 3.2 Process Perspective . 41 3.2.1 Evidence of Software Development Processes . 42 3.2.2 Concurrent Effort . 43 3.2.3 Process and Behaviour . 43 3.3 Software Development Perspective . 43 3.3.1 Requirements and Design . 44 3.3.2 Implementation, Testing, and Maintenance . 44 3.3.3 Deployment, Project Management, and Quality Assurance . 45 3.4 Summary . 46 vii 4 Evidence of Process 47 4.1 Release Patterns . 48 4.1.1 Background . 49 4.1.2 Terminology . 50 4.2 Methodology . 51 4.2.1 Extraction . 52 4.2.2 Partitioning . 53 4.2.3 Aggregates . 53 4.2.4 Analysis . 54 4.2.5 STBD Notation . 55 4.3 Case Study . 55 4.3.1 Questions and Predictions . 56 4.3.2 Tools and Data-sets . 56 4.4 Results . 58 4.4.1 Indicators of Process . 61 4.4.2 Linear Regression Perspective . 61 4.4.3 Release Perspective . 62 4.4.4 Interval Length Perspective . 62 4.4.5 Project Perspective . 63 4.4.6 Revision Class Perspective . 64 4.4.7 Zipf Alpha Measure . 65 4.4.8 Answers to Our Questions . 65 4.4.9 Validity Threats . 67 4.5 Possible Extension . 68 4.6 Conclusions . 69 5 Learning the Reasons behind Changes 70 5.1 What About Large Commits? . 71 5.1.1 Previous Work . 71 viii 5.2 Methodology . 72 5.3 Results . 77 5.3.1 Themes of the Large Commits . 77 5.3.2 Quantitative Analysis . 79 5.4 Analysis and Discussion . 88 5.4.1 Threats to Validity . 90 5.5 Conclusions . 90 6 Classifying Changes by Rationale 92 6.1 On the Classification of Large Commits . 93 6.2 Previous Work . 94 6.3 Methodology . 95 6.3.1 Projects . 95 6.3.2 Creating the training set . 95 6.3.3 Features used for classification . 95 6.3.4 Machine Learning Algorithms . 96 6.4 Results . 97 6.4.1 Learning from Decision Trees and Rules . 98 6.4.2 Authors . 101 6.4.3 Accuracy . 101 6.4.4 Discussion . 105 6.5 Validity Threats . 106 6.6 Possible Extension . 107 6.7 Conclusions . 107 7 Recovering Developer Topics 108 7.1 Latent Dirichlet Allocation and Developer Topics . 109 7.1.1 Topics of Development . 109 7.1.2 Background . 111 7.1.3 Preliminary Case Study . ..