Corpus-Based Empirical Research in Software Engineering
Total Page:16
File Type:pdf, Size:1020Kb
Ekaterina Pek Corpus-based Empirical Research in Software Engineering Dissertation Koblenz, October 2013 Department of Computer Science Universität Koblenz-Landau Ekaterina Pek Corpus-based Empirical Research in Software Engineering When someone is seeking, it happens quite easily that he only sees the thing that he is seeking; that he is unable to find anything, unable to absorb anything, because he is only thinking of the thing he is seeking, because he has a goal, because he is obsessed with his goal. Seeking means: to have a goal; but finding means: to be free, to be receptive, to have no goal. Herman Hesse, Siddhartha1 1 Translation by Hilda Rosner Acknowledgments First of all, I am very grateful to my supervisor, Prof. Dr. Ralf Lämmel, for his guid- ance, support, and patience during these years. I am very glad that I had a chance to work with Ralf, learn from his experience and wisdom, and grow professionally under his supervision, as well as appreciate his friendly personality. I am grateful to my co-authors: Jean-Marie Favre, Dragan Gaševic,´ Rufus Linke, Jürgen Starek, and Andrei Varanovich, working with whom was a great intellectual pleasure. No man is an island: I would like to thank my ex-boyfriend, Vladimir, for his great help and support during my PhD; my then-colleagues at the University of Koblenz for their company, thoughts, and time: my gratitude goes especially to An- drei Varanovich, Claudia Schon, and Sabine Hülstrunk. I gratefully acknowledge the support of my work by the Rhineland Palatinate’s Research Initiative 2008-2011, the ADAPT research focus, and the University of Koblenz-Landau. August 2013 Ekaterina Pek Abstract In the recent years, Software Engineering research has shown the rise of interest in the empirical studies. Such studies are often based on empirical evidence derived from corpora—collections of software artifacts. While there are established forms of carrying out empirical research (experiments, case studies, surveys, etc.), the com- mon task of preparing the underlying collection of software artifacts is typically ad- dressed in ad hoc manner. In this thesis, by means of a literature survey we show how frequently Software Engineering research employs software corpora and using a developed classification scheme we discuss their characteristics. Addressing the lack of methodology, we suggest a method of corpus (re-)engineering and apply it to an existing collection of Java projects. We report two extensive empirical studies, where we perform a broad and di- verse range of analyses on the language for privacy preferences (P3P) and on object- oriented application programming interfaces (APIs). In both cases, we are driven by the data at hand—by the corpus itself—discovering the actual usage of the languages. Zusammenfassung In den letzten Jahren gibt es im Bereich Software Engineering ein steigendes Interesse an empirischen Studien. Solche Studien stützen sich häufig auf empirische Daten aus Corpora—Sammlungen von Software-Artefakten. Während es etablierte Formen der Durchführung solcher Studien gibt, wie z.B. Experimente, Fallstudien und Umfragen, geschieht die Vorbereitung der zugrunde liegenden Sammlung von Software-Artefakten in der Regel ad hoc. In der vorliegenden Arbeit wird mittels einer Literaturrecherche gezeigt, wie häufig die Forschung im Bereich Software Engineering Software Corpora benutzt. Es wird ein Klassifikationsschema entwickelt, um Eigenschaften von Corpora zu beschreiben und zu diskutieren. Es wird auch erstmals eine Methode des Corpus (Re-)Engineering entwickelt und auf eine bestehende Sammlung von Java-Projekten angewendet. Die Arbeit legt zwei umfassende empirische Studien vor, in denen eine umfang- reiche und breit angelegte Analysenreihe zu den Sprachen Privacy Preferences (P3P) und objektorientierte Programmierschnittstellen (APIs) durchgeführt wird. Beide Studien stützen sich allein auf die vorliegenden Daten der Corpora und decken da- durch die tatsächliche Nutzung der Sprachen auf. Contents 1 Introduction ................................................... 1 1.1 Research Context. .1 1.2 Problem Statement . .3 1.3 Outline of the Thesis . .5 1.4 Contributions of the Thesis . .8 1.5 Supporting Publications . .9 Part I Prerequisites 2 Essential Background .......................................... 13 2.1 Software Linguistics . 14 2.1.1 Stances in Software Language Engineering . 14 2.1.2 Examples of Linguistic Sub-disciplines . 15 2.2 Actual Usage of Language . 16 2.2.1 Prescriptive and Descriptive Linguistics . 16 2.2.2 Measuring Actual Usage . 17 2.3 Languages Under Study . 21 2.3.1 Platform for Privacy Preferences . 21 2.3.2 Application Programming Interfaces . 22 2.4 Corpus Engineering. 24 2.4.1 Corpora in Natural Linguistics . 24 2.4.2 Corpora in Software Linguistics . 25 2.4.3 Corpora versus Software Repositories . 28 2.5 Literature Surveys . 29 2.5.1 Systematic Literature Reviews . 29 2.5.2 Content and Meta-analysis . 30 2.5.3 Grounded Theory . 30 2.5.4 Survey Process . 31 XIV Contents Part II Language Usage 3 A Study of P3P Language ....................................... 35 3.1 Introduction . 36 3.2 Methodology of the Study . 36 3.2.1 Research Questions . 36 3.2.2 Corpus under Study . 38 3.2.3 Leveraged Analyses . 40 3.3 The Essence of P3P . 41 3.3.1 Language versus Platform . 42 3.3.2 Syntax of P3P . 42 3.3.3 A Normal Form . 46 3.3.4 Degree of Exposure . 50 3.4 Analyses of the Study . 52 3.4.1 Analysis of Vocabulary . 52 3.4.2 Analysis of Constraints . 57 3.4.3 Analysis of Metrics . 63 3.4.4 Analysis of Cloning . 73 3.4.5 Analysis of Extensions. 81 3.5 Threats to Validity . 85 3.6 Related Work . 86 3.7 Conclusion . 89 4 A Study of APIs ................................................ 91 4.1 Introduction . 92 4.2 Java APIs . 92 4.2.1 Overview of the Approach . 93 4.2.2 A Study of SourceForge . 94 4.2.3 Examples of API-usage Analysis . 97 4.3 .NET framework . 105 4.3.1 Methodology . 106 4.3.2 Reuse-related Metrics for Frameworks . 108 4.3.3 Classification of Frameworks . 112 4.3.4 Comparison of Potential and Actual Reuse . 116 4.4 Multi-dimensional Exploration . 119 4.4.1 An Exploration Story . 120 4.4.2 Basic Concepts . 121 4.4.3 Exploration Insights . 124 4.4.4 Exploration Views . 130 4.4.5 The EXAPUS Exploration Platform . 134 4.5 Threats to Validity . 135 4.6 Related Work . 136 4.7 Conclusion . 139 Contents XV Part III Corpus Engineering 5 Literature Survey of Empirical Software Engineering . 143 5.1 Introduction . 144 5.2 Pilot Studies . 144 5.2.1 Survey on Empirical Language Analysis . 145 5.2.2 Survey on Corpora Usage . 149 5.3 A Survey on Empirical Software Engineering . 153 5.3.1 Methodology . 153 5.3.2 Results . 154 5.4 Threats to Validity . 168 5.5 Related Work . 169 5.6 Conclusion . 172 6 Corpus (Re-)Engineering ....................................... 175 6.1 Introduction . 176 6.1.1 Benefits of Using an Established Corpus . 176 6.1.2 Obstacles to Corpus Adoption . 177 6.2 A Method for Corpus (Re-)Engineering . 177 6.2.1 Underlying Concepts . ..