<<

Investigating topic modeling techniques for historical feature location.

Lukas Schulte

Faculty of Health, Science and Technology Master thesis in Second Cycle, 30 hp (ECTS) Dr. Sebastian Herold, University of Karlstad Dr. Muhammad Ovais Ahmad Karlstad, June 28th, 2021

I Abstract

I Abstract

Software maintenance and the understanding of where in the source code features are imple- mented are two strongly coupled tasks that make up a large portion of the effort spent on de- veloping applications. The concept of feature location investigated in this thesis can serve as a supporting factor in those tasks as it facilitates the automation of otherwise manual searches for source code artifacts. Challenges in this subject area include the aggregation and composition of a training corpus from historical codebase data for models as well as the integration and optimization of qualified topic modeling techniques. Building up on previous research, this thesis provides a comparison of two different techniques and introduces a toolkit that can be used to reproduce and extend on the results discussed. Specifically, in this thesis a changeset- based approach to feature location is pursued and applied to a large open-source Java project. The project is used to optimize and evaluate the performance of Latent Dirichlet Allocation models and Pachinko Allocation models, as well as to compare the accuracy of the two models with each other. As discussed at the end of the thesis, the results do not indicate a clear favorite between the models. Instead, the outcome of the comparison depends on the metric and view- point from which it is assessed.

Keywords feature location, topic modeling, changesets, latent dirichlet distribution, pachinko allocation, mining software repositories, source code comprehension

Investigating topic modeling techniques for historical feature location. Page I II Acknowledgments

II Acknowledgments

First, I would like to acknowledge the work that the responsible instances at Karlstad University and the University of Applied Sciences Osnabrück have put into their partnership within the ERASMUS program, which made my thesis possible. Further, I would like to give special thanks to the supervisor of my thesis, Dr. Sebastian Herold, for his support and help. In the same way, I would like to thank Dr. Muhammad Ovais Ahmad for taking the role of examiner for my work. Finally, I would like to express my gratitude to my family and friends who provided support and distraction during the time I worked on this thesis.

Investigating topic modeling techniques for historical feature location. Page I III Table of Contents

III Table of Contents

1 INTRODUCTION...... 1 1.1 BACKGROUND...... 1 1.2 PROBLEM DESCRIPTION ...... 1 1.3 THESIS GOAL ...... 2 1.4 THESIS OBJECTIVE ...... 2 1.5 ETHICS AND SUSTAINABILITY ...... 2 1.6 METHODOLOGY ...... 3 1.7 STAKEHOLDERS ...... 4 1.8 DELIMITATIONS ...... 4 1.9 OUTLINE ...... 4 2 BACKGROUND AND RELATED WORK ...... 6 2.1 FEATURE LOCATION ...... 6 2.1.1 Definition and Taxonomy ...... 6 2.1.2 Tools for Feature Location ...... 7 2.1.3 Datasets for Benchmarking ...... 8 2.2 ...... 8 2.3 TOPIC MODELING ...... 9 2.4 DIRICHLET DISTRIBUTIONS ...... 11 2.5 LATENT DIRICHLET ALLOCATION AND PACHINKO ALLOCATION ...... 13 2.6 RELATED WORK ...... 15 3 METHODS ...... 18 3.1 OVERVIEW ...... 18 3.2 DATA PREPARATION ...... 19 3.2.1 Data Mining ...... 19 3.2.2 Text Cleaning ...... 19 3.3 TOPIC MODELING ...... 20 3.3.1 Parameters ...... 20 3.3.2 Hyperparameter Tuning ...... 21 3.4 FEATURE LOCATION ...... 21 3.4.1 Search Query ...... 22 3.4.2 Performance Metrics ...... 22 3.4.3 Goldset-based Validation ...... 23 4 IMPLEMENTATION ...... 25 4.1 GOALS AND CONSTRAINTS ...... 25 4.2 GENERAL SOLUTION STRATEGY ...... 26 4.3 IMPORTER APPLICATION ...... 29 4.3.1 Context, Scope, and Solution Strategy ...... 29 4.3.2 Building Block View ...... 32 4.4 FEATURE LOCATION ...... 39 4.4.1 Context, Scope, and Solution Strategy ...... 39 4.4.2 Building Block View ...... 42 5 EVALUATION ...... 47 5.1 SETUP ...... 47

Investigating topic modeling techniques for historical feature location. Page II III Table of Contents

5.1.1 General Data Structures ...... 47 5.1.2 Target System ...... 50 5.2 FEATURE LOCATION TECHNIQUE ANALYSIS ...... 50 5.2.1 Execution time ...... 51 5.2.2 Improvements through Hyperparameter Tuning ...... 53 5.2.3 Obstacles in the Feature Location Process ...... 57 5.3 SEARCH RESULT PERFORMANCE...... 58 5.3.1 Objects of Evaluation ...... 58 5.3.2 Performance ...... 59 6 DISCUSSION ...... 61 6.1 DISCUSSION OF THE FEATURE LOCATION TECHNIQUE (RQ1) ...... 61 6.2 DISCUSSION OF THE SEARCH RESULT PERFORMANCE (RQ2) ...... 62 6.3 THREATS TO VALIDITY ...... 64 6.3.1 External Validity ...... 64 6.3.2 Internal Validity ...... 64 7 CONCLUSIONS AND FUTURE WORK ...... 66 7.1 FUTURE WORK ...... 67 8 BIBLIOGRAPHY ...... 68 9 APPENDIX ...... 72 A.1 GOLDSET COMPARISON...... 72 A.2 LOOKUP TABLE ...... 73 A.3 RANKS AND RECIPROCAL RANKS CALCULATED FROM THE DISCUSSED MODELS ...... 74 A.4 INTERMEDIATE VALUES OF WILCOXON’S SINGED-RANK TEST ...... 76

Investigating topic modeling techniques for historical feature location. Page III IV List of Figures

IV List of Figures

FIGURE 1: ILLUSTRATION OF TOPIC MODELING APPLIED TO AN ARTICLE USING LDA [BL12]...... 10 FIGURE 2: DENSITY PLOTS [BLUE = LOW, RED = HIGH] [FKG10]...... 12 FIGURE 3: VISUAL ABSTRACTION OF THE RELATIONSHIP BETWEEN DOCUMENTS AND WORDS IN LDA...... 13 FIGURE 4: MODEL STRUCTURES OF LDA AND A FOUR-LEVEL PACHINKO ALLOCATION [LM06]...... 14 FIGURE 5: COMPARISON BETWEEN LDA, PACHINKO ALLOCATION (PAM), AND TWO OTHER TECHNIQUES: LEFT: PACHINKO PERFORMS BETTER WITH A HIGHER NUMBER OF TOPICS; RIGHT: PACHINKO PERFORMS BETTER WITH MORE TRAINING DATA [LM06]...... 15 FIGURE 6: CORPUS GENERATION FROM ISSUE DESCRIPTIONS AND CHANGESETS ON CLASS LEVEL...... 18 FIGURE 7: LDA REPRESENTED AS A GRAPHICAL MODEL WITH HIDDEN PARAMETERS AND THE OBSERVED PARAMETERS AND [SS18]...... 21 FIGURE 8: IMPORTER APPLICATION CONTEXT...... 31 휶 휼 FIGURE 9: UML DIAGRAM OF THE BUILDING BLOCKS IN THE IMPORTER APPLICATION...... 32 FIGURE 10: IMPORT PROCESS...... 32 FIGURE 11: DATABASE DIAGRAM...... 33 FIGURE 12: TASKS OF THE FEATURE LOCATION PROCESS...... 39 FIGURE 13: FEATURE-LOCATION APPLICATION CONTEXT...... 40 FIGURE 14: UML DIAGRAM OF THE BUILDING BLOCKS IN THE FEATURE-LOCATION APPLICATION...... 42 FIGURE 15: FLOW CHART OF THE TRAIN TASK...... 44 FIGURE 16: FLOW CHART OF THE EVALUATION TASK...... 45 FIGURE 17: FLOW CHART OF THE VALIDATE TASK...... 46 FIGURE 18: PROGRAMMING LANGUAGE COMPOSITION OF THE ZOOKEEPER REPOSITORY BY LINES OF CODE...... 50 FIGURE 19: TRAINING TIME OF A LDA AND A PACHINKO ALLOCATION MODEL; LDA 350 TOPICS; PA 250 + 50 TOPICS; 100 ITERATIONS; AMD RYZEN 2700X (8-CORE) @ 3.70GHZ – WIN 10...... 51 FIGURE 20: PERFORMANCE OF LDA AND PA TRAINING ON DIFFERENT COMPUTER SYSTEMS. LDA (LEFT): 500 TOPICS, 100 ITERATIONS, AND A BURN-IN OF 10; PA (RIGHT): 100 TOPICS AND 100 SUB-TOPICS, 100 ITERATIONS, AND A BURN-IN OF 10. EACH CPU SUPPORTS HYPERTHREADING FOR 2 LOGICAL CORES...... 52 FIGURE 21: EVALUATION TIME OF A LDA AND A PACHINKO ALLOCATION MODEL; LDA 350 TOPICS; PA 250 + 50 TOPICS; 100 ITERATIONS; AMD RYZEN 2700X (8-CORE) @ 3.70GHZ – WIN 10...... 52 FIGURE 22: MRR AND LOG-LIKELIHOOD OBSERVED DURING HYPERPARAMETER TUNING OF AN LDA MODEL WITH 100 ITERATIONS AND A BURN-IN OF 10...... 53 FIGURE 23: MRR AND LOG-LIKELIHOOD OBSERVED DURING HYPERPARAMETER TUNING OF A PA MODEL WITH 100 ITERATIONS AND A BURN-IN OF 10...... 54 FIGURE 24: LDA AND PA AVERAGE LOGARITHMIC LIKELIHOOD WHILE TUNING THE NUMBER OF ITERATIONS. BURN-IN FACTOR 10; LDA 350 TOPICS; PA 100 + 100 TOPICS...... 55 FIGURE 25: LDA AND PA AVERAGE LOGARITHMIC LIKELIHOOD WHILE TUNING THE BURN-IN FACTOR. 100 ITERATIONS; LDA 350 TOPICS; PA 100 + 100 TOPICS...... 55 FIGURE 26: TUNING OF ALPHA ( ) AND ETA ( ) PARAMETERS OF BOTH LDA AND PA: LDA 350 TOPICS; PA 100 + 100 TOPICS; OTHER PARAMETERS ACCORDING TO PREVIOUS CHAPTERS...... 56 휶 휼 FIGURE 27: HISTOGRAM OF THE DISTRIBUTION OF SEARCH RESULT RANKS EXTRACTED FROM THE VALIDATION PROCESS OF FEATURE LOCATION RESULTS BASED ON THE LDA AND PACHINKO ALLOCATION MODELS...... 60 FIGURE 28: FREQUENCIES AND PERCENTAGES OF CURSOR HOVERS AND CLICKS OCCURRING ON THE SEARCH RESULTS. PERCENTAGES REFLECT THE PROPORTION OF HOVER OR CLICK EVENTS OVER ALL 10 RESULTS [HWD11]...... 63

Investigating topic modeling techniques for historical feature location. Page IV IV List of Figures

FIGURE 29: MEAN NUMBER OF SEARCH RESULTS HOVERED OVER BEFORE USERS CLICKED ON A RESULT (ABOVE AND BELOW THAT RESULT). RESULT CLICKS ARE RED CIRCLES, RESULT HOVERS ARE BLUE LINES [HWD11]...... 63

Investigating topic modeling techniques for historical feature location. Page V V List of Tables

V List of Tables

TABLE 1: ACCURACIES [LM06]...... 15 TABLE 2: OVERVIEW OF TOPIC MODELING LIBRARIES...... 27 TABLE 3: COMPOSITION OF PROJECTS IN THE GOLDSETS...... 30 TABLE 4: DESCRIPTION OF COMPONENTS IN THE CONTEXT OF THE IMPORTER APPLICATION...... 31 TABLE 5: OVERVIEW OF DATA PRODUCED BY IMPORTER STEPS...... 38 TABLE 6: DESCRIPTION OF COMPONENTS IN THE CONTEXT OF THE FEATURE-LOCATION APPLICATION...... 41 TABLE 7: COMMAND LINE OPTIONS OF THE APP.PY FILE...... 43 TABLE 8: STRUCTURE OF THE LERO GOLDSET...... 48 TABLE 9: STRUCTURE OF THE CORLEY GOLDSET...... 48 TABLE 10: STRUCTURE OF AND JIRA DATA...... 49 TABLE 11: CONFIGURATIONS OF LDA AND PA USED FOR EVALUATING THE SEARCH RESULT PERFORMANCE...... 58 TABLE 12: WORD DISTRIBUTION ASSOCIATED WITH ONE OF THE TOPICS/SUBTOPICS OF THE LDA AND PACHINKO ALLOCATION MODELS USED FOR THE PERFORMANCE EVALUATION...... 58 TABLE 13: METRICS CALCULATED DURING THE EVALUATION OF THE SEARCH RESULT PERFORMANCE...... 59

Investigating topic modeling techniques for historical feature location. Page VI VI Table of Listings

VI Table of Listings

LISTING 1: REGULAR EXPRESSION APPROXIMATING THE DETECTION OF JAVA METHOD NAMES...... 8 LISTING 2: EXAMPLE OF PATTERN MATCHING USING A REGULAR EXPRESSION TO DETECT A METHOD NAME...... 8 LISTING 3: GIT LOG COMMAND...... 34 LISTING 4: GIT LOG OUTPUT OF A TO THE JABREF REPOSITORY (SHORTENED)...... 34 LISTING 5: GIT COMMAND FOR A COMMIT TO THE JABREF REPOSITORY...... 35 LISTING 6: GIT DIFF OUTPUT OF A COMMIT TO THE JABREF REPOSITORY (SHORTENED)...... 35 LISTING 7: EXTRACT FROM THE REGULAR EXPRESSIONS USED FOR METHOD AND CLASS DETECTION BY THE INTERPRETER.PY...... 36 LISTING 8: SAMPLE URL FOR ACCESSING AN ISSUE FROM A JIRA INSTANCE...... 37 LISTING 9: BEGINNING OF A CLASS-BASED SEARCH RESULT FOR THE ZOOKEEPER REPOSITORY...... 45 LISTING 10: START COMMAND FOR TRAINING, EVALUATION, AND VALIDATION OF LDA AND PA MODELS...... 51 LISTING 11: TRAINING, EVALUATION, AND VALIDATION OF AN LDA MODEL THAT PERFORMS DIFFERENTLY EVEN THOUGH THE SAME PARAMETERS AND SEED VALUE HAVE BEEN USED...... 57

Investigating topic modeling techniques for historical feature location. Page VII VII List of Abbreviations

VII List of Abbreviations

FL Feature Location IDE Integrated Development Environment IR Information Retrieval ITS Issue Tracking System LDA Latent Dirichlet Allocation MRR Mean Reciprocal Rank NLP Natural Language Processing PA Pachinko Allocation RegEx Regular Expression VCS System

Investigating topic modeling techniques for historical feature location. Page VIII Chapter 1 Introduction

1 Introduction

1.1 Background

In software development, adding new features to a system is not the task of a software developer that takes up most of the time and budget. Rather 60% to 80% of the budget is spent on software maintenance, which is a broad activity including work done after a feature or a whole software system becomes operational [Ch01]. Since further research shows that of the time spent on maintenance about 50% are invested into understanding existing code, tools that support this process can have a significant effect on the productivity of developers [Bo14]. Specifically, research in the field has shown that as much as 88% of manual searches for a feature in an unfamiliar project return no relevant results [Ko06]. So far manual searches in the source code, supported by tools like grep and the text editor search, were used by developers in this task of understanding source code [Jo15]. In such tasks, and when required to find a solution to a maintenance problem, those tools already allow the developer to use wildcards and pipes to connect searches based on the code in the codebase [Jo15]. Beyond this, however, new techniques that integrate further existing information into the data pool are being explored in current research. One relevant field is feature location as a search engine for source code, as it can help developers to find relevant parts in the source code by simply entering a natural language search query into the engine. Techniques used under the hood to enable such functionality can include source code analysis, information retrieval, or .

1.2 Problem Description

For developers unfamiliar with a codebase the process of locating features based purely on textural searches in the source code has shown a low success rate [Ko06]. Therefore, advanced techniques for historical feature location have previously been explored as a possible solution. The concept was introduced and debated in previous research and relies on topic modeling which is fed by data from a codebase’s historic data and descriptions collected from issues tracking systems [CKK15], [CDK20], [CEB15], [CEB17]. Together, the data can form feature descriptions on which topic modeling can create clusters used to match search queries to search results. Yet, there are no ready-to-use tools that can do this reliably. Even those prototypes provided by researchers do not yet generate results that could rival modern search engines which are popular in non-coding specific contexts. Further, those prototypes so far struggle to provide reproducible results and benchmarks upon which investigations in the field can be ex- tended [Ma18].

Investigating topic modeling techniques for historical feature location. Page 1 Chapter 1 Introduction

For those reasons, the question of whether newer topic modeling techniques bring an advantage over the commonly used Latent Dirichlet Allocation (LDA) must be discussed and a solution for reproducible results and benchmarks needs to be provided.

1.3 Thesis Goal

The goal of this thesis and its research are based on data that describes a software systems evolution. Such data includes the source code history as tracked by version control systems like Git as well as issue descriptions from issue tracking systems like Jira. In the context of this thesis, this kind of data will be called codebase history data. With this it is the goal of this thesis to answer the following research questions: (RQ1) How can different feature location techniques be applied to codebase history data while optimizing results for the comparison of different approaches as well as for re- producibility? (RQ2) How does the Pachinko Allocation Model perform on the task of feature location com- pared to the more popular Latent Dirichlet Allocation Model?

1.4 Thesis Objective

The thesis is structured following two objectives that mirror the previously define research questions. While the first objective focuses on a solution that can provide a basis for comparison between topic modeling techniques in the context of feature location, the second one deals spe- cifically with the comparison of the two selected techniques, the Latent Dirichlet Allocation and the Pachinko Allocation. (O1) Design and implementation of a toolkit that supports the application of feature location to historical codebase data. It is part of this objective to achieve flexibility through modularization for the exchangeability of specific technologies as well as reproduci- bility of results through machine- and human-readable outputs. (O2) Comparison of the Pachinko Allocation with Latent Dirichlet Allocation in terms of performance and accuracy when applied in the context of feature location. This objec- tive contains the requirement to investigate the effect that variating configurations can have on the performance and accuracy of the models. Further, a direct comparison between two good models in terms of locating features in historic codebase data is part of this objective.

1.5 Ethics and Sustainability

Using technologies like machine learning, clustering, or topic modeling use generalization as a basis to provide their functionality. In contexts where persons or their work are involved, this

Investigating topic modeling techniques for historical feature location. Page 2 Chapter 1 Introduction generalization can raise ethical concerns. The data imported from the version control system and issue tracker includes such data, but those data points that could allow a link between au- thors and code are not processed or presented by the tools. Further, also manually conducted analyses of the data which are presented in this thesis do not contain information on such links. Therefore, the subject of this thesis is believed not to raise ethical concerns. Its sustainability though can be judged in an economic sense. As stated before, between 60% and 80% of costs incurred during software development are related to maintenance with about 50% of that being invested into understanding existing source code [Ch01], [Bo14]. A reduction of the time required to locate features can therefore justify the investment into the exploration of feature location techniques as it reduces the costs of maintenance. And it also allows the developer to spend more time on tasks like the implementation of new features or software quality control.

1.6 Methodology

The thesis is separated into two objectives that were presented previously in chapter 1.4. To fulfill those objectives and to be able to find answers for the research questions introduced in chapter 1.3, the work conducted can be split into three steps: research, implementation, and evaluation.

• Research o Collection and investigation of available resources and tools that can support the implementation and evaluation. • Implementation o Design and implementation of an importer tool for data from Git repositories and Jira issue tracking systems with a focus on expandability and reusability. o Exploration of multiple approaches to feature location, including different concepts for data preparation and result validation as well as different libraries for topic mod- eling. o Conduction of preliminary evaluations to determine the suitability of approaches for being included in the final tool. • Evaluation o Process of optimizing parameters of two topic models, one for LDA and one for the Pachinko Allocation. o Analysis of performance and topic detection accuracy based on goldset validation. The applied methods are discussed in detail in chapter 3.

Investigating topic modeling techniques for historical feature location. Page 3 Chapter 1 Introduction

1.7 Stakeholders

Since the application of feature location can provide benefits in many aspects of software maintenance, potential stakeholders can be found in most companies that work on the modern- ization and extension of software products. Examples for those companies include IT consul- tancies that work with previously unknown, external systems, as well as their clients who often deal with legacy applications. Further stakeholders are researchers in the fields of software ar- chitecture and quality management who may find this work and its results useful for their stud- ies.

1.8 Delimitations

This thesis provides an overview of topic modeling techniques for historical feature location while also going into detail on selected subjects. A noteworthy delimitation exists for the exe- cution of the evaluation and specifically for the process of hyperparameter tuning. Due to high execution times for Pachinko Allocation models, a full nested cross-validation process common in a machine learning context is not to be performed. Instead, individual tuning of selected hyperparameters must suffice for the scope of this work. It is further not required, that the toolkit which was created during the implementation phase of this thesis is a production-ready appli- cation. Instead, a prototypical implementation is desired.

1.9 Outline

This thesis is divided into seven chapters. After this introduction, the second chapter describes the background and work related to the topic of this thesis. This includes a breakdown of feature location into its underlying steps and parts as well as summaries of research papers that already applied topic modeling on historical codebase data to locate features. The third chapter focuses on the methods applied during both the implementation and evalua- tion phases. Those methods include approaches for data preparation, training, and optimization of topic models, as well as validation techniques for the result of feature location.

The implementation in chapter 4 describes the toolkit on which further analyses are based. It is split into two parts, one documenting the importer tool and the other describing the character- istics of the application responsible for executing feature location. Following a modular ap- proach, the toolkit supports two topic models: the Latent Dirichlet Allocation and the Pachinko Allocation. Chapter 5 focuses on evaluating the performance of the toolkit as well as the accuracy of feature location search results provided by the two supported topic models. Further, the object of eval- uation, an open-source Java application called ZooKeeper, is introduced and discussed here.

Investigating topic modeling techniques for historical feature location. Page 4 Chapter 1 Introduction

Finally in chapter 6 the results previously presented are discussed. Based on those discussions the research questions are answered and threats to the validity of the results are reviewed. The last chapter serves as a conclusion to the thesis and summarizes the results as well as the work that was done in order to fulfill the thesis objectives.

Investigating topic modeling techniques for historical feature location. Page 5 Chapter 2 Background and Related Work

2 Background and Related Work The scope of this thesis requires the application of topic modeling-based feature location tech- niques to documents that describe said features in a textual form. Those documents consist of natural language text and changes made to the source code. This chapter will break down the high-level idea of feature location into granular steps that describe the state of the art in this field of research. Further, it will explain the function of those steps as well as their purpose in the context of the thesis. Starting with an elaboration on what feature location is and how it can be used by developers and researchers to understand the structure of software projects, the principles that it is based on will be discussed. They include information retrieval, topic modeling with Dirichlet distri- butions as well as text mining and cleaning.

2.1 Feature Location

Feature location is the application of natural language processing with the specific goal of lo- cating artifacts in text documents that match a search query. In this section, feature location will be summarized under the context of locating features in source code while also touching avail- able tools that can perform this task. Furthermore, a listing of resources that can assist during the evaluation of feature location results is presented [FB19].

2.1.1 Definition and Taxonomy

The main purpose of refactoring is to remove technical debt from the codebase. It can be defined as a noun and as a verb respectively:

Refactoring (noun): a change made to the internal structure of software to make it easier to understand and cheaper to modify without changing its observable behaviour. Refactoring (verb): to restructure software by applying a series of refactorings without changing its observable behaviour.

In order to be able to execute refactorings and work with source code in general, developers need to gain a solid understanding of the codebase at hand. Feature location can be the first step of this process, as it is the act of identifying the source code entity or entities that implement a feature [CKK15]. It can be executed based on four types of analyses: dynamic, static, historical, and textual [CKK15], [Di13]. Loosely defined it can be described as a search engine for features where the results are methods, classes, or source files. According to a study on feature location in source code [Di13], dynamic analyses are conducted at a system's runtime, observing features during the execution of an application. The study finds

Investigating topic modeling techniques for historical feature location. Page 6 Chapter 2 Background and Related Work that they can be classified for gathering the execution trace of the application while the feature in question is invoked and then either compare them to traces where the feature was not invoked or perform frequency-based analyses. Since, given correct scenarios, features can be mapped by this, the study finds that they are a popular choice that only comes with the disadvantage of causing a significant overhead on the execution time. Static analyses are described by the same study as processes that examine structural information like dependencies in the code or the data flow in the application. While this tactic may be closely related to what a developer does when he tries to understand code manually, the study identified a high probability for false positives when this is automated. The textual approaches surveyed by the researchers of the study follow the idea of aggregating identifiers and comments which are connected to textual descriptions of domain knowledge. This text can then be used to search for a feature with a query in text form. To facilitate the implementation of such an analysis the study identified three techniques: pattern matching, in- formation retrieval, and natural language processing, which will be further discussed in chapter 2.2. Lastly, a feature location technique described as historical by the study relies on the mining of data from version control systems and thereby gathering relevant lines of code or artifacts re- lated to a feature. Recent approaches in research usually combine at least two of the techniques in an attempt to factor out the disadvantages of one with the advantages of another [Di13]. The combination this thesis will focus on is a historical analysis that also makes use of textual approaches to feature location through topic modeling which is, as discussed in chapter 2.3, an information retrieval technique.

2.1.2 Tools for Feature Location

Previous research has already produced some proofs-of-concept and production-ready tools to apply feature location to projects. This chapter is to provide a brief overview. For the two most relevant techniques, the historical and the textual analysis, the tools CVSSearch [CR01] and Hipikat [Cu05] exist enabling historical feature location [Di13]. Others for textual feature lo- cation are mostly integrations of text search functions into integrated development environ- ments like eclipse. Examples are Google Eclipse Search [Po06] and IRiSS [Po05] [Di13]. For the others, tools like TraceGraph [Lu94], STRADA [EBG07], and Featureous [OJ10] imple- ment forms of dynamic feature location, and Ripples [CR01] and Suade [WR07] implement static feature location techniques.

Investigating topic modeling techniques for historical feature location. Page 7 Chapter 2 Background and Related Work

2.1.3 Datasets for Benchmarking

Besides tools that enable feature location in a certain way, some researchers have also published datasets that are supposed to assist in further research by providing a goldset of query-result mappings which can be used to evaluate the results of new techniques. Both [CKK15], [CDK20]1, and [Ma18]2 respectively provide such a dataset. Due to their relation to this thesis, both of their work is revisited in chapter 2.6.

2.2 Text Mining

As stated in the previous chapter, three techniques facilitate textual analysis for feature location: pattern matching, information retrieval, and natural language processing [Di13]. Those will be introduced in this chapter. In the technology survey [Di13], the authors describe pattern matching as a mere textual search using utilities such as grep3. It is described as relatively robust but not very precise since the chances of query terms matching words found in source code are relatively low. Further tech- nologies commonly used include regular expressions which match patterns consisting of one or more character literals, operators, or constructs [Mi21]. Regular expressions will be used as a way of pre-filtering data during the mining of text in this thesis. One example can be seen in the combination of Listing 1 and Listing 2.

(?:public|protected|private|static|\s) +[\w\<\>\[\]]+\s+(\w+) *\([^\)]*\) *(?:\{?|[^;])

Listing 1: Regular expression approximating the detection of java method names.

/** * Add the specified scheme:auth information to this connection. */ public void addAuthInfo(String scheme, byte[] auth) { ↑ Method ↑ cnxn.addAuthInfo(scheme, auth); }

Listing 2: Example of pattern matching using a regular expression to detect a method name.

1 Available on GitHub: https://github.com/cscorley/changeset-feature-location/ 2 Available at: https://lero.ie/research/datasets/feature_location/comparison 3 http://www.gnu.org/software/grep/ (accessed and verified on 10/05/2021)

Investigating topic modeling techniques for historical feature location. Page 8 Chapter 2 Background and Related Work

The survey [Di13] further describes information retrieval techniques as statistical methods used to find features relevant code based on identifiers and comments similar to a query. Technol- ogy-wise the Latent Dirichlet Allocation (LDA) [BNJ03] is among the ones referenced by the survey. As a form of topic modeling that is based on Dirichlet distributions, LDA will be eval- uated next to the related Pachinko Allocation [LM06] in this thesis. For its relevance in deter- mining which parts of the source code match a query, Dirichlet distributions as well as LDA and the Pachinko Allocation are described further in chapters 2.4 and 2.5, after the analysis of a general survey for the state of the art in topic modeling in chapter 2.3. Natural language processing (NLP) is a broad approach that can be used to analyze the parts of speech used in source code and a search query to match them more precisely than possible through pattern matching, yet it is also more expensive [Di13]. Techniques associated with NLP like pre-filtering of search queries through the tokenization of sentences to words and the re- moval of stop words will be used in this thesis.

2.3 Topic Modeling

Topic modeling as an information retrieval technique will play a major role in the execution of this thesis project. Therefore, the state of the art in topic modeling will be summarized in this chapter. Being a technique for unsupervised learning with a specialization in detecting topics from text documents, topic modeling has great potential within today’s internet technologies [BB07]. It can for example be used to classify digitized documents in an archive which would otherwise require manual analysis to put them into groups for easy browsing. To implement this, multiple types of models exist. With the focus on LDA and the Pachinko Allocation, those will be ex- plained in subsequent chapters. Essentially, topic modeling can be described as probabilistic clustering for text documents. On a lower level, this means that the model can connect words with similar meanings into topic groups with the same context and distinguish between uses of words with multiple meanings [AA15]. A topic is therefore a distribution over a fixed vocabulary. In an example of clustering articles about different sciences, one cluster would contain words that are often used in genetics with a high probability while another one might contain words about evolutionary biology [Bl12]. Figure 1 shows this way of working with articles graphically, yet it is to empathize that the algorithms of LDA and the Pachinko Allocation have no information about the subjects of the topics they create, as there is no labeling involved [Bl12]. The resulting model can be used for example in a tool for automated grouping of documents or as a search engine for user que- ries. The latter will be the use case applied in this thesis.

Investigating topic modeling techniques for historical feature location. Page 9 Chapter 2 Background and Related Work

Figure 1: Illustration of topic modeling applied to an article using LDA [Bl12].

Besides the aforementioned Pachinko Allocation Model, research has produced many alterna- tives to the LDA Model which is the conceptional basis for a couple of them [LM06]. LDA is again based on the Dirichlet Distributions discussed in the following chapter 2.4 [BNJ03]. Training of topic models, in general, is applied in batches adding multiple sets of data to the model and then finalizing it, making it impossible to add more documents later on. As such this applies to LDA and the Pachinko Allocation Model as well, yet there is a move towards the application of online topic modeling. With those online variants based on the classic models, documents can be appended to the corpus whilst having a usable model at the same time [BB07]. While this would be a useful addition to the tool developed in this thesis, allowing it to provide an always up-to-date search engine for features, implementations of such topic mod- els in libraries or in general are sparse and those that are publicly available are limited to LDA. Theories for those online variants are discussed and compared in papers like [BB07]. In contrast to clustering, topic modeling techniques do not generate distinct results assigning documents to one cluster (or topic) but many. This is comparable to fuzzy clustering, a subcat- egory of normal clustering algorithms [Pr17]. This association with multiple clusters in Topic models like LDA and the Pachinko Allocation is presented in the form of Dirichlet Distribu- tions, which will be the topic of the next subchapter.

Investigating topic modeling techniques for historical feature location. Page 10 Chapter 2 Background and Related Work

2.4 Dirichlet Distributions

Acquiring an in-depth understanding of Dirichlet distributions requires an extensive amount of knowledge in statistics. In order to follow the argumentation and implementation presented in this thesis such knowledge is not required, yet a high-level understanding of what the Dirichlet distribution describes is useful and will be summarized in this chapter. To achieve this the ex- planations of [FKG10] will be mapped on the example of a three-sided die.

A six-sided die with each number “1”, “2”, and “3” appearing twice will be called a three-sided die and can be considered a probability mass function, just like an ordinary die would [FKG10]. Those dices, manufactured manually, will not be uniformly weighted and therefore have a ran- domness of outcomes that is not considered fair. It is likely that a loaded die is produced, re- sulting in an unfair distribution of probabilities. This probability mass function for three-sided dices can be written as a vector in as shown below, where symbolizes the impossible 3 fair die and the symbolizes an unfair die with which the value푓푎푖푟 “3” is the most likely and ℝ 휃 “1” the least likely. 휃푙표푎푑푒푑

1 1 1 휃푓푎푖푟 = ( , , ) 3 3 3 0.9 1 1.1 휃푙표푎푑푒푑 = ( , , ) Given a bag of 100 such handmade dices, the randomness3 3 3 of the probability mass function can be modeled with the Dirichlet distribution. The given prerequisites are that the outcomes of each side , in this case with , are greater than zero and in sum equal one. 휃 = (휃1, 휃2, … , 휃푘) 푘 = 3

푖 휃 ≥ 0 푓표푟 푖 = 1, 2, … , 푘 푘 푖 It is also required that a parameter ∑푖=1 휃 = 1 is defined with .

With this, each probability mass function훼 = (훼1 can, 훼2 ,be … ,visualized 훼푘) as a point훼 on푖 > a 0 triangle 푓표푟 푖 = in 1, a 2,three- … 푘 dimensional Euclidean space. Now it is the function of the Dirichlet distribution to provide a probability distribution over this space. For this, the function below is used:

푘 푘 Γ(∑푖=1 푎푖) 푎푖−1 푝(휃|훼) = 푘 ∏ 휃푖 Combining them then results in density plots∏푖=1 whichΓ(푎i) have푖=1 a different destiny center depending on the parameter that was used.

Investigating topic modeling techniques for historical feature location. Page 11 Chapter 2 Background and Related Work

훼 = (1, 1, 1) 훼 = (0.1, 0.1, 0.1)

훼 = (10Figure, 10, 2:10 Density) plots [blue = low, red = high] [FKG10].훼 = (2, 5, 15)

For a uniform with , the density plot is as shown in Figure 2 in the bottom- left corner: there is a density center in the middle of the triangle. Interpreted through the exam- 훼 = (푐, 푐, 푐) 푐 > 1 ple of the three-sided dices this means that the majority of outcomes is close to , suggesting an advanced and precise production technique. For a similarly uniform with the special 휃푓푎푖푟 case in the top-left corner of Figure 2 applies. Here all outcomes are equally likely, producing 훼 푐 = 1 all kinds of fair and unfair dices. An again uniform with produces dices that are less likely to be fair than unfair with the amount of fairness decreasing as approaches . Fi- 훼 0 < 푐 < 1 nally, for being a vector that is not uniform the center moves to either corner of the triangle 푐 0 which means that the die produced will be loaded in a certain direction. 훼 This Dirichlet distribution will play a major part in both the setup of the Latent Dirichlet Allo- cation and the Pachinko Allocation. Also, the prior will be relevant as one of the hyperparam- eters that can be used to influence the accuracy of the models. Yet in current research, this is an

Investigating topic modeling techniques for historical feature location. Page 12 Chapter 2 Background and Related Work uncommon practice with nearly all researchers choosing simple symmetrical Dirichlet priors [SS18].

2.5 Latent Dirichlet Allocation and Pachinko Allocation

Both the Latent Dirichlet Distribution and the Pachinko Allocation are composed of Dirichlet distributions. Since both will be actively used to create topic models in this thesis, current re- search into those technologies will be summarized in this chapter. The Latent Dirichlet Allocation was first introduced in 2003 and has since been used in many research papers that applied topic modeling in all kinds of scenarios [BNJ03]. Some of them will be referenced later in this thesis for their relevance for their relation to the topic. According to the initial paper, the list of predecessors to the LDA include the tf-idf scheme by [SM83], the Latent Semantic Indexing (LSI) [De90], and the probabilistic Latent Semantic In- dexing as an extension of LDA [Ho99] [BNJ03]. The reason LDA was introduced in addition to those solutions was to provide a solution that works under the so-called “bag-of-words as- sumption” which considers the order of words in an article as irrelevant [Bl12], [BNJ03]. tf-idf is simply based on counting words and terms in documents and from that it creates a term- by-document matrix effectively reducing documents of variable length to fixed-length lists of numbers [BNJ03]. LSI and pLSI extent on this but do come with problems of their own as discussed in [BNJ03].

Figure 3: Visual abstraction of the relationship between documents and words in LDA.

Investigating topic modeling techniques for historical feature location. Page 13 Chapter 2 Background and Related Work

Finally, the LDA defines hidden topics to capture latent semantics in text documents [Li15]. With LDA, each document is represented by a distribution over unobserved (latent) topics with each topic being described by a distribution over words [Li15], [BNJ03]. In this explanation the fact that the topics are unobserved means that they do not contain any labels that would describe them, but rather that they are a collection of the proportions of their contents. With this, the order of words is irrelevant, which is further provable through the so-called De Finetti`s representation theorem [BNJ03]. A visual abstraction of this relationship can be seen in Figure 3. While introducing their new model, the authors of the Pachinko Allocation compare their idea with the Latent Dirichlet Distribution through a visualization shown in Figure 4 that is compa- rable to the one in Figure 3. Both directed graphs of LDA and the Pachinko Allocation can be seen consisting of a node for sampling documents (r), interior nodes representing the topics made from Dirichlet distributions (S), and leaves representing words (V) [LM06]. The difference lies in the number of interior layers, of which the Pachinko Allocation has two compared to one. This is so that the Pachinko Allocation can detect correlations between topics for which LDA has no way of identification [LM06]. Whilst even more layers are possible, this so-called four-level Pachinko Allocation Model was the one described in the initial paper [LM06]. In summary, the motivation for the Pachinko Allocation was to create accurate models that can discover large numbers of fine-grained topics and the correlation between those topics. The paper defining the Pachinko Allocation further measures it against other models, including a comparison of classification accuracy with the Latent Dirichlet Allocation. This is visualized in Table 1. The results show that the Pachinko Allocation has consistently better accuracy than LDA.

Figure 4: Model structures of LDA and a four-level Pachinko Allocation [LM06].

Investigating topic modeling techniques for historical feature location. Page 14 Chapter 2 Background and Related Work

class Number of docs LDA PAM

graphics 234 83.95 86.83

os 239 81.59 84.10

pc 245 83.67 88.16

mac 239 86.61 89.54

windows.x 243 88.07 92.20

total 1209 84.70 87.34

Table 1: Document classification accuracies [LM06].

Tests conducted in the paper further show that the Pachinko Allocation can handle a higher number of topics compared to LDA, all while maintaining a higher likelihood. The same applies to a higher amount of training data making it an alternative to consider when working with topic modeling. The corresponding graphs are shown in Figure 5.

Figure 5: Comparison between LDA, Pachinko Allocation (PAM), and two other techniques: Left: Pa- chinko performs better with a higher number of topics; Right: Pachinko performs better with more train- ing data [LM06].

2.6 Related Work

Previous research related to this thesis in the field of feature location in source code has been studied. It will regularly serve as a basis for argumentation in this thesis. Most notably two groups of papers were influential: those are [Co15], [CDK20] as well as [CEB15], [CEB17] and they will be discussed at the end of this chapter.

Investigating topic modeling techniques for historical feature location. Page 15 Chapter 2 Background and Related Work

Further, [Ma18] has been working on the same field through the preparation of a benchmark for AgroUML4, specifically for AgroUML SPL5 (a version of the software that is being re- engineered into a Software Line Product). For this, the location of features was manually added as an annotation to the source code with pre-processor directives. The paper does not apply any form of automated topic modeling itself though, as it is just proposing a way to evaluate feature location techniques. The aforementioned papers [CEB15], [CEB17] go beyond the theoretical planning of an anal- ysis by executing feature location on eight subject systems and 600 features. With this, they also introduce a feature location technique which is called ACIR (Aggregate Changeset de- scriptions for Information Retrieval), which works on a similar search corpus as the technique used in this thesis. The conclusions drawn from the analysis are summarized as listed below [CEB17]: - A granularity on a method level reduces the search effort. - The use of more recent change-sets improves search efficiency. - Aggregation of recent change-sets grouped by change request decreases effectiveness. - Naive, text-classification-based filtering of “management” change-sets also decreases the effectiveness. Other than [Ma18], with the work of [CEB17] an automatic process was established that maps feature to the correct places in the source code. The following approaches for this mapping based on the bug identification number (bugID) found in an issue tracking system to infor- mation available in the version control system are discussed in the paper [CEB17]: - Matching the bugIDs to those IDs found in changeset descriptions. - Matching the bugIDs to those IDs found in change requests. - Matching the bugIDs to those IDs found in patches added to change requests. After an evaluation, the idea of matching with the changeset descriptions was adopted for the creation of a search corpus, which agrees with the path chosen in this thesis. The feature location search on the other hand differs from the technique used in this thesis, as it uses the open-source search software Apache Lucene as an Information Retrieval technique. The papers' relevance for this thesis is therefore mostly limited to its data gathering and search corpus generation techniques. Among the papers mentioned the ones by [CKK15], [CDK20] have been the most helpful as an argumentation basis for this thesis. The papers describe in detail how feature location in

4 AgroUML: https://argouml-tigris-org.github.io/

5 AgroUML SPL: https://github.com/marcusvnac/argouml-spl

Investigating topic modeling techniques for historical feature location. Page 16 Chapter 2 Background and Related Work source code can be applied whilst using an online version of the Latent Dirichlet Distribution on a total of 14 open-source Java projects. Among other results, the research questions asked firstly lead to a comparison between changeset- and snapshot-based feature location. Further, an analysis of the so-called time invariance assumption and the advantages of different kinds of corpus compositions is performed. The comparison between changeset- and snapshot-based feature location relates to the data that the corpus generation is executed on. The major difference between them is that for snapshot- based techniques the documents are structural program elements and in changeset-based ap- proaches, the documents are the changesets [CDK20]. This means the granularity of change between two consecutive documents is much smaller for changesets. For the application of LDA on this corpus the consequence is that frequently changed parts of the source code will turn up more often in the search results. For the application of the aforementioned online ver- sion of LDA this means that the topic model does not have to be retrained all over as new commits can simply be added. Limited by the availability of implementations for online vari- ants of the Pachinko Allocation, the advantage of re-trainability by using online LDA was put aside in favor of a uniform and comparable approach in this thesis. Whilst introducing changesets for the training of topic modeling, [CDK20] identified the so- called time invariance assumption as a problematic part of the snapshot-based approach. The assumption states that benchmarks based on snapshots may include misleading results as the source code artifacts in a snapshot, that mapped to the description of a maintenance task, may have been changed again after the task was finished but before the snapshot was taken. Here the granularity of the changeset approach is a considerable advantage. In terms of corpus compositions [CDK20] considered four categories of data, besides the de- scription of issues similarly mapped to the commits as in [CEB17]:

- Added source code lines - Context source code lines (퐴) - Commit messages (퐶) - Removed source code lines (푀) While [CDK20] argues that the inclusion(푅) of added lines and the commit message are intuitively necessary, the question is asked whether the other two are relevant. By testing out (퐴) (푀) all combinations on the subject systems the paper arrives at the result that the optimal combi- nation is . Yet for their research they note that they ended up using .

(퐴, 퐶, 푀) (퐴, 푅, 퐶)

Investigating topic modeling techniques for historical feature location. Page 17 Chapter 3 Methods

3 Methods This chapter will elaborate on the methodologies applied during the research phase of the thesis. It will describe through which ideas the solution was reached, how the tests are designed and how the evaluation of results was structured.

3.1 Overview

The basis for the research phase is the corpus used to train the topic models. This process for the example of a class-level granularity is visualized in Figure 6. It essentially performs a merg- ing of information from two sources: the history of the source code and the issue descriptions written in natural language.

Figure 6: Corpus generation from issue descriptions and changesets on class level.

Each changeset consists of multiple commits that are, in the case of Git being used, aggregated into requests that are then applied to the main branch of the repository. When using issue tracking systems like Jira, those changesets are associated with an issue description through an ID that is written inside the message field of the merge request. From this, the documents that serve as input for the topic modeling are created. Detailed steps for obtaining the data and pre- paring it for usage are elaborated in chapter 3.2.

Investigating topic modeling techniques for historical feature location. Page 18 Chapter 3 Methods

Finally, the then pre-processed textual data is included in the document: - Added lines of code - Unchanged lines of code - Issue description This information is further linked to the class that they were extracted from so that the search engine that is based on the topic modeling can display them as results. Methods used to train and optimize the topic model are further explained in chapter 3.3, while the ones applied to create a search engine based on the models are the topic of chapter 3.4.

3.2 Data Preparation

During the preparation and execution of topic modeling-based feature location, multiple steps are taken to read and transform data. The methods applied during those steps are being ex- plained in the following subchapters.

3.2.1 Data Mining

Implementation details on the data mining part of the tool responsible for the import of version control and issue tracking data are explained in chapter 4.3. Further, the extraction of issue IDs from changesets and the subsequent linking of source code artifacts to issue descriptions is also the topic of those subchapters in chapter 4. All those steps heavily rely on textual filtering based on regular expressions that match structured text with patterns and extract the required tokens.

3.2.2 Text Cleaning

Methods applied in terms of text cleaning fall in the category of removing stop words from both English text and Java source code as well as a removal of the most common words as part of the LDA and Pachinko Allocation training. While the latter is a simple parameter that can be set in the implementation of topic modeling techniques used in this thesis, the removal of stop words is a static part of the analysis that is applied on multiple occasions and datapoints. While predefined lists of words exist to filter stop words out of English text, a custom list was used for this thesis. This list originates from the data in the Corley goldset and was chosen to improve the comparability of results from the papers [CKK15], [CDK20] later on in the evalu- ation. In a similar fashion, the stop words for the Java programming language also originate from the Corley goldset and include common identifiers like “abstract”, “interface”, or “true” and “false”.

Investigating topic modeling techniques for historical feature location. Page 19 Chapter 3 Methods

To complete the text cleaning other symbolic characters, digits, and whitespaces are also re- moved from all text inputs which include issue descriptions, source code artifacts as well as search queries.

3.3 Topic Modeling

Topic modeling with LDA and the Pachinko Allocation is a process that requires the optimiza- tion of parameters in order to recognize connections between documents and the words con- tained in them. This chapter will summarize those parameters available and elaborate on the methods used during optimization.

3.3.1 Topic Model Parameters

Both topic modeling techniques, LDA and the Pachinko Allocation, have a similar setup as described in chapter 2.5. The output, as well as Parameters and also the options for tuning the results through the variation of those parameters, are similar for both techniques. This chapter will summarize the relevant settings implemented in the library used to execute topic modeling techniques before branching out introducing the method of hyperparameter tuning. The number of clusters is the most obvious method where the model can directly be influenced to return better results. Here the available parameters differ between LDA and the Pachinko Allocation. LDA has the option to vary one layer of clusters ( ) and the Pachinko Allocation has, in its most common form, two layers ( ). 푘 The prior parameters are given through the 푘1,Dirichlet 푘2 distributions, which make out the connec- tions between nodes of both LDA and PA. To create clusters at the extremes of the space a vector with numbers must be chosen to get distinct cluster concentrations. Only in previous research, this option has rarely been used [SS18]. In implementations of LDA and 0 < 푥 < 1 the Pachinko Allocation, there are two such parameters: and (alpha and eta, shown in Figure 7). The parameter is responsible for the concentration of topics and the parameter is respon- 훼 휂 sible for the concentration of words within the topics [SS18]. Since the Pachinko Allocation 훼 휂 has two layers of topics is supplemented by for the second layer.

Further, the removal of 훼words that are common훼푠푢푏 or too uncommon can be facilitated by the interfaces of LDA and Pachinko Allocation implementations. The output of both techniques, when using a document as the input into the model, is a topic distribution. Methods applied to this in order to create a search engine are discussed in chapter 3.4 under the Search Query section.

Investigating topic modeling techniques for historical feature location. Page 20 Chapter 3 Methods

Figure 7: LDA represented as a graphical model with hidden parameters and the observed parameters and [SS18]. 휶 휼 3.3.2 Hyperparameter Tuning

Hyperparameter tuning is the practice of varying a model’s parameters to achieve better results. In principle, all parameters mentioned in the previous section can be used for this, for example by performing a grid search or simply by testing out random values [CCC10]. Commonly in topic modeling-based research the variation of the topic number is the one that the most atten- tion is given to and the others are usually just chosen based on best practices as previous expe- rience [SS18].

The method applied in this thesis goes beyond this common approach, yet also starts with fo- cusing on finding the number of clusters that is best for the given task of locating features in the source code. Once a good number of clusters has been found this configuration will be used to tune further parameters. This way the considerably high run time of tuning multiple param- eters at once is avoided. Parameters tuned aside from the number of clusters are the preciously discussed Dirichlet Parameters alpha and eta as well as the number of iterations over the training data and the burn-in factor for optimizing initial parameters. Following best practices, the logarithmic likelihood is used as an identifier for optimization progress, with a high logarithmic likelihood representing a good performance of the model at hand [SS18], [BM14].

3.4 Feature Location

Previously explained methods have already teased those applied in the context of feature loca- tion. Accordingly, this chapter summarizes those methods. It elaborates on tactics for finding matches between topic distributions in order to match a search query to its related documents, metrics for measuring the accuracy of said matches, and on goldset-based validation in general.

Investigating topic modeling techniques for historical feature location. Page 21 Chapter 3 Methods

3.4.1 Search Query

Given a search query, the model inference is applied through the tomotopy library. The result is a topic distribution with either one or two levels depending on whether the model is an LDA model or a Pachinko Allocation model. With this document-specific topic distribution, the as- sociation to other documents can be calculated either by determining the most likely topic of the query and sorting the other documents accordingly or by calculating the distance between topic distributions. For the latter, the deviation can be used as described by the formula below:

푛 1 ∑ |푥푖 − 푦푖| In this formula is the probability of a푛 document푖=1 to be in a topic and is the probability of the document associated with the search query to be in the same topic. Once calculated for all 푥푖 푦푖 existing documents the list of possible matches can be sorted by this score.

3.4.2 Performance Metrics

For evaluating the effectiveness of feature location results previous papers use the mean recip- rocal rank (MRR) metric that is applied to the output of a search [CKK15], [CDK20]. It essen- tially calculates a score based on the position the first relevant element is on, in a list of search results. Its formula is shown below [CKK15]:

|푄| 1 1 푀푅푅 = ∑ 푖 One major property of the mean reciprocal rank|푄 |is푖=1 its푟푎푛푘 preference for highly relevant results over marginally relevant results. This will be an issue that is discussed in chapter 6.2. Arguments for focusing on these metrics include that it is a preferred solution in previous research when there are only a few or one relevant document to search queries, as is the case with the project and goldset used for investigation in this thesis [CDK20]. Further performance metrics used to evaluate and discuss results are the mean and median rank as well as the Wilcoxon signed-rank test. While the mean and the median rank are simple metrics to directly judge the outcome, the outcome of a Wilcoxon signed-rank test requires some explanation. This is because it does not judge the results themselves but their significance in comparison between two techniques or two matched pairs of datasets in general. Its -value can be calculated with the following for- mula: 푇

Investigating topic modeling techniques for historical feature location. Page 22 Chapter 3 Methods

− + 푁 푁 − − + + − + 푇 = ∑ 푅푖 푎푛푑 푇 = ∑ 푅푖 푎푛푑 푇 = min (푇 , 푇 ) In this formula are푖=1 signed-ranks of 푖=1the differences between results of and , and are the corresponding− + number of differences [Ré08]. The signed-ranks can here be 푅푖 , 푅푖 퐵 퐴 computed− + by ordering the differences by their absolute value in increasing order and then as- 푁 , 푁 ≠ 0 signing rank values [CMS09]. In the case of ties between values, the average rank between them is assigned to all [CMS09].

Given , the significance between and can either be determined by comparison with an existing probability table or by manual calculation of the -score for the configuration at hand 푇 퐵 퐴 [Ré08]. A manual calculation can be performed with the formulas below [Ré08]: 푍

휇 − 푇 − 0.5 푛(푛 + 1) [2푛 + 1]휇 푍 = 푎푛푑 휇 = 푎푛푑 휎 = √ Now for -score larger than휎 1.96 the difference4 is significant by a significance6 level of . This value is known from other tests like the Student’s t-test [CMS09], [Ré08]. -scores 푍 훼 = that are smaller would mean that the difference between and is not significant enough 0.05 푍 [Ré08]. 퐵 퐴 If the values of the ranks are not too different the -score can be used in the context of a normal distribution to calculate the p-value [Ra10b]. With this normal distribution, the -score can be 푍 found on the x-axis and associated p-values can be read from the y-axis [Ra10a]. To find the 푍 cumulative probability of the -score being outside the area in which we assume that differ- ences are insignificant , the comulative distribution function can be calculated for both 푍 sides of the normal distribution resulting in the decided probability (p-value): 푝 < 훼

In this formula, the function is the푝 cumulative = 2 ∗ (1 − distribution 휙(|푍|)) function of the standard normal distribution. With the p-value, a direct comparison to the significance level can be made. If 휙 is smaller than the difference is significant. Otherwise, such a conclusion about the signif- 훼 icance is not supported by the data. 푝 훼 3.4.3 Goldset-based Validation

Aside from performance metrics, the method of validation used in this thesis is based on gold- sets provided by previous research. This means the data used as input into the training of topic models is not split into a train and a test dataset as it is common in machine learning, but instead all the data that is verifiable through the goldset is used. The goldset then consists of test data that was provided together with labels, or in a search engine context together with expected

Investigating topic modeling techniques for historical feature location. Page 23 Chapter 3 Methods results, so that the performance metrics can be calculated to compare approaches with each other. In the context of this thesis, two viable goldsets are available, of which one will be chosen in later chapters to enable the validation [Co15], [Le18].

Investigating topic modeling techniques for historical feature location. Page 24 Chapter 4 Implementation

4 Implementation This chapter of the master thesis is dedicated to the implementation of the methods described in chapter 3. It furthermore describes the architecture of the tools that result from the combina- tion of those methods and the environment in which they were tested.

4.1 Goals and Constraints

The implementation is required to fulfill two high-level purposes: the import of data from the relevant data sources and the execution/evaluation of feature location on top of this aggregated data.

Goals identified for the import-functionality are derived from the idea that the data aggregation should be automated and reusable in a stable way. The implementation that provides the feature location will on the other hand be used to experiment with different kinds of topic modeling techniques and potentially frameworks and is, therefore, more volatile. It is expected to change regularly during development, evaluation, and validation, and may never reach a final state. Because of this, it was decided to split the project into two separate applications that share a database for the exchange of data. Specific goals therefore distinguish between the importer application and the feature-location application. The aforementioned goal of reusability for the importer can be separated into the following concrete requirements: (R1): The importer must gather data from the changesets of a version control system and issue descriptions from an issue tracking system. (R2): The adapters integrating the version control and issue tracking system into the appli- cation must be designed modular in order to make it possible to extend the set of sup- ported systems later on. (R3): The changesets must contain information on which parts of files were changed on dif- ferent levels of granularity, for example on class or method level. Therefore, a modular system enabling the extension of supported programming languages needs to be im- plemented. (R4): Features and changesets must be saved in a way that links between them can be estab- lished after the successful import. (R5): All data must be made available to other applications through a data storage that is accessible with industry-standard techniques.

Investigating topic modeling techniques for historical feature location. Page 25 Chapter 4 Implementation

The feature-location application must be seen more like a prototype or evaluation tool than a production-ready application. The applicable set of requirements for conduction feature loca- tion on any codebase is listed below: (R6): The feature-location application must be able to read data stored by the importer. (R7): Text must be pre-processed following best practices to create a for each changeset. Those must include the feature descriptions as well as the modified source code itself. (R8): The execution of the training process for a Topic Model must use available libraries and persist trained models as well as corpus lookup tables for later use. (R9): The evaluation of user queries against trained models must be implemented through a command-line interface with results printable on the console or output files. (R10): Goldsets from at least one previous research paper must be evaluated against newly trained models and the results generated from this must be automatically persisted as a list of search results for each goldset-query. (R11): The results of evaluated goldset-queries must be validated against the expected results of the goldset by calculating a score value that makes a comparison of success possible between models. The development of the applications is constrained by factors related to dependencies on exter- nal data sources and technical dependencies on third-party tools. The feature-location application is dependent on having access to a corpus of feature descrip- tions that are mapped to classes or methods contained in changesets. This kind of data can be found in source code version control systems like Git or Subversion and issue tracking appli- cations like GitHub or Jira. Due to the constraint of time though, the importer can only imple- ment interfaces to a few of those systems. Details can be found in the solution strategy of chap- ter 4.3.1. The feature-location application depends on libraries providing functionality to perform topic modeling. Between Java and Python, which both have comparable libraries available, Python was selected as the programming language. Details are discussed in subchapter 4.2.

4.2 General Solution Strategy

For the development of both the importer and the feature-location application decisions con- cerning the general development strategy had to be made upfront before the implementation could begin.

Investigating topic modeling techniques for historical feature location. Page 26 Chapter 4 Implementation

A crucial part of this is the decision on which programming language is to be used. This largely depends on the evaluation of libraries supporting the methods discussed in chapter 3. The feature-location application requires special libraries for the execution of topic modeling techniques like the Latent Dirichlet Allocation and the Pachinko Allocation described in chapter 2.5. Since, as previously mentioned in the corresponding chapters, there are only a handful of libraries supporting the Pachinko Allocation, the choice of programming languages is vastly limited. A comparison of popular topic modeling libraries is listed in Table 2.

Library Supported Topic Models Interface languages

Tomotopy ✔ LDA (multiple variations) Python 3 (v0.11.16) ✔ Pachinko Allocation ➖ Others

MALLET ✔ LDA (multiple variations) Java (v.2.0.87) ✔ Pachinko Allocation (hierarchical) ➖ Others

Gensim ✔ LDA Python 3 (v.4.0.08) ❌ Pachinko Allocation ➖ Others

Top2Vec ❌ LDA (multiple variations) Python 3 (v.1.0.249) ❌ Pachinko Allocation (hierarchical) ➖ Top2Vec algorithm

Table 2: Overview of Topic Modeling Libraries.

Gensim and Top2Vec are not suitable for the task of comparing LDA to the Pachinko Alloca- tion since their support for these techniques is limited or not available at all. Yet both are often referred to in literature and online tutorials [CKK15], [SS18]. The two options that can be con- sidered suitable are Tomotopy and MALLET since both of them support at least one variant of LDA and Pachinko Allocation. To arrive at a decision, arguments for and against those libraries

6 Tomotopy v.0.11.1, documentation available at https://bab2min.github.io/tomotopy/v0.11.1/en/ 7 MALLET v.2.0.8, available at http://mallet.cs.umass.edu/topics.php 8 Gensim v.4.0.0, documentation available at https://radimrehurek.com/gensim_4.0.0/ 9 Top2Vec v.1.0.24, available at https://github.com/ddangelov/Top2Vec/tree/1.0.24

Investigating topic modeling techniques for historical feature location. Page 27 Chapter 4 Implementation can be based on the degree of details provided in their documentation and on features of the programming language that they provide an interface to. The MALLET library provides a quick-start guide, outlaying the basic use of the command- line interface. But since it is written in Java it also provides a Java API which is documented through an automated documentation tool. MALLETs full list of Topic Modeling techniques supported is listed below: - LDA - Parallel LDA - DMR LDA - Hierarchical LDA - Labeled LDA - Polylingual Topic Model - Hierarchical Pachinko Allocation Model (PAM) - Weighted Topic Model - LDA with integrated phrase discovery - Embeddings () using skip-gram with negative sampling Tomotopy does not come with a command-line interface but instead provides an API for Python 3. While this documentation was found to be incomplete in some minor instances, its important functions and parameters are sufficiently described. The full list of supported techniques is listed below. - Latent Dirichlet Allocation - Labeled LDA - Partially Labeled LDA - Supervised LDA - Dirichlet Multinomial Regression - Generalized Dirichlet Multinomial Regression - Hierarchical Dirichlet Process - Hierarchical LDA - Multi-Grain LDA - Pachinko Allocation - Hierarchical PA - Correlated Topic Model - Dynamic Topic Model - Pseudo-document-based Topic Model

Investigating topic modeling techniques for historical feature location. Page 28 Chapter 4 Implementation

Since Tomotopy has better documentation and supports more variations of the relevant Topic Modeling techniques it was chosen as a starting point for the feature-location application, thereby requiring Python 3 as a programming language for the project. The importer application has no direct requirement for a library which has functionality that is not built into most common programming languages. The choice of a language for implemen- tation follows therefore the one made for the feature-location application. As both applications will be based on Python 3, a mono repo was chosen as a development environment, allowing the project to be contained in one common version control system and leaving the option for code sharing open. This repository is publicly hosted on GitHub10 and loosely follows the GitHub Flow branching strategy.

4.3 Importer Application

This chapter will provide implementation details of the importer application. After an introduc- tion the context, scope, and strategy of the application are discussed in subchapter 4.3.1, fol- lowed by a detailed description of the architecture of the tool in subchapter 0.

4.3.1 Context, Scope, and Solution Strategy

Following the prerequisites of previous chapters, the task of the importer application is data aggregation from version control systems and issue trackers (R1). For the scope of this thesis, it was decided to put the focus on one version control system and one issue tracker as data sources, with the option to include more in the future. In order to be able to use previously mentioned goldsets for the validation of feature location results, those data sources had to match the ones used by at least one software project that was analyzed in previous research and has results saved in a goldset. As mentioned in chapter 2.1.3, two such goldsets have been identi- fied: “Modeling Changeset Topics for feature location” [Co15] based on papers [CKK15], [CDK20] and “Empirical Assessment of Baseline Feature Location Techniques” [Le18] based on the paper [Ma18]. The table in appendix A.1 gives an overview of the projects contained in those goldsets and contains information on which project uses which version control and issue tracking system. It can therefore be used as an argumentation basis in the process of choosing one of each said systems for integration into the importer. A summary of the relevant infor- mation can be found in Table 3 below. The data shows that most projects from the two goldsets primarily use Git as a version control system. Choosing Git as a data source comes with the advantage that it is a decentralized system and therefore does not require a constant connection to specific hosting services. Instead, the

10 Repository of the source code on GitHub: https://github.com/l-schulte/cdbs_feature_location

Investigating topic modeling techniques for historical feature location. Page 29 Chapter 4 Implementation

VCS Hosting Issue Tracking

Git 18 GitHub 17 Jira 12

SVN 2 SF 1 GitHub 5

Custom 2 Bugzilla 2

SF 1

Table 3: Composition of projects in the goldsets. well-documented and commonly used Git command line interface can be used to load a copy of the repository from any hosted Git instance and read the required data. The approach is therefore not limited to GitHub or any hosting provider and can for example also work with GitLab or a custom Git-compatible solution. On this basis, it was decided that the solution strategy should include an import of Git repositories through the Git CLI. The choice for Jira as the first issue tracking solution to integrate into the importer was again made based on the data in Table 3 Table 3: Composition of projects in the goldsets and Appen- dix A.1. Since 12 of 18 projects that use Git as a version control system track issues on a Jira instance, it is the choice that gives the tool the highest flexibility, theoretically allowing the evaluation phase of this thesis to validate against more than half of the available goldsets. To be able to run the import multiple times without requiring the manual setup of a runtime environment with compatible installations of Git, Python, and Python libraries, it was further decided to containerize the importer application with docker. For this thesis, it is not in the scope of the importer to periodically update the data imported from Git, yet persistent storage for the data is required that allows other applications to access it through industry standard solutions (R5). This storage must be available outside of the envi- ronment of the importer since other applications may run on dedicated machines and should not have to provide and run an importer instance themselves. In order to implement this strategy a globally available database instance must be set up. Technology-wise the use of a relational database was evaluated against the use of a non-relational database like MongoDB. MongoDB was chosen since the data structures that need to be stored have a strong coupling: For example, changes on specific files can be stored in the same document as the file information themselves. To ensure availability from outside the local network, a free tier of MongoDB Atlas was chosen as a hosting provider for the database instance. To further allow the import of multiple reposi- tories into the same MongoDB instance, each repository will be stored in its own database.

Investigating topic modeling techniques for historical feature location. Page 30 Chapter 4 Implementation

Figure 8: Importer application context.

The overall context of the importer can be visualized as shown in Figure 8. It consists of three modules of which one is the central system module (dark green) and two which have a specific responsibility that is accessed by the system module (light green). Further internal and external systems (white) are displayed that are used by those modules. In Table 4 those components are listed again with a description of their purpose.

Component Description

Importer App The starting point of the Python application. The main app is responsible for initiating processes and for passing on data between them and the data storage.

MongoDB A globally available instance of the non-relational database MongoDB hosted on MongoDB Atlas.

Git Import Python module of the importer application responsible for handling Git. It is the task of this module to trigger commands on the Git CLI that on the one hand pull data from the Git Server as well as initiating commands reading the Git CLI output. Its thereby crawling and interpreting the re- pository's commit history.

Git CLI An instance of the Git CLI installed on the host operating system. It is accessed by the Git Import Python module.

Git Server External component hosted on a computer that is available publicly on the internet or locally on a network.

Jira Import Python module of the importer application responsible for handling the import of issues from a Jira instance, which it accesses through the corre- sponding REST-API.

Jira Server External component hosted on a computer that is available publicly on the internet or locally on a network.

Table 4: Description of components in the context of the importer application.

Investigating topic modeling techniques for historical feature location. Page 31 Chapter 4 Implementation

4.3.2 Building Block View

In accordance with the solution strategy of the previous sub-chapter and the thereby defined logical structure of the importer application, the implementation specifics will be discussed in this chapter. As displayed in Figure 9, the building blocks of the application are split into three high-level modules: the main app.py, git, and issues.

Figure 9: UML diagram of the building blocks in the importer application.

In the following sections, the tasks and functionality of those modules will be explained in a technical context.

4.3.2.1 Main app The main application with its app.py file is responsible for the order in which the import process visualized in Figure 10 is executed. It is also responsible for accessing, merging, and saving data into the database. This allows the other modules to act independently from a specific type of data storage.

Figure 10: Import process.

The first two steps trigger the functionality of the Git module which is going to be the subject of chapter 4.3.2.2. First, the git log is crawled to obtain information on which files have been

Investigating topic modeling techniques for historical feature location. Page 32 Chapter 4 Implementation changed and when. Second, those changed files are investigated one by one reading the number of lines changed and associating them with the classes and methods they belong to. The second step triggers the functionality of the issues module. Here the issues that are con- nected to previously crawled commits are imported from the projects issue tracking system. Lastly, the data is saved, storing all commits in one table, all files, and the information on their changes in another as well as all loaded issues in a third one. In total, the database diagram can be drawn as visible in Figure 11 below.

Figure 11: Database diagram.

4.3.2.2 Git module The Git module is in charge of aggregating relevant data from a Git repository that is either available on a local network or the internet. In the context of the main application and after making the repository available offline, this aggregation process can be split into two steps which were already mentioned in the previous chapter: Crawling the Git log for commit infor- mation and crawling the Git diff of those commits for method and class level artifacts. The implementation of this functionality is an extension to the fork of existing source code11 which was created previously as part of a research project [Sc21] conducted by the author of this thesis. Within the module, there are three components: The module entry point git.py, an adapter to the Git CLI cli.py, and an interpreter for the CLI output interpreter.py.

11 Source code available on GitHub: https://github.com/l-schulte/cdbs_hist_kau

Investigating topic modeling techniques for historical feature location. Page 33 Chapter 4 Implementation git log --numstat --no-merges --date=unix --after=2020-01-01 --before=2021-01-01

Listing 3: Git log command. commit 3123090b6cf44e853eceae42854d49e78e81be2a Date: 1601069990

Changed default value of "search and store files relative to bibtex file" to true (#6928)

* Fixes #6863

1 0 CHANGELOG.md 1 1 src/jabref/gui/JabRefGUI.java 1 1 src/jabref/gui/SendAsEMailAction.java 1 1 src/jabref/gui/desktop/JabRefDesktop.java 2 1 src/jabref/gui/exporter/ExportToClipboardAction.java 3 3 src/jabref/gui/fieldeditors/LinkedFileViewModel.java 5 1 src/jabref/gui/fieldeditors/LinkedFilesEditor.java 0 77 src/jabref/gui/filelist/FileListEntry.java 2 2 src/jabref/gui/{filelist => linkedfile}/AttachFileAction.java ...

Listing 4: Git log output of a commit to the JabRef repository (shortened).

To gain access to information about the commits, the module reads the Git log through func- tions in the cli.py file and interprets them with the interpreter.py scripts. Running the Git log command as displayed in Listing 3 returns the information shown in Listing 4. Since the output format is optimized to be read by humans and not machines, it needs to be parsed in order to be useful for further analysis. From this the following datapoints can be ex- tracted through the interpreter scripts using regular expressions: - A list of commits containing: o Unique ID of the commit o Name and email of the author o Date of the commit o Commit message o Issue ID extracted from the commit message (see chapter 4.3.2.3) - A list of all changed files containing o Unique ID of the commit o Issue ID extracted from the commit message (see chapter 4.3.2.3) o Path of the file o Previous path of the file

Investigating topic modeling techniques for historical feature location. Page 34 Chapter 4 Implementation git diff 3123090b6cf44^ 3123090b6cf44 –-histogram -U1000

Listing 5: Git diff command for a commit to the JabRef repository.

... diff --git a/src/main/java/org/jabref/gui/fieldeditors/LinkedFileViewModel.java b/src/main/java/org/jabref/gui/fieldeditors/LinkedFileViewModel.java index 07a25bf42..995718ce7 100644 --- a/src/main/java/org/jabref/gui/fieldeditors/LinkedFileViewModel.java +++ b/src/main/java/org/jabref/gui/fieldeditors/LinkedFileViewModel.java @@ -1,510 +1,510 @@ package org.jabref.gui.fieldeditors; -import org.jabref.gui.filelist.LinkedFileEditDialogView; import org.jabref.gui.icon.IconTheme; import org.jabref.gui.icon.JabRefIcon; +import org.jabref.gui.linkedfile.LinkedFileEditDialogView;

public class LinkedFileViewModel extends AbstractViewModel { ... public Observable[] getObservables() { List observables = new ArrayList<>(...); observables.add(downloadOngoing); observables.add(downloadProgress); observables.add(isAutomaticallyFound); - return observables.toArray(new Observable[observables.size()]); + return observables.toArray(new Observable[0]); } ... } ...

Listing 6: Git diff output of a commit to the JabRef repository (shortened).

To gain access to information describing what was changed in the files, the Git diff command shown in Listing 5 must be executed through the cli.py scripts and again analyzed by the inter- preter.py.

The command compares the files changed on one commit (3123090b6cf44) against the same files from before the changes of the commit were applied, giving an overview of which lines were changed and what the change included. This previous state of the code in Git can be de- scribed by the current commit id and a caret (^) sign attached to the back (3123090b6cf44^). Further, the command includes the --histogram option which activates the usage of the one change detection mechanism which seems to work best to detect changes granularly, and the - U1000 option which deactivates skipping parts of the source code that were not changed. Again, the information, of which an extract is shown in Listing 6, is only available in a form that requires manual . Since commits can contain multiple files, the first step taken by the interpreter while iterating the lines of the output is to look for the start of a new file. Once

Investigating topic modeling techniques for historical feature location. Page 35 Chapter 4 Implementation a new one is recognized, all source code lines are added to a chunk that is analyzed further. This further analysis is then split into two parts which each iterate the code lines, one counting the lines per method and the other counting the lines per class changed in the code. The task of counting changes method- and class-based is of course dependent on the language used. Here the interpreter offers room for an extension since the regular expressions used for detecting the beginning of a class or method can be set based on the file ending. Currently, only Java files with the .java extension are supported, and the regular expressions displayed in List- ing 7 are only an approximation of the complete coverage of all possible method definitions. Also, edge-cases like code lines containing a string that matches a method or class definition may cause the detection of false-positive matches.

'java': { 'method_name': r'^(\+|\-| )( *|\t*)(?:public|protected|private|static|\s) +[\w\<\>\[\]]+\s+(\w+) *\([^\)]*\) *(?:\{?|[^;])', 'class_name': r'^(\+|\-| )( *|\t*)(?:public|protected|private) *(?:static)? *(?:abstract)? *(?:final)? class (\w+) (?:\w| )*{', 'brackets_open': r'{', 'brackets_close': r'}' }

Listing 7: Extract from the regular expressions used for method and class detection by the interpreter.py.

The result of this task is a dictionary with the file path as a key and an entry for each file which contains the following information: - Old path - List of classes where each contains: o The number of lines changed in the class o All added source code lines o All removed source code lines o All source code lines that remained unchanged - List of methods where each contains: o The number of lines changed in the method o All added source code lines o All removed source code lines o All source code lines that remained unchanged Later this dictionary will be combined with the list of changed files from the commits, allowing to access a file, all its changes, and the associated feature ID associated with the change from one document.

Investigating topic modeling techniques for historical feature location. Page 36 Chapter 4 Implementation

4.3.2.3 Issues module The issues module is responsible for aggregating issues from issue-tracking systems and providing them back to the main app. In Figure 10 this step is labeled as issues crawl. At this point, it consists of an issues.py file as an entry point of the module and an adapter for accessing data from a Jira instance implemented in the jira.py file. In order to establish a connection to a Jira instance over HTTP, the jira.py file needs to obtain a token from that instance whilst creating a new session. Afterward, single issues can be ac- cessed through an URL following the schema shown in Listing 8 below, where the domain and the parameters project and issue_id need to be changed accordingly. http://issues.sample.com/{project}/rest/api/2/issue/{issue_id}

Listing 8: Sample URL for accessing an issue from a Jira instance.

During the crawling process, all commits are iterated, and each distinct issue is downloaded from the server. The issue descriptions are then put into a dictionary with the issue ID as a key and the description as a value. Due to the modularity of the solution with the issues.py file being the access point in front of the Jira-specific adapter, other issue tracking systems can be added later. This also applies to the feature id extraction from commit messages, which is a method implemented in jira.py that is used by the Git-module interpreter.py through issues.py.

4.3.2.4 Data finalization The import process is finalized by the merging of the aggregated data. So far four separate data pools were created, as summarized in Table 5. The commits and features data pools are directly stored in the corresponding collections of the MongoDB database. The changes and diffs on the other hand need to be merged and post-processed. First, the diffs are added to the changes based on the file path. After that, the tracing of renamed files needs to be executed by matching the old file names of already saved changes to the new filename of the currently assessed change. Finally the import is finished, and the data can be read by other applications fulfilling the spec- ifications of the database diagram in Figure 11 at the beginning of this chapter.

Investigating topic modeling techniques for historical feature location. Page 37 Chapter 4 Implementation

Data pool Description

Commits A list of commits structured as JSON documents: [{ 'commit_id': string, 'author': string, 'email': string, 'date': date, 'comment': string, 'feature_id': string }] Changes A list of changes on a file-level structured as JSON documents: [{ 'commit_id': string, 'feature_id': string, 'path': string, 'old_path': string }] Diffs A dictionary of differential changes on a source code level structured as JSON documents with filenames as keys12: { ‘?path’: { ‘old_path’: string, ‘classes’: { ‘?class_name’: { ‘cnt’: int, ‘+’: int, ‘-’: int }, ... }, ‘methods: { ‘?method_name’: { ‘cnt’: int, ‘+’: int, ‘-’: int }, ... } Features A dictionary of issue descriptions structured as a JSON document with issue IDs as keys12: { ‘?feature_id’: { ‘description’: string, ‘type’: { ‘self’: string, ‘id’: string, ... }

Table 5: Overview of data produced by importer steps.

12 In this graphic the keys preceded by a question mark are variable/generated depending on the input data.

Investigating topic modeling techniques for historical feature location. Page 38 Chapter 4 Implementation

4.4 Feature Location

This chapter will provide implementation details on the feature-location application. It follows a similar structure as the previous chapter and is separated into subchapters 4.4.1 and 4.4.2, for a discussion of context, scope, and solution strategy, and the documentation of the tool’s archi- tecture.

4.4.1 Context, Scope, and Solution Strategy

As part of the general solution strategy in chapter 4.2, the feature-location application is re- sponsible for the execution and comparison of topic modeling-based feature location. Because of this, it needs to run three different tasks that will be labeled as train, evaluate, and validate in this chapter. The functionality of those tasks follows the requirements (R8), (R10), and (R11). The choice of implementing the tasks as a Python 3 command-line application can also be de- rived from chapter 4.2. Here the separate tasks are to be executable independently storing the output in a format that can, if necessary, be read by the others.

Figure 12: Tasks of the feature location process.

In contrast to the importer application, the development of those tasks did not follow a prede- fined path. Instead, technologies and methods were being exchanged or removed during imple- mentation and testing to adapt to the topic modeling results that were observed. The solution strategy may therefore list options that are to be explored and implemented but may not make it into the final application or are later only available through specific commands. Figure 13 shows the application context of the feature-location application. It consists of the main system (dark green), an external MongoDB instance (white), and four modules (green) that contain extensions that are designed to be exchangeable since they implement access to specific libraries or algorithms where alternatives may be explored as well. The train task with its implementation in the training module naturally uses the database filled by the importer application as a data source, described in chapter 4.3, fulfilling requirement (R6). Based on what is available from the database, strategies must be found for both data pre- processing (R7) and the general corpus generation to use it as an input for the execution of the topic modeling.

Investigating topic modeling techniques for historical feature location. Page 39 Chapter 4 Implementation

Figure 13: Feature-location application context.

While the database technically provides three groups of text sources that describe change only two of them will be used in the implementation of the train task: issue descriptions and those source code lines that have been changed. They are the basis of the corpus generation as shown in Figure 15 and for the feature description. Commit messages as a group of text sources on the other hand are not used, since they are not considered to provide information on high-level features. This conforms to methods used in other papers investigating changeset-based topic modeling, like [HGH09], [Hi15] and [CEB15], [CEB17] or [CKK15], [CDK20] and [Ma18]. The pre-processing then consists of the tokenization of those descriptions as well as filtering of stop words. For this two options need to be evaluated: the use of the English stop words from the NLTK Python library and an alternative approach using custom stop words for both English and the Java programming language found in the implementation of other research [Co15], [CKK15].

The corpus generation must then be applied to a number of documents that in form of files and their change history are read from the database. Here the user must be able to select parameters required for the training of the available Models through the command line interface of the application. As discussed before, the tomotopy library must be used for the implementation and support at least the LDA model and the Pachinko Allocation model. The results of the training that is applied to this document-based corpus need to be saved for later phases. They must consist of a model file that will be used for inferring new documents

Investigating topic modeling techniques for historical feature location. Page 40 Chapter 4 Implementation and a lookup table that contains each document that the model has been trained on together with their associated topic distribution. Based on the saved models and lookup tables for both LDA and the Pachinko Allocation, the evaluation phase is required to create search results for either the queries from a goldset or single user queries read from the command line interface. For this, the inference functions of the tomotopy library must be used. Once a document has been inferred, a strategy for finding relevant results matching the topic distribution is applied. In order to do this, it was decided to explore two options: A simple lookup of the cluster that commonly has the highest probability and a rating of the distance between probability distributions. Said search results must be saved as JSON documents for later analysis.

Component Description

Feature-Location The starting point of the Python application. The main app is responsible App for starting one or multiple of the available tasks (train, evaluate, vali- date). It also contains code that can train both LDA and Pachinko Allo- cation with a variating number of clusters. It provides a command-line interface for starting the application.

MongoDB A globally available instance of the non-relational database MongoDB hosted on MongoDB Atlas. It is filled by the importer application and contains data on changes made to the repositories as well as features from the issue tracking systems. On that MongoDB instance, each im- ported repository has its own database.

Models Python module containing the model-specific as well as the common implementation logic of LDA and Pachinko Allocation using the to- motopy library.

Training Python module that provides the relevant functions for initiating the training process. For this, it is being accessed by the main application. It currently offers an implementation for training with tomotopy, but the library-specific logic is exchangeable.

Evaluation Python module that is responsible for evaluating the goldset queries by inferring them against a trained model. Besides generic evaluation func- tionality, it also acts as a wrapper that supports tomotopy models.

Validation Python module for validating previously evaluated goldset queries against the anticipated results from the same goldset.

Table 6: Description of components in the context of the feature-location application.

Investigating topic modeling techniques for historical feature location. Page 41 Chapter 4 Implementation

From the search results, the validation task must calculate a metric that can be used to compare evaluations of the success. For this, the mean reciprocal rank should be used since it is a com- mon indicator for search engine hit rates [CDK20], [CMS09].

4.4.2 Building Block View

The implementation specifics described in this chapter are derived from the solution strategy above as well as the other previous chapters. As shown in the diagram of Figure 14 the imple- mentation is split into 6 high-level components that were already introduced in chapter 4.4.1: app.py, models, data, training, evaluation, and validation.

Figure 14: UML diagram of the building blocks in the feature-location application.

The app.py file provides a command-line interface for accessing the most common features of the application. When executing the script, the input parameters listed in Table 7 are available. In the following sections, the implementation specifics of those blocks are described on a high level, split into the three tasks identified in the solution strategy of the previous chapter: train, evaluate, and validate.

Investigating topic modeling techniques for historical feature location. Page 42 Chapter 4 Implementation

Parameter Description Options

-t, --train Choosing the models that should be trained. lda, pa

--lda_k1 Number of clusters if LDA is being trained. Integer

--pa_k1 Number of first level clusters for training Pachinko. Integer

--pa_k2 Number of second-level clusters for training Pa- Integer chinko.

-b, --base Choosing either file or class-based training. class, file

-e, --eval Choosing the models that should be evaluated. lda, pa

-q, --query Search query for which a topic must be found. String

-i, --input Query and goldset directory as an alternative to a Path custom query.

-p, --pages Number of pages to display in the evaluation result. Integer

-d, --determ Evaluation method used to determine which docu- ml, dist ments match the search query.

-v, --validate Choosing the models that should be validated. lda, pa

Table 7: Command line options of the app.py file.

4.4.2.1 Training The training process starts with loading the documents from the MongoDB database and for- matting them to fit either the class-based or file-based topic modeling. Similarly, the issue de- scriptions are loaded from the database and stored within the runtime context of the application. This is handled by the app.py file. The actual process of training LDA and Pachinko Allocation models with the tomotopy library is generic and implemented in the tomotopy.py file training module. It receives a model for either LDA or Pachinko Allocation generated from the models module by the app.py through the training.py script which acts as an adapter. Due to the implementation of training.py as an

Investigating topic modeling techniques for historical feature location. Page 43 Chapter 4 Implementation adapter, the code specific to tomotopy could be exchanged easily given that fitting models have previously been instantiated. Figure 15 displays the process from a high-level view. Starting with the corpus generation based on documents, feature descriptions are generated from the concatenation of issue descriptions as well as titles and file diffs including the lines added and those that remained the same. Those feature descriptions are then tokenized into single words and filtered to no longer contain stop words. This corpus is then used to train the models resulting in an output of a model file and a lookup table as a CSV file. In this implementation, the lookup table contains at least all docu- ments with their path and name as well as each document's probability distribution for each topic. A screenshot of an example can be found in Appendix A.2.

Figure 15: Flow chart of the train task.

4.4.2.2 Evaluation Based on the model file previously generated in the train task and a query, the probability dis- tribution of topics for that query can be inferred. Using the lookup table that was also generated in the previous step, search results are created. This is done by matching the documents in the lookup table to the topic distribution. A graphical overview is provided in Figure 16. As required by the solution strategy, two options exist for creating those matches. The first one is a simple descending sort of the documents in the lookup table by the most likely topic of the query. The second one is a comparison of the deviation between all topics of a document and the query as described in chapter 3.4.1. Both are implemented and available via the command- line interface with the latter being the default option.

Investigating topic modeling techniques for historical feature location. Page 44 Chapter 4 Implementation

Figure 16: Flow chart of the evaluation task.

In case an input directory was chosen instead of a custom query, the process is run for the content of all text files in that directory. Also, the output is persisted as text files, in this case, stored in a subdirectory of the input called either ‘lda’ or ‘pa’ depending on which model type was evaluated. An example of such a search result file can be seen in Listing 9 below.

{ "log_ll": -158.93582153320312, "res": [ { "_id": "6062e8fea09c5e09e16fe3e5", "path": ".../Login.java -> Login", "name": "Login" }, { "_id": "6062e8fea09c5e09e16fe3e6", "path": ".../client/ZooKeeperSaslClient.java -> ZooKeeperSaslClient", "name": "ZooKeeperSaslClient" }, ... }

Listing 9: Beginning of a class-based search result for the zookeeper repository.

4.4.2.3 Validation Once an evaluation was performed on queries contained in a goldset, the validation must gen- erate a score that can be used to determine how effective the topic modeling technique performs in feature location. It requires the definition of an input directory through the command line interface and uses the data stored there to compare previously generated search results for gold- set queries to the anticipated results of those goldsets. The flow is shown in Figure 17. The method implemented for this is the mean reciprocal rank metric. Its formula was already intro- duced in chapter 3.4.2. Its result describes the average rank that the first correct result can be found on in the whole collection of search results.

Investigating topic modeling techniques for historical feature location. Page 45 Chapter 4 Implementation

Figure 17: Flow chart of the validate task.

Investigating topic modeling techniques for historical feature location. Page 46 Chapter 5 Evaluation

5 Evaluation The Evaluation chapter is split into three parts: a description of the setup used for the experi- ments, an analysis of how the execution of feature location was performed, and an experiment that aims to provide a basis for comparing the performance of LDA compared to the Pachinko Allocation.

5.1 Setup

The following chapter will describe the general setup of the experiments and fill in those blanks that have previously been left open so that the tools in chapter 4 are flexible and usable in multiple scenarios. This description will start with an evaluation of the general structures of data used as in- and output of the feature location process and then continue with an overview of the software system that has been imported by the tool.

5.1.1 General Data Structures

For the execution of a comparison between the Latent Dirichlet Allocation model and the Pa- chinko Allocation model the following data sources have been identified as relevant and chosen to enable the implementation and evaluation in this thesis: - Git version control systems - Jira issue tracking systems - Corley goldset The first two data sources were chosen due to technical and popularity reasons that are ex- plained in chapter 4 and will be summarized at the end of this chapter. The choice of the goldset on the other hand must be evaluated in this chapter since the arguments for and against the options available are of a more conceptional nature. The two available datasets are each provided in association with research papers on the topic of feature location. They are available under their respective links on the website of the Lero Institute13 [Ma18] and the GitHub Repository of the papers' main author Christopher Corley14 [CKK15], [CDK20]. In the following chapters, they will be referred to by their respective sources. The Lero dataset contains data on 11 open-source projects, including popular ones like AgroUML, the Eclipse IDE, and JabRef. An overview of all projects can be found in Appendix A.1. While the number of projects available would be more than sufficient to base answers to

13 Goldset from [Ma18]: https://lero.ie/research/datasets/feature_location/comparison

14 Goldset from [CKK15], [CDK20]: https://github.com/cscorley/changeset-feature-location/

Investigating topic modeling techniques for historical feature location. Page 47 Chapter 5 Evaluation the research question based on them, the structure of this dataset is not optimal for the scope in which this thesis applies data pre-processing. It consists of three files per project as listed in Table 8 below:

Name Description

CorpusRaw A text file that contains a list of lines where each is made up from a method that has been pre-filtered to remove stop words.

Queries A text file that contains a list of lines where each contains a feature search query that has been pre-filtered to remove stop words.

AnswerMatrix CSV file that contains numbers in a matrix. Each line corresponds to the line in the Queries-file that has the same index. Each number in a line corresponds to a method in the CorpusRaw-file. The matrix, there- fore, serves as a link between queries and those methods that are con- sidered the location of the feature described.

Table 8: Structure of the Lero goldset.

The problem lies in the fact that the goldset is already heavily pre-processed, with little infor- mation available on how to reproduce these inputs for other projects or techniques.

The Corley goldset on the other hand is set up for more flexible use while its choice of projects is at the same time similar to the Lero goldset as shown in Appendix A.1. The relevant files and folders with their structure are explained in Table 9 below:

Name Description

queries A folder containing a number of text files that contain either an issue description (LongDescription[??].txt) or the title of an issue (ShortDescription[??].txt) with all files containing a number in their name.

goldsets A folder containing a folder called “class” and/or a folder called “method”. Inside those folders, there is a number of files that are linked to the aforementioned queries through their filename. The content of the files is a list of classes or methods which are the expected feature location result of the query.

Table 9: Structure of the Corley goldset.

Investigating topic modeling techniques for historical feature location. Page 48 Chapter 5 Evaluation

Further, the Corley dataset contains snapshots and temporal files as well as search results gen- erated by the authors of the papers [CKK15], [CDK20]. Due to its structure, the goldset of Corley is better suited to explore the full process of feature location for different topic modeling techniques. It is therefore the preferred choice going for- ward. The data available from the Git version control system and the Jira issue tracker is gathered by the importer tool described in chapter 4.3 and includes the high-level data points listed in Table 10. It will in this form serve as an input into the evaluation-relevant topic modeling-based fea- ture location.

Name Description

commits A list of all commits made to the repository that should be analyzed. The information available includes the date of the commit, a commit message, as well as other metadata, and a list of files that were changed in the commit. Due to the pattern by which Jira issues are mapped to Git commits the ID of the issue attached to the commit can also be found here.

features A dictionary containing the issue ID and full description as well as some metadata.

files Each file in the repository that has been changed in a previously de- fined timeframe without duplications and with tracking of file renam- ing. Further, all issue IDs connected to the file are listed.

Table 10: Structure of Git and Jira data.

After the import of data and the iteration through the three steps of the feature location appli- cation train, evaluate, and validate (see chapter 4.4), an output of search results and their vali- dation against the goldset expressed by a metric is to be expected. While the metrics are shown in the command-line interface, the search results that were generated in the evaluation step are saved in JSON files on the computer’s hard drive. In order to keep them organized it is possible to select an input/output folder through the command line interface of the program, in which folders called “lda” and “pachinko” will be created so that the output of different techniques is stored separately. After execution, those folders will contain a number of files matching the number of goldset search queries and besides for replication of results, they can then be used for manual validation.

Investigating topic modeling techniques for historical feature location. Page 49 Chapter 5 Evaluation

5.1.2 Target System

Apache Zookeeper, an open-source server project that provides functionality to maintain relia- ble and distributed cloud applications, was used as the main research object. It was chosen from the list of systems that are available in the Corley goldset, which is summarized in Appendix A.1. As stated there, Zookeeper is hosted on GitHub15 and uses a Jira instance16 for issue track- ing purposes. In conformance with the Corley goldset, the timeframe from the start of the project until the release of version 3.4.5 (published on the 19th of November 2012) was imported. At this date, the project consists mainly of Java files with a total of just under 80.000 lines. Further, 954 commits were made to the main branch and a total of 693 Jira issues were assigned to commits.

Lines of Code 90000 80000 70000 60000 50000 40000

Lines Code ofLines 30000 20000 10000 0 Java HTML XML C++ C C/C++ JavaScript

Figure 18: Programming language composition of the Zookeeper repository by lines of code.

After the import into the database that is later used for training, exactly 1200 files exist in the database of which 474 are Java files that will be included in the feature location process.

5.2 Feature Location Technique Analysis

The topic of this chapter is the analysis of the feature location process. Thereby the primary focus is on the task of training models for LDA and the Pachinko Allocation. Subchapters de- scribe the performance of the tasks as well as the tuning of hyperparameters to optimize results. Finally, obstacles encountered in the process are addressed. In general, the feature location process can be started through the command-line interface with command in Listing 10, or a variation of it.

15 Zookeeper repository at release version 3.4.5: https://github.com/apache/zookeeper/tree/release-3.4.5

16 Zookeeper Jira instance: https://issues.apache.org/jira/projects/ZOOKEEPER/issues

Investigating topic modeling techniques for historical feature location. Page 50 Chapter 5 Evaluation python .\app.py -t lda pa -e lda pa -v lda pa --lda_k11 500 --pa_k1 200 --pa_k2 300 -i .\data\zookeeper\

Listing 10: Start command for training, evaluation, and validation of LDA and PA models.

5.2.1 Execution time

As the tomotopy library for Python 3 is used to implement the feature location process, the performance of the tool depends largely on the performance of the library. While a comparison of the performance of LDA and the Pachinko Allocation is difficult as the number of topics in a one-layered model cannot simply be compared to the number of topics in a two-layered model, the general performance can still be derived from the experience gained by working with both.

The configurations of which the execution time is shown below perform comparably when val- idated against the goldset, yet the training of LDA is faster than the training of the Pachinko Allocation by a factor of 4.9.

00:03:16 training time 00:00:40

00:00:00 00:00:30 00:01:00 00:01:30 00:02:00 00:02:30 00:03:00 00:03:30

pa lda

Figure 19: Training time of a LDA and a Pachinko Allocation model; LDA 350 topics; PA 250 + 50 topics; 100 iterations; AMD Ryzen 2700X (8-Core) @ 3.70GHz – Win 10.

Running multiple analyses on different systems showed that the library largely profits from using parallel computing on multiple cores while training the Pachinko Allocation model, yet the training of the LDA model does not seem to scale in the same way by just having more cores. Instead, the conclusion from the graph in Figure 20 can be that there is an overhead for LDA when using a machine with 8 cores (16 logical cores) over one with a similar CPU with only 4 cores (8 logical cores), while the Pachinko Allocation, on the other hand, improves its performance. As noted before, the Pachinko Allocation remains slower compared to LDA. It is also noteworthy that the requirements for RAM are negligible as both the training processes only use less than 300 MB of memory.

Investigating topic modeling techniques for historical feature location. Page 51 Chapter 5 Evaluation

00:01:00 00:16:00 00:15:04 00:00:53

00:00:47 00:00:45 00:12:00

00:00:33 00:08:57

00:00:30 00:08:00

00:00:15 00:04:00 00:02:58

00:00:00 00:00:00 LDA PA

Intel Core i5-6200U (2-Core) @ 2.30GHz - Win 10 Intel Core i5-6200U (2-Core) @ 2.30GHz - Win 10 Intel Core i7-4790 (4-Core) @ 3.60Ghz - Ubuntu 18 Intel Core i7-4790 (4-Core) @ 3.60Ghz - Ubuntu 18 AMD Ryzen 2700X (8-Core) @ 3.70GHz - Win 10 AMD Ryzen 2700X (8-Core) @ 3.70GHz - Win 10

Figure 20: Performance of LDA and PA training on different computer systems. LDA (left): 500 topics, 100 iterations, and a burn-in of 10; PA (right): 100 topics and 100 sub-topics, 100 iterations, and a burn-in of 10. Each CPU supports hyperthreading for 2 logical cores.

Other than the training task, the evaluation of search queries does not differ significantly be- tween LDA and Pachinko Allocation models as shown by the graph in Figure 21. It displays the runtime of evaluations for 160 queries from the Corley goldset.

00:07:19 evaluation time 00:07:13

00:00:00 00:01:00 00:02:00 00:03:00 00:04:00 00:05:00 00:06:00 00:07:00 00:08:00

pa lda

Figure 21: Evaluation time of a LDA and a Pachinko Allocation model; LDA 350 topics; PA 250 + 50 top- ics; 100 iterations; AMD Ryzen 2700X (8-Core) @ 3.70GHz – Win 10.

Investigating topic modeling techniques for historical feature location. Page 52 Chapter 5 Evaluation

5.2.2 Improvements through Hyperparameter Tuning

Due to the long runtime of the Pachinko Allocation hyperparameter tuning is applied distinctly, modifying only one parameter at a time and then choosing a good value to proceed to the tuning process of the next parameter. Of the following subchapters, chapter 5.2.2.1 describes this pro- cess for the number of topics used in the models. Chapter 5.2.2.1 then continues with grid searches for good iterating parameters and finally, chapter 5.2.2.3 will investigate enhance- ments possible through the variation of Dirichlet parameters.

5.2.2.1 Topic Number Grid Search First, a grid search for LDA was performed by iterating through a wide-ranging number of topics between 10 and 800 with a step size of 10. The graph of the logarithmic likelihood shown in Figure 22 indicates stagnation of improvements between 30 and 100 topics and then a slow descend after more than 400 topics. While the authors of the Corley goldset use an LDA model with 500 topics, the results from the tuning of the numbers of topics for LDA in this thesis indicate that configurations with between 170 and 400 topics perform best [Co15].

-6

-6.2

-6.4

-6.6

-6.8 Log likelihood Log

-7

-7.2 0 100 200 300 400 500 600 700 800 Number of topics

logarithmic likelihood

Figure 22: MRR and log-likelihood observed during hyperparameter tuning of an LDA model with 100 iterations and a burn-in of 10.

Due to the low runtime, the parameter tuning for the number of topics in LDA models was suitable to be performed with fine granularity, yet the time-intensive tuning of hyperparameters with the Pachinko Allocation model requires a more coarsely granular approach. The number of super-topics was tuned with values between 20 and 400 and the sub-topics be- tween 20 and 300, with an overall step size of 50. From this, the graph in Figure 23 was drawn to give an overview of where to look for good configurations.

Investigating topic modeling techniques for historical feature location. Page 53 Chapter 5 Evaluation

300

250

-9--8.5 200 -9.5--9 -10--9.5 150 -10.5--10 -11--10.5

100 sub-topics Number of -11.5--11 -12--11.5 50

20 20 50 100 150 200 250 300 350 400 Number of super-topics Figure 23: MRR and log-likelihood observed during hyperparameter tuning of a PA model with 100 itera- tions and a burn-in of 10.

The graph shows the logarithmic likelihood on an elevation diagram with lighter colors indi- cating a higher logarithmic likelihood. It indicates that the more topics there are, the better the performance that can be expected. Other than the respective graph for LDA, the evaluation performed for Pachinko Allocation does not indicate a descend for the likelihood metric, at least not in this configuration. Due to the aforementioned high execution time of the training of Pa- chinko Allocation models, further configurations with higher numbers of topics were not ex- plored in detail. Instead to keep the execution time reasonably low for the tuning of other pa- rameters in the following sub-chapters the model used will consist of 100 super-topics and 100 sub-topics.

5.2.2.2 Variation of iterating Parameters The following parameters will be tuned in this chapter: - Iterations over the dataset. - Burn-in iterations for optimizing parameters. As a base configuration, a fixed number of topics will be used to create the models. For LDA this means 350 topics while for the Pachinko Allocation 100 super- and 100 sub-topics will be used. Tuning the number of iterations over the dataset for both LDA and the Pachinko Allocation resulted in the data shown in Figure 24. While the logarithmic likelihood as an identifier for model improvement stagnates after 50 iterations with the LDA model, the data shows that the

Investigating topic modeling techniques for historical feature location. Page 54 Chapter 5 Evaluation

-5 -6 -7 -8 -9 -10

Log likelihood Log -11 -12 -13 0 100 200 300 400 500 600 700 800 900 1000 Iterations

LDA PA

Figure 24: LDA and PA average logarithmic likelihood while tuning the number of iterations. Burn-in factor 10; LDA 350 topics; PA 100 + 100 topics.

Pachinko Allocation keeps improving almost steadily after an initial jump between 10 and 50 iterations. Weighting the execution time against the improvement, the number of iterations for PA should be located between 100 and 300 for this dataset. For further analyses, 100 iterations are going to be used for both techniques to profit from the faster performance. The results gathered while tuning the LDA burn-in factor, as displayed in Figure 25, are coun- ter-intuitive: The optimum found while iterating through values between 1 and 1000 is indicat- ing that the performance is best with none or just a minimal burn-in. On the other hand, a similar tuning for the Pachinko Allocation resulted in numbers that conform with the intuitive expec- tation: there is a maximum around the burn-in factor of 100 and performance slowly decreases after this point.

-5 -6 -7 -8 -9

Log likelihood Log -10 -11 -12 0 100 200 300 400 500 600 700 800 900 1000 Burn-in factor

LDA PA

Figure 25: LDA and PA average logarithmic likelihood while tuning the burn-in factor. 100 iterations; LDA 350 topics; PA 100 + 100 topics.

Investigating topic modeling techniques for historical feature location. Page 55 Chapter 5 Evaluation

The recommended configuration derived from those results is a burn-in factor of 0 for LDA, and a factor of 100 for the Pachinko Allocation respectively.

5.2.2.3 Variation of Dirichlet Parameters

As visualized in the graph of Figure 26, the Dirichlet parameters alpha ( ) and eta ( ) where tuned symmetrically in separate configurations for LDA and the Pachinko Allocation. 훼 휂 For the alpha and eta parameters of both techniques, the graph shows that the optimum values are low, around the default configuration of 0.01. There is only one exception: The sub-alpha parameter of the Pachinko Allocation. This parameter, which is related to the second topic layer of the Pachinko Allocation model, has its optimum closer to the value 1. This is an observation that is somewhat unexpected as a (sub-)alpha parameter close to the value 1 means that the Dirichlet distribution is not specific and therefore associations between groups of topics and their associated words are almost evenly split.

-6

-7

-8

-9

-10

-11 Log likelihood Log

-12

-13

-14 0 1 2 3 4 5 6 7 8 9 10 Value per Parameter

lda alpha lda eta pa alpha pa subalpha pa eta

Figure 26: Tuning of alpha ( ) and eta ( ) parameters of both LDA and PA: LDA 350 topics; PA 100 + 100 topics; other parameters according to previous chapters. 휶 휼 While there is also the option to tune the Dirichlet parameters asymmetrically, through vectors that have the same length as the number of topics they correspond to, this was not attempted in this evaluation. The reason for this is that it would require time and calculation power which would go beyond the scope of this thesis.

Investigating topic modeling techniques for historical feature location. Page 56 Chapter 5 Evaluation

5.2.3 Obstacles in the Feature Location Process

During the evaluation phase of this thesis, the tool for performing feature location-based on topic modeling was put to the test by running it with various configurations. During the time spent with the evaluation, it became apparent that there is an issue in the tool's main library tomotopy that affects the determinism of the training process. This results in a limited repro- ducibility of the results presented. Even though tomotopy offers the option to set a seed value to be able to reproduce results while training models for both techniques, models created with the same parameter configuration ended up performing differently from each other. They show a slight variance in the logarithmic likelihood score and their results can vary significantly when evaluating and validating against the goldset. An example in form of the output during training, evaluation, and validation for two seemingly identical LDA training processes and their results are shown in Listing 11.

> python app.py -t lda --lda_k1 100 -e lda -v lda -i .\tmp\ --- train ------Iteration: 10 Log-likelihood: -6.812546821579827 Iteration: 20 Log-likelihood: -6.552810377639443 Iteration: 30 Log-likelihood: -6.466817423457372 Iteration: 40 Log-likelihood: -6.423062776760207 Iteration: 50 Log-likelihood: -6.398546375060162 Iteration: 60 Log-likelihood: -6.387198130807414 Iteration: 70 Log-likelihood: -6.376373306876543 Iteration: 80 Log-likelihood: -6.372046251722799 Iteration: 90 Log-likelihood: -6.366875461321159 Iteration: 100 Log-likelihood: -6.364717015332481 LDA ll per word -6.364717015332481 --- evaluate --- | | # | 159 Elapsed Time: 0:01:16 --- validate --- LDA MRR: 0.18178203688584976 Time Lapsed = 0:1:59.99697017669678

> python app.py -t lda --lda_k1 100 -e lda -v lda -i .\tmp\ --- train ------Iteration: 10 Log-likelihood: -6.722635893883447 Iteration: 20 Log-likelihood: -6.5114688372249985 Iteration: 30 Log-likelihood: -6.43296017583458 Iteration: 40 Log-likelihood: -6.398586748371089 Iteration: 50 Log-likelihood: -6.387941462968303 Iteration: 60 Log-likelihood: -6.379946728737874 Iteration: 70 Log-likelihood: -6.378963118261652 Iteration: 80 Log-likelihood: -6.377343502242806 Iteration: 90 Log-likelihood: -6.3754312534110555 Iteration: 100 Log-likelihood: -6.374021941608899 LDA ll per word -6.374021941608899 --- evaluate --- | | # | 159 Elapsed Time: 0:01:14 --- validate --- LDA MRR: 0.1561320680286651 Time Lapsed = 0:1:54.462520122528076 Listing 11: training, evaluation, and validation of an LDA model that performs differently even though the same parameters and seed value have been used.

Investigating topic modeling techniques for historical feature location. Page 57 Chapter 5 Evaluation

As a reaction to this the seed value was not used in the experiments so that confusion about the assumed, yet nonexistent, determinism is prevented.

5.3 Search Result Performance

Based on the insights into the feature location process gained from the previous chapter, the performance of two good models, one from the Latent Dirichlet Allocation and one from the Pachinko Allocation, is evaluated in this chapter. The focus of this chapter is on providing metrics for the final discussion in chapter 6.

5.3.1 Objects of Evaluation

Two good-performing models encountered during the evaluation of the feature location process have been chosen for the performance comparison. Their configuration is listed in Table 11. During training, they performed with a logarithmic likelihood of -6.3451 for the LDA model and -10.2163 for the Pachinko Allocation Model.

Technique Topics Iterations Burn-in Alpha Eta

LDA 350 100 10 0.01 0.01

PA 250 50 100 100 0.01 0.01 0.01

Table 11: Configurations of LDA and PA used for evaluating the search result performance.

LDA PA Word Probability in topic Word Probability in subtopic code 0.0939 proposal 0.0183

deprecated 0.0641 quorumpacket 0.0158

keeperexception 0.0286 type 0.0138

link 0.0218 follower 0.0133

3.1.0 0.0203 qp 0.0094

Table 12: Word distribution associated with one of the topics/subtopics of the LDA and Pachinko Alloca- tion Models used for the performance evaluation.

Investigating topic modeling techniques for historical feature location. Page 58 Chapter 5 Evaluation

To give some insight into how the topic models performed under the abstraction of the tool, Table 12 shows the top 5 words from one of the topics of each model. They are paired with their respective probability from the Dirichlet distribution that describes their association with the topic. From an evaluation standpoint, this table shows that a more comprehensive pre-filtering of words may be useful to further improve the results that are presented in the next chapter.

5.3.2 Performance

To finally evaluate the performance of feature location applied through the techniques presented in this thesis, the previously introduced models were validated against the Corley goldset of queries and expected search results. Following the work of previous research, the focus lies on the mean reciprocal rank (MRR) metric as it expresses the effectiveness of a search engine [CMS09], [CDK20]. Further metrics calculated from the output of the validation are the mean and median rank, as well as the average absolute deviation from the mean. While the MRR of LDA is higher (better) than the corresponding value for the Pachinko Allo- cation, the other metrics that are based on the rank seem to be favoring the newer technique.

A -score and p-value calculated with Wilcoxon’s signed-rank test for the rank further accredits the results a statistically relevant difference, assuming a commonly used significance level of 푍 [CMS09].

훼 = 0.05 Metric LDA PA

MRR 0.2068 0.1662

Mean rank 111 51

Avg. abs. deviation 102.6 59.1

Median rank 67 18

Z-score of the rank 5.78

p-value of the rank < 0.01

Z-score of the reciprocal rank 1.35

p-value of the reciprocal rank 0.18

Table 13: Metrics calculated during the evaluation of the search result performance.

Investigating topic modeling techniques for historical feature location. Page 59 Chapter 5 Evaluation

As described in chapter 3.4.2, the p-value needs to be higher than the selected for the differ- ence to be significant [Ré08]. The p-value for the reciprocal rank on the other hand does not 훼 support such an assumption under the same significance level. Results of the calculations are shown in Table 13, with intermediate values for Wilcoxon’s signed-rank test provided in Ap- pendix A.4. In addition to metrics, the distribution of search result ranks describing the position of the first correct result was extracted from the validation process and visualized for later discussion. The results are shown in Figure 27 and displayed through a histogram with 5 bins and an overflow- bin for ranks higher than 20. It is visible that the LDA results include more ranks that exceed the overflow value and therefore are sorted into the last bin. It is also apparent that the Pachinko Allocation generated more results in the two bins describing the lower search result ranks. The full list of all ranks and reciprocal ranks for the two evaluated models can be found in Appendix A.3.

Figure 27: Histogram of the distribution of search result ranks extracted from the validation process of feature location results based on the LDA and Pachinko Allocation models.

Investigating topic modeling techniques for historical feature location. Page 60 Chapter 6 Discussion

6 Discussion This chapter is dedicated to the discussion of the studies that were presented in the previous chapter. Those discussions will elaborate on the suitability of the feature location technique introduced through the toolkit developed as part of this thesis as well as on the performance of feature location results derived from the process facilitated by the tool. With this, the research questions defined in chapter 1.3 will be answered.

6.1 Discussion of the Feature Location Technique (RQ1)

(RQ1) How can different feature location techniques be applied to codebase history data while optimizing results for the comparison of different approaches as well as for re- producibility? This thesis introduced its feature location approach based on a modular integration of two topic modeling techniques. It works on a corpus joint together from a software systems codebase history that was obtained from git and issue descriptions from Jira instances. Addressing the first part of the research question, the comparison of different approaches on a model level is supported through the modularity of the implementation. For this, the tool allows to easily exchange the tomotopy-based models and, with some more modifications, also allows to exchange the whole topic modeling library. In its current implementation the training, eval- uation, and validation of both LDA and the Pachinko Allocation models are available and exe- cutable from the command line interface. Through the command line, it is further possible to dynamically configure most of the settings for each model. This enables the user to use the tool without manually changing settings in source or configuration files and simplifies the auto- mated execution of different scenarios. With the approach of integrating the tomotopy library for topic modeling the execution time of the training and query evaluation process can be considered sufficient for daily use, yet between the two implemented techniques the time required for training differs significantly. Pachinko Allocation models require an execution time that is 4.9 times higher than the one of LDA mod- els and therefore is at a clear disadvantage in a direct comparison. Still, the overall runtime is acceptable with less than four minutes on a desktop computer with current hardware. Evaluating the internals of the models investigated in the previous chapter, it became apparent that a more sophisticated pre-filtering of words from the source code may be a good starting point for further improvements of the topic prediction accuracy. This is because the words that are common in models often include implementation syntax. But such implementation syntax is unlikely to be searched for directly, as the focus of the tool is natural language search queries [Co15]. At this point, further investigations and tests are required.

Investigating topic modeling techniques for historical feature location. Page 61 Chapter 6 Discussion

In terms of reproducibility, the application provides multiple outputs that are stored and can be used to reproduce results. Among those are the models generated during training, the training data used, the association of training data with topics, and, after the evaluation of goldset que- ries, also the lists of search results. It can therefore be argued that the application provides a solution through which research in the field of topic modeling-based feature location can be continued. There is only one exception: As discussed in chapter 5.3, the reproducibility of the model training task is limited since the provided seed parameter of the tomotopy library does not seem to stop randomization. Therefore, the process of training creates models that are slightly different from each other even if the same configurations were used. To finally answer the research question, the feature location technique introduced and its im- plementation can be summarized as a viable solution. Its characteristics are its modularity, its simplified access through the command-line interface for users and automation, as well as its reproducibility through the separation of the whole process into steps and the frequent output of reusable datasets.

6.2 Discussion of the Search Result Performance (RQ2)

(RQ2) How does the Pachinko Allocation Model perform on the task of feature location com- pared to the more popular Latent Dirichlet Allocation Model? As established in chapter 3.4.2, the mean reciprocal rank is the first metric to judge search results in the setting of feature location based on the Corley goldset. Applied to the ZooKeeper codebase through two selected models, one being an LDA model and one being a Pachinko Allocation model, the results present a preference for LDA by 0.14 points on the scale between 1 (best) and 0 (worst). Although preferred by other research papers, the MRR metric is not intuitively comprehensible, and simpler metrics draw a different picture: both the average rank of the first correct result in a list of search results and the median of first correct results are considerably lower (better) with the Pachinko Allocation. Now it is debatable whether they have relevance next to MRR — as other research in the field of feature location does not present them in their papers [CKK15], [CDK20], [CEB17], [CEB15]. The main argument for including them into the evaluation of results is that MRR gives a high relevance to queries that have a correct result on the first rank, but already results on the second and third rank have significantly lower importance to the metric [CMS09]. For example, if two sets of four search queries ( and ) were to be evaluated and the ranks of their first correct result would be and , the MRR values for those would be 푎 푏 and . Purely relying on the MRR value, an evaluation would 푎 = (1, 1, 1, 10) 푏 = (2,2,2,2) clearly favor the results of , while the average rank with and 푀푅푅(푎) = 0.775 푀푅푅(푏) = 0.5 clearly favors . Considering that the last result of is far down on the search result page while 푎 푎푣푔(푎) = 3.25 푎푣푔(푏) = 2 푏 푎

Investigating topic modeling techniques for historical feature location. Page 62 Chapter 6 Discussion all results of are visible to a hypothetical user at first glance, the MRR metric seems to mis- represent the actual usefulness of the search engine. A comparison to the click-through-rate on 푏 search engines can be drawn here. A study investigating the click behavior of users searching on a web search engine shows that while the first result is the most clicked entry, also the second and third one are of importance [HWD11]. Yet results at positions nine and ten were rarely clicked during the study, as Figure 28 shows. The research further shows that users inspect around five results before clicking on one, further strengthening the argument that the relevance of the first result may be overestimated. This is displayed in Figure 29.

Figure 28: Frequencies and percentages of cursor hovers and clicks occurring on the search results. Per- centages reflect the proportion of hover or click events over all 10 results [HWD11].

Figure 29: Mean number of search results hovered over before users clicked on a result (above and below that result). Result clicks are red circles, result hovers are blue lines [HWD11].

Following the argumentation above, a solidly good yet not perfect search engine may be of more use to a user than one that performs perfectly for some queries and bad for others. Further, according to Wilcoxon’s signed-rank test performed on the rank, the difference between LDA and the Pachinko Allocation is statistically relevant, which would make the latter the preferred choice.

Investigating topic modeling techniques for historical feature location. Page 63 Chapter 6 Discussion

As a result of this contradiction though, it is not clear whether the Pachinko Allocation can play out its advantage at finding correlations between topics over the topic-word correlation ap- proach followed by LDA. Assuming the MRR metric is the more relevant one, the results pre- sented can be interpreted as an indication that features in the source code of the ZooKeeper project are distinct enough to have little overlapping between them. Therefore they would not require additional connections between topics for the task of feature location. Under that as- sumption, Wilcoxon’s signed-rank test for the reciprocal rank value indicates that there is no statistical difference between the performance of the techniques. This would mean that LDA and the Pachinko Allocation perform similarly. In general, though, the performance of both techniques in generating search results is not yet ready for production use. To be applied in a scenario where developers depend on automatic feature location the results need to be improved further so that at least the majority of correct results can be found at first glance. This would require a median of the first correct result with the value of 10. With the LDA model at a median of 67 and the Pachinko Allocation model at 18, this is currently out of reach for both models evaluated in this thesis.

6.3 Threats to Validity

The study presented in this thesis has limitations that may impact its validity. In section 6.3.1 threats concerning the chosen target system and the generalization involved are discussed, while in section 6.3.2 the focus lies on factors within the chosen approach and implementation.

6.3.1 External Validity

When selecting the target system for the analysis in this thesis, the choice was limited to those projects that have previously been investigated by other research and which are available in the goldset used. Further time constraints only allowed for the detailed analysis of one target sys- tem. This poses a threat to external validity since it cannot be assumed that the application that was chosen is the gold standard of software projects. It is therefore not possible to further as- sume that the results archived here are also valid for other projects. Instead, the results require validation through the application of similar analyses to other projects. Yet a mitigating factor is that the application that was chosen, ZooKeeper, is maintained by the Apache Foundation, a major player in open-source software development.

6.3.2 Internal Validity

As experienced in chapter 5.2.3, the reliance on a third-party library for critical parts can cause threats to validity. Internally such a library can have implementation errors that may not be apparent and that influence the accuracy of results as well as the speed of the calculations. While the main library, tomotopy, is open source, the implementation of algorithms was not

Investigating topic modeling techniques for historical feature location. Page 64 Chapter 6 Discussion verified by the author. Instead, it is assumed that they are implemented correctly and perform with decent speed and accuracy.

Investigating topic modeling techniques for historical feature location. Page 65 Chapter 7 Conclusions and Future Work

7 Conclusions and Future Work This thesis investigated two topic modeling techniques for their suitability in the context of feature location tasks. For this purpose, a toolkit, consisting of an importer for codebase history data and a prototypical evaluation tool capable of performing feature location, was created. As laid out in chapter 4 the implementation meets the requirements set out for the tool and it was therefore used as a basis for the evaluation that provides one part of the data that was used to answer the research ques- tions. The other part is provided through the Corley goldset, a dataset that contains search que- ries and their expected results, which was used to validate results from the feature location process. Due to the modularity of the implementation, the toolkit is not limited to be used in the context of this thesis but rather allows for the extension and support of further topic modeling techniques. To put the toolkit to a test and to fulfill the research objectives of this thesis, multiple aspects of feature location and topic modeling were investigated in chapter 5, based on the evaluation tool and the performance of its underlying techniques. Accordingly, a full feature location pro- cess for two topic models, one Latent Dirichlet Allocation model and one Pachinko Allocation model, was conducted on a selected software system and its codebase. This software system is the ZooKeeper project of the Apache Foundation. It is hosted on GitHub and therefore accessi- ble through the Git CLI, has a Jira issue tracking system, and is under active development. Further, it also has a sufficient dataset in the Corley goldset, meeting the requirements of the toolkit's implementation. For the topic modeling involved, LDA and the Pachinko Allocation were chosen mainly for the following reasons: LDA is the model commonly used in previous research when applying topic modeling, while the Pachinko Allocation is an extension to LDA promising better recognition of connections not only between topics and words but also amongst two or more topics. The results of the investigation of the techniques and a comparison of both have been discussed in chapter 6. In summary and by metrics commonly used in other research, LDA has been identified as being at the advantage over the Pachinko Allocation, yet there are also arguments against the established technique in the context of feature location. Drawbacks and limitations of the work identified during the tool's evaluation are related to reproducibility and the readiness of results for a production environment. Concerning the latter, it was not the purpose of this thesis to create a production-ready tool. The reproducibility of results on the other hand was one of the objectives of this thesis, yet it was not achieved in aspects of the training process. The reason for this has been identified as a non-functioning seed value in the tomotopy library which causes trained models to turn out differently even when the

Investigating topic modeling techniques for historical feature location. Page 66 Chapter 7 Conclusions and Future Work same settings are applied. Further, in terms of result accuracy, one conclusion from the discus- sion in chapter 6 is that neither LDA nor the Pachinko Allocation performs at a level that would be acceptable in a production application on which developers rely in their daily work. Instead, there is still work to be done, either by improving the topic modeling techniques or by enhanc- ing the data that they are based on. For example through improved pre-filtering of input texts. Ideas for this further work will be discussed in the respective chapter. All in all, the research and implementation work done in this thesis provides the toolkit and knowledge required for follow-up research as well as further evaluations. It therefore completed its investigation of topic modeling techniques for feature location successfully.

7.1 Future Work

Future work includes the toolkit-based application of feature location to the codebase of further software systems. A starting point can be those that have goldsets available from previous re- search. Firstly, the ones maintained by the Apache Foundation are of interest as they follow a similar development strategy compared to ZooKeeper. Yet the tool can be extended to support different systems as well. In terms of enhancements to the toolkit, the currently applied pre- filtering of text needs to be revisited and enhanced to filter out more of the structural keywords imported from the source code of the systems analyzed. Finally, the variation of topic modeling techniques evaluated is of interest for further work as there are multiple other models supported by the tomotopy library that could easily be added to the toolkit. Aside from required improvements in the accuracy and validation of results, further work can include the integration of the toolkit into integrated development environments (IDEs) and the general workflow of software developers. Such integration into an IDE could for example be facilitated through an add-on or the extension of the IDE’s general search function. Further use of the feature location techniques in the developer's workflow may include a dedicated appli- cation. This application may, through visualization of the codebase, provide an overview suit- able to plan architecturing tasks such as the planning of new implementations or the removal of no longer required features or services in the codebase. One option for the supporting visu- alization may be a variation of circle packing hotspots [To15]. In terms of the time-intensive training process, strategies for re-training and model-version control need to be developed. This may include the design of a server-side model repository and the integration of training into continuous integration processes.

Investigating topic modeling techniques for historical feature location. Page 67 Chapter 8 Bibliography

8 Bibliography [AA15] Alghamdi, R.; Alfalqi, K.: A Survey of Topic Modeling in Text Mining. In Inter- national Journal of Advanced Computer Science and Applications, 2015, 6. [BB07] Banerjee, A.; Basu, S.: Topic Models over Text Streams: A Study of Batch and Online Unsupervised Learning. In (Apté, C. V. Ed.): Proceedings of the Seventh SIAM International Conference on Data Mining. Society for Industrial and Ap- plied Mathematics, Philadelphia, 2007; pp. 431–436. [Bl12] Blei, D. M.: Probabilistic topic models. In Communications of the ACM, 2012, 55; pp. 77–84. [BM14] Buntine, W. L.; Mishra, S.: Experiments with non-parametric topic models. In (Macskassy, S. et al. Eds.): Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, New York, NY, 2014; pp. 881–890. [BNJ03] Blei, D. M.; Ng, A. Y.; Jordan, M. I.: Latent Dirichlet Allocation. In J. Mach. Learn. Res., 2003, 3; pp. 993–1022. [Bo14] Bourque, P. Ed.: Guide to the software engineering body of knowledge. Version 3.0 ; SWEBOK ; a project of the IEEE Computer Society. IEEE Computer Soc, Los Alamitos, Calif. [u.a.], 2014. [CCC10] Chih-wei, H.; Chih-chung, C.; Chih-jen, L.: A practical guide to support vector classification, 2010. [CDK20] Corley, C. S.; Damevski, K.; Kraft, N. A.: Changeset-Based Topic Modeling of Software Repositories. In IEEE Transactions on Software Engineering, 2020, 46; pp. 1068–1080. [CEB15] Chochlov, M.; English, M.; Buckley, J.: Using changeset descriptions as a data source to assist feature location: 2015 IEEE 15th International Working Confer- ence on Source Code Analysis and Manipulation (SCAM), 2015; pp. 51–60. [CEB17] Chochlov, M.; English, M.; Buckley, J.: A historical, textual analysis approach to feature location. In Information and Software Technology, 2017, 88; pp. 110–126. [Ch01] Chang, S. K.: Handbook of software engineering & knowledge engineering. World Scientific, River Edge, N.J., London, 2001.

Investigating topic modeling techniques for historical feature location. Page 68 Chapter 8 Bibliography

[CKK15] Corley, C. S.; Kashuda, K. L.; Kraft, N. A.: Modeling changeset topics for feature location. In (Evolution, I. C. o. S. M. a. et al. Eds.): 2015 IEEE International Con- ference on Software Maintenance and Evolution (ICSME 2015). Bremen, Ger- many, 29 September - 1 October 2015. IEEE, Piscataway, NJ, 2015; pp. 71–80. [CMS09] Croft, W.; Metzler, D.; Strohman, T.: Search Engines - Information Retrieval in Practice, 2009. [Co15] Corley, C. S.: cscorley/changeset-feature-location. GitHub. https://github.com/cscorley/changeset-feature-location/, accessed 23 Apr 2021. [CR01] Chen, K.; Rajich, V.: RIPPLES: tool for change in legacy software: International conference on software maintenance. IEEE Comput. Soc, 2001; pp. 230–239. [Cu05] Cubranic, D. et al.: Hipikat: a project memory for software development. In IEEE Transactions on Software Engineering, 2005, 31; pp. 446–465. [De90] Deerwester, S. et al.: Indexing by . In Journal of the American Society for Information Science, 1990, 41; pp. 391–407. [Di13] Dit, B. et al.: Feature location in source code: a taxonomy and survey. In Journal of Software: Evolution and Process, 2013, 25; pp. 53–95. [EBG07] Egyed, A.; Binder, G.; Grunbacher, P.: STRADA: A Tool for Scenario-Based Feature-to-Code Trace Detection and Analysis: 29th International Conference on Software Engineering. ICSE 2007 companion volume proceedings 20-26 May 2007, Minneapolis, Minnesota. IEEE Computer Society, Los Alamitos CA, 2007; pp. 41–42. [FB19] Fowler, M.; Beck, K.: Refactoring. Improving the design of existing code / Mar- tin Fowler with contributions by Kent Beck. Addison-Wesley, Boston?, 2019. [FKG10] Frigyik, B. A.; Kapila, A.; Gupta, M. R.: Introduction to the Dirichlet Distribution and Related Processes, Unversity of Washington, 2010.

[HGH09] Hindle, A.; Godfrey, M. W.; Holt, R. C.: R.C.: What’s hot and what’s not: Win- dowed developer topic analysis: In: International Conference on Software Mainte- nance, 2009; pp. 339–348. [Hi15] Hindle, A. et al.: Do Topics Make Sense to Managers and Developers? In Empiri- cal Software Engineering, 2015, 20; pp. 479–515. [Ho99] Hofmann, T.: Probabilistic latent semantic indexing. In (Hearst, M.; Gey, F. E.; Tong, R. Eds.): Proceedings of SIGIR '99. 22nd international conference on re- search and development in information retrieval. ACM Press, New York, New York, USA, 1999; pp. 50–57.

Investigating topic modeling techniques for historical feature location. Page 69 Chapter 8 Bibliography

[HWD11] Huang, J.; White, R. W.; Dumais, S.: No clicks, no problem. In (Tan, D. et al. Eds.): 29th Annual CHI conference on human factors in computing systems. As- sociation for Computing Machinery, New York, 2011; p. 1225. [Jo15] Jordan, H. et al.: Manually Locating Features in Industrial Source Code: The Search Actions of Software Nomads: 2015 IEEE 23rd International Conference on Program Comprehension. IEEE, 2015; pp. 174–177. [Ko06] Ko, A. J. et al.: An Exploratory Study of How Developers Seek, Relate, and Col- lect Relevant Information during Software Maintenance Tasks. In IEEE Transac- tions on Software Engineering, 2006, 32; pp. 971–987. [Le18] Lero: Empirical Assessment of Baseline Feature Location Techniques. https://lero.ie/research/datasets/feature_location/comparison, accessed 23 Apr 2021. [Li15] Li, X. et al.: Group topic model: organizing topics into groups. In Information Re- trieval Journal, 2015, 18; pp. 1–25. [LM06] Li, W.; McCallum, A.: Pachinko allocation. In (Cohen, W. W.; Moore, A. W. Eds.): Proceedings of the 23rd international conference on machine learning. ICML 2006 / Edited by William W. Cohen and Andrew Moore. ACM Press, New York, New York, USA, 2006; pp. 577–584. [Lu94] Lukoit et al.: TraceGraph: immediate visual location of software features. In (Müller, H. A.; Georges, M. Eds.): Software maintenance. 11th International con- ference Papers. IEEE Comput. Soc. Press, 1994; pp. 33–39. [Ma18] Martinez, J. et al.: Feature location benchmark with argoUML SPL. In (Berger, T. Ed.): Proceedings of the 22nd International Systems and Software Product Line Conference - Volume 1. ACM, New York, NY, 2018; pp. 257–263. [Mi21] Microsoft: Regular Expression Language - Quick Reference. https://docs.mi- crosoft.com/en-us/dotnet/standard/base-types/regular-expression-language-quick- reference, accessed 10 May 2021. [OJ10] Olszak, A.; Jørgensen, B. N.: Featureous: A Tool for Feature-Centric Analysis of Java Software. In (Antoniol, G. Ed.): 2010 IEEE 18th International Conference on Program Comprehension (ICPC 2010). Braga, Portugal, 30 June - 2 July 2010. IEEE, Piscataway NJ, 2010; pp. 44–45. [Po05] Poshyvanyk, D. et al.: IRiSS - A Source Code Exploration Tool, 2005.

Investigating topic modeling techniques for historical feature location. Page 70 Chapter 8 Bibliography

[Po06] Poshyvanyk, D. et al.: Source Code Exploration with Google: 22nd IEEE Interna- tional Conference on Software Maintenance (ICSM'06),Philadelphia, Pennsylva- nia, 24.09-27.09.2006. IEEE, [S.l.]n[s.n.], 2006; pp. 334–338. [Pr17] Provalis Research: Topic modeling vs. cluster analysis: What’s the difference?! In Provalis Research, 2017. [Ra10a] Rasch, B.: Quantitative Methoden. Einführung in die Statistik - Band 1. Springer, Berlin [etc.], 2010. [Ra10b] Rasch, B.: Quantitative Methoden. Einführung in die Statistik - Band 2. Springer, Berlin [etc.], 2010. [Ré08] Rédei, G.P. Ed.: Encyclopedia of genetics, genomics, proteomics and informatics. Springer, London, 2008. [Sc21] Schulte, L. M.: Analyzing Dependencies between Software Architectural Degra- dation and Code Complexity Trends. Research Project, Karlstad, Sweden, 2021. [SM83] Salton, G.; McGill, M. J.: Introduction to modern information retrieval. McGraw- Hill, New York, London, 1983. [SS18] Syed, S.; Spruit, M.: Selecting Priors for Latent Dirichlet Allocation: 2018 IEEE 12th International Conference on Semantic Computing (ICSC), 2018; pp. 194– 202.

[To15] Tornhill, A.: Your code as a crime scene. Use forensic techniques to arrest de- fects, bottlenecks, and bad design in your programs / Adam Tornhill. The Prag- matic Bookshelf, Dallas, 2015. [WR07] Warr, F. W.; Robillard, M. P.: Suade: Topology-Based Searches for Software In- vestigation: 29th International Conference on Software Engineering. ICSE 2007 proceedings 20-26 May 2007, Minneapolis, Minnesota. IEEE Computer Society, Los Alamitos CA, 2007; pp. 780–783.

Investigating topic modeling techniques for historical feature location. Page 71 Chapter 9 Appendix

9 Appendix

A.1 Goldset Comparison

Project Lero Corley VCS / Issue Tracking

Prev.: SVN (Tigris) / Tigris (not available) AgroUML ✔ ✔ 2019: Git (GitHub) / GitHub

Bookkeeper ❌ ✔ Git (GitHub) / Jira

CommonsMath ✔ ❌ Git (GitHub) / Jira

CommonsLang ✔ ❌ Git (GitHub) / Jira

Derby ✔ ✔ Git (GitHub) / Jira

Eclipse ✔ ❌ Git (eclipse.org) / Bugzilla

Hibernate ❌ ✔ Git (GitHub) / Jira

iBatis ✔ ❌ Git (GitHub) / GitHub Prev.: Git (SourceForge) / SourceForge JabRef ✔ ✔ 2015: Git (GitHub) / GitHub

jEdit ✔ ✔ SVN (SourceForge) / SourceForge

Lucene ❌ ✔ Git (GitHub) / Jira

Mahut ❌ ✔ Git (GitHub) / Jira

MuCommander ✔ ✔ Git (GitHub) / GitHub

Mylyn ✔ ❌ Git (eclipse.org) / Bugzilla

OpenJPA ❌ ✔ Git (GitHub) / Jira

Pig ❌ ✔ Git (GitHub) / Jira

Rhino ✔ ❌ Git (GitHub) / GitHub

Solr ❌ ✔ Git (GitHub) / Jira

Tika ❌ ✔ Git (GitHub) / Jira

Zookeeper ❌ ✔ Git (GitHub) / Jira

Investigating topic modeling techniques for historical feature location. Page 72 Chapter 9 Appendix

A.2 Lookup Table

Investigating topic modeling techniques for historical feature location. Page 73 Chapter 9 Appendix

A.3 Ranks and Reciprocal Ranks calculated from the discussed Models

LDA PA queries (1 to 160) rank reciprocal rank rank reciprocal rank query 1 10 0.1 23 0.043478 query 2 10 0.1 4 0.25 query 3 1 1 27 0.037037 query 4 1 1 3 0.333333 query 5 116 0.00862069 11 0.090909 query 6 10 0.1 4 0.25 query 7 152 0.006578947 126 0.007937 query 8 193 0.005181347 5 0.2 query 9 1 1 2 0.5 query 10 2 0.5 18 0.055556 query 11 1 1 1 1 query 12 341 0.002932551 55 0.018182 query 13 210 0.004761905 10 0.1 query 14 244 0.004098361 22 0.045455 query 15 10 0.1 4 0.25 query 16 211 0.004739336 9 0.111111 query 17 189 0.005291005 29 0.034483 query 18 7 0.142857143 31 0.032258 query 19 54 0.018518519 126 0.007937 query 20 69 0.014492754 10 0.1 query 21 251 0.003984064 16 0.0625 query 22 191 0.005235602 29 0.034483 query 23 7 0.142857143 18 0.055556 query 24 64 0.015625 11 0.090909 query 25 379 0.002638522 120 0.008333 query 26 16 0.0625 31 0.032258 query 27 260 0.003846154 100 0.01 query 28 14 0.071428571 33 0.030303 query 29 122 0.008196721 1 1 query 30 146 0.006849315 7 0.142857 query 31 431 0.002320186 389 0.002571 query 32 437 0.00228833 294 0.003401 query 33 139 0.007194245 7 0.142857 query 34 1 1 4 0.25 query 35 291 0.003436426 23 0.043478 query 36 1 1 10 0.1 query 37 24 0.041666667 33 0.030303 query 38 1 1 1 1 query 39 10 0.1 6 0.166667 query 40 101 0.00990099 3 0.333333 query 41 192 0.005208333 175 0.005714 query 42 10 0.1 31 0.032258 query 43 99 0.01010101 3 0.333333 query 44 1 1 1 1 query 45 2 0.5 18 0.055556 query 46 3 0.333333333 20 0.05 query 47 225 0.004444444 22 0.045455 query 48 353 0.002832861 19 0.052632 query 49 141 0.007092199 35 0.028571 query 50 2 0.5 2 0.5 query 51 11 0.090909091 31 0.032258 query 52 162 0.00617284 126 0.007937 query 53 10 0.1 4 0.25 query 54 85 0.011764706 17 0.058824 query 55 15 0.066666667 5 0.2 query 56 309 0.003236246 36 0.027778 query 57 385 0.002597403 65 0.015385 query 58 9 0.111111111 23 0.043478 query 59 126 0.007936508 8 0.125

Investigating topic modeling techniques for historical feature location. Page 74 Chapter 9 Appendix

query 60 2 0.5 18 0.055556 query 61 360 0.002777778 36 0.027778 query 62 92 0.010869565 3 0.333333 query 63 4 0.25 21 0.047619 query 64 1 1 27 0.037037 query 65 10 0.1 22 0.045455 query 66 303 0.00330033 210 0.004762 query 67 63 0.015873016 327 0.003058 query 68 165 0.006060606 351 0.002849 query 69 1 1 24 0.041667 query 70 8 0.125 1 1 query 71 351 0.002849003 296 0.003378 query 72 1 1 27 0.037037 query 73 109 0.009174312 4 0.25 query 74 436 0.002293578 392 0.002551 query 75 75 0.013333333 13 0.076923 query 76 110 0.009090909 12 0.083333 query 77 9 0.111111111 23 0.043478 query 78 46 0.02173913 12 0.083333 query 79 99 0.01010101 11 0.090909 query 80 116 0.00862069 1 1 query 81 10 0.1 22 0.045455 query 82 9 0.111111111 4 0.25 query 83 1 1 27 0.037037 query 84 1 1 4 0.25 query 85 112 0.008928571 11 0.090909 query 86 11 0.090909091 5 0.2 query 87 140 0.007142857 124 0.008065 query 88 112 0.008928571 5 0.2 query 89 1 1 2 0.5 query 90 7 0.142857143 18 0.055556 query 91 1 1 1 1 query 92 336 0.00297619 46 0.021739 query 93 190 0.005263158 9 0.111111 query 94 277 0.003610108 23 0.043478 query 95 10 0.1 4 0.25 query 96 177 0.005649718 8 0.125 query 97 214 0.004672897 29 0.034483 query 98 8 0.125 31 0.032258 query 99 51 0.019607843 126 0.007937 query 100 48 0.020833333 10 0.1 query 101 197 0.005076142 14 0.071429 query 102 316 0.003164557 28 0.035714 query 103 2 0.5 17 0.058824 query 104 53 0.018867925 14 0.071429 query 105 388 0.00257732 120 0.008333 query 106 12 0.083333333 22 0.045455 query 107 307 0.003257329 100 0.01 query 108 23 0.043478261 33 0.030303 query 109 126 0.007936508 1 1 query 110 131 0.007633588 7 0.142857 query 111 422 0.002369668 394 0.002538 query 112 451 0.002217295 294 0.003401 query 113 185 0.005405405 8 0.125 query 114 1 1 4 0.25 query 115 204 0.004901961 23 0.043478 query 116 1 1 10 0.1 query 117 24 0.041666667 33 0.030303 query 118 1 1 1 1 query 119 13 0.076923077 5 0.2 query 120 92 0.010869565 3 0.333333 query 121 194 0.005154639 166 0.006024 query 122 8 0.125 31 0.032258 query 123 92 0.010869565 3 0.333333 query 124 1 1 1 1

Investigating topic modeling techniques for historical feature location. Page 75 Chapter 9 Appendix

query 125 3 0.333333333 18 0.055556 query 126 3 0.333333333 20 0.05 query 127 192 0.005208333 22 0.045455 query 128 71 0.014084507 19 0.052632 query 129 191 0.005235602 38 0.026316 query 130 1 1 2 0.5 query 131 8 0.125 33 0.030303 query 132 112 0.008928571 125 0.008 query 133 9 0.111111111 4 0.25 query 134 35 0.028571429 17 0.058824 query 135 10 0.1 5 0.2 query 136 310 0.003225806 38 0.026316 query 137 314 0.003184713 65 0.015385 query 138 10 0.1 23 0.043478 query 139 137 0.00729927 7 0.142857 query 140 3 0.333333333 18 0.055556 query 141 474 0.002109705 36 0.027778 query 142 115 0.008695652 3 0.333333 query 143 3 0.333333333 21 0.047619 query 144 1 1 27 0.037037 query 145 11 0.090909091 22 0.045455 query 146 392 0.00255102 172 0.005814 query 147 80 0.0125 313 0.003195 query 148 156 0.006410256 349 0.002865 query 149 1 1 5 0.2 query 150 3 0.333333333 1 1 query 151 351 0.002849003 296 0.003378 query 152 1 1 28 0.035714 query 153 131 0.007633588 4 0.25 query 154 311 0.003215434 389 0.002571 query 155 165 0.006060606 13 0.076923 query 156 112 0.008928571 11 0.090909 query 157 9 0.111111111 22 0.045455 query 158 38 0.026315789 12 0.083333 query 159 109 0.009174312 11 0.090909 query 160 126 0.007936508 1 1

A.4 Intermediate Values of Wilcoxon’s Singed-Rank Test

Variable Rank Reciprocal Rank

(smaller sum of signed ranks) 2715.5 5151.5

N푇 (sample size) 153 153

5890.5 5890.5

휇 548.9965847 548.9965847

Z휎 5.782367484 1.34518141

1.00E+00 0.91

p휙(|푍|) 7.36566E-09 0.178566658

Investigating topic modeling techniques for historical feature location. Page 76