Investigating Topic Modeling Techniques for Historical Feature Location

Investigating topic modeling techniques for historical feature location. Lukas Schulte Faculty of Health, Science and Technology Master thesis in Computer Science Second Cycle, 30 hp (ECTS) Dr. Sebastian Herold, University of Karlstad Dr. Muhammad Ovais Ahmad Karlstad, June 28th, 2021 I Abstract I Abstract Software maintenance and the understanding of where in the source code features are imple- mented are two strongly coupled tasks that make up a large portion of the effort spent on de- veloping applications. The concept of feature location investigated in this thesis can serve as a supporting factor in those tasks as it facilitates the automation of otherwise manual searches for source code artifacts. Challenges in this subject area include the aggregation and composition of a training corpus from historical codebase data for models as well as the integration and optimization of qualified topic modeling techniques. Building up on previous research, this thesis provides a comparison of two different techniques and introduces a toolkit that can be used to reproduce and extend on the results discussed. Specifically, in this thesis a changeset- based approach to feature location is pursued and applied to a large open-source Java project. The project is used to optimize and evaluate the performance of Latent Dirichlet Allocation models and Pachinko Allocation models, as well as to compare the accuracy of the two models with each other. As discussed at the end of the thesis, the results do not indicate a clear favorite between the models. Instead, the outcome of the comparison depends on the metric and view- point from which it is assessed. Keywords feature location, topic modeling, changesets, latent dirichlet distribution, pachinko allocation, mining software repositories, source code comprehension Investigating topic modeling techniques for historical feature location. Page I II Acknowledgments II Acknowledgments First, I would like to acknowledge the work that the responsible instances at Karlstad University and the University of Applied Sciences Osnabrück have put into their partnership within the ERASMUS program, which made my thesis possible. Further, I would like to give special thanks to the supervisor of my thesis, Dr. Sebastian Herold, for his support and help. In the same way, I would like to thank Dr. Muhammad Ovais Ahmad for taking the role of examiner for my work. Finally, I would like to express my gratitude to my family and friends who provided support and distraction during the time I worked on this thesis. Investigating topic modeling techniques for historical feature location. Page I III Table of Contents III Table of Contents 1 INTRODUCTION.............................................................................................................................................. 1 1.1 BACKGROUND............................................................................................................................................ 1 1.2 PROBLEM DESCRIPTION ............................................................................................................................. 1 1.3 THESIS GOAL ............................................................................................................................................. 2 1.4 THESIS OBJECTIVE ..................................................................................................................................... 2 1.5 ETHICS AND SUSTAINABILITY .................................................................................................................... 2 1.6 METHODOLOGY ......................................................................................................................................... 3 1.7 STAKEHOLDERS ......................................................................................................................................... 4 1.8 DELIMITATIONS ......................................................................................................................................... 4 1.9 OUTLINE .................................................................................................................................................... 4 2 BACKGROUND AND RELATED WORK ........................................................................................................... 6 2.1 FEATURE LOCATION .................................................................................................................................. 6 2.1.1 Definition and Taxonomy ................................................................................................................ 6 2.1.2 Tools for Feature Location .............................................................................................................. 7 2.1.3 Datasets for Benchmarking ............................................................................................................. 8 2.2 TEXT MINING ............................................................................................................................................. 8 2.3 TOPIC MODELING ...................................................................................................................................... 9 2.4 DIRICHLET DISTRIBUTIONS ...................................................................................................................... 11 2.5 LATENT DIRICHLET ALLOCATION AND PACHINKO ALLOCATION ............................................................. 13 2.6 RELATED WORK ...................................................................................................................................... 15 3 METHODS .................................................................................................................................................... 18 3.1 OVERVIEW ............................................................................................................................................... 18 3.2 DATA PREPARATION ................................................................................................................................ 19 3.2.1 Data Mining ................................................................................................................................... 19 3.2.2 Text Cleaning ................................................................................................................................. 19 3.3 TOPIC MODELING .................................................................................................................................... 20 3.3.1 Topic Model Parameters ............................................................................................................... 20 3.3.2 Hyperparameter Tuning ................................................................................................................ 21 3.4 FEATURE LOCATION ................................................................................................................................ 21 3.4.1 Search Query ................................................................................................................................. 22 3.4.2 Performance Metrics ..................................................................................................................... 22 3.4.3 Goldset-based Validation .............................................................................................................. 23 4 IMPLEMENTATION ....................................................................................................................................... 25 4.1 GOALS AND CONSTRAINTS ...................................................................................................................... 25 4.2 GENERAL SOLUTION STRATEGY .............................................................................................................. 26 4.3 IMPORTER APPLICATION .......................................................................................................................... 29 4.3.1 Context, Scope, and Solution Strategy ........................................................................................... 29 4.3.2 Building Block View ...................................................................................................................... 32 4.4 FEATURE LOCATION ................................................................................................................................ 39 4.4.1 Context, Scope, and Solution Strategy ........................................................................................... 39 4.4.2 Building Block View ...................................................................................................................... 42 5 EVALUATION ............................................................................................................................................... 47 5.1 SETUP ...................................................................................................................................................... 47 Investigating topic modeling techniques for historical feature location. Page II III Table of Contents 5.1.1 General Data Structures ................................................................................................................ 47 5.1.2 Target System ................................................................................................................................ 50

Load more