<<

Masaryk University Faculty of Informatics

Optimal Recommendations for Source Code Reviews

Master’s Thesis

Jakub Lipčák

Brno, Spring 2017

Masaryk University Faculty of Informatics

Optimal Recommendations for Source Code Reviews

Master’s Thesis

Jakub Lipčák

Brno, Spring 2017

Declaration

Hereby I declare that this paper is my original authorial work, which I have worked out on my own. All sources, references, and literature used or excerpted during elaboration of this work are properly cited and listed in complete reference to the due source.

Jakub Lipčák

Advisor: Bruno Rossi, PhD

i

Acknowledgement

I would like to thank my advisor, Bruno Rossi, PhD., for all his help, support and guidance, which was very valuable and professional. I would also like to express my gratitude to my family for support- ing me in my studies.

iii Abstract

Software code reviews are an important part of the development process, leading to better quality and reduced overall costs. However, finding appropriate code reviewers is a complex and time- consuming task. In this thesis we analyze several Code Reviewer Recom- mendation Algorithms designed to find appropriate code reviewers for opened pull requests. We re-implemented two of the major approaches (ReviewBot and RevFinder) and we proposed several modifications which improved the Top-5 accuracy of the RevFinder algorithm by 12.90% on average. Furthermore, we propose a novel code reviewer recommendation approach based on the Naive Bayes technique that takes into consideration the most important features of previously reviewed pull requests. Experiments using 35 239 pull requests from three open source projects show that our approach provides 88.59% accuracy on average considering a top 10 recommendation, which is by 12.35% better than the average accuracy provided by the RevFinder algorithm. Finally, we analyze the reproducibility problems of existing research in this area observed during the work on this thesis and we propose some recommendations for future research.

iv Keywords

Source , Code Reviewer Recommendation System, Pull Request, Distributed , Machine Learning, Gerrit

v

Contents

1 Introduction 1 1.1 Problem Statement ...... 2 1.2 Goals ...... 2 1.3 Thesis Structure ...... 3

2 Code Reviews 5 2.1 What Do We Know about Code Reviews? ...... 5 2.1.1 Code Review Process ...... 6 2.1.2 Objectives of Code Reviews ...... 7 2.1.3 Factors Influencing Code Reviews ...... 8 2.2 Code Review Tools ...... 9 2.2.1 Gerrit ...... 9 2.2.2 GitHub ...... 10 2.2.3 Crucible ...... 11 2.2.4 Collaborator ...... 11

3 Code Reviewer Recommendation Algorithms 13 3.1 Definition ...... 13 3.2 Evaluation Metrics ...... 13 3.2.1 Top-k Accuracy ...... 14 3.2.2 Mean Reciprocal Rank ...... 14 3.2.3 Metrics from the Information Retrieval Domain 15 3.3 Existing Recommendation Algorithms ...... 15 3.3.1 Traditional Approaches ...... 16 3.3.2 Cross-project and Technology Experience . . . . 18 3.3.3 Machine Learning ...... 19 3.3.4 Social Relations ...... 21 3.3.5 Features Summary ...... 22 3.4 Proposed Recommendation Algorithm ...... 22 3.4.1 Naive Bayes Classification ...... 23 3.4.2 Feature Extraction ...... 25 3.4.3 Reviewers Recommendation ...... 26

4 Design and Implementation 29 4.1 Implemented Algorithms ...... 29 4.1.1 ReviewBot Implementation ...... 29

vii 4.1.2 RevFinder Implementation ...... 30 4.1.3 Naive Bayes Reviewer Recommendation . . . . 31 4.2 Implementation Details ...... 32 4.2.1 Design and Technologies ...... 32 4.2.2 Communication Interface ...... 32

5 Empirical Evaluation 35 5.1 Datasets ...... 35 5.1.1 Description of Data Collection ...... 35 5.1.2 Structure of Data Collection ...... 36 5.2 Experimental Setup ...... 37 5.3 Results ...... 37 5.3.1 Baseline Approach Accuracy ...... 38 5.3.2 Proposed Approach Accuracy ...... 38 5.3.3 Comparison of Solutions ...... 39 5.3.4 Discussion ...... 40 5.3.5 Threats to Validity ...... 42

6 Reproducibility of Reviewer Recommendation Algorithms 45 6.1 Reproducible Research in ...... 46 6.2 Reproducibility Problems ...... 46 6.2.1 Source Code ...... 46 6.2.2 Data Sets ...... 47 6.2.3 Metrics ...... 47 6.2.4 Reproducibility Summary ...... 48

7 Conclusion 49

Bibliography 51

A Google Chrome Extension 55

B GitHub Repository 57

C Data Model and Datasets. 59

D Configuration and Deployment 61 D.1 Configuration ...... 61 D.2 Deployment ...... 62 viii 1 Introduction

As distributed collaboration in the development of open source soft- ware projects gains more and more popularity, there is a need for tools able to support this kind of development process [1]. The pull-based model has been widely adopted across several open source projects in recent years as a method allowing external contributors to propose changes into the code base of the projects. Software code reviews are a crucial part of this model. They are considered as one of the most effective ways of improving the overall quality of the source code [2]. The outcomes of code reviews are highly dependent on the code re- viewers and it has turned out that finding appropriate code reviewers for pull requests in the distributed environments is often a non-trivial problem [3]. It has been found beneficial to have some effective tools which are able to automate the process of recommendation of software code reviewers for newly created pull requests [1]. It is a necessary step before reviewing the code changes and its automation can speed up the whole process of integrating new functionality into the main de- velopment branch as well as increase overall software quality [4]. A lot of research has been done in this area already and there are sev- eral existing approaches dealing with this task [4, 5, 6]. However, it is strongly believed that there are still areas worthy of further research in order to increase the efficiency and relevance of the results provided by the actual Code Reviewer Recommendation Algorithms [1]. Software code reviews are expensive because the good under- standing of large code changes by code reviewers is a time-consuming process and finding appropriate code reviewers can also be very labor- intensive for developers. The outcomes of Code Reviewer Recommenda- tion Algorithms should reduce the costs of this process. The automated recommendation of code reviewers would reduce the efforts neces- sary to find appropriate code reviewers and relevant results have the potential to reduce time spent by code reviewers on the understanding of large code changes [1].

1 1. Introduction 1.1 Problem Statement

Patanamon et al. [4] examined comments from more than 1 400 repre- sentative review samples of four open source projects. They discovered that 4% - 30% of reviews have code reviewer assignment problem. It takes approximately 12 days longer to approve a code change with a code reviewer assignment problem. This fact decreases the effectiveness of the Pull-based model. Code reviewers could be assigned to the pull requests immediately after their creation in an ideal scenario, but it is not a common case at the moment. The Pull-based model has a potential to be more effective and enhancements in this area are the main motivation for this thesis.

1.2 Goals

The problem stated in Section 1.1 can be solved by better automated recommendation tools. We have identified two goals worth research- ing in this area. The main goal of this thesis is to evaluate several methods dealing with the recommendation of the most appropriate code reviewers for software code changes. It should provide new approaches and ideas capable of outperforming some of the existing algorithms. The imple- mented prototype will be able to recommend relevant software code reviewers and it will be evaluated on a set of projects. Furthermore, the implemented prototype should be usable in a real environment. The second goal of this thesis is related to reproducibility and replicability problems of past research about Code Reviewer Recom- mendation Algorithms. This thesis should report about such problems in existing research in this area and should provide some recommen- dations to avoid these problems in the future. It should also follow the reproducible research guidelines.

2 1. Introduction 1.3 Thesis Structure

This thesis is divided into the following chapters:

Chapter 2: Code Reviews gives an overview about the code review process, its objectives, factors influencing this process and also about existing code review tools.

Chapter 3: Code Reviewer Recommendation Algorithms describes the most relevant algorithms in this area and common metrics for their evaluation. It also describes our own proposal based on the Naive Bayes technique, as we believed that we can achieve better results than some other algorithms using this approach.

Chapter 4: Design and Implementation provides an overview of the implementation describing implemented algorithms and technologies used for the implementation.

Chapter 5: Empirical Evaluation presents the results and discussion about our empirical evaluation. It also describes datasets used for testing the algorithms and provides answers to the research questions.

Chapter 6: Reproducibility of Reviewer Recommendation Algorithms deals with reproducibility problems in Software Engineering and es- pecially in the area of Code Reviewer Recommendation Algorithms.

Chapter 7: Conclusion summarizes the thesis, its contribution, fulfill- ment of the goals and ideas for the future work.

3

2 Code Reviews

This chapter describes the code review process, factors influencing this process and tools for its support. The code review process was formalized in 1976 by M. E. Fagan. It was a highly structured process based on line-by-line group reviews in the form of Inspections [7]. The code review practices have rapidly changed in recent years [8]. Modern code review process has become informal (in contrast to Fagan’s definition), tool-based [9], more light- weight, continuous and asynchronous [10].

2.1 What Do We Know about Code Reviews?

Code reviews are understood as a process executed in order to improve the overall quality of software. It is an inspection of a code change by independent developers. Baum et al. [8] interviewed software engi- neers from 19 companies and summarized their common idea of code reviews into the following definition:

Definition 1. Code Review is a software quality assurance activity with the following properties:

∙ The main checking is done by one or several humans.

∙ At least one of these humans is not the code’s author.

∙ The checking is performed mainly by viewing and reading source code.

∙ It is performed after implementation or as an interruption of implementation.

According to the definition from SWEBOK Guide [11], software quality assurance can be understood as “a set of activities that define and assess the adequacy of software processes to provide evidence that establishes confidence that the software processes are appropriate and produce software products of suitable quality for their intended purposes” [11].

5 2. Code Reviews

2.1.1 Code Review Process There have been several studies dealing with code review process in companies. It was found out that code review processes in companies nowadays are similar to the processes used by open source projects [2]. There is a large number of companies performing almost no code reviews and processes used in companies that do code reviews have a lot of variations [2, 12]. The exact steps performed during the code review process can vary from project to project. The review process can be for example distinguished by whether meetings (personal or electronic) take place in the process [11]. Other differences can be categorized according to their rigorousness and formality. Karl Wiegers identified six types of code reviews based on the degree of their formality [13], which can be seen in Figure 2.1. While Inspections are the most systematic type of code reviews, Ad hoc reviews are completely informal and unplanned.

Figure 2.1: Peer review formality spectrum [13].

Figure 2.2 describes a practical example of Gerrit code review pro- cess in the Android Open Source Project. The first step is the Pre-Review phase, during which an author writes a patch and committer submits it for a review (committers can be authors as well). In Review phase, code reviewers perform code reviews. Every contributor can perform code review and they are typically invited to review the patches. However, only code reviewers with the authority of Verification or Approval can decide whether the patch will be accepted or not. Verifiers handle the

6 2. Code Reviews

building and testing of patches. This task is usually done by experi- enced developers or by automated tools. Approvers decide whether the patch is acceptable for the source code or not. It has to follow the best practices and rules established by the project. Once the patch is approved, it will be merged into the repository by submitters. In reality, the interaction between roles in the system can be much more complex [14, 15].

Figure 2.2: Simplified Version of the Code Review Process [14].

2.1.2 Objectives of Code Reviews The code review process has several objectives. Code reviews are able to reduce the costs and improve the overall software quality. The main objective is to reveal the implementation errors and bugs before integrating them into the system. They should also reveal violations of best practices and help maintain consistency of the code. The team members learn from each other by doing code reviews. This improves their understanding of the code base of the projects as well as their

7 2. Code Reviews coding skills. They can learn new ways of solving problems by reading the code of other developers. Managers have better insights into the products thanks to the code reviews and they can identify possible improvement opportunities in advance. In open source projects, code reviews play a necessary role serving as a method ensuring the quality of code from external contributors [16]. Code reviews have many positive effects which lead to better code quality [2]. On the other hand, code reviewing is a time-consuming process, as there is a lot of human effort involved [17]. Managers of projects might have concerns that code reviews will slow the projects down while testing would be faster. Code reviews could also lead to negative moods in the team because some team members might fear public criticism caused by other team members reviewing their source code [16].

2.1.3 Factors Influencing Code Reviews There is a need for tools helping with code reviews in order to reduce the need of significant human effort and to do this process more efficiently [1, 5]. There are several factors influencing the code review process which are important for code reviews and where tool support would be beneficial [1]:

1. Choosing the most appropriate code reviewer: The larger the team is, the more difficult it is to find an appropriate code reviewer. Itis usually easy to find a suitable code reviewer in a small team.

2. Reducing the size of the changeset that has to be reviewed. Given a large changeset, code reviewers try to filter out non relevant files [1]. This is an error-prone task. Tool support for automatic dis- tinction between relevant and less relevant fragments of source code could bring benefits.

3. Helping the code reviewer understand the change: There have been some improvements identified which could help code reviewers understand the changes. Hyperlinking support of an IDE, visu- alization of changes or a logical order of changed files is missing in code review tools nowadays and further research in this area is needed.

8 2. Code Reviews

4. Decreasing the need to understand the change: Solid understanding of the changes by code reviewers is necessary for in-depth code reviews. Research in the area of static code analysis in the future could possibly come up with algorithms, which would decrease the need to understand the changes by code reviewers. Algo- rithms predicting the outcomes of code review in advance could also be beneficial. One study dealing with this task is mentioned in Subsection 3.3.3 [18].

2.2 Code Review Tools

This section describes some of the existing tools supporting the code review process in open source and private projects. They are all inte- grated with the most popular version control systems.

2.2.1 Gerrit Gerrit1 is a popular team collaboration web-based tool closely inte- grated with the Git2 version control system. It provides a gateway between developers and Git. Every commit has to be reviewed before it can be accepted. Figure 2.3 describes the interaction of developers and code review- ers with a Git repository, where Gerrit system is implemented. Devel- opers can only push to the Pending Changes store. Once the change has been reviewed, it will be submitted to the Authoritative Repository. The code review process can be done via web interface of Gerrit. It provides several transparent views with the possibility to add inline comments on reviewed lines. These comments can be seen by anybody and can be discussed by other team members later. The reviewing is completed by entering a code review label and a final message. The la- bel has five possible values and the selected value will determine what happens next with the change request. Gerrit also allows automatized tools to analyze, label and comment on pending change requests [19].

1. https://www.gerritcodereview.com/ 2. https://git-scm.com/

9 2. Code Reviews

Figure 2.3: Integration of Gerrit and Git [19].

2.2.2 GitHub GitHub3 is a widespread web-based hosting service used by millions of users. It is based on the Git version control system and provides several useful features for collaborative work. It became the most popular platform for open source projects thanks to these features. One of them is the pull requests functionality. The recommended workflow for new project features begins with the creation of a feature branch. New commits are pushed into the feature branch and once the new functionality is ready, a pull request should be created. The creator of the pull request has to choose the base branch, where the new functionality should be merged later. The overview page of the pull request contains all the necessary in- formation about the pull request and about all differences between feature branch and base branch. Other contributors can comment on

3. https://github.com/

10 2. Code Reviews

the changes and mention other relevant people, who will be notified about that. The pull request can be approved by repository adminis- trator or by collaborators with write access to the repository. Once the pull request is approved and no merge conflicts are present, GitHub will automatically merge the changes to the base branch [20]. The information about social interaction between users in Git repos- itories can be useful for Code Reviewer Recommendation Algorithms. One approach, which uses this information for reviewer recommendation is mentioned in Subsection 3.3.4.

2.2.3 Crucible Crucible4 is a web-based collaborative code review tool developed by Atlassian5. It does not only support Git repositories, but it can also be used with SVN, Mercurial, CVS or Perforce version control system. It is an enterprise tool providing rich amount of useful features such as iterative reviews, pre-commit reviews, charts and reports, inline discussions, integration with JIRA software and many others. In spite of its enterprise aim, free licenses are offered to qualified open source projects.

2.2.4 Collaborator Collaborator6 is another commercial code review tool, which provides functionality for sophisticated code review workflows. One can use it to review documents throughout all stages of the development process such as user stories and requirements in design stage, source code and test cases in development and testing stage and deploy scripts in deployment stage. Collaborator supports integration with GitHub, JIRA and with most popular Integrated Development Environments. It also contains tools for Audit Management and Reporting, whose outcomes can continuously improve the review process. All these features are well suited for enterprise environment.

4. https://www.atlassian.com/software/crucible 5. https://www.atlassian.com/ 6. https://smartbear.com/product/collaborator/overview/

11

3 Code Reviewer Recommendation Algorithms

This chapter provides an overview of algorithms developed for the recommendation of software code reviewers. Section 3.1 formally describes the definition of Code Reviewer Recommendation Algorithms in the context of this thesis. Section 3.2 presents the most common metrics used for the evaluation of accuracy of Code Reviewer Recommendation Algorithms. Section 3.3 describes several existing recommendation algorithms and Section 3.4 presents our proposed approach based on the Naive Bayes technique.

3.1 Definition

Definition 2 formally describes, what is meant by term Code Reviewer Recommendation Algorithm in this thesis.

Definition 2. Code Reviewer Recommendation Algorithm for a review request rq given a set S of all previous reviews is defined as:

CRRS(rq) = (rev1, rev2, ..., revn) where the CRR function represents the recommendation algorithm and rev1, rev2, ..., revn is an ordered tuple T returned by the algorithm, where rev1 is considered to be the most relevant code reviewer candi- date by the algorithm, rev2 the second most relevant candidate and so on. For every candidate revx ∈ T holds that there exists at least one review r ∈ S, where candidate revx reviewed the review r.

3.2 Evaluation Metrics

This section presents five types of the most common metrics used for the evaluation of accuracy of existing Code Reviewer Recommendation Algorithms.

13 3. Code Reviewer Recommendation Algorithms

3.2.1 Top-k Accuracy The result of Top-k accuracy is a percentage describing how many code reviewers were correctly recommended within top k reviewers of the result list. It can be described by this formula:

∑ isCorrect(r, k) ∈ Top-k accuracy = r R (3.1) |R| where:

∙ R is a set of reviews.

∙ isCorrect(r,k) is a function returning 1 if at least one code reviewer recommended within top k reviewers approved the review r, otherwise 0 is returned.

We choose the k value to be 1, 3, 5 and 10 in our tests.

3.2.2 Mean Reciprocal Rank Mean Reciprocal Rank is a value describing an average position of actual code reviewer in the list returned by recommendation algorithm. It can be described by following formula:

1 1 Mean Reciprocal Rank = ∑ (3.2) |R| r∈R rank(r, recommend(r)) where:

∙ R is a set of reviews.

∙ recommend(r) is a function returning a sorted list of code review- ers recommended for review r.

∙ rank(r, l) is a function returning the rank of the code reviewer 1 who approved review r in a sorted list l. The value of rank(r, l) will be 0 if there is no such code reviewer in the list l.

14 3. Code Reviewer Recommendation Algorithms

3.2.3 Metrics from the Information Retrieval Domain Precision, Recall and F-Measure are metrics from the information re- trieval domain and they are sometimes used to evaluate the accuracy of Code Reviewer Recommendation Algorithms [21, 22]. These metrics are computed for different sizes of top-k recommended code reviewers with the k value usually ranging from 1 to 10. The formulas of these metrics are listed below [22]:

|Rec_Reviewers ∩ Actual_Reviewers| Precision = (3.3) |Rec_Reviewers|

|Rec_Reviewers ∩ Actual_Reviewers| Recall = (3.4) |Actual_Reviewers|

2 * Precision * Recall F-Measure = (3.5) Precision + Recall where:

∙ Rec_Reviewers is a set of recommended code reviewers.

∙ Actual_Reviewers is a set of actual code reviewers.

These metrics are usually used together. As they evaluate the in- tersection of top-k recommended code reviewers with all actual code reviewers, they might not be appropriate for datasets containing re- views where the size of the set of actual code reviewers is smaller than the k value [4].

3.3 Existing Recommendation Algorithms

We divided the most relevant existing recommendation algorithms into four groups based on the features they process and the tech- niques they use to recommend code reviewers (Traditional approaches, Cross-project and technology experience, Machine Learning, Social Relations). The accuracy of these algorithms was mostly evaluated using Top-k Accuracy and Mean Reciprocal Rank (MRR) metrics.

15 3. Code Reviewer Recommendation Algorithms

3.3.1 Traditional Approaches Traditional recommendation approaches process historical project review data and use imperative algorithms to find the most relevant code reviewers.

3.3.1.1 ReviewBot ReviewBot is a technique proposed by Balachandran [5]. It is a code reviewer recommendation approach based on the assumption that lines of code changed in the pull request should be reviewed by the same code reviewers who had previously reviewed or modified the same lines of code. ReviewBot algorithm has two phases: 1) Computing Line Change History: The ReviewBot algorithm iterates over all lines modified in the pull request and computes the line change history for all of them. It means that it assigns points to previous reviews related to the same lines. The more recent the review was, the more points it is assigned. The number of assigned points may vary by types of files, for e.g. the importance of .xml files can be prioritized above .properties files etc. 2) Reviewer Ranking: Previous reviews were assigned points in the first step. These points are received by code reviewer candidates inthe second step. Each code reviewer and submitter of line change history reviews is assigned their points. The result of this step is a list of reviewer candidates sorted by their points. Reviewer candidates with the most points are considered as the most relevant code reviewers to review the pull request. We tried to improve this algorithm, however we did not find any significant improvements for this approach. The ReviewBot algorithm has several problems. One of them is with newly created files. These files don’t have any line change history and thus cannot be correctly processed using this approach. Another problem is that most of the lines in average project are only changed once[21]. The accuracy of results returned by the ReviewBot algorithm is therefore limited.

3.3.1.2 RevFinder RevFinder [4] is an approach based on the location of files included in pull requests. The idea of this approach is that files located in

16 3. Code Reviewer Recommendation Algorithms

similar file paths contain similar functionality and therefore should be reviewed by similar code reviewers. The RevFinder algorithm can be described by Figure 3.1.

Figure 3.1: Graphical description of the RevFinder algorithm [22].

The first part of RevFinder approach is the Code Reviewers Ranking Algorithm. It compares file paths included in a new pull request with all previously reviewed file paths. It uses the following four string comparison techniques to examine the file paths’ similarity between new and historical file paths: Longest Common Prefix, Longest Common Suffix, Longest Common Substring and Longest Common Subsequence Code reviewer candidates are assigned points in this step. The more similar the file paths are, the higher is the number of points assigned to the code reviewers who previously reviewed them. The second important part of the RevFinder approach is the combination technique. The results of each of the four string comparison techniques are combined using the Borda count combination method. A sorted list of candidates with their scores is then returned as the output of the RevFinder algorithm. This approach was tested on more than 42 000 reviews of four open source projects and it was 4 times more accurate than the ReviewBot algorithm.

17 3. Code Reviewer Recommendation Algorithms

Although this algorithm was able to correctly recommend 79% of reviews considering a top-10 recommendation, there is still some space for improvements. RevFinder does not consider retired code reviewers. Performance might be affected by recommending code reviewers who don’t work anymore. It is also not able to recommend code reviewers for new files where no file path similarity is found. Another improvement against the presented RevFinder algorithm can be achieved by extending the dataset with information about project name. We implemented and tested some of these changes and our results are discussed in detail in Section 5.3.

3.3.2 Cross-project and Technology Experience

Approaches based on cross-project and technology experience use in- formation about the developers’ expertise to recommend appropriate code reviewers.

3.3.2.1 CORRECT

CORRECT (Code Reviewer Recommendation based on Cross-project and Technology experience) [21] is another code reviewers recommen- dation technique. The baseline idea is: “If a past pull request uses similar external libraries or similar specialized technologies to the current pull re- quest, then the past request is relevant to the current request, and thus, its reviewers are also potential candidates for the code review of the current re- quest” [21]. The proposed approach iterates over all files of a newly created pull request and analyzes the imported libraries of these files. The developers who have the most experience with the attached libraries are considered to be the most relevant code reviewer candidates. The approach was tested on several private projects and was com- pared against the RevFinder algorithm. CORRECT outperformed the RevFinder algorithm on these projects having top-5 recommendation more than 11% better and MRR value by 0.02 better than RevFinder. The problem of this approach would be with projects which don’t have many external dependencies and thus developers’ expertise with external libraries would hardly be helpful. Another threat for usage

18 3. Code Reviewer Recommendation Algorithms

of this algorithm could be insufficient information about developers’ expertise which does not always have to be easily available.

3.3.3 Machine Learning Algorithms in this group use different Machine Learning techniques for the recommendation of code reviewers.

3.3.3.1 CoreDevRec Automatic Core Member Recommendation for Contribution Evalua- tion [6] is an approach based on Machine Learning. Its overall process can be described by Figure 3.2

Figure 3.2: CoreDevRec algorithm [6].

CoreDevRec algorithm has two phases. Model Building Phase builds a prediction model from historical pull requests using three feature types for the prediction: 1. Path Features are extracted from the file paths. File path simi- larities are identified using the Support Vector Machine algo- rithm unlike the four string comparison techniques used in the RevFinder algorithm. 2. Relationship Features are extracted from social relationship infor- mation between GitHub developers.

19 3. Code Reviewer Recommendation Algorithms

3. Activeness Features are extracted from six features chosen to cal- culate actual activeness of developers in order to recommend active and available code reviewers. Prediction Phase uses Machine Learning techniques to recommend code reviewers. Each pull request is represented as a weighted vector and each feature is an element of that vector. Several Machine Learning techniques were tried for the Prediction Phase. Support Vector Machine method had the best result and was hence chosen for CoreDevRec. CoreDevRec algorithm was tested on five open source GitHub projects. It was compared against RevFinder and it outperformed RevFinder in all these projects with MRR value better by 0.21 on aver- age. Especially results of Top-1 accuracy of CoreDevRec were signifi- cantly better in comparison with the RevFinder algorithm having an average gain of 99.5%.

3.3.3.2 Predicting Reviewers and Acceptance of Patches Improving Code Review by Predicting Reviewers and Acceptance of Patches was proposed by Jeoung et al. [18]. This approach is also based on Machine Learning with the prediction model built from the following features: 1. The Patch meta-data feature includes information such as patch size, patch writer and patch file names.

2. The Patch content feature is extracted from keywords included in patch with ambition to recognize the quality of the patch.

3. Bug report information is also considered a feature consisting of information such as bug priority, bug severity, time after open and bug reporter. The Bayesian Network technique was used to predict reviewers and patch acceptance from the prediction model. The approach was tested on Firefox and Mozilla core projects. It has proven solid results, however, no comparisons with other algorithms are available. Many of the extracted patches were focused more on acceptance prediction than on code reviewers recommendation. Acceptance prediction is also a very important and challenging task because predicting the

20 3. Code Reviewer Recommendation Algorithms

outcome of a review can help code reviewers with prioritization of patches and it can also give early feedback to developers.

3.3.4 Social Relations Social relations between developers is another aspect analyzed by Code Reviewer Recommendation Algorithms.

3.3.4.1 Comment Network Yue Yu et al. [22] proposed a Comment Network (CN) based code reviewer recommendation approach by analyzing social relations between contributors and developers. They compared their approach against traditional recommendation algorithms. They were able to achieve similar performance with their approach. Better yet, they were able to improve the performance of traditional approaches by mixing them together with their CN-based recommendation algorithm. CN-based recommendation is based on the idea that the interest of developers can be extracted from their commenting interaction. Developers sharing common interests with the originator of a pull request are considered to be appropriate code reviewers. The common interest is considered to be project specific, so it is built for each project separately. To conserve space, we recommend reading [22] for detailed description of how Comment Network is built and how code reviewers are recommended using this algorithm. Traditional recommendation approaches analyze the code reviewer’s expertise, whereas the CN-based approach is based on common in- terest. As these are two different dimensions, the recommendation process can be improved by their integration. Both approaches are regarded as equally important and their combination is used to recom- mend code reviewers. The mixed approach was tested on a comprehen- sive dataset of 84 popular GitHub projects and the results showed that the CN-based approach combined with traditional recommendation approaches can achieve significant improvements when compared with pure traditional algorithms. As the results of the evaluation of the CN-based approach presented in this research are comprehensive, we recommend reading Chapter 4 and Chapter 5 of [22], where exact numbers and charts can be found.

21 3. Code Reviewer Recommendation Algorithms

3.3.5 Features Summary Table 3.1 presents a brief summary of all features, which are processed by Code Reviewer Recommendation Algorithms mentioned in this chap- ter. Paper om e.[22] Net. Comm. oeeRc[6] CoreDevRec ORC [21] CORRECT rd e.[18] Rev. Pred. eiwo [5] ReviewBot eFne [4] RevFinder Total Feature

File paths       2 Social interactions       2 Line change history       1 Reviewer expertise       1 Activeness of reviewers       1 Patch meta-data       1 Patch content       1 Bug report information       1 Total 1 1 1 3 3 1

Table 3.1: Summary of features.

3.4 Proposed Recommendation Algorithm

In Section 3.3, we described several Code Reviewer Recommendation Algorithms and discussed their weaknesses. The most common weak- nesses were caused by the inability to recommend code reviewers for newly created files, recommending retired code reviewers, omitting important features or unavailability of essential data. We have discov- ered several interesting findings during our studies that encouraged us to formulate our own recommendation algorithm which would avoid the above mentioned weaknesses. We chose the Naive Bayes technique to recommend code reviewers, as it was already used by other relevant Code Reviewer Recommendation Algorithms [6, 18].

22 3. Code Reviewer Recommendation Algorithms

3.4.1 Naive Bayes Classification Our proposed approach uses Naive Bayes to recommend code re- viewers, which is a Machine Learning technique based on conditional probabilities. It is the simplest form of Bayesian Network assuming conditional independence between features chosen for the classifica- tion. A Naive Bayes model is easy to build and it is able to achieve surprisingly good results, although feature independence is rarely true in real world applications [23]. Classification in Machine Learning is the problem of identifying which set of categories a new observation belongs to. It is a funda- mental issue in Machine Learning and Data Mining [23]. The goal of the classification problem can be defined as follows [24]:

Definition 3. Classification: Given a set of training data points along with associated training labels, determine the class label for an unla- beled test instance.

The Naive Bayes Classifier or simply Naive Bayes is a function that assigns a class label to an observed example based on the application of Bayes’ theorem.

Definition 4. Bayes’ theorem is defined via the relation [25]:

P(B|A)P(A) P(A|B) = (3.6) P(B) where:

∙ P(A|B) is called the likelihood function and it assesses the proba- bility of observing event A given that B is true.

∙ P(B|A) is known as the posterior probability and it reflects the probability of event B given that A is true.

∙ P(A) and P(B) are known as the prior probabilities reflecting the probability of observing events A or B independently.

The Naive Bayes Classifier simplifies its complexity by making an assumption that all attributes are conditionally independent given the value of the class variable [26].

23 3. Code Reviewer Recommendation Algorithms

Definition 5. Conditional Independence [26]: Given three sets of random variables X, Y and Z, we say X is conditionally independent of Y given Z, if and only if the probability distribution governing X is independent of the value of Y given Z. That is:

(∀i, j, k)P(X = xi|Y = yj, Z = zk) = P(X = xi|Z = zk) (3.7)

The Naive Bayes Classifier is constructed from a set of training ex- amples using the definition below.

Definition 6. Naive Bayes Classifier[23, 26]: Let C represent the classi- fication variable. Given a set of attribute values (x1, x2,···, xn) where xi is the value of attribute Xi, the goal of the Naive Bayes Classifier is to find the value c ∈ C that maximizes P(c|x1, x2,···, xn). According to Bayes’ theorem, the probability of an example E = (x1, x2,···, xn) being class c is calculated as:

P(c)P(E|c) P(c|E) = (3.8) P(E) where:

∙ P(E) is the product of prior probabilities of all independent at- tribute values from Example E. Since P(E) does not depend on c, it can be ignored and Equation 3.8 can be simplified by only using the numerator.

∙ P(c) assesses the probability of occurrence of event c in the training set.

∙ P(E|c) can be calculated using Equation 3.9. This equation is dramatically simplified thanks to the fact that conditional inde- pendence is assumed. It is rewritten using repeated applications of Definitions 3.6 and 3.7.

n P(E|c) = ∏ P(xi|c) (3.9) i=1

24 3. Code Reviewer Recommendation Algorithms

3.4.2 Feature Extraction We decided to choose three types of features which are important for the choice of a code reviewer: 1. File paths: this is an important feature what was successfully proven by the RevFinder algorithm. 2. Project name information was chosen as another feature. This in- formation correlates closely with the final code reviewer, which was proven by our improvements of the RevFinder algorithm by using this information in the recommendation process. 3. Owner of the change request is the last feature of our model. Reviewers of change requests need to understand the changes and they often have similar knowledge and expertise as the own- ers. This fact as well as the analysis of our dataset led us to the assumption that change requests created by the same developers are often reviewed by the same code reviewers. Therefore, we have chosen Owner information as the third feature.

Figure 3.3: Features of our Naive Bayes model.

Figure 3.3 graphically describes our prediction model. It is built by computing probabilities of three features from data in the database. After all the probabilities are calculated, other two modifications of the prediction model are done in order to achieve better results:

25 3. Code Reviewer Recommendation Algorithms

1. In the first step we add one more value to every feature ofour feature set. This value represents an unknown value and it is used to classify reviews containing a File path, a Project name or an Owner which was not present in our training set. Classification of such reviews would not be possible otherwise.

2. The second modification step of our prediction model is called Probability smoothing [27]. It is used to assign non-zero proba- bility to unseen events. These events are plausible in reality but were not found in the training data. The exact behavior of Probability smoothing is configurable by the smoothing variable. Probability smoothing is done for every feature as follows. The value of the smoothing variable is divided by the amount of val- ues of this feature with non-zero probability. The result is then subtracted from all of these probabilities and the value of the smoothing variable is equally added to all values with zero prob- ability. This ensures the possibility to evaluate some events as feasible, although they never happened in the past. This step is necessary for relevant results of our reviewer prediction. We chose the smoothing variable to be 0.01 for our tests.

3.4.3 Reviewers Recommendation The pseudo-code of our Code Reviewer Recommendation Algorithm is shown in Figure 3.4. The algorithm takes a new review as input (Rn). This review has to contain information about all modified File paths, the Project name and about the Owner of the review. A sorted list of recommended code reviewer candidates (C) is returned as the out- put. The prediction model has to be built at the beginning from all previously closed reviews (Lines 7 and 8). This is a time-consuming process. It does not have to be done before every recommendation and could be recomputed daily, weekly or monthly in real projects depending on the amount of data and the frequency with which new reviews are created. Lines 10 to 20 describe the main loop where code reviewer candi- dates are recommended. It iterates over all files modified in the review. Feature probabilities were computed separately for every file and therefore code reviewers are recommended separately too. Unlike the

26 3. Code Reviewer Recommendation Algorithms other algorithms, a Naive Bayes recommendation with a precomputed model can be calculated very quickly. Lines 16 to 19 calculate scores achieved by code reviewers. The percentage assigned by the Naive Bayes calculation is not important, we only consider the order of code reviewers. Every code reviewer is assigned a whole number based on his position in the recommendation list. All the code reviewers who reviewed at least one change request in the past will appear in the recommendation list thanks to Probability smoothing. Score calculation is done for every file and the achieved scores are added together. Finally, retired reviewers are concatenated to the end of the result list by the removeRetiredReviewers function (Line 21). This function iterates over all code reviewers and moves down those who haven’t done any code reviews in recent n months. We chose the value n to be 12 in our tests. Finally (Line 22), code reviewers are sorted by their scores and a sorted list of reviewer candidates is returned as the result.

1 Code-Reviewers Ranking Algorithm 2 Input: 3 Rn : A new review 4 Output: 5 C : Sorted list of code reviewer candidates 6 Method: 7 pastReviews ← A list of all previously closed reviews 8 bayesRec ← buildModel(pastReviews) 9 Filesn ← getFiles(Rn) 10 for f n ∈ Filesn do 11 # Get sorted list of recommended reviewers for every 12 # f i l e to be reviewed 13 reviewers ← bayesRec.recommend( fn, Rn.owner, Rn.projectName) 14 # Assign points to reviewers 15 scoreCounter ← 0 16 for r ∈ reviewers do 17 scoreCounter + + 18 C[r].score ← C[r].score + reviewers.length − scoreCounter 19 end for 20 end for 21 C ← removeRetiredReviewers(C) 22 return C.sortBy(score)

Figure 3.4: Naive Bayes-based Code Reviewer Recommendation Algo- rithm.

27

4 Design and Implementation

A project developed in the Java programming language was imple- mented as the practical part of this thesis. This chapter describes its functionality and the technologies used. We decided to design the prototype as a Spring boot1 application as we wanted to develop an application usable for real projects which use the Gerrit system for code reviews. The prototype is usable in a real environment. It has also been used to test the theoretical part of the thesis.

4.1 Implemented Algorithms

The core of the project and of this thesis are three Code Reviewer Rec- ommendation Algorithms (ReviewBot, RevFinder, Naive Bayes-based reviewer recommendation). They all implement ReviewerRecommen- dation interface prescribing two methods: List recommend(PullRequest pullRequest); void buildModel();

The recommend method takes an object of class PullRequest as an input and returns a sorted list of recommended code reviewers. The buildModel method is used to build a model that will be used for the recommendation. This is necessary for techniques based on Machine Learning. In our case, only the algorithm using Naive Bayes for the recommendation of code reviewers contains the implementation of the buildModel method.

4.1.1 ReviewBot Implementation ReviewBot is a recommendation algorithm based on the line change history. Information about added, deleted and inserted lines in each pull request has to be extracted from the Git repository of the project for which code reviewers are recommended. We used the JGit2 library for this purpose.

1. https://projects.spring.io/spring-boot/ 2. https://eclipse.org/jgit/

29 4. Design and Implementation

Another crucial task is the mapping between commits and pull requests where we had to identify, which commits belong to which pull request. We used Change-Ids for this. Information about each Change-Id can be found in the footer of commit messages. A Change- Id is a unique identifier of a Gerrit change request. See [28] for more information about Change-Ids and their mapping to commits. All the functionality related to Git repositories necessary for us was imple- mented within the GitBrowser class. When the Change-Ids of historical commits are found, the informa- tion about their code reviewers has to be retrieved. This information can be retrieved from the Gerrit server of the project via its REST API. We implemented the GerritBrowser class which contains meth- ods for communication with Gerrit API of projects with usage of the gerrit-rest-java-client3 library. Class ReviewBot implements the ReviewBot algorithm with the usage of classes and libraries mentioned above. The implementation exactly follows the specification [5]. We tried several modifications in order to improve its performance, but we were not able to find any significant improvements due to the limitations of this algorithm already mentioned in Chapter 3. As this algorithm needs to work with the Git repository of the project for which code reviewers are recommended, it expects this repository to be cloned in the repos folder of the application before running the recommendation process.

4.1.2 RevFinder Implementation

The RevFinder class contains the implementation of the RevFinder algorithm. Unlike ReviewBot, this algorithm does not need to work with Git or communicate with the Gerrit system and it only uses data from the database. The specification of the RevFinder algorithm is described in Chapter 3. The configuration of our implementation can be set inthe applica- tion.properties file, as we implemented few optional modifications in comparison with the specification [4]. The modified algorithm proved better results than the original implementation and their comparison is

3. https://github.com/uwolfer/gerrit-rest-java-client

30 4. Design and Implementation discussed in Chapter 5.3. Our modifications of the original RevFinder algorithm consist of two changes:

∙ Consideration of projects’ names: Every project of our dataset con- sists of several subprojects. We noticed that pull requests in the same subproject are often reviewed by similar code reviewers. The original implementation omits this information and consid- ers only file paths. We decided to use this information inthe recommendation process. We added the name of the subproject without slashes at the beginning of every file path. The removal of slashes is especially important as they are often present in the names of the subprojects and the RevFinder algorithm considers them as file path separators.

∙ Consideration of retired code reviewers: We decided to identify re- tired code reviewers. In our implementation all code reviewers who haven’t done any code review over the last twelve months are seen as retired. All the retired code reviewers are moved down to the end of the result list, thus appearing under active code reviewers. The exact number of months for retirement can be set in the application.properties file

We have decided to name the modified version of this algorithm RevFinder+.

4.1.3 Naive Bayes Reviewer Recommendation

Our own recommendation approach is described in Chapter 3. It is implemented in class BayesRec. We used the Jayes4 library for all the computations involving Naive Bayes. The most complex part of this class is the buildModel method where all the probabilities necessary for computations with Naive Bayes are calculated.

4. https://github.com/kutschkem/Jayes

31 4. Design and Implementation 4.2 Implementation Details

This section describes implementation details of our application such as design, technologies and interfaces for the communication with the application.

4.2.1 Design and Technologies The application is designed as a Spring boot project. The project’s build is managed by the Apache Maven5 tool. We used the MySql6 relational database as a data storage and the Hibernate framework7 for object- relational mapping. The data model used by the application is de- scribed in Chapter C. The whole configuration of the project is setup in the application.properties file.

4.2.2 Communication Interface As we wanted to implement an application usable in a real environ- ment, we had to implement an interface for communication with the outside world. We decided to implement one simple REST8 endpoint for this purpose. When the application is running the endpoint is available at the address: http://base-server-url/api/reviewers-recommendation

An HTTP9 GET request can be sent to this address. It expects two parameters. The first HTTP parameter is required: gerritChangeNumber. Our application will get information about the pull request with spec- ified Gerrit Change number via a Gerrit REST API call to the url of the Gerrit server of the project specified in the application.properties. The second parameter is optional and it can be used to specify the recom- mendation method. It is called recommendationMethod and has three

5. https://maven.apache.org/ 6. https://www.mysql.com/ 7. http://hibernate.org/ 8. https://www.ics.uci.edu/~fielding/pubs/dissertation/rest_arch_ style.htm 9. https://tools.ietf.org/html/rfc2616

32 4. Design and Implementation allowed values: REVIEWBOT, REVFINDER or BAYES. RevFinder rec- ommendation is used as the default method when no recommendation method is specified. The response sent to the client isa JSON docu- ment containing a sorted list of recommended code reviewers with information about their id, email, name and profile picture in the Gerrit system. For approaches that require building a model in advance (the Naive Bayes approach in our case), the precalculated model will be used for the recommendation via REST API. The first model will be calculated at the server’s deploy time. Every subsequent actualized model will be calculated asynchronously with the frequency specified in the application.properties file.

33

5 Empirical Evaluation

We performed an empirical evaluation to test the accuracy of RevFinder, RevFinder+ and of our Naive Bayes-based approach. This chapter con- tains a description of the datasets used and presents the results of our empirical evaluation. Lastly, we discuss the research questions and analyze possible threats to the validity of our study.

5.1 Datasets

We used the data collection of three open source projects (Android, Qt, OpenStack) for our empirical evaluation. These projects were chosen because they are large and active with carefully maintained code review system, what makes them suitable for a realistic evaluation of Code Reviewer Recommendation Algorithms [4]. This section describes the structure and source of this data collection.

5.1.1 Description of Data Collection Our data collection was taken from the repository1 of RevFinder’s authors. It was used to evaluate the RevFinder [4] algorithm and it is publicly available in JSON format. We extended this dataset with information about the owners of reviews which was necessary for the evaluation of our Naive Bayes-based approach. This information was obtained from the Gerrit system, as all these projects use Gerrit for code reviews. However, we had to reduce the size of the dataset due to missing pull requests in Gerrit. We were not able to find 0.79% of data from the original dataset in Gerrit. In the end, our data collection contains 35 239 pull requests of three open source projects. The evaluated projects are described below:

1. Android 2 is an operating system for mobile devices developed by Google.

1. https://github.com/patanamon/revfinder 2. https://source.android.com/

35 5. Empirical Evaluation

2. QT 3 is a cross-platform application framework developed by The Qt Company and the Qt Project.

3. OpenStack 4 is a software for creating private and public clouds. It is supported by the biggest companies in software development and thousands of community members.

5.1.2 Structure of Data Collection Table 5.1 describes the structure of data collection used for the empiri- cal evaluation.

Android OpenStack Qt Period - from 24 Oct 2008 18 Jul 2011 17 May 2011 Period - to 26 Jan 2012 30 May 2012 25 May 2012 Reviews 5 029 6 545 23 665 Code reviewers 93 82 200 Owners 346 324 444 Subprojects 111 35 57 Files 26 768 11 409 77 767 Avg. reviewers per review 1.06 1.44 1.07 Avg. files per review 8.35 5.93 10.63 Avg. separators per file 5.06 4.53 3.75

Table 5.1: Dataset structure.

3. https://www.qt.io/ 4. https://www.openstack.org/

36 5. Empirical Evaluation 5.2 Experimental Setup

Assignment of code reviewers to pull requests is time dependent. Therefore, we had to use a setup which ensures that only data from past pull requests will be used to recommend code reviewers for future pull requests. For testing of the Naive Bayes approach we chose 11-fold validation inspired by [29] and [18]. It divides the training set into 11 equally sized folds. The experiments are run in 10 iterations using different folds as training and test sets, as can be seen in Figure 5.1. The results of all iterations are then averaged. The RevFinder algorithm always considers all the pull requests from the past for recommendation without the need to build a model for different training folds. As we wanted to evaluate the accuracy of both algorithms for the same set of pull requests, we only executed a RevFinder recommendation for pull requests contained in folds 2-11.

Figure 5.1: Eleven folds experimental setup [29].

5.3 Results

This section describes the results of our empirical evaluation. We eval- uated the accuracy of the original RevFinder algorithm, the modified RevFinder algorithm and the Naive Bayes-based approach.

37 5. Empirical Evaluation

5.3.1 Baseline Approach Accuracy Table 5.2 presents the accuracy of the RevFinder algorithm.

Top-K accuracy System MRR Top-1 Top-3 Top-5 Top-10 Android 47.29% 72.49% 81.17% 89.50% 0.617 OpenStack 37.49% 65.61% 77.04% 87.66% 0.544 Qt 18.21% 33.21% 40.65% 51.55% 0.297 Average 34.33% 57.10% 66.29% 76.24% 0.486

Table 5.2: RevFinder recommendation results.

5.3.2 Proposed Approach Accuracy Table 5.3 presents the accuracy of the RevFinder+ algorithm.

Top-K accuracy System MRR Top-1 Top-3 Top-5 Top-10 Android 50.99% 77.49% 84.69% 92.25% 0.654 OpenStack 42.02% 72.87% 83.11% 92.20% 0.597 Qt 22.71% 41.88% 51.42% 66.63% 0.367 Average 38.57% 64.08% 73.07% 83.69% 0.539

Table 5.3: RevFinder+ recommendation results.

Table 5.4 presents the accuracy of our Naive Bayes-based approach.

Top-K accuracy System MRR Top-1 Top-3 Top-5 Top-10 Android 53.65% 80.78% 86.89% 91.79% 0.679 OpenStack 39.93% 70.90% 80.60% 89.69% 0.574 Qt 37.37% 65.67% 75.33% 84.30% 0.540 Average 43.65% 72.45% 80.94% 88.59% 0.597

Table 5.4: Naive Bayes recommendation results.

38 5. Empirical Evaluation

5.3.3 Comparison of Solutions The tested solutions were compared by calculating their improvement over the original RevFinder algorithm. The improvement of results of approach (1) compared with ap- proach (2) is calculated using this formula:

Result (1) − Result (2) Improvement = × 100 (5.1) Result (2) where an improvement value above 0 means that approach (1) outper- formed approach (2) and vice versa in case of a negative improvement value.

Table 5.5 presents the percentual improvement of RevFinder+ over the original RevFinder algorithm.

Top-K accuracy System MRR Top-1 Top-3 Top-5 Top-10 Android 7.82% 6.89% 4.34% 3.07% 5.99% OpenStack 12.08% 11.06% 7.88% 5.18% 9.74% Qt 24.71% 26.10% 26.49% 29.25% 23.56% Average 14.87% 14.69% 12.90% 12.50% 13.10%

Table 5.5: Improvement of RevFinder+ over RevFinder.

Table 5.6 presents the percentual improvement of the Naive Bayes- based approach over the original RevFinder algorithm.

Top-K accuracy System MRR Top-1 Top-3 Top-5 Top-10 Android 13.45% 11.44% 7.05% 2.56% 10.05% OpenStack 6.51% 8.06% 4.62% 2.32% 5.51% Qt 105.22% 97.74% 85.31% 63.53% 81.81% Average 41.72% 39.08% 32.33% 22.80% 32.46%

Table 5.6: Improvement of Naive Bayes approach over RevFinder.

39 5. Empirical Evaluation

5.3.4 Discussion In this section we discuss the results of our empirical evaluation. We compared the accuracy of two algorithms with the RevFinder algo- rithm which was chosen as a baseline not just in our thesis but in other relevant studies as well [6, 21, 22]. We will use MRR and Top-5 accuracy as the main metrics for the interpretation of our results. Top-1 recommendation is rarely used in practice, because more than one code reviewer is usually recommended [5, 30]. On the other hand, recommending 10 code reviewers is often excessive and not useful. To achieve the goal defined in Chapter 1, we ask the following four research questions:

(RQ1) Does the RevFinder+ provide better accuracy of recommended code reviewers than the original RevFinder algorithm?

The RevFinder+ algorithm achieved better accuracy than the orig- inal RevFinder algorithm in terms of all measured metrics. We see that the RevFinder+ algorithm outperformed RevFinder algorithm with the Top-5 accuracy improved by 12.90% and with the MRR value improved by 13.10% on average.

(RQ2) Does the Naive Bayes-based approach provide better accu- racy of recommended code reviewers than the original RevFinder algorithm?

The recommendation based on the Naive Bayes also provided bet- ter results than the RevFinder algorithm in terms of all measured metrics. The results of the Naive Bayes-based approach provided the Top-5 accuracy with an improvement of 32.33% and the MRR value with an improvement of 32.46% on average. Especially enhancements on the QT project were significant with the Top-5 accuracy improved by 85.31% and the MRR value improved by 81.81%.

40 5. Empirical Evaluation

(RQ3) Why did the RevFinder+ outperform the original RevFinder algorithm?

Our modifications of the RevFinder algorithm consist of two changes: consideration of retired reviewers and consideration of information about project names. Adding information about project name caused the largest improvements over the original RevFinder algorithm. This information is very important for the choice of the final reviewer and should always be processed by Code Reviewer Recommendation Algo- rithms for projects consisting of several subprojects.

(RQ4) Why did the Naive Bayes-based approach outperform the original RevFinder algorithm?

Similarly to RQ3, the improvement achieved by our Naive Bayes-based approach compared to the RevFinder algorithm is caused by the ex- tended feature set. Apart from file paths, we also processed informa- tion about project names and owners of pull requests. Besides extend- ing the feature set, the improvement could also have been caused by choosing the Naive Bayes technique to process these features and by moving retired reviewers down in the result list. As we wanted to know which of our features are the most impor- tant for the recommendation by the Naive Bayes-based approach, we ran the tests in a configuration that processes the features separately. The averaged results have shown that the information about owners of pull requests is the most important feature in our case. Information about project name was less important and information about file paths modified in pull requests was the least important. In any case, all the processed features were useful in the recommendation process and they achieved the best results together. It is important to notice that the importance of features is dataset-dependent and it can differ depending on the project. We also analyzed the cause of the massive improvement in the accuracy of the Naive Bayes-based approach on the QT project, as it was significantly higher than for other projects of our dataset. Better results could have been caused by high correlation between owners of the pull requests and their code reviewers in this project. The Naive Bayes-based approach provided the MRR value of 0.476 by only pro-

41 5. Empirical Evaluation cessing the owner feature itself, whereas the RevFinder algorithm does not process this feature at all. Another reason of the improvement could have been caused by flat package structure of the QT project. We see that the average count of separators in file paths of this project is only 3.75 which is the lowest of the studied projects. As the RevFinder calculates the similarities of file paths, this fact might be the reason of weaker accuracy of code reviewers recommended by the RevFinder algorithm for the QT project.

5.3.5 Threats to Validity We identified the following threats to the validity of our study:

∙ External validity: Threats to external validity are related to the generalizability of our study [31]. Our empirical evaluation was executed on three open source projects, which use the Gerrit system for code reviews. We cannot claim that the same re- sults would be achieved for other projects. Smaller open source projects, commercial projects or projects using other systems for code reviews might show different results.

∙ Internal validity: Threats to internal validity are related to ex- perimental errors and biases [31]. As we use the project’s name as one of the features, we assume usage of these algorithms on large systems consisting of several projects. Such systems were therefore chosen for our empirical evaluation. Nevertheless, Code Reviewer Recommendation Algorithms are primarily expected to be used for large projects where code reviewer assignment prob- lems are more likely to be present [6].

∙ Construct validity: Threats to construct validity are related to the suitability of our evaluation metrics [31]. We used Top-k Ac- curacy and Mean Reciprocal Rank for empirical evaluation which are widely used metrics in the relevant literature and should not cause threats to construct validity. In our experiments we evalu- ate whether one of the top k recommended reviewers actually evaluated the pull request. However, we do not know whether this reviewer was in fact the most appropriate candidate and an analysis of this question could discover new findings.

42 5. Empirical Evaluation

In a real environment we would have to consider some other aspects as well. The most active code reviewers would be prob- ably assigned to a huge number of pull requests which could lead to overburdening of certain reviewers. Therefore, workload balancing should be also taken into account [4].

43

6 Reproducibility of Reviewer Recommenda- tion Algorithms

This thesis analyzes the effectiveness of existing Code Reviewer Rec- ommendation Algorithms. We described and compared the existing ap- proaches and re-implemented some of them with the goal of analyzing the possibilities of their improvements. We encountered several prob- lems during our work caused by incompleteness, inconsistencies and other deficiencies in the research papers describing these algorithms. It encouraged us to introduce several recommendations to improve future research in this area. We had to face the following problems:

1. Public unavailability of source code of implemented algorithms: The source code of the proposed algorithms is rarely publicly avail- able in spite of being produced.

2. Reproducibility problems: The description of several proposed algorithms is too brief. It is often very complicated or even im- possible to re-implement the proposed solution.

3. Ambiguity of constants: Some of the algorithms are configurable. It is not always clear which configuration values were used in the studies to test the algorithms.

4. Public unavailability of datasets used: Projects used for testing of recommendation algorithms are almost always publicly avail- able. The exact subset of data processed in the studies are usually not described and therefore testing with the same dataset cannot be done by anybody else.

5. Different evaluation metrics: The authors of the proposed algo- rithms sometimes use different metrics to evaluate their results. This makes it more complicated to compare the results of existing algorithms among each other.

In this chapter, we propose some recommendations for further research in the area of Code Reviewer Recommendation Algorithms in order to avoid problems such as these in the future.

45 6. Reproducibility of Reviewer Recommendation Algorithms 6.1 Reproducible Research in Software Engineering

The concept of reproducible research is a mechanism to address the problems of unreproducible research. Reproducible research can be defined as a research the outputs of which can be reproduced from the published materials by other researchers [32]. Gentleman and Lang [33] introduced the term compendium. It refers to three aspects that are the product of Software Engineering research and all of them should be made public. A compendium contains a paper (the textual description of the problem), the data used by the study and the computer code that processes the data. The research that does not contain all of these is often unreproducible. The results of such research cannot be verified, reused or extended by anybody else. The concept of Reproducible research is related to the Literate Programming presented by Donald E. Knuth. Literate Programming is based on the following idea: “Instead of imagining that our main task is to instruct a computer what to do, let us concentrate rather on explaining to human beings what we want a computer to do” [34].

6.2 Reproducibility Problems

In Chapter 3, we analyzed six relevant Code Reviewer Recommendation Algorithms. Our findings about Code Reviewer Recommendation Algo- rithms led us to formulate several recommendations for further re- search in this area. We believe that a long term compliance with these recommendations would lead to improvements of the research results in this area.

6.2.1 Source Code None of the analyzed research provided the source code produced during the research. Description of produced algorithms is always provided in the paper, but the algorithm is not always reproducible from the description. This is the worst case scenario in terms of repro- ducibility. Unavailability of datasets can be solved by using other datasets. Nevertheless, irreproducibility of proposed algorithms causes a seri- ous decline in the quality of the presented research. These findings

46 6. Reproducibility of Reviewer Recommendation Algorithms

led us to formulate the first recommendation.

Recommendation 1: Source code of algorithms produced during your research should be published. If there is a reason not to publish it, a detailed description of the implemented algorithms should be provided. This description should ensure the possibility to re-implement the proposed algorithms by third party with the same configuration values.

6.2.2 Data Sets Only one of the analyzed pieces of research was actually reproducible. The others did not provide the exact dataset. Public availability of datasets would be beneficial for several reasons. The results of the research could be verified by other researchers thanks to the availability of the datasets. Other researchers could reuse these datasets for their own research or compare their algorithms with the original one on the same dataset without the need to re-implement the original algorithm. These facts led us to formulate the second recommendation.

Recommendation 2: If it is possible, datasets which were used for the evaluation of your research should be published. The form of the published data is not ex- actly specified. Published files should either contain the exact dataset necessary to replicate the research or at least the information that would be sufficient to mine the exact dataset from the public sources.

6.2.3 Metrics We have noticed that some of the analyzed pieces of research use different metrics to evaluate their results. Concretely: Top-k accuracy, Mean Reciprocal Rank, Precision, Recall and F - Measure. Usage of different metrics in different papers is undesirable for several reasons. The individual pieces of research are hard to compare among each other. It is not always possible to reproduce them which leads to a situation where we do not know which algorithm is able to produce the best results at the moment. If it is possible, we recommend

47 6. Reproducibility of Reviewer Recommendation Algorithms using all of the most common metrics. That is why we recommend the following:

Recommendation 3: Top-k accuracy and Mean Reciprocal Rank metrics should always be used to evaluate your research in this area. We recommend the k value to be 1, 3, 5 and 10 for the Top-k accuracy metric. Precision, Recall and F- Measure metrics should also be used, when they are appropriate for the evaluated dataset (more information about their appropriateness can be found in Subsection 3.2.3).

6.2.4 Reproducibility Summary Table 6.1 contains a summary of availability of source code and datasets as well as a summary of metrics used in the existing papers about Code Reviewer Recommendation Algorithms. All the papers contained in the summary were already mentioned in Chapter 3.

Metrics Measure - F

Source code Dataset Precision Recall Top-k Paper available available MRR

ReviewBot [5]        RevFinder [4]        CORRECT [21]        CoreDevRec [6]        Pred. Rev. [18]        Comm. Net. [22]       

Table 6.1: Reproducibility summary of existing papers.

48 7 Conclusion

In this thesis, we analyzed the code review process and the existing Code Reviewer Recommendation Algorithms. We discussed their strengths and weaknesses and we identified potential room for their improve- ment. These findings encouraged us to propose the RevFinder+ and to formulate our own recommendation algorithm based on the Naive Bayes technique. We re-implemented the ReviewBot and the RevFinder algorithms and we proposed several modifications of the RevFinder algorithm which are able to increase its recommendation accuracy. In order to evaluate the modified algorithm, we conducted tests on 35 239 pull requests of three open source projects. The results show that the RevFinder+ performed better than the RevFinder with the Top-5 accuracy improved by 12.90% and with MRR improved by 13.10% on average. We also propose a novel approach based on the Naive Bayes technique. In comparison with the RevFinder, our approach improves the Top-5 accuracy by 32.33% and MRR by 32.46% on average. The aforementioned algorithms were implemented within a Spring boot application which can be deployed in a real environment. Finally, we decided to sum up the reproducibility problems of existing research that we had to face during our analysis. We formu- lated several recommendations which address these problems and we discussed ways to avoid them. Following these rules could possibly lead to more relevant research results in this area in the future. The main deliverables of this thesis are two proposed recommen- dation systems: a modified RevFinder algorithm called RevFinder+ and a novel Naive Bayes-based technique. Comparisons with one of the state-of-the-art algorithms (RevFinder) indicated very promis- ing results of our proposals. We therefore believe that our findings could improve the assignment of pull requests to code reviewers in the future. Deployment of our system in a real environment could reduce the time spent on finding reviewers, which would speed up the overall code review process leading to increased effectiveness of the Pull-based development model. Goals and objectives stated in Chapter 1 were met by the thesis.

49 7. Conclusion

Nevertheless, there are still several areas worth of future work. It would be interesting to analyze the impact of other features on the accuracy of the algorithms such as patch content, patch size or actual activeness of reviewers, which should also be considered with regard to the potential overburdening of selected reviewers. As proposed in [22], mixing our approaches with the CN-based recommendation could theoretically achieve better performance and would be worthy of further analysis.

50 Bibliography

1. T. Baum and K. Schneider, "On the Need for a New Generation of Code Review Tools", in Product-Focused Software Process Improvement, 2016, pp. 301–308. 2. T. Baum, O. Liskin, K. Niklas and K. Schneider, "Factors Influencing Code Review Processes in Industry", in Proceedings of the ACM SIG- SOFT 24th International Symposium on the Foundations of Software Engineering, 2016, pp. 85-96. 3. V. Mashayekhi, J. Drake, W.-T. Tsai, and J. Riedl, "Distributed, collab- orative software inspection", IEEE Software, vol. 10, no. 5, 1993, pp. 66–75. 4. P. Thongtanunam, Ch. Tantithamthavorn, R. G. Kula, N. Yoshida, H. Iida, and K. Matsumoto, "Who Should Review My Code?", in Software Analysis, Evolution and Reengineering, IEEE, 2015, pp. 141–150. 5. V. Balachandran, "Reducing Human Effort and Improving Quality in Peer Code Reviews using Automatic Static Analysis and Reviewer Recommen- dation", in Proceedings of the 2013 International Conference on Software Engineering, 2013, pp. 931-940. 6. J. Jiang, J.-H. He and X.-Y. Chen, "CoreDevRec: Automatic Core Member Recommendation for Contribution Evaluation", in Comput. Sci. Tech- nol. (2015), 2015, pp. 998-1016. 7. M. E. Fagan, "Design and code inspections to reduce errors in program de- velopment", in IBM Systems Journal, vol. 15, no. 3, 1976, pp. 182–211. 8. T. Baum, O. Liskin, K. Niklas and K. Schneider, "A Faceted Classifi- cation Scheme for Change-Based Industrial Code Review Processes", in Software Quality, Reliability and Security, IEEE, 2016. 9. A. Bacchelli and Ch. Bird, "Expectations, outcomes, and challenges of mod- ern code review", in Proceedings of the 2013 International Conference on Software Engineering, 2013, pp. 712-721. 10. P. C. Rigby and Ch. Bird, "Convergent contemporary software peer review practices", in Proceedings of the 2013 9th Joint Meeting on Founda- tions of Software Engineering, 2013, pp. 202-212.

51 BIBLIOGRAPHY

11. P. Bourque and R.E. Fairley, "Guide to the Software Engineering Body of Knowledge, Version 3.0", IEEE Computer Society,2014, www.swebok.org. 12. L. Harjumaa, I. Tervonen and A. Huttunen, "Peer Reviews in Real Life - Motivators and Demotivators", in Proceedings of the Fifth Interna- tional Conference on Quality Software, 2005, pp. 29-36. 13. K. Wiegers, "Peer Reviews in Software: A Practical Guide". Addison- Wesley Professional, 2002. ISBN 978-0201734850. 14. K. Hamasaki, R. Gaikovina Kula, N. Yoshida, A. E. C. Cruz, K. Fuji- wara and H. Iida, "Who Does What during a Code Review? Datasets of OSS Peer Review Repositories", in Proceedings of the 10th Working Conference on Mining Software Repositories, 2013, pp. 49–52. 15. X. Yang, N. Yoshida, R. G. Kula and H. Iida, "Peer Review Social Net- work (PeRSoN) in Open Source Projects", in IEICE Transactions on Information and Systems, 2016, pp. 661-670. 16. Software Peer Reviews: An Executive Overview [online]. Karl Wiegers [visited on 2017-03-18]. Available from: http://www2.smartbear. com / rs / smartbear / images / (Karl % 20Wiegers ) %20Software - Peer-Reviews-An-Executive-Overview-KW_final.pdf. 17. P. C. Rigby and M.-A. Storey, "Understanding broadcast based peer review on open source software projects", in Proceedings of the 2011 Interna- tional Conference on Software Engineering, 2011, pp. 541–550. 18. G. Jeong, S. Kim, T. Zimmermann and K. Yi, "Improving Code Review by Predicting Reviewers and Acceptance of Patches", in ROSAEC MEMO 2009-006, 2009. 19. Gerrit Code Review for Git [online]. Gerrit [visited on 2017-03-11]. Avail- able from: https://gerrit-documentation.storage.googleapis. com/Documentation/2.13.6/index.html. 20. About pull requests [online]. GitHub [visited on 2017-03-10]. Avail- able from: https://help.github.com/articles/about- pull- requests. 21. M. M. Rahman, Ch. K. Roy and J. A. Collins, "CORRECT: Code Reviewer Recommendation in GitHub Based on Cross-Project and Technology Ex- perience", in Proceedings of the 2016 International Conference on Software Engineering, 2016, pp. 222-231.

52 BIBLIOGRAPHY

22. Y. Yu, H. Wang, G. Yin and T. Wang, "Reviewer Recommendation for Pull-Requests in GitHub: What Can We Learn from Code Review and Bug Assignment?", in Information and Software Technology Volume 74, 2016, pp. 204–218. 23. The Optimality of Naive Bayes [online]. Harry Zhang [visited on 2017-04-10]. Available from: http://www.cs.unb.ca/~hzhang/publications/ FLAIRS04ZhangH.pdf. 24. Ch. Aggarwal, "Data Classification: Algorithms and Applications". Chap- man & Hall/CRC, 2014. ISBN 978-1466586741. 25. Bayesian probability theory [online]. B. A. Olshausen [visited on 2017-04-14]. Available from: http://redwood.berkeley.edu/bruno/npb163/ bayes.pdf. 26. Generative and Discriminative Classifiers: Naive Bayes and Logistic Regres- sion [online]. Tom M. Mitchell [visited on 2017-04-13]. Available from: https://www.cs.cmu.edu/~tom/mlbook/NBayesLogReg.pdf. 27. Probability smoothing [online]. Djoerd Hiemstra, University of Twente [visited on 2017-03-14]. Available from: https://pdfs.semanticscholar. org/945f/1c4efe98eb51a1943f60c0cd92df965cf8cf.pdf. 28. Gerrit Code Review - Change-Ids [online]. Eclipse [visited on 2017-03-25]. Available from: https://git.eclipse.org/r/Documentation/ user-changeid.html#_description. 29. N. Bettenburg, R. Premraj, T. Zimmermann and S. Kim, "Duplicate Bug Reports Considered Harmful...Really?", in Proceedings of the Interna- tional Conference on , IEEE, 2008. 30. X. Xia, D. Lo, X. Wang, and X. Yang, "Who should review this change?: Putting text and file location analyses together for more accurate recom- mendations.", in Software Maintenance and Evolution, IEEE, 2015, pp. 261–270. 31. T. Yuan, D. Lo and J. Lawall, "Automated Construction of a Software- Specific Word Similarity Database.", in Software Maintenance, Reengi- neering and Reverse Engineering, IEEE, 2014, pp. 44–53. 32. L. Madeyski and B. Kitchenham, "Would wider adoption of reproducible research be beneficial for empirical software engineering research?", in Journal of Intelligent and Fuzzy Systems 32(2), 2017, pp. 1509-1521.

53 BIBLIOGRAPHY

33. R. Gentleman and D. T. Lang, "Statistical Analyses and Reproducible Research", in Bioconductor Project Working Papers, 2004. 34. D. Knuth, "Literate Programming", in The Computer Journal, 1984, pp. 97–111.

54 A Google Chrome Extension

To be able to use our system easily, we implemented a small extension to the Google Chrome1 web browser. This extension can be used by developers when they want to ask our system for a recommendation of suitable reviewers to review their Gerrit pull request. The reviewer recommendation process can be invoked by clicking on the extension on the Gerrit web page of the pull request in question. The extension will first retrieve the Gerrit Change number from the url opened in the browser. In the second step, it will send a GET request to our server. The response will be displayed in the extension. The developer can immediately see the recommended reviewers and add their names to the site of the pull request. We believe that such system would be very easy to use and would make the whole process of finding suitable reviewers much more effective. A picture of our Google Chrome extension can be seen in Figure A.1.

1. https://www.google.com/chrome/

55 A. Google Chrome Extension

Figure A.1: Example of recommendation of code reviewers through our Google Chrome extension. List on the right side presents the rec- ommended reviewers for the pull request, that can be found under this link: https://android-review.googlesource.com/c/31583/

56 B GitHub Repository

The project is available in our GitHub repository1. The repository has the following structure:

∙ the root folder of the repository contains the source files of the Spring boot application.

∙ the data folder contains the datasets.

∙ the chrome_extension folder contains the source files of the imple- mented Google Chrome extension.

∙ the repos folder should contain repositories of projects, for which the ReviewBot algorithm will be used as the recommendation system.

1. https://github.com/XLipcak/rev-rec

57

C Data Model and Datasets.

Entity Relationship Diagram in the Figure C.1 describes our database schema. We stored the data in a MySql relational database. Datasets used for tests are publicly available in the data folder1 of our repository in JSON format as well as in the form of SQL import scripts. Although our data model allows the storing of data of different projects in the same schema, we stored the data of each project in a separate schema in order to make testing and data manipulation simpler and faster.

Figure C.1: Entity Relationship Diagram.

1. https://github.com/XLipcak/rev-rec/tree/master/data

59

D Configuration and Deployment

This chapter describes the steps necessary to configure and run the application. Build of the application is managed via the Apache Maven tool.

D.1 Configuration

The configuration settings can be set inthe application.properties file. Database connection has to be set up correctly before running the application. The following properties are specific for our project and should also be specified:

∙ recommendation.project (String): name of the project for which reviewers will be recommended.

∙ recommendation.retired (Boolean): set whether retired reviewers should be disadvantaged.

∙ recommendation.retired.interval (Integer): set how many months of inactivity are required for reviewers to be considered as retired.

∙ recommendation.revfinder.projectname (Boolean): set whether the RevFinder algorithm should consider project names.

∙ recommendation.jobs.buildModel.cron (Cron format): set the schedul- ing of a model’s recalculation.

The class InitialLoader was written in order to demonstrate the functionality of the application. It implements the CommandLineRun- ner interface which ensures that the content of the run method will always be executed after the deployment of the application. This class contains some commented lines with several testing examples. As the application uses a relational database, the database has to be config- ured correctly and should contain some data before executing this functionality.

61 D. Configuration and Deployment D.2 Deployment

The application requires the Apache Maven tool to be installed and JAVA_HOME environment variable to be set and point to the JDK1 installation (at least version 8). The following command can be used to quickly compile and run the application from its root folder:

mvn spring-boot:run

A detailed description of the steps necessary to run the application is written in the README.md file of the repository.

1. http://www.oracle.com/technetwork/java/javase/downloads/jdk8- downloads-2133151.html

62