Test-Driven Code Review: an Empirical Study

Test-Driven Code Review: An Empirical Study Davide Spadini1;2, Fabio Palomba3, Tobias Baum4 Stefan Hanenberg5, Magiel Bruntink1 and Alberto Bacchelli3 1Software Improvement Group, The Netherlands - 2Delft University of Technology, The Netherlands 3University of Zurich, Switzerlands - 4Leibniz Universitat Hannover, Germany - 5Universitat Duisburg-Essen, Germany Abstract—Test-Driven Code Review (TDR) is a code review Mock objects [48] and TDD [9]—commented how he covered practice in which a reviewer inspects a patch by examining the similar ground [21]. Recently, in a popular online forum for changed test code before the changed production code. Although programmers, another article supported TDR (collecting more this practice has been mentioned positively by practitioners in informal literature and interviews, there is no systematic than 1,200 likes): “By looking at the requirements and check- knowledge on its effects, prevalence, problems, and advantages. ing them against the test cases, the developer can have a pretty In this paper, we aim at empirically understanding whether good understanding of what the implementation should be this practice has an effect on code review effectiveness and how like, what functionality it covers and if the developer omitted developers’ perceive TDR. We conduct (i) a controlled experiment any use cases.” Interviewed developers reported preferring to with 93 developers that perform more than 150 reviews, and (ii) 9 semi-structured interviews and a survey with 103 respondents review test code first to better understanding the code change to gather information on how TDR is perceived. Key results from before looking for defects in production [49]. the experiment show that developers adopting TDR find the same These above are compelling arguments in favor of TDR, yet proportion of defects in production code, but more in test code, we have no systematic knowledge on this practice: whether at the expenses of less maintainability issues in production code. TDR is effective in finding defects during code review, how Furthermore, we found that most developers prefer to review production code as they deem it more important and tests should frequently it is used, and what are its potential problems and follow from it. Moreover, widespread poor test code quality and advantages beside review effectiveness. This knowledge can no tool support hinder the adoption of TDR. provide insights for both practitioners and researchers. De- velopers and project stakeholders can use empirical evidence NTRODUCTION I. I about TDR effects, problems, and advantages to make informed Peer code review is a well-established and widely adopted decisions about when to adopt it. Researchers can focus their practice aimed at maintaining and promoting software qual- attention on the novel aspects of TDR and challenges reviewers ity [3]. Contemporary code review, also known as Change- face to inform future research. based Code Review [6] or Modern Code Review (MCR) [12], In this paper, our goal is to obtain a deeper understanding is a lightweight process that is (1) informal, (2) tool-based, of TDR. We do this by conducting an empirical study set up (3) asynchronous, and (4) focused on inspecting new proposed in two phases: An experiment, followed by an investigation of code changes rather than the whole codebase [43]. Specifically, developers’ practices and perceptions. in a code review, developers other than the code change In the first phase, we study the effects of TDR in terms author manually inspect the patch to find as many issues of the proportion of defects and maintainability issues found as possible and provide feedbacks that need to be addressed in a review. To this aim, we devise and analyze the results before accepting the code in production [6]. of an online experiment in which 92 developers (77 with The academic research community is conducting empirical at least two years of professional development experience) studies to better understand the code review process [44], [43], complete 154 reviews, using TDR or two alternative strategies [3], [26], [45], as well as to obtain empirical evidence on (i.e., production first or only production); also, two external aspects and practices that are related to more efficient and developers rated the quality of their review comments. In effective reviews [51], [34]. the second phase, we investigate problems, advantages, and A code review practice that has only been touched upon frequency of adoption of TDR – valuable aspects that could in academic literature [49], but has been described in gray not be investigated in the experiment. To this aim, we conduct literature almost ten years ago [58] is that of test-driven nine interviews with experiment participants and deploy an code review (TDR, henceforth). By following TDR, a reviewer online survey with 103 respondents. inspects a patch by examining the changed test code before Key findings of our study include: with TDR, the proportion the changed production code. of functional defects (bugs henceforth) found in production To motivate TDR, P. Zembrod—Senior Software Engineer code and maintainability issues (issues henceforth) found in Test at Google—explained in the Google Blog [58]: “When in test code does not change. However, TDR leads to the I look at new code or a code change, I ask: What is this about? discovery of more bugs in test code, at the expenses of less What is it supposed to do? Questions that tests often have a issues found in production code. The external raters judged good answer for. They expose interfaces and state use cases”. the quality of the review comments as comparable across Among the comments, also S. Freeman—one of the ideators of all review strategies. Furthermore, most developers seem to be reluctant to devote much attention to tests, as they deem code first as to get a better understanding of a code change production code more important; moreover applying TDR is before looking for defects in production code. Our work builds problematic, due to widespread poor test quality (reducing upon the research on reviewing test code, by investigating TDR’s applicability) and no tool support (not easing TDR). how reviewing test code can(not) be beneficial for the whole reviewing process. II. RELATED WORK III. METHODOLOGY To some extent, TDR can be considered as an evolution of classical reading techniques [4], as it shares the general idea to In this section we describe the research questions and the guide code inspectors with software artifacts (i.e., test cases) methodology we follow to conduct our study. and help them with the code review task. A. Research Questions Scenario-based inspections. Among reading techniques, The overall goal of this paper is to obtain a deeper under- Porter & Votta [39] defined the scenario-based approach, standing of Test-Driven Code Review. This study has two parts based on scenarios that provide inspectors with more specific that we structure in two research questions. In the first part, we instructions than a typical checklist and focus on a wider start by designing and running an experiment to investigate the variety of defects. They discovered that such technique is effects of TDR on code review effectiveness. We measure the significantly more useful for requirements inspectors. Later effectiveness as the ability to find bugs and maintainability on, Porter et al. [40], [38] and Miller et al. [35] replicated issues during a code review (i.e., the main reported goal of the original study confirming the results. Other studies by code review [3]). This allows us to establish whether it is Fusaro et al. [22] and Sandahl et al. [46] reported contra- possible to empirically measure any significant difference in dictory results, however without providing explanations on this aspect using TDR. Hence, our first research question: the circumstances leading scenario-based code inspection to fail. An important advance in this field was then provided by RQ1. Does the order of presenting test code to the Basili et al. [5], who re-visited the original scenario-based reviewer influence code review effectiveness? as a technique that needs to be specialized for the specific issues to be analyzed. They also defined a new scenario- In the second part of the study, we investigate the promi- based technique called perspective-based reading: The basic nence of TDR and the developers’ perception toward this idea is that different aspects of the source code should be practice, also focusing on problems and advantages that could inspected by inspectors having different skills [5]. All in be measured through the aforementioned experiment. Our aim all, the papers mentioned above provided evidence of the is to obtain a more complete view on TDR. To do so, we usefulness of reading techniques; their similarities with TDR, turn to the developers, conducting semi-structured interviews give an interesting rationale on why TDR could bring benefits. and deploying an online survey. Hence, our second research Ordering of code changes. Research on the ordering of code question: changes is also related to TDR. In particular, Baum et al. RQ2. How do developers perceive the practice of Test- argued that an optimal ordering of code changes would help Driven Code Review? reviewers by reducing the cognitive load and improving the alignment with their cognitive processes [8], even though they made no explicit reference to ordering tests. This may give B. Method – RQ1: Design Overview theoretical value to the TDR practice. Code ordering and its Figure 1 depicts an overview of the overall flow of our relation to understanding, yet without explicit reference to tests experiment. We follow a partially counter-balanced repeated or reviews, has also been the subject of studies [24], [10]. measures design [17], augmented with some additional phases. Reviewing test code. Many articles on classical inspection 1) We use a browser-based tool to conduct the experiment (e.g., [31], [56]) underline the importance of reviewing tests; and answer RQ1.

Test-Driven Code Review: an Empirical Study

The'as Code'activities: Development Anti-Patterns for Infrastructure As Code

Whodo: Automating Reviewer Suggestions at Scale

Software Reviews

Arxiv:2005.09217V1 [Cs.SE] 19 May 2020 University of Tennessee Knoxville, Tennessee, USA E-Mail: [email protected] 2 Andrey Krutauz Et Al

Software Project Management (November 2018)

The Impact of Code Review Coverage and Code Review Participation on Software Quality

A Community of Practice Around Peer Review for Long-Term Research Software Sustainability

Contemporary Peer Code Review Practices and Associated Benefits

Revisiting Linus's

Open Source Peer Review – Lessons and Recommendations for Closed Source

Reviewer Recommendation Using Software Artifact Traceability Graphs

Experimenters Handbook (GRC-DOC-PLAN-5006)