<<

1 WhoDo: Automating reviewer suggestions at scale 59 2 60 3 61 4 Sumit Asthana B.Ashok Chetan Bansal 62 5 Microsoft Research India Microsoft Research India Microsoft Research India 63 6 [email protected] [email protected] [email protected] 64 7 65 8 Ranjita Bhagwan Christian Bird Rahul Kumar 66 9 Microsoft Research India Microsoft Research Redmond Microsoft Research India 67 10 [email protected] [email protected] [email protected] 68 11 69 12 Chandra Maddila Sonu Mehta 70 13 Microsoft Research India Microsoft Research India 71 14 [email protected] [email protected] 72 15 73 16 ABSTRACT ACM Reference Format: 74 17 Today’s development is distributed and involves contin- Sumit Asthana, B.Ashok, Chetan Bansal, Ranjita Bhagwan, Christian Bird, 75 18 uous changes for new features and yet, their development cycle Rahul Kumar, Chandra Maddila, and Sonu Mehta. 2019. WhoDo: Automat- 76 ing reviewer suggestions at scale. In Proceedings of The 27th ACM Joint 19 has to be fast and agile. An important component of enabling this 77 20 European Conference and Symposium on the Founda- 78 agility is selecting the right reviewers for every code-change - the tions of Software Engineering (ESEC/FSE 2019). ACM, New York, NY, USA, 21 79 smallest unit of the development cycle. Modern tool-based code 9 pages. https://doi.org/10.1145/nnnnnnn.nnnnnnn 22 review is proven to be an effective way to achieve appropriate code 80 23 review of software changes. However, the selection of reviewers 81 24 in these systems is at best manual. As software and 1 INTRODUCTION 82 25 teams scale, this poses the challenge of selecting the right reviewers, Large software projects have continuously evolving code-bases 83 26 which in turn determines software quality over time. While pre- and an ever changing set of developers. Making the development 84 27 vious work has suggested automatic approaches to code reviewer process smooth and fast, while maintaining code-quality is vital to 85 28 recommendations, it has been limited to retrospective analysis. We any effort. As one approach for meeting this 86 29 not only deploy a reviewer suggestions algorithm - WhoDo - and challenge, code review [1, 2] is widely accepted as an effective tool 87 30 evaluate its effect but also incorporate load balancing as part ofit for subjecting code to scrutiny by peers and maintaining quality. 88 31 to address one of its major shortcomings: of recommending expe- Modern code review [3], characterized by lightweight tool-based 89 32 rienced developers very frequently. We evaluate the effect of this reviews of source code changes, is in use broadly across both com- 90 33 hybrid recommendation + load balancing system on five reposito- mercial and open source software projects [17]. This form of code 91 34 ries within Microsoft. Our results are based around various aspects review provides developers with an effective workflow to review 92 35 of a commit and how code review affects that. We attempt to quanti- code changes and improve code and this process has been studied 93 36 tatively answer questions which are supposed to play a vital role in in depth in the research community [5, 6, 11, 13, 17–20, 23]. 94 37 effective code review through our data and substantiate it through One topic that has received much attention over the past five 95 38 qualitative feedback of partner repositories. years is the challenge of recommending the most appropriate re- 96 39 viewers for a software change. Bacchelli and Bird [3] found that 97 40 CCS CONCEPTS when the reviewing developer had a deep understanding of the code 98 41 being reviewed, the feedback was "more likely to find subtle defects 99 • Human-centered computing → Empirical studies in collabo- 42 ... more conceptual (better ideas, approaches) instead of superficial 100 rative and social computing; • Software and its engineering → 43 (naming, mechanical style, etc.)". Kononenko et al [12] found that 101 Software configuration management and version control systems; 44 selecting the right reviewers impacts quality. Thus, many have pro- 102 tools; Programming teams. 45 posed and evaluated approaches for identifying the best reviewer 103 46 104 KEYWORDS for a code review [4, 8, 14, 15, 22, 24, 26] (see section 2 for a more 47 in-depth description of related work). At Microsoft, many devel- 105 48 software-engineering, recommendation, code-review opment teams have voiced a desire for help in identifying those 106 49 developers that have the understanding and expertise needed to 107 50 Permission to make digital or hard copies of all or part of this work for personal or 108 classroom use is granted without fee provided that copies are not made or distributed review a given software change. 51 for profit or commercial advantage and that copies bear this notice and the full citation The large and growing size of the software repositories at many 109 52 on the first page. Copyrights for components of this work owned by others than ACM software companies (including Microsoft, the company involved 110 must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, 53 111 to post on servers or to redistribute to lists, requires prior specific permission and/or a in the evaluation of our approach) has created the need for an 54 fee. Request permissions from [email protected]. automated way to suggest reviewers [3]. One common approach 112 55 ESEC/FSE 2019, 26–30 August, 2019, Tallinn, Estonia that several projects have used is a manually defined set of groups 113 © 2019 Association for Computing Machinery. 56 114 ACM ISBN 978-x-xxxx-xxxx-x/YY/MM...$15.00 that identify experts in an area of code collectively. These groups are 57 https://doi.org/10.1145/nnnnnnn.nnnnnnn used in conjunction with rules which trigger the addition of groups 115 58 1 116 ESEC/FSE 2019, 26–30 August, 2019, Tallinn, Estonia Asthana, et al.

117 whenever files in a pre-defined part of the system change (e.g.,add • Provide an evaluation of the recommendation system in live 175 118 participants in the Network Protocol Review group whenever a production. 176 119 file is changed in /src/networking/protocols/TCPIP/*/). This • Present the results of a user study of software developers 177 120 ensures that the appropriate group of experts are informed of the that used the recommendation system. 178 121 file change and can review it. However, such solutions are hardto 179 122 scale, suffer from becoming stale quickly, and may miss the right 2 RELATED WORK 180 123 181 reviewers even when rules and groups are manually kept up to Tool based code review is the adopted standard in both OSS and 124 182 date. proprietary software systems [17] and many tools exist that enable 125 183 Motivated by the need for help in identifying the most appropri- developers to look at software changes effectively and review them. 126 184 ate reviewers and the difficulty of manually tracking expertise, we Reviewboard1, Gerrit2 and Phabricator3, the popular open source 127 185 have developed and deployed an automatic code review system at code review tools and Microsoft’s internal code review interface 128 186 Microsoft called WhoDo. In this paper, we report our experience share some common characteristics: 129 and initial evaluation of this deployment on five software reposi- 187 130 tories. We leverage the success of previous works like Zanjani et • Each code change has a review associated with it and almost 188 131 al. [26] that demonstrated that considering past history of code all code changes (depending on team policies) have to go 189 132 in terms of authorship and reviewership with respect to the cur- through the review process. 190 133 rent change is an effective way to recommend peer reviewers for a • Each review shows the associated changes in a standard diff 191 134 code change. Based on the positive metrics of our evaluation and format. 192 135 favorable user-feedback, we have proceeded to deploy WhoDo onto • Each review can be reviewed by any number of developers. 193 136 many additional repositories across Microsoft. Currently, it runs Currently, reviewers are added by the author of the change 194 137 on 123 repositories and this number continues to grow rapidly. or through manually defined rules. 195 138 As discussed in Section 2, reviewer recommendation has been • Each reviewer can leave comments on the change at any 196 139 the subject of much research. However, it is not clear which of these particular location of the diff pinpointing errors or asking 197 140 systems have been deployed in practice and the evaluation of such for clarifications. 198 141 recommendation systems has consisted primarily of historical com- • The code author addresses these comments in subsequent 199 142 parison, determining how well the recommendations match what iterations/revisions and this continues until all comments 200 143 actually happened and who participated in the review. While such are resolved or one of the reviewer signs-off. 201 144 an offline evaluation9 [ ] is useful, it may not provide an accurate One potential bottleneck of the above workflow is addition of 202 145 picture of the impact of the recommendation system when used in code reviewers by authors. This is a manual activity and several 203 146 practice. For example, there is an (often implicit) assumption that social factors [7] such as developer relations, code knowledge play 204 147 those who participated in the review were in fact the best people a role in it. This system, while, effective is also biased towards inter- 205 148 to review the change and those who were not invited were not personal relations which can lead to incorrect assignment in cases 206 149 appropriate reviewers. To address this concern, in this paper we of large repositories where groups of teams are changing different 207 150 report on the deployment of our reviewer recommender to a set of code areas. A change affecting an unknown area might be signed- 208 151 software projects at Microsoft, a large and diverse software com- off by a teammate without appropriate expertise in the related area. 209 152 pany. Our evaluation measures impacts quantitatively, examining To mitigate this effect, the code review interface has provisions 210 153 user involvement and time to completion of reviews, as well as for defining manual rules on files that if changed, would trigger 211 154 qualitatively through a user study of the developers that used our addition of certain group aliases who own that file. One of the prin- 212 155 system. cipal motivations of our system is to extend this kind of ownership 213 156 In addition, we find that the proposed reviewer recommendation definition beyond manual rules and track it automatically over time. 214 157 systems in literature do not take into account reviewer load, the We are certainly not the first to attack the challenge of building 215 158 distribution of reviews across available reviewers. In practice, we a system to automatically recommend developers for code-review. 216 159 found that reviewer recommendation systems will often assign a Balachandran [4] was the first to introduce a system for recom- 217 160 large proportion of reviews to a small set of experienced developers mending reviewers. His tool, Review Bot, recommends reviewers 218 161 (e.g., 20% of the developers in a project are assigned 80% of the for a review based by determining which source code lines were 219 162 reviews). To address this, we present the first (to our knowledge) changed and identifying those who had reviewed the previous 220 163 approach that incorporates load balancing into live reviewer rec- changes that modified or added those same lines in the past. 221 164 ommendations along with an evaluation of the positive impact of RevFinder [21, 22], subsequently proposed by Thongtanunam 222 165 load balancing. et al. is based on the idea that those who have reviewed a file in 223 166 We describe our efforts from building this model to deploying it the past are qualified to review that file in the future. The intuition 224 167 to scaling it to repositories across Microsoft. In this paper, we: behind RevFinder is that files that have similar paths are similar 225 168 and should be managed and reviewed by similar experienced code- 226 169 • Provide a description of a straightforward, interpretable au- reviewers. 227 170 tomatic code reviewer recommendation system that works 228 171 in practice. 229 1https://www.reviewboard.org/ 172 230 • Describe the problem of load balancing and present an ap- 2https://www.gerritcodereview.com/ 173 proach to address the issue of unbalanced recommendations. 3https://secure.phabricator.com/ 231 174 2 232 WhoDo: Automating reviewer suggestions at scale ESEC/FSE 2019, 26–30 August, 2019, Tallinn, Estonia

233 Tie, an approach introduced by Xia et al. [24] builds on RevFinder 3 SYSTEM DESIGN 291 234 by using file path similarity metrics and incorporating a text mining In this section, we first describe the WhoDo reviewer recommen- 292 235 component that examines the code review request and compares it dation algorithm in detail with specific emphasis on the scoring 293 236 to the text in previously completed code review requests. function used to prioritize reviewers. Next, we describe how we 294 237 cHRev, developed by Zanjani et al. [26], uses historical informa- augment the algorithm to balance load across all reviewers. 295 238 tion including how frequently and recently a developer provided 296 239 code review comments about a file, in addition to the total number 3.1 Scoring Function 297 240 of review comments made for a file, to rank the best developers to 298 WhoDo’s scoring function creates a ranked order of developers as 241 review a change to that file in the future. Our model is based on 299 potential reviewers for a pull-request using commit and review his- 242 participation in past reviews for a file, but also incorporates the 300 tories. Developers who have in the past either committed changes 243 history of developers that made changes to the file, as we found that 301 to the files in the pull-request, or have reviewed files in thepull- 244 in some cases, the primary author of a file never actually provided 302 request, are more likely to be added as reviewers. We say that a 245 code review comments about it. In these cases, the primary author 303 developer has reviewed a pull-request if and only if they have either 246 should still be recommended to review changes. 304 signed off on the pull-request, or left at least one comment. 247 Jeong et al. [10] developed a method for recommending reviewers 305 Given a pull-request, the score for each reviewer is: 248 for source code changes based on a variety of features including size 306 249 of the change (both in lines of code and files affected), the identity 307 250 of the author of the change, the names of the files changed, source Õ 1 308 Score (r) = C1. n (r, f ) . + 251 code features such as counts of various keywords and braces, and chanдe t (r, f ) 309 f ∈F chanдe 252 even bug report information such as bug severity and priority if 310 Õ 1 253 the change was intended to correct a related bug. C2. n (r,d) . + 311 chanдe t (r,d) 254 Ouni et al. [14] even incorporated socio-technical collaboration d ∈D chanдe 312 255 graphs of developers into reviewer recommendation based on the Õ 1 313 C3. n (r, f ) . + 256 notion that “the socio-technical factor related to the relationship review t (r, f ) 314 ∈ review 257 between reviewer contributors is a crucial aspect that affects the review f F 315 258 quality”. Their work was inspired by Yang et al’s findings that Õ 1 316 C4. nreview (r,d) . 259 treview (r,d) 317 developers’ communication social networks is a strong predictor of d ∈D 260 their activity and collaboration with regard to code review (which 318 261 they term “peer review”) [25]. 319 262 Rahman et al. [15, 16] actually determined developers’ expertise where r is the reviewer, F is the set of all files in the pull-request, 320 263 by looking across multiple projects. Their approach attempts to de- and D is the set of all last-level parent directories that are changed in 321 264 termine if developers’ had used particular libraries or technologies the pull request. nchanдe (r, f ) is the number of times reviewer r has 322 265 in other projects that would make them appropriate reviewers for committed changes to file f in the past. nreview (r, f ) is the number 323 4 266 a change in a particular project. They gathered data from GitHub of times reviewer r has reviewed file f . Similarly, nchanдe (r,d) is 324 267 to determine the history of changes that developers had made and the number of times reviewer r has committed changes within 325 268 what technologies/libraries they had used in the past. directory d and nreview (r,d) is the number of times reviewer r 326 269 Our report differs from this body of work in three primary ways. has reviewed files in directory d. Hence, the larger the number of 327 270 First, as our goal is to recommend reviewers quickly for repositories times a reviewer has interacted with the file (or last-level directory 328 271 that may be very large, we use a simplified approach that uses only in which the file sits), the higher the reviewer’s score. 329 272 one data source, the history of PRs in a project. We do not rely tchanдe (r, f ) is the number of days since reviewer r changed 330 273 on bug databases, an ecosystem of software projects, a history of file f , and treview (r, f ) is the number of days since the reviewer 331 274 developer communication, or features of source code that would reviewed file f . Similarly, tchanдe (r,d) is the number of days since 332 275 require some level of code analysis. These add complexity, require reviewer r changed directory d, and treview (r,d) is the number 333 276 additional data sources (that may or may not exist), and take addi- of days since the reviewer reviewed files in directory d. These 334 277 tional time. Second, while the existing systems mentioned used a terms are in the denominator. Hence, reviewers with more recent 335 278 retrospective, historical approach to evaluate the performance of interactions with the files and directories of the pull-request will 336 279 their approaches, we deploy our reviewer recommendation system have lower values for these, and WhoDo will rank them higher. 337 280 and evaluate it based on the results of it being used in practice. C1, C2, C3 and C4 are constant coefficients. An administrator 338 281 Third, our experience is that reviewer recommendation systems of- deploying WhoDo can manipulate these four coefficients to weigh 339 282 ten suggest a small set of developers to participate in most reviews, authorship over reviewership, or vice-versa. Giving more weight to 340 283 leading to a very skewed assignment of reviews. As a result, we code authorship loops in more junior developers in the recommen- 341 284 are the first to both identify the problem and address it by incorpo- dation system, since junior developers tend to write code more than 342 285 rating reviewer load balancing into our reviewer recommendation review code. In our deployments of WhoDo, we wanted to give 343 286 system. equal weightage to both authorship and reviewership and therefore 344 287 set all coefficients to 1.0. 345 288 Finally, WhoDo picks the reviewers with the top k scores and 346 289 4http://www.github.com adds them as reviewers to the pull-request. 347 290 3 348 ESEC/FSE 2019, 26–30 August, 2019, Tallinn, Estonia Asthana, et al.

349 3.2 Load Balancing The graph shows the suggestion frequencies for all reviewers. The 407 350 The function we described in Section 3.1 does not attempt to bal- x-axis shows reviewers sorted in decreasing order of suggestion 408 351 ance load across developers. Consequently, WhoDo may assign a frequency for WhoDo without load balancing. 409 352 disproportionately high load to a few active and knowledgeable The graph suggests that, with load balancing, WhoDo performs 410 353 developers, who have reviewed and committed to multiple parts a better distribution of reviews across all reviewers. The average 411 354 of the code-base regularly. In this section, we show how WhoDo suggestion frequency without load-balancing is 10.10, and the stan- 412 355 addresses this issue. dard deviation is 10.17. With load balancing, the average suggestion 413 356 We modify the score based on the load of the reviewer to mitigate frequency is 9.34 and the standard deviation is reduced to 5.67. In 414 357 the affect of such unbalanced recommendations. We term thisas Section 6.4, we show additional results on how WhoDo improves 415 358 ScoreLoadBalanced. reviewer assignments after deploying the load balancing algorithm 416 359 on LargeRepo. It is to be noted however, that load balancing comes 417 360 Score (r) at the expense of reduced expertise on code-reviews since the al- 418 361 ScoreLoadBalanced (r) = gorithm is not choosing the most optimal set of reviewers but the 419 Load (r) 362 most optimal reviewers with minimum load. 420 θ .TotalOpenReviews 363 Load (r) =e 421 364 TotalOpenReviews is the total number of incomplete pull-requests 422 365 to which the developer has been added as a reviewer. By using 423 366 this metric, we aim to capture the current review workload of the 424 367 user. θ is a parameter between 0 and 1 to control the amount of 425 368 load balancing: the higher the value of θ, the more aggressive is 426 369 the load balancing. We use the exponential function to smoothen 427 370 our assignments. Figure 1 shows the value of Load(reviewer) for 428 371 different values of θ. Choosing θ of 0.5 will decrease our original 429 372 reviewer score by a load value of 7.3 if the reviewer has more than 430 373 4 active reviews, while choosing θ of 0.2 will cause a penalty of 431 374 3.32 only This gives the deployers of WhoDo an intuition on how 432 375 to choose the appropriate value of θ for their repositories. 433 376 434 377 435 378 436 379 Figure 2: Effect of aggressive load balancing. X-axis shows 437 380 developers, leftmost are more senior. Y-axis depicts the num- 438 381 ber of times the developer was suggested by the algorithm. 439 382 440 383 441 384 442 385 4 DISCUSSION 443 386 We identify that evaluation of the reviewer recommendations prob- 444 387 lem is hard because we know the people who did the review but we 445 388 don’t know of people who did not do the review but were eligible to 446 389 do so. For this reason, we did not start off with a machine learning 447 390 approach as it requires ground truth in the form of positive and 448 391 negative labels. It would be wrong to treat all the people who did 449 392 Figure 1: The load as a function of different values of the not do the review as negative samples. 450 393 parameter θ Developers are more comfortable with reviewers within their 451 394 immediate organization. So they decide to add reviewers who are 452 395 in their own organization, even if they are not the right reviewers. 453 To evaluate this load balancing strategy, we simulated WhoDo, 396 That said, there is some value in adding "close" reviewers because 454 both with and without load balancing on one of our organization’s 397 closer reviewers may be more familiar with the change. This follows 455 repositories, LargeRepo5. We ran this experiment for a period of 398 from the fact that feature development often happens in groups 456 two months. We set θ in the load balancing algorithm to be 0.5. This 399 of small teams and corresponding team member are often familiar 457 is because we wanted to start levying penalty when the average 400 with the change. This factor is difficult to adopt in our evaluation. 458 load crosses a limit of 5 open reviews. Figure2 shows a comparison 401 Besides, the scoring function defined by us is highly interpretable 459 of WhoDo with and without load balancing. We plot a metric that 402 because of its simple summation of contributions. For an initial 460 we call suggestion frequency for each reviewer, which is the number 403 deployment scenario when the system is deployed, a very important 461 of times WhoDo suggested the reviewer in the evaluation period. 404 task that we saw recurring was the request for the interpretation 462 405 5More details of LargeRepo are given in Section 6.2 of suggestions by developers which also helped us identify various 463 406 4 464 WhoDo: Automating reviewer suggestions at scale ESEC/FSE 2019, 26–30 August, 2019, Tallinn, Estonia

465 shortcomings of the system and understand the problem at a deeper addition to the pull request. Optional reviewers are not required to 523 466 level. sign-off on the pull request for it to be deemed complete, where 524 467 as required reviewers need to sign-off on the pull request forit 525 468 5 DEPLOYMENT to be considered complete. We did not consider the first option, 526 469 as it requires vigilance on part of code authors to actively work 527 470 on our suggestions, whereas the last option would have meant an 528 471 incorrect reviewer needing to sign-off on the pull request. Therefore, 529 472 we considered adding our suggestions as optional reviewers as a 530 473 healthy trade-off between passive and active addition. 531 474 532 475 6 EVALUATION 533 476 534 We have deployed WhoDo on 123 repositories within Microsoft. Our 477 535 evaluation focuses on five of these repositories. We now provide a 478 536 summary of the different experiments we used to evaluate WhoDo. 479 537 In Section 6.2, we provide statistics that characterize these repos- 480 538 itories. Next, we describe WhoDo’s performance on these reposito- 481 539 ries in Section 6.3, without load-balancing. Eventually, we deployed 482 540 the WhoDo load-balancing algorithm on the LargeRepo repository. 483 541 We describe WhoDo’s load-balancing performance in Section 6.4. 484 542 To understand how to improve WhoDo, we performed a user- 485 543 study, the results of which we summarize in Section 6.5. Finally, to 486 544 see if our results compared well with previous-work, we performed 487 545 a retrospective analysis of WhoDo on the same five repositories, 488 546 asking questions of how WhoDo’s reviewer suggestions, if it had 489 547 been deployed earlier, would have compared to the current manual 490 548 Figure 3: Pipeline for the WhoDo recommendation service process of adding reviewers. We describe this in Section 6.6. 491 549 We first describe the metrics used for both parts of our evaluation. 492 550 Figure 3 provides an overview of the deployment of the WhoDo Next we describe the results. 493 551 reviewer recommendation service. The first stage is aggregating 494 552 the past history of developers that the scoring model defined above 6.1 Metrics 495 553 can use. We maintain a comprehensive set of repository data like 496 As noted earlier, the ultimate goals of WhoDo are to improve code 554 files changed, review information, changes, etc. which is updated 497 quality and increase deployment agility. Hence, we use the follow- 555 incrementally every four hours. As part of this data, we also record 498 ing metrics to evaluate it. 556 the number of changes and reviews by each developer to every 499 • Hit rate: WhoDo’s goal is to choose the right set of reviewers 557 file path they work on in the repository and the last timeofsuch 500 so that the pull-request owner does not have to add any re- 558 activity. Recording such information makes it very easy for the 501 viewers manually. The WhoDo hit-rate is the fraction of PRs 559 scoring model to query this information directly and aggregate it 502 in a repository for which at least one of the reviewers sug- 560 across all candidate reviewers. 503 gested by WhoDo reviewed the PR. In other words, WhoDo 561 The standard tool for code review inside Microsoft is Azure De- 504 is successful in choosing the right reviewer(s) who review 562 vOps which provides an interface similar to open source tools like 505 and complete the PR. A better hit-rate signifies the service 563 Gerrit, GitHub with a page for each pull-request. The page contains 506 was able to successfully identify a reviewer, and therefore 564 all the information that code reviewers might need for the pull- 507 offload that task from the author. 565 request and functionality for adding comments, adding/removing 508 • Average number of reviews per PR: We evaluate if, post-deployment, 566 reviewers, etc. Our service is hooked to the Azure DevOps service 509 WhoDo achieves a larger number of completed reviews per- 567 and gets triggered to show a reviewer recommendation whenever 510 PR on average than before it was deployed. Having this 568 a new pull-request is created or an existing one modified. 511 larger number through an automated service will hopefully 569 512 5.1 Showing recommendations improve the code quality as more expert reviewers are ex- 570 513 amining the code. 571 514 We had three choices on how to show the recommendations to the • Average PR completion time: We evaluate if, post-deployment, 572 515 developers: WhoDo reduced the average PR completion time for a repos- 573 516 • Suggest reviewers as a comment on the pull request. itory. A reduced PR completion time also reduces the time 574 517 • Add optional reviewers to the pull request. to deploy a code-change into production, thereby improving 575 518 • Add required reviewers to the pull request. deployment agility. WhoDo achieves this by choosing the 576 519 The first case, of commenting is non-invasive, where its the pull right set of reviewers as soon as the PR owner creates the PR, 577 520 request author’s choice to add a suggested developer as a reviewer. rather than waiting for the owner to manually determine 578 521 In the last two cases, suggested developers get a notification of the right set of reviewers. 579 522 5 580 ESEC/FSE 2019, 26–30 August, 2019, Tallinn, Estonia Asthana, et al.

581 • Average per-reviewer active load: This metric helps evaluate examined by a larger set of expert reviewers than before. Based on 639 582 the efficacy of the WhoDo load-balancing algorithm. We this finding, a future goal of WhoDo is to capture author-reviewer 640 583 define per-reviewer active load as the average number of affinity into the model. 641 584 open reviews that a reviewer has on any given day. We 642 6.3.3 Hit Rate. Finally, we evaluated the hit rate that WhoDo pro- 585 then average this across all reviewers to get the average 643 vided. We explain the hit rate and its implication using MediumRepo 586 per-reviewer active load. 644 as an example. Note, from Table 2, that MediumRepo had the lowest 587 645 hit rate of 58.16%. This means that, for 41.84% of all 589 PRs, i.e. 588 646 6.2 Repository Characteristics 246 PRs, none of the reviewers suggested by WhoDo interacted 589 647 Table 1 characterizes the five repositories which we use to evaluate with the PR, and that the owner manually added at least one other 590 648 WhoDo. The LargeRepo repository is the largest, with roughly 120 reviewer that WhoDo did not suggest. PR owners adding reviewers 591 649 active developers and more than 8 PRs per day on average. Medium- manually has the effect of reducing the hit-rate on all 5 repositories. 592 650 Repo has a similar number of active developers, but fewer PRs per To understand why owners add reviewers manually, we performed 593 651 day. The other 3 repositories are smaller in size, with roughly 10-20 the user study described in Section 6.5. 594 652 active developers and about 2 PRs per day. The table also shows 595 653 the date of deployment of service, and the service remains active 6.4 Load Balancing 596 654 to-date on all repositories. Developer coverage for each repository 597 We implemented load-balancing into WhoDo and deployed it on the 655 reflects the percentage of total number of files in the repository that 598 LargeRepo repository on 20th December, 2018. We have collected 656 a developer has touched on an average. Note that the developer 599 data on WhoDo’s performance with load-balancing for 45 days. We 657 coverage seems higher for the smaller repositories. Cut-off date 600 have data on WhoDo’s performance without load-balancing for 56 658 refers to the date up to which we take the pull-requests for all our 601 days. Table 3 summarizes the results of this experiment. 659 analyses. 602 Note that after we deployed load balancing, the PR completion 660 time reduced significantly, from 70.46 hours to 57.92 hours. While 603 6.3 WhoDo Performance 661 604 WhoDo without load-balancing on the LargeRepo repository did 662 605 We now state our findings, using the metrics described in Section 6.1, not have a significant effect on PR completion times (shown in 663 606 from WhoDo’s five deployments. Table 2 summarizes our results. Table 2, WhoDo with load-balancing decreased it by 17.79%. The 664 607 average reviews per PR also increased from 3.86 to 4.05. In addition, 665 6.3.1 PR Completion Time. We find that, for the smaller reposito- 608 we see that the active load on reviewers goes down significantly 666 ries SmallRepo1, MediumRepo, SmallRepo2, and SmallRepo3, the 609 from 0.663 to 0.359, a decrease by 45.85%. This analysis shows that 667 PR completion time improves significantly, between 13.03% and 610 larger repositories such as LargeRepo benefit significantly from 668 21.39%.For the larger LargeRepo repository, without load balancing, 611 using the load-balancing algorithm. Table 4 shows the upper and 669 increases by about 14% 612 lower quantiles, median, mean of pull-request completion times 670 We had received complaints of overloaded reviewers from the 613 for the LargeRepo repository. We note that the majority of our 671 larger repo, so further investigation revealed that a fundamental dif- 614 drop in pull-request completion times arises from the higher end, 672 ference exists between reviewer expertise in smaller versus larger 615 i.e. pull-requests which take longer. We argue that this is mostly 673 repositories. In smaller repositories, developers have expertise in a 616 because a majority of pull-requests which comprise of changes to 674 larger fraction of the code-base. The developer coverage metric in 617 one or two files wouldn’t get affected by the increase in number 675 Table 1 shows this. As a result, the fraction of suitable reviewers 618 of reviews. Its only the pull-requests which are major changes 676 for a pull-request is larger for a small repository than it is for a 619 and require a thorough review, which will benefit from the more 677 large repository. Contrarily, in larger repositories, most reviewers 620 available reviewers. 678 are experts on only a small set of code-paths within the code-base. 621 679 However there do exist some senior, experienced reviewers whose 622 6.5 User Study 680 expertise extends to a larger set of code-paths in larger reposito- 623 As shown in Table 2, WhoDo achieves good hit rates. However, 681 ries too. This is evident from the coverage numbers in Table 1. 624 there are still cases in which, in spite of WhoDo’s automatic re- 682 Therefore, when we deployed WhoDo without load-balancing on 625 viewer additions, several PR owners manually added other review- 683 the large repository, these senior reviewers were assigned a dis- 626 ers. The objective of this user study was to understand why this 684 proportionately large set of reviews, causing a backlog and hence, 627 was happening. 685 affecting overall PR completion time. We addressed this issue byde- 628 We conducted a user-study by sending email to 75 PR owners 686 ploying load-balancing into the LargeRepo repository, after which 629 spread across all 5 repos. We sent only one email per developer, to 687 PR completion time showed further improvement. The results are 630 avoid spamming them and selected only those PRs for investigation, 688 in Section 6.4. 631 where none of our 3 recommendations proved useful. In the email, 689 632 6.3.2 Average Number of Reviews. All repositories saw a signifi- we asked them their reason for adding the reviewers. We provided 690 633 cant increase in average number of reviews per PR. For LargeRepo them the following options: 691 634 though, the increase is significantly more, i.e. 45.66%. We found (1) Were the reviewers WhoDo added not relevant? 692 635 that though WhoDo was automatically adding suitable reviewers, (2) Were the reviewers WhoDo added from a different team? 693 636 developers were manually adding additional reviewers who they (3) While WhoDo’s suggestions were valid, did you add review- 694 637 worked closely with. Consequently, each pull-request was being ers because they were available to review promptly? 695 638 6 696 WhoDo: Automating reviewer suggestions at scale ESEC/FSE 2019, 26–30 August, 2019, Tallinn, Estonia

697 Repository Active Average Developer Total Files Total pull- Deployed Cut-off date 755 698 developers PRs per day Coverage requests WhoDo 756 699 SmallRepo1 20 2.1 11.49% 2218 252 10/10/2018 5/2/2019 757 700 MediumRepo 113 2.68 1.27% 6440 526 11/12/2018 5/2/2019 758 701 SmallRepo2 12 2.43 18.09% 383 196 10/10/2018 5/2/2019 759 702 SmallRepo3 13 1.31 22.81% 482 220 10/10/2018 5/2/2019 760 703 LargeRepo 116 8.4 2.49% 7750 793 24/10/2018 5/2/2019 761 704 Table 1: WhoDo has been deployed on 5 repositories of varying sizes. This table summarizes the characteristics of these repos- 762 705 itories, and for how long WhoDo has been active on them. 763 706 764 707 765 708 No. PRs PR Completion Average Number WhoDo Hit 766 709 Repository Processed (hours) Of Reviews Rate% 767 710 before after before after %impr before after %impr 768 711 SmallRepo1 168 258 56.06 44.07 21.39 1.478 1.66 12.11 68.35% 769 712 MediumRepo 3268 589 74.88 61.59 17.75 1.74 1.65 5.17 58.16% 770 713 SmallRepo2 370 245 19.73 16.55 16.12 1.25 1.45 16.00 78.18% 771 714 SmallRepo3 487 222 21.27 18.50 13.02 1.25 1.47 17.6 80.18% 772 715 LargeRepo 1787 474 74.26 63.71 14.20 2.65 3.86 45.66 72.54 773 716 774 Table 2: Aggregate PR statistics before and after the deployment of WhoDo service on select repositories 717 775 718 776 719 777 720 Measure Before Load After Load % % 778 721 Balancing Balancing Improvement Improvement over baseline 779 722 No. of days of deployment 56 54 - - 780 723 No. of PRs 474 415 - - 781 724 PR Completion time(hours) 70.46 57.92 12.54 17.79 782 725 Average Reviewers per PR 3.86 4.05 4.92 52.83 783 726 Average per-reviewer active load 0.663 0.359 45.85 784 727 Table 3: Evaluation of load balancing on LargeRepo repository 785 728 786 729 787 730 Q/4 Median 3Q/4 Mean 788 731 Before Load Balancing 2.80 22.71 90.65 70.46 789 732 After Load Balancing 2.63 20.60 72.82 57.92 790 733 Table 4: Statistics of PR completion times on the LargeRepo repository. All values in hours 791 734 792 735 793 736 794 737 (4) Do you know the recommended reviewers or have worked 795 738 with them before? 796 739 (5) Was there any other reason for adding reviewers? 797 740 798 741 We received a total of 28 responses. The results are summarized 799 742 in Figure 4. In most cases, the developers chose the third option, 800 743 i.e. while WhoDo’s suggestions were valid, they knew of someone 801 744 who could review their PR almost immediately. Evidently enough, 802 745 the response statistics were split into 2 categories 803 746 As is clear from Figure 4, there is a clear difference in statistics in 804 747 response of smaller vs larger repos. A large portion of responses in 805 748 SmallRepo1 state that the recommendations were valid but some- 806 749 one else did the review because they were available at the time. This Figure 4: Email response on test repositories 807 750 feedback suggests that WhoDo is making many more valid reviewer 808 751 suggestions than the hit rate suggests, but also that WhoDo’s scor- Developers for LargeRepo had different feedback to give. Rather 809 752 ing function needs to capture current availability of reviewers as than adding reviewers from their own team, WhoDo added review- 810 753 well. We leave this for future work. ers from other teams. Here are two example responses: 811 754 7 812 ESEC/FSE 2019, 26–30 August, 2019, Tallinn, Estonia Asthana, et al.

813 • Reviewers, while not irrelevant, were not really required to Overall, while the precision and recall numbers do not seem high, 871 814 review the specific changes - Touching a config file which is they are comparable to previous systems. For instance, cHRev [26] 872 815 used by a lot of components led our model to suggest people reported numbers for four repositories with average precision 0.39, 873 816 who touched that file most often, even though they were average recall of 0.69, and average F1-score of 0.5. WhoDo obtains 874 817 form a different team. an average precision of 0.44, an average recall of 0.47, for an average 875 818 • WhoDo keeps adding people from different teams and they F1-score of 0.44. These numbers provide a ballpark for their perfor- 876 819 didn’t help review at all - This is a drawback of our scoring mance, but aren’t an exact comparison because they were evaluated 877 820 system and of all the solutions before this since they are in- on different software projects with different characteristics (e.g., 878 821 herently similar. We intend to incorporate team information different numbers and distributions of developers and their activity) 879 822 to fix this problem in our next iteration. The reason why these numbers are not higher could be explained 880 823 From this user-study, we learn that to improve WhoDo’s hit-rate by our user-study as well. Our user-study found that often there 881 824 we have to include two factors. First, the model should include the are multiple reviewers who are qualified to review the PR, but the 882 825 affinity between authors and reviewers, such as team information, owner of the change chose the developer who was most readily 883 826 so that we can suggest reviewers who are already familiar with the available. We are investigating ways to infer developer availability 884 827 author. Second, we should include information on the availability of and use it to augment the recommended reviewer ranking. 885 828 the reviewer, for instance, by observing their schedule or calendar. 886 829 887 7 CONCLUSION 830 6.6 Retrospective Analysis 888 831 In this paper we reported our experiences and results from imple- 889 We also performed a retrospective analysis of WhoDo on all five 832 menting and deploying a code review recommendation system. Our 890 repositories. The goal of this analysis is to see how well WhoDo’s 833 system, which identifies potential code reviewers based on their 891 suggestions match reviewers who the developers chose manually. 834 prior experience working with the files and directories involved 892 This analysis is akin to that performed in previous-work [26], which 835 in a code review, has been deployed in five software projects at 893 is similar to our work in that they measure prior activity of devel- 836 Microsoft and performs well in helping the right developers to 894 opers. Their approach differs in that they look solely at comments 837 review the right changes. Of note, we discovered that reviewer rec- 895 and they count them whereas we look also at sign-offs, as we noted 838 ommendation systems may suffer from reviewer load imbalance and 896 that, especially for small changes, sometimes a reviewer signs-off 839 we have mitigated this issue through a load balancing mechanism. 897 on a change without making any comments. 840 Use of load balancing leads to a less skewed review assignment 898 We note that this retrospective analysis is an important part 841 and a decrease in review latency. As far as we know, this is the 899 of our study because before deploying the scoring model or any 842 first study to report on a deployed code reviewer recommendation 900 changes to it thereof, it helps establish a baseline level of accuracy 843 system at scale. We also presented a comprehensive evaluation that 901 for the system and acts as a sanity check. 844 included a set of goal driven metrics, a user study of developers that 902 Table 5 describes this results of this analysis. Similar to the 845 used the system, and finally retrospective analysis similar through 903 previous work[26], we use the following metrics in this analysis, 846 that used in other reviewer recommendation studies. We have also 904 so that we place it well for comparison against past work: 847 identified the additional factors of author-developer affinity and 905 848 reviewer availability when recommending reviewers and we plan to 906 849 |RR(p) ∩ AR(p)| investigate this in the future. We believe that others implementing 907 precision@m = 850 RR(p) reviewer recommendation in their own software organizations can 908 851 |RR(p) ∩ AR(p)| benefit from the ideas and findings in this paper. 909 recall@m = 852 AR(p) 910 853 911 [email protected]@m REFERENCES 854 F − score@m =2. 912 precision@m + recall@m [1] A. Frank Ackerman, Lynne S. Buchwald, and Frank H. Lewski. 1989. Software 855 Inspections: An Effective Verification Process. IEEE Softw. 6, 3 (May 1989), 31–36. 913 856 https://doi.org/10.1109/52.28121 914 857 where RR(p) and AR(p) are the recommended reviewers and the [2] A. Frank Ackerman, Priscilla J. Fowler, and Robert G. Ebenau. 1984. Software 915 Inspections and the Industrial Production of Software. In Proc. Of a Symposium 858 actual reviewers who contributed in the review process of the code on Software Validation: Inspection-testing-verification-alternatives. Elsevier North- 916 859 change p respectively. m is the size of the recommendation list. In Holland, Inc., New York, NY, USA, 13–40. http://dl.acm.org/citation.cfm?id= 917 3541.3543 860 918 our case, we recommend 3 developer on pull-requests so we only [3] Alberto Bacchelli and Christian Bird. 2013. Expectations, outcomes, and chal- 861 report @3 numbers for comparison with previous work. lenges of modern code review. In 35th International Conference on Software 919 862 Additionally we define PR Hits (P@n), as the fraction ofall Engineering, ICSE ’13, San Francisco, CA, USA, May 18-26, 2013, David Notkin, 920 Betty H. C. Cheng, and Klaus Pohl (Eds.). IEEE Computer Society, 712–721. 863 pull-requests where we were able to make at least one positive https://doi.org/10.1109/ICSE.2013.6606617 921 864 suggestion as an indication of the usefulness of the recommendation [4] Vipin Balachandran. 2013. Reducing human effort and improving quality in peer 922 865 WhoDo obtains very high values for PR Hits for all repositories, code reviews using automatic static analysis and reviewer recommendation. In 923 Proceedings of the 2013 International Conference on Software Engineering. IEEE 866 the lowest value being 68.85% and the highest being 96.57%. This Press, 931–940. 924 867 means that in 96.57% of all PRs in the SmallRepo3 repository, one [5] Olga Baysal, Oleksii Kononenko, Reid Holmes, and Michael W. Godfrey. 2012. 925 The Secret Life of Patches: A Firefox Case Study. In 19th Working Conference on 868 926 of the reviewers that WhoDo suggested did in fact complete the Reverse Engineering, WCRE 2012, Kingston, ON, Canada, October 15-18, 2012. IEEE 869 review. Computer Society, 447–455. https://doi.org/10.1109/WCRE.2012.54 927 870 8 928 WhoDo: Automating reviewer suggestions at scale ESEC/FSE 2019, 26–30 August, 2019, Tallinn, Estonia

929 Repository Total PR Hits (P@1) Total sug- Total re- Total Precision Recall F1-Score 987 930 pull- gestions viewers useful 988 931 requests sugges- 989 932 tions 990 933 SmallRepo1 190 170 (89.47%) 560 556 266 0.48 0.48 0.48 991 934 MediumRepo 289 199 (68.85%) 820 926 300 0.37 0.32 0.34 992 935 SmallRepo2 166 123 (74.09%) 466 270 169 0.36 0.63 0.46 993 936 SmallRepo3 263 254 (96.57%) 788 628 400 0.51 0.64 0.57 994 937 LargeRepo 234 191 (81.62%) 700 1120 330 0.47 0.30 0.36 995 938 Table 5: Results of the retrospective evaluation of the algorithm on select repositories 996 939 997 940 998 941 999 942 [6] Olga Baysal, Oleksii Kononenko, Reid Holmes, and Michael W. Godfrey. 2013. ACM, 541–550. https://doi.org/10.1145/1368088.1368162 1000 The influence of non-technical factors on code review. In 20th Working Conference [20] Peter C. Rigby and Margaret-Anne D. Storey. 2011. Understanding broadcast 943 on Reverse Engineering, WCRE 2013, Koblenz, Germany, October 14-17, 2013, Ralf based peer review on open source software projects. In Proceedings of the 33rd 1001 944 Lämmel, Rocco Oliveto, and Romain Robbes (Eds.). IEEE Computer Society, 122– International Conference on Software Engineering, ICSE 2011, Waikiki, Honolulu , HI, 1002 945 131. https://doi.org/10.1109/WCRE.2013.6671287 USA, May 21-28, 2011, Richard N. Taylor, Harald C. Gall, and Nenad Medvidovic 1003 [7] Amiangshu Bosu, Jeffrey C. Carver, Christian Bird, Jonathan D. Orbeck, and (Eds.). ACM, 541–550. https://doi.org/10.1145/1985793.1985867 946 Christopher Chockley. 2017. Process Aspects and Social Dynamics of Con- [21] Patanamon Thongtanunam, Raula Gaikovina Kula, Ana Erika Camargo Cruz, 1004 947 temporary Code Review: Insights from Open Source Development and In- Norihiro Yoshida, and Hajimu Iida. 2014. Improving code review effectiveness 1005 dustrial Practice at Microsoft. IEEE Trans. Software Eng. 43, 1 (2017), 56–75. through reviewer recommendations. In Proceedings of the 7th International Work- 948 1006 https://doi.org/10.1109/TSE.2016.2576451 shop on Cooperative and Human Aspects of Software Engineering. ACM, 119–122. 949 [8] Christoph Hannebauer, Michael Patalas, Sebastian Stünkel, and Volker Gruhn. [22] Patanamon Thongtanunam, Chakkrit Tantithamthavorn, Raula Gaikovina Kula, 1007 950 2016. Automatically recommending code reviewers based on their expertise: Norihiro Yoshida, Hajimu Iida, and Ken-ichi Matsumoto. 2015. Who should 1008 An empirical comparison. In Proceedings of the 31st IEEE/ACM International review my code? A file location-based code-reviewer recommendation approach 951 Conference on Automated Software Engineering. ACM, 99–110. for modern code review. In 2015 IEEE 22nd International Conference on Software 1009 952 [9] Jonathan L Herlocker, Joseph A Konstan, Loren G Terveen, and John T Riedl. Analysis, Evolution, and Reengineering (SANER). IEEE, 141–150. 1010 953 2004. Evaluating collaborative filtering recommender systems. ACM Transactions [23] Peter Weißgerber, Daniel Neu, and Stephan Diehl. 2008. Small patches get in!. 1011 on Information Systems (TOIS) 22, 1 (2004), 5–53. In Proceedings of the 2008 International Working Conference on Mining Software 954 [10] Gaeul Jeong, Sunghun Kim, Thomas Zimmermann, and Kwangkeun Yi. 2009. Im- Repositories, MSR 2008 (Co-located with ICSE), Leipzig, Germany, May 10-11, 2008, 1012 955 proving code review by predicting reviewers and acceptance of patches. Research Proceedings, Ahmed E. Hassan, Michele Lanza, and Michael W. Godfrey (Eds.). 1013 on software analysis for error-free computing center Tech-Memo (ROSAEC MEMO ACM, 67–76. https://doi.org/10.1145/1370750.1370767 956 1014 2009-006) (2009), 1–18. [24] Xin Xia, David Lo, Xinyu Wang, and Xiaohu Yang. 2015. Who should review 957 [11] Yujuan Jiang, Bram Adams, and Daniel M. Germán. 2013. Will my patch make it? this change?: Putting text and file location analyses together for more accurate 1015 958 and how fast?: case study on the Linux kernel. In Proceedings of the 10th Working recommendations. In 2015 IEEE International Conference on Software Maintenance 1016 Conference on Mining Software Repositories, MSR ’13, San Francisco, CA, USA, and Evolution (ICSME). IEEE, 261–270. 959 May 18-19, 2013, Thomas Zimmermann, Massimiliano Di Penta, and Sunghun [25] Xin Yang, Norihiro Yoshida, Raula Gaikovina Kula, and Hajimu Iida. 2016. Peer 1017 960 Kim (Eds.). IEEE Computer Society, 101–110. https://doi.org/10.1109/MSR.2013. review social network (PeRSoN) in open source projects. IEICE TRANSACTIONS 1018 961 6624016 on Information and Systems 99, 3 (2016), 661–670. 1019 [12] Oleksii Kononenko, Olga Baysal, Latifa Guerrouj, Yaxin Cao, and Michael W [26] Motahareh Bahrami Zanjani, Huzefa H. Kagdi, and Christian Bird. 2016. Auto- 962 Godfrey. 2015. Investigating code review quality: Do people and participation matically Recommending Peer Reviewers in Modern Code Review. IEEE Trans. 1020 963 matter?. In 2015 IEEE international conference on software maintenance and evolu- Software Eng. 42, 6 (2016), 530–543. https://doi.org/10.1109/TSE.2015.2500238 1021 tion (ICSME). IEEE, 111–120. 964 1022 [13] Shane McIntosh, Yasutaka Kamei, Bram Adams, and Ahmed E. Hassan. 2014. 965 The impact of code review coverage and code review participation on software 1023 966 quality: a case study of the qt, VTK, and ITK projects. In 11th Working Conference 1024 on Mining Software Repositories, MSR 2014, Proceedings, May 31 - June 1, 2014, 967 Hyderabad, India, Premkumar T. Devanbu, Sung Kim, and Martin Pinzger (Eds.). 1025 968 ACM, 192–201. https://doi.org/10.1145/2597073.2597076 1026 969 [14] Ali Ouni, Raula Gaikovina Kula, and Katsuro Inoue. 2016. Search-based peer 1027 reviewers recommendation in modern code review. In 2016 IEEE International 970 Conference on Software Maintenance and Evolution (ICSME). IEEE, 367–377. 1028 971 [15] Mohammad Masudur Rahman, Chanchal K Roy, and Jason A Collins. 2016. Cor- 1029 rect: code reviewer recommendation in github based on cross-project and tech- 972 1030 nology experience. In 2016 IEEE/ACM 38th International Conference on Software 973 Engineering Companion (ICSE-C). IEEE, 222–231. 1031 974 [16] M. M. Rahman, C. K. Roy, J Redl, and J. Collins. 2016. CORRECT: Code Reviewer 1032 Recommendation at GitHub for Vendasta Technologies. In Proc. ASE. 792–797. 975 [17] Peter C. Rigby and Christian Bird. 2013. Convergent contemporary software 1033 976 peer review practices. In Joint Meeting of the European Software Engineering 1034 977 Conference and the ACM SIGSOFT Symposium on the Foundations of Software 1035 Engineering, ESEC/FSE’13, Saint Petersburg, Russian Federation, August 18-26, 978 2013, Bertrand Meyer, Luciano Baresi, and Mira Mezini (Eds.). ACM, 202–212. 1036 979 https://doi.org/10.1145/2491411.2491444 1037 [18] Peter C. Rigby, Brendan Cleary, Frédéric Painchaud, Margaret-Anne D. Storey, 980 1038 and Daniel M. Germán. 2012. Contemporary Peer Review in Action: Lessons 981 from Open Source Development. IEEE Software 29, 6 (2012), 56–61. https: 1039 982 //doi.org/10.1109/MS.2012.24 1040 [19] Peter C. Rigby, Daniel M. Germán, and Margaret-Anne D. Storey. 2008. Open 983 source software peer review practices: a case study of the apache server. In 30th 1041 984 International Conference on Software Engineering (ICSE 2008), Leipzig, Germany, 1042 985 May 10-18, 2008, Wilhelm Schäfer, Matthew B. Dwyer, and Volker Gruhn (Eds.). 1043 986 9 1044