1 GiterCom - A Dataset of Open Source Developer 59 2 60 3 Communications in Giter 61 4 62 5 Esteban Parra Ashley Ellis Sonia Haiduc 63 6 [email protected] [email protected] [email protected] 64 7 Florida State University Florida State University Florida State University 65 8 Tallahassee, Florida Tallahassee, Florida Tallahassee, Florida 66 9 67 10 Abstract 1 Introduction 68 11 Team communication is essential for the development of modern Modern, complex open source software systems often require large 69 12 software systems. For distributed software development teams, such teams in order to be developed. The teams are usually geograph- 70 13 as those found in many open source projects, this communication ically distributed across diferent locations, countries and even 71 14 usually takes place using electronic tools. Among these, modern continents. In order to collaborate, communicate, and coordinate, 72 15 chat platforms such as Gitter are becoming the de facto choice these teams make use of electronic tools such as instant messag- 73 16 for many software projects due to their advanced features geared ing, email, etc. [14, 15, 18, 20]. Recently, modern messaging and 74 17 towards software development and efective team communication. collaboration platforms such as Gitter4 and Slack5 have revolution- 75 18 Gitter channels contain numerous messages exchanged by devel- ized team communications and project coordination by providing 76 19 opers regarding the state of the project, issues and features of the a user-friendly way of managing and organizing conversations, 77 20 system, team logistics, etc. These messages can contain important facilitating knowledge sharing, and by integrating with external 78 21 information to researchers studying open source software systems, software development tools such as GitHub, Asana, and Jira[19]. 79 22 developers new to a particular project and trying to get familiar Given their features and the support for software development, 80 23 with the software, etc. Therefore, uncovering what developers are many open source projects have adopted Gitter and as their 81 24 communicating about through Gitter is an essential frst step to- preferred communication means [15]. In particular, Gitter is cur- 82 25 wards successfully understanding and leveraging this information. rently the most popular platform in open source 83 26 84 We present a new dataset, called GitterCom, which aims to enable development teams [15]. It also presents some advantages over 27 research in this direction and represents the largest manually la- Slack, such as: 85 28 86 beled and curated dataset of Gitter developer messages. The dataset : in Slack, communities are con- 29 • Open access to communications 87 is comprised of 10,000 messages collected from 10 Gitter communi- trolled by the administrators, whereas in Gitter, access to the user- 30 88 ties associated with the development of open source software. Each generated data is public. In particular, public messages and user- 31 89 message was manually annotated and verifed by two of the authors, generated content in Gitter are subject to the Creative Commons 32 90 capturing the purpose of the communication expressed by the mes- license: Attribution + Non-Commercial + ShareAlike (BY-NC-SA)6 33 91 sage. While the dataset has not yet been used in any publication, : in Slack communities, only the latest 34 • Free access to historical data 92 we discuss how it can enable interesting research opportunities. 10,000 messages are accessible without paying. Since most public 35 93 Slack channels use the free tier [13], their historical data is unavail- 36 94 CCS Concepts able. Conversely, messages posted to public Gitter channels are 37 95 preserved and accessible indefnitely in chat room logs. 38 • Software and its engineering → Collaboration in software 96 39 development; Open source model; Documentation; Despite its advantages over Slack, its greater popularity among 97 40 open source developers, and the availability of tens of thousands 98 41 Keywords of message exchanges between developers of open source soft- 99 42 datasets, chat, social media, team communication ware, there have been no papers so far investigating developer 100 43 communications in Gitter. Rather, existing works analyzing devel- 101 ACM Reference Format: oper communications in modern instant messaging platforms have 44 Esteban Parra, Ashley Ellis, and Sonia Haiduc. 2020. GitterCom - A Dataset 102 45 so far focused solely on Slack [11–13, 16]. 103 of Open Source Developer Communications in Gitter. In 17th International We argue that Gitter developer communications are an untapped 46 Conference on Mining Software Repositories (MSR ’20), October 5–6, 2020, 104 47 Seoul, Republic of Korea. ACM, New York, NY, USA, 5 pages. https://doi.org/ information resource that could be leveraged by researchers in 105 48 10.1145/3379597.3387494 order to get a deeper understanding about the nature of developer 106 49 communications in open source software. With this paper, we aim to 107 Permission to make digital or hard copies of all or part of this work for personal or 50 encourage research in this direction by introducing GitterCom, the 108 classroom use is granted without fee provided that copies are not made or distributed 51 for proft or commercial advantage and that copies bear this notice and the full citation frst manually labeled dataset of Gitter instant message histories in 109 52 on the frst page. Copyrights for components of this work owned by others than ACM open source systems. GitterCom contains 10,000 messages labeled 110 must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, based on the communication purpose they express. This dataset is 53 to post on servers or to redistribute to lists, requires prior specifc permission and/or a 111 54 fee. Request permissions from [email protected]. the largest manually labeled dataset of developer instant messages; 112 55 MSR ’20, October 5–6, 2020, Seoul, Republic of Korea 113 © 2020 Association for Computing Machinery. 4https://gitter.im/ 56 114 ACM ISBN 978-1-4503-7517-7/20/05...$15.00 5https://slack.com/ 57 https://doi.org/10.1145/3379597.3387494 6https://creativecommons.org/licenses/by-nc-sa/3.0/us/ 115 58 1 116 MSR ’20, October 5–6, 2020, Seoul, Republic of Korea Esteban Parra, Ashley Ellis, and Sonia Haiduc

117 Table 1: Distribution of messages per purpose category 175 118 1 2 3 176 Category Cucumber Freezing ImageJ Jhipster JSPM MJS Sklearn THW UIKit Xenko Overall (%) 177 119 Communication 325 794 490 446 635 695 506 583 321 480 5275 (52.75%) 178 120 Customer 442 0 150 239 0 0 4 145 451 0 1431 (14.31%) 179 121 180 support 181 122 Dev-Ops 198 183 308 269 305 235 464 240 190 383 2775 (27.75%) 182 123 Discovery and 13 1 10 9 7 2 3 15 5 32 97 (0.97%) 183 184 124 news 185 125 Fun 0 2 0 0 0 39 0 0 1 0 42 (0.42%) 186 126 187 Networking and 0 0 1 3 0 3 0 0 0 32 39 (0.39%) 188 127 social activities 189 128 Participation in 4 2 9 15 21 13 13 10 12 54 153 (1.53%) 190 129 191 Communities of 192 130 Practice 193 131 194 Team Collabora- 18 18 32 19 32 13 10 7 20 19 188 (1.88%) 195 132 tion 196 133 197 134 198 199 135 200 136 the only other manually labeled dataset available is comprised of sources of information; networking and social activities, where de- 201 137 500 developer Slack messages in one software company [20]. velopers interact with other developers who share similar interests 202 The rest of the paper is structured as follows: section 2 presents or jobs; and , which are messages sharing gifs and memes or 203 138 fun 204 139 an overview of the dataset, section 3 outlines the data collection meant for participating in gaming activities. 205 206 140 process, section 4 discusses potential research directions using this The second purpose relates to Team-wide activities and in- 207 141 dataset, section 5 presents limitations and future improvements cludes messages aimed towards carrying out software development 208 142 that could be made to the data set and lastly, section 6 concludes activities related to the system being developed. Messages within 209 210 143 the paper. this purpose can be further divided into four categories: communi- 211 144 cation messages in which the developers engage in activities such 212 145 2 Dataset Description as communication with teammates (e.g., members of a distributed 213 team) during meetings and note-taking, communication with other 214 146 GitterCom includes data about 10,000 messages collected from 215 147 10 open source software development Gitter communities (1,000 stakeholders, or discussing non-work topics; team collaboration 216 217 148 messages in which the developers engage in activities such as team messages per community). Each message was manually labeled with 218 149 information about the purpose of the communication it expresses, management, fle, and code sharing; Dev-Ops messages in which 219 150 based on the categories identifed by Lin . [16]. the developers engage in activities such as communicating updates 220 et al 221 151 regarding the status of the project ( . ., development operation GitterCom is available in both CSV and Microsoft Excel Open e g 222 152 XML Spreadsheet (XLSX) fle formats online7. In the CSV fle, each notifcations about recent changes to the system, commits, bug 223 153 line is a data record. Each record contains the information for a fxes, pushes to the repository, merges), software deployments, and 224 team Q&As; and fnally, messages in which the 225 154 single message and consists of seven information felds, separated customer support 226 155 by comma and using quotes as the text delimiter. In particular, each developers assist new or existing users of the system on how to 227 228 156 perform certain tasks, identify bugs, and troubleshoot errors. row contains: (i) the channel/system the message belongs to, (ii) 229 157 a unique messageID, (iii) the date and time at which the message The last purpose is represented by Community support mes- 230 158 was posted, (iv) the author of the message, (v) the content of the sages, where developers participate in communities of practice or 231 232 159 special interest groups. These messages are characterized by devel- message in plain text, (vi) the corresponding purpose category 233 160 (manual label), and (vii) the purpose subcategory (manual label). opers aiming to keep up with specifc frameworks/communities, to 234 161 Next, we present brief descriptions of the diferent purposes, learn about new tools and frameworks for developing applications, 235 or to brainstorm ideas with other people in the community. 236 162 their categories and subcategories we used to manually label the 237 Table 1 shows the number of messages per category in Gitter- 163 messages in GitterCom. These were frst identifed by Lin et al.[16], 238 239 164 who surveyed software developers about their use of Slack. Com, for each of the 10 open source systems/communities we considered and the overall distribution of messages associated with 240 165 The frst purpose, called Personal benefts, includes messages 241 166 in which the developer’s main purpose is to fulfll personal needs. each category across all the communities in GitterCom. 242 243 167 Based on the hierarchy presented above, we notice that the major- Messages within this purpose can be further divided into three cat- 244 ity of Gitter messages in GitterCom belong to Team-wide purposes. 168 egories: discovery and aggregation of news and information, where 245 169 developers post reliable, interesting, and relevant blogs or other Table 1 shows that the distribution of messages varies signifcantly 246 across categories. In particular, 83% of the messages are meant to 247 170 248 4 171 MarionneteJS support activities directly associated with the development of the 249 5 SciKit-Learn 250 172 6 system. On the other hand, 14.31% of the messages are related to TheHollyWafe 251 7 173 https://fgshare.com/s/9b3df36e22a8a8f77169 community support and engagement with communities of practice, 252 174 2 253 254 GiterCom - A Dataset of Open Source Developer Communications in Giter MSR ’20, October 5–6, 2020, Seoul, Republic of Korea

255 Table 2: Subset of Gitter communities included in GitterCom for GitterCom was to manually curate and label a subset of the mes- 313 256 sages, based on the purposes/intents identifed by Lin et al.[16] (as 314 315 257 Community Users Messages Application domain described in Section 2). We therefore selected the frst ten channels, 316 258 Marionette [6] 3014 181108 Javascript framework out of 100 qualifying channels which met the following criteria: (i) 317 259 jspm [5] 1103 27245 Package manager they are linked to an active GitHub repository (commits have been 318 319 260 scikit-learn [7] 3188 9844 Machine Learning made within the past year), (ii) they are used as a communication 320 261 Xenko3d [10] 103 2890 Game engine tool for the active development of an open-source software system, 321 322 262 (iii) they cover diferent application domains, (iv) they have been FreezingMoon [2] 109 207925 Video game 323 263 UIkit [9] 2155 41265 Front-end framework active in the past year, and (v) they contain at least 1,000 messages. 324 264 jHipster [4] 2575 39418 Application generator Table 2 shows the details of the selected systems/channels. 325 326 265 Cucumber [1] 337 2030 Testing framework From each of the ten selected channels we then collected the 327 266 Imagej [3] 209 8149 Image processing 1,000 most recent consecutive messages up to April 1, 2019, for a 328 267 total of 10,000 messages. The frst two authors then carried out a 329 TheHolyWafe [8] 196 15046 VoIP communication 330 268 coding procedure to label these messages, using the categories and 331 269 subcategories identifed by Lin et al.[16] as labels. More specifcally, 332 333 270 each message was assigned a category describing the main purpose and only 2.69% of the messages are linked to personal benefts. 334 271 of the message and a subcategory describing the specifc activity 335 272 Moreover, 53% of the messages involve communication between the message relates to. If a message did not provide any meaningful 336 337 273 the developers and stakeholders, 28% of the messages communicate information by itself (e.g., a single emoji, "ok", "great", ""), it was 338 274 updates regarding the status of the system, and 15% of the messages classifed as "Uninformative". After the individual coding, the two 339 275 involve customer support. authors met, discussed, and resolved any coding conficts. The mes- 340 341 276 sages for which a classifcation of "Uninformative" was agreed upon 342 277 3 Data Collection were discarded and replaced by an equal number of messages from 343 344 278 This section presents in detail the data collection and curation pro- the same channel. Then, the coding process was applied on these 345 279 cedure we used to create the GitterCom dataset. First, we gathered new messages. This procedure was repeated until 1,000 messages 346 8 280 the list of all the communities listed in Gitter’s Explore interface were obtained for each channel, all having a label other than "Un- 347 348 281 on April 1, 2019. We then excluded the channels in which the con- informative". Across all channels, a total of 1,061 messages were 349 282 versations were not in English, resulting in a list of 139 Gitter labeled as "Uninformative" during the labeling process. 350 9 283 communities. Afterwards, using the Gitter API , we extracted all During the coding process, when the content of a message was 351 352 284 of the messages in the main channels of these communities, from insufcient to determine a category, we used the list of contributors 353 285 their inception until April 1, 2019. This data collection resulted in a to the system’s repository as a source of additional information 354 355 286 set of 2,939,335 messages across all 139 channels. that could give an insight into the nature of the message. One ex- 356 287 To extract the raw data for GitterCom, we used a custom python ample of such ambiguous messages were questions which could be 357 288 script, which uses pycurl to connect to Gitter’s REST API and interpreted as either a customer asking about the system (Customer 358 359 289 obtain all the messages’ raw text and their corresponding metadata. Support) or a developer of the system asking about a part of the 360 290 Any code snippets included in the messages are also extracted as system they are unfamiliar with (Team Q&A). In this particular 361 291 raw text. Afterwards, to facilitate the labeling process, we ran a case, if a question was made by a contributor to the system, it was 362 363 292 custom script in Java to convert the messages from the JSON format classifed as Team Q&A, and Customer Support otherwise. 364 293 provided by Gitter’s API to CSV format. The data collection scripts The manual coding procedure took the two authors about 100 365 366 294 and instructions on their usage are found in our replication package hours per person to complete (200 hours total), spread across three 367 295 [17]. weeks. After completing the manual labeling, we obtained Gitter- 368 296 The 139 channels collected as raw data vary in three main ways: Com, a dataset comprised of 10,000 Gitter messages, 1,000 per Gitter 369 370 297 by membership - the channels contain between 100 and 17,000 mem- channel, classifed according to their purpose. 371 298 bers per channel, by level of activity - the smallest channel contains 372 299 21 messages, whereas the largest channel contains over 423,000 373 374 300 messages, and by type - channels can be made for the development 4 Potential Research Applications 375 301 of a particular software system, where the developers communicate Previous studies have investigated the growing use of alternative 376 377 302 with each other and with the system’s stakeholders, or made for communication means by developers [15, 16, 20]. The results of 378 303 building communities of practice in which the members’ discussion these studies show the rise of instant messaging tools and the im- 379 304 revolves around particular topics, frameworks, or programming pact they have on reshaping team dynamics and the communication 380 381 305 languages, but does not involve discussion about the active devel- landscape in increasingly distributed software development envi- 382 306 opment of a system. ronments. Future studies could make use of GitterCom to study the 383 307 While we make the entire data we extracted for all the 139 chan- relationship between open source development activity and commu- 384 10 385 308 nels available for download to other researchers , our main goal nication trends. In particular, GitterCom enables further research to 386 309 387 8 analyze and understand patterns in developer communications and https://gitter.im/explore 388 310 9 to address important questions such as: How do software teams https://developer.gitter.im 389 10 311 https://fgshare.com/s/3fd5af0b869b8fd010bb use tools like Gitter to communicate among themselves and with 390 312 3 391 392 MSR ’20, October 5–6, 2020, Seoul, Republic of Korea Esteban Parra, Ashley Ellis, and Sonia Haiduc

393 451 394 452 453 395 454 396 455 397 456 457 398 458 399 459 460 400 461 401 462 402 463 464 403 465 404 466 405 467 468 406 469 407 470 471 408 472 409 473 410 474 475 411 476 412 477 413 478 Figure 1: GitterCom sample 479 414 480 415 481 482 416 other stakeholders? How do team dynamics refect in team com- the messages exchanged by developers in a project. Improvements 483 417 munications? Do developers exchange diferent types of messages that would help increase the generalizability of the results of future 484 418 485 at diferent times in the software life cycle? Do developers new studies analyzing this dataset include the expansion of the labeled 486 419 to a project post diferent types of messages than the more senior data in GitterCom to include more messages from more projects. 487 420 488 developers? For this purpose we also release the raw, unlabeled data extracted 421 489 GitterCom could also be used as a training dataset for machine by our crawler script, containing over 2 million messages from 139 490 422 learning approaches for automatically classifying new developer open source projects at https://fgshare.com/s/3fd5af0b869b8fd010bb. 491 423 492 messages based on their purpose. This could, in turn, be useful We therefore hope other researchers will join our efort and will 493 424 to automatically organize messages into threads or to create sum- select more of this raw data to label and contribute to GitterCom. 494 425 maries of developer conversations based on their purpose, such that 495 426 496 developers that were away for a while or newcomers to a project 497 427 could quickly catch up on important conversations they missed. 498 428 6 Conclusions 499 Another avenue for future work would be to use GitterCom in 429 The rapid adoption of instant messaging tools in open source devel- 500 order to perform large scale replications of previous studies that 501 430 opment communities indicates a strong need to study the nature of analyzed developer communications in Slack [11, 12, 20], but used 502 431 these developer communications and their implications for open 503 much smaller or restricted datasets ( . ., communications in student 504 432 e g source software. However, such analysis is not possible without projects or a particular software company). These replications on 505 433 data to explore. GitterCom could help corroborate previous fndings or uncover 506 434 We introduced GitterCom, the largest manually labeled and cu- 507 new information about how developers communicate through in- 508 435 rated dataset of Gitter developer messages. It comprises 10,000 stant messaging tools. One example of such work that could beneft 509 436 messages and their corresponding purpose labels across multiple 510 from a large scale replication is work on the identifcation of mes- 437 open source Gitter channels, corresponding to systems covering 511 sages that contain rationale for the decisions made by developers 512 438 a wide range of application domains. We believe that our dataset throughout the software life cycle [12]. Thus far, work on rationale 513 439 provides immense opportunities for researchers to perform large 514 has been limited to analyzing the chat messages of three student 515 440 scale empirical research and further analysis on developer discus- teams working on a multi-project capstone course. sions, communication with stakeholders, and team dynamics in 516 441 517 442 open source systems. Our hope is nevertheless that the initial data 518 519 443 5 Limitations and Future Improvements set in this paper will spur interest for the continuing collection and analysis of developer instant communications. 520 444 Although GitterCom is the largest data set of curated and manually 521 445 , it still encompasses a small subset 522 labeled developer instant messages 523 446 of all the existing Gitter developer communications. Therefore, one 524 447 limitation to GitterCom could be that the collected projects are 7 Acknowledgments 525 526 448 not representative of all open-source projects and that the most Sonia Haiduc and Esteban Parra were supported in part by the 527 449 recent 1,000 messages for a project are not representative of all National Science Foundation grants CCF-1846142 and CCF-1644285. 528 450 4 529 530 GiterCom - A Dataset of Open Source Developer Communications in Giter MSR ’20, October 5–6, 2020, Seoul, Republic of Korea

531 References Conference on Mining Software Repositories (MSR’15). IEEE, Florence, Italy, 422– 589 425. 532 [1] [n.d.]. Cucumber. https://github.com/cucumber/cucumber 590 [15] Verena Käfer, Daniel Graziotin, Ivan Bogicevic, Stefan Wagner, and Jasmin Ra- 591 533 [2] [n.d.]. FreezingMoon. https://github.com/FreezingMoon madani. 2018. Communication in Open-Source Projects-End of the E-mail Era?. [3] [n.d.]. Imagej. https://github.com/imagej/imagej 592 534 In Proceedings of the 40th IEEE/ACM International Conference on Software Engi- 593 [4] [n.d.]. jHipster. https://github.com/jhipster/jhipster/ . IEEE, Gothenburg, Sweden, 242–243. 535 neering(ICSE’18) 594 [5] [n.d.]. jspm. https://github.com/jspm [16] Bin Lin, Alexey Zagalsky, Margaret-Anne Storey, and Alexander Serebrenik. [6] [n.d.]. Marionette. https://github.com/marionettejs/backbone.marionette 595 536 2016. Why Developers Are Slacking Of: Understanding How Software Teams 596 [7] [n.d.]. scikit-learn. https://github.com/scikit-learn/scikit-learn Use Slack. In 537 [8] [n.d.]. TheHolyWafe. https://github.com/TheHolyWafe Proceedings of the 19th ACM Conference on Computer Supported 597 Cooperative Work and Social Computing (CSCW’16 ). ACM, 333–336. 598 538 [9] [n.d.]. UIkit. https://github.com/uikit/uikit [17] Esteban Parra. 2020. GitterCom, dataset. https://fgshare.com/s/ [10] [n.d.]. Xenko3d. https://gitter.im/xenko3d/xenko 599 539 9b3df36e22a8a8f77169 600 [11] Rana Alkadhi, Jan Ole Johanssen, Emitza Guzman, and Bernd Bruegge. 2017. [18] M. Storey, A. Zagalsky, F. F. Filho, L. Singer, and D. M. German. 2017. How Social 540 REACT: An Approach for Capturing Rationale in Chat Messages. In 601 Proceedings and Communication Channels Shape and Challenge a Participatory Culture in 602 541 of the 11th ACM/IEEE International Symposium on Empirical Software Engineering Software Development. 43, 2 (Feb. . IEEE, Toronto, ON, Canada, 175–180. IEEE Transactions on Software Engineering 603 542 and Measurement (ESEM’17) 2017), 185–204. 604 [12] R. Alkadhi, T. Lata, E. Guzmany, and B. Bruegge. 2017. Rationale in Development [19] Margaret-Anne Storey, Leif Singer, Brendan Cleary, Fernando Figueira Filho, 543 605 Chat Messages: An Exploratory Study. In Proceedings of the 14th IEEE/ACM and Alexey Zagalsky. 2014. The (R) Evolution of Social Media in Software . 436–446. 606 544 International Conference on Mining Software Repositories (MSR’17) Engineering. In Proceedings of the 36th ACM/IEEE International Conference in 607 [13] Preetha Chatterjee, Kostadin Damevski, Lori Pollock, Vinay Augustine, and . ACM, Hyderabad, 545 Nicholas A Kraft. 2019. Exploratory Study of Slack Q&A Chats as a Mining Source Software Engineering, Future of Software Engineering (FOSE’14) 608 India, 100–116. 609 546 for Software Engineering Tools. In Proceedings of the 16th IEEE International [20] Viktoria Stray, Nils Brede Moe, and Mehdi Noroozi. 2019. Slack Me if You Can!: . IEEE, Montreal, Canada, 610 547 Conference on Mining Software Repositories (MSR’19) Using Enterprise Social Networking Tools in Virtual Agile Teams. In Proceedings 611 490–501. . 548 [14] Shaiful Alam Chowdhury and Abram Hindle. 2015. Mining StackOverfow to of the 14th International Conference on Global Software Engineering (ICGSE’19) 612 IEEE, Montreal, Quebec, Canada, 101–111. 613 549 Filter out Of-topic IRC Discussion. In Proceedings of the 12th IEEE Working 614 550 615 551 616 617 552 618 553 619 620 554 621 555 622 556 623 624 557 625 558 626 559 627 628 560 629 561 630 631 562 632 563 633 564 634 635 565 636 566 637 567 638 639 568 640 569 641 642 570 643 571 644 572 645 646 573 647 574 648 575 649 650 576 651 577 652 653 578 654 579 655 580 656 657 581 658 582 659 583 660 661 584 662 585 663 664 586 665 587 666 588 5 667 668