Summer of Code & Hackathons for Community Building in Scientific
Total Page:16
File Type:pdf, Size:1020Kb
Community Code Engagements: Summer of Code & Hackathons for Community Building in Scientific Software Erik H. Trainer, Chalalai Chaihirunkarn, Arun Kalyanasundaram, James D. Herbsleb Institute for Software Research Carnegie Mellon University 5000 Forbes Avenue, Pittsburgh, PA 15213 {etrainer, cchaihir, arunkaly, jdh}@cs.cmu.edu ABSTRACT the software’s future may not be addressed, e.g.: in 4 years’ time, Community code engagements — short-term, intensive software will the software still be available? Will it work? Will there be development events — are used by some scientific communities pool of participants with the right set of technical skills who can to create new software features and promote community building. respond to bug reports and feature requests? Successful But there is as yet little empirical support for their effectiveness. communities find ways to get code contributions from their This paper presents a qualitative study of two types of community members and to incorporate successive generations of newcomers code engagements: Google Summer of Code (GSoC) and after the original developers leave [30]. hackathons. We investigated the range of outcomes these There is an additional twist for software that scientists write. engagements produce and the underlying practices that lead to Although scientists are directly funded to produce new these outcomes. In GSoC, the vision and experience of core knowledge, they spend significant time searching for, using, and members of the community influence project selection, and the developing software that enables those results. The sustainability intensive mentoring process facilitates creation of strong ties. of scientific software – the ability to maintain the software in a Most GSoC projects result in stable features. The agenda setting state where scientists can understand, replicate, and extend prior phase of hackathons reveals high priority issues perceived by the reported results that depend on that software – has sometimes community. Social events among the relatively large numbers of been an afterthought because scientists are rewarded for the participants over brief engagements tend to create weak ties. Most publications they write, not the software they create and support hackathons result in prototypes rather than finished tools. We [25, 26]. This software, however, is a critical link in the chain of discuss themes and tradeoffs that suggest directions for future evidence establishing new scientific knowledge, and thus other empirical work around designing community code engagements. scientists need to be able to run this software in order to understand and replicate this new knowledge, and apply it to new Categories and Subject Descriptors problems. H.5.3 [Information Interfaces and Presentation (e.g., HCI)]: Group and Organization Interfaces – computer supported A few scientific communities in the life sciences have begun cooperative work, organizational design. experimenting with short-term focused community engagements, such as Google Summer of Code (GSoC) and hackathons. Keywords Although there is reason to believe from previous research on Community code engagements, Google Summer of Code (GSoC), online communities (e.g., [30]) that these engagements may hackathons, scientific software. enhance the sustainability of scientific software, there is as yet little empirical support. Moreover, evidence about when various 1. INTRODUCTION types of engagements are likely to succeed is scant. How do you go from a small number of people with a common interest to a full-fledged community? The active body of research In this paper, we aim to understand the range of outcomes these on this problem (e.g., [17, 30, 31]) is a testament to the role of engagements produce and the underlying practices that lead to community building in collaborative work practices. these outcomes. We hope to highlight concrete engagement design issues that community leaders and funding agencies might Active communities are essential to the sustainability of software. consider in order to optimize the outcomes they desire. Without a community around the code distribution, key issues of 2. BACKGROUND This is a preprint version of a paper to be published in the 2014 ACM 2.1 Sustainability of Scientific Software Conference on Supporting Group Work. Software is of vital importance to science. The role of software in data analysis, simulation, and visualization is widely Please do not quote or distribute. acknowledged [9, 27, 38]. A 2005 NSF Workshop Report [5] clarified the importance of software in cyberinfrastructure, which is the “infrastructure based upon distributed computer, information, and communication technology”[2]. Much scientific software, however, is not infrastructural. For instance, there are many “workbench” applications for end user scientists (e.g., Dan Gezelter’s directory [34] lists almost 500 programs). This list does not even include the myriad scripts and data conversion utilities scientists write to translate data into intermediate forms required by tools in the later stages of workflows [26]. Although NSF common interest, purpose, or goal. Practitioners who learn from sponsored workshops have repeatedly called attention to each other to develop themselves personally and professionally cyberinfrastructure maintenance [5], there is less clarity into how (e.g., a less experienced scientist developer works with a more scientific software more generally can be sustained over time, experienced developer to develop features of increasing difficulty) even though it is a crucial part of scientific research, development, constitute a community of practice [33], whereas people who and delivery [44]. share information with others but who are not necessarily Informal evidence suggests that scientific software is increasingly practitioners themselves (e.g., an end user scientist answers a a key problem of interest to individual researchers and research question about the tool’s installation procedure on the mailing list) institutions. For example, the First Workshop on Sustainable constitute a community of interest [24]. In a community of Software for Science [51] was recently held collocated with an practice, learning is always situated in practice. Prior studies of annual conference on High Performance Computing. A simple open-source software communities have used situated learning to head count revealed that one third of the workshop participants describe the socialization and sustained participation of only attended the conference for the workshop. As further newcomers [17]. This suggests that community of practice is the evidence, the Water Science Software Institute has developed a type of community important for the sustainability of scientific model for software development specifically aimed to support the software. maintenance of scientific software [6]. The literature on communities suggests that in order to be Scientific software exists in a variety of states from “as persistent successful, communities must address two primary challenges: as the next grant supporting maintenance” to “supported by a very receiving contributions and attracting newcomers. small community of volunteers” [44]. But if that software has 2.3.1 Contributions gained widespread use outside of the lab and served a valuable Communities need contributions from participants ([30], p. 2- role in assisting other scientists in making new discoveries, it 5). In software development, contributions might comprise a should be able to be refined and extended for use by other working base of source code that serves people significantly better scientists who can use it to produce new knowledge. than the competition. For example, BLAST, the widely used 2.2 The Promise of Open-Source open-source tool for comparing biological sequence information [1] achieved such early success because it was so much faster than In discussions of software sustainability, the open-source software previous algorithms available at the time. Assuming a newcomer model is invariably held up as a promising approach. For instance, is at least interested in modifying or contributing to the software, position papers from the 2009 NSF funded workshop on the barrier to entry must be low to modify and extend the existing “Cyberinfrastructure Software Sustainability and Reusability” contributions to meet their own needs. Because most contributors [44] led to the report’s recommendation that cyberinfrastructure to open-source software projects leave after their personal needs software should be released under an open-source license. are met [41], it is important to attract enough newcomers in order Directly applying the open-source model to scientific software to find those who will continue to contribute and act as stewards development, however, neglects crucial differences between open- of the code base. source scientific software and open-source software in general. The primary difference is in the incentive structure for 2.3.2 Attracting Newcomers contribution [25, 26]. For open-source developers, reputation in To survive over the long-term, communities must find ways to the open-source community is a primary motivation, where the attract new generations of members to replace the ones who number of “followers” a developer has is a symbol of social status leave ([30], p. 179). For example, in recent years leaders of the [13]. Scientists who write software, in contrast, operate in a online encyclopedia Wikipedia have expressed discontent over