A Behavioral Approach to Understanding the Git Experience
Total Page:16
File Type:pdf, Size:1020Kb
Proceedings of the 54th Hawaii International Conference on System Sciences | 2021 A Behavioral Approach to Understanding the Git Experience Genevieve Milliken Sarah Nguyen Vicky Steeves New York University University of Washington New York University [email protected] [email protected] [email protected] Abstract plan and make no guarantees about the long-term The Investigating and Archiving the Scholarly Git availability of work on their platforms. This is Experience (IASGE) project is multi-track study particularly problematic because the most commonly ocused on understanding the uses of Git by students, used GHPs are commercial companies and are not in aculty" and staff $orking in academic research the business of preservation. Further, these companies institutions as $ell as the $ays source code are not immune to closures and such cessations have, repositories and their associated contextual ephemera in fact, already occurred and include Google Code in 2016 and Gitorious in 2014 as well as many other open can be better preserved. This research, in turn, has source code forges [1]. In 2020, BitBucket also implications regarding how to support Git in the suspended support for Mercurial repositories, further scholarly process, how version control systems forcing developers to “go mainstream” and use Git and contribute to reproducibility" and ho$ &ibrary and GitHub in their workflow [2]. Information Science (&IS) professionals can support In addition to the dependency and precarity of Git through instruction and sustainability e#orts. In these platforms, how and why scholarly researchers this paper, $e ocus on a subset o our larger project use Git and GHP during their research process remains and take a deep look at $hat code hosting plat orms opaque and understudied. These platforms allow for offer researchers in terms o productivity and sharing, collaboration, and even scholarly interactions collaboration. 'or this portion, a survey" ocus groups, such as peer review and annotations of code, though it and user experience intervie$s $ere conducted to gain comes with steep technical and social learning curves. an understanding of how and $hy scholarly A contributing factor to this may be that these researchers use (ersion )ontrol Systems (()S) as $ell platforms were created for software development and as some of the pain points in learning and using ()S not academia, making the learning, use, and support of or daily work% Git idiosyncratic at the project, department, and institutional levels. To date, there has been no large-scale or in-depth study of how researchers use these tools. To fill this 1. Introduction gap, the Investigating and Archiving the Scholarly Git Experience (IASGE) project seeks to understand An important part of the scholarly record is scholars’ engagement with Git as a VCS and GHP as a unstable. While there are many initiatives to mechanism for scholarly communication and incentivize, support, and publish research outputs such dissemination. Through a behavioral studies approach, as data and electronic notebooks/field notes, these are this paper takes a deep look at what code hosting noticeably lacking for code and software. This is platforms offer to researchers in terms of productivity compounded by the fact preservation and access and collaboration and uses a survey, focus groups, and workflows are less systematized for code and software user experience interviews to gain an understanding of curation than other types of research outputs (e.g. A/V how and why scholarly researchers use VCS as well as media, manuscripts). Such a lack is particularly some the pain points involved in learning and using problematic because a multitude of disciplines, from VCS. Our main findings are that a wide variety of the sciences to the humanities, write or use code in researchers, from the humanities to sciences, use Git, their academic research. Further, many researchers use but often lack a mental model for understanding it. Version Control Systems (VCS) such as Git as well as This situation results in a poor understanding of the hosting platforms like GitHub to collaborate, version, tool and suggests that support for learning and and publish code openly. adopting Git, as well as training in best practices, is Many of the most popular Git Hosting Platforms needed. (GHP), however, do not have a long-term preservation URI: https://hdl.handle.net/10125/71493 978-0-9981331-4-0 Page 7239 (CC BY-NC-ND 4.0) 2. Background The population we are interested in studying are all involved with a range of scholarly endeavors (e.g., There is a need in software development to track graduate studies, grant-funded labs, research in areas of changes over time, and this role is fulfilled by VCS. expertise, etc.) and include students, faculty, Also known as revision control systems or software instructors, researchers, and staff who develop and management systems, VCS allows users to track maintain software and code. changes, compare different versions, and merge A key component of this study is to put version- changes to a codebase collaboratively over time. controlled research software into a wider context and Version control is done on a repository, which is a highlight its place as a scholarly research artifact directory of source code files with an underlying data alongside other non-manuscript outputs such as lab structure that contains the metadata for all tracked files, notebooks, datasets, executable papers, data with a historical record of all files. Depending on visualizations, and digital humanities projects. whether the VCS in use is centralized (e.g. Subversion, Recently, data has received the most attention of these CVS) or distributed (e.g. Git, Mercurial), the repository computational outputs due to the mandate that may contain varying amounts of information. federally-funded research be made freely available [6] Distributed VCS became more popular because each and the application of the FAIR principles for research mirror of the repository (e.g. on each developer’s code [7]. Calls for data sharing [8], professional computer) is considered “equal,” and cannot be organizations such as the Data Curation Network [9], automatically overwritten by another, allowing users and professional roles such as Data Librarians [10] more flexibility and the ability to work without fear of have also helped to further foreground the importance corrupting anyone else’s progress. Git, the VCS of data. Data repositories and generalist repositories explored in this paper, was created as a result of a have also facilitated the preservation and access to professional conflict during development of the Linux datasets and have made them more accessible to the kernel. Previously, the open source project had been public. Unlike data, however, research software has using BitKeeper (a proprietary centralized VCS) for received far less attention and has fewer source code management [3], [4]. However, the institutionalized roles dedicated to its maintenance and copyright holder of BitKeeper rescinded access to the curation of these complex digital objects. Software is software, believing that some were trying to reverse particularly difficult to reproduce and preserve due to engineer BitKeeper’s protocols, and in response, Git multiple files/directories, use of packages and libraries, was created. hardware, and operating systems dependencies, as well Since then, Git has emerged as the most popular as its required documentation and metadata for others VCS [5], due in large part to its scalability and support to use. of non-linear code development as well as a ‘toolkit’ Our interest, however, is not solely with the source design that facilitates large collaborative projects with code but also the contextual information around it. many merges. Web-based platforms called “forges”, GHP are also quite complex and come equipped with a now more broadly known as “source code hosting set of ancillary tools within each repository that help platforms,” are important for understanding VCS. In developers and academics alike organize, communicate Free and Open Source Software (FOSS), a “forge” is a within, and disseminate their projects. These include web-based collaborative software development features such as wikis, issues (which often act as platform that includes not only support for repository discussion threads), and pull requests. We will refer to management, but also services and integration such as these tools and the scholarly interactions that take mailing-lists, wikis, bug tracking, and code review. place within them as “scholarly ephemera.” These The first third-party hosting platform for code contextual materials are evidence for understanding the repositories was Helix TeamHub in 1995, which history of a repository, how one repository might relate worked for Perforce, an early VCS. More platforms to another, how members of each repository branch out have grown to include what the authors will refer to as and form networks, and how this information can be “Git Hosting Platforms” (GHP) that include elements used to track derivatives of current work. Currently, of forges, but forefront or exclusively support for Git scholarly ephemera is not being archived with the same (e.g. GitHub). Just as not all squares are rectangles, not breadth as source code. all forges are GHP, but all GHP are forges. In summation, our work described below is While VCS have been used in software centered on the following premise: that scholarship is development since 1975, it has recently become being produced on GHPs and GHPs are geared towards popular in academic institutions. This project is software development, rather than academia. This interested in how Git is used in academia and how results in a steep learning curve, a weak academic GHPs are used as platforms for scholarly engagement. Page 7240 support system, and a lack of essential scholarly and GHP [14]. In order to create and refine our features. qualitative codes, transcripts were uploaded to To explore this, our primary research question for Taguette, a free and open source qualitative analysis this paper is: when learning and adopting Git and GHP tool, for individual “closed” coding [15].