Coding Together at Scale: Github As a Collaborative Social Network
Total Page:16
File Type:pdf, Size:1020Kb
Coding Together at Scale: GitHub as a Collaborative Social Network Antonio Lima, Luca Rossi and Mirco Musolesi School of Computer Science University of Birmingham, UK fa.lima,l.rossi,[email protected] Abstract ever, this service has much emphasis on its social features, GitHub is the most popular repository for open source as summarized in its motto “GitHub: social coding”. In fact, code (Finley 2011). It has more than 3.5 million users, GitHub is not simply offering a code hosting service, like as the company declared in April 2013, and more than its competitors had been doing for a long time, but also an 10 million repositories, as of December 2013. It has a easy-to-use and cheap (or even free in its basic version) on- publicly accessible API and, since March 2012, it also line tool for collaborative software development and many publishes a stream of all the events occurring on public features supporting the community of developers. For all projects. Interactions among GitHub users are of a com- these reasons, GitHub has successfully lowered the barrier plex nature and take place in different forms. Devel- to collaboration in open source. The importance of this col- opers create and fork repositories, push code, approve laboration platform seems to be increasing, as its founder code pushed by others, bookmark their favorite projects has plans to extend the use cases beyond software develop- and follow other developers to keep track of their activ- ment (Lunden 2013). At the same time, most of the data con- ities. 5 In this paper we present a characterization of GitHub, cerning collaboration on public software repositories can be as both a social network and a collaborative platform. accessed and analyzed. This represents a unique opportunity To the best of our knowledge, this is the first quantita- to study aspects of human behavior related to collaboration tive study about the interactions happening on GitHub. at scale. We analyze the logs from the service over 18 months GitHub is based on the Git revision control system6. In (between March 11, 2012 and September 11, 2013), GitHub a user can create code repositories and push code describing 183.54 million events and we obtain in- to them. Every repository has a list of collaborators; they formation about 2.19 million users and 5.68 million can make changes to the content of the repository and they repositories, both growing linearly in time. We show review the contributions that are submitted to the repository, that the distributions of the number of contributors per project, watchers per project and followers per user accepting or discarding them. In this sense, they are not the show a power-law-like shape. We analyze social ties only people collaborating on the project. In fact, every per- and repository-mediated collaboration patterns, and we son that wishes to contribute to a project, without being a 7 observe a remarkably low level of reciprocity of the so- collaborator, can fork it . This action makes a duplicate of cial connections. We also measure the activity of each the repository, allowing developers to work independently, user in terms of authored events and we observe that committing changes only to their own fork. When develop- very active users do not necessarily have a large number ers complete a certain task (e.g., a new feature or a bug fix), of followers. Finally, we provide a geographic charac- they can send the changes to the original repository, through terization of the centers of activity and we investigate a so-called pull request. Then, a collaborator of the original arXiv:1407.2535v1 [cs.SI] 9 Jul 2014 how distance influences collaboration. repository reviews the changes contained in the pull request and decides whether to accept it in the original repository (in 1 Introduction the Git jargon, merge it to the parent repository), or refuse it, In recent years, GitHub1, a hosting platform for software optionally motivating his or her choice. Once the new code projects, has gained much popularity among a large number is accepted in the original repository, its author becomes one of software developers around the world. This platform of- fers version control hosting, as other platforms have done in 5GitHub also offers fee-based private repositories. Since it is the past (e.g., SourceForge2, Assembla3, BitBucket4). How- not possible to access any information about private repositories, our analysis will focus on public repositories. Copyright © 2014, Association for the Advancement of Artificial 6http://git-scm.com/ Intelligence (www.aaai.org). All rights reserved. 7 1 In the open-source context, the term forking had a negative https://github.com/ connotation in the past, i.e., it has often been used to refer to groups 2 https://sourceforge.net/ of developers separating for different views on a project and split- 3https://www.assembla.com/ ting their forces on independent projects. Instead, in GitHub, fork- 4https://bitbucket.org/ ing is a normal part of the process of contributing to a project. of the contributors of the project. In addition to that, GitHub general, open-source projects have been the subject of sev- users can follow other users, to be notified of their actions. eral studies specifically aimed at uncovering the social struc- The website is not used only for collaboration, but also as a ture that emerges from the interactions between develop- resource to find quality software. Users can star interesting ers (Valverde and Sole´ 2007; Bird et al. 2008) and at analyz- repositories that they want to bookmark for later reference. ing the individual contributions to specific projects (Hindle, Other features are also available (e.g., issue tracking, down- German, and Holt 2008). loads, gists, and so on) but we will not consider them in this Recently, given its increasing popularity, there has been a work. surge in interest in GitHub and its underlying social dynam- In this paper, to the best of our knowledge, we present the ics. Some projects are currently undergoing with the specific first in-depth quantitative analysis of GitHub, as a unique aim of providing easy-to-use and efficient tools for access- example of large-scale real-world collaboration platform ing data from GitHub, especially in real-time. For example, mainly used for software projects. The contributions we (Gousios and Spinellis 2012) discusses a system to gather make in this paper can be summarized as follows: streams and data from GitHub in a scalable fashion to over- • We conduct basic structural analyses and we show that the come the limitations imposed by the GitHub API, specifi- distributions of the number of contributors and watchers cally directed at researchers. per project and followers per user show a power-law-like In (Dabbish et al. 2012) an in-depth qualitative user study shape. is conducted on a small group of GitHub users, aimed • We analyze social ties and repository-mediated collabora- at understanding the motivations that are the basis of on- tion patterns. We find a very low reciprocity of the social line collaboration and the consequences of using a trans- ties, which is remarkably different from the findings of parent large-scale tool on the practice of software develop- studies of other types of social networks. ment. They find that people use GitHub for several rea- sons: to learn how to code better, to follow popular de- • We study the depth and width of the trees corresponding velopers, to find new interesting projects, and to promote to forked repositories and we observe that in GitHub col- themselves and their work. They also find that actions laboration on forks seems to happen on a limited number in GitHub, such as following a user, committing changes of key projects. and accepting/rejecting code, are heavily influenced by spe- • We investigate the correlation between the activity of cific characteristics of the interactions happening in the users and their popularity in the network and we observe system. Some other studies have investigated the geo- that very active users do not necessarily have a large num- graphical distribution of collaborations (Heller et al. 2011; ber of followers. Shrestha, Zhu, and Miller 2013). An example is that of • We provide a geographic characterization of activities and Heller et al. (Heller et al. 2011), who use visualization tech- collaborations. We find that users tend to interact with niques to identify patterns of collaboration, including geo- people they are close to and that repositories with a low graphic characteristics of the interactions between cities and number of collaborators tend to have them concentrated influence among them. around a few specific geographic locations, rather than With respect to this body of work, to the best of our scattered around the world. Finally, we observe a simi- knowledge, our paper presents the first systematic quanti- larity between the geographical distributions of following tative analysis of the interactions in GitHub. We believe and contributions ties. that our quantitative methodology complements the existing qualitative work by providing insights about global patterns The paper is organized as follows. We first discuss re- of interactions that are not possible to obtain by means of lated work in this area. Then, we present the data collec- small-scale and interview-based studies. tion methodology and we describe the characteristics of the dataset. We study the networks representing the interactions between entities, extracted from the dataset. We conclude 3 Description of the Dataset the article by discussing our key findings and outlining our The full list of public events that have happened on GitHub future work. is available on the GitHub Archive website8. In this paper, we analyze events that happened on GitHub over a period 2 Related Work of 18 months, between March 11, 2012 and September 11, Several researchers from different communities have been 2013, retrieved from that archive.