Towards Content-Driven Reputation for Collaborative Code Repositories

University of Pennsylvania ScholarlyCommons Departmental Papers (CIS) Department of Computer & Information Science 8-28-2012 Towards Content-Driven Reputation for Collaborative Code Repositories Andrew G. West University of Pennsylvania, [email protected] Insup Lee University of Pennsylvania, [email protected] Follow this and additional works at: https://repository.upenn.edu/cis_papers Part of the Databases and Information Systems Commons, Graphics and Human Computer Interfaces Commons, and the Software Engineering Commons Recommended Citation Andrew G. West and Insup Lee, "Towards Content-Driven Reputation for Collaborative Code Repositories", 8th International Symposium on Wikis and Open Collaboration (WikiSym '12) , 13.1-13.4. August 2012. http://dx.doi.org/10.1145/2462932.2462950 Eighth International Symposium on Wikis and Open Collaboration, Linz, Austria, August 2012. This paper is posted at ScholarlyCommons. https://repository.upenn.edu/cis_papers/750 For more information, please contact [email protected]. Towards Content-Driven Reputation for Collaborative Code Repositories Abstract As evidenced by SourceForge and GitHub, code repositories now integrate Web 2.0 functionality that enables global participation with minimal barriers-to-entry. To prevent detrimental contributions enabled by crowdsourcing, reputation is one proposed solution. Fortunately this is an issue that has been addressed in analogous version control systems such as the *wiki* for natural language content. The WikiTrust algorithm ("content-driven reputation"), while developed and evaluated in wiki environments operates under a possibly shared collaborative assumption: actions that "survive" subsequent edits are reflective of good authorship. In this paper we examine WikiTrust's ability to measure author quality in collaborative code development. We first define a mapping omfr repositories to wiki environments and use it to evaluate a production SVN repository with 92,000 updates. Analysis is particularly attentive to reputation loss events and attempts to establish ground truth using commit comments and bug tracking. A proof-of-concept evaluation suggests the technique is promising (about two-thirds of reputation loss is justified) with false positives identifying areas for future refinement. quallyE as important, these false positives exemplify differences in content evolution and the cooperative process between wikis and code repositories. Keywords WikiTrust, wikis, code repository, SVN, reputation, trust management, content persistence, code quality, software engineering Disciplines Databases and Information Systems | Graphics and Human Computer Interfaces | Software Engineering Comments Eighth International Symposium on Wikis and Open Collaboration, Linz, Austria, August 2012. This conference paper is available at ScholarlyCommons: https://repository.upenn.edu/cis_papers/750 Towards Content-driven Reputation for Collaborative Code Repositories Andrew G. West Insup Lee University of Pennsylvania University of Pennsylvania Philadelphia, PA, USA Philadelphia, PA, USA [email protected] [email protected] ABSTRACT trusted parties. More novel is opening these repositories to a As evidenced by SourceForge and GitHub, code repositories global and anonymous set of participants. This trend can be now integrate Web 2.0 functionality that enables global par- seen with online repository hosts such as SourceForge and ticipation with minimal barriers-to-entry. To prevent detri- GitHub. Consider also vehicleforge.mil [5], a DARPA mental contributions enabled by crowdsourcing, reputation led effort to crowdsource a next-generation military vehicle. is one proposed solution. Fortunately this is an issue that While such broad collaboration is time and cost effective it has been addressed in analogous version control systems can also invite inefficient, buggy, or malicious contributions. such as the wiki for natural language content. The Wik- This not an issue unique to code repositories. Consider iTrust algorithm (“content-driven reputation”), while devel- English Wikipedia which operates on a VCS (the wiki plat- oped and evaluated in wiki environments operates under a form) where ≈7% of edits are unconstructive (“vandalism”). possibly shared collaborative assumption: actions that “sur- To this end, “content-driven reputation” was developed and vive” subsequent edits are reflective of good authorship. encoded as the WikiTrust [2, 3] algorithm. WikiTrust’s as- In this paper we examine WikiTrust’s ability to measure sumption is that actions which persist across subsequent re- author quality in collaborative code development. We first visions are indicative of reputable authors (see Sec. 2). define a mapping from repositories to wiki environments and WikiTrust’s hypothesis has been shown effective for nat- use it to evaluate a production SVN repository with 92,000 ural language content on wikis, however, one can imagine updates. Analysis is particularly attentive to reputation loss it might hold true in many collaborative settings, includ- events and attempts to establish ground truth using commit ing source code repositories. To investigate this we define comments and bug tracking. A proof-of-concept evaluation a mapping between repository software and wiki platforms. suggests the technique is promising (about two-thirds of rep- Then, we apply WikiTrust over a case study SVN repository utation loss is justified) with false positives identifying ar- containing some 92,000 updates (Sec. 3). eas for future refinement. Equally as important, these false WikiTrust plots user reputations across repository history. positives exemplify differences in content evolution and the Our analysis scrutinizes reputation loss on these graphs, cooperative process between wikis and code repositories. gleaming additional information from bug tracking and commit comments. Our goal is to determine whether the reputation penalty was justified, and we estimate those decrements Categories and Subject Descriptors were warranted in about two-thirds of cases (Sec. 4). False H.5.3 [Group and Organization Interfaces]: collabora- positives are given particular attention since they suggest tive computing, computer-supported cooperative work how WikiTrust might be refined for use in code repositories. Fortunately, we are able to identify several redundant, Keywords detectable, and correctable sources of error (Sec. 5). Should these suggestions produce a sufficiently accurate WikiTrust, wikis, code repository, SVN, reputation, trust and robust reputation those values can be integrated in ways management, content persistence, code quality. that remove administrative burden, heighten end user trust, and further promote “openness” in development. For exam- 1. INTRODUCTION ple, additional scrutiny could be given to contributors with Version control systems (VCS), particularly those aimed low reputation or the most productive users rewarded. towards source code development (e.g., CVS, SVN, Git) While the development of a refined version remains a work have long supported local collaboration among known and in progress the shortcomings of our straightforward Wik- iTrust application are valuable in and of themselves. Er- roneous reputation events exemplify differences in the content evolution of wikis and code repositories. Collaborative semantics are not uniform across cooperative settings and research should be attune to these subtleties. 2. BACKGROUND & RELATED WORK c ACM, 2012. This is the author’s version of the work. It is posted here with permission for your personal use. Not for redistribution. The definitive Discussion begins by reviewing the WikiTrust algorithm version was published in: WikiSym ’12, August 27–29, Linz, Austria. (Sec. 2.1) before examining related literature (Sec. 2.2). 2.1 WikiTrust Algorithm Assume a VCS environment consisting of a set of docu- ments {d0,d1 ...dn}. Each document d has a version history ver hist(dx) = [dx.0,dx.1 ...dx.m] where dx.0 is an empty document and dx.m is the most recent version. The transi- tion dx.y → dx.y+1 is termed an edit/revision and each has Figure 2: Repository revision model. an author associated with it, ax.y+1. The WikiTrust algorithm [2, 3] begins by initializing authors with a default reputation. Whenever a new document 3.1 Repository to Wiki Model version dx.y+1 is created, its relationship with the previous n versions is investigated, possibly updating the reputations We now describe the transformation of a code repository into a wiki representation. Generic discussion suffices given of the n authors {ax.y−n ...ax.y}. These authors’ reputa- e.g., tions are updated according to how much of their actions that instantiations of both repositories ( CVS, SVN, etc. e.g., etc. “persist” to the new version per a metric of edit survival. Git, ) and wikis ( Mediawiki, PmWiki, ) have Edit survival uses an edit distance computation to quan- an expected set of functionality. Our writing assumes some tify the similarity of an author’s (prior) version to the cur- familiarity with both paradigms (see also Fig. 2). rent one. This captures not only the survival of novel au- Our aim is to “replay” the history of a repository into thored content (i.e., text survival) but also reorganization the wiki format. Simple repository actions (add, modify, and content removal. The size of a reputation delta is pro- and delete) have a straightforward mapping that requires no portional to the degree of change and weighted by the rep- description. Elsewhere, minor

Towards Content-Driven Reputation for Collaborative Code Repositories

Initiativen Roger Cloes Infobrief

Reliability of Wikipedia 1 Reliability of Wikipedia

Checkuser and Editing Patterns

The Case of Wikipedia Jansn

Wikipedia's Labor Squeeze and Its Consequences Eric Goldman Santa Clara University School of Law, [email protected]

A Cultural and Political Economy of Web 2.0

Wikipedia Vandalism Detection: Combining Natural Language, Metadata, and Reputation Features

Detecting Wikipedia Vandalism Using Wikitrust*

Wikipedia Data Analysis

2013 Social Media Guidebook

Emerging Technologies in Law Libraries: It's Everyone's Business

A Systematic Review of Scholarly Research on Wikipedia