Crowdsourcing the Discovery of Software Repositories in an Educational Environment Yuxing Ma Tapajit Dey University of Tennessee University of Tennessee Knoxville, Tennessee Knoxville, Tennessee
[email protected] [email protected] Jared M. Smith Nathan Wilder Audris Mockus University of Tennessee University of Tennessee University of Tennessee Knoxville, Tennessee Knoxville, Tennessee Knoxville, Tennessee
[email protected] [email protected] [email protected] ABSTRACT vice and a large number of repositories are being hosted on In software repository mining, it's important to have a broad online forges. For example, the number and size of reposi- representation of projects. In particular, it may be of inter- tories hosted on GitHub roughly doubled over the last year. est to know what proportion of projects are public. Dis- A census of public software projects is highly desirable for covering public projects can be easily parallelized but not reasons similar to those of a population census [6]. Discov- so easy to automate due to a variety of data sources. We ering the projects even on a known forge, may be a nontriv- evaluate the research and educational potential of crowd- ial task, however. To accomplish it, we ran an experiment sourcing such research activity in an educational setting. to distribute the data gathering over a group of students Students were instructed on three ways of discovering the in order to provide a practical example of data discovery. projects and assigned a task to discover the list of public Such approach is promising because it has the potential for projects from top 45 forges with each student assigned to faster speed, lower cost, more flexibility, and better diversity, one forge.