Diversity in Software Engineering Research Meiyappan Nagappan Thomas Zimmermann Christian Bird Software Analysis and Intelligence Lab Microsoft Research Microsoft Research Queen’s University, Kingston, Canada Redmond, WA, USA Redmond, WA, USA
[email protected] [email protected] [email protected] ABSTRACT et al. [2] examined 1,000 projects. Another example is the study One of the goals of software engineering research is to achieve gen- by Gabel and Su that examined 6,000 projects [3]. But if care isn’t erality: Are the phenomena found in a few projects reflective of taken when selecting which projects to analyze, then increasing the others? Will a technique perform as well on projects other than the sample size does not actually contribute to the goal of increased projects it is evaluated on? While it is common sense to select a generality. More is not necessarily better. sample that is representative of a population, the importance of di- As an example, consider a researcher who wants to investigate a versity is often overlooked, yet as important. In this paper, we com- hypothesis about say distributed development on a large number of bine ideas from representativeness and diversity and introduce a projects in an effort to demonstrate generality. The researcher goes measure called sample coverage, defined as the percentage of pro- to the json.org website and randomly selects twenty projects, all of jects in a population that are similar to the given sample. We intro- them JSON parsers. Because of the narrow range of functionality duce algorithms to compute the sample coverage for a given set of of the projects in the sample, any findings will not be very repre- projects and to select the projects that increase the coverage the sentative; we would learn about JSON parsers, but little about other most.