The Qualitas Corpus: a Curated Collection of Java Code for Empirical Studies

2010 Asia Pacific Software Engineering Conference The Qualitas Corpus: A Curated Collection of Java Code for Empirical Studies Ewan Tempero∗, Craig Anslow§, Jens Dietrich†, Ted Han∗, Jing Li∗, Markus Lumpe‡, Hayden Melton∗, James Noble§ ∗Department of Computer Science, The University of Auckland Auckland, New Zealand. [email protected] † Massey University, School of Engineering and Advanced Technology Palmerston North, New Zealand. [email protected] ‡ Faculty of Information & Communication Technologies, Swinburne University of Technology Hawthorn, Australia. [email protected] § School of Engineering and Computer Science, Victoria University of Wellington Wellington, New Zealand. [email protected] Abstract—In order to increase our ability to use measure- there is still a non trivial cost to gathering the artifacts from ment to support software development practise we need to enough OSS projects to make a study useful. One of the do more analysis of code. However, empirical studies of code main goals of the Qualitas Corpus is to substantially reduce are expensive and their results are difficult to compare. We describe the Qualitas Corpus, a large curated collection of open the cost of performing large empirical studies of code. source Java systems. The corpus reduces the cost of performing However, just measuring code is not enough. We need large empirical studies of code and supports comparison of models explaining the relationship between the measure- measurements of the same artifacts. We discuss its design, ments and the quality attributes, and we need experiments organisation, and issues associated with its development. to validate those models. Validation does not come through Keywords-Empirical studies; curated code corpus; experi- a single experiment — experiments must be replicated. mental infrastructure Replication requires at least understanding of the relationship between the artifacts used in the different experiments. I. INTRODUCTION In some forms of experiments, we want to use the same Measurement is fundamental to engineering, however its artifacts so as to be able to compare results in a meaningful use in engineering software has been limited. While many way. This means we need to know in detail what artifacts software metrics have been proposed (e.g. [1]), few are are used in any experiment, meaning an ad hoc collection regularly used in industry to support decision making. A key of code whose contents is unknown is not sufficient. What reason for this is that our understanding of the relationship is needed is a curated collection of code artifacts. A second between measurements we know how to make and quality goal of the Qualitas Corpus is to support comparison of attributes, such as modifiability, understandability, extensi- measurements of the same artifacts, that is, to provide a bility, reusability, and testability, that we care about is poor. reference corpus for empirical studies of code. This is particularly true with respect to theories regarding The contributions of this paper are: characteristics of software structure such as encapsulation, inheritance, coupling, and cohesion. Traditional engineering • We present arguments for the provision of a reference disciplines have had hundreds or thousands of years of expe- corpus of code for empirical studies of code. rience of comparing measurements with quality outcomes, • We identify the issues regarding performing replication but central to this experience is the taking and sharing of of studies that analyse Java code. measurements and outcomes. In contrast there have been • We describe the Qualitas Corpus, a curated collection few useful measurements of code. In this paper we describe of Java code that reduces the cost and increases the the Qualitas Corpus, infrastructure that supports taking and replicability of empirical studies. sharing measurements of code artifacts. The rest of the paper is organised as follows. In the Barriers to measuring code and understanding what the next section we present the motivation for our work, which measurements mean include access to code to measure and includes inspiration from the use of corpora in applied the tools to do the measurement. The advent of open source linguistics and the limited empirical studies of code that have software (OSS) has meant significantly more code is now been performed. We also discuss the use of reference collec- accessible for measurement than in the past. This has led to tions in other areas of software engineering and in computer an increase in interest in empirical studies of code. However, science, and discuss the need for a curated collection of 1530-1362/10 $26.00 © 2010 IEEE 336 DOI 10.1109/APSEC.2010.46 code. In section III we discuss the challenges faced when Open source software has existed for several decades, doing empirical studies of code, and from that, determine with systems such as Unix, emacs, and TEX. Their use in the requirements of a curated corpus. Section IV presents empirical studies is relatively recent. For example, Miller et the details of the Qualitas Corpus, its current organisation, al. [8] studied about 90 Unix applications (including emacs, immediate future plans, and rationale of the decisions we TEX, LATEX, yacc) to determine how they responded to input. have taken. Section V evaluates the Qualitas Corpus. Finally Frakes and Pole [9] used Unix tools as the basis for a study we present our conclusions in section VI. on methods for searching for reusable components. During the 1990s the number of accessible systems in- II. MOTIVATION AND RELATED WORK creased, particularly those written in C++, and consequently the number of studies increased. Chidamber and Kemerer The use of a standard collection of artifacts to support applied their metrics to two systems, one had 634 C++ study in an area is not new, neither in general nor in software classes, the other had 1459 Smalltalk classes [1]. No further engineering. One area is that of applied linguistics, where information on the systems was given. standard corpora are the basis for much of the research being Bieman and Zhao studied inheritance in 19 C++ systems, done. Hunston [2] opens her book with “It is no exagger- ranging from 7 classes to 922 classes in size, with 2744 ation to say that corpora, and the study of corpora, have classes in total [10]. They identified the systems studied, revolutionised the study of language, and of the applications but did not identify the versions for all systems. of language, over the last few decades.” Ironically, it is the Harrison et al. applied two coupling metrics to five availability of software systems support for language corpora collections of C++ code, consisting of 96, 197, 113, 61, that has enabled this form of research, whereas researchers and 12 classes respectively [11]. They identified the systems examining code artifacts have been slow to adopt this idea. involved but not the versions studied. While the goals of applied linguistics research is not exactly Chidamber et al. studied three systems, one with 45 the same as ours, the similarities are close enough to warrant C++ classes, one with 27 Objective C classes, and one examining how corpora are used in that field. Their use of identifying 25 classes in design documents [12]. They were corpora is a major motivation for the Qualitas Corpus. We required to restrict information about the systems studied for will discuss language corpora in more detail in section III. commercial reasons. By the end of the millennium, repositories supporting A. Empirical studies of Code open source development such as sourceforge, as well To answer the question of whether a code corpus is as the increase in effectiveness of Internet search systems, necessary, we sample past empirical studies of code. By meant a large number of systems were accessible. This “empirical study of code” we mean a study in which the affected both the number of studies done, and often their artifacts under investigation consist of source code, there size. A representative set of examples include one with 3 are multiple, unrelated, artifacts, and the artifacts were fairly large Java systems [13], a study of 14 Java systems developed independently of the study. This rules out, for [14], and a study of 35 systems, from several languages example, studies that included the creation of the code including Java, C++, Self, and Smalltalk [15]. artifacts, such as those by Briand et al. [3] or Lewis et al. Two particularly large studies were by Succi et al. [16] [4], and studies of one system, such as that by Barry [5]. and Collberg et al [17]. Succi et al. studied 100 Java and 100 Empirical studies of code have been performed for at least C++ applications. The Java applications ranged from 28 to four decades. As with many other things, Knuth was one of 936 classes in size (median 83.5) and the C++ applications the first to carry out empirical studies to understand what ranged from 30 to 2520 classes (median 59). The actual code that is actually written looks like [6]. He presented a applications were not identified. Collberg et al. analysed static analysis of over 400 FORTRAN programs, totalling 1132 Java jar files collected from the Internet. According about 250,000 cards, and dynamic analysis of about 25 to their statistics they analyse a total of 102,688 classes programs. He chose programs that could “run to completion” and 12,188 interfaces. No information was given as to what from job submissions to Stanford’s Computation Center, applications were analysed. various subroutine libraries and scientific packages, contri- The studies described above suggest that there is interest butions from IBM, and personal programs. His main moti- in doing studies that involve analysing code and the ability vation was compiler design, with the concern that compilers to do such studies has significantly advanced our knowledge may not optimise for the typical case as no-one knew what about the characteristics of code structure.

Load more