Using the Lexicon from Source Code to Determine Application Domains
Total Page:16
File Type:pdf, Size:1020Kb
Using the Lexicon from Source Code to Determine Application Domains Andrea Capiluppi∗, Nemitari Ajienkay, Nour Ali∗, Mahir Arzoky∗, Steve Counsell∗, Giuseppe Destefanis∗ Alina Miron∗, Bhaveet Nagaria∗, Rumyana Neykova∗, Martin Shepperd∗, Stephen Swift∗, Allan Tucker∗ ∗Department of Computer Science, Brunel University London, UK yDepartment of Computer Science, Edge Hill University, UK Abstract—Context: The vast majority of software engineering As in the example reported in [21], the extensive study of research is independent of the application domain: techniques all JSON parsers available would find similarities between and tools usage is reported without any domain context. This has them or common patterns. That type of study would focus not always been the case - early in the computing era, there were totally separate application domains (for example, scientific and on one particular language (JSON), one specific domain data processing) and the research focus was often application- (parsers) and inevitably draw limited conclusions. On the specific. other hand, considering the “parsers” domain (but without Objective: This paper claims that software systems should be focusing on one single language) would show the common separated and analysed into domain clusters. We propose a code- characteristics of developing that type of systems, irrespective based approach to identify the application domain of a software system, via its lexicon. We compare its precision with the plain of their language. The thrust of this paper stems from the work textual description attached to the same system. of several prominent researchers who called the empirical Method: Using a sample of 50 Java projects, we obtained i) software engineering community to ‘go deeper, not wider’ [24] the description of each project (e.g., its ReadMe file), ii) the and ‘minding the mine, mining the mind’ [18]. As a matter lexicon extracted from its source code, and iii) a list of its main of fact, there is increasing evidence that empirical research topics extracted with the LDA information retrieval technique. We assigned a random subset of these data items to different on software systems might yield domain-specific results, for researchers (i.e., ‘experts’), and asked them to assign each item instance when clustering the studied systems by the domains to one (or more) application domain. We then evaluated the that they implement ([11], [27], [23]). precision and accuracy of the three techniques. There are two main challenges to domain-based empirical Results: Using the agreement levels between experts, We software engineering: first, there is currently no commonly observed that the ‘baseline’ dataset (i.e., the ReadMe files) agreed taxonomy of application domains [15]. Several at- obtained the highest average in terms of agreement between experts, but we also observed that the three techniques had the tempts have been performed with varying success: past re- same mode and median agreement levels. Additionally, in the search on domains has focused on creating a domain tax- cases where no agreement was reached for the baseline dataset, onomy in a top-down fashion [1], i.e., starting with a seed the two other techniques provided sufficient support. taxonomy and refining it with various techniques (e.g., via Conclusions: We conclude that using the corpora or the topics expert judgement) [12]. The second challenge is assigning a from source code can be an adequate substitution to plain description when assigning a software system to an application software system to a domain: again, expert judgements have domain. been applied more often than automatic assignments in past Index Terms—Application Domains, Source Code, Java, Latent literature [4], but it has become clear that the chosen domains Dirichlet Allocation, Expert Judgement, Machine Learning were selected as either too fine-grained or too large, thus defying the purpose of the categorisation. I. INTRODUCTION In this paper, we argue that the extraction of topics that Albeit the diversity and context of software systems have emerge from the source code can help the assignment to received some attention in the past [26], [10], contemporary domains by experts, and instead of (or aside of) reading research in the computing field is almost entirely application- the documentation that accompanies a software system [25]. independent. This has not always been the case - early in the Researchers and practitioners can assign a system to a domain computing era, there were totally separate application domains by only focusing on a limited number of core topics that (for example, scientific and data processing) and the research emerge more strongly (e.g., with larger weight) from the focus was often application-specific [15]. lexicon of source code. In the context of empirical software engineering research, For this purpose we collected the description of 50 Java while the main goal of empirical papers is to achieve the software systems, alongside the lexicon of their source code generality of the results, the domain, context and uniqueness and a list of their most relevant topics. We asked 10 researchers of a software system have not been considered very often to assign a unique subset of these data files to the most by researchers. The most common rationale for doing so is appropriate domain, and triangulated their expert opinions to to analyze projects having different application domains to discuss the following research questions: decrease threats due to the generalizability of the results. 1) RQ1: is the lexicon an acceptable substitute for the Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected] description of a software system, for the purpose of exceptions in catch blocks. Concretely, content management assigning it to a domain? systems consistently have more exception handling code and 2) RQ2: similarly, are the topics (as extracted by LDA) throw more custom exceptions, as opposed to editors/viewers, acceptable substitutes? which have less error handling code and mainly use standard This paper is organised as follows: section II deals with exceptions instead. related work and illustrates how our paper advances the state Fayad and Smidt [11] explored software frameworks and of the art. Section III describes the empirical approach that classified frameworks based on related application domains, we adopted to extract the data sources and to assign them e.g., operating system and communication frameworks and in subsets to researchers for expert judgement. Section IV user interface frameworks. The authors emphasised that in summarises the results of the expert judgement task, while contrast to earlier OO reuse techniques based on class libraries, section V discusses the findings and their relevance for practi- frameworks are targeted for particular business units (such as tioners. Section VI details the threats to valisity and section VII data processing or cellular communications) and application finally concludes. domains (such as user interfaces or real-time avionics). They also highlight the fact that the next generation of OO applica- II. RELATED WORK tion frameworks target application domains but on the other This work is an extension of a previously published pa- hand, application developers in more complex domains such per [6] where we posited that keywords, corpora and topics as telecommunications and distributed medical imaging have could significantly help in establishing the provenance of traditionally lacked standard “off-the-shelf” frameworks; as a a software system, given a list of pre-defined application result these developers in such domains largely implement, test domains. Now we plan to evaluate the plain description of and maintain software systems from scratch. These findings a system (e.g., in the form of a ReadMe file) against the further show the need to treat project by domains for more corpora and the topics extracted from the source code. The distinct empirical results or observations. The need for a means current paper should therefore be considered as the enactment of identifying which domain a project belongs to is also of the future work proposed in [6]. Prior research has shown highlighted as OSS developers for example can contribute that the number and size of open-source projects are growing frameworks for projects in the telecommunications domain exponentially and open-source projects are becoming more upon identifying such projects and their required functionality. diverse by expanding into different domains [9], [17]. In In past research, software projects have been assigned to view of this and to reduce the effort required in manual application domains by glancing at the source code, or its categorisation of software projects, Tian et al. [25] proposed general description (e.g., the ReadMe file, or the project a new technique based on text mining to categorise software documentation); creating categories and finally assigning a projects irrespective of the programming language used in their project to a category. The research by Borges et al., [4] development. Their contribution is based on the analysis of the follows that approach: the dataset contains 5,000 GitHub documentation that accompanies the software project, rather project (including 520 Java projects). There are two main than the source code, as we propose in our paper. issues with this approach: the first is that there is hardly any In general, distinct results have been observed when more consistency in how a project might get documented by its attention is paid to the categorisation of analysed software developers, meaning that the approach in [4] becomes non- projects.