A Framenet-Based Approach for Annotating Software Requirements
Total Page:16
File Type:pdf, Size:1020Kb
A FrameNet-based Approach for Annotating Natural Language Descriptions of Software Requirements Waad Alhoshan, Riza Batista-Navarro, Liping Zhao School of Computer Science, University of Manchester, United Kingdom [email protected] {riza.batista, liping.zhao}@manchester.ac.uk Abstract As most software requirements are written in natural language, they are unstructured and do not adhere to any formalism. Processing them automatically—within the context of software requirements engineering tasks—thus becomes difficult for machines. As a step towards adding structure to requirements documents, we exploited frames in FrameNet and applied them to the semantic annotation of software descriptions. This was carried out through an approach based on automated lexical unit matching, manual validation and harmonisation. As a result, we produced a novel corpus of requirements documents containing software descriptions which have been assigned a total of 242 unique semantic frames overall. Our evaluation of the resulting annotations shows substantial agreement between our two annotators, encouraging us to pursue finer-grained semantic annotation as part of future work. Keywords: Semantic Frames, FrameNet, Corpus Annotation, Software Requirements, Requirements Engineering represented as a combination of their lemmatised form 1. Introduction and part-of-speech (POS) tag (e.g., assemble.v, create.v where v stands for verb). Such a frame can then be applied Software requirements play a pivotal role in all system on a piece of text (such as in Example 1) to represent, in a design phases. Requirements are generally written in structured manner, the creation idea that is being natural language, and therefore are unstructured (Ferrari et conveyed. Containing over 1,200 such frames, FrameNet al., 2017a). This however presents a challenge to has become an invaluable resource to the NLP research Requirements Engineering (RE) tasks, e.g. requirements community. analysis, which often necessitate the organisation and management of requirements in a systematic manner Example 1: (Dick et al, 2017). While certain RE tasks (e.g., [The system] Creator [generates] Creating_lexical unit modelling) could benefit from automated analysis, this [records of user activities] Created_entity [each time] can only be facilitated if some structure is applied to the Frequency [the user logs into the system] Cause. otherwise unstructured natural language requirements contained in software descriptions (Ferrari et al., 2017b). Recent studies in RE have explored the application of FrameNet frames to software requirements acquisition One way by which we can add structure to software and analysis. For example, Jha and Mahmoud (2017) descriptions written in natural language is by attaching employed semantic frames (automatically extracted by the machine-readable semantic metadata that captures SEMAFOR semantic role labeller2) as features in training meaning. In documents from the general and scientific machine learning-based models for categorising user domains, this often corresponds to named entities, e.g., reviews of mobile applications. Meanwhile, Kundi and proper names of persons, places, diseases or chemical Chitchyan (2017) proposed a technique for gathering compounds. Software descriptions however do not allude requirements that employed FrameNet frames as the basis to such proper names as often and instead mention generic of linguistic patterns for generating use cases at the early if not abstract concepts (e.g., account creation, file stages of RE. They specifically made use of the deletion) and the participants involved (e.g., user, system). Agriculture frame to demonstrate their approach. As shown in early work by Belkhouche and Kozma (1993) and Rolland and Priox (1992), capturing meaning We consider FrameNet as a rich repository of semantic contained in requirements can be approached by using metadata that can be added to requirements documents in semantic frames: coherent structured representations of order to add structure to them. In this work, we seek to concepts (Petruck, 1997). These representations are based employ FrameNet as the basis of a scheme for capturing on the theory of frame semantics proposed by Fillmore the meaning of software descriptions. To this end, we (1977) whose work formed the basis of FrameNet, an adopt FrameNet semantic frames in annotating software online computational lexicon that catalogues detailed requirements in a corpus of documents written in natural information on semantic frames1 (Baker et al., 1998). For language. To the best of our knowledge, our work is the every frame it contains, FrameNet specifies the following: first attempt to investigate FrameNet as a means for frame title, definition, frame elements (i.e., participants) annotating meaning within requirements documents. In and lexical units, i.e., words that evoke the frame. The this way, we are enriching them with semantic metadata concept of creation, for example, is encoded in FrameNet and hence incorporating structure into them. As a result, as a frame entitled Creating, with frame elements we have produced and made publicly available a resource pertaining to Creator, Created_entity and Beneficiary for the perusal of other members of the research (among many others). Importantly, lexical units that signify the concept is also provided, each of which is 1 https://framenet.icsi.berkeley.edu 2 http://www.cs.cmu.edu/~ark/SEMAFOR/ community: the FrameNet-annotated FN-REQ3 corpus of (e.g., "further), inclusion (e.g., "inclusive"), exclusion natural language requirements documents. (e.g., "excluding"), contradiction (e.g., "nevertheless"), causation (e.g., "because of") and purpose (e.g., "in The rest of this paper is organised as follows. Section 2 order"). The selection of these types was informed by our describes our methods for collecting software observations on the linguistic styles often used in writing requirements documents and annotating them based on the software requirements. Through this process, we were semantic frames contained in FrameNet. In Section 3, we able to evoke candidate semantic frames that denote the present and analyse results of our annotation. Lastly, we meaning of the requirements in our documents. present our conclusions and plans for future work in Section 4. 2.2.2 Validation Deciding which FrameNet semantic frames capture the 2. Methodology meaning expressed in software descriptions was In this section, we present the methods we carried out in performed manually in order to maximise accuracy. For order to construct a corpus of documents containing this task, we employed two annotators. The first annotator sentences of software requirements, and to subsequently (Annotator A) is a requirements engineer with five years annotate them according to FrameNet. of experience in the IT industry. The second annotator (Annotator B) is one of the authors of this paper and is a 2.1 Document Selection PhD candidate whose study is focussed on the use of NLP Our goal is to gather a document set consisting of techniques to support RE tasks. different types of software requirements. As a preliminary Provided with candidate frames obtained in the previous step, we formed a Google search query containing step, the annotators were asked to confirm whether they keywords such as "software description", “natural capture the meaning of a given software description or language requirements" and "software requirements not. This validation process was carried out in accordance specification". Furthermore, we employed snowball with the guidelines we developed which drew inspiration sampling and found additional requirements from various from the FrameNet annotation scheme proposed by sources such as web blogs, research articles (together with (Baker, 2017). Over a four-week period, both annotators their corresponding datasets), lecture materials and were trained in applying these guidelines on the industrial/commercial documents. This step resulted in the annotation of a set of software descriptions from collection of 34 requirements documents varying in documents other than those in our corpus. Afterwards, the 4 length. The NLTK tool for sentence boundary detection entire corpus of 34 documents—together with the was then applied on the 34 documents. After manually candidate semantic frames retrieved in the previous step— 5 verifying the results, a total of 1,148 sentences were was presented to each of Annotators A and B for obtained (corresponding to 21,012 tokens). annotation. We provide Table 1 to show an example of the 2.2 Annotation Procedure details that are presented to an annotator and the kind of judgement that he/she is expected to provide. At the top The annotation was carried out in a semi-automatic row of the table is a sample software description. The first manner. This was facilitated by the two main steps column (LU) lists the lexical units matched by the method described as follows. described in Section 2.2.1. The second and third columns (Start and End) indicate the location of the corresponding 2.2.1 Evoking Frames by Lexical Unit Matching lexical unit in terms of character offsets—useful With the intention of making the annotation process more information in cases where a lexical unit appears multiple efficient, we developed a simple method for automatically times within a description. The fourth column (Retrieved matching words in the software descriptions in our corpus Frames) lists the titles of the frames linked