Semantic Annotation of Papers: Interface Enrichment Tool (SAPIENT)

Semantic Annotation of Papers: Interface & Enrichment Tool (SAPIENT) Maria Liakata†, Claire Q††, Larisa N. Soldatova††† Department of Computer Science University of Wales, Aberystwyth SY23 3DB UK [email protected], [email protected], [email protected] † †† ††† Abstract the former is indeed the case. Their experimental re- sults suggested that span-level analysis is a promis- In this paper we introduce a web application ing strategy for taking advantage of the full papers, (SAPIENT) for sentence based annotation of where spans are defined as paragraphs of text as- full papers with semantic information. SAPI- sessed by humans and deemed to be relevant to one ENT enables experts to annotate scientific papers sentence by sentence and also to link re- of 36 pre-defined topics. Therefore, when working lated sentences together, thus forming spans with full papers, it is important to be able to iden- of interesting regions, which can facilitate text tify and annotate spans of text. In previous research, mining applications. As part of the system, sentence based annotation has been used to identify we developed an XML-aware sentence split- text regions with scientific content of interest to the ter (SSSplit) which preserves XML markup user (Wilbur et al., 2006; Shatkay et al., 2008) or and identifies sentences through the addition zones of different rhetorical status (AZ) (Teufel and of in-line markup. SAPIENT has been used Moens, 2002). Sentences are the structural units of in a systematic study for the annotation of scientific papers with concepts representing paragraphs and can be more flexible than paragraphs the Core Information about Scientific Papers for text mining purposes other than information re- (CISP) to create a corpus of 225 annotated pa- trieval. pers. Current general purpose systems for linguistic annotation such as Callisto2 allow the creation of a 1 Introduction simple annotation schema that is a tag set augmented with simple (e.g. string) attributes for each tag. Given the rapid growth in the quantity of scientific Knowtator (Ogren, 2006) is a plug-in of the knowl- literature, particularly in the Biosciences, there is edge representation tool Proteg´ e´3, which works as an increasing need to work with full papers rather a general purpose text annotation tool and has the than abstracts, both to identify their key contribu- advantage that it can work with complex ontology- tions and to provide some automated assistance to derived schemas. However, these systems are not researchers (Karamanis et al., 2008; Medlock and particularly suited to sentence by sentence annota- Briscoe, 2007). Initiatives like OTMI1, which aim tion of full papers, as one would need to highlight to make full papers available to researchers for text entire sentences manually. Also these systems work mining purposes is further evidence that relying mainly with plain text, so they do not necessarily solely on abstracts presents important limitations for interpret the structural information already available such tasks. A recent study on whether information in the paper, which can be crucial to annotation deci- retrieval from full text is more effective than search- sions for the type of high level annotation mentioned ing abstracts alone (Lin Jimmy, 2009) showed that 2http://callisto.mitre.org/manual/use.html 1http://opentextmining.org/wiki/Main Page 3http://protege.stanford.edu/ 193 Proceedings of the Workshop on BioNLP, pages 193–200, Boulder, Colorado, June 2009. c 2009 Association for Computational Linguistics above. The OSCAR3 (Corbett et al., 2007) tool for et al., 2007), which also allows manual annota- the recognition and annotation of chemical named tion alongside the automated annotation of chemical entities fully displays underlying paper information named entities, but where each minor edit is saved in XML but is not suited to sentence by sentence an- to the server, writing to a file. We chose to make notation. more of the functionality client-side in order to re- To address the above issues, we present a sys- duce the number of server requests, which could be- tem (SAPIENT) for sentence by sentence annota- come problematic if the system became widely dis- tion of scientific papers which supports ontology- tributed. motivated concepts representing the core information about scientific papers (CISP) (Soldatova and SAPIENT Architecture Liakata, 2007). An important aspect of the system is User Input Browser Server that although annotation is sentence based, the sys- XML Http Paper in request Paper saved tem caters for identifiers, which link together sen- .xml Page for paper as source.xml tences pertaining to the same concept. This way upload & response links to uploaded 1) Paper is split papers Click on paper spans of interest or key regions are formed. SAPI- into sentences with SSSplit ENT also incorporates OSCAR3 capability for the Paper displayed Processing 2) Paper saved automatic recognition of chemical named entities in dynamic html with .xsl as mode2.xml and runs within a browser, which makes it platform Javascript based Annotations saved annotation with Click on Save In mode2.xml independent. SAPIENT takes as input full scien- CISP tific papers in XML, splits them into individual sentences, displays them and allows the user to anno- OSCAR annotations tate each sentence with one of 11 CISP concepts as well as link the sentence to other sentences refer- Figure 1: Architecture of the SAPIENT System ring to the same instance of the concept selected. The system is especially suitable for so called multi- dimensional annotation (Shatkay et al., 2008) or SAPIENT has been designed to take as input full ontology-motivated annotation, where a label origi- papers in XML, conforming to the SciXML schema nates from a class with properties. SAPIENT is cur- (Rupp et al., 2006)(see Section 3). rently being employed by 16 Chemistry experts to To view or annotate a paper, a user must first up- develop a corpus of scientific papers (ART Corpus) load it. The index page of SAPIENT shows a list annotated with Core Information about Scientific of papers already uploaded (available as links) and Papers (CISP) covering topics in Physical Chemistry an interface for uploading more papers (See Figure and Biochemistry. 2). Once the user selects a link to a paper, the paper is split into sentences using the XML-aware sen- 2 SAPIENT System Description tence splitter SSSplit which we have developed (See section 4) and is included in the server-side Java. We chose to implement SAPIENT as a web appli- The resultant XML file is stored alongside the origi- cation, so as to make it platform independent and nal upload. Sentence splitting involves detecting the easier to incorporate as part of an online workflow. boundaries of sentences and, in this context, mark- We have used state of the art web technologies to ing the latter by inline <s></s> tags added to the develop SAPIENT, namely Java, Javascript (with original XML. The <s></s> tags contain an id at- Asynchronous JavaScript and XML (AJAX) func- tribute enumerating the sentence. tionality), XSLT, CSS and XML. The system has a After sentence splitting, the new XML file client-server architecture (see Figure 1), with pa- containing sentence boundaries marked by <s pers being uploaded and stored on the server but id=#NUM>< /s> tags is parsed by XSLT into functionality for annotation contained in Javascript, HTML, so that it displays in the browser. In the which runs client-side in the browser. This is in- HTML interface dynamically generated in this way, spired by but in contrast with OSCAR3 (Corbett Javascript annotation drop-downs are available for 194 tence in the abstract, 5 sentences in the Discussion and 2 sentences in the Conclusion sections. The distinction between sentence identifiers and concept identifiers is an important characteristic of the system. It means that the system does not necessarily assume a ‘1-1’ correspondence between a sentence and a concept, but rather that concepts can be represented by spans of often disjoint text. There- fore, SAPIENT indirectly allows the annotation of discourse segments beyond the sentence level and Figure 2: Index page of the SAPIENT System also keeps track of co-referring sentences. each sentence. The user can perform annotations 2.1 SAPIENT Usability by selecting items from the drop-downs and all the Even though SAPIENT has been primarily designed corresponding annotation information is stored in to work with CISP concepts, it can be used to an- Javascript until a request to save is made by the user. notate papers according to any sentence based anno- The Javascript drop-downs allow annotation at tation scheme. Changes required can be easily per- two levels (Figure 3), enabling a sentence to have a formed by modifying the XSL sheet which dynami- semantic label (type) with properties (subtypes) and cally generates HTML from XML and organises the an identifier (conceptID). structure of drop-down menus. Automated noun- In the current implementation of SAPIENT, The phrase based annotation from existing ontologies type drop-down value corresponds to the selection is available to SAPIENT users through OSCAR3 of one out of 11 general scientific concepts (Li- (Corbett et al., 2007), since SAPIENT incorporates akata and Soldatova, 2008), namely (‘Background’, OSCAR3 functionality for chemical named entity ‘Conclusion’, ‘Experiment’, ‘Goal of the Investi- recognition. The latter is implemented as a link gation’, ‘Hypothesis’,‘Method’, ‘Model’, ‘Motiva- which when selected calls the OSCAR3 workflow tion’, ‘Object of the Investigation’, ‘Observation’, (integrated in the system) to automatically recognise ‘Result’). These labels originate from a set of chemical named entities (NEs) (See Figure 5). meta-data (The Core Information about Scientific When all annotations (both sentence based and Concepts (CISP) (Soldatova and Liakata, 2007) chemical NEs) are saved to the server, a new ver- which were constructed using an ontology method- sion of the XML file is produced, which contains ology, based on an ontology of experiments EXPO in-line annotation for sentences as well as extra in- (Soldatova and King, 2006).

Load more