Development and Analysis of NLP Pipelines in Argo

Rafal Rak, Andrew Rowley, Jacob Carter, and Sophia Ananiadou National Centre for School of Computer Science, Manchester Institute of Biotechnology 131 Princess St, M1 7DN, Manchester, UK rafal.rak,andrew.rowley,jacob.carter,sophia.ananiadou @manchester.ac.uk { }

Abstract in text is preceded by text segmentation, part-of- speech recognition, the recognition of named enti- Developing sophisticated NLP pipelines ties, and dependency parsing. Currently, the avail- composed of multiple processing tools ability of such atomic processing components is and components available through differ- no longer an issue; the problem lies in ensur- ent providers may pose a challenge in ing their compatibility, as combining components terms of their interoperability. The Un- coming from multiple repositories, written in dif- structured Information Management Ar- ferent programming languages, requiring different chitecture (UIMA) is an industry stan- installation procedures, and having incompatible dard whose aim is to ensure such in- input/output formats can be a source of frustration teroperability by defining common data and poses a real challenge for developers. structures and interfaces. The architec- Unstructured Information Management Archi- ture has been gaining attention from in- tecture (UIMA) (Ferrucci and Lally, 2004) is a dustry and academia alike, resulting in a framework that tackles the problem of interoper- large volume of UIMA-compliant process- ability of processing components. Originally de- ing components. In this paper, we demon- veloped by IBM, it is currently an Apache Soft- strate Argo, a Web-based workbench for ware Foundation open-source project1 that is also the development and processing of NLP registered at the Organization for the Advance- pipelines/workflows. The workbench is ment of Structured Information Standards (OA- based upon UIMA, and thus has the poten- SIS)2. UIMA has been gaining much interest from tial of using many of the existing UIMA industry and academia alike for the past decade. resources. We present features, and show Notable repositories of UIMA-compliant tools examples, of facilitating the distributed de- include U-Compare component library3, DKPro velopment of components and the analysis (Gurevych et al., 2007), cTAKES (Savova et of processing results. The latter includes al., 2010), BioNLP-UIMA Component Reposi- annotation visualisers and editors, as well tory (Baumgartner et al., 2008), and JULIE Lab’s as serialisation to RDF format, which en- UIMA Component Repository (JCoRe) (Hahn et ables flexible querying in addition to data al., 2008). manipulation thanks to the semantic query In this work we demonstrate Argo4, a Web- language SPARQL. The distributed devel- based (remotely-accessed) workbench for collabo- opment feature allows users to seamlessly rative development of text-processing workflows. connect their tools to workflows running We focus primarily on the process of development in Argo, and thus take advantage of both and analysis of both individual processing com- the available library of components (with- ponents and workflows composed of such compo- out the need of installing them locally) and nents. the analytical tools. The next section demonstrates general features of Argo and lays out several technical details about 1 Introduction 1 Building NLP applications usually involves a se- http://uima.apache.org 2http://www.oasis-open.org/committees/uima ries of individual tasks. For instance, the ex- 3http://nactem.ac.uk/ucompare/ traction of relationships between named entities 4http://argo.nactem.ac.uk

115 Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, pages 115–120, Sofia, Bulgaria, August 4-9 2013. c 2013 Association for Computational Linguistics

UIMA that will ease the understanding of the re- maining sections. Sections 3–5 discuss selected features that are useful in the development and analysis of components and workflows. Section 6 mentions related efforts, and Section 7 concludes the paper.

2 Overview of Argo

Argo comes equipped with an ever-growing li- brary of atomic processing components that can be (a) Workflow management view put together by users to form meaningful pipelines or workflows. The processing components range from simple data serialisers to complex text an- alytics and include text segmentation, part-of- speech tagging, parsing, named entity recognition, and discourse analysis. Users interact with the workbench through a graphical user interface (GUI) that is accessible entirely through a Web browser. Figure 1 shows two views of the interface: the main, resource management window (Figure 1(a)) and the work- flow diagramming window (Figure 1(b)). The main window provides access to emphdocuments, workflows processes , and separated in easily ac- (b) Worflow diagram editor view cessible panels. The Documents panel lists primarily user- Figure 1: Screenshots of Argo Web browser con- owned files that are uploaded (through the GUI) tent. by users into their respective personal spaces on the remote host. Documents may also be gener- ated as a result of executing workflows (e.g., XML alytics is to modify incoming data structures and files containing annotations), in which case they pass them onto following components in a work- are available for users to download. flow, and thus they have both incoming and outgo- ing ports. Finally, the consumers are responsible The Workflows panel lists users’ workflows, for serialising or visualising (selected or all) anno- i.e., the user-defined arrangements of processing tations in the data structures without modification, components together with their settings. Users and so they have only an incoming port. compose workflows through a flexible, graphi- cal diagramming editor by connecting the com- The Processes panel lists resources that are cre- ponents (represented as blocks) with lines signi- ated automatically when workflows are submit- fying the flow of data between components (see ted for execution by users. Users may follow the Figure 1(b)). The most common arrangement is to progress of the executing workflows (processes) as form a pipeline, i.e., each participating component well as manage the execution from this panel. The has at most one incoming and at most one out- processing of workflows is carried out on remote going connection; however, the system also sup- servers, and thus frees users from using their own ports multiple branching and merging points in the processing resources. workflow. An example is shown in Figure 2 dis- cussed farther in text. For ease of use, components 2.1 Argo and UIMA are categorized into readers, analytics, and con- Argo supports and is based upon UIMA and thus sumers, indicating what role they are set to play in can run any UIMA-compliant processing compo- a workflow. Readers are responsible for delivering nent. Each such component defines or imports data for processing and have only an outgoing port type systems and modifies common annotation (represented as a green triangle). The role of an- structures (CAS). A type system is the represen-

116 tation of a data model that is shared between com- marily intended to be used during the develop- ponents, whereas a CAS is the container of data ment of processing components, as it allows a de- whose structure complies with the type system. A veloper to rapidly make any necessary changes, CAS stores feature structures, e.g., a token with whilst continuing to make use of the existing com- its text boundaries and a part-of-speech tag. Fea- ponents available within Argo, which may other- ture structures may, and often do, refer to a sub- wise be unavailable if developing on the devel- ject of annotation (Sofa), a structure that (in text- oper’s local system. Any component that a user processing applications) stores the text. UIMA wishes to deploy on the Argo system has to un- comes with built-in data types including primitive dergo a verification process, which could lead to types (boolean, integer, string, etc.), arrays, lists, a slower development lifecycle without the avail- as well as several complex types, e.g., Annotation ability of this component. that holds a reference to a Sofa the annotation is Generic Listener operates in a reverse manner asserted about, and two features, begin and end, to a traditional Web service; rather than Argo con- for marking boundaries of a span of text. A devel- necting to the developer’s component, the compo- oper is free to extend any of the complex types. nent connects to Argo. This behaviour was de- liberately chosen to avoid network-related issues, 2.2 Architecture such as firewall port blocking, which could be- Although the Apache UIMA project provides an come a source of frustration to developers. implementation of the UIMA framework, Argo When a workflow, containing a Generic Lis- incorporates home-grown solutions, especially in tener, is executed within Argo, it will continue terms of the management of workflow processing. as normal until the point at which the Generic This includes features such as workflow branching Listener receives its first CAS object. Argo will and merging points, user-interactive components prompt the user with a unique URL, which must (see Section 4), as well as distributed processing. be supplied to the client component run by the The primary processing is carried out on a user, allowing it to connect to the Argo workflow multi-core server. Additionally, in order to in- and continue its execution. crease computing throughput, we have incorpo- A skeleton Java project has been provided to as- rated cloud computing capabilities into Argo, sist in the production of such components. It con- which is designed to work with various cloud tains a Maven structure, Eclipse IDE project files, computing providers. As a proof of concept, and required libraries, in addition to a number of the current implementation uses HTCondor, an shell scripts to simplify the running of the compo- open-source, high-throughput computing software nent. The project provides both a command-line framework. Currently, Argo is capable of switch- interface (CLI) and GUI runner applications that ing the processing of workflows to a local cluster take, as arguments, the name of the class of the lo- of over 3,000 processor cores. Further extensions cally developed component and the URL provided to use the Microsoft Azure5 and Amazon EC26 by Argo, upon each run of a workflow containing cloud platforms are also planned. the remote component. The Argo platform is available entirely us- An example of a workflow with a Generic Lis- ing RESTful Web services (Fielding and Taylor, tener is shown in Figure 2. The workflow is de- 2002), and therefore it is possible to gain access signed for the analysis and evaluation of a solu- to all or selected features of Argo by implement- tion (in this case, the automatic extraction of bio- ing a compliant client. In fact, the “native” Web logical events) that is being developed locally by interface shown in Figure 1 is an example of such the user. The reader (BioNLP ST Data Reader) a client. provides text documents together with gold (i.e., manually created) event annotations prepared for 3 Distributed Development the BioNLP Shared Task7. The annotations are Argo includes a Generic Listener component that selectively removed with the Annotation Remover permits execution of a UIMA component that is and the remaining data is sent onto the Generic running externally of the Argo system. It is pri- Listener component, and consequently, onto the developer’s machine. The developer’s task is to 5http://www.windowsazure.com 6http://aws.amazon.com/ec2 7http://2013.bionlp-st.org/

117 Figure 3: Example of an annotated fragment of a document visualised with the Brat BioNLP ST Comparator component. The component high- lights (in red and green) differences between two sources of annotations.

Figure 2: Example of a workflow for development, analysis, and evaluation of a user-developed solu- tion for the BioNLP Shared Task. connect to Argo, retrieve CASes from the run- ning workflow, and for each CAS recreate the re- Figure 4: Example of manual annotation with the moved annotations as faithfully as possible. The user-interactive Annotation Editor component. developer can then track the performance of their solution by observing standard information ex- traction measures (precision, recall, etc.) com- expects two incoming connections from compo- puted by the Reference Evaluator component that nents processing the same subject of annotation. compares the original, gold annotations (coming As a result, using brat visualisation (Stenetorp et from the reader) against the developer’s annota- al., 2012), it will show annotation structures by tions (coming from the Generic Listener), and laying them out above text and mark differences saves these measures for each document/CAS into between the two inputs by colour-coding missing a tabular-format file. Moreover, the differences or additional annotations in each input. A sam- can be tracked visually though the interactive Brat ple of visualisation coming from the workflow in BioNLP ST Comparator component, discussed in Figure 2 is shown in Figure 3. Since in this par- the next section. ticular workflow the Brat BioNLP ST Comparator receives gold annotations (from the BioNLP ST 4 Annotation Analysis and Manipulation Data Reader) as one of its inputs, the highlighted Traditionally, NLP pipelines (including existing differences are, in fact, false positives and false UIMA-supporting platforms), once set up, are negatives. executed without human involvement. One of Annotation Editor is another example of a user- the novelties in Argo is an introduction of user- interactive component that allows the user to add, interactive components, a special type of analytic delete or modify annotations. Figure 4 shows the that, if present in a workflow, cause the execu- editor in action. The user has an option to cre- tion of the workflow to pause. Argo resumes the ate a span-of-text annotation by selecting a text execution only after receiving input from a user. fragment and assigning an annotation type. More This feature allows for manual intervention in the complex annotation types, such as tokens with otherwise automatic processing by, e.g., manipu- part-of-speech tags or annotations that do not re- lating automatically created annotations. Exam- fer to the text (meta-annotations) can be created ples of user-interactive components include Anno- or modified using an expandable tree-like struc- tation Editor and Brat BioNLP ST Comparator. ture (shown on the right-hand side of the figure), The Brat BioNLP ST Comparator component which makes it possible to create any annotation

118 neAText neACat neBText neBCat count Ki-67 Protein p53 Protein 85 DC CellType p53 Protein 61 DC CellType KCOT Protein 47

(b) Results (fragment)

(a) Select query (c) Insert query

Figure 5: Example of (a) a SPARQL query that returns biological interactions; (b) a fragment of retrieved results; and (c) a SPARQL query that creates new UIMA feature structures. Namespaces and data types are omitted for brevity. structure permissible by a given type system. languages such as SPARQL8, an official W3C Recommendation. Figure 5 shows an example of a SPARQL query 5 Querying Serialised Data that is performed on the output of an RDF Writer in the workflow shown in Figure 1(b). This work- Argo comes with several (de)serialisation com- flow results in several types of annotations in- ponents for reading and storing collections of cluding the boundaries of sentences, tokens with data, such as a generic reader of text (Document part-of-speech tags and lemmas, chunks, as well Reader) or readers and writers of CASes in XMI as biological entities, such as DNA, RNA, cell format (CAS Reader and CAS Writer). One of line and cell type. The SPARQL query is meant the more useful in terms of annotation analysis to retrieve pairs of seemingly interacting biolog- is, however, the RDF Writer component as well ical entities ranked according to their occurrence as its counterpart, RDF Reader. RDF Writer se- in the entire collection. The interaction here is rialises data into RDF files and supports several (na¨ıvely) defined as co-occurrence of two entities RDF formats such as RDF/XML, Turtle, and N- in the same sentence. The query includes pat- Triple. A resulting RDF graph consists of both the terns for retrieving the boundaries of sentences data model (type system) and the data itself (CAS) (syn:Sentence) and two biological entities and thus constitutes a self-contained knowledge (sem:NamedEntity) and then filters out the base. RDF Writer has an option to create a graph crossproduct of those by ensuring that the two en- for each CAS or a single graph for an entire collec- 8http://www.w3.org/TR/2013/REC-sparql11-overview- tion. Such a knowledge base can be queried with 20130321/

119 tities are enclosed in a sentence. As a result, the Acknowledgments query returns a list of biological entity pairs ac- This work was partially funded by the MRC Text companied by their categories and the number of Mining and Screening grant (MR/J005037/1). appearances, as shown in Figure 5(b). Note that the query itself does not list the four biological cat- egories; instead, it requests their common seman- References tic ancestor sem:NamedEntity. This is one of W A Baumgartner, K B Cohen, and L Hunter. 2008. the advantages of using semantically-enabled lan- An open-source framework for large-scale, flexible guages, such as SPARQL. evaluation of biomedical text mining systems. Jour- SPARQL also supports graph manipulation. nal of biomedical discovery and collaboration, 3:1+. Suppose a user is interested in placing the re- H Cunningham, D Maynard, K Bontcheva, and trieved biological entity interactions from our run- V Tablan. 2002. GATE: A framework and graphical ning example into the UIMA structure Relation- development environment for robust NLP tools and ship that simply defines a pair of references to applications. In Proceedings of the 40th Anniver- other structures of any type. This can be accom- sary Meeting of the Association for Computational Linguistics. plished, without resorting to programming, by is- suing a SPARQL insert query shown in Figure D Ferrucci and A Lally. 2004. UIMA: An Ar- 5(c). The query will create triple statements com- chitectural Approach to Unstructured Information Processing in the Corporate Research Environment. pliant with the definition of Relationship. The re- Natural Language Engineering, 10(3-4):327–348. sulting modified RDF graph can then be read back to Argo by the RDF Reader component that will R T Fielding and R N Taylor. 2002. Principled de- convert the new RDF graph back into a CAS. sign of the modern Web architecture. ACM Trans. Internet Technol., 2(2):115–150, May. 6 Related Work I Gurevych, M Muhlh¨ auser,¨ C Muller,¨ J Steimle, M Weimer, and T Zesch. 2007. Darmstadt knowl- Other notable examples of NLP platforms that edge processing repository based on uima. In Pro- provide graphical interfaces for managing work- ceedings of the First Workshop on Unstructured flows include GATE (Cunningham et al., 2002) Information Management Architecture,Tubingen,¨ Germany. and U-Compare (Kano et al., 2010). GATE is a standalone suite of text processing and annota- U Hahn, E Buyko, R Landefeld, M Muhlhausen,¨ tion tools and comes with its own programming M Poprat, K Tomanek, and J Wermter. 2008. An interface. In contrast, U-Compare—similarly to Overview of JCORE, the JULIE Lab UIMA Compo- nent Repository. In Language Resources and Eval- Argo—uses UIMA as its base interoperability uation Workshop, Towards Enhanc. Interoperability framework. The key features of Argo that distin- Large HLT Syst.: UIMA NLP, pages 1–8. guish it from U-Compare are the Web availabil- Y Kano, R Dorado, L McCrochon, S Ananiadou, and ity of the platform, primarily remote processing J Tsujii. 2010. U-Compare: An integrated language of workflows, a multi-user, collaborative architec- resource evaluation platform including a compre- ture, and the availability of user-interactive com- hensive UIMA resource library. In Proceedings of ponents. the Seventh International Conference on Language Resources and Evaluation, pages 428–434. 7 Conclusions G K Savova, J J Masanz, P V Ogren, J Zheng, S Sohn, K C Kipper-Schuler, and C G Chute. 2010. Mayo Argo emerges as a one-stop solution for develop- clinical Text Analysis and Knowledge Extraction ing and processing NLP tasks. Moreover, the pre- System (cTAKES): architecture, component evalua- sented annotation viewer and editor, performance tion and applications. Journal of the American Med- evaluator, and lastly RDF (de)serialisers are in- ical Informatics Association, 17(5):507–513. dispensable for the analysis of processing tasks P Stenetorp, S Pyysalo, G Topic,´ T Ohta, S Ananiadou, at hand. Together with the distributed develop- and J Tsujii. 2012. brat: a web-based tool for nlp- ment support for developers wishing to create their assisted text annotation. In Proceedings of the 13th Conference of the European Chapter of the Associa- own components or run their own tools with the tion for Computational Linguistics, pages 102–107, help of resources available in Argo, the workbench Avignon, France. becomes a powerful development and analytical NLP tool.

120