Development and Analysis of NLP Pipelines in Argo
Total Page:16
File Type:pdf, Size:1020Kb
Development and Analysis of NLP Pipelines in Argo Rafal Rak, Andrew Rowley, Jacob Carter, and Sophia Ananiadou National Centre for Text Mining School of Computer Science, University of Manchester Manchester Institute of Biotechnology 131 Princess St, M1 7DN, Manchester, UK rafal.rak,andrew.rowley,jacob.carter,sophia.ananiadou @manchester.ac.uk { } Abstract in text is preceded by text segmentation, part-of- speech recognition, the recognition of named enti- Developing sophisticated NLP pipelines ties, and dependency parsing. Currently, the avail- composed of multiple processing tools ability of such atomic processing components is and components available through differ- no longer an issue; the problem lies in ensur- ent providers may pose a challenge in ing their compatibility, as combining components terms of their interoperability. The Un- coming from multiple repositories, written in dif- structured Information Management Ar- ferent programming languages, requiring different chitecture (UIMA) is an industry stan- installation procedures, and having incompatible dard whose aim is to ensure such in- input/output formats can be a source of frustration teroperability by defining common data and poses a real challenge for developers. structures and interfaces. The architec- Unstructured Information Management Archi- ture has been gaining attention from in- tecture (UIMA) (Ferrucci and Lally, 2004) is a dustry and academia alike, resulting in a framework that tackles the problem of interoper- large volume of UIMA-compliant process- ability of processing components. Originally de- ing components. In this paper, we demon- veloped by IBM, it is currently an Apache Soft- strate Argo, a Web-based workbench for ware Foundation open-source project1 that is also the development and processing of NLP registered at the Organization for the Advance- pipelines/workflows. The workbench is ment of Structured Information Standards (OA- based upon UIMA, and thus has the poten- SIS)2. UIMA has been gaining much interest from tial of using many of the existing UIMA industry and academia alike for the past decade. resources. We present features, and show Notable repositories of UIMA-compliant tools examples, of facilitating the distributed de- include U-Compare component library3, DKPro velopment of components and the analysis (Gurevych et al., 2007), cTAKES (Savova et of processing results. The latter includes al., 2010), BioNLP-UIMA Component Reposi- annotation visualisers and editors, as well tory (Baumgartner et al., 2008), and JULIE Lab’s as serialisation to RDF format, which en- UIMA Component Repository (JCoRe) (Hahn et ables flexible querying in addition to data al., 2008). manipulation thanks to the semantic query In this work we demonstrate Argo4, a Web- language SPARQL. The distributed devel- based (remotely-accessed) workbench for collabo- opment feature allows users to seamlessly rative development of text-processing workflows. connect their tools to workflows running We focus primarily on the process of development in Argo, and thus take advantage of both and analysis of both individual processing com- the available library of components (with- ponents and workflows composed of such compo- out the need of installing them locally) and nents. the analytical tools. The next section demonstrates general features of Argo and lays out several technical details about 1 Introduction 1 Building NLP applications usually involves a se- http://uima.apache.org 2http://www.oasis-open.org/committees/uima ries of individual tasks. For instance, the ex- 3http://nactem.ac.uk/ucompare/ traction of relationships between named entities 4http://argo.nactem.ac.uk 115 Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, pages 115–120, Sofia, Bulgaria, August 4-9 2013. c 2013 Association for Computational Linguistics UIMA that will ease the understanding of the re- maining sections. Sections 3–5 discuss selected features that are useful in the development and analysis of components and workflows. Section 6 mentions related efforts, and Section 7 concludes the paper. 2 Overview of Argo Argo comes equipped with an ever-growing li- brary of atomic processing components that can be (a) Workflow management view put together by users to form meaningful pipelines or workflows. The processing components range from simple data serialisers to complex text an- alytics and include text segmentation, part-of- speech tagging, parsing, named entity recognition, and discourse analysis. Users interact with the workbench through a graphical user interface (GUI) that is accessible entirely through a Web browser. Figure 1 shows two views of the interface: the main, resource management window (Figure 1(a)) and the work- flow diagramming window (Figure 1(b)). The main window provides access to emphdocuments, workflows processes , and separated in easily ac- (b) Worflow diagram editor view cessible panels. The Documents panel lists primarily user- Figure 1: Screenshots of Argo Web browser con- owned files that are uploaded (through the GUI) tent. by users into their respective personal spaces on the remote host. Documents may also be gener- ated as a result of executing workflows (e.g., XML alytics is to modify incoming data structures and files containing annotations), in which case they pass them onto following components in a work- are available for users to download. flow, and thus they have both incoming and outgo- ing ports. Finally, the consumers are responsible The Workflows panel lists users’ workflows, for serialising or visualising (selected or all) anno- i.e., the user-defined arrangements of processing tations in the data structures without modification, components together with their settings. Users and so they have only an incoming port. compose workflows through a flexible, graphi- cal diagramming editor by connecting the com- The Processes panel lists resources that are cre- ponents (represented as blocks) with lines signi- ated automatically when workflows are submit- fying the flow of data between components (see ted for execution by users. Users may follow the Figure 1(b)). The most common arrangement is to progress of the executing workflows (processes) as form a pipeline, i.e., each participating component well as manage the execution from this panel. The has at most one incoming and at most one out- processing of workflows is carried out on remote going connection; however, the system also sup- servers, and thus frees users from using their own ports multiple branching and merging points in the processing resources. workflow. An example is shown in Figure 2 dis- cussed farther in text. For ease of use, components 2.1 Argo and UIMA are categorized into readers, analytics, and con- Argo supports and is based upon UIMA and thus sumers, indicating what role they are set to play in can run any UIMA-compliant processing compo- a workflow. Readers are responsible for delivering nent. Each such component defines or imports data for processing and have only an outgoing port type systems and modifies common annotation (represented as a green triangle). The role of an- structures (CAS). A type system is the represen- 116 tation of a data model that is shared between com- marily intended to be used during the develop- ponents, whereas a CAS is the container of data ment of processing components, as it allows a de- whose structure complies with the type system. A veloper to rapidly make any necessary changes, CAS stores feature structures, e.g., a token with whilst continuing to make use of the existing com- its text boundaries and a part-of-speech tag. Fea- ponents available within Argo, which may other- ture structures may, and often do, refer to a sub- wise be unavailable if developing on the devel- ject of annotation (Sofa), a structure that (in text- oper’s local system. Any component that a user processing applications) stores the text. UIMA wishes to deploy on the Argo system has to un- comes with built-in data types including primitive dergo a verification process, which could lead to types (boolean, integer, string, etc.), arrays, lists, a slower development lifecycle without the avail- as well as several complex types, e.g., Annotation ability of this component. that holds a reference to a Sofa the annotation is Generic Listener operates in a reverse manner asserted about, and two features, begin and end, to a traditional Web service; rather than Argo con- for marking boundaries of a span of text. A devel- necting to the developer’s component, the compo- oper is free to extend any of the complex types. nent connects to Argo. This behaviour was de- liberately chosen to avoid network-related issues, 2.2 Architecture such as firewall port blocking, which could be- Although the Apache UIMA project provides an come a source of frustration to developers. implementation of the UIMA framework, Argo When a workflow, containing a Generic Lis- incorporates home-grown solutions, especially in tener, is executed within Argo, it will continue terms of the management of workflow processing. as normal until the point at which the Generic This includes features such as workflow branching Listener receives its first CAS object. Argo will and merging points, user-interactive components prompt the user with a unique URL, which must (see Section 4), as well as distributed processing. be supplied to the client component run by the The primary processing is carried out on a user, allowing it to connect to the Argo workflow multi-core server. Additionally, in order to in- and continue its execution. crease computing throughput, we have incorpo- A skeleton Java project has been provided to as- rated cloud computing capabilities into Argo, sist in the production of such components. It con- which is designed to work with various cloud tains a Maven structure, Eclipse IDE project files, computing providers. As a proof of concept, and required libraries, in addition to a number of the current implementation uses HTCondor, an shell scripts to simplify the running of the compo- open-source, high-throughput computing software nent.