Text Annotation with OpenNLP and UIMA

Graham Wilcock University of Helsinki [email protected]

Abstract that use different annotation formats? This can be done by XSLT transformations, for example The tutorial presents a practical overview WordFreak XML format can be transformed by of automatic linguistic annotation of texts XSLT to OpenNLP plain text annotation format. using freely available open source tools. However, writing such XSLT stylesheets requires 1 OpenNLP specific technical skills. Text annotation typically involves tasks at sev- 3 UIMA eral linguistic levels, such as sentence boundary UIMA (Unstructured Information Management detection, tokenization, part-of-speech tagging, Architecture) provides solutions to many of the phrase chunking, syntactic parsing, named entity above issues. UIMA is open-source Java (http: recognition, coreference resolution, and semantic //incubator.apache.org/uima). It aims role labelling. Most of these tasks can be done to support interoperability and scalability. with appropriate combinations of OpenNLP tools In UIMA, annotators run in analysis engines. (http://opennlp.sourceforge.net). New annotators are written in Java, and existing Practical examples will show annotations of a annotation tools such as the OpenNLP tools are short English text. OpenNLP outputs annotations converted to UIMA annotators by Java wrappers. in a simple plain text format. Pipelines of annotators run in aggregate analysis The OpenNLP tools do a good job of creating engines. Pipelines can be configured by writing annotations automatically, but a number of issues XML descriptors (similar in some ways to Ant arise. Although the OpenNLP tools themselves tasks), or by means of an easy-to-use graphical are open source Java and platform-independent, configuration tool in the GUI (Figure 1). the annotation pipelines (where the output of UIMA supports interoperability at the level of one component is input to the next component) annotation formats by adopting XML Metadata are created by Linux shell scripts and Windows Interchange (XMI), which has been proposed as .bat files that are platform-dependent and error- an interchange standard. Instead of having its own prone. Apache Ant can be used to gain platform- specific XML annotation format, the UIMA anno- independence, but Ant requires technical skills. tation format is XMI. 2 WordFreak UIMA also supports interoperability at the level of annotation tools by means of a type system that OpenNLP tools can also be used in WordFreak defines annotation types and their features. Types (http://wordfreak.sourceforge.net) are used to check that output from one component as plugins. WordFreak provides an attractive, is the right type for input to the next component. easy-to-use GUI for linguistic annotations. It is Practical examples will show how to configure open source Java and platform-independent, and and use pipelines of OpenNLP tools in UIMA, and is convenient for manually correcting annotations how to view the annotations in UIMA (Figure 2). made by the OpenNLP tools. However, Word- Freak creates annotations in its own specific XML stand-off annotation format. References This raises the issue of interoperability. How Graham Wilcock. 2009. Introduction to Linguistic An- can annotations be interchanged between tools notation and Text Analytics. Morgan and Claypool. Figure 1: Configuring an OpenNLP annotation pipeline in UIMA

Figure 2: Viewing annotations by OpenNLP Parser in UIMA