Unstructured Information Management Architecture (UIMA)

Graham Wilcock University of Helsinki A Quick Finnish Lesson

 Finnish = English swim

 Uimahalli: Swimming hall

 Uimakurssi: Swimming course

 Uimamestari: Swimming master Overview

 What is UIMA?

 A framework for NLP tasks and tools  Part-of-Speech Tagging  Full Parsing  Shallow Parsing  More about UIMA What is UIMA?

 An acronym: Unstructured Information Management Architecture

 A framework for NLP tasks and tools

 Originally from IBM (IBM UIMA)

 Now open source (Apache UIMA)

 What is a framework?

Overview

 What is UIMA?  Part-of-Speech Tagging  With OpenNLP  With OpenNLP and UIMA  Full Parsing  Shallow Parsing  More about UIMA What is OpenNLP?

 A set of NLP tools

 Open source Java

 University of Pennsylvania (Tom Morton)

 http://opennlp.sourceforge.net

 English, Spanish, German, Thai

 What is a toolkit? Annotation Tools

 OpenNLP sentence detector

 OpenNLP tokenizer

 OpenNLP POS tagger

 OpenNLP chunker

 OpenNLP parser

 OpenNLP name finder

 OpenNLP coreferencer Annotation Models

 OpenNLP English sentence detector model

 OpenNLP English tokenizer model

 OpenNLP English POS tagger model

 OpenNLP English chunker model

 OpenNLP English parser models

 OpenNLP English named entity models

 OpenNLP English coreferencer models POS Tagging with OpenNLP export OPENNLP_HOME=~gwilcock/Tools/opennlp-1.3.0 export CLASSPATH=.:\ $OPENNLP_HOME/lib/opennlp-tools-1.3.0.jar:\ $OPENNLP_HOME/lib/maxent-2.4.0.jar:\ $OPENNLP_HOME/lib/trove.jar java opennlp.tools.lang.english.SentenceDetector \ $OPENNLP_HOME/models/english/sentdetect/EnglishSD.bin.gz | java opennlp.tools.lang.english.Tokenizer \ $OPENNLP_HOME/models/english/tokenize/EnglishTok.bin.gz | java opennlp.tools.lang.english.PosTagger -d \ $OPENNLP_HOME/models/english/parser/tagdict \ $OPENNLP_HOME/models/english/parser/tag.bin.gz

OpenNLP POS Tagger Format

My/PRP$ mistress/NN '/POS eyes/NNS are/VBP nothing/NN like/IN the/DT sun/NN ,/, Coral/NNP is/VBZ far/RB more/RBR red/JJ than/IN her/PRP$ lips/NNS red/JJ ./. If/IN snow/NN be/VB white/JJ ,/, why/WRB then/RB her/PRP$ breasts/NNS are/VBP dun/VBN ,/, If/IN hairs/NNS be/VB wires/NNS ,/, black/JJ wires/NNS grow/VB on/IN her/PRP$ head/NN ./. POS Tagging with UIMA

 Run OpenNLP POS tagger in UIMA

 Also needs OpenNLP sentence detector and OpenNLP tokenizer

 Each tool needs a UIMA wrapper

 These wrappers provided by UIMA

 Tools run in primitive analysis engines

 Combined in aggregate analysis engine

OpenNLP Script (to compare) export OPENNLP_HOME=~gwilcock/Tools/opennlp-1.3.0 export CLASSPATH=.:\ $OPENNLP_HOME/lib/opennlp-tools-1.3.0.jar:\ $OPENNLP_HOME/lib/maxent-2.4.0.jar:\ $OPENNLP_HOME/lib/trove.jar java opennlp.tools.lang.english.SentenceDetector \ $OPENNLP_HOME/models/english/sentdetect/EnglishSD.bin.gz | java opennlp.tools.lang.english.Tokenizer \ $OPENNLP_HOME/models/english/tokenize/EnglishTok.bin.gz | java opennlp.tools.lang.english.PosTagger -d \ $OPENNLP_HOME/models/english/parser/tagdict \ $OPENNLP_HOME/models/english/parser/tag.bin.gz

Overview

 What is UIMA?  Part-of-Speech Tagging  Full Parsing  With OpenNLP  With OpenNLP and UIMA  Shallow Parsing  More about UIMA Full Parsing with OpenNLP export OPENNLP_HOME=~gwilcock/Tools/opennlp-1.3.0 export CLASSPATH=.:\ $OPENNLP_HOME/lib/opennlp-tools-1.3.0.jar:\ $OPENNLP_HOME/lib/maxent-2.4.0.jar:\ $OPENNLP_HOME/lib/trove.jar java opennlp.tools.lang.english.SentenceDetector \ $OPENNLP_HOME/models/english/sentdetect/EnglishSD.bin.gz | java opennlp.tools.lang.english.Tokenizer \ $OPENNLP_HOME/models/english/tokenize/EnglishTok.bin.gz | java opennlp.tools.lang.english.TreebankParser -d \ $OPENNLP_HOME/models/english/parser OpenNLP Parser Format

(TOP (S (S (NP (NP (PRP$ My) (NN mistress) (POS ')) (NNS eyes)) (VP (VBP are) (NP (NP (NN nothing)) (PP (IN like) (NP (DT the) (NN sun)))))) (, ,) (NP (NNP Coral)) (VP (VBZ is) (ADJP (ADJP (ADVP (RB far) (RBR more)) (JJ red)) (PP (IN than) (NP (NP (PRP$ her) (NNS lips)) (ADJP (JJ red)))))) (. .))) (TOP (SBARQ (SBAR (IN If) (S (NP (NN snow)) (VP (VB be) (ADJP (JJ white))))) (, ,) (WHADVP (WRB why)) (SQ (SBAR (RB then) (S (S (NP (PRP$ her) (NNS breasts)) (VP (VBP are) (VP (VBN dun)))) (, ,) (S (SBAR (IN If) (S (NP (NNS hairs)) (VP (VB be) (NP (NNS wires))))) (, ,) (NP (JJ black) (NNS wires)) (VP (VBP grow) (PP (IN on) (NP (PRP$ her) (NN head)))))))) (. .))) Full Parsing with UIMA

 Run OpenNLP parser in UIMA

 Also needs OpenNLP sentence detector and OpenNLP tokenizer

 Wrappers provided by UIMA

 Add parser to aggregate analysis engine

Toolkits vs. Frameworks

 Toolkit

 The tools support a defined API

 Your application calls the tools

 Framework

 Your components must support the API defined by the framework

 The framework calls your components

 Hollywood? ”Don’t call us, we’ll call you” Overview

 What is UIMA?  Part-of-Speech Tagging  Full Parsing  Shallow Parsing  Chunking with OpenNLP  Chunking with OpenNLP and UIMA  More about UIMA Chunking with OpenNLP

(Same CLASSPATH as tagging and parsing) java opennlp.tools.lang.english.SentenceDetector \ $OPENNLP_HOME/models/english/sentdetect/EnglishSD.bin.gz | java opennlp.tools.lang.english.Tokenizer \ $OPENNLP_HOME/models/english/tokenize/EnglishTok.bin.gz | java opennlp.tools.lang.english.PosTagger -d \ $OPENNLP_HOME/models/english/parser/tagdict \ $OPENNLP_HOME/models/english/parser/tag.bin.gz | java opennlp.tools.lang.english.TreebankChunker \ $OPENNLP_HOME/models/english/chunker/EnglishChunk.bin.gz OpenNLP Chunker Format

 CoNLL-2000 ”IOB” format

 Inside chunk (I-NP, I-VP, I-PP)

 Outside chunk (O)

 Begin chunk (B-NP, B-VP, B-PP)

 Chunk labels attached to tokens OpenNLP Chunker Format

Token - posTag - chunkLabel a DT B-NP far RB I-NP more RBR I-NP pleasing JJ I-NP sound NN I-NP . . O Chunking with UIMA

 Run OpenNLP chunker in UIMA

 Also needs OpenNLP sentence detector, tokenizer and POS tagger

 Write a Java wrapper

 No wrapper provided for chunker

 Similar to wrapper for POS tagger

 Add chunker to aggregate analysis engine

Editing the Type System

 UIMA type systems

 Types, subtypes and inheritance

 Features appropriate for type

 OpenNLPExampleTypes.xml

 Edit Token type

 Has existing posTag feature

 Add new chunkLabel feature

Defining Capabilities

 ”Capabilities” of each component

 Specify input and output types

 Supports interoperability

 Enables check for correct types

 Important in pipeline of components

Colouring the Chunks

 UIMA Annotation Viewer

 All Token annotations are same colour

 Need different colours for NP, VP, PP

 Write a Chunk Marker annotator

 Read chunkLabel features (B-NP, I-NP)

 Write new NP, VP, PP annotations

 NP, VP, PP types already defined for parser

Overview

 What is UIMA?  Part-of-Speech Tagging  Full Parsing  Shallow Parsing  More about UIMA

 UIMA and standards

 UIMA and community Other UIMA Annotators

 UIMA is not just OpenNLP!

 Write your own annotators

 Regular expression annotators

 Email addresses, URLs, phone numbers

 Dictionary lookup annotators

 Use existing name lists (GATE gazetteers) UIMA and Standards

 OASIS

 Organization for the Advancement of Structured Information Standards

 OASIS UIMA Technical Committee UIMA and Standards: GUI

 GATE uses its own GUI

 WordFreak uses its own GUI

 UIMA uses Eclipse

 An existing, widely-used, open source GUI

 Eclipse Modeling Framework (EMF) UIMA and Standards: XML

 GATE uses its own XML format

 WordFreak uses its own XML format

 UIMA uses XMI

 XML Metadata Interchange

 OMG standard

 Modelled by EMF in Eclipse XML Metadata Interchange

 UIMA open source software

 Apache Software Foundation

 UIMA component repositories

 CMU (Carnegie-Mellon University)

 JULIE Lab (Jena University)

 PEAR format

 Processing Engine ARchive UIMA and IBM

 IBM supports Apache UIMA

 UIMA Innovation Awards

 IBM LRWB

 LanguageWare Resource Workbench

 Free download for evaluation

 Non- create annotators

 Install annotators in UIMA with PEAR UIMA and Community: RASP

 RASP Parser

 Robust Accurate Statistical Parsing

 Sussex and Cambridge universities  RASP4UIMA

 UIMA wrappers for RASP tools

 DigitalPebble (www.digitalpebble.com)

 Install RASP tools in UIMA with PEAR UIMA and Community: LuCas

 Apache Lucene

 Widely-used high-performance full-text indexing and search library  LuCas - Lucene CAS Indexer

 Stores UIMA CAS data in Lucene index

 Developed at JULIE Lab (Jena)

 Currently in UIMA sandbox

 Presentation at UIMA Workshop today Overview

 What is UIMA?  Part-of-Speech Tagging  Full Parsing  Shallow Parsing  More about UIMA