Unstructured Information Management Architecture (UIMA)
Graham Wilcock University of Helsinki A Quick Finnish Lesson
Finnish uima = English swim
Uimahalli: Swimming hall
Uimakurssi: Swimming course
Uimamestari: Swimming master Overview
What is UIMA?
A framework for NLP tasks and tools Part-of-Speech Tagging Full Parsing Shallow Parsing More about UIMA What is UIMA?
An acronym: Unstructured Information Management Architecture
A framework for NLP tasks and tools
Originally from IBM (IBM UIMA)
Now open source (Apache UIMA)
What is a framework?
Overview
What is UIMA? Part-of-Speech Tagging With OpenNLP With OpenNLP and UIMA Full Parsing Shallow Parsing More about UIMA What is OpenNLP?
A set of NLP tools
Open source Java
University of Pennsylvania (Tom Morton)
http://opennlp.sourceforge.net
English, Spanish, German, Thai
What is a toolkit? Annotation Tools
OpenNLP sentence detector
OpenNLP tokenizer
OpenNLP POS tagger
OpenNLP chunker
OpenNLP parser
OpenNLP name finder
OpenNLP coreferencer Annotation Models
OpenNLP English sentence detector model
OpenNLP English tokenizer model
OpenNLP English POS tagger model
OpenNLP English chunker model
OpenNLP English parser models
OpenNLP English named entity models
OpenNLP English coreferencer models POS Tagging with OpenNLP export OPENNLP_HOME=~gwilcock/Tools/opennlp-1.3.0 export CLASSPATH=.:\ $OPENNLP_HOME/lib/opennlp-tools-1.3.0.jar:\ $OPENNLP_HOME/lib/maxent-2.4.0.jar:\ $OPENNLP_HOME/lib/trove.jar java opennlp.tools.lang.english.SentenceDetector \ $OPENNLP_HOME/models/english/sentdetect/EnglishSD.bin.gz | java opennlp.tools.lang.english.Tokenizer \ $OPENNLP_HOME/models/english/tokenize/EnglishTok.bin.gz | java opennlp.tools.lang.english.PosTagger -d \ $OPENNLP_HOME/models/english/parser/tagdict \ $OPENNLP_HOME/models/english/parser/tag.bin.gz
OpenNLP POS Tagger Format
My/PRP$ mistress/NN '/POS eyes/NNS are/VBP nothing/NN like/IN the/DT sun/NN ,/, Coral/NNP is/VBZ far/RB more/RBR red/JJ than/IN her/PRP$ lips/NNS red/JJ ./. If/IN snow/NN be/VB white/JJ ,/, why/WRB then/RB her/PRP$ breasts/NNS are/VBP dun/VBN ,/, If/IN hairs/NNS be/VB wires/NNS ,/, black/JJ wires/NNS grow/VB on/IN her/PRP$ head/NN ./. POS Tagging with UIMA
Run OpenNLP POS tagger in UIMA
Also needs OpenNLP sentence detector and OpenNLP tokenizer
Each tool needs a UIMA wrapper
These wrappers provided by UIMA
Tools run in primitive analysis engines
Combined in aggregate analysis engine
OpenNLP Script (to compare) export OPENNLP_HOME=~gwilcock/Tools/opennlp-1.3.0 export CLASSPATH=.:\ $OPENNLP_HOME/lib/opennlp-tools-1.3.0.jar:\ $OPENNLP_HOME/lib/maxent-2.4.0.jar:\ $OPENNLP_HOME/lib/trove.jar java opennlp.tools.lang.english.SentenceDetector \ $OPENNLP_HOME/models/english/sentdetect/EnglishSD.bin.gz | java opennlp.tools.lang.english.Tokenizer \ $OPENNLP_HOME/models/english/tokenize/EnglishTok.bin.gz | java opennlp.tools.lang.english.PosTagger -d \ $OPENNLP_HOME/models/english/parser/tagdict \ $OPENNLP_HOME/models/english/parser/tag.bin.gz
Overview
What is UIMA? Part-of-Speech Tagging Full Parsing With OpenNLP With OpenNLP and UIMA Shallow Parsing More about UIMA Full Parsing with OpenNLP export OPENNLP_HOME=~gwilcock/Tools/opennlp-1.3.0 export CLASSPATH=.:\ $OPENNLP_HOME/lib/opennlp-tools-1.3.0.jar:\ $OPENNLP_HOME/lib/maxent-2.4.0.jar:\ $OPENNLP_HOME/lib/trove.jar java opennlp.tools.lang.english.SentenceDetector \ $OPENNLP_HOME/models/english/sentdetect/EnglishSD.bin.gz | java opennlp.tools.lang.english.Tokenizer \ $OPENNLP_HOME/models/english/tokenize/EnglishTok.bin.gz | java opennlp.tools.lang.english.TreebankParser -d \ $OPENNLP_HOME/models/english/parser OpenNLP Parser Format
(TOP (S (S (NP (NP (PRP$ My) (NN mistress) (POS ')) (NNS eyes)) (VP (VBP are) (NP (NP (NN nothing)) (PP (IN like) (NP (DT the) (NN sun)))))) (, ,) (NP (NNP Coral)) (VP (VBZ is) (ADJP (ADJP (ADVP (RB far) (RBR more)) (JJ red)) (PP (IN than) (NP (NP (PRP$ her) (NNS lips)) (ADJP (JJ red)))))) (. .))) (TOP (SBARQ (SBAR (IN If) (S (NP (NN snow)) (VP (VB be) (ADJP (JJ white))))) (, ,) (WHADVP (WRB why)) (SQ (SBAR (RB then) (S (S (NP (PRP$ her) (NNS breasts)) (VP (VBP are) (VP (VBN dun)))) (, ,) (S (SBAR (IN If) (S (NP (NNS hairs)) (VP (VB be) (NP (NNS wires))))) (, ,) (NP (JJ black) (NNS wires)) (VP (VBP grow) (PP (IN on) (NP (PRP$ her) (NN head)))))))) (. .))) Full Parsing with UIMA
Run OpenNLP parser in UIMA
Also needs OpenNLP sentence detector and OpenNLP tokenizer
Wrappers provided by UIMA
Add parser to aggregate analysis engine
Toolkits vs. Frameworks
Toolkit
The tools support a defined API
Your application calls the tools
Framework
Your components must support the API defined by the framework
The framework calls your components
Hollywood? ”Don’t call us, we’ll call you” Overview
What is UIMA? Part-of-Speech Tagging Full Parsing Shallow Parsing Chunking with OpenNLP Chunking with OpenNLP and UIMA More about UIMA Chunking with OpenNLP
(Same CLASSPATH as tagging and parsing) java opennlp.tools.lang.english.SentenceDetector \ $OPENNLP_HOME/models/english/sentdetect/EnglishSD.bin.gz | java opennlp.tools.lang.english.Tokenizer \ $OPENNLP_HOME/models/english/tokenize/EnglishTok.bin.gz | java opennlp.tools.lang.english.PosTagger -d \ $OPENNLP_HOME/models/english/parser/tagdict \ $OPENNLP_HOME/models/english/parser/tag.bin.gz | java opennlp.tools.lang.english.TreebankChunker \ $OPENNLP_HOME/models/english/chunker/EnglishChunk.bin.gz OpenNLP Chunker Format
CoNLL-2000 ”IOB” format
Inside chunk (I-NP, I-VP, I-PP)
Outside chunk (O)
Begin chunk (B-NP, B-VP, B-PP)
Chunk labels attached to tokens OpenNLP Chunker Format
Token - posTag - chunkLabel a DT B-NP far RB I-NP more RBR I-NP pleasing JJ I-NP sound NN I-NP . . O Chunking with UIMA
Run OpenNLP chunker in UIMA
Also needs OpenNLP sentence detector, tokenizer and POS tagger
Write a Java wrapper
No wrapper provided for chunker
Similar to wrapper for POS tagger
Add chunker to aggregate analysis engine
Editing the Type System
UIMA type systems
Types, subtypes and inheritance
Features appropriate for type
OpenNLPExampleTypes.xml
Edit Token type
Has existing posTag feature
Add new chunkLabel feature
Defining Capabilities
”Capabilities” of each component
Specify input and output types
Supports interoperability
Enables check for correct types
Important in pipeline of components
Colouring the Chunks
UIMA Annotation Viewer
All Token annotations are same colour
Need different colours for NP, VP, PP
Write a Chunk Marker annotator
Read chunkLabel features (B-NP, I-NP)
Write new NP, VP, PP annotations
NP, VP, PP types already defined for parser
Overview
What is UIMA? Part-of-Speech Tagging Full Parsing Shallow Parsing More about UIMA
UIMA and standards
UIMA and community Other UIMA Annotators
UIMA is not just OpenNLP!
Write your own annotators
Regular expression annotators
Email addresses, URLs, phone numbers
Dictionary lookup annotators
Use existing name lists (GATE gazetteers) UIMA and Standards
OASIS
Organization for the Advancement of Structured Information Standards
OASIS UIMA Technical Committee UIMA and Standards: GUI
GATE uses its own GUI
WordFreak uses its own GUI
UIMA uses Eclipse
An existing, widely-used, open source GUI
Eclipse Modeling Framework (EMF) UIMA and Standards: XML
GATE uses its own XML format
WordFreak uses its own XML format
UIMA uses XMI
XML Metadata Interchange
OMG standard
Modelled by EMF in Eclipse XML Metadata Interchange
UIMA open source software
Apache Software Foundation
UIMA component repositories
CMU (Carnegie-Mellon University)
JULIE Lab (Jena University)
PEAR format
Processing Engine ARchive UIMA and IBM
IBM supports Apache UIMA
UIMA Innovation Awards
IBM LRWB
LanguageWare Resource Workbench
Free download for evaluation
Non-programmers create annotators
Install annotators in UIMA with PEAR UIMA and Community: RASP
RASP Parser
Robust Accurate Statistical Parsing
Sussex and Cambridge universities RASP4UIMA
UIMA wrappers for RASP tools
DigitalPebble (www.digitalpebble.com)
Install RASP tools in UIMA with PEAR UIMA and Community: LuCas
Apache Lucene
Widely-used high-performance full-text indexing and search library LuCas - Lucene CAS Indexer
Stores UIMA CAS data in Lucene index
Developed at JULIE Lab (Jena)
Currently in UIMA sandbox
Presentation at UIMA Workshop today Overview
What is UIMA? Part-of-Speech Tagging Full Parsing Shallow Parsing More about UIMA