Argo: a platform for interoperable and customisable

Sophia Ananiadou National Centre for Text Mining School of Computer Science The Overview

• Sharing tools, resources and text mining workflows

• Challenges

• Interoperable infrastructure for processing and annotation

Ananiadou Open AIRE-COAR Conference 2 NaCTeM

• 1st publicly funded national www.nactem.ac.uk text mining centre • Location: Manchester Institute of Biotechnology • Phase I - Biology (2004-2008) • Phase II - Biology, Medicine, Social Sciences (2008-2011) • Phase III – Biology, Medicine, Humanities, Social Sciences; Fully sustainable centre (2011- ) Challenges

Text Types Technology Languages Newswire Sentence Splitter English Scientific Literature Paragraph Splitter French Full papers/abstracts TM Modules NP Chunkers German Twitter C-parser Spanish Patents D-parser TM Workflows Portuguese Clinical records, EMR Semantic parser Italian Textbooks, monographs NE recognizers Polish Online forums…. Shared! Relation recognizers …. ……. Chinese Hindu Domains Language Technology Urdu Finance/Business Japanese Health Tasks Korean…. Biology Translation Social Sciences Information Extraction Humanities…. Semantic Search Diversity of Languages Question Answering Sentiment Analysis Summarization Knowledge Discovery …. Diversity of Contexts Ananiadou DiversityOpen AIRE of-COAR Applications Conference 4 Metadata

Text Types Resource-Rich Languages Newswire Language Technology English Scientific Literature French Full papers/abstracts German Twitter Spanish Portuguese Patents Big Text Big Ontology Big Data Clinical records, EMR Italian Textbooks, monographs Linguistic Resources Polish Online forums…. …. Knowledge Resources Chinese Cloud Computing Crowd Sourcing Hindu Domains Urdu Finance/Business Japanese Health Tasks Korean… Biology Translation Social Sciences Information Extraction Humanities…. Semantic Search Question Answering Sentiment Analysis OPEN SCIENCE Summarization Knowledge Discovery ….

Ananiadou Open AIRE-COAR Conference 5 Requirements from TM infrastructure

• Modularity of TM modules

• Interoperability among TM modules and resources

• Generic across different languages, domains, and text types – Adaptability

Ananiadou Open AIRE-COAR Conference 6 Interoperability and Adaptability

Dependency Parser Interoperability and Adaptability Resources in Resource-rich TM Dictionaries INFRASTRUCTURES! Ontologies

Module Module Module

Rule Writing POS Tagger (Annotated) Adaptation Named Entity Text Languages Text Types Domains

AnaniadouGreek English FrenchOpen AIRE-COAR GermanConference Japanese 7

Example: extracting proteins, annotations

GENIA Problem: Inconsistency PennBioIE Type definitions Texts

Incompatibility

AIMed

GENETAG

Ananiadou Open AIRE-COAR Conference 8 The problem with incompatibility

• Difficult to evaluate NERs

Why so different among Which NER is different corpora and best for my NERs ? task?

NER A NER B

A: 93% B: 36% A: 63% B: 90% A is better than B. B is better than A.

Ananiadou Corpus C Open AIRECorpus-COAR Conference D 9 Text mining workflows

• A pipeline that executes particular tools and resources in order • Example: semantic search

PoS Dictionary NE Semantic Chunking Parsing Tagger Lookup Extraction Query

• Various versions (language- or domain-specific) of basic components needed for different applications and tasks • Different workflows can be created, compared and evaluated by the ability to seamlessly “mix and match” various versions of components

Ananiadou Open AIRE-COAR Conference 10 Text mining workflows

Interoperability Common Data Representation and Types

IBM Journal of Research and Development (2011) U-Compare: a modular NLP workflow construction and evaluation system. Kano, Y., Miwa, M., Cohen, K. B., Hunter, L., Ananiadou, S. and Tsujii, J.

Ananiadou Open AIRE-COAR Conference 11 Common Type System

• A common type system is required for the complete interoperability

A single common type is almost impossible to impose for all developers. • Solution: Maintain local type systems and bridge them via a sharable type system

bridging bridging

U-Compare Local Type System A Sharable Type System Local Type System B

Ananiadou Open AIRE-COAR Conference 12 12 Syntactic Level Document Level

Semantic Level

Ananiadou U-Compare TypeOpen SystemAIRE-COAR Conference 13 U-Compare: Evaluate and Compare TM Worklfows

library

Sentence Sentence Splitter A Splitter B

POS tagger POS tagger A B Workflow A Workflow B Workflow C

NER F-Score A  F-Score B F-Score C

UIMA Tokenizer GENIA Tagger ABNER UIMA SD OpenNLP Tokenizer Stepp Tagger MedT-NER OpenNLP SD GENIA Tagger as OpenNLP GENIA Tagger GENIA SD Tokenizer Tagger as NER • Integrated TM/NLP processing system • GUI for workflow creation • Library of ready-to-use processing components • Statistics, visualizations, developer APIs • Supports UIMA • http://argo.nactem.ac.uk • Web-based application Database: The Journal of Biological Databases • Interactive creation of and Curation (2012) workflows Argo: an integrative, interactive, text mining- based workbench supporting curation. • Cloud and high- Rak, R., Rowley, A., Black, W.J. and Ananiadou, S performance computing 15 Processing Workflow Components Diagramming Workflow Designer Developers

UIMA Compliance

Remote Manual Processing Editing Annotator/Curator

Structured Ananiadou Data 16 Processing Components

• Approaching 100 components (U-Compare) – Additional 50 will be added soon • META-NET • Developed or co-developed by NaCTeM – Planned: Make the library open to others to contribute • Generic Listener component – Developers can plug in their own locally run UIMA component to a workflow in Argo

Ananiadou Open AIRE-COAR Conference 17 Remote Processing

• Single machine execution – In-house high-performance machines • Distributed processing – HTCondor – VMware vCloud (EBI) EUPMC – Planned: EC2, Azure, …

Ananiadou Open AIRE-COAR Conference 18 Workflows

• Users create workflows as block diagrams • Workflows can be shared among users – Read only – Planned: Read & write – Planned: downloadable workflows • Workflows can be deployed as web services – Plain text (input only), XMI, RDF, BioC

Ananiadou Open AIRE-COAR Conference 19 Workflows view

Ananiadou Open AIRE-COAR Conference 20 Workflow Editor

Open AIRE-COAR Conference 21 Sample Use Cases

1 Recognition of chemical entities (chemical NER) 2 Semi-automatic curation of metabolic pathways 3 Evaluation of inter-annotator agreement 4 Information extraction as a Web service

Ananiadou Open AIRE-COAR Conference 22 Use Case 1: Chemical NER Removes golden annotations Supplies gold so that they can be created standard corpus automatically

Compares and reports precision, recall Combinations of syntactic and and F1 of the different branches semantic components create against the gold standard corpus annotations Chemical Entity Recogniser

• Chemical model evaluated at BioCreative IV CHEMDNER challenge • The challenge – Data: 10,000 manually annotated PubMed abstracts – Automatically recognises names of chemical entities in text

Ananiadou Open AIRE-COAR Conference 24 Chemical Entity Recogniser

• Our solution – Ranked unique mentions: ranked 1st out of 18 groups – All mentions: ranked 3rd out of 19 groups

Subtask Precision % Recall % F-score % Ranked unique mentions 91 85 88 All mentions 93 81 87

Ananiadou Open AIRE-COAR Conference 25 Use Case 2: Semi-automatic Curation – Metabolic Pathways Search for relevant documents

NER for chemicals, Linking to genes, process ontologies: CTD, indicators ChEBI, UniProt

Save results in Manual correction of various formats, automatic annotations e.g., RDF for querying and incorporation into databases

Ananiadou Open AIRE-COAR Conference 26 Manual Annotation Editor

Create, modify or delete annotations Edit details of annotations

Open a graphical Create new interface to link annotations by annotations to selecting text ontologies

Ananiadou Open AIRE-COAR Conference 27 Filtering and converting annotations

Ananiadou Open AIRE-COAR Conference 28 Manual Annotation Editor: linking to

Automatic pre- ontologies Details show selection can be ontology entry modified by the user webpage

Ananiadou Open AIRE-COAR Conference 29 Use Case 3: Information extraction as a Web service

Web service- enabled reader

Web service- enabled writer

Ananiadou Open AIRE-COAR Conference 34 Language Universal

• Reusable modules • Generic TM modules: Competence • Annotated Text, corpora: Performance

• Standards of Data Representation and Types for Resources: Competence • Dictionaries, Thesauri, Ontologies: Performance

Ananiadou Open AIRE-COAR Conference 36