Structuring Medical Records with Apache Stanbol

Rafa Haro, Senior Software Engineer, Athento Antonio Pérez Morales, Senior Software Engineer, Ixxus • Committer, PMC Member @ Apache Stanbol, Apache ManifoldCF

• Topics: Document Analysis, NLP, Machine Learning, Semantic Technologies, ECM

• Committer @ Apache Stanbol, Apache ManifoldCF

• Topics: ECM, Semantic Search, ETL, Machine Learning Apache Stanbol provides a set of reusable components for semantic content management. It extends existing CMSs with a number of semantic services.

Traditional Semantic CMS Software Architecture for Semantically Enabled CM and ECM systems Apache Stanbol Story

• Started within FP7 European Project IKS (Interactive Knowledge Stack. 2009 - 2012)

• IKS project brought together an Open Source Community for Defining and Building Platforms in the Semantic CMS Space

• Incubated in November 2010

• Successfully promoted within CMS and ECM industry through IKS Early Adopters Program

• Graduated to Top-Level Apache Project in October 2012 What is a Semantic CMS?

Traditional CMS Semantic CMS

Atomic Unit: Document Atomic Unit: Entity

Properties as meta-data Semantic meta-data (key-value schemas) (RDF)

Keyword Search Semantic Search

Document Management Knowledge Management Document Types Entity Management Document Workflow Ontologies

Source: What Apache Stanbol Can Do for You?. Fabian Christ. ApacheCon Europe 2012 Key Points

• Designed to bring Semantic Technologies to existing CMS

• Non-intrusive set of RESTful ‘Semantic’ Services

• Extremely Modular: Use only the modules you need

• Main Features: • Multilingual Content Enhancement: Structure Content through Semantic Metadata

• Knowledge Bases Management

• Knowledge Models and Reasoning

• Semantic Indexing and Search Stanbol Components

• Stanbol components provide: Page: 6 • RESTful API Service-Oriented View • Java APIs and OSGi services VIE - User Interface Layer VIE VIE • Stanbol components do NOTWidgets depend on each other

• however they can be easily combinedApache S ttoanb ol Service Layer

Apache Apache Apache Apache Stanbol Stanbol Stanbol Stanbol Enhancer EntityHub Ontology Manager Reasoners

Apache Apache Apache Stanbol Stanbol Stanbol Rules ContentHub FactStore Stanbol Enhancement Engines Apache Stanbol Apache Stanbol CMS Adapter Component Layer

www.iks-project.eu Copyright IKS Consortium Stanbol Components (II)

• Enhancer: Extracts Knowledge from unstructured parsed content

• EntityHub: Manage Domain Entities and Topics (Knowledge Bases)

• ContentHub: Semantic Indexing / Search over your - semantic enhanced - Content

• CMS Adapter: Sync. your CMS with Apache Stanbol (JCR/CMIS)

• Ontology Manager: Manage you formal Domain Knowledge

• Reasoners & Rules: Apply Domain Knowledge to improve / validate extracted Information. Refactor / refine knowledge to align it to public schemas such as schema.org Built on Top of Apache….

as OSGi environment • launchers and OSGi Tools • for building • Apache Clerezza as RDF Framework • as TripleStore • for Knowledge Bases Management • for converting input • Apache OpenNLP for NLP Processing Integration Scenarios

• Stand-Alone Server (Stanbol Launchers)

• Web Application (Servlet-Container)

• Embedded within an OSGi environment

Source: What Apache Stanbol Can Do for You?. Fabian Christ. ApacheCon Europe 2012 Project Current Status

Apache Stanbol IKS Project Ending Apache Stanbol Apache Stanbol Incubation Graduation 0.9.0-incubating (Dec 2012) 0.12.0 1.0.0 (Nov 2010) (October 2012) (Aug 2012) (March 2014) (October 2016)

Contributions (commits) to Trunk Since Incubation Project Current Status (II)

• 22 PMC Members (Last Addition Jul 2016) • 26 Committers (Last Addition May 2015) • 3-5 active committers last 2 years • [email protected]: 228 subscribers • Activity has been gradually decreasing • 3 major releases

Source: Apache Stanbol Committee Report Helper (https://reporter.apache.org/?stanbol) Stanbol Enhancer

RDF Stanbol Enhancer (II) Stanbol Enhancer (III) Stanbol Enhancement Chains

• Define how Content is processed by the Enhancer through an ExecutionPlan • Different Implementations: • ListChain: in order sequential enhancement engines execution. Parallel Execution of engines not supported • WeightedChain: ExecutionPlan is calculated using the engines order metadata. Parallel Execution of engines allowed • API: • /enhancer: executes the default chain • /enhancer/chain/{chain-name}: executes a concrete named chain • /enhancer/engine/{engine-name}: executes a concrete named engine Current Enhancement Engines

• Preprocessing • Tika Engine • content type detection • text extraction from several document formats • metadata extraction from several document formats • Natural Language Processing • Language Detection (different implementations) • Sentence Detection (OpenNLP, SmartCN, REST) • Tokenizer (OpenNLP, SmartCN, REST) • POS Tagging (OpenNLP, REST) • Chunking (OpenNLP, REST) • NER (OpenNLP, OpenCalais, REST) • Entity Linking • Named Entity Linking • EntityHub Linking Engine • FST (Lucene Finit State Transducer) Linking Engine • Entity Co-mention • Commercial Engines (OpenCalais, Zemanta, CELI…) • Sentiment Analysis • Disambiguation • DBPedia Spotlight • Solr MLT based • PostProcessing: • Dereferencing Stanbol EntityHub Stanbol EntityHub (II)

• Manage Multiple Entity Sources (Knowledge Bases) • Allows Fast Entity-Lookup using Apache Solr

• Referenced Site (Remote LD + Local Caches) Vs Managed Site (Entity CRUD Api over manually configured Sites)

• API: • Query for Entities (used by Entity Linking Engines)

curl -X POST -d "name=lyon&limit=10" \ http://localhost:8080/entityhub/site/dbpedia/find

• CRUD for Managed Sites friend-names = foaf:knows/foaf:name • LDPath support for: • Graph Path Retrieval (Used for dereferencing) • Schema Translation schema:name = rdfs:label[@en]; • Simple Reasoning Use Case: Hexin Project - Structuring Medical Records

• R&D Project for Sergas (Galician Public Health Office) • Clinical Data Analysis Platform for supporting: • Clinical Assistance • Epidemiology studies • Medical Research • Big Data approach for analyzing both structured historical clinical data and unstructured medical records • Medical Records are written in Spanish and Galician Hexin: Architecture

Event Detection New Case Process Reference Cases Detection Process BIG DATA (HDFS + PatientId HIVE) Date BI Structured Events Semantic Events Symptoms: Rules • Cough

URX • Unrest ETL

Data

Source Cassandra

Unrest Cough Fever>38 Patient Validation Analysis Hexin: Semantic Tagging Hexin: Objective

“Paciente diabético desde los 5 años y con EPOC moderada grado 2 de la GOLD” Hexin:Solution Design

• Structure Medical Records using Apache Stanbol Enhancer • Custom Ontology: • Symptoms • Diseases • Diagnosis Tests • Family and Personal History • Custom Enhancement Chain: • Language Detection > NLP > Entity Linking > Negation Detection > Fact Extraction Hexin: Ontology Hexin: Ontology Indexing

• For supporting the Entity Linking process against Hexin Ontology, an EntityHub site must be created • 2 options: • ManagedSite: full CRUD storage <-> DYNAMIC • ReferencedSite: READ-ONLY remote site + local index • Stanbol EntityHub Indexing Tool: hexin:* • RDF —> JenaTDB —> Solr Index hexin:label > rdfs:label • Configure Custom Namespaces, Mappings and Properties • Generates an OSGi Bundle with the Yard and YardSite default configurations • Copy the index to Stanbol /datafiles folder and install the bundle using Apache Felix OSGi Web Console Hexin: Enhancement Chain

Lang. Detect. OpenNLP-Sent. OpenNLP-Token OpenNLP-POS OpenNLP-Chunker Hexin Linking Fact Extract. Negex

Custom Hexin Engine. Implemented for the project

Entity Linking Engine. Available in Stanbol with a Custom Configuration for this use case

NLP Engines. Available in Stanbol. Default Configuration

Pre-Processing Engine. Available in Stanbol Hexin: Linking Hexin: Linking (II) Hexin: Custom Engines

@Component @Service OSGi bundle public class MyEngine implements EnhancementEngine { Maven build MANIFEST.MF @Activate public void activate(ComponentContext c) { OSGi // initialize, configure, ... metadata } maven-bundle- plugin registered by OSGi public int canEnhance(ContentItem item) { adds OSGI metadata if(...item matches our expectations...) { MyEngine return ENHANCE_SYNCHRONOUS; Service } else { return CANNOT_ENHANCE; maven-scr-plugin } adds services metadata }

public void computeEnhancements(ContentItem item) { // run the engine and add results to item’s Install in // RDF graph based on the item’s InputStream Stanbol } no restart } needed NLP at Apache Stanbol NLP at Apache Stanbol (II)

• Browsable Map with Spans • Spans sorted by Natural Order Stanbol is an Amazing Tool • Iterator based API that allows Token concurrent Modifications Chunk • Annotations supported at Spans Level Sentence • POS Annotation • PosTag Span Types: tag (e.g. NE) • Token lexical category (e.g. Noun) • Chunk • Phrase Annotation (chunks) • Sentence • Text Section • PhraseTag tag (e.g. NP) • Analyzed Text lexical-category (e.g. NounPhrase) • Sentiment Annotation • SentimentTag:: Double Hexin Custom Engine: Negex

• Context/Negex: Algorithm for Negation Detection • Based on Triggers-Terms + Regex

public abstract class AbstractNegexDetector implements NegexDetector {

@Override public Set detectNegations(String language, Graph metadata, AnalysedText at) throws NegexException{}

protected abstract boolean isNegated(String language, String concept, String sentence);

}

Chapman WW, Bridewell W, Hanbury P, Cooper GF, Buchanan BG. A simple algorithm for identifying negated findings and diseases in discharge summaries. J Biomed Inform. Oct 2001;34(5):301-310. Hexin Custom Engine: Negex (II)

• Triggers Types: • Pre-condition Negation terms (e.g. absence of) • Pseudo Negation terms (e.g. no increase) • Pre-condition possibility phrase (e.g. rule him out) • Post-condition negation terms (e.g. unlikely) • Termination terms (e.g. but, however) • Implementation available under 2.0 • Engine Implementation Challenges: • Entity Annotations as Targets • AnalyzedText and EntityAnnotations relationships are currently obfuscated • GLUE CODE for locating Entity Annotations Spans by using START - END Text Annotations properties • Once Entity Annotation sentence is located, is used as context along with the Entity surface-form (mention) for applying the algorithm • Negation Returned as a Custom Property for the TextAnnotation (negated = True or False) Hexin Custom Engine: Fact Extraction

“Paciente diabético desde los 5 años y con EPOC moderada grado 2 de la GOLD” Hexin Custom Engine: Fact Extraction (II)

• In-Context Entity Fact Extraction • Facts returned as Entity RDF Metadata like the rest of Entity Properties • Different Implementations of Context (all extracted from AnalyzedText structure) • Sentence Context (default and usually enough) • Window of Text Context • Paragraph Context • Rule Based Approach: • Regex over RAW Text or POS tags Sequence • ENTITY reserved word -> OR expression for all ENTITY labels Hexin Custom Engine: Fact Extraction (III)

• Supported Expressions: • diabetes|diabético|DM desde los N años • diabetes|diabético|DM a los N años • Debut diabetes|diabético|DM a los N años Hexin Custom Engine: Fact Extraction (IV)

• POS based Rules:

Diabetes diagnosed when he was 5 years old

NNS VB WRB PRP VBD CD NNS JJ

ENTITY \s VB * VB[be] (CD) years old or simply ENTITY \s VB * VB[be] (CD)

Thanks for your attention!