Evidence Finder: a semantic search tool for the PMC Corpus of Biomedical Research Papers

C.J. Rupp Now: National Centre for Spatial Humanities Project History [email protected] www.nactem.ac.uk

2/6/2013 C.J. Rupp 1 Outline

 What is UKPMC

 Text Mining for Biomedicine

 What is PubMed Central?

 What Does UKPMC add?

 Medie: a point of Comparison

 Lean Fact Extraction

 What is Evidence Finder?

 Complementary Search

 Fact Summary

2/6/2013 C.J. Rupp 2 The UKPMC Team at NaCTeM

C.J. Rupp Parsing, Relation Extraction, Indexing. Chikashi Nobata Named Entity Recognition (NER) Bill Black Project Manager Prof. Sophia Ananiadou Director Jock McNaught Deputy Director Matt Machin Web Application, Interfaces, GWIT Jacob Carter Databases C.J. & Bill Design

2/6/2013 C.J. Rupp 3 What is UKPMC?

• A repository of 2.4 million full text journal articles in Biomedicine and Health Science • Available for free on the web with no access restrictions • Launched in January 2007, funded by the 8 largest funders of medical research in the UK • Delivered by a consortium: The British Library, EBI and Manchester University • This is the UK portal on the PubMed Central repository

• Now extended Europe-wide.

2/6/2013 C.J. Rupp 4 • Text Mining for Biomedicine

• There's a lot of work on Text Mining for Biomedicine

 This field has money  But it also has one of the best problems • The rate of Biomedical publication has soared

 To inhuman proportions  So it's appealing to look for machine assistance • The selling points are:

 Handle information overload, and  Avoid overlooking information.

2/6/2013 C.J. Rupp 5 Data Deluge

•Medline •Total Articles / year

•Medline •New Articles / year •EMBL Database •Total Entries / year

2/6/2013 C.J. Rupp 6 What is PubMed Central?

PubMed Central (PMC) is the U.S. National Institutes of Health (NIH) digital archive of biomedical and life sciences journal literature.

 Around 2 Million full text, published article

 Contrast with PubMed: c. 22 Million abstracts  Many PMC articles are Open Access  Mixed format corpus: XML, PDF, OCR-ed

2/6/2013 C.J. Rupp 8 What does UKPMC add?

 There are two main areas where UKPMC offers an extended service:

1. Additional literature, including UK-specific documents, such as NHS guidelines 2. A range of text mining services  This is where NaCTem comes in

2/6/2013 C.J. Rupp 10 Our Mission

 Provide a more Intelligent Search tool for UKPMC

 Showcase Text Mining Technologies

 Use existing Resources, specifically:

 Enju: deep syntactic parser  Biolexicon: domain lexicon  NER tools: for genes, diseases, etc.

2/6/2013 C.J. Rupp 11 Enju Parser

A syntactic parser for English.

 With a wide-coverage probabilistic HPSG grammar

 An efficient parsing algorithm

 Trained on Biomedical text (PubMed abstracts)

 Which provides phrase structures and predicate-argument structures.

2/6/2013 C.J. Rupp 12 The BioLexicon

A Lexical Database for Biomedicine

 2.2 M entries (mainly biomedical terms)

 658 domain-relevant verbs

 Syntactic subcategorisation frames specified for all verbs (1760 frames)

 Collected automatically based on dependency-parsed corpus of 6M tokens on topic of E.Coli

 Include strongly selected modifiers according to importance of location, time, manner etc., in description of biomedical facts

 Also, Semantic frames specified for 168 verbs (856 frames)

2/6/2013 C.J. Rupp 13 NER (Named Entity Recognition)

Dictionary-Based NER for significant classes of entity:

 Genes and Proteins

 Drugs and Diseases

 Metabolites including NeMine, trained for gene/protein disambiguation Dictionaries include UMLS, Drugbank, HMDB

2/6/2013 C.J. Rupp 14 Medie: a Point of Comparison

 There was an existing system with a similar specification:  Defined on PubMed Abstracts  With a powerful query language

 GCL (Generalised Concordance Lists) based on Region Algebra  Using a tabular format for queries

2/6/2013 C.J. Rupp 15 Medie

2/6/2013 C.J. Rupp 16 Medie: Result

2/6/2013 C.J. Rupp 17 Tabular Format

2/6/2013 C.J. Rupp 18 Formal Query

2/6/2013 C.J. Rupp 19 Notes on Medie

 While it seems fairly intuitive

 Medie stores a lot of information from the Enju parse  So there's expressive power under the hood  But the average user doesn't get to use it  Also non-linguists may be put off by explicit grammatical terminology in the interface

2/6/2013 C.J. Rupp 20 UKPMC Engagement

 We did some focus group studies

 These showed a marked preference for a simple interface (predictably?)

 How do you get as close as possible to a Google-style interface

 And still show off your deep linguistic analysis?

2/6/2013 C.J. Rupp 21 Design Constraints

 Intuitive interface

 Tailor the information stored to the requirements of the functionality

 Make best use of our own specialised resource

 Provide a simple web service to link with keyword and metadata searches

2/6/2013 C.J. Rupp 22 Lean Fact Extraction

 We extract a database of facts that may provide answers to queries

 We rely on specialised linguistic and domain knowledge to underwrite the quality of the fact entries

 Facts should be seen as units of evidence

 Validity is the authors' problem

 Ours is relevance

2/6/2013 C.J. Rupp 23 What is a Fact?

Each entry in the fact database is the conjunction of:

 A named entity (NE), according to the NER

 Occuring within an argument (or modifier) position, according to the Enju analysis

 That is designated as domain relevant in the BioLexicon That's a recipe!

2/6/2013 C.J. Rupp 24 Explanation

 The BioLexicon extends our scope with predicted modifiers, as well as arguments

 We take phrases containing NE's to generalise and improve yield

 The parse assigns syntactic roles

 We also handle some negation

 Mainly explicit negation on the verb.

2/6/2013 C.J. Rupp 25 A Simplified Fact Table

Document ID Verb Arg1 Arg2 Sentence PMC2845863 result ciprofloxacin - Treatment wi.. PMC2817234 result ciprofloxacin PAE Treatment of.. PMC2738812 result ciprofloxacin - the combin… PMC2847397 result ciprofloxacin - An in vivo ex..

In practice, tables are populated with identifiers in fields that may be normalised or cross references.

In particular, NEs are mapped to a canonical identifier in the database and a canonical written form in generated questions.

(PAE, here, represents another NE in an (oblique) object position. Otherwise, it’s just text.) 2/6/2013 C.J. Rupp 26 Sentence Snippets

The database also, contains the sentence where each fact was found As well as the document ID to coordinate with other UKPMC services, e.g. metadata Because of copyrighting issues (with the HTML webpages) We were not given access to present results in situ, with highlighting and links in the text

2/6/2013 C.J. Rupp 27 Some Sentence Snippets (about Ciprofloxacin) Treatment with ciprofloxacin, ceftriaxone or pivmecillinam resulted in a cure rate of >99% while assessing clinical failure, bacteriological failure and bacteriological relapse. Treatment of the malaria parasites with ciprofloxacin, an inhibitor of the bacterial DNA gyrase, and other antibiotics including chloramphenicol, clindamycin, tetracycline and rifampicin resulted in the arrest of growth in the second asexual cycle, while the parasites in the current cell cycle appeared relatively unaffected (Geary et al. 1988; McFadden & Roos 1999; Surolia et al. 2004; Ramya et al. 2007). the combination of ciprofloxacin and 5-FU resulted in a synergistic prolongation of the postantibiotic effect (PAE) in comparison with the PAE induced by the drugs alone. An in vivo exposure to ciprofloxacin resulted in predominately efflux- mediated resistant mutants, suggesting that efflux plays a central role in emergence of fluoroquinolone resistance.

2/6/2013 C.J. Rupp 28 We Have all the Answers

Well actually we don't! But we have all the answers we are prepared to offer How do we provide these to the user, in response to relevant query? This must be coordinated with searches based on: A keyword in the text or (literary) metadata

2/6/2013 C.J. Rupp 29 What is Evidence Finder?

 The Concept:

 This is a complementary search tool for UKPMC.  To search the repository from a different perspective.

 We retrieve documents,

 But we search on evidence, rather than publication history, or keywords.

 We provide a structured answer using generated questions 2/6/2013 C.J. Rupp 30 More than a Keyword!

 Evidence Finder extends a keyword search

 Search on a keyword produces a, potentially large, set of possible answers from the fact database  Generating questions around the relations in those facts can structure the result into smaller answer sets: the Jeopardy® solution!?  And help the user refine their query: • “This is what you could have asked”

2/6/2013 C.J. Rupp 31 Generating questions

activate Entity1 activates Entity2 ARG1 Entity1 Entity2 is activated by Entity1 ARG2 Entity2 Entity1 cooperate to activate Entity2 Entity1 play key roles by activating Entity2

We deal with syntactic variability by deep semantic parsing

Turning these into questions suggests how they can be accessed in a search application

2/6/2013 C.J. Rupp 32 Complementary Search

2/6/2013 C.J. Rupp 33 Complementary Search

Evidence Finder Result

2/6/2013 C.J. Rupp 34 What to expect from EvidenceFinder

 Suggests questions for you

 Clicking on a question will return sets of documents with evidence snippets

 Shows where answers may be in the text

 Answers should immediately show you if you want to look at the whole document

 Helps you look at similar facts in other documents

2/6/2013 C.J. Rupp 35 Evidence Finder: Result

2/6/2013 C.J. Rupp 36 Evidence Finder: Result

Generated Document Questions Metadata Evidence Sentences

2/6/2013 C.J. Rupp 37 Fact Summary

2/6/2013 C.J. Rupp 38 “More Like This” Query

2/6/2013 C.J. Rupp 39 What is Evidence Finder?

 The Implementation:

 A Web Services by NaCTeM 1. Suggested questions corresponding to a search term 2. Paged ‘answers’ to question: Document Metadata from EBI WS, extended with matching analyzed sentences. 3. All the analyzed factual sentences in a doc., each with a more like this query attached. • The Platform: • Java supported by Eclipse, using Google Web Toolkit (GWIT) • Web Service running under Apache Tomcat

2/6/2013 C.J. Rupp 40 UKPMC Evidence Finder

Indexing Searching New doc set XML Query Converter from user

Web interface EVF Fact Web User extractor Interface

Enju Retrieved Store parser facts Search Consolidate NER data NER for Document Data UKPMC Fact DB From Europe PMC Web Service Statistics and observations

 2.4M articles fully parsed

 67.36 million indexed facts

 Representing 1.7 million documents

 Relies on NE’s indexed by NaCTeM

 Search results ranked by date, newest first.

 Other rankings possible

2/6/2013 C.J. Rupp 42

What is Evidence Finder for?

 An Evidence-based search:

 Starts from the bottom  Locates specific statements  It may find unexpected or overlooked facts  It may find trivial and boring facts  It's not an antidote to literature or google search  It may not be able to handle complex queries (yet).

2/6/2013 C.J. Rupp 43 Extensions?

 Structure within phrases

 Select NEs with the “Head” line  More negation operators • “lack of”, “fail to”, “avoid”

 More normalisation

 e.g. Acronym resolution

 Relation sets from other domains – Refine the medical verb dictionary

2/6/2013 C.J. Rupp 44 Thanks For your patience and stamina

Services to try:

http://labs.europepmc.org/evf

http://www.nactem.ac.uk/MEDIE/

2/6/2013 C.J. Rupp 45