Evidence Finder: a semantic search tool for the PMC Corpus of Biomedical Research Papers
C.J. Rupp Now: National Centre for Text Mining Spatial Humanities Project University of Manchester History [email protected] www.nactem.ac.uk
2/6/2013 C.J. Rupp 1 Outline
What is UKPMC
Text Mining for Biomedicine
What is PubMed Central?
What Does UKPMC add?
Medie: a point of Comparison
Lean Fact Extraction
What is Evidence Finder?
Complementary Search
Fact Summary
2/6/2013 C.J. Rupp 2 The UKPMC Team at NaCTeM
C.J. Rupp Parsing, Relation Extraction, Indexing. Chikashi Nobata Named Entity Recognition (NER) Bill Black Project Manager Prof. Sophia Ananiadou Director Jock McNaught Deputy Director Matt Machin Web Application, Interfaces, GWIT Jacob Carter Databases C.J. & Bill Design
2/6/2013 C.J. Rupp 3 What is UKPMC?
• A repository of 2.4 million full text journal articles in Biomedicine and Health Science • Available for free on the web with no access restrictions • Launched in January 2007, funded by the 8 largest funders of medical research in the UK • Delivered by a consortium: The British Library, EBI and Manchester University • This is the UK portal on the PubMed Central repository
• Now extended Europe-wide.
2/6/2013 C.J. Rupp 4 • Text Mining for Biomedicine
• There's a lot of work on Text Mining for Biomedicine
This field has money But it also has one of the best problems • The rate of Biomedical publication has soared
To inhuman proportions So it's appealing to look for machine assistance • The selling points are:
Handle information overload, and Avoid overlooking information.
2/6/2013 C.J. Rupp 5 Data Deluge
•Medline •Total Articles / year
•Medline •New Articles / year •EMBL Database •Total Entries / year
2/6/2013 C.J. Rupp 6 What is PubMed Central?
PubMed Central (PMC) is the U.S. National Institutes of Health (NIH) digital archive of biomedical and life sciences journal literature.
Around 2 Million full text, published article
Contrast with PubMed: c. 22 Million abstracts Many PMC articles are Open Access Mixed format corpus: XML, PDF, OCR-ed
2/6/2013 C.J. Rupp 8 What does UKPMC add?
There are two main areas where UKPMC offers an extended service:
1. Additional literature, including UK-specific documents, such as NHS guidelines 2. A range of text mining services This is where NaCTem comes in
2/6/2013 C.J. Rupp 10 Our Mission
Provide a more Intelligent Search tool for UKPMC
Showcase Text Mining Technologies
Use existing Resources, specifically:
Enju: deep syntactic parser Biolexicon: domain lexicon NER tools: for genes, diseases, etc.
2/6/2013 C.J. Rupp 11 Enju Parser
A syntactic parser for English.
With a wide-coverage probabilistic HPSG grammar
An efficient parsing algorithm
Trained on Biomedical text (PubMed abstracts)
Which provides phrase structures and predicate-argument structures.
2/6/2013 C.J. Rupp 12 The BioLexicon
A Lexical Database for Biomedicine
2.2 M entries (mainly biomedical terms)
658 domain-relevant verbs
Syntactic subcategorisation frames specified for all verbs (1760 frames)
Collected automatically based on dependency-parsed corpus of 6M tokens on topic of E.Coli
Include strongly selected modifiers according to importance of location, time, manner etc., in description of biomedical facts
Also, Semantic frames specified for 168 verbs (856 frames)
2/6/2013 C.J. Rupp 13 NER (Named Entity Recognition)
Dictionary-Based NER for significant classes of entity:
Genes and Proteins
Drugs and Diseases
Metabolites including NeMine, trained for gene/protein disambiguation Dictionaries include UMLS, Drugbank, HMDB
2/6/2013 C.J. Rupp 14 Medie: a Point of Comparison
There was an existing system with a similar specification: Defined on PubMed Abstracts With a powerful query language
GCL (Generalised Concordance Lists) based on Region Algebra Using a tabular format for queries
2/6/2013 C.J. Rupp 15 Medie
2/6/2013 C.J. Rupp 16 Medie: Result
2/6/2013 C.J. Rupp 17 Tabular Format
2/6/2013 C.J. Rupp 18 Formal Query
2/6/2013 C.J. Rupp 19 Notes on Medie
While it seems fairly intuitive
Medie stores a lot of information from the Enju parse So there's expressive power under the hood But the average user doesn't get to use it Also non-linguists may be put off by explicit grammatical terminology in the interface
2/6/2013 C.J. Rupp 20 UKPMC Engagement
We did some focus group studies
These showed a marked preference for a simple interface (predictably?)
How do you get as close as possible to a Google-style interface
And still show off your deep linguistic analysis?
2/6/2013 C.J. Rupp 21 Design Constraints
Intuitive interface
Tailor the information stored to the requirements of the functionality
Make best use of our own specialised resource
Provide a simple web service to link with keyword and metadata searches
2/6/2013 C.J. Rupp 22 Lean Fact Extraction
We extract a database of facts that may provide answers to queries
We rely on specialised linguistic and domain knowledge to underwrite the quality of the fact entries
Facts should be seen as units of evidence
Validity is the authors' problem
Ours is relevance
2/6/2013 C.J. Rupp 23 What is a Fact?
Each entry in the fact database is the conjunction of:
A named entity (NE), according to the NER
Occuring within an argument (or modifier) position, according to the Enju analysis
That is designated as domain relevant in the BioLexicon That's a recipe!
2/6/2013 C.J. Rupp 24 Explanation
The BioLexicon extends our scope with predicted modifiers, as well as arguments
We take phrases containing NE's to generalise and improve yield
The parse assigns syntactic roles
We also handle some negation
Mainly explicit negation on the verb.
2/6/2013 C.J. Rupp 25 A Simplified Fact Table
Document ID Verb Arg1 Arg2 Sentence PMC2845863 result ciprofloxacin - Treatment wi.. PMC2817234 result ciprofloxacin PAE Treatment of.. PMC2738812 result ciprofloxacin - the combin… PMC2847397 result ciprofloxacin - An in vivo ex..
In practice, tables are populated with identifiers in fields that may be normalised or cross references.
In particular, NEs are mapped to a canonical identifier in the database and a canonical written form in generated questions.
(PAE, here, represents another NE in an (oblique) object position. Otherwise, it’s just text.) 2/6/2013 C.J. Rupp 26 Sentence Snippets
The database also, contains the sentence where each fact was found As well as the document ID to coordinate with other UKPMC services, e.g. metadata Because of copyrighting issues (with the HTML webpages) We were not given access to present results in situ, with highlighting and links in the text
2/6/2013 C.J. Rupp 27 Some Sentence Snippets (about Ciprofloxacin) Treatment with ciprofloxacin, ceftriaxone or pivmecillinam resulted in a cure rate of >99% while assessing clinical failure, bacteriological failure and bacteriological relapse. Treatment of the malaria parasites with ciprofloxacin, an inhibitor of the bacterial DNA gyrase, and other antibiotics including chloramphenicol, clindamycin, tetracycline and rifampicin resulted in the arrest of growth in the second asexual cycle, while the parasites in the current cell cycle appeared relatively unaffected (Geary et al. 1988; McFadden & Roos 1999; Surolia et al. 2004; Ramya et al. 2007). the combination of ciprofloxacin and 5-FU resulted in a synergistic prolongation of the postantibiotic effect (PAE) in comparison with the PAE induced by the drugs alone. An in vivo exposure to ciprofloxacin resulted in predominately efflux- mediated resistant mutants, suggesting that efflux plays a central role in emergence of fluoroquinolone resistance.
2/6/2013 C.J. Rupp 28 We Have all the Answers
Well actually we don't! But we have all the answers we are prepared to offer How do we provide these to the user, in response to relevant query? This must be coordinated with searches based on: A keyword in the text or (literary) metadata
2/6/2013 C.J. Rupp 29 What is Evidence Finder?
The Concept:
This is a complementary search tool for UKPMC. To search the repository from a different perspective.
We retrieve documents,
But we search on evidence, rather than publication history, or keywords.
We provide a structured answer using generated questions 2/6/2013 C.J. Rupp 30 More than a Keyword!
Evidence Finder extends a keyword search
Search on a keyword produces a, potentially large, set of possible answers from the fact database Generating questions around the relations in those facts can structure the result into smaller answer sets: the Jeopardy® solution!? And help the user refine their query: • “This is what you could have asked”
2/6/2013 C.J. Rupp 31 Generating questions
activate Entity1 activates Entity2 ARG1 Entity1 Entity2 is activated by Entity1 ARG2 Entity2 Entity1 cooperate to activate Entity2 Entity1 play key roles by activating Entity2
We deal with syntactic variability by deep semantic parsing
Turning these into questions suggests how they can be accessed in a search application
2/6/2013 C.J. Rupp 32 Complementary Search
2/6/2013 C.J. Rupp 33 Complementary Search
Evidence Finder Result
2/6/2013 C.J. Rupp 34 What to expect from EvidenceFinder
Suggests questions for you
Clicking on a question will return sets of documents with evidence snippets
Shows where answers may be in the text
Answers should immediately show you if you want to look at the whole document
Helps you look at similar facts in other documents
2/6/2013 C.J. Rupp 35 Evidence Finder: Result
2/6/2013 C.J. Rupp 36 Evidence Finder: Result
Generated Document Questions Metadata Evidence Sentences
2/6/2013 C.J. Rupp 37 Fact Summary
2/6/2013 C.J. Rupp 38 “More Like This” Query
2/6/2013 C.J. Rupp 39 What is Evidence Finder?
The Implementation:
A Web Services by NaCTeM 1. Suggested questions corresponding to a search term 2. Paged ‘answers’ to question: Document Metadata from EBI WS, extended with matching analyzed sentences. 3. All the analyzed factual sentences in a doc., each with a more like this query attached. • The Platform: • Java supported by Eclipse, using Google Web Toolkit (GWIT) • Web Service running under Apache Tomcat
2/6/2013 C.J. Rupp 40 UKPMC Evidence Finder
Indexing Searching New doc set XML Query Converter from user
Web interface EVF Fact Web User extractor Interface
Enju Retrieved Store parser facts Search Consolidate NER data NER for Document Data UKPMC Fact DB From Europe PMC Web Service Statistics and observations
2.4M articles fully parsed
67.36 million indexed facts
Representing 1.7 million documents
Relies on NE’s indexed by NaCTeM
Search results ranked by date, newest first.
Other rankings possible
2/6/2013 C.J. Rupp 42
What is Evidence Finder for?
An Evidence-based search:
Starts from the bottom Locates specific statements It may find unexpected or overlooked facts It may find trivial and boring facts It's not an antidote to literature or google search It may not be able to handle complex queries (yet).
2/6/2013 C.J. Rupp 43 Extensions?
Structure within phrases
Select NEs with the “Head” line More negation operators • “lack of”, “fail to”, “avoid”
More normalisation
e.g. Acronym resolution
Relation sets from other domains – Refine the medical verb dictionary
2/6/2013 C.J. Rupp 44 Thanks For your patience and stamina
Services to try:
http://labs.europepmc.org/evf
http://www.nactem.ac.uk/MEDIE/
2/6/2013 C.J. Rupp 45