Comparing a Knowledge-Driven Approach to A

Xu and Wang BMC Bioinformatics 2015, 16(Suppl 5):S6 http://www.biomedcentral.com/1471-2105/16/S5/S6 PROCEEDINGS Open Access Comparing a knowledge-driven approach to a supervised machine learning approach in large- scale extraction of drug-side effect relationships from free-text biomedical literature Rong Xu1*, QuanQiu Wang2 From 10th International Symposium on Bioinformatics Research and Applications (ISBRA-14) Zhangjiajie, China. 28-30 June 2014 Abstract Background: Systems approaches to studying drug-side-effect (drug-SE) associations are emerging as an active research area for both drug target discovery and drug repositioning. However, a comprehensive drug-SE association knowledge base does not exist. In this study, we present a novel knowledge-driven (KD) approach to effectively extract a large number of drug-SE pairs from published biomedical literature. Data and methods: For the text corpus, we used 21,354,075 MEDLINE records (119,085,682 sentences). First, we used known drug-SE associations derived from FDA drug labels as prior knowledge to automatically find SE-related sentences and abstracts. We then extracted a total of 49,575 drug-SE pairs from MEDLINE sentences and 180,454 pairs from abstracts. Results: On average, the KD approach has achieved a precision of 0.335, a recall of 0.509, and an F1 of 0.392, which is significantly better than a SVM-based machine learning approach (precision: 0.135, recall: 0.900, F1: 0.233) with a 73.0% increase in F1 score. Through integrative analysis, we demonstrate that the higher-level phenotypic drug-SE relationships reflects lower-level genetic, genomic, and chemical drug mechanisms. In addition, we show that the extracted drug-SE pairs can be directly used in drug repositioning. Conclusion: In summary, we automatically constructed a large-scale higher-level drug phenotype relationship knowledge, which can have great potential in computational drug discovery. Introduction tasks. Current drug phenotype-driven systems approaches It has been increasingly recognized that similar side effects rely exclusively on drug-SE associations extracted from of seemingly unrelated drugs can be caused by their com- FDA drug labels. However, there exists a large amount of mon off-targets and that drugs with similar side effects are additional drug-SE relationship knowledge in the large likely to share molecular targets [1]. Therefore, systems body of published biomedical literature. In this study, approaches to studying side effect relationships among we present a novel knowledge-driven approach to auto- drugs and integration of this drug phenotypic data with matically extract a large number of drug-SE pairs from drug-related genetic, genomic, proteomic, and chemical 21 million published biomedical abstracts. We systema- data will facilitate drug target discovery and drug reposi- tically analyzed extracted drug-SE pairs in combination tioning. The availability of acomprehensivedrug-side with drug-related gene targets, metabolism, pathways, effect (SE) relationship knowledge base is critical for these gene expression and chemical structure data. We show that these extracted drug-SE pairs have great potential in drug discovery. 1Case Western Reserve University, Cleveland OH 44106, USA Full list of author information is available at the end of the article © 2015 Xu and Wang; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. Xu and Wang BMC Bioinformatics 2015, 16(Suppl 5):S6 Page 2 of 8 http://www.biomedcentral.com/1471-2105/16/S5/S6 Background text classifier, it is highly dynamic and effective: it can Systems approaches to studying the phenotypic relation- easily incorporate any changing prior knowledge and ships among drugs can facilitate rapid drug target discov- quickly extracted drug-SE pairs from the whole MEDLINE ery and drug repositioning. Computational approaches to (21,354,075 abstracts and 119,085,682 sentences). predicting drug targets have often been based on chemical similarity measures and docking strategies [7,16]. Approach Similarly, many computational strategies for drug reposi- Our study is based on the two key observations: (1) tioning have been explored [6]. The majority of these multiple side effects for a drug are often reported in the approaches leverage on known drug properties such as same sentences or abstracts; and (2) if a sentence con- chemical similarity [7], molecular activity similarity [12], tains a known drug-SE pair, then this sentence is likely molecular docking [8], and gene expression profile simi- to be SE-relevant. Other pairs in this SE-related sen- larity [13]. In a seminal paper, Campillos et al. used phe- tence are likely to be drug-SE pairs. For example, the notypic side-effect similarities among drugs to predict sentence “At the final irinotecan dose of 50 mg/m(2), new targets for drugs [1]. However, their analysis was grade 3 or higher toxicity included diarrhea (26%), neu- limited to drug-SE relationships derived solely from the tropenia (21%), nausea (18%), fatigue (16%), anorexia FDA drug labels. In one of our recent studies, we show (13%), and thrombosis/embolism (13%)“ (PMID that much of the drug-SE association knowledge from 19139178) contains a known drug-SE pair “irinotecan- biomedical literature has not been captured in FDA drug diarrhea.” Based on this fact, we know that this sentence labels yet [17]. is SE-related and that the other five pairs in this sen- Currently, more than 21 million biomedical records are tence are likely to be drug-SE pairs. On the other hand, available on MEDLINE. While many biomedical relation- the following sentence “Weekly docetaxel, cisplatin, and ship extraction tasks have focused on extracting relation- irinotecan (TPC): results of a multicenter phase II trial ships between drugs, diseases, proteins, or genes [2,18,19], in patients with metastatic esophagogastric cancer “con- extracting drug-SE relationships from MEDLINE has been tain no known drug-SE pair, therefore no pair will be less explored. Recently, Gurulingappa et al. trained and extracted from this sentences even though it contains tested a supervised machine learning classifier to classify three drug-disease pairs. In this study, we used all drug-condition pairs in a set of 2972 manually annotated known drug-SE pairs derived from FDA drug labels as case reports [4]. That study focused on a limited set of prior knowledge to find SE-related MEDLINE sentences drugs and side effects and case reports. It is unclear how and abstracts, from which many additional drug-SE their approach can be effectively scaled up to the whole pairs that have not included in FDA drug labels are MEDLINE in building a large-scale drug-SE relationship then extracted. We compared the KD approach to a knowledge base. Recently, we developed an approach in support vector machine (SVM)-based approach. boosting drug safety signal detection from FDA Adverse Event Reporting System (FAERS) using evidence from Data and methods MEDLINE [20]. We developed an automatic approach to The entire experimental process consists of the follow- extract anticancer drug-specific side effects from MED- ing steps: (1) build a local MEDLINE search engine; LINE by developing specific filtering and ranking schemes (2) develop, evaluate and compare the KD approach to a [21]. We developed a pattern-based learning approach to SVM-based approach; (3) extract drug-SE pairs from accurately extract drug-SE pairs from MEDLINE sen- MEDLINE; and (4) systematically analyze the correlation tences [22]. We combined automatic table classification between drug-associated side effects and drug gene tar- and relationship extraction in extracting anticancer drug- gets, metabolism genes, chemical similarity, and disease side effect pairs from full-text articles [23]. In this study, indications. we present a knowledge-driven (KD) text-classification- based approach to extract drug-SE pairs from MEDLINE Build a local MEDLINE search engine sentences. Different from our previous studies where we We downloaded a total of 21,354,075 MEDLINE cita- extracted drug-SE pairs from unclassified sentence, here tions (119,085,682 sentences) published between 1965 we classified sentences into drug-SE-related and -unre- and 2012 from the U.S. National Library of Medicine lated before relationship extraction. Our approach is also http://mbr.nlm.nih.gov/Download/index.shtml. Each sen- different from other text classification-based approaches tence was syntactically parsed with Stanford Parser [9] that often trained text classifiers using annotated training using the Amazon Cloud computing service (a total of datasets to find drug-SE-related sentences [4], instead, we 3,500 instance-hours with High-CPU Extra Large implicitly classified MEDLINE sentences using known Instance used). We used the publicly available informa- drug-SE pairs. Since our study did not explicitly train a tion retrieval library Lucene http://lucene.apache.org to Xu and Wang BMC Bioinformatics 2015, 16(Suppl 5):S6 Page 3 of 8 http://www.biomedcentral.com/1471-2105/16/S5/S6 create a local MEDLINE search engine with indices cre- split its drug-SE pairs into two equal parts: training ated on both sentences and their corresponding parse dataset and testing dataset. For the KD approach, we trees. We have recently used this local search engine in first retrieved all sentences that contain at least one of our recent tasks of extracting disease-manifestation rela- the 10 drugs and at least one SE term from the SE lexi- tionships [19] and anticancer drug-SE pairs from MED- con (4,199 SE terms). We then classified these sentences LINE [21].

Comparing a Knowledge-Driven Approach to A

High-Throughput Analysis Suggests Differences in Journal False

Tracking the Popularity and Outcomes of All Biorxiv Preprints

Event Extraction on Pubmed Scale Filip Ginter1*, Jari Björne1,2, Sampo Pyysalo3 from Workshop on Advances in Bio Text Mining Ghent, Belgium

A Comparison of Feature Selection and Classification Methods in DNA

Use and Mis-Use of Supplementary Material in Science Publications Mihai Pop1* and Steven L

Viewed: 8 Were Accepted As Full Papers, 4 As Poster Large Investment of Effort Needed to Develop Sufficiently Papers and One As a Demo Paper

Mining Locus Tags in Pubmed Central to Improve Microbial Gene Annotation Chris J Stubben* and Jean F Challacombe

Analyzing the Field of Bioinformatics with the Multi-Faceted Topic Modeling Technique Go Eun Heo1, Keun Young Kang1, Min Song1* and Jeong-Hoon Lee2

Semantic Web Applications and Tools for the Life Sciences: SWAT4LS 2010

Paperbot: Open-Source Web-Based Search and Metadata Organization of Scientific Literature Patricia Maraver1,2, Rubén Armañanzas1,2, Todd A

BMC Bioinformatics Biomed Central

Knowledge Discovery of Drug Data on the Example of Adverse Reaction Prediction Pinar Yildirim1*, Ljiljana Majnarić2, Ozgur Ilyas Ekmekci1, Andreas Holzinger3