Methods in Literature-Based Drug Discovery
Total Page:16
File Type:pdf, Size:1020Kb
METHODS IN LITERATURE-BASED DRUG DISCOVERY Nancy C. Baker A dissertation submitted to the faculty of the University of North Carolina at Chapel Hill in partial fulfillment of the requirements for the degree of Doctor of Philosophy in the School of Information and Library Science Chapel Hill 2010 Approved by: Bradley M. Hemminger Stephanie W. Haas Javed Mostafa Diane Pozefsky Alexander Tropsha ©2010 Nancy C. Baker ALL RIGHTS RESERVED ii ABSTRACT NANCY C. BAKER: Methods in Literature-based Drug Discovery (Under the direction of Bradley M. Hemminger) This dissertation work implemented two literature-based methods for predicting new therapeutic uses for drugs, or drug reprofiling (also known as drug repositioning or drug repurposing). Both methods used data stored in ChemoText, a repository of MeSH terms extracted from Medline records and created and designed to support drug discovery algorithms. The first method was an implementation of Swanson’s ABC paradigm that used explicit connections between disease, protein, and chemical annotations to find implicit connections between drugs and disease that could be potential new therapeutic drug treatments. The validation approach implemented in the ABC study divided the corpus into two segments based on a year cutoff. The data in the earlier or baseline period was used to create the hypotheses, and the later period data was used to validate the hypotheses. Ranking approaches were used to put the likeliest drug reprofiling candidates near the top of the hypothesis set. The approaches were successful at reproducing Swanson’s link between magnesium and migraine and at identifying other significant reprofiled drugs. iii The second literature-based discovery method used the patterns in side effect annotations to predict drug molecular activity, specifically 5-HT6 binding and dopamine antagonism. Following a study design adopted from QSAR experiments, side effect information for chemicals with known activity was input as binary vectors into classification algorithms. Models were trained on this data to predict the molecular activity. When the best validated models were applied to a large set of chemicals in a virtual screening step, they successfully identified known 5-HT6 binders and dopamine antagonists based solely on side effect profiles. Both studies addressed research areas relevant to current drug discovery, and both studies incorporated rigorous validation steps. For these reasons, the text mining methods presented here, in addition to the ChemoText repository, have the potential to be adopted in the computational drug discovery laboratory and integrated into existing toolsets. iv ACKNOWLEDGEMENTS I wish to thank my family for their steadfast support throughout this long research and writing process. My husband, Joshua, has provided support of every kind and helped to keep me balanced. My children Madeline, Natasha, Lily, Ella, and Verity have stood by me, even though they did not understand exactly what in the world I was doing. To my brothers and sisters Mende, Margaret, Melissa, Curry, Jim, and Tyler, I send a thank-you as well. They always asked the right questions and never mentioned the birthday cards I forgot to send over the last year. To all my friends – you know who you are – thank you for being there. I want to thank my mother-in-law, Thelma S. Baker, Ph.D. She got her degree at age 54 and has been a great example for me. Her other children, my sisters-in-law (and their respective spouses), have been a great source of support and encouragement. I could not have completed or even started this work without the support of my committee members: Brad Hemminger, Stephanie Haas, Javed Mostafa, Diane Pozefsky, and Alex Tropsha. Brad Hemminger and Stephanie Haas provided both leadership and structure and the essential encouragement along the way. Diane Pozefsky provided a rigorous review of the work and posed insightful questions. Thanks also to the faculty and staff of SILS, including Lara Bailey, who managed to get me registered each semester on time. I would like to thank my fellow SILS students, who over the years have provided kindness, help, encouragement, and great conversations. v I would like to give a special thanks to Alex Tropsha for his enthusiasm and support from the very beginning. And thanks as well to the members of Dr. Tropsha’s lab. The exciting work and collegial environment in the lab has been an inspiration. I would also like to recognize the sources of financial support that made this work possible. The one-year fellowship from the UNC Program in Bioinformatics and Computational Biology got me started. The NIH grant P20-HG003898 (through Alex Tropsha) has been a major source of support since then. The UNC School of Information and Library Science has also provided funding through research positions and the Lester Asheim Scholarship. Thank you, finally, to my mother and father. My mother taught me how to work hard, but she was (and still is) a great example of how to enjoy life. Because of my father, a pharmacist, I sent many hours dusting shelves full of ointments, elixirs, and pills, wondering how such things could have been discovered. vi TABLE OF CONTENTS LIST OF TABLES ................................................................................................................... xi LIST OF FIGURES ............................................................................................................... xiv LIST OF ABBREVIATIONS ................................................................................................. xv 1. RESEARCH GOALS AND BACKGROUND .................................................................... 1 1.1 Research questions and their significance ........................................................... 1 1.1.1 Motivation .................................................................................................... 2 1.1.2 Pilot Study .................................................................................................... 4 1.2 Background ......................................................................................................... 5 1.2.1 Chemical and biomedical literature .............................................................. 6 1.2.2 Information Extraction ............................................................................... 16 1.2.3 Text Mining ................................................................................................ 27 1.3 Conclusion ........................................................................................................ 40 2. PILOT STUDY ................................................................................................................... 41 2.1 Introduction ....................................................................................................... 41 2.2 Construction of ChemoText .............................................................................. 41 2.2.1 Corpus and Theory ..................................................................................... 41 vii 2.2.2 Analysis and Design .................................................................................. 44 2.2.3 Construction............................................................................................... 51 2.3 Drug Discovery Application ............................................................................ 52 2.3.1 Methods ...................................................................................................... 53 2.3.2 Results ........................................................................................................ 54 2.3.2 Evaluation of pilot study and next steps ..................................................... 63 3. EXTENDED IMPLEMENTATION OF SWANSON’S ABC METHODS ..................... 65 3.1 Introduction ....................................................................................................... 65 3.2 Overall Design.................................................................................................. 66 3.3 Methods ............................................................................................................ 67 3.3.1 Calculation of precision and recall ............................................................ 71 3.3.2 Calculation of ranking variables ................................................................ 71 3.4 Results .............................................................................................................. 75 3.5 Discussion ......................................................................................................... 83 3.5.1 Cystic Fibrosis ........................................................................................... 86 3.5.2 Psoriasis ..................................................................................................... 96 3.5.3 Migraine................................................................................................... 105 3.6 Conclusion ...................................................................................................... 110 4. PREDICTING DRUG MOLECULAR ACTIVITY FROM SIDE EFFECTS ................ 119 4.1 Introduction and Background .......................................................................... 119 viii 4.1.1 Previous Work .......................................................................................... 120 4.1.2 Data sources ............................................................................................