©2011 Mark E. Sharp ALL RIGHTS RESERVED
Total Page:16
File Type:pdf, Size:1020Kb
©2011 Mark E. Sharp ALL RIGHTS RESERVED DIMENSIONS OF DRUG INFORMATION By MARK E. SHARP A Dissertation submitted to the Graduate School-New Brunswick Rutgers, The State University of New Jersey in partial fulfillment of the requirements for the degree of Doctor of Philosophy Graduate Program in Communication, Information and Library Studies written under the direction of Professor Nicholas Belkin and approved by ________________________ ________________________ ________________________ ________________________ New Brunswick, New Jersey January 2011 i ABSTRACT OF THE DISSERTATION Dimensions of Drug Information by MARK E. SHARP Dissertation Director Professor Nicholas Belkin The high number, heterogeneity, and inadequate integration of drug information resources constitute barriers to many drug information usage scenarios. In the biomedical domain there is a rich legacy of knowledge representation in ontology-like structures that allows us to connect this problem both to the very mature field of library and information science classification research and the very new field of ontology matching/merging (OM). We argue for a broad view of OM that makes room not only for the "pre-formal" phase/type of multi-ontology integration exemplified by RxNorm and the UMLS Metathesaurus, but also for an even earlier phase/type when "What is there?" in a domain has to deal with implicit and poorly structured "ontologies" that barely qualify as such. Such is the case in the drug domain. We introduce dimensions of drug information as an approach to early, pre-formal OM in the drug domain that draws inspiration and incorporates principles from facet analysis, domain analysis, and Semantic Web research on linked data and mashups. By surveying 23 publically available drug information resources, we identified 39 dimensions relevant to four drug (sub)domains - pharmacy, chemistry, biology, and clinical medicine - and mapped them to the resources An arbitrary four-domain, monohierarchical classification of the dimensions produced, by extension, a reasonable four- domain resource classification. Correspondence analysis and hierarchical cluster analysis also produced evidence of its partial validity. Detailed analysis of information on nine parent drug compounds from 15 resources refined this high-level dimensional mapping and identified hundreds of subdimensions which could be expressed as a six-level hierarchy. Based on these ii dimensions, we integrated this information in an experimental database and showed that it was useful (1) as a training set for automating the normalization of additional raw data from the same 15 sources, bringing the important goal of building an integrated, comprehensive (all drugs) database within reach, and (2) for satisfying a variety of use cases, some quite complex, derived from published literature representing the user types corresponding to our domain focus. iii Acknowledgements I would like to thank my dissertation director, Nicholas Belkin, for attracting me to the program, teaching me the fundamentals of library and information science, being the perfect role model for a successful researcher and committed educator, shepherding me through the dissertation process, many stimulating insights and discussions, and eleven years of unswerving support, understanding, and friendship. Special thanks also to the other members of my dissertation committee, Tefko Saracevic, Marija Dalbello, and Olivier Bodenreider, for their many contributions to this work, and to Dr. Bodenreider for sponsoring my guest researchship at the U.S. National Library of Medicine where much of it was done. I would also like to thank the members of my qualifying examination committee, Nina Wacholder, Paul Kantor, Alex Borgida, and Anselm Spoerri, for teaching me about natural language processing, information retrieval, ontologies, and information visualization, and for sponsoring my many pre-dissertation research projects, all of which contributed to this work. In addition, I would like to thank Craig Scott, Marie Radford, Xiaojun Yuan, and Lorraine Gray for facilitating my re-entry into the Ph.D. program after an absence, and Joan Chabrak for her support, encouragement, and friendship. I would also like to thank my employer, Merck & Co., Inc., for paying my tuition throughout my doctoral studies, and my many colleagues at Merck for their encouragement and interest, especially my managers Ann Jenkins, Judy Labovitz, Doris Schlichter, David Henderson, and Karen Marakoff, and my information science mentors, Patrick Perrin and Eric Minch. Finally, I would like to thank my family for their inspiration and support, especially my wife, Celia Sharp. iv Table of Contents Abstract............................................................................................................................................ii Acknowledgements.........................................................................................................................iv List of Figures.................................................................................................................................ix List of Tables ...................................................................................................................................x Chapter 1. Introduction ...................................................................................................................1 1.1 Problem Statement................................................................................................................1 1.2 Rationale...............................................................................................................................2 1.3 Research Questions and Strategy .........................................................................................4 Chapter 2. Literature Review..........................................................................................................7 2.1 Classification ........................................................................................................................7 2.2 Facet Analysis ......................................................................................................................9 2.3 Domain Analysis ................................................................................................................12 2.4 Ontologies...........................................................................................................................14 2.5 Bio-ontologies ....................................................................................................................16 2.6 Drug Ontologies .................................................................................................................17 2.7 Ontology Matching/Merging (OM)....................................................................................22 2.8 Semantic Web.....................................................................................................................25 2.9 Linked Data ........................................................................................................................26 2.10 Mashups............................................................................................................................27 2.11 Pharmaceutical Discovery Research.................................................................................30 2.12 Drug Information User and Resource Research ...............................................................36 2.13 Basis for Current Work.....................................................................................................37 Chapter 3. Methods.......................................................................................................................39 3.1 Q1: What are the Dimensions of Drug Information? .........................................................39 3.1.1 User warrant / domain focus. ......................................................................................39 v 3.1.2 Literary warrant...........................................................................................................39 3.1.3 Resource survey. .........................................................................................................40 3.1.4 Experimental database.................................................................................................44 3.2 Q2: Do Dimensions Lead to Valid Groupings of Resources?............................................51 3.2.1 Face validity................................................................................................................51 3.2.2 Correspondence analysis.............................................................................................51 3.2.3 Cluster analysis. ..........................................................................................................51 3.3 Q3: Can Dimensions Facilitate Integration/OM Tasks?.....................................................52 3.3.1 Classifying sources......................................................................................................52 3.3.2 Selecting sources appropriate to a given information need.........................................52 3.3.3 Pooling data from different sources. ...........................................................................52 3.3.3.1 Data reduction......................................................................................................52 3.3.3.2 Automatic