Linked Data in Drug Discovery

Linked Data Editor: Carole Goble • [email protected] Linked Data in Drug Discovery Michel Dumontier • Carleton University David J. Wild • Indiana University Drug discovery presents many challenges, but several linked data initiatives are under way to address the huge increase in the amount of data available from chemistry, biology, and drug discovery in the past two decades. nformation is cheap. Understanding produce large amounts of data about chemical is expensive.” This simple but astute compounds, protein targets, genes, biological “Iinsight was recently made by Karl Fast pathways and cells, and their role in how the in regard to the general problem of how we can body functions and in disease states. In the past best use the vast amounts of data now available decade, initiatives such as the Molecular Librar- in the world. However, in drug discovery, both ies Initiative,1 the US Environmental Protection information and understanding are expensive. Agency’s ToxCast program (www.epa.gov/ncct/ Drug discovery involves finding therapies toxcast/), and the Human Genome Project have (usually chemical compounds) that elicit certain brought this technology (and the resulting data desirable responses in the body without creat- deluge) into the public sphere. This effort repre- ing unacceptable, adverse side effects. Until the sents a third phase of drug discovery that’s still 1960s, this process was entirely empirical. It ongoing and constitutes a rational investigation often involved examining plants and other natu- scaled up by orders of magnitude. The promise ral substances with reported beneficial effects is that producing this data will result in new to identify in them the chemical compounds breakthroughs in drug therapies, particularly responsible for their effects. The most widely for significant diseases such as cardiovascular used and arguably most valuable drugs today are disease, cancer, and diabetes. Whereas previ- derived from this process: painkillers, such as ously a few hundred compounds might be tested aspirin, and antibiotics are two clear examples. for activity against a protein target, now hun- From the 1960s onward, increased understanding dreds of thousands can be tested. of the body’s molecular mechanisms, as well as However, the rational approach’s limitations the diseases that affect it, enabled a more mecha- remain — namely, that it focuses only narrowly nistic understanding of how drugs act, and began on how a compound acts on a particular pro- the era of “rational” drug discovery. This is still tein target and doesn’t consider the compound’s the prevailing paradigm and generally involves wider impact on the body, or the unexpected identifying a protein (target) involved in a disease cascade effects of interfering with these targets. state (for example, a protein involved in replicat- Thus, the recurring problems are those of effi- ing cells might be implicated in cancer and thus cacy (making drugs that work in the test tube form a potential drug target). This approach has work in the body) and safety (anticipating unde- had success stories (such as HIV drugs), but some sirable side effects). apparently successful drugs have crashed out of These experimental techniques have resulted the market due to unforeseen, rare side effects. in large, siloed data repositories — both in the The 1990s saw an explosion, primarily in public sphere and internal within companies — in the pharmaceutical industry, in innovative which the silos map to the particular experiment (and expensive) experimental techniques that types. For example, large public data sets such as 68 Published by the IEEE Computer Society 1089-7801/12/$31.00 © 2012 IEEE IEEE INTERNET COMPUTING IC-16-06-Lnkd.indd 68 10/10/12 4:03 PM Linked Data in Drug Discovery PubChem Bioassay and ChEMBL Still, such integration has fueled biomolecular interactions from the pertain to how chemical compounds numerous tools that can find data Biomolecular Interaction Network act on protein targets. Others, such paths across datasets.5 Several phar- Database (BIND), gene information as UniProt, pertain to the function maceutical companies are exploring from NCBI Gene, antibodies from the and biological pathways of genes (and true linked data and semantic meth- Antibody Directory, pathways from their protein targets); yet more pertain ods, mainly using prototype RDF the Kyoto Encyclopedia of Genes and to drug side effects, gene regulation, triple stores and semantic search- Genomes (KEGG) and Reactome, and clinical findings, and so on. (Table A ing using commercial tools such as terminology from NCI Metathesau- in the Web appendix at http://doi TopBraid (www.topquadrant.com), rus and a growing collection of open . ieeecomputersociety.org/10.1109/MIC IO Informatics Sentient (www.io- biomedical ontologies (OBOs). A big .2012.122 gives more information informatics.com), and Franz Allegro- part of this effort was delineating the about the datasets and repositories graph (www.franz.com). However, methodology to marshal a heteroge- mentioned in this article.) such methods aren’t yet mainstream, neous collection of source data (flat- In aggregate, these datasets offer and centralized repositories remain files, XML files, SQL databases, and a much more sophisticated under- in relational format, generally not so on) into RDF and dealing with the standing of the complex network of well linked to the “outside world.” complexity that arises from mash- actions drugs have on the body. The One of the earliest public col- ing together disparate data sources Semantic Web is a critical enabling laborative efforts to develop ideas, with different types and relations. technology for this field. By seman- standards, and projects around the In the end, researchers can query tically annotating and linking data, use of linked data for pharmaceu- the knowledge base for information researchers can search, explore, and tical and clinical research arose that relates to hypotheses and clini- mine the large complex relationships from the W3C’s Semantic Web for cal guidelines, molecular targets, of entities important to drug discovery (drugs, chemicals, proteins, genes, pathways, cells, diseases, and Understanding drugs’ impact on a dynamical side effects) in an integrated fashion. Many researchers are exploring the network is increasingly important. possibilities inherent to such integration, including new scientific areas such as systems chemical biology.2,3 Health Care and Life Sciences Inter- antibodies, and mouse models, all of est Group (HCLSIG). The HCLSIG is which contribute to advancing sci- Current Initiatives an open forum that puts executives, ence and improving healthcare. Various initiatives are under way scientists, researchers, program- Delineating on and off drug that use linked data in drug discov- mers, and policymakers together to targets and understanding drugs’ ery, both in academia and in the phar- work toward developing standards impact on a dynamical network is maceutical industry, and sometimes and demonstrate Semantic-Web- increasingly important in drug dis- crossing both (such as the EU Inno- enabled biomedical solutions that covery. Recent work shows how vative Medicines Initiative’s Open- support translational research. The researchers developed and used an PHACTS project; www.openphacts HCLSIG has worked on several prob- HIV-focused mashup of biological .org). Uptake of linked data in the lems, including those pertaining to resources (Affymetrix array, Ref- pharmaceutical industry is ongoing capturing scientific discourse, inte- Seq, Gene, Online Mendelian Inheri- but is currently at an early stage. grating life science and clinical data, tance in Man [OMIM], the HIV-1 Most companies’ information sys- providing guidelines for publishing Human Protein Interaction Database tems currently center on large rela- proteomics and genomics data, and [HHPID], PubMed, Medical Subject tional databases, and companies undertaking biomedical research. Headings [MeSH], and Gene Ontol- routinely link public datasets into One early HCLSIG demonstra- ogy) to identify a protein interaction these databases (for instance, by cre- tion involved exploring hypotheses network that emerges from signifi- ating separate tables for public data- related to Alzheimer’s disease over an cantly expressed genes during a sets, sometimes cross-linked with integrated store of knowledge.6 This time-course microarray in the first internal data). However, these aren’t effort focused on pathological infor- hours of an HIV infection of primary generally semantically annotated, mation from SenseLab, neuronal cir- human macrophages that had or had which is required to integrate and cuitry from CoCoDa, receptor-ligand not been treated with interferon, 4 7 make efficient use of the data. data from the PDBSP Ki database, an antiviral product. The paper NOVEMBER/DECEMBER 2012 69 IC-16-06-Lnkd.indd 69 10/10/12 4:03 PM Linked Data demonstrates that the difference KEGG pathways with adverse for drug discovery. Anika Oellrich between the interferon-treated effects recorded in Drugbank. and colleagues report on identi- and non-treated networks identifying causal genes for an under- fies interferon-responsive elements, Chem2Bio2RDF is now part of the lying disease by comparing the while an analysis of MeSH-enriched Linked Open Data cloud. Recent similarity of phenotypes arising from terms from associated publications work has shown how researchers mouse models with those of the suggests molecular and

Linked Data in Drug Discovery

Understanding Semantic Aware Grid Middleware for E-Science

Description Logics Emerge from Ivory Towers Deborah L

Open PHACTS: Semantic Interoperability for Drug Discovery

The Fourth Paradigm

Data Curation+Process Curation^Data Integration+Science

Social Networking Site for Researchers Aims to Make Academic Papers a Thing of the Past 16 July 2009

Ivelize Rocha Bernardo Promoting Interoperability of Biodiversity

FAIR Computational Workflows

Hosts: Monash Eresearch Centre and Messagelab Seminar :The Long Tail Scientist Presenter: Prof Carole Goble, Computer Science

The Rise of Bioinformatics and the in Silico Experiment Has Revolutionised the Life Sciences

BENCHMARKING WORKFLOW DISCOVERY 3 the Workﬂow Literature

Professor Carole Goble Dr. John Brooke Summary of Talk