Exploring the dark genome implications for precision medicine Oprea, Tudor I.

Published in: Mammalian Genome

DOI: 10.1007/s00335-019-09809-0

Publication date: 2019

Document version Peer reviewed version

Citation for published version (APA): Oprea, T. I. (2019). Exploring the dark genome: implications for precision medicine. Mammalian Genome, 30(7- 8), 192-200. https://doi.org/10.1007/s00335-019-09809-0

Download date: 28. sep.. 2021 HHS Public Access Author manuscript

Author ManuscriptAuthor Manuscript Author Mamm Manuscript Author Genome. Author Manuscript Author manuscript; available in PMC 2020 August 01. Published in final edited form as: Mamm Genome. 2019 August ; 30(7-8): 192–200. doi:10.1007/s00335-019-09809-0.

Exploring the Dark Genome - Implications for Precision Medicine

Tudor I. Oprea1,2,3,4 1.Department of Internal Medicine, University of New Mexico School of Medicine, Albuquerque, NM, USA. 2.UNM Comprehensive Cancer Center, Albuquerque, NM, USA. 3.Department of Rheumatology and Inflammation Research, Institute of Medicine, Sahlgrenska Academy at University of Gothenburg, Gothenburg, Sweden. 4.Novo Nordisk Foundation Center for Research, Faculty of Health and Medical Sciences, University of Copenhagen, Copenhagen, Denmark.

Abstract The increase in the number of both patients and healthcare practitioners who grew up using the Internet and computers (so-called “digital natives”) is likely to impact the practice of precision medicine, and requires novel platforms for data integration and mining, as well as contextualized information retrieval. The “Illuminating the Druggable Genome Knowledge Management Center” (IDG KMC) quantifies data availability from a wide range of chemical, biological and clinical resources, and has developed platforms that can be used to navigate understudied (the “dark genome”), and their potential contribution to specific pathologies. Using the “Target Importance and Novelty Explorer” (TIN-X) highlights the role of LRRC10 (a dark ) in dilated cardiomyopathy. Combining mouse and human phenotype data leads to increased strength of evidence, which is discussed for 4 additional dark : SLX4IP and its role in glucose metabolism, the role of HSF2BP in coronary artery disease, the involvement of ELFN1 in attention deficit hyperactivity disorder and the role of VPS13D in mouse neural tube development and its confirmed role in childhood onset movement disorders. The workflow and tools described here are aimed at guiding further experimental research, particularly within the context of precision medicine.

Navigating the Target Landscape by Illuminating the Druggable Genome Data wrangling, distilling unrelated data elements into contextualized knowledge and the overall ability to rapidly process digital information are increasing demands placed on current healthcare professionals. At the same time, the number of “digital native” patients

Conflict of Interest Dr. Oprea was a former full time employee at AstraZeneca (1996–2002). He has received honoraria, or consulted for, Abbott, AstraZeneca, Chiron, Genentech, Infinity Pharmaceuticals, Merz Pharmaceuticals, Merck Darmstadt, Mitsubishi Tanabe, Novartis, Ono Pharmaceuticals, Pfizer, Roche, Sanofi, and Wyeth. His spouse was a full-time employee of AstraZeneca (2002–2014) and is a full time employee of Genentech Inc. Publisher's Disclaimer: This Author Accepted Manuscript is a PDF file of an unedited peer-reviewed manuscript that has been accepted for publication but has not been copyedited or corrected. The official version of record that is published in the journal is kept up to date and so may therefore differ from this version. Oprea Page 2

(i.e., patients who started to use computers/tablets/smart phones from an early age) is Author ManuscriptAuthor Manuscript Author Manuscript Author Manuscript Author increasing. Indeed, when it comes to healthcare issues, more and more patients, especially “digital natives”, seek information via social media and web-based platforms and healthcare databases. It therefore seems necessary for healthcare professionals to embrace novel analytic technologies to integrate multi-faceted big data and translate them into patient benefits (Bezemer et al. 2019). This integration process faces unique technical, semantic, and ethical challenges (Seneviratne, Kahn, and Hernandez-Boussard 2019), challenges that could be overcome by emerging computational technologies rooted in phenotypic ontologies drawing from a multi-species context (Robinson, Mungall, and Haendel 2015), Therefore, there is an imperative to improve and streamline “big data” analytics technologies in the context of precision medicine. As both healthcare practitioner and patient categories are increasingly more “digital native”, the patient-doctor relationship is likely to fundamentally change given the unfettered Internet access to healthcare information.

Biomedical advances have fostered the emergence of data science, a relatively novel discipline focused on novel algorithms for data analytics and visualization (Berger and Schneck 2019), which ultimately warrants an informatics-oriented formalization of study design, interoperability and model development (Prosperi et al. 2018), specifically in the context of precision medicine (National Research Council et al. 2012), Defined as “prevention and treatment strategies that take individual variability into account” (Collins and Varmus 2015), precision medicine is widely regarded as the future of clinical practice, and is poised to take advantage of the large-scale integration of the contextualized knowledge emerging from genomic, phenomic and patient-centric databases.

Here, we briefly discuss the Illuminating the Druggable Genome Knowledge Management Center, IDG KMC (Oprea, Bologa, et al. 2018) project, and how its platform can be used to extract relevant data elements for precision medicine. IDG KMC extracts and processes expression and functional data related to proteins and genes, molecular probes such as small molecules, antibodies and approved drugs, small molecule bioactivities, genome-wide association studies, disease associations and drug indications and off-label uses (among other data types) into the Target Central Repository Database, TCRD(Nguyen et al. 2017) - see Figure 1. TCRD organizes and structures information, thus mapping the current target landscape. Key elements from genomic and proteomic sources, in addition to literature, patents, clinical trials, drug labels and other information, are standardized, linked and associated with disease information, and exposed via the Pharos portal (https:// pharos.nih.gov/index), and through a limited REST API TCRD (https://bit.ly/31lMT17). By bridging clinical, biological, chemical and genomic data, the IDG KMC integrates information and knowledge for over 20,000 protein-encoding genes by combining data science, informatics and computational biology to prioritize targets for further experimental evaluation and analysis by the broader scientific community.

Quantifying Data Availability The TCRD/Pharos platform is well suited to address questions about the “druggable genome”, which Hopkins and Groom defined (Hopkins and Groom 2002) as the set of protein-encoding genes that can be therapeutically modulated by orally formulated drugs,

Mamm Genome. Author manuscript; available in PMC 2020 August 01. Oprea Page 3

using Lipinski’s “rule of five” criteria, which defined four physico-chemical property Author ManuscriptAuthor Manuscript Author Manuscript Author Manuscript Author parameter sets enriched in orally available drugs(Lipinski et al. 1997). The intersection between disease-modifying genes and the “druggable genome” was estimated to be between 600 and 1,500 targets (Hopkins and Groom 2002). Indeed, the known druggable genome, categorized as Tclin, or targets via which approved drugs act, includes 602 human proteins (Santos et al. 2017), a number that increased by thirteen via drugs approved in 2018 (Ursu, Glick, and Oprea 2019). A second set of well-studied proteins, Tchem, encompasses proteins that lack mode-of-action associations with approved drugs, and are known to bind small molecules with high potency. Current thresholds are ≤ 30nM for Kinases, ≤ 100nM for G-protein coupled and nuclear receptors, ≤ 10μM for ion channels, and ≤ 1μM for other target families (Oprea, Bologa, et al. 2018). Bioactivity values were extracted from ChEMBL, DrugCentral and the Guide to Pharmacology (Southan et al. 2016). The third Target Development Level (TDL) category, Tbio, includes proteins that have confirmed Mendelian disease phenotype in OMIM (Amberger et al. 2009) or have Gene Ontology (Ashburner et al. 2000) “leaf” (lowest level) term annotations based on experimental evidence; or meet two of the following three conditions: A fractional publication count (Pafilis et al. 2013) above 5, three or more Gene RIF, “Reference Into Function” annotations(https://bit.ly/2WDE1oL), or 50 or more commercial antibodies, as counted in the Antibodypedia database (Kiermer 2008). The fourth TDL category, Tdark, also referred to as the “ignorome” (Pandey et al. 2014) or the “dark genome”, encompasses one in three human proteins that were manually curated at the primary sequence level in UniProt (UniProt Consortium 2015), yet do not meet any of the criteria for Tclin, Tchem or Tbio. Additional information concerning this category may be available from genome-wide association studies (GWAS), tissue and subcellular compartment location, dysregulation, IMPC mouse phenotype availability, etc. (see also Figure 2)

Several target categories and their distribution according to TDLs are summarized in Table 1. By examining this distribution, several imbalances become apparent: Most olfactory GPCRs are understudied (Tdark), and the majority of the transcription factors, solute carrier transporters, transferases, phosphatases and small GTP-ases lack chemical matter and drugs (Tbio and Tdark, respectively). Yet other target categories, such as G-protein coupled and nuclear receptors, ion channels, kinases, carbonic anhydrases and phosphodiesterases are very well described, being primarily annotated as Tclin and Tchem, respectively.

To quantify the availability of data, information and knowledge, we illustrate the degree of data availability associated with proteins for 18 different types of data. Representing publications and patents, gene and protein expression data, availability of three-dimensional structures from X-ray, mouse phenotype associations and GWAS, protein-disease information as well as chemical bioactivity and drug data, these categories highlight the many types of experimental data that can be associated with any given protein (Figure 2). Given the varying degrees of data (and knowledge) generated on individual proteins, this view of total data type availability grouped by TDL, may explain the knowledge deficit separating understudied (Tdark) proteins from the other categories. There are over 500 Tdark proteins that have similar data profiles to those of a Tchem or Tclin protein.

Mamm Genome. Author manuscript; available in PMC 2020 August 01. Oprea Page 4

Author ManuscriptAuthor Exploring Manuscript Author the Manuscript Author Dark Genome Manuscript Author As outlined earlier, “dark” proteins (labeled as Tdark) are described at the primary sequence level and curated in UniProt (UniProt Consortium 2015), but relatively few studies have been published, by comparison to the other 3 TDL categories. As Figure 2 suggests, in addition to tissue location and expression data (almost always available), there may be data for orthology, inferred function via homology, regulation, disease associations, etc. It is safe to state that Tdark contains most unexplored therapeutic opportunities of the “druggable genome” (Oprea, Bologa, et al. 2018).

Evaluating target “druggability”, i.e., the ability of a protein/gene to be therapeutically modulated by medicines, is a complex process: For small molecule drug discovery, information concerning the presence of a putative binding site requires expertise in structural biology, preferably combined with computational and medicinal chemistry, all of which examine whether a protein can bind small molecules with high affinity and specificity (Hajduk, Huth, and Tse 2005), or protein ligandability (Surade and Blundell 2012). When developing therapeutic monoclonal antibodies (mAbs), a different set of informatics and data wrangling needs (Mould and Meibohm 2016) arise: in-depth understanding of safety, pharmacokinetics, and pharmacodynamics in order to avoid side effects such as the “cytokine storm” (Suntharalingam et al. 2006), in combination with biophysical, cell and tissue based evaluations of target-antibody interactions (Abbott, Damschroder, and Lowe 2014), often focused on epitope mapping (Clementi et al. 2013). MAbs are restricted in their therapeutic applications, as the targeted proteins need to be secreted or exposed on the cell surface. Antisense oligonucleotides (ASOs) and small interfering RNAs (siRNAs) lower the expression levels of specifically targeted proteins by promoting degradation of the corresponding messenger RNAs (mRNA) can also be highly specific with respect to target interactions, just like antibodies. Unlike MAbs, ASOs and siRNAs are not restricted to surface or circulating proteins, but similar safety and efficacy issues need to be addressed. All these categories of therapeutic intervention share the need for careful evaluation in the context of pathways and transient/permanent protein interactions (Nooren and Thornton 2003), as well as cellular networks (Wu, Ma, and Tan 2016). There is a growing body of literature for protein “druggability” prediction based on systems biology and machine learning (Kandoi, Acencio, and Lemke 2015). These models, however, tend to be “over- optimistic due to the oversimplified formulation of the drug-target prediction problem as a binary problem” (Kandoi, Acencio, and Lemke 2015). Thus, the first and foremost issue in need of resolution when addressing target selection for drug discovery is the strength of evidence regarding the association of a putative target and the disease of interest.

This is particularly relevant when exploring the dark genome in the context of precision medicine, as illustrated by the PCSK9 example. Mutations in PCSK9 (proprotein convertase subtilisin/kexin type 9) were identified (Abifadel et al. 2003) as one of the causes responsible for monogenic hypercholesteremia (Rader, Cohen, and Hobbs 2003) in 2003, prior to which PCSK9 would have met the Tdark criteria. The PCSK9 gene promotes the degradation of low-density lipoprotein receptor (LDLR) in intracellular acidic compartments (Poirier et al. 2008). Based on information retrieved from DrugCentral (http:// drugcentral.org/), monoclonal antibodies such as evolocumab and alirocumab prevent

Mamm Genome. Author manuscript; available in PMC 2020 August 01. Oprea Page 5

circulating PCSK9 from binding to the LDLR, thus blocking PCSK9-mediated LDLR Author ManuscriptAuthor Manuscript Author Manuscript Author Manuscript Author degradation and permitting LDLR to recycle back to the liver cell surface. The net result is an increase in the number of LDLRs available to clear LDL from the blood, thereby lowering LDL-Cholesterol levels. Both evolocumab (as Repatha) and alirocumab (as Praluent) were approved in 2015 for hypercholesterolemia, which now places PCSK9 into the Tclin category.

One informatics-based platform that can assist with protein-disease ranking and visualization is Target Importance and Novelty Explorer, TIN-X (Cannon et al. 2017). TIN- X derives its information from PubMed (http://pubmed.gov), using Named Entity Recognition (NER) of gene/protein (Szklarczyk et al. 2019) and disease names (Pletscher- Frankild et al. 2015), via two bibliometric concepts: Novelty, which estimates the scarcity of publications about a protein target (see Eq. 1); and Importance, which estimates the strength of the association between that protein target and a specific disease (see also Eq. 2), Here, Tk and Dk are the numbers of targets and diseases in abstract k, respectively; summation over all publications includes Target (i), and for importance, it also includes Disease (j). TIN-X uses fractional counts to reflect strength of association: When a paper mentions three targets, each protein receives a one-third fractional count. Similar counts are applied for diseases. The web-based interface for TIN-X (https://newdrugtargets.org/) supports Disease browsing by category based on the Disease Ontology (Kibbe et al. 2015) hierarchy, as well as Target browsing. The system includes associations for the entire human proteome. Filtering by TDL, direct links to PubMed abstracts for each association, URL sharing of specific visualizations as well as data exporting are enabled.

N 1 i = 1 ∕ k (Eq.1) ∑ Tk

I 1 ij = k (Eq.2) ∑ Tk ⋅ Dk

For example, the process of exploring a “dark” protein and its strength of association with dilated cardiomyopathy (DCM), is illustrated in Figure 3, which highlights LRRC10, "Leucine-rich repeat-containing protein 10", a protein that is linked to six PubMed abstracts relevant to dilated cardiomyopathy. Interrogating IMPC data revealed no unusual heart- related phenotypes, be it heart weight, electrocardiographic signal delays or retinal blood vessel abnormalities (https://bit.ly/2XE5AL3). Whole exome sequencing carried out on a 6- week old DCM patient and on her symptom-free parents, which identified a homozygous recessive variant of LRR10 (I195T), revealed LRRC10 as an auxiliary subunit to cardiac L- type calcium channels (Woon et al. 2018). The I195T variant of LRRC10 has a very different channel gating function, which may explain its involvement in dilated cardiomyopathy. A Google Patents search (which includes PubMed) retrieved no patents filed on targeting LRRC10 (as a gene/protein) in cardiomyopathy (as of April 2019). Deleting the I195T LRRC10 variant via specific ASOs is likely to restore normal L-type channel function. It seems, therefore, reasonable to address this LRRC10 variant therapeutically in DCM patients.

Mamm Genome. Author manuscript; available in PMC 2020 August 01. Oprea Page 6

Author ManuscriptAuthor Dark Manuscript Author Genes with Manuscript Author Significant Manuscript Author IMPC Data Another avenue of Tdark exploration with respect to precision medicine is the association of human diseases with animal phenotypes, using data collected by the IMPC. There are currently 3,485 human proteins/genes in TCRD (version 5.4) annotated with mouse phenotype data from IMPC. For comparison, genes annotated with data from the GWAS catalog (MacArthur et al. 2017), the DISEASES database (Pletscher-Frankild et al. 2015) and OrphaNet (Rath et al. 2012), a resource for rare diseases, are summarized in Table 2.

Of the 551 IMPC-annotated Tdark proteins, 396 (71.9%) are annotated with GWAS or human disease information from the DISEASES database or from OrphaNet for any disease/ phenotype (see Supplementaiy File “TCRD_TdarkSummary” for details). Of these 396, 36 have annotations that match the observed mouse phenotype with human diseases (i.e., DISEASES or OrphaNet), with another 19 matching mouse phenotype with (human) GWA studies (55, or 10%, total; see complete list in the Supplementaiy File, “Confirmed_Phenotype_Association”). This apparent discrepancy, namely that only 55 out of 551 IMPC-annotated genes have matching human phenotypes, is likely due to two possibilities: First, the list is comprised of dark genes only, and is specifically comprised of understudied genes/proteins. Second, a number of significant evolutionary differences (e.g., metabolic rate, life cycle, size, diet, etc.) between mouse and human biology is likely to cause significantly different responses in mice compared to humans, with respect to experimental interventions (Perlman 2016). While differences may prevail, here we focus on 4 understudied proteins where similar phenotypes were observed in mice and men.

Of the 19 genes that have matching GWAS and IMPC annotations, 12 match directly, with another 7 marked as “possible”. For example, Protein SLX4IP (SLX4IP) is annotated with the GWAS entry rs6131100, “Fasting blood glucose adjusted for BMI (body mass index)” (Southam et al. 2017), and with mouse phenotype MP:0005560, “decreasing circulating glucose level”. While the significance of this direct match is currently unknown, the role of this protein in glucose metabolism warrants further investigation. Heat shock factor 2- binding protein (HSF2BP) is annotated with the GWAS entry rs60787346, “Coronary artery disease” (van der Harst and Verweij 2018), and with mouse phenotype MP:0001556, “increased circulating HDL cholesterol level”. Whereas high levels of HDL cholesterol are considered cardioprotective (Rye and Barter 2014), these levels are observed in mice lacking the HSF2BP gene. It is possible that the HSF2BP gene plays a role in decreasing HDL cholesterol levels, thus playing a pathophysiological role in coronary artery disease. Currently, this association and its therapeutic potential have not been investigated.

Of the 36 genes that have matching DISEASES database (i.e., from text mining) and IMPC annotations, 24 are direct matches, whereas another 12 are marked as “possible”. For example, Protein ELFN1 (ELFN1) is associated with attention deficit hyperactivity disorder, ADHD (Tomioka et al. 2014) in the DISEASES database, and with mouse phenotypes MP: 0001399, “hyperactivity” and ∣MP:0002757, “decreased vertical activity”. ELFN1 mutations cause hyperactivity in mice (Dolan and Mitchell 2013), and are associated with epilepsy and ADHD in humans (Tomioka et al. 2014). The significance of this direct match suggests a potential therapeutic role for this protein in ADHD (strengthened by IMPC data) and,

Mamm Genome. Author manuscript; available in PMC 2020 August 01. Oprea Page 7

possibly, epilepsy. The vacuolar protein sorting-associated protein 13D (VPS13D) is Author ManuscriptAuthor Manuscript Author Manuscript Author Manuscript Author annotated with chorea-acanthocytosis (Velayos-Baeza et al. 2004), spastic ataxia (Gauthier et al. 2018) and paraplegia (Seong et al. 2018), among other human diseases, and with the mouse phenotype MP:0003720 ”abnormal neural tube closure”. VPS13D promotes mitochondrial clearance by mitochondrial autophagy and by positively regulating mitochondrial fission (Anding et al. 2018). The absence of this gene causes abnormal neural tube closure in VPS13D knockout mice. It is, therefore, possible that VPS13D gene plays a role in pathophysiological role in childhood onset movement disorders such as spastic ataxia and paraplegia. Currently, there are no reports concerning the therapeutic role of this protein, and interest in VPS13D appears to be low (Tdark), Given the association between VPS13D and several (rare) neurological disorders, VPS13D should be considered as a therapeutic target for these disorders.

Summary Scientific progress is often measured by accumulation of knowledge, as summarized by Edward A. Feigenbaum’s Knowledge Principle: “A system exhibits intelligent understanding and action at a high level of competence primarily because of the specific knowledge that it can bring to bear: The concepts, facts, representations, methods, models, metaphors, and heuristics about its domain of endeavor” (Lenat and Feigenbaum 1991). Lack of progress, evaluated by surveying literature for a wide variety of newly sequenced target families indicates that the process of druggable target selection is rather conservative, and limited progress has been observed (Edwards et al. 2011). Indeed, target selection in drug discovery is a complex process (Knowles and Gromo 2003), one that needs to find the appropriate balance between investors and other financial stakeholders, patients and doctors (consumers), regulators and political factors (potential funders), as well as the research community in general.

Despite the launch of IDG Consortium (Rodgers et al. 2018) and the OpenTargets platform (Koscielny et al. 2017), navigating the dark genome and its funding remain controversial (Stoeger et al. 2018; Oprea, Jan, et al. 2018). The IDG KMC addresses this from a data and knowledge integration perspective, by capturing key data elements relating proteins and genes to diseases, pathways, chemicals, bioactivities, drug discovery and clinical databases and documents, supported by innovative knowledge management tools such as TIN-X and user interfaces via Pharos. Within the IDG KMC workflow, the use of these tools and databases can be focused on the dark genome (Tdark) category and its potential benefit for precision medicine. In this context, we addressed the association of human disease and IMPC phenotypes (551 protein/genes, as detailed in Supplementaiy Material). From a list of 55 genes with confirmed associations (provided in the Supplementaiy Material), we highlighted four that have matching GWAS/ DISEASE evidence and mouse phenotype observations, despite the apparent knowledge deficit. Focusing on Tdark genes/proteins that have clear human disease and corresponding mouse phenotype associations could therefore be a rational strategy for identifying novel therapeutic targets. The IDG platform and associated tools are aimed at encouraging critical thinking and guiding additional experimental and clinical research, particularly within the context of therapeutic target identification and validation for precision medicine.

Mamm Genome. Author manuscript; available in PMC 2020 August 01. Oprea Page 8

Author ManuscriptAuthor Supplementary Manuscript Author Manuscript Author Material Manuscript Author

Refer to Web version on PubMed Central for supplementary material.

Acknowledgments

This work was supported by NIH grants U54CA189205, U24CA224370 (for IDG KMC) and U24TR002278 (for IDG RDOC).

References Abbott W. Mark, Damschroder Melissa M., and Lowe David C.. 2014 “Current Approaches to Fine Mapping of Antigen-Antibody Interactions.” Immunology 142 (4): 526–35. [PubMed: 24635566] Abifadel Marianne, Varret Mathilde, Rabès Jean-Pierre, Allard Delphine, Ouguerram Khadija, Devillers Martine, Cruaud Corinne, et al. 2003 “Mutations in PCSK9 Cause Autosomal Dominant Hypercholesterolemia.” Nature Genetics 34 (2): 154–56. [PubMed: 12730697] Amberger Joanna, Bocchini Carol A., Scott Alan F., and Hamosh Ada. 2009 “McKusick’s Online Mendelian Inheritance in Man (OMIM).” Nucleic Acids Research 37 (Database issue)D793–96. [PubMed: 18842627] Anding Allyson L., Wang Chunxin, Chang Tsun-Kai, Sliter Danielle A., Powers Christine M., Hofmann Kay, Youle Richard J., and Baehrecke Eric H.. 2018 “Vps13D Encodes a Ubiquitin- Binding Protein That Is Required for the Regulation of Mitochondrial Size and Clearance.” Current Biology: CB 28 (2): 287–95.e6. [PubMed: 29307555] Ashburner Michael, Ball Catherine A., Blake Judith A., Botstein David, Butler Heather, Cherry J. Michael, Davis Allan P., et al. 2000 “Gene Ontology: Tool for the Unification of Biology.” Nature Genetics 25 (1): 25–29. [PubMed: 10802651] Berger Kavita M., and Schneck Phyllis A.. 2019 “National and Transnational Security Implications of Asymmetric Access to and Use of Biological Data.” Frontiers in Bioengineering and Biotechnology 7 (February): 21. [PubMed: 30859099] Berman Helen M., Westbrook John, Feng Zukang, Gilliland Gary, Bhat TN, Weissig Helge, Shindyalov Ilya N., and Bourne Philip E.. 2000 “The Protein Data Bank.” Nucleic Acids Research 28 (1): 235–42. [PubMed: 10592235] Bezemer Tim, de Groot Mark Ch, Blasse Enja, Ten Berg Maarten J., Kappen Teus H., Bredenoord Annelien L., van Solinge Wouter W., Hoefer Imo E., and Haitjema Saskia. 2019 “A Human(e) Factor in Clinical Decision Support Systems.” Journal of Medical Internet Research 21 (3): e11732. [PubMed: 30888324] Cannon Daniel C., Yang Jeremy J., Mathias Stephen L., Ursu Oleg, Mani Subramani, Waller Anna, Schürer Stephan C., et al. 2017 “TIN-X: Target Importance and Novelty Explorer.” Bioinformatics, 4 10.1093/bioinformatics/btx200. Clementi Nicola, Mancini Nicasio, Castelli Matteo, Clementi Massimo, and Burioni Roberto. 2013 “Characterization of Epitopes Recognized by Monoclonal Antibodies: Experimental Approaches Supported by Freely Accessible Bioinformatic Tools.” Drug Discovery Today 18 (9-10): 464–71. [PubMed: 23178804] Collins Francis S., and Varmus Harold. 2015 “A New Initiative on Precision Medicine.” The New England Journal of Medicine 372 (9): 793–95. [PubMed: 25635347] Gautier Koscielny, Gagarine Yaikhom G, Iyer Vivek, Meehan Terry F., Morgan Hugh, Atienza-Herrero Julian, et al. 2014 “The International Mouse Phenotyping Consortium Web Portal, a unified point of access for knockout mice and related phenotyping data.” Nucleic Acids Research 42 (Database issue): D802–809 [PubMed: 24194600] Dolan Jackie, and Mitchell Kevin J.. 2013 “Mutation of Elfn1 in Mice Causes Seizures and Hyperactivity.” PloS One 8 (11): e80491. [PubMed: 24312227] Edwards Aled M., Isserlin Ruth, Bader Gary D., Frye Stephen V., Willson Timothy M., and Yu Frank H.. 2011 “Too Many Roads Not Taken.” Nature 470 (7333): 163–65. [PubMed: 21307913]

Mamm Genome. Author manuscript; available in PMC 2020 August 01. Oprea Page 9

Gaulton Anna, Hersey Anne, Nowotka Michał, Bento A. Patrícia, Chambers Jon, Mendez David, Author ManuscriptAuthor Manuscript Author ManuscriptMutowo Author Prudence, et Manuscript Author al. 2017 “The ChEMBL Database in 2017.” Nucleic Acids Research 45 (D1): D945–54. [PubMed: 27899562] Gauthier Julie, Meijer Inge A., Lessel Davor, Mencacci Niccolò E., Krainc Dimitri, Hempel Maja, Tsiakas Konstantinos, et al. 2018 “Recessive Mutations in >VPS13D Cause Childhood Onset Movement Disorders.” Annals of Neurology 83 (6): 1089–95. [PubMed: 29518281] Hajduk Philip J., Huth Jeffrey R., and Tse Christin. 2005 “Predicting Protein Druggability.” Drug Discovery Today 10 (23-24): 1675–82. [PubMed: 16376828] Harst Pim van der, and Verweij Niek. 2018 “Identification of 64 Novel Genetic Loci Provides an Expanded View on the Genetic Architecture of Coronary Artery Disease.” Circulation Research 122 (3): 433–43. [PubMed: 29212778] Hopkins Andrew L., and Groom Colin R.. 2002 “The Druggable Genome.” Nature Reviews. Drug Discovery 1 (9): 727–30. [PubMed: 12209152] Kandoi Gaurav, Acencio Marcio L., and Lemke Ney. 2015 “Prediction of Druggable Proteins Using Machine Learning and Systems Biology: A Mini-Review.” Frontiers in Physiology6 (December): 366. [PubMed: 26696900] Kibbe Warren A., Arze Cesar, Felix Victor, Mitraka Elvira, Bolton Evan, Fu Gang, Mungall Christopher J., et al. 2015 “Disease Ontology 2015 Update: An Expanded and Updated Database of Human Diseases for Linking Biomedical Knowledge through Disease Data.” Nucleic Acids Research 43 (Database issue): D1071–78. [PubMed: 25348409] Kiermer Veronique. 2008 “Antibodypedia.” Nature Methods 5 (10): 860–61. Knowles Jonathan, and Gromo Gianni. 2003 “Target Selection in Drug Discovery.” Nature Reviews. Drug Discovery 2 (1): 63–69. [PubMed: 12509760] Koscielny Gautier, An Peter, Carvalho-Silva Denise, Cham Jennifer A., Fumis Luca, Gasparyan Rippa, Hasan Samiul, et al. 2017 “Open Targets: A Platform for Therapeutic Target Identification and Validation.” Nucleic Acids Research 45 (D1): D985–94. [PubMed: 27899665] Lenat Douglas B., and Feigenbaum Edward A.. 1991 “On the Thresholds of Knowledge.” Artificial Intelligence 47: 185–250. Lin Yu, Mehta Saurabh, Küçük-McGinty Hande, Turner John Paul, Vidovic Dusica, Forlin Michele, Koleti Amar, et al. 2017 “Drug Target Ontology to Classify and Integrate Drug Discovery Data.” Journal of Biomedical Semantics 8 (1): 50. [PubMed: 29122012] Lipinski Christopher A., Lombardo Franco, Dominy Beryl W., and Feeney Paul J.. 1997 “Experimental and Computational Approaches to Estimate Solubility and Permeability in Drug Discovery and Development Settings.” Advanced Drug Delivery Reviews 23 (1-3): 3–25. MacArthur Jacqueline, Bowler Emily, Cerezo Maria, Gil Laurent, Hall Peggy, Hastings Emma, Junkins Heather, et al. 2017 “The New NHGRI-EBI Catalog of Published Genome-Wide Association Studies (GWAS Catalog).” Nucleic Acids Research 45 (D1): D896–901. [PubMed: 27899670] McMurry Julie A., Köhler Sebastian, Washington Nicole L., Balhoff James P., Borromeo Charles, Brush Matthew, Carbon Seth, et al. 2016 “Navigating the Phenotype Frontier: The Monarch Initiative.” Genetics 203 (4): 1491–95. [PubMed: 27516611] Mould Diane R., and Meibohm Bernd. 2016 “Drug Development of Therapeutic Monoclonal Antibodies.” BioDrugs: Clinical Immunotherapeutics, Biopharmaceuticals and Gene Therapy 30 (4): 275–93. National Research Council, Division on Earth and Life Studies, Board on Life Sciences, and Committee on A Framework for Developing a New Taxonomy of Disease 2012 Toward Precision Medicine: Building a Knowledge Network for Biomedical Research and a New Taxonomy of Disease. National Academies Press. Nguyen Dac-Trung, Mathias Stephen, Bologa Cristian, Brunak Soren, Fernandez Nicolas, Gaulton Anna, Hersey Anne, et al. 2017 “Pharos: Collating Protein Information to Shed Light on the Druggable Genome.” Nucleic Acids Research 45 (D1): D995–1002. [PubMed: 27903890] Nooren Irene M. A, and Thornton Janet M.. 2003 “Diversity of Protein-Protein Interactions.” The EMBO Journal 22 (14): 3486–92. [PubMed: 12853464]

Mamm Genome. Author manuscript; available in PMC 2020 August 01. Oprea Page 10

Oprea Tudor I., Bologa Cristian G., Brunak Søren, Campbell Allen, Gan Gregory N., Gaulton Anna, Author ManuscriptAuthor Manuscript Author ManuscriptGomez Author Shawn M., et Manuscript Author al. 2018 “Unexplored Therapeutic Opportunities in the .” Nature Reviews. Drug Discovery 17 (5): 377. Oprea Tudor I., Jan Lily, Johnson Gary L., Roth Bryan L., Ma’ayan Avi, Schürer Stephan, Shoichet Brian K., Sklar Larry A., and McManus Michael T.. 2018 “Far Away from the Lamppost.” PLoS Biology 16 (12): e3000067. [PubMed: 30532236] Pafilis Evangelos, Frankild Sune P., Fanini Lucia, Faulwetter Sarah, Pavloudi Christina, Vasileiadou Aikaterini, Arvanitidis Christos, and Jensen Lars Juhl. 2013 “The SPECIES and ORGANISMS Resources for Fast and Accurate Identification of Taxonomic Names in Text.” PloS One 8 (6): e65390. [PubMed: 23823062] Pandey Ashutosh K., Lu Lu, Wang Xusheng, Homayouni Ramin, and Williams Robert W.. 2014 “Functionally Enigmatic Genes: A Case Study of the Brain Ignorome.” PloS One 9 (2): e88889. [PubMed: 24523945] Pletscher-Frankild Sune, Pallejà Albert, Tsafou Kalliopi, Binder Janos X., and Jensen Lars Juhl. 2015 “DISEASES: Text Mining and Data Integration of Disease-Gene Associations.” Methods 74 (March): 83–89. [PubMed: 25484339] Poirier Steve, Mayer Gaetan, Benjannet Suzanne, Bergeron Eric, Marcinkiewicz Jadwiga, Nassoury Nasha, Mayer Harald, Nimpf Johannes, Prat Annik, and Seidah Nabil G.. 2008 “The Proprotein Convertase PCSK9 Induces the Degradation of Low Density Lipoprotein Receptor (LDLR) and Its Closest Family Members VLDLR and ApoER2.” The Journal of Biological Chemistry 283 (4): 2363–72. [PubMed: 18039658] Prosperi Mattia, Min Jae S., Bian Jiang, and Modave François. 2018 “Big Data Hurdles in Precision Medicine and Precision Public Health.” BMC Medical Informatics and Decision Making 18 (1): 139. [PubMed: 30594159] Perlman Robert L. 2016 Mouse models of human disease: An evolutionary perspective. Evol. Med. Public Health 2016(1):170–176. [PubMed: 27121451] Rader Daniel J., Cohen Jonathan, and Hobbs Helen H.. 2003 “Monogenic Hypercholesterolemia: New Insights in Pathogenesis and Treatment.” The Journal of Clinical Investigation 111 (12): 1795– 1803. [PubMed: 12813012] Rath Ana, Olry Annie, Dhombres Ferdinand, Brandt Maja Miličić, Urbero Bruno, and Ayme Segolene. 2012 “Representation of Rare Diseases in Health Information Systems: The Orphanet Approach to Serve a Wide Range of End Users.” Human Mutation 33 (5): 803–8. [PubMed: 22422702] Robinson Peter N., Mungall Christopher J., and Haendel Melissa. 2015 “Capturing Phenotypes for Precision Medicine.” Cold Spring Harbor Molecular Case Studies 1 (1): a000372. [PubMed: 27148566] Rodgers Griffin, Austin Christopher, Anderson James, Pawlyk Aaron, Colvis Christine, Margolis Ronald, and Baker Jenna. 2018 “Glimmers in Illuminating the Druggable Genome.” Nature Reviews. Drug Discovery 17 (5): 301–2. Rouillard Andrew D., Gundersen Gregory W., Fernandez Nicolas F., Wang Zichen, Monteiro Caroline D., McDermott Michael G., and Ma’ayan Avi. 2016 “The Harmonizome: A Collection of Processed Datasets Gathered to Serve and Mine Knowledge about Genes and Proteins.” Database: The Journal of Biological Databases and Curation 2016 (July). 10.1093/database/baw100. Rye Kerry-Anne, and Barter Philip J.. 2014 “Cardioprotective Functions of HDLs.” Journal of Lipid Research 55 (2): 168–79. [PubMed: 23812558] Santos Rita, Ursu Oleg, Gaulton Anna, Bento A. Patrícia, Donadi Ramesh S., Bologa Cristian G., Karlsson Anneli, et al. 2017 “A Comprehensive Map of Molecular Drug Targets.” Nature Reviews. Drug Discovery 16 (1): 19–34. [PubMed: 27910877] Seneviratne Martin G., Kahn Michael G., and Hernandez-Boussard Tina. 2019 “Merging Heterogeneous Clinical Data to Enable Knowledge Discovery.” Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing 24: 439–43. [PubMed: 30864344] Seong Eunju, Insolera Ryan, Dulovic Marija, Kamsteeg Erik-Jan, Trinh Joanne, Brüggemann Norbert, Sandford Erin, et al. 2018 “Mutations in VPS13D Lead to a New Recessive Ataxia with Spasticity and Mitochondrial Defects.” Annals of Neurology 83 (6): 1075–88. [PubMed: 29604224]

Mamm Genome. Author manuscript; available in PMC 2020 August 01. Oprea Page 11

Southam Lorraine, Gilly Arthur, Süveges Dániel, Farmaki Aliki-Eleni, Schwartzentruber Jeremy, Author ManuscriptAuthor Manuscript Author ManuscriptTachmazidou Author Ioanna, Manuscript Author Matchan Angela, et al. 2017 “Whole Genome Sequencing and Imputation in Isolated Populations Identify Genetic Associations with Medically-Relevant Complex Traits.” Nature Communications 8 (May): 15606. Southan Christopher, Sharman Joanna L., Benson Helen E., Faccenda Elena, Pawson Adam J., Alexander Stephen P. H., Buneman O. Peter, et al. 2016 “The IUPHAR/BPS Guide to PHARMACOLOGY in 2016: Towards Curated Quantitative Interactions between 1300 Protein Targets and 6000 Ligands.” Nucleic Acids Research 44 (D1): D1054–68. [PubMed: 26464438] Stoeger Thomas, Gerlach Martin, Morimoto Richard I., and Nunes Amaral Luís A.. 2018 “Large-Scale Investigation of the Reasons Why Potentially Important Genes Are Ignored.” PLoS Biology 16 (9): e2006643. [PubMed: 30226837] Suntharalingam Ganesh, Perry Meghan R., Ward Stephen, Brett Stephen J., Castello-Cortes Andrew, Brunner Michael D., and Panoskaltsis Nicki. 2006 “Cytokine Storm in a Phase 1 Trial of the Anti- CD28 Monoclonal Antibody TGN1412.” The New England Journal of Medicine 355 (10): 1018– 28. [PubMed: 16908486] Surade Sachin, and Blundell Tom L.. 2012 “Structural Biology and Drug Discovery of Difficult Targets: The Limits of Ligandability.” Chemistry & Biology 19 (1): 42–50. [PubMed: 22284353] Szklarczyk Damian, Gable Annika L., Lyon David, Junge Alexander, Wyder Stefan, Huerta-Cepas Jaime, Simonovic Milan, et al. 2019 “STRING v11: Protein-Protein Association Networks with Increased Coverage, Supporting Functional Discovery in Genome-Wide Experimental Datasets.” Nucleic Acids Research 47 (D1): D607–13. [PubMed: 30476243] “Target Importance and Novelty Explorer (TIN-X).” 2014 TIN-X. 12 14, 2014 http:// newdrugtargets.org/. Tomioka Naoko H., Yasuda Hiroki, Miyamoto Hiroyuki, Hatayama Minoru, Morimura Naoko, Matsumoto Yoshifumi, Suzuki Toshimitsu, et al. 2014 “Elfn1 Recruits Presynaptic mGluR7 in Trans and Its Loss Results in Seizures.” Nature Communications 5 (July): 4501. UniProt Consortium. 2015 “UniProt: A Hub for Protein Information.” Nucleic Acids Research 43 (Database issue): D204–12. [PubMed: 25348405] Ursu Oleg, Glick Meir, and Oprea Tudor. 2019 “Novel Drug Targets in 2018.” Nature Reviews. Drug Discovery March, 10.1038/d41573-019-00052-5. Ursu Oleg, Holmes Jayme, Bologa Cristian G., Yang Jeremy J., Mathias Stephen L., Stathias Vasileios, Nguyen Dac-Trung, Schürer Stephan, and Oprea Tudor. 2019 “DrugCentral 2018: An Update.” Nucleic Acids Research 47 (D1): D963–70. [PubMed: 30371892] Ursu Oleg, Holmes Jayme, Knockel Jeffrey, Bologa Cristian G., Yang Jeremy J., Mathias Stephen L., Nelson Stuart J., and Oprea Tudor I.. 2017 “DrugCentral: Online Drug Compendium.” Nucleic Adds Research 45 (D1): D932–39. Velayos-Baeza Antonio, Vettori Andrea, Copley Richard R., Dobson-Stone Carol, and Monaco AP. 2004 “Analysis of the Human VPS13 Gene Family.” Genomics 84 (3): 536–49. [PubMed: 15498460] Watkins Xavier, Garcia Leyla J., Pundir Sangya, Martin Maria J., and UniProt Consortium. 2017 “ProtVista: Visualization of Protein Sequence Annotations.” Bioinformatics 33 (13): 2040–41. [PubMed: 28334231] Woon Marites T., Long Pamela A., Reilly Louise, Evans Jared M., Keefe Alexis M., Lea Martin R., Beglinger Carl J., et al. 2018 “Pediatric Dilated Cardiomyopathy-Associated LRRC10 (Leucine- Rich Repeat-Containing 10) Variant Reveals LRRC10 as an Auxiliary Subunit of Cardiac L-Type Ca2+ Channels.” Journal of the American Heart Association 7 (3), 10.1161/JAHA.117.006428. Wu Fan, Ma Cong, and Tan Cheemeng. 2016 “Network Motifs Modulate Druggability of Cellular Targets.” Scientific Reports 6 (November): 36626. [PubMed: 27824147]

Mamm Genome. Author manuscript; available in PMC 2020 August 01. Oprea Page 12 Author ManuscriptAuthor Manuscript Author Manuscript Author Manuscript Author

Figure 1. Workflow of the IDG KMC. The 3 horizontal lanes (left) summarize specific areas of data and knowledge integration within IDG KMC, leading to TCRD (“Target Central”). Organized by degree of automation, these areas include the Harmonizome (Rouillard et al. 2016), STRING (Szklarczyk et al. 2019), the Monarch(McMurry et al. 2016) Disease Ontology (MONDO), the DISEASES database (Pletscher-Frankild et al. 2015), and chemical bioactivity data extraction from patents in ChEMBL (Gaulton et al. 2017)), in addition to IDG-specific activities such as the Drug-Target-Ontology, DTO (Lin et al. 2017) and DrugCentral (Ursu et al. 2019, 2017). The knowledge access portal, Pharos, is used to query TCDR and to highlight lists of proteins/genes for potential prioritization. Targets on the prioritization list are typically forwarded to experimental centers such as the International Mouse Phenotype Consortium, IMPC (Koscielny et al. 2014), and the IDG DRGCs, Data and Resource Generation Centers (Rodgers et al. 2018), and other consortia, with new experimental data from IDG being deposited directly into TCDR/Pharos. Some of the visualization features embedded into Pharos include 3D protein structures from the Protein Data Bank, PDB (Berman et al. 2000) and protein sequence features from ProtVista (Watkins et al. 2017). IDG RDOC coordinates tasks across centers, with focus on outreach and training. More information about the IDG Consortium (Rodgers et al. 2018) is available at the IDG website (https://druggablegenome.net/).

Mamm Genome. Author manuscript; available in PMC 2020 August 01. Oprea Page 13 Author ManuscriptAuthor Manuscript Author Manuscript Author Manuscript Author

Figure 2. Quantification of the data availability for individual proteins, grouped by TDL. The data availability score is the count of available data types associated with the each protein. Median and standard deviation values are as follows: 7±3.23 (Tdark), 11±1.9 (Tbio), 14±1.56 (Tchem) and 14±1.65 (Tclin), respectively. See text for details.

Mamm Genome. Author manuscript; available in PMC 2020 August 01. Oprea Page 14 Author ManuscriptAuthor Manuscript Author Manuscript Author Manuscript Author

Figure 3. Using TIN-X to explore dark genes. The left-hand panel shows the disease ontology hierarchy, which can be used to explore diseases that share similar causes/symptoms. The proteins that are associated with the left-panel highlighted disease, as processed using text mining, are plotted on the Importance-Novelty axes (logarithmic scale). Proteins with stronger associations are in the upper part of the plot, while proteins with a higher number of publications are on the left side of the plot. Protein-disease associations of immediate interest are usually placed on the upper right boundary of the TIN-X plot, representing the “Pareto frontier” of non-dominated solutions to the multi-objective optimization maximizing both Importance and Novelty (Cannon et al. 2017). See text for additional details.

Mamm Genome. Author manuscript; available in PMC 2020 August 01. Oprea Page 15

Table 1.

Author ManuscriptAuthor Distribution Manuscript Author of TDL categories Manuscript Author by protein Manuscript Author family for “druggable” targets.

Target Class Category Tclin Tchem Tbio Tdark G-protein coupled receptors (non-olfactory) GPCR 96 142 120 50 olfactory G-protein coupled receptors GPCR 8 413 Ion Channels ion channel 126 85 106 24 ATP-binding cassette Transporters transporter 3 7 32 5 SLC transporters transporter 15 65 218 89 Transcription Factors TF 36 926 476 Nuclear hormone receptors TF 18 19 11 Kinases Enzyme 52 373 178 31 Transferases Enzyme 2 21 221 73 Phosphatase Enzyme 1 54 205 52 Peptidase Enzyme 20 110 141 62 small GTP-ases Enzyme 7 118 23 GTP-ases Enzyme 83 29 Dehydrogenases/reductases Enzyme 11 21 77 18 Hydrolases Enzyme 3 26 57 24 ATP-ases (excluding ABC Transporters) Enzyme 9 6 56 18 RNA Polymerases Enzyme 1 27 3 RNase family Enzyme 1 21 9 Carbonic Anhydrases Enzyme 12 3 Cyclic nucleotide Phosphodiesterases Enzyme 14 1 Cytochrome P450s Enzyme 9 8 9 Sulfatase Enzyme 1 14 2 Other 222 614 8,814 5,187 Total 613 1,598 11,445 6,588

Mamm Genome. Author manuscript; available in PMC 2020 August 01. Oprea Page 16

Table 2.

Author ManuscriptAuthor Phenotype/disease Manuscript Author association Manuscript Author by TDL category. Manuscript Author

Source Tclin Tchem Tbio Tdark Total IMPC phenotypes (significant) 133 433 2,368 551 3,485 GWAS annotations 421 1017 6,472 2,340 10,250 DISEASES database 612 1593 11,289 5,298 18,792 OrphaNet 316 521 2,918 126 3,881

Mamm Genome. Author manuscript; available in PMC 2020 August 01.