Exploring the Dark Genome Implications for Precision Medicine Oprea, Tudor I
Total Page:16
File Type:pdf, Size:1020Kb
Exploring the dark genome implications for precision medicine Oprea, Tudor I. Published in: Mammalian Genome DOI: 10.1007/s00335-019-09809-0 Publication date: 2019 Document version Peer reviewed version Citation for published version (APA): Oprea, T. I. (2019). Exploring the dark genome: implications for precision medicine. Mammalian Genome, 30(7- 8), 192-200. https://doi.org/10.1007/s00335-019-09809-0 Download date: 28. sep.. 2021 HHS Public Access Author manuscript Author ManuscriptAuthor Manuscript Author Mamm Manuscript Author Genome. Author Manuscript Author manuscript; available in PMC 2020 August 01. Published in final edited form as: Mamm Genome. 2019 August ; 30(7-8): 192–200. doi:10.1007/s00335-019-09809-0. Exploring the Dark Genome - Implications for Precision Medicine Tudor I. Oprea1,2,3,4 1.Department of Internal Medicine, University of New Mexico School of Medicine, Albuquerque, NM, USA. 2.UNM Comprehensive Cancer Center, Albuquerque, NM, USA. 3.Department of Rheumatology and Inflammation Research, Institute of Medicine, Sahlgrenska Academy at University of Gothenburg, Gothenburg, Sweden. 4.Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen, Copenhagen, Denmark. Abstract The increase in the number of both patients and healthcare practitioners who grew up using the Internet and computers (so-called “digital natives”) is likely to impact the practice of precision medicine, and requires novel platforms for data integration and mining, as well as contextualized information retrieval. The “Illuminating the Druggable Genome Knowledge Management Center” (IDG KMC) quantifies data availability from a wide range of chemical, biological and clinical resources, and has developed platforms that can be used to navigate understudied proteins (the “dark genome”), and their potential contribution to specific pathologies. Using the “Target Importance and Novelty Explorer” (TIN-X) highlights the role of LRRC10 (a dark gene) in dilated cardiomyopathy. Combining mouse and human phenotype data leads to increased strength of evidence, which is discussed for 4 additional dark genes: SLX4IP and its role in glucose metabolism, the role of HSF2BP in coronary artery disease, the involvement of ELFN1 in attention deficit hyperactivity disorder and the role of VPS13D in mouse neural tube development and its confirmed role in childhood onset movement disorders. The workflow and tools described here are aimed at guiding further experimental research, particularly within the context of precision medicine. Navigating the Target Landscape by Illuminating the Druggable Genome Data wrangling, distilling unrelated data elements into contextualized knowledge and the overall ability to rapidly process digital information are increasing demands placed on current healthcare professionals. At the same time, the number of “digital native” patients Conflict of Interest Dr. Oprea was a former full time employee at AstraZeneca (1996–2002). He has received honoraria, or consulted for, Abbott, AstraZeneca, Chiron, Genentech, Infinity Pharmaceuticals, Merz Pharmaceuticals, Merck Darmstadt, Mitsubishi Tanabe, Novartis, Ono Pharmaceuticals, Pfizer, Roche, Sanofi, and Wyeth. His spouse was a full-time employee of AstraZeneca (2002–2014) and is a full time employee of Genentech Inc. Publisher's Disclaimer: This Author Accepted Manuscript is a PDF file of an unedited peer-reviewed manuscript that has been accepted for publication but has not been copyedited or corrected. The official version of record that is published in the journal is kept up to date and so may therefore differ from this version. Oprea Page 2 (i.e., patients who started to use computers/tablets/smart phones from an early age) is Author ManuscriptAuthor Manuscript Author Manuscript Author Manuscript Author increasing. Indeed, when it comes to healthcare issues, more and more patients, especially “digital natives”, seek information via social media and web-based platforms and healthcare databases. It therefore seems necessary for healthcare professionals to embrace novel analytic technologies to integrate multi-faceted big data and translate them into patient benefits (Bezemer et al. 2019). This integration process faces unique technical, semantic, and ethical challenges (Seneviratne, Kahn, and Hernandez-Boussard 2019), challenges that could be overcome by emerging computational technologies rooted in phenotypic ontologies drawing from a multi-species context (Robinson, Mungall, and Haendel 2015), Therefore, there is an imperative to improve and streamline “big data” analytics technologies in the context of precision medicine. As both healthcare practitioner and patient categories are increasingly more “digital native”, the patient-doctor relationship is likely to fundamentally change given the unfettered Internet access to healthcare information. Biomedical advances have fostered the emergence of data science, a relatively novel discipline focused on novel algorithms for data analytics and visualization (Berger and Schneck 2019), which ultimately warrants an informatics-oriented formalization of study design, interoperability and model development (Prosperi et al. 2018), specifically in the context of precision medicine (National Research Council et al. 2012), Defined as “prevention and treatment strategies that take individual variability into account” (Collins and Varmus 2015), precision medicine is widely regarded as the future of clinical practice, and is poised to take advantage of the large-scale integration of the contextualized knowledge emerging from genomic, phenomic and patient-centric databases. Here, we briefly discuss the Illuminating the Druggable Genome Knowledge Management Center, IDG KMC (Oprea, Bologa, et al. 2018) project, and how its platform can be used to extract relevant data elements for precision medicine. IDG KMC extracts and processes expression and functional data related to proteins and genes, molecular probes such as small molecules, antibodies and approved drugs, small molecule bioactivities, genome-wide association studies, disease associations and drug indications and off-label uses (among other data types) into the Target Central Repository Database, TCRD(Nguyen et al. 2017) - see Figure 1. TCRD organizes and structures information, thus mapping the current target landscape. Key elements from genomic and proteomic sources, in addition to literature, patents, clinical trials, drug labels and other information, are standardized, linked and associated with disease information, and exposed via the Pharos portal (https:// pharos.nih.gov/index), and through a limited REST API TCRD (https://bit.ly/31lMT17). By bridging clinical, biological, chemical and genomic data, the IDG KMC integrates information and knowledge for over 20,000 protein-encoding genes by combining data science, informatics and computational biology to prioritize targets for further experimental evaluation and analysis by the broader scientific community. Quantifying Data Availability The TCRD/Pharos platform is well suited to address questions about the “druggable genome”, which Hopkins and Groom defined (Hopkins and Groom 2002) as the set of protein-encoding genes that can be therapeutically modulated by orally formulated drugs, Mamm Genome. Author manuscript; available in PMC 2020 August 01. Oprea Page 3 using Lipinski’s “rule of five” criteria, which defined four physico-chemical property Author ManuscriptAuthor Manuscript Author Manuscript Author Manuscript Author parameter sets enriched in orally available drugs(Lipinski et al. 1997). The intersection between disease-modifying genes and the “druggable genome” was estimated to be between 600 and 1,500 targets (Hopkins and Groom 2002). Indeed, the known druggable genome, categorized as Tclin, or targets via which approved drugs act, includes 602 human proteins (Santos et al. 2017), a number that increased by thirteen via drugs approved in 2018 (Ursu, Glick, and Oprea 2019). A second set of well-studied proteins, Tchem, encompasses proteins that lack mode-of-action associations with approved drugs, and are known to bind small molecules with high potency. Current thresholds are ≤ 30nM for Kinases, ≤ 100nM for G-protein coupled and nuclear receptors, ≤ 10μM for ion channels, and ≤ 1μM for other target families (Oprea, Bologa, et al. 2018). Bioactivity values were extracted from ChEMBL, DrugCentral and the Guide to Pharmacology (Southan et al. 2016). The third Target Development Level (TDL) category, Tbio, includes proteins that have confirmed Mendelian disease phenotype in OMIM (Amberger et al. 2009) or have Gene Ontology (Ashburner et al. 2000) “leaf” (lowest level) term annotations based on experimental evidence; or meet two of the following three conditions: A fractional publication count (Pafilis et al. 2013) above 5, three or more Gene RIF, “Reference Into Function” annotations(https://bit.ly/2WDE1oL), or 50 or more commercial antibodies, as counted in the Antibodypedia database (Kiermer 2008). The fourth TDL category, Tdark, also referred to as the “ignorome” (Pandey et al. 2014) or the “dark genome”, encompasses one in three human proteins that were manually curated at the primary sequence level in UniProt (UniProt Consortium 2015), yet do not meet any of the criteria for Tclin, Tchem or Tbio. Additional information concerning this category may be available from genome-wide association studies (GWAS), tissue and subcellular compartment location, dysregulation, IMPC mouse