<<

Unexplored therapeutic opportunities in the

Tudor I. Oprea1,2,3,4*, Cristian G. Bologa1, Søren Brunak4, Allen Campbell5, Gregory N. Gan2, Anna Gaulton6, Shawn M. Gomez7,8, Rajarshi Guha9, Anne Hersey6, Jayme Holmes1, Ajit Jadhav9, Lars Juhl Jensen4, Gary L. Johnson8, Anneli Karlson6,20, Andrew R. Leach6, Avi Ma’ayan10, Anna Malovannaya11, Subramani Mani1, Stephen L. Mathias1, Michael T. McManus12, Terrence F. Meehan6, Christian von Mering13, Daniel Muthas14, Dac-Trung Nguyen9, John P. Overington6,21, George Papadatos6,22, Jun Qin11, Christian Reich15, Bryan L. Roth8, Stephan C. Schürer16, Anton Simeonov9, Larry A. Sklar2,17,18, Noel Southall9, Susumu Tomita19, Ilinca Tudose6,23, Oleg Ursu1, Dušica Vidovic´16, Anna Waller17, David Westergaard4, Jeremy J. Yang1 and Gergely Zahoránszky-Köhalmi1,24 Abstract | A large proportion of biomedical research and the development of therapeutics is focused on a small fraction of the human genome. In a strategic effort to map the knowledge gaps around encoded by the human genome and to promote the exploration of currently understudied, but potentially druggable, proteins, the US National Institutes of Health launched the Illuminating the Druggable Genome (IDG) initiative in 2014. In this article, we discuss how the systematic collection and processing of a wide array of genomic, proteomic, chemical and disease-related resource data by the IDG Knowledge Management Center have enabled the development of evidence-based criteria for tracking the target development level (TDL) of human proteins, which indicates a substantial knowledge deficit for approximately one out of three proteins in the human proteome. We then present spotlights on the TDL categories as well as key target classes, including G -coupled receptors, protein kinases and ion channels, which illustrate the nature of the unexplored opportunities for biomedical research and therapeutic development.

Drug Target selection and prioritization are common goals for For the purposes of this article, we define knowledge Externally administered, academic and commercial drug research organizations. as the consensus of information aggregated from dif- possibly endogenous but While motivations differ, in all cases, the target selection ferent sources and information as structured data, with mostly xenobiotic, substances task is fundamentally one of resource allocation in the a contextual layer that supports a broad range of data that are administered to face of incomplete information. Consequently, target analytics. Data have quantity, quality and dimensionality patients in order to influence the outcome of a disease, selection strategies (and metric-based approaches to (for example, genomic knowledge is defined in relation 1 syndrome or condition. assess their success) remain complex and are hindered to associations with distinct entities such as molecular by multiple bottlenecks. Some bottlenecks pertain to the probes and disease concepts). Data, like facts, may also data themselves, such as disjointed, disparate data and have an expiration date (Supplementary Box S1), and metadata standards, data recording errors and accessi- thus knowledge is subject to change. Yet, within a given bility issues; overcoming these issues will require human time frame, knowledge provides context for interpre- and computational efforts and coordination across mul- tation and integration of emergent data, information

*e-mail: tiple communities. Another set of bottlenecks pertains to and models. [email protected] the scientists involved. These include a tendency to focus Data-driven drug discovery strategies rely on the on a small subset of well-known genes2 and the tendency integration of proprietary and internal data with doi:10.1038/nrd.2018.14 Published online 23 Feb 2018; to avoid riskier research paths, driven by poor research third-party resources — both public databases, such corrected online 23 Mar 2018 funding climates3. as PubMed, PubChem4, ChEMBL5 and The Cancer 6 Drug targets Genome Atlas (TCGA ), and commercial databases, Illuminating the Druggable Genome. “The reluctance 2 Molecular entities present in such as Integrity. This integration requires fusion and to work on the unknown” (REF. ) is inherent to the living systems that, upon reconciliation of heterogeneous and sometimes con- scientific endeavour, partly due to our subconscious interaction with therapeutic flicting data sources and types. Although many of tendency to choose research subjects more likely to agents or their by-products, 8 result in modified biological these resources are already partially interlinked, data confirm what we already know or believe . In a deliber- responses that lead to heterogeneity, complexity and incompleteness, as well ate, strategic attempt to map the knowledge gaps around therapeutic outcomes. The as contextual information and metadata capture, pose potential drug targets and to prompt exploration of cur- interaction between a drug and substantial barriers to reliable systematic analyses of all rently understudied but potentially druggable proteins, its target leads, directly or data required to address biomedical research questions, the US National Institutes of Health (NIH) launched the indirectly, to observable clinical 1 outcomes. such as target prioritization in drug discovery . Illuminating the Druggable Genome (IDG) initiative in With the increasing scale and variety of data genera- 2014. As part of this broad, multimillion-dollar initiative, Druggable genome tion, collection and curation in the biomedical sciences, the IDG Knowledge Management Center (KMC) aims Originally defined by Hopkins there is an unmet need for in‑depth, accurate and truth- to systematize general and specific biomedical knowl- and Groom as the set of that encode proteins that ful integration of multiple scientific domains across edge by processing a wide array of genomic, proteomic, could be modulated by an disciplines. Once successful, these data and knowledge chemical and disease-related resources (BOX 1), with the orally administered small integration efforts enable us to ask both global and explicit goal of supporting target hypothesis genera- molecule, as estimated by fundamental questions about genes, proteins and the tion and subsequent knowledge creation, especially for Lipinski’s ‘rule of five’ processes they are involved in. Integrated resources also genes and proteins that are not well studied. guidelines. allow us to address aspects of reproducibility7 via con- In this article, we first define objective, evidence-based cordance of similar data types from unrelated sources criteria for tracking target development levels (TDLs) for and deficits in our knowledge of biological systems human proteins, using multiple sets of current knowl- and their function. More generally, data integration edge. We discuss the data collected by the KMC on facilitates our ability to quantify knowledge using an TDLs, which show the existence of a substantial knowl- evidence-based approach. edge deficit concerning a large portion of the human proteome (one out of three proteins). Reflecting the goal of illuminating the druggable genome, we then present spotlights on the TDL categories, as well as on key target Author addresses classes, including G protein-coupled receptors (GPCRs), protein kinases and ion channels. 1Department of Internal Medicine, University of New Mexico School of Medicine, Albuquerque, NM, USA. Knowledge-based protein classification 2UNM Comprehensive Cancer Center, Albuquerque, NM, USA. 3Department of Rheumatology and Inflammation Research, Institute of Medicine, Target development levels. Most current protein clas- Sahlgrenska Academy at University of Gothenburg, Gothenburg, Sweden. sification schemes are based on structural and func- 4Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical tional criteria. For any given protein, it is also possible Sciences, University of Copenhagen, Copenhagen, Denmark. to identify associated and chemical or biologic 5IQVIA, Plymouth Meeting, PA, USA. modulators, and many types of experimental data can be 6European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), associated with the protein, including publications, pat- Wellcome Genome Campus, Hinxton, Cambridge, UK. ents, expression data and experimental or modelled 7 Joint Department of Biomedical Engineering, University of North Carolina at Chapel 3D structures. Hill and North Carolina State University, Chapel Hill, NC, USA. For target prioritization and therapeutic development, 8Department of , University of North Carolina School of Medicine, Chapel Hill, NC, USA. it is useful to understand the quantity and diversity of 9National Center for Advancing Translational Sciences (NCATS), National Institutes of data that are available for a given protein and to assign Health (NIH), Rockville, MD, USA. a qualitative knowledge metric that characterizes the 10Icahn School of Medicine at Mount Sinai, New York, NY, USA. degree to which a target is comparatively well studied or 11Baylor College of Medicine, Houston, TX, USA. unstudied. To address this, we developed the TDL clas- 12University of California, San Francisco, CA, USA. sification scheme, which categorizes proteins into four 13Institute of Molecular Life Sciences, University of Zurich, Zurich, Switzerland. groups — Tclin, Tchem, Tbio and Tdark — with respect to the 14 Respiratory, Inflammation and Autoimmunity Diseases, Innovative Medicines and depth of investigation from a clinical, chemical and bio- Early Development Biotech Unit, AstraZeneca R&D Gothenburg, Mölndal, Sweden. logical standpoint (FIG. 1; TABLE 1). Except for T , TDL 15IQVIA, Cambridge, MA, USA. clin assignments were performed without human curation. 16Department of Molecular and Cellular Pharmacology, Miller School of Medicine, University of Miami, Miami, FL, USA. Formal definitions for the TDL categories are as follows. • 17Center for Molecular Discovery, University of New Mexico Cancer Center, Tclin (clinic) proteins are drug targets linked to at least University of New Mexico, Albuquerque, NM, USA. one approved drug (that is, an active pharmaceuti- 18Department of Pathology, University of New Mexico, Albuquerque, NM, USA. cal ingredient) by mechanism of action (MoA) (this 19Yale School of Medicine, Yale University, New Haven, CT, USA. criterion supersedes any of the other parameters). Present addresses: 20SciBite Limited, BioData Innovation Centre, Wellcome Genome Classification into this TDL category was achieved Campus, Hinxton, Cambridge, UK. through exhaustive manual querying of primary lit- 21 Medicines Discovery Catapult, Alderley Edge, UK. erature and drug labels for MoA assignments with 22 GlaxoSmithKline, Stevenage, UK. respect to molecular (protein) targets9; drug targets 23Google Germany GmbH, München, Germany. annotated as MoA-related proteins are categorized as 24NIH-NCATS, Rockville, MD, USA. Tclin (see further discussion below). • Tchem (chemistry) proteins lack MoA-based links to counts per protein. ‘GO terms’ examines the distribution approved drugs but are known to bind to small mole- of GO12 annotation counts per protein using data from cules with high potency. The interactions between pro- UniProt 15. ‘R01 grants’ examines the distribution of text- teins and small molecules (and sometimes approved mined R01 grant counts detected for each protein using drugs) are usually studied in the context of a disease NIH RePORTER data (see below for further discus- and often arise from medicinal chemistry efforts. For sion). ‘Patents’ examines the distribution of text-mined 16 inclusion in the Tchem category, we required the bio- granted patents for each protein using SureChEMBL activity of at least one small molecule to be above a data. Finally, the data availability score summarizes specific cut-off chosen to include about 90% of the experimental information density per protein obtained bioactivity values of drugs with a confirmed MoA from Harmonizome17 data — a resource developed for a target from that protein family (Supplementary independently for the KMC that provides an abstract Figure S2). Currently chosen thresholds are ≤30 nM representation of the many types of data associated for kinases, ≤100 nM for GPCRs and nuclear recep- with all human genes and proteins (BOX 1). tors, ≤10 μM for ion channels and ≤1 μM for other Whereas the first three data types were used to assign

target families. Bioactivity values were extracted from the TDL category for proteins in the Tbio and Tdark catego- ChEMBL5 and DrugCentral10. ries, the other four data types — derived from separate text

• Tbio (biology) refers to those proteins that have a con- corpora and repositories — provide independent valida- firmed Mendelian disease phenotype in the Online tion of our criteria for categorization overall. Distribution Mendelian Inheritance in Man (OMIM) database11 trends within TDL categories are consistently repro- (that is, at least two publications), have duced across all data types in FIG. 2a and have statisti- (GO)12 leaf term annotations based on experimental cally significant differences (Supplementary Table S4).

evidence or meet two of the following three condi- Tdark proteins have the least amount of data associated tions: a fractional PubMed publications count13 above with them regardless of source. five; three or more National Center for Biotechnology Increasing amounts of data are observed for proteins

Information (NCBI) Gene Reference Into Function when progressing through the categories from Tdark to

(RIF) annotations; or 50 or more commercial anti- Tbio, Tchem and Tclin. For example, Tdark proteins tend not bodies, counted from data made available by the to be the object of study for many funded NIH R01 grants 14 Antibodypedia database . Tbio assignments imply and are significantly less discussed in patents compared that these proteins are not MoA-related drug targets with proteins in other TDLs. Statistical significance

(these are Tclin proteins). However, it does not follow breaks down when comparing Tclin and Tchem, but because

that these proteins lack associations with bioactive successful clinical trials are required for the Tchem-to-Tclin molecules, including approved small-molecule drugs progression, this evidence may not be well captured by the and biologics. It does, however, imply that given cur- four data types highlighted in Supplementary Table S4. rent levels of evidence, associated bioactivity values However, this is less surprising from a knowledge man-

and clinical observations did not meet Tchem or Tclin agement perspective, since on average, the biochemis- criteria, respectively. try and pharmacology of a protein are likely to be well

• Tdark (dark genome) refers to the remaining proteins studied upon reaching the Tchem development stage. It is

that have been manually curated at the primary important to note that the Tchem stage can be completely sequence level in UniProt15 yet do not meet any of bypassed for targets of therapeutic antibodies and

the criteria for Tclin, Tchem or Tbio. Even for this cate- other biologics. gory, evidence may be available concerning genome- In summary, all the data, information and knowl- wide association studies (GWAS), tissue location, edge aggregated and processed within the IDG KMC dysregulation, inferred function via homology, etc. archive (partially illustrated in FIG. 2a) confirm the

Many proteins in the Tdark category are not context- existence of a knowledge deficit about many proteins, less sequences. However, these are proteins for which some of which could have therapeutic relevance. The there is the least current knowledge and a low number bias towards well-described proteins2 is confirmed not of specific molecular probes available, and some rep- only with respect to publications but also with respect resent unexplored opportunities within the druggable to patents, NIH funding patterns, GWAS and mouse human genome. While evidence that approved drugs phenotype data (data not shown), availability of molec-

interact with some Tdark proteins may be available, the ular probes such as antibodies and small molecules, and 18 above criteria were observed for all Tdark assignments even queries in the STRING database (see below and (Supplementary Table S3). FIG. 2b). Because of this bias, one out of three human

proteins (Tdark) have been largely unstudied. Although The knowledge deficit. FIGURE 2a summarizes the vary- the NIH acknowledged that illumination should directly ing degree of available data (represented using a normal- target understudied proteins, scientists engaged in target ized count of occurrence) for seven different data types selection are likely to remain risk-averse and perhaps associated with individual targets and grouped by TDL. systematically less inclined to study Tdark proteins. The first three groups illustrate category differences for Our classification provides overall insight into the three TDL defining criteria discussed above — namely, current illumination levels and sizes the opportunity the fractional count of protein and/or gene mentions in for drug targets from well-established and precedented PubMed abstracts, NCBI Gene RIF counts and antibody druggable protein families. The natural progression is for Mode of action proteins of potential therapeutic interest to migrate from and medicinal chemistry assessments of druggability Referred to as ‘mechanism of Tdark to Tclin over time, and TDL monitors knowledge largely focus on forecasting whether a target protein can action’ when the molecular accumulation using multiple types of clinical, chem- bind to drug-like small molecules with high affinity and interactions are well ical and biological evidence, while providing an easily specificity19. However, druggability literature rarely men- understood; describes the way in which drugs exert their interpretable ranking scheme. We argue that proteins in tions biologics, antibodies and other protein therapeu- intended therapeutic action, Tdark and Tbio are understudied and more in need of illu- tics, radiotherapy (Supplementary Box S5), gene therapy resulting in the intended mination, and we discuss approaches for achieving this or stem cells. In this section, we discuss Tclin and Tchem therapeutic outcome. later in the article, after first overviewing knowledge on proteins in the context of small-molecule drug discovery.

proteins in the Tclin and Tchem categories.

Tclin proteins. Ideally, unequivocal Tclin assignment (that Spotlight on Tclin and Tchem is, identification of molecular drug targets) would require Evaluating protein target druggability — the ability of several layers of evidence: a full matrix of in vitro bio- a protein to be therapeutically modulated by medicines activity for all prodrugs, drugs and active metabolites — can involve complex assessments of a range of protein (active ingredients) assayed against all relevant human characteristics. Structural biology and computational and non-human (for example, bacterial and viral) targets

Box 1 | Overview of the Illuminating the Druggable Genome Knowledge Management Center Knowledge management implies the ability to structure data into browsing of all TCRD data. Features include search filters to reduce lists information88 while combining low-volume, high-quality data, such as of targets, query-saving capability for sharing, and dossier functionality thorough analyses of experimental data (for example, high-resolution to collate data during searching or browsing. Pharos provides an X‑ray crystallographic structures) or evidence-based systematic reviews extensive REST API to support programmatic access and inclusion in (for example, the Cochrane Collaboration), with high-volume (and perhaps pipelining tools. lower quality) data such as genome-wide association studies (GWAS) or Harmonizome high-throughput screening data sets. As the overall scientific process Given the wide variety of experimental data that is generated on requires the archiving, evaluation and re‑interpretation of sometimes individual proteins, it is useful to characterize the total availability of data conflicting data, the Illuminating the Druggable Genome Knowledge types around individual targets. This Harmonizome is a resource Management Center (IDG KMC) faces similar challenges. Consensus developed for KMC17 that contains a collection of processed data sets emerges based on repeated independent experiments, robustness of the from 70 major online resources, abstracted and organized into ~72 million results (for example, modified reagents or conditions, or model organisms), functional associations between genes and proteins and their attributes. increased domain expertise and qualitative judgement. To this end, the Such attributes could be physical relationships with other biomolecules, IDG KMC automates algorithmic processing of structured data by expression in cell lines and tissues, genetic associations with knockout extracting and processing expression and functional data related to mouse or human phenotypes or changes in expression after drug proteins and genes, molecular probes such as small molecules and treatment. antibodies, small-molecule bioactivities, GWAS, disease associations These associations are stored in a relational database along with rich and launched drug information (among other data types) into the Target metadata for genes and proteins, their attributes and the original sources. Central Resource Database (TCRD)89. TCRD content is presented via To report overall levels of knowledge for each target, the Harmonizome Pharos, a multimodal web interface89 (see below). computes a cumulative probability of a protein occurring within a given TCRD–Pharos is not unique in providing integrated content: data set. With appropriate normalization, this results in an association ChEMBL, DrugBank90 and UniProt91 are excellent examples of drug score for a protein–data source pair, with values ranging from 0 to 1. discovery integration systems, for example, for chemical structure and When a source has no data associated with a target, its score is set to 0. drug bioactivity data and protein and disease information, largely focused Currently, 110 individual data sources (including supplementary files from on a specific knowledge domain. CiViC92 combines multiple resources with publications and public repositories of omics data) are made available a specific goal, for example, to enable clinical interpretation of gene through the Harmonizome, resulting in a 110‑element vector variants. The only resource that parallels the scope of IDG KMC is representation for each target. From this vector, we compute the data OpenTargets93, a consortium focused on disease-specific target validation availability score as the sum of the 110 association scores. efforts. The KMC collates evidence about all human proteins from multiple The Harmonizome is available through a web portal, a web service and domains, supporting research on understudied proteins and new biology, a mobile app for querying, browsing and downloading all data. The and includes the following resources. Harmonizome visualizes gene–gene and attribute–attribute similarity Target Central Resource Database networks for all processed data sets. TCRD is the central open-access data repository for the IDG KMC and is the DrugCentral primary data source for the IDG KMC project-wide web portal Pharos89. This online compendium provides chemical, pharmacological and TCRD integrates 55 heterogeneous data sets, with over 85 million gene regulatory information for active pharmaceutical ingredients and/or protein attributes. Special emphasis is placed on four families that and pharmaceutical products by linking chemical entities, multiple drug were of interest to the pilot phase of the IDG programme: G protein-coupled identification codes, drug mode of action and pharmacological action at receptors, ion channels, kinases and nuclear receptors (TABLE 1). The focus the target level, and pharmaceutical formulation and product-specific on this fraction of the proteome is justified by historical evidence, which information, as well as indications, contraindications and off-label indicates that these four protein families are among the most consistently indications10. DrugCentral links 4,509 active ingredients to 93,084 successful druggable target classes (see also TABLE 2). TCRD is available pharmaceutical products and is available under the CC‑BY‑SA 4.0 licence. under the CC‑BY‑SA 4.0 licence. Programmatic access to TCRD is also available via a REST application programme interface (API). Drug Target Ontology This is an interactive framework to integrate, navigate and analyse drug Pharos discovery data, based on formalized and standardized classifications and Access to TCRD content is via the web portal Pharos89, which is a Java annotations of human proteins94, available under the CC‑BY‑SA 4.0 platform that supports efficient and intuitive search queries and licence. (such as the half-maximal inhibitory concentration (IC50), in humanized animal models (although human data are effector concentration for half-maximum response (EC50), preferable); and phenotypic confirmation supported by inhibitory constant (Ki) and the dissociation constant pharmacodynamic data. In animal disease models lacking

(Kd)); on–off rate constants and other kinetic measure- the gene or genes responsible for the MoA of the drug, the ments performed at appropriately relevant concentra- drug should lack therapeutic effect. Meeting these criteria tions in the tissue or tissues relevant for that particular would be needed in order to attribute the desired clinical disease context, preferably with matching in vivo data outcome to a specific drug target interaction mechanism.

a 60%

16% Tbio 55% 3%

21% 8% 35%

T

chem 7% 28% 3% GPCR Transporter

Nuclear receptor Kinase T 3% 31% Ion channel Transcription factor clin Enzyme Epigenetic Other 4% 24% 7% 2% 8% 12% 14% 7% 35% Tdark

71%

b 100

90

80

70

60

50

40

30

Percentage of proteins in family of proteins Percentage 20

10

0

GPCR Kinase Other Enzyme

Ion channel Transporter Orphan GPCR (epigenetic) Nuclear receptor Tbio Tchem Tclin Epigenetic proteins Transcription factorTranscription factor

Figure 1 | Target development level categories applied to the human proteome. a | Percentages of the whole proteome are shown in the inner ring. Percentages of each target development level (TDL) category for selected major protein families are shown in the outer ring, with the Tclin category expanded. Inner ring coloursNature are Reviews as follows: | Drug Tdark Discovery, black; Tbio, red; Tchem, green; and Tclin, blue. b | TDL distribution across protein families, coloured by TDL category. Data show 3,644 proteins that have a confirmed disease association according to the Online Mendelian Inheritance in Man (OMIM) database. The enzyme category excludes kinases, which are considered separately. GPCR, G protein-coupled receptor. Table 1 | Current distribution of TDL categories by protein family for the druggable genome

Target class All Tclin Tchem Tbio Tdark GPCRs (non-olfactory) 406 96 113 145 52 Olfactory GPCRs 421 0 0 8 413 Kinases 634 50 390 163 31 Ion channels 355 126 44 150 35 Nuclear receptors 48 18 19 11 0 Transporters 473 26 46 287 114 Transcription factors 1,400 0 27 866 507 Epigenetic proteinsa 280 12 53 178 37 Enzymesb 4,146 186 493 2,607 860 Others 11,957 87 217 6,671 4,982 Total 20,120 601 1,402 11,086 7,031 GPCR, G protein-coupled receptor; TDL, target development level. aIncludes 40 transcription factors not already counted with those in the transcription factor category. bExcludes kinases.

Because the above criteria are difficult to implement example, some nuclear receptors lack an (endogenous) by automation, a previous analysis carefully curated ligand-binding domain or do not appear to be amenable MoA data from approved drug labels as well as primary to small-molecule perturbation. Another factor is that literature, based on a rigorous definition of a drug tar- not all proteins can (or will) alter the course of disease via get9. This ongoing process, performed in parallel by three therapeutic intervention, perhaps in some cases owing to teams, is anticipated to improve our ability to link drug our lack of understanding of the underlying pathology. responses to genetic variation and to help us understand Kubinyi pointed out that single proteins combine the molecular basis of clinical efficacy, safety and adverse in vivo in ways that could lead to many more drug events. The interplay between target tissue expression target combinations across multiple pathways — that under disease-specific conditions and the local concen- is, a ‘druggable proteome’ (REF. 22) — and there is now tration of the drug or its active metabolites at the rele- experimental evidence that alternative splicing, post-­ vant disease site is often difficult to ascertain, which is translational modification and heterogeneous oligomers why we attributed a higher weight of evidence to data produce functional isoforms with different interaction derived from multiple drugs belonging to the same ther- profiles, which may further result in increased diver- apeutic class. Indeed, we anticipate that efficacy target sity of the proteome23. It is also important to note that annotations will become more precise as our capability for many drugs, the precise MoA and contributing to colocalize target, disease and drug increases. molecular targets remain cryptic, especially when poly­

From this analysis, Tclin currently consists of ~600 pharmacology (the simultaneous modulation of multiple protein targets9, which is at the lower end of the orig- targets by drugs) occurs. Shedding light on this would inal estimate of between 600 and 1,500 targets for the require data completeness24, namely, experiments across intersection between proteins in the druggable genome all proteins, in relevant physiological conditions, for all and disease-modifying genes by Hopkins and Groom20 approved drugs. This remains a resource-intensive and 25 (note, however, that Tclin includes targets of biologic costly task, which was partially accomplished by the drugs as well as small molecules, while the estimate was NIH Molecular Libraries Initiative26. for small-molecule drug targets only20). So far, proteins in the Tclin category thus represent only a small fraction Tchem proteins. Assignment to Tchem is based on com- (3%) of the human proteome (FIG. 1a). From a com- pound activity thresholds originating from binding mercial perspective, it is also noteworthy that most of experiments for small molecules (Supplementary the global revenues of the pharmaceutical industry are Figure S2). Selectivity, though important both in vivo derived from drugs that target a relatively small number and in vitro, could not be factored in for all Tchem targets (BOX 2; TABLE 2) of the proteins in the Tclin category . The (Supplementary Box S6 and Supplementary Table S7). majority (259 or 79.7%) of these targets are single pro- Because, by definition, Tclin attribution requires support- teins, whereas 39 (12%) are complex multiprotein tar- ing evidence for the MoA, many proteins known to inter- gets. Only 25 targets (8.3%) are comprised of multiple act with approved drugs, even with high affinity, remain proteins for non-selective drugs; these include the mus- in the Tchem category. Additional bioactivity data from, carinic, α-adrenergic and oestrogen receptors, as well as for example, patent literature and papers currently not cyclooxygenases and histone deacetylases. indexed in ChEMBL may progress more targets to Tchem. Among the factors contributing to the small frac- Many compounds that have reported activity against tion of each major protein family in Tclin so far, one Tchem targets are also candidate drugs undergoing clini- factor is that not all members of a protein family have cal trials. Based on an in‑depth analysis of drug-compatible­ or ligandable21 binding sites; for data combined with data from ChEMBL, PubMed, the a 1.00 As noted above, target druggability is frequently esti- mated based on the ability to bind to small molecules20, and expectations of druggability typically diminish as 0.75 the size of the binding pocket increases, with affinity and selectivity being major concerns. The challenge is 0.50 even greater when the binding pocket is shallow and highly exposed to a solvent or when the therapeutic strategy involves disrupting the interaction between the 0.25 targeted protein and other proteins. One approach to

Normalized counts per score evaluating druggability is to focus on protein domains mined from the InterPro database29 and then to prior- 0.00 itize proteins that contain domains known to interact with approved drugs20 or bioactive small molecules30. Patents Others have explored target druggability by evaluating Gene RIFs R01 grants side-effect similarity for known drugs31 or by perform- Antibody count Gene Ontology 32 PubMed abstracts Harmonizomeavailability data score ing combined chemical and target similarity queries , followed by experimental confirmation of novel drug targets derived from clinical observation or computa- b tion. It is possible that induced binding sites in proteins in which a druggable pocket is not initially found may enable them to be targeted with drug-like small mole- cules, but identifying these binding sites with structural 1000 approaches is likely to be challenging, and phenotypic screens may be more useful. An emerging approach harnessing so‑called PROTACs (proteolysis-targeting chimaeras) may help substantially in addressing the issue of undruggability33, at least for Access counts 10 proteins that are capable of selectively binding a small-­ molecule ligand, although not necessarily at a typical binding site34. Essentially, this strategy harnesses the endo­ genous ubiquitin–­proteasome system to promote targeted Link Name degradation of desired proteins following binding of the PROTAC35; the mechanism for ternary ligase–PROTAC– Figure 2 | Patterns of target development level distribution across different data: target complex activation has been recently elucidated36. visualizing the knowledge deficit. a | The three criteriaNature used inReviews establishing | Drug the Discovery target development level are to the left, and their independent validation by four other data This technology may also be subcellular location-specific, types are to the right. For PubMed abstracts, Gene Reference Into Function (RIF) which could be an additional advantage in some (but not annotations, antibodies, Gene Ontology, R01 grants and patents, the score for each other) cases. However, the oral bioavailability of PROTAC target is the count of those entities associated with the target, normalized between 0 molecules may be constrained by their size. and 1. The values for the Harmonizome data availability score were computed differently, as described in the main text. See FIG. 1 for colour codes and Supplementary Table S4 for Spotlight on Tbio and Tdark further details. b | Patterns of scientific curiosity: STRING database access counts by A critical effort in addressing the knowledge deficit about target development level (January–December 2016). Tbio and Tdark proteins is being undertaken by the Monarch Initiative37, which relies on informatics methods to iden- tify phenotypically relevant disease models in research IUPHAR Guide to Pharmacology27 and ChemIDplus, we and diagnostic contexts based on integrated model organ-

mapped 144 Tchem proteins to 356 clinical (phase I–III ism and clinical research data. One of the main sources trial) candidates, for a total of 701 unique target–clinical for the Monarch Initiative is phenotype data from the candidate pairs. For 175 (25%) of these pairs, therapeutic International Mouse Phenotype Consortium (IMPC), indication data extracted from ChEMBL highlight the which was set up to generate and pheno­typically char- different distribution among protein families (TABLE 3). acterize mouse knockout lines. Their recent analysis of Most targeted proteins are kinases (93 unique enzymes), 1,751 unique gene knockouts found that human disease followed by GPCRs (31 unique receptors) and ion chan- genes are enriched for essential genes38. nels (13), with seven targets from other families, which The IDG KMC incorporates gene-centric mouse phe- is similar to the prior observation that most clinical notype data and maps these data to the respective human candidates target the most druggable target families28. orthologues. IDG coordinated with IMPC production Analysis of the target–clinical candidate subset in which centres to prioritize production of knockout mouse anticipated therapeutic indications are available shows strains for druggable genes. As of November 2017, 568 that most of the kinase-targeting clinical candidates are new knockout strains had been produced: 166 GPCRs, aimed at oncology applications, whereas GPCRs and ion 141 ion channels, 238 kinases and 23 nuclear receptors channel-targeting clinical candidates are aimed at central (see Supplementary Table S8). When ignoring olfactory nervous system disorders. GPCRs, these represent a little more than one-third of Box 2 | Financial spotlight on the human proteome Analyses of drug sales focus on pharmaceutical products95 and the The commercial outcomes of target selection and product-led validation companies authorized to market them96. Here, we ask a target-centric can also be analysed with respect to research funding. NIH RePORTER data question — what are the most financially valuable therapeutic targets — (see the NIH ExPORTER website) were processed using the same by exploring IMS Health (now known as IQVIA) data from their MIDAS™ text-mining methods described earlier for FIG. 2. During the same period, platform. We used IMS Health MIDAS drug sales data from 75 countries 2011–2015, the NIH funded 42,924 R01 grant applications, at a total cost of covering Europe, North America, Australia and Japan, aggregated for the $32 billion. These projects discuss up to 7,851 human proteins (see 2011–2015 period. MIDAS tracks products from most therapeutic classes, Supplementary Table S10). For example, R01 grants associated with estimating product volumes, trends and market share through retail and oestrogen receptors were awarded $101 million, compared with $50 billion non-retail channels. We chose quinquennial aggregation over annual sales in sales earned by 18 drugs acting through oestrogen receptors during that data as it diminishes the importance of factors less relevant to this analysis, same period (TABLE 2). Some targets, having over 30 drugs each, are also such as fluctuations in currency exchange rates. Because active top-earning and well funded, for example, the μ‑opioid (OPRM1) and the ingredients lose patent coverage and become generic, annual sales figures (NR3C1) receptors. Other top-earning targets with over can abruptly drop from one year to the next. 30 drugs each, such as the β2 adrenergic receptor and cyclooxygenase 1, After excluding traditional medicines, including botanicals and are not as well funded. We found no relationship between MIDAS global animal products, the MIDAS set comprised 51,095 unique pharmaceutical drug target sales and NIH R01 funding during the 2011–2015 period, even products, including small molecules and biologics. As most anti-infective when factoring in the number of APIs per target. Overall, $4.2 billion was and drugs target non-human proteins (with the notable awarded to study 496 Tclin proteins, representing 13% of the R01 budget and exception of maraviroc, which targets the host (human) CC-chemokine 6% of all R01‑funded proteins. Another 615 proteins (485 Tbio and 67 Tdark) receptor type 5 (CCR5)), we removed these drugs because their targets are had just one funded R01 project dedicated to their study during 2011–2015, outside of the scope of this analysis. The remainder were mapped to 1,182 and 8,857 proteins were not associated with any NIH funding for this 10 active pharmaceutical ingredients (APIs) from DrugCentral , which were time frame. AT1 receptor, angiotensin II type 1 receptor; COX2, first normalized by the number of APIs per pharmaceutical formulation, cyclooxygenase 2; DPP4, dipeptidyl peptidase 4; HMG-CoA, then by the number of manually curated mechanism of action (MoA) 3‑hydroxy‑3‑methylglutaryl-CoA. 9 targets per API. Thus, we used 581 Tclin proteins and 1,096 APIs, a subset of the 893 human and pathogenic biomolecules through which 1,578 previously analysed approved drugs act9. By linking global drug sales data to drug targets, we sought to assess a snapshot of their commercial value and to evaluate the market value of human MoA targets. The top 20 MoA targets ranked by aggregated sales data, together with and muscarinic β2 Proton National Institutes of Health (NIH) R01 funding for receptors pump the same period, are shown in TABLE 2. The entire

V03 S03 set covers 325 drug targets, comprising S02 S01

A01 A02 581 Tclin proteins, totalling over US$3,417 Glucocorticoid R06 A03 receptor A04 Insulin R05 A05 billion in global drug sales V A06 receptor Monoamine R03 S (Supplementary Table S10). These data transporters A07 R02 A A08 DPP4 indicate that the cytokine tumour A10 R01 R necrosis factor (TNF) is the most A11 N07 A - Alimentary tract and metabolism valuable target, and cytokines are B - Blood and blood-forming organs A16 N06 the only target class comprised C - Cardiovascular system B01 Coagulation entirely of biologic drugs in this Dopamine N05 D - Dermatologicals factor X D2 receptor G - Genitourinary system B B02 analysis for the 2011–2015 period. N04 H - Hormonal preparations B03 N G protein-coupled receptors Voltage- N03 J - Anti-infectives L - Antineoplastics and immunomodulators C01 (GPCRs) are the most valuable gated N02 class of druggable targets, with calcium M - Musculoskeletal system C02 N01 N - Nervous system channels C03 total aggregated sales nearing R - Respiratory system μ-Opioid $917 billion over the 5‑year receptor M05 S - Sensory organs C04 period. This spotlight covers 72 V - Various C M04 M C05 of the 108 druggable GPCRs M03 GPCR Enzyme C07 97 reviewed elsewhere . Kinases C08 M02 Nuclear receptor Transporter ($263 billion, with 45 drugs acting COX2 Ion channel Kinase C09 AT1 M01 receptor on 43 targets) and cytokines Cytokine Other C10 L L04 ($242 billion, with 17 drugs acting on D04 TNF L03 12 targets) are the only two target J D D05 classes with an extremely active ratio L02 D06 L01 H D07 of ongoing versus completed projects, G D10 J05 D11 10 G02 HMG-CoA G03 particularly for emerging mechanism– H05 G04 reductase H04 50 98 H03

H02 indication pairs . Finally, combining H01 financial data with targets organized by family Glucocorticoid 100 receptor 150 ons) and Anatomical, Therapeutic and Chemical (ATC) kinases billi 200 classification system level 2 codes shows that the (US$ CD20 top revenue categories are antineoplastics and Progesterone 250 immunomodulators, followed by nervous system receptor targets (see box figure; a larger version is available as Supplementary Figure S11; see also Supplementary Table S10).

Nature Reviews | Drug Discovery Table 2 | Financial (sales and NIH funding) activity for the top 20 mechanism-of-action targets Target Target Top two drugs Number MIDAS NIH R01 Target Protein class of APIs global sales, funding, type count 2011–2015 2011–2015 per target per target (US$) (US$) TNF Cytokine Adalimumab and 5 163 billion 165 million Single 1 etanercept INSR Kinase Insulin glargine 7 144 billion 2 million Single 1 and insulin aspart NR3C1 NHR Fluticasone 36 143 billion 52 million Single 1 propionate and budesonide HMGCR Enzyme Rosuvastatin and 8 123 billion 1 million Single 1 atorvastatin H+/K+-ATPase Transporter Esomeprazole 10 118 billion 5 million Complex 2 and omeprazole AGTR1 GPCR Valsartan and 9 100 billion 17 million Single 1 olmesartan medoxomil ADRB2 GPCR Salmeterol and 36 90 billion 8 million Single 1 salbutamol OPRM1 GPCR Oxycodone and 34 88 billion 51 million Single 1 fentanyl COX2 Enzyme Paracetamol and 40 84 billion 8 million Single 1 diclofenac DRD2 GPCR Aripiprazole and 48 75 billion 17 million Single 1 quetiapine Muscarinic GPCR Tiotropium 40 64 billion 65 million Multiple 5 acetylcholine bromide and receptors solifenacin SLC6A4 Transporter Duloxetine and 26 59 billion 46 million Single 1 escitalopram HTR2A GPCR Aripiprazole and 27 58 billion 13 million Single 1 quetiapine L‑Type Ion Amlodipine and 23 57 billion 21 million Complex 3 calcium channel nifedipine channels SLC6A2 Transporter Duloxetine and 36 56 billion 5 million Single 1 VEGFA Cytokine Bevacizumab 4 55 billion 162 million Single 1 and ranibizumab HRH1 GPCR Olopatadine and 56 54 billion 1 million Single 1 cetirizine Interferon α–β Membrane Interferon β‑1a 5 51 billion 7 million Multiple 2 receptor receptor and interferon β‑1b Voltage-gated Ion Lidocaine and 39 51 billion 40 million Multiple, 10 sodium channel lamotrigine complex channels Oestrogen NHR Ethinyloestradiol 18 50 billion 101 million Multiple 2 receptors and oestradiol Targets are ranked by aggregated API sales data. The number of drugs is the number of MoA-target-associated APIs used in this data set. Target type: single, one protein; multiple, more than one target; complex, more than one protein per target.

Protein count, number of proteins associated with that target. ADRB2, β2-adrenoceptor; AGTR1, AT1 receptor; API, active pharmaceutical ingredient; HMGCR, hydroxymethylglutaryl-CoA reductase; DRD2, dopamine D2 receptor; GPCR, G protein-coupled receptor; HRH1, histamine H1 receptor; HTR2A, serotonin 5‑HT2A receptor; INSR, insulin receptor; MoA, mechanism of action; NHR, nuclear hormone receptor; NIH, National Institutes of Health; NR3C1, glucocorticoid receptor; OPRM1, μ-opioid receptor; COX2, cyclooxygenase 2; SLC6A2, sodium-dependent noradrenaline transporter; SLC6A4, sodium-dependent serotonin transporter; TNF, tumour necrosis factor; VEGFA, vascular endothelial growth factor A. the druggable genes in these protein families. Phenotype L1000 is a set of 978 landmark genes that have been data are available for 80% of these strains, with abnor- selected for their ability to predict a large portion of the malities detected in numerous biological systems, total variability seen in large sets of microarray exper- including those related to development, immune func- iments. The proportion of Tdark proteins in the L1000 tion, metabolism and behaviour. These IMPC strains set (79 of 978; 8%) is substantially smaller than would provide evidence for biological systems that may be be anticipated based simply on the a priori distribution affected when a drug targets a gene with little-known of Tdark proteins (which make up 35% of the proteome), function. whereas Tbio targets (671 of 978; 69%) are more common

Of the 119 Tdark genes (51 GPCRs, 36 ion channels than expected, as these make up 53% of the proteome. and 32 kinases) submitted by IDG to IMPC, 45 mouse The proportions of Tclin proteins (41 of 978; 4%) and lines were produced, with 41 phenotypes observed. Tchem proteins in the L1000 set (187 of 978; 19%) are also

For example, knockouts of the Tdark kinase gene Alpk3 higher than expected, as these make up 3% and 7% of the have increased embryonic and perinatal lethality, with proteome, respectively. The L1000 TDL distribution data the surviving adults displaying severe heart defects support the existence of a knowledge deficit.

(see Further information). Of 482 Tbio genes submitted To some extent, the data on Tbio and Tdark suggest by IDG (135 GPCRs, 133 ion channels, 200 kinases and a causality dilemma: are Tdark proteins underfunded 14 nuclear receptors), 184 mouse lines were produced, because there is no scientific interest in this category, or with 145 phenotypes observed. For example, knock- is the lack of knowledge perpetuated by lack of funding? outs of the Tbio GPCR gene Adgrd1 display reproductive Although our data do not allow us to establish a causal defects, such as female infertility, and skeleton pheno- relationship, we suggest that the absence of high-quality, type defects, such as decreased bone mineral density (see well-characterized molecular probes is a root cause for Further information). this situation. Lack of tools leads to lack of interest, and Among 2,788 genes phenotyped in mice at the IMPC, lack of interest diminishes the probability of such tools 953 have at least one significant behavioural, neuro­logical being developed. A bibliometric evaluation by Edwards or other nervous system-related phenotype observation. and colleagues2 examined how many newly sequenced Target Central Resource Database (TCRD) data from the proteins from several protein families were the subject of GWAS Catalog39, OMIM11 and text-mined DISEASES13 new studies 10 years after the completion of the human databases confirmed human disease phenotypes for 191 genome sequencing project. This analysis concluded (20%) of these 953 genes, ranging from neurological (for that the process of druggable target selection is con- example, seizure disorders) to cognitive (for example, servative and incremental and that limited progress tauopathy) and psychotic affective disorders. Because has been observed with respect to understanding newly only 9 of the 953 genes lack confirmed expression in any discovered proteins. of the 34 neuro-related tissues tracked by IDG KMC (for “If you don’t know very much to begin with, don’t example, GTEx40,41, HPA42 and HPM43), these data sug- expect to learn a lot quickly” (REF. 44). Anecdotal evi- gest that the remaining 80% of this set have the potential dence (summarized in TABLE 4) suggests that it is pos- to be associated with human neurobehavioural pheno- sible for proteins to migrate from Tdark to Tclin within types, paving the way for new research avenues in this 12–20 years. Data on the six protein targets highlighted direction (see Supplementary Table S8). Production of indicate that proteins for which little information was

IMPC strains is set to continue for several more years, available two decades ago (effectively Tdark) became and so further knockout strains for druggable genes and attractive from a drug discovery perspective follow- their phenotype data are anticipated. ing key papers, namely, deorphanization and protein–­

To further explore the characteristics of Tdark and Tbio disease association studies. Five of these six targets are proteins, we analysed their distribution in the L1000 gene modulated by at least one approved drug, which places set, as annotated in TCRD (Supplementary Table S3). them in the Tclin category.

Table 3 | Summary of clinical candidates (phase I–III) with activity against Tchem proteins Disease category GPCRs Ion channels Kinases Other

Cancer – – 41 (35) 1 (1) Central nervous system 7 (14) 8 (5) – – disorders Inflammation and immune 5 (5) 1 (1) 5 (5) – disorders Respiratory disorders 3 (7) 1 (1) 1 (1) – Metabolic disorders 3 (2) – – – Other 3 (3) – 7 (6) 1 (1) Unmapped 28 (58) 11 (23) 87 (175) 5 (14) Numbers in brackets indicate the number of unique clinical candidates. GPCRs, G protein-coupled receptors. Table 4 | Examples of successful attempts of targeting the dark genome Gene name Relevant Study type and Citation API name Therapeutic Market study (year) reference counta indication approval (year) LEPR 1995 Receptor 4,100 Metreleptin Lipodystrophy 2014 deorphanization99 SMO 1998 Protein–disease 1,195 Vismodegib Basal cell 2012 association study100 carcinoma S1PR1 1998 Receptor 968 Fingolimod Multiple sclerosis 2010 deorphanization101 HCRTR1 and 1998 Receptor 4,608 Suvorexant Insomnia 2014 HCRTR2 deorphanization102 PCSK9 2003 Protein–disease 1,840 Evolocumab Hyper- 2015 association study103 cholesterolaemia GHSR 1999 Receptor 8,248 Anamorelinb Cachexia Successful phase deorphanization104 III clinical trial105 API, active pharmaceutical ingredient; GHSR, ghrelin receptor; HCTR1, orexin receptor type 1; LEPR, leptin receptor; PCSK9, proprotein convertase subtilisin/kexin type 9; S1PR1, sphingosine 1‑phosphate receptor 1; SMO, smoothened homologue. aCitation count for the ‘relevant study’ reference, according to Google Scholar, as of 28 December 2017. bThe European Medicines Agency refused marketing authorization for anamorelin in September 2017.

Successful ‘promotions’ across classes, such as those in As many as 3,644 proteins have significant disease TABLE 4, are currently rare. We expect the rate of knowl- (confirmed OMIM11 phenotype) associations. Given edge accumulation for Tdark proteins to be low, at least their TDL assignments (335 Tclin, 543 Tchem and 2,766 initially. Well-studied proteins require multiple layers of Tbio), we examined the distribution of the TDL in rela- management for diverse, rich sets of data, with informa- tion to druggable protein family categories (FIG. 1b). It tion and knowledge stemming from corpora such as bio- appears that Tbio–disease associations are quite rare for medical literature, patents and clinical trials. A paucity of druggable families such as nuclear receptors, ion chan- data and lack of information for understudied proteins nels and GPCRs, as these families are more likely to be

(Tbio and Tdark) affect both knowledge management and in the Tclin or Tchem category. Instead, Tbio assignments the decision-making process with respect to experimen- are quite frequent for transcription factors, epigenetic tal planning, what research questions need to be asked targets, transporters and unassigned protein families. (and in what order) and which methods may be better The exception among druggable families are olfactory suited for each task. For example, we examined access GPCRs, which appear to attract less interest from drug counts for human proteins in STRING18 during 2016 discovery programmes, despite some of these GPCRs (FIG. 2b). ‘Counts by name’ represents users that access being linked to metabolism and ageing47. the STRING website and type in a gene symbol. ‘Counts Concerted efforts focused on an entire target by link’ represents users accessing the network for a gene class have sometimes led to new drugs. For example, in STRING by linking to it from another resource (for GlaxoSmithKline (GSK) had a comprehensive pro- example, GeneCards45 or UniProt15). Whereas ‘Counts gramme aimed at finding new ligands and characteriz- by link’ shows a more comprehensive method to access ing the biology of nuclear receptors48. New insights into 49 the entire proteome, it also suggests that Tdark proteins acid metabolism and xenobiotic transcription of have a lower probability of being recognized (input) by cytochrome P450s50, mediated by nuclear receptors, gene name. These data show a pattern similar to that were described. A receptor (FXR) agonist has FIG. 2a observed in : Tdark proteins are less likely to be reached the market since this programme started: obet- the subject of scientific curiosity, which is a reflection icholic acid (Ocaliva), which was discovered by GSK of funding patterns and an overall lack of information in collaboration with the University of Perugia51 and and molecular probes. Indeed, the paucity of antibodies subsequently developed by Intercept Pharmaceuticals. and small molecules (criteria that help define Tbio and Currently, several FXR agonists are in clinical develop- 52 Tchem, respectively) that could be used to interrogate Tdark ment . Choosing the appropriate proteins as drug targets proteins diminishes our ability to subject Tdark proteins remains a complex process, where scientific factors need to scientific inquiry. to be balanced against commercial factors (such as com- Genomic and proteomic responses following radia- pany investors and medical insurance companies) and tion therapy are also understudied. One in vitro study46 societal factors (such as physicians and patients), as well suggested that as many as one-third of the 10,174 genes as legal factors (such as the requirements of regulatory examined in immortalized B cells following ionizing agencies)1. radiation are radioresponsive (GSE26835 column46 in Supplementary Table S3). Of the 447 genes with signif- Spotlight on G protein-coupled receptors icant fold changes in the GSE26835 set, only 26 are Tclin GPCRs are membrane-bound, cell-surface receptors that and 61 are Tchem, whereas the majority (268 Tbio, 92 Tdark) transduce signals via interactions with hetero­trimeric G are understudied (see also Supplementary Box S5). proteins, arrestins and other cellular transducers53,54. Alterations in GPCR signalling are implicated in the -­like behaviour compared with wild-type pathogenesis and treatment of neuropsychiatric55, mice70, whereas the association71 between schizophrenia 56 57,58 59 72,73 immunological , gastrointestinal , cardiac , renal, and Fzd3 mutants (Tbio) remains controversial . hormonal, infectious60 and many other disorders53,54. GPCRs represent the largest family of druggable targets Spotlight on protein kinases 9 in the human genome , with between 20% and 30% of Tclin and Tchem kinases. The ~600‑member human approved drugs acting on them53,54. kinome (Supplementary Table S3) is made up of pro- The number of publications per GPCR and the num- tein kinases, in addition to metabolic and lipid kinases, ber of chemicals associated with that GPCR in ChEMBL and is highly druggable using both competitive and were examined61 to determine which of the druggable, allosteric small-molecule inhibitors. However, the func- non-olfactory GPCRs are understudied: less than 100 tions of about one-third of the kinases in this family are citations and less than ten ChEMBL compounds define poorly defined or unknown. The 634 human kinases understudied, uninterrogated GPCRs. The number of were categorized as follows: Tclin, N = 50; Tchem, N = 390; (TABLE 1) publications, similar to the fractional PubMed publi- Tbio, N = 163; and Tdark, N = 31 . Tclin kinases cations count13, does not take into account large-scale are not exclusively protein kinases, and the number of (many proteins per paper) analyses. Counting ChEMBL FDA-approved small-molecule kinase inhibitors varies 74 compounds, a quantitative criterion similar to Tchem depending on inclusion criteria. Wu and colleagues assignments, does not consider bioactivity values. found 38 small-molecule protein kinase inhibi- However, this independent analysis validates the more tors. Based on DrugCentral, we found 50 approved general TDL criteria with respect to GPCR biological kinase inhibitors, including 40 small-molecule pro- functions and corresponding chemical matter. tein kinase inhibitors, of which 32 are FDA-approved, one FDA-approved protein kinase activator (ingenol

Tclin and Tchem G protein-coupled receptors. Currently mebutate) and the phosphoinositide 3‑kinase subunit-δ 827 GPCRs — including 421 olfactory GPCRs — are (PIK3CD) small-molecule inhibitor idelalisib, which is tracked by IDG KMC; of these, 96 are Tclin and 113 also FDA-approved. An additional seven FDA-approved are Tchem (none of which are olfactory). Slightly more antibodies target the receptor tyrosine kinases human than half of the non-olfactory GPCRs have annotated epidermal growth factor receptor 2 (HER2; also known as drugs and small molecules targeting them53,54,61; see ERBB2), epidermal growth factor receptor (EGFR), vas- also TABLE 1. A recent analysis indicates, however, that a cular endothelial growth factor receptor 2 (VEGFR2) and handful of GPCRs — mainly biogenic amine, muscarinic platelet-derived­ growth factor receptor-α (PDGFRα), and opioid receptors — represent the most abundantly and there is also an FDA-approved HER2‑targeting targeted receptors for FDA-approved medications53,54. antibody–drug conjugate, trastuzumab emtansine (see GPCRs also represent important off-targets for kinase Supplementary Table S9). inhibitors62,63, ion channel modulators64, anti-­infectives65 53,54 and other classes of drug-like molecules . As with Tbio and Tdark kinases. A number of Tbio and Tdark kinases other druggable target classes, off-target actions within are known to interact with FDA-approved multikinase the GPCR class can be associated with severe and life-­ inhibitors. According to data in DrugCentral, sorafenib threatening side effects. For example, valvular heart inhibits 114 kinases, of which only 9 are associated MoA- disease is associated with anorectic agents, such as fen- related targets, whereas sunitinib inhibits 263 kinases, of fluramine, and antimigraine , such as ergot- which 9 are MoA-related targets. Given the current state 66 amine, via serotonin 5‑HT2B receptor agonism . Recent of kinase inhibitor chemistry, it is very likely that Tbio and successes in structure-guided and cheminformatics-­ Tdark kinases can be effectively therapeutically targeted driven drug discovery show promise for creating safer with highly selective small-­molecule inhibitors. Some of and more effective medications targeting GPCRs. the characteristics shared by understudied Tbio and Tdark kinases include poorly defined integration of the kinase

Tbio and Tdark G protein-coupled receptors. Although 52 in signalling networks, poorly defined function and reg- non-olfactory GPCRs are categorized as Tdark, the avail- ulation, lack of activation-loop phospho-antibodies or ability of new screening platforms to discover chemical immunohistochemistry-grade antibodies, and lack of matter for these GPCRs has begun the process of illumi- selective chemical tools for functional characterization. nation64,67,68. Of 62 GPCRs for which significant pheno- Primary tools for knockout and/or altered expression type calls have been reported by IMPC (Supplementary are RNA interference (RNAi) and CRISPR–Cas9, and

Table S8), 24 are Tbio and 7 are Tdark; of these, 15 are asso- cDNAs for overexpression; kinase knockout or altered ciated with neurological and behavioural phenotypes. expression rarely provides readily assayable phenotypes Including olfactory GPCRs, 618 proteins are classified (for example, growth, migration, apoptosis or in vivo as Tdark or Tbio; of these, 126 non-olfactory GPCRs and function in mouse organ physiology). Currently, the 51 olfactory GPCRs have significant associations with IMPC has targeted 238 kinases with 114 knockouts hav- human diseases via OMIM, GWAS and text mining. ing a significant phenotype; of the latter, 22 are of current Whereas the majority of these associations (nearly 59%) interest for phase 2 of the IDG programme. 69 stem from text mining , 48 GPCRs have confirmed Many Tchem, Tbio or Tdark kinases are altered in expres- TABLE 5 associations from at least two information channels. For sion or mutated in TCGA. shows ten Tbio or Tdark example, Adgrb2-mutant mice (Tbio) showed significant kinases whose amplification is observed in the TCGA database, together with their RNA expression in tri- ion channels an attractive target class for drug develop- ple-negative breast cancer (TNBC) cells. These under- ment. Ion channels are mostly heteromeric complexes studied kinases are frequently altered in breast cancer. that require optimal interactions with ligands at specific The potential increased expression of many kinases in locations. Currently, 355 ion channel pore-forming and primary human tumours suggests these understudied auxiliary subunits are tracked by IDG KMC (TABLE 1; kinases have important functions for the tumour cell Supplementary Table S3). About 100 ion channel modula- phenotype that have not been characterized to date. tors, including auxiliary subunits, are reported, but to our These represent unexplored kinases with possible knowledge, a systematic list of cell type-specific auxiliary therapeutic utility. subunits for all ion channels is not available.

The potential therapeutic importance of Tbio and Tdark kinases in the kinome is highlighted by a recent clinical Tclin and Tchem ion channels. Many drugs are known to study that assessed the response to trametinib, a MEK1 bind to ion channels. There are 217 drugs annotated in 75 10 and MEK2 inhibitor, in TNBC patients . Pretreatment DrugCentral as acting through 125 (Tclin) ion channels needle biopsies and surgical tumour resections follow- for the MoA9. The number of drugs increases to 497 ing 7‑day trametinib treatment were used for RNA when querying how many drugs are known to interact sequencing (RNA-seq) to analyse tumour transcriptomic with ion channels outside of the MoA-related constraint. changes in response to the drug. Pretreatment biopsies Some of these interactions are likely to be responsible for matched to post-treatment surgical specimens showed side effects such as cardiac toxicity. An accurate under- overall concordance of the transcriptional kinase standing of MoA and side-effect assignment at the target response to trametinib, with FRK (Tchem) exhibiting the (molecular) level is required if we are to improve upon highest mean increase and cytoplasmic BMX (also Tchem) available drugs. For example, the anaesthetic ketamine, exhibiting the highest mean decrease among patients which has been postulated to act as a noncompetitive in response to a 7‑day drug treatment. Among the N-methyl-d-aspartate (NMDA) antagonist79, has been kinases transcriptionally altered in the TNBC tumours used off-label as an antidepressant80. However, in‑depth were several understudied kinases, including MRCK-γ analysis of the antidepressant effects of ketamine found (also known as CDC42BPG), PRKACB, STK32B and that its active metabolite (2R,6R)-hydroxynorketamine leukocyte tyrosine kinase receptor (LTK). These find- (HNK) does not block the NMDA receptor. Instead, ings demonstrate that in TNBC tumours in patients, HNK displays sustained activation of α-amino-3-­ members of the understudied Tbio and Tdark kinome are hydroxy-5-methyl-4-isoxazole propionic acid (AMPA) 81 co‑regulated transcriptionally with kinases from the Tclin receptors and lacks ketamine-related side effects . and Tchem category, in a dynamic adaptive response to This may pave the way for the development of novel, targeted inhibition. rapid-acting . It is therefore conceivable

that some ion channels currently categorized as Tclin or

Spotlight on ion channels Tchem are in need of further illumination with respect to Ion channels mediate signalling within cells, between MoA and drug specificity. Indeed, the low bioactivity cells, and between cells and their environment. Defects in cut-off criterion for ion channels (≤10 μM) in Tchem (see ion channels underlie many major disorders in humans, also Supplementary Figure S2) may need revision, given also known as channelopathies, including neuronal that older drugs continue to reveal unexpected modes disorders76, diabetes77 and heart failure78. This makes of action.

Table 5 | Understudied kinases that are frequently altered in breast cancer Gene name Protein name TDL SUM159 Alteration Kinase family averagea frequencyb (%)

TRIB1 Tribbles homologue 1 Tbio 342 24 CAMK

RPS6KC1 Ribosomal protein S6 kinase δ1 Tdark 463 23 Other

UHMK1 Serine/threonine-protein kinase Kist Tbio 1,308 23 Other

NRBP2 Nuclear receptor-binding protein 2 Tdark 350 21 Other

PIP5K1A Phosphatidylinositol 4‑phosphate 5‑kinase type Tbio 2,246 19 Metabolic 1‑α

CDK12 Cyclin-dependent kinase 12 Tbio 3,148 14 CMGC

MAP3K1 Mitogen-activated protein kinase kinase kinase 1 Tbio 455 11 STE

STRADA STE20‑related kinase adapter protein-α Tbio 464 9 STE

BCKDK (3‑methyl‑2‑oxobutanoate dehydrogenase Tbio 1,087 6 Atypical (lipoamide)) kinase, mitochondrial

EEF2K Eukaryotic elongation factor 2 kinase Tbio 1,176 6 Atypical aExpression levels in SUM159PT claudin-low triple-negative breast cancer (TNBC) cells. bAlteration frequency in breast cancer from The Cancer Genome Atlas (TCGA). TDL, target development level. Tdark ion channels. A relatively small number of ion This implicitly requires the construction of a knowledge (TABLE 1) channels (31) are categorized as Tdark . Part of base objectively and in an unbiased manner, asserting the difficulty in illuminating dark ion channels is the what is currently believed to be true (a process that is replication of physiological context and expression of explored metaphorically in a classic book by Italo Calvino, proteins in the appropriate heteromeric, pore-forming ‘The Castle of Crossed Destinies’; see Supplementary Box functional complexes. Currently, there are no scalable S1). The IDG KMC enables us to quantitatively demon- systems available to study the localization of functional strate the existence of a knowledge deficit with respect to complexes. Moreover, most ion channels have paralogues dark and understudied proteins, which underscores the that function redundantly. Gene redundancy increases need for basic science and its major role in illuminating the difficulty of revealing phenotype and precise localiza- gene functions and roles in human disease. The TDL clas- tion, both important elements for understanding physio­ sification scheme provides a convenient way to partition logical functions other than ion channel activity. This human targets that highlights the focus (or lack thereof) considerably delays our progress in understanding ion of science and drug discovery efforts on different targets. channel function in vivo and their role in human health. Through the use of the TDL groupings, we can highlight Unlike GPCRs or kinases, neither pore-forming­ subunits knowledge accumulation, as well as deficits, for a vari- nor auxiliary subunits share characteristic motifs. Lack ety of target families, with a common theme being that of specific protein sequence motifs makes it difficult to while much is known, there remains a large fraction of the flag candidate genes for further study, even with compu- proteome that is understudied. The IDG KMC, by collat- tational assertions. There could be other ion channels, ing and linking a plethora of disparate and diverse data which perhaps should be categorized as Tdark.dark, to reflect sources and data types, aims to shed light on these dark our complete lack of knowledge, even by computational regions with the hope that researchers will be empowered means, regarding these proteins. to use the data and knowledge presented by the KMC to The list of ion channel pore-forming subunits, as well jumpstart research programmes on these targets. as auxiliary subunits, continues to grow. For example, Confirmed associations with a specific disease, or leucine-rich repeat-containing protein 8 (LRRC8) heter- receptor deorphanization (TABLE 4), remain major incen- 82 omers form volume-regulated anion channels (VRACs), tives to allocate resources and further study of Tdark and ORAI proteins assemble to form83 calcium-­release- proteins. As mentioned above, the only other deliber- activated calcium channels (CRACs), whereas anoc- ate targeted effort to study Tdark proteins in addition to tamins are olfactory calcium-activated chloride channels IDG is the IMPC. As of March 2017, mouse lines cor- (CACCs)84. Currently, LRRC8B, LRRC8C and LRRC8D responding to 4,165 human genes have been produced, subunits are classified as Tdark, with the exception being with phenotypes available, 2,788 of which have resulted the subunit LRRC8A (Tbio); ORAI1 is annotated as Tchem, in statistically significant phenotype calls. Of these 2,788 wheras the ORAI2 and ORAI3 proteins are annotated as proteins, 827 (436 Tdark) are not associated with any NIH-

Tbio. With the exception of anoctamin 1 (Tclin), all other funded grants between 2000 and 2015 (Supplementary anoctamins are labelled Tbio. These, and all other Tdark Table S8). By contrast, only 120 of the 1,961 proteins with proteins that lack computational assertions, are in need significant IMPC phenotype calls and that are associated of systematic genomic-scale studies. with NIH funding are Tdark. It was Edgar Allan Poe who once said, “the enormous multiplication of books in every Conclusions branch of knowledge is one of the greatest evils of this Modern medicine often employs artificial distinctions age, since it presents one of the most serious obstacles to in terms of what and how biological systems are stud- the acquisition of correct information”. Poe’s 19th cen- ied: segregated by organ (for example, ophthalmology tury line of thought is remarkably apt in the context of and cardiology) or by disease (for example, oncology and current KMC activities, since the “acquisition of correct infectious diseases), medical specialty separations carry information” remains the largest challenge. over into the research arena, both in academia and indus- Another challenge relates to an area of knowledge try. This distinction breaks down in nature, as we are largely neglected in the scientific literature: the large-scale likely to observe the interplay between the same genes capture of negative results. Due to confirmation bias8, and pathways regardless of organ, albeit in a context-­ scientists have a tendency to primarily publish success- specific manner. These artificial divisions can prevent sci- ful accounts of research. Although there are attempts to entists from achieving a translational, integrative view of overcome this problem86, we are not aware of the exist- gene and protein function. We suspect this to be another ence of an unbiased, easy mechanism to capture negative reason why funding to study Tdark proteins is scarce: for results. The aphorism “absence of evidence is not evidence functionally enigmatic proteins, or the ‘ignorome’ (REF. 85), of absence” illustrates practical limitations of knowledge anticipating which organ, disease or phenotype is relevant management systems: does lack of evidence imply that may be far from trivial. To address this limitation, the the study was conducted, but nothing was found, or does NIH launched a series of high-risk programmes via it imply (more often) that the measurement was not car- the Common Fund resource, aimed to catalyse trans-­ ried out? Proper archiving of negative results (for example, disciplinary research. The IDG is one such Common “protein P is not expressed in cell type CT” or “gene muta- Fund programme. The IDG programme’s ostensible tion Gm does not play a role in disease D”) would benefit goal is to encourage and track the illumination of rela- the community at large and would improve our specific tively understudied and unstudied parts of the genome. knowledge about proteins. However, such non-positive facts fit poorly to current publishing and citation para- Tdark proteins in particular, experimentalists would like digms. One possibility for archiving such statements could to know what experiment to do next, what phenotypic be nanopublications87, which would be amenable to large- changes should be examined first and which pathway is scale integration into systems such as TCRD–Pharos. relevant in a specific disease. These, and other similar Finally, a key challenge faced by IDG KMC, and questions, have yet to find a computer-driven, reliable perhaps by other data analysts working in drug dis- answer. To paraphrase William Gibson, “The truth is covery, is that of reliable predictions: when examining already here — it’s just not very evenly distributed”.

1. Knowles, J. & Gromo, G. Target selection in drug 28. Waring, M. J. et al. An analysis of the attrition of drug 52. Hambruch, E., Kinzel, O. & Kremoser, C. discovery. Nat. Rev. Drug Discov. 2, 63–69 (2003). candidates from four major pharmaceutical On the pharmacology of 2. Edwards, A. M. et al. Too many roads not taken. companies. Nat. Rev. Drug Discov. 14, 475–486 agonists: give me an ‘A’, like in ‘acid’. Nucl. Recep. Res. Nature 470, 163–165 (2011). (2015). 3, 101207 (2016). 3. Alberts, B., Kirschner, M. W., Tilghman, S. & 29. Hunter, S. et al. InterPro in 2011: new developments 53. Wacker, D., Stevens, R. C. & Roth, B. L. How ligands Varmus, H. Rescuing US biomedical research from its in the family and domain prediction database. Nucleic illuminate GPCR molecular pharmacology. Cell 170, systemic flaws. Proc. Natl Acad. Sci. USA 111, Acids Res. 40, D306–312 (2012). 414–427 (2017). 5773–5777 (2014). 30. Kruger, F. A., Gaulton, A., Nowotka, M. & 54. Roth, B. L., Irwin, J. J. & Shoichet, B. K. Discovery of 4. Kim, S. et al. PubChem Substance and Compound Overington, J. P. PPDMs‑a resource for mapping new GPCR ligands to illuminate new biology. databases. Nucleic Acids Res. 44, D1202–D1213 small molecule bioactivities from ChEMBL to Pfam‑A Nat. Chem. Biol. 13, 1143–1151 (2017). (2016). protein domains. Bioinformatics 31, 776–778 55. Roth, B. L., Sheffler, D. J. & Kroeze, W. K. Magic 5. Gaulton, A. et al. The ChEMBL database in 2017. (2015). shotguns versus magic bullets: selectively non- Nucleic Acids Res. 45, D945–D954 (2017). 31. Campillos, M., Kuhn, M., Gavin, A.‑C., Jensen, L. J. & selective drugs for mood disorders and 6. Tomczak, K., Czerwin´ska, P. & Wiznerowicz, M. The Bork, P. Drug target identification using side-effect schizophrenia. Nat. Rev. Drug Discov. 3, 353–359 Cancer Genome Atlas (TCGA): an immeasurable source similarity. Science 321, 263–266 (2008). (2004). of knowledge. Contemp. Oncol. 19, A68–A77 (2015). 32. Keiser, M. J. et al. Predicting new molecular targets 56. Hernandez, P. A. et al. Mutations in the chemokine 7. Munafò, M. R. et al. A manifesto for reproducible for known drugs. Nature 462, 175–181 (2009). receptor gene CXCR4 are associated with WHIM science. Nat. Hum. Behav. 1, 0021 (2017). 33. Huang, X. & Dixit, V. M. Drugging the undruggables: syndrome, a combined immunodeficiency disease. 8. Nickerson, R. S. Confirmation bias: a ubiquitous exploring the ubiquitin system for drug development. Nat. Genet. 34, 70–74 (2003). phenomenon in many guises. Rev. Gen. Psychol. 2, Cell Res. 26, 484–498 (2016). 57. Sternini, C. Receptors and transmission in the brain- 175–220 (1998). 34. Lai, A. C. & Crews, C. M. Induced protein degradation: gut axis: potential for novel therapies. III. Mu‑opioid 9. Santos, R. et al. A comprehensive map of molecular an emerging drug discovery paradigm. Nat. Rev. Drug receptors in the enteric nervous system. drug targets. Nat. Rev. Drug Discov. 16, 19–34 (2017). Discov. 16, 101–114 (2017). Am. J. Physiol. Gastrointest. Physiol. 281, 10. Ursu, O. et al. DrugCentral: online drug compendium. 35. Sakamoto, K. M. et al. Protacs: chimeric molecules G8–15 (2001). Nucleic Acids Res. 45, D932–D939 (2017). that target proteins to the Skp1–Cullin–F box 58. Sternini, C. Taste receptors in the gastrointestinal 11. Amberger, J., Bocchini, C. A., Scott, A. F. & complex for ubiquitination and degradation. tract. IV. Functional implications of bitter taste Hamosh, A. McKusick’s Online Mendelian Inheritance Proc. Natl Acad. Sci. 98, 8554–8559 (2001). receptors in gastrointestinal chemosensing. in Man (OMIM). Nucleic Acids Res. 37, D793–D796 36. Gadd, M. S. et al. Structural basis of PROTAC Am. J. Physiol. Gastrointest. Liver Physiol. 292, (2009). cooperative recognition for selective protein G457–461 (2007). 12. Ashburner, M. et al. Gene ontology: tool for the degradation. Nat. Chem. Biol. 13, 514–521 (2017). 59. Rockman, H. A., Koch, W. J. & Lefkowitz, R. J. Seven- unification of biology. Nat. Genet. 25, 25–29 (2000). 37. Mungall, C. J. et al. The Monarch Initiative: transmembrane-spanning receptors and heart 13. Pletscher-Frankild, S., Pallejà, A., Tsafou, K., an integrative data and analytic platform connecting function. Nature 415, 206–212 (2002). Binder, J. X. & Jensen, L. J. Diseases: text mining and phenotypes to genotypes across species. Nucleic 60. Elphick, G. F. et al. The human polyomavirus, JCV, uses data integration of disease-gene associations. Acids Res. 45, D712–D722 (2017). serotonin receptors to infect cells. Science 306, Methods 74, 83–89 (2015). 38. Dickinson, M. E. et al. High-throughput discovery of 1380–1383 (2004). 14. Kiermer, V. Antibodypedia. Nat. Methods 5, 860–861 novel developmental phenotypes. Nature 537, 61. Roth, B. L. & Kroeze, W. K. Integrated approaches (2008). 508–514 (2016). for genome-wide interrogation of the druggable 15. UniProt Consortium. UniProt: a hub for protein 39. MacArthur, J. et al. The new NHGRI-EBI Catalog of non-olfactory G protein-coupled receptor information. Nucleic Acids Res. 43, D204–D212 (2015). published genome-wide association studies (GWAS superfamily. J. Biol. Chem. 290, 19471–19477 16. Papadatos, G. et al. SureChEMBL: a large-scale, Catalog). Nucleic Acids Res. 45, D896–D901 (2015). chemically annotated patent document database. (2017). 62. Elkins, J. M. et al. Comprehensive characterization of Nucleic Acids Res. 44, D1220–1228 (2016). 40. GTEx Consortium. The Genotype-Tissue Expression the Published Kinase Inhibitor Set. Nat. Biotechnol. 17. Rouillard, A. D. et al. The harmonizome: a collection of (GTEx) project. Nat. Genet. 45, 580–585 (2013). 34, 95–103 (2016). processed datasets gathered to serve and mine 41. GTEx Consortium et al. Genetic effects on gene 63. Lin, X. et al. Life beyond kinases: structure-based knowledge about genes and proteins. Database expression across human tissues. Nature 550, discovery of sorafenib as nanomolar antagonist of 2016, baw100 (2016). 204–213 (2017). 5‑HT receptors. J. Med. Chem. 55, 5749–5759 18. Szklarczyk, D. et al. STRING v10: protein-protein 42. Uhlén, M. et al. Proteomics. Tissue-based map of the (2012). interaction networks, integrated over the tree of life. human proteome. Science 347, 1260419 (2015). 64. Huang, X.‑P. et al. Allosteric ligands for the Nucleic Acids Res. 43, D447–D452 (2015). 43. Kim, M.‑S. et al. A draft map of the human proteome. pharmacologically dark receptors GPR68 and GPR65. 19. Hajduk, P. J., Huth, J. R. & Tse, C. Predicting protein Nature 509, 575–581 (2014). Nature 527, 477–483 (2015). druggability. Drug Discov. Today 10, 1675–1682 44. Lenat, D. B. & Feigenbaum, E. A. On the thresholds of 65. Chan, J. D. et al. The praziquantel is (2005). knowledge. Artif. Intell. 47, 185–250 (1991). a human serotoninergic G‑protein-coupled receptor 20. Hopkins, A. L. & Groom, C. R. The druggable genome. 45. Fishilevich, S. et al. Genic insights from integrated ligand. Nat. Commun. 8, 1910 (2017). Nat. Rev. Drug Discov. 1, 727–730 (2002). human proteomics in GeneCards. Database 2016, 66. Roth, B. L. Drugs and valvular heart disease. N. Engl. 21. Surade, S. & Blundell, T. L. Structural biology and baw030 (2016). J. Med. 356, 6–9 (2007). drug discovery of difficult targets: the limits of 46. Smirnov, D. A. et al. Genetic variation in radiation- 67. Kroeze, W. K. et al. PRESTO-Tango as an open-source ligandability. Chem. Biol. 19, 42–50 (2012). induced cell death. Genome Res. 22, 332–339 resource for interrogation of the druggable human 22. Kubinyi, H. Drug research: myths, hype and reality. (2012). GPCRome. Nat. Struct. Mol. Biol. 22, 362–369 Nat. Rev. Drug Discov. 2, 665–668 (2003). 47. Garrison, J. L. & Knight, Z. A. Linking smell to (2015). 23. Yang, X. et al. Widespread expansion of protein metabolism and aging. Science 358, 718–719 68. Lansu, K. et al. In silico design of novel probes for the interaction capabilities by alternative splicing. (2017). atypical opioid receptor MRGPRX2. Nat. Chem. Biol. Cell 164, 805–817 (2016). 48. Kliewer, S. A., Lehmann, J. M. & Willson, T. M. Orphan 13, 529–536 (2017). 24. Mestres, J., Gregori-Puigjané, E., Valverde, S. & nuclear receptors: shifting endocrinology into reverse. 69. Pafilis, E. et al. The SPECIES and ORGANISMS Solé, R. V. Data completeness—the Achilles heel of Science 284, 757–760 (1999). Resources for Fast and Accurate Identification of drug-target networks. Nat. Biotechnol. 26, 983–984 49. Willson, T. M., Jones, S. A., Moore, J. T. & Taxonomic Names in Text. PLoS ONE 8, e65390 (2008). Kliewer, S. A. Chemical genomics: functional analysis (2013). 25. Schreiber, S. L. et al. Advancing biological of orphan nuclear receptors in the regulation of bile 70. Okajima, D., Kudo, G. & Yokota, H. Antidepressant- understanding and therapeutics discovery with small- acid metabolism. Med. Res. Rev. 21, 513–522 like behavior in brain-specific angiogenesis inhibitor molecule probes. Cell 161, 1252–1265 (2015). (2001). 2‑deficient mice. J. Physiol. Sci. 61, 47–54 (2011). 26. Austin, C. P., Brady, L. S., Insel, T. R. & Collins, F. S. 50. Moore, L. B. et al. Orphan nuclear receptors 71. Katsu, T. et al. The human frizzled‑3 (FZD3) gene on NIH molecular libraries initiative. Science 306, constitutive androstane receptor and pregnane X 8p21, a receptor gene for Wnt ligands, 1138–1139 (2004). receptor share xenobiotic and steroid ligands. J. Biol. is associated with the susceptibility to schizophrenia. 27. Southan, C. et al. The IUPHAR/BPS guide to Chem. 275, 15122–15127 (2000). Neurosci. Lett. 353, 53–56 (2003). pharmacology in 2016: towards curated quantitative 51. Pellicciari, R. et al. 6alpha‑ethyl-chenodeoxycholic acid 72. Wei, J. & Hemmings, G. P. Lack of a genetic interactions between 1300 protein targets and 6000 (6‑ECDCA), a potent and selective FXR agonist association between the frizzled‑3 gene and ligands. Nucleic Acids Res. 44, D1054–D1068 endowed with anticholestatic activity. J. Med. Chem. schizophrenia in a British population. Neurosci. Lett. (2016). 45, 3569–3572 (2002). 366, 336–338 (2004). 73. Jeong, S. H., Joo, E. J., Ahn, Y. M., Lee, K. Y. & 92. Griffith, M. et al. CIViC is a community knowledgebase and G.P.); and by Novo Nordisk Foundation Denmark grant Kim, Y. S. Investigation of genetic association between for expert crowdsourcing the clinical interpretation of NNF14CC0001 (S.B., L.J.L. and D.W). R.G., A.J., D.T.N., A.S., human Frizzled homolog 3 gene (FZD3) and variants in cancer. Nat. Genet. 49, 170–174 (2017). N.S., and G.Z.K. were supported by the Intramural Research schizophrenia: results in a Korean population and 93. Koscielny, G. et al. Open Targets: a platform for Program, National Center for Advancing Translational evidence from meta-analysis. Psychiatry Res. 143, therapeutic target identification and validation. Sciences (NCATS) and by U54 CA189205. Dedicated to 1–11 (2006). Nucleic Acids Res. 45, D985–D994 (2017). Francisc Schneider (1933–2017). 74. Wu, P., Nielsen, T. E. & Clausen, M. H. Small-molecule 94. Lin, Y. et al. Drug target ontology to classify and kinase inhibitors: an analysis of FDA-approved drugs. integrate drug discovery data. J. Biomed. Semant. 8, Competing interests Drug Discov. Today 21, 5–10 (2016). 50 (2017). The authors declare competing interests: see Web version 75. Zawistowski, J. S. et al. Enhancer remodeling during 95. Maggon, K. Best-selling human medicines 2002– for details. adaptive bypass to MEK inhibition is attenuated by 2004. Drug Discov. Today 10, 739–742 (2005). pharmacologic targeting of the P‑TEFb complex. 96. Stebbins, S. The world’s 15 top selling drugs. 24/7 Wall Publisher’s note Cancer Discov. 7, 302–321 (2017). St. http://247wallst.com/special-report/2016/04/26/ Springer Nature remains neutral with regard to jurisdictional 76. Kullmann, D. M. The neuronal channelopathies. top-selling-drugs-in-the-world/ (2016). claims in published maps and institutional affiliations. Brain 125, 1177–1195 (2002). 97. Hauser, A. S., Attwood, M. M., Rask-Andersen, M., 77. Gloyn, A. L. et al. Large-scale association studies of Schiöth, H. B. & Gloriam, D. E. Trends in GPCR drug variants in genes encoding the pancreatic beta-cell discovery: new agents, targets and indications. RELATED LINKS KATP channel subunits Kir6.2 (KCNJ11) and SUR1 Nat. Rev. Drug Discov. 16, 829–842 (2017). Anamorelin: https://en.wikipedia.org/wiki/Anamorelin (ABCC8) confirm that the KCNJ11 E23K variant is 98. Shih, H.‑P., Zhang, X. & Aronov, A. M. Drug discovery ATC codes: https://www.whocc.no/atc_ddd_index/ associated with type 2 diabetes. Diabetes 52, effectiveness from the standpoint of therapeutic Channelopathy: https://en.wikipedia.org/wiki/ 568–572 (2003). mechanisms and indications. Nat. Rev. Drug Discov. Channelopathy 78. Marbán, E. Cardiac channelopathies. Nature 415, 17, 19–33 (2018). ChemIDplus: https://chem.nlm.nih.gov/chemidplus/ 213–218 (2002). 99. Tartaglia, L. A. et al. Identification and expression ClinicalTrials.gov: https://clinicaltrials.gov/ 79. Berman, R. M. et al. Antidepressant effects of cloning of a leptin receptor, OB‑R. Cell 83, Cochrane Collaboration: http://www.cochrane.org/ ketamine in depressed patients. Biol. Psychiatry 47, 1263–1271 (1995). DrugCentral: http://drugcentral.org/ 351–354 (2000). 100. Xie, J. et al. Activating Smoothened mutations in Drug Target Ontology: http://drugtargetontology.org/ 80. Kirby, T. Ketamine for depression: the highs and lows. sporadic basal-cell carcinoma. Nature 391, 90–92 Edgar Allan Poe quotes: https://quoteinvestigator. Lancet Psychiatry 2, 783–784 (2015). (1998). com/2017/07/12/many-books/ 81. Zanos, P. et al. NMDAR inhibition-independent 101. Lee, M. J. et al. Sphingosine‑1‑phosphate as a ligand Evidence of absence: https://en.wikipedia.org/wiki/ antidepressant actions of ketamine metabolites. for the G protein-coupled receptor EDG‑1. Science Evidence_of_absence Nature 533, 481–486 (2016). 279, 1552–1555 (1998). Harmonizome: http://amp.pharm.mssm.edu/Harmonizome/ 82. Pedersen, S. F., Klausen, T. K. & Nilius, B. 102. Sakurai, T. et al. Orexins and orexin receptors: Illuminating the Druggable Genome: https://commonfund. The identification of a volume-regulated anion a family of hypothalamic and G protein- nih.gov/idg Acta Physiol. 213 channel: an amazing Odyssey. , coupled receptors that regulate feeding behavior. IMPC (International Mouse Phenotyping Consortium): Cell 92 868–881 (2015). , 573–585 (1998). http://www.mousephenotype.org/ 83. Niemeyer, B. A. Changing calcium: CRAC channel 103. Abifadel, M. et al. Mutations in PCSK9 cause IMPC information about Adgrd1: http://www. (STIM and Orai) expression, splicing, and autosomal dominant hypercholesterolemia. mousephenotype.org/data/genes/MGI:3041203 posttranslational modifiers. Am. J. Physiol. Cell Nat. Genet. 34, 154–156 (2003). IMPC information about Alpk3: http://www. Physiol. 310, C701–709 (2016). 104. Kojima, M. et al. Ghrelin is a growth-hormone- mousephenotype.org/data/genes/MGI:2151224 84. Dauner, K., Lissmann, J., Jeridi, S., Frings, S. & releasing acylated peptide from stomach. Nature 402, Integrity | Available from Clarivate Analytics at: https:// Möhrlen, F. Expression patterns of anoctamin 1 and 656–660 (1999). clarivate.com/products/integrity/ anoctamin 2 chloride channels in the mammalian 105. Temel, J. S. et al. Anamorelin in patients with non- L1000: https://www.ncbi.nlm.nih.gov/geo/query/acc. nose. Cell Tissue Res. 347, 327–341 (2012). small-cell lung cancer and cachexia (ROMANA 1 and cgi?acc=GPL20573 85. Pandey, A. K., Lu, L., Wang, X., Homayouni, R. & ROMANA 2): results from two randomised, double- MIDAS Platform | IQVIA analytics platform for industry- Williams, R. W. Functionally enigmatic genes: blind, phase 3 trials. Lancet Oncol. 17, 519–531 leading sales and medical data available at: https://www. a case study of the brain ignorome. PLoS ONE 9, (2016). iqvia.com/solutions/commercialization/geographies/midas e88889 (2014). Monarch Initiative: http://www.monarchinitiative.org/ 86. Pfeffer, C. & Olsen, B. R. Editorial: Journal of negative Acknowledgements NIH Common Fund: https://commonfund.nih.gov/ results in biomedicine. J. Negat. Results Biomed. 1, 2 This work was supported by US National Institutes of Health NIH ExPORTER: https://exporter.nih.gov/ (2002). (NIH) grants U54 CA189205 and U24 224370 (Illuminating NIH RePORTER: https://projectreporter.nih.gov/reporter.cfm 87. Groth, P., Gibson, A. & Velterop, J. The anatomy of the Druggable Genome Knowledge Management Center (IDG Pharos: https://pharos.nih.gov a nanopublication. Inf. Serv. Use 30, 51–56 (2010). KMC)) at the University of New Mexico, Novo Nordisk PubMed: https://pubmed.gov 88. Agarwal, P. & Searls, D. B. Can literature analysis Foundation Center for Protein Research, European STRING: https://string-db.org/ identify innovation drivers in drug discovery? Nat. Rev. Bioinformatics Institute (EBI) and University of Miami, U54 The Cancer Genome Atlas: https://cancergenome.nih.gov/ Drug Discov. 8, 865–878 (2009). CA189201 and U24 CA224260 (A.M., Mount Sinai), P30 Target Central Resource Database: http://juniper.health. 89. Nguyen, D.‑T. et al. Pharos: Collating protein CA118100 (T.I.O., G.N.G. and L.A.S., UNM) and UL1 unm.edu/tcrd/ information to shed light on the druggable genome. TR001449 (T.I.O. and L.A.S.), UM1 HG006370 (International William Gibson Wikiquotes: https://en.wikiquote.org/wiki/ Nucleic Acids Res. 45, D995–D1002 (2017). Mouse Phenotyping Consortium, T.F.M. and I.T.), U01 William_Gibson 90. Wishart, D. S. et al. DrugBank 5.0: a major update to MH104974 (B.L.R.), U01 MH104984 (S.T.), U01 MH105028 the DrugBank database for 2018. Nucleic Acids Res. (M.T.M.), U01 MH105026 (J.Q. and A.M., Baylor) and U01 SUPPLEMENTARY INFORMATION 46, D1074–D1082 (2017). MH104999, R01 CA177993 and U24 DK116204 (S.G. and See online article: S1 (box) | S2 (figure) | S3 (table) | S4 (table) | 91. The UniProt Consortium. UniProt: the universal G.L.J.) and by the European Molecular Biology Laboratory S5 (box) | S6 (box) | S7 (table) | S8 (table) | S9 (table) | S10 protein knowledgebase. Nucleic Acids Res. 45, (EMBL) and Wellcome Trust Strategic Awards WT086151/ (table) | S11 ( figure) D158–D169 (2017). Z/08/Z and WT104104/Z/14/Z (A.G., A.H., A.R.L., A.K., J.P.O.,