R&D Solutions DRUG DISCOVERY & DEVELOPMENT From Molecule to Phenotype: Predictive Modeling in Data-Driven Drug Discovery

Summary Despite growing scientific insight and technological investment, attempts to develop novel therapies still show limited success. Traditional drug screening approaches oversimplify the complexity of living cells by focusing on a 1-to-1 substance–target relationship. Modern data- driven drug design utilizes all available layers of information to create predictive models that select compounds likely to have the desired effect on the phenotype. High-quality, carefully curated data are essential for this approach. Despite growing technological investment, the number of new drugs approved each year has not changed much since the 1980s

The last 30 years have brought enormous progress in biological science. From recombinant protein technologies through completion of the human genome to the discovery of epigenetics, our knowledge has expanded in an unprecedented manner. However, insights in basic research have failed to spur an increase in novel therapies. Despite growing technological investment, the number of new drugs approved each year has not changed much since the 1980s (Figure 1). This effect applies to all therapy approaches. First-in-class drugs, which use a novel mechanism of action for therapy, often suffer from side effects. Follower drugs, which build on a therapy that is already established in the clinic, often do not provide therapeutic improvement. Many drugs look promising in biochemical assays, but fail in the clinical phase because of low efficacy or unexpected side effects. Existing knowledge on biological processes is not yet fully implemented in drug design, ignoring the gap between individual molecular interactions and the complexity of whole organisms. Today, most drugs are designed with a 1-to-1 substance–target relationship in mind. This approach requires an extensive characterization of the target protein (1). An example is the frequently used structure-based drug design which calls for precise target structure information, down to the localization of amino acid side chains and their interaction with the compound. Getting to this level of detail for a protein target is cumbersome and time- consuming, with the effect that developers tend to focus on already characterized targets rather than exploring new ones. For example, 200 of 1,000 active oncology programs in 2015 only targeted eight proteins according to a Forbes study (2).

60 60

50 50 Billion US Dollars 40 DNA Recombinant DNA sequence libraries Mining Human genome completed 40

30 30

20 20

10 10 Number of molecular entities (NME) approved new 1980 1990 2000 2010 2014 Figure 1. Progress in biological science has not led to an increase in newly approved drugs. Dark blue: The number of molecular entities (and new biological entities starting in 2004) approved by the U.S. FDA each year from 1980 to 2014. Orange line: The annual R&D expenditure reported by US Pharma company members. Pale blue: The time points when technologies supporting target-oriented approaches to drug development became available (arrows). The outlier peak in NME approvals observed in 1996 stem from the review of backlogged FDA submissions after an additional 600 new drug reviewers and support staff were hired, funded by the Prescription Drug User Fee Act. Data extracted from the U.S. FDA website, DiMasi et al. (1991), and Statista (3–5) Rising to the Phenotype Level Even if detailed information on every possible target were available, it would still not show the entire picture of a phenotype. Approaches that reduce drug design to an isolated interaction of two partners ignore the dynamics that arise from interactions between the target and other proteins within a network, the interaction of multiple networks within the cell, the interaction of cells within the tissue, and the interaction of organs within the body. Therapeutic approaches can fail because of this disconnect of scope. For example, the lung cancer therapy drug gefinitib was shown to be less effective in vivo because neighboring healthy fibroblasts changed the two- and three- dimensional arrangement of the tumor cells, attenuating compound uptake (6). Another important aspect is the promiscuity of both compounds and targets. Even validated, FDA-approved drugs were shown to bind to six different targets on average (7), demonstrating the need for better-targeted compounds. But just screening more molecules does not help. As Dr. John Manchester, computational chemist at AstraZeneca points out: “During the 1990s, screening became very efficient with investment in miniaturization and high-throughput technology. The hope was that if you screened enough compounds, you would certainly find drug leads. That didn’t happen.”

Can Computational Chemistry Close the Gap? In contrast to in vitro experiments, which are limited in scope and throughput, data- driven methods enable zooming out from individual molecular interactions to the complete organism. Computational chemistry has provided various approaches to turn vast amounts of data into relevant predictions of drug efficacy. From in silico screening of drug candidates, through network simulations to virtual screening based on chemical genomics, many new options are emerging.

Modeling at the Single Molecule Level Modeling at the single molecule level helps select the most promising candidates very early in the drug development process, before actual experiments are even performed. If protein structure information is available, virtual compound libraries can be screened in silico. Compound properties such as size, shape and charge distribution allow to predict not only target binding kinetics, but also potential toxicity and pharmacodynamics. Even without available protein structure information, screening of virtual compound libraries allows to narrow down the number of compound candidates. Once set up, this process can screen millions of compounds in a relatively short time. Even though these models are statistical, they support the selection of an enriched set of promising candidates for in vitro bioactivity assays. As part of his research at AstraZeneca, Dr. Manchester has applied modeling at the single molecule level to the search for new antibiotics. In his models, he considers both the behavior of compounds in the human body (such as toxicity and clearance rates) and the mechanisms that import drugs into the bacterial cell. Compounds must meet all requirements to be considered for cell-based assays: low toxicity, good bioavailability and feasible import mechanisms. He believes that “drug discovery will benefit from resisting the temptation to develop and follow rules that oversimplify what we know about structure and function of compounds. We need to summarize less and instead apply our complete knowledge towards predicting compound behavior; in silico models enable that.”

3 Figure 2. Complexity of interaction networks of G-protein coupled receptors (GPCRs) and their ligands. Each line represents a bioactivity of 10 µM for a specific GPCR-ligand interaction based on experimental data or as calculated by chemical genomics-based virtual screening. The node color indicates the classes that compounds and GPCRs belong to (blue, amines; red, peptides; yellow, prostanoids; green, nucleotides). The links colored from green through yellow to red indicate increasing confidence in the GPCR-ligand interaction. Graphic used with permission from Brown and Okuno, 2012 (11).

Modeling at the Network Level Drug development often aims to modify a single protein with the hope to change a phenotype, but this phenotype is the sum of many interconnected network interactions (Figure 2). Dr. Jonny Wray, computational neuroscientist and Head of Discovery Informatics at e-Therapeutics in Oxford, UK, builds network models that comprise 500–1,500 proteins, corresponding to about 10% of the proteins in a cell. He employs stochastic optimization algorithms to find a selected subset of proteins with the largest expected impact on phenotype that can be used as drug targets. With this model, he can screen his virtual library of ten million compounds—of which half have known bioactivity data and half are predicted by machine learning—for interactions with the selected compounds. This virtual compound screening helps his team select a set of about a thousand compound candidates that are tested directly in phenotypic assays. Building these models requires detailed information on disease mechanism and With the development of molecular progression, diagnosis, signaling pathways, and protein-protein interactions, aligned biology and recombinant protein in large training set databases. However, Dr. Wray explains that “current data are technologies in the 1980s, still limiting, noisy, and do not provide enough knowledge to simulate an entire cell. While we can look at only 10% of cellular proteins at once, we do so using different phenotypic assays were almost assumptions to verify results.” entirely abandoned. Today, more and more phenotypic assays are again The outcome looks promising: After five years of building up the informatics side of performed in drug screening, as drug discovery at e-Therapeutics, 10–20% of candidates sent to phenotypic testing they provide information about drug demonstrate desired activity profiles, and two projects are already in lead efficacy at a more complex level, such optimization stage. as cytotoxicity, pharmacokinetics and Further examples for network-based approaches include the analysis of two lung off-target effects. Cell cultures or cancer networks stimulated by epidermal growth factor receptor (EGFR) and insulin- small model organisms like worms, like growth factor 1 receptor (IGF1R) proteins. Fortunato Bianconi and his colleagues zebrafish or flies are used, but also have built a mechanistic model for these two interconnected pathways that comprised co-cell cultures of, for example, 15,000 kinase and kinase inhibitor pairs. Predictions of kinase activation patterns tumor and healthy fibroblast cells to calculated by these models were verified by analyzing gene expression patterns in clinical tumor samples linked to patient disease-free survival rates (8). simulate the tumor environment in the patient body. Gene expression studies themselves gain new insights through predictive modeling. Instead of looking at differential expression of individual genes, models enable Once promising compounds visualization of clusters with correlated expression patterns. Genes with a high degree are selected, the interacting of connectivity (so-called hub genes) can be identified, and utilized as biomarkers or target is identified by molecular therapy targets. Integration of steering mechanisms like epigenetics or regulatory RNA methods, such as ligand-based completes the picture. Dr. Steven Horvath and his team applied this approach to lung chromatography or phage display. cancer tissues, and were able to identify five new hub genes (9). The combined molecular and phenotypic information feeds into lead optimization.

5 Exploratory space of target - oriented approaches

Interactions with Target promiscuity microenvironment

Compound- compound Target–target interactions interactions (Biological networks)

Compound promiscuity

Figure 3. The exploratory space available to medicinal chemistry includes compounds and targets that interact in multiple dimensions. While the most commonly explored dimension is target promiscuity (which compounds bind to a target?), other factors are equally valuable for exploration. These include the interaction dimensions encompassing compound promiscuity (which targets does a compound bind?), compound–compound interactions, and interactions among targets as well as other molecules present in the microenvironment of a target.

Modeling multiple compound-target interactions: chemical genomics virtual screening Systems-based approaches take data-driven drug design another step further by considering the complexity of compound–compound interactions, target–target interactions and compound promiscuity (Figure 3). Hiroaki Yabuuchi and his colleagues (10) proposed a novel approach called chemical genomics virtual screening (Figure 4, center panel) that does not depend on protein structure information. Instead, bioactivity data from known protein–compound interactions are integrated with compound chemical properties and protein dipeptide sequences to determine unknown target–compound interactions. The authors created a training set of 317 pharmacologically relevant G-protein coupled receptors (GPCRs) and 866 ligands, with 5,207 compound–protein interactions that were fed into a pattern-recognition machine-learning system (Figure 2). The resulting algorithms were employed to screen virtual compound libraries. To verify their method in a retrospective virtual screening approach, they selected ligands for the well-characterized beta2-adrenergic receptor (ADRB2) and compared them with compounds obtained from structure-based design. Their predictive modeling approach did not only yield higher compound hit rates including some with novel scaffolds, but they also discovered that a number of commercially available ADRB2 ligands had unexpected affinity to other GPCRs in their training set, indicating possible side effects. Their analysis showed that certain recurring substructures such as tertiary amines and sulfur-containing heterocycles, which are typically seen in antidepressants, exhibit particularly non-selective binding to GPCRs, an effect they called polypharmacy. Chemical genomics based virtual screening can be employed to predict polypharmaceutical interactions and thereby reduce side effects. “Current data are still limiting, noisy, and do not provide enough knowledge to simulate an entire cell.” - Dr. Jonny Wray, Head of Discovery Informatics at e-Therapeutics

Ligand-based Chemical genomics-based Structure-based Ligand-based virtual screening virtual screening virtual screening virtual screening (LBVS) uses data on chemical (i) Collection of known ligands (i) Collection of CPI data (i) Preparation of properties of compounds known to a 3D structure interact with a defined protein target Target Proteins to screen virtual compound libraries. Proteins Compounds Structure-based virtual screening Compounds (SBVS) uses data on the three- dimensional structure of a protein (ii) Descriptor calculation (ii) Descriptor calculation (ii) Identi cation of (MW, logP, #OH, ...) a binding pocket target obtained by biochemical (MW, logP, #OH, ...) (AA, AC, AD, ...) methods. Virtual compound libraries (252, 7, 4, ...) are screened to identify candidates (252, 7, 4, ...) (72, 51, 47, ...) that fit into the binding pockets of (285, 8, 5, ...) the target. (320, 1, 2, ...) (81, 53, 64, ...) (436, 2, 1, ...) Compounds Chemical genomics-based virtual (238, 6, 7, ...) (60, 43, 48, ...) screening (CGBVS) combines target (iii) Labeling of activities amino acid sequence data, chemical (iii) Representation of interaction vectors (iii) Binding energy calculation (252, 7, 4, ...) and physicochemical properties (252, 7, 4, ... 72, 51, 47, ...) of compounds, and information (285, 8, 5, ...) about known interaction pairs in a (320, 1, 2, ... 60, 43, 48, ...) training set for machine learning. (436, 2, 1, ...) The resulting models predict ligand- (238, 6, 7, ... 81, 53, 64, ...) protein interactions for prioritization (452, 3, 1, ...) in bioassay experiments. (252, 7, 4, ... 60, 43, 48, ...) (359, 2, 2, ...) (320, 1, 2, ... 72, 51, 47, ...)

(iv) Construction of a prediction model (iv) Construction of a prediction model

(v) Prediction of test sets (v) Prediction of test sets

(238, 6, 7, ...) or ? (238, 6, 7, ... 72, 51, 47, ...)

Figure 4. Chemical genomics-based virtual screening (CGBVS) covers the entire chemical and biological space in data-driven drug design. CGBVS (center panel) is a change in exploring the interface between chemistry and biology, not requiring the protein three-dimensional structure required in traditional SBVS (right panel), nor limited in scope to a single protein as in LBVS (left panel). Graphic redrawn with permission from Brown and Okuno, 2012 (11).

7 “We understand that compounds often affect more than one target and that future drug development must accommodate and even capitalize on that promiscuity.” - Dr. Jonny Wray, Head of Discovery Informatics at e-Therapeutics

Implementing Data-Driven Drug Design As diverse as these approaches are, they all share one common feature: the absolute requirement for high- quality data. In this context, quality has many meanings. First, data should provide comprehensive information on compound and target properties. As Dr. Wray puts it: “We need compound bioactivity values, assay and measurement type, and associated references. Basically anything that serves as evidence of compound–target interaction.” Second, it is crucial to integrate the data into existing databases, and align them with complementary data from multiple sources. For a seamless import into existing databases, data should come from carefully designed experiments and must be clean, standardized and normalized. Dr. Wray explains: “Each source has its own structure and format, so data must be transformed to meet our needs for machine learning and statistical pattern recognition. [...] Now our biggest issue is cleaning up source data, and this is no simple task. Things as simple as spelling mistakes can impair our pipeline.” Most importantly, data must contain information about the experimental methods and scientific designs that were employed to generate them. Without this knowledge, it is impossible to make good use of any data. Third, data obtained by mechanistic models must be experimentally validated to check whether model assumptions are correct or need to be adjusted. Analyses of patient samples and in vitro assays are therefore indispensable. Finally, every database needs to be carefully maintained and regularly updated. For a maximum effect on informed decisions, all relevant experimental data should be documented and shared with members of the team involved in the drug development program. So eventually, data-driven methodologies will also change the way people work together.

The Future of Systems-Based Drug Development With extremely fast progress in computational power and graphical processing and the generation of more and more data, computational models will supply increasingly detailed information that feeds into drug development decision processes. High-performance graphic interfaces originally developed for computer games allow to visualize binding dynamics of compound–target pairs within networks at atomic resolution. From these models, compounds with long residence times can be selected, allowing to lower therapeutic doses, and thereby reduce side effects. Expanding protein networks to cover a larger proportion of signaling pathways within a cell and the integration of “-omics” data will complete the picture of cellular processes, and enable a preview of the impact of a particular compound on cell phenotype. Comparing network patterns of tumor and healthy cells may reveal common tumor development patterns independent of tissue type. Previously unknown connections of diseases caused by the same underlying mechanism may be identified, leading to alternative therapy approaches. For example, diabetes was traditionally classified as a metabolic disorder, but is recently regarded in the context of an autoimmune disease affecting pancreatic cells. In data-driven drug design, compound potency and behavior can be optimized and used in a graded instead of an all-or-none effect. New compound classes can be discovered, and compounds can be combined in a synergistic approach to increase efficacy. Ultimately, the goal is to reduce development costs by informing decisions about drug candidates in the early research phase. Getting to the stage where all these possibilities enabled by predictive modeling unfold takes time and financial investment. Still, closing the gap between molecules and phenotypes is essential for future drug design. To stay competitive in pharma, each player needs to consider an overhaul of methodologies and techniques designed to process, visualize, analyze, and interpret large amounts of highly complex data. As Dr. Wray puts it: “There is a clear signal in the drug industry that network pharmacology is landing on fertile ground. We understand that compounds often affect more than one target and that future drug development must accommodate and even capitalize on that promiscuity.” REFERENCES 1. Lee, J.A. and Berg, E.L. (2013) Neoclassic drug discovery: the case for lead generation using phenotypic and functional approaches. J. Biomol. Screen., 18, 1143–1155. 2. Booth, B. (2012) Cancer drug targets: the march of the lemmings. Forbes Magazine. http://www.forbes.com/sites/ brucebooth/2012/06/07/cancer-drug-targets-the-march- of-the-lemmings/ (Accessed March 24, 2015). 3. U.S. Food and Drug Administration. Summary of NDA Approvals & Receipts, 1938 to the present. http://www.fda.gov/AboutFDA/WhatWeDo/History/ProductRegulation/ SummaryofNDAApprovalsReceipts1938tothepresent/default.htm 4. DiMasi, J.A., Hansen, R.W., Grabowski, H.G. and Lasagna, L. (1991) Cost of innovation in the pharmaceutical industry. J. Health Economics, 10, 107–142. 5. Statista. Spending of the U.S. pharmaceutical industry on research and development at home and abroad from 1990 to 2014 (in million U.S. dollars). http://www.statista. com/statistics/265090/us-pharmaceutical-industry-spending-onresearch-and- development/ (Accessed May 8 2015) 6. Yong, X., Wang, P., Jiang, T., Yu, W., Shang, Y., Zhang, P. and Li, Q. (2014) Fibroblasts weaken the anti-tumor effect of geftinib on co-cultured non-small cell lung cancer cells. Chin. Med. J., 127, 2091–2096. 7. Mestres, J., Gregori-Puigjané, E., Valverde, S. and Solé, R.V. (2009) The topology of drug–target interaction networks: implicit dependence on drug properties and target families. Mol. Biosyst., 5(9), 1051–1057. 8. Bianconi, F., Baldelli, E., Ludivini, V., Crinò, L., Flacco, A. and Valigi, P. (2012) Computational model of EGFR and iGF1R pathways in lung cancer: a systems biology approach to translational oncology. Biotechnology Advances, 30, 142–153. 9. Bin, Z. and Horvath, S. (2005) A general framework for weighted gene co-expression network analysis. Stat. Appl. Genet. Mol. Biol., 4, Article 17. 10. Yabuuchi, H., Nijima, S., Takematsu, H., Ida, T., Hirokawa, T., Hara, T., Ogawa, T., Minowa, Y., Tsujimoto, G. and Okuno, Y. (2011) Analysis of multiple compound protein interactions reveals novel bioactive molecules. Mol. Syst. Biol., 7, 472. 11. Brown, J.B. and Okuno, Y. (2012) Systems biology and systems chemistry: new directions for drug discovery. Chemistry & Biology, 19, 23–28.

9 R&D Solutions for Pharma & Life Sciences Elsevier’s R&D Solutions is a portfolio of tools that integrate data, analytics and technology capabilities to help pharmaceutical & life sciences customers achieve pharmacovigilance pain relief with less pain and more efficiency.

For more information about R&D Solutions for Pharma & Life Sciences, visit elsevier.com/rd-solutions/pharma-and-life-sciences

Elsevier offices ASIA AND AUSTRALIA Tel: + 65 6349 0222

JAPAN Tel: + 81 3 5561 5034

KOREA AND TAIWAN Tel: +82 2 6714 3000

EUROPE, MIDDLE EAST AND AFRICA Tel: +31 20 485 3767

NORTH AMERICA, CENTRAL AMERICA AND CANADA Tel: +1 888 615 4500

SOUTH AMERICA Tel: +55 21 3970 9300

Copyright © 2019 Elsevier B.V. March 2019