<<

Property Biased-Diversity Guided Explorations of Chemical Spaces

by

Chetan Raj Rupakheti

Graduate Program in Computational Biology and Bioinformatics Duke University

Date:______Approved:

______David Beratan, Supervisor

______Weitao Yang

______Qiu Wang

______Patrick Charbonneau

Dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in the Graduate Program in Computational Biology and Bioinformatics in the Graduate School of Duke University 2015

ABSTRACT

Property Biased-Diversity Guided Explorations of Chemical Spaces

by

Chetan Raj Rupakheti

Graduate Program in Computational Biology and Bioinformatics Duke University

Date:______Approved:

______David Beratan, Supervisor

______Weitao Yang

______Qiu Wang

______Patrick Charbonneau

An abstract of a dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in the Graduate Program in Computational Biology and Bioinformatics in the Graduate School of Duke University 2015

Copyright by Chetan Raj Rupakheti 2015

Abstract

Discovering functionally useful structures and materials by exploring the vastness of chemical space is an exciting undertaking. If done efficiently, one can discover structures that can have therapeutic value (such as drug like organic ) or technological value (such as organic light emitting diodes). While mining of chemical space has the potential to generate libraries of functional structures and materials, one can also easily be lost in its vastness (~1060 theoretically possible small organic

~500 Da molecular weight or less). We have developed a strategy that allows efficient explorations of vast chemical spaces to generate libraries of functional organic molecules. The method, at its core, applies physical properties and structural diversity biased sampling of chemical space to search for new structures. We demonstrate the soundness and efficiency of this approach by searching through the known and enumerated databases to discover diverse organic molecules with optimum electronic and biophysical properties, and we also compare it to various existing approaches used for molecular search and property optimization. We also show a practical application of this approach by designing libraries of chromophores that emit light in the blue region of the spectrum as well as potential leads for protein and RNA binding.

iv

Dedication

To my family, mentors, and friends.

v

Contents

Abstract ...... iv

List of Tables ...... xi

List of Figures ...... xii

Acknowledgements ...... xix

Chapter 1. Background ...... 1

1.1 Chemical space of small organic molecules ...... 1

1.1.1 Motivation for exploration ...... 1

1.2 Experimental Approaches for chemical space exploration ...... 3

1.2.1 Fragment based ...... 3

1.2.2 Diversity oriented drug design ...... 5

1.3 Computational approaches for chemical space exploration ...... 6

1.3.1 Generated databases (GDB) of enumerated small organic molecules ...... 6

1.3.1.1 Methodologies and Principle to Virtually Screen Drug-Like Molecular Libraries ...... 9

1.3.2 Existing Complementary Computational Approaches for Material Designs ... 10

1.3.2.1 AFLOW Material Database ...... 11

1.3.2.2 Harvard Clean Energy Project (CEP) ...... 12

1.3.3 Diversity-oriented stochastic search of chemical Spaces ...... 15

1.4 Thesis contribution: Property-biased exploration of diverse regions of chemical space ...... 17

1.5 Outlines ...... 19

vi

Chapter 2. A strategy to discover diverse optimal molecules in the small molecule universe...... 21

2.1 Introduction ...... 21

2.2 Methods ...... 22

2.2.1 ACSESS algorithm to search molecular space ...... 25

2.2.2 Software Used for Property-Optimizing ACSESS ...... 27

2.2.3 Optimization within GBD9 property landscape ...... 27

2.2.4 Optimization in the NKp fitness landscape model ...... 28

2.3 Results and Discussions ...... 30

2.3.1 ACSESS exploration of a NKp model space ...... 30

2.3.2 Comparing property-optimizing ACSESS with simple genetic algorithm searches ...... 31

2.3.3 ACSESS explorations in GDB-9 ...... 33

2.3.4 Comparisons of property-optimizing ACSESS and genetic algorithms methods for property optimization ...... 34

2.3.5 Self-organizing maps of GDB-9 and of sampled solutions ...... 37

2.4 Conclusions ...... 41

Chapter 3. Diverse optimal molecular libraries for organic light emitting diodes...... 43

3.1 Introduction ...... 43

3.1.1 OLED structures ...... 44

3.2 Method ...... 46

3.2.1 Using property-optimizing ACSESS to design fluorescent emitters ...... 46

vii

3.2.2 Quantum chemical computation of singlet excitation energies and oscillator strengths ...... 49

3.2.3 Design of fluorescent emitters that utilizes only singlet excitation state ...... 50

3.2.4 Designing the TADF library by optimizing the singlet - triplet gap ...... 50

3.3 Results and Discussions ...... 53

3.3.1 Validating the Semiempirical Excitation Energies ...... 53

3.3.2 Property-biased ACSESS-based library of singlet emitters ...... 55

3.3.3 Validating the ZINDO/S Excitation Energy Calculations with TDDFT for the ACSESS-designed Library ...... 59

2.3.4 Property-optimized ACSESS Library of TADF Emitters ...... 60

3.4 Conclusions ...... 64

Chapter 4. Efficiently charting astronomically large chemical spaces to search drug leads ...... 66

4.1 Introduction ...... 66

4.2 Methods ...... 68

4.2.1 Benchmarking Methods for the COX-2 Receptor ...... 69

4.2.2 Comparing Property-Optimized ACSESS with Simple Genetic Algorithms ... 70

4.2.2 Optimization of docking scores within GDB17 ...... 71

4.2.3 Optimization of docking scores within SMU-21 ...... 72

4.3 Results ...... 72

4.3.1 Comparing enrichment of binders for different docking methods ...... 72

4.3.2 Docking score distribution of GDB9 compounds ...... 74

4.3.3 Performance of various library design strategies within GDB9 ...... 76

viii

4.3.4 Performance PO-ACSESS within GDB17 ...... 80

4.3.5 Performance of PO-ACSESS within the SMU-21 containing up to 21 heavy atoms ...... 82

4.4 Conclusions ...... 86

Chapter 5. Theory assisted explorations of the small molecule universe to discover new inhibitors of coactivator-associated arginine methyltransferase1 (CARM1)...... 88

5.1 Introduction ...... 88

5.2 Method ...... 90

5.2.1 Docking Based Scoring ...... 92

5.2.2 MMGBSA Based Rescoring ...... 92

5.2.3 Validating MMGBSA Using Theory and Experiment ...... 94

5.3 Results and Discussion ...... 95

5.3.1 Benchmarking Docking and MMGBSA ...... 95

5.3.2 Libraries Design Using Property-Optimizing ACSESS (Docking Scoring Function) ...... 98

5.3.3 Library Design Using Improved Scoring Function ...... 100

5.3.4 Preliminary Results on MD Based Validation of Designs ...... 101

5.4 Conclusions and Future Directions ...... 105

Chapter 6. Property-Optimizing ACSESS extension and application for designing small organic molecules targeting RNA Structures...... 107

6.1 Introduction ...... 107

6.2 Method ...... 109

6.2.1 RDOCK based docking of small molecules to RNA ...... 111

ix

6.2.2 Steps to Diversify the DMA Scaffold ...... 112

6.2.3 Conformational Selection using MD and experimental RDC ...... 113

6.2.4 Experimental Testing of the Predicted Structures ...... 114

6.3 Results and Discussions ...... 114

6.4 Conclusions and Future Directions ...... 119

References ...... 120

Biography ...... 138

x

List of Tables

Table 1: Statistics of the generated 195-member library ...... 56

Table 2: Average error between TDDFT and ZINDO/S computed excitation energies and oscillator strengths for the maximally diverse library of 45 compounds...... 60

Table 3: Breakdown of energetic contribution (in kcal/mole) towards protein-ligand binding ...... 103

Table 4: Comparing hydrogen bond occupancy across 10 ns trajectory between compound 0 and compound 1 ...... 104

xi

List of Figures

Figure 1: The structural diversity of various representative compounds found in different databases is indicated. The figure is taken with permission from ref [5]...... 2

Figure 2: Shows a cartoon of fragment-based design approach. The protein with two binding pockets is shown in grey rectangle. Fragments (F1 and F2) linked by the linker group (shown in black) are interacting with the protein...... 4

Figure 3: Steps used to enumerate organic molecules starting from graphs of feasible topologies...... 7

Figure 4: Shows the ACSESS steps involved in generating a library of structurally diverse molecule for screening purposes...... 16

Figure 5: The property-optimizing ACSESS procedure uses property filter and diversity- biased sampling to construct a diverse library with properties above a cutoff value...... 27

Figure 6: Distribution of molecular property values for (A) the enumerated NKp landscape with N=19, K=9, p=0.9 and (B) the dipole moments of all molecules in GDB-9...... 30

Figure 7: (A) Diversity biased sampling and (B) fitness-biased sampling. Diversity is first maximized to generate solutions that span a given chemical space. Here, the nearest- neighbor diversity (eq. 4) is defined as the average hamming distance (eq 3) to the nearest neighbor. The diverse solutions seed the property-optimizing ACSESS method and the fitness of the diverse set is improved iteratively. The global optimum in our enumerated NKp model space was found within 30 iterations of our optimization (data shown are averaged over ten different runs)...... 31

Figure 8: Comparison of property-optimizing ACSESS outcomes to the simple genetic algorithm (SGA) optimization for generating multiple solutions with the largest NKp fitness value. SGA iteratively samples the fittest solutions and improves upon them. (A) indicates that SGA finds the fittest solutions (global maximum in the constructed NKp model) after 30 iterations. (B) indicates that property-optimizing ACSESS runs escapes local optima more effectively than do the SGA runs, i.e. the success rate of finding largest fitness solution of property-optimizing ACSESS is 100% compared to 60% for SGA (from 10 runs of each method). (C) indicates that the property-optimizing ACSESS runs also performs better than SGA to sample diverse fittest solutions as it samples a larger number of fittest solutions compared to SGA. (D) indicates that, on average, the

xii

fitness of solutions generated by property-optimizing ACSESS is larger than that of SGA...... 33

Figure 9: Progress of ACSESS runs that sample GDB-9. The plots show the mean of the dipole moments from multiple-runs, and the error bars represent one standard deviation from the mean of either nearest-neighbor distance or dipole moment. (A) Initially, a maximally diverse set (measured by nearest-neighbor distance) is generated. (B) The maximally diverse set is used to maximize their dipole moments. The minimum dipole moment found after 60 iterations is the desired dipole moment threshold (i.e., the top 10% fittest molecules)...... 34

Figure 10: ACSESS and SGAs runs that maximize the dipole moments (fitness) of diverse molecules. The two plots tracks the average fitness of libraries generated by each design algorithm (color coded differently) and the error bars represent one standard deviation from the mean for multiple runs. (A) compares the fitness of the library optimized using ACSESS with three other genetic algorithms (colored coded differently). (B) compares the diversity of the fit libraries generated by ACSESS and SGAs...... 35

Figure 11: SOM for the enumerated and sampled GDB-9. (A) shows the full GDB-9 with color-bar representing average dipole moment per neuron from blue (low) to red (high). (B) shows the ACSESS designed diverse library (without maximizing the dipole moment). The filling of the plot in B indicates that the diverse library from ACSESS spans the GDB-9 space. The color bar in B indicates the number of molecules assigned to each neuron. The white space in both maps indicates regions where no compounds are assigned...... 38

Figure 12: Comparison of the diversity found in an ACSESS library and in an SGA-elitist library to the fully enumerated GDB-9 library. The color bars in A and B indicate the number of compounds per neuron. (A) shows the more fit molecules (molecules above the fitness cutoff) generated by ACSESS, (B) shows the more fit molecules (molecules above the fitness cutoff) generated by SGA (elitist), and (C) shows the fittest regions of GDB-9 (≥5.5D dipole moment) where the color bar represents the number of compounds per neuron. The oval indicates molecules in GDB-9 that are discovered using ACSESS but missed by SGAs...... 40

Figure 13: Light emission mechanism from an OLED device. Holes (h+) are supplied by the anode and (e-) by the cathode...... 44

xiii

Figure 14: ACSESS extension for fluorescent-molecule design based on emission wavelength, oscillator strength, and singlet-triplet excitation energy gap...... 48

Figure 15: (A) Comparison of vertical electronic excitation energies for the lowest singlet transitions of the validation set (average error ~0.7 eV). (B) Comparison of vertical electronic excitation energies for the lowest triplet transition of the validation set (average error ~0.9 eV)...... 54

Figure 16: ACSESS generated potential blue-emitting organic structures. Shown are some representative structures of the 195 compound library with average singlet excitation energy and oscillator strength (3.88 eV and 0.8, respectively). The blue ovals indicated the chromophore unit of the molecules...... 55

Figure 17: Figures show the HOMO and LUMO of a representative structure sampled using ACSESS. (A) Shows HOMO and (B) shows LUMO of the representative molecule...... 57

Figure 18: The x- and y-axes represent position in a SOM grid. (A) The ACSESS generated blue emitting compounds are all shown in a self-organizing map (SOM) where each blue dot represents a compound. (B) Projection of known compounds in the SOM trained on ACSESS generated SOM...... 58

Figure 19: Comparison of vertical electronic excitation energies for the lowest singlet transitions of the maximally diverse set (average error of ~0.13 eV between ZINDO/S and TDDFT). The excitation energies are in eV units...... 60

Figure 20: Singlet-triplet energy gap optimization derived from the property-biased ACSESS calculation. We reach the goal of shrinking ΔES1-T1 to ~0.3 eV within 35 iterations...... 62

Figure 21: Oscillator strength optimization using the property-optimizing ACSESS method. We increased the computed oscillator strength to ~ 0.6 in the 35 iterations...... 62

Figure 22: Singlet excitation energy at each iteration of the property-optimizing ACSESS calculation. While optimizing the S1-T1 energy gap and the oscillator strength, all of our compounds’ excitation energies remain in the blue region (2.9-3.8 eV is the range for the sampled molecules)...... 63

xiv

Figure 23: Molecules generated with the property-optimizing ACSESS method with semiempirical computations of the average S1-T1 gap of ~0.2 eV, average excitation energy of ~3.3 eV, and average oscillator strength of ~0.6...... 64

Figure 24: Property-Optimizing ACSESS framework for lead design. Molecules are constructed using mutations and crossovers,110 chemical stability and synthesizability constraints will cause modification or discarding of structures that are unsuitable.110 Suitable molecules are scored for docking to the target, and a maximally diverse set of docking score qualified structures is selected.110 ...... 69

Figure 25: Enrichment curves compare how well known binders are ranked above known decoys for the COX-2 system for different docking methods. The area under the curve (AUC) provides a metric for the performance of each docking method, indicating how well a docking method ranks a randomly chosen binder above a randomly chosen decoy. The x-axis shows the true positive rate and the y-axis shows the false positive rate for each docking score threshold...... 73

Figure 26: Vina docking score distribution of the strongest binding pose for all compounds of GDB9 screened for COX-2 binding...... 74

Figure 27: SOM constructed from “lead-like” members of GDB9 (Vina score ≤ -6.0). The color bar represents the average Vina score for each grid point. The x and y axes represent positions in the two dimensional grid. The white regions indicate areas where no compounds were mapped. All non-white regions correspond to region of the GDB9 space that contained lead-like molecules for COX2 binding...... 76

Figure 28: (a): Fitness comparison of PO-ACSESS and SGAs. The x-axis shows the number of generations run, and the y-axis shows the negative of the Vina score (here, higher scores indicates more promising leads). (b) Diversity comparison of PO-ACSESS vs. SGAs. The y-axis represents nearest neighbor diversity. We can notice that the diversity of lead-like molecules sampled using property-biased ACSESS is by far higher compared to other SGAs...... 77

Figure 29: (a) Projection of SGA-elitist generated library on the trained SOM. The color bar represents the number of lead-like molecules in GDB9 per grid point. (b) Projection of property-biased ACSESS on the trained SOM. The color bar represents the number of lead-like molecules of GDB9 per grid point...... 79

Figure 30: Optimization of diverse molecules with naproxen-like docking scores from GDB17. The plot shows average, minimum, and maximum scores in each iteration of

xv

PO-ACSESS analysis (from 5 runs). The error bars show the standard error for the statistics (mean, min, and max). The x-axis shows the iteration number for the algorithm and the y-axis indicates the negative Vina Score...... 81

Figure 31: Comparison of sampling based on PO-ACSESS analysis for identifying lead- like molecules in GDB17 and randomly sampling of a known drug-like subset of GDB17 (consisting of ~1 million molecules). Both methods were used to screen 6300 molecules to find those with the 260 most favorable Vina scores...... 82

Figure 32: (a) Optimization of naproxen-like diverse molecules within SMU-21. The plot shows average, minimum, and maximum score in each iteration of property-biased ACSESS (from 3 different runs). The error bar shows the standard error for each statistics (mean, min, and max). The x-axis shows the iteration of algorithm and y-axis shows the -1 x Vina Score. (b) Shown optimization of nearest neighbor diversity85 of lead-like compounds using original ACSESS110 within SMU-21. The x-axis shows the iteration of original ACSESS and the y-axis shows the nearest neighbor diversity (data averaged over 5 different runs)...... 84

Figure 33: Comparison of the average docking scores of maximally diverse set generated using ACSESS with that of PO-ACSESS method. Simply considering diversity without fitness bias leads to poor lead-like compounds...... 85

Figure 34: The cofactor or cofactor mimics binding site of the CARM1 protein (PDB ID 2Y1W). A scaffold for a potential CARM1 inhibitor binding to the cofactor-binding pocket is shown. The R1, R2, and R3 indicate the positions in the scaffold where functional groups can be added to design potential inhibitors...... 89

Figure 35: The extended property-optimizing ACSESS framework for CARM1 inhibitors design. Compared the COX2 test study, this protocol uses multiple docking poses and rescores each docking pose using MMGBSA method to compute protein-ligand interaction scores...... 91

Figure 36: The AutoDock vina scores (y-axis) for the tested actives (blue dots) and inactives (red dots) ligands of CARM1...... 96

Figure 37: The MMGBSA scores (y-axis) for the tested actives (blue dots) and inactives(red dots) ligands of CARM1...... 97

Figure 38: The docking scores optimization by PO-ACSESS. The x-axis indicates the iteration number of PO-ACSESS and y-axis indicates the negative of the AutoDock score

xvi

(now, more positive the better). The black, blue, and red lines indicate the worst scoring compound, the average score of compounds in the library, and the best scored compound in the library respectively in each iteration...... 99

Figure 39: The hydrogen-bonding network of PO-ACSESS designed compound (shown in red and labeled as compound 97 ) with CARM1. The compounds interacts with similar residues that SAH interacts with (Val 108, Glu 80, Gln 58) as well as shows interaction with residues not accessible to SAH (Ser 137 and Arg 6)...... 100

Figure 40: The MMGBSA scores optimization by PO-ACSESS. The x-axis indicates the iteration number of PO-ACSESS and the y-axis indicates the negative of the MMGBSA score (now, more positive the better). The black, blue, and red lines indicate the worst scoring compound, the average score of compounds in the library, and the best scored compound in the library respectively in each iteration...... 101

Figure 41: The left figure (A) shows the hydrogen bonding interaction (red dots) between compound 0 and protein. The right figure (B) shows the hydrogen bonding interaction (red dots) between compound 1 and protein. We can notice less number of hydrogen bonds between compound 0 and protein compared to compound 1 and protein...... 103

Figure 42: (A) and (B) show the structures of two representative designed ligands from PO-ACSESS that uses MMGBSA for rescoring. Compound (A) had MMGBSA score of - 95.33 kcal/com and (B) had MMGBSA score of -93.61 kcal/mol...... 105

Figure 43: A cartoon representation of TAR RNA with its loop, stem, and bulge regions...... 108

Figure 44: Structure of the dimethyl amiloride (DMA) scaffold with two functional group diversification positions (R1 and R2)...... 110

Figure 45: The average (blue), worst (green), and best (red) scores of the library in each iteration from the first round of design using PO-ACSESS. The x-axis shows the iteration of PO-ACSESS and the y-axis shows the negative of the binding energy between ligand- TAR RNA (RDOCK score)...... 115

Figure 46: The structures of four top scoring compounds from the first round. The average RDOCK score for these compounds was ~ -43 kcal/mol. The R2 position is occupied by the chlorine (Cl) atom in these designs...... 116

xvii

Figure 47: Structures of the 4 top scores compounds from each libraries from the second round. The average score of the compounds were ~ -53 kcal/mol...... 117

Figure 48: The average (blue), worst (green), and best (red) scores of the library in each iteration from the second round of designs using PO-ACSESS. The x-axis shows the iteration of PO-ACSESS and the y-axis shows the negative of the binding energy between ligand-TAR RNA (RDOCK score)...... 118

xviii

Acknowledgements

I would like to thank my family, mentors, and friends for their relentless support, guidance, love, and compassion throughout my academic career. Thank you for making this learning experience a joyful and a satisfying adventure.

xix

Chapter 1. Background

This chapter will define chemical space, motivate the need for chemical space exploration, present various approaches used for exploration, and discuss the contribution of this thesis towards advancing theoretical chemical space explorations.

1.1 Chemical space of small organic molecules

The chemical space of theoretically possible drug-like molecules with molecular weights ≤ 500 Da and containing C, N, O, and S atoms types consists of ~1060 compounds.1 This astronomical size makes the searching for useful compounds a daunting as well as an appealing task. The undiscovered molecules could potentially provide drugs or energy efficient materials and positively impact life on earth. Synthetic chemistry over the last century has produced ~60 millions compounds,2 which is a tiny fraction of the chemical space. The need for new structures is severely felt in the realm of medicine as well as in organic electronics. To address the need, there have been various theoretical and computational efforts to design and discover new molecular entities.

1.1.1 Motivation for exploration

New molecular structures (small organic molecules) can be used as probes for studying or manipulating the function of biochemically important protein targets and for organic electronic devices as well. The lack of productive small molecule leads to probe important targets is plaguing biomedicine. Similarly, most electronic devices use expensive heavy-metal containing inorganic complex for displays.3 New organic

1

structures can address the limitations in medicine and organic electronics and research is

ongoing to discover such useful molecules.4 In biomedicine, the lack of structural

diversity among small molecules in existing commercial libraries is one of the

limitations for targeting interesting protein targets.5 Virtual or experimental high-

throughput screening campaigns that use commercial libraries to screen for binders of

non-traditional protein targets (such as the ones involved in protein-protein interaction)

are at a risk of failing. These commercial libraries are biased towards more traditional

drug targets such as G-protein coupled receptors, nuclear receptors, and voltage- and

ligand-gates ion channels.6 The molecules targeting these families of receptors are

structurally simple molecules containing lots of sp2 carbons compared to natural

products or natural product derivatives (Figure 1).5 grand challenge commentary

Typical compounds in Examples of bioactive Examples of NP- 8/(! +ɥ!.,/.4-"2ɥ(-ɥ 2ƨCVL: Low Structural7 ,/+#2ɥ.$ɥ compounds: (. !3(5e compounds: Intermediate7 ,/+#2ɥ.$ɥ derived-derived drugs: drugs: High been disclosed as a preclinical candidate for low structural complexity intermediate structural complexity high structural complexity treating malaria, acting at the P-type cation Complexity! Structural Complexity! Structural Complexity! Cl transporter ATPase-4 (ref. 10). O AcO O OH Developing libraries of these kinds of Ph O OH HN compounds is an attractive and achievable HN O O Ph NH O N OH O task, yet the current driving force in synthetic O O Ph O H chemistry is focused more on exploring 1 HO O O OH new chemical reactivity. Hence populating Robotnikinin (6) O O Shh inhibitor Ph novel chemical space is at best a desirable axol (4) by-product of understanding chemical OH S reactivity and not the primary objective of SS H O Bn most research. Tis considerably tilts the NH N N N balance, with methodology development Et O N O H O O H OH gaining prominence at the expense of library 2 OMe production. Rebalancing this activity to O N 7 Cl upregulate library production with an eye Bcl-2 inhibitor H Cl H toward biological applications will have N O O O H benefcial outcomes. In addition, screening Cl OMe O O OH centers and experienced medicinal chemists N F should educate synthetic chemists about the S O OH OMe importance of considering the physiochemical N Cl H N O H properties of the resulting library Rapamycin (5) 11–13 3  ƖƎƙɥǒ8) products . Te suitability of a synthetic -3(, + 1( +ɥ !3(5(3y pathway for rapid analog generation should also be considered during the design stage to Increasing structural complexity facilitate downstream optimization. Specialized academic centers such as the Figure 1 | Representative classes of compounds used in high-throughput screening. Centers for Chemical Methodology and Figure 1: The structural diversity of various representativeLibrary Development compounds (CMLDs) have made found important inroads into developing relevant MLSMRin different to accept compounds databases from several is indicatedcenters would. leadThe to broader figure participation. is taken syntheticwith methodologiespermission and producing from ref [5]. sources, including academic labs. Tese In addition, a major impediment to success small- to medium-sized libraries. Of the libraries of compounds are then screened in is that most synthetic organic labs lack approximately 16,800 non-CVL compounds a broad range of assays run in the Molecular the infrastructure needed for submitting 2 in the 2010 MLSMR collection, ~6,000 Libraries Probe Production Center Networks compounds to external parties. For example, compounds originated from CMLDs (Fig. (MLPCNs) and Molecular Libraries Screening many synthetic labs do not have ready access 2a). Despite these substantial contributions, Center Networks (MLSCNs; http://mli. to the HPLCs necessary for high-throughput larger collections of new compounds nih.gov/mli/). So far, more than 16,000 analysis or semipreparative HPLCs required with increased complexity are still needed © 2010 Nature America, Inc. All rights reserved. All rights Inc. America, Nature © 2010 compounds have been submitted to the for purifcation of libraries of compounds. from academic institutions to ofset the MLSMR from various academic labs. In More sophisticated instruments, such as disproportionate representation of CVLs in focused public screening centers, libraries automated weighing stations and liquid existing compound collections. Compounds can be screened for specifc disease areas; one handlers, are beyond the reach of most with greater sp3 centers and higher complexity example is the Psychoactive Drug Screening academic labs. Establishing centralized have higher success rates at various stages Program (PDSP; http://pdsp.med.unc.edu/). analytical service stations to purify and of the process14,15. Te need In a public-private partnership mode, Eli Lilly manage chemical libraries from academic for compounds with increased structural has announced an initiative to carry out a institutions would be benefcial. complexity can be appreciated by measuring limited evaluation of the therapeutic potential For scientists who are already committed complexity through Fsp3 analysis (Fig. 2b). of compounds synthesized in university and to contributing to screening centers, a Tis simple analysis reveals that chemical biotechnology laboratories. major goal is to create chemical libraries complexity increases in going from CVLs to Despite the promise of such initiatives, that populate the large area of chemical libraries from academia to natural products. several challenges currently limit broader space between CVLs and natural products. Such evidence underscores the need for participation. More outreach eforts are Compounds of intermediate complexity are additional contributions from the academic needed to make synthetic chemists aware accessible through such strategies as diversity- synthesis community, especially those of the existence and requirements of these oriented synthesis and have already shown focused on the total synthesis of complex screening operations. Funding institutions promise in targeting challenging systems natural products and chemical libraries to should also develop policies or incentives to (Fig. 1)6,7. For example, robotnikinin (6) further supplement the MLSMR or related promote these collaborations. Currently, all binds directly to the extracellular protein compound collections. Te Broad Institute investigators funded by the NIH are required Sonic hedgehog (Shh) and inhibits the Shh has invested in synthesizing compounds with to submit their peer-reviewed research signaling pathway, which has been implicated increased structural complexity, and ~75,000 publications to the US National Library of in a wide range of cancers and developmental diversity-oriented synthesis compounds Medicine’s PubMed Central. A similar policy disorders8. Compound 7 has been identifed have been synthesized so far. We are actively initiative by major funding agencies requiring as a Bcl-2 inhibitor, which operates by collaborating with scientists within and submission of natural products or value- disrupting a protein-protein interaction9. outside the Broad Institute to screen added synthetic intermediates to screening Very recently, spiroindolone NITD609 (8) has these compounds.

862 NATURE CHEMICAL BIOLOGY | VOL 6 | DECEMBER 2010 | www.nature.com/naturechemicalbiology

One of the consequences of reliance on traditional molecular libraries for new drugs discovery projects is the almost linearly growing attrition rate of potential drugs over the past two decades.7 The screening campaign’s reliance on the commercial vendor libraries having limited structural diversity can lead to failure in multiple steps of drug development projects.7 Existing limitations in structurally diverse molecules have increased interest in a broader exploration of chemical spaces with the hope of generating structurally diverse small organic molecules. In the following sections, various experimental and theoretical exploration strategies to design libraries of structurally diverse small molecules are discussed.

1.2 Experimental Approaches for chemical space exploration

Two main experimental approaches, fragment based drug design and diversity oriented drug design, have gained traction recently.

1.2.1 Fragment based drug design

Fragment based drug discovery constructs small-molecules to target a protein structure by combining low-molecular mass (~300 Da) fragment molecules.8 Libraries of thousands of molecules can be constructed using this approach because it combines several thousands fragments (Figure 2). The fragments by themselves may have low affinity for a protein target, but linking multiple fragment can improve the potency of the molecules.9 In this approach, the binding site of the protein (the active site) is identified and fragments are screened that bind to each individual pocket using

3

experimental techniques such as NMR.10 Once appropriate fragments are screened, linker groups that connect small molecule fragments are designed.9,10 For this approach, the binding site in the protein needs to be known along with appropriate sensitive assays that can detect the weakly-binding fragments. 9,11 Along with these requirements, an appropriate-sized linker group is needed to combine the two fragments that may be binding to the two separate pockets in a protein. This strategy has shown great promise to deliver compounds where traditional drug discovery approaches have struggled. This approach has been applied successfully to design inhibitors, such as AT7519, targeting cyclin-dependent kinases (CDK) and is currently undergoing phase 1 clinical trials.12

Another success of this approach has been to design the protein-protein interaction inhibitor, ABT-263, targeting Bcl-XL. This molecule is in clinical trials for the treatment of lung cancer and other malignancies.12

Linker#

F1# F2#

Protein#

Figure 2: Shows a cartoon of fragment-based design approach. The protein with two binding pockets is shown in grey rectangle. Fragments (F1 and F2) linked by the linker group (shown in black) are interacting with the protein. 4

1.2.2 Diversity oriented drug design

The diversity oriented synthesis (DOS) of molecules aims to efficiently generate a collection of structurally diverse molecules. The compounds contains higher substitutional, stereochemical, and scaffold diversity compared to commercial vendor libraries.13 Various approaches for diversity oriented synthesis, such as build-couple- pair, exist that have shown promises to deliver functionally useful molecules.13,14 DOS starts with a multiple building blocks (molecular fragments). The initial reactants are coupled in various combinations to form several intermediate compounds. The intermediate molecules are then paired using suitable reactions to arrive at the final compounds that contain unique scaffolds and appendages and are sterochemically diverse.14 An advantage of DOS over fragment-based approaches is that it does not need information about the protein target. The synthesis yields a variety of structurally diverse compounds that can be screened using high-throughput biochemical assays.

Since the screening collection is generic and applicable to multiple targets, it may not work as well as fragment-based approaches for some targets. Despite these limitations, this approach has been successfully applied to generate active compounds targeting proteins whose binding sites have not been characterized. Some examples of a designed molecules is uretupamines (Kd ~ 7.5 µM), which are function-selective suppressor of the yeast signaling protein Ure2p that were discovered by screening a library generated using diversity-oriented synthesis.15 Another success story of this approach is the design

5

of selective inhibitors targeting histone deacetylases (HDACs). Despite the availability of protein structural information, structure-based design of selective HDAC inhibitors has proven challenging. Schreiber and coworkers leveraged this structural information in combination with DOS to synthesize a library of 7,200 dioxane-containing natural product-like molecules that were targeted to HDACs (EC50 = 34 µM).16,17

1.3 Computational approaches for chemical space exploration

Several computational approaches exist for chemical space exploration. They can broadly be divided into enumerative18–20 and stochastic approaches.21–23

1.3.1 Generated databases (GDB) of enumerated small organic molecules

Atom and bonds connectivity information of a molecule can be described as a molecular graph (a SMILES string).24 The enumeration approach of the Reymond group at the University of Bonn starts with a molecular graph of various stable topologies

(excluding topologies that are heavily strained). Hydrocarbon are introduced to each nodes on the graph followed by different bond saturations at the edges. In the next step, carbons are substituted by other heavy atoms (N, O, and S) at some suitable nodes.

Some of the resulting molecules that do not satisfy the stability and synthetic feasibility filters are discarded (Figure 3).19 Applying this enumeration strategy, a database containing 166×109 molecules was generated and required over 11 CPU-years.19 The

GDB-17 database contains an impressive number of molecules in the area of known drugs. For example, millions of isomers of fifteen typical marketed drugs of 14 to 17 6

atoms can be identified in GDB-17. The isomers include obvious variations of the parent structure by tiny functional group changes such as “methyl-walk” analogs.19

Graphs'

Introduce0 Hydrocarbons'

Skeletons:0Change0C0to0 N0and0O'

Molecules'

Figure 3: Steps used to enumerate organic molecules starting from graphs of feasible topologies.

The vast size of the small molecule universe limits the applicability of enumerative approaches. Enumeration of molecules in itself is not useful unless the molecules are screened to identify their function. But the screening of all the molecules, even at the scale of GDB-17, is computationally demanding and prohibits the exhaustive enumeration approaches for larger structures (> 17 heavy atoms). GDB-17 contains only a small fraction of the possible drug-like molecules with molecular weight of ≤500 Da.

There are still ~ 1043 theoretically possible drug-like molecules (> 17 heavy atoms) missing from the GDB databases.

7

An extension of GDB based enumeration is the ‘SPACESHIP’ algorithm.25 It allows the enumeration of all feasible molecules that lies in between a start and a final molecule (the target). The algorithm applies exhaustive set of mutations to the starting molecule, which changes its bond and atom types. The trajectory from the starting molecule to the final molecule is directed by selecting mutants that have the highest structural similarity (in terms of a Tanimoto similarity coefficient26) to the target molecules. This approach was used to generate analogs of AMPA (starting molecule), an agonist to the glutamate receptor. CNQX molecule, an antagonist to the kainate receptor, was used as a target molecule. Many generated molecules lying in the path between these two known molecules showed activity (represented by favorable docking scores) to the glutamate receptor.25

Apart from enumeration, many rational structure-based computational approaches exist that prune chemical space in the search for useful molecules. Such methods include the branch-and-bound and dead-end-elimination (DEE) methods.27–29

These methods avoid enumeration and can be efficient in searching chemical space.

Some success has been achieved by applying these approaches to the design of drug- resistant HIV-protease inhibitors.30,31 In order to prune the chemical space, the DEE methods make rigid assumptions such as searching over limited number of functional groups and sometimes unrealistically discretizing the conformation of both protein and ligand.20,32,33 Since these approaches limit the search to predefined functional groups,

8

they may not provide a coverage of chemical space that the GDB or publicly available databases (such as ZINC and PubChem) provide. If searching over predefined functional groups fails, the choices of functional groups have to be increased during branch and bound searches with the additional increase in computational cost. The worst-case cost of enumerating molecules can be exponential in the number of functional groups and can potentially render these approaches impractical. To achieve broader coverage of chemical space, much more efficient approaches are needed and are discussed in the following sections.

1.3.1.1 Methodologies and Principle to Virtually Screen Drug-Like Molecular Libraries

The library design methods are strongly intertwined with various screening methodologies. As mentioned earlier, simply generating molecular structures does not answer the need for functional materials until they are tested for their functions using existing methods. The screening methods range from cheap empirical

QSAR approaches,34,35 knowledge-based and semi-empirical molecular docking approaches,36 to computational expensive molecular dynamics simulations,37 and qm mm methods.38,39 In addition, several techniques to enhance the conformational sampling of the small molecules (ligand), macromolecules (protein/RNA), and complexes also exist.40–42 The usage of these existing screening methods are dictated by the size of the molecular library. If the library is large and the computational resources modest, the initial filtering of molecules can be done using computationally cheaper 9

methods such as QSAR and docking. The more expensive methods can then be applied on selected promising molecular candidates.

1.3.2 Existing Complementary Computational Approaches for Material Designs

Chemical space is actively being explored to discover materials beyond drug-like molecules. Efficient materials, such as the ones used for displays in personal electronics, are needed for our day-to-day needs. Material innovations may provide the keys to address some of the most pressing societal challenges, such as global climate change and our future energy supply.43 A significant amount of trial-and-error is still involved in the design of effective materials.44 Decades of research are required to identify suitable materials for technological applications and for commercialization. A principal reason for the long discovery process is that material design is a complex, multi-dimensional optimization problem, and the data is lacking to make an informed choice about which material to focus on and what experiments to perform.44 With a view towards accelerating the pace of materials discovery and for effectively sharing the discovered materials with a broader scientific community, the Materials Genome Initiative was launched in 2011 in the United States.44 It provides a platform for collaboration between materials scientists (experimentalists and theorists) and computer scientists to deploy proven computational methodologies to predict, screen, and optimize materials at an unprecedented scale and rate.44,45 Various computational frameworks have been

10

developed as a part of this initiative. Some of the emerging approaches will be discussed here.

1.3.2.1 AFLOW Material Database

As a project under the Material Genome Initiative, AFLOW (Automatic-FLOW for Materials Discovery) is a method for high-throughput computational materials design.46,47 The database automates calculations using abinitio electronic structure methods such as VASP and Quantum Espresso. The method can calculate physical properties, such as relaxed geometries, electronic and phonon band structures, magnetic properties, and thermodynamic properties. Associated with AFLOW is the AFLOWLIB database that is a repository for ~700,000 experimental material compounds and with over 58×106 calculated properties (growing every day).46,47 A user can access the database via a web interface and query physical properties of materials of interest or seek structures of materials in the database that match the property of interest. A new direction is also underway that applies technique to search and find similar compounds within AFLOWLIB and other large inorganic database.48 This cheminformatics based approach relies on generating structural and electronic fingerprints of inorganic compounds, similar to the fingerprints used for organic molecules, which can easily be used for the querying, similarity searching, and QSAR generations.48 The AFLOWLIB database and AFLOW method has been used to search for novel nano-grained sintered thermoelectric materials. Thermoelectric devices convert

11

thermal to electrical energy, and vice versa, with no moving parts. They are reliable and portable, and are used in applications such as solid-state refrigerators and energy harvesting devices. However, applications are limited by the low efficiency of the currently available commercial thermoelectric compounds. Sintered nano-powered composites are proposed as a cost effective and efficient solution. Screening the database using the AFLOW approach, a number of candidate materials with properties of interest were found.49

In addition to including experimental data in the AFLOWLIB database, the computational designs of new compounds is also being carried out by data mining the experimental crystal structure database.50 In this approach, the substitution of an ion or an element in a compound is proposed based on a model trained on experimental crystal structures to predict analogous compounds. The method provides the substitution probability function that quantifies if a given substitution from a known compound is likely to lead to another stable compound.50 Compared to this exploration of inorganic material space, a similar broader exploration of organic material space is also being carried out as a part of Harvard clean energy project.

1.3.2.2 Harvard Clean Energy Project (CEP)

The CEP is a virtual high-throughput discovery and design effort for generating candidate structures suitable for harvesting renewable energy from the sun and for other organic electronic applications.51 The goal of the project is to move beyond the

12

traditional design approach where the design of new materials is largely based on empirical intuition or experience with a certain family of chemical compounds. An experimental group can only explore a small number of structures per year due to the lengthy synthesis and characterization process. As a way to accelerate the organic material discovery process, the CEP initiative uses a theory-driven high-throughput screening approach to identify potential high-performance material candidate on a large scale, similar in spirit to the material genome initiative.51

The research areas of the CEP have been to design organic materials that can function as organic photovoltaics,52,53 organic-based flow batteries,54 and blue organic light-emitting diodes.51 For all of these applications, the chemical space exploration is based on fragments collected from the literature and modifications to the known fragments. These fragments are combined in various positions to generate a full set of molecules. For example, for the organic photovoltaic application, 26 known fragments were combined to generate a library of molecules consisting of ~107 molecules.52,53 After generating the molecules, their HOMO and LUMO energies were computed via density function theory (DFT). A total of ~15×106 DFT calculations were carried out to generate a library of ~3.5×106 promising candidate molecules.52,53 This computation required a specialized distributed computing framework, IBM world community grid, and 3×104

CPU years were expended.55 The exploration of chemical space localized to the 26 fragments took a massive amount of computing resources. If one wants to expand

13

further (since the chemical space grows factorially with the number of fragments) the computational resource quickly becomes a bottleneck. In the similar design spirit, is the work from Hutchison et al56 for designing organic photovoltaics and the work by Levine et al22 to design blue organic light emitting diodes. Both groups use similar known fragments for chemical space exploration, but instead of using an exhaustive combinatorial search, they have used genetic algorithms to combine effective fragments to design a library on a smaller scale compared to the CEP projects.

A new direction to improve the efficiency of virtual screening of organic materials to design large molecular libraries is to apply machine-learning algorithms to initially filter the unproductive molecules. Machine-learning techniques such as kernel ridge regressions have been applied to estimate atomization energies57 and to improve the accuracy of semi-empirical calculations.58 Within the CEP virtual screening framework, machine-learning algorithm based on neural network was used to estimate the HOMO, LUMO, and power conversion efficiency (PCE) of organic molecules.55 The model that was built based on neural network performed impressively, with accuracy similar to the DFT approach (0.99 Pearson R coefficient) for computing HOMO, LUMO energies, and PCE. The group also showed that the model trained on only ~30% of the

CEP database (~4.5×106 molecules) was able to predict the three physical properties of the overall database (~15×106 molecules) with accuracy similar to that of the DFT method.55 This kind of empirical approach to predict physical properties can accelerate

14

the pace of materials discovery, but caution should be taken before blindly using the approach for any kind of dataset.

Machine learning models are interpolative rather than extrapolative, which means that the performance of the trained models becomes weaker (larger errors in prediction) if they are used to predict the properties of molecules that are structurally different compared to the molecules on which the model was trained.55 But combining these empirical approaches along with the rigorous electronic structure and computation methods and experiments has the potential to significantly improve the efficiency of new materials discovery. Also, if a diverse set of molecular libraries is created that can span chemical space (beyond the scope defined by a few fragments as in the CEP project), the prediction of machine-learning methods can be further improved, and the methods can be more broadly used to compute electronic properties of structurally diverse molecules.

1.3.3 Diversity-oriented stochastic search of chemical Spaces

Although the goal of enumerative approaches (fragment and atom based) to generate an exhaustive list of possible molecules is commendable, the astronomical size of the chemical space renders such approaches impractical. Stochastic approaches that seek to capture the structural diversity of large chemical spaces can generate compounds that can be tested using experimental or theoretical screening methods. The Beratan and

Yang labs have devised a chemical space exploration method called the Algorithm for

15

Chemical Space Exploration with Stochastic Search (ACSESS)21 that allows computationally feasible surveys of unexplored regions of the small-molecule universe without exhaustive enumeration. This approach generates general-purpose libraries that can be screened using computation or experiment to identify lead-like molecules. The method uses a genetic algorithm framework that maximizes the nearest-neighbor distance (diversity) of molecules. Starting with a known compound (for example a benzene molecule), a series of structural mutations that modify atoms, bonds, and fragments are applied to the starting molecule to generate structurally different molecules. Synthetically unfeasible or unstable molecules are deleted. To reduce the structural redundancy among the generated molecules, a diverse subset is chosen. In each iteration of the ACSESS method, the nearest-neighbor distance (diversity) of the library increases (Figure 4).

Starting(Molecule/

Structural(Modification/

Synthesizability(and( Stability(Based(Filtering/

Selecting(a(Maximally( Diverse(Subset/

Figure 4: Shows the ACSESS steps involved in generating a library of structurally diverse molecule for screening purposes.

16

The ACSESS generated representative library of possible drug like structures, referred to as the small molecule universe (SMU), was shown to span the space covered by existing molecular libraries such as ZINC and PubChem.21 In addition, compounds that are in the SMU are missing in the publicly available molecular databases. The SMU provides a set of molecules (~9×106 molecules) that can be screened against biological targets with the hope of discovering potential drug or material leads.21 A limitation of this approach is that it samples molecules based completely on diversity. The diversity is dependent on a set of empirically derived descriptors (a Cartesian coordinate system) that represent the structural (connectivity information) and simple physico-chemical properties59. In essence, the problem of maximizing diversity solved by ACSESS is gridding the given empirical coordinate system and selecting representative molecules at each grid point. The strategy of selecting molecules based entirely on diversity can miss lots of molecules that have important physical properties.

1.4 Thesis contribution: Property-biased exploration of diverse regions of chemical space

While chemical space mapping and exploration using SMU and GDB databases are itself novel undertakings, they does not guarantee the discovery of useful structures.

Some success in identifying valuable structures was accomplished using the enumerated

GDB-11 and GDB-1360 libraries, but screening of every member of a large enumerated library is very demanding. The Harvard clean energy project has made extraordinary

17

use of grid computing to create a library of ~3.5 million molecules potentially useful for organic electronics. An efficient and cost effective strategy is still needed that allows one to go beyond the known fragment space that the CEP project explores. There is a pressing need for computational approaches that bias molecular searches to efficiently generate useful libraries of compounds drawn from the vastness of molecular space. As a part of thesis research, I have established a method within the ACSESS framework to mine chemical space for collecting diverse compounds that possess physical properties biased towards targeted values. This chemical space sampling strategy for a structurally diverse collection of targeted molecules is called Property-Optimizing ACSESS (PO-

ACSESS) strategy. The method will be discussed in detail in chapter 2. The method uses the same descriptors and coordinate system as ACSESS to compute diversity (defined as the average nearest neighbor distance). To speed the property assessment of sampled molecules, the method relies heavily on the semi-empirical calculations. As will be discussed in the upcoming chapters, the semiempirical method computed properties might have significant errors. The errors in such computations can affect the design of

PO-ACSESS generated structures. Recent studies on the use of machine learning approaches for molecular properties by Aspuru-Guzik et al51,55 and von Lilienfeld et al57,58,61 are indicate schemes to improve the accuracy of semiempirical approaches for large-scale chemical space explorations. These techniques can be utilized in the PO-

ACSESS framework in the future. Despite the present limitations, our approach is

18

already capable of efficiently generating libraries of appealing molecules for various targeted applications which will be discussed in chapters 3-6. With incorporation of the improvements being realized from the machine-learning during property assessment, the PO-ACSESS can contribute more effectively towards the design of diverse materials for a wide range applications.

1.5 Outlines

Chapter 2: Describes the PO-ACSESS method, and provides a test study of this approach to optimize a simple molecular property (molecular dipole moment) and to explore optimization within a model NKp fitness landscape.

Chapter 3: Describes the new extension of PO-ACSESS to design organic chromophores that emit light is the blue region of the electromagnetic spectrum. The chapter also discusses further extensions of PO-ACSESS that can predict more efficient blue emitters having small singlet and triplet energies gap and can potentially undergo thermally assisted delayed florescence property (TADF).

Chapter 4: Presents a new extension within the PO-ACSESS framework that allows the design of structurally diverse small organic molecular libraries with computational binding affinity toward a protein target. This chapter also presents a test study of the PO-ACSESS framework to sample leads within a large chemical space with up to 21 heavy atom (SMU-21) and compares the efficiency of generating targeted leads

19

of this method and various other competing approaches, as well as, the original ACSESS approach.

Chapter 5: Is a follow up study to chapter 4 to design diverse lead-like molecules that target the CARM1 protein. CARM1 is an important biological target implicated in disease states such as cancer. In the presence of only a few known leads for this target,

PO-ACSESS found diverse leads targeting CARM1 showing computational binding activities similar to a known and crystallized binder.

Chapter 6: Applies the PO-ACSESS framework to predict leads targeting TAR

RNA. TAR RNA is important in the life cycle of the HIV-1 virus. Inhibition of TAR binding to its protein partner TAT via a small molecule drug can potentially have therapeutic value. PO-ACSESS was used to generate libraries of diverse small molecules targeting TAR RNA and having computational binding scores comparable or better than some known leads.

20

Chapter 2. A strategy to discover diverse optimal molecules in the small molecule universe.

This chapter is reproduced with permission from Rupakheti, C.; Virshup, A.;

Yang, W.; Beratan, D. N. A Strategy to Discover Diverse Optimal Molecules in the Small

Molecule Universe. J. Chem. Inf. Model. 2015, 1–31. Copyright 2015 American Chemical

Society.

2.1 Introduction

From the background chapter, we could realize that the rate of molecular discovery has not kept pace with the demand for molecular species that address compelling challenges,5 and one hopes to develop theoretical approaches to accelerate the pace of progress. The size of the small molecule universe makes its exploration and mining very challenging indeed. We have devised a chemical space exploration method called Algorithm for Chemical Space Exploration with Stochastic Search (ACSESS) 21 that allows computationally feasible surveying of unexplored regions of the small molecule universe without exhaustive enumeration.

While chemical space mapping and exploration is itself a novel undertaking, it does not guarantee the discovery of useful structures. Some success in identifying valuable structures was accomplished using the enumerated GDB-11 and GDB-1360 libraries, but screening of every member of a large enumerated library is very demanding. Hence, there is a pressing need for computational approaches that bias

21

molecular searches to produce useful libraries of compounds drawn from the vastness of molecular space. Here, we describe a method within the ACSESS framework to mine chemical space for collections of diverse compounds possessing optimum value (defined within a threshold from the global optimum) of a physical property of interest. The performance of this approach is tested on a GDB-9 enumerated space. As an example, we show that the property-optimizing ACSESS procedure can be used to construct libraries of diverse, high dipole moment molecules (within GDB-9) without enumerating all molecules within the GDB-9 space. We also evaluate the performance of the property- optimizing ACSESS method using the NKp fitness landscape. The NKp model landscape has been used to test the performance of various evolutionary algorithms.62,63

This model tunes the “ruggedness” of the property landscape by changing the length of a bit string (specified by the parameter N) and the numbers of local hills and valleys

(specified by the parameters K and p).62,63 To our knowledge, this is the first time that the

NKp landscape model was used to compare strategies for chemical space search. These studies demonstrate that property-optimizing ACSESS maximizes the diversity of useful molecules more efficiently than popular simple genetic algorithms. The approach presented here can likely be extended to other molecular design challenges in drug discovery and materials design.

2.2 Methods

We begin with a definition of terms and review of ACSESS.

22

Chemical space is defined as an N-dimensional Cartesian space in which compounds can be mapped using cheminformatics descriptors. Descriptors describe physical, chemical, and topological properties of the compounds. In our analysis, each compound is mapped using a set of 40 autocorrelation descriptors (which defines our molecular space).59 For each molecule, we computed the Moreau-Broto autocorrelation descriptor.59 This descriptor encodes correlations of atomic properties in a molecule as a function of the topological distance between atoms in the molecule. Both the molecular structure and the physico-chemical properties of a molecule can be encoded in this way, which is successfully used to construct structure-activity relationships.64 The autocorrelation descriptor is:

AC(d, p) p p (d d) (1) = ∑ i jδ ij − i<= j where, # %1, if dij = d δ(dij − d) = $ &%0, otherwise

dij is the shortest bond distance between atoms i and j

pi, pj are the descriptors of atoms i and j respectively

Examples of properties p include atomic number, Gastiger-Marsili partial charge,

59 atomic polarizability, topological steric index, and unity (i.e., pi =1 for all i). Values of d that represents the topological distance (bond distance), ranges from 0 to 7.59 The choice of the range for d is based on previous study.21 The use of the above five listed atomic

23

properties and topological distance results in a 40-dimensional descriptor. Here, the molecular descriptors of the generated molecules are mean centered and normalized to have unit variance.

The Chemical space distance between two compounds is defined as the Euclidean distance between compounds based on their descriptors:

N 2 D = d d (2) ij ∑( ik − jk ) k=1

XOR(i, j) Dham = (3) ij N

M 1 2 Dmin = min(Dij ) (4) ∑ i≠ j M i=1 where,

th dik and djk are k descriptor of the molecules i and j

N is the length of the descriptor vector

Dij is the Euclidean distance between the molecules i and j

ham Dij is the Hamming distance between two bit strings i and j of length N

XOR(i,j) is the count of bit positions that differ in the two strings

24

Dmin is the distance of the nearest molecule j from molecule i, i.e., the nearest-neighbor distance of the molecule i

M is the number of molecules in the library

The molecular fitness of any structure is a real valued molecular property. The magnitude of the molecular dipole moment, or a value drawn from the NKp model, is used in our analysis.

Next we introduce the property-optimizing ACSESS strategy to find optimal structures. We then describe property calculations within the enumerated (GDB-9) chemical universe and the enumerated binary fitness model (NKp landscape). Finally, we describe searching for optimal structures within both fitness landscapes using property- optimizing ACSESS strategy.

2.2.1 ACSESS algorithm to search molecular space

ACSESS samples from a chemical space by iteratively optimizing (maximizing) the nearest-neighbor distance (diversity) of a subset of compounds in the space of all possible compounds.21 There are four main steps in the property-optimizing ACSESS calculation described here: (1) initialize a library, (2) breed new compounds, (3) remove compounds that do not have a property value above a given threshold, and (4) select a maximally diverse subset of property qualified structures.21 ACSESS can be seeded with a collection of compounds or with a single molecule.21 The initial library is modified by making mutations and crossovers.21 Among the generated compounds, a diverse subset 25

is chosen by applying either maximin or cell-based partitioning algorithms.21 Maximin algorithm maximizes the nearest-neighbor distance between compounds. It does so by applying iterative approach where, to start with, a compound from a library is randomly selected. It then selects the next compound that is furthest from the initial compound in terms of chemical space distance (eq. 2). The next compound from the library it selects will be the furthest from both of the initially selected compounds. This process is repeated until a certain desired diverse library size is obtained.65 On the other hand, cell- based partitioning algorithm effectively partitions the chemical space into discrete multi- dimensional grids and picks molecule that falls in each grid.66 The advantage of cell- based partitioning is that it scales favorably for large library size.

Earlier implementations of ACSESS did not include an approach to property optimization. As modified here, the modified ACSESS now iteratively selects a maximally diverse set of solutions with a property values above a threshold value. In order to iteratively increase the property value, we grow the property value cutoff linearly with each iteration until a desired property value threshold is reached. Initially the threshold is set to a low value to ensure that the population does not collapse to zero size because of the fitness constraint. A schematic of this algorithm is shown in Figure 5.

26

Figure 5: The property-optimizing ACSESS procedure uses property filter and diversity-biased sampling to construct a diverse library with properties above a cutoff value.

2.2.2 Software Used for Property-Optimizing ACSESS

Property-optimizing ACSESS made use of OpenEye OEChemTK67 for molecule generation, OpenEye MolPropTK67 for filtering, and OpenEye OMEGA TK67 for conformation generation.

2.2.3 Optimization within GBD9 property landscape

We computed the Boltzmann-averaged dipole moment of small molecules in

GDB-9 and pursued effective strategies to identify molecules in this set with large dipole moment. Compounds among the 300,000 compounds of GDB-9 with nine of fewer

27

hetero atoms (allowed atom types include C, N, O, S, and Cl) defined the search space.68

For each molecule, the dipole moment (D) was computed using:

E e−β i ∑µi D i∈C (5) = E , ∑e−β i i∈C

where µi is the dipole moment of a single conformation, β was computed using

Boltzmann constant (kB) 3.1668×10-6 Hartree energy per Kelvin and temperature (T)

298K, i belonging to a set of conformations C and Ei is the internal energy of conformation i. Conformations for each molecule, including stereoisomers, were generated using OMEGA and flipper tools.67,69 Each conformation was energy minimized, and the total dipole moment was computed using AM1 calculation as implemented in the Gaussian 09 package.70 Dipole moments were stored in a database and retrieved during ACSESS exploration of the chemical space.71 ACSESS was used to create a maximally diverse set of compounds with dipole moment above a threshold value of 5.5 Debye (the 90th percentile of GDB-9 dipole moments).

2.2.4 Optimization in the NKp fitness landscape model

In addition to the GDB-9 molecular space, we were curious to test our approach on known binary fitness landscape. We chose NKp model since it has been widely used to test performances of genetic algorithms.72,73 The NKp fitness landscape maps a binary string to a fitness value from 0 to 1 range.72,73 The fitness of a binary string is simply the summation of fitness contribution of each cell scaled by the length of the string (eq. 6). In 28

this model, we should imagine a binary string and its NKp fitness value as a molecule and the molecule’s property value respectively (as in the case of GDB-9 molecules and their dipole moments). A binary string is associated with three parameters: (1) its length

N, (2) the number K of associations each bit makes to other bits in a string (ranges from 0 to N-1), and (3) the fitness contribution or weight p (also ranges from 0 to 1) of each bit position. By varying parameters K and p, we constructed a fitness landscape with multiple optima (to mimic the multiple optima case in the GDB-9 space; details presented in the supporting information).

The fitness (Φ) of a bit string (g) in the NKp fitness landscape model is:72

1 N (g) (g) (6) Φ = ∑φi N i=1 where g ∈ Q N and Q = 0 or 1; N = length of g

φi ∈ [0,1] , where φi is drawn randomly from [0,1]

We computed the fitness of all possible bit strings of 19 bits using NKp model

(eq. 6). We then applied property-optimizing ACSESS to create a maximally diverse set of bit strings having a NKp value equal to a threshold value of ~0.3 (the global maximum within our enumerated NKp model). In this case, the property-optimizing

ACSESS was modified to work with the binary strings. The mutations was defined as the bit flip, and crossover was defined as combining two fragments of binary strings by cutting them both at a randomly chosen position. The maximin algorithm used

29

hamming distance (eq. 3) measure while computing the nearest-neighbor distances (eq.

4).

2.3 Results and Discussions

We enumerated all possible 19-bit patterns and computed their fitness using

N=19, K=9, and p= 0.9 (total of 524288 bit strings which is similar to the number of molecules in the GDB-9 space. We also computed the averaged dipole moments of all molecules in GDB-9. The distributions are shown in Figure 6.

A B

200,000 90,000

180,000 Max = 0.2950 80,000 Number of Max’s = 19 Max = 9.604 D 160,000 Min = 0 Number of Max’s = 1 70,000 Min = 0 D Median = 0.0442 Median = 2.297 D 140,000 60,000 120,000 50,000 100,000 40,000 80,000 Number of Compounds Number of Compounds 30,000 60,000

20,000 40,000

20,000 10,000

0 0 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0 1 2 3 4 5 6 7 8 9 10 NKp Fitness Averaged Dipole Moment

Figure 6: Distribution of molecular property values for (A) the enumerated NKp landscape with N=19, K=9, p=0.9 and (B) the dipole moments of all molecules in GDB-9.

2.3.1 ACSESS exploration of a NKp model space

We used property-optimizing ACSESS method to construct a diverse set of strings in the N=19, K=9, and p=0.9 fitness landscape. First, a maximally diverse set of binary strings was generated without using fitness-based selection (Figure 7A). The maximally diverse set was used to seed the property-optimizing ACSESS method that

30

maximizes fitness (Figure 7B). As shown in Figure 7B, property-optimizing ACSESS method finds the fittest solutions after 30 iterations without exhaustively enumerating and evaluating all possible bit strings.

A B 0.35 0.22

0.3 0.2 0.25

0.18 0.2

0.15 0.16 NKp Fitness 0.1 0.14

Nearest Neighbor Diversity 0.05

0.12 0

0.1 -0.05 0 5 10 15 20 25 30 35 40 45 50 0 10 20 30 40 50 Generations Generations

Figure 7: (A) Diversity biased sampling and (B) fitness-biased sampling. Diversity is first maximized to generate solutions that span a given chemical space. Here, the nearest-neighbor diversity (eq. 4) is defined as the average hamming distance (eq 3) to the nearest neighbor. The diverse solutions seed the property- optimizing ACSESS method and the fitness of the diverse set is improved iteratively. The global optimum in our enumerated NKp model space was found within 30 iterations of our optimization (data shown are averaged over ten different runs).

2.3.2 Comparing property-optimizing ACSESS with simple genetic algorithm searches

To judge the performance of property-optimizing ACSESS method relative to a simple genetic algorithm (SGA) that uses elitism (selecting the fittest molecules at each iteration),74 we implemented a SGA as an extension of ACSESS for molecular library design. We ensured fair comparison by 1) applying NKp based fitness filters to SGA and

ACSESS in every generation and 2) initializing both algorithms with the same diverse

31

subset sampled using ACSESS (Figure 7A). With the parameters and simulation steps exactly the same as in ACSESS, the only difference between ACSESS and SGA is that, at each iteration, SGA selects an equal number of the fittest solutions without considering the diversity among the solutions. Figure 8A indicates that the SGA finds the best solution (with the largest NKp fitness value). However, SGA tends to be trapped in local extrema more often than ACSESS (Figure 8B, 40% of SGA runs failed to find the optimum structures as opposed to 0% for ACSESS). On average, the solutions that SGA finds are sub-optimal compared to those found using ACSESS (Figure 8D). Also, among the fittest solutions (solutions having the largest NKp fitness function value, 19 total possibilities binary strings), the number of fittest solutions found by SGA is smaller (on average ~3 fittest solutions) compared to the number of fittest solutions found by

ACSESS (on average ~15 fittest solutions) (Figure 8C).

32

A B

0.35

0.3 1 0.9 0.25 Average Max 0.8 0.2 Min 0.7 0.6 Fitness 0.15 Success ACSESS 0.5 Rate SGA 0.1 0.4 0.3 0.05 0.2 0.1 0 0 10 20 30 40 50 Generations 0

C D

18 0.298 16 0.296 14 0.294 12 0.292 10 ACSESS Average SGA Subset Size Fitness 8 SGA 0.29 ACSESS 6 0.288 4 2 0.286 0 0.284

Figure 8: Comparison of property-optimizing ACSESS outcomes to the simple genetic algorithm (SGA) optimization for generating multiple solutions with the largest NKp fitness value. SGA iteratively samples the fittest solutions and improves upon them. (A) indicates that SGA finds the fittest solutions (global maximum in the constructed NKp model) after 30 iterations. (B) indicates that property-optimizing ACSESS runs escapes local optima more effectively than do the SGA runs, i.e. the success rate of finding largest fitness solution of property-optimizing ACSESS is 100% compared to 60% for SGA (from 10 runs of each method). (C) indicates that the property-optimizing ACSESS runs also performs better than SGA to sample diverse fittest solutions as it samples a larger number of fittest solutions compared to SGA. (D) indicates that, on average, the fitness of solutions generated by property- optimizing ACSESS is larger than that of SGA.

2.3.3 ACSESS explorations in GDB-9

Property-optimizing ACSESS was used to sample diverse molecules in GDB-9 with average dipole moments ≥ 5.5D (the top 10% molecules). As in the NKp model

33

space exploration, ACSESS first generated a maximally diverse set of compounds, without fitness optimization, spanning the GDB-9 universe (Figure 9A). The maximally diverse set was used to seed the next step where the fitness of the diverse set was improved iteratively. Figure 9B indicates that ACSESS finds diverse molecular structures with dipole moments above the 90th percentile of compounds in the GDB-9 universe without exhaustively enumerating and evaluating each molecule.

A B

9

16 8 14 Average Minimum 12 7 Maximum 10 6 8

6 5 Averaged Dipole Moment Nearest Neighbor Diversity 4 4 2

0 3 0 20 40 60 80 100 120 0 20 40 60 80 100 Generations Generations

Figure 9: Progress of ACSESS runs that sample GDB-9. The plots show the mean of the dipole moments from multiple-runs, and the error bars represent one standard deviation from the mean of either nearest-neighbor distance or dipole moment. (A) Initially, a maximally diverse set (measured by nearest-neighbor distance) is generated. (B) The maximally diverse set is used to maximize their dipole moments. The minimum dipole moment found after 60 iterations is the desired dipole moment threshold (i.e., the top 10% fittest molecules).

2.3.4 Comparisons of property-optimizing ACSESS and genetic algorithms methods for property optimization

We compared the performance of ACSESS to standard genetic algorithms (SGAs) to judge the relative performance for property biased library design. We ensured fair 34

comparison by 1) applying dipole moment filtering in every generation for SGAs and property-optimizing ACSESS and 2) initializing all algorithms with the same diverse subset of GDB-9 generated using ACSESS (Figure 9A). SGAs are used mainly for fitness optimization, but they can also maintain the diversity of solutions using techniques such as roulette wheel selection and tournament selection.74,75 We now explore how well

ACSESS optimizes the fitness and diversity of the library by comparing to these standard approaches. We compare the performance of ACSESS with SGAs where each method starts with the same diverse set constructed using ACSESS (Figure 9A).

A B Comparing Fitness of Solutions for Different Design Strategies 7

6.5 14 6 12 5.5 GA-Tournament GA−RouletteWheel 10 5 ACSESS GA-Elitism GA−Elitism Nearest 8 4.5 GA−Tournament Neighbor GA-RouleeWheel Diversity 4 6 ACSESS Averaged Dipole Moment 3.5 4 Enumerated GDB9 3 2

2.5 0 20 40 60 80 100 120 0 Generations

Figure 10: ACSESS and SGAs runs that maximize the dipole moments (fitness) of diverse molecules. The two plots tracks the average fitness of libraries generated by each design algorithm (color coded differently) and the error bars represent one standard deviation from the mean for multiple runs. (A) compares the fitness of the library optimized using ACSESS with three other genetic algorithms (colored coded differently). (B) compares the diversity of the fit libraries generated by ACSESS and SGAs.

We find on average that ACSESS generates molecules with similar or more favorable dipole moments (fitness) compared to SGAs, but ACSESS generates a more

35

diverse set of fit molecules (dipole moment >= 5.5D) compared to SGAs. More specifically, ACSESS generates molecules with higher dipole moments (fitness) than

SGA with roulette wheel selection, but ACSESS generates solutions of similar fitness to

SGA with tournament selection (Figure 10A). SGA with elitism (selecting the fittest solutions in every generation) performs marginally better than ACSESS. However,

Figure 10B indicates that the diversity (measured using nearest-neighbor Euclidean distance defined in eq. 4) of the molecular library for ACSESS is much larger than is found with the SGAs. In fact, the nearest-neighbor Euclidean distance (Euclidean distance of ~10) of the library generated by ACSESS is similar to the nearest-neighbor

Euclidean distance (Euclidean distance of ~12) that is found for the enumerated GDB-9 molecules with dipole moments ≥ 5.5D. These results indicate that the diversity of the

ACSESS generated library is similar to the diversity of the enumerated GBD-9 universe that contains only the compounds above a fitness cutoff. These findings are similar to those from the model NKp landscape, where ACSESS generated multiple global optima without becoming trapped in local optima. In contrast, the SGAs became trapped in the local optima in 40% of the runs (Figure 10B) and produced far fewer fittest solutions

(Figure 10C). These calculations indicate that the diversity enforcement in ACSESS yields sampling of different high fitness regions of the GDB-9 space favorably compared to SGAs. It is important to note that, while ACSESS generates large diversity solutions,

36

fitness is still comparable to and better, in some cases, compared to those found with popular simple genetic algorithms (Figures 8D and 10A).

2.3.5 Self-organizing maps of GDB-9 and of sampled solutions

Additional insight into performance of property-optimizing ACSESS and SGAs is obtained by visualizing the sampled molecules with respect to the enumerated molecular space. For this purpose, we projected the 40-dimension chemical space into two dimensions using a self-organizing map (SOM). SOMs are widely used in cheminformatics.76 They have the useful property of representing high-dimensional neighborhood relationships.76 A 100 x 100 torroidal grid was used where each grid point is a neuron.77,78 The SOM in this case was trained using the enumerated GDB-9. During the training, each neuron is randomly assigned a chemical space coordinate, and the neurons are trained by presenting a descriptor vector of each molecule.77,78 Neurons compete with each other for the presented descriptor vector and adjust their coordinates based on the descriptor vector closest to them.76 In two-dimensional representation, the structurally similar molecules clumped together in a neuron or closer neurons and the different molecules further apart and assigned to different neurons.

37

Averaged Dipole Moment per neuron

Enumerated GDB-9 in 2D Grid

A

Number of compounds per neuron B Diverse GDB9 Library

Figure 11: SOM for the enumerated and sampled GDB-9. (A) shows the full GDB-9 with color-bar representing average dipole moment per neuron from blue (low) to red (high). (B) shows the ACSESS designed diverse library (without maximizing the dipole moment). The filling of the plot in B indicates that the diverse library from ACSESS spans the GDB-9 space. The color bar in B indicates the number of molecules assigned to each neuron. The white space in both maps indicates regions where no compounds are assigned.

38

We projected the enumerated GDB-9 on the SOM and also the maximally diverse subset of GDB-9 sampled using ACSESS without maximizing dipole moments (Figures

11A and 11B). We can notice comparing Figures 11A and 11B that by maximizing diversity of GDB-9 molecules, we are able to sample various regions of the enumerated

GDB-9. We also projected, on the trained SOM, the libraries designed using property- optimizing ACSESS (Figure 9A) and SGA (using the elitist selection scheme, Figure 9B) and compared those with the high-activity regions (dipole moment ≥ 5.5 D) within GDB-

9 (Figure 12C).

39

A B

ACSESS Sampled GDB-9 in 2D SGA Sampled GDB-9 in 2D Grid

GDB9 ≥ 5.5D C

Figure 12: Comparison of the diversity found in an ACSESS library and in an SGA-elitist library to the fully enumerated GDB-9 library. The color bars in A and B indicate the number of compounds per neuron. (A) shows the more fit molecules (molecules above the fitness cutoff) generated by ACSESS, (B) shows the more fit molecules (molecules above the fitness cutoff) generated by SGA (elitist), and (C) shows the fittest regions of GDB-9 (≥5.5D dipole moment) where the color bar represents the number of compounds per neuron. The oval indicates molecules in GDB-9 that are discovered using ACSESS but missed by SGAs.

Figure 12 indicates that the high activity regions of GDB-9 are identified in

ACSESS analysis (Figure 12A in oval) but overlooked by SGA-elitist analysis (Figure

12B oval). The medium activity regions (yellow colored regions in Figure 12A)

40

populated by the ACSESS library that lie above the black oval (Figure 12A) are absent in

SGA library (Figure 12B). The underperformance of SGA explains why libraries generated by SGAs have lower diversity compared to the ACSESS generated libraries.

While ACSESS is able to explore different activity islands of GDB-9, the SGAs fail to do so.

2.4 Conclusions

In this study, we showed that property-optimizing ACSESS explorations navigate large chemical spaces to find compounds with favorable targeted property values. Property-optimizing ACSESS not only samples diverse regions of chemical space, but also samples useful regions without having to enumerate and test every single molecule. In fact, only ~ 30,000 fitness evaluations were carried out to locate the different optimal regions of GDB-9 (which contains ~300,000 molecules). We explored property-optimizing ACSESS performance in an NKp landscape and in the GDB-9 for molecular dipole moment. In these studies, Property-optimizing ACSESS performs favorably in terms of discovering diverse fit solutions compared to standard genetic algorithms, and property-optimizing ACSESS performs much more efficiently than exhaustive enumeration.

Property-optimizing ACSESS can also be used to sample astronomically large chemical spaces, and this represents a future aim. Since property-optimizing ACSESS library contains multiple favorable molecules, one can choose from among the diverse

41

available alternatives. This approach may assist in reducing attrition in molecular design challenges. The remaining chapters show the extensions and applications of property- optimizing ACSESS to sample larger chemical spaces to generate libraries of small organic molecules targeted towards organic light emitting diodes (OLED) and binders of proteins and RNA structure

42

Chapter 3. Diverse optimal molecular libraries for organic light emitting diodes.

This chapter is reproduced with permission from JCTC, submitted for publication. Unpublished work copyright 2015 American Chemical Society.

3.1 Introduction

Chapter two described about the property-optimizing ACSESS (PO-ACSESS) approach to generate diverse optimal molecular libraries. The property optimization of structurally diverse molecules was shown using simple molecular property such as molecular dipole moment and on a NKp fitness landscape model. In this chapter, an extension of this approach will be presented where PO-ACSESS will be extended and applied towards designing libraries of compound blue emitting organic chromophores having an optimal singlet-triple excitation energy gap.

Organic light emitting diodes (OLEDs) have applications ranging from lighting to electronic displays.79,80 Advantages of OLEDs over traditional technologies include their energy efficiency and compactness.79,80 OLEDs convert electrical charge into light by the radiative recombination of -hole pairs.81 The performance of blue OLEDs is lower than that for emitters of other colors.82 Unlike fluorescence based materials that emit from singlet exciton states with efficiency limited to ~ 25% (excitons exist in 25% singlet and 75% triplet states, assuming electrons and holes are injected with random spin orientation)3, phosphorescent materials can funnel both singlet and triplet excitons

43

into emissive states, thus endowing OLEDs with higher efficiencies.83 Known phosphorescent OLED materials are based on iridium and platinum complexes.84 The aim of this study is to identify prospects for high efficiency organic blue-emitting molecules. Our approach is based on designing property-optimized diverse molecular libraries85 with favorable fluorescence properties.

3.1.1 OLED structures

OLEDs can be single or multi-layered,79,80 and property optimization can focus on the individual layers or the interfaces. A typical OLED structure is shown in Figure

1340:

e- h+ h+e -

25% 75% S1 T1 Intersystem Crossing

Fluorescence Phosphorescence Anode Cathode

Figure 13: Light emission mechanism from an OLED device. Holes (h+) are supplied by the anode and electrons (e-) by the cathode.

Electrons and holes pair in the emissive layer of an OLED to form excitons that can exist either in singlet (S1) or triplet states (T1). Since the injected charges are not spin

44

correlated, 75% of the excited states produced are triplets. The excitons can decay radiatively, via fluorescence or phosphorescence. Intersystem crossing (from T1 to S1) can be facilitated by heavy metals (such as iridium) and is enhanced if the singlet and triplet states are close in energy.80

First generation blue emitting OLEDs harvest luminescence from singlet-excited states. Second generation OLED materials (PHOLEDs) harvest luminescence from triplet excited states with transition metal complexes that mix singlet and triplet states via spin- orbit coupling. The exciton spin state may be, αα, ββ, αβ+βα, or αβ-βα, where α and β represent electron spins. The first three states are triplets and the last is a singlet. In

OLEDs based purely on fluorescence, the radiative decay of triplet excitons (75%) is spin forbidden, and only the singlet excitons (25%) produce appreciable luminescence. 3,86

Organometallic LEDs that emit from triplet states are coupled by spin-orbit interactions to higher lying singlets and to the ground state, enhancing the emission. However, the requisite metals (such as iridium) are costly. 3,86 Third generation OLEDs are based on fluorophores that produce thermally activated delayed fluorescence (TADF). TADF emission can be achieved when the energy gap between triplet and singlet states is sufficiently low that excitons in triplet states can transition into singlet states

(intersystem crossing, Figure 13) by thermal activation. TADF thus circumvents the emission quantum yield limitations defined by spin statistics. Two properties are required to realize efficient TADF: a small energy gap between the lowest energy singlet

45

and triplet excited states and a large transition dipole moment for fluorescence. 3,86

Recent experiments showed impressive results for TADF with internal quantum yields

(ratio of radiative electron-hole recombination to the total radiative and nonradiative, recombination)87 and external quantum yields (ratio of the number of photons emitted from the LED to the number of electrons passing through the device)87 for singlets and triplets up to 100% and 25%, respectively.84 There are few purely organic compounds with high-efficiency TADF, and molecular design is based on the formation of a charge transfer intermediate state (with a donor-acceptor-donor motif).3,86 Developing strategies to expand the TADF mechanism to organic structures beyond the known donor- acceptor-donor framework is appealing and could be facilitated by comprehensive chemical space explorations. Here, we extend the strategy for diverse optimal molecular library design based on the property-optimizing ACSESS strategy88 to design libraries of

1) blue emitting fluorescent structures that emit dominantly from singlet states and 2) higher efficiency blue emitters utilizing TADF emission.

3.2 Method

3.2.1 Using property-optimizing ACSESS to design fluorescent emitters

We briefly summarize the approach for diverse optimal molecular library construction using the ACSESS approach. The details of the ACSESS algorithm appear in Refs [14] and [49]. The previous implementations of ACSESS aimed to maximize library diversity of chemically stable organic molecules21 or to maximize the diversity of 46

molecules while optimizing a ground state molecular property.88 Here, we extend the

ACSESS approach to discover diverse chromophores with targeted fluorescent properties (such as blue emission and low singlet-triplet excitation energy gap).

As in earlier ACSESS calculations, we use a 40-dimensional autocorrelation descriptor to define a molecule’s coordinates in chemical space.59 We coupled the

ACSESS code with the Gaussian0970 and CNDO89 programs to compute molecular excitation energies and oscillator strengths. As part of the property assessment, the objective function of ACSESS was extended to include: 1) the lowest singlet excitation energy, 2) the lowest triplet excitation energy, 3) the gap between the singlet (S1) and the triplet (T1) excitation energies, and 4) the transition’s oscillator strength. The molecules accepted in the ACSESS library are required to satisfy multiple criteria, in contrast to our earlier single property biased library design. As extended here, ACSESS samples from chemical space by iteratively maximizing the nearest-neighbor distance (diversity) of a subset of compounds in the space of all possible fluorescent emitters.21

There are four main steps in the extended property-optimizing ACSESS calculation: (1) initialize a library with known blue emitters, (2) breed new compounds,

(3) remove compounds that are not stable or do not have electronic properties that satisfy the multiple-property thresholds, and (4) select a maximally-diverse subset of property-qualified structures. To optimize the electronic property values iteratively, we

47

tighten the property value cutoffs linearly with each iteration step, until a desired threshold is reached for each property. Initially, the threshold is set to a low value

(based on the electronic property of interest) to ensure that the population does not collapse to zero because of the fitness constraint. The flowchart of the multi-objective property-biased ACSESS scheme used here is shown in the Figure 14.

Initialize Library with Known Blue Emitter

Modify Molecular Structures Properties to Optimize

1) Validate excitation Filter based on Chemical Stability and energy within blue Electronic Properties 2) Validate oscillator strength value

Select a Maximally Diverse Set 3) If required, Validate S1- T1 gap

No Final Iteration?

Yes

Done

Figure 14: ACSESS extension for fluorescent-molecule design based on emission wavelength, oscillator strength, and singlet-triplet excitation energy gap.

Beginning with a few known organic blue emitters,90,91 the goal of the molecular design was to generate a library of diverse structures with singlet excitation energies in the 2.5-4.0 eV range, with an optimal singlet (S1)-triplet (T1) energy gap (ΔES1-T1) at most

48

~0.3 eV (we chose ~0.3 eV following previous published design structures22,86 and considering the errors involved in the semiempirical calculations).

3.2.2 Quantum chemical computation of singlet excitation energies and oscillator strengths

During the ACSESS library generation, the molecular property values are computed on the fly. We use semiempirical quantum chemistry methods to reduce the computational costs. The ground state geometries are optimized using the semiempirical

AM1 approach.92 Vertical electronic excitation energies and oscillator strength are computed for the lowest singlet transitions using ZINDO/S calculations.93 The reliability of a semiempirical approach was assessed by comparing the excitation energies for a set of published organic chromophores94,95 with time-dependent density functional theory

(TDDFT).96,97 We randomly selected a set of polycyclic, aromatic hydrocarbons to compare the TDDFT and semiempirical excitation energies. 94,95 TDDFT and semiempirical calculations were performed using the Gaussian09 package.70 The geometry optimizations were performed using the CAM-B3LYP98 functional with the 6-

31G* basis set.99 Frequency analysis was carried out for all molecules in the validation set to ensure that the optimized structures were local energy minima. The optimized geometries were used in TDDFT excited states calculations with the same functional and basis sets that were used for geometry optimization.

49

3.2.3 Design of fluorescent emitters that utilizes only singlet excitation state

Within this extended property-biased ACSESS framework, we used chemical stability and synthetic feasibility criteria, applied previously to generate GDB libraries,19,68 and we computed the electronic excitation energies and oscillator strengths of molecules at each iteration of the ACSESS algorithm. We focused on generating molecules with singlet excitation energies in the 2.5-4.0 eV range (with ≥ 0.2 oscillator strength). Semiempirical AM1 and ZINDO/S methods were used to compute the excitation energies and oscillator strengths of the candidate molecules. The molecular structures were generated using OpenEye67 and OMEGA libraries,69 and optimized at the AM1 level. As a final step, to assess the reliability of the semiempirical calculations, the property values for the optimized ACSESS library were further analyzed using DFT and TDDFT computations: the molecular structure was re-optimized at the CAM-

B3LYP/6-31G* level of theory and the lowest vertical singlet excitation energies and oscillator strengths were computed at the TD-CAM-B3LYP/6-31G* level based on the

DFT optimized geometry.

3.2.4 Designing the TADF library by optimizing the singlet - triplet gap

Triplet excitons are not useful, per se, in OLED materials because of their long phosphorescence lifetimes. The TADF mechanism funnels T1 states to nearby S1 states, and thus enhances emission intensity.3 In the condensed phase at room temperature, the

50

initially excited molecules typically relax to S1 or T1 excited states on sub-picosecond time scales.100 If S1 is near the potential minimum of T1, the small S1-T1 energy gap will enhance the intersystem crossing from T1 to S1.3 Thus, a key feature of TADF structures is a small S1-T1 energy gap near the T1 equilibrium geometry. Previous TADF designs focused on molecules with S1 and T1 excited states that have charge-transfer character favoring degeneracy of the S1 and T1 states.3,22,84,86 Here, we do not constrain our search to species with charge transfer excited state character, and we explore the chemical space in a property-biased manner to favor near degenerate singlet and triplet excited states.3

We used the library optimized by ACSESS in the previous step (fluorescent emitters with 25% quantum yield) and minimized the singlet and triplet excitation energy gap, requiring the energy gap (ΔES1-T1) ~0.3 eV. For the design, we also used chemical stability and synthetic feasibility filters that were used previously to generate the small molecule universe (SMU) of compounds.21 ACSESS was extended to compute the energies of the generated molecules at their minima on the lowest triplet state potential energy surface. In summary, the multi-objective design criteria were: 1) vertical singlet excitation energy in the 2.4 - 4.1 eV range (blue region); 2) oscillator strength ≥ 0.4; 3) ΔES1-T1 ≤ 0.3 eV at the triplet excited state geometry minimum.

Our benchmark calculations indicate that the singlet-triplet gap derived from semiempirical methods is offset from those found in the TDDFT analysis (see results

51

section). The optimized fluorescence properties obtained here are thus reliable only within the uncertainties associated with the excited state energy calculations. We chose a reasonably low ΔES1-T1 value of 0.3 eV and achieved this objective by discarding compounds that do not match our criteria 1-3 at each iteration. To ensure that the number of promising candidates in each iteration pool does not collapse to zero, we gradually tightened our convergence criteria. First, we constrained the singlet excited state energies of these compounds within the 2.4 - 4.1 eV range. Then, we linearly increased the oscillator strength cutoff and, finally, we linearly decreased the singlet- triplet energy gap requirement until we reached the targeted values of all three criteria.

More specifically, we used the following steps to optimize the library of candidate TADF structures:

1) Ground singlet state (S0) optimization using the DFTB Hamiltonian

2) ZINDO/S calculation of the lowest singlet excitation energy and oscillator

strength based on the S0 optimized geometry

3) Lowest triplet state (T1) geometry optimization using the DFTB Hamiltonian

ZINDO/S calculation of the lowest singlet and triplet excitation energies using the optimized T1 geometry.

52

3.3 Results and Discussions

3.3.1 Validating the Semiempirical Excitation Energies

A set of 44 published organic chromophores94,95 was used to compare computed singlet and triplet excitation energies from ZINDO/S and TDDFT methods. Figure 15 shows the comparison of the computed vertical electronic excitation energies of the lowest singlet excited state based on the ZINDO/S and TDDFT methods. The average difference between ZINDO/S and TDDFT computed vertical electronic excitation energies is ~0.7 eV (Figure 15A). The average difference between the two methods for computing the lowest triplet excited state energy is ~0.9 eV (Figure 15B). This study suggests that using semiempirical calculations to design the singlet emitters is feasible, since our criterion for blue emission ranges from 2.4 to 4.1 eV. However, for the TADF designs involving small singlet-triplet excitation energy gaps (of less than 0.3 eV), the semiempirical calculations are not sufficiently accurate and will require improvement.

We could introduce more reliable electronic structure methods within ACSESS to improve the accuracies at a higher computational cost. Alternatively, applying machine learning algorithms (neural network and support vector machine) may be useful to correct the error of semiempirical calculations.57

53

A

B

Figure 15: (A) Comparison of vertical electronic excitation energies for the lowest singlet transitions of the validation set (average error ~0.7 eV). (B) Comparison of vertical electronic excitation energies for the lowest triplet transition of the validation set (average error ~0.9 eV).

54

3.3.2 Property-biased ACSESS-based library of singlet emitters

We used property-optimizing ACSESS calculations (Figure 14) to construct a library of diverse structures with property values targeted for blue emission and substantial oscillator strengths with the chemical stability and synthetic feasibility described above68 to generate a diverse library of 195 fluorescent structures that fit the property constraints. The structures of 4 representative molecules of the generated library are shown in Figure 16. The statistics of the predicted lowest singlet excitation energies and oscillator strengths for the 195 generated compounds are summarized in

Table 3.

Figure 16: ACSESS generated potential blue-emitting organic structures. Shown are some representative structures of the 195 compound library with average singlet excitation energy and oscillator strength (3.88 eV and 0.8, respectively). The blue ovals indicated the chromophore unit of the molecules. 55

Table 1: Statistics of the generated 195-member library

Singlet Energy(eV) Oscillator Strength

max 3.99 0.99

min 3.46 0.2

Average 3.88 0.8

Table 1 shows that the predicted singlet excitation energy, on average, is 3.88 eV

(320 nm) with an average oscillator strength of 0.8. To compare our generated structures with known structures, we collected 48 published organic blue emitters.101–105 All of the property-optimizing ACSESS generated molecules are structurally unique compared to these published blue emitters. None of the molecules published or patented were found to have the same chromophores (blue ovals, Figure 16) as in our generated structures, base on a SciFinder search. Also, the known and registered molecules within SciFinder have only ~65% structural match with the molecules in our generated library. As such, the molecules in our library most likely have not been investigated before for their fluorescence properties. The lowest singlet excited states for structures containing this unique chromophore are dominated by π à π* transitions of the noted motifs. The

HOMO and LUMO for one of the representative structures are shown in Figures 17A and 17B, respectively. To visualize the difference between the ACSESS generated

56

compounds and the published structures, we projected the published blue emitters onto the self-organizing map trained on the 195 property biased ACSESS generated compounds (Figure 18B). We noticed that ACSESS sampled compounds cover chemical space broadly compared to the known compounds from Figures 18A and 18B. The known compounds are localized to small regions and occupy only a small portion of the

ACSESS sampled chemical space. The white areas in Figure 18B indicate regions where no published compounds were mapped to the SOM trained on ACSESS sampled compounds. Comparing Figures 18A and 18B, we find that the property-biased ACSESS generated compounds sample regions of chemical space not populated by published compounds.

A B

Figure 17: Figures show the HOMO and LUMO of a representative structure sampled using ACSESS. (A) Shows HOMO and (B) shows LUMO of the representative molecule.

57 A

B Number of Compounds Per Grid

Figure 18: The x- and y-axes represent position in a SOM grid. (A) The ACSESS generated blue emitting compounds are all shown in a self-organizing map (SOM) where each blue dot represents a compound. (B) Projection of known compounds in the SOM trained on ACSESS generated SOM.

58

3.3.3 Validating the ZINDO/S Excitation Energy Calculations with TDDFT for the ACSESS-designed Library

To test the validity of the computed transition energies of the designed structures, we compared the results of the ZINDO/S computations with those of TDDFT analysis for 45 maximally diverse structures (a representative subset of the pool of

ACSESS generated molecules)65 among the 195 generated structures. The structures were optimized; their ground state and excitation energies, as well as their transition dipole moment were computed using the CAM-B3LYP functional with a 6-31G* basis set.

TDDFT computed excitation energies agree with the ZINDO/S computed energies for 43 of the 45 molecules to lie in the blue emission region (2.4-4.1 eV). Table 2 shows the average differences for the lowest singlet excitation energies and oscillator strengths found with the two methods. Figure 19 compares ZINDO and TDDFT computed singlet excitation energies. Since the average difference for the computed singlet excitation energy between TDDFT and ZINDO/S is 0.13 eV, ZINDO/S reproduces the DFT excitation energies to the level of accuracy required to remain in the blue region of the spectrum. The oscillator strength computed using the semiempirical approach has a mean error of 0.44 compared to the TDDFT computed values. The diverse library has an average TDDFT computed oscillator strength of ~0.4 (Table 2).

59

Figure 19: Comparison of vertical electronic excitation energies for the lowest singlet transitions of the maximally diverse set (average error of ~0.13 eV between ZINDO/S and TDDFT). The excitation energies are in eV units.

Table 2: Average error between TDDFT and ZINDO/S computed excitation energies and oscillator strengths for the maximally diverse library of 45 compounds.

Singlet Energy (eV) Oscillator Strength

ZINDO 3.874 0.81

DFT 3.8505 0.384

Average Error 0.13 0.44

3.3.4 Property-optimized ACSESS Library of TADF Emitters

We used the extended property-biased ACSESS method to design a library of novel TADF emitters (Figure 14) where we initialized the library with property-

60

optimizing ACSESS sampled blue fluorescent emitters (Figure 16). The initial library of

195 compounds had an average ΔES1-T1 of ~1.8 eV. We ran ACSESS for 50 iterations where ~90 molecules were generated per iteration that satisfied the required ΔES1-T1 constraint. Each generated molecule was subjected to geometry optimization and singlet and triplet excitation energy calculation. We reached our goal of optimizing the average excitation energy gap of the molecules to below 0.3 eV in 35 iterations (Figure 20). At the same time, we optimized the average oscillator strength to 0.6 (Figure 21) and the vertical singlet energy in the blue region (2.4 - 4.1 eV) of the spectrum (Figure 22). The full run was completed in 5 days on a 16 core CPU. The majority of the computational time (~80%) was spent on the semiempirical electronic structure computations.

61

Figure 20: Singlet-triplet energy gap optimization derived from the property- biased ACSESS calculation. We reach the goal of shrinking ΔES1-T1 to ~0.3 eV within 35 iterations.

Figure 21: Oscillator strength optimization using the property-optimizing ACSESS method. We increased the computed oscillator strength to ~ 0.6 in the 35 iterations.

62

Figure 22: Singlet excitation energy at each iteration of the property- optimizing ACSESS calculation. While optimizing the S1-T1 energy gap and the oscillator strength, all of our compounds’ excitation energies remain in the blue region (2.9-3.8 eV is the range for the sampled molecules).

Figure 20 indicates that we can minimize the singlet-triplet energy gap iteratively to 0.2 eV on average. The advantage of designing near-degenerate singlet and triplet excited-state energies is that it favors intersystem crossing and improves the overall

OLED quantum yield. Figures 21 and 22 indicate that, while optimizing the gap, we can also discover compounds that emit in the blue region (2.9-3.8 eV) with strong oscillator strength. The optimization of the key properties involved in the TADF design may expand the diversity and number of synthetic targets. Some of the molecules with favorable computed semiempirical properties are shown in Figure 23.

63

HN O

H2N N N

H O N N N N N

H2N

N N N N N N H H NH N NH2 O N 2 O

Figure 23: Molecules generated with the property-optimizing ACSESS method with semiempirical computations of the average S1-T1 gap of ~0.2 eV, average excitation energy of ~3.3 eV, and average oscillator strength of ~0.6.

3.4 Conclusions

We used the extended property-optimizing ACSESS method to construct a library of structurally diverse organic compounds that are predicted to emit blue light

(2.4-4.1 eV). The library contains 195 molecules, most of which (~70%) are novel structures that do not appear in the literature. The average computed light emission wavelength is 484 nm. The accuracy of the predicted emission wavelength was validated using TDDFT analysis. We found that the semiempirical values are offset on the average from the TDDFT computed values of the validation set by 0.7 eV and 0.9 eV for the singlet and the triplet excitation energies, respectively. As a demonstration of our extended multi-objective optimization approach, we optimized the singlet-triplet energy gap by utilizing only semiempirical computed oscillator strengths, singlet-triplet energy

64

gaps, and maintaining the emission in the 2.4-4.1 eV range. We found ~30 structures with a computed ΔES1-T1 ~0.3 eV that are predicted to emit in the blue and to have reasonably high oscillator strengths. Previous computational designs for TADF structures relied on the D-A-D motif,29 while our designs are not restricted to a specific motif. In this semiempirical-based design, around 30 molecules are predicted to have favorable TADF properties in the blue. Further improvements in the reliability of this approach are expected to be possible using higher-level electronic structure methods that could be implemented on more advanced computer hardware.

With the success of predicting the blue emitters by searching larger chemical space, we have also tested PO-ACSESS approach to search for lead-like libraries targeting various proteins. A test study of lead-like libraries design will be presented in the following chapters.

65

Chapter 4. Efficiently charting astronomically large chemical spaces to search drug leads

4.1 Introduction

The rate of discovery of new molecular entities that address urgent biomedical needs has not kept pace with demands.5 Experimental and computational efforts of many kinds are in progress that may help to accelerate drug discovery.19,21,106,107

Diversity-oriented and natural product inspired synthesis are being used in an attempt to address previously “undruggable” targets.108,109 Computational efforts associated with library design in support of drug discovery thus far have involved enumeration approach (building libraries up to 17 heavy atoms),19 and small molecules universe

(SMU) exploration using the ACSESS method. Recent ACSESS studies have introduced physical property biasing as the library is constructed.110 These innovative approaches have helped us identify new molecular opportunities, but they produce lead-like molecules without biasing towards enhancing binding affinity for a receptor target.

Such, unbiased libraries might prove to be successful for a specific target but the general utility of a library based purely on diversity is not guaranteed to be productive for all targets. Moreover, the enumeration and evaluation of small molecules containing only

17 heavy atoms is daunting. In order to mine the larger molecular spaces for pharmaceutical applications without becoming overwhelmed by the vast number of possibilities, we extended the ACSESS method to include property-optimizing (PO-

ACSESS) to direct the molecular search towards structurally-diverse molecules with a 66

desired physical property. Prior studies have included the ground state molecular dipole moment and organic light emitting diode properties. Here, we extend the PO-

ACSESS framework to develop drug-like leads for a specific protein target and demonstrate that we can efficiently generate a more structurally diverse collection of high scoring lead compounds compared to the enumeration19 or the unbiased ACSESS framework110 for experimental testing. For this study, we choose the well-characterized

COX-2 protein as a drug design target. We evaluate AutoDock Vina as a tool to screen and rank ligands for binding to this target. First, we show that Vina computed protein- ligand interaction scores may be used to enrich the known binders of COX-2 more accurately than similar open-source docking software. Second, we determine the performance of PO-ACSESS analysis that optimizes Vina scores of diverse molecules in

GDB9 and compare it with competing stochastic search and optimization approaches.

Finally, we show that we can prioritize superior scoring molecules for experimental testing much more efficiently using PO-ACSESS than enumeration within GDB17 or simple use of ACSESS in the small molecule universe (SMU)21 containing up to 21 heavy atoms (SMU-21). We show that PO-ACSESS not only prioritizes the screening collection within the known GDB database but can also predict high scoring drug-like leads for a given chemical space much larger in scope than the GDB databases.

67

4.2 Methods

The definition of chemical space, chemical space distance, and diversity were presented in chapter one. In our earlier PO-ACSESS study, we optimized the ensemble averaged ground state molecular dipole moment. A second study has focused on optimizing the properties of organic light emitting diodes.111 Here, we present the extension of PO-ACSESS to generate drug-like leads using chemical synthesis accessibiliy and stability filters110 while maximizing the Vina docking score and library diversity. The docking scores found are comparable to the known binder naproxen

(C14H14O3) binding to COX2 (PDB ID: 3NT1, protein-ligand complex). As with earlier

PO-ACSESS applications,85 lead generation method has four main steps: 1) initializing compounds with some known molecule (e.g., benzene), 2) breeding new molecules using mutations and crossovers to change atom and bond types, 3) screening the molecules from the previous step, and 4) selecting a diverse set of property-qualified structures. We iterate these four steps until we reach convergence in our objective function or we exhaust the allowed number of ACSESS cycles. In each iteration, we grow the protein-ligand binding score linearly, and we discard molecules with scores below a cutoff. A flowchart of PO-ACSESS with docking score based selection is shown in Figure 24.

68

Initialize Library with a Known Molecule

Modify Molecular Structures

Filter based on Chemical Stability and Protein-Ligand Binding Score

Select a Maximally Diverse Set

No Final Iteration?

Yes

Done

Figure 24: Property-Optimizing ACSESS framework for lead design. Molecules are constructed using mutations and crossovers,110 chemical stability and synthesizability constraints will cause modification or discarding of structures that are unsuitable.110 Suitable molecules are scored for docking to the target, and a maximally diverse set of docking score qualified structures is selected.110

4.2.1 Benchmarking Docking Methods for the COX-2 Receptor

Before applying standard docking software, we validated the methods by examining a set of known binders and decoys obtained from the DUD database.112 The

COX-2 protein target crystal structure was obtained from the protein databank (PDB ID

3NT1).113 The protein was set up docking using Autodock tools.114 The grid box for

69

docking calculation contained the active site and was centered on the crystallized ligand.

The smiles string for each compound was converted to a 3D structure using the OMEGA tool.69 Hydrogen atoms were added to the molecules in a neutral pH environment using

QuacPac tool115–117 and their Gasteiger partial charges were computed. The molecules were then docked using 4 different methods: 1) AutoDock, with flexible protein residues and ligand,118 2) Rigid AutoDock, with fixed protein and flexible ligands,,119 3) AutoDock

Vina, with flexible protein residues and ligands,120 4) AutoDock Vina, with fixed protein and flexible ligands.120 For the flexible docking where both protein and ligand are flexible, only two residues (Arg120 and Tyr355) in the binding pocket were made flexible. The highest scoring (representing highest affinity) docking pose for each ligand was tabulated. To understand how the docking methods performed, we plotted the ROC curve (Figure 25) and studied the enrichment of known binders over decoys for each method.

4.2.2 Comparing Property-Optimized ACSESS with Simple Genetic Algorithms

We studied the relative merits of the PO-ACSESS method by comparing with simple genetic algorithms within the enumerated GDB9 chemical universe. GDB9 contains the enumerated set of all compounds with as many as 9 heavy atoms selected from C, N, O, Cl, and S (~300,000 molecules).68 We screened this chemical space using the AutoDock Vina method. To avoid recalculations of docking scores, the docking scores associated with each molecule were recorded in a database and were looked up 70

during property optimization runs. The preparation for protein, ligands, and grid generation steps were exactly same to the benchmark calculations. As in an earlier study,85 we compared PO-ACSESS analysis with simple genetic algorithms employing roulette-wheel, tournament, and elitist selections.74,75 We insured fair comparisons by running the methods for the same number of cycles and constraining each method to undergo the same number of fitness evaluations. Each method was initialized to optimize the docking scores of the same initial set of diverse molecules with in GDB9 generated using ACSESS.110 Each method was evaluated based on the optimized docking scores of the generated library as well as the average nearest-neighbor distance

(diversity)85 of the library.

4.2.2 Optimization of docking scores within GDB17

To explore if the performance of the property-optimized method is dependent on the size of the search space, we used PO-ACSESS with the docking extension to search for molecules with docking scores greater than or equal to that of naproxen (≤-8.2 vina score) within GDB17. We present a comparison of the PO-ACSESS method with random sampling of known drug-like leads in GDB17. We randomly sampled 6,300 molecules from the lead-like GDB17, and subjected them to protein docking using Vina, and computed the average docking score. We also ran PO-ACSESS to search, within GDB17, a diverse set of molecules with a docking score comparable to naproxen (≤-8.2 vina score) and generated a library of 6,300 high docking scored molecules. Finally, we

71

compared the average docking scores of molecules generated using PO-ACSESS and random sampling within lead-like GDB17.

4.2.3 Optimization of docking scores within SMU-21

We wanted to explore if PO-ACSESS can explore chemical spaces beyond the scope of the existing GDB databases (>GDB17). We set the goal of exploring SMU-21

(small molecule universe containing molecules with ≤21 heavy atoms including C, N, O, and S) to design a library of diverse molecules with docking score comparable to naproxen (≤-8.2 vina score). Within SMU-21, we compared the library of compounds generated using ACSESS110 and PO-ACSESS.85 ACSESS was used to create a maximally diverse subset of SMU-21 containing the same number of molecules as PO-ACSESS generated. We compared the average Vina docking scores of the ACSESS and PB-

ACSESS libraries.

4.3 Results

4.3.1 Comparing enrichment of binders for different docking methods

We used four different docking methods to screen known decoys and binders of the

COX-2 receptor. The binders and decoys were obtained from the Directory of Useful Decoys

(DUD).112 Flexible docking methods permit a few protein side chains and the ligand to be flexible, while “rigid docking” methods either permit flexibility of only the ligand or treat both the protein and ligand as rigid objects. The task faced by the docking algorithms is to search for ligand conformations (and protein side chain conformations in flexible docking) that minimize the conformational energy computed using an empirical or semiempirical energy function. We

72

selected all of the 300 known binders and randomly selected 300 known decoys of COX-2 from the DUD. Each compound was docked and the score of the strongest binding pose was recorded.

The enrichment curve for the strongest binding mode is shown in Figure 25.

1 Flexible AutoDock (AUC=0.61) 0.9 Rigid AutoDock (AUC=0.43) Flexible Vina (AUC=0.48) 0.8 Rigid Vina (AUC = 0.62)

0.7

0.6

0.5

0.4 True Positive Rate 0.3

0.2

0.1

0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 False Positive Rate

Figure 25: Enrichment curves compare how well known binders are ranked above known decoys for the COX-2 system for different docking methods. The area under the curve (AUC) provides a metric for the performance of each docking method, indicating how well a docking method ranks a randomly chosen binder above a randomly chosen decoy. The x-axis shows the true positive rate and the y-axis shows the false positive rate for each docking score threshold.

Figure 25 indicates that the rigid Vina method (flexible ligand / rigid protein) has a performance comparable to the flexible AutoDock method (treats a few side chains of the protein and the ligand as flexible). This comparative study finds that Vina performs marginally better than flexible AutoDock for a lower computational cost. The cost increases with the flexibility of the ligand and side chains of the protein. From this 73

study, we found that the rigid Vina method (Fig. 25, AUC= 0.62) could be applied to search for potential leads by screening larger molecular databases.

4.3.2 Docking score distribution of GDB9 compounds

We screened each compound in GDB9 using Vina and computed the binding score of their strongest binding ranked pose. The distribution of the Vina scores for ~300,000 compounds of GDB9 is shown in Figure 26.

Figure 26: Vina docking score distribution of the strongest binding pose for all compounds of GDB9 screened for COX-2 binding.

Figure 26 shows the Vina score distribution for all compounds in the GDB9 database. Fewer than 2% of the GDB9 compounds had Vina scores below -6.0. Figure 26 indicates the fraction of GDB9 compounds that is plausibly lead-like for COX2 binding.

In addition, we also wanted to understand the clustering of molecules within GDB9 and know the parts of GDB9 that have strongest binders (activity regions) via visualizing these compounds within the framework of a self-organizing map (SOM).76 For PO-

74

ACSESS runs, we have used a 40-dimensional autocorrelation function based description of the molecules.21,85 We cannot visualize the clustering patterns of the molecules in this high dimension and need a lower dimensional representation. For the purpose, we trained a 100×100 torroidal grid SOM76,77 using the “lead-like” GDB9 (~3000 molecules). During the training, at first, each grid point is randomly assigned a chemical space coordinate, and later the grid points learn their actual coordinates by finding a descriptor vector with in the training set that is closest to it.76,77 The two-dimensional map that results contains structurally similar molecules clumped together in or near a grid point, while the dissimilar molecule are assigned to distant grid points.76,77 Figure 27 shows the “lead-like” GDB9 (~3000 compounds with vina score ≤ -6.0) on the trained

SOM. Since each grid point could contain more than one molecule, the color bar indicates the average Vina docking score of molecules per grid point. This map of lead- like compounds drawn from GDB9 shows the region (molecules with Vina score ≤ -6.0) that we wish to sample from and explore using PO-ACSESS method. We compared the fitness (average docking score) and average nearest-neighbor distance (diversity) of libraries generated using our approach compared to simple genetic algorithms.

75

Figure 27: SOM constructed from “lead-like” members of GDB9 (Vina score ≤ - 6.0). The color bar represents the average Vina score for each grid point. The x and y axes represent positions in the two dimensional grid. The white regions indicate areas where no compounds were mapped. All non-white regions correspond to region of the GDB9 space that contained lead-like molecules for COX2 binding.

4.3.3 Performance of various library design strategies within GDB9

We compared the fitness and diversity of PO-ACSESS libraries aimed at finding lead-like molecules with popular genetic algorithms. We insured that each method experiences the same number of fitness evaluations and is initialized with the same starting library (a maximally diverse subset of GDB9). We allowed each algorithm to run for 100 generations. Figure 28 (a) shows the fitness comparison and Figure 28 (b) shows the nearest-neighbor diversity comparison for PO-ACSESS and the simple genetic algorithms.

76 A

7

6.5

6

GA-Elitist 5.5 ACSESS GA-Roulette GA-Tourne -1*AutoDock Vina Score 5

0 20 40 60 80 100 Generations

B

25 GA-Roulee 20 GA-Tournament Nearest 15 Neighbor GA-Elitist Diversity 10 ACSESS 5 0 Enumeration

Figure 28: (a): Fitness comparison of PO-ACSESS and SGAs. The x-axis shows the number of generations run, and the y-axis shows the negative of the Vina score (here, higher scores indicates more promising leads). (b) Diversity comparison of PO- ACSESS vs. SGAs. The y-axis represents nearest neighbor diversity. We can notice that the diversity of lead-like molecules sampled using property-biased ACSESS is by far higher compared to other SGAs.

Comparing Figures 28(a) and 28(b) indicates that PO-ACSESS runs generate molecules with comparable fitness to those found with SGAs. The SGA with elitist

77

selection criteria outperforms other SGA approaches and performs marginally better than PO-ACSESS. But PO-ACSESS generated libraries of lead-like molecules having superior diversity compared to those found with competing approaches. In fact, the diversity of the PO-ACSESS library is comparable to the diversity of the enumerated lead-like GDB9. The diversity maintaining characteristic of our approach would appear to be a great asset for drug design and discovery, as a diversity of structural choices would increase the chances to discovering successful leads in the face of unfavorable pharmacodynamics or pharmacokinetics for members of the set.

We can visualize the high diversity of our collection compared to SGAs by projecting the generated libraries onto the trained SOM (Figure 4). Figure 29(a) shows the projection of the library generated using elitist SGA Figure, and 29(b) shows the projection of property-biased ACSESS generated library on the trained SOM.

78

A

B

Figure 29: (a) Projection of SGA-elitist generated library on the trained SOM. The color bar represents the number of lead-like molecules in GDB9 per grid point. (b) Projection of property-biased ACSESS on the trained SOM. The color bar represents the number of lead-like molecules of GDB9 per grid point. 79

Comparing Figures 29(a) and 29(b) indicates that the regions of lead-like GDB9 molecular space sampled using PO-ACSESS analysis is not sampled using GA-Elitist sampling. The larger diversity found in PO-ACSESS indicates that we have structurally diverse choices of lead-like molecules compared to those found in other SGA-based design approaches.

4.3.4 Performance PO-ACSESS within GDB17

GDB17 contains known binders of COX2. We chose naproxen (with 17 heavy atoms) as an example of the kind of molecule that we may hope to discover in our library creation. We re- docked naproxen in the active site of COX2 and computed its minimum-energy docking conformations (strongest binding pose). The computed strongest binding pose of naproxen had a

Vina score of -8.2. Next, we applied PO-ACSESS searching within the GDB17 to collect a diverse set of molecules with docking scores comparable to that found for naproxen. We used similar filters that were used to establish a drug-like subset of GDB17 (drug-like subset of

GDB17 contains ~106 molecules).19 We initialized our run with a benzene molecule and allowed

PO-ACSESS to generate a diverse library of molecules with docking scores found for naproxen or more favorable. The docking score optimization for diverse molecules within GDB17 with high docking scores is shown in Figure 30 (data averaged over from 5 different PO-ACSESS runs).

80

10

9

8

7 Mean Vina Score 6 Minimum Vina Score Maximum Vina Score 5 -1*Vina Score 4

3

2

1 0 10 20 30 40 50 60 Iteration

Figure 30: Optimization of diverse molecules with naproxen-like docking scores from GDB17. The plot shows average, minimum, and maximum scores in each iteration of PO-ACSESS analysis (from 5 runs). The error bars show the standard error for the statistics (mean, min, and max). The x-axis shows the iteration number for the algorithm and the y-axis indicates the negative Vina Score.

Figure 30 indicates that we can generate diverse naproxen like binding scores for molecules in GDB17 within 35 iterations of PO-ACSESS. For each property-biased

ACSESS run, we performed 6300 fitness evaluations to generate a library of 260 molecules. This number of fitness function evaluations is a tiny fraction compared to the size of GDB17 (166 billion molecules). Interestingly, we also screened a drug-like subset of GDB17 (containing ~1 million molecules) against the COX2 receptor by randomly selecting 6300 molecules and computed their Vina score for COX2 binding (Figure 31).

We found that using random subsets of drug-like GDB17 contain molecules with average Vina score (~ -8.1 Vina score) less than the average Vina score of the library 81

designed using PO-ACSESS (~ -8.6 Vina score). Interestingly, on average, the random subsets also had Vina docking scores poorer than that of the known binder naproxen. Of course, one could screen all lead-like molecules of GDB17 to find molecules with naproxen-like scores, but this endeavor would be very costly (~1 million compounds to screen) compared to applying the PO-ACSESS approach that screening only 6300 molecules to find candidates with naproxen-like scores.

-7.4 -7.6 -7.8 Random Selection -8 Vina Score -8.2 Property-Biased -8.4 ACSESS -8.6 -8.8

Figure 31: Comparison of sampling based on PO-ACSESS analysis for identifying lead-like molecules in GDB17 and randomly sampling of a known drug- like subset of GDB17 (consisting of ~1 million molecules). Both methods were used to screen 6300 molecules to find those with the 260 most favorable Vina scores.

4.3.5 Performance of PO-ACSESS within the SMU-21 containing up to 21 heavy atoms

The capacity of exhaustive enumeration approaches is limited to GDB17 (166 x

109 molecules), which contains only a fraction of a known drug like chemical space (~1060 possible molecules). We explored the ability of PO-ACSESS analysis to go beyond the scope of GDB17 to generate a diverse lead-like subset with docking scores comparable to or more favorable than the known drug naproxen. While enumerated databases larger

82

than GDB17 do not exist, we can still apply the ACSESS approach110 to generate a maximally diverse library of drug-like leads from the entire small molecular universe.

We are interested in comparing the ACSESS implementation to generate a lead-like library compared to the lead-like library generated using PO-ACSESS library. This comparison also helps us to understand the advantage of biasing based on a combination of fitness value and molecular diversity (PO-ACSESS) over biasing based on only molecular diversity (ACSESS). Here, PO-ACSESS was allowed to undergo 6300 fitness evaluations to search diverse molecules with naproxen-like scores from SMU-21.

As well, ACSESS was used to create a maximally diverse subset containing 6300 molecules from the SMU-21. To evaluate the fitness of the library generated from these two methods, we compared the fitness (average Vina docking score) based on the top

250 molecules (the library size generated from the PO-ACSESS approach, on average).

The fitness optimization within SMU-21 using PO-ACSESS is shown in Figure 32 (a) and the diversity optimization using ACSESS is shown in Figure 32(b).

83

A

10

9

8

7 Mean Vina Score 6 Minimum Vina Score Maximum Vina Score

-1*Vina Score 5

4

3

2 0 10 20 30 40 50 60 Iteration

B

7

6

5

4

3 Nearest Neighbor Diversity 2

1

0 0 100 200 300 400 500 600 700 800 900 1000 Iteration Figure 32: (a) Optimization of naproxen-like diverse molecules within SMU- 21. The plot shows average, minimum, and maximum score in each iteration of property-biased ACSESS (from 3 different runs). The error bar shows the standard error for each statistics (mean, min, and max). The x-axis shows the iteration of algorithm and y-axis shows the -1 x Vina Score. (b) Shown optimization of nearest neighbor diversity85 of lead-like compounds using original ACSESS110 within SMU- 21. The x-axis shows the iteration of original ACSESS and the y-axis shows the nearest neighbor diversity (data averaged over 5 different runs).

84

We ran the original ACSESS algorithm for 1000 iterations to generate a maximally-diverse subset of 6300 molecules (Figure 32b), and used the diverse set to screen against COX2. A library containing the strongest computed binding 250 molecules were chosen (from the screened 6300 molecules) to compare with a library generated using the PO-ACSESS method. We find that the average docking score for the

PO-ACSESS library is more favorable than the docking score of naproxen (Figure 32A).

In fact, the worst scoring compound generated by the PO-ACSESS approach has a Vina score of -8.2, which is comparable to naproxen (Figure 32A). Figure 33 compares the

ACSESS generated maximally diverse library to the PO-ACSESS library.

0 Property-Biased -2 ACSESS: Diverse Optimized Set -4 Vina Score -6 ACSESS-Maximally Diverse Set -8

-10

Figure 33: Comparison of the average docking scores of maximally diverse set generated using ACSESS with that of PO-ACSESS method. Simply considering diversity without fitness bias leads to poor lead-like compounds.

Figure 32a shows that the PO-ACSESS can be applied to design lead-like molecular libraries for chemical spaces beyond the scope of GDB databases. Figure 33 shows the mean docking score of a maximally diverse subset of SMU-21 (~ -6.6 Vina score), and the mean docking score of a diverse lead-like molecules generated using PO-

85

ACSESS (~ -8.6 Vina score). Figure 33 indicates that simply considering diversity without fitness biasing produces libraries with far unfavorable average property scores.

From the previous comparisons within GDB9 (Figs 28 and 29), we also found that simply considering fitness optimization without maintaining molecular diversity also leads to poor diversity of leads like compounds or get optimization methods stuck in lower scoring leads like molecules. The various test studies here, indicate that a proper balance of fitness and diversity during theoretical chemical space exploration is necessary to prioritize favorable leads-like molecules for experimental testing.

4.4 Conclusions

The vast amount of theoretically possible organic lead-like small molecules

(estimated to be 1060 molecules) makes the design of lead-like libraries for a particular protein target very challenging. We have shown that the PO-ACSESS can efficiently generate libraries of lead-like compounds in a small number of optimization cycles. We compared the performance of a property-biased approach to searching for lead-like molecules based on competing design approaches in astronomically large chemical spaces, including SMU-21. Our test cases indicate that the diversity- and property-biased multi-objective optimization scheme employed in the PO-ACSESS scheme enables the navigation of diverse region of vast chemical spaces to generate diverse lead-like molecule with favorable assessment scores in a very effective manner. Naturally, the utility of constructed libraries will depend on the accuracy of the property estimator.

86

While the Vina metric used here performs reasonably well for the COX2 inhibitor systems (Figure 25), it would be desirable to use more accurate methods and PO-

ACSESS approaches could in fact be coupled to more reliable binding free energies.

In the next chapter, we implement a better scoring approach compared to

AutoDock Vina and apply PO-ACSESS approach to design library of experimentally testable leads targeting CARM1 protein, which is an important biological target implicated in disease states such as cancer.

87

Chapter 5. Theory assisted explorations of the small molecule universe to discover new inhibitors of coactivator-associated arginine methyltransferase1 (CARM1).

5.1 Introduction

A case study of property-optimizing ACSESS (PO-ACSESS) presented in the preceding chapter targeting COX2 indicated that we could potentially target other crystallized protein structures to design libraries of experimentally testable lead like compounds. With this motivation, we have applied PO-ACSESS towards targeting

CARM1 protein. The details of the study will be described in this chapter.

CARM1 is one of the members of protein arginine N-methyltransferase (PRMT) that catalyzes the transfer of methyl groups from its cofactor S-adenosylmethionine

(SAM) to the side chain of specific arginine residues of the substrate proteins.121 The methylation of arginine is a common post-translational modification and also a well known epigenetic modification that can modulate the interactions between substrates of

PRMTs and their interaction partners. CARM1 interacts with different classes of substrate. For instance, it interacts with RNA binding and RNA splicing proteins to influence gene regulation that leads to cell growth and differentiation.122–124 CARM1 also interacts with chromatin remodeling protein such as histone H3 protein. It is known that nuclear hormone receptor complex or transcription factors such as p53, β-catenin or NF-

κB recruit CARM1 to methylate specific arginine residues in the N-terminal tails of

88

histone H3 protein.122,124,125 The direct consequence of these methylation events is the enhancement of gene transcription. However, the deregulation in the interaction between CARM1 and histone H3 protein can lead to disease states such as prostrate/breast and colorectal cancer.125 Since it is known that the cofactor SAM is required for the methyl transferase function of CARM1, one of the therapeutic strategies to regulate CARM1 and substrate interactions can be to use small organic molecules that mimic cofactor and competes with the cofactor binding site as shown in the Figure 34.

R3 X R1 X

R2

Cofactor/ Mimics Binding Site

Figure 34: The cofactor or cofactor mimics binding site of the CARM1 protein (PDB ID 2Y1W). A scaffold for a potential CARM1 inhibitor binding to the cofactor- binding pocket is shown. The R1, R2, and R3 indicate the positions in the scaffold where functional groups can be added to design potential inhibitors.

The scaffold shown in Figure 34 has 3 functional group attachment positions

(design sites). The scaffold contains only 9 heavy atoms, and assuming that we want to design a lead-like inhibitor library targeting CARM1 (with molecular weight ~500

89

Daltons), there is still an astronomically large chemical space to be searched for designing libraries of inhibitors. Since we require searching over a large chemical space to predict diverse functioning inhibitors, we can apply property-optimizing ACSESS method85,111 for these designs. In the method sections, I will present the extension of property-optimizing ACSESS for CARM1 inhibitor designs, the design goals, and the designed library.

5.2 Method

The scaffold shown in Figure 5.1 was provided by the Wang lab at Duke

University. The goal of ACSESS design was to decorate the R1,R2, and R3 positions in the scaffold with functional groups by navigating through the chemical space of ~1060 possible molecules. The extended version of the property-optimizing ACSESS is shown in the flowchart in Figure 35.

90

1) Docking to generate poses Virtually Screening 2) Rescoring each Molecules pose using MM- GBSA 3) Selecting the best pose scored using MM- GBSA !

Figure 35: The extended property-optimizing ACSESS framework for CARM1 inhibitors design. Compared the COX2 test study, this protocol uses multiple docking poses and rescores each docking pose using MMGBSA method to compute protein- ligand interaction scores.

The goal of ACSESS was to design library of inhibitors with comparable or better

MMGBSA126–128 score compared to the known and crystallized ligands. To set a target score for ACSESS, MMGBSA binding score of SAH was computed. The SAH structure was docked to the CARM1 binding site and each docking pose bound to the CARM1 was minimized in an implicit solvent. The minimized complex was then used to compute the MMGBSA score. SAH had MMGBSA score ~-70kcal/mol. Now, before applying docking based posed generation and MMGBSA based rescoring of each docking pose within ACSESS, benchmarking of MMGBSA against docking scoring function was done using data of known actives and inactives of CARM1 (data provided from the Wang lab at Duke University).

91

5.2.1 Docking Based Scoring

As discussed in chapter 4, docking software approximates the protein-ligand interaction score using a combination molecular mechanics terms (bond, angle, torsion) and non-covalent terms (electrostatics and van der Waals interactions).120 Some docking scoring functions also have a solvation term to account for desolvation of the protein, ligand, and the protein-ligand complex. The docking tool used in this work was

AutoDock Vina since it is one of the few freely available open source docking programs and is known for its efficiency in screening large libraries.120 Both protein and ligand were parameterized before docking calculation. The steps involve computing the partial charges of both ligand and protein, and assigning other forcefield parameters to the atoms. AutoDock Tools is an auxiliary program of AutoDock software that can be used for parameterization. For the benchmark study, each ligand (tested actives and inactives) were docked to CARM1 cofactor binding site.

5.2.2 MMGBSA Based Rescoring

MMGBSA method combines molecular mechanics terms to compute the bonded and non-bonded interactions between the protein and its ligand. 126–128 It computes the electrostatic contribution to the solvation energy via generalized Born (GB)129 model and non-polar contribution is computed using surface area approximation. The binding energy of the ligand to a protein is expressed as follows.126

92

ΔGbind = Gcomplex − (Grec +Glig ) (1)

ΔGbind = ΔH −TΔS ≈ ΔEMM + ΔGsolvation −TΔS (2)

ΔEMM = ΔEinternal + ΔEelectrostatics + ΔEvdw (3)

ΔG = ΔG + ΔG (4) solvation GB SurfaceArea

Here, ΔGbind refers to the binding free, Gcomplex, Grec, and Glig refer to free energy of protein-ligand complex, free energy of an unbound receptor, and free energy of an unbound ligand. Eq (1) can be re-written in terms of enthalpic (ΔH) and entropic (ΔS) contributions to the binding free energy as shown in eq (2). The enthalpic contribution to binding can be re-written in the molecular mechanics terms and free energy of solvation.

The molecular mechanics terms describes the energy contribution from bond, angle, torsional terms (ΔEinternal) and the non-covalent interactions terms (ΔEelectrostatics and ΔEvdw).

The polar contribution to the free energy of solvation can be approximated using generalized Born model (ΔGGB) and the non-polar contribution can be described using the surface area term (ΔGSurfaceArea). The entropic contribution to the binding free energy

(ΔS) can be computed using quasi-harmonic analysis or normal mode analysis.

For the MMGBSA calculation, ideally, the average of the energy terms is obtained from snapshots generated from molecular dynamics (MD) trajectory. However, such calculation for large sized libraries (such as those generated within ACSESS iterations) can be cost prohibitive. An alternative to running MD, is to generate an

93

ensemble of bound ligand conformations from docking approaches, minimize each pose by allowing the protein to relax along with the ligand, and rescore the relaxed conformations using MMGBSA model.126 This later approach was used for designing library in property-optimizing ACSESS. The MMGBSA is the fitness function within

ACSESS calculation. ACSESS iteratively minimizes the MMGBSA computed energies of diverse ligands to designs favorable binders of CARM1.

5.2.3 Validating MMGBSA Using Theory and Experiment

The sampling of bound protein-ligand conformations form simple docking and

MMGBSA rescoring represent a few local minima in the energy landscape and may not be representative of the average protein-ligand binding free energy. To test how closely

MMGBSA approximated the binding free energy, we ran ~10 ns of MD of several tested protein-ligand complexes. We also applied MD based sampling to the ligands designed from property-optimizing ACSESS. During the ligand parameterization step of MD, the partial charges of the ligands were obtained using RESP procedure.117,130 During the

RESP procedure, each ligand was geometry optimized in their ground state using

TDDFT method that used 6-31G* basis set using Gaussian09 package.70 The partial charge for the ligands was obtained by applying a fitting procedure (RESP) implemented in antechamber131 that best reproduces the electrostatic potential obtained from the TDDFT calculation. As a preparation step for MD, the protein-ligand complexes were solvated in a 4 × 4 × 4 Å3 water box. The complexes were energy

94

minimized for 500 cycles and then 10ns of MD was run to obtain the ensemble of protein-ligand complexes under NTP conditions. The ensemble of each protein-ligand complex was used to compute the averaged MMGBSA score from MMPBSA.py script provided in Amber14.

Some of the ligands, from the first round of PO-ACSESS design that used only docking as a scoring function, designed using property-optimizing ACSESS were also synthesized and currently undergoing bio-chemical assays. The second round design that includes the library of ligands predicted by using property-optimizing ACSESS

(MMGBSA used as scoring function) will be synthesized in near future and will be followed by biochemical assay.

5.3 Results and Discussion

5.3.1 Benchmarking Docking and MMGBSA

Before applying MMGBSA for the property-biased ACSESS calculations, a enrichment comparison of docking software (autodock vina) with the prescribed

MMGBSA based rescoring method was done using the 8 tested actives and 44 tested inactives of CARM1 (data provided by Wang lab). During the docking, the ligand was modeled as flexible but the protein was held rigid. The best binding mode (the docked conformation that is minimum in binding energy) was chosen as representative of the binding free energy. A plot showing the docking score distribution for the tested dataset

95

is shown in Figure 36 and a plot showing the MMGBSA score distribution of the same tested dataset is shown in Figure 37.

15 14 13 ACTIVES 12 11 INACTIVES 10 9 8 7 6 5 4 3 2 1 0 Docking Score −1 −2 −3 −4 −5 −6 −7 −8 −9 −10 0 10 20 30 40 50

Figure 36: The AutoDock vina scores (y-axis) for the tested actives (blue dots) and inactives (red dots) ligands of CARM1.

96

−20

−30

−40

−50

−60

GBSA Score −70

Actives −80 Inactives

−90

−100 0 5 10 15 20 25 30 35 40 45 50

Figure 37: The MMGBSA scores (y-axis) for the tested actives (blue dots) and inactives(red dots) ligands of CARM1.

Comparing Figures 36 and 37, we can see that MMGBSA based rescoring differentiates the actives better compared to AutoDocking Vina based scoring.

AutoDock Vina seems to assign similar scores for both actives and inactives (higher false positive rates) where as MMGBSA clearly differentiates 4 out of 9 known actives from the tested inactives. This benchmark study revealed that we could use MMGBSA based rescoring over docking for improving the accuracy of predicting protein-ligand binding free energy (ΔGbind). However, further refinement of ΔGbind prediction may be possible.

One of the potential directions can be to consider an empirical term for entropy that would penalize highly flexible ligand. One could also potentially use multiple protein conformations (obtained from MD simulations and described in Chapter 6) and average the MMGBSA scores from across the multiple receptor conformations. A similar

97

benchmark study as presented in this section can reveal the merits and demerits of the proposed approaches.

5.3.2 Libraries Design Using Property-Optimizing ACSESS (Docking Scoring Function)

We applied property-optimizing ACSESS (PO-ACSESS) to design a library of potential inhibitors targeting CARM1. As a first step we simply used docking as a scoring function to generate the libraries. The goal of PO-ACSESS in this step was to generate libraries of compounds comparable in docking score to SAH (-9.5 kcal/mol), a crystallized and known cofactor mimic of CARM1. The design was based on an imidazole scaffold provided by the Wang group (Figure 34). PO-ACSESS came up with the guesses of functional groups that could go in the R1, R2, and R3 positions such that the computational scores of the compounds were comparable to SAH. Figure 38 shows the progression of PO-ACSESS to optimize the docking score.

98

12 11 10 9 8 7 Average Best 6 Worse 5

− 1*AutoDock Score 4 3 2 1

0 0 10 20 30 40 50 Iterations

Figure 38: The docking scores optimization by PO-ACSESS. The x-axis indicates the iteration number of PO-ACSESS and y-axis indicates the negative of the AutoDock score (now, more positive the better). The black, blue, and red lines indicate the worst scoring compound, the average score of compounds in the library, and the best scored compound in the library respectively in each iteration.

Figure 38 shows that within 30 iterations of PO-ACSESS (by about ~4000 fitness function evaluations), we have generated a library of hundreds of testable compounds showing computational binding activity against CARM1. This feature demonstrates the efficiency of PO-ACSESS to search large chemical spaces and generate experimentally testable molecular libraries. One of the representative compounds showing hydrogen- bonding interaction with CARM1 is shown in Figure 39. With further improvement of the scoring function, this approach is likely to generate better compounds that can have therapeutic value.

99

Figure 39: The hydrogen-bonding network of PO-ACSESS designed compound (shown in red and labeled as compound 97 ) with CARM1. The compounds interacts with similar residues that SAH interacts with (Val 108, Glu 80, Gln 58) as well as shows interaction with residues not accessible to SAH (Ser 137 and Arg 6).

5.3.3 Library Design Using Improved Scoring Function

Figures 36 and 37 show that the MMGBSA based scoring function actually provided better enrichment of active compounds compared to the AutoDock Vina based scoring function. Also, an initial experimental testing of a set of 4 docking screened compounds showed that docking generates significant false positive results. After realizing the inaccuracy of docking and the need for improvement, the MMGBSA based rescoring of several docking poses was implemented within PO-ACSESS framework and applied to generate the library of inhibitors targeting CARM1. Figure 40 shows the progression of PO-ACSESS to optimize the MMGBSA score.

100

100

90

80

70

60 Average Best 50 Worst 40 − 1*MM GBSA Score

30

20

10 0 5 10 15 20 25 30 35 40 45 50 Iterations

Figure 40: The MMGBSA scores optimization by PO-ACSESS. The x-axis indicates the iteration number of PO-ACSESS and the y-axis indicates the negative of the MMGBSA score (now, more positive the better). The black, blue, and red lines indicate the worst scoring compound, the average score of compounds in the library, and the best scored compound in the library respectively in each iteration.

The libraries of compounds generated using MMGBSA as scoring function were expectedly more diverse compared to the docking based compounds. Since the protein is allowed to relax via energy minimization, the binding pocket can accommodate larger ligands. These compounds give the opportunity to think about novel synthesis steps that could be utilized towards similar designs. Some these compounds will be synthesized and tested for their activity in near future.

5.3.4 Preliminary Results on MD Based Validation of Designs

MD based sampling was done for a few CARM1-ligand complexes. The protein- ligand complexes were simulated for ~10 ns. Quasi harmonic analysis132,133 was

101

performed for approximating the entropic contribution to the binding (eq 2). We compared binding free energy (ΔGbinding) from the minimization based MMGBSA (as applied in PO-ACSESS) and the MD based GBSA. We found some of the PO-ACSESS predicted ligands (Figure 41) were indeed binding favorably to CARM1 and had a binding score comparable to that of SAH. We compared contributions of each type of non-covalent interactions between PO-ACSESS designed compounds (referred as

Compound 0 and Compound1) and SAH. We found compound 1 have higher computed binding free energy compared to compound 0. Compound 1 differs in structure compared to compound 0 in three atom positions. The result of this structural difference was most noticeable in the electrostatic interactions (table 3). We found that compared to the compound 0 (com0), compound 1 (com1) had ~20 kcal/mol favorable electrostatic interaction (2nd column in table 3). To further analyze any structural hints towards favorable electrostatic interaction, we compared a hydrogen-bonding (h-bond) network each compound makes with the CARM1 protein. We found a larger number h-bonds between compound 1 and the protein compared to compound 0. Also, as shown in table

4, we found that across the 10ns trajectory, the occupancy of various hydrogen bonds

(hydrogen bonds preserved in various snapshots in the trajectory) for compound 1 was higher compared to compound 0. These analyses indicated that compound 1 has higher number of frequently occurring hydrogen bond network compared to compound 0.

102

Table 3: Breakdown of energetic contribution (in kcal/mole) towards protein-ligand binding

Compounds VDW EEL ESolvation EEnthalpy EEntropy ΔGbinding

SAH -55.72±3.51 -94.48±6.52 78.73±4.31 -71.47±4.17 -30.27 -41.2±4.17

Com0 -68.43±6.74 -97.60±20.9 107.25±13.28 -58.78± 9.99 -42.65 -16.13± 9.99 Com1 -74.04±4.24 -117.03±12.45 113.39±8.99 -77.68±6.42 -35.11 -42.56±6.42

Table 3 shows breakdown of various interactions between protein and ligand.

VDW indicates van der Waals interaction, EEL indicates electrostatics interaction,

ESolvation indicates desolvation penalty computed using generalized Born model,

EEnthalpy indicates enthalpic contribution towards binding (VDW+EEL-ESolvation),

EEntopy indicates entropic penalty using quasi-harmonic approximation, and ΔGbinding indicates binding free energy (EEnthalpy-EEntropy) for each compound.

A B

Figure 41: The left Figure (A) shows the hydrogen bonding interaction (red dots) between compound 0 and protein. The right Figure (B) shows the hydrogen bonding interaction (red dots) between compound 1 and protein. We can notice less number of hydrogen bonds between compound 0 and protein compared to compound 1 and protein.

103

Table 4: Comparing hydrogen bond occupancy across 10 ns trajectory between compound 0 and compound 1 Donor:Acceptor Compound0 Compound1 Occupancy Occupancy UNL345-Side-O2 : GLU109-Side-OE1 1.35% 20.97% ARG34-Side-NH1 : UNL345-Side-O3 5.89% 27.70% UNL345-Side-N1 : GLU80-Side-OE1 34.50% 21.26% GLN25-Side-NE2 : UNL345-Side-O4 2.11% 22.98% UNL345-Side-O2 : GLU109-Side-OE2 1.11% 57.57% UNL345-Side-N1 : GLU80-Side-OE2 24.63% 0.29%

Table 4 shows h-bond occupancy between compound 0 and compound 1 with

CARM1 protein. H-bond occupancy refers to percentage of snapshots in the trajectory where a particular h-bond was preserved between a ligand and the protein. From the table 4, we can notice that almost all of the h-bond interactions between compound 1 and protein is well preserved compared to compound 0.

The preliminary results are based on sampling the CARM1-ligand complex for

~10 ns. While this result is promising, this short time scale simulation may not be adequate enough to capture the range of conformational space sampled by the protein- ligand complex. A long time scale (~µs) would be appropriate for the large systems

(~25000 atoms) and better estimate of the energies and entropies can be obtained, which is our future aim. Also, the simulation only included the top scored ligand conformation obtained from docking. We plan on using multiple docking conformations for the

ΔGbinding estimation within the framework of replica exchange molecular dynamics.

104

A B

OH O S H3C N N O OH HN HN CH3 OHCH3 N N O H C 3 O CH 3H3C CH3H3C HO

H2N H2N

Figure 42: (A) and (B) show the structures of two representative designed ligands from PO-ACSESS that uses MMGBSA for rescoring. Compound (A) had MMGBSA score of -95.33 kcal/com and (B) had MMGBSA score of -93.61 kcal/mol.

5.4 Conclusions and Future Directions

From this study, we used PO-ACSESS method to generate experimentally testable inhibitors targeting CARM1. We were encouraged that some of the top scoring hits (from docking based screening within PO-ACSESS) could be synthesized indicating that indeed we can come up with synthetically feasible designs. The accuracy of scoring function is necessary for the predicted designs to also show experimental activity. For this aspect, we have already implemented MMGBSA based rescoring as opposed to simply relying on docking based crude scoring function. From a short time scale MD simulations, we noticed the some designed ligand showed favorable binding (average

MMGBSA binding energy from the MD generated ensemble) to CARM1. We will perform a longer time scale MD simulation (~µs) and apply replica-exchange MD methods to further validate our short time scale preliminary MD simulation. In addition, as future refinements of our screening model that models flexibility of the protein, we can also use ensemble averaged MMGBSA score as a fitness function in PO-ACSESS scheme by using pre-computed multiple receptor (CARM1) conformations obtained 105

from MD. The docking will be done on the MD generated snapshots. This approach has shown promise in targeting RNA via small molecules and will be discussed in the next chapter.134,135 These refinements may improve the synthetic feasibility and experimental activity of the computationally predicted structures.

PO-ACSESS applications towards COX2 and CARM1 proteins were exciting undertakings. The success in generating experimentally testable leads intrigued us to further expand the application of this leads generating approach. Prof. Hargrove’s lab at

Duke University is interested in expanding their lead-like libraries targeting therapeutically important RNA structures. In the next chapter, a study is reported in collaboration with Prof. Hargrove’s lab to design lead like libraries targeting RNA structures.

106

Chapter 6. Property-Optimizing ACSESS extension and application for designing small organic molecules targeting RNA Structures.

6.1 Introduction

The computational prediction of leads targeting the COX2 and CARM1 proteins motivated us to use our property-optimizing ACSESS (PO-ACSESS) approach for RNA targets. Targeting RNA structures requires a proven docking approach that has shown success in screening compounds against RNA. In this chapter, further extension of PO-

ACSESS towards targeting RNA structures and its application towards designing experimentally testable leads against TAR RNA will be presented.

Ribonucleic acid (RNA) is a hetero polymer of nucleotides that is required for protein synthesis in cells. RNAs govern a variety of biological functions such as gene expression, catalyzing biological reactions, and sensing and communicating responses to cellular signals.136–143 Similar to DNA and proteins, RNA can also assumes a three dimensional structure with a unique fold.144–148 Patterns of stem, loop, and bulges (Figure

42) are observed in these structures and these patterns are used as a scaffold by the substrates of RNA to interact with it.146,149,150

107 Downloaded from rnajournal.cshlp.org on August 31, 2015 - Published by Cold Spring Harbor Laboratory Press

Evolution of translation

A second example comes from the binding of an HIV Rev peptide to RRE, Loop$ another hairpin-like RNA element from the env gene of HIV. Binding of the 17- mer peptide induces conformational changes in the RRE RNA that include Bulge$ formation of two noncanonical purine– purine base pairs and stabilization of Stem$ two Watson–Crick pairs that are not ob- served in the free RNA (Battiste et al. 1996). This complex is of additional in-

terest, in that the ␣-helical structure of Figure 43: A cartoon representationFIGURE 2. The of TAR Tat– RNATAR with interaction: its loop, Anstem, example and bulge of ligand-induced structural rearrangement the REV peptide is formed only upon regions. of RNA. (A) Secondary structure and NMR solution structures of (B) free TAR RNA and (C) a complex of argininamide bound to TAR RNA (Puglisi et al. 1992), rendered as space-filling binding to the RNA (Tan and Frankel Proteins are known tovan interact der Waalswith RNA surfaces. in these The structural UCU patterns bulge loop (loop, is shown in dark blue, and the bound arginin- 1994). Thus, the structures of both the amide ligand in orange. The structure of TAR RNA in the argininamide complex is the same RNA and the peptide are mutually in- bulge, and stem) and to influenceas in a the variety TAR of–peptide cellular complexfunctions. (PuglisiOne of such et al. interaction 1992), but the Kd for the latter is five to six orders of magnitude lower (Tao and Frankel 1992). duced by complex formation. In a fur- is between trans-activator of transcription protein (TAT) and trans-activation response ther example, the BIV Tat–TAR interac- tion, the Tat peptide was observed to element (TAR) RNA.149,150 Thequence HIV TAR of is known nine for amino transactivation acids (Calnan of the virus et al. 1991). Model change from a completely unfolded state to a ␤-hairpin promoter and for virus replication.peptides It is known from that this the region TAT protein of Tat binds are to bulge able to recognize and upon complex formation, while the TAR RNA rearranged bind specifically to the TAR RNA with nanomolar Kds. to form a base triple (Puglisi et al. 1995). The HIV Rev–RRE region of the TAR and also mRemarkably,akes some non-specific even ainteraction single argininamide with the loop region was of shown to bind and BIV Tat–TAR findings raise the intriguing possibility that the evolution of protein folding could have been boot- TAR.149,150 Another protein, Cyclinspecifically T1, binds toto the TAR loop (Tao region and of TAR. Frankel TAR provides 1992), a although with greatly decreased affinity. The NMR structures of the free strapped by protein–RNA interactions. For example, fixing scaffold for these proteins (TATand and argininamide-complexed Cyclin T1) to interact with each other. TAR The RNA binding (Puglisi et al. of marginally stable ␤-sheet structures by interaction with 1992) revealed an extensive ligand-induced structural rear- RNA could have preceded the evolution of more stable between TAT-TAR and Cyclin T1 is required for HIV viral replication along with rangement (Fig. 2), which includes unstacking of the bases domains. controlling other cellular processes.in the150 bulge Studies loop,have also coaxial shown stackingthat the binding of the of TA twoT helical stems, A fourth example is provided by the structure of a pep- and formation of an A-U-U base triple. NMR studies indi- tide derived from the HTLV-1 Rex protein bound to an in and Cyclin T1 to TAR is highlycate cooperative. that the Cyclin structure T1 does of not the bind TAR to TAR RNA RNA in in the the peptide–RNA vitro selected RNA aptamer (Jiang et al. 1999a). Binding of absence of TAT protein, and complexTAT binding is to essentially TAR although identical detectable toin experiments that observed is for the ar- the 15-mer peptide to the RNA results not only in forma- gininamide complex, although the binding affinity of the tion of three base triples but also stabilizes the orientations peptide is five 108 to six orders of magnitude greater than that of three double-helical stems. Thus, in addition to promot- of the single amino acid derivative (Tao and Frankel 1992). ing formation of unusual tertiary structural features or non- The Tat–TAR interaction provides one reason for the canonical base pairs, a peptide can even influence or stabi- evolution of polypeptide polymerization. Although the lize the relative geometry of separate RNA structural ele- simple monomer argininamide clearly makes all of the mo- ments, and so direct the overall three-dimensional shape of lecular interactions with TAR RNA that are needed to sta- an RNA. bilize its folded structure, it does so only at very high con- centrations that are unlikely to be found in an RNA world setting. By polymerization of arginine with additional INFLUENCE OF PROTEINS ON RNA FUNCTION amino acids, which apparently contribute mainly nonspe- cific electrostatic interactions, the same rearrangement is The idea that simple peptides could expand the functional achieved at nanomolar concentrations of peptide. Thus, one capabilities of RNA is supported by the fact that most of the rationale for the emergence of a peptidyl transferase activity functional RNAs found in present-day cells depend on pro- is to amplify the affinities of small-molecule effectors. A teins for their activity. Ribosomes themselves, although fun- second, related advantage would be to increase the speci- damentally ribozymes in nature, still require proteins to ficity of the interaction, by building in additional amino fold their RNAs into biologically active conformations and acids whose presence in the folded complex would be com- to optimize the speed and accuracy of their functions (Gar- patible only with a subset of possible structures, for steric or rett et al. 2000; CSHSQB 2001). Like the peptides discussed other reasons. One can then imagine a third advantage, as above, ribosomal proteins have been observed to stabilize in a hypothetical case where the crucial nucleating interac- unusual tertiary structural elements and noncanonical base tions themselves (e.g., those provided by the argininamide pairs (Stern et al. 1989), and are often found at multihelix example) require a dipeptide, tripeptide, and so on. junctions, where they have in some cases been shown to fix

www.rnajournal.org 1835

very inefficient in the absence of Cyclin T1 protein.150 Since both proteins are needed to interact with TAR for HIV replication, targeting the bulge, loop and stem regions of TAR using small molecules can potentially stop the replication. Since both proteins use loop, bulge, or parts of stem to bind to the TAR RNA, if a small molecule effectively binds to

TAR RNA in the same regions, they can potentially block the interaction of proteins to

TAR.151 In addition, in the recent NMR studies done by our collaborators, it is seen that the small molecules can also potentially stabilize certain non-TAT binding regions on

TAR by bringing conformational changes in TAR. The induced conformational changes can also inhibit the binding of TAT to the TAR RNA. The repertoire of such experimentally testable small molecules is less,151–153 so we hope that PO-ACSESS can design testable leads targeting TAR and expand the numbers of known leads.

6.2 Method

For the PO-ACSESS based design of small molecules targeting TAR, we chose a dimethyl amiloride (DMA) scaffold (Figure 43). The scaffold has a basic character, which favors the electrostatic interaction with the negatively charged backbone of the

TAR RNA.151 The scaffold has two designs positions indicated by R1 and R2 in Figure 43.

PO-ACSESS was given the task to diversify the R1 and R2 functional group positions and predict structures that show computational binding activity (-40 kcal/mol) similar or better to known binders of TAR RNA.

109

H H O N

R2 N H N N

R1 H H N N N

H H

Figure 44: Structure of the dimethyl amiloride (DMA) scaffold with two functional group diversification positions (R1 and R2).

For the purpose of diversifying the R1 and R2 positions, we extended and applied PO-ACSESS code. The code was extended to utilize RDOCK program154 for computing binding scores of a ligand to the TAR RNA. The steps involved in the PO-

ACSESS calculations were as follows: 1) initializing the library with the DMA scaffold, 2) diversifying the scaffold with structural mutations in the R1 and R2 positions, 3) applying synthesizability and stability filters and screening the proposed structures using RDOCK, 4) and selecting a maximally diverse subset of the property qualified structures. For these designs, only C, N, O, S, and Cl atoms types were allowed along with drug-like filters. This filter incorporates both GDB19 and SMU110 filters and also has additional constraints to improve the stability and synthesizability on molecules that were built using the suggestions from our experimental collaborators. We iterate over the 4 steps until we reach convergence in terms of property value we desire (in this case

-40kcal/mol).

110

6.2.1 RDOCK based docking of small molecules to RNA

Like many other docking programs, RDOCK computes the binding score of a ligand to a protein or nucleotide.154 RDOCK is widely used for docking a ligand to RNA and has been benchmarked for screening ligands against RNA targets.155 There are two main components in RDOCK based ligand docking: 1) the conformational search algorithm, and 2) the scoring function.

The conformational search algorithm in RDOCK utilizes a suite of stochastic search and deterministic search approaches to predict a binding mode of a ligand to its protein or nucleotide. The conformational search process begins with two cycles of genetic algorithms until convergence in best-scored binding mode is reached (with in a

0.1 RDOCK score threshold). The genetic algorithms search over the translational, rotational, and dihedral degrees of freedom to generate initial guess of the binding mode. The second step in the conformational search process applies a simulated annealing algorithm. During this step, with the hope of further optimizing the score, the search space is limited to a small translational motion (within 0.1Å), rotational motion

(within 10°), and dihedral angle (within 10°). Finally, a simplex minimization is done to further refine the simulated annealing generated docked poses.154

Each of the conformational search algorithms seeks to minimize the RDOCK scoring function (Stotal). As shown in equation 6.1, the scoring function (Stotal) is a summation of the protein/nucleotide and ligand intermolecular interaction (Sinter), ligand

111

intramolecular term (Sintra), and external non-physical restraint term (Srestraint) that can be used to bias the conformational search in various useful ways (such as adding NMR experiments based restraint or pharmacophore based restraint).154

S total = S inter + S intra + S site + S restraint (6.1)

Each term in eq. 6.1, except for Srestraint, is a weighted combination of covalent (rotational and dihedral terms) and non-covalent (van der Waals, electrostatics, and solvation) interactions.154

6.2.2 Steps to Diversify the DMA Scaffold

In contrast to our other scaffold-based designs (Chapter 5), the DMA scaffold was diversified one functional group position at a time. This approach is useful to determine the contribution each added functional group to binding to the TAR RNA. In the first round of design, the R1 position was allowed to change, but a bulky chlorine atom was chosen as a placeholder for the R2 position. Four best-scored molecules from the first round were used to seed the design runs for the second round. For the second round, the chlorine atom in the R2 position was deleted and diversified with new functional groups selected using PO-ACSESS procedure. The R2 chlorine modification was done with a hope to further increase the computational binding scores of the designed library. In both rounds of design, only the compounds that had comparable or better RDOCK score to some known binders of TAR were selected.

112

The design site of interest for us was the bulge, loop, and the stem region of the

TAR RNA. Any compound that showed activity in any regions of the TAR RNA was considered as a potential lead. The TAR RNA is known to be very flexible. Hence docking to a single static structure can potentially lead to false positive or false negative compounds. To take the flexibility of RNA structure into account, the docking of a predicted ligand was done to a molecular dynamics (MD) generated ensemble of TAR

RNA (20 different conformations of TAR).

6.2.3 Conformational Selection using MD and experimental RDC

A constant temperature (using Nose-Hoover thermostat), explicit solvent

(TIP3P), and 80 ns MD generated ~80,000 snapshots of TAR RNA.134 The 20 membered docking ensembles were generated using a select-and-sample strategy (SAS).134,135,156

Twenty snapshots were picked at random from this ensemble and their theoretical residual dipolar coupling (RDC) computed. The computed RDCs were compared to the experimental RDCs if they were within a small RMSD cutoff (~1.7 Hz). If the snapshot were within the small cutoff, they were kept as a representative structure, else the snapshots were discarded and another snapshot were tested from the pool of all snapshots. 134,135,156 This iterative selection process generated 20 membered ensembles of

TAR RNA structures that were used for screening. The best-scored molecule from screening to the whole ensemble was considered as a viable lead. The molecular

113

dynamics and RDC based 20 member ensemble was provided to us by Prof. Hargrove’s lab.

6.2.4 Experimental Testing of the Predicted Structures

The four best scored molecules from the second design runs were selected for experimental testing. Experimental testing steps involved the synthesis and biochemical assay of the predicted structures. The starting material for the synthesis of the compounds was C5-dimethyl amilioride ester. The starting compound underwent the

Sonogashira coupling reaction,157 which yielded C6-propargyl amine containing ester analog. This intermediate underwent guanidinylation or acetylation reaction to form the final product (similar to the compound shown in Figure 46C). The activity of the compounds against TAR RNA will be tested using Isothermal calorimetry assay.158–160

6.3 Results and Discussions

For the first round of our scaffold based design, chlorine was placed at the R2 position. The choice of chlorine was based on experience from previous experimental designs by our synthetic collaborator. The first round of design was seeded with three previously known compounds that had average binding score (enthalpic contribution) ~

-41 kcal/mol. The compounds were diversified in their R1 position to generate libraries of ~100 of compounds using PO-ACSESS. The average binding score of the library after the first round of design was ~45 kcal/mol (Figure 44). The best-scored compound from the first round had binding score ~ -52 kcal/mol (Figure 44).

114

52

50 Average Best 48 Worst 46

44

42

40

38 − 1*RDOCK Score (kcal/mol)

36

34 0 10 20 30 40 50 Iteration

Figure 45: The average (blue), worst (green), and best (red) scores of the library in each iteration from the first round of design using PO-ACSESS. The x-axis shows the iteration of PO-ACSESS and the y-axis shows the negative of the binding energy between ligand-TAR RNA (RDOCK score).

Four compounds, shown in Figure 43, were chosen from the first round and diversified in their R2 position.

115

A B

O O NH2 NH2 N Cl N Cl NH2 N NH2 N

N NH N NH H2N

C D

O NH2 O NH2 N Cl NH N Cl 2 N NH2 N

N NH H2N N NH H2N

HN

Figure 46: The structures of four top scoring compounds from the first round. The average RDOCK score for these compounds was ~ -43 kcal/mol. The R2 position is occupied by the chlorine (Cl) atom in these designs.

In the second round of design, the chlorine atom in the R2 position was replaced by diverse functional groups sampled using PO-ACSESS. The second round of designs yielded four different sets of libraries where each library contained 100s of molecules.

Some of the top scoring compounds from each of the four libraries is shown in Figure 46.

116

A B

NH2 O O NH2

O NH N C NH2 NH2 N N C NH2 N

N NH H2N N NH H2N

D C

NH2

O O NH NH2

O C NH N C 2 N H NH2 N N C NH2 N

N NH H N N NH 2 H2N

HN

Figure 47: Structures of the 4 top scores compounds from each libraries from the second round. The average score of the compounds were ~ -53 kcal/mol.

Excitingly, as we had hoped, the second round of design produced compounds with higher docking scores compared to the first round of designs. The average RDOCK score of the second round compounds was ~ -48 kcal/mol (Figure 47). For comparison, the average score of the first round (Figure 45), was ~ -45 kcal/mol. The worst scoring compound in the second round had score of ~ -45kcal/mol, and the best-scored compound had score of ~58 kcal/mol (Figure 47).

117

60

Average 55 Best Worst

50

45

− 1*RDOCK Score (kcal/mol) 40

35 0 10 20 30 40 50 Iteration

Figure 48: The average (blue), worst (green), and best (red) scores of the library in each iteration from the second round of designs using PO-ACSESS. The x-axis shows the iteration of PO-ACSESS and the y-axis shows the negative of the binding energy between ligand-TAR RNA (RDOCK score).

These results indicate that substitution to the R2 position is likely to increase the binding activity of the compounds and this can be tested using experimental synthesis and biochemical assays. For the experimental verification, four compounds have been synthesized. The success of synthesis indicated that the designs are indeed synthetically feasible. The quality of the RDOCK based screening will be further tested using the isothermal calorimetry assay.

118

6.4 Conclusions and Future Directions

In this study, PO-ACSESS was applied towards a new direction of optimizing the diversity and docking scores of molecules targeting TAR RNA. Previous studies

(Chapters 2-5) had shown applicability of this type of multi-objective optimization towards targeting electronic properties or simple protein-ligand docking score. This study reveals that the repertoire of lead-like and experimentally testable molecules targeting TAR RNA can be increased using PO-ACSESS approach. In particular, it is exciting for the experimental collaborators to recognize that not only the ‘R1’ position in

DMA can improve the computed binding to the TAR RNA but also the ‘R2’ position can potentially be exploited. The four novel compounds that PO-ACSESS generated could be synthesized. This finding further supports the utility of PO-ACSESS to generate experimentally testable lead-like molecular libraries that can potentially have therapeutic values.

As a follow up to this study, various directions can be taken. As a check to the

RDOCK computed scores, binding free energy using explicit solvent and long time scale

MD can be done. Our collaborators are already planning to pursue isothermal calorimerty assay of some of the top scored designs (Figure 46). In addition, the success of the DMA based designs has prompted our collaborators to try PO-ACSESS based designs on oxazolidinone scaffold with a hope of further increasing the diversity of small molecules targeting TAR RNA.

119

References

(1) Bohacek, R. S.; McMartin, C.; Guida, W. C. The Art and Practice of Structure- Based Drug Design: A Molecular Modeling Perspective. Med. Res. Rev. 1996, 16, 3– 50.

(2) Lipkus, A. H.; Yuan, Q.; Lucas, K. A.; Funk, S. A.; Bartelt, W. F.; Schenck, R. J.; Trippe, A. J. Structural Diversity of Organic Chemistry. A Scaffold Analysis of the CAS Registry. J. Org. Chem. 2008, 73, 4443–4451.

(3) Reineke, S. Organic Light-Emitting Diodes: Phosphorescence Meets Its Match. Nat. Photonics 2014, 8, 269–270.

(4) Uoyama, H.; Goushi, K.; Shizu, K.; Nomura, H.; Adachi, C. Highly Efficient Organic Light-Emitting Diodes from Delayed Fluorescence. Nature 2012, 492, 234– 238.

(5) Dandapani, S.; Marcaurelle, L. a. Grand Challenge Commentary: Accessing New Chemical Space for “Undruggable” Targets. Nat. Chem. Biol. 2010, 6, 861–863.

(6) Hopkins, A. L.; Groom, C. R. The Druggable Genome. Nat. Rev. Drug Discov. 2002, 1, 727–730.

(7) Pammolli, F.; Magazzini, L.; Riccaboni, M. The Productivity Crisis in Pharmaceutical R&D. Nat. Rev. Drug Discov. 2011, 10, 428–438.

(8) Scott, D. E.; Coyne, A. G.; Hudson, S. a; Abell, C. Fragment-Based Approaches in Drug Discovery and Chemical Biology. Biochemistry 2012, 51, 4990–5003.

(9) Carr, R. a E.; Congreve, M.; Murray, C. W.; Rees, D. C. Fragment-Based Lead Discovery: Leads by Design. Drug Discov. Today 2005, 10, 987–992.

(10) Harner, M. J.; Frank, A. O.; Fesik, S. W. Fragment-Based Drug Discovery Using NMR Spectroscopy. J. Biomol. NMR 2013, 56, 65–75.

120

(11) Winter, A.; Higueruelo, A. P.; Marsh, M.; Sigurdardottir, A.; Pitt, W. R.; Blundell, T. L. Biophysical and Computational Fragment-Based Approaches to Targeting Protein-Protein Interactions: Applications in Structure-Guided Drug Discovery. Q. Rev. Biophys. 2012, 45, 383–426.

(12) Murray, C. W.; Rees, D. C. The Rise of Fragment-Based Drug Discovery. Nat. Chem. 2009, 1, 187–192.

(13) Morton, D.; Leach, S.; Cordier, C.; Warriner, S.; Nelson, A. Synthesis of Natural- Product-like Molecules with over Eighty Distinct Scaffolds. Angew. Chemie - Int. Ed. 2009, 48, 104–109.

(14) Schreiber, S. L. NEWS & VIEWS Molecular Diversity by Design. 2009, 457, 153– 154.

(15) Kuruvilla, F. G.; Shamji, A. F.; Sternson, S. M.; Hergenrother, P. J.; Schreiber, S. L. Dissecting Glucose Signalling with Diversity-Oriented Synthesis and Small- Molecule Microarrays. Nature 2002, 416, 653–657.

(16) Wong, J. C.; Hong, R.; Schreiber, S. L. Structural Biasing Elements for in-Cell Histone Deacetylase Paralog Selectivity. J. Am. Chem. Soc. 2003, 125, 5586–5587.

(17) Sternson, S. M.; Wong, J. C.; Grozinger, C. M.; Schreiber, S. L. Synthesis of 7200 Small Molecules Based on a Substructural Analysis of the Histone Deacetylase Inhibitors Trichostatin and Trapoxin. Org. Lett. 2001, 3, 4239–4242.

(18) Blum, L. C.; van Deursen, R.; Reymond, J.-L. Visualisation and Subsets of the Chemical Universe Database GDB-13 for Virtual Screening. J. Comput. Aided. Mol. Des. 2011, 25, 637–647.

(19) Ruddigkeit, L.; van Deursen, R.; Blum, L. C.; Reymond, J.-L. Enumeration of 166 Billion Organic Small Molecules in the Chemical Universe Database GDB-17. J. Chem. Inf. Model. 2012, 52, 2864–2875.

121

(20) Altman, M. D.; Ali, A.; Reddy, G. S. K. K.; Nalam, M. N. L.; Anjum, S. G.; Cao, H.; Chellappan, S.; Kairys, V.; Fernandes, M. X.; Gilson, M. K.; Schiffer, C. a; Rana, T. M.; Tidor, B. HIV-1 Protease Inhibitors from Inverse Design in the Substrate Envelope Exhibit Subnanomolar Binding to Drug-Resistant Variants. J. Am. Chem. Soc. 2008, 130, 6099–6113.

(21) Virshup, A. M.; Contreras-García, J.; Wipf, P.; Yang, W.; Beratan, D. N. Stochastic Voyages into Uncharted Chemical Space Produce a Representative Library of All Possible Drug-like Compounds. J. Am. Chem. Soc. 2013, 135, 7296–7303.

(22) Shu, Y.; Levine, B. G. Simulated Evolution of Fluorophores for Light Emitting Diodes. J. Chem. Phys. 2015, 142, 104104.

(23) Chellappan, S.; Kiran Kumar Reddy, G. S.; Ali, A.; Nalam, M. N. L.; Anjum, S. G.; Cao, H.; Kairys, V.; Fernandes, M. X.; Altman, M. D.; Tidor, B.; Rana, T. M.; Schiffer, C. a; Gilson, M. K. Design of Mutation-Resistant HIV Protease Inhibitors with the Substrate Envelope Hypothesis. Chem. Biol. Drug Des. 2007, 69, 298–313.

(24) Weininger, D. SMILES, a Chemical Language and Information System. 1. Introduction to Methodology and Encoding Rules. J. Chem. Inf. Comput. Sci. 1988, 28, 31–36.

(25) Reymond, J.-L.; van Deursen, R.; Blum, L. C.; Ruddigkeit, L. Chemical Space as a Source for New Drugs. Medchemcomm 2010, 1, 30.

(26) Bajusz, D.; Rácz, A.; Héberger, K. Why Is Tanimoto Index an Appropriate Choice for Fingerprint-Based Similarity Calculations? J. Cheminform. 2015, 7, 20.

(27) Lasters, I.; De Maeyer, M.; Desmet, J. Enhanced Dead-End Elimination in the Search for the Global Minimum Energy Conformation of a Collection of Protein Side Chains. Protein Eng. 1995, 8, 815–822.

(28) Leach, a R.; Lemon, a P. Exploring the Conformational Space of Protein Side Chains Using Dead-End Elimination and the A* Algorithm. Proteins 1998, 33, 227– 239.

122

(29) Balamurugan, D.; Yang, W.; Beratan, D. N. Exploring Chemical Space with Discrete, Gradient, and Hybrid Optimization Methods. J. Chem. Phys. 2008, 129.

(30) Nalam, M. N. L.; Ali, A.; Altman, M. D.; Reddy, G. S. K. K.; Chellappan, S.; Kairys, V.; Ozen, A.; Cao, H.; Gilson, M. K.; Tidor, B.; Rana, T. M.; Schiffer, C. a. Evaluating the Substrate-Envelope Hypothesis: Structural Analysis of Novel HIV- 1 Protease Inhibitors Designed to Be Robust against Drug Resistance. J. Virol. 2010, 84, 5368–5378.

(31) Nalam, M. N. L.; Ali, A.; Reddy, G. S. K. K.; Cao, H.; Anjum, S. G.; Altman, M. D.; Yilmaz, N. K.; Tidor, B.; Rana, T. M.; Schiffer, C. a. Substrate Envelope-Designed Potent HIV-1 Protease Inhibitors to Avoid Drug Resistance. Chem. Biol. 2013, 20, 1116–1124.

(32) Aspects, M.; Maeyer, M. De; Desmet, J.; Lasters, I. The Dead-End Elimination Theorem : 143.

(33) Pierce, N. a; Spriet, J. a; Desmet, J.; Mayo, S. L. Conformational Splitting: A More Powerful Criterion for Dead-End Elimination. J. Comput. Chem. 2000, 21, 999–1009.

(34) Gasteiger, J. Handbook of Cheminformatics: Descriptors for Chemical Compounds; Gasteiger, J., Ed.; Wiley-VCH Verlag GmbH, 2008.

(35) Rose, J. R. Handbook of Cheminformatics: Machine Learning Techniques in Chemistry; Gasteiger, J., Ed.; Wiley-VCH Verlag GmbH, 2008.

(36) Young, D. C. Computational Drug Design: A Guide for Computational and Medicinal Chemists; 2009.

(37) Rapaport, D. C. C. The Art of Molecular Dynamics Simulation; 2004; Vol. 2.

(38) Friesner, R. a; Guallar, V. Ab Initio Quantum Chemical and Mixed Quantum Mechanics/molecular Mechanics (QM/MM) Methods for Studying Enzymatic Catalysis. Annu. Rev. Phys. Chem. 2005, 56, 389–427.

123

(39) Gleeson, M. P.; Gleeson, D. QM/MM Calculations in Drug Discovery: A Useful Method for Studying Binding Phenomena? J. Chem. Inf. Model. 2009, 49, 670–677.

(40) Abrams, C.; Bussi, G. Enhanced Sampling in Molecular Dynamics Using Metadynamics, Replica-Exchange, and Temperature-Acceleration. Entropy 2014, 16, 163–199.

(41) Roe, D. R.; Bergonzo, C.; Cheatham, T. E. Evaluation of Enhanced Sampling Provided by Accelerated Molecular Dynamics with Hamiltonian Replica Exchange Methods. J. Phys. Chem. B 2014, 118, 3543–3552.

(42) Valsson, O.; Parrinello, M. Variational Approach to Enhanced Sampling and Free Energy Calculations. Phys. Rev. Lett. 2014, 113, 090601.

(43) Moskowitz Sanford L. The Advanced Materials Revolution: Technology and Economic Growth in the Age of Globalization; John Wiley & Sons Inc, 2008.

(44) Jain, A.; Ong, S. P.; Hautier, G.; Chen, W.; Richards, W. D.; Dacek, S.; Cholia, S.; Gunter, D.; Skinner, D.; Ceder, G.; Persson, K. a. Commentary: The Materials Project: A Materials Genome Approach to Accelerating Materials Innovation. APL Mater. 2013, 1, 011002.

(45) Kalil Tom And Wadia, C. Materials Genome Initiative for Global Competitiveness. Genome 2011, 1–18.

(46) Taylor, R. H.; Rose, F.; Toher, C.; Levy, O.; Nardelli, M. B.; Curtarolo, S. A RESTful API for Exchanging Materials Data in the AFLOWLIB.org Consortium. 2014.

(47) Curtarolo, S.; Setyawan, W.; Wang, S.; Xue, J.; Yang, K.; Taylor, R. H.; Nelson, L. J.; Hart, G. L. W.; Sanvito, S.; Buongiorno-Nardelli, M.; Mingo, N.; Levy, O. AFLOWLIB.ORG: A Distributed Materials Properties Repository from High- Throughput Ab Initio Calculations. Comput. Mater. Sci. 2012, 58, 227–235.

(48) Isayev, O.; Fourches, D.; Muratov, E. N.; Oses, C.; Rasch, K.; Tropsha, A.; Curtarolo, S. Materials Cartography: Representing and Mining Materials Space 124

Using Structural and Electronic Fingerprints. Chem. Mater. 2015, 27, 735–743.

(49) Wang, S.; Wang, Z.; Setyawan, W.; Mingo, N.; Curtarolo, S. Assessing the Thermoelectric Properties of Sintered Compounds via High-Throughput Ab-Initio Calculations. Phys. Rev. X 2011, 1, 021012.

(50) Hautier, G.; Fischer, C.; Ehrlacher, V.; Jain, A.; Ceder, G. Data Mined Ionic Substitutions for the Discovery of New Compounds. Inorg. Chem. 2011, 50, 656– 663.

(51) Pyzer-Knapp, E. O.; Suh, C.; Gómez-Bombarelli, R.; Aguilera-Iparraguirre, J.; Aspuru-Guzik, A. What Is High-Throughput Virtual Screening? A Perspective from Organic Materials Discovery. Annu. Rev. Mater. Res. 2014, 45, 150306094755008.

(52) Hachmann, J.; Olivares-Amaya, R.; Jinich, A.; Appleton, A. L.; Blood-Forsythe, M. a.; Seress, L. R.; Román-Salgado, C.; Trepte, K.; Atahan-Evrenk, S.; Er, S.; Shrestha, S.; Mondal, R.; Sokolov, A.; Bao, Z.; Aspuru-Guzik, A. Lead Candidates for High- Performance Organic Photovoltaics from High-Throughput Quantum Chemistry – the Harvard Clean Energy Project. Energy Environ. Sci. 2014, 7, 698.

(53) Hachmann, J.; Olivares-Amaya, R.; Atahan-Evrenk, S.; Amador-Bedolla, C.; Sánchez-Carrera, R. S.; Gold-Parker, A.; Vogt, L.; Brockway, A. M.; Aspuru-Guzik, A. The Harvard Clean Energy Project: Large-Scale Computational Screening and Design of Organic Photovoltaics on the World Community Grid. J. Phys. Chem. Lett. 2011, 2, 2241–2251.

(54) Er, S.; Suh, C.; Marshak, M. P.; Aspuru-Guzik, A. Computational Design of Molecules for an All-Quinone Redox Flow Battery. Chem. Sci. 2015, 6, 885–893.

(55) Pyzer-Knapp, E. O.; Li, K.; Aspuru-Guzik, A. Learning from the Harvard Clean Energy Project: The Use of Neural Networks to Accelerate Materials Discovery. Adv. Funct. Mater. 2015.

(56) Noel, O.; Casey, C.; Geoffrey, H. Computational Design and Selection of Optimal Organic Photovoltaic Materials. J. Phys. Chem. C 2011, 115, 16200–16210.

125

(57) Hansen, K.; Montavon, G.; Biegler, F.; Fazli, S.; Rupp, M.; Scheffler, M.; Von Lilienfeld, O. A.; Tkatchenko, A.; Müller, K. R. Assessment and Validation of Machine Learning Methods for Predicting Molecular Atomization Energies. J. Chem. Theory Comput. 2013, 9, 3404–3419.

(58) Dral, P. O.; von Lilienfeld, O. A.; Thiel, W. Machine Learning of Parameters for Accurate Semiempirical Quantum Chemical Calculations. J. Chem. Theory Comput. 2015, 11, 150402110823008.

(59) Moreau, J. L.; Broto, P. Autocorrelation of Molecular Structures: Application to SAR Studies. Nouv. J. Chim. 1980, 4, 764.

(60) Blum, L. C.; van Deursen, R.; Bertrand, S.; Mayer, M.; Bürgi, J. J.; Bertrand, D.; Reymond, J.-L. Discovery of α7-Nicotinic Receptor Ligands by Virtual Screening of the Chemical Universe Database GDB-13. J. Chem. Inf. Model. 2011, 51, 3105– 3112.

(61) Rupp, M.; Tkatchenko, A.; Müller, K.-R.; von Lilienfeld, O. A. Fast and Accurate Modeling of Molecular Atomization Energies with Machine Learning. Phys. Rev. Lett. 2012, 108, 058301.

(62) Kauffman, S. a; Weinberger, E. D. The NK Model of Rugged Fitness Landscapes and Its Application to Maturation of the Immune Response. J. Theor. Biol. 1989, 141, 211–245.

(63) Stadler, P. F. Landscapes and Their Correlation Functions. J. Math. Chem. 1996, 20, 1–45.

(64) Bauknecht, H.; Zell, a; Bayer, H.; Levi, P.; Wagener, M.; Sadowski, J.; Gasteiger, J. Locating Biologically Active Compounds in Medium-Sized Heterogeneous Datasets by Topological Autocorrelation Vectors: Dopamine and Benzodiazepine Agonists. J. Chem. Inf. Comput. Sci. 1996, 36, 1205–1213.

(65) Schmuker, M.; Givehchi, A.; Schneider, G. Impact of Different Software Implementations on the Performance of the Maxmin Method for Diverse Subset 126

Selection. Mol. Divers. 2004, 8, 421–425.

(66) Xue, L.; Stahura, F. L.; Bajorath, J. Cell-Based Partitioning. Methods Mol. Biol. 2004, 275, 279–290.

(67) Software, O. scientific. OpenEye Scientific Software. http://www.eyesopen.com, 2014, -.

(68) Blum, L. C.; Reymond, J.-L. 970 Million Druglike Small Molecules for Virtual Screening in the Chemical Universe Database GDB-13. J. Am. Chem. Soc. 2009, 131, 8732–8733.

(69) Hawkins, P. C. D.; Skillman, A. G.; Warren, G. L.; Ellingson, B. A.; Stahl, M. T. Conformer Generation with OMEGA: Algorithm and Validation Using High Quality Structures from the Protein Databank and Cambridge Structural Database. J. Chem. Inf. Model. 2010, 50, 572–584.

(70) Frisch, M. J.; G. W. Trucks; Schlegel, H. B.; Scuseria, G. E.; Robb, M. A.; Cheeseman, J. R.; Scalmani, G.; Barone, V.; Mennucci, B.; Petersson, G. A.; Nakatsuji, H.; Caricato, M.; X. Li, H. P. H.; Izmaylov, A. F.; Bloino, J.; Zheng, G.; Sonnenberg, J. L.; Hada, M.; Ehara, M.; Toyota, K.; Fukuda, R.; Hasegawa, J.; Ishida, M.; Nakajima, T.; Honda, Y.; Kitao, O.; Nakai, H.; Vreven, T.; Montgomery, J. A.; Jr., J. E. P.; Ogliaro, F.; Bearpark, M.; Heyd, J. J.; Brothers, E.; Kudin, K. N.; Staroverov, V. N.; Kobayashi, R.; Normand, J.; Raghavachari, K.; Rendell, A.; Burant, J. C.; Iyengar, S. S.; Tomasi, J.; Cossi, M.; Rega, N.; Millam, J. M.; Klene, M.; Knox, J. E.; Cross, J. B.; Bakken, V.; Adamo, C.; Jaramillo, J.; Gomperts, R.; Stratmann, R. E.; Yazyev, O.; Austin, A. J.; Cammi, R.; Pomelli, C.; Ochterski, J. W.; Martin, R. L.; Morokuma, K.; Zakrzewski, V. G.; Voth, G. A.; Salvador, P.; Dannenberg, J. J.; Dapprich, S.; Daniels, A. D.; Farkas, Ö.; Foresman, J. B.; Ortiz, J. V.; Cioslowski, J.; A. F. Izmaylov, J. Bloino, G. Zheng, J. L. Sonnenberg, M. Hada, M. Ehara, K. Toyota, R. Fukuda, J. Hasegawa, M. Ishida, T. Nakajima, Y. Honda, O. Kitao, H. Nakai, T. Vreven, J. A. Montgomery, Jr., J. E. Peralta, F. Ogliaro, M. Bearpark, J. J. Heyd, E. Bro, and D. J. F. Gaussian 09, Revision A.02. Gaussian 09, Revision A.02, 2009.

(71) Computing, D. The Definitive Guide to MongoDB: The NoSQL Database for Cloud and Desktop Computing; 2010; Vol. 44.

127

(72) Barnett, L. Ruggedness and Neutrality - The NKp Family of Fitness Landscapes. In Proceedings of the 6th International Conference on Artificial Life; Adami, C.; Belew, R.; Kitano, H.; Taylor, E., Eds.; Cambridge, MA, 1998; pp. 18–27.

(73) Geard, N.; Wiles, J.; Hallinan, J.; Tonkes, B.; Skellett, B. A Comparison of Neutral Landscapes - NK, NKp and NKq. Proc. 2002 Congr. Evol. Comput. CEC’02 (Cat. No.02TH8600) 1, 205–210.

(74) Eiben, A. E.; Smith, J. E. Introduction to Evolutionary Computing: With 28 Tables; G. Rozenberg; Th.Bäck; Eiben, A. E.; J.N.Kok; Spaink, H. P., Eds.; Springer Berlin Heidelberg: Heidelberg, Berlin, 2003.

(75) Mitchell, M. An Introduction to Genetic Algorithms (Complex Adaptive Systems); The MIT Press: Cambridge, 1998.

(76) Sadowski, J.; Wagener, M.; Gasteiger, J. Assessing Similarity and Diversity of Combinatorial Libraries by Spatial Autocorrelation Functions and Neural Networks. Angew. Chem. Int. Ed. Engl. 1995, 34, 2674–2677.

(77) Brown, N.; McKay, B.; Gasteiger, J. The de Novo Design of Median Molecules within a Property Range of Interest. J. Comput. Aided. Mol. Des. 2004, 18, 761–771.

(78) van Deursen, R.; Reymond, J.-L. Chemical Space Travel. ChemMedChem 2007, 2, 636–640.

(79) Thejokalyani, N.; Dhoble, S. J. Novel Approaches for Energy Efficient Solid State Lighting by RGB Organic Light Emitting Diodes – A Review. Renew. Sustain. Energy Rev. 2014, 32, 448–467.

(80) Thejo Kalyani, N.; Dhoble, S. J. Organic Light Emitting Diodes: Energy Saving Lighting technology—A Review. Renew. Sustain. Energy Rev. 2012, 16, 2696–2723.

(81) Friend, R.; Gymer, R.; Holmes, A.; Burroughes, J.; Marks, R.; Taliani, C.; Bradley, D.; Dos Santos, D.; Brédas, J.; Lögdlund, M.; Salaneck, R. Electroluminescence in

128

Conjugated Polymers. Nature 1999, 397, 121–128.

(82) Shen, J. Y.; Lee, C. Y.; Huang, T.-H.; Lin, J. T.; Tao, Y.-T.; Chien, C.-H.; Tsai, C. High Tg Blue Emitting Materials for Electroluminescent Devices. Journal of Materials Chemistry, 2005, 15, 2455.

(83) Rothberg, L. J.; Lovinger, A. J. Status of and Prospects for Organic Electroluminescence. Journal of Materials Research, 1996, 11, 3174–3187.

(84) Goushi, K.; Yoshida, K.; Sato, K.; Adachi, C. Organic Light-Emitting Diodes Employing Efficient Reverse Intersystem Crossing for Triplet-to-Singlet State Conversion. Nature Photonics, 2012, 6, 253–258.

(85) Rupakheti, C.; Virshup, A.; Yang, W.; Beratan, D. N. A Strategy to Discover Diverse Optimal Molecules in the Small Molecule Universe. J. Chem. Inf. Model. 2015, 1–31.

(86) Zhang, Q.; Li, B.; Huang, S.; Nomura, H.; Tanaka, H.; Adachi, C. Efficient Blue Organic Light-Emitting Diodes Employing Thermally Activated Delayed Fluorescence. Nat. Photonics 2014, 8, 326–332.

(87) Zhao, H.; Liu, G.; Zhang, J.; Arif, R. A.; Tansu, N. Analysis of Internal Quantum Ef Fi Ciency and Current Injection Ef Fi Ciency in III-Nitride Light-Emitting Diodes. 2013, 9, 212–225.

(88) Rupakheti, C.; Virshup, A.; Yang, W.; Beratan, D. N. Strategy To Discover Diverse Optimal Molecules in the Small Molecule Universe. J. Chem. Inf. Model. 2015, 150219080027003.

(89) Pople, J. A.; Santry, D. P.; Segal, G. A. Approximate Self-Consistent Molecular Orbital Theory. I. Invariant Procedures. J. Chem. Phys. 1965, 43, S129–S135.

(90) Han, C.; Zhao, F.; Zhang, Z.; Zhu, L.; Xu, H.; Li, J.; Ma, D.; Yan, P. Constructing Low-Triplet-Energy Hosts for Highly Efficient Blue PHOLEDs: Controlling Charge and Exciton Capture in Doping Systems. Chem. Mater. 2013, 25, 4966–4976.

129

(91) Cui, L.-S.; Dong, S.-C.; Liu, Y.; Xu, M.-F.; Li, Q.; Jiang, Z.-Q.; Liao, L.-S. Meta- Linked Spirobifluorene/phosphine Oxide Hybrids as Host Materials for Deep Blue Phosphorescent Organic Light-Emitting Diodes. Org. Electron. 2013, 14, 1924– 1930.

(92) Dewar, M. J. S.; Thiel, W. Ground States of Molecules. 38. The MNDO Method. Approximations and Parameters. J. Am. Chem. Soc. 1977, 99, 4899–4907.

(93) Ridley, J.; Zerner, M. An Intermediate Neglect of Differential Overlap Technique for Spectroscopy: Pyrrole and the Azines. Theor. Chim. Acta 1973, 32, 111–134.

(94) Parac, M.; Grimme, S. A TDDFT Study of the Lowest Excitation Energies of Polycyclic Aromatic Hydrocarbons. Chem. Phys. 2003, 292, 11–21.

(95) Schreiber, M.; Silva-Junior, M. R.; Sauer, S. P. A.; Thiel, W. Benchmarks for Electronically Excited States: CASPT2, CC2, CCSD, and CC3. J. Chem. Phys. 2008, 128.

(96) Runge, E.; Gross, E. K. U. Density-Functional Theory for Time-Dependent Systems. Phys. Rev. Lett. 1984, 52, 997–1000.

(97) Casida, M. E. Time-Dependent Density-Functional Theory for Molecules and Molecular Solids. Journal of Molecular Structure: THEOCHEM, 2009, 914, 3–18.

(98) Yanai, T.; Tew, D. P.; Handy, N. C. A New Hybrid Exchange-Correlation Functional Using the Coulomb-Attenuating Method (CAM-B3LYP). Chem. Phys. Lett. 2004, 393, 51–57.

(99) Hariharan, P. C.; Pople, J. A. The Influence of Polarization Functions on Molecular Orbital Hydrogenation Energies. Theor. Chim. Acta 1973, 28, 213–222.

(100) Burghardt, I.; Hughes, K. H.; Martinazzo, R.; Tamura, H.; Gindensperger, E.; Köppel, H.; Cederbaum, L. S. Conical Intersections Coupled to an Environment. In Conical Intersections; 2011; pp. 301–346.

130

(101) Lee, C. W.; Lee, J. Y. High Quantum Efficiency in Blue Phosphorescent Organic Light Emitting Diodes Using Ortho-Substituted High Triplet Energy Host Materials. Org. Electron. 2013, 14, 1602–1607.

(102) Wu, S.; Aonuma, M.; Zhang, Q.; Huang, S.; Nakagawa, T.; Kuwabara, K.; Adachi, C. High-Efficiency Deep-Blue Organic Light-Emitting Diodes Based on a Thermally Activated Delayed Fluorescence Emitter. J. Mater. Chem. C 2014, 2, 421– 424.

(103) Kim, B. S.; Lee, J. Y. Engineering of Mixed Host for High External Quantum Efficiency above 25% in Green Thermally Activated Delayed Fluorescence Device. Adv. Funct. Mater. 2014, 24, 3970–3977.

(104) Teiber, M.; Giebeler, S.; Lessing, T.; Muller, T. J. J. Efficient Pseudo-Five- Component Coupling-Fiesselmann Synthesis of Luminescent Oligothiophenes and Their Modification. Org. Biomol. Chem. 2013, 11, 3541–3552.

(105) Lin, N.; Qiao, J.; Duan, L.; Wang, L.; Qiu, Y. Molecular Understanding of the Chemical Stability of Organic Materials for OLEDs: A Comparative Study on Sulfonyl, Phosphine-Oxide, and Carbonyl-Containing Host Materials. J. Phys. Chem. C 2014, 118, 7569–7578.

(106) Galloway, W. R. J. D.; Isidro-Llobet, A.; Spring, D. R. Diversity-Oriented Synthesis as a Tool for the Discovery of Novel Biologically Active Small Molecules. Nat. Commun. 2010, 1, 80.

(107) Newman, D. J.; Cragg, G. M. Natural Products as Sources of New Drugs over the Last 25 Years. J. Nat. Prod. 2007, 70, 461–477.

(108) Tan, D. S. Diversity-Oriented Synthesis: Exploring the Intersections between Chemistry and Biology. Nat. Chem. Biol. 2005, 1, 74–84.

(109) Harvey, a. Strategies for Discovering Drugs from Previously Unexplored Natural Products. Drug Discov. Today 2000, 5, 294–300.

(110) Virshup, A. M.; Contreras-García, J.; Wipf, P.; Yang, W.; Beratan, D. N. Stochastic 131

Voyages into Uncharted Chemical Space Produce a Representative Library of All Possible Drug-like Compounds. J. Am. Chem. Soc. 2013, 135, 7296–7303.

(111) Rupakheti, C.; Al-Saadon, R.; Zhang, P.; Virshup, A. M.; Yang, W.; Beratan, D. N. Diverse Optimal Molecular Libraries for Organic Light Emitting Diodes. 2015, 1– 27.

(112) Huang, N.; Shoichet, B. K.; Irwin, J. J. Benchmarking Sets for Molecular Docking. J. Med. Chem. 2006, 49, 6789–6801.

(113) Duggan, K. C.; Walters, M. J.; Musee, J.; Harp, J. M.; Kiefer, J. R.; Oates, J. A.; Marnett, L. J. Molecular Basis for Cyclooxygenase Inhibition by the Non-Steroidal Anti-Inflammatory Drug Naproxen. J. Biol. Chem. 2010, 285, 34950–34959.

(114) Gordon, J. C.; Myers, J. B.; Folta, T.; Shoja, V.; Heath, L. S.; Onufriev, A. H++: A Server for Estimating pKas and Adding Missing Hydrogens to Macromolecules. Nucleic Acids Res. 2005, 33.

(115) Halgren, T. A. Merck Molecular Force Field. I. Basis, Form, Scope, Parameterization, and Performance of MMFF94. J. Comput. Chem. 1996, 17, 490– 519.

(116) Jakalian, A.; Jack, D. B.; Bayly, C. I. Fast, Efficient Generation of High-Quality Atomic Charges. AM1-BCC Model: II. Parameterization and Validation. J. Comput. Chem. 2002, 23, 1623–1641.

(117) Cieplak, P.; Cornell, W. D.; Bayly, C.; Kollman, P. A. Application of the Multimolecule and Multiconformational RESP Methodology to Biopolymers: Charge Derivation for DNA, RNA, and Proteins. J. Comput. Chem. 1995, 16, 1357– 1377.

(118) Morris, G. M.; Ruth, H.; Lindstrom, W.; Sanner, M. F.; Belew, R. K.; Goodsell, D. S.; Olson, A. J. Software News and Updates AutoDock4 and AutoDockTools4: Automated Docking with Selective Receptor Flexibility. J. Comput. Chem. 2009, 30, 2785–2791.

132

(119) Morris, G. M.; Goodsell, D. S.; Huey, R.; Olson, A. J. Distributed Automated Docking of Flexible Ligands to Proteins: Parallel Applications of AutoDock 2.4. J. Comput. Aided. Mol. Des. 1996, 10, 293–304.

(120) Trott, O.; Olson, A. J. Software News and Update AutoDock Vina: Improving the Speed and Accuracy of Docking with a New Scoring Function, Efficient Optimization, and Multithreading. J. Comput. Chem. 2010, 31, 455–461.

(121) Sack, J. S.; Thieffine, S.; Bandiera, T.; Fasolini, M.; Duke, G. J.; Jayaraman, L.; Kish, K. F.; Klei, H. E.; Purandare, A. V; Rosettani, P.; Troiani, S.; Xie, D.; Bertrand, J. A. Structural Basis for CARM1 Inhibition by Indole and Pyrazole Inhibitors. 2011, 339, 331–339.

(122) Bauer, U. M.; Daujat, S.; Nielsen, S. J.; Nightingale, K.; Kouzarides, T. Methylation at Arginine 17 of Histone H3 Is Linked to Gene Activation. EMBO Rep. 2002, 3, 39–44.

(123) Cheng, D.; Côté, J.; Shaaban, S.; Bedford, M. T. The Arginine Methyltransferase CARM1 Regulates the Coupling of Transcription and mRNA Processing. Mol. Cell 2007, 25, 71–83.

(124) Yue, W. W.; Hassler, M.; Roe, S. M.; Thompson-vale, V.; Pearl, L. H. Insights into Histone Code Syntax from Structural and Biochemical Studies of CARM1. 2007, 26, 4402–4412.

(125) Kim, Y.-R.; Lee, B. K.; Park, R.-Y.; Nguyen, N. T. X.; Bae, J. A.; Kwon, D. D.; Jung, C. Differential CARM1 Expression in Prostate and Colorectal Cancers. BMC Cancer 2010, 10, 197.

(126) Rastelli, G.; Rio, A. D. E. L.; Degliesposti, G.; Sgobba, M.; Campi, V.; Farmaceutiche, S. Fast and Accurate Predictions of Binding Free Energies Using MM-PBSA and MM-GBSA. 2009.

(127) Thompson, D. C.; Humblet, C.; Joseph-McCarthy, D. Investigation of MM-PBSA Rescoring of Docking Poses. J. Chem. Inf. Model. 2008, 48, 1081–1091.

133

(128) Hou, T.; Wang, J.; Li, Y.; Wang, W. Assessing the Performance of the MM/PBSA and MM/GBSA Methods. 1. The Accuracy of Binding Free Energy Calculations Based on Molecular Dynamics Simulations. J. Chem. Inf. Model. 2011, 51, 69–82.

(129) Bashford, D.; Case, D. a. Generalized Born Models of Macromolecular Solvation Effects. Annu. Rev. Phys. Chem. 2000, 51, 129–152.

(130) Wang, J.; Cieplak, P.; Kollman, P. a. How Well Does a Restrained Electrostatic Potential (RESP) Model Perform in Calculating Conformational Energies of Organic and Biological Molecules? J. Comput. Chem. 2000, 21, 1049–1074.

(131) Case, D. A.; Cheatham, T. E.; Darden, T.; Gohlke, H.; Luo, R.; Merz, K. M.; Onufriev, A.; Simmerling, C.; Wang, B.; Woods, R. J. The Amber Biomolecular Simulation Programs. J. Comput. Chem. 2005, 26, 1668–1688.

(132) Chang, C. E.; Chen, W.; Gilson, M. K. Evaluating the Accuracy of the Quasiharmonic Approximation. J. Chem. Theory Comput. 2005, 1, 1017–1028.

(133) Lee, M. S.; Olson, M. a. Calculation of Absolute Protein-Ligand Binding Affinity Using Path and Endpoint Approaches. Biophys. J. 2006, 90, 864–877.

(134) Frank, A. T.; Stelzer, A. C.; Al, H. M.; Andricioaei, I. Constructing RNA Dynamical Ensembles by Combining MD and Motionally Decoupled NMR RDCs: New Insights into RNA Dynamics and Adaptive Ligand Recognition. Nucleic Acids Res. 2009, 37, 3670–3679.

(135) Stelzer, A. C.; Kratz, J. D.; Zhang, Q.; Al-Hashimi, H. M. RNA Dynamics by Design: Biasing Ensembles towards the Ligand Bound State. Angew. Chemie - Int. Ed. 2010, 49, 5731–5733.

(136) Ozsolak, F.; Milos, P. M. RNA Sequencing: Advances, Challenges and Opportunities. Nat. Rev. Genet. 2011, 12, 87–98.

(137) Brueckner, F.; Armache, K. J.; Cheung, A.; Damsma, G. E.; Kettenberger, H.; Lehmann, E.; Sydow, J.; Cramer, P. Structure-Function Studies of the RNA Polymerase II Elongation Complex. Acta Crystallogr. Sect. D Biol. Crystallogr. 2009, 134

65, 112–120.

(138) Amaral, P. P.; Dinger, M. E.; Mercer, T. R.; Mattick, J. S. The Eukaryotic Genome as an RNA Machine. Science 2008, 319, 1787–1789.

(139) Carmell, M. A.; Hannon, G. J. RNase III Enzymes and the Initiation of Gene Silencing. Nat. Struct. Mol. Biol. 2004, 11, 214–218.

(140) Wu, H.; Xu, H.; Miraglia, L. J.; Crooke, S. T. Human RNase III Is a 160-kDa Protein Involved in Preribosomal RNA Processing. J. Biol. Chem. 2000, 275, 36957–36965.

(141) Mattijssen, S.; Hinson, E. R.; Onnekink, C.; Hermanns, P.; Zabel, B.; Cresswell, P.; Pruijn, G. J. M. Viperin mRNA Is a Novel Target for the Human RNase MRP/RNase P Endoribonuclease. Cell. Mol. Life Sci. 2011, 68, 2469–2480.

(142) Ablasser, A.; Bauernfeind, F.; Hartmann, G.; Latz, E.; Fitzgerald, K. A.; Hornung, V. RIG-I-Dependent Sensing of poly(dA:dT) through the Induction of an RNA Polymerase III-Transcribed RNA Intermediate. Nat. Immunol. 2009, 10, 1065–1072.

(143) Lucks, J. B.; Qi, L.; Mutalik, V. K.; Wang, D.; Arkin, A. P. Versatile RNA-Sensing Transcriptional Regulators for Engineering Genetic Networks. Proc. Natl. Acad. Sci. U. S. A. 2011, 108, 8617–8622.

(144) Tinoco, I.; Bustamante, C. How RNA Folds. J. Mol. Biol. 1999, 293, 271–281.

(145) Li, P. T. X.; Vieregg, J.; Tinoco, I. How RNA Unfolds and Refolds. Annu. Rev. Biochem. 2008, 77, 77–100.

(146) Ferré-D’Amaré, A. R.; Doudna, J. A. RNA Folds: Insights from Recent Crystal Structures. Annu. Rev. Biophys. Biomol. Struct. 1999, 28, 57–73.

(147) Noller, H. F. RNA Structure: Reading the Ribosome. Science 2005, 309, 1508–1514.

(148) Conn, G. L.; Draper, D. E. RNA Structure. Curr. Opin. Struct. Biol. 1998, 8, 278–285.

135

(149) Roy, S.; Delling, U.; Chen, C. H. A Bulge Structure in HIV-I TAR RNA Is Required for Tat Binding and Tat- Mediated Trans-Activation. Genes Dev. 1990, 1365–1373.

(150) Richter, S.; Ping, Y.-H.; Rana, T. M. TAR RNA Loop: A Scaffold for the Assembly of a Regulatory Switch in HIV Replication. Proc. Natl. Acad. Sci. U. S. A. 2002, 99, 7928–7933.

(151) Guan, L.; Disney, M. D. Recent Advances in Developing Small Molecules Targeting RNA. ACS Chemical Biology, 2012, 7, 73–86.

(152) Gallego, J.; Varani, G. Targeting RNA with Small-Molecule Drugs: Therapeutic Promise and Chemical Challenges. Acc. Chem. Res. 2001, 34, 836–843.

(153) Tor, Y. Targeting RNA with Small Molecules. ChemBioChem, 2003, 4, 998–1007.

(154) Ruiz-Carmona, S.; Alvarez-Garcia, D.; Foloppe, N.; Garmendia-Doval, a. B.; Juhos, S.; Schmidtke, P.; Barril, X.; Hubbard, R. E.; Morley, S. D. rDock: A Fast, Versatile and Open Source Program for Docking Ligands to Proteins and Nucleic Acids. PLoS Comput. Biol. 2014, 10, 1–7.

(155) Morley, S. D.; Afshar, M. Validation of an Empirical RNA-Ligand Scoring Function for Fast Flexible Docking Using RiboDock®. J. Comput. Aided. Mol. Des. 2004, 18, 189–208.

(156) Stelzer, A. C.; Frank, A. T.; Kratz, J. D.; Swanson, M. D.; Gonzalez-Hernandez, M. J.; Lee, J.; Andricioaei, I.; Markovitz, D. M.; Al-Hashimi, H. M. Discovery of Selective Bioactive Small Molecules by Targeting an RNA Dynamic Ensemble. Nat. Chem. Biol. 2011, 7, 553–559.

(157) Chinchilla, R.; Nájera, C. The Sonogashira Reaction: A Booming Methodology in Synthetic Organic Chemistry. Chemical Reviews, 2007, 107, 874–922.

(158) Jelesarov, I.; Bosshard, H. R. Isothermal Titration Calorimetry and Differential Scanning Calorimetry as Complementary Tools to Investigate the Energetics of Biomolecular Recognition. Journal of Molecular Recognition. 1999, pp. 3–18.

136

(159) Leavitt, S.; Freire, E. Direct Measurement of Protein Binding Energetics by Isothermal Titration Calorimetry. Curr. Opin. Struct. Biol. 2001, 11, 560–566.

(160) Doyle, M. Characterization of Binding Interactions by Isothermal Titration Calorimetry. Curr. Opin. Biotechnol. 1997, 8, 31–35.

137

Biography

Chetan Rupakheti was born in the beautiful Himalayan nation of Nepal. His birth city is Biratnagar, the industrial town in the southern plains of Nepal. He moved to the capital of Nepal, Kathmandu to purse his schooling. He was passionate about genetics, physics, and mathematics since school. A lover of multiple disciplines, he chose to major in applied mathematics, biology, and minor chemistry at Hampden-Sydney

College in Virginia. During undergraduate studies, he was awarded the Madison honors scholarship and inducted into the Phi Beta Kappa honors society. Continuing the interdisciplinary interest, he joined the CBB program at Duke University and the

Beratan lab where he has worked on developing strategies to survey large chemical spaces for various targeted applications.

Publications

Rupakheti, C. et al. A Strategy to Discover Diverse Optimal Molecules in the Small Molecule Universe. J. Chem. Inf. Model. , 2015, 55(3), pp 529-537 (ACS Editors Choice).

Rupakheti, C. et al. A Strategy to develop Property-Biased Diversity- Oriented Molecular Libraries: Applications to Organic Light Emitting Diodes. (In submission at JCTC)

Rupakheti, C. et al. Efficiently charting astronomically large chemical spaces to search drug leads. (In preparation)

138